Skip to main content

Quickstart for dbt Cloud and Databricks

In this quickstart guide, you'll learn how to use dbt Cloud with Databricks. It will show you how to:

  • Create a Databricks workspace.
  • Load sample data into your Databricks account.
  • Connect dbt Cloud to Databricks.
  • Take a sample query and turn it into a model in your dbt project. A model in dbt is a select statement.
  • Add tests to your models.
  • Document your models.
  • Schedule a job to run.
Videos for you

You can check out dbt Fundamentals for free if you're interested in course learning with videos.

Prerequisites​

  • You have a dbt Cloud account.
  • You have an account with a cloud service provider (such as AWS, GCP, and Azure) and have permissions to create an S3 bucket with this account. For demonstrative purposes, this guide uses AWS as the cloud service provider.

Create a Databricks workspace

  1. Use your existing account or sign up for a Databricks account at Try Databricks. Complete the form with your user information.

    Sign up for DatabricksSign up for Databricks
  2. For the purpose of this tutorial, you will be selecting AWS as our cloud provider but if you use Azure or GCP internally, please choose one of them. The setup process will be similar.

  3. Check your email to complete the verification process.

  4. After setting up your password, you will be guided to choose a subscription plan. Select the Premium or Enterprise plan to access the SQL Compute functionality required for using the SQL warehouse for dbt. We have chosen Premium for this tutorial. Click Continue after selecting your plan.

    Choose Databricks PlanChoose Databricks Plan
  5. Click Get Started when you come to this below page and then Confirm after you validate that you have everything needed.

  6. Now it's time to create your first workspace. A Databricks workspace is an environment for accessing all of your Databricks assets. The workspace organizes objects like notebooks, SQL warehouses, clusters, etc into one place. Provide the name of your workspace and choose the appropriate AWS region and click Start Quickstart. You might get the checkbox of I have data in S3 that I want to query with Databricks. You do not need to check this off for the purpose of this tutorial.

    Setup First WorkspaceSetup First Workspace
  7. By clicking on Start Quickstart, you will be redirected to AWS and asked to log in if you haven’t already. After logging in, you should see a page similar to this.

    Create AWS resourcesCreate AWS resources
tip

If you get a session error and don’t get redirected to this page, you can go back to the Databricks UI and create a workspace from the interface. All you have to do is click create workspaces, choose the quickstart, fill out the form and click Start Quickstart.

  1. There is no need to change any of the pre-filled out fields in the Parameters. Just add in your Databricks password under Databricks Account Credentials. Check off the Acknowledgement and click Create stack.

    ParametersParameters
    CapabilitiesCapabilities
  2. Go back to the Databricks tab. You should see that your workspace is ready to use.

    A Databricks WorkspaceA Databricks Workspace
  3. Now let’s jump into the workspace. Click Open and log into the workspace using the same login as you used to log into the account.

Load data

  1. Download these CSV files (the Jaffle Shop sample data) that you will need for this guide:

  2. First we need a SQL warehouse. Find the drop down menu and toggle into the SQL space.

    SQL spaceSQL space
  3. We will be setting up a SQL warehouse now. Select SQL Warehouses from the left hand side console. You will see that a default SQL Warehouse exists.

  4. Click Start on the Starter Warehouse. This will take a few minutes to get the necessary resources spun up.

  5. Once the SQL Warehouse is up, click New and then File upload on the dropdown menu.

    New File Upload Using Databricks SQLNew File Upload Using Databricks SQL
  6. Let's load the Jaffle Shop Customers data first. Drop in the jaffle_shop_customers.csv file into the UI.

    Databricks Table LoaderDatabricks Table Loader
  7. Update the Table Attributes at the top:

    • data_catalog = hive_metastore
    • database = default
    • table = jaffle_shop_customers
    • Make sure that the column data types are correct. The way you can do this is by hovering over the datatype icon next to the column name.
      • ID = bigint
      • FIRST_NAME = string
      • LAST_NAME = string
    Load jaffle shop customersLoad jaffle shop customers
  8. Click Create on the bottom once you’re done.

  9. Now let’s do the same for Jaffle Shop Orders and Stripe Payments.

    Load jaffle shop ordersLoad jaffle shop orders
    Load stripe paymentsLoad stripe payments
  10. Once that's done, make sure you can query the training data. Navigate to the SQL Editor through the left hand menu. This will bring you to a query editor.

  11. Ensure that you can run a select * from each of the tables with the following code snippets.

    select * from default.jaffle_shop_customers
    select * from default.jaffle_shop_orders
    select * from default.stripe_payments
    Query CheckQuery Check
  12. To ensure any users who might be working on your dbt project has access to your object, run this command.

    grant all privileges on schema default to users;

Connect dbt Cloud to Databricks

There are two ways to connect dbt Cloud to Databricks. The first option is Partner Connect, which provides a streamlined setup to create your dbt Cloud account from within your new Databricks trial account. The second option is to create your dbt Cloud account separately and build the Databricks connection yourself (connect manually). If you want to get started quickly, dbt Labs recommends using Partner Connect. If you want to customize your setup from the very beginning and gain familiarity with the dbt Cloud setup flow, dbt Labs recommends connecting manually.

If you want to use Partner Connect, refer to Connect to dbt Cloud using Partner Connect in the Databricks docs for instructions.

If you want to connect manually, refer to Connect to dbt Cloud manually in the Databricks docs for instructions.

Set up a dbt Cloud managed repository

If you used Partner Connect, you can skip to initializing your dbt project as the Partner Connect provides you with a managed repository. Otherwise, you will need to create your repository connection.

When you develop in dbt Cloud, you can leverage Git to version control your code.

To connect to a repository, you can either set up a dbt Cloud-hosted managed repository or directly connect to a supported git provider. Managed repositories are a great way to trial dbt without needing to create a new repository. In the long run, it's better to connect to a supported git provider to use features like automation and continuous integration.

To set up a managed repository:

  1. Under "Setup a repository", select Managed.
  2. Type a name for your repo such as bbaggins-dbt-quickstart
  3. Click Create. It will take a few seconds for your repository to be created and imported.
  4. Once you see the "Successfully imported repository," click Continue.

Initialize your dbt project​ and start developing

Now that you have a repository configured, you can initialize your project and start development in dbt Cloud:

  1. Click Develop from the upper left. It might take a few minutes for your project to spin up for the first time as it establishes your git connection, clones your repo, and tests the connection to the warehouse.
  2. Above the file tree to the left, click Initialize dbt project. This builds out your folder structure with example models.
  3. Make your initial commit by clicking Commit & Sync. Use the commit message initial commit and click Commit. This creates the first commit to your managed repo and allows you to open a branch where you can add new dbt code.
  4. You can now directly query data from your warehouse and execute dbt run. You can try this out now:
    • Click + Create new file, add this query to the new file, and click Save as to save the new file:
      select * from default.jaffle_shop_customers
    • In the command line bar at the bottom, enter dbt run and click Enter. You should see a dbt run succeeded message.

Build your first model

  1. Click Develop from the upper left of dbt Cloud. You need to create a new branch since the main branch is set to read-only mode.
  2. Click Create branch. You can name it add-customers-model.
  3. Click the ... next to the Models directory, then select Create file.
  4. Name the file models/customers.sql, then click Create.
  5. Copy the following query into the file and click Save.
with customers as (

select
id as customer_id,
first_name,
last_name

from jaffle_shop_customers

),

orders as (

select
id as order_id,
user_id as customer_id,
order_date,
status

from jaffle_shop_orders

),

customer_orders as (

select
customer_id,

min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders

from orders

group by 1

),

final as (

select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders

from customers

left join customer_orders using (customer_id)

)

select * from final
  1. Enter dbt run in the command prompt at the bottom of the screen. You should get a successful run and see the three models.

Later, you can connect your business intelligence (BI) tools to these views and tables so they only read cleaned up data rather than raw data in your BI tool.

FAQs

 
 
 
 
 

Change the way your model is materialized

One of the most powerful features of dbt is that you can change the way a model is materialized in your warehouse, simply by changing a configuration value. You can change things between tables and views by changing a keyword rather than writing the data definition language (DDL) to do this behind the scenes.

By default, everything gets created as a view. You can override that by materializing everything in jaffle_shop as a table. Everything in the example project will still be materialized as a view.

  1. Edit your dbt_project.yml file.

    • Update your project name to:

      dbt_project.yml
      name: 'jaffle_shop'
    • Update your models config block to:

      dbt_project.yml
      models:
      jaffle_shop:
      +materialized: table
      example:
      +materialized: view
    • Click Save.

  2. Enter the dbt run command. Your customers model should now be built as a table!

    info

    To do this, dbt had to first run a drop view statement (or API call on BigQuery), then a create table as statement.

  3. Edit models/customers.sql to override the dbt_project.yml for the customers model only by adding the following snippet to the top, and click Save:

    models/customers.sql
    {{
    config(
    materialized='view'
    )
    }}

    with customers as (

    select
    id as customer_id
    ...

    )

  4. Enter the dbt run command. Your model, customers should now build as a view.

  5. Enter the dbt run --full-refresh command for this to take effect in your warehouse.

FAQs

 
 
 

Delete the example models

You can now delete the files that dbt created when you initialized the project:

  1. Delete the models/example/ directory.

  2. Delete the example: key from your dbt_project.yml file, and any configurations that are listed under it.

    dbt_project.yml
    # before
    models:
    jaffle_shop:
    +materialized: table
    example:
    +materialized: view
    dbt_project.yml
    # after
    models:
    jaffle_shop:
    +materialized: table
  3. Save your changes.

FAQs

 
 

Build models on top of other models

As a best practice in SQL, you should separate logic that cleans up your data from logic that transforms your data. You have already started doing this in the existing query by using common table expressions (CTEs).

Now you can experiment by separating the logic out into separate models and using the ref function to build models on top of other models:

The DAG we want for our dbt projectThe DAG we want for our dbt project
  1. Create a new SQL file, models/stg_customers.sql, with the SQL from the customers CTE in our original query.

  2. Create a second new SQL file, models/stg_orders.sql, with the SQL from the orders CTE in our original query.

    models/stg_customers.sql
    select
    id as customer_id,
    first_name,
    last_name

    from `dbt-tutorial`.jaffle_shop.customers
    models/stg_orders.sql
    select
    id as order_id,
    user_id as customer_id,
    order_date,
    status

    from `dbt-tutorial`.jaffle_shop.orders
  3. Edit the SQL in your models/customers.sql file as follows:

    models/customers.sql
    with customers as (

    select * from {{ ref('stg_customers') }}

    ),

    orders as (

    select * from {{ ref('stg_orders') }}

    ),

    customer_orders as (

    select
    customer_id,

    min(order_date) as first_order_date,
    max(order_date) as most_recent_order_date,
    count(order_id) as number_of_orders

    from orders

    group by 1

    ),

    final as (

    select
    customers.customer_id,
    customers.first_name,
    customers.last_name,
    customer_orders.first_order_date,
    customer_orders.most_recent_order_date,
    coalesce(customer_orders.number_of_orders, 0) as number_of_orders

    from customers

    left join customer_orders using (customer_id)

    )

    select * from final

  4. Execute dbt run

This time, when you performed a dbt run, separate views/tables were created for stg_customers, stg_orders and customers. dbt inferred the order to run these models. Because customers depends on stg_customers and stg_orders, dbt builds customers last. You do not need to explicitly define these dependencies.

FAQs

 
 
 As I create more models, how should I keep my project organized? What should I name my models?

Add tests to your models

Adding tests to a project helps validate that your models are working correctly.

To add tests to your project:

  1. Create a new YAML file in the models directory, named models/schema.yml

  2. Add the following contents to the file:

    models/schema.yml
    version: 2

    models:
    - name: customers
    columns:
    - name: customer_id
    tests:
    - unique
    - not_null

    - name: stg_customers
    columns:
    - name: customer_id
    tests:
    - unique
    - not_null

    - name: stg_orders
    columns:
    - name: order_id
    tests:
    - unique
    - not_null
    - name: status
    tests:
    - accepted_values:
    values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
    - name: customer_id
    tests:
    - not_null
    - relationships:
    to: ref('stg_customers')
    field: customer_id

  3. Run dbt test, and confirm that all your tests passed.

When you run dbt test, dbt iterates through your YAML files, and constructs a query for each test. Each query will return the number of records that fail the test. If this number is 0, then the test is successful.

FAQs

 What tests are available for me to use in dbt? Can I add my own custom tests?
 
 
 Does my test file need to be named `schema.yml`?
 
 
 

Document your models

Adding documentation to your project allows you to describe your models in rich detail, and share that information with your team. Here, we're going to add some basic documentation to our project.

  1. Update your models/schema.yml file to include some descriptions, such as those below.

    models/schema.yml
    version: 2

    models:
    - name: customers
    description: One record per customer
    columns:
    - name: customer_id
    description: Primary key
    tests:
    - unique
    - not_null
    - name: first_order_date
    description: NULL when a customer has not yet placed an order.

    - name: stg_customers
    description: This model cleans up customer data
    columns:
    - name: customer_id
    description: Primary key
    tests:
    - unique
    - not_null

    - name: stg_orders
    description: This model cleans up order data
    columns:
    - name: order_id
    description: Primary key
    tests:
    - unique
    - not_null
    - name: status
    tests:
    - accepted_values:
    values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']

  2. Run dbt docs generate to generate the documentation for your project. dbt introspects your project and your warehouse to generate a JSON file with rich documentation about your project.

  1. Click the book icon in the Develop interface to launch documentation in a new tab.

FAQs

 
 

Commit your changes

Now that you've built your customer model, you need to commit the changes you made to the project so that the repository has your latest code.

  1. Click Commit and add a message. For example, "Add customers model, tests, docs."
  2. Click merge to main To add these changes to the main branch on your repo.

Create a deployment environment

  1. In the upper left, select Deploy, then click Environments.
  2. Click Create Environment.
  3. Name your deployment environment. For example, "Production."
  4. Add a target dataset, for example, "Analytics." dbt will build into this dataset. For some warehouses this will be named "schema."
  5. Click Save.

Create and run a job

Jobs are a set of dbt commands that you want to run on a schedule. For example, dbt run and dbt test.

As the jaffle_shop business gains more customers, and those customers create more orders, you will see more records added to your source data. Because you materialized the customers model as a table, you'll need to periodically rebuild your table to ensure that the data stays up-to-date. This update will happen when you run a job.

  1. After creating your deployment environment, you should be directed to the page for new environment. If not, select Deploy in the upper left, then click Jobs.
  2. Click Create one and provide a name, for example "Production run", and link to the Environment you just created.
  3. Scroll down to "Execution Settings" and select Generate docs on run.
  4. Under "Commands," add these commands as part of your job if you don't see them:
    • dbt run
    • dbt test
  5. For this exercise, do not set a schedule for your project to run while your organization's project should run regularly, there's no need to run this example project on a schedule. Scheduling a job is sometimes referred to as deploying a project.
  6. Select Save, then click Run now to run your job.
  7. Click the run and watch its progress under "Run history."
  8. Once the run is complete, click View Documentation to see the docs for your project.
tip

Congratulations 🎉! You've just deployed your first dbt project!

FAQs

 
0