For the Databricks project in the getting started guide, you'll learn how to set up Databricks and connect it to dbt Cloud.
This project will walk you through:
- Setting up a Databricks account
- Loading training data into your Databricks account
- Configuring a SQL endpoint in Databricks
- Connecting dbt Cloud and Databricks
Before starting this tutorial, you will need the following:
- Existing Cloud Provider account (AWS, GCP, Azure).
- Permissions to create an S3 bucket in said account.
Use your existing account or sign up for a Databricks account at Try Databricks. Complete the form with your user information.
For the purpose of this tutorial, you will be selecting AWS as our cloud provider but if you use Azure or GCP internally, please choose one of them. The setup process will be similar.
Check your email to complete the verification process.
After setting up your password, you will be guided to choose a subscription plan. You will need to select either the
Enterpriseplan to access the SQL Compute functionality, required for using the SQL Endpoint for dbt. We have chosen
Premiumfor this tutorial. Click
Continueafter selecting your plan.
Get Startedwhen you come to this below page and then
Confirmafter you validate that you have everything needed.
Now it's time to create your first workspace. A Databricks workspace is an environment for accessing all of your Databricks assets. The workspace organizes objects like notebooks, SQL Endpoints, clusters, etc into one place. Provide the name of your workspace and choose the appropriate AWS region and click Start Quickstart. You might get the checkbox of
I have data in S3 that I want to query with Databricks. You do not need to check this off for the purpose of this tutorial.
By clicking on
Start Quickstart, you will be redirected to AWS and asked to log in if you haven’t already. After logging in, you should see a page similar to this.
If you get a session error and don’t get redirected to this page, do not worry, go back to the Databricks UI and create a workspace from the interface. All you have to do is click create workspaces, choose the quickstart, fill out the form and click Start Quickstart.
There is no need to change any of the pre-filled out fields in the Parameters. Just add in your Databricks password under Databricks Account Credentials. Check off the Acknowledgement and click Create stack.
Afterwards, you should land on the CloudFormation > Stacks page. Once the status becomes
CREATE_COMPLETE, you will be ready to start. This process can take about 5 minutes so feel free to click refresh to refresh the status updates.
Go back to the Databricks tab. You should see that your workspace is ready to use.
Now let’s jump into the workspace. Click on
Openand log into the workspace using the same login as you used to log into the account.
Congrats! You are now ready to start working in the workspace.
Our next step is to load some data to transform. Luckily for us, Databricks makes it really easy for us to upload data.
First we need a SQL endpoint. Find the drop down menu and toggle into the SQL space.
We will be setting up a SQL endpoint now. Select
SQL Endpointsfrom the left hand side console. You will see that a default SQL Endpoint exists.
Click Start on the Starter Endpoint. This will take a few minutes to get the necessary resources spun up.
While you're waiting, download the three CSV files locally that you will need for this tutorial. You can find them here:
Once the SQL Endpoint is up, click on the
Tableon the drop down menu.
Let's load the Jaffle Shop Customers data first. Drop in the
jaffle_shop_customers.csvfile into the UI.
Update the Table Attributes at the top:
- data_catalog = hive_metastore
- database = default
- table = jaffle_shop_customers
- Make sure that the column data types are correct. The way you can do this is by hovering over the datatype icon next to the column name.
- ID = bigint
- FIRST_NAME = string
- LAST_NAME = string
Createon the bottom once you’re done.
Now let’s do the same for
Jaffle Shop Ordersand
Once that's done, make sure you can query the training data. Navigate to the
SQL Editorthrough the left hand menu. This will bring you to a query editor.
Ensure that you can run a
select *from each of the tables with the following code snippets.
select * from default.jaffle_shop_customers
select * from default.jaffle_shop_orders
select * from default.stripe_payments
To ensure any users who might be working on your dbt project has access to your object, run this command.
grant all privileges on schema default to users;
Congratulations! At this point, you have created a Databricks account, loaded training data, and successfully set up a SQL end point to query the data.
Connecting to dbt Cloud
There are two ways to connect dbt Cloud and Databricks. The first option is Partner Connect, which provides a streamlined setup to create your dbt Cloud account from within your new Databricks trial account. The second option is to create your dbt Cloud account separately and build the Databricks connection yourself. If you are looking to get started quickly, we recommend option 1. If you are looking to customize your setup from the very beginning and gain familiarity with the dbt Cloud setup flow, we recommend option 2.
Option 1: Connect dbt Cloud and Databricks with partner connect
In the Databricks workspace, on the left-side console: click on
Select the dbt tile under
Data preparation and transformation.
Nextwhen prompted to
Connect to partner. This action will create a service principal, PAT token for that service principle, and SQL Endpoint for the dbt Cloud account to use. This does mean that you will have two SQL Endpoints at your disposal from the previous step and from using Partner Connect.
Connect to dbt Cloud.
After the new tab loads, you will see a form. If you already created a dbt Cloud account, you will be asked to provide an account name. If you haven't created account, you will be asked to provide an account name and password.
After you have filled out the form and clicked on
Complete Registration, you will be logged into dbt Cloud automatically.
Option 2: Connect dbt Cloud and Databricks manually
Get endpoint and token information
To manually setup dbt Cloud, you will need the SQL Endpoint connection information and to generate a user token. You can find your SQL endpoint connection information by going to the
Databricks UI > SQL > SQL Endpoints > Starter Endpoint > Connection details. Save this information because you will need it later.
To generate a user token for your development credentials in dbt Cloud, click on
Settings on the left side console (while still in the SQL part of the workspace). Click on
Personal Access Token and provide a comment like
dbt Cloud development. Save the token information somewhere because you will need it for the next part.
Create a dbt Cloud account
Let's start this section by creating a dbt Cloud account if you haven't already.
- Navigate to dbt Cloud.
- If you don't have a dbt Cloud account, create a new one, and verify your account via email.
- If you already have a dbt Cloud account, you can create a new project from your existing account:
- Click the gear icon in the top-right, then click Projects.
- Click + New Project.
- You've arrived at the "Setup a New Project" page.
- Type "Analytics" in the dbt Project Name field. You will be able to rename this project later.
- Click Continue.
Connect dbt Cloud to Databricks
Choose Databricks to setup your connection.
For the name, write
Databricksor another simple title.
For Databricks settings, reference your SQL endpoint connection details from step 6 of the previous section for each of the following fields:
- Method will be ODBC
- Hostname comes from Server hostname
- Endpoint comes from the last part of HTTP path after
For your Development Credentials, type:
tokenthat you saved in a previous step.
- You’ll notice that the schema name has been auto created for you. By convention, this is
dbt_<first-initial><last-name>. This is the schema connected directly to your development environment, and it's where your models will be built when running dbt within the Cloud IDE.
Click Test Connection at the bottom. This verifies that dbt Cloud can access your Databricks workspace.
If the connection test succeeds, click Next. If it fails, you may need to check your Databricks settings and credentials.
Initialize your repository and start development
If you used Partner Connect, you can skip to initializing your dbt project as the Partner Connect provides you with a managed repository. Otherwise, you will need to create your managed repository connection.
Setting up a managed repository
dbt Cloud uses Git for version control, but using a managed repository makes this easier. To set up a managed repository:
- Under "Add repository from", select Managed.
- Type a name for your repo such as
- Click Create. It will take a few seconds for your repository to be created and imported.
- Once you see the "Successfully imported repository," click Continue.
Initialize your dbt project
Now that you have a repository configured, you can initialize your project and start development in dbt Cloud:
- Click Develop from the upper left. It might take a few minutes for your project to spin up for the first time as it establishes your git connection, clones your repo, and tests the connection to the warehouse.
- Above the file tree to the left, click Initialize your project. This builds out your folder structure with example models.
- Make your initial commit by clicking Commit. Use the commit message
initial commit. This creates the first commit to your managed repo and allows you to open a branch where you can add new dbt code.
- Now you should be able to directly query data from your warehouse and execute dbt run. You can try this out now:
- In "Scratchpad 1", delete all text and paste your warehouse-specific code into Scratchpad 1:
select * from `dbt-tutorial.jaffle_shop.customers`
select * from default.jaffle_shop_customers
select * from jaffle_shop.customers
select * from raw.jaffle_shop.customers
- In the command line bar at the bottom, type
dbt runand click Enter. We will explore what happens in the next section of the tutorial.
Congratulations! You have successfully completed the following:
- Set up a Databricks account
- Loaded training data into your Databricks account
- Configured a SQL endpoint in Databricks
- Connected dbt Cloud and Databricks
You can Continue the tutorial by building your first project. You will learn and apply the fundamentals of dbt with models, tests, docs, and deploying your first job!