Set up and connect Databricks
Introduction
For the Databricks project in the getting started guide, you'll learn how to set up Databricks and connect it to dbt Cloud.
This project will walk you through:
- Setting up a Databricks account
- Loading training data into your Databricks account
- Configuring a SQL warehouse (formerly SQL endpoint) in Databricks
- Connecting dbt Cloud and Databricks
Databricks SQL Endpoints' name has changed and is now called SQL warehouses.
Prerequisites
Before starting this tutorial, you will need the following:
- Existing Cloud Provider account (AWS, GCP, Azure).
- Permissions to create an S3 bucket in said account.
Setting up
Use your existing account or sign up for a Databricks account at Try Databricks. Complete the form with your user information.
For the purpose of this tutorial, you will be selecting AWS as our cloud provider but if you use Azure or GCP internally, please choose one of them. The setup process will be similar.
Check your email to complete the verification process.
After setting up your password, you will be guided to choose a subscription plan. Select the
Premium
orEnterprise
plan to access the SQL Compute functionality required for using the SQL warehouse for dbt. We have chosenPremium
for this tutorial. ClickContinue
after selecting your plan.Click on
Get Started
when you come to this below page and thenConfirm
after you validate that you have everything needed.Now it's time to create your first workspace. A Databricks workspace is an environment for accessing all of your Databricks assets. The workspace organizes objects like notebooks, SQL warehouses, clusters, etc into one place. Provide the name of your workspace and choose the appropriate AWS region and click Start Quickstart. You might get the checkbox of
I have data in S3 that I want to query with Databricks
. You do not need to check this off for the purpose of this tutorial.By clicking on
Start Quickstart
, you will be redirected to AWS and asked to log in if you haven’t already. After logging in, you should see a page similar to this.
If you get a session error and don’t get redirected to this page, do not worry, go back to the Databricks UI and create a workspace from the interface. All you have to do is click create workspaces, choose the quickstart, fill out the form and click Start Quickstart.
There is no need to change any of the pre-filled out fields in the Parameters. Just add in your Databricks password under Databricks Account Credentials. Check off the Acknowledgement and click Create stack.
Afterwards, you should land on the CloudFormation > Stacks page. Once the status becomes
CREATE_COMPLETE
, you will be ready to start. This process can take about 5 minutes so feel free to click refresh to refresh the status updates.Go back to the Databricks tab. You should see that your workspace is ready to use.
Now let’s jump into the workspace. Click on
Open
and log into the workspace using the same login as you used to log into the account.
Congrats! You are now ready to start working in the workspace.
Loading data
Our next step is to load some data to transform. Luckily for us, Databricks makes it really easy for us to upload data.
First we need a SQL warehouse. Find the drop down menu and toggle into the SQL space.
We will be setting up a SQL warehouse now. Select
SQL Warehouses
from the left hand side console. You will see that a default SQL Warehouse exists.Click Start on the Starter Warehouse. This will take a few minutes to get the necessary resources spun up.
While you're waiting, download the three CSV files locally that you will need for this tutorial. You can find them here:
Once the SQL Warehouse is up, click on the
Create
and thenTable
on the drop down menu.Let's load the Jaffle Shop Customers data first. Drop in the
jaffle_shop_customers.csv
file into the UI.Update the Table Attributes at the top:
- data_catalog = hive_metastore
- database = default
- table = jaffle_shop_customers
- Make sure that the column data types are correct. The way you can do this is by hovering over the datatype icon next to the column name.
- ID = bigint
- FIRST_NAME = string
- LAST_NAME = string
Click
Create
on the bottom once you’re done.Now let’s do the same for
Jaffle Shop Orders
andStripe Payments
.Once that's done, make sure you can query the training data. Navigate to the
SQL Editor
through the left hand menu. This will bring you to a query editor.Ensure that you can run a
select *
from each of the tables with the following code snippets.select * from default.jaffle_shop_customers
select * from default.jaffle_shop_orders
select * from default.stripe_paymentsTo ensure any users who might be working on your dbt project has access to your object, run this command.
grant all privileges on schema default to users;
Congratulations! At this point, you have created a Databricks account, loaded training data, and successfully set up a SQL end point to query the data.
Connecting to dbt Cloud
There are two ways to connect dbt Cloud and Databricks. The first option is Partner Connect, which provides a streamlined setup to create your dbt Cloud account from within your new Databricks trial account. The second option is to create your dbt Cloud account separately and build the Databricks connection yourself. If you are looking to get started quickly, we recommend option 1. If you are looking to customize your setup from the very beginning and gain familiarity with the dbt Cloud setup flow, we recommend option 2.
Option 1: Connect dbt Cloud and Databricks with partner connect
In the Databricks workspace, on the left-side console: click on
Partner Connect
.Select the dbt tile under
Data preparation and transformation
.Click on
Next
when prompted toConnect to partner
. This action will create a service principal, PAT token for that service principle, and SQL warehouse for the dbt Cloud account to use. This does mean that you will have two SQL warehouses at your disposal from the previous step and from using Partner Connect.
Click on
Connect to dbt Cloud
.After the new tab loads, you will see a form. If you already created a dbt Cloud account, you will be asked to provide an account name. If you haven't created account, you will be asked to provide an account name and password.
After you have filled out the form and clicked on
Complete Registration
, you will be logged into dbt Cloud automatically.
Option 2: Connect dbt Cloud and Databricks manually
Get endpoint and token information
To manually setup dbt Cloud, you will need the SQL warehouse connection information and to generate a user token. You can find your SQL warehouse connection information by going to the Databricks UI > SQL > SQL warehouses > Starter Endpoint > Connection details
. Save this information because you will need it later.

To generate a user token for your development credentials in dbt Cloud, click on Settings
on the left side console (while still in the SQL part of the workspace). Click on Personal Access Token
and provide a comment like dbt Cloud development
. Save the token information somewhere because you will need it for the next part.
Create a dbt Cloud account
Let's start this section by creating a dbt Cloud account if you haven't already.
- Navigate to dbt Cloud.
- If you don't have a dbt Cloud account, create a new one, and verify your account via email.
- If you already have a dbt Cloud account, you can create a new project from your existing account:
- Click the gear icon in the top-right, then click Projects.
- Click + New Project.
- You've arrived at the "Setup a New Project" page.
- Type "Analytics" in the dbt Project Name field. You will be able to rename this project later.
- Click Continue.
Connect dbt Cloud to Databricks
Choose Databricks to setup your connection.
dbt Cloud - Choose Databricks Connection
For the name, write
Databricks
or another simple title.For Databricks settings, reference your SQL warehouse connection details from step 6 of the previous section for each of the following fields:
- Method will be ODBC
- Hostname comes from Server hostname
- Endpoint comes from the last part of HTTP path after
/endpoints
For your Development Credentials, type:
User
andtoken
that you saved in a previous step.- You’ll notice that the schema name has been auto created for you. By convention, this is
dbt_<first-initial><last-name>
. This is the schema connected directly to your development environment, and it's where your models will be built when running dbt within the Cloud IDE.
Click Test Connection at the bottom. This verifies that dbt Cloud can access your Databricks workspace.
If the connection test succeeds, click Next. If it fails, you may need to check your Databricks settings and credentials.
Initialize your repository and start development
When you develop in dbt Cloud, you can leverage Git to version control your code.
To connect to a repository, you can either set up a dbt Cloud-hosted managed repository or directly connect to a supported git provider. Managed repositories are a great way to trial dbt without needing to create a new repository. In the long run, it's better to connect to a supported git provider to use features like automation and continuous integration.
If you used Partner Connect, you can skip to initializing your dbt project as the Partner Connect provides you with a managed repository. Otherwise, you will need to create your repository connection.
Setting up a managed repository
To set up a managed repository:
- Under "Add repository from", select Managed.
- Type a name for your repo such as
bbaggins-dbt-quickstart
- Click Create. It will take a few seconds for your repository to be created and imported.
- Once you see the "Successfully imported repository," click Continue.
Initialize your dbt project
Now that you have a repository configured, you can initialize your project and start development in dbt Cloud:
- Click Develop from the upper left. It might take a few minutes for your project to spin up for the first time as it establishes your git connection, clones your repo, and tests the connection to the warehouse.
- Above the file tree to the left, click Initialize your project. This builds out your folder structure with example models.
- Make your initial commit by clicking Commit. Use the commit message
initial commit
. This creates the first commit to your managed repo and allows you to open a branch where you can add new dbt code. - Now you should be able to directly query data from your warehouse and execute dbt run. Paste your following warehouse-specific code in the IDE:
- BigQuery
- Databricks
- Redshift
- Snowflake
select * from `dbt-tutorial.jaffle_shop.customers`
select * from default.jaffle_shop_customers
select * from jaffle_shop.customers
select * from raw.jaffle_shop.customers
- In the command line bar at the bottom, type
dbt run
and click Enter. We will explore what happens in the next section of the tutorial.
Congratulations! You have successfully completed the following:
- Set up a Databricks account
- Loaded training data into your Databricks account
- Configured a SQL warehouse in Databricks
- Connected dbt Cloud and Databricks
Next steps
You can continue the tutorial by building your first project. You will learn and apply the fundamentals of dbt with models, tests, docs, and deploying your first job!