Skip to main content

Set up and connect Databricks

Introduction

For the Databricks project in the getting started guide, you'll learn how to set up Databricks and connect it to dbt Cloud.

This project will walk you through:

  • Setting up a Databricks account
  • Loading training data into your Databricks account
  • Configuring a SQL endpoint in Databricks
  • Connecting dbt Cloud and Databricks

Prerequisites

Before starting this tutorial, you will need the following:

  • Existing Cloud Provider account (AWS, GCP, Azure).
  • Permissions to create an S3 bucket in said account.

Setting up

  1. Use your existing account or sign up for a Databricks account at Try Databricks. Complete the form with your user information.

    Sign up for Databricks

    Sign up for Databricks

  2. For the purpose of this tutorial, you will be selecting AWS as our cloud provider but if you use Azure or GCP internally, please choose one of them. The setup process will be similar.

  3. Check your email to complete the verification process.

  4. After setting up your password, you will be guided to choose a subscription plan. You will need to select either the Premium or Enterprise plan to access the SQL Compute functionality, required for using the SQL Endpoint for dbt. We have chosen Premium for this tutorial. Click Continue after selecting your plan.

    Choose Databricks Plan

    Choose Databricks Plan

  5. Click on Get Started when you come to this below page and then Confirm after you validate that you have everything needed.

  6. Now it's time to create your first workspace. A Databricks workspace is an environment for accessing all of your Databricks assets. The workspace organizes objects like notebooks, SQL Endpoints, clusters, etc into one place. Provide the name of your workspace and choose the appropriate AWS region and click Start Quickstart. You might get the checkbox of I have data in S3 that I want to query with Databricks. You do not need to check this off for the purpose of this tutorial.

    Setup First Workspace

    Setup First Workspace

  7. By clicking on Start Quickstart, you will be redirected to AWS and asked to log in if you haven’t already. After logging in, you should see a page similar to this.

    Create AWS resources

    Create AWS resources

tip

If you get a session error and don’t get redirected to this page, do not worry, go back to the Databricks UI and create a workspace from the interface. All you have to do is click create workspaces, choose the quickstart, fill out the form and click Start Quickstart.

  1. There is no need to change any of the pre-filled out fields in the Parameters. Just add in your Databricks password under Databricks Account Credentials. Check off the Acknowledgement and click Create stack.

    Parameters

    Parameters

    Capabilities

    Capabilities

  2. Afterwards, you should land on the CloudFormation > Stacks page. Once the status becomes CREATE_COMPLETE, you will be ready to start. This process can take about 5 minutes so feel free to click refresh to refresh the status updates.

    Confirm Status Completion

    Confirm Status Completion

  3. Go back to the Databricks tab. You should see that your workspace is ready to use.

    A Databricks Workspace

    A Databricks Workspace

  4. Now let’s jump into the workspace. Click on Open and log into the workspace using the same login as you used to log into the account.

Congrats! You are now ready to start working in the workspace.

Welcome to the Databricks Workspace

Welcome to the Databricks Workspace

Loading data

Our next step is to load some data to transform. Luckily for us, Databricks makes it really easy for us to upload data.

  1. First we need a SQL endpoint. Find the drop down menu and toggle into the SQL space.

    SQL space

    SQL space

  2. We will be setting up a SQL endpoint now. Select SQL Endpoints from the left hand side console. You will see that a default SQL Endpoint exists.

    SQL Endpoints

    SQL Endpoints

  3. Click Start on the Starter Endpoint. This will take a few minutes to get the necessary resources spun up.

  4. While you're waiting, download the three CSV files locally that you will need for this tutorial. You can find them here:

  5. Once the SQL Endpoint is up, click on the Create and then Table on the drop down menu.

    Create Table Using Databricks SQL

    Create Table Using Databricks SQL

  6. Let's load the Jaffle Shop Customers data first. Drop in the jaffle_shop_customers.csv file into the UI.

    Databricks Table Loader

    Databricks Table Loader

  7. Update the Table Attributes at the top:

    • data_catalog = hive_metastore
    • database = default
    • table = jaffle_shop_customers
    • Make sure that the column data types are correct. The way you can do this is by hovering over the datatype icon next to the column name.
      • ID = bigint
      • FIRST_NAME = string
      • LAST_NAME = string
    Load jaffle shop customers

    Load jaffle shop customers

  8. Click Create on the bottom once you’re done.

  9. Now let’s do the same for Jaffle Shop Orders and Stripe Payments.

    Load jaffle shop orders

    Load jaffle shop orders

    Load stripe payments

    Load stripe payments

  10. Once that's done, make sure you can query the training data. Navigate to the SQL Editor through the left hand menu. This will bring you to a query editor.

  11. Ensure that you can run a select * from each of the tables with the following code snippets.

    select * from default.jaffle_shop_customers
    select * from default.jaffle_shop_orders
    select * from default.stripe.payments
    Query Check

    Query Check

  12. To ensure any users who might be working on your dbt project has access to your object, run this command.

    grant all privileges on schema default to users;

Congratulations! At this point, you have created a Databricks account, loaded training data, and successfully set up a SQL end point to query the data.

Connecting to dbt Cloud

There are two ways to connect dbt Cloud and Databricks. The first option is Partner Connect, which provides a streamlined setup to create your dbt Cloud account from within your new Databricks trial account. The second option is to create your dbt Cloud account separately and build the Databricks connection yourself. If you are looking to get started quickly, we recommend option 1. If you are looking to customize your setup from the very beginning and gain familiarity with the dbt Cloud setup flow, we recommend option 2.

Option 1: Connect dbt Cloud and Databricks with partner connect

  1. In the Databricks workspace, on the left-side console: click on Partner Connect.

    Databricks Partner Connect

    Databricks Partner Connect

  2. Select the dbt tile under Data preparation and transformation.

  3. Click on Next when prompted to Connect to partner. This action will create a service principal, PAT token for that service principle, and SQL Endpoint for the dbt Cloud account to use. This does mean that you will have two SQL Endpoints at your disposal from the previous step and from using Partner Connect.

    Databricks Partner Connect Connect to dbt Cloud

    Databricks Partner Connect Connect to dbt Cloud

  1. Click on Connect to dbt Cloud.

    Databricks Partner Connect Connect to dbt Cloud

    Databricks Partner Connect Connect to dbt Cloud

  2. After the new tab loads, you will see a form. If you already created a dbt Cloud account, you will be asked to provide an account name. If you haven't created account, you will be asked to provide an account name and password.

    Databricks Partner Connect Connect to dbt Cloud

    Databricks Partner Connect Connect to dbt Cloud

  3. After you have filled out the form and clicked on Complete Registration, you will be logged into dbt Cloud automatically.

Option 2: Connect dbt Cloud and Databricks manually

  1. To manually setup dbt Cloud, you will need the SQL Endpoint connection information and to generate a user token. You can find your SQL endpoint connection information by going to the Databricks UI > SQL > SQL Endpoints > Starter Endpoint > Connection details. Save this information because you will need it later.

    Databrick SQL Endpoint Connection Information

    Databrick SQL Endpoint Connection Information

  2. To generate a user token for your development credentials in dbt Cloud, click on Settings on the left side console (while still in the SQL part of the workspace). Click on Personal Access Token and provide a comment like dbt Cloud development. Save the token information somewhere because you will need it for the next part.

    Generate User Token

    Generate User Token

Let's start this section by creating a dbt Cloud account if you haven't already.

  1. Navigate to dbt Cloud.
  2. If you don't have a dbt account, create a new one and verify your account via email.
  3. If you have a dbt Cloud account, you can create a new project from your existing account:
    1. Click hamburger icon, then click Account Settings.
    2. Click New project, then click Begin.
  4. You can now use the Set Up a New Project workflow to set up your first dbt Cloud project!
  5. Type "Analytics" in the dbt Project Name field. You will be able to rename this project later.
  6. Click Continue.
  1. Choose Databricks to setup your connection.

    Databrick Connection

    Databrick Connection

  2. For the name, write Databricks or another simple title.

  3. For Databricks settings, reference your SQL endpoint connection details from step 6 of the previous section for each of the following fields:

    • Method will be ODBC
    • Hostname comes from Server hostname
    • Endpoint comes from the last part of HTTP path after /endpoints
  4. For your Development Credentials, type:

    • User and token that you saved in a previous step.
    • For the schema field, choose a development schema (this will be your default development database to build objects into). We recommend something in the form of dbt_{{ first initial, last name}} like dbt_achen.
  5. Then scroll to the top of the page to test your connection. Once successful, click Continue.

Initialize your repository and start development

If you used Partner Connect, you can skip to initializing your dbt project as the Partner Connect provides you with a managed repository. Otherwise, you will need to create your managed repository connection.

Setting up a managed repository

dbt Cloud uses Git for version control, but using a managed repository makes this easier. To set up a managed repository:

  1. Under "Add repository from", select Managed.
  2. Type a name for your repo such as bbaggins-dbt-quickstart
  3. Click Create. It will take a few seconds for your repository to be created and imported.
  4. Once you see the "Successfully imported repository," click Continue.

Initialize your dbt project

Now that you have a repository configured, you can initialize your project and start development in the dbt Cloud IDE:

  1. Click Start Developing. You can also navigate to the Develop page from hamburger menu in the top left. It might take a few minutes for your project to spin up for the first time, as it established your git connection, clones your repo, and tests the connection to the warehouse.
  2. Above the file tree to the left, click Initialize your project. This builds out your folder structure with example models.
  3. Make your initial commit by clicking Commit. Use the commit message initial commit. This creates the first commit to your managed repo and allows you to open a branch where you can add new dbt code.
  4. Now you should be able to directly query data from your warehouse and execute dbt run. You can try this out now:
    • In "Scratchpad 1", delete all text and paste your warehouse-specific code into Scratchpad 1:
select * from `dbt-tutorial.jaffle_shop.customers`
  • In the command line bar at the bottom, type dbt run and click Enter. We will explore what happens in the next section of the tutorial.

Congratulations! You have successfully completed the following:

  • Set up a Databricks account
  • Loaded training data into your Databricks account
  • Configured a SQL endpoint in Databricks
  • Connected dbt Cloud and Databricks

Next steps

You can Continue the tutorial by building your first project. You will learn and apply the fundamentals of dbt with models, tests, docs, and deploying your first job!