Skip to main content

Set up and connect Databricks

Introduction

For the Databricks project in the getting started guide, you'll learn how to set up Databricks and connect it to dbt Cloud.

This project will walk you through:

  • Setting up a Databricks account
  • Loading training data into your Databricks account
  • Configuring a SQL warehouse (formerly SQL endpoint) in Databricks
  • Connecting dbt Cloud and Databricks
Note 📌

Databricks SQL Endpoints' name has changed and is now called SQL warehouses.

Prerequisites

Before starting this tutorial, you will need the following:

  • Existing Cloud Provider account (AWS, GCP, Azure).
  • Permissions to create an S3 bucket in said account.

Setting up

  1. Use your existing account or sign up for a Databricks account at Try Databricks. Complete the form with your user information.

    Sign up for DatabricksSign up for Databricks
  2. For the purpose of this tutorial, you will be selecting AWS as our cloud provider but if you use Azure or GCP internally, please choose one of them. The setup process will be similar.

  3. Check your email to complete the verification process.

  4. After setting up your password, you will be guided to choose a subscription plan. Select the Premium or Enterprise plan to access the SQL Compute functionality required for using the SQL warehouse for dbt. We have chosen Premium for this tutorial. Click Continue after selecting your plan.

    Choose Databricks PlanChoose Databricks Plan
  5. Click on Get Started when you come to this below page and then Confirm after you validate that you have everything needed.

  6. Now it's time to create your first workspace. A Databricks workspace is an environment for accessing all of your Databricks assets. The workspace organizes objects like notebooks, SQL warehouses, clusters, etc into one place. Provide the name of your workspace and choose the appropriate AWS region and click Start Quickstart. You might get the checkbox of I have data in S3 that I want to query with Databricks. You do not need to check this off for the purpose of this tutorial.

    Setup First WorkspaceSetup First Workspace
  7. By clicking on Start Quickstart, you will be redirected to AWS and asked to log in if you haven’t already. After logging in, you should see a page similar to this.

    Create AWS resourcesCreate AWS resources
tip

If you get a session error and don’t get redirected to this page, do not worry, go back to the Databricks UI and create a workspace from the interface. All you have to do is click create workspaces, choose the quickstart, fill out the form and click Start Quickstart.

  1. There is no need to change any of the pre-filled out fields in the Parameters. Just add in your Databricks password under Databricks Account Credentials. Check off the Acknowledgement and click Create stack.

    ParametersParameters
    CapabilitiesCapabilities
  2. Afterwards, you should land on the CloudFormation > Stacks page. Once the status becomes CREATE_COMPLETE, you will be ready to start. This process can take about 5 minutes so feel free to click refresh to refresh the status updates.

    Confirm Status CompletionConfirm Status Completion
  3. Go back to the Databricks tab. You should see that your workspace is ready to use.

    A Databricks WorkspaceA Databricks Workspace
  4. Now let’s jump into the workspace. Click on Open and log into the workspace using the same login as you used to log into the account.

Congrats! You are now ready to start working in the workspace.

Welcome to the Databricks WorkspaceWelcome to the Databricks Workspace

Loading data

Our next step is to load some data to transform. Luckily for us, Databricks makes it really easy for us to upload data.

  1. First we need a SQL warehouse. Find the drop down menu and toggle into the SQL space.

    SQL spaceSQL space
  2. We will be setting up a SQL warehouse now. Select SQL Warehouses from the left hand side console. You will see that a default SQL Warehouse exists.

  3. Click Start on the Starter Warehouse. This will take a few minutes to get the necessary resources spun up.

  4. While you're waiting, download the three CSV files locally that you will need for this tutorial. You can find them here:

  5. Once the SQL Warehouse is up, click on the Create and then Table on the drop down menu.

    Create Table Using Databricks SQLCreate Table Using Databricks SQL
  6. Let's load the Jaffle Shop Customers data first. Drop in the jaffle_shop_customers.csv file into the UI.

    Databricks Table LoaderDatabricks Table Loader
  7. Update the Table Attributes at the top:

    • data_catalog = hive_metastore
    • database = default
    • table = jaffle_shop_customers
    • Make sure that the column data types are correct. The way you can do this is by hovering over the datatype icon next to the column name.
      • ID = bigint
      • FIRST_NAME = string
      • LAST_NAME = string
    Load jaffle shop customersLoad jaffle shop customers
  8. Click Create on the bottom once you’re done.

  9. Now let’s do the same for Jaffle Shop Orders and Stripe Payments.

    Load jaffle shop ordersLoad jaffle shop orders
    Load stripe paymentsLoad stripe payments
  10. Once that's done, make sure you can query the training data. Navigate to the SQL Editor through the left hand menu. This will bring you to a query editor.

  11. Ensure that you can run a select * from each of the tables with the following code snippets.

    select * from default.jaffle_shop_customers
    select * from default.jaffle_shop_orders
    select * from default.stripe_payments
    Query CheckQuery Check
  12. To ensure any users who might be working on your dbt project has access to your object, run this command.

    grant all privileges on schema default to users;

Congratulations! At this point, you have created a Databricks account, loaded training data, and successfully set up a SQL end point to query the data.

Connecting to dbt Cloud

There are two ways to connect dbt Cloud and Databricks. The first option is Partner Connect, which provides a streamlined setup to create your dbt Cloud account from within your new Databricks trial account. The second option is to create your dbt Cloud account separately and build the Databricks connection yourself. If you are looking to get started quickly, we recommend option 1. If you are looking to customize your setup from the very beginning and gain familiarity with the dbt Cloud setup flow, we recommend option 2.

Option 1: Connect dbt Cloud and Databricks with partner connect

  1. In the Databricks workspace, on the left-side console: click on Partner Connect.

    Databricks Partner ConnectDatabricks Partner Connect
  2. Select the dbt tile under Data preparation and transformation.

  3. Click on Next when prompted to Connect to partner. This action will create a service principal, PAT token for that service principle, and SQL warehouse for the dbt Cloud account to use. This does mean that you will have two SQL warehouses at your disposal from the previous step and from using Partner Connect.

    Databricks Partner Connect Connect to dbt CloudDatabricks Partner Connect Connect to dbt Cloud
  1. Click on Connect to dbt Cloud.

    Databricks Partner Connect Connect to dbt CloudDatabricks Partner Connect Connect to dbt Cloud
  2. After the new tab loads, you will see a form. If you already created a dbt Cloud account, you will be asked to provide an account name. If you haven't created account, you will be asked to provide an account name and password.

    Databricks Partner Connect Connect to dbt CloudDatabricks Partner Connect Connect to dbt Cloud
  3. After you have filled out the form and clicked on Complete Registration, you will be logged into dbt Cloud automatically.

Option 2: Connect dbt Cloud and Databricks manually

Get endpoint and token information

To manually setup dbt Cloud, you will need the SQL warehouse connection information and to generate a user token. You can find your SQL warehouse connection information by going to the Databricks UI > SQL > SQL warehouses > Starter Endpoint > Connection details. Save this information because you will need it later.

Databrick SQL Endpoint Connection InformationDatabrick SQL Endpoint Connection Information

To generate a user token for your development credentials in dbt Cloud, click on Settings on the left side console (while still in the SQL part of the workspace). Click on Personal Access Token and provide a comment like dbt Cloud development. Save the token information somewhere because you will need it for the next part.

Generate User TokenGenerate User Token

Create a dbt Cloud account

Let's start this section by creating a dbt Cloud account if you haven't already.

  1. Navigate to dbt Cloud.
  2. If you don't have a dbt Cloud account, create a new one, and verify your account via email.
  3. If you already have a dbt Cloud account, you can create a new project from your existing account:
    1. Click the gear icon in the top-right, then click Projects.
    2. Click + New Project.
  4. You've arrived at the "Setup a New Project" page.
  5. Type "Analytics" in the dbt Project Name field. You will be able to rename this project later.
  6. Click Continue.

Connect dbt Cloud to Databricks

  1. Choose Databricks to setup your connection.

    dbt Cloud - Choose Databricks Connectiondbt Cloud - Choose Databricks Connection
  2. For the name, write Databricks or another simple title.

  3. For Databricks settings, reference your SQL warehouse connection details from step 6 of the previous section for each of the following fields:

    • Method will be ODBC
    • Hostname comes from Server hostname
    • Endpoint comes from the last part of HTTP path after /endpoints
      dbt Cloud - Databricks Workspace Settingsdbt Cloud - Databricks Workspace Settings
  4. For your Development Credentials, type:

    • User and token that you saved in a previous step.
    • You’ll notice that the schema name has been auto created for you. By convention, this is dbt_<first-initial><last-name>. This is the schema connected directly to your development environment, and it's where your models will be built when running dbt within the Cloud IDE.
  5. Click Test Connection at the bottom. This verifies that dbt Cloud can access your Databricks workspace.

  6. If the connection test succeeds, click Next. If it fails, you may need to check your Databricks settings and credentials.

Initialize your repository and start development

When you develop in dbt Cloud, you can leverage Git to version control your code.

To connect to a repository, you can either set up a dbt Cloud-hosted managed repository or directly connect to a supported git provider. Managed repositories are a great way to trial dbt without needing to create a new repository. In the long run, it's better to connect to a supported git provider to use features like automation and continuous integration.

If you used Partner Connect, you can skip to initializing your dbt project as the Partner Connect provides you with a managed repository. Otherwise, you will need to create your repository connection.

Setting up a managed repository

To set up a managed repository:

  1. Under "Add repository from", select Managed.
  2. Type a name for your repo such as bbaggins-dbt-quickstart
  3. Click Create. It will take a few seconds for your repository to be created and imported.
  4. Once you see the "Successfully imported repository," click Continue.

Initialize your dbt project

Now that you have a repository configured, you can initialize your project and start development in dbt Cloud:

  1. Click Develop from the upper left. It might take a few minutes for your project to spin up for the first time as it establishes your git connection, clones your repo, and tests the connection to the warehouse.
  2. Above the file tree to the left, click Initialize your project. This builds out your folder structure with example models.
  3. Make your initial commit by clicking Commit. Use the commit message initial commit. This creates the first commit to your managed repo and allows you to open a branch where you can add new dbt code.
  4. Now you should be able to directly query data from your warehouse and execute dbt run. Paste your following warehouse-specific code in the IDE:
select * from `dbt-tutorial.jaffle_shop.customers`
  • In the command line bar at the bottom, type dbt run and click Enter. We will explore what happens in the next section of the tutorial.

Congratulations! You have successfully completed the following:

  • Set up a Databricks account
  • Loaded training data into your Databricks account
  • Configured a SQL warehouse in Databricks
  • Connected dbt Cloud and Databricks

Next steps

You can continue the tutorial by building your first project. You will learn and apply the fundamentals of dbt with models, tests, docs, and deploying your first job!

0