Use Databricks workflows to run dbt jobs

Back to guides

Databricks

dbt Core

dbt platform

Orchestration

Intermediate

Introduction

Using Databricks workflows to call the dbt job API can be useful for several reasons:

Integration with other ETL processes — If you're already running other ETL processes in Databricks, you can use a Databricks workflow to trigger a dbt job after those processes are done.
Utilizes dbt jobs features — dbt gives the ability to monitor job progress, manage historical logs and documentation, optimize model timing, and much more.
Separation of concerns — Detailed logs for dbt jobs in the dbt environment can lead to more modularity and efficient debugging. By doing so, it becomes easier to isolate bugs quickly while still being able to see the overall status in Databricks.
Custom job triggering — Use a Databricks workflow to trigger dbt jobs based on custom conditions or logic that aren't natively supported by dbt's scheduling feature. This can give you more flexibility in terms of when and how your dbt jobs run.

Prerequisites

Active Enterprise or Enterprise+ dbt account
You must have a configured and existing dbt deploy job
Active Databricks account with access to Data Science and Engineering workspace and Manage secrets
Databricks CLI
- Note: You only need to set up your authentication. Once you have set up your Host and Token and are able to run databricks workspace ls /Users/<someone@example.com>, you can proceed with the rest of this guide.

Set up a Databricks secret scope

Retrieve **personal access token **or **Service account token **from dbt
Set up a Databricks secret scope, which is used to securely store your dbt API key.
Enter the following commands in your terminal:

# In this example we set up a secret scope and key called "dbt-cloud" and "api-key" respectively.
databricks secrets create-scope --scope <YOUR_SECRET_SCOPE>
databricks secrets put --scope  <YOUR_SECRET_SCOPE> --key  <YOUR_SECRET_KEY> --string-value "<YOUR_DBT_CLOUD_API_KEY>"

Replace <YOUR_SECRET_SCOPE> and <YOUR_SECRET_KEY> with your own unique identifiers. Click here for more information on secrets.
Replace <YOUR_DBT_CLOUD_API_KEY> with the actual API key value that you copied from dbt in step 1.

Create a Databricks Python notebook

Create a Databricks Python notebook, which executes a Python script that calls the dbt job API.
Write a Python script that utilizes the requests library to make an HTTP POST request to the dbt job API endpoint using the required parameters. Here's an example script:

import enum
import os
import time
import json
import requests
from getpass import getpass
     
dbutils.widgets.text("job_id", "Enter the Job ID")
job_id = dbutils.widgets.get("job_id")

account_id = <YOUR_ACCOUNT_ID>
base_url =  "<YOUR_BASE_URL>"
api_key =  dbutils.secrets.get(scope = "<YOUR_SECRET_SCOPE>", key = "<YOUR_SECRET_KEY>")

# These are documented on the dbt API docs
class DbtJobRunStatus(enum.IntEnum):
    QUEUED = 1
    STARTING = 2
    RUNNING = 3
    SUCCESS = 10
    ERROR = 20
    CANCELLED = 30

def _trigger_job() -> int:
    res = requests.post(
        url=f"https://{base_url}/api/v2/accounts/{account_id}/jobs/{job_id}/run/",
        headers={'Authorization': f"Token {api_key}"},
        json={
            # Optionally pass a description that can be viewed within the <Constant name="dbt" /> API.
            # See the API docs for additional parameters that can be passed in,
            # including `schema_override` 
            'cause': f"Triggered by Databricks Workflows.",
        }
    )

    try:
        res.raise_for_status()
    except:
        print(f"API token (last four): ...{api_key[-4:]}")
        raise

    response_payload = res.json()
    return response_payload['data']['id']

def _get_job_run_status(job_run_id):
    res = requests.get(
        url=f"https://{base_url}/api/v2/accounts/{account_id}/runs/{job_run_id}/",
        headers={'Authorization': f"Token {api_key}"},
    )

    res.raise_for_status()
    response_payload = res.json()
    return response_payload['data']['status']

def run():
    job_run_id = _trigger_job()
    print(f"job_run_id = {job_run_id}")   
    while True:
        time.sleep(5)
        status = _get_job_run_status(job_run_id)
        print(DbtJobRunStatus(status))
        if status == DbtJobRunStatus.SUCCESS:
            break
        elif status == DbtJobRunStatus.ERROR or status == DbtJobRunStatus.CANCELLED:
            raise Exception("Failure!")

if __name__ == '__main__':
    run()

Replace <YOUR_SECRET_SCOPE> and <YOUR_SECRET_KEY> with the values you used previously
Replace <YOUR_BASE_URL> and <YOUR_ACCOUNT_ID> with the correct values of your environment and Access URL for your region and plan.
- To find these values, navigate to dbt, select Deploy -> Jobs. Select the Job you want to run and copy the URL. For example: https://YOUR_ACCESS_URL/deploy/000000/projects/111111/jobs/222222 and therefore valid code would be:

Your URL is structured https://<YOUR_BASE_URL>/deploy/<YOUR_ACCOUNT_ID>/projects/<YOUR_PROJECT_ID>/jobs/<YOUR_JOB_ID> account_id = 000000 job_id = 222222 base_url = "cloud.getdbt.com"

Run the Notebook. It will fail, but you should see a job_id widget at the top of your notebook.
In the widget, enter your job_id from step 4.
Run the Notebook again to trigger the dbt job. Your results should look similar to the following:

job_run_id = 123456
DbtJobRunStatus.QUEUED
DbtJobRunStatus.QUEUED
DbtJobRunStatus.QUEUED
DbtJobRunStatus.STARTING
DbtJobRunStatus.RUNNING
DbtJobRunStatus.RUNNING
DbtJobRunStatus.RUNNING
DbtJobRunStatus.RUNNING
DbtJobRunStatus.RUNNING
DbtJobRunStatus.RUNNING
DbtJobRunStatus.RUNNING
DbtJobRunStatus.RUNNING
DbtJobRunStatus.SUCCESS

You can cancel the job from dbt if necessary.

Configure the workflows to run the dbt jobs

You can set up workflows directly from the notebook OR by adding this notebook to one of your existing workflows:

Create a workflow from existing Notebook
Add the Notebook to existing workflow

Click Schedule on the upper right side of the page
Click Add a schedule
Configure Job name, Schedule, Cluster
Add a new parameter called: job_id and fill in your job ID. Refer to step 4 in previous section to find your job ID.
Click Create
Click Run Now to test the job

Open Existing Workflow
Click Tasks
Press “+” icon to add a new task
Enter the following:

Loading table...

Select Save Task
Click Run Now to test the workflow

Multiple Workflow tasks can be set up using the same notebook by configuring the job_id parameter to point to different dbt jobs.

Using Databricks workflows to access the dbt job API can improve integration of your data pipeline processes and enable scheduling of more complex workflows.

Was this page helpful?

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Field	Value
Task name	`<unique_task_name>`
Type	Notebook
Source	Workspace
Path	`</path/to/notebook>`
Cluster	`<your_compute_cluster>`
Parameters	`job_id`: `<your_dbt_job_id>`

Use Databricks workflows to run dbt jobs

Introduction

Prerequisites

Set up a Databricks secret scope

Create a Databricks Python notebook

Configure the workflows to run the dbt jobs

Was this page helpful?

Start building with dbt.

Resources

Community

Support

Connect with Us

Introduction​

Prerequisites​

Set up a Databricks secret scope​

Create a Databricks Python notebook​

Configure the workflows to run the dbt jobs​

Was this page helpful?

Resources

Community

Support

Connect with Us

Introduction

Prerequisites

Set up a Databricks secret scope

Create a Databricks Python notebook

Configure the workflows to run the dbt jobs