Databricks configurations

Configuring tables

When materializing a model as table, you may include several optional configs that are specific to the dbt-databricks plugin, in addition to the standard model configs.

dbt-databricks v1.9 adds support for the table_format: iceberg config. Try it now on the dbt Latest release track. All other table configurations were also supported in 1.8.

Option	Description	Required?	Model support	Example
table_format	Whether or not to provision Iceberg compatibility for the materialization	Optional	SQL, Python	`iceberg`
file_format ^†	The file format to use when creating tables (`parquet`, `delta`, `hudi`, `csv`, `json`, `text`, `jdbc`, `orc`, `hive` or `libsvm`).	Optional	SQL, Python	`delta`
location_root	The created table uses the specified directory to store its data. The table alias is appended to it.	Optional	SQL, Python	`/mnt/root`
include_full_name_in_path	Whether to use the full table path to qualify the location root. If this is set, the database, schema, and table alias are all appended to the location root.	Optional	SQL, Python	`true`
partition_by	Partition the created table by the specified columns. A directory is created for each partition.	Optional	SQL, Python	`date_day`
liquid_clustered_by^{^}	Cluster the created table by the specified columns. Clustering method is based on Delta's Liquid Clustering feature. Available since dbt-databricks 1.6.2.	Optional	SQL, Python	`date_day`
auto_liquid_cluster+	The created table is automatically clustered by Databricks. Available since dbt-databricks 1.10.0	Optional	SQL, Python	`auto_liquid_cluster: true`
clustered_by	Each partition in the created table will be split into a fixed number of buckets by the specified columns.	Optional	SQL, Python	`country_code`
buckets	The number of buckets to create while clustering	Required if `clustered_by` is specified	SQL, Python	`8`
tblproperties	Tblproperties to be set on the created table	Optional	SQL, Python*	`{'this.is.my.key': 12}`
databricks_tags	Tags to be set on the created table	Optional	SQL ^‡ , Python ^‡	`{'my_tag': 'my_value'}`
compression	Set the compression algorithm.	Optional	SQL, Python	`zstd`

Loading table...

* We do not yet have a PySpark API to set tblproperties at table creation, so this feature is primarily to allow users to anotate their python-derived tables with tblproperties.

† When table_format is iceberg, file_format must be delta.

‡ databricks_tags are applied via ALTER statements. Tags cannot be removed via dbt-databricks once applied. To remove tags, use Databricks directly or a post-hook. Starting in dbt-databricks v1.12, databricks_tags set at multiple config hierarchy levels merge additively instead of the lower (more specific) level fully replacing the higher one.

^{^} When liquid_clustered_by is enabled, dbt-databricks issues an OPTIMIZE (Liquid Clustering) operation after each run. To disable this behavior, set the variable DATABRICKS_SKIP_OPTIMIZE=true, which can be passed into the dbt run command (dbt run --vars "{'databricks_skip_optimize': true}") or set as an environment variable. See issue #802.

+ Do not use liquid_clustered_by and auto_liquid_cluster on the same model.

In dbt-databricks v1.10, there are several new model configurations options gated behind the use_materialization_v2 flag. For details, see the documentation of Databricks behavior flags.

Python submission methods

Available in versions 1.9 or higher

In dbt-databricks v1.9 (try it now in the dbt Latest release track), you can use these four options for submission_method:

all_purpose_cluster: Executes the python model either directly using the command api or by uploading a notebook and creating a one-off job run
job_cluster: Creates a new job cluster to execute an uploaded notebook as a one-off job run
serverless_cluster: Uses a serverless cluster to execute an uploaded notebook as a one-off job run
workflow_job: Creates/updates a reusable workflow and uploaded notebook, for execution on all-purpose, job, or serverless clusters.
caution
This approach gives you maximum flexibility, but will create persistent artifacts in Databricks (the workflow) that users could run outside of dbt.

We are currently in a transitionary period where there is a disconnect between old submission methods (which were grouped by compute), and the logically distinct submission methods (command, job run, workflow).

As such, the supported config matrix is somewhat complicated:

Config	Use	Default	`all_purpose_cluster`*	`job_cluster`	`serverless_cluster`	`workflow_job`
`create_notebook`	if false, use Command API, otherwise upload notebook and use job run	`false`	✅	❌	❌	❌
`timeout`	maximum time to wait for command/job to run	`0` (No timeout)	✅	✅	✅	✅
`job_cluster_config`	configures a new cluster for running the model	`{}`	❌	✅	❌	✅
`access_control_list`	directly configures access control for the job	`{}`	✅	✅	✅	✅
`packages`	list of packages to install on the executing cluster	`[]`	✅	✅	✅	✅
`index_url`	url to install `packages` from	`None` (uses pypi)	✅	✅	✅	✅
`additional_libs`	directly configures libraries	`[]`	✅	✅	✅	✅
`python_job_config`	additional configuration for jobs/workflows (see table below)	`{}`	✅	✅	✅	✅
`cluster_id`	id of existing all purpose cluster to execute against	`None`	✅	❌	❌	✅
`http_path`	path to existing all purpose cluster to execute against	`None`	✅	❌	❌	❌

Loading table...

* Only timeout and cluster_id/http_path are supported when create_notebook is false

With the introduction of the workflow_job submission method, we chose to segregate further configuration of the python model submission under a top level configuration named python_job_config. This keeps configuration options for jobs and workflows namespaced in such a way that they do not interfere with other model config, allowing us to be much more flexible with what is supported for job execution.

The support matrix for this feature is divided into workflow_job and all others (assuming all_purpose_cluster with create_notebook==true). Each config option listed must be nested under python_job_config:

Config	Use	Default	`workflow_job`	All others
`name`	The name to give (or used to look up) the created workflow	`None`	✅	❌
`grants`	A simplified way to specify access control for the workflow	`{}`	✅	✅
`existing_job_id`	Id to use to look up the created workflow (in place of `name`)	`None`	✅	❌
`post_hook_tasks`	Tasks to include after the model notebook execution	`[]`	✅	❌
`additional_task_settings`	Additional task config to include in the model task	`{}`	✅	❌
Other job run settings	Config will be copied into the request, outside of the model task	`None`	❌	✅
Other workflow settings	Config will be copied into the request, outside of the model task	`None`	✅	❌

Loading table...

This example uses the new configuration options in the previous table:

schema.yml

models:
  - name: my_model
    config:
      submission_method: workflow_job

      # Define a job cluster to create for running this workflow
      # Alternately, could specify cluster_id to use an existing cluster, or provide neither to use a serverless cluster
      job_cluster_config:
        spark_version: "15.3.x-scala2.12"
        node_type_id: "rd-fleet.2xlarge"
        runtime_engine: "{{ var('job_cluster_defaults.runtime_engine') }}"
        data_security_mode: "{{ var('job_cluster_defaults.data_security_mode') }}"
        autoscale: { "min_workers": 1, "max_workers": 4 }

      python_job_config:
        # These settings are passed in, as is, to the request
        email_notifications: { on_failure: ["me@example.com"] }
        max_retries: 2

        name: my_workflow_name

        # Override settings for your model's dbt task. For instance, you can
        # change the task key
        additional_task_settings: { "task_key": "my_dbt_task" }

        # Define tasks to run before/after the model
        # This example assumes you have already uploaded a notebook to /my_notebook_path to perform optimize and vacuum
        post_hook_tasks:
          [
            {
              "depends_on": [{ "task_key": "my_dbt_task" }],
              "task_key": "OPTIMIZE_AND_VACUUM",
              "notebook_task":
                { "notebook_path": "/my_notebook_path", "source": "WORKSPACE" },
            },
          ]

        # Simplified structure, rather than having to specify permission separately for each user
        grants:
          view: [{ "group_name": "marketing-team" }]
          run: [{ "user_name": "other_user@example.com" }]
          manage: []

Configuring columns

Available in versions 1.10 or higher

When materializing models of various types, you may include several optional column-level configs that are specific to the dbt-databricks plugin, in addition to the standard column configs. Support for column tags and column masks were added in dbt-databricks v1.10.4.

Option	Description	Required?	Model support	Materialization support	Example
databricks_tags	Tags to be set on individual columns	Optional	SQL†, Python†	Table, Incremental, Materialized View, Streaming Table	`{'data_classification': 'pii'}`
column_mask	Column mask configuration for dynamic data masking. Accepts `function` and optional `using_columns` properties*	Optional	SQL, Python	Table, Incremental, Streaming Table	`{'function': 'my_catalog.my_schema.mask_email'}`

Loading table...

* using_columns supports all parameter types listed in Databricks column mask parameters.

† databricks_tags are applied via ALTER statements. Tags cannot be removed via dbt-databricks once applied. To remove tags, use Databricks directly or a post-hook. Starting in dbt-databricks v1.12, databricks_tags set at multiple config hierarchy levels merge additively instead of the lower (more specific) level fully replacing the higher one.

This example uses the column-level configurations in the previous table:

schema.yml

models:
  - name: customers
    columns:
      - name: customer_id
        databricks_tags:
          data_classification: "public"
      - name: email
        databricks_tags:
          data_classification: "pii"
        column_mask:
          function: my_catalog.my_schema.mask_email
          using_columns: "customer_id, 'literal string'"

Setting row filters

Available in versions 1.12 or higher

You can set row_filter to apply a Unity Catalog row filter to a model, restricting which rows a query returns based on a SQL UDF. dbt applies the filter with a WITH ROW FILTER clause when it creates the relation, and emits ALTER ... SET ROW FILTER / ALTER ... DROP ROW FILTER to add, update, or remove the filter on subsequent runs.

row_filter is an optional model-level config. When you set it, both of the following properties are required:

Property	Description	Required?	Example
function	The row-filter UDF to apply. Provide either an unqualified name (dbt qualifies it with the model's catalog and schema) or a fully qualified `catalog.schema.function`. dbt rejects a two-part schema.function name as ambiguous.	Yes	`region_filter`
columns	The columns passed as arguments to the filter function. Can be a single string or a list. Required when `function` is set.	Yes	`[region]`

Loading table...

Row filters are supported on the table, incremental, materialized_view, and streaming_table materializations. They are not supported on regular views or on Hive Metastore relations. Configuring row_filter on either raises a compiler error.

This example applies a row filter to a model:

schema.yml

models:
  - name: orders
    config:
      row_filter:
        function: my_catalog.my_schema.region_filter
        columns: [region]

Incremental models

Available in versions 1.9 or higher

Breaking change in v1.11.0

dbt-databricks v1.11.0 requires Databricks Runtime 12.2 LTS or higher for incremental models

This version introduces a fix for column order mismatches in incremental models by using Databricks' INSERT BY NAME syntax (available since DBR 12.2). This prevents data corruption that could occur when column order changed in models using on_schema_change: sync_all_columns.

If you're using an older runtime:

Pin your dbt-databricks version to 1.10.x
Or upgrade to DBR 12.2 LTS or higher

This breaking change affects all incremental strategies: append, insert_overwrite, replace_where, delete+insert, and merge (via intermediate table creation).

For more details on v1.11.0 changes, see the dbt-databricks v1.11.0 changelog.

dbt-databricks plugin leans heavily on the incremental_strategy config. This config tells the incremental materialization how to build models in runs beyond their first. It can be set to one of six values:

append: Insert new records without updating or overwriting any existing data.
insert_overwrite: If partition_by is specified, overwrite partitions in the table with new data. If no partition_by is specified, overwrite the entire table with new data.
merge(default; Delta and Hudi file format only): Match records based on a unique_key, updating old records, and inserting new ones. (If no unique_key is specified, all new data is inserted, similar to append.)
replace_where (Delta file format only): Match records based on incremental_predicates, replacing all records that match the predicates from the existing table with records matching the predicates from the new data. (If no incremental_predicates are specified, all new data is inserted, similar to append.)
delete+insert (Delta file format only, available in v1.11+): Match records based on a required unique_key, delete matching records, and insert new records. Optionally filter using incremental_predicates.
microbatch (Delta file format only): Implements the microbatch strategy using replace_where with predicates generated based event_time.

Each of these strategies has its pros and cons, which we'll discuss below. As with any model config, incremental_strategy may be specified in dbt_project.yml or within a model file's config() block.

The `append` strategy

Following the append strategy, dbt will perform an insert into statement with all new data. The appeal of this strategy is that it is straightforward and functional across all platforms, file types, connection methods, and Apache Spark versions. However, this strategy cannot update, overwrite, or delete existing data, so it is likely to insert duplicate records for many data sources.

Source code
Run code

databricks_incremental.sql

{{ config(
    materialized='incremental',
    incremental_strategy='append',
) }}

--  All rows returned by this query will be appended to the existing table

select * from {{ ref('events') }}
{% if is_incremental() %}
  where event_ts > (select max(event_ts) from {{ this }})
{% endif %}

databricks_incremental.sql

create temporary view databricks_incremental__dbt_tmp as

    select * from analytics.events

    where event_ts >= (select max(event_ts) from {{ this }})

;

insert into table analytics.databricks_incremental
    select `date_day`, `users` from databricks_incremental__dbt_tmp

The `insert_overwrite` strategy

The insert_overwrite strategy updates data in a table by replacing existing records instead of just adding new ones. This strategy is most effective when specified alongside a partition_by or liquid_clustered_by clause in your model config, which helps identify the specific partitions or clusters affected by your query. dbt will run an atomic insert into ... replace on statement that dynamically replaces all partitions/clusters included in your query, instead of rebuilding the entire table.

Important! Be sure to re-select all of the relevant data for a partition or cluster when using this incremental strategy.

When using liquid_clustered_by, the replace on keys used will be the same as the liquid_clustered_by keys (same as partition_by behavior).

When you set use_replace_on_for_insert_overwrite to true (in SQL warehouses or when using cluster computes) dbt dynamically overwrites partitions and only replaces the partitions or clusters returned by your model query. dbt runs a partitionOverwriteMode='dynamic' insert overwrite statement, which helps reduce unnecessary overwrites and improves performance.

When you set use_replace_on_for_insert_overwrite to false in SQL warehouses, dbt truncates (empties) the entire table before inserting new data. This replaces all rows in the table each time the model runs, which can increase run time and cost for large datasets

If you don't specify partition_by or liquid_clustered_by, then the insert_overwrite strategy will atomically replace all contents of the table, overriding all existing data with only the new records. The column schema of the table remains the same, however. This can be desirable in some limited circumstances, since it minimizes downtime while the table contents are overwritten. The operation is comparable to running truncate and insert on other databases. For atomic replacement of Delta-formatted tables, use the table materialization (which runs create or replace) instead.

Source code
Run code

databricks_incremental.sql

{{ config(
    materialized='incremental',
    partition_by=['date_day'],
    file_format='parquet'
) }}

/*
  Every partition returned by this query will be overwritten
  when this model runs
*/

with new_events as (

    select * from {{ ref('events') }}

    {% if is_incremental() %}
    where date_day >= date_add(current_date, -1)
    {% endif %}

)

select
    date_day,
    count(*) as users

from new_events
group by 1

databricks_incremental.sql

create temporary view databricks_incremental__dbt_tmp as

    with new_events as (

        select * from analytics.events

        where date_day >= date_add(current_date, -1)

    )

    select
        date_day,
        count(*) as users

    from events
    group by 1

;

insert overwrite table analytics.databricks_incremental
    partition (date_day)
    select `date_day`, `users` from databricks_incremental__dbt_tmp

The `merge` strategy

The merge incremental strategy requires:

file_format: delta or hudi
Databricks Runtime 5.1 and above for delta file format
Apache Spark for hudi file format

The Databricks adapter will run an atomic merge statement similar to the default merge behavior on Snowflake and BigQuery. If a unique_key is specified (recommended), dbt will update old records with values from new records that match on the key column. If a unique_key is not specified, dbt will forgo match criteria and simply insert all new records (similar to append strategy).

Specifying merge as the incremental strategy is optional since it's the default strategy used when none is specified.

Source code
Run code

merge_incremental.sql

{{ config(
    materialized='incremental',
    file_format='delta', # or 'hudi'
    unique_key='user_id',
    incremental_strategy='merge'
) }}

with new_events as (

    select * from {{ ref('events') }}

    {% if is_incremental() %}
    where date_day >= date_add(current_date, -1)
    {% endif %}

)

select
    user_id,
    max(date_day) as last_seen

from events
group by 1

target/run/merge_incremental.sql

create temporary view merge_incremental__dbt_tmp as

    with new_events as (

        select * from analytics.events


        where date_day >= date_add(current_date, -1)


    )

    select
        user_id,
        max(date_day) as last_seen

    from events
    group by 1

;

merge into analytics.merge_incremental as DBT_INTERNAL_DEST
    using merge_incremental__dbt_tmp as DBT_INTERNAL_SOURCE
    on DBT_INTERNAL_SOURCE.user_id = DBT_INTERNAL_DEST.user_id
    when matched then update set *
    when not matched then insert *

Beginning with 1.9, merge behavior can be modified with the following additional configuration options:

target_alias, source_alias: Aliases for the target and source to allow you to describe your merge conditions more naturally. These default to DBT_INTERNAL_DEST and DBT_INTERNAL_SOURCE, respectively.
skip_matched_step: If set to true, the 'matched' clause of the merge statement will not be included.
skip_not_matched_step: If set to true, the 'not matched' clause will not be included.
matched_condition: Condition to apply to the WHEN MATCHED clause. You should use the target_alias and source_alias to write a conditional expression, such as DBT_INTERNAL_DEST.col1 = hash(DBT_INTERNAL_SOURCE.col2, DBT_INTERNAL_SOURCE.col3). This condition further restricts the matched set of rows.
not_matched_condition: Condition to apply to the WHEN NOT MATCHED [BY TARGET] clause. This condition further restricts the set of rows in the target that do not match the source that will be inserted into the merged table.
not_matched_by_source_condition: Condition to apply to the further filter WHEN NOT MATCHED BY SOURCE clause. Only used in conjunction with not_matched_by_source_action.
not_matched_by_source_action: The action to apply when the condition is met. Configure as an expression. For example: not_matched_by_source_action: "update set t.attr1 = 'deleted', t.tech_change_ts = current_timestamp()".
merge_with_schema_evolution: If set to true, the merge statement includes the WITH SCHEMA EVOLUTION clause.

For more details on the meaning of each merge clause, please see the Databricks documentation.

The following is an example demonstrating the use of these new options:

Source code
Run code

merge_incremental_options.sql

{{ config(
    materialized = 'incremental',
    unique_key = 'id',
    incremental_strategy='merge',
    target_alias='t',
    source_alias='s',
    matched_condition='t.tech_change_ts < s.tech_change_ts',
    not_matched_condition='s.attr1 IS NOT NULL',
    not_matched_by_source_condition='t.tech_change_ts < current_timestamp()',
    not_matched_by_source_action='delete',
    merge_with_schema_evolution=true
) }}

select
    id,
    attr1,
    attr2,
    tech_change_ts
from
    {{ ref('source_table') }} as s

target/run/merge_incremental_options.sql

create temporary view merge_incremental__dbt_tmp as

    select
        id,
        attr1,
        attr2,
        tech_change_ts
    from upstream.source_table
;

merge 
    with schema evolution
into
    target_table as t
using (
    select
        id,
        attr1,
        attr2,
        tech_change_ts
    from
        source_table as s
)
on
    t.id <=> s.id
when matched
    and t.tech_change_ts < s.tech_change_ts
    then update set
        id = s.id,
        attr1 = s.attr1,
        attr2 = s.attr2,
        tech_change_ts = s.tech_change_ts

when not matched
    and s.attr1 IS NOT NULL
    then insert (
        id,
        attr1,
        attr2,
        tech_change_ts
    ) values (
        s.id,
        s.attr1,
        s.attr2,
        s.tech_change_ts
    )
    
when not matched by source
    and t.tech_change_ts < current_timestamp()
    then delete

The `replace_where` strategy

The replace_where incremental strategy requires:

file_format: delta
Databricks Runtime 12.0 and above

dbt will run an atomic replace where statement which selectively overwrites data matching one or more incremental_predicates specified as a string or array. Only rows matching the predicates will be inserted. If no incremental_predicates are specified, dbt will perform an atomic insert, as with append.

caution

replace_where inserts data into columns in the order provided, rather than by column name. If you reorder columns and the data is compatible with the existing schema, you may silently insert values into an unexpected column. If the incoming data is incompatible with the existing schema, you will instead receive an error.

Source code
Run code

replace_where_incremental.sql

{{ config(
    materialized='incremental',
    file_format='delta',
    incremental_strategy = 'replace_where'
    incremental_predicates = 'user_id >= 10000' # Never replace users with ids < 10000
) }}

with new_events as (

    select * from {{ ref('events') }}

    {% if is_incremental() %}
    where date_day >= date_add(current_date, -1)
    {% endif %}

)

select
    user_id,
    max(date_day) as last_seen

from events
group by 1

target/run/replace_where_incremental.sql

create temporary view replace_where__dbt_tmp as

    with new_events as (

        select * from analytics.events

        where date_day >= date_add(current_date, -1)

    )

    select
        user_id,
        max(date_day) as last_seen

    from events
    group by 1

;

insert into analytics.replace_where_incremental
    replace where user_id >= 10000
    table `replace_where__dbt_tmp`

The `delete+insert` strategy

Available in versions 1.11 or higher

The delete+insert incremental strategy requires:

file_format: delta
A required unique_key configuration
Databricks Runtime 12.2 LTS or higher

The delete+insert strategy is a simpler alternative to the merge strategy for cases where you want to replace matching records without the complexity of updating specific columns. This strategy works in two steps:

Delete: Remove all rows from the target table where the unique_key matches rows in the new data.
Insert: Insert all new rows from the staging data.

This strategy is particularly useful when:

You want to replace entire records rather than update specific columns
Your business logic requires a clean "remove and replace" approach
You need a simpler incremental strategy than merge for full record replacement

When using Databricks Runtime 17.1 or higher, dbt uses the efficient INSERT INTO ... REPLACE ON syntax to perform this operation atomically. For older runtime versions, dbt executes separate DELETE and INSERT statements.

You can optionally use incremental_predicates to further filter which records are processed, providing more control over which rows are deleted and inserted.

Source code
Run code (DBR 17.1+)
Run code (DBR < 17.1)

delete_insert_incremental.sql

{{ config(
    materialized='incremental',
    file_format='delta',
    incremental_strategy='delete+insert',
    unique_key='user_id'
) }}

with new_events as (

    select * from {{ ref('events') }}

    {% if is_incremental() %}
    where date_day >= date_add(current_date, -1)
    {% endif %}

)

select
    user_id,
    max(date_day) as last_seen

from new_events
group by 1

target/run/delete_insert_incremental.sql

create temporary view delete_insert_incremental__dbt_tmp as

    with new_events as (

        select * from analytics.events

        where date_day >= date_add(current_date, -1)

    )

    select
        user_id,
        max(date_day) as last_seen

    from new_events
    group by 1

;

insert into table analytics.delete_insert_incremental as target
replace on (target.user_id <=> temp.user_id)
(select `user_id`, `last_seen`
   from delete_insert_incremental__dbt_tmp where date_day >= date_add(current_date, -1)) as temp

target/run/delete_insert_incremental.sql

create temporary view delete_insert_incremental__dbt_tmp as

    with new_events as (

        select * from analytics.events

        where date_day >= date_add(current_date, -1)

    )

    select
        user_id,
        max(date_day) as last_seen

    from new_events
    group by 1

;

-- Step 1: Delete matching rows
delete from analytics.delete_insert_incremental
where analytics.delete_insert_incremental.user_id IN (SELECT user_id FROM delete_insert_incremental__dbt_tmp)
  and date_day >= date_add(current_date, -1);

-- Step 2: Insert new rows
insert into analytics.delete_insert_incremental by name
select `user_id`, `last_seen`
from delete_insert_incremental__dbt_tmp
where date_day >= date_add(current_date, -1)

The `microbatch` strategy

Available in versions 1.9 or higher

The Databricks adapter implements the microbatch strategy using replace_where. Note the requirements and caution statements for replace_where above. For more information about this strategy, see the microbatch reference page.

In the following example, the upstream table events have been annotated with an event_time column called ts in its schema file.

Source code
Run code

microbatch_incremental.sql

{{ config(
    materialized='incremental',
    file_format='delta',
    incremental_strategy = 'microbatch'
    event_time='date' # Use 'date' as the grain for this microbatch table
) }}

with new_events as (

    select * from {{ ref('events') }}

)

select
    user_id,
    date,
    count(*) as visits

from events
group by 1, 2

target/run/replace_where_incremental.sql

create temporary view replace_where__dbt_tmp as

    with new_events as (

        select * from (select * from analytics.events where ts >= '2024-10-01' and ts < '2024-10-02')

    )

    select
        user_id,
        date,
        count(*) as visits
    from events
    group by 1, 2
;

insert into analytics.replace_where_incremental
    replace where CAST(date as TIMESTAMP) >= '2024-10-01' and CAST(date as TIMESTAMP) < '2024-10-02'
    table `replace_where__dbt_tmp`

Python model configuration

The Databricks adapter supports Python models. Databricks uses PySpark as the processing framework for these models.

Submission methods: Databricks supports a few different mechanisms to submit PySpark code, each with relative advantages. Some are better for supporting iterative development, while others are better for supporting lower-cost production deployments. The options are:

all_purpose_cluster (default): dbt will run your Python model using the cluster ID configured as cluster in your connection profile or for this specific model. These clusters are more expensive but also much more responsive. We recommend using an interactive all-purpose cluster for quicker iteration in development.
- create_notebook: True: dbt will upload your model's compiled PySpark code to a notebook in the namespace /Shared/dbt_python_model/{schema}, where {schema} is the configured schema for the model, and execute that notebook to run using the all-purpose cluster. The appeal of this approach is that you can easily open the notebook in the Databricks UI for debugging or fine-tuning right after running your model. Remember to copy any changes into your dbt .py model code before re-running.
- create_notebook: False (default): dbt will use the Command API, which is slightly faster.
job_cluster: dbt will upload your model's compiled PySpark code to a notebook in the namespace /Shared/dbt_python_model/{schema}, where {schema} is the configured schema for the model, and execute that notebook to run using a short-lived jobs cluster. For each Python model, Databricks will need to spin up the cluster, execute the model's PySpark transformation, and then spin down the cluster. As such, job clusters take longer before and after model execution, but they're also less expensive, so we recommend these for longer-running Python models in production. To use the job_cluster submission method, your model must be configured with job_cluster_config, which defines key-value properties for new_cluster, as defined in the JobRunsSubmit API.

You can configure each model's submission_method in all the standard ways you supply configuration:

def model(dbt, session):
    dbt.config(
        submission_method="all_purpose_cluster",
        create_notebook=True,
        cluster_id="abcd-1234-wxyz"
    )
    ...

models:
  - name: my_python_model
    config:
      submission_method: job_cluster
      job_cluster_config:
        spark_version: ...
        node_type_id: ...

# dbt_project.yml
models:
  project_name:
    subfolder:
      # set defaults for all .py models defined in this subfolder
      +submission_method: all_purpose_cluster
      +create_notebook: False
      +cluster_id: abcd-1234-wxyz

If not configured, dbt-spark will use the built-in defaults: the all-purpose cluster (based on cluster in your connection profile) without creating a notebook. The dbt-databricks adapter will default to the cluster configured in http_path. We encourage explicitly configuring the clusters for Python models in Databricks projects.

Installing packages: When using all-purpose clusters, we recommend installing packages which you will be using to run your Python models.

Related docs:

Selecting compute per model

Beginning in version 1.7.2, you can assign which compute resource to use on a per-model basis. For SQL models, you can select a SQL Warehouse (serverless or provisioned) or an all purpose cluster. For details on how this feature interacts with python models, see Specifying compute for Python models.

note

This is an optional setting. If you do not configure this as shown below, we will default to the compute specified by http_path in the top level of the output section in your profile. This is also the compute that will be used for tasks not associated with a particular model, such as gathering metadata for all tables in a schema.

To take advantage of this capability, you will need to add compute blocks to your profile:

profiles.yml

profile-name:
  target: target-name # this is the default target
  outputs:
    target-name:
      type: databricks
      catalog: optional catalog name if you are using Unity Catalog
      schema: schema name # Required        
      host: yourorg.databrickshost.com # Required

      ### This path is used as the default compute
      http_path: /sql/your/http/path # Required        
      
      ### New compute section
      compute:

        ### Name that you will use to refer to an alternate compute
       Compute1:
          http_path: '/sql/your/http/path' # Required of each alternate compute

        ### A third named compute, use whatever name you like
        Compute2:
          http_path: '/some/other/path' # Required of each alternate compute
      ...

    target-name: # additional targets
      ...
      ### For each target, you need to define the same compute,
      ### but you can specify different paths
      compute:

        ### Name that you will use to refer to an alternate compute
        Compute1:
          http_path: '/sql/your/http/path' # Required of each alternate compute

        ### A third named compute, use whatever name you like
        Compute2:
          http_path: '/some/other/path' # Required of each alternate compute
      ...

The new compute section is a map of user chosen names to objects with an http_path property. Each compute is keyed by a name which is used in the model definition/configuration to indicate which compute you wish to use for that model/selection of models. We recommend choosing a name that is easily recognized as the compute resources you're using, such as the name of the compute resource inside the Databricks UI.

note

You need to use the same set of names for compute across your outputs, though you may supply different http_paths, allowing you to use different computes in different deployment scenarios.

To configure this inside of dbt, use the extended attributes feature on the desired environments:

compute:
  Compute1:
    http_path: /SOME/OTHER/PATH
  Compute2:
    http_path: /SOME/OTHER/PATH

Specifying the compute for models

As with many other configuration options, you can specify the compute for a model in multiple ways, using databricks_compute. In your dbt_project.yml, the selected compute can be specified for all the models in a given directory:

dbt_project.yml

...

models:
  +databricks_compute: "Compute1"     # use the `Compute1` warehouse/cluster for all models in the project...
  my_project:
    clickstream:
      +databricks_compute: "Compute2" # ...except for the models in the `clickstream` folder, which will use `Compute2`.

snapshots:
  +databricks_compute: "Compute1"     # all Snapshot models are configured to use `Compute1`.

For an individual model the compute can be specified in the model config in your schema file.

schema.yml

models:
  - name: table_model
    config:
      databricks_compute: Compute1
    columns:
      - name: id
        data_type: int

Alternatively the warehouse can be specified in the config of a model's SQL file.

model.sql

{{
  config(
    materialized='table',
    databricks_compute='Compute1'
  )
}}
select * from {{ ref('seed') }}

To validate that the specified compute is being used, look for lines in your dbt.log like:

Databricks adapter ... using default compute resource.

Databricks adapter ... using compute resource <name of compute>.

Specifying compute for Python models

Materializing a python model requires execution of SQL as well as python. Specifically, if your python model is incremental, the current execution pattern involves executing python to create a staging table that is then merged into your target table using SQL. The python code needs to run on an all purpose cluster (or serverless cluster, see Python Submission Methods), while the SQL code can run on an all purpose cluster or a SQL Warehouse.

When you specify your databricks_compute for a python model, you are currently only specifying which compute to use when running the model-specific SQL. If you wish to use a different compute for executing the python itself, you must specify an alternate compute in the config for the model. For example:

model.py

def model(dbt, session):
   dbt.config(
     http_path="sql/protocolv1/..."
   )

If your default compute is a SQL Warehouse, you will need to specify an all purpose cluster http_path in this way.

Persisting model descriptions

Relation-level docs persistence is supported. For more information on configuring docs persistence, see the docs.

When the persist_docs option is configured appropriately, you'll be able to see model descriptions in the Comment field of describe [table] extended or show table extended in [database] like '*'.

Query tags

Available in versions 1.11 or higher

Query tags are a Databricks feature that allows you to attach custom key-value metadata to SQL queries. This metadata appears in system tables and query history, making it useful for tracking query costs, debugging, and auditing.

Feature availability

Query tags may not yet be available in all Databricks workspaces. Check the Databricks documentation for the latest information on feature availability.

dbt-databricks supports setting query tags at both the connection level (in your profile) and the model level (in model configs). When you run dbt, it automatically includes default tags containing dbt metadata, such as the model name and dbt version.

Default query tags

dbt-databricks automatically adds the following tags to every query:

Tag key	Description
`@@dbt_model_name`	The name of the model being executed
`@@dbt_core_version`	The version of dbt-core being used
`@@dbt_databricks_version`	The version of dbt-databricks being used
`@@dbt_materialized`	The materialization type (table, view, incremental, and so on.)

Loading table...

These reserved keys cannot be overridden by user-defined tags.

Configuring query tags

You can set query tags at the connection level in your profile or at the model level in your model config. Model-level tags take precedence over connection-level tags.

Connection-level query tags

To set query tags for all queries in a connection, add the query_tags parameter to your profiles.yml file as a JSON string:

~/.dbt/profiles.yml

your_profile_name:
  target: dev
  outputs:
    dev:
      type: databricks
      catalog: my_catalog
      schema: my_schema
      host: yourorg.databrickshost.com
      http_path: /sql/your/http/path
      token: dapiXXXXXXXXXXXXXXXXXXXXXXX
      query_tags: '{"team": "analytics", "project": "customer_360"}'

Model-level query tags

To set query tags for a specific model, use the query_tags config:

models/my_model.sql

{{ config(
    query_tags = {'cost_center': 'marketing', 'priority': 'high'}
) }}

select * from {{ ref('upstream_model') }}

You can also configure query tags in your dbt_project.yml for groups of models:

dbt_project.yml

models:
  my_project:
    marketing:
      +query_tags: {'department': 'marketing'}
    finance:
      +query_tags: {'department': 'finance'}

Tag precedence and merging

When query tags are defined at multiple levels, they are merged with the following precedence (highest to lowest):

Model-level tags (from config() or schema.yml)
Connection-level tags (from profiles.yml)
Default dbt tags (automatically added)

If the same key appears at multiple levels, the higher-precedence value wins.

Why connection-level tags?

Due to how dbt merges configs, specifying query_tags at the model level in config() or schema.yml will replace any query_tags you defined in dbt_project.yml rather than merging them. This is standard dbt behavior for dictionary configs.

To work around this limitation, dbt-databricks accepts query_tags in your connection profile (profiles.yml). Connection-level tags are always merged with model-level tags, allowing you to define common tags once in your profile and selectively add or override specific keys at the model level.

Recommended pattern:

Define shared tags (team, project, environment) in your profile's query_tags
Use model-level query_tags when you need to add model-specific tags

Limitations

Maximum 20 tags: The total number of query tags (including default tags) cannot exceed 20.
Value length: Tag values must be at most 128 characters. Default tag values that exceed this limit are automatically truncated.
Special characters: Backslash (\), comma (,), and colon (:) characters in tag values are automatically escaped. A warning is logged when escaping occurs.
Reserved keys: The keys @@dbt_model_name, @@dbt_core_version, @@dbt_databricks_version, and @@dbt_materialized are reserved and cannot be used in user-defined tags.

Viewing query tags

Query tags appear in Databricks system tables and query history. For information on how to query and analyze query tags, see the Databricks query tags documentation.

Default file format configurations

To access advanced incremental strategies features, such as snapshots and the merge incremental strategy, you will want to use the Delta or Hudi file format as the default file format when materializing models as tables.

It's quite convenient to do this by setting a top-level configuration in your project file:

dbt_project.yml

models:
  +file_format: delta # or hudi
  
seeds:
  +file_format: delta # or hudi
  
snapshots:
  +file_format: delta # or hudi

Materialized views and streaming tables

Materialized views and streaming tables are alternatives to incremental tables that are powered by Delta Live Tables.

Refer to What are Delta Live Tables? for more information and use cases.

In order to adopt these materialization strategies, you will need a workspace that is enabled for Unity Catalog and serverless SQL Warehouses.

materialized_view.sql

{{ config(
   materialized = 'materialized_view'
 ) }}

streaming_table.sql

{{ config(
   materialized = 'streaming_table'
 ) }}

We support on_configuration_change for most available properties of these materializations. The following table summarizes our configuration support. Refer to Configuration details for more details on each config:

Databricks Concept	Config Name	MV/ST support	Version
PARTITIONED BY	`partition_by`	MV/ST	All
CLUSTER BY	`liquid_clustered_by`	MV/ST	v1.11+
COMMENT	`description`	MV/ST	All
TBLPROPERTIES	`tblproperties`	MV/ST	All
TAGS	`databricks_tags`	MV/ST	v1.11+
SCHEDULE CRON	`schedule: { 'cron': '\<cron schedule\>', 'time_zone_value': '\<time zone value\>' }`	MV/ST	All
SCHEDULE EVERY	`schedule: { 'every': '\<n\> \<unit\>' }`	MV/ST	v1.12+
TRIGGER ON UPDATE	`schedule: { 'on_update': true, 'at_most_every': '\<n\> \<unit\>' }`	MV/ST	v1.12+
WITH ROW FILTER	`row_filter`	MV/ST	v1.12+
query	defined by your model SQL	on_configuration_change for MV only	All

mv_example.sql

{{ config(
    materialized='materialized_view',
    partition_by='id',
    schedule = {
        'cron': '0 0 * * * ? *',
        'time_zone_value': 'Etc/UTC'
    },
    tblproperties={
        'key': 'value'
    },
) }}
select * from {{ ref('my_seed') }}

Configuration details

partition_by

partition_by works the same as for views and tables, i.e. can be a single column, or an array of columns to partition by.

liquid_clustered_by

Available in versions 1.11 or higher

liquid_clustered_by enables liquid clustering for materialized views and streaming tables. Liquid clustering optimizes query performance by co-locating similar data within the same files, particularly beneficial for queries with selective filters on the clustered columns.

Note: You cannot use both partition_by and liquid_clustered_by on the same materialization, as Databricks doesn't allow combining these features.

databricks_tags

Available in versions 1.11 or higher

databricks_tags allows you to apply Unity Catalog tags to your materialized views and streaming tables for data governance and organization. Tags are key-value pairs that can be used for data classification, access control policies, and metadata management.

{{ config(
    materialized='streaming_table',
    databricks_tags={'pii': 'contains_email', 'team': 'analytics'}
) }}

dbt-databricks v1.12+ adds support for key-only tags. To set a tag that has a key but no value, set the tag's value to an empty string '' or to None:

{{ config(
    materialized='streaming_table',
    databricks_tags={'sensitive': '', 'reviewed': None}
) }}

This applies to both table-level and column-level databricks_tags. Non-string values, such as numbers or booleans, are converted to strings.

Tags are applied via ALTER statements after the materialization is created. Once applied, tags cannot be removed through dbt-databricks configuration changes. To remove tags, you must use Databricks directly or a post-hook.

Behavior change in v1.12

Starting in dbt-databricks v1.12.0, databricks_tags configurations are merged additively across config hierarchy levels (for example, project-level and model-level), rather than having lower-level configs completely replace higher-level ones.

When the same tag key is defined at multiple levels, the lower-level value takes precedence. Tag keys defined only at higher levels are retained.

This behavior applies anywhere databricks_tags can be configured, including tables, columns, materialized views, and streaming tables.

For example, with the following project-level and model-level configs:

dbt_project.yml

models:
  my_project:
    +databricks_tags:
      a: "b"
      c: "project_value"

models/my_model.sql

{{ config(
    databricks_tags={'c': 'model_value', 'k': 'v'}
) }}

The resulting tags are:

a: b — retained from the project level
c: model_value — the model-level value overrides the project-level c
k: v — added at the model level

description

As with views and tables, adding a description to your configuration will lead to a table-level comment getting added to your materialization.

tblproperties

tblproperties works the same as for views and tables with an important exception: the adapter maintains a list of keys that are set by Databricks when making an materialized view or streaming table which are ignored for the purpose of determining configuration changes.

schedule

Set the refresh schedule for the model using one of three mutually exclusive modes:

Mode	Config	Format	Version
`cron`	`schedule: { 'cron': '...', 'time_zone_value': '...' }`	Cron string (Databricks format). `time_zone_value` is optional.	All
`every`	`schedule: { 'every': '<n> <unit>' }`	`'<n> <unit>'` where unit is `HOURS`, `DAYS`, or `WEEKS` — for example, `'2 HOURS'`	v1.12+
`on_update`	`schedule: { 'on_update': true, 'at_most_every': '<n> <unit>' }`	Set to `true` to refresh when upstream data changes. `at_most_every` is optional and rate-limits refreshes (minimum 60 seconds). For example, `'15 MINUTES'`	v1.12+

Loading table...

Refresh behavior by mode:

cron: dbt requests a manual refresh on every run.
every and on_update: Databricks auto-manages the refresh; dbt does not trigger a manual refresh on a no-op re-run. If a schedule exists in Databricks but your dbt project doesn't specify one, the schedule resets to manual on your next run (when on_configuration_change is set to apply).

query

For materialized views, if the compiled query differs from what's in the database, dbt takes the configured on_configuration_change action. Query changes aren't currently detectable for streaming tables. Refer to on_configuration_change for details.

row_filter

Available in versions 1.12 or higher

row_filter applies a Unity Catalog row filter to a model. It is supported on table, incremental, materialized_view, and streaming_table materializations. Refer to Setting row filters for the full config reference and examples.

on_configuration_change

Materialization	Drop and recreate required?	Notes
Materialized views	Yes, for all changes except schedule updates	Databricks SQL API limitation
Streaming tables	Only when `partition_by` changes	All other supported changes use `CREATE OR REFRESH` plus an `ALTER` for schedule changes

Loading table...

Note on streaming table query changes: there's currently no way for the adapter to detect if a streaming table query has changed. Regardless of on_configuration_change behavior, dbt uses CREATE OR REFRESH, which applies the updated query to future rows only — previously processed rows aren't reprocessed.

To reprocess available source data with an updated query, run with --full-refresh.

Setting table properties

Table properties can be set with your configuration for tables or views using tblproperties:

with_table_properties.sql

{{ config(
    tblproperties={
      'delta.autoOptimize.optimizeWrite' : 'true',
      'delta.autoOptimize.autoCompact' : 'true'
    }
 ) }}

caution

These properties are sent directly to Databricks without validation in dbt. You'll need to do a full refresh of incremental materializations if you change their tblproperties.

One use case is making delta tables compatible with iceberg readers using the Universal Format:

{{ config(
    tblproperties={
      'delta.enableIcebergCompatV2' = 'true'
      'delta.universalFormat.enabledFormats' = 'iceberg'
    }
 ) }}

tblproperties can be specified for Python models, but they're applied via an ALTER statement after table creation due to a PySpark limitation.

Was this page helpful?

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Configuring tables​

Python submission methods​

Configuring columns​

Setting row filters​

Incremental models​

The append strategy​

The insert_overwrite strategy​

The merge strategy​

The replace_where strategy​

The delete+insert strategy​

The microbatch strategy​

Python model configuration​

Selecting compute per model​

Specifying the compute for models​

Specifying compute for Python models​

Persisting model descriptions​

Query tags​

Default query tags​

Configuring query tags​

Connection-level query tags​

Model-level query tags​

Tag precedence and merging​

Limitations​

Viewing query tags​

Default file format configurations​

Materialized views and streaming tables​

Configuration details​

partition_by​

liquid_clustered_by​

databricks_tags​

description​

tblproperties​

schedule​

query​

row_filter​

on_configuration_change​

Setting table properties​

Was this page helpful?

Resources

Community

Support

Connect with Us

Configuring tables

Python submission methods

Configuring columns

Setting row filters

Incremental models

The `append` strategy

The `insert_overwrite` strategy

The `merge` strategy

The `replace_where` strategy

The `delete+insert` strategy

The `microbatch` strategy

Python model configuration

Selecting compute per model

Specifying the compute for models

Specifying compute for Python models

Persisting model descriptions

Query tags

Default query tags

Configuring query tags

Connection-level query tags

Model-level query tags

Tag precedence and merging

Limitations

Viewing query tags

Default file format configurations

Materialized views and streaming tables

Configuration details

partition_by

liquid_clustered_by

databricks_tags

description

tblproperties

schedule

query

row_filter

on_configuration_change

Setting table properties