Skip to main content

Databricks configurations

Configuring tables

When materializing a model as table, you may include several optional configs that are specific to the dbt-databricks plugin, in addition to the standard model configs.

Incremental models

dbt-databricks plugin leans heavily on the incremental_strategy config. This config tells the incremental materialization how to build models in runs beyond their first. It can be set to one of four values:

  • append: Insert new records without updating or overwriting any existing data.
  • insert_overwrite: If partition_by is specified, overwrite partitions in the table with new data. If no partition_by is specified, overwrite the entire table with new data.
  • merge (default; Delta and Hudi file format only): Match records based on a unique_key, updating old records, and inserting new ones. (If no unique_key is specified, all new data is inserted, similar to append.)
  • replace_where (Delta file format only): Match records based on incremental_predicates, replacing all records that match the predicates from the existing table with records matching the predicates from the new data. (If no incremental_predicates are specified, all new data is inserted, similar to append.)

Each of these strategies has its pros and cons, which we'll discuss below. As with any model config, incremental_strategy may be specified in dbt_project.yml or within a model file's config() block.

The append strategy

Following the append strategy, dbt will perform an insert into statement with all new data. The appeal of this strategy is that it is straightforward and functional across all platforms, file types, connection methods, and Apache Spark versions. However, this strategy cannot update, overwrite, or delete existing data, so it is likely to insert duplicate records for many data sources.

databricks_incremental.sql
{{ config(
materialized='incremental',
incremental_strategy='append',
) }}

-- All rows returned by this query will be appended to the existing table

select * from {{ ref('events') }}
{% if is_incremental() %}
where event_ts > (select max(event_ts) from {{ this }})
{% endif %}

The insert_overwrite strategy

caution

This strategy is currently only compatible with All Purpose Clusters, not SQL Warehouses.

This strategy is most effective when specified alongside a partition_by clause in your model config. dbt will run an atomic insert overwrite statement that dynamically replaces all partitions included in your query. Be sure to re-select all of the relevant data for a partition when using this incremental strategy.

If no partition_by is specified, then the insert_overwrite strategy will atomically replace all contents of the table, overriding all existing data with only the new records. The column schema of the table remains the same, however. This can be desirable in some limited circumstances, since it minimizes downtime while the table contents are overwritten. The operation is comparable to running truncate + insert on other databases. For atomic replacement of Delta-formatted tables, use the table materialization (which runs create or replace) instead.

databricks_incremental.sql
{{ config(
materialized='incremental',
partition_by=['date_day'],
file_format='parquet'
) }}

/*
Every partition returned by this query will be overwritten
when this model runs
*/

with new_events as (

select * from {{ ref('events') }}

{% if is_incremental() %}
where date_day >= date_add(current_date, -1)
{% endif %}

)

select
date_day,
count(*) as users

from new_events
group by 1

The merge strategy

The merge incremental strategy requires:

  • file_format: delta or hudi
  • Databricks Runtime 5.1 and above for delta file format
  • Apache Spark for hudi file format

dbt will run an atomic merge statement which looks nearly identical to the default merge behavior on Snowflake and BigQuery. If a unique_key is specified (recommended), dbt will update old records with values from new records that match on the key column. If a unique_key is not specified, dbt will forgo match criteria and simply insert all new records (similar to append strategy).

Specifying merge as the incremental strategy is optional since it's the default strategy used when none is specified.

merge_incremental.sql
{{ config(
materialized='incremental',
file_format='delta', # or 'hudi'
unique_key='user_id',
incremental_strategy='merge'
) }}

with new_events as (

select * from {{ ref('events') }}

{% if is_incremental() %}
where date_day >= date_add(current_date, -1)
{% endif %}

)

select
user_id,
max(date_day) as last_seen

from events
group by 1

The replace_where strategy

The replace_where incremental strategy requires:

  • file_format: delta
  • Databricks Runtime 12.0 and above

dbt will run an atomic replace where statement which selectively overwrites data matching one or more incremental_predicates specified as a string or array. Only rows matching the predicates will be inserted. If no incremental_predicates are specified, dbt will perform an atomic insert, as with append.

caution

replace_where inserts data into columns in the order provided, rather than by column name. If you reorder columns and the data is compatible with the existing schema, you may silently insert values into an unexpected column. If the incoming data is incompatible with the existing schema, you will instead receive an error.

replace_where_incremental.sql
{{ config(
materialized='incremental',
file_format='delta',
incremental_strategy = 'replace_where'
incremental_predicates = 'user_id >= 10000' # Never replace users with ids < 10000
) }}

with new_events as (

select * from {{ ref('events') }}

{% if is_incremental() %}
where date_day >= date_add(current_date, -1)
{% endif %}

)

select
user_id,
max(date_day) as last_seen

from events
group by 1

Persisting model descriptions

Relation-level docs persistence is supported in dbt v0.17.0. For more information on configuring docs persistence, see the docs.

When the persist_docs option is configured appropriately, you'll be able to see model descriptions in the Comment field of describe [table] extended or show table extended in [database] like '*'.

Default file format configurations

To access advanced incremental strategies features, such as snapshots and the merge incremental strategy, you will want to use the Delta or Hudi file format as the default file format when materializing models as tables.

It's quite convenient to do this by setting a top-level configuration in your project file:

dbt_project.yml
models:
+file_format: delta # or hudi

seeds:
+file_format: delta # or hudi

snapshots:
+file_format: delta # or hudi

Setting table properties

Table properties can be set with your configuration for tables or views using tblproperties:

with_table_properties.sql
{{ config(
tblproperties={
'delta.autoOptimize.optimizeWrite' : 'true',
'delta.autoOptimize.autoCompact' : 'true'
}
) }}
caution

These properties are sent directly to Databricks without validation in dbt, so be thoughtful with how you use this feature. You will need to do a full refresh of incremental materializations if you change their tblproperties.

One application of this feature is making delta tables compatible with iceberg readers using the Universal Format.

0