Spark configurations

Heads up!

These docs are a work in progress.

Configuring Spark tables

When materializing a model as table, you may include several optional configs:

OptionDescriptionRequired?Example
file_formatThe file format to use when creating tables (parquet, delta, csv, json, text, jdbc, orc, hive or libsvm).Optionalparquet
location_rootThe created table uses the specified directory to store its data. The table alias is appended to it.Optional/mnt/root
partition_byPartition the created table by the specified columns. A directory is created for each partition.Optionaldate_day
clustered_byEach partition in the created table will be split into a fixed number of buckets by the specified columns.Optionalcountry_code
bucketsThe number of buckets to create while clusteringRequired if clustered_by is specified8

Incremental Models

The incremental_strategy config controls how dbt builds incremental models, and it can be set to one of two values:

  • insert_overwrite (default)
  • merge (Delta Lake only)

The insert_overwrite strategy

Apache Spark does not natively support delete, update, or merge statements. As such, Spark's default incremental behavior is different from the standard.

To use incremental models, specify a partition_by clause in your model config. dbt will run an atomic insert overwrite statement that dynamically replaces all partitions included in your query. Be sure to re-select all of the relevant data for a partition when using this incremental strategy.

spark_incremental.sql
{{ config(
materialized='incremental',
partition_by=['date_day'],
file_format='parquet'
) }}
/*
Every partition returned by this query will be overwritten
when this model runs
*/
with new_events as (
select * from {{ ref('events') }}
{% if is_incremental() %}
where date_day >= date_add(current_date, -1)
{% endif %}
)
select
date_day,
count(*) as users
from events
group by 1

The merge strategy

New in dbt-spark v0.15.3

This functionality is new in dbt-spark v0.15.3. See installation instructions

There are three prerequisites for the merge incremental strategy:

  • Creating the table in Delta file format
  • Using Databricks Runtime 5.1 and above
  • Specifying a unique_key

dbt will run an atomic merge statement which looks nearly identical to the default merge behavior on Snowflake and BigQuery.

delta_incremental.sql
{{ config(
materialized='incremental',
file_format='delta',
unique_key='user_id',
incremental_strategy='merge'
) }}
with new_events as (
select * from {{ ref('events') }}
{% if is_incremental() %}
where date_day >= date_add(current_date, -1)
{% endif %}
)
select
user_id,
max(date_day) as last_seen
from events
group by 1

Persisting model descriptions

Relation-level docs persistence is supported in dbt v0.17.0. For more information on configuring docs persistence, see the docs.

When the persist_docs option is configured appropriately, you'll be able to see model descriptions in the Comment field of describe [table] extended or show table extended in [database] like '*'.

Always schema, never database

New in dbt-spark v0.17.0

This is a breaking change in dbt-spark v0.17.0. See installation instructions

Apache Spark uses the terms "schema" and "database" interchangeably. dbt understands database to exist at a higher level than schema. As such, you should never use or set database as a node config or in the target profile when running dbt-spark.

If you want to control the schema/database in which dbt will materialize models, use the schema config and generate_schema_name macro only.