dbt - Documentation

Performance Optimization

dbt supports several data warehouse features designed to improve the performance of your models.

Amazon Redshift

Using sortkey and distkey

Tables in Amazon Redshift have two powerful optimizations to improve query performance: distkeys and sortkeys. Supplying these values as model-level configurations apply the corresponding settings in the generated CREATE TABLE DDL. Note that these settings will have no effect for models set to view or ephemeral models.

  • dist can have a setting of all, even, or the name of a key.
  • sort accepts a list of sort keys, for example: ['timestamp', 'userid']. dbt will build the sort key in the same order the fields are supplied.
  • sort_type can have a setting of interleaved or compound. if no setting is specified, sort_type defaults to compound.

Sort and dist keys should be added to the {{ config(...) }} block in model .sql files, eg:

-- Example with one sort key
{{ config(materialized='table', sort='id', dist='received_at') }}

select ...
  
  
-- Example with multiple sort keys
{{ config(materialized='table', sort=['id', 'category'], dist='received_at') }}

select ...
  
  
-- Example with interleaved sort keys
{{ config(materialized='table',
          sort_type='interleaved'
          sort=['id', 'category'],
          dist='received_at')
}}

select ...

For more information on distkeys and sortkeys, view Amazon's docs:

Google BigQuery

Using table partitioning and clustering

Partition clause

BigQuery supports the use of a partition by clause to easily partition a table by a column or expression. This option can help decrease latency and cost when querying large tables. Note that the expression provided as the partition_by field must be of type Date, and that partition pruning only works when comparing using scalar/literal values (so selecting partitions using a subquery won't improve performance).

{{ config(materialized='table', partition_by='DATE(created_at)') }}

select
  user_id,
  event_name,
  created_at
  
from {{ ref('events') }}

Because BigQuery only prunes partitions when filters are provided literally, it may be useful to use a statement block to generate filter clauses:


{% call statement('get_date', fetch_result=True) %}

  SELECT
    min(created_at) as min_date
    
  FROM {{ ref('bigquery_table') }}

{% endcall %}

{% set min_date = load_result('get_date').table.columns['min_date'].values()[0] %}

select
  user_id,
  event_name,
  created_at

from {{ ref('bigquery_table') }}
where created_at >= DATE('{{ min_date }}')

Clustering Clause

BigQuery tables can be clustered to colocate related data. At present, BigQuery is only able to cluster partitioned tables, so be sure to use the cluster_by config in conjunction with the partition_by config.

Clustering on a single column:

{{
  config(
    materialized = "table",
    partition_by = "date(created_at)",
    cluster_by = "order_id",
  )
}}

select * from ...

Clustering on a multiple columns:

{{
  config(
    materialized = "table",
    partition_by = "date(created_at)",
    cluster_by = ["customer_id", "order_id"],
  )
}}

select * from ...