dbt - Documentation

Design Patterns

The DRY Principle and SQL

The DRY principle in software engineering refers to "Don't Repeat Yourself" - the main aim of which is to reduce repetition of code, increasing maintainability.

To see how this applies to SQL, consider a data model in which an order can have many payments. Each payment may have a payment_method of bank_transfer, credit_card or gift_card, and therefore each order can have multiple payment_methods

From an analytics perspective, it may be useful to know how much of each order was paid for with each payment_method. As a data analyst using dbt, you can create a model, named order_payment_method_amounts, with the following SQL:

select
order_id,
sum(case when payment_method = 'bank_transfer' then amount else 0 end) as bank_transfer_amount,
sum(case when payment_method = 'credit_card' then amount else 0 end) as credit_card_amount,
sum(case when payment_method = 'gift_card' then amount else 0 end) as gift_card_amount,
sum(amount) as total_amount
from app_data.payments
group by 1

The above code is not DRY (sometimes referred to as "WET" code) as the code for the fields bank_transfer_amount, credit_card_amount and gift_card_amount is repeated.

This code can be difficult to maintain for a number of reasons:

  • If the logic or field name were to change, the code would need to be updated in three places.
  • Often this code is created by copying and pasting, which may lead to mistakes.
  • Other analysts that review the code are less likely to notice errors as it's common to only scan through repeated code.

Part of the dbt viewpoint is that analysts should use the same techniques and follow the same principles as software engineers. However, this common pattern in SQL conflicts with the DRY principle.

Keeping it DRY with Jinja

One of dbt's most powerful features is the ability to combine SQL with Jinja.

Jinja is a lightweight templating language, which, when combined with SQL, provides a way to use control structures (e.g. if statements and for loops) in your queries. When used correctly, Jinja will be compiled to runnable SQL that can be executed against your data warehouse.

Using Jinja enables analysts to write DRY SQL. For example, in the query above, the repeated code can be replaced with a for loop. The following will be compiled to the same query, but is significantly easier to maintain.

select
order_id,
{% for payment_method in ["bank_transfer", "credit_card", "gift_card"] %}
sum(case when payment_method = '{{payment_method}}' then amount else 0 end) as {{payment_method}}_amount,
{% endfor %}
sum(amount) as total_amount
from app_data.payments
group by 1

Jinja can be used in any SQL code in a dbt project, including models, macros, analyses, tests, and even hooks .

To compile Jinja into runnable sql, run dbt compile, then look for the compiled SQL file in the target/compiled/{project name}/ directory.

Using Macros

Macros in Jinja are pieces of code that can be called multiple times – they are analogous to a function in Python, and are extremely useful if you find yourself repeating code across multiple models.

The for loop from above could be rewritten as a macro. Macros are written in a SQL files in the macros directory.

{% macro get_payment_method_amounts(payment_methods) %}
{% for payment_method in payment_methods -%}
sum(case when payment_method = '{{payment_method}}' then amount else 0 end) as {{payment_method}}_amount
{%- if not loop.last -%} , {%- endif %}
{% endfor -%}
{% endmacro %}

Now the order_payment_method_amounts model becomes:

{% set payment_methods = ["bank_transfer", "credit_card", "gift_card"] %}
select
order_id,
{{get_payment_method_amounts(payment_methods)}},
sum(amount) as total_amount
from app_data.payments
group by 1

Writing this SQL snippet as a macro means that you can reuse your code in other parts of your project. For example, this macro could be used to sum the payments made with each payment_method for each customer.

select
orders.customer_id,
{{get_payment_method_amounts(["bank_transfer", "credit_card", "gift_card"])}},
sum(amount) as total_amount
from app_data.orders
left join app_data.payments on payments.order_id = orders.id
group by 1

Useful Jinja patterns for macros

There are a number of Jinja patterns used in the get_payment_method_amounts macro that are particularly useful when writing Jinja in a dbt project:

  • {% set … %} can be used to assign values to a variable, which can increase the readability of dbt code.
  • loop.last evaluates to True on the last iteration of a loop. This is useful for ensuring you do not have a final trailing comma in your SQL. Alternatively, if your SQL style favours leading commas, you can use loop.first to the same effect.
  • Macros that are formatted for readability may compile to hard-to-read SQL due to additional whitespace. Using a minus sign (-) at the start or end of a block strips whitespace before or after the block respectively.

Building Dynamic Models

In the above example, a list was passed to the get_payment_method_amounts macro. If a new payment_method was introduced, or one of the existing methods was renamed, the list would need to be updated.

However, at any given time you could know what payment_methods are used to make a payment by running the following query:

select distinct
payment_method
from app_data.payments
order by 1

Statements provide a way to run this query and return the results to your Jinja context. This means that the list of payment_methods can be set based on the data in your database rather than a hardcoded value.

The following statement would return payment_methods when executed.

{% call statement('payment_methods', fetch_result=True) %}

  select distinct
  payment_method
  from app_data.payments
  order by 1

{% endcall %}

The result of this statement is accessible using the load_result function, which can then be reshaped into a list.

{% set payment_methods = load_result('payment_methods').table.columns['payment_method'].values() -%}

-- Log to the console
{{log(payment_methods, info=True)}}

Use {{log(...)}} to debug JInja

Using the log function helps debug Jinja by printing objects to the console.

Now the object payment_methods resembles the original list, but has a new payment method (coupon) in it.

>>>
["bank_transfer", "coupon", "credit_card", "gift_card"]

Updating the get_payment_method_amounts macro to use a statement gives the following:

{% macro get_payment_method_amounts() %}

  {%- call statement('payment_methods', fetch_result=True) -%}

   select distinct
   payment_method
   from app_data.payments
   order by 1

  {%- endcall -%}

  {%- set payment_methods = load_result('payment_methods').table.columns['payment_method'].values() -%}

  {% for payment_method in payment_methods -%}

    sum(case when payment_method = '{{payment_method}}' then amount else 0 end) as {{payment_method}}_amount
    {%- if not loop.last -%} , {%- endif %}

  {% endfor -%}

{% endmacro %}

As payment_methods is being retrieved dynamically, get_payment_method_amounts no longer needs to be passed a list of strings. Note that it needs to be passed an empty argument list (()).

The order_payment_method_amounts model can now be written as:

select
order_id,
{{get_payment_method_amounts()}},
sum(amount) as total_amount
from app_data.payments
group by 1

Any future payment_method added to the underlying data model will automatically be handled by the dbt model.

Writing Modular Macros

You may wish to reuse the list of payment_methods elsewhere in your dbt project. To design for this, you can break your macro into two:

{% macro get_payment_methods() %}

  {%- call statement('payment_methods', fetch_result=True) -%}

    select distinct
    payment_method
    from app_data.payments
    order by 1

  {%- endcall -%}

  {%- set payment_methods = load_result('payment_methods').table.columns['payment_method'].values() -%}

  {{return(payment_methods)}}

{% endmacro %}


{% macro get_payment_method_amounts() %}

  {% for payment_method in get_payment_methods() -%}

    sum(case when payment_method = '{{payment_method}}' then amount else 0 end) as {{payment_method}}_amount
    {%- if not loop.last -%} , {%- endif %}

  {% endfor -%}

{% endmacro %}

Use {{return(...)}} to return data

Note that the get_payment_methods macro uses the dbt function return.
This function returns data in its original data type (e.g. list or dictionary). Without this function, a string would be returned by the get_payment_methods() macro.

Now that get_payment_methods is a separate macro, you can write other macros that use this list more easily. For example, you can easily write a macro that counts the payments from each payment method:

{% macro get_n_payment_method_payments() %}

  {% for payment_method in get_payment_methods() -%}

    count(case when payment_method = '{{payment_method}}' then id else null end) as n_{{payment_method}}_payments
    {%- if not loop.last -%} , {%- endif %}

  {% endfor -%}

{% endmacro %}

Macros let analysts bring software engineering principles to the SQL they write. One of the features of macros that makes them even more powerful is their ability to be shared across projects.

A number of useful dbt macros have already been written in the dbt-utils package. For example, the get_column_values macro could be used instead of the get_payment_method_amounts macro written above. dbt-utils can be installed as a package in your project, along with any other packages you write.