Snapshots

Overview

Commonly, analysts need to "look back in time" at some previous state of data in their mutable tables. While some source data systems are built in a way that makes accessing historical data possible, this is often not the case. dbt provides a mechanism, Snapshots, which records changes to a mutable table over time.

What are Snapshots?

Snapshots implement type-2 Slowly Changing Dimensions over mutable source tables. These Slowly Changing Dimensions (or SCDs) identify how a row in a table changes over time. Imagine you have an Orders table where the status field can be overwritten as the order is processed.

order_idstatusupdated_at
1pending2019-01-01

Now, imagine that the order goes from "pending" to "shipped". That same record will now look like:

order_idstatusupdated_at
1shipped2019-01-02

This order is now in the "shipped" state, but we've lost the information about when the order was last in the "pending" state. This makes it difficult (or impossible) to analyze how long it took for an order to ship. dbt can "snapshot" these changes to help you understand how values in a row change over time. Here's an example of a snapshot table for the previous example:

order_idstatusupdated_atdbt_valid_fromdbt_valid_to
1pending2019-01-012019-01-012019-01-02
1shipped2019-01-022019-01-02null

dbt creates this Snapshot table by copying the structure of your source table, and then adding some helpful metadata fields. The dbt_valid_from and dbt_valid_to columns indicate the historical state for a given record. The current value for a row is represented with a null value for dbt_valid_to.

Snapshot meta-fields

Snapshot tables will be created as a clone of your source dataset, plus some addition meta-fields.

FieldMeaningUsage
dbt_valid_fromThe timestamp when this snapshot row was first insertedThis column can be used to order the different "versions" of a record.
dbt_valid_toThe timestamp when this row row became invalidated.The most recent snapshot record will have dbt_valid_to set to null.
dbt_scd_idA unique key generated for each snapshotted record.This is used internally by dbt
dbt_updated_atThe updated_at timestamp of the source record when this snapshot row was inserted.This is used internally by dbt

How do I add snapshots to my dbt project?

Snapshots are defined in .sql files using snapshot blocks. dbt will look for snapshot blocks in the snapshot-paths path defined in your dbt_project.yml file. By default, this path is set to snapshots/. An example snapshot block might look like this:

snapshots/orders_snapshot.sql
/*
This snapshot table will live in:
analytics.snapshots.orders_snapshot
*/
{% snapshot orders_snapshot %}
{{
config(
target_database='analytics',
target_schema='snapshots',
unique_key='id',
strategy='timestamp',
updated_at='updated_at',
)
}}
-- Pro-Tip: Use sources in snapshots!
select * from {{ source('ecom', 'orders') }}
{% endsnapshot %}

Snapshot blocks make it possible to snapshot a query. In general, it is most useful to snapshot a select statement over a single table. If you're thinking about snapshotting a join or a union, consider instead snapshotting each table separately, then re-combining them in a model.

The snapshot table generated by dbt will be created with the target_database, target_schema, and name defined in the snapshot block. The snapshot block shown above (orders_snapshot) will be rendered into a table called analytics.snapshots.orders_snapshot.

How does dbt know which rows have changed?

Snapshot "strategies" define how dbt knows if a row has changed. There are two strategies built-in to dbt, but other strategies can be created as macros in dbt projects. While each strategy requires its own configuration, the unique_key config is required for all snapshot strategies.

Timestamp Strategy

The timestamp strategy uses an updated_at field to determine if a row has changed. If the configured updated_at column for a row is more recent than the last time the snapshot ran, then dbt will invalidate the old record and record the new one. If the timestamps are unchanged, then dbt will not take any action.

The timestamp strategy requires the following configurations:

ConfigDescriptionExample
updated_atA column or expression which represents when the source row was last updatedupdated_at

Example usage:

snapshots/timestamp_example.sql
{% snapshot orders_snapshot_timestamp %}
{{
config(
target_schema='snapshots',
strategy='timestamp',
unique_key='id',
updated_at='updated_at',
)
}}
select * from {{ source('ecom', 'orders') }}
{% endsnapshot %}

Check Strategy

The check strategy is useful for tables which do not have a reliable updated_at column. This strategy works by comparing a list of columns between their current and historical values. If any of these columns have changed, then dbt will invalidate the old record and record the new one. If the column values are identical, then dbt will not take any action.

The check strategy requires the following configurations:

ConfigDescriptionExample
check_colsA list of columns to check for changes, or all to check all columns["name", "email"]

Example Usage

snapshots/check_example.sql
{% snapshot orders_snapshot_check %}
{{
config(
target_schema='snapshots',
strategy='check',
unique_key='id',
check_cols=['status', 'is_cancelled'],
)
}}
select * from {{ source('ecom', 'orders') }}
{% endsnapshot %}

Configuring snapshots

Configuration reference

The following snapshot configurations are supported:

ConfigDescriptionRequired?Example
target_databaseThe database that dbt should render the snapshot table intoNoanalytics
target_schemaThe schema that dbt should render the snapshot table intoYessnapshots
strategyThe snapshot strategy to use. One of timestamp or checkYestimestamp
unique_keyA primary key column or expression for the recordYesorder_id
check_colsIf using the check strategy, then the columns to checkOnly if using the check strategy["status"]
updated_atIf using the timestamp strategy, the timestamp column to compareOnly if using the timestamp strategyupdated_at

Configuring snapshots in dbt_project.yml

The snapshots in your dbt project can be configured using the snapshot: key in your dbt_project.yml file. This configuration is analogous to the models: config described in the Projects documentation.

Example usage:

dbt_project.yml
name: my_project
version: 1.0.0
...
snapshots:
my_project:
transient: false
target_database: snapshots
post-hook: "grant select on {{ this }} to reader"

Configuration best practices

Use the timestamp strategy where possible

This strategy handles column additions and deletions better than the check_cols strategy.

Ensure your unique key is really unique

The unique key is used by dbt to match rows up, so it's extremely important to make sure this key is actually unique! If you're snapshotting a source, I'd recommend adding a uniqueness test to your source (example).

Use a target_schema that is separate to your analytics schema

Snapshots cannot be rebuilt. As such, it's a good idea to put snapshots in a separate schema so end users know they are special. From there, you may want to set different privileges on your snapshots compared to your models, and even run them as a different user (or role, depending on your warehouse) to make it very difficult to drop a snapshot unless you really want to.

Writing snapshot queries

Snapshot query best practices

With regards to the specific query to write in your snapshot, we recommend that you:

  • Snapshot source data. Your models should then select from these snapshots, treating them like regular data sources. As much as possible, snapshot your source data in its raw form and use downstream models to clean up the data
  • Ensure your unique key is really unique.
  • Use the source function in your query. This helps when understanding data lineage in your project.
  • Include as many columns as possible. In fact, go for select * if performance permits! Even if a column doesn't feel useful at the moment, it might be better to snapshot it in case it becomes useful – after all, you won't be able to recreate the column later.
  • Avoid joins in your snapshot query. Joins can make it difficult to build a reliable updated_at timestamp. Instead, snapshot the two tables separately, and join them in downstream models.
  • Limit the amount of transformation in your query. If you apply business logic in a snapshot query, and this logic changes in the future, it can be impossible (or, at least, very difficult) to apply the change in logic to your snapshots.

Basically – keep your query as simple as possible! Some reasonable exceptions to these recommendations include:

  • Selecting specific columns if the table is wide.
  • Doing light transformation to get data into a reasonable shape, for example, unpacking a JSON blob to flatten your source data into columns.

Running snapshots

Snapshots can be run using the dbt snapshot command. This command will find and run all of the snapshots defined in your project.

To run a subset of snapshots, use the --select flag, supplying a selector. For more information on resource selectors, consult the selection syntax.

Example usage

$ dbt snapshot
Running with dbt=0.14.0
15:07:36 | Concurrency: 8 threads (target='dev')
15:07:36 |
15:07:36 | 1 of 2 START snapshot snapshots.orders_snapshot...... [RUN]
15:07:36 | 2 of 2 START snapshot snapshots.users_snapshot........[RUN]
15:07:36 | 2 of 2 OK snapshot snapshots.orders_snapshot..........[SELECT 3 in 1.82s]
15:07:36 | 1 of 2 OK snapshot snapshots.users_snapshot.......... [SELECT 3 in 3.47s]
15:07:36 |
15:07:36 | Finished running 2 snapshots in 0.68s.
Completed successfully
Done. PASS=2 ERROR=0 SKIP=0 TOTAL=2

Scheduling your snapshots

Snapshots are a batch-based approach to change data capture. The dbt snapshot command must be run on a schedule to ensure that changes to tables are actually recorded! While individual use-cases may vary, snapshots are intended to be run between hourly and daily. If you find yourself snapshotting more frequently then that, consider if there isn't a more appropriate way to capture changes in your source data tables.

Schema changes

When the schema of your source query changes, dbt will attempt to reconcile the schema change in the destination snapshot table. dbt does this by: 1. Creating new columns from the source query in the destination table 2. Expanding the size of string types where necessary (eg. varchars on Redshift)

dbt will not delete columns in the destination snapshot table if they are removed from the source query. It will also not change the type of a column beyond expanding the size of varchar columns. That is, if a string column is changed to a date column in the snapshot source query, dbt will not attempt to change the type of the column in the destination table.

Treat snapshots like sources

dbt is built around the idea that all of your data transformations should be idempotent and completely rebuildable from scratch. Snapshots break this paradigm due to the nature of the problem that they solve. Because snapshots capture changes in source tables, they need to be running constantly in order to record changes to mutable tables as they occur.

As such, it's typical to only have one snapshot table per data source for all dbt users, rather than one snapshot per user. In this way, snapshot tables are more similar to source tables than they are to proper dbt models.