dbt - Documentation

New in 0.13.0

Sources are new in dbt v0.13.0. If you're using a version older than 0.13.0, check out the upgrading instructions to use sources.

Overview

Sources make it possible to name and describe your source data tables. Sources are useful for

  1. selecting from source tables in your base models
  2. testing your assumptions about your source data
  3. calculating the freshness of your source data

Sources can be defined in schema.yml files alongside model definitions. For information about testing sources, check out the docs on testing in dbt.

Defining sources

Sources are defined in schema.yml files. If you're not already familiar with these files, be sure to check out the documentation on schema.yml files before proceeding. Use a sources block to define sources:


# This example defines two sources called `source_1` (containing 
# two tables) and `source_2` (containing one table).
# This is a minimal example of a source definition.

version: 2

sources:
  - name: source_1
    tables:
      - name: table_1
      - name: table_2

  - name: source_2
    tables:
      - name: table_1

Once a source has been defined, it can be referenced from a model using the source() function.

/*
    The source function accepts two arguments:
      1. The name of the source
      2. The name of the table in that source
*/

select * from {{ source('source_1', 'table_1') }}

/*
	This is compiled to:
  
    select * from "target_database"."source_1"."table_1"
    
*/

Configuring Sources

Overriding source database, schema, and identifier

By default, dbt will use the name of the specified source and table to construct a table name for the source table. The database, schema, and identifier for any given source can be overridden in the schema.yml file. This is useful if your source data table names are unduly long or confusing.


# This source entry describes the table:
#   "raw"."public"."Orders_"
#
# It can be referenced with:
#  {{ source('ecommerce', 'orders') }}

version: 2

sources:
  - name: ecommerce
    database: raw  # Tell dbt to look for the source in the "raw" database
    schema: public # You wouldn't put your source data in public, would you?
    tables:
      - name: orders
        identifier: Orders_ # To alias table names to account for strange casing or naming of tables

Overriding quoting

By default, dbt will not quote the database, schema, or identifier for the source tables that you've specified. To force dbt to quote one of these values, use the quoting: config. This config can be specified for all tables in a source, or for a specific source table. Quoting configs defined for a specific source table override the quoting configs specified for the top-level source.

For more about quoting see Configuring quoting.

version: 2

sources:
  - name: ecommerce
    # Set default quoting configs for the `ecommerce` source
    quoting:
      database: false
      schema: false
      identifier: false
      
    tables:
      - name: order_item
      - name: order
        # This overrides the `ecommerce` quoting config
        quoting:
          identifier: true

Complete example

version: 2

sources:
  - name: ecommerce
    database: raw
    schema: public
    loader: emr # informational only (free text)
    loaded_at_field: _loaded_at # configure for all sources
    
    quoting:
      database: false
      schema: false
      identifier: false
      
    tables:
      - name: orders
        identifier: Orders_
        loaded_at_field: updated_at # override source defaults
        columns:
          - name: id
            tests:
              - unique
              
          - name: price_in_usd
            tests:
              - not_null
              
      - name: Users
        quoting:
          identifier: true # override source defaults
        columns:
          - name: id
            tests:
              - unique

Selecting sources

Sources can be "selected" in dbt runs with the --models flag. While sources themselves cannot be run, it is frequently useful to run all of the models that depend on a source, eg. after that source has finished loading. Sources can be selected by using the source: selector coupled with:

  1. the name of the source
  2. the name of the source and a table, dot-separated
Run all of the models downstream of a source
$ dbt run --model source:ecommerce+

Run all of the models downstream of a specific source table
$ dbt run --model source:ecommerce.orders+

Snapshotting source data freshness

With a couple of extra configs, dbt can optionally snapshot the "freshness" of the data in your source tables. This is useful for understanding if your data pipelines are in a healthy state, and is a critical component of defining SLAs for your warehouse.

To configure sources to snapshot freshness information, add a freshness block to your source and loaded_at_field to your table declaration:


version: 2

sources:
  - name: snowplow
    database: raw
    loader: emr # optional, informational only
    
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    
    tables:
      - name: event
        loaded_at_field: collector_tstamp # required for freshness snapshotting

In the freshness block, one or both of warn_after and error_after can be provided. If neither is provided, then dbt will not calculate freshness snapshots for the tables in this source.

Additionally, the loaded_at_field is required to calculate freshness for a table. If a loaded_at_field is not provided, then dbt will not calculate freshness for the table.

These configs are applied hierarchically, so freshness and loaded_at field values specified for a source will flow through to all of the tables defined in that source. This is useful when all of the tables in a source have the same loaded_at_field, as the config can just be specified once in the top-level source definition.

Freshness declarations

The freshness: block can accept one or both of warn_after and error_after. Each of these fields requires the specification of a count and a period. The count can be any positive integer, and the period can be one of minute, hour, or day.

The following example shows a source schema configured with a freshness specification. The second table, web_page overrides freshness to null, indicating that it will not participate in freshness tests.


version: 2

sources:
  - name: snowplow
    database: raw
    loader: emr
    
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
      
    loaded_at_field: event_time
    
    tables:
      - name: event

      - name: web_page
        freshness: null

Snapshotting freshness

To snapshot freshness information for your sources, use the dbt source snapshot-freshness command:

$ dbt source snapshot-freshness [--select [source_1, ...]]

Use the --select flag to snapshot freshness for specific sources. Eg:

# Snapshot freshness for all Snowplow tables:
$ dbt source snapshot-freshness --select snowplow

# Snapshot freshness for a particular source table:
$ dbt source snapshot-freshness --select snowplow.event

See the sources reference for more information on the dbt source snapshot-freshness command.

Output

The snapshot-freshness command will output a pass/warning/error status for each table selected in the freshness snapshot. Additionally, dbt will write the freshness results to a file in the target/ directory called sources.json by default. To override this destination, use the -o flag to the snapshot-freshness command.

Sources


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.