dbt - Documentation

Best Practices

This page contains the collective wisdom of early users of dbt on how to best use it in your analytics work. Observing these best practices will help your analytics team work as effectively as possible.

Limit dependencies on raw data

Your project will depend on raw data stored in your database. We recommend making what we call "base models" to minimize the dependencies on raw data tables. In this convention, base models can have the following responsibilities:

  • Select only the fields that are relevant for current analytics to limit complexity. More fields can always be added later.
  • Perform any needed type conversion.
  • Perform field aliasing to rationalize field names into a standard format used within the project.
  • Act as the sole access point to a given raw data table.

In this convention, all subsequent data models are built on top of base models rather than on top of raw data—only base models are allowed to select from raw data tables. This ensures both that all of the transformations within the base model will be applied to all uses of this data and that if the source data table moves (or is located in a different schema or table in a different environment) it can be renamed in a single place.

Base models should select from sources as shown here:

	See the docs on Using Sources for more information
  about select from "sources"

  fname as first_name,
  cast(ts as date) as date_day

from {{ source('app_data', 'users') }}

Start by building views

While you're building new dbt models, it's common to default to materialize these new models as views. Views deploy extremely quickly and they have few configuration options. Since model behavior is controlled by configuration, you can update the configuration for a specific model as your project needs evolve without modifying the model code.

In order to configure your project to default to views, you'll want to put the following into your dbt_project.yml:

    enabled: true
    materialized: view

Managing multiple environments

dbt supports multiple targets within a given project within ~/.dbt/profiles.yml. Users can configure a default target and can override this setting with the --target flag passed to dbt run. We recommend setting your default target to your development environment, and then switch to your production target specifically to deploy to production.

Using target to manage multiple environments gives you the flexibility to set up your environments how you choose. Commonly, environments are managed by schemas within the same database: all test models are deployed to a schema called dbt_[username] and production models are deployed to a schema called analytics. An ideal setup would have production and test databases completely separate. Either way, we highly recommend maintaining multiple environments and managing deployments with target.

Source control workflows

We believe that all dbt projects should be managed via source control. We use git for all of our source control, and use branching and pull requests to keep the master branch the sole source of organizational truth.

Using dbt interactively

When your project gets large enough, dbt run can take a while. dbt provides three primary ways to address this so that you can deploy changes to your database quickly:

  1. Use views instead of tables to the greatest extent possible in development. Views typically deploy much faster than tables, and in development it's often not critical that subsequent analytic queries run as fast as possible. It's easy to change this setting later and it will have no impact on your business logic.
  2. Use dbt_project.yml to disable portions of your project that you're not currently working on. If you have multiple modules within a given project, turn off the ones that you're not currently working on so that those models don't deploy with every dbt run.
  3. Pass the --models flag to dbt run. This flag causes dbt to only deploy the models you specify and their dependents. If you're working on a particular model, this can make a very significant difference in your deployment time in large projects.