Skip to main content

Developer Blog | dbt Developer Hub

Find tutorials, product updates, and developer insights in the dbt Developer Blog.

Start here

· 15 min read
Ian Fahey

The larger a data ecosystem gets, the more its users and stakeholders expect consistency. As the ratio of data models to team members (to say nothing of stakeholders to team members) skyrockets, an agreed-upon modeling pattern often acts as scaffolding around that growth.

The biggest tool in the toolbox today, dimensional modeling, offers enough consistency to make it the dominant approach in the space, but what might be possible if we shut that toolbox, took a break from our workbench, and instead strolled over to our bookshelf?

In other words, what if we told a story?

· 15 min read
Bennie Regenold
Barr Yaron

When running a job that has over 1,700 models, how do you know what a “good” runtime is? If the total process takes 3 hours, is that fantastic or terrible? While there are many possible answers depending on dataset size, complexity of modeling, and historical run times, the crux of the matter is normally “did you hit your SLAs”? However, in the cloud computing world where bills are based on usage, the question is really “did you hit your SLAs and stay within budget”?

Here at dbt Labs, we used the Model Timing tab in our internal analytics dbt project to help us identify inefficiencies in our incremental dbt Cloud job that eventually led to major financial savings, and a path forward for periodic improvement checks.

· 13 min read
Benoit Perigaud

Editor's note — since the creation of this post, the package pre-commit-dbt's ownership has moved to another team and it has been renamed to dbt-checkpoint. A redirect has been set up, meaning that the code example below will still work. It is also possible to replace repo: https://github.com/offbi/pre-commit-dbt with repo: https://github.com/dbt-checkpoint/dbt-checkpoint in your .pre-commit-config.yaml file.

At dbt Labs, we have best practices we like to follow for the development of dbt projects. One of them, for example, is that all models should have at least unique and not_null tests on their primary key. But how can we enforce rules like this?

That question becomes difficult to answer in large dbt projects. Developers might not follow the same conventions. They might not be aware of past decisions, and reviewing pull requests in git can become more complex. When dbt projects have hundreds of models, it's hard to know which models do not have any tests defined and aren't enforcing your conventions.

· 7 min read
Jeremy Cohen
Doug Beatty

If you’ve needed to grant access to a dbt model between 2019 and today, there’s a good chance you’ve come across the "The exact grant statements we use in a dbt project" post on Discourse. It explained options for covering two complementary abilities:

  1. querying relations via the "select" privilege
  2. using the schema those relations are within via the "usage" privilege

· 11 min read
Matt Winkler

Stored procedures are widely used throughout the data warehousing world. They’re great for encapsulating complex transformations into units that can be scheduled and respond to conditional logic via parameters. However, as teams continue building their transformation logic using the stored procedure approach, we see more data downtime, increased data warehouse costs, and incorrect / unavailable data in production. All of this leads to more stressed and unhappy developers, and consumers who have a hard time trusting their data.

If your team works heavily with stored procedures, and you ever find yourself with the following or related issues:

  • dashboards that aren’t refreshed on time
  • It feels too slow and risky to modify pipeline code based on requests from your data consumers
  • It’s hard to trace the origins of data in your production reporting

It’s worth considering if an alternative approach with dbt might help.

· 15 min read
Grace Goheen

There are many reasons you, as an analytics engineer, may want to capture the complete version history of data:

  • You’re in an industry with a very high standard for data governance
  • You need to track big OKRs over time to report back to your stakeholders
  • You want to build a window to view history with both forward and backward compatibility

These are often high-stakes situations! So accuracy in tracking changes in your data is key.

· 5 min read
Kira Furuichi

In general, data people prefer the more granular over the less granular. Timestamps > dates, daily data > weekly data, etc.; having data at a more granular level always allows you to zoom in. However, you’re likely looking at your data at a somewhat zoomed-out level—weekly, monthly, or even yearly. To do that, you’re going to need a handy dandy function that helps you round out date or time fields.

The DATE_TRUNC function will truncate a date or time to the first instance of a given date part. Wordy, wordy, wordy! What does this really mean? If you were to truncate 2021-12-13 out to its month, it would return 2021-12-01 (the first day of the month).

Using the DATE_TRUNC function, you can truncate to the weeks, months, years, or other date parts for a date or time field. This can make date/time fields easier to read, as well as help perform cleaner time-based analyses.

· 5 min read
Kira Furuichi

“How long has it been since this customer last ordered with us?”

“What is the average number of days to conversion?”

Business users will have these questions, data people will have to answer these questions, and the only way to solve them is by calculating the time between two different dates. Luckily, there’s a handy DATEDIFF function that can do that for you.

The DATEDIFF function will return the difference in specified units (ex. days, weeks, years) between a start date/time and an end date/time. It’s a simple and widely used function that you’ll find yourself using more often than you expect.

· 16 min read
Lauren Benezra

Let’s set the scene. You are an analytics engineer at your company. You have several relational datasets flowing through your warehouse, and, of course, you can easily access and transform these tables through dbt. You’ve joined together the tables appropriately and have near-real time reporting on the relationships for each entity_id as it currently exists.

But, at some point, your stakeholder wants to know how each entity is changing over time. Perhaps, it is important to understand the trend of a product throughout its lifetime. You need the history of each entity_id across all of your datasets, because each related table is updated on its own timeline.

What is your first thought? Well, you’re a seasoned analytics engineer and you know the good people of dbt Labs have a solution for you. And then it hits you — the answer is snapshots!

· 3 min read
Kira Furuichi

We’ve likely been here: Table A has 56 columns and we want to select all but one of them (column_56). So here we go, let’s get started…

select
column_1,
column_2,
column_3,
please_save_me…
from {{ ref('table_a') }}

At this point, you realize your will to continue typing out the next 52 columns has essentially dwindled down to nothing and you’re probably questioning the life choices that led you here.

But what if there was a way to make these 56+ lines of code come down to a handful? Well, that’s where a handy dbt macro comes into play.

· 16 min read
Christine Berger

If you're reading this article, it looks like you're wondering how you can better optimize your Redshift queries - and you're probably wondering how you can do that in conjunction with dbt.

In order to properly optimize, we need to understand why we might be seeing issues with our performance and how we can fix these with dbt sort and dist configurations.

· 13 min read
Pat Kearns

Analytics engineers (AEs) are constantly navigating through the names of the models in their project, so naming is important for maintainability in your project in the way you access it and work within it. By default, dbt will use your model file name as the view or table name in the database. But this means the name has a life outside of dbt and supports the many end users who will potentially never know about dbt and where this data came from, but still access the database objects in the database or business intelligence (BI) tool.

Model naming conventions are usually made by AEs, for AEs. While that’s useful for maintainability, it leaves out the people who model naming is supposed to primarily benefit: the end users. Good model naming conventions should be created with one thing in mind: Assume your end-user will have no other context than the model name. Folders, schema, and documentation can add additional context, but they may not always be present. Your model names will always be shown in the database.

· 4 min read
Kira Furuichi

There are so many different date functions in SQL—you have DATEDIFF, DATEADD, DATE_PART, and DATE_TRUNC to name a few. They all have their different use cases and understanding how and when they should be used is a SQL fundamental to get down. Are any of those as easy to use as the EXTRACT function? Well, that debate is for another time…

In this post, we’re going to give a deep dive into the EXTRACT function, how it works, and why we use it.

· 11 min read
Lauren Benezra

Hey data champion — so glad you’re here! Sometimes datasets need a team of engineers to tackle their deduplification (totz a real word), and that’s why we wrote this down. For you, friend, we wrote it down for you. You’re welcome!

Let’s get rid of these dupes and send you on your way to do the rest of the super-fun-analytics-engineering that you want to be doing, on top of super-sparkly-clean data. But first, let’s make sure we’re all on the same page.

· 4 min read
Kira Furuichi

We’ve all been there:

  • In a user signup form, user A typed in their name as Kira Furuichi, user B typed it in as john blust, and user C wrote DAvid KrevitT (what’s up with that, David??)
  • Your backend application engineers are adamant customer emails are in all caps
  • All of your event tracking names are lowercase

In the real world of human imperfection, opinions, and error, string values are likely to take inconsistent capitalization across different data sources (or even within the same data source). There’s always a little lack of rhyme or reason for why some values are passed as upper or lowercase, and it’s not worth the headache to unpack that.

So how do you create uniformity for string values that you collect across all your data sources? The LOWER function!

· 7 min read
Matt Winkler

❓ Who is this for: This is for advanced users of dbt Cloud that are interested in expanding their knowledge of the dbt API via an interactive Postman Collection. We only suggest diving into this once you have a strong knowledge of dbt + dbt Cloud. You have a couple of options to review the collection:

The dbt Cloud API has well-documented endpoints for creating, triggering and managing dbt Cloud jobs. But there are other endpoints that aren’t well documented yet, and they’re extremely useful for end-users. These endpoints exposed by the API enable organizations not only to orchestrate jobs, but to manage their dbt Cloud accounts programmatically. This creates some really interesting capabilities for organizations to scale their dbt Cloud implementations.

The main goal of this article is to spread awareness of these endpoints as the docs are being built & show you how to use them.

· 4 min read
Kira Furuichi

It’s inevitable in the field of analytics engineering: you’re going to encounter moments when there’s mysterious or unhelpful blank values in your data. Null values surely have their time and place, but when you need those null values filled with more meaningful data, COALESCE comes to the rescue.

COALESCE is an incredibly useful function that allows you to fill in unhelpful blank values that may show up in your data. In the words of analytics engineer Lauren Benezra, you will probably almost never see a data model that doesn’t use COALESCE somewhere.

· 11 min read
Simon Podhajsky

Continuous Integration (CI) sets the system up to test everyone’s pull request before merging. Continuous Deployment (CD) deploys each approved change to production. “Slim CI” refers to running/testing only the changed code, thereby saving compute. In summary, CI/CD automates dbt pipeline testing and deployment.

dbt Cloud, a much beloved method of dbt deployment, supports GitHub- and Gitlab-based CI/CD out of the box. It doesn’t support Bitbucket, AWS CodeCommit/CodeDeploy, or any number of other services, but you need not give up hope even if you are tethered to an unsupported platform.

Although this article uses Bitbucket Pipelines as the compute service and Bitbucket Downloads as the storage service, this article should serve as a blueprint for creating a dbt-based Slim CI/CD anywhere. The idea is always the same:

· 13 min read
Different from dbt Cloud CLI

This blog explains how to use the dbt-cloud-cli Python library to create a data catalog app with dbt Cloud artifacts. This is different from the dbt Cloud CLI, a tool that allows you to run dbt commands against your dbt Cloud development environment from your local command line.

dbt Cloud is a hosted service that many organizations use for their dbt deployments. Among other things, it provides an interface for creating and managing deployment jobs. When triggered (e.g., cron schedule, API trigger), the jobs generate various artifacts that contain valuable metadata related to the dbt project and the run results.

dbt Cloud provides a REST API for managing jobs, run artifacts and other dbt Cloud resources. Data/analytics engineers would often write custom scripts for issuing automated calls to the API using tools cURL or Python Requests. In some cases, the engineers would go on and copy/rewrite them between projects that need to interact with the API. Now, they have a bunch of scripts on their hands that they need to maintain and develop further if business requirements change. If only there was a dedicated tool for interacting with the dbt Cloud API that abstracts away the complexities of the API calls behind an easy-to-use interface… Oh wait, there is: the dbt-cloud-cli!

· 5 min read
Kira Furuichi

"I forgot to mention we dropped that column and created a new one for it!”

“Hmm, I’m actually not super sure why customer_id is passed as an int and not a string.”

“The primary key for that table is actually the order_id, not the id field.”

I think many analytics engineers, including myself, have been on the receiving end of some of these comments from their backend application developers.

Backend developers work incredibly hard. They create the database and tables that drive the heart of many businesses. In their efforts, they can sometimes overlook, forget, or not understand their impact on analytics work. However, when backend developers do understand and implement the technical and logistical requirements from data teams, they can spark joy.

So what makes strong collaboration possible between analytics engineers and backend application developers?