Skip to main content

dbt Developer Blog

Technical tutorials from the dbt Community.

Start here

· 11 min read

TLDR: The Semantic Layer is made up of a combination of open-source and SaaS offerings and is going to change how your team defines and consumes metrics.

At last year's Coalesce, Drew showed us the future1 - a vision of what metrics in dbt could look like. Since then, we've been getting the infrastructure in place to make that vision a reality. We wanted to share with you where we are today and how it fits into the broader picture of where we're going.

To those who haven't followed this saga with the intensity of someone watching their investments on the crypto market, we're rolling out this new resource to help you better understand the dbt Semantic Layer and provide clarification on the following things:

  1. What is the dbt Semantic Layer?
  2. How do I use it?
  3. What is publicly available now?
  4. What is still in development?

With that, lets get into it!

· 7 min read

If you’ve needed to grant access to a dbt model between 2019 and today, there’s a good chance you’ve come across the "The exact grant statements we use in a dbt project" post on Discourse. It explained options for covering two complementary abilities:

  1. querying relations via the "select" privilege
  2. using the schema those relations are within via the "usage" privilege

· 11 min read

Stored procedures are widely used throughout the data warehousing world. They’re great for encapsulating complex transformations into units that can be scheduled and respond to conditional logic via parameters. However, as teams continue building their transformation logic using the stored procedure approach, we see more data downtime, increased data warehouse costs, and incorrect / unavailable data in production. All of this leads to more stressed and unhappy developers, and consumers who have a hard time trusting their data.

If your team works heavily with stored procedures, and you ever find yourself with the following or related issues:

  • dashboards that aren’t refreshed on time
  • It feels too slow and risky to modify pipeline code based on requests from your data consumers
  • It’s hard to trace the origins of data in your production reporting

It’s worth considering if an alternative approach with dbt might help.

· 15 min read

There are many reasons you, as an analytics engineer, may want to capture the complete version history of data:

  • You’re in an industry with a very high standard for data governance
  • You need to track big OKRs over time to report back to your stakeholders
  • You want to build a window to view history with both forward and backward compatibility

These are often high-stakes situations! So accuracy in tracking changes in your data is key.

· 5 min read

In general, data people prefer the more granular over the less granular. Timestamps > dates, daily data > weekly data, etc.; having data at a more granular level always allows you to zoom in. However, you’re likely looking at your data at a somewhat zoomed-out level—weekly, monthly, or even yearly. To do that, you’re going to need a handy dandy function that helps you round out date or time fields.

The DATE_TRUNC function will truncate a date or time to the first instance of a given date part. Wordy, wordy, wordy! What does this really mean? If you were to truncate 2021-12-13 out to its month, it would return 2021-12-01 (the first day of the month).

Using the DATE_TRUNC function, you can truncate to the weeks, months, years, or other date parts for a date or time field. This can make date/time fields easier to read, as well as help perform cleaner time-based analyses.

· 5 min read

“How long has it been since this customer last ordered with us?”

“What is the average number of days to conversion?”

Business users will have these questions, data people will have to answer these questions, and the only way to solve them is by calculating the time between two different dates. Luckily, there’s a handy DATEDIFF function that can do that for you.

The DATEDIFF function will return the difference in specified units (ex. days, weeks, years) between a start date/time and an end date/time. It’s a simple and widely used function that you’ll find yourself using more often than you expect.

· 15 min read

Let’s set the scene. You are an analytics engineer at your company. You have several relational datasets flowing through your warehouse, and, of course, you can easily access and transform these tables through dbt. You’ve joined together the tables appropriately and have near-real time reporting on the relationships for each entity_id as it currently exists.

But, at some point, your stakeholder wants to know how each entity is changing over time. Perhaps, it is important to understand the trend of a product throughout its lifetime. You need the history of each entity_id across all of your datasets, because each related table is updated on its own timeline.

What is your first thought? Well, you’re a seasoned analytics engineer and you know the good people of dbt Labs have a solution for you. And then it hits you — the answer is snapshots!

· 3 min read

We’ve likely been here: Table A has 56 columns and we want to select all but one of them (column_56). So here we go, let’s get started…

select
column_1,
column_2,
column_3,
please_save_me…
from {{ ref('table_a') }}

At this point, you realize your will to continue typing out the next 52 columns has essentially dwindled down to nothing and you’re probably questioning the life choices that led you here.

But what if there was a way to make these 56+ lines of code come down to a handful? Well, that’s where a handy dbt macro comes into play.

· 13 min read

Analytics engineers (AEs) are constantly navigating through the names of the models in their project, so naming is important for maintainability in your project in the way you access it and work within it. By default, dbt will use your model file name as the view or table name in the database. But this means the name has a life outside of dbt and supports the many end users who will potentially never know about dbt and where this data came from, but still access the database objects in the database or business intelligence (BI) tool.

Model naming conventions are usually made by AEs, for AEs. While that’s useful for maintainability, it leaves out the people who model naming is supposed to primarily benefit: the end users. Good model naming conventions should be created with one thing in mind: Assume your end-user will have no other context than the model name. Folders, schema, and documentation can add additional context, but they may not always be present. Your model names will always be shown in the database.

· 4 min read

There are so many different date functions in SQL—you have DATEDIFF, DATEADD, DATE_PART, and DATE_TRUNC to name a few. They all have their different use cases and understanding how and when they should be used is a SQL fundamental to get down. Are any of those as easy to use as the EXTRACT function? Well, that debate is for another time…

In this post, we’re going to give a deep dive into the EXTRACT function, how it works, and why we use it.

· 11 min read

Hey data champion — so glad you’re here! Sometimes datasets need a team of engineers to tackle their deduplification (totz a real word), and that’s why we wrote this down. For you, friend, we wrote it down for you. You’re welcome!

Let’s get rid of these dupes and send you on your way to do the rest of the super-fun-analytics-engineering that you want to be doing, on top of super-sparkly-clean data. But first, let’s make sure we’re all on the same page.

· 4 min read

We’ve all been there:

  • In a user signup form, user A typed in their name as Kira Furuichi, user B typed it in as john blust, and user C wrote DAvid KrevitT (what’s up with that, David??)
  • Your backend application engineers are adamant customer emails are in all caps
  • All of your event tracking names are lowercase

In the real world of human imperfection, opinions, and error, string values are likely to take inconsistent capitalization across different data sources (or even within the same data source). There’s always a little lack of rhyme or reason for why some values are passed as upper or lowercase, and it’s not worth the headache to unpack that.

So how do you create uniformity for string values that you collect across all your data sources? The LOWER function!

· 7 min read

❓ Who is this for: This is for advanced users of dbt Cloud that are interested in expanding their knowledge of the dbt API via an interactive Postman Collection. We only suggest diving into this once you have a strong knowledge of dbt + dbt Cloud. You have a couple of options to review the collection:

The dbt Cloud API has well-documented endpoints for creating, triggering and managing dbt Cloud jobs. But there are other endpoints that aren’t well documented yet, and they’re extremely useful for end-users. These endpoints exposed by the API enable organizations not only to orchestrate jobs, but to manage their dbt Cloud accounts programmatically. This creates some really interesting capabilities for organizations to scale their dbt Cloud implementations.

The main goal of this article is to spread awareness of these endpoints as the docs are being built & show you how to use them.

· 4 min read

It’s inevitable in the field of analytics engineering: you’re going to encounter moments when there’s mysterious or unhelpful blank values in your data. Null values surely have their time and place, but when you need those null values filled with more meaningful data, COALESCE comes to the rescue.

COALESCE is an incredibly useful function that allows you to fill in unhelpful blank values that may show up in your data. In the words of analytics engineer Lauren Benezra, you will probably almost never see a data model that doesn’t use COALESCE somewhere.

· 11 min read

Continuous Integration (CI) sets the system up to test everyone’s pull request before merging. Continuous Deployment (CD) deploys each approved change to production. “Slim CI” refers to running/testing only the changed code, thereby saving compute. In summary, CI/CD automates dbt pipeline testing and deployment.

dbt Cloud, a much beloved method of dbt deployment, supports GitHub- and Gitlab-based CI/CD out of the box. It doesn’t support Bitbucket, AWS CodeCommit/CodeDeploy, or any number of other services, but you need not give up hope even if you are tethered to an unsupported platform.

Although this article uses Bitbucket Pipelines as the compute service and Bitbucket Downloads as the storage service, this article should serve as a blueprint for creating a dbt-based Slim CI/CD anywhere. The idea is always the same:

· 13 min read

dbt Cloud is a hosted service that many organizations use for their dbt deployments. Among other things, it provides an interface for creating and managing deployment jobs. When triggered (e.g., cron schedule, API trigger), the jobs generate various artifacts that contain valuable metadata related to the dbt project and the run results.

dbt Cloud provides a REST API for managing jobs, run artifacts and other dbt Cloud resources. Data/analytics engineers would often write custom scripts for issuing automated calls to the API using tools cURL or Python Requests. In some cases, the engineers would go on and copy/rewrite them between projects that need to interact with the API. Now, they have a bunch of scripts on their hands that they need to maintain and develop further if business requirements change. If only there was a dedicated tool for interacting with the dbt Cloud API that abstracts away the complexities of the API calls behind an easy-to-use interface… Oh wait, there is: the dbt-cloud-cli!

· 5 min read

"I forgot to mention we dropped that column and created a new one for it!”

“Hmm, I’m actually not super sure why customer_id is passed as an int and not a string.”

“The primary key for that table is actually the order_id, not the id field.”

I think many analytics engineers, including myself, have been on the receiving end of some of these comments from their backend application developers.

Backend developers work incredibly hard. They create the database and tables that drive the heart of many businesses. In their efforts, they can sometimes overlook, forget, or not understand their impact on analytics work. However, when backend developers do understand and implement the technical and logistical requirements from data teams, they can spark joy.

So what makes strong collaboration possible between analytics engineers and backend application developers?

· 12 min read

Special Thanks: Emilie Schario, Matt Winkler

dbt has done a great job of building an elegant, common interface between data engineers, analytics engineers, and any data-y role, by uniting our work on SQL. This unification of tools and workflows creates interoperability between what would normally be distinct teams within the data organization.

I like to call this interoperability a “baton pass.” Like in a relay race, there are clear handoff points & explicit ownership at all stages of the process. But there’s one baton pass that’s still relatively painful and undefined: the handoff between machine learning (ML) engineers and analytics engineers.

In my experience, the initial collaboration workflow between ML engineering & analytics engineering starts off strong but eventually becomes muddy during the maintenance phase. This eventually leads to projects becoming unusable and forgotten.

In this article, we’ll explore a real-life baton pass between ML engineering and analytics engineering and highlighting where things went wrong.

· 18 min read

Executive Summary:

If your company is struggling to leverage analytics, dealing with an overgrown ecosystem of dashboards/databases or simply want to avoid the mistakes of others, this story is for you. In this article, I will walk through forming the first analytics engineering team at Smartsheet including how momentum built around forming the team,  the challenges we faced, and the solutions we developed within the first year.

Introduction

Most writing about analytics engineering, or AE for short, assumes a team already exists. It’s about operating as an AE team or managing stakeholders or leveraging tools more effectively. But what about the prologue? What initial problems do AEs solve? How does an AE team even start? What do the early days look like?