Developer Blog | dbt Developer Hub

Migrating from Stored Procedures to dbt

July 20, 2022 · 11 min read

Senior Solutions Architect at dbt Labs

Stored procedures are widely used throughout the data warehousing world. They’re great for encapsulating complex transformations into units that can be scheduled and respond to conditional logic via parameters. However, as teams continue building their transformation logic using the stored procedure approach, we see more data downtime, increased data warehouse costs, and incorrect / unavailable data in production. All of this leads to more stressed and unhappy developers, and consumers who have a hard time trusting their data.

If your team works heavily with stored procedures, and you ever find yourself with the following or related issues:

dashboards that aren’t refreshed on time
It feels too slow and risky to modify pipeline code based on requests from your data consumers
It’s hard to trace the origins of data in your production reporting

It’s worth considering if an alternative approach with dbt might help.

Strategies for change data capture in dbt

July 14, 2022 · 15 min read

Grace Goheen

Analytics Engineer at dbt Labs

There are many reasons you, as an analytics engineer, may want to capture the complete version history of data:

You’re in an industry with a very high standard for data governance
You need to track big OKRs over time to report back to your stakeholders
You want to build a window to view history with both forward and backward compatibility

These are often high-stakes situations! So accuracy in tracking changes in your data is key.

DATE_TRUNC SQL function: Why we love it

July 13, 2022 · 5 min read

Kira Furuichi

Technical Writer at dbt Labs

In general, data people prefer the more granular over the less granular. Timestamps > dates, daily data > weekly data, etc.; having data at a more granular level always allows you to zoom in. However, you’re likely looking at your data at a somewhat zoomed-out level—weekly, monthly, or even yearly. To do that, you’re going to need a handy dandy function that helps you round out date or time fields.

The DATE_TRUNC function will truncate a date or time to the first instance of a given date part. Wordy, wordy, wordy! What does this really mean? If you were to truncate 2021-12-13 out to its month, it would return 2021-12-01 (the first day of the month).

Using the DATE_TRUNC function, you can truncate to the weeks, months, years, or other date parts for a date or time field. This can make date/time fields easier to read, as well as help perform cleaner time-based analyses.

DATEDIFF SQL function: Why we love it

July 13, 2022 · 5 min read

Kira Furuichi

Technical Writer at dbt Labs

“How long has it been since this customer last ordered with us?”

“What is the average number of days to conversion?”

Business users will have these questions, data people will have to answer these questions, and the only way to solve them is by calculating the time between two different dates. Luckily, there’s a handy DATEDIFF function that can do that for you.

The DATEDIFF function will return the difference in specified units (ex. days, weeks, years) between a start date/time and an end date/time. It’s a simple and widely used function that you’ll find yourself using more often than you expect.

Tackling the complexity of joining snapshots

May 26, 2022 · 16 min read

Lauren Benezra

Analytics Engineer at dbt Labs

Let’s set the scene. You are an analytics engineer at your company. You have several relational datasets flowing through your warehouse, and, of course, you can easily access and transform these tablesIn simplest terms, a table is the direct storage of data in rows and columns. Think excel sheet with raw values in each of the cells. through dbt. You’ve joined together the tables appropriately and have near-real time reporting on the relationships for each entity_id as it currently exists.

But, at some point, your stakeholder wants to know how each entity is changing over time. Perhaps, it is important to understand the trend of a product throughout its lifetime. You need the history of each entity_id across all of your datasets, because each related table is updated on its own timeline.

What is your first thought? Well, you’re a seasoned analytics engineer and you know the good people of dbt Labs have a solution for you. And then it hits you — the answer is snapshots!

A star (generator) is born

May 23, 2022 · 3 min read

Kira Furuichi

Technical Writer at dbt Labs

We’ve likely been here: Table A has 56 columns and we want to select all but one of them (column_56). So here we go, let’s get started…

select
	column_1,
	column_2,
	column_3,
	please_save_me…
from {{ ref('table_a') }}

At this point, you realize your will to continue typing out the next 52 columns has essentially dwindled down to nothing and you’re probably questioning the life choices that led you here.

But what if there was a way to make these 56+ lines of code come down to a handful? Well, that’s where a handy dbt macro comes into play.

Optimizing dbt Models with Redshift Configurations

May 19, 2022 · 16 min read

Christine Berger

Resident Architect at dbt Labs

If you're reading this article, it looks like you're wondering how you can better optimize your Redshift queries - and you're probably wondering how you can do that in conjunction with dbt.

In order to properly optimize, we need to understand why we might be seeing issues with our performance and how we can fix these with dbt sort and dist configurations.

Stakeholder-friendly model names: Model naming conventions that give context

May 17, 2022 · 13 min read

Pat Kearns

Senior Analytics Engineer at dbt Labs

Analytics engineers (AEs) are constantly navigating through the names of the models in their project, so naming is important for maintainability in your project in the way you access it and work within it. By default, dbt will use your model file name as the view or table name in the database. But this means the name has a life outside of dbt and supports the many end users who will potentially never know about dbt and where this data came from, but still access the database objects in the database or business intelligence (BI) tool.

Model naming conventions are usually made by AEs, for AEs. While that’s useful for maintainability, it leaves out the people who model naming is supposed to primarily benefit: the end users. Good model naming conventions should be created with one thing in mind: Assume your end-user will have no other context than the model name. Folders, schema, and documentation can add additional context, but they may not always be present. Your model names will always be shown in the database.

EXTRACT SQL function: Why we love it

May 15, 2022 · 4 min read

Kira Furuichi

Technical Writer at dbt Labs

There are so many different date functions in SQL—you have DATEDIFF, DATEADD, DATE_PART, and DATE_TRUNC to name a few. They all have their different use cases and understanding how and when they should be used is a SQL fundamental to get down. Are any of those as easy to use as the EXTRACT function? Well, that debate is for another time…

In this post, we’re going to give a deep dive into the EXTRACT function, how it works, and why we use it.

How we remove partial duplicates: Complex deduplication to refine your models' grain

May 12, 2022 · 11 min read

Lauren Benezra

Analytics Engineer at dbt Labs

Hey data champion — so glad you’re here! Sometimes datasets need a team of engineers to tackle their deduplification (totz a real word), and that’s why we wrote this down. For you, friend, we wrote it down for you. You’re welcome!

Let’s get rid of these dupes and send you on your way to do the rest of the super-fun-analytics-engineering that you want to be doing, on top of super-sparkly-clean data. But first, let’s make sure we’re all on the same page.

LOWER SQL function: Why we love it

May 11, 2022 · 4 min read

Kira Furuichi

Technical Writer at dbt Labs

We’ve all been there:

In a user signup form, user A typed in their name as Kira Furuichi, user B typed it in as john blust, and user C wrote DAvid KrevitT (what’s up with that, David??)
Your backend application engineers are adamant customer emails are in all caps
All of your event tracking names are lowercase

In the real world of human imperfection, opinions, and error, string values are likely to take inconsistent capitalization across different data sources (or even within the same data source). There’s always a little lack of rhyme or reason for why some values are passed as upper or lowercase, and it’s not worth the headache to unpack that.

So how do you create uniformity for string values that you collect across all your data sources? The LOWER function!

COALESCE SQL function: Why we love it

May 8, 2022 · 4 min read

Kira Furuichi

Technical Writer at dbt Labs

It’s inevitable in the field of analytics engineering: you’re going to encounter moments when there’s mysterious or unhelpful blank values in your data. Null values surely have their time and place, but when you need those null values filled with more meaningful data, COALESCE comes to the rescue.

COALESCE is an incredibly useful function that allows you to fill in unhelpful blank values that may show up in your data. In the words of analytics engineer Lauren Benezra, you will probably almost never see a data model that doesn’t use COALESCE somewhere.

Slim CI/CD with Bitbucket Pipelines for dbt Core

May 6, 2022 · 11 min read

Simon Podhajsky

Data Lead at iLife Technologies

Set up CI/CD with dbt Cloud

This blog is specifically tailored for dbt Core users. If you're using dbt Cloud and your Git provider doesn't have a native dbt Cloud integration (like BitBucket), follow the Customizing CI/CD with custom pipelines guide to set up CI/CD.

Continuous Integration (CI) sets the system up to test everyone’s pull request before merging. Continuous Deployment (CD) deploys each approved change to production. “Slim CI” refers to running/testing only the changed code, thereby saving compute. In summary, CI/CD automates dbt pipeline testing and deployment.

dbt Cloud, a much beloved method of dbt deployment, supports GitHub- and Gitlab-based CI/CD out of the box. It doesn’t support Bitbucket, AWS CodeCommit/CodeDeploy, or any number of other services, but you need not give up hope even if you are tethered to an unsupported platform.

Although this article uses Bitbucket Pipelines as the compute service and Bitbucket Downloads as the storage service, this article should serve as a blueprint for creating a dbt-based Slim CI/CD anywhere. The idea is always the same:

Making dbt Cloud API calls using dbt-cloud-cli

May 3, 2022 · 13 min read

Simo Tumelius

Freelance Data and Analytics Engineer

Different from dbt Cloud CLI

This blog explains how to use the dbt-cloud-cli Python library to create a data catalog app with dbt Cloud artifacts. This is different from the dbt Cloud CLI, a tool that allows you to run dbt commands against your dbt Cloud development environment from your local command line.

dbt Cloud is a hosted service that many organizations use for their dbt deployments. Among other things, it provides an interface for creating and managing deployment jobs. When triggered (e.g., cron schedule, API trigger), the jobs generate various artifacts that contain valuable metadata related to the dbt project and the run results.

dbt Cloud provides a REST API for managing jobs, run artifacts and other dbt Cloud resources. Data/analytics engineers would often write custom scripts for issuing automated calls to the API using tools cURL or Python Requests. In some cases, the engineers would go on and copy/rewrite them between projects that need to interact with the API. Now, they have a bunch of scripts on their hands that they need to maintain and develop further if business requirements change. If only there was a dedicated tool for interacting with the dbt Cloud API that abstracts away the complexities of the API calls behind an easy-to-use interface… Oh wait, there is: the dbt-cloud-cli!

From the Slack Archives: When Backend Devs Spark Joy for Data Folks

April 5, 2022 · 5 min read

Kira Furuichi

Technical Writer at dbt Labs

"I forgot to mention we dropped that column and created a new one for it!”

“Hmm, I’m actually not super sure why customer_id is passed as an int and not a string.”

“The primary keyA primary key is a non-null column in a database object that uniquely identifies each row. for that tableIn simplest terms, a table is the direct storage of data in rows and columns. Think excel sheet with raw values in each of the cells. is actually the order_id, not the id field.”

I think many analytics engineers, including myself, have been on the receiving end of some of these comments from their backend application developers.

Backend developers work incredibly hard. They create the database and tables that drive the heart of many businesses. In their efforts, they can sometimes overlook, forget, or not understand their impact on analytics work. However, when backend developers do understand and implement the technical and logistical requirements from data teams, they can spark joy.

So what makes strong collaboration possible between analytics engineers and backend application developers?

dbt + Machine Learning: What makes a great baton pass?

March 10, 2022 · 11 min read

Sung Won Chung

Solutions Architect at dbt Labs

Izzy Erekson

Solutions Architect at dbt Labs

Special Thanks: Emilie Schario, Matt Winkler

dbt has done a great job of building an elegant, common interface between data engineers, analytics engineers, and any data-y role, by uniting our work on SQL. This unification of tools and workflows creates interoperability between what would normally be distinct teams within the data organization.

I like to call this interoperability a “baton pass.” Like in a relay race, there are clear handoff points & explicit ownership at all stages of the process. But there’s one baton pass that’s still relatively painful and undefined: the handoff between machine learning (ML) engineers and analytics engineers.

In my experience, the initial collaboration workflow between ML engineering & analytics engineering starts off strong but eventually becomes muddy during the maintenance phase. This eventually leads to projects becoming unusable and forgotten.

In this article, we’ll explore a real-life baton pass between ML engineering and analytics engineering and highlighting where things went wrong.

Founding an Analytics Engineering Team

March 2, 2022 · 18 min read

Nate Sooter

Manager of BI Operations at Smartsheet

Executive Summary:

If your company is struggling to leverage analytics, dealing with an overgrown ecosystem of dashboards/databases or simply want to avoid the mistakes of others, this story is for you. In this article, I will walk through forming the first analytics engineering team at Smartsheet including how momentum built around forming the team, the challenges we faced, and the solutions we developed within the first year.

Introduction

Most writing about analytics engineering, or AE for short, assumes a team already exists. It’s about operating as an AE team or managing stakeholders or leveraging tools more effectively. But what about the prologue? What initial problems do AEs solve? How does an AE team even start? What do the early days look like?

The JaffleGaggle Story: Data Modeling for a Customer 360 View

February 8, 2022 · 16 min read

Donny Flynn

Customer Data Architect at Census

Editor's note: In this tutorial, Donny walks through the fictional story of a SaaS company called JaffleGaggle, who needs to group their freemium individual users into company accounts (aka a customer 360 view) in order to drive their product-led growth efforts.

You can follow along with Donny's data modeling technique for identity resolution in this dbt project repo. It includes a set of demo CSV files, which you can use as dbt seeds to test Donny's project for yourself.

How We Calculate Time on Task, the Business Hours Between Two Dates

February 3, 2022 · 10 min read

Dave Connors

Staff Developer Experience Advocate at dbt Labs

Measuring the number of business hours between two dates using SQL is one of those classic problems that sounds simple yet has plagued analysts since time immemorial.

This comes up in a couple places at dbt Labs:

Calculating the time it takes for a support ticket to be solved
Measuring team performance against response time SLAs

We internally refer to this at "Time on Task," and it can be a critical data point for customer or client facing teams. Thankfully our tools for calculating Time on Task have improved just a little bit since 2006.

Even still, you've got to do some pretty gnarly SQL or dbt gymnastics to get this right, including:

Figuring out how to exclude nights and weekends from your SQL calculations
Accounting for holidays using a custom holiday calendar
Accommodating for changes in business hour schedules

This piece will provide an overview of how and critically why to calculate Time on Task and how we use it here at dbt Labs.

How to Build a Mature dbt Project from Scratch

December 6, 2021 · 14 min read

Dave Connors

Staff Developer Experience Advocate at dbt Labs

[We would love to have] A maturity curve of an end-to-end dbt implementation for each version of dbt .... There are so many features in dbt now but it'd be great to understand, "what is the minimum set of dbt features/components that need to go into a base-level dbt implementation?...and then what are the things that are extra credit?" -Will Weld on dbt Community Slack

One question we hear time and time again is this - what does it look like to progress through the different stages of maturity on a dbt project?

When Will posed this question on Slack, it got me thinking about what it would take to create a framework for dbt project maturity.

Introduction​

Introduction