When you were in grade school, did you ever play the “Telephone Game”? The first person would whisper a word to the second person, who would then whisper a word to the third person, and so on and so on. At the end of the line, the final person would loudly announce the word that they heard, and alas! It would have morphed into a new word completely incomprehensible from the original word. That’s how life feels without an analytics engineer on your team.
So let’s say that you have a business question, you have the raw data in your data warehouse, and you’ve got dbt up and running. You’re in the perfect position to get this curated dataset completed quickly! Or are you?
The simplest branching strategy for making code changes to your dbt project repository is to have a single main branch with your production-level code. To update the main branch, a developer will:
Create a new feature branch directly from the main branch
Make changes on said feature branch
When ready, open a pull request to merge their changes back into the main branch
If you are just getting started in dbt and deciding which branching strategy to use, this approach–often referred to as “continuous deployment” or “direct promotion”–is the way to go. It provides many benefits including:
Fast promotion process to get new changes into production
Simple branching strategy to manage
The main risk, however, is that your main branch can become susceptible to bugs that slip through the pull request approval process. In order to have more intensive testing and QA before merging code changes into production, some organizations may decide to create one or more branches between the feature branches and main.
If you’ve ever heard of Marie Kondo, you’ll know she has an incredibly soothing and meditative method to tidying up physical spaces. Her KonMari Method is about categorizing, discarding unnecessary items, and building a sustainable system for keeping stuff.
As an analytics engineer at your company, doesn’t that last sentence describe your job perfectly?! I like to think of the practice of analytics engineering as applying the KonMari Method to data modeling. Our goal as Analytics Engineers is not only to organize and clean up data, but to design a sustainable and scalable transformation project that is easy to navigate, grow, and consume by downstream customers.
Let’s talk about how to apply the KonMari Method to a new migration project. Perhaps you’ve been tasked with unpacking the kitchen in your new house; AKA, you’re the engineer hired to move your legacy SQL queries into dbt and get everything working smoothly. That might mean you’re grabbing a query that is 1500 lines of SQL and reworking it into modular pieces. When you’re finished, you have a performant, scalable, easy-to-navigate data flow.
Analyzing financial data is rarely ever “fun.” In particular, generating and analyzing financial statement data can be extremely difficult and leaves little room for error. If you've ever had the misfortune of having to generate financial reports for multiple systems, then you will understand how incredibly frustrating it is to reinvent the wheel each time.
This process can include a number of variations, but usually involves spending hours, days, or weeks working with Finance to:
Understand what needs to go into the reports
Model said reports
Validate said reports
Make adjustments within your model
Question your existence
Validate said reports again
You can imagine how extremely time consuming this process can be. Thankfully, you can leverage core accounting principles and other tools to more easily and effectively generate actionable financial reports. This way, you can spend more time diving into deeper financial analyses.
Semantic layer, Python model support, the new dbt Cloud UI and IDE… there’s a lot our product team is excited to share with you at Coalesce in a few weeks.
But how these things fit together—because of where dbt Labs is headed—is what I’m most excited to discuss.
You’ll hear more in Tristan’s keynote, but this feels like a good time to remind you that Coalesce isn’t just for answering tough questions… it’s for surfacing them. For sharing challenges we’ve felt in silos, finding the people you want to solve them with, and spending the rest of the year chipping away at them. As Tristan says in his latest blog, that’s how this industry moves forward.
It’s important to be able to test any dbt Project, but it’s even more important to make sure you have robust testing if you are developing a dbt Package.
I love dbt Packages, because it makes it easy to extend dbt’s functionality and create reusable analytics resources. Even better, we can find and share dbt Packages which others developed, finding great packages in dbt hub. However, it is a bit difficult to develop complicated dbt macros, because dbt on top of Jinja2 is lacking some of the functionality you’d expect for software development - like unit testing.
In this article, I would like to share options for unit testing your dbt Package - first through discussing the commonly used pattern of integration testing and then by showing how we can implement unit tests as part of our testing arsenal.
Those who have been building data warehouses for a long time have undoubtedly encountered the challenge of building surrogate keys on their data models. Having a column that uniquely represents each entity helps ensure your data model is complete, does not contain duplicates, and able to join across different data models in your warehouse.
Sometimes, we are lucky enough to have data sources with these keys built right in — Shopify data synced via their API, for example, has easy-to-use keys on all the tables written to your warehouse. If this is not the case, or if you build a data model with a compound key (aka the data is unique across multiple dimensions), you will have to rely on some strategy for creating and maintaining these keys yourself. How can you do this with dbt? Let’s dive in.
The larger a data ecosystem gets, the more its users and stakeholders expect consistency. As the ratio of data models to team members (to say nothing of stakeholders to team members) skyrockets, an agreed-upon modeling pattern often acts as scaffolding around that growth.
The biggest tool in the toolbox today, dimensional modeling, offers enough consistency to make it the dominant approach in the space, but what might be possible if we shut that toolbox, took a break from our workbench, and instead strolled over to our bookshelf?
When running a job that has over 1,700 models, how do you know what a “good” runtime is? If the total process takes 3 hours, is that fantastic or terrible? While there are many possible answers depending on dataset size, complexity of modeling, and historical run times, the crux of the matter is normally “did you hit your SLAs”? However, in the cloud computing world where bills are based on usage, the question is really “did you hit your SLAs and stay within budget”?
Here at dbt Labs, we used the Model Timing tab in our internal analytics dbt project to help us identify inefficiencies in our incremental dbt Cloud job that eventually led to major financial savings, and a path forward for periodic improvement checks.
At dbt Labs, we have best practices we like to follow for the development of dbt projects. One of them, for example, is that all models should have at least unique and not_null tests on their primary key. But how can we enforce rules like this?
That question becomes difficult to answer in large dbt projects. Developers might not follow the same conventions. They might not be aware of past decisions, and reviewing pull requests in git can become more complex. When dbt projects have hundreds of models, it's hard to know which models do not have any tests defined and aren't enforcing your conventions.
TLDR: The Semantic Layer is made up of a combination of open-source and SaaS offerings and is going to change how your team defines and consumes metrics.
At last year's Coalesce, Drew showed us the future1 - a vision of what metrics in dbt could look like. Since then, we've been getting the infrastructure in place to make that vision a reality. We wanted to share with you where we are today and how it fits into the broader picture of where we're going.
Stored procedures are widely used throughout the data warehousing world. They’re great for encapsulating complex transformations into units that can be scheduled and respond to conditional logic via parameters. However, as teams continue building their transformation logic using the stored procedure approach, we see more data downtime, increased data warehouse costs, and incorrect / unavailable data in production. All of this leads to more stressed and unhappy developers, and consumers who have a hard time trusting their data.
If your team works heavily with stored procedures, and you ever find yourself with the following or related issues:
dashboards that aren’t refreshed on time
It feels too slow and risky to modify pipeline code based on requests from your data consumers
It’s hard to trace the origins of data in your production reporting
It’s worth considering if an alternative approach with dbt might help.
In general, data people prefer the more granular over the less granular. Timestamps > dates, daily data > weekly data, etc.; having data at a more granular level always allows you to zoom in. However, you’re likely looking at your data at a somewhat zoomed-out level—weekly, monthly, or even yearly. To do that, you’re going to need a handy dandy function that helps you round out date or time fields.
The DATE_TRUNC function will truncate a date or time to the first instance of a given date part. Wordy, wordy, wordy! What does this really mean? If you were to truncate 2021-12-13 out to its month, it would return 2021-12-01 (the first day of the month).
Using the DATE_TRUNC function, you can truncate to the weeks, months, years, or other date parts for a date or time field. This can make date/time fields easier to read, as well as help perform cleaner time-based analyses.
“How long has it been since this customer last ordered with us?”
“What is the average number of days to conversion?”
Business users will have these questions, data people will have to answer these questions, and the only way to solve them is by calculating the time between two different dates. Luckily, there’s a handy DATEDIFF function that can do that for you.
The DATEDIFF function will return the difference in specified units (ex. days, weeks, years) between a start date/time and an end date/time. It’s a simple and widely used function that you’ll find yourself using more often than you expect.
Let’s set the scene. You are an analytics engineer at your company. You have several relational datasets flowing through your warehouse, and, of course, you can easily access and transform these tables through dbt. You’ve joined together the tables appropriately and have near-real time reporting on the relationships for each entity_id as it currently exists.
But, at some point, your stakeholder wants to know how each entity is changing over time. Perhaps, it is important to understand the trend of a product throughout its lifetime. You need the history of each entity_id across all of your datasets, because each related table is updated on its own timeline.
What is your first thought? Well, you’re a seasoned analytics engineer and you know the good people of dbt Labs have a solution for you. And then it hits you — the answer is snapshots!