It is a thankless but necessary task. In SQL, often we’ll need to UNION ALL two or more tables vertically, to combine their values.
Say we need to combine 3 tables: web traffic, ad spend and sales data, to form a full picture of cost per acquisition (CPA).
Ultimately, we’d want to roll up data at a granularity of date, landing page URL, campaign and channel—so however we combine the 3 tables, we’ll want to wrap it in an outer query with a GROUP BY to reduce the grain.
To accomplish that, we could do a few things:
FULL OUTER JOIN: returns all values + joins where there is a date / landing page / campaign / channel match
UNION: combines all values vertically + dedupes to return unique values
UNION ALL: combines all values vertically, with no matching logic
Let’s assume that data has been deduped at a lower level (in a staging model for the raw source), so deduping here isn’t necessary.
In terms of performance, UNION ALL will always be more performant than FULL OUTER JOIN, as there’s no matching operation taking place, UNION ALL is just appending tables.
The chore of populating null column values in UNION ALLs
Often when running a UNION ALL, tables will have mismatched columns: the web traffic table won’t have revenue values, and the sales data won’t have session values.
To complete the union, we have to propagate those columns as null or 0 in our query, in order for the union to succeed:
null as orders,
null as customers,
null as revenue
null as sessions,
null as pageviews,
null as time_on_site,
Without those null values, we’ll get the dreaded error:
Queries in UNION ALL have mismatched column count; query 1 has 10 columns, query 2 has 12 columns
As you can imagine, this gets super painful when we want to add a new column within our UNIONs—we must add the column as a null value to every query statement within the UNION ALL stack.
Enter the union_relations dbt macro
dbt allows you to write macro functions in Jinja to automate away this chore.
With 3 lines of code, we can accomplish what previously required dozens:
relations=[ref('sessions'), ref('sales'), ref('ads')
When dbt runs (either locally on the command line, in dbt Cloud, or wherever you choose to deploy it), the union_relations macro generates the exact same SQL as we’d previously written by hand.
You can see in the macro’s source code, it finds every column in the tables you’re UNIONing, checks whether those columns exist in the other tables, and if not populates them with a
null as , line.
New to dbt? Check out dbt introduction for more background on dbt and the analytics engineering workflow that it facilitates.
TL;DR: dbt allows data practitioners to write code like software engineers, which in this case means not repeating yourself unnecessarily.
The magic to me of the
union_relations macro is clear—it allows me to save significant time doing the chore of propagating
null column values across a UNION ALL statement.
If you’re interested in going deeper on how dbt macros work, the
union_relations macro uses a couple programming control structures: for loops and if statements—thanks to dbt’s support for Jinja templating, we can now use these control structures to write SQL that writes itself.