Generating Surrogate Keys Across Warehouses

November 29, 2021 · 7 min read

Sanjana Sen

Analytics Engineer at dbt Labs

Jason Ganz

Developer Experience at dbt Labs

David Krevitt

Marketing at dbt Labs

Why primary keys are important

We all know one of the most fundamental rules in data is that every tableIn simplest terms, a table is the direct storage of data in rows and columns. Think excel sheet with raw values in each of the cells. should have a primary keyA primary key is a non-null column in a database object that uniquely identifies each row.. Primary keys are critical for many reasons:

They ensure that you don’t have duplicate rows in your table
They help establish relationships to other tables
They allow you to quickly identify the grainYour data's grain is the combination of columns at which records in a table are unique. Ideally, this is captured in a single column and a unique primary key. of the table (ex: the customers table with a PK of customer_id has one row per customer)
You can test them in dbt, to ensure that your data is complete and unique

If you are lucky, your data will come to you with a unique primary key already in place. Maybe it is an id generated by your internal product systems, or maybe you are pulling data in from a third party source that generates the id for you.

Sometimes, however, you find yourself in a situation where you don’t have a tidy primary key.

For example, maybe you have joined your users to the set of features that they use in your product, such that your data should be expected to be unique on the user_id + product_id basis.

The question is - in this situation, how are you supposed to set a primary key? Meet the surrogate key.

What’s a surrogate key?

A surrogate keyA surrogate key is a unique identifier derived from the data itself. It often takes the form of a hashed value of multiple columns that will create a uniqueness constraint for each row. is a primary key that, instead of existing in your underlying dataset, is derived in the analytics layer itself.

Learning when to use surrogate keys and bring them into your project is a critical skill for any analytics professional.

Knowing when to use a surrogate key is actually quite easy: you should have a surrogate key on any table that doesn’t already have a unique primary key.

Knowing how to create a surrogate key can prove to be much more challenging. The reason for this is that it isn’t always easy to know the best way to implement surrogate keys. Should you just be blocking off time on your calendar every day to individually name each of your rows?

Turns out this is a relatively well-solved problem. To create a surrogate key, you traditionally follow these two steps.

Concatenate together all of the fields required to make a unique row (for example, user_id and product_id)
Apply a function to create a cryptographic hash (usually using the md5 function) on top of these to generate a unique id per combination of unique values

While the process of creating a surrogate key is relatively well understood, you will be shocked (SHOCKED I SAY) to hear that SQL syntax can have subtle differences across dialects and databases.

Surrogate keys in BigQuery, Databricks, Redshift and Snowflake

BigQuery, Redshift and Snowflake’s concat functions returns null if any of the referenced columns for that row returns a null, so to create a proper surrogate key you’d need to wrap each column in a coalesce before hashing with an md5 function:

md5 ( concat ( coalesce(column1, '_this_used_to_be_null_'), coalesce(column2, '_this_used_to_be_null_') ) )

You'll notice that instead of coalescing to a blank string, there is a long replacement string. It doesn't matter what this string is, as long as it's consistent across your project and doesn't appear in any of your data. When you coalesce to a blank string, you can't differentiate between null and '' which sometimes has semantic meaning. Compare the results of these surrogate keys: A table with two columns, comparing surrogate keys when nulls and blank strings are treated the same and differently.

Databricks’ concat function docs don’t specifically reference returning null for the concat if one column is null, but I believe that’s what’s meant by The result type matches the argument types.

You could also separate your columns with pipes (||) rather than using the concat function, but I generally stay away from pipes (one comma > two pipes).

Surrogate keys in Postgres

Postgres’ concat function ignores nulls, which saves you from having to wrap each column in a coalesce function to default nulls to a different value (but this has the same drawback shown in the table above, where you can get the same key from different inputs).

If you used || instead of concat, one null column would cause the entire statement would return a null, breaking your concatenation.

So in plain old PostgreSQL, you’d use:

md5 ( concat (column1, column2) )

The null value problem in surrogate keys

The primary annoyance when creating surrogate keys comes when you try and concatenate a row that has a null value for one or more columns. If any value is null, then often the entire concatenated string is returned as null - no good!

with 

example_ids as (
  select
  123 as user_id,
  123 as product_id

  union all

  select
  123 as user_id,
  null as product_id

  union all

  select
  null as user_id,
  123 as product_id

)

select
  *,
  concat(user_id, product_id) as _surrogate_key
from example_ids

output:

USER_ID	PRODUCT_ID	_SURROGATE_KEY
123	123	123123
123	`null`	`null`
`null`	123	`null`

You can get around this by wrapping each of your columns in a coalesce function to default nulls to an alternate value, which is pretty tedious. You can also run into problems if the fields are different datatypes (string vs numeric), so sometimes you need to cast as well.

...
select
  *,
  concat(
    coalesce(cast(user_id as string), '_this_used_to_be_null_'),
    coalesce(cast(product_id as string), '_this_used_to_be_null_')
    ) as _surrogate_key
from example_ids

output:

USER_ID	PRODUCT_ID	_SURROGATE_KEY
123	123	123123
123	`null`	123_this_used_to_be_null_
`null`	123	_this_used_to_be_null_123

Much better! But let's add another row to our dataset:

with 

example_ids as (
  select
  123 as user_id,
  123 as product_id

  union all

  select
  123 as user_id,
  null as product_id

  union all

  select
  null as user_id,
  123 as product_id

  union all 
  
  select 1231 as user_id,
  23 as product_id

)

select
  *,
  concat(
    coalesce(cast(user_id as string), '_this_used_to_be_null_'),
    coalesce(cast(product_id as string), '_this_used_to_be_null_')
    ) as _surrogate_key
from example_ids

output:

USER_ID	PRODUCT_ID	_SURROGATE_KEY
123	123	123123
123	`null`	123_this_used_to_be_null_
`null`	123	_this_used_to_be_null_123
1231	23	123123

At first glance, this looks like it works, but in reality there are two identical keys: 123123.

To remedy this, you need to add a separator between fields you wish to concatenate.

...
select
  *,
  concat(
    coalesce(cast(user_id as string), ''),
    '|',
    coalesce(cast(product_id as string), '')
    ) as _surrogate_key
from example_ids

output:

USER_ID	PRODUCT_ID	_SURROGATE_KEY
123	123	123\|123
123	`null`	123\|_this_used_to_be_null_
`null`	123	_this_used_to_be_null_\|123
1231	23	1231\|23

Let’s take a look at how generating surrogate keys specifically looks in practice across data warehousesA data warehouse is a data management system used for data storage and computing that allows for analytics activities such as transforming and sharing data., and how you can use one simple dbt macro (dbt_utils.generate_surrogate_key) to abstract away the null value problem.

A surrogate_key macro to the rescue

Thanks to a handy function called generate_surrogate_key in the dbt_utils package, you can fire yourself from the business of wrapping your columns in coalesce every time you want to generate a surrogate key.

Forming your surrogate keys with this macro has the benefit of elegant + DRYDRY is a software development principle that stands for “Don’t Repeat Yourself.” Living by this principle means that your aim is to reduce repetitive patterns and duplicate code and logic in favor of modular and referenceable code. null handling.

Rather than wrapping your columns in a coalesce function when concatenating them, the macro loops through your columns and coalesces on your behalf, so that you can avoid repeating yourself.

When you call {{ dbt_utils.generate_surrogate_key(['field_a', 'field_b'[,...]]) }}, behind the scenes dbt compiles SQL on your behalf, looping through each field and generating the correct number of coalesce statements with type casting:

  coalesce(cast(" ~ field ~ " as " ~ dbt.type_string() ~ "), '_dbt_utils_surrogate_key_null_')

and with conditional logic, adding separator between fields:

  {%- if not loop.last %}
    {%- set _ = fields.append("'-'") -%}
  {%- endif -%}

What does this mean in practice?

Well, you simply don’t have to think about your surrogate keys all that much. On any data warehouse, nulls or no nulls, it just works. Because honestly, who wants to spend more time than they need to thinking about surrogate keys?

Generating Surrogate Keys Across Warehouses

Why primary keys are important

What’s a surrogate key?

Surrogate keys in BigQuery, Databricks, Redshift and Snowflake

Surrogate keys in Postgres

The null value problem in surrogate keys

A surrogate_key macro to the rescue

Comments

2025 dbt Launch Showcase

Why primary keys are important​

What’s a surrogate key?​

Surrogate keys in BigQuery, Databricks, Redshift and Snowflake​

Surrogate keys in Postgres​

The null value problem in surrogate keys​

A surrogate_key macro to the rescue​

Comments

2025 dbt Launch Showcase

Why primary keys are important

What’s a surrogate key?

Surrogate keys in BigQuery, Databricks, Redshift and Snowflake

Surrogate keys in Postgres

The null value problem in surrogate keys

A surrogate_key macro to the rescue