Blog Guide to surrogate keys

Guide to surrogate keys

Jul 01, 2024

A surrogate key is a unique identifier derived from the data itself. It often takes the form of a hashed value of multiple columns that will create a uniqueness constraint for each row. You will need to create a surrogate key for every table that doesn't have a natural primary key.

Why would you ever need to make a surrogate key? Shouldn’t all tables innately just have a field that uniquely identifies each row? Now that would be too easy…

Let’s say you have a table with all license plate numbers and the state of the plate. While license plate numbers are unique to their state, there could be duplicate license plate numbers across different states. So by default, there’s no natural key that can uniquely identify each row here. In order to uniquely identify each record in this table, you could create a surrogate key based on the unique combination of license plate number and its state.

Surrogate keys, natural keys, and primary keys oh my!

Primary keys can be established two ways: naturally or derived through the data in a surrogate key.

A natural key is a primary key that is innate to the data. Perhaps in some tables there’s a unique id field in each table that would act as the natural key. You can use documentation like entity relationship diagrams (ERDs) to help understand natural keys in APIs or backend application database tables.
A surrogate key is a hashed value of multiple fields in a dataset that create a uniqueness constraint on that dataset. You’ll essentially need to make a surrogate key in every table that lacks a natural key.

Note - You may also hear about primary keys being a form of a constraint on a database object. Column constraints are specified in the DDL to create or alter a database object. For data warehouses that support the enforcement of primary key constraints, this means that an error would be raised if a field's uniqueness or non-nullness was broken upon an INSERT or UPDATE statement. Most modern data warehouses don’t support and enforce primary key constraints, so it’s important to have automated testing in-place to ensure your primary keys are unique and not null.

How surrogate keys are created

In analytics engineering, you can generate surrogate keys using a hashing method of your choice. Remember, in order to truly create a uniqueness constraint on a database object, you’ll need to hash the fields together that make each row unique; when you generate a correct surrogate key for a dataset, you’re really establishing the true grain of that dataset.

Let’s take this to an example. Below, there is a table you pull from an ad platform that collects calendar_date, ad_id, and some performance columns.

calendar_date
ad_id
impressions
spend
clicks
conversions

2022-05-16
212
2022-05-16
214
2022-05-05
212
88744
4523.00
323
6.49
125600
117244.56
9432
166
4
0
17318
56

calendar_date	ad_id	impressions	spend	clicks	conversions
2022-05-16	212	88744	4523.00	9432	166
2022-05-16	214	323	6.49	4	0
2022-05-05	212	125600	117244.56	17318	56

In this state, this table has no natural key that can act as a primary key. You know the grain of this table: this is showing performance for each ad_id per calendar_date. Therefore, hashing those two fields will create a uniqueness constraint on this table.

To create a surrogate key for this table using the MD5 function, run the following:

select 
	md5(calendar_date || ad_id) as unique_id,
	*
from {{ source('ad_platform', 'custom_daily_report')}}

After executing this, the table would now have the unique_id field now uniquely identifying each row.

unique_id
calendar_date
ad_id
impressions
spend
clicks
conversions

62aef884fbe3470ce7d9a92140b09b17
2022-05-16
ea385f7a5e560ef4d8a78f7d913927e4
2022-05-16
53a33f257d1d4f2446469ac5adad1c0c
2022-05-05
212
88744
214
323
212
125600
4523.00
9432
6.49
4
117244.56
17318
166
0
56

unique_id	calendar_date	ad_id	impressions	spend	clicks	conversions
62aef884fbe3470ce7d9a92140b09b17	2022-05-16	212	88744	4523.00	9432	166
ea385f7a5e560ef4d8a78f7d913927e4	2022-05-16	214	323	6.49	4	0
53a33f257d1d4f2446469ac5adad1c0c	2022-05-05	212	125600	117244.56	17318	56

Testing surrogate keys

Amazing, you just made a surrogate key! You can just move on to the next data model, right? No!! It’s critically important to test your surrogate keys for uniqueness and non-null values to ensure that the correct fields were chosen to create the surrogate key.

In order to test for null and unique values you can utilize code-based data tests like dbt tests, that can check fields for nullness and uniqueness. You can additionally utilize simple SQL queries or unit tests to check if surrogate key count and non-nullness is correct.

A note on hashing algorithms

Depending on your data warehouse, there’s several cryptographic hashing options to create surrogate keys. The primary hashing methods include MD5 or other algorithms, like HASH or SHA. Choosing the appropriate hashing function is dependent on your dataset and what your warehouse supports.

Hashing algorithm
Bit length
Known collisions?

HASH
64 bits
MD5
128 bits
SHA256
256 bits
Yes, past ~4 billion elements
Yes, but incredibly unlikely
No

Hashing algorithm	Bit length	Known collisions?
HASH	64 bits	Yes, past ~4 billion elements
MD5	128 bits	Yes, but incredibly unlikely
SHA256	256 bits	No

Note - A collision occurs when two pieces of data that are different end up hashing to the same value. If a collision occurs, a different hashing method should be used.

Why we like surrogate keys

Let’s keep it brief: surrogate keys allow data folks to quickly understand the grain of the database object and are compatible across many different data warehouses.

Readability

Because surrogate keys are comprised of the fields that make a uniqueness constraint on the data, you can quickly identify the grain of the data. For example, if you see in your data model that the surrogate key field is created by hashing the ad_id and calendar_date fields, you can immediately know the true grain of the data. When you clearly understand the grain of a database object, this can make for an easier understanding of how entities join together and fan out.

Compatibility

Making a surrogate key involves a relatively straightforward usage of SQL: maybe some coalescing, concatenation, and a hashing method. Most, if not all, modern data warehouses support both the ability to concat, coalesce, and hash fields. They may not have the exact same syntax or hashing functions available, but their core functionality is the same.

Tip: dbt supports several macros to help data folks write DRY (don’t repeat yourself) code. The surrogate_key macro helps you create surrogate keys with the MD5 function without having to worry about coalescing potentially null field values.

Performance concerns for surrogate keys

In the past, you may have seen surrogate keys take the form of monotonically increasing integers (ex. 1, 2, 3, 4). These surrogate keys were often limited to 4-bit integers that could be indexed quickly. However, in the practice of analytics engineering, surrogate keys derived from the data often take the form of a hashed string value. Given this form, these surrogate keys are not necessarily optimized for performance for large table scans and complex joins. For large data models (millions, billions, trillions of rows) that have surrogate keys, you should materialize them as tables or incremental models to help make joining entities more efficient.

Surrogate Key FAQs

Conclusion

Surrogate keys are unique row identifiers that are created by using columns in a database object to create a uniqueness constraint on the data. To create a surrogate key, you will use a cryptographic algorithm usually in the form of the MD5 function to hash together fields that create a uniqueness constraint on the dataset. Ultimately, surrogate keys are a great way to create unique row identifiers for database objects that lack them naturally and allow folks to easily identify the grain of the data.

Great data professionals never work alone

Every industry leader understands one thing: you need the right network to grow. The dbt Community connects you with 100,000+ data professionals—people who share your challenges, insights, and ambitions.

If you’re looking for trusted advice, expert discussions, and real career growth, this is the place for you.

Join the dbt Community Learn more

Solve your toughest challenges

Join today and get real-world advice from experienced pros.

Expand your network

Foster connections with meetups, local groups, and like-minded peers.

Advance your career

The dbt community is full of learning opportunities and shared job postings.

Guide to surrogate keys

Surrogate keys, natural keys, and primary keys oh my!

How surrogate keys are created

calendar_date

ad_id

impressions

spend

clicks

conversions

unique_id

calendar_date

ad_id

impressions

spend

clicks

conversions

Testing surrogate keys

A note on hashing algorithms

Hashing algorithm

Bit length

Known collisions?

Why we like surrogate keys

Readability

Compatibility

Performance concerns for surrogate keys

Surrogate Key FAQs

What is a surrogate key and when do I need one?

How do I create a surrogate key in my data models?

What's the difference between surrogate keys and natural keys?

How should I test my surrogate keys?

Are there performance concerns with surrogate keys?

Conclusion

Further reading

Early Bird pricing is live for Coalesce 2025

Recent Posts

Great data professionals never work alone

Solve your toughest challenges

Expand your network

Advance your career