SQL ARRAY_AGG
In any typical programming language such as Python or Javascript, arrays are typically innate and bountiful; when you’re processing data in SQL, arrays are a little less common but are a handy way to provide more structure to your data.
To create an array of multiple data values in SQL, you’ll likely leverage the ARRAY_AGG function (short for array aggregation), which puts your input column values into an array.
How to use SQL ARRAY_AGG
The ARRAY_AGG function has the following syntax:
array_agg( [distinct] <field_name>) [within group (<order_by field>) over ([partition by <field>])
A few notes on the functionality of this function:
- Most of the example syntax from above is optional, meaning the ARRAY_AGG function can be as simple as
array_agg(<field_name>)
or used as a more complex as a window function - DISTINCT is an optional argument that can be passed in, so only distinct values are in the return array
- If input column is empty, the returning array will also be empty
- Since the ARRAY_AGG is an aggregate function (gasp!), you’ll need a GROUP BY statement at the end of your query if you’re grouping by certain field
- ARRAY_AGG and similar aggregate functions can become inefficient or costly to compute on large datasets, so use ARRAY_AGG wisely and truly understand your use cases for having arrays in your datasets
Let’s dive into a practical example using the ARRAY_AGG function.
SQL ARRAY_AGG example
select
date_trunc('month', order_date) as order_month,
array_agg(distinct status) as status_array
from {{ ref('orders') }}
group by 1
order by 1
This simple query using the sample dataset Jaffle Shop’s orders
table is returning a new column of distinct order statuses by order month:
order_month | status_array |
---|---|
2018-01-01 | [ "returned", "completed", "return_pending" ] |
2018-02-01 | [ "completed", "return_pending" ] |
2018-03-01 | [ "completed", "shipped", "placed" ] |
2018-04-01 | [ "placed" ] |
Looking at the query results—this makes sense! We’d expect newer orders to likely not have any returns, and older orders to have completed returns.
SQL ARRAY_AGG syntax in Snowflake, Databricks, BigQuery, and Redshift
Snowflake, Databricks, and BigQuery all support the ARRAY_AGG function. Redshift, however, supports an out-of-the-box LISTAGG function that can perform similar functionality to ARRAY_AGG. The primary difference is that LISTAGG allows you to explicitly choose a delimiter to separate a list whereas arrays are naturally delimited by commas.
ARRAY_AGG use cases
There are definitely too many use cases to list out for using the ARRAY_AGG function in your dbt models, but it’s very likely that ARRAY_AGG is used pretty downstream in your DAG since you likely don’t want your data so bundled up earlier in your DAG to improve modularity and dryness. A few downstream use cases for ARRAY_AGG:
- In
export_
models that are used to send data to platforms using a reverse ETL tool to pair down multiple rows into a single row. Some downstream platforms, for example, require certain values that we’d usually keep as separate rows to be one singular row per customer or user. ARRAY_AGG is handy to bring multiple column values together by a singular id, such as creating an array of all items a user has ever purchased and sending that array downstream to an email platform to create a custom email campaign. - Similar to export models, you may see ARRAY_AGG used in mart tables to create final aggregate arrays per a singular dimension; performance concerns of ARRAY_AGG in these likely larger tables can potentially be bypassed with use of incremental models in dbt.