Skip to main content

SQL ARRAY_AGG

In any typical programming language such as Python or Javascript, arrays are typically innate and bountiful; when you’re processing data in SQL, arrays are a little less common but are a handy way to provide more structure to your data.

To create an array of multiple data values in SQL, you’ll likely leverage the ARRAY_AGG function (short for array aggregation), which puts your input column values into an array.

How to use SQL ARRAY_AGG

The ARRAY_AGG function has the following syntax:

array_agg( [distinct] <field_name>) [within group (<order_by field>) over ([partition by <field>])

A few notes on the functionality of this function:

  • Most of the example syntax from above is optional, meaning the ARRAY_AGG function can be as simple as array_agg(<field_name>) or used as a more complex as a window function
  • DISTINCT is an optional argument that can be passed in, so only distinct values are in the return array
  • If input column is empty, the returning array will also be empty
  • Since the ARRAY_AGG is an aggregate function (gasp!), you’ll need a GROUP BY statement at the end of your query if you’re grouping by certain field
  • ARRAY_AGG and similar aggregate functions can become inefficient or costly to compute on large datasets, so use ARRAY_AGG wisely and truly understand your use cases for having arrays in your datasets

Let’s dive into a practical example using the ARRAY_AGG function.

SQL ARRAY_AGG example

select
date_trunc('month', order_date) as order_month,
array_agg(distinct status) as status_array
from {{ ref('orders') }}
group by 1
order by 1

This simple query using the sample dataset Jaffle Shop’s orders table is returning a new column of distinct order statuses by order month:

order_monthstatus_array
2018-01-01[ "returned", "completed", "return_pending" ]
2018-02-01[ "completed", "return_pending" ]
2018-03-01[ "completed", "shipped", "placed" ]
2018-04-01[ "placed" ]

Looking at the query results—this makes sense! We’d expect newer orders to likely not have any returns, and older orders to have completed returns.

SQL ARRAY_AGG syntax in Snowflake, Databricks, BigQuery, and Redshift

Snowflake, Databricks, and BigQuery all support the ARRAY_AGG function. Redshift, however, supports an out-of-the-box LISTAGG function that can perform similar functionality to ARRAY_AGG. The primary difference is that LISTAGG allows you to explicitly choose a delimiter to separate a list whereas arrays are naturally delimited by commas.

ARRAY_AGG use cases

There are definitely too many use cases to list out for using the ARRAY_AGG function in your dbt models, but it’s very likely that ARRAY_AGG is used pretty downstream in your DAGA DAG is a Directed Acyclic Graph, a type of graph whose nodes are directionally related to each other and don’t form a directional closed loop. since you likely don’t want your data so bundled up earlier in your DAG to improve modularity and drynessDRY is a software development principle that stands for “Don’t Repeat Yourself.” Living by this principle means that your aim is to reduce repetitive patterns and duplicate code and logic in favor of modular and referenceable code.. A few downstream use cases for ARRAY_AGG:

  • In export_ models that are used to send data to platforms using a reverse ETLReverse ETL is the process of getting your transformed data stored in your data warehouse to end business platforms, such as sales CRMs and ad platforms. tool to pair down multiple rows into a single row. Some downstream platforms, for example, require certain values that we’d usually keep as separate rows to be one singular row per customer or user. ARRAY_AGG is handy to bring multiple column values together by a singular id, such as creating an array of all items a user has ever purchased and sending that array downstream to an email platform to create a custom email campaign.
  • Similar to export models, you may see ARRAY_AGG used in mart tables to create final aggregate arrays per a singular dimension; performance concerns of ARRAY_AGG in these likely larger tables can potentially be bypassed with use of incremental models in dbt.
0