Skip to main content

AWS Glue Setup

Community plugin

Some core functionality may be limited. If you're interested in contributing, check out the source code for each repository listed below.

Overview of dbt-glue

  • Maintained by: Community
  • Authors: Benjamin Menuet, Moshir Mikael, Armando Segnini and Amine El Mallem
  • GitHub repo: aws-samples/dbt-glue
  • PyPI package: dbt-glue
  • Slack channel: #db-glue
  • Supported dbt Core version: v0.24.0 and newer
  • dbt Cloud support: Not Supported
  • Minimum data platform version: Glue 2.0

Installing dbt-glue

pip is the easiest way to install the adapter:

pip install dbt-glue

Installing dbt-glue will also install dbt-core and any other dependencies.

Configuring dbt-glue

For AWS Glue-specifc configuration please refer to AWS Glue Configuration

For further info, refer to the GitHub repository: aws-samples/dbt-glue

For further (and more likely up-to-date) info, see the README

Connection Methods

Configuring your AWS profile for Glue Interactive Session

There are two IAM principals used with interactive sessions.

  • Client principal: The princpal (either user or role) calling the AWS APIs (Glue, Lake Formation, Interactive Sessions) from the local client. This is the principal configured in the AWS CLI and likely the same.
  • Service role: The IAM role that AWS Glue uses to execute your session. This is the same as AWS Glue ETL.

Read this documentation to configure these principals.

You will find bellow a least privileged policy to enjoy all features of dbt-glue adapter.

Please to update variables between <>, here are explanations of these arguments:

ArgsDescription
regionThe region where your Glue database is stored
AWS AccountThe AWS account where you run your pipeline
dbt output databaseThe database updated by dbt (this is the database configured in the profile.yml of your dbt environment)
dbt source databaseAll databases used as source
dbt output bucketThe bucket name where the data will be generated by dbt (the location configured in the profile.yml of your dbt environment)
dbt source bucketThe bucket name of source databases (if they are not managed by Lake Formation)
sample_IAM_Policy.yml
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Read_and_write_databases",
"Action": [
"glue:SearchTables",
"glue:BatchCreatePartition",
"glue:CreatePartitionIndex",
"glue:DeleteDatabase",
"glue:GetTableVersions",
"glue:GetPartitions",
"glue:DeleteTableVersion",
"glue:UpdateTable",
"glue:DeleteTable",
"glue:DeletePartitionIndex",
"glue:GetTableVersion",
"glue:UpdateColumnStatisticsForTable",
"glue:CreatePartition",
"glue:UpdateDatabase",
"glue:CreateTable",
"glue:GetTables",
"glue:GetDatabases",
"glue:GetTable",
"glue:GetDatabase",
"glue:GetPartition",
"glue:UpdateColumnStatisticsForPartition",
"glue:CreateDatabase",
"glue:BatchDeleteTableVersion",
"glue:BatchDeleteTable",
"glue:DeletePartition",
"lakeformation:ListResources",
"lakeformation:BatchGrantPermissions",
"lakeformation:ListPermissions"
],
"Resource": [
"arn:aws:glue:<region>:<AWS Account>:catalog",
"arn:aws:glue:<region>:<AWS Account>:table/<dbt output database>/*",
"arn:aws:glue:<region>:<AWS Account>:database/<dbt output database>"
],
"Effect": "Allow"
},
{
"Sid": "Read_only_databases",
"Action": [
"glue:SearchTables",
"glue:GetTableVersions",
"glue:GetPartitions",
"glue:GetTableVersion",
"glue:GetTables",
"glue:GetDatabases",
"glue:GetTable",
"glue:GetDatabase",
"glue:GetPartition",
"lakeformation:ListResources",
"lakeformation:ListPermissions"
],
"Resource": [
"arn:aws:glue:<region>:<AWS Account>:table/<dbt source database>/*",
"arn:aws:glue:<region>:<AWS Account>:database/<dbt source database>",
"arn:aws:glue:<region>:<AWS Account>:database/default",
"arn:aws:glue:<region>:<AWS Account>:database/global_temp"
],
"Effect": "Allow"
},
{
"Sid": "Storage_all_buckets",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<dbt output bucket>",
"arn:aws:s3:::<dbt source bucket>"
],
"Effect": "Allow"
},
{
"Sid": "Read_and_write_buckets",
"Action": [
"s3:PutObject",
"s3:PutObjectAcl",
"s3:GetObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::<dbt output bucket>"
],
"Effect": "Allow"
},
{
"Sid": "Read_only_buckets",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::<dbt source bucket>"
],
"Effect": "Allow"
}
]
}

Configuration of the local environment

Because dbt and dbt-glue adapter are compatible with Python versions 3.7, 3.8, and 3.9, check the version of Python:

$ python3 --version

Configure a Python virtual environment to isolate package version and code dependencies:

$ sudo yum install git
$ python3 -m venv dbt_venv
$ source dbt_venv/bin/activate
$ python3 -m pip install --upgrade pip

Configure the last version of AWS CLI

$ curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
$ unzip awscliv2.zip
$ sudo ./aws/install

Configure the aws-glue-session package

$ sudo yum install gcc krb5-devel.x86_64 python3-devel.x86_64 -y
$ pip3 install —upgrade boto3
$ pip3 install —upgrade aws-glue-sessions

Example config

profiles.yml
type: glue
query-comment: This is a glue dbt example
role_arn: arn:aws:iam::1234567890:role/GlueInteractiveSessionRole
region: us-east-1
workers: 2
worker_type: G.1X
idle_timeout: 10
schema: "dbt_demo"
database: "dbt_demo"
session_provisioning_timeout_in_seconds: 120
location: "s3://dbt_demo_bucket/dbt_demo_data"

The table below describes all the options.

OptionDescriptionMandatory
project_nameThe dbt project name. This must be the same as the one configured in the dbt project.yes
typeThe driver to use.yes
query-commentA string to inject as a comment in each query that dbt runs.no
role_arnThe ARN of the interactive session role created as part of the CloudFormation template.yes
regionThe AWS Region where you run the data pipeline.yes
workersThe number of workers of a defined workerType that are allocated when a job runs.yes
worker_typeThe type of predefined worker that is allocated when a job runs. Accepts a value of Standard, G.1X, or G.2X.yes
schemaThe schema used to organize data stored in Amazon S3.yes
databaseThe database in Lake Formation. The database stores metadata tables in the Data Catalog.yes
session_provisioning_timeout_in_secondsThe timeout in seconds for AWS Glue interactive session provisioning.yes
locationThe Amazon S3 location of your target data.yes
idle_timeoutThe AWS Glue session idle timeout in minutes. (The session stops after being idle for the specified amount of time.)no
glue_versionThe version of AWS Glue for this session to use. Currently, the only valid options are 2.0 and 3.0. The default value is 2.0.no
security_configurationThe security configuration to use with this session.no
connectionsA comma-separated list of connections to use in the session.no

Caveats

Supported Functionality

Most dbt Core functionality is supported, but some features are only available with Apache Hudi.

Apache Hudi-only features:

  1. Incremental model updates by unique_key instead of partition_by (see merge strategy)

Some dbt features, available on the core adapters, are not yet supported on Glue:

  1. Persisting column-level descriptions as database comments
  2. Snapshots
0