Quickstart for dbt Cloud and Starburst Galaxy
Introduction
In this quickstart guide, you'll learn how to use dbt Cloud with Starburst Galaxy. It will show you how to:
- Load data into the Amazon S3 bucket. This guide uses AWS as the cloud service provider for demonstrative purposes. Starburst Galaxy also supports other data sources such as Google Cloud, Microsoft Azure, and more.
- Connect Starburst Galaxy to the Amazon S3 bucket.
- Create tables with Starburst Galaxy.
- Connect dbt Cloud to Starburst Galaxy.
- Take a sample query and turn it into a model in your dbt project. A model in dbt is a select statement.
- Add tests to your models.
- Document your models.
- Schedule a job to run.
- Connect to multiple data sources in addition to your S3 bucket.
You can check out dbt Fundamentals for free if you're interested in course learning with videos.
You can also watch the Build Better Data Pipelines with dbt and Starburst YouTube video produced by Starburst Data, Inc.
Prerequisites
- You have a multi-tenant deployment in dbt Cloud. For more information, refer to Tenancy.
- You have a Starburst Galaxy account. If you don't, you can start a free trial. Refer to the getting started guide in the Starburst Galaxy docs for further setup details.
- You have an AWS account with permissions to upload data to an S3 bucket.
- For Amazon S3 authentication, you will need either an AWS access key and AWS secret key with access to the bucket, or you will need a cross account IAM role with access to the bucket. For details, refer to these Starburst Galaxy docs:
Related content
- dbt Learn courses
- dbt Cloud CI job
- Job notifications
- Source freshness
- SQL overview for Starburst Galaxy
Load data to an Amazon S3 bucket
Using Starburst Galaxy, you can create tables and also transform them with dbt. Start by loading the Jaffle Shop data (provided by dbt Labs) to your Amazon S3 bucket. Jaffle Shop is a fictional cafe selling food and beverages in several US cities.
-
Download these CSV files to your local machine:
-
Upload these files to S3. For details, refer to Upload objects in the Amazon S3 docs.
When uploading these files, you must create the following folder structure and upload the appropriate file to each folder:
<bucket/blob>
dbt-quickstart (folder)
jaffle-shop-customers (folder)
jaffle_shop_customers.csv (file)
jaffle-shop-orders (folder)
jaffle_shop_orders.csv (file)
stripe-payments (folder)
stripe-payments.csv (file)
Connect Starburst Galaxy to the Amazon S3 bucket
If your Starburst Galaxy instance is not already connected to your S3 bucket, you need to create a cluster, configure a catalog that allows Starburst Galaxy to connect to the S3 bucket, add the catalog to your new cluster, and configure privilege settings.
In addition to Amazon S3, Starburst Galaxy supports many other data sources. To learn more about them, you can refer to the Catalogs overview in the Starburst Galaxy docs.
-
Create a cluster. Click Clusters on the left sidebar of the Starburst Galaxy UI, then click Create cluster in the main body of the page.
-
In the Create a new cluster modal, you only need to set the following options. You can use the defaults for the other options.
- Cluster name — Type a name for your cluster.
- Cloud provider region — Select the AWS region.
When done, click Create cluster.
-
Create a catalog. Click Catalogs on the left sidebar of the Starburst Galaxy UI, then click Create catalog in the main body of the page.
-
On the Create a data source page, select the Amazon S3 tile.
-
In the Name and description section of the Amazon S3 page, fill out the fields.
-
In the Authentication to S3 section of the Amazon S3 page, select the AWS (S3) authentication mechanism you chose to connect with.
-
In the Metastore configuration section, set these options:
- Default S3 bucket name — Enter the name of your S3 bucket you want to access.
- Default directory name — Enter the folder name of where the Jaffle Shop data lives in the S3 bucket. This is the same folder name you used in Load data to an Amazon S3 bucket.
- Allow creating external tables — Enable this option.
- Allow writing to external tables — Enable this option.
The Amazon S3 page should look similar to this, except for the Authentication to S3 section which is dependant on your setup:
-
Click Test connection. This verifies that Starburst Galaxy can access your S3 bucket.
-
Click Connect catalog if the connection test passes.
-
On the Set permissions page, click Skip. You can add permissions later if you want.
-
On the Add to cluster page, choose the cluster you want to add the catalog to from the dropdown and click Add to cluster.
-
Add the location privilege for your S3 bucket to your role in Starburst Galaxy. Click Access control > Roles and privileges on the left sidebar of the Starburst Galaxy UI. Then, in the Roles table, click the role name accountadmin.
If you're using an existing Starburst Galaxy cluster and don't have access to the accountadmin role, then select a role that you do have access to.
To learn more about access control, refer to Access control in the Starburst Galaxy docs.
-
On the Roles page, click the Privileges tab and click Add privilege.
-
On the Add privilege page, set these options:
- What would you like to modify privileges for? — Choose Location.
- Enter a storage location provide — Enter the storage location of your S3 bucket and the folder of where the Jaffle Shop data lives. Make sure to include the
/*
at the end of the location. - Create SQL — Enable the option.
When done, click Add privileges.
Create tables with Starburst Galaxy
To query the Jaffle Shop data with Starburst Galaxy, you need to create tables using the Jaffle Shop data that you loaded to your S3 bucket. You can do this (and run any SQL statement) from the query editor.
-
Click Query > Query editor on the left sidebar of the Starburst Galaxy UI. The main body of the page is now the query editor.
-
Configure the query editor so it queries your S3 bucket. In the upper right corner of the query editor, select your cluster in the first gray box and select your catalog in the second gray box:
-
Copy and paste these queries into the query editor. Then Run each query individually.
Replace
YOUR_S3_BUCKET_NAME
with the name of your S3 bucket. These queries create a schema namedjaffle_shop
and also create thejaffle_shop_customers
,jaffle_shop_orders
, andstripe_payments
tables:CREATE SCHEMA jaffle_shop WITH (location='s3://YOUR_S3_BUCKET_NAME/dbt-quickstart/');
CREATE TABLE jaffle_shop.jaffle_shop_customers (
id VARCHAR,
first_name VARCHAR,
last_name VARCHAR
)
WITH (
external_location = 's3://YOUR_S3_BUCKET_NAME/dbt-quickstart/jaffle-shop-customers/',
format = 'csv',
type = 'hive',
skip_header_line_count=1
);
CREATE TABLE jaffle_shop.jaffle_shop_orders (
id VARCHAR,
user_id VARCHAR,
order_date VARCHAR,
status VARCHAR
)
WITH (
external_location = 's3://YOUR_S3_BUCKET_NAME/dbt-quickstart/jaffle-shop-orders/',
format = 'csv',
type = 'hive',
skip_header_line_count=1
);
CREATE TABLE jaffle_shop.stripe_payments (
id VARCHAR,
order_id VARCHAR,
paymentmethod VARCHAR,
status VARCHAR,
amount VARCHAR,
created VARCHAR
)
WITH (
external_location = 's3://YOUR_S3_BUCKET_NAME/dbt-quickstart/stripe-payments/',
format = 'csv',
type = 'hive',
skip_header_line_count=1
); -
When the queries are done, you can see the following hierarchy on the query editor's left sidebar:
-
Verify that the tables were created successfully. In the query editor, run the following queries:
select * from jaffle_shop.jaffle_shop_customers;
select * from jaffle_shop.jaffle_shop_orders;
select * from jaffle_shop.stripe_payments;
Connect dbt Cloud to Starburst Galaxy
-
Make sure you are still logged in to Starburst Galaxy.
-
If you haven’t already, set your account’s role to accountadmin. Click your email address in the upper right corner, choose Switch role and select accountadmin.
If this role is not listed for you, choose the role you selected in Connect Starburst Galaxy to the Amazon S3 bucket when you added location privilege for your S3 bucket.
-
Click Clusters on the left sidebar.
-
Find your cluster in the View clusters table and click Connection info. Choose dbt from the Select client dropdown. Keep the Connection information modal open. You will use details from that modal in dbt Cloud.
-
In another browser tab, log in to dbt Cloud.
-
Create a new project in dbt Cloud. Click on your account name in the left side menu, select Account settings, and click + New Project.
-
Enter a project name and click Continue.
-
Choose Starburst as your connection and click Next.
-
Enter the Settings for your new project:
- Host – The Host value from the Connection information modal in your Starburst Galaxy tab.
- Port – 443 (which is the default)
-
Enter the Development Credentials for your new project:
- User – The User value from the Connection information modal in your Starburst Galaxy tab. Make sure to use the entire string, including the account's role which is the
/
and all the characters that follow. If you don’t include it, your default role is used and that might not have the correct permissions for project development. - Password – The password you use to log in to your Starburst Galaxy account.
- Database – The Starburst catalog you want to save your data to (for example, when writing new tables). For future reference, database is synonymous to catalog between dbt Cloud and Starburst Galaxy.
- Leave the remaining options as is. You can use their default values.
- User – The User value from the Connection information modal in your Starburst Galaxy tab. Make sure to use the entire string, including the account's role which is the
-
Click Test Connection. This verifies that dbt Cloud can access your Starburst Galaxy cluster.
-
Click Next if the test succeeded. If it failed, you might need to check your Starburst Galaxy settings and credentials.
Set up a dbt Cloud managed repository
When you develop in dbt Cloud, you can leverage Git to version control your code.
To connect to a repository, you can either set up a dbt Cloud-hosted managed repository or directly connect to a supported git provider. Managed repositories are a great way to trial dbt without needing to create a new repository. In the long run, it's better to connect to a supported git provider to use features like automation and continuous integration.
To set up a managed repository:
- Under "Setup a repository", select Managed.
- Type a name for your repo such as
bbaggins-dbt-quickstart
- Click Create. It will take a few seconds for your repository to be created and imported.
- Once you see the "Successfully imported repository," click Continue.
Initialize your dbt project and start developing
Now that you have a repository configured, you can initialize your project and start development in dbt Cloud:
- Click Start developing in the IDE. It might take a few minutes for your project to spin up for the first time as it establishes your git connection, clones your repo, and tests the connection to the warehouse.
- Above the file tree to the left, click Initialize dbt project. This builds out your folder structure with example models.
- Make your initial commit by clicking Commit and sync. Use the commit message
initial commit
and click Commit. This creates the first commit to your managed repo and allows you to open a branch where you can add new dbt code. - You can now directly query data from your warehouse and execute
dbt run
. You can try this out now:- Click + Create new file, add this query to the new file, and click Save as to save the new file:
select * from dbt_quickstart.jaffle_shop.jaffle_shop_customers
- In the command line bar at the bottom, enter
dbt run
and click Enter. You should see adbt run succeeded
message.
- Click + Create new file, add this query to the new file, and click Save as to save the new file:
Build your first model
You have two options for working with files in the dbt Cloud IDE:
- Create a new branch (recommended) — Create a new branch to edit and commit your changes. Navigate to Version Control on the left sidebar and click Create branch.
- Edit in the protected primary branch — If you prefer to edit, format, or lint files and execute dbt commands directly in your primary git branch. The dbt Cloud IDE prevents commits to the protected branch, so you will be prompted to commit your changes to a new branch.
Name the new branch add-customers-model
.
- Click the ... next to the
models
directory, then select Create file. - Name the file
customers.sql
, then click Create. - Copy the following query into the file and click Save.
with customers as (
select
id as customer_id,
first_name,
last_name
from dbt_quickstart.jaffle_shop.jaffle_shop_customers
),
orders as (
select
id as order_id,
user_id as customer_id,
order_date,
status
from dbt_quickstart.jaffle_shop.jaffle_shop_orders
),
customer_orders as (
select
customer_id,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders
from orders
group by 1
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders
from customers
left join customer_orders on customers.customer_id = customer_orders.customer_id
)
select * from final
- Enter
dbt run
in the command prompt at the bottom of the screen. You should get a successful run and see the three models.
Later, you can connect your business intelligence (BI) tools to these views and tables so they only read cleaned up data rather than raw data in your BI tool.
FAQs
Change the way your model is materialized
One of the most powerful features of dbt is that you can change the way a model is materialized in your warehouse, simply by changing a configuration value. You can change things between tables and views by changing a keyword rather than writing the data definition language (DDL) to do this behind the scenes.
By default, everything gets created as a view. You can override that at the directory level so everything in that directory will materialize to a different materialization.
-
Edit your
dbt_project.yml
file.-
Update your project
name
to:dbt_project.ymlname: 'jaffle_shop'
-
Configure
jaffle_shop
so everything in it will be materialized as a table; and configureexample
so everything in it will be materialized as a view. Update yourmodels
config block to:dbt_project.ymlmodels:
jaffle_shop:
+materialized: table
example:
+materialized: view -
Click Save.
-
-
Enter the
dbt run
command. Yourcustomers
model should now be built as a table!infoTo do this, dbt had to first run a
drop view
statement (or API call on BigQuery), then acreate table as
statement. -
Edit
models/customers.sql
to override thedbt_project.yml
for thecustomers
model only by adding the following snippet to the top, and click Save:models/customers.sql{{
config(
materialized='view'
)
}}
with customers as (
select
id as customer_id
...
) -
Enter the
dbt run
command. Your model,customers
, should now build as a view.- BigQuery users need to run
dbt run --full-refresh
instead ofdbt run
to full apply materialization changes.
- BigQuery users need to run
-
Enter the
dbt run --full-refresh
command for this to take effect in your warehouse.
FAQs
Delete the example models
You can now delete the files that dbt created when you initialized the project:
-
Delete the
models/example/
directory. -
Delete the
example:
key from yourdbt_project.yml
file, and any configurations that are listed under it.dbt_project.yml# before
models:
jaffle_shop:
+materialized: table
example:
+materialized: viewdbt_project.yml# after
models:
jaffle_shop:
+materialized: table -
Save your changes.
FAQs
Build models on top of other models
As a best practice in SQL, you should separate logic that cleans up your data from logic that transforms your data. You have already started doing this in the existing query by using common table expressions (CTEs).
Now you can experiment by separating the logic out into separate models and using the ref function to build models on top of other models:
-
Create a new SQL file,
models/stg_customers.sql
, with the SQL from thecustomers
CTE in our original query. -
Create a second new SQL file,
models/stg_orders.sql
, with the SQL from theorders
CTE in our original query.models/stg_customers.sqlselect
id as customer_id,
first_name,
last_name
from dbt_quickstart.jaffle_shop.jaffle_shop_customersmodels/stg_orders.sqlselect
id as order_id,
user_id as customer_id,
order_date,
status
from dbt_quickstart.jaffle_shop.jaffle_shop_orders -
Edit the SQL in your
models/customers.sql
file as follows:models/customers.sqlwith customers as (
select * from {{ ref('stg_customers') }}
),
orders as (
select * from {{ ref('stg_orders') }}
),
customer_orders as (
select
customer_id,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders
from orders
group by 1
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders
from customers
left join customer_orders on customers.customer_id = customer_orders.customer_id
)
select * from final -
Execute
dbt run
.This time, when you performed a
dbt run
, separate views/tables were created forstg_customers
,stg_orders
andcustomers
. dbt inferred the order to run these models. Becausecustomers
depends onstg_customers
andstg_orders
, dbt buildscustomers
last. You do not need to explicitly define these dependencies.
FAQs
Add tests to your models
Adding tests to a project helps validate that your models are working correctly.
To add tests to your project:
-
Create a new YAML file in the
models
directory, namedmodels/schema.yml
-
Add the following contents to the file:
models/schema.ymlversion: 2
models:
- name: customers
columns:
- name: customer_id
tests:
- unique
- not_null
- name: stg_customers
columns:
- name: customer_id
tests:
- unique
- not_null
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
- name: customer_id
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id -
Run
dbt test
, and confirm that all your tests passed.
When you run dbt test
, dbt iterates through your YAML files, and constructs a query for each test. Each query will return the number of records that fail the test. If this number is 0, then the test is successful.
FAQs
Document your models
Adding documentation to your project allows you to describe your models in rich detail, and share that information with your team. Here, we're going to add some basic documentation to our project.
-
Update your
models/schema.yml
file to include some descriptions, such as those below.models/schema.ymlversion: 2
models:
- name: customers
description: One record per customer
columns:
- name: customer_id
description: Primary key
tests:
- unique
- not_null
- name: first_order_date
description: NULL when a customer has not yet placed an order.
- name: stg_customers
description: This model cleans up customer data
columns:
- name: customer_id
description: Primary key
tests:
- unique
- not_null
- name: stg_orders
description: This model cleans up order data
columns:
- name: order_id
description: Primary key
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
- name: customer_id
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id -
Run
dbt docs generate
to generate the documentation for your project. dbt introspects your project and your warehouse to generate a JSON file with rich documentation about your project.
- Click the book icon in the Develop interface to launch documentation in a new tab.
FAQs
Commit your changes
Now that you've built your customer model, you need to commit the changes you made to the project so that the repository has your latest code.
If you edited directly in the protected primary branch:
- Click the Commit and sync git button. This action prepares your changes for commit.
- A modal titled Commit to a new branch will appear.
- In the modal window, name your new branch
add-customers-model
. This branches off from your primary branch with your new changes. - Add a commit message, such as "Add customers model, tests, docs" and and commit your changes.
- Click Merge this branch to main to add these changes to the main branch on your repo.
If you created a new branch before editing:
- Since you already branched out of the primary protected branch, go to Version Control on the left.
- Click Commit and sync to add a message.
- Add a commit message, such as "Add customers model, tests, docs."
- Click Merge this branch to main to add these changes to the main branch on your repo.
Deploy dbt
Use dbt Cloud's Scheduler to deploy your production jobs confidently and build observability into your processes. You'll learn to create a deployment environment and run a job in the following steps.
Create a deployment environment
- In the upper left, select Deploy, then click Environments.
- Click Create Environment.
- In the Name field, write the name of your deployment environment. For example, "Production."
- In the dbt Version field, select the latest version from the dropdown.
- Under Deployment connection, enter the name of the dataset you want to use as the target, such as "Analytics". This will allow dbt to build and work with that dataset. For some data warehouses, the target dataset may be referred to as a "schema".
- Click Save.
Create and run a job
Jobs are a set of dbt commands that you want to run on a schedule. For example, dbt build
.
As the jaffle_shop
business gains more customers, and those customers create more orders, you will see more records added to your source data. Because you materialized the customers
model as a table, you'll need to periodically rebuild your table to ensure that the data stays up-to-date. This update will happen when you run a job.
- After creating your deployment environment, you should be directed to the page for a new environment. If not, select Deploy in the upper left, then click Jobs.
- Click Create one and provide a name, for example, "Production run", and link to the Environment you just created.
- Scroll down to the Execution Settings section.
- Under Commands, add this command as part of your job if you don't see it:
dbt build
- Select the Generate docs on run checkbox to automatically generate updated project docs each time your job runs.
- For this exercise, do not set a schedule for your project to run — while your organization's project should run regularly, there's no need to run this example project on a schedule. Scheduling a job is sometimes referred to as deploying a project.
- Select Save, then click Run now to run your job.
- Click the run and watch its progress under "Run history."
- Once the run is complete, click View Documentation to see the docs for your project.
Congratulations 🎉! You've just deployed your first dbt project!
FAQs
Connect to multiple data sources
This quickstart focuses on using dbt Cloud to run models against a data lake (S3) by using Starburst Galaxy as the query engine. In most real world scenarios, the data that is needed for running models is actually spread across multiple data sources and is stored in a variety of formats. With Starburst Galaxy, Starburst Enterprise, and Trino, you can run your models on any of the data you need, no matter where it is stored.
If you want to try this out, you can refer to the Starburst Galaxy docs to add more data sources and load the Jaffle Shop data into the source you select. Then, extend your models to query the new data source and the data source you created in this quickstart.