Skip to main content

Set up and connect Redshift

Introduction

For the Redshift project in the getting started guide, you'll learn how to set up Redshift and connect it to dbt Cloud.

This guide will walk you through:

  • Setting up a Redshift cluster
  • Loading training data into your Redshift account
  • Connecting dbt Cloud and Redshift

Prerequisites

Before beginning this tutorial you will need access to an existing AWS account with permissions to execute a CloudFormation template to create appropriate roles and a Redshift cluster. If you do not have an AWS account, head over to Sign up for AWS.

Setting up

Let’s get started by accessing your AWS account and setting up Redshift.

  1. Sign into your AWS account on the AWS sign in page as a root user or IAM user depending on your level of access.
  2. We will be using a CloudFormation template to quickly set up a Redshift instance. A CloudFormation template is a configuration file that will automatically spin up the necessary resources in AWS. Use the link below to start this process. (source: cloudformation json file)

Start CloudFormation Stack

  1. Choose next for each page until you reach the Select acknowledgement checkbox. Check the box for "I acknowledge that AWS CloudFormation might create IAM resources with custom names" and click Create Stack. You should land on the stack page with a CREATE_IN_PROGRESS status.

    Cloud Formation in Progress

    Cloud Formation in Progress

  2. When the stack status changes to “CREATE_COMPLETE”, click on the Outputs tab on the top to view information that you will use throughout the rest of this guide. Save those credentials for later by keeping this open in a tab.

  3. Type in Redshift to the search bar on the top and click on Amazon Redshift

    Click on Redshift

    Click on Redshift

  4. Confirm that your new Redshift Cluster is listed under Cluster overview. Click on the cluster name.

Available Redshift Cluster

Available Redshift Cluster

  1. Click on Query Data. You can choose the classic query editor or v2. We will be using the v2 version for the purpose of this guide.

  2. You may be asked to Configure account. For the purpose of this sandbox environment, we recommend selecting “Configure account”.

  3. Click on your cluster name in the list and fill out the credentials from the output of the stack.

  • Database: dbtworkshop
  • User Name: dbtadmin
  • Password: choose your own password and save it for later
Redshift Query Editor v2

Redshift Query Editor v2

Connect to Redshift Cluster

Connect to Redshift Cluster

  1. Click on Create Connection.

Congrats! You have your Redshift cluster.

Loading data

Now we are going to load our sample data into the S3 bucket that our Cloudformation template created. S3 buckets are a cheap and simple way to store data outside of Redshift.

  1. The data used in this course is stored as CSVs in a public S3 bucket. You can use the following URLs to download these files. Download these to your computer to use in the following steps.
  1. Now we are going to use the S3 bucket that you created via CloudFormation and upload the files. Go to the search bar at the top and type in S3 and click on S3. There will be sample data in the file already, feel free to ignore it or use it for other modeling exploration.
Go to S3

Go to S3

  1. Click on the name of the bucket S3 bucket. If you have multiple S3 buckets, this will be the bucket that was listed under “Workshopbucket” on the Outputs page. The bucket will be prefixed with dbt-data-lake.
Go to your S3 Bucket

Go to your S3 Bucket

  1. Click on Upload, drag the three files into the UI, and click on Upload on the button.
Upload your CSVs

Upload your CSVs

  1. Save the name of the S3 bucket. It should look like this: s3://dbt-data-lake-xxxx. You will need it for the next section.
  2. Now let’s go back to the Redshift query editor. Search for Redshift in the search bar, choose your cluster, and select Query data.
  3. In your query editor, execute this query below to create the schemas that we will be placing your raw data into. You can highlight the statement and then click on Run to run them individually. If you are on the Classic Query Editor, you might need to input them separately into the UI. You should see these schemas listed under dbtworkshop.
create schema if not exists jaffle_shop;
create schema if not exists stripe;
  1. Now create the tables in your schema with these queries using the statements below. These will be populated as tables in the respective schemas.
create table jaffle_shop.customers(
id integer,
first_name varchar(50),
last_name varchar(50)
);

create table jaffle_shop.orders(
id integer,
user_id integer,
order_date date,
status varchar(50),
_etl_loaded_at timestamp default current_timestamp
);

create table stripe.payment(
id integer,
orderid integer,
paymentmethod varchar(50),
status varchar(50),
amount integer,
created date,
_batched_at timestamp default current_timestamp
);
  1. Now we need to copy the data from S3. Be sure to update the S3 location, iam role, and region. You can find the S3 and iam role in your outputs from the Cloudformation stack.
copy jaffle_shop.customers( id, first_name, last_name)
from 's3://dbt-data-lake-xxxx/jaffle_shop_customers.csv'
iam_role 'arn:aws:iam::XXXXXXXXXX:role/RoleName'
region 'us-east-1'
delimiter ','
ignoreheader 1
acceptinvchars;

copy jaffle_shop.orders(id, user_id, order_date, status)
from 's3://dbt-data-lake-xxxx/jaffle_shop_orders.csv'
iam_role 'arn:aws:iam::XXXXXXXXXX:role/RoleName'
region 'us-east-1'
delimiter ','
ignoreheader 1
acceptinvchars;

copy stripe.payment(id, orderid, paymentmethod, status, amount, created)
from 's3://dbt-data-lake-xxxx/stripe_payments.csv'
iam_role 'arn:aws:iam::XXXXXXXXXX:role/RoleName'
region 'us-east-1'
delimiter ','
ignoreheader 1
Acceptinvchars;

Ensure that you can run a select * from each of the tables with the following code snippets.

select * from jaffle_shop.customers;
select * from jaffle_shop.orders;
select * from stripe.payment;

Congratulations! At this point, you have created a Redshift instance and loaded training data. In the next section, we will walk through the next steps to connect dbt Cloud and Redshift.

Connecting to dbt Cloud

Let's start this section by creating a dbt Cloud account if you haven't already.

  1. Navigate to dbt Cloud.
  2. If you don't have a dbt account, create a new one and verify your account via email.
  3. If you have a dbt Cloud account, you can create a new project from your existing account:
    1. Click hamburger icon, then click Account Settings.
    2. Click New project, then click Begin.
  4. You can now use the Set Up a New Project workflow to set up your first dbt Cloud project!
  5. Type "Analytics" in the dbt Project Name field. You will be able to rename this project later.
  6. Click Continue.

Connect dbt Cloud to Redshift

Now let's set up the connection between dbt Cloud and Redshift

For Set up a Database Connection, choose Redshift.

  1. Click Redshift to set up your connection.
Setup Redshift Connection

Setup Redshift Connection

  1. For the name, simply choose Redshift or another simple title

  2. Enter your Redshift settings. Reference your credentials you saved from the CloudFormation template.

    • Your hostname is the entire hostname. Make sure to drop the http:// at the beginning and any trailing slashes at the end.
    • The port is 5439
    • The database is dbtworkshop.
  3. Set your development credentials. These credentials will be used by dbt Cloud to connect to Redshift. Those credentials (as provided in your cloudformation output) will be:

    • Username: dbtadmin
    • password: this was the password that you set earlier in the guide
    • schema: This is your sandbox schema where you will build all of your development objects into. We generally use the dbt_<first_initial><lastname> naming convention.
  4. Click Test at the top. This verifies that dbt Cloud can access your Redshift cluster.

  5. If you see "Connection test Succeeded!" then click Continue. If it fails, you might need to go back and check your Redshift settings and development credentials.

Initialize your repository and start development

Setting up a managed repository

dbt Cloud uses Git for version control, but using a managed repository makes this easier. To set up a managed repository:

  1. Under "Add repository from", select Managed.
  2. Type a name for your repo such as bbaggins-dbt-quickstart
  3. Click Create. It will take a few seconds for your repository to be created and imported.
  4. Once you see the "Successfully imported repository," click Continue.

Initialize your dbt project

Now that you have a repository configured, you can initialize your project and start development in the dbt Cloud IDE:

  1. Click Start Developing. You can also navigate to the Develop page from hamburger menu in the top left. It might take a few minutes for your project to spin up for the first time, as it established your git connection, clones your repo, and tests the connection to the warehouse.
  2. Above the file tree to the left, click Initialize your project. This builds out your folder structure with example models.
  3. Make your initial commit by clicking Commit. Use the commit message initial commit. This creates the first commit to your managed repo and allows you to open a branch where you can add new dbt code.
  4. Now you should be able to directly query data from your warehouse and execute dbt run. You can try this out now:
    • In "Scratchpad 1", delete all text and paste your warehouse-specific code into Scratchpad 1:
select * from `dbt-tutorial.jaffle_shop.customers`
  • In the command line bar at the bottom, type dbt run and click Enter. We will explore what happens in the next section of the tutorial.

Congratulations! You have successfully completed the following:

  • Set up a Redshift cluster
  • Loaded training data into your Redshift account
  • Connected dbt Cloud and Redshift

Next steps

You can Continue the tutorial by building your first project. You will learn and apply the fundamentals of dbt with models, tests, docs, and deploying your first job!