Skip to main content

Leverage dbt Cloud to generate analytics and ML-ready pipelines with SQL and Python with Snowflake

The focus of this workshop will be to demonstrate how we can use both SQL and python together in the same workflow to run both analytics and machine learning models on dbt Cloud.

All code in today’s workshop can be found on GitHub.

What you'll use during the lab

What you'll learn

  • How to build scalable data transformation pipelines using dbt, and Snowflake using SQL and Python
  • How to leverage copying data into Snowflake from a public S3 bucket

What you need to know

  • Basic to intermediate SQL and python.
  • Basic understanding of dbt fundamentals. We recommend the dbt Fundamentals course if you're interested.
  • High level machine learning process (encoding, training, testing)
  • Simple ML algorithms we will use logistic regression to keep the focus on the workflow, not algorithms!

What you'll build

  • A set of data analytics and prediction pipelines using Formula 1 data leveraging dbt and Snowflake, making use of best practices like data quality tests and code promotion between environments
  • We will create insights for:
    1. Finding the lap time average and rolling average through the years (is it generally trending up or down)?
    2. Which constructor has the fastest pit stops in 2021?
    3. Predicting the position of each driver given using a decade of data (2010 - 2020)

As inputs, we are going to leverage Formula 1 datasets hosted on a dbt Labs public S3 bucket. We will create a Snowflake Stage for our CSV files then use Snowflake’s COPY INTO function to copy the data in from our CSV files into tables. The Formula 1 is available on Kaggle. The data is originally compiled from the Ergast Developer API.

Overall we are going to set up the environments, build scalable pipelines in dbt, establish data tests, and promote code to production.