SQL transformations
Now that we have all our sources and staging models done, it's time to move into where dbt shines — transformation!
We need to:
- Create some intermediate tables to join tables that aren’t hierarchical
- Create core tables for business intelligence (BI) tool ingestion
- Answer the two questions about:
- fastest pit stops
- lap time trends about our Formula 1 data by creating aggregate models using python!
Intermediate models
We need to join lots of reference tables to our results table to create a human readable dataframe. What does this mean? For example, we don’t only want to have the numeric status_id
in our table, we want to be able to read in a row of data that a driver could not finish a race due to engine failure (status_id=5
).
By now, we are pretty good at creating new files in the correct directories so we won’t cover this in detail. All intermediate models should be created in the path models/intermediate
.
Create a new file called
int_lap_times_years.sql
. In this model, we are joining our lap time and race information so we can look at lap times over years. In earlier Formula 1 eras, lap times were not recorded (only final results), so we filter out records where lap times are null.with lap_times as (
select * from {{ ref('stg_f1_lap_times') }}
),
races as (
select * from {{ ref('stg_f1_races') }}
),
expanded_lap_times_by_year as (
select
lap_times.race_id,
driver_id,
race_year,
lap,
lap_time_milliseconds
from lap_times
left join races
on lap_times.race_id = races.race_id
where lap_time_milliseconds is not null
)
select * from expanded_lap_times_by_yearCreate a file called
in_pit_stops.sql
. Pit stops are a many-to-one (M:1) relationship with our races. We are creating a feature calledtotal_pit_stops_per_race
by partitioning over ourrace_id
anddriver_id
, while preserving individual level pit stops for rolling average in our next section.with stg_f1__pit_stops as
(
select * from {{ ref('stg_f1_pit_stops') }}
),
pit_stops_per_race as (
select
race_id,
driver_id,
stop_number,
lap,
lap_time_formatted,
pit_stop_duration_seconds,
pit_stop_milliseconds,
max(stop_number) over (partition by race_id,driver_id) as total_pit_stops_per_race
from stg_f1__pit_stops
)
select * from pit_stops_per_raceCreate a file called
int_results.sql
. Here we are using 4 of our tables —races
,drivers
,constructors
, andstatus
— to give context to ourresults
table. We are now able to calculate a new featuredrivers_age_years
by bringing thedate_of_birth
andrace_year
into the same table. We are also creating a column to indicate if the driver did not finish (dnf) the race, based upon if theirposition
was null called,dnf_flag
.with results as (
select * from {{ ref('stg_f1_results') }}
),
races as (
select * from {{ ref('stg_f1_races') }}
),
drivers as (
select * from {{ ref('stg_f1_drivers') }}
),
constructors as (
select * from {{ ref('stg_f1_constructors') }}
),
status as (
select * from {{ ref('stg_f1_status') }}
),
int_results as (
select
result_id,
results.race_id,
race_year,
race_round,
circuit_id,
circuit_name,
race_date,
race_time,
results.driver_id,
results.driver_number,
forename ||' '|| surname as driver,
cast(datediff('year', date_of_birth, race_date) as int) as drivers_age_years,
driver_nationality,
results.constructor_id,
constructor_name,
constructor_nationality,
grid,
position,
position_text,
position_order,
points,
laps,
results_time_formatted,
results_milliseconds,
fastest_lap,
results_rank,
fastest_lap_time_formatted,
fastest_lap_speed,
results.status_id,
status,
case when position is null then 1 else 0 end as dnf_flag
from results
left join races
on results.race_id=races.race_id
left join drivers
on results.driver_id = drivers.driver_id
left join constructors
on results.constructor_id = constructors.constructor_id
left join status
on results.status_id = status.status_id
)
select * from int_resultsCreate a Markdown file
intermediate.md
that we will go over in depth during the Testing and Documentation sections.# the intent of this .md is to allow for multi-line long form explanations for our intermediate transformations
# below are descriptions
{% docs int_results %} In this query we want to join out other important information about the race results to have a human readable table about results, races, drivers, constructors, and status.
We will have 4 left joins onto our results table. {% enddocs %}
{% docs int_pit_stops %} There are many pit stops within one race, aka a M:1 relationship.
We want to aggregate this so we can properly join pit stop information without creating a fanout. {% enddocs %}
{% docs int_lap_times_years %} Lap times are done per lap. We need to join them out to the race year to understand yearly lap time trends. {% enddocs %}Create a YAML file
intermediate.yml
that we will go over in depth during the Testing and Documentation sections.version: 2
models:
- name: int_results
description: '{{ doc("int_results") }}'
- name: int_pit_stops
description: '{{ doc("int_pit_stops") }}'
- name: int_lap_times_years
description: '{{ doc("int_lap_times_years") }}'That wraps up the intermediate models we need to create our core models!
Core models
Create a file
fct_results.sql
. This is what I like to refer to as the “mega table” — a really large denormalized table with all our context added in at row level for human readability. Importantly, we have a tablecircuits
that is linked through the tableraces
. When we joinedraces
toresults
inint_results.sql
we allowed our tables to make the connection fromcircuits
toresults
infct_results.sql
. We are only taking information about pit stops at the result level so our join would not cause a fanout.with int_results as (
select * from {{ ref('int_results') }}
),
int_pit_stops as (
select
race_id,
driver_id,
max(total_pit_stops_per_race) as total_pit_stops_per_race
from {{ ref('int_pit_stops') }}
group by 1,2
),
circuits as (
select * from {{ ref('stg_f1_circuits') }}
),
base_results as (
select
result_id,
int_results.race_id,
race_year,
race_round,
int_results.circuit_id,
int_results.circuit_name,
circuit_ref,
location,
country,
latitude,
longitude,
altitude,
total_pit_stops_per_race,
race_date,
race_time,
int_results.driver_id,
driver,
driver_number,
drivers_age_years,
driver_nationality,
constructor_id,
constructor_name,
constructor_nationality,
grid,
position,
position_text,
position_order,
points,
laps,
results_time_formatted,
results_milliseconds,
fastest_lap,
results_rank,
fastest_lap_time_formatted,
fastest_lap_speed,
status_id,
status,
dnf_flag
from int_results
left join circuits
on int_results.circuit_id=circuits.circuit_id
left join int_pit_stops
on int_results.driver_id=int_pit_stops.driver_id and int_results.race_id=int_pit_stops.race_id
)
select * from base_resultsCreate the file
pit_stops_joined.sql
. Our results and pit stops are at different levels of dimensionality (also called grain). Simply put, we have multiple pit stops per a result. Since we are interested in understanding information at the pit stop level with information about race year and constructor, we will create a new tablepit_stops_joined.sql
where each row is per pit stop. Our new table tees up our aggregation in Python.with base_results as (
select * from {{ ref('fct_results') }}
),
pit_stops as (
select * from {{ ref('int_pit_stops') }}
),
pit_stops_joined as (
select
base_results.race_id,
race_year,
base_results.driver_id,
constructor_id,
constructor_name,
stop_number,
lap,
lap_time_formatted,
pit_stop_duration_seconds,
pit_stop_milliseconds
from base_results
left join pit_stops
on base_results.race_id=pit_stops.race_id and base_results.driver_id=pit_stops.driver_id
)
select * from pit_stops_joinedEnter in the command line and execute
dbt build
to build out our entire pipeline to up to this point. Don’t worry about “overriding” your previous models – dbt workflows are designed to be idempotent so we can run them again and expect the same results.Let’s talk about our lineage so far. It’s looking good 😎. We’ve shown how SQL can be used to make data type, column name changes, and handle hierarchical joins really well; all while building out our automated lineage!
The DAG
Time to Commit and push our changes and give your commit a message like
intermediate and fact models
before moving on.