Skip to main content

To defer or to clone, that is the question

· 6 min read
Kshitij Aranke
Doug Beatty

Hi all, I’m Kshitij, a senior software engineer on the Core team at dbt Labs. One of the coolest moments of my career here thus far has been shipping the new dbt clone command as part of the dbt-core v1.6 release.

However, one of the questions I’ve received most frequently is guidance around “when” to clone that goes beyond the documentation on “how” to clone. In this blog post, I’ll attempt to provide this guidance by answering these FAQs:

  1. What is dbt clone?
  2. How is it different from deferral?
  3. Should I defer or should I clone?

What is dbt clone?

dbt clone is a new command in dbt 1.6 that leverages native zero-copy clone functionality on supported warehouses to copy entire schemas for free, almost instantly.

How is this possible?

Well, the warehouse “cheats” by only copying metadata from the source schema to the target schema; the underlying data remains at rest during this operation. This metadata includes materialized objects like tables and views, which is why you see a clone of these objects in the target schema.

In computer science jargon, clone makes a copy of the pointer from the source schema to the underlying data; after the operation there are now two pointers (source and target schemas) that each point to the same underlying data.

How is cloning different from deferral?

On the surface, cloning and deferral seem similar – they’re both ways to save costs in the data warehouse. They do this by bypassing expensive model re-computations – clone by eagerly copying an entire schema into the target schema, and defer by lazily referencing pre-built models in the source schema.

Let’s unpack this sentence and explore its first-order effects:

deferclone
How do I use it?Implicit via the --defer flagExplicit via the dbt clone command
What are its outputs?Doesn't create any objects itself, but dbt might create objects in the target schema if they’ve changed from those in the source schema.Copies objects from source schema to target schema in the data warehouse, which are persisted after operation is finished.
How does it work?Compares manifests between source and target dbt runs and overrides ref to resolve models not built in the target run to point to objects built in the source run.Uses zero-copy cloning if available to copy objects from source to target schemas, else creates pointer views (select * from my_model)

These first-order effects lead to the following second-order effects that truly distinguish clone and defer from each other:

deferclone
Where can I use objects built in the target schema?Only within the context of dbtAny downstream tool (e.g. BI)
Can I safely modify objects built in the target schema?No, since this would modify production dataYes, cloning is a cheap way to create a sandbox of production data for experimentation
Will data in the target schema drift from data in the source schema?No, since deferral will always point to the latest version of the source schemaYes, since clone is a point-in-time operation
Can I use multiple source schemas at once?Yes, defer can dynamically switch between source schemas e.g. ref unchanged models from production and changed models from stagingNo, clone copies objects from one source schema to one target schema

Should I defer or should I clone?

Putting together all the points above, here’s a handy cheat sheet for when to defer and when to clone:

deferclone
Save time & cost by avoiding re-computation
Create database objects to be available in downstream tools (e.g. BI)
Safely modify objects in the target schema
Avoid creating new database objects
Avoid data drift
Support multiple dynamic sources

To absolutely drive this point home:

  1. If you send someone this cheatsheet by linking to this page, you are deferring to this page
  2. If you print out this page and write notes in the margins, you have cloned this page

Putting it in practice

Using the cheat sheet above, let’s explore a few common scenarios and explore whether we should use defer or clone for each:

  1. Testing staging datasets in BI

    In this scenario, we want to:

    1. Make a copy of our production dataset available in our downstream BI tool
    2. To safely iterate on this copy without breaking production datasets

    Therefore, we should use clone in this scenario.

  2. Slim CI

    In this scenario, we want to:

    1. Refer to production models wherever possible to speed up continuous integration (CI) runs
    2. Only run and test models in the CI staging environment that have changed from the production environment
    3. Reference models from different environments – prod for unchanged models, and staging for modified models

    Therefore, we should use defer in this scenario.

Use dbt clone in CI jobs to test incremental models

Learn how to use dbt clone in CI jobs to efficiently test modified incremental models, simulating post-merge behavior while avoiding full-refresh costs.

  1. Blue/Green Deployments

    In this scenario, we want to:

    1. Ensure that all tests are always passing on the production dataset, even if that dataset is slightly stale
    2. Atomically rollback a promotion to production if tests aren’t passing across the entire staging dataset

    In this scenario, we can use clone to implement a deployment strategy known as blue-green deployments where we build the entire staging dataset and then run tests against it, and only clone it over to production if all tests pass.

As a rule of thumb, deferral lends itself better to continuous integration (CI) use cases whereas cloning lends itself better to continuous deployment (CD) use cases.

Wrapping Up

In this post, we covered what dbt clone is, how it is different from deferral, and when to use each. Often, they can be used together within the same project in different parts of the deployment lifecycle.

Thanks for reading, and I look forward to seeing what you build with dbt clone.

Thanks to Jason Ganz and Gwen Windflower for reviewing drafts of this article

Comments

Loading

Be part of Coalesce 2024

Coalesce is the premiere analytics engineering conference! Sign up now for innovation, collaboration, and inspiration. Don't miss out!

Register now