What is parsing?
At the start of every dbt invocation, dbt reads all the files in your project, extracts information, and constructs a manifest containing every object (model, source, macro, etc). Among other things, dbt uses the
config() macro calls within models to set properties, infer dependencies, and construct your project's DAG.
Parsing projects can be slow, especially as projects get bigger—hundreds of models, thousands of files—which is frustrating in development. There are a handful of ways to optimize dbt performance today:
- LibYAML bindings for PyYAML
- Partial parsing, which avoids re-parsing unchanged files between invocations
- An experimental parser, which extracts information from simple models much more quickly
- RPC server, which keeps a manifest in memory, and re-parses the project at server startup/hangup
These optimizations can be used in combination to reduce parse time from minutes to seconds. At the same time, each has some known limitations, so they are disabled by default.
PyYAML + LibYAML
dbt uses PyYAML to read and validate YAML files in your project. PyYAML is written in pure Python, but it can leverage LibYAML (written in C, much faster) if it's available in your system. Whenever it parses your project, dbt will always check first to see if LibYAML is available.
You can test to see if LibYAML is installed by running this command in the environment where you've installed dbt:
python -c "from yaml import CLoader"
After parsing your project, dbt stores an internal project manifest in a file called
partial_parse.msgpack. When partial parsing is enabled, dbt will use that internal manifest to determine which files have been changed (if any) since it last parsed the project. Then, it will only parse the changed files, or files related those changes.
Partial parsing is off by default, and it can be enabled via profile config or CLI flags. In development, partial parsing can significantly reduce the time spent waiting at the start of a run, which translates to faster dev cycles and iteration.
Use caution when enabling partial parsing in dbt, as there are known limitations today:
- A change in environment variables does not trigger a re-parse. Files which depend on
env_varmay be incorrect on subsequent parses.
- Changes to macros called within a model's
config()block will not result in re-parsing that model.
- A file that depends on "volatile" Jinja variables, such as
invocation_id, will quickly get stale. A file is not re-parsed in subsequent invocations if the file's contents have not changed.
- If certain inputs change between runs, dbt will trigger a full re-parse. Today those inputs are:
- installed packages
- dbt version
If you ever get into a bad state, you can disable partial parsing and trigger a full re-parse with the
--no-partial-parse CLI flag, or by deleting
At parse time, dbt needs to extract the contents of
config() from all models in the project. Traditionally, dbt has extracted those values by rendering the Jinja in every model file, which can be slow. In v0.20.0, we're trying out a new way to statically analyze model files, leveraging
tree-sitter, which we're calling an "experimental parser". You can see the code for an initial Jinja2 grammar here.
dbt --use-experimental-parser parsedbt --use-experimental-parser rundbt --use-experimental-parser test
For now, the experimental parser only works with models, and models whose Jinja is limited to those three special macros (
config). The experimental parser is at least 3x faster than a full Jinja render. Based on testing with data from dbt Cloud, we believe the current grammar can statically parse 60% of models in the wild. So for the average project, we'd hope to see a 40% speedup in the model parser. You can check this by running
dbt parse and
dbt --use-experimental-parser parse, and comparing
target/perf_info.json produced by each.
The experimental parser is off by default. We believe it can offer some speedup to 95% of projects.
Do not use the experimental parser if you've overridden the
config macro with a custom implementation.