Some core functionality may be limited. If you're interested in contributing, check out the source code for each repository listed below.
Overview of dbt-spark
Maintained by: core dbt maintainers
Author: Fishtown Analytics
Core version: v0.13.0 and newer
dbt Cloud: Preview
dbt-spark can connect to Spark clusters by three different methods:
odbcis the preferred method when connecting to Databricks. It supports connecting to a SQL Endpoint or an all-purpose interactive cluster.
httpis a more generic method for connecting to a managed service that provides an HTTP endpoint. Currently, this includes connections to a Databricks interactive cluster.
thriftconnects directly to the lead node of a cluster, either locally hosted / on premise or in the cloud (e.g. Amazon EMR).
odbc connection method if you are connecting to a Databricks SQL endpoint or interactive cluster via ODBC driver. (Download the latest version of the official driver here.)
your_profile_name:target: devoutputs:dev:type: sparkmethod: odbcdriver: [path/to/driver]schema: [database/schema name]host: [yourorg.sparkhost.com]organization: [org id] # Azure Databricks onlytoken: [abc123]# one of:endpoint: [endpoint id]cluster: [cluster id]# optionalport: [port] # default 443user: [user]
thrift connection method if you are connecting to a Thrift server sitting in front of a Spark cluster, e.g. a cluster running locally or on Amazon EMR.
your_profile_name:target: devoutputs:dev:type: sparkmethod: thriftschema: [database/schema name]host: [hostname]# optionalport: [port] # default 10001user: [user]auth: [e.g. KERBEROS]kerberos_service_name: [e.g. hive]
http method if your Spark provider supports generic connections over HTTP (e.g. Databricks interactive cluster).
your_profile_name:target: devoutputs:dev:type: sparkmethod: httpschema: [database/schema name]host: [yourorg.sparkhost.com]organization: [org id] # Azure Databricks onlytoken: [abc123]cluster: [cluster id]# optionalport: [port] # default: 443user: [user]connect_timeout: 60 # default 10connect_retries: 5 # default 0
Databricks interactive clusters can take several minutes to start up. You may
include the optional profile configs
and dbt will periodically retry the connection.
Installation and Distribution
dbt's adapter for Apache Spark and Databricks is managed in its own repository, dbt-spark. To use it,
you must install the
The following commands will install the latest version of
dbt-spark as well as the requisite version of
If connecting to Databricks via ODBC driver, it requires
pyodbc. Depending on your system, you can install it seperately or via pip. See the
pyodbc wiki for OS-specific installation details.
If connecting to a Spark cluster via the generic thrift or http methods, it requires
# odbc connections$ pip install "dbt-spark[ODBC]"# thrift or http connections$ pip install "dbt-spark[PyHive]"
Usage with EMR
To connect to Apache Spark running on an Amazon EMR cluster, you will need to run
sudo /usr/lib/spark/sbin/start-thriftserver.sh on the master node of the cluster to start the Thrift server (see the docs for more information). You will also need to connect to port 10001, which will connect to the Spark backend Thrift server; port 10000 will instead connect to a Hive backend, which will not work correctly with dbt.
Most dbt Core functionality is supported, but some features are only available on Delta Lake (Databricks).
Some dbt features, available on the core adapters, are not yet supported on Spark:
- Persisting column-level descriptions as database comments