Skip to main content

Apache Spark Profile

Overview of dbt-spark

  • Maintained by: dbt Labs
  • Authors: core dbt maintainers
  • GitHub repo: dbt-labs/dbt-spark
  • PyPI package: dbt-spark
  • Slack channel: db-databricks-and-spark
  • Supported dbt Core version: v0.15.0 and newer
  • dbt Cloud support: Supported
  • Minimum data platform version: n/a

Installing dbt-spark

pip is the easiest way to install the adapter:

pip install dbt-spark

Installing dbt-spark will also install dbt-core and any other dependencies.

If connecting to Databricks via ODBC driver, it requires pyodbc. Depending on your system, you can install it seperately or via pip. See the pyodbc wiki for OS-specific installation details.

If connecting to a Spark cluster via the generic thrift or http methods, it requires PyHive.

# odbc connections
$ pip install "dbt-spark[ODBC]"

# thrift or http connections
$ pip install "dbt-spark[PyHive]"

Configuring dbt-spark

For Spark-specifc configuration please refer to Spark Configuration

For further info, refer to the GitHub repository: dbt-labs/dbt-spark

Connection Methods

dbt-spark can connect to Spark clusters by three different methods:

  • odbc is the preferred method when connecting to Databricks. It supports connecting to a SQL Endpoint or an all-purpose interactive cluster.
  • thrift connects directly to the lead node of a cluster, either locally hosted / on premise or in the cloud (e.g. Amazon EMR).
  • http is a more generic method for connecting to a managed service that provides an HTTP endpoint. Currently, this includes connections to a Databricks interactive cluster.

ODBC

Changelog

Use the odbc connection method if you are connecting to a Databricks SQL endpoint or interactive cluster via ODBC driver. (Download the latest version of the official driver here.)

~/.dbt/profiles.yml
your_profile_name:
target: dev
outputs:
dev:
type: spark
method: odbc
driver: [path/to/driver]
schema: [database/schema name]
host: [yourorg.sparkhost.com]
organization: [org id] # Azure Databricks only
token: [abc123]

# one of:
endpoint: [endpoint id]
cluster: [cluster id]

# optional
port: [port] # default 443
user: [user]
server_side_parameters:
# cluster configuration parameters, otherwise applied via `SET` statements
# for example:
# "spark.databricks.delta.schema.autoMerge.enabled": True

Thrift

Use the thrift connection method if you are connecting to a Thrift server sitting in front of a Spark cluster, e.g. a cluster running locally or on Amazon EMR.

~/.dbt/profiles.yml
your_profile_name:
target: dev
outputs:
dev:
type: spark
method: thrift
schema: [database/schema name]
host: [hostname]

# optional
port: [port] # default 10001
user: [user]
auth: [e.g. KERBEROS]
kerberos_service_name: [e.g. hive]
use_ssl: [true|false] # value of hive.server2.use.SSL, default false

HTTP

Use the http method if your Spark provider supports generic connections over HTTP (e.g. Databricks interactive cluster).

~/.dbt/profiles.yml
your_profile_name:
target: dev
outputs:
dev:
type: spark
method: http
schema: [database/schema name]
host: [yourorg.sparkhost.com]
organization: [org id] # Azure Databricks only
token: [abc123]
cluster: [cluster id]

# optional
port: [port] # default: 443
user: [user]
connect_timeout: 60 # default 10
connect_retries: 5 # default 0

Databricks interactive clusters can take several minutes to start up. You may include the optional profile configs connect_timeout and connect_retries, and dbt will periodically retry the connection.

Caveats

Usage with EMR

To connect to Apache Spark running on an Amazon EMR cluster, you will need to run sudo /usr/lib/spark/sbin/start-thriftserver.sh on the master node of the cluster to start the Thrift server (see the docs for more information). You will also need to connect to port 10001, which will connect to the Spark backend Thrift server; port 10000 will instead connect to a Hive backend, which will not work correctly with dbt.

Supported Functionality

Most dbt Core functionality is supported, but some features are only available on Delta Lake (Databricks).

Delta-only features:

  1. Incremental model updates by unique_key instead of partition_by (see merge strategy)
  2. Snapshots
  3. Persisting column-level descriptions as database comments
0