Apache Spark Profile
Overview of dbt-spark
- Maintained by: dbt Labs
- Authors: core dbt maintainers
- GitHub repo: dbt-labs/dbt-spark
- PyPI package:
dbt-spark
- Slack channel: db-databricks-and-spark
- Supported dbt Core version: v0.15.0 and newer
- dbt Cloud support: Supported
- Minimum data platform version: n/a
Installing dbt-spark
pip is the easiest way to install the adapter:
pip install dbt-spark
Installing dbt-spark
will also install dbt-core
and any other dependencies.
If connecting to Databricks via ODBC driver, it requires pyodbc
. Depending on your system, you can install it seperately or via pip. See the pyodbc
wiki for OS-specific installation details.
If connecting to a Spark cluster via the generic thrift or http methods, it requires PyHive
.
# odbc connections
$ pip install "dbt-spark[ODBC]"
# thrift or http connections
$ pip install "dbt-spark[PyHive]"
Configuring dbt-spark
For Spark-specific configuration please refer to Spark Configuration
For further info, refer to the GitHub repository: dbt-labs/dbt-spark
Connection Methods
dbt-spark can connect to Spark clusters by three different methods:
odbc
is the preferred method when connecting to Databricks. It supports connecting to a SQL Endpoint or an all-purpose interactive cluster.thrift
connects directly to the lead node of a cluster, either locally hosted / on premise or in the cloud (e.g. Amazon EMR).http
is a more generic method for connecting to a managed service that provides an HTTP endpoint. Currently, this includes connections to a Databricks interactive cluster.
ODBC
Use the odbc
connection method if you are connecting to a Databricks SQL endpoint or interactive cluster via ODBC driver. (Download the latest version of the official driver here.)
your_profile_name:
target: dev
outputs:
dev:
type: spark
method: odbc
driver: [path/to/driver]
schema: [database/schema name]
host: [yourorg.sparkhost.com]
organization: [org id] # Azure Databricks only
token: [abc123]
# one of:
endpoint: [endpoint id]
cluster: [cluster id]
# optional
port: [port] # default 443
user: [user]
server_side_parameters:
# cluster configuration parameters, otherwise applied via `SET` statements
# for example:
# "spark.databricks.delta.schema.autoMerge.enabled": True
Thrift
Use the thrift
connection method if you are connecting to a Thrift server sitting in front of a Spark cluster, e.g. a cluster running locally or on Amazon EMR.
your_profile_name:
target: dev
outputs:
dev:
type: spark
method: thrift
schema: [database/schema name]
host: [hostname]
# optional
port: [port] # default 10001
user: [user]
auth: [e.g. KERBEROS]
kerberos_service_name: [e.g. hive]
use_ssl: [true|false] # value of hive.server2.use.SSL, default false
HTTP
Use the http
method if your Spark provider supports generic connections over HTTP (e.g. Databricks interactive cluster).
your_profile_name:
target: dev
outputs:
dev:
type: spark
method: http
schema: [database/schema name]
host: [yourorg.sparkhost.com]
organization: [org id] # Azure Databricks only
token: [abc123]
cluster: [cluster id]
# optional
port: [port] # default: 443
user: [user]
connect_timeout: 60 # default 10
connect_retries: 5 # default 0
Databricks interactive clusters can take several minutes to start up. You may
include the optional profile configs connect_timeout
and connect_retries
,
and dbt will periodically retry the connection.
Caveats
Usage with EMR
To connect to Apache Spark running on an Amazon EMR cluster, you will need to run sudo /usr/lib/spark/sbin/start-thriftserver.sh
on the master node of the cluster to start the Thrift server (see the docs for more information). You will also need to connect to port 10001, which will connect to the Spark backend Thrift server; port 10000 will instead connect to a Hive backend, which will not work correctly with dbt.
Supported Functionality
Most dbt Core functionality is supported, but some features are only available on Delta Lake (Databricks).
Delta-only features:
- Incremental model updates by
unique_key
instead ofpartition_by
(seemerge
strategy) - Snapshots
- Persisting column-level descriptions as database comments