IBM watsonx.data Spark setup
- Maintained by: IBM
- Authors: Bayan Albunayan, Reema Alzaid, Manjot Sidhu
- GitHub repo: IBM/dbt-watsonx-spark
- PyPI package:
dbt-watsonx-spark
- Slack channel:
- Supported dbt Core version: v0.0.8 and newer
- dbt Cloud support: Not Supported
- Minimum data platform version: n/a
Installing dbt-watsonx-spark
Use pip
to install the adapter. Before 1.8, installing the adapter would automatically install dbt-core
and any additional dependencies. Beginning in 1.8, installing an adapter does not automatically install dbt-core
. This is because adapters and dbt Core versions have been decoupled from each other so we no longer want to overwrite existing dbt-core installations.
Use the following command for installation:
Configuring dbt-watsonx-spark
For IBM watsonx.data-specific configuration, please refer to IBM watsonx.data configs.
The dbt-watsonx-spark
adapter allows you to use dbt to transform and manage data on IBM watsonx.data Spark, leveraging its distributed SQL query engine capabilities.
Before proceeding, ensure you have the following:
- An active IBM watsonx.data, For IBM Cloud (SaaS). For Software
- Provision Native Spark engine in watsonx.data, For IBM Cloud (SaaS). For Software
- An active Spark query server in your Native Spark engine
Read the official documentation for using watsonx.data with dbt-watsonx-spark
Installing dbt-watsonx-spark
Note: From dbt v1.8, installing an adapter no longer installs 'dbt-core' automatically. This is because adapters and dbt Core versions are decoupled to avoid overwriting dbt-core installations.Use the following command for installation:
python -m pip install dbt-core dbt-watsonx-spark
Configuring dbt-watsonx-spark
For IBM watsonx.data-specific configuration, refer to IBM watsonx.data configs.
Connecting to IBM watsonx.data Spark
To connect dbt with watsonx.data Spark, configure a profile in your profiles.yml
file located in the .dbt/
directory of your home folder. The following is an example configuration for connecting to IBM watsonx.data SaaS and Software instances:
project_name:
target: "dev"
outputs:
dev:
type: watsonx_spark
method: http
schema: [schema name]
host: [hostname]
uri: [uri]
catalog: [catalog name]
use_ssl: false
auth:
instance: [Watsonx.data Instance ID]
user: [username]
apikey: [apikey]
Host parameters
The following profile fields are required to configure watsonx.data Spark connections. For IBM watsonx.data SaaS or Software instances, To get the 'profile' details, click 'View connect details' when the 'query server' is in RUNNING status in watsonx.data (In watsonx.data (both SaaS or Software). The Connection details page opens with the profile configuration. Copy and paste the connection details in the profiles.yml file that is located in .dbt of your home directory
The following profile fields are required to configure watsonx.data Spark connections:
Option | Required/Optional | Description | Example |
---|---|---|---|
method | Required | Specifies the connection method to the spark query server. Use http . | http |
schema | Required | To choose an existing schema within spark engine or create a new schema. | spark_schema |
host | Required | Hostname of the watsonx.data console. For more information, see Getting connection information. | https://dataplatform.cloud.ibm.com |
uri | Required | URI of your query server that is running on watsonx.data. For more information, see Getting connection information. | /lakehouse/api/v2/spark_engines/<sparkID>/query_servers/<queryID>/connect/cliservice |
catalog | Required | The catalog that is associated with the Spark engine. | my_catalog |
use_ssl | Optional (default: false) | Specifies whether to use SSL. | true or false |
instance | Required | For SaaS set it as CRN of watsonx.data. As for Software, set it as instance ID of watsonx.data | 1726574045872688 |
user | Required | Username for the watsonx.data instance. for [Saas] use email as username | username or user@example.com |
apikey | Required | Your API key. For more info on SaaS, For Software | API key |
Schemas and catalogs
When selecting the catalog, ensure the user has read and write access. This selection does not limit your ability to query into the schema spcified/created but also serves as the default location for materialized tables
, views
, and incremental
.
SSL verification
- If the Spark instance uses an unsecured HTTP connection, set
use_ssl
tofalse
. - If the instance uses
HTTPS
, set ittrue
.
Additional parameters
The following profile fields are optional. You can configure the instance session and dbt for the connection.
Profile field | Description | Example |
---|---|---|
threads | How many threads dbt should use (default is 1 ) | 8 |
retry_all | Enables automatic retries for transient connection failures. | true |
connect_timeout | Timeout for establishing a connection (in seconds). | 5 |
connect_retries | Number of retry attempts for connection failures. | 3 |
Limitations and considerations
- Supports only HTTP: No support for ODBC, Thrift, or session-based connections.
- Limited dbt Cloud Support: Not fully compatible with dbt Cloud.
- Metadata Persistence: Some dbt features, such as column descriptions, may not persist in all table formats.