Skip to main content

IBM watsonx.data Spark setup

  • Maintained by: IBM
  • Authors: Bayan Albunayan, Reema Alzaid, Manjot Sidhu
  • GitHub repo: IBM/dbt-watsonx-spark
  • PyPI package: dbt-watsonx-spark
  • Slack channel:
  • Supported dbt Core version: v0.0.8 and newer
  • dbt Cloud support: Not Supported
  • Minimum data platform version: n/a

Installing dbt-watsonx-spark

Use pip to install the adapter. Before 1.8, installing the adapter would automatically install dbt-core and any additional dependencies. Beginning in 1.8, installing an adapter does not automatically install dbt-core. This is because adapters and dbt Core versions have been decoupled from each other so we no longer want to overwrite existing dbt-core installations. Use the following command for installation:

Configuring dbt-watsonx-spark

For IBM watsonx.data-specific configuration, please refer to IBM watsonx.data configs.

The dbt-watsonx-spark adapter allows you to use dbt to transform and manage data on IBM watsonx.data Spark, leveraging its distributed SQL query engine capabilities.

Before proceeding, ensure you have the following:

Read the official documentation for using watsonx.data with dbt-watsonx-spark

Installing dbt-watsonx-spark

Note: From dbt v1.8, installing an adapter no longer installs 'dbt-core' automatically. This is because adapters and dbt Core versions are decoupled to avoid overwriting dbt-core installations.Use the following command for installation:

python -m pip install dbt-core dbt-watsonx-spark

Configuring dbt-watsonx-spark

For IBM watsonx.data-specific configuration, refer to IBM watsonx.data configs.

Connecting to IBM watsonx.data Spark

To connect dbt with watsonx.data Spark, configure a profile in your profiles.yml file located in the .dbt/ directory of your home folder. The following is an example configuration for connecting to IBM watsonx.data SaaS and Software instances:

~/.dbt/profiles.yml
project_name:
target: "dev"
outputs:
dev:
type: watsonx_spark
method: http
schema: [schema name]
host: [hostname]
uri: [uri]
catalog: [catalog name]
use_ssl: false
auth:
instance: [Watsonx.data Instance ID]
user: [username]
apikey: [apikey]

Host parameters

The following profile fields are required to configure watsonx.data Spark connections. For IBM watsonx.data SaaS or Software instances, To get the 'profile' details, click 'View connect details' when the 'query server' is in RUNNING status in watsonx.data (In watsonx.data (both SaaS or Software). The Connection details page opens with the profile configuration. Copy and paste the connection details in the profiles.yml file that is located in .dbt of your home directory

The following profile fields are required to configure watsonx.data Spark connections:

OptionRequired/Optional
Description
Example
methodRequiredSpecifies the connection method to the spark query server. Use http.http
schemaRequiredTo choose an existing schema within spark engine or create a new schema.spark_schema
hostRequiredHostname of the watsonx.data console. For more information, see Getting connection information.https://dataplatform.cloud.ibm.com
uriRequiredURI of your query server that is running on watsonx.data. For more information, see Getting connection information./lakehouse/api/v2/spark_engines/<sparkID>/query_servers/<queryID>/connect/cliservice
catalogRequiredThe catalog that is associated with the Spark engine.my_catalog
use_sslOptional (default: false)Specifies whether to use SSL.true or false
instanceRequiredFor SaaS set it as CRN of watsonx.data. As for Software, set it as instance ID of watsonx.data1726574045872688
userRequiredUsername for the watsonx.data instance. for [Saas] use email as usernameusername or user@example.com
apikeyRequiredYour API key. For more info on SaaS, For SoftwareAPI key

Schemas and catalogs

When selecting the catalog, ensure the user has read and write access. This selection does not limit your ability to query into the schema spcified/created but also serves as the default location for materialized tables, views, and incremental.

SSL verification

  • If the Spark instance uses an unsecured HTTP connection, set use_ssl to false.
  • If the instance uses HTTPS, set it true.

Additional parameters

The following profile fields are optional. You can configure the instance session and dbt for the connection.

Profile fieldDescriptionExample
threadsHow many threads dbt should use (default is 1)8
retry_allEnables automatic retries for transient connection failures.true
connect_timeoutTimeout for establishing a connection (in seconds).5
connect_retriesNumber of retry attempts for connection failures.3

Limitations and considerations

  • Supports only HTTP: No support for ODBC, Thrift, or session-based connections.
  • Limited dbt Cloud Support: Not fully compatible with dbt Cloud.
  • Metadata Persistence: Some dbt features, such as column descriptions, may not persist in all table formats.
0