IBM watsonx.data Spark configurations
Instance requirements
To use IBM watsonx.data Spark with dbt-watsonx-spark
adapter, ensure the instance has an attached catalog that supports creating, renaming, altering, and dropping objects such as tables and views. The user connecting to the instance via the dbt-watsonx-spark
adapter must have the necessary permissions for the target catalog.
For detailed setup instructions, including setting up watsonx.data, adding the Spark engine, configuring storages, registering data sources, and managing permissions, refer to the official IBM documentation:
- watsonx.data Software Documentation: IBM watsonx.data Software Guide
- watsonx.data SaaS Documentation: IBM watsonx.data SaaS Guide
Session properties
With IBM watsonx.data SaaS/Software instance, you can set session properties to modify the current configuration for your user session.
To temporarily adjust session properties for a specific dbt model or a group of models, use a dbt hook. For example:
{{
config(
pre_hook="set session query_max_run_time='10m'"
)
}}
Connector properties
IBM watsonx.data SaaS/Software supports various Spark-specific connector properties to control data representation, execution performance, and storage format.
For more details on supported configurations for each data source, refer to:
Additional configuration
The dbt-watsonx-spark
adapter allows additional configurations to be set in the catalog profile:
Catalog:
Specifies the catalog to use for the Spark connection. The plugin can automatically detect the file format type(Iceberg, Hive, or Delta)
based on the catalog type.use_ssl:
Enables SSL encryption for secure connections.
Example configuration:
project_name:
target: "dev"
outputs:
dev:
type: watsonx_spark
method: http
schema: [schema name]
host: [hostname]
uri: [uri]
catalog: [catalog name]
use_ssl: false
auth:
instance: [Watsonx.data Instance ID]
user: [username]
apikey: [apikey]
File format configuration
The supported file formats depend on the catalog type:
- Iceberg Catalog: Supports Iceberg tables.
- Hive Catalog: Supports Hive tables.
- Delta Lake Catalog: Supports Delta tables.
- Hudi Catalog: Supports Hudi tables.
The plugin automatically detects the file format type based on the catalog specified in the configuration.
By specifying file format dbt models. For example:
{{
config(
materialized='table',
file_format='iceberg' or 'hive' or 'delta' or 'hudi'
)
}}
For more details, refer to the documentation.
Seeds and prepared statements
You can configure column data types either in the dbt_project.yml file or in property files, as supported by dbt. For more details on seed configuration and best practices, refer to the dbt seed configuration documentation.
Materializations
The dbt-watsonx-spark
adapter supports table materializations, allowing you to manage how your data is stored and queried in watsonx.data Spark.
For further information on configuring materializations, refer to the dbt materializations documentation.
Table
The dbt-watsonx-spark
adapter enables you to create and update tables through table materialization, making it easier to work with data in watsonx.data Spark.
View
The adapter automatically creates views by default if no materialization is explicitly specified.
Incremental
Incremental materialization is supported but requires additional configuration for partitioning and performance tuning.
Recommendations
- Check Permissions: Ensure that the necessary permissions for table creation are enabled in the catalog or schema.
- Check Connector Documentation: Review watsonx.data Spark data ingestion in watsonx.data to ensure it supports table creation and modification.
Unsupported features
Despite its extensive capabilities, the dbt-watsonx-spark
adapter has some limitations:
- Incremental Materialization: Supported but requires additional configuration for partitioning and performance tuning.
- Materialized Views: Not natively supported in Spark SQL within Watsonx.data.
- Snapshots: Not supported due to Spark’s lack of built-in snapshot functionality.
- Performance Considerations:
- Large datasets may require tuning of Spark configurations such as shuffle partitions and memory allocation.
- Some transformations may be expensive due to Spark’s in-memory processing model.
By understanding these capabilities and constraints, users can maximize the effectiveness of dbt with Watsonx.data Spark for scalable data transformations and analytics.