Connections
Connections are defined in the connections
section under the root attribute application
.
The following types of connections are supported:
Local File System
The local file system connection is used to read and write files to the local file system.
application:
connections:
local:
type: local
Files will be stored in the area
directory under the datasets
directory.
Défault values for area
and datasets
can be set in the application
section.
application:
datasets = "{{root}}/datasets" # or set it through the SL_DATASETS environnement variable.
area:
pending: "pending" # Location where files of pending load are stored. May be overloaded by the ${SL_AREA_PENDING} environment variable.
unresolved: "unresolved" # Location where files that do not match any pattern are moved. May be overloaded by the ${SL_AREA_UNRESOLVED} environment variable.
archive: "archive" # Location where files are moved after they have been processed. May be overloaded by the ${SL_AREA_ARCHIVE} environment variable.
ingesting: "ingesting" # Location where files are moved while they are being processed. May be overloaded by the ${SL_AREA_INGESTING} environment variable.
accepted: "accepted" # Location where files are moved after they have been processed and accepted. May be overloaded by the ${SL_AREA_ACCEPTED} environment variable.
rejected: "rejected" # Location where files are moved after they have been processed and rejected. May be overloaded by the ${SL_AREA_REJECTED} environment variable.
business: "business" # Location where transform tasks store their result. May be overloaded by the ${SL_AREA_BUSINESS} environment variable.
replay: "replay" # Location rejected records are stored in their orginial format. May be overloaded by the ${SL_AREA_REPLAY} environment variable.
hiveDatabase: "${domain}_${area}" # Hive database name. May be overloaded by the ${SL_AREA_HIVE_DATABASE} environment variable.
Google BigQuery
Starlake support native and spark / dataproc bigquery connections.
- BigQuery
- Spark BigQuery Direct
- Spark BigQuery Indirect
application:
connections:
bigquery:
type: "bigquery"
options:
location: "us-central1" # EU or US or ...
authType: "APPLICATION_DEFAULT"
authScopes: "https://www.googleapis.com/auth/cloud-platform" # comma separated list of scopes
#authType: SERVICE_ACCOUNT_JSON_KEYFILE
#jsonKeyfile: "/Users/me/.gcloud/keys/starlake-me.json"
#authType: "ACCESS_TOKEN"
#gcpAccessToken: "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
accessPolicies: # Required when applying access policies to table columns (Column Level Security)
apply: true
location: EU
taxonomy: RGPD
application:
connections:
bigquery:
type: "bigquery"
sparkFormat: "bigquery"
options:
writeMethod: "direct" # direct or indirect (indirect is required for certain features see https://github.com/GoogleCloudDataproc/spark-bigquery-connector)
location: "us-central1" # EU or US or ...
authType: "APPLICATION_DEFAULT"
authScopes: "https://www.googleapis.com/auth/cloud-platform" # comma separated list of scopes
# authType: SERVICE_ACCOUNT_JSON_KEYFILE
# jsonKeyfile: "/Users/me/.gcloud/keys/starlake-me.json"
# authType: "ACCESS_TOKEN"
# gcpAccessToken: "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
spark:
datasource:
bigquery: # Setting properties here will apply them to all bigquery data sources (connection.type == bigquery)
allowFieldAddition: "true" # Allow schema updates. To disable, set it to false
allowFieldRelaxation: "true" # Allow schema updates. To disable, set it to false
application:
connections:
bigquery:
type: "bigquery"
sparkFormat: "bigquery"
options:
writeMethod: "indirect" # direct or indirect (indirect is required for certain features see https://github.com/GoogleCloudDataproc/spark-bigquery-connector)
gcsBucket: "starlake-app" # Temporary GCS Bucket where intermediary files will be stored. Required in indirect mode only
location: "us-central1" # EU or US or ...
authType: "APPLICATION_DEFAULT"
authScopes: "https://www.googleapis.com/auth/cloud-platform" # comma separated list of scopes
materializationDataset: "my-bucket-name" # when sparkFormat is defined, required by the spark-bigquery-connector (https://github.com/GoogleCloudDataproc/spark-bigquery-connector?tab=readme-ov-file#properties)
#authType: SERVICE_ACCOUNT_JSON_KEYFILE
#jsonKeyfile: "/Users/me/.gcloud/keys/starlake-me.json"
#authType: "ACCESS_TOKEN"
#gcpAccessToken: "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
spark:
datasource:
bigquery:
allowFieldAddition: "true" # Allow schema updates. To disable, set it to false
allowFieldRelaxation: "true" # Allow schema updates. To disable, set it to false
Apache Spark / Databricks
Spark connections are used to read and write data from Spark.
- Spark Parquet
- Spark Delta
application:
connections:
spark:
type: "spark"
options:
# any spark configuration can be set here
In addition to the connection defined below, please download the following jars:
- delta-spark_2.12-VERSION.jar and place it in the
bin/deps
directory of the starlake directory. - delta-storage_2.12-VERSION.jar and place it in the
bin/deps
directory of the starlake directory.
application:
connections:
spark:
type: "spark"
options:
# any spark configuration can be set here
spark:
sql:
extensions: "io.delta.sql.DeltaSparkSessionExtension"
catalog:
spark_catalog: "org.apache.spark.sql.delta.catalog.DeltaCatalog"
Snowflake
- Snowflake JDBC
- Snowflake Spark
application:
connectionRef: {{connection}}
connections:
snowflake:
type: jdbc
options:
url: "jdbc:snowflake://{{SNOWFLAKE_ACCOUNT}}.snowflakecomputing.com"
driver: "net.snowflake.client.jdbc.SnowflakeDriver"
user: {{SNOWFLAKE_USER}}
password: {{SNOWFLAKE_PASSWORD}}
warehouse: {{SNOWFLAKE_WAREHOUSE}}
db: {{SNOWFLAKE_DB}}
keep_column_case: "off"
preActions: "alter session set TIMESTAMP_TYPE_MAPPING = 'TIMESTAMP_LTZ';ALTER SESSION SET QUOTED_IDENTIFIERS_IGNORE_CASE = true"
application:
connectionRef: {{connection}}
connections:
snowflake:
spark-snowflake:
type: jdbc
sparkFormat: snowflake
options:
sfUrl: "{{SNOWFLAKE_ACCOUNT}}.snowflakecomputing.com" # make sure you do not prefix by jdbc:snowflake://. This is done by the snowflaek driver
#sfDriver: "net.snowflake.client.jdbc.SnowflakeDriver"
sfUser: {{SNOWFLAKE_USER}}
sfPassword: {{SNOWFLAKE_PASSWORD}}
sfWarehouse: {{SNOWFLAKE_WAREHOUSE}}
sfDatabase: {{SNOWFLAKE_DB}}
keep_column_case: "off"
autopushdown: on
preActions: "alter session set TIMESTAMP_TYPE_MAPPING = 'TIMESTAMP_LTZ';ALTER SESSION SET QUOTED_IDENTIFIERS_IGNORE_CASE = true"
Amazon Redshift
- Redshift JDBC
- Redshift Spark
application:
connections:
redshift:
options:
url: "jdbc:redshift://account.region.redshift.amazonaws.com:5439/database",
driver: com.amazon.redshift.Driver
password: "{{REDSHIFT_PASSWORD}}"
tempdir: "s3a://bucketName/data",
tempdir_region: "eu-central-1" # required only if running from outside AWS (your laptop ...)
aws_iam_role: "arn:aws:iam::aws_count_id:role/role_name"
application:
connections:
redshift:
sparkFormat: "io.github.spark_redshift_community.spark.redshift" # if running on top of Spark or else "redshift" if running on top of Databricks
options:
url: "jdbc:redshift://account.region.redshift.amazonaws.com:5439/database",
driver: com.amazon.redshift.Driver
password: "{{REDSHIFT_PASSWORD}}"
tempdir: "s3a://bucketName/data",
tempdir_region: "eu-central-1" # required only if running from outside AWS (your laptop ...)
aws_iam_role: "arn:aws:iam::aws_count_id:role/role_name"
Postgres
- Postgres JDBC
- Postgres Spark
application:
connectionRef: "postgresql"
connections:
postgresql:
type: jdbc
options:
url: "jdbc:postgresql://{{POSTGRES_HOST}}:{{POSTGRES_PORT}}/{{POSTGRES_DATABASE}}"
driver: "org.postgresql.Driver"
user: "{{DATABASE_USER}}"
password: "{{DATABASE_PASSWORD}}"
quoteIdentifiers: false
application:
connectionRef: "postgresql"
connections:
postgresql:
type: jdbc
sparkFormat: jdbc
options:
url: "jdbc:postgresql://{{POSTGRES_HOST}}:{{POSTGRES_PORT}}/{{POSTGRES_DATABASE}}"
driver: "org.postgresql.Driver"
user: "{{DATABASE_USER}}"
password: "{{DATABASE_PASSWORD}}"
quoteIdentifiers: false