Skip to main content

Databricks on any cloud

Cluster Setup

When running on top of Databricks, you can make complete abstraction of the Cloud provider. You juste need to setup a Databricks cluster that will make use of the compute and storage resources provided by the underlying cloud provider. Follow the steps below to run starlake on top of Databricks:

  • Create a service account
  • Create a Databricks cluster
  • Mount the Databricks File System
  • Create a Starlake job
  • Ingest your data
note

The screenshots below are taken from a Databricks cluster running on Google Cloud but are also valid for Azure and AWS.

Create a service account

Create a storage account and name it for example starlakestorage and assigns it to a resource group. In this storage account create a container that you can name starlake-app and set its public access level to Container. This container will have the following purposes:

  • Store Starlake jars
  • Store Starlake metadata
  • Store parquet files after ingestion

Azure create container

But you can also distribute these tasks across several containers.

Create a Databricks Cluster

In a Databricks Workspace, create a cluster with the correct Databricks Runtime version: 9.1 LTS (Apache Spark 3.1.2, Scala 2.12).

In this section you don't have to set service account variables, we will set it when mounting our container. Just copy your storage account access key we will use it later in a notebook.

Azure Key

Environment Variables

In the Advanced Settings / Environment variables section set the variables below:

Environnement variables

Environnement variableValueDescription
SL_METRICS_ACTIVEtrueShould we compute metrics set on individuals table columns at ingestion time
SL_ROOT/mnt/starlake-app/tmp/userguideThis is a DBFS mounted directory (see below). It should reference the base directory where your starlake metadata is located
SL_AUDIT_SINK_TYPEBigQuerySinkWhere to save audit logs. Here we decide to save it in BigQuery. To have it as a hive table or file on the cloud storage, set it to FsSink
SL_FSdbfs://Filesystem. Always set it to DBFS in Databricks.
SL_HIVEtrueShould we store the resulting parquet files as Databricks tables ?
TEMPORARY_GCS_BUCKETstarlake-appBucket name where Google Cloud API store temporary files when saving data to BigQuery. Don't add this one if you're on Azure

Mount DBFS

Databricks virtualize the underlying filesystem through DBFS. We first need to enable it in Admin Console / Workspace Settings page:

Enable DBFS

We now inside a notebook mount the cloud storage bucket created above and referenced in the cluster environment variables into DBFS:

storage_name= "starlakestorage"
container_name = "starlake-app"
storage_acces_key = "Your access key"
mount_name = "starlake-app"
dbutils.fs.mount(
source="wasbs://%s@%s.blob.core.windows.net" % (container_name, storage_name),
mount_point="/mnt/%s" % mount_name,
extra_configs = {
"fs.azure.account.key.%s.blob.core.windows.net" % storage_name: storage_acces_key
}
)

Your storage account is now accessible on Databricks as a folder from Spark as the folder dbfs:/mnt/starlake-app

Create a Starlake job

To create a starlake job, you first upload the starlake uber jar and the jackson yaml (de)serializer into the gs://starlake-app folder or starlake-app container

conatiner details Azure

The version of the jackson-dataformat-yaml depends follows the version of the others jackson components referenced by the databricks runtime. Download the correct version on Maven Central. Then add the starlake assembly jar that you can find here.

Note that only import, watch/load/ingest, transform and metrics command lines are designed exclusively for production environments

You will need first to write your metadata configuration files from a local environment and then upload your starlake project in the SL_ROOT location like this:

metadata container

If you are working with an external data source, you can mount your data incoming location in databricks and reference it in the env.sl.yml file:

env.sl.yml
env:
root_path: "/mnt/starlake/tmp/userguide"
incoming_path: "/mnt/sample-data"

Then reference your data source by using {{incoming_path}} variable in the domains schemas

hr.sl.yml
load:
name: "HR"
metadata:
mode: "FILE"
format: "JSON"
encoding: "UTF-8"
multiline: false
array: true
separator: "["
quote: "\""
escape: "\\"
write: "APPEND"
directory: "{{incoming_path}}/HR"

Create tasks and reference the two jars you uploaded and are now visible to databricks through the dbfs:/mnt/starlake-app mount

  • The first task (import) will copy the files matching the expected patterns into the pending directory for ingestion by starlake

starlake import

  • The second task (watch) will run the starlake ingestion job store the result as parquet files in your dataset directory

starlake watch

Ingest your data

Start the import task first and then the watch task. The execution logs are available through the runs tab:

tasks runs

Since we set the SL_HIVE=true environnment variable, ingested data are also available as tables

starlake watch

The audit log for the above tasks will be stored in a BigQuery table if we set SL_AUDIT_SINK_TYPE=BigQuerySink environnment variable