Tutorial

Now that our load and transform are working, we can run them on our orchestrator.

Prerequisites

Make sure you run the Transform step first to get the data in the database.

Running the DAG

Using starlake dag-generate command, we can generate a DAG file that will run our load and transform tasks.

starlake dag-generate --clean

This will generate your DAG files in the root of the dags/generated directory.

Just copy anything below the dags/generated directory to your orchestrator dags directory and you are good to go.

Airflow
Dagster

note

Note how load DAGs are tagged with the domain name.

Configuration

The DAG generation is based on the configuration files located in the metadata/dags directory. You put there the configuration files for the DAGs you want to generate and reference them globally in the metadata/application.sl.yml file or specifically for each load or transform task through the dagRef attribute.

Airflow
Dagster

Load configuration: metadata/dags/airflow_scheduled_table_bash.sl.yml
dag:
    comment: "default Airflow DAG configuration for load"
    template: "load/airflow_scheduled_table_bash.py.j2"
    filename: "airflow_all_tables.py"
    options:
        load_dependencies: "true"

Transform configuration: metadata/dags/airflow_scheduled_task_bash.sl.yml
dag:
    comment: "default Airflow DAG configuration for transform"
    template: "transform/airflow_scheduled_task_bash.py.j2"
    filename: "airflow_all_tasks.py"
    options:
        load_dependencies: "true"

Load configuration: metadata/dags/airflow_scheduled_table_bash.sl.yml
dag:
    comment: "default Dagster pipeline configuration for load"
    template: "load/dagster_scheduled_table_shell.py.j2"
    filename: "dagster_all_load.py"
    options:
        load_dependencies: "true"

Transform configuration: metadata/dags/airflow_scheduled_task_bash.sl.yml
dag:
    comment: "default Dagster pipeline configuration for transform"
    template: "transform/airflow_scheduled_task_bash.py.j2"
    filename: "airflow_all_tasks.py"
    options:
        load_dependencies: "true"

load_dependencies: "true" is a flag that tells the DAG generator to include the dependent tables or tasks in the DAG file.

template is the template file that will be used to generate the DAG file. This ay reference :

an absolute path on teh filesystem
a relative path to the metadata/dags/templates/ directory
a template that is built-in in the starlake library and is located in the load and transform resource directories.

The load and transforms tasks will be run as bash commands on the orchestrator. To run them on a different executor, you can change the template to a different one or build your own. Starlake comes with a few templates out of the box targeting cloud run or dataproc.

Tutorial

Prerequisites​

Running the DAG​

Configuration​

Prerequisites

Running the DAG

Configuration