Tutorial
Now that our load and transform are working, we can run them on our orchestrator.
Prerequisites
Make sure you run the Transform step first to get the data in the database.
Running the DAG
Using starlake dag-generate
command, we can generate a DAG file that will run our load and transform tasks.
starlake dag-generate --clean
This will generate your DAG files in the root of the dags/generated
directory.
Just copy anything below the dags/generated directory to your orchestrator dags directory and you are good to go.
- Airflow
- Dagster
TODO
Note how load DAGs are tagged with the domain name.
Configuration
The DAG generation is based on the configuration files located in the metadata/dags
directory.
You put there the configuration files for the DAGs you want to generate and reference them globally in the metadata/application.sl.yml
file
or specifically for each load or transform task through the dagRef
attribute.
- Airflow
- Dagster
dag:
comment: "default Airflow DAG configuration for load"
template: "load/airflow_scheduled_table_bash.py.j2"
filename: "airflow_all_tables.py"
options:
load_dependencies: "true"
dag:
comment: "default Airflow DAG configuration for transform"
template: "transform/airflow_scheduled_task_bash.py.j2"
filename: "airflow_all_tasks.py"
options:
load_dependencies: "true"
dag:
comment: "default Dagster pipeline configuration for load"
template: "load/dagster_scheduled_table_shell.py.j2"
filename: "dagster_all_load.py"
options:
load_dependencies: "true"
dag:
comment: "default Dagster pipeline configuration for transform"
template: "transform/airflow_scheduled_task_bash.py.j2"
filename: "airflow_all_tasks.py"
options:
load_dependencies: "true"
load_dependencies: "true"
is a flag that tells the DAG generator to include the dependent tables or tasks in the DAG file.
template
is the template file that will be used to generate the DAG file. This ay reference :
- an absolute path on teh filesystem
- a relative path to the
metadata/dags/templates/
directory - a template that is built-in in the starlake library and is located in the load and transform resource directories.
The load and transforms tasks will be run as bash commands on the orchestrator. To run them on a different executor, you can change the template to a different one or build your own.
Starlake comes with a few templates out of the box targeting cloud run
or dataproc
.