Autoload

note

Datawarehouses are organized around schemas where tables are grouped. Depending on the database, a database schema can be called schema or catalog or dataset. In starlake, we use the term domain to designate a schema, catalog or dataset.

The conventions used by the autoload command are:

files are stored under the incoming directory and named after the domain and table they are loaded into: $SL_ROOT/incoming/<domain>/<table><suffix>.<extension>
The target schema is named after the domain names
The target tables are named after the file name without the extension or suffix.
The suffix is used to differentiate between files that are loaded in full overwrite mode and those loaded in append mode. For example order_20240228.json and order_line_20240228.csv are respectively loaded into the same table order and order_line.

directory structure example before running autoload
incoming
└── starbake
    ├── product.xml
    ├── order_20240228.csv
    └── order_line_20240228.csv
metadata/load

After running the autoload command:

starlake autoload

Since we did not specify the target datawarehouse but the local filesystem as the target, the directory structure will be updated as follows.

In real life, the target datawarehouse will be a database and the tables will be created in the target database.

table schemas inferred from the directory structure
metadata/load
└── starbake
    ├── _config.sl.yml
    ├── product.sl.yml
    └── order.sl.yml
datasets/
├── archive         # files are moved to the archive directory after being loaded
│   └── starbake
├── ingesting       # files are moved to the ingesting directory before being loaded
│   └── starbake
├── replay          # invalid records that need to be replayed are moved
│   └── starbake    # in the  replay directory
├── stage           # files are moved to the stage directory before being loaded
│   └── starbake
├── unresolved      # filenames that do not match a valid pattern are moved to the unresolved directory.                    
└──incoming
    └── starbake
        ├── product.xml
        ├── order_20240228.csv
        └── order_line_20240228.csv

The _config.sl.yml file is a configuration file that contains the domain configuration and describe how the domain will be created in the target database.
The <table>.sl.yml file contains the table configuration and describes how files are parsed, validated and loaded into the target database.

How autoload detects the format of the files

It expects the file to be named after the table it will be loaded into.
It uses the file extension to detect the format of the file. The supported formats are csv,psv,dsv, json, xml
It uses the first line of the file to detect the separator and the number of columns.
In the case of CSV files, it uses the first line of the file to detect the column names.
It uses the whole file content to detect the column types.
In the case of JSON file, it is a considered a JSON_ARRAY file if the first character is [ and a JSON file if the first character is {.
In the case of XML file, it uses the first line of the file to detect the root element and the number of columns.

XML files are expected to have the following structure:

<?xml> # optional xml declaration
<myroot>
    <myrecord attr1="value1" >
        <column1>value1</column1>
        <column2>value2</column2>
        ...
    </myrecord>
<myroot>

Spark specifics

When running against Spark Locally, the tables are created in the datasets directory. When running against a real datawarehouse, the tables are created in the target database.

Extra files when running autoload against spark locally
datasets/
├── starbake.db     # files are loaded as tables into the starbake domain
│   ├── order
│   ├── order_line
│   └── product
├── audit.db        # audit tables are created in the audit schema
│   ├── audit
│   └── rejected

How autoload detects the format of the files​

Spark specifics​

How autoload detects the format of the files

Spark specifics