Skip to main content

Load DSV files

File load configuration

Most of the time, we won't need to define the table configuration, as the infer-schema command will be able to infer the table configuration from the file itself using the `infer-schema command.

Sometimes, we still need to update some properties of the load configuration for the table or add new properties not present in the source file (ACL, comments ...) or apply transformations.


Describing the file you load involves defining :

  • the file pattern that is mapped to this schema,
  • the parsing parameters and the materialization strategy (APPEND, OVERWRITE, OVERWRITE_BY_PARTITION, UPSERT_BY_KEY, UPSERT_BY_KEY_AND_TIMESTAMP, SCD2 ...)
  • the file format as a list of attributes.

Infer schema

The very first step is to infer the schema of the file from a data file as described in the autoload section before you start customizing your configuration. This is done using the infer-schema command. This will bootstrap the configuration file for the table.


starlake infer-schema --input-path incoming/starbake

Parsing CSV

CSV Parsing options are defined in the metadata section of the configuration file.


metadata/load/<domain>/<table>.sl.yml - parsing section
table:
pattern: "order_line.*.csv"
metadata:
format: "DSV"
withHeader: true
separator: ";"
...
attributes:
- ...

AttributeDescription
encodingthe encoding of the file. Default is UTF-8
withHeaderif the file contains a header. Default is true
separatorthe separator used in the file. Default is ;
quotethe quote character used in the file. Default is "
escapethe escape character used in the file. Default is \
ignorea regex to filter lines to ignore. All lines matching the regex will be ignored. Default is None
ackload file only if a file with the same name and with this extension exist in the same directory.
optionsa map of options to pass to the parser. Available options are defined in the Apache Spark CSV documentation
emptyIsNullif empty column values should be considered as null. Default is false
fillWithDefaultValueif the null value should be filled with the default value of the column. Default is false

Attributes validation

metadata/load/<domain>/<table>.sl.yml - validation section
table:
pattern: "order_line.*.csv"
metadata:
- ...
attributes:
- name: "order_id"
type: "int"
- name: "product_id"
type: "int"
- name: "quantity"
type: "int"
- name: "sale_price"
type: "double"

The attributes properties are defined in the attributes section of the configuration file.

AttributeDescriptionDefault
namethe name of the column in the file header. If no header is present, this refers to the column index in the file. This is the only required property. this will also refer to the column name in the table except if the rename property is set.
typethe type of the column. Default is string. Available types are string, int, long, double, float, boolean, any flavor of date or timestamp as defined in the metadata/types/default.sl.yml file. Also note that you can add your own types in the metadata/types directory.
requiredif the column is required to have a non null value (see nullValue in the metadata.options section).false
privacya transformation to apply to the column.
transforma transformation to apply to the column.
commenta comment for the column.
renamethe name of the column in the database.name
defaultthe default value of the column.
trimthe trim strategy for the column. Available trim strategies are None, Left, Right, Both

To add / replace or ignore some attributes, check the Transform on load section.

Complete configuration

The name property of the column is the only required field. The type property is optional and will be set to string if not provided. The configuration file below describes how the order_line files should be loaded.


metadata/load/<domain>/<table>.sl.yml
table:
pattern: "order_line.*.csv"
metadata:
format: "DSV"
withHeader: true
separator: ";"
writeStrategy:
type: "APPEND"
attributes:
- name: "order_id"
type: "int"
- name: "product_id"
type: "int"
- name: "quantity"
type: "int"
- name: "sale_price"
type: "double"