Skip to main content

Load JSON Files

Loading a JSON file is similar to loading a CSV file except that JSON attributes may have nested attributes.

Infer schema

The very first step is to infer the schema of the file from a data file as described in the autoload section before you start customizing your configuration. This is done using the infer-schema command. This will bootstrap the configuration file for the table.


starlake infer-schema --input-path incoming/starbake

Parsing JSON

Like other file formats, JSON Parsing options are defined in the metadata section of the configuration file.

The formatattribute may be set to one of the values below:

FormatDescription
JSONThe JSON file is parsed as a list of objects, one object per line
JSON_FLATThe JSON objects do not contain any repeated or nested fields. Using this format instead of JSON makes file parsing faster
JSON_ARRAYThe JSON file is parsed as a single array of objects

metadata/load/<domain>/<table>.sl.yml - parsing section
table:
pattern: "order_line.*.csv"
metadata:
format: "JSON" # or "JSON_FLAT" since the JSON objects do not contain any repeated or nested fields
...
attributes:
- ...

Attributes validation

The attributes section of the configuration file is used to define the attributes of the JSON file. The attributes section is a list of objects, each object representing an attribute of the JSON file. In addition to the common attributes, the attributes section may contain the following attributes:

  • 'array' (optional): The attribute is an array of objects. The value of this attribute is a list of objects, each object representing an attribute of the array.
  • 'struct' (optional): The attribute is an object containing other attributes.

Given the Json record below:

{
"order_id": 1,
"order_date": "2021-01-01",
"order_lines": [
{
"line_id": 1,
"product_id": 1,
"quantity": 2
},
{
"line_id": 2,
"product_id": 2,
"quantity": 3
}
]
}

The attributes section of the configuration file would be:

metadata/load/<domain>/<table>.sl.yml - attributes section
table:
pattern: "order_line.*.csv"
metadata:
format: "JSON"
...
attributes:
- name: "order_id"
type: "integer"
- name: "order_date"
type: "date"
- name: "order_lines"
type: "struct"
array: true
attributes:
- name: "line_id"
type: "integer"
- name: "product_id"
type: "integer"
- name: "quantity"
type: "integer"