Type validation
starlake allows you to validate the types of the data you are loading.
This is done by specifying the schema of the data you are writing.
This schema, generally infered from sample data, is described using the YAML syntax in the attributes
section of
the metadata/load/<domain>/<table>.sl.yml
file.
The order_line
table may be described as follows:
table:
pattern: "order_line.*.csv"
metadata:
- ...
attributes:
- name: "order_id"
type: "int"
- name: "product_id"
type: "int"
- name: "quantity"
type: "int"
- name: "sale_price"
type: "double"
int
and double
are what we call the basic types and are mapped to the corresponding SQL types.
Basic types
The basic types below are supported by starlake:
types:
- name: "string"
primitiveType: "string"
pattern: ".+"
sample: "Hello World"
comment: "Any set of chars"
- name: "int"
pattern: "[-|\\+|0-9][0-9]*"
primitiveType: "long"
sample: "1234"
comment: "Int number"
- name: "integer"
pattern: "[-|\\+|0-9][0-9]*"
primitiveType: "long"
sample: "1234"
comment: "Int number"
- name: "byte"
primitiveType: "byte"
pattern: "."
sample: "x"
comment: "Any set of chars"
- name: "double"
primitiveType: "double"
pattern: "[-+]?\\d*\\.?\\d+[Ee]?[-+]?\\d*"
sample: "-45.78"
comment: "Any floating value"
- name: "long"
primitiveType: "long"
pattern: "[-|\\+|0-9][0-9]*"
sample: "-64564"
comment: "any positive or negative number"
- name: "short"
primitiveType: "short"
pattern: "-?\\d+"
sample: "564"
comment: "any positive or negative number"
- name: "boolean"
primitiveType: "boolean"
pattern: "(?i)true|yes|[y1]<-TF->(?i)false|no|[n0]"
sample: "TruE"
- name: "timestamp"
primitiveType: "timestamp"
pattern: "ISO_DATE_TIME"
sample: "2019-12-31 23:59:02"
comment: "date/time in epoch millis"
- name: "decimal"
primitiveType: "decimal"
pattern: "-?\\d*\\.{0,1}\\d+"
sample: "-45.787686786876"
comment: "Any floating value"
- name: "date"
primitiveType: "date"
pattern: "yyyy-MM-dd"
sample: "2018-07-21"
comment: "Date in the format yyyy-MM-dd"
- name: "basic_iso_date"
primitiveType: "timestamp"
pattern: "BASIC_ISO_DATE"
sample: "20111203"
comment: "Timestamp based on yyyMMdd pattern"
- name: "iso_local_date"
primitiveType: "timestamp"
pattern: "ISO_LOCAL_DATE"
sample: "2011-12-03"
comment: "Timestamp based on yyyy-MM-dd pattern"
- name: "iso_offset_date"
primitiveType: "timestamp"
pattern: "ISO_OFFSET_DATE"
sample: "2011-12-03+02:00"
comment: "Timestamp based on `ISO Date with offset` pattern"
- name: "iso_date"
primitiveType: "timestamp"
pattern: "ISO_DATE"
sample: "2011-12-03+02:00"
comment: "Timestamp based on `ISO Date with or without offset` pattern"
- name: "iso_local_date_time"
primitiveType: "timestamp"
pattern: "ISO_LOCAL_DATE_TIME"
sample: "2011-12-03T10:15:30"
comment: "Timestamp based on `ISO Local Date and Time` pattern"
- name: "iso_offset_date_time"
primitiveType: "timestamp"
pattern: "ISO_OFFSET_DATE_TIME"
sample: "2011-12-03T10:15:30+01:00"
comment: "Timestamp based on `ISO Local Date and Time` pattern"
- name: "iso_zoned_date_time"
primitiveType: "timestamp"
pattern: "ISO_ZONED_DATE_TIME"
sample: "2011-12-03T10:15:30+01:00[Europe/Paris]"
comment: "Timestamp based on `ISO Zoned Date Time` pattern"
- name: "iso_date_time"
primitiveType: "timestamp"
pattern: "ISO_DATE_TIME"
sample: "2011-12-03T10:15:30+01:00[Europe/Paris]"
comment: "Timestamp based on `ISO Date and time with ZoneId` pattern"
- name: "iso_ordinal_date"
primitiveType: "timestamp"
pattern: "ISO_ORDINAL_DATE"
sample: "2012-337"
comment: "Timestamp based on `year and day of year` pattern"
- name: "iso_week_date"
primitiveType: "timestamp"
pattern: "ISO_WEEK_DATE"
sample: "2012-W48-6"
comment: "Timestamp based on `Year and Week` pattern"
- name: "iso_instant"
primitiveType: "timestamp"
pattern: "ISO_INSTANT"
sample: "2011-12-03T10:15:30Z"
comment: "Timestamp based on `Date and Time of an Instant` pattern (UTC only)"
- name: "rfc_1123_date_time"
primitiveType: "timestamp"
pattern: "RFC_1123_DATE_TIME"
sample: "Tue, 3 Jun 2008 11:05:30 GMT"
comment: "Timestamp based on `RFC 1123 / RFC 822` patterns"
Custom types
You may redefine existing standard types or define your own types in the metadata/types
directory in a file named custom.sl.yml
.
This allows you to run your input data against advanced validation rules.
Below is an example of a custom type definition:
types:
- name: "email"
pattern: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
primitiveType: "string"
sample: "
comment: "Email address"
- name: "phone"
pattern: "^[0-9]{10}$"
primitiveType: "string"
sample: "1234567890"
comment: "Phone number"
- name: "zip_code"
pattern: "^[0-9]{5}$"
primitiveType: "string"
sample: "12345"
comment: "Zip code"
Note that you have to map the custom type to a primitive type, which is the type that will be used in the SQL schema.
The following primitive types are supported:
- string
- long
- double
- byte
- short
- boolean
- timestamp
- decimal
- date