Load fixed width files

To load fixed width files, you need to know the width of each column. The example below is an example of how a fixed width file may be used to represent orders.

  62024-02-05T21:19:15.454ZCancelled
  72024-02-05T21:19:15.454ZCancelled
  82024-02-05T21:19:15.454ZDelivered

In the file above the columns are as follows:

order_id: 5 characters
customer_id: 5 characters
timestamp: 24 characters
status: 10 characters

Infer schema

The width of each column is fixed, but the columns are not separated by a delimiter. We just need to provide the infer-schema command with a file that contains a single record where each field is placed on a separate line prefixed by its name followed by a colon.

to infer the schema of the fixed width file above, we can submit the following file to the infer_schema command:

order_id:00001
customer_id:    6
timestamp:2024-02-05T21:19:15.454Z
status:Cancelled

After calling the infer-schema command below, the schema will be inferred and printed to the console.

$ starlake infer-schema -inputPath /my/path/fixed_width_file.txt --format FIXED --outputDir $SL_ROOT/metadata/load/starbake

The schema will be saved in the directory specified by the outputDir parameter. The starlake YAML schema will look like the one below:

load:
    metadata:
      format: FIXED
      writeStrategy:
        type: APPEND
    ...
    attributes:
      - name: order_id
        type: string
        position:  # first 5 characters
          first: 0
          last: 4
      - name: customer_id
        type: string
        position: # next  5 characters
          first: 5
          last: 9
      - name: timestamp
        type: string
        position: # next 24 characters
          first: 10
          last: 33
      - name: status
        type: string
        position: # next 10 characters
          first: 34
          last: 43

Load data

After inferring the schema, we can use the load command to load the data into the datawarehouse.

$ starlake load

Infer schema​

Load data​

Infer schema

Load data