Skip to main content

Load XML files

Loading XML files is similar to loading JSON files where attributes may have nested or repeated attributes. the difference is that in XML, child nodes and node attributes are used to represent the nested attributes.

Consider the following XML file:

<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
</data>

The parser needs to detect what tag identifies a row. In the example above, the tag country identifies a row.

Child nodes and node attributes are used to represent the nested attributes. In the example above, the country tag will have the following attributes in the database table:

  • _name
  • rank
  • year
  • gdppc
  • neighbor
    • _name
    • _direction

Note that the _ prefix is used to differentiate node attributes from child nodes, to avoid name collisions.


note

Also note that the target datawarehouse need to support nested and repeated attributes. For example, BigQuery and Databricks support nested and repeated attributes, while Redshift does not.

Infer schema

The very first step is to infer the schema of the file from a data file as described in the autoload section before you start customizing your configuration. This is done using the infer-schema command. This will bootstrap the configuration file for the table.


starlake infer-schema --input-path incoming/starbake

Parsing XML

You may need to customize the configuration file to parse the XML file. The main options to consider are:

  • rowTag: The tag that identifies a row in the XML file.
  • attributePrefix: The prefix used to identify node attributes (_ by default).
  • rowValidationXSDPath: The path to the XSD file used to validate the XML file.

Other options are available to customize the parsing of the XML file. You can find the full list of options in the Spark XML reference.