Load XML files
Loading XML files is similar to loading JSON files where attributes may have nested or repeated attributes. the difference is that in XML, child nodes and node attributes are used to represent the nested attributes.
Consider the following XML file:
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
</data>
The parser needs to detect what tag identifies a row. In the example above, the tag country
identifies a row.
Child nodes and node attributes are used to represent the nested attributes. In the example above, the country
tag will have the following attributes in the database table:
_name
rank
year
gdppc
neighbor
_name
_direction
Note that the _
prefix is used to differentiate node attributes from child nodes, to avoid name collisions.
Also note that the target datawarehouse need to support nested and repeated attributes. For example, BigQuery and Databricks support nested and repeated attributes, while Redshift does not.
Infer schema
The very first step is to infer the schema of the file from a data file as described in the autoload section before you start customizing your configuration. This is done using the infer-schema
command. This will bootstrap the configuration file for the table.
starlake infer-schema --input-path incoming/starbake
Parsing XML
You may need to customize the configuration file to parse the XML file. The main options to consider are:
rowTag
: The tag that identifies a row in the XML file.attributePrefix
: The prefix used to identify node attributes (_
by default).rowValidationXSDPath
: The path to the XSD file used to validate the XML file.
Other options are available to customize the parsing of the XML file. You can find the full list of options in the Spark XML reference.