Skip to main content

Clustering and Partitioning

Some datawarehouses are designed to store data in a way that makes it easy to perform clustering and partitioning. Clustering and partitioning are two techniques that can be used to improve the performance of queries on large datasets. Clustering involves storing similar rows of data together, while partitioning involves splitting a table into smaller chunks of data.

BigQueryand Databricks are two datawarehouses that support user-defined clustering and partitioning.

To specify clustering and partitioning in BigQuery, you define the following properties in the sink section of the table configuration file.

metadata/load/<domain>/<table>.sl.yml
table
...
sink:
clustering:
- <field1>
- <field2>
partition:
field: <field> # only one field is allowed
requirePartitionFilter: <true|false> # default is false
days: <Ta> # expiration in days. default is "never expire"
materializedView: <true|false> # Sink data as a materialized view. default is false
enableRefresh: <true|false> # only if materializedView is true. Default is false
refreshIntervalMs: <TabItem> # only if enable refresh is true.