Clustering and Partitioning
Some datawarehouses are designed to store data in a way that makes it easy to perform clustering and partitioning. Clustering and partitioning are two techniques that can be used to improve the performance of queries on large datasets. Clustering involves storing similar rows of data together, while partitioning involves splitting a table into smaller chunks of data.
BigQuery
and Databricks
are two datawarehouses that support user-defined clustering and partitioning.
To specify clustering and partitioning in BigQuery, you define the following properties in the sink
section of the table configuration file.
- BigQuery
- Spark
metadata/load/<domain>/<table>.sl.yml
table
...
sink:
clustering:
- <field1>
- <field2>
partition:
field: <field> # only one field is allowed
requirePartitionFilter: <true|false> # default is false
days: <Ta> # expiration in days. default is "never expire"
materializedView: <true|false> # Sink data as a materialized view. default is false
enableRefresh: <true|false> # only if materializedView is true. Default is false
refreshIntervalMs: <TabItem> # only if enable refresh is true.
metadata/load/<domain>/<table>.sl.yml
table:
...
sink:
format: <format> # For example, "delta" or "parquet".
clustering:
- <field1>
- <field2>
partition:
- <field1>
- <field2>
coalesce: <true|false> # default is false
options:
<key>: <value> # Options to pass to the Spark writer. For example, "compression" -> "snappy"