Clustering and Partitioning

Some datawarehouses are designed to store data in a way that makes it easy to perform clustering and partitioning. Clustering and partitioning are two techniques that can be used to improve the performance of queries on large datasets. Clustering involves storing similar rows of data together, while partitioning involves splitting a table into smaller chunks of data.

BigQueryand Databricks are two datawarehouses that support user-defined clustering and partitioning.

To specify clustering and partitioning in BigQuery, you define the following properties in the sink section of the table configuration file.

BigQuery
Spark

metadata/load/<domain>/<table>.sl.yml
table
    ...
    sink:
      clustering:
        - <field1>
        - <field2>
      partition:
        field: <field> # only one field is allowed
      requirePartitionFilter: <true|false> # default is false
      days: <Ta> # expiration in days. default is "never expire"
      materializedView: <true|false> # Sink data as a materialized view. default is false
      enableRefresh: <true|false> # only if materializedView is true. Default is false
      refreshIntervalMs: <TabItem> # only if enable refresh is true.

metadata/load/<domain>/<table>.sl.yml
table:
    ...
    sink:
        format: <format> # For example, "delta" or "parquet".
        clustering:
            - <field1>
            - <field2>
        partition:
            - <field1>
            - <field2>
        coalesce: <true|false> # default is false
        options:
            <key>: <value> # Options to pass to the Spark writer. For example, "compression" -> "snappy"