Skip to main content

About Metrics

During ingestion, Starlake may produce metrics for any attribute in the dataset. Currently, only top level attributes are supported. One of the two available metric type may be specified on an attribute: continuous and discrete. When the metric property is set to continuous, Starlake will compute for this attribute the following metrics:

  • minimum value
  • maximum value
  • sum of all values
  • mean: The arithmetic average
  • median: the value separating the higher half from the lower half, may be thought of as "the middle" value
  • variance: How far the values are spread out from their average value.
  • standard deviation: square root of the variance, the standard deviation measures how spread out numbers are in a data set
  • missing values
  • skewness: The measure of the asymmetry of the probability distribution. Negative skew commonly indicates that the tail is on the left side of the distribution, and positive skew indicates that the tail is on the right.
  • kurtosis: It tells us the extent to which the distribution is more or less outlier-prone (heavier or light-tailed) than the normal distribution. The greater the kurtosis, the less precise the standard deviation and variance become.
  • 25th percentile: Returns the approximate 25 percentile of this attribute which is the smallest value in the ordered attribute values (sorted from least to greatest) such that no more than 25% of attribute values is less than the value or equal to that value
  • 75 percentile: Returns the approximate 75 percentile of this attribute which is the smallest value in the ordered attribute values (sorted from least to greatest) such that no more than 75% of attribute values is less than the value or equal to that value
  • row count

When the metric property is set to discrete, Starlake will compute for this attribute the following metrics:

  • count distinct: The number of distinct values for this attribute
  • category frequency: The frequency (percentage) for each distinct value for this attribute
  • category count: The number of occurrences for each distinct value for this attribute
  • row count

Each metric is computed for each attribute only on the incoming dataset and stored in a table with the ingestion time allowing to compare metric values between loads.

Assuming we are ingesting a file with the following schema:

|-- business_id: string (nullable = false)
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- city: string (nullable = true)
|-- state: string (nullable = true)
|-- postal_code: string (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- stars: double (nullable = true)
|-- review_count: long (nullable = true)
|-- is_open: long (nullable = true)

with the attributes city is marked as discrete and review_count is marked as continuous

The following tables would be generated:

+-----------+-------------+---------------------+-----------+-------------------+------+--------+-----+-------------+----------+
|attribute |countDistinct|missingValuesDiscrete|cometMetric|jobId |domain|schema |count|timestamp |cometStage|
+-----------+-------------+---------------------+-----------+-------------------+------+--------+-----+-------------+----------+
|city |53 |0 |Discrete |local-1650471634299|yelp |business|200 |1650471642737|UNIT |
+-----------+-------------+---------------------+-----------+-------------------+------+--------+-----+-------------+----------+

+------------+---+-----+------+-------------+--------+-----------+------+--------+--------+------------+------+------------+-----------+-------------------+------+--------+-----+-------------+----------+
|attribute |min|max |mean |missingValues|variance|standardDev|sum |skewness|kurtosis|percentile25|median|percentile75|cometMetric|jobId |domain|schema |count|timestamp |cometStage|
+------------+---+-----+------+-------------+--------+-----------+------+--------+--------+------------+------+------------+-----------+-------------------+------+--------+-----+-------------+----------+
|review_count|3.0|664.0|38.675|0 |7974.944|89.303 |7735.0|4.359 |21.423 |5.0 |9.0 |25.0 |Continuous |local-1650471634299|yelp |business|200 |1650471642737|UNIT |
+------------+---+-----+------+-------------+--------+-----------+------+--------+--------+------------+------+------------+-----------+-------------------+------+--------+-----+-------------+----------+

+---------+---------------+-----+---------+-------------------+------+--------+-------------+----------+
|attribute|category |count|frequency|jobId |domain|schema |timestamp |cometStage|
+---------+---------------+-----+---------+-------------------+------+--------+-------------+----------+
|city |Tempe |200 |0.01 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |North Las Vegas|200 |0.01 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Phoenix |200 |0.085 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |West Mifflin |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Newmarket |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Wickliffe |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |McKeesport |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Scottsdale |200 |0.06 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Scarborough |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Wexford |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Willoughby |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Chandler |200 |0.02 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Surprise |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Cleveland |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Litchfield Park|200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Verona |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Richmond Hill |200 |0.01 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Hudson |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Etobicoke |200 |0.01 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Cuyahoga Falls |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |.............. |... |..... |local-1650471634299|yelp |business|1650471642737|UNIT |
+---------+---------------+-----+---------+-------------------+------+--------+-------------+----------+