Skip to main content

Tutorial

Extract tables in one shot or incrementally from a database as a set of files.

Prerequisites

  • Install duckdb

  • Download the sample duckdb.db database and store it in the datasets folder in your project directory $SL_ROOT/datasets

This will bring in the following database schema:

$ duckdb $SL_ROOT/datasets/duckdb.db
v0.10.0 20b1486d11
Enter ".help" for usage hints.
D show;
┌──────────┬──────────┬────────────┬────────────────────────────────────────────────────────┬─────────────────────────────────────────────────────┬───────────┐
│ database │ schema │ name │ column_names │ column_types │ temporary │
│ varchar │ varchar │ varchar │ varchar[] │ varchar[] │ boolean │
├──────────┼──────────┼────────────┼────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┼───────────┤
│ duckdb │ starbake │ order │ [customer_id, order_id, status, timestamp] │ [BIGINT, BIGINT, VARCHAR, TIMESTAMP] │ false │
│ duckdb │ starbake │ order_line │ [order_id, product_id, quantity, sale_price] │ [BIGINT, BIGINT, BIGINT, DOUBLE] │ false │
│ duckdb │ starbake │ product │ [category, cost, description, name, price, product_id] │ [VARCHAR, DOUBLE, VARCHAR, VARCHAR, DOUBLE, BIGINT] │ false │
└──────────┴──────────┴────────────┴────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────┴───────────┘
D .quit
$

Extract Data

Extracting Data involves the following steps:

  • Configure the connection to the source database
  • Select the database schema to extract
  • Select the tables to extract
  • Select the columns to extract (optional)
  • Select the rows to extract (optional)

Configure the connection to the source database

In the project folder, under the metadata folder, edit the file application.sl.yml and set the connection parameters to the source database.

metadata/application.sl.yml
application:
...
connections:
duckdb:
type: "jdbc" # Connection to DuckDB
options:
url: "jdbc:duckdb:{{SL_ROOT}}/datasets/duckdb.db" # Location of the DuckDB database
driver: "org.duckdb.DuckDBDriver"

...

In the metadata/extract folder, create a new file my_extract_config.sl.yml and set the connectionRef to the connection defined in the application.sl.yml file.

metadata/extract/my_extract_config.sl.yml
extract:
connectionRef: "duckdb" # The database connection to use
jdbcSchemas:
- schema: "starbake"
tables:
- name: "*" # table names or "*" to extract all tables
tableTypes: # (optional) table types to extract
- "TABLE"
#- "VIEW"
#- "SYSTEM TABLE"
#- "GLOBAL TEMPORARY"
#- "LOCAL TEMPORARY"
#- "ALIAS"
#- "SYNONYM"

That's it! We are ready to extract the data from the database.

Extract the data

$ cd $SL_ROOT
$ starlake extract-data --config my_extract_config --outputDir $SL_ROOT/incoming/starbake

The $SL_ROOT/incoming/starbake folder will contain the extracted data in CSV format.

You are now ready to load the data into the datawarehouse of your choice using the starlake load command.

Extract Schema (Optional)

The command below generates a table schema description file in the metadata/load directory and allows you to load the data into the datawarehouse of your choice, using starlake load command instead of going through the infer-schema feature starlake provides.

$ cd $SL_ROOT
$ starlake.sh extract-schema --config my_extract_config.sl.yml # extract description