Tutorial
Extract tables in one shot or incrementally from a database as a set of files.
Prerequisites
-
Install duckdb
-
Download the sample duckdb.db database and store it in the
datasets
folder in your project directory$SL_ROOT/datasets
This will bring in the following database schema:
$ duckdb $SL_ROOT/datasets/duckdb.db
v0.10.0 20b1486d11
Enter ".help" for usage hints.
D show;
┌──────────┬──────────┬────────────┬────────────────────────────────────────────────────────┬─────────────────────────────────────────────────────┬───────────┐
│ database │ schema │ name │ column_names │ column_types │ temporary │
│ varchar │ varchar │ varchar │ varchar[] │ varchar[] │ boolean │
├──────────┼──────────┼────────────┼────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┼───────────┤
│ duckdb │ starbake │ order │ [customer_id, order_id, status, timestamp] │ [BIGINT, BIGINT, VARCHAR, TIMESTAMP] │ false │
│ duckdb │ starbake │ order_line │ [order_id, product_id, quantity, sale_price] │ [BIGINT, BIGINT, BIGINT, DOUBLE] │ false │
│ duckdb │ starbake │ product │ [category, cost, description, name, price, product_id] │ [VARCHAR, DOUBLE, VARCHAR, VARCHAR, DOUBLE, BIGINT] │ false │
└──────────┴──────────┴────────────┴────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────┴───────────┘
D .quit
$
Extract Data
Extracting Data involves the following steps:
- Configure the connection to the source database
- Select the database schema to extract
- Select the tables to extract
- Select the columns to extract (optional)
- Select the rows to extract (optional)
Configure the connection to the source database
In the project folder, under the metadata
folder, edit the file application.sl.yml and set the connection parameters to the source database.
application:
...
connections:
duckdb:
type: "jdbc" # Connection to DuckDB
options:
url: "jdbc:duckdb:{{SL_ROOT}}/datasets/duckdb.db" # Location of the DuckDB database
driver: "org.duckdb.DuckDBDriver"
...
In the metadata/extract folder, create a new file my_extract_config.sl.yml
and set the connectionRef to the connection defined in the application.sl.yml file.
extract:
connectionRef: "duckdb" # The database connection to use
jdbcSchemas:
- schema: "starbake"
tables:
- name: "*" # table names or "*" to extract all tables
tableTypes: # (optional) table types to extract
- "TABLE"
#- "VIEW"
#- "SYSTEM TABLE"
#- "GLOBAL TEMPORARY"
#- "LOCAL TEMPORARY"
#- "ALIAS"
#- "SYNONYM"
That's it! We are ready to extract the data from the database.
Extract the data
$ cd $SL_ROOT
$ starlake extract-data --config my_extract_config --outputDir $SL_ROOT/incoming/starbake
The $SL_ROOT/incoming/starbake
folder will contain the extracted data in CSV format.
You are now ready to load the data into the datawarehouse of your choice using the starlake load command.
Extract Schema (Optional)
The command below generates a table schema description file in the metadata/load directory and allows you to load the data into the datawarehouse of your choice, using starlake load command instead of going through the infer-schema feature starlake provides.
$ cd $SL_ROOT
$ starlake.sh extract-schema --config my_extract_config.sl.yml # extract description