Quack on Demand and Starflow: open-source data infrastructure

Quack on Demand
Open-source DuckDB fleet control plane · Apache-2.0

Autoscale DuckDB fleets on demand.

Runs fleets of DuckDB Quack nodes on your own Kubernetes. Per-tenant isolation, table-level ACLs, federated queries. Query it from any ODBC/JDBC/ADBC client. Works with any ETL.

$ kubectl apply -k github.com/starlake-ai/quack-on-demand
# control plane up · TLS on · admin UI at /ui/
no telemetryno upsell
Starflow
Open-source ELT · Apache-2.0

Declarative data pipelines as YAML.

Ingest, transform, and orchestrate with native SQL on DuckDB, BigQuery, Snowflake, and Redshift. CLI-first. Runs anywhere. Or let Starflow's AI generate the config from plain English.

$ starlake transform --name sales.revenue
$ starlake lineage --output dot
no lock-inno glue code

Quack on Demand · the serving layer

A community implementation of multi-tenant FlightSQL over DuckDB.

DuckDB ships a minimal HTTP endpoint on localhost and recommends a reverse proxy in front of it. QoD is that proxy, built on the open Arrow Flight SQL standard, with the multi-tenancy, identity, and observability you need to expose it safely. No proprietary protocol extensions. Trivially removable.

What you get

Everything DuckDB Quack is missing in production.

DuckDB ships Quack as a minimal HTTP endpoint on localhost with a random token, and explicitly recommends a reverse proxy in front of it. Quack On Demand is that proxy - with the multi-tenancy, identity, and observability you need to actually expose it.

🛫

Arrow FlightSQL edge

Zero-copy result streaming over Apache Arrow Flight SQL - orders of magnitude faster than JDBC for analytical workloads. TLS is on by default and a self-signed cert is generated on first boot.

🏢

Multi-tenant pools

Spin up tenants and pools of DuckDB Quack nodes on demand. Each node is READONLY, WRITEONLY, or DUAL - the router classifies every statement and picks a compatible target.

🔐

Pluggable authentication

Database (bcrypt-hashed JDBC), external JWT (HS256/RS256/PEM), and OIDC providers - Keycloak (with ROPC), Google, Azure AD, AWS Cognito. Mix and match per deployment.

🛡️

Postgres-relational ACL

Grants live in slkstate_acl_grant alongside DuckLake metadata. Principals expand to user / group / role at validation time so grants match whichever level of identity is stable.

📊

Live admin console

React dashboard at /ui/ - tenant + pool CRUD, per-tenant ACL editor, live node metrics (inFlight, totalServed, EWMA latency), admin-role gated.

🦺

Self-healing on restart

Dead Quack child processes are detected (PID + port probe) and respawned automatically before the edge accepts traffic. Manager restarts no longer strand the fleet.

Deployment

Single uber-jar

REST + React UI + FlightSQL edge in one process. State lives in Postgres next to DuckLake - no extra moving parts.

Configuration

Every key is overridable

Every scalar in application.conf accepts a matching SL_QUACK_* env-var. Build the image once, flip behavior per environment.

Runtime

Local or Kubernetes

Local mode spawns Quack child processes on a port range. Kubernetes mode runs them as pods. Same control plane, same admin UI.

Query federation

One SQL surface over every source.

Quack On Demand turns DuckDB's federation into a governed, multi-tenant service. Point a query at your lake, your Iceberg warehouse, and your operational databases at once - and join them in place, without moving a byte.

1

Query data where it lives

Object storage, Iceberg tables, and operational databases are read in place. No copies, no nightly ETL, no second engine.

2

Join across sources in one statement

A single SQL query spans Parquet on S3, an Iceberg warehouse, and a Postgres table - DuckDB pushes down filters and streams results back over Arrow.

3

Governed like everything else

Every federated table reference passes the same per-statement RBAC and table-ref policy checks - federation never bypasses your ACLs.

Parquet & CSVApache IcebergDuckLake catalogsPostgreSQLMySQLS3 / GCS / Azure Blob
-- One statement, three sources, zero copies
SELECT o.region,
       c.segment,
       SUM(o.amount) AS revenue
FROM   read_parquet('s3://lake/orders/*.parquet')  o
JOIN   postgres_scan('crm','public','customers')   c
         ON c.id = o.customer_id
JOIN   iceberg_scan('s3://warehouse/products')     p
         ON p.sku = o.sku
GROUP BY 1, 2;

Bring your own pipeline

Works with any ETL, however you built your data.

QoD is a serving layer. It sits downstream of whatever produced your data. It does not care which tool wrote it, and it never routes your users anywhere.

dbt

Materialize models to DuckDB/DuckLake, serve them with QoD.

QoD + dbt recipe →

Dagster

Land assets in DuckLake; QoD exposes them to clients.

QoD + Dagster recipe →

dlt

Pipe extracted data into DuckDB, query it through QoD.

QoD + dlt recipe →

SQLMesh

Build versioned models; serve the outputs over FlightSQL.

QoD + SQLMesh recipe →

Spark

Write Parquet/Iceberg; QoD federates and serves it.

QoD + Spark recipe →

Plain SQL

Point QoD at any DuckDB file or DuckLake catalog.

QoD + Plain SQL recipe →

They compose, optionally

Built with Starflow? Quack on Demand serves its DuckLake output.

Built with dbt, Dagster, dlt, Spark, or hand-written SQL? QoD serves that too.

Neither requires the other. Pick one, both, or neither.

Starflow · the engine

Declarative pipelines, not glue code.

Describe the outcome (the table, the transform, the schedule) in YAML and SQL. Starflow handles loading, write strategies (SCD2 / merge), schema enforcement, and dependency-aware orchestration across DuckDB, BigQuery, Snowflake, and Redshift.

Ingest

Declare a table, its load pattern, and a write strategy. Schema is enforced on the way in.

# customers.sl.yml
table:
  name: customers
  pattern: "customers-.*.csv"
  writeStrategy:
    type: UPSERT_BY_KEY
    key: [id]
  attributes:
    - { name: id, type: string, required: true }
    - { name: email, type: email }

Transform

Plain SQL SELECTs become governed, dependency-aware models. Native SQL on every engine.

-- sales/revenue.sql
SELECT
  o.region,
  SUM(o.amount) AS revenue
FROM orders o
GROUP BY 1

Orchestrate

Starflow infers model dependencies and generates Airflow or Dagster DAGs. No hand-wired graphs.

$ starlake dag-generate
$ starlake dag-deploy --target airflow
# DAG inferred from model lineage

Two ways to drive it

Write the YAML yourself, or let Starflow's AI generate it.

Everything above is plain config you can author by hand and check into git. When you'd rather describe intent, Starflow (49 CLI skills and five expert AI personas for Claude Code) generates correct YAML and SQL, runs an adversarial review, and tells you the next step.

$ /load CSV files from gs://acme-landing/customers

→ wrote customers.sl.yml
  table: customers
  writeStrategy: UPSERT_BY_KEY [id]
  schema inferred · 12 attributes

Five AI personas cover the data lifecycle:

  • Lea· Data Analyst
  • Winston· Data Architect
  • Amelia· Data Engineer
  • Quinn· Data Quality
  • Max· Platform Engineer

Runs on your stack: orchestrators and warehouses Starflow targets.

airflowdagsterbigquerysnowflakeredshiftdatabricksduckdbelasticmysqlpostgressql
airflowdagsterbigquerysnowflakeredshiftdatabricksduckdbelasticmysqlpostgressql
airflowdagsterbigquerysnowflakeredshiftdatabricksduckdbelasticmysqlpostgressql
airflowdagsterbigquerysnowflakeredshiftdatabricksduckdbelasticmysqlpostgressql

Open governance

Safe to depend on. Safe to recommend.

Stays Apache-2.0

Both projects are Apache-2.0 and will stay that way. No relicensing to BSL or SSPL. A public, standing commitment.

No telemetry by default

Neither project phones home. No usage tracking, no contact capture, no visibility into your users.

Open contribution

DCO-signed pull requests, transparent review, issues triaged in public on GitHub.

Public roadmap

Direction is decided in the open. Built on neutral, vendor-agnostic standards (Arrow Flight SQL, DuckLake).

FAQ

Frequently asked questions

Still have questions? Email us at [email protected]