DuckLake is an integrated data lake and catalog format

5 months ago 113

Search Shortcut cmd + k | ctrl + k

DuckLake delivers advanced data lake features without traditional lakehouse complexity by using Parquet files and your SQL database. It's an open, standalone format from the DuckDB team.

Deployment scenarios

DuckLake uses a database system to manage your metadata for the catalog. All you need to run your own data warehouse is a database system and storage for Parquet files.

PostgreSQL
SQLite
MySQL
DuckDB

← Choose catalog database

Client

Clients

Users can run multiple DuckLake clients and connect concurrently to PostgreSQL, MySQL or SQLite.

Client

DuckDB also works with DuckLake as the catalog database. In this case, you are limited to a single client.

Catalog database

Database Icon

Catalog database

DuckLake can use any SQL system as its catalog database, provided that it supports ACID transactions and primary key constraints.

Storage

Parquet Folder

Parquet

Storage

DuckLake can store your data on any object storage such as AWS S3.

DuckLake’s key features

wave icon

Data lake operations

DuckLake supports snapshots, time travel queries, schema evolution and partitioning.

wave icon

Lightweight snapshots

You can have as many snapshots as you want without frequent compacting steps!

wave icon

ACID transactions

DuckLake allows concurrent access with ACID transactional guarantees over multi-table operations.

Performance-oriented

DuckLake uses statistics for filter pushdown, enabling fast queries even on large datasets.

In Conversation: DuckDB Founders on DuckLake

Listen to Hannes Mühleisen and Mark Raasveldt walk through the history of data lakes and introduce DuckLake, a new lakehouse format.

Introducing DuckLake

Create your first DuckLake with DuckDB

DuckDB provides first-class support for DuckLake through its highly portable extension, running wherever DuckDB does.

DuckDB
SQLite
PostgreSQL
MySQL

INSTALL ducklake; ATTACH 'ducklake:metadata.ducklake' AS my_ducklake; USE my_ducklake;

INSTALL ducklake; INSTALL postgres; -- Make sure that the database `ducklake_catalog` exists in PostgreSQL. ATTACH 'ducklake:postgres:dbname=ducklake_catalog host=your_postgres_host' AS my_ducklake; USE my_ducklake;

INSTALL ducklake; INSTALL sqlite; ATTACH 'ducklake:sqlite:metadata.sqlite' AS my_ducklake; USE my_ducklake;

INSTALL ducklake; INSTALL mysql; -- Make sure that the database `ducklake_catalog` exists in MySQL ATTACH 'ducklake:mysql:db=ducklake_catalog host=your_mysql_host' AS my_ducklake; USE my_ducklake;

Frequently asked questions

Answers to common questions to help you understand and make the most of DuckLake.

Why should I use DuckLake?

DuckLake provides a lightweight one-stop solution for if you need a data lake and catalog.

You can use DuckLake for a “multiplayer DuckDB” setup with multiple DuckDB instances reading and writing the same dataset – a concurrency model not supported by vanilla DuckDB.

If you only use DuckDB for both your DuckLake entry point and your catalog database, you can still benefit from using DuckLake: you can run time travel queries, exploit data partitioning, and can store your data in multiple files instead of using a single (potentially very large) database file.

What is DuckLake?

First of all, a catchy name for a DuckDB-originated technology for data lakes and lakehouses. More seriously, the term “DuckLake” can refer to three things:

the specification of the DuckLake lakehouse format,
the ducklake DuckDB extension, which supports reading/writing datasets in the DuckLake specification,
a DuckLake, a dataset stored using the DuckLake lakehouse format.

What is the license of DuckLake?

The DuckLake specification and the DuckLake DuckDB extension are released under the MIT license.

Read Entire Article

DuckLake is an integrated data lake and catalog format

Deployment scenarios

Client

Clients

Client

Catalog database

Catalog database

Storage

Storage

DuckLake’s key features

Data lake operations

Lightweight snapshots

ACID transactions

Performance-oriented

In Conversation: DuckDB Founders on DuckLake

Create your first DuckLake with DuckDB

Frequently asked questions

Why should I use DuckLake?

What is DuckLake?

What is the license of DuckLake?

Related

How to mentally calculate logarithms base 2

An ancient generalization of the Pythagorean Theorem

Three Meanings of Reference