Using DuckDB databases as lightweight Data Lake access layer

1 week ago 4

Data Lakes come in a broad variety and lots of different flavors. AWS, Azure, Google Cloud, Snowflake, DataBricks, etc. they all have their specialties, strong and weak sides. Common among them is that the most, if not all, of them use Object Storage services such as Amazon S3 or Cloudflare R2 for storing their vast amount of data. Accessing those Data Lakes can involve pretty complicated IAM setups, OAuth, OIDC and other technologies in this space.

But if we step a few steps back from all the complexity: What’s the most lightweight and flexible way to provide access to data in Data Lakes?

For example, if we’d like to share open public transportation data, and we’d like to focus on costs and ease of use? With an interface that exists since more than 50 years?

Introducing: DuckDB databases as Data Lake access layer!

Jokes aside, I’m not the first one to ever suggest or use it, but I’ve hardly seen any written explanation of this yet, and also got some questions from the community about it (looking at you, David!), so I though it’d make sense to write a short blog post about this technique.

What are our goals?

We want to query the data with SQL
We want the data to be accessible via HTTP(S)
We want to download as little data as possible when querying it
We want (or already) have our data stored in Object Storage services like S3 or R2
We don’t need authentication and authorization to access the data (yet)

How does this work?

The basis of the approach is that we store our datasets on Object Storage. For example, you can store a table’s data as a single Parquet file, multiple Parquet files in a prefix with a naming pattern, or even as Hive-partitioned data. DuckDB can read (and write) them all. The latter two only via the S3-compatible API though, which needs some form of credentials, so for keeping our example simple, we focus on single Parquet files per table.

The other puzzle pieces are that DuckDB database files can contain view definitions that can reference remote Parquet files, and can be shared and attached from remote clients either via HTTP(S) or S3.

From a high-level perspective, our solution looks like this:

How can this be implemented?

The process to create such a "view database" consists of four main steps:

Exporting the data from a pre-made DuckDB database as Parquet files
Uploading the exported Parquet files to an Object Storage service
Creating the database file containing the views to the Object Storage files
Uploading the new database file to Object Storage as well, to be able to use the ATTACH statement of DuckDB

Example Use Case

I recently published another blog post about Handling GTFS data with DuckDB. The accompanying GitHub repository can be found at tobilg/duckdb-gtfs.

Preparation

As preparation, you’ll need to perform the steps outlined in the Usage chapter of the README. If you follow these instructions, it will generate a 1.4GB uncompressed DuckDB database file on your local machine, after it downloaded the example dataset’ raw data and loaded it into the default GTFS table schemas.

Export single Parquet files

The database can be exported as one Parquet file per table with a simple command (also available as bash script):

duckdb exported-data/providers/gtfs-de/full/data.duckdb -c "EXPORT DATABASE 'exported-data/providers/gtfs-de/full' (FORMAT parquet, COMPRESSION zstd);"

The result looks like this:

$ tree exported-data exported-data └── providers └── gtfs-de └── full ├── agency.parquet ├── areas.parquet ├── attributions.parquet ├── booking_rules.parquet ├── calendar.parquet ├── calendar_dates.parquet ├── fare_attributes.parquet ├── fare_leg_join_rules.parquet ├── fare_leg_rules.parquet ├── fare_media.parquet ├── fare_products.parquet ├── fare_rules.parquet ├── fare_transfer_rules.parquet ├── feed_info.parquet ├── frequencies.parquet ├── levels.parquet ├── load.sql ├── location_group_stops.parquet ├── location_groups.parquet ├── networks.parquet ├── pathways.parquet ├── rider_categories.parquet ├── route_networks.parquet ├── routes.parquet ├── schema.sql ├── shapes.parquet ├── stop_areas.parquet ├── stop_times.parquet ├── stops.parquet ├── timeframes.parquet ├── transfers.parquet ├── translations.parquet └── trips.parquet

Not only have the Parquet files been produced, but also the load.sql and schema.sql scripts. We don’t need those for our use case though, so we can either ignore or delete them.

Upload Parquet files to Object Storage

The next step is to upload the newly created Parquet files to your Object Storage provider. In this example, I already have created a new R2 Bucket, and also assigned a custom domain for it (data.openrailway.dev), which is possible with a few clicks because the domain itself is hosted on Cloudflare as well.

💡

R2 offers free egress, compared to S3 which for example charges $0.09 per GB in us-east-1. For storage, S3 charges $0.023 / GB / month, whereas R2 charges $0.015 / GB / month.This can incur large costs if your data is used a lot (reminder: Our use case is providing public transportation datasets for analysis, so we don’t really now how many people will use them, and how). I’d recommend to make an explicit decision here based on your use case.

If you want to enable browser-based clients as well, make sure that you also setup a CORS policy. With R2, this can be configured via API or via the Dashboard and looks like this:

[ { "AllowedOrigins": [ "http://localhost:3000", "http://localhost:5174", "https://sql-workbench.com" ], "AllowedMethods": [ "GET", "HEAD" ], "AllowedHeaders": [ "*" ], "ExposeHeaders": [ "ETag" ] } ]

This enables local development, and also SQL Workbench as browser-based query interface (more of this later). Add your own domains appropriately.

Once you prepared this, you can go on and upload the data to your Object Storage bucket, either with tools like rclone or directly via the Dashboard of R2 or the S3 Console.

Creating the “view database”

As the data is now available via HTTP(S), we can now create a new DuckDB database that contains the VIEWs that reference the data, namely the Parquet files, on Object Storage. This leverages the read_parquet() function, with which you can load remote data via HTTP(S).

An example for our use case (try it with SQL Workbench):

SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/agency.parquet');

The output will look like this:

We now prepare a SQL script that will create a view for each remote Parquet file:

CREATE VIEW agency AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/agency.parquet'); CREATE VIEW areas AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/areas.parquet'); CREATE VIEW attributions AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/attributions.parquet'); CREATE VIEW booking_rules AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/booking_rules.parquet'); CREATE VIEW calendar AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/calendar.parquet'); CREATE VIEW calendar_dates AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/calendar_dates.parquet'); CREATE VIEW fare_leg_join_rules AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/fare_leg_join_rules.parquet'); CREATE VIEW fare_leg_rules AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/fare_leg_rules.parquet'); CREATE VIEW fare_media AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/fare_media.parquet'); CREATE VIEW fare_products AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/fare_products.parquet'); CREATE VIEW fare_transfer_rules AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/fare_transfer_rules.parquet'); CREATE VIEW feed_info AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/feed_info.parquet'); CREATE VIEW frequencies AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/frequencies.parquet'); CREATE VIEW levels AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/levels.parquet'); CREATE VIEW location_groups AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/location_groups.parquet'); CREATE VIEW networks AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/networks.parquet'); CREATE VIEW rider_categories AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/rider_categories.parquet'); CREATE VIEW shapes AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/shapes.parquet'); CREATE VIEW timeframes AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/timeframes.parquet'); CREATE VIEW translations AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/translations.parquet'); CREATE VIEW fare_attributes AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/fare_attributes.parquet'); CREATE VIEW routes AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/routes.parquet'); CREATE VIEW route_networks AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/route_networks.parquet'); CREATE VIEW stops AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/stops.parquet'); CREATE VIEW stop_areas AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/stop_areas.parquet'); CREATE VIEW trips AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/trips.parquet'); CREATE VIEW fare_rules AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/fare_rules.parquet'); CREATE VIEW location_group_stops AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/location_group_stops.parquet'); CREATE VIEW pathways AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/pathways.parquet'); CREATE VIEW stop_times AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/stop_times.parquet'); CREATE VIEW transfers AS SELECT * FROM read_parquet('https://data.openrailway.dev/providers/gtfs-de/full/transfers.parquet');

To create the actual “view database”, we can use the following command:

duckdb exported-data/providers/gtfs-de/full/database.duckdb < queries/providers/gtfs-de/full/create_view_database.sql

This will create a new database in the exported-data/providers/gtfs-de/full directory named database.duckdb.

We can now inspect its size:

$ ls -la exported-data/providers/gtfs-de/full/*.duckdb -rw-r--r--@ 1 tmueller staff 274432 18 Mai 14:32 exported-data/providers/gtfs-de/full/database.duckdb

This means the “view database” has ONLY 268kb in size!

Testing the “view database”

We can now test if it works locally by doing a SHOW TABLES:

$ duckdb exported-data/providers/gtfs-de/full/database.duckdb v1.2.2 7c039464e4 Enter ".help" for usage hints. D show tables; ┌──────────────────────┐ │ name │ │ varchar │ ├──────────────────────┤ │ agency │ │ areas │ │ attributions │ │ booking_rules │ │ calendar │ │ calendar_dates │ │ fare_attributes │ │ fare_leg_join_rules │ │ fare_leg_rules │ │ fare_media │ │ fare_products │ │ fare_rules │ │ fare_transfer_rules │ │ feed_info │ │ frequencies │ │ levels │ │ location_group_stops │ │ location_groups │ │ networks │ │ pathways │ │ rider_categories │ │ route_networks │ │ routes │ │ shapes │ │ stop_areas │ │ stop_times │ │ stops │ │ timeframes │ │ transfers │ │ translations │ │ trips │ ├──────────────────────┤ │ 31 rows │ └──────────────────────┘

Also, we can run some test queries on the remote data, and show the time they take:

$ duckdb exported-data/providers/gtfs-de/full/database.duckdb v1.2.2 7c039464e4 Enter ".help" for usage hints. D .timer on D SELECT count(*) FROM trips; ┌────────────────┐ │ count_star() │ │ int64 │ ├────────────────┤ │ 1630671 │ │ (1.63 million) │ └────────────────┘ Run Time (s): real 0.554 user 0.175864 sys 0.089233 D SELECT count(*) FROM stops; ┌──────────────┐ │ count_star() │ │ int64 │ ├──────────────┤ │ 678388 │ └──────────────┘ Run Time (s): real 0.336 user 0.006173 sys 0.002544 D SELECT count(*) FROM stop_times; ┌─────────────────┐ │ count_star() │ │ int64 │ ├─────────────────┤ │ 32228711 │ │ (32.23 million) │ └─────────────────┘ Run Time (s): real 0.367 user 0.051620 sys 0.009298 D SELECT * FROM stops LIMIT 10; ┌─────────┬───────────┬──────────────────────┬───────────────┬───────────┬───────────┬───┬──────────┬───────────────┬────────────────┬───────────────┬─────────────────────┬──────────┬───────────────┐ │ stop_id │ stop_code │ stop_name │ tts_stop_name │ stop_desc │ stop_lat │ … │ stop_url │ location_type │ parent_station │ stop_timezone │ wheelchair_boarding │ level_id │ platform_code │ │ varchar │ varchar │ varchar │ varchar │ varchar │ double │ │ varchar │ int32 │ varchar │ varchar │ int32 │ varchar │ varchar │ ├─────────┼───────────┼──────────────────────┼───────────────┼───────────┼───────────┼───┼──────────┼───────────────┼────────────────┼───────────────┼─────────────────────┼──────────┼───────────────┤ │ 199269 │ NULL │ 's-Heerenberg Goud… │ NULL │ NULL │ 51.87225 │ … │ NULL │ 1 │ NULL │ NULL │ NULL │ NULL │ NULL │ │ 358723 │ NULL │ 's-Heerenberg Goud… │ NULL │ NULL │ 51.87228 │ … │ NULL │ NULL │ 199269 │ NULL │ NULL │ NULL │ NULL │ │ 536901 │ NULL │ 's-Heerenberg Mole… │ NULL │ NULL │ 51.87632 │ … │ NULL │ 1 │ NULL │ NULL │ NULL │ NULL │ NULL │ │ 532065 │ NULL │ 's-Heerenberg Mole… │ NULL │ NULL │ 51.87632 │ … │ NULL │ NULL │ 536901 │ NULL │ NULL │ NULL │ NULL │ │ 666544 │ NULL │ 's-Heerenberg Muzi… │ NULL │ NULL │ 51.87479 │ … │ NULL │ 1 │ NULL │ NULL │ NULL │ NULL │ NULL │ │ 269106 │ NULL │ 's-Heerenberg Muzi… │ NULL │ NULL │ 51.874844 │ … │ NULL │ NULL │ 666544 │ NULL │ NULL │ NULL │ NULL │ │ 540312 │ NULL │ 's-Heerenberg Muzi… │ NULL │ NULL │ 51.874737 │ … │ NULL │ NULL │ 666544 │ NULL │ NULL │ NULL │ NULL │ │ 615572 │ NULL │ 's-Hertogenbosch │ NULL │ NULL │ 51.69054 │ … │ NULL │ 1 │ NULL │ NULL │ NULL │ NULL │ NULL │ │ 525421 │ NULL │ 's-Hertogenbosch │ NULL │ NULL │ 51.69054 │ … │ NULL │ NULL │ 615572 │ NULL │ NULL │ NULL │ NULL │ │ 198290 │ NULL │ 12 Apostel │ NULL │ NULL │ 52.052963 │ … │ NULL │ NULL │ 655876 │ NULL │ NULL │ NULL │ NULL │ ├─────────┴───────────┴──────────────────────┴───────────────┴───────────┴───────────┴───┴──────────┴───────────────┴────────────────┴───────────────┴─────────────────────┴──────────┴───────────────┤ │ 10 rows 15 columns (13 shown) │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ Run Time (s): real 0.909 user 0.214754 sys 0.023443

As you might have noticed, we counted the records of a remote >32 million records file in less than 0.4 seconds!

We can also switch on the enable_http_logging flag to see the HTTP(S) requests performed by DuckDB when querying the remote files:

D SET enable_http_logging = true; Run Time (s): real 0.001 user 0.000300 sys 0.000365 D SELECT * FROM stops LIMIT 10; HTTP Request: HEAD /providers/gtfs-de/full/stops.parquet Accept: */* Host: data.openrailway.dev User-Agent: cpp-httplib/0.14.3 HTTP Response: 200 OK Accept-Ranges: bytes alt-svc: h3=":443"; ma=86400 cf-cache-status: DYNAMIC CF-RAY: 9423a96b0dd4d38c-FRA Connection: keep-alive Content-Length: 14580912 Date: Sun, 18 May 2025 12:40:15 GMT ETag: "5d654fe45888665fcaff7f2831180088" Last-Modified: Sun, 18 May 2025 10:42:40 GMT NEL: {"success_fraction":0,"report_to":"cf-nel","max_age":604800} Report-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v4?s=8J9y/6mVjzeKKD/RHRLvPc53+xxuxj+n8m6FugMHTbB0N8cD0l18HfCieD6GeWrGlFB1vCMf7PnzSxj8oCYJYgsQHvcJCxnw5OEIkgv9C4FWDrkv1jgTKj3N4IfoIwJGLMqAMAQ+LaEssRo5FN9DPSmWsg=="}],"group":"cf-nel","max_age":604800} Server: cloudflare server-timing: cfL4;desc="?proto=TCP&rtt=9270&min_rtt=8571&rtt_var=3023&sent=6&recv=7&lost=0&retrans=0&sent_bytes=2869&recv_bytes=556&delivery_rate=469577&cwnd=33&unsent_bytes=0&cid=6800b4b191d1dbe2&ts=165&x=0" Vary: Accept-Encoding HTTP Request: GET /providers/gtfs-de/full/stops.parquet Accept: */* Host: data.openrailway.dev Range: bytes=14580904-14580911 User-Agent: cpp-httplib/0.14.3 HTTP Response: 206 Partial Content Accept-Ranges: bytes alt-svc: h3=":443"; ma=86400 cf-cache-status: DYNAMIC CF-RAY: 9423a96bbf92d38c-FRA Connection: keep-alive Content-Length: 8 Content-Range: bytes 14580904-14580911/14580912 Date: Sun, 18 May 2025 12:40:16 GMT ETag: "5d654fe45888665fcaff7f2831180088" Last-Modified: Sun, 18 May 2025 10:42:40 GMT NEL: {"success_fraction":0,"report_to":"cf-nel","max_age":604800} Report-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v4?s=095gOTFBeOfulP4iVVnDyvqk/FLW3exKsuA0e2t3/kNqb2QNnBtswCgTkKoGlL3F8qnNYZjdlljl2wTDb1+KWcrhAYrVAC5sUOHEnjkVzYixd1OQq/WFlHh3/u9HRTqo+UbBxBJKXIS+yxJHwlIHZDzFXQ=="}],"group":"cf-nel","max_age":604800} Server: cloudflare server-timing: cfL4;desc="?proto=TCP&rtt=9152&min_rtt=8571&rtt_var=1901&sent=9&recv=10&lost=0&retrans=0&sent_bytes=4253&recv_bytes=737&delivery_rate=469577&cwnd=35&unsent_bytes=0&cid=6800b4b191d1dbe2&ts=245&x=0" Vary: Accept-Encoding // SNIP // ┌─────────┬───────────┬──────────────────────┬───────────────┬───────────┬───────────┬───┬──────────┬───────────────┬────────────────┬───────────────┬─────────────────────┬──────────┬───────────────┐ │ stop_id │ stop_code │ stop_name │ tts_stop_name │ stop_desc │ stop_lat │ … │ stop_url │ location_type │ parent_station │ stop_timezone │ wheelchair_boarding │ level_id │ platform_code │ │ varchar │ varchar │ varchar │ varchar │ varchar │ double │ │ varchar │ int32 │ varchar │ varchar │ int32 │ varchar │ varchar │ ├─────────┼───────────┼──────────────────────┼───────────────┼───────────┼───────────┼───┼──────────┼───────────────┼────────────────┼───────────────┼─────────────────────┼──────────┼───────────────┤ │ 199269 │ NULL │ 's-Heerenberg Goud… │ NULL │ NULL │ 51.87225 │ … │ NULL │ 1 │ NULL │ NULL │ NULL │ NULL │ NULL │ │ 358723 │ NULL │ 's-Heerenberg Goud… │ NULL │ NULL │ 51.87228 │ … │ NULL │ NULL │ 199269 │ NULL │ NULL │ NULL │ NULL │ │ 536901 │ NULL │ 's-Heerenberg Mole… │ NULL │ NULL │ 51.87632 │ … │ NULL │ 1 │ NULL │ NULL │ NULL │ NULL │ NULL │ │ 532065 │ NULL │ 's-Heerenberg Mole… │ NULL │ NULL │ 51.87632 │ … │ NULL │ NULL │ 536901 │ NULL │ NULL │ NULL │ NULL │ │ 666544 │ NULL │ 's-Heerenberg Muzi… │ NULL │ NULL │ 51.87479 │ … │ NULL │ 1 │ NULL │ NULL │ NULL │ NULL │ NULL │ │ 269106 │ NULL │ 's-Heerenberg Muzi… │ NULL │ NULL │ 51.874844 │ … │ NULL │ NULL │ 666544 │ NULL │ NULL │ NULL │ NULL │ │ 540312 │ NULL │ 's-Heerenberg Muzi… │ NULL │ NULL │ 51.874737 │ … │ NULL │ NULL │ 666544 │ NULL │ NULL │ NULL │ NULL │ │ 615572 │ NULL │ 's-Hertogenbosch │ NULL │ NULL │ 51.69054 │ … │ NULL │ 1 │ NULL │ NULL │ NULL │ NULL │ NULL │ │ 525421 │ NULL │ 's-Hertogenbosch │ NULL │ NULL │ 51.69054 │ … │ NULL │ NULL │ 615572 │ NULL │ NULL │ NULL │ NULL │ │ 198290 │ NULL │ 12 Apostel │ NULL │ NULL │ 52.052963 │ … │ NULL │ NULL │ 655876 │ NULL │ NULL │ NULL │ NULL │ ├─────────┴───────────┴──────────────────────┴───────────────┴───────────┴───────────┴───┴──────────┴───────────────┴────────────────┴───────────────┴─────────────────────┴──────────┴───────────────┤ │ 10 rows 15 columns (13 shown) │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ Run Time (s): real 0.578 user 0.026387 sys 0.018349

We can see the range requests in action, first sending a HEAD request, and then followed by multiple (logs are truncated for brevity) GET range requests.

To even get some more insights, we can prefix the query with EXPLAIN ANALYZE and have a look how the query is executed, along with some stats:

D EXPLAIN ANALYZE SELECT * FROM stops LIMIT 10; ┌─────────────────────────────────────┐ │┌───────────────────────────────────┐│ ││ Query Profiling Information ││ │└───────────────────────────────────┘│ └─────────────────────────────────────┘ EXPLAIN ANALYZE SELECT * FROM stops LIMIT 10; ┌─────────────────────────────────────┐ │┌───────────────────────────────────┐│ ││ HTTPFS HTTP Stats ││ ││ ││ ││ in: 2.5 MiB ││ ││ out: 0 bytes ││ ││ ││ ││ ││ ││ │└───────────────────────────────────┘│ └─────────────────────────────────────┘ ┌────────────────────────────────────────────────┐ │┌──────────────────────────────────────────────┐│ ││ Total Time: 0.490s ││ │└──────────────────────────────────────────────┘│ └────────────────────────────────────────────────┘ ┌───────────────────────────┐ │ QUERY │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ EXPLAIN_ANALYZE │ │ ──────────────────── │ │ 0 Rows │ │ (0.00s) │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ STREAMING_LIMIT │ │ ──────────────────── │ │ 10 Rows │ │ (0.00s) │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ TABLE_SCAN │ │ ──────────────────── │ │ Function: │ │ READ_PARQUET │ │ │ │ Projections: │ │ stop_id │ │ stop_code │ │ stop_name │ │ tts_stop_name │ │ stop_desc │ │ stop_lat │ │ stop_lon │ │ zone_id │ │ stop_url │ │ location_type │ │ parent_station │ │ stop_timezone │ │ wheelchair_boarding │ │ level_id │ │ platform_code │ │ │ │ 4096 Rows │ │ (0.22s) │ └───────────────────────────┘ Run Time (s): real 0.491 user 0.194019 sys 0.022184

Uploading the “view database” to Object Storage

For the final step to use the DuckDB database file containing the views to the remote Parquet files, we need to upload the database file itself to Object Storage as well.

For our example, we’ll upload it to the same directory as the Parquet files, meaning that it’s available at:

https://data.openrailway.dev/providers/gtfs-de/full/database.duckdb

Using the DuckDB database

DuckDB is able to ATTACH remote databases via HTTP(S) or the S3 API, meaning that you can both use it in the local DuckDB CLI, or with another, for example in-browser, tool like SQL Workbench.

$ duckdb v1.2.2 7c039464e4 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database. D ATTACH 'https://data.openrailway.dev/providers/gtfs-de/full/database.duckdb' as gtfs_de_full (READ_ONLY); D USE gtfs_de_full; D SHOW TABLES; ┌──────────────────────┐ │ name │ │ varchar │ ├──────────────────────┤ │ agency │ │ areas │ │ attributions │ │ booking_rules │ │ calendar │ │ calendar_dates │ │ fare_attributes │ │ fare_leg_join_rules │ │ fare_leg_rules │ │ fare_media │ │ fare_products │ │ fare_rules │ │ fare_transfer_rules │ │ feed_info │ │ frequencies │ │ levels │ │ location_group_stops │ │ location_groups │ │ networks │ │ pathways │ │ rider_categories │ │ route_networks │ │ routes │ │ shapes │ │ stop_areas │ │ stop_times │ │ stops │ │ timeframes │ │ transfers │ │ translations │ │ trips │ ├──────────────────────┤ │ 31 rows │ └──────────────────────┘ D SELECT count(*)::INT AS cnt FROM gtfs_de_full.stop_times; ┌─────────────────┐ │ cnt │ │ int32 │ ├─────────────────┤ │ 32228711 │ │ (32.23 million) │ └─────────────────┘

To query the data in your browser, just click on the this link to SQL Workbench. The result will look like this:

Summary

In this blog post, we were able to show that DuckDB database files containing views to static Parquet files are a very lightweight solution to provide a single, highly available entry point to datasets that need to be queryable with SQL.

It intentionally doesn’t talk about authentication and authorization, which will be covered in another subsequent blog post soon.

Read Entire Article