Birds in the Cloud

4 months ago 37

I had never heard about birdwatching until the last day when I stumbled upon the Cornell Lab of Ornithology's eBird project. It provides a dataset of bird observations worldwide, with 1.5 billion records updated monthly.

I immediately loaded this dataset into my visualization tool. The adsb.exposed website already contains a dataset of air traffic with over 130 billion records and a dataset of Foursquare places. There is certainly a larger number of birds than airplanes, but the dataset of bird observations is a hundred times smaller, so it is easy to load and analyze it.

The dataset is located here in a zip file of 58 GB. It contains a CSV file of 440 GB in size.

wget 'https://hosted-datasets.gbif.org/eBird/2023-eBird-dwca-1.0.zip'

Analyzing local and external datasets is very convenient with clickhouse-local, a small command-line tool that provides a full ClickHouse engine.

curl https://clickhouse.com/ | sh sudo ./clickhouse install # or run it as ./clickhouse without installation

ClickHouse allows the processing of files inside archives without decompressing.

DESCRIBE file('2023-eBird-dwca-1.0.zip :: *.csv');
┌─name──────────────┬─type──────────────┐ 1. │ basisofrecord │ Nullable(String) │ 2. │ institutioncode │ Nullable(String) │ 3. │ collectioncode │ Nullable(String) │ 4. │ catalognumber │ Nullable(String) │ 5. │ occurrenceid │ Nullable(String) │ 6. │ recordedby │ Nullable(String) │ 7. │ year │ Nullable(Int64) │ 8. │ month │ Nullable(Int64) │ 9. │ day │ Nullable(Int64) │ 10. │ publishingcountry │ Nullable(String) │ 11. │ country │ Nullable(String) │ 12. │ stateprovince │ Nullable(String) │ 13. │ county │ Nullable(String) │ 14. │ decimallatitude │ Nullable(Float64) │ 15. │ decimallongitude │ Nullable(Float64) │ 16. │ locality │ Nullable(String) │ 17. │ kingdom │ Nullable(String) │ 18. │ phylum │ Nullable(String) │ 19. │ class │ Nullable(String) │ 20. │ order │ Nullable(String) │ 21. │ family │ Nullable(String) │ 22. │ genus │ Nullable(String) │ 23. │ specificepithet │ Nullable(String) │ 24. │ scientificname │ Nullable(String) │ 25. │ vernacularname │ Nullable(String) │ 26. │ taxonremarks │ Nullable(String) │ 27. │ taxonconceptid │ Nullable(String) │ 28. │ individualcount │ Nullable(Int64) │ └───────────────────┴───────────────────┘

The DESCRIBE query gives an automatically inferred schema for the data. The file table function allows the processing of files on the local filesystem and the :: notation allows the processing of files inside archives. All filenames support glob-expansions with *, ?, **, {0..9}, {abc,def}. For CSV files, ClickHouse automatically detects if there is a header and uses the header if it is present. It also automatically detects data types of columns. With all these conveniences, ClickHouse makes data processing seamless.

SELECT * FROM file('2023-eBird-dwca-1.0.zip :: *.csv') LIMIT 1;
Row 1: ────── basisofrecord: HumanObservation institutioncode: CLO collectioncode: EBIRD_ARG catalognumber: OBS602415301 occurrenceid: URN:catalog:CLO:EBIRD_ARG:OBS602415301 recordedby: obsr904254 year: 1989 month: 4 day: 23 publishingcountry: AR country: Argentina stateprovince: Salta county: Anta decimallatitude: -24.7 decimallongitude: -64.63333 locality: PN El Rey kingdom: Animalia phylum: Chordata class: Aves order: Pelecaniformes family: Threskiornithidae genus: Theristicus specificepithet: caudatus scientificname: Theristicus caudatus vernacularname: Buff-necked Ibis taxonremarks: ᴺᵁᴸᴸ taxonconceptid: avibase-5E393799 individualcount: ᴺᵁᴸᴸ 1 row in set. Elapsed: 0.008 sec.

And this is an ibis in Argentina!

You can analyze the data with clickhouse-local, but to create an interactive website, we will load it to clickhouse-server. By the way, there is almost no difference between clickhouse-local and clickhouse-server - it is the same binary executable. The main difference is that clickhouse-server listens to connections.

I created the following table:

CREATE TABLE birds_mercator ( basisofrecord LowCardinality(String), institutioncode LowCardinality(String), collectioncode LowCardinality(String), catalognumber String, occurrenceid LowCardinality(String), recordedby String, year UInt16 EPHEMERAL, month UInt8 EPHEMERAL, day UInt8 EPHEMERAL, publishingcountry LowCardinality(String), country LowCardinality(String), stateprovince LowCardinality(String), county LowCardinality(String), decimallatitude Float32, decimallongitude Float32, locality LowCardinality(String), kingdom LowCardinality(String), phylum LowCardinality(String), class LowCardinality(String), order LowCardinality(String), family LowCardinality(String), genus LowCardinality(String), specificepithet LowCardinality(String), scientificname LowCardinality(String), vernacularname LowCardinality(String), taxonremarks LowCardinality(String), taxonconceptid LowCardinality(String), individualcount UInt32, date Date MATERIALIZED makeDate(year, month, day), mercator_x UInt32 MATERIALIZED 0xFFFFFFFF * ((decimallongitude + 180) / 360), mercator_y UInt32 MATERIALIZED 0xFFFFFFFF * ((1 / 2) - ((log(tan(((decimallatitude + 90) / 360) * pi())) / 2) / pi())), INDEX idx_x mercator_x TYPE minmax, INDEX idx_y mercator_y TYPE minmax ) ORDER BY mortonEncode(mercator_x, mercator_y)

Most of the structure is the same as inferred from the CSV file.

I removed Nullable because it is unneeded - I'd better use empty strings instead of NULLs, and it is generally good practice.

I've replaced year, month, day with EPHEMERAL columns and added a date column, calculated from them: date Date MATERIALIZED makeDate(year, month, day). EPHEMERAL columns are used in INSERT queries but are not stored in a table - they can be used to calculate expressions for other columns to apply transformations on the insertion time. In contrast, MATERIALIZED columns are columns that cannot be used in the INSERT query but are always calculated from their expressions.

I've analyzed the number of distinct values in various columns and replaced many of the String data types with LowCardinality(String) to apply dictionary encoding. For example, the country column contains only 253 unique values.

Additionally, I've created two materialized columns, mercator_x and mercator_y that map the lat/lon coordinates to the Web Mercator projection. The coordinates on the Mercator projection are represented by two UInt32 numbers for easier segmentation of the map into tiles. Additionally, I set up the order of the table by a space-filling curve on top of these numbers, and I created two minmax indices for faster search. ClickHouse has everything we need for real-time mapping applications!

Then, I loaded the data with the following query:

ch --progress --query " SELECT * FROM file('2023-eBird-dwca-1.0.zip :: eod.csv') WHERE decimallatitude BETWEEN -89 AND 89 AND decimallongitude BETWEEN -180 AND 180 FORMAT Native" \ | clickhouse-client --host ... --query "INSERT INTO birds_mercator FORMAT Native"

Here, I use clickhouse-local (which is available under the ch alias after installation) to filter out of bound values of latitude and longitude (required by the Mercator projection) and convert the result to the Native format, which is optimal for insertion. The result is piped into clickhouse-client and inserted into my ClickHouse server in ClickHouse Cloud.

The table took only 16.8 GB in ClickHouse, which is much less than the 58 GB zip file, thanks to ClickHouse compression! This is only 11 bytes on average per each of the 1.5 billion bird observations.

SELECT name, total_rows, total_bytes FROM system.tables WHERE name = 'birds_mercator'
┌─name───────────┬─total_rows─┬─total_bytes─┐ │ birds_mercator │ 1512208407 │ 16847994349 │ └────────────────┴────────────┴─────────────┘

I've added the dataset with this configuration change, and we can instantly explore it!

For example, you can color the map by different orders of birds:

We can map only sea birds:

Interesting to look birds' paths in the south Atlantic near Patagonia:

And New Zealand:

Let's filter only penguins:

Let's show different sorts of Kiwi by adding AND family = 'Apterygidae' to the filter:

If you see this bird, you'll have no regrets!

This crow lives only in Asia:

eBird provides its own basic visualization on a map, but it is not as interactive and provides no such detail into the data and no direct SQL queries.

ClickHouse is a good option for analytics on large-scale geographical datasets. The eBird dataset has 1.5 billion records, while the ADS-B dataset has 130 billion records and counting. ClickHouse customers use the service with datasets over tens of trillions of records. ClickHouse makes large datasets fly!

Reference: eBird. 2021. eBird: An online database of bird distribution and abundance [web application]. eBird, Cornell Lab of Ornithology, Ithaca, New York. Available: https://www.ebird.org. (Accessed: June 1st, 2025).

Get started with ClickHouse Cloud today and receive $300 in credits. At the end of your 30-day trial, continue with a pay-as-you-go plan, or contact us to learn more about our volume-based discounts. Visit our pricing page for details.

Read Entire Article