I had never heard about birdwatching until the last day when I stumbled upon the Cornell Lab of Ornithology's eBird project. It provides a dataset of bird observations worldwide, with 1.5 billion records updated monthly.
I immediately loaded this dataset into my visualization tool. The adsb.exposed website already contains a dataset of air traffic with over 130 billion records and a dataset of Foursquare places. There is certainly a larger number of birds than airplanes, but the dataset of bird observations is a hundred times smaller, so it is easy to load and analyze it.
The dataset is located here in a zip file of 58 GB. It contains a CSV file of 440 GB in size.
Analyzing local and external datasets is very convenient with clickhouse-local, a small command-line tool that provides a full ClickHouse engine.
ClickHouse allows the processing of files inside archives without decompressing.
The DESCRIBE query gives an automatically inferred schema for the data. The file table function allows the processing of files on the local filesystem and the :: notation allows the processing of files inside archives. All filenames support glob-expansions with *, ?, **, {0..9}, {abc,def}. For CSV files, ClickHouse automatically detects if there is a header and uses the header if it is present. It also automatically detects data types of columns. With all these conveniences, ClickHouse makes data processing seamless.
And this is an ibis in Argentina!
You can analyze the data with clickhouse-local, but to create an interactive website, we will load it to clickhouse-server. By the way, there is almost no difference between clickhouse-local and clickhouse-server - it is the same binary executable. The main difference is that clickhouse-server listens to connections.
I created the following table:
Most of the structure is the same as inferred from the CSV file.
I removed Nullable because it is unneeded - I'd better use empty strings instead of NULLs, and it is generally good practice.
I've replaced year, month, day with EPHEMERAL columns and added a date column, calculated from them: date Date MATERIALIZED makeDate(year, month, day). EPHEMERAL columns are used in INSERT queries but are not stored in a table - they can be used to calculate expressions for other columns to apply transformations on the insertion time. In contrast, MATERIALIZED columns are columns that cannot be used in the INSERT query but are always calculated from their expressions.
I've analyzed the number of distinct values in various columns and replaced many of the String data types with LowCardinality(String) to apply dictionary encoding. For example, the country column contains only 253 unique values.
Additionally, I've created two materialized columns, mercator_x and mercator_y that map the lat/lon coordinates to the Web Mercator projection. The coordinates on the Mercator projection are represented by two UInt32 numbers for easier segmentation of the map into tiles. Additionally, I set up the order of the table by a space-filling curve on top of these numbers, and I created two minmax indices for faster search. ClickHouse has everything we need for real-time mapping applications!
Then, I loaded the data with the following query:
Here, I use clickhouse-local (which is available under the ch alias after installation) to filter out of bound values of latitude and longitude (required by the Mercator projection) and convert the result to the Native format, which is optimal for insertion. The result is piped into clickhouse-client and inserted into my ClickHouse server in ClickHouse Cloud.
The table took only 16.8 GB in ClickHouse, which is much less than the 58 GB zip file, thanks to ClickHouse compression! This is only 11 bytes on average per each of the 1.5 billion bird observations.
I've added the dataset with this configuration change, and we can instantly explore it!
For example, you can color the map by different orders of birds:
We can map only sea birds:
Interesting to look birds' paths in the south Atlantic near Patagonia:
And New Zealand:
Let's filter only penguins:
Let's show different sorts of Kiwi by adding AND family = 'Apterygidae' to the filter:
If you see this bird, you'll have no regrets!
This crow lives only in Asia:
eBird provides its own basic visualization on a map, but it is not as interactive and provides no such detail into the data and no direct SQL queries.
ClickHouse is a good option for analytics on large-scale geographical datasets. The eBird dataset has 1.5 billion records, while the ADS-B dataset has 130 billion records and counting. ClickHouse customers use the service with datasets over tens of trillions of records. ClickHouse makes large datasets fly!
Reference: eBird. 2021. eBird: An online database of bird distribution and abundance [web application]. eBird, Cornell Lab of Ornithology, Ithaca, New York. Available: https://www.ebird.org. (Accessed: June 1st, 2025).
Get started with ClickHouse Cloud today and receive $300 in credits. At the end of your 30-day trial, continue with a pay-as-you-go plan, or contact us to learn more about our volume-based discounts. Visit our pricing page for details.
.png)

