Quack-Cluster is a high-performance, serverless distributed SQL query engine designed for large-scale data analysis. It allows you to run complex SQL queries directly on data in object storage (like AWS S3 or Google Cloud Storage) by leveraging the combined power of Python, the Ray distributed computing framework, and the hyper-fast DuckDB analytical database.
It's an ideal, lightweight alternative to complex big data systems for all your analytical needs.
- Serverless & Distributed: Effortlessly run SQL queries on a scalable Ray cluster. Forget about managing complex server infrastructure for your database needs.
- High-Speed SQL Processing: Utilizes the incredible speed of DuckDB's in-memory, columnar-vectorized query engine and the efficiency of the Apache Arrow data format for blazing-fast analytics.
- Query Data Where It Lives: Natively reads data files (Parquet, CSV, etc.) directly from object storage like AWS S3, Google Cloud Storage, and local filesystems. No ETL required.
- Python-Native Integration: Built with Python, Quack-Cluster integrates seamlessly into your existing data science, data engineering, and machine learning workflows.
- Open Source Stack: Built with a powerful, modern stack of open-source technologies, including FastAPI, Ray, and DuckDB.
The Quack-Cluster system is designed for simplicity and scale. It distributes SQL queries across a Ray cluster, where each worker node uses an embedded DuckDB instance to process a portion of the data in parallel.
- A User sends a standard SQL query to the Coordinator's API endpoint.
- The Coordinator (FastAPI + SQLGlot) parses the SQL, identifies the target files (e.g., using wildcards like s3://my-bucket/data/*.parquet), and generates a distributed execution plan.
- The Ray Cluster orchestrates the execution by sending tasks to multiple Worker nodes.
- Each Worker (a Ray Actor) runs an embedded DuckDB instance to execute its assigned query fragment on a subset of the data.
- Partial results are efficiently aggregated by the Coordinator and returned to the user.
This architecture enables massive parallel processing (MPP) for your SQL queries, turning a collection of files into a powerful distributed database.
You only need Docker and make to get a local Quack-Cluster running.
- Docker
- make (pre-installed on Linux/macOS; available on Windows via WSL).
Your cluster is now running! You can monitor the Ray cluster status at the Ray Dashboard: http://localhost:8265.
Use any HTTP client like curl or Postman to send SQL queries to the API. The engine automatically handles file discovery with wildcards.
This query calculates the total sales for each product across all data_part_*.parquet files.
Expected Output:
You can easily test all API features using the provided Postman collection.
-
Import the Collection and Environment:
- In Postman, click Import and select the following files:
- Collection: documentation/postman_collection/QuackCluster_API_Tests.json
- Environment: documentation/postman_collection/QuackCluster_postman_environment.json
-
Activate the Environment:
- In the top-right corner of Postman, select "Quack Cluster Environment" from the environment dropdown list.
-
Send a Request:
- The environment pre-configures the baseUrl variable to http://127.0.0.1:8000. You can now run any of the pre-built requests in the collection.
Quack-Cluster supports a rich subset of the DuckDB SQL dialect, enabling complex analytical queries across multiple files and directories.
- Basic Queries: SELECT, FROM, WHERE, GROUP BY, ORDER BY, LIMIT.
- Aggregate Functions: COUNT(), SUM(), AVG(), MIN(), MAX().
- Distributed Joins: INNER JOIN, LEFT JOIN, FULL OUTER JOIN across different file sets.
- Advanced SQL:
- Subqueries (e.g., WHERE IN (...))
- Common Table Expressions (CTEs) using the WITH clause.
- Window Functions (e.g., SUM(...) OVER (PARTITION BY ...)).
- Advanced SELECT syntax like SELECT * EXCLUDE (...) and SELECT COLUMNS('<regex>').
- File System Functions: Query collections of Parquet or CSV files using glob patterns (e.g., "s3://my-data/2025/**/*.parquet").
Use these make commands to manage your development lifecycle.
| make up scale=N | Starts the cluster with N worker nodes. |
| make down | Stops and removes running containers safely. |
| make logs | Tails the logs from all services. |
| make build | Rebuilds the Docker images after a code change. |
| make test | Runs the pytest suite inside the ray-head container. |
| make clean | DANGER: Stops containers and deletes all data volumes. |
- [✔] Support for distributed JOIN operations.
- [✔] Support for Window Functions (OVER, PARTITION BY).
- Integration with a metadata catalog like Apache Iceberg or Delta Lake.
- A dedicated Python client (SDK) for a better developer experience.
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. Please feel free to open an issue or submit a pull request.
This project leverages AI tools to accelerate development and improve documentation.
All core architectural decisions, debugging, and final testing are human-powered to ensure quality and correctness.
This project is licensed under the MIT License. See the LICENSE file for details.
.png)
![The Perfect Router Does Not Exi [video]](https://www.youtube.com/img/desktop/supported_browsers/firefox.png)

