A DuckDB extension that enables reading Excel, WPS, and OpenDocument spreadsheet files directly within SQL queries. This extension provides seamless integration for analyzing spreadsheet data using DuckDB's powerful SQL engine.
- High Performance: Optimized for large files with pure Rust implementation; e.g., reading a 1M-row XLSX file on MacBook M1 now takes under 13 seconds (down from 140+ seconds in v0.1.x).
- Multiple Format Support: Read Excel files (.xls, .xlsx, .xlsm, .xlsb, .xla, .xlam), WPS files (.et, .ett), and OpenDocument Spreadsheet files (.ods)
- Batch Processing: Analyze or read multiple files and worksheets with wildcard pattern matching
- Flexible Data Types: Support for boolean, integer, double, varchar, datetime, date, and time types
- Excel-Style Data Ranges: Specify data ranges using familiar Excel notation (e.g., "A1:C3")
- Automatic Column Type Detection: Column types are inferred automatically; override specific columns with the columns parameter
- Header Row Handling: Automatic detection and parsing of header rows
- Error Handling: Configurable behavior for parsing errors with precise cell location and file name reporting
- Type Safety: Built-in data type validation and conversion
- Advanced Data Filtering: Skip empty rows or stop at first empty row for efficient data processing
- Remote Storage Support: Read spreadsheets from remote URLs including HTTP, HTTPS, S3, Google Cloud Storage, and Hugging Face datasets
DuckDB 1.4.1 or later is required for community extension support.
The easiest way to install rusty_sheet is through DuckDB's community extension platform:
- Python 3
- Python 3-venv
- Make
- Git
- Rust toolchain
- Clone the repository:
- Configure the build environment:
- Build the extension:
- The built extension will be available in build/debug/extension/ or build/release/extension/
Start DuckDB with the unsigned flag to load local extensions:
Load the extension:
The read_sheets and analyze_sheets functions now feature enhanced matching logic:
- Multiple Pattern Support: Accepts lists of file patterns for flexible multi-format processing
- Smart Worksheet Discovery: When using wildcards, the system finds the first worksheet that matches both the file pattern and worksheet pattern, reducing "no matching worksheets" errors
- File-Specific Matching: Improved handling of file-specific worksheet patterns like *.xlsx=Sheet*
Analyzes the column structure of a single worksheet in a single file.
Parameters:
- file_path (required): Path to the spreadsheet file (no wildcard support). Supports local files and remote URLs (HTTP, HTTPS, S3, GS, HF)
- sheet (optional, default first sheet): Worksheet name (supports wildcards like Sheet*)
- range (optional): Data range in format [start_col][start_row]:[end_col][end_row]
- header (optional, default true): Whether the first row contains column headers
- analyze_rows (optional, default 10): Number of rows to analyze for type inference
- error_as_null (optional, default false): If true, convert parsing errors to NULL instead of failing
Examples:
Analyzes column structures of multiple worksheets across multiple files with wildcard pattern matching.
Parameters:
- file_pattern (required): File path pattern(s) with wildcard support (e.g., ['*.xlsx'], ['*.xls', '*.xlsx']). Also supports remote URLs (HTTP, HTTPS, S3, GS, HF)
- sheets (optional): List of worksheet names (supports wildcards and file-specific patterns like ['Sheet*'], ['*.xlsx=Sheet*'])
- range (optional): Data range in format [start_col][start_row]:[end_col][end_row]
- header (optional, default true): Whether the first row contains column headers
- analyze_rows (optional, default 10): Number of rows to analyze for type inference
- error_as_null (optional, default false): If true, convert parsing errors to NULL instead of failing
Examples:
Reads data from a single worksheet in a single file.
Parameters:
- file_path (required): Path to the spreadsheet file (no wildcard support). Supports local files and remote URLs (HTTP, HTTPS, S3, GS, HF)
- sheet (optional, default first sheet): Worksheet name (supports wildcards like Sheet*)
- range (optional): Data range in format [start_col][start_row]:[end_col][end_row]
- header (optional, default true): Whether the first row contains column headers
- analyze_rows (optional, default 10): Number of rows to analyze for type inference
- error_as_null (optional, default false): If true, convert parsing errors to NULL instead of failing
- skip_empty_rows (optional, default false): Skip rows where all columns contain empty values
- end_at_empty_row (optional, default false): Stop reading at the first completely empty row
Examples:
Reads data from multiple worksheets across multiple files with wildcard pattern matching.
Important Note: When using wildcard patterns, this function analyzes the column structure and data types from the first matching worksheet only. All subsequent worksheets with matching patterns will use the same column structure, even if their actual structure differs. For worksheets with varying structures, consider using analyze_sheets first to inspect individual worksheet structures.
Parameters:
- file_pattern (required): File path pattern(s) with wildcard support (e.g., ['*.xlsx'], ['*.xls', '*.xlsx']). Also supports remote URLs (HTTP, HTTPS, S3, GS, HF)
- sheets (optional): List of worksheet names (supports wildcards and file-specific patterns like ['Sheet*'], ['*.xlsx=Sheet*'])
- range (optional): Data range in format [start_col][start_row]:[end_col][end_row]
- header (optional, default true): Whether the first row contains column headers
- analyze_rows (optional, default 10): Number of rows to analyze for type inference
- error_as_null (optional, default false): If true, convert parsing errors to NULL instead of failing
- skip_empty_rows (optional, default false): Skip rows where all columns contain empty values
- end_at_empty_row (optional, default false): Stop reading at the first completely empty row
- file_name_column (optional): Column name to include file source information in results
- sheet_name_column (optional): Column name to include worksheet source information in results
- union_by_name (optional, default false): When false, union data by position; when true, union data by column name
Examples:
boolean | BOOLEAN | True/false values |
bigint | BIGINT | 64-bit signed integers |
double | DOUBLE | Double-precision floating point |
varchar | VARCHAR | Variable-length strings |
timestamp | TIMESTAMP | Date and time with microsecond precision (supports ISO 8601 format) |
date | DATE | Date without time component (supports ISO 8601 format) |
time | TIME | Time without date component (including ISO 8601 durations) |
The range parameter supports flexible Excel-style cell range notation with five optional components:
-
Start Column (optional): Excel column letter (e.g., A for column 1, B for column 2, etc.)
- If specified, reading starts from this column even if it contains no data
- Does not skip empty columns
-
Start Row (optional): Excel row number (e.g., 1 for row 1, 2 for row 2, etc.)
- If specified, reading starts from this row even if it contains no data
- Does not skip empty rows
-
Colon Separator: Required only when specifying end column or end row
-
End Column (optional): Excel column letter
- If specified, reading stops at this column even if data ends earlier
- Ignores data beyond this column
-
End Row (optional): Excel row number
- If specified, reading stops at this row even if data ends earlier
- Ignores data beyond this row
The extension supports Rust glob patterns for file and worksheet matching.
- ? - Matches any single character
- * - Matches any (possibly empty) sequence of characters
- ** - Matches the current directory and arbitrary subdirectories
- [...] - Matches any character inside the brackets
- [!...] - Negation of [...], matches characters not in the brackets
- ** must form a single path component (e.g., **/*.xlsx is valid, **a is invalid)
- Character ranges use Unicode ordering (e.g., [0-9] matches digits 0-9)
- Metacharacters ?, *, [, ] can be matched using brackets (e.g., [?])
- The - character in character sets must be at start or end (e.g., [abc-])
Run the test suite:
Test with different DuckDB versions:
This extension is built using the DuckDB Rust extension framework. The main components are:
- src/lib.rs: Main extension implementation
- Cargo.toml: Rust dependencies and build configuration
To contribute:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Run make test_debug to verify
- Submit a pull request
- On Windows with Python 3.11, you may encounter extension loading issues. Use Python 3.12 or later.
- Very large spreadsheets may require significant memory allocation.
- Complex Excel formulas are not evaluated; only the computed values are read.
Zhang, Zepeng Email: [email protected]
This project is licensed under the MIT License - see the LICENSE file for details.
- Built on the DuckDB Rust extension template
- Inspired by DuckDB's commitment to making data analysis more accessible
- Thanks to abelcha for helping implement remote file reading functionality