Show HN: Unsiloed Chunker – VLM powered semantic chunking for RAG

1 day ago 2

A super simple way to extract text from documents for for intelligent document processing, extraction, and chunking with multi-threaded processing capabilities.

  • Supported File Types: PDF, DOCX, PPTX
  • Chunking Strategies:
    • Fixed Size: Splits text into chunks of specified size with optional overlap
    • Page-based: Splits PDF by pages (PDF only, falls back to paragraph for other file types)
    • Semantic: Uses Multi-Modal Model to identify meaningful semantic chunks
    • Paragraph: Splits text by paragraphs
    • Heading: Splits text by identified headings
  • Uses OpenAI GPT-4o for semantic chunking
  • Handles authentication via API key from environment variables
  • Implements automatic retries and timeout handling
  • Provides structured JSON output for semantic chunks
  • Multi-threaded processing for improved performance
  • Parallel page extraction from PDFs
  • Distributes processing of large documents across multiple threads
  • Extracts text from PDF, DOCX, and PPTX files
  • Handles image encoding for vision-based models
  • Generates extraction prompts for structured data extraction
  • OPENAI_API_KEY: Your OpenAI API key

🛑 Constraints & Limitations

  • Temporary files are created during processing and deleted afterward
  • Files are processed in-memory where possible
  • Long text (>25,000 characters) is automatically split and processed in parallel for semantic chunking
  • Maximum token limit of 4000 for OpenAI responses
  • Request timeout set to 60 seconds
  • Maximum of 3 retries for OpenAI API calls

Document Chunking Endpoint

  • document_file: The document file to process (PDF, DOCX, PPTX)
  • strategy: Chunking strategy to use (default: "semantic")
    • Options: "fixed", "page", "semantic", "paragraph", "heading"
  • chunk_size: Size of chunks for fixed strategy in characters (default: 1000)
  • overlap: Overlap size for fixed strategy in characters (default: 100)

Unsiloed requires Python 3.8 or higher and has the following dependencies:

  • openai
  • PyPDF2
  • python-docx
  • python-pptx
  • fastapi
  • python-multipart

Before using Unsiloed, set up your OpenAI API key:

Using environment variables

# Linux/macOS export OPENAI_API_KEY="your-api-key-here" # Windows (Command Prompt) set OPENAI_API_KEY=your-api-key-here # Windows (PowerShell) $env:OPENAI_API_KEY="your-api-key-here"

Create a .env file in your project directory:

OPENAI_API_KEY=your-api-key-here

Then in your Python code:

from dotenv import load_dotenv load_dotenv() # This loads the variables from .env
import os import Unsiloed # Example 1: Semantic chunking (default) result = Unsiloed.process_sync({ "filePath": "./test.pdf", "credentials": { "apiKey": os.environ.get("OPENAI_API_KEY") }, "strategy": "semantic", "chunkSize": 1000, "overlap": 100 }) # Print the result print(result) # Example 2: Fixed-size chunking fixed_result = Unsiloed.process_sync({ "filePath": "./test.pdf", #path to your file "credentials": { "apiKey": os.environ.get("OPENAI_API_KEY") }, "strategy": "fixed", "chunkSize": 1500, "overlap": 150 }) # Example 3: Page-based chunking (PDF only) page_result = Unsiloed.process_sync({ "filePath": "./test.pdf", "credentials": { "apiKey": os.environ.get("OPENAI_API_KEY") }, "strategy": "page" }) # Example 4: Paragraph chunking paragraph_result = Unsiloed.process_sync({ "filePath": "./document.docx", "credentials": { "apiKey": os.environ.get("OPENAI_API_KEY") }, "strategy": "paragraph" }) # Example 5: Heading chunking heading_result = Unsiloed.process_sync({ "filePath": "./presentation.pptx", "credentials": { "apiKey": os.environ.get("OPENAI_API_KEY") }, "strategy": "heading" })
  • Python 3.8 or higher
  • pip (Python package installer)
  • git

Setting Up Local Development Environment

  1. Clone the repository:
git clone https://github.com/Unsiloed-opensource/Unsiloed.git cd Unsiloed
  1. Create a virtual environment:
# Using venv python -m venv venv # Activate the virtual environment # On Windows venv\Scripts\activate # On macOS/Linux source venv/bin/activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up your environment variables:
# Create a .env file echo "OPENAI_API_KEY=your-api-key-here" > .env
  1. Run the FastAPI server locally:
uvicorn Unsiloed.app:app --reload
  1. Access the API documentation: Open your browser and go to http://localhost:8000/docs

We welcome contributions to Unsiloed! Here's how you can help:

Setting Up Development Environment

  1. Fork the repository and clone your fork:
git clone https://github.com/YOUR_USERNAME/Unsiloed.git cd Unsiloed
  1. Install development dependencies:
pip install -r requirements.txt
  1. Create a new branch for your feature:
git checkout -b feature/your-feature-name
  1. Make your changes and write tests if applicable

  2. Commit your changes:

git commit -m "Add your meaningful commit message here"
  1. Push to your fork:
git push origin feature/your-feature-name
  1. Create a Pull Request from your fork to the main repository
  • We follow PEP 8 for Python code style
  • Use type hints where appropriate
  • Document functions and classes with docstrings
  • Write tests for new features

This project is licensed under the MIT License - see the LICENSE file for details.

  • GitHub Discussions: For questions, ideas, and discussions
  • Issues: For bug reports and feature requests
  • Pull Requests: For contributing to the codebase
  • Star the repository to show support
  • Watch for notification on new releases
Read Entire Article