A super simple way to extract text from documents for for intelligent document processing, extraction, and chunking with multi-threaded processing capabilities.
- Supported File Types: PDF, DOCX, PPTX
- Chunking Strategies:
- Fixed Size: Splits text into chunks of specified size with optional overlap
- Page-based: Splits PDF by pages (PDF only, falls back to paragraph for other file types)
- Semantic: Uses Multi-Modal Model to identify meaningful semantic chunks
- Paragraph: Splits text by paragraphs
- Heading: Splits text by identified headings
- Uses OpenAI GPT-4o for semantic chunking
- Handles authentication via API key from environment variables
- Implements automatic retries and timeout handling
- Provides structured JSON output for semantic chunks
- Multi-threaded processing for improved performance
- Parallel page extraction from PDFs
- Distributes processing of large documents across multiple threads
- Extracts text from PDF, DOCX, and PPTX files
- Handles image encoding for vision-based models
- Generates extraction prompts for structured data extraction
- OPENAI_API_KEY: Your OpenAI API key
- Temporary files are created during processing and deleted afterward
- Files are processed in-memory where possible
- Long text (>25,000 characters) is automatically split and processed in parallel for semantic chunking
- Maximum token limit of 4000 for OpenAI responses
- Request timeout set to 60 seconds
- Maximum of 3 retries for OpenAI API calls
- document_file: The document file to process (PDF, DOCX, PPTX)
- strategy: Chunking strategy to use (default: "semantic")
- Options: "fixed", "page", "semantic", "paragraph", "heading"
- chunk_size: Size of chunks for fixed strategy in characters (default: 1000)
- overlap: Overlap size for fixed strategy in characters (default: 100)
Unsiloed requires Python 3.8 or higher and has the following dependencies:
- openai
- PyPDF2
- python-docx
- python-pptx
- fastapi
- python-multipart
Before using Unsiloed, set up your OpenAI API key:
# Linux/macOS
export OPENAI_API_KEY="your-api-key-here"
# Windows (Command Prompt)
set OPENAI_API_KEY=your-api-key-here
# Windows (PowerShell)
$env:OPENAI_API_KEY="your-api-key-here"
Create a .env file in your project directory:
OPENAI_API_KEY=your-api-key-here
Then in your Python code:
from dotenv import load_dotenv
load_dotenv() # This loads the variables from .env
import os
import Unsiloed
# Example 1: Semantic chunking (default)
result = Unsiloed.process_sync({
"filePath": "./test.pdf",
"credentials": {
"apiKey": os.environ.get("OPENAI_API_KEY")
},
"strategy": "semantic",
"chunkSize": 1000,
"overlap": 100
})
# Print the result
print(result)
# Example 2: Fixed-size chunking
fixed_result = Unsiloed.process_sync({
"filePath": "./test.pdf", #path to your file
"credentials": {
"apiKey": os.environ.get("OPENAI_API_KEY")
},
"strategy": "fixed",
"chunkSize": 1500,
"overlap": 150
})
# Example 3: Page-based chunking (PDF only)
page_result = Unsiloed.process_sync({
"filePath": "./test.pdf",
"credentials": {
"apiKey": os.environ.get("OPENAI_API_KEY")
},
"strategy": "page"
})
# Example 4: Paragraph chunking
paragraph_result = Unsiloed.process_sync({
"filePath": "./document.docx",
"credentials": {
"apiKey": os.environ.get("OPENAI_API_KEY")
},
"strategy": "paragraph"
})
# Example 5: Heading chunking
heading_result = Unsiloed.process_sync({
"filePath": "./presentation.pptx",
"credentials": {
"apiKey": os.environ.get("OPENAI_API_KEY")
},
"strategy": "heading"
})
- Python 3.8 or higher
- pip (Python package installer)
- git
- Clone the repository:
git clone https://github.com/Unsiloed-opensource/Unsiloed.git
cd Unsiloed
- Create a virtual environment:
# Using venv
python -m venv venv
# Activate the virtual environment
# On Windows
venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
- Set up your environment variables:
# Create a .env file
echo "OPENAI_API_KEY=your-api-key-here" > .env
- Run the FastAPI server locally:
uvicorn Unsiloed.app:app --reload
- Access the API documentation: Open your browser and go to http://localhost:8000/docs
We welcome contributions to Unsiloed! Here's how you can help:
- Fork the repository and clone your fork:
git clone https://github.com/YOUR_USERNAME/Unsiloed.git
cd Unsiloed
- Install development dependencies:
pip install -r requirements.txt
- Create a new branch for your feature:
git checkout -b feature/your-feature-name
-
Make your changes and write tests if applicable
-
Commit your changes:
git commit -m "Add your meaningful commit message here"
- Push to your fork:
git push origin feature/your-feature-name
- Create a Pull Request from your fork to the main repository
- We follow PEP 8 for Python code style
- Use type hints where appropriate
- Document functions and classes with docstrings
- Write tests for new features
This project is licensed under the MIT License - see the LICENSE file for details.
- GitHub Discussions: For questions, ideas, and discussions
- Issues: For bug reports and feature requests
- Pull Requests: For contributing to the codebase
- Star the repository to show support
- Watch for notification on new releases