OntoCast is a powerful framework that automatically extracts semantic triples from documents using an agentic approach. It combines ontology management with natural language processing to create structured knowledge from unstructured text.
- Ontology-Guided Extraction: Uses ontologies to guide the extraction process and ensure semantic consistency
- Entity Disambiguation: Resolves entity and property references across chunks
- Multi-Format Support: Handles various input formats including text, JSON, PDF, and Markdown
- Semantic Chunking: Intelligent text chunking based on semantic similarity
- MCP Compatibility: Fully compatible with the Model Control Protocol (MCP) specification, providing standardized endpoints for health checks, info, and document processing
- RDF Output: Generates standardized RDF/Turtle output
-
Document Processing
- Supports PDF, markdown, and text documents
- Automated text chunking and processing
-
Automated Ontology Management
- Intelligent ontology selection and construction
- Multi-stage validation and critique system
- Ontology sublimation and refinement
-
Knowledge Graph Integration
- RDF-based knowledge graph storage
- Triple extraction for both ontologies and facts
- Configurable workflow with visit limits
- Chunk aggregation preserving fact lineage
uv add ontocast
# or
pip install ontocast
Create a .env file with your OpenAI API key:
uv run serve \
--ontology-directory ONTOLOGY_DIR \
--working-directory WORKING_DIR
The /process endpoint accepts:
- application/json: JSON data
- multipart/form-data: File uploads
And returns:
- application/json: Processing results including:
- Extracted facts in Turtle format
- Generated ontology in Turtle format
- Processing metadata
# Process a PDF file
curl -X POST http://url:port/process -F "file=@data/pdf/sample.pdf"
curl -X POST http://url:port/process -F "file=@test2/sample.json"
# Process text content
curl -X POST http://localhost:8999/process \
-H "Content-Type: application/json" \
-d '{"text": "Your document text here"}'
OntoCast implements the following MCP-compatible endpoints:
- GET /health: Health check endpoint
- GET /info: Service information endpoint
- POST /process: Document processing endpoint
uv run serve \
--ontology-directory ONTOLOGY_DIR \
--working-directory WORKING_DIR \
--input-path DOCUMENT_DIR
- json documents are expected to contain text in text field
- recursion_limit is calculated based on max_visits * estimated_chunks, the estimated number of chunks is taken to be 30 or otherwise fetched from .env (vie ESTIMATED_CHUNKS)
- default 8999 is used default port
To build docker
docker buildx build -t growgraph/ontocast:0.1.1 . 2>&1 | tee build.log
src/
├── agent.py # Main agent workflow implementation
├── onto.py # Ontology and RDF graph handling
├── nodes/ # Individual workflow nodes
├── tools/ # Tool implementations
└── prompts/ # LLM prompts
The extraction follows a multi-stage workflow:
-
Document Preparation
- [Optional] Convert to Markdown
- Text chunking
-
Ontology Processing
- Ontology selection
- Text to ontology triples
- Ontology critique
-
Fact Extraction
- Text to facts
- Facts critique
- Ontology sublimation
-
Chunk Normalization
- Chunk KG aggregation
- Entity/Property Disambiguation
-
Storage
- Knowledge graph storage
Full documentation is available at: growgraph.github.io/ontocast
- Add a triple store for serialization/ontology management
- Replace graph to text by a symbolic graph interface (agent tools for working with triples)
Contributions are welcome! Please feel free to submit a Pull Request.
- Uses RDFlib for semantic triple management
- Uses docling for pdf/pptx conversion
- Uses OpenAI language models / open models served via Ollama for fact extraction
- Uses langchain/langgraph