Show HN: OntoCast – Extract RDF triples using LLMs and co-evolving ontologies

4 months ago 6

Agentic ontology assisted framework for semantic triple extraction from documents

OntoCast is a powerful framework that automatically extracts semantic triples from documents using an agentic approach. It combines ontology management with natural language processing to create structured knowledge from unstructured text.

Ontology-Guided Extraction: Uses ontologies to guide the extraction process and ensure semantic consistency
Entity Disambiguation: Resolves entity and property references across chunks
Multi-Format Support: Handles various input formats including text, JSON, PDF, and Markdown
Semantic Chunking: Intelligent text chunking based on semantic similarity
MCP Compatibility: Fully compatible with the Model Control Protocol (MCP) specification, providing standardized endpoints for health checks, info, and document processing
RDF Output: Generates standardized RDF/Turtle output

Document Processing
- Supports PDF, markdown, and text documents
- Automated text chunking and processing
Automated Ontology Management
- Intelligent ontology selection and construction
- Multi-stage validation and critique system
- Ontology sublimation and refinement
Knowledge Graph Integration
- RDF-based knowledge graph storage
- Triple extraction for both ontologies and facts
- Configurable workflow with visit limits
- Chunk aggregation preserving fact lineage

uv add ontocast # or pip install ontocast

Create a .env file with your OpenAI API key:

uv run serve \ --ontology-directory ONTOLOGY_DIR \ --working-directory WORKING_DIR

The /process endpoint accepts:

application/json: JSON data
multipart/form-data: File uploads

And returns:

application/json: Processing results including:
- Extracted facts in Turtle format
- Generated ontology in Turtle format
- Processing metadata

# Process a PDF file curl -X POST http://url:port/process -F "file=@data/pdf/sample.pdf" curl -X POST http://url:port/process -F "file=@test2/sample.json" # Process text content curl -X POST http://localhost:8999/process \ -H "Content-Type: application/json" \ -d '{"text": "Your document text here"}'

OntoCast implements the following MCP-compatible endpoints:

GET /health: Health check endpoint
GET /info: Service information endpoint
POST /process: Document processing endpoint

Processing Filesystem Documents

uv run serve \ --ontology-directory ONTOLOGY_DIR \ --working-directory WORKING_DIR \ --input-path DOCUMENT_DIR

json documents are expected to contain text in text field
recursion_limit is calculated based on max_visits * estimated_chunks, the estimated number of chunks is taken to be 30 or otherwise fetched from .env (vie ESTIMATED_CHUNKS)
default 8999 is used default port

To build docker

docker buildx build -t growgraph/ontocast:0.1.1 . 2>&1 | tee build.log

src/ ├── agent.py # Main agent workflow implementation ├── onto.py # Ontology and RDF graph handling ├── nodes/ # Individual workflow nodes ├── tools/ # Tool implementations └── prompts/ # LLM prompts

The extraction follows a multi-stage workflow:

Document Preparation
- [Optional] Convert to Markdown
- Text chunking
Ontology Processing
- Ontology selection
- Text to ontology triples
- Ontology critique
Fact Extraction
- Text to facts
- Facts critique
- Ontology sublimation
Chunk Normalization
- Chunk KG aggregation
- Entity/Property Disambiguation
Storage
- Knowledge graph storage