Show HN: Extrai – An open-source tool to fight LLM randomness in data extraction

5 days ago 2

Extrai Logo

Python CI/CD codecov Python 3.12 MIT License

Documentation

With extrai, you can extract data from text documents with LLMs, which will be formatted into a given SQLModel and registered in your database.

The core of the library is its Consensus Mechanism. We make the same request multiple times, using the same or different providers, and then select the values that meet a certain threshold.

extrai also has other features, like generating SQLModels from a prompt and documents, and generating few-shot examples. For complex, nested data, the library offers Hierarchical Extraction, breaking down the extraction into manageable, hierarchical steps. It also includes built-in analytics to monitor performance and output quality.

For a complete guide, please see the full documentation. Here are the key sections:

The library is built around a few key components that work together to manage the extraction workflow. The following diagram illustrates the high-level workflow (see Architecture Overview):

graph TD %% Define styles for different stages for better colors classDef inputStyle fill:#f0f9ff,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e classDef processStyle fill:#eef2ff,stroke:#6366f1,stroke-width:2px,color:#3730a3 classDef consensusStyle fill:#fffbeb,stroke:#f59e0b,stroke-width:2px,color:#78350f classDef outputStyle fill:#f0fdf4,stroke:#22c55e,stroke-width:2px,color:#14532d classDef modelGenStyle fill:#fdf4ff,stroke:#a855f7,stroke-width:2px,color:#581c87 subgraph "Inputs (Static Mode)" A["📄<br/>Documents"] B["🏛️<br/>SQLAlchemy Models"] L1["🤖<br/>LLM"] end subgraph "Inputs (Dynamic Mode)" C["📋<br/>Task Description<br/>(User Prompt)"] D["📚<br/>Example Documents"] L2["🤖<br/>LLM"] end subgraph "Model Generation<br/>(Optional)" MG("🔧<br/>Generate SQLModels<br/>via LLM") end subgraph "Data Extraction" EG("📝<br/>Example Generation<br/>(Optional)") P("✍️<br/>Prompt Generation") subgraph "LLM Extraction Revisions" direction LR E1("🤖<br/>Revision 1") H1("💧<br/>SQLAlchemy Hydration 1") E2("🤖<br/>Revision 2") H2("💧<br/>SQLAlchemy Hydration 2") E3("🤖<br/>...") H3("💧<br/>...") end F("🤝<br/>JSON Consensus") H("💧<br/>SQLAlchemy Hydration") end subgraph Outputs SM["🏛️<br/>Generated SQLModels<br/>(Optional)"] O["✅<br/>Hydrated Objects"] DB("💾<br/>Database Persistence<br/>(Optional)") end %% Connections for Static Mode L1 --> P A --> P B --> EG EG --> P P --> E1 P --> E2 P --> E3 E1 --> H1 E2 --> H2 E3 --> H3 H1 --> F H2 --> F H3 --> F F --> H H --> O H --> DB %% Connections for Dynamic Mode L2 --> MG C --> MG D --> MG MG --> EG EG --> P MG --> SM %% Apply styles class A,B,C,D,L1,L2 inputStyle; class P,E1,E2,E3,H,EG processStyle; class F consensusStyle; class O,DB,SM outputStyle; class MG modelGenStyle;
Loading

Install the library from PyPI:

pip install extrai-workflow

For a more detailed guide, please see the Getting Started Tutorial.

Here is a minimal example:

import asyncio from typing import Optional from sqlmodel import Field, SQLModel, create_engine, Session from extrai.core import WorkflowOrchestrator from extrai.llm_providers.huggingface_client import HuggingFaceClient # 1. Define your data model class Product(SQLModel, table=True): id: Optional[int] = Field(default=None, primary_key=True) name: str price: float # 2. Set up the orchestrator llm_client = HuggingFaceClient(api_key="YOUR_HF_API_KEY") engine = create_engine("sqlite:///:memory:") orchestrator = WorkflowOrchestrator( llm_client=llm_client, db_engine=engine, root_model=Product, ) # 3. Run the extraction and verify text = "The new SuperWidget costs $99.99." with Session(engine) as session: asyncio.run(orchestrator.synthesize_and_save([text], db_session=session)) product = session.query(Product).first() print(product) # Expected: name='SuperWidget' price=99.99 id=1

For more in-depth examples, see the /examples directory in the repository.

We welcome contributions! Please see the Contributing Guide for details on how to set up your development environment, run tests, and submit a pull request.

This project is licensed under the MIT License - see the LICENSE file for details.

Read Entire Article