Show HN: Virtual Ontologies with Claude Code

3 months ago 7

Palantir-lite with Claude Code and Virtual Ontologies — astonishingly good natural language to SQL to insights

“We f*cking won because of ontologies” — Alex Karp (in very loose paraphrase)

Disclaimer: Claude Code wrote large chunks of this after interviewing me, reading the code and reading the transcript of a recorded codewalk. I will still take responsibility for the poor writing.

Skip to the case study if you want to see it in action!

Zoom image will be displayed

I think it was around Palantir’s Q4 earnings call when I binged a bunch of Karp interviews and first heard him mention ontologies. The quote above stuck in my mind, but it’s probably my own synthesis. Though if you’ve watched Karp, it’s plausibly his own.

I was familiar with ontology as a philosophical discipline, perhaps best described by Quine as “on what there is”, but put more plainly as the study of being, existence and the nature of reality. Woah. Heady stuff. But what did that have to do with data and AI? Clearly I should be using them if the eggheads at Palantir are (I say that lovingly, mad respect).

I did some cursory research, finding Palantir’s own documentation, as well as a significant history of ontology research in computer science. In that context, an ontology is at its core a formal description of real life that maps to data. The set of entities, their properties, and their relationships. It functions as a semantic layer on top of the data, providing meaning. It turns out that such a description is incredible context for an LLM to manipulate and analyze data.

Palantir has erected a military-grade ecosystem to implement this, and with great success. But I thought it would be fun to try and roll my own janky system and see what kind of mileage I could get.

My first approach was based on web ontology language (OWL), which required me to convert and store data as RDF triples, and then pass the OWL specificaction to an LLM to generate SPARQL queries to retrieve data, and then pass that back to the LLM to do the analysis, invoking python for large data retrievals or more robust analysis.

It worked incredibly well. SPARQL is a very strict and small language and is tied directly to the data representation, so you can get away with efficient traversals and avoid complex SQL joins, not to mention a poor history of text to SQL generation (this turned out to be less of limiting factor with a virtual ontology as context).

But I had introduced major overhead with the need for an intermediate representation as a triple store and an exotic language you’ve never heard of. Using a graph representation introduces the same kinds of issues. This is painful when virtually all an enterprise’s data is readily available via SQL. What to do?

Why not virtualize the ontology layer?

Context is King

Instead of converting data into triples, why not leave it in SQL but give the LLM enough semantic context to reason about it ontologically? Let the LLM handle the semantic reasoning, then output SQL for the existing infrastructure.

Enter the enabling platform: Claude Code

Claude Code is essentially natural language REPL with excellent context management, tool orchestration, and session persistence that works out of the box. And it can be used in a loop from question generation to data retrieval, to analysis, repeat.

The magic happens when you pair two things in its context:

1. An ontology specification (formal description of your business entities, relationships, and context)

2. A database schema (the actual structure of your SQL data)

When these are properly paired — using the same terminology, mapping business concepts to table structure — Claude Code becomes a semantic reasoning engine that outputs SQL.

Moreover, it can issue a query, retrieve the results and mine insights.

Case study: analysis of manufacturing execution system data

note: this uses synthetic generated data that is reflective of real MES systems

Implementation notes

Samples of the ontology spec and database schema

# Ontology Specification (T-Box/R-Box formalization) - truncated
ontology:
name: "Manufacturing Execution System"
domain: "Production Operations"

classes:
Equipment:
description: "Physical assets on production line"
attributes:
- efficiency
- upstream_dependencies
- maintenance_schedule

DowntimeEvent:
description: "Production stoppage with reason code"
attributes:
- reason_code
- duration
- cascade_impact

relationships:
is_upstream_of:
domain: Equipment
range: Equipment
properties:
- cascade_delay: "typical seconds before downstream impact"
- impact_correlation: "probability of cascade"

business_rules:
material_starvation:
when: "upstream equipment fails"
then: "downstream shows UNP-MAT code"
delay: "30-300 seconds typically"

# Database Schema - Paired with Ontology - truncated
tables:
mes_data:
columns:
equipment_id:
type: string
ontology_class: Equipment
description: "Maps to Equipment entity"
downtime_reason:
type: string
ontology_class: DowntimeEvent.reason_code
description: "Reason codes for stoppages"
timestamp:
type: datetime
description: "Event time, 5-minute granularity"

The pairing is crucial. The LLM is naturally good at ‘parsing’ both formal specs, interleaving natural language and traversing the ontology to understand phenomena that manifest in real world entities, and then translating that to a SQL query.

Read Entire Article