Creating Knowledge Graphs from NLP Datasets

2 hours ago 2

Watch the entire English language blossom from Wiktionary + Google Books N-grams, rendered as a living, breathing prefix galaxy.

Overview

What you’re seeing is a timelapse of English vocabulary growth from 1800 to 2019. Each node represents a letter prefix, and its size reflects how many words with that prefix have appeared up to that year. The layout is stable over time so your eye can track change.

We built this from two public datasets. Wiktionary provides a list of English lemmas. Google Books 1-grams provides yearly counts and volumes. We combine them to estimate a robust first year for each word, then accumulate by prefixes up to six letters.

A lemma is the canonical or dictionary form of a set of related words. For a given set of forms of a word, the lemma is the base form.

English Lexicon Time Machine visualizes the evolution of the English language through time. By combining Wiktionary lemmas with Google Books N-gram data, we trace when words first appeared and render their growth as a radial prefix trie that expands across decades.

Key Features

Zero-config setup – ./setup.sh spins up the virtualenv, fetches every dataset, caches the heavy lifts, and ships final MP4/GIF output
Radial growth cinematics – the trie erupts from the core alphabet, framing decades of linguistic evolution as a neon fractal
Repeatable science – every artifact (lemmata, first-year inference, trie counts, layouts) checkpoints to disk and into a reusable tarball for instant re-renders
Battle-tested – streams 26 full 1-gram shards, handles 1.4GB Wiktionary dumps, and renders 220 frames in glorious 1080p

Quickstart

The script will:

Create/upgrade venv/ with Python 3
Download Wiktionary + Google Books 1-gram shards (a–z)
Extract English lemmas, infer first-use years, aggregate prefix counts
Render 220 radial frames (outputs/frames/frame-0000.png → frame-0219.png)
Encode outputs/english_trie_timelapse.mp4 and a share-ready GIF

Rerun the script anytime—artifact caching means future passes jump straight to rendering.

Pipeline Architecture

Stage Script Output

Lemma extraction	src/ingest/wiktionary_extract.py	artifacts/lemmas/lemmas.tsv
First-year inference	src/ingest/ngram_first_year.py	artifacts/years/first_years.tsv
Prefix aggregation	src/build/build_prefix_trie.py	artifacts/trie/prefix_counts.jsonl
Layout generation	src/viz/layout.py	artifacts/layout/prefix_positions.json
Frame rendering	src/viz/render_frames.py	outputs/frames/
Encoding	src/viz/encode.py	outputs/english_trie_timelapse.mp4 + .gif

Render Only (after initial run)

source venv/bin/activate python -m src.viz.render_frames artifacts/trie/prefix_counts.jsonl outputs/frames python -m src.viz.encode outputs/frames outputs/english_trie_timelapse.mp4 outputs/english_trie_timelapse.gif

Use flags such as --min-radius, --max-radius, --base-edge-alpha, or --start-progress to tune the visualization.

Neo4j Integration

Load artifacts/years/first_years.tsv to explore the word data in Neo4j (compatible with both Community and Enterprise editions):

:param batch => $rows; UNWIND $rows AS row WITH row WHERE row.word IS NOT NULL AND row.word <> "" MERGE (w:Word {text: row.word}) SET w.first_year = CASE WHEN row.first_year = "" THEN NULL ELSE toInteger(row.first_year) END;

Documentation

Getting Started – Quick setup guide
Methodology – How the visualization works
Step-by-Step Guide – Detailed instructions for each stage
Advanced Tuning – Parameter customization options
Interpreting Results – Understanding the visualization
Troubleshooting – Common issues and solutions

GitHub Organization – More helpful resources
GitHub Repository – Source code and issues
X Community – Join discussions on Knowledge Graphs, GNNs, and Graph Databases

Read Entire Article

Creating Knowledge Graphs from NLP Datasets

Overview

Key Features

Quickstart

Pipeline Architecture

Render Only (after initial run)

Neo4j Integration

Documentation

Related

Array Programming the Mandelbrot Set

Bridging the gap between fast-evolving code libraries and AI...

Possible effects of cheap fentanyl on drug markets, use and ...