Creating Knowledge Graphs from NLP Datasets

2 hours ago 2

Watch the entire English language blossom from Wiktionary + Google Books N-grams, rendered as a living, breathing prefix galaxy.

Overview

What you’re seeing is a timelapse of English vocabulary growth from 1800 to 2019. Each node represents a letter prefix, and its size reflects how many words with that prefix have appeared up to that year. The layout is stable over time so your eye can track change.

We built this from two public datasets. Wiktionary provides a list of English lemmas. Google Books 1-grams provides yearly counts and volumes. We combine them to estimate a robust first year for each word, then accumulate by prefixes up to six letters.

A lemma is the canonical or dictionary form of a set of related words. For a given set of forms of a word, the lemma is the base form.

English Lexicon Time Machine visualizes the evolution of the English language through time. By combining Wiktionary lemmas with Google Books N-gram data, we trace when words first appeared and render their growth as a radial prefix trie that expands across decades.

Key Features

  • Zero-config setup – ./setup.sh spins up the virtualenv, fetches every dataset, caches the heavy lifts, and ships final MP4/GIF output
  • Radial growth cinematics – the trie erupts from the core alphabet, framing decades of linguistic evolution as a neon fractal
  • Repeatable science – every artifact (lemmata, first-year inference, trie counts, layouts) checkpoints to disk and into a reusable tarball for instant re-renders
  • Battle-tested – streams 26 full 1-gram shards, handles 1.4GB Wiktionary dumps, and renders 220 frames in glorious 1080p

Quickstart

The script will:

  1. Create/upgrade venv/ with Python 3
  2. Download Wiktionary + Google Books 1-gram shards (a–z)
  3. Extract English lemmas, infer first-use years, aggregate prefix counts
  4. Render 220 radial frames (outputs/frames/frame-0000.png → frame-0219.png)
  5. Encode outputs/english_trie_timelapse.mp4 and a share-ready GIF

Rerun the script anytime—artifact caching means future passes jump straight to rendering.

Pipeline Architecture

Stage Script Output
Lemma extraction src/ingest/wiktionary_extract.py artifacts/lemmas/lemmas.tsv
First-year inference src/ingest/ngram_first_year.py artifacts/years/first_years.tsv
Prefix aggregation src/build/build_prefix_trie.py artifacts/trie/prefix_counts.jsonl
Layout generation src/viz/layout.py artifacts/layout/prefix_positions.json
Frame rendering src/viz/render_frames.py outputs/frames/
Encoding src/viz/encode.py outputs/english_trie_timelapse.mp4 + .gif

Render Only (after initial run)

source venv/bin/activate python -m src.viz.render_frames artifacts/trie/prefix_counts.jsonl outputs/frames python -m src.viz.encode outputs/frames outputs/english_trie_timelapse.mp4 outputs/english_trie_timelapse.gif

Use flags such as --min-radius, --max-radius, --base-edge-alpha, or --start-progress to tune the visualization.

Neo4j Integration

Load artifacts/years/first_years.tsv to explore the word data in Neo4j (compatible with both Community and Enterprise editions):

:param batch => $rows; UNWIND $rows AS row WITH row WHERE row.word IS NOT NULL AND row.word <> "" MERGE (w:Word {text: row.word}) SET w.first_year = CASE WHEN row.first_year = "" THEN NULL ELSE toInteger(row.first_year) END;

Documentation

Read Entire Article