2025 October 29
word2vec popularized the idea of representing words as vectors where
semantically similar words are positioned close to each other in the vector
space. Nowadays these vectors are usually called embeddings.
A neat consequence of the word2vec approach is that adding and subtracting
vectors produces semantically logical results. From Efficient Estimations of
Word Representations in Vector Space (the word2vec paper):
Using a word offset technique where simple algebraic operations are performed
on the word vectors, it was shown for example that vector("King") -
vector("Man") + vector("Woman") results in a vector that is closest to the
vector representation of the word Queen.
Does word2vec-style vector arithmetic work in technical writing contexts?
Experiments§
word2vec was published in 2013. Embedding models have come a long way since
then. word2vec models could only operate on single words. A vector always
represented a single word. Modern embedding models can operate on arbitrary
text. A vector can now represent a word, paragraph, section, document, set of
documents, etc.
My experiments follow the same basic pattern of vector("King") -
vector("Man") + vector("Woman"), with one difference. The experiments
start out with a vector representing the full text of a document, not a
single-word vector.
Same topic, different domain§
This is the first experiment. Starting with the vector for the full text of
Testing Your Database from the Supabase docs, subtract the vector for the
word supabase, and then add the vector for the word angular. The
resultant vector should be semantically close to the concept of “testing in
Angular”.
Different topic, same domain§
This is the second experiment. Starting with the vector for the full text of
Testing Your Database from the Supabase docs, subtract the vector for the
word testing, and then add the vector for the word vectors. The
resultant vector should be semantically close to the concept of “vectors in
Supabase”.
Task types§
From previous research I’ve learned that task types noticeably
affect Gemini Embedding’s outputs. EmbeddingGemma (the model I’ll be using in
the experiments) also supports tasks types. I’ll run both experiments twice:
once with default task types, and again with customized task types.
Verification§
There’s no way to directly verify that the resultant vectors are semantically
close to the expected concepts. What I can do instead is generate vectors from
the full texts of various docs, and then compare the resultant vectors from the
experiments against the vectors of these various docs using cosine
similarity.
Here’s the full list of docs that I use in the experiments:
For the Same topic, different domain experiment (Testing Your Database -
supabase + angular) I expect the resultant vector to be most similar to
Testing or Testing Services from the Angular docs. And for the
Different topic, same domain experiment (Testing Your Database - testing +
vectors) I expect the resultant vector to be most similar to Vector
Columns from the Supabase docs.
Note that I picked short docs because EmbeddingGemma only supports 2048 tokens of
input and I didn’t feel like dealing with chunking. Most of the docs revolve around
testing.
Results§
In the Same topic, different domain experiment (Testing Your Database -
supabase + angular) the resultant vector is most similar to Testing
and Testing Services from the Angular docs, as expected, when custom task
types are enabled:
[INFO] Running "same topic, different domain" experiment with customized task types
[INFO] Results:
[INFO] "Testing" (Angular) => 0.751456081867218
[INFO] "Testing Services" (Angular) => 0.6292878985404968
[INFO] "Background Processing Using Web Workers" (Angular) => 0.5090276598930359
[INFO] "Testing Your Database" (Supabase) => 0.5084458589553833
[INFO] "Refer To Locales By ID" (Angular) => 0.46428176760673523
[INFO] "Test Your Application Locally" (CockroachDB) => 0.4586600363254547
[INFO] "Writing Tests" (Playwright) => 0.4434031546115875
[INFO] "JUnit" (Playwright) => 0.4156876802444458
[INFO] "Actionability" (Playwright) => 0.396766722202301
[INFO] "analysis_test" (Skylib) => 0.3869394063949585
[INFO] "Testing Your Edge Functions" (Supabase) => 0.368389755487442
[INFO] "diff_test" (Skylib) => 0.3524951934814453
[INFO] "bzl_library" (Skylib) => 0.29295891523361206
[INFO] "LINESTRING" (CockroachDB) => 0.2778087854385376
[INFO] "Branching" (Supabase) => 0.26931506395339966
[INFO] "Vector Columns" (Supabase) => 0.23397961258888245
When using the default task types, the resultant vector is most similar to
Testing Your Database i.e. the doc that the experiment started with:
[INFO] Running "same topic, different domain" experiment with default task types
[INFO] Results:
[INFO] "Testing Your Database" (Supabase) => 0.6590374708175659
[INFO] "Testing" (Angular) => 0.571465790271759
[INFO] "Testing Services" (Angular) => 0.46747612953186035
[INFO] "Test Your Application Locally" (CockroachDB) => 0.43749818205833435
[INFO] "Testing Your Edge Functions" (Supabase) => 0.4073418378829956
[INFO] "Writing Tests" (Playwright) => 0.3561333119869232
[INFO] "Background Processing Using Web Workers" (Angular) => 0.3353777527809143
[INFO] "Vector Columns" (Supabase) => 0.3085843324661255
[INFO] "LINESTRING" (CockroachDB) => 0.30450767278671265
[INFO] "Branching" (Supabase) => 0.29775649309158325
[INFO] "analysis_test" (Skylib) => 0.2946781814098358
[INFO] "Actionability" (Playwright) => 0.2879413962364197
[INFO] "JUnit" (Playwright) => 0.2845016121864319
[INFO] "Refer To Locales By ID" (Angular) => 0.2824022173881531
[INFO] "diff_test" (Skylib) => 0.26220911741256714
[INFO] "bzl_library" (Skylib) => 0.2447129189968109
In the Different topic, same domain experiment (Testing
Your Database - testing + vectors) the resultant vector is most similar to
Vector Columns, regardless of whether default or custom task types were used.
Custom task types:
[INFO] Running "different topic, same domain" experiment with customized task types
[INFO] Results:
[INFO] "Vector Columns" (Supabase) => 0.6380605697631836
[INFO] "Testing Your Database" (Supabase) => 0.44831225275993347
[INFO] "LINESTRING" (CockroachDB) => 0.32693782448768616
[INFO] "Background Processing Using Web Workers" (Angular) => 0.2737721800804138
[INFO] "Testing Your Edge Functions" (Supabase) => 0.25883781909942627
[INFO] "Branching" (Supabase) => 0.2509428560733795
[INFO] "Refer To Locales By ID" (Angular) => 0.2328835278749466
[INFO] "bzl_library" (Skylib) => 0.2133977860212326
[INFO] "Test Your Application Locally" (CockroachDB) => 0.20613139867782593
[INFO] "Testing" (Angular) => 0.16262517869472504
[INFO] "Actionability" (Playwright) => 0.14792931079864502
[INFO] "Writing Tests" (Playwright) => 0.14344163239002228
[INFO] "Testing Services" (Angular) => 0.13723336160182953
[INFO] "diff_test" (Skylib) => 0.12111848592758179
[INFO] "JUnit" (Playwright) => 0.11599748581647873
[INFO] "analysis_test" (Skylib) => 0.0979730486869812
Default task types:
[INFO] Running "different topic, same domain" experiment with default task types
[INFO] Results:
[INFO] "Vector Columns" (Supabase) => 0.6698287129402161
[INFO] "Testing Your Database" (Supabase) => 0.6086233854293823
[INFO] "Testing Your Edge Functions" (Supabase) => 0.36533844470977783
[INFO] "LINESTRING" (CockroachDB) => 0.34430524706840515
[INFO] "Branching" (Supabase) => 0.3141021430492401
[INFO] "Test Your Application Locally" (CockroachDB) => 0.29872700572013855
[INFO] "Background Processing Using Web Workers" (Angular) => 0.28414368629455566
[INFO] "bzl_library" (Skylib) => 0.26424312591552734
[INFO] "Refer To Locales By ID" (Angular) => 0.2537899315357208
[INFO] "Testing" (Angular) => 0.23542608320713043
[INFO] "Writing Tests" (Playwright) => 0.22030793130397797
[INFO] "Testing Services" (Angular) => 0.20675960183143616
[INFO] "Actionability" (Playwright) => 0.1959698647260666
[INFO] "diff_test" (Skylib) => 0.19095730781555176
[INFO] "JUnit" (Playwright) => 0.1832783967256546
[INFO] "analysis_test" (Skylib) => 0.15578024089336395
So, yes, it seems like word2vec-style vector arithmetic can work in technical
writing contexts. Make sure to set your task types correctly.
Discussion§
I still don’t really understand how it’s possible to semantically represent an
entire document as a single vector, let alone how adding and subtracting
single-word vectors from full-document vectors works.
How do we actually use this in technical writing workflows or documentation
experiences? I’m not sure. I was just curious to learn whether or not it would
work.
Appendix§
Source code§
experiments.py:
from json import load
from os import environ
from sys import exit
from requests import get
from sentence_transformers import SentenceTransformer
class Doc:
def __init__(self, topic, domain, url, length, embedding):
self.topic = topic
self.domain = domain
self.url = url
self.length = length
self.embedding = embedding
self.similarity = None
def init_docs(model, task_types):
with open("data.json", "r") as f:
data = load(f)
tokenizer = model.tokenizer
docs = []
max_length = tokenizer.model_max_length
for item in data:
url = item["url"]
response = get(url)
text = response.text
length = len(tokenizer.encode(text))
topic = item["topic"]
domain = item["domain"]
if length > max_length:
exit(f"[ERROR] Document is too large: {topic}, {domain}")
prompt = "title: {topic} | text: "
embedding = model.encode(text, prompt=prompt) if task_types else model.encode(text)
doc = Doc(topic, domain, url, length, embedding)
docs.append(doc)
return docs
def create_domain_query(model, task_types):
url = "https://raw.githubusercontent.com/supabase/supabase/refs/heads/master/apps/docs/content/guides/database/testing.mdx"
response = get(url)
text = response.text
if task_types:
doc = model.encode(text, prompt_name="Retrieval-query")
supabase = model.encode("supabase", prompt_name="Retrieval-query")
angular = model.encode("angular", prompt_name="Retrieval-query")
else:
doc = model.encode(text)
supabase = model.encode("supabase")
angular = model.encode("angular")
return doc - supabase + angular
def create_topic_query(model, task_types):
url = "https://raw.githubusercontent.com/supabase/supabase/refs/heads/master/apps/docs/content/guides/database/testing.mdx"
response = get(url)
text = response.text
if task_types:
doc = model.encode(text, prompt_name="Retrieval-query")
testing = model.encode("testing", prompt_name="Retrieval-query")
vectors = model.encode("vectors", prompt_name="Retrieval-query")
else:
doc = model.encode(text)
testing = model.encode("testing")
vectors = model.encode("vectors")
return doc - testing + vectors
def run_experiments():
environ["TOKENIZERS_PARALLELISM"] = "false"
model = SentenceTransformer("google/embeddinggemma-300m")
for task_types in [True, False]:
print(f'[INFO] Running "same topic, different domain" experiment with {"customized" if task_types else "default"} task types')
docs = init_docs(model, task_types)
query = create_domain_query(model, task_types)
for doc in docs:
similarity = model.similarity(query, doc.embedding).item()
doc.similarity = similarity
docs.sort(key=lambda doc: doc.similarity, reverse=True)
print(f'[INFO] Results:')
for doc in docs:
print(f'[INFO] "{doc.topic}" ({doc.domain}) => {doc.similarity}')
print()
print(f'[INFO] Running "different topic, same domain" experiment with {"customized" if task_types else "default"} task types')
docs = init_docs(model, task_types)
query = create_topic_query(model, task_types)
for doc in docs:
similarity = model.similarity(query, doc.embedding).item()
doc.similarity = similarity
docs.sort(key=lambda doc: doc.similarity, reverse=True)
print(f'[INFO] Results:')
for doc in docs:
print(f'[INFO] "{doc.topic}" ({doc.domain}) => {doc.similarity}')
print()
# DEBUG
for d in init_docs(model, True):
print(f"* `{d.topic} <{d.url}`_ ({d.domain})")
if __name__ == "__main__":
run_experiments()
data.json:
[
{
"domain": "Angular",
"topic": "Background Processing Using Web Workers",
"url": "https://raw.githubusercontent.com/angular/angular/refs/heads/main/adev/src/content/ecosystem/web-workers.md"
},
{
"domain": "Angular",
"topic": "Refer To Locales By ID",
"url": "https://raw.githubusercontent.com/angular/angular/refs/heads/main/adev/src/content/guide/i18n/locale-id.md"
},
{
"domain": "Angular",
"topic": "Testing",
"url": "https://raw.githubusercontent.com/angular/angular/refs/heads/main/adev/src/content/guide/testing/overview.md"
},
{
"domain": "Angular",
"topic": "Testing Services",
"url": "https://raw.githubusercontent.com/angular/angular/refs/heads/main/adev/src/content/guide/testing/services.md"
},
{
"domain": "CockroachDB",
"topic": "LINESTRING",
"url": "https://raw.githubusercontent.com/cockroachdb/docs/refs/heads/main/src/current/v25.4/linestring.md"
},
{
"domain": "CockroachDB",
"topic": "Test Your Application Locally",
"url": "https://raw.githubusercontent.com/cockroachdb/docs/refs/heads/main/src/current/v25.4/local-testing.md"
},
{
"domain": "Skylib",
"topic": "analysis_test",
"url": "https://raw.githubusercontent.com/bazelbuild/bazel-skylib/refs/heads/main/docs/analysis_test_doc.md"
},
{
"domain": "Skylib",
"topic": "bzl_library",
"url": "https://raw.githubusercontent.com/bazelbuild/bazel-skylib/refs/heads/main/docs/bzl_library.md"
},
{
"domain": "Skylib",
"topic": "diff_test",
"url": "https://raw.githubusercontent.com/bazelbuild/bazel-skylib/refs/heads/main/docs/diff_test_doc.md"
},
{
"domain": "Playwright",
"topic": "Actionability",
"url": "https://raw.githubusercontent.com/microsoft/playwright/refs/heads/main/docs/src/actionability.md"
},
{
"domain": "Playwright",
"topic": "JUnit",
"url": "https://raw.githubusercontent.com/microsoft/playwright/refs/heads/main/docs/src/junit-java.md"
},
{
"domain": "Playwright",
"topic": "Writing Tests",
"url": "https://raw.githubusercontent.com/microsoft/playwright/refs/heads/main/docs/src/writing-tests-java.md"
},
{
"domain": "Supabase",
"topic": "Branching",
"url": "https://raw.githubusercontent.com/supabase/supabase/refs/heads/master/apps/docs/content/guides/deployment/branching.mdx"
},
{
"domain": "Supabase",
"topic": "Testing Your Database",
"url": "https://raw.githubusercontent.com/supabase/supabase/refs/heads/master/apps/docs/content/guides/database/testing.mdx"
},
{
"domain": "Supabase",
"topic": "Testing Your Edge Functions",
"url": "https://raw.githubusercontent.com/supabase/supabase/refs/heads/master/apps/docs/content/guides/functions/unit-test.mdx"
},
{
"domain": "Supabase",
"topic": "Vector Columns",
"url": "https://raw.githubusercontent.com/supabase/supabase/refs/heads/master/apps/docs/content/guides/ai/vector-columns.mdx"
}
]
Note that I forgot to pin the URLs to specific commits. I.e. I used the
HEAD version of each URL. If you run the experiments a year or two from now
(October 2025), your cosine similarity scores will probably be different,
because the underlying text of the documents will probably have changed.