Can an LLM make "educated" guesses about name origins?

3 months ago 4

By Aaron J. Becker

July 28, 2025

Despite “ingesting” more than 200 reference sources in my quest to produce research-backed name meanings, there are still over 30k names in the Social Security Administration’s baby name database that don’t appear in any of them. Large language models, having effectively ingested every piece of text on the internet1, could have come across name explanations that I missed, but they’re also prone to fabricating plausible-sounding origins with no basis in reality. Perhaps, however, we can get an LLM to produce better name origin guesses by providing it with the same sort of context that a person might seek out when guessing a name’s origin: hard data on when and where a name has been used.

🔥 AI doesn't have to mean slop

The article aims to illustrate how AI large language models can be used to produce novel, informative content that's grounded in concrete facts. This is a deep dive into the process used to generate approximately 12 million words of content on NamePlay.

This interactive directory contains links to all 48k+ names with LLM-inferred origins. Please help by rating the quality of the inferred origins; these are the names with unrated inferred origins. I just built the feedback system at the same time as writing this article so feedback is thin on the ground.

There's no shortage of name-related slop on the internet; even major (fruit-themed) sites are now using AI to write name meanings with no basis in reality. Perhaps naively, I believe that AI tools can surface genuine knowledge if you're willing to put in the work to use them responsibly.

This article falls short of a rigorous scientific experiment; time and budget constraints prevented me from establishing a true baseline, devising a proper evaluation metric, and conducting ablation studies. Contact me if you have questions or would like to collaborate on a more complete analysis.

Contents

knowing the unknowns

When I’m trying to guess the origin of a name in the SSA database and I have nothing else to go on, I start by looking at when and where births with the name have been reported. Most names missing from conventional name research sources fall into one of three broad categories: they’re either compound names, invented names, or names that belong to immigrant communities whose traditions haven’t yet made their way into mainstream American naming literature.

Compound names are usually self-evident: a name that appears in the SSA database as “Princemichael” is pretty clearly a compound of “Prince” and “Michael”. Distinguishing between invented names and recent transplants, however, is a more delicate operation.

timing the waves

Immigration in the United States clusters in both time and space— waves of immigrants from a given country tend to settle in particular cities and states, with family ties driving migration patterns. Knowing when and were a name was used can therefore provide clues about which immigrant communities it may belong to.

Names that arrived in the US as part of late-19th and early-20th century European immigration waves are often well-documented in mainstream sources, but names that arrived in the US as part of more recent waves are often not2. Even culturally sensitive and globally oriented sources like “Behind the Name” have stronger coverage of European name origins than they do of names from other parts of the world.

Earlier generations of immigrants were more likely to modify their names upon arrival in the US than more recent arrivals, which can make it more difficult to link these “Americanized” names to modern sources from their home countries. On the other hand, more recent immigrants may speak languages without universally accepted transliteration systems, leading to multiple possibilities when the name is spelled in English.

anchoring on place

Only about 30k of the 105k names in the SSA database have at least five births of the same sex within a single state in a single year. So information on geographic concentration is relatively rare, especially for names that aren’t common enough to be documented in reference sources: only about 3k names without reference sources have state-level birth data. Still, where it is available, information on the states where a name is most common can be a powerful anchor for guessing the origin of a name.

Many quickly-growing immigrant communities in the United States have concentrated settlement patterns— for instance, Arabic-speaking immigrants are proportionately more common in Michigan, whereas South Asian immigrants are proportionately more common in New Jersey. Knowing that a name is popular in one of these states provides strong hints that it could be associated with one of these communities.

On the other hand, invented names tend to be more common in parts of the country where mothers are younger— women who are older at the time of their first child’s birth are more likely to opt for traditional names than younger first-time mothers. In practice this means that invented names are more common in the South and Southwest.

Geographic concentration is a useful demographic proxy. Some ethnic communities that have become well-established in the US, like Spanish speakers in the Southwest, have begun to adopt the singularly American practice of creatively naming their children based on attractive phonetic elements. And despite the urban stereotype, names with distinctive roots in African-American English are just as likely to be found in the Southern “Black Belt” as they are in the major cities of the Northeast and Midwest.

relying on token meanings

A name’s spelling and resemblance to other more familiar names are obvious clues that both a human and an LLM could use to guess the origin of a name. Humans, however, are still better at recognizing and looking past subtle spelling differences because language models don’t “see” spellings. Human experts can quickly recognize that “Aaliyah” and “Aliya” are spelling variations of the same name by reading the two names aloud, whereas an LLM would have to have seen the two spelling variations in similar contexts in order to recognize the commonalities.

LLMs struggle to recognize spelling variations because of the way these models break up text into “tokens,” the basic units that they use to process text. For an LLM like GPT 4o3, “Aaliyah” is three separate tokens: A | aliy | ah, whereas “Aliya” gets broken up into two tokens: Ali | ya. While all of the letters of “Aliya” are present in “Aaliyah”, there’s actually zero overlap in the way that GPT 4o receives the two names as input.

This same phenomenon occurs in names that share semantic elements, like “Advait” and “Advaitha” (Adv | ait versus Ad | va | itha). While human experts can see similarities in names’ basic building blocks, LLMs can only perceive names built out of disparate tokens as similar if their transformer layers have learned to map the distinct token sequences to similar representations during a process called pretraining.

Pretraining is basically an immense “guess the next word” game that the model plays on all of the text its creators have fed it. As basic as this sounds, all of the advancements in AI models you’ve heard about in the past few years are ultimately built upon the basic idea of learning patterns in text by getting better at guessing the next word. Language models build up meaning across multiple layers, activating patterns of artificial “neurons” that roughly map to concepts the model has found helpful when predicting the next word in text it has seen before.

For our purposes, this means that names that aren’t widespread on the internet— or in other textual sources— will have weaker and more fragile “meanings” associated with them in the model’s “mind”. Because “Aaliyah” and “Aliya” start out with entirely different representations based on their tokenization, the model’s ability to “understand” the names as similar depends on patterns memorized in the model’s transformer layers. Transformer layers are intricate, and because there’s always an element of randomness in the way that language models generate text, it’s entirely possible that the wrong “neural pathways” will activate when the model is asked to generate a meaning for an unfamiliar name4.

the case for context

To understand why providing context on a name’s usage patterns might help an LLM more accurately guess its origin, we need to think in terms of the training data an LLM is likely to have seen. Recent naming trends, at least through a model’s knowledge cutoff date, are probably well-represented via social media posts and online sources like blog posts that have explicit date markers. LLMs develop a sense of the pre-internet past by training on sources that mention when events occurred; historians tend to produce a great deal of text, and digitized archival sources are comprehensive, so we have reason to believe that years should prime an LLM with a rich array of relevant associations.

Asking an LLM to draw connections between the states where births with a name were recorded and that name’s ethnic origins demands more of an inferential leap. We can assume that there are some sources on the internet that directly discuss state-level demographic trends related to national origin, such as Wikipedia articles. Sociological, newspaper and magazine articles exist that discuss ethnic communities in particular cities or states, whether as research subjects, human interest stories, or as part of coverage on broader trends. LLMs may even be able to draw on incidental co-occurrences like the relative prevalence of ethnic restaurants with addresses in certain states; while evidence like this is weak, LLMs have an incredibly large sample size to draw on.

mitigating LLM randomness

The numerical weights that an LLM uses for next word prediction are in a sense a compressed version of the internet. Like a search engine, therefore, LLMs can be used to “retrieve” information; unlike a search engine, however, the answer you get from a language model can fluctuate randomly.

Randomness is inescapable when working with language models, which is part of why so many people are hesitant to trust their outputs. But a little randomness isn’t always a bad thing. Probabilistic sampling is what keeps models from endlessly repeating the same words and phrases, most of the time. Randomness is also what makes it possible for an LLM to answer questions that were never asked in the training data.

A search engine can only regurgitate information from the pages it has “indexed.” Building new knowledge from search engine results requires human inference; entire professions arguably exist to do just that. Just in the baby name space, for example, blogs like Nancy’s Baby Names trace name origins back to Wikipedia articles and pop culture references. Journalists, historians, and financial analysts, among others, also produce a lot of text that draws inferences by connecting the dots between shreds of evidence.

Fairly or not, LLMs have been trained to mimic human knowledge workers by following reasoning and rhetorical patterns established in their work. LLMs are able to do this by guessing, word-by-word, what someone who has read the same sources as they have would write in response to a given prompt. Random sampling is what makes this guessing game possible, allowing LLMs to produce outputs that superficially resemble what a human expert would write.

The upshot, therefore, is that randomness is a double-edged sword. On the one hand, randomness is what allows language models to “mash up” new knowledge from the conceptual soup embedded in their parameters in a way that resembles human knowledge work. On the other hand, however, randomness is what drives LLMs to sometimes generate plausible-sounding but factually incorrect answers5. So how can we use randomness to our advantage, drawing novel connections, while avoiding hallucinatory pitfalls?

strategies for mitigating randomness

When working with LLMs, there are a few key strategies that can help to mitigate randomness and reduce hallucinations6, all of which have been applied in this project:

  1. Provide careful, well-crafted prompts. LLMs responses echo the tone and style of the prompts they’re given. Use an emoji in your prompt and you’ll get a flood of them back. Write clearly and concisely. Don’t hesitate to boss models around—explicit and detailed instructions are the best way to get them to do what you want. Provide examples where possible. Try to cover any edge cases where the results you’re looking for differ markedly from what’s already on the internet. If you need help, chatbots like ChatGPT can be surprisingly good at fleshing out and polishing your prompts.
  2. Provide relevant and well-structured context. During instruction tuning, the LLM training step that follows pre-training, LLMs develop a bias towards extracting information from their “context window,” i.e. the text you provide them, over the information they’ve already seen in their pre-training. You can take advantage of this fact to provide a model with knowledge that’s more recent than the documents seen during pre-training. Even when context doesn’t contain a direct answer, it can still provide clues, like date ranges and geographic hints, that can help to guide the model’s inferences.
  3. Tweak model hyper-parameters. This is a technical point, but one that needs mentioning— if you’re using a LLM via an API, you can influence how random the answers you get are by adjusting parameters like temperature and top_p. Discussing these parameters is beyond the scope of this article; suffice it to say that they’re knobs you can fiddle with to nudge a model into behaving the way you want.
  4. Generate multiple results and integrate them. If there’s an element of luck involved in getting the right answer from an LLM, it only makes sense to roll the dice as many times as you can afford to. If you’re generating text for a large body of names, like the ~30k names in this experiment, then some of those responses are inevitably going to be low-probability hallucinations. If you generate enough responses that some of them begin to repeat, the answers that come up most often should be most likely to be correct, assuming the LLM’s parameters and/or context contain the right answer to begin with. This strategy is so effective that it’s a cornerstone of the latest-generation “thinking” models, which generate more accurate answers in part by exploring and reflecting upon multiple answers7.

💡 DIY model reasoning

You don't need to use a "thinking" model to get an LLM to reflect on multiple possible answers. Thinking models tend to be much more expensive per-token than non-thinking models, and you have to pay for "thought" tokens that most model providers don't even let you see. Chain-of-thought is a prompting strategy that predates reasoning models. If you break your task into multiple steps, thoughtfully structuring your prompts and context at each stage, it's possible to get more accurate answers without paying the premium for a "thinking" model.

🚨 Technical Deep Dive

A word of warning: this article is going to get significantly more technical for a while. If you're not interested in the details, or can't stand the sight of code, you can skip to the results or conclusions sections.

Since I had Google Cloud trial credits to burn, Gemini 2.0 Flash was my model of choice for this exercise. I did consider using Gemini’s Pro series, but given the number of names involved and the fact that I wanted to generate multiple candidates, cost considerations and free tier API limits put reasoning models out of reach. Smaller-scale experiments with Gemini Pro 2.5 and Gemini Flash 2.5 produced results of similar quality8.

The overall structure of the process I designed is nevertheless related to the internal dialog that chain-of-thought models use: I broke the task into stages, first exploring multiple candidates and then evaluating and summarizing a final response. Moreover, I prompted the model to generate post-hoc explanations of the reasoning behind its self-assigned confidence scores by asking it to justify how the name usage data supported its inferred origins and their confidence scores.

For the second stage of the process, I used OpenAI’s 4o-mini model, which occupies a similar price-performance sweet spot as Gemini 2.0 Flash but is better at writing prose. I had some expiring API credits that I needed to use up; the number of names that I generated inferred meanings for was ultimately limited by the amount of credit that I had left.

Both of the models I used are “small” by contemporary LLM standards. Gemini 2.5 Pro and GPT 4o are the “bigger siblings” of the models that I used. Each has more model parameters, which theoretically translates into broader world knowledge. But they also cost significantly more to generate the same amount of text; since this is a self-funded project, I had to balance performance against cost. In this context, “small” is a relative term: both of the models that I used are an order of magnitude more powerful than the GPT 3.5 model that powered ChatGPT at its launch.

All of the code that I used to generate inferred name origins is written in Python. This article contains all of the prompts that I used, but omits certain plumbing code like the functions used to submit API requests to the models and parse model responses. I used ChatGPT to edit some of the prompts and response schemas. I use Cursor, an AI-powered code editor, which auto-completes code as I type, but substantially all of the code here is hand-composed.

overview diagram

Before we dive into the code, here’s an overview diagram of the process, starting with data sources and ending with a final response:

 Overview Diagram

stage 1: context-guided AI “brainstorming”

It may seem strange to include a brainstorming stage in what amounts to a research project; humans usually reserve brainstorming for creative pursuits. For an AI model, however, there’s no real distinction between research and creative writing— it’s all predicting the next word, after all. This stage is like brainstorming because we ask an AI model to generate at least 8 different theories on a name’s origin, based on the name’s popularity data, which can then be evaluated and refined in the next stage.

A few key challenges in this step:

  • Giving the model a persona that’s well-suited to the task. As you’ll see below, I instructed the model to write as though it were Laura Wattenberg, arguably the best-recognized expert on data-driven naming insights. This persona influences not only the writing style, but also the types of information included in the response.
  • Formatting the usage data and explaining it in a way that leads to correct interpretations. Providing usage data as context is only useful if the model knows what to do with it. AI models need some hand-holding, requiring an explanation of both the data’s format as well its significance. AI models strongly prefer JSON data, a way of representing structured information that’s easy for them to parse, which has to be prepared from the raw data that the SSA provides.
  • Priming the model with examples of the types of demographic insights we expect. We can’t possibly explain every link between time, place, and demographics, but providing a few select examples can steer the model towards the types of inference we’re looking for.
  • Structuring the model’s responses for easy parsing. Providing an “output schema” instructs the model to generate responses in a specific format. This format guides what outputs the model generates and makes it easier to parse the results— essential when you want outputs that you can present in a structured user interface.

system prompt

A “System Prompt” assigns the model a role and lays the ground rules for the task. For this exercise, the system prompt mainly focuses on explaining the task at hand, the data that will be provided, and the limitations of the dataset that we’re working with.

The system prompt is a function and not a text block because I’m including, and repeating, the target name in the prompt. LLMs exhibit what’s known as a “primacy bias,” where they pay more attention to text that appears at the beginning of the prompt. The system prompt is what the model sees first, so repeating the target name multiple times emphasizes to the model that it is not being asked about name meanings generally, but should rather focus on a single specific name.

📝 re: imitation & flattery

Whenever I need to prompt an LLM to write about name trends, I always include some variation of "you are Laura Wattenberg" in the "system" prompt that assigns the model a role. Telling a model to pose as Laura Wattenberg improves the quality of the response, subjectively, which I chalk up to large language models having gotten used to seeing Laura Wattenberg's name in close proximity to high quality writing about names.

Obviously, no LLM can match LW for her wit and insight regarding trends in naming culture. Trying, however, does seem to make them better. Note that the language model doesn't claim to be LW in the text that it writes, it just tries to imitate her style.

def get_meaning_system_prompt(name: str) -> str: return f""" You are Laura Wattenberg, renowned baby name expert and author of the book \"Baby Name Wizard\". Your writing style is engaging and accessible, blending analytical rigor with clear explanations that resonate with a broad audience. You have been asked to draft a brief description of the meanings and origins of the name "{name}" following the provided JSON schema. ## Context Popularity Data You will be provided with JSON data regarding births with the name "{name}" in the United States drawn from the Social Security Administration's baby name statistics, which cover the years 1880 to 2024. You must use this data to support inferences regarding the origins and meanings of the name "{name}". Specifically, you will be provided with the following data: - The `first_year` the name "{name}" appeared in the dataset for each gender. - The `latest_year` the name "{name}" appeared in the dataset for each gender. - The `total_births` with the name "{name}" in the United States for each gender. - The `percent_male` gender distribution of births with the name "{name}" in the United States, which describes the overall percentage of births with the name "{name}" that were male. - `most_popular_states`: Up to 5 US states, reported as 2-letter state codes, where the name "{name}" has been more *relatively popular* than it has been nationally, for each gender where data is available. States where the name "{name}" makes up the largest share of births are reported first. May be empty for uncommon names. ### SSA Dataset Limitations - Names must occur ≥5 times per year to appear. - State-level data is sparser due to same rule. - Name inputs may be normalized: e.g., 'Marysue' for 'Mary Sue'. ## Apply Subject-Matter Expertise Because you are Laura Wattenberg, an established and respected expert on baby names, you are deeply familiar with commonly used names, their meanings, origins, and cultural significance. You may use your domain knowledge to inform responses, turning to contextual clues from the popularity data for names that are less familiar. """.strip()

task prompt

The “Task Prompt” comprises the bulk of the instructions that we’re sending to the AI model. Like the system prompt, it’s a template that gets filled in with the target name.

The task prompt has a few main goals:

  1. Provides examples of the type of inference we would like to see in the response. This is a form of “few-shot” prompting, which has been shown to improve LLM task performance on a wide variety of tasks. Here the examples aren’t complete responses so much as fragments that illustrate a reasoning process. I did choose the examples to emphasize specific demographic associations that came up often in the names for which I was trying to infer origins.
  2. Frame’s the model’s treatment of ambiguity and limited data. We do want the model to make guesses when that’s necessary, but we need to set some guidelines around the guessing process to increase the likelihood that we get good guesses.
  3. Describe the format of the response. This is a bit repetitive in light of the response schema that follows, but describing the outputs we want to see in both narrative prose and a structured JSON schema emphasizes the importance of properly-formatted results.
def get_meaning_task_prompt(name): """""" return f""" # Leverage Popularity Data to Infer Origins of the Name "{name}" **Evaluate the provided popularity data alongside your knowledge of demographic history, immigration patterns, and naming culture to infer the likely national, cultural, or linguistic origins of the name "{name}".** Example inferences: - "Creative spellings" of names tend to be more popular among younger parents, and parents tend to be younger in regions like the South. - Modern invented names are more popular since the 1990s. - Names with Spanish origins are more popular in states with large Latino populations, like the US Southwest and West Coast. - Names with Middle Eastern origins are more popular in states with large Arabic-speaking populations like Michigan, New York, New Jersey, and California. *Do not limit yourself to these examples. Use your background knowledge of the United States and its history to make the best inferences you can.* If data is sparse or ambiguous, you may hypothesize plausible origins based on phonetic similarity, naming patterns, or cultural associations. When doing so, lower the confidence score and include a note on speculation in the description. ## Output Format Respond with one or more origins, root meanings, and brief etymologies for the name "{name}". Return your answer as a JSON array of objects. Each object must include: - `origin_type`: Short label describing the type of origin (e.g. "Greek", "Modern Coinage"). - `description`: Narrative explanation of the origin and usage. - `root_meanings`: List of semantic or phonetic meanings. - `confidence`: "high", "medium", or "low" - `popularity_notes`: (Optional) How popularity data supports this origin. JSON data regarding historical popularity of the name "{name}" is provided in the following message. Be concise but informative, writing in a tone consistent with your expertise. """.strip()

output schema

The next thing the LLM will see is this JSON schema, which is a structured way of describing the shape of the answers we want from the language model. Like other prompt components, it’s a template. Unlike the description of the output format in the task prompt text, however, the schema is something that systems from an LLM provider can parse and strictly enforce.

Enforceability has two main implications for response reliability:

  • First, because you can check whether an LLM’s response follows the schema using code, it’s possible for LLM providers to build schema checking into their API infrastructure. I don’t know whether or not this happens in practice, but it would be theoretically possible for Google, in this case, to check the response from Gemini for compliance with the schema and force the model to re-run the task if it doesn’t.
  • Second, also because of the automatic verification, it’s relatively straightforward to train models to adhere to schema definitions. All you need to do is provide a ton of example schemas and corresponding prompts, reward responses that adhere to the provided schema, and use some form of reinforcement learning to update the model’s parameters such that it learns to strictly follow response schemas. This is more likely to be the way that schemas are used in practice.

Note that automatic verification can only be used to train a model to follow schemas superficially. Training a model to adhere to the “spirit” of a schema, as reflected in the descriptions associated with each field, is more subjective and hence more difficult. I’m not sure if or how that happens in deployed LLMs like Gemini.

def get_meaning_response_schema(name: str): return { "type": "array", "description": f"One or more origins and meanings for the name '{name}', inferred from expert knowledge and supported by popularity data. Each item represents a distinct cultural, linguistic, or creative tradition explaining how the name came into use.", "items": { "type": "object", "properties": { "origin_type": { "type": "string", "description": f"A short label describing the type of origin for the name '{name}'. This may include linguistic origins (e.g. 'Hebrew', 'Greek'), cultural naming patterns (e.g. 'African-American Vernacular'), or creative naming categories (e.g. 'Modern Coinage', 'Invented Spelling', 'Literary Name').", "example": "Hebrew" }, "description": { "type": "string", "description": "A brief narrative explaining how the name came into use in this origin context, referencing historical, linguistic, or cultural insights.", "example": "Derived from the Hebrew name 'Yochanan', meaning 'God is gracious'." }, "root_meanings": { "type": "array", "description": "List of core meanings associated with the name in this origin context. Use semantic meanings for traditional names, and phonetic or stylistic associations for creative names.", "items": { "type": "string" }, "example": [ "God is gracious" ] }, "confidence": { "type": "string", "description": "Qualitative confidence in this interpretation. Options are 'high', 'medium', or 'low'.", "enum": ["high", "medium", "low"], "example": "medium" }, "popularity_notes": { "type": "string", "description": f"Explanation of how popularity data supports this origin or meaning of the name '{name}', including time period, gender distribution, or state-level trends.", "example": "This origin is supported by higher-than-average usage in Southwestern states and a spike in popularity during the 1990s." }, }, "required": ["origin_type", "description", "confidence"] } }

formatting the usage data

I use polars to parse the CSV data that the SSA provides into manipulable dataframes. I’ll write a short article on how I do that in a future post. The state_level_popularity dataframe combines the results of the SSA’s state-level data with state-level total birth data from the CDC’s WONDER portal and various other sources that I’ll also describe in a future post.

The query_popularity_context function below takes a name and builds an output dictionary that can be serialized to JSON and passed to the model as a string. This example output is for the name Wateen:

{'first_year': {'female': 2012}, 'latest_year': {'female': 2024}, 'total_births': {'female': 266}, 'percent_male': 0, 'most_popular_states': {'female': ['MI', 'NY', 'IL', 'CA', 'TX']}}

And here’s example output for the name Kamauri, showing how data is presented for more unisex names:

{'first_year': {'male': 1999, 'female': 2001}, 'latest_year': {'male': 2024, 'female': 2024}, 'total_births': {'male': 781, 'female': 338}, 'percent_male': 69.79, 'most_popular_states': {'male': ['AL', 'SC', 'NC', 'GA', 'FL'], 'female': ['SC', 'GA', 'TX']}}

Here’s the function; it’s not really usable as-is because I don’t include the snippets that load the required dataframes, which, as mentioned above, will be fodder for future posts. For now this is just an example of how to convert polars dataframes into JSON-compatible dictionaries, which is one way to transform the data before passing it to the model9.

def query_popularity_context(name:str): gender_labels = { 'm': 'male', 'f': 'female' } # query national name data for first/last years and total births by gender for the target name data = yearly_name_data.filter(pl.col('name') == name).group_by('gender').agg( pl.col('year').min().alias('first_year'), pl.col('year').max().alias('latest_year'), pl.col('births').sum().alias('total_births') ) data = data.rows(named=True) # re-arrange the result into a dictionary output = {} # track genders for use in calculating percent male genders = set() # each row represents first/last year and total births for a single gender for row in data: for field, value in row.items(): if field == 'gender': genders.add(row['gender']) continue if field in output: output[field][gender_labels[row['gender']]] = value else: output[field] = {gender_labels[row['gender']]: value} if 'm' not in genders: output['percent_male'] = 0 elif 'f' not in genders: output['percent_male'] = 100 else: output['percent_male'] = round((output['total_births']['male'] / (output['total_births']['male'] + output['total_births']['female'])) * 100, 2) # query state-level popularity data for the target name states_df = state_level_popularity.filter(pl.col('name') == name, pl.col('state') != '_OTHER') states_data = {} # select states with highest share of births with target name for each gender. # gender is represented differently in the state-level dataframe (Male/Female vs m/f) if 'm' in genders: state_rows = states_df.filter(pl.col('gender') == 'Male').sort('bpm_pct_national', descending=True).head(5) if state_rows.shape[0]: states_data['male'] = state_rows.select('state').to_series().to_list() if 'f' in genders: state_rows = states_df.filter(pl.col('gender') == 'Female').sort('bpm_pct_national', descending=True).head(5) if state_rows.shape[0]: states_data['female'] = state_rows.select('state').to_series().to_list() output['most_popular_states'] = states_data return output

plumbing it all together

There’s already a lot of code here; in the interest of brevity I’m not going to reproduce all of the plumbing code that I used to combine the prompts, schema, and context data, set hyper-parameters, submit the request to the Gemini API, and store the results. If there’s interest (contact me via email) I can create a GitHub repo and post a link here.

stage 2: finalizing the response

The Gemini API returns a JSON response with 8 different “candidate” theories on the name’s origin and meaning in light of the provided prompts and context data. Most of the time, the responses are broadly similar to each other; consistency like this is usually a good sign that the model is confident in its answers. No matter how much of a name nerd you are, however, I doubt you’re going to want to sift through 8 subtly different versions of the same speculative name origins.

The next stage in producing useful outputs is to consolidate the responses into a single, (hopefully) coherent answer. If the first stage of this process resembles brainstorming, this one is more like editing.

extracting and cleaning stage 1 responses

Here’s a bit of plumbing code that I’m sharing here only because it showcases a Python package that I’ve found very useful: json_repair. JSON Repair uses a well-curated set of heuristics to fix minor syntactical errors in JSON that would otherwise prevent results from LLMs from being usable. Even when provided with output schemas, it’s not uncommon for state-of-the-art models to occasionally return results that don’t parse as valid JSON— this library makes it possible to use them10.

def get_meaning_no_context_candidates(name: str): # `no_context` here means no RAG context... file_path = INPUT_RESULTS_DIR / f"{name}_meanings_no_context.json" if file_path.exists(): data_bundle = json.loads(file_path.read_text()) try: candidates = data_bundle["response"]["candidates"] candidate_contents = [] for candidate in candidates: try: content = candidate["content"]["parts"][0]["text"] content = json_repair.loads(content) candidate_contents.append(content) except: # ignore errors when parsing candidate content continue return candidate_contents except KeyError: print(f"No candidates found for {name}") return None return None

stage 2 prompts

Rather than presenting this editing stage as another request in an ongoing conversation with an LLM “agent,” I decided to load results from the brainstorming stage and pass them as context to an entirely new task, with a new set of prompts and schemas. This differs from the workflow you’d have using a chatbot like ChatGPT, which would have access to your full conversation history, and even from the internal monolog that a reasoning model would produce.

My rationale for breaking this process into two discrete stages is that it allows the context to be more focused and task-specific. When working as an editor, the AI model doesn’t need to know the specifics of how Social Security Administration data was formatted, nor does it need to see the actual data. Moreover, including the few-shot inference examples from the first stage in the prompts used for the second stage might have biased the model towards over-weighting responses that more closely resemble those examples.

The schema for the consolidation stage is extremely similar to the schema for the brainstorming stage, so I’m not going to reproduce it here.

Here’s the task prompt. Notice the emphasis on integrating narratives and avoiding repetition:

def get_final_consolidation_prompt(name: str) -> str: return f""" Your task is to synthesize multiple candidate interpretations of the name "{name}" into a single, coherent data structure that represents the best-supported understanding of the name’s origin and meaning. Each candidate interpretation may contain partial overlaps, different labels, varied phrasing, or varying confidence levels. You must: 1. **Merge closely related interpretations** under a single origin type when appropriate, removing redundancy while preserving all relevant insight. 2. **Retain distinctions** between clearly different types of origin (e.g., "Modern Coinage" vs. "Germanic" vs. "Variant Spelling"). 3. **Incorporate meaningful popularity trends**, including appearance dates, gender distribution, and state-specific clustering where relevant. 4. **Preserve linguistic and cultural inferences**, even when speculative, but label speculative interpretations clearly and assign appropriately lower confidence. 5. **Prioritize clarity, conciseness, and readability**, as your output will be published on a public baby names website. Use your expert judgment to present the final result as a small set of distinct interpretations (typically 1–3), representing different cultural or stylistic origins for the name "{name}". You must return a JSON array following the schema provided. Each item must: - Use the clearest, most representative `origin_type`. - Combine multiple related descriptions into a unified, readable paragraph. - Merge or summarize `root_meanings` without duplicating terms. - Assign a `confidence` rating that reflects both the grounding evidence and your domain knowledge. - Provide a `popularity_notes` explanation that cites SSA data when applicable (e.g., first year, gender, regional usage, rarity). ### IMPORTANT - Do not list multiple entries for the same `origin_type` unless they represent clearly different reasoning paths. - Avoid exact repetition of phrases from the input candidates; rewrite in your own expert voice. - Do not include empty arrays or empty strings; omit fields that are not applicable. - Return only the final consolidated JSON array — no commentary, no preamble. The output must match the supplied JSON schema. """.strip()

The system prompt is less detailed this time around; I included it as part of the function that combines the prompt, schema, and context data into a single structure that can be passed to the model API:

def get_request_messages(name: str, candidates: List[List[Dict]]) -> List[Dict]: return [ {"role": "system", "content": """ You are Laura Wattenberg, baby name expert and author of *The Baby Name Wizard*. Your writing is known for its accessible, engaging tone that blends linguistic insight with cultural context. You explain naming patterns with clarity and warmth, making even speculative interpretations feel grounded and thoughtful. You approach names as reflections of both sound and society, using data, etymology, and style trends to tell their story. When combining multiple interpretations, you highlight clear distinctions, merge overlapping ideas, and present concise, readable explanations suitable for a general audience on a baby names website. """.strip()}, {"role": "user", "content": get_final_consolidation_prompt(name)}, {"role": "user", "content": json.dumps(candidates)} ]

I did use ChatGPT to flesh out the task prompt for the second stage of this workflow, which is why it includes some unmistakable GPT-isms. I would be curious to see whether there’s any research that examines biases introduced by using LLMs to write prompts. Earlier research has shown that LLMs are capable of producing prompts that result in higher performance than human-written prompts for certain NLP tasks, given the benefit of multiple attempts and a quantitative measure of performance11.

evaluating the results

I just built a directory of names with inferred origins and a user interface for submitting feedback on the results; they’re going live at the same time as this article. In other words, I’m only just getting started with the review process. It’s possible to view only names without feedback votes to get a sense of where reviewing is most needed, or sort by net feedback votes (across all inferred origins) to see what the results are so far.

Anecdotally, some of the results are pretty good. The inferred origin for Trampas, a name introduced by a TV Western, is a good example— the model linked years of usage to the TV show’s run, and noted that the name was popular in states like Texas where cowboy themes have greater appeal.

Unsurprisingly, the results aren’t always good. I might have instructed the model to be too generous in considering “modern coinage” as a possible origin, because it comes up very frequently. For instance, although the first inferred origin for the name Wateen accurately links the name to the Arabic word for the aorta, the model also inferred that Wateen “might represent a modern, creatively constructed name inspired by Arabic phonetics or other naming trends.” Given that we have a clearly identified meaning in the first inference, this hedging seems unnecessary.

In my prompts I repeatedly emphasize the existence of creative spellings and modern inventions, noting how parents can create names based on phonetic similarity, because I wanted to discourage the model from incorrectly ascribing cultural meanings to names that are trendy modern inventions. A quintessential example here is Zaidyn, which is best understood as “Aiden creatively spelled, with a trendy Z sound in front,” but which LLMs typically link to the Arabic name Zayd, meaning “growth.”

This type of spurious meaning attribution is commonplace in AI-generated baby name slop because it’s appealing to think of a name you like as having a deep meaning even when you’re mostly drawn to its sound. I may have nudged the model too far in the other direction, encouraging it to see modern coinages everywhere.

comparison to research-backed origins

I ran out of OpenAI API credits before I could generate inferred origins for every name in the SSA database. I did generate inferred origins for about 18,000 names which also have research-backed origins, which are origins created by a workflow rooted in source documents. I prioritized names with the shortest research-backed origins, under the assumption that LLM inferences might mention details or theories that were missing in the source documents.

Since the feedback system for research-backed origins, like the one for the inferred origins discussed here, also just went live, I don’t have data to compare the quality of the two name origin processes yet. Please help me out by submitting feedback on the research-backed origins you see on this site.

ideas for future research

measuring uncertainty using embedding similarity

One shortcoming of using a second set of prompts to summarize inferred name origins is that we’re reliant on an LLM’s “subjective” assessment of its own confidence, with an unclear mechanism for translating the confidence assessments in the drafts into the final consolidated answer. It’s not clear, for instance, whether the output from the second stage of prompts considers the relative frequency of origin types when assessing confidence in the final answer.

Applying a text embedding model to the various candidate responses from the first stage could provide a more quantitative measure of the variability contained in the draft origins that fed into the consolidation prompt. Embedding models map text to a vector space, the details of which are beyond the scope of this article, but the upshot is that they’re models trained to give you a way to measure similarity between two pieces of text, typically for information retrieval purposes.

A measure like the mean pairwise similarity between response embeddings, therefore, could provide some indication of how variable the draft origins were. There are all sorts of secondary considerations here— for instance, embedding models themselves are black boxes, so we don’t know what factors are relatively more important when determining similarity. Models trained for document retrieval might overemphasize the presence of non-English text, for instance. If we assume that embedding models do a good job of capturing semantic similarity— i.e, similarity between meanings, which is what they’re explicitly trained to do— then we it would be interesting to see whether there are any correlations between response consistency and model confidence estimates.

similarity between inferred and research-backed origins

Building on the above, it might also be interesting to use embedding models to attempt to measure similarity between the inferred origins I built in this process and the “research-backed” origins that came out of the comprehensive RAG (retrieval-augmented generation) process (which I haven’t yet written about). I don’t think embedding similarity could truly replace human evaluation, but it could be useful for triage— helping identify names where source-attributed meanings and inferred origins are particularly divergent, which could be a good place to focus on when rating result quality.

baseline origin guesses without usage data

As mentioned in the introduction, this experiment isn’t truly scientific because I never asked the LLM to infer origins and meanings for names without providing any usage data as context. This was partially a cost consideration, but I also held off on this because I’ve seen far too many sites that already do this, and the results are usually pretty bad. AI slop answers already dominate Google search results for most uncommon names, and Google’s “AI overview” feature is all-too-happy to pull results from slop pages and spin them into more slop. Recall that the goal here was to create something with AI that could rise above the slop.

conclusions

It’s a bit premature for conclusions at this stage. I just created the infrastructure for rating the results from this workflow, so I don’t know how the results rate subjectively yet.

I need to create a system for mapping the origin types created in this process to a standardized, browseable taxonomy— for instance, so you could see a list of all names where one of the inferred origins was some variation on “Modern Coinage,” or how many names have Indian or Spanish roots. I’d like to be able to measure how often the results from this pipeline matched the results from research-backed origins based on source documents.

Another thing I’d like to do is see how often LLM-inferred origins identify names with overlapping pronunciations as spelling variations, either explicitly or via both variations being associated with the same meaning. I have a separate process, also as yet unpublished, that attempts to link spelling variations on the basis of shared pronunciations; this is what powers the combined spelling name rankings that I publish on this site. I’m curious to see how many of the names that share pronunciations also have meanings in common.

I’d also like to be able to determine whether certain inferred origin types are more or less likely to be correct than others. In order for results to be meaningful I need to get a large enough sample of people to contribute feedback.

It’s already clear that LLMs, like human authors, can draw incorrect conclusions from the evidence they’re given. There are many instances where, on the basis of recent usage, an LLM will conclude that a name is a modern coinage when in fact it’s a name well-established in an ethnic community. Usage data, therefore, can mislead a model as well as guide it towards more accurate answers. Hopefully feedback from users can help identify situations where context leads AI models astray so we have a better understanding of their limitations.

getting in touch

I don’t have a comments section for this blog yet; I plan to add one, but I seem to be determined to build everything the hard way, so I can’t be sure when I’ll get one working. In the meantime, you can contact me via email, on Twitter/X, or send me a message on Reddit.

Read Entire Article