70+ reference sources. 2.3M supercomputer hours. 130M LLM validation calls. Professional lexicography. Production-tested since 2011.
Words by letter count
135791113151719
Most familiarLeast familiar
Single words vs. words with spaces, by familiarity
50K100K150K200K250K300K350K400K
Single wordsWords with spaces
Linguabase is built on a foundation of human-generated data—professional lexicographic work, curated word lists, and structured linguistic resources accumulated over a decade. This foundation is then enhanced through focused LLM queries that validate, rank, and expand relationships. The result combines depth that automation can’t reach with scale that humans can’t sustain.
Over half the vocabulary consists of words with spaces—familiar multi-word expressions like “night sky” and “shake off” that expand coverage without dipping into obscure single words most players would never recognize.
The scale of the work: A skilled lexicographer might spend an hour per word building 50 quality associations. Multiply that by 400,000 words and you get 200 person-years of work—before accounting for consistency checking, sense separation, or quality control.
The Pipeline: Expand, Audit, Contract
Every data layer follows the same pattern:
Phase
Goal
Methods
Expand
Gather candidates from every plausible source
70+ reference sources, computational linguistics, human curation, Library of Congress
Audit
Evaluate and score each candidate
LLM validation, false cognate detection, sense separation, consistency checks
Contract
Retain only production-quality results
Threshold by score, remove duplicates, apply content filters, rank by strength
We start with millions of candidate relationships and ship ~40M high-quality connections.
Development Timeline
Period
Focus
2011–2012
Initial game development, early word lists and association data
2013–2014
NSF XSEDE grant: 2.3M supercomputer hours for LDA topic modeling and Word2Vec
2015–2022
Database expansion, 70+ reference source integration, professional lexicography
2023–present
LLM-assisted validation at scale, false cognate auditing, production refinement
Professional Lexicography
Orin Hargraves (professional lexicographer, contributor to major dictionaries) wrote 2,000+ custom definitions and 4,400+ sense-grouped word associations. His work focused on the words that need the most human judgment: interjections, prepositions, the highest-frequency words, and words with so many dictionary senses that their entries are virtually unreadable.
Linguistics grad students and post-docs created 5,000+ thematic word lists over several years:
Parts of a watch, types of gargoyles, painting supplies
Cross-domain connections and specialized vocabulary
We integrated 70+ linguistic resources—professional thesauri like the NASA Thesaurus, public domain lexicons like WordNet and Roget’s (we explain why these aren’t enough on their own), Library of Congress subject headings, and specialized vocabularies spanning scientific, governmental, artistic, and medical domains. Each source required custom parsing and integration; what works for word games differs from what works for dictionary lookup, and each source required different extraction logic.
Pre-LLM Computational Linguistics
Before LLMs existed, we built the initial graph using classical NLP:
Wikipedia extraction—Link analysis, disambiguation pages, definition summaries
Latent Dirichlet Allocation—2.3M supercomputer hours (NSF XSEDE grant, 2013–14) for topic clustering across massive corpora
Word2Vec models—Statistical word embeddings trained on billions of tokens
These methods produced millions of candidate relationships that human curation and later LLM validation refined into the shipped dataset.
Library of Congress Expansion
We methodically processed all 648,000 Library of Congress subject classifications—capturing the themes of millions of books humans wanted to write in English. Librarians organized these into topics like “orange horticulture” or “Indus Valley civilization.” By analyzing pools of words across these classifications, we discovered semantic clusters that no dictionary or thesaurus would surface. These serve as “idea seeds” for expanding our association graph into domains that traditional lexicography misses.
LLM Validation
We use LLMs for validation, not generation. The difference matters—see where LLM generation plateaus. An LLM can confirm that “key → reef” is a valid association (it’ll say yes), but it won’t generate that association reliably on its own. Our pipeline proposes candidates from the sources above, then uses LLM scoring to rank, filter, and audit them.
False Cognate Removal
False cognates are words that look related but aren’t—they share spelling patterns by coincidence, not common origin. String similarity filters can’t detect them, and LLMs sometimes hallucinate their associativeness:
False Cognates Removed
dig→digress— Latin “dis+gradi” (go apart) ≠ English dig
ant→antebellum, anteroom— Latin “ante-” (before) ≠ insect
man→manage, mansion, manual— Latin “manus” (hand) ≠ person
291,062 false cognates removed. For example, this removed a data error that associated “grave” (burial) with “gravity” (from Latin gravis meaning “heavy”).
Capitalization Intelligence
Since lowercase words are capitalized at the beginning of sentences, LLMs often treat text case-insensitively, conflating “polish” (verb) with “Polish” (nationality). For semantic graphs, this contaminates association lists.
Word
Lowercase
Capitalized
turkey / Turkey
✓ the bird
✓ the country
polish / Polish
✓ to shine
✓ nationality
boston / Boston
✗ not a word
✓ the city
swat / Swat / SWAT
✓ to hit
✓ Pakistan region / ✓ police unit
We evaluated capitalization variants for ambiguous terms. Results: 3,509 words have two valid forms, 86 words have three. Each variant gets its own decontaminated association list.
Common Word Coverage
Traditional thesauruses treat function words as “stopwords”—ignored entirely. We put extra effort into building associations for hundreds of the most common words that other sources systematically skip:
Function Word Coverage
“and”→so, plus, together with, nor, but, furthermore, ampersand, conjunction, additionally, as well as, moreover, including, copulative, union, link, connective, meanwhile, paired, combined, mutual, coupled, continuation, intertwined, likewise...
One source of richness is gestalt relations—experiential and sensory associations that taxonomic approaches miss entirely:
Type
Example
Visual
elephant → gray, wrinkled
Sensory
crisis → siren, sweat, rubble
Cultural
wedding → white, rice, tears
Emotional
home → warmth, safety, belonging
These are NOT synonyms. They’re how humans actually experience concepts. A thesaurus won’t tell you that “crisis” evokes “siren”—but your players know this instantly.
Production Refinement
Linguabase has powered “In Other Words” since 2011. Edge cases have emerged through real gameplay that automated testing alone wouldn’t catch:
Words that look related but aren’t (false cognates discovered through player confusion)
Cultural associations that shift over time
Difficulty calibration based on actual player success rates
Content filter refinements from user reports
Ongoing Curation
Beyond automated processing, our manual override layer includes 50,000+ hand-curated entries plus 300,000+ corrections. Each pipeline run incorporates feedback from the previous one.
Why 1.9M Internally, 400K Deployed?
Internally, we maintain 1.9 million words—including all of Wiktionary plus the top 200,000 words from Wikipedia. But we don’t ship all of these because of extensive noise at the bottom end, and no principled way to cut off what’s a word and what isn’t when you’re dealing with endless technical terms and named entities. The 400K threshold captures every word players actually want, without the noise. The hard part isn’t building a huge list—it’s ranking and curating it.
Small World Property
Analysis of the Linguabase graph shows that 76% of English word pairs connect in 7 hops or fewer. Average path length is 6.43 steps.
Hops
1
2
3
4
5
6
7
8+
% of pairs
0.01%
0.15%
2.1%
10%
21.6%
24.2%
18.3%
23.6%
Of the 1.9M headwords, about 870K (57%) are reachable through the top-40 associations of other words. The semantic space is more connected than it looks.
Continuous Improvement
Linguabase is actively maintained, not a static dataset:
New vocabulary—Words enter mainstream usage constantly (“doomscrolling,” “finsta,” “situationship”). We add them as they stabilize.
Quality refinement—Each pipeline run improves grades based on new validation data
Sense updates—Meanings shift over time; we track semantic drift