70+ reference sources. 2.3M supercomputer hours. 130M LLM calls for generation, validation, and ranking. Professional lexicography. Production-tested since 2011.
Every data layer follows the same pattern:
We evaluate millions of candidate relationships across 400K words, producing over 100 million validated connections.
| 2011–2012 | Initial game development, early word lists and association data |
| 2013–2014 | NSF XSEDE grant: 2.3M supercomputer hours for LDA topic modeling and Word2Vec |
| 2015–2022 | Database expansion, 70+ reference source integration, professional lexicography |
| 2023–present | Large-scale LLM generation, validation, and ranking; false cognate auditing; production refinement |
Orin Hargraves (professional lexicographer, contributor to major dictionaries) wrote 2,000+ custom definitions and 4,400+ sense-grouped word associations. His work focused on the words that need the most human judgment: interjections, prepositions, the highest-frequency words, and words with so many dictionary senses that their entries are virtually unreadable.
Linguistics grad students and post-docs created 5,000+ thematic word lists over several years:
All contributors are credited on the about page.
LLMs generated the foundational data—semantic graphs, sense labels, word families, categories—but none of it was usable as-is. Every output got re-scored, amalgamated with other sources, and re-prompted for judgment calls. The key insight is that LLMs are better validators than generators: an LLM can confirm that “key → reef” is a valid association (it’ll say yes), but it won’t generate that association reliably on its own (see where LLM generation plateaus). So the pipeline generates broadly, then uses LLM scoring to rank, filter, and audit.
False cognates are words that look related but aren’t—they share spelling patterns by coincidence, not common origin. String similarity filters can’t detect them, and LLMs sometimes wrongly confirm they’re related:
Since lowercase words are capitalized at the beginning of sentences, LLMs often ignore the difference, mixing up “polish” (verb) with “Polish” (nationality). For semantic networks, this contaminates association lists.
| Word | Lowercase | Capitalized |
|---|---|---|
| turkey / Turkey | ✓ the bird | ✓ the country |
| polish / Polish | ✓ to shine | ✓ nationality |
| boston / Boston | ✗ not a word | ✓ the city |
| swat / Swat / SWAT | ✓ to hit | ✓ Pakistan / police unit |
We evaluated capitalization variants for ambiguous terms. Results: 3,509 words have two valid forms, 86 words have three. Each variant gets its own decontaminated association list.
Traditional thesauruses treat function words as “stopwords”—ignored entirely. We put extra effort into building associations for hundreds of the most common words that other sources systematically skip:
One source of richness is gestalt associations—sensory, emotional, and cultural connections that taxonomic approaches miss entirely:
| Visual | elephant → gray, wrinkled |
| Sensory | crisis → siren, sweat, rubble |
| Cultural | wedding → white, rice, tears |
| Emotional | home → warmth, safety, belonging |
These are not synonyms. They’re how humans actually experience concepts. A thesaurus won’t tell you that “crisis” evokes “siren”—but your players know this instantly.
Linguabase has powered “In Other Words” since 2011. Edge cases have emerged through real gameplay that automated testing alone wouldn’t catch:
Beyond automated processing, the pipeline incorporates ~80,000 lines of human-authored curation—sense-grouped associations by a professional lexicographer, thematic word lists built by linguistics grad students, hand-written definitions, antonym pairs, and content filters—plus ~670,000 lines of algorithmic corrections and reference data (corpus frequencies, automated morphology, pronunciation data). Each pipeline run incorporates feedback from the previous one.
Internally, we maintain a ranked corpus of 1.8 million terms with 267 million semantic connections. The 400K deployment threshold captures every word a well-read adult English speaker would recognize. Below that line, quality degrades gradually—there’s no clean cutoff, just a long tail where legitimate specialized terms mix with noise.
The bottom of the corpus includes real English that simply isn’t suitable for gameplay: Wadati-Benioff zones (seismology), monomethylarsonic acid (chemistry), q-Pochhammer symbol (mathematics), Struve functions (analysis), Zaydi jurisprudence (Islamic law). These are valid terms with real semantic connections, but no player would recognize them. Mixed in with these are legitimate proper nouns too obscure for games—Pavel Vilikovsky (Slovak novelist), Ghulam Ahmad Mahjoor (Kashmiri poet), Petru Cercel (Wallachian prince)—and sentence fragments, OCR artifacts, and outright gibberish: law of diminishing, same molecular formula, loogie hock, wkurxjk, fkrrvh.
The deeper data still does essential work behind the scenes.
When generating puzzles, every category must be mutually exclusive—no word should plausibly fit two groups. Clearing that conceptual space requires knowing how the full vocabulary interconnects, not just the game-facing slice. “Retinochrome” and “teratosaurid” will never appear in a puzzle—they rank around 1.5 millionth. But the photosensitive pigment connects to light, rhodopsin, retinal; the extinct reptile connects to archosaurs, fossils, carnivores. Those connections inform how we segment semantic space when building categories from the top 400K. The deeper corpus provides the resolution for differentiating concepts, even when most of it never faces the player.
The same depth serves ranking. When a word correlates with thousands of other terms throughout the full corpus—appearing in their association lists, category memberships, sense graphs—that pattern is a familiarity signal. Words that show up everywhere are common; words that cluster in narrow technical domains are specialized. The 400K threshold works because 1.8 million terms were ranked to find it.
Analysis of the Linguabase network shows that 76% of English word pairs connect in 7 hops or fewer. Average path length is 6.43 steps.
Of the 2M headwords, about 870K (57%) are reachable through the top-40 associations of other words. The semantic space is more connected than it looks.
Linguabase is actively maintained, not a static dataset:
Fifteen years of work by lexicographers, linguists, data architects, and dozens of vocabulary contributors. See the full credits →