How Linguabase Was Built

70+ reference sources. 2.3M supercomputer hours. 130M LLM calls for generation, validation, and ranking. Professional lexicography. Production-tested since 2011.

Word lists Lexicography Reference sources
LLM expansion & validation
130M inferences • iterative
Raw network
2M terms • 100M+ connections
Ranked by familiarity
400K native ~50K non-native

Linguabase is built on a foundation of human-generated data—professional lexicographic work, curated word lists, and structured linguistic resources accumulated over a decade. This foundation is then enhanced through focused LLM queries that validate, rank, and expand relationships. The result combines depth that automation can’t reach with scale that humans can’t sustain.

Over half the vocabulary consists of words with spaces—familiar multi-word expressions like “night sky” and “shake off” that expand coverage without dipping into obscure single words most players would never recognize.

The scale of the work: A skilled lexicographer might spend an hour per word building 50 quality associations. Multiply that by 400,000 words and you get 200 person-years of work—before accounting for consistency checking, sense separation, or quality control.

The Pipeline: Expand, Audit, Contract

Every data layer follows the same pattern:

Expand — Gather candidates from every plausible source. 70+ reference sources, computational linguistics, human curation, Library of Congress
Audit — Evaluate and score each candidate. LLM validation, false cognate detection, sense separation, consistency checks
Contract — Retain only production-quality results. Threshold by score, remove duplicates, apply content filters, rank by strength

We evaluate millions of candidate relationships across 400K words, producing over 100 million validated connections.

Development Timeline

2011–2012 Initial game development, early word lists and association data
2013–2014 NSF XSEDE grant: 2.3M supercomputer hours for LDA topic modeling and Word2Vec
2015–2022 Database expansion, 70+ reference source integration, professional lexicography
2023–present Large-scale LLM generation, validation, and ranking; false cognate auditing; production refinement

Professional Lexicography

Orin Hargraves (professional lexicographer, contributor to major dictionaries) wrote 2,000+ custom definitions and 4,400+ sense-grouped word associations. His work focused on the words that need the most human judgment: interjections, prepositions, the highest-frequency words, and words with so many dictionary senses that their entries are virtually unreadable.

Linguistics grad students and post-docs created 5,000+ thematic word lists over several years:

All contributors are credited on the about page.


Reference Sources

NASA

We integrated 70+ linguistic resources—professional thesauri like the NASA Thesaurus, public domain lexicons like WordNet and Roget’s (we explain why these aren’t enough on their own), Library of Congress subject headings, and specialized vocabularies spanning scientific, governmental, artistic, and medical domains. Each source required custom parsing and integration; what works for word games differs from what works for dictionary lookup, and each source required different extraction logic.

Pre-LLM Computational Linguistics

NSF

Before LLMs existed, we built the initial network using classical NLP:

These methods produced millions of candidate relationships that human curation and later LLM validation refined into what we ship today.

Library of Congress Expansion

Library of Congress seal

We methodically processed all 648,000 Library of Congress subject classifications—capturing the themes of millions of books humans wanted to write in English. Librarians organized these into topics like “orange horticulture” or “Indus Valley civilization.” By analyzing pools of words across these classifications, we discovered semantic clusters that no dictionary or thesaurus would surface. These serve as “idea seeds” for expanding our association network into domains that traditional lexicography misses.


LLM Validation

LLMs generated the foundational data—semantic graphs, sense labels, word families, categories—but none of it was usable as-is. Every output got re-scored, amalgamated with other sources, and re-prompted for judgment calls. The key insight is that LLMs are better validators than generators: an LLM can confirm that “key → reef” is a valid association (it’ll say yes), but it won’t generate that association reliably on its own (see where LLM generation plateaus). So the pipeline generates broadly, then uses LLM scoring to rank, filter, and audit.

False Cognate Removal

False cognates are words that look related but aren’t—they share spelling patterns by coincidence, not common origin. String similarity filters can’t detect them, and LLMs sometimes wrongly confirm they’re related:

False Cognates Removed
dig digress — Latin “dis+gradi” (go apart) ≠ English dig
pan pandemic, panorama — Greek “pan-” (all) ≠ cooking vessel
ant antebellum, anteroom — Latin “ante-” (before) ≠ insect
man manage, mansion, manual — Latin “manus” (hand) ≠ person
291,062 false cognates removed. For example, this removed a data error that associated “grave” (burial) with “gravity” (from Latin gravis meaning “heavy”).

Capitalization Intelligence

Since lowercase words are capitalized at the beginning of sentences, LLMs often ignore the difference, mixing up “polish” (verb) with “Polish” (nationality). For semantic networks, this contaminates association lists.

Word Lowercase Capitalized
turkey / Turkey ✓ the bird ✓ the country
polish / Polish ✓ to shine ✓ nationality
boston / Boston ✗ not a word ✓ the city
swat / Swat / SWAT ✓ to hit ✓ Pakistan / police unit

We evaluated capitalization variants for ambiguous terms. Results: 3,509 words have two valid forms, 86 words have three. Each variant gets its own decontaminated association list.

Common Word Coverage

Traditional thesauruses treat function words as “stopwords”—ignored entirely. We put extra effort into building associations for hundreds of the most common words that other sources systematically skip:

Function Word Coverage
“and” so, plus, together with, nor, but, furthermore, ampersand, conjunction, additionally, as well as, moreover, including, copulative, union, link, connective, meanwhile, paired, combined, mutual, coupled, continuation, intertwined, likewise...
“while” although, whereas, yet, simultaneously, period, meanwhile, whilst, notwithstanding, interval, albeit, duration, concurrent, throughout, contrast, lingering, in tandem, temporary, passing, momentary, interlude, even as, span...

Gestalt Enrichment

One source of richness is gestalt associations—sensory, emotional, and cultural connections that taxonomic approaches miss entirely:

Visualelephant → gray, wrinkled
Sensorycrisis → siren, sweat, rubble
Culturalwedding → white, rice, tears
Emotionalhome → warmth, safety, belonging

These are not synonyms. They’re how humans actually experience concepts. A thesaurus won’t tell you that “crisis” evokes “siren”—but your players know this instantly.


Production Refinement

Linguabase has powered “In Other Words” since 2011. Edge cases have emerged through real gameplay that automated testing alone wouldn’t catch:

Ongoing Curation

Beyond automated processing, the pipeline incorporates ~80,000 lines of human-authored curation—sense-grouped associations by a professional lexicographer, thematic word lists built by linguistics grad students, hand-written definitions, antonym pairs, and content filters—plus ~670,000 lines of algorithmic corrections and reference data (corpus frequencies, automated morphology, pronunciation data). Each pipeline run incorporates feedback from the previous one.

Why 2M Internally, 400K Deployed?

Internally, we maintain a ranked corpus of 1.8 million terms with 267 million semantic connections. The 400K deployment threshold captures every word a well-read adult English speaker would recognize. Below that line, quality degrades gradually—there’s no clean cutoff, just a long tail where legitimate specialized terms mix with noise.

The bottom of the corpus includes real English that simply isn’t suitable for gameplay: Wadati-Benioff zones (seismology), monomethylarsonic acid (chemistry), q-Pochhammer symbol (mathematics), Struve functions (analysis), Zaydi jurisprudence (Islamic law). These are valid terms with real semantic connections, but no player would recognize them. Mixed in with these are legitimate proper nouns too obscure for games—Pavel Vilikovsky (Slovak novelist), Ghulam Ahmad Mahjoor (Kashmiri poet), Petru Cercel (Wallachian prince)—and sentence fragments, OCR artifacts, and outright gibberish: law of diminishing, same molecular formula, loogie hock, wkurxjk, fkrrvh.

The deeper data still does essential work behind the scenes.

When generating puzzles, every category must be mutually exclusive—no word should plausibly fit two groups. Clearing that conceptual space requires knowing how the full vocabulary interconnects, not just the game-facing slice. “Retinochrome” and “teratosaurid” will never appear in a puzzle—they rank around 1.5 millionth. But the photosensitive pigment connects to light, rhodopsin, retinal; the extinct reptile connects to archosaurs, fossils, carnivores. Those connections inform how we segment semantic space when building categories from the top 400K. The deeper corpus provides the resolution for differentiating concepts, even when most of it never faces the player.

The same depth serves ranking. When a word correlates with thousands of other terms throughout the full corpus—appearing in their association lists, category memberships, sense graphs—that pattern is a familiarity signal. Words that show up everywhere are common; words that cluster in narrow technical domains are specialized. The 400K threshold works because 1.8 million terms were ranked to find it.


Small World Property

Analysis of the Linguabase network shows that 76% of English word pairs connect in 7 hops or fewer. Average path length is 6.43 steps.

1 hop
0.01%
2
0.15%
3
2.1%
4
10%
5
21.6%
6
24.2%
7
18.3%
8+
23.6%

Of the 2M headwords, about 870K (57%) are reachable through the top-40 associations of other words. The semantic space is more connected than it looks.

Continuous Improvement

Linguabase is actively maintained, not a static dataset:

The People Behind It

Fifteen years of work by lexicographers, linguists, data architects, and dozens of vocabulary contributors. See the full credits →

Pricing →

Talk to us about your game.

linguabase@idea.org