The Missing Infrastructure for Word Games

Vocabulary, definitions, associations, word families, and 1.46M usage examples in one licensable package.

Linguabase was built to power word games. It’s been in development since 2011 and currently powers “In Other Words”, a live iOS game where players navigate semantic space. Linguabase provides the data layer for anyone who wants to explore this space.

Linguabase is built on a foundation of human-generated data—professional lexicographic work, curated word lists, and structured linguistic resources accumulated over a decade. This foundation is then enhanced through focused LLM queries that validate, rank, and expand relationships. The result is an amalgam that neither human curation nor LLM inference could produce alone.

Why this can’t be done manually: A skilled lexicographer might spend an hour per word building 50 quality associations. Multiply that by 400,000 words and you get 200 person-years of work—before accounting for consistency checking, sense separation, or quality control.

Three Things Every Word Game Needs

Words, links, meaning. License what you need:

Words
Vocabulary
400K words with difficulty scores—from everyday vocabulary to crossword-worthy rarities. Includes 200K multi-word expressions. (Why this matters)
Content Filters
Two word lists you control. Hard-block list of purely offensive words. Soft-block list of words carrying unwanted innuendo.
Links
Associations
~40 related words per entry, weighted by relationship strength. Each word decomposed into facets of meaning with related words for each.
Word Families
Morphological groupings: run → runs, running, ran, runner, runway, outrun.
Meaning
Definitions
400K readable paragraphs in flowing sentences. Useful as in-game clues or help text.
Usage Examples
1.46M quotations written by humans with intent. Common words from famous literature; uncommon words from Wikipedia and open-access sources.

Delivered as files you can embed (TSV, SQLite, JSON) or query via API. See delivery options →


Vocabulary

The vocabulary layer provides word lists with difficulty rankings, enabling you to adjust your vocabulary for the difficulty and obscurity level your game needs.

Why 400K Words?

In our experience, 400,000 words delivers an unabridged experience that is complete without being noisy. Larger word counts contaminate the player experience with obscure and borderline words. Smaller counts often miss legitimate words players expect to find. We can vary the delivered word count, but 400K is our recommendation.

For comparison:

NASPA (Scrabble US)
176K
Merriam-Webster Collegiate
225K
Longman LDOCE
230K
Collins Scrabble Words
267K
Linguabase deployed
~400K
Webster’s Third Unabridged
476K
Linguabase full
1.5M

Related
Words with Spaces — why “boiling water” and half a million other compound phrases aren’t in any dictionary.
But these numbers obscure a deeper question: what counts as “one word”? Every dictionary draws the line differently, and it’s nearly impossible to count how many words are in English—even with clear inclusion policies on British vs. American spelling, proper nouns, accent variants, and punctuation. (For reference, Roget’s original 1852 thesaurus covered the working vocabulary of English with just 15,000 words across 1,000 semantic categories—a fraction of modern dictionary counts, yet considered comprehensive for its purpose.)

Consider these cases where reasonable people disagree:

Category Examples Same word or different?
Diacritics naïve / naive, café / cafe, resumé / resume Style guides vary (New Yorker keeps diacritics)
Compounds ice cream / ice-cream / icecream, e-mail / email All three forms in active use
UK/US spelling colour / color, grey / gray, judgement / judgment Regional preference, both valid
Simplified doughnut / donut, whiskey / whisky, all right / alright Traditional vs modern, both current
Capitalization WiFi / Wi-Fi / wifi / wi-fi All four appear in published text

Which of the above count as words? All of them? Just some? And which count as individual words?

Linguabase includes all spelling variants for common words and primary variants for less common ones—so your game recognizes “naïve” and “naive,” “café” and “cafe,” without inflating word counts artificially or rejecting spellings players reasonably expect to work.

Difficulty Rankings

Measuring usefulness for ranking is intuitive but difficult to quantify. Frequency of appearance in books or spoken word corpora would rank “the” and “of” highest while missing that “red panda” is easily understood by most English speakers. PageRank-style algorithms surface superconnectors, but that’s not the same as how understandable something is.

The Linguabase word ranking is an effective proxy for what words are known to a broad user base, including English language learners. The highest ranks align closely with vocabulary standards like the Oxford 3000 and CEFR levels (A1–B2)—but enhanced with common derivatives. Where Oxford 3000 lists “actor,” our ranking includes both “actor” and “actors.”

Here’s what you find at different ranks of the Linguabase ranking:

Rank Examples Assessment
~15K reminds, talking Common words everyone knows
~50K skinhead, conflagration Still widely recognized
~125K apperception, surliest Educated vocabulary
~300K endolysin, phytogeographic Technical/scientific terms
~400K disendorsed, tannicity Rare but real—our threshold

For “In Other Words,” we thresholded at ~400K. This includes every common word any player would want, plus enough depth for interesting discoveries. Different applications might threshold differently—a crossword game might want terms at rank 200K that a casual word game would skip.


Definitions

Linguabase definitions are readable paragraphs—2–3 sentence blurbs that cover all a word’s meanings naturally, not dictionary-style numbered fragments.

Why LLMs Can Do This (Sort Of)

This is the layer where LLMs come closest to being a substitute. You could brute-force 400K API calls to generate definitions in your style. But there are catches:


Content Filters

For applications with content policies, Linguabase provides two word lists you control:

Hard Block List

A few thousand words including vulgar terms and creative permutations of sexual and racial slurs. These words exist in dictionaries but often have no legitimate use in puzzle-making or public/sharing contexts.

Soft Block List

Many words are technically inoffensive but may be inappropriate for automated puzzle generation:

We developed the soft-block list to exclude these words from automated puzzle generation. Though these words should typically be allowed as valid user-entered answers. You might not want to generate a puzzle featuring “knockers” as a target word, but it could be a perfectly acceptable player-provided answer in a spelling game about doors.


Word Associations

Words relate to other words in different ways. Linguabase handles three distinct types of multi-meaning relationships:

Type Definition Example
Homographs Unrelated words that share spelling by coincidence pupil: eye part vs. student (different origins)
Polysemy Meanings that branched from a single origin mouth: body part → river mouth → “mouthing off”
Facets Different aspects of one core meaning elephant: anatomy, behavior, symbolism, habitat

Each type requires different handling. Homographs need complete separation—eye-pupil associations shouldn’t contaminate student-pupil associations. Polysemous words need their branching senses tracked. Facets can be blended or kept separate depending on your use case.

Core Associations

The core associations give you approximately 40 related words per entry, weighted by relationship strength. This represents an amalgam of different meanings, connotations, and facets of a word:

Core Associations
elephant tusk, pachyderm, trunk, hippopotamus, mammoth, ivory, savanna, giraffe, Ganesh, lion, proboscidean, poaching, Dumbo, Elephas, Hannibal, matriarch, rhinoceros, mahout, herd, ears, circus, thick-skinned, hide, zebra, herbivore, peanuts, mammal, massive, Africa, megafauna, intelligent, Republican, stomp, trumpeting, conservation, gray, jungle, watering hole, memory, majestic

Drilling Deeper into Meaning

Beyond the core associations, Linguabase provides pools of words that reflect narrow, particular meanings of any headword:

Sense Pools
elephant [anatomy]
trunk, tusks, ears, pachyderm, proboscis, wrinkled, gray, prehensile, hide, feet, tail, molars, musth, thick-skinned
elephant [behavior]
matriarchal, herd, social, intelligence, mourning, memory, communication, infrasound, bonding, empathy, tool use, play, bathing, dusting, migration
elephant [symbolism]
memory, wisdom, good luck, Ganesh, strength, power, loyalty, patience, Republican, GOP, white elephant, elephant in the room, Horton

Why Not Just Use an LLM?

You could prompt an LLM for word associations. Superficially, it looks fine. But problems lurk under the surface:

LLM inferences are a major input signal that we use, with multiple query types feeding into our ranking and validation pipeline. But in our experience, LLMs serve as better editors than authors. Our workflow pools data about a word to ask rich questions to an LLM—like “which of the following words are strongly related?”—instead of asking the LLM to provide all the answers. More on LLM limitations →

False Cognate Removal

False cognates are words that look related but aren’t—they share spelling patterns by coincidence, not common origin. String similarity filters can’t detect them, and LLMs sometimes hallucinate their associativeness:

False Cognates Removed
dig digress — Latin "dis+gradi" (go apart) ≠ English dig
pan pandemic, panorama — Greek "pan-" (all) ≠ cooking vessel
ant antebellum, anteroom — Latin "ante-" (before) ≠ insect
man manage, mansion, manual — Latin "manus" (hand) ≠ person

We used LLM-based auditing to identify false cognates across all headwords.

291,062 false cognates removed. For example, this removed a data error that associated “grave” (burial) with “gravity” (from Latin gravis meaning “heavy”).

Capitalization Intelligence

Since lowercase words are capitalized at the beginning of sentences, LLMs often treat text case-insensitively, conflating “polish” (verb) with “Polish” (nationality). For semantic graphs, this contaminates association lists—“Poland” shouldn’t appear in the associations for shoe polish.

Word Lowercase Capitalized
turkey / Turkey ✓ the bird ✓ the country
polish / Polish ✓ to shine ✓ nationality
boston / Boston ✗ not a word ✓ the city
swat / Swat / SWAT ✓ to hit ✓ Pakistan region / ✓ police unit

We evaluated capitalization variants for ambiguous terms, testing whether each form works in natural sentences. Results:

Common Word Coverage

Traditional thesauruses treat function words as “stopwords”—ignored entirely. Linguabase does the opposite: we put extra effort into building associations for hundreds of the most common words that other sources systematically skip.

Function Word Coverage
"and" so, plus, together with, nor, but, furthermore, ampersand, conjunction, additionally, as well as, moreover, including, copulative, union, link, connective, meanwhile, paired, combined, mutual, coupled, continuation, intertwined, likewise...
"while" although, whereas, yet, simultaneously, period, meanwhile, whilst, notwithstanding, interval, albeit, duration, concurrent, throughout, contrast, lingering, in tandem, temporary, passing, momentary, interlude, even as, span...

These words are the glue of language. When your game needs to understand how “and” connects to other concepts, we have answers that a thesaurus typically won’t provide.

Gestalt Enrichment

One source of richness in Linguabase is gestalt relations—experiential and sensory associations that taxonomic approaches miss entirely. These are folded into the core associations, enriching what you get for each word.

Type Example
Visual elephant → gray, wrinkled
Sensory crisis → siren, sweat, rubble
Cultural wedding → white, rice, tears
Emotional home → warmth, safety, belonging

These are NOT synonyms. They’re how humans actually experience concepts. A thesaurus typically won’t tell you that “crisis” evokes “siren”—but humans know this instantly.


Word Families

Morphological and etymological groupings that connect related word forms:

Word Family
elephant elephants, elephantine, elephant's, elephant bird, elephant shrew, elephantiasis, elephant trap, elephantry, elephantoid, Elephas, elephantesque, elephantlike, elephant ear, elephant seal, elephanthood, elephant in the room, elephant grass

Variant vs Semantic Distinction

Word families serve two different purposes, depending on your game design needs:

Category Purpose Example Game behavior
Variants Substitutable forms run → runs, running, ran Silent swap for connectivity
Semantic Different meanings run → runway, runaway, outrun Real navigation move

Why this matters:

But not everything that looks splittable should be split. “Mushroom” contains “room” and “mush,” but it’s not a compound of those words—it comes from French mousseron. String patterns alone can’t distinguish real morphological relationships from coincidental letter sequences. This is why the etymological audit matters.


Usage Examples

1.46M illustrative quotations written by humans with intent. Almost all common words come from famous literature—Pulitzer Prize winners and other prestigious sources. Uncommon and technical words are sourced from Wikipedia or open-access science journals, still showing the word used in real context by real writers.


The Data Pipeline

Linguabase is built on a decade of work that began before LLMs existed. The foundation is human curation and pre-LLM computational linguistics; modern LLMs handle validation at scale.

Three Phases: Expand, Audit, Contract

Our pipeline follows a consistent pattern for each data layer:

Phase Goal Methods
Expand Gather candidates from every plausible source 70+ reference sources, computational linguistics, human curation, Library of Congress
Audit Evaluate and score each candidate LLM validation, false cognate detection, sense separation, consistency checks
Contract Retain only production-quality results Threshold by score, remove duplicates, apply content filters, rank by strength

The result: we start with millions of candidate relationships and ship ~40M high-quality connections.

Development Timeline

Period Focus
2011–2012 Initial game development, early word lists and association data
2013–2014 NSF XSEDE grant: 2.3M supercomputer hours for LDA topic modeling and Word2Vec
2015–2022 Database expansion, 70+ reference source integration, professional lexicography
2023–present LLM-assisted validation at scale, false cognate auditing, production refinement

Professional Lexicography

Orin Hargraves (professional lexicographer, contributor to major dictionaries) wrote 2,000+ custom definitions and 4,400+ sense-grouped word associations. His work focused on the words that need the most human judgment: interjections, prepositions, the highest-frequency words, and words with so many dictionary senses that their entries are virtually unreadable.

Linguistics grad students and post-docs created 5,000+ thematic word lists over several years:

Reference Sources

NASA

We integrated 70+ linguistic resources—professional thesauri like the NASA Thesaurus, public domain lexicons like WordNet and Roget’s (we explain why these aren’t enough on their own), Library of Congress subject headings, and specialized vocabularies spanning scientific, governmental, artistic, and medical domains. Each source required custom parsing and integration; the specific methods reflect years of experimentation about what works for word games versus dictionary lookup.

Pre-LLM Computational Linguistics

NSF

Before LLMs existed, we built the initial graph using classical NLP:

These methods produced the raw material—millions of candidate relationships—that human curation and later LLM validation refined into production-quality data.

Library of Congress Expansion

Library of Congress seal

We methodically processed all 648,000 Library of Congress subject classifications—capturing the themes of millions of books humans wanted to write in English. Human librarian classifiers organized these into topics like “orange horticulture” or “Indus Valley civilization.” By analyzing pools of words across these classifications, we discovered semantic clusters that no dictionary or thesaurus would surface. These serve as “idea seeds” for expanding our association graph into domains that traditional lexicography misses.

LLM Validation

We use LLMs for validation, not generation. The difference matters—see why LLMs can’t generate this data reliably. an LLM can confirm that “key → reef” is a valid association (it’ll say yes), but it won’t generate that association reliably on its own. Our pipeline proposes candidates from the sources above, then uses LLM scoring to rank, filter, and audit them.

Production Refinement

Linguabase has powered “In Other Words” since 2011. Countless edge cases have emerged through real gameplay that no automated process would surface:

Ongoing Curation

Beyond automated processing, our manual override layer includes 50,000+ hand-curated entries (definitions, sense-grouped associations, thematic lists) plus hundreds of thousands of corrections for plurals, word families, and edge cases.

This is the work that separates production-quality data from raw output—and it’s ongoing. Each pipeline run incorporates feedback from the previous one.

Why 1.5M Internally, 400K Deployed?

Internally, we maintain 1.5 million words—including all of Wiktionary plus the top 200,000 words from Wikipedia. But we don’t ship all of these because noise gets in. We don’t include “Declaration of Independence” even though it’s in our word list; we don’t include obscure proteins that no player would recognize. The 400K threshold captures every word players actually want, without the noise. The hard part isn’t building a huge list—it’s ranking and curating it.


Game Mechanics Based on Word Relationships

Because the data is structured as a weighted graph, you can build game mechanics based on associations and linkages. The weighted connections enable real-time content customization—filtering by relationship strength, guiding players toward goals, or generating puzzles with tunable difficulty. (Or skip the engineering—we can generate puzzle data for your game mechanics directly.)

Pathfinding

Find how any two words connect through the graph:

Pathfinding Examples
sugar sweet pleasant calm peace
Batman vigilante watchful circumspect inspect

We generate pathfinding puzzles like this at scale—with validated paths, precalculated hints, and difficulty ratings.

Convergence

If you’re building a game that involves non-adjacent word associations (like a chain-association game), real-time brute-force graph analysis has exponential expansion that cannot be completed on modern hardware. But using our data, you can pre-calculate convergence probabilities to targets.

Given a target word, which intermediate words get you closer? Consider “ellipse”—its associations organize into layers by distance:

Convergence Layers
"ellipse" Layer 1
oval, oblong, circular, spherical, curved, round, loop...
"ellipse" Layer 2
ellipsoid, geometer, hypotrochoid, toric, curve, orbit, arc...
"ellipse" Layer 3
spheroid, ovate, spheroidal, cycloid, orbiting...

Paths using Layer 1 terms are easier; paths requiring Layer 2–3 terms are progressively harder. This enables hint systems and adaptive difficulty. See Puzzle Licensing for how we package this into game-ready data.

Small World Property

Analysis of the Linguabase graph shows that 76% of English word pairs connect in 7 hops or fewer. Average path length is 6.43 steps.

The hop-distance distribution peaks at 5-6 hops:

Hops 1 2 3 4 5 6 7 8+
% of pairs 0.01% 0.15% 2.1% 10% 21.6% 24.2% 18.3% 23.6%

Of the 1.5M headwords, about 870K (57%) are reachable through the top-40 associations of other words. The remaining 43% are rare or isolated terms that don’t appear in other words’ association lists.

This is important because it’s a foundational property of the English language that you can exploit to create your own games. The semantic space is more connected than people realize.

Continuous Improvement

Linguabase is actively maintained, not a static dataset:

License terms include current data with update arrangements available.