How Linguabase Works

The Architecture

Core Associations
Primary weighted relationships per headword (~40 each, ranked by strength)
Sense Clouds
Associations grouped by word meaning (bank_financial vs bank_river)
Word Families
Morphological + etymological groupings (elephant → elephantine, elephantiasis)
Usage Examples
1.46M illustrative quotations from literature, journalism, and scholarly sources
Definitions
Narrative paragraphs covering all senses naturally
Content Filters
Hard block (offensive) + soft block (suggestive) lists

We maintain a full 1.5M-word backend and deploy a curated 400K-word subset for production — every word players actually want, without the noise.

Semantic closeness is impossible to precisely quantify — but the connection weights serve as an effective proxy. Lower weights indicate either lower confidence or more oblique/distant relationships. Higher weights indicate stronger, more immediate associations.

Why Not Just Use an LLM?

You could prompt ChatGPT for word associations. For a single query, it works fine. But at scale, you’ll hit problems:

Linguabase isn’t a replacement for LLMs — we use them extensively for ranking and validation. But LLMs are better editors than authors. Our data provides the diverse, sense-balanced, edge-case-handled foundation that makes LLM-based applications feel polished rather than generic.

Vocabulary Depth: Why 400K?

How many words are in English? It’s nearly impossible to count, even with clear policies on British vs. American spelling, proper nouns, accent variants, and punctuation. Every dictionary draws the line differently.

For comparison:

But what counts as “one word”? Consider these cases where reasonable people disagree:

Category Examples Same word or different?
Diacritics naïve / naive, café / cafe, resumé / resume Style guides vary (New Yorker keeps diacritics)
Compounds ice cream / ice-cream / icecream, e-mail / email All three forms in active use
UK/US spelling colour / color, grey / gray, judgement / judgment Regional preference, both valid
Simplified doughnut / donut, whiskey / whisky, all right / alright Traditional vs modern, both current
Capitalization WiFi / Wi-Fi / wifi / wi-fi All four appear in published text

Linguabase includes all common variants as separate entries. This is arguably more useful for a semantic graph — each form can have its own association cloud reflecting how it’s actually used — but it means our word counts are higher than dictionaries that merge variants under a single headword.

Why maintain 1.5M but deploy 400K? Two reasons. First, we’re continually improving — new words enter English constantly, and the pipeline keeps refining. Second, words form a network. Even terms at rank 1.2M contribute signal that helps rank everything else, the same way PageRank uses links from obscure pages to determine which pages matter.

But measuring “usefulness” for ranking is harder than it sounds. Objective measures like pure frequency would rank “the” and “of” highest while missing that “red panda” is easily understood by most English speakers. PageRank-style algorithms surface superconnectors rather than familiar terms.

The Linguabase word ranking is an effective proxy for what words are known to a broad user base, including English language learners. The highest ranks align closely with vocabulary standards like the Oxford 3000 and CEFR levels (A1–B2) — but enhanced with common derivatives. Where Oxford 3000 lists “actor,” our ranking includes both “actor” and “actors.”

Here’s what you find at different ranks:

Rank Examples Assessment
~15K reminds, talking Common words everyone knows
~50K skinhead, conflagration Still widely recognized
~125K apperception, surliest Educated vocabulary
~300K endolysin, phytogeographic Technical/scientific terms
~400K disendorsed, tannicity Rare but real — our threshold
~700K mathematica, dribs and drabs Proper nouns, phrases creeping in
~1M dromion, interpeptide Highly specialized jargon
~1.3M vestibulolingually, animatophiles Almost never useful
~1.5M mmmccclv, hague Roman numerals, demoted proper nouns

For “In Other Words,” we thresholded at ~400K. This includes every common word any player might want, plus enough depth for interesting discoveries. Different applications might threshold differently — a medical app might want terms at rank 300K that a word game would skip.

Superconnector Demotion

Some terms appear too often. “Heritage” appears 110,000+ times across our sources. “Unique” appears 168,000 times. Without intervention, these “superconnector” words would dominate every association list.

Our solution applies inverse-frequency weighting — a cousin of TF-IDF:

Tier Percentile Effect Example Terms
8 Top 0.1% +12 positions heritage, tradition, folklore
7 Top 0.5% +10 positions harmony, ancestry, lineage
6 Top 1% +8 positions innovation, solidarity
0–2 Bottom 50% No penalty (most words)

If “heritage” ranks #3 in a word’s association list, after Tier 8 demotion it becomes #15 — pushed out of top results. Demotions make sure superconnectors don’t dominate.

The Data Pipeline

This isn’t “GPT said these words are related.” Linguabase is built on a decade of pre-LLM work, now enhanced with LLM validation. The pipeline runs 28 hours end-to-end.

Phase 1: Human Sources (Pre-LLM)

Years of aggregating professional and curated linguistic resources:

Total: 70+ reference sources cross-referenced to validate factual word relationships.

Phase 2: In-House Custom Lists (Pre-LLM)

5,000+ thematic word lists created in-house by lexicographer Orin Hargraves:

Phase 3: Machine-Processed Sources (Pre-LLM)

Computational linguistics before LLMs existed:

Phase 4: LLM-Era Enhancement (2023-present)

All the above was amalgamated, then enhanced with modern LLMs:

The key insight: LLMs are better at recognizing valid relationships than generating them. We provide candidates from a decade of diverse sources, then use LLMs to evaluate, rank, and audit.

Manual Curation Layer

Beyond automated processing, 750,000+ hand-curated override lines address what algorithms miss:

This is the work that separates production-quality data from raw output.

Common Word Coverage

Traditional thesauruses treat function words as “stopwords” — ignored entirely. Linguabase does the opposite: we put extra effort into building associations for hundreds of the most common words that other sources systematically skip.

"and" → so, plus, together with, nor, but, furthermore, ampersand, conjunction, additionally, as well as, moreover, including, copulative, union, link, connective, meanwhile, paired, combined, mutual, coupled, continuation, intertwined, likewise... "while" → although, whereas, yet, simultaneously, period, meanwhile, whilst, notwithstanding, interval, albeit, duration, concurrent, throughout, contrast, lingering, in tandem, temporary, passing, momentary, interlude, even as, span...

These words are the glue of language. When your game or AI needs to understand how “and” connects to other concepts, we have answers that no traditional thesaurus provides.

Gestalt Enrichment

One source of richness in Linguabase is gestalt relations — experiential and sensory associations that taxonomic approaches miss entirely. These aren’t delivered as a separate data layer; they’re folded into the core associations, enriching what you get for each word.

Type Example
Visual elephant → gray, wrinkled
Sensory crisis → siren, sweat, rubble
Cultural wedding → white, rice, tears
Emotional home → warmth, safety, belonging

These are NOT synonyms. They’re how humans actually experience concepts. A thesaurus won’t tell you that “crisis” evokes “siren” — but humans know this instantly. By incorporating these into the association pipeline, Linguabase captures connections that feel natural to users even when they’re invisible to purely lexical approaches.

False Cognate Removal

False cognates are words that look related but aren’t — they share spelling patterns by coincidence, not common origin. String similarity filters can’t detect them:

WRONG connections that simple filters miss: - dig → digress (Latin "dis+gradi" go apart ≠ English dig) - pan → pandemic, panorama (Greek "pan-" all ≠ cooking vessel) - ant → antebellum, anteroom (Latin "ante-" before ≠ insect) - man → manage, mansion, manual (Latin "manus" hand ≠ person)

Our solution: Opus 4.5 audited ALL 385K headwords for false cognates.

291,062 false cognates removed — 30.6% of entries modified. This catches subtle distinctions like “grave” (burial) vs “gravity” (Latin gravis “heavy”).

Capitalization Intelligence

Capitalization creates genuine meaning distinctions that most systems ignore:

Word Lowercase Capitalized
turkey / Turkey ✓ the bird ✓ the country
polish / Polish ✓ to shine ✓ nationality
boston / Boston ✗ not a word ✓ the city
swat / Swat / SWAT ✓ to hit ✓ Pakistan region / ✓ police unit

LLMs often treat text case-insensitively, conflating “polish” (verb) with “Polish” (adjective/noun for nationality). For semantic graphs, this contaminates association lists — “Poland” shouldn’t appear in the associations for shoe polish.

Our solution: GPT-4.1 evaluated all 2n capitalization variants for 26K ambiguous terms, attempting natural sentences with each form. Results:

Variant vs Semantic Distinction

Word families serve two different purposes:

Category Purpose Example Game behavior
Variants Substitutable forms run → runs, running, ran Silent swap for connectivity
Semantic Different meanings run → runway, runaway, outrun Real navigation move

Why this matters:

But not everything that looks splittable should be split. “Mushroom” contains “room” and “mush,” but it’s not a compound of those words — it comes from French mousseron. String patterns alone can’t distinguish real morphological relationships from coincidental letter sequences. This is why the etymological audit matters.

Results: 891K variant words, 2.75M semantic words — properly separated.

Example: “elephant”

Core Associations

tusk, pachyderm, trunk, hippopotamus, mammoth, ivory, savanna, giraffe, Ganesh, lion, proboscidean, poaching, Dumbo, Elephas, Hannibal, matriarch, rhinoceros, mahout, herd, ears, circus, thick-skinned, hide, zebra, herbivore, peanuts, mammal, massive, Africa, megafauna, intelligent, Republican, stomp, trumpeting, conservation, gray, jungle, watering hole, memory, majestic, big five, calf

Sense Labels

anatomy | behavior | circus | heraldry | ivory | megafauna | size | symbolism | wildlife

Sense Clouds

anatomy: trunk, tusks, ears, pachyderm, proboscis, wrinkled, gray, prehensile, hide, feet, tail, molars, musth, thick-skinned behavior: matriarchal, herd, social, intelligence, mourning, memory, communication, infrasound, bonding, empathy, tool use, play, bathing, dusting, migration symbolism: memory, wisdom, good luck, Ganesh, strength, power, loyalty, patience, Republican, GOP, white elephant, elephant in the room, Horton

Word Family

elephants, elephantine, elephant's, elephant bird, elephant shrew, elephantiasis, elephant trap, elephantry, elephantoid, Elephas, elephantesque, elephantlike, elephant ear, elephant seal, elephanthood, elephant in the room, elephant grass

Graph Operations

Because the data is structured as a weighted graph, you can perform operations that flat synonym lists cannot support:

Pathfinding

Find how “sugar” connects to “peace”:

sugar → sweet → pleasant → calm → peace

Semantic Distance

How close are two concepts?

distance("cat", "dog") = 0.15 // both pets, mammals distance("cat", "democracy") = 0.92 // very distant

Convergence

Given a target word, which intermediate words get you closer?

target: "ocean" from "fish": water (0.3), sea (0.2), marine (0.25)... from "blue": water (0.4), deep (0.35), wave (0.3)...

Content Blocklists

For applications with content policies, Linguabase provides two levels of filtering:

Hard Block (Offensive Terms)

Colorful derivations of sexual and racial slurs that most applications will want to exclude entirely — particularly for user-facing features like sharing, leaderboards, or public display. These words exist in dictionaries but have no legitimate use in puzzles or public contexts.

Soft Block (Suggestive Terms)

Words that are technically inoffensive but may be inappropriate for automated puzzle generation:

Developers can use soft-block lists to exclude these words from automated puzzle generation while still allowing them as valid user-entered answers. The distinction: a puzzle shouldn’t randomly present “cock” as a target word, but if a player types it as a valid answer for “rooster,” the game should accept it.

Continuous Improvement

Linguabase is actively maintained, not a static dataset:

License terms include current data with update arrangements available.

Small World Property

Analysis of the graph shows that 76% of English word pairs connect in 7 hops or fewer. Average path length is 6.43 steps.

The hop-distance distribution peaks at 5-6 hops:

Hops 1 2 3 4 5 6 7 8+
% of pairs 0.01% 0.15% 2.1% 10% 21.6% 24.2% 18.3% 23.6%

Of the 1.5M headwords, about 870K (57%) are reachable through the top-40 associations of other words. The remaining 43% are rare or isolated terms that don’t appear in other words’ association lists.

This is why word association games work. The semantic space is more connected than people realize.