We maintain a full 1.5M-word backend and deploy a curated 400K-word subset for production — every word players actually want, without the noise.
Semantic closeness is impossible to precisely quantify — but the connection weights serve as an effective proxy. Lower weights indicate either lower confidence or more oblique/distant relationships. Higher weights indicate stronger, more immediate associations.
You could prompt ChatGPT for word associations. For a single query, it works fine. But at scale, you’ll hit problems:
Linguabase isn’t a replacement for LLMs — we use them extensively for ranking and validation. But LLMs are better editors than authors. Our data provides the diverse, sense-balanced, edge-case-handled foundation that makes LLM-based applications feel polished rather than generic.
How many words are in English? It’s nearly impossible to count, even with clear policies on British vs. American spelling, proper nouns, accent variants, and punctuation. Every dictionary draws the line differently.
For comparison:
But what counts as “one word”? Consider these cases where reasonable people disagree:
| Category | Examples | Same word or different? |
|---|---|---|
| Diacritics | naïve / naive, café / cafe, resumé / resume | Style guides vary (New Yorker keeps diacritics) |
| Compounds | ice cream / ice-cream / icecream, e-mail / email | All three forms in active use |
| UK/US spelling | colour / color, grey / gray, judgement / judgment | Regional preference, both valid |
| Simplified | doughnut / donut, whiskey / whisky, all right / alright | Traditional vs modern, both current |
| Capitalization | WiFi / Wi-Fi / wifi / wi-fi | All four appear in published text |
Linguabase includes all common variants as separate entries. This is arguably more useful for a semantic graph — each form can have its own association cloud reflecting how it’s actually used — but it means our word counts are higher than dictionaries that merge variants under a single headword.
Why maintain 1.5M but deploy 400K? Two reasons. First, we’re continually improving — new words enter English constantly, and the pipeline keeps refining. Second, words form a network. Even terms at rank 1.2M contribute signal that helps rank everything else, the same way PageRank uses links from obscure pages to determine which pages matter.
But measuring “usefulness” for ranking is harder than it sounds. Objective measures like pure frequency would rank “the” and “of” highest while missing that “red panda” is easily understood by most English speakers. PageRank-style algorithms surface superconnectors rather than familiar terms.
The Linguabase word ranking is an effective proxy for what words are known to a broad user base, including English language learners. The highest ranks align closely with vocabulary standards like the Oxford 3000 and CEFR levels (A1–B2) — but enhanced with common derivatives. Where Oxford 3000 lists “actor,” our ranking includes both “actor” and “actors.”
Here’s what you find at different ranks:
| Rank | Examples | Assessment |
|---|---|---|
| ~15K | reminds, talking | Common words everyone knows |
| ~50K | skinhead, conflagration | Still widely recognized |
| ~125K | apperception, surliest | Educated vocabulary |
| ~300K | endolysin, phytogeographic | Technical/scientific terms |
| ~400K | disendorsed, tannicity | Rare but real — our threshold |
| ~700K | mathematica, dribs and drabs | Proper nouns, phrases creeping in |
| ~1M | dromion, interpeptide | Highly specialized jargon |
| ~1.3M | vestibulolingually, animatophiles | Almost never useful |
| ~1.5M | mmmccclv, hague | Roman numerals, demoted proper nouns |
For “In Other Words,” we thresholded at ~400K. This includes every common word any player might want, plus enough depth for interesting discoveries. Different applications might threshold differently — a medical app might want terms at rank 300K that a word game would skip.
Some terms appear too often. “Heritage” appears 110,000+ times across our sources. “Unique” appears 168,000 times. Without intervention, these “superconnector” words would dominate every association list.
Our solution applies inverse-frequency weighting — a cousin of TF-IDF:
| Tier | Percentile | Effect | Example Terms |
|---|---|---|---|
| 8 | Top 0.1% | +12 positions | heritage, tradition, folklore |
| 7 | Top 0.5% | +10 positions | harmony, ancestry, lineage |
| 6 | Top 1% | +8 positions | innovation, solidarity |
| 0–2 | Bottom 50% | No penalty | (most words) |
If “heritage” ranks #3 in a word’s association list, after Tier 8 demotion it becomes #15 — pushed out of top results. Demotions make sure superconnectors don’t dominate.
This isn’t “GPT said these words are related.” Linguabase is built on a decade of pre-LLM work, now enhanced with LLM validation. The pipeline runs 28 hours end-to-end.
Years of aggregating professional and curated linguistic resources:
Total: 70+ reference sources cross-referenced to validate factual word relationships.
5,000+ thematic word lists created in-house by lexicographer Orin Hargraves:
Computational linguistics before LLMs existed:
All the above was amalgamated, then enhanced with modern LLMs:
The key insight: LLMs are better at recognizing valid relationships than generating them. We provide candidates from a decade of diverse sources, then use LLMs to evaluate, rank, and audit.
Beyond automated processing, 750,000+ hand-curated override lines address what algorithms miss:
This is the work that separates production-quality data from raw output.
Traditional thesauruses treat function words as “stopwords” — ignored entirely. Linguabase does the opposite: we put extra effort into building associations for hundreds of the most common words that other sources systematically skip.
These words are the glue of language. When your game or AI needs to understand how “and” connects to other concepts, we have answers that no traditional thesaurus provides.
One source of richness in Linguabase is gestalt relations — experiential and sensory associations that taxonomic approaches miss entirely. These aren’t delivered as a separate data layer; they’re folded into the core associations, enriching what you get for each word.
| Type | Example |
|---|---|
| Visual | elephant → gray, wrinkled |
| Sensory | crisis → siren, sweat, rubble |
| Cultural | wedding → white, rice, tears |
| Emotional | home → warmth, safety, belonging |
These are NOT synonyms. They’re how humans actually experience concepts. A thesaurus won’t tell you that “crisis” evokes “siren” — but humans know this instantly. By incorporating these into the association pipeline, Linguabase captures connections that feel natural to users even when they’re invisible to purely lexical approaches.
False cognates are words that look related but aren’t — they share spelling patterns by coincidence, not common origin. String similarity filters can’t detect them:
Our solution: Opus 4.5 audited ALL 385K headwords for false cognates.
Capitalization creates genuine meaning distinctions that most systems ignore:
| Word | Lowercase | Capitalized |
|---|---|---|
| turkey / Turkey | ✓ the bird | ✓ the country |
| polish / Polish | ✓ to shine | ✓ nationality |
| boston / Boston | ✗ not a word | ✓ the city |
| swat / Swat / SWAT | ✓ to hit | ✓ Pakistan region / ✓ police unit |
LLMs often treat text case-insensitively, conflating “polish” (verb) with “Polish” (adjective/noun for nationality). For semantic graphs, this contaminates association lists — “Poland” shouldn’t appear in the associations for shoe polish.
Our solution: GPT-4.1 evaluated all 2n capitalization variants for 26K ambiguous terms, attempting natural sentences with each form. Results:
Word families serve two different purposes:
| Category | Purpose | Example | Game behavior |
|---|---|---|---|
| Variants | Substitutable forms | run → runs, running, ran | Silent swap for connectivity |
| Semantic | Different meanings | run → runway, runaway, outrun | Real navigation move |
Why this matters:
But not everything that looks splittable should be split. “Mushroom” contains “room” and “mush,” but it’s not a compound of those words — it comes from French mousseron. String patterns alone can’t distinguish real morphological relationships from coincidental letter sequences. This is why the etymological audit matters.
Results: 891K variant words, 2.75M semantic words — properly separated.
Because the data is structured as a weighted graph, you can perform operations that flat synonym lists cannot support:
Find how “sugar” connects to “peace”:
How close are two concepts?
Given a target word, which intermediate words get you closer?
For applications with content policies, Linguabase provides two levels of filtering:
Colorful derivations of sexual and racial slurs that most applications will want to exclude entirely — particularly for user-facing features like sharing, leaderboards, or public display. These words exist in dictionaries but have no legitimate use in puzzles or public contexts.
Words that are technically inoffensive but may be inappropriate for automated puzzle generation:
Developers can use soft-block lists to exclude these words from automated puzzle generation while still allowing them as valid user-entered answers. The distinction: a puzzle shouldn’t randomly present “cock” as a target word, but if a player types it as a valid answer for “rooster,” the game should accept it.
Linguabase is actively maintained, not a static dataset:
License terms include current data with update arrangements available.
Analysis of the graph shows that 76% of English word pairs connect in 7 hops or fewer. Average path length is 6.43 steps.
The hop-distance distribution peaks at 5-6 hops:
| Hops | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8+ |
|---|---|---|---|---|---|---|---|---|
| % of pairs | 0.01% | 0.15% | 2.1% | 10% | 21.6% | 24.2% | 18.3% | 23.6% |
Of the 1.5M headwords, about 870K (57%) are reachable through the top-40 associations of other words. The remaining 43% are rare or isolated terms that don’t appear in other words’ association lists.
This is why word association games work. The semantic space is more connected than people realize.