Vocabulary, definitions, associations, word families, and 1.46M usage examples in one licensable package.
Linguabase was built to power word games. It’s been in development since 2011 and currently powers “In Other Words”, a live iOS game where players navigate semantic space. Linguabase provides the data layer for anyone who wants to explore this space.
Linguabase is built on a foundation of human-generated data—professional lexicographic work, curated word lists, and structured linguistic resources accumulated over a decade. This foundation is then enhanced through focused LLM queries that validate, rank, and expand relationships. The result is an amalgam that neither human curation nor LLM inference could produce alone.
Words, links, meaning. License what you need:
Delivered as files you can embed (TSV, SQLite, JSON) or query via API. See delivery options →
The vocabulary layer provides word lists with difficulty rankings, enabling you to adjust your vocabulary for the difficulty and obscurity level your game needs.
In our experience, 400,000 words delivers an unabridged experience that is complete without being noisy. Larger word counts contaminate the player experience with obscure and borderline words. Smaller counts often miss legitimate words players expect to find. We can vary the delivered word count, but 400K is our recommendation.
For comparison:
Related
Words with Spaces — why “boiling water” and half a million other compound phrases aren’t in any dictionary.But these numbers obscure a deeper question: what counts as “one word”? Every dictionary draws the line differently, and it’s nearly impossible to count how many words are in English—even with clear inclusion policies on British vs. American spelling, proper nouns, accent variants, and punctuation. (For reference, Roget’s original 1852 thesaurus covered the working vocabulary of English with just 15,000 words across 1,000 semantic categories—a fraction of modern dictionary counts, yet considered comprehensive for its purpose.)
Consider these cases where reasonable people disagree:
| Category | Examples | Same word or different? |
|---|---|---|
| Diacritics | naïve / naive, café / cafe, resumé / resume | Style guides vary (New Yorker keeps diacritics) |
| Compounds | ice cream / ice-cream / icecream, e-mail / email | All three forms in active use |
| UK/US spelling | colour / color, grey / gray, judgement / judgment | Regional preference, both valid |
| Simplified | doughnut / donut, whiskey / whisky, all right / alright | Traditional vs modern, both current |
| Capitalization | WiFi / Wi-Fi / wifi / wi-fi | All four appear in published text |
Which of the above count as words? All of them? Just some? And which count as individual words?
Linguabase includes all spelling variants for common words and primary variants for less common ones—so your game recognizes “naïve” and “naive,” “café” and “cafe,” without inflating word counts artificially or rejecting spellings players reasonably expect to work.
Measuring usefulness for ranking is intuitive but difficult to quantify. Frequency of appearance in books or spoken word corpora would rank “the” and “of” highest while missing that “red panda” is easily understood by most English speakers. PageRank-style algorithms surface superconnectors, but that’s not the same as how understandable something is.
The Linguabase word ranking is an effective proxy for what words are known to a broad user base, including English language learners. The highest ranks align closely with vocabulary standards like the Oxford 3000 and CEFR levels (A1–B2)—but enhanced with common derivatives. Where Oxford 3000 lists “actor,” our ranking includes both “actor” and “actors.”
Here’s what you find at different ranks of the Linguabase ranking:
| Rank | Examples | Assessment |
|---|---|---|
| ~15K | reminds, talking | Common words everyone knows |
| ~50K | skinhead, conflagration | Still widely recognized |
| ~125K | apperception, surliest | Educated vocabulary |
| ~300K | endolysin, phytogeographic | Technical/scientific terms |
| ~400K | disendorsed, tannicity | Rare but real—our threshold |
For “In Other Words,” we thresholded at ~400K. This includes every common word any player would want, plus enough depth for interesting discoveries. Different applications might threshold differently—a crossword game might want terms at rank 200K that a casual word game would skip.
Linguabase definitions are readable paragraphs—2–3 sentence blurbs that cover all a word’s meanings naturally, not dictionary-style numbered fragments.
This is the layer where LLMs come closest to being a substitute. You could brute-force 400K API calls to generate definitions in your style. But there are catches:
For applications with content policies, Linguabase provides two word lists you control:
A few thousand words including vulgar terms and creative permutations of sexual and racial slurs. These words exist in dictionaries but often have no legitimate use in puzzle-making or public/sharing contexts.
Many words are technically inoffensive but may be inappropriate for automated puzzle generation:
We developed the soft-block list to exclude these words from automated puzzle generation. Though these words should typically be allowed as valid user-entered answers. You might not want to generate a puzzle featuring “knockers” as a target word, but it could be a perfectly acceptable player-provided answer in a spelling game about doors.
Words relate to other words in different ways. Linguabase handles three distinct types of multi-meaning relationships:
| Type | Definition | Example |
|---|---|---|
| Homographs | Unrelated words that share spelling by coincidence | pupil: eye part vs. student (different origins) |
| Polysemy | Meanings that branched from a single origin | mouth: body part → river mouth → “mouthing off” |
| Facets | Different aspects of one core meaning | elephant: anatomy, behavior, symbolism, habitat |
Each type requires different handling. Homographs need complete separation—eye-pupil associations shouldn’t contaminate student-pupil associations. Polysemous words need their branching senses tracked. Facets can be blended or kept separate depending on your use case.
The core associations give you approximately 40 related words per entry, weighted by relationship strength. This represents an amalgam of different meanings, connotations, and facets of a word:
Beyond the core associations, Linguabase provides pools of words that reflect narrow, particular meanings of any headword:
You could prompt an LLM for word associations. Superficially, it looks fine. But problems lurk under the surface:
LLM inferences are a major input signal that we use, with multiple query types feeding into our ranking and validation pipeline. But in our experience, LLMs serve as better editors than authors. Our workflow pools data about a word to ask rich questions to an LLM—like “which of the following words are strongly related?”—instead of asking the LLM to provide all the answers. More on LLM limitations →
False cognates are words that look related but aren’t—they share spelling patterns by coincidence, not common origin. String similarity filters can’t detect them, and LLMs sometimes hallucinate their associativeness:
We used LLM-based auditing to identify false cognates across all headwords.
Since lowercase words are capitalized at the beginning of sentences, LLMs often treat text case-insensitively, conflating “polish” (verb) with “Polish” (nationality). For semantic graphs, this contaminates association lists—“Poland” shouldn’t appear in the associations for shoe polish.
| Word | Lowercase | Capitalized |
|---|---|---|
| turkey / Turkey | ✓ the bird | ✓ the country |
| polish / Polish | ✓ to shine | ✓ nationality |
| boston / Boston | ✗ not a word | ✓ the city |
| swat / Swat / SWAT | ✓ to hit | ✓ Pakistan region / ✓ police unit |
We evaluated capitalization variants for ambiguous terms, testing whether each form works in natural sentences. Results:
Traditional thesauruses treat function words as “stopwords”—ignored entirely. Linguabase does the opposite: we put extra effort into building associations for hundreds of the most common words that other sources systematically skip.
These words are the glue of language. When your game needs to understand how “and” connects to other concepts, we have answers that a thesaurus typically won’t provide.
One source of richness in Linguabase is gestalt relations—experiential and sensory associations that taxonomic approaches miss entirely. These are folded into the core associations, enriching what you get for each word.
| Type | Example |
|---|---|
| Visual | elephant → gray, wrinkled |
| Sensory | crisis → siren, sweat, rubble |
| Cultural | wedding → white, rice, tears |
| Emotional | home → warmth, safety, belonging |
These are NOT synonyms. They’re how humans actually experience concepts. A thesaurus typically won’t tell you that “crisis” evokes “siren”—but humans know this instantly.
Morphological and etymological groupings that connect related word forms:
Word families serve two different purposes, depending on your game design needs:
| Category | Purpose | Example | Game behavior |
|---|---|---|---|
| Variants | Substitutable forms | run → runs, running, ran | Silent swap for connectivity |
| Semantic | Different meanings | run → runway, runaway, outrun | Real navigation move |
Why this matters:
But not everything that looks splittable should be split. “Mushroom” contains “room” and “mush,” but it’s not a compound of those words—it comes from French mousseron. String patterns alone can’t distinguish real morphological relationships from coincidental letter sequences. This is why the etymological audit matters.
1.46M illustrative quotations written by humans with intent. Almost all common words come from famous literature—Pulitzer Prize winners and other prestigious sources. Uncommon and technical words are sourced from Wikipedia or open-access science journals, still showing the word used in real context by real writers.
Linguabase is built on a decade of work that began before LLMs existed. The foundation is human curation and pre-LLM computational linguistics; modern LLMs handle validation at scale.
Our pipeline follows a consistent pattern for each data layer:
| Phase | Goal | Methods |
|---|---|---|
| Expand | Gather candidates from every plausible source | 70+ reference sources, computational linguistics, human curation, Library of Congress |
| Audit | Evaluate and score each candidate | LLM validation, false cognate detection, sense separation, consistency checks |
| Contract | Retain only production-quality results | Threshold by score, remove duplicates, apply content filters, rank by strength |
The result: we start with millions of candidate relationships and ship ~40M high-quality connections.
| Period | Focus |
|---|---|
| 2011–2012 | Initial game development, early word lists and association data |
| 2013–2014 | NSF XSEDE grant: 2.3M supercomputer hours for LDA topic modeling and Word2Vec |
| 2015–2022 | Database expansion, 70+ reference source integration, professional lexicography |
| 2023–present | LLM-assisted validation at scale, false cognate auditing, production refinement |
Orin Hargraves (professional lexicographer, contributor to major dictionaries) wrote 2,000+ custom definitions and 4,400+ sense-grouped word associations. His work focused on the words that need the most human judgment: interjections, prepositions, the highest-frequency words, and words with so many dictionary senses that their entries are virtually unreadable.
Linguistics grad students and post-docs created 5,000+ thematic word lists over several years:
We use LLMs for validation, not generation. The difference matters—see why LLMs can’t generate this data reliably. an LLM can confirm that “key → reef” is a valid association (it’ll say yes), but it won’t generate that association reliably on its own. Our pipeline proposes candidates from the sources above, then uses LLM scoring to rank, filter, and audit them.
Linguabase has powered “In Other Words” since 2011. Countless edge cases have emerged through real gameplay that no automated process would surface:
Beyond automated processing, our manual override layer includes 50,000+ hand-curated entries (definitions, sense-grouped associations, thematic lists) plus hundreds of thousands of corrections for plurals, word families, and edge cases.
This is the work that separates production-quality data from raw output—and it’s ongoing. Each pipeline run incorporates feedback from the previous one.
Internally, we maintain 1.5 million words—including all of Wiktionary plus the top 200,000 words from Wikipedia. But we don’t ship all of these because noise gets in. We don’t include “Declaration of Independence” even though it’s in our word list; we don’t include obscure proteins that no player would recognize. The 400K threshold captures every word players actually want, without the noise. The hard part isn’t building a huge list—it’s ranking and curating it.
Because the data is structured as a weighted graph, you can build game mechanics based on associations and linkages. The weighted connections enable real-time content customization—filtering by relationship strength, guiding players toward goals, or generating puzzles with tunable difficulty. (Or skip the engineering—we can generate puzzle data for your game mechanics directly.)
Find how any two words connect through the graph:
We generate pathfinding puzzles like this at scale—with validated paths, precalculated hints, and difficulty ratings.
If you’re building a game that involves non-adjacent word associations (like a chain-association game), real-time brute-force graph analysis has exponential expansion that cannot be completed on modern hardware. But using our data, you can pre-calculate convergence probabilities to targets.
Given a target word, which intermediate words get you closer? Consider “ellipse”—its associations organize into layers by distance:
Paths using Layer 1 terms are easier; paths requiring Layer 2–3 terms are progressively harder. This enables hint systems and adaptive difficulty. See Puzzle Licensing for how we package this into game-ready data.
Analysis of the Linguabase graph shows that 76% of English word pairs connect in 7 hops or fewer. Average path length is 6.43 steps.
The hop-distance distribution peaks at 5-6 hops:
| Hops | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8+ |
|---|---|---|---|---|---|---|---|---|
| % of pairs | 0.01% | 0.15% | 2.1% | 10% | 21.6% | 24.2% | 18.3% | 23.6% |
Of the 1.5M headwords, about 870K (57%) are reachable through the top-40 associations of other words. The remaining 43% are rare or isolated terms that don’t appear in other words’ association lists.
This is important because it’s a foundational property of the English language that you can exploit to create your own games. The semantic space is more connected than people realize.
Linguabase is actively maintained, not a static dataset:
License terms include current data with update arrangements available.