How It Works

A puzzle generation system built on a decade of curation, 70+ reference sources, and 130 million AI inferences.

The System

400K words
400K definitions & clues
100M+ connections

Linguabase automatically generates word association puzzles where every grouping holds together under player scrutiny. Topics within each level are strictly mutually exclusive—no word plausibly belongs to two categories in the same puzzle. Topics across levels don’t repeat. Every word is drawn from a vocabulary calibrated for both native and non-native English speakers.

fuzzy velvet wool fleece
circus acrobat trapeze tent
rodent hamsterrat burrow gnaw
round globe wheel ballringmoon
“hamster” → rat (overlaps fuzzy)
“ball” → ring → moon (overlap circus)

As each puzzle is assembled, the system clears the wordspace around each topic so nothing overlaps. It tracks which topics and associations have been used across the full level set. And it reaches into corners of English that keep level 2,000 from feeling like a remix of level 20.

Puzzles are generated bespoke for your game’s mechanics. Order them a category or two larger than you need, and your team picks the strongest groupings — curating from abundance rather than building from scratch.

Already building levels in-house or with LLMs? Linguabase also audits existing puzzles—flagging words that may be unfamiliar to your audience, catching places where a word could fit more than one group, and suggesting expansions. Audit your most-complained-about levels first and see whether the data catches what your players are flagging.


The Data Foundation

Words by letter count
135791113151719
Single words vs. words with spaces, by familiarity
50K100K150K200K250K300K350K400K
Single words Words with spaces

The puzzle generation runs on a structured data foundation with three parts: words, links, and meanings. From a raw pool of over 2 million terms—including technical words, proper nouns, and noise—we’ve surfaced a curated set of 400,000 that work well for games. The vocabulary selection and familiarity ranking are purpose-built for word game development.

Roughly half are single words. The other half are words with spaces—200,000 multi-word expressions like “night sky,” “comfort food,” “hold it together,” and “old wives’ tale.” These aren’t just word combinations—they name concepts, and they carry more weight than their parts. Traditional dictionaries cover about 3% of them. Including them doubles the pool of ideas available for puzzles, and they’re the vocabulary that makes levels feel natural rather than clinical. Read more about words with spaces →

Words

400K terms ranked by familiarity—from everyday vocabulary to crossword-worthy rarities. Includes 200K multi-word expressions. Filterable by letter count or difficulty for your game and audience.

Content filters at two severity levels: a hard-block list of offensive words, and a soft-block list of words carrying unwanted innuendo.

Links

Over 100 million weighted relationships, including both categorical associations (soft → gentle, fuzzy, velvet) and associative relations (bunny → soft, fuzzy, rodent).

Every word broken down into senses and facets, with ~100 related words per entry on average. This is the layer that powers mutual exclusivity. Word families connect related forms (morphology): run → runs, running, runner, runway.

Meanings

400K definitions as ~55-word readable paragraphs covering all senses. Short clues for gameplay—1 to 5 words, multiple angles per term.

1.5 million usage examples from literature and journalism. Metadata that makes each word usable in a game, not just present in a list.

All of this exists so that every level rewards the player’s vocabulary and lateral thinking—so that a native English speaker at level 500 still encounters categories that feel fair, interesting, and worth solving.

What the Associations Look Like

Approximately 100 related words per entry on average—a core set of top associations, plus sense-level pools organized by meaning:

Core Associations
elephant tusk, pachyderm, trunk, hippopotamus, mammoth, ivory, savanna, giraffe, Ganesh, lion, proboscidean, poaching, Dumbo, Elephas, Hannibal, matriarch, rhinoceros, mahout, herd, ears, circus, thick-skinned, hide, zebra, herbivore, peanuts, mammal, massive, Africa, megafauna, intelligent, Republican, stomp, trumpeting, conservation, gray, jungle, watering hole, memory, majestic

Notice what’s here that an LLM rarely surfaces: Republican, Ganesh, Dumbo, mahout, proboscidean. Nine distinct senses of “elephant”—anatomy, behavior, circus, heraldry, ivory, megafauna, size, symbolism, wildlife. LLMs typically cluster on 2–3.

Beyond core associations, the data provides pools organized by meaning—so your game can draw from specific facets:

Sense Pools
elephant [anatomy]
trunk, tusks, ears, pachyderm, proboscis, wrinkled, gray, prehensile, hide, feet, tail, molars, musth, thick-skinned
elephant [behavior]
matriarchal, herd, social, intelligence, mourning, memory, communication, infrasound, bonding, empathy, tool use, play, bathing, dusting, migration
elephant [symbolism]
memory, wisdom, good luck, Ganesh, strength, power, loyalty, patience, Republican, GOP, white elephant, elephant in the room, Horton

Quality Layers

291K false cognates removed dig → digress, pan → pandemic
Capitalization intelligence “polish” vs. “Polish” get separate lists
Superconnector demotion Common hubs penalized so rare links surface
Experiential associations (gestalt) crisis → siren, wedding → white
Function word coverage “and,” “while,” “but”—not skipped

What You Can Build

Difficulty tuning Filter by familiarity, association strength, sense complexity
Pathfinding sugar → sweet → pleasant → calm → peace
Convergence Pre-calculate routes for hint systems
Category generation Mutually exclusive groups, no overlaps
Sense-aware queries Which “spring” connects to “coil”?

Vocabulary Scale

This is larger than the Scrabble dictionary, Merriam-Webster Collegiate, or Collins, because it includes familiar multi-word expressions other dictionaries don’t. Roughly the size of Webster’s Third Unabridged, excluding the obscure and technical words that aren’t fun.

NASPA (Scrabble US)
176K
Merriam-Webster Collegiate
225K
Collins Scrabble Words
267K
Linguabase deployed
~400K
Webster’s Third Unabridged
476K
Linguabase backend
2M

Built Over a Decade

Linguabase began as part of a nonprofit program exploring a visual thesaurus, and was developed through years of curation and list-building drawing on 70+ reference sources—from the NASA Thesaurus and Library of Congress subject headings to WordNet and professional lexicography by contributors to major dictionaries.

It incorporated AI early. A National Science Foundation XSEDE grant provided 2.3 million supercomputer hours for topic modeling and word embeddings, years before LLMs existed. An NSF SBIR grant and a Microsoft cloud services scholarship funded further expansion.

The system now pairs deep human curation with large-scale LLM validation—over 130 million inferences used to score, rank, and audit relationships, not generate them. The result combines depth that automation can’t reach with scale that humans can’t sustain.

Who Built It?

A professional lexicographer, a language data architect, 29 vocabulary contributors, and the person you’ll talk to when you reach out. About Linguabase →

Pricing →

Talk to us about your game.

linguabase@idea.org