Three problems LLMs can’t solve for word games.
LLMs are architected to give a good answer to a question—not to provide comprehensive data. Word games need three things LLMs can’t deliver: complete inventories (all valid words, not just some), balanced associations (all meanings, not just dominant ones), and reliable output (verifiable, consistent). These aren’t prompting problems—they’re architectural limits. (Example: ask an LLM to list all two-word phrases that name things. It can’t—we solved this.)
Each answer from an LLM is just a sliver of its training. You can’t extract comprehensive inventories without exhaustive queries—and even then, you can’t know what you missed:
LLMs sample from training distributions where some meanings appear far more often than others. Ask for associations and you get what’s statistically dominant—not balanced coverage across all senses.
This isn’t a prompting problem. You can ask for “diverse” or “unusual” associations—the LLM will try—but it’s still sampling from the same skewed distribution. The minority senses aren’t in the high-probability zone.
We asked Claude for word associations and compared the results. The pattern is consistent: LLMs give you the statistically dominant meanings and miss the rest.
| Word | Linguabase only | Both | LLM only |
|---|---|---|---|
| key | reef, atoll, cay, clef, transpose, tumbler, cryptography, fob, keystone | unlock, cipher, door, password, piano, code, solution | lock, keyboard, essential, metal, chain, scale, crucial |
| bridge | bidding, trump, slam, contract, luthier, nose, Wheatstone, cantilever, viaduct | crossing, span, arch, suspension, river, dental, overpass | connect, gap, cable, pier, structure, link |
| window | browser, tab, popup, opportunity, mullion, oculus, clerestory, fenestration | sill, pane, casement, screen, curtain, frame, view, glass | shade, ledge, transparent, light, ventilation, drapes |
| nephew | nepotism, avuncular, nibling, godson, godfather, doting, prodigal, namesake | niece, uncle, aunt, cousin, brother, sister, relative, kin | relation, generation, son, kinship, bond |
| penguin | Linux, Happy Feet, Morgan Freeman, rookery, porpoising, countershading, crèche | cold, flightless, krill, waddle, emperor, tuxedo, Antarctic | ice, swim, arctic, ocean, pebble |
| tornado | Joplin, Moore, Wizard of Oz, storm chaser, mesocyclone, mobile home, alley | twister, funnel, supercell, cyclone, vortex, storm, Dorothy | rotating, destruction, severe, Midwest, siren |
| giraffe | ossicones, blood pressure, Geoffrey, okapi, reticulated, Serengeti, ruminant | acacia, spots, neck, tall, tongue, savanna, Africa, calf | legs, pattern, horns, graceful |
Comparison: Claude vs. Linguabase, January 2025.
Which do you prefer? The leftmost column (unique to Linguabase), or the rightmost (created by an LLM)?
As game designers and wordplay lovers, even the best LLM output feels flat. And over time, it starts to feel repetitious—an insidious sameness that makes level 49 feel eerily like levels 36 and 85.
Good word games enchant and retain players when they include that kind of richness: non-obvious facets, technical depth, cultural touchstones, etymological connections, sensory and experiential associations. And they feel better when there’s less generic filler.
Some words connect to everything. “Heritage” appears 110,000+ times across our sources. “Unique” appears 168,000 times. If we didn’t penalize these superconnectors, they’d crowd out more interesting associations. Test your intuition:
Our production pipeline applies inverse-frequency weighting to demote these superconnectors:
| Demotion Need | Percentile | Example Terms |
|---|---|---|
| High | Top 0.1% | heritage, tradition, folklore |
| Medium-High | Top 0.5% | harmony, ancestry, lineage |
| Medium | Top 1% | innovation, solidarity |
| N/A | Bottom 50% | (most words) |
Demotions ensure superconnectors don’t dominate, making room for distinctive connections.
Even when LLM output is useful, it’s not production-ready. Three issues make it unsuitable for shipping directly:
Word games ship to millions of players. You can’t ship probabilistic output that might be different tomorrow, or quotes that might not exist.
You can use an LLM to check whether “key → reef” is a valid association. It’ll say yes. But it won’t generate that association reliably, because reef isn’t in the high-probability zone when the prompt is “what’s related to key?”
That’s the core insight: LLMs are good validators, bad generators—at scale, for lexical data.
We use LLMs extensively for verification and filtering:
We spent a decade building the data the hard way:
Built from 1.5 million words (all of Wiktionary plus top Wikipedia entries). Shipped as a curated 400K-word graph with ~40M connections—every plausible word a player would use, without noise. The hard part isn’t building a huge list; it’s ranking it. Learn more → or see licensing options →