Three problems LLMs can’t solve for word games.
LLMs are architected to give a good answer to a question—not to provide comprehensive data. Word games need three things LLMs can’t deliver: complete inventories (all valid words, not just some), balanced associations (all meanings, not just dominant ones), and reliable output (verifiable, consistent). These aren’t prompting problems—they’re architectural limits. (Example: ask an LLM to list all two-word phrases that name things. It can’t—we solved this.)
Each answer from an LLM is just a sliver of its training. You can’t extract comprehensive inventories without exhaustive queries—and even then, you can’t know what you missed:
LLMs sample from training distributions where some meanings appear far more often than others. Ask for associations and you get what’s statistically dominant—not balanced coverage across all senses.
You can ask for “diverse” or “unusual” associations—the LLM will try—but it’s still sampling from the same skewed distribution. The minority senses aren’t in the high-probability zone.
We asked Claude for word associations and compared the results. The pattern is consistent: LLMs give you the statistically dominant meanings and miss the rest.
| Word | Linguabase only | Both | LLM only |
|---|---|---|---|
| key | reef, atoll, cay, clef, transpose, tumbler, cryptography, fob, keystone | unlock, cipher, door, password, piano, code, solution | lock, keyboard, essential, metal, chain, scale, crucial |
| bridge | bidding, trump, slam, contract, luthier, nose, Wheatstone, cantilever, viaduct | crossing, span, arch, suspension, river, dental, overpass | connect, gap, cable, pier, structure, link |
| window | browser, tab, popup, opportunity, mullion, oculus, clerestory, fenestration | sill, pane, casement, screen, curtain, frame, view, glass | shade, ledge, transparent, light, ventilation, drapes |
| nephew | nepotism, avuncular, nibling, godson, godfather, doting, prodigal, namesake | niece, uncle, aunt, cousin, brother, sister, relative, kin | relation, generation, son, kinship, bond |
| penguin | Linux, Happy Feet, Morgan Freeman, rookery, porpoising, countershading, crèche | cold, flightless, krill, waddle, emperor, tuxedo, Antarctic | ice, swim, arctic, ocean, pebble |
| tornado | Joplin, Moore, Wizard of Oz, storm chaser, mesocyclone, mobile home, alley | twister, funnel, supercell, cyclone, vortex, storm, Dorothy | rotating, destruction, severe, Midwest, siren |
| giraffe | ossicones, blood pressure, Geoffrey, okapi, reticulated, Serengeti, ruminant | acacia, spots, neck, tall, tongue, savanna, Africa, calf | legs, pattern, horns, graceful |
Comparison: Claude vs. Linguabase, January 2025.
Which do you prefer? The leftmost column (unique to Linguabase), or the rightmost (created by an LLM)?
Over time, the sameness compounds—level 49 starts to feel eerily like levels 36 and 85.
Good word games retain players through that kind of richness: non-obvious facets, technical depth, cultural touchstones, etymological connections, sensory and experiential associations. Native English speakers notice when puzzle vocabulary is shallow—when every level draws from the same pool of animals, colors, and food. Vocabulary that rewards what players already know is what keeps them coming back.
Some words connect to everything. “Heritage” appears 110,000+ times across our sources. “Unique” appears 168,000 times. If we didn’t penalize these superconnectors, they’d crowd out more interesting associations. Test your intuition:
Our production pipeline applies inverse-frequency weighting to demote these superconnectors:
| Demotion Need | Percentile | Example Terms |
|---|---|---|
| High | Top 0.1% | heritage, tradition, folklore |
| Medium-High | Top 0.5% | harmony, ancestry, lineage |
| Medium | Top 1% | innovation, solidarity |
| N/A | Bottom 50% | (most words) |
Demotions ensure superconnectors don’t dominate, making room for distinctive connections.
Even when LLM output is useful, it’s not production-ready. Three issues make it unsuitable for shipping directly:
You can’t ship probabilistic output that might be different tomorrow, or quotes that might not exist.
You can use an LLM to check whether “key → reef” is a valid association. It’ll say yes. But it won’t generate that association reliably, because reef isn’t in the high-probability zone when the prompt is “what’s related to key?”
That’s the core insight: LLMs are good validators, bad generators—at scale, for lexical data.
We use LLMs extensively for verification and filtering:
We spent a decade building the data the hard way: