Linguabase vs. LLMs

Three problems LLMs can’t solve for word games.

Enumeration
Can’t extract comprehensive lists
Rankings
No absolute difficulty scale
Sense Bias
Clusters on dominant meanings
Clichés
Same banal associations repeat
Consistency
Different output each run
Fabrication
Can’t verify against sources

LLMs are architected to give a good answer to a question—not to provide comprehensive data. Word games need three things LLMs can’t deliver: complete inventories (all valid words, not just some), balanced associations (all meanings, not just dominant ones), and reliable output (verifiable, consistent). These aren’t prompting problems—they’re architectural limits. (Example: ask an LLM to list all two-word phrases that name things. It can’t—we solved this.)

<
AI
ChatGPT
simulated response
Message...
>

1. The Enumeration Problem

Each answer from an LLM is just a sliver of its training. You can’t extract comprehensive inventories without exhaustive queries—and even then, you can’t know what you missed:

2. The Bias Problem

LLMs sample from training distributions where some meanings appear far more often than others. Ask for associations and you get what’s statistically dominant—not balanced coverage across all senses.

Sense Imbalance

elephant → 9 senses
LLM returns: trunk, gray, large
Misses: Republican, Ganesh, mahout

Cliché Blindness

key
Cliché: unlock, lock, door
Distinctive: reef, cay, clef, fob

Experiential Missing

crisis
LLM: emergency, disaster
Misses: siren, stretcher, smoke

Capitalization Conflation

china ≠ China
march ≠ March
turkey ≠ Turkey

You can ask for “diverse” or “unusual” associations—the LLM will try—but it’s still sampling from the same skewed distribution. The minority senses aren’t in the high-probability zone.

Evidence: LLM vs. Linguabase

We asked Claude for word associations and compared the results. The pattern is consistent: LLMs give you the statistically dominant meanings and miss the rest.

Word Linguabase only Both LLM only
key reef, atoll, cay, clef, transpose, tumbler, cryptography, fob, keystone unlock, cipher, door, password, piano, code, solution lock, keyboard, essential, metal, chain, scale, crucial
bridge bidding, trump, slam, contract, luthier, nose, Wheatstone, cantilever, viaduct crossing, span, arch, suspension, river, dental, overpass connect, gap, cable, pier, structure, link
window browser, tab, popup, opportunity, mullion, oculus, clerestory, fenestration sill, pane, casement, screen, curtain, frame, view, glass shade, ledge, transparent, light, ventilation, drapes
nephew nepotism, avuncular, nibling, godson, godfather, doting, prodigal, namesake niece, uncle, aunt, cousin, brother, sister, relative, kin relation, generation, son, kinship, bond
penguin Linux, Happy Feet, Morgan Freeman, rookery, porpoising, countershading, crèche cold, flightless, krill, waddle, emperor, tuxedo, Antarctic ice, swim, arctic, ocean, pebble
tornado Joplin, Moore, Wizard of Oz, storm chaser, mesocyclone, mobile home, alley twister, funnel, supercell, cyclone, vortex, storm, Dorothy rotating, destruction, severe, Midwest, siren
giraffe ossicones, blood pressure, Geoffrey, okapi, reticulated, Serengeti, ruminant acacia, spots, neck, tall, tongue, savanna, Africa, calf legs, pattern, horns, graceful

Comparison: Claude vs. Linguabase, January 2025.

Which do you prefer? The leftmost column (unique to Linguabase), or the rightmost (created by an LLM)?

Over time, the sameness compounds—level 49 starts to feel eerily like levels 36 and 85.

What’s Missing from the LLM Responses?

Good word games retain players through that kind of richness: non-obvious facets, technical depth, cultural touchstones, etymological connections, sensory and experiential associations. Native English speakers notice when puzzle vocabulary is shallow—when every level draws from the same pool of animals, colors, and food. Vocabulary that rewards what players already know is what keeps them coming back.

The Superconnector Problem

Some words connect to everything. “Heritage” appears 110,000+ times across our sources. “Unique” appears 168,000 times. If we didn’t penalize these superconnectors, they’d crowd out more interesting associations. Test your intuition:

Spot the superconnector
Which word would attract the most links?

Our production pipeline applies inverse-frequency weighting to demote these superconnectors:

Demotion Need Percentile Example Terms
High Top 0.1% heritage, tradition, folklore
Medium-High Top 0.5% harmony, ancestry, lineage
Medium Top 1% innovation, solidarity
N/A Bottom 50% (most words)

Demotions ensure superconnectors don’t dominate, making room for distinctive connections.

3. The Reliability Problem

Even when LLM output is useful, it’s not production-ready. Three issues make it unsuitable for shipping directly:

Non-Deterministic Output

Run 1: “ice cream”
Run 2: “ice-cream”
Run 3: “icecream”

Compound Transparency

mushroom ≠ mush + room
pineapple = pine + apple
sunflower = sun + flower

Fabrication

“The ephemeral beauty of cherry blossoms...”
—Japanese proverb
[FABRICATED]

You can’t ship probabilistic output that might be different tomorrow, or quotes that might not exist.

Validators, Not Generators

You can use an LLM to check whether “key → reef” is a valid association. It’ll say yes. But it won’t generate that association reliably, because reef isn’t in the high-probability zone when the prompt is “what’s related to key?”

That’s the core insight: LLMs are good validators, bad generators—at scale, for lexical data.

We use LLMs extensively for verification and filtering:

How We Built It

We spent a decade building the data the hard way:

Read the full methodology →

Licensing and samples →

Talk to us about your game.

linguabase@idea.org