Linguabase vs. LLMs

Three problems LLMs can’t solve for word games.

Enumeration

Can’t extract comprehensive lists

Rankings

No absolute difficulty scale

Sense Bias

Clusters on dominant meanings

Clichés

Same banal associations repeat

Consistency

Different output each run

Fabrication

Can’t verify against sources

LLMs are architected to give a good answer to a question—not to provide comprehensive data. Word games need three things LLMs can’t deliver: complete inventories (all valid words, not just some), balanced associations (all meanings, not just dominant ones), and reliable output (verifiable, consistent). These aren’t prompting problems—they’re architectural limits. (Example: ask an LLM to list all two-word phrases that name things. It can’t—we solved this.)

ChatGPT

simulated response

Message...

↑

1. The Enumeration Problem

Each answer from an LLM is just a sliver of its training. You can’t extract comprehensive inventories without exhaustive queries—and even then, you can’t know what you missed:

Word lists—An LLM can’t tell you all 400K valid English words. It can check if “syzygy” is a word, but it can’t enumerate every word you might need.
Obscenity lists—An LLM can’t comprehensively list all obscenities. You’d have to prompt for every candidate and hope the model doesn’t refuse or miss edge cases.
Soft blocklists—Legitimate words you might exclude from automated puzzle design: “abort,” “stroke,” “balls,” “moist.” Acceptable in a dictionary, awkward in a word game. An LLM can’t give you that list.
Difficulty rankings—An LLM knows “proboscidean” is harder than “elephant.” But all its judgments are relative to a single prompt—not calibrated against an absolute scale of all your words.
Path enumeration—Ask for a path from “sugar” to “peace” and an LLM finds one. But word games need all 50,000+ paths—for hint selection, difficulty tuning, alternate routes.

2. The Bias Problem

LLMs sample from training distributions where some meanings appear far more often than others. Ask for associations and you get what’s statistically dominant—not balanced coverage across all senses.

Sense Imbalance

elephant → 9 senses
LLM returns: trunk, gray, large
Misses: Republican, Ganesh, mahout

Cliché Blindness

key →
Cliché: unlock, lock, door
Distinctive: reef, cay, clef, fob

Experiential Missing

crisis →
LLM: emergency, disaster
Misses: siren, stretcher, smoke

Capitalization Conflation

china ≠ China
march ≠ March
turkey ≠ Turkey

You can ask for “diverse” or “unusual” associations—the LLM will try—but it’s still sampling from the same skewed distribution. The minority senses aren’t in the high-probability zone.

Evidence: LLM vs. Linguabase

We asked Claude for word associations and compared the results. The pattern is consistent: LLMs give you the statistically dominant meanings and miss the rest.

Word	Linguabase only	Both	LLM only
key	reef, atoll, cay, clef, transpose, tumbler, cryptography, fob, keystone	unlock, cipher, door, password, piano, code, solution	lock, keyboard, essential, metal, chain, scale, crucial
bridge	bidding, trump, slam, contract, luthier, nose, Wheatstone, cantilever, viaduct	crossing, span, arch, suspension, river, dental, overpass	connect, gap, cable, pier, structure, link
window	browser, tab, popup, opportunity, mullion, oculus, clerestory, fenestration	sill, pane, casement, screen, curtain, frame, view, glass	shade, ledge, transparent, light, ventilation, drapes
nephew	nepotism, avuncular, nibling, godson, godfather, doting, prodigal, namesake	niece, uncle, aunt, cousin, brother, sister, relative, kin	relation, generation, son, kinship, bond
penguin	Linux, Happy Feet, Morgan Freeman, rookery, porpoising, countershading, crèche	cold, flightless, krill, waddle, emperor, tuxedo, Antarctic	ice, swim, arctic, ocean, pebble
tornado	Joplin, Moore, Wizard of Oz, storm chaser, mesocyclone, mobile home, alley	twister, funnel, supercell, cyclone, vortex, storm, Dorothy	rotating, destruction, severe, Midwest, siren
giraffe	ossicones, blood pressure, Geoffrey, okapi, reticulated, Serengeti, ruminant	acacia, spots, neck, tall, tongue, savanna, Africa, calf	legs, pattern, horns, graceful

Comparison: Claude vs. Linguabase, January 2025.

Which do you prefer? The leftmost column (unique to Linguabase), or the rightmost (created by an LLM)?

Over time, the sameness compounds—level 49 starts to feel eerily like levels 36 and 85.

What’s Missing from the LLM Responses?

key—the island meaning (cay, atoll, reef) is absent
bridge—the card game (bidding, trump, slam) and guitar parts (luthier, nose) are absent
window—computing (browser, tab, popup) and “window of opportunity” are absent
nephew—the etymology (nepotism derives from nephew) is absent
penguin—cultural references (Linux, Happy Feet, Morgan Freeman) are absent
giraffe—the LLM says “horns”; the correct term is “ossicones”

Good word games retain players through that kind of richness: non-obvious facets, technical depth, cultural touchstones, etymological connections, sensory and experiential associations. Native English speakers notice when puzzle vocabulary is shallow—when every level draws from the same pool of animals, colors, and food. Vocabulary that rewards what players already know is what keeps them coming back.

The Superconnector Problem

Some words connect to everything. “Heritage” appears 110,000+ times across our sources. “Unique” appears 168,000 times. If we didn’t penalize these superconnectors, they’d crowd out more interesting associations. Test your intuition:

Spot the superconnector

Which word would attract the most links?

Our production pipeline applies inverse-frequency weighting to demote these superconnectors:

Demotion Need	Percentile	Example Terms
High	Top 0.1%	heritage, tradition, folklore
Medium-High	Top 0.5%	harmony, ancestry, lineage
Medium	Top 1%	innovation, solidarity
N/A	Bottom 50%	(most words)

Demotions ensure superconnectors don’t dominate, making room for distinctive connections.

3. The Reliability Problem

Even when LLM output is useful, it’s not production-ready. Three issues make it unsuitable for shipping directly:

Non-Deterministic Output

Run 1: “ice cream”
Run 2: “ice-cream”
Run 3: “icecream”

Compound Transparency

mushroom ≠ mush + room
pineapple = pine + apple
sunflower = sun + flower

Fabrication

“The ephemeral beauty of cherry blossoms...”
—Japanese proverb
[FABRICATED]

You can’t ship probabilistic output that might be different tomorrow, or quotes that might not exist.

Validators, Not Generators

You can use an LLM to check whether “key → reef” is a valid association. It’ll say yes. But it won’t generate that association reliably, because reef isn’t in the high-probability zone when the prompt is “what’s related to key?”

That’s the core insight: LLMs are good validators, bad generators—at scale, for lexical data.

We use LLMs extensively for verification and filtering:

Checking relationships—“Is key → reef valid?” Yes.
Filtering content—“Is this quote appropriate for a family game?”
Ranking candidates—Given a list, LLMs help order by relevance.

How We Built It

We spent a decade building the data the hard way:

70+ reference sources—Professional thesauri (NASA Thesaurus), public domain lexicons (WordNet, Roget’s), Library of Congress subject headings, specialized vocabularies across scientific, governmental, artistic, and medical domains
Professional lexicography—Orin Hargraves (contributor to major dictionaries) wrote 2,000+ custom definitions and 4,400+ sense-grouped associations
5,000+ thematic word lists—Hand-curated by linguistics grad students and post-docs over several years
2.3M supercomputer hours—Pre-LLM computational linguistics (LDA topic modeling, Word2Vec embeddings, Wikipedia extraction) via NSF XSEDE grant
648,000 Library of Congress classifications—Semantic clusters discovered by analyzing how human librarians categorized millions of books
Production-tested—Powering “In Other Words” with continuous refinement from real gameplay
LLM validation—For ranking and filtering, not generation. LLMs confirm relationships; they don’t reliably discover the non-obvious ones.

Read the full methodology →

Licensing and samples →

Talk to us about your game.

linguabase@idea.org