Players expect every level to feel fresh and fair.
The beautiful thing about word-meaning games—what sets them apart from letter tiles or trivia—is the near-infinity of puzzle configurations, almost as vast as the space of ideas. Players love this genre because it makes them feel smart. Their challenge is lateral, not mechanical: finding links between concepts.
That same infinity is what makes puzzles hard to create.
Old school is once daily. 365 a year. For over a century, word puzzles have been a daily affair—crosswords in newspapers since 1913, now Connections at the Times and Pinpoint at LinkedIn.
See also: Handcrafting doesn’t scale — At the New York Times, Wyna Liu needs at least 2.5 hours per Connections puzzle, and her constraint notebook is growing. Wordle is recycling answers, and Spelling Bee has been “dumpster diving” for new words.
The New York Times ships one small puzzle a day—a 4×4 board, a single solution, built by a brilliant constructor. Wyna Liu spends two and a half hours on each Connections board and says it never gets easier. This is slow and unscalable, but they don’t care. They monetize subscriptions and engagement, not the games directly. They only need 365 levels a year.
A mobile studio has different needs. Players expect a tap experience with larger boards, layered mechanics, and enough levels to retain them for months. That means 8 to 15 categories per board, thousands of levels, real monetization. If Wyna Liu needs two and a half hours for a satisfying 4×4, how long does a 10×5 take? A 15×5? And how much longer to validate that every row is clean?
The puzzlemaking team uses up easier topics. Puzzles need to be internally mutually exclusive—no overlapping answers across categories. Cross-checking every word against every other category is computationally hard. LLMs can’t sustain it: they collapse toward the same popular topics and can’t maintain consistency across levels. Your highest-LTV players notice the repetition first. They stop playing. That’s revenue left on the table.
The genre lives or dies on its content. Let’s look deeper at why scaling is so hard:
A word association puzzle isn’t a collection of independent topics. It’s a system where every word has tendrils reaching into every other row. “Pearls” belongs in “strung together” until “pearl-clutching” is three rows down. “Conduct” is fine in “handle oneself” until “chivalry” shows up in another row. “Cycles” sits comfortably in “strung together” until the player notices “washer and dryer.” Each word’s clarity depends on the entire rest of the grid.
For a puzzle with M rows and N columns, every word needs to be cross-referenced against M−1 other topics. That’s M × N × (M−1) checks.
Most cross-references are obviously fine. But the dangerous ones hide among the safe ones, and a human working row by row can’t hold the full grid in their head. They work locally and hope. Players find what they miss.
Humans and LLMs both need to be primed before they brainstorm well — pointed toward castle fortifications, tropical weather, downhill sports, robotics, sanitization. Without that direction, they circle the same familiar territory. And when LLMs generate the words within a topic, they collapse toward the most statistically common associations. Ask for “castle fortifications” and you get moat, drawbridge, rampart every time. They don’t find the words that make a category click — the ones that make a player pause, see the connection, and feel clever. LLMs are better at evaluating puzzles than generating them, and when left to generate on their own, they produce levels that feel flat.
Studios that push further — forcing diversity through seed domains, or mapping every word in a vocabulary to its possible groups — get further, but still produce raw material, not finished puzzles. Sense separation, mutual exclusivity checking, cross-level deduplication, difficulty calibration, and content filtering are each a separate engineering problem on top of the raw data. Read more ›
Linguabase has a structured network of over 100 million semantic relationships between over 2 million terms, built over fifteen years. Every term is ranked by familiarity — how recognizable it is to native and non-native English speakers. From that network, we’ve curated roughly 400,000 headwords that are viable for word games — familiar enough to recognize, interesting enough to play with. Half of those are multi-word expressions that standard dictionaries don’t index: ‘hot dog,’ ‘comfort food,’ ‘old wives’ tale,’ ‘hold it together.’ These words with spaces double the pool of ideas available for puzzles, and they’re the vocabulary that makes levels feel natural rather than clinical.
The deeper corpus matters even though players never see it. Ensuring mutual exclusivity—that no word plausibly fits two categories—requires knowing how all concepts interconnect, not just the game-facing 400K. Terms that will never appear in a puzzle still inform how we segment semantic space when building categories from the words that will. Read more ›
When Linguabase builds a puzzle, it picks topics from across the full breadth of what English can talk about — not just what comes to mind first. As each topic is added, it clears the surrounding wordspace so the next topic can’t bleed into it. Within each topic, it picks members that capture something real about the category — words at the interesting edges, not just the obvious five. And it tracks what’s already been used across your entire game, so level 847 doesn’t echo level 212.
The data spans two types of semantic relationships, and both matter for gameplay. Categorical associations have an obvious shared trait — the category “soft” gives you gentle, fuzzy, velvet, and the player sees the connection quickly. Associative relationships require the player to decipher the common theme — seeing bunny, soft, fuzzy, and rodent together and realizing they all evoke the same thing. Each serves a different kind of thinking — one rewards sorting, the other rewards lateral leaps. Designers can mix both within a level to vary how it feels, or lean on one type to control difficulty.
The result is puzzles with diverse topics, airtight categories, and interesting words — the three things that make players feel smart, and come back to solve more levels. That’s what separates a game people play for a week from a game people play for a year.
Linguabase distinguishes vocabulary difficulty from associative difficulty, so designers can control both independently. It ranks familiarity across the full vocabulary for both native and non-native English speakers — so difficulty tuning isn’t guesswork. It scores iconability — which words can be represented as a small visual icon (dog, snowflake) and which can’t (cohabitation). Games that mix text and imagery need to know the difference.
That’s what we built Linguabase to do: generate thousands of puzzles that don’t get worse. A useful workflow: order puzzles a category or two larger than you need, and let your team drop the weakest grouping and pick the strongest members. Human judgment on finished boards, not blank-page brainstorming.
Talk to us about your game.
linguabase@idea.org