What a word game developer actually finds.
Every developer building a word game starts with the same question: where do I get a word list? The answer turns out to be a fragmented ecosystem where every resource solves one narrow problem and nothing connects. You pick a word list, layer on frequency data, manually vet for offensive content, handle inflections yourself, and if you want any semantic relationships between words, you’re mostly out of luck. Here’s what’s actually out there, in roughly the order you’ll encounter it.
Most developers start here because it’s already on their machine. The file at /usr/share/dict/words ships with every Unix-like OS, but its contents vary wildly:
The format is one word per line, sorted alphabetically, no metadata whatsoever. A typical developer experience: hack together inflections by appending S to four-letter words, manually extract common words from obscure ones, and vet for slurs, all before you have anything usable.
The most sophisticated free word list by a wide margin. Created by Kevin Atkinson in 2000, it’s the upstream source for English spell checking in Firefox, LibreOffice, and every Debian system dictionary. SCOWL’s key design idea is size levels: the size 60 threshold (~123,000 words) is specifically curated as the largest level Atkinson is “fairly confident does not contain any misspellings or invalid words.” It draws from over a dozen sources: Moby Words, Brian Kelk’s UK frequency list, Alan Beale’s 12Dicts package, the ENABLE Scrabble list, the UKACD, and YAWL.
Bundled with SCOWL is the 12Dicts project, which includes sub-lists like 3of6game, explicitly labeled “lists for use in word games”—about 41,000–82,000 words depending on which sub-list. Public domain and cleaner than raw SCOWL, but still just words in a file.
Created by Mendel Cooper and Alan Beale as a free, public-domain alternative to copyrighted Scrabble dictionaries. ~173,529 words with no licensing restrictions. This is what Words With Friends was built on—Zynga took ENABLE, added contemporary slang, removed offensive terms, and had a game dictionary at zero cost.
The critical limitation: ENABLE has not been updated since 2000. No COVID, no EMOJI, no BITCOIN. Cooper also created YAWL (Yet Another Word List), a ~264,000-word public domain superset, but it too hasn’t been maintained since ~2008.
Grady Ward’s Moby Project (public domain, 1996) is the largest free word resource: ~355,000 single words plus ~257,000 compounds, a thesaurus (30,260 root words, 2.5 million synonym entries), a pronunciator, a POS tagger, and a hyphenator. Fedora’s system dictionary is built from Moby. The problem is quality—accents are stripped, there’s contamination across lists, and SCOWL’s maintainer found enough errors that Moby’s name lists were demoted from size 50 to size 95. Useful as raw material but requires heavy filtering.
All three lists are binary—a word is valid or it isn’t. No frequency, no definitions (NWL/CSW have none at all), no relationships. The 1995 controversy is worth knowing: ~167 offensive words were removed from OSPD, and a separate definition-free tournament list was created to sidestep the issue entirely. In 2020, 259 slurs were removed from NWL; NWL2023 reinstated about 105 that had inoffensive alternate meanings.
The UK Advanced Cryptics Dictionary, compiled by J. Ross Beresford. ~250,000 words curated for crossword construction and solving. Includes all inflected forms, common proper names, and longer entries for jumbo grids. Became the source for the English Open Word List (EOWL), a ~129,000-word derivative for computer word games. Also incorporated into SCOWL at the size 80 level. Freeware with attribution.
For American-style crossword construction, Peter Broda maintains a ~390,000-entry wordlist with phrases, proper names, and entries scored by puzzle-worthiness. Chris Jones’s scored wordlist (~170,000 entries) draws from Broda plus NYT/WSJ/WaPo published puzzles. The “Spread the Wordlist” project (~80,000 entries) takes a data-driven approach. These are designed for grid-filling, not game development, but they represent significant curation effort.
Multiple dictionaries accessible through word-matching tools like Qat and Word Matcher, drawing from Chambers, UKACD, and SOWPODS. Crossword constructors describe Qat as “Wordfinder on steroids.” But Quinapalus is a query tool, not a downloadable dataset—you can’t get the underlying data into your game.
Since most word lists don’t include frequency information, developers bolt it on separately. Cross-referencing one of these against your SCOWL or ENABLE list is how most developers sort common from obscure. It works but it’s manual assembly—and frequency is not difficulty. “The” is frequent but trivial; “quotidian” is infrequent but educated adults know it.
A word-finding query engine built on WordNet 3.0, Google Books Ngrams, word2vec, and the Paraphrase Database. Handles synonyms, rhymes, sound-alikes, and spelling patterns—free up to 100,000 requests per day. The limitation for game developers: it’s an HTTP API, not a downloadable database. You can’t precompute at scale or embed it in an offline game, and its relationships are a grab-bag from different sources without consistent ranking or typing.
Princeton’s psycholinguistic experiment (1985–2006) and the academic gold standard for computational semantics. The limitation isn’t vocabulary—it’s architecture. WordNet maps taxonomic hierarchies, not associative meaning. Look up “hiking” and you get: it’s a type of walk; trudging and backpacking are types of it. No trail, mountain, nature, wilderness, boots—none of the experiential associations a word game needs.
We use Wiktionary as one of our 70+ sources. It’s superb as a dictionary—but it’s strictly a dictionary, not a semantic graph. It tells you what words mean; it doesn’t map how words relate to each other. And using it requires parsing a 1GB XML dump with wildly inconsistent wiki markup—tens of thousands of edge cases that make reliable data extraction effectively impossible at scale.
MIT Media Lab’s crowdsourced commonsense knowledge graph (1999, 21M+ edges, CC license): “ice is cold,” “people eat when hungry.” Its relations are coarse-grained and commonsense-focused—useful for “dogs are animals” but not for the fine-grained semantic distinctions that make word puzzles interesting. Large language models have now largely absorbed this kind of commonsense knowledge, making ConceptNet less central than it once was.
Wordnik maintains a free open-source wordlist on GitHub (~185,000 words, MIT license) and sells a Game Developer’s Dataset: ~190,000 words in a single JSON file with definitions, parts of speech, offensive-word flags, root-form cross-references, and frequency information. $1,250 for one game, $1,550 for unlimited titles. Vocabulary is built on ENABLE plus TWL—no Collins/SOWPODS, so it skews North American. The closest thing to a ready-to-integrate product for game developers, though limited to word metadata without semantic relationships.
The pattern is consistent: each resource solves one narrow problem. Word lists have no definitions. Definitions have no relationships. Frequency data doesn’t map to difficulty. Nothing ships with content filters or short clues for gameplay. And no single resource combines words, relationships, rankings, definitions, clues, and game-readiness in one place.
This isn’t an accident. The legal and economic structure of language data means the best work stays invisible. Here’s why.