Linguabase vs. Free Sources

Every developer building a word game starts with the same question: where do I get a word list? The answer turns out to be a fragmented ecosystem where every resource solves one narrow problem and nothing connects. You pick a word list, layer on frequency data, manually vet for offensive content, handle inflections yourself, and if you want any semantic relationships between words, you’re mostly out of luck. Here’s what’s actually out there, in roughly the order you’ll encounter it.

Word Lists

System Dict

Most developers start here because it’s already on their machine. The file at /usr/share/dict/words ships with every Unix-like OS, but its contents vary wildly:

macOS

~236,000 words

From Webster’s Second International (1934, public domain). No inflections—no BAKED, no BAKES. Nothing coined after 1934.

Debian / Ubuntu

~100,000 words

From SCOWL at size 60. The spell-checker threshold.

Fedora

~480,000 words

From Moby Words. Large but noisy—quality issues throughout.

The format is one word per line, sorted alphabetically, no metadata whatsoever. A typical developer experience: hack together inflections by appending S to four-letter words, manually extract common words from obscure ones, and vet for slurs, all before you have anything usable.

SCOWL

Architecture

Words bucketed by size level (10–95)

Scale

~4,400 (size 10) to ~658,000 (size 95)

Used by

Firefox, LibreOffice, every Debian system dictionary

What you get

Rough frequency tiers via size levels
Dialect variants (US, UK, CA, AU)
Auto-generated inflections (AGID)
Sub-lists for games (12Dicts: 3of6game)
Public domain, well-maintained

What you don’t get

Definitions
Semantic relationships
Difficulty rankings
Content filters
Any sense of what a word means

The most sophisticated free word list by a wide margin. Created by Kevin Atkinson in 2000, it’s the upstream source for English spell checking in Firefox, LibreOffice, and every Debian system dictionary. SCOWL’s key design idea is size levels: the size 60 threshold (~123,000 words) is specifically curated as the largest level Atkinson is “fairly confident does not contain any misspellings or invalid words.” It draws from over a dozen sources: Moby Words, Brian Kelk’s UK frequency list, Alan Beale’s 12Dicts package, the ENABLE Scrabble list, the UKACD, and YAWL.

Bundled with SCOWL is the 12Dicts project, which includes sub-lists like 3of6game, explicitly labeled “lists for use in word games”—about 41,000–82,000 words depending on which sub-list. Public domain and cleaner than raw SCOWL, but still just words in a file.

wordlist.aspell.net

ENABLE

Created by Mendel Cooper and Alan Beale as a free, public-domain alternative to copyrighted Scrabble dictionaries. ~173,529 words with no licensing restrictions. This is what Words With Friends was built on—Zynga took ENABLE, added contemporary slang, removed offensive terms, and had a game dictionary at zero cost.

The critical limitation: ENABLE has not been updated since 2000. No COVID, no EMOJI, no BITCOIN. Cooper also created YAWL (Yet Another Word List), a ~264,000-word public domain superset, but it too hasn’t been maintained since ~2008.

Moby

Grady Ward’s Moby Project (public domain, 1996) is the largest free word resource: ~355,000 single words plus ~257,000 compounds, a thesaurus (30,260 root words, 2.5 million synonym entries), a pronunciator, a POS tagger, and a hyphenator. Fedora’s system dictionary is built from Moby. The problem is quality—accents are stripped, there’s contamination across lists, and SCOWL’s maintainer found enough errors that Moby’s name lists were demoted from size 50 to size 95. Useful as raw material but requires heavy filtering.

icon.shef.ac.uk/Moby

Game & Puzzle Dictionaries

Scrabble

NWL (NASPA)

~196,601 words · 2–15 letters

North American tournament standard, maintained with Merriam-Webster. No definitions—deliberate choice after the 1995 OSPD expurgation controversy. Licensable via contact.

CSW (Collins)

~280,887 words · 43% larger than NWL

International standard used in ~50 countries. Draws on Commonwealth sources via Collins English Dictionary. Proprietary, published by HarperCollins under WESPA.

OSPD

~100,000 words · 2–8 letters only

The retail/family edition. Sanitized of offensive terms, brief copyrighted definitions. Despite the name, no role in tournament play.

All three lists are binary—a word is valid or it isn’t. No frequency, no definitions (NWL/CSW have none at all), no relationships. The 1995 controversy is worth knowing: ~167 offensive words were removed from OSPD, and a separate definition-free tournament list was created to sidestep the issue entirely. In 2020, 259 slurs were removed from NWL; NWL2023 reinstated about 105 that had inoffensive alternate meanings.

UKACD

The UK Advanced Cryptics Dictionary, compiled by J. Ross Beresford. ~250,000 words curated for crossword construction and solving. Includes all inflected forms, common proper names, and longer entries for jumbo grids. Became the source for the English Open Word List (EOWL), a ~129,000-word derivative for computer word games. Also incorporated into SCOWL at the size 80 level. Freeware with attribution.

Constructors

For American-style crossword construction, Peter Broda maintains a ~390,000-entry wordlist with phrases, proper names, and entries scored by puzzle-worthiness. Chris Jones’s scored wordlist (~170,000 entries) draws from Broda plus NYT/WSJ/WaPo published puzzles. The “Spread the Wordlist” project (~80,000 entries) takes a data-driven approach. These are designed for grid-filling, not game development, but they represent significant curation effort.

Quinapalus

Multiple dictionaries accessible through word-matching tools like Qat and Word Matcher, drawing from Chambers, UKACD, and SOWPODS. Crossword constructors describe Qat as “Wordfinder on steroids.” But Quinapalus is a query tool, not a downloadable dataset—you can’t get the underlying data into your game.

quinapalus.com

Frequency & Semantic Data

Frequency

Norvig

~300,000 words

Derived from Google’s Web Trillion Word Corpus. The most commonly used frequency list.

Google Books Ngrams

Millions of entries

Historical word frequency from books. One developer built a combined list of 458,343 words with counts.

British National Corpus

~470,000 word types

100M tokens of British English, 65 POS tags. Licensed (not public domain), full of non-word strings from automated processing.

Since most word lists don’t include frequency information, developers bolt it on separately. Cross-referencing one of these against your SCOWL or ENABLE list is how most developers sort common from obscure. It works but it’s manual assembly—and frequency is not difficulty. “The” is frequent but trivial; “quotidian” is infrequent but educated adults know it.

Datamuse

A word-finding query engine built on WordNet 3.0, Google Books Ngrams, word2vec, and the Paraphrase Database. Handles synonyms, rhymes, sound-alikes, and spelling patterns—free up to 100,000 requests per day. The limitation for game developers: it’s an HTTP API, not a downloadable database. You can’t precompute at scale or embed it in an offline game, and its relationships are a grab-bag from different sources without consistent ranking or typing.

datamuse.com/api

WordNet

Architecture

Synsets with taxonomic hierarchy

Scale

~155,000 words in ~117,000 synsets

Last major update

2006 (frozen)

What you get

Clean hierarchical structure
Typed relationships (hypernym, meronym)
Academic gold standard
Well-documented, Python/Java libraries
Powers Datamuse and much of NLP

What you don’t get

Experiential/gestalt associations
Difficulty rankings
Content filters
Compounds, slang, proper nouns
Words with spaces (excluded by design)

Princeton’s psycholinguistic experiment (1985–2006) and the academic gold standard for computational semantics. The limitation isn’t vocabulary—it’s architecture. WordNet maps taxonomic hierarchies, not associative meaning. Look up “hiking” and you get: it’s a type of walk; trudging and backpacking are types of it. No trail, mountain, nature, wilderness, boots—none of the experiential associations a word game needs.

wordnet.princeton.edu

Wiktionary

Architecture

Web page per word, wiki markup

Scale

~1.4M English (10M+ multilingual)

Access

XML dump (~1GB compressed), no API

What you get

Massive coverage, largest dictionary in history
Definitions, pronunciations, etymologies
Surpassed commercial dictionaries in breadth
Free (CC license)
Actively maintained (community)

What you don’t get

Semantic relationships (dictionary, not graph)
Difficulty rankings
Content filters
Consistent structure (wildly variable markup)
Reliable words with spaces

We use Wiktionary as one of our 70+ sources. It’s superb as a dictionary—but it’s strictly a dictionary, not a semantic graph. It tells you what words mean; it doesn’t map how words relate to each other. And using it requires parsing a 1GB XML dump with wildly inconsistent wiki markup—tens of thousands of edge cases that make reliable data extraction effectively impossible at scale.

en.wiktionary.org

ConceptNet

MIT Media Lab’s crowdsourced commonsense knowledge graph (1999, 21M+ edges, CC license): “ice is cold,” “people eat when hungry.” Its relations are coarse-grained and commonsense-focused—useful for “dogs are animals” but not for the fine-grained semantic distinctions that make word puzzles interesting. Large language models have now largely absorbed this kind of commonsense knowledge, making ConceptNet less central than it once was.

conceptnet.io

Wordnik

Wordnik maintains a free open-source wordlist on GitHub (~185,000 words, MIT license) and sells a Game Developer’s Dataset: ~190,000 words in a single JSON file with definitions, parts of speech, offensive-word flags, root-form cross-references, and frequency information. $1,250 for one game, $1,550 for unlimited titles. Vocabulary is built on ENABLE plus TWL—no Collins/SOWPODS, so it skews North American. The closest thing to a ready-to-integrate product for game developers, though limited to word metadata without semantic relationships.

wordnik.com

The Overall Picture

The pattern is consistent: each resource solves one narrow problem. Word lists have no definitions. Definitions have no relationships. Frequency data doesn’t map to difficulty. Nothing ships with content filters or short clues for gameplay. And no single resource combines words, relationships, rankings, definitions, clues, and game-readiness in one place.

This isn’t an accident. The legal and economic structure of language data means the best work stays invisible. Here’s why.