Free sources won’t get you far enough.
Several free language resources exist. They’re fine for academic research or simple lookups. But if you’re building word games—whether spelling games that need validated vocabulary and difficulty rankings, or semantic games that need weighted associations—free sources have fundamental gaps.
The Wikipedia of dictionaries. Volunteer editors have built the largest dictionary in history — 1.4 million English entries with definitions, pronunciations, etymologies, and translations. We use it as one of our 70+ sources.
Wiktionary is superb as a dictionary — but it’s strictly a dictionary, not a thesaurus or semantic graph. It tells you what words mean; it doesn’t map how words relate to each other. Coverage of multi-word expressions is spotty — you’ll find “hot dog” but not “boiling water.” And despite being human-readable, its collaborative history and the wildly different metadata needs of different word types have produced deep structural inconsistencies — tens of thousands of edge cases that make reliable, consistent data extraction effectively impossible at scale.
Princeton’s psycholinguistic experiment (1985–2006) and the academic gold standard for computational lexicography. WordNet abstracts each word meaning into a “synset” — a concept like “a long walk for exercise” — then lists the words that express it (hike, hiking, tramp) and connects synsets hierarchically: hiking IS-A walk, trudge IS-A hiking.
The limitation isn’t quality — it’s architecture. WordNet maps synonym clusters and taxonomic hierarchies, not associative meaning. Look up “hiking” and you get: it’s a type of walk; trudging and backpacking are types of it. That’s the entire semantic world. No trail, mountain, nature, wilderness, boots, campsite, elevation, or scenic — none of the experiential associations a word game needs. Multi-word expressions were deliberately excluded from scope. Each word lands in a small number of synsets with unweighted connections, so the richness of how people actually think about words is absent.
MIT Media Lab’s commonsense knowledge graph (1999). Crowdsourced everyday facts like “ice is cold” and “people eat when hungry,” now grown to 21M+ edges across 300+ languages.
The core limitation is granularity and focus. ConceptNet’s relations are coarse-grained and commonsense-focused — great for “dogs are animals” but not for the fine-grained semantic distinctions that make word puzzles interesting. Multi-word expressions appear inconsistently. Its crowdsourced origins also mean significant noise. Large language models have now largely absorbed this kind of commonsense knowledge, making ConceptNet less central to current research.
Structured knowledge graphs for entities and facts. DBpedia extracts from Wikipedia infoboxes; Wikidata is a primary, community-curated knowledge base with 100M+ items.
Both are entity databases, not lexical resources. They capture factual relationships — born-in, instance-of, part-of — not associative or conceptual proximity. Wikidata will tell you “cat is-a mammal”; it won’t tell you that “cat” evokes “curiosity” and “nine lives.” Different problem entirely.
Before developers find the academic sources above, they usually find simpler options: Scrabble word lists, frequency data, and free APIs. These are fine for prototyping. They’re not a data layer for production.
Frequency ≠ difficulty. “The” is frequent but trivial; “quotidian” is infrequent but educated adults know it.
Developers also reach for NLTK corpora (Python’s built-in word lists), crossword clue databases (for wordplay associations), Sporcle quizzes and Reddit threads (for crowd-sourced category ideas), TV Tropes and IMDb datasets (for pop culture lists), and generation tricks like compound words, homophones, and anagram sets. These are ingredients for brainstorming, not infrastructure for shipping. The gap between “I found a word list” and “I have a production-quality data layer” is where Linguabase fits.
| Feature | Wiktionary | WordNet | ConceptNet | Linguabase |
|---|---|---|---|---|
| Relationship structure | None | Typed (hypernym, meronym) | Typed (36 relations) | Weighted by strength |
| Relationship weights | No | No | Yes (unreliable) | Yes (curated) |
| Data quality | Variable | High (dated) | Low-medium | High |
| Graph operations | No | Limited | Yes | Yes |
| Commercial use | CC license | Unclear | CC license | Licensed |
| Active maintenance | Yes (community) | Minimal | Limited | Yes |
| Production-ready API | No | No | Yes (limited) | Yes |
| Sense-balanced coverage | No | No | No | Yes |
| Directional weights | No | No | No | Yes |
| False cognate removal | No | No | No | 291K audited |
| Gestalt/experiential | No | No | Partial | Yes |
| Vocabulary scale | 1.4M English (10M+ multilingual) | 155K | ~300K | 1.5M (400K prod) |
| Multi-word expressions | Limited | No | Partial | 700K |
Even if you combine all free sources, you still won’t have:
These aren’t cleanup problems. They’re architectural gaps. Free sources were built for lookup and research, not for word games.
Beyond capability gaps, free sources also require:
Linguabase: Over a decade of engineering already done. One clean API. Production-tested. How we built it → or see licensing options →