Linguabase vs. Free Sources

Free sources won’t get you far enough.

Several free language resources exist. They’re fine for academic research or simple lookups. But if you’re building word games—whether spelling games that need validated vocabulary and difficulty rankings, or semantic games that need weighted associations—free sources have fundamental gaps.

Wiktionary

The Wikipedia of dictionaries. Volunteer editors have built the largest dictionary in history — 1.4 million English entries with definitions, pronunciations, etymologies, and translations. We use it as one of our 70+ sources.

Architecture
Web page per word
Download
dumps.wikimedia.org (~1GB compressed)
Format
XML wrapper, content in wiki markup
Pros
  • Massive coverage (~1.4M English entries, 10M+ multilingual)
  • Surpassed commercial dictionaries
  • Many languages
  • Free (CC license)
  • Has thesaurus qualities for verbs/adjectives
Cons
  • Fundamentally a dictionary, not a semantic graph
  • Wiki markup parsing is wildly inconsistent
  • No typed relationships or weights
  • Littered with noise and occasional spam/abuse
  • No API (scraping required)
  • No difficulty rankings
  • No content filters
  • Limited multi-word expression coverage

Wiktionary is superb as a dictionary — but it’s strictly a dictionary, not a thesaurus or semantic graph. It tells you what words mean; it doesn’t map how words relate to each other. Coverage of multi-word expressions is spotty — you’ll find “hot dog” but not “boiling water.” And despite being human-readable, its collaborative history and the wildly different metadata needs of different word types have produced deep structural inconsistencies — tens of thousands of edge cases that make reliable, consistent data extraction effectively impossible at scale.

WordNet

Princeton’s psycholinguistic experiment (1985–2006) and the academic gold standard for computational lexicography. WordNet abstracts each word meaning into a “synset” — a concept like “a long walk for exercise” — then lists the words that express it (hike, hiking, tramp) and connects synsets hierarchically: hiking IS-A walk, trudge IS-A hiking.

Architecture
Synsets with taxonomic hierarchy
Download
Format
Custom DB files; Python/Java libraries
Pros
  • Academic gold standard
  • Clean hierarchical structure
  • Typed relationships (hypernym, meronym)
  • Well-documented
  • Free for research
Cons
  • Scale: only ~155K words
  • Frozen since 2006 (last major update)
  • No weights or gradation
  • No experiential/gestalt associations
  • Academic biases (grad student creators)
  • Excludes function words entirely
  • No difficulty rankings
  • No content filters
  • Excludes multi-word expressions by design

The limitation isn’t quality — it’s architecture. WordNet maps synonym clusters and taxonomic hierarchies, not associative meaning. Look up “hiking” and you get: it’s a type of walk; trudging and backpacking are types of it. That’s the entire semantic world. No trail, mountain, nature, wilderness, boots, campsite, elevation, or scenic — none of the experiential associations a word game needs. Multi-word expressions were deliberately excluded from scope. Each word lands in a small number of synsets with unweighted connections, so the richness of how people actually think about words is absent.

ConceptNet

MIT Media Lab’s commonsense knowledge graph (1999). Crowdsourced everyday facts like “ice is cold” and “people eat when hungry,” now grown to 21M+ edges across 300+ languages.

Architecture
Commonsense knowledge graph
Download
github.com/.../Downloads (~800MB compressed)
Format
Tab-separated CSV with JSON metadata
Pros
  • Pioneering work in commonsense AI
  • Large-scale (21M+ edges)
  • Typed relationships (IsA, PartOf, UsedFor...)
  • Multilingual (300+ languages)
  • API available
  • CC license
Cons
  • Noisy data (crowdsourced origins)
  • Coarse-grained relations
  • Weight scores unreliable
  • Commonsense focus, not lexical nuance
  • Largely superseded by LLMs for its original purpose
  • Inconsistent multi-word expression coverage

The core limitation is granularity and focus. ConceptNet’s relations are coarse-grained and commonsense-focused — great for “dogs are animals” but not for the fine-grained semantic distinctions that make word puzzles interesting. Multi-word expressions appear inconsistently. Its crowdsourced origins also mean significant noise. Large language models have now largely absorbed this kind of commonsense knowledge, making ConceptNet less central to current research.

DBpedia / Wikidata

Structured knowledge graphs for entities and facts. DBpedia extracts from Wikipedia infoboxes; Wikidata is a primary, community-curated knowledge base with 100M+ items.

Architecture
Entity knowledge graphs
Download
Format
RDF/JSON, SPARQL-queryable
Pros
  • Good for factual queries (who/what/when)
  • Massive scale (Wikidata: 100M+ items)
  • Structured and queryable (SPARQL)
  • Actively maintained
  • Used by Google Knowledge Graph, Apple, etc.
  • Free and open
Cons
  • Entity database, not a lexical resource
  • Factual relationships only (born-in, instance-of)
  • No word-sense info
  • No synonyms, antonyms, or associations
  • Focuses on named entities, not concepts

Both are entity databases, not lexical resources. They capture factual relationships — born-in, instance-of, part-of — not associative or conceptual proximity. Wikidata will tell you “cat is-a mammal”; it won’t tell you that “cat” evokes “curiosity” and “nine lives.” Different problem entirely.

Quick-Fix Sources

Before developers find the academic sources above, they usually find simpler options: Scrabble word lists, frequency data, and free APIs. These are fine for prototyping. They’re not a data layer for production.

Scrabble Word Lists
enable1.txt, TWL, SOWPODS, Collins
Flat text files of tournament-legal words. The most common starting point — ~170K validated spellings.
What you get
What you don’t get
SCOWL & Friends
12dicts, Moby Project, FrequencyWords
Public domain word lists in tiered sizes (small → huge). SCOWL is spell-checker oriented; 12dicts is curated by commonality; Moby includes thesaurus and pronunciation data.
What you get
What you don’t get
Frequency Data
Norvig, Google Books Ngrams, COCA
Word frequency from books, web crawls, or corpora. COCA is corpus-based; Ngrams shows trends over time.
What you get
What you don’t get

Frequency ≠ difficulty. “The” is frequent but trivial; “quotidian” is infrequent but educated adults know it.

Free APIs
Datamuse, Wordnik, RhymeZone, Free Dictionary API
Query-based lookups for synonyms, rhymes, definitions. Datamuse wraps WordNet; Wordnik has frequency data; RhymeZone does sound-alikes.
What you get
What you don’t get

Developers also reach for NLTK corpora (Python’s built-in word lists), crossword clue databases (for wordplay associations), Sporcle quizzes and Reddit threads (for crowd-sourced category ideas), TV Tropes and IMDb datasets (for pop culture lists), and generation tricks like compound words, homophones, and anagram sets. These are ingredients for brainstorming, not infrastructure for shipping. The gap between “I found a word list” and “I have a production-quality data layer” is where Linguabase fits.

Comparison Matrix

Feature Wiktionary WordNet ConceptNet Linguabase
Relationship structure None Typed (hypernym, meronym) Typed (36 relations) Weighted by strength
Relationship weights No No Yes (unreliable) Yes (curated)
Data quality Variable High (dated) Low-medium High
Graph operations No Limited Yes Yes
Commercial use CC license Unclear CC license Licensed
Active maintenance Yes (community) Minimal Limited Yes
Production-ready API No No Yes (limited) Yes
Sense-balanced coverage No No No Yes
Directional weights No No No Yes
False cognate removal No No No 291K audited
Gestalt/experiential No No Partial Yes
Vocabulary scale 1.4M English (10M+ multilingual) 155K ~300K 1.5M (400K prod)
Multi-word expressions Limited No Partial 700K

What Free Sources Can’t Do

Even if you combine all free sources, you still won’t have:

These aren’t cleanup problems. They’re architectural gaps. Free sources were built for lookup and research, not for word games.

The Work Problem (Secondary)

Beyond capability gaps, free sources also require:

Next Steps

Linguabase: Over a decade of engineering already done. One clean API. Production-tested. How we built it → or see licensing options →