Linguabase vs. Free Sources

Free sources won’t get you far enough.

The Problem

Several free language resources exist. They’re fine for academic research or simple lookups. But if you need to traverse meaning-space — for games, AI reasoning, or semantic exploration — free sources fundamentally lack the architecture.

Wiktionary

Wiktionary is actually superb — by the late 2010s it surpassed commercial dictionaries in coverage, and we use it ourselves. But it’s fundamentally designed as a multilingual dictionary, not a semantic exploration tool.

ProsCons
  • Massive coverage (7M+ entries)
  • Surpassed commercial dictionaries
  • Many languages
  • Free (CC license)
  • Has thesaurus qualities for verbs/adjectives
  • Fundamentally a dictionary, not a semantic graph
  • Wiki markup parsing is wildly inconsistent
  • No typed relationships or weights
  • Littered with noise and occasional spam/abuse
  • No API (scraping required)
Bottom line: Excellent dictionary. Wrong tool for semantic exploration. Built for multilingual lookup, not for traversing meaning-space.

WordNet

WordNet began in 1985 at Princeton under George Miller as a psycholinguistic experiment — a test of how the human mental lexicon is organized. Hand-crafted by graduate students and post-docs, it groups words into “synsets” (synonym sets) representing single concepts, connected by hierarchical relations like hypernymy (“dog IS-A animal”). The design was deliberately scoped: open-class words only (nouns, verbs, adjectives, adverbs), excluding function words, proper nouns, and multi-word expressions. Current scale: ~155K words organized into 117K synsets with 206K word-sense pairs.

The core limitation isn’t quality — it’s architecture. WordNet is fundamentally about clustering words around meanings, not about free-association outward from a word. It answers “what other words share this meaning?” not “what does this word make you think of?” It lacks any experiential color — no gestalt associations, no weighted connections, no sense of which relationships are stronger than others.

ProsCons
  • Academic gold standard
  • Clean hierarchical structure
  • Typed relationships (hypernym, meronym)
  • Well-documented
  • Free for research
  • Scale: only ~155K words
  • Frozen since 2006 (last major update)
  • No weights or gradation
  • No experiential/gestalt associations
  • Academic biases (grad student creators)
  • Excludes function words entirely
Bottom line: Brilliant for what it was designed for — modeling synonym clusters and taxonomic hierarchies. But it’s not a tool for semantic exploration or free-association. Different architecture, different problem.

ConceptNet

ConceptNet emerged from MIT Media Lab’s Open Mind Common Sense project (1999), founded by Push Singh and Marvin Minsky. The insight was that AI systems lacked everyday knowledge humans take for granted — things like “ice is cold” or “people eat when hungry.” The approach was bottom-up crowdsourcing: ordinary people contributed statements in natural language, parsed into structured relations. By ConceptNet 5.x, it had grown to ~21 million edges connecting 8 million nodes across 300+ languages, integrating WordNet, Wiktionary, and other sources.

The core limitation is granularity and focus. ConceptNet’s relations are coarse-grained and commonsense-focused — great for “dogs are animals” but not for the fine-grained semantic distinctions that make word puzzles interesting. Its crowdsourced origins also mean significant noise. Large language models have now largely absorbed this kind of commonsense knowledge, making ConceptNet less central to current research.

ProsCons
  • Large-scale (21M+ edges)
  • Typed relationships (IsA, PartOf, UsedFor...)
  • Multilingual (300+ languages)
  • API available
  • CC license
  • Noisy data (crowdsourced origins)
  • Coarse-grained relations
  • Weight scores unreliable
  • Commonsense focus, not lexical nuance
  • Largely superseded by LLMs for its original purpose
Bottom line: Pioneering work in commonsense AI, but the crowdsourced approach traded precision for scale. Good for broad commonsense inference, not for fine-grained semantic exploration.

DBpedia / Wikidata

DBpedia (2007) extracts structured data from Wikipedia infoboxes — it’s derived, not primary. No one contributes to DBpedia directly; it parses whatever Wikipedia editors put in infoboxes. Wikidata (2012) is the opposite: a primary, community-curated knowledge base where humans and bots enter statements directly. Originally created to centralize Wikipedia’s interlanguage links, Wikidata has grown into the world’s largest open knowledge graph with 100+ million items.

Both are entity databases, not lexical resources. They capture factual relationships — born-in, instance-of, part-of — not associative or conceptual proximity. Wikidata will tell you “cat is-a mammal”; it won’t tell you that “cat” evokes “curiosity” and “nine lives.” Different problem entirely.

ProsCons
  • Massive scale (Wikidata: 100M+ items)
  • Structured and queryable (SPARQL)
  • Actively maintained
  • Powers Google Knowledge Graph, Apple, etc.
  • Free and open
  • Entity database, not a lexical resource
  • Factual relationships only (born-in, instance-of)
  • No word-sense info
  • No synonym/antonym/association relationships
  • Focuses on named entities, not concepts
Bottom line: Excellent for “Who is the president of France?” or “What year was X born?” Wrong tool entirely for “How does ‘democracy’ relate to ‘freedom’?” These are knowledge graphs for facts, not semantic networks for meaning.

Comparison Matrix

Feature Wiktionary WordNet ConceptNet Linguabase
Relationship structure None Typed (hypernym, meronym) Typed (36 relations) Weighted by strength
Relationship weights No No Yes (unreliable) Yes (curated)
Data quality Variable High (dated) Low-medium High
Graph operations No Limited Yes Yes
Commercial use CC license Unclear CC license Licensed
Active maintenance Yes (community) Minimal Limited Yes
Production-ready API No No Yes (limited) Yes
Sense-balanced coverage No No No Yes
Directional weights No No No Yes
False cognate removal No No No 291K audited
Gestalt/experiential No No Partial Yes
Vocabulary scale 7M (raw) 155K ~300K 1.5M (400K prod)

What Free Sources Can’t Do

Even if you combine all free sources, you still won’t have:

These aren’t cleanup problems. They’re architectural gaps. Free sources were built for lookup, not exploration.

The Work Problem (Secondary)

Beyond capability gaps, free sources also require:

Linguabase: Over a decade of engineering already done. One clean API. Production-tested.