What You Get

Structured data files you can embed, query, or integrate directly.

The Data Layers

Layer What It Contains Scale
Vocabulary Rankings Master word list ranked by importance/suitability (common → obscure → noise) 1.5M words
Core Associations Primary weighted relationships per headword (~40 each) 397K headwords
Sense Clouds Associations grouped by word sense (bank_financial, bank_river, etc.) 1.18M entries
Word Families Morphological + etymological groupings (elephant → elephants, elephantine) 385K families
Definitions Narrative paragraphs covering all senses naturally 400K words
Content Filters Hard block (offensive) + soft block (suggestive) word lists ~8K terms
Usage Examples Illustrative quotations from literature, journalism, and scholarly sources 1.46M quotes

Vocabulary Size: Your Choice

There’s no precise count for “words in English” — variations, compounds, and proper nouns blur every boundary. And ranking words by “importance” is surprisingly hard: pure frequency metrics rank “the” highest while missing that “red panda” is widely understood.

The Linguabase word ranking is an effective proxy for what words are known to a broad user base, including English language learners. The highest ranks align with vocabulary standards like the Oxford 3000 and CEFR levels — but enhanced with common derivatives (both “actor” and “actors”). From there it extends through specialized jargon down to noise.

Our recommendation: 400K words. This covers every plausible word a user would want to interact with or see in a game. But we deliver à la carte:

Vocabulary Use Case Sample Words at Threshold
~100K Abridged dictionary scope conspecific, impeccability, shalwar
~400K (recommended) Full production coverage carboxylesterase, soft butch, naval strategies
~600K Inclusive with more noise divinas, benefactor’s, rock n’ roll
1.5M Research/completeness selenocysteines, straticulate, clobenpropit

The ranking includes commonly known proper nouns (Paris, Einstein, Toyota). If your application requires a no-proper-noun rule — like classic Scrabble — we can filter accordingly.

Sample Data Format

Core Associations — your primary association lookup:

headword target weight elephant trunk 5.0 elephant pachyderm 5.0 elephant mammoth 4.2 elephant savanna 3.8 elephant Ganesh 2.1 elephant memory 1.8 elephant Republican 0.9

Sense Clouds — associations grouped by meaning:

headword sense targets bank bank_financial money, deposit, loan, vault, teller, savings... bank bank_river shore, embankment, riverbank, waterway... bank bank_pool cushion, rail, pocket, billiards... bank bank_aviation tilt, turn, angle, maneuver...

Actual schema varies by delivery format. We’ll work with you on integration.

What You Can Build

Delivery Options

Format Best For
TSV files Embedding in apps, offline use, full control
SQLite database Local querying, mobile apps, simple integration
JSON export Web applications, JavaScript environments
API access Server-side integration, real-time queries (optional add-on)
Custom format Whatever your stack needs — we’re flexible

Typical Questions

Can I see sample data before licensing?

Yes. Contact us and we’ll send evaluation samples for your specific use case.

What about updates?

Linguabase is actively maintained — new vocabulary is added as words enter mainstream usage, connection weights are refined, and ongoing auditing catches edge cases. License includes current data with update arrangements available.

Do you provide integration support?

Yes. We offer consulting on graph operations (pathfinding, distance calculation) and can help with integration into your stack.

What’s NOT included?

Get in Touch

Linguabase is available for licensing to word game studios, AI/LLM companies, educational technology developers, and productivity tools.

Email: linguabase@idea.org

Tell us about your use case and we’ll send relevant samples and discuss licensing options.

About IDEA.org

Linguabase is developed by IDEA.org, a research organization focused on language, games, and education. We’ve been building language data since 2011 and currently publish word games including “In Other Words” (iOS).