Structured data files you can embed, query, or integrate directly.
| Layer | What It Contains | Scale |
|---|---|---|
| Vocabulary Rankings | Master word list ranked by importance/suitability (common → obscure → noise) | 1.5M words |
| Core Associations | Primary weighted relationships per headword (~40 each) | 397K headwords |
| Sense Clouds | Associations grouped by word sense (bank_financial, bank_river, etc.) | 1.18M entries |
| Word Families | Morphological + etymological groupings (elephant → elephants, elephantine) | 385K families |
| Definitions | Narrative paragraphs covering all senses naturally | 400K words |
| Content Filters | Hard block (offensive) + soft block (suggestive) word lists | ~8K terms |
| Usage Examples | Illustrative quotations from literature, journalism, and scholarly sources | 1.46M quotes |
There’s no precise count for “words in English” — variations, compounds, and proper nouns blur every boundary. And ranking words by “importance” is surprisingly hard: pure frequency metrics rank “the” highest while missing that “red panda” is widely understood.
The Linguabase word ranking is an effective proxy for what words are known to a broad user base, including English language learners. The highest ranks align with vocabulary standards like the Oxford 3000 and CEFR levels — but enhanced with common derivatives (both “actor” and “actors”). From there it extends through specialized jargon down to noise.
Our recommendation: 400K words. This covers every plausible word a user would want to interact with or see in a game. But we deliver à la carte:
| Vocabulary | Use Case | Sample Words at Threshold |
|---|---|---|
| ~100K | Abridged dictionary scope | conspecific, impeccability, shalwar |
| ~400K (recommended) | Full production coverage | carboxylesterase, soft butch, naval strategies |
| ~600K | Inclusive with more noise | divinas, benefactor’s, rock n’ roll |
| 1.5M | Research/completeness | selenocysteines, straticulate, clobenpropit |
The ranking includes commonly known proper nouns (Paris, Einstein, Toyota). If your application requires a no-proper-noun rule — like classic Scrabble — we can filter accordingly.
Core Associations — your primary association lookup:
Sense Clouds — associations grouped by meaning:
Actual schema varies by delivery format. We’ll work with you on integration.
| Format | Best For |
|---|---|
| TSV files | Embedding in apps, offline use, full control |
| SQLite database | Local querying, mobile apps, simple integration |
| JSON export | Web applications, JavaScript environments |
| API access | Server-side integration, real-time queries (optional add-on) |
| Custom format | Whatever your stack needs — we’re flexible |
Can I see sample data before licensing?
Yes. Contact us and we’ll send evaluation samples for your specific use case.
What about updates?
Linguabase is actively maintained — new vocabulary is added as words enter mainstream usage, connection weights are refined, and ongoing auditing catches edge cases. License includes current data with update arrangements available.
Do you provide integration support?
Yes. We offer consulting on graph operations (pathfinding, distance calculation) and can help with integration into your stack.
What’s NOT included?
Linguabase is available for licensing to word game studios, AI/LLM companies, educational technology developers, and productivity tools.
Tell us about your use case and we’ll send relevant samples and discuss licensing options.
Linguabase is developed by IDEA.org, a research organization focused on language, games, and education. We’ve been building language data since 2011 and currently publish word games including “In Other Words” (iOS).