Words, links, meaning. License what you need.
| Layer | What It Contains | Scale | Size | |
|---|---|---|---|---|
| 1 | Vocabulary | Validated words with difficulty scores—from everyday vocabulary to crossword-worthy rarities. Filter by obscurity, include or exclude proper nouns. | 400K–1.5M words | |
| 2 | Definitions | Readable 2–3 sentence paragraphs in flowing sentences covering all meanings naturally. | Coverage for nearly all vocabulary | |
| 3 | Content Filters | Hard block (offensive) + soft block (suggestive) word lists for content control. | ~6K terms | |
| 4 | Word Associations | ~40 related words per entry, weighted by connection strength. Includes sense-level pools organized by meaning. | ~40M connections, scales with vocabulary | |
| 5 | Word Families | Morphological + etymological groupings (elephant → elephants, elephantine). | Coverage for nearly all vocabulary | |
| 6 | Usage Examples | Illustrative quotations—common words from literature, uncommon words from Wikipedia and open-access sources. | 1.46M quotes |
Sizes shown are what we ship in-bundle for our own apps—compressed and with DRM applied. Many tables use 3-byte word IDs instead of strings, which keeps file sizes small while preserving fast lookups.
Embed as files, query via API, or get custom exports. For game studios that want turnkey content, we also offer ready-made puzzle data built on these layers.
There’s no precise count for “words in English”—variations, compounds, and proper nouns blur every boundary. And ranking words by “importance” is surprisingly hard: pure frequency metrics rank “the” highest while missing that “red panda” is widely understood.
The Linguabase word ranking is an effective proxy for what words are known to a broad user base, including English language learners. The highest ranks align with vocabulary standards like the Oxford 3000 and CEFR levels—but enhanced with common derivatives (both “actor” and “actors”). From there it extends through specialized jargon down to noise.
Our recommendation: 400K words. This covers every plausible word a user would want to interact with or see in a game. But we deliver à la carte:
| Vocabulary | Use Case | Sample Words at Threshold |
|---|---|---|
| ~100K | Abridged dictionary scope | conspecific, impeccability, shalwar |
| ~400K (recommended) | Full production coverage | carboxylesterase, soft butch, naval strategies |
| ~600K | Inclusive with more noise | divinas, benefactor’s, rock n’ roll |
| 1.5M | Research/completeness | selenocysteines, straticulate, clobenpropit |
The ranking includes commonly known proper nouns (Paris, Einstein, Toyota). If your application requires a no-proper-noun rule—like classic Scrabble—we can filter accordingly.
Vocabulary (words with difficulty scores):
Word Associations:
Sense Clouds—associations grouped by meaning:
Actual schema varies by delivery format. We’ll work with you on integration.
Or skip building entirely—we can generate production-ready puzzle data for your game mechanics directly.
| Format | Best For |
|---|---|
| TSV files | Embedding in apps, offline use, full control |
| SQLite database | Local querying, mobile apps, simple integration |
| JSON export | Web applications, JavaScript environments |
| API access | Server-side integration, real-time queries (optional add-on) |
| Custom format | Whatever your stack needs—we’re flexible |
Can I see sample data before licensing?
Yes. Contact us and we’ll send evaluation samples for your specific use case.
What about updates?
Linguabase is actively maintained—new vocabulary is added as words enter mainstream usage, connection weights are refined, and ongoing auditing catches edge cases. License includes current data with update arrangements available.
Do you provide integration support?
Yes. We offer consulting on graph operations (pathfinding, distance calculation) and can help with integration into your stack.
What’s NOT included?
Linguabase is available for licensing to word game studios, AI/LLM companies, educational technology developers, and productivity tools. Game studios looking for turnkey content can also explore custom puzzle generation.
Tell us about your use case and we’ll send relevant samples and discuss licensing options.
If you need:
We built Linguabase for exploration and games, not reference lookups.
Linguabase is developed by IDEA.org, a research organization focused on language, games, and education. We’ve been building language data since 2011 and currently publish word games including “In Other Words” (iOS).