License What You Need

Six data layers, each solving a different problem.

400K
Words
400K
Definitions
~40M
Connections

Ship word data that works offline, loads fast, and fits in your app bundle. We built Linguabase for our own games—In Other Words (launched 2025) and OtherWordly (coming soon)—so every layer is optimized for mobile deployment: clean formats, no server round-trips required, and sizes you can actually ship.

Layer What It Contains Scale Size
1 Vocabulary Validated words with difficulty scores—from everyday vocabulary to crossword-worthy rarities. Filter by obscurity, include or exclude proper nouns. 400K–1.5M words 4 MB
2 Definitions Readable 2–3 sentence paragraphs in flowing sentences covering all meanings naturally. Coverage for nearly all vocabulary 97 MB
3 Content Filters Hard block (offensive) + soft block (suggestive) word lists for content control. ~6K terms 56 KB
4 Word Associations ~40 related words per entry, weighted by connection strength. Includes sense-level pools organized by meaning. ~40M connections, scales with vocabulary 64–240 MB
5 Word Families Morphological + etymological groupings (elephant → elephants, elephantine). Coverage for nearly all vocabulary 22 MB
6 Usage Examples Illustrative quotations—common words from literature, uncommon words from Wikipedia and open-access sources. 1.46M quotes 260 MB

Sizes shown are what we ship in-bundle for our own apps—compressed and with DRM applied. Many tables use 3-byte word IDs instead of strings, which keeps file sizes small while preserving fast lookups.

Ready to Integrate

Embed as files, query via API, or get custom exports.

Vocabulary Size: Your Choice

There’s no precise count for “words in English”—variations, compounds, and proper nouns blur every boundary. And ranking words by “importance” is surprisingly hard: pure frequency metrics rank “the” highest while missing that “red panda” is widely understood.

The Linguabase word ranking is an effective proxy for what words are known to a broad user base, including English language learners. The highest ranks align with vocabulary standards like the Oxford 3000 and CEFR levels—but enhanced with common derivatives (both “actor” and “actors”). From there it extends through specialized jargon down to noise.

Our recommendation: 400K words. This covers every plausible word a user would want to interact with or see in a game. But we deliver à la carte:

Vocabulary Use Case Sample Words at Threshold
~100K Abridged dictionary scope conspecific, impeccability, shalwar
~400K (recommended) Full production coverage carboxylesterase, soft butch, naval strategies
~600K Inclusive with more noise divinas, benefactor’s, rock n’ roll
1.5M Research/completeness selenocysteines, straticulate, clobenpropit

The ranking includes commonly known proper nouns (Paris, Einstein, Toyota). If your application requires a no-proper-noun rule—like classic Scrabble—we can filter accordingly.

Sample Data Format

Vocabulary (words with difficulty scores):

word rank difficulty_tier elephant 4523 common pachyderm 89234 educated Elephas 312456 technical

Word Associations:

headword target weight elephant trunk 5.0 elephant pachyderm 5.0 elephant mammoth 4.2 elephant savanna 3.8 elephant Ganesh 2.1 elephant memory 1.8 elephant Republican 0.9

Sense Clouds—associations grouped by meaning:

headword sense targets cloud cloud_weather sky, rain, cumulus, nimbus, overcast, storm... cloud cloud_computing server, upload, storage, streaming, AWS... cloud cloud_figurative vague, obscure, nebulous, hazy, uncertain...

Actual schema varies by delivery format. We’ll work with you on integration.

What You Can Build

Delivery Options

Format Best For
TSV files Embedding in apps, offline use, full control
SQLite database Local querying, mobile apps, simple integration
JSON export Web applications, JavaScript environments
API access Server-side integration, real-time queries (optional add-on)
Custom format Whatever your stack needs—we’re flexible

Typical Questions

Can I see sample data before licensing?

Yes. Contact us and we’ll send evaluation samples for your specific use case.

What about updates?

Linguabase is actively maintained—new vocabulary is added as words enter mainstream usage, connection weights are refined, and ongoing auditing catches edge cases. License includes current data with update arrangements available.

Do you provide integration support?

Yes. We offer consulting on graph operations (pathfinding, distance calculation) and can help with integration into your stack.

What’s NOT included?

Get in Touch

Linguabase is available for licensing to word game studios, AI/LLM companies, educational technology developers, and productivity tools.

Email: linguabase@idea.org

Tell us about your use case and we’ll send relevant samples and discuss licensing options.

About IDEA.org

Linguabase is developed by IDEA.org, a research organization focused on language, games, and education. We’ve been building language data since 2011 and currently publish word games including “In Other Words” (iOS).