About - Linguabase

Linguabase generates puzzle levels and audits existing ones for word association and word categorization games—the genre that took off after NYT Connections. Every level is clean, non-repeating, and difficulty-calibrated, with airtight mutual exclusivity across categories—at whatever scale your game requires. Behind it is a structured network of 400,000 English terms and over 100 million semantic relationships, built over fifteen years. Linguabase is a product of IDEA.org.

Michael Douma

Founder & AI Systems Architect

Michael led the project from its origins as internal infrastructure for two word games through to the current system—the LLM pipeline, the production data, and the adaptation into a puzzle generation product for game studios. When you email linguabase@idea.org, you’re talking to the person who designed every layer of the data.

linguabase@idea.org michaeldouma.com

How It Was Built

The data accumulated over fifteen years across three phases. Professional lexicography and hand-built vocabulary came first—thousands of definitions, sense-grouped associations, and thematic word lists authored by people who understand English at a level automation can’t reach. Computational linguistics came next: 70+ structured linguistic sources integrated algorithmically, 2.3 million supercomputer hours of topic modeling and word embeddings via an NSF grant, and the algorithms that map and weight relationships across the full vocabulary. The current system layers 130 million LLM inferences on top of that foundation—generating, validating, ranking, and auditing at a scale the earlier phases couldn’t. Read the full story →

Credits

Foundational Data

The data layers built before LLMs existed—algorithms, lexicography, and source integration that the current system inherits and builds on.

Li Mei

Language Data Architect

Designed the algorithms that map and weight word relationships across the full vocabulary. Mathematics background (Shandong University), decades of software engineering.

Orin Hargraves

Lexicographer

Established the lexicographic framework for how word relationships should be structured—which senses matter, how to group associations, where human judgment is non-negotiable. Wrote 2,000+ custom definitions and 4,400+ sense-grouped associations. Contributor to Oxford, Macmillan, and other major dictionaries.

Contributors — Thematic Word Lists

Sally Smith manually curated 100 sets of mutually exclusive topics for OtherWordly—a manual process that became the precursor to the current automated pipeline.

The following contributed as content writers and reviewers many as linguistics grad students or post-docs, typically working on a block of topics from the Dewey Decimal or Library of Congress classification system—sports played with balls, cathedral architectural elements, newspaper brand names (Globe, Post, Tribune), soft candies.

Thouria Bensaoula

Brenda Darlene Hunter

Ronaldo Borja

Andrew Hursh

Catherine Carnovale

Louisa Jordan

Jerry Carr-Brion

Johan Josefsson

Mario Christiner

Emily Moline

Raluca Crisan

Sahana Pal

Jamie Friel

Miodrag Petrusevski

Nicole Gordiyenko

Olga L. Rachello

James Grama

Melody Ann Ross

Meredith Green

Jason Sankovic

Joan Ham

Brinda Sousley

David Hughes

Chitra Sundararajan

Samantha Williams

Ljubomir Stevanovic

Mary Thomas

Nick Williams

Rachel Usher

Support

Linguabase was supported by a $295,000 NSF SBIR Phase I grant and $300,000 in Microsoft for Startups compute credits.

About Linguabase