Linguabase generates puzzle levels and audits existing ones for word association and word categorization games—the genre that took off after NYT Connections. Every level is clean, non-repeating, and difficulty-calibrated, with airtight mutual exclusivity across categories—at whatever scale your game requires. Behind it is a structured network of 400,000 English terms and over 100 million semantic relationships, built over fifteen years. Linguabase is a product of IDEA.org.
Founder & AI Systems Architect
Michael led the project from its origins as internal infrastructure for two word games through to the current system—the LLM pipeline, the production data, and the adaptation into a puzzle generation product for game studios. When you email linguabase@idea.org, you’re talking to the person who designed every layer of the data.
The data accumulated over fifteen years across three phases. Professional lexicography and hand-built vocabulary came first—thousands of definitions, sense-grouped associations, and thematic word lists authored by people who understand English at a level automation can’t reach. Computational linguistics came next: 70+ structured linguistic sources integrated algorithmically, 2.3 million supercomputer hours of topic modeling and word embeddings via an NSF grant, and the algorithms that map and weight relationships across the full vocabulary. The current system layers 130 million LLM inferences on top of that foundation—generating, validating, ranking, and auditing at a scale the earlier phases couldn’t. Read the full story →
The data layers built before LLMs existed—algorithms, lexicography, and source integration that the current system inherits and builds on.
Language Data Architect
Designed the algorithms that map and weight word relationships across the full vocabulary. Mathematics background (Shandong University), decades of software engineering.
Lexicographer
Established the lexicographic framework for how word relationships should be structured—which senses matter, how to group associations, where human judgment is non-negotiable. Wrote 2,000+ custom definitions and 4,400+ sense-grouped associations. Contributor to Oxford, Macmillan, and other major dictionaries.
Sally Smith manually curated 100 sets of mutually exclusive topics for OtherWordly—a manual process that became the precursor to the current automated pipeline.
The following contributed as content writers and reviewers many as linguistics grad students or post-docs, typically working on a block of topics from the Dewey Decimal or Library of Congress classification system—sports played with balls, cathedral architectural elements, newspaper brand names (Globe, Post, Tribune), soft candies.
Linguabase was supported by a $295,000 NSF SBIR Phase I grant and $300,000 in Microsoft for Startups compute credits.