Why the Best Language Data Is Not on GitHub

The copyright paradox that keeps high-quality curated language data invisible to the market.

The Invisibility Problem

The most valuable curated language data in the world—difficulty-ranked vocabularies, weighted semantic graphs, sense-balanced association networks—has never been publicly available. Not because it’s locked behind copyright, but precisely because it can’t be. The absence of copyright protection creates a paradox: the data is too expensive to build and too easy to copy, so anyone who builds something good keeps it behind closed doors.

The publicly visible landscape is excellent for what it was built to do. WordNet is a landmark of computational linguistics. Wiktionary is the largest dictionary ever created. ConceptNet mapped commonsense knowledge before LLMs existed. ENABLE gave an entire generation of word games a free vocabulary. But these projects were built by academics, hobbyists, and volunteers for purposes other than powering commercial word games at scale—and that’s not a criticism, it’s the point. The data that was purpose-built for commercial applications has never been published, because the economics don’t allow it.

The Legal Landscape: Word Lists and Copyright

The Feist Framework

Telephone directory from Wapakoneta, Ohio, June 1960 — A small-town telephone directory from 1960. The Supreme Court’s 1991 *Feist* ruling held that the alphabetized names and numbers inside directories like this one aren’t copyrightable—no matter how much work went into collecting them.

The foundational case is Feist Publications, Inc. v. Rural Telephone Service Co., 499 U.S. 340 (1991), where the Supreme Court unanimously held that copyright requires originality, not effort. Justice O’Connor’s opinion explicitly repudiated the “sweat of the brow” doctrine—the idea that sheer labor in compiling factual information earns copyright protection. Facts cannot be copyrighted, no matter how much work went into gathering them.

Facts do not owe their origin to an act of authorship. The distinction is one between creation and discovery: the first person to find and report a particular fact has not created the fact; he or she has merely discovered its existence.

Compilations of facts can qualify for thin copyright protection, but only when the selection, coordination, or arrangement displays “at least some minimal degree of creativity.” Even then, protection covers only the compiler’s original contribution (the particular selection and arrangement), never the underlying facts.

A bare word list—an alphabetized inventory of valid English words—sits at the weak end of this spectrum. The individual words are facts. Alphabetical arrangement is “practically inevitable” (the Court’s language in Feist). And if the selection criteria are mechanical (“all English words meeting X objective criteria”), the resulting compilation may lack even the minimal creativity threshold.

The Circuit Split

Post-Feist case law pulls in different directions. On the side favoring copyrightability, CCC Information Services v. Maclean Hunter (2d Cir. 1994) held that projected used car valuations were copyrightable because they reflected subjective editorial expertise—professional judgment, not mechanical rule-following. This is the strongest precedent for protecting curated data that involves expert judgment in weighting, ranking, or evaluation. Similarly, ADA v. Delta Dental (7th Cir. 1997) found that a dental procedure taxonomy was copyrightable because the classification choices were creative—procedures could be organized many different ways, and the ADA’s particular choices constituted expression.

On the other side, Southco v. Kanebridge (3d Cir. 2004, en banc) held that industrial part numbers generated by mechanically applying a predetermined coding system lacked originality. The court distinguished between a creative system and the outputs produced by mechanically applying it. And in Assessment Technologies v. WIREdata (7th Cir. 2003), Judge Posner held that compilation copyright “cannot prevent access to data that not only are neither copyrightable nor copyrighted, but were not created or obtained by the copyright owner.” Even when public domain data sits inside a copyrighted structure, extracting the underlying facts is permissible.

The Scrabble Word List—Untested Claims

The Official Scrabble Players Dictionary, Sixth Edition, published by Merriam-Webster — Merriam-Webster publishes the OSPD under license from Hasbro. The word list inside is asserted as copyrighted—but no court has ever ruled on whether an alphabetized list of valid game words meets the originality threshold.

The most prominent real-world example is the Scrabble word list ecosystem. Hasbro, NASPA, and Collins all claim copyright over their respective word lists and enforce those claims through licensing. None of these claims have ever been tested in court. The closest confrontation—the 2014 Zyzzyva controversy—ended not with a ruling but with NASPA settling rather than litigate against a company with $4+ billion in annual revenue.

Hasbro’s VP Jonathan Berkowitz stated in 2014: “We have a curated word list that is created for the purpose of playing the game and directly relates to playing the game. That’s copyrightable.” This assertion remains legally unvalidated.

The honest assessment: the OSPD’s selection probably satisfies Feist’s low creativity threshold given subjective editorial choices about offensive words, neologisms, and borderline entries. But the bare alphabetical NWL’s claim is weaker, and any protection would be “thin”—covering the particular selection, never the individual words.

Where Richer Data Structures Stand

A weighted semantic graph with difficulty rankings, association strengths, sense-balanced coverage, and editorial judgments about word relationships occupies much stronger legal ground than a bare word list. Under the CCC framework, subjective expert judgment about how words relate to each other, how difficult they are, and which associations matter looks far more like the Red Book’s copyrightable valuations than like Feist’s uncopyrightable phone book.

The EU Contrast

The EU Database Directive (96/9/EC) provides a sui generis database right with no American equivalent. It protects database contents—not just structure—when the maker has made “substantial investment in obtaining, verification or presentation of the contents.” It lasts 15 years but renews with each substantial update. A curated language database would almost certainly qualify for EU protection based on investment alone, without meeting any creativity threshold. The practical implication: data that might be uncopyrightable in the United States receives robust protection in every EU member state.

Key Legal References

The Economic Paradox

The Copying Chain

Every major dictionary in history was built by copying the previous one and improving it. This isn’t a dirty secret—it’s the defining mechanism of lexicographic progress.

Case	Year	Key Holding
Feist v. Rural Telephone (link)	1991	Facts aren’t copyrightable; compilations require minimal creativity
CCC v. Maclean Hunter (link)	1994	Expert editorial judgment in valuations is copyrightable
ADA v. Delta Dental	1997	Creative taxonomic choices are copyrightable expression
Assessment Tech. v. WIREdata	2003	Can’t use compilation copyright to lock up public domain facts
Southco v. Kanebridge	2004	Mechanically generated classifications lack originality
EU Database Directive (link)	1996	Sui generis right protects substantial investment in databases

Title page of A World of Errors Discovered in the New World of Words by Thomas Blount, 1673 — Thomas Blount’s *A World of Errors* (1673), his sharp-tongued response to Phillips’s copying. First and only edition. This copy sold at Bonhams in 2012 for $9,375.

Robert Cawdrey’s Table Alphabeticall (1604), generally considered the first monolingual English dictionary, contained 2,543 headwords. Scholars have demonstrated that more than four-fifths were taken directly from two sources: Edmund Coote’s English Schoole-Maister and Thomas Thomas’s Latin-English Dictionarium. Fewer than one in five of Cawdrey’s words can’t be traced to these predecessors. The pattern intensified with each subsequent dictionary. When Edward Phillips published his New World of English Words in 1658, he lifted entries wholesale from Thomas Blount’s Glossographia—published just two years earlier. Blount eventually documented cases where Phillips had copied even his misprints. Phillips’s response was in keeping with the spirit of the age: he quietly corrected the mistakes in his next edition and moved on.

Samuel Johnson’s 1755 Dictionary of the English Language was physically built on an interleaved copy of Nathan Bailey’s Dictionarium Britannicum, which Johnson annotated, crossed out, and expanded. He noted which entries he was carrying forward from prior dictionaries rather than documenting from his own reading. After Johnson’s dictionary appeared, a new edition of Bailey plagiarized massively from Johnson—completing a circular borrowing loop. Noah Webster publicly criticized Johnson’s work while building substantially on it—a posture his own successors would repeat in turn. When the OED appeared, it incorporated hundreds of Johnson’s definitions and thousands of his quotations unchanged.

None of this was piracy. It was—and remains—how lexicography works. The raw linguistic facts are and always have been free to use.

But this legal freedom creates an economic trap. Building a high-quality curated language database from scratch requires years of expert labor, millions of computational operations, and deep domain expertise. The resulting product cannot be copyright-protected in its most basic form (the word list) and receives only thin protection in its richer forms (the relational structure). Anyone who publishes it openly watches it get copied instantly, with no legal recourse for the bare factual content.

This is why the publicly available language data landscape looks the way it does—excellent academic and volunteer-built resources that were never designed for commercial word games, treated as the state of the art for lack of visible alternatives.

The Practical Reality

The result is a market where the barrier to entry isn’t legal—it’s economic. The real protection for curated language data has never been copyright. It’s the sheer cost of recreation. No one is going to independently replicate a decade of curation work when they can license it, and no one is going to publish it openly when they can’t protect it.