The copyright paradox that keeps high-quality curated language data invisible to the market.
The most valuable curated language data in the world—difficulty-ranked vocabularies, weighted semantic graphs, sense-balanced association networks—has never been publicly available. Not because it’s locked behind copyright, but precisely because it can’t be. The absence of copyright protection creates a paradox: the data is too expensive to build and too easy to copy, so anyone who builds something good keeps it behind closed doors.
The publicly visible landscape is excellent for what it was built to do. WordNet is a landmark of computational linguistics. Wiktionary is the largest dictionary ever created. ConceptNet mapped commonsense knowledge before LLMs existed. ENABLE gave an entire generation of word games a free vocabulary. But these projects were built by academics, hobbyists, and volunteers for purposes other than powering commercial word games at scale—and that’s not a criticism, it’s the point. The data that was purpose-built for commercial applications has never been published, because the economics don’t allow it.
The foundational case is Feist Publications, Inc. v. Rural Telephone Service Co., 499 U.S. 340 (1991), where the Supreme Court unanimously held that copyright requires originality, not effort. Justice O’Connor’s opinion explicitly repudiated the “sweat of the brow” doctrine—the idea that sheer labor in compiling factual information earns copyright protection. Facts cannot be copyrighted, no matter how much work went into gathering them.
Facts do not owe their origin to an act of authorship. The distinction is one between creation and discovery: the first person to find and report a particular fact has not created the fact; he or she has merely discovered its existence.
Compilations of facts can qualify for thin copyright protection, but only when the selection, coordination, or arrangement displays “at least some minimal degree of creativity.” Even then, protection covers only the compiler’s original contribution (the particular selection and arrangement), never the underlying facts.
A bare word list—an alphabetized inventory of valid English words—sits at the weak end of this spectrum. The individual words are facts. Alphabetical arrangement is “practically inevitable” (the Court’s language in Feist). And if the selection criteria are mechanical (“all English words meeting X objective criteria”), the resulting compilation may lack even the minimal creativity threshold.
Post-Feist case law pulls in different directions. On the side favoring copyrightability, CCC Information Services v. Maclean Hunter (2d Cir. 1994) held that projected used car valuations were copyrightable because they reflected subjective editorial expertise—professional judgment, not mechanical rule-following. This is the strongest precedent for protecting curated data that involves expert judgment in weighting, ranking, or evaluation. Similarly, ADA v. Delta Dental (7th Cir. 1997) found that a dental procedure taxonomy was copyrightable because the classification choices were creative—procedures could be organized many different ways, and the ADA’s particular choices constituted expression.
On the other side, Southco v. Kanebridge (3d Cir. 2004, en banc) held that industrial part numbers generated by mechanically applying a predetermined coding system lacked originality. The court distinguished between a creative system and the outputs produced by mechanically applying it. And in Assessment Technologies v. WIREdata (7th Cir. 2003), Judge Posner held that compilation copyright “cannot prevent access to data that not only are neither copyrightable nor copyrighted, but were not created or obtained by the copyright owner.” Even when public domain data sits inside a copyrighted structure, extracting the underlying facts is permissible.
The most prominent real-world example is the Scrabble word list ecosystem. Hasbro, NASPA, and Collins all claim copyright over their respective word lists and enforce those claims through licensing. None of these claims have ever been tested in court. The closest confrontation—the 2014 Zyzzyva controversy—ended not with a ruling but with NASPA settling rather than litigate against a company with $4+ billion in annual revenue.
Hasbro’s VP Jonathan Berkowitz stated in 2014: “We have a curated word list that is created for the purpose of playing the game and directly relates to playing the game. That’s copyrightable.” This assertion remains legally unvalidated.
The honest assessment: the OSPD’s selection probably satisfies Feist’s low creativity threshold given subjective editorial choices about offensive words, neologisms, and borderline entries. But the bare alphabetical NWL’s claim is weaker, and any protection would be “thin”—covering the particular selection, never the individual words.
A weighted semantic graph with difficulty rankings, association strengths, sense-balanced coverage, and editorial judgments about word relationships occupies much stronger legal ground than a bare word list. Under the CCC framework, subjective expert judgment about how words relate to each other, how difficult they are, and which associations matter looks far more like the Red Book’s copyrightable valuations than like Feist’s uncopyrightable phone book.
The EU Database Directive (96/9/EC) provides a sui generis database right with no American equivalent. It protects database contents—not just structure—when the maker has made “substantial investment in obtaining, verification or presentation of the contents.” It lasts 15 years but renews with each substantial update. A curated language database would almost certainly qualify for EU protection based on investment alone, without meeting any creativity threshold. The practical implication: data that might be uncopyrightable in the United States receives robust protection in every EU member state.
| Case | Year | Key Holding |
|---|---|---|
| Feist v. Rural Telephone (link) | 1991 | Facts aren’t copyrightable; compilations require minimal creativity |
| CCC v. Maclean Hunter (link) | 1994 | Expert editorial judgment in valuations is copyrightable |
| ADA v. Delta Dental | 1997 | Creative taxonomic choices are copyrightable expression |
| Assessment Tech. v. WIREdata | 2003 | Can’t use compilation copyright to lock up public domain facts |
| Southco v. Kanebridge | 2004 | Mechanically generated classifications lack originality |
| EU Database Directive (link) | 1996 | Sui generis right protects substantial investment in databases |
Every major dictionary in history was built by copying the previous one and improving it. This isn’t a dirty secret—it’s the defining mechanism of lexicographic progress.
But this legal freedom creates an economic trap. Building a high-quality curated language database from scratch requires years of expert labor, millions of computational operations, and deep domain expertise. The resulting product cannot be copyright-protected in its most basic form (the word list) and receives only thin protection in its richer forms (the relational structure). Anyone who publishes it openly watches it get copied instantly, with no legal recourse for the bare factual content.
This is why the publicly available language data landscape looks the way it does—excellent academic and volunteer-built resources that were never designed for commercial word games, treated as the state of the art for lack of visible alternatives.
The result is a market where the barrier to entry isn’t legal—it’s economic. The real protection for curated language data has never been copyright. It’s the sheer cost of recreation. No one is going to independently replicate a decade of curation work when they can license it, and no one is going to publish it openly when they can’t protect it.