/\_/\ ( o.o ) > ^ <
languagecat
A free comprehensive dataset for building language-learning apps
I'm building a free language-learning app. But I noticed I spent as much time gathering and structuring data as I did actually developing the app. So I'm releasing the data I gathered, to help others also interested in building a language-learning app. Welcome to languagecat!
languagecat provides a curated dataset with everything you need to get started building your own language-learning app.
This is a public resource created as part of the Yap.town project.
What do I need to make a language-learning app?
You need a dictionary and you need sentences. The dictionary is essential so users can look up words. The sentences are essential so you can have users practice translating. (I know there are other philosophies of language-learning apps which might not require this exact setup. But they can make their own datasets. 😸)
The Dictionary
Not any dictionary will do. You need a dictionary focused on language-learners, which is different from a traditional dictionary. Traditional dictionaries are exhaustive in their definitions and information, to the point of being mostly useless for beginners to the language. language-learners need just the most essential information.
The Sentences
language-learners need a gradient of sentences from simple to hard. The most important step in learning a language is getting a lot of input that's suitable for the learner's skill level. So the app needs sentences that match all skill levels.
What's in languagecat?
languagecat is structured as "language packs" from language A to language B. (For example, there is a French-for-English-Speakers pack that contains French words with their definitions in English.)
Each language pack contains tens of thousands of words and hundreds of thousands of sentences.
Dictionary
For each word, I provide:
Since words can have multiple parts of speech, frequency data is tracked separately for each usage. This means that my frequency analysis was much more sophisticated than just checking for text in my corpus. I ran a custom neural-network-based NLP pipeline on the entire corpus, but more on that later.
Phrasebook
In addition to individual words, I include multi-word terms and phrases. These are useful because they often have their own meaning that needs to be learned independently from their component words.
Sentences
Each sentence in the dataset is fully analyzed with detailed linguistic information for every word, making it easy to build interactive learning experiences.
This breakdown is extremely useful for building language-learning apps because it allows you to programmatically understand what's happening in a sentence. For example, "bois" in French can be either a verb ("drink") or a noun ("wood"). The two usages have completely different meanings. Looking at just the text, they're indistinguishable, but with this linguistic breakdown, your software can easily tell the difference.
Custom NLP Models
To create the sophisticated linguistic analysis in languagecat, I custom trained my own suite of NLP models. These models are what enable the word-by-word breakdown, part-of-speech tagging, and frequency analysis that's separated by word usage.
All of my models are available on Hugging Face for anyone to use.
Movie Data
Sentences come from two main sources: Tatoeba and movies from Open Subtitles.
For sentences sourced from movies, I include the localized movie names and posters for the language being learned. For example, "Fight Club" becomes "El Club de la Lucha" in the Spanish dataset, complete with the Spanish movie poster:
I also provide separate frequency lists for each movie in the dataset. This allows you to show your users how close they are to being able to watch a particular movie—a great motivational feature for language learners.
Format
The data is provided in JSON, which is straightforward to parse in any programming language. Each part of each pack (e.g., the list of sentences, the translations) has its own JSON file, so you can only download what you need.
However, JSON is incredibly slow to parse when you're dealing with tens of thousands of words and hundreds of thousands of sentences. For this reason, I also provide the data in a custom format using a fork of rkyv (a Rust library that allows extremely fast deserialization). This is the same format used internally by yap.town. Please contact me if you'd like to use this format; it's a little bit involved in order to be as efficient as possible. It will not be possible to use this format outside of a Rust program.
Download
The data is available here.
Language Packs Provided So Far
- 🇫🇷 French to English
- 🇬🇧 English to French
- 🇪🇸 English to Spanish
- 🇩🇪 English to German
- 🇰🇷 English to Korean (experimental)
Creating each language pack takes quite a bit of time and money. In total, I've spent about $3000 on this project so far. To be clear, most of this data wasn't generated manually—it came from a combination of external sources (Tatoeba and Open Subtitles), LLMs, and a small amount of manual labeling.
I'm currently working on Italian and Portuguese language packs.
What's Next?
I'm really curious to see what people will build with this. Language learning is such a personal journey, and I hope this dataset can save you the countless hours I spent gathering and structuring all this data.
If you end up building something with languagecat, I'd love to hear about it. Join the Discord to share what you're working on. Happy building!
This data could be used to create Anki decks that would be, to my knowledge, pedagogically better and more comprehensive than any that currently exist. Someone has actually requested this. Please cite this page (or Yap.Town) if you do.
License
This dataset is licensed under CC BY-NC-ND 4.0.
You are free to share this dataset with proper attribution, but you may not use it for commercial purposes or create derivative works.
My goal with this license is that people can't use this as part of commercial projects (unless otherwise negotiated, I will probably say yes if you ask), and that any improvements anyone makes are upstreamed to the official version so everyone can benefit.