/\_/\
 ( o.o )
 > ^ <

languagecat

A free comprehensive dataset for building language-learning apps

🇫🇷

French

43k

words

286k

sentences

🇪🇸

Spanish

48k

words

234k

sentences

🇩🇪

German

57k

words

231k

sentences

🇬🇧

English

35k

words

143k

sentences

🇰🇷

Korean

10k

words

sentences

I'm building a free language-learning app. But I noticed I spent as much time gathering and structuring data as I did actually developing the app. So I'm releasing the data I gathered, to help others also interested in building a language-learning app. Welcome to languagecat!

languagecat provides a curated dataset with everything you need to get started building your own language-learning app.

This is a public resource created as part of the Yap.town project.

What do I need to make a language-learning app?

You need a dictionary and you need sentences. The dictionary is essential so users can look up words. The sentences are essential so you can have users practice translating. (I know there are other philosophies of language-learning apps which might not require this exact setup. But they can make their own datasets. 😸)

The Dictionary

Not any dictionary will do. You need a dictionary focused on language-learners, which is different from a traditional dictionary. Traditional dictionaries are exhaustive in their definitions and information, to the point of being mostly useless for beginners to the language. language-learners need just the most essential information.

The Sentences

language-learners need a gradient of sentences from simple to hard. The most important step in learning a language is getting a lot of input that's suitable for the learner's skill level. So the app needs sentences that match all skill levels.

What's in languagecat?

languagecat is structured as "language packs" from language A to language B. (For example, there is a French-for-English-Speakers pack that contains French words with their definitions in English.)

Each language pack contains tens of thousands of words and hundreds of thousands of sentences.

Dictionary

French-to-English dictionary entry example:

a avoir VERB

has

Elle a deux chats. — She has two cats.

Singular, Present, Third person, Indicative

For each word, I provide:

Frequency

Definitions

Example sentences

Pronunciation (IPA)

Part of speech

Morphology

Since words can have multiple parts of speech, frequency data is tracked separately for each usage. This means that my frequency analysis was much more sophisticated than just checking for text in my corpus. I ran a custom neural-network-based NLP pipeline on the entire corpus, but more on that later.

Phrasebook

In addition to individual words, I include multi-word terms and phrases. These are useful because they often have their own meaning that needs to be learned independently from their component words.

French-to-English phrasebook entry example:

être sur son 31

to be dressed very smartly; to be dressed to the nines

Pour le mariage, tout le monde était sur son 31. — For the wedding, everyone was dressed to the nines.

Used when someone is wearing their best, most elegant clothes for a special occasion.

Sentences

Each sentence in the dataset is fully analyzed with detailed linguistic information for every word, making it easy to build interactive learning experiences.

French-to-English sentence example:

Il comprit tout de suite le sens de ses mots.

Translations:

He immediately understood the meaning of her words.

Human translation

He soon comprehended the significance of her words.

Machine translation

Word-by-word breakdown:

(il, PRON)

comprit

(comprendre, VERB)

tout

(tout, ADV)

(de, ADP)

suite

(suite, NOUN)

(le, DET)

sens

(sens, NOUN)

(de, ADP)

ses

(son, DET)

mots

(mot, NOUN)

Multi-word terms detected:

tout de suite de suite

This breakdown is extremely useful for building language-learning apps because it allows you to programmatically understand what's happening in a sentence. For example, "bois" in French can be either a verb ("drink") or a noun ("wood"). The two usages have completely different meanings. Looking at just the text, they're indistinguishable, but with this linguistic breakdown, your software can easily tell the difference.

Custom NLP Models

To create the sophisticated linguistic analysis in languagecat, I custom trained my own suite of NLP models. These models are what enable the word-by-word breakdown, part-of-speech tagging, and frequency analysis that's separated by word usage.

All of my models are available on Hugging Face for anyone to use.

Movie Data

Sentences come from two main sources: Tatoeba and movies from Open Subtitles.

For sentences sourced from movies, I include the localized movie names and posters for the language being learned. For example, "Fight Club" becomes "El Club de la Lucha" in the Spanish dataset, complete with the Spanish movie poster:

El Club de la Lucha (Fight Club)

I also provide separate frequency lists for each movie in the dataset. This allows you to show your users how close they are to being able to watch a particular movie—a great motivational feature for language learners.

Format

The data is provided in JSON, which is straightforward to parse in any programming language. Each part of each pack (e.g., the list of sentences, the translations) has its own JSON file, so you can only download what you need.

However, JSON is incredibly slow to parse when you're dealing with tens of thousands of words and hundreds of thousands of sentences. For this reason, I also provide the data in a custom format using a fork of rkyv (a Rust library that allows extremely fast deserialization). This is the same format used internally by yap.town. Please contact me if you'd like to use this format; it's a little bit involved in order to be as efficient as possible. It will not be possible to use this format outside of a Rust program.

Download

The data is available here.

Language Packs Provided So Far

🇫🇷 French to English
🇬🇧 English to French
🇪🇸 English to Spanish
🇩🇪 English to German
🇰🇷 English to Korean (experimental)

Creating each language pack takes quite a bit of time and money. In total, I've spent about $3000 on this project so far. To be clear, most of this data wasn't generated manually—it came from a combination of external sources (Tatoeba and Open Subtitles), LLMs, and a small amount of manual labeling.

I'm currently working on Italian and Portuguese language packs.

What's Next?

I'm really curious to see what people will build with this. Language learning is such a personal journey, and I hope this dataset can save you the countless hours I spent gathering and structuring all this data.

If you end up building something with languagecat, I'd love to hear about it. Join the Discord to share what you're working on. Happy building!

This data could be used to create Anki decks that would be, to my knowledge, pedagogically better and more comprehensive than any that currently exist. Someone has actually requested this. Please cite this page (or Yap.Town) if you do.

License

This dataset is licensed under CC BY-NC-ND 4.0.

You are free to share this dataset with proper attribution, but you may not use it for commercial purposes or create derivative works.

My goal with this license is that people can't use this as part of commercial projects (unless otherwise negotiated, I will probably say yes if you ask), and that any improvements anyone makes are upstreamed to the official version so everyone can benefit.