Frequent Croatian Words

Francisco Kovacevich, June 2019

Some insight on the composition of the Croatian language and its most frequent words that might help the student.

All the source code can be found at github.com/frankovacevich/grand_dictionary_ipy_notebook.

About the Grand Croatian Dictionary

The Grand Croatian Dictionary started as a simple project to help me learn this beautiful language. My family has Croatian roots and I was required to learn the language to get the Croatian citizenship, so I started learning from different online resources. It took me several months to even start to grasp the concept of declensions; I speak fluently English, French and Spanish, but none of these languages have this characteristic. Also, Slavic vocabulary is so different from that of these romantic languages that it's hard to memorize even the simplest words.

My frustration with learning the language grew even greater trying to find useful material on the internet. I started building a dictionary by myself to recognize every words and it's declensions. This is how the Croatian Dictionary came to be. I used data from the Wiktionary articles, which is very complete but cumbersomely spread across the English Wiktionary Page (en.wictionary.org), the Croatian Wiktionary Page (hr.wiktionary.org) and the Serbo-Croatian Wiktionary Page (sh.wiktionary.org). I also used the Glosbe Dictionary (glosbe.com) and even sometimes Google Translate and Microsoft (Bing) Translator. There are sure many errors in the dictionary (even some errors are present in the Wiktionary pages), but with time I hope it will grow more reliable. A more complete dictionary for fluent in Croatian speakers can be found at http://hjp.znanje.hr/. The HJP (Hrvatski Jezični Portal) is a extremely complete dictionary, with the disadvantage that it's fully in Croatian, isn't available in a n App and can't be used offline.

I profited a lot from building the Grand Croatian Dictionary, and I was able to learn Croatian faster. Later I built a smaller spinoff of the Grand Croatian Dictionary for Russian (The Grand Russian Dictionary), also available on Google Play. I also find that Russian learners were lacking a tool to easily search words and their declensions and conjugations.

With the dictionary I built for the Grand Croatian Dictionary App I was able to perform some studies of the mostly used words in Croatian. I wanted to share the results in this website. Although learning directly from word frequency lists is not always the best way to learn a language, it does certainly help to learn faster.

Corpora

To study the most frequent words, I started searching some corpora in Croatian. In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed) [en.wikipedia.org/wiki/Text_corpus]. I only found one that is publicly available on the internet, compiled by Hermit Dave (https://invokeit.wordpress.com/frequency-word-lists/) using some Open Subtitles database subtitles as a source.

There are some other corpora built by Croatian academics, but they only allow you to consult the corpus and not download it completely. However, I was able to scrap the webpage of the Croatian Language Corpus (HJK) using some python code to rebuild the frequency of each word in the corpus itself (this is what I want to extract from each corpus at last). The Croatian Language Corpus (Croatian: Hrvatski jezični korpus, HJK) is a corpus of Croatian compiled at the Institute of Croatian Language and Linguistics (IHJJ) [en.wikipedia.org/wiki/Croatian_Language_Corpus].

Finally I decided to complete my set with two more corpora. First, I used another Open Subtitles database (like Hermit Dave) to build a 4Gb corpus. I used the subtitles from the Spanish-Croatian dataset. Arguably, there are some phrases in this dataset that are not common in current Croatian, since they might be forced or literal translation from the Spanish language, but however I thought it was valid to use it to expand the corpora set. Finally I scrapped some articles from the Croatian Wikipedia Portal (hr.wikipedia.org). I scrapped mainly long articles, and built a 800Mb corpus.

With four corpora in my hands, two built from Open Subtitles datasets and two from written literature and academic articles, I was satisfied. Here are the number of unique words (not repeated) of each corpus (in millions):

Hermit Dave's Corpus: 213.829154M words (download)
HJK Corpus: 88.316596M words (download)
Open Subtitles Corpus: 196.755303M words (download)
Wikipedia Corpus 61.987162M words (download)

Word frequency lists for each corpus can be found on the github repository listed above (with the .summary extension). Interestingly, the Open Subtitles corpus was very different from the Wikipedia and HJK corpus; there were many swore words and informal expressions on the former, that are common only in oral speech. Since I was planning to learn not only formal Croatian but spoken too, I decided to try to combine this three corpora (I left the Hermit Dave's corpus out for most calculations because I wanted to work only with the corpora I had built, and use Hermit Dave's only for comparison later). To combine the corpora I used the following formula for each word with $x$ frequency:

$$ \bar{x} = \left( \frac{1}{x_{Wiki}} + \frac{1}{x_{OpenSubs}} + \frac{1}{x_{HJK}} \right)^{-1} $$

This way, the words that are present in the three corpora are better valued, without eliminating those that are present only in one corpus (as an analogy with a circuit with resistors in parallel instead of in series). I was satisfied with how this corpus turned out, so I ended up using it to build some frequency list to study.

Words present in the Grand Croatian Dictionary

Are all the words in the corpora present in the Grand Croatian Dictionary? Certainly not. There are many words that are very strange and rarely used. There are names of people and places that should not be in a dictionary. There are some errors and typos too. But there are also many words that should be on the Dictionary and they're not, although after comparing the dictionary with the corpora I was able to identify many words missing (and later added them).

To compare the corpora with the Grand Croatian Dictionary is important to bear in mind that some words can have many forms. Nouns and adjectives can be declined, and verbs can be conjugated. Each different form is present in the corpora in as is, but in the dictionary nouns and adjectives are in nominative singular and verbs in infinitive. To see if a word of the corpora is present in the Dictionary one has to see if some alternative form of the word is in the Dictionary as well.

In the Figure below (Figure 1) there is a plot of the frequency of each word against the number of words with that frequency (see https://en.wikipedia.org/wiki/Zipf%27s_law). Both lines were built with the HJK corpus. We can see that both lines depart at a frequency below 0.016 (1/1000) words. This means that for words with a greater frequency can be found on the Dictionary (with some exceptions like names and such, that explains why both lines don't match exactly).

Figure 1. Word frequency vs count for the HJK corpus, comparing the words that are in the Dictionary with the total words in the corpus.

The same concept is shown in Figure 2 for the Open Subtitles corpus.

Figure 2. Word frequency vs count for the Open Subtitles corpus, comparing the words that are in the Dictionary with the total words in the corpus.

Comparison between corpora

Figure 3 shows the word frequency vs count plot for the different corpora, for only the words present in the Dictionary. One can see how the Wikipedia and the HJK corpora are similar.

In Figure 4 are the most frequent words present in each corpus. It's interesting to see how different words rank differently in frequency for each corpus, with the Open Subtitles corpus showing the greater discrepancies.

Figure 3. Word frequency vs count for the different corpora, for only the words present in the Dictionary.

Figure 4. Most frequent words in each corpora.

Words by type

To study the words by frequency, it's often useful to arrange them by type (Verb, Adjective, Adverb and Noun). Having them classified in the Dictionary, this is easy to do, though there may be some errors since one can find words that are written exactly the same but derive from different words. This happens specially with adjectives and adverbs, since most adverbs can be formed from the neutral singular form of the corresponding adjective. Thus, we present Figures 5 and 6, each showing the frequency vs count for words in the combined corpus separated by type, the second figure combining adjectives and adverbs.

Figure 5. Word frequency vs count for the combined corpus, separated by type.

Figure 6. Word frequency vs count for the combined corpus, separated by type, combining adjectives and adverbs together.

Frequency lists

From the previous figures one can derive how many words of each type one should learn to reach a certain level. There are some correlations between the extent of one's vocabulary and the overall dominance of the (see for example https://universeofmemory.com/how-many-words-you-should-know/).

Here are the list of most frequent words for each type:

verbs (download)
adjectives and adverbs (download)
nouns (download)

For the verbs list, the verbs have been grouped by stem, which makes it easier to learn. They have also been paired with their corresponding perfective or imperfective form (with the imperfective form usually in italics). To learn more you can read the repository (indicated above).

In some cases I added a "!" sign to indicate that the word can be irregular. Nouns that have an unexpected gender have also been indicated.

Here are two other lists of words that bight be useful:

irregular adjectives (download)
irregular comparatives (download)
nouns with unexpected ending (pdf) (download)