In this post, I will compare some lemmatizers for Portuguese. In order to do the comparison, I downloaded subtitles from various television programs. The sentences are written in European Portuguese (EP).
01.09.2020: I have migrated the post from my old blog and updated it to reflect the current state of lemmatization.
There exists a Python binding for Hunspell called “CyHunspell”. This library contains a function stem which can be used to get the root of a word. In contrast to Spacy, it is not possible to consider the context in which the word occurs. A dictionary can be downloaded here.
It is also necessary to use beforehand a tokenizer. If we don’t consider special cases like mesoclisis, it’s easy to write our own.
The results are:
Not every word gets assigned a lemma, because some tokens don’t seem to have entries in the dictionary.
Another problem is the context. The dictionary has for example two different stems for the word “sentido”: “sentir” and “sentido”. In the first case, it could be a verb conjugated in pretérito perfeito composto (tenho sentido etc.). In the second case, the word is a noun. Hence, we need a Part-of-Speech (POS) Tagger to decide which case is the right one.
This library is written in Java and requires an external tokenizer and POS Tagger.
When I used the right annotations, the lemmas were generated correctly. However, there is an issue with the size of the dictionary. Using the full dictionary “resources/acdc/lemas.total.txt”, will result in a “java.lang.OutOfMemoryError: GC overhead limit exceeded” exception. One can give either Java more memory or use a smaller dictionary to fix this.
NLTK is one of the most popular libraries for NLP-related tasks. However, it does not contain a lemmatizer for Portuguese. There are only two stemmers: RSLPStemmer and snowball.
Neural network based
Spacy is a relatively new NLP library for Python. A language model for Portuguese can be downloaded here. This model was trained with a CNN on the Universal Dependencies and WikiNER corpus.
Let us try some sentences.
There is a mistake with the word “Estás”. The lemma should be “estar”. Most Portuguese-speaking countries don’t use the second-person singular. Thus, the problem could be that the corpus does not contain enough texts written in EP.
To verify this, let us consider the sentences “Está bem ?” and “Você está bem ?”.
The library still doesn’t find the correct lemma. Only by explicitly adding a pronoun, we can get the right result.
Maybe we have more success with longer sentences.
The lemmas are a bit strange:
“no” is a contraction of “em + o”
“como” the lemma should not be “comer”
“uma” should not be “umar”
“para” should not be a verb
Assuming the lemmas were intended to be written in this way, then they should be at least consistent. But Spacy assigns sometimes “para” as lemma and not “parir” (for example in the sentence “Para mim estão boas !”).
Stanza came just out this year (2020). It is like Spacy quite easy to use and also provides pretrained Portuguese models. If there are any problems installing the library, try github directly pip install git+https://github.com/stanfordnlp/stanza.git.
The results are
The results are good, but still not perfect:
somehow no was replaced by o (see word.text)
miúda: should be miúdo
irmã: should be irmão
This library is a little more complex to install than stanza and spacy.
Then start docker sudo systemctl start docker and run
Or alternatively instead of pt_bosque, there is also pt_gsd and pt_pud. I used pt_bosque, because it contains both European (CETEMPúblico) and Brazilian (CETENFolha) variants.
Then we can feed texts to the docker image
The result is in the CoNLL-U format. The library pyconll can be used for parsing the following output:
But since stanza uses also pt_bosque, the results are approximately the same. Still the original paper shows slight improvements on almost all treebanks.
It is also possible to lemmatize entire texts by sending POST requests to the docker image. The following bash script splits a file ../feed.txt in lines of 10000 and appends the output to a split/parsed.conllu file.
The splitting is necessary, when millions of sentences need to be lemmatized. This could be a bug or I might simply not have enough RAM.
Instead of using the library pyconll, manual parsing can be performed as follows:
The neural network based lemmatizers have gotten much better. Personally, I often use “Universal Lemmatizer” because it also works well in other languages such as German. The main alternative is stanza. This library also offers other tools such as NER (Named Entity Recognition).
However, no lemmatizer is perfect. It is easy to find sentences where there are obvious mistakes.