In this post, I will compare some lemmatizers for Portuguese. In order to do the comparison, I downloaded subtitles from various television programs. The sentences are written in European Portuguese (EP).
Spacy is a relatively new NLP library for Python. A language model for Portuguese can be downloaded here. This model was trained with a CNN on the Universal Dependencies and WikiNER corpus.
Let us try some sentences.
There is a mistake with the word “Estás”. It is a verb, not a noun. Furthermore, the lemma should be “estar”. Most Portuguese-speaking countries don’t use the second-person singular. Thus, the problem could be that the corpus does not contain enough texts written in EP.
To verify this, let us consider the sentences “Está bem ?” and “Você está bem ?”.
The library now recognizes that “Está” is a verb, but it still doesn’t find the correct lemma. Only by explicitly adding a pronoun, we can get the right result.
Maybe we have more success with longer sentences.
The lemmas are a bit strange:
- “no” is a contraction of “em + o”
- “como” is here not a verb and the lemma should not be “comer”
- “uma” should not be “umar”
- “para” is here also not a verb
Assuming the lemmas were intended to be written in this way, then they should be at least consistent. But Spacy assigns sometimes “para” as lemma and not “parir” (for example in the sentence “Para mim estão boas !”).
There exists a Python binding for Hunspell called “CyHunspell”. This library contains a function stem which can be used to get the root of a word. In contrast to Spacy, it is not possible to consider the context in which the word occurs. A dictionary can be downloaded here.
It is also necessary to use beforehand a tokenizer. If we don’t consider special cases like mesoclisis, it’s easy to write our own.
The results are:
Not every word gets assigned a lemma, because some tokens don’t seem to have entries in the dictionary.
Another problem is the context. The dictionary has for example two different stems for the word “sentido”: “sentir” and “sentido”. In the first case, it could be a verb conjugated in pretérito perfeito composto (tenho sentido etc.). In the second case, the word is a noun. Hence, we need a Part-of-Speech (POS) Tagger to decide which case is the right one.
This library is written in Java and requires an external tokenizer and POS Tagger.
When I used the right annotations, the lemmas were generated correctly. However, there is an issue with the size of the dictionary. Using the full dictionary “resources/acdc/lemas.total.txt”, will result in a “java.lang.OutOfMemoryError: GC overhead limit exceeded” exception. One can give either Java more memory or use a smaller dictionary to fix this.
NLTK is one of the most popular libraries for NLP-related tasks. However, it does not contain a lemmatizer for Portuguese. There are only two stemmers: RSLPStemmer and snowball.
In the end, no library really convinced me. LemPORT seems to be working fairly well, but it is written in Java. Spacy can be used, if you train a better language model. Hunspell needs a better dictionary and requires a POS. And NLTK contains no lemmatizers, only stemmers.