Portuguese Lemmatizers

4 minute read

In this post, I will compare some lemmatizers for Portuguese. In order to do the comparison, I downloaded subtitles from various television programs. The sentences are written in European Portuguese (EP).

Spacy

Spacy is a relatively new NLP library for Python. A language model for Portuguese can be downloaded here. This model was trained with a CNN on the Universal Dependencies and WikiNER corpus.

Let us try some sentences.

import spacy

nlp = spacy.load("pt_core_news_sm/pt_core_news_sm-2.0.0")

text = ""
pos = ""
lemma = ""
for token in nlp("Estás bem ?"):
    text += token.text + "\t"
    pos += token.pos_ + "\t"
    lemma += token.lemma_ + "\t"
Estás   bem ?   
NOUN    ADV PUNCT   
Estás   bem ?   

There is a mistake with the word “Estás”. It is a verb, not a noun. Furthermore, the lemma should be “estar”. Most Portuguese-speaking countries don’t use the second-person singular. Thus, the problem could be that the corpus does not contain enough texts written in EP.

To verify this, let us consider the sentences “Está bem ?” and “Você está bem ?”.

Está    bem ?   
VERB    ADV PUNCT   
Está    bem ?   

Você    está    bem ?   
PRON    VERB    ADV PUNCT   
Você    estar   bem ?   

The library now recognizes that “Está” is a verb, but it still doesn’t find the correct lemma. Only by explicitly adding a pronoun, we can get the right result.

Maybe we have more success with longer sentences.

Não     ,       minha   miúda   no      sentido que     és
INTJ    PUNCT   DET     NOUN    PRON    NOUN    PRON    VERB
Não     ,       meu     miúdo   o       sentir  que     ser

como    uma     irmã    para    mim     .
ADP     DET     NOUN    ADP     PRON   PUNCT   
comer   umar    irmão   parir   mim     .

The lemmas are a bit strange:

  • “no” is a contraction of “em + o”
  • “como” is here not a verb and the lemma should not be “comer”
  • “uma” should not be “umar”
  • “para” is here also not a verb

Assuming the lemmas were intended to be written in this way, then they should be at least consistent. But Spacy assigns sometimes “para” as lemma and not “parir” (for example in the sentence “Para mim estão boas !”).

Hunspell

There exists a Python binding for Hunspell called “CyHunspell”. This library contains a function stem which can be used to get the root of a word. In contrast to Spacy, it is not possible to consider the context in which the word occurs. A dictionary can be downloaded here.

It is also necessary to use beforehand a tokenizer. If we don’t consider special cases like mesoclisis, it’s easy to write our own.

import hunspell
import re

def tokenize(sentence):
    tokens_regex = re.compile(r"([., :;\n()\"#!?1234567890/&%+])", flags=re.IGNORECASE)
    tokens = re.split(tokens_regex, sentence)
    postprocess = []
    postprocess_regex = re.compile(r"\b(\w+)-(me|te|se|nos|vos|o|os|a|as|lo|los|la|las|lhe|lhes|lha|lhas|lho|lhos|no|na|nas|mo|ma|mos|mas|to|ta|tos|tas)\b", flags=re.IGNORECASE)
    for token in tokens:
        for token2 in re.split(postprocess_regex, token):
            if token2.strip():
                postprocess.append(token2)

    return postprocess

tokens = tokenize("Estás bem ?")
h = hunspell.Hunspell("pt_PT", hunspell_data_dir="/usr/share/hunspell/")

text = ""
lemmas = ""
for token in tokens:
    text += token + "\t"
    lemma = h.stem(token)
    if len(lemma) == 1:
        lemmas += lemma[0] + "\t"
    else:
        lemmas += token + "\t"

The results are:

Estás   bem ?   
estás   bem ?   

Está    bem ?
está    bem ?

Não ,   minha   miúda   no  sentido que és  como    uma irmã    para    mim .
não ,   minha   miúdo   no  sentido que és  como    um  irmão   para    mim .

Not every word gets assigned a lemma, because some tokens don’t seem to have entries in the dictionary.

Another problem is the context. The dictionary has for example two different stems for the word “sentido”: “sentir” and “sentido”. In the first case, it could be a verb conjugated in pretérito perfeito composto (tenho sentido etc.). In the second case, the word is a noun. Hence, we need a Part-of-Speech (POS) Tagger to decide which case is the right one.

LemPORT

This library is written in Java and requires an external tokenizer and POS Tagger.

import lemma.Lemmatizer;

public class Main {

    public static void main(final String[] args) {
        final String[] tokens = {"Estás", "bem", "?"};
        final String[] tags = {"v-fin", "adv", "punc"};

        final Lemmatizer lemmatizer;
        final String[] lemmas;
        try {
            lemmatizer = new Lemmatizer();
            lemmas = lemmatizer.lemmatize(tokens, tags);
        } catch (Exception e) {
            e.printStackTrace();
            return;
        }

        final StringBuilder token = new StringBuilder();
        final StringBuilder lemma = new StringBuilder();
        for (int i = 0; i < tokens.length; i++) {
            token.append(tokens[i]).append("\t");
            lemma.append(lemmas[i]).append("\t");
        }
        System.out.println(token);
        System.out.println(lemma);
    }
}

When I used the right annotations, the lemmas were generated correctly. However, there is an issue with the size of the dictionary. Using the full dictionary “resources/acdc/lemas.total.txt”, will result in a “java.lang.OutOfMemoryError: GC overhead limit exceeded” exception. One can give either Java more memory or use a smaller dictionary to fix this.

NLTK

NLTK is one of the most popular libraries for NLP-related tasks. However, it does not contain a lemmatizer for Portuguese. There are only two stemmers: RSLPStemmer and snowball.

Summary

In the end, no library really convinced me. LemPORT seems to be working fairly well, but it is written in Java. Spacy can be used, if you train a better language model. Hunspell needs a better dictionary and requires a POS. And NLTK contains no lemmatizers, only stemmers.

Categories:

Updated:

Comments