My personal blog

Machine learning, computer vision, languages

Portuguese Lemmatizers (2020 update)

08 May 2018

In this post, I will compare some lemmatizers for Portuguese. In order to do the comparison, I downloaded subtitles from various television programs. The sentences are written in European Portuguese (EP).

01.09.2020: I have migrated the post from my old blog and updated it to reflect the current state of lemmatization.

Rule-based

Hunspell

There exists a Python binding for Hunspell called “CyHunspell”. This library contains a function stem which can be used to get the root of a word. In contrast to Spacy, it is not possible to consider the context in which the word occurs. A dictionary can be downloaded here.

It is also necessary to use beforehand a tokenizer. If we don’t consider special cases like mesoclisis, it’s easy to write our own.

import hunspell
import re

def tokenize(sentence):
    tokens_regex = re.compile(r"([., :;\n()\"#!?1234567890/&%+])", flags=re.IGNORECASE)
    tokens = re.split(tokens_regex, sentence)
    postprocess = []
    postprocess_regex = re.compile(r"\b(\w+)-(me|te|se|nos|vos|o|os|a|as|lo|los|la|las|lhe|lhes|lha|lhas|lho|lhos|no|na|nas|mo|ma|mos|mas|to|ta|tos|tas)\b", flags=re.IGNORECASE)
    for token in tokens:
        for token2 in re.split(postprocess_regex, token):
            if token2.strip():
                postprocess.append(token2)

    return postprocess

tokens = tokenize("Estás bem ?")
h = hunspell.Hunspell("pt_PT", hunspell_data_dir="/usr/share/hunspell/")

text = ""
lemmas = ""
for token in tokens:
    text += token + "\t"
    lemma = h.stem(token)
    if len(lemma) == 1:
        lemmas += lemma[0] + "\t"
    else:
        lemmas += token + "\t"

The results are:

Estás   bem ?   
estás   bem ?   

Está    bem ?
está    bem ?

Não ,   minha   miúda   no  sentido que és  como    uma irmã    para    mim .
não ,   minha   miúdo   no  sentido que és  como    um  irmão   para    mim .

Not every word gets assigned a lemma, because some tokens don’t seem to have entries in the dictionary.

Another problem is the context. The dictionary has for example two different stems for the word “sentido”: “sentir” and “sentido”. In the first case, it could be a verb conjugated in pretérito perfeito composto (tenho sentido etc.). In the second case, the word is a noun. Hence, we need a Part-of-Speech (POS) Tagger to decide which case is the right one.

LemPORT

This library is written in Java and requires an external tokenizer and POS Tagger.

import lemma.Lemmatizer;

public class Main {

    public static void main(final String[] args) {
        final String[] tokens = {"Estás", "bem", "?"};
        final String[] tags = {"v-fin", "adv", "punc"};

        final Lemmatizer lemmatizer;
        final String[] lemmas;
        try {
            lemmatizer = new Lemmatizer();
            lemmas = lemmatizer.lemmatize(tokens, tags);
        } catch (Exception e) {
            e.printStackTrace();
            return;
        }

        final StringBuilder token = new StringBuilder();
        final StringBuilder lemma = new StringBuilder();
        for (int i = 0; i < tokens.length; i++) {
            token.append(tokens[i]).append("\t");
            lemma.append(lemmas[i]).append("\t");
        }
        System.out.println(token);
        System.out.println(lemma);
    }
}

When I used the right annotations, the lemmas were generated correctly. However, there is an issue with the size of the dictionary. Using the full dictionary “resources/acdc/lemas.total.txt”, will result in a “java.lang.OutOfMemoryError: GC overhead limit exceeded” exception. One can give either Java more memory or use a smaller dictionary to fix this.

NLTK

NLTK is one of the most popular libraries for NLP-related tasks. However, it does not contain a lemmatizer for Portuguese. There are only two stemmers: RSLPStemmer and snowball.

Neural network based

Spacy

Spacy is a relatively new NLP library for Python. A language model for Portuguese can be downloaded here. This model was trained with a CNN on the Universal Dependencies and WikiNER corpus.

Let us try some sentences.

import spacy

nlp = spacy.load("pt_core_news_lg")

text = ""
pos = ""
lemma = ""
for token in nlp("Estás bem ?"):
    text += token.text + "\t"
    pos += token.pos_ + "\t"
    lemma += token.lemma_ + "\t"
Estás   bem ?   
AUX     ADV PUNCT   
Estás   bem ?   

There is a mistake with the word “Estás”. The lemma should be “estar”. Most Portuguese-speaking countries don’t use the second-person singular. Thus, the problem could be that the corpus does not contain enough texts written in EP.

To verify this, let us consider the sentences “Está bem ?” and “Você está bem ?”.

Está    bem ?   
VERB    ADV PUNCT   
Está    bem ?   

Você    está    bem ?   
PRON    VERB    ADV PUNCT   
Você    estar   bem ?   

The library still doesn’t find the correct lemma. Only by explicitly adding a pronoun, we can get the right result.

Maybe we have more success with longer sentences.

Não ,     minha   miúda   no  sentido que  és
ADV PUNCT DET     NOUN    DET NOUN    PRON AUX
Não ,     meu     miúdo   o   sentir  que  ser

como   uma   irmã    para   mim   .   
ADP    DET   NOUN    ADP    PRON  PUNCT   
comer  umar  irmão   parir  mim   .   

The lemmas are a bit strange:

Assuming the lemmas were intended to be written in this way, then they should be at least consistent. But Spacy assigns sometimes “para” as lemma and not “parir” (for example in the sentence “Para mim estão boas !”).

Stanza

Stanza came just out this year (2020). It is like Spacy quite easy to use and also provides pretrained Portuguese models. If there are any problems installing the library, try github directly pip install git+https://github.com/stanfordnlp/stanza.git.

import stanza

stanza.download('pt')
nlp = stanza.Pipeline('pt')

text = ""
pos = ""
lemma = ""
for sent in nlp("Não, minha miúda no sentido que és como uma irmã para mim.").sentences:
    for word in sent.words:
        text += word.text + "\t"
        pos += word.upos + "\t"
        lemma += word.lemma + "\t"

print(text)
print(pos)
print(lemma)

The results are

Estás   bem ?
AUX     ADV PUNCT
estar   bem ?

and

Não ,     minha miúda   em  o   sentido que
ADV PUNCT DET   NOUN    ADP DET NOUN    PRON
não ,     meu   miúda   em  o   sentido que

és  como  uma irmã    para    mim   .   
AUX ADP   DET NOUN    ADP     PRON  PUNCT   
ser como  um  irmã    para    eu    .   

The results are good, but still not perfect:

Universal Lemmatizer

This library is a little more complex to install than stanza and spacy.

git clone https://github.com/TurkuNLP/Turku-neural-parser-pipeline.git
cd Turku-neural-parser-pipeline

Then start docker sudo systemctl start docker and run

docker build -t "my_portuguese_parser" --build-arg models=pt_bosque --build-arg hardware=cpu -f Dockerfile-lang .

Or alternatively instead of pt_bosque, there is also pt_gsd and pt_pud. I used pt_bosque, because it contains both European (CETEMPúblico) and Brazilian (CETENFolha) variants.

Then we can feed texts to the docker image

echo "Não, minha miúda no sentido que és como uma irmã para mim." | docker run -i my_portuguese_parser stream pt_bosque parse_plaintext

The result is in the CoNLL-U format. The library pyconll can be used for parsing the following output:

# newdoc
# newpar
# sent_id = 1
# text = Não, minha miúda no sentido que és como uma irmã para mim.
1   Não não INTJ    _   _   4   advmod  _   SpaceAfter=No
2   ,   ,   PUNCT   _   _   1   punct   _   _
3   minha   meu DET _   Gender=Fem|Number=Sing|PronType=Prs 4det    _   _
4   miúda   miúda   NOUN    _   Gender=Fem|Number=Sing  0   root    __
5-6 no  _   _   _   _   _   _   _   _
5   em  em  ADP _   _   7   case    _   _
6   o   o   DET _   Definite=Def|Gender=Masc|Number=Sing|PronType=Art   7   det _   _
7   sentido sentido NOUN    _   Gender=Masc|Number=Sing 4   nmod    __
8   que que PRON    _   Gender=Masc|Number=Sing|PronType=Rel    9nsubj  _   _
9   és  ser VERB    _   Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin   7   acl:relcl   _   _
10  como    como    ADP _   _   12  case    _   _
11  uma um  DET _   Definite=Ind|Gender=Fem|Number=Sing|PronType=Art    12  det _   _
12  irmã    irmã    NOUN    _   Gender=Fem|Number=Sing  9   obl __
13  para    para    ADP _   _   14  case    _   _
14  mim eu  PRON    _   Gender=Unsp|Number=Sing|Person=1|PronType=Prs   12  nmod    _   SpaceAfter=No
15  .   .   PUNCT   _   _   4   punct   _   SpacesAfter=\n

But since stanza uses also pt_bosque, the results are approximately the same. Still the original paper shows slight improvements on almost all treebanks.

It is also possible to lemmatize entire texts by sending POST requests to the docker image. The following bash script splits a file ../feed.txt in lines of 10000 and appends the output to a split/parsed.conllu file.

mkdir split
cd split
split -l 10000 ../feed.txt
cd ..

for filename in split/*; do
    echo $filename
    if [[ $filename == split/x* ]]
    then
        curl --request POST --header 'Content-Type: text/plain; charset=utf-8' --data-binary @"$filename" http://localhost:15000 >> "split/parsed.conllu"
    fi
done

The splitting is necessary, when millions of sentences need to be lemmatized. This could be a bug or I might simply not have enough RAM.

Instead of using the library pyconll, manual parsing can be performed as follows:

train = []
with open("split/parsed.conllu", "r") as f:
    k = []
    sent = ""
    pattern = []
    for line in tqdm(f.readlines()):
        if "# text" in line:
            sent = line
            train.append([sent.replace("# text =", "").strip(), []])
        if "#" in line or "\n" == line:
            continue
        s = line.split("\t")
        if "1" == s[0]:
            if len(train) > 1:
                train[-2][1].extend(k)
            k = []
        k.append((s[1], s[2], s[3]))

out = []
for sent, sentence in tqdm(train):
    for token, lemma, pos in sentence:
    ...

Summary

The neural network based lemmatizers have gotten much better. Personally, I often use “Universal Lemmatizer” because it also works well in other languages such as German. The main alternative is stanza. This library also offers other tools such as NER (Named Entity Recognition).

However, no lemmatizer is perfect. It is easy to find sentences where there are obvious mistakes.