Machine learning, computer vision, languages
08 May 2018
In this post, I will compare some lemmatizers for Portuguese. In order to do the comparison, I downloaded subtitles from various television programs. The sentences are written in European Portuguese (EP).
01.09.2020: I have migrated the post from my old blog and updated it to reflect the current state of lemmatization.
There exists a Python binding for Hunspell called “CyHunspell”. This library contains a function stem which can be used to get the root of a word. In contrast to Spacy, it is not possible to consider the context in which the word occurs. A dictionary can be downloaded here.
It is also necessary to use beforehand a tokenizer. If we don’t consider special cases like mesoclisis, it’s easy to write our own.
import hunspell
import re
def tokenize(sentence):
tokens_regex = re.compile(r"([., :;\n()\"#!?1234567890/&%+])", flags=re.IGNORECASE)
tokens = re.split(tokens_regex, sentence)
postprocess = []
postprocess_regex = re.compile(r"\b(\w+)-(me|te|se|nos|vos|o|os|a|as|lo|los|la|las|lhe|lhes|lha|lhas|lho|lhos|no|na|nas|mo|ma|mos|mas|to|ta|tos|tas)\b", flags=re.IGNORECASE)
for token in tokens:
for token2 in re.split(postprocess_regex, token):
if token2.strip():
postprocess.append(token2)
return postprocess
tokens = tokenize("Estás bem ?")
h = hunspell.Hunspell("pt_PT", hunspell_data_dir="/usr/share/hunspell/")
text = ""
lemmas = ""
for token in tokens:
text += token + "\t"
lemma = h.stem(token)
if len(lemma) == 1:
lemmas += lemma[0] + "\t"
else:
lemmas += token + "\t"
The results are:
Estás bem ?
estás bem ?
Está bem ?
está bem ?
Não , minha miúda no sentido que és como uma irmã para mim .
não , minha miúdo no sentido que és como um irmão para mim .
Not every word gets assigned a lemma, because some tokens don’t seem to have entries in the dictionary.
Another problem is the context. The dictionary has for example two different stems for the word “sentido”: “sentir” and “sentido”. In the first case, it could be a verb conjugated in pretérito perfeito composto (tenho sentido etc.). In the second case, the word is a noun. Hence, we need a Part-of-Speech (POS) Tagger to decide which case is the right one.
This library is written in Java and requires an external tokenizer and POS Tagger.
import lemma.Lemmatizer;
public class Main {
public static void main(final String[] args) {
final String[] tokens = {"Estás", "bem", "?"};
final String[] tags = {"v-fin", "adv", "punc"};
final Lemmatizer lemmatizer;
final String[] lemmas;
try {
lemmatizer = new Lemmatizer();
lemmas = lemmatizer.lemmatize(tokens, tags);
} catch (Exception e) {
e.printStackTrace();
return;
}
final StringBuilder token = new StringBuilder();
final StringBuilder lemma = new StringBuilder();
for (int i = 0; i < tokens.length; i++) {
token.append(tokens[i]).append("\t");
lemma.append(lemmas[i]).append("\t");
}
System.out.println(token);
System.out.println(lemma);
}
}
When I used the right annotations, the lemmas were generated correctly. However, there is an issue with the size of the dictionary. Using the full dictionary “resources/acdc/lemas.total.txt”, will result in a “java.lang.OutOfMemoryError: GC overhead limit exceeded” exception. One can give either Java more memory or use a smaller dictionary to fix this.
NLTK is one of the most popular libraries for NLP-related tasks. However, it does not contain a lemmatizer for Portuguese. There are only two stemmers: RSLPStemmer and snowball.
Spacy is a relatively new NLP library for Python. A language model for Portuguese can be downloaded here. This model was trained with a CNN on the Universal Dependencies and WikiNER corpus.
Let us try some sentences.
import spacy
nlp = spacy.load("pt_core_news_lg")
text = ""
pos = ""
lemma = ""
for token in nlp("Estás bem ?"):
text += token.text + "\t"
pos += token.pos_ + "\t"
lemma += token.lemma_ + "\t"
Estás bem ?
AUX ADV PUNCT
Estás bem ?
There is a mistake with the word “Estás”. The lemma should be “estar”. Most Portuguese-speaking countries don’t use the second-person singular. Thus, the problem could be that the corpus does not contain enough texts written in EP.
To verify this, let us consider the sentences “Está bem ?” and “Você está bem ?”.
Está bem ?
VERB ADV PUNCT
Está bem ?
Você está bem ?
PRON VERB ADV PUNCT
Você estar bem ?
The library still doesn’t find the correct lemma. Only by explicitly adding a pronoun, we can get the right result.
Maybe we have more success with longer sentences.
Não , minha miúda no sentido que és
ADV PUNCT DET NOUN DET NOUN PRON AUX
Não , meu miúdo o sentir que ser
como uma irmã para mim .
ADP DET NOUN ADP PRON PUNCT
comer umar irmão parir mim .
The lemmas are a bit strange:
Assuming the lemmas were intended to be written in this way, then they should be at least consistent. But Spacy assigns sometimes “para” as lemma and not “parir” (for example in the sentence “Para mim estão boas !”).
Stanza came just out this year (2020). It is like Spacy quite easy to use and also provides pretrained Portuguese models. If there are any problems installing the library, try github directly pip install git+https://github.com/stanfordnlp/stanza.git
.
import stanza
stanza.download('pt')
nlp = stanza.Pipeline('pt')
text = ""
pos = ""
lemma = ""
for sent in nlp("Não, minha miúda no sentido que és como uma irmã para mim.").sentences:
for word in sent.words:
text += word.text + "\t"
pos += word.upos + "\t"
lemma += word.lemma + "\t"
print(text)
print(pos)
print(lemma)
The results are
Estás bem ?
AUX ADV PUNCT
estar bem ?
and
Não , minha miúda em o sentido que
ADV PUNCT DET NOUN ADP DET NOUN PRON
não , meu miúda em o sentido que
és como uma irmã para mim .
AUX ADP DET NOUN ADP PRON PUNCT
ser como um irmã para eu .
The results are good, but still not perfect:
no
was replaced by o
(see word.text
)This library is a little more complex to install than stanza and spacy.
git clone https://github.com/TurkuNLP/Turku-neural-parser-pipeline.git
cd Turku-neural-parser-pipeline
Then start docker sudo systemctl start docker
and run
docker build -t "my_portuguese_parser" --build-arg models=pt_bosque --build-arg hardware=cpu -f Dockerfile-lang .
Or alternatively instead of pt_bosque
, there is also pt_gsd
and pt_pud
. I used pt_bosque
, because it contains both European (CETEMPúblico) and Brazilian (CETENFolha) variants.
Then we can feed texts to the docker image
echo "Não, minha miúda no sentido que és como uma irmã para mim." | docker run -i my_portuguese_parser stream pt_bosque parse_plaintext
The result is in the CoNLL-U format. The library pyconll
can be used for parsing the following output:
# newdoc
# newpar
# sent_id = 1
# text = Não, minha miúda no sentido que és como uma irmã para mim.
1 Não não INTJ _ _ 4 advmod _ SpaceAfter=No
2 , , PUNCT _ _ 1 punct _ _
3 minha meu DET _ Gender=Fem|Number=Sing|PronType=Prs 4det _ _
4 miúda miúda NOUN _ Gender=Fem|Number=Sing 0 root __
5-6 no _ _ _ _ _ _ _ _
5 em em ADP _ _ 7 case _ _
6 o o DET _ Definite=Def|Gender=Masc|Number=Sing|PronType=Art 7 det _ _
7 sentido sentido NOUN _ Gender=Masc|Number=Sing 4 nmod __
8 que que PRON _ Gender=Masc|Number=Sing|PronType=Rel 9nsubj _ _
9 és ser VERB _ Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin 7 acl:relcl _ _
10 como como ADP _ _ 12 case _ _
11 uma um DET _ Definite=Ind|Gender=Fem|Number=Sing|PronType=Art 12 det _ _
12 irmã irmã NOUN _ Gender=Fem|Number=Sing 9 obl __
13 para para ADP _ _ 14 case _ _
14 mim eu PRON _ Gender=Unsp|Number=Sing|Person=1|PronType=Prs 12 nmod _ SpaceAfter=No
15 . . PUNCT _ _ 4 punct _ SpacesAfter=\n
But since stanza uses also pt_bosque
, the results are approximately the same. Still the original paper shows slight improvements on almost all treebanks.
It is also possible to lemmatize entire texts by sending POST requests to the docker image. The following bash script splits a file ../feed.txt
in lines of 10000 and appends the output to a split/parsed.conllu file.
mkdir split
cd split
split -l 10000 ../feed.txt
cd ..
for filename in split/*; do
echo $filename
if [[ $filename == split/x* ]]
then
curl --request POST --header 'Content-Type: text/plain; charset=utf-8' --data-binary @"$filename" http://localhost:15000 >> "split/parsed.conllu"
fi
done
The splitting is necessary, when millions of sentences need to be lemmatized. This could be a bug or I might simply not have enough RAM.
Instead of using the library pyconll
, manual parsing can be performed as follows:
train = []
with open("split/parsed.conllu", "r") as f:
k = []
sent = ""
pattern = []
for line in tqdm(f.readlines()):
if "# text" in line:
sent = line
train.append([sent.replace("# text =", "").strip(), []])
if "#" in line or "\n" == line:
continue
s = line.split("\t")
if "1" == s[0]:
if len(train) > 1:
train[-2][1].extend(k)
k = []
k.append((s[1], s[2], s[3]))
out = []
for sent, sentence in tqdm(train):
for token, lemma, pos in sentence:
...
The neural network based lemmatizers have gotten much better. Personally, I often use “Universal Lemmatizer” because it also works well in other languages such as German. The main alternative is stanza. This library also offers other tools such as NER (Named Entity Recognition).
However, no lemmatizer is perfect. It is easy to find sentences where there are obvious mistakes.