Machine learning, computer vision, languages

24 Jul 2020

After having introduced Riemannian SGD in the last blog post, here I will give a concrete application for this optimization method. Poincaré embeddings [1][2] are hierarchical word embeddings which map integer-encoded words to the hyperbolic space.

Even though the original paper used the Poincaré unit ball, any reasonable manifold can work. For example, when the data is not really a tree, the Euclidean space can produce better embeddings.

Poincaré Embeddings consist of the following components:

- dataset: \(\{(w_1, v_2), \dots, (w_n, v_n)\}\) where \(w_i\) is the parent and \(v_i\) is the child (e.g. hypernym and hyponym)
- distance function: the Poincaré ball needs \(d(x, y) = \text{arcosh}\left(1 + 2\frac{\lvert\lvert x - y\rvert\rvert^2}{(1 - \lvert\lvert x\rvert\rvert^2)(1 - \lvert\lvert y\rvert\rvert^2)}\right)\), while the Euclidean space uses \(d(x, y) = \lVert x - y\rVert_2^2\)
- loss function: \(\log\left(\text{softmax}\left(-d(x, y)\right)\right)\) (“categorical crossentropy”)
- weights: a single \(m \times n\) matrix where \(m\) is the input vocabulary and \(n\) is the output vocabulary (projection)
- metric: words that are neighbors should be closer together in a ranking than words that have no connection (“mean rank”).

First, we create the weights using the function `Embedding`

. Then they are initialized close to \(0\). Since the Poincaré ball requires \(\lvert\lvert x\rvert\rvert < 1\), this won’t cause any trouble.

During forward propagation the input is split into two parts: parent (0 to 1) and children (1 to n). Next, we compute the distance between all nodes.

The line `x = 1 + 2 * sqdist / ((1 - squnorm) * (1 - sqvnorm))`

causes numerical instability. When using double-precision floating-point, `epsilon`

can often be set to \(0\).

First, set `torch.set_default_dtype(torch.float64)`

. This is not strictly necessary, but gives slightly better results and makes the network more stable.

Next, we need two distributions:

- categorical distribution: during the first \(20\) epochs, words/relations that occur more often are drawn with a higher frequency.
- uniform distribution: every word/relation has the same chance of being drawn.

It is not mentioned in the original paper, but the offical code follows this approach.

Note only the first two words in `batch_X`

are real relations. All other words are randomly drawn from either the uniform or categorical distribution. The ground truth for `CrossEntropyLoss`

is always the first element `batch_y = torch.zeros(10, dtype=torch.long)`

. All other children are negatives.

After having trained the neural network, we can visualize our embeddings.

If you made no mistakes, the result for WordNet mammals should like this:

WordNet mammals has a low Gromov hyperbolicity.

I also tried the dataset OpenThesaurus which has a pretty high Gromov hyperbolicity. It is a German dataset, but even if you don’t understand this language, you should see the difference.

Besides visualizations, various uses can be found for these embeddings:

- clustering with hyperbolic k-means
- finding wrong relations in graphs [3]
- hypernym-hyponym detection [4]
- inside another network for other applications
- …

However, if you need word similarity or analogy there are better word embeddings. For example, SGNS (skip-gram with negative-sampling) produces quite good results for these tasks.

[1] M. Nickel and D. Kiela, *Poincaré Embeddings for Learning Hierarchical Representations*, 2017.

[2] M. Nickel and D. Kiela, *Learning Continuous Hierarchies in the Lorentz Model of Hyperbolic Geometry*, 2018.

[3] S. Roller, D. Kiela and M. Nickel, *Hearst Patterns Revisited: Automatic Hypernym Detection from Large Text Corpora*, 2018.

[4] M. Le, S. Roller, L. Papaxanthos, D. Kiela and M. Nickel, *Inferring Concept Hierarchies from Text Corpora via Hyperbolic Embeddings*, 2019.