After having introduced Riemannian SGD in the last blog post, here I will give a concrete application for this optimization method. Poincaré embeddings [1][2] are hierarchical word embeddings which map integer-encoded words to the hyperbolic space.
Even though the original paper used the Poincaré unit ball, any reasonable manifold can work. For example, when the data is not really a tree, the Euclidean space can produce better embeddings.
Poincaré Embeddings consist of the following components:
dataset: \(\{(w_1, v_2), \dots, (w_n, v_n)\}\) where \(w_i\) is the parent and \(v_i\) is the child (e.g. hypernym and hyponym)
distance function: the Poincaré ball needs \(d(x, y) = \text{arcosh}\left(1 + 2\frac{\lvert\lvert x - y\rvert\rvert^2}{(1 - \lvert\lvert x\rvert\rvert^2)(1 - \lvert\lvert y\rvert\rvert^2)}\right)\), while the Euclidean space uses \(d(x, y) = \lVert x - y\rVert_2^2\)
loss function: \(\log\left(\text{softmax}\left(-d(x, y)\right)\right)\) (“categorical crossentropy”)
weights: a single \(m \times n\) matrix where \(m\) is the input vocabulary and \(n\) is the output vocabulary (projection)
metric: words that are neighbors should be closer together in a ranking than words that have no connection (“mean rank”).
Implementation
Model
First, we create the weights using the function Embedding. Then they are initialized close to \(0\). Since the Poincaré ball requires \(\lvert\lvert x\rvert\rvert < 1\), this won’t cause any trouble.
During forward propagation the input is split into two parts: parent (0 to 1) and children (1 to n). Next, we compute the distance between all nodes.
The line x = 1 + 2 * sqdist / ((1 - squnorm) * (1 - sqvnorm)) causes numerical instability. When using double-precision floating-point, epsilon can often be set to \(0\).
Training
First, set torch.set_default_dtype(torch.float64). This is not strictly necessary, but gives slightly better results and makes the network more stable.
Next, we need two distributions:
categorical distribution: during the first \(20\) epochs, words/relations that occur more often are drawn with a higher frequency.
uniform distribution: every word/relation has the same chance of being drawn.
It is not mentioned in the original paper, but the offical code follows this approach.
Note only the first two words in batch_X are real relations. All other words are randomly drawn from either the uniform or categorical distribution. The ground truth for CrossEntropyLoss is always the first element batch_y = torch.zeros(10, dtype=torch.long). All other children are negatives.
Visualizations
After having trained the neural network, we can visualize our embeddings.
If you made no mistakes, the result for WordNet mammals should like this:
WordNet mammals has a low Gromov hyperbolicity.
I also tried the dataset OpenThesaurus which has a pretty high Gromov hyperbolicity. It is a German dataset, but even if you don’t understand this language, you should see the difference.
Applications
Besides visualizations, various uses can be found for these embeddings:
However, if you need word similarity or analogy there are better word embeddings. For example, SGNS (skip-gram with negative-sampling) produces quite good results for these tasks.
References
[1] M. Nickel and D. Kiela, Poincaré Embeddings for Learning Hierarchical Representations, 2017.
[2] M. Nickel and D. Kiela, Learning Continuous Hierarchies in the Lorentz Model of Hyperbolic Geometry, 2018.
[3] S. Roller, D. Kiela and M. Nickel, Hearst Patterns Revisited: Automatic Hypernym Detection from Large Text Corpora, 2018.
[4] M. Le, S. Roller, L. Papaxanthos, D. Kiela and M. Nickel, Inferring Concept Hierarchies from Text Corpora via Hyperbolic Embeddings, 2019.