Machine learning, computer vision, languages
23 Jul 2020
A lot of recent papers use different spaces than the regular Euclidean space. This trend is sometimes called geometric deep learning. There is a growing interest particularly in the domain of word embeddings and graphs.
Since geometric neural networks perform optimization in a different space, it is not possible to simply apply stochastic gradient descent.
The following two equations show what changes are necessary:
\[\begin{aligned} \text{SGD: } \theta_{t+1} &\gets \theta_t - \lambda \nabla \mathcal{L}\\ \text{RSGD: } \theta_{t+1} &\gets \exp_{\theta_t}\left(- \lambda \nabla_R \mathcal{L}\right) \end{aligned}\]where \(\exp_{\theta_t} : \mathcal{T}_{\theta_t}M \to M\) is the exponential map. It maps a small change by the vector \(v \in \mathcal{T}_{\theta_t}M\) on a point of the manifold \(M\). \(\lambda\) is the learning rate.
\(\nabla_R\) is the Riemannian gradient, given by \(g_{\theta_t}^{-1} \nabla \mathcal{L}\) where \(g_{\theta_t}\) is the metric tensor. This gradient is also called the natural gradient. A derivation can be found in [1].
In the Euclidean space there is one model:
In the hyperbolic space there are multiple models:
Poincaré unit ball: \(\exp_{\theta_t}(v) = \theta_t \oplus \tanh(\frac{\lambda_{\theta_t} \lvert\lvert v\rvert\rvert}{2})\frac{v}{\lvert\lvert v\rvert\rvert}\) where \(\oplus\) is the Möbius addition and \(\lambda_{\theta_t} = \frac{2}{1 - \lvert\lvert \theta_t\rvert\rvert^2}\) is the conformal factor. However, the approximation \(\exp_{\theta_t}\left(v\right) = \theta_t + v\) tends to work better and is faster. The metric tensor is \(g_{\theta_t} = \lambda_{\theta_t}^2 I_n = \left(\frac{2}{1 - \lvert\lvert \theta_t\rvert\rvert^2}\right)^2 I_n\) where \(I_n\) is the identity matrix.
Hyperboloid: the metric tensor is \(g_{\theta_t} = \begin{bmatrix}-1 & 0 & \cdots & 0\\0 & 1 & \cdots & 0\\\vdots\\0 & 0 & \cdots & 1\end{bmatrix}\). The exponential map is \(\exp_{\theta_t}(v) = \text{cosh}(\lVert v\rVert_{\mathcal{L}})x + \text{sinh}(\lVert v\rVert_{\mathcal{L}})\frac{v}{\lVert v\rVert_{\mathcal{L}}}\) where \(\lVert v\rVert_{\mathcal{L}} = \sqrt{-v_0^2 + \sum_{i=1}^n v_i^2}\).
There are other models, but they are not as common. The models are mathematically equivalent, but one has to also consider the numerical stability. Furthermore, it is possible to scale the models to change the Gaussian curvature. For example, the conformal factor of the Poincaré ball would change to \(\lambda_{\theta_t} = \frac{2}{1 - \left(\frac{\lvert\lvert \theta_t\rvert\rvert}{r^2}\right)^2} = \frac{2r^2}{r^2 - \lvert\lvert \theta_t\rvert\rvert^2}\) (see this). The paper [3] gives a good introduction to hyperbolic geometry with neural networks.
For elliptic geometry see the paper [2].
The following code contains also the exact exponential map. I commented the relevant lines out, because empirically the approximation produces slightly better results. Refer to [4], the authors did some more tests.
According to [5], the hyperboloid / Lorentz model is more stable during training. However, my tests showed similar results. Sometimes the Poincaré unit ball was actually more stable.
My preliminary tests with word embeddings showed the following disadvantages of the hyperboloid:
d_p = proj(p, d_p)
). This is more difficult than an inequality constraint, because the equality constraint is always active. In comparison, the Poincaré unit ball is defined by \(\lVert x\rVert < 1\). As long as the learning rate is reasonable, points will never fall of the unit ball.A random initializiation of the weights will violate the equality constraint. Hence, before training one should set all weights \(w\) to torch.sqrt(1 + torch.sum((w.narrow(-1, 1, w.size(-1) - 1)) ** 2, dim=-1, keepdim=True))
.
[1] Shun-ichi Amari, Natural Gradient Works Efficiently in Learning, 1998.
[2] Y. Meng, J. Huang, G. Wang, C. Zhang, H. Zhuang, L. Kaplan and J. Han, Spherical Text Embedding, 2019.
[3] O. Ganea, G. Bécigneul and T. Hofmann, Hyperbolic Neural Networks, 2018.
[4] G. Bécigneul and O.-E. Ganea, Riemannian Adaptive Optimization Methods, 2018.
[5] M. Nickel and D. Kiela, Learning Continuous Hierarchies in the Lorentz Model of Hyperbolic Geometry, 2018.