Deriving and implementing LDA

3 minute read

In this post, I will derive and implement Linear Discriminant Analysis (LDA). See also chapter 4.3 in [1].

LDA as classifier

Derivation

Given our data with features , we want to find the class label . One way to solve this problem is to model the joint distribution and find the model parameters (i.e. generative model).

Suppose our null hypothesis is class $Y=k$ and our alternative hypothesis is class $Y=\ell$. We use the ratio as decision rule and divide both hypothesis by each other.

Assume . In words:

  • all features are normally distributed given a class label
  • all classes use the same covariance matrix
  • each class uses a different means vector

Next, insert the normal distribution and apply the logarithm to (1).

The term cancelled out, because there is only one covariance matrix.

Setting the equation (3) $> 0$, we find that our alternative hypothesis is true, when

Suppose we have more than two classes. Then we can use MAP estimation on (3).

To finish our derivation, we only have to specify how to estimate the parameters:

  • prior:
  • mean: calculate mean for each class (and for each feature)
  • covariance matrix: weighted mean of each class covariance matrix

Implementation

from sklearn.datasets import load_iris
import numpy as np

X, y = load_iris(return_X_y=True)

Mu = np.array([X[y == c, :].mean(axis=0) for c in np.unique(y)])

_, y_t = np.unique(y, return_inverse=True)
priors = np.bincount(y_t) / float(len(y))

cov = [np.cov(X[y == c, :], rowvar=False, bias=True) for c in np.unique(y)]
cov = np.average(cov, axis=0, weights=priors)

inv = np.linalg.lstsq(cov, Mu.T, rcond=None)[0].T

y_hat = np.argmax(-1/2 * np.diag(Mu @ inv.T) + X @ inv.T + np.log(priors), axis=1)

Some remarks:

  • (Mu) is here a matrix and not a vector .
  • In equation (4) the term is a scalar, here must be a vector.
  • Without taking the diagonal of the matrix, we would be multiplying different means e.g. .
  • np.linalg.lstsq solves the equation .
  • It is possible to whiten the data matrix to achieve .

LDA for dimensionality reduction

Derivation

Consider the following optimization problem

In words: find such that the between-class variance is maximized relative to the within-class variance .

is given by the mean of all covariance matrices (as before) and is the covariance of our data matrix subtracted by .

Notice that this time we are not assuming normally distributed features to derive LDA. However, the situation is similar to PCA. The only zero-mean probability distribution that is fully described by the variance is the Gaussian distribution [2]. Hence, the features should still be normally distributed.

To solve this optimization problem, we apply Lagrange multipliers . Calculating , yields the generalized eigenvalue problem:

Assume is non-singular.

We have transformed the generalized eigenvalue problem in a normal eigenvalue problem. To reduce the dimensionality of our data matrix, we now choose eigenvectors and transform our data on a new subspace via . This is the first way to perform the calculations.

However, there exists a method to avoid calculating . Simultaneous diagonalization finds a non-orthogonal matrix such that

where gives us the eigenvectors and the eigenvalues of the generalized eigenvalue problem [3][4]. The calculations are

  • Whiten
    • diagonalize i.e.
    • multiply by left and right
    • define
  • Calculate
  • Diagonalize i.e.
  • Eigenvectors are given by

Implementation

There are already lots of implementations for this type of LDA on the internet. This is why, I will only provide some remarks.

The easiest way to solve the generalized eigenvalue problem is to use SciPy’s function linalg.eigh.

Instead of doing an eigendecomposition, you can also use SVD and a whitening transformation. See the scikit-learn code and [5].

And finally, if you don’t want to implement LDA yourself. scikit-learn provides three solvers for LDA:

  • lsqr: Calculate the covariance matrix. Dimensionality reduction is not possible, because we don’t know the eigenvectors. I implemented this in the first section.
  • eigen: Calculate the covariance matrix. Dimensionality reduction is possible. Solve directly the generalized eigenvalue problem.
  • svd: Don’t calculate the covariance matrix. Dimensionality reduction is possible. Use whitening and SVD.

References

[1] T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning. Springer, 2008.

[2] J. Shlens. A Tutorial on Principal Component Analysis: Derivation, Discussion and Singular Value Decomposition, 2003.

[3] http://mlg.postech.ac.kr/~seungjin/courses/ml/handouts/handout12.pdf

[4] http://www.geo.tuwien.ac.at/downloads/tm/svd.pdf

[5] https://stats.stackexchange.com/questions/109747/why-is-pythons-scikit-learn-lda-not-working-correctly-and-how-does-it-compute-l

Comments