# Deriving and implementing LDA

In this post, I will derive and implement Linear Discriminant Analysis (LDA). See also chapter 4.3 in [1].

## LDA as classifier

### Derivation

Given our data with features $X = X_1, \dots, X_n$, we want to find the class label $Y$. One way to solve this problem is to model the joint distribution $f_{X, Y}(\mathbf{x}, y) = f_{X \mid Y}(\mathbf{x} \mid y)f_Y(y)$ and find the model parameters (i.e. generative model).

Suppose our null hypothesis is class $Y=k$ and our alternative hypothesis is class $Y=\ell$. We use the ratio as decision rule and divide both hypothesis by each other.

Assume $X_1, \dots, X_n \mid Y=j \sim \mathcal{N}(\boldsymbol{\mu}_{j}, \boldsymbol{\Sigma}), \forall j$. In words:

• all features are normally distributed given a class label
• all classes use the same covariance matrix $\boldsymbol{\Sigma}$
• each class uses a different means vector $\boldsymbol{\mu}_{j}$

Next, insert the normal distribution and apply the logarithm to (1).

The term $\sqrt{(2\pi)^n\det\left({\boldsymbol{\Sigma}}\right)}$ cancelled out, because there is only one covariance matrix.

Setting the equation (3) $> 0$, we find that our alternative hypothesis is true, when

Suppose we have more than two classes. Then we can use MAP estimation on (3).

To finish our derivation, we only have to specify how to estimate the parameters:

• prior: $f_{Y=j}(j) = \frac{\text{samples in class j}}{\text{number samples}}$
• mean: $\boldsymbol{\mu}_{j}$ calculate mean for each class $j$ (and for each feature)
• covariance matrix: weighted mean of each class covariance matrix $\boldsymbol{\Sigma}_1, \dots, \boldsymbol{\Sigma}_m$

### Implementation

Some remarks:

• $\mathbf{M}$ (Mu) is here a matrix and not a vector $\boldsymbol{\mu}_j$.
• In equation (4) the term $\boldsymbol{\mu}_j^T\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_j$ is a scalar, here $\mathbf{M}^T\boldsymbol{\Sigma}^{-1}\mathbf{M}$ must be a vector.
• Without taking the diagonal of the matrix, we would be multiplying different means e.g. $\boldsymbol{\mu}_1^T\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_3$.
• np.linalg.lstsq solves the equation $\boldsymbol{\Sigma}\mathbf{X} = \mathbf{M}^T \iff \mathbf{X} = \boldsymbol{\Sigma}^{-1}\mathbf{M}^T$.
• It is possible to whiten the data matrix to achieve $\boldsymbol{\Sigma} = \mathbf{I} = \boldsymbol{\Sigma}^{-1}$.

## LDA for dimensionality reduction

### Derivation

Consider the following optimization problem

In words: find $\mathbf{c}$ such that the between-class variance $\boldsymbol{\Sigma}_b$ is maximized relative to the within-class variance $\boldsymbol{\Sigma}_w$.

$\boldsymbol{\Sigma}_w$ is given by the mean of all covariance matrices (as before) and $\boldsymbol{\Sigma}_b$ is the covariance of our data matrix $\mathbf{X}$ subtracted by $\boldsymbol{\Sigma}_w$.

Notice that this time we are not assuming normally distributed features to derive LDA. However, the situation is similar to PCA. The only zero-mean probability distribution that is fully described by the variance is the Gaussian distribution [2]. Hence, the features should still be normally distributed.

To solve this optimization problem, we apply Lagrange multipliers $\mathcal{L}(\mathbf{c}, \lambda) = \mathbf{c}^T\boldsymbol{\Sigma}_b\mathbf{c} - \lambda\left(\mathbf{c}^T\boldsymbol{\Sigma}_w\mathbf{c} - 1\right)$. Calculating $\nabla\mathcal{L}(\mathbf{c}, \lambda) = \mathbf{0}$, yields the generalized eigenvalue problem:

Assume $\boldsymbol{\Sigma}_w$ is non-singular.

We have transformed the generalized eigenvalue problem in a normal eigenvalue problem. To reduce the dimensionality of our data matrix, we now choose $k$ eigenvectors $\mathbf{c}$ and transform our data on a new subspace via $\mathbf{c}$. This is the first way to perform the calculations.

However, there exists a method to avoid calculating $\boldsymbol{\Sigma}_w^{-1}$. Simultaneous diagonalization finds a non-orthogonal matrix $\boldsymbol{\phi}$ such that

where $\boldsymbol{\phi}$ gives us the eigenvectors and $\boldsymbol{\Lambda}$ the eigenvalues of the generalized eigenvalue problem [3][4]. The calculations are

• Whiten $\boldsymbol{\Sigma}_w$
• diagonalize $\boldsymbol{\Sigma}_w$ i.e. $\boldsymbol{\Lambda}_w = \boldsymbol{\phi}_w^T\boldsymbol{\Sigma}_w\boldsymbol{\phi}_w$
• multiply by $\boldsymbol{\Lambda}_w^{-\frac{1}{2}}$ left and right $\boldsymbol{I} = \boldsymbol{\Lambda}_w^{-\frac{1}{2}}\boldsymbol{\phi}_w^T\boldsymbol{\Sigma}_w\boldsymbol{\phi}_w\boldsymbol{\Lambda}_w^{-\frac{1}{2}}$
• define $\boldsymbol{\phi}_w' = \boldsymbol{\phi}_w\boldsymbol{\Lambda}_w^{-\frac{1}{2}}$
• Calculate $\boldsymbol{\Sigma}_b' = \boldsymbol{\phi}_w'^T\boldsymbol{\Sigma}_b\boldsymbol{\phi}_w'$
• Diagonalize $\boldsymbol{\Sigma}_b'$ i.e. $\boldsymbol{\Lambda} = \boldsymbol{\phi}_b^T\boldsymbol{\Sigma}_b'\boldsymbol{\phi}_b$
• Eigenvectors are given by $\boldsymbol{\phi} = \boldsymbol{\phi}_w\boldsymbol{\Lambda}_w^{-\frac{1}{2}}\boldsymbol{\phi}_b$

### Implementation

There are already lots of implementations for this type of LDA on the internet. This is why, I will only provide some remarks.

The easiest way to solve the generalized eigenvalue problem $\boldsymbol{\Sigma}_b \mathbf{c} = \lambda\boldsymbol{\Sigma}_w\mathbf{c}$ is to use SciPy’s function linalg.eigh.

Instead of doing an eigendecomposition, you can also use SVD and a whitening transformation. See the scikit-learn code and [5].

And finally, if you don’t want to implement LDA yourself. scikit-learn provides three solvers for LDA:

• lsqr: Calculate the covariance matrix. Dimensionality reduction is not possible, because we don’t know the eigenvectors. I implemented this in the first section.
• eigen: Calculate the covariance matrix. Dimensionality reduction is possible. Solve directly the generalized eigenvalue problem.
• svd: Don’t calculate the covariance matrix. Dimensionality reduction is possible. Use whitening and SVD.

## References

[1] T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning. Springer, 2008.

[2] J. Shlens. A Tutorial on Principal Component Analysis: Derivation, Discussion and Singular Value Decomposition, 2003.

Categories:

Updated: