# Time series classification with images and 2D CNNs

There are many methods to classify time series using neural networks. This blog post will mainly focus on two-dimensional CNNs and how 1D series can be represented as images.

# Polar coordinates

Let $X = \{x_1, x_2, \dots, x_n \mid x_t \in \mathbb{R}, t \in \mathbb{N}\}$ be observations of some sensor (gyroscope, goniometer etc.). In cartesian coordinates, a point is given by an index $t$ and a value $x_t$.

The tuple $(t, x_t)$ can be transformed to polar coordinates by setting $r_t = \sqrt{t^2 + x_t^2}$ and $\phi_t = \arccos\left(\frac{t}{r_t}\right)$ for $x_t \geq 0$ and $-\phi_t$ for $% $. Then $\left(r_t\cos(\phi_t), r_t\sin(\phi_t)\right)$ gives back the original coordinates.

Now, let us assume that $r_t = 1$. Then all points lie on the unit circle. This requires $t$ and $x_t$ to be in the intervals $[0, 1]$ or $[-1, 1]$. The advantage of $[0, 1]$ is that the functions $x_t = \sqrt{1 - t^2}$ and $t = \sqrt{1 - x_t^2}$ are injective.

Since neural networks don’t care in general whether one scales to $[-1, 1]$ or $[0, 1]$, we choose the latter. The transformation is $\tilde{x}_t = \frac{x_t - \min(X)}{\max(X) - \min(X)}$.

Polar coordinates make it possible to use trigonometric identities like $\cos(\phi_i + \phi_j) = \cos(\phi_i) \cos(\phi_j) - \sin(\phi_i)\sin(\phi_j)$ and $\sin(\phi_i - \phi_j) = \sin(\phi_i)\cos(\phi_j) - \cos(\phi_i)\sin(\phi_j)$.

When we keep the assumption $r_t = 1$, the trigonometric functions are simply $\cos(\phi_i) = \sqrt{1 - \tilde{x}_i^2}$ and $\sin(\phi_i) = \tilde{x}_i$. However, we lost the time $t$ because we set $r_t = 1$. Since $t$ is just an index, this won’t hurt the performance of the CNN.

# (quasi)-Gramian Angular Field (GAF) and Recurrence Plot (RP)

GAFs and RPs are both 2D plots that show the behavior of some time step to another time step. For example, we could look at how much higher the value $x_t$ is at time $t$ in comparison to time $t+1$.

The paper  introduced the following two GAFs: $\cos(\phi_i + \phi_j)$ and $\sin(\phi_i - \phi_j)$ for all $i, j$ in the time series. These two plots use polar coordinates the way it was described in the last section.

In comparison, RPs use cartesian coordinates. For a vector of dimension $1$, the definition is $\sqrt{(x_i - x_j)^2}$ for all $i, j$.

Let us look at a simple example: a time series consisting of two values $x_1 = 1$ and $x_2 = 2$. There are 4 possibilities: $x_1 \to x_1$, $x_1 \to x_2$, $x_2 \to x_1$ and $x_2 \to x_2$.

The recurrence plot is given by:

To calculate the GAFs, polar coordinates are needed. Scaling $x_1$ and $x_2$ to $[0, 1$] results in $\tilde{x}_1 = 0$ and $\tilde{x}_2 = 1$. Then by the trigonometric formulas from above, each entry of the first GAF is given by $\sqrt{1 - \tilde{x}_i^2}\sqrt{1 - \tilde{x}_j^2} - \tilde{x}_i\tilde{x}_j$.

The first GAF is:

The second GAF can be calculated similarly.

These matrices can be fed to the neural network as input. In general, we can even use our own operations like $\max(x_i, x_j)$ or $\min(x_i, x_j)$. It’s also possible to combine different time series like $\sqrt{x_i^2 - y_j^2}$ or $\sqrt{x_i^2 - x_j^2} \cdot \sqrt{y_i^2 - y_j^2}$.

# Implementation

Time series can become fairly long. For example, a time series containing $3000$ measurements would result in a $3000 \times 3000$ GAF or RP plot. Hence, we will first reduce the size with a piecewise aggregate approximation (PAA).

This function takes as input an $m \times n$ matrix where $m$ is the length of one time series and $n$ is the number of time series. Next, we apply PAA to the data and calculate the plots.

Since GAFs return values from $-1$ to $1$, the RP plots have to be scaled to the same range. The outer product is a slow operation, so I would recommend calculating only once the images and storing them in memory.

The next step is to define a model for the neural network. I got the best results with a Wide Residual Network . I set the network width $k$ to $2$ and the depth to $N = 4$.

During training, it is important to use some kind of data augmentation, because residual networks tend to overfit. The following code randomly adds Gaussian distributed noise to the whole input matrix.

I tested the 2D CNN model on an activity recognition dataset with 10-fold cross validation. There were in total $19$ features (time series) which were transformed to $76$ RP/GAF plots of size $32 \times 32$.

The 2D CNN model performed consistently better than MLP and at least as well as 1D CNN and 1D LSTM-CNN. More tests are of course needed but for specific datasets the performance is quite good.

To conclude this blog post, here are some input images (using some threshold).

### Random noise with $\sin(\phi_i - \phi_j)$: ### Gyroscope z-axis with $\sin(\phi_i - \phi_j)$: Zhiguang Wang and Tim Oates. “Imaging Time-Series to Improve Classification and Imputation”. https://arxiv.org/abs/1506.00327

 S. Zagoruyko and N. Komodakis. “Wide Residual Networks”. https://arxiv.org/abs/1605.07146

 J. Debayle, N. Hatami, and Y. Gavet. “Classification of Time-Series Images Using Deep Convolutional Neural Networks”. https://arxiv.org/abs/1710.00886