2. Classification of raw time series

Algorithms that can directly classify time series have been developed. The following sections will describe the ones that are available in pyts. They can be found in the pyts.classification module.

2.1. KNeighborsClassifier

The k-nearest neighbors algorithm is a relatively simple algorithm. KNeighborsClassifier finds the k nearest neighbors of a time series and the predicted class is determined with majority voting. A key parameter of this algorithm is the metric used to find the nearest neighbors. A popular metric for time series is the Dynamic Time Warping metric (see Metrics for time series). The one-nearest-neighbor algorithm with this metric can be considered as a good baseline for time series classification:

>>> from pyts.classification import KNeighborsClassifier
>>> from pyts.datasets import load_gunpoint
>>> X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)
>>> clf = KNeighborsClassifier(metric='dtw')
>>> clf.fit(X_train, y_train)
KNeighborsClassifier(...)
>>> clf.score(X_test, y_test)
0.91...

2.2. SAXVSM

SAX-VSM stands for Symbolic Aggregate approXimation in Vector Space Model. SAXVSM is an algorithm based on the SAX representation of time series in a vector space model. Subsequences are extracted using a sliding window and each subsequence of real numbers is transformed into a word (i.e., a sequence of symbols) using the Symbolic Aggregate approXimation algorithm. Each time series is thus transformed into a bag of words (the order of the words is not taken into account). For each class, all the bags of words from all the time series belonging to this class are combined into a single bag of words, leading to a bag of words for each class. Finally, a term-frequency inverse-term-frequency (tf-idf) vector is computed for each class. Predictions are made using the cosine similarity between the time series and the tf-idf vectors for each class. The predicted class is the class yielding the highest cosine similarity.

../_images/sphx_glr_plot_saxvsm_001.png
>>> from pyts.classification import SAXVSM
>>> from pyts.datasets import load_gunpoint
>>> X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)
>>> clf = SAXVSM(window_size=34, sublinear_tf=False, use_idf=False)
>>> clf.fit(X_train, y_train)
SAXVSM(...)
>>> clf.score(X_test, y_test)
0.76

References

  • P. Senin, and S. Malinchik, “SAX-VSM: Interpretable Time Series Classification Using SAX and Vector Space Model”. International Conference on Data Mining, 13, 1175-1180 (2013).

2.3. BOSSVS

BOSSVS stands for Bag of Symbolic Fourier Symbols in Vector Space. BOSSVS is another bag-of-words approach for time series classification. BOSSVS is relatively similar to SAX-VSM: it builds a term-frequency inverse-term-frequency vector for each class, but the symbols used to create the words are generated with the Symbolic Fourier Approximation algorithm.

../_images/sphx_glr_plot_bossvs_001.png
>>> from pyts.classification import BOSSVS
>>> from pyts.datasets import load_gunpoint
>>> X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)
>>> clf = BOSSVS(window_size=28)
>>> clf.fit(X_train, y_train)
BOSSVS(...)
>>> clf.score(X_test, y_test)
0.98

References

  • P. Schäfer, “Scalable Time Series Classification”. Data Mining and Knowledge Discovery, 30(5), 1273-1298 (2016).

2.4. LearningShapelets

LearningShapelets is a shapelet-based classifier. A shapelet is defined as a contiguous subsequence of a time series. The distance between a shapelet and a time series is defined as the minimum of the distances between this shapelet and all the shapelets of identical length extracted from this time series. This estimator consists of two steps: computing the distances between the shapelets and the time series, then computing a logistic regression using these distances as features. This algorithm learns the shapelets as well as the coefficients of the logistic regression.

../_images/sphx_glr_plot_learning_shapelets_001.png
>>> from pyts.classification import LearningShapelets
>>> from pyts.datasets import load_gunpoint
>>> X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)
>>> clf = LearningShapelets(random_state=42, tol=0.01)
>>> clf.fit(X_train, y_train)
LearningShapelets(...)
>>> clf.score(X_test, y_test)
0.766...

References

  • J. Grabocka, N. Schilling, M. Wistuba and L. Schmidt-Thieme, “Learning Time-Series Shapelets”. International Conference on Data Mining, 14, 392-401 (2014).

2.5. TimeSeriesForest

TimeSeriesForest is a two-stage algorithm. First it extracts three features from a given number of windows: the mean, the standard deviation and the slope of the simple linear regression. Then a random forest is fitted using the extracted features as input data. These three statistics are fast to compute and give a lot of information about the window. The windows are generated randomly and the number of windows is controlled with the n_windows parameter. Using the feature importance scores of the random forest, one can easily find the most important windows to classify time series.

../_images/sphx_glr_plot_time_series_forest_001.png
>>> from pyts.datasets import load_gunpoint
>>> from pyts.classification import TimeSeriesForest
>>> X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)
>>> clf = TimeSeriesForest(random_state=43)
>>> clf.fit(X_train, y_train)
TimeSeriesForest(...)
>>> clf.score(X_test, y_test)
0.973333...

References

  • H. Deng, G. Runger, E. Tuv and M. Vladimir, “A Time Series Forest for Classification and Feature Extraction”. Information Sciences, 239, 142-153 (2013).

2.6. Time Series Bag-of-Features

TSBF (acronym for Time Series Bag-of-Features) is a complex algorithm whose fitting procedure consists of the following steps:

  • Random intervals are generated.
  • Each interval is split into several subintervals.
  • Three features are extracted from each subinterval: the mean, the standard deviation and the slope.
  • Four features are also extracted from the whole interval: the mean, the standard deviation and the start and end indices.
  • A first random forest classifier is fitted on this dataset of subsequences, and the label of a subsequence is given by the label of the time series from which this subsequence has been extracted.
  • Out-of-bag probabilities for each class are binned across all the subsequences extracted from a given time series; the mean probability for each class is also computed. They are the features extracted from the original data set.
  • A second random forest classifier is finally fitted using these extracted features.

Since the final estimator is a random forest classifier, one can extract the feature importance scores:

../_images/sphx_glr_plot_tsbf_001.png
>>> from pyts.datasets import load_gunpoint
>>> from pyts.classification import TimeSeriesForest
>>> X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)
>>> clf = TimeSeriesForest(random_state=43)
>>> clf.fit(X_train, y_train)
TimeSeriesForest(...)
>>> clf.score(X_test, y_test)
0.973333...

References

  • M.G. Baydogan, G. Runger and E. Tuv, “A Bag-of-Features Framework to Classify Time Series”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11), 2796-2802 (2013).