Scikit-learn compatibility¶
Scikit-learn is a very popular Python package for machine learning. If you are familiar with scikit-learn API, you should feel comfortable with pyts API as it is heavily inspired from it. The following sections illustrate the compatibility between pyts and scikit-learn.
Estimator API¶
pyts provides two types of estimators:
- transformers: estimators that transform the input data,
- classifiers: estimators that classify the input data.
These estimators have the same basic methods as the ones from scikit-learn:
- Transformers:
fit
: fit the transformer,transform
: transform the input data.
- Classifiers:
fit
: fit the classifier,predict
: make predictions given the input data.
Compatibility with existing tools from scikit-learn¶
Scikit-learn provides a lot of utilities such as model selection and pipelines. These tools are often used in machine learning. By having an API compatible with scikit-learn API, we do not need to reimplement them, and can use them directly. We will illustrate this compatibility with two popular modules from scikit-learn: Model selection and Pipeline.
Model selection¶
Model selection is a core concept of machine learning. With a wide range of algorithms and several hyper-parameters for each algorithm, there needs a way to select the best model. One popular approach is to perform cross validation over a grid of possible values for each hyper-parameter. The corresponding scikit-learn implementation is sklearn.model_selection.GridSearchCV.
We will illustrate the use of GridSearchCV with a classifier from pyts. Let’s say that we want to use the SAX-VSM classifier and tune the value for two of its hyper-parameters:
- window_size : 0.3, 0.5 or 0.7
- strategy: ‘quantile’ or ‘uniform’
We can define a GridSearchCV instance to find the best combination:
>>> clf = GridSearchCV(
... SAXVSM(),
... {'window_size': (0.3, 0.5, 0.7), 'strategy': ('uniform', 'quantile')},
... iid=False, cv=5
... )
Then we can simply:
- fit on the training set by calling
clf.fit(X_train, y_train)
, - derive predictions on the test set by calling
clf.predict(X_test)
, - directly evaluate the performance on the test set by calling
clf.score(X_test, y_test)
.
Here is a self-contained example:
>>> from pyts.classification import SAXVSM
>>> from pyts.datasets import load_gunpoint
>>> from sklearn.model_selection import GridSearchCV
>>> X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)
>>> clf = GridSearchCV(
... SAXVSM(),
... {'window_size': (0.3, 0.5, 0.7), 'strategy': ('uniform', 'quantile')},
... iid=False, cv=5
... )
>>> clf.fit(X_train, y_train)
GridSearchCV(...)
>>> clf.best_params_
{'strategy': 'uniform', 'window_size': 0.5}
>>> clf.score(X_test, y_test)
0.846...
Pipeline¶
Transformers are usually combined with a classifier to build a composite estimator. It is possible to build such an estimator in scikit-learn using sklearn.pipeline.Pipeline. You can use estimators from both pyts and scikit-learn to build your own composite estimator to classify time series.
We will illustrate this functionality with the following example. Let’s say that we want to build a composite estimator with the following steps:
1. Standardization of each time series using pyts.preprocessing.StandardScaler,
2. Feature extraction using pyts.transformation.WEASEL,
3. Scaling of each feature using sklearn.preprocessing.MinMaxScaler,
4. Classification using sklearn.ensemble.RandomForestClassifier.
We just have to create a Pipeline instance with these estimators:
>>> clf = Pipeline([('scaler_1', StandardScaler()),
... ('boss', BOSS(sparse=False)),
... ('scaler_2', MinMaxScaler()),
... ('forest', RandomForestClassifier())])
Then we can simply:
- fit on the training set by calling
clf.fit(X_train, y_train)
, - derive predictions on the test set by calling
clf.predict(X_test)
, - directly evaluate the performance on the test set by calling
clf.score(X_test, y_test)
.
Here is a self-contained example:
>>> from pyts.datasets import load_pig_central_venous_pressure
>>> from pyts.preprocessing import StandardScaler
>>> from pyts.transformation import BOSS
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import MinMaxScaler
>>> X_train, X_test, y_train, y_test = load_pig_central_venous_pressure(return_X_y=True)
>>> clf = Pipeline([('scaler_1', StandardScaler()),
... ('boss', BOSS(sparse=False)),
... ('scaler_2', MinMaxScaler()),
... ('forest', RandomForestClassifier(random_state=42))])
>>> clf.fit(X_train, y_train)
Pipeline(...)
>>> clf.score(X_test, y_test)
0.543...