.. _scikit_learn_compatibility: ========================== Scikit-learn compatibility ========================== `Scikit-learn `_ is a very popular Python package for machine learning. If you are familiar with scikit-learn API, you should feel comfortable with pyts API as it is heavily inspired from it. The following sections illustrate the compatibility between pyts and scikit-learn. Estimator API ------------- pyts provides two types of estimators: - *transformers*: estimators that transform the input data, - *classifiers*: estimators that classify the input data. These estimators have the same basic methods as the ones from scikit-learn: - Transformers: + ``fit``: fit the transformer, + ``transform``: transform the input data. - Classifiers: + ``fit``: fit the classifier, + ``predict``: make predictions given the input data. Compatibility with existing tools from scikit-learn --------------------------------------------------- Scikit-learn provides a lot of utilities such as model selection and pipelines. These tools are often used in machine learning. By having an API compatible with scikit-learn API, we do not need to reimplement them, and can use them directly. We will illustrate this compatibility with two popular modules from scikit-learn: `Model selection `_ and `Pipeline `_. Model selection ^^^^^^^^^^^^^^^ Model selection is a core concept of machine learning. With a wide range of algorithms and several hyper-parameters for each algorithm, there needs a way to select the best model. One popular approach is to perform cross validation over a grid of possible values for each hyper-parameter. The corresponding scikit-learn implementation is `sklearn.model_selection.GridSearchCV `_. We will illustrate the use of GridSearchCV with a classifier from pyts. Let's say that we want to use the `SAX-VSM `_ classifier and tune the value for two of its hyper-parameters: - *window_size* : 0.3, 0.5 or 0.7 - *strategy*: 'quantile' or 'uniform' We can define a GridSearchCV instance to find the best combination: >>> clf = GridSearchCV( ... SAXVSM(), ... {'window_size': (0.3, 0.5, 0.7), 'strategy': ('uniform', 'quantile')}, ... iid=False, cv=5 ... ) Then we can simply: - fit on the training set by calling ``clf.fit(X_train, y_train)``, - derive predictions on the test set by calling ``clf.predict(X_test)``, - directly evaluate the performance on the test set by calling ``clf.score(X_test, y_test)``. Here is a self-contained example: >>> from pyts.classification import SAXVSM >>> from pyts.datasets import load_gunpoint >>> from sklearn.model_selection import GridSearchCV >>> X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True) >>> clf = GridSearchCV( ... SAXVSM(), ... {'window_size': (0.3, 0.5, 0.7), 'strategy': ('uniform', 'quantile')}, ... iid=False, cv=5 ... ) >>> clf.fit(X_train, y_train) GridSearchCV(...) >>> clf.best_params_ {'strategy': 'uniform', 'window_size': 0.5} >>> clf.score(X_test, y_test) 0.846... Pipeline ^^^^^^^^ Transformers are usually combined with a classifier to build a composite estimator. It is possible to build such an estimator in scikit-learn using `sklearn.pipeline.Pipeline `_. You can use estimators from both pyts and scikit-learn to build your own composite estimator to classify time series. We will illustrate this functionality with the following example. Let's say that we want to build a composite estimator with the following steps: 1. Standardization of each time series using `pyts.preprocessing.StandardScaler `_, 2. Feature extraction using `pyts.transformation.WEASEL `_, 3. Scaling of each feature using `sklearn.preprocessing.MinMaxScaler `_, 4. Classification using `sklearn.ensemble.RandomForestClassifier `_. We just have to create a Pipeline instance with these estimators: >>> clf = Pipeline([('scaler_1', StandardScaler()), ... ('boss', BOSS(sparse=False)), ... ('scaler_2', MinMaxScaler()), ... ('forest', RandomForestClassifier())]) Then we can simply: - fit on the training set by calling ``clf.fit(X_train, y_train)``, - derive predictions on the test set by calling ``clf.predict(X_test)``, - directly evaluate the performance on the test set by calling ``clf.score(X_test, y_test)``. Here is a self-contained example: >>> from pyts.datasets import load_pig_central_venous_pressure >>> from pyts.preprocessing import StandardScaler >>> from pyts.transformation import BOSS >>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.pipeline import Pipeline >>> from sklearn.preprocessing import MinMaxScaler >>> X_train, X_test, y_train, y_test = load_pig_central_venous_pressure(return_X_y=True) >>> clf = Pipeline([('scaler_1', StandardScaler()), ... ('boss', BOSS(sparse=False)), ... ('scaler_2', MinMaxScaler()), ... ('forest', RandomForestClassifier(random_state=42))]) >>> clf.fit(X_train, y_train) Pipeline(...) >>> clf.score(X_test, y_test) 0.543...