Scikit-Learn - Pipelines

Posted on Mar 26, 2022

One of the most overlooked functions from scikit-learn are Pipeline, and FeatureUnion. I’ve been using these two functions for years (released around 2013), and I’m surprised many folks don’t use them.

Example:

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    random_state=0,
)
pipe = Pipeline([
    ('scaler', StandardScaler()), 
    ('svc', SVC())
])
pipe.fit(X_train, y_train)
pipe.predict(X_test[0:1])

Note that the sequential results between the pipeline would get passed to the next level. For example the result of the transform from StandardScaler will be passed to SVC.

Although you can use them individually like:

scaler = StandardScaler()
svc = SVC()
scaler.fit(X_train, y_train)
svc.fit(scaler.transform(X_train), y_train)
svc.predict(X_test[0:1])

What makes it useful is a list of operations gets compressed into a single step. And using the transformers individually can cause production issues because you can accidentally swap transformer steps.

Sometimes there’s pipeline modules (i.e. objects like StandardScaler) that may not be able to get completed in a step, for example you need make multiple function calls. I suggest these modules to be compressed into a single fit step. However, when if it’s not possible, then you can finish your multi fit step, and then combine these transformers later.

Different from the tutorial, I like to use the pipelines separately by breaking apart the feature transforms and the models into two individual objects, like:

pipe = Pipeline([
    ('scaler', StandardScaler()),
])
pipe.fit(X_train, y_train)

This allows you to create train/test/validation splits before it gets feed into the model directly. And then you can use the train set to perform hyper paramter tuning:

X_train_pipe = pipe.transform(X_train)
X_test_pipe = pipe.transform(X_test)

If you have both the pipeline and the model in a hyper parameter tuning, it’s possible to do this but if the feature transformation steps before the pipeline takes a while, then you are running a lot of the same work transforming the features, which adds unnecessary training time.

from sklearn.model_selection import GridSearchCV

param_grid = {
    'kernel': ('linear', 'rbf'),
    'C': [1, 10, 100]
}
grid_search = GridSearchCV(
    SVC(),
    param_grid,
    cv=5,
)
grid_search.fit(X_train_pipe, y_train)
model = grid_search.best_estimator_
model.predict(X_test_pipe[0:1])

After you are done, you can connect the pipeline, and the model back together, but don’t fit the steps again:

full_pipeline = Pipeline([
    ('feature_transformers', pipe),
    ('model', model),
])
full_pipeline.predict(X_test_pipe[0:1])

Then you can pickle them, and save the pipeline on disk or (DVC)[https://dvc.org/]:

import pickle
pickle.dump(full_pipeline, open("location", "wb"))

And load them up in production:

full_pipeline = pickle.load(open("location", "rb"))

Which makes it pretty handy if you are building an API, because it’s usually a one liner to predict and prevents common issues such as using the transformers out of order by accident:

full_pipeline.predict(<features>)

Although note that if you want to use your own transformers, then you need to write your own package, which later on I’ll give an introduction creating your own transformers, modifying FeatureUnion to accept multiple types, and tips on fast pipeline transformers in production.