Scikit-Learn - Faster Transformers with Dictionaries

Posted on Apr 16, 2022

Most of the time we use dictionaries or arrays during production prediction if we are predicting row by row. Pandas column assigment is a big overhead, so if you have a long pipeline then most of the prediction latency is caused by Pandas.

Therefore, using Pandas for training but not for prediction is a better approach. Although Scikit-learn supports numpy arrays, but you’ll lose column information such as column names. Scikit-learn does not support dictionaries, so we will need to create a custom transformer that wraps around Scikit-learn classes.

As an example, we will use the LabelEncoder from scikit-learn:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

sample = pd.DataFrame({"fruits": ["Apple", "Banana"]})
21.9 µs ± 366 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

It would give us about 21.9us, which is pretty fast, but this is not considering the overhead of Pandas column assignment.

If we have a dictionary instead:

fruits = ["Apple", "Banana"]
14.7 µs ± 372 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

It would be about 25% faster. But ~20us is already pretty fast, however if we tried to use column assignment.

import copy
from typing import Union

from sklearn.base import BaseEstimator, TransformerMixin

class LabelEncoderMixed(BaseEstimator, TransformerMixin):
    def __init__(self, column: str):
        self.column = None
        self.label_encoder = None

    def fit(self, X: pd.DataFrame, y=None):
        if isinstance(X, pd.DataFrame):
            self.column = list(X.columns)[0]
            self.label_encoder = LabelEncoder().fit(X)
            raise NotImplementedError()

        return self

    def transform(self, X: Union[pd.DataFrame, dict]) -> Union[pd.DataFrame, dict]:
        X = copy.copy(X)

        if isinstance(X, pd.DataFrame) or isinstance(X, dict):
            X[self.column] = self.label_encoder.transform(X[self.column])
            raise NotImplementedError()

        return X

label_encoder_mixed = LabelEncoderMixed(column="fruits").fit(sample)
177 µs ± 3.05 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

It would be much slower due to Pandas column assignment, for about 140us. And if you’d have many steps, then this would create significant overhead.

But if we passed it a dictionary instead:

items = {"fruits": ["Apple", "Banana"]}
15.8 µs ± 291 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

It would still be as fast as the unwrapped version, and about 1us extra time for the shallow copy that we were doing.

If you require low latency, for example if the prediction is through a web app, then using dictionary or arrays would be the way to go. But for training, it would be fine training with a Pandas dataframe, because most models support Pandas as an input to the model.