Scikit-Learn - Checking between dictionary and pandas modes in pipelines.

Posted on Apr 15, 2022

If the scikit-learn pipelines has two modes of operation, dict and pd.DataFrame, then it’s important to sample a few rows of training/testing sets to make sure the outputs match.

import copy
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

class PlusOne(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        if isinstance(X, dict):
            X = X.copy()
            for column in self.columns:
                X[column] += 1
        if isinstance(X, pd.DataFrame):
            X = copy.copy(X)
            X[self.columns] += 1
        return X
plus_one: PlusOne = PlusOne(columns=["a"])
df = pd.DataFrame({"a": [2, 3, 4]})
row = {"a": 1}
0 3
0 4
0 5

{'a': 2}

Then you can loop through the dataframe, and check the outputs between a dictionary row, for example:

for (_, df_row), (_, transform_row) in zip(df.iterrows(), plus_one.transform(df).iterrows()):
    assert plus_one.transform(df_row.to_dict()) == transform_row.to_dict()

It’s recommended to only sample a few rows, since running through train/test sets may take a while. Usually if you have sufficient unit tests, then testing a few rows should suffice.