Scikit-Learn - Checking between dictionary and pandas modes in pipelines.
If the scikit-learn pipelines has two modes of operation, dict and pd.DataFrame, then it’s important to sample a few rows of training/testing sets to make sure the outputs match.
import copy
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
class PlusOne(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns
    def fit(self, X, y=None):
        return self
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        if isinstance(X, dict):
            X = X.copy()
            for column in self.columns:
                X[column] += 1
        if isinstance(X, pd.DataFrame):
            X = copy.copy(X)
            X[self.columns] += 1
        return X
        
plus_one: PlusOne = PlusOne(columns=["a"])
df = pd.DataFrame({"a": [2, 3, 4]})
row = {"a": 1}plus_one.transform(df)| a | |
|---|---|
| 0 | 3 | 
| 0 | 4 | 
| 0 | 5 | 
plus_one.transform(row)
{'a': 2}Then you can loop through the dataframe, and check the outputs between a dictionary row, for example:
for (_, df_row), (_, transform_row) in zip(df.iterrows(), plus_one.transform(df).iterrows()):
    assert plus_one.transform(df_row.to_dict()) == transform_row.to_dict()It’s recommended to only sample a few rows, since running through train/test sets may take a while. Usually if you have sufficient unit tests, then testing a few rows should suffice.