Scikit-Learn - Checking between dictionary and pandas modes in pipelines.
If the scikit-learn pipelines has two modes of operation, dict
and pd.DataFrame
, then it’s important to sample a few rows of training/testing sets to make sure the outputs match.
import copy
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
class PlusOne(BaseEstimator, TransformerMixin):
def __init__(self, columns):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X: pd.DataFrame) -> pd.DataFrame:
if isinstance(X, dict):
X = X.copy()
for column in self.columns:
X[column] += 1
if isinstance(X, pd.DataFrame):
X = copy.copy(X)
X[self.columns] += 1
return X
plus_one: PlusOne = PlusOne(columns=["a"])
df = pd.DataFrame({"a": [2, 3, 4]})
row = {"a": 1}
plus_one.transform(df)
a | |
---|---|
0 | 3 |
0 | 4 |
0 | 5 |
plus_one.transform(row)
{'a': 2}
Then you can loop through the dataframe, and check the outputs between a dictionary row, for example:
for (_, df_row), (_, transform_row) in zip(df.iterrows(), plus_one.transform(df).iterrows()):
assert plus_one.transform(df_row.to_dict()) == transform_row.to_dict()
It’s recommended to only sample a few rows, since running through train/test sets may take a while. Usually if you have sufficient unit tests, then testing a few rows should suffice.