Pandas - Apply vs Where in Small DataFrames

Posted on May 7, 2022

ML Production prediction often requires non-existing keys to be present and set to NaN or some other imputed values.

The pipeline may detect types and impute to the correct values. However apply vs where may cause some difference in dataframe types.

import pandas as pd
import numpy as np

df = pd.DataFrame({"a": ["a", None, np.nan, np.nan]})
a
0 “a”
1 None
2 NaN
3 NaN
df.dtypes
data
a object

Since there is a string value the object type for the dataframe is an object. However, if were to remove the string using an apply:

df["a"] = df["a"].apply(lambda x: np.nan if x == "a" else x)
a
0 NaN
1 NaN
2 NaN
3 NaN
df.dtypes
data
a float64

Now the float type is float64, and we also changed the value of the 2nd row from None to NaN, this is a side effect that can cause None to be changed to NaN if the column type changed.

The prevent this from happening we can use the where function:

df["a"] = df["a"].where(df["a"] != "a", other=np.NaN)
a
0 NaN
1 None
2 NaN
3 NaN
df.dtypes
data
a object

By using the where function, we can preserve the column data type, and the 2nd column is still None.