Pandas - Apply vs Where in Small DataFrames
ML Production prediction often requires non-existing keys to be present and set to NaN or some other imputed values.
The pipeline may detect types and impute to the correct values. However apply vs where may cause some difference in dataframe types.
import pandas as pd
import numpy as np
df = pd.DataFrame({"a": ["a", None, np.nan, np.nan]})
a | |
---|---|
0 | “a” |
1 | None |
2 | NaN |
3 | NaN |
df.dtypes
data | |
---|---|
a | object |
Since there is a string value the object type for the dataframe is an object. However, if were to remove the string using an apply:
df["a"] = df["a"].apply(lambda x: np.nan if x == "a" else x)
a | |
---|---|
0 | NaN |
1 | NaN |
2 | NaN |
3 | NaN |
df.dtypes
data | |
---|---|
a | float64 |
Now the float type is float64
, and we also changed the value of the 2nd row from None
to NaN
, this is a side effect that can cause None
to be changed to NaN
if the column type changed.
The prevent this from happening we can use the where
function:
df["a"] = df["a"].where(df["a"] != "a", other=np.NaN)
a | |
---|---|
0 | NaN |
1 | None |
2 | NaN |
3 | NaN |
df.dtypes
data | |
---|---|
a | object |
By using the where function, we can preserve the column data type, and the 2nd column is still None
.