Pandas - Apply vs Where in Small DataFrames
ML Production prediction often requires non-existing keys to be present and set to NaN or some other imputed values.
The pipeline may detect types and impute to the correct values. However apply vs where may cause some difference in dataframe types.
import pandas as pd
import numpy as np
df = pd.DataFrame({"a": ["a", None, np.nan, np.nan]})| a | |
|---|---|
| 0 | “a” |
| 1 | None |
| 2 | NaN |
| 3 | NaN |
df.dtypes| data | |
|---|---|
| a | object |
Since there is a string value the object type for the dataframe is an object. However, if were to remove the string using an apply:
df["a"] = df["a"].apply(lambda x: np.nan if x == "a" else x)| a | |
|---|---|
| 0 | NaN |
| 1 | NaN |
| 2 | NaN |
| 3 | NaN |
df.dtypes| data | |
|---|---|
| a | float64 |
Now the float type is float64, and we also changed the value of the 2nd row from None to NaN, this is a side effect that can cause None to be changed to NaN if the column type changed.
The prevent this from happening we can use the where function:
df["a"] = df["a"].where(df["a"] != "a", other=np.NaN)| a | |
|---|---|
| 0 | NaN |
| 1 | None |
| 2 | NaN |
| 3 | NaN |
df.dtypes| data | |
|---|---|
| a | object |
By using the where function, we can preserve the column data type, and the 2nd column is still None.