# Scikit-Learn - Modifying feature union to work with Pandas DataFrame

A sample transformer is as follows, where it will add the columns specified +1:

```
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
import copy
class PlusOne(BaseEstimator, TransformerMixin):
def __init__(self, columns):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X: pd.DataFrame) -> pd.DataFrame:
return X[self.columns] + 1
```

We can easily put this into a pipeline:

```
from sklearn.pipeline import make_pipeline
sample = pd.DataFrame([{"a": 1}])
pipe = make_pipeline(
PlusOne(columns=["a"]),
PlusOne(columns=["a"]),
)
```

`pipe.fit_transform(sample)`

a | |
---|---|

0 | 3 |

However, if we were to transform on two columns using `make_union`

, it’s possible, but we would lose our column names:

```
from sklearn.pipeline import make_pipeline, make_union
sample = pd.DataFrame([{"a": 1, "b": 2}])
pipe = make_pipeline(
make_union(
PlusOne(columns=["a"]),
PlusOne(columns=["b"]),
),
)
```

`pipe.fit_transform(sample)`

0 | 1 | |
---|---|---|

0 | 2 | 3 |

Notice that the output columns are now labeled as `0`

and `1`

.

This is because the `FeatureUnion`

class:

```
class FeatureUnion(TransformerMixin, _BaseComposition):
def fit_transform(self, X, y=None, **fit_params):
"""Fit all transformers, transform the data and concatenate results.
Parameters
----------
X : iterable or array-like, depending on transformers
Input data to be transformed.
y : array-like of shape (n_samples, n_outputs), default=None
Targets for supervised learning.
**fit_params : dict, default=None
Parameters to pass to the fit method of the estimator.
Returns
-------
X_t : array-like or sparse matrix of \
shape (n_samples, sum_n_components)
The `hstack` of results of transformers. `sum_n_components` is the
sum of `n_components` (output dimension) over transformers.
"""
results = self._parallel_func(X, y, fit_params, _fit_transform_one)
if not results:
# All transformers are None
return np.zeros((X.shape[0], 0))
Xs, transformers = zip(*results)
self._update_transformer_list(transformers)
return self._hstack(Xs)
def transform(self, X):
"""Transform X separately by each transformer, concatenate results.
Parameters
----------
X : iterable or array-like, depending on transformers
Input data to be transformed.
Returns
-------
X_t : array-like or sparse matrix of \
shape (n_samples, sum_n_components)
The `hstack` of results of transformers. `sum_n_components` is the
sum of `n_components` (output dimension) over transformers.
"""
Xs = Parallel(n_jobs=self.n_jobs)(
delayed(_transform_one)(trans, X, None, weight)
for name, trans, weight in self._iter()
)
if not Xs:
# All transformers are None
return np.zeros((X.shape[0], 0))
return self._hstack(Xs)
def _hstack(self, Xs):
if any(sparse.issparse(f) for f in Xs):
Xs = sparse.hstack(Xs).tocsr()
else:
Xs = np.hstack(Xs)
return Xs
```

Notice that the original implemtnation `_hstack`

to stack the output features horizontally. This calls the `hstack`

function in numpy.

For example if we had a dataframe as output:

```
import numpy as np
np.hstack([pd.DataFrame([{"a": 3}]), pd.DataFrame([{"b": 4}])])
```

0 | 1 | |
---|---|---|

0 | 3 | 4 |

The output of a stacked pandas dataframe loses it’s column. Hence we need to modify the `_hstack`

function so that it recognizes that it is a dataframe or numpy array.

An alternative implemention is as follows:

```
from sklearn.utils.metaestimators import _BaseComposition
from sklearn.pipeline import FeatureUnion, _name_estimators
from scipy import sparse
class ModifiedFeatureUnion(FeatureUnion, TransformerMixin, _BaseComposition):
def _hstack(self, Xs):
# Xs is a list of results, either a list of numpy arrays or pandas dataframe.
if isinstance(Xs[0], np.ndarray):
if any(sparse.issparse(f) for f in Xs):
Xs = sparse.hstack(Xs).tocsr()
else:
Xs = np.hstack(Xs)
return Xs
elif isinstance(Xs[0], pd.DataFrame):
return pd.concat(Xs, axis=1)
def modified_make_union(*transformers, n_jobs=None, verbose=False):
return ModifiedFeatureUnion(_name_estimators(transformers), n_jobs=n_jobs, verbose=verbose)
```

```
sample = pd.DataFrame([{"a": 1, "b": 2}])
pipe = make_pipeline(
modified_make_union(
PlusOne(columns=["a"]),
PlusOne(columns=["b"]),
),
)
pipe.fit_transform(sample)
```

a | b | |
---|---|---|

0 | 2 | 3 |

Since `Xs`

is a list of pd.DataFrame, we can basically use concat, with the caveat that both dataframes must have the same index.

We also get to keep the original functionality by detecting whether the first item in the list is an numpy array.

With this simple implementation the pipeline can now support any functions that uses pandas dataframe, and with `FeatureUnion`

each of the pipelines are executed on different threads, which provides speedup to the overall pipeline.