pybind11 With Sklearn Pipelines - Introduction

Posted on Jun 21, 2025

I’ve used scikit-learn pipelines extensively in adtech over the past few years.

Adtech typically demands sub-20ms latency, requiring careful balance between network calls to caches and local processing. Local operations like feature transformation for user attributes (location, interests) eliminate network delays.

For fast local processing, options include Cython, NumPy, and pybind11.

pybind11 enables scikit-learn pipelines to call C++ code, accelerating heavy operations like branching and looping. Stack-allocated arrays also perform significantly faster than Python equivalents.

Our goal is to convert:

class PythonPlusOne:
    """A pure Python transformer that adds 1 to specified keys in a dictionary.
    
    Input format: Simple dictionary like {"a": 1, "b": 2}
    Output format: Same dictionary with specified keys incremented by 1
    """
    
    def __init__(self, columns):
        """Initialize with list of column names (keys) to transform."""
        self.columns = columns
    
    def fit(self, X, y=None):
        """Fit method (no-op for this transformer)."""
        return self
    
    def transform(self, X):
        """Transform the input dictionary by adding 1 to specified columns."""
        if not isinstance(X, dict):
            raise TypeError("Input must be a dictionary")
    
        result = X.copy()
        
        for col in self.columns:
            if col in result:
                result[col] = result[col] + 1.0
            else:
                raise KeyError(f"Column '{col}' not found in input dictionary")
        
        return result
    
    def fit_transform(self, X, y=None):
        """Convenience method to fit and transform in one step."""
        return self.fit(X, y).transform(X)

into:

class AddOneToKeys : private AddOneToKeysImpl {
public:
    using AddOneToKeysImpl::AddOneToKeysImpl;
    AddOneToKeys& fit(const py::object& X, const py::object& y = py::none()) {
        return *this;
    }
    
    std::unordered_map<std::string, double> transform(const std::unordered_map<std::string, double>& X) {
        return AddOneToKeysImpl::transform(X);
    }
    
    std::unordered_map<std::string, double> fit_transform(const std::unordered_map<std::string, double>& X,
                                                          const py::object& y = py::none()) {
        return transform(X);
    }
};

inline std::unordered_map<std::string, double> add_scalar_to_dict(
    const std::unordered_map<std::string, double>& input, 
    double scalar) {
    
    std::unordered_map<std::string, double> result;
    
    for (const auto& [key, value] : input) {
        result[key] = value + scalar;
    }
    
    return result;
}

This series documents my notes on using pybind11 on scikit-learn pipelines.

A quick overview on performance: - AddOneToKeys: A transformer that adds 1.0 to specified keys in a dictionary. For example, if you have {"a": 5, "b": 10} and transform keys ["a"], you get {"a": 6, "b": 10}.

Dict Size	Python	Pure C++	Pybind11	Python vs C++	Pybind11 vs Python
10	0.5 μs	0.5 μs	1.5 μs	Same speed	3.0x slower
100	3.6 μs	4.0 μs	13.5 μs	1.1x faster	3.8x slower
1000	36.9 μs	43.6 μs	145.5 μs	1.2x faster	3.9x slower
10000	449.7 μs	676.1 μs	2253.6 μs	1.5x faster	5.0x slower

RollingStatistics: Computes rolling mean, standard deviation, and z-score for each key in a dictionary using a sliding window over sorted values. This involves sorting, windowing, and statistical calculations.
IterativeComputation: Performs 1000 iterations of complex mathematical transformations (sin, cos, sqrt, log, exp) on each dictionary value. This is a CPU-intensive operation designed to test computational performance.

RollingStatistics Performance (window=5):

Dict Size	Python	Pure C++	Pybind11	Pybind11 Speedup	Pure C++ Speedup
10	14.6 μs	2.8 μs	4.7 μs	3.1x	5.2x
100	146.2 μs	33.4 μs	55.7 μs	2.6x	4.4x
1000	1537 μs	451 μs	751.7 μs	2.0x	3.4x

IterativeComputation Performance (1000 iterations):

Dict Size	Python	Pure C++	Pybind11	Pybind11 Speedup	Pure C++ Speedup
10	158 μs	22 μs	36.4 μs	4.3x	7.2x
100	1575 μs	222 μs	370.2 μs	4.3x	7.1x
1000	15620 μs	2364 μs	3939.5 μs	4.0x	6.6x

The series of documents will essentially cover 4 scenarios:

┌─────────────────────────────────────────────────────────────────┐
│                     Which Pattern Should You Use?               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Start Here: What is your main application language?            │
│                                                                 │
│  Python-based Application                 C++-based Application │
│         │                                          │            │
│         ▼                                          ▼            │
│  Need Performance?                          Need Python libs?   │
│    Yes /    \ No                              Yes /    \ No     │
│       /      \                                   /      \       │
│      ▼        ▼                                 ▼        ▼      │
│  Pattern B  Pattern A                      Pattern C  Pattern D │
│  (Py→C++)   (Py→Py)                        (C++→Py)  (C++→C++) │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘