pybind11 With Sklearn Pipelines - Introduction

Posted on Jun 21, 2025

I’ve used scikit-learn pipelines extensively in adtech over the past few years.

Adtech typically demands sub-20ms latency, requiring careful balance between network calls to caches and local processing. Local operations like feature transformation for user attributes (location, interests) eliminate network delays.

For fast local processing, options include Cython, NumPy, and pybind11.

pybind11 enables scikit-learn pipelines to call C++ code, accelerating heavy operations like branching and looping. Stack-allocated arrays also perform significantly faster than Python equivalents.

Our goal is to convert:

class PythonPlusOne:
    """A pure Python transformer that adds 1 to specified keys in a dictionary.
    
    Input format: Simple dictionary like {"a": 1, "b": 2}
    Output format: Same dictionary with specified keys incremented by 1
    """
    
    def __init__(self, columns):
        """Initialize with list of column names (keys) to transform."""
        self.columns = columns
    
    def fit(self, X, y=None):
        """Fit method (no-op for this transformer)."""
        return self
    
    def transform(self, X):
        """Transform the input dictionary by adding 1 to specified columns."""
        if not isinstance(X, dict):
            raise TypeError("Input must be a dictionary")
    
        result = X.copy()
        
        for col in self.columns:
            if col in result:
                result[col] = result[col] + 1.0
            else:
                raise KeyError(f"Column '{col}' not found in input dictionary")
        
        return result
    
    def fit_transform(self, X, y=None):
        """Convenience method to fit and transform in one step."""
        return self.fit(X, y).transform(X)

into:

class AddOneToKeys : private AddOneToKeysImpl {
public:
    using AddOneToKeysImpl::AddOneToKeysImpl;
    AddOneToKeys& fit(const py::object& X, const py::object& y = py::none()) {
        return *this;
    }
    
    std::unordered_map<std::string, double> transform(const std::unordered_map<std::string, double>& X) {
        return AddOneToKeysImpl::transform(X);
    }
    
    std::unordered_map<std::string, double> fit_transform(const std::unordered_map<std::string, double>& X,
                                                          const py::object& y = py::none()) {
        return transform(X);
    }
};

inline std::unordered_map<std::string, double> add_scalar_to_dict(
    const std::unordered_map<std::string, double>& input, 
    double scalar) {
    
    std::unordered_map<std::string, double> result;
    
    for (const auto& [key, value] : input) {
        result[key] = value + scalar;
    }
    
    return result;
}

This series documents my notes on using pybind11 on scikit-learn pipelines.

A quick overview on performance: - AddOneToKeys: A transformer that adds 1.0 to specified keys in a dictionary. For example, if you have {"a": 5, "b": 10} and transform keys ["a"], you get {"a": 6, "b": 10}.

Dict Size Python Pure C++ Pybind11 Python vs C++ Pybind11 vs Python
10 0.5 μs 0.5 μs 1.5 μs Same speed 3.0x slower
100 3.6 μs 4.0 μs 13.5 μs 1.1x faster 3.8x slower
1000 36.9 μs 43.6 μs 145.5 μs 1.2x faster 3.9x slower
10000 449.7 μs 676.1 μs 2253.6 μs 1.5x faster 5.0x slower
  • RollingStatistics: Computes rolling mean, standard deviation, and z-score for each key in a dictionary using a sliding window over sorted values. This involves sorting, windowing, and statistical calculations.
  • IterativeComputation: Performs 1000 iterations of complex mathematical transformations (sin, cos, sqrt, log, exp) on each dictionary value. This is a CPU-intensive operation designed to test computational performance.

RollingStatistics Performance (window=5):

Dict Size Python Pure C++ Pybind11 Pybind11 Speedup Pure C++ Speedup
10 14.6 μs 2.8 μs 4.7 μs 3.1x 5.2x
100 146.2 μs 33.4 μs 55.7 μs 2.6x 4.4x
1000 1537 μs 451 μs 751.7 μs 2.0x 3.4x

IterativeComputation Performance (1000 iterations):

Dict Size Python Pure C++ Pybind11 Pybind11 Speedup Pure C++ Speedup
10 158 μs 22 μs 36.4 μs 4.3x 7.2x
100 1575 μs 222 μs 370.2 μs 4.3x 7.1x
1000 15620 μs 2364 μs 3939.5 μs 4.0x 6.6x

The series of documents will essentially cover 4 scenarios:

┌─────────────────────────────────────────────────────────────────┐
│                     Which Pattern Should You Use?               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Start Here: What is your main application language?            │
│                                                                 │
│  Python-based Application                 C++-based Application │
│         │                                          │            │
│         ▼                                          ▼            │
│  Need Performance?                          Need Python libs?   │
│    Yes /    \ No                              Yes /    \ No     │
│       /      \                                   /      \       │
│      ▼        ▼                                 ▼        ▼      │
│  Pattern B  Pattern A                      Pattern C  Pattern D │
│  (Py→C++)   (Py→Py)                        (C++→Py)  (C++→C++) │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘