How to productize your machine learning model using Scikit-learn? [2/2]

As we saw in How to productize your machine learning model using Scikit-learn [1/2], it is crucial that production-ready code is versionable, testable, manageable, and integrable. After understanding these concepts and why it is important not to use Jupyter notebooks to productize your model pipeline, let’s look at a practical example of transforming prototype code from data scientists into production-ready code.

We’ll start by exploring the Pipeline class, discussing its significance and how to effectively use it. Following that, we will briefly cover how to adapt your development code into a Scikit-learn class, where you can encapsulate all your feature engineering steps as methods.

First, let’s understand what a Pipeline is in this context.

1. What is a Scikit-Learn Pipeline?

Figure 2. Scikit-learn logo.
Figure 1. Scikit-learn logo.

According to the official documentation, the class sklearn.pipeline.Pipeline can be summarized as follows:

A Pipeline allows you to sequentially apply a list of transformers to preprocess the data and, if desired, conclude the sequence with a final predictor for predictive modeling.

For a machine learning model pipeline, it is crucial to perform the same preprocessing steps on the training, validation, and test data to avoid data leakage. Data leakage is a common issue during model building, where information from outside the training dataset is inadvertently used to develop the model. In other words, information about the future is used to train the model, making it unsuitable for production. The main benefits of using a Scikit-Learn Pipeline class are:

  1. Consistency: A Pipeline ensures that your data transformations are consistently applied across all stages—training, validation, and testing. This consistency allows the model to handle unseen data appropriately and reliably.
  2. No data leakage: When preprocessing steps like scaling and normalizing are applied to the training dataset, using a Pipeline ensures that the test data remains unseen and unaltered by these transformations. This prevents “peeking” into the test data, which helps the model’s performance metrics accurately reflect its ability to generalize to new, unseen data.
  3. Fair evaluation: By maintaining consistent data transformations, a Pipeline ensures that during model evaluation, the metrics truly represent the model’s performance on new, unseen data. This allows for a fair and accurate assessment of the model’s true capabilities.

This is why keeping your set of data transformations reproducible using the sklearn.pipeline.Pipeline class is so important.

2. What does production-ready code look like?

To demonstrate production-ready code, I will use code snippets from a dataset with the following features: ‘City’, ‘Website’, ‘Revenue’, ‘Status’, and the target ‘Value’.

The hypothetical code provided by the data scientist is in the following format:

Feature engineering steps

To refactor the provided code, we’ll employ object-oriented programming principles, utilizing a CustomPreprocessor and a Pipeline.

2.1. Setting up Scikit-learn

2.2. Create a CustomPreprocessor

Then, our CustomerPreprocessor() class will inherit from the classes BaseEstimator and TransformerMixin.

Before building our custom preprocessor, we need to convert the feature engineering steps from procedural programming into methods for the CustomPreprocessor class.

2.2.1. BaseEstimator class

According to the Scikit-learn documentation, BaseEstimator serves as a base class for all estimators in Scikit-learn. In our context, this means we’re leveraging a foundational class that provides several built-in functionalities, including parameter validation, data validation, and estimator serialization. You can find more details in the official documentation here.

2.2.2. TransformerMixin class

We will also need to utilize the TransformerMixin class to construct our CustomPreprocessor, which serves as a mixin class for all transformers in Scikit-Learn. You can refer to the documentation here for more details.

Each feature engineering step will be encapsulated as a method within the CustomPreprocessor(), allowing us to utilize them in both the fit() and transform() methods.

2.3. Building a CustomPreprocessor

By inheriting from the BaseEstimator and TransformerMixin classes, we can create our new preprocessor to encapsulate preprocessing logic in a reusable and maintainable way. The first method is the constructor:

2.3.1. __init__ method

In the __init__ method, we initialize the object’s attributes and perform any setup or initialization tasks needed.

One of the feature engineering steps in this pipeline involves performing one-hot encoding. To handle categorical variables appropriately as part of the data preprocessing workflow, I opted to use the built-in OneHotEncoder() preprocessor. This ensures consistency and reusability throughout the preprocessing process.

2.3.2. fit() method

The primary goal of the fit() method is to learn and store the necessary parameters or statistics from the training data that are required for the preprocessing steps. For instance, when encoding categorical features using OneHotEncoder, the fit() method determines the unique categories present in the training data. This step ensures that the preprocessing is prepared to handle unseen data during prediction, aligning with the custom preprocessor’s design.

2.3.3. transform() method

The transform() method applies the parameter statistics learned in fit() to perform actual transformations on the data. These transformations can include encoding categorical features, scaling numerical values, or other preprocessing steps. When predicting with your machine learning pipeline, only the transform() method is used. Therefore, it’s crucial that transform() applies the same transformations learned during fit() to ensure consistency and integrity across the preprocessing pipeline. This approach guarantees that the preprocessing steps applied during training are replicated accurately on new data during prediction.

Note that, following the original code, I’m using self.cat_preprocessor.transform() with self.cat_columns to encode features based on what was learned in the CustomPreprocessor().fit() method. Additionally, it continues to concatenate the newly encoded features listed in cat_feats, along with ‘revenue_revised’ and ‘has_https’.

In this pipeline, a GradientBoostingClassifier() was chosen as the machine learning model to predict a binary class. The previously created CustomPreprocessor() can be integrated using sklearn.pipeline.Pipeline().

The Pipeline class represents a sequence of data transformers with an optional final predictor.

2.5. Run the Pipeline

After running the pipeline, we can verify its success by examining the output, which connects our CustomPreprocessor with our model (GradientBoostingClassifier).

Figure 2. Scikit-learn pipeline output.

Conclusion

In this post, we’ve explored in detail how to transform machine learning code from a prototype into production-ready code using Scikit-learn’s pipeline capabilities.

We began by discussing the CustomPreprocessor class, emphasizing how to adapt prototype feature engineering steps into methods for object-oriented programming. This approach enhances testability, manageability, and integration, ensuring a reproducible and scalable solution for fair model evaluation.

Next, we delved into the Pipeline class, a powerful tool for sequentially applying data transformations across training, validation, and test datasets. By maintaining consistency, pipelines help prevent data leakage, a common issue where insights from test data unintentionally influence model training, compromising its ability to generalize to new data.

Ultimately, leveraging Scikit-learn pipelines and custom preprocessing classes facilitates the transition from experimental code to robust production-ready machine learning systems. This approach not only enhances code quality and maintainability but also ensures that models generalize effectively to new, unseen data.

Share your comments, thoughts, or key considerations regarding what is important when creating production-ready machine learning code.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *