Boosting

23 Mar 2022
Machine Learning

Last updated on Sep 3, 2025

Table of Contents

Introduction
AdaBoost

Introduction

Combining predictions from multiple sources is usually preferred to a single source. For example, a medical diagnosis would carry much more weight if it was the result of a consensus of several experts. This idea of prediction by consensus is a powerful way to improve classification and regression models. In fact, good performance of a committee of models can be achieved even if each individual model is conceptually very simple.

Boosting is one such way of building a committee of models for classification or regression and is popularly implemented by an algorithm called AdaBoost.

AdaBoost

Given a dataset \(\{\mathbf{x}_i\}\) and target variables \(\{\mathbf{y}_i\}\), AdaBoost first initializes a set of weights corresponding to each data sample as \(w_i = \frac{1}{N}\). At each step of the algorithm, a simple classifier, called a weak learner is fit to the data. The weights for each sample are adjusted based on the individual classifier’s performance. If the sample was misclassified, the relative weight for that sample is increased. After all classifiers have been fit, they are combined to form an ensemble model.

Exponential Loss

Gradient boosting is applied to binary classification using an exponential loss. In its original derivation, a target of \(y \in \{-1, +1\}\) is used. As in Logistic Regression, the likelihood of a Bernoulli distribution is used. That original criterion is

\[ \mathbb{E}[\exp(-yF(x))] = P(y=1 \vert x)\exp(-F(x)) + P(y=-1|x)\exp(F(x)). \]

The primary motivation for this loss function is that it serves as a differentiable upper bound to misclassification error (Friedman, Jerome and Hastie, Trevor and Tibshirani, Robert, 2000).

This expectation is minimized at

\[ F(x) = \frac{1}{2} \log \frac{P(y=1 \vert x)}{P(y=-1 \vert x)}. \]

Therefore,

\[ P(y=1 \vert x) = \frac{\exp(F(x))}{\exp(-F(x)) + \exp(F(x))}. \]

The Algorithm

Initialize data weights \({w_i}\) as \(w_i^{(1)} = \frac{1}{n}\) for \(i = 1, \dots, n\).
Fit each weak learner \(j\) to the training data by minimizing the misclassification cost:

\[ \sum_{i=1}^n w_i^{(j)} \mathbb{1}(f_j(\mathbf{x}_i) \neq \mathbf{y}_i) \]
Compute a weighted error rate

\[ \epsilon_j = \frac{\sum_{i=1}^n w_i^{(j)} \mathbb{1}(f_j(\mathbf{x}_i) \neq \mathbf{y}_i)}{\sum_{i=1}^n w_i^{(j)}} \]
Use the weighted error rate to compute a weight for each classifier such that misclassified samples are given higher weight:

\[ \alpha_j = \ln \bigg\{\frac{1 - \epsilon_j}{\epsilon_j}\bigg\}. \]
Update the data weights for the next model in the sequence:

\[ w_i^{j+1} = w_i^{j} \exp\{\alpha_j \mathbb{1}(f_j(\mathbf{x}_i \neq \mathbf{y}_i)\}. \]

Once all weak learners are trained, the final model predictions are given by

\[ Y_M(\mathbf{x}) = \text{sign} \Bigg(\sum_{j=1}^M \alpha_j f_j(\mathbf{x})\Bigg). \]

Weak Learners

The weak learners can be any classification or regression model. However, they are typically chosen to be very simple to account for training time. For example, a complex deep learning model would be a poor choice for a weak learner.

One example of a weak learner is a simple linear model like a Perceptron or decision stump. A standard implementation of AdaBoost uses a decision tree with depth 1, as observed in sklearn’s implementation.

Example

Let’s put this together and walk through the first few steps of training an AdaBoost model using a decision stump as the weak learner. We will use a very simple dataset to keep the values easy to compute by hand.

Initial Data

x1	x2	y	weight
1	5	0	0.2
2	6	1	0.2
3	7	0	0.2
4	8	1	0.2
5	9	1	0.2

Weak Learner 1

The first learner is trained on the initial data and picks \(x_1 = 2.5\) as the split threshold. Input where \(x_1 \leq 2.5\) is assigned to class 0 and all other samples are assigned class 1. The data with this learner’s predictions are shown below.

x1	x2	y	weight	prediction
1	5	0	0.2	0
2	6	1	0.2	0
3	7	0	0.2	1
4	8	1	0.2	1
5	9	1	0.2	1

Error and weight

The error is simple enough to compute as all samples are currently weighted equally. Since two of the samples were misclassified, the error is the sum of their weights.

Total error

\(e_1 = 0.2 + 0.2 = 0.4\).

The weight of the classifier can then be computed.

Classifier weight

\(\alpha_1 = \frac{1}{2} \ln \big(\frac{1 - e_1}{e_1}\big) = 0.2027\).

The weights of our data can now be updated using this value of \(\alpha_1\). The weight of each example is updated by multiplying each correctly classifed sample by \(\exp\{-\alpha_1\}\) and each incorrectly classified sample by \(\exp\{\alpha\}\):

\[ w_i^{j+1} = w_i^{j} \exp\{\alpha_j \mathbb{1}(f_j(\mathbf{x}_i \neq \mathbf{y}_i)\}. \]

NOTE: You will notice that the equation above is different from the actual update rule that was applied to the weights in this example. In the original publication, the weights are renormalized at the end of the loop (Friedman, Jerome and Hastie, Trevor and Tibshirani, Robert, 2000). In this example, the normalization is combined with the update. In either case, the updated weights are shown below.

x1	x2	y	weight
1	5	0	0.167
2	6	1	0.250
3	7	0	0.250
4	8	1	0.167
5	9	1	0.167

Weak Learner 2

The algorithm now moves to the next weak learner, which classifies the data given a threshold of \(x_1 = 3.5\). Its predictions are shown below.

x1	x2	y	weight	prediction
1	5	0	0.167	0
2	6	1	0.250	0
3	7	0	0.250	0
4	8	1	0.167	1
5	9	1	0.167	1

Only a single sample is misclassified, and the error is computed as before.

Total error

\(e_2 = 0.250\)

Classifier weight

\(\alpha_2 = \frac{1}{2} \ln \big(\frac{1 - e_2}{e_2}\big) = 0.5493\)

The weights are updated for each sample, yielding the following data:

x1	x2	y	weight
1	5	0	0.111
2	6	1	0.500
3	7	0	0.167
4	8	1	0.111
5	9	1	0.111

The second sample has been misclassified twice at this point, leading to a relatively high weight. This will hopefully be addressed by the third learner.

Weak Learner 3

The final weak learner splits the data on \(x_2 = 6.5\), yielding the following output for each sample.

x1	x2	y	weight	prediction
1	5	0	0.111	0
2	6	1	0.500	0
3	7	0	0.167	1
4	8	1	0.111	1
5	9	1	0.111	1

Unfortunately, sample 2 is too tricky for any of our weak learners. The total error is shown below. Since this is a binary classification problem, the error suggests that our weak learner performs worse than random guessing.

Total error

\(e_3 = 0.667\)

Classifier weight

\(\alpha_3 = \frac{1}{2} \ln \big(\frac{1 - e_3}{e_3}\big) = -0.3473\)

The negative value of the classifier weight suggests that its predictions will be reversed when evaluated. The updated weights of each data sample are given below.

x1	x2	y	weight
1	5	0	0.167
2	6	1	0.375
3	7	0	0.125
4	8	1	0.167
5	9	1	0.167

Final Classifier

The final classifier is a weighted vote of the three weak learners, with the weights being the classifier weights we calculated (0.2027, 0.5493, and -0.3473). The negative weight means that the third learner’s predictions are reversed.

References

Friedman, Jerome and Hastie, Trevor and Tibshirani, Robert (2000). Additive Logistic Regression: A Statistical View of Boosting ({{With}} Discussion and a Rejoinder by the Authors), Institute of Mathematical Statistics.