Gradient Boosting

Summary

Gradient boosting is another boosting algorithm used for machine learning. Similar to AdaBoost, this boosting technique relies on learning from previous decision trees. However, the key differences lie in gradient boosting not using weak learners and having the same weighting attached to each tree.

This is gradient boosting for regression, meaning it will be a model for predicting a continuous target.

Related Sections

The Gradient Boosting Process

The gradient boosting algorithm works by taking a base prediction, and then minimising the error of this prediction in order to get to a final value. This is best explained through an example using the following dataset, with the target being the price of the car.

Row Number	Cylinder Number	Car Height	Engine Location	Price
1	Four	48.8	Front	12000
2	Six	48.8	Back	16500
3	Five	52.4	Back	15500
4	Four	54.3	Front	14000

1. Creating Your Initial Prediction

Our initial prediction for the price of a new car will be quite simple and just be the average of the cars. Average = 14500.

We then insert this prediction into the above training data and calculate the residuals. Residuals relate to the error the prediction has from the actual data.

\ ext{Residual} = \ ext{Observed Data} - \ ext{Predicted Data}

Row Number	Cylinder Number	Car Height	Engine Location	Price	Prediction	Residual
1	Four	48.8	Front	12000	14500	-2500
2	Six	48.8	Back	16500	14500	2000
3	Five	52.4	Back	15500	14500	1000
4	Four	54.3	Front	14000	14500	-500

2. Creating a Model to Predict the Residual Amount

Now we will be using Decision Trees to predict the Residual Amount. This is key as we are not predicting the Price, which is the target value, but rather the error between our initial prediction and the observed data.

We will create the decision tree using Gini Impurity. Please note that the tree's created for gradient boosting will have a maximum fixed length decided on at the beginning of the project. This is to prevent too much overfitting in the data.

The following decision tree was created:

As you can see, there are some cases where we get more than 1 output per leaf (R2,1). We will replace these two values with a singular average for this leaf (1500).

3. Update Predictions from Initial

This is the step where we go and update our prediction based off this new model we have created.

The updated prediction will look as following:

\ ext{Prediction} = \ ext{Initial Prediction} + \ ext{Learning Rate} \ imes{\ ext{Predicted Residual from Tree}}

The learning rate is a multiple chosen between 0 and 1 at the start of the project. This is chosen to ensure that no single decision tree has too much bearing on the final project, reducing the overfitting problem. We will use 0.1 for this example.

For example, the new predicted value for Item 1 will be:

\ ext{Item 1 Prediction} = \ ext{Initial Prediction} + \ ext{Learning Rate} \ imes{\ ext{Predicted Residual from Tree}}

\ ext{Item 1 Prediction} = 14500 + (0.1\ imes{1500})

\ ext{Item 1 Prediction} = 14650

4. Conclusion

A fixed number of decision trees is decided upon at the beginning of the project and once that is reached, the full model will be ready to predict the data.

The hope is that as we continue along the decision trees, the residual amounts will decrease as we should be getting better at predicting the value.