Gradient Boosting

Summary

Gradient boosting is another boosting algorithm used for machine learning. Similar to AdaBoost, this boosting technique relies on learning from previous decision trees. However, the key differences lie in gradient boosting not using weak learners and having the same weighting attached to each tree.

This is gradient boosting for regression, meaning it will be a model for predicting a continuous target.

The Gradient Boosting Process

The gradient boosting algorithm works by taking a base prediction, and then minimising the error of this prediction in order to get to a final value. This is best explained through an example using the following dataset, with the target being the price of the car.

Row NumberCylinder NumberCar HeightEngine LocationPrice
1Four48.8Front12000
2Six48.8Back16500
3Five52.4Back15500
4Four54.3Front14000

1. Creating Your Initial Prediction

Our initial prediction for the price of a new car will be quite simple and just be the average of the cars. Average = 14500.

We then insert this prediction into the above training data and calculate the residuals. Residuals relate to the error the prediction has from the actual data.

 extResidual= extObservedData extPredictedData\ ext{Residual} = \ ext{Observed Data} - \ ext{Predicted Data}
Row NumberCylinder NumberCar HeightEngine LocationPricePredictionResidual
1Four48.8Front1200014500-2500
2Six48.8Back16500145002000
3Five52.4Back15500145001000
4Four54.3Front1400014500-500

2. Creating a Model to Predict the Residual Amount

Now we will be using Decision Trees to predict the Residual Amount. This is key as we are not predicting the Price, which is the target value, but rather the error between our initial prediction and the observed data.

We will create the decision tree using Gini Impurity. Please note that the tree's created for gradient boosting will have a maximum fixed length decided on at the beginning of the project. This is to prevent too much overfitting in the data.

The following decision tree was created:

Gradient Boost Decision Tree Example

As you can see, there are some cases where we get more than 1 output per leaf (R2,1). We will replace these two values with a singular average for this leaf (1500).

3. Update Predictions from Initial

This is the step where we go and update our prediction based off this new model we have created.

The updated prediction will look as following:

 extPrediction= extInitialPrediction+ extLearningRate imes extPredictedResidualfromTree\ ext{Prediction} = \ ext{Initial Prediction} + \ ext{Learning Rate} \ imes{\ ext{Predicted Residual from Tree}}

The learning rate is a multiple chosen between 0 and 1 at the start of the project. This is chosen to ensure that no single decision tree has too much bearing on the final project, reducing the overfitting problem. We will use 0.1 for this example.

For example, the new predicted value for Item 1 will be:

 extItem1Prediction= extInitialPrediction+ extLearningRate imes extPredictedResidualfromTree\ ext{Item 1 Prediction} = \ ext{Initial Prediction} + \ ext{Learning Rate} \ imes{\ ext{Predicted Residual from Tree}}
 extItem1Prediction=14500+(0.1 imes1500)\ ext{Item 1 Prediction} = 14500 + (0.1\ imes{1500})
 extItem1Prediction=14650\ ext{Item 1 Prediction} = 14650

4. Conclusion

A fixed number of decision trees is decided upon at the beginning of the project and once that is reached, the full model will be ready to predict the data.

The hope is that as we continue along the decision trees, the residual amounts will decrease as we should be getting better at predicting the value.