Gradient Boosting Sample Clauses
Gradient Boosting. The (extreme) gradient boosting algorithm is another decision tree ensemble technique. However, it works on the principle of boosting instead of the bagging in the random forest method. Just like in bagging, boosting works with multiple trees to then combine them to create a single predictive model. However, the way that the trees are built differs significantly. The trees are grown sequentially instead of independently. This means that each tree is grown using information from the previously grown trees, with the model precision being improved after each iteration. Furthermore, each tree is fit on the original full data set instead of involving bootstrap sampling (there is a hyperparameter for that though, cfr. infra). In this dissertation, the extreme gradient boosting version was performed. It works on the same principle as the general gradient boosting, but with a more regularized model formalization to control for overfitting1. As this method is also inherently binary, the one-vs-rest strategy was used here to make it multiclass. With this strategy, each class is fitted against all other classes for each classifier. Compared to the one-vs-one strategy, this saves computation time as only three classifiers will be needed (one for each class). Furthermore, it is also more interpretable as each class is only represented by one classifier. This is the default strategy for this method and it is also the default for almost all classifiers. Similar to random forest, boosting has hyperparameters that need to be tuned. Some examples are (▇▇▇▇▇▇▇ & ▇▇▇▇, 2018):
1. The number of trees. With this method, there is danger of overfitting when this is too large.
2. The shrinkage parameter, which controls the rate at which the method learns. Most of the time, it ranges between 0.01 and 0.001. Having a small will often lead to requiring a large number of trees to get a good model performance.
3. The number of splits in each tree d. It is also called the interaction depth, as it controls the interaction order of the boosted model. D splits can involve at most d variables. If d = 1, each tree is a ▇▇▇▇▇ consisting of a single split. Thus, a higher value for d will result in a more complex boosted ensemble.
4. Gamma, which regularises the model using across trees information. It shows by how much the loss has to be reduced after a split, for that split to actually done. The higher the value, the higher the regularization.
5. The sampling of the dataset at each boosting ro...
