All three of these concepts classify as "Machine Learning", specifically, supervised machine learning.
"Bagging", is a word play synonym, which serves as a short abbreviation for "Bootstrap Aggregation". Bootstrap aggregation is a term which is utilized to describe a methodology in which multiple randomized observations are drawn from a sample data set. "Boosting" refers to the algorithm which analyzes numerous sample sets which were composed as a result of the previous process. Ultimately, from these sets, numerous decision trees are created. Into which, test data is eventually passed. Each observation within the test data set is analyzed as it passes through the numerous nodes of each individual tree. The results of the predictive output being the consensus of the results reached from a majority of the individual internal models.
How Bagging is Utilized
As previously discussed, "Bagging" is a data sampling methodology. For demonstrative purposes, let's consider its application as it is applied to a randomized version of the "iris" data frame. Here is a portion of the data frame as it currently exists within the "R" platform.
A graphical representation of this process is illustrated below:
Boosting Described
Once new data samples have been created, the "boosting" process, which is the portion of the algorithm which is initiated following the "bagging" methodology's application, begins to create individualized decision trees for each newly created set. Once each decision tree has been created, the model’s creation process is complete.
The Decision Making Process
With the model created, the process of predicting dependent variable values can be initiated.
Remember that each decision tree was created from data observations from which each corresponding set was comprised of.
A Real Application Demonstration (Classification)
Again, we will utilize the "iris" data set which comes embedded within the R data platform.
A short note on the standard notation utilized for this model type:
D = The training data set.
n = The number of observations within the training data set.
n^1 = "n prime". The number of observations within each data subset.
m = The number of subsets.
In this example we will allow the bagging package command to perform its default function without specifying any additional options. If n^1 = n, then each subset which is created from the training data set is expected to contain at least (1 - 1/e) (≈63.2%) of the unique observations contained within the training data set. If this does not occur, the bagging() function will automatically enable an option which ensures this occurrence.
# Create a training data set from the data frame: "iris" #
# Set randomization seed #
set.seed(454)
# Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. #
rannum <- runif(nrow(iris))
# Order the data frame rows by the values in which the random set is ordered #
raniris <- iris[order(rannum), ]
# With the package "ipred" downloaded and enabled #
# Create the model #
mod <- bagging(Species ~., data= raniris[1:100,], type = "class")
# View model classification results with training data #
prediction <- predict(mod, raniris[1:100,], type="class")
table(raniris[1:100,]$Species, predicted = prediction )
# View model classification results with test data #
prediction <- predict(mod, raniris[101:150,], type="class")
table(raniris[101:150,]$Species, predicted = prediction )
Console Output (1):
predicted
setosa versicolor virginica
setosa 31 0 0
versicolor 0 35 0
virginica 0 0 34
Console Output (2):
predicted
setosa versicolor virginica
setosa 19 0 0
versicolor 0 13 2
virginica 0 2 14
A Real Application Demonstration (ANOVA)
In this second example demonstration, all of the notational aspects of the model and the restrictions of the function still apply. However, in this case, the dependent variable is continuous, not categorical. To test the predictability of the model, the Root Mean Standard Error and the Mean Absolute Error values are calculated. For more information as it pertains to the calculation and interpretation of these measurements of predictability, please consult the prior article.
# Create a training data set from the data frame: "iris" #
# Set randomization seed #
set.seed(454)
# Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. #
rannum <- runif(nrow(iris))
# Order the dataframe rows by the values in which the random set is ordered #
raniris <- iris[order(rannum), ]
# With the package "ipred" downloaded and enabled #
# Create the model #
anmod <- bagging(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], method="anova")
# Compute the Root Mean Standard Error (RMSE) of model training data #
prediction <- predict(anmod, raniris[1:100,], type="class")
# With the package "metrics" downloaded and enabled #
rmse(raniris[1:100,]$Sepal.Length, prediction )
# Compute the Root Mean Standard Error (RMSE) of model test data #
prediction <- predict(anmod, raniris[101:150,], type="class")
# With the package "metrics" downloaded and enabled #
rmse(raniris[101:150,]$Sepal.Length, prediction )
Console Output (1) - Training Data:
[1] 0.3032058
Console Output (2) - Test Data:
[1] 0.3427076
# Create a function to calculate Mean Absolute Error #
MAE <- function(actual, predicted) {mean(abs(actual - predicted))}
# Compute the Mean Absolute Error (MAE) of model training data #
anprediction <- predict(anmodel , raniris[1:100,])
MAE(raniris[1:100,]$Sepal.Length, anprediction)
# Compute the Mean Absolute Error (MAE) of model test data #
anprediction <- predict(anmodel , raniris[101:150,])
MAE(raniris[101:150,]$Sepal.Length, anprediction)
Console Output (1) - Training Data:
[1] 0.2289299
Console Output (2) - Test Data:
[1] 0.2706003
Conclusions
The method from which the Bagging() function was derived, was initially postulated by Leo Breiman, the same individual who created the tree model methodology. You will likely never be inclined to use this methodology as a standalone method of analysis. As was previously mentioned within this article, the justification for this topic’s discussion pertains solely its applicability as an aspect of the random forest model. Therefore, from a pragmatic standpoint, if tree models are the model type which you wish to utilize when performing data analysis, you would either be inclined to select the basic tree model for its simplicity, or the random forest model for its enhanced ability.
That's all for today.
I'll see you next week,
-RD
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.