Today we will be discussing a very important method which is often overlooked within most entry level statistics curriculums.
The concept of Polynomial Regression, when applied to data, allows the practitioner to create a model which could potentially provide a better predictive capacity than a linear model. The inclusion of exponential variables within a regression formula can also, additionally, account for non-linear sample data within a model.
Ideally, when graphed, linear data should resemble the illustration below:
However, there may be instances where data is non-linear. In this case, the graphical data output could instead resemble something similar to the following illustration:
A polynomial regression model can account for the data point curvature. Due to such, applying this method can, at times, allow for the creation of a more accurate predictive model.
Example:
# Create data vectors #
x <- c(8, 1, 4, 10, 8, 10, 3, 1, 1, 2)
w <- c(366, 383, 332, 331, 311, 352, 356, 351, 399, 357)
# Create polynomial variables #
# Squared #
x2 <- x^2
# Cubic #
x3 <- x^3
# Quadratic #
x4 <- x^4
# Create initial linear model #
lmtestset <- lm(formula = w ~ x)
summary(lmtestset)
anova(lmtestset)
# Create squared polynomial model #
lmtestset <- lm(formula = w ~ x + x2)
summary(lmtestset)
anova(lmtestset)
# Create cubic polynomial model #
lmtestset <- lm(formula = w ~ x + x2 + x3)
summary(lmtestset)
anova(lmtestset)
# Create quadratic polynomial model #
lmtestset <- lm(formula = w ~ x + x2 + x3 + x4)
summary(lmtestset)
anova(lmtestset)
As a product of each model’s creation, summary output will be produced. The typical coefficient of determination method for measuring fit will be misleading when assessing models of this type. Therefore, the code which I have created, produces an ANOVA output subsequent to the creation of each individual model.
To select the best fitting model for our data, we must compare the ANOVA outputs generated for each model composition.
What we are specifically seeking, is an increase in significance with each new variable addition. In the case of our example, the following model code produces the corresponding ANOVA output:
# Create squared polynomial model #
lmtestset <- lm(formula = w ~ x + x2)
anova(lmtestset)
Analysis of Variance Table
Response: w
As is demonstrated within the output, the addition of the “x2” variable did not increase specificity, as the p-value for this variable is insignificant at the .05 level. From this output, we can conclude that the original linear model is superior to its polynomial alternative.
However, if the variable “x2” was significant, we would then test the cubic model which includes the variable “x3”. We would continue in this manner until we came upon a variable addition which was not significant. At that point, we would utilize whatever model was assessed prior.
If, for the sake of example, the quadratic polynomial model was the superior model, the linear equation which can be assembled from the provided output would resemble the following:
W = 385.54 + (-3.01 * x) + (-6.25 * x^2) + (1.25 * x^3) + (-0.06 * x^4)
This can be derived from the output below:
# lmtestset <- lm(formula = w ~ x + x2 + x3 + x4) #
summary(lmtestset)
Call: lm(formula = w ~ x + x2 + x3 + x4)
Residuals:
1 2 3 4 5 6 7 8 9 10
27.668 5.832 -3.929 -10.547 -27.332 10.453 8.082 -26.168 21.832 -5.893
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 385.53730 91.44558 4.216 0.00836 **
x -3.30179 128.28914 -0.026 0.98046
x2 -6.25192 49.92296 -0.125 0.90522
x3 1.24753 6.91815 0.180 0.86398
x4 -0.06333 0.31200 -0.203 0.84715
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Example:
# Create data vectors #
x <- c(8, 1, 4, 10, 8, 10, 3, 1, 1, 2)
w <- c(366, 383, 332, 331, 311, 352, 356, 351, 399, 357)
# Create polynomial variables #
# Squared #
x2 <- x^2
# Cubic #
x3 <- x^3
# Quadratic #
x4 <- x^4
# Create initial linear model #
lmtestset <- lm(formula = w ~ x)
summary(lmtestset)
anova(lmtestset)
# Create squared polynomial model #
lmtestset <- lm(formula = w ~ x + x2)
summary(lmtestset)
anova(lmtestset)
# Create cubic polynomial model #
lmtestset <- lm(formula = w ~ x + x2 + x3)
summary(lmtestset)
anova(lmtestset)
# Create quadratic polynomial model #
lmtestset <- lm(formula = w ~ x + x2 + x3 + x4)
summary(lmtestset)
anova(lmtestset)
As a product of each model’s creation, summary output will be produced. The typical coefficient of determination method for measuring fit will be misleading when assessing models of this type. Therefore, the code which I have created, produces an ANOVA output subsequent to the creation of each individual model.
To select the best fitting model for our data, we must compare the ANOVA outputs generated for each model composition.
What we are specifically seeking, is an increase in significance with each new variable addition. In the case of our example, the following model code produces the corresponding ANOVA output:
# Create squared polynomial model #
lmtestset <- lm(formula = w ~ x + x2)
anova(lmtestset)
Analysis of Variance Table
Response: w
Df Sum Sq Mean Sq F value Pr(>F)
x 1 1878.4 1878.42 4.1571 0.08084 .
x2 1 856.2 856.19 1.8948 0.21107
Residuals 7 3163.0 451.85
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
x 1 1878.4 1878.42 4.1571 0.08084 .
x2 1 856.2 856.19 1.8948 0.21107
Residuals 7 3163.0 451.85
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As is demonstrated within the output, the addition of the “x2” variable did not increase specificity, as the p-value for this variable is insignificant at the .05 level. From this output, we can conclude that the original linear model is superior to its polynomial alternative.
However, if the variable “x2” was significant, we would then test the cubic model which includes the variable “x3”. We would continue in this manner until we came upon a variable addition which was not significant. At that point, we would utilize whatever model was assessed prior.
If, for the sake of example, the quadratic polynomial model was the superior model, the linear equation which can be assembled from the provided output would resemble the following:
This can be derived from the output below:
# lmtestset <- lm(formula = w ~ x + x2 + x3 + x4) #
summary(lmtestset)
Call: lm(formula = w ~ x + x2 + x3 + x4)
Residuals:
1 2 3 4 5 6 7 8 9 10
27.668 5.832 -3.929 -10.547 -27.332 10.453 8.082 -26.168 21.832 -5.893
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 385.53730 91.44558 4.216 0.00836 **
x -3.30179 128.28914 -0.026 0.98046
x2 -6.25192 49.92296 -0.125 0.90522
x3 1.24753 6.91815 0.180 0.86398
x4 -0.06333 0.31200 -0.203 0.84715
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.