The only difference in synthesis of the two models, and this aspect of differentiation is minor at best, is the size of the tails which are inherent in the each model’s creation. The Logit model produces slightly flatter tails. This is the sole unique aspect which separates the two models when utilized within an applied setting.
For this reason, due to the wider adoption of the logistic regression model, I would recommend against utilizing the Probit Regression methodology unless explicitly instructed to do so otherwise.
Example (SPSS):
It would seem that I am not alone in my confusion:
http://www-01.ibm.com/support/docview.wss?uid=swg21480469
I will now briefly explain how to create the model within the R platform.
# Create data vectors #
age <- c(55.00, 45.00, 33.00, 22.00, 34.00, 56.00, 78.00, 47.00, 38.00, 68.00, 49.00, 34.00, 28.00, 61.00, 26.00)
obese <- c(1.00, .00, .00, .00, 1.00, 1.00, .00, 1.00, 1.00, .00, 1.00, 1.00, .00, 1.00, .00)
smoking <- c(1.00, .00, .00, 1.00, 1.00, 1.00, .00, .00, 1.00, .00, .00, 1.00, .00, 1.00, 1.00)
cancer <- c(1.00, .00, .00, 1.00, .00, 1.00, .00, .00, 1.00, 1.00, .00, 1.00, 1.00, 1.00, .00)
# Combine data vectors into a single data frame #
cancerdata <- data.frame(cancer, smoking, obese, age)
# Create Probit Model #
probitmodel <- glm(cancer ~ smoking + obese + age, family=binomial(link= "probit"), data=cancerdata)
# Generate Model Summary #
summary(probitmodel)
This produces the output:
Call:
glm(formula = cancer ~ smoking + obese + age, family = binomial(link = "probit"),
data = cancerdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6239 -0.7546 0.5868 0.8184 1.8403
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.40234 1.30773 -1.072 0.284
smoking 1.55682 0.88940 1.750 0.080 .
obese -0.24549 0.82711 -0.297 0.767
age 0.01792 0.02413 0.743 0.458
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 20.728 on 14 degrees of freedom
Residual deviance: 16.795 on 11 degrees of freedom
AIC: 24.795
Number of Fisher Scoring iterations: 4
Which enables for the creation of the model equation:
glm(formula = cancer ~ smoking + obese + age, family = binomial(link = "probit"),
data = cancerdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6239 -0.7546 0.5868 0.8184 1.8403
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.40234 1.30773 -1.072 0.284
smoking 1.55682 0.88940 1.750 0.080 .
obese -0.24549 0.82711 -0.297 0.767
age 0.01792 0.02413 0.743 0.458
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 20.728 on 14 degrees of freedom
Residual deviance: 16.795 on 11 degrees of freedom
AIC: 24.795
Number of Fisher Scoring iterations: 4
Which enables for the creation of the model equation:
Logit(p) = -1.40234 + (Smoking * 1.55682) + (Obese * -0.23549) + (Age * 0.01792)
Which can be implemented within the R platform as:
# Smoking #
Smoking <- 0
# Obese #
Obese <- 0
# Age #
Age <- 0
p <- -1.40234 + (Smoking * 1.55682) + (Obese * -0.23549) + (Age * 0.01792)
plogis(p)
The output of such will provide the probability of the occurrence of the dependent binary variable.
To check for model fit, we will utilize the Nagelkerke R-Squared statistic.
# Generate Nagelkerke R Squared #
# Download and Enable Package: "BaylorEdPsych" #
PseudoR2(probitmodel)
This series of actions presents the following output:
McFadden Adj.McFadden Cox.Snell Nagelkerke
0.1897432 -0.2927030 0.2306398 0.3079774
McKelvey.Zavoina Effron
0.3228210 0.2436036
Count Adj.Count AIC Corrected.AIC
0.7333333 0.4285714 24.7947590 28.7947590
The only value which we need to concern ourselves with is the Nagelkerke value. As this value is interpreted in way which is similar to that of the typical R-Squared value, at .308, we can determine that this model is not a good fit for the data.
There are additional aspects of interpretation that can be discerned from this model, and these interpretations are similarly applicable as they pertain to the logistic regression model. For more information as to how these methods of interpretation are utilized, please consult the entry titled: Logistic Regression Analysis (Binary Categorical Variables).
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.