Reflections of a Data Scientist: (R) Probit Regression (SPSS)

Providing the same function of the Logistic Regression Model, and structured in a similar manner, The Probit Regression model provides an alternative to The Logistic Regression Model if the practitioner wishes to pursue a differing methodology.

The only difference in synthesis of the two models, and this aspect of differentiation is minor at best, is the size of the tails which are inherent in the each model’s creation. The Logit model produces slightly flatter tails. This is the sole unique aspect which separates the two models when utilized within an applied setting.

For this reason, due to the wider adoption of the logistic regression model, I would recommend against utilizing the Probit Regression methodology unless explicitly instructed to do so otherwise.

Example (SPSS):

This is where the probit methodology would be demonstrated within the SPSS platform. However, I cannot seem to discern from the interface, or the internet, as to how this would be achieved.

It would seem that I am not alone in my confusion:

http://www-01.ibm.com/support/docview.wss?uid=swg21480469

Example (R):

I will now briefly explain how to create the model within the R platform.

# Create data vectors #

age <- c(55.00, 45.00, 33.00, 22.00, 34.00, 56.00, 78.00, 47.00, 38.00, 68.00, 49.00, 34.00, 28.00, 61.00, 26.00)

obese <- c(1.00, .00, .00, .00, 1.00, 1.00, .00, 1.00, 1.00, .00, 1.00, 1.00, .00, 1.00, .00)

smoking <- c(1.00, .00, .00, 1.00, 1.00, 1.00, .00, .00, 1.00, .00, .00, 1.00, .00, 1.00, 1.00)

cancer <- c(1.00, .00, .00, 1.00, .00, 1.00, .00, .00, 1.00, 1.00, .00, 1.00, 1.00, 1.00, .00)

# Combine data vectors into a single data frame #

cancerdata <- data.frame(cancer, smoking, obese, age)

# Create Probit Model #

probitmodel <- glm(cancer ~ smoking + obese + age, family=binomial(link= "probit"), data=cancerdata)

# Generate Model Summary #

summary(probitmodel)

This produces the output:

Call:

glm(formula = cancer ~ smoking + obese + age, family = binomial(link = "probit"),
data = cancerdata)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.6239 -0.7546 0.5868 0.8184 1.8403

Coefficients:

Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.40234 1.30773 -1.072 0.284
smoking 1.55682 0.88940 1.750 0.080 .
obese -0.24549 0.82711 -0.297 0.767
age 0.01792 0.02413 0.743 0.458
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 20.728 on 14 degrees of freedom
Residual deviance: 16.795 on 11 degrees of freedom
AIC: 24.795

Number of Fisher Scoring iterations: 4

Which enables for the creation of the model equation:

Logit(p) = -1.40234 + (Smoking * 1.55682) + (Obese * -0.23549) + (Age * 0.01792)

Which can be implemented within the R platform as:

# Smoking #

Smoking <- 0

# Obese #

Obese <- 0

# Age #

Age <- 0

p <- -1.40234 + (Smoking * 1.55682) + (Obese * -0.23549) + (Age * 0.01792)

plogis(p)

The output of such will provide the probability of the occurrence of the dependent binary variable.

To check for model fit, we will utilize the Nagelkerke R-Squared statistic.

# Generate Nagelkerke R Squared #

# Download and Enable Package: "BaylorEdPsych" #

PseudoR2(probitmodel)

This series of actions presents the following output:

McFadden Adj.McFadden Cox.Snell Nagelkerke
0.1897432 -0.2927030 0.2306398 0.3079774

McKelvey.Zavoina Effron
0.3228210 0.2436036

Count Adj.Count AIC Corrected.AIC
0.7333333 0.4285714 24.7947590 28.7947590

The only value which we need to concern ourselves with is the Nagelkerke value. As this value is interpreted in way which is similar to that of the typical R-Squared value, at .308, we can determine that this model is not a good fit for the data.

There are additional aspects of interpretation that can be discerned from this model, and these interpretations are similarly applicable as they pertain to the logistic regression model. For more information as to how these methods of interpretation are utilized, please consult the entry titled: Logistic Regression Analysis (Binary Categorical Variables).

Reflections of a Data Scientist

Wednesday, May 9, 2018

(R) Probit Regression (SPSS)

No comments:

Post a Comment