Reflections of a Data Scientist: (R) Logistic Regression Analysis (Binary Categorical Variables) (SPSS)

Today we will be discussing an advanced topic, but a useful topic nonetheless, that topic being: Logistic Regression. Before delving into this subject matter, I would advise you, the reader, if you are not already familiar with linear regression, to please review the articles pertaining to this topic.

Logistic regression is a method which is similar to linear regression, however, the logistic regression method is utilized specifically to create models which analyze dependent variables that are binary in nature (True/False, Yes/No, Positive/Negative, 1/0, etc.).

Let’s begin with a simple example, I have previously entered some sample data into SPSS.

As we discussed in a previous article, you should edit the variable labels so that they correspond with the appropriate binary outcomes.

To perform a logistic regression analysis, select "Analyze" from the top drop down menu, then select "Regression". From the next menu, select "Binary Logistic".

The next screen presents us with the options necessary to structure the equation utilized to create the model. The logistic regression model is structured in a way that is similar to the linear regression model. In the case of our example, “Cancer” will be our dependent variable. “Age”, “Obese”, and “Smoking” will be the model’s independent variables.

NOTE: In this example, our categorical variables are binary. Meaning, that they can only represents two values (0,1). If we were working with sample data that contained categorical variables that were not binary, we would have to specify this in the following screen. This menu can be toggled by clicking on the “Categorical” option.

From the options menu, select “Hosmer-Lemeshow goodness-of-fit”. This test will be useful in interpretating the model results.

Selecting “Save” from the menu options, will present you with this interface. Check “Probabilities” and “Group Membership” beneath the “Predicted Values“ menu header. These options being enabled will output probability data directly to our SPSS spreadsheet.

So what does all of the generated output indicate?

For the most part, you can ignore all of the data listed in the screen below:

Block 0

(Ignore Block 0)

Block 1

Starting from Block 1 to perform analysis, please first observe the portion of this output which reads “Omnibus Tests of Model Coefficients”. We will primarily be concerned with the bottom-most line which coincides with the row labeled, "Model".

The rightmost column entry: "Sig.", indicates the significance level of the model. Depending on the pre-determined confidence interval which has been established, this value will determine whether or not the model is significant. "df" represents the test model's degrees of freedom. The "Chi-square" column contains the corresponding test statistic. All three values, in addition to the number of observations within the model, are utilized in tandem when documenting the model's significance.

“-2 Log Likelihood”, “Cox & Snell R Square”, “Nagelkerke R Square” – All three columns correspond with outputs generated from methodologies which test the strength of the model.

Cox & Snell R Squared – This value can reach a maximum of .75. Therefore, I believe that it is best to consult the “Nagelkerke R Square” value when assessing the overall model.

Nagelkerke R Square – This value can reach a maximum of 1. Thus, it is the measurement that best resembles the equivalent of The Coefficient of Determination. I would recommend referring to this value when considering the strength of the predictive model.

The “Homer and Lameshow Test” – This is measuring for co-linearity amongst the independent variables of the model. Only the significance of this test is valuable. If the significance percentage is lower than .05, a correlation may be present. In such cases, additional testing should be performed to address such. Since this test utilizes a Chi-Squared distribution, it is sensitive to sample size. The test is considered most accurate when a sample size is greater than 400.

“Contingency Table for Hosmer and Lameshow Test” – This can be ignored.

“Classification Table” – This output is simple to understand but difficult to follow. What is being illustrated by the output is the number of positive and negative cases predicted by the model, and the number of actual cases that occurred within the sample.

For our model, there were 5 predicted non-occurrences of cancer, and 2 occurrence of cancer that were not predicted. Thus, the percentage of correct estimates is 71.4. (5/(5+2))

Also, there were 2 predicted occurrences of cancer, and 6 non predicted occurrence. Thus, the percentage of correct estimates is 75. (6/(6+2))

The total percentage of accurate estimations provided by the model is 73.3. ((6+5)/(6+2+5+2).

“Variables in the Equation”

The first column lists the model variables. “Age”, “Obese”, “Smoking” are dependent variables. “Constant” is a variable which will be included as an aspect of the model.

So, if we were to construct the model as an equation, it would resemble:

Logit(p) = (Age * .030) + (Obese * -.389) + (Smoking * 2.544) – 2.344

S.E - represents the standard error of each variable within the context of the model.

Wald - represents the value of the Wald Test variable. You can ignore this value.

df - is indicating the degrees of freedom for each variable.

Sig - is indicating the significance of each variable within the context of the model. An insignificant variable value (typically > .05), does not necessarily indicate that the variable itself should be excluded from subsequent model generation.

Exp(B) or “Odds Ratio” – This indicates the value in which the probability of a positive outcome will increase if the corresponding variable is increased by one. This warrants further explanation, as logistic regression models utilize the “Log” function to generate results.

Returning to our original model equation:

Logit(p) = (Age * .030) + (Obese * -.389) + (Smoking * 2.544) – 2.344

Let’s assume that we wish to test our model on a 19 year old, who is not obese, and who smokes. The equation would resemble:

Logit(p) = (19 * .030) + (0 * -.389) + (1 * 2.544) – 2.344

Logit(p) = .77

But Logit(p) does not equal probability. It equals the Logit value of .77.

To generate the associated probability of this value, you can enter the following information into the R console:

plogis(.77)

Which produces the console output:

[1] 0.6835209

Additionally, you could also create the model within “R” by utilizing the code:

a <- 19
b <- 0
c <- 1

p <- (.030 * a) + (-.389 * b) + (2.544 * c) - 2.344

plogis(p)

Which produces the console output:

[1] 0.6835209

In both cases, the output value is the probability value of a positive occurrence. In the case of our scenario, this would equate a cancer diagnosis. The numbers from our model are un-realistic, but this only due to the absolutely un-realistic sample data that I hastily created.

Returning to the odds ratio.

The value given in the column Exp(B) is the exponential value of the value given in column B. What this value is indicates is the probability increase, within the context of the model, if the variable value of the corresponding B variable is increased by one. However, this value must first be transformed.

Assuming the context of the model, if an individual were being assessed for cancer, assuming that he aged a year, the probability of cancer would increase:

(1.031 – 1) * 100

3.1 %

Data Sheet Output

Previously, during this exercise, I asked that you select “Save” from the menu options, and then subsequently select “Probabilities” and “Group Membership”. In doing so, we have informed SPSS to output data directly to our datasheet. The results are listed below.

We are presented with two new variables “PRE_1” and “PGR_1”. “PRE_1” represents the probability (.00 – 1.00) of a positive event occurring when the dependent variable data found in the left most columns is input into the model. “PGR_1” represents the model’s prediction given the dependent variable data (1 or 0). This data output is needed to create a ROC curve. ROC curves will be covered in a later article.

Model Creation within R

I will now briefly illustrate how to obtain the same results in R.

# Model Creation #

Age <- c(55, 45, 33, 22, 34, 56, 78, 47, 38, 68, 49, 34, 28, 61, 26)

Obese <- c(1,0,0,0,1,1,0,1,1,0,1,1,0,1,0)

Smoking <- c(1,0,0,1,1,1,0,0,1,0,0,1,0,1,1)

Cancer <- c(1,0,0,1,0,1,0,0,1,1,0,1,1,1,0)

CancerModel <- data.frame(Age, Obese, Smoking, Cancer)

# Analyze the Significance of the Model #

CancerModelLog <- glm(Cancer ~ Age + Obese + Smoking, family=binomial)

CancerModelLog1 <- glm(Cancer ~ 1, family=binomial)

anova(CancerModelLog, CancerModelLog1, test="LRT")

# Console Output #

Analysis of Deviance Table

Model 1: Cancer ~ Age + Obese + Smoking
Model 2: Cancer ~ 1
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 11 16.807
2 14 20.728 -3 -3.9209 0.2701

# Summary Creation and Output #

CancerModelLog <- glm(Cancer~ Age + Obese + Smoking, family=binomial)

summary(CancerModelLog)

# Output #

Call:

glm(formula = Cancer ~ Age + Obese + Smoking, family = binomial)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.6096 -0.7471 0.5980 0.8260 1.8485

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.34431 2.25748 -1.038 0.2991
Age 0.02984 0.04055 0.736 0.4617
Obese -0.38924 1.39132 -0.280 0.7797
Smoking 2.54387 1.53564 1.657 0.0976 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 20.728 on 14 degrees of freedom
Residual deviance: 16.807 on 11 degrees of freedom
AIC: 24.807
Number of Fisher Scoring iterations: 4

# Generate Nagelkerke R Squared #

# Download and Enable Package: "BaylorEdPsych" #

PseudoR2(CancerModelLog)

# Console Output #

McFadden Adj.McFadden Cox.Snell Nagelkerke McKelvey.Zavoina Effron
0.2328838 -0.2495624 0.2751639 0.3674311 0.3477522 0.3042371 0.8000000
Adj.Count AIC Corrected.AIC
0.5714286 23.9005542 27.9005542

# Calculate Exp(B) #
# (Intercept) #

exp(-2.34431)

# Age #

exp(0.02984)

# Obese #

exp(-0.38924)

# Smoking #

exp(2.54387)

# Output #

[1] 0.09591336

[1] 1.03029

[1] 0.6775716

[1] 12.72884

# Utilize VIF() and COR() in lieu of The Homer and Lameshow Test #

# Generate Correlation Matrix #

correlationmatrix <- cor(CancerModel)

# Output #

Age Obese Smoking Cancer
Age 1.00000000 0.1231756 -0.2836428 0.02147097
Obese 0.12317556 1.0000000 0.4642857 0.19642857
Smoking -0.28364280 0.4642857 1.0000000 0.46428571
Cancer 0.02147097 0.1964286 0.4642857 1.00000000

# Generate Variance Influence Factor #

# Download and Enable Package: "car" #

vif(CancerModelLog)

# Output #

Age Obese Smoking
1.286936 1.361140 1.654232

Reflections of a Data Scientist

Tuesday, January 9, 2018

(R) Logistic Regression Analysis (Binary Categorical Variables) (SPSS)

No comments:

Post a Comment