Reflections of a Data Scientist: (R) Multinomial Logistical Analysis (SPSS)

If you unfamiliar with the concept of Logistic Regression, I advise that you review the article pertaining to such that was previously posted on this site. I make this suggestion due to the conceptual synthesis of the Multinomial Logistical Model acting as an extension of the logistical regression model. While the logistical regression model can only provide predictive capacity as it pertains to binary probabilities, the multinomial logistic regression model can account for multiple categorical outcomes.

Example (SPSS):

In this demonstration, we will assume that you are attempting to predict an individual’s favorite color based on other aspects of their individuality.

We will begin with our data set:

Which we will assign value labels to produce the following data interface:

To begin our analysis, we must select, from the topmost menu, “Analyze”, then “Regression”, followed by “Multinomial Logic”.

This sequence of actions should cause the following menu to appear:

Using the topmost arrow button, assign “Color” as a ”Dependent” variable. Once this has been completed, utilize the center arrow button to assign the remaining variables (“Gender”, “Smoker”, “Car”), as “Factor(s)”.

Next, click on the button labeled “Reference Category”, this should populate the following menu:

Select the “Custom” option and enter the value of “4” into the box below. After performing these actions, click “Continue”. Once you have returned to the initial menu, click on the button labeled “Save”.

This series of selections should cause the following sub-menu to appear:

Select the box labeled “Predicted category”, then click “Continue”. You will again be returned to the initial menu. From this menu, click “OK”.

This should produce a voluminous output, however we will only concern ourselves with the following output aspects:

As was described in a prior article, Pseudo R-Squared methods are utilized to measure for model fit when the traditional coefficient methodology is inapplicable. In the case of our example, we will be utilizing the “Nagelkerke” value, which can be assessed on a scale similar to the traditional r-squared metric. Since this model’s Nagelkerke value is .843, we can assume that the model does function as a decent predictor for our dependent variable.

The above output provides us with the internal aspects of the model’s synthesis. Though this may appear daunting to behold at first, the information that is illustrated in the chart is no different than the output generated as it pertains to a typical linear model.

In the case of our model, we will have three logistical equations:

Green = (Gender:Female * -35.791) + (Smoker:Yes * 34.774) + (Car:KIA * -17.40) + (Car:Ford * .985) + 17.40

Yellow = (Gender:Female * -36.664) + (Smoker:Yes * 15.892) + (Car:KIA * -35.632) + (Car:Ford * 1.499) + 16.886

Red = (Gender:Female * -19.199) + (Smoker:Yes * 18.880) + (Car:KIA * -37.252) + (Car:Ford * -19.974) + 18.506

As is the case with all log models, we need to transform the output values of each equation to generate the appropriate probabilities.

So, for our first example observation, our equations would resemble:

Observation 1 | Gender: Female | Smoker: No | Car: Chevy

Green = (1 * -35.791) + (0 * 34.774) + (0 * -17.40) + (0 * .985) + 17.40

Yellow = (1 * -36.664) + (0 * 15.892) + (0 * -35.632) + (0 * 1.499) + 16.886

Red = (1 * -19.199) + (0 * 18.880) + (0 * -37.252) + (0 * -19.974) + 18.506

Which produces the values of:

Green = -18.391

Yellow = -19.778

Red = -0.693

To produce probabilities, we have to transform these values through the utilization of the following R code:

Green <- -18.391

Yellow <- -19.778

Red <- -0.693

# Green #

exp(Green) / (1 + exp(Green) + exp(Red) + exp(Yellow))

#Red #

exp(Red) / (1 + exp(Green) + exp(Red) + exp(Yellow))

# Yellow #

exp(Yellow) / (1 + exp(Green) + exp(Red) + exp(Yellow))

# Blue (Reference Category) #

1 / (1 + exp(Green) + exp(Red) + exp(Yellow))

Which produces the following outputs:

[1] 6.867167e-09

[1] 0.333366
[1] 1.715581e-09
[1] 0.666634

Interpretation:

Each output value represents the probability of the occurrence of the dependent variable as it is related to the reference category (“Blue”).

P(Green) = 6.867167e-09

P(Yellow) = 1.715581e-09

P(Red) = 0.333366

R(Blue) = 0.666634

Therefore, in the case of the first observation of our example data set, we can assume that the reference category, “Blue”, has the highest likelihood of occurrence.

The predicted values, as a result of the “Save” option, have been output into a column within the original data set.

Example (R):

If we wanted to repeat our analysis through the utilization of the “R” platform, we could do so with the following code:

# (With the package: "nnet", downloaded and enabled) #

# Multinomial Logistic Regression #

color <- c("Red", "Blue", "Green", "Blue", "Blue", "Blue", "Green", "Green", "Green", "Yellow")

gender <- c("Female", "Female", "Male", "Female", "Female", "Male", "Male", "Male", "Female", "Male")

smoker <- c("No", "No", "Yes", "No", "No", "No", "No", "No", "Yes", "No")

car <-c("Chevy", "Chevy", "Ford", "Ford", "Chevy", "KIA", "Ford", "KIA", "Ford", "Ford")

color <- as.factor(color)
gender <- as.factor(gender)
smoker <- as.factor(smoker)
car <- as.factor(car)

testset <- data.frame(color, gender, smoker, car)

mlr <- multinom(color ~ gender + smoker + car, data=testset )

summary(mlr)

This produces the following output:

Call:
multinom(formula = color ~ gender + smoker + car, data = testset)

Coefficients:
(Intercept) genderMale smokerYes carFord carKIA
Green -40.2239699 36.73179 47.085203 21.36387 3.492186
Red -0.6931559 -17.00881 -3.891315 -20.23802 -11.832468
Yellow -41.0510233 37.33637 -10.943821 21.58634 -22.161372

Std. Errors:
(Intercept) genderMale smokerYes carFord carKIA
Green 0.4164966 4.164966e-01 7.125616e-14 3.642157e-01 6.388766e-01
Red 1.2247466 2.899257e-13 1.686282e-23 1.492263e-09 2.899257e-13
Yellow 0.3642157 3.642157e-01 6.870313e-26 3.642157e-01 1.119723e-12

Residual Deviance: 9.364263
AIC: 39.36426

To test the model results, the code below can be utilized:

# Test Model #

# Gender : Male #

a <- 0

# Smoker : Yes #

b <- 0

# Car : Ford #

c <- 0

# Car : KIA #

d <- 0

Green <- -40.2239699 + (a * 36.73179) + (b * 47.085203) + (c * 21.36387) + (d * 3.492186)

Red <- -0.6931559 + (a * -17.00881) + (b * -3.891315) + (c * -20.23802) + (d * -11.832468)

Yellow <- -41.0510233 + (a * 37.33637) + (b * -10.943821) + (c * 21.58634) + (d * -22.161372)

# Green #

exp(Green) / (1 + exp(Green) + exp(Red) + exp(Yellow))

#Red #

exp(Red) / (1 + exp(Green) + exp(Red) + exp(Yellow))

# Yellow #

exp(Yellow) / (1 + exp(Green) + exp(Red) + exp(Yellow))

# Blue (Reference Category) #

1 / (1 + exp(Green) + exp(Red) + exp(Yellow))

NOTE: The model’s internal aspects differ depending on the platform which was utilized to generate the analysis. Though the model predictions do not differ, I would recommend, if publishing findings, to utilize SPSS in lieu of R. The reason for this rational, pertains to the auditing record which SPSS possesses. If data output possesses abnormalities, R, being open source, cannot be held to account. Additionally, as the multinomial function within R exists as an additional aspect of an external package, it could potentially cause platform computational errors to have a greater likelihood of occurrence.

Reflections of a Data Scientist

Monday, April 30, 2018

(R) Multinomial Logistical Analysis (SPSS)

No comments:

Post a Comment