Reflections of a Data Scientist: Discriminant Analysis (SPSS)

Discriminant Analysis is, in summary, a less robust variation of Binary Logistical Analysis. In every case I would recommend utilizing Binary Logistical Analysis in lieu of Discriminate Analysis. However, since this is a website dedicated to all things statistical, we will briefly cover this topic.
Discriminant Analysis is a very sensitive modeling methodology, as outliers and group size can potentially cause miscalculation. Additionally, there are various assumptions that must be accounted for prior to application.

These assumptions* are:

Multivariate Normality - Independent variables are normal for each level of the grouping variable.

Homogeneity of Variance - Variances among group variables are the same across levels of predictors.

Multicollinearity - Predictive power can decrease with an increased correlation between predictor variables.

Independence - Participants are assumed to be randomly sampled, and a participant’s score on one variable is assumed to be independent of scores on that variable for all other participants.

Example:

We’ll begin with a familiar sample data set:

From the “Analyze” menu, select “Classify”, then select “Discriminant”.

The following menu should appear. Using the topmost middle arrow, select “Cancer” as the “Grouping Variable”. Using the center arrow, select “Age”, “Obese” and “Smoking” as “Independents”.

Click on the “Define Range” button to populate the following sub-menu. Since “Cancer” is a binary variable, we will set “Minimum” to “0”, and “Maximum” to “1”.

(Note: “1” indicates “Cancer”, and “0” indicates “No Cancer Detected”)

After clicking “Statistics”, check the box adjacent to “Unstandardized”.

Clicking on “Save” will populate the menu below. Check the options labeled “Predicted group membership” and “Probabilities of group membership”.

Once this has been completed, click “OK”.

The following output should be generated:

Wilks’ Lambda – Two useful values are being provided within this table output. The first value is the Wilk’s Lambda value. This value is similar to the coefficient of determination, however, its value is interpreted in an inverse manner. Meaning, a value of 0 would equate perfect correlation. Therefore, if you would like to determine the equivalent r-squared value for interpretation only, you could subtract this value from the value of 1 and consider the difference. The second value worth noting is the Chi-square significance: “Sig”. This value is illustrating the significance of the Wilk’s Lambda. If the p-value is less than .05, we can determine that the derived model is significant in determining a predictive outcome.

Canonical Discriminat Function Coefficients – The values presented in the above table are the components of the predictive model. If we were to construct the model as an equation, it would resemble:

Logit(p) = (Age * .026) + (Obese * -.347) + (Smoking * 2.418) - 2.263

The logit value can be utilized in tandem with the R function “plogis” to generate the probability of a positive outcome. For more information pertaining to this function, please consult the article related to Binary Logistical Analysis that was previously featured on this blog.

The final output that we will review is the output that was produced within the original data sheet.

We are presented with three new variables, “Dis_1” represents the model’s predicted outcome given the dependent variable data (1 or 0). “Dis2_2” represents the probability (.00 – 1.00) of a positive outcome occurring, “Dis2_1” represents the probability of a negative outcome occurring.

That’s all for now. Stay subscribed for more analytics and articles. Until next time, Data Heads!

*- https://en.wikipedia.org/wiki/Discriminant_function_analysis

Reflections of a Data Scientist

Wednesday, March 14, 2018

Discriminant Analysis (SPSS)

No comments:

Post a Comment