Reflections of a Data Scientist: Two Step Cluster (SPSS)

Two Step Cluster Analysis is a synthetic methodology that utilizes algorithms to create groupings based on similarities between collections of variables within a single data set. The procedure itself is not a statistical concept. You will not find this method discussed within statistics textbooks. Two Step Cluster Analysis only exists within SPSS. The algorithm that creates the output is proprietary, and therefore, cannot be reverse engineered or re-produced by hand. Therefore, I would recommend that this method only be used sparingly for certain situations. Additionally, model output should always be provided along with SPSS syntax (code) and data.

Example:

To create a two step cluster analysis within SPSS, first choose “Analyze” from the top drop down menu. After this option has been selected, choose “Classify”, and then choose “TwoStep Cluster”.

You will be presented with a menu which presents the following options:

For our example model, we will be creating analytic output which includes the categorical variables “Cat_Var1” and “Cat_Var2”. Additionally, we will also include the continuous variable “Cont_Var1”. For “Distance Measure”, if your model does not contain categorical variables, and if you wish to manually specify the “Number of Clusters” which the model will include for analysis, it is best to change the option “Log-likelihood” to “Euclidean”.

TwoStep Cluster Analysis (Menu Explanation)

This menu is establishing the parameters in which the model will adhere to upon its creation. If “Determine automatically” is selected, the output, which will contain the model itself, will comprise of a full analyzation of the selected variables, which therein will comprise of groupings which the computer algorithm determined most appropriate for the situation.

If “Specify fixed” is selected, the computer will put forth its best efforts to create the amount of groupings specified by the user. This forced number of groupings will be utilized for the creation of the model output. (Reminder: If your model will comprise of only categorical variables, and you would like to specify the number of groupings, it is best to change “Distance Measure” to “Euclidean”.)

Example (cont.)

If we were to continue with our example and select “Output” from the above menu, we would be presented with the following:

Let’s select “Cont_Var2” from our “Variables” list. This will move the selection to the “Evaluation_Fields” menu box. Selecting “Create cluster membership variable” beneath the “Working Data File” header will write output to the data table after the model output is provided.

Clicking on “Continue” from this menu, and “OK” from the prior menu, will provide the following output:

Model Summary (Explanation)

Model Summary

Algorithm – This cell entry is providing the algorithm utilized to create the model.

Inputs – This cell entry is providing the number of inputs utilized to create the model.

Clusters – This cell entry is providing the number of clusters produced by the sorting algorithm.

Cluster Quality

This output is illustrating the overall strength of the model.

(Double clicking on the TwoStep Cluster output provides the following illustration)

What is shown in the above output is a graphical illustration of the clusters which combined, represent the model in its entirety.

Chart (Explanation)

Size of Smallest Cluster – This is number of entries from which the smallest cluster is comprised. To the right of this value is the percentage of the model which the cluster represents.

Size of Largest Cluster – This is number of entries from which the largest cluster is comprised. To the right of this value is the percentage of the model which the cluster represents.

Ratio of Sizes: Largest Cluster to Smallest Cluster – This value is representative of the ratio produced when largest cluster is divided by smallest cluster. The value of this ratio should be no greater than 2.

If you change the “View” in the menu below the graphical output from “Cluster Sizes” to “Predictor Importance”, you will be presented with the following graphic:

If you change the “View” in the menu below the model summary from “Clusters” to “Model Summary”, you will be presented with the following graphic:

Within this table, we are presented with the following:

Cluster – Each cluster segmented by a numerical value.

Label – There is no default label provided. However, if you would like to create a label for cluster “1”, this field enables you to do so.

Description - There is no default description provided. If you would like to create a description for cluster “1”, this field enables you to do so.

Size – The size of each cluster as it relates to the total number of observations contained within the model. Percentage of Total Model (number of observations within cluster).

Inputs

Listed in the order of predictive importance are the variables which make up each cluster. If you hover your mouse above any cell, a box will appear which contains a key pertaining to what is represented within the cell.

If a variable is categorical, its most frequent category is listed along with the frequency of its occurrence within the group.

If a variable is continuous, its mean value is listed instead.

You may recall that at the beginning of this exercise that we selected “Cont_Var2” for our “Evaluation Field”. These next steps will demonstrate what this accomplished.

On the bottom right side of the menu bar which is displayed beneath the table graphic, there is a button that reads “Display”. Click this button and then select the option “Evaluation Fields”. This should populate “Cont_Var2” within the “Fields” box. Click “OK” after this variable appears.

This adds a bottom row to the chart which contains the previously selected variable.

This variable was not included in the creation of the model, however, its values are displayed as if it were part of the model. This allows for the comparison of non-model variables to the clusters created by the algorithm.

Returning to the initial data set, you will witness an additional column has been created.

This column indicates the cluster that each observational entry adheres to. This data can be utilized to graph the findings of the model, a topic which will be discussed in our next entry.

Reflections of a Data Scientist

Monday, January 29, 2018

Two Step Cluster (SPSS)

No comments:

Post a Comment