The MICE package, is a package which assists with performing analysis on shoddily assembled data frames.
In the world of data science, the real world, not the YouTube world, or the classroom world, data often comes down in a less than optimal state. In most cases, this is more the reality of the matter.
Now, it would easy to throw up your hands and say, “I CAN’T PERFORM ANY SORT OF ANALYSIS WITH ALL OF THESE MISSING VARIABLES”,
~OR~
Unfortunately, for you, the data scientist, whoever passed you this data expects a product and not your excuses.
Fortunately, for all of us, there is a way forward.
Example:
Let’s say that you were given this small data set for analysis:
The data is provided in an .xls format, because why wouldn’t it be?
For the sake of not having you download an example data file, I have re-coded this data into the R format.
# Create Data Frame: "SheetB" #
VarA <- c(1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, NA , 1, NA, 0, 0, 0, 0)
VarB <- c(20, 16, 20, 4, NA, NA, 13, 6, 2, 18, 12, NA, 13, 9, 14, 18, 6, NA, 5, 2)
VarC <- c(2, NA, 1, 1, NA, 2, 3, 1, 2, NA, 3, 4, 4, NA, 4, 3, 1, 2, 3, NA)
VarD <- c(70, 80, NA, 87, 79, 60, 61, 75, NA, 67, 62, 93, NA, 80, 91, 51, NA, 33, NA, 50)
VarE <- c(980, 800, 983, 925, 821, NA, NA, 912, 987, 889, 870, 918, 923, 833, 839, 919, 905, 859, 819, 966)
SheetB <- data.frame(VarA, VarB, VarC, VarD, VarE)
If you would like to see a version of the initial example file with the missing values, the code to create this data frame is below:
# Create Data Frame: "SheetA" #
VarA <- c(1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0)
VarB <- c(20, 16, 20, 4, 8, 17, 13, 6, 2, 18, 12, 17, 13, 9, 14, 18, 6, 13, 5, 2)
VarC <- c(2, 3, 1, 1, 1, 2, 3, 1, 2, 1, 3, 4, 4, 1, 4, 3, 1, 2, 3, 1)
VarD <- c(70, 80, 90, 87, 79, 60, 61, 75, 92, 67, 62, 93, 74, 80, 91, 51, 64, 33, 77, 50)
VarE <- c(980, 800, 983, 925, 821, 978, 881, 912, 987, 889, 870, 918, 923, 833, 839, 919, 905, 859, 819, 966)
SheetA <- data.frame(VarA, VarB, VarC, VarD, VarE)
In our example, we’ll assume that the sheet which contains all values is unavailable to you (“SheetA”). Therefore, to perform any sort of meaningful analysis, you will need to either delete all observations which contain missing data variables (DON’T DO IT!), or, run an imputation function.
We will opt to do the latter, and the function which we will utilize, is the mice() function.
First, we will initialize the appropriate library:
# Initalitze Library #
library(mice)
Next, we will perform the imputation function contained within the library.
# Perform Imputation #
SheetB_Imputed <- mice(SheetB, m=1, maxit = 50, method = 'pmm', seed = 500)
SheetB: is the data frame which is being called by the function.
m = 1: This is the number of data frame imputation variations which will be generated as a result of the mice function. One is all that is necessary.
maxit: This is the number of max iterations which will occur as the mice function calculates what it determines to be the optimal value of each missing variable cell.
method: Is the method which will be utilized to perform this function.
seed: The mice() function partially relies on randomness to generate missing variable values. The seed value can be whatever value you determine to be appropriate.
After performing the above function, you should be greeted with the output below:
iter imp variable
1 1 VarA VarB VarC VarD VarE
2 1 VarA VarB VarC VarD VarE
3 1 VarA VarB VarC VarD VarE
4 1 VarA VarB VarC VarD VarE
5 1 VarA VarB VarC VarD VarE
6 1 VarA VarB VarC VarD VarE
7 1 VarA VarB VarC VarD VarE
8 1 VarA VarB VarC VarD VarE
9 1 VarA VarB VarC VarD VarE
10 1 VarA VarB VarC VarD VarE
11 1 VarA VarB VarC VarD VarE
12 1 VarA VarB VarC VarD VarE
13 1 VarA VarB VarC VarD VarE
14 1 VarA VarB VarC VarD VarE
15 1 VarA VarB VarC VarD VarE
16 1 VarA VarB VarC VarD VarE
17 1 VarA VarB VarC VarD VarE
18 1 VarA VarB VarC VarD VarE
19 1 VarA VarB VarC VarD VarE
20 1 VarA VarB VarC VarD VarE
21 1 VarA VarB VarC VarD VarE
22 1 VarA VarB VarC VarD VarE
23 1 VarA VarB VarC VarD VarE
24 1 VarA VarB VarC VarD VarE
25 1 VarA VarB VarC VarD VarE
26 1 VarA VarB VarC VarD VarE
27 1 VarA VarB VarC VarD VarE
28 1 VarA VarB VarC VarD VarE
29 1 VarA VarB VarC VarD VarE
30 1 VarA VarB VarC VarD VarE
31 1 VarA VarB VarC VarD VarE
32 1 VarA VarB VarC VarD VarE
33 1 VarA VarB VarC VarD VarE
34 1 VarA VarB VarC VarD VarE
35 1 VarA VarB VarC VarD VarE
36 1 VarA VarB VarC VarD VarE
37 1 VarA VarB VarC VarD VarE
38 1 VarA VarB VarC VarD VarE
39 1 VarA VarB VarC VarD VarE
40 1 VarA VarB VarC VarD VarE
41 1 VarA VarB VarC VarD VarE
42 1 VarA VarB VarC VarD VarE
43 1 VarA VarB VarC VarD VarE
44 1 VarA VarB VarC VarD VarE
45 1 VarA VarB VarC VarD VarE
46 1 VarA VarB VarC VarD VarE
47 1 VarA VarB VarC VarD VarE
48 1 VarA VarB VarC VarD VarE
49 1 VarA VarB VarC VarD VarE
50 1 VarA VarB VarC VarD VarE
The output is informing you that the iteration was performed a total of 50 times on one single set.
The code below assigns all the initial values from the original set, with newly estimated values, which now occupy the variable cells which were previously blank.
# Assign Original Values with Imputations to Data Frame #
SheetB_Imputed_Complete <- complete(SheetB_Imputed)
The outcome should resemble something like:
A quick warning, the mice() function cannot be utilized on data frames which contain unencoded categorical variable entries.
An example of this:
To get mice() to work correctly on this data set, you must recode "VARC" prior to proceeding. You could do this by changing each instance of "Spade" to 1, "Club" to 2, “Diamond" to 3, and "Heart" to 4.
For more information as it relates to this function, please check out this link.
That’s all for now, internet.
-RD
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.