Reflections of a Data Scientist: (R) Analysis of Variance

In this article, we will discuss ANOVA, specifically, when its usage is appropriate, and how it can be utilized within R. This will likely be the final article of the current series of entries pertaining to The R Programming Language. Subsequent articles will discuss concepts and usage of software within the SPSS platform.

ANOVA is an abbreviation that represents a method known as The Analysis of Variance.

There are few terms that are specific to ANOVA, those are:

Way – Which refers to an independent variable within the ANOVA model.

Factor – Another term which refers to an independent variable.

Level – The category of an independent variable within the ANOVA model.

ANOVA is used to compare the variances of various sample groups against one another. In many ways it is similar to a t-test, however, ANOVA allows for multiple group comparisons. This differs from the t-test, which only allows for one single group to be compared to another single group.

A post-hoc test is often performed after ANOVA has been calculated. We will discuss this topic in a different article. A post-hoc test is used to further investigate data sample similarities and is utilized when the ANOVA model returns certain results.

Like the t-test, there are different variations of the ANOVA model that are applicable depending on the data being analyzed. We will review three common ANOVA application as they pertain to various data types. The analyzation of the output of the model data is performed through the utilization of the F-Test. For a detailed description of the F-Test, and what conclusions it provides, please refer to the pervious article.

One Way ANOVA

As a reminder, Way, in this scenario, is referring to a single independent variable.

In a one way ANOVA, we are assuming the following:

1. Each sample is random.
2. Each sample is in no way influenced by the other sampling results.
3. Each dependent variable is sampled from a normally distributed population.
4. The variances of the samples, should be equivalent, or somewhat equivalent. The reason for such, is that the population variances are assumed to be equal for each sample.

The hypothesis for this model type will be:

H0: u1 = u2 = u3 =…..etc.

H1: Not all means are equal.

Example Problem:

A chef wants to test if patrons prefer a soup which he prepares based on salt content. He prepares a limited experiment in which he creates three types of soup: soup with a low amount of salt, soup with a high amount of salt, and soup with a medium amount of salt. He then servers this soup to his customers and asks them to rate their satisfaction on a scale from 1-8.

Low Salt Soup it rated: 4, 1, 8
Medium Salt Soup is rated: 4, 5, 3, 5
High Salt Soup is rated: 3, 2, 5

Hypothesis:

H0: u1 = u2 = u3 =…..etc.

H1: Not all means are equal.

Let’s use this data to create a model within R:

satisfaction <- c(4, 1, 8, 4, 5, 3, 5, 3, 2, 5)

salt <- c(rep("low",3), rep("med",4), rep("high",3))

salttest <- data.frame(satisfaction, salt)

results <- aov(satisfaction~salt, data=salttest)

summary(results)

This produces the output:

Df Sum Sq Mean Sq F value Pr(>F)
salt 2 1.92 0.958 0.209 0.816
Residuals 7 32.08 4.583

If p < .05, we will reject the null hypothesis.

Hypothesis: 0.816 > .05

Since the model’s p-value (.816) is greater than the assumed alpha (.05), we will fail to reject the null hypothesis. What this is indicating, is that at 95% confidence interval, we cannot state that through the analysis of the data provided, that there is a significant difference of customer satisfaction as it pertains to salt content in soup.

Two Way ANOVA

Two way, in this scenario, is referring to the two independent variables which will be utilized within this ANOVA model.

The hypothesis for this model type will be:

1.

H0: u1 = u2 = u3 =…..etc. (All means are equal)

H1: Not all means are equal.

2.

H0: uVar1 = uVar2 (Var1’s value does not significantly differ from Var2’s value)

H1: uVar1 NE uVar2

3.

H0: An interaction is absent.

H1: An interaction is present.

Example Problem:

Researchers want to test study habits within two schools as they pertain to student life satisfaction. The researchers also believe that the school that each group of students is attending may also have an impact on study habits. Students from each school are assigned study material which in sum, totals to 1 hour, 2 hours, and 3 hours on a daily basis. Measured is the satisfaction of each student group on a scale from 1-10 after a 1 month duration.

School A:

1 Hour of Study Time: 7, 2, 10, 2, 2
2 Hours of Study Time: 9, 10, 3, 10, 8
3 Hours of Study Time: 3, 6, 4, 7, 1

School B:

1 Hour of Study Time: 8, 5, 1, 3, 10
2 Hours of Study Time: 7, 5, 6, 4, 10
3 Hours of Study Time: 5, 5, 2, 2, 2

Let’s state our hypothesizes, as they apply to this problem:

1.

H0: u1 = u2 = u3 (Stress levels DO NOT differ depending on hours of daily study.)

H1: Not all means are equal. (Stress levels DO differ depending on hours of daily study.)

2.

H0: uSchoolA = uSchoolB (Stress levels DO NOT significantly differ depending on school school.)

H1: uSchoolA NE uSchoolB (Stress levels DO significantly differ depending of school.)

3.

H0: An interaction is absent. (The combination of school and study time is NOT impacting the outcome)

H1: An interaction is present. (The combination of school and study time IS impacting the outcome)

Entering this into R can be tricky, but stay with me:

satisfaction <- c(7, 2, 10, 2, 2, 8, 5, 1, 3, 10, 9, 10, 3, 10, 8, 7, 5, 6, 4, 10, 3, 6, 4, 7, 1, 5, 5, 2, 2, 2)

studytime <- c(rep("One Hour",10), rep("Two Hours",10), rep("Three Hours",10))

school = c(rep("SchoolA",5), rep("SchoolB",5), rep("SchoolA",5), rep("SchoolB",5), rep("SchoolA",5), rep("SchoolB",5))

schooltest <- data.frame(satisfaction, studytime, school)

results <- aov(lm(satisfaction ~ studytime * school, data=schooltest))

summary(results)

Which produces the output:

Df Sum Sq Mean Sq F value Pr(>F)
studytime 2 62.6 31.300 3.809 0.0366 *
school 1 2.7 2.700 0.329 0.5718
studytime:school 2 7.8 3.900 0.475 0.6278
Residuals 24 197.2 8.217
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Since we have three hypothesis tests, we must assess all three of the p-values present within the output.

Study Time

p = 0.0366

School

p = 0.5718

Study Time : School

p = 0.6278

In investigating the output we can make the following conclusions:

Hypothesis 1: 0.0366 < .05

Hypothesis 2: 0.5718 > .05

Hypothesis 3: 0.6278 > .05

If p < .05, we will reject the null hypothesis.

Hypothesis 1: Reject

Hypothesis 2: Fail to Reject

Hypothesis 3: Fail to Reject

So we can state:

Students of different schools did not significantly different stress levels. There was significant difference between the levels of study time as it pertains to stress. No interaction effect was present.

(Two Way ANOVA must have columns observations of equal length)

Repeated-Measures ANOVA

A repeated measures ANOVA is similar to a paired t-test in that it samples from the same set more than once. This model contains one factor with at least two levels, and the levels are dependent.

Example Problem:

Researchers want to test the impact of reading existential philosophy on a group of 8 individuals. They measure the happiness of the participants three times, once prior to reading, once after reading the materials for one week, and once after reading the materials for two weeks. We will assume an alpha of .05.

Before Reading = 1, 8, 2, 4, 4, 10, 2, 9
After Reading = 4, 2, 5, 4, 3, 4, 2, 1
After Reading (wk. 2) = 5, 10, 1, 1, 4, 6, 1, 8

Hypothesis:

H0: u1 = u2 = u3

H1: Not all means are equal.

Let’s use this data to create a model within R:

library(lme4) # You will need to install and enable this package #

happiness <- c(1, 8, 2, 4, 4, 10, 2, 9, 4, 2, 5, 4, 3, 4, 2, 1, 5, 10, 1, 1, 4, 6, 1, 8 )

week <- c(rep("Before", 8), rep("Week1", 8), rep("Week2", 8))

id <- c(1,2,3,4,5,6,7, 8)

survey <- data.frame(id, happiness, week)

model <- lmer(happiness ~ week + (1|id), data=survey)

anova(model)

Which produces the output:

Analysis of Variance Table
Df Sum Sq Mean Sq F value
week 2 15.083 7.5417 1.0462

The F-Test statistic = 1.0462

To calculate the p-value of our test statistic, we can use the following r-code:

pf(q=1.0462, df1=2, df2=14, lower.tail=FALSE) # Test Statistic , Numerator Degrees of Freedom = 2, Denominator Degrees of Freedom = 14 #

Which produces the output:

[1] 0.3771816

If p < .05, we will reject the null hypothesis.

Hypothesis: 0.3771816 > .05

With this information, we can conclude that the three conditions did not significantly differ pertaining to level of happiness.

A similar methodology that can be utilized to perform this analysis:

library(lme4) # You will need to install and enable this package #
library(nlme) # You will also need to install and enable this package #

happiness <- c(1, 8, 2, 4, 4, 10, 2, 9, 4, 2, 5, 4, 3, 4, 2, 1, 5, 10, 1, 1, 4, 6, 1, 8 )

week <- c(rep("Before", 8), rep("Week1", 8), rep("Week2", 8))

id <- c(1,2,3,4,5,6,7, 8)

survey <- data.frame(id, happiness, week)

model <- lme(happiness ~ week, random=~1|id, data=survey)

anova(model)

This method saves some time by producing the output:

numDF denDF F-value p-value
(Intercept) 1 14 37.21053 <.0001
week 2 14 1.04624 0.3772

That is all for now, Data Heads. The topic of the next article will be Post-Hoc Analysis. Stay tuned!

Reflections of a Data Scientist

Tuesday, December 5, 2017

(R) Analysis of Variance - ANOVA

No comments:

Post a Comment