Reflections of a Data Scientist: (R) Chi-Square

Chi-Square is an often overlooked concept in statistics. It has many uses, as will be demonstrated in this article. The first essential aspect of understanding Chi-Square is to understand its pronunciation. Many would assume that the pronunciation is "C-HI", or "ChEE". Neither is correct, the proper pronunciation is “Kai". Next, let’s examines how a Chi-Square distribution appears when graphed.

Above is a graphical representation of the chi-square distribution. What is being illustrated is the probability densities of various chi-square distributions based on degrees of freedom.

Things to remember about the Chi-Squared Distribution:

1. It is a continuous probability distribution.

2. It is related to the standard normal distribution.

3. Degrees of freedom for a sample chi-square distribution will be the total number of independent standard normal variables minus one.

The chi-squared distribution is utilized for goodness-of-fit tests. Meaning, that it is used to test one set of data against another. This is undertaken in order to determine whether a model of predictability is accurate. The degrees of freedom (n-1), or the size of the sample minus one, determines the shape of the probability density curve. Alpha, or 1 minus the confidence interval, will determine the size of the rejection region. This region is defined as the right most area beneath the distribution curve. The chi-square value, is derived from utilizing a mathematical function. Once derived, this value is matched against a chi-square distribution table. The chi square value, in conjunction with the determined degrees of freedom and the alpha value, ultimately determine as to whether a relationship may be assumed to exist.

Example:

A small motel owner has created a model which he believes, is an accurate predictor of individuals who will stay at his establishment. He presents you with his findings:

Monday: 20
Tuesday: 28
Wednesday: 18
Thursday: 25
Friday: 16
Saturday: 22
Sunday: 26

The following week, you are tasked with keeping track of guests who rent rooms at the motel. Here are your findings:

Monday: 14
Tuesday: 25
Wednesday: 22
Thursday: 18
Friday: 16
Saturday: 24
Sunday: 30

Given your findings, and assuming a 95% confidence interval, can we assume that the motel owner's model is an accurate predictor?

Model <- c(20, 28, 18, 25, 16, 22, 26)

Results <- c(14, 25, 22, 18, 16, 24, 30)

chisq.test(Model, p = Results, rescale.p = TRUE)

Console Output:

Chi-squared test for given probabilities

data: Model
X-squared = 6.5746, df = 6, p-value = 0.362

Findings:

Degrees of Freedom (df) - 6
Confidence Interval (CI) - .95
Alpha (α) (1-CI) - .05
Chi Square Test Statistic - 6.5746

This creates the hypothesis test parameters:

H0 : The model is a good fit (Null Hypothesis).

The critical value of 12.59 is found when consulting the chi-square distribution table. Since our chi-square value is less than this value (6.5746 < 12.59), we can state, that with 95 % confidence, that the owner's model is accurate.

Cannot Reject: Null Hypothesis.

Example:

The same small motel owner also created an additional model which he believes, is an accurate predictor of individuals who will stay at his establishment. He presents you with his findings:

Monday: 10%
Tuesday: 5%
Wednesday: 20%
Thursday: 10%
Friday: 20%
Saturday: 30%
Sunday: 5%

(Predicted percentage of total individuals who will stay throughout the week)

The following week, you are tasked to keep track of guests who rent rooms at the motel. Here are your findings:

Monday: 11
Tuesday: 25
Wednesday: 30
Thursday: 13
Friday: 23
Saturday: 17
Sunday: 8

(Actual number of individuals who stayed throughout the week)

Given your findings, and assuming a 95% confidence interval, can we assume that the motel owner's model is an accurate predictor?

Model <- c(.10, .05, .20, .10, .20, .30, .05)

Results <- c(11, 25, 30, 13, 23, 17, 8)

chisq.test(Results, p=Model, rescale.p= FALSE)

Console Output:

Chi-squared test for given probabilities

data: Results
X-squared = 68.184, df = 6, p-value = 9.634e-13

Findings:

Degrees of Freedom (df) - 6
Confidence Interval (CI) - .95
Alpha (α) (1-CI) - .05
Chi-Square Test Statistic - 68.184

This creates the hypothesis test parameters:

H0 : The model is a good fit (Null Hypothesis).

The critical value 12.59, is found when consulting the chi-squared distribution table. Since our chi-square value is greater than this value (68.184 > 12.59), we cannot state, that with 95 % confidence, that the owner's model is inaccurate.

Reject: Null Hypothesis.

Example:

While working as a statistician at a local university, you are tasked to evaluate, based on survey data, the level of job satisfaction that each member of the staff currently has for their occupational role. The data that you gather from the surveys is as follows:

General Faculty
130 Satisfied 20 Unsatisfied

Professors
30 Satisfied 20 Unsatisfied

Adjunct Professors
80 Satisfied 20 Unsatisfied

Custodians
20 Satisfied 10 Unsatisfied

The question remains however, as to whether the assigned role of each staff member, has any impact on the survey results. To decide this, with 95% confidence, you must follow the subsequent steps.

First, we will need to input this survey data into R as a matrix. This can be achieved by utilizing the code below:

Model <- matrix(c(130, 30, 80, 20, 20, 20, 20, 10), nrow = 4, ncol=2)

The result should resemble:

Once this step has been completed, the next step is as simple as entering the code:

chisq.test(Model)

Console Output:

Pearson's Chi-squared test

data: Model
X-squared = 18.857, df = 3, p-value = 0.0002926

Findings:

Degrees of Freedom (df) - 3
Confidence Interval (CI) - .95
Alpha (α) (1-CI) - .05
Chi Square Test Statistic - 18.857

This creates the hypothesis test parameters:

H0 : There is no correlation between job type and job satisfaction (Null Hypothesis). Job type and job satisfaction are independent variables.

HA: There is a correlation between job type and job satisfaction. Job type and job satisfaction are not independent variables.

The critical value 7.815 is found when consulting the chi squared distribution table. Since our chi square value is greater than this value (18.857 > 7.815), we can state, that with 95 % confidence, that there is a correlation between job type and overall satisfaction.

Reject: Null Hypothesis.

* Source for Chi Square Distribution Image - https://en.wikipedia.org/wiki/Chi-squared_distribution

Reflections of a Data Scientist

Monday, September 4, 2017

(R) Chi-Square

No comments:

Post a Comment