A Problem of Probability: A Null Hypothesis Example
Two little league teams decide to flip a coin to determine which team gets to bat first. The best out of ten flips wins the coin toss: the red team chooses heads, and the blue team chooses tails. The coin is flipped ten times, and tails come up all ten times. The red team cries foul and declares the coin must be unfair.
The red team has developed the hypothesis that the coin is biased for tails. What is the probability that a fair coin would show up as “tails” in ten out of ten flips?
Since the coin should have a 50% chance of landing as heads or tails on each flip, we can test the likelihood of getting tails in ten out of ten flips using the binomial distribution equation.
In the case of the coin toss, the probability would be:
(0.5)10 = 0.0009766
In other words, the likelihood of a fair coin coming up as tails ten times out of ten is less than 1/1000. Statistically, we would say that the P<0.001 for ten tails occurs in ten coin tosses. So, was the coin fair?
Null Hypothesis: Determining the Likelihood of a Measurable Event.
We have two options: either the coin toss was fair, and we observed a rare event or the coin toss was unfair. We have to make a decision as to which option we believe – the basic statistical equation cannot determine which of the two scenarios is correct.
Most of us, however, would choose to believe that the coin was unfair. We would reject the hypothesis that the coin was fair (i.e. had a ½ chance of flipping tails vs heads), and we would reject that hypothesis at the 0.001 level of significance. Most people would believe the coin was unfair rather than believe they had witnessed an event that occurred less than 1/1000 times.
The Null Hypothesis: Determining Bias
What if we wanted to test our theory that the coin was unfair? To study whether the “unfair coin” theory is true, we must first examine the theory that the coin is fair. We will examine whether the coin is fair first because we know what to expect with a fair coin: the probability will be ½ of the tosses will result in heads, and ½ of the tosses will result in tails. We cannot examine the possibility that the coin was unfair because the probability of getting heads or tails is unknown for a biased coin.
The null hypothesis is the theory we can test directly. In the case of the coin toss, the null hypothesis would be that the coin is fair and has a 50% chance of landing as heads or tails for each toss of the coin. The null hypothesis is usually abbreviated as H0.
The alternative hypothesis is the theory we can’t test directly. In the coin toss case, the alternative hypothesis would be that the coin is biased. The alternative hypothesis is usually abbreviated as H1.
In the little league coin toss example above, we know that the probability of getting 10/10 tails in a coin toss is very unlikely: the chance that such a thing would happen is less than 1/1000. This is a rare event: we would reject the null hypothesis (that the coin is fair) at the P<0.001 level of significance. By rejecting the null hypothesis, we accept the alternative hypothesis (i.e. the coin is unfair). Essentially, the acceptance or rejection of the null hypothesis is determined by the significance level: the determination of the rarity of an event.
Understanding Hypothesis Tests
A Second Example: The Null Hypothesis at Work
Consider another scenario: the little league team has another coin toss with a different coin, and flips 8 tails out of 10 coin tosses. Is the coin biased in this case?
Using the binomial distribution equation, we find that the likelihood of getting 2 heads out of 10 tosses is 0.044. Do we reject the null hypothesis that the coin is fair at the 0.05 level (a 5% significance level)?
The answer is no, for the following reasons:
(1) If we consider the likelihood of getting 2/10 coin tosses as heads rare, then we must also consider the possibility of getting 1/10 and 0/10 coin tosses as heads rare. We must consider the aggregate probability of (0 out of 10) + (1 out of 10) + (2 out of 10). The three probabilities are 0.0009766 + 0.0097656 + 0.0439450. When added together, the probability of getting 2 (or fewer) coin tosses as heads in ten tries is 0.0547. We cannot reject this scenario at a 0.05 confidence level, because 0.0547 > 0.05.
(2) Since we are considering the likelihood of getting 2/10 coin tosses as heads, we must also consider the likelihood of getting 8/10 heads instead. This is just as likely as getting 2/10 heads. We are examining the Null Hypothesis that the coin is fair, so we must examine the probability of getting 8 out of ten tosses as heads, 9 out of ten tosses as heads, and 10 out of ten tosses as heads. Because we must examine this two sided alternative, the probability of getting 8 out of 10 heads is also 0.0547. The “whole picture” is that the likelihood of this event is 2(0.0547), which equals 11%.
Getting 2 heads out of 10 coin tosses could not possibly be described as a “rare” event, unless we call something that happens 11% of the time as “rare.” In this case, we would accept the Null Hypothesis that the coin is fair.
Levels of Significance
There are many levels of significance in statistics – usually, the level of significance is simplified to one of a few levels. The typical levels of significance are P<0.001, P<0.01, P<0.05, and P<0.10. If the actual level of significance is 0.024, for example, we would say P<0.05 for the purposes of calculation. It is possible to use the actual level (0.024), but most statisticians would use the next largest significance level for ease of calculation. Instead of calculating the probability of 0.0009766 for the coin toss, the 0.001 level would be used.
Most of the time, a significance level of 0.05 is used for testing hypotheses.
Defining Rare: Significance Levels for the Null Hypothesis
The levels of significance used for determining whether the Null Hypothesis is true or false are essentially levels of determining how rare an event might be. What is rare? Is 5% an acceptable level of error? Is 1% an acceptable level of error?
The acceptability of error will vary depending on the application. If you are manufacturing toy tops, for example, 5% might be an acceptable level of error. If less than 5% of the toy tops wobble during testing, the toy company may declare that as acceptable and send out the product.
A 5% confidence level, however, would be completely unacceptable for medical devices. If a cardiac pacemaker failed 5% of the time, for example, the device would be pulled from the market immediately. No one would accept a 5% failure rate for an implantable medical device. The confidence level for this sort of device would have to be much, much higher: a confidence level of 0.001 would be a better cut-off for this type of device.
One and Two Tailed Tests
One-Tailed vs. Two Tailed Tests
A hospital wants to determine if the trauma team’s average response time is appropriate. The emergency room claims they respond to a reported trauma with an average response time of 5 minutes or less.
If the hospital wants to determine the critical cut-off for only one parameter (response time must be faster than x seconds), then we call this a one tailed test. We might use this test if we didn’t care how fast the team was responding in a best-case scenario, but only cared about whether they were responding slower than the five minute claim. The emergency room merely wants to determine if the response time is worse than the claim. A one tailed test essentially evaluates whether the data shows something is "better" vs. "worse."
If the hospital wants to determine the whether the response time is faster or slower than the stated time of 5 minutes, we would use a two tailed test. In this circumstance, we would values that are too large or too small. This eliminates the outliers of response time on both ends of the bell curve, and allows us to evaluate whether the average time is statistically similar to the claimed 5 minute time. A two-tailed test essentially evaluates whether something is "different" vs. "not different."
The critical value for a one-tailed test is 1.645 for a normal distribution at the 5% level: you must reject the Null Hypothesis if z > 1.645.
The critical value for a two-tailed test is + 1.96: you must reject the Null Hypothesis if z > 1.96 or if z < -1.96.
The z-score is a number that tells you how many standard deviations your data is from the mean. In order to use a z-table, you must first calculate your z-score. The equation for calculating a z score is: