Flipping a Coin: Is it a Fair?
A Problem of Probability: A Null Hypothesis Example
Two little league teams decide to flip a coin to determine which team gets to bat first. The best out of ten flips wins the coin toss: the red team chooses heads, and the blue team chooses tails. The coin is flipped ten times, and tails come up all ten times. The red team cries foul and declares the coin must be unfair.
The red team has come up with the hypothesis that the coin is biased for tails. What is the probability that a fair coin would show up as “tails” in ten out of ten flips?
Since the coin should have a 50% chance of landing as heads or tails on each flip, we can test the likelihood of getting tails in ten out of ten flips using the binomial distribution equation.
In the case of the coin toss, the probability would be:
(0.5)10 = 0.0009766
In other words, the likelihood of a fair coin coming up as tails ten times out of ten is less than 1/1000. Statistically, we would say that the P<0.001 for ten tails to occur in ten coin tosses. So, was the coin fair?
Null Hypothesis: Determining the Likelihood of a Measurable Event.
We have two options: either the coin toss was fair and we observed a rare event, or the coin toss was unfair. We have to make a decision as to which option we believe – the basic statistical equation cannot determine which of the two scenarios is correct.
Most of us, however, would choose to believe that the coin was unfair. We would reject the hypothesis that the coin was fair (i.e. had a ½ chance of flipping tails vs. heads), and we would reject that hypothesis at the 0.001 level of significance. Most people would believe the coin was unfair, rather than believe they had witnessed an event that occurs less than 1/1000 times.
The Null Hypothesis: Determining Bias
What if we wanted to test out our theory that the coin was unfair? To study whether the “unfair coin” theory is true, we must first examine the theory that the coin is fair. We will examine whether the coin is fair first, because we know what to expect with a fair coin: the probability will be ½ of the tosses will result in heads, and ½ of the tosses will result in tails. We cannot examine the possibility that the coin was unfair because the probability of getting heads or tails is unknown for a biased coin.
The Null Hypothesis is the theory we can test directly. In the case of the coin toss, the Null Hypothesis would be that the coin is fair, and has a 50% chance of landing as heads or tails for each toss of the coin. The null hypothesis is usually abbreviated as H0.
The Alternative Hypothesis is the theory we can’t test directly. In the case of the coin toss, the alternative hypothesis would be that the coin is biased. The alternative hypothesis is usually abbreviated as H1.
In the little league coin toss example above, we know that the probability of getting 10/10 tails in a coin toss is very unlikely: the chance that such a thing would happen is less than 1/1000. This is a rare event: we would reject the Null Hypothesis (that the coin is fair) at the P<0.001 level of significance. By rejecting the null hypothesis, we accept the alternative hypothesis (i.e. the coin is unfair). Essentially, the acceptance or rejection of the null hypothesis is determined by the significance level: the determination of the rarity of an event.
Understanding Hypothesis Tests
A Second Example: The Null Hypothesis at Work
Consider another scenario: the little league team has another coin toss with a different coin, and flips 8 tails out of 10 coin tosses. Is the coin biased in this case?
Using the binomial distribution equation, we find that the likelihood of getting 2 heads out of 10 tosses is 0.044. Do we reject the null hypothesis that the coin is fair at the 0.05 level (a 5% significance level)?
The answer is no, for the following reasons:
(1) If we consider the likelihood of getting 2/10 coin tosses as heads rare, then we must also consider the possibility of getting 1/10 and 0/10 coin tosses as heads rare. We must consider the aggregate probability of (0 out of 10) + (1 out of 10) + (2 out of 10). The three probabilities are 0.0009766 + 0.0097656 + 0.0439450. When added together, the probability of getting 2 (or fewer) coin tosses as heads in ten tries is 0.0547. We cannot reject this scenario at a 0.05 confidence level, because 0.0547 > 0.05.
(2) Since we are considering the likelihood of getting 2/10 coin tosses as heads, we must also consider the likelihood of getting 8/10 heads instead. This is just as likely as getting 2/10 heads. We are examining the Null Hypothesis that the coin is fair, so we must examine the probability of getting 8 out of ten tosses as heads, 9 out of ten tosses as heads, and 10 out of ten tosses as heads. Because we must examine this two sided alternative, the probability of getting 8 out of 10 heads is also 0.0547. The “whole picture” is that the likelihood of this event is 2(0.0547), which equals 11%.
Getting 2 heads out of 10 coin tosses could not possibly be described as a “rare” event, unless we call something that happens 11% of the time as “rare.” In this case, we would accept the Null Hypothesis that the coin is fair.
Levels of Significance
There are many levels of significance in statistics – usually, the level of significance is simplified to one of a few levels. The typical levels of significance are P<0.001, P<0.01, P<0.05, and P<0.10. If the actual level of significance is 0.024, for example, we would say P<0.05 for the purposes of calculation. It is possible to use the actual level (0.024), but most statisticians would use the next largest significance level for ease of calculation. Instead of calculating the probability of 0.0009766 for the coin toss, the 0.001 level would be used.
Most of the time, a significance level of 0.05 is used for testing hypotheses.
Defining Rare: Significance Levels for the Null Hypothesis
The levels of significance used for determining whether the Null Hypothesis is true or false are essentially levels of determining how rare an event might be. What is rare? Is 5% an acceptable level of error? Is 1% an acceptable level of error?
The acceptability of error will vary depending on the application. If you are manufacturing toy tops, for example, 5% might be an acceptable level of error. If less than 5% of the toy tops wobble during testing, the toy company may declare that as acceptable and send out the product.
A 5% confidence level, however, would be completely unacceptable for medical devices. If a cardiac pacemaker failed 5% of the time, for example, the device would be pulled from the market immediately. No one would accept a 5% failure rate for an implantable medical device. The confidence level for this sort of device would have to be much, much higher: a confidence level of 0.001 would be a better cut-off for this type of device.
One and Two Tailed Tests
One-Tailed vs. Two Tailed Tests
A hospital wants to determine if the trauma team’s average response time is appropriate. The emergency room claims they respond to a reported trauma with an average response time of 5 minutes or less.
If the hospital wants to determine the critical cut-off for only one parameter (response time must be faster than x seconds), then we call this a one tailed test. We might use this test if we didn’t care how fast the team was responding in a best-case scenario, but only cared about whether they were responding slower than the five minute claim. The emergency room merely wants to determine if the response time is worse than the claim. A one tailed test essentially evaluates whether the data shows something is "better" vs. "worse."
If the hospital wants to determine the whether the response time is faster or slower than the stated time of 5 minutes, we would use a two tailed test. In this circumstance, we would values that are too large or too small. This eliminates the outliers of response time on both ends of the bell curve, and allows us to evaluate whether the average time is statistically similar to the claimed 5 minute time. A two-tailed test essentially evaluates whether something is "different" vs. "not different."
The critical value for a one-tailed test is 1.645 for a normal distribution at the 5% level: you must reject the Null Hypothesis if z > 1.645.
The critical value for a two-tailed test is + 1.96: you must reject the Null Hypothesis if z > 1.96 or if z < -1.96.
The z-score is a number that tells you how many standard deviations your data is from the mean. In order to use a z-table, you must first calculate your z-score. The equation for calculating a z score is:
(x-μ)/σ = z
x = the sample
μ = the mean
σ = the standard deviation
Another formula for calculating the z-score is:
x = the observed mean
μ = the expected mean
s= standard deviation
n = the sample size
A One Tailed Test Example
Using the emergency room example above, the hospital observed 40 traumas. In the first scenario, the average response time was 5.8 minutes for the observed traumas. The sample variance was 3 minutes for all traumas recorded. The null hypothesis is that the response time is five minutes or better. For the purposes of this test, we are using a significance level of 5% (0.05). First, we must compute a z-score:
Z = 5.8 min – 5.0 min = 1.69
The Z-score is -1.69: using a z-score table, we obtain the number 0.9545. The probability of the sample mean being 5 minutes is 0.0455, or 4.55%. Since 0.0455<0.05, we reject that the mean response time is 5 minutes (the null hypothesis). The 5.8 minute response time is statistically significant: the average response time is worse than the claim.
The Null Hypothesis is that the response team has an average response time of five minutes or less. In this one-tailed test, we found that the response time was worse than the claimed time. The Null hypothesis is false.
If, however, the team had a 5.6 minute response time on average, the following would be observed:
Z = 5.6 min – 5.0 min = 1.27
The z-score is 1.27, which correlates to 0.8980 on the z-table. The probability of the sample mean being 5 minutes or less is 0.102, or 10.2 percent. Since 0.102>0.05, the null hypothesis is true. The average response time is, statistically speaking, five minutes or less.
Since this example uses a normal distribution, one can also simply look at the "critical number" of 1.645 for a one-tailed test and determine immediately that the z-score resulting from the 5.8 minute response time is statistically worse than the claimed mean, while the z-score from the 5.6 minute average response time is acceptable (statistically speaking).
One vs. Two Tailed Tests
A Two Tailed Test Example
We will use the emergency room example above and determine if the response times are statistically different than the stated mean.
With the 5.8 minute response time (calculated above), we have a z-score of 1.69. Using a normal distribution, we can see that 1.69 is not greater than 1.96. Thus, there is no reason to doubt the emergency department's claim that their response time is five minutes. The null hypothesis in this case is true: the emergency department responds with a mean time of five minutes.
The same is true for the 5.6 minute response time. With a z-score of 1.27, the null hypothesis remains true. The emergency department's claim of a 5 minute response time is not statistically different than the observed response time.
In a two-tailed test, we are observing whether the data is statistically different or statistically the same. In this case, a two-tailed test shows that both a 5.8 minute response time and a 5.6 minute response time are not statistically different from the 5 minute claim.
Abuses of Hypothesis Testing
All tests are subject to error. A few of the most common mistakes in experiments (to falsely yield a significant result) include:
- Publishing the tests which support your conclusion,and hiding the data which does not support your conclusion.
- Conducting only one or two tests with a large sample size.
- Designing the experiment to yield the data you desire.
Sometimes researchers want to show no significant effect, and may:
- Publish only the data that supports a claim of "no effect."
- Conduct many tests with a very small sample size.
- Design the experiment to have few limits.
Experimenters may alter the chosen significance level, ignore or include outliers, or replace a two-tailed test with a one-tailed test to get the results they desire. Statistics can be manipulated, which is why experiments must be repeatable, peer-reviewed, and consist of a sufficient sample size with adequate repetition.
Leah Lefler (author) from Western New York on May 03, 2012:
The question, Gus, is whether it was random luck or someone stacked the deck so that you would find it! ;-) I hope you found it useful!
Gustave Kilthau from USA on May 03, 2012:
Howdy Leah - I had been awaiting this kind of article for a long time, so I really thank you for posting it where I could find it. The odds, as it was, were far less than 1 in 1,000 for the occurrence !
Leah Lefler (author) from Western New York on May 01, 2012:
Thanks, brackenb - I remember the idea of a null hypothesis throwing me for a loop in college (when I was first exposed to basic stats). I hope the article helps someone!
brackenb on May 01, 2012:
Well written and understandable even to me, a dullard at anything mathmatical! Interesting hub.