Results of Testing Whether Exam Questions Were Effective at Distinguishing Between Students of Higher and Lower Ability
Why Test Exam Questions for Effectiveness?
Every question of a completed examination paper can be checked by using statistical methods (quantitatively) and human checks (qualitatively). Statistical checks will test how well the questions were able to discriminate between higher and lower level examinees and also whether the test as a whole was a reliable measure of what the educator set out to test. Human checks (qualitative methods) can help to see if a question was poorly written.
Example of Testing a Real Completed Exam Paper
I statistically analyzed an upper-intermediate English as a Foreign Language (EFL) formal examination given to 80 grade 12 students at a private school in southeast Asia. 20 questions (or items as the statistical analysts tend to call them) were statistically tested.
These 20 items were four-item multiple choice questions and it's these kind of questions that statistics are generally assumed to be the most effective at checking. The idea is that you check to see which questions give you "bad" statistical results and then you flag these questions for review. Bad statistical results on their own might not be a reason to remove the question from future exams; qualitative checks also need to be done to examine the questions quality.
First Statistical Test: Distractor Alaysis
Firstly, it's important to know the anatomy of a multiple choice question so that the terminology here can be understood. A multiple choice question consists of:
- A stem: The wording of the question; can be in the form of a question or of an incomplete sentence.
- Options: Also known as alternatives, these are the choices that the examinee will select from.
- Key: This is the correct option for the item.
- Distractors: The incorrect options.
Distractor Analysis has many different names. It is also known as Difficulty Index, Item Difficulty, Ease Index, IF, Percent Correct or p-value. What it does is simply show you the proportion (or percentage) of students who took the test who answered the item correctly. You'd normally show it as a proportion (rather than percentage), ranging from 0.0 to 1.0. A value of zero would tell you that none of your students answered the question correctly. Conversely, a score of 1.0 shows that all students answered the item correctly.
This analysis is only possible on multiple choice, so in my example, I'm restricted to the responses from the first 20 questions from my test. My questions all had four options for examinees to choose from. The Distractor Analysis results are shown in the table below.
Distractor Analysis: What the Results Show
The first thing that jumps out from the data in the table shown above is from question number 18. The p-number value is 1 which means that all 80 students who took the test got this answer correct. This is not good news for that question as it means that it was too easy. This question should be checked (qualitatively) to see why it was so easy. Were the distractors too obviously wrong? Was it not worth testing the content of this question at all since all of the students knew it?
Six of my items were between 0.7 to 0.8 which is roughly ideal difficulty.
Anything with a p-number of higher than 0.8 is getting into easy question territory, not that there's anything wrong with having a number of questions that are attainable for weaker students. Above 0.9 though suggests the question was too easy. For my test, in addition to question 18, I should definitely check how to adjust or whether to keep items 11 to 14, 16 and 20 as the p-number is hitting higher than 0.9.
Aside from the "easy" questions alluded to above, there are some questions that might be too difficult for the learners. These are items with a p-number lower than 0.3; in my test, that's question numbers three and eight. Anything lower than 0.2 is considered very difficult; if you get that in your analysis, it could be that the language used in the question was too complicated or that the students hadn't understood what you had taught them.
Ideal Difficulty: Optimum p-numbers
Five options multiple choice questions, 0.60;
Four options multiple choice questions, 0.62;
Three options multiple choice questions, 0.66;
True or false questions, 0.75.
Second Analysis: The Item Point-Biserial Correlation Coefficient
The item point-biserial correlation coefficient, shown in mathematical terms as rpbi, measures the degree to which an question differentiates correctly among those that took your test regarding what the question was designed to measure. In other words, were the smart kids largely getting your questions right and the weak ones were generally scoring less well (what you'd expect from an effective evaluation)?
Look back at my Distraqctor Analysis results above. Item number four has a p-number of 0.54 meaning roughly half the class got it right and half the class got it wrong; the students are split. The question is, which half got it right and which half got it wrong? If the weaker learners were the ones who got it right and the stronger learners were getting it wrong, then the item is flawed - the item point-biserial correlation coefficient will tell us if this is the case. My rpbi for this question was 0.38 (which is fairly good) meaning that more stronger students and fewer weaker students got it correct.
The item point-biserial correlation coefficient measures the correlation between examinee performance on the item against their performance on the whole test. The full set of item point-biserial correlation coefficient results (rpbi) from my exam paper are shown in the table below.
What the Item Point-Biserial Correlation Coefficient Results Mean.
The scores for each of your items will range from between minus one and plus one. Negative scores are bad because it quite simply means that the question is useless and is dragging down the overall validity of your assessment. Thankfully I don't have any negative results. Any scores of 0.19 or less are probably poor questions and should be flagged for review (that's question number six for me as it's only got an rpbi of 0.5). Opinions vary, but around 0.3 is considered a good item, whilst 0.4 to 0.5 is thought to be very good (eight items for me).
The item discrimination index thus shows you how well your questions distinguished between students who understood the concepts that you had taught them versus those who did not.
Optimum Item Point-Biserial Correlation Coefficients
Very good item: 0.4 or higher
Good item: 0.30 to 0.39
Quite good items: 0.2 to 0.29
Poor items: 0.19 or lower
Very poor items: Anything in negative
Third Analysis: Cronbach's Alpha
Cronbach's Alpha tests whether your test was a consistent reliable measure of the concepts you taught and it will generally increase as the intercorrelations among your questions increase.
When you calculate Cronbach's Alpha you will get a number that will range from zero to plus one. The higher your alpha the more your questions are measuring the underlying concept as a whole. If all the questions in your exam are independent from one another, alpha will edge toward zero. Your alpha score should be above 0.5 otherwise your test is considered not statistically reliable.
Alpha for my test paper came out at above 0.7, as can be seen in the table below, suggesting it was quite a reliable test of what I'd set out to evaluate.
Statistical analysis (quantitative checks) are postvalidation protocols. That is they can take place after examinees have taken the test. The statistics will flag up items for review. The questions can then be checked for the quality of their writing. Again, this human checking is a postvalidation protocol.
However, before an educator gets to these stages, there should have been prevalidation measures in place to check the quality of the exam items prior to the exam having been approved for distribution to students in the exam hall. It is best to check multiple choice exam items against a check list to make sure the grammar, wording and subject accuracy are acceptable. The teacher creating the test should be checking this as a start point when they build a test, not just waiting for those above them in managerial or department head positions to check on their behalf. Sometimes there are common construction errors that teachers are not aware of and these could be easily avoided by using guidelines such as those created by Hansen and Dexter.