Results of Testing Whether Exam Questions Were Effective at Distinguishing Between Students of Higher and Lower Ability

Updated on April 14, 2019
murraylindsay profile image

I've been an educational professional for many years, holding certified qualifications in that field.

Why Test Exam Questions for Effectiveness?

Every question of a completed examination paper can be checked by using statistical methods (quantitatively) and human checks (qualitatively). Statistical checks will test how well the questions were able to discriminate between higher and lower level examinees and also whether the test as a whole was a reliable measure of what the educator set out to test. Human checks (qualitative methods) can help to see if a question was poorly written.

Multiple choice answer grid laid over a student answer sheet
Multiple choice answer grid laid over a student answer sheet

Example of Testing a Real Completed Exam Paper

I statistically analyzed an upper-intermediate English as a Foreign Language (EFL) formal examination given to 80 grade 12 students at a private school in southeast Asia. 20 questions (or items as the statistical analysts tend to call them) were statistically tested.

These 20 items were four-item multiple choice questions and it's these kind of questions that statistics are generally assumed to be the most effective at checking. The idea is that you check to see which questions give you "bad" statistical results and then you flag these questions for review. Bad statistical results on their own might not be a reason to remove the question from future exams; qualitative checks also need to be done to examine the questions quality.

First Statistical Test: Distractor Alaysis

Firstly, it's important to know the anatomy of a multiple choice question so that the terminology here can be understood. A multiple choice question consists of:

  • A stem: The wording of the question; can be in the form of a question or of an incomplete sentence.
  • Options: Also known as alternatives, these are the choices that the examinee will select from.
  • Key: This is the correct option for the item.
  • Distractors: The incorrect options.

Distractor Analysis has many different names. It is also known as Difficulty Index, Item Difficulty, Ease Index, IF, Percent Correct or p-value. What it does is simply show you the proportion (or percentage) of students who took the test who answered the item correctly. You'd normally show it as a proportion (rather than percentage), ranging from 0.0 to 1.0. A value of zero would tell you that none of your students answered the question correctly. Conversely, a score of 1.0 shows that all students answered the item correctly.

This analysis is only possible on multiple choice, so in my example, I'm restricted to the responses from the first 20 questions from my test. My questions all had four options for examinees to choose from. The Distractor Analysis results are shown in the table below.

Distractor Analysis Results: 20 Multiple Choice Questions
Distractor Analysis Results: 20 Multiple Choice Questions

Distractor Analysis: What the Results Show

The first thing that jumps out from the data in the table shown above is from question number 18. The p-number value is 1 which means that all 80 students who took the test got this answer correct. This is not good news for that question as it means that it was too easy. This question should be checked (qualitatively) to see why it was so easy. Were the distractors too obviously wrong? Was it not worth testing the content of this question at all since all of the students knew it?

Six of my items were between 0.7 to 0.8 which is roughly ideal difficulty.

Anything with a p-number of higher than 0.8 is getting into easy question territory, not that there's anything wrong with having a number of questions that are attainable for weaker students. Above 0.9 though suggests the question was too easy. For my test, in addition to question 18, I should definitely check how to adjust or whether to keep items 11 to 14, 16 and 20 as the p-number is hitting higher than 0.9.

Aside from the "easy" questions alluded to above, there are some questions that might be too difficult for the learners. These are items with a p-number lower than 0.3; in my test, that's question numbers three and eight. Anything lower than 0.2 is considered very difficult; if you get that in your analysis, it could be that the language used in the question was too complicated or that the students hadn't understood what you had taught them.

Ideal Difficulty: Optimum p-numbers

Five options multiple choice questions, 0.60;

Four options multiple choice questions, 0.62;

Three options multiple choice questions, 0.66;

True or false questions, 0.75.

Second Analysis: The Item Point-Biserial Correlation Coefficient

The item point-biserial correlation coefficient, shown in mathematical terms as rpbi, measures the degree to which an question differentiates correctly among those that took your test regarding what the question was designed to measure. In other words, were the smart kids largely getting your questions right and the weak ones were generally scoring less well (what you'd expect from an effective evaluation)?

Look back at my Distraqctor Analysis results above. Item number four has a p-number of 0.54 meaning roughly half the class got it right and half the class got it wrong; the students are split. The question is, which half got it right and which half got it wrong? If the weaker learners were the ones who got it right and the stronger learners were getting it wrong, then the item is flawed - the item point-biserial correlation coefficient will tell us if this is the case. My rpbi for this question was 0.38 (which is fairly good) meaning that more stronger students and fewer weaker students got it correct.

The item point-biserial correlation coefficient measures the correlation between examinee performance on the item against their performance on the whole test. The full set of item point-biserial correlation coefficient results (rpbi) from my exam paper are shown in the table below.

Item point-biserial correlation coefficient (rpbi) results
Item point-biserial correlation coefficient (rpbi) results

What the Item Point-Biserial Correlation Coefficient Results Mean.

The scores for each of your items will range from between minus one and plus one. Negative scores are bad because it quite simply means that the question is useless and is dragging down the overall validity of your assessment. Thankfully I don't have any negative results. Any scores of 0.19 or less are probably poor questions and should be flagged for review (that's question number six for me as it's only got an rpbi of 0.5). Opinions vary, but around 0.3 is considered a good item, whilst 0.4 to 0.5 is thought to be very good (eight items for me).

The item discrimination index thus shows you how well your questions distinguished between students who understood the concepts that you had taught them versus those who did not.

Optimum Item Point-Biserial Correlation Coefficients

Very good item: 0.4 or higher

Good item: 0.30 to 0.39

Quite good items: 0.2 to 0.29

Poor items: 0.19 or lower

Very poor items: Anything in negative

Third Analysis: Cronbach's Alpha

Cronbach's Alpha tests whether your test was a consistent reliable measure of the concepts you taught and it will generally increase as the intercorrelations among your questions increase.

When you calculate Cronbach's Alpha you will get a number that will range from zero to plus one. The higher your alpha the more your questions are measuring the underlying concept as a whole. If all the questions in your exam are independent from one another, alpha will edge toward zero. Your alpha score should be above 0.5 otherwise your test is considered not statistically reliable.

Alpha for my test paper came out at above 0.7, as can be seen in the table below, suggesting it was quite a reliable test of what I'd set out to evaluate.

Chronbach's Alpha result for my exam
Chronbach's Alpha result for my exam

Qualitative Analysis

Statistical analysis (quantitative checks) are postvalidation protocols. That is they can take place after examinees have taken the test. The statistics will flag up items for review. The questions can then be checked for the quality of their writing. Again, this human checking is a postvalidation protocol.

However, before an educator gets to these stages, there should have been prevalidation measures in place to check the quality of the exam items prior to the exam having been approved for distribution to students in the exam hall. It is best to check multiple choice exam items against a check list to make sure the grammar, wording and subject accuracy are acceptable. The teacher creating the test should be checking this as a start point when they build a test, not just waiting for those above them in managerial or department head positions to check on their behalf. Sometimes there are common construction errors that teachers are not aware of and these could be easily avoided by using guidelines such as those created by Hansen and Dexter.

.

Questions & Answers

    Comments

      0 of 8192 characters used
      Post Comment

      No comments yet.

      working

      This website uses cookies

      As a user in the EEA, your approval is needed on a few things. To provide a better website experience, owlcation.com uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

      For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at: https://owlcation.com/privacy-policy#gdpr

      Show Details
      Necessary
      HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
      LoginThis is necessary to sign in to the HubPages Service.
      Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
      AkismetThis is used to detect comment spam. (Privacy Policy)
      HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
      HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
      Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
      CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
      Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the googleapis.com or gstatic.com domains, for performance and efficiency reasons. (Privacy Policy)
      Features
      Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
      Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
      Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
      Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
      Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
      VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
      PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
      Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
      MavenThis supports the Maven widget and search functionality. (Privacy Policy)
      Marketing
      Google AdSenseThis is an ad network. (Privacy Policy)
      Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
      Index ExchangeThis is an ad network. (Privacy Policy)
      SovrnThis is an ad network. (Privacy Policy)
      Facebook AdsThis is an ad network. (Privacy Policy)
      Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
      AppNexusThis is an ad network. (Privacy Policy)
      OpenxThis is an ad network. (Privacy Policy)
      Rubicon ProjectThis is an ad network. (Privacy Policy)
      TripleLiftThis is an ad network. (Privacy Policy)
      Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
      Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
      Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
      Statistics
      Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
      ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
      Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)
      ClickscoThis is a data management platform studying reader behavior (Privacy Policy)