Analyzing Multiple-Choice Item Responses
Understanding how to interpret three useful statistics concerning your
students’ multiple-choice test scores will help you construct well-designed
tests and improve instruction.
1. Item difficulty, P: the percentage of students who
correctly answered an item.
- Also referred to as the p-value
- Ranges from 0% to 100%, or more typically written as a proportion
0.00 to 1.00
- The higher the value, the easier the item
- P-values above 0.90 indicate very easy items that you should not
use in subsequent tests. If almost all students responded correctly,
an item addresses a concept probably not worth testing.
- P-values below 0.20 indicate very difficult items. If almost all
students responded incorrectly, either an item is flawed or students
did not understand the concept. Consider revising confusing language,
removing the item from subsequent tests, or targeting the concept for
re-instruction.
For maximum discrimination potential, desirable difficulty levels are
slightly higher than midway between chance (1.00 divided by the number
of choices) and perfect scores (1.00) for an item:
| Format |
Ideal Difficulty |
| Five-response multiple-choice |
.60 |
| Four-response multiple-choice |
.62 |
| Three-response multiple-choice |
.66 |
| True-false (two-response multiple-choice) |
.75 |
2. Item discrimination, R(IT): the relationship between
how well students performed on the item and their total test score.
- Also referred to as the Point-Biserial correlation (PBS)
- Ranges from 0.00 to 1.00
- The higher the value, the more discriminating the item
- A highly discriminating item indicates that students with high test
scores responded correctly whereas students with low test scores responded
incorrectly.
Remove items with discrimination values near or less than zero, because
this indicates that students who performed poorly on the test performed
better on an item than students who performed well on the test. The item
is confusing for your better scoring students in some way.
Evaluate items using four guidelines for classroom test discrimination
values:
| 0.40 or higher |
very good items |
| 0.30 to 0.39 |
good items |
| 0.20 to 0.29 |
fairly good items |
| 0.19 or less |
poor items |
3. Reliability coefficient, ALPHA: a measure of the
amount of measurement error associated with a test score.
- Ranges from 0.00 to 1.00
- The higher the value, the more reliable the test score
- Typically, a measure of internal consistency, indicating how well
items are correlated with one another
- High reliability indicates that items are measuring the same construct
(e.g., knowledge of how to calculate integrals)
- Two ways to improve test reliability: 1) increase the number of items
or 2) use items with high discrimination values
| Reliability |
Interpretation |
| .90 and above |
Excellent reliability; at the level of the best standardized tests |
| .80 - .90 |
Very good for a classroom test |
| .70 - .80 |
Good for a classroom test; in the range of most. There are probably
a few items that could be improved. |
| .60 - .70 |
Somewhat low. This test should be supplemented by other measures
to determine grades. There are probably some items that could be improved. |
| .50 - .60 |
Suggests need to revise the test, unless it is quite short (ten
or fewer items). The test must be supplemented by other measures for
grading. |
| .50 or below |
Questionable reliability. This test should not contribute heavily
to the course grade, and it needs revision. |
Distractor Evaluation
Another useful item review technique is distractor evaluation.
You should consider each distractor an important part of an item in view
of nearly 50 years of research that shows that there is a relationship
between the distractors students choose and total test score. The quality
of the distractors influences student performance on a test item.
Although correct answers must be truly correct, it is just as important
that distractors be clearly incorrect, appealing to low scorers who have
not mastered the material rather than to high scorers. You should review
all item options to anticipate potential errors of judgment and inadequate
performance so you can revise, replace, or remove poor distractors.
One way to study responses to distractors is with a frequency table that
tells you the proportion of students who selected a given distractor.
Remove or replace distractors selected by a few or no students because
students find them to be implausible.
Caution when Interpreting Item Analysis Results
Mehrens and Lehmann (1973) offer three cautions about using the results
of item analysis:
- Item analysis data are not synonymous with item validity. An external
criterion is required to accurately judge the validity of test items.
By using the internal criterion of total test score, item analyses reflect
internal consistency of items rather than validity.
- The discrimination index is not always a measure of item quality.
There are a variety of reasons why an item may have low discrimination
power:
a) extremely difficult or easy items will have low ability to discriminate,
but such items are often needed to adequately sample course content
and objectives;
b) an item may show low discrimination if the test measures many content
areas and cognitive skills. For example, if the majority of the test
measures "knowledge of facts," then an item assessing "ability
to apply principles" may have a low correlation with total test
score, yet both types of items are needed to measure attainment of course
objectives.
- Item analysis data are tentative. Such data are influenced by the
type and number of students being tested, instructional procedures employed,
and chance errors. If repeated use of items is possible, statistics
should be recorded for each administration of each item.
References:
DeVellis, R. F. (1991). Scale development: Theory and applications.
Newbury Park: Sage Publications.
Haladyna. T. M. (1999). Developing and validating multiple-choice test
items (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
Lord, F.M. (1952). The relationship of the reliability of multiple-choice
test to the distribution of item difficulties. Psychometrika, 18, 181-194.
Mehrens, W. A., & Lehmann, I. J. (1973). Measurement and Evaluation
in Education and Psychology. New York: Holt, Rinehart and Winston, 333-334.
Suen, H. K. (1990). Principles of test theories. Hillsdale, NJ: Lawrence
Erlbaum Associates.
|