An item analysis provides information useful for improving the quality and accuracy of multiple-choice tests. The program used by the Measurement and Evaluation Center consists of three sections. They include the following:
Each of these sections will be described below, and an actual item analysis report produced by the Measurement and Evaluation Center will provide examples. Guidelines will be suggested for interpreting some of the item analysis statistics. While these guidelines will be appropriate for many tests and test items, there are some purposes and types of test items for which these guidelines are not appropriate.
Table 1 shows the standard item breakdown for each item on the test. The item's number ("ITEM NO.") and its correct alternative ("KEYED RESPONSE") are presented at the top of the table.
Item Breakdown
ITEM NO.3 ............................... Keyed Response = B
| SPLIT | OMIT | A | B | C | D | E | SUM |
|---|---|---|---|---|---|---|---|
| ____ | ____ | ____ | ____ | ____ | ____ | ____ | ____ |
| SUM | |||||||
| MEAN | 64.7 | 72.7 | 61.1 | 65.7 | 63.3 |
P TOT = 1.00..........P=.60.........R(IT) = .39
This frequency table shows how a class of 932 students responded to the third item on a test. Based on the total scores made on the test, the full group was divided (split) into four subgroups, each representing approximately one-fourth of the class and identified by a number (1 to 4) under the column head SPLIT. Each of the four rows shows how many respondents in that subgroup omitted Item 3 (OMIT column) and how many selected each of the five possible responses (Columns A - E). In this example, the line for the Split 1 subgroup reports the responses of those students who scored (on their total score) in the top one-fourth of the class, while the line for the Split 4 subgroup reports the responses of those who scored in the bottom one-fourth of the class. The "SUM" row and column report the column and row totals.
The "MEAN" row presents the average of the raw scores of those students who chose each of the possible responses (A - E) to an item. (The raw score is the number of items to which a person gives the correct response.)
In an item breakdown, the number of persons who responded to the item is expressed as a proportion (P TOT) of the total number of persons on whose responses the item analysis was based. In the example shown in Table l, everyone who took the test responded to Item 3; therefore, P TOT = 1.00.
The number of persons who answered an item with the correct response is expressed as a proportion (P) of the total number of persons who took the examination; P is an index of the item's level of difficulty for a given group of persons. This value is useful in evaluating whether or not the difficulty of an item is suited to the level of preparation for the persons taking the test. The higher the difficulty (P) value, the easier the item. In Table 1, 60% (i.e., 0.60 x 100%) of the students who took the examination answered this item correctly.
Because, in practice, users of an item analysis will obtain a range of difficulty values around the optimal difficulty value, it is best to use those items whose average difficulty levels approximate the optimal difficulty value. Ideally, this range should be from about 0.50 to 0.90. Items with difficulty values close to 1.00 or at/below the guessing level may need to be rewritten or discarded. If possible, it is desirable to place the easier items at the beginning of the examination and then to make subsequent items progressively more difficult.
R(IT) is an item-total coefficient of correlation. It indicates the item's discrimination value--that is, whether or not the scores on the item differentiate between those persons who score high and those who score low on the test as a whole. For items that are scored 1 if answered correctly and 0 otherwise, this index is a point-biserial coefficient of correlation. In Table 1, the correlation coefficient between the scores on Item 3 and the raw scores (number correct) on the total test is 0.39.
As an example of a frequency table, Table 2 reports the number of persons who received each raw score. After the name of the test, Table 2 presents the total number of persons whose scores are included in the item analysis (N TOTAL), the mean raw score on the test for the total group in the item analysis (MEAN TOTAL), the standard deviation (a standard deviation is a measure of the variability among the scores made by the group of persons tested) for the distribution of raw scores (S.D. TOTAL), and an index of the test's reliability or internal consistency (ALPHA). This latter value (ALPHA) provides one indication of the measurement accuracy of the test and, therefore, of the overall quality of the test. If a high alpha value is obtained (1.00 is the highest possible value), the test is considered to be highly reliable. However, a low alpha value may be due either to poor reliability or to other factors. For example, alpha is affected by the number of items on the test; generally speaking, the more items there are on the test, the higher the alpha value for the test will be. As a rough guideline, teacher-made classroom tests are typically thought to be sufficiently reliable if they have alpha values of 0.65 to 0.90; however, certain kinds of tests may have much lower alpha values and still have sufficient reliability.
Frequency Table of Raw Scores
(test name)
N TOTAL = 932 ..... MEAN TOTAL = 69.4 .....S.D. TOTAL = 10.2 ..... ALPHA = .84
| RAW | PCTL | PERCENT | STAND. 50-10 | ||
|---|---|---|---|---|---|
| SCORE | FREQ | RANK | CORRECT | SCORE | PCT |
Columns 1 and 2 of the table show the raw scores attained on the examination under RAW SCORES and the number (frequency) of students who received each raw score under FREQ.
Column 3 under PCTL RANK presents each score's rank ordering relative to the other scores. The percentile rank (PCTL RANK) for a particular raw score is the percent of persons whose scores are lower than the midpoint of that score interval. (To see how PCTL RANK is calculated, please refer to the Glossary).
Column 4 shows the PERCENT CORRECT which is calculated by dividing the total raw score (the total number of correct items) by the total number of items on which the item analysis was based and subsequently multiplying by 100%. This information is most useful when one wants to assign a numerical grade on a l00-point scale.
Column 5 contains the standardized score corresponding to a particular raw score. Unless otherwise specified by the person requesting the item analysis, the raw scores are converted to a standard scale with a mean score of 50 points and a standard deviation of 10 points for this group of persons.
Conceptually, these standardized scores are simple z-scores for which the mean has been changed from 0.00 points to 50.00 and for which the standard deviation has been changed from 1.00 point to 10.00. Comparison between or among tests can be made in terms of z-scores; however, all z-scores below the mean (0.00) will be negative. The use of negative and positive scores for the comparison of two tests is not as easy as is the use of positive scores only. Therefore, the primary reason for using standardized scores with a mean of 50 and a standard deviation of 10 is to eliminate this problem. Comparison of standardized scores across two tests should be made only when the frequency distributions of standardized scores for both tests are very similar and when the standardized score scales are the same.
Column 6 shows the percent equivalence, PCT, of each score's frequency.
For easy comparison and interpretation, this section summarizes the test item statistics presented in Table 1. After the name of the test, Table 3 presents N TOTAL, MEAN TOTAL, S.D. TOTAL, and ALPHA, which are defined above for Table 2.
Summary Table of Test Item Statistics
(test name)
N TOTAL = 932 ..... MEAN TOTAL = 69.4 .....S.D. TOTAL = 10.2 ..... ALPHA = .84
| ITEM | P | R(IT) | NC | MC | MI | OMIT | A | B | C | D | E |
|---|---|---|---|---|---|---|---|---|---|---|---|
| .83 | 744 | 70.39 | 64.66 | 122 | 774 | ||||||
| .90 | 836 | 70.08 | 63.64 | 836 | |||||||
| .60 | 561 | 72.69 | 64.47 | 233 | 561 | ||||||
| .66 | 612 | 71.06 | 66.26 | 612 | 295 | ||||||
| .69 | 641 | 71.26 | 65.35 | 152 | 641 | 114 | |||||
| .69 | 639 | 70.70 | 66.60 | 187 | 639 | ||||||
| .83 | 771 | 70.45 | 64.45 | 103 | 771 | ||||||
| .68 | 805 | 70.56 | 62.19 | 805 | 104 | ||||||
| .92 | 856 | 70.05 | 62.32 | 856 |
The first three columns -ITEM, P, R(IT)- of Table 3 refer to the item's number, its level of difficulty, and its discrimination power, respectively; these terms were defined in Section 1.
The NC column refers to the number of persons who answered the item correctly. The mean of the raw scores of those persons who answered an item correctly is denoted by MC. The MI column contains the mean total raw score of the persons who did not answer the item with the correct response; this value is the same as the weighted average of the MEAN values of the OMIT group and each of the incorrect response groups. The mean score of those persons who answered an item correctly (MC) should be higher than the mean score of those who answered the item incorrectly (MI). The OMIT column indicates the number of persons who omitted the item. The last five columns, labeled A, B, C, D, E, represent the item's five alternatives and contain the number of persons who selected each alternative for a given item.
The following is a list of some options that are available with the Item Analysis Program of the Measurement and Evaluation Center (MEC):