MEC Item Analysis

An item analysis provides information useful for improving the quality and accuracy of multiple-choice tests. The program used by the Measurement and Evaluation Center consists of three sections. They include the following:

Each of these sections will be described below, and an actual item analysis report produced by the Measurement and Evaluation Center will provide examples. Guidelines will be suggested for interpreting some of the item analysis statistics. While these guidelines will be appropriate for many tests and test items, there are some purposes and types of test items for which these guidelines are not appropriate.

Table 1 shows the standard item breakdown for each item on the test. The item's number ("ITEM NO.") and its correct alternative ("KEYED RESPONSE") are presented at the top of the table.

Section 1: Item Breakdown

Table 1

Item Breakdown


ITEM NO.3 ............................... Keyed Response = B


SPLITOMITABCDESUM
1
O
20
199
1
13
0
233
2
O
55
157
5
16
0
233
3
O
66
119
14
31
3
233
4
O
92
86
26
28
1
233
________________________________
SUM
O
233
561
46
88
4
932
MEAN
O
64.772.761.165.763.3

P TOT = 1.00..........P=.60.........R(IT) = .39

This frequency table shows how a class of 932 students responded to the third item on a test. Based on the total scores made on the test, the full group was divided (split) into four subgroups, each representing approximately one-fourth of the class and identified by a number (1 to 4) under the column head SPLIT. Each of the four rows shows how many respondents in that subgroup omitted Item 3 (OMIT column) and how many selected each of the five possible responses (Columns A - E). In this example, the line for the Split 1 subgroup reports the responses of those students who scored (on their total score) in the top one-fourth of the class, while the line for the Split 4 subgroup reports the responses of those who scored in the bottom one-fourth of the class. The "SUM" row and column report the column and row totals.

The "MEAN" row presents the average of the raw scores of those students who chose each of the possible responses (A - E) to an item. (The raw score is the number of items to which a person gives the correct response.)

Guidelines.
Typically, as one moves from the first split (#l) to the last split (#4), it is desirable to see a decrease in the number of persons who chose the keyed response. This pattern should be reversed, however, for the incorrect alternatives. Table 1 shows a typical pattern of responses for a well-constructed item. If an alternative is chosen by either a very small proportion of persons or no one (see Column E), then that alternative may need to be rewritten to make it more attractive, but still incorrect; alternatively, the test item may need to be discarded. In addition, the mean score on the total test for the four subgroups combined should be higher for the correct response than for the incorrect alternatives.

In an item breakdown, the number of persons who responded to the item is expressed as a proportion (P TOT) of the total number of persons on whose responses the item analysis was based. In the example shown in Table l, everyone who took the test responded to Item 3; therefore, P TOT = 1.00.

Guidelines.
The proportion of persons answering an item should be high. A low value may indicate (a) that an item is too difficult or (b) that the test is too long (i.e., the persons may not have had enough time to answer all of the items; this may have resulted in low P TOT's for some items, especially those at the end of the test).

The number of persons who answered an item with the correct response is expressed as a proportion (P) of the total number of persons who took the examination; P is an index of the item's level of difficulty for a given group of persons. This value is useful in evaluating whether or not the difficulty of an item is suited to the level of preparation for the persons taking the test. The higher the difficulty (P) value, the easier the item. In Table 1, 60% (i.e., 0.60 x 100%) of the students who took the examination answered this item correctly.

Guidelines.
In order to provide the greatest amount of useful information about differences in subject knowledge or skills among the persons tested, the difficulty level of an item should be slightly more than halfway between 1.00 and the guessing level for that item. For example, on a four-alternative, multiple-choice item, the random guessing level is 1.00/4 = 0.25; therefore, the optimal difficulty level is 0.25 + (1.00 - 0.25) / 2 = 0.62. On a true-false question, the optimal difficulty level is 0.75.

Because, in practice, users of an item analysis will obtain a range of difficulty values around the optimal difficulty value, it is best to use those items whose average difficulty levels approximate the optimal difficulty value. Ideally, this range should be from about 0.50 to 0.90. Items with difficulty values close to 1.00 or at/below the guessing level may need to be rewritten or discarded. If possible, it is desirable to place the easier items at the beginning of the examination and then to make subsequent items progressively more difficult.

R(IT) is an item-total coefficient of correlation. It indicates the item's discrimination value--that is, whether or not the scores on the item differentiate between those persons who score high and those who score low on the test as a whole. For items that are scored 1 if answered correctly and 0 otherwise, this index is a point-biserial coefficient of correlation. In Table 1, the correlation coefficient between the scores on Item 3 and the raw scores (number correct) on the total test is 0.39.

Guidelines.
In general, R(IT) values should be greater than 0.20. Items with values less than 0.20 do not yield much information about differences among the abilities of the persons tested. If this value is negative for an item, then the scoring key should be checked to ensure that the item was scored correctly. If it was scored correctly, then the item probably should be discarded or revised.

As an example of a frequency table, Table 2 reports the number of persons who received each raw score. After the name of the test, Table 2 presents the total number of persons whose scores are included in the item analysis (N TOTAL), the mean raw score on the test for the total group in the item analysis (MEAN TOTAL), the standard deviation (a standard deviation is a measure of the variability among the scores made by the group of persons tested) for the distribution of raw scores (S.D. TOTAL), and an index of the test's reliability or internal consistency (ALPHA). This latter value (ALPHA) provides one indication of the measurement accuracy of the test and, therefore, of the overall quality of the test. If a high alpha value is obtained (1.00 is the highest possible value), the test is considered to be highly reliable. However, a low alpha value may be due either to poor reliability or to other factors. For example, alpha is affected by the number of items on the test; generally speaking, the more items there are on the test, the higher the alpha value for the test will be. As a rough guideline, teacher-made classroom tests are typically thought to be sufficiently reliable if they have alpha values of 0.65 to 0.90; however, certain kinds of tests may have much lower alpha values and still have sufficient reliability.

Section 2. Frequency Table of Raw Scores

Table 2

Frequency Table of Raw Scores

(test name)


N TOTAL = 932 ..... MEAN TOTAL = 69.4 .....S.D. TOTAL = 10.2 ..... ALPHA = .84


RAWPCTLPERCENTSTAND. 50-10
SCOREFREQRANKCORRECTSCOREPCT
99
1
100
99.00
79.05
.1
97
1
100
97.00
77.09
.1
95
1
100
95.00
72.12
.1
94
1
100
94.00
74.14
.1
93
1
99
93.00
73.16
.2
92
1
99
92.00
72.18
.3
91
1
99
91.00
71.19
.9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
70
40
54
70.00
50.57
4.3
69
35
50
69.00
49.59
3.8
68
38
46
68.00
48.61
4.1
67
36
42
67.00
47.63
3.9
66
43
37
66.00
46.65
4.6
65
42
33
65.00
45.66
4.5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
2
1
42.00
23.08
.2
41
1
1
41.00
22.10
.1
40
1
1
40.00
21.12
.1
39
1
0
39.00
20.13
.1
37
1
0
37.00
18.17
.1
36
3
0
36.00
17.19
.3


Columns 1 and 2 of the table show the raw scores attained on the examination under RAW SCORES and the number (frequency) of students who received each raw score under FREQ.

Column 3 under PCTL RANK presents each score's rank ordering relative to the other scores. The percentile rank (PCTL RANK) for a particular raw score is the percent of persons whose scores are lower than the midpoint of that score interval. (To see how PCTL RANK is calculated, please refer to the Glossary).

Column 4 shows the PERCENT CORRECT which is calculated by dividing the total raw score (the total number of correct items) by the total number of items on which the item analysis was based and subsequently multiplying by 100%. This information is most useful when one wants to assign a numerical grade on a l00-point scale.

Column 5 contains the standardized score corresponding to a particular raw score. Unless otherwise specified by the person requesting the item analysis, the raw scores are converted to a standard scale with a mean score of 50 points and a standard deviation of 10 points for this group of persons.

Conceptually, these standardized scores are simple z-scores for which the mean has been changed from 0.00 points to 50.00 and for which the standard deviation has been changed from 1.00 point to 10.00. Comparison between or among tests can be made in terms of z-scores; however, all z-scores below the mean (0.00) will be negative. The use of negative and positive scores for the comparison of two tests is not as easy as is the use of positive scores only. Therefore, the primary reason for using standardized scores with a mean of 50 and a standard deviation of 10 is to eliminate this problem. Comparison of standardized scores across two tests should be made only when the frequency distributions of standardized scores for both tests are very similar and when the standardized score scales are the same.

Column 6 shows the percent equivalence, PCT, of each score's frequency.

For easy comparison and interpretation, this section summarizes the test item statistics presented in Table 1. After the name of the test, Table 3 presents N TOTAL, MEAN TOTAL, S.D. TOTAL, and ALPHA, which are defined above for Table 2.

Section 3. Summary Table of Test Item Statistics

Table 3

Summary Table of Test Item Statistics

(test name)


N TOTAL = 932 ..... MEAN TOTAL = 69.4 .....S.D. TOTAL = 10.2 ..... ALPHA = .84


ITEMPR(IT)NCMCMIOMITABCDE
1.
.83
.21
74470.3964.66
0
122
1
35
0
774
2.
.90
.19
83670.0863.64
1
59
21
0
15
836
3.
.60
.39
56172.6964.47
0
233561
46
88
4
4.
.66
.22
61271.0666.26
1
612
15
3
295
6
5.
.69
.27
64171.2665.35
1
152641114
18
6
6.
.69
.19
63970.7066.60
1
187
99
639
1
5
7.
.83
.22
77170.4564.45
0
103771
5
40
13
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
99.
.68
.28
80570.5662.19
8
0
805
2
13
104
100.
.92
.21
85670.0562.32
10
0
1
61
856
4


The first three columns -ITEM, P, R(IT)- of Table 3 refer to the item's number, its level of difficulty, and its discrimination power, respectively; these terms were defined in Section 1.

The NC column refers to the number of persons who answered the item correctly. The mean of the raw scores of those persons who answered an item correctly is denoted by MC. The MI column contains the mean total raw score of the persons who did not answer the item with the correct response; this value is the same as the weighted average of the MEAN values of the OMIT group and each of the incorrect response groups. The mean score of those persons who answered an item correctly (MC) should be higher than the mean score of those who answered the item incorrectly (MI). The OMIT column indicates the number of persons who omitted the item. The last five columns, labeled A, B, C, D, E, represent the item's five alternatives and contain the number of persons who selected each alternative for a given item.

Options

The following is a list of some options that are available with the Item Analysis Program of the Measurement and Evaluation Center (MEC):

  1. Each item on the test may contain up to five alternative responses. It is not necessary that all of the items on the test contain the same number of responses.
  2. For the item breakdown (see Table 1), the total group of persons can be divided (split) into any number of subgroups from 2 through 9.
  3. Any number of versions of a test (i.e., forms of the test that contain the same items arranged in different orders) can be combined into a single item analysis.
  4. One or more items may be omitted from the item analysis.
  5. When a test is composed of subsets of items (each with its own answer key), an item analysis can be performed for each subset and/or for the test as a whole.

Glossary of Item Analysis Terms

A, B, C, D, E
These letters refer to the alternative responses that are possible for answering a question.
ALPHA
Coefficient alpha, a measure of the internal consistency (reliability) of the test, considered to be an overall indicator of test quality. (See the journal references on the output listing for greater detail.)
FREQ
The number (frequency) of persons who received each raw score.
MC
Mean total raw score of the persons who answered the item with the correct response.
MEAN
Mean total score on the test, expressed in terms of the raw score (number of correct responses) scale, for those persons who gave the response represented by the column of frequencies immediately above the mean.
MEAN TOTAL
Mean total raw score on the test of all of the persons tested, expressed in terms of the raw score (number of correct responses) scale.
MI
Mean total score of the persons who did not answer the item with the correct response.
NC
Number of persons who answered the item with the correct response.
N TOTAL
Total number of persons on whose responses the item analysis is based.
OMIT
Number of persons who did not respond to the item.
P
The number of persons who answered the item with the correct response, expressed as a proportion of the total number of persons on whose responses the item analysis is based (NC divided by N TOTAL). P is an index of the difficulty of the item for that group of persons.
PCT
The number of persons who received each raw score, expressed as a percent of the total number of persons on whose responses the item analysis is based.
PCTL RANK
The percentile rank for a particular raw score is the percent of persons whose scores are lower than the midpoint of that particular score interval. (The number of persons whose scores are lower than the midpoint of a particular score interval is equal to the number of persons whose scores are lower than the given score plus half the number of persons whose scores are equal to the given score.)
PERCENT CORRECT
The total raw score expressed as a percent of the maximum possible raw score, which usually is the total number of items on which the item analysis is based.
RAW SCORE
Usually, the number of items to which a person gave the correct response.
R(IT)
A coefficient of correlation between scores on the item and the total scores on the test for the total number of persons on whose responses the item analysis is based. (For items that are scored 1 if answered correctly and 0 otherwise, this is a point-biserial correlation coefficient.) R(IT) is an index of the discrimination value of the item (i.e., whether or not it differentiated between persons who scored high on the test and those who scored low).
S.D. TOTAL
Standard deviation of the total raw scores of all of the persons tested, expressed in terms of the raw score scale. The standard deviation is a measure of variability among the scores.
SPLIT
Indicates that the total group of persons was divided (split) into a number of subgroups of approximately equal size. SPLIT 1 contains a summary of the responses made by the persons in the highest scoring subgroup, SPLIT 2 contains the responses made by those in the next-to-highest subgroup, etc.
STAND. 50-10 SCORE
A transformation of the corresponding raw score. These particular transformed scores (called Standard Scores) customarily are calculated to have a mean equal to 50 points and a standard deviation equal to 10 points.
SUM
Number of persons who chose each of the possible responses to an item (i.e., A, B, C, D, E); or, the number of persons in each subgroup (SPLIT).

This handout was prepared by Ralph J. De Ayala, Graduate Research assistant III, and H. Paul Kelley, Director of the Measurement and Evaluation Center at The University of Texas at Austin, April 1987.


Updated April 1, 1997
MEC Scanning Office
Mail questions or comments to:
mecscore@www.utexas.edu