The Tower at UT

Teachers and Students
A sourcebook for UT- Austin faculty
Center for Teaching Effectiveness
University of Texas at Austin



Evaluating and Grading Students
Marilla D. Svinicki
Center for Teaching Effectiveness
University of Texas at Austin


Selecting Types of Activities

Relevant
Reliable
Recognizable
Realistic

The topic of this discussion is the design of an evaluation system for your course. Now you may be saying to yourself, "I haven't even met the class yet. How can I and why should I be thinking about how to evaluate them already?" Well, there are at least two reasons for doing it now, one very weighty reason and one very practical reason. First, the weighty reason. As outlined by the experts, the first step in the design of instruction is the identification of goals and objectives, followed closely by the design of evaluation. These two elements, the objectives and the evaluation method, determine which learning activities are needed in a situation. For example, if my objective is that the students will be able to describe the steps in the qualitative analysis of an unknown, and the evaluation of that objective will be in the laboratory, the learning activities should include laboratory time. If the evaluation will be on a written exam, I may not need to include actual laboratory experience; computer simulation or demonstration tapes may be sufficient. Therefore, before you can choose the types of learning activities your students will receive, you need to know what the final evaluation criteria will be, and those are most clearly laid out in the combination of objectives and evaluation.

On the more mundane level, you need to select an evaluation method before the beginning of the semester because, no matter how much we would like to think our students exist for the love of learning, we soon realize that one of their most frequent concerns is how they will be evaluated for a grade. This concern of the students has been codified in the University regulations which state that an instructor must notify the students of the basis for evaluation prior to the end of the add/drop period. Normally, this is done on the first day of class since it is usually a part of the syllabus.

Given these two very good reasons for getting started on the evaluation design, let's now consider what is involved in designing the evaluation system for your class.

There are two parts of the evaluation system which will require your attention. One is the selection of the types of activities which will be evaluated; the second is the selection of the grade assignment method. In this short time we cannot consider all the facets of these two topics, but what I hope to do is highlight the major considerations of each. If you want to go more deeply into any of the topics, you should feel free to contact us at the Center for additional information.

Back to top

Selecting Types of Activities

Because so much depends upon the evaluation of a studentÕs learning and the assigned grade, it is in everyone's interest to try to make the evaluation system as free from irrelevant errors as possible. Borrowing from the evaluation literature, I propose that you concern yourself with four R's of evaluation in attempting to design a system which will be acceptable to all concerned. Such a system should be:

Relevant
Reliable
Recognizable
Realistic

Let's look at each of these in turn and what it means for your course.

Relevant

In the jargon this is known as the validity of an evaluation method, but since that doesn't start with an R, I've changed it to relevance. This means that any activity used to evaluate a student's learning must be an accurate reflection of the skill or concept which is being tested. For example, if I am trying to determine if my students have learned the social and economic causes of the Civil War, the test must have questions which address that issue. Questions which ask students to list the major battles of the Civil War are not relevant to the objective. You may be saying that no one would be as foolish as that, but let me assure you that there are many documented cases of instructors who make equally flagrant violations of the principle of relevance. One of my favorites was in a graduate course in which the students were required to read a long list of primary sources and the test question was to match the authors' names with the article titles. The instructor claimed that if the students knew the pairings, they must have read the articles and, therefore, there was no need to actually test the contents.

What are the characteristics of a relevant evaluation? Oddly enough, one characteristic which might seem very mundane is that the evaluation activity must appear to be related to the course content (known in the jargon as face validity). A common complaint of students is that tests are not related to the content of the course or what was presented in class. Although we recognize that the things we assign are directly related to the course, the students often donÕt get the connection. And, student impressions aside, the more obvious the connection, the higher the probability that we really have a relevant, valid evaluation activity.

A second characteristic of relevant evaluations is that they are derived directly from the objectives (known in the jargon as content validity). The most obvious way to achieve this is to follow the objectives as closely as possible in selecting activities. If your course objective states that the students will be able to select the appropriate statistic for analyzing a given set of data, the evaluation should provide them with a set of data and have them select the analysis. The format for this evaluation could take many forms: an in-class exam where no actual calculations are done, an out-of-class homework assignment involving extensive calculations, a component of a large-scale semester-long project, an in-class exercise done in groups with class-generated data. All of these alternatives represent relevant tests of that objective. The difference among them would be in the sophistication possible under each condition. If I am working with undergraduates at the application level, and the skill IÕm interested in is only selecting as recognizing, then in-class activities like multiple choice exams will meet my needs. If I am working with more sophisticated students, and expecting them to weigh the various alternatives before choosing, then the task requires additional time and resources and the out-of-class choices should be used.

Thus, one of the first steps in selecting an evaluation type is to analyze the objectives and design activities aimed directly at the content and level of those objectives. Figure 1 is a chart which suggests alternative evaluation methods for various levels and types of instructional objectives. To use this chart, select an evaluation method and begin by analyzing your own objectives with a chart like that shown in Figure 2. This chart lists the objectives of the course down one side and the level and type of those objectives across the top, with checks indicating the desired levels for each objective. By comparing the two charts, an instructor can identify evaluation possibilities for each course objective. This is a chart for a course I have taught on instructional design for graduate students working in industry and adult education. I can use these two charts to come up with a list of possible methods for evaluating each of the objectives of my course. A comparison of these possibilities can help me combine various objectives in different formats and test a single objective in more than one format. For example, a cross-check indicating that I expect the objective on describing the characteristics of various teaching methods to be at a low level suggests the possibility of using an in-class exam, or a discussion, or any other method appropriate for that level. By tying the evaluation methods I choose to the objectives, I increase the relevance of those evaluations and the probability that the resulting grade accurately reflects the studentsÕ skill and knowledge of the intended material.

Another characteristic of a relevant evaluation is how well performance on that evaluation predicts performance on other closely related skills, either at the same time (concurrent validity) or in the future (predictive validity). If the skill you are supposedly testing should be highly correlated with some other skill which you are also testing, chart the studentsÕ performances on each and see if they follow the same pattern. To use a simplified example, we can say that the ability to add two single digit numbers is a precursor to, and therefore highly correlated with, the ability to add two two-digit numbers. Therefore, students who do poorly on the former should not be able to do well on the latter. If they do, then one of the two tests is not measuring what it is supposed to be measuring and is therefore not relevant to the addition skill we are trying to evaluate.

So the first R in our set is relevance, and it means that the evaluation activities we choose are really measuring the skills and knowledge which we intend them to measure.

Back to top

Reliable

The second aspect of an evaluation activity is how reliably or consistently it measures whatever it measures without being affected too much by the situation in which the evaluation takes place. A studentÕs grade should not hang on a single performance or on the mood of the person making the judgement. Of course, no system is perfectly reliable and will produce exactly the same evaluation of performance each time, but the goal here is to eliminate as many sources of error as possible and accept the fact that errors and discrepancies will occur anyway.

The three biggest sources of error in reliably evaluating a student are 1) poor communication of expectations, 2) lack of consistent criteria for judgement, and 3) lack of sufficient information about performance.

Poor communication of expectations means that poor student performance may be the result of the studentÕs failure to correctly interpret the task requirements. In written exams this usually is caused by ambiguous questions, unclear instructions, corrections given verbally during the test, and so on. In each case a bad grade is not a result of the student not knowing the material; it is a result of the student not understanding the question. In out-of-class assignments, this most often occurs when the instructor makes the assignment verbally without a written backup. The task, as originally designed, may be a very fine and relevant measure of the objective, but the way it is presented causes it to be misinterpreted and the student ends up answering a different question than the instructor intended.

Lack of consistent criteria for judgement means that if the same performance were to be judged a second time by the same grader, or if a second grader evaluated it, it might not receive the same grade because the basis for judging was unclear. The clearer the criterion for judging a studentÕs performance, the more reliable the evaluation becomes. For example, one real strength of multiple choice tests is that the grading is very reliable. Either the students marked the correct answer or they didnÕt; very little is left to the judgement of the grader. On the other hand, essay tests are notoriously unreliable unless the instructor takes pains to make the criteria explicit and keeps checking to make sure he or she is not straying too far from the preset criteria. Therefore, to make your evaluation system reliable in this sense, choose types of evaluations which have clear standards you can specify for yourself, for others who may be grading in your course, and for your students.

The lack of sufficient information is the third source of error in evaluating students, not just in terms of the amount of information, but also in terms of variety of information sources. Not everyone excels in every format. Using only one format may introduce a source of bias for or against some students and lower the reliability of an evaluation. LetÕs look at an example in Figure 3. If we were to base our judgement of this student on the first exam score only, we might say that she was a B student. Then we add a second score and our estimate drops. A third score reinforces the first estimate.

Figure 3: Example
Hour Exams (100 points each)
85, 35, 65, 89, 90, 94
Final (100 points)
93
Lab Work (10 points each)
3, 2, 5, 7
Papers (50 points each)
10, 20

Which is correct? We need more information. Looking at the total set of exam scores, we find that this really is a good student who perhaps takes a while to get started. Were her grade based on only the first few scores, it would be unreliable. Now, letÕs look at a second aspect of more information as illustrated by the addition of her other grades, these on labs and reports. Obviously, this student excels at in-class exams, but does very poorly when longer analyses are required or when practical applications are tested. Any one set of activities alone does not give a reliable measure of this studentÕs performance. We need them all to assign a reliable grade.

Back to top

Recognizable

Our third R is the need for the evaluation system to be recognizable to the students. By this we mean that students should be aware of how they will be evaluated and their class activities should prepare them for those evaluations. Testing should not be a game of "guess what IÕm going to ask you." There is far too much for students to learn as it is for them to spend time trying to "psych out" the instructor. One of the biggest complaints students have is that the basis for evaluation was unclear to them. An instructor should choose evaluation types which are clearly related to the content and daily activities of the course. He or she should plan learning activities which are similar in scope and complexity to the ones to be used for evaluation. The instructor should explain the activities and their relevance to the students. It should never be the case that the students come into a test not knowing what to expect. Students don't mind "hard" tests as long as there are no surprises, and they can recognize the relationship of the test to the course. Some instructors may criticize this as "teaching the test," but in reality the test should be the best statement of the course expectations and, therefore, should mirror the teaching. Furthermore, few courses are taught at such a low level that tests are verbatim transcripts of the class or text; rather, they are interpretations or new examples of the class or text material.

Back to top

Realistic

All of the above activities require work, either on the part of the students or the teacher. So to avoid burning out either, the final R is that the evaluation system should be realistic; the amount of information obtained is balanced by the amount of work required. Too often we forget that our students are taking three to four other courses along with ours. WeÕre less likely to forget that we are teaching two to three other courses as well. So, as much as we would like to have a large amount of data on each student to increase the reliability of our grades, or we would like to validate each of our evaluation activities each semester, or have crystal clear directions for all tests and assignments, we must also face the fact that unless the system we design is realistic, it will collapse under its own weight. What is a realistic system? Unfortunately, no one can give a blanket answer to that question. I can say that several smaller assignments tend to be more valuable than one large assignment. Alternatively, if a large assignment is called for, spreading it out across the semester and requiring components to be handed in periodically is a good technique, both from a learning and an administrative standpoint.

In Conclusion

When you are planning the overall system for evaluating your students, keep in mind these four R's:

Relevant
Reliable
Recognizable
Realistic

If you can build these ideas into your system from the beginning, you have a good chance of getting an accurate estimate of each studentÕs achievement upon which to base your grades.

Back to top

Selecting A System for Assigning Grades

Now we come to the second part of designing the evaluation system, selecting the system for assigning grades. We canÕt go deeply into the mechanics of actually computing grades, but we can look at some of the bigger issues in grading which determine how you choose a grade computation system. Later on in the semester, as you face the actual task of assigning grades, please feel free to contact us for assistance in getting started.

First, a warning. Because the grading policy you adopt is so closely tied to your personal philosophy of teaching and your view of your own role as a teacher, be sure you give these two areas significant thought before settling on a system. You will be the one who will have to defend grading decisions against both students and administrators. It is very difficult to defend a system in which you do not believe or which you have not carefully worked out. It is unlikely that anyone will seriously challenge the grades you give, but you have an ethical responsibility as a teacher to be sure that the grades you assign are your best estimate of your studentsÕ abilities, whether anyone else is looking over your shoulder or not.

Grading Systems and Philosophies

There are two basic grading philosophies currently in use. These are commonly called norm-referenced systems and criterion-referenced systems. Each system uses different methods for determining cutoff points for letter grades. Each can be applied to a single test or to the determination of final letter grades. LetÕs examine the procedures associated with each.

Norm-referenced systems: The assumption underlying norm-referenced systems is that whatever is being measured is distributed throughout the population according to a normal distribution, commonly known as the bell curve (Figure 4). In the normal distribution, a very few people will do either very well or very poorly while the great mass of the unwashed show up clustered around the middle. Indeed, when we take a random sample of the general population and measure just about anything, this is what we get. The assumption is that when we evaluate our studentsÕ achievement, it will follow this same distribution. Thus, the grades will reflect the curve. There will be a few students way out on one end of the curve who should get As; a few down on the other end who should get Fs; and the great mass in the middle who get Bs, Cs and Ds. The assignment of grades under these systems identifies those students who do significantly better or worse than their peers.

Some examples of norm-referenced systems are:

the simple curve: In this system the instructor determines beforehand that a certain percentage of students will receive AÕs and a similar percentage will receive FÕs. The same holds for BÕs and DÕs. The remainder receive CÕs. Cutoffs are based on the number of students in the class and are figured by counting down the distribution of grades until that number is reached. Of course, it never works out to be exactly equal, but the numbers in corresponding categories are close. Since this system involves nothing more sophisticated than counting, it is easy to use. A grade distribution figured by this method is shown in Figure 5a.

the normalized curve: This is a more sophisticated system in which the actual score a student earns is converted into what is called a standard score based on the class average and the distribution of the scores. Then, using standard tables, the instructor converts these standard scores into percentiles based on a normal curve. The studentÕs score is reported as being in the 90th percentile or the 50th, with some predetermined percentiles representing each of the letter grades. The second set of grades in Figure 5a show a normalized grade distribution. Percentile scores have some real advantages when it comes to comparing grades from a wide range of activities, but their computation and interpretation can be confusing. They are probably not practical for the classroom instructor unless he or she is familiar with statistics.

In both of the above cases, you can see that the studentÕs grade depends on where he or she falls in relation to the rest of the class rather than on the absolute score he or she obtained. Thus, the student is in competition with the others being evaluated at the same time. A grade of A in one class may mean a test score of 99, while in another class it could be a test score of 79, depending on how well the class as a whole performed.

Criterion-referenced systems: Opposed to norm-referenced systems are the criterion-referenced systems. The assumption underlying these systems is that there is an absolute quantity of whatever is being measured and the grade reflects how much of that quantity each student has. This is more like a strength test. We have a set criterion, the bell at the top, and each student takes a swing and achieves a given level which determines the grade he or she gets, regardless of how anyone else does.

The most common forms of criterion-referenced systems are:

percent of total points possible: In this system, there is a fixed number of points available to be earned. Earning 90% (or some other arbitrary percent) of those points will result in an A, while 80% will result in a B, and so on. The student is being evaluated against a pre-set criterion, hence the name, and not against his or her peers. It does not matter how many students reach a given level. If everyone earns the maximum, everyone gets an A. The third set of grades in Figure 5a was figured using this system.

mastery or pass/fail: In this case, there is only one pre-set level of achievement, usually based on a set of specific objectives which must be passed. If these are passed, the student moves on; if not, the student must repeat the evaluation or, alternatively, fails the course. Sometimes the specifics refer to a given percent of the total possible rather than to given skills. This is the case with the fourth set of grades in Figure 5a.

In both of the above cases, you can see that the student's grade depends on the absolute score he or she obtains rather than on the relative position of that score in the class. Thus, the student is in competition with an outside standard rather than his or her peers. A grade of A in this system would indicate a given level of achievement regardless of the performance of the class as a whole, but would tell us nothing about how the student compared with his or her peers.

Hybrid systems: Now letÕs look at some systems which have no clear-cut allegiance to either philosophy, but are very commonly used.

percent of maximum obtained: This system uses a predetermined set of cutoff percentages for each grade as in a criterion-referenced system, but bases the actual grades on the highest score earned in that class, rather than the highest possible score. This latter characteristic makes the grades somewhat comparative as in a norm-referenced system. The class performance plays a role in determining what is needed for each grade, but the number of students who can earn each grade is not restricted as in the norm-referenced systems. Except on the grossest level, the students are not in competition with one another. This system gives us neither absolute nor relative performance information, but it is easy to compute and easy for students to understand. The fifth set of scores in Figure 5a use this system.

gap system: This could be labeled the interoccular system since it involves laying out the score distribution and looking for gaps in the distribution. These breaks then determine the cut-off scores for the various grades. One advantage of this system is that the instructor has a practical reason for setting the grade cutoffs where they are. The idea is to identify real differences in performance which will then be reflected in the grades. Under this system, A performance really appears to be different from B performance because the two groups of students have a gap separating them. All the other systems are based on more or less arbitrary cutoffs, even though they may have a sound statistical basis. Like norm-referenced systems, the gap system gives us relative but not absolute performance information. It is also easy to compute and explain. The sixth set of grades in Figure 5a are based on a gap system.

WhatÕs the Difference in Terms of Grades?

The distribution of grades under these various systems does not differ remarkably in the set of scores shown in Figure 5a. In this class it might not matter which system the instructor chose since they would all come out about the same. This is because the underlying distribution of scores in this class is distributed fairly normally across the range of possible points. However, if you inspect the grade distributions for Figure 5b, a class in which the highest score is an 80 out of 100, you can see that now it makes a big difference which system is chosen. If we stick to criterion-referenced systems such as percentage total or pass/fail, many students will fail. Under these circumstances, the students will usually cry out for a "curve." To be honest, this might not be a bad move, since poor performance by an entire class might be an indication of a poorly constructed exam, or inadequate instruction, or some other variable over which students have little control. On the other hand, if the material being tested is something critical like the construction of a nuclear plant or the insertion of a needle for drug injection, I personally don't want anyone to curve the grades; I want a criterion-referenced system in place.

While there are many valid arguments which can be made for norm-referenced systems, they are usually made in the situation illustrated in Figure 5b where overall class performance is poor. They are seldom applied to a situation like that in Figure 5c where overall class performance is very high. In this case, most students prefer a criterion-referenced system which will allow everyone to receive the top grade. It is hard to imagine curving the grades in a graduate course, where the assumption of a normal distribution is not valid. On the other hand, suppose the purpose of this course were to determine which two students should be selected to receive fellowships or which two should be allowed into a special program for promising researchers? Under those conditions, a criterion-referenced system would not provide the comparative information needed to make those decisions.

As can be deduced from the above examples, no one grading system is the "right" system. The choice will depend on the purpose of the grade (to provide absolute or comparative evaluation), the type of content being evaluated (critical or non-critical), the type of students (how select the sample is), and the philosophy of the instructor. There are some other practical considerations, such as ease of computation, size of class, clarity to students, whether it is necessary for students to be able to track their progress toward a final grade, and so on. These last few practical considerations dealing with whether the students can monitor their progress may not be important to the instructor, but they are very important to the students. Students have very definite ideas about how final course grades are computed. They feel very insecure when they cannot predict how their final grade will turn out because it is going to be based on the final class distribution. In order to deal with this, many instructors use norm-referenced systems to assign periodic grades, such as those on hour tests, and then combine these into one course grade which is evaluated on a criterion-referenced basis. In fact, this may be the fairest system of all. Procedures for making such final grade determinations are described in the sections labeled "A criterion-referenced system" and "A norm-referenced system."

The real question to ask yourself is whether you wish your students' grades to provide information about their absolute performance level or about their relative performance level. That is the first and most important distinction you must make. From it will flow the other choices. No one can answer the question for you, although there may be a departmental or college recommendation or leaning toward one or the other. Neither system represents truth; each has its pros and cons. The best system for you is the one which reflects your own teaching philosophy.

A Norm-Referenced System for Final Grades

In order for norm-referenced systems to work correctly, student scores have to be distributed according to the normal curve or at least the same on all measures which are to be combined. This is usually the case, especially in large or beginning level classes. If you are fortunate enough to have all the distributions on all your measures normally distributed or at least similarly distributed, you can follow the procedures outlined below. If any of the measures is badly out of line with the others in terms of the shape of its distribution, you would be well-advised to convert all the scores to T-scores before any combining of scores for a final grade. This procedure is described in a separate section. For now, let's look at combining scores for a fairly standard class. Refer to the student records shown in Figure 6a as we step through the process of figuring final grades.

Half way between these two would be the cut-off between B and C. The same figuring would go into determining all the cut-off points.

A Criterion-Referenced (sort of) System for Final Grades

Criterion-referenced grading is the simpler of the two systems we've been discussing because it is based on fewer assumptions and fewer statistical concerns. We are going to look at one of many ways of doing it. As we step through the process, follow along on the grade set on the right in Figure 6c. To assign final grades under a criterion-referenced system:

Back to top

Back to Table of Contents


Home | Faculty Services | TA/AI Services | Publications | Resources | Research | About CTE


November 11, 2002
The University of Texas at Austin
Copyright © 2002 Center for Teaching Effectiveness
Contact CTE