Conduct research

Analyzing quantitative content analysis data

Rater reliability

To establish reliability when evaluating content, use two or more raters and make sure they are rating in a consistent manner. Train them in a group and have them practice using the same criteria and compare results. They should continue to compare results as the study progresses to make sure their ratings do not drift apart. Although reliability will be higher when rating concrete (for example, use of transitional phrases in an essay) rather than abstract content (for example, persuasiveness), rating only concrete content might lead to overlooking important indicators of quality. Once you begin ratings, refine procedures or definitions of categories to increase reliability.

There are several ways to assess reliability among two or more raters. The simplest way is to record the percentage agreement:

# of ratings that agree x 100 = % agreement
total # of ratings

If you are using a 4-point rating scale (0 = not present to 3 = exceeds criteria), compute the percentage of times raters made the same rating. If there is not at least 80% agreement between evaluators, discuss differences and repeat the process until they achieve satisfactory agreement. Avoid using a rating scale with more than 5-points because raters will have difficulty making subtle distinctions. Many journals prefer Cohen's Kappa or the intraclass correlation coefficient because percentage agreement does not correct for chance agreement.

Cohen's Kappa is a measure of reliability that corrects for chance agreement and can be used for checklists that involve yes/no decisions or decisions between mutually exclusive categories. Generally, a Kappa value of .7 or greater indicates acceptable reliability.

The intraclass correlation coefficient (ICC) is a measure of reliability between observers that can be used with categorical or continuous data, such as observing the number of questions students ask during a class.

Once you establish adequate reliability, you can simplify analyses by averaging ratings across raters. Alternatively, you can analyze only one set of ratings if you designate one person at the outset as the main rater and use a secondary rater only to establish reliability.

Using content analysis in an experiment


Example 1:

You decide to conduct a study that compares the quality of papers for two groups of students in a sociology class. At the start of the semester, students agree to be randomly assigned to one of two groups. At the semester mid-point, each student in Group 1 posts a rough draft of their paper on the course website and receives an overall quality score ranging from 1-poor to 4-excellent from three other students in Group 1. Each student in Group 2 also posts a rough draft at the same time and receives an overall quality score and also detailed comments on an evaluation form from three other students in Group 2. Students comment about organization, use of examples, use of theory, and persuasiveness of arguments. Both groups incorporate the feedback in their papers. Two judges, who do not know what group students belong to, then rate overall paper quality (1- poor to 4-excellent) and use the rubric below to assess paper quality in four categories. You have three hypotheses: 1) As rated by the two judges, overall paper quality for Group 2 will be higher than for Group 1, on average; 2) Total rubric scores will be higher, on average, for Group 2 than Group 1; 3) Overall quality scores will not differ between groups for rough drafts.

Below is an example of ratings provided by the first judge for a student in Group 1:

Overall quality (1-poor; 2-average; 3-good; 4-excellent)
2 -- average

Rubric ratings of paper:
Not present = no use or demonstration of objective
Below criteria = little use or demonstration of objective or use is frequently inaccurate
Meets criteria = consistent use or demonstration of objective
Exceeds criteria = consistent and skillful use or demonstration of objective

Objective Not Present
(0 pts)
Below Criteria
(1 pt)
Meets Criteria
(2 pts)
Exceeds Criteria
(3 pts)




Use of examples  


Use of theory





Enter all data into a statistical program such as SPSS or SAS and calculate means and standard deviations for overall quality ratings and total rubric scores. While comparing means provides a rough sense of differences between the groups, statistical tests demonstrate that these differences are unlikely to have occurred by chance. Many statistical programs provide a p value that indicates the probability that group differences occurred by chance alone. For example, a p value of .05 indicates that there is a 5% probability that differences between groups occurred by chance rather than because of the intervention. Prior to analyzing the data, you set a p value of .05 or less as the criterion for statistical significance. In addition, you make sure that outcome variables are normally distributed, a requirement for many statistical tests. If a variable is not distributed normally, consult with a statistician to determine if you need to transform the variable.

To determine if there is a difference in paper quality between the two groups, conduct two t-tests for independent groups (also called the an independent samples t-test, or the t-test for independent means), one comparing average group ratings of overall quality and a second t-test comparing total rubric scores. At the study's outset, it would be wise to give all participants a standardized test of writing quality to make sure that Groups 1 and 2 did not significantly differ in writing ability. This would enable you to conclude that later group differences in paper quality were not due to differences that existed before you began the study.

If you discover that, before you start your study, two groups differ on a variable you are measuring, such as writing quality, you can control for these differences using an Analysis of Covariance (ANCOVA) procedure. You cannot use an ANCOVA, however, to control for pre-existing group differences when there is no random assignment, so consult with a statistician in this case.

To test for differences between three or more groups, use an independent samples analysis of variance (ANOVA). Obtaining a significant F value for an ANOVA tells you that, overall, scores differ at different times, but it does not tell you which scores are significantly different from each other. To answer that question, you must perform post-hoc comparisons after you obtain a significant F, using tests such as Tukey's and Scheffe's, which set more stringent significance levels as you make more comparisons. However, if you make specific predictions about differences between means, you can test these predictions with planned comparisons, which enable you to set significance levels at p < .05. Planned comparisons are performed instead of an overall ANOVA.


Example 2:

You decide to alter your study design from Example 1. You use the same procedure but ask the judges to make two additional ratings when students post rough drafts on the course website: 1) an overall rating of rough draft quality 2) a total rubric score based on ratings of organization, use of examples, use of theory, and persuasiveness of arguments. Both groups receive feedback from other students, as described in Example 1, but judges' ratings are not shared with students. You then compare judges' rough and final draft ratings and test if the average amount of change for the Groups 1 and 2 is significantly different, using a mixed factorial (ANOVA). A mixed ANOVA enables you to simultaneously consider change over time within each group and differences between the two groups.

Other statistical procedures

To test whether a statistically significant change has occurred within one group of students at two points in time (for example, the start and end of the semester), use a t-test for dependent means (also called a paired samples t-test, repeated measures t-test, or t-test for dependent samples). To compare ratings at three or more points in time (for example, the start, midpoint, and end of the semester), one option is a repeated measures analysis of variance (ANOVA) (also called ANOVA for correlated samples).

If you are rating a student product using categories that are not on a continuous scale (for example, "inadequate, satisfactory, above average"), you can test for differences between groups or times using a chi-square statistic. For example, you may rate whether the thesis for a research paper is "clearly stated" of "not clearly stated."

You might also compute correlations to determine whether there is a statistically significant positive or negative relationship between two continuous variables. For example, you could determine if ratings of the quality of student essays is related to students' satisfaction with the course. Be aware, however, that computing correlations between several sets of variables increases the chances of finding a relationship due to chance alone, and that finding significant correlations between variables does not tell you what causes those relationships.

If you need additional help from someone knowledgeable about statistics, contact the research consulting staff at UT's Austin's Division of Statistics & Scientific Computation.

Additional information

2 x 2 Mixed Factorial Design. Retrieved June 21, 2006 from the University of Missouri - Rolla, Psychology World Web site: http://web.umr.edu/~psyworld/mixed_designs.htm

Aron, A. & Aron, E. N. (2002). Statistics for Psychology, 3rd edition. Upper Saddle River, N J: Prentice Hall.

Chi-square: One Way. Retrieved June 21, 2006 from the Georgetown University, Department of Psychology, Research Methods and Statistics Resources Web site: http://www.georgetown.edu/departments/psychology/researchmethods/sta­tistics/inferential/chisquareone.htm.

Cohen's Kappa: Index of inter-rater reliability. Retrieved June 21, 2006 from University of Nebraska Psychology Department Research Design and Data Analysis Directory Web site: http://www-class.unl.edu/psycrs/handcomp/hckappa.PDF.

Helberg, C. (1995). Pitfalls of data analysis. Retrieved June 21, 2006 from: http://my.execpc.com/4A/B7/helberg/pitfalls

Lane, D. M. (2003). Tests of linear combinations of means, independent groups. Retrieved June 21, 2006 from the Hyperstat Online textbook: http://davidmlane.com/hyperstat/confidence_intervals.html

Lowry, R. P. (2005). Concepts and Applications of Inferential Statistics. Retrieved  June 21, 2006  from: http://faculty.vassar.edu/lowry/webtext.html

T-test. Retrieved December 4, 2007 from the Georgetown University, Department of Psychology, Research Methods and Statistics Resources Web site: http://www1.georgetown.edu/departments/psychology/resources/researchmethods/statistics/8318.html.

Weunschk, K.L. (2003). Inter-rater Agreement. Retrieved June 21, 2006 from: http://core.ecu.edu/psyc/wuenschk/docs30/InterRater.doc

Page last updated: Sep 21 2011
Copyright © 2007, The University of Texas at Austin