Conduct research

Analyzing observational data

Guide your analysis of observational data by the study's hypotheses. For example, if the goal is to assess improvement in instructors' oral presentation skills, you might use classroom observations to evaluate these skills before and after oral presentation training.

If you use multiple observers, make sure they are reporting observations consistently. To improve consistency between observers or reliability, train observers in a group and have them practice using the same form and compare results. They should continue to compare results throughout the study so their ratings don't drift apart. Conduct observations several times, in case one is unusual in some way, and complete observation forms during and immediately following an observation. Observing for longer periods will reduce the extent to which participants change their behavior in reaction to being observed. Although reliability will be higher for records of concrete rather than abstract behaviors, observing only concrete behaviors might lead to overlooking meaningful behavior. Reliability is higher for checklists than for ratings, which involve greater observer judgment. Once you begin observations, refine observation procedures or definitions of observational categories to increase reliability.

Assess reliability among multiple observers

Percentage agreement is the simplest way to assess reliability between two observers:

   # of ratings that agree    
total # of ratings

 x 100 = % agreement

For a 5-point rating scale, for example, you might compute the percentage of time two observers made the same rating or the percentage of time ratings differed by no more than one point. Agreement should be at least 80%. Many journals prefer Cohen's Kappa or the intraclass correlation coefficient, because percentage agreement does not correct for chance agreement.

Cohen's Kappa can be used when with checklists that require decisions between mutually exclusive categories such as yes/no. Generally, a Kappa value of .7 or greater indicates acceptable reliability.

The intraclass correlation coefficient (ICC) can be used with categorical or continuous data, such as the number of questions students ask during a class.

Once you have established adequate reliability, you can simplify analyses by averaging ratings of the same session across observers. Alternatively, you can analyze only one set of ratings if, before the study, you designate one observer as the main observer and use a secondary observer only to establish reliability.

With checklists, you can count the number of positive behaviors recorded, compute this as a percentage of total behaviors, and compare groups or times:

   # yes responses    
total # of reponses

 = % positive behaviors

For example, consider this checklist assessing lecture organization in a course at the start of a semester:


The instructor Comments

1. stated the purpose of the lecture.

_X_ Yes

__ No

stated clearly at start of class

2. explained the relation of the class to the previous one.

_X_ Yes

__ No

very clear and concise

3. put class objectives on a PowerPoint slide

_X_ Yes

__ No

good, reinforced #1

4. verbally provided an outline of lecture content.

_X_ Yes

__ No

with PowerPoint slide

5. made transition statements between lecture segments

__ Yes

_X_ No

mostly jumped to new topic

6. summarized periodically and at the end of class.

__ Yes

_X_ No

some at end, but otherwise no

7. connected different points or topics in summaries.

__ Yes

_X_ No

would be helpful to tie together

For this instructor, four of seven possible organization behaviors were observed.

In a study that included several instructors, a researcher averaged the number of organization behaviors for all participating instructors to find the mean and computed a standard deviation to assess the average variation in number of organization behaviors among instructors. The researcher found a mean of 3.7 positive behaviors with a standard deviation of 1.2. He shared these results with all instructors to help them improve. Six weeks later, observers recorded a mean of 6.0 positive behaviors for these same instructors with a standard deviation of 1.1, suggesting substantial improvement.

Test for statistical significance

While comparing means gives a rough sense of differences between times or groups, you must use statistical tests to demonstrate that these differences are unlikely to have occurred by chance. Many statistical programs provide a p value that indicates the probability that group differences occurred by chance alone. For example, a p value of .05 indicates that there is a 5% probability that differences between groups occurred by chance rather than because of the intervention. Prior to analyzing your data, set the p value that you will use as the criterion for statistical significance. A p value of .05 is most often used as a cutoff.

To test whether the improvement observed above is statistically significant, use a t-test for dependent means (also called a paired samples t-test, repeated measures t-test, or t-test for dependent samples). The easiest way to accomplish this is to enter the data in a statistical program like SPSS and to use the pull-down menu to run the test.

Because most statistical tests require that outcome variables, such as the number of positive behaviors observed, be normally distributed, if a variable is not, consult with someone knowledgeable about statistics to determine if you must transform the variable. In addition, when you are comparing observation ratings for two groups using a t-test, the spread of the ratings (variance) should be roughly equal for both groups. Your outcome variable should be on a continuous rather than categorical scale.

To compare observation scores at three or more points in time, one option is a repeated measures analysis of variance (ANOVA) (also called ANOVA for correlated samples). A significant F value for an ANOVA tells you that, overall, scores differ at different times, but it does not tell you which scores are significantly different from each other. To answer that question, you must perform post-hoc comparisons after you obtain a significant F, using tests such as Tukey's and Scheffe's, which set more stringent significance levels as you make more comparisons. However, if you make specific predictions about differences between means, you can test these predictions with planned comparisons, which enable you to set significance levels at p < .05. Planned comparisons are performed instead of an overall ANOVA.

To determine if there are statistically significant differences between two groups, use a t-test for independent groups (also called an independent samples t-test, or the t-test for independent means). For example, if instructors in one group receive training to improve lecture organization while instructors in a second group do not, you could compare these groups by observing their lectures. To compare three groups or more, use an independent samples analysis of variance. Again, you will need to conduct post-hoc comparisons after obtaining a significant F value to determine differences between specific groups.

If you are comparing groups, observe them before your instructional intervention (e.g. training) to make sure there are not pre-existing differences that are statistically significant. If there are pre-test differences and you randomly assigned participants to the groups, you can control for these differences using an Analysis of Covariance (ANCOVA) procedure. You cannot use an ANCOVA, however, to control for pre-existing group differences in a field experiment, so consult with a statistician in this case.

Linear regression enables you to predict the level of an outcome variable using one or more continuous variables. For example, you might use the number of observed behaviors an instructor uses to organize lectures to predict later student test scores.

If participants are observed multiple times, hierarchical linear models (HLM), may be a better choice than a repeated measures ANOVA. HLM is particularly suited to analyze data from repeated measurements or data in a hierarchical structure. For example, in much educational research, students are grouped within classrooms, which are grouped within schools. HLM takes into account that students from a classroom or school have more in common than individuals who are randomly sampled from a larger population. HLM requires specialized software, available to UT Austin faculty and staff at a discount.

You might also compute correlations to determine whether there is a statistically significant positive or negative relationship between two continuous variables. For example, you could determine if the number of questions students ask is significantly related to ratings of course satisfaction. Be aware, however, that computing correlations between several sets of variables increases the chances of finding a relationship due to chance alone, and that finding significant correlations between variables does not tell you what causes those relationships.

For additional help from someone knowledgeable about statistics, contact the research consulting staff at UT's Austin's Division of Statistics & Scientific Computation.

When using ratings, compare means of groups or times:

(Time 1)
 =  Sum of all ratings
at Time 1    

# of ratings at Time 1
vs. Mean
(Time 2)
 =  Sum of all ratings
at Time 2     

# of ratings at Time 2

For example, observers rated the clarity of an instructor's lectures from 1 = no clarity to 5 = outstanding clarity. For instructors participating in a public speaking course, mean ratings of clarity increased from 2.3 before the course to 3.7 after the course. You would conduct a t-test for dependent means to see if this difference is statistically significant.

If you are rating a behavior using categories that are not on a continuous scale, you can test for differences between groups or times using a chi-square statistic. For example, in a study comparing classrooms equipped with computers for every student and those not equipped with computers, observers rated 200 instructors.


How often does the instructor act as a coach or facilitator during class?

  Never Rarely Occasionally Frequently Extensively
Computer equipped






Not computer equipped






Never = not observed in any classes for this instructor
Rarely = less than five minutes per class
Occasionally = average of between 5 and 15 minutes per class
Frequently = average of between 15 and 25 minutes per class
Extensively = average of more than 25 minutes per class

The researcher decided ahead of time to combine the five rating categories into two larger categories: rarely or less and occasionally or more. With these larger categories, a chi-square test revealed that a higher proportion of instructors in computer-equipped classrooms at least occasionally act as coaches or facilitators compared to instructors in non-equipped classrooms:


How often does the instructor act as a coach or facilitator during class?

  Rarely or less Occasionally or more
Computer equipped



Not computer equipped



You should also analyze comments that accompany checklists or ratings, identifying themes and significant points, such as what works well and what needs improvement. Comments will often help you interpret numeric results.

If your study uses extensive observer commentary or a narrative log-a detailed and descriptive record of verbal and nonverbal behaviors-create a transcript using a word processing program and analyze it by coding.

Develop coding categories

A major step in analyzing qualitative data is coding speech into meaningful categories, enabling you to organize large amounts of text and discover patterns that would be difficult to detect by just reading observer commentary. Always keep the original copy of observer commentary. Bogdan and Biklin (1998) suggest first ordering narrative logs chronologically or by some other criteria. Carefully read all your data at least twice during long, undisturbed periods. Next, conduct initial coding by generating numerous category codes as you read commentary, labeling data that are related without worrying about the variety of categories. Write notes to yourself, listing ideas or diagramming relationships you notice. Because codes are not always mutually exclusive, a phrase or section might be assigned several codes. Last, use focused coding to eliminate, combine, or subdivide coding categories and look for repeating ideas and larger themes that connect codes. Repeating ideas are the same idea expressed by different respondents, while a theme is a larger topic that organizes or connects a group of repeating ideas. Try to limit final codes to between 30 and 50. After you have developed coding categories, make a list that assigns each code an abbreviation and description.  [more]

Berkowitz (1997) suggests considering these questions when coding qualitative data:

Bogdan and Biklin (1998) provide common types of coding categories but emphasize that your  hypotheses shape your coding scheme.

Software programs can help with coding commentary from observations, understanding conceptual relationships, or counting key words. They facilitate systematic, efficient coding and complex analyses. Three popular software packages for qualitative coding and data analysis are Atlas.ti and NVivo7 and XSight.

Use visual devices to organize and guide your study

You may want to use matrices, concept maps, flow charts, or diagrams to illustrate relationships and themes. Visual devices help you think critically, confirm themes, and see new relationships.

Additional information

Aron, A. & Aron, E. N. (2002). Statistics for Psychology, 3rd edition. Upper Saddle River, N J: Prentice Hall.

Berkowitz, S. (1997). Analyzing Qualitative Data. In J. Frechtling, L. Sharp, and Westat (Eds.), User-Friendly Handbook for Mixed Method Evaluations (Chapter 4). Retrieved June 21, 2006 from National Science Foundation, Directorate of Education and Human Resources Web site: http://www.ehr.nsf.gov/EHR/REC/pubs/NSF97-153/CHAP_4.HTM

Bogdan R. B. & Biklin, S. K. (1998). Qualitative Research for Education: An Introduction to Theory and Methods, Third Edition. Needham Heights, MA: Allyn and Bacon.

Chi-square: One Way. Retrieved May 6, 2005 from the Georgetown University, Department of Psychology, Research Methods and Statistics Resources Web site: http://www.georgetown.edu/departments/psychology/researchmethods/statistics/inferential/chisquareone.htm
Note: No longer available

Coe, R. (2000). What is an 'effect size'? A guide for users. Retrieved March 1, 2005 from the University of Durham, Curriculum, Evaluation and Management Centre, Evidence-Based Education-UK Web site: http://www.cemcentre.org/ebeuk/research/effectsize/ESguide.htm
Note: No longer available

Cohen's Kappa: Index of inter-rater reliability. Retrieved June 21, 2006 from University of Nebraska Psychology Department Research Design and Data Analysis Directory Web site: http://www-class.unl.edu/psycrs/handcomp/hckappa.PDF

Greiner, J. M. (2004). Trained observer ratings. In J. S. Wholey, H. P. Hatry & K. E. Newcomers (Eds.), Handbook of Practical Program Evaluation (2nd ed.) (pp. 211-256). San Francisco: Jossey-Bass.

Helberg, C. (1995). Pitfalls of data analysis. Retrieved June 21, 2006 from: http://my.execpc.com/4A/B7/helberg/pitfalls/

Linear regression. (n.d.) Retrieved June 21, 2006 from the Yale University Department of Statistics Index of Courses 1997-98 Web site: http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

Lane, D. M. (2003). Tests of linear combinations of means, independent groups. Retrieved June 21, 2006 from the Hyperstat Online textbook: http://davidmlane.com/hyperstat/confidence_intervals.html

Lowry, R. P. (2005). Concepts and Applications of Inferential Statistics. Retrieved June 21, 2006 from: http://faculty.vassar.edu/lowry/webtext.html

Osborne, Jason W. (2000). Advantages of hierarchical linear modeling. Practical Assessment, Research & Evaluation, 7(1). Retrieved June 21, 2006 from: http://PAREonline.net/getvn.asp?v=7&n=1

T-test. Retrieved December 4, 2007 from the Georgetown University, Department of Psychology, Research Methods and Statistics Resources Web site: http://www1.georgetown.edu/departments/psychology/resources/researchmethods/statistics/8318.html

Weunschk, K.L. (2003). Inter-rater Agreement. Retrieved June 21 , 2006 from: http://core.ecu.edu/psyc/wuenschk/docs30/InterRater.doc

Page last updated: Sep 21 2011
Copyright © 2007, The University of Texas at Austin