# Analyzing observational data

Guide your analysis of observational data by the study's hypotheses. For example, if the goal is to assess improvement in instructors' oral presentation skills, you might use classroom observations to evaluate these skills before and after oral presentation training.

If you use multiple observers, make sure they are reporting observations consistently. To improve consistency between observers or reliability, train observers in a group and have them practice using the same form and compare results. They should continue to compare results throughout the study so their ratings don't drift apart. Conduct observations several times, in case one is unusual in some way, and complete observation forms during and immediately following an observation. Observing for longer periods will reduce the extent to which participants change their behavior in reaction to being observed. Although reliability will be higher for records of concrete rather than abstract behaviors, observing only concrete behaviors might lead to overlooking meaningful behavior. Reliability is higher for checklists than for ratings, which involve greater observer judgment. Once you begin observations, refine observation procedures or definitions of observational categories to increase reliability.

## Assess reliability among multiple observers

**Percentage agreement **is the simplest way to assess reliability
between two observers:

#
of ratings that agree |
x 100 = % agreement |

For a 5-point rating scale, for example, you might compute the percentage
of time two observers made the same rating or the percentage of time
ratings differed by no more than one point. Agreement should be at least
80%. Many journals prefer Cohen's Kappa or the intraclass correlation
coefficient, because percentage agreement does not correct for chance
agreement.

Cohen's Kappa can be used when with checklists that require decisions between mutually exclusive categories such as yes/no. Generally, a Kappa value of .7 or greater indicates acceptable reliability.

The intraclass correlation coefficient (ICC) can be used with categorical or continuous data, such as the number of questions students ask during a class.

Once you have established adequate reliability, you can simplify analyses by averaging ratings of the same session across observers. Alternatively, you can analyze only one set of ratings if, before the study, you designate one observer as the main observer and use a secondary observer only to establish reliability.

**With checklists**, you can count the number of positive
behaviors recorded, compute this as a percentage of total behaviors,
and compare groups or times:

#
yes responses |
= % positive behaviors |

For example, consider this checklist assessing lecture organization in a course at the start of a semester:

Example

The instructor | Comments | ||
---|---|---|---|

1. stated the purpose of the lecture. |
_X_ Yes |
__ No |
stated clearly at start of class |

2. explained the relation of the class to the previous one. |
_X_ Yes |
__ No |
very clear and concise |

3. put class objectives on a PowerPoint slide |
_X_ Yes |
__ No |
good, reinforced #1 |

4. verbally provided an outline of lecture content. |
_X_ Yes |
__ No |
with PowerPoint slide |

5. made transition statements between lecture segments |
__ Yes |
_X_ No |
mostly jumped to new topic |

6. summarized periodically and at the end of class. |
__ Yes |
_X_ No |
some at end, but otherwise no |

7. connected different points or topics in summaries. |
__ Yes |
_X_ No |
would be helpful to tie together |

For this instructor, four of seven possible organization behaviors were observed.

In a study that included several instructors, a researcher averaged the number of organization behaviors for all participating instructors to find the mean and computed a standard deviation to assess the average variation in number of organization behaviors among instructors. The researcher found a mean of 3.7 positive behaviors with a standard deviation of 1.2. He shared these results with all instructors to help them improve. Six weeks later, observers recorded a mean of 6.0 positive behaviors for these same instructors with a standard deviation of 1.1, suggesting substantial improvement.

## Test for statistical significance

While comparing means gives a rough sense of differences between times or groups, you must use statistical tests to demonstrate that these differences are unlikely to have occurred by chance. Many statistical programs provide a p value that indicates the probability that group differences occurred by chance alone. For example, a p value of .05 indicates that there is a 5% probability that differences between groups occurred by chance rather than because of the intervention. Prior to analyzing your data, set the p value that you will use as the criterion for statistical significance. A p value of .05 is most often used as a cutoff.

To test whether the improvement observed above is statistically significant, use a t-test for dependent means (also called a paired samples t-test, repeated measures t-test, or t-test for dependent samples). The easiest way to accomplish this is to enter the data in a statistical program like SPSS and to use the pull-down menu to run the test.

Because most statistical tests require that outcome variables, such as the number of positive behaviors observed, be normally distributed, if a variable is not, consult with someone knowledgeable about statistics to determine if you must transform the variable. In addition, when you are comparing observation ratings for two groups using a t-test, the spread of the ratings (variance) should be roughly equal for both groups. Your outcome variable should be on a continuous rather than categorical scale.

To compare observation scores at three or more points in time, one option is a repeated measures analysis of variance (ANOVA) (also called ANOVA for correlated samples). A significant F value for an ANOVA tells you that, overall, scores differ at different times, but it does not tell you which scores are significantly different from each other. To answer that question, you must perform post-hoc comparisons after you obtain a significant F, using tests such as Tukey's and Scheffe's, which set more stringent significance levels as you make more comparisons. However, if you make specific predictions about differences between means, you can test these predictions with planned comparisons, which enable you to set significance levels at p < .05. Planned comparisons are performed instead of an overall ANOVA.

To determine if there are statistically significant differences between two groups, use a t-test for independent groups (also called an independent samples t-test, or the t-test for independent means). For example, if instructors in one group receive training to improve lecture organization while instructors in a second group do not, you could compare these groups by observing their lectures. To compare three groups or more, use an independent samples analysis of variance. Again, you will need to conduct post-hoc comparisons after obtaining a significant F value to determine differences between specific groups.

If you are comparing groups, observe them before your instructional intervention (e.g. training) to make sure there are not pre-existing differences that are statistically significant. If there are pre-test differences and you randomly assigned participants to the groups, you can control for these differences using an Analysis of Covariance (ANCOVA) procedure. You cannot use an ANCOVA, however, to control for pre-existing group differences in a field experiment, so consult with a statistician in this case.

Linear regression enables you to predict the level of an outcome variable using one or more continuous variables. For example, you might use the number of observed behaviors an instructor uses to organize lectures to predict later student test scores.

If participants are observed multiple times, hierarchical linear models (HLM), may be a better choice than a repeated measures ANOVA. HLM is particularly suited to analyze data from repeated measurements or data in a hierarchical structure. For example, in much educational research, students are grouped within classrooms, which are grouped within schools. HLM takes into account that students from a classroom or school have more in common than individuals who are randomly sampled from a larger population. HLM requires specialized software, available to UT Austin faculty and staff at a discount.

You might also compute correlations to determine whether there is a statistically significant positive or negative relationship between two continuous variables. For example, you could determine if the number of questions students ask is significantly related to ratings of course satisfaction. Be aware, however, that computing correlations between several sets of variables increases the chances of finding a relationship due to chance alone, and that finding significant correlations between variables does not tell you what causes those relationships.

For additional help from someone knowledgeable about statistics, contact the research consulting staff at UT's Austin's Division of Statistics & Scientific Computation.

**When using ratings**, compare means of groups
or times:

Mean (Time 1) |
= | Sum of all ratings at Time 1 # of ratings at Time 1 |
vs. | Mean (Time 2) |
= | Sum of all ratings at Time 2 # of ratings at Time 2 |

For example, observers rated the clarity of an instructor's lectures from 1 = no clarity to 5 = outstanding clarity. For instructors participating in a public speaking course, mean ratings of clarity increased from 2.3 before the course to 3.7 after the course. You would conduct a t-test for dependent means to see if this difference is statistically significant.

If you are rating a behavior using categories that are not on a continuous scale, you can test for differences between groups or times using a chi-square statistic. For example, in a study comparing classrooms equipped with computers for every student and those not equipped with computers, observers rated 200 instructors.

Example

**How often does the instructor act as a coach or facilitator during
class?**

Never | Rarely | Occasionally | Frequently | Extensively | |
---|---|---|---|---|---|

Computer equipped | 31% |
22% |
20% |
17% |
10% |

Not computer equipped | 41% |
25% |
16% |
10% |
8% |

**Definitions:**

Never = not observed in any classes for this instructor

Rarely = less than five minutes per class

Occasionally = average of between 5 and 15 minutes per class

Frequently = average of between 15 and 25 minutes per class

Extensively = average of more than 25 minutes per class

The researcher decided ahead of time to combine the five rating
categories into two larger categories: **rarely or less** and **occasionally
or more**. With these larger categories, a chi-square test
revealed that a higher proportion of instructors in computer-equipped
classrooms at least occasionally act as coaches or facilitators compared
to instructors in non-equipped classrooms:

Example

**How often does the instructor act as a coach or facilitator
during class?**

Rarely or less | Occasionally or more | |
---|---|---|

Computer equipped | 53% |
47% |

Not computer equipped | 66% |
34% |

**You should also analyze comments** that accompany
checklists or ratings, identifying themes and significant points,
such as what works well and what needs improvement. Comments will
often help you interpret numeric results.

If your study uses extensive observer commentary or a narrative log-a detailed and descriptive record of verbal and nonverbal behaviors-create a transcript using a word processing program and analyze it by coding.

### Develop coding categories

A major step in analyzing qualitative data is coding speech into meaningful categories, enabling you to organize large amounts of text and discover patterns that would be difficult to detect by just reading observer commentary. Always keep the original copy of observer commentary. Bogdan and Biklin (1998) suggest first ordering narrative logs chronologically or by some other criteria. Carefully read all your data at least twice during long, undisturbed periods. Next, conduct initial coding by generating numerous category codes as you read commentary, labeling data that are related without worrying about the variety of categories. Write notes to yourself, listing ideas or diagramming relationships you notice. Because codes are not always mutually exclusive, a phrase or section might be assigned several codes. Last, use focused coding to eliminate, combine, or subdivide coding categories and look for repeating ideas and larger themes that connect codes. Repeating ideas are the same idea expressed by different respondents, while a theme is a larger topic that organizes or connects a group of repeating ideas. Try to limit final codes to between 30 and 50. After you have developed coding categories, make a list that assigns each code an abbreviation and description. [more]

Berkowitz (1997) suggests considering these questions when coding qualitative data:

- What common themes emerge in observations about specific topics? How do these patterns (or lack thereof) help to illuminate the broader study’s hypotheses?
- Are there deviations from these patterns? If so, are there any factors that might explain these deviations?
- How are the environments or past experiences of participants related to their behavior and attitudes?
- What interesting stories emerge from observations? How do they help illuminate the study’s hypotheses?
- Do any of these patterns suggest that additional data may be needed? Do any of the hypotheses need to be revised?
- Are the patterns that emerge similar to the findings of other studies on the same topic? If not, what might explain these discrepancies?

Bogdan and Biklin (1998) provide common types of coding categories but emphasize that your hypotheses shape your coding scheme.

**Setting/Context codes**provide background information on the setting, topic, or participants.**Defining the Situation codes**categorize the world view of participants and how they see themselves in relation to a setting or your topic.**Participant Perspective codes**capture how participants define a particular aspect of a setting, summed up in phrases they use such as, "Say what you mean, but don't say it mean."**Participants' Ways of Thinking about People and Objects codes**capture how they categorize and view each other, outsiders, and objects. For example, a dean at a private school may categorize students: "There are crackerjack kids and there are junk kids."**Process codes**categorize sequences of events and changes over times.**Activity codes**identify recurring informal and formal behaviors.**Event codes**, in contrast, identify infrequent or unique happenings in the setting or lives of those being observed.**Strategy codes**relate to ways people accomplish things, such as how instructors maintain students' attention during lectures.**Relationship and social structure codes**tell you about alliances, friendships, and adversaries as well as about more formally defined relations such as social roles.**Method codes**identify your research approaches, procedures, dilemmas, and breakthroughs.

Software programs can help with coding commentary from observations, understanding conceptual relationships, or counting key words. They facilitate systematic, efficient coding and complex analyses. Three popular software packages for qualitative coding and data analysis are Atlas.ti and NVivo7 and XSight.

### Use visual devices to organize and guide your study

You may want to use matrices, concept maps, flow charts, or diagrams to illustrate relationships and themes. Visual devices help you think critically, confirm themes, and see new relationships.

## Additional information

Aron, A. & Aron, E. N. (2002). *Statistics for Psychology,* 3rd
edition. Upper Saddle River, N J: Prentice Hall.

Berkowitz, S. (1997).
Analyzing Qualitative Data. In J. Frechtling, L. Sharp, and Westat
(Eds.), *User-Friendly Handbook for Mixed
Method Evaluations* (Chapter 4). Retrieved June 21, 2006 from
National Science Foundation, Directorate of Education and Human Resources
Web site: http://www.ehr.nsf.gov/EHR/REC/pubs/NSF97-153/CHAP_4.HTM

Bogdan
R. B. & Biklin, S. K. (1998). *Qualitative Research
for Education: An Introduction to Theory and Methods*, Third
Edition. Needham Heights, MA: Allyn and Bacon.

*Chi-square: One Way.* Retrieved May 6, 2005 from the Georgetown
University, Department of Psychology, Research Methods and Statistics
Resources Web site: http://www.georgetown.edu/departments/psychology/researchmethods/statistics/inferential/chisquareone.htm

Note: No longer available

Coe,
R. (2000). *What is an 'effect size'? A guide for users*.
Retrieved March 1, 2005 from the University of Durham, Curriculum,
Evaluation and Management Centre, Evidence-Based Education-UK Web
site: http://www.cemcentre.org/ebeuk/research/effectsize/ESguide.htm

Note: No longer available

*Cohen's Kappa: Index of inter-rater reliability*. Retrieved
June 21, 2006 from University of Nebraska Psychology Department Research
Design and Data Analysis Directory Web site: http://www-class.unl.edu/psycrs/handcomp/hckappa.PDF

Greiner,
J. M. (2004). Trained observer ratings. In J. S. Wholey, H. P. Hatry & K.
E. Newcomers (Eds.), *Handbook of Practical
Program Evaluation* (2nd ed.) (pp. 211-256). San Francisco: Jossey-Bass.

Helberg, C. (1995). *Pitfalls of data analysis.* Retrieved
June 21, 2006 from: http://my.execpc.com/4A/B7/helberg/pitfalls/

*Linear regression.* (n.d.) Retrieved June 21, 2006 from
the Yale University Department of Statistics Index of Courses 1997-98
Web site: http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

Lane,
D. M. (2003). *Tests of linear combinations of means, independent
groups.* Retrieved June 21, 2006 from the Hyperstat Online textbook: http://davidmlane.com/hyperstat/confidence_intervals.html

Lowry,
R. P. (2005). *Concepts and Applications of Inferential
Statistics*. Retrieved June 21, 2006 from: http://faculty.vassar.edu/lowry/webtext.html

Osborne,
Jason W. (2000). *Advantages of hierarchical linear modeling.
Practical Assessment, Research & Evaluation,* 7(1). Retrieved
June 21, 2006 from: http://PAREonline.net/getvn.asp?v=7&n=1

*T-test.* Retrieved December 4, 2007 from the Georgetown
University, Department of Psychology, Research Methods and Statistics
Resources Web site: http://www1.georgetown.edu/departments/psychology/resources/researchmethods/statistics/8318.html

Weunschk,
K.L. (2003). *Inter-rater Agreement*. Retrieved
June 21 , 2006 from: http://core.ecu.edu/psyc/wuenschk/docs30/InterRater.doc