Is speech-gesture production ballistic or interactive?
Nobuhiro Furuyama
National Institute of Informatics
(JAPAN)
David McNeill,
University of Chicago
(USA)
Mischa Park-Doob
University of Chicago
(USA)


Abstract

A “ballistic” model of speech and gesture argues that once the planning of a sentence and a gesture is completed in the conceptualizer, the rest of the production processes of speech and gesture are independent from each other. Observing Arrernte speakers who gesture at arm’s length, De Ruiter and Wilkins hypothesized that the preparation of their gestures should take longer than usual, and that the co-expressive speech should be delayed to the same degree as the gesture to maintain synchrony between the two if their production processes are interactive. The result was that the preparation phase of Arrernte gesture was longer than usual, but speech did not become delayed. They concluded the ballistic model was therefore correct. The present paper asks whether the same holds true for non-Arrernte speakers (e.g., English speakers) gesturing at arm’s length. Our preliminary analysis shows that the preparation phase does not necessarily take longer than usual and that in any case speech remains synchronized with the related gesture. We conclude that the production of speech and that of gesture are interactive throughout the entire process, and that De Ruiter and Wilkins’s findings likely resulted from factors other than the modularity of speechgesture production.


Introduction

This paper argues that speech production and production of the concurrent gesture are interactive throughout the entire process, regardless of where in the gesture space a gesture is produced, or how long the preparation phase of a gesture lasts. This is meant to be contrasted with the ballistic (or modular) model of speech-gesture production (e.g., Levelt, Richardson, and La Heij 1985; De Ruiter and Wilkins 1998) which argues that once the planning of a sentence and a gesture is completed in the conceptualizer speech production and gesture production are independent from each other.

Evidence in support of the ballistic model allegedly comes in part from a crosslinguistic (and cross-cultural) comparison of speech-gesture synchronization differences
between Arrernte speakers and Dutch speakers (De Ruiter and Wilkins 1998). Arrernte is a language spoken in central Australia. The key fact is that Arrernte speakers, unlike non-Arrernte speakers, perform gestures at arm’s length; such gestures, performed at the outer limit of the gesture space, require a longer preparation phase. The question De Ruiter and Wilkins ask is, does the co-expressive (‘affiliated’) speech become delayed to the same degree as the gesture to maintain the synchrony between gesture and speech? The finding was that the preparation phases of Arrernte gestures were longer than for Dutch gestures, but speech did not become delayed: gestures occurred after the co-expressive speech by an amount equal to the extra time needed for the gesture preparation. Thus, De Ruiter and Wilkins concluded, gesture and speech are ‘modular’ in that speech and gesture are on separate ballistic tracks; once launched they unfold independently of each other. This is schematically shown in (a) and (b) in Figure 1.

(a) Arrernte (large) gesture with long preparation phase
(b) non-Arrernte (small) gesture with short preparation phase

Figure 1. Relative timing between speech and the concurrent gesture (black = preparation phase, gray = stroke phase, or between the onset of stroke phase and the
offset of the entire gesture movement, slanted stripe = speech).

One may naturally wonder what the synchronization relationship of speech and gesture would be if non-Arrernte speakers—Dutch or English—could somehow be induced to make their gestures at arm’s length, just as Arrernte speakers do, albeit spontaneously. The present study examines this question, using English speakers as participants. A ballistic model of speech and gesture would predict that under these circumstances English speakers would exhibit behavior similar to that of Arrernte speakers: the preparation phase of English speakers’ gestures would take longer than normal, causing gesture strokes to become delayed with respect to the co-expressive speech. If this turns out to be the case, the ballistic model may prove to be correct. If not, as the present study argues, the result could show two things: one is that the production of speech and gesture are interactive throughout the duration of both processes, if despite an extension of the arms to the outer limit of the gesture space speech and gesture remain in synchrony. The other would be that the findings of De Ruiter and Wilkins’s experiment—that in Arrernte speech continues forward while the gesture becomes delayed by an amount equal to the extra time for gesture preparation—resulted from factors other than the modularity of speech-gesture production.


The Present Predictions

There are two possible scenarios that fit the interactive theory. One is that the preparation phases of large gestures by speakers of English do not become delayed (i.e., they are comparable to Dutch gestures), and that the relative timing between the onset of speech and that of the stroke phase is intact for large gestures.

The other possibility that also fits the interactive theory is that the preparation phases of large gestures by English speakers become longer than those of small (Dutch) gestures, yet the relative timing between the onset of speech and the stroke phase remains intact. These predictions are schematically shown in (c) and (d) of Figure 2, respectively. These diagrams are to be contrasted with the prediction that De Ruiter and Wilkins would make, as shown in (a) and (b) of Figure 1 above.


Method

To induce English speakers to produce gestures at the outer-limit of the gesture space —gestures that might induce longer preparation phases—we modified our standard procedure of eliciting gestures via a cartoon narration.

Eight native speakers of American English participated in the study. Pair 1 consisted of a female narrator and male listener, all four subjects in Pairs 2 and 4 were female, and both members of Pair 3 were male (the subjects volunteered in pairs—we conducted the experiment without regard to gender). They were all undergraduate students at the University of Chicago at the time of the experiment, and were recruited through the “study-pool” list, an email database of interested volunteers maintained by the Department of Psychology. The material used to elicit the narrative featured the well-known cartoon characters Sylvester and Tweety. The cartoon is entitled “Canary Row” (Warner Brothers, Inc.). A detailed description of the story and some aspects of the animated film can be found in the Appendices of McNeill (1992). We asked our subjects to recount the cartoon story and, as they spoke, to point at relevant photos (still image clips printed in color on standard US letter-sized paper, 23 in total) that were taken from each of the episodes of the cartoon and posted in a random arrangement on the walls of the experiment room. The participants were tested in pairs. The narrator watched the entire cartoon once in a separate room. The narrator then joined the listener in the experiment room and was given approximately one minute to familiarize herself/himself with the locations of the image clips. Then the narrator was instructed to recount the entire cartoon
story, and to point at the image clips posted on the walls of the experiment room whenever the narration came to a point where the image clips were relevant. The listener was instructed to attend carefully and attempt to remember the events recounted by the narrator. The cameras were then switched on and the pair was left alone in the experiment room until the task was completed. The narrator was videotaped such that her/his whole body would fit on the screen even if s/he produced a gesture at the outer-limit of the gesture space. The listener was excluded from the camera field in favor of having a wider space available for the narrator.

(c) First prediction from the interactive theory for large gestures
(d) Second prediction from the interactive theory for large gestures
Figure 2. The present predictions on speechgesture timing when a gesture is performed at the outer-limit of the gesture space.


Results

The gesture space can be segmented as shown in Figure 3. The present analysis is limited to pointing gestures for which the preparation started in the center gesture space and terminated at the extreme periphery. This coding restriction ensured that we could see the effects of large movements on preparation phase duration and speech-gesture timing. Analysis shows that in most cases speech either remains synchronized with the meaningfully related gesture or even waits until the pointing gesture is fully executed in the outerlimit of the gesture space.

Figure 4 shows the histogram of the duration of preparation phase when English speakers were induced to gesture at the outer-limit of the gesture space.
(1) The mean duration of the preparation phase of the English speakers is 631.11 msec.
(SD = 34.27 msec., the median is 566.67 msec.) This is more comparable to that of the Dutch speakers (Mean = 559 msec., SD unknown) than that of the Arrernte speakers (803 msec, SD unknown).
(2) The histogram also shows that the distribution of duration of preparation phase is not entirely normal. There are some cases of abnormally long preparation phases towards the right edge of the chart. Durations in the range 301-600 msec. have higher frequencies than other durations. These deviations from an otherwise relatively normal distribution imply that there may be several interacting factors involved. Crucially for the present purpose, however, performing the stroke phase of a gesture
at the outer-limit of the gesture space does not necessarily make the duration of preparation phases longer. The longest preparation phase in the data set was 2166.67 msec (=65 frames).
Figure 3. The gesture space (Pedelty 1987, cited in McNeill 1992).
When a preparation phase lasts as long as this, it typically is accompanied by a so-called pretroke hold between the onset of the preparation phase and the onset of the stroke phase. Our longest preparation phase is a case in point: there was a long pre-stroke hold. The shortest preparation phase for each subject was equal to or below 366.67 msec. (= 11 frames). The shortest preparation phase in the entire data set lasted 333.33 msec. (=10 frames).
As to the relative timing between the stroke phase and the meaningfully related speech segments, they remain robustly in synchrony for the English speakers regardless of the length of the preparation phase. This is clearly shown in examples (1) through (4) below:
Example (1): A gesture with one of the shortest prep. phases1
[there’s a picture / <h>of wha^t it looks like #]
Example (2): Another gesture with one of the shortest prep. phases
[Sylvester / on* / the picture’s on the door / goes u]p /
Example (3): A gesture with one of the mid-length prep. phases
and then<n> / the camera sorta [pulls back n’ we see that / driving the trolley #]
Example (4): A gesture with the longest prep. phase
[a<aa>nd / Tweety’s ¿cage? // ¿right? is sitting on the ¿window sill? /]
1The following symbols are used in the transcription: [ = onset of preparation phase; ] = offset of retraction; ^ =
super imposed beat; / = unfilled speech pause; Bold face = stroke phase & post stroke hold; _ = prestroke hold;
# = breath pause; < ... > = filled speech pause; * = aborted speech or “speech trouble”; ¿...? = rising intonational
contour.
366.67msec (11 frames)
333.33msec (10 frames)
566.67msec(17 frames)
2166.67msec (65 frames)
Figure 4. The histogram of the duration of preparation phase in msec
when English speakers were induced to gesture at the outer-limit of the
gesture space.


Discussion

Speech-gesture synchrony by non-Arrernte speakers would imply that, for the Arrernte, the temporal separation of speech and gesture is not just a mechanical effect of a larger gesture space. That is, there is something extra causing the separation of speech and gesture. This extra something could be a rhetorical use of gesture. Suppose that, in the Arrernte culture, there is active control of gesture such that gestures are timed to be deployed after the co-expressive speech. This control might be ‘rhetorical’ in that gestures are made to occur as‘reinforcements’ or echoes of what is said in speech.

Even if this hypothetical rhetorical use is not present in the Arrernte, as long as there is some form of active control of gesture, gesture could follow speech, as it were, by design. The large gesture space then would be an effect of the gesture-speech timing difference rather than a cause of it. The extra long preparation phase would be the instrument of the speaker’s active control over the timing of the gesture. Using the preparation phase as the instrument of control would automatically make the amount of gesture delay equal the extra length of the preparation phase, as De Ruiter and Wilkins report.

But whatever the extra factor in Arrernte gesture performance is, something extra in the control of gesture means that gesture is not ballistic. And this means that Arrernte speech and gestures cannot be taken as evidence of modularity. On the contrary, they would reveal the very opposite of modularity—a continuing on-line process by Arrernte speakers of controlling the relationship between speech and gesture, in which the gesture is aimed to occur at the moment that the semantically co-expressive speech ends.


Conclusion

The evidence shows decisively that separation of speech and gesture in the Arrernte manner is not the result of using a larger gesture space with its attendant longer preparation phase. While a ballistic model of speech and gesture would predict gesture to be delayed according to length of preparation phase, our evidence has shown that speakers can maintain tight synchrony between speech and gesture even as preparation phase length varies widely. Such behavior is possible only if speakers exert careful control over the temporal relationship between the various parts of their simultaneously unfolding speech and gestures.


References

De Ruiter, J.P. and Wilkins, D. (1998). The synchronization of gesture and speech in Dutch and Arrernte (an Australian Aboriginal language). In S. Santi, I. Guaïtella, C. Cavé and G. Konopczynski (eds.), Oralité et Gestualité, pp. 603-607. Paris: L’Hamattan.
Kita, S. (2000). How representational gestures help speaking. In McNeill (2000).
Levelt, W.J., Richardson, G. and La Heij, W. (1985). Pointing and voicing in deictic expressions. Journal of Memory and Language, 24:133-164.
McNeill, D. (1992). Hand and Mind: What Gestures Reveal about Thought. Chicago: University of Chicago Press.
McNeill, D. (2000). Language and Gesture. (Ed.). Cambridge: Cambridge University Press.
McNeill, D. and Duncan, S., (2000). Growth points in thinking-for-speaking. In McNeill. (2000).
Nobe, S. (1996). Representational Gestures, Cognitive Rhythms, and Acoustic Aspects of Speech: A Network/Threshold Model of Gesture Production. Unpublished Ph.D. Dissertation,Department of Psychology, The University of Chicago.