Is speech-gesture production
ballistic or interactive?
Nobuhiro Furuyama
National Institute of Informatics
(JAPAN)
David McNeill,
University of Chicago
(USA)
Mischa Park-Doob
University of Chicago
(USA)
Abstract
A ballistic model of speech and gesture argues that once the planning
of a sentence and a gesture is completed in the conceptualizer, the rest of
the production processes of speech and gesture are independent from each other.
Observing Arrernte speakers who gesture at arms length, De Ruiter and
Wilkins hypothesized that the preparation of their gestures should take longer
than usual, and that the co-expressive speech should be delayed to the same
degree as the gesture to maintain synchrony between the two if their production
processes are interactive. The result was that the preparation phase of Arrernte
gesture was longer than usual, but speech did not become delayed. They concluded
the ballistic model was therefore correct. The present paper asks whether the
same holds true for non-Arrernte speakers (e.g., English speakers) gesturing
at arms length. Our preliminary analysis shows that the preparation phase
does not necessarily take longer than usual and that in any case speech remains
synchronized with the related gesture. We conclude that the production of speech
and that of gesture are interactive throughout the entire process, and that
De Ruiter and Wilkinss findings likely resulted from factors other than
the modularity of speechgesture production.
Introduction
This paper argues that speech production and production of the concurrent gesture
are interactive throughout the entire process, regardless of where in the gesture
space a gesture is produced, or how long the preparation phase of a gesture
lasts. This is meant to be contrasted with the ballistic (or modular) model
of speech-gesture production (e.g., Levelt, Richardson, and La Heij 1985; De
Ruiter and Wilkins 1998) which argues that once the planning of a sentence and
a gesture is completed in the conceptualizer speech production and gesture production
are independent from each other.
Evidence in support of the ballistic model allegedly comes in part from a crosslinguistic
(and cross-cultural) comparison of speech-gesture synchronization differences
between Arrernte speakers and Dutch speakers (De Ruiter and Wilkins 1998). Arrernte
is a language spoken in central Australia. The key fact is that Arrernte speakers,
unlike non-Arrernte speakers, perform gestures at arms length; such
gestures, performed at the outer limit of the gesture space, require a longer
preparation phase. The question De Ruiter and Wilkins ask is, does
the co-expressive (affiliated) speech become delayed to the
same degree as the gesture to maintain the synchrony between gesture and
speech? The finding was that the preparation phases of Arrernte gestures were
longer than for Dutch gestures, but speech did not become delayed: gestures
occurred after the co-expressive speech by an amount equal to the extra time
needed for the gesture preparation. Thus, De Ruiter and Wilkins concluded, gesture
and speech are modular in that speech and gesture are on separate
ballistic tracks; once launched they unfold independently of each other. This
is schematically shown in (a) and (b) in Figure 1.
(a) Arrernte (large) gesture with long preparation phase
(b) non-Arrernte (small) gesture with short preparation phase
Figure 1. Relative timing between speech and the concurrent gesture (black =
preparation phase, gray = stroke phase, or between the onset of stroke phase
and the
offset of the entire gesture movement, slanted stripe = speech).
One may naturally wonder what the synchronization relationship of speech and
gesture would be if non-Arrernte speakersDutch or Englishcould somehow
be induced to make their gestures at arms length, just as Arrernte speakers
do, albeit spontaneously. The present study examines this question, using English
speakers as participants. A ballistic model of speech and gesture would predict
that under these circumstances English speakers would exhibit behavior similar
to that of Arrernte speakers: the preparation phase of English speakers
gestures would take longer than normal, causing gesture strokes to become delayed
with respect to the co-expressive speech. If this turns out to be the case,
the ballistic model may prove to be correct. If not, as the present study argues,
the result could show two things: one is that the production of speech and gesture
are interactive throughout the duration of both processes, if despite an extension
of the arms to the outer limit of the gesture space speech and gesture remain
in synchrony. The other would be that the findings of De Ruiter and Wilkinss
experimentthat in Arrernte speech continues forward while the gesture
becomes delayed by an amount equal to the extra time for gesture preparationresulted
from factors other than the modularity of speech-gesture production.
The Present Predictions
There are two possible scenarios that fit the interactive theory. One is that
the preparation phases of large gestures by speakers of English do not become
delayed (i.e., they are comparable to Dutch gestures), and that the relative
timing between the onset of speech and that of the stroke phase is intact for
large gestures.
The other possibility that also fits the interactive theory is that the preparation
phases of large gestures by English speakers become longer than those of small
(Dutch) gestures, yet the relative timing between the onset of speech and the
stroke phase remains intact. These predictions are schematically shown in (c)
and (d) of Figure 2, respectively. These diagrams are to be contrasted with
the prediction that De Ruiter and Wilkins would make, as shown in (a) and (b)
of Figure 1 above.
Method
To induce English speakers to produce gestures at the outer-limit of the gesture
space gestures that might induce longer preparation phaseswe modified
our standard procedure of eliciting gestures via a cartoon narration.
Eight native speakers of American English participated in the study. Pair 1
consisted of a female narrator and male listener, all four subjects in Pairs
2 and 4 were female, and both members of Pair 3 were male (the subjects volunteered
in pairswe conducted the experiment without regard to gender). They were
all undergraduate students at the University of Chicago at the time of the experiment,
and were recruited through the study-pool list, an email database
of interested volunteers maintained by the Department of Psychology. The material
used to elicit the narrative featured the well-known cartoon characters Sylvester
and Tweety. The cartoon is entitled Canary Row (Warner Brothers,
Inc.). A detailed description of the story and some aspects of the animated
film can be found in the Appendices of McNeill (1992). We asked our subjects
to recount the cartoon story and, as they spoke, to point at relevant photos
(still image clips printed in color on standard US letter-sized paper, 23 in
total) that were taken from each of the episodes of the cartoon and posted in
a random arrangement on the walls of the experiment room. The participants were
tested in pairs. The narrator watched the entire cartoon once in a separate
room. The narrator then joined the listener in the experiment room and was given
approximately one minute to familiarize herself/himself with the locations of
the image clips. Then the narrator was instructed to recount the entire cartoon
story, and to point at the image clips posted on the walls of the experiment
room whenever the narration came to a point where the image clips were relevant.
The listener was instructed to attend carefully and attempt to remember the
events recounted by the narrator. The cameras were then switched on and the
pair was left alone in the experiment room until the task was completed. The
narrator was videotaped such that her/his whole body would fit on the screen
even if s/he produced a gesture at the outer-limit of the gesture space. The
listener was excluded from the camera field in favor of having a wider space
available for the narrator.
(c) First prediction from the interactive theory for large gestures
(d) Second prediction from the interactive theory for large gestures
Figure 2. The present predictions on speechgesture timing when a gesture is
performed at the outer-limit of the gesture space.
Results
The gesture space can be segmented as shown in Figure 3. The present analysis
is limited to pointing gestures for which the preparation started in the center
gesture space and terminated at the extreme periphery. This coding restriction
ensured that we could see the effects of large movements on preparation phase
duration and speech-gesture timing. Analysis shows that in most cases speech
either remains synchronized with the meaningfully related gesture or even waits
until the pointing gesture is fully executed in the outerlimit of the gesture
space.
Figure 4 shows the histogram of the duration of preparation phase when English
speakers were induced to gesture at the outer-limit of the gesture space.
(1) The mean duration of the preparation phase of the English speakers is 631.11
msec.
(SD = 34.27 msec., the median is 566.67 msec.) This is more comparable to that
of the Dutch speakers (Mean = 559 msec., SD unknown) than that of the Arrernte
speakers (803 msec, SD unknown).
(2) The histogram also shows that the distribution of duration of preparation
phase is not entirely normal. There are some cases of abnormally long preparation
phases towards the right edge of the chart. Durations in the range 301-600 msec.
have higher frequencies than other durations. These deviations from an otherwise
relatively normal distribution imply that there may be several interacting factors
involved. Crucially for the present purpose, however, performing the stroke
phase of a gesture
at the outer-limit of the gesture space does not necessarily make the duration
of preparation phases longer. The longest preparation phase in the data set
was 2166.67 msec (=65 frames).
Figure 3. The gesture space (Pedelty 1987, cited in McNeill 1992).
When a preparation phase lasts as long as this, it typically is accompanied
by a so-called pretroke hold between the onset of the preparation phase and
the onset of the stroke phase. Our longest preparation phase is a case in point:
there was a long pre-stroke hold. The shortest preparation phase for each subject
was equal to or below 366.67 msec. (= 11 frames). The shortest preparation phase
in the entire data set lasted 333.33 msec. (=10 frames).
As to the relative timing between the stroke phase and the meaningfully related
speech segments, they remain robustly in synchrony for the English speakers
regardless of the length of the preparation phase. This is clearly shown in
examples (1) through (4) below:
Example (1): A gesture with one of the shortest prep. phases1
[theres a picture / <h>of wha^t it looks like #]
Example (2): Another gesture with one of the shortest prep. phases
[Sylvester / on* / the pictures on the door / goes u]p /
Example (3): A gesture with one of the mid-length prep. phases
and then<n> / the camera sorta [pulls back n we see that / driving
the trolley #]
Example (4): A gesture with the longest prep. phase
[a<aa>nd / Tweetys ¿cage? // ¿right? is sitting on
the ¿window sill? /]
1The following symbols are used in the transcription: [ = onset of preparation
phase; ] = offset of retraction; ^ =
super imposed beat; / = unfilled speech pause; Bold face = stroke phase &
post stroke hold; _ = prestroke hold;
# = breath pause; < ... > = filled speech pause; * = aborted speech or
speech trouble; ¿...? = rising intonational
contour.
366.67msec (11 frames)
333.33msec (10 frames)
566.67msec(17 frames)
2166.67msec (65 frames)
Figure 4. The histogram of the duration of preparation phase in msec
when English speakers were induced to gesture at the outer-limit of the
gesture space.
Discussion
Speech-gesture synchrony by non-Arrernte speakers would imply that, for the
Arrernte, the temporal separation of speech and gesture is not just a mechanical
effect of a larger gesture space. That is, there is something extra causing
the separation of speech and gesture. This extra something could be a rhetorical
use of gesture. Suppose that, in the Arrernte culture, there is active control
of gesture such that gestures are timed to be deployed after the co-expressive
speech. This control might be rhetorical in that gestures are made
to occur asreinforcements or echoes of what is said in speech.
Even if this hypothetical rhetorical use is not present in the Arrernte, as
long as there is some form of active control of gesture, gesture could follow
speech, as it were, by design. The large gesture space then would be an effect
of the gesture-speech timing difference rather than a cause of it. The extra
long preparation phase would be the instrument of the speakers active
control over the timing of the gesture. Using the preparation phase as the instrument
of control would automatically make the amount of gesture delay equal the extra
length of the preparation phase, as De Ruiter and Wilkins report.
But whatever the extra factor in Arrernte gesture performance is, something
extra in the control of gesture means that gesture is not ballistic. And this
means that Arrernte speech and gestures cannot be taken as evidence of modularity.
On the contrary, they would reveal the very opposite of modularitya continuing
on-line process by Arrernte speakers of controlling the relationship between
speech and gesture, in which the gesture is aimed to occur at the moment that
the semantically co-expressive speech ends.
Conclusion
The evidence shows decisively that separation of speech and gesture in the Arrernte
manner is not the result of using a larger gesture space with its attendant
longer preparation phase. While a ballistic model of speech and gesture would
predict gesture to be delayed according to length of preparation phase, our
evidence has shown that speakers can maintain tight synchrony between speech
and gesture even as preparation phase length varies widely. Such behavior is
possible only if speakers exert careful control over the temporal relationship
between the various parts of their simultaneously unfolding speech and gestures.
References
De Ruiter, J.P. and Wilkins, D. (1998). The synchronization of gesture and speech
in Dutch and Arrernte (an Australian Aboriginal language). In S. Santi, I. Guaïtella,
C. Cavé and G. Konopczynski (eds.), Oralité et Gestualité,
pp. 603-607. Paris: LHamattan.
Kita, S. (2000). How representational gestures help speaking. In McNeill (2000).
Levelt, W.J., Richardson, G. and La Heij, W. (1985). Pointing and voicing in
deictic expressions. Journal of Memory and Language, 24:133-164.
McNeill, D. (1992). Hand and Mind: What Gestures Reveal about Thought. Chicago:
University of Chicago Press.
McNeill, D. (2000). Language and Gesture. (Ed.). Cambridge: Cambridge University
Press.
McNeill, D. and Duncan, S., (2000). Growth points in thinking-for-speaking.
In McNeill. (2000).
Nobe, S. (1996). Representational Gestures, Cognitive Rhythms, and Acoustic
Aspects of Speech: A Network/Threshold Model of Gesture Production. Unpublished
Ph.D. Dissertation,Department of Psychology, The University of Chicago.