An audiovisual experiment
Introduction
Since the audiovisual composition Dynamics
was fundamentally based on my own subjective correspondences, I decided to create
a test that verified other people's reactions to several visual/aural associations.
In order to do that, I wrote a C program that generated visual as well as aural
events. The purpose of this test was to understand whether or not shapes are objectively
related to timbres by asking viewers to indicate preferences to timbre / shapes
associations, and to qualify such associations. In order to establish such an
experiment, I created 8 visual sequences and 6 sounds, producing a grid of 48
combinations. I first created the sounds, and then the animations.
The Program
The program runs on an Apple Macintosh II. The purpose of the program was to set
an environment by which I could both conduct my experiments and create an audiovisual
composition to be performed in real-time. For this reason, I chose machines that
were going to be easily available, such as the Apple Macintosh II and the Akai
S900 sampler. The computer has an Apple videocard providing a resolution of 640
x 480 pixels at 67 Hz non-interlaced, a 13" Sony monitor, 4 Megabytes of
RAM, and a 40 Megabyte hard disc. The program fundamentally generates two 2-d
shapes, each constructed by means of 4 Bézier curves, and interpolates
all shapes between them, creating simple animation. When the animation begins,
the program simultaneously sends a Note_On MIDI signal to the sampler, making
it play a sound. A Note_Off MIDI signal is also sent when the animation is done.
This real-time capability is important because the experience of producing Dynamics
clearly demonstrated that when working in an audiovisual environment it is very
important to have immediate feedback of what one is creating. In fact, it is often
necessary to compare the images with the music. In a real-time environment, this
is instantaneous.
Music
The software I wrote used the MIDI standard in order to communicate with musical
instruments. Although a number of commands are available, partly depending on
which musical device is controlled, this version of my program required only two
basic functions: Note_On and Note_Off. Note_On and Note_Off are the names of two
C functions that in turn call a MIDI_Write function that is present in another
package, the MIDI driver that Lee Boynton wrote at the MIT Media Lab. Further
versions of my program will include other MIDI capabilities, such as the panning
function and program change. The panning function allows the creation of stereo
movement of the sound source. The program change permits software-controlled change
of the instrument used.
Images
Each Bézier curve is defined by 4 control points and approximated by 10
segments. The final control point of each curve is forced to be coincident with
the first of the next one, so that the 4 curves are connected. The program takes
as input, through a mouse, the 12 control points that define the initial shape.
Similarly, 12 more points are used as input for the final shape. Each point is
re-definable, so that the shape can be interactively changed by the user. Once
the two shapes are set, the program computes the interpolated control points,
using linear interpolation, and consequently computes the interpolated shapes.
Both the control points and the segments that approximate the interpolated curves
are stored in 1-d arrays so that the contents of the arrays are accessed very
quickly. Then the pre-computed shapes are displayed, using these Macintosh QuickDraw
routines: OpenPoly, PolyLine, FillPoly, ClosePoly. The colors of the initial and
final shapes are also user-definable, by means of the Color Picker Package, and
are linearly interpolated in the RGB space. However, in the version of the program
used to conduct experiments, the color of the first shape, when defined, is grey,
while the color of the second is white. The background is black. When animated,
the saturation of the color of the shapes is fixed at zero, so that the viewer
is not influenced by the hue. Once the animation is done, the segments that approximate
the interpolated shapes can be saved on disc and re-loaded.
Unfortunately, this method of animating shapes did not permit complex images,
because features like anti-aliasing or texture mapping, that could have greatly
enhanced the quality of the images, were not available. In fact, both anti-aliasing
and texture mapping are time expensive, and real-time animation would have been
impossible. A solution to this problem would have been to create all the shapes
in advance, and then to store (and successively display) in RAM the bit maps instead
of the approximated curves. There are two advantages to this method: the images
can be as complex as necessary, and the speed of display is very high. The cost
is that a huge amount of memory is needed. For example, an image of size 1"
x l", relatively small, is made up of 72 x 72 pixels. This is equivalent
to 648 bytes. lf 8 color planes are used, then 5184 bytes are needed to store
one shape. One second of animation, at the speed of one frame per video refresh
(67 Hz), is therefore 347,328 bytes, a considerable amount if an animation lasting
2 minutes and involving 10 1" x 1" objects is considered. Such a configuration
would in fact require more than 416 Megabytes of memory. The situation would be
even worse if the objects were larger. Of course, it would be possible to have
most of the images stored on a huge hard disc, with only the necessary images
kept in RAM. Probably a Macintosh II can deal with the transfer speed between
hard disc and RAM. However, the 40 Megabyte hard disc available for the experiment
did not have sufficient memory.
The Sounds
The sounds were generated by using a commercial package called Softsynth, by Digidesign,
Inc. Both additive and FM syntheses are available, in combination if necessary.
It is particularly suited for my goal, because I was able to use additive synthesis
to create simple and precise sounds. In fact, additive synthesis gave me control
of each partial of the sound. Three sounds were generated by using FM synthesis,
which permits quite easily to create complex sounds, with just 2 oscillators and
simple modulations.
When I chose the sounds, I knew which matching I was looking for. I therefore
created sounds with variable smoothness. Sounds 1 and 2 can be considered as medium
smooth, sounds 3 and 4 as very smooth, and sounds 5 and 6 as not smooth. Also,
I considered changes in frequency and amplitude envelope, as is underlined in
the following brief description of the sounds.
The sounds all lasted 5 seconds and were sampled at a frequency of 22 KHz. Sound
1 was an FM sound, using two oscillators. The carrier was set at 220 Hz, and its
attack lasted 0.5 seconds, and the decay 4.5 seconds. The modulator was set at
an amplitude of 1/3 the amplitude of the carrier, the frequency was again 220
Hz, and attack and decay were respectively of 2 and 3 seconds. Sound 2 had an
attack of 1.25 seconds and a decay of 3.75. The fundamental frequency was set
at 440 Hz, but in the case of this sound only the 11th through 20th partials were
used, producing a high frequency sound. Also, the partials were not in perfect
harmonic ratio, so the sound was a bit dirty. Sound 3 and sound 4 were similar,
both FM, and the difference was only in the frequency of the two oscillators.
In the first case, in fact, the carrier was 110 Hz and the modulator 154 Hz (ratio
1 : 1.4). In the second case, the carrier was 440 Hz and the modulator 616 Hz
(same ratio). This difference in turn affected the size of shapes. The envelope,
as well as the amplitude of the two oscillators were identical. The attack of
the carrier was 0.05 seconds and the decay was 4.95 seconds. The attack of the
modulator was, inversely, 4.95 seconds, and the decay 0.05 seconds. Also, the
last two sounds had a similar structure. Both had a fundamental frequency of 220
Hz, with 32 harmonics set to their maximum level of amplitude. Therefore, the
only difference was in their envelope. Sound 5 had an attack of 4.975 seconds,
and a decay of 0.025. Sound 6 had the opposite characteristics.
The Shapes
To create the images, I followed three simple principles of mapping: frequency/size,
envelope/brightness, smoothness of the sound/smoothness of the shape. Three characteristics
of sound spectra seem to determine their smoothness. A slow attack, or an integer
ratio between the partials, or partials steady over time, seem to determine a
sound's smoothness.
The first two mappings were easy to visualize. In the first mapping, since the
sounds had just 4 levels of frequency (110, 220, 400, and 4,840 Hz), I created
4 different sizes. In the second case, I followed the simple amplitude envelope
of the sounds. In order to do that, I mapped audio level zero to black (the background)
and peak of the sound to white.
The third mapping (smoothness of the sound/smoothness of the shape) was the most
difficult. The first sound was visually represented twice, with different sizes,
as a slightly changing round shape. Sound number two was visualized as a non-regular,
very smaII changing shape. It was visualized twice, changing the size. In the
third and fourth sound, emphasis was put on the transformation of the shape. In
fact, in these FM sounds a transforming spectrum was clearly detectable, and therefore
I created very plastic non regular shapes. Otherwise, as before, the two sounds
were differentiated only by their sizes. The final two sounds, being steady over
time, were represented with steady triangular shapes. In the case of these sounds,
the envelope was the only distinguishing parameter.
Testing the subjects
The experiment was conducted in the Audio Studio of the MIT Media Lab's Music
and Cognition Group. Ten subjects, one at a time, watched the screen and listened
to the sounds through 2 speakers. Since all images were centered on the screen,
the sounds arrived from the center of the stereo image created by the two speakers.
Each subject was given a sheet like the one shown in fig. 1. The three symbols
(high, medium and low) indicated the degree of correspondence between the image
and the sound. When I successively analyzed the results, I assigned 2, 1 and 0
points, respectively, to those symbols. The shapes are placed along the horizontal
axis, with the sounds along the vertical axis. The subjects listening to the different
sounds, keeping fixed the shape, proceeded vertically from top to bottom for each
column, and left to right. They were free to assign whatever symbol they wanted
to whatever correspondence, even repeating the high or low symbol several times.
Moreover, the subjects were also free to experience the sounds/shapes before actually
judging the correspondences. Even when judging they could replay the animation
several times. The average results, normalized between 0 and 1, are shown in fig.
2. The original matchings are circled, and indicate the combination I originally
created.
For each combination please choose one of the following symbols:

fig. 1

fig. 2
Interpreting the results
The interpretation of the results, done with the help of a statistical program,
is relatively complex. Two main elements can be extrapolated from this experiment:
the subjects were significantly affected by the changes of sounds (F(5,45) = 11.82,
p < .001), and, even more importantly, they were very affected by the interaction
of sounds and shapes (F(35,315) - 2.466, p < .001).
However, the following analysis seems plausible. Shape 1 was quite successful,
in the sense that the subjects mostly chose S1 as the best sound. Notice that
the best sounds, other than S1, were S3 and S4, probably because their shape is
rounded at the beginning, like S1l. Sounds S3 and S4 always got good judgement,
except for with shapes 7 and 8, that were in fact steady (representing the steady
sounds 5 and 6). Also, notice the low values of S6 through all the sounds, except
for its shape, sh8. This is explainable considering that S6 had a very sudden
attack (0.025 seconds), while most of the shapes (except sh8 of course) had a
smoother attack. Therefore, a first conclusion is that the envelope of the sound
can be a factor in matching sounds with images. Shape 2 was the same as shape
1 but smaller. The results are similar to those of shape 1, although less convincing.
Shape 3 and 4 have also similar results. In both cases the matching between shape
and sound doesn't seem to be successful enough. Shapes 4 and 5 are vice versa
extremely successful. In fact, without any doubt the subjects chose the FM sounds
associated with those shapes, giving a slight preference to their own sounds,
respectively S3 and S4. This result is particularly interesting, because it clearly
demonstrates that the changing spectrum of those FM sounds was well represented
by the changing shape of sh5 and sh6. Shape 7 was best represented by S5, although
other sounds got pretty good judgements. Similarly, shape 8 was best represented
by S6. Notice that also S1 and S2 got decent judgements, definitely better than
S3 and S4. This may be explainable by the fact that S1 and S2 are more steady
sounds than S3 and S4.
Conclusions
A few considerations can be made after looking at the results. The subjects preferred
some sound/shape associations rather than others. The loudness/brightness envelope
is certainly perceived as very important. In fact, if the envelope of the sound
does not correspond to the brightness of the shape, then a mismatch is easily
detected. Finally, more than the shape itself, which is however recognizable as
an element to be considered, the transformation of the shape is clearly perceived
as a change in the spectrum of the sound. These two results are in my opinion
particularly interesting, since they indicate that animated abstract objects are
well suited for the representation of changing timbres.