An audiovisual experiment




Introduction

Since the audiovisual composition
Dynamics was fundamentally based on my own subjective correspondences, I decided to create a test that verified other people's reactions to several visual/aural associations. In order to do that, I wrote a C program that generated visual as well as aural events. The purpose of this test was to understand whether or not shapes are objectively related to timbres by asking viewers to indicate preferences to timbre / shapes associations, and to qualify such associations. In order to establish such an experiment, I created 8 visual sequences and 6 sounds, producing a grid of 48 combinations. I first created the sounds, and then the animations.


The Program

The program runs on an Apple Macintosh II. The purpose of the program was to set an environment by which I could both conduct my experiments and create an audiovisual composition to be performed in real-time. For this reason, I chose machines that were going to be easily available, such as the Apple Macintosh II and the Akai S900 sampler. The computer has an Apple videocard providing a resolution of 640 x 480 pixels at 67 Hz non-interlaced, a 13" Sony monitor, 4 Megabytes of RAM, and a 40 Megabyte hard disc. The program fundamentally generates two 2-d shapes, each constructed by means of 4 Bézier curves, and interpolates all shapes between them, creating simple animation. When the animation begins, the program simultaneously sends a Note_On MIDI signal to the sampler, making it play a sound. A Note_Off MIDI signal is also sent when the animation is done.

This real-time capability is important because the experience of producing
Dynamics clearly demonstrated that when working in an audiovisual environment it is very important to have immediate feedback of what one is creating. In fact, it is often necessary to compare the images with the music. In a real-time environment, this is instantaneous.


Music

The software I wrote used the MIDI standard in order to communicate with musical instruments. Although a number of commands are available, partly depending on which musical device is controlled, this version of my program required only two basic functions: Note_On and Note_Off. Note_On and Note_Off are the names of two C functions that in turn call a MIDI_Write function that is present in another package, the MIDI driver that Lee Boynton wrote at the MIT Media Lab. Further versions of my program will include other MIDI capabilities, such as the panning function and program change. The panning function allows the creation of stereo movement of the sound source. The program change permits software-controlled change of the instrument used.


Images

Each Bézier curve is defined by 4 control points and approximated by 10 segments. The final control point of each curve is forced to be coincident with the first of the next one, so that the 4 curves are connected. The program takes as input, through a mouse, the 12 control points that define the initial shape. Similarly, 12 more points are used as input for the final shape. Each point is re-definable, so that the shape can be interactively changed by the user. Once the two shapes are set, the program computes the interpolated control points, using linear interpolation, and consequently computes the interpolated shapes. Both the control points and the segments that approximate the interpolated curves are stored in 1-d arrays so that the contents of the arrays are accessed very quickly. Then the pre-computed shapes are displayed, using these Macintosh QuickDraw routines: OpenPoly, PolyLine, FillPoly, ClosePoly. The colors of the initial and final shapes are also user-definable, by means of the Color Picker Package, and are linearly interpolated in the RGB space. However, in the version of the program used to conduct experiments, the color of the first shape, when defined, is grey, while the color of the second is white. The background is black. When animated, the saturation of the color of the shapes is fixed at zero, so that the viewer is not influenced by the hue. Once the animation is done, the segments that approximate the interpolated shapes can be saved on disc and re-loaded.
Unfortunately, this method of animating shapes did not permit complex images, because features like anti-aliasing or texture mapping, that could have greatly enhanced the quality of the images, were not available. In fact, both anti-aliasing and texture mapping are time expensive, and real-time animation would have been impossible. A solution to this problem would have been to create all the shapes in advance, and then to store (and successively display) in RAM the bit maps instead of the approximated curves. There are two advantages to this method: the images can be as complex as necessary, and the speed of display is very high. The cost is that a huge amount of memory is needed. For example, an image of size 1" x l", relatively small, is made up of 72 x 72 pixels. This is equivalent to 648 bytes. lf 8 color planes are used, then 5184 bytes are needed to store one shape. One second of animation, at the speed of one frame per video refresh (67 Hz), is therefore 347,328 bytes, a considerable amount if an animation lasting 2 minutes and involving 10 1" x 1" objects is considered. Such a configuration would in fact require more than 416 Megabytes of memory. The situation would be even worse if the objects were larger. Of course, it would be possible to have most of the images stored on a huge hard disc, with only the necessary images kept in RAM. Probably a Macintosh II can deal with the transfer speed between hard disc and RAM. However, the 40 Megabyte hard disc available for the experiment did not have sufficient memory.


The Sounds

The sounds were generated by using a commercial package called Softsynth, by Digidesign, Inc. Both additive and FM syntheses are available, in combination if necessary. It is particularly suited for my goal, because I was able to use additive synthesis to create simple and precise sounds. In fact, additive synthesis gave me control of each partial of the sound. Three sounds were generated by using FM synthesis, which permits quite easily to create complex sounds, with just 2 oscillators and simple modulations.
When I chose the sounds, I knew which matching I was looking for. I therefore created sounds with variable smoothness. Sounds 1 and 2 can be considered as medium smooth, sounds 3 and 4 as very smooth, and sounds 5 and 6 as not smooth. Also, I considered changes in frequency and amplitude envelope, as is underlined in the following brief description of the sounds.
The sounds all lasted 5 seconds and were sampled at a frequency of 22 KHz. Sound 1 was an FM sound, using two oscillators. The carrier was set at 220 Hz, and its attack lasted 0.5 seconds, and the decay 4.5 seconds. The modulator was set at an amplitude of 1/3 the amplitude of the carrier, the frequency was again 220 Hz, and attack and decay were respectively of 2 and 3 seconds. Sound 2 had an attack of 1.25 seconds and a decay of 3.75. The fundamental frequency was set at 440 Hz, but in the case of this sound only the 11th through 20th partials were used, producing a high frequency sound. Also, the partials were not in perfect harmonic ratio, so the sound was a bit dirty. Sound 3 and sound 4 were similar, both FM, and the difference was only in the frequency of the two oscillators. In the first case, in fact, the carrier was 110 Hz and the modulator 154 Hz (ratio 1 : 1.4). In the second case, the carrier was 440 Hz and the modulator 616 Hz (same ratio). This difference in turn affected the size of shapes. The envelope, as well as the amplitude of the two oscillators were identical. The attack of the carrier was 0.05 seconds and the decay was 4.95 seconds. The attack of the modulator was, inversely, 4.95 seconds, and the decay 0.05 seconds. Also, the last two sounds had a similar structure. Both had a fundamental frequency of 220 Hz, with 32 harmonics set to their maximum level of amplitude. Therefore, the only difference was in their envelope. Sound 5 had an attack of 4.975 seconds, and a decay of 0.025. Sound 6 had the opposite characteristics.


The Shapes

To create the images, I followed three simple principles of mapping: frequency/size, envelope/brightness, smoothness of the sound/smoothness of the shape. Three characteristics of sound spectra seem to determine their smoothness. A slow attack, or an integer ratio between the partials, or partials steady over time, seem to determine a sound's smoothness.
The first two mappings were easy to visualize. In the first mapping, since the sounds had just 4 levels of frequency (110, 220, 400, and 4,840 Hz), I created 4 different sizes. In the second case, I followed the simple amplitude envelope of the sounds. In order to do that, I mapped audio level zero to black (the background) and peak of the sound to white.
The third mapping (smoothness of the sound/smoothness of the shape) was the most difficult. The first sound was visually represented twice, with different sizes, as a slightly changing round shape. Sound number two was visualized as a non-regular, very smaII changing shape. It was visualized twice, changing the size. In the third and fourth sound, emphasis was put on the transformation of the shape. In fact, in these FM sounds a transforming spectrum was clearly detectable, and therefore I created very plastic non regular shapes. Otherwise, as before, the two sounds were differentiated only by their sizes. The final two sounds, being steady over time, were represented with steady triangular shapes. In the case of these sounds, the envelope was the only distinguishing parameter.


Testing the subjects


The experiment was conducted in the Audio Studio of the MIT Media Lab's Music and Cognition Group. Ten subjects, one at a time, watched the screen and listened to the sounds through 2 speakers. Since all images were centered on the screen, the sounds arrived from the center of the stereo image created by the two speakers. Each subject was given a sheet like the one shown in fig. 1. The three symbols (high, medium and low) indicated the degree of correspondence between the image and the sound. When I successively analyzed the results, I assigned 2, 1 and 0 points, respectively, to those symbols. The shapes are placed along the horizontal axis, with the sounds along the vertical axis. The subjects listening to the different sounds, keeping fixed the shape, proceeded vertically from top to bottom for each column, and left to right. They were free to assign whatever symbol they wanted to whatever correspondence, even repeating the high or low symbol several times. Moreover, the subjects were also free to experience the sounds/shapes before actually judging the correspondences. Even when judging they could replay the animation several times. The average results, normalized between 0 and 1, are shown in fig. 2. The original matchings are circled, and indicate the combination I originally created.

For each combination please choose one of the following symbols:


fig. 1



fig. 2



Interpreting the results

The interpretation of the results, done with the help of a statistical program, is relatively complex. Two main elements can be extrapolated from this experiment: the subjects were significantly affected by the changes of sounds (F(5,45) = 11.82, p < .001), and, even more importantly, they were very affected by the interaction of sounds and shapes (F(35,315) - 2.466, p < .001).
However, the following analysis seems plausible. Shape 1 was quite successful, in the sense that the subjects mostly chose S1 as the best sound. Notice that the best sounds, other than S1, were S3 and S4, probably because their shape is rounded at the beginning, like S1l. Sounds S3 and S4 always got good judgement, except for with shapes 7 and 8, that were in fact steady (representing the steady sounds 5 and 6). Also, notice the low values of S6 through all the sounds, except for its shape, sh8. This is explainable considering that S6 had a very sudden attack (0.025 seconds), while most of the shapes (except sh8 of course) had a smoother attack. Therefore, a first conclusion is that the envelope of the sound can be a factor in matching sounds with images. Shape 2 was the same as shape 1 but smaller. The results are similar to those of shape 1, although less convincing.
Shape 3 and 4 have also similar results. In both cases the matching between shape and sound doesn't seem to be successful enough. Shapes 4 and 5 are vice versa extremely successful. In fact, without any doubt the subjects chose the FM sounds associated with those shapes, giving a slight preference to their own sounds, respectively S3 and S4. This result is particularly interesting, because it clearly demonstrates that the changing spectrum of those FM sounds was well represented by the changing shape of sh5 and sh6. Shape 7 was best represented by S5, although other sounds got pretty good judgements. Similarly, shape 8 was best represented by S6. Notice that also S1 and S2 got decent judgements, definitely better than S3 and S4. This may be explainable by the fact that S1 and S2 are more steady sounds than S3 and S4.


Conclusions

A few considerations can be made after looking at the results. The subjects preferred some sound/shape associations rather than others. The loudness/brightness envelope is certainly perceived as very important. In fact, if the envelope of the sound does not correspond to the brightness of the shape, then a mismatch is easily detected. Finally, more than the shape itself, which is however recognizable as an element to be considered, the transformation of the shape is clearly perceived as a change in the spectrum of the sound. These two results are in my opinion particularly interesting, since they indicate that animated abstract objects are well suited for the representation of changing timbres.