3D Audio Primer
August 1998

What is 3D Audio?

3D audio techniques allow for the placement of sound potentially anywhere within a virtual space surrounding a listener. Audiophiles have used these techniques for years in making binaural recordings -- literally "two-eared" recordings made with microphones mounted in the ears of a mannequin or a dummy head. The spatial cues in these recordings are so accurate that, when listened to with headphones, they actually recreate a sense of "being there." The trumpet player sounds like he is standing just 10 feet away from you on your left, the crowd noise is coming from behind you, and when someone "knocks at the door" you may actually you turn around to "see who's there." For a taste of 3D audio in your own living room, the Binaural Source sells a large selection of binaural CDs and cassettes. They also provide a good Q&A entitled Binaural for Beginners. Check them out -- you really do need to hear it to believe it!

Computer generated 3D audio seeks to imitate this binaural effect by extending the functionality of a computer system and allowing developers to provide a similar illusion of sound coming from locations all around the listener -- not just from the speakers in front of them. This effect can be made independent of the type of speakers used and is designed to create the immersive audio experience required by game and multimedia developers. As lead engineer for the Apple SoundSprocket, I created just such an effect for the Mac OS using a model of human hearing that recreates the same audio signals listeners experience in real-life situations.

In the next section, I provide a brief description of the high and low level APIs which developers can use to control these effects from within their application. Then, in the following sections, I discuss two of the steps involved in modeling human hearing: first, creating stereo signals containing directional cues from a mono input and second, the delivery of the left and right signals to the left and right ears, respectively, without the need for headphones. Within each section, I also provide many links to related information elsewhere on the Web.

High and Low level APIs

Both the high and low level APIs in the Apple SoundSprocket allow developers to transparently control hardware or software filter implementations and have been designed to provide a lot of flexibility in how 3D audio effects are implemented in a game or multimedia title. The low level API provides access to all of the 3D filtering parameters in their most efficient form, allowing developers to send messages (via the Sound Manager) to directly control all of the 3D audio parameters. The coordinates of sound sources in these messages are given in polar coordinates (a form which most developers will find cumbersome).

Sometimes this low level interface is inconvenient. When this happens, a developer may instead use the high level API which consists of a collection of routines that produce the lower level messages described above. Here, coordinates are specified in the more natural Cartesian space with position and orientation vectors, transformation matrices or QuickDraw 3D camera positions. In addition, although velocities may still be specified explicitly, the high level API allows for their automatic computation based on an object's changing position.

For a more detailed description of both of these APIs, along with everything else you need to start using them, download the Apple Game Sprocket SDK from Apple's Web site.

Creating Directional Cues

Before 3D audio can be generated by a computer, the relationship of elements within the simulated environment must be known. In a virtual audio environment, these elements include the listener and a number of sources which collectively produce the sounds heard by the listener. Each of these elements has a position and orientation in space, along with a velocity through space. Sound sources are not limited to points in space, but rather may be objects which emit sound uniformly over a larger area (for example, a fountain which makes noise over its entire surface). A source may also emit sound more loudly in one direction than in another.

Now, given this description of a virtual audio environment, the creation of directional cues requires the use of several digital filters which are contained within the Apple SoundSprocket extension. These filters are needed to transform a mono input signal into stereo output signals intended for a listener's right and left ears (see Sound Delivery below for a description of how these signals are actually delivered to the ears). The goal at this stage is to model the directional effect that a listener's head, torso and ears have on incoming sounds in the real world. In addition, several environmental effects must also be considered since they effect our spatial auditory perception and are thus essential to an accurate model. These effects include reverberant surfaces, dampening due to the medium through which sound travels (air, water, etc.) and Doppler shifts caused by the motion of elements within the environment.

Sound Delivery

Audio can be presented to the listener via stereo speakers or headphones. In this section, I provide a brief discussion of the relative strengths and weaknesses of these two solutions. (Note, mono configurations are not recommended because a single speaker can not provide accurate spatial cues.)

When stereo speakers are used, they must be placed symmetrically on either side of the computer monitor and an additional filter within the Apple SoundSprocket extension must be used. This filter ensures that the sound produced by the left speaker is heard predominantly by the left ear and that the sound produced by the right speaker is heard predominantly by the right ear. This filter is necessary because the binaural effects encoded in the stereo signal (described above) are destroyed by crosstalk (sound from the left speaker heard by the right ear and vice versa).

The primary drawback to the use of stereo speakers is that crosstalk can be eliminated from only one location within the listening environment. As a result, speaker use is restricted to a single listener -- usually with the added requirement that this lone listener keep their head centered between the two speakers at all times. Although a good crosstalk filter can increase the size of this "sweet spot" a little (and maybe even move it around to room), these basic restrictions will remain. If you would like a more detailed description of the elimination of crosstalk (and you don't mind technical papers which rely heavy on the use of mathematical formulas) please download and read my paper entitled Crosstalk Cancellation Theory.

Headphone use presents a very different set of engineering challenges. On the plus side, since headphones are placed right next to a listener's ears, there is no crosstalk (or need for the crosstalk cancellation filter described above) and thus the restrictions associated with speakers don't apply. On the minus side, however, because a listener wearing headphones effectively moves the entire virtual environment with them when they turn their head, the illusion of stability in the virtual world is lost. This can be fixed either by requiring the listener to keep their head still (admittedly, not a very practical solution) or by having them wear a device which tracks the movement of their head in the real world and updates their position in the virtual world accordingly. It should be noted that head-tracking devices have their own drawbacks -- primarily due to the inherent lag between head movements and their corresponding updates. Although this lag will decrease as the technology improves, the current lag is unfortunately large enough to cause a mild form of "sea sickness" in some listeners.



Back to Cover Letter
Copyright © 1996-98 by Tim Nufire, tnufire@ibink.com. All Rights Reserved.
Last updated: 8/98