Evolutions of Spatial Audio
Odie DeSmith, MASARY Technical Manager
Since the earliest days of MASARY, when we composed for the support beams of the Green Monster at Fenway park, sound has been a fundamental part of our studios’ ethos. In the landscape of immersive art experiences, the visual components are often foregrounded over the auditory experience. We’ve all seen these spaces with 4 walls of projection showcasing all the latest touring exhibits; Immersive Van Gogh, immersive ice cream, immersive etc. But what can’t be captured in photographs and instagram posts are the major advancements happening in spatial sound. We live in a world where anyone with a set of $100 ear buds can stream full atmos mixes directly from their smartphone. What was once reserved for iMax theatres and academic research facilities is now more accessible than ever.
Before we discuss the technologies at the bleeding edge of this recent spatial audio awakening, let’s take a brief look at the history of multichannel audio and some of the pioneers that paved the way for it.
Early Sound recording
Audio began in monophony. The earliest sound recording technologies emerged in the middle of the 19th century with Édouard-Léon Scott de Martinville’s Phonautograph. This device translated audio signals as a line traced on paper but did not playback sound. 20 years later, Thomas Edison cracked the code with the invention of the Phonograph. While this primitive recording was the result of a purely (mechanical) acoustic process, the advent of electricity quickly led to a series of subsequent inventions, such as electric loudspeakers, microphones, signal amplifiers, and electromagnetic recorders.
Property of the Library of Congress
And then there were two…
In 1876 Alexander Graham Bell’s moving iron loudspeaker was included in his first telephone patent. The French inventor and engineer Clément Ader improved upon this technology by producing the first telephone capable of transmitting separate audio signals in each ear (known later as the Théâtrophone). His invention was showcased at the Paris Electrical Exhibition where it was used to listen to a live stereophonic broadcast of an opera performance happening at the Palais Garnier. Audiences were shocked by what seemed at the time to be an auditory illusion. By the end of the 19th century, this two channel telephone process had been commercialized across Europe as a novelty and party trick of sorts. Stereo, however, was quickly thereafter adopted for the masses. Decades before they were cutting records with Pink Floyd, British recording conglomerate EMI invented and patented the first modern stereophonic recording and playback technologies in the 1930s.
Image from the Les Maitre de L'Affiches series by Jules Chéret
Fantasound
Around the same time, these recent advents in stereo recording methods had piqued the interest of none other than Walt Disney. Disney bet big on immersive audio and began the development of their own proprietary audio technologies to debut alongside the film Fantasia. They called this audio system Fantasound. Disney’s work quickly led to the discovery of many innovative multichannel recording techniques and advancements in both speaker placement and fidelity. The later evolutions of Fantasound set-ups even included a revolutionary form of volume automation known as “Togad”, one of the earliest examples of spatial sound automation at the playback stage. Varying electronic tones were recorded onto separate channels of tape to modify the amplitude of corresponding speaker outputs. This allowed for spatial effects that gave audiences a sense of motion within the theatre.
"We know...that music emerging from one speaker behind the screen sounds thin, tinkly and strainy. We wanted to reproduce such beautiful masterpieces...so that audiences would feel as though they were standing at the podium with Stokowski," -Walt Disney
While Fantasound was a groundbreaking sonic experience that put a new lens on how recorded audio could be recreated spatially, it was reliant on custom hardware and advanced playback systems. Those limitations greatly stunted its feasibility in the long term. The sheer amount of equipment necessary to deploy Disney’s sonic technologies proved to be their downfall. At the time, there was no real sense of regularity across theatre’s technical systems. Implementing Disney’s pioneering audio technology would often require keeping theatres dark for several days if not weeks. Especially during war-time economic conditions. This undertaking just wasn’t financially viable for most theatres. Although the distribution rights to Fantasia were handed off in 1941 to RKO Radio Pictures, who soon-after converted the picture to a mono soundtrack, an Academy Honorary Award was presented the following year to Disney for their "outstanding contribution to the advancement of the use of sound in motion pictures through the production of Fantasia". This technology was well ahead of its time and proved that multichannel audio had a place in moving pictures, a sentiment that we now take for granted.
Although it took decades to get past these hurdles, cinema eventually evolved a concept of standardization and adaptability of content, which for the first time made multichannel audio practical for theatrical release.
Disney Fantasound System
Object Based Mixing
Most modern spatial sound approaches rely on what’s known as object-based mixing. Object based audio differs from earlier channel-based approaches in that the end mix is agnostic of speaker setup. Rather than mixing sounds to specified channels in the studio, audio is distributed independently and mixed only during reproduction. Placement/panning information is rendered as data alongside the audio object. This location data is read by a render engine at the playback stage to determine how audio is reproduced. This approach began gaining popularity with the release of Dolby atmos in 2012. Dolby atmos utilizes a combination of traditional channel based playback known as beds (7.1.2) alongside audio objects. The latest renditions of atmos support up to 128 independent channels and can render to as many as 64 speakers. At MASARY object based audio is essential for retaining flexibility across environments. We often don’t have total control of the spaces where we show work and need our compositions to be adaptable to different dimensions and speaker layouts. Knowing that our authored spatial automation will translate across playback conditions in the field is everything when you might only have 1-2 days on site to deploy an installation. Just like the days of Disney and their deployment of Fantasound, standardization in the spaces we occupy is few and far between. Having a tool that allows us to adapt and shift the spatial audio attributes of both fixed and interactive media gives us the confidence to deploy our artworks without the need for hyper-regulated physical and technical conditions.
Head Tracking
Many modern headphones have head-tracking capabilities that can be used to translate audio objects across a stereo field based on a lister’s movement. This can give the illusion of moving through an environment and hearing fluctuating amplitudes. While this isn’t in my experience as effective as multichannel speaker setups, there’s no doubt that this technology is rapidly improving. Advancements in VR/AR experiences have furthered the need for spatial sound that is localized to a listener. Apple’s airpods feature this capability and utilizes their motion-tracking hardware in conjunction with dolby atmos as a part of their spatial audio program. Many of these formats support 3 degrees of freedom meaning that sounds will vary based on the left, right, up, and down movement along with the tilt of the user's head. Some go a step further incorporating 6 degrees of freedom where a user's movement in space is also tracked. These technologies are a huge step for the accessibility of immersive audio. By the end of 2024 Apple sold more than 75 million sets of airpods. Not everyone can afford the equipment necessary for reproducing atmos mixes with speakers, and this method provides an alternative that allows users to dip their toes into the rapidly expanding world of spatial sound. Furthermore, the broadening of audiences for immersive audio is already beginning to have an effect on the availability of recordings beyond traditional stereo stereo formats. Almost all new releases on Apple music feature spatial audio mixes, so not only is this technology expanding audiences but also the way music is recorded and mixed in the first place.
Wave Field Synthesis
All of the aforementioned sound formats rely on what’s known as a phantom sound source--essentially a perceived sonic location between two or more speakers. The time difference it takes for one source to arrive at each ear, relative to its amplitude, gives listener’s a fairly accurate sense of sonic positioning. While this method for sound reproduction is effective for many applications, it relies on a listener’s location remaining relatively consistent and therefore has its limitations for creating truly life-like sonic environments.
However, around the turn of the millennium research began popping up around a different method for spatial sound reproduction known as wave field synthesis (WFS). These systems use a massive array of loudspeakers to create what are known as artificial wavefronts. These wavefronts can be thought of as sound sources that seem to emit from in front of the speakers themselves, allowing for the sonic origin to remain consistent regardless of a listener's position. As exciting as that idea is, it comes with some massive technical hurdles. Wave field synthesis only works when deployed with an enormous sound system, we’re talking hundreds to thousands of specialized speakers. The Technische Universität in Berlin has 2700 of them in one space. It is also traditionally dependent on a highly controlled acoustic environment. As a result, these types of systems exist almost exclusively within the context of academic research.
EMPAC Wave Field Synthesis Array at RPI
But there is light on the horizon when it comes to wave field synthesis being utilized within modern creative applications. The Berlin-Based company Holoplot created arguably the first practical breakthrough in WFS with the introduction of their X1 sound modules. These modules are deployed similarly to a traditional concert line array but utilize electronic beam steering to deliver sound across both horizontal and vertical axis. These are the speakers responsible for the audio deployment at the Sphere in Las Vegas and represent a true turning point in how WFS setups can be deployed and scaled. Although the technology Holoplot introduced is still far outside the budgets for most venues, it’s the first real inkling of these WFS principles moving towards a practical commercial application.
At MASARY, we are excited by these developments and hope to encounter a future where these tools are scalable, deployable, and realistic within the context of immersive public art experiences. Over the past 4 or 5 years we have been at the forefront of deploying spatial audio into performance and installation projects. Our touring work Sound Sculpture operates in 4.1, Phase Garden is a 12.2 circular installation, and our ongoing performance work Ritual / System is a multi-channel AV work that has been adapted to various venues and settings. Most recently our work with the German-based company Media Apes has led us to working with their Astro Spatial Audio engine. This dedicated rendering CPU is an amazing tool that incorporates the principles of WFS but allows a scalability and deployment flexibility which is crucial within our artworks. When speaking with Nûjîn Kartal, a close friend and fellow audio fanatic working with their company, he had this to share:
“WFS is based on the principle of recreating the first wavefront of a sound source. That makes the sound so realistic. Other technologies like VBAP, DBAP, Ambisonics and so on are great as well and we use these tools, but when it comes to smooth sound movement and a well rounded sound field, that’s not centered on a sweet spot, then WFS is still the way to go…
Definitely these technologies (at the Sphere) are exciting. In my point of view these superlatives are exciting for me and other tech and immersive enthusiasts. On the other hand I’ve heard a lot of not so exciting parts about the sphere. It is a very complex system and I think for a venue like that holoplot’s technology with their sound beaming was a good choice to go with, but this is not a technology for the immersive sound I am looking for. I want to have a well rounded sound field. With sound beaming you have better control in complex acoustical situations, because you do not push energy in the room, where it’s not needed. Domes generally have poor acoustics, but are really nice for visuals.”
I think it was great to show off what's possible but overall I think personally it’s not the best idea to put something of that size into the desert. Another part is to make it accessible. Thus me and my team are working continuously on creating immersive rooms that can be built everywhere with a focus on sustainability regarding the built elements and technologies as far as possible. I think that there are more ways to build quality experiences and have sustainability involved as well”
UI of spatial sound software from Media Apes
Conclusion
But why does all of this tech jargon, and history matter? Regardless of the countless innovations since the mid 19th century, audio is still primarily consumed within a stereo L/R format. Well for us at MASARY we believe not in just creating artworks that can be experienced within standardized formats, but rather strive to create uncanny worlds within our artworks.
Audio plays a core role in how we process and understand the world. Whether it’s a tiger in the bushes or the low rumble of an earthquake, our ability to localize and recognize sound is an essential survival mechanism. But beyond its core evolutionary role, it is also a huge part of how we process the world emotionally. Sound can make us dance, dream, love, and think in profound ways that go far beyond a standard utility. We all have a favorite song. We know how it feels to hear the voice of a missing loved one. As visual art decorates space, music decorates time in a way that stimulates a primal sort of pattern recognition that feels intrinsic to the human experience.
Phase Garden Light and Sound Installation - MASARY Studios 2022. (12.2 Spatial Audio via Astro Spatial Sound engine) Photo by Aram Boghosian - Presented by Friends of Mount Auburn
As public art practitioners we think that the ability to manipulate and alter space can inspire, bring communities together, and evoke a sense of wonder in a way that isn’t possible without the use of these modern technologies. In that way, to us, audio is not just a complimentary component, a score in a film or a video game, but rather a fundamental piece of the artwork. A piece of how we can reimagine and dream spaces to be. As much as we prioritize and are often more attuned to technical progression in digital visual domains (CGI, VR, etc), we hope that audio can be just as inspiring and profound—just like those at the Paris Electrical Exhibition experiencing the first live stereo playback. Or maybe a child seeing fantasia in theaters and being both as enamored with the auditory spectacle as they were with the innovative animation techniques. The thrill of seeing a concert at the Sphere in Las Vegas, feeling sounds swarm around you like a school of fish or a whole immersive sonic landscape in your earbuds on the train. There are so many exciting advancements on the horizon of spatial sound. We hear the world in 3D, each sound source whether it is a bird or car having its own spatial origin. We are just at the beginning of what it means to represent that reality through technology and with it comes a toolkit that can expand and inspire creators for decades to come.