A new 'Spatial Audio' for FaceTime Conferencing Patent has revealed its use of a special Audio Spatial Controller
On Tuesday, Patently Apple posted a report titled "Apple has Won a Patent for FaceTime with Spatial Audio that includes a possible new Dimension." Our report included the video presented below with Craig Federighi introducing Spatial Audio for FaceTime conferences.
Like with all major Apple projects like Spatial Audio for FaceTime conferences, Apple's engineers file several patents covering various aspects of their invention. Yesterday, the US Patent & Trademark Office published a another patent application from Apple that relates to an spatial audio controller that controls how audio is spatialized in communication sessions, such as FaceTime conferences.
In Apple's patent background they not that while a local device is engaged in a communication session, such as a video conference call, with several remote devices, the local device may receive video data and audio data of the session from each remote device. The local device may use the video data to display a dynamic video representation of each remote participant, and may use the audio data to create spatialized audio renderings of each remote participant. To do this, the local device may perform spatial rendering operations upon each remote device's audio data, such that the local user perceives each remote participant from a different location.
Performing these video processing and audio processing operations, however, takes a heavy processing toll on electronics (e.g., a central processing unit (CPU)) of the local device. Therefore, there is a need for a spatial audio controller that creates and manages audio spatial renderings during a communication session with remote devices, with consideration of both complexity and the video presentation, while preserving audio quality.
To overcome these deficiencies, Apple's patent application describes a local device with a spatial audio controller for performing audio signal processing operations to efficiently and effectively spatial render input audio streams from one or more remote devices during a communication session.
One aspect of the invention is a method performed by an electronic device (e.g., a local device) that is communicatively coupled with one or more remote devices and are engaged in a communication session. While engaged in the session, the local device receives, from each remote device, an input audio stream, and receives a set of communication session parameters for each remote device.
For example, the parameters may include one or more voice activity detection (VAD) parameters that is based on a VAD signal received from each remote device that indicates at least one of voice activity and voice strength of a remote participant of a respective remote device.
In addition, when the devices are engaged in a video communication session (e.g., video conference call / FaceTime conference call), in which input video streams are received, and visual representations (or tiles) of the video streams are disposed in a graphical user interface (GUI) (window) on a display screen of the local device, the session parameters may indicate how the visual representations are arranged within the GUI (e.g., whether in a larger per-user tile canvas region or a smaller per-user tile roster region), the size of the visual representations, etc.
The local device determines, for each input audio stream, whether the input audio stream is to be 1) rendered individually with respect to the other received input audio streams or 2) rendered within a mix of input audio streams with one or more other input audio streams based on the set of communication session parameters.
For example, an input audio stream may be individually rendered when at least one of: a VAD parameter, such as voice activity is above a voice activity threshold (e.g., indicating that the remote participant is actively talking), the visual representation associated with the input audio stream is contained within a prominent area of the GUI (e.g., in the canvas region of the GUI), and/or a size of the visual representation (e.g., the size of the representation used to show the video of a remote participant) is above a threshold size. In this way, for each input audio stream that is determined to be rendered individually, the local device spatial renders the input audio stream as an individual virtual sound source that contains only the input audio stream, and for input audio streams that are determined to be rendered as the mix of input audio streams, spatial renders the mix of input audio streams as a single virtual sound source that contains the mix of input audio streams. Therefore, by spatial rendering some input audio streams individually which may be of more importance to the local participant, while spatial rendering a mix of other input audio streams that may not be as critical, the local device may reduce the amount of computational processing that is required to spatial render all of the streams of the communication session.
According to another aspect of the disclosure, a method performed by the local device determines a panning range of several speakers based on an aspect ratio of the GUI (e.g., window displayed on the GUI) of the communication session that is displayed on the display screen. The local device receives an input audio stream and an input video stream, and displays a visual representation of the input video stream within a GUI of the video communication session that is displayed on a display screen (e.g., which may be integrated within the local device, and on which may be a window containing the communication session). The local device determines an aspect ratio of the GUI, and determines an azimuth panning range that is at least a portion of a total azimuth panning range of several speakers and an elevation panning range that is at least a portion of a total elevation panning range of the speakers based on the aspect ratio. The local device spatial renders the input audio stream to output a virtual sound source within the azimuth and elevation panning ranges.
Apple's patent FIG. 1 below shows a system that includes one or more remote devices that are engaged in a communication session with a local device that includes a spatial audio controller for spatial rendering audio from the one or more remote devices according to one aspect; FIG. 2 shows a block diagram of the local device 2 that spatial renders input audio streams from remote devices with which the local device is engaged in a communication session to output virtual sound sources according to one aspect.
Apple's patent FIG. 3 above illustrates an example graphical user interface (GUI) of a communication session that is being displayed by the local device and an arrangement of virtual sound locations of spatial audio being output by the local device during the communication session.
Apple's patent FIG. 4 below illustrates a block diagram of the local device that includes the audio spatial controller that performs spatial rendering operations during a communication session.
(Click on the patent figure below to Enlarge)
For more details, review Apple's patent application number US 20220394407 A1.
Apple Inventors
Christopher Garrido: Senior+ Manager, Real-Time Media Conferencing (RTC)
Karthick Santhanam: Software Engineering Manager (Group FaceTime+)
Sean Ramprashad: Engineering Manager: CoreAudio Algorithms-Communications
Marty Johnson: Distinguished Engineer: Audio Technology Developments
Daniel Boothe: Audio and Acoustic Technology Incubation
Konstantyn Komarov: Core Audio Software Engineer
Patrick Miauton: Real-Time Media Software Engineer
Austin Shyu: Software Engineer
Jae Woo Chang: Designer
Callaway; Peter D.: No LinkedIn profile was found
Comments