Spaces:
Configuration error
Configuration error
# Spatial Audio RFC (draft) | |
*This document describes an open metadata scheme by which MP4 multimedia containers may accommodate spatial and head-locked stereo audio. Comments are welcome on the [spatial-media-discuss](https://groups.google.com/forum/#!forum/spatial-media-discuss) mailing list or by [filing an issue](https://github.com/google/spatial-media/issues) on GitHub.* | |
------------------------------------------------------ | |
## Metadata Format | |
### MP4 | |
Spatial audio metadata is stored in a new box, `SA3D`, defined in this RFC. | |
#### Spatial Audio Box (SA3D) | |
##### Definition | |
Box Type: `SA3D` | |
Container: Sound Sample Description box (e.g., `mp4a`, `lpcm`, `sowt`, etc.) | |
Mandatory: No | |
Quantity: Zero or one | |
When present, provides additional information about the spatial audio content contained in this audio track. | |
##### Syntax | |
``` | |
aligned(8) class SpatialAudioBox extends Box(‘SA3D’) { | |
unsigned int(8) version; | |
unsigned int(8) ambisonic_type; | |
unsigned int(32) ambisonic_order; | |
unsigned int(8) ambisonic_channel_ordering; | |
unsigned int(8) ambisonic_normalization; | |
unsigned int(32) num_channels; | |
for (i = 0; i < num_channels; i++) { | |
unsigned int(32) channel_map; | |
} | |
} | |
``` | |
##### Semantics | |
- `version` is an 8-bit unsigned integer that specifies the version of this box. Must be set to `0`. | |
- `head_locked_stereo` is a 1-bit flag used to indicate that the stored audio track contains head-locked stereo audio in addition to ambisonics audio. The flag should be set if the track contains head-locked stereo and unset otherwise. | |
- `ambisonic_type` is a 7-bit unsigned integer that specifies the type of ambisonic audio represented; the following values are defined: | |
| `ambisonic_type` | Ambisonic Type Description | | |
|:-----------------|:---------------------------| | |
| `0` | **Periphonic**: Indicates that the audio stored is a periphonic ambisonic sound field (i.e., full 3D). | | |
- `ambisonic_order` is a 32-bit unsigned integer that specifies the order of the ambisonic sound field. If the `ambisonic_type` is `0` (*periphonic*), this is a non-negative integer representing the periphonic ambisonic order; in this case, it should take a value of `sqrt(n) - 1`, where `n` is the number of channels in the represented ambisonic audio data. For example, a *periphonic* ambisonic sound field with `ambisonic_order = 1` requires `(ambisonic_order + 1)^2 = 4` ambisonic components. | |
- `ambisonic_channel_ordering` is an 8-bit integer specifying the channel ordering (i.e., spherical harmonics component ordering) used in the represented ambisonic audio data; the following values are defined: | |
| `ambisonic_channel_ordering` | Channel Ordering Description | | |
|:-----------------------------|:-----------------------------| | |
| `0` | **ACN**: The channel ordering used is the *Ambisonic Channel Number* (ACN) system. In this, given a spherical harmonic of degree `l` and order `m`, the corresponding ordering index `n` is given by `n = l * (l + 1) + m`. | | |
- `ambisonic_normalization` is an 8-bit unsigned integer specifying the normalization (i.e., spherical harmonics normalization) used in the represented ambisonic audio data; the following values are defined: | |
| `ambisonic_normalization` | Normalization Description | | |
|:--------------------------|:--------------------------| | |
| `0` | **SN3D**: The normalization used is *Schmidt semi-normalization* (SN3D). In this, the spherical harmonic of degree `l` and order `m` is normalized according to `sqrt((2 - δ(m)) * ((l - m)! / (l + m)!))`, where `δ(m)` is the *Kronecker delta* function, such that `δ(0) = 1` and `δ(m) = 0` otherwise. | | |
- `num_channels` is a 32-bit unsigned integer specifying the number of audio channels contained in the given audio track. | |
- `channel_map` is a sequence of 32-bit unsigned integers that maps audio channels in a given audio track to ambisonic components, given the defined `ambisonic_channel_ordering`. The sequence of `channel_map` values should match the channel sequence within the given audio track. | |
For the example case of `ambisonic_type = 0` (Periphonic), consider a 4-channel audio track containing ambisonic components *W*, *X*, *Y*, *Z* at channel indexes `0`, `1`, `2`, `3`, respectively. For `ambisonic_channel_ordering = 0` (ACN), the ordering of components should be *W*, *Y*, *Z*, *X*, so the `channel_map` sequence should be `0`, `2`, `3`, `1`. | |
As a simpler example, for a 4-channel audio track containing ambisonic components *W*, *Y*, *Z*, *X* at channel indexes `0`, `1`, `2`, `3`, respectively, the `channel_map` sequence should be specified as `0`, `1`, `2`, `3` when `ambisonic_channel_ordering = 0` (ACN). | |
For the example case of `ambisonic_type = 0` (Periphonic) with `head_locked_stereo = 1`, the stored audio will consist of `4` ambisonic components *W*, *Y*, *Z*, *X* in addition to head-locked stereo components *L* and *R*. In this case, the SA3D atom will define `num_channels = 6` and a `channel_map` specified as `0`, `1`, `2`, `3`, `4`, `5` indicating that the channels are laid out in the file as *W*, *Y*, *Z*, *X*, *L*, *R*. This representation extends to different layouts of ambisonics and head-locked stereo components. For example, a channel layout of `4`, `5`, `0`, `1`, `2`, `3` indicates that the layout of the stored audio is *L*, *R*, *W*, *Y*, *Z*, *X*. | |
##### Example | |
Here is an example MP4 box hierarchy for a file containing the `SA3D` box: | |
- moov | |
- trak | |
- mdia | |
- minf | |
- stbl | |
- stsd | |
- mp4a | |
- esds | |
- SA3D | |
where the `SA3D` box has the following data: | |
| Field Name | Value | | |
|:-----------|:-----| | |
| `version` | `0` | | |
| `ambisonic_type` | `0` | | |
| `ambisonic_order` | `1` | | |
| `ambisonic_channel_ordering` | `0` | | |
| `ambisonic_normalization` | `0` | | |
| `num_channels` | `4` | | |
| `channel_map` | `0` | | |
| `channel_map` | `2` | | |
| `channel_map` | `3` | | |
| `channel_map` | `1` | | |
------------------------------------------------------ | |
## Appendix 1 - Ambisonics | |
The traditional notion of ambisonics is used, where the sound field is represented by spherical harmonics coefficients using the *associated Legendre polynomials* (without *Condon-Shortley phase*) as the basis functions. Thus, the spherical harmonic of degree `l` and order `m` at elevation `E` and azimuth `A` is given by: | |
N(l, abs(m)) * P(l, abs(m), sin(E)) * T(m, A) | |
where: | |
- `N(l, m)` is the spherical harmonics normalization function used. | |
- `P(l, m, x)` is the (unnormalized) *associated Legendre polynomial*, without *Condon-Shortley phase*, of degree `l` and order `m` evaluated at `x`. | |
- `T(m, x)` is `sin(-m * x)` for `m < 0` and `cos(m * x)` otherwise. | |
### Conventions | |
#### Azimuth | |
- `A = 0`: The source is in front of the listener. | |
- `A` in `(0, pi/2)`: The source is in the forward-left quadrant. | |
- `A` in `(pi/2, pi)`: The source is in the back-left quadrant. | |
- `A` in `(-pi/2, 0)`: The source is in the forward-right quadrant. | |
- `A` in `(-pi, -pi/2)`: The source is in the back-right quadrant. | |
#### Elevation | |
- `E = 0`: The source is in the horizontal plane. | |
- `E` in `(0, pi/2]`: The source is above the listener. | |
- `E` in `[-pi/2, 0)`: The source is below the listener. | |