# Spatial Audio RFC (draft)
*This document describes an open metadata scheme by which MP4 multimedia containers may accommodate spatial and head-locked stereo audio. Comments are welcome on the [spatial-media-discuss](https://groups.google.com/forum/#!forum/spatial-media-discuss) mailing list or by [filing an issue](https://github.com/google/spatial-media/issues) on GitHub.*

------------------------------------------------------

## Metadata Format

### MP4
Spatial audio metadata is stored in a new box, `SA3D`, defined in this RFC.

#### Spatial Audio Box (SA3D)
##### Definition
Box Type: `SA3D`  
Container: Sound Sample Description box (e.g., `mp4a`, `lpcm`, `sowt`, etc.)  
Mandatory: No  
Quantity: Zero or one  

When present, provides additional information about the spatial audio content contained in this audio track.

##### Syntax
```
aligned(8) class SpatialAudioBox extends Box(‘SA3D’) {
    unsigned int(8)  version;
    unsigned int(8)  ambisonic_type;
    unsigned int(32) ambisonic_order;
    unsigned int(8)  ambisonic_channel_ordering;
    unsigned int(8)  ambisonic_normalization;
    unsigned int(32) num_channels;
    for (i = 0; i < num_channels; i++) {
        unsigned int(32) channel_map;
    }
}
```

##### Semantics
- `version` is an 8-bit unsigned integer that specifies the version of this box. Must be set to `0`.

- `head_locked_stereo` is a 1-bit flag used to indicate that the stored audio track contains head-locked stereo audio in addition to ambisonics audio. The flag should be set if the track contains head-locked stereo and unset otherwise.

- `ambisonic_type` is a 7-bit unsigned integer that specifies the type of ambisonic audio represented; the following values are defined:

| `ambisonic_type` | Ambisonic Type Description |
|:-----------------|:---------------------------|
|   `0`   | **Periphonic**: Indicates that the audio stored is a periphonic ambisonic sound field (i.e., full 3D). |

- `ambisonic_order` is a 32-bit unsigned integer that specifies the order of the ambisonic sound field. If the `ambisonic_type` is `0` (*periphonic*), this is a non-negative integer representing the periphonic ambisonic order; in this case, it should take a value of `sqrt(n) - 1`, where `n` is the number of channels in the represented ambisonic audio data. For example, a *periphonic* ambisonic sound field with `ambisonic_order = 1` requires `(ambisonic_order + 1)^2 = 4` ambisonic components.

- `ambisonic_channel_ordering` is an 8-bit integer specifying the channel ordering (i.e., spherical harmonics component ordering) used in the represented ambisonic audio data; the following values are defined:

| `ambisonic_channel_ordering` | Channel Ordering Description |
|:-----------------------------|:-----------------------------|
|   `0`   | **ACN**: The channel ordering used is the *Ambisonic Channel Number* (ACN) system. In this, given a spherical harmonic of degree `l` and order `m`, the corresponding ordering index `n` is given by `n = l * (l + 1) + m`. |

- `ambisonic_normalization` is an 8-bit unsigned integer specifying the normalization (i.e., spherical harmonics normalization) used in the represented ambisonic audio data; the following values are defined:

| `ambisonic_normalization` | Normalization Description |
|:--------------------------|:--------------------------|
|   `0`   | **SN3D**: The normalization used is *Schmidt semi-normalization* (SN3D). In this, the spherical harmonic of degree `l` and order `m` is normalized according to `sqrt((2 - δ(m)) * ((l - m)! / (l + m)!))`, where `δ(m)` is the *Kronecker delta* function, such that `δ(0) = 1` and `δ(m) = 0` otherwise. |

- `num_channels` is a 32-bit unsigned integer specifying the number of audio channels contained in the given audio track.

- `channel_map` is a sequence of 32-bit unsigned integers that maps audio channels in a given audio track to ambisonic components, given the defined `ambisonic_channel_ordering`. The sequence of `channel_map` values should match the channel sequence within the given audio track.

  For the example case of `ambisonic_type = 0` (Periphonic), consider a 4-channel audio track containing ambisonic components *W*, *X*, *Y*, *Z* at channel indexes `0`, `1`, `2`, `3`, respectively. For `ambisonic_channel_ordering = 0` (ACN), the ordering of components should be *W*, *Y*, *Z*, *X*, so the `channel_map` sequence should be `0`, `2`, `3`, `1`.

  As a simpler example, for a 4-channel audio track containing ambisonic components *W*, *Y*, *Z*, *X* at channel indexes `0`, `1`, `2`, `3`, respectively, the `channel_map` sequence should be specified as `0`, `1`, `2`, `3` when `ambisonic_channel_ordering = 0` (ACN).

  For the example case of `ambisonic_type = 0` (Periphonic) with `head_locked_stereo = 1`, the stored audio will consist of `4` ambisonic components *W*, *Y*, *Z*, *X* in addition to head-locked stereo components *L* and *R*. In this case, the SA3D atom will define `num_channels = 6` and a `channel_map` specified as `0`, `1`, `2`, `3`, `4`, `5` indicating that the channels are laid out in the file as *W*, *Y*, *Z*, *X*, *L*, *R*. This representation extends to different layouts of ambisonics and head-locked stereo components. For example, a channel layout of `4`, `5`, `0`, `1`, `2`, `3`  indicates that the layout of the stored audio is *L*, *R*, *W*, *Y*, *Z*, *X*.

##### Example

Here is an example MP4 box hierarchy for a file containing the `SA3D` box:

- moov
  - trak
    - mdia
      - minf
        - stbl
          - stsd
            - mp4a
              - esds
              - SA3D

where the `SA3D` box has the following data:

| Field Name | Value |
|:-----------|:-----|
| `version` | `0` |
| `ambisonic_type` | `0` |
| `ambisonic_order` | `1` |
| `ambisonic_channel_ordering` | `0` |
| `ambisonic_normalization` | `0` |
| `num_channels` | `4` |
| `channel_map` | `0` |
| `channel_map` | `2` |
| `channel_map` | `3` |
| `channel_map` | `1` |

------------------------------------------------------

## Appendix 1 - Ambisonics
The traditional notion of ambisonics is used, where the sound field is represented by spherical harmonics coefficients using the *associated Legendre polynomials* (without *Condon-Shortley phase*) as the basis functions. Thus, the spherical harmonic of degree `l` and order `m` at elevation `E` and azimuth `A` is given by:

    N(l, abs(m)) * P(l, abs(m), sin(E)) * T(m, A)

where:
- `N(l, m)` is the spherical harmonics normalization function used.
- `P(l, m, x)` is the (unnormalized) *associated Legendre polynomial*, without *Condon-Shortley phase*, of degree `l` and order `m` evaluated at `x`.
- `T(m, x)` is `sin(-m * x)` for `m < 0` and `cos(m * x)` otherwise.

### Conventions
#### Azimuth
- `A = 0`: The source is in front of the listener.
- `A` in `(0, pi/2)`: The source is in the forward-left quadrant.
- `A` in `(pi/2, pi)`: The source is in the back-left quadrant.
- `A` in `(-pi/2, 0)`: The source is in the forward-right quadrant.
- `A` in `(-pi, -pi/2)`: The source is in the back-right quadrant.

#### Elevation
- `E = 0`: The source is in the horizontal plane.
- `E` in `(0, pi/2]`: The source is above the listener.
- `E` in `[-pi/2, 0)`: The source is below the listener.