Spaces:

rohitmeshram976
/

vr-converter

Configuration error

App Files Files Community

vr-converter / spatial-media /docs /spatial-audio-rfc.md

rohitmeshram976

Update VR 180 Converter with HF_TOKEN and spatial-media

a2fcab8 6 days ago

preview code

raw

history blame contribute delete

7.24 kB

	# Spatial Audio RFC (draft)
	This document describes an open metadata scheme by which MP4 multimedia containers may accommodate spatial and head-locked stereo audio. Comments are welcome on the [spatial-media-discuss](https://groups.google.com/forum/#!forum/spatial-media-discuss) mailing list or by [filing an issue](https://github.com/google/spatial-media/issues) on GitHub.

	------------------------------------------------------

	## Metadata Format

	### MP4
	Spatial audio metadata is stored in a new box, `SA3D`, defined in this RFC.

	#### Spatial Audio Box (SA3D)
	##### Definition
	Box Type: `SA3D`
	Container: Sound Sample Description box (e.g., `mp4a`, `lpcm`, `sowt`, etc.)
	Mandatory: No
	Quantity: Zero or one

	When present, provides additional information about the spatial audio content contained in this audio track.

	##### Syntax
	```
	aligned(8) class SpatialAudioBox extends Box(‘SA3D’) {
	unsigned int(8) version;
	unsigned int(8) ambisonic_type;
	unsigned int(32) ambisonic_order;
	unsigned int(8) ambisonic_channel_ordering;
	unsigned int(8) ambisonic_normalization;
	unsigned int(32) num_channels;
	for (i = 0; i < num_channels; i++) {
	unsigned int(32) channel_map;
	}
	}
	```

	##### Semantics
	- `version` is an 8-bit unsigned integer that specifies the version of this box. Must be set to `0`.

	- `head_locked_stereo` is a 1-bit flag used to indicate that the stored audio track contains head-locked stereo audio in addition to ambisonics audio. The flag should be set if the track contains head-locked stereo and unset otherwise.

	- `ambisonic_type` is a 7-bit unsigned integer that specifies the type of ambisonic audio represented; the following values are defined:

	\| `ambisonic_type` \| Ambisonic Type Description \|
	\|:-----------------\|:---------------------------\|
	\| `0` \| Periphonic: Indicates that the audio stored is a periphonic ambisonic sound field (i.e., full 3D). \|

	- `ambisonic_order` is a 32-bit unsigned integer that specifies the order of the ambisonic sound field. If the `ambisonic_type` is `0` (periphonic), this is a non-negative integer representing the periphonic ambisonic order; in this case, it should take a value of `sqrt(n) - 1`, where `n` is the number of channels in the represented ambisonic audio data. For example, a periphonic ambisonic sound field with `ambisonic_order = 1` requires `(ambisonic_order + 1)^2 = 4` ambisonic components.

	- `ambisonic_channel_ordering` is an 8-bit integer specifying the channel ordering (i.e., spherical harmonics component ordering) used in the represented ambisonic audio data; the following values are defined:

	\| `ambisonic_channel_ordering` \| Channel Ordering Description \|
	\|:-----------------------------\|:-----------------------------\|
	\| `0` \| ACN: The channel ordering used is the Ambisonic Channel Number (ACN) system. In this, given a spherical harmonic of degree `l` and order `m`, the corresponding ordering index `n` is given by `n = l * (l + 1) + m`. \|

	- `ambisonic_normalization` is an 8-bit unsigned integer specifying the normalization (i.e., spherical harmonics normalization) used in the represented ambisonic audio data; the following values are defined:

	\| `ambisonic_normalization` \| Normalization Description \|
	\|:--------------------------\|:--------------------------\|
	\| `0` \| SN3D: The normalization used is Schmidt semi-normalization (SN3D). In this, the spherical harmonic of degree `l` and order `m` is normalized according to `sqrt((2 - δ(m)) * ((l - m)! / (l + m)!))`, where `δ(m)` is the Kronecker delta function, such that `δ(0) = 1` and `δ(m) = 0` otherwise. \|

	- `num_channels` is a 32-bit unsigned integer specifying the number of audio channels contained in the given audio track.

	- `channel_map` is a sequence of 32-bit unsigned integers that maps audio channels in a given audio track to ambisonic components, given the defined `ambisonic_channel_ordering`. The sequence of `channel_map` values should match the channel sequence within the given audio track.

	For the example case of `ambisonic_type = 0` (Periphonic), consider a 4-channel audio track containing ambisonic components W, X, Y, Z at channel indexes `0`, `1`, `2`, `3`, respectively. For `ambisonic_channel_ordering = 0` (ACN), the ordering of components should be W, Y, Z, X, so the `channel_map` sequence should be `0`, `2`, `3`, `1`.

	As a simpler example, for a 4-channel audio track containing ambisonic components W, Y, Z, X at channel indexes `0`, `1`, `2`, `3`, respectively, the `channel_map` sequence should be specified as `0`, `1`, `2`, `3` when `ambisonic_channel_ordering = 0` (ACN).

	For the example case of `ambisonic_type = 0` (Periphonic) with `head_locked_stereo = 1`, the stored audio will consist of `4` ambisonic components W, Y, Z, X in addition to head-locked stereo components L and R. In this case, the SA3D atom will define `num_channels = 6` and a `channel_map` specified as `0`, `1`, `2`, `3`, `4`, `5` indicating that the channels are laid out in the file as W, Y, Z, X, L, R. This representation extends to different layouts of ambisonics and head-locked stereo components. For example, a channel layout of `4`, `5`, `0`, `1`, `2`, `3` indicates that the layout of the stored audio is L, R, W, Y, Z, X.

	##### Example

	Here is an example MP4 box hierarchy for a file containing the `SA3D` box:

	- moov
	- trak
	- mdia
	- minf
	- stbl
	- stsd
	- mp4a
	- esds
	- SA3D

	where the `SA3D` box has the following data:

	\| Field Name \| Value \|
	\|:-----------\|:-----\|
	\| `version` \| `0` \|
	\| `ambisonic_type` \| `0` \|
	\| `ambisonic_order` \| `1` \|
	\| `ambisonic_channel_ordering` \| `0` \|
	\| `ambisonic_normalization` \| `0` \|
	\| `num_channels` \| `4` \|
	\| `channel_map` \| `0` \|
	\| `channel_map` \| `2` \|
	\| `channel_map` \| `3` \|
	\| `channel_map` \| `1` \|

	------------------------------------------------------

	## Appendix 1 - Ambisonics
	The traditional notion of ambisonics is used, where the sound field is represented by spherical harmonics coefficients using the associated Legendre polynomials (without Condon-Shortley phase) as the basis functions. Thus, the spherical harmonic of degree `l` and order `m` at elevation `E` and azimuth `A` is given by:

	N(l, abs(m)) * P(l, abs(m), sin(E)) * T(m, A)

	where:
	- `N(l, m)` is the spherical harmonics normalization function used.
	- `P(l, m, x)` is the (unnormalized) associated Legendre polynomial, without Condon-Shortley phase, of degree `l` and order `m` evaluated at `x`.
	- `T(m, x)` is `sin(-m * x)` for `m < 0` and `cos(m * x)` otherwise.

	### Conventions
	#### Azimuth
	- `A = 0`: The source is in front of the listener.
	- `A` in `(0, pi/2)`: The source is in the forward-left quadrant.
	- `A` in `(pi/2, pi)`: The source is in the back-left quadrant.
	- `A` in `(-pi/2, 0)`: The source is in the forward-right quadrant.
	- `A` in `(-pi, -pi/2)`: The source is in the back-right quadrant.

	#### Elevation
	- `E = 0`: The source is in the horizontal plane.
	- `E` in `(0, pi/2]`: The source is above the listener.
	- `E` in `[-pi/2, 0)`: The source is below the listener.