File size: 18,858 Bytes
a2fcab8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
# VR180 Video Format

# 1. Introduction

VR180 cameras are a new category of VR camera that use two wide angle cameras to
capture the world as you see it with point and shoot simplicity. This document
describes the video format output by these devices. The choice considers the
following aspects:

*   **FOV**: VR180 cameras capture sub-360 FOV rather than full 360. It is
    important to retain the original pixel density of the camera sensors in
    order to provide high pixel density for VR viewing.
*   **Projection**: Different versions of VR180 cameras may have different lens
    and different camera projections. As such the file format should be
    camera-independent.
*   **Motion**: The cameras can often be non-stationary due to unintentional
    shakes or intentional motion, for example, handheld capture of events or
    people. To avoid motion sickness, camera motion metadata should be saved for
    stabilized playback.
*   **Playback**: The file format should be friendly enough for local playback
    so that manufacturers can easily build their apps. Android and iOS should
    have an easy way to play the raw video.

VR180 videos contain two types of metadata to jointly define the projection from
video frames to their partial viewports within a spherical coordinate system.

1.  **A global static projection** that defines the mapping from the pixels to
    local spherical coordinate systems, typically to only a sub-180 FOV part.
    The [Spherical Metadata V2
    Spec](https://github.com/google/spatial-media/blob/master/docs/spherical-video-v2-rfc.md)
    is adopted here to encode this global metadata. (See details in [section
    2](#2-mesh-projection)).
2.  **A dynamic orientation stream** that defines the rotation between the local
    coordinate system of each frame and the world coordinate system. A new
    [Camera Motion Metadata
    track](https://developers.google.com/streetview/publish/camm-spec) is
    created for encoding such per-frame metadata. (See [section
    3](#3-camera-motion-metadata)).

# 2. Mesh Projection

The [Spherical Metadata V2
Spec](https://github.com/google/spatial-media/blob/master/docs/spherical-video-v2-rfc.md)
should be present in the file to define the static global projection of
individual frames to their local spherical coordinate system. Among the allowed
projection types by Spherical Metadata V2, the VR180 Video format requires a
mesh projection, which is most generic and works for fisheye projection.

|<img src="equirect.jpg" width="500"> | <img src="fisheye.jpg" width="900">
:-------------------------------------------------------: | :---------------------:
(a) 360 equirectangular                                   | (b) fisheye mesh projection

Figure 1. Example of video frame in typical equirectangular format and the mesh
format.

By using the mesh projection type, the cameras can save the raw pixels in
side-by-side or over-under format in the video, and let the projection meshes
define the back-projection from pixels to the 3D directions. This not only
preserves the pixel density of the camera sensors, but also saves production
cost and power consumption by shaving off expensive reprojection computation. To
render such videos, player clients simply need to draw the saved per-eye mesh
with their corresponding image as texture. To be specific in VR180:

*   Dual stereo mesh: video files contain two meshes, one mesh for each eye.
*   Fisheye projection: geometry-wise, the video frames are simple
    concatenations of left and right views with possible crop and rescale, but
    there is no other type of warping (e.g. de-fisheye).
*   Stereo mode: for better compatibility with video streaming services that are
    optimized for 16:9, landscape LEFT-RIGHT is preferred over portrait
    TOP-BOTTOM.

## Mesh Generation

Once the cameras are calibrated, the mesh vertices can be generated by
straightforward back-projection for a grid of coordinates that cover the valid
image portion (inside 180 image circle). Refer to the
[appendix](#appendix-mesh-generation-demo) for a complete Matlab demo code for
producing a full mesh for a fisheye camera. Below is the pseudo code for getting
a single mesh vertex.

```matlab
% Returns the mesh vertex for an image point image_x, image_y for an eye.
% (width, height) : the size of the image of an eye.
% (image_x, image_y): image coordinate where (0, 0) and (width, height) are top-left
%                     and bottom-right corner respectively.
% eye_camera : the calibrated camera for an eye (left or right)
function [x, y, z, u, v] = GetMeshVertex(width, height, image_x, image_y, eye_camera)
  % Unit ray direction corresponding the pixel
  [x, y, z] = PixelToRay(eye_camera, image_x, image_y);
  % Negate Y and Z IF the camera parameterization follows standard Computer Vision
  % convention where Y points down and Z points forward. This is to account the
  % difference with OpenGL coordinate system.
  [x, y, z] = [x, -y, -z];
  % Normalized OpenGL coordinate for the pixel, where the V coordinate needs to be flipped.
  u = image_x / width;
  v = (height - image_y) / height;
end
```

*   Although the video frame is a concatenation of left and right eye images
    (LEFT-RIGHT or OVER-UNDER), the mesh for each eye should be generated as if
    they are separate images.
*   A coarse mesh is preferred over a full-resolution mesh. Downsampled meshes
    work well as long as the resolution is reasonable, and they are more
    efficient for playback. A typical mesh resolution is a 40x40 grid.

# 3. Camera Motion Metadata

Camera rotations during video capture in a world coordinate system can be
embedded as video metadata. This metadata is particularly important for
hand-held VR video:

*   By using camera rotation metadata, the player can render the video frames at
    the exact orientation they were captured. The compensation of the camera
    rotation essentially keeps the distant background static. Our experiments
    have shown such stabilized viewing significantly reduces the motion sickness
    issue for VR.
*   It is important to have high quality rotation data (including correct
    gravity vector), otherwise the playback can cause motion sickness or be
    disorienting. This basically requires a well-calibrated IMU along with
    on-device sensor fusion.

|<img src="motion1.jpg" width="300"> | <img src="motion2.jpg" width="300"> | <img src="motion3.jpg" width="300">|
:-: | :--:| :--:

Figure 2. Three equirectangular stereo views generated according to their
rotations.

## Camera Motion Metadata Track

We have created a new [Camera Motion Metadata
Track](https://developers.google.com/streetview/publish/camm-spec) for storing
various kinds of camera motion metadata, including camera orientation, gyroscope
reading, accelerometer readings, etc. The custom metadata track can be
identified by the new Camera Motion Metadata (camm) Sample Entry box.

In the application of VR180 camera, each video contains such a metadata track to
store camera rotation data. Each data sample in the metadata track is
represented as bitstream in the following format

<table>
  <tr>
    <td> Fields:           </td>
    <td>Description: </td>
  </tr>
  <tr>
    <td> int32 reserved;     </td>
    <td>Should be 0.</td>
  </tr>
  <tr>
    <td> float32 angle_axis[3];</td>
    <td>Angle axis orientation in radians representing the rotation from camera coordinate system to world coordinate system.<br>
        Let M be the 3x3 rotation matrix corresponding to the angle axis vector. For any ray X in the local coordinate system, <br>
      the ray direction in the world coordinate is M * X.<br>
<br>
Such orientation information can be obtained by running 3DoF sensor fusion on the device. After integrating the IMU readings, <br>
      only the integrated global orientation needs to be recorded. <br>
          <br>
      Below is an example c++ code for converting from matrix to the expected angle axis using <a href="https://eigen.tuxfamily.org/dox/">Eigen3</a>.
<pre lang="cpp">Eigen::Matrix3f M = get_current_rotation_matrix();
Eigen::AngleAxisf aa(M);
Eigen::Vector3f angle_axis = aa.angle() * aa.axis(); </pre>
       </td>
  </tr>
</table>

*   The coordinate systems are right-hand sided. The camera coordinate system is
    defined as X pointing right, Y pointing downward, and Z pointing forward.
    The Y-axis of the global coordinate system should point down along the
    gravity vector.

![image alt text](coordinate.png)

*   IMU readings are typically in its own IMU coordinate system, and necessary
    rotation is needed to map them to the camera coordinate system if the two
    coordinate systems are different.
*   To have a consistent viewing experience, we recommend resetting the yaw
    angle for each new video recording, such that the orientation of the first
    frame has a yaw angle of 0.
*   All fields are little-endian and least significant bit first, and the 32-bit
    floating points are of IEEE 754-1985 format. The video recorder should
    maintain a struct of these fields in memory and copy the raw data to video
    packets.

Synchronization between metadata and video frames.

*   Video track and metadata track are synchronized by the presentation
    timestamp of the video and metadata samples.
*   Given the camera orientation for a discrete set of metadata presentation
    time, the continuous orientation for any given time is defined by linear
    interpolation of neighboring camera orientations. When rendering a video
    frame, player should obtain the frame rotation by linear interpolation using
    the presentation time of the video frame.
*   Typical presentation time for a video frame is the start of frame exposure,
    which does not take into account of exposure time and rolling shutter. When
    per-frame exposure time and rolling shutter are known, better
    correspondences can be achieved by adjusting the presentation time of the
    video frames to the middle of frame exposure duration:
    exposure_start_of_first_row + (pixel_exposure_time + rolling_shutter_skew)
    /2.

# 4. Identifying VR180 Videos

Below is an example box structure of a VR180 video:

```
[moov]
  [trak]                   // video track
    [mdia]
      [minf]
        [stbl]
          [stsd]
            [avc1]
              [st3d]      // spherical metadata v2
              [sv3d]      // spherical metadata v2
               ...
  [trak]                  // audio track
         ...
  [trak]                  // camera motion data track
    [mdia]
      [hdlr]              // handler = ‘meta’
      [minf]
        [stbl]
          [stsd]
             [camm]       // camera motion sample entry
```

The VR180 videos can be identified for custom processing or playback by the
existence and the content of Spherical Video Metadata V2. Optionally, the camera
motion metadata track provides the stabilization that aligns the video frames
with a fixed world orientation.

# Appendix - Mesh Generation Demo

```matlab
% Demo code for mesh generation from a fisheye camera. The format of the mesh
% vertices and triangle indices are generated according to the definition of
% ProjectionMesh in Spherical Video V2 (
% https://github.com/google/spatial-media/blob/master/docs/spherical-video-v2-rfc.md)
%
% Note for stereo image that are composed of two sub-images for left and right
% eye, the meshes should be generated from the individual cameras that describe
% the sub-images of each eye as if they are separated.
%
% Please note that Computer Vision typically uses a coordinate system such X
% points right, Y points downward, and Z points forward, which has negated Y and
% Z compared to OpenGL. To generate a mesh from such a camera, Y and Z
% coordinates need to be negated, and texture coordinate V needs to be flipped
% similarly.
%
function spherical_mesh_demo()
  % Example fisheye camera for the demo.
  fisheye_camera = demo_camera();

  % Mesh resolution.
  grid_size_x = 40;
  grid_size_y = 40;

  % Generate the vertices and triangle indices from the camera.
  [vertices, tri] = generate_mesh(fisheye_camera, grid_size_x, grid_size_y);

  % Plot the UV triangulation in the image space.
  figure(1);
  u = reshape([vertices(:).u], grid_size_y, grid_size_x);
  v = reshape([vertices(:).v], grid_size_y, grid_size_x);
  trimesh(tri, u * fisheye_camera.image_size(1), ...
          v * fisheye_camera.image_size(2));
  set(gca, 'xlim', [0, fisheye_camera.image_size(1)],...
      'ylim', [0, fisheye_camera.image_size(2)]);
  axis equal;

  % Plot the mesh in 3D.
  figure(2);
  x = reshape([vertices(:).x], grid_size_y, grid_size_x);
  y = reshape([vertices(:).y], grid_size_y, grid_size_x);
  z = reshape([vertices(:).z], grid_size_y, grid_size_x);
  trimesh(tri, x, y, z);
  axis equal;
end


% Generate the mesh vertices and triangle indices using a grid in the
% intersection of 180 image circle and the image rectangle.
function [vertices, tri] = generate_mesh(fisheye_camera, grid_size_x,...
                                         grid_size_y)
  % Struct for the mesh vertex
  vertices = struct('u', {}, 'v', {}, 'x', {}, 'y', {}, 'z', {});

  %  The radius along x-axis and y-axis, assuming an ellipse shape.
  radius = image_circle(fisheye_camera);

  % The vertical boundary of the image circle/ellipse.
  ymin = max(0, fisheye_camera.principal_point(2) - radius(2));
  ymax = min(fisheye_camera.image_size(2),...
             fisheye_camera.principal_point(2) + radius(2));

  for i = 1 : grid_size_y;
    % Y coordinate in the image.
    yi = ymin + (i - 1) * (ymax - ymin) / (grid_size_y - 1);
    % Y coordinate relative to image center.
    yc = yi - fisheye_camera.principal_point(2);

    % Horizontal boundary on the image circle along the given y coordinate.
    rx = radius(1) * sqrt(1 - yc^2 / (radius(2)^2));
    xmin = max(0, fisheye_camera.principal_point(1) - rx);
    xmax = min(fisheye_camera.image_size(1), ...
               fisheye_camera.principal_point(1) + rx);

    % Generate evenly spaced vertices along the horizontal line.
    for j = 1 : grid_size_x;
      % X coordinate
      xj = xmin + (j - 1) * (xmax - xmin) / (grid_size_x - 1);
      point = pixel_to_ray(fisheye_camera, xj, yi);

      % X, Y, Z, U, V for each vertex. To account for the difference between
      % normal Computer Vision coordinate and OpenGL coordinate, the Y and Z
      % coordinate of the needs to be negated, and V needs to be flipped.
      % If you are already using an OpenGL like coordinate system, this will
      % not be needed.
      vertices(i, j).x = point(1);
      vertices(i, j).y = - point(2);
      vertices(i, j).z = - point(3);
      vertices(i, j).u = xj / fisheye_camera.image_size(1);
      vertices(i, j).v = 1 - yi / fisheye_camera.image_size(2);
    end
  end

  % Generate triangle indices for the mesh.
  for j = 0 : grid_size_x - 2;
    for i = 0 : grid_size_y - 2;
      % Split the quad (i , i + 1) x (j, j + 1) to two triangles:
      tri(end + 1, :) = [grid_size_y * j + i + 1,...
                         grid_size_y * (j + 1) + i + 1,...
                         grid_size_y * j + i + 2];
      tri(end + 1, :) = [grid_size_y * j + i + 2,...
                         grid_size_y * (j + 1) + i + 1,...
                         grid_size_y * (j + 1) + i + 2];
     end
  end
end

% The example camera uses a typical Computer Vision fisheye camera model, which
% projects a 3D points in the world coordinate system as follows:
%
% 1. Transform world_point to camera coordinate system:
%       camera_point = ...
%          camera.world_to_camera_rotation * (world_point - camera.position);
% 2. Fisheye mapping.
%      theta = atan2(norm(camera_point(1:2)), camera_point(3));
% 3. Radial distortion factors
%      d = camera.radial_distortion
%      normalized_r = theta + d(1) * theta^3 + d(2) * theta^5 + d(3) * theta^7.
%      normalized_x = normalized_r * camera_point(1) / norm(camera_point(1:2));
%      normalized_y = normalized_r * camera_point(2) / norm(camera_point(1:2));
% 4. Map the normalized coordinate to pixels
%     x = camera.focal_length * normalized_x + camera.principal_point(1)
%     y = camera.focal_length * camera.pixel_aspect_ratio * normalized_y ...
%         + camera.principal_point(2);
function fisheye_camera = demo_camera()
  fisheye_camera = struct('image_size', [2160, 2160], ...
                          'principal_point', [1080, 1080], ...
                          'pixel_aspect_ratio', 1.2, ...
                          'focal_length', 828, ...
                          'radial_distortion', [-0.032, -0.00243, 0.001],...
                          'world_to_camera_rotation', eye(3), ...
                          'position', [0, 0, 0]);
end

% Map a pixel (x, y) to the ray direction in the world coordinate.
%
% Note this needs to be modified for cameras with different parametrization.
function point = pixel_to_ray(fisheye_camera, x, y)
  % Normalized Y coordinate.
  yn = (y - fisheye_camera.principal_point(2))/ fisheye_camera.focal_length ...
       / fisheye_camera.pixel_aspect_ratio;

  % Normalized X coordinate.
  xn = (x - fisheye_camera.principal_point(1)) / fisheye_camera.focal_length;

  % Normalized distance to image center.
  rn = sqrt(xn * xn + yn * yn);

  % Solve for the angle theta between the viewing ray and the optical axis
  % that satisfies:
  %   rn = theta + theta^3 * d(1) + theta^5 * d(2) + theta^7 * d(3);
  % The example uses just 3 parameters, but it can easily extended to more.
  d = fisheye_camera.radial_distortion;
  theta = roots([d(3), 0, d(2), 0, d(1), 0, 1.0, -rn]);

  % Take the smallest positive real solution.
  theta = min(theta(find(imag(theta) == 0 & real(theta) > 0)));
  % Degenerate case in exact image center.
  if isempty(theta); theta = 0; end;

  % Generate the point in the world coordinate.
  point = [sin(theta) * xn / rn; sin(theta) * yn / rn; cos(theta)];

  % Apply the inverse rotation to transform it from camera to world.
  point = fisheye_camera.world_to_camera_rotation' * point;
end

% Calculate X- and Y-radius of the image circle/ellipse for a fisheye camera.
%
% Note the image circle logic needs to be modified for cameras with different
% parametrization, for example, having non-zero skew.
function radius = image_circle(fisheye_camera)
  % Half of the desired image circle. Note the maximum image circle should be
  % at most 180 degrees, but it is OK to make it smaller to avoid peripheral
  % with poor quality.
  theta = pi / 2;
  % Normalized distance to the image center.
  d = fisheye_camera.radial_distortion;
  normalized_r = theta + theta^3 * d(1) + theta^5 * d(2) + theta^7 * d(3);
  % Radius along X and Y axes.
  radius = normalized_r * fisheye_camera.focal_length...
           * [1, fisheye_camera.pixel_aspect_ratio];
end
```