Spaces:

rohitmeshram976
/

vr-converter

Configuration error

File size: 18,858 Bytes

a2fcab8

# VR180 Video Format

# 1. Introduction

VR180 cameras are a new category of VR camera that use two wide angle cameras to
capture the world as you see it with point and shoot simplicity. This document
describes the video format output by these devices. The choice considers the
following aspects:

*   **FOV**: VR180 cameras capture sub-360 FOV rather than full 360. It is
    important to retain the original pixel density of the camera sensors in
    order to provide high pixel density for VR viewing.
*   **Projection**: Different versions of VR180 cameras may have different lens
    and different camera projections. As such the file format should be
    camera-independent.
*   **Motion**: The cameras can often be non-stationary due to unintentional
    shakes or intentional motion, for example, handheld capture of events or
    people. To avoid motion sickness, camera motion metadata should be saved for
    stabilized playback.
*   **Playback**: The file format should be friendly enough for local playback
    so that manufacturers can easily build their apps. Android and iOS should
    have an easy way to play the raw video.

VR180 videos contain two types of metadata to jointly define the projection from
video frames to their partial viewports within a spherical coordinate system.

1.  **A global static projection** that defines the mapping from the pixels to
    local spherical coordinate systems, typically to only a sub-180 FOV part.
    The [Spherical Metadata V2
    Spec](https://github.com/google/spatial-media/blob/master/docs/spherical-video-v2-rfc.md)
    is adopted here to encode this global metadata. (See details in [section
    2](#2-mesh-projection)).
2.  **A dynamic orientation stream** that defines the rotation between the local
    coordinate system of each frame and the world coordinate system. A new
    [Camera Motion Metadata
    track](https://developers.google.com/streetview/publish/camm-spec) is
    created for encoding such per-frame metadata. (See [section
    3](#3-camera-motion-metadata)).

# 2. Mesh Projection

The [Spherical Metadata V2
Spec](https://github.com/google/spatial-media/blob/master/docs/spherical-video-v2-rfc.md)
should be present in the file to define the static global projection of
individual frames to their local spherical coordinate system. Among the allowed
projection types by Spherical Metadata V2, the VR180 Video format requires a
mesh projection, which is most generic and works for fisheye projection.

|<img src="equirect.jpg" width="500"> | <img src="fisheye.jpg" width="900">
:-------------------------------------------------------: | :---------------------:
(a) 360 equirectangular                                   | (b) fisheye mesh projection

Figure 1. Example of video frame in typical equirectangular format and the mesh
format.

By using the mesh projection type, the cameras can save the raw pixels in
side-by-side or over-under format in the video, and let the projection meshes
define the back-projection from pixels to the 3D directions. This not only
preserves the pixel density of the camera sensors, but also saves production
cost and power consumption by shaving off expensive reprojection computation. To
render such videos, player clients simply need to draw the saved per-eye mesh
with their corresponding image as texture. To be specific in VR180:

*   Dual stereo mesh: video files contain two meshes, one mesh for each eye.
*   Fisheye projection: geometry-wise, the video frames are simple
    concatenations of left and right views with possible crop and rescale, but
    there is no other type of warping (e.g. de-fisheye).
*   Stereo mode: for better compatibility with video streaming services that are
    optimized for 16:9, landscape LEFT-RIGHT is preferred over portrait
    TOP-BOTTOM.

## Mesh Generation

Once the cameras are calibrated, the mesh vertices can be generated by
straightforward back-projection for a grid of coordinates that cover the valid
image portion (inside 180 image circle). Refer to the
[appendix](#appendix-mesh-generation-demo) for a complete Matlab demo code for
producing a full mesh for a fisheye camera. Below is the pseudo code for getting
a single mesh vertex.

```matlab
% Returns the mesh vertex for an image point image_x, image_y for an eye.
% (width, height) : the size of the image of an eye.
% (image_x, image_y): image coordinate where (0, 0) and (width, height) are top-left
%                     and bottom-right corner respectively.
% eye_camera : the calibrated camera for an eye (left or right)
function [x, y, z, u, v] = GetMeshVertex(width, height, image_x, image_y, eye_camera)
  % Unit ray direction corresponding the pixel
  [x, y, z] = PixelToRay(eye_camera, image_x, image_y);
  % Negate Y and Z IF the camera parameterization follows standard Computer Vision
  % convention where Y points down and Z points forward. This is to account the
  % difference with OpenGL coordinate system.
  [x, y, z] = [x, -y, -z];
  % Normalized OpenGL coordinate for the pixel, where the V coordinate needs to be flipped.
  u = image_x / width;
  v = (height - image_y) / height;
end
```

*   Although the video frame is a concatenation of left and right eye images
    (LEFT-RIGHT or OVER-UNDER), the mesh for each eye should be generated as if
    they are separate images.
*   A coarse mesh is preferred over a full-resolution mesh. Downsampled meshes
    work well as long as the resolution is reasonable, and they are more
    efficient for playback. A typical mesh resolution is a 40x40 grid.

# 3. Camera Motion Metadata

Camera rotations during video capture in a world coordinate system can be
embedded as video metadata. This metadata is particularly important for
hand-held VR video:

*   By using camera rotation metadata, the player can render the video frames at
    the exact orientation they were captured. The compensation of the camera
    rotation essentially keeps the distant background static. Our experiments
    have shown such stabilized viewing significantly reduces the motion sickness
    issue for VR.
*   It is important to have high quality rotation data (including correct
    gravity vector), otherwise the playback can cause motion sickness or be
    disorienting. This basically requires a well-calibrated IMU along with
    on-device sensor fusion.

|<img src="motion1.jpg" width="300"> | <img src="motion2.jpg" width="300"> | <img src="motion3.jpg" width="300">|
:-: | :--:| :--:

Figure 2. Three equirectangular stereo views generated according to their
rotations.

## Camera Motion Metadata Track

We have created a new [Camera Motion Metadata
Track](https://developers.google.com/streetview/publish/camm-spec) for storing
various kinds of camera motion metadata, including camera orientation, gyroscope
reading, accelerometer readings, etc. The custom metadata track can be
identified by the new Camera Motion Metadata (camm) Sample Entry box.

In the application of VR180 camera, each video contains such a metadata track to
store camera rotation data. Each data sample in the metadata track is
represented as bitstream in the following format

<table>
  <tr>
    <td> Fields:           </td>
    <td>Description: </td>
  </tr>
  <tr>
    <td> int32 reserved;     </td>
    <td>Should be 0.</td>
  </tr>
  <tr>
    <td> float32 angle_axis[3];</td>
    <td>Angle axis orientation in radians representing the rotation from camera coordinate system to world coordinate system.<br>
        Let M be the 3x3 rotation matrix corresponding to the angle axis vector. For any ray X in the local coordinate system, <br>
      the ray direction in the world coordinate is M * X.<br>
<br>
Such orientation information can be obtained by running 3DoF sensor fusion on the device. After integrating the IMU readings, <br>
      only the integrated global orientation needs to be recorded. <br>
          <br>
      Below is an example c++ code for converting from matrix to the expected angle axis using <a href="https://eigen.tuxfamily.org/dox/">Eigen3</a>.
<pre lang="cpp">Eigen::Matrix3f M = get_current_rotation_matrix();
Eigen::AngleAxisf aa(M);
Eigen::Vector3f angle_axis = aa.angle() * aa.axis(); </pre>
       </td>
  </tr>
</table>

*   The coordinate systems are right-hand sided. The camera coordinate system is
    defined as X pointing right, Y pointing downward, and Z pointing forward.
    The Y-axis of the global coordinate system should point down along the
    gravity vector.

![image alt text](coordinate.png)

*   IMU readings are typically in its own IMU coordinate system, and necessary
    rotation is needed to map them to the camera coordinate system if the two
    coordinate systems are different.
*   To have a consistent viewing experience, we recommend resetting the yaw
    angle for each new video recording, such that the orientation of the first
    frame has a yaw angle of 0.
*   All fields are little-endian and least significant bit first, and the 32-bit
    floating points are of IEEE 754-1985 format. The video recorder should
    maintain a struct of these fields in memory and copy the raw data to video
    packets.

Synchronization between metadata and video frames.

*   Video track and metadata track are synchronized by the presentation
    timestamp of the video and metadata samples.
*   Given the camera orientation for a discrete set of metadata presentation
    time, the continuous orientation for any given time is defined by linear
    interpolation of neighboring camera orientations. When rendering a video
    frame, player should obtain the frame rotation by linear interpolation using
    the presentation time of the video frame.
*   Typical presentation time for a video frame is the start of frame exposure,
    which does not take into account of exposure time and rolling shutter. When
    per-frame exposure time and rolling shutter are known, better
    correspondences can be achieved by adjusting the presentation time of the
    video frames to the middle of frame exposure duration:
    exposure_start_of_first_row + (pixel_exposure_time + rolling_shutter_skew)
    /2.

# 4. Identifying VR180 Videos

Below is an example box structure of a VR180 video:

```
[moov]
  [trak]                   // video track
    [mdia]
      [minf]
        [stbl]
          [stsd]
            [avc1]
              [st3d]      // spherical metadata v2
              [sv3d]      // spherical metadata v2
               ...
  [trak]                  // audio track
         ...
  [trak]                  // camera motion data track
    [mdia]
      [hdlr]              // handler = ‘meta’
      [minf]
        [stbl]
          [stsd]
             [camm]       // camera motion sample entry
```

The VR180 videos can be identified for custom processing or playback by the
existence and the content of Spherical Video Metadata V2. Optionally, the camera
motion metadata track provides the stabilization that aligns the video frames
with a fixed world orientation.

# Appendix - Mesh Generation Demo

```matlab
% Demo code for mesh generation from a fisheye camera. The format of the mesh
% vertices and triangle indices are generated according to the definition of
% ProjectionMesh in Spherical Video V2 (
% https://github.com/google/spatial-media/blob/master/docs/spherical-video-v2-rfc.md)
%
% Note for stereo image that are composed of two sub-images for left and right
% eye, the meshes should be generated from the individual cameras that describe
% the sub-images of each eye as if they are separated.
%
% Please note that Computer Vision typically uses a coordinate system such X
% points right, Y points downward, and Z points forward, which has negated Y and
% Z compared to OpenGL. To generate a mesh from such a camera, Y and Z
% coordinates need to be negated, and texture coordinate V needs to be flipped
% similarly.
%
function spherical_mesh_demo()
  % Example fisheye camera for the demo.
  fisheye_camera = demo_camera();

  % Mesh resolution.
  grid_size_x = 40;
  grid_size_y = 40;

  % Generate the vertices and triangle indices from the camera.
  [vertices, tri] = generate_mesh(fisheye_camera, grid_size_x, grid_size_y);

  % Plot the UV triangulation in the image space.
  figure(1);
  u = reshape([vertices(:).u], grid_size_y, grid_size_x);
  v = reshape([vertices(:).v], grid_size_y, grid_size_x);
  trimesh(tri, u * fisheye_camera.image_size(1), ...
          v * fisheye_camera.image_size(2));
  set(gca, 'xlim', [0, fisheye_camera.image_size(1)],...
      'ylim', [0, fisheye_camera.image_size(2)]);
  axis equal;

  % Plot the mesh in 3D.
  figure(2);
  x = reshape([vertices(:).x], grid_size_y, grid_size_x);
  y = reshape([vertices(:).y], grid_size_y, grid_size_x);
  z = reshape([vertices(:).z], grid_size_y, grid_size_x);
  trimesh(tri, x, y, z);
  axis equal;
end


% Generate the mesh vertices and triangle indices using a grid in the
% intersection of 180 image circle and the image rectangle.
function [vertices, tri] = generate_mesh(fisheye_camera, grid_size_x,...
                                         grid_size_y)
  % Struct for the mesh vertex
  vertices = struct('u', {}, 'v', {}, 'x', {}, 'y', {}, 'z', {});

  %  The radius along x-axis and y-axis, assuming an ellipse shape.
  radius = image_circle(fisheye_camera);

  % The vertical boundary of the image circle/ellipse.
  ymin = max(0, fisheye_camera.principal_point(2) - radius(2));
  ymax = min(fisheye_camera.image_size(2),...
             fisheye_camera.principal_point(2) + radius(2));

  for i = 1 : grid_size_y;
    % Y coordinate in the image.
    yi = ymin + (i - 1) * (ymax - ymin) / (grid_size_y - 1);
    % Y coordinate relative to image center.
    yc = yi - fisheye_camera.principal_point(2);

    % Horizontal boundary on the image circle along the given y coordinate.
    rx = radius(1) * sqrt(1 - yc^2 / (radius(2)^2));
    xmin = max(0, fisheye_camera.principal_point(1) - rx);
    xmax = min(fisheye_camera.image_size(1), ...
               fisheye_camera.principal_point(1) + rx);

    % Generate evenly spaced vertices along the horizontal line.
    for j = 1 : grid_size_x;
      % X coordinate
      xj = xmin + (j - 1) * (xmax - xmin) / (grid_size_x - 1);
      point = pixel_to_ray(fisheye_camera, xj, yi);

      % X, Y, Z, U, V for each vertex. To account for the difference between
      % normal Computer Vision coordinate and OpenGL coordinate, the Y and Z
      % coordinate of the needs to be negated, and V needs to be flipped.
      % If you are already using an OpenGL like coordinate system, this will
      % not be needed.
      vertices(i, j).x = point(1);
      vertices(i, j).y = - point(2);
      vertices(i, j).z = - point(3);
      vertices(i, j).u = xj / fisheye_camera.image_size(1);
      vertices(i, j).v = 1 - yi / fisheye_camera.image_size(2);
    end
  end

  % Generate triangle indices for the mesh.
  for j = 0 : grid_size_x - 2;
    for i = 0 : grid_size_y - 2;
      % Split the quad (i , i + 1) x (j, j + 1) to two triangles:
      tri(end + 1, :) = [grid_size_y * j + i + 1,...
                         grid_size_y * (j + 1) + i + 1,...
                         grid_size_y * j + i + 2];
      tri(end + 1, :) = [grid_size_y * j + i + 2,...
                         grid_size_y * (j + 1) + i + 1,...
                         grid_size_y * (j + 1) + i + 2];
     end
  end
end

% The example camera uses a typical Computer Vision fisheye camera model, which
% projects a 3D points in the world coordinate system as follows:
%
% 1. Transform world_point to camera coordinate system:
%       camera_point = ...
%          camera.world_to_camera_rotation * (world_point - camera.position);
% 2. Fisheye mapping.
%      theta = atan2(norm(camera_point(1:2)), camera_point(3));
% 3. Radial distortion factors
%      d = camera.radial_distortion
%      normalized_r = theta + d(1) * theta^3 + d(2) * theta^5 + d(3) * theta^7.
%      normalized_x = normalized_r * camera_point(1) / norm(camera_point(1:2));
%      normalized_y = normalized_r * camera_point(2) / norm(camera_point(1:2));
% 4. Map the normalized coordinate to pixels
%     x = camera.focal_length * normalized_x + camera.principal_point(1)
%     y = camera.focal_length * camera.pixel_aspect_ratio * normalized_y ...
%         + camera.principal_point(2);
function fisheye_camera = demo_camera()
  fisheye_camera = struct('image_size', [2160, 2160], ...
                          'principal_point', [1080, 1080], ...
                          'pixel_aspect_ratio', 1.2, ...
                          'focal_length', 828, ...
                          'radial_distortion', [-0.032, -0.00243, 0.001],...
                          'world_to_camera_rotation', eye(3), ...
                          'position', [0, 0, 0]);
end

% Map a pixel (x, y) to the ray direction in the world coordinate.
%
% Note this needs to be modified for cameras with different parametrization.
function point = pixel_to_ray(fisheye_camera, x, y)
  % Normalized Y coordinate.
  yn = (y - fisheye_camera.principal_point(2))/ fisheye_camera.focal_length ...
       / fisheye_camera.pixel_aspect_ratio;

  % Normalized X coordinate.
  xn = (x - fisheye_camera.principal_point(1)) / fisheye_camera.focal_length;

  % Normalized distance to image center.
  rn = sqrt(xn * xn + yn * yn);

  % Solve for the angle theta between the viewing ray and the optical axis
  % that satisfies:
  %   rn = theta + theta^3 * d(1) + theta^5 * d(2) + theta^7 * d(3);
  % The example uses just 3 parameters, but it can easily extended to more.
  d = fisheye_camera.radial_distortion;
  theta = roots([d(3), 0, d(2), 0, d(1), 0, 1.0, -rn]);

  % Take the smallest positive real solution.
  theta = min(theta(find(imag(theta) == 0 & real(theta) > 0)));
  % Degenerate case in exact image center.
  if isempty(theta); theta = 0; end;

  % Generate the point in the world coordinate.
  point = [sin(theta) * xn / rn; sin(theta) * yn / rn; cos(theta)];

  % Apply the inverse rotation to transform it from camera to world.
  point = fisheye_camera.world_to_camera_rotation' * point;
end

% Calculate X- and Y-radius of the image circle/ellipse for a fisheye camera.
%
% Note the image circle logic needs to be modified for cameras with different
% parametrization, for example, having non-zero skew.
function radius = image_circle(fisheye_camera)
  % Half of the desired image circle. Note the maximum image circle should be
  % at most 180 degrees, but it is OK to make it smaller to avoid peripheral
  % with poor quality.
  theta = pi / 2;
  % Normalized distance to the image center.
  d = fisheye_camera.radial_distortion;
  normalized_r = theta + theta^3 * d(1) + theta^5 * d(2) + theta^7 * d(3);
  % Radius along X and Y axes.
  radius = normalized_r * fisheye_camera.focal_length...
           * [1, fisheye_camera.pixel_aspect_ratio];
end
```