Spaces:
Configuration error
Configuration error
File size: 18,858 Bytes
a2fcab8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 |
# VR180 Video Format
# 1. Introduction
VR180 cameras are a new category of VR camera that use two wide angle cameras to
capture the world as you see it with point and shoot simplicity. This document
describes the video format output by these devices. The choice considers the
following aspects:
* **FOV**: VR180 cameras capture sub-360 FOV rather than full 360. It is
important to retain the original pixel density of the camera sensors in
order to provide high pixel density for VR viewing.
* **Projection**: Different versions of VR180 cameras may have different lens
and different camera projections. As such the file format should be
camera-independent.
* **Motion**: The cameras can often be non-stationary due to unintentional
shakes or intentional motion, for example, handheld capture of events or
people. To avoid motion sickness, camera motion metadata should be saved for
stabilized playback.
* **Playback**: The file format should be friendly enough for local playback
so that manufacturers can easily build their apps. Android and iOS should
have an easy way to play the raw video.
VR180 videos contain two types of metadata to jointly define the projection from
video frames to their partial viewports within a spherical coordinate system.
1. **A global static projection** that defines the mapping from the pixels to
local spherical coordinate systems, typically to only a sub-180 FOV part.
The [Spherical Metadata V2
Spec](https://github.com/google/spatial-media/blob/master/docs/spherical-video-v2-rfc.md)
is adopted here to encode this global metadata. (See details in [section
2](#2-mesh-projection)).
2. **A dynamic orientation stream** that defines the rotation between the local
coordinate system of each frame and the world coordinate system. A new
[Camera Motion Metadata
track](https://developers.google.com/streetview/publish/camm-spec) is
created for encoding such per-frame metadata. (See [section
3](#3-camera-motion-metadata)).
# 2. Mesh Projection
The [Spherical Metadata V2
Spec](https://github.com/google/spatial-media/blob/master/docs/spherical-video-v2-rfc.md)
should be present in the file to define the static global projection of
individual frames to their local spherical coordinate system. Among the allowed
projection types by Spherical Metadata V2, the VR180 Video format requires a
mesh projection, which is most generic and works for fisheye projection.
|<img src="equirect.jpg" width="500"> | <img src="fisheye.jpg" width="900">
:-------------------------------------------------------: | :---------------------:
(a) 360 equirectangular | (b) fisheye mesh projection
Figure 1. Example of video frame in typical equirectangular format and the mesh
format.
By using the mesh projection type, the cameras can save the raw pixels in
side-by-side or over-under format in the video, and let the projection meshes
define the back-projection from pixels to the 3D directions. This not only
preserves the pixel density of the camera sensors, but also saves production
cost and power consumption by shaving off expensive reprojection computation. To
render such videos, player clients simply need to draw the saved per-eye mesh
with their corresponding image as texture. To be specific in VR180:
* Dual stereo mesh: video files contain two meshes, one mesh for each eye.
* Fisheye projection: geometry-wise, the video frames are simple
concatenations of left and right views with possible crop and rescale, but
there is no other type of warping (e.g. de-fisheye).
* Stereo mode: for better compatibility with video streaming services that are
optimized for 16:9, landscape LEFT-RIGHT is preferred over portrait
TOP-BOTTOM.
## Mesh Generation
Once the cameras are calibrated, the mesh vertices can be generated by
straightforward back-projection for a grid of coordinates that cover the valid
image portion (inside 180 image circle). Refer to the
[appendix](#appendix-mesh-generation-demo) for a complete Matlab demo code for
producing a full mesh for a fisheye camera. Below is the pseudo code for getting
a single mesh vertex.
```matlab
% Returns the mesh vertex for an image point image_x, image_y for an eye.
% (width, height) : the size of the image of an eye.
% (image_x, image_y): image coordinate where (0, 0) and (width, height) are top-left
% and bottom-right corner respectively.
% eye_camera : the calibrated camera for an eye (left or right)
function [x, y, z, u, v] = GetMeshVertex(width, height, image_x, image_y, eye_camera)
% Unit ray direction corresponding the pixel
[x, y, z] = PixelToRay(eye_camera, image_x, image_y);
% Negate Y and Z IF the camera parameterization follows standard Computer Vision
% convention where Y points down and Z points forward. This is to account the
% difference with OpenGL coordinate system.
[x, y, z] = [x, -y, -z];
% Normalized OpenGL coordinate for the pixel, where the V coordinate needs to be flipped.
u = image_x / width;
v = (height - image_y) / height;
end
```
* Although the video frame is a concatenation of left and right eye images
(LEFT-RIGHT or OVER-UNDER), the mesh for each eye should be generated as if
they are separate images.
* A coarse mesh is preferred over a full-resolution mesh. Downsampled meshes
work well as long as the resolution is reasonable, and they are more
efficient for playback. A typical mesh resolution is a 40x40 grid.
# 3. Camera Motion Metadata
Camera rotations during video capture in a world coordinate system can be
embedded as video metadata. This metadata is particularly important for
hand-held VR video:
* By using camera rotation metadata, the player can render the video frames at
the exact orientation they were captured. The compensation of the camera
rotation essentially keeps the distant background static. Our experiments
have shown such stabilized viewing significantly reduces the motion sickness
issue for VR.
* It is important to have high quality rotation data (including correct
gravity vector), otherwise the playback can cause motion sickness or be
disorienting. This basically requires a well-calibrated IMU along with
on-device sensor fusion.
|<img src="motion1.jpg" width="300"> | <img src="motion2.jpg" width="300"> | <img src="motion3.jpg" width="300">|
:-: | :--:| :--:
Figure 2. Three equirectangular stereo views generated according to their
rotations.
## Camera Motion Metadata Track
We have created a new [Camera Motion Metadata
Track](https://developers.google.com/streetview/publish/camm-spec) for storing
various kinds of camera motion metadata, including camera orientation, gyroscope
reading, accelerometer readings, etc. The custom metadata track can be
identified by the new Camera Motion Metadata (camm) Sample Entry box.
In the application of VR180 camera, each video contains such a metadata track to
store camera rotation data. Each data sample in the metadata track is
represented as bitstream in the following format
<table>
<tr>
<td> Fields: </td>
<td>Description: </td>
</tr>
<tr>
<td> int32 reserved; </td>
<td>Should be 0.</td>
</tr>
<tr>
<td> float32 angle_axis[3];</td>
<td>Angle axis orientation in radians representing the rotation from camera coordinate system to world coordinate system.<br>
Let M be the 3x3 rotation matrix corresponding to the angle axis vector. For any ray X in the local coordinate system, <br>
the ray direction in the world coordinate is M * X.<br>
<br>
Such orientation information can be obtained by running 3DoF sensor fusion on the device. After integrating the IMU readings, <br>
only the integrated global orientation needs to be recorded. <br>
<br>
Below is an example c++ code for converting from matrix to the expected angle axis using <a href="https://eigen.tuxfamily.org/dox/">Eigen3</a>.
<pre lang="cpp">Eigen::Matrix3f M = get_current_rotation_matrix();
Eigen::AngleAxisf aa(M);
Eigen::Vector3f angle_axis = aa.angle() * aa.axis(); </pre>
</td>
</tr>
</table>
* The coordinate systems are right-hand sided. The camera coordinate system is
defined as X pointing right, Y pointing downward, and Z pointing forward.
The Y-axis of the global coordinate system should point down along the
gravity vector.

* IMU readings are typically in its own IMU coordinate system, and necessary
rotation is needed to map them to the camera coordinate system if the two
coordinate systems are different.
* To have a consistent viewing experience, we recommend resetting the yaw
angle for each new video recording, such that the orientation of the first
frame has a yaw angle of 0.
* All fields are little-endian and least significant bit first, and the 32-bit
floating points are of IEEE 754-1985 format. The video recorder should
maintain a struct of these fields in memory and copy the raw data to video
packets.
Synchronization between metadata and video frames.
* Video track and metadata track are synchronized by the presentation
timestamp of the video and metadata samples.
* Given the camera orientation for a discrete set of metadata presentation
time, the continuous orientation for any given time is defined by linear
interpolation of neighboring camera orientations. When rendering a video
frame, player should obtain the frame rotation by linear interpolation using
the presentation time of the video frame.
* Typical presentation time for a video frame is the start of frame exposure,
which does not take into account of exposure time and rolling shutter. When
per-frame exposure time and rolling shutter are known, better
correspondences can be achieved by adjusting the presentation time of the
video frames to the middle of frame exposure duration:
exposure_start_of_first_row + (pixel_exposure_time + rolling_shutter_skew)
/2.
# 4. Identifying VR180 Videos
Below is an example box structure of a VR180 video:
```
[moov]
[trak] // video track
[mdia]
[minf]
[stbl]
[stsd]
[avc1]
[st3d] // spherical metadata v2
[sv3d] // spherical metadata v2
...
[trak] // audio track
...
[trak] // camera motion data track
[mdia]
[hdlr] // handler = ‘meta’
[minf]
[stbl]
[stsd]
[camm] // camera motion sample entry
```
The VR180 videos can be identified for custom processing or playback by the
existence and the content of Spherical Video Metadata V2. Optionally, the camera
motion metadata track provides the stabilization that aligns the video frames
with a fixed world orientation.
# Appendix - Mesh Generation Demo
```matlab
% Demo code for mesh generation from a fisheye camera. The format of the mesh
% vertices and triangle indices are generated according to the definition of
% ProjectionMesh in Spherical Video V2 (
% https://github.com/google/spatial-media/blob/master/docs/spherical-video-v2-rfc.md)
%
% Note for stereo image that are composed of two sub-images for left and right
% eye, the meshes should be generated from the individual cameras that describe
% the sub-images of each eye as if they are separated.
%
% Please note that Computer Vision typically uses a coordinate system such X
% points right, Y points downward, and Z points forward, which has negated Y and
% Z compared to OpenGL. To generate a mesh from such a camera, Y and Z
% coordinates need to be negated, and texture coordinate V needs to be flipped
% similarly.
%
function spherical_mesh_demo()
% Example fisheye camera for the demo.
fisheye_camera = demo_camera();
% Mesh resolution.
grid_size_x = 40;
grid_size_y = 40;
% Generate the vertices and triangle indices from the camera.
[vertices, tri] = generate_mesh(fisheye_camera, grid_size_x, grid_size_y);
% Plot the UV triangulation in the image space.
figure(1);
u = reshape([vertices(:).u], grid_size_y, grid_size_x);
v = reshape([vertices(:).v], grid_size_y, grid_size_x);
trimesh(tri, u * fisheye_camera.image_size(1), ...
v * fisheye_camera.image_size(2));
set(gca, 'xlim', [0, fisheye_camera.image_size(1)],...
'ylim', [0, fisheye_camera.image_size(2)]);
axis equal;
% Plot the mesh in 3D.
figure(2);
x = reshape([vertices(:).x], grid_size_y, grid_size_x);
y = reshape([vertices(:).y], grid_size_y, grid_size_x);
z = reshape([vertices(:).z], grid_size_y, grid_size_x);
trimesh(tri, x, y, z);
axis equal;
end
% Generate the mesh vertices and triangle indices using a grid in the
% intersection of 180 image circle and the image rectangle.
function [vertices, tri] = generate_mesh(fisheye_camera, grid_size_x,...
grid_size_y)
% Struct for the mesh vertex
vertices = struct('u', {}, 'v', {}, 'x', {}, 'y', {}, 'z', {});
% The radius along x-axis and y-axis, assuming an ellipse shape.
radius = image_circle(fisheye_camera);
% The vertical boundary of the image circle/ellipse.
ymin = max(0, fisheye_camera.principal_point(2) - radius(2));
ymax = min(fisheye_camera.image_size(2),...
fisheye_camera.principal_point(2) + radius(2));
for i = 1 : grid_size_y;
% Y coordinate in the image.
yi = ymin + (i - 1) * (ymax - ymin) / (grid_size_y - 1);
% Y coordinate relative to image center.
yc = yi - fisheye_camera.principal_point(2);
% Horizontal boundary on the image circle along the given y coordinate.
rx = radius(1) * sqrt(1 - yc^2 / (radius(2)^2));
xmin = max(0, fisheye_camera.principal_point(1) - rx);
xmax = min(fisheye_camera.image_size(1), ...
fisheye_camera.principal_point(1) + rx);
% Generate evenly spaced vertices along the horizontal line.
for j = 1 : grid_size_x;
% X coordinate
xj = xmin + (j - 1) * (xmax - xmin) / (grid_size_x - 1);
point = pixel_to_ray(fisheye_camera, xj, yi);
% X, Y, Z, U, V for each vertex. To account for the difference between
% normal Computer Vision coordinate and OpenGL coordinate, the Y and Z
% coordinate of the needs to be negated, and V needs to be flipped.
% If you are already using an OpenGL like coordinate system, this will
% not be needed.
vertices(i, j).x = point(1);
vertices(i, j).y = - point(2);
vertices(i, j).z = - point(3);
vertices(i, j).u = xj / fisheye_camera.image_size(1);
vertices(i, j).v = 1 - yi / fisheye_camera.image_size(2);
end
end
% Generate triangle indices for the mesh.
for j = 0 : grid_size_x - 2;
for i = 0 : grid_size_y - 2;
% Split the quad (i , i + 1) x (j, j + 1) to two triangles:
tri(end + 1, :) = [grid_size_y * j + i + 1,...
grid_size_y * (j + 1) + i + 1,...
grid_size_y * j + i + 2];
tri(end + 1, :) = [grid_size_y * j + i + 2,...
grid_size_y * (j + 1) + i + 1,...
grid_size_y * (j + 1) + i + 2];
end
end
end
% The example camera uses a typical Computer Vision fisheye camera model, which
% projects a 3D points in the world coordinate system as follows:
%
% 1. Transform world_point to camera coordinate system:
% camera_point = ...
% camera.world_to_camera_rotation * (world_point - camera.position);
% 2. Fisheye mapping.
% theta = atan2(norm(camera_point(1:2)), camera_point(3));
% 3. Radial distortion factors
% d = camera.radial_distortion
% normalized_r = theta + d(1) * theta^3 + d(2) * theta^5 + d(3) * theta^7.
% normalized_x = normalized_r * camera_point(1) / norm(camera_point(1:2));
% normalized_y = normalized_r * camera_point(2) / norm(camera_point(1:2));
% 4. Map the normalized coordinate to pixels
% x = camera.focal_length * normalized_x + camera.principal_point(1)
% y = camera.focal_length * camera.pixel_aspect_ratio * normalized_y ...
% + camera.principal_point(2);
function fisheye_camera = demo_camera()
fisheye_camera = struct('image_size', [2160, 2160], ...
'principal_point', [1080, 1080], ...
'pixel_aspect_ratio', 1.2, ...
'focal_length', 828, ...
'radial_distortion', [-0.032, -0.00243, 0.001],...
'world_to_camera_rotation', eye(3), ...
'position', [0, 0, 0]);
end
% Map a pixel (x, y) to the ray direction in the world coordinate.
%
% Note this needs to be modified for cameras with different parametrization.
function point = pixel_to_ray(fisheye_camera, x, y)
% Normalized Y coordinate.
yn = (y - fisheye_camera.principal_point(2))/ fisheye_camera.focal_length ...
/ fisheye_camera.pixel_aspect_ratio;
% Normalized X coordinate.
xn = (x - fisheye_camera.principal_point(1)) / fisheye_camera.focal_length;
% Normalized distance to image center.
rn = sqrt(xn * xn + yn * yn);
% Solve for the angle theta between the viewing ray and the optical axis
% that satisfies:
% rn = theta + theta^3 * d(1) + theta^5 * d(2) + theta^7 * d(3);
% The example uses just 3 parameters, but it can easily extended to more.
d = fisheye_camera.radial_distortion;
theta = roots([d(3), 0, d(2), 0, d(1), 0, 1.0, -rn]);
% Take the smallest positive real solution.
theta = min(theta(find(imag(theta) == 0 & real(theta) > 0)));
% Degenerate case in exact image center.
if isempty(theta); theta = 0; end;
% Generate the point in the world coordinate.
point = [sin(theta) * xn / rn; sin(theta) * yn / rn; cos(theta)];
% Apply the inverse rotation to transform it from camera to world.
point = fisheye_camera.world_to_camera_rotation' * point;
end
% Calculate X- and Y-radius of the image circle/ellipse for a fisheye camera.
%
% Note the image circle logic needs to be modified for cameras with different
% parametrization, for example, having non-zero skew.
function radius = image_circle(fisheye_camera)
% Half of the desired image circle. Note the maximum image circle should be
% at most 180 degrees, but it is OK to make it smaller to avoid peripheral
% with poor quality.
theta = pi / 2;
% Normalized distance to the image center.
d = fisheye_camera.radial_distortion;
normalized_r = theta + theta^3 * d(1) + theta^5 * d(2) + theta^7 * d(3);
% Radius along X and Y axes.
radius = normalized_r * fisheye_camera.focal_length...
* [1, fisheye_camera.pixel_aspect_ratio];
end
```
|