Spaces:

rohitmeshram976
/

vr-converter

Configuration error

App Files Files Community

vr-converter / spatial-media /docs /vr180.md

rohitmeshram976

Update VR 180 Converter with HF_TOKEN and spatial-media

a2fcab8 5 days ago

preview code

raw

history blame contribute delete

18.9 kB

	# VR180 Video Format

	# 1. Introduction

	VR180 cameras are a new category of VR camera that use two wide angle cameras to
	capture the world as you see it with point and shoot simplicity. This document
	describes the video format output by these devices. The choice considers the
	following aspects:

	* FOV: VR180 cameras capture sub-360 FOV rather than full 360. It is
	important to retain the original pixel density of the camera sensors in
	order to provide high pixel density for VR viewing.
	* Projection: Different versions of VR180 cameras may have different lens
	and different camera projections. As such the file format should be
	camera-independent.
	* Motion: The cameras can often be non-stationary due to unintentional
	shakes or intentional motion, for example, handheld capture of events or
	people. To avoid motion sickness, camera motion metadata should be saved for
	stabilized playback.
	* Playback: The file format should be friendly enough for local playback
	so that manufacturers can easily build their apps. Android and iOS should
	have an easy way to play the raw video.

	VR180 videos contain two types of metadata to jointly define the projection from
	video frames to their partial viewports within a spherical coordinate system.

	1. A global static projection that defines the mapping from the pixels to
	local spherical coordinate systems, typically to only a sub-180 FOV part.
	The [Spherical Metadata V2
	Spec](https://github.com/google/spatial-media/blob/master/docs/spherical-video-v2-rfc.md)
	is adopted here to encode this global metadata. (See details in [section
	2](#2-mesh-projection)).
	2. A dynamic orientation stream that defines the rotation between the local
	coordinate system of each frame and the world coordinate system. A new
	[Camera Motion Metadata
	track](https://developers.google.com/streetview/publish/camm-spec) is
	created for encoding such per-frame metadata. (See [section
	3](#3-camera-motion-metadata)).

	# 2. Mesh Projection

	The [Spherical Metadata V2
	Spec](https://github.com/google/spatial-media/blob/master/docs/spherical-video-v2-rfc.md)
	should be present in the file to define the static global projection of
	individual frames to their local spherical coordinate system. Among the allowed
	projection types by Spherical Metadata V2, the VR180 Video format requires a
	mesh projection, which is most generic and works for fisheye projection.

	\|<img src="equirect.jpg" width="500"> \| <img src="fisheye.jpg" width="900">
	:-------------------------------------------------------: \| :---------------------:
	(a) 360 equirectangular \| (b) fisheye mesh projection

	Figure 1. Example of video frame in typical equirectangular format and the mesh
	format.

	By using the mesh projection type, the cameras can save the raw pixels in
	side-by-side or over-under format in the video, and let the projection meshes
	define the back-projection from pixels to the 3D directions. This not only
	preserves the pixel density of the camera sensors, but also saves production
	cost and power consumption by shaving off expensive reprojection computation. To
	render such videos, player clients simply need to draw the saved per-eye mesh
	with their corresponding image as texture. To be specific in VR180:

	* Dual stereo mesh: video files contain two meshes, one mesh for each eye.
	* Fisheye projection: geometry-wise, the video frames are simple
	concatenations of left and right views with possible crop and rescale, but
	there is no other type of warping (e.g. de-fisheye).
	* Stereo mode: for better compatibility with video streaming services that are
	optimized for 16:9, landscape LEFT-RIGHT is preferred over portrait
	TOP-BOTTOM.

	## Mesh Generation

	Once the cameras are calibrated, the mesh vertices can be generated by
	straightforward back-projection for a grid of coordinates that cover the valid
	image portion (inside 180 image circle). Refer to the
	[appendix](#appendix-mesh-generation-demo) for a complete Matlab demo code for
	producing a full mesh for a fisheye camera. Below is the pseudo code for getting
	a single mesh vertex.

	```matlab
	% Returns the mesh vertex for an image point image_x, image_y for an eye.
	% (width, height) : the size of the image of an eye.
	% (image_x, image_y): image coordinate where (0, 0) and (width, height) are top-left
	% and bottom-right corner respectively.
	% eye_camera : the calibrated camera for an eye (left or right)
	function [x, y, z, u, v] = GetMeshVertex(width, height, image_x, image_y, eye_camera)
	% Unit ray direction corresponding the pixel
	[x, y, z] = PixelToRay(eye_camera, image_x, image_y);
	% Negate Y and Z IF the camera parameterization follows standard Computer Vision
	% convention where Y points down and Z points forward. This is to account the
	% difference with OpenGL coordinate system.
	[x, y, z] = [x, -y, -z];
	% Normalized OpenGL coordinate for the pixel, where the V coordinate needs to be flipped.
	u = image_x / width;
	v = (height - image_y) / height;
	end
	```

	* Although the video frame is a concatenation of left and right eye images
	(LEFT-RIGHT or OVER-UNDER), the mesh for each eye should be generated as if
	they are separate images.
	* A coarse mesh is preferred over a full-resolution mesh. Downsampled meshes
	work well as long as the resolution is reasonable, and they are more
	efficient for playback. A typical mesh resolution is a 40x40 grid.

	# 3. Camera Motion Metadata

	Camera rotations during video capture in a world coordinate system can be
	embedded as video metadata. This metadata is particularly important for
	hand-held VR video:

	* By using camera rotation metadata, the player can render the video frames at
	the exact orientation they were captured. The compensation of the camera
	rotation essentially keeps the distant background static. Our experiments
	have shown such stabilized viewing significantly reduces the motion sickness
	issue for VR.
	* It is important to have high quality rotation data (including correct
	gravity vector), otherwise the playback can cause motion sickness or be
	disorienting. This basically requires a well-calibrated IMU along with
	on-device sensor fusion.

	\|<img src="motion1.jpg" width="300"> \| <img src="motion2.jpg" width="300"> \| <img src="motion3.jpg" width="300">\|
	:-: \| :--:\| :--:

	Figure 2. Three equirectangular stereo views generated according to their
	rotations.

	## Camera Motion Metadata Track

	We have created a new [Camera Motion Metadata
	Track](https://developers.google.com/streetview/publish/camm-spec) for storing
	various kinds of camera motion metadata, including camera orientation, gyroscope
	reading, accelerometer readings, etc. The custom metadata track can be
	identified by the new Camera Motion Metadata (camm) Sample Entry box.

	In the application of VR180 camera, each video contains such a metadata track to
	store camera rotation data. Each data sample in the metadata track is
	represented as bitstream in the following format

	<table>
	<tr>
	<td> Fields: </td>
	<td>Description: </td>
	</tr>
	<tr>
	<td> int32 reserved; </td>
	<td>Should be 0.</td>
	</tr>
	<tr>
	<td> float32 angle_axis[3];</td>
	<td>Angle axis orientation in radians representing the rotation from camera coordinate system to world coordinate system.<br>
	Let M be the 3x3 rotation matrix corresponding to the angle axis vector. For any ray X in the local coordinate system, <br>
	the ray direction in the world coordinate is M * X.<br>
	<br>
	Such orientation information can be obtained by running 3DoF sensor fusion on the device. After integrating the IMU readings, <br>
	only the integrated global orientation needs to be recorded. <br>
	<br>
	Below is an example c++ code for converting from matrix to the expected angle axis using <a href="https://eigen.tuxfamily.org/dox/">Eigen3</a>.
	<pre lang="cpp">Eigen::Matrix3f M = get_current_rotation_matrix();
	Eigen::AngleAxisf aa(M);
	Eigen::Vector3f angle_axis = aa.angle() * aa.axis(); </pre>
	</td>
	</tr>
	</table>

	* The coordinate systems are right-hand sided. The camera coordinate system is
	defined as X pointing right, Y pointing downward, and Z pointing forward.
	The Y-axis of the global coordinate system should point down along the
	gravity vector.

	![image alt text](coordinate.png)

	* IMU readings are typically in its own IMU coordinate system, and necessary
	rotation is needed to map them to the camera coordinate system if the two
	coordinate systems are different.
	* To have a consistent viewing experience, we recommend resetting the yaw
	angle for each new video recording, such that the orientation of the first
	frame has a yaw angle of 0.
	* All fields are little-endian and least significant bit first, and the 32-bit
	floating points are of IEEE 754-1985 format. The video recorder should
	maintain a struct of these fields in memory and copy the raw data to video
	packets.

	Synchronization between metadata and video frames.

	* Video track and metadata track are synchronized by the presentation
	timestamp of the video and metadata samples.
	* Given the camera orientation for a discrete set of metadata presentation
	time, the continuous orientation for any given time is defined by linear
	interpolation of neighboring camera orientations. When rendering a video
	frame, player should obtain the frame rotation by linear interpolation using
	the presentation time of the video frame.
	* Typical presentation time for a video frame is the start of frame exposure,
	which does not take into account of exposure time and rolling shutter. When
	per-frame exposure time and rolling shutter are known, better
	correspondences can be achieved by adjusting the presentation time of the
	video frames to the middle of frame exposure duration:
	exposure_start_of_first_row + (pixel_exposure_time + rolling_shutter_skew)
	/2.

	# 4. Identifying VR180 Videos

	Below is an example box structure of a VR180 video:

	```
	[moov]
	[trak] // video track
	[mdia]
	[minf]
	[stbl]
	[stsd]
	[avc1]
	[st3d] // spherical metadata v2
	[sv3d] // spherical metadata v2
	...
	[trak] // audio track
	...
	[trak] // camera motion data track
	[mdia]
	[hdlr] // handler = ‘meta’
	[minf]
	[stbl]
	[stsd]
	[camm] // camera motion sample entry
	```

	The VR180 videos can be identified for custom processing or playback by the
	existence and the content of Spherical Video Metadata V2. Optionally, the camera
	motion metadata track provides the stabilization that aligns the video frames
	with a fixed world orientation.

	# Appendix - Mesh Generation Demo

	```matlab
	% Demo code for mesh generation from a fisheye camera. The format of the mesh
	% vertices and triangle indices are generated according to the definition of
	% ProjectionMesh in Spherical Video V2 (
	% https://github.com/google/spatial-media/blob/master/docs/spherical-video-v2-rfc.md)
	%
	% Note for stereo image that are composed of two sub-images for left and right
	% eye, the meshes should be generated from the individual cameras that describe
	% the sub-images of each eye as if they are separated.
	%
	% Please note that Computer Vision typically uses a coordinate system such X
	% points right, Y points downward, and Z points forward, which has negated Y and
	% Z compared to OpenGL. To generate a mesh from such a camera, Y and Z
	% coordinates need to be negated, and texture coordinate V needs to be flipped
	% similarly.
	%
	function spherical_mesh_demo()
	% Example fisheye camera for the demo.
	fisheye_camera = demo_camera();

	% Mesh resolution.
	grid_size_x = 40;
	grid_size_y = 40;

	% Generate the vertices and triangle indices from the camera.
	[vertices, tri] = generate_mesh(fisheye_camera, grid_size_x, grid_size_y);

	% Plot the UV triangulation in the image space.
	figure(1);
	u = reshape([vertices(:).u], grid_size_y, grid_size_x);
	v = reshape([vertices(:).v], grid_size_y, grid_size_x);
	trimesh(tri, u * fisheye_camera.image_size(1), ...
	v * fisheye_camera.image_size(2));
	set(gca, 'xlim', [0, fisheye_camera.image_size(1)],...
	'ylim', [0, fisheye_camera.image_size(2)]);
	axis equal;

	% Plot the mesh in 3D.
	figure(2);
	x = reshape([vertices(:).x], grid_size_y, grid_size_x);
	y = reshape([vertices(:).y], grid_size_y, grid_size_x);
	z = reshape([vertices(:).z], grid_size_y, grid_size_x);
	trimesh(tri, x, y, z);
	axis equal;
	end


	% Generate the mesh vertices and triangle indices using a grid in the
	% intersection of 180 image circle and the image rectangle.
	function [vertices, tri] = generate_mesh(fisheye_camera, grid_size_x,...
	grid_size_y)
	% Struct for the mesh vertex
	vertices = struct('u', {}, 'v', {}, 'x', {}, 'y', {}, 'z', {});

	% The radius along x-axis and y-axis, assuming an ellipse shape.
	radius = image_circle(fisheye_camera);

	% The vertical boundary of the image circle/ellipse.
	ymin = max(0, fisheye_camera.principal_point(2) - radius(2));
	ymax = min(fisheye_camera.image_size(2),...
	fisheye_camera.principal_point(2) + radius(2));

	for i = 1 : grid_size_y;
	% Y coordinate in the image.
	yi = ymin + (i - 1) * (ymax - ymin) / (grid_size_y - 1);
	% Y coordinate relative to image center.
	yc = yi - fisheye_camera.principal_point(2);

	% Horizontal boundary on the image circle along the given y coordinate.
	rx = radius(1) * sqrt(1 - yc^2 / (radius(2)^2));
	xmin = max(0, fisheye_camera.principal_point(1) - rx);
	xmax = min(fisheye_camera.image_size(1), ...
	fisheye_camera.principal_point(1) + rx);

	% Generate evenly spaced vertices along the horizontal line.
	for j = 1 : grid_size_x;
	% X coordinate
	xj = xmin + (j - 1) * (xmax - xmin) / (grid_size_x - 1);
	point = pixel_to_ray(fisheye_camera, xj, yi);

	% X, Y, Z, U, V for each vertex. To account for the difference between
	% normal Computer Vision coordinate and OpenGL coordinate, the Y and Z
	% coordinate of the needs to be negated, and V needs to be flipped.
	% If you are already using an OpenGL like coordinate system, this will
	% not be needed.
	vertices(i, j).x = point(1);
	vertices(i, j).y = - point(2);
	vertices(i, j).z = - point(3);
	vertices(i, j).u = xj / fisheye_camera.image_size(1);
	vertices(i, j).v = 1 - yi / fisheye_camera.image_size(2);
	end
	end

	% Generate triangle indices for the mesh.
	for j = 0 : grid_size_x - 2;
	for i = 0 : grid_size_y - 2;
	% Split the quad (i , i + 1) x (j, j + 1) to two triangles:
	tri(end + 1, :) = [grid_size_y * j + i + 1,...
	grid_size_y * (j + 1) + i + 1,...
	grid_size_y * j + i + 2];
	tri(end + 1, :) = [grid_size_y * j + i + 2,...
	grid_size_y * (j + 1) + i + 1,...
	grid_size_y * (j + 1) + i + 2];
	end
	end
	end

	% The example camera uses a typical Computer Vision fisheye camera model, which
	% projects a 3D points in the world coordinate system as follows:
	%
	% 1. Transform world_point to camera coordinate system:
	% camera_point = ...
	% camera.world_to_camera_rotation * (world_point - camera.position);
	% 2. Fisheye mapping.
	% theta = atan2(norm(camera_point(1:2)), camera_point(3));
	% 3. Radial distortion factors
	% d = camera.radial_distortion
	% normalized_r = theta + d(1) * theta^3 + d(2) * theta^5 + d(3) * theta^7.
	% normalized_x = normalized_r * camera_point(1) / norm(camera_point(1:2));
	% normalized_y = normalized_r * camera_point(2) / norm(camera_point(1:2));
	% 4. Map the normalized coordinate to pixels
	% x = camera.focal_length * normalized_x + camera.principal_point(1)
	% y = camera.focal_length * camera.pixel_aspect_ratio * normalized_y ...
	% + camera.principal_point(2);
	function fisheye_camera = demo_camera()
	fisheye_camera = struct('image_size', [2160, 2160], ...
	'principal_point', [1080, 1080], ...
	'pixel_aspect_ratio', 1.2, ...
	'focal_length', 828, ...
	'radial_distortion', [-0.032, -0.00243, 0.001],...
	'world_to_camera_rotation', eye(3), ...
	'position', [0, 0, 0]);
	end

	% Map a pixel (x, y) to the ray direction in the world coordinate.
	%
	% Note this needs to be modified for cameras with different parametrization.
	function point = pixel_to_ray(fisheye_camera, x, y)
	% Normalized Y coordinate.
	yn = (y - fisheye_camera.principal_point(2))/ fisheye_camera.focal_length ...
	/ fisheye_camera.pixel_aspect_ratio;

	% Normalized X coordinate.
	xn = (x - fisheye_camera.principal_point(1)) / fisheye_camera.focal_length;

	% Normalized distance to image center.
	rn = sqrt(xn * xn + yn * yn);

	% Solve for the angle theta between the viewing ray and the optical axis
	% that satisfies:
	% rn = theta + theta^3 * d(1) + theta^5 * d(2) + theta^7 * d(3);
	% The example uses just 3 parameters, but it can easily extended to more.
	d = fisheye_camera.radial_distortion;
	theta = roots([d(3), 0, d(2), 0, d(1), 0, 1.0, -rn]);

	% Take the smallest positive real solution.
	theta = min(theta(find(imag(theta) == 0 & real(theta) > 0)));
	% Degenerate case in exact image center.
	if isempty(theta); theta = 0; end;

	% Generate the point in the world coordinate.
	point = [sin(theta) * xn / rn; sin(theta) * yn / rn; cos(theta)];

	% Apply the inverse rotation to transform it from camera to world.
	point = fisheye_camera.world_to_camera_rotation' * point;
	end

	% Calculate X- and Y-radius of the image circle/ellipse for a fisheye camera.
	%
	% Note the image circle logic needs to be modified for cameras with different
	% parametrization, for example, having non-zero skew.
	function radius = image_circle(fisheye_camera)
	% Half of the desired image circle. Note the maximum image circle should be
	% at most 180 degrees, but it is OK to make it smaller to avoid peripheral
	% with poor quality.
	theta = pi / 2;
	% Normalized distance to the image center.
	d = fisheye_camera.radial_distortion;
	normalized_r = theta + theta^3 * d(1) + theta^5 * d(2) + theta^7 * d(3);
	% Radius along X and Y axes.
	radius = normalized_r * fisheye_camera.focal_length...
	* [1, fisheye_camera.pixel_aspect_ratio];
	end
	```