Real-Time View Interpolation for Eye Gaze Corrected Video Conferencing

DUMONT, Maarten

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/19695

Title:	Real-Time View Interpolation for Eye Gaze Corrected Video Conferencing
Authors:	DUMONT, Maarten
Advisors:	BEKAERT, Philippe LAFRUIT, Gauthier
Issue Date:	2015
Abstract:	Conventional video conferencing (e.g. Skype with a webcam) suffers from some fundamental flaws that keep it from attaining a true sense of immersivity and copresence and thereby emulating a real face-to-face conversation. Not in the least does it not allow its users to look directly into each other’s eyes. The webcam is usually set up next to the screen or at best integrated into the bezel. This forces the user to alternate his gaze between looking at the screen to observe his remote conferencing partner and looking into the webcam. It is this conflict between both viewing directions that stands in the way of experiencing true eye contact. This issue of missing eye contact is the central problem to solve in this dissertation. An Image-Based Approach We opt for an image-based approach to solving our problem, meaning that we synthesize an eye gaze corrected image from real-world (live) captured images. In the newly reconstructed image, the user’s gaze will be corrected and thus the conflict between viewing directions will no longer be present. By using live imagery, we avoid the more artificial look and feel of many previous solutions that employ model-based reconstructions or avatar-based representations. Specifically, we investigate three main view synthesis algorithms to reach our goal. This results in contributions to environment mapping, disparity estimation from rectified stereo, and plane sweeping. By designing and implementing all algorithms for and on the GPU exclusively, we take advantage of its massive parallel processing capabilities and guarantee real-time performance and future-proof scalability. This strategy of exploiting the GPU for general – non-graphical – computations is known as general-purpose GPU (GPGPU) computing. Although developed here to correct eye gaze in video conferencing, our algorithms are more generally applicable to any type of scene and usage scenario. Four Prototypes We develop four different system prototypes, with each prototype relying on its specific (combination of) view synthesis algorithm(s) to reconstruct the eye gaze corrected image. Each view synthesis algorithm is enabled by a specific configuration of the capturing cameras, allowing us to arrange and present the prototypes according to increasing physical complexity of their camera setup.Maintaining Camera Calibration Maintaining the calibration of those cameras, however, may pose a challenge for a prototype that can be subject to a lot of dynamic user activity. Therefore, we first develop an efficient algorithm to detect camera movement and to subsequently reintegrate a single displaced camera into an a priori calibrated network of cameras. Assuming the intrinsic calibration of the displaced camera remains known (physical movement is reflected in the extrinsic parameters), we robustly recompute its extrinsic calibration as follows. First, we compute pairs of essential matrices between the displaced camera and its neighboring cameras using image point correspondences. This provides us with an estimate of a local coordinate frame for each camera pair, with each pair related to the real world coordinates up to a similarity transformation. From all these estimates, we deduce a (mean) rotation and (intersecting) translation in the common coordinate frame of the previously fully calibrated system. Unlike other approaches, we do not explicitly reconstruct any 3D scene structure, but rely solely on image-space correspondences. We achieve a reprojection error of less than a pixel, comparable to state-of-the-art (de-)centralized network recalibration algorithms. Prototype 1: Environment Remapping Our first prototype is immediately our most outside-of-the-box solution. It requires only the bare minimum of capturing cameras, namely a single one, together with a single projector for display. Drawing inspiration from the field of environment mapping, we capture omnidirectional video (in other words, the environment) by filming a spherical mirror (the northern hemisphere) and combine this – after a remap of the captured image – with projection on an identically-shaped spherical screen (the southern hemisphere). Both hemispheres are combined into a single full sphere, forming a single communication device that allows to capture from the top and display at the bottom. The unconventional novelty lies in the observation that we do not perform image interpolation in the traditional sense, but rather compose an eye gaze corrected image by remapping the captured environment pixel-to-pixel. We develop the mathematical equations that govern this image transformation by mapping the captured input to the projected output, both interpreted as parallel rays of light under an affine camera model. The resulting equations are completely independent of the scene structure and do not require the recovery of the depth of scene. Consequently, they have to be precomputed only once, which allows for an extremely lightweight implementation that easily operates in real-time on any contemporary GPU and even CPU. Unfolding the environmental reflection captured on a (relatively small) specular sphere yields omnidirectional imagery with a projection center located at the center of that sphere. Consequently, the user looks directly into the camera when looking at the center of the sphere and eye contact is inherently guaranteed. Moreover, the prototype effortlessly supports multiple users simultaneously, unveils their full spatial context and offers them an unprecedented freedom of movement. Its main drawback, however, is the image quality. It is severely diminished by limitations of the mathematical model and off-the-shelf hardware components.Edge-Sensitive Disparity Estimation with Iterative Refinement Our second prototype, which we will present in a moment, relies heavily on our novel algorithm for accurate disparity estimation. We make three main contributions. First, we present a matching cost aggregation method that uses two edge-sensitive shapeadaptive support windows per pixel neighborhood. The windows are defined such that they cover image patches of similar color; one window follows horizontal edges in the image, the other vertical edges. Together they form the final aggregation window shape that closely follows all object edges and thereby achieves increased disparity hypothesis confidence. Second, we formalize an iterative process to further refine the estimated disparity map. It consists of four well-defined stages (cross-check, bitwise fast voting, invalid disparity handling, median filtering) and primarily relies on the same horizontal and vertical support windows. By assuming that color discontinuity boundaries in the image are also depth discontinuity boundaries in the scene, the refinement is able to efficiently detect and fill in occlusions. It only requires the input color images as prior knowledge, can be applied to any initially estimated disparity map and quickly converges to a final solution. Third, next to improving the cost aggregation and disparity refinement, we introduce the idea of restricting the disparity search range itself. We observe that peaks in the disparity map’s histogram indicate where objects are located in the scene, whereas noise with a high probability represents mismatches. We derive a two-pass hierarchical method, where, after analyzing the histogram at a reduced image resolution, all disparity hypotheses for which the histogram bin value does not reach a dynamically determined threshold (proportional to the image resolution or the histogram entropy) are excluded from the disparity search range at the full resolution. Constructing the low-resolution histogram is relatively cheap and in turn the potential to simultaneously increase the matching quality and decrease the processing complexity (of any local stereo matching algorithm) becomes very high. Implementation is done in CUDA, a modern GPU programming paradigm that exposes the hardware as a massive pool of directly operable parallel threads and that maps very well to scanline-rectified pixel-wise algorithms. On contemporary hardware, we reach real-time performance of about 12 FPS for the standard resolution (450 375) of the Middlebury dataset. Our algorithm is easy to understand and implement and generates smooth disparity maps with sharp object edges and little to no artifacts. It is very competitive with the current stateof- the-art of real-time local stereo matching algorithms. Prototype 2: Stereo Interpolation Our second prototype turns to rectified stereo interpolation. We mount two cameras around the screen, one to the left and one to the right, and let the user be seated in the horizontal middle. We then interpolate the intermediate (and thus eye gaze corrected) viewpoint by following (and extending) the depth-image-based rendering (DIBR) pipeline. This pipeline essentially consists of a disparity estimation and view synthesis stage. The view synthesis is straightforward and very lightweight, but relies heavily on accurate disparity estimation to correctly warp the input pixels to the intermediate viewpoint.On the one hand, the prototype is able to synthesize an eye gaze corrected image that contains very sharp and clearly discernible eyes. On the other hand, its reliance on stereo matching also gives rise to its biggest disadvantages. First, the user is restricted to move on the horizontal baseline between the left and right cameras, which causes eye contact to be difficult to maintain. Second, the small baseline preference of dense stereo matching forces us to either place the cameras around a smaller screen or assume a larger user-to-screen distance to avoid too large occlusions. Prototype 3: Plane Sweeping Our third prototype aims to overcome these shortcomings by mounting six cameras closely around the screen on a custom-made lightweight metal frame. The more general camera configuration avoids large occlusions, but, as such a configuration is no longer suitable for rectified stereo, we must turn to plane sweeping to interpolate the eye gaze corrected image. The flexible plane sweeping algorithm allows us to reconstruct any freely selectable viewpoint, without the need of image extrapolation. Combined with a concurrently running eye tracker to determine the user’s viewpoint, this ensures that eye contact is maintained at all times and from any position and angle. A number of carefully considered design and implementation choices ensures over realtime performance of about 40 FPS for the SVGA resolution (800 600) without noticeable loss of visual quality, even on low-end hardware. First, from our strategy for disparity range restriction, we devise a method to efficiently keep a uniform distribution of planes focused around a single dominant object-of-interest (e.g. the user’s head and torso) as it moves through the scene. A Gaussian fit on the histogram of the depth map will indicate the depth (mean) and extent (standard deviation) of the object. We can use this to retroactively respond to movements of the object by dynamically shifting a condensed set of planes back and forth, instead of sweeping the entire space with a sparser distribution. This not only leverages the algorithmic performance, but also implicitly increases the accuracy of the plane sweep by significantly reducing the chance at mismatches. Second, we present an iterative spatial filter that removes photometric artifacts from the interpolated image. It does so by detecting and correcting geometric outliers in the jointly linked depth map that is assumed to be locally linear. Third, we use OpenGL and Cg to reprogram the GPU vertex and fragment processing stages of the traditional graphics rendering pipeline, which better suits the inherent structure and scattered memory access patterns of plane sweeping. We even further improve the end-toend performance by developing granular optimization schemes that map well to the polygonbased processing of the traditional graphics pipeline. Finally, a fine-tuned set of user-independent parameters grants the system a general applicability. The result is a fully functional prototype for close-up one-to-one eye gaze corrected video conferencing that has a minimal amount of constraints, is intuitive to use and is very convincing as a proof-of concept. Prototype 4: Immersive Collaboration Environment Our fourth and final prototype is realized after recognizing that current tools for computer-supported cooperative work (CSCW) suffer from two major deficiencies. First, they do not allow to observe the body language, facial expressions and spatial context of the (remote) collaborators. Second, they miss the ability to naturally and synchronously manipulate objects in a shared environment. We solve these issues by integrating our plane sweeping algorithm for eye gaze correction into an immersive environment that supports collaboration at a distance. In doing so, we identify and implement five fundamental technical requirements of the ultimate collaborative environment, namely dynamic image-based modeling, subsequent reconstruction and correction for rendering, a spatially immersive display, cooperative surface computing, and aural communication. We also propose our last adaptation of the plane sweeping algorithm to efficiently interpolate a complex scene that contains multiple dominant depths, e.g. when multiple users are present in the environment. This time, we interpret the cumulative histogram of the depth map as a probability density function that describes the likelihood that a plane should be positioned at a particular depth in the scene. The result is a non-uniform plane distribution that responds to a redistribution of any and all content in the scene. Our final prototype truly brings together many key research areas that have been the focus of our institute as a whole over the past years: view interpolation for free viewpoint video, calibration of camera networks, tracking, omnidirectional cameras, multi-projector immersive displays, multi-touch interfaces, and audio processing. Seven Evaluated Requirements From practical experience with our prototypes, we learn that other factors besides eye contact contribute to attaining a true sense of immersivity and copresence in video conferencing. Seven constantly recurring requirements have been identified: eye contact (and the related gaze awareness), spatial context, freedom of movement, visual quality, algorithmic performance, physical complexity, and communication modes (one-to-one, many-to-many, multi-party). We discover that they are subject to many trade-offs and interdependencies as we use them to (informally) evaluate and compare all our prototypes. A concise sociability study not only points toward the importance of the seven requirements, but also validates our initial preference for image-based methods. However, to arrive at the ideal video conferencing solution, more insight should be gained into the concept of presence, what it means to experience a virtual telepresence and exactly what factors enable this experience. Nevertheless, we believe that the seven requirements provide a reference framework around the experience gained in this dissertation on which to design, develop and evaluate any future solution to eye gaze corrected video conferencing.
Document URI:	http://hdl.handle.net/1942/19695
Category:	T1
Type:	Theses and Dissertations
Appears in Collections:	PhD theses Research publications

Files in This Item:

File	Description	Size	Format
mdumont-phd.pdf		7.71 MB	Adobe PDF	View/Open

Show full item record

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Google Scholar^TM