Princeton > CS Dept > PIXL > Lunch Local Access 

The PIXL lunch meets every Monday during the semester at noon in room 402 of the Computer Science building. To get on the mailing list to receive announcements, sign up for the "pixl-talks" list at

Upcoming Talks

No talks scheduled yet.

Previous Talks

Monday, September 17, 2018

We will assign future talks and briefly discuss what everyone has worked on over the summer.

Monday, September 24, 2018
Attentive Human Action Recognition
Minh Hoai Nguyen, Stony Brook University

Enabling computers to recognize human actions in video has the potential to revolutionize many areas that benefit society such as clinical diagnosis, human-computer interaction, and social robotics. Human action recognition, however, is tremendously challenging for computers due to the subtlety of human actionsand thecomplexity of video data. Critical to the success of any human action recognition algorithm is its ability to attend to the relevant information during both training and prediction times.

In the first part of this talk, I will describe a novel approach for training human action classifiers, one that can explicitly factorize human actions from the co-occurring context. Our approach utilizes conjugate samples, which are video clips that are contextually similar to human action samples, but do not contain the actions. Our approach enables the classifier to attend to the relevant information and improve its performance in recognizing human actions under varying context.

In the second part of this talk, I will describe a method for early recognition of human actions, one that can take advantages of multiple cameras. To account for the limited communication bandwidth and processing power, we will learn a camera selection polity so that the system can attend to the most relevant information at each time step. This problem is formulated as a sequential decision process, and the attention policy is learned based on reinforcement learning. Experiments on several datasets demonstrate the effectiveness of this approach for early recognition of human actions.

Minh Hoai Nguyen is an Assistant Professor of Computer Science at Stony Brook University. He received a Bachelor of Software Engineering from the University of New South Wales in 2006 and a Ph.D. in Robotics from Carnegie Mellon University in 2012. His research interests are in computer vision and machine learning. In 2012, Nguyen and his coauthor received the Best Student Paper Award at the IEEE Conference On Computer Vision and Pattern Recognition (CVPR).

Monday, October 01, 2018
Accelerating Neural Networks using Box Filters.
Linguang Zhang

This is a project at an early stage and we sincerely solicit feedbacks). In a neural network, a large receptive filed is typically achieved through a stack of small filters(e.g., 3x3), pooling layers or dilated convolution. These methods all suffer from a drastic increase in computational cost or memory footprint when the network architecture needs to be adjusted for a larger receptive field. In this project, we explore the possibility of allowing a convolution layer to have an arbitrarily large receptive field at a constant cost, using box filters. The intuition behind is that any convolution kernel can be approximated using multiple box filters. The result of a box filter convolving with an image can be easily computed using the summed area table, with the running time invariant to the size of the filter. This method could potentially be useful for vision applications that require large receptive fields but cannot afford a high computational cost.

Monday, October 08, 2018
Weifeng Chen

Monday, October 15, 2018
Yuting Yang

Monday, October 22, 2018
Aishwarya Agrawal, Georgia Tech

Monday, November 05, 2018
Unifying Regression and Classification for Human Pose Estimation
Fangyin Wei

State-of-the-art human pose estimation methods are based on heat map representation. In spite of the good performance, the representation has a few issues in nature, such as non-differentiable postprocessing and quantization error. The work to be presented in this talk shows that a simple integral operation relates and unifies the heat map representation and joint regression, thus avoiding the above issues. It is differentiable, efficient, and compatible with any heat map based methods. Its effectiveness is convincingly validated via comprehensive ablation experiments under various settings, specifically on 3D pose estimation, for the first time. This method was used by the top two teams of COCO 2018 Keypoint Detection

Challenge. If time permits, another work on learning disentangled representation will also be presented.

Monday, November 12, 2018
Jiaqi Su, Elena Balashova

Jiaqi Title: Perceptually-motivated environment-specific speech enhancement Abstract: We introduce a data-driven method to enhance speech recordings made in a specific environment. The method handles denoising, dereverberation, and equalization matching due recording non-linearities in a unified framework. It relies on a new perceptual loss function that combines adversarial loss with spectrogram features. We show that the method offers an improvement over state of the art baseline methods in both subjective and objective evaluations.

Elena Title: Structure Aware Shape Synthesis Abstract: We propose a new procedure to guide training of a data- driven shape generative model using a structure-aware loss function. Complex 3D shapes often can be summarized us- ing a coarsely defined structure which is consistent and ro- bust across variety of observations. However, existing syn- thesis techniques do not account for structure during train- ing, and thus often generate implausible and structurally unrealistic shapes. During training, we enforce structural constraints in order to enforce consistency and structure across the entire manifold. We propose a novel methodology for training 3D generative models that incorporates structural information into an end-to-end training pipeline.

Monday, November 19, 2018
Pose2Pose: Pose Selection and Transfer for 2D Character Animation
Nora Willett

When trying to create a 2D animated character which characterizes an actor or is designed for a specific performance, it is challenging to determine the type of arm and hand poses necessary for the most expressive animation. Our system helps a user select and transfer poses of an actor from an input video to an animated character. After extracting pose data from the video, we cluster similar arm and hand poses and expose them to the user for selection. After the user chooses the best poses for capturing the style and personality of the character, we can automatically assign poses to new video frames creating 2D animations in a variety of styles.

Monday, November 26, 2018
Learning from Synthetic Data through Joint Training and Evolution
Dawei Yang

For training deep neural networks, synthetic images have been widely used as an alternative to real images in order to reduce the effort of collecting data. However, the networks pre-trained on synthetic data still need to be finetuned on real data to due to domain difference, and therefore manual collection is still required. In this talk, I will present our approach that avoids such expensive manual collection of large-scale image datasets. We achieve this by joint network training and dataset evolution. I will first show how our approach can outperform the state of the art on a shape-from-shading benchmark (past work), and then discuss how we can extend our approach to 3D reconstruction of indoor scenes (ongoing work).

Monday, December 03, 2018
Reconstructing Static and Dynamic Scenes from Video
Zachary Teed

Abstract: As the abundance of digital cameras continues to increase, so does the rate of monocular video production. There is a lot to learn from this massive source of data. One of the most fundamental problems of video understanding is recovering a 3d representation of the scene. This includes reasoning about dynamic objects and how the underlying 3D representation changes in time. To estimate 3d structure from video, we have proposed DeepV2D, a system which can reconstruct depth from a monocular video clip. DeepV2D is built by “differentiablizing” classical geometric algorithms and integrating them into an end-to-end trainable architecture. We show that by alternating between motion estimation and stereo reconstruction we can converge to high quality depth estimates on several challenging benchmarks. However, one of the limitations of this reconstruction pipeline is the inability to recover the depth of dynamic objects—some of the most important parts of a scene. We are working on a new method which decomposes video into a set of rigidly moving components, and then estimates the 3d motion and structure for each of these objects.

Monday, December 10, 2018
SpatialSense: An Adversarially Crowdsourced Benchmark for Spatial Relation Recognition
Kaiyu Yang

Understanding the spatial relations between objects in images is a surprisingly challenging task. A chair may be "behind" a person even if it appears to the left of the person in the image (if the person is facing right). The predicate "on" entails very different spatial configurations in different contexts: a sweatshirt "on" a person versus a hat "on" a person versus a leash "on" a dog. Reasoning about spatial relations may further require understanding other objects in the scene: for example, whether two students are "next to" each other depends on if there is a third student between them.

We introduce SpatialSense, a dataset specializing in spatial relation recognition which captures a broad spectrum of such challenges, allowing for proper benchmarking of computer vision techniques. SpatialSense is constructed through adversarial crowdsourcing, in which human annotators are tasked with finding spatial relations that are difficult to predict using simple cues such as 2D spatial configurations or language priors. Adversarial crowdsourcing significantly reduces dataset bias and samples much more interesting relations in the long tail compared to existing datasets. On SpatialSense, state-of-the-art recognition models perform comparably to simple baselines, suggesting that they rely on straightforward cues instead of fully reasoning about this complex task. The SpatialSense benchmark provides a path forward to advancing the spatial reasoning capabilities of computer vision systems.