Princeton > CS Dept > PIXL > Graphics > Lunch Local Access 

The PIXL lunch meets every Monday during the semester at noon in room 402 of the Computer Science building. To get on the mailing list to receive announcements, sign up for the "pixl-talks" list at

Upcoming Talks

Monday, May 01, 2017
Xinyi / Riley

Previous Talks

Monday, February 13, 2017
VoCo 2.0 and Wave/FFT-net
Zeyu Jin

Editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to work in a text transcript of the narration, and perform select, cut, copy and paste operations directly in the transcript; these operations are then automatically applied to the waveform in a straightforward manner. However, an obvious gap in the text-based interface is the ability to type new words not appearing in the transcript, for example inserting a new word for emphasis or replacing a misspoken word. We present a system called VoCo, a text-based speech editor that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration. The main idea is to use a text to speech synthesizer to say the word in a generic voice, and then use voice conversion to convert it into a voice that matches the narration. Offering a range of degrees of control to the editor, our interface supports fully automatic synthesis, selection among a candidate set of alternative pronunciations, fine control over edit placements and pitch profiles, and even guidance by the editors own voice. It’s earlier version has been presented in Adobe Max ( The first half of this talk will focus on the lated version of VoCo submitted to SigGraph.

VoCo has generated quite a bit buzz lately, but it is not perfect. Since it is a data-driven method that concatenates audio snippets to form new words, the quality degrades from segmentation error and data insufficiency. We would like to address this problem with parametric methods. Although existing parametric methods introduce artifacts that are unfit for VoCo, we borrowed the idea from WaveNet and devised parameter-to-waveform generator called FFT-net. While wavenet works as a reenactment of the mathematical structure of wavelet with coefficients replaced by neural connections, FFT-net mimics fast fourier transform. Our experiment shows FFT-net can produce almost perfect audio signal from widely-used parametric representations such as MFCC+pitch. In the second half of this talk I would like to demonstrate its unprecedented strength in text-to-speech synthesis, transformation, audio feature sonification and most importantly making a parametric-model based VoCo.

Monday, February 20, 2017
Studying the Internet of (Any)Things: Confessions of a Human-Computer Interaction Researcher

The Internet has become a part of our daily lives yet we often do not stop to consider how users interact with this infrastructure or what is needed to make the Internet of any and every thing run smoothly in everyday activities. In this talk, I present my current and ongoing work that focuses on helping users understand, gain control over, and make more effective use of Internet infrastructure in their day to day tasks. This will entail research confessions from the perspective of human-computer interaction and cover projects in ubiquitous computing, usable security, and information and communications technologies for development.

Marshini Chetty is a research scholar in the Department of Computer Science at Princeton University specializing in human computer interaction and ubiquitous computing. Marshini designs, implements, and evaluates technologies to help users manage different aspects of Internet use from security to performance. She often works in resource constrained settings and uses her work to help inform policy. She has a Ph.D. in Human-Centered Computing from Georgia Institute of Technology, USA and a Masters and Bachelors in Computer Science from University of Cape Town, South Africa. Her passions are all things broadband related and trying to make the world a better place, one bit at a time.

Monday, February 27, 2017
Ongoing project
Linguang Zhang

Detecting reliable interest points from a single image is often a critical procedure for many computer vision tasks. Previous interest point detectors are either handcrafted (corners and blobs) or learned by implicitly using handcrafted features as supervision. Most interest point detectors are designed and optimized for natural images. When the application domain is relatively limited, such as texture images, medical images and paints, the definition of a reliable interest point is often unclear. In addition, previous interest point detectors often suffer from low repeatability and weak global distinctiveness. We believe that a more effective interest point detector can be learned in an unsupervised manner where no handcrafted definition is required. Such detector will only retain interest points with both high repeatability and strong distinctiveness. Specifically, we plan to train a fully convolutional neural network that takes in an input image and outputs a heat map wher e pixels with high values correspond to good interest points.

Monday, March 06, 2017
Rethinking object detection in computer vision
Alex Berg

Object detection is one of the core problems in computer vision and is a lens through which to view the field. It brings together machine learning (classification, regression) with the variation in appearance of objects and scenes with pose and articulation (lighting, geometry) , and the difficulty of what to recognize for what purpose (semantics) all in a setting where computational complexity is not something to talk about in abstract terms, but matters every millisecond for inference and where it can take exaflops to train a model (computation).

I will talk about our ongoing work attacking all fronts of the detection problem. One is the speed-accuracy trade-off, which determines the settings where it is reasonably possible to use detection. Our work on single shot detection (SSD) is currently the leading approach [1,2]. Another direction is moving beyond detecting the presence and location of an object to detecting 3D pose. We are working on both learning deep-network models of how visual appearance changes with pose and object [3], as well as integrating pose estimation as a first class element in detection [4].

One place where pose is especially important is for object detection in the world around us, e.g in robotics, as opposed to on isolated internet images without context. I call this setting "situated recognition". A key illustration that this setting is under addressed is the lack of work in computer vision on the problem of active vision, where perception is integrated in a loop with sensor platform motion, a key challenge in robotics. I will present our work on a new approach to collecting datasets for training and evaluating situated recognition, allowing computer vision researchers to study active vision, for instance training networks using reinforcement learning on a densely sampled data of real RGBD imagery without the difficulty of operating a robot in the training loop. This is a counterpoint to recent work using simulation and CG for such reinforcement learning, where our use of real images allows studying and evaluating real-world perception.

I will also briefly mention our lower-level work on computation for computer vision and deep learning algorithms and building tools for implementation on GPUS and fPGAs, as well as other ongoing projects.

Collaborators for major parts of this talk UNC Students- Wei Liu, Cheng-Yang Fu, Phil Ammirato, Ric Poirson, Eunbyung Park Outside academic collaborator- Prof. Jana Kosecka (George Mason University) Adobe: Duygu Ceylan, Jimei Yang, Ersin Yumer; Google: Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed Amazon: Ananth Ranga, Ambrish Tyagi

[1] SSD: Single Shot MultiBox Detector Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg ECCV 2016

[2] DSSD : Deconvolutional Single Shot Detector Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C. Berg arXiv preprint arXiv:1701.06659

[3] Transformation-Grounded Image Generation Network for Novel 3D View Synthesis Eunbyung Park, Jimei Yang, Ersin Yumer, Duygu Ceylan, Alexander C. Berg To appear CVPR 2017

[4] Fast Single Shot Detection and Pose Estimation Patrick Poirson, Philip Ammirato, Cheng-Yang Fu, Wei Liu, Jana Kosecka, Alexander C. Berg 3DV 2016

[5] A Dataset for Developing and Benchmarking Active Vision Phil Ammirato,Patrick Poirson, Eunbyung Park, Jana Kosecka, and Alexander C. Berg to appear ICRA 2017


Alex Berg's research concerns computational visual recognition. His work addresses aspects of computer, human, and robot vision. He has worked on general object recognition in images, action recognition in video, human pose identification in images, image parsing, face recognition, image search, and large-scale machine learning. He co-organizes the ImageNet Large Scale Visual Recognition Challenge, and organized the first Large-Scale Learning for Vision workshop. He is currently an associate professor in computer science at UNC Chapel Hill. Prior to that he was on the faculty at Stony Brook University, a research scientist at Columbia University, and research scientist at Yahoo! Research. His PhD at U.C. Berkeley developed a novel approach to deformable template matching. He earned a BA and MA in Mathematics from Johns Hopkins University and learned to race sailboats at SSA in Annapolis. In 2013, his work received the Marr prize.

Monday, March 13, 2017
Representation Learning from Sensorimotor Control

Recent work in deep reinforcement learning has proven very successful. AI agents can now outperform humans at playing Atari games or even Go, and can successfully play complex games such as Doom. This line of work is starting to address tasks closer to the real world, such as as visual navigation within realistic interiors, or driving a car. In this talk, I will provide an overview of our early work on building a simulator for human navigation within indoor 3D environments, and frame the research problem as one of learning representations that allow generalization across various tasks in 3D scene understanding.

This work is done in collaboration with Angel Chang, Alexey Dosovitskiy, Vladlen Koltun, and Tom Funkhouser.

Monday, March 27, 2017
Learning Where to Look: Data-Driven Viewpoint Set Selection for 3D Scenes

The use of rendered images, whether from completely synthetic datasets or from 3D reconstructions, is increasingly prevalent in vision tasks. However, little attention has been given to how the selection of viewpoints affects the performance of rendered training sets. In this paper, we propose a data-driven approach to view set selection. Given a set of example images, we extract statistics describing their contents and generate a set of views matching the distribution of those statistics. Motivated by semantic segmentation tasks, we model the spatial distribution of each semantic object category within an image view volume. We provide a search algorithm that generates a sampling of likely candidate views according to the example distribution, and a set selection algorithm that chooses a subset of the candidates that jointly cover the example distribution. Results of experiments with these algorithms on SUNCG indicate that they are indeed able to produce view distributions similar to an example set from NYUDv2 according to the earth mover's distance. Furthermore, the selected views improve performance on semantic segmentation compared to alternative view selection algorithms.

Monday, April 03, 2017
Elena Sizikova

Monday, April 10, 2017
Triggering Artwork Swaps for Live Animation
Nora Willett

Recently live animations of 2D characters have been used extensively on streaming platforms and broadcasts of TV shows. These characters respond to audience interaction and improvise conversations. While character pose can be adjusted by rotating, translating or warping components of the artwork, extreme pose changes require discrete artwork swaps. Current systems for animating these characters use a camera for head movements, audio for lip syncing and keyboard shortcuts for triggering swaps. Controlling all of these variables during a live performance time is challenging. We present a multi-touch interface to facilitate triggering artwork swaps during a live performance. Our key contributions are an automatic layout design for iconographic triggers representing artwork changes, and a predictive triggering model trained on practice performances to suggest next poses during a live performance. We describe validation experiments on our predictive model, the results of a study with novice users, and interviews with professional animators.

Monday, April 17, 2017
State of the Art in Methods and Representations for Fabrication-Aware Design "Meta-talk"
Amit Bermano

In about two weeks time, I will be giving a 90 minutes talk in Eurographics, presenting our latest State-of-The-ARt report on computational fabrication. Since none of us really have a lot of experience with STAR talks, I would like to get as much feedback from the group as possible. In this talk, I will describe how the STAR talk will be structured, going over a rather matured slide set. This "Meta-talk" should give the audience a clear idea of how the full 90 minutes talk would be, including timing, while still leaving plenty of time for discussion.

Monday, April 24, 2017
Understanding What is Behind You: Predicting 3D Structure and Semantics Beyond the Field of View
Shuran Song

Because of physical and cost constraints, many perception systems in robotic applications like autonomous drones and self-driving vehicles typically have a very limited field of view. However, the ability to understand the whole surrounding environment beyond the current observation is important for high level reasoning and decision making. While there have been many inspiring works on predicting raw RGB pixel values for small unobserved image regions, such as image inpainting and texture synthesis, the predicted raw RGB pixels cannot be directly used for high level planning. In this project, we focus on predicting the 3D structure and semantics of the surrounding environment outside the field of view. Specifically, given a semantically segmented RGB-D image, our goal is to predict the segmentation and depth maps of the environment that is not observed within the input view. To do that, we make use of RGB-D panorama images from real world (Matterport 3D) and synthetic scenes (SUNCG) to train a multi-task generative adversarial network.