Statistical Learning and Computation for BigData Analysis

Fall 2014, Thursday Noons, Hill Center
11:40am -- 12pm (Room 502, Lunch)
12:00pm -- 1pm (Room 552, Talk)
Rutgers Busch Campus, 110 Frelinghuysen Rd Piscataway

Sponsored by

The Bigdata seminar meets weekly or biweekly for presentations by invited researchers emphasizing either the theory or practice of statistical learning with BigData. Lunch will be served starting at 11:40am, with the talks running between 12pm and 1pm.

September 25, 2014

Speaker: Kilian Weinberger, Associate Professor in Computer Science, Washington University

Title: Learning with Marginalized Corruption

Abstract:If infinite amounts of labeled data are provided, many machine-learning algorithms become perfect. With finite amounts of data, regularization or priors have to be used to introduce bias into a classifier. We propose a third option: learning with marginalized corrupted features. We (implicitly) corrupt existing data as a means to generate additional, infinitely many, training samples from a slightly different data distribution — this is computationally tractable, because the corruption can be marginalized out in closed form. Our framework leads to machine learning algorithms that are fast, generalize well and naturally scale to very large data sets. We showcase this technology in the context of risk minimization with linear classifiers and deep learning for domain adaptation. We further show that our framework is not limited to features: marginalized corrupted labels and graph edges have promising applications in tag prediction of natural images and label propagation within protein-protein interaction networks.

Bio: Kilian Q. Weinberger is an Associate Professor in the Department of Computer Science & Engineering at Washington University in St. Louis. He received his Ph.D. from the University of Pennsylvania in Machine Learning under the supervision of Lawrence Saul and his undergraduate degree in Mathematics and Computer Science from the University of Oxford. During his career he has won several best paper awards at ICML, CVPR, AISTATS and KDD (runner-up award). In 2011 he was awarded the Outstanding AAAI senior program chair award and in 2012 he received an NSF CAREER award. Kilian Weinberger’s research is in Machine Learning and its applications. In particular, he focuses on high dimensional data analysis, resource efficient learning, metric learning, machine learned web-search ranking, transfer- and multi-task learning as well as biomedical applications. Before joining Washington University in St. Louis, Kilian worked as a research scientist at Yahoo! Research in Santa Clara.

Statistical Learning and Computation for BigData Analysis

Spring 2014, Thursday Noons, Hill Center
11:40am -- 12pm (Room 502, Lunch)
12:00pm -- 1pm (Room 552, Talk)
Rutgers Busch Campus, 110 Frelinghuysen Rd Piscataway

The Bigdata seminar meets weekly or biweekly for presentations by invited researchers emphasizing either the theory or practice of statistical learning with BigData. Lunch will be served starting at 11:40am, with the talks running between 12pm and 1pm.

January 30th, 2014

Speaker: Stephen Burley, Distinguished Professor in Department of Chemistry and Chemical Biology, Rutgers University

Title:Integrative Structural Biology and the Big Data Revolution: Role of the Protein Data Bank

Abstract:The mission and activities of the Protein Data Bank ( will be described in some detail, with particular emphasis on the challenges and opportunities presented by development of integrative structural biology.

Director, Center for Integrative Proteomics Research
Associate Director, RCSB Protein Data Bank
Director, BioMaPS Institute for Quantitative Biology
Distinguished Professor, Department of Chemistry and Chemical Biology
Member, Rutgers Cancer Institute of New Jersey
Rutgers, The State University of New Jersey

Feburary 13, 2014

Speaker: Visa Koivunen, Academy Professor, Aalto University, Finland Visiting professor, Princeton University (sabbatical).
(Due to severe weather condition, this talk was cancelled. Please attend the SIP/EE seminar on 2/21/14 if you are still interested in hearing this talk.)

Title:Tensor models and techniques for analyzing high-dimensional data

Abstract:Analyzing high-dimensional and high-volume datasets is of interest in many fields of engineering and science. For example, modeling of multidimensional MIMO channels, analyzing sensor data collected by mobile terminals, analyzing fMRI and surveillance data are among emerging application areas. Tensors, being multi-way arrays in a simple definition, accommodate high-dimensional data sets naturally. Various tensor decompositions based on multilinear models are powerful tools to explore and reveal important information in high-dimensional data sets. In this presentation we will give a tutorial overview of tensor representation and methods for data analysis. In particular, tensor factorization methods and low-rank modeling techniques are considered. In addition, recent advances in sparse methods and regularization for reducing dimensionality, simplifying visualizations and variable selection when employing tensor models are introduced. Furthermore, statistically robust procedures for analyzing tensor data are proposed and their performance is studied in the face of outliers.

Bio: Visa Koivunen (IEEE Fellow) received his D.Sc. (EE) degree with honors from the University of Oulu, Finland. He received the primus doctor (best graduate) award among the doctoral graduates in years 1989-1994. He is a member of Eta Kappa Nu. From 1992 to 1995 he was a visiting researcher at the University of Pennsylvania, Philadelphia, USA. Years 1997 -1999 he was faculty at Tampere UT. Since 1999 he has been a full Professor of Signal Processing at Helsinki Univ of Technology , Finland that is currently known as Aalto University. He received the Academy professor (distinguished professor nominated by the Academy of Finland) position from the Academy of Finland for years 2010-2014. He is one of the Principal Investigators in SMARAD Center of Excellence in Research nominated by the Academy of Finland. Years 2003-2006 he has been also adjunct full professor at the University of Pennsylvania, Philadelphia, USA. During his sabbatical term year 2007 he was a Visiting Fellow at Princeton University, NJ, USA. He was a part-time Visiting Fellow at Nokia Research Center (2006-2012). He has been visiting fellow Princeton University multiple times. He is currently on sabbatical at Princeton University for the full academic year 2013-2014. Dr. Koivunen's research interest include statistical, communications and sensor array signal processing. He has published about 350 papers in international scientific conferences and journals. He co-authored the papers receiving the best paper award in IEEE PIMRC 2005, EUSIPCO 2006, EUCAP 2006 and COCORA 2012. He has been awarded the IEEE Signal Processing Society best paper award for the year 2007 (with J. Eriksson). He served as an associate editor for IEEE Signal Processing Letters and IEEE Transactions on Signal Processing. He is co-editor for IEEE JSTSP special issue on Smart Grids. He is a member of editorial board for IEEE Signal Processing Magazine. He has been a member of the IEEE Signal Processing Society technical committees SPCOM-TC and SAMTC. He was the general chair of the IEEE SPAWC conference 2007 conference in Helsinki, Finland June 2007.

Feburary 28, 2014
(Note the special date)

Speaker: Vincent Poor, Professor in EE at Princeton . (Note the talk is on Friday not Thursday)

Title:Privacy in the Smart Grid: An Information Theoretic Framework

Abstract: The proliferation of electronic data generated in smart grid and other applications has made potential leakage of private information through such data an important issue. This talk will first describe a fundamental information theoretic framework for examining, in a general setting, the tradeoff between the privacy of data and its measurable benefits. This framework will then be used to investigate two problems arising in smart grid. The first of these is smart-meter privacy, in which the tradeoff between the privacy of information that can be inferred from meter data and the usefulness of that data is examined. The second is competitive privacy, which models situations in which multiple parties (e.g., power companies) need to exchange information to collaborate on tasks (e.g., management of a shared grid) without revealing company-sensitive data.

Bio: H. Vincent Poor is the Michael Henry Strater University Professor of Electrical Engineering at Princeton, where is also the dean of the School or Engineering and Applied Science. His research interests are in the areas of information theory, statistical signal processing and stochastic analysis, and their applications in smart grid, wireless networks and related fields. His publications in these areas include the recent book Mechanisms and Games for Dynamic Spectrum Allocation, published by Cambridge University Press in 2014. Dr. Poor is a Fellow of the IEEE and a member of the National Academy of Engineering and the National Academy of Sciences.

March 13, 2014

Speaker: Kenneth W. Church, IBM Research

Title:Big Data Goes Mobile

Abstract:What is "big"? Time & Space? Expense? Pounds? Power? Size of machine? Size of market? We will discuss many of these dimensions, but focus on throughput and latency (mobility of data). If our clouds can't import and export data at scale, they may turn into roach motels where data can check in; but it can't check out. DataScope is designed to make it easy to import and export 100s of TBs of disks. Amdahl's Laws have stood up remarkably well to the test of time. These laws explain how to balance memory, cycles and IO. There is an opportunity to extend these laws to balance for mobility.


Ken is currently at IBM working on Siri-like applications of speech on phones. Before that, he was the Chief Scientist of the HLTCOE at JHU. He has worked at Microsoft and AT&T, as well. Education: MIT (undergrad and graduate). He enjoys working with large datasets. Back in the 1980s, we thought that Associated Press newswire (1million words per week) was big, but he has since had the opportunity to work with much larger datasets such as AT&T's billing records and Bing's web logs. He has worked on many topics in computational linguistics including: web search, language modeling, text analysis, spelling correction, word-sense disambiguation, terminology, translation, lexicography, compression, speech (recognition and synthesis), OCR, as well as applications that go well beyond computational linguistics such as revenue assurance and virtual integration (using screen scraping and web crawling to integrate systems that traditionally don't talk together as well as they could such as billing and customer care). Service: past president of ACL and former president of SIGDAT (the organization that organizes EMNLP). Honors: AT&T Fellow.
March 27, 2014

Speaker: Howard Karloff, Yahoo! Labs @ NYC

Title: Maximum Entropy Summary Trees

Abstract: Given a very large, node-weighted, rooted tree on, say, n nodes, if one has only enough space to display a k-node summary of the tree, what is the most informative way to draw the tree? We define a type of weighted tree that we call a "summary tree" of the original tree, that results from aggregating nodes of the original tree subject to certain constraints. We suggest that the best choice of which summary tree to use (among those with a fixed number of nodes) is the one that maximizes the information-theoretic entropy of a natural probability distribution associated with the summary tree, and we provide a (pseudopolynomial-time) dynamic-programming algorithm to compute this maximum entropy summary tree, when the weights are integral. The result is an automated way to summarize large trees and retain as much information about them as possible, while using (and displaying) only a fraction of the original node set. We also provide an additive approximation algorithm and a greedy heuristic that are faster than the optimal algorithm, and generalize to trees with real-valued weights. This is joint work with Ken Shirley of ATT Labs and Richard Cole of NYU.

Bio: After receiving his PhD from Berkeley, Howard Karloff taught at the University of Chicago and Georgia Tech before leaving Georgia Tech as a full professor to join AT&T Labs--Research in 1999. He left ATT Labs in 2013 to join Yahoo Labs in New York. An editor of ACM's Transactions on Algorithms and an ACM Fellow, he has served on the program committees of numerous conferences, chaired the 1998 Symposium of Discrete Algorithms (SODA) program committee, was general chair of the 2012 Symposium on the Theory of Computing (2012) and will be general chair of STOC 2014. He is the author of numerous journal and conference articles and the Birkhauser book "Linear Programming." His research interests span algorithms and optimization and extend to more applied areas of computer science such as databases, networking, and machine learning.

April 3, 2014

Speaker: Wei Liu, IBM Research

Title: Handling Big Data: A Machine Learning Perspective

Abstract: With the rapid development of the Internet, nowadays tremendous amounts of data including images and videos, up to millions or billions, can be collected for training machine learning models. Inspired by this trend, my current work is dedicated to developing large-scale machine learning techniques for the purpose of making classification and nearest neighbor search practical on big data. My first approach is to explore data graphs to aid classification and nearest neighbor search. A graph offers an attractive way of representing data and discovering the essential information such as the neighborhood structure. However, both of the graph construction process and graph-based learning techniques become computationally prohibitive at a large scale. To this end, I propose an efficient large graph construction approach and subsequently apply it to develop scalable semi-supervised learning and unsupervised hashing algorithms. To address other practical application scenarios, I further develop advanced hashing techniques that incorporate supervised information or leverage unique formulations to cope with new forms of queries such as hyperplanes. All of the machine learning techniques I have proposed emphasize and pursue excellent performance in both speed and accuracy. The addressed problems, classification and nearest neighbor search, are fundamental for many practical problems across various disciplines. Therefore, I expect that the proposed solutions based on graphs and hashing will have a tremendous impact on a great number of realistic large-scale applications.

Bio: Wei Liu received the M.Phil. and Ph.D. degrees in electrical engineering from Columbia University, New York, NY, USA in 2012. Currently, he is a research staff member of IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA. He has been the Josef Raviv Memorial Postdoctoral Fellow at IBM Thomas J. Watson Research Center for one year since 2012. His research interests include machine learning, data mining, computer vision, and information retrieval. Dr. Liu is the recipient of the 2011-2012 Facebook Fellowship.

April 10, 2014

Speaker: Silvio Lattanzi, Senior Research Scientist at Google Research New York

Title: Large scale graph-mining

Abstract: The amount of data available and requiring analysis has grown at astonishing rate in recent years. To cope with this deluge of information it is fundamental to design new algorithms to analyze data efficiently. In this talk, we describe our effort to build a large scale graph-mining library. We first describe the general framework and few relevant problems that we are trying to solve. Then we describe in details two results on local algorithms for clustering and on learning noisy feedback from a crowdsource system.

To solve the first problem we develop a random-walked based method with better theoretical guarantee compared to all previous work, both in terms of the clustering accuracy and the conductance of the output set. We also prove that our analysis is tight, and perform empirical evaluation to support our theory on both synthetic and real data.

For the second problem we introduce a new model of Gaussian mixtures, that captures the setting where the data points correspond to ratings on a set of items provided by users who have widely varying expertise. In this setting we study the single item case and obtain efficient algorithms for the problem, complemented by near-matching lower bounds; we also obtain preliminary results for the multiple items case.

Joint work with Flavio Chierichetti, Anirban Dasgupta, Ravi Kumar, Vahab Mirrokni and Zeyuan Allen Zhu.

Bio: Silvio Lattanzi is a Senior Research Scientist at Google Research New York. He received his Phd from University La Sapienza of Rome. During his PhD he interned twice at Google Research Mountain View and once at Yahoo! Research Santa Clara, he also spent a semester visiting University of Texas Austin. His main research interests are in large-scale graph mining, information retrieval and probabilistic algorithms. Silvio has published several papers in top tier conferences in information retrieval, social network analysis and algorithms. He also served as program committee or senior program committee for several top conferences including WWW, WSDM and KDD.

April 24, 2014

Speaker: John Langford, Senior Researcher at Microsoft Research @NYC

Title: Learning to Interact

Abstract: Large quantities of data are not explicitly labeled in the manner of traditional supervised learning. Instead, they come from observations. How do we effectively learn to intervene given this data source? I will address both the process of learning as well as some new work about the process of optimally and efficiently gathering information.

Bio: John Langford is a machine learning research scientist, a field which he says "is shifting from an academic discipline to an industrial tool". He is the author of the weblog and the principal developer of Vowpal Wabbit. John works at Microsoft Research New York, of which he was one of the founding members, and was previously affiliated with Yahoo! Research, Toyota Technological Institute, and IBM's Watson Research Center. He studied Physics and Computer Science at the California Institute of Technology, earning a double bachelor's degree in 1997, and received his Ph.D. in Computer Science from Carnegie Mellon University in 2002. He was the program co-chair for the 2012 International Conference on Machine Learning.

Fall 2013, Thursday Noons, Hill Center
11:40am -- 12pm (Room 502, Lunch)
12:00pm -- 1pm (Room 552, Talk)
Rutgers Busch Campus, 110 Frelinghuysen Rd Piscataway

October 3rd

Speaker: Edo Liberty, Senior Research Scientist, Yahoo! Research at NYC

Bio: Edo received his B.Sc in Physics and Computer Science from Tel Aviv university and his Ph.D in Computer Science from Yale University, under the supervision of Steven Zucker. During his PhD he spent time at both UCLA and Google as an engineer and a researcher. After that, he joined the Program in Applied Mathematics at Yale as a Post-Doctoral fellow. In 2009 he joined Yahoo! Labs in Israel. He recently moved to New York to lead the machine learning group which focuses on the theory and practice of (very) large scale data mining and machine learning. In Particular, theoretical foundation of machine learning, optimizations, scalable scientific computing, and machine learning systems and platforms.

Title: Simple and Deterministic Matrix Sketches

Abstract: A sketch of a matrix A is another matrix B which is significantly smaller than A, but still approximates it well. Finding such sketches efficiently is an important building block in modern algorithms for approximating, for example, the PCA of massive matrices. This task is made more challenging in the streaming model, where each row of the input matrix can be processed only once and storage is severely limited. In this paper, we adapt a well known streaming algorithm for approximating item frequencies to the matrix sketching setting. Our experiments corroborate the algorithm's scalability and improved convergence rate. The presented algorithm is deterministic, simple to implement, and elementary to prove.

October 17th

Speaker: Ping Li, Department of Statistics & Biostatistics, Department of Computer Science, Rutgers University


Title: Flexible Statistical Modeling from Massive Data by Boosting and Trees (and Comparisons with Deep Learning)

Abstract: Logistic regression has been around for perhaps 100 years. In textbooks, the derivative of the log likelihood is written as {y_k - p_k}, where y_k=0 or 1 is the k-th class label and p_k is the class probability. About 5 years ago, I had the observation that the derivative can also be written as {y_k - p_k} - {y_0 - p_0} if using 0-th class as baseline, due to the sum-to-zero constraint. The second derivative can be written differently too, of course. It turns out using these new derivatives could lead to almost unbelievably substantial improvement in classification accuracies in the boosting framework (e.g., MART and LogitBoost).

A side consequence of this project is a fix of the known numerical issue with LogitBoost, which was still criticized in recent papers (e.g., 2008 Statistical Science). MART was invented to use only the first derivatives to build the trees, to avoid the numerical problem. It turns out such a numerical problem actually does not exist after we derived a new tree-split criterion. Indeed, using both the first and second derivatives to build the trees (as in LogitBoost) will often lead to more accurate results.

Trees and boosting algorithms have been extremely popular in industry. In fact, they are the default techniques in some search-related companies/divisions. In 2006, I participated in the development of the ranking algorithms at Microsoft. At that time, we already used millions of observations on a single machine. Because tree algorithms can be parallelized easily, the method scales up very nicely.

November 8th

(Jointly with ECE/SIP Seminar)

Speaker: Yann LeCun, Professor, New York University. (Jointly with the ECE/SIP Seminar. Note the special date)

Title:Computer Perception with Deep Learning

Abstract: Pattern recognition tasks, particularly perceptual tasks such as vision and audition, require the extraction of good internal representations of the data prior to classification. Designing feature extractors that turns raw data into suitable representations for a classifier often requires a considerable amount of engineering and domain expertise.

The purpose of the emergent field of "Deep Learning" is to devise methods that can train entire pattern recognition systems in an integrated fashion, from raw inputs to ultimate output, using a combination of labeled and unlabeled samples.

Deep learning systems are multi-stage architectures in which the perceptual world is represented hierarchically. Features in successive stages are increasingly global, abstract, and invariant to irrelevant transformations of the input.

Convolutional networks (ConvNets) are a particular type of deep architectures that are somewhat inspired by biology, and consist of multiple stages of filter banks, interspersed with non-linear operations, and spatial pooling. Deep learning models, particularly ConvNets, have become the record holder for a wide variety of benchmarks and competition, including object recognition in image, semantic image labeling (2D and 3D), acoustic modeling for speech recognition, drug design, asian handwriting recognition, pedestrian detection, road sign recognition, biological image segmentation, etc. The most recent speech recognition and image analysis systems deployed by Google, IBM, Microsoft, Baidu, NEC and others use deep learning, and many use convolutional networks.

A number of supervised methods and unsupervised methods, based on sparse auto-encoders, to train deep convolutional networks will be presented. Several applications will be shown through videos and live demos, including a category-level object recognition system that can be trained on the fly, a system that can label every pixel in an image with the category of the object it belongs to (scene parsing), and a pedestrian detector. Specialized hardware architecture that run these systems in real time will also be described.

Bio:Yann LeCun is the founding director of the Center for Data Science at New York University, and Silver Professor of Computer Science, Neural Science, and Electrical Engineering at the Courant Institute of Mathematical Science, the Center for Neural Science, and the ECE Department at NYU-Poly.

He received the Electrical Engineer Diploma from Ecole SupÃieure d'IngÃieurs en Electrotechnique et Electronique (ESIEE), Paris in 1983, and a PhD in Computer Science from UniversitÃPierre et Marie Curie (Paris) in 1987. After a postdoc at the University of Toronto, he joined AT&T Bell Laboratories in Holmdel, NJ in 1988. I became head of the Image Processing Research Department at AT&T Labs-Research in 1996. He joined NYU as a professor in 2003, after a brief period as a Fellow of the NEC Research Institute in Princeton.

His current interests include machine learning, computer perception and vision, mobile robotics, and computational neuroscience. He has published over 180 technical papers and book chapters on these topics as well as on neural networks, handwriting recognition, image processing and compression, and on dedicated circuits and architectures for computer perception. The character recognition technology he developed at Bell Labs is used by several banks around the world to read checks and was reading between 10 and 20% of all the checks in the US in the early 2000s. His image compression technology, called DjVu, is used by hundreds of web sites and publishers and millions of users to access scanned documents on the Web. His pattern recognition methods, particularly one known as convolutional networks, are deployed in products and services by companies such as AT&T, Google, Microsoft, NEC, and Baidu for document recognition, human-computer interaction, image indexing, speech recognition, and video analytics.

LeCun has been on the editorial board of IJCV, IEEE PAMI, IEEE Trans. Neural Networks, was program chair of CVPR'06, and is chair of ICLR 2013 and 2014. He is on the science advisory board of Institute for Pure and Applied Mathematics. He has advised many large and small companies about machine learning technology, including several startups he co-founded. He is the recipient of the 2014 IEEE Neural Network Pioneer Award.

November 14th

Speaker: Sanjiv Kumar, Google Research at NYC

Title: Learning binary representations for fast similarity search in massive databases

Abstract: Binary coding based Approximate Nearest Neighbor (ANN) search in huge databases has attracted much attention recently due to its fast query time and drastically reduced storage needs. There are several challenges in developing a good ANN search system. A fundamental question that comes up often is: how difficult is ANN search in a given dataset? In other words, which data properties affect the quality of ANN search and how? Moreover, for different application scenarios, different types of learning methods are appropriate. In this talk, I will discuss what makes ANN search difficult, and a variety of binary coding techniques for non-negative data, data that lives on a manifold, and matrix data.

Bio: Sanjiv Kumar is currently a Research Scientist in Google Research, NY. He received his PhD from The Robotics Institute, Carnegie Mellon University in 2005, and a Masters from Indian Institute of Technology Madras, India in 1997. During 1997-2000, he worked in National University Hospital Singapore developing a Robotic Colonoscopy system, and in National Robotics Engineering Consortium, Pittsburgh USA on Robotic Transportation system. His research interests include large scale machine learning and computer vision, graphical models, medical imaging and robotics. robotics.

November 21th

Speaker: Yi Wang, Professor in Radiology, Cornell Weil Medical School, NYC

Title: Bayesian image reconstruction to decode biomarkers from noisy incomplete data in MRI

Abstract:What seen in a microscopic voxel (~1mm3) in current medical imaging is a complex sum of contributions from millions of cells in that voxel and is invariably highly contaminated noise. Decoding critical cellular information about diseases from noisy image data is often an ill-posed inverse problem. Fortunately, there is abundant prior information in medical imaging, such as anatomic structures, to regulate the inverse problem using the Bayesian approach. We will demonstrate the Bayesian reconstruction in magnetic resonance imaging (MRI), which is very sensitive to the presences of many diseases. One example is the quantitative susceptibility mapping to estimate from MRI data the molecular polarizability in the scanner magnet that reflects essential cellular activities. Another example is the 4D imaging of high spatial-temporal resolution to capture the dynamic transport processes that perfuse and vitalize tissue.

Bio: Yi Wang (PhD 1994, University of Wisconsin-Madison) is the Faculty Distinguished Professor of Radiology and professor of Biomedical Engineering at Cornell University. Dr. Wang is a Fellow of ISMRM and AIMBE. Dr. Wang is an active grant reviewer for many agencies including NIH and the European Research Council. Dr. Wang as a PI has been awarded multiple NIH grants. Dr. Wang has published more than 130 papers in peer-reviewed scientific journals and a textbook, .Principles of Magnetic Resonance Imaging.. Dr. Wang has been a very active researcher in MRI. Dr. Wang has invented several key technologies in cardiovascular MRI, including multi-station stepping table platform, bolus chase MRA, time-resolved contrast enhanced MRA, and navigator motion compensation for cardiac MRI. Dr. Wang has pioneered quantitative susceptibility mapping (QSM), a vibrant new field in MRI for studying magnetic susceptibility properties of tissues in health and diseases.

Sponsored by