Speaker Details

Speaker 1

Pang Wei Koh

University of Washington

Pang Wei Koh is an assistant professor in the Allen School of Computer Science and Engineering at the University of Washington and a visiting research scientist at AI2. His research interests are in the theory and practice of building reliable machine learning systems.

His research has been published in Nature and Cell, featured in media outlets such as The New York Times and The Washington Post, and recognized by the MIT Technology Review Innovators Under 35 Asia Pacific award and best paper awards at ICML and KDD.

He received his PhD and BS in Computer Science from Stanford University. Prior to his PhD, he was the 3rd employee and Director of Partnerships at Coursera.

Talk

Title: Reliable and responsible data use: retrieval-based models and synthetic data

Abstract: How can we use our available data more efficiently and responsibly to build more reliable models? I will first describe how scaling up the amount of data available at inference time to retrieval-based language models can facilitate responsible use and improve performance across a variety of tasks without obvious saturation, indicating that the data used at inference time—and not just at training time—should be considered as a new dimension of scaling language models. Next, I will discuss when it might be useful to train on synthetic data derived, in turn, from a generative model trained on the available real data. Finally, I will present our efforts to evaluate responsible data use by developing a benchmark for measuring copyright risk.