What is the Right Notion of Distance between Predict-then-Optimize Tasks?

Paula Rodriguez-Diaz; Lingkai Kong; Kai Wang; David Alvarez-Melis; Milind Tambe; Paula Rodriguez-Diaz; Lingkai Kong; Kai Wang; David Alvarez-Melis; Milind Tambe; Paula Rodriguez-Diaz; Lingkai Kong; Kai Wang; David Alvarez-Melis; Milind Tambe; Paula Rodriguez-Diaz; Lingkai Kong; Kai Wang; David Alvarez-Melis; Milind Tambe

David Alvarez-Melis

(he/him/his)

Assistant Professor, Harvard University (SEAS)

150 Western Av. Room 2-332, Allston MA 02134

[three initials]@seas.harvard.edu

About

I'm an Assistant Professor of Computer Science at Harvard SEAS where I lead the Data-Centric Machine Learning (DCML) group. I'm also an Associate Faculty at the Kempner Institute, and have affiliations with the Center for Research on Computation and Society and the Harvard Data Science Initiative. I am also a researcher at Microsoft Research New England.

My research seeks to make machine learning more broadly applicable (especially to data-poor applications) and trustworthy (e.g., robust and interpretable). I am particularly interested in the implications of these two directions for applications in the natural and medical sciences. My approach to the first of these goals draws on ideas from statistics, optimization, and applied mathematics, especially optimal transport, which I have used to develop methods to mitigate data scarcity by various types of geometric dataset manipulations: alignment, comparison, generation, and transformation. This talk provides a high-level overview of this part of my work. As for trustworthy machine learning, I have worked on methods for explaining predictions of black box models, showed their lack of robustness, proposed methods to robustify them, and sought inspiration in the social sciences to make them human-centered. In the past, I worked on various aspects of learning with highly-structured data such as text or graphs, ranging from learning representations of structured objects, to generating them, to interpreting models that operate on them.

Prospective lab members: If you are interested in joining my group at Harvard, please read this.

Bio

I obtained a PhD in computer science from MIT, where I worked at CSAIL on various topics in machine learning and natural language processing. I also hold BSc (Licenciatura) and MS degrees in mathematics from ITAM and Courant Institute (NYU), respectively. During the latter, I worked on semidefinite programming for domain adaptation for my thesis. Between Master's and PhD, I spent a year at IBM's T.J. Watson Research Center, working in the Speech Recognigtion and NLP teams.

News

[07/08/2025] Giving a keynote talk at AstroAI Workshop 2025.
[07/08/2025] Paper: To Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning led by Sunny Qin accepted to COLM 2025.
[05/20/2025] Paper: What is the Right Notion of Distance between Predict-then-Optimize Tasks? led by Paula Rodriguez-Diaz accepted to UAI 2025.
[03/19/2025] OTDD has been incorporated into the DataSimilarity R package.
[01/22/2025] Paper: DDEQs: Distributional Deep Equilibrium Models through Wasserstein Gradient Flows led by Jonathan Geuter accepted to AISTATS 2025.
[01/22/2025] Paper: Mixture of Parrots: Experts improve memorization more than reasoning led by Samy Jelassi accepted to ICLR 2025.
[09/27/2024] Our son Noah was born!
[09/26/2024] Paper: A Label is Worth a Thousand Images in Dataset Distillation led by Sunny Qin accepted to NeurIPS 2024.
[07/26/2024] Giving a talk at ICML 2024 'Foundation Models in the Wild' workshop.
[07/23/2024] Attending ICML 2024 in Vienna. Our group is presenting work on LLMs, Optimal Transport, Model Interpolation and more!
[06/26/2024] Joined the Kempner Institute as Affiliate Faculty.
[05/01/2024] Paper: Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains led by MSR Summer Intern Junhong Shen accepted to ICML 2024.
[05/11/2023] Will be serving as AC for ACML 2023.
[05/08/2023] Paper: Generating Synthetic Datasets by Interpolating along Generalized Geodesics led by MSR Summer Intern Jiaojiao Fan accepted to UAI 2023.
[04/24/2023] Paper: InfoOT: Information Maximizing Optimal Transport led by MSR Summer Intern Ching-Yao Chuang accepted to ICML 2023.
[02/28/2022] Will be serving as AC for NeurIPS 2023.
[02/24/2023] Paper: Domain adaptation using optimal transport for invariant learning using histopathology datasets led by MSR Summer Intern Kia Falahkheirkhah accepted to MIDL 2023.
[10/06/2022] New Preprint: InfoOT: Information Maximizing Optimal Transport.
[09/30/2022] New Preprint: Neural Unbalanced Optimal Transport via Cycle-Consistent Semi-Couplings.
[09/15/2022] Paper: Are GANs overkill for NLP? accepted to NeurIPS 2022.
[07/15/2022] Paper: Optimizing Functionals on the Space of Probabilities with Input Convex Neural Networks accepted to TMLR.
[07/06/2022] Anna Yeaton will be preseting our work on Optimal Transport for Histopathology at MIDL 2022.
[05/16/2022] Giving a talk at the IMSI Workshop on Applied Optimal Transport: 'Machine Learning in the Space of Datasets: an Optimal Transport Perspective'
[05/14/2022] Will be serving as AC for ACML 2022.
[03/18/2022] Giving a talk at the AMS Special Session on Maths of Data Science: 'Principled Data Manipulation with Optimal Transport'
[01/31/2022] I will be serving as associate chair for ICML 2022.
[12/06/21-12/14/21] 'Attending' NeurIPS. Presenting 'JKO-ICNN' at the OTML workshop.
[11/15/2021] Guest lecture on NLP/Language Models at Harvard's CS182.
[10/28/2021] Talk: 'Principled Data Manipulation with OT', BIRS-CMO workshop on 'Geometry & Learning from Data'.
[08/21/2021] Paper: From Human Explanation to Model Interpretabilty: A Framework Based on Weight of Evidence accepted to HCOMP 2021.
[07/23/2021] Talk: 'Comparing, Transforming, and Optimizing Datasets with OT', ML4Data workshop at ICML.
[06/01/2021] New Preprint: Optimizing Functionals on the Space of Probabilities with Input Convex Neural Networks.
[05/08/2021] Paper: Dataset Dynamics via Gradient Flows in Probability Space accepted to ICML 2021.
[03/20/2021] Reviewer award at ICLR 2021.
[12/06/20-12/12/20] 'Attending' NeurIPS. Presenting 'Geometric Dataset Distances via Optimal Transport'.
[11/18/20] Giving a talk at MSR's Directions in ML series.
[09/28/20] Paper: Geometric Dataset Distances via Optimal Transport accepted to NeurIPS 2020.
[09/28/20] Optimal Transport Dataset Distance featured in the MSR blog.
[01/06/20] Paper: Unsupervised Hierarchy Matching with Optimal Transport over Hyperbolic Spaces accepted to AISTATS 2020.
[12/10-12/15] At NeurIPS 2019, presenting at the HCML and OTML Workshops
[10/31/19] New preprint on Interpretability via Weight of Evidence (to be presented at HCML@NeurIPS2019)
[09/01/19] Reviewer award (among 400 highest-scoring reviewers) at NeurIPS 2019.
[08/26/19] Started postdoc at MSR
[07/15/19] Final thesis submission - PhD completed!
[06/21/19] Succesfully defended my thesis!
[04/21/19] Paper: Functional Transparency for Structured Data: a Game-Theoretic Approach. accepted to ICML 2019.
[04/21/19] Paper: Learning Generative Models Across Incomparable Spaces accepted to ICML 2019.
[04/14-04/19] At AISTATS 2019.
[12/22/18] Paper: Towards Optimal Transport with Global Invariances accepted to AISTATS 2019.
[12/21/18] Paper: Towards Robust, Locally Linear Deep Networks accepted to ICLR 2019.
[12/08/18] Received best paper award in the R2L Workshop at NeurIPS. Congratulations Charlotte!
[12/02->09/18] At NeurIPS 2018.
[11/03/18] Talk on Gromov-Wasserstein embedding alignment @ EMNLP.
[10/30-11/04] At EMNLP 2018.
[10/30] Our EMNLP paper was covered by MIT NEWS.
[09/05/18] Paper: Towards Robust Interpretability with Self-Explaining Neural Networks accepted to NIPS 2018.
[09/03/18] Named one of 218 best reviewers at NIPS 2018. Thanks NIPS for the free registration!
[08/25-26/18] In Mexico City, for our student-run RIIAA conference.
[08/24/18] Finished Internship!
[08/10/18] Paper: Gromov-Wasserstein Alignment of Word Embedding Spaces accepted to EMNLP 2018.
[07/10-15/18] At ICML 2018.
[06/25/18] New Preprint: Towards Optimal Transport with Global Invariances.
[05/29/18] Started Internship at MSR NYC.
[04/17/18] Guest lecture on Interpretabilty for NLP @ CMU ECE739.
[04/11/18] Presenting our work on Structured OT @ AISTATS.
[02/01/18] Named finalist for Facebook PhD Fellowship.
[01/12/18] Talk at OpenAI on interpretability + optimal transport.
[11/16/18] Talk at CompLang Seminar @ MIT on interpretability in NLP.

Projects

Selected
All

Dataset Distances and Dynamics
A principled framework to compare and transform labeled datasets

Word Translation with Optimal Transport
OT-based approaches to fully unsupervised bilingual lexical induction

Optimal Transport with Local and Global Structure
Generalizing the OT problem to include local structure (or ignore global invariances)

Robustly Interpretable Machine Learning
Bridging the gap between model expressiveness and transparency

Dataset Distances and Dynamics
A principled framework to compare and transform labeled datasets

Word Translation with Optimal Transport
OT-based approaches to fully unsupervised bilingual lexical induction

Optimal Transport with Local and Global Structure
Generalizing the OT problem to include local structure (or ignore global invariances)

Robustly Interpretable Machine Learning
Bridging the gap between model expressiveness and transparency

Towards a Theory of Word Embeddings
A theoretical framework to understand the semantic properties of word embeddings

Publications

Most recent publications on Google Scholar.

To Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning