
ICML 2025 Takeaways
Amplify has been attending ICML for close to a decade – and though this was personally only my third – it was undoubtedly the most action packed ICML yet. Throughout the week, our team (myself, Barr, Grace and Neiman) enjoyed close conversations with researchers across workshops, poster sessions, our annual conference dinner, and even an inaugural conference boat trip!
We also hosted two themed breakfasts on both audio and world models, where hot takes and pleasantries were exchanged over pastries in equal measure. We’ve distilled some of our key conference takeaways below, and are excited to be back in Seoul next year for more breakthrough research (and definitely a second boat trip).
Takeaway 1: Audio model preprocessing pipelines are evolving
At Amplify, we are incredibly excited about audio models as we soon expect them to be a dominant mode of human computer interaction. During ICML we organized a breakfast on this topic with key researchers, plus attended talks/workshops.
One thing researchers seemed to agree on: training audio AI models requires good audio representations being fed to the model (a compression of the raw waveform). However, unlike other modalities, one of our key conference takeaways was that the research community doesn’t have consensus on what they should be.
This topic arose during our audio themed breakfast and was the focus of James Betker's talk during the amusingly titled audio workshop: “AI Heard That!”. James (one of the leads on OpenAI’s GPT4o) commented that the best work today designing good audio representations comes from the open source community since they are relatively cheaper to train. This is because the task of understanding basic aspects of audio such as phonemes and rhythm is much simpler than complex translation and generation workflows.
He also mentioned that most prominent labs adopt these open source representations, so it’s a great opportunity for researchers in the wild to make them better.! During his talk, he broke down five key considerations when designing an audio representation space:
- What autoencoder architecture will you use?
- What information from your signal do you care about?
- What spatial compression do you need?
- Does your downstream model perform discrete prediction?
- Do you want to support real time?
A second critical component of the audio preprocessing pipeline is tokenization: how do you divide the well-represented audio into meaningful chunks that are more efficient to compute over?
Albert Gu, a professor at CMU and Chief Scientist at Cartesia, presented a novel hierarchical network architecture called H-Net which learns this tokenization process end-to-end. Today’s popular algorithms such as Byte Pair Encoding were invented in the early 1990s: they rigidly follow language specific rules, fail to preserve semantic relationships, and can lead to a bloated vocabulary that doesn’t reflect actual word usage or frequency. His framework enables the model to dynamically create new context-aware tokens for a given problem, obviating the need for brittle, human designed tokenizers. This follows an age-old lesson of AI progress, where we push to learn any component of the system that is initially human designed.
While much of the focus of audio research at the conference was on human speech, Pratyusha Sharma’s talk on understanding animal communications with audio models opened up a fascinating new application area. Her work at MIT on WhaleLM demonstrates that we can predict whale conversations and understand their syntax. While sperm whales are the starting point of this research trajectory, we may not be far away from communicating with dogs, pets and many other animals in a language of their own!
Takeaway 2: World models continue to improve, but nobody is quite sure how to define them.
The phrase “world models” can divide opinions between AI researchers because it is commonly used in many different contexts. We witnessed this confusion up close at our world models breakfast.
Some think world models refer to interactive video models, especially since DeepMind’s Genie 2 is introduced as a “large scale foundation world model”. The workshop dedicated to this section of research even colloquially named itself “Building Physically Plausible World Models”.
A second contingent think world models refer to the latent space of any trained model, including language, code and audio, and their research inspects this latent space to see if the learned parameters truly represent the dynamics of the real world across all modalities. The workshop focused on this community was also informally titled “Assessing World Models” with a focus on metrics for understanding. Confusing, right!
The focus of the below takeaways are on world models defined as interactive video models (similar to Genie 2). I believe the opportunity to explore generated interactive worlds will make us rethink product R&D across gaming, robotics and many more industries. Nonetheless, we are equally enthusiastic about research which inspects whether trained models truly represent the real world and how mechanistic insights of this parameter space can translate into architectural breakthroughs!
Firstly, we have seen impressive efficiency gains in video model inference. One of our portfolio companies, Luma, recently presented a new technique entitled “Inductive Moment Matching”. Previously, once a video diffusion model had been trained, it had to be distilled into a smaller model so that it was easy to generate new videos in only one step instead of hundreds, leading to lower latency. Luma’s work replaced this two step process with a single training procedure that trained more stably, accelerating us towards a world of faster video inference.
Secondly, we saw interesting developments in loss functions for interactive video models. Although pixel reconstruction has traditionally been used and produces visually sharp features, it suffers from producing unrealistic motion and dynamics over long time frames. In VideoJam, Meta’s GenAI team introduced a joint appearance-motion representation which forces the model to learn smooth long-range actions during training and steers the model during inference to generate consistent frames.
Finally, we saw interesting benchmarks released for action taking in embodied world models such as WorldSimBench. An embodied world model is simply an interactive video model from the perspective of a robot, and allows us to test a number of capabilities out in this neural simulation. WorldSimBench included several perceptual and manipulation tasks across autonomous driving and robotics, testing the ability to incorporate both temporal information and follow instructions.
One of the conclusions of our world models breakfast was that existing generations of world models are unable to accurately capture object physics and so fail these robotic benchmarks today. For example, sometimes a hand will randomly pass through an object it’s holding! Nonetheless, there was broad consensus that with more data in the coming years, this is a solvable problem, so we are excited to see these new benchmarks broken in the coming years.
Takeaway 3: Good Computer Use Agents are around the corner
With the release of OpenAI’s Operator and Claude’s Computer Use in the last year, it is no secret that computer use will be one of the big next unlocked modalities. While this will create an exciting broad application surface area, it will also present interface design and security challenges for foundation model and agent companies.
Recently, Shunyu Yao at OpenAI, in his blog post “The Second Half”, pointed out that evaluations and creating rich environments to train agents is now the most important bottleneck to AI progress. This is necessary not just to apply existing algorithms to new tasks, but also to provide a testbed for AI researchers working on new techniques.
In this vein, we were excited to see MILA and ServiceNow introduce the UI-Vision benchmark for computer use agents during the main conference. They created a set of desktop-centric tasks spanning 83 software applications and actions such as document editing and file management. Accompanied by rich bounding box annotations and user interaction trajectories (clicking, dragging etc.), the tasks ascended in complexity from identifying a single UI component, to a group of UI components, and subsequently predicting an entire user trajectory.
The panel discussion at the workshop on Computer Use Agents also highlighted some pressing areas of research. Russ Salakhutdinov, who invented dropout among many other exciting breakthroughs over the last 20 years, emphasized that some desktop-centric problems are more easily solved visually and others with text; so we will need mixed modality models to solve most workflows effectively.
Secondly, Alexandre Drouin at ServiceNow commented that to make training computer use agents easier and more stable, we will need more efficient encodings of visual UIs than what humans see. Finally, both Alexandre and Russ agreed that while computer use agents will be wildly effective, building interfaces that keep a human in the loop to nudge the model along and receive nuanced user preferences is critical and remains largely unsolved. Since ChatGPT was as much an interface revolution as a revolutionary model, it may be the case that the breakout winner of computer use agents will require not just a great model but a novel human-in-the-loop frontend.
Takeaway 4: New techniques are improving compute efficiency at training and inference time
As the cost of model training runs skyrocket, methods improving compute efficiency will become increasingly important.
At the conference, there was exciting work from Stanford to automate the writing of CUDA kernels. In KernelBench, they curated a dataset of 250 compilations from PyTorch to CUDA of increasing complexity. The written kernels were evaluated against syntactic validity, correctness and performance. This work will hopefully yield automated code for new architectures that surpass human designed kernels, speeding up AI research.
A second interesting paper from Stanford was Cartridges. Using long contexts with models leads to high memory consumption in the KV cache, further exacerbated when there are many users of the system. This paper instead pre-trains several small KV caches on each corpus using a “self-study” technique so that the model forsees questions that could be asked about the corpus. This combination led to 38x less memory consumption and 26x higher throughput, massively increasing inference efficiency.
Finally, Zachary Charles gave a great talk at the “Efficient Systems for Foundation Models” workshop on DeepMind’s DiLoCo (Distributed Low Communication Training of Language Models). As the size of models and datasets increase, effective distributed training techniques will become even more necessary. This is because there are physical limitations to how big we can scale a single cluster of GPUs, or build more data centers of closely interconnected GPUs. This framework enables islands of poorly connected GPUs with high latency connections to be harnessed for training models with 500x less communication overhead than previous techniques and no performance drop-off.
Excitingly, a new paper at the same workshop entitled MuLoCo had already introduced the lately heralded Muon as an inner optimizer for DiLoCo to improve convergence, quality and device communication. Research moves faster and faster these days!
Takeaway 5: Scaling RL and Synthetic Data Will Improve Model Capabilities
While the well of pre-training data runs dry (as was virally stated by Ilya Sutskever at NeurIPS), reasoning models such as OpenAI’s o1 demonstrate that using synthetic data can elicit many more model capabilities. Reinforcement learning is currently being used to scale this synthetic data as well as unlock new paradigms such as true model creativity.
In “Training a Generally Curious Agent”, we saw how novel synthetic data can guide a model to better explore unseen environments. By training on a diverse set of tasks ranging from cellular automata to Wordle, the model generalized creative strategies to apply to out-of-distribution challenges. The authors also focused on curriculum learning to prioritize which tasks the model should see next given its learning potential, all to improve training efficiency.
This idea inherits from Natasha Jaques’s Unsupervised Environment Design work published in 2020. In general, I’m excited to see curriculum learning gain wider adoption for efficient RL. Natasha also gave a brilliant talk at the conference on using adversarial training to make models cooperatively adapt to human behavior, a key problem in human-agent interaction.
Although promising, synthetic data has its pitfalls, which was highlighted in the “Collapse or Thrive” paper (and Sarah’s post on our blog). This research reminded us that while beneficial, only training a model recursively on self-generated data will inevitably lead to collapse. Ilia Shumailov of DeepMind also wrote a paper on this phenomenon called “The Curse of Recursion”. It is critical to maintain a sufficient quantity of real-world data to anchor the training distribution – while augmenting with synthetic data – and the authors introduce techniques of gradually accumulating these two types of data throughout training. I’m excited to see the science of data continue to evolve to unlock more smart data curation as well as cost-effective ways to climb new hills.
As always, it’s an exciting time to be shoulder deep in machine learning research. There’s no place we’d rather be than on the field’s front lines and I can’t wait to see you all next year!