NeurIPS 2024: main themes and takeaways
Amplify has a long history of attending NeurIPS (for the papers, of course) and went 5 strong this year for one of the densest and most exciting conferences we’ve been to in a while. So we (Barr, Sarah, Rohan, Mike, and Neiman) decided to write up a few of the themes that we heard a lot about. From advances in data curation to compute scaling laws and (finally) good audio models coming soon, it’s an exciting time to be a Neural Information Processing System.
Takeaway 1: RL is making a comeback
Over the last decade, reinforcement learning has delivered several successes in solving tasks with a clear reward function (e.g., Atari Games or Go), but it was never quite at the forefront of GenAI hype cycles. That might be changing, though: in the context of foundation models (language, vision, speech, and/or multimodal), researchers are using RL to teach agents reasoning in environments by formally verifying mathematical or programming solutions.
For example, researchers are generating code in Dafny and Rust and using formal verification to provide mathematical guarantees that AI generated code is correct. Similarly, digital environments like OSWorld make it easier for researchers to test new algorithms such as GFlowNet finetuning in completing basic computer operations.
From a technical standpoint, we chatted with researchers who are revisiting age-old problems of credit assignment for achieving sample efficiency. Knowing which reasoning steps to reward on the way to a correct answer can significantly reduce the amount of attempts and consequently the amount of data / compute you need. Furthermore, making models learn tasks in the right order can boost efficiency, an area of research known as environment design. OMNI-EPIC from Jeff Clune’s team showed great results in creating this learnable frontier in the context of code models, winning a Best Paper Award at the Intrinsically Motivated Open-ended Learning Workshop.
Several breakthroughs using RL made agents play games against each other, known as self play, to massively increase the amount of data they’ve seen and consequent learning. Looking ahead, it seems likely we’ll use self-play to improve language agents too. Recently, Noam Brown mentioned he’s hiring for a multi-agent research team at OpenAI. He highlighted how useful self-play techniques were when he worked on Libratus (a champion poker bot), and was excited that making intermediate agents reason against each other gives us another axis for scaling compute during post-training.
Takeaway 2: Data curation makes models faster, better, and smaller
It’s no secret that better data yields better models. After NeurIPS, we are even more excited about new techniques for analyzing how data influences models. Cohere’s Procedural Knowledge in Pre-Training drives LLM Reasoning used influence functions – a statistical technique that determines how changing a particular input affects a model’s output – to show that while models learn that factual knowledge derives from a single document, generalizable reasoning emerges from a collection of documents. We noticed equal interest at the conference in Aleksander Madry’s datamodels. This approach blindly optimizes for which subset of the data leads to the highest performance on a single task, after which we can draw inferences from the subsets about how we should curate data going forward.
During post-training, scalable oversight techniques – methods for aligning models to responsibly improve model capabilities while ensuring that model behavior conforms to norms – have proven successful in generating high quality synthetic data, augmenting evaluation, and supporting higher throughput, higher quality annotation. During the conference, we hosted a breakfast for scalable oversight researchers to discuss this topic further.
While previous research proposed using LLMs to judge or critique model outputs, new approaches address the limitations of these techniques. For example, models that don’t know the answer to a question can watch stronger models debate each other and select the right answer. This both generates new chains of thought that were not present in the pre-training data and provides a signal on whether it’s true or false, increasing the quantity and quality of supervision data. These methods could be especially helpful in domains without any ground truth such as legal or medical reasoning.
Takeaway 3: Emphasis shift from pretraining scaling law progress to inference-time compute scaling law progress
Noam Brown at the Compound AI Systems workshop talk said: “I’ve never heard any serious AI researcher say that AI is hitting a wall.” Much of the discussion at NeurIPS this year was around where exactly AI progress will (and won’t) come from, with a particular focus on scaling laws.
One buzzy moment was in Ilya Sutskever’s talk, when he stated that “pre-training as we know it will end.” Major AI progress has come from scaling data and compute. While compute is growing, we are pre-training on easily available internet data which is limited. In short, the low hanging fruit is gone. Scaling laws, which have historically driven major breakthroughs during pretraining, also inherently yield slower returns due to their logarithmic nature. It is also more costly to train enormous models – Noam posed the question: “would we pay trillions for better AI?” That said, just because pretraining with the low hanging internet fruit may slow down, pretraining is far from dead; there are tons of other data sources we haven’t yet figured out, like under-utilized domain-specific data.
Much of the conference focus shifted from a focus on pretraining scaling laws to gains from inference-time compute. Beyond exciting post training techniques, over the last year we have seen strategies for how spending compute at inference time can predictably improve an agent’s accuracy. In the summer, work from Stanford and Berkeley started by simply scaling the number of LLM calls per query and filtering responses with majority vote. This demonstrated a path towards cost optimal choices for increasing compute, and there are far more dimensions along which one can scale compute at inference time as well as very different results based on the types of questions you’re asking.
At the conference, we saw how architecture search frameworks such as Archon help tractably reduce the large design space of ensembling, ranking, fusion, critiquing and verification to a set of cost optimal hyperparameters. Furthermore, work in Large Language Monkeys concluded that environments without a clear ground hinder inference time scaling. The authors suggested that auto-formalizing informal problems, e.g. taking a math question in text and rewriting it as verifiable code, can bypass this bottleneck and we’re excited about how more investment into these data pipelines might progress.
Takeaway 4: Good audio models are coming soon
Our second conference breakfast focused on audio models, spanning researchers from real time speech to speech models to music conditioned generation and background noise reduction.
The biggest takeaway was that even if many researchers have switched focus to post training and agents, pretraining marches on in other modalities: we still have significant amounts of unused web audio data! We have not yet figured out how to use most of this data due to privacy concerns, but research in differential privacy may soon unlock this corpus.
Secondly, it is tricky to have long open ended conversations with speech models today because they simply aren’t good enough. Fortunately, improved tokenization and data curation around a now fixed architecture, specifically streaming ASR-TTS (automatic speech recognition – text to speech), should yield performant systems in the next few years.
Thirdly, most audio models today condition on text, but users want to condition on other modalities to generate more diverse outputs. For example, several papers in the Audio Imagination workshop focused on audio model conditioning, and Sony’s Diff-a-Riff lets you take a track that you’re working on, add an audio clip for a theme you want to incorporate e.g. a hummed melody, and output the musical accompaniment! This opens up fascinating new questions in human computer interaction, as we can continuously supply new pieces of audio to nudge the model in a new direction. Beyond these technical advancements, it was encouraging to learn about active speech data collection efforts in low resource languages by Ai4Bharat, paving the path to even greater access to audio models.
Takeaway 5: New Architectural Analyses Explaining Poor Reasoning Results
Previously most attempts to understand LLMs have focused on examining model outputs, and jailbreak attempts have used complex prompt optimization strategies to exploit model weaknesses. At NeurIPS, it was refreshing to see the spotlight shone on theoretical limitations of the transformer and reasoning architectures instead.
In Transformers Need Glasses, the authors showed how representation collapse means transformers are unable to count or copy, reducing reasoning capabilities. In addition, the Softmax is Not Enough talk at the System 2 Reasoning at Scale Workshop highlighted that even if generalizable neural circuitry is learned, limitations of the softmax function (one of the most important in modern deep learning/pervasive across most models) means the model cannot generalize out of distribution for problems with a constant number of inputs. For the same reason we’re excited about interpretability techniques, granular understanding of how information propagates in transformers will help scale better, more accurate models.
See you next year
Amplify showed up big for NeurIPS this year, with two thematic breakfasts (audio and scalable oversight), our yearly dinner across research areas, plus a professor / grad student lunch. We’ve been investing in AI for 10+ years, and it never gets old seeing the cutting edge every December. See you next year!