Annotation for AI doesn’t Scale

For decades, researchers have trained models on human data and data labeling vendors have built huge businesses creating that data—mostly by hiring cheap, unskilled labor. But as models get more capable, the traditional approach to data labeling is starting to fall short. Powering the next generation of foundation models is going to require a drastic rethinking of how we access human knowledge and preferences and integrate them into increasingly complex AI.

We still need human data to encode human preferences in models

To build AI products that are genuinely useful – that people like and actually want to use – we need real, human data. While some believe synthetic data, generated by models, will be sufficient to drive step-change improvements in model performance, I disagree. 

In particular I believe human data will be most impactful when:

1) Supporting new capabilities in more subjective domains where formal verification is not possible. 

2) It’s used directly to improve a user experience, i.e. soliciting feedback directly from a user to improve that user’s experience.

Although “self-play” is possible in objective settings like coding and math, human data is needed in areas like law, medicine, and physics where tasks cannot be formally or automatically verified. In these contexts, we need humans to review model outputs. We cannot use models to assess something they don’t know. Without human intervention, a language model can only generate synthetic data based on existing patterns in their training data; they cannot introduce fundamentally new skills or knowledge. For similar reasons, it is also unlikely that models can synthesize heterogeneous and evolving user preferences. 

As such, I think that for at least the next decade, human data representing new skills and reflecting real user preferences will be critical to advancing GenAI applications and unlocking new use cases. 

Human data can also complement synthetic data. When developing DALL·E 3, OpenAI researchers blended human-written captions with synthetic ones during training, using a small percentage of human captions to regularize the data distribution. This helped mitigate biases in synthetic captions, reducing overfitting and improving the model’s ability to generate high-quality images from diverse prompts.

A brief history of annotation: from Mechanical Turk to Scale

At least since the release of WordNet in 1985, humans have been annotating data to train and evaluate models. In the 2000s, products like FigureEight and Amazon Mechanical Turk emerged to help machine learning teams collect data that could be used to train models via supervised learning. Then, in the 2010s, a new generation of data labeling vendors emerged, largely to address the needs of companies developing autonomous vehicles and applying computer vision. During this time, Scale AI emerged as the dominant vendor serving AV companies that needed massive labeled datasets. 

The data labeling tasks that FigureEight, Scale, Labelbox, and other vendors supported were quick, easy, and straightforward – like drawing a bounding box around pedestrians in a photo of a roadway or classifying the sentiment of a review as positive or negative. Because these tasks did not require domain expertise or nuanced judgment, data labeling vendors could easily outsource work to cheap, unskilled labor. With lower staffing costs, they could simply assign multiple labelers to the same task and compare outputs to ensure quality (consensus-based labeling). It was also relatively cheap to have humans review submitted labels and spot-check random samples to estimate overall quality. Additionally, it was far easier to design software tools that enforced labeling rules, highlighted potential , and flagged ambiguous cases for higher-level review. 

Through these data labeling workflows, companies like DD Scale built amazing businesses with hundreds of millions in revenue. However, growth started to stall in the early 2020s when self-driving technologies matured. As autonomous vehicles hit the road, companies collected an increasing amount of data pi from real-world deployments and needed less from human annotators. 

Nonetheless, Scale and other labeling vendors saw an opportunity to accelerate growth by supporting a new customer: the burgeoning AI research labs developing foundation models that would power products like ChatGPT. Today, Scale is valued at nearly $14B and approaching $1B in annual recurring revenue. They support many of the top AI research labs and GenAI application companies in the world. 

However, many of these customers are starting to raise concerns about low quality and high turnaround times. Based on our conversations with AI researchers and engineers at companies ranging from the largest AI research labs to earlier-stage startups, we’re hearing pretty conclusively that the traditional approach to data labeling is not meeting their needs

Why the traditional data labeling model doesn’t work anymore

When Scale was started, most companies were using labeled data to train computer vision and language models through supervised learning. They needed large labeled datasets, where the labels represented responses to simple tasks. In the late 2010s though, a new modeling paradigm emerged. With self-supervised learning, AI researchers and developers no longer needed access to massive labeled datasets. Instead, they would pretrain models on even more massive, Internet-scale, unlabeled datasets. Models gained the uncanny ability to predict the next word in a sentence, which enabled them to do a surprisingly wide range of tasks. 

Still, to teach new behaviors and deliver compelling user experiences, these models had to be post-trained. Models could be post-trained through supervised learning on a labeled dataset, and/or they could be aligned with user preferences (to reflect what we subjectively think is good) through reinforcement learning from human feedback (RLHF). To implement the latter technique, AI practitioners collect data from human annotators who evaluate and compare model outputs. 

In addition, model reasoning could also be improved through a technique called process supervision wherein human annotators reviewed the output of each step a model executed as it tried to solve a problem. This technique could engender new and improved model capabilities. 

Quickly, the AI community discovered that the processes and tools that established vendors had created to solve the labeling problem in the supervised learning context could not be applied or extended to collect annotations needed for RLHF, process supervision or other post-training techniques. The entire landscape of what annotation meant had changed, because the way we trained models – and what they can do – had completely changed. 

Labeling now requires increasingly expert humans 

When post-training a foundation model, researchers are trying to improve upon what is already a very capable model. The pretrained model can already do simple tasks that a human could quickly execute reasonably well. Through post-training, they are trying to get the model to do harder tasks and to understand subtle human proclivities. So the problem is that these hard, nuanced tasks are — not surprisingly — also hard for humans to do and to evaluate. 

Instead of, for example, identifying trucks on a highway, now we have annotation tasks requiring domain expertise like solving physics problems or planning complex travel.

We might be comparing the output of two code generation models — how do you decide which massive blob of code is better? If the coding task is short and simple or if there’s a big difference in the quality of the two outputs, a junior developer can do that job. But as the coding tasks get longer and harder, it becomes harder for humans to catch bugs and other problems. Once models are so capable that there are only small nuances to distinguish between a good and a better output, that is a highly skilled evaluation task for a senior engineer. But does a senior engineer even want to do that type of work? 

Consequently, more experienced annotators are needed. To build more and more capable models, we need more and more capable humans.  

Scale and new market entrants like Surge, Turing and Mercor are already helping AI research labs connect with expert coders and mathematicians who can evaluate AI output. However, there are many, many domains wherein they don’t yet provide any solution. AI researchers and engineers need to automate tasks in a growing number of industries and specialties like product design, customer support, or medicine. But finding experts in these specialized domains is even tougher; there are few available experts willing to perform annotation. 

Companies like Midjourney might need access to expert photographers; companies like Harvey might need access to attorneys— but finding, recruiting, and screening experts in this long tail of markets is really challenging and expensive. So it’s clear that new solutions are needed to access (and empower) expert annotators. 

Quality control needs revamping

In addition, the quality control systems and processes designed for data labeling need to be rethought. Since expert annotators demand high compensation, it’s no longer reasonable to just improve quality through consensus-based labeling, even when dynamically adapting redundancy levels. Even when companies are willing to spend money on consensus-based labeling, they often find it challenging to achieve high concordance among annotators since tasks are getting more subjective. 

Moreover, more traditional error-detection algorithms do not address new and emerging problems. For example, several companies with whom we spoke noted that it was clear that the annotators with whom they worked were directly taking content from ChatGPT.

Disagreements about quality control processes are creating friction between annotators and the platforms that recruit them. Several annotators have complained that they weren’t paid what they were owed due to opaque submission policies, leading to frustration and distrust. Because annotators are usually compensated on a per-task basis, they may be incentivized to cut corners and find shortcuts so they can complete tasks as quickly as possible – thereby maximizing earnings. Unlike real users (who benefit from high-quality outputs), annotators don’t have a direct incentive to carefully review outputs before sharing their preferences.

The waiting game

Systems and processes for collecting expert annotations are just too slow. While data labelers could generate results relatively quickly, assessing model outputs on complex tasks is more arduous – it takes time for human experts to do. But this means that customers of established annotation vendors need to wait weeks or even months before getting the data they need to improve their models and stay competitive. 

This problem is compounded by the fact that AI practitioners must often iterate on the instructions and guidelines they share with annotators several times before they see substantial improvements in submission quality. This was not as often necessary before, when task complexity levels were low. 

Lastly, companies must collect feedback from experts in carefully constructed environments tailored to the specific annotation task. For example, a company building text2sql models might need to provision access to a database and SQL workbench to ensure that annotators can interact with queries in a realistic setting. Similarly, a company doing document-centric tasks must provide annotators with access to real documents to elicit their preferences. However, configuring these environments takes additional time, resources and logistical effort, adding complexity to the data annotation process. 

Costs do scale

While data labeling and annotation vendors have and may continue to experience rapid revenue growth, they often experience low margins and scalability challenges. Unlike software businesses that can scale revenue without a proportional increase in costs, these companies face persistent margin pressure due to the high – and rising – cost of human labor. As companies demand more specialized annotations, vendors must recruit annotators with greater domain expertise, further increasing costs. To improve margins, new entrants must fervently streamline operational processes. They may also consider increasing automation, strategically integrating synthetic and human data, and/or selling related tools. 

In summary, what is needed is a better approach to:

  1. Recruit a long tail of expert annotators (or enable amateur annotators to perform like experts) 
  2. Review expert annotations on increasingly challenging tasks 
  3. Improve the throughput of a limited pool of expert annotators
  4. Increase margins through automation and packaging

How do we do all of that?

Approaches to solve the advanced human annotation problem

We’re starting to see that scalable oversight techniques – pioneered by the AI safety community to ensure models don’t do things that jeopardize the human race – can be applied to improve models too - by enabling humans to efficiently oversee models at scale. 

Efficient Use of Human Feedback 

The original paper on RLHF focused on reducing the cost of human oversight by selectively requesting human input only when necessary. The authors developed a method to estimate the uncertainty in the reward model and use this to determine when to request human feedback. By focusing on these uncertain cases, the system minimized redundant annotations. This active learning approach ensures that human oversight is concentrated on the most informative examples, allowing the agent to learn complex behaviors with significantly fewer human-labeled comparisons. In other words, we can minimize the amount of annotation required by only bringing it to bear when it’s necessary or will have maximum impact.

Critic models and debate 

Researchers recently proposed using a model to generate a critique of a model’s output, which a human then reviews when comparing outputs and before selecting the one they prefer. This has two benefits:

  1. Humans can review longer outputs: it’s much easier and less time-consuming for a human to read a summary, for example, of why one codebase is superior to another, than having to review two entire codebases to make a comparison. 
  2. Humans can review more difficult tasks: the critique helps guide the human’s annotation. Think of it this way: if you asked someone on the street to do a blind tasting of wine, it’s a coin toss as to whether they could actually pick the better (whatever this means) wine. But if instead you first gave them Robert Parker’s analysis and ranking of each wine before letting them taste, there’s a much better chance they’ll be able to pick the winning wine. 

OpenAI has experimented with this approach with CriticGPT, finding that “when people get help from CriticGPT to review ChatGPT code they outperform those without help 60% of the time.” Knowing what to look for goes a long way toward higher-quality annotation.

In a similar vein, another technique has AI agents argue for or against a pair of outputs. Each agent will make its case by presenting evidence or reasoning on why one output is better. The human annotators decide which AI provided the most convincing argument. This technique may be even more robust than a critique since a single agent could mislead the annotator. The presence of a debater could lead human annotators to apply more scrutiny to particularly hard tasks. 

LLM-as-judge

The AI community has noted that LLMs are often better at verifying outputs than generating outputs. Hence, many have proposed using LLMs to review the quality, accuracy, or coherence of output provided by other agents. 

In fact, this method can also be applied to screen annotators and review their submissions. Unlike existing QC mechanisms, LLM-as-judge based techniques could operate in near time. As such, platforms could intervene and guide annotators immediately as concerns arise. This could obviate the need for intense discussions when annotators are banned from platforms or refused payment. 

Stay scrappy 

It’s worth repeating that the annotation problem cannot be solved through AI alone. As enterprises adopt AI, they will need human data annotation for an increasingly diverse range of tasks. Although these jobs may be small, they will cover a broader spectrum of needs. This will drive significant growth in the annotation market, but it will also become more fragmented, with a wider array of customers and use cases. To succeed, startups must excel at finding, recruiting, and managing annotators who can meet the evolving demands of enterprise AI. And this will require operational excellence.

There’s a huge opportunity here to enable higher-quality annotation for a market that’s hungry for advanced human data. If you are building tools that help to make human data annotation more efficient or more accessible for non-experts, reach out to our team!

And if you’re interested in reading more about what’s going on in annotation, we put together a reading list of papers from the last few years. You can check it out on GitHub here.

Thanks to Finbarr Timbers, Jonathan Frankle, and Tom Yan for thoughts and edits on this post.

No items found.