In March, DeepMind introduced the Perceiver model, which is based on the Transformer architecture but can scale to hundreds of thousands of inputs (like ConvNets) by leveraging an asymmetric attention mechanism. Although the Perceiver model could handle different types of input data (image, video, audio, point cloud, and their combinations), it only produces a single classification label. More recently, DeepMind has released Perceiver IO, which can handle both arbitrary inputs and arbitrary outputs. Perceiver IO, which uses attention to both encode and decode the latent array, can also generate language, optical flow, and multimodal videos with audio.