Many ML model architectures use modality-specific inductive biases (e.g., spatial locality) that boost performance but must be modified to handle new types of data. In contrast, Transformer architectures make far fewer assumptions about their inputs but may require preprocessing steps to reduce computational complexity. In this paper, Jaegle et al. introduce the Perceiver, which can scale to hundreds of thousands of high-dimensional inputs, including images, videos, point clouds, and multimodal media without domain-specific assumptions. The Perceiver, which could enable applications of ML in settings where we don’t understand inductive biases well, uses cross-attention to map inputs to a small latent space that doesn’t depend on input size.