Most machine learning systems are designed to perform well on a specific task, like recognizing faces or classifying digits. These tasks are specified through the neural network architecture and can only be extended by adding additional output heads. In contrast, Gupta et al. propose a task-agnostic architecture, GPV-1, which can perform several vision and language tasks without any modifications to the architecture design. GPV-1 takes as input an image and text that describes the task (e.g. a photo of a parking lot and a request “how many cars are in the image?”) and outputs relevant bounding boxes, confidences, and text. By reusing the same encoders and decoders, this architecture can outperform task-specific architectures.