Several multi-modal understanding systems use black-box object detection models as a component to identify concepts within an image. However, this approach may impede opportunities to optimize the system’s overall perceptive capability, for instance, because the multi-modal understanding system can only leverage the detected objects and not the whole image. Moreover, most object detection systems cannot recognize concepts expressed in free-form text. To address this limitation, Kamath et al. introduce MDETR, an end-to-end text-modulated detection system that detects objects in an image using only text and bounding boxes as a form of supervision. MDETR relies upon a convolutional backbone to identify visual features and a language model to extract text features. A transformer encoder-decoder then predicts bounding boxes of the objects and their grounding in the text once the visual and text features are projected to a shared embedding space.