Applications of computer vision may be limited because the best vision models are good at a single task but don’t generalize to different tasks or datasets (e.g. through zero-shot learning). OpenAI has open sourced CLIP, a neural network designed to address this problem through multimodal learning. Clip is trained through natural language supervision on millions of image-text pairs (scraped from the web). It can be instructed (through natural language description of the visual categories to be recognized) to perform a range of classification tasks.