Most self-supervised learning models were developed for a single modality (e.g., speech, NLP, vision). However, recently, Meta AI OSS’ed data2vec, a framework that uses the same learning method for any modality (speech, text, images) when the learning objective is identical for all modalities. Data2vec initially produces representations of the original input before regressing these representations on a masked version of the input to predict the full data representation. The framework currently includes speech and NLP models that can be trained through CLI tools. Pre-trained vision models will be made available soon.