To power ML-driven experiences that deepen engagement among buyers and merchants transacting on Facebook and Instagram shops, Facebook researchers sought to build multimodal representations for commerce-related applications. In this paper, Yu et al. describe CommerceMM, a pre-trained model for joint multimodal commerce embedding composed of an image encoder and Transformer-based text and multimodal fusion encoders. To improve the performance of this model on commerce-specific downstream tasks, they propose two sets of pre-training tasks (i.e. image-text and omni-retrieval tasks) and apply modality randomization wherein they change the text and multimodal fusion encoder layers dynamically during training. They demonstrate that ComerceMM achieves SOTA performance on 7 downstream tasks.