At Facebook and several other technology companies, deep learning-powered recommendation engines power business-critical features and applications but consume vast amounts of resources (including compute, memory, and network capacity). To enable more efficient offline training of trillion-parameter recommendation models, Facebook developed the hardware ZionEX training platform and co-designed the high-performance training software stack implemented in Python. In this paper, Mudigere et al. discuss how they achieved a 40x improvement in offline training time through techniques like optimizing PyTorch to support model and data parallelism; developing sharding algorithms to partition embedding tables; using reduced precision communications to decrease bandwidth requirements, and other strategies.