ML models to power recommendation systems and other services. While several reports discuss strategies to accelerate training and inference, less attention is paid to the data ingestion pipelines for these models, which include storage, reading, and preprocessing components. When compute, network and memory resources are unavailable, training throughout will decrease and GPU resources can be wasted. Here, Facebook researchers discuss their solution: 1) training data is stored in a structured data warehouse built on a distributed file system; 2) Data PreProcessing (DPP) service, which eliminates data stalls by performing reprocessing operations on disaggregated compute nodes; 3) efficiency optimization co-designed for storage and DPP.