To develop increasingly complex models on increasingly large datasets, machine learning developers often need to leverage distributed parallel training. Although the computational performance of distributed training has improved due to advances in GPUs and hardware accelerators, distributed training is becoming increasingly network-bound (i.e. bottlenecked by delays in the communication phases wherein workers exchange gradient vectors to compute the global sum and update their models). To reduce training time, Sapio et al. propose an in-network parameter aggregation primitive that can be implemented using programmable switch hardware. Their proposed approach, SwitchML, reduces the amount of data transmitted during synchronization phases. The authors demonstrate how SwitchML speeds up training workloads by up to 2.27x.