Developers who want to target specialized hardware platforms like GPUs often use kernel libraries like PyTorch with high-level interfaces similar to their CPU equivalents. However, these kernel libraries don’t support several functions or easily scale to larger datasets that don’t fit in memory. Yuan et al. present “offload annotations” (OA) to facilitate the integration of existing CPU libraries with new accelerator libraries, thereby enabling heterogeneous GPU computation. This approach involves a) annotating a parallel accelerator library function for each CPU function, and b) specifying an initial storage device, a method of transferring data between devices, and a method for partitioning and merging datasets as needed. Their implementation of OA, Bach, is able to intelligently schedule execution and data transfer based on estimated data transfer size and computation cost, and offers a median speedup of 6.3x over CPU-only workflows with minimal code modification.