International Federation For Information Processing
FORESIGHT System for Joint Scheduling in ML Training
Pages
9
Time to read
37 mins
Publication
Language
English
Pages
9
Time to read
37 mins
Publication
Language
English
This document is a technical report that presents FORESIGHT, a system designed to optimize communication scheduling for distributed machine learning (ML) training. The report outlines the challenges posed by network contention in shared cluster environments, where independent execution of ML jobs can degrade performance due to overlapping communication events. FORESIGHT addresses these challenges by jointly optimizing scheduling across both time and space dimensions, allowing for better coordination of network traffic. The authors detail the algorithm's iterative approach, which refines scheduling decisions based on routing feedback, ultimately achieving a contention-free schedule. Extensive evaluations demonstrate that FORESIGHT can significantly improve network efficiency, leading to up to a 46% reduction in ML job iteration times. The findings underscore the necessity of network-aware scheduling in enhancing the performance of distributed ML workloads without requiring changes to existing hardware or application frameworks. The report concludes by emphasizing the scalability of the proposed solution for optimizing resource utilization in shared cluster environments.