SR-APPFL is a scalable and resilient Argonne Federated Learning platform that features efficient and accurate modeling and simulation toolkits for federated learning systems. As part of this effort, we have developed FedDES, a discrete-event performance simulation framework for large-scale federated learning systems, and PACER, a userspace network rate controller in MPI with adaptive compression for parallel applications. With FedDES and PACER, we can perform large-scale simulations of federated learning workflows, providing an efficient platform for studying system performance and resilience. In this project, the student will characterize the SR-APPFL platform by running real-world scientific applications and AI workloads on it. The AI tasks to be evaluated may include AI-for-science applications such as PowerGrid and SmartMeter. The overall workflow will be systematically analyzed to identify performance bottlenecks across computation, communication, and data movement. The insights gained from this characterization will be used to optimize system performance and improve the efficiency of AI applications running on the platform. The expected deliverables include optimized software implementations and publications in top-tier HPC conferences such as IPDPS, ICS, and HPDC.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Part-time
Career Level
Intern
Education Level
No Education Listed
Number of Employees
1,001-5,000 employees