您现在的位置: 首页 » 学院新闻 » 讲座信息 » 正文

学院新闻

讲座信息

计算机学院系列讲座菁英论坛第43期——Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks

           

报告题目(Title) Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks


时间(Date & Time)2025.5.30;10-11am


地点(Location)燕园大厦813(燕园校区)Room 813, Yanyuan Building #1 (Yanyuan)


主讲人(Speaker)Ryan Huang


邀请人(Host):金鑫


报告摘要(Abstract)


Deep learning training is a complex process involving assorted components, making this process prone to errors. While some errors cause immediate training job failures, many others are silent or latent. These silent training errors are inherently challenging to detect. Yet, they incur severe consequences such as wasting precious training resources and producing poor models. In this talk, I will present TrainCheck, a system we developed to proactively detect silent training errors and provide diagnostic clues.


The talk will begin by discussing why silent errors are common and difficult to catch, drawing on an empirical study we conducted. I will then walk through the core idea behind TrainCheck: instead of relying on high-level metrics like loss or accuracy, which are noisy and unreliable, we enforce precise training invariants---semantic properties that should consistently hold during training. I will explain how TrainCheck automatically infers these invariants, deduces preconditions, and validates them online. Our evaluation results show that TrainCheck detects real-world bugs within a single iteration, and uncovers previously unknown silent errors in popular DL libraries. The talk will conclude with discussions on future directions toward making deep learning training more robust to silent errors.


主讲人简介(Bio)



Ryan Huang is an Associate Professor in the EECS Department at University of Michigan, Ann Arbor. Prior to that, he was an Assistant Professor at Johns Hopkins University. He leads the OrderLab, which conducts research broadly in computer systems while specializing in designing principled methods to improve the reliability and performance of software systems ranging from small mobile devices to large data centers. His work received the best paper awards in top systems venues. He is a recipient of the NSF CAREER Award.




欢迎关注计算机学院微信公众号,了解更多讲座信息!


北京大学计算机学院