题名 | ROG: A High Performance and Robust Distributed Training System for Robotic IoT |
作者 | |
通讯作者 | Zhao,Shixiong |
DOI | |
发表日期 | 2022
|
会议名称 | 55th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
|
ISSN | 1072-4451
|
ISBN | 978-1-6654-7428-3
|
会议录名称 | |
卷号 | 2022-October
|
页码 | 336-353
|
会议日期 | 1-5 Oct. 2022
|
会议地点 | Chicago, IL, USA
|
出版地 | 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA
|
出版者 | |
摘要 | Critical robotic tasks such as rescue and disaster response are more prevalently leveraging ML (Machine Learning) models deployed on a team of wireless robots, on which data parallel (DP) training over Internet of Things of these robots (robotic IoT) can harness the distributed hardware resources to adapt their models to changing environments as soon as possible. Unfortunately, due to the need for DP synchronization across all robots, the instability in wireless networks (i.e., fluctuating bandwidth due to occlusion and varying communication distance) often leads to severe stall of robots, which affects the training accuracy within a tight time budget and wastes energy stalling. Existing methods to cope with the instability of datacenter networks are incapable of handling such straggler effect. That is because they are conducting model-granulated transmission scheduling, which is much more coarse-grained than the granularity of transient network instability in real-world robotic IoT networks, making a previously reached schedule mismatch with the varying bandwidth during transmission. We present ROG, the first ROw-Granulated distributed training system optimized for ML training over unstable wireless networks. ROG confines the granularity of transmission and synchronization to each row of a layer's parameters and schedules the transmission of each row adaptively to the fluctuating bandwidth. In this way the ML training process can update partial and the most important gradients of a stale robot to avoid triggering stalls, while provably guaranteeing convergence. The evaluation shows that, given the same training time, ROG achieved about 4.9%6.5% training accuracy gain compared with the baselines and saved 20.4%50.7% of the energy to achieve the same training accuracy. |
关键词 | |
学校署名 | 其他
|
语种 | 英语
|
相关链接 | [Scopus记录] |
收录类别 | |
资助项目 | HK RGC GRF["17202318","17207117"]
; HK ITF[GHP/169/20SZ]
; NSFC[62132009]
|
WOS研究方向 | Computer Science
; Engineering
|
WOS类目 | Computer Science, Hardware & Architecture
; Engineering, Electrical & Electronic
|
WOS记录号 | WOS:000886530600020
|
EI入藏号 | 20224613109619
|
EI主题词 | Bandwidth
; Budget control
; Energy efficiency
; Internet of things
; Robots
; Wave transmission
|
EI分类号 | Energy Conservation:525.2
; Information Theory and Signal Processing:716.1
; Radio Systems and Equipment:716.3
; Data Communication, Equipment and Techniques:722.3
; Computer Software, Data Handling and Applications:723
; Robotics:731.5
|
Scopus记录号 | 2-s2.0-85141723019
|
来源库 | Scopus
|
全文链接 | https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9923782 |
引用统计 |
被引频次[WOS]:0
|
成果类型 | 会议论文 |
条目标识符 | http://sustech.caswiz.com/handle/2SGJ60CL/411872 |
专题 | 南方科技大学 |
作者单位 | 1.The University of Hong Kong,Department of Computer Science,Hong Kong,Hong Kong 2.Tsinghua University,Beijing,China 3.Institute of Software,Chinese Academy of Sciences,Beijing,China 4.Eee,Southern University of Science and Technology,China 5.Pujiang Lab,Shanghai,China |
推荐引用方式 GB/T 7714 |
Guan,Xiuxian,Sun,Zekai,Deng,Shengliang,et al. ROG: A High Performance and Robust Distributed Training System for Robotic IoT[C]. 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA:IEEE COMPUTER SOC,2022:336-353.
|
条目包含的文件 | 条目无相关文件。 |
|
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论