中文版 | English
题名

ROG: A High Performance and Robust Distributed Training System for Robotic IoT

作者
通讯作者Zhao,Shixiong
DOI
发表日期
2022
会议名称
55th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
ISSN
1072-4451
ISBN
978-1-6654-7428-3
会议录名称
卷号
2022-October
页码
336-353
会议日期
1-5 Oct. 2022
会议地点
Chicago, IL, USA
出版地
10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA
出版者
摘要
Critical robotic tasks such as rescue and disaster response are more prevalently leveraging ML (Machine Learning) models deployed on a team of wireless robots, on which data parallel (DP) training over Internet of Things of these robots (robotic IoT) can harness the distributed hardware resources to adapt their models to changing environments as soon as possible. Unfortunately, due to the need for DP synchronization across all robots, the instability in wireless networks (i.e., fluctuating bandwidth due to occlusion and varying communication distance) often leads to severe stall of robots, which affects the training accuracy within a tight time budget and wastes energy stalling. Existing methods to cope with the instability of datacenter networks are incapable of handling such straggler effect. That is because they are conducting model-granulated transmission scheduling, which is much more coarse-grained than the granularity of transient network instability in real-world robotic IoT networks, making a previously reached schedule mismatch with the varying bandwidth during transmission. We present ROG, the first ROw-Granulated distributed training system optimized for ML training over unstable wireless networks. ROG confines the granularity of transmission and synchronization to each row of a layer's parameters and schedules the transmission of each row adaptively to the fluctuating bandwidth. In this way the ML training process can update partial and the most important gradients of a stale robot to avoid triggering stalls, while provably guaranteeing convergence. The evaluation shows that, given the same training time, ROG achieved about 4.9%6.5% training accuracy gain compared with the baselines and saved 20.4%50.7% of the energy to achieve the same training accuracy.
关键词
学校署名
其他
语种
英语
相关链接[Scopus记录]
收录类别
资助项目
HK RGC GRF["17202318","17207117"] ; HK ITF[GHP/169/20SZ] ; NSFC[62132009]
WOS研究方向
Computer Science ; Engineering
WOS类目
Computer Science, Hardware & Architecture ; Engineering, Electrical & Electronic
WOS记录号
WOS:000886530600020
EI入藏号
20224613109619
EI主题词
Bandwidth ; Budget control ; Energy efficiency ; Internet of things ; Robots ; Wave transmission
EI分类号
Energy Conservation:525.2 ; Information Theory and Signal Processing:716.1 ; Radio Systems and Equipment:716.3 ; Data Communication, Equipment and Techniques:722.3 ; Computer Software, Data Handling and Applications:723 ; Robotics:731.5
Scopus记录号
2-s2.0-85141723019
来源库
Scopus
全文链接https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9923782
引用统计
被引频次[WOS]:0
成果类型会议论文
条目标识符http://sustech.caswiz.com/handle/2SGJ60CL/411872
专题南方科技大学
作者单位
1.The University of Hong Kong,Department of Computer Science,Hong Kong,Hong Kong
2.Tsinghua University,Beijing,China
3.Institute of Software,Chinese Academy of Sciences,Beijing,China
4.Eee,Southern University of Science and Technology,China
5.Pujiang Lab,Shanghai,China
推荐引用方式
GB/T 7714
Guan,Xiuxian,Sun,Zekai,Deng,Shengliang,et al. ROG: A High Performance and Robust Distributed Training System for Robotic IoT[C]. 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA:IEEE COMPUTER SOC,2022:336-353.
条目包含的文件
条目无相关文件。
个性服务
原文链接
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
导出为Excel格式
导出为Csv格式
Altmetrics Score
谷歌学术
谷歌学术中相似的文章
[Guan,Xiuxian]的文章
[Sun,Zekai]的文章
[Deng,Shengliang]的文章
百度学术
百度学术中相似的文章
[Guan,Xiuxian]的文章
[Sun,Zekai]的文章
[Deng,Shengliang]的文章
必应学术
必应学术中相似的文章
[Guan,Xiuxian]的文章
[Sun,Zekai]的文章
[Deng,Shengliang]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
[发表评论/异议/意见]
暂无评论

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。