浏览全部资源
扫码关注微信
国电南瑞科技股份有限公司,江苏 南京 211100
[ "张万才(1983- ),男,博士,国电南瑞科技股份有限公司高级工程师,主要研究方向为电力信息通信。" ]
张楠(1982- ),女,国电南瑞科技股份有限公司高级工程师,主要研究方向为人工智能平台、知识图谱、NLP语义分析、RPA机器人在电力系统中的应用和创新。
杨文清(1974- ),男,博士,国电南瑞科技股份有限公司高级工程师,主要研究方向为电网大数据分析、电力系统数据应用及数据价值挖掘。
王涛(1996- ),男,国电南瑞科技股份有限公司工程师,主要研究方向为人工智能、计算机视觉、数字图像处理、目标检测等。
张文强(1993- ),男,国电南瑞科技股份有限公司工程师,主要研究方向为电力信息化系统建设、人工智能、知识图谱。
收稿日期:2024-07-05,
修回日期:2024-09-15,
纸质出版日期:2024-09-20
移动端阅览
张万才,张楠,杨文清等.基于虚拟化的GPU异构资源池平台架构设计、关键技术及应用研究[J].电信科学,2024,40(09):162-175.
ZHANG Wancai,ZHANG Nan,YANG Wenqing,et al.Architecture design, key technologies, and application research of GPU heterogeneous resource pool platform based on virtualization[J].Telecommunications Science,2024,40(09):162-175.
张万才,张楠,杨文清等.基于虚拟化的GPU异构资源池平台架构设计、关键技术及应用研究[J].电信科学,2024,40(09):162-175. DOI: 10.11959/j.issn.1000-0801.2024216.
ZHANG Wancai,ZHANG Nan,YANG Wenqing,et al.Architecture design, key technologies, and application research of GPU heterogeneous resource pool platform based on virtualization[J].Telecommunications Science,2024,40(09):162-175. DOI: 10.11959/j.issn.1000-0801.2024216.
人工智能算力资源面临价格高昂、市场断供等现状问题,传统的单卡单用模式导致资源利用率和使用效率低下,现有的技术研究手段难以支撑多元异构图形处理单元(graphics processing unit,GPU)资源的高效管理和调度。基于此,提出一种基于虚拟化的GPU异构资源池平台,首先对平台总体架构、逻辑架构和功能架构进行了规划设计;其次,对关键技术进行研究,提出了虚拟化异构GPU资源池框架和基于时间切片+负载均衡的调度模型;最后,基于所提方法,提出了多业务单卡叠加、交叉拉远、跨机整合、混合部署和时分复用等多种创新应用模式。所提方法为企业级AI应用提供了可兼容多个GPU不同厂商、支持远程访问、可灵活切分和聚合、可弹性调度的GPU算力资源。经测算分析,同等开发和训练量下,GPU卡数量可节省60%、运行效率可提升4倍。
The current challenges facing the field of artificial intelligence include high prices and market supply disruptions. The traditional single-card
single-use model results in low resource utilization and efficiency. Furthermore
existing technological research methods make it difficult to support the efficient management and scheduling of diverse heterogeneous GPU resources. Based on this
a virtualization-based GPU heterogeneous resource pool platform was proposed. Firstly
the overall architecture
logical architecture
and functional architecture of the platform were planned and designed. Secondly
key technologies were studied
and a virtualization heterogeneous GPU resource pool framework and a scheduling model based on time slicing + load balancing were proposed. Finally
based on the methods described
various innovative application models were proposed
including multiservice single-card stacking
cross-pull
cross-machine integration
hybrid deployment
and time division multiplexing. The research method proposed provides enterprise-level AI applications with GPU computing resources that are compatible with multiple GPU manufacturers
support remote access
flexible partitioning and aggregation
and flexible scheduling. Following the completion of calculations and an in-depth analysis
it has been demonstrated that a reduction of up to 60% in the number of GPU cards can be achieved while simultaneously enhancing operational efficiency by a factor of four.
于非 , 何玉林 , 贺颖 . 人工智能与数字经济专题序言: 人工智能赋能数字经济,共创未来无限可能 [J ] . 深圳大学学报(理工版) , 2023 , 40 ( 3 ): 253 - 257 .
YU F , HE Y L , HE Y . Editorial of special issue on artificial intelligence and digital economy [J ] . Journal of Shenzhen University (Science and Engineering) , 2023 , 40 ( 3 ): 253 - 257 .
傅懋钟 , 胡海洋 , 李忠金 . 面向GPU集群的动态资源调度方法 [J ] . 计算机研究与发展 , 2023 , 60 ( 6 ): 1308 - 1321 .
FU M Z , HU H Y , LI Z J . Dynamic resource scheduling method for GPU cluster [J ] . Journal of Computer Research and Development , 2023 , 60 ( 6 ): 1308 - 1321 .
陈铉 , 阚博文 , 刘广一 . GPU技术的最新进展及其在电力系统中的应用前景探讨 [J ] . 电力信息与通信技术 , 2018 , 16 ( 3 ): 16 - 25 .
CHEN X , KAN B W , LIU G Y . The latest development of GPU and its prospective application in power system [J ] . Electric Power Information and Communication Technology , 2018 , 16 ( 3 ): 16 - 25 .
陈杏仪 , 柯清建 . 异构算力的应用与展望 [J ] . 长江信息通信 , 2023 , 36 ( 11 ): 226 - 228 .
CHEN X Y , KE Q J . Application and prospect of heterogeneous computing power [J ] . Changjiang Information & Communications , 2023 , 36 ( 11 ): 226 - 228 .
段晓东 . 中国移动算力网络推动“东数西算” 工程向纵深发展 [J ] . 通信世界 , 2022 ( 5 ): 20 - 21 .
DUAN X D . China Mobile Computing Network promotes the deep development of the “East Counting and West Computing” project [J ] . Communications World , 2022 ( 5 ): 20 - 21 .
孙毅 , 王会梅 , 鲜明 , 等 . Kubeflow异构算力调度策略研究 [J ] . 计算机工程 , 2024 , 50 ( 2 ): 25 - 32 .
SUN Y , WANG H M , XIAN M , et al . Research on heterogeneous computing scheduling strategy for kubeflow [J ] . Computer Engineering , 2024 , 50 ( 2 ): 25 - 32 .
许俊东 , 李兆滨 , 宋德华 , 等 . 算力网络应用平台研究与设计 [J ] . 信息技术与信息化 , 2024 ( 2 ): 27 - 30 .
XU J D , LI Z B , SONG D H , et al . Research and design of computing network application platform [J ] . Information Technology and Informatization , 2024 ( 2 ): 27 - 30 .
王月 , 柯芊 . 智能计算中心: 人工智能时代的算力基石 [J ] . 中国电信业 , 2021 ( S1 ): 11 - 15 .
WANG Y , KE Q . Intelligent computing center: the cornerstone of computing power in the era of artificial intelligence [J ] . China Telecommunications Trade , 2021 ( S1 ): 11 - 15 .
史庭祥 , 张剑波 , 曹越 , 等 . 一种基于超大规模云资源池的算力供给新模式及其关键技术 [J ] . 移动通信 , 2023 , 47 ( 1 ): 83 - 89 .
SHI T X , ZHANG J B , CAO Y , et al . A new model of computing power supply based on super-large-scale cloud resource pool and its key technologies [J ] . Mobile Communications , 2023 , 47 ( 1 ): 83 - 89 .
邢文娟 , 雷波 , 赵倩颖 . 算力基础设施发展现状与趋势展望 [J ] . 电信科学 , 2022 , 38 ( 6 ): 51 - 61 .
XING W J , LEI B , ZHAO Q Y . Development status and trend prospect of computing power infrastructure [J ] . Telecommunications Science , 2022 , 38 ( 6 ): 51 - 61 .
高鹏 , 李玮 , 唐利莉 , 等 . 新基建风潮下的全国数据中心算力布局规划研究 [J ] . 电信工程技术与标准化 , 2021 , 34 ( 8 ): 2 - 5 .
GAO P , LI W , TANG L L , et al . Research on the layout planning of national data center computing power under the new trend of infrastructure construction [J ] . Telecom Engineering Technics and Standardization , 2021 , 34 ( 8 ): 2 - 5 .
沈健 , 孙道军 . 算力网高质量发展探究: 现实挑战、制度基础和技术支撑 [J ] . 人工智能 , 2024 , 11 ( 2 ): 20 - 31 .
SHEN J , SUN D J . Research on the high-quality development of computing network: realistic challenges, institutional basis and technical support [J ] . AI-View , 2024 , 11 ( 2 ): 20 - 31 .
崔雪冰 , 张延红 , 李国徽 . 基于通用计算的GPU-CPU协作计算模式研究 [J ] . 微电子学与计算机 , 2009 , 26 ( 8 ): 30 - 33 .
CUI X B , ZHANG Y H , LI G H . On the basis of general-purpose computing GPU-CPU cooperative computing method research [J ] . Microelectronics & Computer , 2009 , 26 ( 8 ): 30 - 33 .
杨经纬 , 马凯 , 龙翔 . 面向集群环境的虚拟化GPU计算平台 [J ] . 北京航空航天大学学报 , 2016 , 42 ( 11 ): 2340 - 2348 .
YANG J W , MA K , LONG X . Virtualized GPU computing platform in clustered system environment [J ] . Journal of Beijing University of Aeronautics and Astronautics , 2016 , 42 ( 11 ): 2340 - 2348 .
王浩 , 王浩枫 . 面向CPUs-GPUs系统的OpenCL任务调度框架 [J ] . 计算机工程与设计 , 2022 , 43 ( 7 ): 1955 - 1963 .
WANG H , WANG H F . Scheduling framework for OpenCL programs on CPUs-GPUs heterogeneous platforms [J ] . Computer Engineering and Design , 2022 , 43 ( 7 ): 1955 - 1963 .
崔嘉 . 试析虚拟计算环境中资源池的资源聚合机制的研究 [J ] . 自动化技术与应用 , 2017 , 36 ( 6 ): 35 - 37, 41 .
CUI J . Research on resource pool mechanism of resource pool in virtual computing environment [J ] . Techniques of Automation and Applications , 2017 , 36 ( 6 ): 35 - 37, 41 .
查乾 . 基于GPU虚拟化的资源优化调度 [D ] . 武汉 : 武汉理工大学 , 2022 .
ZHA Q . Optimal scheduling of resources based on GPU virtualization [D ] . Wuhan : Wuhan University of Technology , 2022 .
朱紫钰 , 汤小春 , 赵全 . 面向CPU-GPU集群的分布式机器学习资源调度框架研究 [J ] . 西北工业大学学报 , 2021 , 39 ( 3 ): 529 - 538 .
ZHU Z Y , TANG X C , ZHAO Q . A unified schedule policy of distributed machine learning framework for CPU-GPU cluster [J ] . Journal of Northwestern Polytechnical University , 2021 , 39 ( 3 ): 529 - 538 .
GÓMEZ-LUNA J , GONZÁLEZ-LINARES J M , BENAVIDES J I , et al . Performance models for asynchronous data transfers on consumer Graphics Processing Units [J ] . Journal of Parallel and Distributed Computing , 2012 , 72 ( 9 ): 1117 - 1126 .
吴再龙 , 王利明 , 徐震 , 等 . GPU虚拟化技术及其安全问题综述 [J ] . 信息安全学报 , 2022 , 7 ( 2 ): 30 - 58 .
WU Z L , WANG L M , XU Z , et al . GPU virtualization technology and security issues: a survey [J ] . Journal of Cyber Security , 2022 , 7 ( 2 ): 30 - 58 .
0
浏览量
14
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构