
浏览全部资源
扫码关注微信
中国移动通信集团设计院有限公司,北京 100080
[ "曹原铭(1983- ),男,中国移动通信集团设计院有限公司高级工程师,主要研究方向为云计算、人工智能算力基础设施等。" ]
[ "雷鸣(1973- ),女,中国移动通信集团设计院有限公司高级工程师,主要研究方向为云计算、人工智能算力基础设施等。" ]
[ "刘芹(1977- ),女,中国移动通信集团设计院有限公司正高级工程师,主要从事与云计算、人工智能等算力资源相关的咨询及研究工作。" ]
[ "牛瑛霞(1971- ),女,现就职于中国移动通信集团设计院有限公司,主要研究方向为算力网络、人工智能、数据业务平台等。" ]
[ "武振宇(1975- ),男,中国移动通信集团设计院有限公司高级工程师,主要研究方向为云计算、算力网络、智算中心等。" ]
[ "潘洁(1978- ),女,中国移动通信集团设计院有限公司高级工程师,主要研究方向为算力网络安全和网络信息安全。" ]
收稿日期:2025-03-21,
修回日期:2025-06-27,
录用日期:2025-06-11,
纸质出版日期:2025-07-20
移动端阅览
曹原铭,雷鸣,刘芹等.智算中心存储系统优化研究[J].电信科学,2025,41(07):164-175.
CAO Yuanming,LEI Ming,LIU Qin,et al.Research on optimization of storage system in intelligent computing center[J].Telecommunications Science,2025,41(07):164-175.
曹原铭,雷鸣,刘芹等.智算中心存储系统优化研究[J].电信科学,2025,41(07):164-175. DOI: 10.11959/j.issn.1000-0801.2025160.
CAO Yuanming,LEI Ming,LIU Qin,et al.Research on optimization of storage system in intelligent computing center[J].Telecommunications Science,2025,41(07):164-175. DOI: 10.11959/j.issn.1000-0801.2025160.
智算中心使用分布式文件存储进行数据预处理和模型训练,使用分布式对象存储进行原始数据获取和模型发布,使用分布式块存储为资源管理平台提供存储服务。同时,模型训练过程中经常使用高性能分布式文件存储缩短checkpoint读写时间,提高集群训练效率。大模型训练全生命周期需要在不同存储协议、不同读写性能的存储系统之间进行数据复制、迁移,导致数据重复存储,且数据复制需要占用计算资源和网络带宽。为了解决上述问题,并为智算集群提供统一命名空间,提出文件、对象融合存储和文件系统分级存储方案,解决不同存储协议间的数据搬运问题,实现高性能文件存储(全闪)和普通性能文件存储(混闪)间的数据自动流动,为超大规模智算集群的存储系统优化方案提供参考。
The intelligent computing center uses distributed file storage for data preprocessing and model training
distributed object storage for the acquisition of raw data and model release
and distributed block storage to provide storage for the resource management platform. Meanwhile
high-performance distributed file storage is used to shorten the read and write time of checkpoint during the training process and improve the training efficiency of the cluster. The entire life cycle of large model training requires data copying and migration between storage systems with different storage protocols and different read-write performances
resulting in duplicate data storage. Additionally
data copying requires computing resources and network bandwidth. To address the above issues and provide a unified namespace for the intelligent computing clusters
a file and object converged storage and a hierarchical file storage scheme were proposed to solve the problem of data transfer between different storage protocols and enable automatic data flow between high-performance file storage (all SSD) and ordinary-performance file storage (SSD and HDD)
providing a reference for the optimization of storage systems of ultra-large-scale intelligent computing clusters.
OpenAI . GPT-4 technical report [J ] . arXiv preprint , 2023 : 2303 .08774.
曹守欣 , 赵琉涛 , 金翊 . 基于对象存储的云存储系统研究 [J ] . 计算机科学与应用 , 2014 , 4 ( 12 ): 333 - 343 .
CAO S X , ZHAO L T , JIN Y . The study of cloud storage system based on object-based storage [J ] . Computer Science and Application , 2014 , 4 ( 12 ): 333 - 343 .
陈曦 , 朱建涛 , 何晓斌 . 一种面向高性能计算的分布式对象存储系统 [J ] . 计算机工程 , 2017 , 43 ( 8 ): 69 - 73 .
CHEN X , ZHU J T , HE X B . An HPC-oriented distributed object storage system [J ] . Computer Engineering , 2017 , 43 ( 8 ): 69 - 73 .
MA P F , YIN Y S , LAN C , et al . A distributed file system for frequency reading of various file sizes [C ] // Proceedings of the 2013 10th Web Information System and Application Conference . Piscataway : IEEE Press , 2013 : 339 - 344 .
田海东 , 张明政 , 常锐 , 等 . 大模型训练技术综述 [J ] . 中兴通讯技术 , 2024 , 30 ( 2 ): 21 - 28 .
TIAN H D , ZHANG M Z , CHANG R , et al . A survey on large model training technologies [J ] . ZTE Technology Journal , 2024 , 30 ( 2 ): 21 - 28 .
高凯辉 , 李丹 . 数据中心网络性能保障研究综述 [J ] . 电信科学 , 2023 , 39 ( 6 ): 1 - 21 .
GAO K H , LI D . Data center networks with performance guarantee: a survey [J ] . Telecommunications Science , 2023 , 39 ( 6 ): 1 - 21 .
DWIVEDI K , DUBEY S K . Analytical review on hadoop distributed file system [C ] // Proceedings of the 2014 5th International Conference-Confluence The Next Generation Information Technology Summit (Confluence) . Piscataway : IEEE Press , 2014 : 174 - 181 .
IEEE . IEEE standard for information technology: portable operating system interface (POSIX) base specifications, issue 7: IEEE 1003.1-2008 [S ] . Piscataway : IEEE Press , 2008 .
LADEKAR S . Converged storage network technologies and guidelines [R ] . 2014 .
TANIMURA Y , TAKIZAWA S , OGAWA H , et al . Building and evaluation of cloud storage and datasets services on AI and HPC converged infrastructure [C ] // Proceedings of the 2020 IEEE International Conference on Big Data (Big Data) . Piscataway : IEEE Press , 2020 : 1992 - 2001 .
H 3 C. UIS hyper converged unified storage solution [EB ] . 2022.
ARAÚJO DE MEDEIROS D , MARKIDIS S , BO PENG I . LibCOS: enabling converged HPC and cloud data stores with MPI [C ] // Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region . New York : ACM Press , 2023 : 106 - 116 .
CRUZ F A , DABIN A J , DORSCH J P , et al . FirecREST: a RESTful API to HPC systems [C ] // Proceedings of the 2020 IEEE/ACM International Workshop on Interoperability of Supercomputing and Cloud Technologies (SuperCompCloud) . Piscataway : IEEE Press , 2020 : 21 - 26 .
ZHAO W J , XIE D P , JIA R L , et al . Distributed hierarchical GPU parameter server for massive scale deep learning ads systems [J ] . arXiv preprint , 2020 : 2003 .05622.
WANG G , SHI Z J , NIXON M , et al . ChainSplitter: towards blockchain-based industrial IoT architecture for supporting hierarchical storage [C ] // Proceedings of the 2019 IEEE International Conference on Blockchain (Blockchain) . Piscataway : IEEE Press , 2019 : 166 - 175 .
曹纪磊 . 基于网络存储系统中虚拟存储分级存储技术的研究 [J ] . 软件 , 2022 , 43 ( 11 ): 83 - 87 .
CAO J L . Research on hierarchical storage technology based on virtual storage in network storage system [J ] . Software , 2022 , 43 ( 11 ): 83 - 87 .
赵晓南 , 李战怀 , 曾雷杰 , 等 . 分级存储管理技术研究 [J ] . 计算机研究与发展 , 2011 , 48 ( S1 ): 105 - 111 .
ZHAO X N , LI Z H , ZENG L J , et al . Research on hierarchical storage management technology [J ] . Journal of Computer Research and Development , 2011 , 48 ( S1 ): 105 - 111 .
ZHANG T R , HELLANDER A , TOOR S . Efficient hierarchical storage management empowered by reinforcement learning extended abstract [C ] // Proceedings of the 2023 IEEE 39th International Conference on Data Engineering (ICDE) . Piscataway : IEEE Press , 2023 : 3869 - 3870 .
BU K , WANG M , NIE H S , et al . The optimization of the hierarchical storage system based on the hybrid SSD technology [C ] // Proceedings of the 2012 Second International Conference on Intelligent System Design and Engineering Application . Piscataway : IEEE Press , 2012 : 1323 - 1326 .
FEDUS W , ZOPH B , SHAZEER N . Switch transformers: scaling to trillion parameter models with simple and efficient sparsity [J ] . arXiv preprint , 2021 : 2101 .03961.
KAPLAN J , MCCANDLISH S , HENIGHAN T , et al . Scaling laws for neural language models [J ] . arXiv preprint , 2020 : 2001 .08361.
0
浏览量
0
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621