首页|期刊简介|投稿指南|分类索引|刊文选读|订阅指南|资料|样刊邮寄查询|常见问题解答|联系我们
肖汉,李彩林,郭宝云,等. 开放式计算语言加速的分段前缀和并行算法[J]. 科学技术与工程, 2019, 19(31): 215-221.
xiaohan,et al.OpenCL-Accelerated Parallel Algorithm for Segmented Prefix Sum[J].Science Technology and Engineering,2019,19(31):215-221.
开放式计算语言加速的分段前缀和并行算法
OpenCL-Accelerated Parallel Algorithm for Segmented Prefix Sum
投稿时间:2019-01-14  修订日期:2019-02-27
DOI:
中文关键词:  分段式前缀和 图形处理器 开放式计算语言 并行算法 性能优化
英文关键词:segmented prefix sum Graphic Processing Unit (GPU) Open Computing Language (OpenCL) parallel algorithm performance optimization
基金项目:国家自然科学基金项目(面上项目,重点项目,重大项目)
           
作者单位
肖汉 郑州师范学院
李彩林 山东理工大学
郭宝云 山东理工大学
周清雷 郑州大学
摘要点击次数: 60
全文下载次数: 19
中文摘要:
      针对数值计算中前缀和运算数据量大、耗时巨大这一难题,提出了一种基于开放式计算语言(Open Computing Language,OpenCL)的分段式前缀和并行算法。首先进行了分段式前缀和算法的并行性分析,对任务进行了层次化分解与组合,设计了两级并行的分段式前缀和算法;然后通过OpenCL编程将前缀和并行算法映射到CPU+GPU系统平台上,实现了层次化并行前缀和处理;最后,根据计算单元(Compute Unit,CU)的资源条件,增加CU中本地存储器的分配,通过改进工作节点的访问模式来降低bank冲突,提高访存速度。实验结果表明,与基于AMD Opteron 2439 SE CPU的串行算法、基于OpenMP(Open Multi-Processing)并行算法和基于统一计算设备架构并行算法性能相比,前缀和并行算法在OpenCL架构下NVIDIA Tesla C2075计算平台上分别获得了33.51倍、6.26倍和2.41倍的加速比。验证了提出的并行优化方法的有效性和性能可移植性。
英文摘要:
      Aiming at the problem of large amount of prefix sum computation data in numerical computation and huge time-consuming, this paper proposes a segmented prefix sum parallel algorithm based on the Open Computing Language (OpenCL). First,the parallel analysis of segmented prefix sum algorithms is performed,and a two-level parallel segmented prefix sum algorithm is designed through the hierarchical decomposition and combination of processing tasks. Then the prefix sum parallel algorithm is mapped to the hardware platform of CPU+GPU and the hierarchical parallel processing of prefix sum is implemented by the OpenCL programming. Finally, according to the resource conditions of the Compute Unit (CU), the allocation of local memory is increased in CU. In addition, the bank conflict is reduced by improving the work-items access mode to increase the memory access speed. The experimental results show that compared with the performance of the serial algorithm based on AMD Opteron 2439 SE CPU, parallel algorithm based on OpenMP (Open Multi-Processing) and parallel algorithm based on Compute Unified Device Architecture (CUDA), the prefix sum parallel algorithm obtains 33.51 times, 6.26 times and 2.41 times speedup in the NVIDIA Tesla C2075 computing platform under the OpenCL architecture respectively. The validity and performance portability of the proposed parallel optimization method are verified.
查看全文  查看/发表评论  下载PDF阅读器
关闭
你是第26609138位访问者
版权所有:科学技术与工程编辑部
主管:中国科学技术协会    主办:中国技术经济学会
Tel:(010)62118920 E-mail:stae@vip.163.com
京ICP备05035734号-4
技术支持:本系统由北京勤云科技发展有限公司设计

京公网安备 11010802029091号