开放式计算语言加速的分段前缀和并行算法
DOI:
作者:
作者单位:

1.郑州师范学院;2.山东理工大学;3.郑州大学

作者简介:

通讯作者:

中图分类号:

TP311

基金项目:

国家自然科学基金项目(面上项目,重点项目,重大项目)


OpenCL-Accelerated Parallel Algorithm for Segmented Prefix Sum
Author:
Affiliation:

Zhengzhou Normal University

Fund Project:

The National Natural Science Foundation of China (General Program, Key Program, Major Research Plan)

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    针对数值计算中前缀和运算数据量大、耗时巨大这一难题,提出了一种基于开放式计算语言(Open Computing Language,OpenCL)的分段式前缀和并行算法。首先进行了分段式前缀和算法的并行性分析,对任务进行了层次化分解与组合,设计了两级并行的分段式前缀和算法;然后通过OpenCL编程将前缀和并行算法映射到CPU+GPU系统平台上,实现了层次化并行前缀和处理;最后,根据计算单元(Compute Unit,CU)的资源条件,增加CU中本地存储器的分配,通过改进工作节点的访问模式来降低bank冲突,提高访存速度。实验结果表明,与基于AMD Opteron 2439 SE CPU的串行算法、基于OpenMP(Open Multi-Processing)并行算法和基于统一计算设备架构并行算法性能相比,前缀和并行算法在OpenCL架构下NVIDIA Tesla C2075计算平台上分别获得了33.51倍、6.26倍和2.41倍的加速比。验证了提出的并行优化方法的有效性和性能可移植性。

    Abstract:

    Aiming at the problem of large amount of prefix sum computation data in numerical computation and huge time-consuming, this paper proposes a segmented prefix sum parallel algorithm based on the Open Computing Language (OpenCL). First,the parallel analysis of segmented prefix sum algorithms is performed,and a two-level parallel segmented prefix sum algorithm is designed through the hierarchical decomposition and combination of processing tasks. Then the prefix sum parallel algorithm is mapped to the hardware platform of CPU+GPU and the hierarchical parallel processing of prefix sum is implemented by the OpenCL programming. Finally, according to the resource conditions of the Compute Unit (CU), the allocation of local memory is increased in CU. In addition, the bank conflict is reduced by improving the work-items access mode to increase the memory access speed. The experimental results show that compared with the performance of the serial algorithm based on AMD Opteron 2439 SE CPU, parallel algorithm based on OpenMP (Open Multi-Processing) and parallel algorithm based on Compute Unified Device Architecture (CUDA), the prefix sum parallel algorithm obtains 33.51 times, 6.26 times and 2.41 times speedup in the NVIDIA Tesla C2075 computing platform under the OpenCL architecture respectively. The validity and performance portability of the proposed parallel optimization method are verified.

    参考文献
    相似文献
    引证文献
引用本文

肖汉,李彩林,郭宝云,等. 开放式计算语言加速的分段前缀和并行算法[J]. 科学技术与工程, 2019, 19(31): 215-221.
xiaohan,, and. OpenCL-Accelerated Parallel Algorithm for Segmented Prefix Sum[J]. Science Technology and Engineering,2019,19(31):215-221.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2019-01-14
  • 最后修改日期:2019-02-27
  • 录用日期:2019-03-12
  • 在线发布日期: 2019-11-20
  • 出版日期:
×
律回春渐,新元肇启|《科学技术与工程》编辑部恭祝新岁!
亟待确认版面费归属稿件,敬请作者关注