电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 09 Parallel patterns(MERGE SORT)

PARALLEL PATTERNS: MERGE SORT
PARALLEL PATTERNS: MERGE SORT

Data Parallelism Data- Dependent Execution Data-Independent Data-Dependent Data Parallel Stencil Histogram SpMV Not Data Prefix Scan Merge Parallel
Data Parallelism / DataDependent Execution

Objective Study increasingly sophisticated parallel merge kernels Observe the combined effects of data- dependent execution and a lack of data parallelism on GPU algorithm design
Objective • Study increasingly sophisticated parallel merge kernels • Observe the combined effects of data - dependent execution and a lack of data parallelism on GPU algorithm design

Merge Sort Input:two sorted arrays Output:the (sorted)union of the input A: 8 9 10 B: 7 10 10 12 C: 1 8 9 10 10 10 12
Merge Sort • Input: two sorted arrays • Output: the (sorted) union of the input arrays

Merge Sort A bottom-up divide-and-conquer sorting algorithm O(n log n)average-(and worst-)case performance O(n)additional space requirement Merging two arrays is the core computation 6 -18-2 -4 3 6 -7-2-18
Merge Sort • A bottom-up divide-and-conquer sorting algorithm • O(n log n) average - (and worst - ) case performance • O(n) additional space requirement • Merging two arrays is the core computation

Other Uses for Merge Taking the union of two (non-overlapping) sparse matrices represented in the CSR format Each row is merged col indices are the keys In MapReduce,when Map produces sorted key-value pairs and Reduce must maintain sorting
Other Uses for Merge • Taking the union of two (non-overlapping) sparse matrices represented in the CSR format – Each row is merged – col_indices are the keys • In MapReduce, when Map produces sorted key-value pairs and Reduce must – maintain sorting

Sequential Merge void merge(const int A,int m,const int B,int n,int C){ int i=0;/Index into A int j=0;/Index into B int k=0;/Index into C //merge the initial overlapping sections of A and B while ((i<m)&&(j<n)){ if(A[0<=B]){ C[k+]=A[i+]; }else k increases by one for every C[k++]=B+]; iteration of the loops } if (i==m){ //done with A,place the rest of B for j<n;j++){ C[k+]=B; } }else{ //done with B,place the rest of A for (;i<m;i++){ C[k+]=A[]; In any given iteration (other than the first),the values of i and j are data-dependent
Sequential Merge void merge(const int * A, int m, const int * B, int n, int * C) { int i = 0; // Index into A int j = 0; // Index into B int k = 0; // Index into C // merge the initial overlapping sections of A and B while ((i < m) && (j < n)) { if (A[i] <= B[j]) { C[k++] = A[i++]; } else { C[k++] = B[j++]; } } if (i == m) { // done with A, place the rest of B for ( ; j < n; j++) { C[k++] = B[j]; } } else { // done with B, place the rest of A for ( ; i < m; i++) { C[k++] = A[i]; } } } k increases by one for every iteration of the loops In any given iteration (other than the first), the values of i and j are data-dependent

Sequential Merge Parallelization Challenges We could assign one thread to write each output element However,given a particular output location, the input element that belongs there is data-dependent The sequential merge is O(n)in the length of the output array so we must be work-efficient
Sequential Merge Parallelization Challenges • We could assign one thread to write each output element • However, given a particular output location, the input element that belongs – there is data-dependent • The sequential merge is O(n) in the length of the output array – so we must be work-efficient

Observations about Merge 1.For any k s.t.0<k<m+n,there is either: 一 a.anis.t.0si<m and C[k]A[i] -b.ajs.t.0≤j<n and C[k]∈B[j] ● 2.For any k s.t.0sk<m+n,there is an iand ajs.t.: -a.i+j=k -b.0≤i≤m c.0≤j≤n d.The subarray C[O:k-1]is the result of merging A[o: i-1]and B[O j-1] Indices i and jare referred to as co-ranks
Observations about Merge • 1. For any k s.t. 0 ≤ k < m + n, there is either: – a. an i s.t. 0 ≤ i < m and C[k] ⇐ A[i] – b. a j s.t. 0 ≤ j < n and C[k] ⇐ B[j] • 2. For any k s.t. 0 ≤ k < m + n, there is an i and a j s.t. : – a. i + j = k – b. 0 ≤ i ≤ m – c. 0 ≤ j ≤ n – d. The subarray C[0 : k-1] is the result of merging A[0 : i-1] and B[0 : j-1] Indices i and j are referred to as co-ranks

A Merge Parallelization Approach Assume a co-rank function of the form: co-rank(k,A,B)=i j=k-i We can use the co-rank function to map a range of output values to a range of input values We'll need to compute co-rank efficiently for a work-efficient merge
A Merge Parallelization Approach • Assume a co-rank function of the form: co-rank(k,A,B) = i j = k - i • We can use the co-rank function to map a range of output values to a range of input values • We’ll need to compute co-rank efficiently for a work-efficient merge
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 08 Parallel Sparse Methods.pdf
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 07 JOINT CUDA-MPI PROGRAMMING.pdf
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 06 PARALLEL COMPUTATION PATTERNS(SCAN).pdf
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 05 PARALLEL COMPUTATION PATTERNS(HISTOGRAM).pdf
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 04 Performance considerations.pdf
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 03 MEMORY AND DATA LOCALITY.pdf
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 02 CUDA PARALLELISM MODEL.pdf
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 01 Introduction To Cuda C.pdf
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)NVIDIA CUDA C Programming Guide(Design Guide,June 2017).pdf
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)Methods of conjugate gradients for solving linear systems.pdf
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)NVIDIA Parallel Prefix Sum(Scan)with CUDA(April 2007).pdf
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)Single-pass Parallel Prefix Scan with Decoupled Look-back.pdf
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)Program Optimization Space Pruning for a Multithreaded GPU.pdf
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA.pdf
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)Some Computer Organizations and Their Effectiveness.pdf
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)Software and the Concurrency Revolution.pdf
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems.pdf
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)MPI A Message-Passing Interface Standard(Version 2.2).pdf
- 南京大学:《网络安全与入侵检测 Network Security and Intrusion Detection》课程教学资源(课件讲稿)19 Firewall Design Methods.pdf
- 南京大学:《网络安全与入侵检测 Network Security and Intrusion Detection》课程教学资源(课件讲稿)18 Web Security(SQL Injection and Cross-Site Request Forgery).pdf
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 10 Computational Thinking.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)课程简介(杜平安).pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第一章 绪论.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第二章 有限元法的基本原理(平面问题有限元法).pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第七章 动态分析有限元法 FEM of Dynamic Analysis.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第3~6章 其他问题有限元法.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第八章 热分析有限元法 FEM of Thermal Analysis.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第二篇 有限元建模方法 第十二章 有限元建模概述 Overview of Finite Element Modeling.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第二篇 有限元建模方法 第十一章 有限元建模的基本原则.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第二篇 有限元建模方法 第十四章 几何模型的建立.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第二篇 有限元建模方法 第十五章 单元类型及特性定义.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第二篇 有限元建模方法 第十六章 网格划分方法.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第二篇 有限元建模方法 第十七章 模型检查与处理 Model Checking and Processing.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第二篇 有限元建模方法 第十八章 边界条件的建立 Creation of Boundary Condition.pdf
- 南京大学:《高级算法 Advanced Algorithms》课程教学资源(课件讲稿)Fingerprinting.pdf
- 南京大学:《高级算法 Advanced Algorithms》课程教学资源(课件讲稿)Greedy and Local Search.pdf
- 南京大学:《高级算法 Advanced Algorithms》课程教学资源(课件讲稿)Balls into Bins.pdf
- 南京大学:《高级算法 Advanced Algorithms》课程教学资源(课件讲稿)Concentration of Measure.pdf
- 南京大学:《高级算法 Advanced Algorithms》课程教学资源(课件讲稿)Introduction(Min-Cut and Max-Cut,尹⼀通).pdf
- 南京大学:《高级算法 Advanced Algorithms》课程教学资源(课件讲稿)Fingerprinting.pdf