中国高校课件下载中心 》 教学资源 》 大学文库

电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 05 PARALLEL COMPUTATION PATTERNS(HISTOGRAM)

文档信息
资源类别:文库
文档格式:PDF
文档页数:42
文件大小:1.17MB
团购合买:点击进入团购
内容简介
parallel histogram Data Racing condition privatized histogram kernel
刷新页面文档预览

LECTURE 5 PARALLEL COMPUTATION PATTERNS(HISTOGRAM)

Warps and SIMD Hardware

parallel histogram Data Racing condition privatized histogram kernel 笔十1女子 Uaimraity at Eleetreie Scieeand Tecgd

2 parallel histogram Data Racing condition privatized histogram kernel

Objective To learn the parallel histogram computation pattern -An important,useful computation -Very different from all the patterns we have covered so far in terms of output behavior of each thread: Output can be modified by all participating threads. 电子科妓女学 O

3 Objective – To learn the parallel histogram computation pattern – An important, useful computation – Very different from all the patterns we have covered so far in terms of output behavior of each thread: Output can be modified by all participating threads

A Text Histogram Example Define the bins as four-letter sections of the alphabet:a-d,e-h,i-l,n- p,… For each character in an input string,increment the appropriate bin counter. 一 In the phrase "Programming Massively Parallel Processors"the output histogram is shown below: 12 10 8 0 电子神戏女学 e-h m-p q-t U-X y-Z

4 A Text Histogram Example – Define the bins as four-letter sections of the alphabet: a-d, e-h, i-l, n￾p, … – For each character in an input string, increment the appropriate bin counter. – In the phrase “Programming Massively Parallel Processors” the output histogram is shown below: 4

A simple parallel histogram algorithm Partition the input into sections Have each thread to take a section of the input -Each thread iterates through its section. For each letter,increment the appropriate bin counter 电子科妓女学 O

5 A simple parallel histogram algorithm – Partition the input into sections – Have each thread to take a section of the input – Each thread iterates through its section. – For each letter, increment the appropriate bin counter

Sectioned Partitioning (Iteration #1) p r 0 g r a m m n g m a i e y p a r a e l p r o r Thread 0 Thread 1 Thread 2 Thread 3 0 0 0 3 0 0 0 a-d e-h i-l m-p q-t U-X y-Z 电子神越女学 0

6 Sectioned Partitioning (Iteration #1)

Sectioned Partitioning (Iteration #2) 0 r 0 g r a m m n m a i e p a r e p r 0 e Thread 0 Thread 1 Thread 2 Thread 3 2 0 0 A 1 0 0 a-d e-h i m-p q-t U-X y-2 电子神越女学 O

7 Sectioned Partitioning (Iteration #2) 7

Input Partitioning Affects Memory Access Efficiency Sectioned partitioning results in poor memory access efficiency Adjacent threads do not access adjacent memory locations Accesses are not coalesced DRAM bandwidth is poorly utilized 电子科妓女学 O

8 Input Partitioning Affects Memory Access Efficiency – Sectioned partitioning results in poor memory access efficiency – Adjacent threads do not access adjacent memory locations – Accesses are not coalesced – DRAM bandwidth is poorly utilized

Input Partitioning Affects Memory Access Efficiency Sectioned partitioning results in poor memory access efficiency Adjacent threads do not access adjacent memory locations Accesses are not coalesced DRAM bandwidth is poorly utilized 111112222233333 4444 4 Change to interleaved partitioning All threads process a contiguous section of elements They all move to the next section and repeat The memory accesses are coalesced 12 3 4 23412 3 4 1 2 3 4 1 2 3 4 电子神越女学 0

9 Input Partitioning Affects Memory Access Efficiency – Sectioned partitioning results in poor memory access efficiency – Adjacent threads do not access adjacent memory locations – Accesses are not coalesced – DRAM bandwidth is poorly utilized – Change to interleaved partitioning – All threads process a contiguous section of elements – They all move to the next section and repeat – The memory accesses are coalesced 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

Interleaved Partitioning of Input For coalescing and better memory access performance r o g r a mm i n g m a ss e p a Thread 0 Thread 1 Thread 2 Thread 3 0 1 0 2 1 0 0 a-d e-h m-p q-t U-X y-z 电子神越女学 O

10 Interleaved Partitioning of Input – For coalescing and better memory access performance …

刷新页面下载完整文档
VIP每日下载上限内不扣除下载券和下载次数;
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
相关文档