上海交通大学:《Multicore Architecture and Parallel Computing》课程教学资源(PPT课件讲稿)Lecture 8 CUDA, cont’d
data:image/s3,"s3://crabby-images/cbf66/cbf66ccc85c3c22ce62344a732f502a08495f135" alt=""
上声定通大字 SHANGHAI JLAO TONG UNIVERSITY CS427 Multicore Architecture and Parallel Computing Lecture 8 CUDA, contd Prof Xiaoyao Liang 2016/10/26
CS427 Multicore Architecture and Parallel Computing Lecture 8 CUDA, cont’d Prof. Xiaoyao Liang 2016/10/26 1
data:image/s3,"s3://crabby-images/25b55/25b55a36b9a3438a5fc35d9f9061e3c4510c3171" alt=""
O Register File Limitation If each block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM Each block requires 10* 256=2560 registers 8192=3*2560+ change So. three blocks can run on an sm as far as registers are concerned How about if each thread increases the use of registers by Each Block now requires 11256=2816 registers 8192<2816*3 Only two Blocks can run on an SM, 1/3 reduction of parallelism!!!
Register File Limitation 2
data:image/s3,"s3://crabby-images/248ea/248eabae1d91ef7564d8fef56e82beb472f72532" alt=""
Dynamic Partitioning Dynamic partitioning gives more flexibility to compilers/programm ers One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each This allows for finer grain threading than traditional cpu threading models The compiler can tradeoff between instruction-level parallelism and thread level parallelism
Dynamic Partitioning 3
data:image/s3,"s3://crabby-images/698ef/698efed13c5f08b429aff7a40886727d0dbe8564" alt=""
O)ILP VS. TLP assume that a kernel has 256-thread blocks. 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, can fit 3 blocks global loads have 400 cycles 4 cycles 4 inst 24 warps =384 400 If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load, can only fit 2 blocks 4 cycles"8 inst *16warps=512>400, better hiding memory latenc
ILP Vs. TLP 4 • Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, can fit 3 blocks global loads have 400 cycles – 4 cycles * 4 inst * 24 warps = 384 400, better hiding memory latency
data:image/s3,"s3://crabby-images/21e67/21e67e2794d6eaa211d3f9462f038ddece705d8d" alt=""
O Memory Coalescing Access MIMolM 2,0 direction 1M1M2M21 in Kerne code M3. 2 M 12223,2 Time Period 1 Time Period 2 Moo M1.O M2. M3. 0 MO1 M1 M2.1 M3.Mo: 2 M2 M22 M3.2 Mo.3 M13 M2.3 M33
Memory Coalescing 5
data:image/s3,"s3://crabby-images/5256a/5256a1d89f2975c653baede29102ad43ac68ecbc" alt=""
O Memory Coalescing Access direction M 0,1m1,1 M21M31 in Kerne code M。2M12M22M32 M。3M13M2.M Time period 2 Tinhe Period 1 M MOo MoM20M3..1 M1.1 M2. M31M. 2 M1. 2 M2. 2 M3. 2 M3 M1.3 M2. 3 M3.3
Memory Coalescing 6
data:image/s3,"s3://crabby-images/b736b/b736b3ba81b1bf1fdaab4f9a0a0805d1eec35e15" alt=""
O Memory Coalescing global void Matrix MulKernel (float* Md, float* Nd, float*Pd, int Width) shared float Mds [TILE WIDTH] [TILE WIDTH]; shared f1 oat nds[T工LEW工DTH][ TILE WIDTH]; 3. int bx blockIdx x int by blockIdxy 4. int tx threadIdx. x; int ty =threadIdxy // Identify the row and column of the Pd element to work on int Row by TILE WIDTH tyi 6. int col bx TILE WIDTH + txi float Pvalue = o // Loop over the Ma and Nd tiles required to compute the pd element 8 for (int m=0; m< Width/TILE WIDTH; ++m) I // Coolaborative loading of Md and Nd tiles into shared memory 9 Mds [ty][tx]= Md[Row*width +(m*TILE WIDTH tx)]i 10 Nds [ty][tx]= Nd[Col +(m*TILE WIDTH ty)*width]i syncthreads ()i 11. for (int k=0; k< TILE WIDTH; ++k) Pvalue + Mds [tyl[k] Nds [k][tx]i Synchthreads ()i 14.} 13. Pd[Row*Width+Col]= Pvaluei
Memory Coalescing 7
data:image/s3,"s3://crabby-images/4fae2/4fae21751ab1621fd4500533afabeaf061de753e" alt=""
O) Increasing Per Thread Work 012 TILE WIDTH-1 Each thread computes two element of pda, Reduced loads from global memory(Md)to shared memory Reduced instruction overhead More work done in each iteration TILE WIDT TIPE WIIDT: IAE WIDB TIIPE WDTR
Increasing Per Thread Work 8
data:image/s3,"s3://crabby-images/7bfa5/7bfa50c22643d36912ebed4865cc7a91072024f0" alt=""
@Double buffering One could double buffer the computation, getting better instruction mix within each thread This is classic software pipelining in ILP compilers Loop i Load next tile from global memory Load current tile to shared memory Loop t Deposit current tile to shared syncthreads memory syncthreadso) Compute current tile Load next tile from global syncthreads( memory Compute current tile
Double Buffering 9
data:image/s3,"s3://crabby-images/e6973/e69732a3ee7a54d1394e553e8ea05f2fc3abed49" alt=""
Double buffering Deposit blue tile from register into 012 TILE WIDTH-1 IL Shared memory yncthreads Load orange tile into register Compute blue tile Deposit orange tile into shared memory 二二二量 TILE WIDT TIRE WIIF IE wiLDI TIPE WIDIIB
Double Buffering 10
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 东南大学:《操作系统概念 Operating System Concepts》课程教学资源(PPT课件讲稿)06 Process synchronization.ppt
- 河南中医药大学:《数据库原理》课程教学资源(PPT课件讲稿)第一章 绪论.ppt
- 中国科学技术大学:《计算机体系结构》课程教学资源(PPT课件讲稿)第4章 存储层次结构设计.ppt
- 西安交通大学:《网络与信息安全》课程PPT教学课件(网络入侵与防范)第一章 网络安全概述(主讲:沈超、刘烃).ppt
- 《管理信息系统》课程教学资源(PPT课件讲稿)第16章 新型数据库技术及发展.ppt
- 北京大学:《软件需求工程》课程教学资源(PPT课件讲稿)第三章 软件需求获取(主讲:周立新).ppt
- 电子工业出版社:《计算机网络》课程教学资源(第六版,PPT课件讲稿)第三章 数据链路层.pptx
- 山东大学:《微机原理及单片机接口技术》课程教学资源(PPT课件讲稿)第四章 指令系统及汇编语言程序设计(4.1-4.6).ppt
- 西北农林科技大学:高性能计算之并行编程技术(讲座PPT,报告人:周兆永).ppt
- 《计算机操作系统》课程教学资源(PPT课件讲稿)第8章 计算机系统的测试.ppt
- 数据包检测技术(PPT讲稿)High-Performance Pattern Matching for Intrusion Detection.ppt
- 中国科学技术大学:《信号与图像处理基础 Signal and Image Processing》课程教学资源(PPT课件讲稿)图像成像机理与模型.pptx
- 中国科学技术大学:《算法基础》课程教学资源(PPT课件讲稿)第八讲 串匹配算法(主讲:顾乃杰).ppt
- 《计算机视觉》课程教学资源(PPT课件)第八章 基于运动视觉的稠密估计——光流法(Optical Flow).ppt
- 东南大学:《操作系统概念 Operating System Concepts》课程教学资源(PPT课件讲稿)04 线程 Threads.ppt
- 《数字图像处理学》课程教学资源(PPT课件讲稿)第9章 数学形态学及其应用.ppt
- 南京航空航天大学:《数据结构》课程教学资源(PPT课件讲稿)第一章 绪论.ppt
- 《大学计算机》实践教程(PPT讲稿)面向计算思维能力培养(Raptor程序设计).pptx
- 机械工业出版社:国家“十一五”规划教材《数据库原理与应用教程》教学资源(PPT课件,第3版)第8章 数据库设计.ppt
- 安徽理工大学:《汇编语言》课程教学资源(PPT课件讲稿)第三章 80x86指令系统和寻址方式.ppt
- 赣南师范大学:《计算机网络原理》课程教学资源(PPT课件讲稿)第四章 数据链路层.ppt
- 南京大学:移动Agent系统支撑(PPT讲稿)Agent Mobility Software Agent(主讲:余萍).pptx
- 上海师范大学:《R语言与统计分析》课程教学资源(PPT课件)R语言——介绍(主讲:汤银才).ppt
- 《视频制作》课程教学资源:课程教学大纲.doc
- 新乡学院:《办公自动化》课程教学资源(教学大纲).pdf
- 《Excel高级应用》课程教学资源:课程教学大纲.doc
- 《计算机网络》课程电子教案(PPT课件讲稿)第2章 数据通信的基础知识.ppt
- 并行处理(PPT讲稿)Parallel Processing - Hypercubes and Their Algorithms.ppt
- 《计算机网络》课程教学资源(PPT课件讲稿)第8章 应用层.ppt
- 香港城市大学:PROGRAMMING METHODOLOGY AND SOFTWARE ENGINEERING.ppt
- 《计算机操作系统》课程教学资源(PPT课件讲稿)第二章 进程描述与控制 Process Concept & Process Control.ppt
- 佛山科学技术学院:《网络技术基础》课程教学资源(专业技能考试大纲).doc
- 四川大学:《数据结构》课程教学资源(PPT课件讲稿)第五章 树和二叉树 Tree & Binary Tree.ppt
- 2019年《计算机网络》考试大纲.doc
- 计算机算法(PPT讲稿)禁忌搜索算法 Tabu Search.ppt
- 北京航空航天大学:《数据挖掘——概念和技术(Data Mining - Concepts and Techniques)》课程教学资源(PPT课件讲稿)Chapter 05 Mining Frequent Patterns, Association and Correlations.ppt
- 电子科技大学:《计算机操作系统》课程教学资源(PPT课件讲稿)第二章 进程与调度(Processes and Scheduling).ppt
- 交互式数据语言(PPT讲稿)Basic IDL knowledge.ppt
- 江苏海洋大学(淮海工学院):《Java面向对象程序设计》课程教学资源(PPT课件讲稿)全国二级Java考试的重点难点.pptx
- 长春工业大学:《Javascript 程序设计》课程教学资源(PPT课件讲稿)第8章 网页特效 JavaScript.ppt