上海交通大学:《Multicore Architecture and Parallel Computing》课程教学资源(PPT课件讲稿)Lecture 7 CUDA
data:image/s3,"s3://crabby-images/07305/073058dd3c3b5259137faa249dc299fcbb9b48d4" alt=""
上通大字 SHANGHAI JLAO TONG UNIVERSITY CS427 Multicore Architecture and Parallel Computing Lecture 7 CUDA Prof Xiaoyao Liang 201210/15
CS427 Multicore Architecture and Parallel Computing Lecture 7 CUDA Prof. Xiaoyao Liang 2012/10/15 1
data:image/s3,"s3://crabby-images/83978/8397818c4206dff4c5d0cc2f269272468f90553d" alt=""
⑨CUDA "Compute Unified Device Architecture General purpose programming model > User kicks off batches of threads on the gel Targeted software stack Compute oriented drivers, language, and tools Driver for loading computation programs into GPU Standalone Driver -Optimized for computation Interface designed for compute- graphics-free API Data sharing with OpengL buffer objects Guaranteed maximum download &z readback speeds Explicit eU memory management
CUDA 2 • “Compute Unified Device Architecture” ➢General purpose programming model ➢User kicks off batches of threads on the GPU •Targeted software stack ➢Compute oriented drivers, language, and tools •Driver for loading computation programs into GPU ➢Standalone Driver -Optimized for computation ➢Interface designed for compute –graphics-free API ➢Data sharing with OpenGL buffer objects ➢Guaranteed maximum download & readback speeds ➢Explicit GPU memory management
data:image/s3,"s3://crabby-images/62b14/62b140af9766bd8dbc08193ae2b4446ffccf00be" alt=""
D)GPU Location CPU FSB 画图 AGP Northbridge (RAM NB CPU Southbridge
GPU Location 3
data:image/s3,"s3://crabby-images/b9491/b9491d966e1ce303863c3a2c2de096ae96a39cb3" alt=""
S GPU VS CPU Con trol ALU ALU ALU ALU Cache DRAM DRAM CPU GPU
GPU Vs. CPU 4
data:image/s3,"s3://crabby-images/67201/67201fa06f5ac2e28f5a558af8d0b3f9dbc52463" alt=""
⑨ CUDA Execution Model E△xL1e1 cene1 Device Block (0, o) 3落 01 Device en1>>《 1,o》 Block (11)
CUDA Execution Model 5
data:image/s3,"s3://crabby-images/28841/28841b91a5e3d47f1a7030d3349a03d2e0dc1c3f" alt=""
3 CUDA Device and Threads ° A com pute device Is a coprocessor to the cpu or host Has its own RAM ( device mermory Runs many threads in parallel Is typically a gpU but can also be another type of parallel p rocess vice Data-parallel portions of an application are expressed as device kernels which run on many threads Differences between GPu and cpu threads > gPU threads are extremely light weight >very little creation overhead >gPU needs 1000s of threads for full effic ciency >Multi-core CpU needs only a few
CUDA Device and Threads 6 •A compute device ➢Is a coprocessor to the CPU or host ➢Has its own DRAM (device memory) ➢Runs many threads in parallel ➢Is typically a GPU but can also be another type of parallel processing device •Data-parallel portions of an application are expressed as device kernels which run on many threads •Differences between GPU and CPU threads ➢GPU threads are extremely light weight ➢Very little creation overhead ➢GPU needs 1000s of threads for full efficiency ➢Multi-core CPU needs only a few
data:image/s3,"s3://crabby-images/2ceee/2ceee67976d918b6ef2500466a39a4772c3db16b" alt=""
@ EXtension eclspecs device float filter [ nli global, device, shared, local constant global void convolve (float *image shared float region [M] Keywords threadx. blockldx region [threadIdx]= image [i]i · Intrinsics syncthreads syncthreads image[j]= resulti Runtime apl Memory, symbol // Allocate GPU memory void *myimage= cudaMalloc(bytes execution management //100 blocks, 10 threads per block Function launch convolve>>(myimage) 7
C Extension 7
data:image/s3,"s3://crabby-images/4c93f/4c93fcdd6fc74f5c9d4faa00dd80d87470f6ce40" alt=""
S)Compilation Flow Integrated source (foo. cu) cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly CPU Host Code foo s foo. cpp OCG gcc/cl G80 SASS Mark Murphy. " NVIDIA's Experience with fo Open64 8
Compilation Flow 8
data:image/s3,"s3://crabby-images/2429e/2429e03faeb6a6b0043a5537ed75f4ce086dd830" alt=""
@ Compilation Flow C/C++ CUDA float4 me gxIgtid] Application me.X t= me. y me. Z, NVCC CPU Code Virtual PTX Code PhysicapTX to Target Id globalv4. f32 [sfh mad. f32 sfl Compiler G80 GPU Target code
Compilation Flow 9
data:image/s3,"s3://crabby-images/c74a1/c74a120a03ac45fd34747b6999129314e7069d3b" alt=""
@Matrix Multiplication void MatrixMultiplication(float* M, float* N, float* P, int width) for (int i =0: i< width: ++i) for (int j=0:j< Width: ++j)I k float sum=0: for (int k =0;k< Width: ++k)I float a= M[i width + k]: float b=NIk width j]: sum a b P[i Width j] P k WIDTH WIDTH 1000X1000=1,000,000 independent dot product 1000 multiply+ 1000 accumulate per dot
Matrix Multiplication 10 1000X1000=1,000,000 independent dot product 1000 multiply+1000 accumulate per dot
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 上海交通大学:云安全(PPT讲稿)Cloud Security.pptx
- 局域网的硬件设备和操作系统(PPT讲稿).ppt
- 大数据分析(PPT讲稿)大数据引领我们走向数据智能化时代.ppt
- 淮阴工学院:《数据库原理》课程教学资源(PPT课件讲稿)第3章 关系数据库的基本理论.ppt
- 《Java面向对象程序设计》课程教学资源(PPT课件讲稿)第三章 Java面向对象编程.pptx
- 《Java面向对象程序设计》课程教学资源(PPT课件讲稿)第六章 Java输入输出流与文件操作.pptx
- 《Java面向对象程序设计》课程教学课件(PPT讲稿)流程控制语句.pptx
- 《Java面向对象程序设计》课程教学课件(PPT讲稿)AWT和Swing组件.pptx
- 江苏海洋大学(淮海工学院):《Java面向对象程序设计》课程教学资源(PPT课件讲稿)第4章 Java图形用户界面设计.pptx
- 江苏海洋大学(淮海工学院):《Java面向对象程序设计》课程教学资源(PPT课件讲稿)第2章 Java语言基础.pptx
- 《Java面向对象程序设计》课程教学资源(PPT课件讲稿)第四章 Java图形用户界面设计 4.2 AWT和Swing组件.pptx
- 《高级语言程序设计 Advanced Programming》课程教学资源(PPT课件讲稿)第8章 指针.ppt
- 《C语言程序设计》课程教学资源(PPT课件讲稿)第5章 循环结构程序设计.ppt
- 广西外国语学院:《计算机网络》课程教学资源(PPT课件讲稿)第8章 DNS.ppt
- 深圳大学:《图片处理基础》课程教学课件(PPT讲稿)Poisson Image Editing.pptx
- 《PhotoshopCS2基础教程与上机指导》课程教学资源(PPT课件讲稿)第20章 Web图像与动画设计.ppt
- 广西医科大学:《计算机网络 Computer Networking》课程教学资源(PPT课件讲稿)Chapter 17 NETWORK MANAGEMENT.pptx
- 局域网基础知识及网络设备(PPT课件讲稿).ppt
- 长沙医学院:《计算机专业英语》课程教学资源_教学大纲.doc
- 郑州大学:《计算机组成原理》课程教学资源(PPT课件讲稿,共八章,任课教师:石磊).ppt
- 上海交通大学:《通信网络》课程PPT教学课件(讲稿)Communication Networks - ANALYSIS OF 10G EEE PROTOCOL.pptx
- 亚马逊云计算AWS(Amazon Web Service)、Cloud Computing——Cassandra.ppt
- 《计算机图形学》课程教学资源(PPT课件讲稿)Chapter 4 Graphics Output Primitives(Part II).pptx
- 北京理工大学:《软件工程基础》课程教学资源(PPT课件讲稿)需求工程(主讲:刘驰).ppt
- 上海交通大学:Scheduling Algorithms in Heterogeneous Computing Systems.pptx
- 上海交通大学:《程序设计》课程教学资源(PPT课件讲稿)第5章 批量数据处理——数组.ppt
- 上海交通大学:《现代操作系统》课程教学资源(PPT课件讲稿)Chapter 02 进程与线程 Process and Thread.pps
- 《数据库基础与应用》课程PPT教学课件(Access案例教程)第9章 数据库语言SQL.pptx
- 《数据库基础与应用》课程PPT教学课件(Access案例教程)第8章 宏.pptx
- 《数据库基础与Access应用》课程教学资源(PPT课件)第12章 应用实例.pptx
- 《数字图像处理基础》课程教学资源(教学大纲.pdf
- 长安大学:《微机原理》课程教学资源(PPT课件讲稿)第7章 汇编语言程序设计.pptx
- 西安交通大学:《微型计算机接口技术》课程教学资源(PPT课件讲稿)第二章 微型处理器与单片机.ppt
- 中国铁道出版社:《局域网技术与组网工程》课程教学资源(PPT课件讲稿)第7章 网络系统集成与网络维护.ppt
- 《计算机应用基础》课程教学资源(PPT课件讲稿)第3章 Word 2007文字处理.ppt
- 《微机原理》课程教学资源(PPT课件)第六章 微型计算机的输入/输出.ppt
- 《单片机原理及应用》课程教学资源(PPT课件)第8章 AT89S51单片机外部存储器的扩展.ppt
- 《网页设计与制作》课程教学资源(PPT课件讲稿)第七章 模板与库的应用.ppt
- 《网页设计与制作》课程教学资源(PPT课件讲稿)第四章 设计页面布局.ppt
- 《微机原理》课程教学资源(PPT课件)第2章 微处理器与总线.ppt