《现代计算机体系结构》课程教学课件(英文讲稿)Lecture 15 GPGPU Architecture and Programming Paradigm

高级计算机体系结构设计及其在数据中心和云计算的应用Lecture 14GPGPUArchitectureandProgramming Paradigm
高级计算机体系结构设计及其在数据中心和云计算的应 用 Lecture 14 GPGPU Architecture and Programming Paradigm

高级计算机体系结构设计及其在数据中心和云计算的应用OutlineGPGPUArchitectureOverviewCore ArchitectureMemory HierarchyInterconnectCPU-GPU InterfacingProgramming Paradigm
高级计算机体系结构设计及其在数据中心和云计算的应 用 Outline • GPGPU Architecture Overview • Core Architecture • Memory Hierarchy • Interconnect • CPU-GPU Interfacing • Programming Paradigm

高级计算机体系结构设计及其在数据中心和云计算的应用Basic Blocks: Several shadercores/streamingOn-chipareamultiprocessor (SM)SM,SMSMSMSM.Interconnection networkTPC-1TPC-0HOSTCPUPciFxpross BunInterconnectNetworkOn-chip memoryL2L2L2L2L2controllersDRAMDRAMDRAMDRAMDRAM..ControlleraController,ControllerControllerControllerm On-chip caches (level1/2)DRAMDRAMDRAMDRAMDRAMChipChipChipsChip,Chipe Off-chip DRAMOf-chiparea
高级计算机体系结构设计及其在数据中心和云计算的应 用 Basic Blocks • Several shader cores/streaming multiprocessor (SM) • Interconnection network • On-chip memory controllers • On-chip caches (level1/2) • Off-chip DRAM

高级计算机体系结构设计及其在数据中心和云计算的应用Basic BlocksSSSSSSSSATextureThreadSchedulerCacheSSSSSConstantInstructionCacheCacheDecoderThread batch-HW unit of threadexecution (Warp -SharedNvidia)Memory(Wavefront-ATI)HardwarethreadschedulingThreadshavededicatedregistersRegisterShared memoryamongthreadFileblockSamePCforallthreadsinwarpSeparateALUandmemorypipeline
高级计算机体系结构设计及其在数据中心和云计算的应 用 . INTERCONNECT . SM SM . SM Texture Processor Cluster0 SM SM . SM Texture Processor Cluster1 SM SM . SM Texture Processor ClusterM Streaming Multiprocessor High BW onchip network SP SP SP . SP Thread Scheduler Instruction Cache Decoder Texture Cache Constant Cache Shared . matrixMul>>(d_C, d_A, d_B, uiWA, uiWB); . . GPU Kernels Compile with Thread batch CUDA compiler - HW unit of thread execution (Warp - Basic Blocks MC0 MC1 MC2 MC3 MCL DRAM DRAM DRAM DRAM DRAM L2 L2 L2 L2 . L2 . . Off-chip memory array Memory Controllers SP SP SP . SP SP SP SP . SP SP SP SP . SP SP SP SP . SP . . . . . Shared Memory Register File . mov.s32 %r14, 15; and.b32 %r15, %r13, %r14; add.s32 %r16, %r15, %r12; shr.s32 %r17, %r16, 4; . Light weight into thread-blocks threads grouped PTX assembly execution (Warp - Nvidia) (Wavefront - ATI) • Hardware thread scheduling • Threads have dedicated registers • Shared memory among thread block • Same PC for all threads in warp • Separate ALU and memory pipeline

高级计算机体系结构设计及其在数据中心和云计算的应用Streaming MultiprocessorMulti thread unitTPCSharedTextureMTUnitCache:InstructionConstantCacheInstructionCachecache/decoderDecoderSeveral singleSPSPSPprocessor (SP)SPSPRegisterSharedLoad-store/SFU unitsFileMemoryLarge register fileSPSPSP.+....Shared memorySPSPSP Shared texture cachesSFUUnitLoad/Store Unit Constant cache
高级计算机体系结构设计及其在数据中心和云计算的应 用 Streaming Multiprocessor • Multi thread unit • Instruction cache/decoder • Several single processor (SP) • Load-store/SFU units • Large register file • Shared memory • Shared texture caches • Constant cache

高级计算机体系结构设计及其在数据中心和云计算的应用Examples : G80 and GT200. MT- unit (Global Block)Global BlockSchedderGT200TPCOschedulerSM ControlierOsMControtioeSMControfeTPC - texture24K8KETX0processor cluster0(group of SM sharesame texture unit)GiobatBlockSchedulenG80TPCC2 GPU generation G80SMControler7SM ControllerOSMController1and GT200 shownSM
高级计算机体系结构设计及其在数据中心和云计算的应 用 Examples : G80 and GT200 • MT- unit (Global Block) scheduler • TPC – texture processor cluster (group of SM share same texture unit) • 2 GPU generation G80 and GT200 shown

高级计算机体系结构设计及其在数据中心和云计算的应用Examples : GT300GT300 (Fermi)InstructionCachWarp SohedulerWarp ScheculerDispatch UnitDispatch UnitRegisterFile(32768x32-bitRANSHDRAMCUDA CoreDispatch PonSFUL2CacheFPUnitINTUnitDRAMSFUResult CueueDRANSHUFermi's16SMarepositionedaroundacommonL2cache.EachSMisavertical rectangularstripthatcontainanorangeportion64KBSharedMemoryL1Cach(scheduleranddispatch),agreenportion(executionunits),andlightblueportions(registerfileandLicache)FermiStreamingMultiprocessor(SM)http://www.nvidia.com/content/PDF/fermiwhitepapers/NVIDIAFermiCompute_Architecture_Whitepaper.pdf
高级计算机体系结构设计及其在数据中心和云计算的应 用 Examples : GT300 • GT300 (Fermi) http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA _Fermi_Compute_Architecture_Whitepaper.pdf Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache) Fermi Streaming Multiprocessor (SM)

高级计算机体系结构设计及其在数据中心和云计算的应用ComparisonG80 S. GT200 vS. GT300GPUG80GT200Fermi681million1.4billion3.0billionTransistors128240512CUDACoresNone30FMAops/clock256FMAops/clockDouble PrecisionFloatingPoint Capability128MAD240MADOPS/512FMAops/clockSingle PrecisionFloatingops/clockclockPointCapability112Warpschedulers(perSM)224SpecialFunction Units(SFUs)/SM16KB16KBSharedMemory(perSM)Configurable48KBor16KBL1Cache(perSM)NoneNoneConfigurable16KBor48KBNone768KBL2Cache(perSM)NoneNoNoYesEcCMemorySupportNoNoConcurrentKernelsUpto1632-bit32-bit64-bitLoad/StoreAddressWidthhttp:/www.dvhardware.net/article38173.html
高级计算机体系结构设计及其在数据中心和云计算的应 用 Comparison • G80 vs. GT200 vs. GT300 http://www.dvhardware.net/article38173.html

高级计算机体系结构设计及其在数据中心和云计算的应用Example: GK110 (Kepler Architecture)PCiExpress1.0HosIntertacL2CacheKepler:FastfEfficientTexSMSMXMorepower efficient than FermiFermiKeplerNewSMarchitecture(SMX)CNRLLOECONTROLLOGC3xRevampedmemoryarchitectureHardwaresupportfornewPerf/Wattprogramingmodels192coresCapableofDynamicParallelismSource:http://www.nvidia.com/content/PDF/kepler/NVIDIA-KeplerGK110-Architecture-Whitenaner.nd
高级计算机体系结构设计及其在数据中心和云计算的应 用 Example: GK110 (Kepler Architecture) More power efficient than Fermi. New SM architecture (SMX). Revamped memory architecture. Hardware support for new programing models. Capable of Dynamic Parallelism. Source: http://www.nvidia.com/content/PDF/kepler/NVIDIA-KeplerGK110-Architecture-Whitepaper.pdf

高级计算机体系结构设计及其在数据中心和云计算的应用BasicGPGPUProcessor PipelineSimplein-orderexecutioninSIMT-SingleinstructionmultiplethreadsSchedule Warp andFetchInstructionSchedulerchooses one of severalwarps (PC)水I-cacheFetches 1 instruction from the Is per warpDecode+I-BufferandDecodesthe instruction,reads register andScoreboarddispatchesSharedIssueInstructionScoreboard maintains dependenciesMemoryRegister FileMulti-ported registerfileprovidesdataforalllanesSpecialLoadiIntegerFloatStoreUnitALUALUFunctions1NumerousALU,FPU,LD/ST,SFUlanesrunOf-chipDataRegister Write Backsimultaneously (differentspeeds)DRAMicacheWriteback updatestheregisterfile
高级计算机体系结构设计及其在数据中心和云计算的应 用 Basic GPGPU Processor Pipeline • Simple in-order execution in SIMT – Single instruction multiple threads • Scheduler chooses one of several warps (PC) • Fetches 1 instruction from the I$ per warp • Decodes the instruction, reads register and dispatches – Scoreboard maintains dependencies • Multi-ported register file provides data for all lanes • Numerous ALU, FPU, LD/ST, SFU lanes run simultaneously (different speeds) • Writeback updates the register file
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 《现代计算机体系结构》课程教学课件(英文讲稿)Lecture 12 Shared Memory Multiprocessor.pdf
- 《现代计算机体系结构》课程教学课件(留学生版)Lecture 1 Instruction Set Architecture(Introduction).pdf
- 《现代计算机体系结构》课程教学课件(留学生版)Lecture 0 Introduction and Performance Evaluation.pdf
- 《现代计算机体系结构》课程教学课件(留学生版)Lecture 3 Pipelining.pdf
- 《现代计算机体系结构》课程教学课件(留学生版)Lecture 2 Instruction Set Architecture(Microarchitecture Implementation).pdf
- 《现代计算机体系结构》课程教学课件(留学生版)Lecture 7 Multiprocessors.pdf
- 《现代计算机体系结构》课程教学课件(留学生版)Lecture 4 Spectualtive Execution.pdf
- 《现代计算机体系结构》课程教学课件(留学生版)Lecture 6 Memory Hierarchy and Cache.pdf
- 《现代计算机体系结构》课程教学课件(留学生版)Lecture 5 Out of Order Execution.pdf
- 武汉理工大学:《模式识别》课程教学资源(PPT课件)第4章 基于统计决策的概率分类法.ppt
- 武汉理工大学:《模式识别》课程教学资源(PPT课件)第1章 绪论、第2章 聚类分析.ppt
- 武汉理工大学:《模式识别》课程教学资源(PPT课件)第3章 判别函数及几何分类法.ppt
- 武汉理工大学:《模式识别》课程教学资源(PPT课件)第7章 模糊模式识别法.ppt
- 武汉理工大学:《模式识别》课程教学资源(PPT课件)第6章 句法模式识别.ppt
- 武汉理工大学:《模式识别》课程教学资源(PPT课件)第5章 特征选择与特征提取.ppt
- 武汉理工大学:《模式识别》课程教学资源(PPT课件)第8章 神经网络模式识别法.ppt
- 武汉理工大学:《模式识别》课程教学资源(实验指导,共五个实验).pdf
- 武汉理工大学:《模式识别》课程授课教案(讲义)第8章 神经网络在模式识别中的应用.pdf
- 武汉理工大学:《模式识别》课程授课教案(讲义)第7章 模糊模式识别.pdf
- 武汉理工大学:《模式识别》课程授课教案(讲义)第6章 特征提取与选择.pdf
- 《现代计算机体系结构》课程教学课件(英文讲稿)Lecture 14 Towards Renewable Energy Powered Sustainable and Green Cloud Datacenters.pdf
- 《现代计算机体系结构》课程教学课件(英文讲稿)Lecture 11 Multi-core and Multi-threading.pdf
- 《现代计算机体系结构》课程教学课件(英文讲稿)Lecture 10 Out of Order and Speculative Execution.pdf
- 《现代计算机体系结构》课程教学课件(英文讲稿)Lecture 13 An Introduction to Cloud Data Centers.pdf
- 《现代计算机体系结构》课程教学课件(英文讲稿)Lecture 09 Case Study- Jave Branch Prediction Optimization.pdf
- 《现代计算机体系结构》课程教学课件(英文讲稿)Lecture 07 Instruction Decode.pdf
- 《现代计算机体系结构》课程教学课件(英文讲稿)Lecture 08 Instruction Fetch and Branch Predictioin.pdf
- 《现代计算机体系结构》课程教学课件(英文讲稿)Lecture 06 Scoreboarding and Tomasulo.pdf
- 《现代计算机体系结构》课程教学课件(英文讲稿)Lecture 04 Memory Data Prefetching.pdf
- 《现代计算机体系结构》课程教学课件(英文讲稿)Lecture 05 Core Pipelining.pdf
- 《现代计算机体系结构》课程教学课件(英文讲稿)Lecture 02 Memory Hierarchy and Caches.pdf
- 《现代计算机体系结构》课程教学课件(英文讲稿)Lecture 03 Main Memory and DRAM.pdf
- 《现代计算机体系结构》课程教学课件(英文讲稿)Lecture 01 Introduction and Performance Evaluation-new.pdf
- 东北大学:某学院计算机科学与技术专业《智能信息系统开发》课程教学大纲.pdf
- 东北大学:某学院计算机科学与技术专业《软件工程综合实践》课程教学大纲.pdf
- 东北大学:某学院计算机科学与技术专业《创新创业设计基础》课程教学大纲.pdf
- 东北大学:某学院计算机科学与技术专业《工程领导力》课程教学大纲.pdf
- 东北大学:某学院计算机科学与技术专业《高等数学建模》课程教学大纲(二).pdf
- 东北大学:某学院计算机科学与技术专业《数据库原理》课程教学大纲.pdf
- 东北大学:某学院计算机科学与技术专业《物理建模》课程教学大纲 A.pdf