《现代计算机体系结构》课程教学课件（英文讲稿）Lecture 15 GPGPU Architecture and Programming Paradigm

点击下载完整版文档（PDF）

文档信息

资源类别：文库
文档格式：PDF
文档页数：57
文件大小：4.72MB
团购合买：点击进入团购

内容简介

《现代计算机体系结构》课程教学课件（英文讲稿）Lecture 15 GPGPU Architecture and Programming Paradigm

刷新页面文档预览

高级计算机体系结构设计及其在数据中心和云计算的应用Lecture 14GPGPUArchitectureandProgramming Paradigm

高级计算机体系结构设计及其在数据中心和云计算的应用 Lecture 14 GPGPU Architecture and Programming Paradigm

高级计算机体系结构设计及其在数据中心和云计算的应用OutlineGPGPUArchitectureOverviewCore ArchitectureMemory HierarchyInterconnectCPU-GPU InterfacingProgramming Paradigm

高级计算机体系结构设计及其在数据中心和云计算的应用 Outline • GPGPU Architecture Overview • Core Architecture • Memory Hierarchy • Interconnect • CPU-GPU Interfacing • Programming Paradigm

高级计算机体系结构设计及其在数据中心和云计算的应用Basic Blocks: Several shadercores/streamingOn-chipareamultiprocessor (SM)SM,SMSMSMSM.Interconnection networkTPC-1TPC-0HOSTCPUPciFxpross BunInterconnectNetworkOn-chip memoryL2L2L2L2L2controllersDRAMDRAMDRAMDRAMDRAM..ControlleraController,ControllerControllerControllerm On-chip caches (level1/2)DRAMDRAMDRAMDRAMDRAMChipChipChipsChip,Chipe Off-chip DRAMOf-chiparea

高级计算机体系结构设计及其在数据中心和云计算的应用 Basic Blocks • Several shader cores/streaming multiprocessor (SM) • Interconnection network • On-chip memory controllers • On-chip caches (level1/2) • Off-chip DRAM

高级计算机体系结构设计及其在数据中心和云计算的应用Basic BlocksSSSSSSSSATextureThreadSchedulerCacheSSSSSConstantInstructionCacheCacheDecoderThread batch-HW unit of threadexecution (Warp -SharedNvidia)Memory(Wavefront-ATI)HardwarethreadschedulingThreadshavededicatedregistersRegisterShared memoryamongthreadFileblockSamePCforallthreadsinwarpSeparateALUandmemorypipeline

高级计算机体系结构设计及其在数据中心和云计算的应用 . INTERCONNECT . SM SM . SM Texture Processor Cluster0 SM SM . SM Texture Processor Cluster1 SM SM . SM Texture Processor ClusterM Streaming Multiprocessor High BW onchip network SP SP SP . SP Thread Scheduler Instruction Cache Decoder Texture Cache Constant Cache Shared . matrixMul>>(d_C, d_A, d_B, uiWA, uiWB); . . GPU Kernels Compile with Thread batch CUDA compiler - HW unit of thread execution (Warp - Basic Blocks MC0 MC1 MC2 MC3 MCL DRAM DRAM DRAM DRAM DRAM L2 L2 L2 L2 . L2 . . Off-chip memory array Memory Controllers SP SP SP . SP SP SP SP . SP SP SP SP . SP SP SP SP . SP . . . . . Shared Memory Register File . mov.s32 %r14, 15; and.b32 %r15, %r13, %r14; add.s32 %r16, %r15, %r12; shr.s32 %r17, %r16, 4; . Light weight into thread-blocks threads grouped PTX assembly execution (Warp - Nvidia) (Wavefront - ATI) • Hardware thread scheduling • Threads have dedicated registers • Shared memory among thread block • Same PC for all threads in warp • Separate ALU and memory pipeline

高级计算机体系结构设计及其在数据中心和云计算的应用Streaming MultiprocessorMulti thread unitTPCSharedTextureMTUnitCache:InstructionConstantCacheInstructionCachecache/decoderDecoderSeveral singleSPSPSPprocessor (SP)SPSPRegisterSharedLoad-store/SFU unitsFileMemoryLarge register fileSPSPSP.+....Shared memorySPSPSP Shared texture cachesSFUUnitLoad/Store Unit Constant cache

高级计算机体系结构设计及其在数据中心和云计算的应用 Streaming Multiprocessor • Multi thread unit • Instruction cache/decoder • Several single processor (SP) • Load-store/SFU units • Large register file • Shared memory • Shared texture caches • Constant cache

高级计算机体系结构设计及其在数据中心和云计算的应用Examples : G80 and GT200. MT- unit (Global Block)Global BlockSchedderGT200TPCOschedulerSM ControlierOsMControtioeSMControfeTPC - texture24K8KETX0processor cluster0(group of SM sharesame texture unit)GiobatBlockSchedulenG80TPCC2 GPU generation G80SMControler7SM ControllerOSMController1and GT200 shownSM

高级计算机体系结构设计及其在数据中心和云计算的应用 Examples : G80 and GT200 • MT- unit (Global Block) scheduler • TPC – texture processor cluster (group of SM share same texture unit) • 2 GPU generation G80 and GT200 shown

高级计算机体系结构设计及其在数据中心和云计算的应用Examples : GT300GT300 (Fermi)InstructionCachWarp SohedulerWarp ScheculerDispatch UnitDispatch UnitRegisterFile(32768x32-bitRANSHDRAMCUDA CoreDispatch PonSFUL2CacheFPUnitINTUnitDRAMSFUResult CueueDRANSHUFermi's16SMarepositionedaroundacommonL2cache.EachSMisavertical rectangularstripthatcontainanorangeportion64KBSharedMemoryL1Cach(scheduleranddispatch),agreenportion(executionunits),andlightblueportions(registerfileandLicache)FermiStreamingMultiprocessor(SM)http://www.nvidia.com/content/PDF/fermiwhitepapers/NVIDIAFermiCompute_Architecture_Whitepaper.pdf

高级计算机体系结构设计及其在数据中心和云计算的应用 Examples : GT300 • GT300 (Fermi) http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA _Fermi_Compute_Architecture_Whitepaper.pdf Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache) Fermi Streaming Multiprocessor (SM)

高级计算机体系结构设计及其在数据中心和云计算的应用ComparisonG80 S. GT200 vS. GT300GPUG80GT200Fermi681million1.4billion3.0billionTransistors128240512CUDACoresNone30FMAops/clock256FMAops/clockDouble PrecisionFloatingPoint Capability128MAD240MADOPS/512FMAops/clockSingle PrecisionFloatingops/clockclockPointCapability112Warpschedulers(perSM)224SpecialFunction Units(SFUs)/SM16KB16KBSharedMemory(perSM)Configurable48KBor16KBL1Cache(perSM)NoneNoneConfigurable16KBor48KBNone768KBL2Cache(perSM)NoneNoNoYesEcCMemorySupportNoNoConcurrentKernelsUpto1632-bit32-bit64-bitLoad/StoreAddressWidthhttp:/www.dvhardware.net/article38173.html

高级计算机体系结构设计及其在数据中心和云计算的应用 Comparison • G80 vs. GT200 vs. GT300 http://www.dvhardware.net/article38173.html

高级计算机体系结构设计及其在数据中心和云计算的应用Example: GK110 (Kepler Architecture)PCiExpress1.0HosIntertacL2CacheKepler:FastfEfficientTexSMSMXMorepower efficient than FermiFermiKeplerNewSMarchitecture(SMX)CNRLLOECONTROLLOGC3xRevampedmemoryarchitectureHardwaresupportfornewPerf/Wattprogramingmodels192coresCapableofDynamicParallelismSource:http://www.nvidia.com/content/PDF/kepler/NVIDIA-KeplerGK110-Architecture-Whitenaner.nd

高级计算机体系结构设计及其在数据中心和云计算的应用 Example: GK110 (Kepler Architecture) More power efficient than Fermi. New SM architecture (SMX). Revamped memory architecture. Hardware support for new programing models. Capable of Dynamic Parallelism. Source: http://www.nvidia.com/content/PDF/kepler/NVIDIA-KeplerGK110-Architecture-Whitepaper.pdf

高级计算机体系结构设计及其在数据中心和云计算的应用BasicGPGPUProcessor PipelineSimplein-orderexecutioninSIMT-SingleinstructionmultiplethreadsSchedule Warp andFetchInstructionSchedulerchooses one of severalwarps (PC)水I-cacheFetches 1 instruction from the Is per warpDecode+I-BufferandDecodesthe instruction,reads register andScoreboarddispatchesSharedIssueInstructionScoreboard maintains dependenciesMemoryRegister FileMulti-ported registerfileprovidesdataforalllanesSpecialLoadiIntegerFloatStoreUnitALUALUFunctions1NumerousALU,FPU,LD/ST,SFUlanesrunOf-chipDataRegister Write Backsimultaneously (differentspeeds)DRAMicacheWriteback updatestheregisterfile

高级计算机体系结构设计及其在数据中心和云计算的应用 Basic GPGPU Processor Pipeline • Simple in-order execution in SIMT – Single instruction multiple threads • Scheduler chooses one of several warps (PC) • Fetches 1 instruction from the I$ per warp • Decodes the instruction, reads register and dispatches – Scoreboard maintains dependencies • Multi-ported register file provides data for all lanes • Numerous ALU, FPU, LD/ST, SFU lanes run simultaneously (different speeds) • Writeback updates the register file

共57页，可试读19页，点击继续阅读 ↓

刷新页面下载完整文档

VIP每日下载上限内不扣除下载券和下载次数；
按次数下载不扣除下载券；
注册用户24小时内重复下载只扣除一次；
顺序：VIP每日次数-->可用次数-->下载券；

点击下载完整版文档（PDF）