中国高校课件下载中心 》 教学资源 》 大学文库

《现代计算机体系结构》课程教学课件(英文讲稿)Lecture 15 GPGPU Architecture and Programming Paradigm

文档信息
资源类别:文库
文档格式:PDF
文档页数:57
文件大小:4.72MB
团购合买:点击进入团购
内容简介
《现代计算机体系结构》课程教学课件(英文讲稿)Lecture 15 GPGPU Architecture and Programming Paradigm
刷新页面文档预览

高级计算机体系结构设计及其在数据中心和云计算的应用Lecture 14GPGPUArchitectureandProgramming Paradigm

高级计算机体系结构设计及其在数据中心和云计算的应 用 Lecture 14 GPGPU Architecture and Programming Paradigm

高级计算机体系结构设计及其在数据中心和云计算的应用OutlineGPGPUArchitectureOverviewCore ArchitectureMemory HierarchyInterconnectCPU-GPU InterfacingProgramming Paradigm

高级计算机体系结构设计及其在数据中心和云计算的应 用 Outline • GPGPU Architecture Overview • Core Architecture • Memory Hierarchy • Interconnect • CPU-GPU Interfacing • Programming Paradigm

高级计算机体系结构设计及其在数据中心和云计算的应用Basic Blocks: Several shadercores/streamingOn-chipareamultiprocessor (SM)SM,SMSMSMSM.Interconnection networkTPC-1TPC-0HOSTCPUPciFxpross BunInterconnectNetworkOn-chip memoryL2L2L2L2L2controllersDRAMDRAMDRAMDRAMDRAM..ControlleraController,ControllerControllerControllerm On-chip caches (level1/2)DRAMDRAMDRAMDRAMDRAMChipChipChipsChip,Chipe Off-chip DRAMOf-chiparea

高级计算机体系结构设计及其在数据中心和云计算的应 用 Basic Blocks • Several shader cores/streaming multiprocessor (SM) • Interconnection network • On-chip memory controllers • On-chip caches (level1/2) • Off-chip DRAM

高级计算机体系结构设计及其在数据中心和云计算的应用Basic BlocksSSSSSSSSATextureThreadSchedulerCacheSSSSSConstantInstructionCacheCacheDecoderThread batch-HW unit of threadexecution (Warp -SharedNvidia)Memory(Wavefront-ATI)HardwarethreadschedulingThreadshavededicatedregistersRegisterShared memoryamongthreadFileblockSamePCforallthreadsinwarpSeparateALUandmemorypipeline

高级计算机体系结构设计及其在数据中心和云计算的应 用 . INTERCONNECT . SM SM . SM Texture Processor Cluster0 SM SM . SM Texture Processor Cluster1 SM SM . SM Texture Processor ClusterM Streaming Multiprocessor High BW on￾chip network SP SP SP . SP Thread Scheduler Instruction Cache Decoder Texture Cache Constant Cache Shared . matrixMul>>(d_C, d_A, d_B, uiWA, uiWB); . . GPU Kernels Compile with Thread batch CUDA compiler - HW unit of thread execution (Warp - Basic Blocks MC0 MC1 MC2 MC3 MCL DRAM DRAM DRAM DRAM DRAM L2 L2 L2 L2 . L2 . . Off-chip memory array Memory Controllers SP SP SP . SP SP SP SP . SP SP SP SP . SP SP SP SP . SP . . . . . Shared Memory Register File . mov.s32 %r14, 15; and.b32 %r15, %r13, %r14; add.s32 %r16, %r15, %r12; shr.s32 %r17, %r16, 4; . Light weight into thread-blocks threads grouped PTX assembly execution (Warp - Nvidia) (Wavefront - ATI) • Hardware thread scheduling • Threads have dedicated registers • Shared memory among thread block • Same PC for all threads in warp • Separate ALU and memory pipeline

高级计算机体系结构设计及其在数据中心和云计算的应用Streaming MultiprocessorMulti thread unitTPCSharedTextureMTUnitCache:InstructionConstantCacheInstructionCachecache/decoderDecoderSeveral singleSPSPSPprocessor (SP)SPSPRegisterSharedLoad-store/SFU unitsFileMemoryLarge register fileSPSPSP.+....Shared memorySPSPSP Shared texture cachesSFUUnitLoad/Store Unit Constant cache

高级计算机体系结构设计及其在数据中心和云计算的应 用 Streaming Multiprocessor • Multi thread unit • Instruction cache/decoder • Several single processor (SP) • Load-store/SFU units • Large register file • Shared memory • Shared texture caches • Constant cache

高级计算机体系结构设计及其在数据中心和云计算的应用Examples : G80 and GT200. MT- unit (Global Block)Global BlockSchedderGT200TPCOschedulerSM ControlierOsMControtioeSMControfeTPC - texture24K8KETX0processor cluster0(group of SM sharesame texture unit)GiobatBlockSchedulenG80TPCC2 GPU generation G80SMControler7SM ControllerOSMController1and GT200 shownSM

高级计算机体系结构设计及其在数据中心和云计算的应 用 Examples : G80 and GT200 • MT- unit (Global Block) scheduler • TPC – texture processor cluster (group of SM share same texture unit) • 2 GPU generation G80 and GT200 shown

高级计算机体系结构设计及其在数据中心和云计算的应用Examples : GT300GT300 (Fermi)InstructionCachWarp SohedulerWarp ScheculerDispatch UnitDispatch UnitRegisterFile(32768x32-bitRANSHDRAMCUDA CoreDispatch PonSFUL2CacheFPUnitINTUnitDRAMSFUResult CueueDRANSHUFermi's16SMarepositionedaroundacommonL2cache.EachSMisavertical rectangularstripthatcontainanorangeportion64KBSharedMemoryL1Cach(scheduleranddispatch),agreenportion(executionunits),andlightblueportions(registerfileandLicache)FermiStreamingMultiprocessor(SM)http://www.nvidia.com/content/PDF/fermiwhitepapers/NVIDIAFermiCompute_Architecture_Whitepaper.pdf

高级计算机体系结构设计及其在数据中心和云计算的应 用 Examples : GT300 • GT300 (Fermi) http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA _Fermi_Compute_Architecture_Whitepaper.pdf Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache) Fermi Streaming Multiprocessor (SM)

高级计算机体系结构设计及其在数据中心和云计算的应用ComparisonG80 S. GT200 vS. GT300GPUG80GT200Fermi681million1.4billion3.0billionTransistors128240512CUDACoresNone30FMAops/clock256FMAops/clockDouble PrecisionFloatingPoint Capability128MAD240MADOPS/512FMAops/clockSingle PrecisionFloatingops/clockclockPointCapability112Warpschedulers(perSM)224SpecialFunction Units(SFUs)/SM16KB16KBSharedMemory(perSM)Configurable48KBor16KBL1Cache(perSM)NoneNoneConfigurable16KBor48KBNone768KBL2Cache(perSM)NoneNoNoYesEcCMemorySupportNoNoConcurrentKernelsUpto1632-bit32-bit64-bitLoad/StoreAddressWidthhttp:/www.dvhardware.net/article38173.html

高级计算机体系结构设计及其在数据中心和云计算的应 用 Comparison • G80 vs. GT200 vs. GT300 http://www.dvhardware.net/article38173.html

高级计算机体系结构设计及其在数据中心和云计算的应用Example: GK110 (Kepler Architecture)PCiExpress1.0HosIntertacL2CacheKepler:FastfEfficientTexSMSMXMorepower efficient than FermiFermiKeplerNewSMarchitecture(SMX)CNRLLOECONTROLLOGC3xRevampedmemoryarchitectureHardwaresupportfornewPerf/Wattprogramingmodels192coresCapableofDynamicParallelismSource:http://www.nvidia.com/content/PDF/kepler/NVIDIA-KeplerGK110-Architecture-Whitenaner.nd

高级计算机体系结构设计及其在数据中心和云计算的应 用 Example: GK110 (Kepler Architecture) More power efficient than Fermi. New SM architecture (SMX). Revamped memory architecture. Hardware support for new programing models. Capable of Dynamic Parallelism. Source: http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler￾GK110-Architecture-Whitepaper.pdf

高级计算机体系结构设计及其在数据中心和云计算的应用BasicGPGPUProcessor PipelineSimplein-orderexecutioninSIMT-SingleinstructionmultiplethreadsSchedule Warp andFetchInstructionSchedulerchooses one of severalwarps (PC)水I-cacheFetches 1 instruction from the Is per warpDecode+I-BufferandDecodesthe instruction,reads register andScoreboarddispatchesSharedIssueInstructionScoreboard maintains dependenciesMemoryRegister FileMulti-ported registerfileprovidesdataforalllanesSpecialLoadiIntegerFloatStoreUnitALUALUFunctions1NumerousALU,FPU,LD/ST,SFUlanesrunOf-chipDataRegister Write Backsimultaneously (differentspeeds)DRAMicacheWriteback updatestheregisterfile

高级计算机体系结构设计及其在数据中心和云计算的应 用 Basic GPGPU Processor Pipeline • Simple in-order execution in SIMT – Single instruction multiple threads • Scheduler chooses one of several warps (PC) • Fetches 1 instruction from the I$ per warp • Decodes the instruction, reads register and dispatches – Scoreboard maintains dependencies • Multi-ported register file provides data for all lanes • Numerous ALU, FPU, LD/ST, SFU lanes run simultaneously (different speeds) • Writeback updates the register file

刷新页面下载完整文档
VIP每日下载上限内不扣除下载券和下载次数;
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
相关文档