《并行与分布式程序设计》课程教学参考书:NVIDIA《CUDA C PROGRAMMING GUIDE》(Design Guide,CHANGES FROM VERSION 9.0)

nVIDIA. CUDA C PROGRAMMING GUIDE PG-02829-001v9.2|May2018 Design Guide
CUDA C PROGRAMMING GUIDE PG-02829-001_v9.2 | May 2018 Design Guide

CHANGES FROM VERSION 9.0 Documented restriction that operator-overloads cannot be global functions in Operator Function. Removed guidance to break 8-byte shuffles into two 4-byte instructions.8-byte shuffle variants are provided since CUDA 9.0.See Warp Shuffle Functions. Passing restrict references to global functions is now supported. Updated comment in globalfunctions and function templates. Documented CUDA ENABLE CRC CHECK in CUDA Environment Variables. Warp matrix functions [PREVIEW FEATURE]now support matrix products with m=32,n=8,k=16 and m=8,n=32,k=16 in addition to m=n=k=16. Added new Unified Memory sections:System Allocator,Hardware Coherency, Access Counters www.nvidia.com CUDA C Programming Guide PG-02829-001v9.21ii
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | ii CHANGES FROM VERSION 9.0 ‣ Documented restriction that operator-overloads cannot be __global__ functions in Operator Function. ‣ Removed guidance to break 8-byte shuffles into two 4-byte instructions. 8-byte shuffle variants are provided since CUDA 9.0. See Warp Shuffle Functions. ‣ Passing __restrict__ references to __global__ functions is now supported. Updated comment in __global__ functions and function templates. ‣ Documented CUDA_ENABLE_CRC_CHECK in CUDA Environment Variables. ‣ Warp matrix functions [PREVIEW FEATURE] now support matrix products with m=32, n=8, k=16 and m=8, n=32, k=16 in addition to m=n=k=16. ‣ Added new Unified Memory sections: System Allocator, Hardware Coherency, Access Counters

TABLE OF CONTENTS Chapter 1.Introduction..1 1.1.From Graphics Processing to General Purpose Parallel Computing...............................1 1.2.CUDA:A General-Purpose Parallel Computing Platform and Programming Model.............3 1.3.A Scalable Programming Model.....................................4 1.4.Document Structure....................... 6 8 Chapter 2.Programming Model................ 2.1.Kernets............. .8 2.2.Thread Hierarchy........................... 9 2.3.Memory Hierarchy................... .11 2.4.Heterogeneous Programming.................. 3 2.5.Compute Capability.............. 15 Chapter 3.Programming Interface........ …16 3.1.Compilation with NVCC............ 16 3.1.1.Compilation Workflow.... .17 3.1.1.1.Offline Compilation....... 17 3.1.1.2.Just-in-Time Compilation.. .17 3.1.2.Binary Compatibility............. .17 3.1.3.PTX Compatibility............... 18 3.1.4.Application Compatibility........... 18 3.1.5.C/C++Compatibility...... .19 3.1.6.64-Bit Compatibility............... 19 3.2.CUDA C Runtime.… .19 3.2.1.Initiatization.................... …20 3.2.2.Device Memory.… 20 3.2.3.Shared Memory.................. .24 3.2.4.Page-Locked Host Memory..................... 29 3.2.4.1.Portable Memory.................... .30 3.2.4.2.Write-combining Memory.......30 3.2.4.3.Mapped Memory.................. 30 3.2.5.Asynchronous Concurrent Execution............... .31 3.2.5.1.Concurrent Execution between Host and Device..... .32 3.2.5.2.Concurrent Kernel Execution......................... .32 3.2.5.3.Overlap of Data Transfer and Kernel Execution.... 32 3.2.5.4.Concurrent Data Transfers...................... 33 3.2.5.5.Streams....... .33 3.2.5.6.Events.… 37 3.2.5.7.Synchronous Calls.... .38 3.2.6.Multi-Device System...................... .38 3.2.6.1.Device Enumeration...... .38 3.2.6.2.Device Selection..................38 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2|ii
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | iii TABLE OF CONTENTS Chapter 1. Introduction.........................................................................................1 1.1. From Graphics Processing to General Purpose Parallel Computing............................... 1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model.............3 1.3. A Scalable Programming Model.........................................................................4 1.4. Document Structure...................................................................................... 6 Chapter 2. Programming Model............................................................................... 8 2.1. Kernels......................................................................................................8 2.2. Thread Hierarchy......................................................................................... 9 2.3. Memory Hierarchy....................................................................................... 11 2.4. Heterogeneous Programming.......................................................................... 13 2.5. Compute Capability..................................................................................... 15 Chapter 3. Programming Interface..........................................................................16 3.1. Compilation with NVCC................................................................................ 16 3.1.1. Compilation Workflow.............................................................................17 3.1.1.1. Offline Compilation.......................................................................... 17 3.1.1.2. Just-in-Time Compilation....................................................................17 3.1.2. Binary Compatibility...............................................................................17 3.1.3. PTX Compatibility..................................................................................18 3.1.4. Application Compatibility.........................................................................18 3.1.5. C/C++ Compatibility............................................................................... 19 3.1.6. 64-Bit Compatibility............................................................................... 19 3.2. CUDA C Runtime.........................................................................................19 3.2.1. Initialization.........................................................................................20 3.2.2. Device Memory..................................................................................... 20 3.2.3. Shared Memory..................................................................................... 24 3.2.4. Page-Locked Host Memory........................................................................29 3.2.4.1. Portable Memory..............................................................................30 3.2.4.2. Write-Combining Memory....................................................................30 3.2.4.3. Mapped Memory...............................................................................30 3.2.5. Asynchronous Concurrent Execution............................................................ 31 3.2.5.1. Concurrent Execution between Host and Device........................................32 3.2.5.2. Concurrent Kernel Execution............................................................... 32 3.2.5.3. Overlap of Data Transfer and Kernel Execution......................................... 32 3.2.5.4. Concurrent Data Transfers.................................................................. 33 3.2.5.5. Streams.........................................................................................33 3.2.5.6. Events...........................................................................................37 3.2.5.7. Synchronous Calls.............................................................................38 3.2.6. Multi-Device System............................................................................... 38 3.2.6.1. Device Enumeration.......................................................................... 38 3.2.6.2. Device Selection.............................................................................. 38

3.2.6.3.Stream and Event Behavior..... .39 3.2.6.4.Peer-to-Peer Memory Access.....39 3.2.6.5.Peer-to-Peer Memory Copy............ .40 3.2.7.Unified Virtual Address Space............................. 41 3.2.8.Interprocess Communication...... 41 3.2.9.Error checking..................... …42 3.2.10.Call Stack.… …42 3.2.11.Texture and Surface Memory......... 42 3.2.11.1.Texture Memory................ 43 3.2.11.2.Surface Memory.............. 52 3.2.11.3.CUDA Arrays................ .56 3.2.11.4.Read/Write Coherency..... 56 3.2.12.Graphics Interoperability.......... .56 3.2.12.1.OpenGL Interoperability.... 57 3.2.12.2.Direct3D Interoperability...... .59 3.2.12.3.SLI Interoperability....... 65 3.3.Versioning and Compatibility............... .66 3.4.Compute Modes.................... 67 3.5.Mode Switches...68 3.6.Tesla Compute Cluster Mode for Windows... 68 Chapter 4.Hardware Implementation.................. 0…70 4.1.SIMT Architecture................. 70 4.2.Hardware Multithreading............... .72 Chapter 5.Performance Guidelines................. .74 5.1.Overall Performance Optimization Strategies.. 74 5.2.Maximize Utilization.......................... .74 5.2.1.Application Level.................... .74 5.2.2.Device Level.… .75 5.2.3.Multiprocessor Level...... .75 5.2.3.1.Occupancy Calculator............ .77 5.3.Maximize Memory Throughput...... 79 5.3.1.Data Transfer between Host and Device. 80 5.3.2.Device Memory Accesses............ .81 5.4.Maximize Instruction Throughput............ .85 5.4.1.Arithmetic Instructions......... .85 5.4.2.Control Flow Instructions.............. .89 5.4.3.Synchronization Instruction... .90 Appendix A.CUDA-Enabled GPUs............ …91 Appendix B.C Language Extensions..... .92 B.1.Function Execution Space Specifiers .92 B.1.1.device_ 92 B.1.2.global 92 B.1.3.ho5t. 93 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.21iV
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | iv 3.2.6.3. Stream and Event Behavior................................................................. 39 3.2.6.4. Peer-to-Peer Memory Access................................................................39 3.2.6.5. Peer-to-Peer Memory Copy..................................................................40 3.2.7. Unified Virtual Address Space................................................................... 41 3.2.8. Interprocess Communication..................................................................... 41 3.2.9. Error Checking......................................................................................42 3.2.10. Call Stack.......................................................................................... 42 3.2.11. Texture and Surface Memory................................................................... 42 3.2.11.1. Texture Memory............................................................................. 43 3.2.11.2. Surface Memory............................................................................. 52 3.2.11.3. CUDA Arrays..................................................................................56 3.2.11.4. Read/Write Coherency..................................................................... 56 3.2.12. Graphics Interoperability........................................................................56 3.2.12.1. OpenGL Interoperability................................................................... 57 3.2.12.2. Direct3D Interoperability...................................................................59 3.2.12.3. SLI Interoperability..........................................................................65 3.3. Versioning and Compatibility.......................................................................... 66 3.4. Compute Modes..........................................................................................67 3.5. Mode Switches........................................................................................... 68 3.6. Tesla Compute Cluster Mode for Windows.......................................................... 68 Chapter 4. Hardware Implementation......................................................................70 4.1. SIMT Architecture....................................................................................... 70 4.2. Hardware Multithreading...............................................................................72 Chapter 5. Performance Guidelines........................................................................ 74 5.1. Overall Performance Optimization Strategies...................................................... 74 5.2. Maximize Utilization.................................................................................... 74 5.2.1. Application Level...................................................................................74 5.2.2. Device Level........................................................................................ 75 5.2.3. Multiprocessor Level...............................................................................75 5.2.3.1. Occupancy Calculator........................................................................ 77 5.3. Maximize Memory Throughput........................................................................ 79 5.3.1. Data Transfer between Host and Device....................................................... 80 5.3.2. Device Memory Accesses..........................................................................81 5.4. Maximize Instruction Throughput.....................................................................85 5.4.1. Arithmetic Instructions............................................................................85 5.4.2. Control Flow Instructions......................................................................... 89 5.4.3. Synchronization Instruction.......................................................................90 Appendix A. CUDA-Enabled GPUs........................................................................... 91 Appendix B. C Language Extensions........................................................................92 B.1. Function Execution Space Specifiers.................................................................92 B.1.1. __device__.......................................................................................... 92 B.1.2. __global__...........................................................................................92 B.1.3. __host__............................................................................................. 93

B.1.4.noinline and forceinline. 93 B.2.Variable Memory Space Specifiers............... 93 B.2.1._device_… .94 B.2.2.constant_ 94 B.2.3.shared_… .94 B.2.4.managed.… ..95 B.2.5.restrict_… .95 B.3.Built-in Vector Types.............. 97 B.3.1.char,short,int,long,longlong,float,double .97 B.3.2.dim3.… .98 B.4.Built-in Variables....................... 98 B.4.1.gridDim..… 98 B.4.2.blockldx.… .98 B.4.3.blockDim.............. .98 B.4.4.threadldx................... 99 B.4.5.warpSize.......... 99 B.5.Memory Fence Functions................. 99 B.6.Synchronization Functions.... ,102 B.7.Mathematical Functions................ .103 B.8.Texture Functions.......... .103 B.8.1.Texture object APl.......... …103 B.8.1.1.tex1Dfetch()...... ..103 B.8.1.2.tex1D0.… 103 B.8.1.3.tex1DLod0.… .103 B.8.1.4.tex1DGrad()........... 104 B.8.1.5.tex2D0.. 104 B.8.1.6.tex2DLod0.… ….104 B.8.1.7.tex2DGrad()........ 104 B.8.1.8.tex3D0.… ..104 B.8.1.9.tex3DLod0… .104 B.8.1.10.tex3DGrad().... ..105 B.8.1.11.tex1DLayered()........ .105 B.8.1.12.tex1DLayeredLod().. .105 B.8.1.13.tex1DLayeredGrad(). 105 B.8.1.14.tex2DLayered()...... 105 B.8.1.15.tex2DLayeredLod()......... .105 B.8.1.16.tex2DLayeredGrad()... .106 B.8.1.17.texCubemap()............ …106 B.8.1.18.texCubemapLod()...... .106 B.8.1.19.texCubemapLayered()...... .106 B.8.1.20.texCubemapLayeredLod(). .106 B.8.1.21.tex2Dgather()............. .106 B.8.2.Texture Reference API......... .107 www.nvidia.com CUDA C Programming Guide PG-02829-001_9.21V
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | v B.1.4. __noinline__ and __forceinline__............................................................... 93 B.2. Variable Memory Space Specifiers....................................................................93 B.2.1. __device__.......................................................................................... 94 B.2.2. __constant__........................................................................................94 B.2.3. __shared__.......................................................................................... 94 B.2.4. __managed__....................................................................................... 95 B.2.5. __restrict__......................................................................................... 95 B.3. Built-in Vector Types................................................................................... 97 B.3.1. char, short, int, long, longlong, float, double................................................ 97 B.3.2. dim3..................................................................................................98 B.4. Built-in Variables........................................................................................ 98 B.4.1. gridDim.............................................................................................. 98 B.4.2. blockIdx..............................................................................................98 B.4.3. blockDim.............................................................................................98 B.4.4. threadIdx............................................................................................ 99 B.4.5. warpSize............................................................................................. 99 B.5. Memory Fence Functions...............................................................................99 B.6. Synchronization Functions............................................................................102 B.7. Mathematical Functions...............................................................................103 B.8. Texture Functions......................................................................................103 B.8.1. Texture Object API...............................................................................103 B.8.1.1. tex1Dfetch()..................................................................................103 B.8.1.2. tex1D()........................................................................................ 103 B.8.1.3. tex1DLod()....................................................................................103 B.8.1.4. tex1DGrad().................................................................................. 104 B.8.1.5. tex2D()........................................................................................ 104 B.8.1.6. tex2DLod()....................................................................................104 B.8.1.7. tex2DGrad().................................................................................. 104 B.8.1.8. tex3D()........................................................................................ 104 B.8.1.9. tex3DLod()....................................................................................104 B.8.1.10. tex3DGrad().................................................................................105 B.8.1.11. tex1DLayered()............................................................................. 105 B.8.1.12. tex1DLayeredLod().........................................................................105 B.8.1.13. tex1DLayeredGrad()....................................................................... 105 B.8.1.14. tex2DLayered()............................................................................. 105 B.8.1.15. tex2DLayeredLod().........................................................................105 B.8.1.16. tex2DLayeredGrad()....................................................................... 106 B.8.1.17. texCubemap().............................................................................. 106 B.8.1.18. texCubemapLod().......................................................................... 106 B.8.1.19. texCubemapLayered().....................................................................106 B.8.1.20. texCubemapLayeredLod()................................................................ 106 B.8.1.21. tex2Dgather()...............................................................................106 B.8.2. Texture Reference API...........................................................................107

B.8.2.1.tex1Dfetch(). .107 B.8.2.2.tex1D0.… 107 B.8.2.3.tex1DLod()....... .108 B.8.2.4.tex1DGrad()............... .108 B.8.2.5.tex2D0.. 108 B.8.2.6.tex2DLod().............. .108 B.8.2.7.tex2DGrad().......... ...108 B.8.2.8.tex3D0. …109 B.8.2.9.tex3DLod()............ ..109 B.8.2.10.tex3DGrad()............ .109 B.8.2.11.tex1DLayered()......... .109 B.8.2.12.tex1DLayeredLod().. .110 B.8.2.13.tex1DLayeredGrad()..... .110 B.8.2.14.tex2DLayered()...... ,110 B.8.2.15.tex2DLayeredLod()...... .110 B.8.2.16.tex2DLayeredGrad()... 111 B.8.2.17.texCubemap().............. .111 B.8.2.18.texCubemapLod()..... 111 B.8.2.19.texCubemapLayered()............ ….111 B.8.2.20.texCubemapLayeredLod(). …111 B.8.2.21.tex2Dgather()............... .112 B.9.Surface Functions............ .112 B.9.1.Surface object APl............... .112 B.9.1.1.surf1Dread()................ .112 B.9.1.2.surf1Dwrite............. .112 B.9.1.3.surf2Dread()............. .113 B.9.1.4.surf2Dwrite()............ ,113 B.9.1.5.surf3Dread()............ .113 B.9.1.6.surf3Dwrite().... .113 B.9.1.7.surf1DLayeredread().... 114 B.9.1.8.surf1DLayeredwrite()... 114 B.9.1.9.surf2DLayeredread()..... .114 B.9.1.10.surf2DLayeredwrite() .114 B.9.1.11.surfCubemapread()......... .115 B.9.1.12.surfCubemapwrite().... 115 B.9.1.13.surfCubemapLayeredread() .115 B.9.1.14.surfCubemapLayeredwrite() .115 B.9.2.Surface Reference APl.... .116 B.9.2.1.surf1Dread().......... .116 B.9.2.2.surf1Dwrite............. …116 B.9.2.3.surf2Dread().......... 116 B.9.2.4.surf2Dwrite()........ …116 B.9.2.5.surf3Dread()....... .117 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.21
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | vi B.8.2.1. tex1Dfetch()..................................................................................107 B.8.2.2. tex1D()........................................................................................ 107 B.8.2.3. tex1DLod()....................................................................................108 B.8.2.4. tex1DGrad().................................................................................. 108 B.8.2.5. tex2D()........................................................................................ 108 B.8.2.6. tex2DLod()....................................................................................108 B.8.2.7. tex2DGrad().................................................................................. 108 B.8.2.8. tex3D()........................................................................................ 109 B.8.2.9. tex3DLod()....................................................................................109 B.8.2.10. tex3DGrad().................................................................................109 B.8.2.11. tex1DLayered()............................................................................. 109 B.8.2.12. tex1DLayeredLod().........................................................................110 B.8.2.13. tex1DLayeredGrad()....................................................................... 110 B.8.2.14. tex2DLayered()............................................................................. 110 B.8.2.15. tex2DLayeredLod().........................................................................110 B.8.2.16. tex2DLayeredGrad()....................................................................... 111 B.8.2.17. texCubemap().............................................................................. 111 B.8.2.18. texCubemapLod().......................................................................... 111 B.8.2.19. texCubemapLayered().....................................................................111 B.8.2.20. texCubemapLayeredLod()................................................................ 111 B.8.2.21. tex2Dgather()...............................................................................112 B.9. Surface Functions...................................................................................... 112 B.9.1. Surface Object API............................................................................... 112 B.9.1.1. surf1Dread()..................................................................................112 B.9.1.2. surf1Dwrite................................................................................... 112 B.9.1.3. surf2Dread()..................................................................................113 B.9.1.4. surf2Dwrite()................................................................................. 113 B.9.1.5. surf3Dread()..................................................................................113 B.9.1.6. surf3Dwrite()................................................................................. 113 B.9.1.7. surf1DLayeredread()........................................................................ 114 B.9.1.8. surf1DLayeredwrite()....................................................................... 114 B.9.1.9. surf2DLayeredread()........................................................................ 114 B.9.1.10. surf2DLayeredwrite()......................................................................114 B.9.1.11. surfCubemapread()........................................................................ 115 B.9.1.12. surfCubemapwrite()....................................................................... 115 B.9.1.13. surfCubemapLayeredread()...............................................................115 B.9.1.14. surfCubemapLayeredwrite()..............................................................115 B.9.2. Surface Reference API........................................................................... 116 B.9.2.1. surf1Dread()..................................................................................116 B.9.2.2. surf1Dwrite................................................................................... 116 B.9.2.3. surf2Dread()..................................................................................116 B.9.2.4. surf2Dwrite()................................................................................. 116 B.9.2.5. surf3Dread()..................................................................................117

B.9.2.6.surf3Dwrite()..... ,117 B.9.2.7.surf1DLayeredread()............. 117 B.9.2.8.surf1DLayeredwrite().... ..117 B.9.2.9.surf2DLayeredread()............ 118 B.9.2.10.surf2DLayeredwrite(). .118 B.9.2.11.surfCubemapread()............ 118 B.9.2.12.surfCubemapwrite().......... .118 B.9.2.13.surfCubemapLayeredread(). .119 B.9.2.14.surfCubemapLayeredwrite(). .119 B.10.Read-Only Data Cache Load Function. .119 B.11.Time Function.......................... ...119 B.12.Atomic Functions........... 120 B.12.1.Arithmetic Functions......... 121 B.12.1.1.atomicAdd().... .121 B.12.1.2.atomicSub0..... .121 B.12.1.3.atomicExch()... .122 B.12.1.4.atomicMin()............. .122 B.12.1.5.atomicMax()... ..122 B.12.1.6.atomiclnc().............. .122 B.12.1.7.atomicDec()... .123 B.12.1.8.atomiccAS()........ .123 B.12.2.Bitwise Functions... 123 B.12.2.1.atomicAnd().......... .123 B.12.2.2.atomic0r0.… 123 B.12.2.3.atomicXor().......... 124 B.13.Warp Vote Functions........... 124 B.14.Warp Match Functions...... 125 B.14.1.Synop5ys.… 125 B.14.2.Description...... 125 B.15.Warp Shuffle Functions...... .126 B.15.1.Synopsis.… 126 B.15.2.Description................. 126 B.15.3.Return Value............... 127 B.15.4.Notes..… 128 B.15.5.xamples..… 128 B.15.5.1.Broadcast of a single value across a warp................................... 128 B.15.5.2.Inclusive plus-scan across sub-partitions of 8 threads.... .129 B.15.5.3.Reduction across a warp...................... 129 B.16.Warp matrix functions [PREVIEW FEATURE].......... .129 B.16.1.Description.......... 130 B.16.2.xample.… 132 B.17.Profiler Counter Function............. 132 B.18.Assertion............... 133 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2|vii
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | vii B.9.2.6. surf3Dwrite()................................................................................. 117 B.9.2.7. surf1DLayeredread()........................................................................ 117 B.9.2.8. surf1DLayeredwrite()....................................................................... 117 B.9.2.9. surf2DLayeredread()........................................................................ 118 B.9.2.10. surf2DLayeredwrite()......................................................................118 B.9.2.11. surfCubemapread()........................................................................ 118 B.9.2.12. surfCubemapwrite()....................................................................... 118 B.9.2.13. surfCubemapLayeredread()...............................................................119 B.9.2.14. surfCubemapLayeredwrite()..............................................................119 B.10. Read-Only Data Cache Load Function.............................................................119 B.11. Time Function.........................................................................................119 B.12. Atomic Functions..................................................................................... 120 B.12.1. Arithmetic Functions........................................................................... 121 B.12.1.1. atomicAdd().................................................................................121 B.12.1.2. atomicSub()................................................................................. 121 B.12.1.3. atomicExch()................................................................................122 B.12.1.4. atomicMin()................................................................................. 122 B.12.1.5. atomicMax().................................................................................122 B.12.1.6. atomicInc()..................................................................................122 B.12.1.7. atomicDec().................................................................................123 B.12.1.8. atomicCAS().................................................................................123 B.12.2. Bitwise Functions............................................................................... 123 B.12.2.1. atomicAnd().................................................................................123 B.12.2.2. atomicOr().................................................................................. 123 B.12.2.3. atomicXor()................................................................................. 124 B.13. Warp Vote Functions................................................................................. 124 B.14. Warp Match Functions............................................................................... 125 B.14.1. Synopsys.......................................................................................... 125 B.14.2. Description....................................................................................... 125 B.15. Warp Shuffle Functions..............................................................................126 B.15.1. Synopsis........................................................................................... 126 B.15.2. Description....................................................................................... 126 B.15.3. Return Value..................................................................................... 127 B.15.4. Notes.............................................................................................. 128 B.15.5. Examples..........................................................................................128 B.15.5.1. Broadcast of a single value across a warp............................................ 128 B.15.5.2. Inclusive plus-scan across sub-partitions of 8 threads............................... 129 B.15.5.3. Reduction across a warp................................................................. 129 B.16. Warp matrix functions [PREVIEW FEATURE]......................................................129 B.16.1. Description....................................................................................... 130 B.16.2. Example...........................................................................................132 B.17. Profiler Counter Function........................................................................... 132 B.18. Assertion............................................................................................... 133

B.19.Formatted Output.......... 134 B.19.1.Format Specifiers............... 134 B.19.2.Limitations......... 135 B.19.3.Associated Host-Side APl.........136 B.19.4.Examples...... .136 B.20.Dynamic Global Memory Allocation and Operations.137 B.20.1.Heap Memory Allocation....................138 B.20.2.Interoperability with Host Memory APl..... .138 B.20.3.Examples.… 138 B.20.3.1.Per Thread Allocation.................... .139 B.20.3.2.Per Thread Block Allocation............... …140 B.20.3.3.Allocation Persisting Between Kernel Launches. 141 B.21.Execution Configuration........................ .142 B.22.Launch Bounds..… .142 B.23.#pragma unrou........ .145 B.24.SIMD Video Instructions................. .145 Appendix C.Cooperative Groups......147 C.1.Introduction.................. ..147 C.2.Intra-block Groups...........148 C.2.1.Thread Groups and Thread Blocks... .148 C.2.2.Tiled Partitions......... ….149 C.2.3.Thread Block Tiles................ 149 C.2.4.Coalesced Groups......... 150 C.2.5.Uses of Intra-block Cooperative Groups....... 150 C.2.5.1.Discovery Pattern................. 150 C.2.5.2.Warp-Synchronous Code Pattern....... .151 C.2.5.3.Composition................... 152 C.3.Grid Synchronization.................... 152 C.4.Multi-Device Synchronization.... 154 Appendix D.CUDA Dynamic Parallelism...... …156 D.1.Introduction..… .156 D.1.1.Overview................. ...156 D.1.2.Glossary..… 156 D.2.Execution Environment and Memory Model. 157 D.2.1.Execution Environment............. 157 D.2.1.1.Parent and Child Grids.............. .157 D.2.1.2.Scope of CUDA Primitives.... 158 D.2.1.3.Synchronization............... …158 D.2.1.4.Streams and Events......... .158 D.2.1.5.Ordering and Concurrency........ .159 D.2.1.6.Device Management............ 159 D.2.2.Memory Model..… 159 D.2.2.1.Coherence and Consistency....... .160 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.21iii
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | viii B.19. Formatted Output.................................................................................... 134 B.19.1. Format Specifiers............................................................................... 134 B.19.2. Limitations....................................................................................... 135 B.19.3. Associated Host-Side API.......................................................................136 B.19.4. Examples..........................................................................................136 B.20. Dynamic Global Memory Allocation and Operations............................................ 137 B.20.1. Heap Memory Allocation....................................................................... 138 B.20.2. Interoperability with Host Memory API......................................................138 B.20.3. Examples..........................................................................................138 B.20.3.1. Per Thread Allocation.....................................................................139 B.20.3.2. Per Thread Block Allocation............................................................. 140 B.20.3.3. Allocation Persisting Between Kernel Launches...................................... 141 B.21. Execution Configuration.............................................................................142 B.22. Launch Bounds........................................................................................ 142 B.23. #pragma unroll........................................................................................145 B.24. SIMD Video Instructions..............................................................................145 Appendix C. Cooperative Groups.......................................................................... 147 C.1. Introduction.............................................................................................147 C.2. Intra-block Groups.....................................................................................148 C.2.1. Thread Groups and Thread Blocks.............................................................148 C.2.2. Tiled Partitions....................................................................................149 C.2.3. Thread Block Tiles............................................................................... 149 C.2.4. Coalesced Groups................................................................................ 150 C.2.5. Uses of Intra-block Cooperative Groups...................................................... 150 C.2.5.1. Discovery Pattern........................................................................... 150 C.2.5.2. Warp-Synchronous Code Pattern..........................................................151 C.2.5.3. Composition.................................................................................. 152 C.3. Grid Synchronization.................................................................................. 152 C.4. Multi-Device Synchronization........................................................................ 154 Appendix D. CUDA Dynamic Parallelism..................................................................156 D.1. Introduction.............................................................................................156 D.1.1. Overview........................................................................................... 156 D.1.2. Glossary............................................................................................ 156 D.2. Execution Environment and Memory Model....................................................... 157 D.2.1. Execution Environment.......................................................................... 157 D.2.1.1. Parent and Child Grids.....................................................................157 D.2.1.2. Scope of CUDA Primitives................................................................. 158 D.2.1.3. Synchronization..............................................................................158 D.2.1.4. Streams and Events.........................................................................158 D.2.1.5. Ordering and Concurrency.................................................................159 D.2.1.6. Device Management........................................................................ 159 D.2.2. Memory Model.................................................................................... 159 D.2.2.1. Coherence and Consistency............................................................... 160

D.3.Programming Interface. ..162 D.3.1.CUDA C/C++Reference.................. 162 D.3.1.1.Device-Side Kernel Launch.......... , 162 D.3.1.2.Stream5..163 D.3.1.3.Events..… .164 D.3.1.4.Synchronization............................ 164 D.3.1.5.Device Management.................. .164 D.3.1.6.Memory Declarations.................. 165 D.3.1.7.API Errors and Launch Failures............ .166 D.3.1.8.APl Reference................. .167 D.3.2.Device-side Launch from PTX................ .168 D.3.2.1.Kernel Launch APls..................... 168 D.3.2.2.Parameter Buffer Layout..................... 170 D.3.3.Toolkit Support for Dynamic Parallelism..... .170 D.3.3.1.Including Device Runtime API in CUDA Code... 170 D.3.3.2.Compiling and Linking................ .171 D.4.Programming Guidelines.................. .171 D.4.1.Basics........... 171 D.4.2.Performance.................T2 D.4.2.1.Synchronization............... .172 D.4.2.2.Dynamic-parallelism-enabled Kernel Overhead...........................172 D.4.3.Implementation Restrictions and Limitations..... .173 D.4.3.1.Runtime.....4473 Appendix E.Mathematical Functions................ ............ 176 E.1.Standard Functions............ 176 E.2.Intrinsic Functions........................... 184 Appendix F.C/C++Language Support............ 187 F.1.C++11 Language Features............... 187 F.2.C++14 Language Features........ ..190 f.3.Restrictions...… 190 F.3.1.Host Compiler Extensions.... ..190 F.3.2.Preprocessor Symbols............... .191 F3.2.1._CUDA_ARCH._… 191 F3.3.Qualifiers...................... 192 F.3.3.1.Device Memory Space Specifiers... 192 F.3.3.2.managed_Memory Space Specifier. .193 F.3.3.3.Volatile Qualifier.......... .195 f3.4.Pointers.… …196 F.3.5.Operators......... .196 F.3.5.1.Assignment Operator............... …196 F.3.5.2.Address Operator.............. 196 F.3.6.Run Time Type Information (RTTI). 196 F.3.7.Exception Handling............ 196 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.21ix
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | ix D.3. Programming Interface................................................................................162 D.3.1. CUDA C/C++ Reference..........................................................................162 D.3.1.1. Device-Side Kernel Launch................................................................ 162 D.3.1.2. Streams....................................................................................... 163 D.3.1.3. Events......................................................................................... 164 D.3.1.4. Synchronization..............................................................................164 D.3.1.5. Device Management........................................................................ 164 D.3.1.6. Memory Declarations....................................................................... 165 D.3.1.7. API Errors and Launch Failures........................................................... 166 D.3.1.8. API Reference................................................................................167 D.3.2. Device-side Launch from PTX.................................................................. 168 D.3.2.1. Kernel Launch APIs......................................................................... 168 D.3.2.2. Parameter Buffer Layout.................................................................. 170 D.3.3. Toolkit Support for Dynamic Parallelism......................................................170 D.3.3.1. Including Device Runtime API in CUDA Code........................................... 170 D.3.3.2. Compiling and Linking......................................................................171 D.4. Programming Guidelines.............................................................................. 171 D.4.1. Basics............................................................................................... 171 D.4.2. Performance.......................................................................................172 D.4.2.1. Synchronization..............................................................................172 D.4.2.2. Dynamic-parallelism-enabled Kernel Overhead........................................ 172 D.4.3. Implementation Restrictions and Limitations................................................173 D.4.3.1. Runtime.......................................................................................173 Appendix E. Mathematical Functions..................................................................... 176 E.1. Standard Functions.................................................................................... 176 E.2. Intrinsic Functions..................................................................................... 184 Appendix F. C/C++ Language Support.................................................................... 187 F.1. C++11 Language Features............................................................................. 187 F.2. C++14 Language Features............................................................................. 190 F.3. Restrictions.............................................................................................. 190 F.3.1. Host Compiler Extensions........................................................................190 F.3.2. Preprocessor Symbols.............................................................................191 F.3.2.1. __CUDA_ARCH__............................................................................. 191 F.3.3. Qualifiers........................................................................................... 192 F.3.3.1. Device Memory Space Specifiers.......................................................... 192 F.3.3.2. __managed__ Memory Space Specifier...................................................193 F.3.3.3. Volatile Qualifier.............................................................................195 F.3.4. Pointers............................................................................................. 196 F.3.5. Operators........................................................................................... 196 F.3.5.1. Assignment Operator........................................................................ 196 F.3.5.2. Address Operator............................................................................ 196 F.3.6. Run Time Type Information (RTTI)............................................................. 196 F.3.7. Exception Handling............................................................................... 196

F.3.8.Standard Library...... .196 f3.9.Functi0n5.… 196 F.3.9.1.External Linkage.................... .197 F.3.9.2.Implicitly-declared and explicitly-defaulted functions........................197 F.3.9.3.Function Parameters....... 198 F3.9.4.Static Variables within Function.........198 F3.9.5.Function Pointers.................... ..199 F.3.9.6.Function Recursion............... 199 F.3.9.7.Friend Functions.............. 199 F.3.9.8.Operator Function............. 200 f.3.10.Classes.… .200 F.3.10.1.Data Members........ .200 F.3.10.2.Function Members........... …200 F.3.10.3.Virtual Functions..... 200 F.3.10.4.Virtual Base Classes......... …201 F.3.10.5.Anonymous Unions..... 201 F.3.10.6.Windows-Specific............... 201 F.3.11.Templates................ 202 F.3.12.Trigraphs and Digraphs............... .202 F.3.13.Const-qualified variables.... 203 f3.14.Long Double.… 203 F.3.15.Deprecation Annotation......... 203 F3.16.C++11 Features.......... .204 F.3.16.1.Lambda Expressions................ .204 F3.16.2.std::initializer_list............... 205 F.3.16.3.Rvalue references...................... 206 F.3.16.4.Constexpr functions and function templates.. 206 F.3.16.5.Constexpr variables...................... 206 F.3.16.6.Inline namespaces................... .207 f.3.16.7.thread_local.............. ....208 F.3.16.8.global functions and function templates.... 208 F.3.16.9.device/constant/sharedvariables......................................210 F.3.16.10.Defaulted functions................. 210 F3.17.C++14 Features........... 211 F.3.17.1.Functions with deduced return type... 211 F3.17.2.Variable templates.......................... 212 F.4.Polymorphic Function Wrappers............ 212 F.5.Experimental Feature:Extended Lambdas....... .216 F.5.1.Extended Lambda Type Traits............ .217 F.5.2.Extended Lambda Restrictions................ 218 F.5.3.Notes onhost_devicelambdas..... 226 F.5.4.*this Capture By Value.................. 227 F.5.5.Additional Notes................... ...229 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2|×
www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | x F.3.8. Standard Library...................................................................................196 F.3.9. Functions........................................................................................... 196 F.3.9.1. External Linkage............................................................................. 197 F.3.9.2. Implicitly-declared and explicitly-defaulted functions................................ 197 F.3.9.3. Function Parameters........................................................................ 198 F.3.9.4. Static Variables within Function.......................................................... 198 F.3.9.5. Function Pointers............................................................................ 199 F.3.9.6. Function Recursion.......................................................................... 199 F.3.9.7. Friend Functions............................................................................. 199 F.3.9.8. Operator Function........................................................................... 200 F.3.10. Classes............................................................................................. 200 F.3.10.1. Data Members...............................................................................200 F.3.10.2. Function Members..........................................................................200 F.3.10.3. Virtual Functions........................................................................... 200 F.3.10.4. Virtual Base Classes........................................................................201 F.3.10.5. Anonymous Unions......................................................................... 201 F.3.10.6. Windows-Specific........................................................................... 201 F.3.11. Templates......................................................................................... 202 F.3.12. Trigraphs and Digraphs..........................................................................202 F.3.13. Const-qualified variables....................................................................... 203 F.3.14. Long Double...................................................................................... 203 F.3.15. Deprecation Annotation........................................................................ 203 F.3.16. C++11 Features...................................................................................204 F.3.16.1. Lambda Expressions........................................................................204 F.3.16.2. std::initializer_list..........................................................................205 F.3.16.3. Rvalue references.......................................................................... 206 F.3.16.4. Constexpr functions and function templates.......................................... 206 F.3.16.5. Constexpr variables........................................................................ 206 F.3.16.6. Inline namespaces..........................................................................207 F.3.16.7. thread_local................................................................................. 208 F.3.16.8. __global__ functions and function templates......................................... 208 F.3.16.9. __device__/__constant__/__shared__ variables...................................... 210 F.3.16.10. Defaulted functions.......................................................................210 F.3.17. C++14 Features...................................................................................211 F.3.17.1. Functions with deduced return type....................................................211 F.3.17.2. Variable templates......................................................................... 212 F.4. Polymorphic Function Wrappers..................................................................... 212 F.5. Experimental Feature: Extended Lambdas.........................................................216 F.5.1. Extended Lambda Type Traits...................................................................217 F.5.2. Extended Lambda Restrictions..................................................................218 F.5.3. Notes on __host__ __device__ lambdas.......................................................226 F.5.4. *this Capture By Value........................................................................... 227 F.5.5. Additional Notes...................................................................................229
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 《并行与分布式程序设计》课程教学参考书:CUDA C PROGRAMMING(CUDA编程指南4.0中文版).pdf
- 上海交通大学:《ROS机器人操作系统基础与实战》课程教学资源(实验指导书)09 机械臂规划Moveit.doc
- 上海交通大学:《ROS机器人操作系统基础与实战》课程教学资源(实验指导书)08 机器人导航包.doc
- 上海交通大学:《ROS机器人操作系统基础与实战》课程教学资源(实验指导书)07 机器人建模与仿真.doc
- 上海交通大学:《ROS机器人操作系统基础与实战》课程教学资源(实验指导书)06 机器视觉.doc
- 上海交通大学:《ROS机器人操作系统基础与实战》课程教学资源(实验指导书)05 外部设备的使用.doc
- 上海交通大学:《ROS机器人操作系统基础与实战》课程教学资源(实验指导书)04 调试和可视化.doc
- 上海交通大学:《ROS机器人操作系统基础与实战》课程教学资源(实验指导书)03 ROS基本元素实验(二).doc
- 上海交通大学:《ROS机器人操作系统基础与实战》课程教学资源(实验指导书)02 ROS基本元素实验(一).doc
- 上海交通大学:《ROS机器人操作系统基础与实战》课程教学资源(实验指导书)01 ROS系统安装.doc
- 《C程序与算法设计》课程教学资源(学习资料)快乐的Linux命令行.pdf
- 上海交通大学:《C程序与算法设计》课程教学资源(课件讲稿)Lecture 04 C语言运算符与表达式.pdf
- 上海交通大学:《C程序与算法设计》课程教学资源(课件讲稿)Lecture 02 Divide and Conquer.pptx
- 上海交通大学:《C程序与算法设计》课程教学资源(课件讲稿)Lecture 01 Greedy and Dynamic Programming.pptx
- 上海交通大学:《C程序与算法设计》课程教学资源(课件讲稿)Lecture 01 算法设计与分析基础.pdf
- 上海交通大学:《C程序与算法设计》课程教学资源(课件讲稿)Lecture 09 C程序组织.pdf
- 上海交通大学:《C程序与算法设计》课程教学资源(课件讲稿)Lecture 08 指针.pdf
- 上海交通大学:《C程序与算法设计》课程教学资源(课件讲稿)Lecture 07 函数.pdf
- 上海交通大学:《C程序与算法设计》课程教学资源(课件讲稿)Lecture 06 C语言数组.pdf
- 上海交通大学:《C程序与算法设计》课程教学资源(课件讲稿)Lecture 05 C语言语句.pdf
- 《并行与分布式程序设计》课程教学参考书:NVIDIA《CUDA C Programming》(Professional).pdf
- 《并行与分布式程序设计》课程教学参考书:CUDA《Programming Massively Parallel Processors》A Hands-on Approach(美,David B. Kirk and Wen-mei W. Hwu,英文版).pdf
- 《并行与分布式程序设计》课程教学参考书:CUDA《Programming Massively Parellel Processors》大规模并行处理器编程实战(美)David B.Kirk&Wen-mei W.Hwu(中文版).pdf
- 《并行与分布式程序设计》课程教学参考书:分布式与云计算(美)Tom White《Hadoop权威指南》(中文第3版).pdf
- 《并行与分布式程序设计》课程教学参考书:分布式与云计算《Spark大数据处理技术、应用与性能优化》(PDF扫描版).pdf
- 《并行与分布式程序设计》课程教学参考书:并行与并发编程《An Introduction to Parallel Programming》.pdf
- 《并行与分布式程序设计》课程教学参考书:并行与并发编程《C++ Concurrency in Action - Practical Multithreading》(Manning, 2012).pdf
- 《并行与分布式程序设计》课程教学参考书:并行与并发编程《Introduction to Parallel Computing》Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar(Second Edition).pdf
- 《并行与分布式程序设计》课程教学参考书:并行与并发编程《Java Concurrency In Practice》.pdf
- 《并行与分布式程序设计》课程教学参考书:并行与并发编程《JAVA并发编程实践》JAVA CONCURRENCY IN PRACTICE(中文完整版).pdf
- 《数据科学引论——Python之道》课程教学资源(课件讲稿)00 Python环境搭建.pdf
- 《数据科学引论——Python之道》课程教学资源(课件讲稿)01 什么是数据科学.pdf
- 《数据科学引论——Python之道》课程教学资源(课件讲稿)02 数据科学的应用(1/2).pdf
- 《数据科学引论——Python之道》课程教学资源(课件讲稿)03 数据科学的应用(2/2).pdf
- 《数据科学引论——Python之道》课程教学资源(课件讲稿)04 数据分析入门.pdf
- 《数据科学引论——Python之道》课程教学资源(课件讲稿)05 爬虫环境搭建.pdf
- 《数据科学引论——Python之道》课程教学资源(课件讲稿)05 网络爬虫介绍和样例.pdf
- 《数据科学引论——Python之道》课程教学资源(课件讲稿)06 统计初步.pdf
- 《数据科学引论——Python之道》课程教学资源(课件讲稿)07 数据科学方法学(1/2).pdf
- 《数据科学引论——Python之道》课程教学资源(课件讲稿)07 数据科学方法学(2/2).pdf