《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)NVIDIA CUDA C Programming Guide(Design Guide,June 2017)

nVIDIA. CUDA C PROGRAMMING GUIDE PG-02829-001v8.01June2017 Design Guide
CUDA C PROGRAMMING GUIDE PG-02829-001_v8.0 | June 2017 Design Guide

CHANGES FROM VERSION 7.5 Updates to add compute capabilities 6.0,6.1 and 6.2,including: Updated Table 13 to mention support of 64-bit floating point atomicAdd on devices of compute capabilities 6.x. Added compute capabilities 6.0,6.1,and 6.2 to Table 14. Added atomicAdd()support of 64-bit floating point for compute capability 6.x to atomicAdd(). Added scoped atomics support for compute capability 6.x to Atomic Functions. Updated section Unified Memory Programming: Documented new features and behavior of architectures with compute capabilities 6.x. Added new section Performance Tuning. www.nvidia.com CUDA C Programming Guide PG-02829-001v8.01i
www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | ii CHANGES FROM VERSION 7.5 ‣ Updates to add compute capabilities 6.0, 6.1 and 6.2, including: ‣ Updated Table 13 to mention support of 64-bit floating point atomicAdd on devices of compute capabilities 6.x. ‣ Added compute capabilities 6.0, 6.1, and 6.2 to Table 14. ‣ Added atomicAdd() support of 64-bit floating point for compute capability 6.x to atomicAdd(). ‣ Added scoped atomics support for compute capability 6.x to Atomic Functions. ‣ Updated section Unified Memory Programming: ‣ Documented new features and behavior of architectures with compute capabilities 6.x. ‣ Added new section Performance Tuning

TABLE OF CONTENTS Chapter 1.Introduction..1 1.1.From Graphics Processing to General Purpose Parallel Computing...............................1 1.2.CUDA:A General-Purpose Parallel Computing Platform and Programming Model.............3 1.3.A Scalable Programming Model.....................................4 1.4.Document Structure....................... …6 8 Chapter 2.Programming Model................ 2.1.Kernels..… .8 2.2.Thread Hierarchy........................... 9 2.3.Memory Hierarchy................... 11 2.4.Heterogeneous Programming.................. 3 2.5.Compute Capability.............. 15 Chapter 3.Programming Interface........ …16 3.1.Compilation with NVCC............ 16 3.1.1.Compilation Workflow.... 17 3.1.1.1.Offline Compilation....... 17 3.1.1.2.Just-in-Time Compilation.. .17 3.1.2.Binary Compatibility............. .17 3.1.3.PTX Compatibility............... 18 3.1.4.Application Compatibility........... 18 3.1.5.C/C++Compatibility...... .19 3.1.6.64-Bit Compatibility................ 19 3.2.cUDA C Runtime.............. .19 3.2.1.Initiatization................. …20 3.2.2.Device Memory.… 20 3.2.3.Shared Memory.................. .23 3.2.4.Page-Locked Host Memory..................... 28 3.2.4.1.Portable Memory.................... …29 3.2.4.2.Write-combining Memory.........29 3.2.4.3.Mapped Memory.................. 29 3.2.5.Asynchronous Concurrent Execution............... .30 3.2.5.1.Concurrent Execution between Host and Device..... .31 3.2.5.2.Concurrent Kernel Execution....................... 31 3.2.5.3.Overlap of Data Transfer and Kernel Execution.... ,31 3.2.5.4.Concurrent Data Transfers...................... 32 3.2.5.5.Streams.… .32 3.2.5.6.Events.… 36 3.2.5.7.Synchronous Calls.... .36 3.2.6.Multi-Device System...................... .37 3.2.6.1.Device Enumeration...... .37 3.2.6.2.Device Selection......................37 www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0|ii
www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | iii TABLE OF CONTENTS Chapter 1. Introduction.........................................................................................1 1.1. From Graphics Processing to General Purpose Parallel Computing............................... 1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model.............3 1.3. A Scalable Programming Model.........................................................................4 1.4. Document Structure...................................................................................... 6 Chapter 2. Programming Model............................................................................... 8 2.1. Kernels......................................................................................................8 2.2. Thread Hierarchy......................................................................................... 9 2.3. Memory Hierarchy....................................................................................... 11 2.4. Heterogeneous Programming.......................................................................... 13 2.5. Compute Capability..................................................................................... 15 Chapter 3. Programming Interface..........................................................................16 3.1. Compilation with NVCC................................................................................ 16 3.1.1. Compilation Workflow.............................................................................17 3.1.1.1. Offline Compilation.......................................................................... 17 3.1.1.2. Just-in-Time Compilation....................................................................17 3.1.2. Binary Compatibility...............................................................................17 3.1.3. PTX Compatibility..................................................................................18 3.1.4. Application Compatibility.........................................................................18 3.1.5. C/C++ Compatibility............................................................................... 19 3.1.6. 64-Bit Compatibility............................................................................... 19 3.2. CUDA C Runtime.........................................................................................19 3.2.1. Initialization.........................................................................................20 3.2.2. Device Memory..................................................................................... 20 3.2.3. Shared Memory..................................................................................... 23 3.2.4. Page-Locked Host Memory........................................................................28 3.2.4.1. Portable Memory..............................................................................29 3.2.4.2. Write-Combining Memory....................................................................29 3.2.4.3. Mapped Memory...............................................................................29 3.2.5. Asynchronous Concurrent Execution............................................................ 30 3.2.5.1. Concurrent Execution between Host and Device........................................31 3.2.5.2. Concurrent Kernel Execution............................................................... 31 3.2.5.3. Overlap of Data Transfer and Kernel Execution......................................... 31 3.2.5.4. Concurrent Data Transfers.................................................................. 32 3.2.5.5. Streams.........................................................................................32 3.2.5.6. Events...........................................................................................36 3.2.5.7. Synchronous Calls.............................................................................36 3.2.6. Multi-Device System............................................................................... 37 3.2.6.1. Device Enumeration.......................................................................... 37 3.2.6.2. Device Selection.............................................................................. 37

3.2.6.3.Stream and Event Behavior..... .37 3.2.6.4.Peer-to-Peer Memory Access.....38 3.2.6.5.Peer-to-Peer Memory Copy............ 38 3.2.7.Unified Virtual Address Space.....39 3.2.8.Interprocess Communication...... 40 3.2.9.Error checking..................... ……40 3.2.10.call Stack............. .41 3.2.11.Texture and Surface Memory......... 41 3.2.11.1.Texture Memory................ .41 3.2.11.2.Surface Memory.............. 51 3.2.11.3.CUDA Arrays................. .55 3.2.11.4.Read/Write Coherency..... 55 3.2.12.Graphics Interoperability.......... .55 3.2.12.1.OpenGL Interoperability.... 56 3.2.12.2.Direct3D Interoperability...... .58 3.2.12.3.SLI Interoperability....... .64 3.3.Versioning and Compatibility................ 65 3.4.Compute Modes.................... 66 3.5.Mode Switches..67 3.6.Tesla Compute Cluster Mode for Windows... .67 Chapter 4.Hardware Implementation.................. .68 4.1.SIMT Architecture................. 68 4.2.Hardware Multithreading............... .70 Chapter 5.Performance Guidelines.................. .71 5.1.Overall Performance Optimization Strategies.. 71 5.2.Maximize Utilization.......................... 71 5.2.1.Application Level.................... .71 5.2.2.Device Level..… .72 5.2.3.Multiprocessor Level...... 72 5.2.3.1.Occupancy Calculator............ .74 5.3.Maximize Memory Throughput...... 76 5.3.1.Data Transfer between Host and Device .77 5.3.2.Device Memory Accesses.......... 78 5.4.Maximize Instruction Throughput........... .82 5.4.1.Arithmetic Instructions...... .82 5.4.2.Control Flow Instructions.............. .86 5.4.3.Synchronization Instruction... .87 Appendix A.CUDA-Enabled GPUs......... …88 Appendix B.C Language Extensions.. 89 B.1.Function Type Qualifiers........... …89 B.1.1.device......... 89 B.1.2.global .89 B.1.3.ho5t..... .89 www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.01iv
www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | iv 3.2.6.3. Stream and Event Behavior................................................................. 37 3.2.6.4. Peer-to-Peer Memory Access................................................................38 3.2.6.5. Peer-to-Peer Memory Copy..................................................................38 3.2.7. Unified Virtual Address Space................................................................... 39 3.2.8. Interprocess Communication..................................................................... 40 3.2.9. Error Checking......................................................................................40 3.2.10. Call Stack.......................................................................................... 41 3.2.11. Texture and Surface Memory................................................................... 41 3.2.11.1. Texture Memory............................................................................. 41 3.2.11.2. Surface Memory............................................................................. 51 3.2.11.3. CUDA Arrays..................................................................................55 3.2.11.4. Read/Write Coherency..................................................................... 55 3.2.12. Graphics Interoperability........................................................................55 3.2.12.1. OpenGL Interoperability................................................................... 56 3.2.12.2. Direct3D Interoperability...................................................................58 3.2.12.3. SLI Interoperability..........................................................................64 3.3. Versioning and Compatibility.......................................................................... 65 3.4. Compute Modes..........................................................................................66 3.5. Mode Switches........................................................................................... 67 3.6. Tesla Compute Cluster Mode for Windows.......................................................... 67 Chapter 4. Hardware Implementation......................................................................68 4.1. SIMT Architecture....................................................................................... 68 4.2. Hardware Multithreading...............................................................................70 Chapter 5. Performance Guidelines........................................................................ 71 5.1. Overall Performance Optimization Strategies...................................................... 71 5.2. Maximize Utilization.................................................................................... 71 5.2.1. Application Level...................................................................................71 5.2.2. Device Level........................................................................................ 72 5.2.3. Multiprocessor Level...............................................................................72 5.2.3.1. Occupancy Calculator........................................................................ 74 5.3. Maximize Memory Throughput........................................................................ 76 5.3.1. Data Transfer between Host and Device....................................................... 77 5.3.2. Device Memory Accesses..........................................................................78 5.4. Maximize Instruction Throughput.....................................................................82 5.4.1. Arithmetic Instructions............................................................................82 5.4.2. Control Flow Instructions......................................................................... 86 5.4.3. Synchronization Instruction.......................................................................87 Appendix A. CUDA-Enabled GPUs........................................................................... 88 Appendix B. C Language Extensions........................................................................89 B.1. Function Type Qualifiers............................................................................... 89 B.1.1. __device__.......................................................................................... 89 B.1.2. __global__...........................................................................................89 B.1.3. __host__............................................................................................. 89

B.1.4.noinline and forceinline. 90 B.2.Variable Type Quatifiers......................... 90 B.2.1.device._… 90 B.2.2.constant_ 91 B.2.3.shared .91 B.2.4.managed.… .92 B.2.5.restrict........... .92 B.3.Built-in Vector Types............. 93 B.3.1.char,short,int,long,longlong,float,double .93 B.3.2.dim3. 94 B.4.Built-in Variables....................... 95 B.4.1.gridDim..… 95 B.4.2.blockldx........... .95 B.4.3.blockDim.… .95 B.4.4.threadldx................... .95 B.4.5.warpSize.… 95 B.5.Memory Fence Functions................. 95 B.6.Synchronization Functions.... 98 B.7.Mathematical Functions................ 99 B.8.Texture Functions.......... .99 B.8.1.Texture object APl.......... …100 B.8.1.1.tex1Dfetch()...... ..100 B.8.1.2.tex1D0. 100 B.8.1.3.tex1DLod0.… .100 B.8.1.4.tex1DGrad()........... 100 B.8.1.5.tex2D0.. 100 B.8.1.6.tex2DLod()....... ..100 B.8.1.7.tex2DGrad()........ .101 B.8.1.8.tex3D0. 101 B.8.1.9.tex3DLod()........... ..101 B.8.1.10.tex3DGrad().... ,101 B.8.1.11.tex1DLayered()........ .101 B.8.1.12.tex1DLayeredLod().. 101 B.8.1.13.tex1DLayeredGrad(). … 102 B.8.1.14.tex2DLayered()...... 102 B.8.1.15.tex2DLayeredLod()......... .102 B.8.1.16.tex2DLayeredGrad()... .102 B.8.1.17.texCubemap()............ .102 B.8.1.18.texCubemapLod()...... .102 B.8.1.19.texCubemapLayered()...... .103 B.8.1.20.texCubemapLayeredLod(). 103 B.8.1.21.tex2Dgather()............. .103 B.8.2.Texture Reference APl......... 104 www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.01V
www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | v B.1.4. __noinline__ and __forceinline__............................................................... 90 B.2. Variable Type Qualifiers................................................................................90 B.2.1. __device__.......................................................................................... 90 B.2.2. __constant__........................................................................................91 B.2.3. __shared__.......................................................................................... 91 B.2.4. __managed__....................................................................................... 92 B.2.5. __restrict__......................................................................................... 92 B.3. Built-in Vector Types................................................................................... 93 B.3.1. char, short, int, long, longlong, float, double................................................ 93 B.3.2. dim3..................................................................................................94 B.4. Built-in Variables........................................................................................ 95 B.4.1. gridDim.............................................................................................. 95 B.4.2. blockIdx..............................................................................................95 B.4.3. blockDim.............................................................................................95 B.4.4. threadIdx............................................................................................ 95 B.4.5. warpSize............................................................................................. 95 B.5. Memory Fence Functions...............................................................................95 B.6. Synchronization Functions............................................................................. 98 B.7. Mathematical Functions................................................................................ 99 B.8. Texture Functions....................................................................................... 99 B.8.1. Texture Object API...............................................................................100 B.8.1.1. tex1Dfetch()..................................................................................100 B.8.1.2. tex1D()........................................................................................ 100 B.8.1.3. tex1DLod()....................................................................................100 B.8.1.4. tex1DGrad().................................................................................. 100 B.8.1.5. tex2D()........................................................................................ 100 B.8.1.6. tex2DLod()....................................................................................100 B.8.1.7. tex2DGrad().................................................................................. 101 B.8.1.8. tex3D()........................................................................................ 101 B.8.1.9. tex3DLod()....................................................................................101 B.8.1.10. tex3DGrad().................................................................................101 B.8.1.11. tex1DLayered()............................................................................. 101 B.8.1.12. tex1DLayeredLod().........................................................................101 B.8.1.13. tex1DLayeredGrad()....................................................................... 102 B.8.1.14. tex2DLayered()............................................................................. 102 B.8.1.15. tex2DLayeredLod().........................................................................102 B.8.1.16. tex2DLayeredGrad()....................................................................... 102 B.8.1.17. texCubemap().............................................................................. 102 B.8.1.18. texCubemapLod().......................................................................... 102 B.8.1.19. texCubemapLayered().....................................................................103 B.8.1.20. texCubemapLayeredLod()................................................................ 103 B.8.1.21. tex2Dgather()...............................................................................103 B.8.2. Texture Reference API...........................................................................104

B.8.2.1.tex1Dfetch(). .104 B.8.2.2.tex1D0 104 B.8.2.3.tex1DLod(0.… .105 B.8.2.4.tex1DGrad()............... 105 B.8.2.5.tex2D0. 105 B.8.2.6.tex2DLod0.… .105 B.8.2.7.tex2DGrad().......... ...105 B.8.2.8.tex3D0.… …106 B.8.2.9.tex3DLod0… .106 B.8.2.10.tex3DGrad()............ .106 B.8.2.11.tex1DLayered()......... .......106 B.8.2.12.tex1DLayeredLod().. .107 B.8.2.13.tex1DLayeredGrad()..... .107 B.8.2.14.tex2DLayered()...... ,107 B.8.2.15.tex2DLayeredLod()...... .107 B.8.2.16.tex2DLayeredGrad()... 108 B.8.2.17.texCubemap().............. 108 B.8.2.18.texCubemapLod()..... 108 B.8.2.19.texCubemapLayered()............ 108 B.8.2.20.texCubemapLayeredLod(). …108 B.8.2.21.tex2Dgather()............... .109 B.9.Surface Functions............ ..109 B.9.1.Surface object APl............... 109 B.9.1.1.surf1Dread()................ .109 B.9.1.2.surf1Dwrite............. .109 B.9.1.3.surf2Dread()............. .110 B.9.1.4.surf2Dwrite()............ ,110 B.9.1.5.surf3Dread()............ .110 B.9.1.6.surf3Dwrite().... .110 B.9.1.7.surf1DLayeredread().... .111 B.9.1.8.surf1DLayeredwrite()... 111 B.9.1.9.surf2DLayeredread()..... 111 B.9.1.10.surf2DLayeredwrite() 111 B.9.1.11.surfCubemapread()......... .112 B.9.1.12.surfCubemapwrite().... 112 B.9.1.13.surfCubemapLayeredread() .112 B.9.1.14.surfCubemapLayeredwrite() ..112 B.9.2.Surface Reference APl.... .113 B.9.2.1.surf1Dread().......... ..113 B.9.2.2.surf1Dwrite............. …113 B.9.2.3.surf2Dread().......... .113 B.9.2.4.surf2Dwrite()....... .113 B.9.2.5.surf3Dread()....... ...114 www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.01
www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | vi B.8.2.1. tex1Dfetch()..................................................................................104 B.8.2.2. tex1D()........................................................................................ 104 B.8.2.3. tex1DLod()....................................................................................105 B.8.2.4. tex1DGrad().................................................................................. 105 B.8.2.5. tex2D()........................................................................................ 105 B.8.2.6. tex2DLod()....................................................................................105 B.8.2.7. tex2DGrad().................................................................................. 105 B.8.2.8. tex3D()........................................................................................ 106 B.8.2.9. tex3DLod()....................................................................................106 B.8.2.10. tex3DGrad().................................................................................106 B.8.2.11. tex1DLayered()............................................................................. 106 B.8.2.12. tex1DLayeredLod().........................................................................107 B.8.2.13. tex1DLayeredGrad()....................................................................... 107 B.8.2.14. tex2DLayered()............................................................................. 107 B.8.2.15. tex2DLayeredLod().........................................................................107 B.8.2.16. tex2DLayeredGrad()....................................................................... 108 B.8.2.17. texCubemap().............................................................................. 108 B.8.2.18. texCubemapLod().......................................................................... 108 B.8.2.19. texCubemapLayered().....................................................................108 B.8.2.20. texCubemapLayeredLod()................................................................ 108 B.8.2.21. tex2Dgather()...............................................................................109 B.9. Surface Functions...................................................................................... 109 B.9.1. Surface Object API............................................................................... 109 B.9.1.1. surf1Dread()..................................................................................109 B.9.1.2. surf1Dwrite................................................................................... 109 B.9.1.3. surf2Dread()..................................................................................110 B.9.1.4. surf2Dwrite()................................................................................. 110 B.9.1.5. surf3Dread()..................................................................................110 B.9.1.6. surf3Dwrite()................................................................................. 110 B.9.1.7. surf1DLayeredread()........................................................................ 111 B.9.1.8. surf1DLayeredwrite()....................................................................... 111 B.9.1.9. surf2DLayeredread()........................................................................ 111 B.9.1.10. surf2DLayeredwrite()......................................................................111 B.9.1.11. surfCubemapread()........................................................................ 112 B.9.1.12. surfCubemapwrite()....................................................................... 112 B.9.1.13. surfCubemapLayeredread()...............................................................112 B.9.1.14. surfCubemapLayeredwrite()..............................................................112 B.9.2. Surface Reference API........................................................................... 113 B.9.2.1. surf1Dread()..................................................................................113 B.9.2.2. surf1Dwrite................................................................................... 113 B.9.2.3. surf2Dread()..................................................................................113 B.9.2.4. surf2Dwrite()................................................................................. 113 B.9.2.5. surf3Dread()..................................................................................114

B.9.2.6.surf3Dwrite()..... ,114 B.9.2.7.surf1DLayeredread().............. 114 B.9.2.8.surf1DLayeredwrite().... .114 B.9.2.9.surf2DLayeredread()............ .115 B.9.2.10.surf2DLayeredwrite(). .115 B.9.2.11.surfCubemapread()............ .115 B.9.2.12.surfCubemapwrite().......... .…115 B.9.2.13.surfCubemapLayeredread().. .116 B.9.2.14.surfCubemapLayeredwrite(). .116 B.10.Read-Only Data Cache Load Function. .116 B.11.Time Function.......................... ..116 B.12.Atomic Functions........... ...117 B.12.1.Arithmetic Functions......... 118 B.12.1.1.atomicAdd().... ..118 B.12.1.2.atomicSub()............ .118 B.12.1.3.atomicExch()... .119 B.12.1.4.atomicMin()............. .119 B.12.1.5.atomicMax()... .119 B.12.1.6.atomiclnc().............. .119 B.12.1.7.atomicDec()... .120 B.12.1.8.atomiccAS().......... ….120 B.12.2.Bitwise Functions.... 120 B.12.2.1.atomicAnd().......... .120 B.12.2.2.atomic0r0.. 120 B.12.2.3.atomicXor()............ 121 B.13.Warp Vote Functions............ 121 B.14.Warp Shuffle Functions.............. 122 B.14.1.Synopsis..… .122 B.14.2.Description.... 122 B.14.3.Return Value.............. 123 B.14.4.Notes....................... 123 B.14.5.Example5........ .124 B.14.5.1.Broadcast of a single value across a warp........ 124 B.14.5.2.Inclusive plus-scan across sub-partitions of 8 threads.............................124 B.14.5.3.Reduction across a warp.............. 125 B.15.Profiler Counter Function.............................. .125 B.16.Assertion.................. 125 B.17.Formatted Output.............. .126 B.17.1.Format Specifiers............ 127 B.17.2.Limitations........ 127 B.17.3.Associated Host-Side APl.................... .128 B.17.4.Example5..129 B.18.Dynamic Global Memory Allocation and Operations........................130 www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0|vi
www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | vii B.9.2.6. surf3Dwrite()................................................................................. 114 B.9.2.7. surf1DLayeredread()........................................................................ 114 B.9.2.8. surf1DLayeredwrite()....................................................................... 114 B.9.2.9. surf2DLayeredread()........................................................................ 115 B.9.2.10. surf2DLayeredwrite()......................................................................115 B.9.2.11. surfCubemapread()........................................................................ 115 B.9.2.12. surfCubemapwrite()....................................................................... 115 B.9.2.13. surfCubemapLayeredread()...............................................................116 B.9.2.14. surfCubemapLayeredwrite()..............................................................116 B.10. Read-Only Data Cache Load Function.............................................................116 B.11. Time Function.........................................................................................116 B.12. Atomic Functions..................................................................................... 117 B.12.1. Arithmetic Functions........................................................................... 118 B.12.1.1. atomicAdd().................................................................................118 B.12.1.2. atomicSub()................................................................................. 118 B.12.1.3. atomicExch()................................................................................119 B.12.1.4. atomicMin()................................................................................. 119 B.12.1.5. atomicMax().................................................................................119 B.12.1.6. atomicInc()..................................................................................119 B.12.1.7. atomicDec().................................................................................120 B.12.1.8. atomicCAS().................................................................................120 B.12.2. Bitwise Functions............................................................................... 120 B.12.2.1. atomicAnd().................................................................................120 B.12.2.2. atomicOr().................................................................................. 120 B.12.2.3. atomicXor()................................................................................. 121 B.13. Warp Vote Functions................................................................................. 121 B.14. Warp Shuffle Functions..............................................................................122 B.14.1. Synopsis........................................................................................... 122 B.14.2. Description....................................................................................... 122 B.14.3. Return Value..................................................................................... 123 B.14.4. Notes.............................................................................................. 123 B.14.5. Examples..........................................................................................124 B.14.5.1. Broadcast of a single value across a warp............................................ 124 B.14.5.2. Inclusive plus-scan across sub-partitions of 8 threads............................... 124 B.14.5.3. Reduction across a warp................................................................. 125 B.15. Profiler Counter Function........................................................................... 125 B.16. Assertion............................................................................................... 125 B.17. Formatted Output.................................................................................... 126 B.17.1. Format Specifiers............................................................................... 127 B.17.2. Limitations....................................................................................... 127 B.17.3. Associated Host-Side API.......................................................................128 B.17.4. Examples..........................................................................................129 B.18. Dynamic Global Memory Allocation and Operations............................................ 130

B.18.1.Heap Memory Allocation...... 130 B.18.2.Interoperability with Host Memory APl.......................131 B.18.3.Examples..... .131 B.18.3.1.Per Thread Allocation.........131 B.18.3.2.Per Thread Block Allocation........ …132 B.18.3.3.Allocation Persisting Between Kernel Launches........................133 B.19.Execution Configuration.........134 B.20.Launch Bounds............. …134 B.21.#pragma unroul............. .137 B.22.SIMD Video Instructions...................... .137 Appendix C.CUDA Dynamic Parallelism.............. …139 C.1.Introduction..… .139 C.1.1.0 verview...... .139 C.1.2.Glossary.… 139 C.2.Execution Environment and Memory Model. 140 C.2.1.Execution Environment............. ..140 C.2.1.1.Parent and Child Grids............... .140 C.2.1.2.Scope of CUDA Primitives..... .141 C.2.1.3.Synchronization........ .141 C.2.1.4.Streams and Events......... .141 C.2.1.5.Ordering and Concurrency....... .142 C.2.1.6.Device Management....... .142 C.2.2.Memory Model...… 142 C.2.2.1.Coherence and Consistency.......... .143 C.3.Programming Interface................... .145 C.3.1.CUDA C/C++Reference............... ...145 C.3.1.1.Device-Side Kernel Launch... 145 C.3.1.2.Streams..… .146 C.3.1.3.Events..… ...147 C.3.1.4.Synchronization.............. .147 C.3.1.5.Device Management............. 147 C.3.1.6.Memory Declarations............... …148 C.3.1.7.API Errors and Launch Failures..... 149 C.3.1.8.API Reference........ .150 C.3.2.Device-side Launch from PTX............. 151 C.3.2.1.Kernel Launch APIs................... 151 C.3.2.2.Parameter Buffer Layout.............. 153 C.3.3.Toolkit Support for Dynamic Parallelism.......... …153 C.3.3.1.Including Device Runtime API in CUDA Code. .153 C.3.3.2.Compiling and Linking.............. 154 C.4.Programming Guidelines................ 154 C.4.1.Basic5.… …….154 C.4.2.Performance............. ......155 www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.01iii
www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | viii B.18.1. Heap Memory Allocation....................................................................... 130 B.18.2. Interoperability with Host Memory API......................................................131 B.18.3. Examples..........................................................................................131 B.18.3.1. Per Thread Allocation.....................................................................131 B.18.3.2. Per Thread Block Allocation............................................................. 132 B.18.3.3. Allocation Persisting Between Kernel Launches...................................... 133 B.19. Execution Configuration.............................................................................134 B.20. Launch Bounds........................................................................................ 134 B.21. #pragma unroll........................................................................................137 B.22. SIMD Video Instructions..............................................................................137 Appendix C. CUDA Dynamic Parallelism.................................................................. 139 C.1. Introduction.............................................................................................139 C.1.1. Overview........................................................................................... 139 C.1.2. Glossary............................................................................................ 139 C.2. Execution Environment and Memory Model....................................................... 140 C.2.1. Execution Environment.......................................................................... 140 C.2.1.1. Parent and Child Grids..................................................................... 140 C.2.1.2. Scope of CUDA Primitives..................................................................141 C.2.1.3. Synchronization..............................................................................141 C.2.1.4. Streams and Events.........................................................................141 C.2.1.5. Ordering and Concurrency.................................................................142 C.2.1.6. Device Management........................................................................ 142 C.2.2. Memory Model.................................................................................... 142 C.2.2.1. Coherence and Consistency............................................................... 143 C.3. Programming Interface................................................................................145 C.3.1. CUDA C/C++ Reference..........................................................................145 C.3.1.1. Device-Side Kernel Launch................................................................ 145 C.3.1.2. Streams....................................................................................... 146 C.3.1.3. Events......................................................................................... 147 C.3.1.4. Synchronization..............................................................................147 C.3.1.5. Device Management........................................................................ 147 C.3.1.6. Memory Declarations....................................................................... 148 C.3.1.7. API Errors and Launch Failures........................................................... 149 C.3.1.8. API Reference................................................................................150 C.3.2. Device-side Launch from PTX.................................................................. 151 C.3.2.1. Kernel Launch APIs......................................................................... 151 C.3.2.2. Parameter Buffer Layout.................................................................. 153 C.3.3. Toolkit Support for Dynamic Parallelism......................................................153 C.3.3.1. Including Device Runtime API in CUDA Code........................................... 153 C.3.3.2. Compiling and Linking......................................................................154 C.4. Programming Guidelines.............................................................................. 154 C.4.1. Basics............................................................................................... 154 C.4.2. Performance.......................................................................................155

C.4.2.1.Synchronization....... 155 C.4.2.2.Dynamic-parallelism-enabled Kernel Overhead.......................... 155 C.4.3.Implementation Restrictions and Limitations.................................156 C.4.3.1.Runtime..156 Appendix D.Mathematical Functions............. .159 D.1.Standard Functions............................. ….159 D.2.Intrinsic Functions................... ..167 Appendix E.C/C++Language Support............. 170 E.1.C++11 Language Features................. 170 E.2.C++14 Language Features................. 173 E.3.Restrictions..… .173 E.3.1.Host Compiler Extensions..... .173 E.3.2.Preprocessor Symbols............. .174 E.3.2.1._CUDA_ARCH.… ..174 E.3.3.Qualifiers............. .175 E.3.3.1.Device Memory Qualifiers.... .175 E.3.3.2.managed_Qualifier.............. .176 E.3.3.3.Volatile Qualifier......... 178 E.3.4.Pointers.… .179 E.3.5.0 perators.… .179 E.3.5.1.Assignment Operator.............. 179 E.3.5.2.Address Operator........... .179 E.3.6.Run Time Type Information (RTTI)... .179 E.3.7.Exception Handling................. 179 E.3.8.Standard Library............... …179 E.3.9.Functions................................. .180 E.3.9.1.External Linkage........................ …….180 E.3.9.2.Compiler generated functions......... 180 E.3.9.3.Function Parameters...... …180 E.3.9.4.Static Variables within Function... .181 E.3.9.5.Function Pointers............ ..181 E.3.9.6.Function Recursion................. …182 E.3.10.Classes...… 182 E.3.10.1.Data Members................... 182 E.3.10.2.Function Members..... 182 E.3.10.3.Virtual Functions................. .182 E.3.10.4.Virtual Base Classes... 182 E.3.10.5.Anonymous Unions....... …182 E.3.10.6.Windows-Specific...... ....182 E.3.11.Templates.… ….183 E.3.12.Trigraphs and Digraphs..... 183 E.3.13.Const-qualified variables. .184 E.3.14.C++11 Features.....… …184 www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.01ix
www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | ix C.4.2.1. Synchronization..............................................................................155 C.4.2.2. Dynamic-parallelism-enabled Kernel Overhead........................................ 155 C.4.3. Implementation Restrictions and Limitations................................................156 C.4.3.1. Runtime.......................................................................................156 Appendix D. Mathematical Functions..................................................................... 159 D.1. Standard Functions.................................................................................... 159 D.2. Intrinsic Functions..................................................................................... 167 Appendix E. C/C++ Language Support.................................................................... 170 E.1. C++11 Language Features............................................................................ 170 E.2. C++14 Language Features............................................................................ 173 E.3. Restrictions..............................................................................................173 E.3.1. Host Compiler Extensions....................................................................... 173 E.3.2. Preprocessor Symbols............................................................................ 174 E.3.2.1. __CUDA_ARCH__.............................................................................174 E.3.3. Qualifiers...........................................................................................175 E.3.3.1. Device Memory Qualifiers..................................................................175 E.3.3.2. __managed__ Qualifier.....................................................................176 E.3.3.3. Volatile Qualifier............................................................................ 178 E.3.4. Pointers.............................................................................................179 E.3.5. Operators.......................................................................................... 179 E.3.5.1. Assignment Operator....................................................................... 179 E.3.5.2. Address Operator............................................................................179 E.3.6. Run Time Type Information (RTTI).............................................................179 E.3.7. Exception Handling...............................................................................179 E.3.8. Standard Library.................................................................................. 179 E.3.9. Functions...........................................................................................180 E.3.9.1. External Linkage.............................................................................180 E.3.9.2. Compiler generated functions............................................................ 180 E.3.9.3. Function Parameters........................................................................180 E.3.9.4. Static Variables within Function.......................................................... 181 E.3.9.5. Function Pointers............................................................................181 E.3.9.6. Function Recursion..........................................................................182 E.3.10. Classes............................................................................................ 182 E.3.10.1. Data Members.............................................................................. 182 E.3.10.2. Function Members......................................................................... 182 E.3.10.3. Virtual Functions...........................................................................182 E.3.10.4. Virtual Base Classes....................................................................... 182 E.3.10.5. Anonymous Unions......................................................................... 182 E.3.10.6. Windows-Specific...........................................................................182 E.3.11. Templates.........................................................................................183 E.3.12. Trigraphs and Digraphs......................................................................... 183 E.3.13. Const-qualified variables.......................................................................184 E.3.14. C++11 Features.................................................................................. 184

E.3.14.1.Lambda Expressions............ 184 E.3.14.2.std::initializer_list......... 185 E.3.14.3.Rvalue references................. .186 E.3.14.4.Constexpr functions and function templates............................186 E.3.14.5.Constexpr variables.............. .186 E.3.14.6.Intine namespaces.187 E.3.14.7.thread local...........................................188 E.3.14.8.global functions and function templates..... .189 E.3.14.9.device_/constant_/shared_variables......................................190 E.3.14.10.Defautted functions.... 190 E.3.15.C+14 Features.........190 E.3.15.1.Functions with deduced return type... 190 E.3.15.2.Variable templates......................... .191 E.3.15.3.[[deprecated]]attribute........... ..192 E.4.Polymorphic Function Wrappers....................... .192 E.5.Experimental Feature:Extended Lambdas.... 195 E.5.1.Extended Lambda Type Traits.......................... .197 E.5.2.Extended Lambda Restrictions...... 198 E.5.3.Notes onhostdevicelambdas..............205 E.5.4.*this Capture By Value........... .206 E.5.5.Additional Notes.............. 208 E.6.Code Samples............... ...210 E.6.1.Data Aggregation Class.............. 210 E.6.2.Derived class................ .210 E.6.3.Class Template.............. 211 E.6.4.Function Template............... .211 E.6.5.Functor class............ .212 Appendix F.Texture Fetching................ .213 F.1.Nearest-Point Sampling.... ,213 F.2.Linear Filtering................... 214 f.3.Table Lookup..… 215 Appendix G.Compute Capabilities........... .217 G.1.Features and Technical Specifications. 217 G.2.Floating-Point Standard................ .221 G.3.Compute Capability 2.x...... 222 G.3.1.Architecture.................... 222 G.3.2.Global Memory..... 223 G.3.3.Shared Memory.......... 224 G.3.4.Constant Memory...... ....225 G.4.Compute Capability 3.x........... 225 G.4.1.Architecture............. .225 G.4.2.Global Memory........... 227 G.4.3.Shared Memory.......... .228 www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.01×
www.nvidia.com CUDA C Programming Guide PG-02829-001_v8.0 | x E.3.14.1. Lambda Expressions....................................................................... 184 E.3.14.2. std::initializer_list......................................................................... 185 E.3.14.3. Rvalue references..........................................................................186 E.3.14.4. Constexpr functions and function templates..........................................186 E.3.14.5. Constexpr variables........................................................................186 E.3.14.6. Inline namespaces......................................................................... 187 E.3.14.7. thread_local................................................................................ 188 E.3.14.8. __global__ functions and function templates.........................................189 E.3.14.9. __device__/__constant__/__shared__ variables......................................190 E.3.14.10. Defaulted functions...................................................................... 190 E.3.15. C++14 Features.................................................................................. 190 E.3.15.1. Functions with deduced return type................................................... 190 E.3.15.2. Variable templates.........................................................................191 E.3.15.3. [[deprecated]] attribute..................................................................192 E.4. Polymorphic Function Wrappers..................................................................... 192 E.5. Experimental Feature: Extended Lambdas........................................................ 195 E.5.1. Extended Lambda Type Traits.................................................................. 197 E.5.2. Extended Lambda Restrictions................................................................. 198 E.5.3. Notes on __host__ __device__ lambdas...................................................... 205 E.5.4. *this Capture By Value...........................................................................206 E.5.5. Additional Notes.................................................................................. 208 E.6. Code Samples...........................................................................................210 E.6.1. Data Aggregation Class.......................................................................... 210 E.6.2. Derived Class...................................................................................... 210 E.6.3. Class Template.................................................................................... 211 E.6.4. Function Template................................................................................211 E.6.5. Functor Class...................................................................................... 212 Appendix F. Texture Fetching.............................................................................. 213 F.1. Nearest-Point Sampling................................................................................213 F.2. Linear Filtering......................................................................................... 214 F.3. Table Lookup............................................................................................ 215 Appendix G. Compute Capabilities........................................................................ 217 G.1. Features and Technical Specifications............................................................. 217 G.2. Floating-Point Standard...............................................................................221 G.3. Compute Capability 2.x.............................................................................. 222 G.3.1. Architecture.......................................................................................222 G.3.2. Global Memory....................................................................................223 G.3.3. Shared Memory................................................................................... 224 G.3.4. Constant Memory.................................................................................225 G.4. Compute Capability 3.x.............................................................................. 225 G.4.1. Architecture.......................................................................................225 G.4.2. Global Memory....................................................................................227 G.4.3. Shared Memory................................................................................... 228
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)Methods of conjugate gradients for solving linear systems.pdf
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)NVIDIA Parallel Prefix Sum(Scan)with CUDA(April 2007).pdf
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)Single-pass Parallel Prefix Scan with Decoupled Look-back.pdf
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)Program Optimization Space Pruning for a Multithreaded GPU.pdf
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA.pdf
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)Some Computer Organizations and Their Effectiveness.pdf
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)Software and the Concurrency Revolution.pdf
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems.pdf
- 《GPU并行编程 GPU Parallel Programming》课程教学资源(参考文献)MPI A Message-Passing Interface Standard(Version 2.2).pdf
- 南京大学:《网络安全与入侵检测 Network Security and Intrusion Detection》课程教学资源(课件讲稿)19 Firewall Design Methods.pdf
- 南京大学:《网络安全与入侵检测 Network Security and Intrusion Detection》课程教学资源(课件讲稿)18 Web Security(SQL Injection and Cross-Site Request Forgery).pdf
- 南京大学:《网络安全与入侵检测 Network Security and Intrusion Detection》课程教学资源(课件讲稿)17 Web Security(Cookies and Cross Site Scripting,XSS).pdf
- 南京大学:《网络安全与入侵检测 Network Security and Intrusion Detection》课程教学资源(课件讲稿)16 Bloom Filter for Network Security.pdf
- 南京大学:《网络安全与入侵检测 Network Security and Intrusion Detection》课程教学资源(课件讲稿)15 Bloom Filters and its Variants.pdf
- 南京大学:《网络安全与入侵检测 Network Security and Intrusion Detection》课程教学资源(课件讲稿)14 Buffer Overflow Attacks.pdf
- 南京大学:《网络安全与入侵检测 Network Security and Intrusion Detection》课程教学资源(课件讲稿)13 Human Authentication.pdf
- 南京大学:《网络安全与入侵检测 Network Security and Intrusion Detection》课程教学资源(课件讲稿)12 Secure Socket Layer(SSL)、TLS(Transport Layer Security).pdf
- 南京大学:《网络安全与入侵检测 Network Security and Intrusion Detection》课程教学资源(课件讲稿)11 Public-Key Infrastructure.pdf
- 南京大学:《网络安全与入侵检测 Network Security and Intrusion Detection》课程教学资源(课件讲稿)10 Kerberos.pdf
- 南京大学:《网络安全与入侵检测 Network Security and Intrusion Detection》课程教学资源(课件讲稿)09 Authentication Using Symmetric Keys.pdf
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 01 Introduction To Cuda C.pdf
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 02 CUDA PARALLELISM MODEL.pdf
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 03 MEMORY AND DATA LOCALITY.pdf
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 04 Performance considerations.pdf
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 05 PARALLEL COMPUTATION PATTERNS(HISTOGRAM).pdf
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 06 PARALLEL COMPUTATION PATTERNS(SCAN).pdf
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 07 JOINT CUDA-MPI PROGRAMMING.pdf
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 08 Parallel Sparse Methods.pdf
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 09 Parallel patterns(MERGE SORT).pdf
- 电子科技大学:《GPU并行编程 GPU Parallel Programming》课程教学资源(课件讲稿)Lecture 10 Computational Thinking.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)课程简介(杜平安).pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第一章 绪论.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第二章 有限元法的基本原理(平面问题有限元法).pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第七章 动态分析有限元法 FEM of Dynamic Analysis.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第3~6章 其他问题有限元法.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第八章 热分析有限元法 FEM of Thermal Analysis.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第二篇 有限元建模方法 第十二章 有限元建模概述 Overview of Finite Element Modeling.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第二篇 有限元建模方法 第十一章 有限元建模的基本原则.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第二篇 有限元建模方法 第十四章 几何模型的建立.pdf
- 电子科技大学:《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源(课件讲稿)第二篇 有限元建模方法 第十五章 单元类型及特性定义.pdf