docs/NVPTXUsage.rst

   1 =============================
   2 User Guide for NVPTX Back-end
   3 =============================
   4
   5 .. contents::
   6    :local:
   7    :depth: 3
   8
   9
  10 Introduction
  11 ============
  12
  13 To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
  14 along with a defined set of conventions used to represent GPU programming
  15 concepts. This document provides an overview of the general usage of the back-
  16 end, including a description of the conventions used and the set of accepted
  17 LLVM IR.
  18
  19 .. note::
  20
  21    This document assumes a basic familiarity with CUDA and the PTX
  22    assembly language. Information about the CUDA Driver API and the PTX assembly
  23    language can be found in the `CUDA documentation
  24    <http://docs.nvidia.com/cuda/index.html>`_.
  25
  26
  27
  28 Conventions
  29 ===========
  30
  31 Marking Functions as Kernels
  32 ----------------------------
  33
  34 In PTX, there are two types of functions: *device functions*, which are only
  35 callable by device code, and *kernel functions*, which are callable by host
  36 code. By default, the back-end will emit device functions. Metadata is used to
  37 declare a function as a kernel function. This metadata is attached to the
  38 ``nvvm.annotations`` named metadata object, and has the following format:
  39
  40 .. code-block:: llvm
  41
  42    !0 = metadata !{<function-ref>, metadata !"kernel", i32 1}
  43
  44 The first parameter is a reference to the kernel function. The following
  45 example shows a kernel function calling a device function in LLVM IR. The
  46 function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.
  47
  48 .. code-block:: llvm
  49
  50     define float @my_fmad(float %x, float %y, float %z) {
  51       %mul = fmul float %x, %y
  52       %add = fadd float %mul, %z
  53       ret float %add
  54     }
  55
  56     define void @my_kernel(float* %ptr) {
  57       %val = load float* %ptr
  58       %ret = call float @my_fmad(float %val, float %val, float %val)
  59       store float %ret, float* %ptr
  60       ret void
  61     }
  62
  63     !nvvm.annotations = !{!1}
  64     !1 = metadata !{void (float*)* @my_kernel, metadata !"kernel", i32 1}
  65
  66 When compiled, the PTX kernel functions are callable by host-side code.
  67
  68
  69 Address Spaces
  70 --------------
  71
  72 The NVPTX back-end uses the following address space mapping:
  73
  74    ============= ======================
  75    Address Space Memory Space
  76    ============= ======================
  77    0             Generic
  78    1             Global
  79    2             Internal Use
  80    3             Shared
  81    4             Constant
  82    5             Local
  83    ============= ======================
  84
  85 Every global variable and pointer type is assigned to one of these address
  86 spaces, with 0 being the default address space. Intrinsics are provided which
  87 can be used to convert pointers between the generic and non-generic address
  88 spaces.
  89
  90 As an example, the following IR will define an array ``@g`` that resides in
  91 global device memory.
  92
  93 .. code-block:: llvm
  94
  95     @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]
  96
  97 LLVM IR functions can read and write to this array, and host-side code can
  98 copy data to it by name with the CUDA Driver API.
  99
 100 Note that since address space 0 is the generic space, it is illegal to have
 101 global variables in address space 0.  Address space 0 is the default address
 102 space in LLVM, so the ``addrspace(N)`` annotation is *required* for global
 103 variables.
 104
 105
 106 NVPTX Intrinsics
 107 ================
 108
 109 Address Space Conversion
 110 ------------------------
 111
 112 '``llvm.nvvm.ptr.*.to.gen``' Intrinsics
 113 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 114
 115 Syntax:
 116 """""""
 117
 118 These are overloaded intrinsics.  You can use these on any pointer types.
 119
 120 .. code-block:: llvm
 121
 122     declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
 123     declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
 124     declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
 125     declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)
 126
 127 Overview:
 128 """""""""
 129
 130 The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
 131 address space to a generic address space pointer.
 132
 133 Semantics:
 134 """"""""""
 135
 136 These intrinsics modify the pointer value to be a valid generic address space
 137 pointer.
 138
 139
 140 '``llvm.nvvm.ptr.gen.to.*``' Intrinsics
 141 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 142
 143 Syntax:
 144 """""""
 145
 146 These are overloaded intrinsics.  You can use these on any pointer types.
 147
 148 .. code-block:: llvm
 149
 150     declare i8* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8 addrspace(1)*)
 151     declare i8* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8 addrspace(3)*)
 152     declare i8* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8 addrspace(4)*)
 153     declare i8* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8 addrspace(5)*)
 154
 155 Overview:
 156 """""""""
 157
 158 The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
 159 address space to a pointer in the target address space.  Note that these
 160 intrinsics are only useful if the address space of the target address space of
 161 the pointer is known.  It is not legal to use address space conversion
 162 intrinsics to convert a pointer from one non-generic address space to another
 163 non-generic address space.
 164
 165 Semantics:
 166 """"""""""
 167
 168 These intrinsics modify the pointer value to be a valid pointer in the target
 169 non-generic address space.
 170
 171
 172 Reading PTX Special Registers
 173 -----------------------------
 174
 175 '``llvm.nvvm.read.ptx.sreg.*``'
 176 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 177
 178 Syntax:
 179 """""""
 180
 181 .. code-block:: llvm
 182
 183     declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
 184     declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
 185     declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
 186     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
 187     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
 188     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
 189     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
 190     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
 191     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
 192     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
 193     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
 194     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
 195     declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
 196
 197 Overview:
 198 """""""""
 199
 200 The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
 201 special registers, in particular the kernel launch bounds.  These registers
 202 map in the following way to CUDA builtins:
 203
 204    ============ =====================================
 205    CUDA Builtin PTX Special Register Intrinsic
 206    ============ =====================================
 207    ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
 208    ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
 209    ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
 210    ``gridDim``  ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
 211    ============ =====================================
 212
 213
 214 Barriers
 215 --------
 216
 217 '``llvm.nvvm.barrier0``'
 218 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 219
 220 Syntax:
 221 """""""
 222
 223 .. code-block:: llvm
 224
 225   declare void @llvm.nvvm.barrier0()
 226
 227 Overview:
 228 """""""""
 229
 230 The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
 231 instruction, equivalent to the ``__syncthreads()`` call in CUDA.
 232
 233
 234 Other Intrinsics
 235 ----------------
 236
 237 For the full set of NVPTX intrinsics, please see the
 238 ``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.
 239
 240
 241 Executing PTX
 242 =============
 243
 244 The most common way to execute PTX assembly on a GPU device is to use the CUDA
 245 Driver API. This API is a low-level interface to the GPU driver and allows for
 246 JIT compilation of PTX code to native GPU machine code.
 247
 248 Initializing the Driver API:
 249
 250 .. code-block:: c++
 251
 252     CUdevice device;
 253     CUcontext context;
 254
 255     // Initialize the driver API
 256     cuInit(0);
 257     // Get a handle to the first compute device
 258     cuDeviceGet(&device, 0);
 259     // Create a compute device context
 260     cuCtxCreate(&context, 0, device);
 261
 262 JIT compiling a PTX string to a device binary:
 263
 264 .. code-block:: c++
 265
 266     CUmodule module;
 267     CUfunction funcion;
 268
 269     // JIT compile a null-terminated PTX string
 270     cuModuleLoadData(&module, (void*)PTXString);
 271
 272     // Get a handle to the "myfunction" kernel function
 273     cuModuleGetFunction(&function, module, "myfunction");
 274
 275 For full examples of executing PTX assembly, please see the `CUDA Samples
 276 <https://developer.nvidia.com/cuda-downloads>`_ distribution.