1 =============================
2 User Guide for NVPTX Back-end
3 =============================
13 To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
14 along with a defined set of conventions used to represent GPU programming
15 concepts. This document provides an overview of the general usage of the back-
16 end, including a description of the conventions used and the set of accepted
21 This document assumes a basic familiarity with CUDA and the PTX
22 assembly language. Information about the CUDA Driver API and the PTX assembly
23 language can be found in the `CUDA documentation
24 <http://docs.nvidia.com/cuda/index.html>`_.
31 Marking Functions as Kernels
32 ----------------------------
34 In PTX, there are two types of functions: *device functions*, which are only
35 callable by device code, and *kernel functions*, which are callable by host
36 code. By default, the back-end will emit device functions. Metadata is used to
37 declare a function as a kernel function. This metadata is attached to the
38 ``nvvm.annotations`` named metadata object, and has the following format:
42 !0 = metadata !{<function-ref>, metadata !"kernel", i32 1}
44 The first parameter is a reference to the kernel function. The following
45 example shows a kernel function calling a device function in LLVM IR. The
46 function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.
50 define float @my_fmad(float %x, float %y, float %z) {
51 %mul = fmul float %x, %y
52 %add = fadd float %mul, %z
56 define void @my_kernel(float* %ptr) {
57 %val = load float* %ptr
58 %ret = call float @my_fmad(float %val, float %val, float %val)
59 store float %ret, float* %ptr
63 !nvvm.annotations = !{!1}
64 !1 = metadata !{void (float*)* @my_kernel, metadata !"kernel", i32 1}
66 When compiled, the PTX kernel functions are callable by host-side code.
72 The NVPTX back-end uses the following address space mapping:
74 ============= ======================
75 Address Space Memory Space
76 ============= ======================
83 ============= ======================
85 Every global variable and pointer type is assigned to one of these address
86 spaces, with 0 being the default address space. Intrinsics are provided which
87 can be used to convert pointers between the generic and non-generic address
90 As an example, the following IR will define an array ``@g`` that resides in
95 @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]
97 LLVM IR functions can read and write to this array, and host-side code can
98 copy data to it by name with the CUDA Driver API.
100 Note that since address space 0 is the generic space, it is illegal to have
101 global variables in address space 0. Address space 0 is the default address
102 space in LLVM, so the ``addrspace(N)`` annotation is *required* for global
109 Address Space Conversion
110 ------------------------
112 '``llvm.nvvm.ptr.*.to.gen``' Intrinsics
113 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
118 These are overloaded intrinsics. You can use these on any pointer types.
122 declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
123 declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
124 declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
125 declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)
130 The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
131 address space to a generic address space pointer.
136 These intrinsics modify the pointer value to be a valid generic address space
140 '``llvm.nvvm.ptr.gen.to.*``' Intrinsics
141 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
146 These are overloaded intrinsics. You can use these on any pointer types.
150 declare i8* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8 addrspace(1)*)
151 declare i8* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8 addrspace(3)*)
152 declare i8* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8 addrspace(4)*)
153 declare i8* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8 addrspace(5)*)
158 The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
159 address space to a pointer in the target address space. Note that these
160 intrinsics are only useful if the address space of the target address space of
161 the pointer is known. It is not legal to use address space conversion
162 intrinsics to convert a pointer from one non-generic address space to another
163 non-generic address space.
168 These intrinsics modify the pointer value to be a valid pointer in the target
169 non-generic address space.
172 Reading PTX Special Registers
173 -----------------------------
175 '``llvm.nvvm.read.ptx.sreg.*``'
176 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
183 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
184 declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
185 declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
186 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
187 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
188 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
189 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
190 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
191 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
192 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
193 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
194 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
195 declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
200 The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
201 special registers, in particular the kernel launch bounds. These registers
202 map in the following way to CUDA builtins:
204 ============ =====================================
205 CUDA Builtin PTX Special Register Intrinsic
206 ============ =====================================
207 ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
208 ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
209 ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
210 ``gridDim`` ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
211 ============ =====================================
217 '``llvm.nvvm.barrier0``'
218 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
225 declare void @llvm.nvvm.barrier0()
230 The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
231 instruction, equivalent to the ``__syncthreads()`` call in CUDA.
237 For the full set of NVPTX intrinsics, please see the
238 ``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.
244 The most common way to execute PTX assembly on a GPU device is to use the CUDA
245 Driver API. This API is a low-level interface to the GPU driver and allows for
246 JIT compilation of PTX code to native GPU machine code.
248 Initializing the Driver API:
255 // Initialize the driver API
257 // Get a handle to the first compute device
258 cuDeviceGet(&device, 0);
259 // Create a compute device context
260 cuCtxCreate(&context, 0, device);
262 JIT compiling a PTX string to a device binary:
269 // JIT compile a null-terminated PTX string
270 cuModuleLoadData(&module, (void*)PTXString);
272 // Get a handle to the "myfunction" kernel function
273 cuModuleGetFunction(&function, module, "myfunction");
275 For full examples of executing PTX assembly, please see the `CUDA Samples
276 <https://developer.nvidia.com/cuda-downloads>`_ distribution.