3 1) TCM Userspace Design
7 d) Implementation overview
13 g) Other contingencies
14 2) Writing a user pass-through handler
15 a) Discovering and configuring TCMU uio devices
16 b) Waiting for events on the device(s)
17 c) Managing the command ring
18 3) Command filtering and pass_level
25 TCM is another name for LIO, an in-kernel iSCSI target (server).
26 Existing TCM targets run in the kernel. TCMU (TCM in Userspace)
27 allows userspace programs to be written which act as iSCSI targets.
28 This document describes the design.
30 The existing kernel provides modules for different SCSI transport
31 protocols. TCM also modularizes the data storage. There are existing
32 modules for file, block device, RAM or using another SCSI device as
33 storage. These are called "backstores" or "storage engines". These
34 built-in modules are implemented entirely as kernel code.
38 In addition to modularizing the transport protocol used for carrying
39 SCSI commands ("fabrics"), the Linux kernel target, LIO, also modularizes
40 the actual data storage as well. These are referred to as "backstores"
41 or "storage engines". The target comes with backstores that allow a
42 file, a block device, RAM, or another SCSI device to be used for the
43 local storage needed for the exported SCSI LUN. Like the rest of LIO,
44 these are implemented entirely as kernel code.
46 These backstores cover the most common use cases, but not all. One new
47 use case that other non-kernel target solutions, such as tgt, are able
48 to support is using Gluster's GLFS or Ceph's RBD as a backstore. The
49 target then serves as a translator, allowing initiators to store data
50 in these non-traditional networked storage systems, while still only
51 using standard protocols themselves.
53 If the target is a userspace process, supporting these is easy. tgt,
54 for example, needs only a small adapter module for each, because the
55 modules just use the available userspace libraries for RBD and GLFS.
57 Adding support for these backstores in LIO is considerably more
58 difficult, because LIO is entirely kernel code. Instead of undertaking
59 the significant work to port the GLFS or RBD APIs and protocols to the
60 kernel, another approach is to create a userspace pass-through
61 backstore for LIO, "TCMU".
66 In addition to allowing relatively easy support for RBD and GLFS, TCMU
67 will also allow easier development of new backstores. TCMU combines
68 with the LIO loopback fabric to become something similar to FUSE
69 (Filesystem in Userspace), but at the SCSI layer instead of the
70 filesystem layer. A SUSE, if you will.
72 The disadvantage is there are more distinct components to configure, and
73 potentially to malfunction. This is unavoidable, but hopefully not
74 fatal if we're careful to keep things as simple as possible.
78 - Good performance: high throughput, low latency
79 - Cleanly handle if userspace:
84 - Allow future flexibility in user & kernel implementations
85 - Be reasonably memory-efficient
86 - Simple to configure & run
87 - Simple to write a userspace backend
90 Implementation overview:
92 The core of the TCMU interface is a memory region that is shared
93 between kernel and userspace. Within this region is: a control area
94 (mailbox); a lockless producer/consumer circular buffer for commands
95 to be passed up, and status returned; and an in/out data buffer area.
97 TCMU uses the pre-existing UIO subsystem. UIO allows device driver
98 development in userspace, and this is conceptually very close to the
99 TCMU use case, except instead of a physical device, TCMU implements a
100 memory-mapped layout designed for SCSI commands. Using UIO also
101 benefits TCMU by handling device introspection (e.g. a way for
102 userspace to determine how large the shared region is) and signaling
103 mechanisms in both directions.
105 There are no embedded pointers in the memory region. Everything is
106 expressed as an offset from the region's starting address. This allows
107 the ring to still work if the user process dies and is restarted with
108 the region mapped at a different virtual address.
110 See target_core_user.h for the struct definitions.
114 The mailbox is always at the start of the shared memory region, and
115 contains a version, details about the starting offset and size of the
116 command ring, and head and tail pointers to be used by the kernel and
117 userspace (respectively) to put commands on the ring, and indicate
118 when the commands are completed.
120 version - 1 (userspace should abort if otherwise)
121 flags - none yet defined.
122 cmdr_off - The offset of the start of the command ring from the start
123 of the memory region, to account for the mailbox size.
124 cmdr_size - The size of the command ring. This does *not* need to be a
126 cmd_head - Modified by the kernel to indicate when a command has been
128 cmd_tail - Modified by userspace to indicate when it has completed
129 processing of a command.
133 Commands are placed on the ring by the kernel incrementing
134 mailbox.cmd_head by the size of the command, modulo cmdr_size, and
135 then signaling userspace via uio_event_notify(). Once the command is
136 completed, userspace updates mailbox.cmd_tail in the same way and
137 signals the kernel via a 4-byte write(). When cmd_head equals
138 cmd_tail, the ring is empty -- no commands are currently waiting to be
139 processed by userspace.
141 TCMU commands start with a common header containing "len_op", a 32-bit
142 value that stores the length, as well as the opcode in the lowest
143 unused bits. Currently only two opcodes are defined, TCMU_OP_PAD and
144 TCMU_OP_CMD. When userspace encounters a command with PAD opcode, it
145 should skip ahead by the bytes in "length". (The kernel inserts PAD
146 entries to ensure each CMD entry fits contigously into the circular
149 When userspace handles a CMD, it finds the SCSI CDB (Command Data
150 Block) via tcmu_cmd_entry.req.cdb_off. This is an offset from the
151 start of the overall shared memory region, not the entry. The data
152 in/out buffers are accessible via tht req.iov[] array. Note that
153 each iov.iov_base is also an offset from the start of the region.
155 TCMU currently does not support BIDI operations.
157 When completing a command, userspace sets rsp.scsi_status, and
158 rsp.sense_buffer if necessary. Userspace then increments
159 mailbox.cmd_tail by entry.hdr.length (mod cmdr_size) and signals the
160 kernel via the UIO method, a 4-byte write to the file descriptor.
164 This is shared-memory space after the command ring. The organization
165 of this area is not defined in the TCMU interface, and userspace
166 should access only the parts referenced by pending iovs.
171 Other devices may be using UIO besides TCMU. Unrelated user processes
172 may also be handling different sets of TCMU devices. TCMU userspace
173 processes must find their devices by scanning sysfs
174 class/uio/uio*/name. For TCMU devices, these names will be of the
177 tcm-user/<hba_num>/<device_name>/<subtype>/<path>
179 where "tcm-user" is common for all TCMU-backed UIO devices. <hba_num>
180 and <device_name> allow userspace to find the device's path in the
181 kernel target's configfs tree. Assuming the usual mount point, it is
184 /sys/kernel/config/target/core/user_<hba_num>/<device_name>
186 This location contains attributes such as "hw_block_size", that
187 userspace needs to know for correct operation.
189 <subtype> will be a userspace-process-unique string to identify the
190 TCMU device as expecting to be backed by a certain handler, and <path>
191 will be an additional handler-specific string for the user process to
192 configure the device, if needed. The name cannot contain ':', due to
195 For all devices so discovered, the user handler opens /dev/uioX and
198 mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0)
200 where size must be equal to the value read from
201 /sys/class/uio/uioX/maps/map0/size.
206 If a new device is added or removed, a notification will be broadcast
207 over netlink, using a generic netlink family name of "TCM-USER" and a
208 multicast group named "config". This will include the UIO name as
209 described in the previous section, as well as the UIO minor
210 number. This should allow userspace to identify both the UIO device and
211 the LIO device, so that after determining the device is supported
212 (based on subtype) it can take the appropriate action.
217 Userspace handler process never attaches:
219 - TCMU will post commands, and then abort them after a timeout period
222 Userspace handler process is killed:
224 - It is still possible to restart and re-connect to TCMU
225 devices. Command ring is preserved. However, after the timeout period,
226 the kernel will abort pending tasks.
228 Userspace handler process hangs:
230 - The kernel will abort pending tasks after a timeout period.
232 Userspace handler process is malicious:
234 - The process can trivially break the handling of devices it controls,
235 but should not be able to access kernel memory outside its shared
239 Writing a user pass-through handler (with example code)
240 -------------------------------------------------------
242 A user process handing a TCMU device must support the following:
244 a) Discovering and configuring TCMU uio devices
245 b) Waiting for events on the device(s)
246 c) Managing the command ring: Parsing operations and commands,
247 performing work as needed, setting response fields (scsi_status and
248 possibly sense_buffer), updating cmd_tail, and notifying the kernel
249 that work has been finished
251 First, consider instead writing a plugin for tcmu-runner. tcmu-runner
252 implements all of this, and provides a higher-level API for plugin
255 TCMU is designed so that multiple unrelated processes can manage TCMU
256 devices separately. All handlers should make sure to only open their
257 devices, based opon a known subtype string.
259 a) Discovering and configuring TCMU UIO devices:
261 (error checking omitted for brevity)
265 unsigned long long map_len;
268 fd = open("/sys/class/uio/uio0/name", O_RDONLY);
269 ret = read(fd, buf, sizeof(buf));
271 buf[ret-1] = '\0'; /* null-terminate and chop off the \n */
273 /* we only want uio devices whose name is a format we expect */
274 if (strncmp(buf, "tcm-user", 8))
277 /* Further checking for subtype also needed here */
279 fd = open(/sys/class/uio/%s/maps/map0/size, O_RDONLY);
280 ret = read(fd, buf, sizeof(buf));
282 str_buf[ret-1] = '\0'; /* null-terminate and chop off the \n */
284 map_len = strtoull(buf, NULL, 0);
286 dev_fd = open("/dev/uio0", O_RDWR);
287 map = mmap(NULL, map_len, PROT_READ|PROT_WRITE, MAP_SHARED, dev_fd, 0);
290 b) Waiting for events on the device(s)
295 int ret = read(dev_fd, buf, 4); /* will block */
297 handle_device_events(dev_fd, map);
301 c) Managing the command ring
303 #include <linux/target_core_user.h>
305 int handle_device_events(int fd, void *map)
307 struct tcmu_mailbox *mb = map;
308 struct tcmu_cmd_entry *ent = (void *) mb + mb->cmdr_off + mb->cmd_tail;
309 int did_some_work = 0;
311 /* Process events from cmd ring until we catch up with cmd_head */
312 while (ent != (void *)mb + mb->cmdr_off + mb->cmd_head) {
314 if (tcmu_hdr_get_op(&ent->hdr) == TCMU_OP_CMD) {
315 uint8_t *cdb = (void *)mb + ent->req.cdb_off;
318 /* Handle command here. */
319 printf("SCSI opcode: 0x%x\n", cdb[0]);
321 /* Set response fields */
323 ent->rsp.scsi_status = SCSI_NO_SENSE;
325 /* Also fill in rsp->sense_buffer here */
326 ent->rsp.scsi_status = SCSI_CHECK_CONDITION;
330 /* Do nothing for PAD entries */
333 /* update cmd_tail */
334 mb->cmd_tail = (mb->cmd_tail + tcmu_hdr_get_len(&ent->hdr)) % mb->cmdr_size;
335 ent = (void *) mb + mb->cmdr_off + mb->cmd_tail;
339 /* Notify the kernel that work has been finished */
350 Command filtering and pass_level
351 --------------------------------
353 TCMU supports a "pass_level" option with valid values of 0 or 1. When
354 the value is 0 (the default), nearly all SCSI commands received for
355 the device are passed through to the handler. This allows maximum
356 flexibility but increases the amount of code required by the handler,
357 to support all mandatory SCSI commands. If pass_level is set to 1,
358 then only IO-related commands are presented, and the rest are handled
359 by LIO's in-kernel command emulation. The commands presented at level
360 1 include all versions of:
375 Please be careful to return codes as defined by the SCSI
376 specifications. These are different than some values defined in the
377 scsi/scsi.h include file. For example, CHECK CONDITION's status code