Jens Axboe [Wed, 28 May 2014 16:18:51 +0000 (10:18 -0600)]
Merge branch 'for-3.16/core' into for-3.16/drivers
Pull in core changes (again), since we got rid of the alloc/free
hctx mq_ops hooks and mtip32xx then needed updating again.
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Wed, 28 May 2014 16:11:06 +0000 (18:11 +0200)]
blk-mq: remove alloc_hctx and free_hctx methods
There is no need for drivers to control hardware context allocation
now that we do the context to node mapping in common code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 28 May 2014 16:15:41 +0000 (10:15 -0600)]
blk-mq: add file comments and update copyright notices
None of the blk-mq files have an explanatory comment at the top
for what that particular file does. Add that and add appropriate
copyright notices as well.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 28 May 2014 15:50:26 +0000 (09:50 -0600)]
Merge branch 'for-3.16/core' into for-3.16/drivers
mtip32xx uses blk_mq_alloc_reserved_request(), so pull in the
core changes so we have a properly merged end result.
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Tue, 27 May 2014 18:59:50 +0000 (20:59 +0200)]
blk-mq: remove blk_mq_alloc_request_pinned
We now only have one caller left and can open code it there in a cleaner
way.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Tue, 27 May 2014 18:59:49 +0000 (20:59 +0200)]
blk-mq: do not use blk_mq_alloc_request_pinned in blk_mq_map_request
We already do a non-blocking allocation in blk_mq_map_request, no need
to repeat it. Just call __blk_mq_alloc_request to wait directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Tue, 27 May 2014 18:59:48 +0000 (20:59 +0200)]
blk-mq: remove blk_mq_wait_for_tags
The current logic for blocking tag allocation is rather confusing, as we
first allocated and then free again a tag in blk_mq_wait_for_tags, just
to attempt a non-blocking allocation and then repeat if someone else
managed to grab the tag before us.
Instead change blk_mq_alloc_request_pinned to simply do a blocking tag
allocation itself and use the request we get back from it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Tue, 27 May 2014 18:59:47 +0000 (20:59 +0200)]
blk-mq: initialize request in __blk_mq_alloc_request
Both callers if __blk_mq_alloc_request want to initialize the request, so
lift it into the common path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Tue, 27 May 2014 18:59:46 +0000 (20:59 +0200)]
blk-mq: merge blk_mq_alloc_reserved_request into blk_mq_alloc_request
Instead of having two almost identical copies of the same code just let
the callers pass in the reserved flag directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Wed, 28 May 2014 14:08:02 +0000 (08:08 -0600)]
blk-mq: add helper to insert requests from irq context
Both the cache flush state machine and the SCSI midlayer want to submit
requests from irq context, and the current per-request requeue_work
unfortunately causes corruption due to sharing with the csd field for
flushes. Replace them with a per-request_queue list of requests to
be requeued.
Based on an earlier test by Ming Lei.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Ming Lei <tom.leiming@gmail.com>
Tested-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 28 May 2014 14:06:34 +0000 (08:06 -0600)]
blk-mq: remove stale comment for blk_mq_complete_request()
It works for both IPI and local completions as of commit
95f096849932.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jiri Kosina [Wed, 28 May 2014 09:55:23 +0000 (11:55 +0200)]
floppy: do not corrupt bio.bi_flags when reading block 0
Commit
41a55b4de39 ("floppy: silence warning during disk test") caused
bio.bi_flags being overwritten, and its initialization to BIO_UPTODATE
in bio_init() to be lost.
This was unnoticed until
7b7b68bba5 ("floppy: bail out in open() if
drive is not responding to block0 read"), because the error value wasn't
checked for in the bio completion callback.
Now we are actually looking at the error, and the loss of BIO_UPTODATE
causes EIO to be wrongly passed to the callback, which confuses the
FD_OPEN_SHOULD_FAIL_BIT logic.
Fix this by not destroying previous value of bi_flags when setting
BIO_QUIET.
Cc: Stephen Hemminger <shemminger@vyatta.com>
Reported-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Jens Axboe [Tue, 27 May 2014 23:46:48 +0000 (17:46 -0600)]
blk-mq: allow non-softirq completions
Right now we export two ways of completing a request:
1) blk_mq_complete_request(). This uses an IPI (if needed) and
completes through q->softirq_done_fn(). It also works with
timeouts.
2) blk_mq_end_io(). This completes inline, and ignores any timeout
state of the request.
Let blk_mq_complete_request() handle non-softirq_done_fn completions
as well, by just completing inline. If a driver has enough completion
ports to place completions correctly, it need not define a
mq_ops->complete() and we can avoid an indirect function call by
doing the completion inline.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Tue, 27 May 2014 18:06:53 +0000 (12:06 -0600)]
blk-mq: pass in suggested NUMA node to ->alloc_hctx()
Drivers currently have to figure this out on their own, and they
are missing information to do it properly. The ones that did
attempt to do it, do it wrong.
So just pass in the suggested node directly to the alloc
function.
Signed-off-by: Jens Axboe <axboe@fb.com>
Ming Lei [Tue, 27 May 2014 15:35:14 +0000 (23:35 +0800)]
block: only allocate/free mq_usage_counter in blk-mq
The percpu counter is only used for blk-mq, so move
its allocation and free inside blk-mq, and don't
allocate it for legacy queue device.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Ming Lei [Tue, 27 May 2014 15:35:13 +0000 (23:35 +0800)]
blk-mq: avoid code duplication
blk_mq_exit_hw_queues() and blk_mq_free_hw_queues()
are introduced to avoid code duplication.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Ming Lei [Tue, 27 May 2014 14:34:45 +0000 (08:34 -0600)]
blk-mq: fix leak of hctx->ctx_map
hctx->ctx_map should have been freed inside blk_mq_free_queue().
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Fabian Frederick [Mon, 26 May 2014 20:19:14 +0000 (22:19 +0200)]
block/blk-lib.c: make __blkdev_issue_zeroout static
__blkdev_issue_zeroout is only used in blk-lib.c
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Mon, 26 May 2014 09:45:02 +0000 (11:45 +0200)]
blk-mq: idle all hardware contexts before freeing a queue
Without this we can leak the active_queues reference if a command is
freed while it is considered active.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Fri, 23 May 2014 20:14:57 +0000 (14:14 -0600)]
blk-mq: allow setting of per-request timeouts
Currently blk-mq uses the queue timeout for all requests. But
for some commands, drivers may want to set a specific timeout
for special requests. Allow this to be passed in through
request->timeout, and use it if set.
Signed-off-by: Jens Axboe <axboe@fb.com>
Sam Bradshaw [Fri, 23 May 2014 19:30:16 +0000 (13:30 -0600)]
blk-mq: export blk_mq_tag_busy_iter
Export the blk-mq in-flight tag iterator for driver consumption.
This is particularly useful in exception paths or SRSI where
in-flight IOs need to be cancelled and/or reissued. The NVMe driver
conversion will use this.
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Thu, 22 May 2014 16:40:51 +0000 (10:40 -0600)]
blk-mq: split make request handler for multi and single queue
We want slightly different behavior from them:
- On single queue devices, we currently use the per-process plug
for deferred IO and for merging.
- On multi queue devices, we don't use the per-process plug, but
we want to go straight to hardware for SYNC IO.
Split blk_mq_make_request() into a blk_sq_make_request() for single
queue devices, and retain blk_mq_make_request() for multi queue
devices. Then we don't need multiple checks for q->nr_hw_queues
in the request mapping.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 21 May 2014 20:01:15 +0000 (14:01 -0600)]
blk-mq: save memory by freeing requests on unused hardware queues
Depending on the topology of the machine and the number of queues
exposed by a device, we can end up in a situation where some of
the hardware queues are unused (as in, they don't map to any
software queues). For this case, free up the memory used by the
request map, as we will not use it. This can be a substantial
amount of memory, depending on the number of queues vs CPUs and
the queue depth of the device.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 21 May 2014 19:59:08 +0000 (13:59 -0600)]
blk-mq: allow the hctx cpu hotplug notifier to return errors
Prepare this for the next patch which adds more smarts in the
plugging logic, so that we can save some memory.
Signed-off-by: Jens Axboe <axboe@fb.com>
Robert Elliott [Tue, 20 May 2014 21:46:26 +0000 (16:46 -0500)]
blk-mq: Micro-optimize blk_queue_nomerges() check
In blk_mq_make_request(), do the blk_queue_nomerges() check
outside the call to blk_attempt_plug_merge() to eliminate
function call overhead when nomerges=2 (disabled)
Signed-off-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Tue, 20 May 2014 21:17:27 +0000 (15:17 -0600)]
blk-mq: initialize q->nr_requests after calling blk_queue_make_request()
blk_queue_make_requests() overwrites our set value for q->nr_requests,
turning it into the default of 128. Set this appropriately after
initializing queue values in blk_queue_make_request().
Signed-off-by: Jens Axboe <axboe@fb.com>
Asai Thambi S P [Tue, 20 May 2014 17:48:56 +0000 (10:48 -0700)]
mtip32xx: move error handling to service thread
Move error handling to service thread, and use mtip_set_timeout()
to set timeouts for HDIO_DRIVE_TASK and HDIO_DRIVE_CMD IOCTL commands.
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Tue, 20 May 2014 17:49:02 +0000 (11:49 -0600)]
blk-mq: allow changing of queue depth through sysfs
For request_fn based devices, the block layer exports a 'nr_requests'
file through sysfs to allow adjusting of queue depth on the fly.
Currently this returns -EINVAL for blk-mq, since it's not wired up.
Wire this up for blk-mq, so that it now also always dynamic
adjustments of the allowed queue depth for any given block device
managed by blk-mq.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Tue, 20 May 2014 14:17:35 +0000 (08:17 -0600)]
htmldocs: fix bio.c location
Commit
f9c78b2be2ca moved bio.c from fs/ to block/, but didn't
update the docbook location. Fix that up.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Tue, 20 May 2014 02:01:52 +0000 (20:01 -0600)]
block: move mm/bounce.c to block/
Continue moving some of the block files that are scattered around.
bounce.c contains only code for bouncing the contents of a bio.
It's block proper code, not mm code.
Suggested-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Mon, 19 May 2014 17:52:35 +0000 (11:52 -0600)]
Merge branch 'for-3.16/blk-mq-tagging' into for-3.16/core
Signed-off-by: Jens Axboe <axboe@fb.com>
Conflicts:
block/blk-mq-tag.c
Jens Axboe [Mon, 19 May 2014 15:23:55 +0000 (09:23 -0600)]
blk-mq: switch ctx pending map to the sparser blk_align_bitmap
Each hardware queue has a bitmap of software queues with pending
requests. When new IO is queued on a software queue, the bit is
set, and when IO is pruned on a hardware queue run, the bit is
cleared. This causes a lot of traffic. Switch this from the regular
BITS_PER_LONG bitmap to a sparser layout, similarly to what was
done for blk-mq tagging.
20% performance increase was observed for single threaded IO, and
about 15% performanc increase on multiple threads driving the
same device.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Mon, 19 May 2014 15:17:48 +0000 (09:17 -0600)]
blk-mq: move the cache friendly bitmap type of out blk-mq-tag
We will use it for the pending list in blk-mq core as well.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Mon, 19 May 2014 17:02:18 +0000 (11:02 -0600)]
block: move ioprio.c from fs/ to block/
Like commit
f9c78b2b, move this block related file outside
of fs/ and into the core block directory, block/.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Mon, 19 May 2014 14:16:41 +0000 (08:16 -0600)]
block: move bio.c and bio-integrity.c from fs/ to block/
They really belong in block/, especially now since it's not in
drivers/block/ anymore. Additionally, the get_maintainer script
gets it wrong when in fs/.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Ming Lei [Fri, 16 May 2014 15:31:21 +0000 (23:31 +0800)]
virtio_blk: fix race between start and stop queue
When there isn't enough vring descriptor for adding to vq,
blk-mq will be put as stopped state until some of pending
descriptors are completed & freed.
Unfortunately, the vq's interrupt may come just before
blk-mq's BLK_MQ_S_STOPPED flag is set, so the blk-mq will
still be kept as stopped even though lots of descriptors
are completed and freed in the interrupt handler. The worst
case is that all pending descriptors are freed in the
interrupt handler, and the queue is kept as stopped forever.
This patch fixes the problem by starting/stopping blk-mq
with holding vq_lock.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 14 May 2014 14:22:56 +0000 (08:22 -0600)]
mtip32xx: stop block hardware queues before quiescing IO
We need to stop the block layer queues to prevent new "normal"
IO from entering the driver, while we wait for existing commands
to finish.
Signed-off-by: Jens Axboe <axboe@fb.com>
Dan Carpenter [Wed, 14 May 2014 12:54:18 +0000 (15:54 +0300)]
mtip32xx: blk_mq_init_queue() returns an ERR_PTR
We changed this from blk_alloc_queue_node() to blk_mq_init_queue() so
the check needs to be updated as well.
Fixes: ffc771b3ca8b2 ('mtip32xx: convert to use blk-mq')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Fri, 9 May 2014 15:42:02 +0000 (09:42 -0600)]
mtip32xx: convert to use blk-mq
This rips out timeout handling, requeueing, etc in converting
it to use blk-mq instead.
Acked-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Tue, 13 May 2014 21:10:52 +0000 (15:10 -0600)]
blk-mq: improve support for shared tags maps
This adds support for active queue tracking, meaning that the
blk-mq tagging maintains a count of active users of a tag set.
This allows us to maintain a notion of fairness between users,
so that we can distribute the tag depth evenly without starving
some users while allowing others to try unfair deep queues.
If sharing of a tag set is detected, each hardware queue will
track the depth of its own queue. And if this exceeds the total
depth divided by the number of active queues, the user is actively
throttled down.
The active queue count is done lazily to avoid bouncing that data
between submitter and completer. Each hardware queue gets marked
active when it allocates its first tag, and gets marked inactive
when 1) the last tag is cleared, and 2) the queue timeout grace
period has passed.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Sat, 10 May 2014 21:44:42 +0000 (15:44 -0600)]
Merge branch 'for-3.16/blk-mq-tagging' into for-3.16/core
Ming Lei [Sat, 10 May 2014 17:01:51 +0000 (01:01 +0800)]
blk-mq: bitmap tag: cleanup blk_mq_init_tags
Both nr_cache and nr_tags arn't needed for bitmap tag anymore.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Ming Lei [Sat, 10 May 2014 21:43:14 +0000 (15:43 -0600)]
blk-mq: bitmap tag: select random tag betweet 0 and (depth - 1)
The selected tag should be selected at random between 0 and
(depth - 1) with probability 1/depth, instead between 0 and
(depth - 2) with probability 1/(depth - 1).
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Ming Lei [Sat, 10 May 2014 17:01:49 +0000 (01:01 +0800)]
blk-mq: bitmap tag: remove barrier in bt_clear_tag()
The barrier isn't necessary because both atomic_dec_and_test()
and wake_up() implicate one barrier.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Ming Lei [Sat, 10 May 2014 17:01:48 +0000 (01:01 +0800)]
blk-mq: bitmap tag: use clear_bit_unlock in bt_clear_tag()
The unlock memory barrier need to order access to req in free
path and clearing tag bit, otherwise either request free path
may see a allocated request, or initialized request in allocate
path might be modified by the ongoing free path.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Fri, 9 May 2014 21:48:23 +0000 (15:48 -0600)]
block: only calculate part_in_flight() once
We first check if we have inflight IO, then retrieve that
same number again. Usually this isn't that costly since the
chance of having the data dirtied in between is small, but
there's no reason for calling part_in_flight() twice.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Fri, 9 May 2014 20:54:08 +0000 (14:54 -0600)]
blk-mq: fix race in IO start accounting
Commit
c6d600c6 opened up a small race where we could attempt to
account IO completion on a request, racing with IO start accounting.
Fix this up by ensuring that we've accounted for IO start before
inserting the request.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Fri, 9 May 2014 19:41:15 +0000 (13:41 -0600)]
blk-mq: use sparser tag layout for lower queue depth
For best performance, spreading tags over multiple cachelines
makes the tagging more efficient on multicore systems. But since
we have 8 * sizeof(unsigned long) tags per cacheline, we don't
always get a nice spread.
Attempt to spread the tags over at least 4 cachelines, using fewer
number of bits per unsigned long if we have to. This improves
tagging performance in setups with 32-128 tags. For higher depths,
the spread is the same as before (BITS_PER_LONG tags per cacheline).
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Fri, 9 May 2014 15:36:49 +0000 (09:36 -0600)]
blk-mq: implement new and more efficient tagging scheme
blk-mq currently uses percpu_ida for tag allocation. But that only
works well if the ratio between tag space and number of CPUs is
sufficiently high. For most devices and systems, that is not the
case. The end result if that we either only utilize the tag space
partially, or we end up attempting to fully exhaust it and run
into lots of lock contention with stealing between CPUs. This is
not optimal.
This new tagging scheme is a hybrid bitmap allocator. It uses
two tricks to both be SMP friendly and allow full exhaustion
of the space:
1) We cache the last allocated (or freed) tag on a per blk-mq
software context basis. This allows us to limit the space
we have to search. The key element here is not caching it
in the shared tag structure, otherwise we end up dirtying
more shared cache lines on each allocate/free operation.
2) The tag space is split into cache line sized groups, and
each context will start off randomly in that space. Even up
to full utilization of the space, this divides the tag users
efficiently into cache line groups, avoiding dirtying the same
one both between allocators and between allocator and freeer.
This scheme shows drastically better behaviour, both on small
tag spaces but on large ones as well. It has been tested extensively
to show better performance for all the cases blk-mq cares about.
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Tue, 6 May 2014 10:12:45 +0000 (12:12 +0200)]
blk-mq: initialize struct request fields individually
This allows us to avoid a non-atomic memset over ->atomic_flags as well
as killing lots of duplicate initializations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Thu, 8 May 2014 20:50:19 +0000 (14:50 -0600)]
blk-mq: update a hotplug comment for grammar
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 7 May 2014 16:26:44 +0000 (10:26 -0600)]
blk-mq: add basic round-robin of what CPU to queue workqueue work on
Right now we just pick the first CPU in the mask, but that can
easily overload that one. Add some basic batching and round-robin
all the entries in the mask instead.
Signed-off-by: Jens Axboe <axboe@fb.com>
Joe Perches [Mon, 5 May 2014 00:05:13 +0000 (17:05 -0700)]
cdrom: Remove unnecessary prototype for cdrom_get_disc_info
Move the function to the proper spot instead.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Joe Perches [Mon, 5 May 2014 00:05:12 +0000 (17:05 -0700)]
cdrom: Remove unnecessary prototype for cdrom_mrw_exit
Move the function to appropriate locations instead.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Joe Perches [Mon, 5 May 2014 00:05:11 +0000 (17:05 -0700)]
cdrom: Remove cdrom_count_tracks prototype
Move function to proper location instead.
Fix whitespace and embedded if too.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Joe Perches [Mon, 5 May 2014 00:05:10 +0000 (17:05 -0700)]
cdrom: Remove cdrom_get_next_writeable prototype
Move the function to the right spot instead.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Joe Perches [Mon, 5 May 2014 00:05:09 +0000 (17:05 -0700)]
cdrom: Remove cdrom_get_last_written prototype
Move the function instead.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Joe Perches [Mon, 5 May 2014 00:05:08 +0000 (17:05 -0700)]
cdrom: Move mmc_ioctls above cdrom_ioctl to remove unnecessary prototype
Neaten the spacing too.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Joe Perches [Mon, 5 May 2014 00:05:07 +0000 (17:05 -0700)]
cdrom: Remove unnecessary sanitize_format prototype
It's defined below without being called.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Joe Perches [Mon, 5 May 2014 00:05:06 +0000 (17:05 -0700)]
cdrom: Remove unnecessary check_for_audio_disc prototype
The actual static is defined below it but not used until later.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Joe Perches [Mon, 5 May 2014 00:05:05 +0000 (17:05 -0700)]
cdrom: Remove prototype for open_for_data
Move static function to the appropriate place to remove
the now unnecessary prototype.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Joe Perches [Mon, 5 May 2014 00:05:04 +0000 (17:05 -0700)]
cdrom: Remove obfuscating IOCTL_IN and IOCTL_OUT macros
Macros with hidden control flow aren't nice.
Just use copy_to/from_user directly instead.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Joe Perches [Mon, 5 May 2014 00:05:03 +0000 (17:05 -0700)]
cdrom: Remove unused CHECKAUDIO macro
It's unused, make it disappear.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Joe Perches [Mon, 5 May 2014 00:05:02 +0000 (17:05 -0700)]
cdrom: convert cdinfo to cd_dbg
It's a debugging message, mark it so.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Fabian Frederick [Fri, 2 May 2014 16:28:17 +0000 (18:28 +0200)]
block/blk-throttle.c: fix return of 0/1 with return type bool
Fix 4 coccinelle warnings.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
Fabian Frederick [Fri, 2 May 2014 16:21:45 +0000 (18:21 +0200)]
block/blk-iopoll.c: use iop instead of iopoll
All blk_iopoll functions use iop for parent iopoll structure except
blk_iopoll_complete.This also fixes one kernel-doc warning.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Fri, 2 May 2014 17:24:48 +0000 (11:24 -0600)]
blk-mq: remove extra requeue trace
We already issue a blktrace requeue event in
__blk_mq_requeue_request(), don't do it from the original caller
as well.
Signed-off-by: Jens Axboe <axboe@fb.com>
Ming Lei [Thu, 1 May 2014 07:12:36 +0000 (15:12 +0800)]
block: null_blk: fix use after free
entry(cmd->ll_list) may belong to new request once end_cmd()
returns, so fix the bug with the patch.
Without the change, it is easy to observe oops when
doing null_blk(timer) test.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Masanari Iida [Mon, 28 Apr 2014 03:38:34 +0000 (12:38 +0900)]
block: Fix format string mismatch in cfq-iosched.c
Fix format string mismatch in cfq_var_show()
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Lars Ellenberg [Mon, 28 Apr 2014 16:43:35 +0000 (18:43 +0200)]
drbd: use list_first_entry_or_null in first_peer_device/first_connection
If there are no peer_devices or connections, I'd rather have NULL
than some "arbitrary" address pretending to point to a struct.
Helps to avoid hard to debug symptoms, in case we ever try to use
and dereference a drbd_connection or drbd_peer_device
where we in fact don't have any connection at all.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Philipp Reisner [Mon, 28 Apr 2014 16:43:34 +0000 (18:43 +0200)]
drbd: Allow attaching of a newly created device to any backing device
A newly created device was never exposed before, i.e. has a
exposed_data_uuid of 0. Then it is valid to attach to any current_uuid
of a backing device (of course also to a newly created one (4))
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Philipp Reisner [Mon, 28 Apr 2014 16:43:33 +0000 (18:43 +0200)]
drbd: Test cstate while holding req_lock
In case a connection transitions into C_TIMEOUT within the timer
function (request_timer_fn()) we need to make sure that the receiver
thread (potentially running on a different CPU) sees the updated
cstate later on.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Philipp Reisner [Mon, 28 Apr 2014 16:43:32 +0000 (18:43 +0200)]
drbd: use blk_set_stacking_limits()
...instead directly assigning to q->limits.discard_zeroes_data
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Lars Ellenberg [Mon, 28 Apr 2014 16:43:31 +0000 (18:43 +0200)]
drbd: evaluate disk and network timeout on different requests
Just because it is the oldest not yet completed request
does not make it the oldest request waiting for disk.
Or waiting for the peer.
And we completely missed already completed requests
that would still hold references to activity log extents,
waiting only for the barrier ack.
Find two oldest not yet completely processed requests,
one that is still waiting for local completion,
and one that is still waiting for some response from the peer.
These may or may not be the same request object.
Then separately apply the network and disk timeouts, respectively.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Philipp Reisner [Mon, 28 Apr 2014 16:43:30 +0000 (18:43 +0200)]
drbd: Fix a hole in the challange-response connection authentication
In the implementation as it was, the two peers sent each other
a challenge, and expects the challenge hashed with the shared
secret back.
A attacker could simply wait for the challenge of the peer, and
send the same challenge back. Then it waits for the response, and
sends the same response back.
Prevent this by not accepting a challenge from the peer that is
the same as the challenge sent to the peer.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Lars Ellenberg [Mon, 28 Apr 2014 16:43:29 +0000 (18:43 +0200)]
drbd: always implicitly close last epoch when idle
Once our sender thread needs to wait_for_work(),
and actually needs to schedule(), just before we do that,
we already check if it is useful to implicitly close the last epoch.
The condition was too strict: only implicitly close the epoch,
if there have been no new (write) requests at all.
The assumption was that if there were new requests, they would
always be communicated one way or another, and would send necessary
epoch separating barriers explicitly.
This is not always true, e.g. when becoming diskless,
or while explicitly starting a full resync.
The last communicated epoch could stay open for a long time,
locking down corresponding activity log extents.
It is safe to always implicitly send that last barrier, as soon as we
determin that there cannot be more requests in the last communicated
epoch, even if there have been (uncommunicated) new requests in new
epochs meanwhile.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Lars Ellenberg [Mon, 28 Apr 2014 16:43:28 +0000 (18:43 +0200)]
drbd: add back some fairness to AL transactions
When batching more updates to the activity log into single transactions,
we lost the ability for new requests to force themselves into the active
set: all preparation steps became non-blocking, and if all currently
hot extents keep busy, they could starve out new incoming requests
to cold extents for quite a while.
This can only happen if your IO backend accepts more IO operations per
average DRBD replication round trip time than you have al-extents
configured.
If we have incoming requests to cold extents,
at least do one blocking update per transaction.
In an artificial worst-case workload on SSD with an asynchronous 600 ms
replication link, with al-extents = 7 (the minimum we allow), and
concurrent full resynch, without this patch, some write requests have
been observed to be starved for 40 seconds.
With this patch, application observed a worst case latency of twice the
replication round trip time.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Lars Ellenberg [Mon, 28 Apr 2014 16:43:27 +0000 (18:43 +0200)]
drbd: keep max-bio size during detach/attach on disconnected primary
We want to store in persistent meta data what the peer DRBD can handle,
which, due to spreading requests to multiple bios,
may be more than its backing device can handle.
Otherwise, if a disconnected Primary temporarily loses access to its local data
as well, we may accidentally shrink the max-bio setting, portentially causing
already assembled, but not yet processed, application bios to be spuriously
failed due to device limits.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Lars Ellenberg [Mon, 28 Apr 2014 16:43:26 +0000 (18:43 +0200)]
drbd: fix a race between start_resync and send_and_submit
In the drbd make request function, specifically in
drbd_send_and_submit(), we decide whether we want to send the actual
write request, or only a "set this block out of sync" information.
We do so based on the current connection state, while holding the req_lock.
The connection state is not supposed to change while holding the req_lock.
But in drbd_start_resync, we did change that state anyways,
while only holding the global_state_lock, which is enough to change
sync-after dependencies (paused vs active resync), but
not good enough to change the connection state.
Fix: in drbd_start_resync, first grab the req_lock to serialize with
drbd_send_and_submit(), before grabbing the global_state_lock
to be able to evaluate the sync-after dependencies.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Lars Ellenberg [Mon, 28 Apr 2014 16:43:25 +0000 (18:43 +0200)]
drbd: Enable QUEUE_FLAG_DISCARD only if the peer can recieve P_TRIM
Allow the user of REQ_DISCARD.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Lars Ellenberg [Mon, 28 Apr 2014 16:43:24 +0000 (18:43 +0200)]
drbd: prepare sending side for REQ_DISCARD
Note that I do NOT call __drbd_chk_io_error for failed REQ_DISCARD.
That may be wrong, though, or needs to differ between EOPNOTSUPP and
other errors...
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Lars Ellenberg [Mon, 28 Apr 2014 16:43:23 +0000 (18:43 +0200)]
drbd: prepare receiving side for REQ_DISCARD
If the receiver needs to serve a discard request on a queue that does
not announce to be discard cabable, it falls back to do synchronous
blkdev_issue_zeroout().
We expect only "reasonably" large (up to one activity log extent?)
discard requests.
We do this to not to not block the receiver for too long in this
fallback code path, and to not set/clear too many bits inside one
spinlock_irq_save() in drbd_set_in_sync/drbd_set_out_of_sync,
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Lars Ellenberg [Mon, 28 Apr 2014 16:43:22 +0000 (18:43 +0200)]
drbd: allow parallel promote/demote actions
We plan to use genl_family->parallel_ops = true in the future,
but need to review all possible interactions first.
For now, only selectively drop genl_lock() in drbd_set_role(),
instead serializing on our own internal resource->conf_update mutex.
We now can be promoted/demoted on many resources in parallel,
which may significantly improve cluster failover times
when fencing is required.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Lars Ellenberg [Mon, 28 Apr 2014 16:43:21 +0000 (18:43 +0200)]
drbd: perpare for genetlink parallel_ops
Because all administrative requests via genetlink have been globally
serialized via genl_lock(), we used to have one static struct
drbd_config_context "admin context".
Move this on-stack to the respective callback functions.
This will allow us to selectively drop the genl_lock()
(or use genl_family->parallel_ops) in the future.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Philipp Reisner [Mon, 28 Apr 2014 16:43:20 +0000 (18:43 +0200)]
drbd: Do not BUG() when connection breaks in a special way
When a 'cluster wide' disconnect executes, the result comes back
from the peer, and immediately after that the connection breaks
then _conn_rq_cond() reported back SS_CW_SUCCESS.
Therefore _conn_request_state() calls conn_set_state(), which
has a BUG() in it.
The BUG() is hit because conn_is_valid_transition() does not like
the transaction. Which goes back to is_valid_soft_transition()
returning SS_OUTDATE_WO_CONN.
This fix is to consider an error reported by is_valid_soft_transition()
even when the peer agreed to the transaction.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Lars Ellenberg [Mon, 28 Apr 2014 16:43:19 +0000 (18:43 +0200)]
drbd: don't let application IO pre-empt resync too often
Before, application IO could pre-empt resync activity
for up to hardcoded 20 seconds per resync request.
A very busy server could throttle the effective resync bandwidth
down to one request per 20 seconds.
Now, we only let application IO pre-empt resync traffic
while the current resync rate estimate is above c-min-rate.
If you disable the c-min-rate throttle feature (set c-min-rate = 0),
application IO will no longer pre-empt resync traffic at all.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Lars Ellenberg [Mon, 28 Apr 2014 16:43:18 +0000 (18:43 +0200)]
drbd: fix potential distributed deadlock during verify or resync
If max-buffers and socket buffer sizes are "too small" for the chosen
resync rate, this could lead potentially lead to a distributed deadlock,
which may or may not resolve itself via the "ko-count" and request
timeout mechanism, or could be resolved by forced disconnect.
One option to deal with this is proper configuration:
use larger max-buffer and socket buffers settings,
or reduce the resync rate.
But even with bad configuration we should not deadlock,
but "gracefully" recover.
The issue is avoided by using only up to max-buffers/2 for resync
requests, and by using max-buffers not as a hard limit for data buffer
allocations, but as a throttle threshold only.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Lars Ellenberg [Mon, 28 Apr 2014 16:43:17 +0000 (18:43 +0200)]
drbd: resync: fix too large bursts for very slow rates
While merging adjacent dirty blocks into resync requests,
the resync rate throttle was disregarded.
For very low resync rates, the effective rate may have exceeded
the intended rate by a larger margin.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Lars Ellenberg [Mon, 28 Apr 2014 16:43:16 +0000 (18:43 +0200)]
drbd: fix stalled resync detection in /proc/drbd
If we don't make resync or verify progress for "too long",
we want to flag it as "stalled".
Since 2010, "use rolling marks for resync speed calculation"
this "too long" was wrong by a factor of HZ.
With HZ 250, it would have been flagged as stalled
after 100 minutes.
Hardcode 3 minutes instead.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Philipp Reisner [Mon, 28 Apr 2014 16:43:15 +0000 (18:43 +0200)]
drbd: Allow online layout change of AL while peer is not connected
If a user forces the operation he takes the blame in case
the peer does not have enough space. No reason to dey this...
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Philipp Reisner [Mon, 28 Apr 2014 16:43:14 +0000 (18:43 +0200)]
drbd: Remove drbd_wrappers.h
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Philipp Reisner [Mon, 28 Apr 2014 16:43:13 +0000 (18:43 +0200)]
drbd: Leave IO suspended if the fence handler find the peer primary
Actually we are clearing the susp_fen flag if we are not going
to call a fencing handler.
For setting the susp_fen flag needs to be edge-triggerd, and not
level triggered.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Philipp Reisner [Mon, 28 Apr 2014 16:43:12 +0000 (18:43 +0200)]
drbd: Break a deadlock while concurrent fencing and establishing a connection
When we need to outdate the peer while being promoted to primary,
and the connection gets established at the same time, we deadlock
in drbd_try_outdate_peer() when trying to clear the susp_fen
bit.
Fix this by setting the STATE_SENT bit while holding the mutex.
Using drbd_change_state(.. , CS_HARD, ..) which does not block
until STATE_SENT is cleared, is only for clearness. It does
not contribute anything to the fix.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 30 Apr 2014 19:43:56 +0000 (13:43 -0600)]
blk-mq: refactor request insertion/merging
Refactor the logic around adding a new bio to a software queue,
so we nest the ctx->lock where we really need it (merge and
insertion) and don't hold it when we don't (init and IO start
accounting).
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 30 Apr 2014 19:43:08 +0000 (13:43 -0600)]
blk-mq remove debug BUG_ON() when draining software queues
It's never been of any use, lets get rid of it.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 30 Apr 2014 02:49:48 +0000 (20:49 -0600)]
blk-mq: fix waiting for reserved tags
blk_mq_wait_for_tags() is only able to wait for "normal" tags,
not reserved tags. Pass in which one we should attempt to get
a tag for, so that waiting for reserved tags will work.
Reserved tags are used for internal commands, which are usually
serialized. Hence no waiting generally takes place, but we should
ensure that it actually works if users need that functionality.
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Fri, 25 Apr 2014 07:36:37 +0000 (00:36 -0700)]
random: export add_disk_randomness
This will be needed for pending changes to the scsi midlayer that now
calls lower level block APIs, as well as any blk-mq driver that wants to
contribute to the random pool.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Fri, 25 Apr 2014 12:14:48 +0000 (14:14 +0200)]
block: fold __blk_add_timer into blk_add_timer
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Fri, 25 Apr 2014 09:32:53 +0000 (02:32 -0700)]
blk-mq: respect rq_affinity
The blk-mq code is using it's own version of the I/O completion affinity
tunables, which causes a few issues:
- the rq_affinity sysfs file doesn't work for blk-mq devices, even if it
still is present, thus breaking existing tuning setups.
- the rq_affinity = 1 mode, which is the defauly for legacy request based
drivers isn't implemented at all.
- blk-mq drivers don't implement any completion affinity with the default
flag settings.
This patches removes the blk-mq ipi_redirect flag and sysfs file, as well
as the internal BLK_MQ_F_SHOULD_IPI flag and replaces it with code that
respects the queue-wide rq_affinity flags and also implements the
rq_affinity = 1 mode.
This means I/O completion affinity can now only be tuned block-queue wide
instead of per context, which seems more sensible to me anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Thu, 24 Apr 2014 14:51:47 +0000 (08:51 -0600)]
blk-mq: fix race with timeouts and requeue events
If a requeue event races with a timeout, we can get into the
situation where we attempt to complete a request from the
timeout handler when it's not start anymore. This causes a crash.
So have the timeout handler check that REQ_ATOM_STARTED is still
set on the request - if not, we ignore the event. If this happens,
the request has now been marked as complete. As a consequence, we
need to ensure to clear REQ_ATOM_COMPLETE in blk_mq_start_request(),
as to maintain proper request state.
Signed-off-by: Jens Axboe <axboe@fb.com>