Jens Axboe [Sun, 15 Jan 2012 09:29:48 +0000 (10:29 +0100)]
Revert "block: recursive merge requests"
This reverts commit
274193224cdabd687d804a26e0150bb20f2dd52c.
We have some problems related to selection of empty queues
that need to be resolved, evidence so far points to the
recursive merge logic making either being the cause or at
least the accelerator for this. So revert it for now, until
we figure this out.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Martin K. Petersen [Fri, 13 Jan 2012 07:15:33 +0000 (08:15 +0100)]
block: Stop using macro stubs for the bio data integrity calls
Replace preprocessor macro stubs with real function declarations to
prevent warnings when CONFIG_BLK_DEV_INTEGRITY is disabled.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stephen Rothwell [Thu, 12 Jan 2012 08:17:30 +0000 (09:17 +0100)]
blockdev: convert some macros to static inlines
We prefer to program in C rather than preprocessor and it fixes this
warning when CONFIG_BLK_DEV_INTEGRITY is not set:
drivers/md/dm-table.c: In function 'dm_table_set_integrity':
drivers/md/dm-table.c:1285:3: warning: statement with no effect [-Wunused-value]
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Namjae Jeon [Thu, 12 Jan 2012 08:11:56 +0000 (09:11 +0100)]
fs: remove unneeded plug in mpage_readpages()
The block plug in mpage_readpages() duplicates the one in read_pages().
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: Amit Sahrawat <amit.sahrawat83@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Martin K. Petersen [Wed, 11 Jan 2012 15:29:31 +0000 (16:29 +0100)]
block: Add BLKROTATIONAL ioctl
Introduce an ioctl which permits applications to query whether a block
device is rotational.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Martin K. Petersen [Wed, 11 Jan 2012 15:27:11 +0000 (16:27 +0100)]
block: Introduce blk_set_stacking_limits function
Stacking driver queue limits are typically bounded exclusively by the
capabilities of the low level devices, not by the stacking driver
itself.
This patch introduces blk_set_stacking_limits() which has more liberal
metrics than the default queue limits function. This allows us to
inherit topology parameters from bottom devices without manually
tweaking the default limits in each driver prior to calling the stacking
function.
Since there is now a clear distinction between stacking and low-level
devices, blk_set_default_limits() has been modified to carry the more
conservative values that we used to manually set in
blk_queue_make_request().
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 27 Dec 2011 17:52:16 +0000 (18:52 +0100)]
block: remove WARN_ON_ONCE() in exit_io_context()
6e736be7 "block: make ioc get/put interface more conventional and fix
race on alloction" added WARN_ON_ONCE() in exit_io_context() which
triggers if !PF_EXITING. All tasks hitting exit_io_context() from
task exit should have PF_EXITING set but task struct tearing down
after fork failure calls into the function without PF_EXITING,
triggering the condition.
WARNING: at block/blk-ioc.c:234 exit_io_context+0x40/0x92()
Pid: 17090, comm: trinity Not tainted 3.2.0-rc6-next-
20111222-sasha-dirty #77
Call Trace:
[<
ffffffff810b69a3>] warn_slowpath_common+0x8f/0xb2
[<
ffffffff810b6a77>] warn_slowpath_null+0x18/0x1a
[<
ffffffff8181a7a2>] exit_io_context+0x40/0x92
[<
ffffffff810b58c9>] copy_process+0x126f/0x1453
[<
ffffffff810b5c1b>] do_fork+0x120/0x3e9
[<
ffffffff8106242f>] sys_clone+0x26/0x28
[<
ffffffff82425803>] stub_clone+0x13/0x20
---[ end trace
a2e4eb670b375238 ]---
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Sun, 25 Dec 2011 13:29:14 +0000 (14:29 +0100)]
block: an exiting task should be allowed to create io_context
While fixing io_context creation / task exit race condition,
6e736be7f2 "block: make ioc get/put interface more conventional and
fix race on alloction" also prevented an exiting (%PF_EXITING) task
from creating its own io_context. This is incorrect as exit path may
issue IOs, e.g. from exit_files(), and if those IOs are the first ones
issued by the task, io_context needs to be created to process the IOs.
Combined with the existing problem of io_context / io_cq creation
failure having the possibility of stalling IO, this problem results in
deterministic full IO lockup with certain workloads.
Fix it by allowing io_context creation regardless of %PF_EXITING for
%current.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 19 Dec 2011 09:36:44 +0000 (10:36 +0100)]
block: ioc_cgroup_changed() needs to be exported
With the ioc changed, ioc_cgroup_changed() can be used by modular
code. So ensure that it is exported.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Shaohua Li [Fri, 16 Dec 2011 13:00:31 +0000 (14:00 +0100)]
block: recursive merge requests
In my workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1,
a+3,.... When the requests are flushed to queue, a and a+1 are merged
to (a, a+1), a+2 and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3)
aren't merged.
With recursive merge below, the workload throughput gets improved 20%
and context switch drops 60%.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Shaohua Li [Fri, 16 Dec 2011 13:00:22 +0000 (14:00 +0100)]
block, cfq: fix empty queue crash caused by request merge
All requests of a queue could be merged to other requests of other queue.
Such queue will not have request in it, but it's in service tree. This
will cause kernel oops.
I encounter a BUG_ON() in cfq_dispatch_request() with next patch, but the
issue should exist without the patch.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:42 +0000 (00:33 +0100)]
block, cfq: move icq creation and rq->elv.icq association to block core
Now block layer knows everything necessary to create and associate
icq's with requests. Move ioc_create_icq() to blk-ioc.c and update
get_request() such that, if elevator_type->icq_size is set, requests
are automatically associated with their matching icq's before
elv_set_request(). io_context reference is also managed by block core
on request alloc/free.
* Only ioprio/cgroup changed handling remains from cfq_get_cic().
Collapsed into cfq_set_request().
* This removes queue kicking on icq allocation failure (for now). As
icq allocation failure is rare and the only effect of queue kicking
achieved was possibily accelerating queue processing, this change
shouldn't be noticeable.
There is a larger underlying problem. Unlike request allocation,
icq allocation is not guaranteed to succeed eventually after
retries. The number of icq is unbound and thus mempool can't be the
solution either. This effectively adds allocation dependency on
memory free path and thus possibility of deadlock.
This usually wouldn't happen because icq allocation is not a hot
path and, even when the condition triggers, it's highly unlikely
that none of the writeback workers already has icq.
However, this is still possible especially if elevator is being
switched under high memory pressure, so we better get it fixed.
Probably the only solution is just bypassing elevator and appending
to dispatch queue on any elevator allocation failure.
* Comment added to explain how icq's are managed and synchronized.
This completes cleanup of io_context interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:42 +0000 (00:33 +0100)]
block, cfq: restructure io_cq creation path for io_context interface cleanup
Add elevator_ops->elevator_init_icq_fn() and restructure
cfq_create_cic() and rename it to ioc_create_icq().
The new function expects its caller to pass in io_context, uses
elevator_type->icq_cache, handles generic init, calls the new elevator
operation for elevator specific initialization, and returns pointer to
created or looked up icq. This leaves cfq_icq_pool variable without
any user. Removed.
This prepares for io_context interface cleanup and doesn't introduce
any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:42 +0000 (00:33 +0100)]
block, cfq: move io_cq exit/release to blk-ioc.c
With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
blk-ioc too. The odd ->io_cq->exit/release() callbacks are replaced
with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
and q, and freeing automatically handled by blk-ioc. The elevator
operation only need to perform exit operation specific to the elevator
- in cfq's case, exiting the cfqq's.
Also, clearing of io_cq's on q detach is moved to block core and
automatically performed on elevator switch and q release.
Because the q io_cq points to might be freed before RCU callback for
the io_cq runs, blk-ioc code should remember to which cache the io_cq
needs to be freed when the io_cq is released. New field
io_cq->__rcu_icq_cache is added for this purpose. As both the new
field and rcu_head are used only after io_cq is released and the
q/ioc_node fields aren't, they are put into unions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:42 +0000 (00:33 +0100)]
block, cfq: move icq cache management to block core
Let elevators set ->icq_size and ->icq_align in elevator_type and
elv_register() and elv_unregister() respectively create and destroy
kmem_cache for icq.
* elv_register() now can return failure. All callers updated.
* icq caches are automatically named "ELVNAME_io_cq".
* cfq_slab_setup/kill() are collapsed into cfq_init/exit().
* While at it, minor indentation change for iosched_cfq.elevator_name
for consistency.
This will help moving icq management to block core. This doesn't
introduce any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:42 +0000 (00:33 +0100)]
block, cfq: move io_cq lookup to blk-ioc.c
Now that all io_cq related data structures are in block core layer,
io_cq lookup can be moved from cfq-iosched.c to blk-ioc.c.
Lookup logic from cfq_cic_lookup() is moved to ioc_lookup_icq() with
parameter return type changes (cfqd -> request_queue, cfq_io_cq ->
io_cq) and cfq_cic_lookup() becomes thin wrapper around
cfq_cic_lookup().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:41 +0000 (00:33 +0100)]
block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
Most of icq management is about to be moved out of cfq into blk-ioc.
This patch prepares for it.
* Move cfqd->icq_list to request_queue->icq_list
* Make request explicitly point to icq instead of through elevator
private data. ->elevator_private[3] is replaced with sub struct elv
which contains icq pointer and priv[2]. cfq is updated accordingly.
* Meaningless clearing of ->elevator_private[0] removed from
elv_set_request(). At that point in code, the field was guaranteed
to be %NULL anyway.
This patch doesn't introduce any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:41 +0000 (00:33 +0100)]
block, cfq: reorganize cfq_io_context into generic and cfq specific parts
Currently io_context and cfq logics are mixed without clear boundary.
Most of io_context is independent from cfq but cfq_io_context handling
logic is dispersed between generic ioc code and cfq.
cfq_io_context represents association between an io_context and a
request_queue, which is a concept useful outside of cfq, but it also
contains fields which are useful only to cfq.
This patch takes out generic part and put it into io_cq (io
context-queue) and the rest into cfq_io_cq (cic moniker remains the
same) which contains io_cq. The following changes are made together.
* cfq_ttime and cfq_io_cq now live in cfq-iosched.c.
* All related fields, functions and constants are renamed accordingly.
* ioc->ioc_data is now "struct io_cq *" instead of "void *" and
renamed to icq_hint.
This prepares for io_context API cleanup. Documentation is currently
sparse. It will be added later.
Changes in this patch are mechanical and don't cause functional
change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:41 +0000 (00:33 +0100)]
block: remove elevator_queue->ops
elevator_queue->ops points to the same ops struct ->elevator_type.ops
is pointing to. The only effect of caching it in elevator_queue is
shorter notation - it doesn't save any indirect derefence.
Relocate elevator_type->list which used only during module init/exit
to the end of the structure, rename elevator_queue->elevator_type to
->type, and replace elevator_queue->ops with elevator_queue->type.ops.
This doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:40 +0000 (00:33 +0100)]
block: reorder elevator switch sequence
Elevator switch sequence first attached the new elevator, then tried
registering it (sysfs) and if that failed attached back the old
elevator. However, sysfs registration doesn't require the elevator to
be attached, so there is no reason to do the "detach, attach new,
register, maybe re-attach old" sequence. It can just do "register,
detach, attach".
* elevator_init_queue() is updated to set ->elevator_data directly and
return 0 / -errno. This allows elevator_exit() on an unattached
elevator.
* __elv_unregister_queue() which was necessary to unregister
unattached q is removed in favor of __elv_register_queue() which can
register unattached q.
* elevator_attach() becomes a single assignment and obscures more then
it helps. Dropped.
This will help cleaning up io_context handling across elevator switch.
This patch doesn't introduce visible behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:40 +0000 (00:33 +0100)]
block, cfq: replace current_io_context() with create_io_context()
When called under queue_lock, current_io_context() triggers lockdep
warning if it hits allocation path. This is because io_context
installation is protected by task_lock which is not IRQ safe, so it
triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock
deadlock warning.
Given the restriction, accessor + creator rolled into one doesn't work
too well. Drop current_io_context() and let the users access
task->io_context directly inside queue_lock combined with explicit
creation using create_io_context().
Future ioc updates will further consolidate ioc access and the create
interface will be unexported.
While at it, relocate ioc internal interface declarations in blk.h and
add section comments before and after.
This patch does not introduce functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:40 +0000 (00:33 +0100)]
block, cfq: kill cic->key
Now that lazy paths are removed, cfqd_dead_key() is meaningless and
cic->q can be used whereever cic->key is used. Kill cic->key.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:39 +0000 (00:33 +0100)]
block, cfq: kill ioc_gone
Now that cic's are immediately unlinked under both locks, there's no
need to count and drain cic's before module unload. RCU callback
completion is waited with rcu_barrier().
While at it, remove residual RCU operations on cic_list.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:39 +0000 (00:33 +0100)]
block, cfq: remove delayed unlink
Now that all cic's are immediately unlinked from both ioc and queue,
lazy dropping from lookup path and trimming on elevator unregister are
unnecessary. Kill them and remove now unused elevator_ops->trim().
This also leaves call_for_each_cic() without any user. Removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:39 +0000 (00:33 +0100)]
block, cfq: unlink cfq_io_context's immediately
cic is association between io_context and request_queue. A cic is
linked from both ioc and q and should be destroyed when either one
goes away. As ioc and q both have their own locks, locking becomes a
bit complex - both orders work for removal from one but not from the
other.
Currently, cfq tries to circumvent this locking order issue with RCU.
ioc->lock nests inside queue_lock but the radix tree and cic's are
also protected by RCU allowing either side to walk their lists without
grabbing lock.
This rather unconventional use of RCU quickly devolves into extremely
fragile convolution. e.g. The following is from cfqd going away too
soon after ioc and q exits raced.
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
[ 88.503444]
Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs
RIP: 0010:[<
ffffffff81397628>] [<
ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0
...
Call Trace:
[<
ffffffff81395a4a>] call_for_each_cic+0x5a/0x90
[<
ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20
[<
ffffffff81389130>] exit_io_context+0x100/0x140
[<
ffffffff81098a29>] do_exit+0x579/0x850
[<
ffffffff81098d5b>] do_group_exit+0x5b/0xd0
[<
ffffffff81098de7>] sys_exit_group+0x17/0x20
[<
ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b
The only real hot path here is cic lookup during request
initialization and avoiding extra locking requires very confined use
of RCU. This patch makes cic removal from both ioc and request_queue
perform double-locking and unlink immediately.
* From q side, the change is almost trivial as ioc->lock nests inside
queue_lock. It just needs to grab each ioc->lock as it walks
cic_list and unlink it.
* From ioc side, it's a bit more difficult because of inversed lock
order. ioc needs its lock to walk its cic_list but can't grab the
matching queue_lock and needs to perform unlock-relock dancing.
Unlinking is now wholly done from put_io_context() and fast path is
optimized by using the queue_lock the caller already holds, which is
by far the most common case. If the ioc accessed multiple devices,
it tries with trylock. In unlikely cases of fast path failure, it
falls back to full double-locking dance from workqueue.
Double-locking isn't the prettiest thing in the world but it's *far*
simpler and more understandable than RCU trick without adding any
meaningful overhead.
This still leaves a lot of now unnecessary RCU logics. Future patches
will trim them.
-v2: Vivek pointed out that cic->q was being dereferenced after
cic->release() was called. Updated to use local variable @this_q
instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:39 +0000 (00:33 +0100)]
block, cfq: fix cic lookup locking
* cfq_cic_lookup() may be called without queue_lock and multiple tasks
can execute it simultaneously for the same shared ioc. Nothing
prevents them racing each other and trying to drop the same dead cic
entry multiple times.
* smp_wmb() in cfq_exit_cic() doesn't really do anything and nothing
prevents cfq_cic_lookup() seeing stale cic->key. This usually
doesn't blow up because by the time cic is exited, all requests have
been drained and new requests are terminated before going through
elevator. However, it can still be triggered by plug merge path
which doesn't grab queue_lock and thus can't check DEAD state
reliably.
This patch updates lookup locking such that,
* Lookup is always performed under queue_lock. This doesn't add any
more locking. The only issue is cfq_allow_merge() which can be
called from plug merge path without holding any lock. For now, this
is worked around by using cic of the request to merge into, which is
guaranteed to have the same ioc. For longer term, I think it would
be best to separate out plug merge method from regular one.
* Spurious ioc->lock locking around cic lookup hint assignment
dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:38 +0000 (00:33 +0100)]
block, cfq: fix race condition in cic creation path and tighten locking
cfq_get_io_context() would fail if multiple tasks race to insert cic's
for the same association. This patch restructures
cfq_get_io_context() such that slow path insertion race is handled
properly.
Note that the restructuring also makes cfq_get_io_context() called
under queue_lock and performs both ioc and cfqd insertions while
holding both ioc and queue locks. This is part of on-going locking
tightening and will be used to simplify synchronization rules.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:38 +0000 (00:33 +0100)]
block, cfq: move ioc ioprio/cgroup changed handling to cic
ioprio/cgroup change was handled by marking the changed state in ioc
and, on the following access to the ioc, performing RCU-protected
iteration through all cic's grabbing the matching queue_lock.
This patch moves the changed state to each cic. When ioprio or cgroup
changes, the respective bit is set on all cic's of the ioc and when
each of those cic (not ioc) is accessed, change is applied for that
specific ioc-queue pair.
This also fixes the following two race conditions between setting and
clearing of changed states.
* Missing barrier between assign/load of ioprio and ioprio_changed
allowed applying old ioprio.
* Change requests could happen between application of change and
clearing of changed variables.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:38 +0000 (00:33 +0100)]
block, cfq: misc updates to cfq_io_context
Make the following changes to prepare for ioc/cic management cleanup.
* Add cic->q so that ioc can determine the associated queue without
querying cfq. This will eventually replace ->key.
* Factor out cfq_release_cic() from cic_free_func(). This function
assumes that the caller handled locking.
* Rename __cfq_exit_single_io_context() to cfq_exit_cic() and make it
take only @cic.
* Restructure cfq_cic_link() for future updates.
This patch doesn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:38 +0000 (00:33 +0100)]
block: misc updates to blk_get_queue()
* blk_get_queue() is peculiar in that it returns 0 on success and 1 on
failure instead of 0 / -errno or boolean. Update it such that it
returns %true on success and %false on failure.
* Make sure the caller checks for the return value.
* Separate out __blk_get_queue() which doesn't check whether @q is
dead and put it in blk.h. This will be used later.
This patch doesn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:38 +0000 (00:33 +0100)]
block: make ioc get/put interface more conventional and fix race on alloction
Ignoring copy_io() during fork, io_context can be allocated from two
places - current_io_context() and set_task_ioprio(). The former is
always called from local task while the latter can be called from
different task. The synchornization between them are peculiar and
dubious.
* current_io_context() doesn't grab task_lock() and assumes that if it
saw %NULL ->io_context, it would stay that way until allocation and
assignment is complete. It has smp_wmb() between alloc/init and
assignment.
* set_task_ioprio() grabs task_lock() for assignment and does
smp_read_barrier_depends() between "ioc = task->io_context" and "if
(ioc)". Unfortunately, this doesn't achieve anything - the latter
is not a dependent load of the former. ie, if ioc itself were being
dereferenced "ioc->xxx", it would mean something (not sure what tho)
but as the code currently stands, the dependent read barrier is
noop.
As only one of the the two test-assignment sequences is task_lock()
protected, the task_lock() can't do much about race between the two.
Nothing prevents current_io_context() and set_task_ioprio() allocating
its own ioc for the same task and overwriting the other's.
Also, set_task_ioprio() can race with exiting task and create a new
ioc after exit_io_context() is finished.
ioc get/put doesn't have any reason to be complex. The only hot path
is accessing the existing ioc of %current, which is simple to achieve
given that ->io_context is never destroyed as long as the task is
alive. All other paths can happily go through task_lock() like all
other task sub structures without impacting anything.
This patch updates ioc get/put so that it becomes more conventional.
* alloc_io_context() is replaced with get_task_io_context(). This is
the only interface which can acquire access to ioc of another task.
On return, the caller has an explicit reference to the object which
should be put using put_io_context() afterwards.
* The functionality of current_io_context() remains the same but when
creating a new ioc, it shares the code path with
get_task_io_context() and always goes through task_lock().
* get_io_context() now means incrementing ref on an ioc which the
caller already has access to (be that an explicit refcnt or implicit
%current one).
* PF_EXITING inhibits creation of new io_context and once
exit_io_context() is finished, it's guaranteed that both ioc
acquisition functions return %NULL.
* All users are updated. Most are trivial but
smp_read_barrier_depends() removal from cfq_get_io_context() needs a
bit of explanation. I suppose the original intention was to ensure
ioc->ioprio is visible when set_task_ioprio() allocates new
io_context and installs it; however, this wouldn't have worked
because set_task_ioprio() doesn't have wmb between init and install.
There are other problems with this which will be fixed in another
patch.
* While at it, use NUMA_NO_NODE instead of -1 for wildcard node
specification.
-v2: Vivek spotted contamination from debug patch. Removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:37 +0000 (00:33 +0100)]
block: misc ioc cleanups
* int return from put_io_context() wasn't used by anybody. Make it
return void like other put functions and docbook-fy the function
comment.
* Reorder dummy declarations for !CONFIG_BLOCK case a bit.
* Make alloc_ioc_context() use __GFP_ZERO allocation, take init out of
if block and drop 0'ing.
* Docbook-fy current_io_context() comment.
This patch doesn't introduce any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:37 +0000 (00:33 +0100)]
block, cfq: move cfqd->cic_index to q->id
cfq allocates per-queue id using ida and uses it to index cic radix
tree from io_context. Move it to q->id and allocate on queue init and
free on queue release. This simplifies cfq a bit and will allow for
further improvements of io context life-cycle management.
This patch doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:37 +0000 (00:33 +0100)]
block: add missing blk_queue_dead() checks
blk_insert_cloned_request(), blk_execute_rq_nowait() and
blk_flush_plug_list() either didn't check whether the queue was dead
or did it without holding queue_lock. Update them so that dead state
is checked while holding queue_lock.
AFAICS, this plugs all holes (requeue doesn't matter as the request is
transitioning atomically from in_flight to queued).
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:37 +0000 (00:33 +0100)]
block: fix drain_all condition in blk_drain_queue()
When trying to drain all requests, blk_drain_queue() checked only
q->rq.count[]; however, this only tracks REQ_ALLOCED requests. This
patch updates blk_drain_queue() such that it looks at all the counters
and queues so that request_queue is actually empty on completion.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:37 +0000 (00:33 +0100)]
block: add blk_queue_dead()
There are a number of QUEUE_FLAG_DEAD tests. Add blk_queue_dead()
macro and use it.
This patch doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 13 Dec 2011 23:33:37 +0000 (00:33 +0100)]
block, sx8: kill blk_insert_request()
The only user left for blk_insert_request() is sx8 and it can be
trivially switched to use blk_execute_rq_nowait() - special requests
aren't included in io stat and sx8 doesn't use block layer tagging.
Switch sx8 and kill blk_insert_requeset().
This patch doesn't introduce any functional difference.
Only compile tested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Fri, 9 Dec 2011 23:09:32 +0000 (15:09 -0800)]
Linux 3.2-rc5
Linus Torvalds [Fri, 9 Dec 2011 22:45:44 +0000 (14:45 -0800)]
Merge git://git.samba.org/sfrench/cifs-2.6
* git://git.samba.org/sfrench/cifs-2.6:
cifs: check for NULL last_entry before calling cifs_save_resume_key
cifs: attempt to freeze while looping on a receive attempt
cifs: Fix sparse warning when calling cifs_strtoUCS
CIFS: Add descriptions to the brlock cache functions
Linus Torvalds [Fri, 9 Dec 2011 22:45:12 +0000 (14:45 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, efi: Calling __pa() with an ioremap()ed address is invalid
x86, hpet: Immediately disable HPET timer 1 if rtc irq is masked
x86/intel_mid: Kconfig select fix
x86/intel_mid: Fix the Kconfig for MID selection
Linus Torvalds [Fri, 9 Dec 2011 22:41:50 +0000 (14:41 -0800)]
Merge branch 'spi/for-3.2' of git://git.pengutronix.de/git/wsa/linux-2.6
* 'spi/for-3.2' of git://git.pengutronix.de/git/wsa/linux-2.6:
spi/gpio: fix section mismatch warning
spi/fsl-espi: disable CONFIG_SPI_FSL_ESPI=m build
spi/nuc900: Include linux/module.h
spi/ath79: fix compile error due to missing include
Linus Torvalds [Fri, 9 Dec 2011 16:18:08 +0000 (08:18 -0800)]
Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
md: raid5 crash during degradation
md/raid5: never wait for bad-block acks on failed device.
md: ensure new badblocks are handled promptly.
md: bad blocks shouldn't cause a Blocked status on a Faulty device.
md: take a reference to mddev during sysfs access.
md: refine interpretation of "hold_active == UNTIL_IOCTL".
md/lock: ensure updates to page_attrs are properly locked.
Linus Torvalds [Fri, 9 Dec 2011 16:08:57 +0000 (08:08 -0800)]
Merge git://git./linux/kernel/git/cmetcalf/linux-tile
* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
arch/tile: use new generic {enable,disable}_percpu_irq() routines
drivers/net/ethernet/tile: use skb_frag_page() API
asm-generic/unistd.h: support new process_vm_{readv,write} syscalls
arch/tile: fix double-free bug in homecache_free_pages()
arch/tile: add a few #includes and an EXPORT to catch up with kernel changes.
Linus Torvalds [Fri, 9 Dec 2011 16:08:14 +0000 (08:08 -0800)]
Merge branch 'iommu/fixes' of git://git./linux/kernel/git/joro/iommu
* 'iommu/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
MAINTAINERS: Update amd-iommu F: patterns
iommu/amd: Fix typo in kernel-parameters.txt
iommu/msm: Fix compile error in mach-msm/devices-iommu.c
Fix comparison using wrong pointer variable in dma debug code
Linus Torvalds [Fri, 9 Dec 2011 16:07:42 +0000 (08:07 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/tiwai/sound
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/realtek - Fix lost speaker volume controls
ALSA: hda/realtek - Create "Bass Speaker" for two speaker pins
ALSA: hda/realtek - Don't create extra controls with channel suffix
ALSA: hda - Fix remaining VREF mute-LED NID check in post-3.1 changes
ALSA: hda - Fix GPIO LED setup for IDT 92HD75 codecs
ASoC: Provide a more complete DMA driver stub
ASoC: Remove references to corgi and spitz from machine driver document
ASoC: Make SND_SOC_MX27VIS_AIC32X4 depend on I2C
ASoC: Fix dependency for SND_SOC_RAUMFELD and SND_PXA2XX_SOC_HX4700
ASoC: uda1380: Return proper error in uda1380_modinit failure path
ASoC: kirkwood: Make SND_KIRKWOOD_SOC_OPENRD and SND_KIRKWOOD_SOC_T5325 depend on I2C
ASoC: Mark WM8994 ADC muxes as virtual
ALSA: hda/realtek - Fix Oops in alc_mux_select()
ALSA: sis7019 - give slow codecs more time to reset
Linus Torvalds [Fri, 9 Dec 2011 16:07:24 +0000 (08:07 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Do no try to schedule task events if there are none
lockdep, kmemcheck: Annotate ->lock in lockdep_init_map()
perf header: Use event_name() to get an event name
perf stat: Failure with "Operation not supported"
Mandeep Singh Baines [Thu, 8 Dec 2011 22:34:44 +0000 (14:34 -0800)]
sys_getppid: add missing rcu_dereference
In order to safely dereference current->real_parent inside an
rcu_read_lock, we need an rcu_dereference.
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexandre Bounine [Thu, 8 Dec 2011 22:34:42 +0000 (14:34 -0800)]
rapidio/tsi721: modify PCIe capability settings
Modify initialization of PCIe capability registers in Tsi721 mport driver:
- change Completion Timeout value to avoid unexpected data transfer
aborts during intensive traffic.
- replace hardcoded offset of PCIe capability block by making it use the
common function.
This patch is applicable to kernel versions starting from 3.2-rc1.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexandre Bounine [Thu, 8 Dec 2011 22:34:36 +0000 (14:34 -0800)]
rapidio/tsi721: fix mailbox resource reporting
Bug fix for Tsi721 RapidIO mport driver: Tsi721 supports four RapidIO
mailboxes (MBOX0 - MBOX3) as defined by RapidIO specification. Mailbox
resources has to be properly reported to allow use of all available
mailboxes (initial version reports only MBOX0).
This patch is applicable to kernel versions staring from 3.2-rc1.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexandre Bounine [Thu, 8 Dec 2011 22:34:35 +0000 (14:34 -0800)]
rapidio/tsi721: switch to dma_zalloc_coherent
Replace the pair dma_alloc_coherent()+memset() with the new
dma_zalloc_coherent() added by Andrew Morton for kernel version 3.2
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Thu, 8 Dec 2011 22:34:32 +0000 (14:34 -0800)]
procfs: do not overflow get_{idle,iowait}_time for nohz
Since commit
a25cac5198d4 ("proc: Consider NO_HZ when printing idle and
iowait times") we are reporting idle/io_wait time also while a CPU is
tickless. We rely on get_{idle,iowait}_time functions to retrieve
proper data.
These functions, however, use usecs_to_cputime to translate micro
seconds time to cputime64_t. This is just an alias to usecs_to_jiffies
which reduces the data type from u64 to unsigned int and also checks
whether the given parameter overflows jiffies_to_usecs(MAX_JIFFY_OFFSET)
and returns MAX_JIFFY_OFFSET in that case.
When we overflow depends on CONFIG_HZ but especially for CONFIG_HZ_300
it is quite low (
1431649781) so we are getting MAX_JIFFY_OFFSET for
>3000s! until we overflow unsigned int. Just for reference
CONFIG_HZ_100 has an overflow window around 20s, CONFIG_HZ_250 ~8s and
CONFIG_HZ_1000 ~2s.
This results in a bug when people saw [h]top going mad reporting 100%
CPU usage even though there was basically no CPU load. The reason was
simply that /proc/stat stopped reporting idle/io_wait changes (and
reported MAX_JIFFY_OFFSET) and so the only change happening was for user
system time.
Let's use nsecs_to_jiffies64 instead which doesn't reduce the precision
to 32b type and it is much more appropriate for cumulative time values
(unlike usecs_to_jiffies which intended for timeout calculations).
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Artem S. Tashkinov <t.artem@mailcity.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Thu, 8 Dec 2011 22:34:30 +0000 (14:34 -0800)]
mm: vmalloc: check for page allocation failure before vmlist insertion
Commit
f5252e00 ("mm: avoid null pointer access in vm_struct via
/proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after
it is fully initialised. Unfortunately, it did not check that
__vmalloc_area_node() successfully populated the area. In the event of
allocation failure, the vmalloc area is freed but the pointer to freed
memory is inserted into the vmlist leading to a a crash later in
get_vmalloc_info().
This patch adds a check for ____vmalloc_area_node() failure within
__vmalloc_node_range. It does not use "goto fail" as in the previous
error path as a warning was already displayed by __vmalloc_area_node()
before it called vfree in its failure path.
Credit goes to Luciano Chavez for doing all the real work of identifying
exactly where the problem was.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Luciano Chavez <lnx1138@linux.vnet.ibm.com>
Tested-by: Luciano Chavez <lnx1138@linux.vnet.ibm.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [3.1.x+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Thu, 8 Dec 2011 22:34:27 +0000 (14:34 -0800)]
mm: Ensure that pfn_valid() is called once per pageblock when reserving pageblocks
setup_zone_migrate_reserve() expects that zone->start_pfn starts at
pageblock_nr_pages aligned pfn otherwise we could access beyond an
existing memblock resulting in the following panic if
CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:
IP: [<
c02d331d>] setup_zone_migrate_reserve+0xcd/0x180
*pdpt =
0000000000000000 *pde =
f000ff53f000ff53
Oops: 0000 [#1] SMP
Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
EIP: 0060:[<
c02d331d>] EFLAGS:
00010006 CPU: 0
EIP is at setup_zone_migrate_reserve+0xcd/0x180
EAX:
000c0000 EBX:
f5801fc0 ECX:
000c0000 EDX:
00000000
ESI:
000c01fe EDI:
000c01fe EBP:
00140000 ESP:
f2475f58
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 1, ti=
f2474000 task=
f2472cd0 task.ti=
f2474000)
Call Trace:
[<
c02d389c>] __setup_per_zone_wmarks+0xec/0x160
[<
c02d3a1f>] setup_per_zone_wmarks+0xf/0x20
[<
c08a771c>] init_per_zone_wmark_min+0x27/0x86
[<
c020111b>] do_one_initcall+0x2b/0x160
[<
c086639d>] kernel_init+0xbe/0x157
[<
c05cae26>] kernel_thread_helper+0x6/0xd
Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 <2b> 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
EIP: [<
c02d331d>] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:
f2475f58
CR2:
00000000000012b4
We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
highstart_pfn = 0x36ffe.
The issue was introduced in 3.0-rc1 by
6d3163ce ("mm: check if any page
in a pageblock is reserved before marking it MIGRATE_RESERVE").
Make sure that start_pfn is always aligned to pageblock_nr_pages to
ensure that pfn_valid s always called at the start of each pageblock.
Architectures with holes in pageblocks will be correctly handled by
pfn_valid_within in pageblock_is_reserved.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Dang Bo <bdang@vmware.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Arve Hjnnevg <arve@android.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> [3.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hillf Danton [Thu, 8 Dec 2011 22:34:20 +0000 (14:34 -0800)]
mm/migrate.c: pair unlock_page() and lock_page() when migrating huge pages
Avoid unlocking and unlocked page if we failed to lock it.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Youquan Song [Thu, 8 Dec 2011 22:34:18 +0000 (14:34 -0800)]
thp: set compound tail page _count to zero
Commit
70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all
page_tail->_count zero at all times. But the current kernel does not
set page_tail->_count to zero if a 1GB page is utilized. So when an
IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
tail page's _count does not equal zero.
kernel BUG at include/linux/mm.h:386!
invalid opcode: 0000 [#1] SMP
Call Trace:
gup_pud_range+0xb8/0x19d
get_user_pages_fast+0xcb/0x192
? trace_hardirqs_off+0xd/0xf
hva_to_pfn+0x119/0x2f2
gfn_to_pfn_memslot+0x2c/0x2e
kvm_iommu_map_pages+0xfd/0x1c1
kvm_iommu_map_memslots+0x7c/0xbd
kvm_iommu_map_guest+0xaa/0xbf
kvm_vm_ioctl_assigned_device+0x2ef/0xa47
kvm_vm_ioctl+0x36c/0x3a2
do_vfs_ioctl+0x49e/0x4e4
sys_ioctl+0x5a/0x7c
system_call_fastpath+0x16/0x1b
RIP gup_huge_pud+0xf2/0x159
Signed-off-by: Youquan Song <youquan.song@intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Youquan Song [Thu, 8 Dec 2011 22:34:16 +0000 (14:34 -0800)]
thp: add compound tail page _mapcount when mapped
With the 3.2-rc kernel, IOMMU 2M pages in KVM works. But when I tried
to use IOMMU 1GB pages in KVM, I encountered an oops and the 1GB page
failed to be used.
The root cause is that 1GB page allocation calls gup_huge_pud() while 2M
page calls gup_huge_pmd. If compound pages are used and the page is a
tail page, gup_huge_pmd() increases _mapcount to record tail page are
mapped while gup_huge_pud does not do that.
So when the mapped page is relesed, it will result in kernel oops
because the page is not marked mapped.
This patch add tail process for compound page in 1GB huge page which
keeps the same process as 2M page.
Reproduce like:
1. Add grub boot option: hugepagesz=1G hugepages=8
2. mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages
3. qemu-kvm -m 2048 -hda os-kvm.img -cpu kvm64 -smp 4 -mem-path /dev/hugepages
-net none -device pci-assign,host=07:00.1
kernel BUG at mm/swap.c:114!
invalid opcode: 0000 [#1] SMP
Call Trace:
put_page+0x15/0x37
kvm_release_pfn_clean+0x31/0x36
kvm_iommu_put_pages+0x94/0xb1
kvm_iommu_unmap_memslots+0x80/0xb6
kvm_assign_device+0xba/0x117
kvm_vm_ioctl_assigned_device+0x301/0xa47
kvm_vm_ioctl+0x36c/0x3a2
do_vfs_ioctl+0x49e/0x4e4
sys_ioctl+0x5a/0x7c
system_call_fastpath+0x16/0x1b
RIP put_compound_page+0xd4/0x168
Signed-off-by: Youquan Song <youquan.song@intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Zijlstra [Thu, 8 Dec 2011 22:34:13 +0000 (14:34 -0800)]
printk: avoid double lock acquire
Commit
4f2a8d3cf5e ("printk: Fix console_sem vs logbuf_lock unlock race")
introduced another silly bug where we would want to acquire an already
held lock. Avoid this.
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KAMEZAWA Hiroyuki [Thu, 8 Dec 2011 22:34:10 +0000 (14:34 -0800)]
memcg: update maintainers
More players joined to memory cgroup developments and Johannes' great work
changed internal design of memory cgroup dramatically. And he will do
more works. Michal Hokko did many bug fixes and know memory cgroup very
well. Daisuke Nishimura helped us very much but he seems busy now.
Thanks to his works.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jonghwan Choi [Thu, 8 Dec 2011 22:34:02 +0000 (14:34 -0800)]
drivers/rtc/rtc-s3c.c: fix driver clock enable/disable balance issues
If an error occurs after the clock is enabled, the enable/disable state
can become unbalanced.
Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Acked-by: Kukjin Kim <kgene.kim@samsung.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Thu, 8 Dec 2011 22:34:00 +0000 (14:34 -0800)]
CREDITS: update Kees's expired fingerprint and fix details
Small clean-up for my CREDITS entry; the GPG fingerprint was not up to
date, so I fixed other details at the same time too.
Signed-off-by: Kees Cook <kees@outflux.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Thu, 8 Dec 2011 22:33:57 +0000 (14:33 -0800)]
thp: reduce khugepaged freezing latency
khugepaged can sometimes cause suspend to fail, requiring that the user
retry the suspend operation.
Use wait_event_freezable_timeout() instead of
schedule_timeout_interruptible() to avoid missing freezer wakeups. A
try_to_freeze() would have been needed in the khugepaged_alloc_hugepage
tight loop too in case of the allocation failing repeatedly, and
wait_event_freezable_timeout will provide it too.
khugepaged would still freeze just fine by trying again the next minute
but it's better if it freezes immediately.
Reported-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Jiri Slaby <jslaby@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: "Rafael J. Wysocki" <rjw@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Claudio Scordino [Thu, 8 Dec 2011 22:33:56 +0000 (14:33 -0800)]
fs/proc/meminfo.c: fix compilation error
Fix the error message "directives may not be used inside a macro argument"
which appears when the kernel is compiled for the cris architecture.
Signed-off-by: Claudio Scordino <claudio@evidence.eu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Konstantin Khlebnikov [Thu, 8 Dec 2011 22:33:54 +0000 (14:33 -0800)]
vmscan: use atomic-long for shrinker batching
Use atomic-long operations instead of looping around cmpxchg().
[akpm@linux-foundation.org: massage atomic.h inclusions]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Konstantin Khlebnikov [Thu, 8 Dec 2011 22:33:51 +0000 (14:33 -0800)]
vmscan: fix initial shrinker size handling
A shrinker function can return -1, means that it cannot do anything
without a risk of deadlock. For example prune_super() does this if it
cannot grab a superblock refrence, even if nr_to_scan=0. Currently we
interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan'
according to this. So the next time around this shrinker can cause
really big pressure. Let's skip such shrinkers instead.
Also make total_scan signed, otherwise the check (total_scan < 0) below
never works.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Fri, 9 Dec 2011 04:21:40 +0000 (20:21 -0800)]
MAINTAINERS: Update amd-iommu F: patterns
Commit
29b68415e335 ("x86: amd_iommu: move to drivers/iommu/")
moved the files, update the patterns.
CC: Ohad Ben-Cohen <ohad@wizery.com>
CC: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Matt Fleming [Fri, 18 Nov 2011 13:09:11 +0000 (13:09 +0000)]
x86, efi: Calling __pa() with an ioremap()ed address is invalid
If we encounter an efi_memory_desc_t without EFI_MEMORY_WB set
in ->attribute we currently call set_memory_uc(), which in turn
calls __pa() on a potentially ioremap'd address.
On CONFIG_X86_32 this is invalid, resulting in the following
oops on some machines:
BUG: unable to handle kernel paging request at
f7f22280
IP: [<
c10257b9>] reserve_ram_pages_type+0x89/0x210
[...]
Call Trace:
[<
c104f8ca>] ? page_is_ram+0x1a/0x40
[<
c1025aff>] reserve_memtype+0xdf/0x2f0
[<
c1024dc9>] set_memory_uc+0x49/0xa0
[<
c19334d0>] efi_enter_virtual_mode+0x1c2/0x3aa
[<
c19216d4>] start_kernel+0x291/0x2f2
[<
c19211c7>] ? loglevel+0x1b/0x1b
[<
c19210bf>] i386_start_kernel+0xbf/0xc8
A better approach to this problem is to map the memory region
with the correct attributes from the start, instead of modifying
it after the fact. The uncached case can be handled by
ioremap_nocache() and the cached by ioremap_cache().
Despite first impressions, it's not possible to use
ioremap_cache() to map all cached memory regions on
CONFIG_X86_64 because EFI_RUNTIME_SERVICES_DATA regions really
don't like being mapped into the vmalloc space, as detailed in
the following bug report,
https://bugzilla.redhat.com/show_bug.cgi?id=748516
Therefore, we need to ensure that any EFI_RUNTIME_SERVICES_DATA
regions are covered by the direct kernel mapping table on
CONFIG_X86_64. To accomplish this we now map E820_RESERVED_EFI
regions via the direct kernel mapping with the initial call to
init_memory_mapping() in setup_arch(), whereas previously these
regions wouldn't be mapped if they were after the last E820_RAM
region until efi_ioremap() was called. Doing it this way allows
us to delete efi_ioremap() completely.
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Huang Ying <huang.ying.caritas@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1321621751-3650-1-git-send-email-matt@console-pimps.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jeff Layton [Fri, 2 Dec 2011 01:23:34 +0000 (20:23 -0500)]
cifs: check for NULL last_entry before calling cifs_save_resume_key
Prior to commit
eaf35b1, cifs_save_resume_key had some NULL pointer
checks at the top. It turns out that at least one of those NULL
pointer checks is needed after all.
When the LastNameOffset in a FIND reply appears to be beyond the end of
the buffer, CIFSFindFirst and CIFSFindNext will set srch_inf.last_entry
to NULL. Since
eaf35b1, the code will now oops in this situation.
Fix this by having the callers check for a NULL last entry pointer
before calling cifs_save_resume_key. No change is needed for the
call site in cifs_readdir as it's not reachable with a NULL
current_entry pointer.
This should fix:
https://bugzilla.redhat.com/show_bug.cgi?id=750247
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Reported-by: Adam G. Metzler <adamgmetzler@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Jeff Layton [Fri, 2 Dec 2011 01:22:41 +0000 (20:22 -0500)]
cifs: attempt to freeze while looping on a receive attempt
In the recent overhaul of the demultiplex thread receive path, I
neglected to ensure that we attempt to freeze on each pass through the
receive loop.
Reported-and-Tested-by: Woody Suwalski <terraluna977@gmail.com>
Reported-and-Tested-by: Adam Williamson <awilliam@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Steve French [Thu, 10 Nov 2011 18:48:20 +0000 (12:48 -0600)]
cifs: Fix sparse warning when calling cifs_strtoUCS
Fix sparse endian check warning while calling cifs_strtoUCS
CHECK fs/cifs/smbencrypt.c
fs/cifs/smbencrypt.c:216:37: warning: incorrect type in argument 1
(different base types)
fs/cifs/smbencrypt.c:216:37: expected restricted __le16 [usertype] *<noident>
fs/cifs/smbencrypt.c:216:37: got unsigned short *<noident>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com
Pavel Shilovsky [Mon, 7 Nov 2011 13:11:24 +0000 (16:11 +0300)]
CIFS: Add descriptions to the brlock cache functions
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Adam Kwolek [Fri, 9 Dec 2011 03:26:11 +0000 (14:26 +1100)]
md: raid5 crash during degradation
NULL pointer access causes crash in raid5 module.
Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Linus Torvalds [Thu, 8 Dec 2011 21:21:28 +0000 (13:21 -0800)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
alarmtimers: Fix time comparison
ptp: Fix clock_getres() implementation
Linus Torvalds [Thu, 8 Dec 2011 21:18:59 +0000 (13:18 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: drop spin lock when memory alloc fails
Btrfs: check if the to-be-added device is writable
Btrfs: try cluster but don't advance in search list
Btrfs: try to allocate from cluster even at LOOP_NO_EMPTY_SIZE
Linus Torvalds [Thu, 8 Dec 2011 21:18:38 +0000 (13:18 -0800)]
Merge branch 'fixes' of git://git./linux/kernel/git/arm/arm-soc
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (28 commits)
ARM: sa1100: fix build error
ARM: OMAP1: recalculate loops per jiffy after dpll1 reprogram
ARM: davinci: dm365 evm: align nand partition table to u-boot
ARM: davinci: da850 evm: change audio edma event queue to EVENTQ_0
ARM: davinci: dm646x evm: wrong register used in setup_vpif_input_channel_mode
ARM: davinci: dm646x does not have a DSP domain
ARM: davinci: psc: fix incorrect offsets
ARM: davinci: psc: fix incorrect mask
ARM: mx28: LRADC macro rename
arm: mx23: recognise stmp378x as mx23
ARM: mxs: fix machines' initializers order
ARM: mxs/tx28: add __initconst for fec pdata
ARM: S3C64XX: Staticise s3c6400_sysclass
ARM: S3C64XX: Add linux/export.h to dev-spi.c
ARM: S3C64XX: Remove extern from definition of framebuffer setup call
MAINTAINERS: Extend Samsung patterns to cover SPI and ASoC drivers
MAINTAINERS: Add linux-samsung-soc mailing list for Samsung
MAINTAINERS: Consolidate Samsung MAINTAINERS
ARM: CSR: PM: fix build error due to undeclared 'THIS_MODULE'
ARM: CSR: fix build error due to new mdesc->dma_zone_size
...
Tetsuo Handa [Thu, 8 Dec 2011 12:24:06 +0000 (21:24 +0900)]
TOMOYO: Fix pathname handling of disconnected paths.
Current tomoyo_realpath_from_path() implementation returns strange pathname
when calculating pathname of a file which belongs to lazy unmounted tree.
Use local pathname rather than strange absolute pathname in that case.
Also, this patch fixes a regression by commit
02125a82 "fix apparmor
dereferencing potentially freed dentry, sanitize __d_path() API".
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark Langsdorf [Fri, 18 Nov 2011 15:33:06 +0000 (16:33 +0100)]
x86, hpet: Immediately disable HPET timer 1 if rtc irq is masked
When HPET is operating in RTC mode, the TN_ENABLE bit on timer1
controls whether the HPET or the RTC delivers interrupts to irq8. When
the system goes into suspend, the RTC driver sends a signal to the
HPET driver so that the HPET releases control of irq8, allowing the
RTC to wake the system from suspend. The switchover is accomplished by
a write to the HPET configuration registers which currently only
occurs while servicing the HPET interrupt.
On some systems, I have seen the system suspend before an HPET
interrupt occurs, preventing the write to the HPET configuration
register and leaving the HPET in control of the irq8. As the HPET is
not active during suspend, it does not generate a wake signal and RTC
alarms do not work.
This patch forces the HPET driver to immediately transfer control of
the irq8 channel to the RTC instead of waiting until the next
interrupt event.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Link: http://lkml.kernel.org/r/20111118153306.GB16319@alberich.amd.com
Tested-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Arnd Bergmann [Thu, 8 Dec 2011 15:52:23 +0000 (15:52 +0000)]
Merge branch 'fixes' of git://github.com/hzhuang1/linux into fixes
Liu Bo [Thu, 8 Dec 2011 01:08:40 +0000 (20:08 -0500)]
Btrfs: drop spin lock when memory alloc fails
Drop spin lock in convert_extent_bit() when memory alloc fails,
otherwise, it will be a deadlock.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Li Zefan [Thu, 8 Dec 2011 01:08:40 +0000 (20:08 -0500)]
Btrfs: check if the to-be-added device is writable
If we call ioctl(BTRFS_IOC_ADD_DEV) directly, we'll succeed in adding
a readonly device to a btrfs filesystem, and btrfs will write to
that device, emitting kernel errors:
[ 3109.833692] lost page write due to I/O error on loop2
[ 3109.833720] lost page write due to I/O error on loop2
...
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Alexandre Oliva [Thu, 8 Dec 2011 01:08:40 +0000 (20:08 -0500)]
Btrfs: try cluster but don't advance in search list
When we find an existing cluster, we switch to its block group as the
current block group, possibly skipping multiple blocks in the process.
Furthermore, under heavy contention, multiple threads may fail to
allocate from a cluster and then release just-created clusters just to
proceed to create new ones in a different block group.
This patch tries to allocate from an existing cluster regardless of its
block group, and doesn't switch to that group, instead proceeding to
try to allocate a cluster from the group it was iterating before the
attempt.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Jett.Zhou [Wed, 30 Nov 2011 06:32:54 +0000 (14:32 +0800)]
ARM: sa1100: fix build error
arm-eabi-4.4.3-ld:--defsym zreladdr=: syntax error
make[2]: *** [arch/arm/boot/compressed/vmlinux] Error 1
make[1]: *** [arch/arm/boot/compressed/vmlinux] Error 2
make: *** [uImage] Error 2
Signed-off-by: Haojian Zhuang <haojian.zhuang@gmail.com>
Signed-off-by: Jett.Zhou <jtzhou@marvell.com>
NeilBrown [Thu, 8 Dec 2011 05:27:57 +0000 (16:27 +1100)]
md/raid5: never wait for bad-block acks on failed device.
Once a device is failed we really want to completely ignore it.
It should go away soon anyway.
In particular the presence of bad blocks on it should not cause us to
block as we won't be trying to write there anyway.
So as soon as we can check if a device is Faulty, do so and pretend
that it is already gone if it is Faulty.
Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Thu, 8 Dec 2011 05:26:08 +0000 (16:26 +1100)]
md: ensure new badblocks are handled promptly.
When we mark blocks as bad we need them to be acknowledged by the
metadata handler promptly.
For an in-kernel metadata handler that was already being done. But
for an external metadata handler we need to alert it of the change by
sending a notification through the sysfs file. This adds that
notification.
Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Thu, 8 Dec 2011 05:22:48 +0000 (16:22 +1100)]
md: bad blocks shouldn't cause a Blocked status on a Faulty device.
Once a device is marked Faulty the badblocks - whether acknowledged or
not - become irrelevant. So they shouldn't cause the device to be
marked as Blocked.
Without this patch, a process might write "-blocked" to clear the
Blocked status, but while that will correctly fail the device, it
won't remove the apparent 'blocked' status.
Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Thu, 8 Dec 2011 04:49:46 +0000 (15:49 +1100)]
md: take a reference to mddev during sysfs access.
When we are accessing an mddev via sysfs we know that the
mddev cannot disappear because it has an embedded kobj which
is refcounted by sysfs.
And we also take the mddev_lock.
However this is not enough.
The final mddev_put could have been called and the
mddev_delayed_delete is waiting for sysfs to let go so it can destroy
the kobj and mddev.
In this state there are a lot of changes that should not be attempted.
To to guard against this we:
- initialise mddev->all_mddevs in on last put so the state can be
easily detected.
- in md_attr_show and md_attr_store, check ->all_mddevs under
all_mddevs_lock and mddev_get the mddev if it still appears to
be active.
This means that if we get to sysfs as the mddev is being deleted we
will get -EBUSY.
rdev_attr_store and rdev_attr_show are similar but already have
sufficient protection. They check that rdev->mddev still points to
mddev after taking mddev_lock. As this is cleared before delayed
removal which can only be requested under the mddev_lock, this
ensure the rdev and mddev are still alive.
Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Thu, 8 Dec 2011 04:49:12 +0000 (15:49 +1100)]
md: refine interpretation of "hold_active == UNTIL_IOCTL".
We like md devices to disappear when they really are not needed.
However it is not possible to tell from the current state whether it
is needed or not. We can only tell from recent history of changes.
In particular immediately after we create an md device it looks very
similar to immediately after we have finished with it.
So we always preserve a newly created md device until something
significant happens. This state is stored in 'hold_active'.
The normal case is to keep it until an ioctl happens, as that will
normally either activate it, or explicitly de-activate it. If it
doesn't then it was probably created by mistake and it is now time to
get rid of it.
We can also modify an array via sysfs (instead of via ioctl) and we
currently treat any change via sysfs like an ioctl as a sign that if
it now isn't more active, it should be destroyed.
However this is not appropriate as changes made via sysfs are more
gradual so we should look for a more definitive change.
So this patch only clears 'hold_active' from UNTIL_IOCTL to clear when
the array_state is changed via sysfs. Other changes via sysfs
are ignored.
Signed-off-by: NeilBrown <neilb@suse.de>
Olof Johansson [Thu, 8 Dec 2011 04:36:27 +0000 (20:36 -0800)]
Merge branch 'fixes' of git://git./linux/kernel/git/tmlind/linux-omap into fixes
Linus Torvalds [Thu, 8 Dec 2011 02:18:27 +0000 (18:18 -0800)]
Merge branch '3.2-rc-fixes' of git://git./linux/kernel/git/nab/target-pending
* '3.2-rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (25 commits)
iscsi-target: Fix hex2bin warn_unused compile message
target: Don't return an error if disabling unsupported features
target/rd: fix or rewrite the copy routine
target/rd: simplify the page/offset computation
target: remove the unused se_dev_list
target/file: walk properly over sg list
target: remove unused struct fields
target: Fix page length in emulated INQUIRY VPD page 86h
target: Handle 0 correctly in transport_get_sectors_6()
target: Don't return an error status for 0-length READ and WRITE
iscsi-target: Use kmemdup rather than duplicating its implementation
iscsi-target: Add missing F_BIT for iscsi_tm_rsp
iscsi-target: Fix residual count hanlding + remove iscsi_cmd->residual_count
target: Reject SCSI data overflow for fabrics using transport_generic_map_mem_to_cmd
target: remove the unused t_task_pt_sgl and t_task_pt_sgl_num se_cmd fields
target: remove the t_tasks_bidi se_cmd field
target: remove the t_tasks_fua se_cmd field
target: remove the se_ordered_node se_cmd field
target: remove the se_obj_ptr and se_orig_obj_ptr se_cmd fields
target: Drop config_item_name usage in fabric TFO->free_wwn()
...
Alexandre Oliva [Thu, 8 Dec 2011 00:50:42 +0000 (19:50 -0500)]
Btrfs: try to allocate from cluster even at LOOP_NO_EMPTY_SIZE
If we reach LOOP_NO_EMPTY_SIZE, we won't even try to use a cluster that
others might have set up. Odds are that there won't be one, but if
someone else succeeded in setting it up, we might as well use it, even
if we don't try to set up a cluster again.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Linus Torvalds [Thu, 8 Dec 2011 00:13:54 +0000 (16:13 -0800)]
Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: fix the logspace waiting algorithm
xfs: fix nfs export of 64-bit inodes numbers on 32-bit kernels
xfs: fix allocation length overflow in xfs_bmapi_write()
Linus Torvalds [Thu, 8 Dec 2011 00:12:29 +0000 (16:12 -0800)]
Merge branch 'pm-fixes' of git://git./linux/kernel/git/rafael/linux-pm
* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / Driver core: leave runtime PM enabled during system shutdown
Linus Torvalds [Thu, 8 Dec 2011 00:12:14 +0000 (16:12 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/geert/linux-m68k
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
m68k: Wire up process_vm_{read,write}v
Ingo Molnar [Wed, 7 Dec 2011 22:23:44 +0000 (23:23 +0100)]
Merge branch 'perf/urgent' of git://github.com/acmel/linux into perf/urgent
Alan Stern [Tue, 6 Dec 2011 22:24:52 +0000 (23:24 +0100)]
PM / Driver core: leave runtime PM enabled during system shutdown
Disabling all runtime PM during system shutdown turns out not to be a
good idea, because some devices may need to be woken up from a
low-power state at that time.
The whole point of disabling runtime PM for system shutdown was to
prevent untimely runtime-suspend method calls. This patch (as1504)
accomplishes the same result by incrementing the usage count for each
device and waiting for ongoing runtime-PM callbacks to finish. This
is what we already do during system suspend and hibernation, which
makes sense since the shutdown method is pretty much a legacy analog
of the pm->poweroff method.
This fixes a recent regression on some OMAP systems introduced by
commit
af8db1508f2c9f3b6e633e2d2d906c6557c617f9 (PM / driver core:
disable device's runtime PM during shutdown).
Reported-and-tested-by: NeilBrown <neilb@suse.de>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Manuel Lauss [Sat, 19 Nov 2011 12:08:10 +0000 (13:08 +0100)]
spi/gpio: fix section mismatch warning
Fixes:
The function __devinit spi_gpio_probe() references
a function __init spi_gpio_alloc.isra.4().
If spi_gpio_alloc.isra.4 is only used by spi_gpio_probe then
annotate spi_gpio_alloc.isra.4 with a matching annotation.
[wsa: fix spi_gpio_request(), too]
Signed-off-by: Manuel Lauss <manuel.lauss@googlemail.com>
Signed-off-by: Wolfram Sang <w.sang@pengutronix.de>
Jiri Slaby [Wed, 7 Dec 2011 20:18:16 +0000 (21:18 +0100)]
spi/fsl-espi: disable CONFIG_SPI_FSL_ESPI=m build
When spi_fsl_espi is chosen to be built as a module, there is a build
error because we test only CONFIG_SPI_FSL_ESPI in declaration of
struct mpc8xxx_spi in drivers/spi/spi_fsl_lib.h. Also some called
functions are not exported.
So we forbid CONFIG_SPI_FSL_ESPI to be tristate here.
The error looks like:
drivers/spi/spi_fsl_espi.c: In function 'fsl_espi_bufs':
drivers/spi/spi_fsl_espi.c:232: error: 'struct mpc8xxx_spi' has no member named 'len'
...
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Acked-by: Kumar Gala <galak@kernel.crashing.org>
Cc: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Wolfram Sang <w.sang@pengutronix.de>
Axel Lin [Thu, 24 Nov 2011 03:10:51 +0000 (11:10 +0800)]
spi/nuc900: Include linux/module.h
Include linux/module.h to fix below build error:
CC drivers/spi/spi-nuc900.o
drivers/spi/spi-nuc900.c:484: error: 'THIS_MODULE' undeclared here (not in a function)
drivers/spi/spi-nuc900.c:489: error: expected declaration specifiers or '...' before string constant
drivers/spi/spi-nuc900.c:489: warning: data definition has no type or storage class
drivers/spi/spi-nuc900.c:489: warning: type defaults to 'int' in declaration of 'MODULE_AUTHOR'
drivers/spi/spi-nuc900.c:489: warning: function declaration isn't a prototype
drivers/spi/spi-nuc900.c:490: error: expected declaration specifiers or '...' before string constant
drivers/spi/spi-nuc900.c:490: warning: data definition has no type or storage class
drivers/spi/spi-nuc900.c:490: warning: type defaults to 'int' in declaration of 'MODULE_DESCRIPTION'
drivers/spi/spi-nuc900.c:490: warning: function declaration isn't a prototype
drivers/spi/spi-nuc900.c:491: error: expected declaration specifiers or '...' before string constant
drivers/spi/spi-nuc900.c:491: warning: data definition has no type or storage class
drivers/spi/spi-nuc900.c:491: warning: type defaults to 'int' in declaration of 'MODULE_LICENSE'
drivers/spi/spi-nuc900.c:491: warning: function declaration isn't a prototype
drivers/spi/spi-nuc900.c:492: error: expected declaration specifiers or '...' before string constant
drivers/spi/spi-nuc900.c:492: warning: data definition has no type or storage class
drivers/spi/spi-nuc900.c:492: warning: type defaults to 'int' in declaration of 'MODULE_ALIAS'
drivers/spi/spi-nuc900.c:492: warning: function declaration isn't a prototype
make[2]: *** [drivers/spi/spi-nuc900.o] Error 1
make[1]: *** [drivers/spi] Error 2
make: *** [drivers] Error 2
Signed-off-by: Axel Lin <axel.lin@gmail.com>
Acked-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Wolfram Sang <w.sang@pengutronix.de>
Gabor Juhos [Wed, 16 Nov 2011 19:01:43 +0000 (20:01 +0100)]
spi/ath79: fix compile error due to missing include
Whithout including 'linux/module.h' spi-ath79 driver fails to compile
with the these errors:
drivers/spi/spi-ath79.c:273:12: error: 'THIS_MODULE' undeclared here (not in a function)
drivers/spi/spi-ath79.c:278:20: error: expected declaration specifiers or '...' before string constant
drivers/spi/spi-ath79.c:278:1: warning: data definition has no type or storage class
drivers/spi/spi-ath79.c:278:1: warning: type defaults to 'int' in declaration of 'MODULE_DESCRIPTION'
drivers/spi/spi-ath79.c:278:20: warning: function declaration isn't a prototype
drivers/spi/spi-ath79.c:279:15: error: expected declaration specifiers or '...' before string constant
drivers/spi/spi-ath79.c:279:1: warning: data definition has no type or storage class
drivers/spi/spi-ath79.c:279:1: warning: type defaults to 'int' in declaration of 'MODULE_AUTHOR'
drivers/spi/spi-ath79.c:279:15: warning: function declaration isn't a prototype
drivers/spi/spi-ath79.c:280:16: error: expected declaration specifiers or '...' before string constant
drivers/spi/spi-ath79.c:280:1: warning: data definition has no type or storage class
drivers/spi/spi-ath79.c:280:1: warning: type defaults to 'int' in declaration of 'MODULE_LICENSE'
drivers/spi/spi-ath79.c:280:16: warning: function declaration isn't a prototype
drivers/spi/spi-ath79.c:281:14: error: expected declaration specifiers or '...' before string constant
drivers/spi/spi-ath79.c:281:1: warning: data definition has no type or storage class
drivers/spi/spi-ath79.c:281:1: warning: type defaults to 'int' in declaration of 'MODULE_ALIAS'
drivers/spi/spi-ath79.c:281:14: warning: function declaration isn't a prototype
Signed-off-by: Gabor Juhos <juhosg@openwrt.org>
Signed-off-by: Wolfram Sang <w.sang@pengutronix.de>
Anton Vorontsov [Tue, 6 Dec 2011 23:16:26 +0000 (03:16 +0400)]
of/irq: Get rid of NO_IRQ usage
PPC32/64 defines NO_IRQ to zero, so no problems expected.
ARM defines NO_IRQ to -1, but OF code relies on IRQ domains support,
which returns correct ('0') value in 'no irq' case. So everything
should be fine.
Other arches might break if some of their OF drivers rely on NO_IRQ
being not 0. If so, the drivers must be fixed, finally.
[ Rob Herring points out that microblaze should be fixed, and has posted
a patch for testing for that. - Linus ]
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Wolfram Sang <w.sang@pengutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Takashi Iwai [Wed, 7 Dec 2011 16:20:30 +0000 (17:20 +0100)]
ALSA: hda/realtek - Fix lost speaker volume controls
When there are the same or more number of HP pins are available, HP pins
are used as the primary outputs instead of the speaker pins. But, in
some cases (especially with ALC663 & co), some DACs are available only
with a later pin and it's assigned to a speaker, and since the driver
parses the pins from the lower NID, such a DAC was skipped eventually
without assignments. This resulted in a regression, the missing speaker
volume control in the new parser.
As a workaround for this, now the driver retries the pin->DAC mapping
again after restoring the speaker-pins as primary. This is still an ad
hoc fix, but it works so far for most of Realtek codecs.
Signed-off-by: Takashi Iwai <tiwai@suse.de>