Nikhil Rao [Thu, 21 Jul 2011 16:43:40 +0000 (09:43 -0700)]
sched: Add exports tracking cfs bandwidth control statistics
This change introduces statistics exports for the cpu sub-system, these are
added through the use of a stat file similar to that exported by other
subsystems.
The following exports are included:
nr_periods: number of periods in which execution occurred
nr_throttled: the number of periods above in which execution was throttle
throttled_time: cumulative wall-time that any cpus have been throttled for
this group
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184758.198901931@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul Turner [Thu, 21 Jul 2011 16:43:39 +0000 (09:43 -0700)]
sched: Throttle entities exceeding their allowed bandwidth
With the machinery in place to throttle and unthrottle entities, as well as
handle their participation (or lack there of) we can now enable throttling.
There are 2 points that we must check whether it's time to set throttled state:
put_prev_entity() and enqueue_entity().
- put_prev_entity() is the typical throttle path, we reach it by exceeding our
allocated run-time within update_curr()->account_cfs_rq_runtime() and going
through a reschedule.
- enqueue_entity() covers the case of a wake-up into an already throttled
group. In this case we know the group cannot be on_rq and can throttle
immediately. Checks are added at time of put_prev_entity() and
enqueue_entity()
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184758.091415417@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul Turner [Thu, 21 Jul 2011 16:43:38 +0000 (09:43 -0700)]
sched: Migrate throttled tasks on HOTPLUG
Throttled tasks are invisisble to cpu-offline since they are not eligible for
selection by pick_next_task(). The regular 'escape' path for a thread that is
blocked at offline is via ttwu->select_task_rq, however this will not handle a
throttled group since there are no individual thread wakeups on an unthrottle.
Resolve this by unthrottling offline cpus so that threads can be migrated.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.989000590@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul Turner [Thu, 21 Jul 2011 16:43:37 +0000 (09:43 -0700)]
sched: Prevent buddy interactions with throttled entities
Buddies allow us to select "on-rq" entities without actually selecting them
from a cfs_rq's rb_tree. As a result we must ensure that throttled entities
are not falsely nominated as buddies. The fact that entities are dequeued
within throttle_entity is not sufficient for clearing buddy status as the
nomination may occur after throttling.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.886850167@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul Turner [Thu, 21 Jul 2011 16:43:36 +0000 (09:43 -0700)]
sched: Prevent interactions with throttled entities
From the perspective of load-balance and shares distribution, throttled
entities should be invisible.
However, both of these operations work on 'active' lists and are not
inherently aware of what group hierarchies may be present. In some cases this
may be side-stepped (e.g. we could sideload via tg_load_down in load balance)
while in others (e.g. update_shares()) it is more difficult to compute without
incurring some O(n^2) costs.
Instead, track hierarchicaal throttled state at time of transition. This
allows us to easily identify whether an entity belongs to a throttled hierarchy
and avoid incorrect interactions with it.
Also, when an entity leaves a throttled hierarchy we need to advance its
time averaging for shares averaging so that the elapsed throttled time is not
considered as part of the cfs_rq's operation.
We also use this information to prevent buddy interactions in the wakeup and
yield_to() paths.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.777916795@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul Turner [Thu, 21 Jul 2011 16:43:35 +0000 (09:43 -0700)]
sched: Allow for positional tg_tree walks
Extend walk_tg_tree to accept a positional argument
static int walk_tg_tree_from(struct task_group *from,
tg_visitor down, tg_visitor up, void *data)
Existing semantics are preserved, caller must hold rcu_lock() or sufficient
analogue.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.677889157@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul Turner [Thu, 21 Jul 2011 16:43:34 +0000 (09:43 -0700)]
sched: Add support for unthrottling group entities
At the start of each period we refresh the global bandwidth pool. At this time
we must also unthrottle any cfs_rq entities who are now within bandwidth once
more (as quota permits).
Unthrottled entities have their corresponding cfs_rq->throttled flag cleared
and their entities re-enqueued.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.574628950@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul Turner [Thu, 21 Jul 2011 16:43:33 +0000 (09:43 -0700)]
sched: Add support for throttling group entities
Now that consumption is tracked (via update_curr()) we add support to throttle
group entities (and their corresponding cfs_rqs) in the case where this is no
run-time remaining.
Throttled entities are dequeued to prevent scheduling, additionally we mark
them as throttled (using cfs_rq->throttled) to prevent them from becoming
re-enqueued until they are unthrottled. A list of a task_group's throttled
entities are maintained on the cfs_bandwidth structure.
Note: While the machinery for throttling is added in this patch the act of
throttling an entity exceeding its bandwidth is deferred until later within
the series.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.480608533@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul Turner [Thu, 21 Jul 2011 16:43:32 +0000 (09:43 -0700)]
sched: Expire invalid runtime
Since quota is managed using a global state but consumed on a per-cpu basis
we need to ensure that our per-cpu state is appropriately synchronized.
Most importantly, runtime that is state (from a previous period) should not be
locally consumable.
We take advantage of existing sched_clock synchronization about the jiffy to
efficiently detect whether we have (globally) crossed a quota boundary above.
One catch is that the direction of spread on sched_clock is undefined,
specifically, we don't know whether our local clock is behind or ahead
of the one responsible for the current expiration time.
Fortunately we can differentiate these by considering whether the
global deadline has advanced. If it has not, then we assume our clock to be
"fast" and advance our local expiration; otherwise, we know the deadline has
truly passed and we expire our local runtime.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.379275352@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul Turner [Thu, 21 Jul 2011 16:43:31 +0000 (09:43 -0700)]
sched: Add a timer to handle CFS bandwidth refresh
This patch adds a per-task_group timer which handles the refresh of the global
CFS bandwidth pool.
Since the RT pool is using a similar timer there's some small refactoring to
share this support.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.277271273@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul Turner [Thu, 21 Jul 2011 16:43:30 +0000 (09:43 -0700)]
sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth
Account bandwidth usage on the cfs_rq level versus the task_groups to which
they belong. Whether we are tracking bandwidth on a given cfs_rq is maintained
under cfs_rq->runtime_enabled.
cfs_rq's which belong to a bandwidth constrained task_group have their runtime
accounted via the update_curr() path, which withdraws bandwidth from the global
pool as desired. Updates involving the global pool are currently protected
under cfs_bandwidth->lock, local runtime is protected by rq->lock.
This patch only assigns and tracks quota, no action is taken in the case that
cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.179386821@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul Turner [Thu, 21 Jul 2011 16:43:29 +0000 (09:43 -0700)]
sched: Validate CFS quota hierarchies
Add constraints validation for CFS bandwidth hierarchies.
Validate that:
max(child bandwidth) <= parent_bandwidth
In a quota limited hierarchy, an unconstrained entity
(e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent.
This constraint is chosen over sum(child_bandwidth) as notion of over-commit is
valuable within SCHED_OTHER. Some basic code from the RT case is re-factored
for reuse.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.083774572@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul Turner [Thu, 21 Jul 2011 16:43:28 +0000 (09:43 -0700)]
sched: Introduce primitives to account for CFS bandwidth tracking
In this patch we introduce the notion of CFS bandwidth, partitioned into
globally unassigned bandwidth, and locally claimed bandwidth.
- The global bandwidth is per task_group, it represents a pool of unclaimed
bandwidth that cfs_rqs can allocate from.
- The local bandwidth is tracked per-cfs_rq, this represents allotments from
the global pool bandwidth assigned to a specific cpu.
Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
- cpu.cfs_period_us : the bandwidth period in usecs
- cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
to consume over period above.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184756.972636699@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul Turner [Thu, 21 Jul 2011 16:43:27 +0000 (09:43 -0700)]
sched: Implement hierarchical task accounting for SCHED_OTHER
Introduce hierarchical task accounting for the group scheduling case in CFS, as
well as promoting the responsibility for maintaining rq->nr_running to the
scheduling classes.
The primary motivation for this is that with scheduling classes supporting
bandwidth throttling it is possible for entities participating in throttled
sub-trees to not have root visible changes in rq->nr_running across activate
and de-activate operations. This in turn leads to incorrect idle and
weight-per-task load balance decisions.
This also allows us to make a small fixlet to the fastpath in pick_next_task()
under group scheduling.
Note: this issue also exists with the existing sched_rt throttling mechanism.
This patch does not address that.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184756.878333391@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Yong Zhang [Sat, 6 Aug 2011 00:10:04 +0000 (08:10 +0800)]
sched/cpupri: Remove cpupri->pri_active
Since [sched/cpupri: Remove the vec->lock], member pri_active
of struct cpupri is not needed any more, just remove it. Also
clean stuff related to it.
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110806001004.GA2207@zhy
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Fri, 5 Aug 2011 12:27:49 +0000 (08:27 -0400)]
sched/cpupri: Fix memory barriers for vec updates to always be in order
[ This patch actually compiles. Thanks to Mike Galbraith for pointing
that out. I compiled and booted this patch with no issues. ]
Re-examining the cpupri patch, I see there's a possible race because the
update of the two priorities vec->counts are not protected by a memory
barrier.
When a RT runqueue is overloaded and wants to push an RT task to another
runqueue, it scans the RT priority vectors in a loop from lowest
priority to highest.
When we queue or dequeue an RT task that changes a runqueue's highest
priority task, we update the vectors to show that a runqueue is rated at
a different priority. To do this, we first set the new priority mask,
and increment the vec->count, and then set the old priority mask by
decrementing the vec->count.
If we are lowering the runqueue's RT priority rating, it will trigger a
RT pull, and we do not care if we miss pushing to this runqueue or not.
But if we raise the priority, but the priority is still lower than an RT
task that is looking to be pushed, we must make sure that this runqueue
is still seen by the push algorithm (the loop).
Because the loop reads from lowest to highest, and the new priority is
set before the old one is cleared, we will either see the new or old
priority set and the vector will be checked.
But! Since there's no memory barrier between the updates of the two, the
old count may be decremented first before the new count is incremented.
This means the loop may see the old count of zero and skip it, and also
the new count of zero before it was updated. A possible runqueue that
the RT task could move to could be missed.
A conditional memory barrier is placed between the vec->count updates
and is only called when both updates are done.
The smp_wmb() has also been changed to smp_mb__before_atomic_inc/dec(),
as they are not needed by archs that already synchronize
atomic_inc/dec().
The smp_rmb() has been moved to be called at every iteration of the loop
so that the race between seeing the two updates is visible by each
iteration of the loop, as an arch is free to optimize the reading of
memory of the counters in the loop.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1312547269.18583.194.camel@gandalf.stny.rr.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Tue, 2 Aug 2011 20:36:12 +0000 (16:36 -0400)]
sched/cpupri: Remove the vec->lock
sched/cpupri: Remove the vec->lock
The cpupri vec->lock has been showing up as a top contention
lately. This is because of the RT push/pull logic takes an
agressive approach for migrating RT tasks. The cpupri logic is
in place to improve the performance of the push/pull when dealing
with large number CPU machines.
The problem though is a vec->lock is required, where a vec is a
global per RT priority structure. That is, if there are lots of
RT tasks at the same priority, every time they are added or removed
from the RT queue, this global vec->lock is taken. Now that more
kernel threads are becoming RT (RCU boost and threaded interrupts)
this is becoming much more of an issue.
There are two variables that are being synced by the vec->lock.
The cpupri bitmask, and the vec->counter. The cpupri bitmask
is one bit per priority. If a RT priority vec has a process queued,
then the vec->count is > 0 and the cpupri bitmask is set for that
RT priority.
If the cpupri bitmask gets out of sync with the vec->counter, we could
end up pushing a low proirity RT task to a high priority queue.
That RT task that could have run immediately could be queued on a
run queue with a higher priority task indefinitely.
The solution is not to use the cpupri bitmask and just look at the
vec->count directly when doing a pull. The cpupri bitmask is just
a fast way to scan the RT priorities when a pull is made. Instead
of using the bitmask, and just examine all RT priorities, and
look at the vec->counts, we could eliminate the vec->lock. The
scan of RT tasks is to find a run queue that we can push an RT task
to, and we do not push to a high priority queue, thus the scan only
needs to go from 1 to RT task->prio, and not all 100 RT priorities.
The push algorithm, which does the scan of RT priorities (and
scan of the bitmask) only happens when we have an overloaded RT run
queue (more than one RT task queued). The grabbing of the vec->lock
happens every time any RT task is queued or dequeued on the run
queue for that priority. The slowing down of the scan by not using
a bitmask is negligible by the speed up of removing the vec->lock
contention, and replacing it with an atomic counter and memory barrier.
To prove this, I wrote a patch that times both the loop and the code
that grabs the vec->locks. I passed the patches to various people
(and companies) to test and show the results. I let everyone choose
their own load to test, giving different loads on the system,
for various different setups.
Here's some of the results: (snipping to a few CPUs to not make
this change log huge, but the results were consistent across
the entire system).
System 1 (24 CPUs)
Before patch:
CPU: Name Count Max Min Average Total
---- ---- ----- --- --- ------- -----
[...]
cpu 20: loop 3057 1.766 0.061 0.642 1963.170
vec
6782949 90.469 0.089 0.414
2811760.503
cpu 21: loop 2617 1.723 0.062 0.641 1679.074
vec
6782810 90.499 0.089 0.291
1978499.900
cpu 22: loop 2212 1.863 0.063 0.699 1547.160
vec
6767244 85.685 0.089 0.435
2949676.898
cpu 23: loop 2320 2.013 0.062 0.594 1380.265
vec
6781694 87.923 0.088 0.431
2928538.224
After patch:
cpu 20: loop 2078 1.579 0.061 0.533 1108.006
vec
6164555 5.704 0.060 0.143 885185.809
cpu 21: loop 2268 1.712 0.065 0.575 1305.248
vec
6153376 5.558 0.060 0.187
1154960.469
cpu 22: loop 1542 1.639 0.095 0.533 823.249
vec
6156510 5.720 0.060 0.190
1172727.232
cpu 23: loop 1650 1.733 0.068 0.545 900.781
vec
6170784 5.533 0.060 0.167
1034287.953
All times are in microseconds. The 'loop' is the amount of time spent
doing the loop across the priorities (before patch uses bitmask).
the 'vec' is the amount of time in the code that requires grabbing
the vec->lock. The second patch just does not have the vec lock, but
encompasses the same code.
Amazingly the loop code even went down on average. The vec code went
from .5 down to .18, that's more than half the time spent!
Note, more than one test was run, but they all had the same results.
System 2 (64 CPUs)
Before patch:
CPU: Name Count Max Min Average Total
---- ---- ----- --- --- ------- -----
cpu 60: loop 0 0 0 0 0
vec
5410840 277.954 0.084 0.782
4232895.727
cpu 61: loop 0 0 0 0 0
vec
4915648 188.399 0.084 0.570
2803220.301
cpu 62: loop 0 0 0 0 0
vec
5356076 276.417 0.085 0.786
4214544.548
cpu 63: loop 0 0 0 0 0
vec
4891837 170.531 0.085 0.799
3910948.833
After patch:
cpu 60: loop 0 0 0 0 0
vec
5365118 5.080 0.021 0.063 340490.267
cpu 61: loop 0 0 0 0 0
vec
4898590 1.757 0.019 0.071 347903.615
cpu 62: loop 0 0 0 0 0
vec
5737130 3.067 0.021 0.119 687108.734
cpu 63: loop 0 0 0 0 0
vec
4903228 1.822 0.021 0.071 348506.477
The test run during the measurement did not have any (very few,
from other CPUs) RT tasks pushing. But this shows that it helped
out tremendously with the contention, as the contention happens
because the vec->lock is taken only on queuing at an RT priority,
and different CPUs that queue tasks at the same priority will
have contention.
I tested on my own 4 CPU machine with the following results:
Before patch:
CPU: Name Count Max Min Average Total
---- ---- ----- --- --- ------- -----
cpu 0: loop 2377 1.489 0.158 0.588 1398.395
vec 4484 770.146 2.301 4.396 19711.755
cpu 1: loop 2169 1.962 0.160 0.576 1250.110
vec 4425 152.769 2.297 4.030 17834.228
cpu 2: loop 2324 1.749 0.155 0.559 1299.799
vec 4368 779.632 2.325 4.665 20379.268
cpu 3: loop 2325 1.629 0.157 0.561 1306.113
vec 4650 408.782 2.394 4.348 20222.577
After patch:
CPU: Name Count Max Min Average Total
---- ---- ----- --- --- ------- -----
cpu 0: loop 2121 1.616 0.113 0.636 1349.189
vec 4303 1.151 0.225 0.421 1811.966
cpu 1: loop 2130 1.638 0.178 0.644 1372.927
vec 4627 1.379 0.235 0.428 1983.648
cpu 2: loop 2056 1.464 0.165 0.637 1310.141
vec 4471 1.311 0.217 0.433 1937.927
cpu 3: loop 2154 1.481 0.162 0.601 1295.083
vec 4236 1.253 0.230 0.425 1803.008
This was running my migrate.c code that can be found at:
http://lwn.net/Articles/425763/
The migrate code does stress the RT tasks a bit. This shows that
the loop did increase a little after the patch, but not by much.
The vec code dropped dramatically. From 4.3us down to .42us.
That's a 10x improvement!
Tested-by: Mike Galbraith <mgalbraith@suse.de>
Tested-by: Luis Claudio R. Gonçalves <lgoncalv@redhat.com>
Tested-by: Matthew Hank Sabins<msabins@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Gregory Haskins <gregory.haskins@gmail.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Chris Mason <chris.mason@oracle.com>
Link: http://lkml.kernel.org/r/1312317372.18583.101.camel@gandalf.stny.rr.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Fri, 17 Jun 2011 01:55:23 +0000 (21:55 -0400)]
sched: Use pushable_tasks to determine next highest prio
Hillf Danton proposed a patch (see link) that cleaned up the
sched_rt code that calculates the priority of the next highest priority
task to be used in finding run queues to pull from.
His patch removed the calculating of the next prio to just use the current
prio when deteriming if we should examine a run queue to pull from. The problem
with his patch was that it caused more false checks. Because we check a run
queue for pushable tasks if the current priority of that run queue is higher
in priority than the task about to run on our run queue. But after grabbing
the locks and doing the real check, we find that there may not be a task
that has a higher prio task to pull. Thus the locks were taken with nothing to
do.
I added some trace_printks() to record when and how many times the run queue
locks were taken to check for pullable tasks, compared to how many times we
pulled a task.
With the current method, it was:
3806 locks taken vs 2812 pulled tasks
With Hillf's patch:
6728 locks taken vs 2804 pulled tasks
The number of times locks were taken to pull a task went up almost double with
no more success rate.
But his patch did get me thinking. When we look at the priority of the highest
task to consider taking the locks to do a pull, a failure to pull can be one
of the following: (in order of most likely)
o RT task was pushed off already between the check and taking the lock
o Waiting RT task can not be migrated
o RT task's CPU affinity does not include the target run queue's CPU
o RT task's priority changed between the check and taking the lock
And with Hillf's patch, the thing that caused most of the failures, is
the RT task to pull was not at the right priority to pull (not greater than
the current RT task priority on the target run queue).
Most of the above cases we can't help. But the current method does not check
if the next highest prio RT task can be migrated or not, and if it can not,
we still grab the locks to do the test (we don't find out about this fact until
after we have the locks). I thought about this case, and realized that the
pushable task plist that is maintained only holds RT tasks that can migrate.
If we move the calculating of the next highest prio task from the inc/dec_rt_task()
functions into the queuing of the pushable tasks, then we only measure the
priorities of those tasks that we push, and we get this basically for free.
Not only does this patch make the code a little more efficient, it cleans it
up and makes it a little simpler.
Thanks to Hillf Danton for inspiring me on this patch.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Gregory Haskins <ghaskins@novell.com>
Link: http://lkml.kernel.org/r/BANLkTimQ67180HxCx5vgMqumqw1EkFh3qg@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Fri, 17 Jun 2011 01:55:22 +0000 (21:55 -0400)]
sched: Balance RT tasks when forked as well
When a new task is woken, the code to balance the RT task is currently
skipped in the select_task_rq() call. But it will be pushed if the rq
is currently overloaded with RT tasks anyway. The issue is that we
already queued the task, and if it does get pushed, it will have to
be dequeued and requeued on the new run queue. The advantage with
pushing it first is that we avoid this requeuing as we are pushing it
off before the task is ever queued.
See commit
318e0893ce3f524 ("sched: pre-route RT tasks on wakeup")
for more details.
The return of select_task_rq() when it is not a wake up has also been
changed to return task_cpu() instead of smp_processor_id(). This is more
of a sanity because the current only other user of select_task_rq()
besides wake ups, is an exec, where task_cpu() should also be the same
as smp_processor_id(). But if it is used for other purposes, lets keep
the task on the same CPU. Why would we mant to migrate it to the current
CPU?
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hillf Danton <dhillf@gmail.com>
Link: http://lkml.kernel.org/r/20110617015919.832743148@goodmis.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Hillf Danton [Fri, 17 Jun 2011 01:55:21 +0000 (21:55 -0400)]
sched: Remove resetting exec_start in put_prev_task_rt()
There's no reason to clean the exec_start in put_prev_task_rt() as it is reset
when the task gets back to the run queue. This saves us doing a store() in the
fast path.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Yong Zhang <yong.zhang0@gmail.com>
Link: http://lkml.kernel.org/r/BANLkTimqWD=q6YnSDi-v9y=LMWecgEzEWg@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Hillf Danton [Fri, 17 Jun 2011 01:55:20 +0000 (21:55 -0400)]
sched, rt: Fix rq->rt.pushable_tasks bug in push_rt_task()
Do not call dequeue_pushable_task() when failing to push an eligible
task, as it remains pushable, merely not at this particular moment.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Mike Galbraith <mgalbraith@gmx.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Yong Zhang <yong.zhang0@gmail.com>
Link: http://lkml.kernel.org/r/1306895385.4791.26.camel@marge.simson.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Hillf Danton [Fri, 17 Jun 2011 01:55:19 +0000 (21:55 -0400)]
sched: Remove noop in lowest_flag_domain()
Checking for the validity of sd is removed, since it is already
checked by the for_each_domain macro.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/BANLkTimT+Tut-3TshCDm-NiLLXrOznibNA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Hillf Danton [Fri, 17 Jun 2011 01:55:18 +0000 (21:55 -0400)]
sched: Remove noop in next_prio()
When computing the next priority for a given run-queue, the check for
RT priority of the task determined by the pick_next_highest_task_rt()
function could be removed, since only RT tasks are returned by the
function.
Reviewed-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/BANLkTimxmWiof9s5AvS3v_0X+sMiE=0x5g@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Galbraith [Wed, 27 Jul 2011 15:14:55 +0000 (17:14 +0200)]
sched: fix broken SCHED_RESET_ON_FORK handling
Setting child->prio = current->normal_prio _after_ SCHED_RESET_ON_FORK has
been handled for an RT parent gives birth to a deranged mutant child with
non-RT policy, but RT prio and sched_class.
Move PI leakage protection up, always set priorities and weight, and if the
child is leaving RT class, reset rt_priority to the proper value.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1311779695.8691.2.camel@marge.simson.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Yong Zhang [Fri, 29 Jul 2011 08:20:33 +0000 (16:20 +0800)]
sched: Kill WAKEUP_PREEMPT
Remove the WAKEUP_PREEMPT feature, disabling it doesn't make any sense
and its outlived its use by a long long while.
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110729082033.GB12106@zhy
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jan H. Schönherr [Mon, 1 Aug 2011 09:03:28 +0000 (11:03 +0200)]
sched: Remove rq->avg_load_per_task
Since commit
a2d47777 ("sched: fix stale value in average load per task")
the variable rq->avg_load_per_task is no longer required. Remove it.
Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1312189408-17172-1-git-send-email-schnhrr@cs.tu-berlin.de
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Linus Torvalds [Fri, 12 Aug 2011 07:35:46 +0000 (00:35 -0700)]
Merge git://git./linux/kernel/git/davem/sparc
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc: Don't do hypervisor calls on non-sun4v in DS driver.
David S. Miller [Fri, 12 Aug 2011 00:58:59 +0000 (17:58 -0700)]
sparc: Don't do hypervisor calls on non-sun4v in DS driver.
Reported-by: Pieter-Paul Giesberts <pieterpg@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Boaz Harrosh [Thu, 11 Aug 2011 21:29:25 +0000 (14:29 -0700)]
pnfs: Automatically select blocks & objects layouts
Just like files-layout, blocks & objects layouts are part of the
NFS 4.1 protocol and should be automatically selected if NFS_4_1
is selected. The small problem is that these depend on other
Kernel support being present, while files only depends on NFS
itself.
This patch removes from the user choice the presence of objects
and blocks layout. But makes sure these are selected only if
the depended subsystems are present in the Kernel.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Acked-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Sandeen [Thu, 11 Aug 2011 14:54:31 +0000 (09:54 -0500)]
ext4: Properly count journal credits for long symlinks
Commit
df5e6223407e ("ext4: fix deadlock in ext4_symlink() in ENOSPC
conditions") recalculated the number of credits needed for a long
symlink, in the process of splitting it into two transactions. However,
the first credit calculation under-counted because if selinux is
enabled, credits are needed to create the selinux xattr as well.
Overrunning the reservation will result in an OOPS in
jbd2_journal_dirty_metadata() due to this assert:
J_ASSERT_JH(jh, handle->h_buffer_credits > 0);
Fix this by increasing the reservation size.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Sandeen [Thu, 11 Aug 2011 14:51:46 +0000 (09:51 -0500)]
ext3: Properly count journal credits for long symlinks
Commit
ae54870a1dc9 ("ext3: Fix lock inversion in ext3_symlink()")
recalculated the number of credits needed for a long symlink, in the
process of splitting it into two transactions. However, the first
credit calculation under-counted because if selinux is enabled, credits
are needed to create the selinux xattr as well.
Overrunning the reservation will result in an OOPS in
journal_dirty_metadata() due to this assert:
J_ASSERT_JH(jh, handle->h_buffer_credits > 0);
Fix this by increasing the reservation size.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vasiliy Kulikov [Mon, 8 Aug 2011 15:02:04 +0000 (19:02 +0400)]
move RLIMIT_NPROC check from set_user() to do_execve_common()
The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC
check in set_user() to check for NPROC exceeding via setuid() and
similar functions.
Before the check there was a possibility to greatly exceed the allowed
number of processes by an unprivileged user if the program relied on
rlimit only. But the check created new security threat: many poorly
written programs simply don't check setuid() return code and believe it
cannot fail if executed with root privileges. So, the check is removed
in this patch because of too often privilege escalations related to
buggy programs.
The NPROC can still be enforced in the common code flow of daemons
spawning user processes. Most of daemons do fork()+setuid()+execve().
The check introduced in execve() (1) enforces the same limit as in
setuid() and (2) doesn't create similar security issues.
Neil Brown suggested to track what specific process has exceeded the
limit by setting PF_NPROC_EXCEEDED process flag. With the change only
this process would fail on execve(), and other processes' execve()
behaviour is not changed.
Solar Designer suggested to re-check whether NPROC limit is still
exceeded at the moment of execve(). If the process was sleeping for
days between set*uid() and execve(), and the NPROC counter step down
under the limit, the defered execve() failure because NPROC limit was
exceeded days ago would be unexpected. If the limit is not exceeded
anymore, we clear the flag on successful calls to execve() and fork().
The flag is also cleared on successful calls to set_user() as the limit
was exceeded for the previous user, not the current one.
Similar check was introduced in -ow patches (without the process flag).
v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 11 Aug 2011 16:03:48 +0000 (09:03 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/linux-2.6-tip
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf symbols: Check '/tmp/perf-' symbol file ownership
perf sched: Usage leftover from trace -> script rename
perf sched: Do not delete session object prematurely
perf tools: Check $HOME/.perfconfig ownership
perf, x86: Add model 45 SandyBridge support
perf tools: Add support to install perf python extension
perf tools: do not look at ./config for configuration
perf tools: Make clean leaves some files
perf lock: Dropping unsupported ':r' modifier
perf probe: Fix coredump introduced by probe module option
jump label: Reduce the cycle count by changing the link order
perf report: Use ui__warning in some more places
perf python: Add PERF_RECORD_{LOST,READ,SAMPLE} routine tables
perf evlist: Introduce 'disable' method
trace events: Update version number reference to new 3.x scheme for EVENT_POWER_TRACING_DEPRECATED
perf buildid-cache: Zero out buffer of filenames when adding/removing buildid
Tracey Dent [Thu, 11 Aug 2011 06:59:00 +0000 (02:59 -0400)]
MAINTAINERS: Update linus' git repository
Change to new git tree -
(git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git).
Signed-off-by: Tracey Dent <tdent48227@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 11 Aug 2011 15:58:41 +0000 (08:58 -0700)]
Revert "EDAC: Correct Kconfig dependencies"
This reverts commit
af9d220bac41dc3201893e1601cc7c44f7da4498.
It turns out that one was meant to be applied on top of the edac.git
tree in -next that has more i7core_edac changes, but that wasn't clear
in the original email.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peng Tao [Wed, 10 Aug 2011 22:29:21 +0000 (18:29 -0400)]
NFS41: make PNFS_BLOCK selectable
PNFS_BLOCK needs BLK_DEV_DM/MD, which is not a dependency for other
pnfs layout drivers. Seperate it out so others can still build when
BLK_DEV_DM/MD is not enabled.
Also change select to depends on to avoid build failures.
Reported-and-tested-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Acked-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 11 Aug 2011 00:37:17 +0000 (17:37 -0700)]
Merge branch 'fixes' of /home/rmk/linux-2.6-arm
* 'fixes' of master.kernel.org:/home/rmk/linux-2.6-arm:
ARM: drop experimental status for ARM_PATCH_PHYS_VIRT
ARM: 7008/1: alignment: Make SIGBUS sent to userspace POSIXly correct
ARM: 7007/1: alignment: Prevent ignoring of faults with ARMv6 unaligned access model
ARM: 7010/1: mm: fix invalid loop for poison_init_mem
ARM: 7005/1: freshen up mm/proc-arm946.S
dmaengine: PL08x: Fix trivial build error
ARM: Fix build error for SMP=n builds
Linus Torvalds [Wed, 10 Aug 2011 19:36:45 +0000 (12:36 -0700)]
Merge branch 'merge' of git://git./linux/kernel/git/benh/powerpc
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: Really fix build without CONFIG_PCI
powerpc: Fix build without CONFIG_PCI
powerpc/4xx: Fix build of PCI code on 405
powerpc/pseries: Simplify vpa deregistration functions
powerpc/pseries: Cleanup VPA registration and deregistration errors
powerpc/pseries: Fix kexec on recent firmware versions
MAINTAINERS: change maintainership of mpc5xxx
powerpc: Make KVM_GUEST default to n
powerpc/kvm: Fix build errors with older toolchains
powerpc: Lack of ibm,io-events not that important!
powerpc: Move kdump default base address to half RMO size on 64bit
powerpc/perf: Disable pagefaults during callchain stack read
ppc: Remove duplicate definition of PV_POWER7
powerpc: pseries: Fix kexec on machines with more than 4TB of RAM
powerpc: Jump label misalignment causes oops at boot
powerpc: Clean up some panic messages in prom_init
powerpc: Fix device tree claim code
powerpc: Return the_cpu_ spec from identify_cpu
powerpc: mtspr/mtmsr should take an unsigned long
Linus Torvalds [Wed, 10 Aug 2011 18:08:06 +0000 (11:08 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/ecryptfs/ecryptfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
Ecryptfs: Add mount option to check uid of device being mounted = expect uid
eCryptfs: Fix payload_len unitialized variable warning
eCryptfs: fix compile error
eCryptfs: Return error when lower file pointer is NULL
Borislav Petkov [Wed, 10 Aug 2011 12:43:30 +0000 (14:43 +0200)]
EDAC: Correct Kconfig dependencies
Both AMD and Intel i7 EDAC drivers use MCE features and are thus
dependent of this functionality present in the kernel. Express this in
Kconfig so that randconfig builds don't break.
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Benjamin Herrenschmidt [Wed, 10 Aug 2011 15:15:44 +0000 (01:15 +1000)]
powerpc: Really fix build without CONFIG_PCI
Brown paper bag day, previous commit wouldn't work very well with modules
enabled. Move the exports into the ifdef.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Russell King [Wed, 10 Aug 2011 09:17:07 +0000 (10:17 +0100)]
ARM: drop experimental status for ARM_PATCH_PHYS_VIRT
This has now been well tested, and several platforms are now selecting
this directly. It's time to drop its experimental status.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Ingo Molnar [Wed, 10 Aug 2011 08:20:52 +0000 (10:20 +0200)]
Merge branch 'perf/core' of git://git./linux/kernel/git/acme/linux into perf/urgent
John Johansen [Fri, 22 Jul 2011 15:14:15 +0000 (08:14 -0700)]
Ecryptfs: Add mount option to check uid of device being mounted = expect uid
Close a TOCTOU race for mounts done via ecryptfs-mount-private. The mount
source (device) can be raced when the ownership test is done in userspace.
Provide Ecryptfs a means to force the uid check at mount time.
Signed-off-by: John Johansen <john.johansen@canonical.com>
Cc: <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Jonathan Nieder [Mon, 8 Aug 2011 04:22:43 +0000 (06:22 +0200)]
cap_syslog: don't use WARN_ONCE for CAP_SYS_ADMIN deprecation warning
syslog-ng versions before 3.3.0beta1 (2011-05-12) assume that
CAP_SYS_ADMIN is sufficient to access syslog, so ever since CAP_SYSLOG
was introduced (2010-11-25) they have triggered a warning.
Commit
ee24aebffb75 ("cap_syslog: accept CAP_SYS_ADMIN for now")
improved matters a little by making syslog-ng work again, just keeping
the WARN_ONCE(). But still, this is a warning that writes a stack trace
we don't care about to syslog, sets a taint flag, and alarms sysadmins
when nothing worse has happened than use of an old userspace with a
recent kernel.
Convert the WARN_ONCE to a printk_once to avoid that while continuing to
give userspace developers a hint that this is an unwanted
backward-compatibility feature and won't be around forever.
Reported-by: Ralf Hildebrandt <ralf.hildebrandt@charite.de>
Reported-by: Niels <zorglub_olsen@hotmail.com>
Reported-by: Paweł Sikora <pluto@agmk.net>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Liked-by: Gergely Nagy <algernon@madhouse-project.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Tue, 9 Aug 2011 09:56:26 +0000 (11:56 +0200)]
Revert "memcg: get rid of percpu_charge_mutex lock"
This reverts commit
8521fc50d433507a7cdc96bec280f9e5888a54cc.
The patch incorrectly assumes that using atomic FLUSHING_CACHED_CHARGE
bit operations is sufficient but that is not true. Johannes Weiner has
reported a crash during parallel memory cgroup removal:
BUG: unable to handle kernel NULL pointer dereference at
0000000000000018
IP: [<
ffffffff81083b70>] css_is_ancestor+0x20/0x70
Oops: 0000 [#1] PREEMPT SMP
Pid: 19677, comm: rmdir Tainted: G W
3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
RIP: 0010:[<
ffffffff81083b70>] css_is_ancestor+0x20/0x70
RSP: 0018:
ffff880077b09c88 EFLAGS:
00010202
Process rmdir (pid: 19677, threadinfo
ffff880077b08000, task
ffff8800781bb310)
Call Trace:
[<
ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40
[<
ffffffff810feccf>] drain_all_stock+0x11f/0x170
[<
ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0
[<
ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20
[<
ffffffff81080559>] cgroup_rmdir+0xb9/0x500
[<
ffffffff81114d26>] vfs_rmdir+0x86/0xe0
[<
ffffffff81114e7b>] do_rmdir+0xfb/0x110
[<
ffffffff81114ea6>] sys_rmdir+0x16/0x20
[<
ffffffff8154d76b>] system_call_fastpath+0x16/0x1b
We are crashing because we try to dereference cached memcg when we are
checking whether we should wait for draining on the cache. The cache is
already cleaned up, though.
There is also a theoretical chance that the cached memcg gets freed
between we test for the FLUSHING_CACHED_CHARGE and dereference it in
mem_cgroup_same_or_subtree:
CPU0 CPU1 CPU2
mem=stock->cached
stock->cached=NULL
clear_bit
test_and_set_bit
test_bit() ...
<preempted> mem_cgroup_destroy
use after free
The percpu_charge_mutex protected from this race because sync draining
is exclusive.
It is safer to revert now and come up with a more parallel
implementation later.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Tue, 9 Aug 2011 19:51:25 +0000 (12:51 -0700)]
Merge branch 'slab/urgent' of git://git./linux/kernel/git/penberg/slab-2.6
* 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slub: Fix partial count comparison confusion
Tyler Hicks [Fri, 5 Aug 2011 09:15:19 +0000 (04:15 -0500)]
eCryptfs: Fix payload_len unitialized variable warning
fs/ecryptfs/keystore.c: In function ‘ecryptfs_generate_key_packet_set’:
fs/ecryptfs/keystore.c:1991:28: warning: ‘payload_len’ may be used uninitialized in this function [-Wuninitialized]
fs/ecryptfs/keystore.c:1976:9: note: ‘payload_len’ was declared here
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Roberto Sassu [Mon, 1 Aug 2011 11:33:38 +0000 (13:33 +0200)]
eCryptfs: fix compile error
This patch fixes the compile error reported at the address:
https://bugzilla.kernel.org/show_bug.cgi?id=40292
The problem arises when compiling eCryptfs as built-in and the 'encrypted'
key type as a module. The patch prevents this combination from being set in
the kernel configuration, by fixing the eCryptfs dependencies.
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Reported-by: David Hill <hilld@binarystorm.net>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Tyler Hicks [Fri, 5 Aug 2011 03:58:51 +0000 (22:58 -0500)]
eCryptfs: Return error when lower file pointer is NULL
When an eCryptfs inode's lower file has been closed, and the pointer has
been set to NULL, return an error when trying to do a lower read or
write rather than calling BUG().
https://bugzilla.kernel.org/show_bug.cgi?id=37292
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
Pekka Enberg [Tue, 9 Aug 2011 19:54:18 +0000 (22:54 +0300)]
perf symbols: Check '/tmp/perf-' symbol file ownership
The external symbol files are generated by JIT compilers, for example, but we
need to make sure they're ours before injecting them to 'perf report'.
Requested-by: Ingo Molnar <mingo@elte.hu>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1312919658-17158-1-git-send-email-penberg@kernel.org
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Christoph Lameter [Tue, 9 Aug 2011 18:01:32 +0000 (13:01 -0500)]
slub: Fix partial count comparison confusion
deactivate_slab() has the comparison if more than the minimum number of
partial pages are in the partial list wrong. An effect of this may be that
empty pages are not freed from deactivate_slab(). The result could be an
OOM due to growth of the partial slabs per node. Frees mostly occur from
__slab_free which is okay so this would only affect use cases where a lot
of switching around of per cpu slabs occur.
Switching per cpu slabs occurs with high frequency if debugging options are
enabled.
Reported-and-tested-by: Xiaotian Feng <xtfeng@gmail.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Jiri Olsa [Tue, 9 Aug 2011 12:46:51 +0000 (14:46 +0200)]
perf sched: Usage leftover from trace -> script rename
The 'perf sched' command usage still showing 'trace' command instead of
the 'script' command.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110809124651.GD2056@jolsa.brq.redhat.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Mon, 8 Aug 2011 21:03:34 +0000 (23:03 +0200)]
perf sched: Do not delete session object prematurely
The session object is released prematurely when processing events for
latency command. The session's thread objects are used within the
output_lat_thread function.
Runnning following commands:
# perf sched record
# perf sched latency
the latter displays incorrect data and might cause access violation.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1312837414-3819-1-git-send-email-jolsa@redhat.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Linus Torvalds [Tue, 9 Aug 2011 15:42:16 +0000 (08:42 -0700)]
Merge branch 'slab/urgent' of git://git./linux/kernel/git/penberg/slab-2.6
* 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slub: fix check_bytes() for slub debugging
slub: Fix full list corruption if debugging is on
Arnaldo Carvalho de Melo [Tue, 9 Aug 2011 15:42:13 +0000 (12:42 -0300)]
perf tools: Check $HOME/.perfconfig ownership
Just like we do already for perf.data files.
Requested-by: Ingo Molnar <mingo@elte.hu>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Christian Ohm <chr.ohm@gmx.net>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jonathan Nieder <jrnieder@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-qgokmxsmvppwpc5404qhyk7e@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Linus Torvalds [Tue, 9 Aug 2011 15:41:36 +0000 (08:41 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/tiwai/sound-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
sound: pss - don't use the deprecated function check_region
ALSA: timer - Add NULL-check for invalid slave timer
ALSA: timer - Fix Oops at closing slave timer
ASoC: Acknowledge WM8996 interrupts before acting on them
ASoC: Rename WM8915 to WM8996
ALSA: Fix dependency of CONFIG_SND_TEA575X
ALSA: asihpi - use kzalloc()
ALSA: snd-usb-caiaq: Fix keymap for RigKontrol3
ALSA: snd-usb: Fix uninitialized variable usage
ALSA: hda - Fix a complile warning in patch_via.c
ALSA: hdspm - Fix uninitialized compile warnings
ALSA: usb-audio - add quirk for Keith McMillen StringPort
ALSA: snd-usb: operate on given mixer interface only
ALSA: snd-usb: avoid dividing by zero on invalid input
ALSA: snd-usb: Accept UAC2 FORMAT_TYPE descriptors with bLength > 6
sound: oss/pas2: Remove CLOCK_TICK_RATE dependency from PAS16 driver
ALSA: hda - Use auto-parser for ASUS UX50, Eee PC P901, S101 and P1005
ALSA: hda - Fix digital-mic mono recording on ASUS Eee PC
ASoC: sgtl5000: fix cache handling
ASoC: Disable wm_hubs periodic DC servo update
Alan Cox [Tue, 9 Aug 2011 13:30:37 +0000 (14:30 +0100)]
gma500: Fix clashes with DRM updates
The private object support has migrated from gma500 into the DRM core,
remove our now clashing copy.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ingo Molnar [Tue, 9 Aug 2011 14:44:27 +0000 (16:44 +0200)]
Merge branch 'perf/core' of git://git./linux/kernel/git/acme/linux into perf/urgent
Akinobu Mita [Sun, 7 Aug 2011 09:30:38 +0000 (18:30 +0900)]
slub: fix check_bytes() for slub debugging
The check_bytes() function is used by slub debugging. It returns a pointer
to the first unmatching byte for a character in the given memory area.
If the character for matching byte is greater than 0x80, check_bytes()
doesn't work. Becuase 64-bit pattern is generated as below.
value64 = value | value << 8 | value << 16 | value << 24;
value64 = value64 | value64 << 32;
The integer promotions are performed and sign-extended as the type of value
is u8. The upper 32 bits of value64 is 0xffffffff in the first line, and
the second line has no effect.
This fixes the 64-bit pattern generation.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Matt Mackall <mpm@selenic.com>
Reviewed-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Christoph Lameter [Mon, 8 Aug 2011 16:16:56 +0000 (11:16 -0500)]
slub: Fix full list corruption if debugging is on
When a slab is freed by __slab_free() and the slab can only contain a
single object ever then it was full (and therefore not on the partial
lists but on the full list in the debug case) before we reached
slab_empty.
This caused the following full list corruption when SLUB debugging was enabled:
[ 5913.233035] ------------[ cut here ]------------
[ 5913.233097] WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98()
[ 5913.233101] Hardware name: Adamo 13
[ 5913.233105] list_del corruption. prev->next should be
ffffea000434fd20, but was
ffffea0004199520
[ 5913.233108] Modules linked in: nfs fscache fuse ebtable_nat ebtables ppdev parport_pc lp parport ipt_MASQUERADE iptable_nat nf_nat nfsd lockd nfs_acl auth_rpcgss xt_CHECKSUM sunrpc iptable_mangle bridge stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables rfcomm bnep arc4 iwlagn snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_intel btusb mac80211 snd_hda_codec bluetooth snd_hwdep snd_seq snd_seq_device snd_pcm usb_debug dell_wmi sparse_keymap cdc_ether usbnet cdc_acm uvcvideo cdc_wdm mii cfg80211 snd_timer dell_laptop videodev dcdbas snd microcode v4l2_compat_ioctl32 soundcore joydev tg3 pcspkr snd_page_alloc iTCO_wdt i2c_i801 rfkill iTCO_vendor_support wmi virtio_net kvm_intel kvm ipv6 xts gf128mul dm_crypt i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
[ 5913.233213] Pid: 0, comm: swapper Not tainted 3.0.0+ #127
[ 5913.233213] Call Trace:
[ 5913.233213] <IRQ> [<
ffffffff8105df18>] warn_slowpath_common+0x83/0x9b
[ 5913.233213] [<
ffffffff8105dfd3>] warn_slowpath_fmt+0x46/0x48
[ 5913.233213] [<
ffffffff8127e7c1>] __list_del_entry+0x8d/0x98
[ 5913.233213] [<
ffffffff8127e7da>] list_del+0xe/0x2d
[ 5913.233213] [<
ffffffff814e0430>] __slab_free+0x1db/0x235
[ 5913.233213] [<
ffffffff811706ab>] ? bvec_free_bs+0x35/0x37
[ 5913.233213] [<
ffffffff811706ab>] ? bvec_free_bs+0x35/0x37
[ 5913.233213] [<
ffffffff811706ab>] ? bvec_free_bs+0x35/0x37
[ 5913.233213] [<
ffffffff81133085>] kmem_cache_free+0x88/0x102
[ 5913.233213] [<
ffffffff811706ab>] bvec_free_bs+0x35/0x37
[ 5913.233213] [<
ffffffff811706e1>] bio_free+0x34/0x64
[ 5913.233213] [<
ffffffff813dc390>] dm_bio_destructor+0x12/0x14
[ 5913.233213] [<
ffffffff8116fef6>] bio_put+0x2b/0x2d
[ 5913.233213] [<
ffffffff813dccab>] clone_endio+0x9e/0xb4
[ 5913.233213] [<
ffffffff8116f7dd>] bio_endio+0x2d/0x2f
[ 5913.233213] [<
ffffffffa00148da>] crypt_dec_pending+0x5c/0x8b [dm_crypt]
[ 5913.233213] [<
ffffffffa00150a9>] crypt_endio+0x78/0x81 [dm_crypt]
[ Full discussion here: https://lkml.org/lkml/2011/8/4/375 ]
Make sure that we remove such a slab also from the full lists.
Reported-and-tested-by: Dave Jones <davej@redhat.com>
Reported-and-tested-by: Xiaotian Feng <xtfeng@gmail.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Youquan Song [Tue, 2 Aug 2011 06:01:35 +0000 (14:01 +0800)]
perf, x86: Add model 45 SandyBridge support
Add support to Romely-EP SandyBridge.
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Anhua Xu <anhua.xu@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1312264895-2010-1-git-send-email-youquan.song@intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dave Martin [Thu, 28 Jul 2011 13:29:40 +0000 (14:29 +0100)]
ARM: 7008/1: alignment: Make SIGBUS sent to userspace POSIXly correct
With the UM_SIGNAL alignment fault mode, no siginfo structure is
passed to userspace.
POSIX specifies how siginfo_t should be populated for alignment
faults, so this patch does just that:
* si_signo = SIGBUS
* si_code = BUS_ADRALN
* si_addr = misaligned data address at which access was attempted
Signed-off-by: Dave Martin <dave.martin@linaro.org>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Dave Martin [Thu, 28 Jul 2011 13:28:52 +0000 (14:28 +0100)]
ARM: 7007/1: alignment: Prevent ignoring of faults with ARMv6 unaligned access model
Currently, it's possible to set the kernel to ignore alignment
faults when changing the alignment fault handling mode at runtime
via /proc/sys/alignment, even though this is undesirable on ARMv6
and above, where it can result in infinite spins where an un-fixed-
up instruction repeatedly faults.
In addition, the kernel clobbers any alignment mode specified on
the command-line if running on ARMv6 or above.
This patch factors out the necessary safety check into a couple of
new helper functions, and checks and modifies the fault handling
mode as appropriate on boot and on writes to /proc/cpu/alignment.
Prior to ARMv6, the behaviour is unchanged.
For ARMv6 and above, the behaviour changes as follows:
* Attempting to ignore faults on ARMv6 results in the mode being
forced to UM_FIXUP instead. A warning is printed if this
happened as a result of a write to /proc/cpu/alignment. The
user's UM_WARN bit (if present) is still honoured.
* An alignment= argument from the kernel command-line is now
honoured, except that the kernel will modify the specified mode
as described above. This is allows modes such as UM_SIGNAL and
UM_WARN to be active immediately from boot, which is useful for
debugging purposes.
Signed-off-by: Dave Martin <dave.martin@linaro.org>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Jamie Iles [Thu, 4 Aug 2011 08:39:31 +0000 (09:39 +0100)]
ARM: 7010/1: mm: fix invalid loop for poison_init_mem
poison_init_mem() used a loop of:
while ((count = count - 4))
which has 2 problems - an off by one error so that we do one less word
than we should, and the other is that if count == 0 then we loop forever
and poison too much. On a platform with HAVE_TCM=y but nothing in the
TCM's, this caused corruption and the platform failed to boot.
Acked-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Signed-off-by: Jamie Iles <jamie@jamieiles.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Brian S. Julin [Sun, 24 Jul 2011 15:53:50 +0000 (16:53 +0100)]
ARM: 7005/1: freshen up mm/proc-arm946.S
The file mm/proc-arm946.S contains a typo and is missing a structure
member in __arm946_proc_info. The former prevents compilation
and the latter causes problems during boot. It is likely this
file was manually copied from a similar file and not tested, then
later updates to the *_proc_info structures missed this file.
This patch will apply (with offset) with or without the
recent macro unification work that has been done in this directory.
This was verified against linux-next/stable last week.
See arm-linux-kernel thread:
http://lists.arm.linux.org.uk/lurker/message/
20110718.103237.
0106d468.en.html
Signed-off-by: Brian S. Julin <bri@abrij.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Russell King [Sat, 6 Aug 2011 08:34:26 +0000 (09:34 +0100)]
dmaengine: PL08x: Fix trivial build error
Something changed during the 3.1 merge window in the include files
which now causes the pl08x DMA engine driver to fail to build. Fix
this by adding the now necessary dma-mapping.h include:
drivers/dma/amba-pl08x.c: In function ■pl08x_unmap_buffers■:
drivers/dma/amba-pl08x.c:1524: error: implicit declaration of function ■dma_unmap_single■
drivers/dma/amba-pl08x.c:1527: error: implicit declaration of function ■dma_unmap_page■
Acked-by: Vinod Koul <vinod.koul@intel.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Linus Torvalds [Mon, 8 Aug 2011 19:14:51 +0000 (12:14 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
TOMOYO: Fix incomplete read of /sys/kernel/security/tomoyo/profile
Peter Zijlstra [Wed, 27 Jul 2011 10:17:11 +0000 (12:17 +0200)]
mm: Fix fixup_user_fault() for MMU=n
In commit
2efaca927f5c ("mm/futex: fix futex writes on archs with SW
tracking of dirty & young") we forgot about MMU=n. This patch fixes
that.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/1311761831.24752.413.camel@twins
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Mon, 8 Aug 2011 18:55:20 +0000 (11:55 -0700)]
autofs4: fix debug printk warning uncovered by cleanup
The previous comit made the autofs4 debug printouts check types against
the printout format, and uncovered this bug:
fs/autofs4/waitq.c:106:2: warning: format ‘%08lx’ expects type ‘long unsigned int’, but argument 4 has type ‘autofs_wqt_t’
which is due to the insane type for wait_queue_token. That thing should
be some fixed well-defined size (preferably just 'unsigned int' or
'u32') but for unexplained reasons it is randomly either 'unsigned long'
or 'unsigned int' depending on the architecture.
For now, cast it to 'unsigned long' for printing, the way we do
elsewhere. Somebody else can try to explain the typedef mess.
(There's a reason we don't support excessive use of typedefs in the
kernel: it's usually just a good way of confusing yourself).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Mon, 8 Aug 2011 18:35:17 +0000 (11:35 -0700)]
autofs4: clean up uaotfs use of debug/info/warning printouts
Use 'pr_debug()' for DPRINTK, which will do the proper type checking on
the arguments (without generating code) even when DEBUG isn't #defined.
Also, use the standard __VA_ARGS__ for the macros, and stop the
pointless abuse of 'do { xyz } while (0)' when the macro is already a
perfectly well-formed single statement.
Reported-by: David Howells <dhowells@redhat.com>
Suggested-by: Joe Perches <joe@perches.com>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Mon, 8 Aug 2011 18:33:23 +0000 (11:33 -0700)]
cred: use 'const' in get_current_{user,groups}
Avoid annoying warnings from these functions ("discards qualifiers")
because they assign 'current_cred()' to a non-const pointer.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Mon, 8 Aug 2011 14:54:53 +0000 (15:54 +0100)]
CRED: Restore const to current_cred()
Commit
3295514841c2 ("fix rcu annotations noise in cred.h") accidentally
dropped the const of current->cred inside current_cred() by the
insertion of a cast to deal with an RCU annotation loss warning from
sparce.
Use an appropriate RCU wrapper instead so as not to lose the const.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jiri Olsa [Fri, 22 Jul 2011 11:33:07 +0000 (13:33 +0200)]
perf tools: Add support to install perf python extension
Adding install-python_ext target to install python extension related
files. Installation directory is governed by python distutils package
and follows the DESTDIR variable settings.
Also moving python extension build output into '$(O)python_ext_build'
directory and making it configurable via PYTHON_EXTBUILD variable.
Keeping the '$(O)python/perf.so' file, so it could be used for testing
as of until now.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110722113307.GA1931@jolsa.brq.redhat.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jonathan Nieder [Fri, 5 Aug 2011 16:58:38 +0000 (18:58 +0200)]
perf tools: do not look at ./config for configuration
In addition to /etc/perfconfig and $HOME/.perfconfig, perf looks for
configuration in the file ./config, imitating git which looks at
$GIT_DIR/config. If ./config is not a perf configuration file, it
fails, or worse, treats it as a configuration file and changes behavior
in some unexpected way.
"config" is not an unusual name for a file to be lying around and perf
does not have a private directory dedicated for its own use, so let's
just stop looking for configuration in the cwd. Callers needing
context-sensitive configuration can use the PERF_CONFIG environment
variable.
Requested-by: Christian Ohm <chr.ohm@gmx.net>
Cc: 632923@bugs.debian.org
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Christian Ohm <chr.ohm@gmx.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110805165838.GA7237@elie.gateway.2wire.net
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Kusanagi Kouichi [Sun, 7 Aug 2011 08:39:31 +0000 (17:39 +0900)]
perf tools: Make clean leaves some files
Use LIB_OBJS and BUILTIN_OBJS for .o files.
LIB_FILE is already prefixed with OUTPUT.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110807083932.9C0E514C03B@msa103.auone-net.jp
Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Zhu Yanhai [Sat, 30 Jul 2011 14:13:52 +0000 (22:13 +0800)]
perf lock: Dropping unsupported ':r' modifier
Looks to me like the :r modifier is not supported anymore, so remove it
from the list of events. Without this fix 'perf lock record' doesn't
work.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Zhu Yanhai <gaoyang.zyh@taobao.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1312035232-9534-1-git-send-email-gaoyang.zyh@taobao.com
Signed-off-by: Zhu Yanhai <gaoyang.zyh@taobao.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jovi Zhang [Mon, 25 Jul 2011 14:08:08 +0000 (22:08 +0800)]
perf probe: Fix coredump introduced by probe module option
perf will coredump if the user doesn't give the "-m" option in probe
command, this patch fixes it.
[root@localhost perf]# ./perf probe --add='PROBE'
Segmentation fault (core dumped)
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1311602888-2389-1-git-send-email-bookjovi@gmail.com
Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Takashi Iwai [Mon, 8 Aug 2011 12:30:44 +0000 (14:30 +0200)]
Merge branch 'fix/asoc' into for-linus
Takashi Iwai [Mon, 8 Aug 2011 12:30:29 +0000 (14:30 +0200)]
Merge branch 'fix/kconfig' into for-linus
Wang Shaoyan [Mon, 8 Aug 2011 11:10:26 +0000 (19:10 +0800)]
sound: pss - don't use the deprecated function check_region
sound/oss/pss.c: In function 'configure_nonsound_components':
sound/oss/pss.c:676: warning: 'check_region' is deprecated (declared at include/linux/ioport.h:201)
Signed-off-by: Wang Shaoyan <wangshaoyan.pt@taobao.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Takashi Iwai [Mon, 8 Aug 2011 10:28:22 +0000 (12:28 +0200)]
ALSA: timer - Add NULL-check for invalid slave timer
Just to be sure.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Takashi Iwai [Mon, 8 Aug 2011 10:24:46 +0000 (12:24 +0200)]
ALSA: timer - Fix Oops at closing slave timer
A slave-timer instance has no timer reference, and this results in
NULL-dereference at stopping the timer, typically called at closing
the device.
Reference: https://bugzilla.kernel.org/show_bug.cgi?id=40682
Cc: <stable@kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Takashi Iwai [Mon, 8 Aug 2011 08:45:31 +0000 (10:45 +0200)]
Merge branch 'wm8996-rename' of git://git./linux/kernel/git/broonie/sound-2.6 into fix/asoc
Mark Brown [Wed, 20 Jul 2011 12:49:58 +0000 (13:49 +0100)]
ASoC: Acknowledge WM8996 interrupts before acting on them
This closes the small race between a status being read in response to an
interrupt and clearing the interrupt, meaning that if the status changes
between those periods we might not get a reassertion of the interrupt.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Mark Brown [Fri, 24 Jun 2011 11:10:44 +0000 (12:10 +0100)]
ASoC: Rename WM8915 to WM8996
For marketing reasons the part will be called WM8996. In order to avoid
user confusion rename the driver to reflect this.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Acked-by: Kukjin Kim <kgene.kim@samsung.com>
Acked-by: Liam Girdwood <lrg@ti.com>
Tetsuo Handa [Sat, 6 Aug 2011 14:38:30 +0000 (23:38 +0900)]
TOMOYO: Fix incomplete read of /sys/kernel/security/tomoyo/profile
Commit
bd03a3e4 "TOMOYO: Add policy namespace support." forgot to set EOF flag
and forgot to print namespace at PREFERENCE line.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: James Morris <jmorris@namei.org>
Linus Torvalds [Mon, 8 Aug 2011 01:23:30 +0000 (18:23 -0700)]
Linux 3.1-rc1
Linus Torvalds [Sun, 7 Aug 2011 22:52:19 +0000 (15:52 -0700)]
Merge git://git./linux/kernel/git/davem/sparc
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc: Fix build with DEBUG_PAGEALLOC enabled.
Rafael J. Wysocki [Sun, 7 Aug 2011 22:26:50 +0000 (00:26 +0200)]
sh: Fix boot crash related to SCI
Commit
d006199e72a9 ("serial: sh-sci: Regtype probing doesn't need to be
fatal.") made sci_init_single() return when sci_probe_regmap() succeeds,
although it should return when sci_probe_regmap() fails. This causes
systems using the serial sh-sci driver to crash during boot.
Fix the problem by using the right return condition.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 7 Aug 2011 22:49:11 +0000 (15:49 -0700)]
arm: remove stale export of 'sha_transform'
The generic library code already exports the generic function, this was
left-over from the ARM-specific version that just got removed.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 7 Aug 2011 21:07:03 +0000 (14:07 -0700)]
arm: remove "optimized" SHA1 routines
Since commit
1eb19a12bd22 ("lib/sha1: use the git implementation of
SHA-1"), the ARM SHA1 routines no longer work. The reason? They
depended on the larger 320-byte workspace, and now the sha1 workspace is
just 16 words (64 bytes). So the assembly version would overwrite the
stack randomly.
The optimized asm version is also probably slower than the new improved
C version, so there's no reason to keep it around. At least that was
the case in git, where what appears to be the same assembly language
version was removed two years ago because the optimized C BLK_SHA1 code
was faster.
Reported-and-tested-by: Joachim Eastwood <manabian@gmail.com>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Cc: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Al Viro [Sun, 7 Aug 2011 17:55:11 +0000 (18:55 +0100)]
fix rcu annotations noise in cred.h
task->cred is declared as __rcu, and access to other tasks' ->cred is,
indeed, protected. Access to current->cred does not need rcu_dereference()
at all, since only the task itself can change its ->cred. sparse, of
course, has no way of knowing that...
Add force-cast in current_cred(), make current_fsuid() et.al. use it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 7 Aug 2011 16:53:20 +0000 (09:53 -0700)]
vfs: rename 'do_follow_link' to 'should_follow_link'
Al points out that the do_follow_link() helper function really is
misnamed - it's about whether we should try to follow a symlink or not,
not about actually doing the following.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Takashi Iwai [Sun, 7 Aug 2011 15:34:07 +0000 (17:34 +0200)]
ALSA: Fix dependency of CONFIG_SND_TEA575X
CONFIG_SND_TEA575X is enabled by RADIO_SF16FMR2, but the latter one is
no PCI device. Since tea575x-tuner itself is independent from the board
bus type, the config should be moved out of SND_PCI dependency.
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Thomas Meyer [Sat, 6 Aug 2011 11:26:20 +0000 (13:26 +0200)]
ALSA: asihpi - use kzalloc()
Use kzalloc rather than kmalloc followed by memset with 0
This considers some simple cases that are common and easy to validate
Note in particular that there are no ...s in the rule, so all of the
matched code has to be contiguous
The semantic patch that makes this output is available
in scripts/coccinelle/api/alloc/kzalloc-simple.cocci.
More information about semantic patching is available at
http://coccinelle.lip6.fr/
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Ari Savolainen [Sat, 6 Aug 2011 16:43:07 +0000 (19:43 +0300)]
Fix POSIX ACL permission check
After commit
3567866bf261: "RCUify freeing acls, let check_acl() go ahead in
RCU mode if acl is cached" posix_acl_permission is being called with an
unsupported flag and the permission check fails. This patch fixes the issue.
Signed-off-by: Ari Savolainen <ari.m.savolainen@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Sun, 7 Aug 2011 05:56:03 +0000 (22:56 -0700)]
Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
ore: Make ore its own module
exofs: Rename raid engine from exofs/ios.c => ore
exofs: ios: Move to a per inode components & device-table
exofs: Move exofs specific osd operations out of ios.c
exofs: Add offset/length to exofs_get_io_state
exofs: Fix truncate for the raid-groups case
exofs: Small cleanup of exofs_fill_super
exofs: BUG: Avoid sbi realloc
exofs: Remove pnfs-osd private definitions
nfs_xdr: Move nfs4_string definition out of #ifdef CONFIG_NFS_V4
Linus Torvalds [Sun, 7 Aug 2011 05:45:50 +0000 (22:45 -0700)]
vfs: optimize inode cache access patterns
The inode structure layout is largely random, and some of the vfs paths
really do care. The path lookup in particular is already quite D$
intensive, and profiles show that accessing the 'inode->i_op->xyz'
fields is quite costly.
We already optimized the dcache to not unnecessarily load the d_op
structure for members that are often NULL using the DCACHE_OP_xyz bits
in dentry->d_flags, and this does something very similar for the inode
ops that are used during pathname lookup.
It also re-orders the fields so that the fields accessed by 'stat' are
together at the beginning of the inode structure, and roughly in the
order accessed.
The effect of this seems to be in the 1-2% range for an empty kernel
"make -j" run (which is fairly kernel-intensive, mostly in filename
lookup), so it's visible. The numbers are fairly noisy, though, and
likely depend a lot on exact microarchitecture. So there's more tuning
to be done.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 7 Aug 2011 05:41:50 +0000 (22:41 -0700)]
vfs: renumber DCACHE_xyz flags, remove some stale ones
Gcc tends to generate better code with small integers, including the
DCACHE_xyz flag tests - so move the common ones to be first in the list.
Also just remove the unused DCACHE_INOTIFY_PARENT_WATCHED and
DCACHE_AUTOFS_PENDING values, their users no longer exists in the source
tree.
And add a "unlikely()" to the DCACHE_OP_COMPARE test, since we want the
common case to be a nice straight-line fall-through.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>