Josef Bacik [Thu, 10 Nov 2011 13:29:20 +0000 (08:29 -0500)]
Btrfs: add allocator tracepoints
I used these tracepoints when figuring out what the cluster stuff was doing, so
add them to mainline in case we need to profile this stuff again. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Josef Bacik [Fri, 13 Jan 2012 00:10:12 +0000 (19:10 -0500)]
Btrfs: don't call btrfs_throttle in file write
Btrfs_throttle will make us wait if there is a currently committing transaction
until we can open new transactions, which is ridiculous since we don't actually
start any transactions within the file write path anyway, so all this does is
introduce big latencies if we have a sync/fsync heavy workload going on while
somebody else is trying to do work. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef Bacik [Fri, 13 Jan 2012 00:10:12 +0000 (19:10 -0500)]
Btrfs: release space on error in page_mkwrite
If updating the inode gave us an ENOSPC we were just returning in page_mkwrite,
which is a problem since we make our reservation right before trying to update
the inode, so fix the out label so that we actually free our reservation.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Miao Xie [Fri, 13 Jan 2012 00:10:12 +0000 (19:10 -0500)]
Btrfs: fix btrfsck error 400 when truncating a compressed
Reproduce steps:
# mkfs.btrfs /dev/sdb5
# mount /dev/sdb5 -o compress=lzo /mnt
# dd if=/dev/zero of=/mnt/tmpfile bs=128K count=1
# sync
# truncate -s 64K /mnt/tmpfile
root 5 inode 257 errors 400
This is because of the wrong if condition, which is used to check if we should
subtract the bytes of the dropped range from i_blocks/i_bytes of i-node or not.
When we truncate a compressed extent, btrfs substracts the bytes of the whole
extent, it's wrong. We should substract the real size that we truncate, no
matter it is a compressed extent or not. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef Bacik [Fri, 13 Jan 2012 00:10:12 +0000 (19:10 -0500)]
Btrfs: do not use btrfs_end_transaction_throttle everywhere
A user reported a problem where things like open with O_CREAT would take up to
30 seconds when he had nfs activity on the same mount. This is because all of
our quick metadata operations, like create, symlink etc all do
btrfs_end_transaction_throttle, which if the transaction is blocked will wait
for the commit to complete before it returns. This adds a ridiculous amount of
latency and isn't really needed. The normal btrfs_end_transaction will mark the
transaction as blocked and wake the transaction kthread up if it thinks the
transaction needs to end (this being in the running out of global reserve space
scenario), and this is all that is really needed since we've already done
everything we're going to do, we just need to return. This should help people
with the latency they were seeing when using synchronous heavy workloads.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Mon, 16 Jan 2012 20:27:58 +0000 (15:27 -0500)]
Merge branch 'integrity-check-patch-v2' of git://btrfs.giantdisaster.de/git/btrfs into integration
Conflicts:
fs/btrfs/ctree.h
fs/btrfs/super.c
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Mon, 16 Jan 2012 20:26:31 +0000 (15:26 -0500)]
Merge branch 'for-chris' of git://git.jan-o-sch.net/btrfs-unstable into integration
Chris Mason [Mon, 16 Jan 2012 20:26:17 +0000 (15:26 -0500)]
Merge branch 'for-chris' of git://repo.or.cz/linux-btrfs-devel into integration
Conflicts:
fs/btrfs/volumes.c
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Mon, 16 Jan 2012 20:26:02 +0000 (15:26 -0500)]
Merge branch 'restriper' of git://github.com/idryomov/btrfs-unstable into integration
Chris Mason [Mon, 16 Jan 2012 20:25:42 +0000 (15:25 -0500)]
Merge branch 'allocation-fixes' into integration
Ilya Dryomov [Mon, 16 Jan 2012 20:04:49 +0000 (22:04 +0200)]
Btrfs: add balance progress reporting
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:49 +0000 (22:04 +0200)]
Btrfs: allow for resuming restriper after it was paused
Recognize BTRFS_BALANCE_RESUME flag passed from userspace. We use the
same heuristics used when recovering balance after a crash to try to
start where we left off last time.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:49 +0000 (22:04 +0200)]
Btrfs: allow for canceling restriper
Implement an ioctl for canceling restriper. Currently we wait until
relocation of the current block group is finished, in future this can be
done by triggering a commit. Balance item is deleted and no memory
about the interrupted balance is kept.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:49 +0000 (22:04 +0200)]
Btrfs: allow for pausing restriper
Implement an ioctl for pausing restriper. This pauses the relocation,
but balance is still considered to be "in progress": balance item is
not deleted, other volume operations cannot be started, etc. If paused
in the middle of profile changing operation we will continue making
allocations with the target profile.
Add a hook to close_ctree() to pause restriper and free its data
structures on unmount. (It's safe to unmount when restriper is in
"paused" state, we will resume with the same parameters on the next
mount)
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:48 +0000 (22:04 +0200)]
Btrfs: add skip_balance mount option
Since restriper kthread starts involuntarily on mount and can suck cpu
and memory bandwidth add a mount option to forcefully skip it. The
restriper in that case hangs around in paused state and can be resumed
from userspace when it's convenient.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:48 +0000 (22:04 +0200)]
Btrfs: recover balance on mount
On mount, if balance item is found, resume balance in a separate
kernel thread.
Try to be smart to continue roughly where previous balance (or convert)
was interrupted. For chunk types that were being converted to some
profile we turn on soft convert, in case of a simple balance we turn on
usage filter and relocate only less-than-90%-full chunks of that type.
These are just heuristics but they help quite a bit, and can be improved
in future.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:48 +0000 (22:04 +0200)]
Btrfs: save balance parameters to disk
Introduce a new btree objectid for storing balance item. The reason is
to be able to resume restriper after a crash with the same parameters.
Balance item has a very high objectid and goes into tree of tree roots.
The key for the new item is as follows:
[ BTRFS_BALANCE_OBJECTID ; BTRFS_BALANCE_ITEM_KEY ; 0 ]
Older kernels simply ignore it so it's safe to mount with an older
kernel and then go back to the newer one.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:48 +0000 (22:04 +0200)]
Btrfs: soft profile changing mode (aka soft convert)
When doing convert from one profile to another if soft mode is on
restriper won't touch chunks that already have the profile we are
converting to. This is useful if e.g. half of the FS was converted
earlier.
The soft mode switch is (like every other filter) per-type. This means
that we can convert for example meta chunks the "hard" way while
converting data chunks selectively with soft switch.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:48 +0000 (22:04 +0200)]
Btrfs: implement online profile changing
Profile changing is done by launching a balance with
BTRFS_BALANCE_CONVERT bits set and target fields of respective
btrfs_balance_args structs initialized. Profile reducing code in this
case will pick restriper's target profile if it's available instead of
doing a blind reduce. If target profile is not yet available it goes
back to a plain reduce.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:48 +0000 (22:04 +0200)]
Btrfs: do not reduce profile in do_chunk_alloc()
Every caller of do_chunk_alloc() feeds it the reduced allocation
profile, so stop trying to reduce it one more time. Instead check the
validity of the passed profile.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:48 +0000 (22:04 +0200)]
Btrfs: virtual address space subset filter
Select chunks which have at least one byte located inside a given
[vstart, vend) virtual address space range.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:48 +0000 (22:04 +0200)]
Btrfs: devid subset filter
Select chunks which have at least one byte of at least one stripe
located on a device with devid X in a given [pstart,pend) physical
address range.
This filter only works when devid filter is turned on.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:47 +0000 (22:04 +0200)]
Btrfs: devid filter
Relocate chunks which have at least one stripe located on a device with
devid X.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:47 +0000 (22:04 +0200)]
Btrfs: usage filter
Select chunks that are less than X percent full.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:47 +0000 (22:04 +0200)]
Btrfs: profiles filter
Select chunks based on a given profile mask.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:47 +0000 (22:04 +0200)]
Btrfs: add basic infrastructure for selective balancing
This allows to have a separate set of filters for each chunk type
(data,meta,sys). The code however is generic and switch on chunk type
is only done once.
This commit also adds a type filter: it allows to balance for example
meta and system chunks w/o touching data ones.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:47 +0000 (22:04 +0200)]
Btrfs: add basic restriper infrastructure
Add basic restriper infrastructure: extended balancing ioctl and all
related ioctl data structures, add data structure for tracking
restriper's state to fs_info, etc. The semantics of the old balancing
ioctl are fully preserved.
Explicitly disallow any volume operations when balance is in progress.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:47 +0000 (22:04 +0200)]
Btrfs: make avail_*_alloc_bits fields dynamic
Currently when new chunks are created respective avail_alloc_bits field
is updated to reflect profiles of all chunks present in the system.
However when chunks are removed profile bits are never cleared.
This patch clears profile bit of respective avail_alloc_bits field when
the last chunk with that profile is removed. Restriper needs this to
properly operate when "downgrading".
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:47 +0000 (22:04 +0200)]
Btrfs: add BTRFS_AVAIL_ALLOC_BIT_SINGLE bit
Right now on-disk BTRFS_BLOCK_GROUP_* profile bits are used for
avail_{data,metadata,system}_alloc_bits fields, which gather info about
available allocation profiles in the FS. When chunk is created or read
from disk, its profile is OR'ed with the corresponding avail_alloc_bits
field. Since SINGLE is denoted by 0 in the on-disk format, currently
there is no way to tell when such chunks become avaialble. Restriper
needs that information, so add a separate bit for SINGLE profile.
This bit is going to be in-memory only, it should never be written out
to disk, so it's not a disk format change. However to avoid remappings
in future, reserve corresponding on-disk bit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:47 +0000 (22:04 +0200)]
Btrfs: introduce masks for chunk type and profile
Chunk's type and profile are encoded in u64 flags field. Introduce
masks to easily access them. Also fix the type of BTRFS_BLOCK_GROUP_*
constants, it should be ULL.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Mon, 16 Jan 2012 20:04:47 +0000 (22:04 +0200)]
Btrfs: get rid of *_alloc_profile fields
{data,metadata,system}_alloc_profile fields have been unused for a long
time now. Get rid of them.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Li Zefan [Wed, 7 Dec 2011 03:38:24 +0000 (11:38 +0800)]
Btrfs: fix possible deadlock when opening a seed device
The correct lock order is uuid_mutex -> volume_mutex -> chunk_mutex,
but when we mount a filesystem which has backing seed devices, we have
this lock chain:
open_ctree()
lock(chunk_mutex);
read_chunk_tree();
read_one_dev();
open_seed_devices();
lock(uuid_mutex);
and then we hit a lockdep splat.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Li Zefan [Wed, 7 Dec 2011 02:39:22 +0000 (10:39 +0800)]
Btrfs: update global block_rsv when creating a new block group
A bug was triggered while using seed device:
# mkfs.btrfs /dev/loop1
# btrfstune -S 1 /dev/loop1
# mount -o /dev/loop1 /mnt
# btrfs dev add /dev/loop2 /mnt
btrfs: block rsv returned -28
------------[ cut here ]------------
WARNING: at fs/btrfs/extent-tree.c:5969 btrfs_alloc_free_block+0x166/0x396 [btrfs]()
...
Call Trace:
...
[<
f7b7c31c>] btrfs_cow_block+0x101/0x147 [btrfs]
[<
f7b7eaa6>] btrfs_search_slot+0x1b8/0x55f [btrfs]
[<
f7b7f844>] btrfs_insert_empty_items+0x42/0x7f [btrfs]
[<
f7b7f8c1>] btrfs_insert_item+0x40/0x7e [btrfs]
[<
f7b8ac02>] btrfs_make_block_group+0x243/0x2aa [btrfs]
[<
f7bb3f53>] __btrfs_alloc_chunk+0x672/0x70e [btrfs]
[<
f7bb41ff>] init_first_rw_device+0x77/0x13c [btrfs]
[<
f7bb5a62>] btrfs_init_new_device+0x664/0x9fd [btrfs]
[<
f7bbb65a>] btrfs_ioctl+0x694/0xdbe [btrfs]
[<
c04f55f7>] do_vfs_ioctl+0x496/0x4cc
[<
c04f5660>] sys_ioctl+0x33/0x4f
[<
c07b9edf>] sysenter_do_call+0x12/0x38
---[ end trace
906adac595facc7d ]---
Since seed device is readonly, there's no usable space in the filesystem.
Afterwards we add a sprout device to it, and the kernel creates a METADATA
block group and a SYSTEM block group where comes free space we can reserve,
but we still get revervation failure because the global block_rsv hasn't
been updated accordingly.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Li Zefan [Thu, 29 Dec 2011 06:47:27 +0000 (14:47 +0800)]
Btrfs: rewrite btrfs_trim_block_group()
There are various bugs in block group trimming:
- It may trim from offset smaller than user-specified offset.
- It may trim beyond user-specified range.
- It may leak free space for extents smaller than specified minlen.
- It may truncate the last trimmed extent thus leak free space.
- With mixed extents+bitmaps, some extents may not be trimmed.
- With mixed extents+bitmaps, some bitmaps may not be trimmed (even
none will be trimmed). Even for those trimmed, not all the free space
in the bitmaps will be trimmed.
I rewrite btrfs_trim_block_group() and break it into two functions.
One is to trim extents only, and the other is to trim bitmaps only.
Before patching:
# fstrim -v /mnt/
/mnt/:
1496465408 bytes were trimmed
After patching:
# fstrim -v /mnt/
/mnt/:
2193768448 bytes were trimmed
And this matches the total free space:
# btrfs fi df /mnt
Data: total=3.58GB, used=1.79GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=205.12MB, used=97.14MB
Metadata: total=8.00MB, used=0.00
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Li Zefan [Thu, 1 Dec 2011 06:06:42 +0000 (14:06 +0800)]
Btrfs: simplfy calculation of stripe length for discard operation
For btrfs raid, while discarding a range of space, we'll need to know
the start offset and length to discard for each device, and it's done
in btrfs_map_block().
However the calculation is a bit complex for raid0 and raid10, so I
reimplement it based on a fact that:
dev1 dev2 dev3 (raid0)
-----------------------------------
s0 s3 s6 s1 s4 s7 s2 s5
Each device has (total_stripes / nr_dev) stripes, or plus one.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Li Zefan [Thu, 1 Dec 2011 04:55:47 +0000 (12:55 +0800)]
Btrfs: don't pre-allocate btrfs bio
We pre-allocate a btrfs bio with fixed size, and then may re-allocate
memory if we find stripes are bigger than the fixed size. But this
pre-allocation is not necessary.
Also we don't have to calcuate the stripe number twice.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Li Zefan [Thu, 8 Dec 2011 07:07:24 +0000 (15:07 +0800)]
Btrfs: don't pass a trans handle unnecessarily in volumes.c
Some functions never use the transaction handle passed to them.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Li Zefan [Thu, 29 Dec 2011 05:39:50 +0000 (13:39 +0800)]
Btrfs: reserve metadata space in btrfs_ioctl_setflags()
Check and reserve space for btrfs_update_inode().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Li Zefan [Thu, 29 Dec 2011 05:36:45 +0000 (13:36 +0800)]
Btrfs: remove BUG_ON()s in btrfs_ioctl_setflags()
We can recover from errors and return -errno to user space.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Li Zefan [Mon, 9 Jan 2012 06:36:28 +0000 (14:36 +0800)]
Btrfs: check the return value of io_ctl_init()
It can return -ENOMEM.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Li Zefan [Mon, 9 Jan 2012 06:27:42 +0000 (14:27 +0800)]
Btrfs: avoid possible NULL deref in io_ctl_drop_pages()
If we run into some failure path in io_ctl_prepare_pages(),
io_ctl->pages[] array may have some NULL pointers.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Li Zefan [Tue, 10 Jan 2012 08:41:01 +0000 (16:41 +0800)]
Btrfs: add pinned extents to on-disk free space cache correctly
I got this while running xfstests:
[24256.836098] block group
317849600 has an wrong amount of free space
[24256.836100] btrfs: failed to load free space cache for block group
317849600
We should clamp the extent returned by find_first_extent_bit(),
so the start of the extent won't smaller than the start of the
block group.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Li Zefan [Wed, 11 Jan 2012 01:54:49 +0000 (09:54 +0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/mason/linux-btrfs into for-linus
Alexandre Oliva [Fri, 14 Oct 2011 15:10:36 +0000 (12:10 -0300)]
Btrfs: revamp clustered allocation logic
Parameterize clusters on minimum total size, minimum chunk size and
minimum contiguous size for at least one chunk, without limits on
cluster, window or gap sizes. Don't tolerate any fragmentation for
SSD_SPREAD; accept it for metadata, but try to keep data dense.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Alexandre Oliva [Mon, 28 Nov 2011 14:36:17 +0000 (12:36 -0200)]
Btrfs: don't set up allocation result twice
We store the allocation start and length twice in ins, once right
after the other, but with intervening calls that may prevent the
duplicate from being optimized out by the compiler. Remove one of the
assignments.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Alexandre Oliva [Mon, 12 Dec 2011 06:48:19 +0000 (04:48 -0200)]
Btrfs: test free space only for unclustered allocation
Since the clustered allocation may be taking extents from a different
block group, there's no point in spin-locking and testing the current
block group free space before attempting to allocate space from a
cluster, even more so when we might refrain from even trying the
cluster in the current block group because, after the cluster was set
up, not enough free space remained. Furthermore, cluster creation
attempts fail fast when the block group doesn't have enough free
space, so the test was completely superfluous.
I've move the free space test past the cluster allocation attempt,
where it is more useful, and arranged for a cluster in the current
block group to be released before trying an unclustered allocation,
when we reach the LOOP_NO_EMPTY_SIZE stage, so that the free space in
the cluster stands a chance of being combined with additional free
space in the block group so as to succeed in the allocation attempt.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 6 Jan 2012 20:47:38 +0000 (15:47 -0500)]
Btrfs: use bigger metadata chunks on bigger filesystems
The 256MB chunk is a little small on a huge FS. This scales up the
chunk size.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 6 Jan 2012 20:41:34 +0000 (15:41 -0500)]
Btrfs: lower the bar for chunk allocation
The chunk allocation code has tried to keep a pretty tight lid on creating new
metadata chunks. This is partially because in the past the reservation
code didn't give us an accurate idea of how much space was being used.
The new code is much more accurate, so we're able to get rid of some of these
checks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 6 Jan 2012 20:23:57 +0000 (15:23 -0500)]
Btrfs: run chunk allocations while we do delayed refs
Btrfs tries to batch extent allocation tree changes to improve performance
and reduce metadata trashing. But it doesn't allocate new metadata chunks
while it is doing allocations for the extent allocation tree.
This commit changes the delayed refence code to do chunk allocations if we're
getting low on room. It prevents crashes and improves performance.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Jan Schmidt [Thu, 1 Dec 2011 13:35:19 +0000 (14:35 +0100)]
Btrfs: make sure we're not using obsolete code in btrfs_get_extent
There's code in btrfs_get_extent that should never be used. This patch turns
a WARN_ON(1) into a BUG(), hoping we can remove the transaction code from
btrfs_get_extent soon.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Jan Schmidt [Fri, 2 Dec 2011 13:56:41 +0000 (14:56 +0100)]
Btrfs: new backref walking code
The old backref iteration code could only safely be used on commit roots.
Besides this limitation, it had bugs in finding the roots for these
references. This commit replaces large parts of it by btrfs_find_all_roots()
which a) really finds all roots and the correct roots, b) works correctly
under heavy file system load, c) considers delayed refs.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Linus Torvalds [Wed, 4 Jan 2012 23:55:44 +0000 (15:55 -0800)]
Linux 3.2
Linus Torvalds [Wed, 4 Jan 2012 23:03:49 +0000 (15:03 -0800)]
Merge git://git./linux/kernel/git/davem/net
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
fix CAN MAINTAINERS SCM tree type
mwifiex: fix crash during simultaneous scan and connect
b43: fix regression in PIO case
ath9k: Fix kernel panic in AR2427 in AP mode
CAN MAINTAINERS update
net: fsl: fec: fix build for mx23-only kernel
sch_qfq: fix overflow in qfq_update_start()
Revert "Bluetooth: Increase HCI reset timeout in hci_dev_do_close"
Al Viro [Wed, 4 Jan 2012 10:51:03 +0000 (10:51 +0000)]
minixfs: misplaced checks lead to dentry leak
bitmap size sanity checks should be done *before* allocating ->s_root;
there their cleanup on failure would be correct. As it is, we do iput()
on root inode, but leak the root dentry...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Josh Boyer <jwboyer@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 4 Jan 2012 16:29:20 +0000 (17:29 +0100)]
ptrace: ensure JOBCTL_STOP_SIGMASK is not zero after detach
This is the temporary simple fix for 3.2, we need more changes in this
area.
1. do_signal_stop() assumes that the running untraced thread in the
stopped thread group is not possible. This was our goal but it is
not yet achieved: a stopped-but-resumed tracee can clone the running
thread which can initiate another group-stop.
Remove WARN_ON_ONCE(!current->ptrace).
2. A new thread always starts with ->jobctl = 0. If it is auto-attached
and this group is stopped, __ptrace_unlink() sets JOBCTL_STOP_PENDING
but JOBCTL_STOP_SIGMASK part is zero, this triggers WANR_ON(!signr)
in do_jobctl_trap() if another debugger attaches.
Change __ptrace_unlink() to set the artificial SIGSTOP for report.
Alternatively we could change ptrace_init_task() to copy signr from
current, but this means we can copy it for no reason and hide the
possible similar problems.
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@kernel.org> [3.1]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 4 Jan 2012 16:29:02 +0000 (17:29 +0100)]
ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE race
Test-case:
int main(void)
{
int pid, status;
pid = fork();
if (!pid) {
for (;;) {
if (!fork())
return 0;
if (waitpid(-1, &status, 0) < 0) {
printf("ERR!! wait: %m\n");
return 0;
}
}
}
assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0);
assert(waitpid(-1, NULL, 0) == pid);
assert(ptrace(PTRACE_SETOPTIONS, pid, 0,
PTRACE_O_TRACEFORK) == 0);
do {
ptrace(PTRACE_CONT, pid, 0, 0);
pid = waitpid(-1, NULL, 0);
} while (pid > 0);
return 1;
}
It fails because ->real_parent sees its child in EXIT_DEAD state
while the tracer is going to change the state back to EXIT_ZOMBIE
in wait_task_zombie().
The offending commit is
823b018e which moved the EXIT_DEAD check,
but in fact we should not blame it. The original code was not
correct as well because it didn't take ptrace_reparented() into
account and because we can't really trust ->ptrace.
This patch adds the additional check to close this particular
race but it doesn't solve the whole problem. We simply can't
rely on ->ptrace in this case, it can be cleared if the tracer
is multithreaded by the exiting ->parent.
I think we should kill EXIT_DEAD altogether, we should always
remove the soon-to-be-reaped child from ->children or at least
we should never do the DEAD->ZOMBIE transition. But this is too
complex for 3.2.
Reported-and-tested-by: Denys Vlasenko <vda.linux@googlemail.com>
Tested-by: Lukasz Michalik <lmi@ift.uni.wroc.pl>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@kernel.org> [3.0+]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 4 Jan 2012 22:57:55 +0000 (14:57 -0800)]
Merge git://git.samba.org/sfrench/cifs-2.6
* git://git.samba.org/sfrench/cifs-2.6:
[CIFS] default ntlmv2 for cifs mount delayed to 3.3
cifs: fix bad buffer length check in coalesce_t2
John W. Linville [Wed, 4 Jan 2012 16:37:30 +0000 (11:37 -0500)]
Merge branch 'master' of git://git./linux/kernel/git/linville/wireless into for-davem
Linus Torvalds [Wed, 4 Jan 2012 15:57:22 +0000 (07:57 -0800)]
Revert "rtc: Expire alarms after the time is set."
This reverts commit
93b2ec0128c431148b216b8f7337c1a52131ef03.
The call to "schedule_work()" in rtc_initialize_alarm() happens too
early, and can cause oopses at bootup
Neil Brown explains why we do it:
"If you set an alarm in the future, then shutdown and boot again after
that time, then you will end up with a timer_queue node which is in
the past.
When this happens the queue gets stuck. That entry-in-the-past won't
get removed until and interrupt happens and an interrupt won't happen
because the RTC only triggers an interrupt when the alarm is "now".
So you'll find that e.g. "hwclock" will always tell you that
'select' timed out.
So we force the interrupt work to happen at the start just in case."
and has a patch that convert it to do things in-process rather than with
the worker thread, but right now it's too late to play around with this,
so we just revert the patch that caused problems for now.
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Requested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Requested-by: John Stultz <john.stultz@linaro.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Schmidt [Wed, 23 Nov 2011 17:55:04 +0000 (18:55 +0100)]
Btrfs: added btrfs_find_all_roots()
This function gets a byte number (a data extent), collects all the leafs
pointing to it and walks up the trees to find all fs roots pointing to those
leafs. It also returns the list of all leafs pointing to that extent.
It does proper locking for the involved trees, can be used on busy file
systems and honors delayed refs.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Jan Schmidt [Mon, 12 Dec 2011 15:10:07 +0000 (16:10 +0100)]
Btrfs: add waitqueue instead of doing busy waiting for more delayed refs
Now that we may be holding back delayed refs for a limited period, we
might end up having no runnable delayed refs. Without this commit, we'd
do busy waiting in that thread until another (runnable) ref arives.
Instead, we're detecting this situation and use a waitqueue, such that
we only try to run more refs after
a) another runnable ref was added or
b) delayed refs are no longer held back
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Arne Jansen [Tue, 13 Sep 2011 13:16:43 +0000 (15:16 +0200)]
Btrfs: put back delayed refs that are too new
When processing a delayed ref, first check if there are still old refs in
the process of being added. If so, put this ref back to the tree. To avoid
looping on this ref, choose a newer one in the next loop.
btrfs_find_ref_cluster has to take care of that.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Arne Jansen [Wed, 14 Sep 2011 10:37:00 +0000 (12:37 +0200)]
Btrfs: add sequence numbers to delayed refs
Sequence numbers are needed to reconstruct the backrefs of a given extent to
a certain point in time. The total set of backrefs consist of the set of
backrefs recorded on disk plus the enqueued delayed refs for it that existed
at that moment.
This patch also adds a list that records all delayed refs which are
currently in the process of being added.
When walking all refs of an extent in btrfs_find_all_roots(), we freeze the
current state of delayed refs, honor anythinh up to this point and prevent
processing newer delayed refs to assert consistency.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Arne Jansen [Tue, 13 Sep 2011 08:55:48 +0000 (10:55 +0200)]
Btrfs: add nested locking mode for paths
This patch adds the possibilty to read-lock an extent even if it is already
write-locked from the same thread. btrfs_find_all_roots() needs this
capability.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Steve French [Wed, 4 Jan 2012 05:08:24 +0000 (23:08 -0600)]
[CIFS] default ntlmv2 for cifs mount delayed to 3.3
Turned out the ntlmv2 (default security authentication)
upgrade was harder to test than expected, and we ran
out of time to test against Apple and a few other servers
that we wanted to. Delay upgrade of default security
from ntlm to ntlmv2 (on mount) to 3.3. Still works
fine to specify it explicitly via "sec=ntlmv2" so this
should be fine.
Acked-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Jeff Layton [Sun, 1 Jan 2012 15:34:39 +0000 (10:34 -0500)]
cifs: fix bad buffer length check in coalesce_t2
The current check looks to see if the RFC1002 length is larger than
CIFSMaxBufSize, and fails if it is. The buffer is actually larger than
that by MAX_CIFS_HDR_SIZE.
This bug has been around for a long time, but the fact that we used to
cap the clients MaxBufferSize at the same level as the server tended
to paper over it. Commit
c974befa changed that however and caused this
bug to bite in more cases.
Reported-and-Tested-by: Konstantinos Skarlatos <k.skarlatos@gmail.com>
Tested-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Linus Torvalds [Wed, 4 Jan 2012 01:32:13 +0000 (17:32 -0800)]
Revert "rtc: Disable the alarm in the hardware"
This reverts commit
c0afabd3d553c521e003779c127143ffde55a16f.
It causes failures on Toshiba laptops - instead of disabling the alarm,
it actually seems to enable it on the affected laptops, resulting in
(for example) the laptop powering on automatically five minutes after
shutdown.
There's a patch for it that appears to work for at least some people,
but it's too late to play around with this, so revert for now and try
again in the next merge window.
See for example
http://bugs.debian.org/652869
Reported-and-bisected-by: Andreas Friedrich <afrie@gmx.net> (Toshiba Tecra)
Reported-by: Antonio-M. Corbi Bellot <antonio.corbi@ua.es> (Toshiba Portege R500)
Reported-by: Marco Santos <marco.santos@waynext.com> (Toshiba Portege Z830)
Reported-by: Christophe Vu-Brugier <cvubrugier@yahoo.fr> (Toshiba Portege R830)
Cc: Jonathan Nieder <jrnieder@gmail.com>
Requested-by: John Stultz <john.stultz@linaro.org>
Cc: stable@kernel.org # for the versions that applied this
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mandeep Singh Baines [Tue, 3 Jan 2012 22:41:13 +0000 (14:41 -0800)]
hung_task: fix false positive during vfork
vfork parent uninterruptibly and unkillably waits for its child to
exec/exit. This wait is of unbounded length. Ignore such waits
in the hung_task detector.
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
LKML-Reference: <
1325344394.28904.43.camel@lappy>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Kacur <jkacur@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Tue, 3 Jan 2012 12:14:29 +0000 (13:14 +0100)]
security: Fix security_old_inode_init_security() when CONFIG_SECURITY is not set
Commit
1e39f384bb01 ("evm: fix build problems") makes the stub version
of security_old_inode_init_security() return 0 when CONFIG_SECURITY is
not set.
But that makes callers such as reiserfs_security_init() assume that
security_old_inode_init_security() has set name, value, and len
arguments properly - but security_old_inode_init_security() left them
uninitialized which then results in interesting failures.
Revert security_old_inode_init_security() to the old behavior of
returning EOPNOTSUPP since both callers (reiserfs and ocfs2) handle this
just fine.
[ Also fixed the S_PRIVATE(inode) case of the actual non-stub
security_old_inode_init_security() function to return EOPNOTSUPP
for the same reason, as pointed out by Mimi Zohar.
It got incorrectly changed to match the new function in commit
fb88c2b6cbb1: "evm: fix security/security_old_init_security return
code". - Linus ]
Reported-by: Jorge Bastos <mysql.jorge@decimal.pt>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oliver Hartkopp [Tue, 3 Jan 2012 19:57:43 +0000 (14:57 -0500)]
fix CAN MAINTAINERS SCM tree type
As pointed out by Joe Perches the SCM tree type was missing in my patch.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
CC: Oliver Hartkopp <oliver.hartkopp@volkswagen.de>
CC: Urs Thuermann <urs.thuermann@volkswagen.de>
CC: Wolfgang Grandegger <wg@grandegger.com>
CC: Marc Kleine-Budde <mkl@pengutronix.de>
CC: linux-can@vger.kernel.org
Amitkumar Karwar [Tue, 3 Jan 2012 00:18:40 +0000 (16:18 -0800)]
mwifiex: fix crash during simultaneous scan and connect
If 'iw connect' command is fired when driver is already busy in
serving 'iw scan' command, ssid specific scan operation for connect
is skipped. In this case cmd wait queue handler gets called with no
command in queue (i.e. adapter->cmd_queued = NULL).
This patch adds a NULL check in mwifiex_wait_queue_complete()
routine to fix crash observed during simultaneous scan and assoc
operations.
Signed-off-by: Amitkumar Karwar <akarwar@marvell.com>
Signed-off-by: Bing Zhao <bzhao@marvell.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Guennadi Liakhovetski [Mon, 26 Dec 2011 17:28:08 +0000 (18:28 +0100)]
b43: fix regression in PIO case
This patch fixes the regression, introduced by
commit
17030f48e31adde5b043741c91ba143f5f7db0fd
From: Rafał Miłecki <zajec5@gmail.com>
Date: Thu, 11 Aug 2011 17:16:27 +0200
Subject: [PATCH] b43: support new RX header, noticed to be used in 598.314+ fw
in PIO case.
Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Mohammed Shafi Shajakhan [Mon, 26 Dec 2011 05:12:15 +0000 (10:42 +0530)]
ath9k: Fix kernel panic in AR2427 in AP mode
don't do aggregation related stuff for 'AP mode client power save
handling' if aggregation is not enabled in the driver, otherwise it
will lead to panic because those data structures won't be never
intialized in 'ath_tx_node_init' if aggregation is disabled
EIP is at ath_tx_aggr_wakeup+0x37/0x80 [ath9k]
EAX:
e8c09a20 EBX:
f2a304e8 ECX:
00000001 EDX:
00000000
ESI:
e8c085e0 EDI:
f2a304ac EBP:
f40e1ca4 ESP:
f40e1c8c
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process swapper/1 (pid: 0, ti=
f40e0000 task=
f408e860
task.ti=
f40dc000)
Stack:
0001e966 e8c09a20 00000000 f2a304ac e8c085e0 f2a304ac
f40e1cb0 f8186741
f8186700 f40e1d2c f922988d f2a304ac 00000202 00000001
c0b4ba43 00000000
0000000f e8eb75c0 e8c085e0 205b0001 34383220 f2a304ac
f2a30000 00010020
Call Trace:
[<
f8186741>] ath9k_sta_notify+0x41/0x50 [ath9k]
[<
f8186700>] ? ath9k_get_survey+0x110/0x110 [ath9k]
[<
f922988d>] ieee80211_sta_ps_deliver_wakeup+0x9d/0x350
[mac80211]
[<
c018dc75>] ? __module_address+0x95/0xb0
[<
f92465b3>] ap_sta_ps_end+0x63/0xa0 [mac80211]
[<
f9246746>] ieee80211_rx_h_sta_process+0x156/0x2b0
[mac80211]
[<
f9247d1e>] ieee80211_rx_handlers+0xce/0x510 [mac80211]
[<
c018440b>] ? trace_hardirqs_on+0xb/0x10
[<
c056936e>] ? skb_queue_tail+0x3e/0x50
[<
f9248271>] ieee80211_prepare_and_rx_handle+0x111/0x750
[mac80211]
[<
f9248bf9>] ieee80211_rx+0x349/0xb20 [mac80211]
[<
f9248949>] ? ieee80211_rx+0x99/0xb20 [mac80211]
[<
f818b0b8>] ath_rx_tasklet+0x818/0x1d00 [ath9k]
[<
f8187a75>] ? ath9k_tasklet+0x35/0x1c0 [ath9k]
[<
f8187a75>] ? ath9k_tasklet+0x35/0x1c0 [ath9k]
[<
f8187b33>] ath9k_tasklet+0xf3/0x1c0 [ath9k]
[<
c0151b7e>] tasklet_action+0xbe/0x180
Cc: stable@kernel.org
Cc: Senthil Balasubramanian <senthilb@qca.qualcomm.com>
Cc: Rajkumar Manoharan <rmanohar@qca.qualcomm.com>
Reported-by: Ashwin Mendonca <ashwinloyal@gmail.com>
Tested-by: Ashwin Mendonca <ashwinloyal@gmail.com>
Signed-off-by: Mohammed Shafi Shajakhan <mohammed@qca.qualcomm.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
John W. Linville [Tue, 3 Jan 2012 19:26:56 +0000 (14:26 -0500)]
Merge branch 'master' of git://git./linux/kernel/git/padovan/bluetooth
Oliver Hartkopp [Tue, 3 Jan 2012 08:40:28 +0000 (08:40 +0000)]
CAN MAINTAINERS update
Update the CAN MAINTAINERS section:
- point out active maintainers
- pull the CAN driver discussion away from netdev ML
- point to the new CAN web site on gitorious.org
- add CAN development git repository URL to submit patches
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
CC: Oliver Hartkopp <oliver.hartkopp@volkswagen.de>
CC: Urs Thuermann <urs.thuermann@volkswagen.de>
CC: Wolfgang Grandegger <wg@grandegger.com>
CC: Marc Kleine-Budde <mkl@pengutronix.de>
CC: linux-can@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Wolfram Sang [Tue, 3 Jan 2012 03:46:47 +0000 (03:46 +0000)]
net: fsl: fec: fix build for mx23-only kernel
If one only selects mx23-based boards, compile fails:
drivers/net/ethernet/freescale/fec.c:410:2: error: 'FEC_HASH_TABLE_HIGH' undeclared (first use in this function)
drivers/net/ethernet/freescale/fec.c:411:2: error: 'FEC_HASH_TABLE_LOW' undeclared (first use in this function)
This is because fec.h uses CONFIG_SOC_IMX28 to determine the register
layout of the core which makes sense since the MX23 does not have a fec.
However, Kconfig uses the broader ARCH_MXS symbol and this way even
makes the fec-driver default for MX23. Adapt Kconfig to use the more
precise SOC_IMX28 as well.
Signed-off-by: Wolfram Sang <w.sang@pengutronix.de>
Cc: Fabio Estevam <fabio.estevam@freescale.com>
Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Shawn Guo <shawn.guo@linaro.org>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: Fabio Estevam <fabio.estevam@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 2 Jan 2012 05:47:57 +0000 (05:47 +0000)]
sch_qfq: fix overflow in qfq_update_start()
grp->slot_shift is between 22 and 41, so using 32bit wide variables is
probably a typo.
This could explain QFQ hangs Dave reported to me, after 2^23 packets ?
(23 = 64 - 41)
Reported-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Müller [Fri, 30 Dec 2011 17:55:48 +0000 (12:55 -0500)]
drm/radeon/kms/atom: fix possible segfault in pm setup
If we end up with no power states, don't look up
current vddc.
fixes:
https://bugs.freedesktop.org/show_bug.cgi?id=44130
agd5f: fix patch formatting
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Dave Airlie <airlied@redhat.com>
Linus Torvalds [Mon, 2 Jan 2012 20:34:03 +0000 (12:34 -0800)]
Merge branch 'devicetree/merge' of git://git.secretlab.ca/git/linux-2.6
* 'devicetree/merge' of git://git.secretlab.ca/git/linux-2.6:
dt/device: Fix auxdata matching to handle entries without a name override
Linus Torvalds [Mon, 2 Jan 2012 03:36:08 +0000 (19:36 -0800)]
Merge git://git./linux/kernel/git/davem/net
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
netfilter: ctnetlink: fix timeout calculation
ipvs: try also real server with port 0 in backup server
skge: restore rx multicast filter on resume and after config changes
mlx4_en: nullify cq->vector field when closing completion queue
Linus Torvalds [Sat, 31 Dec 2011 19:55:06 +0000 (11:55 -0800)]
Merge branch 'fix/asoc' of git://git./linux/kernel/git/tiwai/sound
* 'fix/asoc' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ASoC: wm8776: add missing break in sample size switch
Mauro Carvalho Chehab [Sat, 31 Dec 2011 13:32:03 +0000 (11:32 -0200)]
gspca: Fix falling back to lower isoc alt settings
The current gspca core code has a regression where it no longer properly
falls back to lower alt settings when there is not enough bandwidth.
This causes many iso based usb-1 cameras to not work when plugged into a
usb2 hub or a sandybridge chipset motherboard!
This patch fixes this.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Sat, 31 Dec 2011 19:44:01 +0000 (11:44 -0800)]
futex: Fix uninterruptible loop due to gate_area
It was found (by Sasha) that if you use a futex located in the gate
area we get stuck in an uninterruptible infinite loop, much like the
ZERO_PAGE issue.
While looking at this problem, PeterZ realized you'll get into similar
trouble when hitting any install_special_pages() mapping. And are there
still drivers setting up their own special mmaps without page->mapping,
and without special VM or pte flags to make get_user_pages fail?
In most cases, if page->mapping is NULL, we do not need to retry at all:
Linus points out that even /proc/sys/vm/drop_caches poses no problem,
because it ends up using remove_mapping(), which takes care not to
interfere when the page reference count is raised.
But there is still one case which does need a retry: if memory pressure
called shmem_writepage in between get_user_pages_fast dropping page
table lock and our acquiring page lock, then the page gets switched from
filecache to swapcache (and ->mapping set to NULL) whatever the refcount.
Fault it back in to get the page->mapping needed for key->shared.inode.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xi Wang [Fri, 30 Dec 2011 15:40:17 +0000 (10:40 -0500)]
netfilter: ctnetlink: fix timeout calculation
The sanity check (timeout < 0) never works; the dividend is unsigned
and so is the division, which should have been a signed division.
long timeout = (ct->timeout.expires - jiffies) / HZ;
if (timeout < 0)
timeout = 0;
This patch converts the time values to signed for the division.
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Julian Anastasov [Fri, 30 Dec 2011 05:19:02 +0000 (14:19 +0900)]
ipvs: try also real server with port 0 in backup server
We should not forget to try for real server with port 0
in the backup server when processing the sync message. We should
do it in all cases because the backup server can use different
forwarding method.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Zumbiehl [Fri, 30 Dec 2011 17:30:09 +0000 (17:30 +0000)]
skge: restore rx multicast filter on resume and after config changes
Restore skge hardware registers for multicast filtering to their
appropriate values after system resume and after hardware restarts
that are done when changing certain settings.
Signed-off-by: Florian Zumbiehl <florz@florz.de>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yevgeny Petrilin [Thu, 29 Dec 2011 05:49:58 +0000 (05:49 +0000)]
mlx4_en: nullify cq->vector field when closing completion queue
Caused loss of connectivity when changing ring size.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 30 Dec 2011 21:45:34 +0000 (13:45 -0800)]
Merge branch 'fixes' of ftp.arm.linux.org.uk/pub/linux/arm/kernel/git-cur/linux-2.6-arm
* 'fixes' of http://ftp.arm.linux.org.uk/pub/linux/arm/kernel/git-cur/linux-2.6-arm:
ARM: 7237/1: PL330: Fix driver freeze
ARM: 7197/1: errata: Remove SMP dependency for erratum 751472
ARM: 7196/1: errata: Remove SMP dependency for erratum 720789
ARM: 7220/1: mmc: mmci: Fixup error handling for dma
ARM: 7214/1: mmc: mmci: Fixup handling of MCI_STARTBITERR
Linus Torvalds [Fri, 30 Dec 2011 21:43:45 +0000 (13:43 -0800)]
Merge branch 'fixes' of git://git./linux/kernel/git/arm/arm-soc
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: plat-orion: make gpiochip label unique
enable uncompress log on cpuimx35sd
cpuimx35: fix touchscreen support
cpuimx35sd: fix Kconfig
clock-imx35: fix reboot in internal boot mode
dma: MX3_IPU fix depends
imx_v4_v5_defconfig: update default configuration
cpuimx25sd: fix Kconfig
arm/imx: fix cpufreq section mismatch
ARM:imx:fix pwm period value
ARM: OMAP: hwmod data: fix iva and mailbox hwmods for OMAP 3
Linus Torvalds [Fri, 30 Dec 2011 21:42:41 +0000 (13:42 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: sentelic - fix retrieving number of buttons
Input: sentelic - release mutex upon register write failure
Linus Torvalds [Fri, 30 Dec 2011 21:34:22 +0000 (13:34 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: disable use of dcache for readdir etc.
Linus Torvalds [Fri, 30 Dec 2011 21:34:00 +0000 (13:34 -0800)]
Merge branch 'v3.2-samsung-fixes-4' of git://git./linux/kernel/git/kgene/linux-samsung
* 'v3.2-samsung-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/kgene/linux-samsung:
ARM: EXYNOS: Remove duplicated SROMC static memory mapping
ARM: SAMSUNG: Fix build error when selecting CPU_FREQ_S3C24XX_DEBUGFS on S3C2440
Linus Torvalds [Fri, 30 Dec 2011 21:24:40 +0000 (13:24 -0800)]
Revert "clockevents: Set noop handler in clockevents_exchange_device()"
This reverts commit
de28f25e8244c7353abed8de0c7792f5f883588c.
It results in resume problems for various people. See for example
http://thread.gmane.org/gmane.linux.kernel/
1233033
http://thread.gmane.org/gmane.linux.kernel/
1233389
http://thread.gmane.org/gmane.linux.kernel/
1233159
http://thread.gmane.org/gmane.linux.kernel/
1227868/focus=
1230877
and the fedora and ubuntu bug reports
https://bugzilla.redhat.com/show_bug.cgi?id=767248
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/904569
which got bisected down to the stable version of this commit.
Reported-by: Jonathan Nieder <jrnieder@gmail.com>
Reported-by: Phil Miller <mille121@illinois.edu>
Reported-by: Philip Langdale <philipl@overt.org>
Reported-by: Tim Gardner <tim.gardner@canonical.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg KH <gregkh@suse.de>
Cc: stable@kernel.org # for stable kernels that applied the original
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 30 Dec 2011 20:13:03 +0000 (12:13 -0800)]
Merge git://www.linux-watchdog.org/linux-watchdog
* git://www.linux-watchdog.org/linux-watchdog:
watchdog: iTCO_wdt.c - problems with newer hardware due to SMI clearing (part 2)
watchdog: hpwdt: Changes to handle NX secure bit in 32bit path
watchdog: sp805: Fix section mismatch in ID table.
watchdog: move coh901327 state holders
Linus Torvalds [Fri, 30 Dec 2011 01:36:15 +0000 (17:36 -0800)]
Merge branch 'iommu/fixes' of git://git./linux/kernel/git/joro/iommu
* 'iommu/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu: Initialize domain->handler in iommu_domain_alloc()
Linus Torvalds [Fri, 30 Dec 2011 01:35:33 +0000 (17:35 -0800)]
Merge git://git./linux/kernel/git/davem/net
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
packet: fix possible dev refcnt leak when bind fail
netem: dont call vfree() under spinlock and BH disabled
netfilter: ctnetlink: fix scheduling while atomic if helper is autoloaded
netfilter: ctnetlink: fix return value of ctnetlink_get_expect()
Linus Torvalds [Fri, 30 Dec 2011 01:09:16 +0000 (17:09 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix raw_spin_unlock_irqrestore() usage
oprofile, arm/sh: Fix oprofile_arch_exit() linkage issue
Linus Torvalds [Fri, 30 Dec 2011 01:05:45 +0000 (17:05 -0800)]
Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: log all dirty inodes in xfs_fs_sync_fs
xfs: log the inode in ->write_inode calls for kupdate
Linus Torvalds [Fri, 30 Dec 2011 00:33:37 +0000 (16:33 -0800)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix blk_queue_end_tag()
block: re-use existing 'reading' variable instead of checking direction again
block, cfq: fix empty queue crash caused by request merge
Hillf Danton [Wed, 28 Dec 2011 23:57:16 +0000 (15:57 -0800)]
mm: hugetlb: fix non-atomic enqueue of huge page
If a huge page is enqueued under the protection of hugetlb_lock, then the
operation is atomic and safe.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@vger.kernel.org> [2.6.37+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>