firefly-linux-kernel-4.4.55.git
13 years agoxen-blkfront: Provide for 'feature-flush-cache' the BLKIF_OP_WRITE_FLUSH_CACHE operation.
Konrad Rzeszutek Wilk [Thu, 5 May 2011 16:41:03 +0000 (12:41 -0400)]
xen-blkfront: Provide for 'feature-flush-cache' the BLKIF_OP_WRITE_FLUSH_CACHE operation.

The operation BLKIF_OP_WRITE_FLUSH_CACHE has existed in the Xen
tree header file for years but it was never present in the Linux tree
because the frontend (nor the backend) supported this interface.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
13 years agoxen-blkfront: fix data size for xenbus_gather in blkfront_connect
Marek Marczykowski [Tue, 3 May 2011 16:04:52 +0000 (12:04 -0400)]
xen-blkfront: fix data size for xenbus_gather in blkfront_connect

barrier variable is int, not long. This overflow caused another variable
override: "err" (in PV code) and "binfo" (in xenlinux code -
drivers/xen/blkfront/blkfront.c). The later caused incorrect device
flags (RO/removable etc).

Signed-off-by: Marek Marczykowski <marmarek@mimuw.edu.pl>
Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
[v1: Changed title]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
13 years agocciss: fix compile issue
Jens Axboe [Fri, 6 May 2011 14:27:00 +0000 (08:27 -0600)]
cciss: fix compile issue

drivers/block/cciss.c: In function ‘cciss_send_reset’:
drivers/block/cciss.c:2515:2: error: implicit declaration of function ‘fill_cmd’
drivers/block/cciss.c: At top level:
drivers/block/cciss.c:2531:12: error: conflicting types for ‘fill_cmd’
drivers/block/cciss.c:2534:1: note: an argument type that has a default promotion can’t match an empty parameter name list declaration
drivers/block/cciss.c:2515:18: note: previous implicit declaration of ‘fill_cmd’ was here
make[1]: *** [drivers/block/cciss.o] Error 1
make: *** [drivers/block/cciss.o] Error 2

Move fill_cmd() to above where it is first used.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agocciss: add cciss_tape_cmds module paramter
Stephen M. Cameron [Tue, 3 May 2011 19:54:12 +0000 (14:54 -0500)]
cciss: add cciss_tape_cmds module paramter

This is to allow number of commands reserved for use by SCSI tape drives
and medium changers to be adjusted at driver load time via the kernel
parameter cciss_tape_cmds, with a default value of 6, and a range
of 2 - 16 inclusive.  Previously, the driver limited the number of
commands which could be queued to the SCSI half of the the driver
to only 2.  This is to fix the problem that if you had more than
two tape drives, you couldn't, for example, erase or rewind them all
at the same time.

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agocciss: do not use bit 2 doorbell reset
Stephen M. Cameron [Tue, 3 May 2011 19:54:07 +0000 (14:54 -0500)]
cciss: do not use bit 2 doorbell reset

It causes NMIs which are undesirable at best, unsurvivable at worst.
Prefer the soft reset instead.

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agocciss: do not attempt PCI power management reset method if we know it won't work.
Stephen M. Cameron [Tue, 3 May 2011 19:54:02 +0000 (14:54 -0500)]
cciss: do not attempt PCI power management reset method if we know it won't work.

Just go straight to the soft-reset method instead.

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agocciss: remove superfluous sleeps around reset code
Stephen M. Cameron [Tue, 3 May 2011 19:53:57 +0000 (14:53 -0500)]
cciss: remove superfluous sleeps around reset code

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agocciss: do soft reset if hard reset is broken
Stephen M. Cameron [Tue, 3 May 2011 19:53:52 +0000 (14:53 -0500)]
cciss: do soft reset if hard reset is broken

on driver load, if reset_devices is set, and the hard reset
attempts fail, try to bring up the controller to the point that
a command can be sent, and send it a soft reset command, then
after the reset undo whatever driver initialization was done to get
it to the point to take a command, and re-do it after the reset.

This is to get kdump to work on all the "non-resettable" controllers
(except 64xx controllers which can't be reset due to the potentially
shared cache module.)

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agocciss: use new doorbell-bit-5 reset method
Stephen M. Cameron [Tue, 3 May 2011 19:53:46 +0000 (14:53 -0500)]
cciss: use new doorbell-bit-5 reset method

The bit-2-doorbell reset method seemed to cause (survivable) NMIs
on some systems and (unsurvivable) IOCK NMIs on some G7 servers.
Firmware guys implemented a new doorbell method to alleviate these
problems triggered by bit 5 of the doorbell register.  We want to
use it if it's available.

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agocciss: increase timeouts for post-reset no-ops
Stephen M. Cameron [Tue, 3 May 2011 19:53:41 +0000 (14:53 -0500)]
cciss: increase timeouts for post-reset no-ops

Just to reduce the messages about timeouts that appear.

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agocciss: clarify messages around reset behavior
Stephen M. Cameron [Tue, 3 May 2011 19:53:36 +0000 (14:53 -0500)]
cciss: clarify messages around reset behavior

When waiting for the board to become "not ready"
don't print a message saying "waiting for board to
become ready" (possibly followed by a message saying
"failed waiting for board to become not ready".  Instead,
it should be "waiting for board to reset" and "failed
waiting for board to reset."

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
"
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agocciss: increase time to wait for board reset to start
Stephen M. Cameron [Tue, 3 May 2011 19:53:31 +0000 (14:53 -0500)]
cciss: increase time to wait for board reset to start

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agocciss: get rid of message related magic numbers
Stephen M. Cameron [Tue, 3 May 2011 19:53:26 +0000 (14:53 -0500)]
cciss: get rid of message related magic numbers

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agocciss: fix reply pool and block fetch table memory leaks
Stephen M. Cameron [Tue, 3 May 2011 19:53:21 +0000 (14:53 -0500)]
cciss: fix reply pool and block fetch table memory leaks

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agocciss: factor out irq request code
Stephen M. Cameron [Tue, 3 May 2011 19:53:16 +0000 (14:53 -0500)]
cciss: factor out irq request code

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agocciss: factor out scatterlist allocation functions
Stephen M. Cameron [Tue, 3 May 2011 19:53:10 +0000 (14:53 -0500)]
cciss: factor out scatterlist allocation functions

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agocciss: factor out command pool allocation functions
Stephen M. Cameron [Tue, 3 May 2011 19:53:05 +0000 (14:53 -0500)]
cciss: factor out command pool allocation functions

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agocciss: do a better job of detecting controller reset failure
Stephen M. Cameron [Tue, 3 May 2011 19:53:00 +0000 (14:53 -0500)]
cciss: do a better job of detecting controller reset failure

Detect failure of controller reset by noticing if the 32 bytes of
"driver version" we store on the hardware in the config table
fail to get zeroed out.  Previously we noticed if the controller
did not transition to "simple mode", but this did not detect reset
failure if the controller was already in simple mode prior to
the reset attempt (e.g. due to module parameter hpsa_simple_mode=1).

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agocciss: add readl after writel in interrupt mask setting code
Stephen M. Cameron [Tue, 3 May 2011 19:52:54 +0000 (14:52 -0500)]
cciss: add readl after writel in interrupt mask setting code

This is to ensure the board interrupts are really off when
these functions return.

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agoiosched: remove redundant sprintf
Kees Cook [Fri, 6 May 2011 00:02:12 +0000 (18:02 -0600)]
iosched: remove redundant sprintf

After the anticipatory scheduler was dropped, there was no need to
special-case the request_module string. As such, drop the redundant
sprintf and stack variable.

Signed-off-by: Kees Cook <kees.cook@canonical.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agoblock: Remove 'plug/unplug' comment in blk_execute_rq_nowait
Tao Ma [Thu, 5 May 2011 21:10:05 +0000 (15:10 -0600)]
block: Remove 'plug/unplug' comment in blk_execute_rq_nowait

unplug is replaced with blk_run_queue now in blk_execute_rq_nowait,
so change the comment accordingly.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agoblock: don't block events on excl write for non-optical devices
Tejun Heo [Thu, 21 Apr 2011 18:54:46 +0000 (20:54 +0200)]
block: don't block events on excl write for non-optical devices

Disk event code automatically blocks events on excl write.  This is
primarily to avoid issuing polling commands while burning is in
progress.  This behavior doesn't fit other types of devices with
removeable media where polling commands don't have adverse side
effects and door locking usually doesn't exist.

This patch introduces new genhd flag which controls the auto-blocking
behavior and uses it to enable auto-blocking only on optical devices.

Note for stable: 2.6.38 and later only

Cc: stable@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agoblock: rescan partitions on invalidated devices on -ENOMEDIA too
Tejun Heo [Thu, 21 Apr 2011 18:54:45 +0000 (20:54 +0200)]
block: rescan partitions on invalidated devices on -ENOMEDIA too

__blkdev_get() doesn't rescan partitions if disk->fops->open() fails,
which leads to ghost partition devices lingering after medimum removal
is known to both the kernel and userland.  The behavior also creates a
subtle inconsistency where O_NONBLOCK open, which doesn't fail even if
there's no medium, clears the ghots partitions, which is exploited to
work around the problem from userland.

Fix it by updating __blkdev_get() to issue partition rescan after
-ENOMEDIA too.

This was reported in the following bz.

 https://bugzilla.kernel.org/show_bug.cgi?id=13029

Note for stable: 2.6.38 and later only

Cc: stable@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: David Zeuthen <zeuthen@gmail.com>
Reported-by: Martin Pitt <martin.pitt@ubuntu.com>
Reported-by: Kay Sievers <kay.sievers@vrfy.org>
Tested-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agocdrom: always check_disk_change() on open
Tejun Heo [Thu, 21 Apr 2011 18:54:44 +0000 (20:54 +0200)]
cdrom: always check_disk_change() on open

cdrom_open() called check_disk_change() after the rest of open path
succeeded which leads to the following bizarre behavior.

* After media change, if the device opened without O_NONBLOCK,
  open_for_data() naturally fails with -ENOMEDIA and
  check_disk_change() is never called.  The media is known to be gone
  and the open failure makes it obvious to the userland but device
  invalidation never happens.

* But if the device is opened with O_NONBLOCK, all the checks are
  bypassed and cdrom_open() doesn't notice that the media is not there
  and check_disk_change() is called and invalidation happens.

There's nothing to be gained by avoiding calling check_disk_change()
on open failure.  Common cases end up calling check_disk_change()
anyway.  All we get is inconsistent behavior.

Fix it by moving check_disk_change() invocation to the top of
cdrom_open() so that it always gets called regardless of how the rest
of open proceeds.

Note for stable: 2.6.38 and later only

Cc: stable@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Amit Shah <amit.shah@redhat.com>
Tested-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agoLinux 2.6.39-rc4
Linus Torvalds [Tue, 19 Apr 2011 04:26:00 +0000 (21:26 -0700)]
Linux 2.6.39-rc4

13 years agoMerge branch 'for-39-rc4' of git://codeaurora.org/quic/kernel/davidb/linux-msm
Linus Torvalds [Mon, 18 Apr 2011 22:44:29 +0000 (15:44 -0700)]
Merge branch 'for-39-rc4' of git://codeaurora.org/quic/kernel/davidb/linux-msm

* 'for-39-rc4' of git://codeaurora.org/quic/kernel/davidb/linux-msm:
  msm: timer: fix missing return value
  msm: Remove extraneous ffa device check

13 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Linus Torvalds [Mon, 18 Apr 2011 20:29:03 +0000 (13:29 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: xen-kbdfront - fix mouse getting stuck after save/restore
  Input: estimate number of events per packet
  Input: evdev - indicate buffer overrun with SYN_DROPPED
  Input: document event types and codes and their intended use
  Input: add KEY_IMAGES specifically for AL Image Browser
  Input: twl4030_keypad - fix potential NULL dereference in twl4030_kp_probe()
  Input: h3600_ts - fix error handling at connect
  Input: twl4030_keypad - avoid potential NULL-pointer dereference

13 years agoMerge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
Linus Torvalds [Mon, 18 Apr 2011 20:21:18 +0000 (13:21 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block

* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: add blk_run_queue_async
  block: blk_delay_queue() should use kblockd workqueue
  md: fix up raid1/raid10 unplugging.
  md: incorporate new plugging into raid5.
  md: provide generic support for handling unplug callbacks.
  md - remove old plugging code.
  md/dm - remove remains of plug_fn callback.
  md: use new plugging interface for RAID IO.
  block: drop queue lock before calling __blk_run_queue() for kblockd punt
  Revert "block: add callback function for unplug notification"
  block: Enhance new plugging support to support general callbacks

13 years agoMerge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Linus Torvalds [Mon, 18 Apr 2011 19:24:24 +0000 (12:24 -0700)]
Merge branch 'merge' of git://git./linux/kernel/git/benh/powerpc

* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
  powerpc/powermac: Build fix with SMP and CPU hotplug
  powerpc/perf_event: Skip updating kernel counters if register value shrinks
  powerpc: Don't write protect kernel text with CONFIG_DYNAMIC_FTRACE enabled
  powerpc: Fix oops if scan_dispatch_log is called too early
  powerpc/pseries: Use a kmem cache for DTL buffers
  powerpc/kexec: Fix regression causing compile failure on UP
  powerpc/85xx: disable Suspend support if SMP enabled
  powerpc/e500mc: Remove CPU_FTR_MAYBE_CAN_NAP/CPU_FTR_MAYBE_CAN_DOZE
  powerpc/book3e: Fix CPU feature handling on 64-bit e5500
  powerpc: Check device status before adding serial device
  powerpc/85xx: Don't add disabled PCIe devices

13 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
Linus Torvalds [Mon, 18 Apr 2011 19:24:05 +0000 (12:24 -0700)]
Merge git://git./linux/kernel/git/mason/btrfs-unstable

* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
  Btrfs: fix free space cache leak
  Btrfs: avoid taking the chunk_mutex in do_chunk_alloc
  Btrfs end_bio_extent_readpage should look for locked bits
  Btrfs: don't force chunk allocation in find_free_extent
  Btrfs: Check validity before setting an acl
  Btrfs: Fix incorrect inode nlink in btrfs_link()
  Btrfs: Check if btrfs_next_leaf() returns error in btrfs_real_readdir()
  Btrfs: Check if btrfs_next_leaf() returns error in btrfs_listxattr()
  Btrfs: make uncache_state unconditional
  btrfs: using cached extent_state in set/unlock combinations
  Btrfs: avoid taking the trans_mutex in btrfs_end_transaction
  Btrfs: fix subvolume mount by name problem when default mount subvolume is set
  fix user annotation in ioctl.c
  Btrfs: check for duplicate iov_base's when doing dio reads
  btrfs: properly handle overlapping areas in memmove_extent_buffer
  Btrfs: fix memory leaks in btrfs_new_inode()
  Btrfs: check for duplicate iov_base's when doing dio reads
  Btrfs: reuse the extent_map we found when calling btrfs_get_extent
  Btrfs: do not use async submit for small DIO io's
  Btrfs: don't split dio bios if we don't have to
  ...

13 years agoproc: do proper range check on readdir offset
Linus Torvalds [Mon, 18 Apr 2011 17:36:54 +0000 (10:36 -0700)]
proc: do proper range check on readdir offset

Rather than pass in some random truncated offset to the pid-related
functions, check that the offset is in range up-front.

This is just cleanup, the previous commit fixed the real problem.

Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agonext_pidmap: fix overflow condition
Linus Torvalds [Mon, 18 Apr 2011 17:35:30 +0000 (10:35 -0700)]
next_pidmap: fix overflow condition

next_pidmap() just quietly accepted whatever 'last' pid that was passed
in, which is not all that safe when one of the users is /proc.

Admittedly the proc code should do some sanity checking on the range
(and that will be the next commit), but that doesn't mean that the
helper functions should just do that pidmap pointer arithmetic without
checking the range of its arguments.

So clamp 'last' to PID_MAX_LIMIT.  The fact that we then do "last+1"
doesn't really matter, the for-loop does check against the end of the
pidmap array properly (it's only the actual pointer arithmetic overflow
case we need to worry about, and going one bit beyond isn't going to
overflow).

[ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ]

Reported-by: Tavis Ormandy <taviso@cmpxchg8b.com>
Analyzed-by: Robert Święcki <robert@swiecki.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agoInput: xen-kbdfront - fix mouse getting stuck after save/restore
Igor Mammedov [Mon, 18 Apr 2011 17:17:17 +0000 (10:17 -0700)]
Input: xen-kbdfront - fix mouse getting stuck after save/restore

Mouse gets "stuck" after restore of PV guest but buttons are in working
condition.

If driver has been configured for ABS coordinates at start it will get
XENKBD_TYPE_POS events and then suddenly after restore it'll start getting
XENKBD_TYPE_MOTION events, that will be dropped later and they won't get
into user-space.

Regression was introduced by hunk 5 and 6 of
5ea5254aa0ad269cfbd2875c973ef25ab5b5e9db
("Input: xen-kbdfront - advertise either absolute or relative
coordinates").

Driver on restore should ask xen for request-abs-pointer again if it is
available. So restore parts that did it before 5ea5254.

Acked-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
[v1: Expanded the commit description]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
13 years agoInput: estimate number of events per packet
Jeff Brown [Mon, 18 Apr 2011 17:08:02 +0000 (10:08 -0700)]
Input: estimate number of events per packet

Calculate a default based on the number of ABS axes, REL axes,
and MT slots for the device during input device registration.

Signed-off-by: Jeff Brown <jeffbrown@android.com>
Reviewed-by: Henrik Rydberg <rydberg@euromail.se>
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
13 years agoBtrfs: fix free space cache leak
Chris Mason [Mon, 18 Apr 2011 12:55:34 +0000 (08:55 -0400)]
Btrfs: fix free space cache leak

The free space caching code was recently reworked to
cache all the pages it needed instead of using find_get_page everywhere.

One loop was missed though, so it ended up leaking pages.  This fixes
it to use our page array instead of find_get_page.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
13 years agoblock: add blk_run_queue_async
Christoph Hellwig [Mon, 18 Apr 2011 09:41:33 +0000 (11:41 +0200)]
block: add blk_run_queue_async

Instead of overloading __blk_run_queue to force an offload to kblockd
add a new blk_run_queue_async helper to do it explicitly.  I've kept
the blk_queue_stopped check for now, but I suspect it's not needed
as the check we do when the workqueue items runs should be enough.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agoblock: blk_delay_queue() should use kblockd workqueue
Jens Axboe [Mon, 18 Apr 2011 09:36:39 +0000 (11:36 +0200)]
block: blk_delay_queue() should use kblockd workqueue

Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agomd: fix up raid1/raid10 unplugging.
NeilBrown [Mon, 18 Apr 2011 08:25:43 +0000 (18:25 +1000)]
md: fix up raid1/raid10 unplugging.

We just need to make sure that an unplug event wakes up the md
thread, which is exactly what mddev_check_plugged does.

Also remove some plug-related code that is no longer needed.

Signed-off-by: NeilBrown <neilb@suse.de>
13 years agomd: incorporate new plugging into raid5.
NeilBrown [Mon, 18 Apr 2011 08:25:43 +0000 (18:25 +1000)]
md: incorporate new plugging into raid5.

In raid5 plugging is used for 2 things:
 1/ collecting writes that require a bitmap update
 2/ collecting writes in the hope that we can create full
    stripes - or at least more-full.

We now release these different sets of stripes when plug_cnt
is zero.

Also in make_request, we call mddev_check_plug to hopefully increase
plug_cnt, and wake up the thread at the end if plugging wasn't
achieved for some reason.

Signed-off-by: NeilBrown <neilb@suse.de>
13 years agomd: provide generic support for handling unplug callbacks.
NeilBrown [Mon, 18 Apr 2011 08:25:42 +0000 (18:25 +1000)]
md: provide generic support for handling unplug callbacks.

When an md device adds a request to a queue, it can call
mddev_check_plugged.
If this succeeds then we know that the md thread will be woken up
shortly, and ->plug_cnt will be non-zero until then, so some
processing can be delayed.

If it fails, then no unplug callback is expected and the make_request
function needs to do whatever is required to make the request happen.

Signed-off-by: NeilBrown <neilb@suse.de>
13 years agomd - remove old plugging code.
NeilBrown [Mon, 18 Apr 2011 08:25:42 +0000 (18:25 +1000)]
md - remove old plugging code.

md has some plugging infrastructure for RAID5 to use because the
normal plugging infrastructure required a 'request_queue', and when
called from dm, RAID5 doesn't have one of those available.

This relied on the ->unplug_fn callback which doesn't exist any more.

So remove all of that code, both in md and raid5.  Subsequent patches
with restore the plugging functionality.

Signed-off-by: NeilBrown <neilb@suse.de>
13 years agomd/dm - remove remains of plug_fn callback.
NeilBrown [Mon, 18 Apr 2011 08:25:41 +0000 (18:25 +1000)]
md/dm - remove remains of plug_fn callback.

Now that unplugging is done differently, the unplug_fn callback is
never called, so it can be completely discarded.

Signed-off-by: NeilBrown <neilb@suse.de>
13 years agomd: use new plugging interface for RAID IO.
NeilBrown [Mon, 18 Apr 2011 08:25:41 +0000 (18:25 +1000)]
md: use new plugging interface for RAID IO.

md/raid submits a lot of IO from the various raid threads.
So adding start/finish plug calls to those so that some
plugging happens.

Signed-off-by: NeilBrown <neilb@suse.de>
13 years agoblock: drop queue lock before calling __blk_run_queue() for kblockd punt
Jens Axboe [Mon, 18 Apr 2011 07:59:55 +0000 (09:59 +0200)]
block: drop queue lock before calling __blk_run_queue() for kblockd punt

If we know we are going to punt to kblockd, we can drop the queue
lock before calling into __blk_run_queue() since it only does a
safe bit test and a workqueue call. Since kblockd needs to grab
this very lock as one of the first things it does, it's a good
optimization to drop the lock before waking kblockd.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agoRevert "block: add callback function for unplug notification"
Jens Axboe [Mon, 18 Apr 2011 07:54:05 +0000 (09:54 +0200)]
Revert "block: add callback function for unplug notification"

MD can't use this since it really requires us to be able to
keep more than a single piece of state for the unplug. Commit
048c9374 added the required support for MD, so get rid of this
now unused code.

This reverts commit f75664570d8b75469cc468f23c2b27220984983b.

Conflicts:

block/blk-core.c

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agoblock: Enhance new plugging support to support general callbacks
NeilBrown [Mon, 18 Apr 2011 07:52:22 +0000 (09:52 +0200)]
block: Enhance new plugging support to support general callbacks

md/raid requires an unplug callback, but as it does not uses
requests the current code cannot provide one.

So allow arbitrary callbacks to be attached to the blk_plug.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agopowerpc/powermac: Build fix with SMP and CPU hotplug
Benjamin Herrenschmidt [Mon, 18 Apr 2011 05:46:35 +0000 (15:46 +1000)]
powerpc/powermac: Build fix with SMP and CPU hotplug

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
13 years agopowerpc/perf_event: Skip updating kernel counters if register value shrinks
Eric B Munson [Fri, 15 Apr 2011 08:12:30 +0000 (08:12 +0000)]
powerpc/perf_event: Skip updating kernel counters if register value shrinks

Because of speculative event roll back, it is possible for some event coutners
to decrease between reads on POWER7.  This causes a problem with the way that
counters are updated.  Delta calues are calculated in a 64 bit value and the
top 32 bits are masked.  If the register value has decreased, this leaves us
with a very large positive value added to the kernel counters.  This patch
protects against this by skipping the update if the delta would be negative.
This can lead to a lack of precision in the coutner values, but from my testing
the value is typcially fewer than 10 samples at a time.

Signed-off-by: Eric B Munson <emunson@mgebm.net>
Cc: stable@kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
13 years agopowerpc: Don't write protect kernel text with CONFIG_DYNAMIC_FTRACE enabled
Stefan Roese [Thu, 14 Apr 2011 23:49:53 +0000 (23:49 +0000)]
powerpc: Don't write protect kernel text with CONFIG_DYNAMIC_FTRACE enabled

This problem was noticed on an MPC855T platform. Ftrace did oops
when trying to write to the kernel text segment.

Many thanks to Joakim for finding the root cause of this problem.

Signed-off-by: Stefan Roese <sr@denx.de>
Cc: Joakim Tjernlund <joakim.tjernlund@transmode.se>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
13 years agopowerpc: Fix oops if scan_dispatch_log is called too early
Anton Blanchard [Thu, 7 Apr 2011 21:44:21 +0000 (21:44 +0000)]
powerpc: Fix oops if scan_dispatch_log is called too early

We currently enable interrupts before the dispatch log for the boot
cpu is setup. If a timer interrupt comes in early enough we oops in
scan_dispatch_log:

Unable to handle kernel paging request for data at address 0x00000010

...

.scan_dispatch_log+0xb0/0x170
.account_system_vtime+0xa0/0x220
.irq_enter+0x88/0xc0
.do_IRQ+0x48/0x230

The patch below adds a check to scan_dispatch_log to ensure the
dispatch log has been allocated.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
13 years agopowerpc/pseries: Use a kmem cache for DTL buffers
Nishanth Aravamudan [Wed, 13 Apr 2011 19:45:59 +0000 (19:45 +0000)]
powerpc/pseries: Use a kmem cache for DTL buffers

PAPR specifies that DTL buffers can not cross AMS environments (aka CMO
in the PAPR) and can not cross a memory entitlement granule boundary
(4k). This is found in section 14.11.3.2 H_REGISTER_VPA of the PAPR.
kmalloc does not guarantee an alignment of the allocation, though,
beyond 8 bytes (at least in my understanding). Create a special kmem
cache for DTL buffers with the alignment requirement.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
13 years agopowerpc/kexec: Fix regression causing compile failure on UP
Paul Gortmaker [Wed, 13 Apr 2011 06:30:08 +0000 (06:30 +0000)]
powerpc/kexec: Fix regression causing compile failure on UP

Recent commit b987812b3fcaf70fdf0037589e5d2f5f2453e6ce caused
a compile failure on UP because a considerably large block
of the file was included within CONFIG_SMP, hence making a stub
function not exposed on UP builds when it needed to be.

Relocate the stub to the #else /* ! CONFIG_SMP */ section
and also annotate the relevant else/endif so that nobody
else falls into the same trap I did.

Reported-by: Michael Guntsche <mike@it-loops.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
13 years agoMerge remote branch 'kumar/merge' into merge
Benjamin Herrenschmidt [Mon, 18 Apr 2011 02:09:37 +0000 (12:09 +1000)]
Merge remote branch 'kumar/merge' into merge

13 years agoMerge branch 'i2c-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvar...
Linus Torvalds [Mon, 18 Apr 2011 00:37:02 +0000 (17:37 -0700)]
Merge branch 'i2c-for-linus' of git://git./linux/kernel/git/jdelvare/staging

* 'i2c-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
  i2c-algo-bit: Call pre/post_xfer for bit_test
  i2c: Improve deprecation warnings

13 years agoMerge branch 's5p-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Mon, 18 Apr 2011 00:36:45 +0000 (17:36 -0700)]
Merge branch 's5p-fixes-for-linus' of git://git./linux/kernel/git/kgene/linux-samsung

* 's5p-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/kgene/linux-samsung:
  ARM: SAMSUNG: Fix warning 's3c_pm_show_resume_irqs' defined but not used
  ARM: SAMSUNG: Fix build failure in PM CRC check code
  ARM: S5P: Remove unused s3c_pm_check_resume_pin

13 years agoalpha: Fix uninitialized value in read_persistent_clock.
Richard Henderson [Sun, 17 Apr 2011 20:05:26 +0000 (13:05 -0700)]
alpha: Fix uninitialized value in read_persistent_clock.

Signed-off-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agoalpha: Fix RTC interrupt setup.
Richard Henderson [Sun, 17 Apr 2011 20:05:25 +0000 (13:05 -0700)]
alpha: Fix RTC interrupt setup.

Following commit 091738a266fc ("genirq: Remove real old transition
functions") we removed an automatic conversion of no_irq_chip to
dummy_irq_chip.  This change needs to be propagated back into the alpha
backend.

Signed-off-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agoalpha: Remove set but unused variables.
Richard Henderson [Sun, 17 Apr 2011 20:05:24 +0000 (13:05 -0700)]
alpha: Remove set but unused variables.

This is a new warning in gcc 4.6.  Several of these variables are
used within #if 0 code, which probably ought to be removed.  Most
of the changes are legitimate cleanups.

Signed-off-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agoalpha: Don't force -Werror.
Richard Henderson [Sun, 17 Apr 2011 20:05:23 +0000 (13:05 -0700)]
alpha: Don't force -Werror.

There are outstanding gcc 4.6 warnings that need to be cleaned up
in the subdirectory.  No sense forcing the issue immediately.

Signed-off-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agofs: synchronize_rcu when unregister_filesystem success not failure
Milton Miller [Thu, 14 Apr 2011 15:30:08 +0000 (10:30 -0500)]
fs: synchronize_rcu when unregister_filesystem success not failure

While checking unregister_filesystem for saftey vs extra calls for
"ext4: register ext2 and ext3 alias after ext4" I realized that
the synchronize_rcu() was called on the error path but not on
the success path.

Cc: stable (2.6.38)
Signed-off-by: Milton Miller <miltonm@bga.com>
[ This probably won't really make a difference since commit d863b50ab013
  ("vfs: call rcu_barrier after ->kill_sb()"), but it's the right thing
  to do.  - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agoi2c-algo-bit: Call pre/post_xfer for bit_test
Alex Deucher [Sun, 17 Apr 2011 08:20:19 +0000 (10:20 +0200)]
i2c-algo-bit: Call pre/post_xfer for bit_test

Apparently some distros set i2c-algo-bit.bit_test to 1 by
default.  In some cases this causes i2c_bit_add_bus
to fail and prevents the i2c bus from being added.  In the
radeon case, we fail to add the ddc i2c buses which prevents
the driver from being able to detect attached monitors.
The i2c bus works fine even if bit_test fails.  This is likely
due to gpio switching that is required and handled in the
pre/post_xfer hooks, so call the pre/post_xfer hooks in the
bit test as well.

Fixes:
https://bugs.freedesktop.org/show_bug.cgi?id=36221

Signed-off-by: Alex Deucher <alexdeucher@gmail.com>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: stable@kernel.org [.38 down to .34]
13 years agoi2c: Improve deprecation warnings
Jean Delvare [Sun, 17 Apr 2011 08:20:19 +0000 (10:20 +0200)]
i2c: Improve deprecation warnings

When warning on the use of deprecated i2c_driver methods
attach_adapter and detach_adapter, mention the name of the driver
which needs to be updated.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
13 years agoMerge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
Linus Torvalds [Sat, 16 Apr 2011 17:33:41 +0000 (10:33 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block

* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: make unplug timer trace event correspond to the schedule() unplug
  block: let io_schedule() flush the plug inline

13 years agoMerge branch 'usb-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh...
Linus Torvalds [Sat, 16 Apr 2011 17:33:13 +0000 (10:33 -0700)]
Merge branch 'usb-linus' of git://git./linux/kernel/git/gregkh/usb-2.6

* 'usb-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6: (43 commits)
  Revert "USB: isp1760-hcd: move imask clear after pending work is done"
  xHCI: Implement AMD PLL quirk
  xhci: Tell USB core both roothubs lost power.
  usbcore: Bug fix: system can't suspend with USB3.0 device connected to USB3.0 hub
  USB: Fix unplug of device with active streams
  USB: xhci - also free streams when resetting devices
  xhci: Fix NULL pointer deref in handle_port_status()
  USB: xhci - fix math in xhci_get_endpoint_interval()
  USB: xhci: simplify logic of skipping missed isoc TDs
  USB: xhci - remove excessive 'inline' markings
  USB: xhci: unsigned char never equals -1
  USB: xhci - fix unsafe macro definitions
  USB: fix formatting of SuperSpeed endpoints in /proc/bus/usb/devices
  USB: isp1760-hcd: move imask clear after pending work is done
  USB: fsl_qe_udc: send ZLP when zero flag and length % maxpacket == 0
  usb: qcserial add missing errorpath kfrees
  usb: qcserial avoid pointing to freed memory
  usb: Fix qcserial memory leak on rmmod
  USB: ftdi_sio: add ids for Hameg HO720 and HO730
  USB: option: Added support for Samsung GT-B3730/GT-B3710 LTE USB modem.
  ...

13 years agoMerge branches 'core-fixes-for-linus', 'perf-fixes-for-linus', 'sched-fixes-for-linus...
Linus Torvalds [Sat, 16 Apr 2011 16:45:08 +0000 (09:45 -0700)]
Merge branches 'core-fixes-for-linus', 'perf-fixes-for-linus', 'sched-fixes-for-linus', 'timer-fixes-for-linus' and 'x86-fixes-for-linus' of git://git./linux/kernel/git/tip/linux-2.6-tip

* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  futex: Set FLAGS_HAS_TIMEOUT during futex_wait restart setup

* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf_event: Fix cgrp event scheduling bug in perf_enable_on_exec()
  perf: Fix a build error with some GCC versions

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Fix erroneous all_pinned logic
  sched: Fix sched-domain avg_load calculation

* 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  RTC: rtc-mrst: follow on to the change of rtc_device_register()
  RTC: add missing "return 0" in new alarm func for rtc-bfin.c
  RTC: Fix s3c compile error due to missing s3c_rtc_setpie
  RTC: Fix early irqs caused by calling rtc_set_alarm too early

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, amd: Disable GartTlbWlkErr when BIOS forgets it
  x86, NUMA: Fix fakenuma boot failure
  x86/mrst: Fix boot crash caused by incorrect pin to irq mapping
  x86/ce4100: Add reg property to bridges

13 years agoblock: make unplug timer trace event correspond to the schedule() unplug
Jens Axboe [Sat, 16 Apr 2011 11:51:05 +0000 (13:51 +0200)]
block: make unplug timer trace event correspond to the schedule() unplug

It's a pretty close match to what we had before - the timer triggering
would mean that nobody unplugged the plug in due time, in the new
scheme this matches very closely what the schedule() unplug now is.
It's essentially the difference between an explicit unplug (IO unplug)
or an implicit unplug (timer unplug, we scheduled with pending IO
queued).

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agoblock: let io_schedule() flush the plug inline
Jens Axboe [Sat, 16 Apr 2011 11:27:55 +0000 (13:27 +0200)]
block: let io_schedule() flush the plug inline

Linus correctly observes that the most important dispatch cases
are now done from kblockd, this isn't ideal for latency reasons.
The original reason for switching dispatches out-of-line was to
avoid too deep a stack, so by _only_ letting the "accidental"
flush directly in schedule() be guarded by offload to kblockd,
we should be able to get the best of both worlds.

So add a blk_schedule_flush_plug() that offloads to kblockd,
and only use that from the schedule() path.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agoBtrfs: avoid taking the chunk_mutex in do_chunk_alloc
Josef Bacik [Tue, 12 Apr 2011 00:20:11 +0000 (20:20 -0400)]
Btrfs: avoid taking the chunk_mutex in do_chunk_alloc

Everytime we try to allocate disk space we try and see if we can pre-emptively
allocate a chunk, but in the common case we don't allocate anything, so there is
no sense in taking the chunk_mutex at all.  So instead if we are allocating a
chunk, mark it in the space_info so we don't get two people trying to allocate
at the same time.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Reviewed-by: Liu Bo <liubo2009@cn.fujitsu.com>
13 years agoBtrfs end_bio_extent_readpage should look for locked bits
Chris Mason [Sat, 16 Apr 2011 10:55:39 +0000 (06:55 -0400)]
Btrfs end_bio_extent_readpage should look for locked bits

A recent commit caches the extent state in end_bio_extent_readpage,
but the search it does should look for locked extents.  This
fixes things to make it more effective.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
13 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh...
Linus Torvalds [Sat, 16 Apr 2011 03:31:15 +0000 (20:31 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/ericvh/v9fs

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  net/9p: nwname should be an unsigned int
  9p: Fix sparse error
  fs/9p: Fix error reported by coccicheck
  9p: revert tsyncfs related changes
  fs/9p: Use write_inode for data sync on server
  fs/9p: Fix revalidate to return correct value

13 years agoMerge branch 'for-linus' of git://android.git.kernel.org/kernel/tegra
Linus Torvalds [Sat, 16 Apr 2011 03:19:17 +0000 (20:19 -0700)]
Merge branch 'for-linus' of git://android.git./kernel/tegra

* 'for-linus' of git://android.git.kernel.org/kernel/tegra:
  arm: tegra: fix error check in tegra2_clocks.c
  ARM: tegra: gpio: Fix unused variable warnings

13 years agoMerge branch 'fixes' of master.kernel.org:/home/rmk/linux-2.6-arm
Linus Torvalds [Sat, 16 Apr 2011 03:18:59 +0000 (20:18 -0700)]
Merge branch 'fixes' of /home/rmk/linux-2.6-arm

* 'fixes' of master.kernel.org:/home/rmk/linux-2.6-arm:
  ARM: 6879/1: fix personality test wrt usage of domain handlers
  ARM: 6878/1: fix personality flag propagation across an exec
  ARM: 6877/1: the ADDR_NO_RANDOMIZE personality flag should be honored with mmap()
  ARM: 6876/1: Kconfig.debug: Remove unused CONFIG_DEBUG_ERRORS
  ARM: pxa: convert incorrect IRQ_TO_IRQ() to irq_to_gpio()
  ARM: mmp: align NR_BUILTIN_GPIO with gpio interrupt number
  ARM: pxa: align NR_BUILTIN_GPIO with GPIO interrupt number
  ARM: pxa: always clear LPM bits for PXA168 MFPR
  pcmcia: limit pxa2xx_trizeps4 subdriver to trizeps4 platform
  pcmcia: limit pxa2xx_balloon3 subdriver to balloon3 platform
  ARM: pxafb: Fix access to nonexistent member of pxafb_info
  ARM: 6872/1: arch:common:Makefile Remove unused config in the Makefile.
  ARM: 6868/1: Preserve the VFP state during fork
  ARM: 6867/1: Introduce THREAD_NOTIFY_COPY for copy_thread() hooks
  ARM: 6866/1: Do not restrict HIGHPTE to !OUTER_CACHE
  ARM: 6865/1: perf: ensure pass through zero is counted on overflow
  ARM: 6864/1: hw_breakpoint: clear DBGVCR out of reset
  ARM: Only allow PM_SLEEP with CPUs which support suspend
  ARM: Make consolidated PM sleep code depend on PM_SLEEP

13 years agox86, amd: Disable GartTlbWlkErr when BIOS forgets it
Joerg Roedel [Fri, 15 Apr 2011 12:47:40 +0000 (14:47 +0200)]
x86, amd: Disable GartTlbWlkErr when BIOS forgets it

This patch disables GartTlbWlk errors on AMD Fam10h CPUs if
the BIOS forgets to do is (or is just too old). Letting
these errors enabled can cause a sync-flood on the CPU
causing a reboot.

The AMD BKDG recommends disabling GART TLB Wlk Error completely.

This patch is the fix for

https://bugzilla.kernel.org/show_bug.cgi?id=33012

on my machine.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Link: http://lkml.kernel.org/r/20110415131152.GJ18463@8bytes.org
Tested-by: Alexandre Demers <alexandre.f.demers@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
13 years agovfs: Fix absolute RCU path walk failures due to uninitialized seq number
Tim Chen [Fri, 15 Apr 2011 18:39:29 +0000 (11:39 -0700)]
vfs: Fix absolute RCU path walk failures due to uninitialized seq number

During RCU walk in path_lookupat and path_openat, the rcu lookup
frequently failed if looking up an absolute path, because when root
directory was looked up, seq number was not properly set in nameidata.

We dropped out of RCU walk in nameidata_drop_rcu due to mismatch in
directory entry's seq number.  We reverted to slow path walk that need
to take references.

With the following patch, I saw a 50% increase in an exim mail server
benchmark throughput on a 4-socket Nehalem-EX system.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: stable@kernel.org (v2.6.38)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agonet/9p: nwname should be an unsigned int
Harsh Prateek Bora [Thu, 31 Mar 2011 10:19:39 +0000 (15:49 +0530)]
net/9p: nwname should be an unsigned int

Signed-off-by: Harsh Prateek Bora <harsh@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric VAn Hensbergen <ericvh@gmail.com>
13 years ago9p: Fix sparse error
Aneesh Kumar K.V [Thu, 24 Mar 2011 17:44:46 +0000 (23:14 +0530)]
9p: Fix sparse error

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
13 years agofs/9p: Fix error reported by coccicheck
Aneesh Kumar K.V [Thu, 24 Mar 2011 17:34:41 +0000 (23:04 +0530)]
fs/9p: Fix error reported by coccicheck

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
13 years ago9p: revert tsyncfs related changes
Aneesh Kumar K.V [Thu, 24 Mar 2011 15:08:35 +0000 (20:38 +0530)]
9p: revert tsyncfs related changes

Now that we use write_inode to flush server
cache related to fid, we don't need tsyncfs either fort dotl or dotu
protocols. For dotu this helps to do a more efficient server flush.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
13 years agofs/9p: Use write_inode for data sync on server
Aneesh Kumar K.V [Wed, 23 Mar 2011 09:41:27 +0000 (15:11 +0530)]
fs/9p: Use write_inode for data sync on server

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
13 years agofs/9p: Fix revalidate to return correct value
Aneesh Kumar K.V [Wed, 23 Mar 2011 14:18:16 +0000 (19:48 +0530)]
fs/9p: Fix revalidate to return correct value

revalidate should return > 0 on success. Also return 0 on ENOENT
to force do_revalidate to return NULL dentry;

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
13 years agoBtrfs: don't force chunk allocation in find_free_extent
Chris Mason [Fri, 15 Apr 2011 20:05:44 +0000 (16:05 -0400)]
Btrfs: don't force chunk allocation in find_free_extent

find_free_extent likes to allocate in contiguous clusters,
which makes writeback faster, especially on SSD storage.  As
the FS fragments, these clusters become harder to find and we have
to decide between allocating a new chunk to make more clusters
or giving up on the cluster to allocate from the free space
we have.

Right now it creates too many chunks, and you can end up with
a whole FS that is mostly empty metadata chunks.  This commit
changes the allocation code to be more strict and only
allocate new chunks when we've made good use of the chunks we
already have.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
13 years agox86, NUMA: Fix fakenuma boot failure
KOSAKI Motohiro [Fri, 15 Apr 2011 11:39:01 +0000 (20:39 +0900)]
x86, NUMA: Fix fakenuma boot failure

Currently, numa=fake boot parameter is broken. If it's used,
kernel may panic due to devide by zero error depending on CPU
configuration

Call Trace:
 [<ffffffff8104ad4c>] find_busiest_group+0x38c/0xd30
 [<ffffffff81086aff>] ? local_clock+0x6f/0x80
 [<ffffffff81050533>] load_balance+0xa3/0x600
 [<ffffffff81050f53>] idle_balance+0xf3/0x180
 [<ffffffff81550092>] schedule+0x722/0x7d0
 [<ffffffff81550538>] ? wait_for_common+0x128/0x190
 [<ffffffff81550a65>] schedule_timeout+0x265/0x320
 [<ffffffff81095815>] ? lock_release_holdtime+0x35/0x1a0
 [<ffffffff81550538>] ? wait_for_common+0x128/0x190
 [<ffffffff8109bb6c>] ? __lock_release+0x9c/0x1d0
 [<ffffffff815534e0>] ? _raw_spin_unlock_irq+0x30/0x40
 [<ffffffff815534e0>] ? _raw_spin_unlock_irq+0x30/0x40
 [<ffffffff81550540>] wait_for_common+0x130/0x190
 [<ffffffff81051920>] ? try_to_wake_up+0x510/0x510
 [<ffffffff8155067d>] wait_for_completion+0x1d/0x20
 [<ffffffff8107f36c>] kthread_create_on_node+0xac/0x150
 [<ffffffff81077bb0>] ? process_scheduled_works+0x40/0x40
 [<ffffffff8155045f>] ? wait_for_common+0x4f/0x190
 [<ffffffff8107a283>] __alloc_workqueue_key+0x1a3/0x590
 [<ffffffff81e0cce2>] cpuset_init_smp+0x6b/0x7b
 [<ffffffff81df3d07>] kernel_init+0xc3/0x182
 [<ffffffff8155d5e4>] kernel_thread_helper+0x4/0x10
 [<ffffffff81553cd4>] ? retint_restore_args+0x13/0x13
 [<ffffffff81df3c44>] ? start_kernel+0x400/0x400
 [<ffffffff8155d5e0>] ? gs_change+0x13/0x13

The divede by zero is caused by the following line,
group->cpu_power==0:

 kernel/sched_fair.c::update_sg_lb_stats()
        /* Adjust by relative CPU power of the group */
        sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;

This regression was caused by commit e23bba6044 ("x86-64, NUMA: Unify
emulated distance mapping") because it changes cpu -> node
mapping in the process of dropping fake_physnodes().

  old) all cpus are assinged node 0
  now) cpus are assigned round robin
       (the logic is implemented by numa_init_array())

  Note: The change in behavior only happens if the system doesn't
        have neither ACPI SRAT table nor AMD northbridge NUMA
information.

Round robin assignment doesn't work because init_numa_sched_groups_power()
assumes all logical cpus in the same physical cpu share the same node
(then it only accounts for group_first_cpu()), and the simple round robin
breaks the above assumption.

Thus, this patch implements a reassignment of node-ids if buggy firmware
or numa emulation makes wrong cpu node map. Tt enforce all logical cpus
in the same physical cpu share the same node.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/20110415203928.1303.A69D9226@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
13 years agoMerge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
Linus Torvalds [Fri, 15 Apr 2011 15:01:13 +0000 (08:01 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block

* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: only force kblockd unplugging from the schedule() path
  block: cleanup the block plug helper functions
  block, blk-sysfs: Use the variable directly instead of a function call
  block: move queue run on unplug to kblockd
  block: kill queue_sync_plugs()
  block: readd plug trace event
  block: add callback function for unplug notification
  block: add comment on why we save and disable interrupts in flush_plug_list()
  block: fixup block IO unplug trace call
  block: remove block_unplug_timer() trace point
  block: splice plug list to local context

13 years agoMerge branch 'linux-next' of git://git.infradead.org/ubifs-2.6
Linus Torvalds [Fri, 15 Apr 2011 14:44:22 +0000 (07:44 -0700)]
Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6

* 'linux-next' of git://git.infradead.org/ubifs-2.6:
  UBIFS: fix compilation warnings when compiling with gcc 4.5
  UBIFS: fix oops when R/O file-system is fsync'ed

13 years agofutex: Set FLAGS_HAS_TIMEOUT during futex_wait restart setup
Darren Hart [Thu, 14 Apr 2011 22:41:57 +0000 (15:41 -0700)]
futex: Set FLAGS_HAS_TIMEOUT during futex_wait restart setup

The FLAGS_HAS_TIMEOUT flag was not getting set, causing the restart_block to
restart futex_wait() without a timeout after a signal.

Commit b41277dc7a18ee332d in 2.6.38 introduced the regression by accidentally
removing the the FLAGS_HAS_TIMEOUT assignment from futex_wait() during the setup
of the restart block. Restore the originaly behavior.

Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=32922
Reported-by: Tim Smith <tsmith201104@yahoo.com>
Reported-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Kacur <jkacur@redhat.com>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/%3Cdaac0eb3af607f72b9a4d3126b2ba8fb5ed3b883.1302820917.git.dvhart%40linux.intel.com%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
13 years agovfs: fix incorrect dentry_update_name_case() BUG_ON() test
Linus Torvalds [Fri, 15 Apr 2011 14:34:26 +0000 (07:34 -0700)]
vfs: fix incorrect dentry_update_name_case() BUG_ON() test

The case we should be verifying when updating the dentry name is that
the _parent_ inode (the directory) semaphore is held, not the semaphore
for the dentry itself.  It's the directory locking that rename and
readdir() etc all care about.

The comment just above even says so - but then the BUG_ON() still
checked the dentry inode itself.

Very few people noticed, because this helper function really isn't used
for very much, so you had to be using ncpfs to ever hit it.

I think I should just remove the BUG_ON (the function really has just
one user), but let's run with it fixed for a while before getting rid of
it entirely.

Reported-and-tested-by: Bongani Hlope <bonganih@bankservafrica.com>
Reported-and-tested-by: Bernd Feige <bernd.feige@uniklinik-freiburg.de>
Cc: Petr Vandrovec <petr@vandrovec.name>,
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agoblock: only force kblockd unplugging from the schedule() path
Jens Axboe [Fri, 15 Apr 2011 13:49:07 +0000 (15:49 +0200)]
block: only force kblockd unplugging from the schedule() path

For the explicit unplugging, we'd prefer to kick things off
immediately and not pay the penalty of the latency to switch
to kblockd. So let blk_finish_plug() do the run inline, while
the implicit-on-schedule-out unplug will punt to kblockd.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agoblock: cleanup the block plug helper functions
Christoph Hellwig [Fri, 15 Apr 2011 13:20:10 +0000 (15:20 +0200)]
block: cleanup the block plug helper functions

It's a bit of a mess currently. task->plug is being cleared
and reset in __blk_finish_plug(), and blk_finish_plug() is
testing for a NULL plug which cannot happen even from schedule()
anymore since it uses blk_needs_flush_plug() to determine
whether to call into this function at all.

So get rid of some of the cruft.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/vapier...
Linus Torvalds [Fri, 15 Apr 2011 02:03:27 +0000 (19:03 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/vapier/blackfin

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/vapier/blackfin:
  Blackfin: SMP: fix cache flush loop
  Blackfin: time-ts: ack gptimer sooner to avoid missing short ints
  Blackfin: gptimers: fix thinko when disabling timers
  Blackfin: SMP: make all barriers handle cache issues

13 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph...
Linus Torvalds [Fri, 15 Apr 2011 02:02:55 +0000 (19:02 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/sage/ceph-client

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: fix linger request requeueing

13 years agomm/thp: use conventional format for boolean attributes
Ben Hutchings [Thu, 14 Apr 2011 22:22:21 +0000 (15:22 -0700)]
mm/thp: use conventional format for boolean attributes

The conventional format for boolean attributes in sysfs is numeric ("0" or
"1" followed by new-line).  Any boolean attribute can then be read and
written using a generic function.  Using the strings "yes [no]", "[yes]
no" (read), "yes" and "no" (write) will frustrate this.

[akpm@linux-foundation.org: use kstrtoul()]
[akpm@linux-foundation.org: test_bit() doesn't return 1/0, per Neil]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Tested-by: David Rientjes <rientjes@google.com>
Cc: NeilBrown <neilb@suse.de>
Cc: <stable@kernel.org> [2.6.38.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agoramfs: fix memleak on no-mmu arch
Bob Liu [Thu, 14 Apr 2011 22:22:20 +0000 (15:22 -0700)]
ramfs: fix memleak on no-mmu arch

On no-mmu arch, there is a memleak during shmem test.  The cause of this
memleak is ramfs_nommu_expand_for_mapping() added page refcount to 2
which makes iput() can't free that pages.

The simple test file is like this:

  int main(void)
  {
int i;
key_t k = ftok("/etc", 42);

for ( i=0; i<100; ++i) {
int id = shmget(k, 10000, 0644|IPC_CREAT);
if (id == -1) {
printf("shmget error\n");
}
if(shmctl(id, IPC_RMID, NULL ) == -1) {
printf("shm  rm error\n");
return -1;
}
}
printf("run ok...\n");
return 0;
  }

And the result:

  root:/> free
               total         used         free       shared      buffers
  Mem:         60320        17912        42408            0            0
  -/+ buffers:              17912        42408
  root:/> shmem
  run ok...
  root:/> free
               total         used         free       shared      buffers
  Mem:         60320        19096        41224            0            0
  -/+ buffers:              19096        41224
  root:/> shmem
  run ok...
  root:/> free
               total         used         free       shared      buffers
  Mem:         60320        20296        40024            0            0
  -/+ buffers:              20296        40024
  ...

After this patch the test result is:(no memleak anymore)

  root:/> free
               total         used         free       shared      buffers
  Mem:         60320        16668        43652            0            0
  -/+ buffers:              16668        43652
  root:/> shmem
  run ok...
  root:/> free
               total         used         free       shared      buffers
  Mem:         60320        16668        43652            0            0
  -/+ buffers:              16668        43652

Signed-off-by: Bob Liu <lliubbo@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agoum: disable CONFIG_CMPXCHG_LOCAL
Richard Weinberger [Thu, 14 Apr 2011 22:22:18 +0000 (15:22 -0700)]
um: disable CONFIG_CMPXCHG_LOCAL

Commit 8a5ec0ba "Lockless (and preemptless) fastpaths for slub" makes use
of this_cpu_cmpxchg_double() which needs this_cpu_cmpxchg16b_emu() on
x86_64.  Implementing cmpxchg16b emulation for UML would introduce too
much complexity.  So just disable it.

Signed-off-by: Richard Weinberger <richard@nod.at>
Reported-by: Sergei Trofimovich <slyich@gmail.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agoum: fix call tracer and bug handler
Richard Weinberger [Thu, 14 Apr 2011 22:22:17 +0000 (15:22 -0700)]
um: fix call tracer and bug handler

Commit 1de1502c ("x86, um: now we can get rid of trivial uml headers")
removed accidentally bug.h which broke UML's call tracer and bug
handler.

Without asm-generic/bug.h UML uses BUG() from arch/x86/ which makes use
of ud2.  UML cannot use ud2, it raises SIGILL in user mode.  As UML has
a different stack for handling signals the call trace will be cut off.

Signed-off-by: Richard Weinberger <richard@nod.at>
Reported-by: Sergei Trofimovich <slyich@gmail.com>
Tested-by: Sergei Trofimovich <slyich@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agofs/fhandle.c: add <linux/personality.h> for ia64
Jeff Mahoney [Thu, 14 Apr 2011 22:22:16 +0000 (15:22 -0700)]
fs/fhandle.c: add <linux/personality.h> for ia64

force_o_largefile() on ia64 is defined in <asm/fcntl.h> and requires
<linux/personality.h>.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agoMAINTAINERS: change mail adress of Hans J. Koch
Hans J. Koch [Thu, 14 Apr 2011 22:22:16 +0000 (15:22 -0700)]
MAINTAINERS: change mail adress of Hans J. Koch

My old mail address doesn't exist anymore.  This patch changes all
occurences in MAINTAINERS to my new address.

Signed-off-by: Hans J. Koch <hjk@hansjkoch.de>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agoRapidIO/mpc85xx: fix possible mport registration problems
Alexandre Bounine [Thu, 14 Apr 2011 22:22:14 +0000 (15:22 -0700)]
RapidIO/mpc85xx: fix possible mport registration problems

Fix a possible problem with mport registration left non-cleared after
fsl_rio_setup() exits on link error.  Abort mport initialization if
registration failed.

This patch is applicable to 2.6.39-rc1 only.  The problem does not exist
for earlier versions.

Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Thomas Moll <thomas.moll@sysgo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agoRapidIO: add IDT CPS-1432 switch definitions
Alexandre Bounine [Thu, 14 Apr 2011 22:22:14 +0000 (15:22 -0700)]
RapidIO: add IDT CPS-1432 switch definitions

Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agooom-kill: remove boost_dying_task_prio()
KOSAKI Motohiro [Thu, 14 Apr 2011 22:22:13 +0000 (15:22 -0700)]
oom-kill: remove boost_dying_task_prio()

This is an almost-revert of commit 93b43fa ("oom: give the dying task a
higher priority").

That commit dramatically improved oom killer logic when a fork-bomb
occurs.  But I've found that it has nasty corner case.  Now cpu cgroup has
strange default RT runtime.  It's 0!  That said, if a process under cpu
cgroup promote RT scheduling class, the process never run at all.

If an admin inserts a !RT process into a cpu cgroup by setting
rtruntime=0, usually it runs perfectly because a !RT task isn't affected
by the rtruntime knob.  But if it promotes an RT task via an explicit
setscheduler() syscall or an OOM, the task can't run at all.  In short,
the oom killer doesn't work at all if admins are using cpu cgroup and don't
touch the rtruntime knob.

Eventually, kernel may hang up when oom kill occur.  I and the original
author Luis agreed to disable this logic.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Luis Claudio R. Goncalves <lclaudio@uudg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agovmscan: all_unreclaimable() use zone->all_unreclaimable as a name
KOSAKI Motohiro [Thu, 14 Apr 2011 22:22:12 +0000 (15:22 -0700)]
vmscan: all_unreclaimable() use zone->all_unreclaimable as a name

all_unreclaimable check in direct reclaim has been introduced at 2.6.19
by following commit.

2006 Sep 25; commit 408d8544; oom: use unreclaimable info

And it went through strange history. firstly, following commit broke
the logic unintentionally.

2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
      costly-order allocations

Two years later, I've found obvious meaningless code fragment and
restored original intention by following commit.

2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
      return value when priority==0

But, the logic didn't works when 32bit highmem system goes hibernation
and Minchan slightly changed the algorithm and fixed it .

2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
      in direct reclaim path

But, recently, Andrey Vagin found the new corner case. Look,

struct zone {
  ..
        int                     all_unreclaimable;
  ..
        unsigned long           pages_scanned;
  ..
}

zone->all_unreclaimable and zone->pages_scanned are neigher atomic
variables nor protected by lock.  Therefore zones can become a state of
zone->page_scanned=0 and zone->all_unreclaimable=1.  In this case, current
all_unreclaimable() return false even though zone->all_unreclaimabe=1.

This resulted in the kernel hanging up when executing a loop of the form

1. fork
2. mmap
3. touch memory
4. read memory
5. munmmap

as described in
http://www.gossamer-threads.com/lists/linux/kernel/1348725#1348725

Is this ignorable minor issue?  No.  Unfortunately, x86 has very small dma
zone and it become zone->all_unreclamble=1 easily.  and if it become
all_unreclaimable=1, it never restore all_unreclaimable=0.  Why?  if
all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
a-few-lru-pages>>DEF_PRIORITY always makes 0.  that mean no page scan at
all!

Eventually, oom-killer never works on such systems.  That said, we can't
use zone->pages_scanned for this purpose.  This patch restore
all_unreclaimable() use zone->all_unreclaimable as old.  and in addition,
to add oom_killer_disabled check to avoid reintroduce the issue of commit
d1908362 ("vmscan: check all_unreclaimable in direct reclaim path").

Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>