Joe Perches [Thu, 3 Apr 2014 21:49:20 +0000 (14:49 -0700)]
checkpatch: ignore networking block comment style first lines in file
It's very common to have normal block comments for the initial comments
of a file description preface.
So for files in drivers/net and net/ don't emit a warning when the first
comment block in the file uses the normal block comment style and not
the networking block comment style.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Thu, 3 Apr 2014 21:49:19 +0000 (14:49 -0700)]
checkpatch: use a more consistent function argument style
Instead of array indexing $_, use temporary variables like all the other
subroutines in the script use.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Thu, 3 Apr 2014 21:49:18 +0000 (14:49 -0700)]
checkpatch: add test for char * arrays that could be static const
static const char* arrays create smaller text as each function call does
not have to populate the array.
Emit a warning when char *arrays aren't static const and the array is
not apparently global by being declared in the first column.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Thu, 3 Apr 2014 21:49:17 +0000 (14:49 -0700)]
checkpatch: fix jiffies comparison and others
checkpatch could not distinguish between a variable in a struct named
jiffies and the normal jiffies.
foo->jiffies
would emit a "Comparing jiffies" arning.
Update the $Compare variable to do a negative look-behind for "-" when
finding a ">" so that a pointer dereference like -> isn't a comparison.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Thu, 3 Apr 2014 21:49:16 +0000 (14:49 -0700)]
checkpatch: avoid sscanf test duplicated messages
Change a test of $dstat to $line to avoid possibly emitting the sscanf
warning multiple times.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Thu, 3 Apr 2014 21:49:15 +0000 (14:49 -0700)]
checkpatch: update octal permissions warning
When checking permissions, make sure 4 octal digits are used, but allow
a single 0 too.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Thu, 3 Apr 2014 21:49:14 +0000 (14:49 -0700)]
checkpatch: warn on uses of __constant_<foo> functions
Emit a warning when using any of these __constant_<foo> forms:
__constant_cpu_to_be[x]
__constant_cpu_to_le[x]
__constant_be[x]_to_cpu
__constant_le[x]_to_cpu
__constant_htons
__constant_ntohs
Using any of these outside of include/uapi/ isn't preferred as using the
function without __constant_ is identical when the argument is a
constant.
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Thu, 3 Apr 2014 21:49:13 +0000 (14:49 -0700)]
checkpatch: add checks for constant non-octal permissions
umode_t permissions are sometimes mistakenly written with decimal
constants. Verify that numeric permissions are using octal.
Add a list of the most commonly used functions and macros that have
umode_t permissions and the argument position.
Add a $Octal type to $Constant.
Allow $LvalOrFunc to be a pointer indirection too.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Thu, 3 Apr 2014 21:49:12 +0000 (14:49 -0700)]
checkpatch: don't warn on some function pointer return styles
Checks for some function pointer return styles are too strict. Fix
them.
Multiple spaces after function pointer return types are allowed.
int (*foo)(int bar)
Spaces after function pointer returns of pointer types are not required.
int *(*foo)(int bar)
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Thu, 3 Apr 2014 21:49:11 +0000 (14:49 -0700)]
checkpatch: add test for long udelay
Holger reported:
: The macro udelay cannot handle large values because of loss-of-precision.
:
: IMHO udelay on ARM is broken, because it also cannot work with fast
: ARM processors (where bogomips >= 3355, which is in sight now). It's
: just not broken enough that someone did something against it ... so
: the current kludge is good enough.
Until then, warn on long udelay uses.
Also fix uses of $line that should have been $herecurr.
Signed-off-by: Joe Perches <joe@perches.com>
Reported-by: Holger Schurig <holgerschurig@gmail.com>
Cc: Sujith Manoharan <sujith@msujith.org>
Cc: John Linville <linville@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rashika Kheria [Thu, 3 Apr 2014 21:49:10 +0000 (14:49 -0700)]
lib/decompress_inflate.c: include appropriate header file
Include appropriate header file include/linux/decompress/inflate.h in
lib/decompress_inflate.c because it has prototype declaration of
function defined in lib/decompress_inflate.c.
Also, fix the guard around the header file
include/linux/decompress/inflate.h to use a more unique guard symbol.
This avoids conflict with the INFLATE_H defined by
zlib_inflate/inflate.h.
This eliminates the following warning in lib/decompress_inflate.c:
lib/decompress_inflate.c:35:17: warning: no previous prototype for `gunzip' [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rashika Kheria [Thu, 3 Apr 2014 21:49:09 +0000 (14:49 -0700)]
lib/clz_ctz.c: add prototype declarations in lib/clz_ctz.c
Add prototype declarations of functions in lib/clz_ctz.c. These
functions are required by GCC builtins and hence can not be removed
despite of their unreferenced appearance in kernel source.
This eliminates the following warning in lib/clz_ctz.c:
lib/clz_ctz.c:16:12: warning: no previous prototype for `__ctzsi2' [-Wmissing-prototypes]
lib/clz_ctz.c:22:12: warning: no previous prototype for `__clzsi2' [-Wmissing-prototypes]
lib/clz_ctz.c:44:12: warning: no previous prototype for `__clzdi2' [-Wmissing-prototypes]
lib/clz_ctz.c:50:12: warning: no previous prototype for `__ctzdi2' [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Chanho Min <chanho.min@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Daniel Borkmann [Thu, 3 Apr 2014 21:49:08 +0000 (14:49 -0700)]
lib/random32.c: minor cleanups and kdoc fix
These are just some very minor and misc cleanups in the PRNG. In
prandom_u32() we store the result in an unsigned long which is
unnecessary as it should be u32 instead that we get from
prandom_u32_state(). prandom_bytes_state()'s comment is in kdoc format,
so change it into such as it's done everywhere else. Also, use the
normal comment style for the header comment. Last but not least for
readability, add some newlines.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Steven Rostedt [Thu, 3 Apr 2014 21:49:07 +0000 (14:49 -0700)]
lib/devres.c: fix some sparse warnings
Having a discussion about sparse warnings in the kernel, and that we
should clean them up, I decided to pick a random file to do so. This
happened to be devres.c which gives the following warnings:
CHECK lib/devres.c
lib/devres.c:83:9: warning: cast removes address space of expression
lib/devres.c:117:31: warning: incorrect type in return expression (different address spaces)
lib/devres.c:117:31: expected void [noderef] <asn:2>*
lib/devres.c:117:31: got void *
lib/devres.c:125:31: warning: incorrect type in return expression (different address spaces)
lib/devres.c:125:31: expected void [noderef] <asn:2>*
lib/devres.c:125:31: got void *
lib/devres.c:136:26: warning: incorrect type in assignment (different address spaces)
lib/devres.c:136:26: expected void [noderef] <asn:2>*[assigned] dest_ptr
lib/devres.c:136:26: got void *
lib/devres.c:226:9: warning: cast removes address space of expression
Mostly it's just the use of typecasting to void * without adding
__force, or returning ERR_PTR(-ESOMEERR) without typecasting to a
__iomem type.
I added a helper macro IOMEM_ERR_PTR() that does the typecast to make
the code a little nicer than adding ugly typecasts to the code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jingoo Han [Thu, 3 Apr 2014 21:49:06 +0000 (14:49 -0700)]
backlight: tps65217_bl: remove unnecessary OOM messages
The site-specific OOM messages are unnecessary, because they duplicate
the MM subsystem generic OOM message.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jingoo Han [Thu, 3 Apr 2014 21:49:05 +0000 (14:49 -0700)]
backlight: platform_lcd: remove unnecessary OOM messages
The site-specific OOM messages are unnecessary, because they duplicate
the MM subsystem generic OOM message.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jingoo Han [Thu, 3 Apr 2014 21:49:04 +0000 (14:49 -0700)]
backlight: lms283gf05: remove unnecessary OOM messages
The site-specific OOM messages are unnecessary, because they duplicate
the MM subsystem generic OOM message.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Acked-by: Marek Vasut <marex@denx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jingoo Han [Thu, 3 Apr 2014 21:49:03 +0000 (14:49 -0700)]
backlight: lm3533_bl: remove unnecessary OOM messages
The site-specific OOM messages are unnecessary, because they duplicate
the MM subsystem generic OOM message.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Acked-by: Johan Hovold <jhovold@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jingoo Han [Thu, 3 Apr 2014 21:49:02 +0000 (14:49 -0700)]
backlight: l4f00242t03: remove unnecessary OOM messages
The site-specific OOM messages are unnecessary, because they duplicate
the MM subsystem generic OOM message.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jingoo Han [Thu, 3 Apr 2014 21:49:02 +0000 (14:49 -0700)]
backlight: ili9320: remove unnecessary OOM messages
The site-specific OOM messages are unnecessary, because they duplicate
the MM subsystem generic OOM message.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jingoo Han [Thu, 3 Apr 2014 21:49:01 +0000 (14:49 -0700)]
backlight: ili922x: remove unnecessary OOM messages
The site-specific OOM messages are unnecessary, because they duplicate
the MM subsystem generic OOM message.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Acked-by: Stefano Babic <sbabic@denx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jingoo Han [Thu, 3 Apr 2014 21:49:00 +0000 (14:49 -0700)]
backlight: hx8357: remove unnecessary OOM messages
The site-specific OOM messages are unnecessary, because they duplicate
the MM subsystem generic OOM message.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Acked-by: Maxime Ripard <maxime.ripard@free-electrons.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jingoo Han [Thu, 3 Apr 2014 21:48:59 +0000 (14:48 -0700)]
backlight: corgi_lcd: remove unnecessary OOM messages
The site-specific OOM messages are unnecessary, because they duplicate
the MM subsystem generic OOM message.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jingoo Han [Thu, 3 Apr 2014 21:48:58 +0000 (14:48 -0700)]
backlight: adp8870: remove unnecessary OOM messages
The site-specific OOM messages are unnecessary, because they duplicate
the MM subsystem generic OOM message.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Acked-by: Michael Hennerich <michael.hennerich@analog.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jingoo Han [Thu, 3 Apr 2014 21:48:57 +0000 (14:48 -0700)]
backlight: adp8860: remove unnecessary OOM messages
The site-specific OOM messages are unnecessary, because they duplicate
the MM subsystem generic OOM message.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Acked-by: Michael Hennerich <michael.hennerich@analog.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jingoo Han [Thu, 3 Apr 2014 21:48:56 +0000 (14:48 -0700)]
backlight: aat2870: remove unnecessary OOM messages
The site-specific OOM messages are unnecessary, because they duplicate
the MM subsystem generic OOM message.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Acked-by: Jinyoung Park <jinyoungp@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Liu Ying [Thu, 3 Apr 2014 21:48:55 +0000 (14:48 -0700)]
backlight: update backlight status when necessary
We don't have to update a backlight status every time a blanking or
unblanking event comes because the backlight status may have already
been what we want. Another thought is that one backlight device may be
shared by multiple framebuffers. We don't hope blanking one of the
framebuffers may turn the backlight off for all the other framebuffers,
since they are likely being active to display something.
This patch makes the backlight status be updated only when the relevant
backlight device's use count changes from zero to one or from one to
zero.
Signed-off-by: Liu Ying <Ying.Liu@freescale.com>
Cc: Jingoo Han <jg1.han@samsung.com>
Cc: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Liu Ying [Thu, 3 Apr 2014 21:48:54 +0000 (14:48 -0700)]
backlight: update bd state & fb_blank properties when necessary
We don't have to update the state and fb_blank properties of a backlight
device every time a blanking or unblanking event comes because they may
have already been what we want. Another thought is that one backlight
device may be shared by multiple framebuffers. The backlight driver
should take the backlight device as a resource shared by all the
associated framebuffers.
This patch adds some logic to record each framebuffer's backlight usage
to determine the backlight device use count and whether the two
properties should be updated or not. To be more specific, only one
unblank operation on a certain blanked framebuffer may increase the
backlight device's use count by one, while one blank operation on a
certain unblanked framebuffer may decrease the use count by one, because
the userspace is likely to unblank an unblanked framebuffer or blank a
blanked framebuffer.
Signed-off-by: Liu Ying <Ying.Liu@freescale.com>
Cc: Jingoo Han <jg1.han@samsung.com>
Cc: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Thu, 3 Apr 2014 21:48:53 +0000 (14:48 -0700)]
MAINTAINERS: remove Venkatesh from HPET, move to CREDITS
Seems he's gone off to bigger/better things. So long, etc...
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Simek [Thu, 3 Apr 2014 21:48:52 +0000 (14:48 -0700)]
MAINTAINERS: microblaze: use LKML as mailing list
microblaze-uclinux mailing list is almost dead and it is just causing
troubles for non subscribers which are getting email about waiting for
moderator. Approval never happens. Move it to LKML.
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Reported-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Opensource [Steve Twiss] [Thu, 3 Apr 2014 21:48:51 +0000 (14:48 -0700)]
MAINTAINERS: addition of Dialog Semiconductor files
Dialog Semiconductor Ltd would like to add a new section called DIALOG
SEMICONDUCTOR DRIVERS which contains a new e-mail address that can cover
all Dialog supported drivers: support.opensource@diasemi.com.
Signed-off-by: Opensource [Steve Twiss] <stwiss.opensource@diasemi.com>
Signed-off-by: David Dajun Chen <david.chen@diasemi.com>
Cc: Mark Brown <broonie@linaro.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Thu, 3 Apr 2014 21:48:49 +0000 (14:48 -0700)]
MAINTAINERS: wimax@linuxwimax.org is subscribers-only
Mark it so.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Max Filippov [Thu, 3 Apr 2014 21:48:48 +0000 (14:48 -0700)]
MAINTAINERS: add xtensa irqchips to xtensa port entry
Now that irqchip drivers for xtensa live outside arch/xtensa we'd like
to add them to our maintenance list.
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Marc Gauthier <marc@cadence.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Geert Uytterhoeven [Thu, 3 Apr 2014 21:48:47 +0000 (14:48 -0700)]
MAINTAINERS: mark SuperH orphan
Paul Mundt's email address bounces regularly, and he hasn't taken any
SuperH patches for about one year.
Signed-off-by: Geert Uytterhoeven <geert+renesas@linux-m68k.org>
Suggested-by: Paul Bolle <pebolle@tiscali.nl>
Cc: Paul Mundt <paul.mundt@gmail.com>
Cc: Magnus Damm <magnus.damm@gmail.com>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Laurent Pinchart <Laurent.pinchart@ideasonboard.com>
Cc: Simon Horman <horms@verge.net.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jingoo Han [Thu, 3 Apr 2014 21:48:46 +0000 (14:48 -0700)]
MAINTAINERS: add backlight co-maintainers
Bryan Wu and Lee Jones volunteer to maintain backlight drivers and help
to setup git-tree for backlight subsystem. Thus, I add them as
backlight co-maintainers.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Acked-by: Bryan Wu <cooloney@gmail.com>
Acked-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jane Li [Thu, 3 Apr 2014 21:48:45 +0000 (14:48 -0700)]
printk: fix one circular lockdep warning about console_lock
Fix a warning about possible circular locking dependency.
If do in following sequence:
enter suspend -> resume -> plug-out CPUx (echo 0 > cpux/online)
lockdep will show warning as following:
======================================================
[ INFO: possible circular locking dependency detected ]
3.10.0 #2 Tainted: G O
-------------------------------------------------------
sh/1271 is trying to acquire lock:
(console_lock){+.+.+.}, at: console_cpu_notify+0x20/0x2c
but task is already holding lock:
(cpu_hotplug.lock){+.+.+.}, at: cpu_hotplug_begin+0x2c/0x58
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (cpu_hotplug.lock){+.+.+.}:
lock_acquire+0x98/0x12c
mutex_lock_nested+0x50/0x3d8
cpu_hotplug_begin+0x2c/0x58
_cpu_up+0x24/0x154
cpu_up+0x64/0x84
smp_init+0x9c/0xd4
kernel_init_freeable+0x78/0x1c8
kernel_init+0x8/0xe4
ret_from_fork+0x14/0x2c
-> #1 (cpu_add_remove_lock){+.+.+.}:
lock_acquire+0x98/0x12c
mutex_lock_nested+0x50/0x3d8
disable_nonboot_cpus+0x8/0xe8
suspend_devices_and_enter+0x214/0x448
pm_suspend+0x1e4/0x284
try_to_suspend+0xa4/0xbc
process_one_work+0x1c4/0x4fc
worker_thread+0x138/0x37c
kthread+0xa4/0xb0
ret_from_fork+0x14/0x2c
-> #0 (console_lock){+.+.+.}:
__lock_acquire+0x1b38/0x1b80
lock_acquire+0x98/0x12c
console_lock+0x54/0x68
console_cpu_notify+0x20/0x2c
notifier_call_chain+0x44/0x84
__cpu_notify+0x2c/0x48
cpu_notify_nofail+0x8/0x14
_cpu_down+0xf4/0x258
cpu_down+0x24/0x40
store_online+0x30/0x74
dev_attr_store+0x18/0x24
sysfs_write_file+0x16c/0x19c
vfs_write+0xb4/0x190
SyS_write+0x3c/0x70
ret_fast_syscall+0x0/0x48
Chain exists of:
console_lock --> cpu_add_remove_lock --> cpu_hotplug.lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(cpu_hotplug.lock);
lock(cpu_add_remove_lock);
lock(cpu_hotplug.lock);
lock(console_lock);
*** DEADLOCK ***
There are three locks involved in two sequence:
a) pm suspend:
console_lock (@suspend_console())
cpu_add_remove_lock (@disable_nonboot_cpus())
cpu_hotplug.lock (@_cpu_down())
b) Plug-out CPUx:
cpu_add_remove_lock (@(cpu_down())
cpu_hotplug.lock (@_cpu_down())
console_lock (@console_cpu_notify()) => Lockdeps prints warning log.
There should be not real deadlock, as flag of console_suspended can
protect this.
Although console_suspend() releases console_sem, it doesn't tell lockdep
about it. That results in the lockdep warning about circular locking
when doing the following: enter suspend -> resume -> plug-out CPUx (echo
0 > cpux/online)
Fix the problem by telling lockdep we actually released the semaphore in
console_suspend() and acquired it again in console_resume().
Signed-off-by: Jane Li <jiel@marvell.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simon Kågström [Thu, 3 Apr 2014 21:48:44 +0000 (14:48 -0700)]
include/linux/printk.h: remove double asmlinkage in printk_emit
The double asmlinkage was introduced in commit
7ff9554bb578 ("printk:
convert byte-buffer to variable-length record buffer").
Signed-off-by: Simon Kagstrom <simon.kagstrom@netinsight.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Petr Mladek [Thu, 3 Apr 2014 21:48:43 +0000 (14:48 -0700)]
printk: do not compute the size of the message twice
This is just a tiny optimization. It removes duplicate computation of
the message size.
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Petr Mladek [Thu, 3 Apr 2014 21:48:42 +0000 (14:48 -0700)]
printk: use also the last bytes in the ring buffer
It seems that we have newer used the last byte in the ring buffer. In
fact, we have newer used the last 4 bytes because of padding.
First problem is in the check for free space. The exact number of free
bytes is enough to store the length of data.
Second problem is in the check where the ring buffer is rotated. The
left side counts the first unused index. It is unused, so it might be
the same as the size of the buffer.
Note that the first problem has to be fixed together with the second
one. Otherwise, the buffer is rotated even when there is enough space
on the end of the buffer. Then the beginning of the buffer is rewritten
and valid entries get corrupted.
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Petr Mladek [Thu, 3 Apr 2014 21:48:41 +0000 (14:48 -0700)]
printk: add comment about tricky check for text buffer size
There is no check for potential "text_len" overflow. It is not needed
because only valid level is detected. It took me some time to
understand why. It would deserve a comment ;-)
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Petr Mladek [Thu, 3 Apr 2014 21:48:39 +0000 (14:48 -0700)]
printk: remove obsolete check for log level "c"
The kernel log level "c" was removed in commit
61e99ab8e35a ("printk:
remove the now unnecessary "C" annotation for KERN_CONT"). It is no
longer detected in printk_get_level(). Hence we do not need to check it
in vprintk_emit.
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Petr Mladek [Thu, 3 Apr 2014 21:48:38 +0000 (14:48 -0700)]
printk: remove duplicated check for log level
The check for the exact log level is already done in printk_get_level.
We do not need to duplicate it in printk_skip_level.
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ryan Mallon [Thu, 3 Apr 2014 21:48:37 +0000 (14:48 -0700)]
vsprintf: remove %n handling
All in-kernel users of %n in format strings have now been removed and
the %n directive is ignored. Remove the handling of %n so that it is
treated the same as any other invalid format string directive. Keep a
warning in place to deter new instances of %n in format strings.
Signed-off-by: Ryan Mallon <rmallon@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Daeseok Youn [Thu, 3 Apr 2014 21:48:36 +0000 (14:48 -0700)]
kernel/resource.c: make reallocate_resource() static
sparse says:
kernel/resource.c:518:5: warning:
symbol 'reallocate_resource' was not declared. Should it be static?
Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Gortmaker [Thu, 3 Apr 2014 21:48:35 +0000 (14:48 -0700)]
kernel: audit/fix non-modular users of module_init in core code
Code that is obj-y (always built-in) or dependent on a bool Kconfig
(built-in or absent) can never be modular. So using module_init as an
alias for __initcall can be somewhat misleading.
Fix these up now, so that we can relocate module_init from init.h into
module.h in the future. If we don't do this, we'd have to add module.h
to obviously non-modular code, and that would be a worse thing.
The audit targets the following module_init users for change:
kernel/user.c obj-y
kernel/kexec.c bool KEXEC (one instance per arch)
kernel/profile.c bool PROFILING
kernel/hung_task.c bool DETECT_HUNG_TASK
kernel/sched/stats.c bool SCHEDSTATS
kernel/user_namespace.c bool USER_NS
Note that direct use of __initcall is discouraged, vs. one of the
priority categorized subgroups. As __initcall gets mapped onto
device_initcall, our use of subsys_initcall (which makes sense for these
files) will thus change this registration from level 6-device to level
4-subsys (i.e. slightly earlier). However no observable impact of that
difference has been observed during testing.
Also, two instances of missing ";" at EOL are fixed in kexec.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Thu, 3 Apr 2014 21:48:34 +0000 (14:48 -0700)]
lib/syscall.c: unexport task_current_syscall()
It is only used by procfs and procfs cannot be a module.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Serge Hallyn [Thu, 3 Apr 2014 21:48:33 +0000 (14:48 -0700)]
xattr: guard against simultaneous glibc header inclusion
If the glibc xattr.h header is included after the uapi header,
compilation fails due to an enum re-using a #define from the uapi
header.
Protect against this by guarding the define and enum inclusions against
each other.
(See https://lists.debian.org/debian-glibc/2014/03/msg00029.html
and https://sourceware.org/glibc/wiki/Synchronizing_Headers
for more information.)
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Allan McRae <allan@archlinux.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Markos Chandras [Thu, 3 Apr 2014 21:48:32 +0000 (14:48 -0700)]
samples/seccomp/Makefile: do not build tests if cross-compiling for MIPS
The Makefile is designed to use the host toolchain so it may be unsafe
to build the tests if the kernel has been configured and built for
another architecture. This fixes a build problem when the kernel has
been configured and built for the MIPS architecture but the host is not
MIPS (cross-compiled). The MIPS syscalls are only defined if one of the
following is true:
1) _MIPS_SIM == _MIPS_SIM_ABI64
2) _MIPS_SIM == _MIPS_SIM_ABI32
3) _MIPS_SIM == _MIPS_SIM_NABI32
Of course, none of these make sense on a non-MIPS toolchain and the
following build problem occurs when building on a non-MIPS host.
linux/usr/include/linux/kexec.h:50: userspace cannot reference function or variable defined in the kernel
samples/seccomp/bpf-direct.c: In function `emulator':
samples/seccomp/bpf-direct.c:76:17: error: `__NR_write' undeclared (first use in this function)
Signed-off-by: Markos Chandras <markos.chandras@imgtec.com>
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Thu, 3 Apr 2014 21:48:31 +0000 (14:48 -0700)]
err.h: use bool for IS_ERR and IS_ERR_OR_NULL
Use the more natural return of bool for these tests.
No difference observed in .o files produced by gcc for x86.
Remove the dentry description of kernel pointers left over from the 90's
and 2002's cleanup move of parts of fs.h to err.h.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Josh Triplett [Thu, 3 Apr 2014 21:48:30 +0000 (14:48 -0700)]
SubmittingPatches: document the use of git
Most of the mechanical portions of SubmittingPatches exist to help patch
submitters replicate the output of git. Mention this explicitly, both
as a reminder that git will help with this process, and as signposting
to let git users know what they can safely skip.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Rob Landley <rob@landley.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Josh Triplett [Thu, 3 Apr 2014 21:48:29 +0000 (14:48 -0700)]
SubmittingPatches: add recommendation for mailing list references
SubmittingPatches already mentions referencing bugs fixed by a commit,
but doesn't mention citing relevant mailing list discussions. Add a
note to that effect, along with a recommendation to use the
https://lkml.kernel.org/ redirector.
Portions based on text from git's SubmittingPatches.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Rob Landley <rob@landley.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Josh Triplett [Thu, 3 Apr 2014 21:48:28 +0000 (14:48 -0700)]
SubmittingPatches: add style recommendation to use imperative descriptions
Most commit messages use this style, and the recommendation frequently
comes up in discussions (especially in response to patches that don't
use it), but that recommendation doesn't actually appear anywhere in
Documentation. Add this style guideline to SubmittingPatches, using the
description from git's SubmittingPatches.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Rob Landley <rob@landley.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Josh Triplett [Thu, 3 Apr 2014 21:48:27 +0000 (14:48 -0700)]
fs, kernel: permit disabling the uselib syscall
uselib hasn't been used since libc5; glibc does not use it. Support
turning it off.
When disabled, also omit the load_elf_library implementation from
binfmt_elf.c, which only uselib invokes.
bloat-o-meter:
add/remove: 0/4 grow/shrink: 0/1 up/down: 0/-785 (-785)
function old new delta
padzero 39 36 -3
uselib_flags 20 - -20
sys_uselib 168 - -168
SyS_uselib 168 - -168
load_elf_library 426 - -426
The new CONFIG_USELIB defaults to `y'.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wang YanQing [Thu, 3 Apr 2014 21:48:26 +0000 (14:48 -0700)]
kernel/groups.c: remove return value of set_groups
After commit
6307f8fee295 ("security: remove dead hook task_setgroups"),
set_groups will always return zero, so we could just remove return value
of set_groups.
This patch reduces code size, and simplfies code to use set_groups,
because we don't need to check its return value any more.
[akpm@linux-foundation.org: remove obsolete claims from set_groups() comment]
Signed-off-by: Wang YanQing <udknight@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fabian Frederick [Thu, 3 Apr 2014 21:48:25 +0000 (14:48 -0700)]
sys_sysfs: Add CONFIG_SYSFS_SYSCALL
sys_sysfs is an obsolete system call no longer supported by libc.
- This patch adds a default CONFIG_SYSFS_SYSCALL=y
- Option can be turned off in expert mode.
- cond_syscall added to kernel/sys_ni.c
[akpm@linux-foundation.org: tweak Kconfig help text]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rashika Kheria [Thu, 3 Apr 2014 21:48:24 +0000 (14:48 -0700)]
include/linux/syscalls.h: add sys32_quotactl() prototype
This eliminates the following warning in quota/compat.c:
fs/quota/compat.c:43:17: warning: no previous prototype for `sys32_quotactl' [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Raghavendra K T [Thu, 3 Apr 2014 21:48:23 +0000 (14:48 -0700)]
mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages
Currently max_sane_readahead() returns zero on the cpu whose NUMA node
has no local memory which leads to readahead failure. Fix this
readahead failure by returning minimum of (requested pages, 512). Users
running applications on a memory-less cpu which needs readahead such as
streaming application see considerable boost in the performance.
Result:
fadvise experiment with FADV_WILLNEED on a PPC machine having memoryless
CPU with 1GB testfile (12 iterations) yielded around 46.66% improvement.
fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB
testfile 32GB* 4G RAM numa machine (12 iterations) showed no impact on
the normal NUMA cases w/ patch.
Kernel Avg Stddev
base 7.4975 3.92%
patched 7.4174 3.26%
[Andrew: making return value PAGE_SIZE independent]
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vladimir Davydov [Thu, 3 Apr 2014 21:48:22 +0000 (14:48 -0700)]
slub: do not drop slab_mutex for sysfs_slab_add
We release the slab_mutex while calling sysfs_slab_add from
__kmem_cache_create since commit
66c4c35c6bc5 ("slub: Do not hold
slub_lock when calling sysfs_slab_add()"), because kobject_uevent called
by sysfs_slab_add might block waiting for the usermode helper to exec,
which would result in a deadlock if we took the slab_mutex while
executing it.
However, apart from complicating synchronization rules, releasing the
slab_mutex on kmem cache creation can result in a kmemcg-related race.
The point is that we check if the memcg cache exists before going to
__kmem_cache_create, but register the new cache in memcg subsys after
it. Since we can drop the mutex there, several threads can see that the
memcg cache does not exist and proceed to creating it, which is wrong.
Fortunately, recently kobject_uevent was patched to call the usermode
helper with the UMH_NO_WAIT flag, making the deadlock impossible.
Therefore there is no point in releasing the slab_mutex while calling
sysfs_slab_add, so let's simplify kmem_cache_create synchronization and
fix the kmemcg-race mentioned above by holding the slab_mutex during the
whole cache creation path.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Greg KH <greg@kroah.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vladimir Davydov [Thu, 3 Apr 2014 21:48:21 +0000 (14:48 -0700)]
kobject: don't block for each kobject_uevent
Currently kobject_uevent has somewhat unpredictable semantics. The
point is, since it may call a usermode helper and wait for it to execute
(UMH_WAIT_EXEC), it is impossible to say for sure what lock dependencies
it will introduce for the caller - strictly speaking it depends on what
fs the binary is located on and the set of locks fork may take. There
are quite a few kobject_uevent's users that do not take this into
account and call it with various mutexes taken, e.g. rtnl_mutex,
net_mutex, which might potentially lead to a deadlock.
Since there is actually no reason to wait for the usermode helper to
execute there, let's make kobject_uevent start the helper asynchronously
with the aid of the UMH_NO_WAIT flag.
Personally, I'm interested in this, because I really want kobject_uevent
to be called under the slab_mutex in the slub implementation as it used
to be some time ago, because it greatly simplifies synchronization and
automatically fixes a kmemcg-related race. However, there was a
deadlock detected on an attempt to call kobject_uevent under the
slab_mutex (see https://lkml.org/lkml/2012/1/14/45), which was reported
to be fixed by releasing the slab_mutex for kobject_uevent.
Unfortunately, there was no information about who exactly blocked on the
slab_mutex causing the usermode helper to stall, neither have I managed
to find this out or reproduce the issue.
BTW, this is not the first attempt to make kobject_uevent use
UMH_NO_WAIT. Previous one was made by commit
f520360d93cd ("kobject:
don't block for each kobject_uevent"), but it was wrong (it passed
arguments allocated on stack to async thread) so it was reverted in
05f54c13cd0c ("Revert "kobject: don't block for each kobject_uevent".").
It targeted on speeding up the boot process though.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Greg KH <greg@kroah.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen [Thu, 3 Apr 2014 21:48:19 +0000 (14:48 -0700)]
drop_caches: add some documentation and info message
There is plenty of anecdotal evidence and a load of blog posts
suggesting that using "drop_caches" periodically keeps your system
running in "tip top shape". Perhaps adding some kernel documentation
will increase the amount of accurate data on its use.
If we are not shrinking caches effectively, then we have real bugs.
Using drop_caches will simply mask the bugs and make them harder to
find, but certainly does not fix them, nor is it an appropriate
"workaround" to limit the size of the caches. On the contrary, there
have been bug reports on issues that turned out to be misguided use of
cache dropping.
Dropping caches is a very drastic and disruptive operation that is good
for debugging and running tests, but if it creates bug reports from
production use, kernel developers should be aware of its use.
Add a bit more documentation about it, a syslog message to track down
abusers, and vmstat drop counters to help analyze problem reports.
[akpm@linux-foundation.org: checkpatch fixes]
[hannes@cmpxchg.org: add runtime suppression control]
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sasha Levin [Thu, 3 Apr 2014 21:48:18 +0000 (14:48 -0700)]
mm: remove read_cache_page_async()
This patch removes read_cache_page_async() which wasn't really needed
anywhere and simplifies the code around it a bit.
read_cache_page_async() is useful when we want to read a page into the
cache without waiting for it to complete. This happens when the
appropriate callback 'filler' doesn't complete its read operation and
releases the page lock immediately, and instead queues a different
completion routine to do that. This never actually happened anywhere in
the code.
read_cache_page_async() had 3 different callers:
- read_cache_page() which is the sync version, it would just wait for
the requested read to complete using wait_on_page_read().
- JFFS2 would call it from jffs2_gc_fetch_page(), but the filler
function it supplied doesn't do any async reads, and would complete
before the filler function returns - making it actually a sync read.
- CRAMFS would call it using the read_mapping_page_async() wrapper, with
a similar story to JFFS2 - the filler function doesn't do anything that
reminds async reads and would always complete before the filler function
returns.
To sum it up, the code in mm/filemap.c never took advantage of having
read_cache_page_async(). While there are filler callbacks that do async
reads (such as the block one), we always called it with the
read_cache_page().
This patch adds a mandatory wait for read to complete when adding a new
page to the cache, and removes read_cache_page_async() and its wrappers.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Thu, 3 Apr 2014 21:48:17 +0000 (14:48 -0700)]
mm, thp: drop do_huge_pmd_wp_zero_page_fallback()
I've realized that there's no need for do_huge_pmd_wp_zero_page_fallback().
We can just split zero page with split_huge_page_pmd() and return
VM_FAULT_FALLBACK. handle_pte_fault() will handle write-protection
fault for us.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Thu, 3 Apr 2014 21:48:16 +0000 (14:48 -0700)]
mm: consolidate code to setup pte
Extract and consolidate code to setup pte from do_read_fault(),
do_cow_fault() and do_shared_fault().
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Thu, 3 Apr 2014 21:48:15 +0000 (14:48 -0700)]
mm: consolidate code to call vm_ops->page_mkwrite()
There are two functions which need to call vm_ops->page_mkwrite():
do_shared_fault() and do_wp_page(). We can consolidate preparation
code.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Thu, 3 Apr 2014 21:48:13 +0000 (14:48 -0700)]
mm: introduce do_shared_fault() and drop do_fault()
Introduce do_shared_fault(). The function does what do_fault() does for
write faults to shared mappings
Unlike do_fault(), do_shared_fault() is relatively clean and
straight-forward.
Old do_fault() is not needed anymore. Let it die.
[lliubbo@gmail.com: fix NULL pointer dereference]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Thu, 3 Apr 2014 21:48:12 +0000 (14:48 -0700)]
mm: introduce do_cow_fault()
Introduce do_cow_fault(). The function does what do_fault() does for
write page faults to private mappings.
Unlike do_fault(), do_read_fault() is relatively clean and
straight-forward.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Thu, 3 Apr 2014 21:48:11 +0000 (14:48 -0700)]
mm: introduce do_read_fault()
Introduce do_read_fault(). The function does what do_fault() does for
read page faults.
Unlike do_fault(), do_read_fault() is pretty clean and straightforward.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Thu, 3 Apr 2014 21:48:10 +0000 (14:48 -0700)]
mm: do_fault(): extract to call vm_ops->do_fault() to separate function
Extract code to vm_ops->do_fault() and basic error handling to separate
function. The code will be reused.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Thu, 3 Apr 2014 21:48:08 +0000 (14:48 -0700)]
mm: rename __do_fault() -> do_fault()
Current __do_fault() is awful and unmaintainable. These patches try to
sort it out by split __do_fault() into three destinct codepaths:
- to handle read page fault;
- to handle write page fault to private mappings;
- to handle write page fault to shared mappings;
I also found page refcount leak in PageHWPoison() path of __do_fault().
This patch (of 7):
do_fault() is unused: no reason for underscores.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rashika Kheria [Thu, 3 Apr 2014 21:48:07 +0000 (14:48 -0700)]
include/linux/mm.h: remove ifdef condition
The ifdef conditions in include/linux/mm.h presents three cases:
- !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
There is no actual definition of function but include/linux/mm.h has a
static inline stub defined.
- defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
linux/mm.h does not define a prototype, but mm/page_alloc.c defines
the function.
Hence, compiler reports the following warning:
mm/page_alloc.c:4300:15: warning: no previous prototype for `__early_pfn_to_nid' [-Wmissing-prototypes]
- defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
The architecture defines the function, and linux/mm.h has a
prototype.
Thus, join the conditions of Case 2 and 3 ie eliminate the ifdef
condition of CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID to eliminate the missing
prototype warning from file mm/page_alloc.c.
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rashika Kheria [Thu, 3 Apr 2014 21:48:06 +0000 (14:48 -0700)]
mm/nobootmem.c: mark function as static
Mark function as static in nobootmem.c because it is not used outside
this file.
This eliminates the following warning in mm/nobootmem.c:
mm/nobootmem.c:324:15: warning: no previous prototype for `___alloc_bootmem_node' [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rashika Kheria [Thu, 3 Apr 2014 21:48:05 +0000 (14:48 -0700)]
mm/page_cgroup.c: mark functions as static
Mark functions as static in page_cgroup.c because they are not used
outside this file.
This eliminates the following warning in mm/page_cgroup.c:
mm/page_cgroup.c:177:6: warning: no previous prototype for `__free_page_cgroup' [-Wmissing-prototypes]
mm/page_cgroup.c:190:15: warning: no previous prototype for `online_page_cgroup' [-Wmissing-prototypes]
mm/page_cgroup.c:225:15: warning: no previous prototype for `offline_page_cgroup' [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rashika Kheria [Thu, 3 Apr 2014 21:48:04 +0000 (14:48 -0700)]
mm/process_vm_access.c: mark function as static
Mark function as static in process_vm_access.c because it is not used
outside this file.
This eliminates the following warning in mm/process_vm_access.c:
mm/process_vm_access.c:416:1: warning: no previous prototype for `compat_process_vm_rw' [-Wmissing-prototypes]
[akpm@linux-foundation.org: remove unneeded asmlinkage - compat_process_vm_rw isn't referenced from asm]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rashika Kheria [Thu, 3 Apr 2014 21:48:03 +0000 (14:48 -0700)]
mm/mmap.c: mark function as static
Mark function as static in mmap.c because they are not used outside this
file.
This eliminates the following warning in mm/mmap.c:
mm/mmap.c:407:6: warning: no previous prototype for `validate_mm' [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rashika Kheria [Thu, 3 Apr 2014 21:48:02 +0000 (14:48 -0700)]
mm/memory.c: mark functions as static
mark functions as static in memory.c because they are not used outside
this file.
This eliminates the following warnings in mm/memory.c:
mm/memory.c:3530:5: warning: no previous prototype for `numa_migrate_prep' [-Wmissing-prototypes]
mm/memory.c:3545:5: warning: no previous prototype for `do_numa_page' [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rashika Kheria [Thu, 3 Apr 2014 21:48:01 +0000 (14:48 -0700)]
mm/compaction.c: mark function as static
Mark function as static in compaction.c because it is not used outside
this file.
This eliminates the following warning from mm/compaction.c:
mm/compaction.c:1190:9: warning: no previous prototype for `sysfs_compact_node' [-Wmissing-prototypes
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Thu, 3 Apr 2014 21:48:00 +0000 (14:48 -0700)]
mm, compaction: avoid isolating pinned pages
Page migration will fail for memory that is pinned in memory with, for
example, get_user_pages(). In this case, it is unnecessary to take
zone->lru_lock or isolating the page and passing it to page migration
which will ultimately fail.
This is a racy check, the page can still change from under us, but in
that case we'll just fail later when attempting to move the page.
This avoids very expensive memory compaction when faulting transparent
hugepages after pinning a lot of memory with a Mellanox driver.
On a 128GB machine and pinning ~120GB of memory, before this patch we
see the enormous disparity in the number of page migration failures
because of the pinning (from /proc/vmstat):
compact_pages_moved 8450
compact_pagemigrate_failed
15614415
0.05% of pages isolated are successfully migrated and explicitly
triggering memory compaction takes 102 seconds. After the patch:
compact_pages_moved 9197
compact_pagemigrate_failed 7
99.9% of pages isolated are now successfully migrated in this
configuration and memory compaction takes less than one second.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Thu, 3 Apr 2014 21:47:59 +0000 (14:47 -0700)]
mm, hugetlb: mark some bootstrap functions as __init
Both prep_compound_huge_page() and prep_compound_gigantic_page() are
only called at bootstrap and can be marked as __init.
The __SetPageTail(page) in prep_compound_gigantic_page() happening
before page->first_page is initialized is not concerning since this is
bootstrap.
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Thu, 3 Apr 2014 21:47:56 +0000 (14:47 -0700)]
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Thu, 3 Apr 2014 21:47:54 +0000 (14:47 -0700)]
lib: radix_tree: tree node interface
Make struct radix_tree_node part of the public interface and provide API
functions to create, look up, and delete whole nodes. Refactor the
existing insert, look up, delete functions on top of these new node
primitives.
This will allow the VM to track and garbage collect page cache radix
tree nodes.
[sasha.levin@oracle.com: return correct error code on insertion failure]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Thu, 3 Apr 2014 21:47:51 +0000 (14:47 -0700)]
mm: thrash detection-based file cache sizing
The VM maintains cached filesystem pages on two types of lists. One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have shown
to benefit from caching in the past. We call the recently usedbut
ultimately was not significantly better than a FIFO policy and still
thrashed cache based on eviction speed, rather than actual demand for
cache.
This patch solves one half of the problem by decoupling the ability to
detect working set changes from the inactive list size. By maintaining
a history of recently evicted file pages it can detect frequently used
pages with an arbitrarily small inactive list size, and subsequently
apply pressure on the active list based on actual demand for cache, not
just overall eviction speed.
Every zone maintains a counter that tracks inactive list aging speed.
When a page is evicted, a snapshot of this counter is stored in the
now-empty page cache radix tree slot. On refault, the minimum access
distance of the page can be assessed, to evaluate whether the page
should be part of the active list or not.
This fixes the VM's blindness towards working set changes in excess of
the inactive list. And it's the foundation to further improve the
protection ability and reduce the minimum inactive list size of 50%.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Thu, 3 Apr 2014 21:47:49 +0000 (14:47 -0700)]
mm + fs: store shadow entries in page cache
Reclaim will be leaving shadow entries in the page cache radix tree upon
evicting the real page. As those pages are found from the LRU, an
iput() can lead to the inode being freed concurrently. At this point,
reclaim must no longer install shadow pages because the inode freeing
code needs to ensure the page tree is really empty.
Add an address_space flag, AS_EXITING, that the inode freeing code sets
under the tree lock before doing the final truncate. Reclaim will check
for this flag before installing shadow pages.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Thu, 3 Apr 2014 21:47:46 +0000 (14:47 -0700)]
mm + fs: prepare for non-page entries in page cache radix trees
shmem mappings already contain exceptional entries where swap slot
information is remembered.
To be able to store eviction information for regular page cache, prepare
every site dealing with the radix trees directly to handle entries other
than pages.
The common lookup functions will filter out non-page entries and return
NULL for page cache holes, just as before. But provide a raw version of
the API which returns non-page entries as well, and switch shmem over to
use it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Thu, 3 Apr 2014 21:47:44 +0000 (14:47 -0700)]
mm: filemap: move radix tree hole searching here
The radix tree hole searching code is only used for page cache, for
example the readahead code trying to get a a picture of the area
surrounding a fault.
It sufficed to rely on the radix tree definition of holes, which is
"empty tree slot". But this is about to change, though, as shadow page
descriptors will be stored in the page cache after the actual pages get
evicted from memory.
Move the functions over to mm/filemap.c and make them native page cache
operations, where they can later be adapted to handle the new definition
of "page cache hole".
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Thu, 3 Apr 2014 21:47:41 +0000 (14:47 -0700)]
mm: shmem: save one radix tree lookup when truncating swapped pages
Page cache radix tree slots are usually stabilized by the page lock, but
shmem's swap cookies have no such thing. Because the overall truncation
loop is lockless, the swap entry is currently confirmed by a tree lookup
and then deleted by another tree lookup under the same tree lock region.
Use radix_tree_delete_item() instead, which does the verification and
deletion with only one lookup. This also allows removing the
delete-only special case from shmem_radix_tree_replace().
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Thu, 3 Apr 2014 21:47:39 +0000 (14:47 -0700)]
lib: radix-tree: add radix_tree_delete_item()
Provide a function that does not just delete an entry at a given index,
but also allows passing in an expected item. Delete only if that item
is still located at the specified index.
This is handy when lockless tree traversals want to delete entries as
well because they don't have to do an second, locked lookup to verify
the slot has not changed under them before deleting the entry.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Thu, 3 Apr 2014 21:47:36 +0000 (14:47 -0700)]
fs: cachefiles: use add_to_page_cache_lru()
This code used to have its own lru cache pagevec up until
a0b8cab3 ("mm:
remove lru parameter from __pagevec_lru_add and remove parts of pagevec
API"). Now it's just add_to_page_cache() followed by lru_cache_add(),
might as well use add_to_page_cache_lru() directly.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Thu, 3 Apr 2014 21:47:34 +0000 (14:47 -0700)]
mm: vmstat: fix UP zone state accounting
Summary:
The VM maintains cached filesystem pages on two types of lists. One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have shown
to benefit from caching in the past. We call the recently used list
"inactive list" and the frequently used list "active list".
Currently, the VM aims for a 1:1 ratio between the lists, which is the
"perfect" trade-off between the ability to *protect* frequently used
pages and the ability to *detect* frequently used pages. This means
that working set changes bigger than half of cache memory go undetected
and thrash indefinitely, whereas working sets bigger than half of cache
memory are unprotected against used-once streams that don't even need
caching.
This happens on file servers and media streaming servers, where the
popular files and file sections change over time. Even though the
individual files might be smaller than half of memory, concurrent access
to many of them may still result in their inter-reference distance being
greater than half of memory. It's also been reported as a problem on
database workloads that switch back and forth between tables that are
bigger than half of memory. In these cases the VM never recognizes the
new working set and will for the remainder of the workload thrash disk
data which could easily live in memory.
Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved them
to the head of the inactive list. This model gave established working
sets more gracetime in the face of temporary use-once streams, but
ultimately was not significantly better than a FIFO policy and still
thrashed cache based on eviction speed, rather than actual demand for
cache.
This series solves the problem by maintaining a history of pages evicted
from the inactive list, enabling the VM to detect frequently used pages
regardless of inactive list size and facilitate working set transitions.
Tests:
The reported database workload is easily demonstrated on a 8G machine
with two filesets a 6G. This fio workload operates on one set first,
then switches to the other. The VM should obviously always cache the
set that the workload is currently using.
This test is based on a problem encountered by Citus Data customers:
http://citusdata.com/blog/72-linux-memory-manager-and-your-big-data
unpatched:
db1: READ: io=98304MB, aggrb=885559KB/s, minb=885559KB/s, maxb=885559KB/s, mint= 113672msec, maxt= 113672msec
db2: READ: io=98304MB, aggrb= 66169KB/s, minb= 66169KB/s, maxb= 66169KB/s, mint=1521302msec, maxt=1521302msec
sdb: ios=835750/4, merge=2/1, ticks=
4659739/60016, in_queue=
4719203, util=98.92%
real 27m15.541s
user 0m19.059s
sys 0m51.459s
patched:
db1: READ: io=98304MB, aggrb=877783KB/s, minb=877783KB/s, maxb=877783KB/s, mint=114679msec, maxt=114679msec
db2: READ: io=98304MB, aggrb=397449KB/s, minb=397449KB/s, maxb=397449KB/s, mint=253273msec, maxt=253273msec
sdb: ios=170587/4, merge=2/1, ticks=954910/61123, in_queue=
1015923, util=90.40%
real 6m8.630s
user 0m14.714s
sys 0m31.233s
As can be seen, the unpatched kernel simply never adapts to the
workingset change and db2 is stuck indefinitely with secondary storage
speed. The patched kernel needs 2-3 iterations over db2 before it
replaces db1 and reaches full memory speed. Given the unbounded
negative affect of the existing VM behavior, these patches should be
considered correctness fixes rather than performance optimizations.
Another test resembles a fileserver or streaming server workload, where
data in excess of memory size is accessed at different frequencies.
There is very hot data accessed at a high frequency. Machines should be
fitted so that the hot set of such a workload can be fully cached or all
bets are off. Then there is a very big (compared to available memory)
set of data that is used-once or at a very low frequency; this is what
drives the inactive list and does not really benefit from caching.
Lastly, there is a big set of warm data in between that is accessed at
medium frequencies and benefits from caching the pages between the first
and last streamer of each burst.
unpatched:
hot: READ: io=128000MB, aggrb=160693KB/s, minb=160693KB/s, maxb=160693KB/s, mint=815665msec, maxt=815665msec
warm: READ: io= 81920MB, aggrb=109853KB/s, minb= 27463KB/s, maxb= 29244KB/s, mint=717110msec, maxt=763617msec
cold: READ: io= 30720MB, aggrb= 35245KB/s, minb= 35245KB/s, maxb= 35245KB/s, mint=892530msec, maxt=892530msec
sdb: ios=797960/4, merge=11763/1, ticks=
4307910/796, in_queue=
4308380, util=100.00%
patched:
hot: READ: io=128000MB, aggrb=160678KB/s, minb=160678KB/s, maxb=160678KB/s, mint=815740msec, maxt=815740msec
warm: READ: io= 81920MB, aggrb=147747KB/s, minb= 36936KB/s, maxb= 40960KB/s, mint=512000msec, maxt=567767msec
cold: READ: io= 30720MB, aggrb= 40960KB/s, minb= 40960KB/s, maxb= 40960KB/s, mint=768000msec, maxt=768000msec
sdb: ios=596514/4, merge=9341/1, ticks=
2395362/997, in_queue=
2396484, util=79.18%
In both kernels, the hot set is propagated to the active list and then
served from cache.
In both kernels, the beginning of the warm set is propagated to the
active list as well, but in the unpatched case the active list
eventually takes up half of memory and no new pages from the warm set
get activated, despite repeated access, and despite most of the active
list soon being stale. The patched kernel on the other hand detects the
thrashing and manages to keep this cache window rolling through the data
set. This frees up enough IO bandwidth that the cold set is served at
full speed as well and disk utilization even drops by 20%.
For reference, this same test was performed with the traditional
demotion mechanism, where deactivation is coupled to inactive list
reclaim. However, this had the same outcome as the unpatched kernel:
while the warm set does indeed get activated continuously, it is forced
out of the active list by inactive list pressure, which is dictated
primarily by the unrelated cold set. The warm set is evicted before
subsequent streamers can benefit from it, even though there would be
enough space available to cache the pages of interest.
Costs:
Page reclaim used to shrink the radix trees but now the tree nodes are
reused for shadow entries, where the cost depends heavily on the page
cache access patterns. However, with workloads that maintain spatial or
temporal locality, the shadow entries are either refaulted quickly or
reclaimed along with the inode object itself. Workloads that will
experience a memory cost increase are those that don't really benefit
from caching in the first place.
A more predictable alternative would be a fixed-cost separate pool of
shadow entries, but this would incur relatively higher memory cost for
well-behaved workloads at the benefit of cornercases. It would also
make the shadow entry lookup more costly compared to storing them
directly in the cache structure.
Future:
To simplify the merging process, this patch set is implementing thrash
detection on a global per-zone level only for now, but the design is
such that it can be extended to memory cgroups as well. All we need to
do is store the unique cgroup ID along the node and zone identifier
inside the eviction cookie to identify the lruvec.
Right now we have a fixed ratio (50:50) between inactive and active list
but we already have complaints about working sets exceeding half of
memory being pushed out of the cache by simple streaming in the
background. Ultimately, we want to adjust this ratio and allow for a
much smaller inactive list. These patches are an essential step in this
direction because they decouple the VMs ability to detect working set
changes from the inactive list size. This would allow us to base the
inactive list size on the combined readahead window size for example and
potentially protect a much bigger working set.
It's also a big step towards activating pages with a reuse distance
larger than memory, as long as they are the most frequently used pages
in the workload. This will require knowing more about the access
frequency of active pages than what we measure right now, so it's also
deferred in this series.
Another possibility of having thrashing information would be to revisit
the idea of local reclaim in the form of zero-config memory control
groups. Instead of having allocating tasks go straight to global
reclaim, they could try to reclaim the pages in the memcg they are part
of first as long as the group is not thrashing. This would allow a user
to drop e.g. a back-up job in an otherwise unconfigured memcg and it
would only inflate (and possibly do global reclaim) until it has enough
memory to do proper readahead. But once it reaches that point and stops
thrashing it would just recycle its own used-once pages without kicking
out the cache of any other tasks in the system more than necessary.
This patch (of 10):
Fengguang Wu's build testing spotted problems with inc_zone_state() and
dec_zone_state() on UP configurations in out-of-tree patches.
inc_zone_state() is declared but not defined, dec_zone_state() is
missing entirely.
Just like with *_zone_page_state(), they can be defined like their
preemption-unsafe counterparts on UP.
[akpm@linux-foundation.org: make it build]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vladimir Davydov [Thu, 3 Apr 2014 21:47:32 +0000 (14:47 -0700)]
mm: vmscan: shrink_slab: rename max_pass -> freeable
The name `max_pass' is misleading, because this variable actually keeps
the estimate number of freeable objects, not the maximal number of
objects we can scan in this pass, which can be twice that. Rename it to
reflect its actual meaning.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Davidlohr Bueso [Thu, 3 Apr 2014 21:47:31 +0000 (14:47 -0700)]
mm, hugetlb: improve page-fault scalability
The kernel can currently only handle a single hugetlb page fault at a
time. This is due to a single mutex that serializes the entire path.
This lock protects from spurious OOM errors under conditions of low
availability of free hugepages. This problem is specific to hugepages,
because it is normal to want to use every single hugepage in the system
- with normal pages we simply assume there will always be a few spare
pages which can be used temporarily until the race is resolved.
Address this problem by using a table of mutexes, allowing a better
chance of parallelization, where each hugepage is individually
serialized. The hash key is selected depending on the mapping type.
For shared ones it consists of the address space and file offset being
faulted; while for private ones the mm and virtual address are used.
The size of the table is selected based on a compromise of collisions
and memory footprint of a series of database workloads.
Large database workloads that make heavy use of hugepages can be
particularly exposed to this issue, causing start-up times to be
painfully slow. This patch reduces the startup time of a 10 Gb Oracle
DB (with ~5000 faults) from 37.5 secs to 25.7 secs. Larger workloads
will naturally benefit even more.
NOTE:
The only downside to this patch, detected by Joonsoo Kim, is that a
small race is possible in private mappings: A child process (with its
own mm, after cow) can instantiate a page that is already being handled
by the parent in a cow fault. When low on pages, can trigger spurious
OOMs. I have not been able to think of a efficient way of handling
this... but do we really care about such a tiny window? We already
maintain another theoretical race with normal pages. If not, one
possible way to is to maintain the single hash for private mappings --
any workloads that *really* suffer from this scaling problem should
already use shared mappings.
[akpm@linux-foundation.org: remove stray + characters, go BUG if hugetlb_init() kmalloc fails]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Thu, 3 Apr 2014 21:47:30 +0000 (14:47 -0700)]
mm, hugetlb: use vma_resv_map() map types
Util now, we get a resv_map by two ways according to each mapping type.
This makes code dirty and unreadable. Unify it.
[davidlohr@hp.com: code cleanups]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Thu, 3 Apr 2014 21:47:28 +0000 (14:47 -0700)]
mm, hugetlb: remove resv_map_put
This is a preparation patch to unify the use of vma_resv_map()
regardless of the map type. This patch prepares it by removing
resv_map_put(), which only works for HPAGE_RESV_OWNER's resv_map, not
for all resv_maps.
[davidlohr@hp.com: update changelog]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Davidlohr Bueso [Thu, 3 Apr 2014 21:47:27 +0000 (14:47 -0700)]
mm, hugetlb: fix race in region tracking
There is a race condition if we map a same file on different processes.
Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only
mmap_sem (exclusively). This doesn't prevent other tasks from modifying
the region structure, so it can be modified by two processes
concurrently.
To solve this, introduce a spinlock to resv_map and make region
manipulation function grab it before they do actual work.
[davidlohr@hp.com: updated changelog]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Thu, 3 Apr 2014 21:47:26 +0000 (14:47 -0700)]
mm, hugetlb: improve, cleanup resv_map parameters
To change a protection method for region tracking to find grained one,
we pass the resv_map, instead of list_head, to region manipulation
functions.
This doesn't introduce any functional change, and it is just for
preparing a next step.
[davidlohr@hp.com: update changelog]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Thu, 3 Apr 2014 21:47:25 +0000 (14:47 -0700)]
mm, hugetlb: unify region structure handling
Currently, to track reserved and allocated regions, we use two different
ways, depending on the mapping. For MAP_SHARED, we use
address_mapping's private_list and, while for MAP_PRIVATE, we use a
resv_map.
Now, we are preparing to change a coarse grained lock which protect a
region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling, consistently
using a resv_map regardless of the kind of mapping.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Thu, 3 Apr 2014 21:47:24 +0000 (14:47 -0700)]
mm: optimize put_mems_allowed() usage
Since put_mems_allowed() is strictly optional, its a seqcount retry, we
don't need to evaluate the function if the allocation was in fact
successful, saving a smp_rmb some loads and comparisons on some relative
fast-paths.
Since the naming, get/put_mems_allowed() does suggest a mandatory
pairing, rename the interface, as suggested by Mel, to resemble the
seqcount interface.
This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
where it is important to note that the return value of the latter call
is inverted from its previous incarnation.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Thu, 3 Apr 2014 21:47:23 +0000 (14:47 -0700)]
mm, compaction: ignore pageblock skip when manually invoking compaction
The cached pageblock hint should be ignored when triggering compaction
through /proc/sys/vm/compact_memory so all eligible memory is isolated.
Manually invoking compaction is known to be expensive, there's no need
to skip pageblocks based on heuristics (mainly for debugging).
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vladimir Davydov [Thu, 3 Apr 2014 21:47:22 +0000 (14:47 -0700)]
mm: vmscan: remove shrink_control arg from do_try_to_free_pages()
There is no need passing on a shrink_control struct from
try_to_free_pages() and friends to do_try_to_free_pages() and then to
shrink_zones(), because it is only used in shrink_zones() and the only
field initialized on the top level is gfp_mask, which is always equal to
scan_control.gfp_mask. So let's move shrink_control initialization to
shrink_zones().
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vladimir Davydov [Thu, 3 Apr 2014 21:47:20 +0000 (14:47 -0700)]
mm: vmscan: move call to shrink_slab() to shrink_zones()
This reduces the indentation level of do_try_to_free_pages() and removes
extra loop over all eligible zones counting the number of on-LRU pages.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Glauber Costa <glommer@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vladimir Davydov [Thu, 3 Apr 2014 21:47:19 +0000 (14:47 -0700)]
mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim
When direct reclaim is executed by a process bound to a set of NUMA
nodes, we should scan only those nodes when possible, but currently we
will scan kmem from all online nodes even if the kmem shrinker is NUMA
aware. That said, binding a process to a particular NUMA node won't
prevent it from shrinking inode/dentry caches from other nodes, which is
not good. Fix this.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>