Programming Languages Research Group: Git - firefly-linux-kernel-4.4.55.git/log

Merge branch 'linux-linaro-lsk' into linux-linaro-lsk-android

Merge remote-tracking branch 'lsk/v3.10/topic/arm64-crypto' into linux-linaro-lsk

Merge remote-tracking branch 'lsk/v3.10/topic/arm64-fpsimd' into linux-linaro-lsk

Conflicts:
arch/arm64/Kconfig
arch/arm64/include/asm/thread_info.h
arch/arm64/kernel/fpsimd.c

arm64: add support for kernel mode NEON in interrupt context

This patch modifies kernel_neon_begin() and kernel_neon_end(), so
they may be called from any context. To address the case where only
a couple of registers are needed, kernel_neon_begin_partial(u32) is
introduced which takes as a parameter the number of bottom 'n' NEON
q-registers required. To mark the end of such a partial section, the
regular kernel_neon_end() should be used.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
(cherry picked from commit 190f1ca85d071114930dd7abe6b5d103e9d5572f)
Signed-off-by: Mark Brown <broonie@linaro.org>

arm64: defer reloading a task's FPSIMD state to userland resume

If a task gets scheduled out and back in again and nothing has touched
its FPSIMD state in the mean time, there is really no reason to reload
it from memory. Similarly, repeated calls to kernel_neon_begin() and
kernel_neon_end() will preserve and restore the FPSIMD state every time.

This patch defers the FPSIMD state restore to the last possible moment,
i.e., right before the task returns to userland. If a task does not return to
userland at all (for any reason), the existing FPSIMD state is preserved
and may be reused by the owning task if it gets scheduled in again on the
same CPU.

This patch adds two more functions to abstract away from straight FPSIMD
register file saves and restores:
- fpsimd_restore_current_state -> ensure current's FPSIMD state is loaded
- fpsimd_flush_task_state -> invalidate live copies of a task's FPSIMD state

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
(cherry picked from commit 005f78cd88494457ed38ce817f4e3fe5d372f0cb)
Signed-off-by: Mark Brown <broonie@linaro.org>

arm64: add abstractions for FPSIMD state manipulation

There are two tacit assumptions in the FPSIMD handling code that will no longer
hold after the next patch that optimizes away some FPSIMD state restores:
. the FPSIMD registers of this CPU contain the userland FPSIMD state of
task 'current';
. when switching to a task, its FPSIMD state will always be restored from
memory.

This patch adds the following functions to abstract away from straight FPSIMD
register file saves and restores:
- fpsimd_preserve_current_state -> ensure current's FPSIMD state is saved
- fpsimd_update_current_state -> replace current's FPSIMD state

Where necessary, the signal handling and fork code are updated to use the above
wrappers instead of poking into the FPSIMD registers directly.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
(cherry picked from commit c51f92693c35c141cf7d9b7e2fcbb81128324eb4)
Signed-off-by: Mark Brown <broonie@linaro.org>

arm64: kernel: implement fpsimd CPU PM notifier

When a CPU enters a low power state, its FP register content is lost.
This patch adds a notifier to save the FP context on CPU shutdown
and restore it on CPU resume. The context is saved and restored only
if the suspending thread is not a kernel thread, mirroring the current
context switch behaviour.

Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
(cherry picked from commit fb1ab1ab3889fc23ed90e452502662311ebdf229)
Signed-off-by: Mark Brown <broonie@linaro.org>

arm64: fix possible invalid FPSIMD initialization state

If context switching happens during executing fpsimd_flush_thread(),
stale value in FPSIMD registers will be saved into current thread's
fpsimd_state by fpsimd_thread_switch(). That may cause invalid
initialization state for the new process, so disable preemption
when executing fpsimd_flush_thread().

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit 6db83cea1c975b9a102e17def7d2795814e1ae2b)
Signed-off-by: Mark Brown <broonie@linaro.org>

arm64: add support for kernel mode NEON

Add <asm/neon.h> containing kernel_neon_begin/kernel_neon_end function
declarations and corresponding definitions in fpsimd.c

These are needed to wrap uses of NEON in kernel mode. The names are
identical to the ones used in arm/ so code using intrinsics or
vectorized by GCC can be shared between arm and arm64.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit 4cfb36136480c029a29dbf63a623506e6ed7282b)
Signed-off-by: Mark Brown <broonie@linaro.org>

Merge remote-tracking branch 'lsk/v3.10/topic/aosp' into linux-linaro-lsk-android

Conflicts:
net/wireless/nl80211.c

Merge branch 'linaro-android-3.10-lsk' of git://git.linaro.org/people/jstultz/android into lsk-v3.10-aosp

crypto: create generic version of ablk_helper

Create a generic version of ablk_helper so it can be reused
by other architectures.

Acked-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit a62b01cd6cc1feb5e80d64d6937c291473ed82cb)
Signed-off-by: Mark Brown <broonie@linaro.org>

crypto: remove direct blkcipher_walk dependency on transform

In order to allow other uses of the blkcipher walk API than the blkcipher
algos themselves, this patch copies some of the transform data members to the
walk struct so the transform is only accessed at walk init time.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 822be00fe67105a90e536df52d1e4d688f34b5b2)
Signed-off-by: Mark Brown <broonie@linaro.org>

crypto: allow blkcipher walks over AEAD data

This adds the function blkcipher_aead_walk_virt_block, which allows the caller
to use the blkcipher walk API to handle the input and output scatterlists.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 4f7f1d7cff8f2c170ce0319eb4c01a82c328d34f)
Signed-off-by: Mark Brown <broonie@linaro.org>

Merge branch 'linaro-fixes/android-3.10' into linaro-android-3.10-lsk

Merge branch 'upstream/android-3.10' into linaro-fixes/android-3.10

nl80211: cumulative vendor command support patch

Based on commit d3fd06d0259232e1362c6d1da136970d26628467
Author: Johannes Berg <johannes.berg@intel.com>
Date: Sat Jan 25 10:17:18 2014 -0800
nl80211: vendor command support

Change-Id: I832eb4da295fe7b2c9bd8ff69ae80fe7bfe30add
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>

Merge branch 'linux-linaro-lsk' into linux-linaro-lsk-android

Merge remote-tracking branch 'lsk/v3.10/topic/configs' into linux-linaro-lsk

configs: Enable some SELinux options

Going forward Android will require SELinux and we may as well make sure
it does't get in the way for other configurations.

Signed-off-by: Mark Brown <broonie@linaro.org>

Merge branch 'linux-linaro-lsk' into linux-linaro-lsk-android

Merge remote-tracking branch 'lsk/linux-linaro-lsk-android' into linux-linaro-lsk-android

Merge remote-tracking branch 'lsk/v3.10/topic/configs' into linux-linaro-lsk

Merge branch 'linux-linaro-lsk' into linux-linaro-lsk-android

Merge tag v3.10.44 into linux-linaro-lsk

This is the 3.10.44 stable release.

power: Add property CHARGE_COUNTER_EXT and 64-bit precision properties

Add POWER_SUPPLY_PROP_CHARGE_COUNTER_EXT that stores accumulated charge
in nAh units as a signed 64-bit value.

Add generic support for signed 64-bit property values.

Change-Id: I2bd34b1e95ffba24e7bfef81f398f22bd2aaf05e
Signed-off-by: Todd Poynor <toddpoynor@google.com>

nl80211: fix another nl80211_fam.attrbuf race

commit c319d50bfcf678c2857038276d9fab3c6646f3bf upstream.

This is similar to the race Linus had reported, but in this case
it's an older bug: nl80211_prepare_wdev_dump() uses the wiphy
index in cb->args[0] as it is and thus parses the message over
and over again instead of just once because 0 is the first valid
wiphy index. Similar code in nl80211_testmode_dump() correctly
offsets the wiphy_index by 1, do that here as well.

Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Merge branch 'linux-linaro-lsk' into linux-linaro-lsk-android

Merge remote-tracking branch 'lsk/v3.10/topic/configs' into linux-linaro-lsk

configs: Enable ARMv8 crypto acceleration build

Signed-off-by: Mark Brown <broonie@linaro.org>

Merge branch 'linux-linaro-lsk' into linux-linaro-lsk-android

Linux 3.10.44

ahci: add PCI ID for Marvell 88SE91A0 SATA Controller

commit 754a292fe6b08196cb135c03b404444e17de520a upstream.

Add support for Marvell Technology Group Ltd. 88SE91A0 SATA 6Gb/s
Controller by adding its PCI ID.

Signed-off-by: Andreas Schrägle <ajs124.ajs124@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ahci: Add Device ID for HighPoint RocketRaid 642L

commit d251836508fb26cd1a22b41381739835ee23728d upstream.

This device normally comes with a proprietary driver, using a web GUI
to configure RAID:
http://www.highpoint-tech.com/USA_new/series_rr600-download.htm
But thankfully it also works out of the box with the AHCI driver,
being just a Marvell 88SE9235.

Devices 640L, 644L, 644LS should also be supported but not tested here.

Signed-off-by: Jérôme Carretero <cJ-ko@zougloub.eu>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mei: me: drop harmful wait optimization

commit 07cd7be3d92eeeae1f92a017f2cfe4fdd9256526 upstream.

It my take time till ME_RDY will be cleared after the reset,
so we cannot check the bit before we got the interrupt

Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: Alexander Usyskin <alexander.usyskin@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

SCSI: megaraid: Use resource_size_t for PCI resources, not long

commit 11f8a7b31f2140b0dc164bb484281235ffbe51d3 upstream.

The assumption that sizeof(long) >= sizeof(resource_size_t) can lead to
truncation of the PCI resource address, meaning this driver didn't work
on 32-bit systems with 64-bit PCI adressing ranges.

Signed-off-by: Ben Collins <ben.c@servergy.com>
Acked-by: Sumit Saxena <sumit.saxena@lsi.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

auditsc: audit_krule mask accesses need bounds checking

commit a3c54931199565930d6d84f4c3456f6440aefd41 upstream.

Fixes an easy DoS and possible information disclosure.

This does nothing about the broken state of x32 auditing.

eparis: If the admin has enabled auditd and has specifically loaded
audit rules. This bug has been around since before git. Wow...

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mm/compaction: make isolate_freepages start at pageblock boundary

commit 49e068f0b73dd042c186ffa9b420a9943e90389a upstream.

The compaction freepage scanner implementation in isolate_freepages()
starts by taking the current cc->free_pfn value as the first pfn.  In a
for loop, it scans from this first pfn to the end of the pageblock, and
then subtracts pageblock_nr_pages from the first pfn to obtain the first
pfn for the next for loop iteration.

This means that when cc->free_pfn starts at offset X rather than being
aligned on pageblock boundary, the scanner will start at offset X in all
scanned pageblock, ignoring potentially many free pages.  Currently this
can happen when

a) zone's end pfn is not pageblock aligned, or

b) through zone->compact_cached_free_pfn with CONFIG_HOLES_IN_ZONE
    enabled and a hole spanning the beginning of a pageblock

This patch fixes the problem by aligning the initial pfn in
isolate_freepages() to pageblock boundary.  This also permits replacing
the end-of-pageblock alignment within the for loop with a simple
pageblock_nr_pages increment.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Heesub Shin <heesub.shin@samsung.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Dongjun Shin <d.j.shin@samsung.com>
Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mm: compaction: detect when scanners meet in isolate_freepages

commit 7ed695e069c3cbea5e1fd08f84a04536da91f584 upstream.

Compaction of a zone is finished when the migrate scanner (which begins
at the zone's lowest pfn) meets the free page scanner (which begins at
the zone's highest pfn).  This is detected in compact_zone() and in the
case of direct compaction, the compact_blockskip_flush flag is set so
that kswapd later resets the cached scanner pfn's, and a new compaction
may again start at the zone's borders.

The meeting of the scanners can happen during either scanner's activity.
However, it may currently fail to be detected when it occurs in the free
page scanner, due to two problems.  First, isolate_freepages() keeps
free_pfn at the highest block where it isolated pages from, for the
purposes of not missing the pages that are returned back to allocator
when migration fails.  Second, failing to isolate enough free pages due
to scanners meeting results in -ENOMEM being returned by
migrate_pages(), which makes compact_zone() bail out immediately without
calling compact_finished() that would detect scanners meeting.

This failure to detect scanners meeting might result in repeated
attempts at compaction of a zone that keep starting from the cached
pfn's close to the meeting point, and quickly failing through the
-ENOMEM path, without the cached pfns being reset, over and over.  This
has been observed (through additional tracepoints) in the third phase of
the mmtests stress-highalloc benchmark, where the allocator runs on an
otherwise idle system.  The problem was observed in the DMA32 zone,
which was used as a fallback to the preferred Normal zone, but on the
4GB system it was actually the largest zone.  The problem is even
amplified for such fallback zone - the deferred compaction logic, which
could (after being fixed by a previous patch) reset the cached scanner
pfn's, is only applied to the preferred zone and not for the fallbacks.

The problem in the third phase of the benchmark was further amplified by
commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") which
resulted in a non-deterministic regression of the allocation success
rate from ~85% to ~65%.  This occurs in about half of benchmark runs,
making bisection problematic.  It is unlikely that the commit itself is
buggy, but it should put more pressure on the DMA32 zone during phases 1
and 2, which may leave it more fragmented in phase 3 and expose the bugs
that this patch fixes.

The fix is to make scanners meeting in isolate_freepage() stay that way,
and to check in compact_zone() for scanners meeting when migrate_pages()
returns -ENOMEM.  The result is that compact_finished() also detects
scanners meeting and sets the compact_blockskip_flush flag to make
kswapd reset the scanner pfn's.

The results in stress-highalloc benchmark show that the "regression" by
commit 81c0a2bb515f in phase 3 no longer occurs, and phase 1 and 2
allocation success rates are also significantly improved.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mm: compaction: reset cached scanner pfn's before reading them

commit d3132e4b83e6bd383c74d716f7281d7c3136089c upstream.

Compaction caches pfn's for its migrate and free scanners to avoid
scanning the whole zone each time.  In compact_zone(), the cached values
are read to set up initial values for the scanners.  There are several
situations when these cached pfn's are reset to the first and last pfn
of the zone, respectively.  One of these situations is when a compaction
has been deferred for a zone and is now being restarted during a direct
compaction, which is also done in compact_zone().

However, compact_zone() currently reads the cached pfn's *before*
resetting them.  This means the reset doesn't affect the compaction that
performs it, and with good chance also subsequent compactions, as
update_pageblock_skip() is likely to be called and update the cached
pfn's to those being processed.  Another chance for a successful reset
is when a direct compaction detects that migration and free scanners
meet (which has its own problems addressed by another patch) and sets
update_pageblock_skip flag which kswapd uses to do the reset because it
goes to sleep.

This is clearly a bug that results in non-deterministic behavior, so
this patch moves the cached pfn reset to be performed *before* the
values are read.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

target: Fix alua_access_state attribute OOPs for un-configured devices

commit f1453773514bb8b0bba0716301e8c8f17f8d39c7 upstream.

This patch fixes a OOPs where an attempt to write to the per-device
alua_access_state configfs attribute at:

/sys/kernel/config/target/core/$HBA/$DEV/alua/$TG_PT_GP/alua_access_state

results in an NULL pointer dereference when the backend device has not
yet been configured.

This patch adds an explicit check for DF_CONFIGURED, and fails with
-ENODEV to avoid this case.

Reported-by: Chris Boot <crb@tiger-computing.co.uk>
Reported-by: Philip Gaw <pgaw@darktech.org.uk>
Cc: Chris Boot <crb@tiger-computing.co.uk>
Cc: Philip Gaw <pgaw@darktech.org.uk>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

target: Allow READ_CAPACITY opcode in ALUA Standby access state

commit e7810c2d2c37fa8e58dda74b00790dab60fe6fba upstream.

This patch allows READ_CAPACITY + SAI_READ_CAPACITY_16 opcode
processing to occur while the associated ALUA group is in Standby
access state.

This is required to avoid host side LUN probe failures during the
initial scan if an ALUA group has already implicitly changed into
Standby access state.

This addresses a bug reported by Chris + Philip using dm-multipath
+ ESX hosts configured with ALUA multipath.

(Drop v3.15 specific set_ascq usage - nab)

Reported-by: Chris Boot <crb@tiger-computing.co.uk>
Reported-by: Philip Gaw <pgaw@darktech.org.uk>
Cc: Chris Boot <crb@tiger-computing.co.uk>
Cc: Philip Gaw <pgaw@darktech.org.uk>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

iscsi-target: Fix wrong buffer / buffer overrun in iscsi_change_param_value()

commit 79d59d08082dd0a0a18f8ceb78c99f9f321d72aa upstream.

In non-leading connection login, iscsi_login_non_zero_tsih_s1() calls
iscsi_change_param_value() with the buffer it uses to hold the login
PDU, not a temporary buffer.  This leads to the login header getting
corrupted and login failing for non-leading connections in MC/S.

Fix this by adding a wrapper iscsi_change_param_sprintf() that handles
the temporary buffer itself to avoid confusion.  Also handle sending a
reject in case of failure in the wrapper, which lets the calling code
get quite a bit smaller and easier to read.

Finally, bump the size of the temporary buffer from 32 to 64 bytes to be
safe, since "MaxRecvDataSegmentLength=" by itself is 25 bytes; with a
trailing NUL, a value >= 1M will lead to a buffer overrun.  (This isn't
the default but we don't need to run right at the ragged edge here)

(Fix up context changes for v3.10.y - nab)

Reported-by: Santosh Kulkarni <santosh.kulkarni@calsoftinc.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

iser-target: Fix multi network portal shutdown regression

commit 2363d196686e44c0158929e7cf96c8589a24a81b upstream.

This patch fixes a iser-target specific regression introduced in
v3.15-rc6 with:

commit 14f4b54fe38f3a8f8392a50b951c8aa43b63687a
Author: Sagi Grimberg <sagig@mellanox.com>
Date: Tue Apr 29 13:13:47 2014 +0300

Target/iscsi,iser: Avoid accepting transport connections during stop stage

where the change to set iscsi_np->enabled = false within
iscsit_clear_tpg_np_login_thread() meant that a iscsi_np with
two iscsi_tpg_np exports would have it's parent iscsi_np set
to a disabled state, even if other iscsi_tpg_np exports still
existed.

This patch changes iscsit_clear_tpg_np_login_thread() to only
set iscsi_np->enabled = false when shutdown = true, and also
changes iscsit_del_np() to set iscsi_np->enabled = true when
iscsi_np->np_exports is non zero.

(Fix up context changes for v3.10.y - nab)

Cc: Sagi Grimberg <sagig@dev.mellanox.co.il>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Target/iscsi,iser: Avoid accepting transport connections during stop stage

commit 14f4b54fe38f3a8f8392a50b951c8aa43b63687a upstream.

When the target is in stop stage, iSER transport initiates RDMA disconnects.
The iSER initiator may wish to establish a new connection over the
still existing network portal. In this case iSER transport should not
accept and resume new RDMA connections. In order to learn that, iscsi_np
is added with enabled flag so the iSER transport can check when deciding
weather to accept and resume a new connection request.

The iscsi_np is enabled after successful transport setup, and disabled
before iscsi_np login threads are cleaned up.

(Fix up context changes for v3.10.y - nab)

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

netfilter: ipv4: defrag: set local_df flag on defragmented skb

commit 895162b1101b3ea5db08ca6822ae9672717efec0 upstream.

else we may fail to forward skb even if original fragments do fit
outgoing link mtu:

1. remote sends 2k packets in two 1000 byte frags, DF set
2. we want to forward but only see '2k > mtu and DF set'
3. we then send icmp error saying that outgoing link is 1500

But original sender never sent a packet that would not fit
the outgoing link.

Setting local_df makes outgoing path test size vs.
IPCB(skb)->frag_max_size, so we will still send the correct
error in case the largest original size did not fit
outgoing link mtu.

Reported-by: Maxime Bizon <mbizon@freebox.fr>
Suggested-by: Maxime Bizon <mbizon@freebox.fr>
Fixes: 5f2d04f1f9 (ipv4: fix path MTU discovery with connection tracking)
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ARM: mvebu: fix NOR bus-width in Armada XP OpenBlocks AX3 Device Tree

commit 6e20bae8a39c40d4e03698e4160bad2d2629062b upstream.

The mvebu-devbus driver had a serious bug, which lead to a 8 bits bus
width declared in the Device Tree being considered as a 16 bits bus
width when configuring the hardware.

This bug in mvebu-devbus driver was compensated by a symetric mistake
in the Armada XP OpenBlocks AX3 Device Tree: a 8 bits bus width was
declared, even though the hardware actually has a 16 bits bus width
connection with the NOR flash.

Now that we have fixed the mvebu-devbus driver to behave according to
its Device Tree binding, this commit fixes the problematic Device Tree
files as well.

This bug was introduced in commit
a7d4f81821f7eec3175f8e23dd6949c71ab2da43 ('ARM: mvebu: Add support for
NOR flash device on Openblocks AX3 board') which was merged in v3.10.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Link: https://lkml.kernel.org/r/1397489361-5833-5-git-send-email-thomas.petazzoni@free-electrons.com
Fixes: a7d4f81821f7 ('ARM: mvebu: Add support for NOR flash device on Openblocks AX3 board')
Cc: stable@vger.kernel.org # v3.10+
Acked-by: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
Acked-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: Jason Cooper <jason@lakedaemon.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ARM: mvebu: fix NOR bus-width in Armada XP GP Device Tree

commit 1a88f809ccb5db1509a7514b187c00b3a995fc82 upstream.

The mvebu-devbus driver had a serious bug, which lead to a 8 bits bus
width declared in the Device Tree being considered as a 16 bits bus
width when configuring the hardware.

This bug in mvebu-devbus driver was compensated by a symetric mistake
in the Armada XP GP Device Tree: a 8 bits bus width was declared, even
though the hardware actually has a 16 bits bus width connection with
the NOR flash.

Now that we have fixed the mvebu-devbus driver to behave according to
its Device Tree binding, this commit fixes the problematic Device Tree
files as well.

This bug was introduced in commit
da8d1b38356853c37116f9afa29f15648d7fb159 ('ARM: mvebu: Add support for
NOR flash device on Armada XP-GP board') which was merged in v3.10.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Link: https://lkml.kernel.org/r/1397489361-5833-3-git-send-email-thomas.petazzoni@free-electrons.com
Fixes: da8d1b383568 ('ARM: mvebu: Add support for NOR flash device on Armada XP-GP board')
Cc: stable@vger.kernel.org # v3.10+
Acked-by: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
Acked-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: Jason Cooper <jason@lakedaemon.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mlx4_en: don't use napi_synchronize inside mlx4_en_netpoll

commit c98235cb8584a72e95786e17d695a8e5fafcd766 upstream.

The mlx4 driver is triggering schedules while atomic inside
mlx4_en_netpoll:

spin_lock_irqsave(&cq->lock, flags);
napi_synchronize(&cq->napi);
^^^^^ msleep here
mlx4_en_process_rx_cq(dev, cq, 0);
spin_unlock_irqrestore(&cq->lock, flags);

This was part of a patch by Alexander Guller from Mellanox in 2011,
but it still isn't upstream.

Signed-off-by: Chris Mason <clm@fb.com>
Acked-By: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Masoud Sharbiani <msharbiani@twitter.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

fs,userns: Change inode_capable to capable_wrt_inode_uidgid

commit 23adbe12ef7d3d4195e80800ab36b37bee28cd03 upstream.

The kernel has no concept of capabilities with respect to inodes; inodes
exist independently of namespaces. For example, inode_capable(inode,
CAP_LINUX_IMMUTABLE) would be nonsense.

This patch changes inode_capable to check for uid and gid mappings and
renames it to capable_wrt_inode_uidgid, which should make it more
obvious what it does.

Fixes CVE-2014-4014.

Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Merge remote-tracking branch 'lsk/v3.10/topic/asoc-compress' into linux-linaro-lsk

Merge remote-tracking branch 'lsk/v3.10/topic/arm64-crypto' into linux-linaro-lsk

Conflicts:
arch/arm64/Kconfig

ASoC: compress: Add suport for DPCM into compressed audio

Currently compressed audio streams are statically routed from the /dev
to the DAI link. Some DSPs can route compressed data to multiple BE DAIs
like they do for PCM data.

Add support to allow dynamically routed compressed streams using the existing
DPCM infrastructure. This patch adds special FE versions of the compressed ops
that work out the runtime routing.

Signed-off-by: Liam Girdwood <liam.r.girdwood@linux.intel.com>
Acked-by: Vinod Koul <vinod.koul@intel.com>
Signed-off-by: Mark Brown <broonie@linaro.org>
(cherry picked from commit 2a99ef0fdb35a0f8d6b56ccc5d9d821e9ff100c1)
Signed-off-by: Mark Brown <broonie@linaro.org>

ASoC: DPCM: make some DPCM API calls non static for compressed usage

The ASoC compressed code needs to call the internal DPCM APIs in order to
dynamically route compressed data to different DAIs.

Signed-off-by: Liam Girdwood <liam.r.girdwood@linux.intel.com>
Signed-off-by: Mark Brown <broonie@linaro.org>
(cherry picked from commit 23607025303af6e84bc2cd4cabe89c21f6a22a3f)
Signed-off-by: Mark Brown <broonie@linaro.org>

ASoC: compress: use soc_xxx handlers for metadata

the compress metadata handlers were wrongly named sst_xxx

Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Signed-off-by: Mark Brown <broonie@linaro.org>
(cherry picked from commit 02bd90e86dc63728feebaf2b238684208ccb19eb)
Signed-off-by: Mark Brown <broonie@linaro.org>

prctl: adds the capable(CAP_SYS_NICE) check to PR_SET_TIMERSLACK_PID.

Adds a capable() check to make sure that arbitary apps do not change the
timer slack for other apps.

Bug: 15000427
Change-Id: I558a2551a0e3579c7f7e7aae54b28aa9d982b209
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>

arm64: enable generic CPU feature modalias matching for this architecture

This enables support for the generic CPU feature modalias implementation that
wires up optional CPU features to udev based module autoprobing.

A file <asm/cpufeature.h> is provided that maps CPU feature numbers to
elf_hwcap bits, which is the standard way on arm64 to advertise optional CPU
features both internally and to user space.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
[catalin.marinas@arm.com: removed unnecessary "!!"]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit 3be1a5c4f75989cf457f13f38ff0913dff6d4996)
Signed-off-by: Mark Brown <broonie@linaro.org>
Conflicts:
arch/arm64/Kconfig

cpu: add generic support for CPU feature based module autoloading

This patch adds support for advertising optional CPU features over udev
using the modalias, and for declaring compatibility with/dependency upon
such a feature in a module.

The mapping between feature numbers and actual features should be provided
by the architecture in a file called <asm/cpufeature.h> which exports the
following functions/macros:
- cpu_feature(FEAT), a preprocessor macro that maps token FEAT to a
  numeric index;
- bool cpu_have_feature(n), returning whether this CPU has support for
  feature #n;
- MAX_CPU_FEATURES, an upper bound for 'n' in the previous function.

The feature can then be enabled by setting CONFIG_GENERIC_CPU_AUTOPROBE
for the architecture.

For instance, a module that registers its module init function using

  module_cpu_feature_match(FEAT_X, module_init_function)

will be probed automatically when the CPU's support for the 'FEAT_X'
feature is advertised over udev, and will only allow the module to be
loaded by hand if the 'FEAT_X' feature is supported.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 67bad2fdb754dbef14596c0b5d28b3a12c8dfe84)
Signed-off-by: Mark Brown <broonie@linaro.org>
Conflicts:
drivers/base/cpu.c

binfmt_elf: add ELF_HWCAP2 to compat auxv entries

Add ELF_HWCAP2 to the set of auxv entries that is passed to
a 32-bit ELF program running in 32-bit compat mode under a
64-bit kernel.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit 66d6e3b385778c2eef3b172a037235516207393e)
Signed-off-by: Mark Brown <broonie@linaro.org>

Merge branch 'v3.10/topic/pinctrl' into linux-linaro-lsk

pinctrl: Fix pin_config_*_set_bulk APIs

In commit baeae2041e14d5b7a272b9941b88fb56d716f643, Linaro-specific
APIs pin_config_set_bulk and pin_config_group_set_bulk were introduced.
However, pinconf_check_ops was not updated, and the wrong function was
used in the PIN_MAP_TYPE_CONFIGS_PIN case in pinconf_apply_setting.

Change-Id: Idc4a802f3d0086b927b808f65270e548b5b927be
Signed-off-by: Sherman Yin <syin@broadcom.com>

cpufreq: Persist cpufreq time in state data across hotplug

Cpufreq time_in_state data for all CPUs is made persistent across
hotplug and exposed to userspace via sysfs file
/sys/devices/system/cpu/cpufreq/all_time_in_state

Change-Id: I97cb5de24b6de16189bf8b5df9592d0a6e6ddf32
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>

cpufreq: interactive: prevents the frequency to directly raise above the
hispeed_freq from a lower frequency.

When the load was below go_hispeed_load, there is a possibility that
choose_freq() would return a frequency which would be higher than the
hispeed_freq. According to the policy we should first jump to the
hispeed_freq, stay there for above_hispeed_delay and then be allowed to
raise higher than that.

Added a check to prevent the frequency to be directly raised to
something higher than the hispeed_freq.

Change-Id: Icda5d848dd9beadcc18835082ddf269732c75bd0
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>

Merge remote-tracking branch 'lsk/v3.10/topic/mailbox' into linux-linaro-lsk

Merge remote-tracking branch 'lsk/v3.10/topic/configs' into linux-linaro-lsk

mailbox: Remove const from client argument of mbox_request_channel()

The struct mbox_client supplied to mbox_request_channel() is const but
it is stored in the channel in a non-constant member causing compiler
warnings.

While the mailbox API should treat the struct mailbox_client as const
itself the struct is passed back to the channel in callbacks without
a const so we need to either remove the const, change the callbacks to
take const or cast the const away when doing callbacks. Take the simplest
option and just remove the const.

Signed-off-by: Mark Brown <broonie@linaro.org>

mailbox: Prototype mbox_client_peek_data()

This is an interface intended for client drivers and exported but it is
not prototyped in the headers so can't be used. Add a prototype.

Signed-off-by: Mark Brown <broonie@linaro.org>

MAINTAINERS: Add maintainer entry for Mailbox API

Add myself as the maintainer of mailbox api

Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>
Signed-off-by: Mark Brown <broonie@linaro.org>

mailbox: Fix deleteing poll timer

Try to delete the timer only if it was init/used.

Signed-off-by: LeyFoon Tan <lftan.linux@gmail.com>
Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>
Signed-off-by: Mark Brown <broonie@linaro.org>

Mailbox: Generic: Specify mailbox api bindings

Due to the platform specific nature of remote's firmware, we can't do much
to facilitate shareable client drivers. So the Mailbox api is more like a
bunch of useful functions put together. The only feature that can be
abstracted out is mailbox channel assignment to client drivers, which this
binding specifies.

Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>
Signed-off-by: Mark Brown <broonie@linaro.org>

mailbox: Introduce framework for mailbox

Introduce common framework for client/protocol drivers and
controller drivers of Inter-Processor-Communication (IPC).

Client driver developers should have a look at
include/linux/mailbox_client.h to understand the part of
the API exposed to client drivers.
Similarly controller driver developers should have a look
at include/linux/mailbox_controller.h

Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>
Signed-off-by: Mark Brown <broonie@linaro.org>

configs: Enable mailbox API

Signed-off-by: Mark Brown <broonie@linaro.org>

Merge remote-tracking branch 'lsk/v3.10/topic/arm64-fvp' into linux-linaro-lsk

Merge remote-tracking branch 'lsk/v3.10/topic/mm-timer' into linux-linaro-lsk

Conflicts:
drivers/clocksource/arm_arch_timer.c

clocksource: arch_timer: Do not register arch_sys_counter twice

Commit:

65cd4f6 ("arch_timer: Move to generic sched_clock framework")

added code to register the arch_sys_counter in arch_timer_register(),
but it is already registered in arch_counter_register().

This results in the timer being added to the clocksource list twice,
therefore causing an infinite loop in the list.

Remove the duplicate registration and register the scheduler
clock after the original registration instead.

This fixes a hang during boot on Tegra114 (Cortex-A15).

[ While I've only tested this on Tegra114, I suspect the same hang
during boot happens for all processors that use this clock source. ]

Signed-off-by: Thierry Reding <treding@nvidia.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Stephen Warren <swarren@wwwdotorg.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: http://lkml.kernel.org/r/1381843911-31962-1-git-send-email-treding@nvidia.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 4a7d3e8a9939cf8073c247286d81cbe0ae48eba2)
Signed-off-by: Mark Brown <broonie@linaro.org>
Conflicts:
drivers/clocksource/arm_arch_timer.c

Merge remote-tracking branch 'lsk/v3.10/topic/juno' into linux-linaro-lsk

arm64: Juno: Carve out memory reserved for secure access.

Trusted Firmware 0.4 reserves the top 16MB of RAM available
in the first 4GB address space for the secure world use.
Carve out that memory from the device tree to avoid triggering
faults when accessing protected memory.

Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com>
Signed-off-by: Jon Medhurst <tixy@linaro.org>
Signed-off-by: Mark Brown <broonie@linaro.org>

arm64: Add cpu idle information to the fast model DT

Signed-off-by: Mark Brown <broonie@linaro.org>

clocksource: arch_timer: Add support for memory mapped timers

Add support for the memory mapped timers by filling in the
read/write functions and adding some parsing code. Note that we
only register one clocksource, preferring the cp15 based
clocksource over the mmio one.

To keep things simple we register one global clockevent. This
covers the case of UP and SMP systems with only mmio hardware and
systems where the memory mapped timers are used as the broadcast
timer in low power modes.

The DT binding allows for per-CPU memory mapped timers in case we
want to support that in the future, but the code isn't added
here. We also don't do much for hypervisor support, although it
should be possible to support it by searching for at least two
frames where one frame has the virtual capability and then
updating KVM timers to support it.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <Marc.Zyngier@arm.com>
Cc: Rob Herring <robherring2@gmail.com>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
(cherry picked from commit 220069945b298d3998c6598b081c466dca259929)
Signed-off-by: Mark Brown <broonie@linaro.org>
Conflicts:
drivers/clocksource/arm_arch_timer.c

clocksource: arch_timer: Push the read/write wrappers deeper

We're going to introduce support to read and write the memory
mapped timer registers in the next patch, so push the cp15
read/write functions one level deeper. This simplifies the next
patch and makes it clearer what's going on.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <Marc.Zyngier@arm.com>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
(cherry picked from commit 60faddf6eb3aba16068032bdcf35e18ace4bfb21)
Signed-off-by: Mark Brown <broonie@linaro.org>

Documentation: Add memory mapped ARM architected timer binding

Add a binding for the arm architected timer hardware's memory
mapped interface. The mmio timer hardware is made up of one base
frame and a collection of up to 8 timer frames, where each of the
8 timer frames can have either one or two views. A frame
typically maps to a privilege level (user/kernel, hypervisor,
secure). The first view has full access to the registers within a
frame, while the second view can be restricted to particular
registers within a frame. Each frame must support a physical
timer. It's optional for a frame to support a virtual timer.

Cc: devicetree-discuss@lists.ozlabs.org
Cc: Marc Zyngier <Marc.Zyngier@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Rob Herring <robherring2@gmail.com>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
(cherry picked from commit d53ef114cf40a043e3cc3fa70dbcdfb268a7e4dc)
Signed-off-by: Mark Brown <broonie@linaro.org>

clocksource: arch_timer: Pass clock event to set_mode callback

There isn't any reason why we don't pass the event here and we'll
need it in the near future for memory mapped arch timers anyway.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <Marc.Zyngier@arm.com>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
(cherry picked from commit 1ff99ea65687d921cb71f330491ec4205c00eb9f)
Signed-off-by: Mark Brown <broonie@linaro.org>

clocksource: arch_timer: Make register accessors less error-prone

Using an enum for the register we wish to access allows newer
compilers to determine if we've forgotten a case in our switch
statement. This allows us to remove the BUILD_BUG() instances in
the arm64 port, avoiding problems where optimizations may not
happen.

To try and force better code generation we're currently marking
the accessor functions as inline, but newer compilers can ignore
the inline keyword unless it's marked __always_inline. Luckily on
arm and arm64 inline is __always_inline, but let's make
everything __always_inline to be explicit.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <Marc.Zyngier@arm.com>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
(cherry picked from commit e09f3cc0184d6b5c3816f921b7ffb67623e5e834)
Signed-off-by: Mark Brown <broonie@linaro.org>

mailbox: rename pl320-ipc specific mailbox.h

The patch 30058677 "ARM / highbank: add support for pl320 IPC"
added a pl320 IPC specific header file as a generic mailbox.h.
This file has been renamed appropriately to allow the
introduction of the generic mailbox API framework.

Acked-by: Mark Langsdorf <mark.langsdorf@calxeda.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Suman Anna <s-anna@ti.com>
Signed-off-by: Mark Brown <broonie@linaro.org>

netfilter: fix seq_printf type mismatch warning

The return type of atomic64_read() varies depending on arch. The
arm64 version is being changed from long long to long in the mainline
for v3.16, causing a seq_printf type mismatch (%llu) in
guid_ctrl_proc_show().

This commit fixes the type mismatch by casting atomic64_read() to u64.

Change-Id: Iae0a6bd4314f5686a9f4fecbe6203e94ec0870de
Signed-off-by: Sherman Yin <shermanyin@gmail.com>

Merge branch 'linux-linaro-lsk' into linux-linaro-lsk-android

Merge remote-tracking branch 'lsk/v3.10/topic/mm' into linux-linaro-lsk

mm, hugetlb: return a reserved page to a reserved pool if failed

If we fail with a reserved page, just calling put_page() is not
sufficient, because put_page() invoke free_huge_page() at last step and it
doesn't know whether a page comes from a reserved pool or not.  So it
doesn't do anything related to reserved count.  This makes reserve count
lower than how we need, because reserve count already decrease in
dequeue_huge_page_vma().  This patch fix this situation.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 07443a85ad90c7b62fbe11dcd3d6a1de1e10516f)
Signed-off-by: Mark Brown <broonie@linaro.org>

mm, hugetlb: grab a page_table_lock after page_cache_release

We don't need to grab a page_table_lock when we try to release a page.
So, defer to grab a page_table_lock.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8312034f3604bc0339c40545c538116f4ddad152)
Signed-off-by: Mark Brown <broonie@linaro.org>

mm, hugetlb: remove useless check about mapping type

is_vma_resv_set(vma, HPAGE_RESV_OWNER) implys that this mapping is for
private. So we don't need to check whether this mapping is for shared or
not.

This patch is just for clean-up.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 5944d0116c773319a48ea6812d1891aa6d0bbbbf)
Signed-off-by: Mark Brown <broonie@linaro.org>

mm, hugetlb: fix subpool accounting handling

If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
So, we should check subpool counter when avoid_reserve. This patch
implement it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8bb3f12e7d4f7b043a7c5aa3831e72041e80dc4a)
Signed-off-by: Mark Brown <broonie@linaro.org>

mm, hugetlb: change variable name reservations to resv

'reservations' is so long name as a variable and we use 'resv_map' to
represent 'struct resv_map' in other place. To reduce confusion and
unreadability, change it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f522c3ac00a49128115f99a5fcb95a447601c1c3)
Signed-off-by: Mark Brown <broonie@linaro.org>

mm, hugetlb: protect reserved pages when soft offlining a hugepage

Don't use the reserve pool when soft offlining a hugepage. Check we have
free pages outside the reserve pool before we dequeue the huge page.
Otherwise, we can steal other's reserve page.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 4ef91848043679b272a1a5b8e2879acf696ba9e2)
Signed-off-by: Mark Brown <broonie@linaro.org>

mm, hugetlb: decrement reserve count if VM_NORESERVE alloc page cache

If a vma with VM_NORESERVE allocate a new page for page cache, we should
check whether this area is reserved or not.  If this address is already
reserved by other process(in case of chg == 0), we should decrement
reserve count, because this allocated page will go into page cache and
currently, there is no way to know that this page comes from reserved pool
or not when releasing inode.  This may introduce over-counting problem to
reserved count.  With following example code, you can easily reproduce
this situation.

Assume 2MB, nr_hugepages = 100

        size = 20 * MB;
        flag = MAP_SHARED;
        p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
        if (p == MAP_FAILED) {
                fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
                return -1;
        }

        flag = MAP_SHARED | MAP_NORESERVE;
        q = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
        if (q == MAP_FAILED) {
                fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
        }
        q[0] = 'c';

After finish the program, run 'cat /proc/meminfo'.  You can see below
result.

HugePages_Free:      100
HugePages_Rsvd:        1

To fix this, we should check our mapping type and tracked region.  If our
mapping is VM_NORESERVE, VM_MAYSHARE and chg is 0, this imply that current
allocated page will go into page cache which is already reserved region
when mapping is created.  In this case, we should decrease reserve count.
As implementing above, this patch solve the problem.

[akpm@linux-foundation.org: fix spelling in comment]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit af0ed73e699bb0453603b1d1a4727377641b2096)
Signed-off-by: Mark Brown <broonie@linaro.org>

mm, hugetlb: remove decrement_hugepage_resv_vma()

Now, Checking condition of decrement_hugepage_resv_vma() and
vma_has_reserves() is same, so we can clean-up this function with
vma_has_reserves(). Additionally, decrement_hugepage_resv_vma() has only
one call site, so we can remove function and embed it into
dequeue_huge_page_vma() directly. This patch implement it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit a63884e921cb33a6beb260fa88bcbf1712d98a9a)
Signed-off-by: Mark Brown <broonie@linaro.org>

mm, hugetlb: add VM_NORESERVE check in vma_has_reserves()

If we map the region with MAP_NORESERVE and MAP_SHARED, we can skip to
check reserve counting and eventually we cannot be ensured to allocate a
huge page in fault time.  With following example code, you can easily find
this situation.

Assume 2MB, nr_hugepages = 100

        fd = hugetlbfs_unlinked_fd();
        if (fd < 0)
                return 1;

        size = 200 * MB;
        flag = MAP_SHARED;
        p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
        if (p == MAP_FAILED) {
                fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
                return -1;
        }

        size = 2 * MB;
        flag = MAP_ANONYMOUS | MAP_SHARED | MAP_HUGETLB | MAP_NORESERVE;
        p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, -1, 0);
        if (p == MAP_FAILED) {
                fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
        }
        p[0] = '0';
        sleep(10);

During executing sleep(10), run 'cat /proc/meminfo' on another process.

HugePages_Free:       99
HugePages_Rsvd:      100

Number of free should be higher or equal than number of reserve, but this
aren't.  This represent that non reserved shared mapping steal a reserved
page.  Non reserved shared mapping should not eat into reserve space.

If we consider VM_NORESERVE in vma_has_reserve() and return 0 which mean
that we don't have reserved pages, then we check that we have enough free
pages in dequeue_huge_page_vma().  This prevent to steal a reserved page.

With this change, above test generate a SIGBUG which is correct, because
all free pages are reserved and non reserved shared mapping can't get a
free page.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 72231b03ccf126ca04fba8359998ef7dfd195577)
Signed-off-by: Mark Brown <broonie@linaro.org>

mm, hugetlb: do not use a page in page cache for cow optimization

Currently, we use a page with mapped count 1 in page cache for cow
optimization.  If we find this condition, we don't allocate a new page and
copy contents.  Instead, we map this page directly.  This may introduce a
problem that writting to private mapping overwrite hugetlb file directly.
You can find this situation with following code.

        size = 20 * MB;
        flag = MAP_SHARED;
        p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
        if (p == MAP_FAILED) {
                fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
                return -1;
        }
        p[0] = 's';
        fprintf(stdout, "BEFORE STEAL PRIVATE WRITE: %c\n", p[0]);
        munmap(p, size);

        flag = MAP_PRIVATE;
        p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
        if (p == MAP_FAILED) {
                fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
        }
        p[0] = 'c';
        munmap(p, size);

        flag = MAP_SHARED;
        p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
        if (p == MAP_FAILED) {
                fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
                return -1;
        }
        fprintf(stdout, "AFTER STEAL PRIVATE WRITE: %c\n", p[0]);
        munmap(p, size);

We can see that "AFTER STEAL PRIVATE WRITE: c", not "AFTER STEAL PRIVATE
WRITE: s".  If we turn off this optimization to a page in page cache, the
problem is disappeared.

So, I change the trigger condition of optimization.  If this page is not
AnonPage, we don't do optimization.  This makes this optimization turning
off for a page cache.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 37a2140dc2145a6f154172286944a1861e978dfd)
Signed-off-by: Mark Brown <broonie@linaro.org>

mm, hugetlb: remove redundant list_empty check in gather_surplus_pages()

If list is empty, list_for_each_entry_safe() doesn't do anything. So,
this check is redundant. Remove it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c0d934ba278935fa751057091fe4a7c02d814f68)
Signed-off-by: Mark Brown <broonie@linaro.org>

mm, hugetlb: fix and clean-up node iteration code to alloc or free

Current node iteration code have a minor problem which do one more node
rotation if we can't succeed to allocate.  For example, if we start to
allocate at node 0, we stop to iterate at node 0.  Then we start to
allocate at node 1 for next allocation.

I introduce new macros "for_each_node_mask_to_[alloc|free]" and fix and
clean-up node iteration code to alloc or free.  This makes code more
understandable.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b2261026825ed34066b24069359d118098bb1876)
Signed-off-by: Mark Brown <broonie@linaro.org>

mm, hugetlb: clean-up alloc_huge_page()

Unify successful allocation paths to make the code more readable. There
are no functional changes.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 81a6fcae3ff3f6af1c9d7e31499e68fda2b3f58d)
Signed-off-by: Mark Brown <broonie@linaro.org>