Insu Yun [Wed, 17 Feb 2016 16:47:35 +0000 (11:47 -0500)]
tipc: unlock in error path
[ Upstream commit
b53ce3e7d407aa4196877a48b8601181162ab158 ]
tipc_bcast_unlock need to be unlocked in error path.
Signed-off-by: Insu Yun <wuninsu@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Anton Protopopov [Wed, 17 Feb 2016 02:43:16 +0000 (21:43 -0500)]
rtnl: RTM_GETNETCONF: fix wrong return value
[ Upstream commit
a97eb33ff225f34a8124774b3373fd244f0e83ce ]
An error response from a RTM_GETNETCONF request can return the positive
error value EINVAL in the struct nlmsgerr that can mislead userspace.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Phil Sutter [Wed, 17 Feb 2016 14:37:43 +0000 (15:37 +0100)]
IFF_NO_QUEUE: Fix for drivers not calling ether_setup()
[ Upstream commit
a813104d923339144078939175faf4e66aca19b4 ]
My implementation around IFF_NO_QUEUE driver flag assumed that leaving
tx_queue_len untouched (specifically: not setting it to zero) by drivers
would make it possible to assign a regular qdisc to them without having
to worry about setting tx_queue_len to a useful value. This was only
partially true: I overlooked that some drivers don't call ether_setup()
and therefore not initialize tx_queue_len to the default value of 1000.
Consequently, removing the workarounds in place for that case in qdisc
implementations which cared about it (namely, pfifo, bfifo, gred, htb,
plug and sfb) leads to problems with these specific interface types and
qdiscs.
Luckily, there's already a sanitization point for drivers setting
tx_queue_len to zero, which can be reused to assign the fallback value
most qdisc implementations used, which is 1.
Fixes: 348e3435cbefa ("net: sched: drop all special handling of tx_queue_len == 0")
Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Dumazet [Thu, 18 Feb 2016 13:39:18 +0000 (05:39 -0800)]
tcp/dccp: fix another race at listener dismantle
[ Upstream commit
7716682cc58e305e22207d5bb315f26af6b1e243 ]
Ilya reported following lockdep splat:
kernel: =========================
kernel: [ BUG: held lock freed! ]
kernel:
4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
kernel: -------------------------
kernel: swapper/5/0 is freeing memory
ffff880035c9d200-
ffff880035c9dbff, with a lock still held there!
kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<
ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
kernel: 4 locks held by swapper/5/0:
kernel: #0: (rcu_read_lock){......}, at: [<
ffffffff8169ef6b>]
netif_receive_skb_internal+0x4b/0x1f0
kernel: #1: (rcu_read_lock){......}, at: [<
ffffffff816e977f>]
ip_local_deliver_finish+0x3f/0x380
kernel: #2: (slock-AF_INET){+.-...}, at: [<
ffffffff81685ffb>]
sk_clone_lock+0x19b/0x440
kernel: #3: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<
ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
To properly fix this issue, inet_csk_reqsk_queue_add() needs
to return to its callers if the child as been queued
into accept queue.
We also need to make sure listener is still there before
calling sk->sk_data_ready(), by holding a reference on it,
since the reference carried by the child can disappear as
soon as the child is put on accept queue.
Reported-by: Ilya Dryomov <idryomov@gmail.com>
Fixes: ebb516af60e1 ("tcp/dccp: fix race at listener dismantle phase")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Xin Long [Thu, 18 Feb 2016 13:21:19 +0000 (21:21 +0800)]
route: check and remove route cache when we get route
[ Upstream commit
deed49df7390d5239024199e249190328f1651e7 ]
Since the gc of ipv4 route was removed, the route cached would has
no chance to be removed, and even it has been timeout, it still could
be used, cause no code to check it's expires.
Fix this issue by checking and removing route cache when we get route.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jamal Hadi Salim [Thu, 18 Feb 2016 12:38:04 +0000 (07:38 -0500)]
net_sched fix: reclassification needs to consider ether protocol changes
[ Upstream commit
619fe32640b4b01f370574d50344ae0f62689816 ]
actions could change the etherproto in particular with ethernet
tunnelled data. Typically such actions, after peeling the outer header,
will ask for the packet to be reclassified. We then need to restart
the classification with the new proto header.
Example setup used to catch this:
sudo tc qdisc add dev $ETH ingress
sudo $TC filter add dev $ETH parent ffff: pref 1 protocol 802.1Q \
u32 match u32 0 0 flowid 1:1 \
action vlan pop reclassify
Fixes: 3b3ae880266d ("net: sched: consolidate tc_classify{,_compat}")
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Guillaume Nault [Mon, 15 Feb 2016 16:01:10 +0000 (17:01 +0100)]
pppoe: fix reference counting in PPPoE proxy
[ Upstream commit
29e73269aa4d36f92b35610c25f8b01c789b0dc8 ]
Drop reference on the relay_po socket when __pppoe_xmit() succeeds.
This is already handled correctly in the error path.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mark Tomlinson [Mon, 15 Feb 2016 03:24:44 +0000 (16:24 +1300)]
l2tp: Fix error creating L2TP tunnels
[ Upstream commit
853effc55b0f975abd6d318cca486a9c1b67e10f ]
A previous commit (
33f72e6) added notification via netlink for tunnels
when created/modified/deleted. If the notification returned an error,
this error was returned from the tunnel function. If there were no
listeners, the error code ESRCH was returned, even though having no
listeners is not an error. Other calls to this and other similar
notification functions either ignore the error code, or filter ESRCH.
This patch checks for ESRCH and does not flag this as an error.
Reviewed-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz>
Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eugenia Emantayev [Wed, 17 Feb 2016 15:24:27 +0000 (17:24 +0200)]
net/mlx4_en: Avoid changing dev->features directly in run-time
[ Upstream commit
925ab1aa9394bbaeac47ee5b65d3fdf0fb8135cf ]
It's forbidden to manually change dev->features in run-time. Currently, this is
done in the driver to make sure that GSO_UDP_TUNNEL is advertized only when
VXLAN tunnel is set. However, since the stack actually does features intersection
with hw_enc_features, we can safely revert to advertizing features early when
registering the netdevice.
Fixes: f4a1edd56120 ('net/mlx4_en: Advertize encapsulation offloads [...]')
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eugenia Emantayev [Wed, 17 Feb 2016 15:24:23 +0000 (17:24 +0200)]
net/mlx4_en: Choose time-stamping shift value according to HW frequency
[ Upstream commit
31c128b66e5b28f468076e4f3ca3025c35342041 ]
Previously, the shift value used for time-stamping was constant and didn't
depend on the HW chip frequency. Change that to take the frequency into account
and calculate the maximal value in cycles per wraparound of ten seconds. This
time slot was chosen since it gives a good accuracy in time synchronization.
Algorithm for shift value calculation:
* Round up the maximal value in cycles to nearest power of two
* Calculate maximal multiplier by division of all 64 bits set
to above result
* Then, invert the function clocksource_khz2mult() to get the shift from
maximal mult value
Fixes: ec693d47010e ('net/mlx4_en: Add HW timestamping (TS) support')
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Amir Vadai [Wed, 17 Feb 2016 15:24:22 +0000 (17:24 +0200)]
net/mlx4_en: Count HW buffer overrun only once
[ Upstream commit
281e8b2fdf8e4ef366b899453cae50e09b577ada ]
RdropOvflw counts overrun of HW buffer, therefore should
be used for rx_fifo_errors only.
Currently RdropOvflw counter is mistakenly also set into
rx_missed_errors and rx_over_errors too, which makes the
device total dropped packets accounting to show wrong results.
Fix that. Use it for rx_fifo_errors only.
Fixes: c27a02cd94d6 ('mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC')
Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bjørn Mork [Fri, 12 Feb 2016 15:42:14 +0000 (16:42 +0100)]
qmi_wwan: add "4G LTE usb-modem U901"
[ Upstream commit
aac8d3c282e024c344c5b86dc1eab7af88bb9716 ]
Thomas reports:
T: Bus=01 Lev=01 Prnt=01 Port=03 Cnt=01 Dev#= 4 Spd=480 MxCh= 0
D: Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs= 1
P: Vendor=05c6 ProdID=6001 Rev=00.00
S: Manufacturer=USB Modem
S: Product=USB Modem
S: SerialNumber=
1234567890ABCDEF
C: #Ifs= 5 Cfg#= 1 Atr=e0 MxPwr=500mA
I: If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
I: If#= 1 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
I: If#= 2 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
I: If#= 3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=qmi_wwan
I: If#= 4 Alt= 0 #EPs= 2 Cls=08(stor.) Sub=06 Prot=50 Driver=usb-storage
Reported-by: Thomas Schäfer <tschaefer@t-online.de>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Dumazet [Fri, 12 Feb 2016 06:50:29 +0000 (22:50 -0800)]
tcp: md5: release request socket instead of listener
[ Upstream commit
729235554d805c63e5e274fcc6a98e71015dd847 ]
If tcp_v4_inbound_md5_hash() returns an error, we must release
the refcount on the request socket, not on the listener.
The bug was added for IPv4 only.
Fixes: 079096f103fac ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jon Paul Maloy [Wed, 10 Feb 2016 21:14:57 +0000 (16:14 -0500)]
tipc: fix premature addition of node to lookup table
[ Upstream commit
d5c91fb72f1652ea3026925240a0998a42ddb16b ]
In commit
5266698661401a ("tipc: let broadcast packet reception
use new link receive function") we introduced a new per-node
broadcast reception link instance. This link is created at the
moment the node itself is created. Unfortunately, the allocation
is done after the node instance has already been added to the node
lookup hash table. This creates a potential race condition, where
arriving broadcast packets are able to find and access the node
before it has been fully initialized, and before the above mentioned
link has been created. The result is occasional crashes in the function
tipc_bcast_rcv(), which is trying to access the not-yet existing link.
We fix this by deferring the addition of the node instance until after
it has been fully initialized in the function tipc_node_create().
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Rainer Weikusat [Thu, 11 Feb 2016 19:37:27 +0000 (19:37 +0000)]
af_unix: Guard against other == sk in unix_dgram_sendmsg
[ Upstream commit
a5527dda344fff0514b7989ef7a755729769daa1 ]
The unix_dgram_sendmsg routine use the following test
if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
to determine if sk and other are in an n:1 association (either
established via connect or by using sendto to send messages to an
unrelated socket identified by address). This isn't correct as the
specified address could have been bound to the sending socket itself or
because this socket could have been connected to itself by the time of
the unix_peer_get but disconnected before the unix_state_lock(other). In
both cases, the if-block would be entered despite other == sk which
might either block the sender unintentionally or lead to trying to unlock
the same spin lock twice for a non-blocking send. Add a other != sk
check to guard against this.
Fixes: 7d267278a9ec ("unix: avoid use-after-free in ep_remove_wait_queue")
Reported-By: Philipp Hahn <pmhahn@pmhahn.de>
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Tested-by: Philipp Hahn <pmhahn@pmhahn.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Rainer Weikusat [Mon, 8 Feb 2016 18:47:19 +0000 (18:47 +0000)]
af_unix: Don't set err in unix_stream_read_generic unless there was an error
[ Upstream commit
1b92ee3d03af6643df395300ba7748f19ecdb0c5 ]
The present unix_stream_read_generic contains various code sequences of
the form
err = -EDISASTER;
if (<test>)
goto out;
This has the unfortunate side effect of possibly causing the error code
to bleed through to the final
out:
return copied ? : err;
and then to be wrongly returned if no data was copied because the caller
didn't supply a data buffer, as demonstrated by the program available at
http://pad.lv/
1540731
Change it such that err is only set if an error condition was detected.
Fixes: 3822b5c2fc62 ("af_unix: Revert 'lock_interruptible' in stream receive code")
Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Dumazet [Thu, 4 Feb 2016 14:23:28 +0000 (06:23 -0800)]
ipv4: fix memory leaks in ip_cmsg_send() callers
[ Upstream commit
919483096bfe75dda338e98d56da91a263746a0a ]
Dmitry reported memory leaks of IP options allocated in
ip_cmsg_send() when/if this function returns an error.
Callers are responsible for the freeing.
Many thanks to Dmitry for the report and diagnostic.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jay Vosburgh [Tue, 2 Feb 2016 21:35:56 +0000 (13:35 -0800)]
bonding: Fix ARP monitor validation
[ Upstream commit
21a75f0915dde8674708b39abfcda113911c49b1 ]
The current logic in bond_arp_rcv will accept an incoming ARP for
validation if (a) the receiving slave is either "active" (which includes
the currently active slave, or the current ARP slave) or, (b) there is a
currently active slave, and it has received an ARP since it became active.
For case (b), the receiving slave isn't the currently active slave, and is
receiving the original broadcast ARP request, not an ARP reply from the
target.
This logic can fail if there is no currently active slave. In
this situation, the ARP probe logic cycles through all slaves, assigning
each in turn as the "current_arp_slave" for one arp_interval, then setting
that one as "active," and sending an ARP probe from that slave. The
current logic expects the ARP reply to arrive on the sending
current_arp_slave, however, due to switch FDB updating delays, the reply
may be directed to another slave.
This can arise if the bonding slaves and switch are working, but
the ARP target is not responding. When the ARP target recovers, a
condition may result wherein the ARP target host replies faster than the
switch can update its forwarding table, causing each ARP reply to be sent
to the previous current_arp_slave. This will never pass the logic in
bond_arp_rcv, as neither of the above conditions (a) or (b) are met.
Some experimentation on a LAN shows ARP reply round trips in the
200 usec range, but my available switches never update their FDB in less
than 4000 usec.
This patch changes the logic in bond_arp_rcv to additionally
accept an ARP reply for validation on any slave if there is a current ARP
slave and it sent an ARP probe during the previous arp_interval.
Fixes: aeea64ac717a ("bonding: don't trust arp requests unless active slave really works")
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Daniel Borkmann [Wed, 10 Feb 2016 15:47:11 +0000 (16:47 +0100)]
bpf: fix branch offset adjustment on backjumps after patching ctx expansion
[ Upstream commit
a1b14d27ed0965838350f1377ff97c93ee383492 ]
When ctx access is used, the kernel often needs to expand/rewrite
instructions, so after that patching, branch offsets have to be
adjusted for both forward and backward jumps in the new eBPF program,
but for backward jumps it fails to account the delta. Meaning, for
example, if the expansion happens exactly on the insn that sits at
the jump target, it doesn't fix up the back jump offset.
Analysis on what the check in adjust_branches() is currently doing:
/* adjust offset of jmps if necessary */
if (i < pos && i + insn->off + 1 > pos)
insn->off += delta;
else if (i > pos && i + insn->off + 1 < pos)
insn->off -= delta;
First condition (forward jumps):
Before: After:
insns[0] insns[0]
insns[1] <--- i/insn insns[1] <--- i/insn
insns[2] <--- pos insns[P] <--- pos
insns[3] insns[P] `------| delta
insns[4] <--- target_X insns[P] `-----|
insns[5] insns[3]
insns[4] <--- target_X
insns[5]
First case is if we cross pos-boundary and the jump instruction was
before pos. This is handeled correctly. I.e. if i == pos, then this
would mean our jump that we currently check was the patchlet itself
that we just injected. Since such patchlets are self-contained and
have no awareness of any insns before or after the patched one, the
delta is correctly not adjusted. Also, for the second condition in
case of i + insn->off + 1 == pos, means we jump to that newly patched
instruction, so no offset adjustment are needed. That part is correct.
Second condition (backward jumps):
Before: After:
insns[0] insns[0]
insns[1] <--- target_X insns[1] <--- target_X
insns[2] <--- pos <-- target_Y insns[P] <--- pos <-- target_Y
insns[3] insns[P] `------| delta
insns[4] <--- i/insn insns[P] `-----|
insns[5] insns[3]
insns[4] <--- i/insn
insns[5]
Second interesting case is where we cross pos-boundary and the jump
instruction was after pos. Backward jump with i == pos would be
impossible and pose a bug somewhere in the patchlet, so the first
condition checking i > pos is okay only by itself. However, i +
insn->off + 1 < pos does not always work as intended to trigger the
adjustment. It works when jump targets would be far off where the
delta wouldn't matter. But, for example, where the fixed insn->off
before pointed to pos (target_Y), it now points to pos + delta, so
that additional room needs to be taken into account for the check.
This means that i) both tests here need to be adjusted into pos + delta,
and ii) for the second condition, the test needs to be <= as pos
itself can be a target in the backjump, too.
Fixes: 9bac3d6d548e ("bpf: allow extended BPF programs access skb fields")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Alexander Duyck [Tue, 9 Feb 2016 10:49:54 +0000 (02:49 -0800)]
flow_dissector: Fix unaligned access in __skb_flow_dissector when used by eth_get_headlen
[ Upstream commit
461547f3158978c180d74484d58e82be9b8e7357 ]
This patch fixes an issue with unaligned accesses when using
eth_get_headlen on a page that was DMA aligned instead of being IP aligned.
The fact is when trying to check the length we don't need to be looking at
the flow label so we can reorder the checks to first check if we are
supposed to gather the flow label and then make the call to actually get
it.
v2: Updated path so that either STOP_AT_FLOW_LABEL or KEY_FLOW_LABEL can
cause us to check for the flow label.
Reported-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Alexander Duyck [Tue, 9 Feb 2016 14:14:43 +0000 (06:14 -0800)]
net: Copy inner L3 and L4 headers as unaligned on GRE TEB
[ Upstream commit
78565208d73ca9b654fb9a6b142214d52eeedfd1 ]
This patch corrects the unaligned accesses seen on GRE TEB tunnels when
generating hash keys. Specifically what this patch does is make it so that
we force the use of skb_copy_bits when the GRE inner headers will be
unaligned due to NET_IP_ALIGNED being a non-zero value.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Xin Long [Wed, 3 Feb 2016 15:33:30 +0000 (23:33 +0800)]
sctp: translate network order to host order when users get a hmacid
[ Upstream commit
7a84bd46647ff181eb2659fdc99590e6f16e501d ]
Commit
ed5a377d87dc ("sctp: translate host order to network order when
setting a hmacid") corrected the hmacid byte-order when setting a hmacid.
but the same issue also exists on getting a hmacid.
We fix it by changing hmacids to host order when users get them with
getsockopt.
Fixes: Commit ed5a377d87dc ("sctp: translate host order to network order when setting a hmacid")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sandeep Pillai [Wed, 3 Feb 2016 09:10:44 +0000 (14:40 +0530)]
enic: increment devcmd2 result ring in case of timeout
[ Upstream commit
ca7f41a4957b872577807169bd7464b36aae9b9c ]
Firmware posts the devcmd result in result ring. In case of timeout, driver
does not increment the current result pointer and firmware could post the
result after timeout has occurred. During next devcmd, driver would be
reading the result of previous devcmd.
Fix this by incrementing result even in case of timeout.
Fixes: 373fb0873d43 ("enic: add devcmd2")
Signed-off-by: Sandeep Pillai <sanpilla@cisco.com>
Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Siva Reddy Kallam [Wed, 3 Feb 2016 08:39:38 +0000 (14:09 +0530)]
tg3: Fix for tg3 transmit queue 0 timed out when too many gso_segs
[ Upstream commit
b7d987295c74500b733a0ba07f9a9bcc4074fa83 ]
tg3_tso_bug() can hit a condition where the entire tx ring is not big
enough to segment the GSO packet. For example, if MSS is very small,
gso_segs can exceed the tx ring size. When we hit the condition, it
will cause tx timeout.
tg3_tso_bug() is called to handle TSO and DMA hardware bugs.
For TSO bugs, if tg3_tso_bug() cannot succeed, we have to drop the packet.
For DMA bugs, we can still fall back to linearize the SKB and let the
hardware transmit the TSO packet.
This patch adds a function tg3_tso_bug_gso_check() to check if there
are enough tx descriptors for GSO before calling tg3_tso_bug().
The caller will then handle the error appropriately - drop or
lineraize the SKB.
v2: Corrected patch description to avoid confusion.
Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Acked-by: Prashant Sreedharan <prashant@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Hans Westgaard Ry [Wed, 3 Feb 2016 08:26:57 +0000 (09:26 +0100)]
net:Add sysctl_max_skb_frags
[ Upstream commit
5f74f82ea34c0da80ea0b49192bb5ea06e063593 ]
Devices may have limits on the number of fragments in an skb they support.
Current codebase uses a constant as maximum for number of fragments one
skb can hold and use.
When enabling scatter/gather and running traffic with many small messages
the codebase uses the maximum number of fragments and may thereby violate
the max for certain devices.
The patch introduces a global variable as max number of fragments.
Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Dumazet [Wed, 3 Feb 2016 03:31:12 +0000 (19:31 -0800)]
tcp: do not drop syn_recv on all icmp reports
[ Upstream commit
9cf7490360bf2c46a16b7525f899e4970c5fc144 ]
Petr Novopashenniy reported that ICMP redirects on SYN_RECV sockets
were leading to RST.
This is of course incorrect.
A specific list of ICMP messages should be able to drop a SYN_RECV.
For instance, a REDIRECT on SYN_RECV shall be ignored, as we do
not hold a dst per SYN_RECV pseudo request.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111751
Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
Reported-by: Petr Novopashenniy <pety@rusnet.ru>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Hannes Frederic Sowa [Wed, 3 Feb 2016 01:11:03 +0000 (02:11 +0100)]
unix: correctly track in-flight fds in sending process user_struct
[ Upstream commit
415e3d3e90ce9e18727e8843ae343eda5a58fad6 ]
The commit referenced in the Fixes tag incorrectly accounted the number
of in-flight fds over a unix domain socket to the original opener
of the file-descriptor. This allows another process to arbitrary
deplete the original file-openers resource limit for the maximum of
open files. Instead the sending processes and its struct cred should
be credited.
To do so, we add a reference counted struct user_struct pointer to the
scm_fp_list and use it to account for the number of inflight unix fds.
Fixes: 712f4aad406bb1 ("unix: properly account for FDs passed over unix sockets")
Reported-by: David Herrmann <dh.herrmann@gmail.com>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Dumazet [Wed, 3 Feb 2016 01:55:01 +0000 (17:55 -0800)]
ipv6: fix a lockdep splat
[ Upstream commit
44c3d0c1c0a880354e9de5d94175742e2c7c9683 ]
Silence lockdep false positive about rcu_dereference() being
used in the wrong context.
First one should use rcu_dereference_protected() as we own the spinlock.
Second one should be a normal assignation, as no barrier is needed.
Fixes: 18367681a10bd ("ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
subashab@codeaurora.org [Tue, 2 Feb 2016 02:11:10 +0000 (02:11 +0000)]
ipv6: addrconf: Fix recursive spin lock call
[ Upstream commit
16186a82de1fdd868255448274e64ae2616e2640 ]
A rcu stall with the following backtrace was seen on a system with
forwarding, optimistic_dad and use_optimistic set. To reproduce,
set these flags and allow ipv6 autoconf.
This occurs because the device write_lock is acquired while already
holding the read_lock. Back trace below -
INFO: rcu_preempt self-detected stall on CPU { 1} (t=2100 jiffies
g=3992 c=3991 q=4471)
<6> Task dump for CPU 1:
<2> kworker/1:0 R running task 12168 15 2 0x00000002
<2> Workqueue: ipv6_addrconf addrconf_dad_work
<6> Call trace:
<2> [<
ffffffc000084da8>] el1_irq+0x68/0xdc
<2> [<
ffffffc000cc4e0c>] _raw_write_lock_bh+0x20/0x30
<2> [<
ffffffc000bc5dd8>] __ipv6_dev_ac_inc+0x64/0x1b4
<2> [<
ffffffc000bcbd2c>] addrconf_join_anycast+0x9c/0xc4
<2> [<
ffffffc000bcf9f0>] __ipv6_ifa_notify+0x160/0x29c
<2> [<
ffffffc000bcfb7c>] ipv6_ifa_notify+0x50/0x70
<2> [<
ffffffc000bd035c>] addrconf_dad_work+0x314/0x334
<2> [<
ffffffc0000b64c8>] process_one_work+0x244/0x3fc
<2> [<
ffffffc0000b7324>] worker_thread+0x2f8/0x418
<2> [<
ffffffc0000bb40c>] kthread+0xe0/0xec
v2: do addrconf_dad_kick inside read lock and then acquire write
lock for ipv6_ifa_notify as suggested by Eric
Fixes: 7fd2561e4ebdd ("net: ipv6: Add a sysctl to make optimistic
addresses useful candidates")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Erik Kline <ek@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Paolo Abeni [Fri, 29 Jan 2016 11:30:20 +0000 (12:30 +0100)]
ipv6/udp: use sticky pktinfo egress ifindex on connect()
[ Upstream commit
1cdda91871470f15e79375991bd2eddc6e86ddb1 ]
Currently, the egress interface index specified via IPV6_PKTINFO
is ignored by __ip6_datagram_connect(), so that RFC 3542 section 6.7
can be subverted when the user space application calls connect()
before sendmsg().
Fix it by initializing properly flowi6_oif in connect() before
performing the route lookup.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Paolo Abeni [Fri, 29 Jan 2016 11:30:19 +0000 (12:30 +0100)]
ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail()
[ Upstream commit
6f21c96a78b835259546d8f3fb4edff0f651d478 ]
The current implementation of ip6_dst_lookup_tail basically
ignore the egress ifindex match: if the saddr is set,
ip6_route_output() purposefully ignores flowi6_oif, due
to the commit
d46a9d678e4c ("net: ipv6: Dont add RT6_LOOKUP_F_IFACE
flag if saddr set"), if the saddr is 'any' the first route lookup
in ip6_dst_lookup_tail fails, but upon failure a second lookup will
be performed with saddr set, thus ignoring the ifindex constraint.
This commit adds an output route lookup function variant, which
allows the caller to specify lookup flags, and modify
ip6_dst_lookup_tail() to enforce the ifindex match on the second
lookup via said helper.
ip6_route_output() becames now a static inline function build on
top of ip6_route_output_flags(); as a side effect, out-of-tree
modules need now a GPL license to access the output route lookup
functionality.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Dumazet [Wed, 27 Jan 2016 18:52:43 +0000 (10:52 -0800)]
tcp: beware of alignments in tcp_get_info()
[ Upstream commit
ff5d749772018602c47509bdc0093ff72acd82ec ]
With some combinations of user provided flags in netlink command,
it is possible to call tcp_get_info() with a buffer that is not 8-bytes
aligned.
It does matter on some arches, so we need to use put_unaligned() to
store the u64 fields.
Current iproute2 package does not trigger this particular issue.
Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info")
Fixes: 977cb0ecf82e ("tcp: add pacing_rate information into tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Ido Schimmel [Wed, 27 Jan 2016 14:16:43 +0000 (15:16 +0100)]
switchdev: Require RTNL mutex to be held when sending FDB notifications
[ Upstream commit
4f2c6ae5c64c353fb1b0425e4747e5603feadba1 ]
When switchdev drivers process FDB notifications from the underlying
device they resolve the netdev to which the entry points to and notify
the bridge using the switchdev notifier.
However, since the RTNL mutex is not held there is nothing preventing
the netdev from disappearing in the middle, which will cause
br_switchdev_event() to dereference a non-existing netdev.
Make switchdev drivers hold the lock at the beginning of the
notification processing session and release it once it ends, after
notifying the bridge.
Also, remove switchdev_mutex and fdb_lock, as they are no longer needed
when RTNL mutex is held.
Fixes: 03bf0c281234 ("switchdev: introduce switchdev notifier")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Joe Stringer [Fri, 22 Jan 2016 23:49:12 +0000 (15:49 -0800)]
inet: frag: Always orphan skbs inside ip_defrag()
[ Upstream commit
8282f27449bf15548cb82c77b6e04ee0ab827bdc ]
Later parts of the stack (including fragmentation) expect that there is
never a socket attached to frag in a frag_list, however this invariant
was not enforced on all defrag paths. This could lead to the
BUG_ON(skb->sk) during ip_do_fragment(), as per the call stack at the
end of this commit message.
While the call could be added to openvswitch to fix this particular
error, the head and tail of the frags list are already orphaned
indirectly inside ip_defrag(), so it seems like the remaining fragments
should all be orphaned in all circumstances.
kernel BUG at net/ipv4/ip_output.c:586!
[...]
Call Trace:
<IRQ>
[<
ffffffffa0205270>] ? do_output.isra.29+0x1b0/0x1b0 [openvswitch]
[<
ffffffffa02167a7>] ovs_fragment+0xcc/0x214 [openvswitch]
[<
ffffffff81667830>] ? dst_discard_out+0x20/0x20
[<
ffffffff81667810>] ? dst_ifdown+0x80/0x80
[<
ffffffffa0212072>] ? find_bucket.isra.2+0x62/0x70 [openvswitch]
[<
ffffffff810e0ba5>] ? mod_timer_pending+0x65/0x210
[<
ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
[<
ffffffffa03205a2>] ? nf_conntrack_in+0x252/0x500 [nf_conntrack]
[<
ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
[<
ffffffffa02051a3>] do_output.isra.29+0xe3/0x1b0 [openvswitch]
[<
ffffffffa0206411>] do_execute_actions+0xe11/0x11f0 [openvswitch]
[<
ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
[<
ffffffffa0206822>] ovs_execute_actions+0x32/0xd0 [openvswitch]
[<
ffffffffa020b505>] ovs_dp_process_packet+0x85/0x140 [openvswitch]
[<
ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
[<
ffffffffa02068a2>] ovs_execute_actions+0xb2/0xd0 [openvswitch]
[<
ffffffffa020b505>] ovs_dp_process_packet+0x85/0x140 [openvswitch]
[<
ffffffffa0215019>] ? ovs_ct_get_labels+0x49/0x80 [openvswitch]
[<
ffffffffa0213a1d>] ovs_vport_receive+0x5d/0xa0 [openvswitch]
[<
ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
[<
ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
[<
ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
[<
ffffffffa0214895>] ? internal_dev_xmit+0x5/0x140 [openvswitch]
[<
ffffffffa02148fc>] internal_dev_xmit+0x6c/0x140 [openvswitch]
[<
ffffffffa0214895>] ? internal_dev_xmit+0x5/0x140 [openvswitch]
[<
ffffffff81660299>] dev_hard_start_xmit+0x2b9/0x5e0
[<
ffffffff8165fc21>] ? netif_skb_features+0xd1/0x1f0
[<
ffffffff81660f20>] __dev_queue_xmit+0x800/0x930
[<
ffffffff81660770>] ? __dev_queue_xmit+0x50/0x930
[<
ffffffff810b53f1>] ? mark_held_locks+0x71/0x90
[<
ffffffff81669876>] ? neigh_resolve_output+0x106/0x220
[<
ffffffff81661060>] dev_queue_xmit+0x10/0x20
[<
ffffffff816698e8>] neigh_resolve_output+0x178/0x220
[<
ffffffff816a8e6f>] ? ip_finish_output2+0x1ff/0x590
[<
ffffffff816a8e6f>] ip_finish_output2+0x1ff/0x590
[<
ffffffff816a8cee>] ? ip_finish_output2+0x7e/0x590
[<
ffffffff816a9a31>] ip_do_fragment+0x831/0x8a0
[<
ffffffff816a8c70>] ? ip_copy_metadata+0x1b0/0x1b0
[<
ffffffff816a9ae3>] ip_fragment.constprop.49+0x43/0x80
[<
ffffffff816a9c9c>] ip_finish_output+0x17c/0x340
[<
ffffffff8169a6f4>] ? nf_hook_slow+0xe4/0x190
[<
ffffffff816ab4c0>] ip_output+0x70/0x110
[<
ffffffff816a9b20>] ? ip_fragment.constprop.49+0x80/0x80
[<
ffffffff816aa9f9>] ip_local_out+0x39/0x70
[<
ffffffff816abf89>] ip_send_skb+0x19/0x40
[<
ffffffff816abfe3>] ip_push_pending_frames+0x33/0x40
[<
ffffffff816df21a>] icmp_push_reply+0xea/0x120
[<
ffffffff816df93d>] icmp_reply.constprop.23+0x1ed/0x230
[<
ffffffff816df9ce>] icmp_echo.part.21+0x4e/0x50
[<
ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
[<
ffffffff810d5f9e>] ? rcu_read_lock_held+0x5e/0x70
[<
ffffffff816dfa06>] icmp_echo+0x36/0x70
[<
ffffffff816e0d11>] icmp_rcv+0x271/0x450
[<
ffffffff816a4ca7>] ip_local_deliver_finish+0x127/0x3a0
[<
ffffffff816a4bc1>] ? ip_local_deliver_finish+0x41/0x3a0
[<
ffffffff816a5160>] ip_local_deliver+0x60/0xd0
[<
ffffffff816a4b80>] ? ip_rcv_finish+0x560/0x560
[<
ffffffff816a46fd>] ip_rcv_finish+0xdd/0x560
[<
ffffffff816a5453>] ip_rcv+0x283/0x3e0
[<
ffffffff810b6302>] ? match_held_lock+0x192/0x200
[<
ffffffff816a4620>] ? inet_del_offload+0x40/0x40
[<
ffffffff8165d062>] __netif_receive_skb_core+0x392/0xae0
[<
ffffffff8165e68e>] ? process_backlog+0x8e/0x230
[<
ffffffff810b53f1>] ? mark_held_locks+0x71/0x90
[<
ffffffff8165d7c8>] __netif_receive_skb+0x18/0x60
[<
ffffffff8165e678>] process_backlog+0x78/0x230
[<
ffffffff8165e6dd>] ? process_backlog+0xdd/0x230
[<
ffffffff8165e355>] net_rx_action+0x155/0x400
[<
ffffffff8106b48c>] __do_softirq+0xcc/0x420
[<
ffffffff816a8e87>] ? ip_finish_output2+0x217/0x590
[<
ffffffff8178e78c>] do_softirq_own_stack+0x1c/0x30
<EOI>
[<
ffffffff8106b88e>] do_softirq+0x4e/0x60
[<
ffffffff8106b948>] __local_bh_enable_ip+0xa8/0xb0
[<
ffffffff816a8eb0>] ip_finish_output2+0x240/0x590
[<
ffffffff816a9a31>] ? ip_do_fragment+0x831/0x8a0
[<
ffffffff816a9a31>] ip_do_fragment+0x831/0x8a0
[<
ffffffff816a8c70>] ? ip_copy_metadata+0x1b0/0x1b0
[<
ffffffff816a9ae3>] ip_fragment.constprop.49+0x43/0x80
[<
ffffffff816a9c9c>] ip_finish_output+0x17c/0x340
[<
ffffffff8169a6f4>] ? nf_hook_slow+0xe4/0x190
[<
ffffffff816ab4c0>] ip_output+0x70/0x110
[<
ffffffff816a9b20>] ? ip_fragment.constprop.49+0x80/0x80
[<
ffffffff816aa9f9>] ip_local_out+0x39/0x70
[<
ffffffff816abf89>] ip_send_skb+0x19/0x40
[<
ffffffff816abfe3>] ip_push_pending_frames+0x33/0x40
[<
ffffffff816d55d3>] raw_sendmsg+0x7d3/0xc30
[<
ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
[<
ffffffff816e7557>] ? inet_sendmsg+0xc7/0x1d0
[<
ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
[<
ffffffff816e759a>] inet_sendmsg+0x10a/0x1d0
[<
ffffffff816e7495>] ? inet_sendmsg+0x5/0x1d0
[<
ffffffff8163e398>] sock_sendmsg+0x38/0x50
[<
ffffffff8163ec5f>] ___sys_sendmsg+0x25f/0x270
[<
ffffffff811aadad>] ? handle_mm_fault+0x8dd/0x1320
[<
ffffffff8178c147>] ? _raw_spin_unlock+0x27/0x40
[<
ffffffff810529b2>] ? __do_page_fault+0x1e2/0x460
[<
ffffffff81204886>] ? __fget_light+0x66/0x90
[<
ffffffff8163f8e2>] __sys_sendmsg+0x42/0x80
[<
ffffffff8163f932>] SyS_sendmsg+0x12/0x20
[<
ffffffff8178cb17>] entry_SYSCALL_64_fastpath+0x12/0x6f
Code: 00 00 44 89 e0 e9 7c fb ff ff 4c 89 ff e8 e7 e7 ff ff 41 8b 9d 80 00 00 00 2b 5d d4 89 d8 c1 f8 03 0f b7 c0 e9 33 ff ff f
66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48
RIP [<
ffffffff816a9a92>] ip_do_fragment+0x892/0x8a0
RSP <
ffff88006d603170>
Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Parthasarathy Bhuvaragan [Wed, 27 Jan 2016 10:35:59 +0000 (11:35 +0100)]
tipc: fix connection abort during subscription cancel
[ Upstream commit
4d5cfcba2f6ec494d8810b9e3c0a7b06255c8067 ]
In 'commit
7fe8097cef5f ("tipc: fix nullpointer bug when subscribing
to events")', we terminate the connection if the subscription
creation fails.
In the same commit, the subscription creation result was based on
the value of the subscription pointer (set in the function) instead
of the return code.
Unfortunately, the same function tipc_subscrp_create() handles
subscription cancel request. For a subscription cancellation request,
the subscription pointer cannot be set. Thus if a subscriber has
several subscriptions and cancels any of them, the connection is
terminated.
In this commit, we terminate the connection based on the return value
of tipc_subscrp_create().
Fixes: commit 7fe8097cef5f ("tipc: fix nullpointer bug when subscribing to events")
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Russell King [Sun, 24 Jan 2016 09:22:05 +0000 (09:22 +0000)]
net: dsa: fix mv88e6xxx switches
[ Upstream commit
db0e51afa481088e6396f11e02018d64113a6578 ]
Since commit
76e398a62712 ("net: dsa: use switchdev obj for VLAN add/del
ops"), the Marvell 88E6xxx switch has been unable to pass traffic
between ports - any received traffic is discarded by the switch.
Taking a port out of bridge mode and configuring a vlan on it also the
port to start passing traffic.
With the debugfs files re-instated to allow debug of this issue by
comparing the register settings between the working and non-working
case, the reason becomes clear:
GLOBAL GLOBAL2 SERDES 0 1 2 3 4 5 6
- 7: 1111 707f 2001 2 2 2 2 2 0 2
+ 7: 1111 707f 2001 1 1 1 1 1 0 1
Register 7 for the ports is the default vlan tag register, and in the
non-working setup, it has been set to 2, despite vlan 2 not being
configured. This causes the switch to drop all packets coming in to
these ports. The working setup has the default vlan tag register set
to 1, which is the default vlan when none is configured.
Inspection of the code reveals why. The code prior to this commit
was:
- for (vid = vlan->vid_begin; vid <= vlan->vid_end; ++vid) {
...
- if (!err && vlan->flags & BRIDGE_VLAN_INFO_PVID)
- err = ds->drv->port_pvid_set(ds, p->port, vid);
but the new code is:
+ for (vid = vlan->vid_begin; vid <= vlan->vid_end; ++vid) {
...
+ }
...
+ if (pvid)
+ err = _mv88e6xxx_port_pvid_set(ds, port, vid);
This causes the new code to always set the default vlan to one higher
than the old code.
Fix this.
Fixes: 76e398a62712 ("net: dsa: use switchdev obj for VLAN add/del ops")
Cc: <stable@vger.kernel.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Marcelo Ricardo Leitner [Fri, 22 Jan 2016 20:29:49 +0000 (18:29 -0200)]
sctp: allow setting SCTP_SACK_IMMEDIATELY by the application
[ Upstream commit
27f7ed2b11d42ab6d796e96533c2076ec220affc ]
This patch extends commit
b93d6471748d ("sctp: implement the sender side
for SACK-IMMEDIATELY extension") as it didn't white list
SCTP_SACK_IMMEDIATELY on sctp_msghdr_parse(), causing it to be
understood as an invalid flag and returning -EINVAL to the application.
Note that the actual handling of the flag is already there in
sctp_datamsg_from_user().
https://tools.ietf.org/html/rfc7053#section-7
Fixes: b93d6471748d ("sctp: implement the sender side for SACK-IMMEDIATELY extension")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Hannes Frederic Sowa [Fri, 22 Jan 2016 00:39:43 +0000 (01:39 +0100)]
pptp: fix illegal memory access caused by multiple bind()s
[ Upstream commit
9a368aff9cb370298fa02feeffa861f2db497c18 ]
Several times already this has been reported as kasan reports caused by
syzkaller and trinity and people always looked at RCU races, but it is
much more simple. :)
In case we bind a pptp socket multiple times, we simply add it to
the callid_sock list but don't remove the old binding. Thus the old
socket stays in the bucket with unused call_id indexes and doesn't get
cleaned up. This causes various forms of kasan reports which were hard
to pinpoint.
Simply don't allow multiple binds and correct error handling in
pptp_bind. Also keep sk_state bits in place in pptp_connect.
Fixes: 00959ade36acad ("PPTP: PPP over IPv4 (Point-to-Point Tunneling Protocol)")
Cc: Dmitry Kozlov <xeb@mail.ru>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Dumazet [Sun, 24 Jan 2016 21:53:50 +0000 (13:53 -0800)]
af_unix: fix struct pid memory leak
[ Upstream commit
fa0dc04df259ba2df3ce1920e9690c7842f8fa4b ]
Dmitry reported a struct pid leak detected by a syzkaller program.
Bug happens in unix_stream_recvmsg() when we break the loop when a
signal is pending, without properly releasing scm.
Fixes: b3ca9b02b007 ("net: fix multithreaded signal handling in unix recv routines")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Dumazet [Thu, 21 Jan 2016 16:02:54 +0000 (08:02 -0800)]
tcp: fix NULL deref in tcp_v4_send_ack()
[ Upstream commit
e62a123b8ef7c5dc4db2c16383d506860ad21b47 ]
Neal reported crashes with this stack trace :
RIP: 0010:[<
ffffffff8c57231b>] tcp_v4_send_ack+0x41/0x20f
...
CR2:
0000000000000018 CR3:
000000044005c000 CR4:
00000000001427e0
...
[<
ffffffff8c57258e>] tcp_v4_reqsk_send_ack+0xa5/0xb4
[<
ffffffff8c1a7caa>] tcp_check_req+0x2ea/0x3e0
[<
ffffffff8c19e420>] tcp_rcv_state_process+0x850/0x2500
[<
ffffffff8c1a6d21>] tcp_v4_do_rcv+0x141/0x330
[<
ffffffff8c56cdb2>] sk_backlog_rcv+0x21/0x30
[<
ffffffff8c098bbd>] tcp_recvmsg+0x75d/0xf90
[<
ffffffff8c0a8700>] inet_recvmsg+0x80/0xa0
[<
ffffffff8c17623e>] sock_aio_read+0xee/0x110
[<
ffffffff8c066fcf>] do_sync_read+0x6f/0xa0
[<
ffffffff8c0673a1>] SyS_read+0x1e1/0x290
[<
ffffffff8c5ca262>] system_call_fastpath+0x16/0x1b
The problem here is the skb we provide to tcp_v4_send_ack() had to
be parked in the backlog of a new TCP fastopen child because this child
was owned by the user at the time an out of window packet arrived.
Before queuing a packet, TCP has to set skb->dev to NULL as the device
could disappear before packet is removed from the queue.
Fix this issue by using the net pointer provided by the socket (being a
timewait or a request socket).
IPv6 is immune to the bug : tcp_v6_send_response() already gets the net
pointer from the socket if provided.
Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Paolo Abeni [Wed, 17 Feb 2016 18:30:01 +0000 (19:30 +0100)]
lwt: fix rx checksum setting for lwt devices tunneling over ipv6
[ Upstream commit
c868ee7063bdb53f3ef9eac7bcec84960980b471 ]
the commit
35e2d1152b22 ("tunnels: Allow IPv6 UDP checksums to be
correctly controlled.") changed the default xmit checksum setting
for lwt vxlan/geneve ipv6 tunnels, so that now the checksum is not
set into external UDP header.
This commit changes the rx checksum setting for both lwt vxlan/geneve
devices created by openvswitch accordingly, so that lwt over ipv6
tunnel pairs are again able to communicate with default values.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jesse Gross [Thu, 21 Jan 2016 00:22:47 +0000 (16:22 -0800)]
tunnels: Allow IPv6 UDP checksums to be correctly controlled.
[ Upstream commit
35e2d1152b22eae99c961affbe85374bef05a775 ]
When configuring checksums on UDP tunnels, the flags are different
for IPv4 vs. IPv6 (and reversed). However, when lightweight tunnels
are enabled the flags used are always the IPv4 versions, which are
ignored in the IPv6 code paths. This uses the correct IPv6 flags, so
checksums can be controlled appropriately.
Fixes: a725e514 ("vxlan: metadata based tunneling for IPv6")
Fixes: abe492b4 ("geneve: UDP checksum configuration via netlink")
Signed-off-by: Jesse Gross <jesse@kernel.org>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Manfred Rudigier [Wed, 20 Jan 2016 10:22:28 +0000 (11:22 +0100)]
net: dp83640: Fix tx timestamp overflow handling.
[ Upstream commit
81e8f2e930fe76b9814c71b9d87c30760b5eb705 ]
PHY status frames are not reliable, the PHY may not be able to send them
during heavy receive traffic. This overflow condition is signaled by the
PHY in the next status frame, but the driver did not make use of it.
Instead it always reported wrong tx timestamps to user space after an
overflow happened because it assigned newly received tx timestamps to old
packets in the queue.
This commit fixes this issue by clearing the tx timestamp queue every time
an overflow happens, so that no timestamps are delivered for overflow
packets. This way time stamping will continue correctly after an overflow.
Signed-off-by: Manfred Rudigier <manfred.rudigier@omicron.at>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jesse Gross [Thu, 21 Jan 2016 01:59:49 +0000 (17:59 -0800)]
gro: Make GRO aware of lightweight tunnels.
[ Upstream commit
ce87fc6ce3f9f4488546187e3757cf666d9d4a2a ]
GRO is currently not aware of tunnel metadata generated by lightweight
tunnels and stored in the dst. This leads to two possible problems:
* Incorrectly merging two frames that have different metadata.
* Leaking of allocated metadata from merged frames.
This avoids those problems by comparing the tunnel information before
merging, similar to how we handle other metadata (such as vlan tags),
and releasing any state when we are done.
Reported-by: John <john.phillips5@hpe.com>
Fixes: 2e15ea39 ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jesse Gross <jesse@kernel.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Ursula Braun [Tue, 19 Jan 2016 09:41:33 +0000 (10:41 +0100)]
af_iucv: Validate socket address length in iucv_sock_bind()
[ Upstream commit
52a82e23b9f2a9e1d429c5207f8575784290d008 ]
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Evgeny Cherkashin <Eugene.Crosser@ru.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Greg Kroah-Hartman [Thu, 25 Feb 2016 20:01:36 +0000 (12:01 -0800)]
Linux 4.4.3
Luis R. Rodriguez [Wed, 3 Feb 2016 06:25:26 +0000 (16:55 +1030)]
modules: fix modparam async_probe request
commit
4355efbd80482a961cae849281a8ef866e53d55c upstream.
Commit
f2411da746985 ("driver-core: add driver module
asynchronous probe support") added async probe support,
in two forms:
* in-kernel driver specification annotation
* generic async_probe module parameter (modprobe foo async_probe)
To support the generic kernel parameter parse_args() was
extended via commit
ecc8617053e0 ("module: add extra
argument for parse_params() callback") however commit
failed to
f2411da746985 failed to add the required argument.
This causes a crash then whenever async_probe generic
module parameter is used. This was overlooked when the
form in which in-kernel async probe support was reworked
a bit... Fix this as originally intended.
Cc: Hannes Reinecke <hare@suse.de>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> [minimized]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Rusty Russell [Wed, 3 Feb 2016 06:25:26 +0000 (16:55 +1030)]
module: wrapper for symbol name.
commit
2e7bac536106236104e9e339531ff0fcdb7b8147 upstream.
This trivial wrapper adds clarity and makes the following patch
smaller.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Thomas Gleixner [Thu, 14 Jan 2016 16:54:48 +0000 (16:54 +0000)]
itimers: Handle relative timers with CONFIG_TIME_LOW_RES proper
commit
51cbb5242a41700a3f250ecfb48dcfb7e4375ea4 upstream.
As Helge reported for timerfd we have the same issue in itimers. We return
remaining time larger than the programmed relative time to user space in case
of CONFIG_TIME_LOW_RES=y. Use the proper function to adjust the extra time
added in hrtimer_start_range_ns().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Helge Deller <deller@gmx.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: linux-m68k@lists.linux-m68k.org
Cc: dhowells@redhat.com
Link: http://lkml.kernel.org/r/20160114164159.528222587@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Thomas Gleixner [Thu, 14 Jan 2016 16:54:47 +0000 (16:54 +0000)]
posix-timers: Handle relative timers with CONFIG_TIME_LOW_RES proper
commit
572c39172684c3711e4a03c9a7380067e2b0661c upstream.
As Helge reported for timerfd we have the same issue in posix timers. We
return remaining time larger than the programmed relative time to user space
in case of CONFIG_TIME_LOW_RES=y. Use the proper function to adjust the extra
time added in hrtimer_start_range_ns().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Helge Deller <deller@gmx.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: linux-m68k@lists.linux-m68k.org
Cc: dhowells@redhat.com
Link: http://lkml.kernel.org/r/20160114164159.450510905@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Thomas Gleixner [Thu, 14 Jan 2016 16:54:46 +0000 (16:54 +0000)]
timerfd: Handle relative timers with CONFIG_TIME_LOW_RES proper
commit
b62526ed11a1fe3861ab98d40b7fdab8981d788a upstream.
Helge reported that a relative timer can return a remaining time larger than
the programmed relative time on parisc and other architectures which have
CONFIG_TIME_LOW_RES set. This happens because we add a jiffie to the resulting
expiry time to prevent short timeouts.
Use the new function hrtimer_expires_remaining_adjusted() to calculate the
remaining time. It takes that extra added time into account for relative
timers.
Reported-and-tested-by: Helge Deller <deller@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: linux-m68k@lists.linux-m68k.org
Cc: dhowells@redhat.com
Link: http://lkml.kernel.org/r/20160114164159.354500742@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mateusz Guzik [Wed, 20 Jan 2016 23:01:02 +0000 (15:01 -0800)]
prctl: take mmap sem for writing to protect against others
commit
ddf1d398e517e660207e2c807f76a90df543a217 upstream.
An unprivileged user can trigger an oops on a kernel with
CONFIG_CHECKPOINT_RESTORE.
proc_pid_cmdline_read takes mmap_sem for reading and obtains args + env
start/end values. These get sanity checked as follows:
BUG_ON(arg_start > arg_end);
BUG_ON(env_start > env_end);
These can be changed by prctl_set_mm. Turns out also takes the semaphore for
reading, effectively rendering it useless. This results in:
kernel BUG at fs/proc/base.c:240!
invalid opcode: 0000 [#1] SMP
Modules linked in: virtio_net
CPU: 0 PID: 925 Comm: a.out Not tainted 4.4.0-rc8-next-20160105dupa+ #71
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task:
ffff880077a68000 ti:
ffff8800784d0000 task.ti:
ffff8800784d0000
RIP: proc_pid_cmdline_read+0x520/0x530
RSP: 0018:
ffff8800784d3db8 EFLAGS:
00010206
RAX:
ffff880077c5b6b0 RBX:
ffff8800784d3f18 RCX:
0000000000000000
RDX:
0000000000000002 RSI:
00007f78e8857000 RDI:
0000000000000246
RBP:
ffff8800784d3e40 R08:
0000000000000008 R09:
0000000000000001
R10:
0000000000000000 R11:
0000000000000001 R12:
0000000000000050
R13:
00007f78e8857800 R14:
ffff88006fcef000 R15:
ffff880077c5b600
FS:
00007f78e884a740(0000) GS:
ffff88007b200000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
CR2:
00007f78e8361770 CR3:
00000000790a5000 CR4:
00000000000006f0
Call Trace:
__vfs_read+0x37/0x100
vfs_read+0x82/0x130
SyS_read+0x58/0xd0
entry_SYSCALL_64_fastpath+0x12/0x76
Code: 4c 8b 7d a8 eb e9 48 8b 9d 78 ff ff ff 4c 8b 7d 90 48 8b 03 48 39 45 a8 0f 87 f0 fe ff ff e9 d1 fe ff ff 4c 8b 7d 90 eb c6 0f 0b <0f> 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
RIP proc_pid_cmdline_read+0x520/0x530
---[ end trace
97882617ae9c6818 ]---
Turns out there are instances where the code just reads aformentioned
values without locking whatsoever - namely environ_read and get_cmdline.
Interestingly these functions look quite resilient against bogus values,
but I don't believe this should be relied upon.
The first patch gets rid of the oops bug by grabbing mmap_sem for
writing.
The second patch is optional and puts locking around aformentioned
consumers for safety. Consumers of other fields don't seem to benefit
from similar treatment and are left untouched.
This patch (of 2):
The code was taking the semaphore for reading, which does not protect
against readers nor concurrent modifications.
The problem could cause a sanity checks to fail in procfs's cmdline
reader, resulting in an OOPS.
Note that some functions perform an unlocked read of various mm fields,
but they seem to be fine despite possible modificaton.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jarod Wilson <jarod@redhat.com>
Cc: Jan Stancek <jstancek@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anshuman Khandual <anshuman.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dave Chinner [Mon, 18 Jan 2016 21:28:10 +0000 (08:28 +1100)]
xfs: log mount failures don't wait for buffers to be released
commit
85bec5460ad8e05e0a8d70fb0f6750eb719ad092 upstream.
Recently I've been seeing xfs/051 fail on 1k block size filesystems.
Trying to trace the events during the test lead to the problem going
away, indicating that it was a race condition that lead to this
ASSERT failure:
XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 156
.....
[<
ffffffff814e1257>] xfs_free_perag+0x87/0xb0
[<
ffffffff814e21b9>] xfs_mountfs+0x4d9/0x900
[<
ffffffff814e5dff>] xfs_fs_fill_super+0x3bf/0x4d0
[<
ffffffff811d8800>] mount_bdev+0x180/0x1b0
[<
ffffffff814e3ff5>] xfs_fs_mount+0x15/0x20
[<
ffffffff811d90a8>] mount_fs+0x38/0x170
[<
ffffffff811f4347>] vfs_kern_mount+0x67/0x120
[<
ffffffff811f7018>] do_mount+0x218/0xd60
[<
ffffffff811f7e5b>] SyS_mount+0x8b/0xd0
When I finally caught it with tracing enabled, I saw that AG 2 had
an elevated reference count and a buffer was responsible for it. I
tracked down the specific buffer, and found that it was missing the
final reference count release that would put it back on the LRU and
hence be found by xfs_wait_buftarg() calls in the log mount failure
handling.
The last four traces for the buffer before the assert were (trimmed
for relevance)
kworker/0:1-5259 xfs_buf_iodone: hold 2 lock 0 flags ASYNC
kworker/0:1-5259 xfs_buf_ioerror: hold 2 lock 0 error -5
mount-7163 xfs_buf_lock_done: hold 2 lock 0 flags ASYNC
mount-7163 xfs_buf_unlock: hold 2 lock 1 flags ASYNC
This is an async write that is completing, so there's nobody waiting
for it directly. Hence we call xfs_buf_relse() once all the
processing is complete. That does:
static inline void xfs_buf_relse(xfs_buf_t *bp)
{
xfs_buf_unlock(bp);
xfs_buf_rele(bp);
}
Now, it's clear that mount is waiting on the buffer lock, and that
it has been released by xfs_buf_relse() and gained by mount. This is
expected, because at this point the mount process is in
xfs_buf_delwri_submit() waiting for all the IO it submitted to
complete.
The mount process, however, is waiting on the lock for the buffer
because it is in xfs_buf_delwri_submit(). This waits for IO
completion, but it doesn't wait for the buffer reference owned by
the IO to go away. The mount process collects all the completions,
fails the log recovery, and the higher level code then calls
xfs_wait_buftarg() to free all the remaining buffers in the
filesystem.
The issue is that on unlocking the buffer, the scheduler has decided
that the mount process has higher priority than the the kworker
thread that is running the IO completion, and so immediately
switched contexts to the mount process from the semaphore unlock
code, hence preventing the kworker thread from finishing the IO
completion and releasing the IO reference to the buffer.
Hence by the time that xfs_wait_buftarg() is run, the buffer still
has an active reference and so isn't on the LRU list that the
function walks to free the remaining buffers. Hence we miss that
buffer and continue onwards to tear down the mount structures,
at which time we get find a stray reference count on the perag
structure. On a non-debug kernel, this will be ignored and the
structure torn down and freed. Hence when the kworker thread is then
rescheduled and the buffer released and freed, it will access a
freed perag structure.
The problem here is that when the log mount fails, we still need to
quiesce the log to ensure that the IO workqueues have returned to
idle before we run xfs_wait_buftarg(). By synchronising the
workqueues, we ensure that all IO completions are fully processed,
not just to the point where buffers have been unlocked. This ensures
we don't end up in the situation above.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dave Chinner [Mon, 18 Jan 2016 21:21:46 +0000 (08:21 +1100)]
Revert "xfs: clear PF_NOFREEZE for xfsaild kthread"
commit
3e85286e75224fa3f08bdad20e78c8327742634e upstream.
This reverts commit
24ba16bb3d499c49974669cd8429c3e4138ab102 as it
prevents machines from suspending. This regression occurs when the
xfsaild is idle on entry to suspend, and so there s no activity to
wake it from it's idle sleep and hence see that it is supposed to
freeze. Hence the freezer times out waiting for it and suspend is
cancelled.
There is no obvious fix for this short of freezing the filesystem
properly, so revert this change for now.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dave Chinner [Mon, 11 Jan 2016 20:03:44 +0000 (07:03 +1100)]
xfs: inode recovery readahead can race with inode buffer creation
commit
b79f4a1c68bb99152d0785ee4ea3ab4396cdacc6 upstream.
When we do inode readahead in log recovery, we do can do the
readahead before we've replayed the icreate transaction that stamps
the buffer with inode cores. The inode readahead verifier catches
this and marks the buffer as !done to indicate that it doesn't yet
contain valid inodes.
In adding buffer error notification (i.e. setting b_error = -EIO at
the same time as as we clear the done flag) to such a readahead
verifier failure, we can then get subsequent inode recovery failing
with this error:
XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32
This occurs when readahead completion races with icreate item replay
such as:
inode readahead
find buffer
lock buffer
submit RA io
....
icreate recovery
xfs_trans_get_buffer
find buffer
lock buffer
<blocks on RA completion>
.....
<ra completion>
fails verifier
clear XBF_DONE
set bp->b_error = -EIO
release and unlock buffer
<icreate gains lock>
icreate initialises buffer
marks buffer as done
adds buffer to delayed write queue
releases buffer
At this point, we have an initialised inode buffer that is up to
date but has an -EIO state registered against it. When we finally
get to recovering an inode in that buffer:
inode item recovery
xfs_trans_read_buffer
find buffer
lock buffer
sees XBF_DONE is set, returns buffer
sees bp->b_error is set
fail log recovery!
Essentially, we need xfs_trans_get_buf_map() to clear the error status of
the buffer when doing a lookup. This function returns uninitialised
buffers, so the buffer returned can not be in an error state and
none of the code that uses this function expects b_error to be set
on return. Indeed, there is an ASSERT(!bp->b_error); in the
transaction case in xfs_trans_get_buf_map() that would have caught
this if log recovery used transactions....
This patch firstly changes the inode readahead failure to set -EIO
on the buffer, and secondly changes xfs_buf_get_map() to never
return a buffer with an error state set so this first change doesn't
cause unexpected log recovery failures.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Darrick J. Wong [Mon, 4 Jan 2016 05:13:21 +0000 (16:13 +1100)]
libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct
commit
96f859d52bcb1c6ea6f3388d39862bf7143e2f30 upstream.
Because struct xfs_agfl is 36 bytes long and has a 64-bit integer
inside it, gcc will quietly round the structure size up to the nearest
64 bits -- in this case, 40 bytes. This results in the XFS_AGFL_SIZE
macro returning incorrect results for v5 filesystems on 64-bit
machines (118 items instead of 119). As a result, a 32-bit xfs_repair
will see garbage in AGFL item 119 and complain.
Therefore, tell gcc not to pad the structure so that the AGFL size
calculation is correct.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Miklos Szeredi [Fri, 11 Dec 2015 15:30:49 +0000 (16:30 +0100)]
ovl: setattr: check permissions before copy-up
commit
cf9a6784f7c1b5ee2b9159a1246e327c331c5697 upstream.
Without this copy-up of a file can be forced, even without actually being
allowed to do anything on the file.
[Arnd Bergmann] include <linux/pagemap.h> for PAGE_CACHE_SIZE (used by
MAX_LFS_FILESIZE definition).
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Miklos Szeredi [Wed, 9 Dec 2015 15:11:59 +0000 (16:11 +0100)]
ovl: root: copy attr
commit
ed06e069775ad9236087594a1c1667367e983fb5 upstream.
We copy i_uid and i_gid of underlying inode into overlayfs inode. Except
for the root inode.
Fix this omission.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Konstantin Khlebnikov [Mon, 16 Nov 2015 15:44:11 +0000 (18:44 +0300)]
ovl: check dentry positiveness in ovl_cleanup_whiteouts()
commit
84889d49335627bc770b32787c1ef9ebad1da232 upstream.
This patch fixes kernel crash at removing directory which contains
whiteouts from lower layers.
Cache of directory content passed as "list" contains entries from all
layers, including whiteouts from lower layers. So, lookup in upper dir
(moved into work at this stage) will return negative entry. Plus this
cache is filled long before and we can race with external removal.
Example:
mkdir -p lower0/dir lower1/dir upper work overlay
touch lower0/dir/a lower0/dir/b
mknod lower1/dir/a c 0 0
mount -t overlay none overlay -o lowerdir=lower1:lower0,upperdir=upper,workdir=work
rm -fr overlay/dir
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Vito Caputo [Sat, 24 Oct 2015 12:19:46 +0000 (07:19 -0500)]
ovl: use a minimal buffer in ovl_copy_xattr
commit
e4ad29fa0d224d05e08b2858e65f112fd8edd4fe upstream.
Rather than always allocating the high-order XATTR_SIZE_MAX buffer
which is costly and prone to failure, only allocate what is needed and
realloc if necessary.
Fixes https://github.com/coreos/bugs/issues/489
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Miklos Szeredi [Tue, 10 Nov 2015 16:08:41 +0000 (17:08 +0100)]
ovl: allow zero size xattr
commit
97daf8b97ad6f913a34c82515be64dc9ac08d63e upstream.
When ovl_copy_xattr() encountered a zero size xattr no more xattrs were
copied and the function returned success. This is clearly not the desired
behavior.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Thomas Gleixner [Sat, 19 Dec 2015 20:07:38 +0000 (20:07 +0000)]
futex: Drop refcount if requeue_pi() acquired the rtmutex
commit
fb75a4282d0d9a3c7c44d940582c2d226cf3acfb upstream.
If the proxy lock in the requeue loop acquires the rtmutex for a
waiter then it acquired also refcount on the pi_state related to the
futex, but the waiter side does not drop the reference count.
Add the missing free_pi_state() call.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <darren@dvhart.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Bhuvanesh_Surachari@mentor.com
Cc: Andy Lowe <Andy_Lowe@mentor.com>
Link: http://lkml.kernel.org/r/20151219200607.178132067@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Toshi Kani [Wed, 17 Feb 2016 21:11:29 +0000 (13:11 -0800)]
devm_memremap_release(): fix memremap'd addr handling
commit
9273a8bbf58a15051e53a777389a502420ddc60e upstream.
The pmem driver calls devm_memremap() to map a persistent memory range.
When the pmem driver is unloaded, this memremap'd range is not released
so the kernel will leak a vma.
Fix devm_memremap_release() to handle a given memremap'd address
properly.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Kirill A. Shutemov [Wed, 17 Feb 2016 21:11:35 +0000 (13:11 -0800)]
ipc/shm: handle removed segments gracefully in shm_mmap()
commit
1ac0b6dec656f3f78d1c3dd216fad84cb4d0a01e upstream.
remap_file_pages(2) emulation can reach file which represents removed
IPC ID as long as a memory segment is mapped. It breaks expectations of
IPC subsystem.
Test case (rewritten to be more human readable, originally autogenerated
by syzkaller[1]):
#define _GNU_SOURCE
#include <stdlib.h>
#include <sys/ipc.h>
#include <sys/mman.h>
#include <sys/shm.h>
#define PAGE_SIZE 4096
int main()
{
int id;
void *p;
id = shmget(IPC_PRIVATE, 3 * PAGE_SIZE, 0);
p = shmat(id, NULL, 0);
shmctl(id, IPC_RMID, NULL);
remap_file_pages(p, 3 * PAGE_SIZE, 0, 7, 0);
return 0;
}
The patch changes shm_mmap() and code around shm_lock() to propagate
locking error back to caller of shm_mmap().
[1] http://github.com/google/syzkaller
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dan Carpenter [Tue, 26 Jan 2016 09:24:25 +0000 (12:24 +0300)]
intel_scu_ipcutil: underflow in scu_reg_access()
commit
b1d353ad3d5835b16724653b33c05124e1b5acf1 upstream.
"count" is controlled by the user and it can be negative. Let's prevent
that by making it unsigned. You have to have CAP_SYS_RAWIO to call this
function so the bug is not as serious as it could be.
Fixes: 5369c02d951a ('intel_scu_ipc: Utility driver for intel scu ipc')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Vineet Gupta [Fri, 12 Feb 2016 00:13:09 +0000 (16:13 -0800)]
mm,thp: khugepaged: call pte flush at the time of collapse
commit
6a6ac72fd6ea32594b316513e1826c3f6db4cc93 upstream.
This showed up on ARC when running LMBench bw_mem tests as Overlapping
TLB Machine Check Exception triggered due to STLB entry (2M pages)
overlapping some NTLB entry (regular 8K page).
bw_mem 2m touches a large chunk of vaddr creating NTLB entries. In the
interim khugepaged kicks in, collapsing the contiguous ptes into a
single pmd. pmdp_collapse_flush()->flush_pmd_tlb_range() is called to
flush out NTLB entries for the ptes. This for ARC (by design) can only
shootdown STLB entries (for pmd). The stray NTLB entries cause the
overlap with the subsequent STLB entry for collapsed page. So make
pmdp_collapse_flush() call pte flush interface not pmd flush.
Note that originally all thp flush call sites in generic code called
flush_tlb_range() leaving it to architecture to implement the flush for
pte and/or pmd. Commit
12ebc1581ad11454 changed this by calling a new
opt-in API flush_pmd_tlb_range() which made the semantics more explicit
but failed to distinguish the pte vs pmd flush in generic code, which is
what this patch fixes.
Note that ARC can fixed w/o touching the generic pmdp_collapse_flush()
by defining a ARC version, but that defeats the purpose of generic
version, plus sementically this is the right thing to do.
Fixes STAR
9000961194: LMBench on AXS103 triggering duplicate TLB
exceptions with super pages
Fixes: 12ebc1581ad11454 ("mm,thp: introduce flush_pmd_tlb_range")
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Dumazet [Fri, 5 Feb 2016 23:36:16 +0000 (15:36 -0800)]
dump_stack: avoid potential deadlocks
commit
d7ce36924344ace0dbdc855b1206cacc46b36d45 upstream.
Some servers experienced fatal deadlocks because of a combination of
bugs, leading to multiple cpus calling dump_stack().
The checksumming bug was fixed in commit
34ae6a1aa054 ("ipv6: update
skb->csum when CE mark is propagated").
The second problem is a faulty locking in dump_stack()
CPU1 runs in process context and calls dump_stack(), grabs dump_lock.
CPU2 receives a TCP packet under softirq, grabs socket spinlock, and
call dump_stack() from netdev_rx_csum_fault().
dump_stack() spins on atomic_cmpxchg(&dump_lock, -1, 2), since
dump_lock is owned by CPU1
While dumping its stack, CPU1 is interrupted by a softirq, and happens
to process a packet for the TCP socket locked by CPU2.
CPU1 spins forever in spin_lock() : deadlock
Stack trace on CPU1 looked like :
NMI backtrace for cpu 1
RIP: _raw_spin_lock+0x25/0x30
...
Call Trace:
<IRQ>
tcp_v6_rcv+0x243/0x620
ip6_input_finish+0x11f/0x330
ip6_input+0x38/0x40
ip6_rcv_finish+0x3c/0x90
ipv6_rcv+0x2a9/0x500
process_backlog+0x461/0xaa0
net_rx_action+0x147/0x430
__do_softirq+0x167/0x2d0
call_softirq+0x1c/0x30
do_softirq+0x3f/0x80
irq_exit+0x6e/0xc0
smp_call_function_single_interrupt+0x35/0x40
call_function_single_interrupt+0x6a/0x70
<EOI>
printk+0x4d/0x4f
printk_address+0x31/0x33
print_trace_address+0x33/0x3c
print_context_stack+0x7f/0x119
dump_trace+0x26b/0x28e
show_trace_log_lvl+0x4f/0x5c
show_stack_log_lvl+0x104/0x113
show_stack+0x42/0x44
dump_stack+0x46/0x58
netdev_rx_csum_fault+0x38/0x3c
__skb_checksum_complete_head+0x6e/0x80
__skb_checksum_complete+0x11/0x20
tcp_rcv_established+0x2bd5/0x2fd0
tcp_v6_do_rcv+0x13c/0x620
sk_backlog_rcv+0x15/0x30
release_sock+0xd2/0x150
tcp_recvmsg+0x1c1/0xfc0
inet_recvmsg+0x7d/0x90
sock_recvmsg+0xaf/0xe0
___sys_recvmsg+0x111/0x3b0
SyS_recvmsg+0x5c/0xb0
system_call_fastpath+0x16/0x1b
Fixes: b58d977432c8 ("dump_stack: serialize the output from dump_stack()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Konstantin Khlebnikov [Fri, 5 Feb 2016 23:37:01 +0000 (15:37 -0800)]
radix-tree: fix oops after radix_tree_iter_retry
commit
732042821cfa106b3c20b9780e4c60fee9d68900 upstream.
Helper radix_tree_iter_retry() resets next_index to the current index.
In following radix_tree_next_slot current chunk size becomes zero. This
isn't checked and it tries to dereference null pointer in slot.
Tagged iterator is fine because retry happens only at slot 0 where tag
bitmask in iter->tags is filled with single bit.
Fixes: 46437f9a554f ("radix-tree: fix race in gang lookup")
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ohad Ben-Cohen <ohad@wizery.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Matthew Wilcox [Wed, 3 Feb 2016 00:57:55 +0000 (16:57 -0800)]
drivers/hwspinlock: fix race between radix tree insertion and lookup
commit
c6400ba7e13a41539342f1b6e1f9e78419cb0148 upstream.
of_hwspin_lock_get_id() is protected by the RCU lock, which means that
insertions can occur simultaneously with the lookup. If the radix tree
transitions from a height of 0, we can see a slot with the indirect_ptr
bit set, which will cause us to at least read random memory, and could
cause other havoc.
Fix this by using the newly introduced radix_tree_iter_retry().
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ohad Ben-Cohen <ohad@wizery.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Matthew Wilcox [Wed, 3 Feb 2016 00:57:52 +0000 (16:57 -0800)]
radix-tree: fix race in gang lookup
commit
46437f9a554fbe3e110580ca08ab703b59f2f95a upstream.
If the indirect_ptr bit is set on a slot, that indicates we need to redo
the lookup. Introduce a new function radix_tree_iter_retry() which
forces the loop to retry the lookup by setting 'slot' to NULL and
turning the iterator back to point at the problematic entry.
This is a pretty rare problem to hit at the moment; the lookup has to
race with a grow of the radix tree from a height of 0. The consequences
of hitting this race are that gang lookup could return a pointer to a
radix_tree_node instead of a pointer to whatever the user had inserted
in the tree.
Fixes: cebbd29e1c2f ("radix-tree: rewrite gang lookup using iterator")
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ohad Ben-Cohen <ohad@wizery.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Rich Felker [Fri, 22 Jan 2016 23:11:05 +0000 (15:11 -0800)]
MAINTAINERS: return arch/sh to maintained state, with new maintainers
commit
114bf37e04d839b555b3dc460b5e6ce156f49cf0 upstream.
Add Yoshinori Sato and Rich Felker as maintainers for arch/sh
(SUPERH).
Signed-off-by: Rich Felker <dalias@libc.org>
Signed-off-by: Yoshinori Sato <ysato@users.sourceforge.jp>
Acked-by: D. Jeff Dionne <jeff@uClinux.org>
Acked-by: Rob Landley <rob@landley.net>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Acked-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Martijn Coenen [Sat, 16 Jan 2016 00:57:49 +0000 (16:57 -0800)]
memcg: only free spare array when readers are done
commit
6611d8d76132f86faa501de9451a89bf23fb2371 upstream.
A spare array holding mem cgroup threshold events is kept around to make
sure we can always safely deregister an event and have an array to store
the new set of events in.
In the scenario where we're going from 1 to 0 registered events, the
pointer to the primary array containing 1 event is copied to the spare
slot, and then the spare slot is freed because no events are left.
However, it is freed before calling synchronize_rcu(), which means
readers may still be accessing threshold->primary after it is freed.
Fixed by only freeing after synchronize_rcu().
Signed-off-by: Martijn Coenen <maco@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Michael Holzheu [Wed, 3 Feb 2016 00:57:26 +0000 (16:57 -0800)]
numa: fix /proc/<pid>/numa_maps for hugetlbfs on s390
commit
5c2ff95e41c9290d16556cd02e35b25d81be8fe0 upstream.
When working with hugetlbfs ptes (which are actually pmds) is not valid to
directly use pte functions like pte_present() because the hardware bit
layout of pmds and ptes can be different. This is the case on s390.
Therefore we have to convert the hugetlbfs ptes first into a valid pte
encoding with huge_ptep_get().
Currently the /proc/<pid>/numa_maps code uses hugetlbfs ptes without
huge_ptep_get(). On s390 this leads to the following two problems:
1) The pte_present() function returns false (instead of true) for
PROT_NONE hugetlb ptes. Therefore PROT_NONE vmas are missing
completely in the "numa_maps" output.
2) The pte_dirty() function always returns false for all hugetlb ptes.
Therefore these pages are reported as "mapped=xxx" instead of
"dirty=xxx".
Therefore use huge_ptep_get() to correctly convert the hugetlb ptes.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mike Kravetz [Sat, 16 Jan 2016 00:57:37 +0000 (16:57 -0800)]
fs/hugetlbfs/inode.c: fix bugs in hugetlb_vmtruncate_list()
commit
9aacdd354d197ad64685941b36d28ea20ab88757 upstream.
Hillf Danton noticed bugs in the hugetlb_vmtruncate_list routine. The
argument end is of type pgoff_t. It was being converted to a vaddr
offset and passed to unmap_hugepage_range. However, end was also being
used as an argument to the vma_interval_tree_foreach controlling loop.
In addition, the conversion of end to vaddr offset was incorrect.
hugetlb_vmtruncate_list is called as part of a file truncate or
fallocate hole punch operation.
When truncating a hugetlbfs file, this bug could prevent some pages from
being unmapped. This is possible if there are multiple vmas mapping the
file, and there is a sufficiently sized hole between the mappings. The
size of the hole between two vmas (A,B) must be such that the starting
virtual address of B is greater than (ending virtual address of A <<
PAGE_SHIFT). In this case, the pages in B would not be unmapped. If
pages are not properly unmapped during truncate, the following BUG is
hit:
kernel BUG at fs/hugetlbfs/inode.c:428!
In the fallocate hole punch case, this bug could prevent pages from
being unmapped as in the truncate case. However, for hole punch the
result is that unmapped pages will not be removed during the operation.
For hole punch, it is also possible that more pages than desired will be
unmapped. This unnecessary unmapping will cause page faults to
reestablish the mappings on subsequent page access.
Fixes: 1bfad99ab (" hugetlbfs: hugetlb_vmtruncate_list() needs to take a range")Reported-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sergey Senozhatsky [Thu, 14 Jan 2016 23:16:53 +0000 (15:16 -0800)]
scripts/bloat-o-meter: fix python3 syntax error
commit
72214a24a7677d4c7501eecc9517ed681b5f2db2 upstream.
In Python3+ print is a function so the old syntax is not correct
anymore:
$ ./scripts/bloat-o-meter vmlinux.o vmlinux.o.old
File "./scripts/bloat-o-meter", line 61
print "add/remove: %s/%s grow/shrink: %s/%s up/down: %s/%s (%s)" % \
^
SyntaxError: invalid syntax
Fix by calling print as a function.
Tested on python 2.7.11, 3.5.1
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Laura Abbott [Thu, 14 Jan 2016 23:16:50 +0000 (15:16 -0800)]
dma-debug: switch check from _text to _stext
commit
ea535e418c01837d07b6c94e817540f50bfdadb0 upstream.
In include/asm-generic/sections.h:
/*
* Usage guidelines:
* _text, _data: architecture specific, don't use them in
* arch-independent code
* [_stext, _etext]: contains .text.* sections, may also contain
* .rodata.*
* and/or .init.* sections
_text is not guaranteed across architectures. Architectures such as ARM
may reuse parts which are not actually text and erroneously trigger a bug.
Switch to using _stext which is guaranteed to contain text sections.
Came out of https://lkml.kernel.org/g/<
567B1176.
4000106@redhat.com>
Signed-off-by: Laura Abbott <labbott@fedoraproject.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sudip Mukherjee [Thu, 14 Jan 2016 23:16:47 +0000 (15:16 -0800)]
m32r: fix m32104ut_defconfig build fail
commit
601f1db653217f205ffa5fb33514b4e1711e56d1 upstream.
The build of m32104ut_defconfig for m32r arch was failing for long long
time with the error:
ERROR: "memory_start" [fs/udf/udf.ko] undefined!
ERROR: "memory_end" [fs/udf/udf.ko] undefined!
ERROR: "memory_end" [drivers/scsi/sg.ko] undefined!
ERROR: "memory_start" [drivers/scsi/sg.ko] undefined!
ERROR: "memory_end" [drivers/i2c/i2c-dev.ko] undefined!
ERROR: "memory_start" [drivers/i2c/i2c-dev.ko] undefined!
As done in other architectures export the symbols to fix the error.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mathias Nyman [Tue, 26 Jan 2016 15:50:12 +0000 (17:50 +0200)]
xhci: Fix list corruption in urb dequeue at host removal
commit
5c82171167adb8e4ac77b91a42cd49fb211a81a0 upstream.
xhci driver frees data for all devices, both usb2 and and usb3 the
first time usb_remove_hcd() is called, including td_list and and xhci_ring
structures.
When usb_remove_hcd() is called a second time for the second xhci bus it
will try to dequeue all pending urbs, and touches td_list which is already
freed for that endpoint.
Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mathias Nyman [Tue, 26 Jan 2016 15:50:04 +0000 (17:50 +0200)]
Revert "xhci: don't finish a TD if we get a short-transfer event mid TD"
commit
a6835090716a85f2297668ba593bd00e1051e662 upstream.
This reverts commit
e210c422b6fd ("xhci: don't finish a TD if we get a
short transfer event mid TD")
Turns out that most host controllers do not follow the xHCI specs and never
send the second event for the last TRB in the TD if there was a short event
mid-TD.
Returning the URB directly after the first short-transfer event is far
better than never returning the URB. (class drivers usually timeout
after 30sec). For the hosts that do send the second event we will go
back to treating it as misplaced event and print an error message for it.
The origial patch was sent to stable kernels and needs to be reverted from
there as well
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
David Woodhouse [Mon, 15 Feb 2016 12:42:38 +0000 (12:42 +0000)]
iommu/vt-d: Clear PPR bit to ensure we get more page request interrupts
commit
46924008273ed03bd11dbb32136e3da4cfe056e1 upstream.
According to the VT-d specification we need to clear the PPR bit in
the Page Request Status register when handling page requests, or the
hardware won't generate any more interrupts.
This wasn't actually necessary on SKL/KBL (which may well be the
subject of a hardware erratum, although it's harmless enough). But
other implementations do appear to get it right, and we only ever get
one interrupt unless we clear the PPR bit.
Reported-by: CQ Tang <cq.tang@intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CQ Tang [Wed, 13 Jan 2016 21:15:03 +0000 (21:15 +0000)]
iommu/vt-d: Fix 64-bit accesses to 32-bit DMAR_GSTS_REG
commit
fda3bec12d0979aae3f02ee645913d66fbc8a26e upstream.
This is a 32-bit register. Apparently harmless on real hardware, but
causing justified warnings in simulation.
Signed-off-by: CQ Tang <cq.tang@intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
David Woodhouse [Tue, 12 Jan 2016 19:18:06 +0000 (19:18 +0000)]
iommu/vt-d: Fix mm refcounting to hold mm_count not mm_users
commit
e57e58bd390a6843db58560bf7b8341665d2e058 upstream.
Holding mm_users works OK for graphics, which was the first user of SVM
with VT-d. However, it works less well for other devices, where we actually
do a mmap() from the file descriptor to which the SVM PASID state is tied.
In this case on process exit we end up with a recursive reference count:
- The MM remains alive until the file is closed and the driver's release()
call ends up unbinding the PASID.
- The VMA corresponding to the mmap() remains intact until the MM is
destroyed.
- Thus the file isn't closed, even when exit_files() runs, because the
VMA is still holding a reference to it. And the MM remains alive…
To address this issue, we *stop* holding mm_users while the PASID is bound.
We already hold mm_count by virtue of the MMU notifier, and that can be
made to be sufficient.
It means that for a period during process exit, the fun part of mmput()
has happened and exit_mmap() has been called so the MM is basically
defunct. But the PGD still exists and the PASID is still bound to it.
During this period, we have to be very careful — exit_mmap() doesn't use
mm->mmap_sem because it doesn't expect anyone else to be touching the MM
(quite reasonably, since mm_users is zero). So we also need to fix the
fault handler to just report failure if mm_users is already zero, and to
temporarily bump mm_users while handling any faults.
Additionally, exit_mmap() calls mmu_notifier_release() *before* it tears
down the page tables, which is too early for us to flush the IOTLB for
this PASID. And __mmu_notifier_release() removes every notifier from the
list, so when exit_mmap() finally *does* tear down the mappings and
clear the page tables, we don't get notified. So we work around this by
clearing the PASID table entry in our MMU notifier release() callback.
That way, the hardware *can't* get any pages back from the page tables
before they get cleared.
Hardware designers have confirmed that the resulting 'PASID not present'
faults should be handled just as gracefully as 'page not present' faults,
the important criterion being that they don't perturb the operation for
any *other* PASID in the system.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Baoquan He [Wed, 20 Jan 2016 14:01:19 +0000 (22:01 +0800)]
iommu/amd: Correct the wrong setting of alias DTE in do_attach
commit
9b1a12d29109234d2b9718d04d4d404b7da4e794 upstream.
In below commit alias DTE is set when its peripheral is
setting DTE. However there's a code bug here to wrongly
set the alias DTE, correct it in this patch.
commit
e25bfb56ea7f046b71414e02f80f620deb5c6362
Author: Joerg Roedel <jroedel@suse.de>
Date: Tue Oct 20 17:33:38 2015 +0200
iommu/amd: Set alias DTE in do_attach/do_detach
Signed-off-by: Baoquan He <bhe@redhat.com>
Tested-by: Mark Hounschell <markh@compro.net>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jeremy McNicoll [Fri, 15 Jan 2016 05:33:06 +0000 (21:33 -0800)]
iommu/vt-d: Don't skip PCI devices when disabling IOTLB
commit
da972fb13bc5a1baad450c11f9182e4cd0a091f6 upstream.
Fix a simple typo when disabling IOTLB on PCI(e) devices.
Fixes: b16d0cb9e2fc ("iommu/vt-d: Always enable PASID/PRI PCI capabilities before ATS")
Signed-off-by: Jeremy McNicoll <jmcnicol@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dmitry Torokhov [Sat, 16 Jan 2016 18:04:49 +0000 (10:04 -0800)]
Input: vmmouse - fix absolute device registration
commit
d4f1b06d685d11ebdaccf11c0db1cb3c78736862 upstream.
We should set device's capabilities first, and then register it,
otherwise various handlers already present in the kernel will not be
able to connect to the device.
Reported-by: Lauri Kasanen <cand@gmx.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
James Bottomley [Wed, 20 Jan 2016 22:58:29 +0000 (14:58 -0800)]
string_helpers: fix precision loss for some inputs
commit
564b026fbd0d28e9f70fb3831293d2922bb7855b upstream.
It was noticed that we lose precision in the final calculation for some
inputs. The most egregious example is size=3000 blk_size=1900 in units
of 10 should yield 5.70 MB but in fact yields 3.00 MB (oops).
This is because the current algorithm doesn't correctly account for
all the remainders in the logarithms. Fix this by doing a correct
calculation in the remainders based on napier's algorithm.
Additionally, now we have the correct result, we have to account for
arithmetic rounding because we're printing 3 digits of precision. This
means that if the fourth digit is five or greater, we have to round up,
so add a section to ensure correct rounding. Finally account for all
possible inputs correctly, including zero for block size.
Fixes: b9f28d863594c429e1df35a0474d2663ca28b307
Signed-off-by: James Bottomley <JBottomley@Odin.com>
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Aurélien Francillon [Sun, 3 Jan 2016 04:39:54 +0000 (20:39 -0800)]
Input: i8042 - add Fujitsu Lifebook U745 to the nomux list
commit
dd0d0d4de582a6a61c032332c91f4f4cb2bab569 upstream.
Without i8042.nomux=1 the Elantech touch pad is not working at all on
a Fujitsu Lifebook U745. This patch does not seem necessary for all
U745 (maybe because of different BIOS versions?). However, it was
verified that the patch does not break those (see opensuse bug 883192:
https://bugzilla.opensuse.org/show_bug.cgi?id=883192).
Signed-off-by: Aurélien Francillon <aurelien@francillon.net>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Benjamin Tissoires [Tue, 12 Jan 2016 01:35:38 +0000 (17:35 -0800)]
Input: elantech - mark protocols v2 and v3 as semi-mt
commit
6544a1df11c48c8413071aac3316792e4678fbfb upstream.
When using a protocol v2 or v3 hardware, elantech uses the function
elantech_report_semi_mt_data() to report data. This devices are rather
creepy because if num_finger is 3, (x2,y2) is (0,0). Yes, only one valid
touch is reported.
Anyway, userspace (libinput) is now confused by these (0,0) touches,
and detect them as palm, and rejects them.
Commit
3c0213d17a09 ("Input: elantech - fix semi-mt protocol for v3 HW")
was sufficient enough for xf86-input-synaptics and libinput before it has
palm rejection. Now we need to actually tell libinput that this device is
a semi-mt one and it should not rely on the actual values of the 2 touches.
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Kirill A. Shutemov [Wed, 17 Feb 2016 21:11:15 +0000 (13:11 -0800)]
mm: fix regression in remap_file_pages() emulation
commit
48f7df329474b49d83d0dffec1b6186647f11976 upstream.
Grazvydas Ignotas has reported a regression in remap_file_pages()
emulation.
Testcase:
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#define SIZE (4096 * 3)
int main(int argc, char **argv)
{
unsigned long *p;
long i;
p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}
for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;
if (remap_file_pages(p, 4096, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}
if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}
assert(p[0] == 1);
munmap(p, SIZE);
return 0;
}
The second remap_file_pages() fails with -EINVAL.
The reason is that remap_file_pages() emulation assumes that the target
vma covers whole area we want to over map. That assumption is broken by
first remap_file_pages() call: it split the area into two vma.
The solution is to check next adjacent vmas, if they map the same file
with the same flags.
Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Grazvydas Ignotas <notasas@gmail.com>
Tested-by: Grazvydas Ignotas <notasas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Konstantin Khlebnikov [Fri, 5 Feb 2016 23:36:50 +0000 (15:36 -0800)]
mm: replace vma_lock_anon_vma with anon_vma_lock_read/write
commit
12352d3cae2cebe18805a91fab34b534d7444231 upstream.
Sequence vma_lock_anon_vma() - vma_unlock_anon_vma() isn't safe if
anon_vma appeared between lock and unlock. We have to check anon_vma
first or call anon_vma_prepare() to be sure that it's here. There are
only few users of these legacy helpers. Let's get rid of them.
This patch fixes anon_vma lock imbalance in validate_mm(). Write lock
isn't required here, read lock is enough.
And reorders expand_downwards/expand_upwards: security_mmap_addr() and
wrapping-around check don't have to be under anon vma lock.
Link: https://lkml.kernel.org/r/CACT4Y+Y908EjM2z=706dv4rV6dWtxTLK9nFg9_7DhRMLppBo2g@mail.gmail.com
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Kirill A. Shutemov [Fri, 22 Jan 2016 00:40:27 +0000 (16:40 -0800)]
mm: fix mlock accouting
commit
7162a1e87b3e380133dadc7909081bb70d0a7041 upstream.
Tetsuo Handa reported underflow of NR_MLOCK on munlock.
Testcase:
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#define BASE ((void *)0x400000000000)
#define SIZE (1UL << 21)
int main(int argc, char *argv[])
{
void *addr;
system("grep Mlocked /proc/meminfo");
addr = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE | MAP_LOCKED | MAP_FIXED,
-1, 0);
if (addr == MAP_FAILED)
printf("mmap() failed\n"), exit(1);
munmap(addr, SIZE);
system("grep Mlocked /proc/meminfo");
return 0;
}
It happens on munlock_vma_page() due to unfortunate choice of nr_pages
data type:
__mod_zone_page_state(zone, NR_MLOCK, -nr_pages);
For unsigned int nr_pages, implicitly casted to long in
__mod_zone_page_state(), it becomes something around UINT_MAX.
munlock_vma_page() usually called for THP as small pages go though
pagevec.
Let's make nr_pages signed int.
Similar fixes in
6cdb18ad98a4 ("mm/vmstat: fix overflow in
mod_zone_page_state()") used `long' type, but `int' here is OK for a
count of the number of sub-pages in a huge page.
Fixes: ff6a6da60b89 ("mm: accelerate munlock() treatment of THP pages")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michel Lespinasse <walken@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dan Williams [Wed, 6 Jan 2016 02:37:23 +0000 (18:37 -0800)]
libnvdimm: fix namespace object confusion in is_uuid_busy()
commit
e07ecd76d4db7bda1e9495395b2110a3fe28845a upstream.
When btt devices were re-worked to be child devices of regions this
routine was overlooked. It mistakenly attempts to_nd_namespace_pmem()
or to_nd_namespace_blk() conversions on btt and pfn devices. By luck to
date we have happened to be hitting valid memory leading to a uuid
miscompare, but a recent change to struct nd_namespace_common causes:
BUG: unable to handle kernel NULL pointer dereference at
0000000000000001
IP: [<
ffffffff814610dc>] memcmp+0xc/0x40
[..]
Call Trace:
[<
ffffffffa0028631>] is_uuid_busy+0xc1/0x2a0 [libnvdimm]
[<
ffffffffa0028570>] ? to_nd_blk_region+0x50/0x50 [libnvdimm]
[<
ffffffff8158c9c0>] device_for_each_child+0x50/0x90
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Naoya Horiguchi [Sat, 16 Jan 2016 00:54:03 +0000 (16:54 -0800)]
mm: soft-offline: check return value in second __get_any_page() call
commit
d96b339f453997f2f08c52da3f41423be48c978f upstream.
I saw the following BUG_ON triggered in a testcase where a process calls
madvise(MADV_SOFT_OFFLINE) on thps, along with a background process that
calls migratepages command repeatedly (doing ping-pong among different
NUMA nodes) for the first process:
Soft offlining page 0x60000 at 0x700000600000
__get_any_page: 0x60000 free buddy page
page:
ffffea0001800000 count:0 mapcount:-127 mapping: (null) index:0x1
flags: 0x1fffc0000000000()
page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
------------[ cut here ]------------
kernel BUG at /src/linux-dev/include/linux/mm.h:342!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: cfg80211 rfkill crc32c_intel serio_raw virtio_balloon i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi
CPU: 3 PID: 3035 Comm: test_alloc_gene Tainted: G O 4.4.0-rc8-v4.4-rc8-160107-1501-00000-rc8+ #74
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task:
ffff88007c63d5c0 ti:
ffff88007c210000 task.ti:
ffff88007c210000
RIP: 0010:[<
ffffffff8118998c>] [<
ffffffff8118998c>] put_page+0x5c/0x60
RSP: 0018:
ffff88007c213e00 EFLAGS:
00010246
Call Trace:
put_hwpoison_page+0x4e/0x80
soft_offline_page+0x501/0x520
SyS_madvise+0x6bc/0x6f0
entry_SYSCALL_64_fastpath+0x12/0x6a
Code: 8b fc ff ff 5b 5d c3 48 89 df e8 b0 fa ff ff 48 89 df 31 f6 e8 c6 7d ff ff 5b 5d c3 48 c7 c6 08 54 a2 81 48 89 df e8 a4 c5 01 00 <0f> 0b 66 90 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 8b 47
RIP [<
ffffffff8118998c>] put_page+0x5c/0x60
RSP <
ffff88007c213e00>
The root cause resides in get_any_page() which retries to get a refcount
of the page to be soft-offlined. This function calls
put_hwpoison_page(), expecting that the target page is putback to LRU
list. But it can be also freed to buddy. So the second check need to
care about such case.
Fixes: af8fae7c0886 ("mm/memory-failure.c: clean up soft_offline_page()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Ravi Bangoria [Mon, 7 Dec 2015 06:55:02 +0000 (12:25 +0530)]
perf kvm record/report: 'unprocessable sample' error while recording/reporting guest data
commit
3caeaa562733c4836e61086ec07666635006a787 upstream.
While recording guest samples in host using perf kvm record, it will
populate unprocessable sample error, though samples will be recorded
properly. While generating report using perf kvm report, no samples will
be processed and same error will populate. We have seen this behaviour
with upstream perf(4.4-rc3) on x86 and ppc64 hardware.
Reason behind this failure is, when it tries to fetch machine from
rb_tree of machines, it fails. As a part of tracing a bug, we figured
out that this code was incorrectly refactored in commit
54245fdc3576
("perf session: Remove wrappers to machines__find").
This patch will change the functionality such that if it can't fetch
machine in first trial, it will create one node of machine and add that to
rb_tree. So next time when it tries to fetch same machine from rb_tree,
it won't fail. Actually it was the case before refactoring of code in
aforementioned commit.
This patch is generated from acme perf/core branch.
Below I've mention an example that demonstrate the behaviour before and
after applying patch.
Before applying patch:
[Note: One needs to run guest before recording data in host]
ravi@ravi-bangoria:~$ ./perf kvm record -a
Warning:
5903 unprocessable samples recorded.
Do you have a KVM guest running and not using 'perf kvm'?
[ perf record: Captured and wrote 1.409 MB perf.data.guest (285 samples) ]
ravi@ravi-bangoria:~$ ./perf kvm report --stdio
Warning:
5903 unprocessable samples recorded.
Do you have a KVM guest running and not using 'perf kvm'?
# To display the perf.data header info, please use --header/--header-only options.
#
# Total Lost Samples: 0
#
# Samples: 285 of event 'cycles'
# Event count (approx.):
88715406
#
# Overhead Command Shared Object Symbol
# ........ ....... ............. ......
#
# (For a higher level overview, try: perf report --sort comm,dso)
#
After applying patch:
ravi@ravi-bangoria:~$ ./perf kvm record -a
[ perf record: Captured and wrote 1.188 MB perf.data.guest (17 samples) ]
ravi@ravi-bangoria:~$ ./perf kvm report --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
# Total Lost Samples: 0
#
# Samples: 17 of event 'cycles'
# Event count (approx.): 700746
#
# Overhead Command Shared Object Symbol
# ........ ....... ................ ......................
#
34.19% :5758 [unknown] [g] 0xffffffff818682ab
22.79% :5758 [unknown] [g] 0xffffffff812dc7f8
22.79% :5758 [unknown] [g] 0xffffffff818650d0
14.83% :5758 [unknown] [g] 0xffffffff8161a1b6
2.49% :5758 [unknown] [g] 0xffffffff818692bf
0.48% :5758 [unknown] [g] 0xffffffff81869253
0.05% :5758 [unknown] [g] 0xffffffff81869250
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Fixes: 54245fdc3576 ("perf session: Remove wrappers to machines__find")
Link: http://lkml.kernel.org/r/1449471302-11283-1-git-send-email-ravi.bangoria@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Greg Kurz [Wed, 13 Jan 2016 17:28:17 +0000 (18:28 +0100)]
KVM: PPC: Fix ONE_REG AltiVec support
commit
b4d7f161feb3015d6306e1d35b565c888ff70c9d upstream.
The get and set operations got exchanged by mistake when moving the
code from book3s.c to powerpc.c.
Fixes: 3840edc8033ad5b86deee309c1c321ca54257452
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Thomas Huth [Fri, 20 Nov 2015 08:11:45 +0000 (09:11 +0100)]
KVM: PPC: Fix emulation of H_SET_DABR/X on POWER8
commit
760a7364f27d974d100118d88190e574626e18a6 upstream.
In the old DABR register, the BT (Breakpoint Translation) bit
is bit number 61. In the new DAWRX register, the WT (Watchpoint
Translation) bit is bit number 59. So to move the DABR-BT bit
into the position of the DAWRX-WT bit, it has to be shifted by
two, not only by one. This fixes hardware watchpoints in gdb of
older guests that only use the H_SET_DABR/X interface instead
of the new H_SET_MODE interface.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Andre Przywara [Wed, 3 Feb 2016 16:56:51 +0000 (16:56 +0000)]
KVM: arm/arm64: Fix reference to uninitialised VGIC
commit
b3aff6ccbb1d25e506b60ccd9c559013903f3464 upstream.
Commit
4b4b4512da2a ("arm/arm64: KVM: Rework the arch timer to use
level-triggered semantics") brought the virtual architected timer
closer to the VGIC. There is one occasion were we don't properly
check for the VGIC actually having been initialized before, but
instead go on to check the active state of some IRQ number.
If userland hasn't instantiated a virtual GIC, we end up with a
kernel NULL pointer dereference:
=========
Unable to handle kernel NULL pointer dereference at virtual address
00000000
pgd =
ffffffc9745c5000
[
00000000] *pgd=
00000009f631e003, *pud=
00000009f631e003, *pmd=
0000000000000000
Internal error: Oops:
96000006 [#2] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 2144 Comm: kvm_simplest-ar Tainted: G D 4.5.0-rc2+ #1300
Hardware name: ARM Juno development board (r1) (DT)
task:
ffffffc976da8000 ti:
ffffffc976e28000 task.ti:
ffffffc976e28000
PC is at vgic_bitmap_get_irq_val+0x78/0x90
LR is at kvm_vgic_map_is_active+0xac/0xc8
pc : [<
ffffffc0000b7e28>] lr : [<
ffffffc0000b972c>] pstate:
20000145
....
=========
Fix this by bailing out early of kvm_timer_flush_hwstate() if we don't
have a VGIC at all.
Reported-by: Cosmin Gorgovan <cosmin@linux-geek.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Marek Szyprowski [Tue, 16 Feb 2016 14:14:44 +0000 (15:14 +0100)]
arm64: dma-mapping: fix handling of devices registered before arch_initcall
commit
722ec35f7faefcc34d12616eca7976a848870f9d upstream.
This patch ensures that devices, which got registered before arch_initcall
will be handled correctly by IOMMU-based DMA-mapping code.
Fixes: 13b8629f6511 ("arm64: Add IOMMU dma_ops")
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tony Lindgren [Thu, 14 Jan 2016 20:20:48 +0000 (12:20 -0800)]
ARM: OMAP2+: Fix ppa_zero_params and ppa_por_params for rodata
commit
4da597d16602d14405b71a18d45e1c59f28f0fd2 upstream.
We don't want to write to .text so let's move ppa_zero_params and
ppa_por_params to .data and access them via pointers.
Note that I have not been able to test as we I don't have a HS
omap4 to test with. The code has been changed in similar way as
for omap3 though.
Cc: Kees Cook <keescook@chromium.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Nishanth Menon <nm@ti.com>
Cc: Richard Woodruff <r-woodruff2@ti.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Tero Kristo <t-kristo@ti.com>
Acked-by: Nicolas Pitre <nico@linaro.org>
Fixes: 1e6b48116a95 ("ARM: mm: allow non-text sections to be
non-executable")
Signed-off-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tony Lindgren [Thu, 14 Jan 2016 20:20:47 +0000 (12:20 -0800)]
ARM: OMAP2+: Fix save_secure_ram_context for rodata
commit
a5311d4d13df80bd71a9e47f9ecaf327f478fab1 upstream.
We don't want to write to .text and we can move save_secure_ram_context
into .data as it all gets copied into SRAM anyways.
Cc: Kees Cook <keescook@chromium.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Nishanth Menon <nm@ti.com>
Cc: Richard Woodruff <r-woodruff2@ti.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Cc: Tero Kristo <t-kristo@ti.com>
Acked-by: Nicolas Pitre <nico@linaro.org>
Fixes: 1e6b48116a95 ("ARM: mm: allow non-text sections to be
non-executable")
Signed-off-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>