Florian Westphal [Sun, 22 Dec 2013 23:32:31 +0000 (00:32 +0100)]
net: rose: restore old recvmsg behavior
[ Upstream commit
f81152e35001e91997ec74a7b4e040e6ab0acccf ]
recvmsg handler in net/rose/af_rose.c performs size-check ->msg_namelen.
After commit
f3d3342602f8bcbf37d7c46641cb9bca7618eb1c
(net: rework recvmsg handler msg_name and msg_namelen logic), we now
always take the else branch due to namelen being initialized to 0.
Digging in netdev-vger-cvs git repo shows that msg_namelen was
initialized with a fixed-size since at least 1995, so the else branch
was never taken.
Compile tested only.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sasha Levin [Thu, 19 Dec 2013 04:49:42 +0000 (23:49 -0500)]
rds: prevent dereference of a NULL device
[ Upstream commit
c2349758acf1874e4c2b93fe41d072336f1a31d0 ]
Binding might result in a NULL device, which is dereferenced
causing this BUG:
[ 1317.260548] BUG: unable to handle kernel NULL pointer dereference at
000000000000097
4
[ 1317.261847] IP: [<
ffffffff84225f52>] rds_ib_laddr_check+0x82/0x110
[ 1317.263315] PGD
418bcb067 PUD
3ceb21067 PMD 0
[ 1317.263502] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 1317.264179] Dumping ftrace buffer:
[ 1317.264774] (ftrace buffer empty)
[ 1317.265220] Modules linked in:
[ 1317.265824] CPU: 4 PID: 836 Comm: trinity-child46 Tainted: G W 3.13.0-rc4-
next-20131218-sasha-00013-g2cebb9b-dirty #4159
[ 1317.267415] task:
ffff8803ddf33000 ti:
ffff8803cd31a000 task.ti:
ffff8803cd31a000
[ 1317.268399] RIP: 0010:[<
ffffffff84225f52>] [<
ffffffff84225f52>] rds_ib_laddr_check+
0x82/0x110
[ 1317.269670] RSP: 0000:
ffff8803cd31bdf8 EFLAGS:
00010246
[ 1317.270230] RAX:
0000000000000000 RBX:
ffff88020b0dd388 RCX:
0000000000000000
[ 1317.270230] RDX:
ffffffff8439822e RSI:
00000000000c000a RDI:
0000000000000286
[ 1317.270230] RBP:
ffff8803cd31be38 R08:
0000000000000000 R09:
0000000000000000
[ 1317.270230] R10:
0000000000000000 R11:
0000000000000001 R12:
0000000000000000
[ 1317.270230] R13:
0000000054086700 R14:
0000000000a25de0 R15:
0000000000000031
[ 1317.270230] FS:
00007ff40251d700(0000) GS:
ffff88022e200000(0000) knlGS:
000000000000
0000
[ 1317.270230] CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
[ 1317.270230] CR2:
0000000000000974 CR3:
00000003cd478000 CR4:
00000000000006e0
[ 1317.270230] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[ 1317.270230] DR3:
0000000000000000 DR6:
00000000ffff0ff0 DR7:
0000000000090602
[ 1317.270230] Stack:
[ 1317.270230]
0000000054086700 5408670000a25de0 5408670000000002 0000000000000000
[ 1317.270230]
ffffffff84223542 00000000ea54c767 0000000000000000 ffffffff86d26160
[ 1317.270230]
ffff8803cd31be68 ffffffff84223556 ffff8803cd31beb8 ffff8800c6765280
[ 1317.270230] Call Trace:
[ 1317.270230] [<
ffffffff84223542>] ? rds_trans_get_preferred+0x42/0xa0
[ 1317.270230] [<
ffffffff84223556>] rds_trans_get_preferred+0x56/0xa0
[ 1317.270230] [<
ffffffff8421c9c3>] rds_bind+0x73/0xf0
[ 1317.270230] [<
ffffffff83e4ce62>] SYSC_bind+0x92/0xf0
[ 1317.270230] [<
ffffffff812493f8>] ? context_tracking_user_exit+0xb8/0x1d0
[ 1317.270230] [<
ffffffff8119313d>] ? trace_hardirqs_on+0xd/0x10
[ 1317.270230] [<
ffffffff8107a852>] ? syscall_trace_enter+0x32/0x290
[ 1317.270230] [<
ffffffff83e4cece>] SyS_bind+0xe/0x10
[ 1317.270230] [<
ffffffff843a6ad0>] tracesys+0xdd/0xe2
[ 1317.270230] Code: 00 8b 45 cc 48 8d 75 d0 48 c7 45 d8 00 00 00 00 66 c7 45 d0 02 00
89 45 d4 48 89 df e8 78 49 76 ff 41 89 c4 85 c0 75 0c 48 8b 03 <80> b8 74 09 00 00 01 7
4 06 41 bc 9d ff ff ff f6 05 2a b6 c2 02
[ 1317.270230] RIP [<
ffffffff84225f52>] rds_ib_laddr_check+0x82/0x110
[ 1317.270230] RSP <
ffff8803cd31bdf8>
[ 1317.270230] CR2:
0000000000000974
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Li RongQing [Thu, 19 Dec 2013 04:40:26 +0000 (12:40 +0800)]
ipv6: always set the new created dst's from in ip6_rt_copy
[ Upstream commit
24f5b855e17df7e355eacd6c4a12cc4d6a6c9ff0 ]
ip6_rt_copy only sets dst.from if ort has flag RTF_ADDRCONF and RTF_DEFAULT.
but the prefix routes which did get installed by hand locally can have an
expiration, and no any flag combination which can ensure a potential from
does never expire, so we should always set the new created dst's from.
This also fixes the new created dst is always expired since the ort, which
is created by RA, maybe has RTF_EXPIRES and RTF_ADDRCONF, but no RTF_DEFAULT.
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
CC: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Dumazet [Thu, 19 Dec 2013 18:53:02 +0000 (10:53 -0800)]
net: fec: fix potential use after free
[ Upstream commit
7a2a84518cfb263d2c4171b3d63671f88316adb2 ]
skb_tx_timestamp(skb) should be called _before_ TX completion
has a chance to trigger, otherwise it is too late and we access
freed memory.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: de5fb0a05348 ("net: fec: put tx to napi poll function to fix dead lock")
Cc: Frank Li <Frank.Li@freescale.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: Frank Li <Frank.Li@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Salva Peiró [Tue, 17 Dec 2013 09:06:30 +0000 (10:06 +0100)]
hamradio/yam: fix info leak in ioctl
[ Upstream commit
8e3fbf870481eb53b2d3a322d1fc395ad8b367ed ]
The yam_ioctl() code fails to initialise the cmd field
of the struct yamdrv_ioctl_cfg. Add an explicit memset(0)
before filling the structure to avoid the 4-byte info leak.
Signed-off-by: Salva Peiró <speiro@ai2.upv.es>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Wenliang Fan [Tue, 17 Dec 2013 03:25:28 +0000 (11:25 +0800)]
drivers/net/hamradio: Integer overflow in hdlcdrv_ioctl()
[ Upstream commit
e9db5c21d3646a6454fcd04938dd215ac3ab620a ]
The local variable 'bi' comes from userspace. If userspace passed a
large number to 'bi.data.calibrate', there would be an integer overflow
in the following line:
s->hdlctx.calibrate = bi.data.calibrate * s->par.bitrate / 16;
Signed-off-by: Wenliang Fan <fanwlexca@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Daniel Borkmann [Mon, 16 Dec 2013 23:38:39 +0000 (00:38 +0100)]
net: inet_diag: zero out uninitialized idiag_{src,dst} fields
[ Upstream commit
b1aac815c0891fe4a55a6b0b715910142227700f ]
Jakub reported while working with nlmon netlink sniffer that parts of
the inet_diag_sockid are not initialized when r->idiag_family != AF_INET6.
That is, fields of r->id.idiag_src[1 ... 3], r->id.idiag_dst[1 ... 3].
In fact, it seems that we can leak 6 * sizeof(u32) byte of kernel [slab]
memory through this. At least, in udp_dump_one(), we allocate a skb in ...
rep = nlmsg_new(sizeof(struct inet_diag_msg) + ..., GFP_KERNEL);
... and then pass that to inet_sk_diag_fill() that puts the whole struct
inet_diag_msg into the skb, where we only fill out r->id.idiag_src[0],
r->id.idiag_dst[0] and leave the rest untouched:
r->id.idiag_src[0] = inet->inet_rcv_saddr;
r->id.idiag_dst[0] = inet->inet_daddr;
struct inet_diag_msg embeds struct inet_diag_sockid that is correctly /
fully filled out in IPv6 case, but for IPv4 not.
So just zero them out by using plain memset (for this little amount of
bytes it's probably not worth the extra check for idiag_family == AF_INET).
Similarly, fix also other places where we fill that out.
Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Timo Teräs [Mon, 16 Dec 2013 09:02:09 +0000 (11:02 +0200)]
ip_gre: fix msg_name parsing for recvfrom/recvmsg
[ Upstream commit
0e3da5bb8da45890b1dc413404e0f978ab71173e ]
ipgre_header_parse() needs to parse the tunnel's ip header and it
uses mac_header to locate the iphdr. This got broken when gre tunneling
was refactored as mac_header is no longer updated to point to iphdr.
Introduce skb_pop_mac_header() helper to do the mac_header assignment
and use it in ipgre_rcv() to fix msg_name parsing.
Bug introduced in commit
c54419321455 (GRE: Refactor GRE tunneling code.)
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sasha Levin [Fri, 13 Dec 2013 15:54:22 +0000 (10:54 -0500)]
net: unix: allow bind to fail on mutex lock
[ Upstream commit
37ab4fa7844a044dc21fde45e2a0fc2f3c3b6490 ]
This is similar to the set_peek_off patch where calling bind while the
socket is stuck in unix_dgram_recvmsg() will block and cause a hung task
spew after a while.
This is also the last place that did a straightforward mutex_lock(), so
there shouldn't be any more of these patches.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Hannes Frederic Sowa [Fri, 13 Dec 2013 14:12:27 +0000 (15:12 +0100)]
ipv6: fix illegal mac_header comparison on 32bit
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jason Wang [Fri, 13 Dec 2013 09:21:27 +0000 (17:21 +0800)]
netvsc: don't flush peers notifying work during setting mtu
[ Upstream commit
50dc875f2e6e2e04aed3b3033eb0ac99192d6d02 ]
There's a possible deadlock if we flush the peers notifying work during setting
mtu:
[ 22.991149] ======================================================
[ 22.991173] [ INFO: possible circular locking dependency detected ]
[ 22.991198] 3.10.0-54.0.1.el7.x86_64.debug #1 Not tainted
[ 22.991219] -------------------------------------------------------
[ 22.991243] ip/974 is trying to acquire lock:
[ 22.991261] ((&(&net_device_ctx->dwork)->work)){+.+.+.}, at: [<
ffffffff8108af95>] flush_work+0x5/0x2e0
[ 22.991307]
but task is already holding lock:
[ 22.991330] (rtnl_mutex){+.+.+.}, at: [<
ffffffff81539deb>] rtnetlink_rcv+0x1b/0x40
[ 22.991367]
which lock already depends on the new lock.
[ 22.991398]
the existing dependency chain (in reverse order) is:
[ 22.991426]
-> #1 (rtnl_mutex){+.+.+.}:
[ 22.991449] [<
ffffffff810dfdd9>] __lock_acquire+0xb19/0x1260
[ 22.991477] [<
ffffffff810e0d12>] lock_acquire+0xa2/0x1f0
[ 22.991501] [<
ffffffff81673659>] mutex_lock_nested+0x89/0x4f0
[ 22.991529] [<
ffffffff815392b7>] rtnl_lock+0x17/0x20
[ 22.991552] [<
ffffffff815230b2>] netdev_notify_peers+0x12/0x30
[ 22.991579] [<
ffffffffa0340212>] netvsc_send_garp+0x22/0x30 [hv_netvsc]
[ 22.991610] [<
ffffffff8108d251>] process_one_work+0x211/0x6e0
[ 22.991637] [<
ffffffff8108d83b>] worker_thread+0x11b/0x3a0
[ 22.991663] [<
ffffffff81095e5d>] kthread+0xed/0x100
[ 22.991686] [<
ffffffff81681c6c>] ret_from_fork+0x7c/0xb0
[ 22.991715]
-> #0 ((&(&net_device_ctx->dwork)->work)){+.+.+.}:
[ 22.991715] [<
ffffffff810de817>] check_prevs_add+0x967/0x970
[ 22.991715] [<
ffffffff810dfdd9>] __lock_acquire+0xb19/0x1260
[ 22.991715] [<
ffffffff810e0d12>] lock_acquire+0xa2/0x1f0
[ 22.991715] [<
ffffffff8108afde>] flush_work+0x4e/0x2e0
[ 22.991715] [<
ffffffff8108e1b5>] __cancel_work_timer+0x95/0x130
[ 22.991715] [<
ffffffff8108e303>] cancel_delayed_work_sync+0x13/0x20
[ 22.991715] [<
ffffffffa03404e4>] netvsc_change_mtu+0x84/0x200 [hv_netvsc]
[ 22.991715] [<
ffffffff815233d4>] dev_set_mtu+0x34/0x80
[ 22.991715] [<
ffffffff8153bc2a>] do_setlink+0x23a/0xa00
[ 22.991715] [<
ffffffff8153d054>] rtnl_newlink+0x394/0x5e0
[ 22.991715] [<
ffffffff81539eac>] rtnetlink_rcv_msg+0x9c/0x260
[ 22.991715] [<
ffffffff8155cdd9>] netlink_rcv_skb+0xa9/0xc0
[ 22.991715] [<
ffffffff81539dfa>] rtnetlink_rcv+0x2a/0x40
[ 22.991715] [<
ffffffff8155c41d>] netlink_unicast+0xdd/0x190
[ 22.991715] [<
ffffffff8155c807>] netlink_sendmsg+0x337/0x750
[ 22.991715] [<
ffffffff8150d219>] sock_sendmsg+0x99/0xd0
[ 22.991715] [<
ffffffff8150d63e>] ___sys_sendmsg+0x39e/0x3b0
[ 22.991715] [<
ffffffff8150eba2>] __sys_sendmsg+0x42/0x80
[ 22.991715] [<
ffffffff8150ebf2>] SyS_sendmsg+0x12/0x20
[ 22.991715] [<
ffffffff81681d19>] system_call_fastpath+0x16/0x1b
This is because we hold the rtnl_lock() before ndo_change_mtu() and try to flush
the work in netvsc_change_mtu(), in the mean time, netdev_notify_peers() may be
called from worker and also trying to hold the rtnl_lock. This will lead the
flush won't succeed forever. Solve this by not canceling and flushing the work,
this is safe because the transmission done by NETDEV_NOTIFY_PEERS was
synchronized with the netif_tx_disable() called by netvsc_change_mtu().
Reported-by: Yaju Cao <yacao@redhat.com>
Tested-by: Yaju Cao <yacao@redhat.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Nat Gurumoorthy [Mon, 9 Dec 2013 18:43:21 +0000 (10:43 -0800)]
tg3: Initialize REG_BASE_ADDR at PCI config offset 120 to 0
[ Upstream commit
388d3335575f4c056dcf7138a30f1454e2145cd8 ]
The new tg3 driver leaves REG_BASE_ADDR (PCI config offset 120)
uninitialized. From power on reset this register may have garbage in it. The
Register Base Address register defines the device local address of a
register. The data pointed to by this location is read or written using
the Register Data register (PCI config offset 128). When REG_BASE_ADDR has
garbage any read or write of Register Data Register (PCI 128) will cause the
PCI bus to lock up. The TCO watchdog will fire and bring down the system.
Signed-off-by: Nat Gurumoorthy <natg@google.com>
Acked-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sasha Levin [Sat, 7 Dec 2013 22:26:27 +0000 (17:26 -0500)]
net: unix: allow set_peek_off to fail
[ Upstream commit
12663bfc97c8b3fdb292428105dd92d563164050 ]
unix_dgram_recvmsg() will hold the readlock of the socket until recv
is complete.
In the same time, we may try to setsockopt(SO_PEEK_OFF) which will hang until
unix_dgram_recvmsg() will complete (which can take a while) without allowing
us to break out of it, triggering a hung task spew.
Instead, allow set_peek_off to fail, this way userspace will not hang.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changli Gao [Sun, 8 Dec 2013 14:36:56 +0000 (09:36 -0500)]
net: drop_monitor: fix the value of maxattr
[ Upstream commit
d323e92cc3f4edd943610557c9ea1bb4bb5056e8 ]
maxattr in genl_family should be used to save the max attribute
type, but not the max command type. Drop monitor doesn't support
any attributes, so we should leave it as zero.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Hannes Frederic Sowa [Sat, 7 Dec 2013 02:33:45 +0000 (03:33 +0100)]
ipv6: don't count addrconf generated routes against gc limit
[ Upstream commit
a3300ef4bbb1f1e33ff0400e1e6cf7733d988f4f ]
Brett Ciphery reported that new ipv6 addresses failed to get installed
because the addrconf generated dsts where counted against the dst gc
limit. We don't need to count those routes like we currently don't count
administratively added routes.
Because the max_addresses check enforces a limit on unbounded address
generation first in case someone plays with router advertisments, we
are still safe here.
Reported-by: Brett Ciphery <brett.ciphery@windriver.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Daniel Borkmann [Fri, 6 Dec 2013 10:36:15 +0000 (11:36 +0100)]
packet: fix send path when running with proto == 0
[ Upstream commit
66e56cd46b93ef407c60adcac62cf33b06119d50 ]
Commit
e40526cb20b5 introduced a cached dev pointer, that gets
hooked into register_prot_hook(), __unregister_prot_hook() to
update the device used for the send path.
We need to fix this up, as otherwise this will not work with
sockets created with protocol = 0, plus with sll_protocol = 0
passed via sockaddr_ll when doing the bind.
So instead, assign the pointer directly. The compiler can inline
these helper functions automagically.
While at it, also assume the cached dev fast-path as likely(),
and document this variant of socket creation as it seems it is
not widely used (seems not even the author of TX_RING was aware
of that in his reference example [1]). Tested with reproducer
from
e40526cb20b5.
[1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap#Example
Fixes: e40526cb20b5 ("packet: fix use after free race in send path when dev is released")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Tested-by: Salam Noureddine <noureddine@aristanetworks.com>
Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Andrey Vagin [Thu, 5 Dec 2013 14:36:21 +0000 (18:36 +0400)]
virtio: delete napi structures from netdev before releasing memory
[ Upstream commit
d4fb84eefe5164f6a6ea51d0a9e26280c661a0dd ]
free_netdev calls netif_napi_del too, but it's too late, because napi
structures are placed on vi->rq. netif_napi_add() is called from
virtnet_alloc_queues.
general protection fault: 0000 [#1] SMP
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables virtio_balloon pcspkr virtio_net(-) i2c_pii
CPU: 1 PID: 347 Comm: rmmod Not tainted 3.13.0-rc2+ #171
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task:
ffff8800b779c420 ti:
ffff8800379e0000 task.ti:
ffff8800379e0000
RIP: 0010:[<
ffffffff81322e19>] [<
ffffffff81322e19>] __list_del_entry+0x29/0xd0
RSP: 0018:
ffff8800379e1dd0 EFLAGS:
00010a83
RAX:
6b6b6b6b6b6b6b6b RBX:
ffff8800379c2fd0 RCX:
dead000000200200
RDX:
6b6b6b6b6b6b6b6b RSI:
0000000000000001 RDI:
ffff8800379c2fd0
RBP:
ffff8800379e1dd0 R08:
0000000000000001 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000001 R12:
ffff8800379c2f90
R13:
ffff880037839160 R14:
0000000000000000 R15:
00000000013352f0
FS:
00007f1400e34740(0000) GS:
ffff8800bfb00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
CR2:
00007f464124c763 CR3:
00000000b68cf000 CR4:
00000000000006e0
Stack:
ffff8800379e1df0 ffffffff8155beab 6b6b6b6b6b6b6b2b ffff8800378391c0
ffff8800379e1e18 ffffffff8156499b ffff880037839be0 ffff880037839d20
ffff88003779d3f0 ffff8800379e1e38 ffffffffa003477c ffff88003779d388
Call Trace:
[<
ffffffff8155beab>] netif_napi_del+0x1b/0x80
[<
ffffffff8156499b>] free_netdev+0x8b/0x110
[<
ffffffffa003477c>] virtnet_remove+0x7c/0x90 [virtio_net]
[<
ffffffff813ae323>] virtio_dev_remove+0x23/0x80
[<
ffffffff813f62ef>] __device_release_driver+0x7f/0xf0
[<
ffffffff813f6ca0>] driver_detach+0xc0/0xd0
[<
ffffffff813f5f28>] bus_remove_driver+0x58/0xd0
[<
ffffffff813f72ec>] driver_unregister+0x2c/0x50
[<
ffffffff813ae65e>] unregister_virtio_driver+0xe/0x10
[<
ffffffffa0036942>] virtio_net_driver_exit+0x10/0x6ce [virtio_net]
[<
ffffffff810d7cf2>] SyS_delete_module+0x172/0x220
[<
ffffffff810a732d>] ? trace_hardirqs_on+0xd/0x10
[<
ffffffff810f5d4c>] ? __audit_syscall_entry+0x9c/0xf0
[<
ffffffff81677f69>] system_call_fastpath+0x16/0x1b
Code: 00 00 55 48 8b 17 48 b9 00 01 10 00 00 00 ad de 48 8b 47 08 48 89 e5 48 39 ca 74 29 48 b9 00 02 20 00 00 00
RIP [<
ffffffff81322e19>] __list_del_entry+0x29/0xd0
RSP <
ffff8800379e1dd0>
---[ end trace
d5931cd3f87c9763 ]---
Fixes: 986a4f4d452d (virtio_net: multiqueue support)
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jason Wang [Wed, 11 Dec 2013 05:08:34 +0000 (13:08 +0800)]
macvtap: signal truncated packets
[ Upstream commit
ce232ce01d61b184202bb185103d119820e1260c ]
macvtap_put_user() never return a value grater than iov length, this in fact
bypasses the truncated checking in macvtap_recvmsg(). Fix this by always
returning the size of packet plus the possible vlan header to let the trunca
checking work.
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Zhi Yong Wu [Fri, 6 Dec 2013 06:16:51 +0000 (14:16 +0800)]
tun: update file current position
[ Upstream commit
d0b7da8afa079ffe018ab3e92879b7138977fc8f ]
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Zhi Yong Wu [Fri, 6 Dec 2013 06:16:50 +0000 (14:16 +0800)]
macvtap: update file current position
[ Upstream commit
e6ebc7f16ca1434a334647aa56399c546be4e64b ]
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Vlad Yasevich [Tue, 26 Nov 2013 17:37:12 +0000 (12:37 -0500)]
macvtap: Do not double-count received packets
[ Upstream commit
006da7b07bc4d3a7ffabad17cf639eec6849c9dc ]
Currently macvlan will count received packets after calling each
vlans receive handler. Macvtap attempts to count the packet
yet again when the user reads the packet from the tap socket.
This code doesn't do this consistently either. Remove the
counting from macvtap and let only macvlan count received
packets.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Venkat Venkatsubra [Mon, 2 Dec 2013 23:41:39 +0000 (15:41 -0800)]
rds: prevent BUG_ON triggered on congestion update to loopback
[ Upstream commit
18fc25c94eadc52a42c025125af24657a93638c0 ]
After congestion update on a local connection, when rds_ib_xmit returns
less bytes than that are there in the message, rds_send_xmit calls
back rds_ib_xmit with an offset that causes BUG_ON(off & RDS_FRAG_SIZE)
to trigger.
For a 4Kb PAGE_SIZE rds_ib_xmit returns min(8240,4096)=4096 when actually
the message contains 8240 bytes. rds_send_xmit thinks there is more to send
and calls rds_ib_xmit again with a data offset "off" of 4096-48(rds header)
=4048 bytes thus hitting the BUG_ON(off & RDS_FRAG_SIZE) [RDS_FRAG_SIZE=4k].
The commit
6094628bfd94323fc1cea05ec2c6affd98c18f7f
"rds: prevent BUG_ON triggering on congestion map updates" introduced
this regression. That change was addressing the triggering of a different
BUG_ON in rds_send_xmit() on PowerPC architecture with 64Kbytes PAGE_SIZE:
BUG_ON(ret != 0 &&
conn->c_xmit_sg == rm->data.op_nents);
This was the sequence it was going through:
(rds_ib_xmit)
/* Do not send cong updates to IB loopback */
if (conn->c_loopback
&& rm->m_inc.i_hdr.h_flags & RDS_FLAG_CONG_BITMAP) {
rds_cong_map_updated(conn->c_fcong, ~(u64) 0);
return sizeof(struct rds_header) + RDS_CONG_MAP_BYTES;
}
rds_ib_xmit returns 8240
rds_send_xmit:
c_xmit_data_off = 0 + 8240 - 48 (rds header accounted only the first time)
= 8192
c_xmit_data_off < 65536 (sg->length), so calls rds_ib_xmit again
rds_ib_xmit returns 8240
rds_send_xmit:
c_xmit_data_off = 8192 + 8240 = 16432, calls rds_ib_xmit again
and so on (c_xmit_data_off 24672,32912,41152,49392,57632)
rds_ib_xmit returns 8240
On this iteration this sequence causes the BUG_ON in rds_send_xmit:
while (ret) {
tmp = min_t(int, ret, sg->length - conn->c_xmit_data_off);
[tmp = 65536 - 57632 = 7904]
conn->c_xmit_data_off += tmp;
[c_xmit_data_off = 57632 + 7904 = 65536]
ret -= tmp;
[ret = 8240 - 7904 = 336]
if (conn->c_xmit_data_off == sg->length) {
conn->c_xmit_data_off = 0;
sg++;
conn->c_xmit_sg++;
BUG_ON(ret != 0 &&
conn->c_xmit_sg == rm->data.op_nents);
[c_xmit_sg = 1, rm->data.op_nents = 1]
What the current fix does:
Since the congestion update over loopback is not actually transmitted
as a message, all that rds_ib_xmit needs to do is let the caller think
the full message has been transmitted and not return partial bytes.
It will return 8240 (RDS_CONG_MAP_BYTES+48) when PAGE_SIZE is 4Kb.
And 64Kb+48 when page size is 64Kb.
Reported-by: Josh Hunt <joshhunt00@gmail.com>
Tested-by: Honggang Li <honli@redhat.com>
Acked-by: Bang Nguyen <bang.nguyen@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Dumazet [Mon, 2 Dec 2013 16:51:13 +0000 (08:51 -0800)]
net: do not pretend FRAGLIST support
[ Upstream commit
28e24c62ab3062e965ef1b3bcc244d50aee7fa85 ]
Few network drivers really supports frag_list : virtual drivers.
Some drivers wrongly advertise NETIF_F_FRAGLIST feature.
If skb with a frag_list is given to them, packet on the wire will be
corrupt.
Remove this flag, as core networking stack will make sure to
provide packets that can be sent without corruption.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Cc: Anirudha Sarangi <anirudh@xilinx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Kamala R [Mon, 2 Dec 2013 14:25:21 +0000 (19:55 +0530)]
IPv6: Fixed support for blackhole and prohibit routes
[ Upstream commit
7150aede5dd241539686e17d9592f5ebd28a2cda ]
The behaviour of blackhole and prohibit routes has been corrected by setting
the input and output pointers of the dst variable appropriately. For
blackhole routes, they are set to dst_discard and to ip6_pkt_discard and
ip6_pkt_discard_out respectively for prohibit routes.
ipv6: ip6_pkt_prohibit(_out) should not depend on
CONFIG_IPV6_MULTIPLE_TABLES
We need ip6_pkt_prohibit(_out) available without
CONFIG_IPV6_MULTIPLE_TABLES
Signed-off-by: Kamala R <kamala@aristanetworks.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Nestor Lopez Casado [Thu, 18 Jul 2013 13:21:30 +0000 (06:21 -0700)]
HID: Revert "Revert "HID: Fix logitech-dj: missing Unifying device issue""
commit
c63e0e370028d7e4033bd40165f18499872b5183 upstream.
This reverts commit
8af6c08830b1ae114d1a8b548b1f8b056e068887.
This patch re-adds the workaround introduced by
596264082f10dd4
which was reverted by
8af6c08830b1ae114.
The original patch 596264 was needed to overcome a situation where
the hid-core would drop incoming reports while probe() was being
executed.
This issue was solved by
c849a6143bec520af which added
hid_device_io_start() and hid_device_io_stop() that enable a specific
hid driver to opt-in for input reports while its probe() is being
executed.
Commit
a9dd22b730857347 modified hid-logitech-dj so as to use the
functionality added to hid-core. Having done that, workaround 596264
was no longer necessary and was reverted by
8af6c08.
We now encounter a different problem that ends up 'again' thwarting
the Unifying receiver enumeration. The problem is time and usb controller
dependent. Ocasionally the reports sent to the usb receiver to start
the paired devices enumeration fail with -EPIPE and the receiver never
gets to enumerate the paired devices.
With
dcd9006b1b053c7b1c the problem was "hidden" as the call to the usb
driver became asynchronous and none was catching the error from the
failing URB.
As the root cause for this failing SET_REPORT is not understood yet,
-possibly a race on the usb controller drivers or a problem with the
Unifying receiver- reintroducing this workaround solves the problem.
Overall what this workaround does is: If an input report from an
unknown device is received, then a (re)enumeration is performed.
related bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/
1194649
Signed-off-by: Nestor Lopez Casado <nlopezcasad@logitech.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Kuninori Morimoto [Thu, 18 Apr 2013 06:40:57 +0000 (23:40 -0700)]
gpio-rcar: R-Car GPIO IRQ share interrupt
commit
c234962b808f289237a40e4ce5fc1c8066d1c9d0 upstream.
R-Car H1 or Gen2 GPIO interrupts are assigned per each GPIO domain,
but, Gen1 E1/M1 GPIO interrupts are shared for all GPIO domain.
gpio-rcar driver needs IRQF_SHARED flags for these.
This patch was tested on Bock-W board
Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Magnus Damm [Wed, 18 Sep 2013 20:01:16 +0000 (15:01 -0500)]
clocksource: em_sti: Set cpu_possible_mask to fix SMP broadcast
commit
2199a5574b6d94b9ca26c6345356f45ec60fef8b upstream.
Update the STI driver by setting cpu_possible_mask to make EMEV2
SMP work as expected together with the ARM broadcast timer.
This breakage was introduced by:
f7db706 ARM: 7674/1: smp: Avoid dummy clockevent being preferred over real hardware clock-event
Without this fix SMP operation is broken on EMEV2 since no
broadcast timer interrupts trigger on the secondary CPU cores.
Signed-off-by: Magnus Damm <damm@opensource.se>
Tested-by: Simon Horman <horms+renesas@verge.net.au>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Axel Lin [Mon, 6 May 2013 09:03:32 +0000 (17:03 +0800)]
irqchip: renesas-irqc: Fix irqc_probe error handling
commit
dfaf820a13ec160f06556e08dab423818ba87f14 upstream.
The code in goto err3 path is wrong because it will call fee_irq() with k == 0,
which means it does free_irq(p->irq[-1].requested_irq, &p->irq[-1]);
Signed-off-by: Axel Lin <axel.lin@ingics.com>
Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Greg Kroah-Hartman [Thu, 9 Jan 2014 20:25:15 +0000 (12:25 -0800)]
Linux 3.10.26
Nobuhiro Iwamatsu [Thu, 2 Jan 2014 20:58:53 +0000 (12:58 -0800)]
sh: add EXPORT_SYMBOL(min_low_pfn) and EXPORT_SYMBOL(max_low_pfn) to sh_ksyms_32.c
commit
ad70b029d2c678386384bd72c7fa2705c449b518 upstream.
Min_low_pfn and max_low_pfn were used in pfn_valid macro if defined
CONFIG_FLATMEM. When the functions that use the pfn_valid is used in
driver module, max_low_pfn and min_low_pfn is to undefined, and fail to
build.
ERROR: "min_low_pfn" [drivers/block/aoe/aoe.ko] undefined!
ERROR: "max_low_pfn" [drivers/block/aoe/aoe.ko] undefined!
make[2]: *** [__modpost] Error 1
make[1]: *** [modules] Error 2
This patch fix this problem.
Signed-off-by: Nobuhiro Iwamatsu <nobuhiro.iwamatsu.yj@renesas.com>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@gmail.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Whitney [Mon, 6 Jan 2014 19:00:23 +0000 (14:00 -0500)]
ext4: fix bigalloc regression
commit
d0abafac8c9162f39c4f6b2f8141b772a09b3770 upstream.
Commit
f5a44db5d2 introduced a regression on filesystems created with
the bigalloc feature (cluster size > blocksize). It causes xfstests
generic/006 and /013 to fail with an unexpected JBD2 failure and
transaction abort that leaves the test file system in a read only state.
Other xfstests run on bigalloc file systems are likely to fail as well.
The cause is the accidental use of a cluster mask where a cluster
offset was needed in ext4_ext_map_blocks().
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Catalin Marinas [Fri, 29 Nov 2013 10:56:14 +0000 (10:56 +0000)]
arm64: Use Normal NonCacheable memory for writecombine
commit
4f00130b70e5eee813cc7bc298e0f3fdf79673cc upstream.
This provides better performance compared to Device GRE and also allows
unaligned accesses. Such memory is intended to be used with standard RAM
(e.g. framebuffers) and not I/O.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Catalin Marinas [Wed, 1 May 2013 15:34:22 +0000 (16:34 +0100)]
arm64: Do not flush the D-cache for anonymous pages
commit
7249b79f6b4cc3c2aa9138dca52e535a4c789107 upstream.
The D-cache on AArch64 is VIPT non-aliasing, so there is no need to
flush it for anonymous pages.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Will Deacon <will.deacon@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Catalin Marinas [Wed, 1 May 2013 11:23:05 +0000 (12:23 +0100)]
arm64: Avoid cache flushing in flush_dcache_page()
commit
b5b6c9e9149d8a7c3f1d7b9d0c046c6184e1dd17 upstream.
The flush_dcache_page() function is called when the kernel modified a
page cache page. Since the D-cache on AArch64 does not have aliases
this function can simply mark the page as dirty for later flushing via
set_pte_at()/__sync_icache_dcache() if the page is executable (to ensure
the I-D cache coherency).
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Will Deacon <will.deacon@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mark Rutland [Tue, 26 Mar 2013 13:41:35 +0000 (13:41 +0000)]
ARM: KVM: arch_timers: zero CNTVOFF upon return to host
commit
f793c23ebbe5afd1cabf4a42a3a297022213756f upstream.
To use the virtual counters from the host, we need to ensure that
CNTVOFF doesn't change unexpectedly. When we change to a guest, we
replace the host's CNTVOFF, but we don't restore it when returning to
the host.
As the host sets CNTVOFF to zero, and never changes it, we can simply
zero CNTVOFF when returning to the host. This patch adds said zeroing to
the return to host path.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Acked-by: Christoffer Dall <cdall@cs.columbia.edu>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Marc Zyngier [Wed, 30 Jan 2013 18:17:49 +0000 (18:17 +0000)]
ARM: hyp: initialize CNTVOFF to zero
commit
0af0b189abf73d232af782df2f999235cd2fed7f upstream.
In order to be able to use the virtual counter in a safe way,
make sure it is initialized to zero before dropping to SVC.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Dave Martin <dave.martin@linaro.org>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mark Rutland [Wed, 30 Jan 2013 17:51:26 +0000 (17:51 +0000)]
clocksource: arch_timer: use virtual counters
commit
0d651e4e65e96989f72236bf83bd4c6e55eb6ce4 upstream.
Switching between reading the virtual or physical counters is
problematic, as some core code wants a view of time before we're fully
set up. Using a function pointer and switching the source after the
first read can make time appear to go backwards, and having a check in
the read function is an unfortunate block on what we want to be a fast
path.
Instead, this patch makes us always use the virtual counters. If we're a
guest, or don't have hyp mode, we'll use the virtual timers, and as such
don't care about CNTVOFF as long as it doesn't change in such a way as
to make time appear to travel backwards. As the guest will use the
virtual timers, a (potential) KVM host must use the physical timers
(which can wake up the host even if they fire while a guest is
executing), and hence a host must have CNTVOFF set to zero so as to have
a consistent view of time between the physical timers and virtual
counters.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Catalin Marinas [Mon, 2 Sep 2013 15:33:54 +0000 (16:33 +0100)]
arm64: Remove unused cpu_name ascii in arch/arm64/mm/proc.S
commit
f3a1d7d53dccf51959aec16b574617cc6bfeca09 upstream.
This string has been moved to arch/arm64/kernel/cputable.c.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Catalin Marinas [Thu, 14 Nov 2013 15:15:37 +0000 (15:15 +0000)]
arm64: dts: Reserve the memory used for secondary CPU release address
commit
df503ba7f653c590b475ab80bde788edf5af70d5 upstream.
With the spin-table SMP booting method, secondary CPUs poll a location
passed in the DT. The foundation-v8.dts file doesn't have this memory
reserved and there is a risk of Linux using it before secondary CPUs are
started.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
AKASHI Takahiro [Thu, 3 Oct 2013 05:47:44 +0000 (06:47 +0100)]
arm64: check for number of arguments in syscall_get/set_arguments()
commit
7b22c03536a539142f931815528d55df455ffe2d upstream.
In ftrace_syscall_enter(),
syscall_get_arguments(..., 0, n, ...)
if (i == 0) { <handle orig_x0> ...; n--;}
memcpy(..., n * sizeof(args[0]));
If 'number of arguments(n)' is zero and 'argument index(i)' is also zero in
syscall_get_arguments(), none of arguments should be copied by memcpy().
Otherwise 'n--' can be a big positive number and unexpected amount of data
will be copied. Tracing system calls which take no argument, say sync(void),
may hit this case and eventually make the system corrupted.
This patch fixes the issue both in syscall_get_arguments() and
syscall_set_arguments().
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jiang Liu [Fri, 27 Sep 2013 08:04:41 +0000 (09:04 +0100)]
arm64: fix possible invalid FPSIMD initialization state
commit
6db83cea1c975b9a102e17def7d2795814e1ae2b upstream.
If context switching happens during executing fpsimd_flush_thread(),
stale value in FPSIMD registers will be saved into current thread's
fpsimd_state by fpsimd_thread_switch(). That may cause invalid
initialization state for the new process, so disable preemption
when executing fpsimd_flush_thread().
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Feng Kan [Tue, 23 Jul 2013 17:52:31 +0000 (18:52 +0100)]
arm64: Change kernel stack size to 16K
commit
845ad05ec31e0f3872a321e10dbeaf872022632c upstream.
Written by Catalin Marinas, tested by APM on storm platform. This is needed
because of the failures encountered when running SpecWeb benchmark test.
Signed-off-by: Feng Kan <fkan@apm.com>
Acked-by: Kumar Sankaran <ksankaran@apm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mark Rutland [Tue, 9 Jul 2013 14:16:06 +0000 (15:16 +0100)]
arm64: virt: ensure visibility of __boot_cpu_mode
commit
82b2f495fba338d1e3098dde1df54944a9c19751 upstream.
Secondary CPUs write to __boot_cpu_mode with caches disabled, and thus a
cached value of __boot_cpu_mode may be incoherent with that in memory.
This could lead to a failure to detect mismatched boot modes.
This patch adds flushing to ensure that writes by secondaries to
__boot_cpu_mode are made visible before we test against it.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <cdall@cs.columbia.edu>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Catalin Marinas [Fri, 19 Jul 2013 14:08:15 +0000 (15:08 +0100)]
arm64: Only enable local interrupts after the CPU is marked online
commit
53ae3acd4390ffeecb3a11dbd5be347b5a3d98f2 upstream.
There is a slight chance that (timer) interrupts are triggered before a
secondary CPU has been marked online with implications on softirq thread
affinity.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Josh Durgin [Thu, 5 Sep 2013 00:57:31 +0000 (17:57 -0700)]
rbd: fix error handling from rbd_snap_name()
commit
da6a6b63978d45f9ae582d1f362f182012da3a22 upstream.
rbd_snap_name() calls rbd_dev_v{1,2}_snap_name() depending on the
format of the image. The format 1 version returns NULL on error, which
is handled by the caller. The format 2 version returns an ERR_PTR,
which the caller of rbd_snap_name() does not expect.
Fortunately this is unlikely to occur in practice because
rbd_snap_id_by_name() is called before rbd_snap_name(). This would hit
similar errors to rbd_snap_name() (like the snapshot not existing) and
return early, so rbd_snap_name() would not hit an error unless the
snapshot was removed between the two calls or memory was exhausted.
Use an ERR_PTR in rbd_dev_v1_snap_name() so that the specific error
can be propagated, and it is consistent with rbd_dev_v2_snap_name().
Handle the ERR_PTR in the only rbd_snap_name() caller.
Suggested-by: Alex Elder <alex.elder@linaro.org>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Josh Durgin [Fri, 30 Aug 2013 02:16:42 +0000 (19:16 -0700)]
rbd: ignore unmapped snapshots that no longer exist
commit
efadc98aab674153709cc357ba565f04e3164fcd upstream.
This prevents erroring out while adding a device when a snapshot
unrelated to the current mapping is deleted between reading the
snapshot context and reading the snapshot names. If the mapped
snapshot name is not found an error still occurs as usual.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Josh Durgin [Fri, 30 Aug 2013 00:26:31 +0000 (17:26 -0700)]
rbd: fix use-after free of rbd_dev->disk
commit
9875201e10496612080e7d164acc8f625c18725c upstream.
Removing a device deallocates the disk, unschedules the watch, and
finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
from the watch callback, updates the disk size and rbd_dev
structure. With no locking between them, rbd_dev_refresh() may use the
device or rbd_dev after they've been freed.
To fix this, check whether RBD_DEV_FLAG_REMOVING is set before
updating the disk size in rbd_dev_refresh(). In order to prevent a
race where rbd_dev_refresh() is already revalidating the disk when
rbd_remove() is called, move the call to rbd_bus_del_dev() after the
watch is unregistered and all notifies are complete. It's safe to
defer deleting this structure because no new requests can be submitted
once the RBD_DEV_FLAG_REMOVING is set, since the device cannot be
opened.
Fixes: http://tracker.ceph.com/issues/5636
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Josh Durgin [Fri, 30 Aug 2013 00:36:03 +0000 (17:36 -0700)]
rbd: make rbd_obj_notify_ack() synchronous
commit
20e0af67ce88c657d0601977b9941a2256afbdaa upstream.
The only user of rbd_obj_notify_ack() is rbd_watch_cb(). It used
asynchronously with no tracking of when the notify ack completes, so
it may still be in progress when the osd_client is shut down. This
results in a BUG() since the osd client assumes no requests are in
flight when it stops. Since all notifies are flushed before the
osd_client is stopped, waiting for the notify ack to complete before
returning from the watch callback ensures there are no notify acks in
flight during shutdown.
Rename rbd_obj_notify_ack() to rbd_obj_notify_ack_sync() to reflect
its new synchronous nature.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Josh Durgin [Fri, 30 Aug 2013 00:31:03 +0000 (17:31 -0700)]
rbd: complete notifies before cleaning up osd_client and rbd_dev
commit
9abc59908e0c5f983aaa91150da32d5b62cf60b7 upstream.
To ensure rbd_dev is not used after it's released, flush all pending
notify callbacks before calling rbd_dev_image_release(). No new
notifies can be added to the queue at this point because the watch has
already be unregistered with the osd_client.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Josh Durgin [Thu, 29 Aug 2013 04:43:09 +0000 (21:43 -0700)]
libceph: add function to ensure notifies are complete
commit
dd935f44a40f8fb02aff2cc0df2269c92422df1c upstream.
Without a way to flush the osd client's notify workqueue, a watch
event that is unregistered could continue receiving callbacks
indefinitely.
Unregistering the event simply means no new notifies are added to the
queue, but there may still be events in the queue that will call the
watch callback for the event. If the queue is flushed after the event
is unregistered, the caller can be sure no more watch callbacks will
occur for the canceled watch.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Josh Durgin [Thu, 29 Aug 2013 00:08:10 +0000 (17:08 -0700)]
rbd: fix null dereference in dout
commit
c35455791c1131e7ccbf56ea6fbdd562401c2ce2 upstream.
The order parameter is sometimes NULL in _rbd_dev_v2_snap_size(), but
the dout() always derefences it. Move this to another dout() protected
by a check that order is non-NULL.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Josh Durgin [Tue, 27 Aug 2013 21:45:46 +0000 (14:45 -0700)]
rbd: fix buffer size for writes to images with snapshots
commit
03507db631c94a48e316c7f638ffb2991544d617 upstream.
rbd_osd_req_create() needs to know the snapshot context size to create
a buffer large enough to send it with the message front. It gets this
from the img_request, which was not set for the obj_request yet. This
resulted in trying to write past the end of the front payload, hitting
this BUG:
libceph: BUG_ON(p > msg->front.iov_base + msg->front.iov_len);
Fix this by associating the obj_request with its img_request
immediately after it's created, before the osd request is created.
Fixes: http://tracker.ceph.com/issues/5760
Suggested-by: Alex Elder <alex.elder@linaro.org>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
majianpeng [Wed, 21 Aug 2013 07:02:51 +0000 (15:02 +0800)]
ceph: allow sync_read/write return partial successed size of read/write.
commit
ee7289bfadda5f4ef60884547ebc9989c8fb314a upstream.
For sync_read/write, it may do multi stripe operations.If one of those
met erro, we return the former successed size rather than a error value.
There is a exception for write-operation met -EOLDSNAPC.If this occur,we
retry the whole write again.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
majianpeng [Tue, 6 Aug 2013 08:20:38 +0000 (16:20 +0800)]
ceph: fix bugs about handling short-read for sync read mode.
commit
02ae66d8b229708fd94b764f6c17ead1c7741fcf upstream.
cephfs . show_layout
>layyout.data_pool: 0
>layout.object_size:
4194304
>layout.stripe_unit:
4194304
>layout.stripe_count: 1
TestA:
>dd if=/dev/urandom of=test bs=1M count=2 oflag=direct
>dd if=/dev/urandom of=test bs=1M count=2 seek=4 oflag=direct
>dd if=test of=/dev/null bs=6M count=1 iflag=direct
The messages from func striped_read are:
ceph: file.c:350 : striped_read 0~
6291456 (read 0) got
2097152 HITSTRIPE SHORT
ceph: file.c:350 : striped_read
2097152~
4194304 (read
2097152) got 0 HITSTRIPE SHORT
ceph: file.c:381 : zero tail
4194304
ceph: file.c:390 : striped_read returns
6291456
The hole of file is from 2M--4M.But actualy it zero the last 4M include
the last 2M area which isn't a hole.
Using this patch, the messages are:
ceph: file.c:350 : striped_read 0~
6291456 (read 0) got
2097152 HITSTRIPE SHORT
ceph: file.c:358 : zero gap
2097152 to
4194304
ceph: file.c:350 : striped_read
4194304~
2097152 (read
4194304) got
2097152
ceph: file.c:384 : striped_read returns
6291456
TestB:
>echo majianpeng > test
>dd if=test of=/dev/null bs=2M count=1 iflag=direct
The messages are:
ceph: file.c:350 : striped_read 0~
6291456 (read 0) got 11 HITSTRIPE SHORT
ceph: file.c:350 : striped_read 11~
6291445 (read 11) got 0 HITSTRIPE SHORT
ceph: file.c:390 : striped_read returns 11
For this case,it did once more striped_read.It's no meaningless.
Using this patch, the message are:
ceph: file.c:350 : striped_read 0~
6291456 (read 0) got 11 HITSTRIPE SHORT
ceph: file.c:384 : striped_read returns 11
Big thanks to Yan Zheng for the patch.
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dan Carpenter [Thu, 15 Aug 2013 05:58:59 +0000 (08:58 +0300)]
libceph: create_singlethread_workqueue() doesn't return ERR_PTRs
commit
dbcae088fa660086bde6e10d63bb3c9264832d85 upstream.
create_singlethread_workqueue() returns NULL on error, and it doesn't
return ERR_PTRs.
I tweaked the error handling a little to be consistent with earlier in
the function.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dan Carpenter [Thu, 15 Aug 2013 05:52:48 +0000 (08:52 +0300)]
libceph: potential NULL dereference in ceph_osdc_handle_map()
commit
b72e19b9225d4297a18715b0998093d843d170fa upstream.
There are two places where we read "nr_maps" if both of them are set to
zero then we would hit a NULL dereference here.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dan Carpenter [Thu, 15 Aug 2013 05:51:58 +0000 (08:51 +0300)]
libceph: fix error handling in handle_reply()
commit
1874119664dafda3ef2ed9b51b4759a9540d4a1a upstream.
We've tried to fix the error paths in this function before, but there
is still a hidden goto in the ceph_decode_need() macro which goes to the
wrong place. We need to release the "req" and unlock a mutex before
returning.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
majianpeng [Fri, 2 Aug 2013 10:14:48 +0000 (18:14 +0800)]
ceph: Add check returned value on func ceph_calc_ceph_pg.
commit
2fbcbff1d6b9243ef71c64a8ab993bc3c7bb7af1 upstream.
Func ceph_calc_ceph_pg maybe failed.So add check for returned value.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dan Carpenter [Tue, 23 Jul 2013 13:48:01 +0000 (16:48 +0300)]
ceph: cleanup types in striped_read()
commit
688bac461ba3e9d221a879ab40b687f5d7b5b19c upstream.
We pass in a u64 value for "len" and then immediately truncate away the
upper 32 bits.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Nathaniel Yazdani [Mon, 5 Aug 2013 04:04:30 +0000 (21:04 -0700)]
ceph: fix null pointer dereference
commit
c338c07c51e3106711fad5eb599e375eadb6855d upstream.
When register_session() is given an out-of-range argument for mds,
ceph_mdsmap_get_addr() will return a null pointer, which would be given to
ceph_con_open() & be dereferenced, causing a kernel oops. This fixes bug #4685
in the Ceph bug tracker <http://tracker.ceph.com/issues/4685>.
Signed-off-by: Nathaniel Yazdani <n1ght.4nd.d4y@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Yan, Zheng [Mon, 24 Jun 2013 06:41:27 +0000 (14:41 +0800)]
libceph: call r_unsafe_callback when unsafe reply is received
commit
61c5d6bf7074ee32d014dcdf7698dc8c59eb712d upstream.
We can't use !req->r_sent to check if OSD request is sent for the
first time, this is because __cancel_request() zeros req->r_sent
when OSD map changes. Rather than adding a new variable to struct
ceph_osd_request to indicate if it's sent for the first time, We
can call the unsafe callback only when unsafe OSD reply is received.
If OSD's first reply is safe, just skip calling the unsafe callback.
The purpose of unsafe callback is adding unsafe request to a list,
so that fsync(2) can wait for the safe reply. fsync(2) doesn't need
to wait for a write(2) that hasn't returned yet. So it's OK to add
request to the unsafe list when the first OSD reply is received.
(ceph_sync_write() returns after receiving the first OSD reply)
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sasha Levin [Mon, 1 Jul 2013 22:33:39 +0000 (18:33 -0400)]
ceph: avoid accessing invalid memory
commit
5446429630257f4723829409337a26c076907d5d upstream.
when mounting ceph with a dev name that starts with a slash, ceph
would attempt to access the character before that slash. Since we
don't actually own that byte of memory, we would trigger an
invalid access:
[ 43.499934] BUG: unable to handle kernel paging request at
ffff880fa3a97fff
[ 43.500984] IP: [<
ffffffff818f3884>] parse_mount_options+0x1a4/0x300
[ 43.501491] PGD
743b067 PUD
10283c4067 PMD
10282a6067 PTE
8000000fa3a97060
[ 43.502301] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 43.503006] Dumping ftrace buffer:
[ 43.503596] (ftrace buffer empty)
[ 43.504046] CPU: 0 PID: 10879 Comm: mount Tainted: G W 3.10.0-sasha #1129
[ 43.504851] task:
ffff880fa625b000 ti:
ffff880fa3412000 task.ti:
ffff880fa3412000
[ 43.505608] RIP: 0010:[<
ffffffff818f3884>] [<
ffffffff818f3884>] parse_mount_options$
[ 43.506552] RSP: 0018:
ffff880fa3413d08 EFLAGS:
00010286
[ 43.507133] RAX:
ffff880fa3a98000 RBX:
ffff880fa3a98000 RCX:
0000000000000000
[ 43.507893] RDX:
ffff880fa3a98001 RSI:
000000000000002f RDI:
ffff880fa3a98000
[ 43.508610] RBP:
ffff880fa3413d58 R08:
0000000000001f99 R09:
ffff880fa3fe64c0
[ 43.509426] R10:
ffff880fa3413d98 R11:
ffff880fa38710d8 R12:
ffff880fa3413da0
[ 43.509792] R13:
ffff880fa3a97fff R14:
0000000000000000 R15:
ffff880fa3413d90
[ 43.509792] FS:
00007fa9c48757e0(0000) GS:
ffff880fd2600000(0000) knlGS:
000000000000$
[ 43.509792] CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
[ 43.509792] CR2:
ffff880fa3a97fff CR3:
0000000fa3bb9000 CR4:
00000000000006b0
[ 43.509792] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[ 43.509792] DR3:
0000000000000000 DR6:
00000000ffff0ff0 DR7:
0000000000000400
[ 43.509792] Stack:
[ 43.509792]
0000e5180000000e ffffffff85ca1900 ffff880fa38710d8 ffff880fa3413d98
[ 43.509792]
0000000000000120 0000000000000000 ffff880fa3a98000 0000000000000000
[ 43.509792]
ffffffff85cf32a0 0000000000000000 ffff880fa3413dc8 ffffffff818f3c72
[ 43.509792] Call Trace:
[ 43.509792] [<
ffffffff818f3c72>] ceph_mount+0xa2/0x390
[ 43.509792] [<
ffffffff81226314>] ? pcpu_alloc+0x334/0x3c0
[ 43.509792] [<
ffffffff81282f8d>] mount_fs+0x8d/0x1a0
[ 43.509792] [<
ffffffff812263d0>] ? __alloc_percpu+0x10/0x20
[ 43.509792] [<
ffffffff8129f799>] vfs_kern_mount+0x79/0x100
[ 43.509792] [<
ffffffff812a224d>] do_new_mount+0xcd/0x1c0
[ 43.509792] [<
ffffffff812a2e8d>] do_mount+0x15d/0x210
[ 43.509792] [<
ffffffff81220e55>] ? strndup_user+0x45/0x60
[ 43.509792] [<
ffffffff812a2fdd>] SyS_mount+0x9d/0xe0
[ 43.509792] [<
ffffffff83fd816c>] tracesys+0xdd/0xe2
[ 43.509792] Code: 4c 8b 5d c0 74 0a 48 8d 50 01 49 89 14 24 eb 17 31 c0 48 83 c9 ff $
[ 43.509792] RIP [<
ffffffff818f3884>] parse_mount_options+0x1a4/0x300
[ 43.509792] RSP <
ffff880fa3413d08>
[ 43.509792] CR2:
ffff880fa3a97fff
[ 43.509792] ---[ end trace
22469cd81e93af51 ]---
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Reviewed-by: Sage Weil <sage@inktan.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
majianpeng [Tue, 25 Jun 2013 06:48:19 +0000 (14:48 +0800)]
ceph: Free mdsc if alloc mdsc->mdsmap failed.
commit
fb3101b6f0db9ae3f35dc8e6ec908d0af8cdf12e upstream.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sage Weil [Sun, 9 Jun 2013 15:40:39 +0000 (08:40 -0700)]
rbd: fix a couple warnings
commit
e976cad0f0dbe5440a4ca38e29e1f932d9319125 upstream.
gcc isn't quite smart enough and generates these warnings:
drivers/block/rbd.c: In function 'rbd_img_request_fill':
drivers/block/rbd.c:1266:22: warning: 'bio_list' may be used uninitialized in this function [-Wmaybe-uninitialized]
drivers/block/rbd.c:2186:14: note: 'bio_list' was declared here
drivers/block/rbd.c:2247:10: warning: 'pages' may be used uninitialized in this function [-Wmaybe-uninitialized]
even though they are initialized for their respective code paths.
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Yan, Zheng [Sun, 2 Jun 2013 10:40:23 +0000 (18:40 +0800)]
libceph: fix truncate size calculation
commit
ccca4e37b1a912da3db68aee826557ea66145273 upstream.
check the "not truncated yet" case
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Yan, Zheng [Fri, 31 May 2013 07:54:44 +0000 (15:54 +0800)]
libceph: fix safe completion
commit
eb845ff13a44477f8a411baedbf11d678b9daf0a upstream.
handle_reply() calls complete_request() only if the first OSD reply
has ONDISK flag.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Alex Elder [Fri, 31 May 2013 22:40:44 +0000 (17:40 -0500)]
rbd: protect against concurrent unmaps
commit
82a442d239695a242c4d584464c9606322cd02aa upstream.
Make sure two concurrent unmap operations on the same rbd device
won't collide, by only proceeding with the removal and cleanup of a
device if is not already underway.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Alex Elder [Fri, 31 May 2013 20:17:01 +0000 (15:17 -0500)]
rbd: set removing flag while holding list lock
commit
751cc0e3cfabdda87c4c21519253c6751e97a8d4 upstream.
When unmapping a device, its id is supplied, and that is used to
look up which rbd device should be unmapped. Looking up the
device involves searching the rbd device list while holding
a spinlock that protects access to that list.
Currently all of this is done under protection of the control lock,
but that protection is going away soon. To ensure the rbd_dev is
still valid (still on the list) while setting its REMOVING flag, do
so while still holding the list lock. To do so, get rid of
__rbd_get_dev(), and open code what it did in the one place it
was used.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Alex Elder [Thu, 23 May 2013 01:54:25 +0000 (20:54 -0500)]
rbd: flush dcache after zeroing page data
commit
e215605417b87732c6debf65da6d953016a1e5bc upstream.
Neither zero_bio_chain() nor zero_pages() contains a call to flush
caches after zeroing a portion of a page. This can cause problems
on architectures that have caches that allow virtual address
aliasing.
This resolves:
http://tracker.ceph.com/issues/4777
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Alex Elder [Thu, 23 May 2013 01:54:25 +0000 (20:54 -0500)]
libceph: add lingering request reference when registered
commit
96e4dac66f69d28af2b736e723364efbbdf9fdee upstream.
When an osd request is set to linger, the osd client holds onto the
request so it can be re-submitted following certain osd map changes.
The osd client holds a reference to the request until it is
unregistered. This is used by rbd for watch requests.
Currently, the reference is taken when the request is marked with
the linger flag. This means that if an error occurs after that
time but before the the request completes successfully, that
reference is leaked.
There's really no reason to take the reference until the request is
registered in the the osd client's list of lingering requests, and
that only happens when the lingering (watch) request completes
successfully.
So take that reference only when it gets registered following
succesful completion, and drop it (as before) when the request
gets unregistered. This avoids the reference problem on error
in rbd.
Rearrange ceph_osdc_unregister_linger_request() to avoid using
the request pointer after it may have been freed.
And hold an extra reference in kick_requests() while handling
a linger request that has not yet been registered, to ensure
it doesn't go away.
This resolves:
http://tracker.ceph.com/issues/3859
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Emil Goode [Tue, 28 May 2013 14:59:00 +0000 (16:59 +0200)]
ceph: improve error handling in ceph_mdsmap_decode
commit
c213b50b7dcbf06abcfbf1e4eee5b76586718bd9 upstream.
This patch makes the following improvements to the error handling
in the ceph_mdsmap_decode function:
- Add a NULL check for return value from kcalloc
- Make use of the variable err
Signed-off-by: Emil Goode <emilgoode@gmail.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dinh Nguyen [Tue, 10 Dec 2013 18:49:18 +0000 (19:49 +0100)]
clocksource: dw_apb_timer_of: Fix read_sched_clock
commit
85dc6ee1237c8a4a7742e6abab96a20389b7d682 upstream.
The read_sched_clock should return the ~value because the clock is a
countdown implementation. read_sched_clock() should be the same as
__apbt_read_clocksource().
Signed-off-by: Dinh Nguyen <dinguyen@altera.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Paul Moore [Tue, 10 Dec 2013 19:58:01 +0000 (14:58 -0500)]
selinux: process labeled IPsec TCP SYN-ACK packets properly in selinux_ip_postroute()
commit
c0828e50485932b7e019df377a6b0a8d1ebd3080 upstream.
Due to difficulty in arriving at the proper security label for
TCP SYN-ACK packets in selinux_ip_postroute(), we need to check packets
while/before they are undergoing XFRM transforms instead of waiting
until afterwards so that we can determine the correct security label.
Reported-by: Janak Desai <Janak.Desai@gtri.gatech.edu>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Paul Moore [Tue, 10 Dec 2013 19:57:54 +0000 (14:57 -0500)]
selinux: look for IPsec labels on both inbound and outbound packets
commit
817eff718dca4e54d5721211ddde0914428fbb7c upstream.
Previously selinux_skb_peerlbl_sid() would only check for labeled
IPsec security labels on inbound packets, this patch enables it to
check both inbound and outbound traffic for labeled IPsec security
labels.
Reported-by: Janak Desai <Janak.Desai@gtri.gatech.edu>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Geert Uytterhoeven [Thu, 19 Dec 2013 01:08:48 +0000 (17:08 -0800)]
sh: always link in helper functions extracted from libgcc
commit
84ed8a99058e61567f495cc43118344261641c5f upstream.
E.g. landisk_defconfig, which has CONFIG_NTFS_FS=m:
ERROR: "__ashrdi3" [fs/ntfs/ntfs.ko] undefined!
For "lib-y", if no symbols in a compilation unit are referenced by other
units, the compilation unit will not be included in vmlinux. This
breaks modules that do reference those symbols.
Use "obj-y" instead to fix this.
http://kisskb.ellerman.id.au/kisskb/buildresult/
8838077/
This doesn't fix all cases. There are others, e.g. udivsi3.
This is also not limited to sh, many architectures handle this in the
same way.
A simple solution is to unconditionally include all helper functions.
A more complex solution is to make the choice of "lib-y" or "obj-y" depend
on CONFIG_MODULES:
obj-$(CONFIG_MODULES) += ...
lib-y($CONFIG_MODULES) += ...
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Tested-by: Nobuhiro Iwamatsu <nobuhiro.iwamatsu.yj@renesas.com>
Reviewed-by: Nobuhiro Iwamatsu <nobuhiro.iwamatsu.yj@renesas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Stephen Boyd [Tue, 10 Dec 2013 23:19:03 +0000 (15:19 -0800)]
gpio: msm: Fix irq mask/unmask by writing bits instead of numbers
commit
4cc629b7a20945ce35628179180329b6bc9e552b upstream.
We should be writing bits here but instead we're writing the
numbers that correspond to the bits we want to write. Fix it by
wrapping the numbers in the BIT() macro. This fixes gpios acting
as interrupts.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Roger Quadros [Thu, 5 Dec 2013 09:23:35 +0000 (11:23 +0200)]
gpio: twl4030: Fix regression for twl gpio LED output
commit
f5837ec11f8cfa6d53ebc5806582771b2c9988c6 upstream.
Commit
0b2aa8be introduced a regression that causes failure
in setting LED GPO direction to OUT.
This causes USB host probe failures for Beagleboard C4.
platform usb_phy_gen_xceiv.2: Driver usb_phy_gen_xceiv requests probe deferral
hsusb2_vcc: Failed to request enable GPIO510: -22
reg-fixed-voltage reg-fixed-voltage.0.auto: Failed to register regulator: -22
reg-fixed-voltage: probe of reg-fixed-voltage.0.auto failed with error -22
direction_out/direction_in must return 0 if the operation succeeded.
Also, don't update direction flag and output data if twl4030_set_gpio_direction()
failed inside twl_direction_out();
Signed-off-by: Roger Quadros <rogerq@ti.com>
Acked-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Theodore Ts'o [Mon, 9 Dec 2013 02:12:59 +0000 (21:12 -0500)]
jbd2: don't BUG but return ENOSPC if a handle runs out of space
commit
f6c07cad081ba222d63623d913aafba5586c1d2c upstream.
If a handle runs out of space, we currently stop the kernel with a BUG
in jbd2_journal_dirty_metadata(). This makes it hard to figure out
what might be going on. So return an error of ENOSPC, so we can let
the file system layer figure out what is going on, to make it more
likely we can get useful debugging information). This should make it
easier to debug problems such as the one which was reported by:
https://bugzilla.kernel.org/show_bug.cgi?id=44731
The only two callers of this function are ext4_handle_dirty_metadata()
and ocfs2_journal_dirty(). The ocfs2 function will trigger a
BUG_ON(), which means there will be no change in behavior. The ext4
function will call ext4_error_inode() which will print the useful
debugging information and then handle the situation using ext4's error
handling mechanisms (i.e., which might mean halting the kernel or
remounting the file system read-only).
Also, since both file systems already call WARN_ON(), drop the WARN_ON
from jbd2_journal_dirty_metadata() to avoid two stack traces from
being displayed.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: ocfs2-devel@oss.oracle.com
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Martin Schwidefsky [Wed, 18 Dec 2013 13:36:18 +0000 (14:36 +0100)]
s390/3270: fix allocation of tty3270_screen structure
commit
36d9f4d3b68c7035ead3850dc85f310a579ed0eb upstream.
The tty3270_alloc_screen function is called from tty3270_install with
swapped arguments, the number of columns instead of rows and vice versa.
The number of rows is typically smaller than the number of columns which
makes the screen array too big but the individual cell arrays for the
lines too small. Creating lines longer than the number of rows will
clobber the memory after the end of the cell array.
The fix is simple, call tty3270_alloc_screen with the correct argument
order.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Vladimir Davydov [Thu, 2 Jan 2014 20:58:47 +0000 (12:58 -0800)]
memcg: fix memcg_size() calculation
commit
695c60830764945cf61a2cc623eb1392d137223e upstream.
The mem_cgroup structure contains nr_node_ids pointers to
mem_cgroup_per_node objects, not the objects themselves.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Steven Whitehouse [Wed, 18 Dec 2013 14:14:52 +0000 (14:14 +0000)]
GFS2: Fix incorrect invalidation for DIO/buffered I/O
commit
dfd11184d894cd0a92397b25cac18831a1a6a5bc upstream.
In patch
209806aba9d540dde3db0a5ce72307f85f33468f we allowed
local deferred locks to be granted against a cached exclusive
lock. That opened up a corner case which this patch now
fixes.
The solution to the problem is to check whether we have cached
pages each time we do direct I/O and if so to unmap, flush
and invalidate those pages. Since the glock state machine
normally does that for us, mostly the code will be a no-op.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Steven Whitehouse [Fri, 6 Dec 2013 11:52:34 +0000 (11:52 +0000)]
GFS2: don't hold s_umount over blkdev_put
commit
dfe5b9ad83a63180f358b27d1018649a27b394a9 upstream.
This is a GFS2 version of Tejun's patch:
4f331f01b9c43bf001d3ffee578a97a1e0633eac
vfs: don't hold s_umount over close_bdev_exclusive() call
In this case its blkdev_put itself that is the issue and this
patch uses the same solution of dropping and retaking s_umount.
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dmitry Torokhov [Fri, 27 Dec 2013 01:44:29 +0000 (17:44 -0800)]
Input: allocate absinfo data when setting ABS capability
commit
28a2a2e1aedbe2d8b2301e6e0e4e63f6e4177aca upstream.
We need to make sure we allocate absinfo data when we are setting one of
EV_ABS/ABS_XXX capabilities, otherwise we may bomb when we try to emit this
event.
Rested-by: Paul Cercueil <pcercuei@gmail.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Naoya Horiguchi [Thu, 2 Jan 2014 20:58:51 +0000 (12:58 -0800)]
mm/memory-failure.c: transfer page count from head page to tail page after split thp
commit
a3e0f9e47d5ef7858a26cc12d90ad5146e802d47 upstream.
Memory failures on thp tail pages cause kernel panic like below:
mce: [Hardware Error]: Machine check events logged
MCE exception done on CPU 7
BUG: unable to handle kernel NULL pointer dereference at
0000000000000058
IP: [<
ffffffff811b7cd1>] dequeue_hwpoisoned_huge_page+0x131/0x1e0
PGD
bae42067 PUD
ba47d067 PMD 0
Oops: 0000 [#1] SMP
...
CPU: 7 PID: 128 Comm: kworker/7:2 Tainted: G M O
3.13.0-rc4-131217-1558-00003-g83b7df08e462 #25
...
Call Trace:
me_huge_page+0x3e/0x50
memory_failure+0x4bb/0xc20
mce_process_work+0x3e/0x70
process_one_work+0x171/0x420
worker_thread+0x11b/0x3a0
? manage_workers.isra.25+0x2b0/0x2b0
kthread+0xe4/0x100
? kthread_create_on_node+0x190/0x190
ret_from_fork+0x7c/0xb0
? kthread_create_on_node+0x190/0x190
...
RIP dequeue_hwpoisoned_huge_page+0x131/0x1e0
CR2:
0000000000000058
The reasoning of this problem is shown below:
- when we have a memory error on a thp tail page, the memory error
handler grabs a refcount of the head page to keep the thp under us.
- Before unmapping the error page from processes, we split the thp,
where page refcounts of both of head/tail pages don't change.
- Then we call try_to_unmap() over the error page (which was a tail
page before). We didn't pin the error page to handle the memory error,
this error page is freed and removed from LRU list.
- We never have the error page on LRU list, so the first page state
check returns "unknown page," then we move to the second check
with the saved page flag.
- The saved page flag have PG_tail set, so the second page state check
returns "hugepage."
- We call me_huge_page() for freed error page, then we hit the above panic.
The root cause is that we didn't move refcount from the head page to the
tail page after split thp. So this patch suggests to do this.
This panic was introduced by commit
524fca1e73 ("HWPOISON: fix
misjudgement of page_action() for errors on mlocked pages"). Note that we
did have the same refcount problem before this commit, but it was just
ignored because we had only first page state check which returned "unknown
page." The commit changed the refcount problem from "doesn't work" to
"kernel panic."
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Rik van Riel [Thu, 2 Jan 2014 20:58:46 +0000 (12:58 -0800)]
mm: fix use-after-free in sys_remap_file_pages
commit
4eb919825e6c3c7fb3630d5621f6d11e98a18b3a upstream.
remap_file_pages calls mmap_region, which may merge the VMA with other
existing VMAs, and free "vma". This can lead to a use-after-free bug.
Avoid the bug by remembering vm_flags before calling mmap_region, and
not trying to dereference vma later.
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jianguo Wu [Thu, 19 Dec 2013 01:08:59 +0000 (17:08 -0800)]
mm/hugetlb: check for pte NULL pointer in __page_check_address()
commit
98398c32f6687ee1e1f3ae084effb4b75adb0747 upstream.
In __page_check_address(), if address's pud is not present,
huge_pte_offset() will return NULL, we should check the return value.
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: qiuxishi <qiuxishi@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Joonsoo Kim [Thu, 19 Dec 2013 01:08:52 +0000 (17:08 -0800)]
mm/compaction: respect ignore_skip_hint in update_pageblock_skip
commit
6815bf3f233e0b10c99a758497d5d236063b010b upstream.
update_pageblock_skip() only fits to compaction which tries to isolate
by pageblock unit. If isolate_migratepages_range() is called by CMA, it
try to isolate regardless of pageblock unit and it don't reference
get_pageblock_skip() by ignore_skip_hint. We should also respect it on
update_pageblock_skip() to prevent from setting the wrong information.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mel Gorman [Thu, 19 Dec 2013 01:08:45 +0000 (17:08 -0800)]
mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates
commit
af2c1401e6f9177483be4fad876d0073669df9df upstream.
According to documentation on barriers, stores issued before a LOCK can
complete after the lock implying that it's possible tlb_flush_pending
can be visible after a page table update. As per revised documentation,
this patch adds a smp_mb__before_spinlock to guarantee the correct
ordering.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Rik van Riel [Thu, 19 Dec 2013 01:08:44 +0000 (17:08 -0800)]
mm: fix TLB flush race between migration, and change_protection_range
commit
20841405940e7be0617612d521e206e4b6b325db upstream.
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Oleg Nesterov [Mon, 12 Aug 2013 16:14:00 +0000 (18:14 +0200)]
sched: fix the theoretical signal_wake_up() vs schedule() race
commit
e0acd0a68ec7dbf6b7a81a87a867ebd7ac9b76c4 upstream.
This is only theoretical, but after try_to_wake_up(p) was changed
to check p->state under p->pi_lock the code like
__set_current_state(TASK_INTERRUPTIBLE);
schedule();
can miss a signal. This is the special case of wait-for-condition,
it relies on try_to_wake_up/schedule interaction and thus it does
not need mb() between __set_current_state() and if(signal_pending).
However, this __set_current_state() can move into the critical
section protected by rq->lock, now that try_to_wake_up() takes
another lock we need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.
The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock). This is what try_to_wake_up() already
does by the same reason.
We turn this wmb() into the new helper, smp_mb__before_spinlock(),
for better documentation and to allow the architectures to change
the default implementation.
While at it, kill smp_mb__after_lock(), it has no callers.
Perhaps we can also add smp_mb__before/after_spinunlock() for
prepare_to_wait().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mel Gorman [Thu, 19 Dec 2013 01:08:39 +0000 (17:08 -0800)]
mm: numa: avoid unnecessary work on the failure path
commit
eb4489f69f224356193364dc2762aa009738ca7f upstream.
If a PMD changes during a THP migration then migration aborts but the
failure path is doing more work than is necessary.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mel Gorman [Thu, 19 Dec 2013 01:08:38 +0000 (17:08 -0800)]
mm: numa: ensure anon_vma is locked to prevent parallel THP splits
commit
c3a489cac38d43ea6dc4ac240473b44b46deecf7 upstream.
The anon_vma lock prevents parallel THP splits and any associated
complexity that arises when handling splits during THP migration. This
patch checks if the lock was successfully acquired and bails from THP
migration if it failed for any reason.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mel Gorman [Thu, 19 Dec 2013 01:08:34 +0000 (17:08 -0800)]
mm: clear pmd_numa before invalidating
commit
67f87463d3a3362424efcbe8b40e4772fd34fc61 upstream.
On x86, PMD entries are similar to _PAGE_PROTNONE protection and are
handled as NUMA hinting faults. The following two page table protection
bits are what defines them
_PAGE_NUMA:set _PAGE_PRESENT:clear
A PMD is considered present if any of the _PAGE_PRESENT, _PAGE_PROTNONE,
_PAGE_PSE or _PAGE_NUMA bits are set. If pmdp_invalidate encounters a
pmd_numa, it clears the present bit leaving _PAGE_NUMA which will be
considered not present by the CPU but present by pmd_present. The
existing caller of pmdp_invalidate should handle it but it's an
inconsistent state for a PMD. This patch keeps the state consistent
when calling pmdp_invalidate.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Rob Herring [Mon, 30 Dec 2013 01:37:43 +0000 (19:37 -0600)]
Revert "of/address: Handle #address-cells > 2 specially"
commit
13fcca8f25f4e9ce7f55da9cd353bb743236e212 upstream.
This reverts commit
e38c0a1fbc5803cbacdaac0557c70ac8ca5152e7.
Nikita Yushchenko reports:
While trying to make freescale p2020ds and mpc8572ds boards working
with mainline kernel, I faced that commit
e38c0a1f (Handle
Both these boards have uli1575 chip.
Corresponding part in device tree is something like
uli1575@0 {
reg = <0x0 0x0 0x0 0x0 0x0>;
#size-cells = <2>;
#address-cells = <3>;
ranges = <0x2000000 0x0 0x80000000
0x2000000 0x0 0x80000000
0x0 0x20000000
0x1000000 0x0 0x0
0x1000000 0x0 0x0
0x0 0x10000>;
isa@1e {
...
I.e. it has #address-cells = <3>
With commit
e38c0a1f reverted, devices under uli1575 are registered
correctly, e.g. for rtc
OF: ** translation for device /pcie@
ffe09000/pcie@0/uli1575@0/isa@1e/rtc@70 **
OF: bus is isa (na=2, ns=1) on /pcie@
ffe09000/pcie@0/uli1575@0/isa@1e
OF: translating address:
00000001 00000070
OF: parent bus is default (na=3, ns=2) on /pcie@
ffe09000/pcie@0/uli1575@0
OF: walking ranges...
OF: ISA map, cp=0, s=1000, da=70
OF: parent translation for:
01000000 00000000 00000000
OF: with offset: 70
OF: one level translation:
00000000 00000000 00000070
OF: parent bus is pci (na=3, ns=2) on /pcie@
ffe09000/pcie@0
OF: walking ranges...
OF: default map, cp=
a0000000, s=
20000000, da=70
OF: default map, cp=0, s=10000, da=70
OF: parent translation for:
01000000 00000000 00000000
OF: with offset: 70
OF: one level translation:
01000000 00000000 00000070
OF: parent bus is pci (na=3, ns=2) on /pcie@
ffe09000
OF: walking ranges...
OF: PCI map, cp=0, s=10000, da=70
OF: parent translation for:
01000000 00000000 00000000
OF: with offset: 70
OF: one level translation:
01000000 00000000 00000070
OF: parent bus is default (na=2, ns=2) on /
OF: walking ranges...
OF: PCI map, cp=0, s=10000, da=70
OF: parent translation for:
00000000 ffc10000
OF: with offset: 70
OF: one level translation:
00000000 ffc10070
OF: reached root node
With commit
e38c0a1f in place, address translation fails:
OF: ** translation for device /pcie@
ffe09000/pcie@0/uli1575@0/isa@1e/rtc@70 **
OF: bus is isa (na=2, ns=1) on /pcie@
ffe09000/pcie@0/uli1575@0/isa@1e
OF: translating address:
00000001 00000070
OF: parent bus is default (na=3, ns=2) on /pcie@
ffe09000/pcie@0/uli1575@0
OF: walking ranges...
OF: ISA map, cp=0, s=1000, da=70
OF: parent translation for:
01000000 00000000 00000000
OF: with offset: 70
OF: one level translation:
00000000 00000000 00000070
OF: parent bus is pci (na=3, ns=2) on /pcie@
ffe09000/pcie@0
OF: walking ranges...
OF: default map, cp=
a0000000, s=
20000000, da=70
OF: default map, cp=0, s=10000, da=70
OF: not found !
Thierry Reding confirmed this commit was not needed after all:
"We ended up merging a different address representation for Tegra PCIe
and I've confirmed that reverting this commit doesn't cause any obvious
regressions. I think all other drivers in drivers/pci/host ended up
copying what we did on Tegra, so I wouldn't expect any other breakage
either."
There doesn't appear to be a simple way to support both behaviours, so
reverting this as nothing should be depending on the new behaviour.
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Rafael J. Wysocki [Tue, 31 Dec 2013 12:37:46 +0000 (13:37 +0100)]
intel_pstate: Fail initialization if P-state information is missing
commit
98a947abdd54e5de909bebadfced1696ccad30cf upstream.
If pstate.current_pstate is 0 after the initial
intel_pstate_get_cpu_pstates(), this means that we were unable to
obtain any useful P-state information and there is no reason to
continue, so free memory and return an error in that case.
This fixes the following divide error occuring in a nested KVM
guest:
Intel P-state driver initializing.
Intel pstate controlling: cpu 0
cpufreq: __cpufreq_add_dev: ->get() failed
divide error: 0000 [#1] SMP
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-0.rc4.git5.1.fc21.x86_64 #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task:
ffff88001ea20000 ti:
ffff88001e9bc000 task.ti:
ffff88001e9bc000
RIP: 0010:[<
ffffffff815c551d>] [<
ffffffff815c551d>] intel_pstate_timer_func+0x11d/0x2b0
RSP: 0000:
ffff88001ee03e18 EFLAGS:
00010246
RAX:
0000000000000000 RBX:
ffff88001a454348 RCX:
0000000000006100
RDX:
0000000000000000 RSI:
0000000000000000 RDI:
0000000000000000
RBP:
ffff88001ee03e38 R08:
0000000000000000 R09:
0000000000000000
R10:
ffff88001ea20000 R11:
0000000000000000 R12:
00000c0a1ea20000
R13:
1ea200001ea20000 R14:
ffffffff815c5400 R15:
ffff88001a454348
FS:
0000000000000000(0000) GS:
ffff88001ee00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
CR2:
0000000000000000 CR3:
0000000001c0c000 CR4:
00000000000006f0
Stack:
fffffffb1a454390 ffffffff821a4500 ffff88001a454390 0000000000000100
ffff88001ee03ea8 ffffffff81083e9a ffffffff81083e15 ffffffff82d5ed40
ffffffff8258cc60 0000000000000000 ffffffff81ac39de 0000000000000000
Call Trace:
<IRQ>
[<
ffffffff81083e9a>] call_timer_fn+0x8a/0x310
[<
ffffffff81083e15>] ? call_timer_fn+0x5/0x310
[<
ffffffff815c5400>] ? pid_param_set+0x130/0x130
[<
ffffffff81084354>] run_timer_softirq+0x234/0x380
[<
ffffffff8107aee4>] __do_softirq+0x104/0x430
[<
ffffffff8107b5fd>] irq_exit+0xcd/0xe0
[<
ffffffff81770645>] smp_apic_timer_interrupt+0x45/0x60
[<
ffffffff8176efb2>] apic_timer_interrupt+0x72/0x80
<EOI>
[<
ffffffff810e15cd>] ? vprintk_emit+0x1dd/0x5e0
[<
ffffffff81757719>] printk+0x67/0x69
[<
ffffffff815c1493>] __cpufreq_add_dev.isra.13+0x883/0x8d0
[<
ffffffff815c14f0>] cpufreq_add_dev+0x10/0x20
[<
ffffffff814a14d1>] subsys_interface_register+0xb1/0xf0
[<
ffffffff815bf5cf>] cpufreq_register_driver+0x9f/0x210
[<
ffffffff81fb19af>] intel_pstate_init+0x27d/0x3be
[<
ffffffff81761e3e>] ? mutex_unlock+0xe/0x10
[<
ffffffff81fb1732>] ? cpufreq_gov_dbs_init+0x12/0x12
[<
ffffffff8100214a>] do_one_initcall+0xfa/0x1b0
[<
ffffffff8109dbf5>] ? parse_args+0x225/0x3f0
[<
ffffffff81f64193>] kernel_init_freeable+0x1fc/0x287
[<
ffffffff81f638d0>] ? do_early_param+0x88/0x88
[<
ffffffff8174b530>] ? rest_init+0x150/0x150
[<
ffffffff8174b53e>] kernel_init+0xe/0x130
[<
ffffffff8176e27c>] ret_from_fork+0x7c/0xb0
[<
ffffffff8174b530>] ? rest_init+0x150/0x150
Code: c1 e0 05 48 63 bc 03 10 01 00 00 48 63 83 d0 00 00 00 48 63 d6 48 c1 e2 08 c1 e1 08 4c 63 c2 48 c1 e0 08 48 98 48 c1 e0 08 48 99 <49> f7 f8 48 98 48 0f af f8 48 c1 ff 08 29 f9 89 ca c1 fa 1f 89
RIP [<
ffffffff815c551d>] intel_pstate_timer_func+0x11d/0x2b0
RSP <
ffff88001ee03e18>
---[ end trace
f166110ed22cc37a ]---
Kernel panic - not syncing: Fatal exception in interrupt
Reported-and-tested-by: Kashyap Chamarthy <kchamart@redhat.com>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Larry Finger [Wed, 11 Dec 2013 23:13:10 +0000 (17:13 -0600)]
rtlwifi: pci: Fix oops on driver unload
commit
9278db6279e28d4d433bc8a848e10b4ece8793ed upstream.
On Fedora systems, unloading rtl8192ce causes an oops. This patch fixes the
problem reported at https://bugzilla.redhat.com/show_bug.cgi?id=852761.
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Johannes Berg [Mon, 16 Dec 2013 11:04:36 +0000 (12:04 +0100)]
radiotap: fix bitmap-end-finding buffer overrun
commit
bd02cd2549cfcdfc57cb5ce57ffc3feb94f70575 upstream.
Evan Huus found (by fuzzing in wireshark) that the radiotap
iterator code can access beyond the length of the buffer if
the first bitmap claims an extension but then there's no
data at all. Fix this.
Reported-by: Evan Huus <eapache@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tejun Heo [Wed, 18 Dec 2013 12:07:32 +0000 (07:07 -0500)]
libata, freezer: avoid block device removal while system is frozen
commit
85fbd722ad0f5d64d1ad15888cd1eb2188bfb557 upstream.
Freezable kthreads and workqueues are fundamentally problematic in
that they effectively introduce a big kernel lock widely used in the
kernel and have already been the culprit of several deadlock
scenarios. This is the latest occurrence.
During resume, libata rescans all the ports and revalidates all
pre-existing devices. If it determines that a device has gone
missing, the device is removed from the system which involves
invalidating block device and flushing bdi while holding driver core
layer locks. Unfortunately, this can race with the rest of device
resume. Because freezable kthreads and workqueues are thawed after
device resume is complete and block device removal depends on
freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
progress, this can lead to deadlock - block device removal can't
proceed because kthreads are frozen and kthreads can't be thawed
because device resume is blocked behind block device removal.
839a8e8660b6 ("writeback: replace custom worker pool implementation
with unbound workqueue") made this particular deadlock scenario more
visible but the underlying problem has always been there - the
original forker task and jbd2 are freezable too. In fact, this is
highly likely just one of many possible deadlock scenarios given that
freezer behaves as a big kernel lock and we don't have any debug
mechanism around it.
I believe the right thing to do is getting rid of freezable kthreads
and workqueues. This is something fundamentally broken. For now,
implement a funny workaround in libata - just avoid doing block device
hot[un]plug while the system is frozen. Kernel engineering at its
finest. :(
v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
as a module.
v3: Comment updated and polling interval changed to 10ms as suggested
by Rafael.
v4: Add #ifdef CONFIG_FREEZER around the hack as pm_freezing is not
defined when FREEZER is not configured thus breaking build.
Reported by kbuild test robot.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tomaž Šolc <tomaz.solc@tablix.org>
Reviewed-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801
Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Robin H. Johnson [Mon, 16 Dec 2013 17:31:19 +0000 (09:31 -0800)]
libata: disable a disk via libata.force params
commit
b8bd6dc36186fe99afa7b73e9e2d9a98ad5c4865 upstream.
A user on StackExchange had a failing SSD that's soldered directly
onto the motherboard of his system. The BIOS does not give any option
to disable it at all, so he can't just hide it from the OS via the
BIOS.
The old IDE layer had hdX=noprobe override for situations like this,
but that was never ported to the libata layer.
This patch implements a disable flag for libata.force.
Example use:
libata.force=2.0:disable
[v2 of the patch, removed the nodisable flag per Tejun Heo]
Signed-off-by: Robin H. Johnson <robbat2@gentoo.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://unix.stackexchange.com/questions/102648/how-to-tell-linux-kernel-3-0-to-completely-ignore-a-failing-disk
Link: http://askubuntu.com/questions/352836/how-can-i-tell-linux-kernel-to-completely-ignore-a-disk-as-if-it-was-not-even-co
Link: http://superuser.com/questions/599333/how-to-disable-kernel-probing-for-drive
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Vincent Pelletier [Tue, 21 May 2013 20:30:58 +0000 (22:30 +0200)]
libata: Add atapi_dmadir force flag
commit
966fbe193f47c68e70a80ec9991098e88e7959cb upstream.
Some device require DMADIR to be enabled, but are not detected as such
by atapi_id_dmadir. One such example is "Asus Serillel 2"
SATA-host-to-PATA-device bridge: the bridge itself requires DMADIR,
even if the bridged device does not.
As atapi_dmadir module parameter can cause problems with some devices
(as per Tejun Heo's memory), enabling it globally may not be possible
depending on the hardware.
This patch adds atapi_dmadir in the form of a "force" horkage value,
allowing global, per-bus and per-device control.
Signed-off-by: Vincent Pelletier <plr.vincent@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>