Matt Bennett [Wed, 23 Sep 2015 04:58:31 +0000 (16:58 +1200)]
ip6_gre: Reduce log level in ip6gre_err() to debug
Currently error log messages in ip6gre_err are printed at 'warn'
level. This is different to most other tunnel types which don't
print any messages. These log messages don't provide any information
that couldn't be deduced with networking tools. Also it can be annoying
to have one end of the tunnel go down and have the logs fill with
pointless messages such as "Path to destination invalid or inactive!".
This patch reduces the log level of these messages to 'dbg' level to
bring the visible behaviour into line with other tunnel types.
Signed-off-by: Matt Bennett <matt.bennett@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wilson Kok [Wed, 23 Sep 2015 04:40:22 +0000 (21:40 -0700)]
fib_rules: fix fib rule dumps across multiple skbs
dump_rules returns skb length and not error.
But when family == AF_UNSPEC, the caller of dump_rules
assumes that it returns an error. Hence, when family == AF_UNSPEC,
we continue trying to dump on -EMSGSIZE errors resulting in
incorrect dump idx carried between skbs belonging to the same dump.
This results in fib rule dump always only dumping rules that fit
into the first skb.
This patch fixes dump_rules to return error so that we exit correctly
and idx is correctly maintained between skbs that are part of the
same dump.
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 23 Sep 2015 00:04:58 +0000 (17:04 -0700)]
bnx2x: byte swap rss_key to comply to Toeplitz specs
After a good amount of debugging, I found bnx2x was byte swaping
the 40 bytes of rss_key.
If we byte swap the key, then bnx2x generates hashes matching
MSDN specs as documented in (Verifying the RSS Hash Calculation)
https://msdn.microsoft.com/en-us/library/windows/hardware/
ff571021%
28v=vs.85%29.aspx
It is mostly a non issue, unless we want to mix different NIC
in a host, and want consistent hashing among all of them, ie
if they all use the boot time generated rss key, or if some application
is choosing specific tuple(s) so that incoming traffic lands into known
rx queue(s).
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Wed, 23 Sep 2015 00:01:11 +0000 (17:01 -0700)]
net: revert "net_sched: move tp->root allocation into fw_init()"
fw filter uses tp->root==NULL to check if it is the old method,
so it doesn't need allocation at all in this case. This patch
reverts the offending commit and adds some comments for old
method to make it obvious.
Fixes: 33f8b9ecdb15 ("net_sched: move tp->root allocation into fw_init()")
Reported-by: Akshat Kakkar <akshat.1984@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 24 Sep 2015 21:31:37 +0000 (14:31 -0700)]
Merge branch 'lwt_arp'
Jiri Benc says:
====================
lwtunnel: make it really work, for IPv4
One of the selling points of lwtunnel was the ability to specify the tunnel
destination using routes. However, this doesn't really work currently, as
ARP and ndisc replies are not handled correctly. ARP and ndisc replies won't
have tunnel metadata attached, thus they will be sent out with the default
parameters or not sent at all, either way never reaching the requester.
Most of the egress tunnel parameters can be inferred from the ingress
metada. The only and important exception is UDP ports. This patchset infers
the egress data from the ingress data and disallow settings of UDP ports in
tunnel routes. If there's a need for different UDP ports, a new interface
needs to be created for each port combination. Note that it's still possible
to specify the UDP ports to use, it just needs to be done while creating the
vxlan/geneve interface.
This covers only ARPs. IPv6 ndisc has the same problem but is harder to
solve, as there's already dst attached to outgoing skbs. Ideas to solve this
are welcome.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Tue, 22 Sep 2015 16:12:12 +0000 (18:12 +0200)]
lwtunnel: remove source and destination UDP port config option
The UDP tunnel config is asymmetric wrt. to the ports used. The source and
destination ports from one direction of the tunnel are not related to the
ports of the other direction. We need to be able to respond to ARP requests
using the correct ports without involving routing.
As the consequence, UDP ports need to be fixed property of the tunnel
interface and cannot be set per route. Remove the ability to set ports per
route. This is still okay to do, as no kernel has been released with these
attributes yet.
Note that the ability to specify source and destination ports is preserved
for other users of the lwtunnel API which don't use routes for tunnel key
specification (like openvswitch).
If in the future we rework ARP handling to allow port specification, the
attributes can be added back.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Tue, 22 Sep 2015 16:12:11 +0000 (18:12 +0200)]
ipv4: send arp replies to the correct tunnel
When using ip lwtunnels, the additional data for xmit (basically, the actual
tunnel to use) are carried in ip_tunnel_info either in dst->lwtstate or in
metadata dst. When replying to ARP requests, we need to send the reply to
the same tunnel the request came from. This means we need to construct
proper metadata dst for ARP replies.
We could perform another route lookup to get a dst entry with the correct
lwtstate. However, this won't always ensure that the outgoing tunnel is the
same as the incoming one, and it won't work anyway for IPv4 duplicate
address detection.
The only thing to do is to "reverse" the ip_tunnel_info.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sudeep Holla [Mon, 21 Sep 2015 15:47:09 +0000 (16:47 +0100)]
net: gianfar: remove misuse of IRQF_NO_SUSPEND flag
The device is set as wakeup capable using proper wakeup API but the
driver misuses IRQF_NO_SUSPEND to set the interrupt as wakeup source
which is incorrect.
This patch removes the use of IRQF_NO_SUSPEND flags replacing it with
enable_irq_wake instead.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Claudiu Manoil <claudiu.manoil@freescale.com>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pravin B Shelar [Tue, 22 Sep 2015 19:57:53 +0000 (12:57 -0700)]
skbuff: Fix skb checksum flag on skb pull
VXLAN device can receive skb with checksum partial. But the checksum
offset could be in outer header which is pulled on receive. This results
in negative checksum offset for the skb. Such skb can cause the assert
failure in skb_checksum_help(). Following patch fixes the bug by setting
checksum-none while pulling outer header.
Following is the kernel panic msg from old kernel hitting the bug.
------------[ cut here ]------------
kernel BUG at net/core/dev.c:1906!
RIP: 0010:[<
ffffffff81518034>] skb_checksum_help+0x144/0x150
Call Trace:
<IRQ>
[<
ffffffffa0164c28>] queue_userspace_packet+0x408/0x470 [openvswitch]
[<
ffffffffa016614d>] ovs_dp_upcall+0x5d/0x60 [openvswitch]
[<
ffffffffa0166236>] ovs_dp_process_packet_with_key+0xe6/0x100 [openvswitch]
[<
ffffffffa016629b>] ovs_dp_process_received_packet+0x4b/0x80 [openvswitch]
[<
ffffffffa016c51a>] ovs_vport_receive+0x2a/0x30 [openvswitch]
[<
ffffffffa0171383>] vxlan_rcv+0x53/0x60 [openvswitch]
[<
ffffffffa01734cb>] vxlan_udp_encap_recv+0x8b/0xf0 [openvswitch]
[<
ffffffff8157addc>] udp_queue_rcv_skb+0x2dc/0x3b0
[<
ffffffff8157b56f>] __udp4_lib_rcv+0x1cf/0x6c0
[<
ffffffff8157ba7a>] udp_rcv+0x1a/0x20
[<
ffffffff8154fdbd>] ip_local_deliver_finish+0xdd/0x280
[<
ffffffff81550128>] ip_local_deliver+0x88/0x90
[<
ffffffff8154fa7d>] ip_rcv_finish+0x10d/0x370
[<
ffffffff81550365>] ip_rcv+0x235/0x300
[<
ffffffff8151ba1d>] __netif_receive_skb+0x55d/0x620
[<
ffffffff8151c360>] netif_receive_skb+0x80/0x90
[<
ffffffff81459935>] virtnet_poll+0x555/0x6f0
[<
ffffffff8151cd04>] net_rx_action+0x134/0x290
[<
ffffffff810683d8>] __do_softirq+0xa8/0x210
[<
ffffffff8162fe6c>] call_softirq+0x1c/0x30
[<
ffffffff810161a5>] do_softirq+0x65/0xa0
[<
ffffffff810687be>] irq_exit+0x8e/0xb0
[<
ffffffff81630733>] do_IRQ+0x63/0xe0
[<
ffffffff81625f2e>] common_interrupt+0x6e/0x6e
Reported-by: Anupam Chanda <achanda@vmware.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Tue, 22 Sep 2015 03:38:56 +0000 (11:38 +0800)]
netlink: Replace rhash_portid with bound
On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
>
> store_release and load_acquire are different from the usual memory
> barriers and can't be paired this way. You have to pair store_release
> and load_acquire. Besides, it isn't a particularly good idea to
OK I've decided to drop the acquire/release helpers as they don't
help us at all and simply pessimises the code by using full memory
barriers (on some architectures) where only a write or read barrier
is needed.
> depend on memory barriers embedded in other data structures like the
> above. Here, especially, rhashtable_insert() would have write barrier
> *before* the entry is hashed not necessarily *after*, which means that
> in the above case, a socket which appears to have set bound to a
> reader might not visible when the reader tries to look up the socket
> on the hashtable.
But you are right we do need an explicit write barrier here to
ensure that the hashing is visible.
> There's no reason to be overly smart here. This isn't a crazy hot
> path, write barriers tend to be very cheap, store_release more so.
> Please just do smp_store_release() and note what it's paired with.
It's not about being overly smart. It's about actually understanding
what's going on with the code. I've seen too many instances of
people simply sprinkling synchronisation primitives around without
any knowledge of what is happening underneath, which is just a recipe
for creating hard-to-debug races.
> > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
> > }
> > }
> >
> > - if (!nlk->portid) {
> > + if (!nlk->bound) {
>
> I don't think you can skip load_acquire here just because this is the
> second deref of the variable. That doesn't change anything. Race
> condition could still happen between the first and second tests and
> skipping the second would lead to the same kind of bug.
The reason this one is OK is because we do not use nlk->portid or
try to get nlk from the hash table before we return to user-space.
However, there is a real bug here that none of these acquire/release
helpers discovered. The two bound tests here used to be a single
one. Now that they are separate it is entirely possible for another
thread to come in the middle and bind the socket. So we need to
repeat the portid check in order to maintain consistency.
> > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
> > !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
> > return -EPERM;
> >
> > - if (!nlk->portid)
> > + if (!nlk->bound)
>
> Don't we need load_acquire here too? Is this path holding a lock
> which makes that unnecessary?
Ditto.
---8<---
The commit
1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
Fix autobind race condition that leads to zero port ID") created
some new races that can occur due to inconcsistencies between the
two port IDs.
Tejun is right that a barrier is unavoidable. Therefore I am
reverting to the original patch that used a boolean to indicate
that a user netlink socket has been bound.
Barriers have been added where necessary to ensure that a valid
portid and the hashed socket is visible.
I have also changed netlink_insert to only return EBUSY if the
socket is bound to a portid different to the requested one. This
combined with only reading nlk->bound once in netlink_bind fixes
a race where two threads that bind the socket at the same time
with different port IDs may both succeed.
Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Nacked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
John W. Linville [Tue, 22 Sep 2015 17:09:32 +0000 (13:09 -0400)]
geneve: use network byte order for destination port config parameter
This is primarily for consistancy with vxlan and other tunnels which
use network byte order for similar parameters.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Woodhouse [Wed, 23 Sep 2015 08:45:31 +0000 (09:45 +0100)]
8139cp: Dump contents of descriptor ring on TX timeout
We are seeing unexplained TX timeouts under heavy load. Let's try to get
a better idea of what's going on.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Woodhouse [Wed, 23 Sep 2015 08:45:16 +0000 (09:45 +0100)]
8139cp: Fix DMA unmapping of transmitted buffers
The low 16 bits of the 'opts1' field in the TX descriptor are supposed
to still contain the buffer length when the descriptor is handed back to
us. In practice, at least on my hardware, they don't. So stash the
original value of the opts1 field and get the length to unmap from
there.
There are other ways we could have worked out the length, but I actually
want a stash of the opts1 field anyway so that I can dump it alongside
the contents of the descriptor ring when we suffer a TX timeout.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Woodhouse [Wed, 23 Sep 2015 08:44:57 +0000 (09:44 +0100)]
8139cp: Reduce duplicate csum/tso code in cp_start_xmit()
We calculate the value of the opts1 descriptor field in three different
places. With two different behaviours when given an invalid packet to
be checksummed — none of them correct. Sort that out.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Woodhouse [Wed, 23 Sep 2015 08:44:38 +0000 (09:44 +0100)]
8139cp: Fix TSO/scatter-gather descriptor setup
When sending a TSO frame in multiple buffers, we were neglecting to set
the first descriptor up in TSO mode.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Woodhouse [Wed, 23 Sep 2015 08:44:06 +0000 (09:44 +0100)]
8139cp: Fix tx_queued debug message to print correct slot numbers
After a certain amount of staring at the debug output of this driver, I
realised it was lying to me.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Woodhouse [Wed, 23 Sep 2015 08:43:41 +0000 (09:43 +0100)]
8139cp: Do not re-enable RX interrupts in cp_tx_timeout()
If an RX interrupt was already received but NAPI has not yet run when
the RX timeout happens, we end up in cp_tx_timeout() with RX interrupts
already disabled. Blindly re-enabling them will cause an IRQ storm.
(This is made particularly horrid by the fact that cp_interrupt() always
returns that it's handled the interrupt, even when it hasn't actually
done anything. If it didn't do that, the core IRQ code would have
detected the storm and handled it, I'd have had a clear smoking gun
backtrace instead of just a spontaneously resetting router, and I'd have
at *least* two days of my life back. Changing the return value of
cp_interrupt() will be argued about under separate cover.)
Unconditionally leave RX interrupts disabled after the reset, and
schedule NAPI to check the receive ring and re-enable them.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 23 Sep 2015 21:37:38 +0000 (14:37 -0700)]
Merge branch 'netcp-fixes'
Murali Karicheri says:
====================
net: netcp: a set of bug fixes
This patch series fixes a set of issues in netcp driver seen during internal
testing of the driver. While at it, do some clean up as well.
The fixes are tested on K2HK, K2L and K2E EVMs and the boot up logs can be
seen at
http://pastebin.ubuntu.com/
12533100/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Karicheri, Muralidharan [Wed, 23 Sep 2015 17:37:11 +0000 (13:37 -0400)]
net: netcp: fix deadlock reported by lockup detector
A deadlock trace is seen in netcp driver with lockup detector enabled.
The trace log is provided below for reference. This patch fixes the
bug by removing the usage of netcp_modules_lock within ndo_ops functions.
ndo_{open/close/ioctl)() is already called with rtnl_lock held. So there
is no need to hold another mutex for serialization across processes on
multiple cores. So remove use of netcp_modules_lock mutex from these
ndo ops functions.
ndo_set_rx_mode() shouldn't be using a mutex as it is called from atomic
context. In the case of ndo_set_rx_mode(), there can be call to this API
without rtnl_lock held from an atomic context. As the underlying modules
are expected to add address to a hardware table, it is to be protected
across concurrent updates and hence a spin lock is used to synchronize
the access. Same with ndo_vlan_rx_add_vid() & ndo_vlan_rx_kill_vid().
Probably the netcp_modules_lock is used to protect the module not being
removed as part of rmmod. Currently this is not fully implemented and
assumes the interface is brought down before doing rmmod of modules.
The support for rmmmod while interface is up is expected in a future
patch set when additional modules such as pa, qos are added. For now
all of the tests such as if up/down, reboot, iperf works fine with this
patch applied.
Deadlock trace seen with lockup detector enabled is shown below for
reference.
[ 16.863014] ======================================================
[ 16.869183] [ INFO: possible circular locking dependency detected ]
[ 16.875441]
4.1.6-01265-gfb1e101 #1 Tainted: G W
[ 16.881176] -------------------------------------------------------
[ 16.887432] ifconfig/1662 is trying to acquire lock:
[ 16.892386] (netcp_modules_lock){+.+.+.}, at: [<
c03e8110>]
netcp_ndo_open+0x168/0x518
[ 16.900321]
[ 16.900321] but task is already holding lock:
[ 16.906144] (rtnl_mutex){+.+.+.}, at: [<
c053a418>] devinet_ioctl+0xf8/0x7e4
[ 16.913206]
[ 16.913206] which lock already depends on the new lock.
[ 16.913206]
[ 16.921372]
[ 16.921372] the existing dependency chain (in reverse order) is:
[ 16.928844]
-> #1 (rtnl_mutex){+.+.+.}:
[ 16.932865] [<
c06023f0>] mutex_lock_nested+0x68/0x4a8
[ 16.938521] [<
c04c5758>] register_netdev+0xc/0x24
[ 16.943831] [<
c03e65c0>] netcp_module_probe+0x214/0x2ec
[ 16.949660] [<
c03e8a54>] netcp_register_module+0xd4/0x140
[ 16.955663] [<
c089654c>] keystone_gbe_init+0x10/0x28
[ 16.961233] [<
c000977c>] do_one_initcall+0xb8/0x1f8
[ 16.966714] [<
c0867e04>] kernel_init_freeable+0x148/0x1e8
[ 16.972720] [<
c05f9994>] kernel_init+0xc/0xe8
[ 16.977682] [<
c0010038>] ret_from_fork+0x14/0x3c
[ 16.982905]
-> #0 (netcp_modules_lock){+.+.+.}:
[ 16.987619] [<
c006eab0>] lock_acquire+0x118/0x320
[ 16.992928] [<
c06023f0>] mutex_lock_nested+0x68/0x4a8
[ 16.998582] [<
c03e8110>] netcp_ndo_open+0x168/0x518
[ 17.004064] [<
c04c48f0>] __dev_open+0xa8/0x10c
[ 17.009112] [<
c04c4b74>] __dev_change_flags+0x94/0x144
[ 17.014853] [<
c04c4c3c>] dev_change_flags+0x18/0x48
[ 17.020334] [<
c053a9fc>] devinet_ioctl+0x6dc/0x7e4
[ 17.025729] [<
c04a59ec>] sock_ioctl+0x1d0/0x2a8
[ 17.030865] [<
c0142844>] do_vfs_ioctl+0x41c/0x688
[ 17.036173] [<
c0142ae4>] SyS_ioctl+0x34/0x5c
[ 17.041046] [<
c000ff60>] ret_fast_syscall+0x0/0x54
[ 17.046441]
[ 17.046441] other info that might help us debug this:
[ 17.046441]
[ 17.054434] Possible unsafe locking scenario:
[ 17.054434]
[ 17.060343] CPU0 CPU1
[ 17.064862] ---- ----
[ 17.069381] lock(rtnl_mutex);
[ 17.072522] lock(netcp_modules_lock);
[ 17.078875] lock(rtnl_mutex);
[ 17.084532] lock(netcp_modules_lock);
[ 17.088366]
[ 17.088366] *** DEADLOCK ***
[ 17.088366]
[ 17.094279] 1 lock held by ifconfig/1662:
[ 17.098278] #0: (rtnl_mutex){+.+.+.}, at: [<
c053a418>]
devinet_ioctl+0xf8/0x7e4
[ 17.105774]
[ 17.105774] stack backtrace:
[ 17.110124] CPU: 1 PID: 1662 Comm: ifconfig Tainted: G W
4.1.6-01265-gfb1e101 #1
[ 17.118637] Hardware name: Keystone
[ 17.122123] [<
c00178e4>] (unwind_backtrace) from [<
c0013cbc>]
(show_stack+0x10/0x14)
[ 17.129862] [<
c0013cbc>] (show_stack) from [<
c05ff450>]
(dump_stack+0x84/0xc4)
[ 17.137079] [<
c05ff450>] (dump_stack) from [<
c0068e34>]
(print_circular_bug+0x210/0x330)
[ 17.145161] [<
c0068e34>] (print_circular_bug) from [<
c006ab7c>]
(validate_chain.isra.35+0xf98/0x13ac)
[ 17.154372] [<
c006ab7c>] (validate_chain.isra.35) from [<
c006da60>]
(__lock_acquire+0x52c/0xcc0)
[ 17.163149] [<
c006da60>] (__lock_acquire) from [<
c006eab0>]
(lock_acquire+0x118/0x320)
[ 17.171058] [<
c006eab0>] (lock_acquire) from [<
c06023f0>]
(mutex_lock_nested+0x68/0x4a8)
[ 17.179140] [<
c06023f0>] (mutex_lock_nested) from [<
c03e8110>]
(netcp_ndo_open+0x168/0x518)
[ 17.187484] [<
c03e8110>] (netcp_ndo_open) from [<
c04c48f0>]
(__dev_open+0xa8/0x10c)
[ 17.195133] [<
c04c48f0>] (__dev_open) from [<
c04c4b74>]
(__dev_change_flags+0x94/0x144)
[ 17.203129] [<
c04c4b74>] (__dev_change_flags) from [<
c04c4c3c>]
(dev_change_flags+0x18/0x48)
[ 17.211560] [<
c04c4c3c>] (dev_change_flags) from [<
c053a9fc>]
(devinet_ioctl+0x6dc/0x7e4)
[ 17.219729] [<
c053a9fc>] (devinet_ioctl) from [<
c04a59ec>]
(sock_ioctl+0x1d0/0x2a8)
[ 17.227378] [<
c04a59ec>] (sock_ioctl) from [<
c0142844>]
(do_vfs_ioctl+0x41c/0x688)
[ 17.234939] [<
c0142844>] (do_vfs_ioctl) from [<
c0142ae4>]
(SyS_ioctl+0x34/0x5c)
[ 17.242242] [<
c0142ae4>] (SyS_ioctl) from [<
c000ff60>]
(ret_fast_syscall+0x0/0x54)
[ 17.258855] netcp-1.0
2620110.netcp eth0: Link is Up - 1Gbps/Full - flow
control off
[ 17.271282] BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:616
[ 17.279712] in_atomic(): 1, irqs_disabled(): 0, pid: 1662, name: ifconfig
[ 17.286500] INFO: lockdep is turned off.
[ 17.290413] Preemption disabled at:[< (null)>] (null)
[ 17.295728]
[ 17.297214] CPU: 1 PID: 1662 Comm: ifconfig Tainted: G W
4.1.6-01265-gfb1e101 #1
[ 17.305735] Hardware name: Keystone
[ 17.309223] [<
c00178e4>] (unwind_backtrace) from [<
c0013cbc>]
(show_stack+0x10/0x14)
[ 17.316970] [<
c0013cbc>] (show_stack) from [<
c05ff450>]
(dump_stack+0x84/0xc4)
[ 17.324194] [<
c05ff450>] (dump_stack) from [<
c06023b0>]
(mutex_lock_nested+0x28/0x4a8)
[ 17.332112] [<
c06023b0>] (mutex_lock_nested) from [<
c03e9840>]
(netcp_set_rx_mode+0x160/0x210)
[ 17.340724] [<
c03e9840>] (netcp_set_rx_mode) from [<
c04c483c>]
(dev_set_rx_mode+0x1c/0x28)
[ 17.348982] [<
c04c483c>] (dev_set_rx_mode) from [<
c04c490c>]
(__dev_open+0xc4/0x10c)
[ 17.356724] [<
c04c490c>] (__dev_open) from [<
c04c4b74>]
(__dev_change_flags+0x94/0x144)
[ 17.364729] [<
c04c4b74>] (__dev_change_flags) from [<
c04c4c3c>]
(dev_change_flags+0x18/0x48)
[ 17.373166] [<
c04c4c3c>] (dev_change_flags) from [<
c053a9fc>]
(devinet_ioctl+0x6dc/0x7e4)
[ 17.381344] [<
c053a9fc>] (devinet_ioctl) from [<
c04a59ec>]
(sock_ioctl+0x1d0/0x2a8)
[ 17.388994] [<
c04a59ec>] (sock_ioctl) from [<
c0142844>]
(do_vfs_ioctl+0x41c/0x688)
[ 17.396563] [<
c0142844>] (do_vfs_ioctl) from [<
c0142ae4>]
(SyS_ioctl+0x34/0x5c)
[ 17.403873] [<
c0142ae4>] (SyS_ioctl) from [<
c000ff60>]
(ret_fast_syscall+0x0/0x54)
[ 17.413772] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
udhcpc (v1.20.2) started
Sending discover...
[ 18.690666] netcp-1.0
2620110.netcp eth0: Link is Up - 1Gbps/Full - flow
control off
Sending discover...
[ 22.250972] netcp-1.0
2620110.netcp eth0: Link is Up - 1Gbps/Full - flow
control off
[ 22.258721] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 22.265458] BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:616
[ 22.273896] in_atomic(): 1, irqs_disabled(): 0, pid: 342, name: kworker/1:1
[ 22.280854] INFO: lockdep is turned off.
[ 22.284767] Preemption disabled at:[< (null)>] (null)
[ 22.290074]
[ 22.291568] CPU: 1 PID: 342 Comm: kworker/1:1 Tainted: G W
4.1.6-01265-gfb1e101 #1
[ 22.300255] Hardware name: Keystone
[ 22.303750] Workqueue: ipv6_addrconf addrconf_dad_work
[ 22.308895] [<
c00178e4>] (unwind_backtrace) from [<
c0013cbc>]
(show_stack+0x10/0x14)
[ 22.316643] [<
c0013cbc>] (show_stack) from [<
c05ff450>]
(dump_stack+0x84/0xc4)
[ 22.323867] [<
c05ff450>] (dump_stack) from [<
c06023b0>]
(mutex_lock_nested+0x28/0x4a8)
[ 22.331786] [<
c06023b0>] (mutex_lock_nested) from [<
c03e9840>]
(netcp_set_rx_mode+0x160/0x210)
[ 22.340394] [<
c03e9840>] (netcp_set_rx_mode) from [<
c04c9d18>]
(__dev_mc_add+0x54/0x68)
[ 22.348401] [<
c04c9d18>] (__dev_mc_add) from [<
c05ab358>]
(igmp6_group_added+0x168/0x1b4)
[ 22.356580] [<
c05ab358>] (igmp6_group_added) from [<
c05ad2cc>]
(ipv6_dev_mc_inc+0x4f0/0x5a8)
[ 22.365019] [<
c05ad2cc>] (ipv6_dev_mc_inc) from [<
c058f0d0>]
(addrconf_dad_work+0x21c/0x33c)
[ 22.373460] [<
c058f0d0>] (addrconf_dad_work) from [<
c0042850>]
(process_one_work+0x214/0x8d0)
[ 22.381986] [<
c0042850>] (process_one_work) from [<
c0042f54>]
(worker_thread+0x48/0x4bc)
[ 22.390071] [<
c0042f54>] (worker_thread) from [<
c004868c>]
(kthread+0xf0/0x108)
[ 22.397381] [<
c004868c>] (kthread) from [<
c0010038>]
Trace related to incorrect usage of mutex inside ndo_set_rx_mode
[ 24.086066] BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:616
[ 24.094506] in_atomic(): 1, irqs_disabled(): 0, pid: 1682, name: ifconfig
[ 24.101291] INFO: lockdep is turned off.
[ 24.105203] Preemption disabled at:[< (null)>] (null)
[ 24.110511]
[ 24.112005] CPU: 2 PID: 1682 Comm: ifconfig Tainted: G W
4.1.6-01265-gfb1e101 #1
[ 24.120518] Hardware name: Keystone
[ 24.124018] [<
c00178e4>] (unwind_backtrace) from [<
c0013cbc>]
(show_stack+0x10/0x14)
[ 24.131772] [<
c0013cbc>] (show_stack) from [<
c05ff450>]
(dump_stack+0x84/0xc4)
[ 24.138989] [<
c05ff450>] (dump_stack) from [<
c06023b0>]
(mutex_lock_nested+0x28/0x4a8)
[ 24.146908] [<
c06023b0>] (mutex_lock_nested) from [<
c03e9840>]
(netcp_set_rx_mode+0x160/0x210)
[ 24.155523] [<
c03e9840>] (netcp_set_rx_mode) from [<
c04c483c>]
(dev_set_rx_mode+0x1c/0x28)
[ 24.163787] [<
c04c483c>] (dev_set_rx_mode) from [<
c04c490c>]
(__dev_open+0xc4/0x10c)
[ 24.171531] [<
c04c490c>] (__dev_open) from [<
c04c4b74>]
(__dev_change_flags+0x94/0x144)
[ 24.179528] [<
c04c4b74>] (__dev_change_flags) from [<
c04c4c3c>]
(dev_change_flags+0x18/0x48)
[ 24.187966] [<
c04c4c3c>] (dev_change_flags) from [<
c053a9fc>]
(devinet_ioctl+0x6dc/0x7e4)
[ 24.196145] [<
c053a9fc>] (devinet_ioctl) from [<
c04a59ec>]
(sock_ioctl+0x1d0/0x2a8)
[ 24.203803] [<
c04a59ec>] (sock_ioctl) from [<
c0142844>]
(do_vfs_ioctl+0x41c/0x688)
[ 24.211373] [<
c0142844>] (do_vfs_ioctl) from [<
c0142ae4>]
(SyS_ioctl+0x34/0x5c)
[ 24.218676] [<
c0142ae4>] (SyS_ioctl) from [<
c000ff60>]
(ret_fast_syscall+0x0/0x54)
[ 24.227156] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Karicheri, Muralidharan [Wed, 23 Sep 2015 17:37:10 +0000 (13:37 -0400)]
net: netcp: allocate buffers to desc before re-enable interrupt
Currently netcp_rxpool_refill() that refill descriptors and attached
buffers to fdq while interrupt is enabled as part of NAPI poll. Doing
it while interrupt is disabled could be beneficial as hardware will
not be starved when CPU is busy with processing interrupt.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Karicheri, Muralidharan [Wed, 23 Sep 2015 17:37:09 +0000 (13:37 -0400)]
net: netcp: check for interface handle in netcp_module_probe()
Currently netcp_module_probe() doesn't check the return value of
of_parse_phandle() that points to the interface data for the
module and then pass the node ptr to the module which is incorrect.
Check for return value and free the intf_modpriv if there is error.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Karicheri, Muralidharan [Wed, 23 Sep 2015 17:37:08 +0000 (13:37 -0400)]
net: netcp: add error check to netcp_allocate_rx_buf()
Currently, if netcp_allocate_rx_buf() fails due no descriptors
in the rx free descriptor queue, inside the netcp_rxpool_refill() function
the iterative loop to fill buffers doesn't terminate right away. So modify
the netcp_allocate_rx_buf() to return an error code and use it break the
loop when there is error.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Karicheri, Muralidharan [Wed, 23 Sep 2015 17:37:07 +0000 (13:37 -0400)]
net: netcp: move netcp_register_interface() to after attach module
The netcp interface is not fully initialized before attach the module
to the interface. For example, the tx pipe/rx pipe is initialized
in ethss module as part of attach(). So until this is complete, the
interface can't be registered. So move registration of interface to
net device outside the current loop that attaches the modules to the
interface.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Karicheri, Muralidharan [Wed, 23 Sep 2015 17:37:06 +0000 (13:37 -0400)]
net: netcp: remove dead code from the driver
netcp_core is the first driver that will get initialized and the modules
(ethss, pa etc) will then get initialized. So the code at the end of
netcp_probe() that iterate over the modules is a dead code as the module
list will be always be empty. So remove this code.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WingMan Kwok [Wed, 23 Sep 2015 17:37:05 +0000 (13:37 -0400)]
net: netcp: ethss: fix error in calling sgmii api with incorrect offset
On K2HK, sgmii module registers of slave 0 and 1 are mem
mapped to one contiguous block, while those of slave 2
and 3 are mapped to another contiguous block. However,
on K2E and K2L, sgmii module registers of all slaves are
mem mapped to one contiguous block. SGMII APIs expect
slave 0 sgmii base when API is invoked for slave 0 and 1,
and slave 2 sgmii base when invoked for other slaves.
Before this patch, slave 0 sgmii base is always passed to
sgmii API for K2E regardless which slave is the API invoked
for. This patch fixes the problem.
Signed-off-by: WingMan Kwok <w-kwok2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Woodhouse [Wed, 23 Sep 2015 18:45:08 +0000 (19:45 +0100)]
Fix AF_PACKET ABI breakage in 4.2
Commit
7d82410950aa ("virtio: add explicit big-endian support to memory
accessors") accidentally changed the virtio_net header used by
AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian.
Since virtio_legacy_is_little_endian() is a very long identifier,
define a vio_le macro and use that throughout the code instead of the
hard-coded 'false' for little-endian.
This restores the ABI to match 4.1 and earlier kernels, and makes my
test program work again.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neil Horman [Wed, 23 Sep 2015 18:57:58 +0000 (14:57 -0400)]
netpoll: Close race condition between poll_one_napi and napi_disable
Drivers might call napi_disable while not holding the napi instance poll_lock.
In those instances, its possible for a race condition to exist between
poll_one_napi and napi_disable. That is to say, poll_one_napi only tests the
NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such
the following may happen:
CPU0 CPU1
ndo_tx_timeout napi_poll_dev
napi_disable poll_one_napi
test_and_set_bit (ret 0)
test_bit (ret 1)
reset adapter napi_poll_routine
If the adapter gets a tx timeout without a napi instance scheduled, its possible
for the adapter to think it has exclusive access to the hardware (as the napi
instance is now scheduled via the napi_disable call), while the netpoll code
thinks there is simply work to do. The result is parallel hardware access
leading to corrupt data structures in the driver, and a crash.
Additionaly, there is another, more critical race between netpoll and
napi_disable. The disabled napi state is actually identical to the scheduled
state for a given napi instance. The implication being that, if a napi instance
is disabled, a netconsole instance would see the napi state of the device as
having been scheduled, and poll it, likely while the driver was dong something
requiring exclusive access. In the case above, its fairly clear that not having
the rings in a state ready to be polled will cause any number of crashes.
The fix should be pretty easy. netpoll uses its own bit to indicate that that
the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC).
We can just gate disabling on that bit as well as the sched bit. That should
prevent netpoll from conducting a napi poll if we convert its set bit to a
test_and_set_bit operation to provide mutual exclusion
Change notes:
V2)
Remove a trailing whtiespace
Resubmit with proper subject prefix
V3)
Clean up spacing nits
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: jmaxwell@redhat.com
Tested-by: jmaxwell@redhat.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 23 Sep 2015 21:00:21 +0000 (14:00 -0700)]
tcp: add proper TS val into RST packets
RST packets sent on behalf of TCP connections with TS option (RFC 7323
TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr.
A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100
ecr 0], length 0
B > A: Flags [S.], seq
2444755794, ack 1, win 28960, options [mss
1460,nop,nop,TS val
7264344 ecr 100], length 0
A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr
7264344], length 0
B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0
ecr 110], length 0
We need to call skb_mstamp_get() to get proper TS val,
derived from skb->skb_mstamp
Note that RFC 1323 was advocating to not send TS option in RST segment,
but RFC 7323 recommends the opposite :
Once TSopt has been successfully negotiated, that is both <SYN> and
<SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST>
segment for the duration of the connection, and SHOULD be sent in an
<RST> segment (see Section 5.2 for details)
Note this RFC recommends to send TS val = 0, but we believe it is
premature : We do not know if all TCP stacks are properly
handling the receive side :
When an <RST> segment is
received, it MUST NOT be subjected to the PAWS check by verifying an
acceptable value in SEG.TSval, and information from the Timestamps
option MUST NOT be used to update connection state information.
SEG.TSecr MAY be used to provide stricter <RST> acceptance checks.
In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider
to decide to send TS val = 0, if it buys something.
Fixes: 7faee5c0d514 ("tcp: remove TCP_SKB_CB(skb)->when")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neil Armstrong [Tue, 22 Sep 2015 09:28:14 +0000 (11:28 +0200)]
net: dsa: Fix Marvell Egress Trailer check
The Marvell Egress rx trailer check must be fixed to
correctly detect bad bits in the third byte of the
Eggress trailer as described in the Table 28 of the
88E6060 datasheet.
The current code incorrectly omits to check the third
byte and checks the fourth byte twice.
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitriy Vyukov [Tue, 22 Sep 2015 08:51:52 +0000 (10:51 +0200)]
lib: fix data race in rhashtable_rehash_one
rhashtable_rehash_one() uses complex logic to update entry->next field,
after INIT_RHT_NULLS_HEAD and NULLS_MARKER expansion:
entry->next = 1 | ((base + off) << 1)
This can be compiled along the lines of:
entry->next = base + off
entry->next <<= 1
entry->next |= 1
Which will break concurrent readers.
NULLS value recomputation is not needed here, so just remove
the complex logic.
The data race was found with KernelThreadSanitizer (KTSAN).
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tobias Klauser [Tue, 22 Sep 2015 07:29:49 +0000 (09:29 +0200)]
ch9200: Convert to use module_usb_driver
Converts the ch9200 driver to use the module_usb_driver() macro which
makes the code smaller and a bit simpler.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Matthew Garrett <mjg59@srcf.ucam.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesse Gross [Tue, 22 Sep 2015 03:21:20 +0000 (20:21 -0700)]
openvswitch: Zero flows on allocation.
When support for megaflows was introduced, OVS needed to start
installing flows with a mask applied to them. Since masking is an
expensive operation, OVS also had an optimization that would only
take the parts of the flow keys that were covered by a non-zero
mask. The values stored in the remaining pieces should not matter
because they are masked out.
While this works fine for the purposes of matching (which must always
look at the mask), serialization to netlink can be problematic. Since
the flow and the mask are serialized separately, the uninitialized
portions of the flow can be encoded with whatever values happen to be
present.
In terms of functionality, this has little effect since these fields
will be masked out by definition. However, it leaks kernel memory to
userspace, which is a potential security vulnerability. It is also
possible that other code paths could look at the masked key and get
uninitialized data, although this does not currently appear to be an
issue in practice.
This removes the mask optimization for flows that are being installed.
This was always intended to be the case as the mask optimizations were
really targetting per-packet flow operations.
Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King [Mon, 21 Sep 2015 20:42:59 +0000 (21:42 +0100)]
net: dsa: actually force the speed on the CPU port
Commit
54d792f257c6 ("net: dsa: Centralise global and port setup
code into mv88e6xxx.") merged in the 4.2 merge window broke the link
speed forcing for the CPU port of Marvell DSA switches. The original
code was:
/* MAC Forcing register: don't force link, speed, duplex
* or flow control state to any particular values on physical
* ports, but force the CPU port and all DSA ports to 1000 Mb/s
* full duplex.
*/
if (dsa_is_cpu_port(ds, p) || ds->dsa_port_mask & (1 << p))
REG_WRITE(addr, 0x01, 0x003e);
else
REG_WRITE(addr, 0x01, 0x0003);
but the new code does a read-modify-write:
reg = _mv88e6xxx_reg_read(ds, REG_PORT(port), PORT_PCS_CTRL);
if (dsa_is_cpu_port(ds, port) ||
ds->dsa_port_mask & (1 << port)) {
reg |= PORT_PCS_CTRL_FORCE_LINK |
PORT_PCS_CTRL_LINK_UP |
PORT_PCS_CTRL_DUPLEX_FULL |
PORT_PCS_CTRL_FORCE_DUPLEX;
if (mv88e6xxx_6065_family(ds))
reg |= PORT_PCS_CTRL_100;
else
reg |= PORT_PCS_CTRL_1000;
The link speed in the PCS control register is a two bit field. Forcing
the link speed in this way doesn't ensure that the bit field is set to
the correct value - on the hardware I have here, the speed bitfield
remains set to 0x03, resulting in the speed not being forced to gigabit.
We must clear both bits before forcing the link speed.
Fixes: 54d792f257c6 ("net: dsa: Centralise global and port setup code into mv88e6xxx.")
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
John W. Linville [Mon, 21 Sep 2015 14:29:09 +0000 (10:29 -0400)]
geneve: ensure ECN info is handled properly in all tx/rx paths
Partially due to a pre-exising "thinko", the new metadata-based tx/rx
paths were handling ECN propagation differently than the traditional
tx/rx paths. This patch removes the "thinko" (involving multiple
ip_hdr assignments) on the rx path and corrects the ECN handling on
both the rx and tx paths.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 19 Sep 2015 16:48:04 +0000 (09:48 -0700)]
inet: fix races in reqsk_queue_hash_req()
Before allowing lockless LISTEN processing, we need to make
sure to arm the SYN_RECV timer before the req socket is visible
in hash tables.
Also, req->rsk_hash should be written before we set rsk_refcnt
to a non zero value.
Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 19 Sep 2015 16:08:34 +0000 (09:08 -0700)]
tcp/dccp: fix timewait races in timer handling
When creating a timewait socket, we need to arm the timer before
allowing other cpus to find it. The signal allowing cpus to find
the socket is setting tw_refcnt to non zero value.
As we set tw_refcnt in __inet_twsk_hashdance(), we therefore need to
call inet_twsk_schedule() first.
This also means we need to remove tw_refcnt changes from
inet_twsk_schedule() and let the caller handle it.
Note that because we use mod_timer_pinned(), we have the guarantee
the timer wont expire before we set tw_refcnt as we run in BH context.
To make things more readable I introduced inet_twsk_reschedule() helper.
When rearming the timer, we can use mod_timer_pending() to make sure
we do not rearm a canceled timer.
Note: This bug can possibly trigger if packets of a flow can hit
multiple cpus. This does not normally happen, unless flow steering
is broken somehow. This explains this bug was spotted ~5 months after
its introduction.
A similar fix is needed for SYN_RECV sockets in reqsk_queue_hash_req(),
but will be provided in a separate patch for proper tracking.
Fixes: 789f558cfb36 ("tcp/dccp: get rid of central timewait timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Ying Cai <ycai@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Fri, 18 Sep 2015 21:47:55 +0000 (17:47 -0400)]
sunvnet: Invoke SET_NETDEV_DEV() to set up the vdev in vnet_new()
`ls /sys/devices/channel-devices/vnet-port-0-0/net' is missing without
this change, and applications like NetworkManager are looking in
sysfs for the information.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John W. Linville [Fri, 18 Sep 2015 20:20:32 +0000 (16:20 -0400)]
geneve: remove vlan-related feature assignment
The code handling vlan tag insertion was dropped in commit
371bd1061d29
("geneve: Consolidate Geneve functionality in single module."). Now we
need to drop the related vlan feature bits in the netdev structure.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Matthew Garrett [Sun, 20 Sep 2015 09:25:38 +0000 (02:25 -0700)]
usbnet: New driver for QinHeng CH9200 devices
There's a bunch of cheap USB 10/100 devices based on QinHeng chipsets. The
vendor driver supports the CH9100 and CH9200 devices, but the majority of
the code is of the if (ch9100) {} else {} form, with the most significant
difference being that CH9200 provides a real MII interface but CH9100 fakes
one with a bunch of global variables and magic commands. I don't have a
CH9100, so it's probably better if someone who does provides an independent
driver for it. In any case, this is a lightly cleaned up version of the
vendor driver with all the CH9100 code dropped.
Signed-off-by: Matthew Garrett <mjg59@srcf.ucam.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 21 Sep 2015 23:11:21 +0000 (16:11 -0700)]
Merge branch 'phy-of-autoload'
Luis de Bethencourt says:
====================
net: phy: Fix module autoload for OF platform drivers
These patches add the missing MODULE_DEVICE_TABLE() for OF to export
the information so modules have the correct aliases built-in and
autoloading works correctly.
A longer explanation by Javier Canillas can be found here:
https://lkml.org/lkml/2015/7/30/519
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Luis de Bethencourt [Fri, 18 Sep 2015 16:16:29 +0000 (18:16 +0200)]
net: phy: mdio-gpio: Fix module autoload for OF platform driver
This platform driver has a OF device ID table but the OF module
alias information is not created so module autoloading won't work.
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Luis de Bethencourt [Fri, 18 Sep 2015 16:16:12 +0000 (18:16 +0200)]
net: phy: mdio-bcm-unimac: Fix module autoload for OF platform driver
This platform driver has a OF device ID table but the OF module
alias information is not created so module autoloading won't work.
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 21 Sep 2015 23:09:11 +0000 (16:09 -0700)]
Merge branch 'net-of-autoload'
Luis de Bethencourt says:
====================
net: Fix module autoload for OF platform drivers
These patches add the missing MODULE_DEVICE_TABLE() for OF to export
the information so modules have the correct aliases built-in and
autoloading works correctly.
A longer explanation by Javier Canillas can be found here:
https://lkml.org/lkml/2015/7/30/519
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Luis de Bethencourt [Fri, 18 Sep 2015 15:56:21 +0000 (17:56 +0200)]
net: moxa: Fix module autoload for OF platform driver
This platform driver has a OF device ID table but the OF module
alias information is not created so module autoloading won't work.
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Luis de Bethencourt [Fri, 18 Sep 2015 15:55:27 +0000 (17:55 +0200)]
net: gianfar_ptp: Fix module autoload for OF platform driver
This platform driver has a OF device ID table but the OF module
alias information is not created so module autoloading won't work.
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Luis de Bethencourt [Fri, 18 Sep 2015 15:55:02 +0000 (17:55 +0200)]
net: bcmgenet: Fix module autoload for OF platform driver
This platform driver has a OF device ID table but the OF module
alias information is not created so module autoloading won't work.
Signed-off-by: Luis de Bethencourt <luis@osg.samsung.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Luis de Bethencourt [Fri, 18 Sep 2015 15:54:30 +0000 (17:54 +0200)]
net: systemport: Fix module autoload for OF platform driver
This platform driver has a OF device ID table but the OF module
alias information is not created so module autoloading won't work.
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Luis de Bethencourt [Fri, 18 Sep 2015 15:54:00 +0000 (17:54 +0200)]
net: arc: Fix module autoload for OF platform driver
This platform driver has a OF device ID table but the OF module
alias information is not created so module autoloading won't work.
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Fri, 18 Sep 2015 11:16:50 +0000 (19:16 +0800)]
netlink: Fix autobind race condition that leads to zero port ID
The commit
c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
Reset portid after netlink_insert failure") introduced a race
condition where if two threads try to autobind the same socket
one of them may end up with a zero port ID. This led to kernel
deadlocks that were observed by multiple people.
This patch reverts that commit and instead fixes it by introducing
a separte rhash_portid variable so that the real portid is only set
after the socket has been successfully hashed.
Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael S. Tsirkin [Fri, 18 Sep 2015 10:41:09 +0000 (13:41 +0300)]
macvtap: fix TUNSETSNDBUF values > 64k
Upon TUNSETSNDBUF, macvtap reads the requested sndbuf size into
a local variable u.
commit
39ec7de7092b ("macvtap: fix uninitialized access on
TUNSETIFF") changed its type to u16 (which is the right thing to
do for all other macvtap ioctls), breaking all values > 64k.
The value of TUNSETSNDBUF is actually a signed 32 bit integer, so
the right thing to do is to read it into an int.
Cc: David S. Miller <davem@davemloft.net>
Fixes: 39ec7de7092b ("macvtap: fix uninitialized access on TUNSETIFF")
Reported-by: Mark A. Peloquin
Bisected-by: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel [Fri, 18 Sep 2015 09:47:41 +0000 (11:47 +0200)]
ip6tunnel: make rx/tx bytes counters consistent
Like the previous patch, which fixes ipv4 tunnels, here is the ipv6 part.
Before the patch, the external ipv6 header + gre header were included on
tx.
After the patch:
$ ping -c1 192.168.6.121 ; ip -s l ls dev ip6gre1
PING 192.168.6.121 (192.168.6.121) 56(84) bytes of data.
64 bytes from 192.168.6.121: icmp_req=1 ttl=64 time=1.92 ms
--- 192.168.6.121 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.923/1.923/1.923/0.000 ms
7: ip6gre1@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1440 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/gre6 20:01:06:60:30:08:c1:c3:00:00:00:00:00:00:01:23 peer 20:01:06:60:30:08:c1:c3:00:00:00:00:00:00:01:21
RX: bytes packets errors dropped overrun mcast
84 1 0 0 0 0
TX: bytes packets errors dropped carrier collsns
84 1 0 0 0 0
$ ping -c1 192.168.1.121 ; ip -s l ls dev ip6tnl1
PING 192.168.1.121 (192.168.1.121) 56(84) bytes of data.
64 bytes from 192.168.1.121: icmp_req=1 ttl=64 time=2.28 ms
--- 192.168.1.121 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.288/2.288/2.288/0.000 ms
8: ip6tnl1@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1452 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/tunnel6 2001:660:3008:c1c3::123 peer 2001:660:3008:c1c3::121
RX: bytes packets errors dropped overrun mcast
84 1 0 0 0 0
TX: bytes packets errors dropped carrier collsns
84 1 0 0 0 0
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel [Fri, 18 Sep 2015 09:47:40 +0000 (11:47 +0200)]
iptunnel: make rx/tx bytes counters consistent
This was already done a long time ago in
commit
64194c31a0b6 ("inet: Make tunnel RX/TX byte counters more consistent")
but tx path was broken (at least since 3.10).
Before the patch the gre header was included on tx.
After the patch:
$ ping -c1 192.168.0.121 ; ip -s l ls dev gre1
PING 192.168.0.121 (192.168.0.121) 56(84) bytes of data.
64 bytes from 192.168.0.121: icmp_req=1 ttl=64 time=2.95 ms
--- 192.168.0.121 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.955/2.955/2.955/0.000 ms
7: gre1@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1468 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/gre 10.16.0.249 peer 10.16.0.121
RX: bytes packets errors dropped overrun mcast
84 1 0 0 0 0
TX: bytes packets errors dropped carrier collsns
84 1 0 0 0 0
Reported-by: Julien Meunier <julien.meunier@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 21 Sep 2015 05:32:20 +0000 (22:32 -0700)]
Merge git://git./pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patch contains Netfilter fixes for your net tree, they are:
1) nf_log_unregister() should only set to NULL the logger that is being
unregistered, instead of everything else. Patch from Florian Westphal.
2) Fix a crash when accessing physoutdev from PREROUTING in br_netfilter.
This is partially reverting the patch to shrink nf_bridge_info to 32 bytes.
Also from Florian.
3) Use existing match/target extensions in the internal nft_compat extension
lists when the extension is family unspecific (ie. NFPROTO_UNSPEC).
4) Wait for rcu grace period before leaving nf_log_unregister().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Erik Hugne [Fri, 18 Sep 2015 08:46:31 +0000 (10:46 +0200)]
tipc: reinitialize pointer after skb linearize
The msg pointer into header may change after skb linearization.
We must reinitialize it after calling skb_linearize to prevent
operating on a freed or invalid pointer.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reported-by: Tamás Végh <tamas.vegh@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kevin Hao [Fri, 18 Sep 2015 07:42:30 +0000 (15:42 +0800)]
Revert "net/phy: Add Vitesse 8641 phy ID"
This reverts commit
1298267b548a78840bd4b3e030993ff8747ca5e6.
That commit claim that the Vitesse VSC8641 is compatible with Vitesse
82xx. But this is not true. It seems that all the registers used
in Vitesse phy driver are not compatible between 8641 and 82xx.
It does cause malfunction of the Ethernet on p1010rdb-pa board.
So we definitely need a rework in order to support the 8641 phy
in this driver.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Woodhouse [Thu, 17 Sep 2015 23:21:54 +0000 (00:21 +0100)]
8139cp: Call __cp_set_rx_mode() from cp_tx_timeout()
Unless we reset the RX config, on real hardware I don't seem to receive
any packets after a TX timeout.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Woodhouse [Thu, 17 Sep 2015 23:19:08 +0000 (00:19 +0100)]
8139cp: Use dev_kfree_skb_any() instead of dev_kfree_skb() in cp_clean_rings()
This can be called from cp_tx_timeout() with interrupts disabled.
Spotted by Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikola Forró [Thu, 17 Sep 2015 14:01:32 +0000 (16:01 +0200)]
net: Fix behaviour of unreachable, blackhole and prohibit routes
Man page of ip-route(8) says following about route types:
unreachable - these destinations are unreachable. Packets are dis‐
carded and the ICMP message host unreachable is generated. The local
senders get an EHOSTUNREACH error.
blackhole - these destinations are unreachable. Packets are dis‐
carded silently. The local senders get an EINVAL error.
prohibit - these destinations are unreachable. Packets are discarded
and the ICMP message communication administratively prohibited is
generated. The local senders get an EACCES error.
In the inet6 address family, this was correct, except the local senders
got ENETUNREACH error instead of EHOSTUNREACH in case of unreachable route.
In the inet address family, all three route types generated ICMP message
net unreachable, and the local senders got ENETUNREACH error.
In both address families all three route types now behave consistently
with documentation.
Signed-off-by: Nikola Forró <nforro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Wed, 16 Sep 2015 13:27:43 +0000 (15:27 +0200)]
bna: check for dma mapping errors
Check for DMA mapping errors, recover from them and register them in
ethtool stats like other errors.
Cc: Rasesh Mody <rasesh.mody@qlogic.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Acked-by: Rasesh Mody <rasesh.mody@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 17 Sep 2015 15:38:00 +0000 (08:38 -0700)]
tcp_cubic: do not set epoch_start in the future
Tracking idle time in bictcp_cwnd_event() is imprecise, as epoch_start
is normally set at ACK processing time, not at send time.
Doing a proper fix would need to add an additional state variable,
and does not seem worth the trouble, given CUBIC bug has been there
forever before Jana noticed it.
Let's simply not set epoch_start in the future, otherwise
bictcp_update() could overflow and CUBIC would again
grow cwnd too fast.
This was detected thanks to a packetdrill test Neal wrote that was flaky
before applying this fix.
Fixes: 30927520dbae ("tcp_cubic: better follow cubic curve after idle period")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Cc: Jana Iyengar <jri@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Taku Izumi [Thu, 17 Sep 2015 14:21:21 +0000 (23:21 +0900)]
fjes: fix off-by-one error at fjes_hw_update_zone_task()
Dan Carpenter reported off-by-one error of fjes at
http://www.mail-archive.com/netdev@vger.kernel.org/msg77520.html
Actually this is a bug.
ep_shm_info[epidx].{es_status, zone} should be update
inside for loop.
This patch fixes this bug.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Thu, 17 Sep 2015 14:28:31 +0000 (16:28 +0200)]
MAINTAINERS: remove bouncing email address for qlcnic
I got this automated message from <shahed.shaikh@qlogic.com> when submitting
a qlcnic patch:
> Shahed Shaikh is no longer with QLogic. If you require assistance please
> contact Ariel Elior Ariel.Elior@qlogic.com
There's no point in having a bouncing address in MAINTAINERS.
CC: Dept-GELinuxNICDev@qlogic.com
CC: Ariel Elior <Ariel.Elior@qlogic.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 18 Sep 2015 05:32:16 +0000 (22:32 -0700)]
Merge branch 'vxlan-fixes'
Jiri Benc says:
====================
vxlan fixes
This fixes various issues with vxlan related to IPv6.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Thu, 17 Sep 2015 14:11:14 +0000 (16:11 +0200)]
bnx2x: track vxlan port count
The callback for adding vxlan port can be called with the same port for
both IPv4 and IPv6. Do not disable the offloading when the same port for
both protocols is added and later one of them removed.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Thu, 17 Sep 2015 14:11:13 +0000 (16:11 +0200)]
be2net: allow offloading with the same port for IPv4 and IPv6
The callback for adding vxlan port can be called with the same port for both
IPv4 and IPv6. Do not disable the offloading if this occurs.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Sathya Perla <sathya.perla@avagotech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Thu, 17 Sep 2015 14:11:12 +0000 (16:11 +0200)]
qlcnic: track vxlan port count
The callback for adding vxlan port can be called with the same port for
both IPv4 and IPv6. Do not disable the offloading when the same port for
both protocols is added and later one of them removed.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Thu, 17 Sep 2015 14:11:11 +0000 (16:11 +0200)]
vxlan: reject IPv6 addresses if IPv6 is not configured
When IPv6 address is set without IPv6 configured, the vxlan socket is mostly
treated as an IPv4 one but various lookus in fdb etc. still take the
AF_INET6 into account. This creates incosistencies with weird consequences.
Just reject IPv6 addresses in such case.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Thu, 17 Sep 2015 14:11:10 +0000 (16:11 +0200)]
vxlan: set needed headroom correctly
vxlan_setup is called when allocating the net_device, i.e. way before
vxlan_newlink (or vxlan_dev_configure) is called. This means
vxlan->default_dst is actually unset in vxlan_setup and the condition that
sets needed_headroom always takes the else branch.
Set the needed_headrom at the point when we have the information about
the address family available.
Fixes: e4c7ed415387c ("vxlan: add ipv6 support")
Fixes: 2853af6a2ea1a ("vxlan: use dev->needed_headroom instead of dev->hard_header_len")
CC: Cong Wang <cwang@twopensource.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Grzeschik [Thu, 17 Sep 2015 13:26:16 +0000 (15:26 +0200)]
MAINTAINERS: add arcnet and take maintainership
Add entry for arcnet to MAINTAINERS file and add myself as the
maintainer of the subsystem.
Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de>
Cc: davem@davemloft.net
Cc: joe@perches.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Grzeschik [Thu, 17 Sep 2015 13:18:34 +0000 (15:18 +0200)]
ARCNET: fix hard_header_len limit
For arcnet the bare minimum header only contains the 4 bytes to
specify source, dest and offset (1, 1 and 2 bytes respectively).
The corresponding struct is struct arc_hardware.
The struct archdr contains additionally a union of possible soft
headers. When doing $insertusecasehere packets might well
include short (or even no?) soft headers.
For this reason only use arc_hardware instead of archdr to
determine the hard_header_len for an arcnet device.
Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 18 Sep 2015 05:25:51 +0000 (22:25 -0700)]
Merge branch 'for-upstream' of git://git./linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:
====================
pull request: bluetooth 2015-09-17
Here's one important patch for the 4.3-rc series that fixes an issue
with Bluetooth LE encryption failing because of a too early check for
the SMP context.
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sasha Levin [Wed, 16 Sep 2015 19:30:21 +0000 (15:30 -0400)]
atm: deal with setting entry before mkip was called
If we didn't call ATMARP_MKIP before ATMARP_ENCAP the VCC descriptor is
non-existant and we'll end up dereferencing a NULL ptr:
[
1033173.491930] kasan: GPF could be caused by NULL-ptr deref or user memory accessirq event stamp: 123386
[
1033173.493678] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
[
1033173.493689] Modules linked in:
[
1033173.493697] CPU: 9 PID: 23815 Comm: trinity-c64 Not tainted
4.2.0-next-20150911-sasha-00043-g353d875-dirty #2545
[
1033173.493706] task:
ffff8800630c4000 ti:
ffff880063110000 task.ti:
ffff880063110000
[
1033173.493823] RIP: clip_ioctl (net/atm/clip.c:320 net/atm/clip.c:689)
[
1033173.493826] RSP: 0018:
ffff880063117a88 EFLAGS:
00010203
[
1033173.493828] RAX:
dffffc0000000000 RBX:
0000000000000000 RCX:
000000000000000c
[
1033173.493830] RDX:
0000000000000002 RSI:
ffffffffb3f10720 RDI:
0000000000000014
[
1033173.493832] RBP:
ffff880063117b80 R08:
ffff88047574d9a4 R09:
0000000000000000
[
1033173.493834] R10:
0000000000000000 R11:
0000000000000000 R12:
1ffff1000c622f53
[
1033173.493836] R13:
ffff8800cb905500 R14:
ffff8808d6da2000 R15:
00000000fffffdfd
[
1033173.493840] FS:
00007fa56b92d700(0000) GS:
ffff880478000000(0000) knlGS:
0000000000000000
[
1033173.493843] CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
[
1033173.493845] CR2:
0000000000000000 CR3:
00000000630e8000 CR4:
00000000000006a0
[
1033173.493855] Stack:
[
1033173.493862]
ffffffffb0b60444 000000000000eaea 0000000041b58ab3 ffffffffb3c3ce32
[
1033173.493867]
ffffffffb0b6f3e0 ffffffffb0b60444 ffffffffb5ea2e50 1ffff1000c622f5e
[
1033173.493873]
ffff8800630c4cd8 00000000000ee09a ffffffffb3ec4888 ffffffffb5ea2de8
[
1033173.493874] Call Trace:
[
1033173.494108] do_vcc_ioctl (net/atm/ioctl.c:170)
[
1033173.494113] vcc_ioctl (net/atm/ioctl.c:189)
[
1033173.494116] svc_ioctl (net/atm/svc.c:605)
[
1033173.494200] sock_do_ioctl (net/socket.c:874)
[
1033173.494204] sock_ioctl (net/socket.c:958)
[
1033173.494244] do_vfs_ioctl (fs/ioctl.c:43 fs/ioctl.c:607)
[
1033173.494290] SyS_ioctl (fs/ioctl.c:622 fs/ioctl.c:613)
[
1033173.494295] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:186)
[
1033173.494362] Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 50 09 00 00 49 8b 9e 60 06 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 14 48 89 fa 48 c1 ea 03 <0f> b6 04 02 48 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 14 09 00
All code
========
0: fa cli
1: 48 c1 ea 03 shr $0x3,%rdx
5: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1)
9: 0f 85 50 09 00 00 jne 0x95f
f: 49 8b 9e 60 06 00 00 mov 0x660(%r14),%rbx
16: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
1d: fc ff df
20: 48 8d 7b 14 lea 0x14(%rbx),%rdi
24: 48 89 fa mov %rdi,%rdx
27: 48 c1 ea 03 shr $0x3,%rdx
2b:* 0f b6 04 02 movzbl (%rdx,%rax,1),%eax <-- trapping instruction
2f: 48 89 fa mov %rdi,%rdx
32: 83 e2 07 and $0x7,%edx
35: 38 d0 cmp %dl,%al
37: 7f 08 jg 0x41
39: 84 c0 test %al,%al
3b: 0f 85 14 09 00 00 jne 0x955
Code starting with the faulting instruction
===========================================
0: 0f b6 04 02 movzbl (%rdx,%rax,1),%eax
4: 48 89 fa mov %rdi,%rdx
7: 83 e2 07 and $0x7,%edx
a: 38 d0 cmp %dl,%al
c: 7f 08 jg 0x16
e: 84 c0 test %al,%al
10: 0f 85 14 09 00 00 jne 0x92a
[
1033173.494366] RIP clip_ioctl (net/atm/clip.c:320 net/atm/clip.c:689)
[
1033173.494368] RSP <
ffff880063117a88>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Wed, 16 Sep 2015 15:26:14 +0000 (17:26 +0200)]
ipv6: ip6_fragment: fix headroom tests and skb leak
David Woodhouse reports skb_under_panic when we try to push ethernet
header to fragmented ipv6 skbs:
skbuff: skb_under_panic: text:
c1277f1e len:1294 put:14 head:
dec98000
data:
dec97ffc tail:0xdec9850a end:0xdec98f40 dev:br-lan
[..]
ip6_finish_output2+0x196/0x4da
David further debugged this:
[..] offending fragments were arriving here with skb_headroom(skb)==10.
Which is reasonable, being the Solos ADSL card's header of 8 bytes
followed by 2 bytes of PPP frame type.
The problem is that if netfilter ipv6 defragmentation is used, skb_cow()
in ip6_forward will only see reassembled skb.
Therefore, headroom is overestimated by 8 bytes (we pulled fragment
header) and we don't check the skbs in the frag_list either.
We can't do these checks in netfilter defrag since outdev isn't known yet.
Furthermore, existing tests in ip6_fragment did not consider the fragment
or ipv6 header size when checking headroom of the fraglist skbs.
While at it, also fix a skb leak on memory allocation -- ip6_fragment
must consume the skb.
I tested this e1000 driver hacked to not allocate additional headroom
(we end up in slowpath, since LL_RESERVED_SPACE is 16).
If 2 bytes of headroom are allocated, fastpath is taken (14 byte
ethernet header was pulled, so 16 byte headroom available in all
fragments).
Reported-by: David Woodhouse <dwmw2@infradead.org>
Diagnosed-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Woodhouse [Wed, 16 Sep 2015 11:35:00 +0000 (12:35 +0100)]
solos-pci: Increase headroom on received packets
A comment in include/linux/skbuff.h says that:
* Various parts of the networking layer expect at least 32 bytes of
* headroom, you should not reduce this.
This was demonstrated by a panic when handling fragmented IPv6 packets:
http://marc.info/?l=linux-netdev&m=
144236093519172&w=2
It's not entirely clear if that comment is still valid — and if it is,
perhaps netif_rx() ought to be enforcing it with a warning.
But either way, it is rather stupid from a performance point of view
for us to be receiving packets into a buffer which doesn't have enough
room to prepend an Ethernet header — it means that *every* incoming
packet is going to be need to be reallocated. So let's fix that.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Javier Martinez Canillas [Wed, 16 Sep 2015 09:11:22 +0000 (11:11 +0200)]
net: ks8851: Export OF module alias information
Drivers needs to export the OF id table and this be built into
the module or udev won't have the necessary information to autoload
the driver module when the device is registered via OF.
Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 16 Sep 2015 01:29:47 +0000 (18:29 -0700)]
net/mlx4_en: really allow to change RSS key
When changing rss key, we do not want to overwrite user provided key
by the one provided by netdev_rss_key_fill(), which is the host random
key generated at boot time.
Fixes: 947cbb0ac242 ("net/mlx4_en: Support for configurable RSS hash function")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Eyal Perry <eyalpe@mellanox.com>
CC: Amir Vadai <amirv@mellanox.com>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 15 Sep 2015 22:10:50 +0000 (15:10 -0700)]
net: Fix vti use case with oif in dst lookups
Steffen reported that the recent change to add oif to dst lookups breaks
the VTI use case. The problem is that with the oif set in the flow struct
the comparison to the nh_oif is triggered. Fix by splitting the
FLOWI_FLAG_VRFSRC into 2 flags -- one that triggers the vrf device cache
bypass (FLOWI_FLAG_VRFSRC) and another telling the lookup to not compare
nh oif (FLOWI_FLAG_SKIP_NH_OIF).
Fixes: 42a7b32b73d6 ("xfrm: Add oif to dst lookups")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Tue, 15 Sep 2015 11:50:09 +0000 (17:20 +0530)]
cxgb4: add device ID for few T5 adapters
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Phil Sutter [Tue, 15 Sep 2015 08:33:07 +0000 (10:33 +0200)]
net: qdisc: enhance default_qdisc documentation
Aside from some lingual cleanup, point out which interfaces are not or
partly covered by this setting.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 15 Sep 2015 16:50:14 +0000 (10:50 -0600)]
net: Add documentation for VRF device
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Stringer [Mon, 14 Sep 2015 18:14:50 +0000 (11:14 -0700)]
openvswitch: Fix IPv6 exthdr handling with ct helpers.
Static code analysis reveals the following bug:
net/openvswitch/conntrack.c:281 ovs_ct_helper()
warn: unsigned 'protoff' is never less than zero.
This signedness bug breaks error handling for IPv6 extension headers when
using conntrack helpers. Fix the error by using a local signed variable.
Fixes: cae3a2627520: "openvswitch: Allow attaching helpers to ct
action"
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Sun, 13 Sep 2015 17:18:33 +0000 (10:18 -0700)]
ipv6: include NLM_F_REPLACE in route replace notifications
This patch adds NLM_F_REPLACE flag to ipv6 route replace notifications.
This makes nlm_flags in ipv6 replace notifications consistent
with ipv4.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Thu, 17 Sep 2015 11:37:00 +0000 (13:37 +0200)]
netfilter: nf_log: wait for rcu grace after logger unregistration
The nf_log_unregister() function needs to call synchronize_rcu() to make sure
that the objects are not dereferenced anymore on module removal.
Fixes: 5962815a6a56 ("netfilter: nf_log: use an array of loggers instead of list")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Johan Hedberg [Fri, 4 Sep 2015 09:22:46 +0000 (12:22 +0300)]
Bluetooth: Delay check for conn->smp in smp_conn_security()
There are several actions that smp_conn_security() might make that do
not require a valid SMP context (conn->smp pointer). One of these
actions is to encrypt the link with an existing LTK. If the SMP
context wasn't initialized properly we should still allow the
independent actions to be done, i.e. the check for the context should
only be done at the last possible moment.
Reported-by: Chuck Ebbert <cebbert.lkml@gmail.com>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org # 4.0+
Julia Lawall [Sun, 13 Sep 2015 12:15:27 +0000 (14:15 +0200)]
dccp: drop null test before destroy functions
Remove unneeded NULL test.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression x;
@@
-if (x != NULL)
\(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
@@
expression x;
@@
-if (x != NULL) {
\(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
x = NULL;
-}
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julia Lawall [Sun, 13 Sep 2015 12:15:18 +0000 (14:15 +0200)]
net: core: drop null test before destroy functions
Remove unneeded NULL test.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@ expression x; @@
-if (x != NULL) {
\(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
x = NULL;
-}
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julia Lawall [Sun, 13 Sep 2015 12:15:03 +0000 (14:15 +0200)]
atm: he: drop null test before destroy functions
Remove unneeded NULL test.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@ expression x; @@
-if (x != NULL)
\(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesse Gross [Sat, 12 Sep 2015 01:38:28 +0000 (18:38 -0700)]
openvswitch: Fix mask generation for nested attributes.
Masks were added to OVS flows in a way that was backwards compatible
with userspace programs that did not generate masks. As a result, it is
possible that we may receive flows that do not have a mask and we need
to synthesize one.
Generating a mask requires iterating over attributes and descending into
nested attributes. For each level we need to know the size to generate the
correct mask. We do this with a linked table of attribute types.
Although the logic to handle these nested attributes was there in concept,
there are a number of bugs in practice. Examples include incomplete links
between tables, variable length attributes being treated as nested and
missing sanity checks.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sjoerd Simons [Fri, 11 Sep 2015 20:25:48 +0000 (22:25 +0200)]
net: stmmac: Use msleep rather then udelay for reset delay
The reset delays used for stmmac are in the order of 10ms to 1 second,
which is far too long for udelay usage, so switch to using msleep.
Practically this fixes the PHY not being reliably detected in some cases
as udelay wouldn't actually delay for long enough to let the phy
reliably be reset.
Signed-off-by: Sjoerd Simons <sjoerd.simons@collabora.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 15 Sep 2015 21:44:29 +0000 (14:44 -0700)]
rtnetlink: catch -EOPNOTSUPP errors from ndo_bridge_getlink
problem reported:
kernel 4.1.3
------------
# bridge vlan
port vlan ids
eth0 1 PVID Egress Untagged
90
91
92
93
94
95
96
97
98
99
100
vmbr0 1 PVID Egress Untagged
94
kernel 4.2
-----------
# bridge vlan
port vlan ids
ndo_bridge_getlink can return -EOPNOTSUPP when an interfaces
ndo_bridge_getlink op is set to switchdev_port_bridge_getlink
and CONFIG_SWITCHDEV is not defined. This today can happen to
bond, rocker and team devices. This patch adds -EOPNOTSUPP
checks after calls to ndo_bridge_getlink.
Fixes: 85fdb956726ff2a ("switchdev: cut over to new switchdev_port_bridge_getlink")
Reported-by: Alexandre DERUMIER <aderumier@odiso.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Guinot [Tue, 15 Sep 2015 20:41:21 +0000 (22:41 +0200)]
net: mvneta: fix DMA buffer unmapping in mvneta_rx()
This patch fixes a regression introduced by the commit
a84e32894191
("net: mvneta: fix refilling for Rx DMA buffers"). Due to this commit
the newly allocated Rx buffers are DMA-unmapped in place of those passed
to the networking stack. Obviously, this causes data corruptions.
This patch fixes the issue by ensuring that the right Rx buffers are
DMA-unmapped.
Reported-by: Oren Laskin <oren@igneous.io>
Signed-off-by: Simon Guinot <simon.guinot@sequanux.org>
Fixes: a84e32894191 ("net: mvneta: fix refilling for Rx DMA buffers")
Cc: <stable@vger.kernel.org> # v3.8+
Tested-by: Oren Laskin <oren@igneous.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 15 Sep 2015 21:53:46 +0000 (14:53 -0700)]
Merge branch 'ip6tunnel_dst'
Martin KaFai Lau says:
====================
ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
v4:
- Fix a compilation error in patch 5 when CONFIG_LOCKDEP is turned on and
re-test it
v3:
- Merge a 'if else if' test in patch 4
- Use rcu_dereference_protected in patch 5 to fix a sparse check when
CONFIG_SPARSE_RCU_POINTER is enabled
v2:
- Add patch 4 and 5 to remove the spinlock
v1:
This patch series is to fix the dst refcnt bugs in ip6_tunnel.
Patch 1 and 2 are the prep works. Patch 3 is the fix.
I can reproduce the bug by adding and removing the ip6gre tunnel
while running a super_netperf TCP_CRR test. I get the following
trace by adding WARN_ON_ONCE(newrefcnt < 0) to dst_release():
[ 312.760432] ------------[ cut here ]------------
[ 312.774664] WARNING: CPU: 2 PID: 10263 at net/core/dst.c:288 dst_release+0xf3/0x100()
[ 312.776041] Modules linked in: k10temp coretemp hwmon ip6_gre ip6_tunnel tunnel6 ipmi_devintf ipmi_ms\
ghandler ip6table_filter ip6_tables xt_NFLOG nfnetlink_log nfnetlink xt_comment xt_statistic iptable_fil\
ter ip_tables x_tables nfsv3 nfs_acl nfs fscache lockd grace mptctl netconsole autofs4 rpcsec_gss_krb5 a\
uth_rpcgss oid_registry sunrpc ipv6 dm_mod loop iTCO_wdt iTCO_vendor_support serio_raw rtc_cmos pcspkr i\
2c_i801 i2c_core lpc_ich mfd_core ehci_pci ehci_hcd e1000e mlx4_en ptp pps_core vxlan udp_tunnel ip6_udp\
_tunnel mlx4_core sg button ext3 jbd mpt2sas raid_class
[ 312.785302] CPU: 2 PID: 10263 Comm: netperf Not tainted
4.2.0-rc8-00046-g4db9b63-dirty #15
[ 312.791695] Hardware name: Quanta Freedom /Windmill-EP, BIOS F03_3B04 09/12/2013
[ 312.792965]
ffffffff819dca2c ffff8811dfbdf6f8 ffffffff816537de ffff88123788fdb8
[ 312.794263]
0000000000000000 ffff8811dfbdf738 ffffffff81052646 ffff8811dfbdf768
[ 312.795593]
ffff881203a98180 00000000ffffffff ffff88242927a000 ffff88120a2532e0
[ 312.796946] Call Trace:
[ 312.797380] [<
ffffffff816537de>] dump_stack+0x45/0x57
[ 312.798288] [<
ffffffff81052646>] warn_slowpath_common+0x86/0xc0
[ 312.799699] [<
ffffffff8105273a>] warn_slowpath_null+0x1a/0x20
[ 312.800852] [<
ffffffff8159f9b3>] dst_release+0xf3/0x100
[ 312.801834] [<
ffffffffa03f1308>] ip6_tnl_dst_store+0x48/0x70 [ip6_tunnel]
[ 312.803738] [<
ffffffffa03fd0b6>] ip6gre_xmit2+0x536/0x720 [ip6_gre]
[ 312.804774] [<
ffffffffa03fd40a>] ip6gre_tunnel_xmit+0x16a/0x410 [ip6_gre]
[ 312.805986] [<
ffffffff8159934b>] dev_hard_start_xmit+0x23b/0x390
[ 312.808810] [<
ffffffff815a2f5f>] ? neigh_destroy+0xef/0x140
[ 312.809843] [<
ffffffff81599a6c>] __dev_queue_xmit+0x48c/0x4f0
[ 312.813931] [<
ffffffff81599ae3>] dev_queue_xmit_sk+0x13/0x20
[ 312.814993] [<
ffffffff815a0832>] neigh_direct_output+0x12/0x20
[ 312.817448] [<
ffffffffa021d633>] ip6_finish_output2+0x183/0x460 [ipv6]
[ 312.818762] [<
ffffffff81306fc5>] ? find_next_bit+0x15/0x20
[ 312.819671] [<
ffffffffa021fd79>] ip6_finish_output+0x89/0xe0 [ipv6]
[ 312.820720] [<
ffffffffa021fe14>] ip6_output+0x44/0xe0 [ipv6]
[ 312.821762] [<
ffffffff815c8809>] ? nf_hook_slow+0x69/0xc0
[ 312.823123] [<
ffffffffa021d232>] ip6_xmit+0x242/0x4c0 [ipv6]
[ 312.824073] [<
ffffffffa021c9f0>] ? ac6_proc_exit+0x20/0x20 [ipv6]
[ 312.825116] [<
ffffffffa024c751>] inet6_csk_xmit+0x61/0xa0 [ipv6]
[ 312.826127] [<
ffffffff815eb590>] tcp_transmit_skb+0x4f0/0x9b0
[ 312.827441] [<
ffffffff815ed267>] tcp_connect+0x637/0x7a0
[ 312.828327] [<
ffffffffa0245906>] tcp_v6_connect+0x2d6/0x550 [ipv6]
[ 312.829581] [<
ffffffff81606f05>] __inet_stream_connect+0x95/0x2f0
[ 312.830600] [<
ffffffff810ae13a>] ? hrtimer_try_to_cancel+0x1a/0xf0
[ 312.833456] [<
ffffffff812fba19>] ? timerqueue_add+0x59/0xb0
[ 312.834407] [<
ffffffff81607198>] inet_stream_connect+0x38/0x50
[ 312.835886] [<
ffffffff8157cb17>] SYSC_connect+0xb7/0xf0
[ 312.840035] [<
ffffffff810af6d3>] ? do_setitimer+0x1b3/0x200
[ 312.840983] [<
ffffffff810af75a>] ? alarm_setitimer+0x3a/0x70
[ 312.841941] [<
ffffffff8157d7ae>] SyS_connect+0xe/0x10
[ 312.842818] [<
ffffffff81659297>] entry_SYSCALL_64_fastpath+0x12/0x6a
[ 312.844206] ---[ end trace
43f3ecd86c3b1313 ]---
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Tue, 15 Sep 2015 21:30:09 +0000 (14:30 -0700)]
ipv6: Replace spinlock with seqlock and rcu in ip6_tunnel
This patch uses a seqlock to ensure consistency between idst->dst and
idst->cookie. It also makes dst freeing from fib tree to undergo a
rcu grace period.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Tue, 15 Sep 2015 21:30:08 +0000 (14:30 -0700)]
ipv6: Avoid double dst_free
It is a prep work to get dst freeing from fib tree undergo
a rcu grace period.
The following is a common paradigm:
if (ip6_del_rt(rt))
dst_free(rt)
which means, if rt cannot be deleted from the fib tree, dst_free(rt) now.
1. We don't know the ip6_del_rt(rt) failure is because it
was not managed by fib tree (e.g. DST_NOCACHE) or it had already been
removed from the fib tree.
2. If rt had been managed by the fib tree, ip6_del_rt(rt) failure means
dst_free(rt) has been called already. A second
dst_free(rt) is not always obviously safe. The rt may have
been destroyed already.
3. If rt is a DST_NOCACHE, dst_free(rt) should not be called.
4. It is a stopper to make dst freeing from fib tree undergo a
rcu grace period.
This patch is to use a DST_NOCACHE flag to indicate a rt is
not managed by the fib tree.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Tue, 15 Sep 2015 21:30:07 +0000 (14:30 -0700)]
ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
Problems in the current dst_entry cache in the ip6_tunnel:
1. ip6_tnl_dst_set is racy. There is no lock to protect it:
- One major problem is that the dst refcnt gets messed up. F.e.
the same dst_cache can be released multiple times and then
triggering the infamous dst refcnt < 0 warning message.
- Another issue is the inconsistency between dst_cache and
dst_cookie.
It can be reproduced by adding and removing the ip6gre tunnel
while running a super_netperf TCP_CRR test.
2. ip6_tnl_dst_get does not take the dst refcnt before returning
the dst.
This patch:
1. Create a percpu dst_entry cache in ip6_tnl
2. Use a spinlock to protect the dst_cache operations
3. ip6_tnl_dst_get always takes the dst refcnt before returning
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Tue, 15 Sep 2015 21:30:06 +0000 (14:30 -0700)]
ipv6: Rename the dst_cache helper functions in ip6_tunnel
It is a prep work to fix the dst_entry refcnt bugs in
ip6_tunnel.
This patch rename:
1. ip6_tnl_dst_check() to ip6_tnl_dst_get() to better
reflect that it will take a dst refcnt in the next patch.
2. ip6_tnl_dst_store() to ip6_tnl_dst_set() to have a more
conventional name matching with ip6_tnl_dst_get().
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Tue, 15 Sep 2015 21:30:05 +0000 (14:30 -0700)]
ipv6: Refactor common ip6gre_tunnel_init codes
It is a prep work to fix the dst_entry refcnt bugs in ip6_tunnel.
This patch refactors some common init codes used by both
ip6gre_tunnel_init and ip6gre_tap_init.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Mon, 14 Sep 2015 16:04:09 +0000 (18:04 +0200)]
netfilter: nft_compat: skip family comparison in case of NFPROTO_UNSPEC
Fix lookup of existing match/target structures in the corresponding list
by skipping the family check if NFPROTO_UNSPEC is used.
This is resulting in the allocation and insertion of one match/target
structure for each use of them. So this not only bloats memory
consumption but also severely affects the time to reload the ruleset
from the iptables-compat utility.
After this patch, iptables-compat-restore and iptables-compat take
almost the same time to reload large rulesets.
Fixes: 0ca743a55991 ("netfilter: nf_tables: add compatibility layer for x_tables")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Mon, 14 Sep 2015 15:06:27 +0000 (17:06 +0200)]
netfilter: bridge: fix routing of bridge frames with call-iptables=1
We can't re-use the physoutdev storage area.
1. When using NFQUEUE in PREROUTING, we attempt to bump a bogus
refcnt since nf_bridge->physoutdev is garbage (ipv4/ipv6 address)
2. for same reason, we crash in physdev match in FORWARD or later if
skb is routed instead of bridged.
This increases nf_bridge_info to 40 bytes, but we have no other choice.
Fixes: 72b1e5e4cac7 ("netfilter: bridge: reduce nf_bridge_info to 32 bytes again")
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Wed, 9 Sep 2015 00:57:21 +0000 (02:57 +0200)]
netfilter: nf_log: don't zap all loggers on unregister
like nf_log_unset, nf_log_unregister must not reset the list of loggers.
Otherwise, a call to nf_log_unregister() will render loggers of other nf
protocols unusable:
iptables -A INPUT -j LOG
modprobe nf_log_arp ; rmmod nf_log_arp
iptables -A INPUT -j LOG
iptables: No chain/target/match by that name
Fixes: 30e0c6a6be ("netfilter: nf_log: prepare net namespace support for loggers")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>