firefly-linux-kernel-4.4.55.git
10 years agonet: fec: change data structure to support multiqueue
Fugang Duan [Fri, 12 Sep 2014 21:00:48 +0000 (05:00 +0800)]
net: fec: change data structure to support multiqueue

This patch just change data structure to support multi-queue.
Only 1 queue enabled.

Ethernet multiqueue mechanism can improve performance in SMP system.
For single hw queue, multiqueue can balance cpu loading.
For multi hw queues, multiple cores can process network packets in parallel,
and refer the article for the detail advantage for multiqueue:
http://vger.kernel.org/~davem/davem_nyc09.pdf

Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: Frank Li <frank.li@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet:fec: add enet AVB feature macro define for imx6sx
Fugang Duan [Fri, 12 Sep 2014 21:00:47 +0000 (05:00 +0800)]
net:fec: add enet AVB feature macro define for imx6sx

Add enet AVB feature macro define for imx6sx.

Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: Frank Li <Frank.Li@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet:fec: add enet refrence clock for i.MX 6SX chip
Fugang Duan [Fri, 12 Sep 2014 21:00:46 +0000 (05:00 +0800)]
net:fec: add enet refrence clock for i.MX 6SX chip

i.MX6sx enet has below clocks for user config:
clk_ipg: ipg_clk_s, ipg_clk_mac0_s, 66Mhz
clk_ahb: enet system clock, it is enet AXI clock for imx6sx.
 For imx6sx, it alos is the clock source of interrupt coalescing.
 The clock range: 200Mhz ~ 266Mhz.
clk_ref: refrence clock for tx and rx. For imx6sx enet RGMII mode,
 the refrence clock is 125Mhz coming from internal PLL or external.
 In i.MX6sx-arm2 board, the clock is from internal PLL.
 clk_ref is optional, depends on board.
clk_enet_out: The clock can be output from internal PLL. It can supply 50Mhz
 clock for phy. clk_enet_out is optional, depends on chip and board.
clk_ptp: 1588 ts clock. It is optional, depends on chip.

The patch add clk_ref to distiguish the different clocks.

Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: Frank Li <Frank.Li@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: DSA: Marvell mv88e6171 switch driver
Andrew Lunn [Fri, 12 Sep 2014 21:58:44 +0000 (23:58 +0200)]
net: DSA: Marvell mv88e6171 switch driver

This is the Marvell driver with some cleanups by Claudio Leite
and myself.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Cc: Claudio Leite <leitec@staticky.com>
Signed-off-by: Claudio Leite <leitec@staticky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'be2net-next'
David S. Miller [Sat, 13 Sep 2014 21:12:25 +0000 (17:12 -0400)]
Merge branch 'be2net-next'

Sathya Perla says:

====================
be2net: patch set

Patch 1 fixes some minor issues with log messages in be2net.

Patch 2 replaces strcpy() calls with strlcpy() to avoid possible buffer
overflow.

Patch 3 improves the RX buffer posting scheme for jumbo frames.

Patch 4 replaces the use of v0 of SET_FLOW_CONTROL cmd with v1 to receive
a definitive completion status from FW.

Patch 5 adds support for ethtool "-m" ethtool option.

Patch 6 fixes port-type reporting via ethtool get_settings for QSFP/SFP+
interfaces.

Patch 7 fixes the usage of MODIFY_EQD FW cmd to target a max of 8 EQs on
Lancer chip.

Patch 8 enables PCIe error reporting even for VFs.

Pls consider applying this patch set to net-next. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobe2net: enable PCIe error reporting on VFs too
Kalesh AP [Fri, 12 Sep 2014 12:09:21 +0000 (17:39 +0530)]
be2net: enable PCIe error reporting on VFs too

Currently PCIe error reporting is enabled only on PFs. This patch enables
this feature on VFs too as Lancer VFs support it.

Signed-off-by: Kalesh AP <kalesh.purayil@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobe2net: send a max of 8 EQs to be_cmd_modify_eqd() on Lancer
Kalesh AP [Fri, 12 Sep 2014 12:09:20 +0000 (17:39 +0530)]
be2net: send a max of 8 EQs to be_cmd_modify_eqd() on Lancer

The MODIFY_EQ_DELAY FW cmd on Lancer is supported for a max of 8 EQs per cmd.

Signed-off-by: Kalesh AP <kalesh.purayil@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobe2net: fix port-type reporting in get_settings
Ravikumar Nelavelli [Fri, 12 Sep 2014 12:09:19 +0000 (17:39 +0530)]
be2net: fix port-type reporting in get_settings

Report the ethtool port-type/supported/advertising values based on the
cable_type for QSFP and SFP+ interfaces. The cable_type is parsed from
the transceiver data fetched from the FW.

Signed-off-by: Ravikumar Nelavelli <ravikumar.nelavelli@emulex.com>
Signed-off-by: Suresh Reddy <Suresh.Reddy@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobe2net: add ethtool "-m" option support
Mark Leonard [Fri, 12 Sep 2014 12:09:18 +0000 (17:39 +0530)]
be2net: add ethtool "-m" option support

This patch adds support for the dump-module-eeprom and module-info
ethtool options.

Signed-off-by: Mark Leonard <mark.leonard@emulex.com>
Signed-off-by: Suresh Reddy <Suresh.Reddy@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobe2net: use v1 of SET_FLOW_CONTROL command
Suresh Reddy [Fri, 12 Sep 2014 12:09:17 +0000 (17:39 +0530)]
be2net: use v1 of SET_FLOW_CONTROL command

In some configurations the FW doesn't allow changing flow control settings
of a link. Unless a v1 version of the SET_FLOW_CONTROL cmd is used, the FW
doesn't report an error to the driver.

Signed-off-by: Suresh Reddy <Suresh.Reddy@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobe2net: fix RX fragment posting for jumbo frames
Ajit Khaparde [Fri, 12 Sep 2014 12:09:16 +0000 (17:39 +0530)]
be2net: fix RX fragment posting for jumbo frames

In the RX path, the driver currently consumes upto 64 (budget) packets in
one NAPI sweep. When the size of the packet received is larger than a
fragment size (2K), more than one fragment is consumed for each packet.
As the driver currently posts a max of 64 fragments, all the consumed
fragments may not be replenished. This can cause avoidable drops in RX path.
This patch fixes this by posting a max(consumed_frags, 64) frags. This is
done only when there are atleast 64 free slots in the RXQ.

Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: Kalesh AP <kalesh.purayil@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobe2net: replace strcpy with strlcpy
Vasundhara Volam [Fri, 12 Sep 2014 12:09:15 +0000 (17:39 +0530)]
be2net: replace strcpy with strlcpy

Replace strcpy with strlcpy, as it avoids a possible buffer overflow.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobe2net: fix some log messages
Vasundhara Volam [Fri, 12 Sep 2014 12:09:14 +0000 (17:39 +0530)]
be2net: fix some log messages

This patch fixes the following minor issues with log messages in be2net:
  1) Period is not required at the end of log message.
  2) Remove "Unknown grp5 event" logs to reduce noise. The driver can safely
     ignore async events from FW it's not interested in.
  3) Reword a log message for better readability to say that SRIOV
     "is disabled" rather than "not supported".

Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: filter: constify detection of pkt_type_offset
Hannes Frederic Sowa [Fri, 12 Sep 2014 12:04:43 +0000 (14:04 +0200)]
net: filter: constify detection of pkt_type_offset

Currently we have 2 pkt_type_offset functions doing the same thing and
spread across the architecture files. Remove those and replace them
with a PKT_TYPE_OFFSET macro helper which gets the constant value from a
zero sized sk_buff member right in front of the bitfield with offsetof.
This new offset marker does not change size of struct sk_buff.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Markos Chandras <markos.chandras@imgtec.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: dsa: change tag_protocol to an enum
Florian Fainelli [Fri, 12 Sep 2014 04:18:09 +0000 (21:18 -0700)]
net: dsa: change tag_protocol to an enum

Now that we introduced an additional multiplexing/demultiplexing layer
with commit 3e8a72d1dae37 ("net: dsa: reduce number of protocol hooks")
that lives within the DSA code, we no longer need to have a given switch
driver tag_protocol be an actual ethertype value, instead, we can
replace it with an enum: dsa_tag_protocol.

Do this replacement in the drivers, which allows us to get rid of the
cpu_to_be16()/htons() dance, and remove ETH_P_BRCMTAG since we do not
need it anymore.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agor8152: support VLAN
hayeswang [Fri, 12 Sep 2014 02:43:11 +0000 (10:43 +0800)]
r8152: support VLAN

Support hw VLAN for tx and rx. And enable them by default.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: stmmac: fix return value check in socfpga_dwmac_parse_data()
Wei Yongjun [Thu, 11 Sep 2014 23:12:57 +0000 (07:12 +0800)]
net: stmmac: fix return value check in socfpga_dwmac_parse_data()

In case of error, the function devm_ioremap_resource() returns
ERR_PTR() and never returns NULL. The NULL test in the return
value check should be replaced with IS_ERR().

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoipv6: exit early in addrconf_notify() if IPv6 is disabled
WANG Cong [Thu, 11 Sep 2014 22:07:16 +0000 (15:07 -0700)]
ipv6: exit early in addrconf_notify() if IPv6 is disabled

If IPv6 is explicitly disabled before the interface comes up,
it makes no sense to continue when it comes up, even just
print a message.

(I am not sure about other cases though, so I prefer not to touch)

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'ipv6-cleanups'
David S. Miller [Sat, 13 Sep 2014 20:38:53 +0000 (16:38 -0400)]
Merge branch 'ipv6-cleanups'

Cong Wang says:

====================
ipv6: clean up locking code in anycast and mcast

This patchset cleans up the locking code in anycast.c and mcast.c
and makes the refcount code more readable.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
v1 -> v2:
* refactor some code and make it in a separated patch
* update comments
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoipv6: refactor ipv6_dev_mc_inc()
WANG Cong [Thu, 11 Sep 2014 22:35:16 +0000 (15:35 -0700)]
ipv6: refactor ipv6_dev_mc_inc()

Refactor out allocation and initialization and make
the refcount code more readable.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoipv6: update the comment in mcast.c
WANG Cong [Thu, 11 Sep 2014 22:35:15 +0000 (15:35 -0700)]
ipv6: update the comment in mcast.c

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoipv6: drop some rcu_read_lock in mcast
WANG Cong [Thu, 11 Sep 2014 22:35:14 +0000 (15:35 -0700)]
ipv6: drop some rcu_read_lock in mcast

Similarly the code is already protected by rtnl lock.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoipv6: drop ipv6_sk_mc_lock in mcast
WANG Cong [Thu, 11 Sep 2014 22:35:13 +0000 (15:35 -0700)]
ipv6: drop ipv6_sk_mc_lock in mcast

Similarly the code is already protected by rtnl lock.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoipv6: refactor __ipv6_dev_ac_inc()
WANG Cong [Thu, 11 Sep 2014 22:35:12 +0000 (15:35 -0700)]
ipv6: refactor __ipv6_dev_ac_inc()

Refactor out allocation and initialization and make
the refcount code more readable.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoipv6: clean up ipv6_dev_ac_inc()
WANG Cong [Thu, 11 Sep 2014 22:35:11 +0000 (15:35 -0700)]
ipv6: clean up ipv6_dev_ac_inc()

Make it accept inet6_dev, and rename it to __ipv6_dev_ac_inc()
to reflect this change.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoipv6: remove ipv6_sk_ac_lock
WANG Cong [Thu, 11 Sep 2014 22:35:10 +0000 (15:35 -0700)]
ipv6: remove ipv6_sk_ac_lock

Just move rtnl lock up, so that the anycast list can be protected
by rtnl lock now.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoipv6: drop useless rcu_read_lock() in anycast
WANG Cong [Thu, 11 Sep 2014 22:35:09 +0000 (15:35 -0700)]
ipv6: drop useless rcu_read_lock() in anycast

These code is now protected by rtnl lock, rcu read lock
is useless now.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'bonding-next'
David S. Miller [Sat, 13 Sep 2014 20:29:57 +0000 (16:29 -0400)]
Merge branch 'bonding-next'

Nikolay Aleksandrov says:

====================
bonding: get rid of curr_slave_lock

This is the second patch-set dealing with bond locking and the purpose here
is to convert curr_slave_lock into a spinlock called "mode_lock" which can
be used in the various modes for their specific needs. The first three
patches cleanup the use of curr_slave_lock and prepare it for the
conversion which is done in patch 4 and then the modes that were using
their own locks are converted to use the new "mode_lock" giving us the
opportunity to remove their locks.
This patch-set has been tested in each mode by running enslave/release of
slaves in parallel with traffic transmission and miimon=1 i.e. running
all the time. In fact this lead to the discovery of a subtle bug related to
RCU which will be fixed in -net.
Also did an allmodconfig test just in case :-)

v2: fix bond_3ad_state_machine_handler's use of mode_lock and
    curr_slave_lock
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobonding: adjust locking comments
Nikolay Aleksandrov [Thu, 11 Sep 2014 20:49:28 +0000 (22:49 +0200)]
bonding: adjust locking comments

Now that locks have been removed, remove some unnecessary comments and
adjust others to reflect reality. Also add a comment to "mode_lock" to
describe its current users and give a brief summary why they need it.

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobonding: 3ad: convert to bond->mode_lock
Nikolay Aleksandrov [Thu, 11 Sep 2014 20:49:27 +0000 (22:49 +0200)]
bonding: 3ad: convert to bond->mode_lock

Now that we have bond->mode_lock, we can remove the state_machine_lock
and use it in its place. There're no fast paths requiring the per-port
spinlocks so it should be okay to consolidate them into mode_lock.
Also move it inside the unbinding function as we don't want to expose
mode_lock outside of the specific modes.

Suggested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobonding: alb: convert to bond->mode_lock
Nikolay Aleksandrov [Thu, 11 Sep 2014 20:49:26 +0000 (22:49 +0200)]
bonding: alb: convert to bond->mode_lock

The ALB/TLB specific spinlocks are no longer necessary as we now have
bond->mode_lock for this purpose, so convert them and remove them from
struct alb_bond_info.
Also remove the unneeded lock/unlock functions and use spin_lock/unlock
directly.

Suggested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobonding: convert curr_slave_lock to a spinlock and rename it
Nikolay Aleksandrov [Thu, 11 Sep 2014 20:49:25 +0000 (22:49 +0200)]
bonding: convert curr_slave_lock to a spinlock and rename it

curr_slave_lock is now a misleading name, a much better name is
mode_lock as it'll be used for each mode's purposes and it's no longer
necessary to use a rwlock, a simple spinlock is enough.

Suggested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobonding: clean curr_slave_lock use
Nikolay Aleksandrov [Thu, 11 Sep 2014 20:49:24 +0000 (22:49 +0200)]
bonding: clean curr_slave_lock use

Mostly all users of curr_slave_lock already have RTNL as we've discussed
previously so there's no point in using it, the one case where the lock
must stay is the 3ad code, in fact it's the only one.
It's okay to remove it from bond_do_fail_over_mac() as it's called with
RTNL and drops the curr_slave_lock anyway.
bond_change_active_slave() is one of the main places where
curr_slave_lock was used, it's okay to remove it as all callers use RTNL
these days before calling it, that's why we move the ASSERT_RTNL() in
the beginning to catch any potential offenders to this rule.
The RTNL argument actually applies to all of the places where
curr_slave_lock has been removed from in this patch.
Also remove the unnecessary bond_deref_active_protected() macro and use
rtnl_dereference() instead.

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobonding: alb: remove curr_slave_lock
Nikolay Aleksandrov [Thu, 11 Sep 2014 20:49:23 +0000 (22:49 +0200)]
bonding: alb: remove curr_slave_lock

First in rlb_teach_disabled_mac_on_primary() it's okay to remove
curr_slave_lock as all callers except bond_alb_monitor() already hold
RTNL, and in case bond_alb_monitor() is executing we can at most have a
period with bad throughput (very unlikely though).
In bond_alb_monitor() it's okay to remove the read_lock as the slave
list is walked with RCU and the worst that could happen is another
transmitter at the same time and thus for a period which currently is 10
seconds (bond_alb.h: BOND_ALB_LP_TICKS).
And bond_alb_handle_active_change() is okay because it's always called
with RTNL. Removed the ASSERT_RTNL() because it'll be inserted in the
parent function in a following patch.

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agobonding: 3ad: clean up curr_slave_lock usage
Nikolay Aleksandrov [Thu, 11 Sep 2014 20:49:22 +0000 (22:49 +0200)]
bonding: 3ad: clean up curr_slave_lock usage

Remove the read_lock in bond_3ad_lacpdu_recv() since when the slave is
being released its rx_handler is removed before 3ad unbind, so even if
packets arrive, they won't see the slave in an inconsistent state.

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agovirtio_ring: unify direct/indirect code paths.
Rusty Russell [Thu, 11 Sep 2014 00:47:38 +0000 (10:17 +0930)]
virtio_ring: unify direct/indirect code paths.

virtqueue_add() populates the virtqueue descriptor table from the sgs
given.  If it uses an indirect descriptor table, then it puts a single
descriptor in the descriptor table pointing to the kmalloc'ed indirect
table where the sg is populated.

Previously vring_add_indirect() did the allocation and the simple
linear layout.  We replace that with alloc_indirect() which allocates
the indirect table then chains it like the normal descriptor table so
we can reuse the core logic.

This slows down pktgen by less than 1/2 a percent (which uses direct
descriptors), as well as vring_bench, but it's far neater.

vring_bench before:
1061485790-1104800648(1.08254e+09+/-6.6e+06)ns
vring_bench after:
1125610268-1183528965(1.14172e+09+/-8e+06)ns

pktgen before:
   787781-796334(793165+/-2.4e+03)pps 365-369(367.5+/-1.2)Mb/sec (365530384-369498976(3.68028e+08+/-1.1e+06)bps) errors: 0

pktgen after:
   779988-790404(786391+/-2.5e+03)pps 361-366(364.35+/-1.3)Mb/sec (361914432-366747456(3.64885e+08+/-1.2e+06)bps) errors: 0

Now, if we make force indirect descriptors by turning off any_header_sg
in virtio_net.c:

pktgen before:
  713773-721062(718374+/-2.1e+03)pps 331-334(332.95+/-0.92)Mb/sec (331190672-334572768(3.33325e+08+/-9.6e+05)bps) errors: 0
pktgen after:
  710542-719195(714898+/-2.4e+03)pps 329-333(331.15+/-1.1)Mb/sec (329691488-333706480(3.31713e+08+/-1.1e+06)bps) errors: 0

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agovirtio_ring: assume sgs are always well-formed.
Rusty Russell [Thu, 11 Sep 2014 00:47:37 +0000 (10:17 +0930)]
virtio_ring: assume sgs are always well-formed.

We used to have several callers which just used arrays.  They're
gone, so we can use sg_next() everywhere, simplifying the code.

On my laptop, this slowed down vring_bench by 15%:

vring_bench before:
936153354-967745359(9.44739e+08+/-6.1e+06)ns
vring_bench after:
1061485790-1104800648(1.08254e+09+/-6.6e+06)ns

However, a more realistic test using pktgen on a AMD FX(tm)-8320 saw
a few percent improvement:

pktgen before:
  767390-792966(785159+/-6.5e+03)pps 356-367(363.75+/-2.9)Mb/sec (356068960-367936224(3.64314e+08+/-3e+06)bps) errors: 0

pktgen after:
   787781-796334(793165+/-2.4e+03)pps 365-369(367.5+/-1.2)Mb/sec (365530384-369498976(3.68028e+08+/-1.1e+06)bps) errors: 0

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agovirtio_net: pass well-formed sgs to virtqueue_add_*()
Rusty Russell [Thu, 11 Sep 2014 00:47:36 +0000 (10:17 +0930)]
virtio_net: pass well-formed sgs to virtqueue_add_*()

This is the only driver which doesn't hand virtqueue_add_inbuf and
virtqueue_add_outbuf a well-formed, well-terminated sg.  Fix it,
so we can make virtio_add_* simpler.

pktgen results:
modprobe pktgen
echo 'add_device eth0' > /proc/net/pktgen/kpktgend_0
echo nowait 1 > /proc/net/pktgen/eth0
echo count 1000000 > /proc/net/pktgen/eth0
echo clone_skb 100000 > /proc/net/pktgen/eth0
echo dst_mac 4e:14:25:a9:30:ac > /proc/net/pktgen/eth0
echo dst 192.168.1.2 > /proc/net/pktgen/eth0
for i in `seq 20`; do echo start > /proc/net/pktgen/pgctrl; tail -n1 /proc/net/pktgen/eth0; done

Before:
  746547-793084(786421+/-9.6e+03)pps 346-367(364.4+/-4.4)Mb/sec (346397808-367990976(3.649e+08+/-4.5e+06)bps) errors: 0

After:
  767390-792966(785159+/-6.5e+03)pps 356-367(363.75+/-2.9)Mb/sec (356068960-367936224(3.64314e+08+/-3e+06)bps) errors: 0

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net...
David S. Miller [Sat, 13 Sep 2014 16:43:24 +0000 (12:43 -0400)]
Merge branch 'master' of git://git./linux/kernel/git/jkirsher/net-next

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2014-09-12

This series contains updates to e1000, ixgbe and ixgbevf.

Mark provide two fixes to reduce compile warnings produce by ixgbe
and ixgbevf.

Alex provides two patches for ixgbe, first removes the receive buffer
allocation at the end of the ixgbe_clean_rx_irq().  The reason for
removing this is to avoid the extra latency introduced by the MMIO write.
Second patch addresses several issues in the current ixgbe implementation
of busy poll sockets.  It was possible for frames to be delivered out of
order if they were held in GRO, so address this by flushing the GRO
buffers before releasing the q_vector back to the idle state.  Also, we
were having to take a spinlock on changing the state to and from idle,
so to resolve this, replaced the state value with an atomic and use
atomic_cmpxchg to change the value from idle, and a simple atomic set
to restore it back to idle after we have acquired it.  This allows us
to only use a locked operation on acquiring the vector without a need
for a locked operation to release it.

Florian Westphal provides several patches for e1000 which does some
cleanup and updating of the driver.  Moved e1000_tbi_adjust_stats()
so that he could make the function static.  Added a helper function
to deal with the tbi workaround that was located in 2 different
Rx clean functions.  Added a e1000_rx_buffer struct for use on receive
since the transmit and receive have different requirements.  Updates
e1000 to use napi_gro_frags API.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'sched_rcu'
David S. Miller [Sat, 13 Sep 2014 16:30:33 +0000 (12:30 -0400)]
Merge branch 'sched_rcu'

John Fastabend says:

====================
net/sched rcu classifiers and tcf

This series converts the tcf_proto usage to RCU.

This requires updating each classifier individually to handle the
new copy/update requirement and also to update the core list
traversals. This makes the assumption that updates to the tables
are infrequent in comparison to the packet per second being
classified. On a 10Gbps running near line rate we can easily
produce 12+ million packets per second so IMO this is a reasonable
assumption. The updates are serialized by RTNL.

I have done some basic testing on this series and do not see any
immediate splats or issues. The patch series has been running
on my dev systems for a month or so now and I've not seen any
issues. Although my configurations are not overly complicated.

My test cases at this point cover all the filters with a
tight loop to add/remove filters. Some basic estimator tests
where I add an estimator to the qdisc and verify the statistics
accurate using pktgen. And finally I have a small script to
exercise the 'tc actions' interface. Feel free to send me more
tests off list and I can run them.

This is prep work to drop the qdisc lock with the first
target being the ingress qdisc. To be done is making the
tc actions RCU safe and statistics per cpu. These patches
are in the works.

Comments:
  - Checkpatch is still giving errors on some >80 char lines I know
    about this. IMO the way to fix this is to restructure the sched
    code to avoid being so heavily indented. But doing this here
    bloats the patchset and anyways there are already lots of >80
    chars in these files. I would prefer to keep the patches as is
    but let me know if others think I should fix these and I will.
    A follow up patch set could restructure the code and fix this
    throughout the code blocks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: rcu'ify cls_bpf
John Fastabend [Sat, 13 Sep 2014 03:10:24 +0000 (20:10 -0700)]
net: sched: rcu'ify cls_bpf

This patch makes the cls_bpf classifier RCU safe. The tcf_lock
was being used to protect a list of cls_bpf_prog now this list
is RCU safe and updates occur with rcu_replace.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: rcu'ify cls_rsvp
John Fastabend [Sat, 13 Sep 2014 03:09:49 +0000 (20:09 -0700)]
net: sched: rcu'ify cls_rsvp

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: make cls_u32 lockless
John Fastabend [Sat, 13 Sep 2014 03:09:16 +0000 (20:09 -0700)]
net: sched: make cls_u32 lockless

Make cls_u32 classifier safe to run without holding lock. This patch
converts statistics that are kept in read section u32_classify into
per cpu counters.

This patch was tested with a tight u32 filter add/delete loop while
generating traffic with pktgen. By running pktgen on vlan devices
created on top of a physical device we can hit the qdisc layer
correctly. For ingress qdisc's a loopback cable was used.

for i in {1..100}; do
        q=`echo $i%8|bc`;
        echo -n "u32 tos: iteration $i on queue $q";
        tc filter add dev p3p2 parent $p prio $i u32 match ip tos 0x10 0xff \
                  action skbedit queue_mapping $q;
        sleep 1;
        tc filter del dev p3p2 prio $i;

        echo -n "u32 tos hash table: iteration $i on queue $q";
        tc filter add dev p3p2 parent $p protocol ip prio $i handle 628: u32 divisor 1
        tc filter add dev p3p2 parent $p protocol ip prio $i u32 \
                match ip protocol 17 0xff link 628: offset at 0 mask 0xf00 shift 6 plus 0
        tc filter add dev p3p2 parent $p protocol ip prio $i u32 \
                ht 628:0 match ip tos 0x10 0xff action skbedit queue_mapping $q
        sleep 2;
        tc filter del dev p3p2 prio $i
        sleep 1;
done

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: make cls_u32 per cpu
John Fastabend [Sat, 13 Sep 2014 03:08:47 +0000 (20:08 -0700)]
net: sched: make cls_u32 per cpu

This uses per cpu counters in cls_u32 in preparation
to convert over to rcu.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: RCU cls_tcindex
John Fastabend [Sat, 13 Sep 2014 03:08:20 +0000 (20:08 -0700)]
net: sched: RCU cls_tcindex

Make cls_tcindex RCU safe.

This patch addds a new RCU routine rcu_dereference_bh_rtnl() to check
caller either holds the rcu read lock or RTNL. This is needed to
handle the case where tcindex_lookup() is being called in both cases.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: RCU cls_route
John Fastabend [Sat, 13 Sep 2014 03:07:50 +0000 (20:07 -0700)]
net: sched: RCU cls_route

RCUify the route classifier. For now however spinlock's are used to
protect fastmap cache.

The issue here is the fastmap may be read by one CPU while the
cache is being updated by another. An array of pointers could be
one possible solution.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: fw use RCU
John Fastabend [Sat, 13 Sep 2014 03:07:22 +0000 (20:07 -0700)]
net: sched: fw use RCU

RCU'ify fw classifier.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: cls_flow use RCU
John Fastabend [Sat, 13 Sep 2014 03:06:55 +0000 (20:06 -0700)]
net: sched: cls_flow use RCU

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: cls_cgroup use RCU
John Fastabend [Sat, 13 Sep 2014 03:06:26 +0000 (20:06 -0700)]
net: sched: cls_cgroup use RCU

Make cgroup classifier safe for RCU.

Also drops the calls in the classify routine that were doing a
rcu_read_lock()/rcu_read_unlock(). If the rcu_read_lock() isn't held
entering this routine we have issues with deleting the classifier
chain so remove the unnecessary rcu_read_lock()/rcu_read_unlock()
pair noting all paths AFAIK hold rcu_read_lock.

If there is a case where classify is called without the rcu read lock
then an rcu splat will occur and we can correct it.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: cls_basic use RCU
John Fastabend [Sat, 13 Sep 2014 03:05:59 +0000 (20:05 -0700)]
net: sched: cls_basic use RCU

Enable basic classifier for RCU.

Dereferencing tp->root may look a bit strange here but it is needed
by my accounting because it is allocated at init time and needs to
be kfree'd at destroy time. However because it may be referenced in
the classify() path we must wait an RCU grace period before free'ing
it. We use kfree_rcu() and rcu_ APIs to enforce this. This pattern
is used in all the classifiers.

Also the hgenerator can be incremented without concern because it
is always incremented under RTNL.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: rcu-ify tcf_proto
John Fastabend [Sat, 13 Sep 2014 03:05:27 +0000 (20:05 -0700)]
net: rcu-ify tcf_proto

rcu'ify tcf_proto this allows calling tc_classify() without holding
any locks. Updaters are protected by RTNL.

This patch prepares the core net_sched infrastracture for running
the classifier/action chains without holding the qdisc lock however
it does nothing to ensure cls_xxx and act_xxx types also work without
locking. Additional patches are required to address the fall out.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: qdisc: use rcu prefix and silence sparse warnings
John Fastabend [Sat, 13 Sep 2014 03:04:52 +0000 (20:04 -0700)]
net: qdisc: use rcu prefix and silence sparse warnings

Add __rcu notation to qdisc handling by doing this we can make
smatch output more legible. And anyways some of the cases should
be using rcu_dereference() see qdisc_all_tx_empty(),
qdisc_tx_chainging(), and so on.

Also *wake_queue() API is commonly called from driver timer routines
without rcu lock or rtnl lock. So I added rcu_read_lock() blocks
around netif_wake_subqueue and netif_tx_wake_queue.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: rcu'ify cls_bpf
John Fastabend [Sat, 13 Sep 2014 03:10:24 +0000 (20:10 -0700)]
net: sched: rcu'ify cls_bpf

This patch makes the cls_bpf classifier RCU safe. The tcf_lock
was being used to protect a list of cls_bpf_prog now this list
is RCU safe and updates occur with rcu_replace.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: rcu'ify cls_rsvp
John Fastabend [Sat, 13 Sep 2014 03:09:49 +0000 (20:09 -0700)]
net: sched: rcu'ify cls_rsvp

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: make cls_u32 lockless
John Fastabend [Sat, 13 Sep 2014 03:09:16 +0000 (20:09 -0700)]
net: sched: make cls_u32 lockless

Make cls_u32 classifier safe to run without holding lock. This patch
converts statistics that are kept in read section u32_classify into
per cpu counters.

This patch was tested with a tight u32 filter add/delete loop while
generating traffic with pktgen. By running pktgen on vlan devices
created on top of a physical device we can hit the qdisc layer
correctly. For ingress qdisc's a loopback cable was used.

for i in {1..100}; do
        q=`echo $i%8|bc`;
        echo -n "u32 tos: iteration $i on queue $q";
        tc filter add dev p3p2 parent $p prio $i u32 match ip tos 0x10 0xff \
                  action skbedit queue_mapping $q;
        sleep 1;
        tc filter del dev p3p2 prio $i;

        echo -n "u32 tos hash table: iteration $i on queue $q";
        tc filter add dev p3p2 parent $p protocol ip prio $i handle 628: u32 divisor 1
        tc filter add dev p3p2 parent $p protocol ip prio $i u32 \
                match ip protocol 17 0xff link 628: offset at 0 mask 0xf00 shift 6 plus 0
        tc filter add dev p3p2 parent $p protocol ip prio $i u32 \
                ht 628:0 match ip tos 0x10 0xff action skbedit queue_mapping $q
        sleep 2;
        tc filter del dev p3p2 prio $i
        sleep 1;
done

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: make cls_u32 per cpu
John Fastabend [Sat, 13 Sep 2014 03:08:47 +0000 (20:08 -0700)]
net: sched: make cls_u32 per cpu

This uses per cpu counters in cls_u32 in preparation
to convert over to rcu.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: RCU cls_tcindex
John Fastabend [Sat, 13 Sep 2014 03:08:20 +0000 (20:08 -0700)]
net: sched: RCU cls_tcindex

Make cls_tcindex RCU safe.

This patch addds a new RCU routine rcu_dereference_bh_rtnl() to check
caller either holds the rcu read lock or RTNL. This is needed to
handle the case where tcindex_lookup() is being called in both cases.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: RCU cls_route
John Fastabend [Sat, 13 Sep 2014 03:07:50 +0000 (20:07 -0700)]
net: sched: RCU cls_route

RCUify the route classifier. For now however spinlock's are used to
protect fastmap cache.

The issue here is the fastmap may be read by one CPU while the
cache is being updated by another. An array of pointers could be
one possible solution.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: fw use RCU
John Fastabend [Sat, 13 Sep 2014 03:07:22 +0000 (20:07 -0700)]
net: sched: fw use RCU

RCU'ify fw classifier.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: cls_flow use RCU
John Fastabend [Sat, 13 Sep 2014 03:06:55 +0000 (20:06 -0700)]
net: sched: cls_flow use RCU

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: cls_cgroup use RCU
John Fastabend [Sat, 13 Sep 2014 03:06:26 +0000 (20:06 -0700)]
net: sched: cls_cgroup use RCU

Make cgroup classifier safe for RCU.

Also drops the calls in the classify routine that were doing a
rcu_read_lock()/rcu_read_unlock(). If the rcu_read_lock() isn't held
entering this routine we have issues with deleting the classifier
chain so remove the unnecessary rcu_read_lock()/rcu_read_unlock()
pair noting all paths AFAIK hold rcu_read_lock.

If there is a case where classify is called without the rcu read lock
then an rcu splat will occur and we can correct it.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: sched: cls_basic use RCU
John Fastabend [Sat, 13 Sep 2014 03:05:59 +0000 (20:05 -0700)]
net: sched: cls_basic use RCU

Enable basic classifier for RCU.

Dereferencing tp->root may look a bit strange here but it is needed
by my accounting because it is allocated at init time and needs to
be kfree'd at destroy time. However because it may be referenced in
the classify() path we must wait an RCU grace period before free'ing
it. We use kfree_rcu() and rcu_ APIs to enforce this. This pattern
is used in all the classifiers.

Also the hgenerator can be incremented without concern because it
is always incremented under RTNL.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: rcu-ify tcf_proto
John Fastabend [Sat, 13 Sep 2014 03:05:27 +0000 (20:05 -0700)]
net: rcu-ify tcf_proto

rcu'ify tcf_proto this allows calling tc_classify() without holding
any locks. Updaters are protected by RTNL.

This patch prepares the core net_sched infrastracture for running
the classifier/action chains without holding the qdisc lock however
it does nothing to ensure cls_xxx and act_xxx types also work without
locking. Additional patches are required to address the fall out.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: qdisc: use rcu prefix and silence sparse warnings
John Fastabend [Sat, 13 Sep 2014 03:04:52 +0000 (20:04 -0700)]
net: qdisc: use rcu prefix and silence sparse warnings

Add __rcu notation to qdisc handling by doing this we can make
smatch output more legible. And anyways some of the cases should
be using rcu_dereference() see qdisc_all_tx_empty(),
qdisc_tx_chainging(), and so on.

Also *wake_queue() API is commonly called from driver timer routines
without rcu lock or rtnl lock. So I added rcu_read_lock() blocks
around netif_wake_subqueue and netif_tx_wake_queue.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agosunvnet: Avoid sending superfluous LDC messages.
Sowmini Varadhan [Thu, 11 Sep 2014 13:57:22 +0000 (09:57 -0400)]
sunvnet: Avoid sending superfluous LDC messages.

When sending out a burst of packets across multiple descriptors,
it is sufficient to send one LDC "start" trigger for
the first descriptor, so do not send an LDC "start" for every
pass through vnet_start_xmit. Similarly, it is sufficient to send
one "DRING_STOPPED" trigger for the last dring (and if that
fails, hold off and send the trigger later).

Optimizations to the number of LDC messages helps avoid
filling up the LDC channel with superfluous LDC messages
that risk triggering flow-control on the channel,
and also boosts performance.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Raghuram Kothakota <raghuram.kothakota@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: axienet: remove unnecessary ether_setup after alloc_etherdev
Subbaraya Sundeep Bhatta [Thu, 11 Sep 2014 09:23:33 +0000 (14:53 +0530)]
net: axienet: remove unnecessary ether_setup after alloc_etherdev

calling ether_setup is redundant since alloc_etherdev calls
it.

Signed-off-by: Subbaraya Sundeep Bhatta <sbhatta@xilinx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoethernet: amd: use pr_info_once()
Varka Bhadram [Thu, 11 Sep 2014 07:20:50 +0000 (12:50 +0530)]
ethernet: amd: use pr_info_once()

It will use pr_info_one() to print the version info of the
driver in probe function only once. No need to use the static
variable here.

Signed-off-by: Varka Bhadram <varkab@cdac.in>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoudp: Fix inverted NAPI_GRO_CB(skb)->flush test
Scott Wood [Thu, 11 Sep 2014 02:23:18 +0000 (21:23 -0500)]
udp: Fix inverted NAPI_GRO_CB(skb)->flush test

Commit 2abb7cdc0d ("udp: Add support for doing checksum unnecessary
conversion") caused napi_gro_cb structs with the "flush" field zero to
take the "udp_gro_receive" path rather than the "set flush to 1" path
that they would previously take.  As a result I saw booting from an NFS
root hang shortly after starting userspace, with "server not
responding" messages.

This change to the handling of "flush == 0" packets appears to be
incidental to the goal of adding new code in the case where
skb_gro_checksum_validate_zero_check() returns zero.  Based on that and
the fact that it breaks things, I'm assuming that it is unintentional.

Fixes: 2abb7cdc0d ("udp: Add support for doing checksum unnecessary conversion")
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'sock_queue_err_skb'
David S. Miller [Fri, 12 Sep 2014 21:51:32 +0000 (17:51 -0400)]
Merge branch 'sock_queue_err_skb'

Alexander Duyck says:

====================
Address reference counting issues with sock_queue_err_skb

After looking over the code for skb_clone_sk after some comments made by
Eric Dumazet I have come to the conclusion that skb_clone_sk is taking the
correct approach in how to handle the sk_refcnt when creating a buffer that
is eventually meant to be returned to the socket via the sock_queue_err_skb
function.

However upon review of other callers I found what I believe to be a
possible reference count issue in the path for handling "wifi ack" packets.
To address this I have applied the same logic that is currently in place so
that the sk_refcnt will be forced to stay at least 1, or we will not
provide an skb to return in the sk_error_queue.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agomac80211: Resolve sk_refcnt/sk_wmem_alloc issue in wifi ack path
Alexander Duyck [Wed, 10 Sep 2014 22:05:42 +0000 (18:05 -0400)]
mac80211: Resolve sk_refcnt/sk_wmem_alloc issue in wifi ack path

There is a possible issue with the use, or lack thereof of sk_refcnt and
sk_wmem_alloc in the wifi ack status functionality.

Specifically if a socket were to request acknowledgements, and the socket
were to have sk_refcnt drop to 0 resulting in it waiting on sk_wmem_alloc
to reach 0 it would be possible to have sock_queue_err_skb orphan the last
buffer, resulting in __sk_free being called on the socket.  After this the
buffer is enqueued on sk_error_queue, however the queue has already been
flushed resulting in at least a memory leak, if not a data corruption.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoskb: Add documentation for skb_clone_sk
Alexander Duyck [Wed, 10 Sep 2014 22:05:26 +0000 (18:05 -0400)]
skb: Add documentation for skb_clone_sk

This change adds some documentation to the call skb_clone_sk.  This is
meant to help clarify the purpose of the function for other developers.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoRevert "ipv4: Clarify in docs that accept_local requires rp_filter."
Sébastien Barré [Wed, 10 Sep 2014 16:20:23 +0000 (18:20 +0200)]
Revert "ipv4: Clarify in docs that accept_local requires rp_filter."

This reverts commit c801e3cc1925 ("ipv4: Clarify in docs that accept_local requires rp_filter.").
It is not needed anymore since commit 1dced6a85482 ("ipv4: Restore accept_local behaviour in fib_validate_source()").

Suggested-by: Julian Anastasov <ja@ssi.bg>
Cc: Gregory Detal <gregory.detal@uclouvain.be>
Cc: Christoph Paasch <christoph.paasch@uclouvain.be>
Cc: Hannes Frederic Sowa <hannes@redhat.com>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Sébastien Barré <sebastien.barre@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoe1000: switch to napi_gro_frags api
Florian Westphal [Wed, 3 Sep 2014 13:34:42 +0000 (13:34 +0000)]
e1000: switch to napi_gro_frags api

napi_gro_frags allows skb re-use in case GRO can merge payload pages
into an skb on the GRO lists.

netperf TCP_STREAM, kvm-e1000 emulation, mtu 9k:
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec
old: 87380  16384  16384    30.00  8985.78
new: 87380  16384  16384    30.00  9907.05

Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
10 years agoe1000: convert to build_skb
Florian Westphal [Wed, 3 Sep 2014 13:34:36 +0000 (13:34 +0000)]
e1000: convert to build_skb

Instead of preallocating Rx skbs, allocate them right before sending
inbound packet up the stack.

e1000-kvm, mtu1500, netperf TCP_STREAM:
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec
old: 87380  16384  16384    60.00    4532.40
new: 87380  16384  16384    60.00    4599.05

Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
10 years agoe1000: rename struct e1000_buffer to e1000_tx_buffer
Florian Westphal [Wed, 3 Sep 2014 13:34:31 +0000 (13:34 +0000)]
e1000: rename struct e1000_buffer to e1000_tx_buffer

and remove *page, its only used for Rx.

Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
10 years agoe1000: add and use e1000_rx_buffer info for Rx
Florian Westphal [Wed, 3 Sep 2014 13:34:26 +0000 (13:34 +0000)]
e1000: add and use e1000_rx_buffer info for Rx

e1000 uses the same metadata struct for Rx and Tx.  But Tx and Rx have
different requirements.

For Rx, we only need to store a buffer and a DMA address.

Follow-up patch will remove skb for Rx, bringing rx_buffer_info down
to 16 bytes on x86_64.

[ buffer_info is 48 bytes ]

Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
10 years agoe1000: perform copybreak ahead of DMA unmap
Florian Westphal [Wed, 3 Sep 2014 13:34:21 +0000 (13:34 +0000)]
e1000: perform copybreak ahead of DMA unmap

Currently we unmap the DMA range, then copy to new skb.
Change this so we can keep the mapping in case the data is copied.

Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
10 years agoe1000: move tbi workaround code into helper function
Florian Westphal [Wed, 3 Sep 2014 13:34:15 +0000 (13:34 +0000)]
e1000: move tbi workaround code into helper function

Its the same in both handlers.

Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
10 years agoe1000: move e1000_tbi_adjust_stats to where its used
Florian Westphal [Wed, 3 Sep 2014 13:34:10 +0000 (13:34 +0000)]
e1000: move e1000_tbi_adjust_stats to where its used

... and make it static.

Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
10 years agoixgbe: Refactor busy poll socket code to address multiple issues
Alexander Duyck [Sat, 26 Jul 2014 02:42:44 +0000 (02:42 +0000)]
ixgbe: Refactor busy poll socket code to address multiple issues

This change addresses several issues in the current ixgbe implementation of
busy poll sockets.

First was the fact that it was possible for frames to be delivered out of
order if they were held in GRO.  This is addressed by flushing the GRO buffers
before releasing the q_vector back to the idle state.

The other issue was the fact that we were having to take a spinlock on
changing the state to and from idle.  To resolve this I have replaced the
state value with an atomic and use atomic_cmpxchg to change the value from
idle, and a simple atomic set to restore it back to idle after we have
acquired it.  This allows us to only use a locked operation on acquiring the
vector without a need for a locked operation to release it.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
10 years agoixgbe: Drop Rx alloc at end of Rx cleanup
Alexander Duyck [Sat, 26 Jul 2014 02:42:39 +0000 (02:42 +0000)]
ixgbe: Drop Rx alloc at end of Rx cleanup

This change removes the Rx buffer allocation at the end of ixgbe_clean_rx_irq.
The reason for removing this is to avoid the extra latency introduced by the
MMIO write.  This can amount to somewhere around an extra 100ns of latency and
one extra message worth of PCIe bus overhead.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
10 years agoixgbevf: Resolve missing-field-initializers warnings
Mark Rustad [Thu, 24 Jul 2014 06:19:29 +0000 (06:19 +0000)]
ixgbevf: Resolve missing-field-initializers warnings

Resolve missing-field-initializers warnings by using
designated initialization.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
10 years agoixgbe: Resolve warnings produced in W=2 builds
Mark Rustad [Thu, 24 Jul 2014 06:19:24 +0000 (06:19 +0000)]
ixgbe: Resolve warnings produced in W=2 builds

This patch resolves warnings produced by ixgbe in W=2 kernel
builds. There are missing-field-initializers warnings and shadow
warnings. None of these point to any deeper problem, so just
resolve them so any new warnings get analyzed.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
10 years agonet: bpf: only build bpf_jit_binary_{alloc, free}() when jit selected
Daniel Borkmann [Wed, 10 Sep 2014 13:01:02 +0000 (15:01 +0200)]
net: bpf: only build bpf_jit_binary_{alloc, free}() when jit selected

Since BPF JIT depends on the availability of module_alloc() and
module_free() helpers (HAVE_BPF_JIT and MODULES), we better build
that code only in case we have BPF_JIT in our config enabled, just
like with other JIT code. Fixes builds for arm/marzen_defconfig
and sh/rsk7269_defconfig.

====================
kernel/built-in.o: In function `bpf_jit_binary_alloc':
/home/cwang/linux/kernel/bpf/core.c:144: undefined reference to `module_alloc'
kernel/built-in.o: In function `bpf_jit_binary_free':
/home/cwang/linux/kernel/bpf/core.c:164: undefined reference to `module_free'
make: *** [vmlinux] Error 1
====================

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Fixes: 738cbe72adc5 ("net: bpf: consolidate JIT binary allocator")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'cxgb4-next'
David S. Miller [Wed, 10 Sep 2014 21:02:37 +0000 (14:02 -0700)]
Merge branch 'cxgb4-next'

Hariprasad Shenai says:

====================
cxgb4: Allow FW size upto 1MB, support for S25FL032P flash and misc. fixes

This patch series adds support to allow FW size upto 1MB, support for S25FL032P
flash. Fix t4_flash_erase_sectors to throw an error, when erase sector aren't in
the flash and also warning message when adapters have flashes less than 2Mb.
Adds device id of new adapter and removes device id of debug adapter.

The patches series is created against 'net-next' tree.
And includes patches on cxgb4 driver and cxgb4vf driver.

We have included all the maintainers of respective drivers. Kindly review the
change and let us know in case of any review comments.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agocxgb4/cxgb4vf: Add device ID for new adapter and remove for dbg adapter
Hariprasad Shenai [Wed, 10 Sep 2014 12:14:31 +0000 (17:44 +0530)]
cxgb4/cxgb4vf: Add device ID for new adapter and remove for dbg adapter

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agocxgb4: Add warning msg when attaching to adapters which have FLASHes smaller than 2Mb
Hariprasad Shenai [Wed, 10 Sep 2014 12:14:30 +0000 (17:44 +0530)]
cxgb4: Add warning msg when attaching to adapters which have FLASHes smaller than 2Mb

Based on original work by Casey Leedom <leedom@chelsio.com>

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agocxgb4: Fix t4_flash_erase_sectors() to throw an error when requested to erase sectors...
Hariprasad Shenai [Wed, 10 Sep 2014 12:14:29 +0000 (17:44 +0530)]
cxgb4: Fix t4_flash_erase_sectors() to throw an error when requested to erase sectors which aren't in the FLASH

Based on original work by Casey Leedom <leedom@chelsio.com>

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agocxgb4: Add support to S25FL032P flash
Hariprasad Shenai [Wed, 10 Sep 2014 12:14:28 +0000 (17:44 +0530)]
cxgb4: Add support to S25FL032P flash

Add support for Spansion S25FL032P flash
Based on original work by Dimitris Michailidis

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agocxgb4: Allow T4/T5 firmware sizes up to 1MB
Hariprasad Shenai [Wed, 10 Sep 2014 12:14:27 +0000 (17:44 +0530)]
cxgb4: Allow T4/T5 firmware sizes up to 1MB

Based on original work by Casey Leedom <leedom@chelsio.com>

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agotipc: fix sparse warnings
Erik Hugne [Wed, 10 Sep 2014 12:02:50 +0000 (14:02 +0200)]
tipc: fix sparse warnings

This fixes the following sparse warnings:
sparse: symbol 'tipc_update_nametbl' was not declared. Should it be static?
Also, the function is changed to return bool upon success, rather than a
potentially freed pointer.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonet: ethernet: arc: Don't free Rockchip resources before disconnect from phy
Romain Perier [Wed, 10 Sep 2014 07:51:13 +0000 (07:51 +0000)]
net: ethernet: arc: Don't free Rockchip resources before disconnect from phy

Free resources before being disconnected from phy and calling core driver is
wrong and should not happen. It avoids a delay of 4-5s caused by the timeout of
phy_disconnect().

Signed-off-by: Romain Perier <romain.perier@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
David S. Miller [Wed, 10 Sep 2014 19:46:32 +0000 (12:46 -0700)]
Merge git://git./linux/kernel/git/pablo/nf-next

Pablo Neira Ayuso says:

====================
nf-next pull request

The following patchset contains Netfilter/IPVS updates for your
net-next tree. Regarding nf_tables, most updates focus on consolidating
the NAT infrastructure and adding support for masquerading. More
specifically, they are:

1) use __u8 instead of u_int8_t in arptables header, from
   Mike Frysinger.

2) Add support to match by skb->pkttype to the meta expression, from
   Ana Rey.

3) Add support to match by cpu to the meta expression, also from
   Ana Rey.

4) A smatch warning about IPSET_ATTR_MARKMASK validation, patch from
   Vytas Dauksa.

5) Fix netnet and netportnet hash types the range support for IPv4,
   from Sergey Popovich.

6) Fix missing-field-initializer warnings resolved, from Mark Rustad.

7) Dan Carperter reported possible integer overflows in ipset, from
   Jozsef Kadlecsick.

8) Filter out accounting objects in nfacct by type, so you can
   selectively reset quotas, from Alexey Perevalov.

9) Move specific NAT IPv4 functions to the core so x_tables and
   nf_tables can share the same NAT IPv4 engine.

10) Use the new NAT IPv4 functions from nft_chain_nat_ipv4.

11) Move specific NAT IPv6 functions to the core so x_tables and
    nf_tables can share the same NAT IPv4 engine.

12) Use the new NAT IPv6 functions from nft_chain_nat_ipv6.

13) Refactor code to add nft_delrule(), which can be reused in the
    enhancement of the NFT_MSG_DELTABLE to remove a table and its
    content, from Arturo Borrero.

14) Add a helper function to unregister chain hooks, from
    Arturo Borrero.

15) A cleanup to rename to nft_delrule_by_chain for consistency with
    the new nft_*() functions, also from Arturo.

16) Add support to match devgroup to the meta expression, from Ana Rey.

17) Reduce stack usage for IPVS socket option, from Julian Anastasov.

18) Remove unnecessary textsearch state initialization in xt_string,
    from Bojan Prtvar.

19) Add several helper functions to nf_tables, more work to prepare
    the enhancement of NFT_MSG_DELTABLE, again from Arturo Borrero.

20) Enhance NFT_MSG_DELTABLE to delete a table and its content, from
    Arturo Borrero.

21) Support NAT flags in the nat expression to indicate the flavour,
    eg. random fully, from Arturo.

22) Add missing audit code to ebtables when replacing tables, from
    Nicolas Dichtel.

23) Generalize the IPv4 masquerading code to allow its re-use from
    nf_tables, from Arturo.

24) Generalize the IPv6 masquerading code, also from Arturo.

25) Add the new masq expression to support IPv4/IPv6 masquerading
    from nf_tables, also from Arturo.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agonetfilter: Convert pr_warning to pr_warn
Joe Perches [Wed, 10 Sep 2014 04:17:32 +0000 (21:17 -0700)]
netfilter: Convert pr_warning to pr_warn

Use the more common pr_warn.

Other miscellanea:

o Coalesce formats
o Realign arguments

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoiucv: Convert pr_warning to pr_warn
Joe Perches [Wed, 10 Sep 2014 04:17:31 +0000 (21:17 -0700)]
iucv: Convert pr_warning to pr_warn

Use the more common pr_warn.
Coalesce formats.
Realign arguments.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agopktgen: Convert pr_warning to pr_warn
Joe Perches [Wed, 10 Sep 2014 04:17:30 +0000 (21:17 -0700)]
pktgen: Convert pr_warning to pr_warn

Use the more common pr_warn.
Realign arguments.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoatm: Convert pr_warning to pr_warn
Joe Perches [Wed, 10 Sep 2014 04:17:28 +0000 (21:17 -0700)]
atm: Convert pr_warning to pr_warn

Use the more common pr_warn.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoMerge branch 'ipip_sit_gro'
David S. Miller [Wed, 10 Sep 2014 04:29:50 +0000 (21:29 -0700)]
Merge branch 'ipip_sit_gro'

Tom Herbert says:

====================
net: enable GRO for IPIP and SIT

This patch sets populates the IPIP and SIT offload structures with
gro_receive and gro_complete functions. This enables use of GRO
for these. Also, fixed a problem in IPv6 where we were not properly
initializing flush_id.

Peformance results are below. Note that these tests were done on bnx2x
which doesn't provide RX checksum offload of IPIP or SIT (i.e. does
not give CHEKCSUM_COMPLETE). Also, we don't get 4-tuple hash for RSS
only 2-tuple in this case so all the packets between two hosts are
winding up on the same queue. Net result is the interrupting CPU is
the bottleneck in GRO (checksumming every packet there).

Testing:

netperf TCP_STREAM between two hosts using bnx2x.

* Before fix

IPIP
  1 connection
    6.53% CPU utilization
    6544.71 Mbps
  20 connections
    13.79% CPU utilization
    9284.54 Mbps

SIT
  1 connection
    6.68% CPU utilization
    5653.36 Mbps
  20 connections
    18.88% CPU utilization
    9154.61 Mbps

* After fix

IPIP
  1 connection
    5.73% CPU utilization
    9279.53 Mbps
  20 connections
    7.14% CPU utilization
    7279.35 Mbps

SIT
  1 connection
    2.95% CPU utilization
    9143.36 Mbps
  20 connections
    7.09% CPU utilization
    6255.3 Mbps
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agosit: Add gro callbacks to sit_offload
Tom Herbert [Tue, 9 Sep 2014 18:23:16 +0000 (11:23 -0700)]
sit: Add gro callbacks to sit_offload

Add ipv6_gro_receive and ipv6_gro_complete to sit_offload to
support GRO.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10 years agoipip: Add gro callbacks to ipip offload
Tom Herbert [Tue, 9 Sep 2014 18:23:15 +0000 (11:23 -0700)]
ipip: Add gro callbacks to ipip offload

Add inet_gro_receive and inet_gro_complete to ipip_offload to
support GRO.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>