summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2016-11-17Merge branch 'rds-ha-failover-fixes'David S. Miller9-21/+86
Sowmini Varadhan says: ==================== RDS: TCP: HA/Failover fixes This series contains a set of fixes for bugs exposed when we ran the following in a loop between a test machine pair: while (1); do # modprobe rds-tcp on test nodes # run rds-stress in bi-dir mode between test machine pair # modprobe -r rds-tcp on test nodes done rds-stress in bi-dir mode will cause both nodes to initiate RDS-TCP connections at almost the same instant, exposing the bugs fixed in this series. Without the fixes, rds-stress reports sporadic packet drops, and packets arriving out of sequence. After the fixes,we have been able to run the test overnight, without any issues. Each patch has a detailed description of the root-cause fixed by the patch. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17RDS: TCP: Force every connection to be initiated by numerically smaller IP ↵Sowmini Varadhan3-18/+26
address When 2 RDS peers initiate an RDS-TCP connection simultaneously, there is a potential for "duelling syns" on either/both sides. See commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an outgoing socket in rds_tcp_accept_one()") for a description of this condition, and the arbitration logic which ensures that the numerically large IP address in the TCP connection is bound to the RDS_TCP_PORT ("canonical ordering"). The rds_connection should not be marked as RDS_CONN_UP until the arbitration logic has converged for the following reason. The sender may start transmitting RDS datagrams as soon as RDS_CONN_UP is set, and since the sender removes all datagrams from the rds_connection's cp_retrans queue based on TCP acks. If the TCP ack was sent from a tcp socket that got reset as part of duel aribitration (but before data was delivered to the receivers RDS socket layer), the sender may end up prematurely freeing the datagram, and the datagram is no longer reliably deliverable. This patch remedies that condition by making sure that, upon receipt of 3WH completion state change notification of TCP_ESTABLISHED in rds_tcp_state_change, we mark the rds_connection as RDS_CONN_UP if, and only if, the IP addresses and ports for the connection are canonically ordered. In all other cases, rds_tcp_state_change will force an rds_conn_path_drop(), and rds_queue_reconnect() on both peers will restart the connection to ensure canonical ordering. A side-effect of enforcing this condition in rds_tcp_state_change() is that rds_tcp_accept_one_path() can now be refactored for simplicity. It is also no longer possible to encounter an RDS_CONN_UP connection in the arbitration logic in rds_tcp_accept_one(). Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17RDS: TCP: Track peer's connection generation numberSowmini Varadhan6-3/+57
The RDS transport has to be able to distinguish between two types of failure events: (a) when the transport fails (e.g., TCP connection reset) but the RDS socket/connection layer on both sides stays the same (b) when the peer's RDS layer itself resets (e.g., due to module reload or machine reboot at the peer) In case (a) both sides must reconnect and continue the RDS messaging without any message loss or disruption to the message sequence numbers, and this is achieved by rds_send_path_reset(). In case (b) we should reset all rds_connection state to the new incarnation of the peer. Examples of state that needs to be reset are next expected rx sequence number from, or messages to be retransmitted to, the new incarnation of the peer. To achieve this, the RDS handshake probe added as part of commit 5916e2c1554f ("RDS: TCP: Enable multipath RDS for TCP") is enhanced so that sender and receiver of the RDS ping-probe will add a generation number as part of the RDS_EXTHDR_GEN_NUM extension header. Each peer stores local and remote generation numbers as part of each rds_connection. Changes in generation number will be detected via incoming handshake probe ping request or response and will allow the receiver to reset rds_connection state. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17RDS: TCP: set RDS_FLAG_RETRANSMITTED in cp_retrans listSowmini Varadhan1-0/+3
As noted in rds_recv_incoming() sequence numbers on data packets can decreas for the failover case, and the Rx path is equipped to recover from this, if the RDS_FLAG_RETRANSMITTED is set on the rds header of an incoming message with a suspect sequence number. The RDS_FLAG_RETRANSMITTED is predicated on the RDS_FLAG_RETRANSMITTED flag in the rds_message, so make sure the flag is set on messages queued for retransmission. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17net: stmmac: replace if (netif_msg_type) by their netif_xxx counterpartLABBE Corentin1-28/+21
As sugested by Joe Perches, we could replace all if (netif_msg_type(priv)) dev_xxx(priv->devices, ...) by the simpler macro netif_xxx(priv, hw, priv->dev, ...) Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17net: stmmac: replace hardcoded function name by __func__LABBE Corentin1-4/+3
Some printing have the function name hardcoded. It is better to use __func__ instead. Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17net: stmmac: replace all pr_xxx by their netdev_xxx counterpartLABBE Corentin2-95/+123
The stmmac driver use lots of pr_xxx functions to print information. This is bad since we cannot know which device logs the information. (moreover if two stmmac device are present) Furthermore, it seems that it assumes wrongly that all logs will always be subsequent by using a dev_xxx then some indented pr_xxx like this: kernel: sun7i-dwmac 1c50000.ethernet: no reset control found kernel: Ring mode enabled kernel: No HW DMA feature register supported kernel: Normal descriptors kernel: TX Checksum insertion supported So this patch replace all pr_xxx by their netdev_xxx counterpart. Excepts for some printing where netdev "cause" unpretty output like: sun7i-dwmac 1c50000.ethernet (unnamed net_device) (uninitialized): no reset control found In those case, I keep dev_xxx. In the same time I remove some "stmmac:" print since this will be a duplicate with that dev_xxx displays. Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com> Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17net_sched: sch_fq: use hash_ptr()Eric Dumazet1-2/+2
When I wrote sch_fq.c, hash_ptr() on 64bit arches was awful, and I chose hash_32(). Linus Torvalds and George Spelvin fixed this issue, so we can use hash_ptr() to get more entropy on 64bit arches with Terabytes of memory, and avoid the cast games. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17net/mlx5e: remove napi_hash_del() callsEric Dumazet1-4/+0
Calling napi_hash_del() after netif_napi_del() is pointless. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17net/mlx4_en: remove napi_hash_del() callEric Dumazet1-4/+0
There is no need calling napi_hash_del()+synchronize_rcu() before calling netif_napi_del() netif_napi_del() does this already. Using napi_hash_del() in a driver is useful only when dealing with a batch of NAPI structures, so that a single synchronize_rcu() can be used. mlx4_en_deactivate_cq() is deactivating a single NAPI. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16Merge branch 'mlxsw-i2c'David S. Miller8-5/+783
Jiri Pirko says: ==================== mlxsw: Introduce support for I2C bus Vadim says: This patchset adds I2C access support for SwitchX, SwitchX2, SwitchIB, SwitchIB2 and Spectrum silicones. It contains: - Small changes in mlxsw core code, needed for I2C bus support; - I2C driver, which obtains I2C input/output mailboxes setting and provides command interface implementation. - Minimal driver, which works on top of I2C driver and allows running of mlxsw command interface over I2C bus; Use case: On system, which does not have PCI to ASIC (BMC), hwmon functionality (sensors, pwm, tacho) will be available through I2C. Usage (manual probing): echo mlxsw_minimal 0x48 > /sys/bus/i2c/devices/i2c-2/new_device Sysfs interface: /sys/bus/i2c/devices/2-0048/hwmon/hwmon5/pwm1 /sys/bus/i2c/devices/2-0048/hwmon/hwmon5/temp1_input ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16mlxsw: minimal: Add I2C support for Mellanox ASICsVadim Pasternak3-0/+110
Add I2C access support for Mellanox ASICs: - Virtual Protocol Interconnect switches SwitchX, SwitchX2, providing InfiniBand, Ethernet and Fibre Channel connectivity; - Infiniband switches SwitchIB, SwitchIB2: - Ethernet switch Spectrum. Example of probing activation: echo mlxsw_minimal 0x48 > /sys/bus/i2c/devices/i2c-2/new_device Signed-off-by: Vadim Pasternak <vadimp@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16mlxsw: Invoke driver's init/fini methods only if definedVadim Pasternak1-5/+9
We are going to add a minimal driver on top of the mlxsw core infrastructure, which will be mainly used for hardware monitoring in Baseboard management controller (BMC) installations. Unlike the switch drivers (e.g., spectrum, switchx2), this driver does not initialize the ASIC and therefore doesn't need to implement the init() and fini() methods in its 'mlxsw_driver' struct. Signed-off-by: Vadim Pasternak <vadimp@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16mlxsw: Introduce support for I2C busVadim Pasternak4-0/+654
Add I2C bus implementation for Mellanox Technologies Switch ASICs. This includes command interface implementation using input / out mailboxes, whose location is retrieved from the firmware during probe time. Signed-off-by: Vadim Pasternak <vadimp@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16mlxsw: Add bus capability flagVadim Pasternak3-0/+10
The mlxsw core infrastructure currently assumes that communication with the ASIC is always possible using Ethernet management datagrams (EMADs), but this is only possible when the PCI bus is used. The bus capability flag is added to indicate EMAD support and make core initialize EMAD communication only when it's set. Otherwise, register access is done using command interface. Signed-off-by: Vadim Pasternak <vadimp@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16net: netcp: replace IS_ERR_OR_NULL by IS_ERRJulia Lawall1-3/+3
knav_queue_open always returns an ERR_PTR value, never NULL. This can be confirmed by unfolding the function calls and conforms to the function's documentation. Thus, replace IS_ERR_OR_NULL by IS_ERR in error checks. The change is made using the following semantic patch: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; statement S; @@ x = knav_queue_open(...); if ( - IS_ERR_OR_NULL + IS_ERR (x)) S // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16sctp: use new rhlist interface on sctp transport rhashtableXin Long5-50/+64
Now sctp transport rhashtable uses hash(lport, dport, daddr) as the key to hash a node to one chain. If in one host thousands of assocs connect to one server with the same lport and different laddrs (although it's not a normal case), all the transports would be hashed into the same chain. It may cause to keep returning -EBUSY when inserting a new node, as the chain is too long and sctp inserts a transport node in a loop, which could even lead to system hangs there. The new rhlist interface works for this case that there are many nodes with the same key in one chain. It puts them into a list then makes this list be as a node of the chain. This patch is to replace rhashtable_ interface with rhltable_ interface. Since a chain would not be too long and it would not return -EBUSY with this fix when inserting a node, the reinsert loop is also removed here. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16Merge branch 'bnxt_en-next'David S. Miller4-16/+298
Michael Chan says: ==================== bnxt_en: Updates. New firmware spec. update, autoneg update, and UDP RSS support. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16bnxt_en: Add ethtool -n|-N rx-flow-hash support.Michael Chan1-3/+164
To display and modify the RSS hash. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16bnxt_en: Add UDP RSS support for 57X1X chips.Michael Chan2-8/+16
The newer chips have proper support for 4-tuple UDP RSS. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16bnxt_en: Enhance autoneg support.Michael Chan2-0/+24
On some dual port NICs, the speed setting on one port can affect the available speed on the other port. Add logic to detect these changes and adjust the advertised speed settings when necessary. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16bnxt_en: Update firmware interface spec to 1.5.4.Michael Chan2-5/+94
Use the new FORCE_LINK_DWN bit to shutdown link during close. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16netpoll: more efficient lockingEric Dumazet4-11/+10
Callers of netpoll_poll_lock() own NAPI_STATE_SCHED Callers of netpoll_poll_unlock() have BH blocked between the NAPI_STATE_SCHED being cleared and poll_lock is released. We can avoid the spinlock which has no contention, and use cmpxchg() on poll_owner which we need to set anyway. This removes a possible lockdep violation after the cited commit, since sk_busy_loop() re-enables BH before calling busy_poll_stop() Fixes: 217f69743681 ("net: busy-poll: allow preemption in sk_busy_loop()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16cadence: Add LSO support.Rafal Ozieblo2-12/+140
New Cadence GEM hardware support Large Segment Offload (LSO): TCP segmentation offload (TSO) as well as UDP fragmentation offload (UFO). Support for those features was added to the driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16netronome: don't access real_num_rx_queues directlyArnd Bergmann1-8/+6
The netdev->real_num_rx_queues setting is only available if CONFIG_SYSFS is enabled, so we now get a build failure when that is turned off: netronome/nfp/nfp_net_common.c: In function 'nfp_net_ring_swap_enable': netronome/nfp/nfp_net_common.c:2489:18: error: 'struct net_device' has no member named 'real_num_rx_queues'; did you mean 'real_num_tx_queues'? As far as I can tell, the check here is only used as an optimization that we can skip in order to fix the compilation. If sysfs is disabled, the following netif_set_real_num_rx_queues() has no effect. Fixes: 164d1e9e5d52 ("nfp: add support for ethtool .set_channels") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16sfc: remove napi_hash_del() callEric Dumazet1-3/+2
Calling napi_hash_del() after netif_napi_del() is pointless. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Edward Cree <ecree@solarflare.com> Cc: Bert Kenward <bkenward@solarflare.com> Acked-by: Bert Kenward <bkenward@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16lwtunnel: subtract tunnel headroom from mtu on output redirectDavid Lebrun1-1/+2
This patch changes the lwtunnel_headroom() function which is called in ipv4_mtu() and ip6_mtu(), to also return the correct headroom value when the lwtunnel state is OUTPUT_REDIRECT. This patch enables e.g. SR-IPv6 encapsulations to work without manually setting the route mtu. Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16mlxsw: spectrum_router: Adjust placement of FIB abort warningIdo Schimmel1-3/+3
The recent merge commit bb598c1b8c9b ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net") would cause the FIB abort warning to fire whenever we flush the FIB tables - either during module removal or actual abort. Move it back to its rightful location in the FIB abort function. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16net: dsa: mv88e6xxx: Respect SPEED_UNFORCED, don't set force bitAndrew Lunn1-1/+1
The SPEED_UNFORCED indicates the MAC & PHY should perform auto-negotiation to determine a speed which works. If this is called for, don't set the force bit. If it is set, the MAC actually does 10Gbps, why the internal PHYs don't support. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16Merge branch 'amd-xgbe-next'David S. Miller2-3/+3
Tom Lendacky says: ==================== amd-xgbe: AMD XGBE driver updates 2016-11-15 This patch series addresses some minor issues found in the recently accepted patch series for the AMD XGBE driver. The following fixes are included in this driver update series: - Fix a possibly uninitialized variable in the debugfs support - Fix the GPIO pin number constraint check This patch series is based on net-next. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16amd-xgbe: Fix maximum GPIO value checkLendacky, Thomas1-2/+2
The GPIO support in the hardware allows for up to 16 GPIO pins, enumerated from 0 to 15. The driver uses the wrong value (16) to validate the GPIO pin range in the routines to set and clear the GPIO output pins. Update the code to use the correct value (15). Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16amd-xgbe: Fix possible uninitialized variableLendacky, Thomas1-1/+1
The debugfs support in the driver uses a common routine to write the debugfs values. In this routine, if the input file position is non-zero then the write routine will not return an error and an output parameter will not have been set. Because an error isn't returned an uninitialized value will be written into a register. Fix the common write routine to return an error if the input file position is non-zero, which will propagate back to the caller. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16Merge branch 'nway-reset'David S. Miller5-0/+5
Florian Fainelli says: ==================== net: Implenent ethtool::nway_reset for a few drivers This patch series depends on "net: phy: Centralize auto-negotation restart" since it provides phy_ethtool_nway_reset as a helper function. The drivers here already support PHYLIB, so there really is no reason why restarting auto-negotiation would not be possible with these. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16net: ethernet: marvell: pxa168_eth: Implement ethtool::nway_resetFlorian Fainelli1-0/+1
Implement ethtool::nway_reset using phy_ethtool_nway_reset. We are already using dev->phydev all over the place so this comes for free. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16net: ethernet: mvpp2: Implement ethtool::nway_resetFlorian Fainelli1-0/+1
Implement ethtool::nway_reset using phy_ethtool_nway_reset. We are already using dev->phydev all over the place so this comes for free. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16net: ethernet: mvneta: Implement ethtool::nway_resetFlorian Fainelli1-0/+1
Implement ethtool::nway_reset using phy_ethtool_nway_reset. We are already using dev->phydev all over the place so this comes for free. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16net: ethoc: Implement ethtool::nway_resetFlorian Fainelli1-0/+1
Implement ethtool::nway_reset using phy_ethtool_nway_reset. We are already using dev->phydev all over the place so this comes for free. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16net: stmmac: Implement ethtool::nway_resetFlorian Fainelli1-0/+1
Utilize the generic phy_ethtool_nway_reset() helper function to implement an autonegotiation restart. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16Merge branch 'busypoll-preemption-and-other-optimizations'David S. Miller5-38/+113
Eric Dumazet says: ==================== net: busy-poll: allow preemption and other optimizations It is time to have preemption points in sk_busy_loop() and improve its scalability. Also napi_complete() and friends can tell drivers when it is safe to not re-enable device interrupts, saving some overhead under high busy polling. mlx4 and bnx2x are changed accordingly, to show how this busy polling status can be exploited by drivers. Next steps will implement Zach Brown suggestion, where NAPI polling would be enabled all the time for some chosen queues. This is needed for efficient epoll() support anyway. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16bnx2x: switch to napi_complete_done()Eric Dumazet1-7/+8
Switch from napi_complete() to napi_complete_done() for better GRO support (gro_flush_timeout) and core NAPI features. Do not rearm interrupts if we are busy polling, to reduce bus and interrupts overhead. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Adam Belay <abelay@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Yuval Mintz <Yuval.Mintz@cavium.com> Cc: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16net/mlx4_en: use napi_complete_done() return valueEric Dumazet1-2/+2
Do not rearm interrupts if we are busy polling. mlx4 uses separate CQ for TX and RX, so number of TX interrupts does not change, unfortunately. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Adam Belay <abelay@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Yuval Mintz <Yuval.Mintz@cavium.com> Cc: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16net: busy-poll: return busypolling status to driversEric Dumazet2-7/+10
NAPI drivers use napi_complete_done() or napi_complete() when they drained RX ring and right before re-enabling device interrupts. In busy polling, we can avoid interrupts being delivered since we are polling RX ring in a controlled loop. Drivers can chose to use napi_complete_done() return value to reduce interrupts overhead while busy polling is active. This is optional, legacy drivers should work fine even if not updated. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Adam Belay <abelay@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Yuval Mintz <Yuval.Mintz@cavium.com> Cc: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16net: busy-poll: remove need_resched() from sk_can_busy_loop()Eric Dumazet1-3/+2
Now sk_busy_loop() can schedule by itself, we can remove need_resched() check from sk_can_busy_loop() Also add a const to its struct sock parameter. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Adam Belay <abelay@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Yuval Mintz <Yuval.Mintz@cavium.com> Cc: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16net: busy-poll: allow preemption in sk_busy_loop()Eric Dumazet2-20/+92
After commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job"), sk_busy_loop() needs a bit of care : softirqs might be delayed since we do not allow preemption yet. This patch adds preemptiom points in sk_busy_loop(), and makes sure no unnecessary cache line dirtying or atomic operations are done while looping. A new flag is added into napi->state : NAPI_STATE_IN_BUSY_POLL This prevents napi_complete_done() from clearing NAPIF_STATE_SCHED, so that sk_busy_loop() does not have to grab it again. Similarly, netpoll_poll_lock() is done one time. This gives about 10 to 20 % improvement in various busy polling tests, especially when many threads are busy polling in configurations with large number of NIC queues. This should allow experimenting with bigger delays without hurting overall latencies. Tested: On a 40Gb mlx4 NIC, 32 RX/TX queues. echo 70 >/proc/sys/net/core/busy_read for i in `seq 1 40`; do echo -n $i: ; ./super_netperf $i -H lpaa24 -t UDP_RR -- -N -n; done Before: After: 1: 90072 92819 2: 157289 184007 3: 235772 213504 4: 344074 357513 5: 394755 458267 6: 461151 487819 7: 549116 625963 8: 544423 716219 9: 720460 738446 10: 794686 837612 11: 915998 923960 12: 937507 925107 13: 1019677 971506 14: 1046831 1113650 15: 1114154 1148902 16: 1105221 1179263 17: 1266552 1299585 18: 1258454 1383817 19: 1341453 1312194 20: 1363557 1488487 21: 1387979 1501004 22: 1417552 1601683 23: 1550049 1642002 24: 1568876 1601915 25: 1560239 1683607 26: 1640207 1745211 27: 1706540 1723574 28: 1638518 1722036 29: 1734309 1757447 30: 1782007 1855436 31: 1724806 1888539 32: 1717716 1944297 33: 1778716 1869118 34: 1805738 1983466 35: 1815694 2020758 36: 1893059 2035632 37: 1843406 2034653 38: 1888830 2086580 39: 1972827 2143567 40: 1877729 2181851 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Adam Belay <abelay@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Yuval Mintz <Yuval.Mintz@cavium.com> Cc: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16bpf: Fix compilation warning in __bpf_lru_list_rotate_inactiveMartin KaFai Lau1-1/+1
gcc-6.2.1 gives the following warning: kernel/bpf/bpf_lru_list.c: In function ‘__bpf_lru_list_rotate_inactive.isra.3’: kernel/bpf/bpf_lru_list.c:201:28: warning: ‘next’ may be used uninitialized in this function [-Wmaybe-uninitialized] The "next" is currently initialized in the while() loop which must have >=1 iterations. This patch initializes next to get rid of the compiler warning. Fixes: 3a08c2fd7634 ("bpf: LRU List") Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16ipv6: sr: add option to control lwtunnel supportDavid Lebrun3-3/+23
This patch adds a new option CONFIG_IPV6_SEG6_LWTUNNEL to enable/disable support of encapsulation with the lightweight tunnels. When this option is enabled, CONFIG_LWTUNNEL is automatically selected. Fix commit 6c8702c60b88 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels") Without a proper option to control lwtunnel support for SR-IPv6, if CONFIG_LWTUNNEL=n then the IPv6 initialization fails as a consequence of seg6_iptunnel_init() failure with EOPNOTSUPP: NET: Registered protocol family 10 IPv6: Attempt to unregister permanent protocol 6 IPv6: Attempt to unregister permanent protocol 136 IPv6: Attempt to unregister permanent protocol 17 NET: Unregistered protocol family 10 Tested (compiling, booting, and loading ipv6 module when relevant) with possible combinations of CONFIG_IPV6={y,m,n}, CONFIG_IPV6_SEG6_LWTUNNEL={y,n} and CONFIG_LWTUNNEL={y,n}. Reported-by: Lorenzo Colitti <lorenzo@google.com> Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15Merge branch 'alx-multiqueue-support'David S. Miller2-170/+420
Tobias Regnery says: ==================== alx: add multi queue support This patchset lays the groundwork for multi queue support in the alx driver and enables multi queue support for the tx path by default. The hardware supports up to 4 tx queues. Benefits are better utilization of multi core cpus and the usage of the msi-x support by default which splits the handling of rx / tx and misc other interrupts. The rx path is a little bit harder because apparently (based on the limited information from the downstream driver) the hardware supports up to 8 rss queues but only has one hardware descriptor ring on the rx side. So the rx path will be part of another patchset. Tested on my AR8161 ethernet adapter with different tests: - there are no regressions observed during my daily usage - iperf tcp and udp tests shows no performance regressions - netperf TCP_RR and UDP_RR shows a slight performance increase of about 1-2% with this patchset applied This work is based on the downstream driver at github.com/qca/alx Changes in V2: - drop unneeded casts in alx_alloc_rx_ring (Patch 1) - add additional information about testing and benefit to the changelog ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15alx: enable multiple tx queuesTobias Regnery1-2/+6
Enable multiple tx queues by default based on the number of online cpus. The hardware supports up to four tx queues. Based on the downstream driver at github.com/qca/alx Signed-off-by: Tobias Regnery <tobias.regnery@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15alx: enable msi-x interrupts by defaultTobias Regnery1-5/+1
Remove the module parameter to enable msi-x support and enable msi-x interrupts unconditionally by default. This is a preparatory step to enable multi queue support by default, because this is only working with msi-x interrupts. Signed-off-by: Tobias Regnery <tobias.regnery@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15alx: prepare tx path for multi queue supportTobias Regnery1-13/+45
This patch prepares the tx path to send data on multiple tx queues. It introduces per queue register adresses and uses them in the alx_tx_queue structs. There are new helper functions for the queue mapping in the tx path. Based on the downstream driver at github.com/qca/alx Signed-off-by: Tobias Regnery <tobias.regnery@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>