linux - Linux Kernel (branches are rebased on master from time to time)

Age	Commit message (Collapse)	Author	Files	Lines
2016-06-09	net: add netdev_lockdep_set_classes() helper	Eric Dumazet	7	-82/+23
	It is time to add netdev_lockdep_set_classes() helper so that lockdep annotations per device type are easier to manage. This removes a lot of copies and missing annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09	net: sched: fix qdisc->running lockdep annotations	Eric Dumazet	2	-3/+7
	1) qdisc_run_begin() is really using the equivalent of a trylock. Instead of using write_seqcount_begin(), use a combination of raw_write_seqcount_begin() and correct lockdep annotation. 2) sch_direct_xmit() should use regular spin_lock(root_lock) Fixes: f9eb8aea2a1e ("net_sched: transform qdisc running bit into a seqcount") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09	netvsc: get rid of completion timeouts	Vitaly Kuznetsov	2	-100/+31
	I'm hitting 5 second timeout in rndis_filter_set_rss_param() while setting RSS parameters for the device. When this happens we end up returning -ETIMEDOUT from the function and rndis_filter_device_add() falls back to setting net_device->max_chn = 1; net_device->num_chn = 1; net_device->num_sc_offered = 0; but after a moment the rndis request succeeds and subchannels start to appear. netvsc_sc_open() does unconditional nvscdev->num_sc_offered-- and it becomes U32_MAX-1. Consequent rndis_filter_device_remove() will hang while waiting for all U32_MAX-1 subchannels to appear and this is not going to happen. The immediate issue could be solved by adding num_sc_offered > 0 check to netvsc_sc_open() but we're getting out of sync with the host and it's not easy to adjust things later, e.g. in this particular case we'll be creating queues without a user request for it and races are expected. Same applies to other parts of the driver which have the same completion timeout. Following the trend in drivers/hv/* code I suggest we remove all these timeouts completely. As a guest we can always trust the host we're running on and if the host screws things up there is no easy way to recover anyway. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Acked-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09	sit: remove unnecessary protocol check in ipip6_tunnel_xmit()	Simon Horman	1	-3/+0
	ipip6_tunnel_xmit() is called immediately after checking that skb->protocol is htons(ETH_P_IPV6) so there is no need to check it a second time. Found by inspection. Signed-off-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	Merge branch 'cbq-kill-drop'	David S. Miller	20	-680/+15
	Florian Westphal says: ==================== sched, cbq: remove OVL_STRATEGY/POLICE support iproute2 does not implement any options that result in the TCA_CBQ_OVL_STRATEGY/TCA_CBQ_POLICE attributes being set/used. This series removes these two attributes from cbq and makes kernel reject them via EOPNOTSUPP in case they are present. The two followup changes then remove several features from qdisc infrastructure that are then no longer used/needed. These are: - The 'drop' method provided by most qdiscs - the 'reshape_fail' function used by some qdiscs - the __parent member in struct Qdisc I tested this with allmod and allyesconfig builds and also with a brief cbq script: tc qdisc add dev eth0 root handle 1:0 cbq bandwidth 10Mbit avpkt 1000 cell 8 tc class add dev eth0 parent 1:0 classid 1:1 est 1sec 8sec cbq bandwidth 10Mbit rate 5Mbit prio 1 allot 1514 maxburst 20 cell 8 avpkt 1000 bounded split 1:0 defmap 3f tc class add dev eth0 parent 1:0 classid 1:2 est 1sec 8sec cbq bandwidth 10Mbit rate 5Mbit prio 1 allot 1514 maxburst 20 cell 8 avpkt 1000 bounded split 1:0 defmap 3f tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip tos 0x10 0xff classid 1:1 police rate 2Mbit burst 10K reclassify tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip tos 0x0c 0xff classid 1:2 tc filter add dev eth0 parent 1:0 protocol ip prio 2 u32 match ip tos 0x10 0xff classid 1:2 tc filter add dev eth0 parent 1:0 protocol ip prio 3 u32 match ip tos 0x0 0x0 classid 1:2 No changes since v1 except patch #5 to fix up struct Qdisc layout. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	sched: place state, next_sched and gso_skb in same cacheline again	Florian Westphal	1	-2/+2
	Earlier commits removed two members from struct Qdisc which places next_sched/gso_skb into a different cacheline than ->state. This restores the struct layout to what it was before the removal. Move the two members, then add an annotation so they all reside in the same cacheline. This adds a 16 byte hole after cpu_qstats. The hole could be closed but as it doesn't decrease total struct size just do it this way. Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	sched: remove qdisc->drop	Florian Westphal	19	-395/+0
	after removal of TCA_CBQ_OVL_STRATEGY from cbq scheduler, there are no more callers of ->drop() outside of other ->drop functions, i.e. nothing calls them. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	sched: remove qdisc_rehape_fail	Florian Westphal	5	-26/+7
	After the removal of TCA_CBQ_POLICE in cbq scheduler qdisc->reshape_fail is always NULL, i.e. qdisc_rehape_fail is now the same as qdisc_drop. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	cbq: remove TCA_CBQ_POLICE support	Florian Westphal	2	-98/+1
	iproute2 doesn't implement any cbq option that results in this attribute being sent to kernel. To make use of it, user would have to - patch iproute2 - add a class - attach a qdisc to the class (default pfifo doesn't work as q->handle is 0 and cbq_set_police() is a no-op in this case) - re-'add' the same class (tc class change ...) again - user must also specifiy a defmap (e.g. 'split 1:0 defmap 3f'), since this 'police' feature relies on its presence - the added qdisc must be one of bfifo, pfifo or netem If all of these conditions are met and _some_ leaf qdiscs, namely p/bfifo, netem, plug or tbf would drop a packet, kernel calls back into cbq, which will attempt to re-queue the skb into a different class as indicated by the parents' defmap entry for TC_PRIO_BESTEFFORT. [ i.e. we behave as if tc_classify returned TC_ACT_RECLASSIFY ]. This feature, which isn't documented or implemented in iproute2, and isn't implemented consistently (most qdiscs like sfq, codel, etc drop right away instead of attempting this reclassification) is the sole reason for the reshape_fail and __parent member in Qdisc struct. So remove TCA_CBQ_POLICE support from the kernel, reject it via EOPNOTSUPP so userspace knows we don't support it, and then remove no-longer needed infrastructure in followup commit. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	cbq: remove TCA_CBQ_OVL_STRATEGY support	Florian Westphal	1	-160/+6
	since initial revision of cbq in 2004 iproute 2 has never implemented support for TCA_CBQ_OVL_STRATEGY, which is what needs to be set to activate the class->drop() call (TC_CBQ_OVL_DROP strategy must be set by userspace value must be set by userspace). David Miller says: It seems really safe to kill this thing off, flag an error if someone tries to set the attribute, and therefore kill off all of the non-default cbq_ovl_*() functions. A followup commit can then remove all .drop qdisc methods since this removed the only caller. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	ip6gre: Allow live link address change	Shweta Choudaha	1	-0/+3
	The ip6 GRE tap device should not be forced to down state to change the mac address and should allow live address change for tap device similar to ipv4 gre. Signed-off-by: Shweta Choudaha <schoudah@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	Merge branch 'vrf-fib-rule-improve'	David S. Miller	8	-12/+214
	David Ahern says: ==================== net: vrf: Improve use of FIB rules Currently, VRFs require 1 oif and 1 iif rule per address family per VRF. As the number of VRF devices increases it brings scalability issues with the increasing rule list. All of the VRF rules have the same format with the exception of the specific table id to direct the lookup. Since the table id is available from the oif or iif in the loopup, the VRF rules can be consolidated to a single rule that pulls the table from the VRF device. This solution still allows a user to insert their own rules for VRFs, including rules with additional attributes. Accordingly, it is backwards compatible with existing setups and allows other policy routing as desired. Hopefully v5 is the charm; my e-waste can is getting full. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	net: vrf: Add l3mdev rules on first device create	David Ahern	1	-1/+105
	Add l3mdev rule per address family when the first VRF device is created. The rules are installed with a default preference of 1000. Users can replace the default rule as desired. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	net: Add l3mdev rule	David Ahern	7	-11/+109
	Currently, VRFs require 1 oif and 1 iif rule per address family per VRF. As the number of VRF devices increases it brings scalability issues with the increasing rule list. All of the VRF rules have the same format with the exception of the specific table id to direct the lookup. Since the table id is available from the oif or iif in the loopup, the VRF rules can be consolidated to a single rule that pulls the table from the VRF device. This patch introduces a new rule attribute l3mdev. The l3mdev rule means the table id used for the lookup is pulled from the L3 master device (e.g., VRF) rather than being statically defined. With the l3mdev rule all of the basic VRF FIB rules are reduced to 1 l3mdev rule per address family (IPv4 and IPv6). If an admin wishes to insert higher priority rules for specific VRFs those rules will co-exist with the l3mdev rule. This capability means current VRF scripts will co-exist with this new simpler implementation. Currently, the rules list for both ipv4 and ipv6 look like this: $ ip ru ls 1000: from all oif vrf1 lookup 1001 1000: from all iif vrf1 lookup 1001 1000: from all oif vrf2 lookup 1002 1000: from all iif vrf2 lookup 1002 1000: from all oif vrf3 lookup 1003 1000: from all iif vrf3 lookup 1003 1000: from all oif vrf4 lookup 1004 1000: from all iif vrf4 lookup 1004 1000: from all oif vrf5 lookup 1005 1000: from all iif vrf5 lookup 1005 1000: from all oif vrf6 lookup 1006 1000: from all iif vrf6 lookup 1006 1000: from all oif vrf7 lookup 1007 1000: from all iif vrf7 lookup 1007 1000: from all oif vrf8 lookup 1008 1000: from all iif vrf8 lookup 1008 ... 32765: from all lookup local 32766: from all lookup main 32767: from all lookup default With the l3mdev rule the list is just the following regardless of the number of VRFs: $ ip ru ls 1000: from all lookup [l3mdev table] 32765: from all lookup local 32766: from all lookup main 32767: from all lookup default (Note: the above pretty print of the rule is based on an iproute2 prototype. Actual verbage may change) Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	Merge branch 'tipc-small-fixes'	David S. Miller	2	-12/+12
	Jon Maloy says: ==================== tipc: two small fixes We fix a couple of rarely seen anomalies discovered during testing. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	tipc: change node timer unit from jiffies to ms	Jon Paul Maloy	2	-10/+10
	The node keepalive interval is recalculated at each timer expiration to catch any changes in the link tolerance, and stored in a field in struct tipc_node. We use jiffies as unit for the stored value. This is suboptimal, because it makes the calculation unnecessary complex, including two unit conversions. The conversions also lead to a rounding error that causes the link "abort limit" to be 3 in the normal case, instead of 4, as intended. This again leads to unnecessary link resets when the network is pushed close to its limit, e.g., in an environment with hundreds of nodes or namesapces. In this commit, we do instead let the keepalive value be calculated and stored in milliseconds, so that there is only one conversion and the rounding error is eliminated. We also remove a redundant "keepalive" field in struct tipc_link. This is remnant from the previous implementation. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	tipc: correct error in node fsm	Jon Paul Maloy	1	-2/+2
	commit 88e8ac7000dc ("tipc: reduce transmission rate of reset messages when link is down") revealed a flaw in the node FSM, as defined in the log of commit 66996b6c47ed ("tipc: extend node FSM"). We see the following scenario: 1: Node B receives a RESET message from node A before its link endpoint is fully up, i.e., the node FSM is in state SELF_UP_PEER_COMING. This event will not change the node FSM state, but the (distinct) link FSM will move to state RESETTING. 2: As an effect of the previous event, the local endpoint on B will declare node A lost, and post the event SELF_DOWN to the its node FSM. This moves the FSM state to SELF_DOWN_PEER_LEAVING, meaning that no messages will be accepted from A until it receives another RESET message that confirms that A's endpoint has been reset. This is wasteful, since we know this as a fact already from the first received RESET, but worse is that the link instance's FSM has not wasted this information, but instead moved on to state ESTABLISHING, meaning that it repeatedly sends out ACTIVATE messages to the reset peer A. 3: Node A will receive one of the ACTIVATE messages, move its link FSM to state ESTABLISHED, and start repeatedly sending out STATE messages to node B. 4: Node B will consistently drop these messages, since it can only accept accept a RESET according to its node FSM. 5: After four lost STATE messages node A will reset its link and start repeatedly sending out RESET messages to B. 6: Because of the reduced send rate for RESET messages, it is very likely that A will receive an ACTIVATE (which is sent out at a much higher frequency) before it gets the chance to send a RESET, and A may hence quickly move back to state ESTABLISHED and continue sending out STATE messages, which will again be dropped by B. 7: GOTO 5. 8: After having repeated the cycle 5-7 a number of times, node A will by chance get in between with sending a RESET, and the situation is resolved. Unfortunately, we have seen that it may take a substantial amount of time before this vicious loop is broken, sometimes in the order of minutes. We correct this by making a small correction to the node FSM: When a node in state SELF_UP_PEER_COMING receives a SELF_DOWN event, it now moves directly back to state SELF_DOWN_PEER_DOWN, instead of as now SELF_DOWN_PEER_LEAVING. This is logically consistent, since we don't need to wait for RESET confirmation from of an endpoint that we alread know has been reset. It also means that node B in the scenario above will not be dropping incoming STATE messages, and the link can come up immediately. Finally, a symmetry comparison reveals that the FSM has a similar error when receiving the event PEER_DOWN in state PEER_UP_SELF_COMING. Instead of moving to PERR_DOWN_SELF_LEAVING, it should move directly to SELF_DOWN_PEER_DOWN. Although we have never seen any negative effect of this logical error, we choose fix this one, too. The node FSM looks as follows after those changes: +----------------------------------------+ \| PEER_DOWN_EVT\| \| \| +------------------------+----------------+ \| \|SELF_DOWN_EVT \| \| \| \| \| \| \| \| +-----------+ +-----------+ \| \| \|NODE_ \| \|NODE_ \| \| \| +----------\|FAILINGOVER\|<---------\|SYNCHING \|-----------+ \| \| \|SELF_ +-----------+ FAILOVER_+-----------+ PEER_ \| \| \| \|DOWN_EVT \| A BEGIN_EVT A \| DOWN_EVT\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|FAILOVER_ \|FAILOVER_ \|SYNCH_ \|SYNCH_ \| \| \| \| \|END_EVT \|BEGIN_EVT \|BEGIN_EVT\|END_EVT \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| +--------------+ \| \| \| \| \| +-------->\| SELF_UP_ \|<-------+ \| \| \| \| +-----------------\| PEER_UP \|----------------+ \| \| \| \| \|SELF_DOWN_EVT +--------------+ PEER_DOWN_EVT\| \| \| \| \| \| A A \| \| \| \| \| \| \| \| \| \| \| \| \| \| PEER_UP_EVT\| \|SELF_UP_EVT \| \| \| \| \| \| \| \| \| \| \| V V V \| \| V V V +------------+ +-----------+ +-----------+ +------------+ \|SELF_DOWN_ \| \|SELF_UP_ \| \|PEER_UP_ \| \|PEER_DOWN \| \|PEER_LEAVING\| \|PEER_COMING\| \|SELF_COMING\| \|SELF_LEAVING\| +------------+ +-----------+ +-----------+ +------------+ \| \| A A \| \| \| \| \| \| \| \| \| SELF_ \| \|SELF_ \|PEER_ \|PEER_ \| \| DOWN_EVT\| \|UP_EVT \|UP_EVT \|DOWN_EVT \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| +--------------+ \| \| \|PEER_DOWN_EVT +--->\| SELF_DOWN_ \|<---+ SELF_DOWN_EVT\| +------------------->\| PEER_DOWN \|<--------------------+ +--------------+ Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	Merge branch 'dsa-misc-improvements'	David S. Miller	7	-97/+230
	Florian Fainelli says: ==================== net: dsa: misc improvements This patch series builds on top of Andrew's "New DSA bind, switches as devices" patch set and does the following: - add a few helper functions/goodies for net/dsa/dsa2.c to be as close as possible from net/dsa/dsa.c in terms of what drivers can expect, in particular the slave MDIO bus and the enabled_port_mask and phy_mii_mask - fix the CPU port ethtools ops to work in a multiple tree setup since we can no longer assume a single tree is supported - make the bcm_sf2 driver register its own MDIO bus, yet assign it to ds->slave_mii_bus for everything to work in net/dsa/slave.c wrt. PHY probing, this is a tad cleaner than what we have now Changes in v2: Most of the previous patches have been dropped to just keep the relevant ones now. Changes in v3: - split the addition of the slave MII bus as a separate patch - properly unwind all operations at the right place and right time (ethtool ops, slave MDIO bus - fixed a few typos here and there Changes in v4: - removed superfluous dst agrument to dsa_cpu_port_ethtool_{setup,restore} ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	net: dsa: bcm_sf2: Register our slave MDIO bus	Florian Fainelli	2	-81/+140
	Register a slave MDIO bus which allows us to divert problematic read/writes towards conflicting pseudo-PHY address (30). Do no longer rely on DSA's slave_mii_bus, but instead provide our own implementation which offers more flexibility as to what to do, and when to register it. We need to register it by the time we are able to get access to our memory mapped registers, which is not until drv->setup() time. In order to avoid forward declarations, we need to re-order the function bodies a bit. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	net: dsa: Initialize CPU port ethtool ops per tree	Florian Fainelli	5	-10/+50
	Now that we can properly support multiple distinct trees in the system, using a global variable: dsa_cpu_port_ethtool_ops is getting clobbered as soon as the second switch tree gets probed, and we don't want that. We need to move this to be dynamically allocated, and since we can't really be comparing addresses anymore to determine first time initialization versus any other times, just move this to dsa.c and dsa2.c where the remainder of the dst/ds initialization happens. The operations teardown restores the master netdev's ethtool_ops to its original ethtool_ops pointer (typically within the Ethernet driver) Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	net: dsa: Add initialization helper for CPU port ethtool_ops	Florian Fainelli	2	-6/+9
	Add a helper function: dsa_cpu_port_ethtool_init() which initializes a custom ethtool_ops structure with custom DSA ethtool operations for CPU ports. This is a preliminary change to move the initialization outside of net/dsa/slave.c. Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	net: dsa: Provide a slave MII bus if needed	Florian Fainelli	1	-0/+15
	Mimic what net/dsa/dsa.c does and provide a slave MII bus by default which will be created if the driver implements a phy_read method. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	net: dsa: Initialize ds->enabled_port_mask and ds->phys_mii_mask	Florian Fainelli	1	-0/+15
	Some drivers rely on these two bitmasks to contain the correct values for them to successfully probe and initialize at drv->setup() time, calculate correct values to put in both masks as early as possible in dsa_get_ports_dn(). Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	net: dsa: Provide unique DSA slave MII bus names	Florian Fainelli	1	-1/+2
	In case we have multiples trees and switches with the same index, we need to add another discriminating id: the switch tree. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	net: sched: fix missing doc annotations	Eric Dumazet	2	-1/+3
	"make htmldocs" complains otherwise: .//net/core/gen_stats.c:168: warning: No description found for parameter 'running' .//include/linux/netdevice.h:1867: warning: No description found for parameter 'qdisc_running_key' Fixes: f9eb8aea2a1e ("net_sched: transform qdisc running bit into a seqcount") Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	net: Reduce queue allocation to one in kdump kernel	Hariprasad Shenai	1	-1/+3
	When in kdump kernel, reduce memory usage by only using a single Queue Set for multiqueue devices. So make netif_get_num_default_rss_queues() return one, when in kdump kernel. Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	Merge branch 'qed-dcbnl'	David S. Miller	10	-1/+2168
	Sudarsana Reddy Kalluru says: ==================== qed/qede support for dcbnl. This series adds the dcbnl functionality to the driver. Patch (1) adds the qed infrastucture for querying/configuring the dcbx parameters. Patch (2) adds the qed infrastructure for dcbnl APIs. And patch (3) adds the qede support for dcbnl. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	qede: Add dcbnl support.	Sudarsana Reddy Kalluru	4	-0/+356
	This patch adds the interfaces for ieee/cee dcbnl callbacks and registers them with the kernel. Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	qed: Add dcbnl support.	Sudarsana Reddy Kalluru	3	-0/+1159
	This patch adds the implementation for both cee/ieee dcbnl callbacks by using the qed query/config APIs. Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	qed: Add support for query/config dcbx.	Sudarsana Reddy Kalluru	4	-1/+653
	Query API reads the dcbx data from the device shared memory and return it to the caller. The config API configures the user provided dcbx values on the device, and initiates the dcbx negotiation with the peer. Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	fsl/qe: Do not prefix header guard with CONFIG_	Andreas Ziegler	1	-2/+2
	The CONFIG_ prefix should only be used for options which can be configured through Kconfig and not for guarding headers. Signed-off-by: Andreas Ziegler <andreas.ziegler@fau.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	drivers/net/fsl_ucc: Do not prefix header guard with CONFIG_	Andreas Ziegler	1	-2/+2
	The CONFIG_ prefix should only be used for options which can be configured through Kconfig and not for guarding headers. Signed-off-by: Andreas Ziegler <andreas.ziegler@fau.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	ila: Perform only one translation in forwarding path	Tom Herbert	4	-9/+12
	When setting up ILA in a router we noticed that the the encapsulation is invoked twice: once in the route input path and again upon route output. To resolve this we add a flag set_csum_neutral for the ila_update_ipv6_locator. If this flag is set and the checksum neutral bit is also set we assume that checksum-neutral translation has already been performed and take no further action. The flag is set only in ila_output path. The flag is not set for ila_input and ila_xlat. Tested: Used 3 netns to set to emulate a router and two hosts. The router translates SIR addresses between the two destinations in other two netns. Verified ping and netperf are functional. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	tcp: accept RST if SEQ matches right edge of right-most SACK block	Pau Espin Pedrol	1	-3/+23
	RFC 5961 advises to only accept RST packets containing a seq number matching the next expected seq number instead of the whole receive window in order to avoid spoofing attacks. However, this situation is not optimal in the case SACK is in use at the time the RST is sent. I recently run into a scenario in which packet losses were high while uploading data to a server, and userspace was willing to frequently terminate connections by sending a RST. In this case, the ACK sent on the receiver side (rcv_nxt) is frozen waiting for a lost packet retransmission and SACK blocks are used to let the client continue uploading data. At some point later on, the client sends the RST (snd_nxt), which matches the next expected seq number of the right-most SACK block on the receiver side which is going forward receiving data. In this scenario, as RFC 5961 defines, the RST SEQ doesn't match the frozen main ACK at receiver side and thus gets dropped and a challenge ACK is sent, which gets usually lost due to network conditions. The main consequence is that the connection stays alive for a while even if it made sense to accept the RST. This can get really bad if lots of connections like this one are created in few seconds, allocating all the resources of the server easily. For security reasons, not all SACK blocks are checked (there could be a big amount of SACK blocks => acceptable SEQ numbers). Furthermore, it wouldn't make sense to check for RST in blocks other than the right-most received one because the sender is not expected to be sending new data after the RST. For simplicity, only up to the 4 most recently updated SACK blocks (selective_acks[4] field) are compared to find the right-most block, as usually those are the ones with bigger probability to contain it. This patch was tested in a 3.18 kernel and probed to improve the situation in the scenario described above. Signed-off-by: Pau Espin Pedrol <pau.espin@tessares.net> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	qed: potential overflow in qed_cxt_src_t2_alloc()	Dan Carpenter	1	-1/+1
	In the current code "ent_per_page" could be more than "conn_num" making "conn_num" negative after the subtraction. In the next iteration through the loop then the negative is treated as a very high positive meaning we don't put a limit on "ent_num". It could lead to memory corruption. Fixes: dbb799c39717 ('qed: Initialize hardware for new protocols') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	Merge branch 'vrf-local'	David S. Miller	2	-33/+202
	David Ahern says: ==================== net: vrf: Add support for local traffic to local addresses Add support for locally originated traffic to VRF-local addresses, be it addresses on enslaved devices or addresses on the VRF device: $ ip addr show dev red 33: red: <NOARP,MASTER,UP,LOWER_UP> mtu 65536 qdisc pfifo_fast state UP group default qlen 1000 link/ether be:00:53:b5:e4:25 brd ff:ff:ff:ff:ff:ff inet 1.1.1.1/32 scope global red valid_lft forever preferred_lft forever inet6 1111:1::1/128 scope global valid_lft forever preferred_lft forever $ ip addr show dev eth1 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 link/ether 02:e0:f9:79:34:bd brd ff:ff:ff:ff:ff:ff inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1 valid_lft forever preferred_lft forever inet6 2100:1::1/120 scope global valid_lft forever preferred_lft forever inet6 fe80::e0:f9ff:fe79:34bd/64 scope link valid_lft forever preferred_lft forever $ ping -c1 -I red 10.100.1.1 ping: Warning: source address might be selected on device other than red. PING 10.100.1.1 (10.100.1.1) from 10.100.1.1 red: 56(84) bytes of data. 64 bytes from 10.100.1.1: icmp_seq=1 ttl=64 time=0.057 ms $ ping -c1 -I red 1.1.1.1 PING 1.1.1.1 (1.1.1.1) from 1.1.1.1 red: 56(84) bytes of data. 64 bytes from 1.1.1.1: icmp_seq=1 ttl=64 time=0.136 ms --- 1.1.1.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.136/0.136/0.136/0.000 ms $ ping6 -c1 -I red 2100:1::1 ping6: Warning: source address might be selected on device other than red. PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes 64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.167 ms --- 2100:1::1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.167/0.167/0.167/0.000 ms $ ping6 -c1 -I red 1111::1 PING 1111::1(1111::1) from 1111:1::1 red: 56 data bytes 64 bytes from 1111::1: icmp_seq=1 ttl=64 time=0.187 ms --- 1111::1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.187/0.187/0.187/0.000 ms This change also enables use of loopback address on the VRF device: $ ip addr add dev red 127.0.0.1/8 $ ping -c1 -I red 127.0.0.1 PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 red: 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.058 ms ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	net: vrf: ipv6 support for local traffic to local addresses	David Ahern	2	-4/+86
	Add support for locally originated traffic to VRF-local IPv6 addresses. Similar to IPv4 a local dst is set on the skb and the packet is reinserted with a call to netif_rx. With this patch, ping, tcp and udp packets to a local IPv6 address are successfully routed: $ ip addr show dev eth1 4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1 valid_lft forever preferred_lft forever inet6 2100:1::1/120 scope global valid_lft forever preferred_lft forever inet6 fe80::e0:f9ff:fe1c:b974/64 scope link valid_lft forever preferred_lft forever $ ping6 -c1 -I red 2100:1::1 ping6: Warning: source address might be selected on device other than red. PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes 64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.098 ms ip6_input is exported so the VRF driver can use it for the dst input function. The dst_alloc function for IPv4 defaults to setting the input and output functions; IPv6's does not. VRF does not need to duplicate the Rx path so just export the ipv6 input function. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	net: vrf: ipv4 support for local traffic to local addresses	David Ahern	1	-2/+98
	Add support for locally originated traffic to VRF-local addresses. If destination device for an skb is the loopback or VRF device then set its dst to a local version of the VRF cached dst_entry and call netif_rx to insert the packet onto the rx queue - similar to what is done for loopback. This patch handles IPv4 support; follow on patch handles IPv6. With this patch, ping, tcp and udp packets to a local IPv4 address are successfully routed: $ ip addr show dev eth1 4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1 valid_lft forever preferred_lft forever inet6 2100:1::1/120 scope global valid_lft forever preferred_lft forever inet6 fe80::e0:f9ff:fe1c:b974/64 scope link valid_lft forever preferred_lft forever $ ping -c1 -I red 10.100.1.1 ping: Warning: source address might be selected on device other than red. PING 10.100.1.1 (10.100.1.1) from 10.100.1.1 red: 56(84) bytes of data. 64 bytes from 10.100.1.1: icmp_seq=1 ttl=64 time=0.057 ms This patch also enables use of IPv4 loopback address on the VRF device: $ ip addr add dev red 127.0.0.1/8 $ ping -c1 -I red 127.0.0.1 PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 red: 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.058 ms Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08	net: vrf: Minor refactoring for local address patches	David Ahern	1	-27/+18
	Move the stripping of the ethernet header from is_ip_tx_frame into the ipv4 and ipv6 outbound functions and collapse vrf_send_v4_prep into vrf_process_v4_outbound. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07	gue: Implement direction IP encapsulation	Tom Herbert	1	-5/+76
	This patch implements direct encapsulation of IPv4 and IPv6 packets in UDP. This is done a version "1" of GUE and as explained in I-D draft-ietf-nvo3-gue-03. Changes here are only in the receive path, fou with IPxIPx already supports the transmit side. Both the normal receive path and GRO path are modified to check for GUE version and check for IP version in the case that GUE version is "1". Tested: IPIP with direct GUE encap 1 TCP_STREAM 4530 Mbps 200 TCP_RR 1297625 tps 135/232/444 90/95/99% latencies IP4IP6 with direct GUE encap 1 TCP_STREAM 4903 Mbps 200 TCP_RR 1184481 tps 149/253/473 90/95/99% latencies IP6IP6 direct GUE encap 1 TCP_STREAM 5146 Mbps 200 TCP_RR 1202879 tps 146/251/472 90/95/99% latencies SIT with direct GUE encap 1 TCP_STREAM 6111 Mbps 200 TCP_RR 1250337 tps 139/241/467 90/95/99% latencies Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07	Merge branch 'net-sched-fast-stats'	David S. Miller	29	-85/+158
	Eric Dumazet says: ==================== net: sched: faster stats gathering A while back, I sent one RFC patch using lockless stats gathering on 64bit arches. This patch series does it more cleanly, using a seqcount. Since qdisc/class stats are written at dequeue() time, we can ask the dequeue to change the seqcount, so that stats readers can avoid taking the root qdisc lock, and instead the typical read_seqcount_{begin\|retry} guarded loop. This does not change fast path costs, as the seqcount increments are not more expensive than the bit manipulation, and allows readers to not freeze the fast path anymore. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07	net: sched: do not acquire qdisc spinlock in qdisc/class stats dump	Eric Dumazet	20	-69/+126
	Large tc dumps (tc -s {qdisc\|class} sh dev ethX) done by Google BwE host agent [1] are problematic at scale : For each qdisc/class found in the dump, we currently lock the root qdisc spinlock in order to get stats. Sampling stats every 5 seconds from thousands of HTB classes is a challenge when the root qdisc spinlock is under high pressure. Not only the dumps take time, they also slow down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases. An audit of existing qdiscs showed that sch_fq_codel is the only qdisc that might need the qdisc lock in fq_codel_dump_stats() and fq_codel_dump_class_stats() In v2 of this patch, I now use the Qdisc running seqcount to provide consistent reads of packets/bytes counters, regardless of 32/64 bit arches. I also changed rate estimators to use the same infrastructure so that they no longer need to lock root qdisc lock. [1] http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdf Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Kevin Athey <kda@google.com> Cc: Xiaotian Pei <xiaotian@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07	net_sched: transform qdisc running bit into a seqcount	Eric Dumazet	10	-16/+32
	Instead of using a single bit (__QDISC___STATE_RUNNING) in sch->__state, use a seqcount. This adds lockdep support, but more importantly it will allow us to sample qdisc/class statistics without having to grab qdisc root lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07	Merge branch 'be2net-noncrit-fixes'	David S. Miller	5	-159/+256
	Sathya Perla says: ==================== be2net: patch set Hi David, the following patch set contains three non-critical fixes that can go into the net-next tree. Patch 1 fixes the logic for provisioning queue pairs on VFs to take into account the limit on number of TXQs too as in some profiles the number of TXQs is less than that of RXQs. Patch 2 enables WoL support from shutdown on Skyhawk. Patch 3 enhances the logic for provisioning queue pairs on VFs on SR-IOV over multi-partition configs. Each PF (partition) on a port has to compute the number of RSS tables it's VFs can use. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07	be2net: Fix provisioning of RSS for VFs in multi-partition configurations	Somnath Kotur	4	-30/+130
	Currently, we do not distribute queue resources to enable RSS for VFs in multi-channel/partition configurations. Fix this by having each PF(SRIOV capable) calculate it's share of the 15 RSS Policy Tables available per port before provisioning resources for all the VFs. This proportional share calculation is done based on division of the PF's MAX VFs with the Total MAX VFs on that port. It also needs to learn about the no: of NIC PFs on the port and subtract that from the 15 RSS Policy Tables on the port. Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07	be2net: Enable Wake-On-LAN from shutdown for Skyhawk	Sriharsha Basavapatna	4	-47/+39
	Skyhawk does support wake-up from ACPI shutdown state - S5, provided the platform supports it (like Auxiliary power source etc). The changes listed below are done to fix this. 1) There's no need to defer the HW configuration of WOL to be_suspend(). Remove this in be_suspend() and move it to be_set_wol() ethtool function so it is configured directly in the context of ethtool. This automatically takes care of the shutdown case. 2) The driver incorrectly uses WOL_CAP field in the FW response to get_acpi_wol_cap() command, to determine if WOL is enabled. Instead the driver must rely on the macaddr field in the response to infer WOL state. 3) In be_get_config() during init, if we find that WOL is enabled in FW, call pci_enable_wake() to enable pmcsr.pme_en bit. This is needed to support persistent WOL configuration provided by the FW in some platforms. 4) Remove code in be_set_wol() that writes to PCICFG_PM_CONTROL_OFFSET to set pme_en bit; pci_enable_wake() sets that. Fixes: 028991e49 ("Enabling Wake-on-LAN is not supported in S5 state") Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Sathya Perla <sathya.perla@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07	be2net: use max-TXQs limit too while provisioning VF queue pairs	Suresh Reddy	4	-85/+90
	When the PF driver provisions resources for VFs, it currently only looks at max RSS queues available to calculate the number of VF queue pairs. This logic breaks when there are less number of TX-queues than RSS-queues. This patch fixes this problem by using the max-TXQs available in the PF-pool in the calculations. As a part of this change the be_calculate_vf_qs() routine is renamed as be_calculate_vf_res() and the code that calculates limits on other related resources is moved here to contain all resource calculation code inside one routine. Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com> Signed-off-by: Sathya Perla <sathya.perla@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07	drivers/net: support hdlc function for QE-UCC	Zhao Qiang	7	-2/+1379
	The driver add hdlc support for Freescale QUICC Engine. It support NMSI and TSA mode. Signed-off-by: Zhao Qiang <qiang.zhao@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07	fsl/qe: Add QE TDM lib	Zhao Qiang	5	-5/+377
	QE has module to support TDM, some other protocols supported by QE are based on TDM. add a qe-tdm lib, this lib provides functions to the protocols using TDM to configurate QE-TDM. Signed-off-by: Zhao Qiang <qiang.zhao@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07	fsl/qe: Make regs resouce_size_t	Zhao Qiang	1	-1/+1
	Signed-off-by: Zhao Qiang <qiang.zhao@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>