summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2015-05-12switchdev: remove NETIF_F_HW_SWITCH_OFFLOAD feature flagScott Feldman5-12/+4
Roopa said remove the feature flag for this series and she'll work on bringing it back if needed at a later date. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: convert fib_ipv4_add/del over to switchdev_port_obj_add/delScott Feldman3-56/+55
The IPv4 FIB ops convert nicely to the switchdev objs and we're left with only four switchdev ops: port get/set and port add/del. Other objs will follow, such as FDB. So go ahead and convert IPv4 FIB over to switchdev obj for consistency, anticipating more objs to come. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: cut over to new switchdev_port_bridge_getlinkScott Feldman3-14/+3
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: add new switchdev_port_bridge_getlinkScott Feldman2-0/+38
Like bridge_setlink, add switchdev wrapper to handle bridge_getlink and call into port driver to get port attrs. For now, only BR_LEARNING and BR_LEARNING_SYNC are returned. To add more, we'll probably want to break away from ndo_dflt_bridge_getlink() and build the netlink skb directly in the switchdev code. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12bridge: revert br_dellink change back to originalScott Feldman1-10/+1
This is revert of: commit 68e331c785b8 ("bridge: offload bridge port attributes to switch asic if feature flag set") Restore br_dellink back to original and don't call into SELF port driver. rtnetlink.c:bridge_dellink() already does a call into port driver for SELF. bridge vlan add/del cmd defaults to MASTER. From man page for bridge vlan add/del cmd: self the vlan is configured on the specified physical device. Required if the device is the bridge device. master the vlan is configured on the software bridge (default). Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: remove unused switchdev_port_bridge_dellinkScott Feldman2-45/+0
Now we can remove old wrappers for dellink. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: cut over to new switchdev_port_bridge_dellinkScott Feldman3-2/+3
Rocker, bonding and team and switch over to the new switchdev_port_bridge_dellink to avoid duplicating code in each driver. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: add new switchdev_port_bridge_dellinkScott Feldman2-12/+18
Same change as setlink. Provide the wrapper op for SELF ndo_bridge_dellink and call into the switchdev driver to delete afspec VLANs. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12bridge: restore br_setlink back to originalScott Feldman1-10/+1
This is revert of: commit 68e331c785b8 ("bridge: offload bridge port attributes to switch asic if feature flag set") Restore br_setlink back to original and don't call into SELF port driver. rtnetlink.c:bridge_setlink() already does a call into port driver for SELF. bridge set link cmd defaults to MASTER. From man page for bridge link set cmd: self link setting is configured on specified physical device master link setting is configured on the software bridge (default) The link setting has two values: the device-side value and the software bridge-side value. These are independent and settable using the bridge link set cmd by specifying some combination of [master] | [self]. Furthermore, the device-side and bridge-side settings have their own initial value, viewable from bridge -d link show cmd. Restoring br_setlink back to original makes rocker (the only in-kernel user of SELF link settings) work as first implement: two-sided values. It's true that when both MASTER and SELF are specified from the command, two netlink notifications are generated, one for each side of the settings. The user-space app can distiquish between the two notifications by observing the MASTER or SELF flag. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: remove old switchdev_port_bridge_setlinkScott Feldman2-39/+0
New attr-based bridge_setlink can recurse lower devs and recover on err, so remove old wrapper (including ndo_dflt_switchdev_port_bridge_setlink). Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: cut over to new switchdev_port_bridge_setlinkScott Feldman3-43/+3
Rocker, bonding, and team can now use the switchdev bridge setlink to parse raw netlink; no need to duplicate this code in each driver. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: add new switchdev bridge setlinkScott Feldman1-11/+140
Add new switchdev_port_bridge_setlink that can be used by drivers implementing .ndo_bridge_setlink to set switchdev bridge attributes. Basically turn the raw rtnl_bridge_setlink netlink into switchdev attr sets. Proper netlink attr policy checking is done on the protinfo part of the netlink msg. Currently, for protinfo, only bridge port attrs BR_LEARNING and BR_LEARNING_SYNC are parsed and passed to port driver. For afspec, VLAN objs are passed so switchdev driver can set VLANs assigned to SELF. To illustrate with iproute2 cmd, we have: bridge vlan add vid 10 dev sw1p1 self master To add VLAN 10 to port sw1p1 for both the bridge (master) and the device (self). Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: add bridge port flags attrScott Feldman2-0/+27
rocker: use switchdev get/set attr for bridge port flags Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12rocker: use switchdev add/del obj for bridge port vlansScott Feldman1-7/+123
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: add port vlan objScott Feldman1-0/+8
VLAN obj has flags (PVID and untagged) as well as start and end vid ranges. The switchdev driver can optimize programing the device using the ranges. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: introduce switchdev add/del obj opsScott Feldman2-0/+138
Like switchdev attr get/set, add new switchdev obj add/del. switchdev objs will be things like VLANs or FIB entries, so add/del fits better for objects than get/set used for attributes. Use same two-phase prepare-commit transaction model as in attr set. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: convert STP update to switchdev attr setScott Feldman5-50/+31
STP update is just a settable port attribute, so convert switchdev_port_stp_update to an attr set. For DSA, the prepare phase is skipped and STP updates are only done in the commit phase. This is because currently the DSA drivers don't need to allocate any memory for STP updates and the STP update will not fail to HW (unless something horrible goes wrong on the MDIO bus, in which case the prepare phase wouldn't have been able to predict anyway). Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12rocker: support prepare-commit transaction modelScott Feldman1-249/+451
For rocker, support prepare-commit transaction model for setting attributes (and for adding objects). This requires rocker to preallocate memory needed for the commit up front in the prepare phase. Since rtnl_lock is held between prepare-commit, store the allocated memory on a queue hanging off of the rocker_port. Also, in prepare phase, do everything right up to calling into HW. The same code paths are tranversed in the driver for both prepare and commit phases. In some cases, any state modified in the prepare phase must be reverted before returning so the commit phase makes the same decisions. As a consequence of holding rtnl_lock in process context for all attr sets (and obj adds), all memory is GFP_KERNEL allocated and we don't need to busy spin waiting for the device to complete the command. So the bulk of this patch is simplifying the memory allocations to only use GFP_KERNEL and to remove the nowait flag and busy spin loop. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: convert parent_id_get to switchdev attr getScott Feldman6-57/+51
Switch ID is just a gettable port attribute. Convert switchdev op switchdev_parent_id_get to a switchdev attr. Note: for sysfs and netlink interfaces, SWITCHDEV_ATTR_PORT_PARENT_ID is called with SWITCHDEV_F_NO_RECUSE to limit switch ID user-visiblity to only port netdevs. So when a port is stacked under bond/bridge, the user can only query switch id via the switch ports, but not via the upper devices Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: introduce get/set attrs opsScott Feldman2-0/+212
Add two new swdev ops for get/set switch port attributes. Most swdev interactions on a port are gets or sets on port attributes, so rather than adding ops for each attribute, let's define clean get/set ops for all attributes, and then we can have clear, consistent rules on how attributes propagate on stacked devs. Add the basic algorithms for get/set attr ops. Use the same recusive algo to walk lower devs we've used for STP updates, for example. For get, compare attr value for each lower dev and only return success if attr values match across all lower devs. For sets, set the same attr value for all lower devs. We'll use a two-phase prepare-commit transaction model for sets. In the first phase, the driver(s) are asked if attr set is OK. If all OK, the commit attr set in second phase. A driver would NACK the prepare phase if it can't set the attr due to lack of resources or support, within it's control. RTNL lock must be held across both phases because we'll recurse all lower devs first in prepare phase, and then recurse all lower devs again in commit phase. If any lower dev fails the prepare phase, we need to abort the transaction for all lower devs. If lower dev recusion isn't desired, allow a flag SWITCHDEV_F_NO_RECURSE to indicate get/set only work on port (lowest) device. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: s/swdev_/switchdev_/Jiri Pirko5-58/+59
Turned out that "switchdev" sticks. So just unify all related terms to use this prefix. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12switchdev: s/netdev_switch_/switchdev_/ and s/NETDEV_SWITCH_/SWITCHDEV_/Jiri Pirko11-171/+166
Turned out that "switchdev" sticks. So just unify all related terms to use this prefix. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12net_sched: gred: add TCA_GRED_LIMIT attributeDavid Ward2-5/+26
In a GRED qdisc, if the default "virtual queue" (VQ) does not have drop parameters configured, then packets for the default VQ are not subjected to RED and are only dropped if the queue is larger than the net_device's tx_queue_len. This behavior is useful for WRED mode, since these packets will still influence the calculated average queue length and (therefore) the drop probability for all of the other VQs. However, for some drivers tx_queue_len is zero. In other cases the user may wish to make the limit the same for all VQs (including the default VQ with no drop parameters). This change adds a TCA_GRED_LIMIT attribute to set the GRED queue limit, in bytes, during qdisc setup. (This limit is in bytes to be consistent with the drop parameters.) The default limit is the same as for a bfifo queue (tx_queue_len * psched_mtu). If the drop parameters of any VQ are configured with a smaller limit than the GRED queue limit, that VQ will still observe the smaller limit instead. Signed-off-by: David Ward <david.ward@ll.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12ARM: net: add JIT support for loads from struct seccomp_data.Nicolas Schichan1-0/+10
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12Merge branch 'netdev_page_frags'David S. Miller12-171/+223
Alexander Duyck says: ==================== Refactor netdev page frags and move them into mm/ This patch series addresses several things. First I found an issue in the performance of the pfmemalloc check from build_skb. To work around it I have provided a cached copy of pfmemalloc to be used in __netdev_alloc_skb and __napi_alloc_skb. Second I moved the page fragment allocation logic into the mm tree and added functionality for freeing page fragments. I had to fix igb before I could do this as it was using a reference to NETDEV_FRAG_PAGE_MAX_SIZE incorrectly. Finally I went through and replaced all of the duplicate code that was calling put_page and replaced it with calls to skb_free_frag. With these changes in place a simple receive and drop test increased from a packet rate of 8.9Mpps to 9.8Mpps. The gains breakdown as follows: 8.9Mpps Before 9.8Mpps After ------------------------ ------------------------ 7.8% put_compound_page 9.1% __free_page_frag 3.9% skb_free_head 1.1% put_page 4.9% build_skb 3.8% __napi_alloc_skb 2.5% __alloc_rx_skb 1.9% __napi_alloc_skb ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12bnx2x, tg3: Replace put_page(virt_to_head_page()) with skb_free_frag()Alexander Duyck2-2/+2
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12hisilicon: Replace put_page(virt_to_head_page()) with skb_free_frag()Alexander Duyck1-1/+1
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12e1000: Replace e1000_free_frag with skb_free_fragAlexander Duyck1-12/+7
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12mvneta: Replace put_page(virt_to_head_page(ptr)) w/ skb_free_fragAlexander Duyck1-1/+1
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12netcp: Replace put_page(virt_to_head_page(ptr)) w/ skb_free_fragAlexander Duyck1-1/+1
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12net: Add skb_free_frag to replace use of put_page in freeing skb->headAlexander Duyck2-4/+11
This change adds a function called skb_free_frag which is meant to compliment the function netdev_alloc_frag. The general idea is to enable a more lightweight version of page freeing since we don't actually need all the overhead of a put_page, and we don't quite fit the model of __free_pages. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12mm/net: Rename and move page fragment handling from net/ to mm/Alexander Duyck5-97/+127
This change moves the __alloc_page_frag functionality out of the networking stack and into the page allocation portion of mm. The idea it so help make this maintainable by placing it with other page allocation functions. Since we are moving it from skbuff.c to page_alloc.c I have also renamed the basic defines and structure from netdev_alloc_cache to page_frag_cache to reflect that this is now part of a different kernel subsystem. I have also added a simple __free_page_frag function which can handle freeing the frags based on the skb->head pointer. The model for this is based off of __free_pages since we don't actually need to deal with all of the cases that put_page handles. I incorporated the virt_to_head_page call and compound_order into the function as it actually allows for a signficant size reduction by reducing code duplication. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12net: Store virtual address instead of page in netdev_alloc_cacheAlexander Duyck2-26/+34
This change makes it so that we store the virtual address of the page in the netdev_alloc_cache instead of the page pointer. The idea behind this is to avoid multiple calls to page_address since the virtual address is required for every access, but the page pointer is only needed at allocation or reset of the page. While I was at it I also reordered the netdev_alloc_cache structure a bit so that the size is always 16 bytes by dropping size in the case where PAGE_SIZE is greater than or equal to 32KB. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12igb: Don't use NETDEV_FRAG_PAGE_MAX_SIZE in descriptor calculationAlexander Duyck1-8/+3
This change updates igb so that it will correctly perform the descriptor count calculation. Previously it was taking NETDEV_FRAG_PAGE_MAX_SIZE into account with isn't really correct since a different value is used to determine the size of the pages used for TCP. That is actually determined by SKB_FRAG_PAGE_ORDER. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12net: Use cached copy of pfmemalloc to avoid accessing pageAlexander Duyck1-62/+79
While testing I found that the testing for pfmemalloc in build_skb was rather expensive. I found the issue to be two-fold. First we have to get from the virtual address to the head page and that comes at the cost of something like 11 cycles. Then there is the cost for reading pfmemalloc out of the head page which can be cache cold due to the fact that put_page_testzero is likely invalidating the cache-line on one or more CPUs as the fragments can be shared. To avoid this extra expense I have added a pfmemalloc member to the netdev_alloc_cache. I then pushed pieces of __alloc_rx_skb into __napi_alloc_skb and __netdev_alloc_skb so that I could rewrite them to make use of the cached pfmemalloc value. The result is that my perf traces show a reduction from 9.28% overhead to 3.7% for the code covered by build_skb, __alloc_rx_skb, and __napi_alloc_skb when performing a test with the packet being dropped instead of being handed to napi_gro_receive. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11net: sched: deprecate enqueue_root()Eric Dumazet2-8/+2
Only left enqueue_root() user is netem, and it looks not necessary : qdisc_skb_cb(skb)->pkt_len is preserved after one skb_clone() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11net: ll_temac: Use one return statement instead of twoMichal Simek1-3/+1
Use one return statement instead of two to simplify the code. Both are returning the same value. Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11net: fec: add support of ethtool get_regsPhilippe Reynes1-0/+78
This enables the ethtool's "-d" and "--register-dump" options for fec devices. Signed-off-by: Philippe Reynes <tremyfr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11net: sched: fix typo in net_device ifdefDaniel Borkmann1-1/+1
This should have been #ifdef not #if. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Fixes: d2788d34885d ("net: sched: further simplify handle_ing") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11Merge branch 'handle_ing_lightweight'David S. Miller3-77/+33
Daniel Borkmann says: ==================== handle_ing update These are a couple of cleanups to make ingress a bit more lightweight. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11net: sched: further simplify handle_ingDaniel Borkmann3-61/+31
Ingress qdisc has no other purpose than calling into tc_classify() that executes attached classifier(s) and action(s). It has a 1:1 relationship to dev->ingress_queue. After having commit 087c1a601ad7 ("net: sched: run ingress qdisc without locks") removed the central ingress lock, one major contention point is gone. The extra indirection layers however, are not necessary for calling into ingress qdisc. pktgen calling locally into netif_receive_skb() with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps. We can redirect the private classifier list to the netdev directly, without changing any classifier API bits (!) and execute on that from handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed, ingress qdisc doesn't have a queue and thus dev_deactivate_queue() is also not applicable, ingress_cl_list provides similar behaviour. In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc. One next possible step is the removal of the dev's ingress (dummy) netdev_queue, and to only have the list member in the netdevice itself. Note, the filter chain is RCU protected and individual filter elements are being kfree'd by sched subsystem after RCU grace period. RCU read lock is being held by __netif_receive_skb_core(). Joint work with Alexei Starovoitov. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11net: sched: consolidate handle_ing and ing_filterDaniel Borkmann1-30/+16
Given quite some code has been removed from ing_filter(), we can just consolidate that function into handle_ing() and get rid of a few instructions at the same time. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11test: bpf: extend "load 64-bit immediate" testcaseXi Wang1-1/+2
Extend the testcase to catch a signedness bug in the arm64 JIT: test_bpf: #58 load 64-bit immediate jited:1 ret -1 != 1 FAIL (1 times) This is useful to ensure other JITs won't have a similar bug. Link: https://lkml.org/lkml/2015/5/8/458 Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Xi Wang <xi.wang@gmail.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11Merge branch 'bonding_netlink_lacp'David S. Miller10-9/+328
Jonathan Toppins says: ==================== add netlink support for new lacp bonding parameters This is a resubmit of Mahesh's last 3 bonding patches from this series (http://marc.info/?l=linux-netdev&m=142432864626179&w=2) with one additional kernel patch which adds the netlink bits. I have noted any modifications I did to the original patches just above my signoff line. Patch 5 is the iproute2 support for these bonding options. All patches were coded against the net-next branch of their respective projects. v2: * rebased * only send these new parameters via netlink when bond is in mode 4 * fixed ad_actor_sys_prio to be 0xFFFF by default even when the bond is initially created in mode 0 and switched to mode 4 v3: * reverted changes to bond_option_ad_actor_system_set() from v1 in Mahesh's patch "bonding: Allow userspace to set actors' macaddr in an AD-system." Instead implementing all setting in the option specific set function as Nik suggested. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11bonding: add netlink support for sys prio, actor sys mac, and port keyAndy Gospodarek3-9/+74
Adds netlink support for the following bonding options: * BOND_OPT_AD_ACTOR_SYS_PRIO * BOND_OPT_AD_ACTOR_SYSTEM * BOND_OPT_AD_USER_PORT_KEY When setting the actor system mac address we assume the netlink message contains a binary mac and not a string representation of a mac. Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> [jt: completed the setting side of the netlink attributes] Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11bonding: Implement user key part of port_key in an AD system.Mahesh Bandewar7-7/+123
The port key has three components - user-key, speed-part, and duplex-part. The LSBit is for the duplex-part, next 5 bits are for the speed while the remaining 10 bits are the user defined key bits. Get these 10 bits from the user-space (through the SysFs interface) and use it to form the admin port-key. Allowed range for the user-key is 0 - 1023 (10 bits). If it is not provided then use zero for the user-key-bits (default). It can set using following example code - # modprobe bonding mode=4 # usr_port_key=$(( RANDOM & 0x3FF )) # echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key # echo +eth1 > /sys/class/net/bond0/bonding/slaves ... # ip link set bond0 up Signed-off-by: Mahesh Bandewar <maheshb@google.com> Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com> [jt: * fixed up style issues reported by checkpatch * fixed up context from change in ad_actor_sys_prio patch] Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11bonding: Allow userspace to set actors' macaddr in an AD-system.Mahesh Bandewar8-1/+70
In an AD system, the communication between actor and partner is the business between these two entities. In the current setup anyone on the same L2 can "guess" the LACPDU contents and then possibly send the spoofed LACPDUs and trick the partner causing connectivity issues for the AD system. This patch allows to use a random mac-address obscuring it's identity making it harder for someone in the L2 is do the same thing. This patch allows user-space to choose the mac-address for the AD-system. This mac-address can not be NULL or a Multicast. If the mac-address is set from user-space; kernel will honor it and will not overwrite it. In the absence (value from user space); the logic will default to using the masters' mac as the mac-address for the AD-system. It can be set using example code below - # modprobe bonding mode=4 # sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \ $(( (RANDOM & 0xFE) | 0x02 )) \ $(( RANDOM & 0xFF )) \ $(( RANDOM & 0xFF )) \ $(( RANDOM & 0xFF )) \ $(( RANDOM & 0xFF )) \ $(( RANDOM & 0xFF ))) # echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system # echo +eth1 > /sys/class/net/bond0/bonding/slaves ... # ip link set bond0 up Signed-off-by: Mahesh Bandewar <maheshb@google.com> Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com> [jt: fixed up style issues reported by checkpatch] Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11bonding: Allow userspace to set actors' system_priority in AD systemMahesh Bandewar8-2/+71
This patch allows user to randomize the system-priority in an ad-system. The allowed range is 1 - 0xFFFF while default value is 0xFFFF. If user does not specify this value, the system defaults to 0xFFFF, which is what it was before this patch. Following example code could set the value - # modprobe bonding mode=4 # sys_prio=$(( 1 + RANDOM + RANDOM )) # echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio # echo +eth1 > /sys/class/net/bond0/bonding/slaves ... # ip link set bond0 up Signed-off-by: Mahesh Bandewar <maheshb@google.com> Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com> [jt: * fixed up style issues reported by checkpatch * changed how the default value is set in bond_check_params(), this makes the default consistent between what gets set for a new bond and what the default is claimed to be in the bonding options.] Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11Merge branch 'kernel_socket_netns'David S. Miller72-238/+166
Eric W. Biederman says: ==================== Cleanup the kernel sockets. Right now the situtation for allocating kernel sockets is a mess. - sock_create_kern does not take a namespace parameter. - kernel sockets must not reference count a network namespace and keep it alive or else we will have a reference counting loop. - The way we avoid the reference counting loop with sk_change_net and sk_release_kernel are major hacks. This patchset addresses this mess by fixing sock_create_kern to do everything necessary to create a kernel socket. None of the current users of kernel sockets need the network namespace reference counted. Either kernel sockets are network namespace aware (and using the current hacks) or kernel sockets are limited to the initial network namespace in which case it does not matter. This patchset starts by addressing tun which should be using normal userspace sockets like macvtap. Then sock_create_kern is fixed to take a network namespace. Then the in kernel status of sockets are passed through to sk_alloc. Then sk_alloc is fixed to not reference count the network namespace of kernel sockets. Then the callers of sock_create_kern are fixed up to stop using hacks. Then netlink which uses it's own flavor of sock_create_kern is fixed. Finally the hacks that are sk_change_net and sk_release_kernel are removed. When it is all done the code is easier to follow, easier to use, easier to maintain and shorter by about 70 lines. ==================== Reported-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11net: kill sk_change_net and sk_release_kernelEric W. Biederman2-36/+0
These functions are no longer needed and no longer used kill them. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>