summaryrefslogtreecommitdiffstats
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2020-01-27udp: Support UDP fraglist GRO/GSO.Steffen Klassert2-25/+106
This patch extends UDP GRO to support fraglist GRO/GSO by using the previously introduced infrastructure. If the feature is enabled, all UDP packets are going to fraglist GRO (local input and forward). After validating the csum, we mark ip_summed as CHECKSUM_UNNECESSARY for fraglist GRO packets to make sure that the csum is not touched. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27net: Support GRO/GSO fraglist chaining.Steffen Klassert2-1/+92
This patch adds the core functions to chain/unchain GSO skbs at the frag_list pointer. This also adds a new GSO type SKB_GSO_FRAGLIST and a is_flist flag to napi_gro_cb which indicates that this flow will be GROed by fraglist chaining. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27net: Add a netdev software feature set that defaults to off.Steffen Klassert1-1/+1
The previous patch added the NETIF_F_GRO_FRAGLIST feature. This is a software feature that should default to off. Current software features default to on, so add a new feature set that defaults to off. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27net: Add fraglist GRO/GSO feature flagsSteffen Klassert1-0/+1
This adds new Fraglist GRO/GSO feature flags. They will be used to configure fraglist GRO/GSO what will be implemented with some followup paches. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27rxrpc: Fix use-after-free in rxrpc_receive_data()David Howells1-5/+7
The subpacket scanning loop in rxrpc_receive_data() references the subpacket count in the private data part of the sk_buff in the loop termination condition. However, when the final subpacket is pasted into the ring buffer, the function is no longer has a ref on the sk_buff and should not be looking at sp->* any more. This point is actually marked in the code when skb is cleared (but sp is not - which is an error). Fix this by caching sp->nr_subpackets in a local variable and using that instead. Also clear 'sp' to catch accesses after that point. This can show up as an oops in rxrpc_get_skb() if sp->nr_subpackets gets trashed by the sk_buff getting freed and reused in the meantime. Fixes: e2de6c404898 ("rxrpc: Use info in skbuff instead of reparsing a jumbo packet") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27net_sched: ematch: reject invalid TCF_EM_SIMPLEEric Dumazet1-0/+3
It is possible for malicious userspace to set TCF_EM_SIMPLE bit even for matches that should not have this bit set. This can fool two places using tcf_em_is_simple() 1) tcf_em_tree_destroy() -> memory leak of em->data if ops->destroy() is NULL 2) tcf_em_tree_dump() wrongly report/leak 4 low-order bytes of a kernel pointer. BUG: memory leak unreferenced object 0xffff888121850a40 (size 32): comm "syz-executor927", pid 7193, jiffies 4294941655 (age 19.840s) hex dump (first 32 bytes): 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000f67036ea>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline] [<00000000f67036ea>] slab_post_alloc_hook mm/slab.h:586 [inline] [<00000000f67036ea>] slab_alloc mm/slab.c:3320 [inline] [<00000000f67036ea>] __do_kmalloc mm/slab.c:3654 [inline] [<00000000f67036ea>] __kmalloc_track_caller+0x165/0x300 mm/slab.c:3671 [<00000000fab0cc8e>] kmemdup+0x27/0x60 mm/util.c:127 [<00000000d9992e0a>] kmemdup include/linux/string.h:453 [inline] [<00000000d9992e0a>] em_nbyte_change+0x5b/0x90 net/sched/em_nbyte.c:32 [<000000007e04f711>] tcf_em_validate net/sched/ematch.c:241 [inline] [<000000007e04f711>] tcf_em_tree_validate net/sched/ematch.c:359 [inline] [<000000007e04f711>] tcf_em_tree_validate+0x332/0x46f net/sched/ematch.c:300 [<000000007a769204>] basic_set_parms net/sched/cls_basic.c:157 [inline] [<000000007a769204>] basic_change+0x1d7/0x5f0 net/sched/cls_basic.c:219 [<00000000e57a5997>] tc_new_tfilter+0x566/0xf70 net/sched/cls_api.c:2104 [<0000000074b68559>] rtnetlink_rcv_msg+0x3b2/0x4b0 net/core/rtnetlink.c:5415 [<00000000b7fe53fb>] netlink_rcv_skb+0x61/0x170 net/netlink/af_netlink.c:2477 [<00000000e83a40d0>] rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5442 [<00000000d62ba933>] netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline] [<00000000d62ba933>] netlink_unicast+0x223/0x310 net/netlink/af_netlink.c:1328 [<0000000088070f72>] netlink_sendmsg+0x2c0/0x570 net/netlink/af_netlink.c:1917 [<00000000f70b15ea>] sock_sendmsg_nosec net/socket.c:639 [inline] [<00000000f70b15ea>] sock_sendmsg+0x54/0x70 net/socket.c:659 [<00000000ef95a9be>] ____sys_sendmsg+0x2d0/0x300 net/socket.c:2330 [<00000000b650f1ab>] ___sys_sendmsg+0x8a/0xd0 net/socket.c:2384 [<0000000055bfa74a>] __sys_sendmsg+0x80/0xf0 net/socket.c:2417 [<000000002abac183>] __do_sys_sendmsg net/socket.c:2426 [inline] [<000000002abac183>] __se_sys_sendmsg net/socket.c:2424 [inline] [<000000002abac183>] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2424 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+03c4738ed29d5d366ddf@syzkaller.appspotmail.com Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27net: include struct nhmsg size in nh nlmsg sizeStephen Worley1-1/+3
Include the size of struct nhmsg size when calculating how much of a payload to allocate in a new netlink nexthop notification message. Without this, we will fail to fill the skbuff at certain nexthop group sizes. You can reproduce the failure with the following iproute2 commands: ip link add dummy1 type dummy ip link add dummy2 type dummy ip link add dummy3 type dummy ip link add dummy4 type dummy ip link add dummy5 type dummy ip link add dummy6 type dummy ip link add dummy7 type dummy ip link add dummy8 type dummy ip link add dummy9 type dummy ip link add dummy10 type dummy ip link add dummy11 type dummy ip link add dummy12 type dummy ip link add dummy13 type dummy ip link add dummy14 type dummy ip link add dummy15 type dummy ip link add dummy16 type dummy ip link add dummy17 type dummy ip link add dummy18 type dummy ip link add dummy19 type dummy ip ro add 1.1.1.1/32 dev dummy1 ip ro add 1.1.1.2/32 dev dummy2 ip ro add 1.1.1.3/32 dev dummy3 ip ro add 1.1.1.4/32 dev dummy4 ip ro add 1.1.1.5/32 dev dummy5 ip ro add 1.1.1.6/32 dev dummy6 ip ro add 1.1.1.7/32 dev dummy7 ip ro add 1.1.1.8/32 dev dummy8 ip ro add 1.1.1.9/32 dev dummy9 ip ro add 1.1.1.10/32 dev dummy10 ip ro add 1.1.1.11/32 dev dummy11 ip ro add 1.1.1.12/32 dev dummy12 ip ro add 1.1.1.13/32 dev dummy13 ip ro add 1.1.1.14/32 dev dummy14 ip ro add 1.1.1.15/32 dev dummy15 ip ro add 1.1.1.16/32 dev dummy16 ip ro add 1.1.1.17/32 dev dummy17 ip ro add 1.1.1.18/32 dev dummy18 ip ro add 1.1.1.19/32 dev dummy19 ip next add id 1 via 1.1.1.1 dev dummy1 ip next add id 2 via 1.1.1.2 dev dummy2 ip next add id 3 via 1.1.1.3 dev dummy3 ip next add id 4 via 1.1.1.4 dev dummy4 ip next add id 5 via 1.1.1.5 dev dummy5 ip next add id 6 via 1.1.1.6 dev dummy6 ip next add id 7 via 1.1.1.7 dev dummy7 ip next add id 8 via 1.1.1.8 dev dummy8 ip next add id 9 via 1.1.1.9 dev dummy9 ip next add id 10 via 1.1.1.10 dev dummy10 ip next add id 11 via 1.1.1.11 dev dummy11 ip next add id 12 via 1.1.1.12 dev dummy12 ip next add id 13 via 1.1.1.13 dev dummy13 ip next add id 14 via 1.1.1.14 dev dummy14 ip next add id 15 via 1.1.1.15 dev dummy15 ip next add id 16 via 1.1.1.16 dev dummy16 ip next add id 17 via 1.1.1.17 dev dummy17 ip next add id 18 via 1.1.1.18 dev dummy18 ip next add id 19 via 1.1.1.19 dev dummy19 ip next add id 1111 group 1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19 ip next del id 1111 Fixes: 430a049190de ("nexthop: Add support for nexthop groups") Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27net_sched: walk through all child classes in tc_bind_tclass()Cong Wang1-11/+30
In a complex TC class hierarchy like this: tc qdisc add dev eth0 root handle 1:0 cbq bandwidth 100Mbit \ avpkt 1000 cell 8 tc class add dev eth0 parent 1:0 classid 1:1 cbq bandwidth 100Mbit \ rate 6Mbit weight 0.6Mbit prio 8 allot 1514 cell 8 maxburst 20 \ avpkt 1000 bounded tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip \ sport 80 0xffff flowid 1:3 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip \ sport 25 0xffff flowid 1:4 tc class add dev eth0 parent 1:1 classid 1:3 cbq bandwidth 100Mbit \ rate 5Mbit weight 0.5Mbit prio 5 allot 1514 cell 8 maxburst 20 \ avpkt 1000 tc class add dev eth0 parent 1:1 classid 1:4 cbq bandwidth 100Mbit \ rate 3Mbit weight 0.3Mbit prio 5 allot 1514 cell 8 maxburst 20 \ avpkt 1000 where filters are installed on qdisc 1:0, so we can't merely search from class 1:1 when creating class 1:3 and class 1:4. We have to walk through all the child classes of the direct parent qdisc. Otherwise we would miss filters those need reverse binding. Fixes: 07d79fc7d94e ("net_sched: add reverse binding for tc class") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27net_sched: fix ops->bind_class() implementationsCong Wang10-29/+76
The current implementations of ops->bind_class() are merely searching for classid and updating class in the struct tcf_result, without invoking either of cl_ops->bind_tcf() or cl_ops->unbind_tcf(). This breaks the design of them as qdisc's like cbq use them to count filters too. This is why syzbot triggered the warning in cbq_destroy_class(). In order to fix this, we have to call cl_ops->bind_tcf() and cl_ops->unbind_tcf() like the filter binding path. This patch does so by refactoring out two helper functions __tcf_bind_filter() and __tcf_unbind_filter(), which are lockless and accept a Qdisc pointer, then teaching each implementation to call them correctly. Note, we merely pass the Qdisc pointer as an opaque pointer to each filter, they only need to pass it down to the helper functions without understanding it at all. Fixes: 07d79fc7d94e ("net_sched: add reverse binding for tc class") Reported-and-tested-by: syzbot+0a0596220218fcb603a8@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+63bdb6006961d8c917c6@syzkaller.appspotmail.com Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27nf_tables: Add set type for arbitrary concatenation of rangesStefano Brivio3-1/+2106
This new set type allows for intervals in concatenated fields, which are expressed in the usual way, that is, simple byte concatenation with padding to 32 bits for single fields, and given as ranges by specifying start and end elements containing, each, the full concatenation of start and end values for the single fields. Ranges are expanded to composing netmasks, for each field: these are inserted as rules in per-field lookup tables. Bits to be classified are divided in 4-bit groups, and for each group, the lookup table contains 4^2 buckets, representing all the possible values of a bit group. This approach was inspired by the Grouper algorithm: http://www.cse.usf.edu/~ligatti/projects/grouper/ Matching is performed by a sequence of AND operations between bucket values, with buckets selected according to the value of packet bits, for each group. The result of this sequence tells us which rules matched for a given field. In order to concatenate several ranged fields, per-field rules are mapped using mapping arrays, one per field, that specify which rules should be considered while matching the next field. The mapping array for the last field contains a reference to the element originally inserted. The notes in nft_set_pipapo.c cover the algorithm in deeper detail. A pure hash-based approach is of no use here, as ranges need to be classified. An implementation based on "proxying" the existing red-black tree set type, creating a tree for each field, was considered, but deemed impractical due to the fact that elements would need to be shared between trees, at least as long as we want to keep UAPI changes to a minimum. A stand-alone implementation of this algorithm is available at: https://pipapo.lameexcu.se together with notes about possible future optimisations (in pipapo.c). This algorithm was designed with data locality in mind, and can be highly optimised for SIMD instruction sets, as the bulk of the matching work is done with repetitive, simple bitwise operations. At this point, without further optimisations, nft_concat_range.sh reports, for one AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB L2$): TEST: performance net,port [ OK ] baseline (drop from netdev hook): 10190076pps baseline hash (non-ranged entries): 6179564pps baseline rbtree (match on first field only): 2950341pps set with 1000 full, ranged entries: 2304165pps port,net [ OK ] baseline (drop from netdev hook): 10143615pps baseline hash (non-ranged entries): 6135776pps baseline rbtree (match on first field only): 4311934pps set with 100 full, ranged entries: 4131471pps net6,port [ OK ] baseline (drop from netdev hook): 9730404pps baseline hash (non-ranged entries): 4809557pps baseline rbtree (match on first field only): 1501699pps set with 1000 full, ranged entries: 1092557pps port,proto [ OK ] baseline (drop from netdev hook): 10812426pps baseline hash (non-ranged entries): 6929353pps baseline rbtree (match on first field only): 3027105pps set with 30000 full, ranged entries: 284147pps net6,port,mac [ OK ] baseline (drop from netdev hook): 9660114pps baseline hash (non-ranged entries): 3778877pps baseline rbtree (match on first field only): 3179379pps set with 10 full, ranged entries: 2082880pps net6,port,mac,proto [ OK ] baseline (drop from netdev hook): 9718324pps baseline hash (non-ranged entries): 3799021pps baseline rbtree (match on first field only): 1506689pps set with 1000 full, ranged entries: 783810pps net,mac [ OK ] baseline (drop from netdev hook): 10190029pps baseline hash (non-ranged entries): 5172218pps baseline rbtree (match on first field only): 2946863pps set with 1000 full, ranged entries: 1279122pps v4: - fix build for 32-bit architectures: 64-bit division needs div_u64() (kbuild test robot <lkp@intel.com>) v3: - rework interface for field length specification, NFT_SET_SUBKEY disappears and information is stored in description - remove scratch area to store closing element of ranges, as elements now come with an actual attribute to specify the upper range limit (Pablo Neira Ayuso) - also remove pointer to 'start' element from mapping table, closing key is now accessible via extension data - use bytes right away instead of bits for field lengths, this way we can also double the inner loop of the lookup function to take care of upper and lower bits in a single iteration (minor performance improvement) - make it clearer that set operations are actually atomic API-wise, but we can't e.g. implement flush() as one-shot action - fix type for 'dup' in nft_pipapo_insert(), check for duplicates only in the next generation, and in general take care of differentiating generation mask cases depending on the operation (Pablo Neira Ayuso) - report C implementation matching rate in commit message, so that AVX2 implementation can be compared (Pablo Neira Ayuso) v2: - protect access to scratch maps in nft_pipapo_lookup() with local_bh_disable/enable() (Florian Westphal) - drop rcu_read_lock/unlock() from nft_pipapo_lookup(), it's already implied (Florian Westphal) - explain why partial allocation failures don't need handling in pipapo_realloc_scratch(), rename 'm' to clone and update related kerneldoc to make it clear we're not operating on the live copy (Florian Westphal) - add expicit check for priv->start_elem in nft_pipapo_insert() to avoid ending up in nft_pipapo_walk() with a NULL start element, and also zero it out in every operation that might make it invalid, so that insertion doesn't proceed with an invalid element (Florian Westphal) Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-27netfilter: nf_tables: Support for sets with multiple ranged fieldsStefano Brivio2-1/+92
Introduce a new nested netlink attribute, NFTA_SET_DESC_CONCAT, used to specify the length of each field in a set concatenation. This allows set implementations to support concatenation of multiple ranged items, as they can divide the input key into matching data for every single field. Such set implementations would be selected as they specify support for NFT_SET_INTERVAL and allow desc->field_count to be greater than one. Explicitly disallow this for nft_set_rbtree. In order to specify the interval for a set entry, userspace would include in NFTA_SET_DESC_CONCAT attributes field lengths, and pass range endpoints as two separate keys, represented by attributes NFTA_SET_ELEM_KEY and NFTA_SET_ELEM_KEY_END. While at it, export the number of 32-bit registers available for packet matching, as nftables will need this to know the maximum number of field lengths that can be specified. For example, "packets with an IPv4 address between 192.0.2.0 and 192.0.2.42, with destination port between 22 and 25", can be expressed as two concatenated elements: NFTA_SET_ELEM_KEY: 192.0.2.0 . 22 NFTA_SET_ELEM_KEY_END: 192.0.2.42 . 25 and NFTA_SET_DESC_CONCAT attribute would contain: NFTA_LIST_ELEM NFTA_SET_FIELD_LEN: 4 NFTA_LIST_ELEM NFTA_SET_FIELD_LEN: 2 v4: No changes v3: Complete rework, NFTA_SET_DESC_CONCAT instead of NFTA_SET_SUBKEY v2: No changes Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-27netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attributePablo Neira Ayuso2-23/+64
Add NFTA_SET_ELEM_KEY_END attribute to convey the closing element of the interval between kernel and userspace. This patch also adds the NFT_SET_EXT_KEY_END extension to store the closing element value in this interval. v4: No changes v3: New patch [sbrivio: refactor error paths and labels; add corresponding nft_set_ext_type for new key; rebase] Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-27netfilter: nf_tables: add nft_setelem_parse_key()Pablo Neira Ayuso1-46/+45
Add helper function to parse the set element key netlink attribute. v4: No changes v3: New patch [sbrivio: refactor error paths and labels; use NFT_DATA_VALUE_MAXLEN instead of sizeof(*key) in helper, value can be longer than that; rebase] Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-26tcp: export count for rehash attemptsAbdul Kabbani4-1/+13
Using IPv6 flow-label to swiftly route around avoid congested or disconnected network path can greatly improve TCP reliability. This patch adds SNMP counters and a OPT_STATS counter to track both host-level and connection-level statistics. Network administrators can use these counters to evaluate the impact of this new ability better. Export count for rehash attempts to 1) two SNMP counters: TcpTimeoutRehash (rehash due to timeouts), and TcpDuplicateDataRehash (rehash due to receiving duplicate packets) 2) Timestamping API SOF_TIMESTAMPING_OPT_STATS. Signed-off-by: Abdul Kabbani <akabbani@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Kevin(Yudong) Yang <yyd@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller37-156/+286
Minor conflict in mlx5 because changes happened to code that has moved meanwhile. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller10-68/+144
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Missing netlink attribute sanity check for NFTA_OSF_DREG, from Florian Westphal. 2) Use bitmap infrastructure in ipset to fix KASAN slab-out-of-bounds reads, from Jozsef Kadlecsik. 3) Missing initial CLOSED state in new sctp connection through ctnetlink events, from Jiri Wiesner. 4) Missing check for NFT_CHAIN_HW_OFFLOAD in nf_tables offload indirect block infrastructure, from wenxu. 5) Add __nft_chain_type_get() to sanity check family and chain type. 6) Autoload modules from the nf_tables abort path to fix races reported by syzbot. 7) Remove unnecessary skb->csum update on inet_proto_csum_replace16(), from Praveen Chaudhary. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25net: sched: Make TBF Qdisc offloadablePetr Machata1-0/+55
Invoke ndo_setup_tc as appropriate to signal init / replacement, destroying and dumping of TBF Qdisc. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25net: sched: sch_tbf: Don't overwrite backlog before dumpingPetr Machata1-1/+0
In 2011, in commit b0460e4484f9 ("sch_tbf: report backlog information"), TBF started copying backlog depth from the child Qdisc before dumping, with the motivation that the backlog was otherwise not visible in "tc -s qdisc show". Later, in 2016, in commit 8d5958f424b6 ("sch_tbf: update backlog as well"), TBF got a full-blown backlog tracking. However it kept copying the child's backlog over before dumping. That line is now unnecessary, so remove it. As shown in the following example, backlog is still reported correctly: # tc -s qdisc show dev veth0 invisible qdisc tbf 1: root refcnt 2 rate 1Mbit burst 128Kb lat 82.8s Sent 505475370 bytes 406985 pkt (dropped 0, overlimits 812544 requeues 0) backlog 81972b 66p requeues 0 qdisc bfifo 0: parent 1:1 limit 10Mb Sent 505475370 bytes 406985 pkt (dropped 0, overlimits 0 requeues 0) backlog 81972b 66p requeues 0 Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25devlink: Add health recover notifications on devlink flowsMoshe Shemesh1-87/+89
Devlink health recover notifications were added only on driver direct updates of health_state through devlink_health_reporter_state_update(). Add notifications on updates of health_state by devlink flows of report and recover. Moved functions devlink_nl_health_reporter_fill() and devlink_recover_notify() to avoid forward declaration. Fixes: 97ff3bd37fac ("devlink: add devink notification when reporter update health state") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25net: atm: use %*ph to print small bufferAndy Shevchenko2-70/+28
Use %*ph format to print small buffer as hex string. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25mptcp: Fix code formattingMat Martineau2-3/+3
checkpatch.pl had a few complaints in the last set of MPTCP patches: ERROR: code indent should use tabs where possible +^I subflow, sk->sk_family, icsk->icsk_af_ops, target, mapped);$ CHECK: Comparison to NULL could be written "!new_ctx" + if (new_ctx == NULL) { ERROR: "foo * bar" should be "foo *bar" +static const struct proto_ops * tcp_proto_ops(struct sock *sk) Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25mptcp: do not inherit inet proto opsFlorian Westphal1-19/+51
We need to initialise the struct ourselves, else we expose tcp-specific callbacks such as tcp_splice_read which will then trigger splat because the socket is an mptcp one: BUG: KASAN: slab-out-of-bounds in tcp_mstamp_refresh+0x80/0xa0 net/ipv4/tcp_output.c:57 Write of size 8 at addr ffff888116aa21d0 by task syz-executor.0/5478 CPU: 1 PID: 5478 Comm: syz-executor.0 Not tainted 5.5.0-rc6 #3 Call Trace: tcp_mstamp_refresh+0x80/0xa0 net/ipv4/tcp_output.c:57 tcp_rcv_space_adjust+0x72/0x7f0 net/ipv4/tcp_input.c:612 tcp_read_sock+0x622/0x990 net/ipv4/tcp.c:1674 tcp_splice_read+0x20b/0xb40 net/ipv4/tcp.c:791 do_splice+0x1259/0x1560 fs/splice.c:1205 To prevent build error with ipv6, add the recv/sendmsg function declaration to ipv6.h. The functions are already accessible "thanks" to retpoline related work, but they are currently only made visible by socket.c specific INDIRECT_CALLABLE macros. Reported-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24net: Fix skb->csum update in inet_proto_csum_replace16().Praveen Chaudhary1-3/+17
skb->csum is updated incorrectly, when manipulation for NF_NAT_MANIP_SRC\DST is done on IPV6 packet. Fix: There is no need to update skb->csum in inet_proto_csum_replace16(), because update in two fields a.) IPv6 src/dst address and b.) L4 header checksum cancels each other for skb->csum calculation. Whereas inet_proto_csum_replace4 function needs to update skb->csum, because update in 3 fields a.) IPv4 src/dst address, b.) IPv4 Header checksum and c.) L4 header checksum results in same diff as L4 Header checksum for skb->csum calculation. [ pablo@netfilter.org: a few comestic documentation edits ] Signed-off-by: Praveen Chaudhary <pchaudhary@linkedin.com> Signed-off-by: Zhenggen Xu <zxu@linkedin.com> Signed-off-by: Andy Stracner <astracner@linkedin.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-24netfilter: nf_tables: autoload modules from the abort pathPablo Neira Ayuso2-43/+89
This patch introduces a list of pending module requests. This new module list is composed of nft_module_request objects that contain the module name and one status field that tells if the module has been already loaded (the 'done' field). In the first pass, from the preparation phase, the netlink command finds that a module is missing on this list. Then, a module request is allocated and added to this list and nft_request_module() returns -EAGAIN. This triggers the abort path with the autoload parameter set on from nfnetlink, request_module() is called and the module request enters the 'done' state. Since the mutex is released when loading modules from the abort phase, the module list is zapped so this is iteration occurs over a local list. Therefore, the request_module() calls happen when object lists are in consistent state (after fulling aborting the transaction) and the commit list is empty. On the second pass, the netlink command will find that it already tried to load the module, so it does not request it again and nft_request_module() returns 0. Then, there is a look up to find the object that the command was missing. If the module was successfully loaded, the command proceeds normally since it finds the missing object in place, otherwise -ENOENT is reported to userspace. This patch also updates nfnetlink to include the reason to enter the abort phase, which is required for this new autoload module rationale. Fixes: ec7470b834fe ("netfilter: nf_tables: store transaction list locally while requesting module") Reported-by: syzbot+29125d208b3dae9a7019@syzkaller.appspotmail.com Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-24netfilter: nf_tables: add __nft_chain_type_get()Pablo Neira Ayuso1-8/+21
This new helper function validates that unknown family and chain type coming from userspace do not trigger an out-of-bound array access. Bail out in case __nft_chain_type_get() returns NULL from nft_chain_parse_hook(). Fixes: 9370761c56b6 ("netfilter: nf_tables: convert built-in tables/chains to chain types") Reported-by: syzbot+156a04714799b1d480bc@syzkaller.appspotmail.com Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-24netfilter: nf_tables_offload: fix check the chain offload flagwenxu1-1/+1
In the nft_indr_block_cb the chain should check the flag with NFT_CHAIN_HW_OFFLOAD. Fixes: 9a32669fecfb ("netfilter: nf_tables_offload: support indr block call") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-24netfilter: conntrack: sctp: use distinct states for new SCTP connectionsJiri Wiesner1-3/+3
The netlink notifications triggered by the INIT and INIT_ACK chunks for a tracked SCTP association do not include protocol information for the corresponding connection - SCTP state and verification tags for the original and reply direction are missing. Since the connection tracking implementation allows user space programs to receive notifications about a connection and then create a new connection based on the values received in a notification, it makes sense that INIT and INIT_ACK notifications should contain the SCTP state and verification tags available at the time when a notification is sent. The missing verification tags cause a newly created netfilter connection to fail to verify the tags of SCTP packets when this connection has been created from the values previously received in an INIT or INIT_ACK notification. A PROTOINFO event is cached in sctp_packet() when the state of a connection changes. The CLOSED and COOKIE_WAIT state will be used for connections that have seen an INIT and INIT_ACK chunk, respectively. The distinct states will cause a connection state change in sctp_packet(). Signed-off-by: Jiri Wiesner <jwiesner@suse.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-24mptcp: cope with later TCP fallbackPaolo Abeni2-17/+103
With MPTCP v1, passive connections can fallback to TCP after the subflow becomes established: syn + MP_CAPABLE -> <- syn, ack + MP_CAPABLE ack, seq = 3 -> // OoO packet is accepted because in-sequence // passive socket is created, is in ESTABLISHED // status and tentatively as MP_CAPABLE ack, seq = 2 -> // no MP_CAPABLE opt, subflow should fallback to TCP We can't use the 'subflow' socket fallback, as we don't have it available for passive connection. Instead, when the fallback is detected, replace the mptcp socket with the underlying TCP subflow. Beyond covering the above scenario, it makes a TCP fallback socket as efficient as plain TCP ones. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: process MP_CAPABLE data optionChristoph Paasch4-27/+95
This patch implements the handling of MP_CAPABLE + data option, as per RFC 6824 bis / RFC 8684: MPTCP v1. On the server side we can receive the remote key after that the connection is established. We need to explicitly track the 'missing remote key' status and avoid emitting a mptcp ack until we get such info. When a late/retransmitted/OoO pkt carrying MP_CAPABLE[+data] option is received, we have to propagate the mptcp seq number info to the msk socket. To avoid ABBA locking issue, explicitly check for that in recvmsg(), where we own msk and subflow sock locks. The above also means that an established mp_capable subflow - still waiting for the remote key - can be 'downgraded' to plain TCP. Such change could potentially block a reader waiting for new data forever - as they hook to msk, while later wake-up after the downgrade will be on subflow only. The above issue is not handled here, we likely have to get rid of msk->fallback to handle that cleanly. Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: parse and emit MP_CAPABLE option according to v1 specChristoph Paasch5-38/+148
This implements MP_CAPABLE options parsing and writing according to RFC 6824 bis / RFC 8684: MPTCP v1. Local key is sent on syn/ack, and both keys are sent on 3rd ack. MP_CAPABLE messages len are updated accordingly. We need the skbuff to correctly emit the above, so we push the skbuff struct as an argument all the way from tcp code to the relevant mptcp callbacks. When processing incoming MP_CAPABLE + data, build a full blown DSS-like map info, to simplify later processing. On child socket creation, we need to record the remote key, if available. Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: move from sha1 (v0) to sha256 (v1)Paolo Abeni4-59/+104
For simplicity's sake use directly sha256 primitives (and pull them as a required build dep). Add optional, boot-time self-tests for the hmac function. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: new sysctl to control the activation per NSMatthieu Baerts4-5/+146
New MPTCP sockets will return -ENOPROTOOPT if MPTCP support is disabled for the current net namespace. We are providing here a way to control access to the feature for those that need to turn it on or off. The value of this new sysctl can be different per namespace. We can then restrict the usage of MPTCP to the selected NS. In case of serious issues with MPTCP, administrators can now easily turn MPTCP off. Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: allow collapsing consecutive sendpages on the same substreamPaolo Abeni1-15/+60
If the current sendmsg() lands on the same subflow we used last, we can try to collapse the data. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: recvmsg() can drain data from multiple subflowsPaolo Abeni1-10/+168
With the previous patch in place, the msk can detect which subflow has the current map with a simple walk, let's update the main loop to always select the 'current' subflow. The exit conditions now closely mirror tcp_recvmsg() to get expected timeout and signal behavior. Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Co-developed-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: add subflow write space signalling and mptcp_pollFlorian Westphal3-0/+57
Add new SEND_SPACE flag to indicate that a subflow has enough space to accept more data for transmission. It gets cleared at the end of mptcp_sendmsg() in case ssk has run below the free watermark. It is (re-set) from the wspace callback. This allows us to use msk->flags to determine the poll mask. Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: Implement MPTCP receive pathMat Martineau5-5/+591
Parses incoming DSS options and populates outgoing MPTCP ACK fields. MPTCP fields are parsed from the TCP option header and placed in an skb extension, allowing the upper MPTCP layer to access MPTCP options after the skb has gone through the TCP stack. The subflow implements its own data_ready() ops, which ensures that the pending data is in sequence - according to MPTCP seq number - dropping out-of-seq skbs. The DATA_READY bit flag is set if this is the case. This allows the MPTCP socket layer to determine if more data is available without having to consult the individual subflows. It additionally validates the current mapping and propagates EoF events to the connection socket. Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Co-developed-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: Write MPTCP DSS headers to outgoing data packetsMat Martineau3-6/+285
Per-packet metadata required to write the MPTCP DSS option is written to the skb_ext area. One write to the socket may contain more than one packet of data, which is copied to page fragments and mapped in to MPTCP DSS segments with size determined by the available page fragments and the maximum mapping length allowed by the MPTCP specification. If do_tcp_sendpages() splits a DSS segment in to multiple skbs, that's ok - the later skbs can either have duplicated DSS mapping information or none at all, and the receiver can handle that. The current implementation uses the subflow frag cache and tcp sendpages to avoid excessive code duplication. More work is required to ensure that it works correctly under memory pressure and to support MPTCP-level retransmissions. The MPTCP DSS checksum is not yet implemented. Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: Add setsockopt()/getsockopt() socket operationsPeter Krystad1-0/+58
set/getsockopt behaviour with multiple subflows is undefined. Therefore, for now, we return -EOPNOTSUPP unless we're in fallback mode. Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: Add shutdown() socket operationPeter Krystad1-0/+66
Call shutdown on all subflows in use on the given socket, or on the fallback socket. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: Add key generation and token treePeter Krystad6-8/+428
Generate the local keys, IDSN, and token when creating a new socket. Introduce the token tree to track all tokens in use using a radix tree with the MPTCP token itself as the index. Override the rebuild_header callback in inet_connection_sock_af_ops for creating the local key on a new outgoing connection. Override the init_req callback of tcp_request_sock_ops for creating the local key on a new incoming connection. Will be used to obtain the MPTCP parent socket to handle incoming joins. Co-developed-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: Create SUBFLOW socket for incoming connectionsPeter Krystad1-5/+231
Add subflow_request_sock type that extends tcp_request_sock and add an is_mptcp flag to tcp_request_sock distinguish them. Override the listen() and accept() methods of the MPTCP socket proto_ops so they may act on the subflow socket. Override the conn_request() and syn_recv_sock() handlers in the inet_connection_sock to handle incoming MPTCP SYNs and the ACK to the response SYN. Add handling in tcp_output.c to add MP_CAPABLE to an outgoing SYN-ACK response for a subflow_request_sock. Co-developed-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: Handle MP_CAPABLE options for outgoing connectionsPeter Krystad7-24/+603
Add hooks to tcp_output.c to add MP_CAPABLE to an outgoing SYN request, to capture the MP_CAPABLE in the received SYN-ACK, to add MP_CAPABLE to the final ACK of the three-way handshake. Use the .sk_rx_dst_set() handler in the subflow proto to capture when the responding SYN-ACK is received and notify the MPTCP connection layer. Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: Associate MPTCP context with TCP socketPeter Krystad4-7/+272
Use ULP to associate a subflow_context structure with each TCP subflow socket. Creating these sockets requires new bind and connect functions to make sure ULP is set up immediately when the subflow sockets are created. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: Handle MPTCP TCP optionsPeter Krystad5-1/+145
Add hooks to parse and format the MP_CAPABLE option. This option is handled according to MPTCP version 0 (RFC6824). MPTCP version 1 MP_CAPABLE (RFC6824bis/RFC8684) will be added later in coordination with related code changes. Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: Add MPTCP socket stubsMat Martineau8-0/+195
Implements the infrastructure for MPTCP sockets. MPTCP sockets open one in-kernel TCP socket per subflow. These subflow sockets are only managed by the MPTCP socket that owns them and are not visible from userspace. This commit allows a userspace program to open an MPTCP socket with: sock = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP); The resulting socket is simply a wrapper around a single regular TCP socket, without any of the MPTCP protocol implemented over the wire. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24net: bridge: vlan: add per-vlan stateNikolay Aleksandrov5-19/+133
The first per-vlan option added is state, it is needed for EVPN and for per-vlan STP. The state allows to control the forwarding on per-vlan basis. The vlan state is considered only if the port state is forwarding in order to avoid conflicts and be consistent. br_allowed_egress is called only when the state is forwarding, but the ingress case is a bit more complicated due to the fact that we may have the transition between port:BR_STATE_FORWARDING -> vlan:BR_STATE_LEARNING which should still allow the bridge to learn from the packet after vlan filtering and it will be dropped after that. Also to optimize the pvid state check we keep a copy in the vlan group to avoid one lookup. The state members are modified with *_ONCE() to annotate the lockless access. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24net: bridge: vlan: add basic option setting supportNikolay Aleksandrov3-7/+129
This patch adds support for option modification of single vlans and ranges. It allows to only modify options, i.e. skip create/delete by using the BRIDGE_VLAN_INFO_ONLY_OPTS flag. When working with a range option changes we try to pack the notifications as much as possible. v2: do full port (all vlans) notification only when creating/deleting vlans for compatibility, rework the range detection when changing options, add more verbose extack errors and check if a vlan should be used (br_vlan_should_use checks) Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24net: bridge: vlan: add basic option dumping supportNikolay Aleksandrov4-7/+48
We'll be dumping the options for the whole range if they're equal. The first range vlan will be used to extract the options. The commit doesn't change anything yet it just adds the skeleton for the support. The dump will happen when the first option is added. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24net: bridge: check port state before br_allowed_egressNikolay Aleksandrov1-1/+1
If we make sure that br_allowed_egress is called only when we have BR_STATE_FORWARDING state then we can avoid a test later when we add per-vlan state. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24ipv6_route_seq_next should increase position indexVasily Averin1-5/+2
if seq_file .next fuction does not change position index, read after some lseek can generate unexpected output. https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>