summaryrefslogtreecommitdiffstats
path: root/net/ipv6
AgeCommit message (Collapse)AuthorFilesLines
2021-06-29net: sock: introduce sk_error_reportAlexander Aring3-4/+4
This patch introduces a function wrapper to call the sk_error_report callback. That will prepare to add additional handling whenever sk_error_report is called, for example to trace socket errors. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28ipv6: ICMPV6: add response to ICMPV6 RFC 8335 PROBE messagesAndreas Roeseler1-3/+18
This patch builds off of commit 2b246b2569cd2ac6ff700d0dce56b8bae29b1842 and adds functionality to respond to ICMPV6 PROBE requests. Add icmp_build_probe function to construct PROBE requests for both ICMPV4 and ICMPV6. Modify icmpv6_rcv to detect ICMPV6 PROBE messages and call the icmpv6_echo_reply handler. Modify icmpv6_echo_reply to build a PROBE response message based on the queried interface. This patch has been tested using a branch of the iputils git repo which can be found here: https://github.com/Juniper-Clinic-2020/iputils/tree/probe-request Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/gitDavid S. Miller7-114/+0
/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2021-06-28 1) Remove an unneeded error assignment in esp4_gro_receive(). From Yang Li. 2) Add a new byseq state hashtable to find acquire states faster. From Sabrina Dubroca. 3) Remove some unnecessary variables in pfkey_create(). From zuoqilin. 4) Remove the unused description from xfrm_type struct. From Florian Westphal. 5) Fix a spelling mistake in the comment of xfrm_state_ok(). From gushengxian. 6) Replace hdr_off indirections by a small helper function. From Florian Westphal. 7) Remove xfrm4_output_finish and xfrm6_output_finish declarations, they are not used anymore.From Antony Antony. 8) Remove xfrm replay indirections. From Florian Westphal. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28ip6_tunnel: allow redirecting ip6gre and ipxip6 packets to eth devicesGuillaume Nault1-0/+1
Reset the mac_header pointer even when the tunnel transports only L3 data (in the ARPHRD_ETHER case, this is already done by eth_type_trans). This prevents other parts of the stack from mistakenly accessing the outer header after the packet has been decapsulated. In practice, this allows to push an Ethernet header to ipip6, ip6ip6, mplsip6 or ip6gre packets and redirect them to an Ethernet device: $ tc filter add dev ip6tnl0 ingress matchall \ action vlan push_eth dst_mac 00:00:5e:00:53:01 \ src_mac 00:00:5e:00:53:00 \ action mirred egress redirect dev eth0 Without this patch, push_eth refuses to add an ethernet header because the skb appears to already have a MAC header. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28sit: allow redirecting ip6ip, ipip and mplsip packets to eth devicesGuillaume Nault1-0/+4
Even though sit transports L3 data (IPv6, IPv4 or MPLS) packets, it needs to reset the mac_header pointer, so that other parts of the stack don't mistakenly access the outer header after the packet has been decapsulated. There are two rx handlers to modify: ipip6_rcv() for the ip6ip mode and sit_tunnel_rcv() which is used to re-implement the ipip and mplsip modes of ipip.ko. This allows to push an Ethernet header to sit packets and redirect them to an Ethernet device: $ tc filter add dev sit0 ingress matchall \ action vlan push_eth dst_mac 00:00:5e:00:53:01 \ src_mac 00:00:5e:00:53:00 \ action mirred egress redirect dev eth0 Without this patch, push_eth refuses to add an ethernet header because the skb appears to already have a MAC header. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24ipv6: delete useless dst check in ip6_dst_lookup_tailzhang kai1-5/+3
parameter dst always points to null. Signed-off-by: zhang kai <zhangkaiheb@126.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: ip: avoid OOM kills with large UDP sends over loopbackJakub Kicinski1-15/+17
Dave observed number of machines hitting OOM on the UDP send path. The workload seems to be sending large UDP packets over loopback. Since loopback has MTU of 64k kernel will try to allocate an skb with up to 64k of head space. This has a good chance of failing under memory pressure. What's worse if the message length is <32k the allocation may trigger an OOM killer. This is entirely avoidable, we can use an skb with page frags. af_unix solves a similar problem by limiting the head length to SKB_MAX_ALLOC. This seems like a good and simple approach. It means that UDP messages > 16kB will now use fragments if underlying device supports SG, if extra allocator pressure causes regressions in real workloads we can switch to trying the large allocation first and falling back. v4: pre-calculate all the additions to alloclen so we can be sure it won't go over order-2 Reported-by: Dave Jones <dsj@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-5/+22
Trivial conflicts in net/can/isotp.c and tools/testing/selftests/net/mptcp/mptcp_connect.sh scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c to include/linux/ptp_clock_kernel.h in -next so re-apply the fix there. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-18seg6: add support for SRv6 End.DT46 BehaviorAndrea Mayer1-22/+72
IETF RFC 8986 [1] includes the definition of SRv6 End.DT4, End.DT6, and End.DT46 Behaviors. The current SRv6 code in the Linux kernel only implements End.DT4 and End.DT6 which can be used respectively to support IPv4-in-IPv6 and IPv6-in-IPv6 VPNs. With End.DT4 and End.DT6 it is not possible to create a single SRv6 VPN tunnel to carry both IPv4 and IPv6 traffic. The proposed End.DT46 implementation is meant to support the decapsulation of IPv4 and IPv6 traffic coming from a single SRv6 tunnel. The implementation of the SRv6 End.DT46 Behavior in the Linux kernel greatly simplifies the setup and operations of SRv6 VPNs. The SRv6 End.DT46 Behavior leverages the infrastructure of SRv6 End.DT{4,6} Behaviors implemented so far, because it makes use of a VRF device in order to force the routing lookup into the associated routing table. To make the End.DT46 work properly, it must be guaranteed that the routing table used for routing lookup operations is bound to one and only one VRF during the tunnel creation. Such constraint has to be enforced by enabling the VRF strict_mode sysctl parameter, i.e.: $ sysctl -wq net.vrf.strict_mode=1 Note that the same approach is used for the SRv6 End.DT4 Behavior and for the End.DT6 Behavior in VRF mode. The command used to instantiate an SRv6 End.DT46 Behavior is straightforward, i.e.: $ ip -6 route add 2001:db8::1 encap seg6local action End.DT46 vrftable 100 dev vrf100. [1] https://www.rfc-editor.org/rfc/rfc8986.html#name-enddt46-decapsulation-and-s ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Performance and impact of SRv6 End.DT46 Behavior on the SRv6 Networking ======================================================================= This patch aims to add the SRv6 End.DT46 Behavior with minimal impact on the performance of SRv6 End.DT4 and End.DT6 Behaviors. In order to verify this, we tested the performance of the newly introduced SRv6 End.DT46 Behavior and compared it with the performance of SRv6 End.DT{4,6} Behaviors, considering both the patched kernel and the kernel before applying the End.DT46 patch (referred to as vanilla kernel). In details, the following decapsulation scenarios were considered: 1.a) IPv6 traffic in SRv6 End.DT46 Behavior on patched kernel; 1.b) IPv4 traffic in SRv6 End.DT46 Behavior on patched kernel; 2.a) SRv6 End.DT6 Behavior (VRF mode) on patched kernel; 2.b) SRv6 End.DT4 Behavior on patched kernel; 3.a) SRv6 End.DT6 Behavior (VRF mode) on vanilla kernel (without the End.DT46 patch); 3.b) SRv6 End.DT4 Behavior on vanilla kernel (without the End.DT46 patch). All tests were performed on a testbed deployed on the CloudLab [2] facilities. We considered IPv{4,6} traffic handled by a single core (at 2.4 GHz on a Xeon(R) CPU E5-2630 v3) on kernel 5.13-rc1 using packets of size ~ 100 bytes. Scenario (1.a): average 684.70 kpps; std. dev. 0.7 kpps; Scenario (1.b): average 711.69 kpps; std. dev. 1.2 kpps; Scenario (2.a): average 690.70 kpps; std. dev. 1.2 kpps; Scenario (2.b): average 722.22 kpps; std. dev. 1.7 kpps; Scenario (3.a): average 690.02 kpps; std. dev. 2.6 kpps; Scenario (3.b): average 721.91 kpps; std. dev. 1.2 kpps; Considering the results for the patched kernel (1.a, 1.b, 2.a, 2.b) we observe that the performance degradation incurred in using End.DT46 rather than End.DT6 and End.DT4 respectively for IPv6 and IPv4 traffic is minimal, around 0.9% and 1.5%. Such very minimal performance degradation is the price to be paid if one prefers to use a single tunnel capable of handling both types of traffic (IPv4 and IPv6). Comparing the results for End.DT4 and End.DT6 under the patched and the vanilla kernel (2.a, 2.b, 3.a, 3.b) we observe that the introduction of the End.DT46 patch has no impact on the performance of End.DT4 and End.DT6. [2] https://www.cloudlab.us Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-3/+11
Daniel Borkmann says: ==================== pull-request: bpf-next 2021-06-17 The following pull-request contains BPF updates for your *net-next* tree. We've added 50 non-merge commits during the last 25 day(s) which contain a total of 148 files changed, 4779 insertions(+), 1248 deletions(-). The main changes are: 1) BPF infrastructure to migrate TCP child sockets from a listener to another in the same reuseport group/map, from Kuniyuki Iwashima. 2) Add a provably sound, faster and more precise algorithm for tnum_mul() as noted in https://arxiv.org/abs/2105.05398, from Harishankar Vishwanathan. 3) Streamline error reporting changes in libbpf as planned out in the 'libbpf: the road to v1.0' effort, from Andrii Nakryiko. 4) Add broadcast support to xdp_redirect_map(), from Hangbin Liu. 5) Extends bpf_map_lookup_and_delete_elem() functionality to 4 more map types, that is, {LRU_,PERCPU_,LRU_PERCPU_,}HASH, from Denis Salopek. 6) Support new LLVM relocations in libbpf to make them more linker friendly, also add a doc to describe the BPF backend relocations, from Yonghong Song. 7) Silence long standing KUBSAN complaints on register-based shifts in interpreter, from Daniel Borkmann and Eric Biggers. 8) Add dummy PT_REGS macros in libbpf to fail BPF program compilation when target arch cannot be determined, from Lorenz Bauer. 9) Extend AF_XDP to support large umems with 1M+ pages, from Magnus Karlsson. 10) Fix two minor libbpf tc BPF API issues, from Kumar Kartikeya Dwivedi. 11) Move libbpf BPF_SEQ_PRINTF/BPF_SNPRINTF macros that can be used by BPF programs to bpf_helpers.h header, from Florent Revest. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-15tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK.Kuniyuki Iwashima1-3/+11
This patch also changes the code to call reuseport_migrate_sock() and inet_reqsk_clone(), but unlike the other cases, we do not call inet_reqsk_clone() right after reuseport_migrate_sock(). Currently, in the receive path for TCP_NEW_SYN_RECV sockets, its listener has three kinds of refcnt: (A) for listener itself (B) carried by reuqest_sock (C) sock_hold() in tcp_v[46]_rcv() While processing the req, (A) may disappear by close(listener). Also, (B) can disappear by accept(listener) once we put the req into the accept queue. So, we have to hold another refcnt (C) for the listener to prevent use-after-free. For socket migration, we call reuseport_migrate_sock() to select a listener with (A) and to increment the new listener's refcnt in tcp_v[46]_rcv(). This refcnt corresponds to (C) and is cleaned up later in tcp_v[46]_rcv(). Thus we have to take another refcnt (B) for the newly cloned request_sock. In inet_csk_complete_hashdance(), we hold the count (B), clone the req, and try to put the new req into the accept queue. By migrating req after winning the "own_req" race, we can avoid such a worst situation: CPU 1 looks up req1 CPU 2 looks up req1, unhashes it, then CPU 1 loses the race CPU 3 looks up req2, unhashes it, then CPU 2 loses the race ... Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-8-kuniyu@amazon.co.jp
2021-06-14mld: avoid unnecessary high order page allocation in mld_newpack()Taehee Yoo1-11/+14
If link mtu is too big, mld_newpack() allocates high-order page. But most mld packets don't need high-order page. So, it might waste unnecessary pages. To avoid this, it makes mld_newpack() try to allocate order-0 page. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-14ipv6: fib6: remove redundant initialization of variable errColin Ian King1-1/+1
The variable err is being initialized with a value that is never read, the assignment is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11xfrm: remove hdr_offset indirectionFlorian Westphal4-10/+0
After previous patches all remaining users set the function pointer to the same function: xfrm6_find_1stfragopt. So remove this function pointer and call ip6_find_1stfragopt directly. Reduces size of xfrm_type to 64 bytes on 64bit platforms. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-06-11xfrm: ipv6: move mip6_rthdr_offset into xfrm coreFlorian Westphal1-48/+0
Place the call into the xfrm core. After this all remaining users set the hdr_offset function pointer to the same function which opens the possiblity to remove the indirection. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-06-11xfrm: ipv6: move mip6_destopt_offset into xfrm coreFlorian Westphal1-49/+0
This helper is relatively small, just move this to the xfrm core and call it directly. Next patch does the same for the ROUTING type. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-06-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller1-4/+18
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Fix a crash when stateful expression with its own gc callback is used in a set definition. 2) Skip IPv6 packets from any link-local address in IPv6 fib expression. Add a selftest for this scenario, from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09udp: fix race between close() and udp_abort()Paolo Abeni1-0/+3
Kaustubh reported and diagnosed a panic in udp_lib_lookup(). The root cause is udp_abort() racing with close(). Both racing functions acquire the socket lock, but udp{v6}_destroy_sock() release it before performing destructive actions. We can't easily extend the socket lock scope to avoid the race, instead use the SOCK_DEAD flag to prevent udp_abort from doing any action when the critical race happens. Diagnosed-and-tested-by: Kaustubh Pandey <kapandey@codeaurora.org> Fixes: 5d77dca82839 ("net: diag: support SOCK_DESTROY for UDP sockets") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-localFlorian Westphal1-4/+18
The ip6tables rpfilter match has an extra check to skip packets with "::" source address. Extend this to ipv6 fib expression. Else ipv6 duplicate address detection packets will fail rpf route check -- lookup returns -ENETUNREACH. While at it, extend the prerouting check to also cover the ingress hook. Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1543 Fixes: f6d0cbcf09c5 ("netfilter: nf_tables: add fib expression") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-06-09xfrm: remove description from xfrm_type structFlorian Westphal6-7/+0
Its set but never read. Reduces size of xfrm_type to 64 bytes on 64bit. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-06-08net: ipv4: Remove unneed BUG() functionZheng Yongjun1-1/+1
When 'nla_parse_nested_deprecated' failed, it's no need to BUG() here, return -EINVAL is ok. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-07Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2-2/+9
Bug fixes overlapping feature additions and refactoring, mostly. Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03ipv6: Fix KASAN: slab-out-of-bounds Read in fib6_nh_flush_exceptionsCoco Li1-2/+6
Reported by syzbot: HEAD commit: 90c911ad Merge tag 'fixes' of git://git.kernel.org/pub/scm.. git tree: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master dashboard link: https://syzkaller.appspot.com/bug?extid=123aa35098fd3c000eb7 compiler: Debian clang version 11.0.1-2 ================================================================== BUG: KASAN: slab-out-of-bounds in fib6_nh_get_excptn_bucket net/ipv6/route.c:1604 [inline] BUG: KASAN: slab-out-of-bounds in fib6_nh_flush_exceptions+0xbd/0x360 net/ipv6/route.c:1732 Read of size 8 at addr ffff8880145c78f8 by task syz-executor.4/17760 CPU: 0 PID: 17760 Comm: syz-executor.4 Not tainted 5.12.0-rc8-syzkaller #0 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x202/0x31e lib/dump_stack.c:120 print_address_description+0x5f/0x3b0 mm/kasan/report.c:232 __kasan_report mm/kasan/report.c:399 [inline] kasan_report+0x15c/0x200 mm/kasan/report.c:416 fib6_nh_get_excptn_bucket net/ipv6/route.c:1604 [inline] fib6_nh_flush_exceptions+0xbd/0x360 net/ipv6/route.c:1732 fib6_nh_release+0x9a/0x430 net/ipv6/route.c:3536 fib6_info_destroy_rcu+0xcb/0x1c0 net/ipv6/ip6_fib.c:174 rcu_do_batch kernel/rcu/tree.c:2559 [inline] rcu_core+0x8f6/0x1450 kernel/rcu/tree.c:2794 __do_softirq+0x372/0x7a6 kernel/softirq.c:345 invoke_softirq kernel/softirq.c:221 [inline] __irq_exit_rcu+0x22c/0x260 kernel/softirq.c:422 irq_exit_rcu+0x5/0x20 kernel/softirq.c:434 sysvec_apic_timer_interrupt+0x91/0xb0 arch/x86/kernel/apic/apic.c:1100 </IRQ> asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:632 RIP: 0010:lock_acquire+0x1f6/0x720 kernel/locking/lockdep.c:5515 Code: f6 84 24 a1 00 00 00 02 0f 85 8d 02 00 00 f7 c3 00 02 00 00 49 bd 00 00 00 00 00 fc ff df 74 01 fb 48 c7 44 24 40 0e 36 e0 45 <4b> c7 44 3d 00 00 00 00 00 4b c7 44 3d 09 00 00 00 00 43 c7 44 3d RSP: 0018:ffffc90009e06560 EFLAGS: 00000206 RAX: 1ffff920013c0cc0 RBX: 0000000000000246 RCX: dffffc0000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffc90009e066e0 R08: dffffc0000000000 R09: fffffbfff1f992b1 R10: fffffbfff1f992b1 R11: 0000000000000000 R12: 0000000000000000 R13: dffffc0000000000 R14: 0000000000000000 R15: 1ffff920013c0cb4 rcu_lock_acquire+0x2a/0x30 include/linux/rcupdate.h:267 rcu_read_lock include/linux/rcupdate.h:656 [inline] ext4_get_group_info+0xea/0x340 fs/ext4/ext4.h:3231 ext4_mb_prefetch+0x123/0x5d0 fs/ext4/mballoc.c:2212 ext4_mb_regular_allocator+0x8a5/0x28f0 fs/ext4/mballoc.c:2379 ext4_mb_new_blocks+0xc6e/0x24f0 fs/ext4/mballoc.c:4982 ext4_ext_map_blocks+0x2be3/0x7210 fs/ext4/extents.c:4238 ext4_map_blocks+0xab3/0x1cb0 fs/ext4/inode.c:638 ext4_getblk+0x187/0x6c0 fs/ext4/inode.c:848 ext4_bread+0x2a/0x1c0 fs/ext4/inode.c:900 ext4_append+0x1a4/0x360 fs/ext4/namei.c:67 ext4_init_new_dir+0x337/0xa10 fs/ext4/namei.c:2768 ext4_mkdir+0x4b8/0xc00 fs/ext4/namei.c:2814 vfs_mkdir+0x45b/0x640 fs/namei.c:3819 ovl_do_mkdir fs/overlayfs/overlayfs.h:161 [inline] ovl_mkdir_real+0x53/0x1a0 fs/overlayfs/dir.c:146 ovl_create_real+0x280/0x490 fs/overlayfs/dir.c:193 ovl_workdir_create+0x425/0x600 fs/overlayfs/super.c:788 ovl_make_workdir+0xed/0x1140 fs/overlayfs/super.c:1355 ovl_get_workdir fs/overlayfs/super.c:1492 [inline] ovl_fill_super+0x39ee/0x5370 fs/overlayfs/super.c:2035 mount_nodev+0x52/0xe0 fs/super.c:1413 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x86/0x270 fs/super.c:1497 do_new_mount fs/namespace.c:2903 [inline] path_mount+0x196f/0x2be0 fs/namespace.c:3233 do_mount fs/namespace.c:3246 [inline] __do_sys_mount fs/namespace.c:3454 [inline] __se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3431 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x4665f9 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f68f2b87188 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 000000000056bf60 RCX: 00000000004665f9 RDX: 00000000200000c0 RSI: 0000000020000000 RDI: 000000000040000a RBP: 00000000004bfbb9 R08: 0000000020000100 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000056bf60 R13: 00007ffe19002dff R14: 00007f68f2b87300 R15: 0000000000022000 Allocated by task 17768: kasan_save_stack mm/kasan/common.c:38 [inline] kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:427 [inline] ____kasan_kmalloc+0xc2/0xf0 mm/kasan/common.c:506 kasan_kmalloc include/linux/kasan.h:233 [inline] __kmalloc+0xb4/0x380 mm/slub.c:4055 kmalloc include/linux/slab.h:559 [inline] kzalloc include/linux/slab.h:684 [inline] fib6_info_alloc+0x2c/0xd0 net/ipv6/ip6_fib.c:154 ip6_route_info_create+0x55d/0x1a10 net/ipv6/route.c:3638 ip6_route_add+0x22/0x120 net/ipv6/route.c:3728 inet6_rtm_newroute+0x2cd/0x2260 net/ipv6/route.c:5352 rtnetlink_rcv_msg+0xb34/0xe70 net/core/rtnetlink.c:5553 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2502 netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline] netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1338 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1927 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg net/socket.c:674 [inline] ____sys_sendmsg+0x5a2/0x900 net/socket.c:2350 ___sys_sendmsg net/socket.c:2404 [inline] __sys_sendmsg+0x319/0x400 net/socket.c:2433 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xae Last potentially related work creation: kasan_save_stack+0x27/0x50 mm/kasan/common.c:38 kasan_record_aux_stack+0xee/0x120 mm/kasan/generic.c:345 __call_rcu kernel/rcu/tree.c:3039 [inline] call_rcu+0x1b1/0xa30 kernel/rcu/tree.c:3114 fib6_info_release include/net/ip6_fib.h:337 [inline] ip6_route_info_create+0x10c4/0x1a10 net/ipv6/route.c:3718 ip6_route_add+0x22/0x120 net/ipv6/route.c:3728 inet6_rtm_newroute+0x2cd/0x2260 net/ipv6/route.c:5352 rtnetlink_rcv_msg+0xb34/0xe70 net/core/rtnetlink.c:5553 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2502 netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline] netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1338 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1927 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg net/socket.c:674 [inline] ____sys_sendmsg+0x5a2/0x900 net/socket.c:2350 ___sys_sendmsg net/socket.c:2404 [inline] __sys_sendmsg+0x319/0x400 net/socket.c:2433 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xae Second to last potentially related work creation: kasan_save_stack+0x27/0x50 mm/kasan/common.c:38 kasan_record_aux_stack+0xee/0x120 mm/kasan/generic.c:345 insert_work+0x54/0x400 kernel/workqueue.c:1331 __queue_work+0x981/0xcc0 kernel/workqueue.c:1497 queue_work_on+0x111/0x200 kernel/workqueue.c:1524 queue_work include/linux/workqueue.h:507 [inline] call_usermodehelper_exec+0x283/0x470 kernel/umh.c:433 kobject_uevent_env+0x1349/0x1730 lib/kobject_uevent.c:617 kvm_uevent_notify_change+0x309/0x3b0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4809 kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:877 [inline] kvm_put_kvm+0x9c/0xd10 arch/x86/kvm/../../../virt/kvm/kvm_main.c:920 kvm_vcpu_release+0x53/0x60 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3120 __fput+0x352/0x7b0 fs/file_table.c:280 task_work_run+0x146/0x1c0 kernel/task_work.c:140 tracehook_notify_resume include/linux/tracehook.h:189 [inline] exit_to_user_mode_loop kernel/entry/common.c:174 [inline] exit_to_user_mode_prepare+0x10b/0x1e0 kernel/entry/common.c:208 __syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline] syscall_exit_to_user_mode+0x26/0x70 kernel/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff8880145c7800 which belongs to the cache kmalloc-192 of size 192 The buggy address is located 56 bytes to the right of 192-byte region [ffff8880145c7800, ffff8880145c78c0) The buggy address belongs to the page: page:ffffea00005171c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x145c7 flags: 0xfff00000000200(slab) raw: 00fff00000000200 ffffea00006474c0 0000000200000002 ffff888010c41a00 raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880145c7780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff8880145c7800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff8880145c7880: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc ^ ffff8880145c7900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880145c7980: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ================================================================== In the ip6_route_info_create function, in the case that the nh pointer is not NULL, the fib6_nh in fib6_info has not been allocated. Therefore, when trying to free fib6_info in this error case using fib6_info_release, the function will call fib6_info_destroy_rcu, which it will access fib6_nh_release(f6i->fib6_nh); However, f6i->fib6_nh doesn't have any refcount yet given the lack of allocation causing the reported memory issue above. Therefore, releasing the empty pointer directly instead would be the solution. Fixes: f88d8ea67fbdb ("ipv6: Plumb support for nexthop object in a fib6_info") Fixes: 706ec91916462 ("ipv6: Fix nexthop refcnt leak when creating ipv6 route info") Signed-off-by: Coco Li <lixiaoyan@google.com> Cc: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03sit: set name of device back to struct parmszhang kai1-0/+3
addrconf_set_sit_dstaddr will use parms->name. Signed-off-by: zhang kai <zhangkaiheb@126.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-02sit: replace 68 with micro IPV4_MIN_MTUzhang kai1-1/+1
Use meaningfull micro IPV4_MIN_MTU Signed-off-by: zhang kai <zhangkaiheb@126.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2-2/+2
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Support for SCTP chunks matching on nf_tables, from Phil Sutter. 2) Skip LDMXCSR, we don't need a valid MXCSR state. From Stefano Brivio. 3) CONFIG_RETPOLINE for nf_tables set lookups, from Florian Westphal. 4) A few Kconfig leading spaces removal, from Juerg Haefliger. 5) Remove spinlock from xt_limit, from Jason Baron. 6) Remove useless initialization in xt_CT, oneliner from Yang Li. 7) Tree-wide replacement of netlink_unicast() by nfnetlink_unicast(). 8) Reduce footprint of several structures: xt_action_param, nft_pktinfo and nf_hook_state, from Florian. 10) Add nft_thoff() and nft_sk() helpers and use them, also from Florian. 11) Fix documentation in nf_tables pipapo avx2, from Florian Westphal. 12) Fix clang-12 fmt string warnings, also from Florian. ====================
2021-05-31ipv6: align code with contextRocco Yue1-4/+4
The Tab key is used three times, causing the code block to be out of alignment with the context. Signed-off-by: Rocco Yue <rocco.yue@mediatek.com> Link: https://lore.kernel.org/r/20210530113811.8817-1-rocco.yue@mediatek.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-31ipv6: use prandom_u32() for ID generationWilly Tarreau1-23/+5
This is a complement to commit aa6dd211e4b1 ("inet: use bigger hash table for IP ID generation"), but focusing on some specific aspects of IPv6. Contary to IPv4, IPv6 only uses packet IDs with fragments, and with a minimum MTU of 1280, it's much less easy to force a remote peer to produce many fragments to explore its ID sequence. In addition packet IDs are 32-bit in IPv6, which further complicates their analysis. On the other hand, it is often easier to choose among plenty of possible source addresses and partially work around the bigger hash table the commit above permits, which leaves IPv6 partially exposed to some possibilities of remote analysis at the risk of weakening some protocols like DNS if some IDs can be predicted with a good enough probability. Given the wide range of permitted IDs, the risk of collision is extremely low so there's no need to rely on the positive increment algorithm that is shared with the IPv4 code via ip_idents_reserve(). We have a fast PRNG, so let's simply call prandom_u32() and be done with it. Performance measurements at 10 Gbps couldn't show any difference with the previous code, even when using a single core, because due to the large fragments, we're limited to only ~930 kpps at 10 Gbps and the cost of the random generation is completely offset by other operations and by the network transfer time. In addition, this change removes the need to update a shared entry in the idents table so it may even end up being slightly faster on large scale systems where this matters. The risk of at least one collision here is about 1/80 million among 10 IDs, 1/850k among 100 IDs, and still only 1/8.5k among 1000 IDs, which remains very low compared to IPv4 where all IDs are reused every 4 to 80ms on a 10 Gbps flow depending on packet sizes. Reported-by: Amit Klein <aksecurity@gmail.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20210529110746.6796-1-w@1wt.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-29netfilter: nf_tables: add and use nft_sk helperFlorian Westphal1-1/+1
This allows to change storage placement later on without changing readers. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-05-29netfilter: x_tables: reduce xt_action_param by 8 byteFlorian Westphal1-1/+1
The fragment offset in ipv4/ipv6 is a 16bit field, so use u16 instead of unsigned int. On 64bit: 40 bytes to 32 bytes. By extension this also reduces nft_pktinfo (56 to 48 byte). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-05-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-4/+3
cdc-wdm: s/kill_urbs/poison_urbs/ to fix build Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-21ipv6: record frag_max_size in atomic fragments in input pathFrancesco Ruggeri1-1/+3
Commit dbd1759e6a9c ("ipv6: on reassembly, record frag_max_size") filled the frag_max_size field in IP6CB in the input path. The field should also be filled in case of atomic fragments. Fixes: dbd1759e6a9c ('ipv6: on reassembly, record frag_max_size') Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-19net: Add notifications when multipath hash field changeIdo Schimmel1-1/+17
In-kernel notifications are already sent when the multipath hash policy itself changes, but not when the multipath hash fields change. Add these notifications, so that interested listeners (e.g., switch ASIC drivers) could perform the necessary configuration. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-18ipv6: Add custom multipath hash policyIdo Schimmel2-1/+127
Add a new multipath hash policy where the packet fields used for hash calculation are determined by user space via the fib_multipath_hash_fields sysctl that was introduced in the previous patch. The current set of available packet fields includes both outer and inner fields, which requires two invocations of the flow dissector. Avoid unnecessary dissection of the outer or inner flows by skipping dissection if none of the outer or inner fields are required. In accordance with the existing policies, when an skb is not available, packet fields are extracted from the provided flow key. In which case, only outer fields are considered. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-18ipv6: Add a sysctl to control multipath hash fieldsIdo Schimmel2-0/+17
A subsequent patch will add a new multipath hash policy where the packet fields used for multipath hash calculation are determined by user space. This patch adds a sysctl that allows user space to set these fields. The packet fields are represented using a bitmask and are common between IPv4 and IPv6 to allow user space to use the same numbering across both protocols. For example, to hash based on standard 5-tuple: # sysctl -w net.ipv6.fib_multipath_hash_fields=0x0037 net.ipv6.fib_multipath_hash_fields = 0x0037 To avoid introducing holes in 'struct netns_sysctl_ipv6', move the 'bindv6only' field after the multipath hash fields. The kernel rejects unknown fields, for example: # sysctl -w net.ipv6.fib_multipath_hash_fields=0x1000 sysctl: setting key "net.ipv6.fib_multipath_hash_fields": Invalid argument Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-18ipv6: Calculate multipath hash inside switch statementIdo Schimmel1-2/+4
A subsequent patch will add another multipath hash policy where the multipath hash is calculated directly by the policy specific code and not outside of the switch statement. Prepare for this change by moving the multipath hash calculation inside the switch statement. No functional changes intended. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-18ipv6: Use a more suitable label nameIdo Schimmel1-2/+2
The 'out_timer' label was added in commit 63152fc0de4d ("[NETNS][IPV6] ip6_fib - gc timer per namespace") when the timer was allocated on the heap. Commit 417f28bb3407 ("netns: dont alloc ipv6 fib timer list") removed the allocation, but kept the label name. Rename it to a more suitable name. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-17mld: fix panic in mld_newpack()Taehee Yoo1-3/+0
mld_newpack() doesn't allow to allocate high order page, only order-0 allocation is allowed. If headroom size is too large, a kernel panic could occur in skb_put(). Test commands: ip netns del A ip netns del B ip netns add A ip netns add B ip link add veth0 type veth peer name veth1 ip link set veth0 netns A ip link set veth1 netns B ip netns exec A ip link set lo up ip netns exec A ip link set veth0 up ip netns exec A ip -6 a a 2001:db8:0::1/64 dev veth0 ip netns exec B ip link set lo up ip netns exec B ip link set veth1 up ip netns exec B ip -6 a a 2001:db8:0::2/64 dev veth1 for i in {1..99} do let A=$i-1 ip netns exec A ip link add ip6gre$i type ip6gre \ local 2001:db8:$A::1 remote 2001:db8:$A::2 encaplimit 100 ip netns exec A ip -6 a a 2001:db8:$i::1/64 dev ip6gre$i ip netns exec A ip link set ip6gre$i up ip netns exec B ip link add ip6gre$i type ip6gre \ local 2001:db8:$A::2 remote 2001:db8:$A::1 encaplimit 100 ip netns exec B ip -6 a a 2001:db8:$i::2/64 dev ip6gre$i ip netns exec B ip link set ip6gre$i up done Splat looks like: kernel BUG at net/core/skbuff.c:110! invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI CPU: 0 PID: 7 Comm: kworker/0:1 Not tainted 5.12.0+ #891 Workqueue: ipv6_addrconf addrconf_dad_work RIP: 0010:skb_panic+0x15d/0x15f Code: 92 fe 4c 8b 4c 24 10 53 8b 4d 70 45 89 e0 48 c7 c7 00 ae 79 83 41 57 41 56 41 55 48 8b 54 24 a6 26 f9 ff <0f> 0b 48 8b 6c 24 20 89 34 24 e8 4a 4e 92 fe 8b 34 24 48 c7 c1 20 RSP: 0018:ffff88810091f820 EFLAGS: 00010282 RAX: 0000000000000089 RBX: ffff8881086e9000 RCX: 0000000000000000 RDX: 0000000000000089 RSI: 0000000000000008 RDI: ffffed1020123efb RBP: ffff888005f6eac0 R08: ffffed1022fc0031 R09: ffffed1022fc0031 R10: ffff888117e00187 R11: ffffed1022fc0030 R12: 0000000000000028 R13: ffff888008284eb0 R14: 0000000000000ed8 R15: 0000000000000ec0 FS: 0000000000000000(0000) GS:ffff888117c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8b801c5640 CR3: 0000000033c2c006 CR4: 00000000003706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? ip6_mc_hdr.isra.26.constprop.46+0x12a/0x600 ? ip6_mc_hdr.isra.26.constprop.46+0x12a/0x600 skb_put.cold.104+0x22/0x22 ip6_mc_hdr.isra.26.constprop.46+0x12a/0x600 ? rcu_read_lock_sched_held+0x91/0xc0 mld_newpack+0x398/0x8f0 ? ip6_mc_hdr.isra.26.constprop.46+0x600/0x600 ? lock_contended+0xc40/0xc40 add_grhead.isra.33+0x280/0x380 add_grec+0x5ca/0xff0 ? mld_sendpack+0xf40/0xf40 ? lock_downgrade+0x690/0x690 mld_send_initial_cr.part.34+0xb9/0x180 ipv6_mc_dad_complete+0x15d/0x1b0 addrconf_dad_completed+0x8d2/0xbb0 ? lock_downgrade+0x690/0x690 ? addrconf_rs_timer+0x660/0x660 ? addrconf_dad_work+0x73c/0x10e0 addrconf_dad_work+0x73c/0x10e0 Allowing high order page allocation could fix this problem. Fixes: 72e09ad107e7 ("ipv6: avoid high order allocations") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-14tcp: add tracepoint for checksum errorsJakub Kicinski1-0/+2
Add a tracepoint for capturing TCP segments with a bad checksum. This makes it easy to identify sources of bad frames in the fleet (e.g. machines with faulty NICs). It should also help tools like IOvisor's tcpdrop.py which are used today to get detailed information about such packets. We don't have a socket in many cases so we must open code the address extraction based just on the skb. v2: add missing export for ipv6=m Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-29net: Remove redundant assignment to errYang Li1-3/+0
Variable 'err' is set to -ENOMEM but this value is never read as it is overwritten with a new value later on, hence the 'If statements' and assignments are redundantand and can be removed. Cleans up the following clang-analyzer warning: net/ipv6/seg6.c:126:4: warning: Value stored to 'err' is never read [clang-analyzer-deadcode.DeadStores] Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-29seg6: add counters support for SRv6 BehaviorsAndrea Mayer1-2/+196
This patch provides counters for SRv6 Behaviors as defined in [1], section 6. For each SRv6 Behavior instance, counters defined in [1] are: - the total number of packets that have been correctly processed; - the total amount of traffic in bytes of all packets that have been correctly processed; In addition, this patch introduces a new counter that counts the number of packets that have NOT been properly processed (i.e. errors) by an SRv6 Behavior instance. Counters are not only interesting for network monitoring purposes (i.e. counting the number of packets processed by a given behavior) but they also provide a simple tool for checking whether a behavior instance is working as we expect or not. Counters can be useful for troubleshooting misconfigured SRv6 networks. Indeed, an SRv6 Behavior can silently drop packets for very different reasons (i.e. wrong SID configuration, interfaces set with SID addresses, etc) without any notification/message to the user. Due to the nature of SRv6 networks, diagnostic tools such as ping and traceroute may be ineffective: paths used for reaching a given router can be totally different from the ones followed by probe packets. In addition, paths are often asymmetrical and this makes it even more difficult to keep up with the journey of the packets and to understand which behaviors are actually processing our traffic. When counters are enabled on an SRv6 Behavior instance, it is possible to verify if packets are actually processed by such behavior and what is the outcome of the processing. Therefore, the counters for SRv6 Behaviors offer an non-invasive observability point which can be leveraged for both traffic monitoring and troubleshooting purposes. [1] https://www.rfc-editor.org/rfc/rfc8986.html#name-counters Troubleshooting using SRv6 Behavior counters -------------------------------------------- Let's make a brief example to see how helpful counters can be for SRv6 networks. Let's consider a node where an SRv6 End Behavior receives an SRv6 packet whose Segment Left (SL) is equal to 0. In this case, the End Behavior (which accepts only packets with SL >= 1) discards the packet and increases the error counter. This information can be leveraged by the network operator for troubleshooting. Indeed, the error counter is telling the user that the packet: (i) arrived at the node; (ii) the packet has been taken into account by the SRv6 End behavior; (iii) but an error has occurred during the processing. The error (iii) could be caused by different reasons, such as wrong route settings on the node or due to an invalid SID List carried by the SRv6 packet. Anyway, the error counter is used to exclude that the packet did not arrive at the node or it has not been processed by the behavior at all. Turning on/off counters for SRv6 Behaviors ------------------------------------------ Each SRv6 Behavior instance can be configured, at the time of its creation, to make use of counters. This is done through iproute2 which allows the user to create an SRv6 Behavior instance specifying the optional "count" attribute as shown in the following example: $ ip -6 route add 2001:db8::1 encap seg6local action End count dev eth0 per-behavior counters can be shown by adding "-s" to the iproute2 command line, i.e.: $ ip -s -6 route show 2001:db8::1 2001:db8::1 encap seg6local action End packets 0 bytes 0 errors 0 dev eth0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Impact of counters for SRv6 Behaviors on performance ==================================================== To determine the performance impact due to the introduction of counters in the SRv6 Behavior subsystem, we have carried out extensive tests. We chose to test the throughput achieved by the SRv6 End.DX2 Behavior because, among all the other behaviors implemented so far, it reaches the highest throughput which is around 1.5 Mpps (per core at 2.4 GHz on a Xeon(R) CPU E5-2630 v3) on kernel 5.12-rc2 using packets of size ~ 100 bytes. Three different tests were conducted in order to evaluate the overall throughput of the SRv6 End.DX2 Behavior in the following scenarios: 1) vanilla kernel (without the SRv6 Behavior counters patch) and a single instance of an SRv6 End.DX2 Behavior; 2) patched kernel with SRv6 Behavior counters and a single instance of an SRv6 End.DX2 Behavior with counters turned off; 3) patched kernel with SRv6 Behavior counters and a single instance of SRv6 End.DX2 Behavior with counters turned on. All tests were performed on a testbed deployed on the CloudLab facilities [2], a flexible infrastructure dedicated to scientific research on the future of Cloud Computing. Results of tests are shown in the following table: Scenario (1): average 1504764,81 pps (~1504,76 kpps); std. dev 3956,82 pps Scenario (2): average 1501469,78 pps (~1501,47 kpps); std. dev 2979,85 pps Scenario (3): average 1501315,13 pps (~1501,32 kpps); std. dev 2956,00 pps As can be observed, throughputs achieved in scenarios (2),(3) did not suffer any observable degradation compared to scenario (1). Thanks to Jakub Kicinski and David Ahern for their valuable suggestions and comments provided during the discussion of the proposed RFCs. [2] https://www.cloudlab.us Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-27net: bridge: mcast: fix broken length + header check for MRDv6 Adv.Linus Lüssing1-5/+7
The IPv6 Multicast Router Advertisements parsing has the following two issues: For one thing, ICMPv6 MRD Advertisements are smaller than ICMPv6 MLD messages (ICMPv6 MRD Adv.: 8 bytes vs. ICMPv6 MLDv1/2: >= 24 bytes, assuming MLDv2 Reports with at least one multicast address entry). When ipv6_mc_check_mld_msg() tries to parse an Multicast Router Advertisement its MLD length check will fail - and it will wrongly return -EINVAL, even if we have a valid MRD Advertisement. With the returned -EINVAL the bridge code will assume a broken packet and will wrongly discard it, potentially leading to multicast packet loss towards multicast routers. The second issue is the MRD header parsing in br_ip6_multicast_mrd_rcv(): It wrongly checks for an ICMPv6 header immediately after the IPv6 header (IPv6 next header type). However according to RFC4286, section 2 all MRD messages contain a Router Alert option (just like MLD). So instead there is an IPv6 Hop-by-Hop option for the Router Alert between the IPv6 and ICMPv6 header, again leading to the bridge wrongly discarding Multicast Router Advertisements. To fix these two issues, introduce a new return value -ENODATA to ipv6_mc_check_mld() to indicate a valid ICMPv6 packet with a hop-by-hop option which is not an MLD but potentially an MRD packet. This also simplifies further parsing in the bridge code, as ipv6_mc_check_mld() already fully checks the ICMPv6 header and hop-by-hop option. These issues were found and fixed with the help of the mrdisc tool (https://github.com/troglobit/mrdisc). Fixes: 4b3087c7e37f ("bridge: Snoop Multicast Router Advertisements") Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-26netfilter: allow to turn off xtables compat layerFlorian Westphal1-8/+8
The compat layer needs to parse untrusted input (the ruleset) to translate it to a 64bit compatible format. We had a number of bugs in this department in the past, so allow users to turn this feature off. Add CONFIG_NETFILTER_XTABLES_COMPAT kconfig knob and make it default to y to keep existing behaviour. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26netfilter: ip6_tables: pass table pointer via nf_hook_opsFlorian Westphal6-54/+61
Same patch as the ip_tables one: removal of all accesses to ip6_tables xt_table pointers. After this patch the struct net xt_table anchors can be removed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26netfilter: xt_nat: pass table to hookfnFlorian Westphal1-10/+35
This changes how ip(6)table nat passes the ruleset/table to the evaluation loop. At the moment, it will fetch the table from struct net. This change stores the table in the hook_ops 'priv' argument instead. This requires to duplicate the hook_ops for each netns, so they can store the (per-net) xt_table structure. The dupliated nat hook_ops get stored in net_generic data area. They are free'd in the namespace exit path. This is a pre-requisite to remove the xt_table/ruleset pointers from struct net. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26netfilter: x_tables: remove paranoia testsFlorian Westphal5-15/+0
No need for these. There is only one caller, the xtables core, when the table is registered for the first time with a particular network namespace. After ->table_init() call, the table is linked into the tables[af] list, so next call to that function will skip the ->table_init(). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26netfilter: ip6tables: unregister the tables by nameFlorian Westphal6-33/+22
Same as the previous patch, but for ip6tables. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26netfilter: x_tables: remove ipt_unregister_tableFlorian Westphal2-10/+1
Its the same function as ipt_unregister_table_exit. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26netfilter: disable defrag once its no longer neededFlorian Westphal1-6/+23
When I changed defrag hooks to no longer get registered by default I intentionally made it so that registration can only be un-done by unloading the nf_defrag_ipv4/6 module. In hindsight this was too conservative; there is no reason to keep defrag on while there is no feature dependency anymore. Moreover, this won't work if user isn't allowed to remove nf_defrag module. This adds the disable() functions for both ipv4 and ipv6 and calls them from conntrack, TPROXY and the xtables socket module. ipvs isn't converted here, it will behave as before this patch and will need module removal. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller1-0/+2
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Add vlan match and pop actions to the flowtable offload, patches from wenxu. 2) Reduce size of the netns_ct structure, which itself is embedded in struct net Make netns_ct a read-mostly structure. Patches from Florian Westphal. 3) Add FLOW_OFFLOAD_XMIT_UNSPEC to skip dst check from garbage collector path, as required by the tc CT action. From Roi Dayan. 4) VLAN offload fixes for nftables: Allow for matching on both s-vlan and c-vlan selectors. Fix match of VLAN id due to incorrect byteorder. Add a new routine to properly populate flow dissector ethertypes. 5) Missing keys in ip{6}_route_me_harder() results in incorrect routes. This includes an update for selftest infra. Patches from Ido Schimmel. 6) Add counter hardware offload support through FLOW_CLS_STATS. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>