summaryrefslogtreecommitdiffstats
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2022-01-07af_packet: fix tracking issues in packet_do_bind()Eric Dumazet1-19/+8
It appears that my changes in packet_do_bind() were slightly wrong. syzbot found that calling bind() twice would trigger a false positive. Remove proto_curr/dev_curr variables and rewrite things to be less confusing (like not having to use netdev_tracker_alloc(), and instead use the standard dev_hold_track()) Fixes: f1d9268e0618 ("net: add net device refcount tracker to struct packet_type") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220107183953.3886647-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-07mptcp: reuse __mptcp_make_csum in validate_data_csumGeliang Tang1-9/+6
This patch reused __mptcp_make_csum() in validate_data_csum() instead of open-coding. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-07mptcp: change the parameter of __mptcp_make_csumGeliang Tang2-4/+5
This patch changed the type of the last parameter of __mptcp_make_csum() from __sum16 to __wsum. And export this function in protocol.h. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-07net/tls: Fix skb memory leak when running kTLS trafficGal Pressman1-0/+1
The cited Fixes commit introduced a memory leak when running kTLS traffic (with/without hardware offloads). I'm running nginx on the server side and wrk on the client side and get the following: unreferenced object 0xffff8881935e9b80 (size 224): comm "softirq", pid 0, jiffies 4294903611 (age 43.204s) hex dump (first 32 bytes): 80 9b d0 36 81 88 ff ff 00 00 00 00 00 00 00 00 ...6............ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000efe2a999>] build_skb+0x1f/0x170 [<00000000ef521785>] mlx5e_skb_from_cqe_mpwrq_linear+0x2bc/0x610 [mlx5_core] [<00000000945d0ffe>] mlx5e_handle_rx_cqe_mpwrq+0x264/0x9e0 [mlx5_core] [<00000000cb675b06>] mlx5e_poll_rx_cq+0x3ad/0x17a0 [mlx5_core] [<0000000018aac6a9>] mlx5e_napi_poll+0x28c/0x1b60 [mlx5_core] [<000000001f3369d1>] __napi_poll+0x9f/0x560 [<00000000cfa11f72>] net_rx_action+0x357/0xa60 [<000000008653b8d7>] __do_softirq+0x282/0x94e [<00000000644923c6>] __irq_exit_rcu+0x11f/0x170 [<00000000d4085f8f>] irq_exit_rcu+0xa/0x20 [<00000000d412fef4>] common_interrupt+0x7d/0xa0 [<00000000bfb0cebc>] asm_common_interrupt+0x1e/0x40 [<00000000d80d0890>] default_idle+0x53/0x70 [<00000000f2b9780e>] default_idle_call+0x8c/0xd0 [<00000000c7659e15>] do_idle+0x394/0x450 I'm not familiar with these areas of the code, but I've added this sk_defer_free_flush() to tls_sw_recvmsg() based on a hunch and it resolved the issue. Fixes: f35f821935d8 ("tcp: defer skb freeing after socket lock is released") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220102081253.9123-1-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-07sch_cake: revise Diffserv docsKevin Bracey1-20/+20
Documentation incorrectly stated that CS1 is equivalent to LE for diffserv8. But when LE was added to the table, CS1 was pushed into tin 1, leaving only LE in tin 0. Also "TOS1" no longer exists, as that is the same codepoint as LE. Make other tweaks properly distinguishing codepoints from classes and putting current Diffserve codepoints ahead of legacy ones. Signed-off-by: Kevin Bracey <kevin@bracey.fi> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Link: https://lore.kernel.org/r/20220106215637.3132391-1-kevin@bracey.fi Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-07mptcp: avoid atomic bit manipulation when possiblePaolo Abeni3-31/+38
Currently the msk->flags bitmask carries both state for the mptcp_release_cb() - mostly touched under the mptcp data lock - and others state info touched even outside such lock scope. As a consequence, msk->flags is always manipulated with atomic operations. This change splits such bitmask in two separate fields, so that we use plain bit operations when touching the cb-related info. The MPTCP_PUSH_PENDING bit needs additional care, as it is the only CB related field currently accessed either under the mptcp data lock or the mptcp socket lock. Let's add another mask just for such bit's sake. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: cleanup MPJ subflow list handlingPaolo Abeni5-104/+60
We can simplify the join list handling leveraging the mptcp_release_cb(): if we can acquire the msk socket lock at mptcp_finish_join time, move the new subflow directly into the conn_list, otherwise place it on join_list and let the release_cb process such list. Since pending MPJ connection are now always processed in a timely way, we can avoid flushing the join list every time we have to process all the current subflows. Additionally we can now use the mptcp data lock to protect the join_list, removing the additional spin lock. Finally, the MPJ handshake is now always finalized under the msk socket lock, we can drop the additional synchronization between mptcp_finish_join() and mptcp_close(). Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: do not block subflows creation on errorsPaolo Abeni4-31/+71
If the MPTCP configuration allows for multiple subflows creation, and the first additional subflows never reach the fully established status - e.g. due to packets drop or reset - the in kernel path manager do not move to the next subflow. This patch introduces a new PM helper to cope with MPJ subflow creation failure and delay and hook it where appropriate. Such helper triggers additional subflow creation, as needed and updates the PM subflow counter, if the current one is closing. Additionally start all the needed additional subflows as soon as the MPTCP socket is fully established, so we don't have to cope with slow MPJ handshake blocking the next subflow creation. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: keep track of local endpoint still available for each mskPaolo Abeni4-54/+87
Include into the path manager status a bitmap tracking the list of local endpoints still available - not yet used - for the relevant mptcp socket. Keep such map updated at endpoint creation/deletion time, so that we can easily skip already used endpoint at local address selection time. The endpoint used by the initial subflow is lazyly accounted at subflow creation time: the usage bitmap is be up2date before endpoint selection and we avoid such unneeded task in some relevant scenarios - e.g. busy servers accepting incoming subflows but not creating any additional ones nor annuncing additional addresses. Overall this allows for fair local endpoints usage in case of subflow failure. As a side effect, this patch also enforces that each endpoint is used at most once for each mptcp connection. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: clean-up MPJ option writingPaolo Abeni1-21/+23
Check for all MPJ variant at once, this reduces the number of conditionals traversed on average and will simplify the next patch. No functional change intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: fix per socket endpoint accountingPaolo Abeni1-6/+12
Since full-mesh endpoint support, the reception of a single ADD_ADDR option can cause multiple subflows creation. When such option is accepted we increment 'add_addr_accepted' by one. When we received a paired RM_ADDR option, we deleted all the relevant subflows, decrementing 'add_addr_accepted' by one for each of them. We have a similar issue for 'local_addr_used' Fix them moving the pm endpoint accounting outside the subflow traversal. Fixes: 1a0d6136c5f0 ("mptcp: local addresses fullmesh") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: implement support for user-space disconnectPaolo Abeni1-4/+13
Handle explicitly AF_UNSPEC in mptcp_stream_connnect() to allow user-space to disconnect established MPTCP connections Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: cleanup accept and pollPaolo Abeni3-20/+7
After the previous patch, msk->subflow will never be deleted during the whole msk lifetime. We don't need anymore to acquire references to it in mptcp_stream_accept() and we can use the listener subflow accept queue to simplify mptcp_poll() for listener socket. Overall this removes a lock pair and 4 more atomic operations per accept(). Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: full disconnect implementationPaolo Abeni4-28/+98
The current mptcp_disconnect() implementation lacks several steps, we additionally need to reset the msk socket state and flush the subflow list. Factor out the needed helper to avoid code duplication. Additionally ensure that the initial subflow is disposed only after mptcp_close(), just reset it at disconnect time. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: implement fastclose xmit pathPaolo Abeni2-12/+46
Allow the MPTCP xmit path to add MP_FASTCLOSE suboption on RST egress packets. Additionally reorder related options writing to reduce the number of conditionals required in the fast path. Co-developed-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: keep snd_una updated for fallback socketPaolo Abeni1-0/+1
After shutdown, for fallback MPTCP sockets, we always have write_seq == snd_una+1 The above will foul OUTQ ioctl(). Keep snd_una in sync with write_seq even after shutdown. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski13-50/+189
Alexei Starovoitov says: ==================== pull-request: bpf-next 2022-01-06 We've added 41 non-merge commits during the last 2 day(s) which contain a total of 36 files changed, 1214 insertions(+), 368 deletions(-). The main changes are: 1) Various fixes in the verifier, from Kris and Daniel. 2) Fixes in sockmap, from John. 3) bpf_getsockopt fix, from Kuniyuki. 4) INET_POST_BIND fix, from Menglong. 5) arm64 JIT fix for bpf pseudo funcs, from Hou. 6) BPF ISA doc improvements, from Christoph. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (41 commits) bpf: selftests: Add bind retry for post_bind{4, 6} bpf: selftests: Use C99 initializers in test_sock.c net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() bpf/selftests: Test bpf_d_path on rdonly_mem. libbpf: Add documentation for bpf_map batch operations selftests/bpf: Don't rely on preserving volatile in PT_REGS macros in loop3 xdp: Add xdp_do_redirect_frame() for pre-computed xdp_frames xdp: Move conversion to xdp_frame out of map functions page_pool: Store the XDP mem id page_pool: Add callback to init pages when they are allocated xdp: Allow registering memory model without rxq reference samples/bpf: xdpsock: Add timestamp for Tx-only operation samples/bpf: xdpsock: Add time-out for cleaning Tx samples/bpf: xdpsock: Add sched policy and priority support samples/bpf: xdpsock: Add cyclic TX operation capability samples/bpf: xdpsock: Add clockid selection support samples/bpf: xdpsock: Add Dest and Src MAC setting for Tx-only operation samples/bpf: xdpsock: Add VLAN support for Tx-only operation libbpf 1.0: Deprecate bpf_object__find_map_by_offset() API libbpf 1.0: Deprecate bpf_map__is_offload_neutral() ... ==================== Link: https://lore.kernel.org/r/20220107013626.53943-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()Menglong Dong8-0/+10
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in __inet_bind() is not handled properly. While the return value is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and exit: err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk); if (err) { inet->inet_saddr = inet->inet_rcv_saddr = 0; goto out_release_sock; } Let's take UDP for example and see what will happen. For UDP socket, it will be added to 'udp_prot.h.udp_table->hash' and 'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port() called success. If 'inet->inet_rcv_saddr' is specified here, then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong to (because inet_saddr is changed to 0), and UDP packet received will not be passed to this sock. If 'inet->inet_rcv_saddr' is not specified here, the sock will work fine, as it can receive packet properly, which is wired, as the 'bind()' is already failed. To undo the get_port() operation, introduce the 'put_port' field for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP proto, it is udp_lib_unhash(); For icmp proto, it is ping_unhash(). Therefore, after sys_bind() fail caused by BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which means that it can try to be binded to another port. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
2022-01-06ethtool: use phydev variableTom Rix1-4/+4
In ethtool_get_phy_stats(), the phydev varaible is set to dev->phydev but dev->phydev is still used. Replace dev->phydev uses with phydev. Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06gro: add ability to control gro max packet sizeCoco Li3-1/+28
Eric Dumazet suggested to allow users to modify max GRO packet size. We have seen GRO being disabled by users of appliances (such as wifi access points) because of claimed bufferbloat issues, or some work arounds in sch_cake, to split GRO/GSO packets. Instead of disabling GRO completely, one can chose to limit the maximum packet size of GRO packets, depending on their latency constraints. This patch adds a per device gro_max_size attribute that can be changed with ip link command. ip link set dev eth0 gro_max_size 16000 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Coco Li <lixiaoyan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: fix SOF_TIMESTAMPING_BIND_PHC to work with multiple socketsMiroslav Lichvar1-3/+6
When multiple sockets using the SOF_TIMESTAMPING_BIND_PHC flag received a packet with a hardware timestamp (e.g. multiple PTP instances in different PTP domains using the UDPv4/v6 multicast or L2 transport), the timestamps received on some sockets were corrupted due to repeated conversion of the same timestamp (by the same or different vclocks). Fix ptp_convert_timestamp() to not modify the shared skb timestamp and return the converted timestamp as a ktime_t instead. If the conversion fails, return 0 to not confuse the application with timestamps corresponding to an unexpected PHC. Fixes: d7c088265588 ("net: socket: support hardware timestamp conversion to PHC bound") Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Cc: Yangbo Lu <yangbo.lu@nxp.com> Cc: Richard Cochran <richardcochran@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: setup master before portsVladimir Oltean1-10/+13
It is said that as soon as a network interface is registered, all its resources should have already been prepared, so that it is available for sending and receiving traffic. One of the resources needed by a DSA slave interface is the master. dsa_tree_setup -> dsa_tree_setup_ports -> dsa_port_setup -> dsa_slave_create -> register_netdevice -> dsa_tree_setup_master -> dsa_master_setup -> sets up master->dsa_ptr, which enables reception Therefore, there is a short period of time after register_netdevice() during which the master isn't prepared to pass traffic to the DSA layer (master->dsa_ptr is checked by eth_type_trans). Same thing during unregistration, there is a time frame in which packets might be missed. Note that this change opens us to another race: dsa_master_find_slave() will get invoked potentially earlier than the slave creation, and later than the slave deletion. Since dp->slave starts off as a NULL pointer, the earlier calls aren't a problem, but the later calls are. To avoid use-after-free, we should zeroize dp->slave before calling dsa_slave_destroy(). In practice I cannot really test real life improvements brought by this change, since in my systems, netdevice creation races with PHY autoneg which takes a few seconds to complete, and that masks quite a few races. Effects might be noticeable in a setup with fixed links all the way to an external system. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: first set up shared ports, then non-shared portsVladimir Oltean1-13/+37
After commit a57d8c217aad ("net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports"), the port setup and teardown procedure became asymmetric. The fact of the matter is that user ports need the shared ports to be up before they can be used for CPU-initiated termination. And since we register net devices for the user ports, those won't be functional until we also call the setup for the shared (CPU, DSA) ports. But we may do that later, depending on the port numbering scheme of the hardware we are dealing with. It just makes sense that all shared ports are brought up before any user port is. I can't pinpoint any issue due to the current behavior, but let's change it nonetheless, for consistency's sake. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: hold rtnl_mutex when calling dsa_master_{setup,teardown}Vladimir Oltean2-2/+10
DSA needs to simulate master tracking events when a binding is first with a DSA master established and torn down, in order to give drivers the simplifying guarantee that ->master_state_change calls are made only when the master's readiness state to pass traffic changes. master_state_change() provide a operational bool that DSA driver can use to understand if DSA master is operational or not. To avoid races, we need to block the reception of NETDEV_UP/NETDEV_CHANGE/NETDEV_GOING_DOWN events in the netdev notifier chain while we are changing the master's dev->dsa_ptr (this changes what netdev_uses_dsa(dev) reports). The dsa_master_setup() and dsa_master_teardown() functions optionally require the rtnl_mutex to be held, if the tagger needs the master to be promiscuous, these functions call dev_set_promiscuity(). Move the rtnl_lock() from that function and make it top-level. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: stop updating master MTU from master.cVladimir Oltean1-24/+1
At present there are two paths for changing the MTU of the DSA master. The first is: dsa_tree_setup -> dsa_tree_setup_ports -> dsa_port_setup -> dsa_slave_create -> dsa_slave_change_mtu -> dev_set_mtu(master) The second is: dsa_tree_setup -> dsa_tree_setup_master -> dsa_master_setup -> dev_set_mtu(dev) So the dev_set_mtu() call from dsa_master_setup() has been effectively superseded by the dsa_slave_change_mtu(slave_dev, ETH_DATA_LEN) that is done from dsa_slave_create() for each user port. The later function also updates the master MTU according to the largest user port MTU from the tree. Therefore, updating the master MTU through a separate code path isn't needed. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: merge rtnl_lock sections in dsa_slave_createVladimir Oltean1-3/+1
Currently dsa_slave_create() has two sequences of rtnl_lock/rtnl_unlock in a row. Remove the rtnl_unlock() and rtnl_lock() in between, such that the operation can execute slighly faster. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: reorder PHY initialization with MTU setup in slave.cVladimir Oltean1-7/+7
In dsa_slave_create() there are 2 sections that take rtnl_lock(): MTU change and netdev registration. They are separated by PHY initialization. There isn't any strict ordering requirement except for the fact that netdev registration should be last. Therefore, we can perform the MTU change a bit later, after the PHY setup. A future change will then be able to merge the two rtnl_lock sections into one. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06Merge branch 'master' of ↵David S. Miller8-8/+88
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2022-01-06 1) Fix some clang_analyzer warnings about never read variables. From luo penghao. 2) Check for pols[0] only once in xfrm_expand_policies(). From Jean Sacren. 3) The SA curlft.use_time was updated only on SA cration time. Update whenever the SA is used. From Antony Antony 4) Add support for SM3 secure hash. From Xu Jia. 5) Add support for SM4 symmetric cipher algorithm. From Xu Jia. 6) Add a rate limit for SA mapping change messages. From Antony Antony. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05xdp: Add xdp_do_redirect_frame() for pre-computed xdp_framesToke Høiland-Jørgensen1-11/+54
Add an xdp_do_redirect_frame() variant which supports pre-computed xdp_frame structures. This will be used in bpf_prog_run() to avoid having to write to the xdp_frame structure when the XDP program doesn't modify the frame boundaries. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220103150812.87914-6-toke@redhat.com
2022-01-05xdp: Move conversion to xdp_frame out of map functionsToke Høiland-Jørgensen1-7/+17
All map redirect functions except XSK maps convert xdp_buff to xdp_frame before enqueueing it. So move this conversion of out the map functions and into xdp_do_redirect(). This removes a bit of duplicated code, but more importantly it makes it possible to support caller-allocated xdp_frame structures, which will be added in a subsequent commit. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220103150812.87914-5-toke@redhat.com
2022-01-05page_pool: Store the XDP mem idToke Høiland-Jørgensen2-2/+4
Store the XDP mem ID inside the page_pool struct so it can be retrieved later for use in bpf_prog_run(). Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20220103150812.87914-4-toke@redhat.com
2022-01-05page_pool: Add callback to init pages when they are allocatedToke Høiland-Jørgensen1-0/+2
Add a new callback function to page_pool that, if set, will be called every time a new page is allocated. This will be used from bpf_test_run() to initialise the page data with the data provided by userspace when running XDP programs with redirect turned on. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20220103150812.87914-3-toke@redhat.com
2022-01-05xdp: Allow registering memory model without rxq referenceToke Høiland-Jørgensen1-30/+62
The functions that register an XDP memory model take a struct xdp_rxq as parameter, but the RXQ is not actually used for anything other than pulling out the struct xdp_mem_info that it embeds. So refactor the register functions and export variants that just take a pointer to the xdp_mem_info. This is in preparation for enabling XDP_REDIRECT in bpf_prog_run(), using a page_pool instance that is not connected to any network device. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220103150812.87914-2-toke@redhat.com
2022-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski21-177/+269
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-05bpf: Add SO_RCVBUF/SO_SNDBUF in _bpf_getsockopt().Kuniyuki Iwashima1-0/+6
This patch exposes SO_RCVBUF/SO_SNDBUF through bpf_getsockopt(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220104013153.97906-3-kuniyu@amazon.co.jp
2022-01-05bpf: Fix SO_RCVBUF/SO_SNDBUF handling in _bpf_setsockopt().Kuniyuki Iwashima1-0/+2
The commit 4057765f2dee ("sock: consistent handling of extreme SO_SNDBUF/SO_RCVBUF values") added a change to prevent underflow in setsockopt() around SO_SNDBUF/SO_RCVBUF. This patch adds the same change to _bpf_setsockopt(). Fixes: 4057765f2dee ("sock: consistent handling of extreme SO_SNDBUF/SO_RCVBUF values") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220104013153.97906-2-kuniyu@amazon.co.jp
2022-01-05bpf, sockmap: Fix double bpf_prog_put on error case in map_linkJohn Fastabend1-8/+13
sock_map_link() is called to update a sockmap entry with a sk. But, if the sock_map_init_proto() call fails then we return an error to the map_update op against the sockmap. In the error path though we need to cleanup psock and dec the refcnt on any programs associated with the map, because we refcnt them early in the update process to ensure they are pinned for the psock. (This avoids a race where user deletes programs while also updating the map with new socks.) In current code we do the prog refcnt dec explicitely by calling bpf_prog_put() when the program was found in the map. But, after commit '38207a5e81230' in this error path we've already done the prog to psock assignment so the programs have a reference from the psock as well. This then causes the psock tear down logic, invoked by sk_psock_put() in the error path, to similarly call bpf_prog_put on the programs there. To be explicit this logic does the prog->psock assignment: if (msg_*) psock_set_prog(...) Then the error path under the out_progs label does a similar check and dec with: if (msg_*) bpf_prog_put(...) And the teardown logic sk_psock_put() does ... psock_set_prog(msg_*, NULL) ... triggering another bpf_prog_put(...). Then KASAN gives us this splat, found by syzbot because we've created an inbalance between bpf_prog_inc and bpf_prog_put calling put twice on the program. BUG: KASAN: vmalloc-out-of-bounds in __bpf_prog_put kernel/bpf/syscall.c:1812 [inline] BUG: KASAN: vmalloc-out-of-bounds in __bpf_prog_put kernel/bpf/syscall.c:1812 [inline] kernel/bpf/syscall.c:1829 BUG: KASAN: vmalloc-out-of-bounds in bpf_prog_put+0x8c/0x4f0 kernel/bpf/syscall.c:1829 kernel/bpf/syscall.c:1829 Read of size 8 at addr ffffc90000e76038 by task syz-executor020/3641 To fix clean up error path so it doesn't try to do the bpf_prog_put in the error path once progs are assigned then it relies on the normal psock tear down logic to do complete cleanup. For completness we also cover the case whereh sk_psock_init_strp() fails, but this is not expected because it indicates an incorrect socket type and should be caught earlier. Fixes: 38207a5e8123 ("bpf, sockmap: Attach map progs to psock early for feature probes") Reported-by: syzbot+bb73e71cf4b8fd376a4f@syzkaller.appspotmail.com Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220104214645.290900-1-john.fastabend@gmail.com
2022-01-05bpf, sockmap: Fix return codes from tcp_bpf_recvmsg_parser()John Fastabend1-0/+27
Applications can be confused slightly because we do not always return the same error code as expected, e.g. what the TCP stack normally returns. For example on a sock err sk->sk_err instead of returning the sock_error we return EAGAIN. This usually means the application will 'try again' instead of aborting immediately. Another example, when a shutdown event is received we should immediately abort instead of waiting for data when the user provides a timeout. These tend to not be fatal, applications usually recover, but introduces bogus errors to the user or introduces unexpected latency. Before 'c5d2177a72a16' we fell back to the TCP stack when no data was available so we managed to catch many of the cases here, although with the extra latency cost of calling tcp_msg_wait_data() first. To fix lets duplicate the error handling in TCP stack into tcp_bpf so that we get the same error codes. These were found in our CI tests that run applications against sockmap and do longer lived testing, at least compared to test_sockmap that does short-lived ping/pong tests, and in some of our test clusters we deploy. Its non-trivial to do these in a shorter form CI tests that would be appropriate for BPF selftests, but we are looking into it so we can ensure this keeps working going forward. As a preview one idea is to pull in the packetdrill testing which catches some of this. Fixes: c5d2177a72a16 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220104205918.286416-1-john.fastabend@gmail.com
2022-01-05netlink: do not allocate a device refcount tracker in ethnl_default_notify()Eric Dumazet1-1/+0
As reported by Johannes, the tracker allocated in ethnl_default_notify() is not really needed, as this function is not expected to change a device reference count. Fixes: e4b8954074f6 ("netlink: add net device refcount tracker to struct ethnl_req_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Johannes Berg <johannes@sipsolutions.net> Tested-by: Johannes Berg <johannes@sipsolutions.net> Link: https://lore.kernel.org/r/20220105170849.2610470-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-05net/sched: add missing tracker information in qdisc_create()Eric Dumazet1-1/+1
qdisc_create() error path needs to use dev_put_track() because qdisc_alloc() allocated the tracker. Fixes: 606509f27f67 ("net/sched: add net device refcount tracker to struct Qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220104170439.3790052-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-05net: dsa: remove cross-chip support for HSRVladimir Oltean3-49/+13
The cross-chip notifiers for HSR are bypass operations, meaning that even though all switches in a tree are notified, only the switch specified in the info structure is targeted. We can eliminate the unnecessary complexity by deleting the cross-chip notifier logic and calling the ds->ops straight from port.c. Cc: George McCollister <george.mccollister@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: George McCollister <george.mccollister@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05net: dsa: remove cross-chip support for MRPVladimir Oltean3-106/+20
The cross-chip notifiers for MRP are bypass operations, meaning that even though all switches in a tree are notified, only the switch specified in the info structure is targeted. We can eliminate the unnecessary complexity by deleting the cross-chip notifier logic and calling the ds->ops straight from port.c. Cc: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05net: dsa: fix incorrect function pointer check for MRP ring rolesVladimir Oltean1-2/+2
The cross-chip notifier boilerplate code meant to check the presence of ds->ops->port_mrp_add_ring_role before calling it, but checked ds->ops->port_mrp_add instead, before calling ds->ops->port_mrp_add_ring_role. Therefore, a driver which implements one operation but not the other would trigger a NULL pointer dereference. There isn't any such driver in DSA yet, so there is no reason to backport the change. Issue found through code inspection. Cc: Horatiu Vultur <horatiu.vultur@microchip.com> Fixes: c595c4330da0 ("net: dsa: add MRP support") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05net: dsa: make dsa_switch :: num_ports an unsigned intVladimir Oltean1-1/+1
Currently, num_ports is declared as size_t, which is defined as __kernel_ulong_t, therefore it occupies 8 bytes of memory. Even switches with port numbers in the range of tens are exotic, so there is no need for this amount of storage. Additionally, because the max_num_bridges member right above it is also 4 bytes, it means the compiler needs to add padding between the last 2 fields. By reducing the size, we don't need that padding and can reduce the struct size. Before: pahole -C dsa_switch net/dsa/slave.o struct dsa_switch { struct device * dev; /* 0 8 */ struct dsa_switch_tree * dst; /* 8 8 */ unsigned int index; /* 16 4 */ u32 setup:1; /* 20: 0 4 */ u32 vlan_filtering_is_global:1; /* 20: 1 4 */ u32 needs_standalone_vlan_filtering:1; /* 20: 2 4 */ u32 configure_vlan_while_not_filtering:1; /* 20: 3 4 */ u32 untag_bridge_pvid:1; /* 20: 4 4 */ u32 assisted_learning_on_cpu_port:1; /* 20: 5 4 */ u32 vlan_filtering:1; /* 20: 6 4 */ u32 pcs_poll:1; /* 20: 7 4 */ u32 mtu_enforcement_ingress:1; /* 20: 8 4 */ /* XXX 23 bits hole, try to pack */ struct notifier_block nb; /* 24 24 */ /* XXX last struct has 4 bytes of padding */ void * priv; /* 48 8 */ void * tagger_data; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct dsa_chip_data * cd; /* 64 8 */ const struct dsa_switch_ops * ops; /* 72 8 */ u32 phys_mii_mask; /* 80 4 */ /* XXX 4 bytes hole, try to pack */ struct mii_bus * slave_mii_bus; /* 88 8 */ unsigned int ageing_time_min; /* 96 4 */ unsigned int ageing_time_max; /* 100 4 */ struct dsa_8021q_context * tag_8021q_ctx; /* 104 8 */ struct devlink * devlink; /* 112 8 */ unsigned int num_tx_queues; /* 120 4 */ unsigned int num_lag_ids; /* 124 4 */ /* --- cacheline 2 boundary (128 bytes) --- */ unsigned int max_num_bridges; /* 128 4 */ /* XXX 4 bytes hole, try to pack */ size_t num_ports; /* 136 8 */ /* size: 144, cachelines: 3, members: 27 */ /* sum members: 132, holes: 2, sum holes: 8 */ /* sum bitfield members: 9 bits, bit holes: 1, sum bit holes: 23 bits */ /* paddings: 1, sum paddings: 4 */ /* last cacheline: 16 bytes */ }; After: pahole -C dsa_switch net/dsa/slave.o struct dsa_switch { struct device * dev; /* 0 8 */ struct dsa_switch_tree * dst; /* 8 8 */ unsigned int index; /* 16 4 */ u32 setup:1; /* 20: 0 4 */ u32 vlan_filtering_is_global:1; /* 20: 1 4 */ u32 needs_standalone_vlan_filtering:1; /* 20: 2 4 */ u32 configure_vlan_while_not_filtering:1; /* 20: 3 4 */ u32 untag_bridge_pvid:1; /* 20: 4 4 */ u32 assisted_learning_on_cpu_port:1; /* 20: 5 4 */ u32 vlan_filtering:1; /* 20: 6 4 */ u32 pcs_poll:1; /* 20: 7 4 */ u32 mtu_enforcement_ingress:1; /* 20: 8 4 */ /* XXX 23 bits hole, try to pack */ struct notifier_block nb; /* 24 24 */ /* XXX last struct has 4 bytes of padding */ void * priv; /* 48 8 */ void * tagger_data; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct dsa_chip_data * cd; /* 64 8 */ const struct dsa_switch_ops * ops; /* 72 8 */ u32 phys_mii_mask; /* 80 4 */ /* XXX 4 bytes hole, try to pack */ struct mii_bus * slave_mii_bus; /* 88 8 */ unsigned int ageing_time_min; /* 96 4 */ unsigned int ageing_time_max; /* 100 4 */ struct dsa_8021q_context * tag_8021q_ctx; /* 104 8 */ struct devlink * devlink; /* 112 8 */ unsigned int num_tx_queues; /* 120 4 */ unsigned int num_lag_ids; /* 124 4 */ /* --- cacheline 2 boundary (128 bytes) --- */ unsigned int max_num_bridges; /* 128 4 */ unsigned int num_ports; /* 132 4 */ /* size: 136, cachelines: 3, members: 27 */ /* sum members: 128, holes: 1, sum holes: 4 */ /* sum bitfield members: 9 bits, bit holes: 1, sum bit holes: 23 bits */ /* paddings: 1, sum paddings: 4 */ /* last cacheline: 8 bytes */ }; Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04Merge tag 'mac80211-for-net-2022-01-04' of ↵Jakub Kicinski4-82/+55
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Two more changes: - mac80211: initialize a variable to avoid using it uninitialized - mac80211 mesh: put some data structures into the container to fix bugs with and not have to deal with allocation failures * tag 'mac80211-for-net-2022-01-04' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211: mac80211: mesh: embedd mesh_paths and mpp_paths into ieee80211_if_mesh mac80211: initialize variable have_higher_than_11mbit ==================== Link: https://lore.kernel.org/r/20220104144449.64937-1-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-04mac80211: use ieee80211_bss_get_elem()Johannes Berg1-7/+7
Instead of ieee80211_bss_get_ie(), use the more typed ieee80211_bss_get_elem(). Link: https://lore.kernel.org/r/20211220113609.56f8e2a70152.Id5a56afb8a4f9b38d10445e5a1874e93e84b5251@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-01-04mac80211: Add stations iterator where the iterator function may sleepMartin Blumenstingl1-0/+13
ieee80211_iterate_active_interfaces() and ieee80211_iterate_active_interfaces_atomic() already exist, where the former allows the iterator function to sleep. Add ieee80211_iterate_stations() which is similar to ieee80211_iterate_stations_atomic() but allows the iterator to sleep. This is needed for adding SDIO support to the rtw88 driver. Some interators there are reading or writing registers. With the SDIO ops (sdio_readb, sdio_writeb and friends) this means that the iterator function may sleep. Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Link: https://lore.kernel.org/r/20211228211501.468981-2-martin.blumenstingl@googlemail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-01-04mac80211: allow non-standard VHT MCS-10/11Ping-Ke Shih1-1/+1
Some AP can possibly try non-standard VHT rate and mac80211 warns and drops packets, and leads low TCP throughput. Rate marked as a VHT rate but data is invalid: MCS: 10, NSS: 2 WARNING: CPU: 1 PID: 7817 at net/mac80211/rx.c:4856 ieee80211_rx_list+0x223/0x2f0 [mac8021 Since commit c27aa56a72b8 ("cfg80211: add VHT rate entries for MCS-10 and MCS-11") has added, mac80211 adds this support as well. After this patch, throughput is good and iw can get the bitrate: rx bitrate: 975.1 MBit/s VHT-MCS 10 80MHz short GI VHT-NSS 2 or rx bitrate: 1083.3 MBit/s VHT-MCS 11 80MHz short GI VHT-NSS 2 Buglink: https://bugzilla.suse.com/show_bug.cgi?id=1192891 Reported-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Ping-Ke Shih <pkshih@realtek.com> Link: https://lore.kernel.org/r/20220103013623.17052-1-pkshih@realtek.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-01-04mac80211: mesh: embedd mesh_paths and mpp_paths into ieee80211_if_meshPavel Skripkin3-81/+54
Syzbot hit NULL deref in rhashtable_free_and_destroy(). The problem was in mesh_paths and mpp_paths being NULL. mesh_pathtbl_init() could fail in case of memory allocation failure, but nobody cared, since ieee80211_mesh_init_sdata() returns void. It led to leaving 2 pointers as NULL. Syzbot has found null deref on exit path, but it could happen anywhere else, because code assumes these pointers are valid. Since all ieee80211_*_setup_sdata functions are void and do not fail, let's embedd mesh_paths and mpp_paths into parent struct to avoid adding error handling on higher levels and follow the pattern of others setup_sdata functions Fixes: 60854fd94573 ("mac80211: mesh: convert path table to rhashtable") Reported-and-tested-by: syzbot+860268315ba86ea6b96b@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Link: https://lore.kernel.org/r/20211230195547.23977-1-paskripkin@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-01-04mac80211: initialize variable have_higher_than_11mbitTom Rix1-1/+1
Clang static analysis reports this warnings mlme.c:5332:7: warning: Branch condition evaluates to a garbage value have_higher_than_11mbit) ^~~~~~~~~~~~~~~~~~~~~~~ have_higher_than_11mbit is only set to true some of the time in ieee80211_get_rates() but is checked all of the time. So have_higher_than_11mbit needs to be initialized to false. Fixes: 5d6a1b069b7f ("mac80211: set basic rates earlier") Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20211223162848.3243702-1-trix@redhat.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>