summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2022-01-06Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski36-368/+1214
Alexei Starovoitov says: ==================== pull-request: bpf-next 2022-01-06 We've added 41 non-merge commits during the last 2 day(s) which contain a total of 36 files changed, 1214 insertions(+), 368 deletions(-). The main changes are: 1) Various fixes in the verifier, from Kris and Daniel. 2) Fixes in sockmap, from John. 3) bpf_getsockopt fix, from Kuniyuki. 4) INET_POST_BIND fix, from Menglong. 5) arm64 JIT fix for bpf pseudo funcs, from Hou. 6) BPF ISA doc improvements, from Christoph. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (41 commits) bpf: selftests: Add bind retry for post_bind{4, 6} bpf: selftests: Use C99 initializers in test_sock.c net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() bpf/selftests: Test bpf_d_path on rdonly_mem. libbpf: Add documentation for bpf_map batch operations selftests/bpf: Don't rely on preserving volatile in PT_REGS macros in loop3 xdp: Add xdp_do_redirect_frame() for pre-computed xdp_frames xdp: Move conversion to xdp_frame out of map functions page_pool: Store the XDP mem id page_pool: Add callback to init pages when they are allocated xdp: Allow registering memory model without rxq reference samples/bpf: xdpsock: Add timestamp for Tx-only operation samples/bpf: xdpsock: Add time-out for cleaning Tx samples/bpf: xdpsock: Add sched policy and priority support samples/bpf: xdpsock: Add cyclic TX operation capability samples/bpf: xdpsock: Add clockid selection support samples/bpf: xdpsock: Add Dest and Src MAC setting for Tx-only operation samples/bpf: xdpsock: Add VLAN support for Tx-only operation libbpf 1.0: Deprecate bpf_object__find_map_by_offset() API libbpf 1.0: Deprecate bpf_map__is_offload_neutral() ... ==================== Link: https://lore.kernel.org/r/20220107013626.53943-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06Merge branch 'net: bpf: handle return value of post_bind{4,6} and add ↵Alexei Starovoitov10-148/+233
selftests for it' Menglong Dong says: ==================== From: Menglong Dong <imagedong@tencent.com> The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in __inet_bind() is not handled properly. While the return value is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and exit: err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk); if (err) { inet->inet_saddr = inet->inet_rcv_saddr = 0; goto out_release_sock; } Let's take UDP for example and see what will happen. For UDP socket, it will be added to 'udp_prot.h.udp_table->hash' and 'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port() called success. If 'inet->inet_rcv_saddr' is specified here, then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong to (because inet_saddr is changed to 0), and UDP packet received will not be passed to this sock. If 'inet->inet_rcv_saddr' is not specified here, the sock will work fine, as it can receive packet properly, which is wired, as the 'bind()' is already failed. To undo the get_port() operation, introduce the 'put_port' field for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP proto, it is udp_lib_unhash(); For icmp proto, it is ping_unhash(). Therefore, after sys_bind() fail caused by BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which means that it can try to be binded to another port. The second patch use C99 initializers in test_sock.c The third patch is the selftests for this modification. Changes since v4: - use C99 initializers in test_sock.c before adding the test case Changes since v3: - add the third patch which use C99 initializers in test_sock.c Changes since v2: - NULL check for sk->sk_prot->put_port Changes since v1: - introduce 'put_port' field for 'struct proto' - add selftests for it ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-06bpf: selftests: Add bind retry for post_bind{4, 6}Menglong Dong1-20/+130
With previous patch, kernel is able to 'put_port' after sys_bind() fails. Add the test for that case: rebind another port after sys_bind() fails. If the bind success, it means previous bind operation is already undoed. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220106132022.3470772-4-imagedong@tencent.com
2022-01-06bpf: selftests: Use C99 initializers in test_sock.cMenglong Dong1-128/+92
Use C99 initializers for the initialization of 'tests' in test_sock.c. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220106132022.3470772-3-imagedong@tencent.com
2022-01-06net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()Menglong Dong9-0/+11
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in __inet_bind() is not handled properly. While the return value is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and exit: err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk); if (err) { inet->inet_saddr = inet->inet_rcv_saddr = 0; goto out_release_sock; } Let's take UDP for example and see what will happen. For UDP socket, it will be added to 'udp_prot.h.udp_table->hash' and 'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port() called success. If 'inet->inet_rcv_saddr' is specified here, then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong to (because inet_saddr is changed to 0), and UDP packet received will not be passed to this sock. If 'inet->inet_rcv_saddr' is not specified here, the sock will work fine, as it can receive packet properly, which is wired, as the 'bind()' is already failed. To undo the get_port() operation, introduce the 'put_port' field for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP proto, it is udp_lib_unhash(); For icmp proto, it is ping_unhash(). Therefore, after sys_bind() fail caused by BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which means that it can try to be binded to another port. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
2022-01-06bpf/selftests: Test bpf_d_path on rdonly_mem.Hao Luo2-1/+49
The second parameter of bpf_d_path() can only accept writable memories. Rdonly_mem obtained from bpf_per_cpu_ptr() can not be passed into bpf_d_path for modification. This patch adds a selftest to verify this behavior. Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220106205525.2116218-1-haoluo@google.com
2022-01-06libbpf: Add documentation for bpf_map batch operationsGrant Seltzer2-6/+117
This adds documention for: - bpf_map_delete_batch() - bpf_map_lookup_batch() - bpf_map_lookup_and_delete_batch() - bpf_map_update_batch() This also updates the public API for the `keys` parameter of `bpf_map_delete_batch()`, and both the `keys` and `values` parameters of `bpf_map_update_batch()` to be constants. Signed-off-by: Grant Seltzer <grantseltzer@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220106201304.112675-1-grantseltzer@gmail.com
2022-01-06selftests/bpf: Don't rely on preserving volatile in PT_REGS macros in loop3Andrii Nakryiko1-2/+2
PT_REGS*() macro on some architectures force-cast struct pt_regs to other types (user_pt_regs, etc) and might drop volatile modifiers, if any. Volatile isn't really required as pt_regs value isn't supposed to change during the BPF program run, so this is correct behavior. But progs/loop3.c relies on that volatile modifier to ensure that loop is preserved. Fix loop3.c by declaring i and sum variables as volatile instead. It preserves the loop and makes the test pass on all architectures (including s390x which is currently broken). Fixes: 3cc31d794097 ("libbpf: Normalize PT_REGS_xxx() macro definitions") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220106205156.955373-1-andrii@kernel.org
2022-01-06veth: Do not record rx queue hint in veth_xmitDaniel Borkmann1-1/+0
Laurent reported that they have seen a significant amount of TCP retransmissions at high throughput from applications residing in network namespaces talking to the outside world via veths. The drops were seen on the qdisc layer (fq_codel, as per systemd default) of the phys device such as ena or virtio_net due to all traffic hitting a _single_ TX queue _despite_ multi-queue device. (Note that the setup was _not_ using XDP on veths as the issue is generic.) More specifically, after edbea9220251 ("veth: Store queue_mapping independently of XDP prog presence") which made it all the way back to v4.19.184+, skb_record_rx_queue() would set skb->queue_mapping to 1 (given 1 RX and 1 TX queue by default for veths) instead of leaving at 0. This is eventually retained and callbacks like ena_select_queue() will also pick single queue via netdev_core_pick_tx()'s ndo_select_queue() once all the traffic is forwarded to that device via upper stack or other means. Similarly, for others not implementing ndo_select_queue() if XPS is disabled, netdev_pick_tx() might call into the skb_tx_hash() and check for prior skb_rx_queue_recorded() as well. In general, it is a _bad_ idea for virtual devices like veth to mess around with queue selection [by default]. Given dev->real_num_tx_queues is by default 1, the skb->queue_mapping was left untouched, and so prior to edbea9220251 the netdev_core_pick_tx() could do its job upon __dev_queue_xmit() on the phys device. Unbreak this and restore prior behavior by removing the skb_record_rx_queue() from veth_xmit() altogether. If the veth peer has an XDP program attached, then it would return the first RX queue index in xdp_md->rx_queue_index (unless configured in non-default manner). However, this is still better than breaking the generic case. Fixes: edbea9220251 ("veth: Store queue_mapping independently of XDP prog presence") Fixes: 638264dc9022 ("veth: Support per queue XDP ring") Reported-by: Laurent Bernaille <laurent.bernaille@datadoghq.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Cc: Toshiaki Makita <toshiaki.makita1@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06ethernet: ibmveth: use default_groups in kobj_typeGreg Kroah-Hartman1-1/+2
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the ibmveth sysfs code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Cristobal Forno <cforno12@linux.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: netdev@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06sfc: Use swap() instead of open coding itJiapeng Chong1-10/+4
Clean the following coccicheck warning: ./drivers/net/ethernet/sfc/efx_channels.c:870:36-37: WARNING opportunity for swap(). ./drivers/net/ethernet/sfc/efx_channels.c:824:36-37: WARNING opportunity for swap(). Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Acked-by: Martin Habets <habetsm.xilinx@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06ethtool: use phydev variableTom Rix1-4/+4
In ethtool_get_phy_stats(), the phydev varaible is set to dev->phydev but dev->phydev is still used. Replace dev->phydev uses with phydev. Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: macb: use .mac_select_pcs() interfaceRussell King (Oracle)2-15/+14
Convert the PCS selection to use mac_select_pcs, which allows the PCS to perform any validation it needs. We must use separate phylink_pcs instances for the USX and SGMII PCS, rather than just changing the "ops" pointer before re-setting it to phylink as this interface queries the PCS, rather than requesting it to be changed. Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06gro: add ability to control gro max packet sizeCoco Li6-1/+41
Eric Dumazet suggested to allow users to modify max GRO packet size. We have seen GRO being disabled by users of appliances (such as wifi access points) because of claimed bufferbloat issues, or some work arounds in sch_cake, to split GRO/GSO packets. Instead of disabling GRO completely, one can chose to limit the maximum packet size of GRO packets, depending on their latency constraints. This patch adds a per device gro_max_size attribute that can be changed with ip link command. ip link set dev eth0 gro_max_size 16000 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Coco Li <lixiaoyan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: fix SOF_TIMESTAMPING_BIND_PHC to work with multiple socketsMiroslav Lichvar3-13/+18
When multiple sockets using the SOF_TIMESTAMPING_BIND_PHC flag received a packet with a hardware timestamp (e.g. multiple PTP instances in different PTP domains using the UDPv4/v6 multicast or L2 transport), the timestamps received on some sockets were corrupted due to repeated conversion of the same timestamp (by the same or different vclocks). Fix ptp_convert_timestamp() to not modify the shared skb timestamp and return the converted timestamp as a ktime_t instead. If the conversion fails, return 0 to not confuse the application with timestamps corresponding to an unexpected PHC. Fixes: d7c088265588 ("net: socket: support hardware timestamp conversion to PHC bound") Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Cc: Yangbo Lu <yangbo.lu@nxp.com> Cc: Richard Cochran <richardcochran@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: warn about dsa_port and dsa_switch bit fields being non atomicVladimir Oltean1-0/+8
As discussed during review here: https://patchwork.kernel.org/project/netdevbpf/patch/20220105132141.2648876-3-vladimir.oltean@nxp.com/ we should inform developers about pitfalls of concurrent access to the boolean properties of dsa_switch and dsa_port, now that they've been converted to bit fields. No other measure than a comment needs to be taken, since the code paths that update these bit fields are not concurrent with each other. Suggested-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: don't enumerate dsa_switch and dsa_port bit fields using commasVladimir Oltean1-58/+56
This is a cosmetic incremental fixup to commits 7787ff776398 ("net: dsa: merge all bools of struct dsa_switch into a single u32") bde82f389af1 ("net: dsa: merge all bools of struct dsa_port into a single u8") The desire to make this change was enunciated after posting these patches here: https://patchwork.kernel.org/project/netdevbpf/cover/20220105132141.2648876-1-vladimir.oltean@nxp.com/ but due to a slight timing overlap (message posted at 2:28 p.m. UTC, merge commit is at 2:46 p.m. UTC), that comment was missed and the changes were applied as-is. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06Merge branch 'dsa-init-cleanups'David S. Miller3-50/+60
Vladimir Oltean says: ==================== DSA initialization cleanups These patches contain miscellaneous work that makes the DSA init code path symmetric with the teardown path, and some additional patches carried by Ansuel Smith for his register access over Ethernet work, but those patches can be applied as-is too. https://patchwork.kernel.org/project/netdevbpf/patch/20211214224409.5770-3-ansuelsmth@gmail.com/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: setup master before portsVladimir Oltean1-10/+13
It is said that as soon as a network interface is registered, all its resources should have already been prepared, so that it is available for sending and receiving traffic. One of the resources needed by a DSA slave interface is the master. dsa_tree_setup -> dsa_tree_setup_ports -> dsa_port_setup -> dsa_slave_create -> register_netdevice -> dsa_tree_setup_master -> dsa_master_setup -> sets up master->dsa_ptr, which enables reception Therefore, there is a short period of time after register_netdevice() during which the master isn't prepared to pass traffic to the DSA layer (master->dsa_ptr is checked by eth_type_trans). Same thing during unregistration, there is a time frame in which packets might be missed. Note that this change opens us to another race: dsa_master_find_slave() will get invoked potentially earlier than the slave creation, and later than the slave deletion. Since dp->slave starts off as a NULL pointer, the earlier calls aren't a problem, but the later calls are. To avoid use-after-free, we should zeroize dp->slave before calling dsa_slave_destroy(). In practice I cannot really test real life improvements brought by this change, since in my systems, netdevice creation races with PHY autoneg which takes a few seconds to complete, and that masks quite a few races. Effects might be noticeable in a setup with fixed links all the way to an external system. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: first set up shared ports, then non-shared portsVladimir Oltean1-13/+37
After commit a57d8c217aad ("net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports"), the port setup and teardown procedure became asymmetric. The fact of the matter is that user ports need the shared ports to be up before they can be used for CPU-initiated termination. And since we register net devices for the user ports, those won't be functional until we also call the setup for the shared (CPU, DSA) ports. But we may do that later, depending on the port numbering scheme of the hardware we are dealing with. It just makes sense that all shared ports are brought up before any user port is. I can't pinpoint any issue due to the current behavior, but let's change it nonetheless, for consistency's sake. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: hold rtnl_mutex when calling dsa_master_{setup,teardown}Vladimir Oltean2-2/+10
DSA needs to simulate master tracking events when a binding is first with a DSA master established and torn down, in order to give drivers the simplifying guarantee that ->master_state_change calls are made only when the master's readiness state to pass traffic changes. master_state_change() provide a operational bool that DSA driver can use to understand if DSA master is operational or not. To avoid races, we need to block the reception of NETDEV_UP/NETDEV_CHANGE/NETDEV_GOING_DOWN events in the netdev notifier chain while we are changing the master's dev->dsa_ptr (this changes what netdev_uses_dsa(dev) reports). The dsa_master_setup() and dsa_master_teardown() functions optionally require the rtnl_mutex to be held, if the tagger needs the master to be promiscuous, these functions call dev_set_promiscuity(). Move the rtnl_lock() from that function and make it top-level. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: stop updating master MTU from master.cVladimir Oltean1-24/+1
At present there are two paths for changing the MTU of the DSA master. The first is: dsa_tree_setup -> dsa_tree_setup_ports -> dsa_port_setup -> dsa_slave_create -> dsa_slave_change_mtu -> dev_set_mtu(master) The second is: dsa_tree_setup -> dsa_tree_setup_master -> dsa_master_setup -> dev_set_mtu(dev) So the dev_set_mtu() call from dsa_master_setup() has been effectively superseded by the dsa_slave_change_mtu(slave_dev, ETH_DATA_LEN) that is done from dsa_slave_create() for each user port. The later function also updates the master MTU according to the largest user port MTU from the tree. Therefore, updating the master MTU through a separate code path isn't needed. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: merge rtnl_lock sections in dsa_slave_createVladimir Oltean1-3/+1
Currently dsa_slave_create() has two sequences of rtnl_lock/rtnl_unlock in a row. Remove the rtnl_unlock() and rtnl_lock() in between, such that the operation can execute slighly faster. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: reorder PHY initialization with MTU setup in slave.cVladimir Oltean1-7/+7
In dsa_slave_create() there are 2 sections that take rtnl_lock(): MTU change and netdev registration. They are separated by PHY initialization. There isn't any strict ordering requirement except for the fact that netdev registration should be last. Therefore, we can perform the MTU change a bit later, after the PHY setup. A future change will then be able to merge the two rtnl_lock sections into one. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06Merge branch 'master' of ↵David S. Miller11-8/+96
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2022-01-06 1) Fix some clang_analyzer warnings about never read variables. From luo penghao. 2) Check for pols[0] only once in xfrm_expand_policies(). From Jean Sacren. 3) The SA curlft.use_time was updated only on SA cration time. Update whenever the SA is used. From Antony Antony 4) Add support for SM3 secure hash. From Xu Jia. 5) Add support for SM4 symmetric cipher algorithm. From Xu Jia. 6) Add a rate limit for SA mapping change messages. From Antony Antony. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05xdp: Add xdp_do_redirect_frame() for pre-computed xdp_framesToke Høiland-Jørgensen2-11/+58
Add an xdp_do_redirect_frame() variant which supports pre-computed xdp_frame structures. This will be used in bpf_prog_run() to avoid having to write to the xdp_frame structure when the XDP program doesn't modify the frame boundaries. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220103150812.87914-6-toke@redhat.com
2022-01-05xdp: Move conversion to xdp_frame out of map functionsToke Høiland-Jørgensen4-45/+39
All map redirect functions except XSK maps convert xdp_buff to xdp_frame before enqueueing it. So move this conversion of out the map functions and into xdp_do_redirect(). This removes a bit of duplicated code, but more importantly it makes it possible to support caller-allocated xdp_frame structures, which will be added in a subsequent commit. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220103150812.87914-5-toke@redhat.com
2022-01-05page_pool: Store the XDP mem idToke Høiland-Jørgensen3-4/+11
Store the XDP mem ID inside the page_pool struct so it can be retrieved later for use in bpf_prog_run(). Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20220103150812.87914-4-toke@redhat.com
2022-01-05page_pool: Add callback to init pages when they are allocatedToke Høiland-Jørgensen2-0/+4
Add a new callback function to page_pool that, if set, will be called every time a new page is allocated. This will be used from bpf_test_run() to initialise the page data with the data provided by userspace when running XDP programs with redirect turned on. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20220103150812.87914-3-toke@redhat.com
2022-01-05xdp: Allow registering memory model without rxq referenceToke Høiland-Jørgensen2-30/+65
The functions that register an XDP memory model take a struct xdp_rxq as parameter, but the RXQ is not actually used for anything other than pulling out the struct xdp_mem_info that it embeds. So refactor the register functions and export variants that just take a pointer to the xdp_mem_info. This is in preparation for enabling XDP_REDIRECT in bpf_prog_run(), using a page_pool instance that is not connected to any network device. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220103150812.87914-2-toke@redhat.com
2022-01-05Merge branch 'samples/bpf: xdpsock app enhancements'Alexei Starovoitov1-22/+341
Ong Boon says: ==================== First of all, sorry for taking more time to get back to this series and thanks to all valuble feedback in series-1 at [1] from Jesper and Song Liu. Since then I have looked into what Jesper suggested in [2] and worked on revising the patch series into several patches for ease of review: v1->v2: 1/7: [No change]. Add VLAN tag (ID & Priority) to the generated Tx-Only frames. 2/7: [No change]. Add DMAC and SMAC setting to the generated Tx-Only frames. If parameters are not set, previous DMAC and SMAC are used. 3/7: [New]. Add support for selecting different CLOCK for clock_gettime() used in get_nsecs. 4/7: [New]. This is a total rework from series-1 3/4-patch [3]. It uses clock_nanosleep() suggested by Jesper. In addition, added statistic for Tx schedule variance under application stat (-a|--app-stats). Make the cyclic Tx operation and --poll mode to be mutually- exclusive. Still, the ability to specify TX cycle time and used together with batch size and packet count remain the same. 5/7: [New]. Add the support for TX process schedule policy and priority setting. By default, SCHED_OTHER policy is used. This too is matching the schedule policy setting in [2]. 6/7: [Change]. This is update from series-1 4/4-patch [4]. Added TX clean process time-out in 1s granularity with configurable retries count (-O|--retries). 7/7: [New]. Added timestamp for TX packet following pktgen_hdr format matching the implementation in [2]. However, the sequence ID remains the same as it is instead of process schedule diff in [2]. To summarize on what program options have been added with v2 series using an example below:- DMAC (-G) = fa:8d:f1:e2:0b:e8 SMAC (-H) = ce:17:07:17:3e:3a VLAN tagged (-V) VLAN ID (-J) = 12 VLAN Pri (-K) = 3 Tx Queue (-q) = 3 Cycle Time in us (-T) = 1000 Batch (-b) = 2 Packet Count = 6 Tx schedule policy (-W) = FIFO Tx schedule priority (-U) = 50 Clock selection (-w) = REALTIME Tx timeout retries(-O) = 5 Tx timestamp (-y) Cyclic Tx schedule stat (-a) Note: xdpsock sets UDP dest-port and src-port to 0x1000 as default. Sending Board ============= $ xdpsock -i eth0 -t -N -z -H ce:17:07:17:3e:3a -G fa:8d:f1:e2:0b:e8 \ -V -J 12 -K 3 -q 3 \ -T 1000 -b 2 -C 6 -W FIFO -U 50 -w REALTIME \ -O 5 -y -a sock0@eth0:3 txonly xdp-drv pps pkts 0.00 rx 0 0 tx 0 6 calls/s count rx empty polls 0 0 fill fail polls 0 0 copy tx sendtos 0 0 tx wakeup sendtos 0 5 opt polls 0 0 period min ave max cycle Cyclic TX 1000000 31033 32009 33397 3 Receiving Board =============== $ tcpdump -nei eth0 udp port 0x1000 -vv -Q in -X \ --time-stamp-precision nano tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes 03:46:40.520111580 ce:17:07:17:3e:3a > fa:8d:f1:e2:0b:e8, ethertype 802.1Q (0x8100), length 62: vlan 12, p 3, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 44) 10.10.10.16.4096 > 10.10.10.32.4096: [udp sum ok] UDP, length 16 0x0000: 4500 002c 0000 0000 4011 527e 0a0a 0a10 E..,....@.R~.... 0x0010: 0a0a 0a20 1000 1000 0018 e997 be9b e955 ...............U 0x0020: 0000 0000 61cd 2ba1 0006 987c ....a.+....| 03:46:40.520112163 ce:17:07:17:3e:3a > fa:8d:f1:e2:0b:e8, ethertype 802.1Q (0x8100), length 62: vlan 12, p 3, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 44) 10.10.10.16.4096 > 10.10.10.32.4096: [udp sum ok] UDP, length 16 0x0000: 4500 002c 0000 0000 4011 527e 0a0a 0a10 E..,....@.R~.... 0x0010: 0a0a 0a20 1000 1000 0018 e996 be9b e955 ...............U 0x0020: 0000 0001 61cd 2ba1 0006 987c ....a.+....| 03:46:40.521066860 ce:17:07:17:3e:3a > fa:8d:f1:e2:0b:e8, ethertype 802.1Q (0x8100), length 62: vlan 12, p 3, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 44) 10.10.10.16.4096 > 10.10.10.32.4096: [udp sum ok] UDP, length 16 0x0000: 4500 002c 0000 0000 4011 527e 0a0a 0a10 E..,....@.R~.... 0x0010: 0a0a 0a20 1000 1000 0018 e5af be9b e955 ...............U 0x0020: 0000 0002 61cd 2ba1 0006 9c62 ....a.+....b 03:46:40.521067012 ce:17:07:17:3e:3a > fa:8d:f1:e2:0b:e8, ethertype 802.1Q (0x8100), length 62: vlan 12, p 3, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 44) 10.10.10.16.4096 > 10.10.10.32.4096: [udp sum ok] UDP, length 16 0x0000: 4500 002c 0000 0000 4011 527e 0a0a 0a10 E..,....@.R~.... 0x0010: 0a0a 0a20 1000 1000 0018 e5ae be9b e955 ...............U 0x0020: 0000 0003 61cd 2ba1 0006 9c62 ....a.+....b 03:46:40.522061935 ce:17:07:17:3e:3a > fa:8d:f1:e2:0b:e8, ethertype 802.1Q (0x8100), length 62: vlan 12, p 3, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 44) 10.10.10.16.4096 > 10.10.10.32.4096: [udp sum ok] UDP, length 16 0x0000: 4500 002c 0000 0000 4011 527e 0a0a 0a10 E..,....@.R~.... 0x0010: 0a0a 0a20 1000 1000 0018 e1c5 be9b e955 ...............U 0x0020: 0000 0004 61cd 2ba1 0006 a04a ....a.+....J 03:46:40.522062173 ce:17:07:17:3e:3a > fa:8d:f1:e2:0b:e8, ethertype 802.1Q (0x8100), length 62: vlan 12, p 3, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 44) 10.10.10.16.4096 > 10.10.10.32.4096: [udp sum ok] UDP, length 16 0x0000: 4500 002c 0000 0000 4011 527e 0a0a 0a10 E..,....@.R~.... 0x0010: 0a0a 0a20 1000 1000 0018 e1c4 be9b e955 ...............U 0x0020: 0000 0005 61cd 2ba1 0006 a04a ....a.+....J I have tested the above with both tagged and untagged packet format and based on the timestamp in tcpdump found that the timing of the batch cyclic transmission is correct. Appreciate if community can give the patch series v2 a try and point out any gap. Thanks Boon Leong [1] https://patchwork.kernel.org/project/netdevbpf/cover/20211124091821.3916046-1-boon.leong.ong@intel.com/ [2] https://github.com/netoptimizer/network-testing/blob/master/src/udp_pacer.c [3] https://patchwork.kernel.org/project/netdevbpf/patch/20211124091821.3916046-4-boon.leong.ong@intel.com/ [4] https://patchwork.kernel.org/project/netdevbpf/patch/20211124091821.3916046-5-boon.leong.ong@intel.com/ ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-05samples/bpf: xdpsock: Add timestamp for Tx-only operationOng Boon Leong1-9/+68
It may be useful to add timestamp for Tx packets for continuous or cyclic transmit operation. The timestamp and sequence ID of a Tx packet are stored according to pktgen header format. To enable per-packet timestamp, use -y|--tstamp option. If timestamp is off, pktgen header is not included in the UDP payload. This means receiving side can use the magic number for pktgen for differentiation. The implementation supports both VLAN tagged and untagged option. By default, the minimum packet size is set at 64B. However, if VLAN tagged is on (-V), the minimum packet size is increased to 66B just so to fit the pktgen_hdr size. Added hex_dump() into the code path just for future cross-checking. As before, simply change to "#define DEBUG_HEXDUMP 1" to inspect the accuracy of TX packet. Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211230035447.523177-8-boon.leong.ong@intel.com
2022-01-05samples/bpf: xdpsock: Add time-out for cleaning TxOng Boon Leong1-2/+9
When user sets tx-pkt-count and in case where there are invalid Tx frame, the complete_tx_only_all() process polls indefinitely. So, this patch adds a time-out mechanism into the process so that the application can terminate automatically after it retries 3*polling interval duration. v1->v2: Thanks to Jesper's and Song Liu's suggestion. - clean-up git message to remove polling log - make the Tx time-out retries configurable with 1s granularity Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211230035447.523177-7-boon.leong.ong@intel.com
2022-01-05samples/bpf: xdpsock: Add sched policy and priority supportOng Boon Leong1-2/+59
By default, TX schedule policy is SCHED_OTHER (round-robin time-sharing). To improve TX cyclic scheduling, we add SCHED_FIFO policy and its priority by using -W FIFO or --policy=FIFO and -U <PRIO> or --schpri=<PRIO>. A) From xdpsock --app-stats, for SCHED_OTHER policy: $ xdpsock -i eth0 -t -N -z -T 1000 -b 16 -C 100000 -a period min ave max cycle Cyclic TX 1000000 53507 75334 712642 6250 B) For SCHED_FIFO policy and schpri=50: $ xdpsock -i eth0 -t -N -z -T 1000 -b 16 -C 100000 -a -W FIFO -U 50 period min ave max cycle Cyclic TX 1000000 3699 24859 54397 6250 Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211230035447.523177-6-boon.leong.ong@intel.com
2022-01-05samples/bpf: xdpsock: Add cyclic TX operation capabilityOng Boon Leong1-5/+80
Tx cycle time is in micro-seconds unit. By combining the batch size (-b M) and Tx cycle time (-T|--tx-cycle N), xdpsock now can transmit batch-size of packets every N-us periodically. Cyclic TX operation is not applicable if --poll mode is used. To transmit 16 packets every 1ms cycle time for total of 100000 packets silently: $ xdpsock -i eth0 -T -N -z -T 1000 -b 16 -C 100000 To print cyclic TX schedule variance stats, use --app-stats|-a: $ xdpsock -i eth0 -T -N -z -T 1000 -b 16 -C 100000 -a sock0@eth0:0 txonly xdp-drv pps pkts 0.00 rx 0 0 tx 0 100000 calls/s count rx empty polls 0 0 fill fail polls 0 0 copy tx sendtos 0 0 tx wakeup sendtos 0 6254 opt polls 0 0 period min ave max cycle Cyclic TX 1000000 53507 75334 712642 6250 Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211230035447.523177-5-boon.leong.ong@intel.com
2022-01-05samples/bpf: xdpsock: Add clockid selection supportOng Boon Leong1-2/+38
User specifies the clock selection by using -w CLOCK or --clock=CLOCK where CLOCK=[REALTIME, TAI, BOOTTIME, MONOTONIC]. The default CLOCK selection is MONOTONIC. The implementation of clock selection parsing is borrowed from iproute2/tc/q_taprio.c Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211230035447.523177-4-boon.leong.ong@intel.com
2022-01-05samples/bpf: xdpsock: Add Dest and Src MAC setting for Tx-only operationOng Boon Leong1-5/+30
To set Dest MAC address (-G|--tx-dmac) only: $ xdpsock -i eth0 -t -N -z -G aa:bb:cc:dd:ee:ff To set Source MAC address (-H|--tx-smac) only: $ xdpsock -i eth0 -t -N -z -H 11:22:33:44:55:66 To set both Dest and Source MAC address: $ xdpsock -i eth0 -t -N -z -G aa:bb:cc:dd:ee:ff \ -H 11:22:33:44:55:66 The default Dest and Source MAC address remain the same as before. Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20211230035447.523177-3-boon.leong.ong@intel.com
2022-01-05samples/bpf: xdpsock: Add VLAN support for Tx-only operationOng Boon Leong1-15/+75
In multi-queue environment testing, the support for VLAN-tag based steering is useful. So, this patch adds the capability to add VLAN tag (VLAN ID and Priority) to the generated Tx frame. To set the VLAN ID=10 and Priority=2 for Tx only through TxQ=3: $ xdpsock -i eth0 -t -N -z -q 3 -V -J 10 -K 2 If VLAN ID (-J) and Priority (-K) is set, it default to VLAN ID = 1 VLAN Priority = 0. For example, VLAN-tagged Tx only, xdp copy mode through TxQ=1: $ xdpsock -i eth0 -t -N -c -q 1 -V Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20211230035447.523177-2-boon.leong.ong@intel.com
2022-01-05Merge branch 'net-lantiq_xrx200-improve-ethernet-performance'Jakub Kicinski2-23/+41
Aleksander Jan Bajkowski says: ==================== net: lantiq_xrx200: improve ethernet performance This patchset improves Ethernet performance by 15%. NAT Performance results on BT Home Hub 5A (kernel 5.10.89, mtu 1500): Down Up Before 539 Mbps 599 Mbps After 624 Mbps 695 Mbps ==================== Link: https://lore.kernel.org/r/20220104151144.181736-1-olek2@wp.pl Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-05net: lantiq_xrx200: convert to build_skbAleksander Jan Bajkowski1-20/+36
We can increase the efficiency of rx path by using buffers to receive packets then build SKBs around them just before passing into the network stack. In contrast, preallocating SKBs too early reduces CPU cache efficiency. NAT Performance results on BT Home Hub 5A (kernel 5.10.89, mtu 1500): Down Up Before 577 Mbps 648 Mbps After 624 Mbps 695 Mbps Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-05net: lantiq_xrx200: increase napi poll weigthAleksander Jan Bajkowski1-2/+4
NAT Performance results on BT Home Hub 5A (kernel 5.10.89, mtu 1500): Down Up Before 545 Mbps 625 Mbps After 577 Mbps 648 Mbps Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-05MIPS: lantiq: dma: increase descritor countAleksander Jan Bajkowski1-1/+1
NAT Performance results on BT Home Hub 5A (kernel 5.10.89, mtu 1500): Down Up Before 539 Mbps 599 Mbps After 545 Mbps 625 Mbps Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-05testptp: set pin function before other requestsMiroslav Lichvar1-12/+12
When the -L option of the testptp utility is specified with other options (e.g. -p to enable PPS output), the user probably wants to apply it to the pin configured by the -L option. Reorder the code to set the pin function before other function requests to avoid confusing users. Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Link: https://lore.kernel.org/r/20220105152506.3256026-1-mlichvar@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-05libbpf 1.0: Deprecate bpf_object__find_map_by_offset() APIChristy Lee1-1/+2
API created with simplistic assumptions about BPF map definitions. It hasn’t worked for a while, deprecate it in preparation for libbpf 1.0. [0] Closes: https://github.com/libbpf/libbpf/issues/302 Signed-off-by: Christy Lee <christylee@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220105003120.2222673-1-christylee@fb.com
2022-01-05libbpf 1.0: Deprecate bpf_map__is_offload_neutral()Christy Lee2-1/+2
Deprecate bpf_map__is_offload_neutral(). It’s most probably broken already. PERF_EVENT_ARRAY isn’t the only map that’s not suitable for hardware offloading. Applications can directly check map type instead. [0] Closes: https://github.com/libbpf/libbpf/issues/306 Signed-off-by: Christy Lee <christylee@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220105000601.2090044-1-christylee@fb.com
2022-01-05libbpf: Support repeated legacy kprobes on same functionQiang Wang1-1/+4
If repeated legacy kprobes on same function in one process, libbpf will register using the same probe name and got -EBUSY error. So append index to the probe name format to fix this problem. Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Qiang Wang <wangqiang.wq.frank@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211227130713.66933-2-wangqiang.wq.frank@bytedance.com
2022-01-05libbpf: Use probe_name for legacy kprobeQiang Wang1-1/+1
Fix a bug in commit 46ed5fc33db9, which wrongly used the func_name instead of probe_name to register legacy kprobe. Fixes: 46ed5fc33db9 ("libbpf: Refactor and simplify legacy kprobe code") Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Qiang Wang <wangqiang.wq.frank@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Hengqi Chen <hengqi.chen@gmail.com> Reviewed-by: Hengqi Chen <hengqi.chen@gmail.com> Link: https://lore.kernel.org/bpf/20211227130713.66933-1-wangqiang.wq.frank@bytedance.com
2022-01-05libbpf: Deprecate bpf_perf_event_read_simple() APIChristy Lee2-8/+15
With perf_buffer__poll() and perf_buffer__consume() APIs available, there is no reason to expose bpf_perf_event_read_simple() API to users. If users need custom perf buffer, they could re-implement the function. Mark bpf_perf_event_read_simple() and move the logic to a new static function so it can still be called by other functions in the same file. [0] Closes: https://github.com/libbpf/libbpf/issues/310 Signed-off-by: Christy Lee <christylee@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211229204156.13569-1-christylee@fb.com
2022-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski88-404/+918
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-05bpf: Add SO_RCVBUF/SO_SNDBUF in _bpf_getsockopt().Kuniyuki Iwashima1-0/+6
This patch exposes SO_RCVBUF/SO_SNDBUF through bpf_getsockopt(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220104013153.97906-3-kuniyu@amazon.co.jp