summaryrefslogtreecommitdiffstats
path: root/net/xdp
AgeCommit message (Collapse)AuthorFilesLines
2020-08-31xsk: Add shared umem support between devicesMagnus Karlsson1-7/+4
Add support to share a umem between different devices. This mode can be invoked with the XDP_SHARED_UMEM bind flag. Previously, sharing was only supported within the same device. Note that when sharing a umem between devices, just as in the case of sharing a umem between queue ids, you need to create a fill ring and a completion ring and tie them to the socket (with two setsockopts, one for each ring) before you do the bind with the XDP_SHARED_UMEM flag. This so that the single-producer single-consumer semantics of the rings can be upheld. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-13-git-send-email-magnus.karlsson@intel.com
2020-08-31xsk: Add shared umem support between queue idsMagnus Karlsson2-16/+54
Add support to share a umem between queue ids on the same device. This mode can be invoked with the XDP_SHARED_UMEM bind flag. Previously, sharing was only supported within the same queue id and device, and you shared one set of fill and completion rings. However, note that when sharing a umem between queue ids, you need to create a fill ring and a completion ring and tie them to the socket before you do the bind with the XDP_SHARED_UMEM flag. This so that the single-producer single-consumer semantics can be upheld. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-12-git-send-email-magnus.karlsson@intel.com
2020-08-31xsk: Enable sharing of dma mappingsMagnus Karlsson2-42/+142
Enable the sharing of dma mappings by moving them out from the buffer pool. Instead we put each dma mapped umem region in a list in the umem structure. If dma has already been mapped for this umem and device, it is not mapped again and the existing dma mappings are reused. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-9-git-send-email-magnus.karlsson@intel.com
2020-08-31xsk: Move addrs from buffer pool to umemMagnus Karlsson2-19/+24
Replicate the addrs pointer in the buffer pool to the umem. This mapping will be the same for all buffer pools sharing the same umem. In the buffer pool we leave the addrs pointer for performance reasons. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-8-git-send-email-magnus.karlsson@intel.com
2020-08-31xsk: Move xsk_tx_list and its lock to buffer poolMagnus Karlsson4-37/+32
Move the xsk_tx_list and the xsk_tx_list_lock from the umem to the buffer pool. This so that we in a later commit can share the umem between multiple HW queues. There is one xsk_tx_list per device and queue id, so it should be located in the buffer pool. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-7-git-send-email-magnus.karlsson@intel.com
2020-08-31xsk: Move queue_id, dev and need_wakeup to buffer poolMagnus Karlsson6-71/+39
Move queue_id, dev, and need_wakeup from the umem to the buffer pool. This so that we in a later commit can share the umem between multiple HW queues. There is one buffer pool per dev and queue id, so these variables should belong to the buffer pool, not the umem. Need_wakeup is also something that is set on a per napi level, so there is usually one per device and queue id. So move this to the buffer pool too. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-6-git-send-email-magnus.karlsson@intel.com
2020-08-31xsk: Move fill and completion rings to buffer poolMagnus Karlsson4-46/+49
Move the fill and completion rings from the umem to the buffer pool. This so that we in a later commit can share the umem between multiple HW queue ids. In this case, we need one fill and completion ring per queue id. As the buffer pool is per queue id and napi id this is a natural place for it and one umem struture can be shared between these buffer pools. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-5-git-send-email-magnus.karlsson@intel.com
2020-08-31xsk: Create and free buffer pool independently from umemMagnus Karlsson6-182/+225
Create and free the buffer pool independently from the umem. Move these operations that are performed on the buffer pool from the umem create and destroy functions to new create and destroy functions just for the buffer pool. This so that in later commits we can instantiate multiple buffer pools per umem when sharing a umem between HW queues and/or devices. We also erradicate the back pointer from the umem to the buffer pool as this will not work when we introduce the possibility to have multiple buffer pools per umem. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-4-git-send-email-magnus.karlsson@intel.com
2020-08-31xsk: i40e: ice: ixgbe: mlx5: Rename xsk zero-copy driver interfacesMagnus Karlsson2-30/+39
Rename the AF_XDP zero-copy driver interface functions to better reflect what they do after the replacement of umems with buffer pools in the previous commit. Mostly it is about replacing the umem name from the function names with xsk_buff and also have them take the a buffer pool pointer instead of a umem. The various ring functions have also been renamed in the process so that they have the same naming convention as the internal functions in xsk_queue.h. This so that it will be clearer what they do and also for consistency. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-3-git-send-email-magnus.karlsson@intel.com
2020-08-31xsk: i40e: ice: ixgbe: mlx5: Pass buffer pool to driver instead of umemMagnus Karlsson2-24/+26
Replace the explicit umem reference passed to the driver in AF_XDP zero-copy mode with the buffer pool instead. This in preparation for extending the functionality of the zero-copy mode so that umems can be shared between queues on the same netdev and also between netdevs. In this commit, only an umem reference has been added to the buffer pool struct. But later commits will add other entities to it. These are going to be entities that are different between different queue ids and netdevs even though the umem is shared between them. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-2-git-send-email-magnus.karlsson@intel.com
2020-08-28bpf: Relax max_entries check for most of the inner map typesMartin KaFai Lau1-1/+8
Most of the maps do not use max_entries during verification time. Thus, those map_meta_equal() do not need to enforce max_entries when it is inserted as an inner map during runtime. The max_entries check is removed from the default implementation bpf_map_meta_equal(). The prog_array_map and xsk_map are exception. Its map_gen_lookup uses max_entries to generate inline lookup code. Thus, they will implement its own map_meta_equal() to enforce max_entries. Since there are only two cases now, the max_entries check is not refactored and stays in its own .c file. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200828011813.1970516-1-kafai@fb.com
2020-08-28bpf: Add map_meta_equal map opsMartin KaFai Lau1-0/+1
Some properties of the inner map is used in the verification time. When an inner map is inserted to an outer map at runtime, bpf_map_meta_equal() is currently used to ensure those properties of the inserting inner map stays the same as the verification time. In particular, the current bpf_map_meta_equal() checks max_entries which turns out to be too restrictive for most of the maps which do not use max_entries during the verification time. It limits the use case that wants to replace a smaller inner map with a larger inner map. There are some maps do use max_entries during verification though. For example, the map_gen_lookup in array_map_ops uses the max_entries to generate the inline lookup code. To accommodate differences between maps, the map_meta_equal is added to bpf_map_ops. Each map-type can decide what to check when its map is used as an inner map during runtime. Also, some map types cannot be used as an inner map and they are currently black listed in bpf_map_meta_alloc() in map_in_map.c. It is not unusual that the new map types may not aware that such blacklist exists. This patch enforces an explicit opt-in and only allows a map to be used as an inner map if it has implemented the map_meta_equal ops. It is based on the discussion in [1]. All maps that support inner map has its map_meta_equal points to bpf_map_meta_equal in this patch. A later patch will relax the max_entries check for most maps. bpf_types.h counts 28 map types. This patch adds 23 ".map_meta_equal" by using coccinelle. -5 for BPF_MAP_TYPE_PROG_ARRAY BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE BPF_MAP_TYPE_STRUCT_OPS BPF_MAP_TYPE_ARRAY_OF_MAPS BPF_MAP_TYPE_HASH_OF_MAPS The "if (inner_map->inner_map_meta)" check in bpf_map_meta_alloc() is moved such that the same error is returned. [1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/ Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com
2020-07-28xdp: Prevent kernel-infoleak in xsk_getsockopt()Peilin Ye1-1/+1
xsk_getsockopt() is copying uninitialized stack memory to userspace when 'extra_stats' is 'false'. Fix it. Doing '= {};' is sufficient since currently 'struct xdp_statistics' is defined as follows: struct xdp_statistics { __u64 rx_dropped; __u64 rx_invalid_descs; __u64 tx_invalid_descs; __u64 rx_ring_full; __u64 rx_fill_ring_empty_descs; __u64 tx_ring_empty_descs; }; When being copied to the userspace, 'stats' will not contain any uninitialized 'holes' between struct fields. Fixes: 8aa5a33578e9 ("xsk: Add new statistics") Suggested-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/bpf/20200728053604.404631-1-yepeilin.cs@gmail.com
2020-07-24net: pass a sockptr_t into ->setsockoptChristoph Hellwig1-4/+4
Rework the remaining setsockopt code to pass a sockptr_t instead of a plain user pointer. This removes the last remaining set_fs(KERNEL_DS) outside of architecture specific code. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> [ieee802154] Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller4-5/+55
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-07-13 The following pull-request contains BPF updates for your *net-next* tree. We've added 36 non-merge commits during the last 7 day(s) which contain a total of 62 files changed, 2242 insertions(+), 468 deletions(-). The main changes are: 1) Avoid trace_printk warning banner by switching bpf_trace_printk to use its own tracing event, from Alan. 2) Better libbpf support on older kernels, from Andrii. 3) Additional AF_XDP stats, from Ciara. 4) build time resolution of BTF IDs, from Jiri. 5) BPF_CGROUP_INET_SOCK_RELEASE hook, from Stanislav. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-13xsk: Add xdp statistics to xsk_diagCiara Loftus1-0/+17
Add xdp statistics to the information dumped through the xsk_diag interface Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200708072835.4427-4-ciara.loftus@intel.com
2020-07-13xsk: Add new statisticsCiara Loftus3-5/+38
It can be useful for the user to know the reason behind a dropped packet. Introduce new counters which track drops on the receive path caused by: 1. rx ring being full 2. fill ring being empty Also, on the tx path introduce a counter which tracks the number of times we attempt pull from the tx ring when it is empty. Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200708072835.4427-2-ciara.loftus@intel.com
2020-07-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-50/+4
All conflicts seemed rather trivial, with some guidance from Saeed Mameed on the tc_ct.c one. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30xsk: Use dma_need_sync instead of reimplenting itChristoph Hellwig1-47/+3
Use the dma_need_sync helper instead of (not always entirely correctly) poking into the dma-mapping internals. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200629130359.2690853-5-hch@lst.de
2020-06-30xsk: Remove a double pool->dev assignment in xp_dma_mapChristoph Hellwig1-1/+0
->dev is already assigned at the top of the function, remove the duplicate one at the end. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200629130359.2690853-4-hch@lst.de
2020-06-30xsk: Replace the cheap_dma flag with a dma_need_sync flagChristoph Hellwig1-3/+2
Invert the polarity and better name the flag so that the use case is properly documented. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200629130359.2690853-3-hch@lst.de
2020-06-22bpf: Set map_btf_{name, id} for all map typesAndrey Ignatov1-0/+3
Set map_btf_name and map_btf_id for all map types so that map fields can be accessed by bpf programs. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/a825f808f22af52b018dbe82f1c7d29dab5fc978.1592600985.git.rdna@fb.com
2020-06-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds1-3/+1
Pull networking fixes from David Miller: 1) Fix cfg80211 deadlock, from Johannes Berg. 2) RXRPC fails to send norigications, from David Howells. 3) MPTCP RM_ADDR parsing has an off by one pointer error, fix from Geliang Tang. 4) Fix crash when using MSG_PEEK with sockmap, from Anny Hu. 5) The ucc_geth driver needs __netdev_watchdog_up exported, from Valentin Longchamp. 6) Fix hashtable memory leak in dccp, from Wang Hai. 7) Fix how nexthops are marked as FDB nexthops, from David Ahern. 8) Fix mptcp races between shutdown and recvmsg, from Paolo Abeni. 9) Fix crashes in tipc_disc_rcv(), from Tuong Lien. 10) Fix link speed reporting in iavf driver, from Brett Creeley. 11) When a channel is used for XSK and then reused again later for XSK, we forget to clear out the relevant data structures in mlx5 which causes all kinds of problems. Fix from Maxim Mikityanskiy. 12) Fix memory leak in genetlink, from Cong Wang. 13) Disallow sockmap attachments to UDP sockets, it simply won't work. From Lorenz Bauer. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits) net: ethernet: ti: ale: fix allmulti for nu type ale net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init net: atm: Remove the error message according to the atomic context bpf: Undo internal BPF_PROBE_MEM in BPF insns dump libbpf: Support pre-initializing .bss global variables tools/bpftool: Fix skeleton codegen bpf: Fix memlock accounting for sock_hash bpf: sockmap: Don't attach programs to UDP sockets bpf: tcp: Recv() should return 0 when the peer socket is closed ibmvnic: Flush existing work items before device removal genetlink: clean up family attributes allocations net: ipa: header pad field only valid for AP->modem endpoint net: ipa: program upper nibbles of sequencer type net: ipa: fix modem LAN RX endpoint id net: ipa: program metadata mask differently ionic: add pcie_print_link_status rxrpc: Fix race between incoming ACK parser and retransmitter net/mlx5: E-Switch, Fix some error pointer dereferences net/mlx5: Don't fail driver on failure to create debugfs net/mlx5e: CT: Fix ipv6 nat header rewrite actions ...
2020-06-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller1-3/+1
Alexei Starovoitov says: ==================== pull-request: bpf 2020-06-12 The following pull-request contains BPF updates for your *net* tree. We've added 26 non-merge commits during the last 10 day(s) which contain a total of 27 files changed, 348 insertions(+), 93 deletions(-). The main changes are: 1) sock_hash accounting fix, from Andrey. 2) libbpf fix and probe_mem sanitizing, from Andrii. 3) sock_hash fixes, from Jakub. 4) devmap_val fix, from Jesper. 5) load_bytes_relative fix, from YiFei. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-11xdp: Fix xsk_generic_xmit errnoLi RongQing1-3/+1
Propagate sock_alloc_send_skb error code, not set it to EAGAIN unconditionally, when fail to allocate skb, which might cause that user space unnecessary loops. Fixes: 35fcde7f8deb ("xsk: support for Tx") Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1591852266-24017-1-git-send-email-lirongqing@baidu.com
2020-06-09mmap locking API: use coccinelle to convert mmap_sem rwsem call sitesMichel Lespinasse1-2/+2
This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04net/xdp: use shift instead of 64 bit divisionPavel Machek1-1/+1
64bit division is kind of expensive, and shift should do the job here. Signed-off-by: Pavel Machek (CIP) <pavel@denx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-2/+6
xdp_umem.c had overlapping changes between the 64-bit math fix for the calculation of npgs and the removal of the zerocopy memory type which got rid of the chunk_size_nohdr member. The mlx5 Kconfig conflict is a case where we just take the net-next copy of the Kconfig entry dependency as it takes on the ESWITCH dependency by one level of indirection which is what the 'net' conflicting change is trying to ensure. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26xsk: Add overflow check for u64 division, stored into u32Björn Töpel1-2/+6
The npgs member of struct xdp_umem is an u32 entity, and stores the number of pages the UMEM consumes. The calculation of npgs npgs = size / PAGE_SIZE can overflow. To avoid overflow scenarios, the division is now first stored in a u64, and the result is verified to fit into 32b. An alternative would be storing the npgs as a u64, however, this wastes memory and is an unrealisticly large packet area. Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt") Reported-by: "Minh Bùi Quang" <minhquangbui99@gmail.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Link: https://lore.kernel.org/bpf/CACtPs=GGvV-_Yj6rbpzTVnopgi5nhMoCcTkSkYrJHGQHJWFZMQ@mail.gmail.com/ Link: https://lore.kernel.org/bpf/20200525080400.13195-1-bjorn.topel@gmail.com
2020-05-21xsk: Explicitly inline functions and move definitionsBjörn Töpel3-143/+65
In order to reduce the number of function calls, the struct xsk_buff_pool definition is moved to xsk_buff_pool.h. The functions xp_get_dma(), xp_dma_sync_for_cpu(), xp_dma_sync_for_device(), xp_validate_desc() and various helper functions are explicitly inlined. Further, move xp_get_handle() and xp_release() to xsk.c, to allow for the compiler to perform inlining. rfc->v1: Make sure xp_validate_desc() is inlined for Tx perf. (Maxim) Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200520192103.355233-15-bjorn.topel@gmail.com
2020-05-21xsk: Remove MEM_TYPE_ZERO_COPY and corresponding codeBjörn Töpel5-269/+9
There are no users of MEM_TYPE_ZERO_COPY. Remove all corresponding code, including the "handle" member of struct xdp_buff. rfc->v1: Fixed spelling in commit message. (Björn) Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200520192103.355233-13-bjorn.topel@gmail.com
2020-05-21xsk: Introduce AF_XDP buffer allocation APIBjörn Töpel6-113/+582
In order to simplify AF_XDP zero-copy enablement for NIC driver developers, a new AF_XDP buffer allocation API is added. The implementation is based on a single core (single producer/consumer) buffer pool for the AF_XDP UMEM. A buffer is allocated using the xsk_buff_alloc() function, and returned using xsk_buff_free(). If a buffer is disassociated with the pool, e.g. when a buffer is passed to an AF_XDP socket, a buffer is said to be released. Currently, the release function is only used by the AF_XDP internals and not visible to the driver. Drivers using this API should register the XDP memory model with the new MEM_TYPE_XSK_BUFF_POOL type. The API is defined in net/xdp_sock_drv.h. The buffer type is struct xdp_buff, and follows the lifetime of regular xdp_buffs, i.e. the lifetime of an xdp_buff is restricted to a NAPI context. In other words, the API is not replacing xdp_frames. In addition to introducing the API and implementations, the AF_XDP core is migrated to use the new APIs. rfc->v1: Fixed build errors/warnings for m68k and riscv. (kbuild test robot) Added headroom/chunk size getter. (Maxim/Björn) v1->v2: Swapped SoBs. (Maxim) v2->v3: Initialize struct xdp_buff member frame_sz. (Björn) Add API to query the DMA address of a frame. (Maxim) Do DMA sync for CPU till the end of the frame to handle possible growth (frame_sz). (Maxim) Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200520192103.355233-6-bjorn.topel@gmail.com
2020-05-21xsk: Move defines only used by AF_XDP internals to xsk.hBjörn Töpel2-0/+16
Move the XSK_NEXT_PG_CONTIG_{MASK,SHIFT}, and XDP_UMEM_USES_NEED_WAKEUP defines from xdp_sock.h to the AF_XDP internal xsk.h file. Also, start using the BIT{,_ULL} macro instead of explicit shifts. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200520192103.355233-5-bjorn.topel@gmail.com
2020-05-21xsk: Move driver interface to xdp_sock_drv.hMagnus Karlsson3-2/+3
Move the AF_XDP zero-copy driver interface to its own include file called xdp_sock_drv.h. This, hopefully, will make it more clear for NIC driver implementors to know what functions to use for zero-copy support. v4->v5: Fix -Wmissing-prototypes by include header file. (Jakub) Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200520192103.355233-4-bjorn.topel@gmail.com
2020-05-21xsk: Move xskmap.c to net/xdp/Björn Töpel3-1/+284
The XSKMAP is partly implemented by net/xdp/xsk.c. Move xskmap.c from kernel/bpf/ to net/xdp/, which is the logical place for AF_XDP related code. Also, move AF_XDP struct definitions, and function declarations only used by AF_XDP internals into net/xdp/xsk.h. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200520192103.355233-3-bjorn.topel@gmail.com
2020-05-04xsk: Remove unnecessary member in xdp_umemMagnus Karlsson1-4/+3
Remove the unnecessary member of address in struct xdp_umem as it is only used during the umem registration. No need to carry this around as it is not used during run-time nor when unregistering the umem. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Link: https://lore.kernel.org/bpf/1588599232-24897-3-git-send-email-magnus.karlsson@intel.com
2020-05-04xsk: Change two variable names for increased clarityMagnus Karlsson4-17/+17
Change two variables names so that it is clearer what they represent. The first one is xsk_list that in fact only contains the list of AF_XDP sockets with a Tx component. Change this to xsk_tx_list for improved clarity. The second variable is size in the ring structure. One might think that this is the size of the ring, but it is in fact the size of the umem, copied into the ring structure to improve performance. Rename this variable umem_size to avoid any confusion. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Link: https://lore.kernel.org/bpf/1588599232-24897-2-git-send-email-magnus.karlsson@intel.com
2020-04-26xsk: Fix typo in xsk_umem_consume_tx and xsk_generic_xmit commentsTobias Klauser1-2/+2
s/backpreassure/backpressure/ Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20200421232927.21082-1-tklauser@distanz.ch
2020-04-15xsk: Add missing check on user supplied headroom sizeMagnus Karlsson1-3/+2
Add a check that the headroom cannot be larger than the available space in the chunk. In the current code, a malicious user can set the headroom to a value larger than the chunk size minus the fixed XDP headroom. That way packets with a length larger than the supported size in the umem could get accepted and result in an out-of-bounds write. Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt") Reported-by: Bui Quang Minh <minhquangbui99@gmail.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://bugzilla.kernel.org/show_bug.cgi?id=207225 Link: https://lore.kernel.org/bpf/1586849715-23490-1-git-send-email-magnus.karlsson@intel.com
2020-04-06xsk: Fix out of boundary write in __xsk_rcv_memcpyLi RongQing1-2/+3
first_len is the remainder of the first page we're copying. If this size is larger, then out of page boundary write will otherwise happen. Fixes: c05cd3645814 ("xsk: add support to allow unaligned chunk placement") Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1585813930-19712-1-git-send-email-lirongqing@baidu.com
2020-02-28xdp: Replace zero-length array with flexible-array memberGustavo A. R. Silva1-2/+2
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-11xsk: Publish global consumer pointers when NAPI is finishedMagnus Karlsson2-1/+4
The commit 4b638f13bab4 ("xsk: Eliminate the RX batch size") introduced a much more lazy way of updating the global consumer pointers from the kernel side, by only doing so when running out of entries in the fill or Tx rings (the rings consumed by the kernel). This can result in a deadlock with the user application if the kernel requires more than one entry to proceed and the application cannot put these entries in the fill ring because the kernel has not updated the global consumer pointer since the ring is not empty. Fix this by publishing the local kernel side consumer pointer whenever we have completed Rx or Tx processing in the kernel. This way, user space will have an up-to-date view of the consumer pointers whenever it gets to execute in the one core case (application and driver on the same core), or after a certain number of packets have been processed in the two core case (application and driver on different cores). A side effect of this patch is that the one core case gets better performance, but the two core case gets worse. The reason that the one core case improves is that updating the global consumer pointer is relatively cheap since the application by definition is not running when the kernel is (they are on the same core) and it is beneficial for the application, once it gets to run, to have pointers that are as up to date as possible since it then can operate on more packets and buffers. In the two core case, the most important performance aspect is to minimize the number of accesses to the global pointers since they are shared between two cores and bounces between the caches of those cores. This patch results in more updates to global state, which means lower performance in the two core case. Fixes: 4b638f13bab4 ("xsk: Eliminate the RX batch size") Reported-by: Ryan Goodfellow <rgoodfel@isi.edu> Reported-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Maxim Mikityanskiy <maximmi@mellanox.com> Link: https://lore.kernel.org/bpf/1581348432-6747-1-git-send-email-magnus.karlsson@intel.com
2020-01-31mm, tree-wide: rename put_user_page*() to unpin_user_page*()John Hubbard1-1/+1
In order to provide a clearer, more symmetric API for pinning and unpinning DMA pages. This way, pin_user_pages*() calls match up with unpin_user_pages*() calls, and the API is a lot closer to being self-explanatory. Link: http://lkml.kernel.org/r/20200107224558.2362728-23-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31net/xdp: set FOLL_PIN via pin_user_pages()John Hubbard1-1/+1
Convert net/xdp to use the new pin_longterm_pages() call, which sets FOLL_PIN. Setting FOLL_PIN is now required for code that requires tracking of pinned pages. In partial anticipation of this work, the net/xdp code was already calling put_user_page() instead of put_page(). Therefore, in order to convert from the get_user_pages()/put_page() model, to the pin_user_pages()/put_user_page() model, the only change required here is to change get_user_pages() to pin_user_pages(). Link: http://lkml.kernel.org/r/20200107224558.2362728-18-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Björn Töpel <bjorn.topel@intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-22xsk, net: Make sock_def_readable() have external linkageBjörn Töpel1-1/+1
XDP sockets use the default implementation of struct sock's sk_data_ready callback, which is sock_def_readable(). This function is called in the XDP socket fast-path, and involves a retpoline. By letting sock_def_readable() have external linkage, and being called directly, the retpoline can be avoided. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200120092917.13949-1-bjorn.topel@gmail.com
2020-01-15xsk: Support allocations of large umemsMagnus Karlsson1-3/+4
When registering a umem area that is sufficiently large (>1G on an x86), kmalloc cannot be used to allocate one of the internal data structures, as the size requested gets too large. Use kvmalloc instead that falls back on vmalloc if the allocation is too large for kmalloc. Also add accounting for this structure as it is triggered by a user space action (the XDP_UMEM_REG setsockopt) and it is by far the largest structure of kernel allocated memory in xsk. Reported-by: Ryan Goodfellow <rgoodfel@isi.edu> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Link: https://lore.kernel.org/bpf/1578995365-7050-1-git-send-email-magnus.karlsson@intel.com
2019-12-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller3-224/+241
Daniel Borkmann says: ==================== pull-request: bpf-next 2019-12-27 The following pull-request contains BPF updates for your *net-next* tree. We've added 127 non-merge commits during the last 17 day(s) which contain a total of 110 files changed, 6901 insertions(+), 2721 deletions(-). There are three merge conflicts. Conflicts and resolution looks as follows: 1) Merge conflict in net/bpf/test_run.c: There was a tree-wide cleanup c593642c8be0 ("treewide: Use sizeof_field() macro") which gets in the way with b590cb5f802d ("bpf: Switch to offsetofend in BPF_PROG_TEST_RUN"): <<<<<<< HEAD if (!range_is_zero(__skb, offsetof(struct __sk_buff, priority) + sizeof_field(struct __sk_buff, priority), ======= if (!range_is_zero(__skb, offsetofend(struct __sk_buff, priority), >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c There are a few occasions that look similar to this. Always take the chunk with offsetofend(). Note that there is one where the fields differ in here: <<<<<<< HEAD if (!range_is_zero(__skb, offsetof(struct __sk_buff, tstamp) + sizeof_field(struct __sk_buff, tstamp), ======= if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_segs), >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c Just take the one with offsetofend() /and/ gso_segs. Latter is correct due to 850a88cc4096 ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN"). 2) Merge conflict in arch/riscv/net/bpf_jit_comp.c: (I'm keeping Bjorn in Cc here for a double-check in case I got it wrong.) <<<<<<< HEAD if (is_13b_check(off, insn)) return -1; emit(rv_blt(tcc, RV_REG_ZERO, off >> 1), ctx); ======= emit_branch(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx); >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c Result should look like: emit_branch(BPF_JSLT, tcc, RV_REG_ZERO, off, ctx); 3) Merge conflict in arch/riscv/include/asm/pgtable.h: <<<<<<< HEAD ======= #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1) #define VMALLOC_END (PAGE_OFFSET - 1) #define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE) #define BPF_JIT_REGION_SIZE (SZ_128M) #define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE) #define BPF_JIT_REGION_END (VMALLOC_END) /* * Roughly size the vmemmap space to be large enough to fit enough * struct pages to map half the virtual address space. Then * position vmemmap directly below the VMALLOC region. */ #define VMEMMAP_SHIFT \ (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT) #define VMEMMAP_SIZE BIT(VMEMMAP_SHIFT) #define VMEMMAP_END (VMALLOC_START - 1) #define VMEMMAP_START (VMALLOC_START - VMEMMAP_SIZE) #define vmemmap ((struct page *)VMEMMAP_START) >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c Only take the BPF_* defines from there and move them higher up in the same file. Remove the rest from the chunk. The VMALLOC_* etc defines got moved via 01f52e16b868 ("riscv: define vmemmap before pfn_to_page calls"). Result: [...] #define __S101 PAGE_READ_EXEC #define __S110 PAGE_SHARED_EXEC #define __S111 PAGE_SHARED_EXEC #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1) #define VMALLOC_END (PAGE_OFFSET - 1) #define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE) #define BPF_JIT_REGION_SIZE (SZ_128M) #define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE) #define BPF_JIT_REGION_END (VMALLOC_END) /* * Roughly size the vmemmap space to be large enough to fit enough * struct pages to map half the virtual address space. Then * position vmemmap directly below the VMALLOC region. */ #define VMEMMAP_SHIFT \ (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT) #define VMEMMAP_SIZE BIT(VMEMMAP_SHIFT) #define VMEMMAP_END (VMALLOC_START - 1) #define VMEMMAP_START (VMALLOC_START - VMEMMAP_SIZE) [...] Let me know if there are any other issues. Anyway, the main changes are: 1) Extend bpftool to produce a struct (aka "skeleton") tailored and specific to a provided BPF object file. This provides an alternative, simplified API compared to standard libbpf interaction. Also, add libbpf extern variable resolution for .kconfig section to import Kconfig data, from Andrii Nakryiko. 2) Add BPF dispatcher for XDP which is a mechanism to avoid indirect calls by generating a branch funnel as discussed back in bpfconf'19 at LSF/MM. Also, add various BPF riscv JIT improvements, from Björn Töpel. 3) Extend bpftool to allow matching BPF programs and maps by name, from Paul Chaignon. 4) Support for replacing cgroup BPF programs attached with BPF_F_ALLOW_MULTI flag for allowing updates without service interruption, from Andrey Ignatov. 5) Cleanup and simplification of ring access functions for AF_XDP with a bonus of 0-5% performance improvement, from Magnus Karlsson. 6) Enable BPF JITs for x86-64 and arm64 by default. Also, final version of audit support for BPF, from Daniel Borkmann and latter with Jiri Olsa. 7) Move and extend test_select_reuseport into BPF program tests under BPF selftests, from Jakub Sitnicki. 8) Various BPF sample improvements for xdpsock for customizing parameters to set up and benchmark AF_XDP, from Jay Jayatheerthan. 9) Improve libbpf to provide a ulimit hint on permission denied errors. Also change XDP sample programs to attach in driver mode by default, from Toke Høiland-Jørgensen. 10) Extend BPF test infrastructure to allow changing skb mark from tc BPF programs, from Nikita V. Shirokov. 11) Optimize prologue code sequence in BPF arm32 JIT, from Russell King. 12) Fix xdp_redirect_cpu BPF sample to manually attach to tracepoints after libbpf conversion, from Jesper Dangaard Brouer. 13) Minor misc improvements from various others. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20xsk: Use struct_size() helperMagnus Karlsson1-8/+7
Improve readability and maintainability by using the struct_size() helper when allocating the AF_XDP rings. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-13-git-send-email-magnus.karlsson@intel.com
2019-12-20xsk: Add function naming comments and reorder functionsMagnus Karlsson2-136/+167
Add comments on how the ring access functions are named and how they are supposed to be used for producers and consumers. The functions are also reordered so that the consumer functions are in the beginning and the producer functions in the end, for easier reference. Put this in a separate patch as the diff might look a little odd, but no functionality has changed in this patch. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-12-git-send-email-magnus.karlsson@intel.com
2019-12-20xsk: Remove unnecessary READ_ONCE of dataMagnus Karlsson1-2/+2
There are two unnecessary READ_ONCE of descriptor data. These are not needed since the data is written by the producer before it signals that the data is available by incrementing the producer pointer. As the access to this producer pointer is serialized and the consumer always reads the descriptor after it has read and synchronized with the producer counter, the write of the descriptor will have fully completed and it does not matter if the consumer has any read tearing. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-11-git-send-email-magnus.karlsson@intel.com