summaryrefslogtreecommitdiffstats
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2022-01-02net/smc: add comments for smc_link_{usable|sendable}Dust Li1-1/+16
Add comments for both smc_link_sendable() and smc_link_usable() to help better distinguish and use them. No function changes. Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02net: socket.c: style fixHamish MacDonald1-1/+1
Removed spaces and added a tab that was causing an error on checkpatch Signed-off-by: Hamish MacDonald <elusivenode@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02ipv6: ioam: Support for Queue depth data fieldJustin Iurman1-1/+15
v3: - Report 'backlog' (bytes) instead of 'qlen' (number of packets) v2: - Fix sparse warning (use rcu_dereference) This patch adds support for the queue depth in IOAM trace data fields. The draft [1] says the following: The "queue depth" field is a 4-octet unsigned integer field. This field indicates the current length of the egress interface queue of the interface from where the packet is forwarded out. The queue depth is expressed as the current amount of memory buffers used by the queue (a packet could consume one or more memory buffers, depending on its size). An existing function (i.e., qdisc_qstats_qlen_backlog) is used to retrieve the current queue length without reinventing the wheel. Note: it was tested and qlen is increasing when an artificial delay is added on the egress with tc. [1] https://datatracker.ietf.org/doc/html/draft-ietf-ippm-ioam-data#section-5.4.2.7 Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02net/smc: remove redundant re-assignment of pointer linkColin Ian King1-1/+0
The pointer link is being re-assigned the same value that it was initialized with in the previous declaration statement. The re-assignment is redundant and can be removed. Fixes: 387707fdf486 ("net/smc: convert static link ID to dynamic references") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02net/smc: Introduce TCP ULP supportTony Lu1-7/+86
This implements TCP ULP for SMC, helps applications to replace TCP with SMC protocol in place. And we use it to implement transparent replacement. This replaces original TCP sockets with SMC, reuse TCP as clcsock when calling setsockopt with TCP_ULP option, and without any overhead. To replace TCP sockets with SMC, there are two approaches: - use setsockopt() syscall with TCP_ULP option, if error, it would fallback to TCP. - use BPF prog with types BPF_CGROUP_INET_SOCK_CREATE or others to replace transparently. BPF hooks some points in create socket, bind and others, users can inject their BPF logics without modifying their applications, and choose which connections should be replaced with SMC by calling setsockopt() in BPF prog, based on rules, such as TCP tuples, PID, cgroup, etc... BPF doesn't support calling setsockopt with TCP_ULP now, I will send the patches after this accepted. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02net/smc: Add net namespace for tracepointsTony Lu1-7/+16
This prints net namespace ID, helps us to distinguish different net namespaces when using tracepoints. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02net/smc: Print net namespace in logTony Lu2-9/+14
This adds net namespace ID to the kernel log, net_cookie is unique in the whole system. It is useful in container environment. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02net/smc: Add netlink net namespace supportTony Lu2-7/+12
This adds net namespace ID to diag of linkgroup, helps us to distinguish different namespaces, and net_cookie is unique in the whole system. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02net/smc: Introduce net namespace support for linkgroupTony Lu4-12/+42
Currently, rdma device supports exclusive net namespace isolation, however linkgroup doesn't know and support ibdev net namespace. Applications in the containers don't want to share the nics if we enabled rdma exclusive mode. Every net namespaces should have their own linkgroups. This patch introduce a new field net for linkgroup, which is standing for the ibdev net namespace in the linkgroup. The net in linkgroup is initialized with the net namespace of link's ibdev. It compares the net of linkgroup and sock or ibdev before choose it, if no matched, create new one in current net namespace. If rdma net namespace exclusive mode is not enabled, it behaves as before. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller39-40/+84
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-12-30 The following pull-request contains BPF updates for your *net-next* tree. We've added 72 non-merge commits during the last 20 day(s) which contain a total of 223 files changed, 3510 insertions(+), 1591 deletions(-). The main changes are: 1) Automatic setrlimit in libbpf when bpf is memcg's in the kernel, from Andrii. 2) Beautify and de-verbose verifier logs, from Christy. 3) Composable verifier types, from Hao. 4) bpf_strncmp helper, from Hou. 5) bpf.h header dependency cleanup, from Jakub. 6) get_func_[arg|ret|arg_cnt] helpers, from Jiri. 7) Sleepable local storage, from KP. 8) Extend kfunc with PTR_TO_CTX, PTR_TO_MEM argument support, from Kumar. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski25-121/+172
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c commit 077cdda764c7 ("net/mlx5e: TC, Fix memory leak with rules with internal port") commit 31108d142f36 ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'") commit 4390c6edc0fb ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'") https://lore.kernel.org/all/20211229065352.30178-1-saeed@kernel.org/ net/smc/smc_wr.c commit 49dc9013e34b ("net/smc: Use the bitmap API when applicable") commit 349d43127dac ("net/smc: fix kernel panic caused by race of smc_sock") bitmap_zero()/memset() is removed by the fix Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-30net/smc: Use the bitmap API when applicableChristophe JAILLET1-14/+5
Using the bitmap API is less verbose than hand writing them. It also improves the semantic. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-30ethtool: Remove redundant ret assignmentsluo penghao1-1/+0
The assignment here will be overwritten, so it should be deleted The clang_analyzer complains as follows: net/ethtool/netlink.c: Value stored to 'ret' is never read Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: luo penghao <luo.penghao@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-29bpf: Allow bpf_local_storage to be used by sleepable programsKP Singh1-1/+7
Other maps like hashmaps are already available to sleepable programs. Sleepable BPF programs run under trace RCU. Allow task, sk and inode storage to be used from sleepable programs. This allows sleepable and non-sleepable programs to provide shareable annotations on kernel objects. Sleepable programs run in trace RCU where as non-sleepable programs run in a normal RCU critical section i.e. __bpf_prog_enter{_sleepable} and __bpf_prog_exit{_sleepable}) (rcu_read_lock or rcu_read_lock_trace). In order to make the local storage maps accessible to both sleepable and non-sleepable programs, one needs to call both call_rcu_tasks_trace and call_rcu to wait for both trace and classical RCU grace periods to expire before freeing memory. Paul's work on call_rcu_tasks_trace allows us to have per CPU queueing for call_rcu_tasks_trace. This behaviour can be achieved by setting rcupdate.rcu_task_enqueue_lim=<num_cpus> boot parameter. In light of these new performance changes and to keep the local storage code simple, avoid adding a new flag for sleepable maps / local storage to select the RCU synchronization (trace / classical). Also, update the dereferencing of the pointers to use rcu_derference_check (with either the trace or normal RCU locks held) with a common bpf_rcu_lock_held helper method. Signed-off-by: KP Singh <kpsingh@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20211224152916.1550677-2-kpsingh@kernel.org
2021-12-29net/ncsi: check for error return from call to nla_put_u32Jiasheng Jiang1-1/+5
As we can see from the comment of the nla_put() that it could return -EMSGSIZE if the tailroom of the skb is insufficient. Therefore, it should be better to check the return value of the nla_put_u32 and return the error code if error accurs. Also, there are many other functions have the same problem, and if this patch is correct, I will commit a new version to fix all. Fixes: 955dc68cb9b2 ("net/ncsi: Add generic netlink family") Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn> Link: https://lore.kernel.org/r/20211229032118.1706294-1-jiasheng@iscas.ac.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-29net: bridge: mcast: fix br_multicast_ctx_vlan_global_disabled helperNikolay Aleksandrov1-3/+3
We need to first check if the context is a vlan one, then we need to check the global bridge multicast vlan snooping flag, and finally the vlan's multicast flag, otherwise we will unnecessarily enable vlan mcast processing (e.g. querier timers). Fixes: 7b54aaaf53cb ("net: bridge: multicast: add vlan state initialization and control") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20211228153142.536969-1-nikolay@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-29net: fix use-after-free in tw_timer_handlerMuchun Song1-6/+4
A real world panic issue was found as follow in Linux 5.4. BUG: unable to handle page fault for address: ffffde49a863de28 PGD 7e6fe62067 P4D 7e6fe62067 PUD 7e6fe63067 PMD f51e064067 PTE 0 RIP: 0010:tw_timer_handler+0x20/0x40 Call Trace: <IRQ> call_timer_fn+0x2b/0x120 run_timer_softirq+0x1ef/0x450 __do_softirq+0x10d/0x2b8 irq_exit+0xc7/0xd0 smp_apic_timer_interrupt+0x68/0x120 apic_timer_interrupt+0xf/0x20 This issue was also reported since 2017 in the thread [1], unfortunately, the issue was still can be reproduced after fixing DCCP. The ipv4_mib_exit_net is called before tcp_sk_exit_batch when a net namespace is destroyed since tcp_sk_ops is registered befrore ipv4_mib_ops, which means tcp_sk_ops is in the front of ipv4_mib_ops in the list of pernet_list. There will be a use-after-free on net->mib.net_statistics in tw_timer_handler after ipv4_mib_exit_net if there are some inflight time-wait timers. This bug is not introduced by commit f2bf415cfed7 ("mib: add net to NET_ADD_STATS_BH") since the net_statistics is a global variable instead of dynamic allocation and freeing. Actually, commit 61a7e26028b9 ("mib: put net statistics on struct net") introduces the bug since it put net statistics on struct net and free it when net namespace is destroyed. Moving init_ipv4_mibs() to the front of tcp_init() to fix this bug and replace pr_crit() with panic() since continuing is meaningless when init_ipv4_mibs() fails. [1] https://groups.google.com/g/syzkaller/c/p1tn-_Kc6l4/m/smuL_FMAAgAJ?pli=1 Fixes: 61a7e26028b9 ("mib: put net statistics on struct net") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Cong Wang <cong.wang@bytedance.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20211228104145.9426-1-songmuchun@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-29Merge tag 'for-net-next-2021-12-29' of ↵Jakub Kicinski14-1934/+2369
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - Add support for Foxconn MT7922A - Add support for Realtek RTL8852AE - Rework HCI event handling to use skb_pull_data * tag 'for-net-next-2021-12-29' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (62 commits) Bluetooth: MGMT: Fix spelling mistake "simultanous" -> "simultaneous" Bluetooth: vhci: Set HCI_QUIRK_VALID_LE_STATES Bluetooth: MGMT: Fix LE simultaneous roles UUID if not supported Bluetooth: hci_sync: Add check simultaneous roles support Bluetooth: hci_sync: Wait for proper events when connecting LE Bluetooth: hci_sync: Add support for waiting specific LE subevents Bluetooth: hci_sync: Add hci_le_create_conn_sync Bluetooth: hci_event: Use skb_pull_data when processing inquiry results Bluetooth: hci_sync: Push sync command cancellation to workqueue Bluetooth: hci_qca: Stop IBS timer during BT OFF Bluetooth: btusb: Add support for Foxconn MT7922A Bluetooth: btintel: Add missing quirks and msft ext for legacy bootloader Bluetooth: btusb: Add two more Bluetooth parts for WCN6855 Bluetooth: L2CAP: Fix using wrong mode Bluetooth: hci_sync: Fix not always pausing advertising when necessary Bluetooth: mgmt: Make use of mgmt_send_event_skb in MGMT_EV_DEVICE_CONNECTED Bluetooth: mgmt: Make use of mgmt_send_event_skb in MGMT_EV_DEVICE_FOUND Bluetooth: mgmt: Introduce mgmt_alloc_skb and mgmt_send_event_skb Bluetooth: btusb: Return error code when getting patch status failed Bluetooth: btusb: Handle download_firmware failure cases ... ==================== Link: https://lore.kernel.org/r/20211229211258.2290966-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-29net: bridge: mcast: add and enforce startup query interval minimumNikolay Aleksandrov5-3/+22
As reported[1] if startup query interval is set too low in combination with large number of startup queries and we have multiple bridges or even a single bridge with multiple querier vlans configured we can crash the machine. Add a 1 second minimum which must be enforced by overwriting the value if set lower (i.e. without returning an error) to avoid breaking user-space. If that happens a log message is emitted to let the admin know that the startup interval has been set to the minimum. It doesn't make sense to make the startup interval lower than the normal query interval so use the same value of 1 second. The issue has been present since these intervals could be user-controlled. [1] https://lore.kernel.org/netdev/e8b9ce41-57b9-b6e2-a46a-ff9c791cf0ba@gmail.com/ Fixes: d902eee43f19 ("bridge: Add multicast count/interval sysfs entries") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-29net: bridge: mcast: add and enforce query interval minimumNikolay Aleksandrov5-3/+22
As reported[1] if query interval is set too low and we have multiple bridges or even a single bridge with multiple querier vlans configured we can crash the machine. Add a 1 second minimum which must be enforced by overwriting the value if set lower (i.e. without returning an error) to avoid breaking user-space. If that happens a log message is emitted to let the administrator know that the interval has been set to the minimum. The issue has been present since these intervals could be user-controlled. [1] https://lore.kernel.org/netdev/e8b9ce41-57b9-b6e2-a46a-ff9c791cf0ba@gmail.com/ Fixes: d902eee43f19 ("bridge: Add multicast count/interval sysfs entries") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-29ipv6: raw: check passed optlen before readingTamir Duberstein1-0/+3
Add a check that the user-provided option is at least as long as the number of bytes we intend to read. Before this patch we would blindly read sizeof(int) bytes even in cases where the user passed optlen<sizeof(int), which would potentially read garbage or fault. Discovered by new tests in https://github.com/google/gvisor/pull/6957 . The original get_user call predates history in the git repo. Signed-off-by: Tamir Duberstein <tamird@gmail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20211229200947.2862255-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-29xsk: Initialise xskb free_list_nodeCiara Loftus1-0/+1
This commit initialises the xskb's free_list_node when the xskb is allocated. This prevents a potential false negative returned from a call to list_empty for that node, such as the one introduced in commit 199d983bc015 ("xsk: Fix crash on double free in buffer pool") In my environment this issue caused packets to not be received by the xdpsock application if the traffic was running prior to application launch. This happened when the first batch of packets failed the xskmap lookup and XDP_PASS was returned from the bpf program. This action is handled in the i40e zc driver (and others) by allocating an skbuff, freeing the xdp_buff and adding the associated xskb to the xsk_buff_pool's free_list if it hadn't been added already. Without this fix, the xskb is not added to the free_list because the check to determine if it was added already returns an invalid positive result. Later, this caused allocation errors in the driver and the failure to receive packets. Fixes: 199d983bc015 ("xsk: Fix crash on double free in buffer pool") Fixes: 2b43470add8c ("xsk: Introduce AF_XDP buffer allocation API") Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20211220155250.2746-1-ciara.loftus@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-29net: Don't include filter.h from net/sock.hJakub Kicinski32-0/+35
sock.h is pretty heavily used (5k objects rebuilt on x86 after it's touched). We can drop the include of filter.h from it and add a forward declaration of struct sk_filter instead. This decreases the number of rebuilt objects when bpf.h is touched from ~5k to ~1k. There's a lot of missing includes this was masking. Primarily in networking tho, this time. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/bpf/20211229004913.513372-1-kuba@kernel.org
2021-12-29of: net: support NVMEM cells with MAC in text formatRafał Miłecki1-11/+22
Some NVMEM devices have text based cells. In such cases MAC is stored in a XX:XX:XX:XX:XX:XX format. Use mac_pton() to parse such data and support those NVMEM cells. This is required to support e.g. a very popular U-Boot and its environment variables. Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-28net: caif: remove redundant assignment to variable expectlenColin Ian King1-1/+0
Variable expectlen is being assigned a value that is never read, the assignment occurs before a return statement. The assignment is redundant and can be removed. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-28net/smc: fix kernel panic caused by race of smc_sockDust Li8-76/+57
A crash occurs when smc_cdc_tx_handler() tries to access smc_sock but smc_release() has already freed it. [ 4570.695099] BUG: unable to handle page fault for address: 000000002eae9e88 [ 4570.696048] #PF: supervisor write access in kernel mode [ 4570.696728] #PF: error_code(0x0002) - not-present page [ 4570.697401] PGD 0 P4D 0 [ 4570.697716] Oops: 0002 [#1] PREEMPT SMP NOPTI [ 4570.698228] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.16.0-rc4+ #111 [ 4570.699013] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/0 [ 4570.699933] RIP: 0010:_raw_spin_lock+0x1a/0x30 <...> [ 4570.711446] Call Trace: [ 4570.711746] <IRQ> [ 4570.711992] smc_cdc_tx_handler+0x41/0xc0 [ 4570.712470] smc_wr_tx_tasklet_fn+0x213/0x560 [ 4570.712981] ? smc_cdc_tx_dismisser+0x10/0x10 [ 4570.713489] tasklet_action_common.isra.17+0x66/0x140 [ 4570.714083] __do_softirq+0x123/0x2f4 [ 4570.714521] irq_exit_rcu+0xc4/0xf0 [ 4570.714934] common_interrupt+0xba/0xe0 Though smc_cdc_tx_handler() checked the existence of smc connection, smc_release() may have already dismissed and released the smc socket before smc_cdc_tx_handler() further visits it. smc_cdc_tx_handler() |smc_release() if (!conn) | | |smc_cdc_tx_dismiss_slots() | smc_cdc_tx_dismisser() | |sock_put(&smc->sk) <- last sock_put, | smc_sock freed bh_lock_sock(&smc->sk) (panic) | To make sure we won't receive any CDC messages after we free the smc_sock, add a refcount on the smc_connection for inflight CDC message(posted to the QP but haven't received related CQE), and don't release the smc_connection until all the inflight CDC messages haven been done, for both success or failed ones. Using refcount on CDC messages brings another problem: when the link is going to be destroyed, smcr_link_clear() will reset the QP, which then remove all the pending CQEs related to the QP in the CQ. To make sure all the CQEs will always come back so the refcount on the smc_connection can always reach 0, smc_ib_modify_qp_reset() was replaced by smc_ib_modify_qp_error(). And remove the timeout in smc_wr_tx_wait_no_pending_sends() since we need to wait for all pending WQEs done, or we may encounter use-after- free when handling CQEs. For IB device removal routine, we need to wait for all the QPs on that device been destroyed before we can destroy CQs on the device, or the refcount on smc_connection won't reach 0 and smc_sock cannot be released. Fixes: 5f08318f617b ("smc: connection data control (CDC)") Reported-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-28net/smc: don't send CDC/LLC message if link not readyDust Li5-5/+11
We found smc_llc_send_link_delete_all() sometimes wait for 2s timeout when testing with RDMA link up/down. It is possible when a smc_link is in ACTIVATING state, the underlaying QP is still in RESET or RTR state, which cannot send any messages out. smc_llc_send_link_delete_all() use smc_link_usable() to checks whether the link is usable, if the QP is still in RESET or RTR state, but the smc_link is in ACTIVATING, this LLC message will always fail without any CQE entering the CQ, and we will always wait 2s before timeout. Since we cannot send any messages through the QP before the QP enter RTS. I add a wrapper smc_link_sendable() which checks the state of QP along with the link state. And replace smc_link_usable() with smc_link_sendable() in all LLC & CDC message sending routine. Fixes: 5f08318f617b ("smc: connection data control (CDC)") Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-27net: udp: fix alignment problem in udp4_seq_show()yangxingwu1-1/+1
$ cat /pro/net/udp before: sl local_address rem_address st tx_queue rx_queue tr tm->when 26050: 0100007F:0035 00000000:0000 07 00000000:00000000 00:00000000 26320: 0100007F:0143 00000000:0000 07 00000000:00000000 00:00000000 27135: 00000000:8472 00000000:0000 07 00000000:00000000 00:00000000 after: sl local_address rem_address st tx_queue rx_queue tr tm->when 26050: 0100007F:0035 00000000:0000 07 00000000:00000000 00:00000000 26320: 0100007F:0143 00000000:0000 07 00000000:00000000 00:00000000 27135: 00000000:8472 00000000:0000 07 00000000:00000000 00:00000000 Signed-off-by: yangxingwu <xingwu.yang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-27net/smc: fix using of uninitialized completionsKarsten Graul1-2/+4
In smc_wr_tx_send_wait() the completion on index specified by pend->idx is initialized and after smc_wr_tx_send() was called the wait for completion starts. pend->idx is used to get the correct index for the wait, but the pend structure could already be cleared in smc_wr_tx_process_cqe(). Introduce pnd_idx to hold and use a local copy of the correct index. Fixes: 09c61d24f96d ("net/smc: wait for departure of an IB message") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-27net: bridge: Get SIOCGIFBR/SIOCSIFBR ioctl working in compat modeRemi Pommarel2-43/+52
In compat mode SIOC{G,S}IFBR ioctls were only supporting BRCTL_GET_VERSION returning an artificially version to spur userland tool to use SIOCDEVPRIVATE instead. But some userland tools ignore that and use SIOC{G,S}IFBR unconditionally as seen with busybox's brctl. Example of non working 32-bit brctl with CONFIG_COMPAT=y: $ brctl show brctl: SIOCGIFBR: Invalid argument Example of fixed 32-bit brctl with CONFIG_COMPAT=y: $ brctl show bridge name bridge id STP enabled interfaces br0 Signed-off-by: Remi Pommarel <repk@triplefau.lt> Co-developed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-27ip6_vti: initialize __ip6_tnl_parm struct in vti6_siocdevprivateWilliam Zhao1-0/+2
The "__ip6_tnl_parm" struct was left uninitialized causing an invalid load of random data when the "__ip6_tnl_parm" struct was used elsewhere. As an example, in the function "ip6_tnl_xmit_ctl()", it tries to access the "collect_md" member. With "__ip6_tnl_parm" being uninitialized and containing random data, the UBSAN detected that "collect_md" held a non-boolean value. The UBSAN issue is as follows: =============================================================== UBSAN: invalid-load in net/ipv6/ip6_tunnel.c:1025:14 load of value 30 is not a valid value for type '_Bool' CPU: 1 PID: 228 Comm: kworker/1:3 Not tainted 5.16.0-rc4+ #8 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Workqueue: ipv6_addrconf addrconf_dad_work Call Trace: <TASK> dump_stack_lvl+0x44/0x57 ubsan_epilogue+0x5/0x40 __ubsan_handle_load_invalid_value+0x66/0x70 ? __cpuhp_setup_state+0x1d3/0x210 ip6_tnl_xmit_ctl.cold.52+0x2c/0x6f [ip6_tunnel] vti6_tnl_xmit+0x79c/0x1e96 [ip6_vti] ? lock_is_held_type+0xd9/0x130 ? vti6_rcv+0x100/0x100 [ip6_vti] ? lock_is_held_type+0xd9/0x130 ? rcu_read_lock_bh_held+0xc0/0xc0 ? lock_acquired+0x262/0xb10 dev_hard_start_xmit+0x1e6/0x820 __dev_queue_xmit+0x2079/0x3340 ? mark_lock.part.52+0xf7/0x1050 ? netdev_core_pick_tx+0x290/0x290 ? kvm_clock_read+0x14/0x30 ? kvm_sched_clock_read+0x5/0x10 ? sched_clock_cpu+0x15/0x200 ? find_held_lock+0x3a/0x1c0 ? lock_release+0x42f/0xc90 ? lock_downgrade+0x6b0/0x6b0 ? mark_held_locks+0xb7/0x120 ? neigh_connected_output+0x31f/0x470 ? lockdep_hardirqs_on+0x79/0x100 ? neigh_connected_output+0x31f/0x470 ? ip6_finish_output2+0x9b0/0x1d90 ? rcu_read_lock_bh_held+0x62/0xc0 ? ip6_finish_output2+0x9b0/0x1d90 ip6_finish_output2+0x9b0/0x1d90 ? ip6_append_data+0x330/0x330 ? ip6_mtu+0x166/0x370 ? __ip6_finish_output+0x1ad/0xfb0 ? nf_hook_slow+0xa6/0x170 ip6_output+0x1fb/0x710 ? nf_hook.constprop.32+0x317/0x430 ? ip6_finish_output+0x180/0x180 ? __ip6_finish_output+0xfb0/0xfb0 ? lock_is_held_type+0xd9/0x130 ndisc_send_skb+0xb33/0x1590 ? __sk_mem_raise_allocated+0x11cf/0x1560 ? dst_output+0x4a0/0x4a0 ? ndisc_send_rs+0x432/0x610 addrconf_dad_completed+0x30c/0xbb0 ? addrconf_rs_timer+0x650/0x650 ? addrconf_dad_work+0x73c/0x10e0 addrconf_dad_work+0x73c/0x10e0 ? addrconf_dad_completed+0xbb0/0xbb0 ? rcu_read_lock_sched_held+0xaf/0xe0 ? rcu_read_lock_bh_held+0xc0/0xc0 process_one_work+0x97b/0x1740 ? pwq_dec_nr_in_flight+0x270/0x270 worker_thread+0x87/0xbf0 ? process_one_work+0x1740/0x1740 kthread+0x3ac/0x490 ? set_kthread_struct+0x100/0x100 ret_from_fork+0x22/0x30 </TASK> =============================================================== The solution is to initialize "__ip6_tnl_parm" struct to zeros in the "vti6_siocdevprivate()" function. Signed-off-by: William Zhao <wizhao@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-25sctp: use call_rcu to free endpointXin Long3-22/+36
This patch is to delay the endpoint free by calling call_rcu() to fix another use-after-free issue in sctp_sock_dump(): BUG: KASAN: use-after-free in __lock_acquire+0x36d9/0x4c20 Call Trace: __lock_acquire+0x36d9/0x4c20 kernel/locking/lockdep.c:3218 lock_acquire+0x1ed/0x520 kernel/locking/lockdep.c:3844 __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline] _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168 spin_lock_bh include/linux/spinlock.h:334 [inline] __lock_sock+0x203/0x350 net/core/sock.c:2253 lock_sock_nested+0xfe/0x120 net/core/sock.c:2774 lock_sock include/net/sock.h:1492 [inline] sctp_sock_dump+0x122/0xb20 net/sctp/diag.c:324 sctp_for_each_transport+0x2b5/0x370 net/sctp/socket.c:5091 sctp_diag_dump+0x3ac/0x660 net/sctp/diag.c:527 __inet_diag_dump+0xa8/0x140 net/ipv4/inet_diag.c:1049 inet_diag_dump+0x9b/0x110 net/ipv4/inet_diag.c:1065 netlink_dump+0x606/0x1080 net/netlink/af_netlink.c:2244 __netlink_dump_start+0x59a/0x7c0 net/netlink/af_netlink.c:2352 netlink_dump_start include/linux/netlink.h:216 [inline] inet_diag_handler_cmd+0x2ce/0x3f0 net/ipv4/inet_diag.c:1170 __sock_diag_cmd net/core/sock_diag.c:232 [inline] sock_diag_rcv_msg+0x31d/0x410 net/core/sock_diag.c:263 netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2477 sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:274 This issue occurs when asoc is peeled off and the old sk is freed after getting it by asoc->base.sk and before calling lock_sock(sk). To prevent the sk free, as a holder of the sk, ep should be alive when calling lock_sock(). This patch uses call_rcu() and moves sock_put and ep free into sctp_endpoint_destroy_rcu(), so that it's safe to try to hold the ep under rcu_read_lock in sctp_transport_traverse_process(). If sctp_endpoint_hold() returns true, it means this ep is still alive and we have held it and can continue to dump it; If it returns false, it means this ep is dead and can be freed after rcu_read_unlock, and we should skip it. In sctp_sock_dump(), after locking the sk, if this ep is different from tsp->asoc->ep, it means during this dumping, this asoc was peeled off before calling lock_sock(), and the sk should be skipped; If this ep is the same with tsp->asoc->ep, it means no peeloff happens on this asoc, and due to lock_sock, no peeloff will happen either until release_sock. Note that delaying endpoint free won't delay the port release, as the port release happens in sctp_endpoint_destroy() before calling call_rcu(). Also, freeing endpoint by call_rcu() makes it safe to access the sk by asoc->base.sk in sctp_assocs_seq_show() and sctp_rcv(). Thanks Jones to bring this issue up. v1->v2: - improve the changelog. - add kfree(ep) into sctp_endpoint_destroy_rcu(), as Jakub noticed. Reported-by: syzbot+9276d76e83e3bcde6c99@syzkaller.appspotmail.com Reported-by: Lee Jones <lee.jones@linaro.org> Fixes: d25adbeb0cdb ("sctp: fix an use-after-free issue in sctp_sock_dump") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-23udp: using datalen to cap ipv6 udp max gso segmentsCoco Li1-1/+1
The max number of UDP gso segments is intended to cap to UDP_MAX_SEGMENTS, this is checked in udp_send_skb(). skb->len contains network and transport header len here, we should use only data len instead. This is the ipv6 counterpart to the below referenced commit, which missed the ipv6 change Fixes: 158390e45612 ("udp: using datalen to cap max gso segments") Signed-off-by: Coco Li <lixiaoyan@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20211223222441.2975883-1-lixiaoyan@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski24-48/+81
include/net/sock.h commit 8f905c0e7354 ("inet: fully convert sk->sk_rx_dst to RCU rules") commit 43f51df41729 ("net: move early demux fields close to sk_refcnt") https://lore.kernel.org/all/20211222141641.0caa0ab3@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-23Bluetooth: MGMT: Fix spelling mistake "simultanous" -> "simultaneous"Colin Ian King1-1/+1
There is a spelling mistake in a bt_dev_info message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2021-12-23net: bridge: fix ioctl old_deviceless bridge argumentRemi Pommarel1-1/+1
Commit 561d8352818f ("bridge: use ndo_siocdevprivate") changed the source and destination arguments of copy_{to,from}_user in bridge's old_deviceless() from args[1] to uarg breaking SIOC{G,S}IFBR ioctls. Commit cbd7ad29a507 ("net: bridge: fix ioctl old_deviceless bridge argument") fixed only BRCTL_{ADD,DEL}_BRIDGES commands leaving BRCTL_GET_BRIDGES one untouched. The fixes BRCTL_GET_BRIDGES as well and has been tested with busybox's brctl. Example of broken brctl: $ brctl show bridge name bridge id STP enabled interfaces brctl: can't get bridge name for index 0: No such device or address Example of fixed brctl: $ brctl show bridge name bridge id STP enabled interfaces br0 8000.000000000000 no Fixes: 561d8352818f ("bridge: use ndo_siocdevprivate") Signed-off-by: Remi Pommarel <repk@triplefau.lt> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/all/20211223153139.7661-2-repk@triplefau.lt/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-23net: dsa: tag_ocelot: use traffic class to map priority on injected headerXiaoliang Yang1-1/+5
For Ocelot switches, the CPU injected frames have an injection header where it can specify the QoS class of the packet and the DSA tag, now it uses the SKB priority to set that. If a traffic class to priority mapping is configured on the netdevice (with mqprio for example ...), it won't be considered for CPU injected headers. This patch make the QoS class aligned to the priority to traffic class mapping if it exists. Fixes: 8dce89aa5f32 ("net: dsa: ocelot: add tagger for Ocelot/Felix switches") Signed-off-by: Xiaoliang Yang <xiaoliang.yang_1@nxp.com> Signed-off-by: Marouen Ghodhbane <marouen.ghodhbane@nxp.com> Link: https://lore.kernel.org/r/20211223072211.33130-1-xiaoliang.yang_1@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-23flow_offload: fix suspicious RCU usage when offloading tc actionBaowen Zheng1-2/+9
Fix suspicious rcu_dereference_protected() usage when offloading tc action. We should hold tcfa_lock to offload tc action in action initiation. Without these changes, the following warning will be observed: WARNING: suspicious RCU usage 5.16.0-rc5-net-next-01504-g7d1f236dcffa-dirty #50 Tainted: G I ----------------------------- include/net/tc_act/tc_tunnel_key.h:33 suspicious rcu_dereference_protected() usage! 1 lock held by tc/12108: CPU: 4 PID: 12108 Comm: tc Tainted: G Hardware name: Dell Inc. PowerEdge R740/07WCGN, BIOS 1.6.11 11/20/2018 Call Trace: <TASK> dump_stack_lvl+0x49/0x5e dump_stack+0x10/0x12 lockdep_rcu_suspicious+0xed/0xf8 tcf_tunnel_key_offload_act_setup+0x1de/0x300 [act_tunnel_key] tcf_action_offload_add_ex+0xc0/0x1f0 tcf_action_init+0x26a/0x2f0 tcf_action_add+0xa9/0x1f0 tc_ctl_action+0xfb/0x170 rtnetlink_rcv_msg+0x169/0x510 ? sched_clock+0x9/0x10 ? rtnl_newlink+0x70/0x70 netlink_rcv_skb+0x55/0x100 rtnetlink_rcv+0x15/0x20 netlink_unicast+0x1a8/0x270 netlink_sendmsg+0x245/0x490 sock_sendmsg+0x65/0x70 ____sys_sendmsg+0x219/0x260 ? __import_iovec+0x2c/0x150 ___sys_sendmsg+0xb7/0x100 ? __lock_acquire+0x3d5/0x1f40 ? __this_cpu_preempt_check+0x13/0x20 ? lock_is_held_type+0xe4/0x140 ? sched_clock+0x9/0x10 ? ktime_get_coarse_real_ts64+0xbe/0xd0 ? __this_cpu_preempt_check+0x13/0x20 ? lockdep_hardirqs_on+0x7e/0x100 ? ktime_get_coarse_real_ts64+0xbe/0xd0 ? trace_hardirqs_on+0x2a/0xf0 __sys_sendmsg+0x5a/0xa0 ? syscall_trace_enter.constprop.0+0x1dd/0x220 __x64_sys_sendmsg+0x1f/0x30 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f4db7bb7a60 Fixes: 8cbfe939abe9 ("flow_offload: allow user to offload tc action to net device") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com> Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-23sctp: move hlist_node and hashent out of sctp_ep_commonXin Long3-26/+17
Struct sctp_ep_common is included in both asoc and ep, but hlist_node and hashent are only needed by ep after asoc_hashtable was dropped by Commit b5eff7128366 ("sctp: drop the old assoc hashtable of sctp"). So it is better to move hlist_node and hashent from sctp_ep_common to sctp_endpoint, and it saves some space for each asoc. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-22Bluetooth: MGMT: Fix LE simultaneous roles UUID if not supportedLuiz Augusto von Dentz2-47/+77
If controller/driver don't support LE simultaneous roles its UUID shall be omitted when responding to MGMT_OP_READ_EXP_FEATURES_INFO. This also rework the support introducing HCI_LE_SIMULTANEOUS_ROLES flag so it can be detected when userspace wants to use or not. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-12-22Bluetooth: hci_sync: Add check simultaneous roles supportLuiz Augusto von Dentz2-16/+13
This attempts to check if the controller can act as both central and peripheral simultaneously and in case it does skip suspending advertising or in case of directed advertising don't fail if scanning. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-12-22Bluetooth: hci_sync: Wait for proper events when connecting LELuiz Augusto von Dentz1-4/+7
When using HCI_OP_LE_CREATE_CONN wait for HCI_EV_LE_CONN_COMPLETE before completing it and for HCI_OP_LE_EXT_CREATE_CONN wait for HCI_EV_LE_ENHANCED_CONN_COMPLETE before resuming advertising. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-12-22Bluetooth: hci_sync: Add support for waiting specific LE subeventsLuiz Augusto von Dentz2-16/+27
This adds support for waiting for specific LE subevents instead of command status which may only indicate that the commands is in progress and a different event is used to complete the operation. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-12-22Bluetooth: hci_sync: Add hci_le_create_conn_syncLuiz Augusto von Dentz6-347/+295
This adds hci_le_create_conn_sync and make hci_le_connect use it instead of queueing multiple commands which may conflict with the likes of hci_update_passive_scan which uses hci_cmd_sync_queue. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-12-22Bluetooth: hci_event: Use skb_pull_data when processing inquiry resultsLuiz Augusto von Dentz1-2/+18
This makes each result entry to be checked using skb_pull_data instead of acessing them by index. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-12-22Bluetooth: hci_sync: Push sync command cancellation to workqueueBenjamin Berg3-3/+28
syzbot reported that hci_cmd_sync_cancel may sleep from the wrong context. To avoid this, create a new work item that pushes the relevant parts into a different context. Note that we keep the old implementation with the name __hci_cmd_sync_cancel as the sleeping behaviour is desired in some cases. Reported-and-tested-by: syzbot+485cc00ea7cf41dfdbf1@syzkaller.appspotmail.com Fixes: c97a747efc93 ("Bluetooth: btusb: Cancel sync commands for certain URB errors") Signed-off-by: Benjamin Berg <bberg@redhat.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-12-21devlink: Add new "event_eq_size" generic device paramShay Drory1-0/+5
Add new device generic parameter to determine the size of the asynchronous control events EQ. For example, to reduce event EQ size to 64, execute: $ devlink dev param set pci/0000:06:00.0 \ name event_eq_size value 64 cmode driverinit $ devlink dev reload pci/0000:06:00.0 Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-12-21devlink: Add new "io_eq_size" generic device paramShay Drory1-0/+5
Add new device generic parameter to determine the size of the I/O completion EQs. For example, to reduce I/O EQ size to 64, execute: $ devlink dev param set pci/0000:06:00.0 \ name io_eq_size value 64 cmode driverinit $ devlink dev reload pci/0000:06:00.0 Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-12-21Bluetooth: L2CAP: Fix using wrong modeLuiz Augusto von Dentz1-2/+10
If user has a set to use SOCK_STREAM the socket would default to L2CAP_MODE_ERTM which later needs to be adjusted if the destination address is LE which doesn't support such mode. Fixes: 15f02b9105625 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-12-21Bluetooth: hci_sync: Fix not always pausing advertising when necessaryLuiz Augusto von Dentz1-4/+2
hci_pause_advertising_sync shall always pause advertising until hci_resume_advertising_sync but instance 0x00 doesn't count in adv_instance_cnt so it was causing it to be skipped. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>