Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull io_uring updates from Jens Axboe:
"A lot of stuff has been going on this cycle, with improving the
support for networked IO (and hence unbounded request completion
times) being one of the major themes. There's been a set of fixes done
this week, I'll send those out as well once we're certain we're fully
happy with them.
This contains:
- Unification of the "normal" submit path and the SQPOLL path (Pavel)
- Support for sparse (and bigger) file sets, and updating of those
file sets without needing to unregister/register again.
- Independently sized CQ ring, instead of just making it always 2x
the SQ ring size. This makes it more flexible for networked
applications.
- Support for overflowed CQ ring, never dropping events but providing
backpressure on submits.
- Add support for absolute timeouts, not just relative ones.
- Support for generic cancellations. This divorces io_uring from
workqueues as well, which additionally gets us one step closer to
generic async system call support.
- With cancellations, we can support grabbing the process file table
as well, just like we do mm context. This allows support for system
calls that create file descriptors, like accept4() support that's
built on top of that.
- Support for io_uring tracing (Dmitrii)
- Support for linked timeouts. These abort an operation if it isn't
completed by the time noted in the linke timeout.
- Speedup tracking of poll requests
- Various cleanups making the coder easier to follow (Jackie, Pavel,
Bob, YueHaibing, me)
- Update MAINTAINERS with new io_uring list"
* tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block: (64 commits)
io_uring: make POLL_ADD/POLL_REMOVE scale better
io-wq: remove now redundant struct io_wq_nulls_list
io_uring: Fix getting file for non-fd opcodes
io_uring: introduce req_need_defer()
io_uring: clean up io_uring_cancel_files()
io-wq: ensure free/busy list browsing see all items
io-wq: ensure we have a stable view of ->cur_work for cancellations
io_wq: add get/put_work handlers to io_wq_create()
io_uring: check for validity of ->rings in teardown
io_uring: fix potential deadlock in io_poll_wake()
io_uring: use correct "is IO worker" helper
io_uring: fix -ENOENT issue with linked timer with short timeout
io_uring: don't do flush cancel under inflight_lock
io_uring: flag SQPOLL busy condition to userspace
io_uring: make ASYNC_CANCEL work with poll and timeout
io_uring: provide fallback request for OOM situations
io_uring: convert accept4() -ERESTARTSYS into -EINTR
io_uring: fix error clear of ->file_table in io_sqe_files_register()
io_uring: separate the io_free_req and io_free_req_find_next interface
io_uring: keep io_put_req only responsible for release and put req
...
|
|
In commit 3975b097e577 ("convert stream-like files -> stream_open, even
if they use noop_llseek") Kirill used a coccinelle script to change
"nonseekable_open()" to "stream_open()", which changed the trivial cases
of stream-like file descriptors to the new model with FMODE_STREAM.
However, the two big cases - sockets and pipes - don't actually have
that trivial pattern at all, and were thus never converted to
FMODE_STREAM even though it makes lots of sense to do so.
That's particularly true when looking forward to the next change:
getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to
decide whether f_pos updates are needed or not. And if they are, we'll
always do them atomically.
This came up because KCSAN (correctly) noted that the non-locked f_pos
updates are data races: they are clearly benign for the case where we
don't care, but it would be good to just not have that issue exist at
all.
Note that the reason we used FMODE_ATOMIC_POS originally is that only
doing it for the minimal required case is "safer" in that it's possible
that the f_pos locking can cause unnecessary serialization across the
whole write() call. And in the worst case, that kind of serialization
can cause deadlock issues: think writers that need readers to empty the
state using the same file descriptor.
[ Note that the locking is per-file descriptor - because it protects
"f_pos", which is obviously per-file descriptor - so it only affects
cases where you literally use the same file descriptor to both read
and write.
So a regular pipe that has separate reading and writing file
descriptors doesn't really have this situation even though it's the
obvious case of "reader empties what a bit writer concurrently fills"
But we want to make pipes as being stream-line anyway, because we
don't want the unnecessary overhead of locking, and because a named
pipe can be (ab-)used by reading and writing to the same file
descriptor. ]
There are likely a lot of other cases that might want FMODE_STREAM, and
looking for ".llseek = no_llseek" users and other cases that don't have
an lseek file operation at all and making them use "stream_open()" might
be a good idea. But pipes and sockets are likely to be the two main
cases.
Cc: Kirill Smelkov <kirr@nexedi.com>
Cc: Eic Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Marco Elver <elver@google.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Once udp stack has set the UDP_SKB_IS_STATELESS flag, later skb free
assumes all skb head state has been dropped already.
This will leak the extension memory in case the skb has extensions other
than the ipsec secpath, e.g. bridge nf data.
To fix this, set the UDP_SKB_IS_STATELESS flag only if we don't have
extensions or if the extension space can be free'd.
Fixes: 895b5c9f206eb7d25dc1360a ("netfilter: drop bridge nf reset from nf_reset")
Cc: Paolo Abeni <pabeni@redhat.com>
Reported-by: Byron Stanoszek <gandalf@winds.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The "ivm->vf" variable is a u32, but the problem is that a number of
drivers cast it to an int and then forget to check for negatives. An
example of this is in the cxgb4 driver.
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
2890 static int cxgb4_mgmt_get_vf_config(struct net_device *dev,
2891 int vf, struct ifla_vf_info *ivi)
^^^^^^
2892 {
2893 struct port_info *pi = netdev_priv(dev);
2894 struct adapter *adap = pi->adapter;
2895 struct vf_info *vfinfo;
2896
2897 if (vf >= adap->num_vfs)
^^^^^^^^^^^^^^^^^^^
2898 return -EINVAL;
2899 vfinfo = &adap->vfinfo[vf];
^^^^^^^^^^^^^^^^^^^^^^^^^^
There are 48 functions affected.
drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:8435 hclge_set_vf_vlan_filter() warn: can 'vfid' underflow 's32min-2147483646'
drivers/net/ethernet/freescale/enetc/enetc_pf.c:377 enetc_pf_set_vf_mac() warn: can 'vf' underflow 's32min-2147483646'
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:2899 cxgb4_mgmt_get_vf_config() warn: can 'vf' underflow 's32min-254'
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:2960 cxgb4_mgmt_set_vf_rate() warn: can 'vf' underflow 's32min-254'
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:3019 cxgb4_mgmt_set_vf_rate() warn: can 'vf' underflow 's32min-254'
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:3038 cxgb4_mgmt_set_vf_vlan() warn: can 'vf' underflow 's32min-254'
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:3086 cxgb4_mgmt_set_vf_link_state() warn: can 'vf' underflow 's32min-254'
drivers/net/ethernet/chelsio/cxgb/cxgb2.c:791 get_eeprom() warn: can 'i' underflow 's32min-(-4),0,4-s32max'
drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c:82 bnxt_set_vf_spoofchk() warn: can 'vf_id' underflow 's32min-65534'
drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c:164 bnxt_set_vf_trust() warn: can 'vf_id' underflow 's32min-65534'
drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c:186 bnxt_get_vf_config() warn: can 'vf_id' underflow 's32min-65534'
drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c:228 bnxt_set_vf_mac() warn: can 'vf_id' underflow 's32min-65534'
drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c:264 bnxt_set_vf_vlan() warn: can 'vf_id' underflow 's32min-65534'
drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c:293 bnxt_set_vf_bw() warn: can 'vf_id' underflow 's32min-65534'
drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c:333 bnxt_set_vf_link_state() warn: can 'vf_id' underflow 's32min-65534'
drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c:2595 bnx2x_vf_op_prep() warn: can 'vfidx' underflow 's32min-63'
drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c:2595 bnx2x_vf_op_prep() warn: can 'vfidx' underflow 's32min-63'
drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c:2281 bnx2x_post_vf_bulletin() warn: can 'vf' underflow 's32min-63'
drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c:2285 bnx2x_post_vf_bulletin() warn: can 'vf' underflow 's32min-63'
drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c:2286 bnx2x_post_vf_bulletin() warn: can 'vf' underflow 's32min-63'
drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c:2292 bnx2x_post_vf_bulletin() warn: can 'vf' underflow 's32min-63'
drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c:2297 bnx2x_post_vf_bulletin() warn: can 'vf' underflow 's32min-63'
drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c:1832 qlcnic_sriov_set_vf_mac() warn: can 'vf' underflow 's32min-254'
drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c:1864 qlcnic_sriov_set_vf_tx_rate() warn: can 'vf' underflow 's32min-254'
drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c:1937 qlcnic_sriov_set_vf_vlan() warn: can 'vf' underflow 's32min-254'
drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c:2005 qlcnic_sriov_get_vf_config() warn: can 'vf' underflow 's32min-254'
drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c:2036 qlcnic_sriov_set_vf_spoofchk() warn: can 'vf' underflow 's32min-254'
drivers/net/ethernet/emulex/benet/be_main.c:1914 be_get_vf_config() warn: can 'vf' underflow 's32min-65534'
drivers/net/ethernet/emulex/benet/be_main.c:1915 be_get_vf_config() warn: can 'vf' underflow 's32min-65534'
drivers/net/ethernet/emulex/benet/be_main.c:1922 be_set_vf_tvt() warn: can 'vf' underflow 's32min-65534'
drivers/net/ethernet/emulex/benet/be_main.c:1951 be_clear_vf_tvt() warn: can 'vf' underflow 's32min-65534'
drivers/net/ethernet/emulex/benet/be_main.c:2063 be_set_vf_tx_rate() warn: can 'vf' underflow 's32min-65534'
drivers/net/ethernet/emulex/benet/be_main.c:2091 be_set_vf_link_state() warn: can 'vf' underflow 's32min-65534'
drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c:2609 ice_set_vf_port_vlan() warn: can 'vf_id' underflow 's32min-65534'
drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c:3050 ice_get_vf_cfg() warn: can 'vf_id' underflow 's32min-65534'
drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c:3103 ice_set_vf_spoofchk() warn: can 'vf_id' underflow 's32min-65534'
drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c:3181 ice_set_vf_mac() warn: can 'vf_id' underflow 's32min-65534'
drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c:3237 ice_set_vf_trust() warn: can 'vf_id' underflow 's32min-65534'
drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c:3286 ice_set_vf_link_state() warn: can 'vf_id' underflow 's32min-65534'
drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:3919 i40e_validate_vf() warn: can 'vf_id' underflow 's32min-2147483646'
drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:3957 i40e_ndo_set_vf_mac() warn: can 'vf_id' underflow 's32min-2147483646'
drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:4104 i40e_ndo_set_vf_port_vlan() warn: can 'vf_id' underflow 's32min-2147483646'
drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:4263 i40e_ndo_set_vf_bw() warn: can 'vf_id' underflow 's32min-2147483646'
drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:4309 i40e_ndo_get_vf_config() warn: can 'vf_id' underflow 's32min-2147483646'
drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:4371 i40e_ndo_set_vf_link_state() warn: can 'vf_id' underflow 's32min-2147483646'
drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:4441 i40e_ndo_set_vf_spoofchk() warn: can 'vf_id' underflow 's32min-2147483646'
drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:4441 i40e_ndo_set_vf_spoofchk() warn: can 'vf_id' underflow 's32min-2147483646'
drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:4504 i40e_ndo_set_vf_trust() warn: can 'vf_id' underflow 's32min-2147483646'
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
NET_RAW is less dangerous, so more likely to be available to a process,
so check it first to prevent some spurious logging.
This matches IP_TRANSPARENT which checks NET_RAW first.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
kobject_put() should only be called in error path.
Fixes: b8eb718348b8 ("net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jouni Hogander <jouni.hogander@unikie.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Previously we will return directly if (!rt || !rt->fib6_nh.fib_nh_gw_family)
in function rt6_probe(), but after commit cc3a86c802f0
("ipv6: Change rt6_probe to take a fib6_nh"), the logic changed to
return if there is fib_nh_gw_family.
Fixes: cc3a86c802f0 ("ipv6: Change rt6_probe to take a fib6_nh")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
kobject_init_and_add takes reference even when it fails. This has
to be given up by the caller in error handling. Otherwise memory
allocated by kobject_init_and_add is never freed. Originally found
by Syzkaller:
BUG: memory leak
unreferenced object 0xffff8880679f8b08 (size 8):
comm "netdev_register", pid 269, jiffies 4294693094 (age 12.132s)
hex dump (first 8 bytes):
72 78 2d 30 00 36 20 d4 rx-0.6 .
backtrace:
[<000000008c93818e>] __kmalloc_track_caller+0x16e/0x290
[<000000001f2e4e49>] kvasprintf+0xb1/0x140
[<000000007f313394>] kvasprintf_const+0x56/0x160
[<00000000aeca11c8>] kobject_set_name_vargs+0x5b/0x140
[<0000000073a0367c>] kobject_init_and_add+0xd8/0x170
[<0000000088838e4b>] net_rx_queue_update_kobjects+0x152/0x560
[<000000006be5f104>] netdev_register_kobject+0x210/0x380
[<00000000e31dab9d>] register_netdevice+0xa1b/0xf00
[<00000000f68b2465>] __tun_chr_ioctl+0x20d5/0x3dd0
[<000000004c50599f>] tun_chr_ioctl+0x2f/0x40
[<00000000bbd4c317>] do_vfs_ioctl+0x1c7/0x1510
[<00000000d4c59e8f>] ksys_ioctl+0x99/0xb0
[<00000000946aea81>] __x64_sys_ioctl+0x78/0xb0
[<0000000038d946e5>] do_syscall_64+0x16f/0x580
[<00000000e0aa5d8f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<00000000285b3d1a>] 0xffffffffffffffff
Cc: David Miller <davem@davemloft.net>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Jouni Hogander <jouni.hogander@unikie.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
when configuring act_pedit rules, the number of keys is validated only on
addition of a new entry. This is not sufficient to avoid hitting a WARN()
in the traffic path: for example, it is possible to replace a valid entry
with a new one having 0 extended keys, thus causing splats in dmesg like:
pedit BUG: index 42
WARNING: CPU: 2 PID: 4054 at net/sched/act_pedit.c:410 tcf_pedit_act+0xc84/0x1200 [act_pedit]
[...]
RIP: 0010:tcf_pedit_act+0xc84/0x1200 [act_pedit]
Code: 89 fa 48 c1 ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e ac 00 00 00 48 8b 44 24 10 48 c7 c7 a0 c4 e4 c0 8b 70 18 e8 1c 30 95 ea <0f> 0b e9 a0 fa ff ff e8 00 03 f5 ea e9 14 f4 ff ff 48 89 58 40 e9
RSP: 0018:ffff888077c9f320 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffac2983a2
RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff888053927bec
RBP: dffffc0000000000 R08: ffffed100a726209 R09: ffffed100a726209
R10: 0000000000000001 R11: ffffed100a726208 R12: ffff88804beea780
R13: ffff888079a77400 R14: ffff88804beea780 R15: ffff888027ab2000
FS: 00007fdeec9bd740(0000) GS:ffff888053900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffdb3dfd000 CR3: 000000004adb4006 CR4: 00000000001606e0
Call Trace:
tcf_action_exec+0x105/0x3f0
tcf_classify+0xf2/0x410
__dev_queue_xmit+0xcbf/0x2ae0
ip_finish_output2+0x711/0x1fb0
ip_output+0x1bf/0x4b0
ip_send_skb+0x37/0xa0
raw_sendmsg+0x180c/0x2430
sock_sendmsg+0xdb/0x110
__sys_sendto+0x257/0x2b0
__x64_sys_sendto+0xdd/0x1b0
do_syscall_64+0xa5/0x4e0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fdeeb72e993
Code: 48 8b 0d e0 74 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 0d d6 2c 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 4b cc 00 00 48 89 04 24
RSP: 002b:00007ffdb3de8a18 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 000055c81972b700 RCX: 00007fdeeb72e993
RDX: 0000000000000040 RSI: 000055c81972b700 RDI: 0000000000000003
RBP: 00007ffdb3dea130 R08: 000055c819728510 R09: 0000000000000010
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040
R13: 000055c81972b6c0 R14: 000055c81972969c R15: 0000000000000080
Fix this moving the check on 'nkeys' earlier in tcf_pedit_init(), so that
attempts to install rules having 0 keys are always rejected with -EINVAL.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The taprio qdisc allows to set mqprio setting but only once. In case
if mqprio settings are provided next time the error is returned as
it's not allowed to change traffic class mapping in-flignt and that
is normal. But if configuration is absolutely the same - no need to
return error. It allows to provide same command couple times,
changing only base time for instance, or changing only scheds maps,
but leaving mqprio setting w/o modification. It more corresponds the
message: "Changing the traffic mapping of a running schedule is not
supported", so reject mqprio if it's really changed.
Also corrected TC_BITMASK + 1 for consistency, as proposed.
Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule")
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Tested-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Bring back tls_sw_sendpage_locked. sk_msg redirection into a socket
with TLS_TX takes the following path:
tcp_bpf_sendmsg_redir
tcp_bpf_push_locked
tcp_bpf_push
kernel_sendpage_locked
sock->ops->sendpage_locked
Also update the flags test in tls_sw_sendpage_locked to allow flag
MSG_NO_SHARED_FRAGS. bpf_tcp_sendmsg sets this.
Link: https://lore.kernel.org/netdev/CA+FuTSdaAawmZ2N8nfDDKu3XLpXBbMtcCT0q4FntDD2gn8ASUw@mail.gmail.com/T/#t
Link: https://github.com/wdebruij/kerneltools/commits/icept.2
Fixes: 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect through ULP")
Fixes: f3de19af0f5b ("Revert \"net/tls: remove unused function tls_sw_sendpage_locked\"")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit eec4844fae7c ("proc/sysctl: add shared variables for range
check") did:
- .extra2 = &two,
+ .extra2 = SYSCTL_ONE,
here, which doesn't seem to be intentional, given the changelog.
This patch restores it to the previous, as the value of 2 still makes
sense (used in fib_multipath_hash()).
Fixes: eec4844fae7c ("proc/sysctl: add shared variables for range check")
Cc: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
info->options_len is 'u8' type, and when opts_len with a value >
IP_TUNNEL_OPTS_MAX, 'info->options_len = opts_len' will cast int
to u8 and set a wrong value to info->options_len.
Kernel crashed in my test when doing:
# opts="0102:80:00800022"
# for i in {1..99}; do opts="$opts,0102:80:00800022"; done
# ip link add name geneve0 type geneve dstport 0 external
# tc qdisc add dev eth0 ingress
# tc filter add dev eth0 protocol ip parent ffff: \
flower indev eth0 ip_proto udp action tunnel_key \
set src_ip 10.0.99.192 dst_ip 10.0.99.193 \
dst_port 6081 id 11 geneve_opts $opts \
action mirred egress redirect dev geneve0
So we should do the similar check as cls_flower does, return error
when opts_len > IP_TUNNEL_OPTS_MAX in tunnel_key_copy_opts().
Fixes: 0ed5269f9e41 ("net/sched: add tunnel option support to act_tunnel_key")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In case of errors in unlink_clip_vcc, the logging level is set to
pr_crit but failures in clip_setentry are handled by pr_err().
The patch changes the severity consistent across invocations.
Signed-off-by: Aditya Pakki <pakki001@umn.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In route.c, inet_rtm_getroute_build_skb() creates an skb with no
headroom. This skb is then used by inet_rtm_getroute() which may pass
it to rt_fill_info() and, from there, to ipmr_get_route(). The later
might try to reuse this skb by cloning it and prepending an IPv4
header. But since the original skb has no headroom, skb_push() triggers
skb_under_panic():
skbuff: skb_under_panic: text:00000000ca46ad8a len:80 put:20 head:00000000cd28494e data:000000009366fd6b tail:0x3c end:0xec0 dev:veth0
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:108!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 6 PID: 587 Comm: ip Not tainted 5.4.0-rc6+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
RIP: 0010:skb_panic+0xbf/0xd0
Code: 41 a2 ff 8b 4b 70 4c 8b 4d d0 48 c7 c7 20 76 f5 8b 44 8b 45 bc 48 8b 55 c0 48 8b 75 c8 41 54 41 57 41 56 41 55 e8 75 dc 7a ff <0f> 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
RSP: 0018:ffff888059ddf0b0 EFLAGS: 00010286
RAX: 0000000000000086 RBX: ffff888060a315c0 RCX: ffffffff8abe4822
RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff88806c9a79cc
RBP: ffff888059ddf118 R08: ffffed100d9361b1 R09: ffffed100d9361b0
R10: ffff88805c68aee3 R11: ffffed100d9361b1 R12: ffff88805d218000
R13: ffff88805c689fec R14: 000000000000003c R15: 0000000000000ec0
FS: 00007f6af184b700(0000) GS:ffff88806c980000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffc8204a000 CR3: 0000000057b40006 CR4: 0000000000360ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
skb_push+0x7e/0x80
ipmr_get_route+0x459/0x6fa
rt_fill_info+0x692/0x9f0
inet_rtm_getroute+0xd26/0xf20
rtnetlink_rcv_msg+0x45d/0x630
netlink_rcv_skb+0x1a5/0x220
rtnetlink_rcv+0x15/0x20
netlink_unicast+0x305/0x3a0
netlink_sendmsg+0x575/0x730
sock_sendmsg+0xb5/0xc0
___sys_sendmsg+0x497/0x4f0
__sys_sendmsg+0xcb/0x150
__x64_sys_sendmsg+0x48/0x50
do_syscall_64+0xd2/0xac0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Actually the original skb used to have enough headroom, but the
reserve_skb() call was lost with the introduction of
inet_rtm_getroute_build_skb() by commit 404eb77ea766 ("ipv4: support
sport, dport and ip_proto in RTM_GETROUTE").
We could reserve some headroom again in inet_rtm_getroute_build_skb(),
but this function shouldn't be responsible for handling the special
case of ipmr_get_route(). Let's handle that directly in
ipmr_get_route() by calling skb_realloc_headroom() instead of
skb_clone().
Fixes: 404eb77ea766 ("ipv4: support sport, dport and ip_proto in RTM_GETROUTE")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
FASTOPEN does not work with SMC-sockets. Since SMC allows fallback to
TCP native during connection start, the FASTOPEN setsockopts trigger
this fallback, if the SMC-socket is still in state SMC_INIT.
But if a FASTOPEN setsockopt is called after a non-blocking connect(),
this is broken, and fallback does not make sense.
This change complements
commit cd2063604ea6 ("net/smc: avoid fallback in case of non-blocking connect")
and fixes the syzbot reported problem "WARNING in smc_unhash_sk".
Reported-by: syzbot+8488cc4cf1c9e09b8b86@syzkaller.appspotmail.com
Fixes: e1bbdd570474 ("net/smc: reduce sock_put() for fallback sockets")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently WR sizes are updated from rds_ib_sysctl_max_send_wr and
rds_ib_sysctl_max_recv_wr when a connection is shut down. As a result,
a connection being down while rds_ib_sysctl_max_send_wr or
rds_ib_sysctl_max_recv_wr are updated, will not update the sizes when
it comes back up.
Move resizing of WRs to rds_ib_setup_qp so that connections will be setup
with the most current WR sizes.
Signed-off-by: Dag Moxnes <dag.moxnes@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This sequence of operations:
ip link set dev br0 type bridge vlan_filtering 1
bridge vlan del dev swp2 vid 1
ip link set dev br0 type bridge vlan_filtering 1
ip link set dev br0 type bridge vlan_filtering 0
apparently fails with the message:
[ 31.305716] sja1105 spi0.1: Reset switch and programmed static config. Reason: VLAN filtering
[ 31.322161] sja1105 spi0.1: Couldn't determine PVID attributes (pvid 0)
[ 31.328939] sja1105 spi0.1: Failed to setup VLAN tagging for port 1: -2
[ 31.335599] ------------[ cut here ]------------
[ 31.340215] WARNING: CPU: 1 PID: 194 at net/switchdev/switchdev.c:157 switchdev_port_attr_set_now+0x9c/0xa4
[ 31.349981] br0: Commit of attribute (id=6) failed.
[ 31.354890] Modules linked in:
[ 31.357942] CPU: 1 PID: 194 Comm: ip Not tainted 5.4.0-rc6-01792-gf4f632e07665-dirty #2062
[ 31.366167] Hardware name: Freescale LS1021A
[ 31.370437] [<c03144dc>] (unwind_backtrace) from [<c030e184>] (show_stack+0x10/0x14)
[ 31.378153] [<c030e184>] (show_stack) from [<c11d1c1c>] (dump_stack+0xe0/0x10c)
[ 31.385437] [<c11d1c1c>] (dump_stack) from [<c034c730>] (__warn+0xf4/0x10c)
[ 31.392373] [<c034c730>] (__warn) from [<c034c7bc>] (warn_slowpath_fmt+0x74/0xb8)
[ 31.399827] [<c034c7bc>] (warn_slowpath_fmt) from [<c11ca204>] (switchdev_port_attr_set_now+0x9c/0xa4)
[ 31.409097] [<c11ca204>] (switchdev_port_attr_set_now) from [<c117036c>] (__br_vlan_filter_toggle+0x6c/0x118)
[ 31.418971] [<c117036c>] (__br_vlan_filter_toggle) from [<c115d010>] (br_changelink+0xf8/0x518)
[ 31.427637] [<c115d010>] (br_changelink) from [<c0f8e9ec>] (__rtnl_newlink+0x3f4/0x76c)
[ 31.435613] [<c0f8e9ec>] (__rtnl_newlink) from [<c0f8eda8>] (rtnl_newlink+0x44/0x60)
[ 31.443329] [<c0f8eda8>] (rtnl_newlink) from [<c0f89f20>] (rtnetlink_rcv_msg+0x2cc/0x51c)
[ 31.451477] [<c0f89f20>] (rtnetlink_rcv_msg) from [<c1008df8>] (netlink_rcv_skb+0xb8/0x110)
[ 31.459796] [<c1008df8>] (netlink_rcv_skb) from [<c1008648>] (netlink_unicast+0x17c/0x1f8)
[ 31.468026] [<c1008648>] (netlink_unicast) from [<c1008980>] (netlink_sendmsg+0x2bc/0x3b4)
[ 31.476261] [<c1008980>] (netlink_sendmsg) from [<c0f43858>] (___sys_sendmsg+0x230/0x250)
[ 31.484408] [<c0f43858>] (___sys_sendmsg) from [<c0f44c84>] (__sys_sendmsg+0x50/0x8c)
[ 31.492209] [<c0f44c84>] (__sys_sendmsg) from [<c0301000>] (ret_fast_syscall+0x0/0x28)
[ 31.500090] Exception stack(0xedf47fa8 to 0xedf47ff0)
[ 31.505122] 7fa0: 00000002 b6f2e060 00000003 beabd6a4 00000000 00000000
[ 31.513265] 7fc0: 00000002 b6f2e060 5d6e3213 00000128 00000000 00000001 00000006 000619c4
[ 31.521405] 7fe0: 00086078 beabd658 0005edbc b6e7ce68
The reason is the implementation of br_get_pvid:
static inline u16 br_get_pvid(const struct net_bridge_vlan_group *vg)
{
if (!vg)
return 0;
smp_rmb();
return vg->pvid;
}
Since VID 0 is an invalid pvid from the bridge's point of view, let's
add this check in dsa_8021q_restore_pvid to avoid restoring a pvid that
doesn't really exist.
Fixes: 5f33183b7fdf ("net: dsa: tag_8021q: Restore bridge VLANs when enabling vlan_filtering")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
in the receive path (more precisely in ip6_rcv_core()) the
skb->transport_header is set to skb->network_header + sizeof(*hdr). As a
consequence, after routing operations, destination input expects to find
skb->transport_header correctly set to the next protocol (or extension
header) that follows the network protocol. However, decap behaviors (DX*,
DT*) remove the outer IPv6 and SRH extension and do not set again the
skb->transport_header pointer correctly. For this reason, the patch sets
the skb->transport_header to the skb->network_header + sizeof(hdr) in each
DX* and DT* behavior.
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
pskb_may_pull may change pointers in header. For this reason, it is
mandatory to reload any pointer that points into skb header.
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The tipc prefix for log messages generated by tipc was
removed in commit 07f6c4bc048a ("tipc: convert tipc reference
table to use generic rhashtable").
This is still a useful prefix so add it back.
Signed-off-by: Matt Bennett <matt.bennett@alliedtelesis.co.nz>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2019-11-13
this is a pull request of 9 patches for net/master, hopefully for the v5.4
release cycle.
All nine patches are by Oleksij Rempel and fix locking and use-after-free bugs
in the j1939 stack found by the syzkaller syzbot.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2019-11-13
1) Fix a page memleak on xfrm state destroy.
2) Fix a refcount imbalance if a xfrm_state
gets invaild during async resumption.
From Xiaodong Xu.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
j1939_session_destroy() and __j1939_priv_release() should be called only
if session, ecu or socket are not linked or used by any one else. If at
least one of these resources is linked, then the reference counting is
broken somewhere.
This warning will be triggered before KASAN will do, and will make it
easier to debug initial issue. This works on platforms without KASAN
support.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
|
|
j1939_can_recv() can be called in parallel with socket release. In this
case sk_release and sk_destruct can be done earlier than
j1939_can_recv() is processed.
Reported-by: syzbot+ca172a0ac477ac90f045@syzkaller.appspotmail.com
Reported-by: syzbot+07ca5bce8530070a5650@syzkaller.appspotmail.com
Reported-by: syzbot+a47537d3964ef6c874e1@syzkaller.appspotmail.com
Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
|
|
hrtimer_try_to_cancel() instead of hrtimer_cancel()
This part of the code protected by lock used in the hrtimer as well.
Using hrtimer_cancel() will trigger dead lock.
Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
|
|
We link the socket to the session to be able provide socket specific
notifications. For example messages over error queue.
We need to keep the socket held, while we have a reference to it.
Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
|
|
only once
j1939_session_cancel() was modifying session->state without protecting
it by locks and without checking actual state of the session.
This patch moves j1939_tp_set_rxtimeout() into j1939_session_cancel()
and adds the missing locking.
Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
|
|
j1939_sk_sendmsg()
j1939_sk_sendmsg() should be protected by lock_sock() to avoid race with
j1939_sk_bind() and j1939_sk_release().
Reported-by: syzbot+afd421337a736d6c1ee6@syzkaller.appspotmail.com
Reported-by: syzbot+6d04f6a1b31a0ae12ca9@syzkaller.appspotmail.com
Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
|
|
This patch avoids a NULL pointer deref crash if ndev->ml_priv is NULL.
Reported-by: syzbot+95c8e0d9dffde15b6c5c@syzkaller.appspotmail.com
Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
|
|
This patch delays the j1939_priv_put() until the socket is destroyed via
the sk_destruct callback, to avoid use-after-free problems.
Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
|
|
In j1939 we need our own struct sock::sk_destruct callback. Export the
generic af_can can_sock_destruct() that allows us to chain-call it.
Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
|
|
If an SMC socket is immediately terminated after a non-blocking connect()
has been called, a memory leak is possible.
Due to the sock_hold move in
commit 301428ea3708 ("net/smc: fix refcounting for non-blocking connect()")
an extra sock_put() is needed in smc_connect_work(), if the internal
TCP socket is aborted and cancels the sk_stream_wait_connect() of the
connect worker.
Reported-by: syzbot+4b73ad6fc767e576e275@syzkaller.appspotmail.com
Fixes: 301428ea3708 ("net/smc: fix refcounting for non-blocking connect()")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
An ESP packet could be decrypted in async mode if the input handler for
this packet returns -EINPROGRESS in xfrm_input(). At this moment the device
reference in skb is held. Later xfrm_input() will be invoked again to
resume the processing.
If the transform state is still valid it would continue to release the
device reference and there won't be a problem; however if the transform
state is not valid when async resumption happens, the packet will be
dropped while the device reference is still being held.
When the device is deleted for some reason and the reference to this
device is not properly released, the kernel will keep logging like:
unregister_netdevice: waiting for ppp2 to become free. Usage count = 1
The issue is observed when running IPsec traffic over a PPPoE device based
on a bridge interface. By terminating the PPPoE connection on the server
end for multiple times, the PPPoE device on the client side will eventually
get stuck on the above warning message.
This patch will check the async mode first and continue to release device
reference in async resumption, before it is dropped due to invalid state.
v2: Do not assign address family from outer_mode in the transform if the
state is invalid
v3: Release device reference in the error path instead of jumping to resume
Fixes: 4ce3dbe397d7b ("xfrm: Fix xfrm_input() to verify state is valid when (encap_type < 0)")
Signed-off-by: Xiaodong Xu <stid.smth@gmail.com>
Reported-by: Bo Chen <chenborfc@163.com>
Tested-by: Bo Chen <chenborfc@163.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
When setting the dump's time-stamp, use ktime_get_real in addition to
jiffies. This simplifies the user space implementation and bypasses
some inconsistent behavior with translating jiffies to current time.
The time taken is transformed into nsec, to comply with y2038 issue.
Fixes: c8e1da0bf923 ("devlink: Add health report functionality")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There is a race between driver code that does setup/cleanup of device
and devlink reload operation that in some drivers works with the same
code. Use after free could we easily obtained by running:
while true; do
echo "0000:00:10.0" >/sys/bus/pci/drivers/mlxsw_spectrum2/bind
devlink dev reload pci/0000:00:10.0 &
echo "0000:00:10.0" >/sys/bus/pci/drivers/mlxsw_spectrum2/unbind
done
Fix this by enabling reload only after setup of device is complete and
disabling it at the beginning of the cleanup process.
Reported-by: Ido Schimmel <idosch@mellanox.com>
Fixes: 2d8dc5bbf4e7 ("devlink: Add support for reload")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The "42f5cda5eaf4" commit rightly set SOCK_DONE on peer shutdown,
but there is an issue if we receive the SHUTDOWN(RDWR) while the
virtio_transport_close_timeout() is scheduled.
In this case, when the timeout fires, the SOCK_DONE is already
set and the virtio_transport_close_timeout() will not call
virtio_transport_reset() and virtio_transport_do_close().
This causes that both sockets remain open and will never be released,
preventing the unloading of [virtio|vhost]_transport modules.
This patch fixes this issue, calling virtio_transport_reset() and
virtio_transport_do_close() when we receive the SHUTDOWN(RDWR)
and there is nothing left to read.
Fixes: 42f5cda5eaf4 ("vsock/virtio: set SOCK_DONE on peer shutdown")
Cc: Stephen Barber <smbarber@chromium.org>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In the first 5 minutes after boot (time of INITIAL_JIFFIES),
ieee80211_sta_last_active() returns zero if last_ack is zero. This
leads to "inactive time" showing jiffies_to_msecs(jiffies).
# iw wlan0 station get fc:ec:da:64:a6:dd
Station fc:ec:da:64:a6:dd (on wlan0)
inactive time: 4294894049 ms
.
.
connected time: 70 seconds
Fix by returning last_rx if last_ack == 0.
Signed-off-by: Ahmed Zaki <anzaki@gmail.com>
Link: https://lore.kernel.org/r/20191031121243.27694-1-anzaki@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
If ieee80211_txq_setup_flows() fails, we don't clean up LED
state properly, leading to crashes later on, fix that.
Fixes: dc8b274f0952 ("mac80211: Move up init of TXQs")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://lore.kernel.org/r/20191105154110.1ccf7112ba5d.I0ba865792446d051867b33153be65ce6b063d98c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Hendrik reported routes in the main table using source address are not
removed when the address is removed. The problem is that fib_sync_down_addr
does not account for devices in the default VRF which are associated
with the main table. Fix by updating the table id reference.
Fixes: 5a56a0b3a45d ("net: Don't delete routes in different VRFs")
Reported-by: Hendrik Donner <hd@os-cillation.de>
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While looking at a syzbot KCSAN report [1], I found multiple
issues in this code :
1) fib6_nh->last_probe has an initial value of 0.
While probably okay on 64bit kernels, this causes an issue
on 32bit kernels since the time_after(jiffies, 0 + interval)
might be false ~24 days after boot (for HZ=1000)
2) The data-race found by KCSAN
I could use READ_ONCE() and WRITE_ONCE(), but we also can
take the opportunity of not piling-up too many rt6_probe_deferred()
works by using instead cmpxchg() so that only one cpu wins the race.
[1]
BUG: KCSAN: data-race in find_match / find_match
write to 0xffff8880bb7aabe8 of 8 bytes by interrupt on cpu 1:
rt6_probe net/ipv6/route.c:663 [inline]
find_match net/ipv6/route.c:757 [inline]
find_match+0x5bd/0x790 net/ipv6/route.c:733
__find_rr_leaf+0xe3/0x780 net/ipv6/route.c:831
find_rr_leaf net/ipv6/route.c:852 [inline]
rt6_select net/ipv6/route.c:896 [inline]
fib6_table_lookup+0x383/0x650 net/ipv6/route.c:2164
ip6_pol_route+0xee/0x5c0 net/ipv6/route.c:2200
ip6_pol_route_output+0x48/0x60 net/ipv6/route.c:2452
fib6_rule_lookup+0x3d6/0x470 net/ipv6/fib6_rules.c:117
ip6_route_output_flags_noref+0x16b/0x230 net/ipv6/route.c:2484
ip6_route_output_flags+0x50/0x1a0 net/ipv6/route.c:2497
ip6_dst_lookup_tail+0x25d/0xc30 net/ipv6/ip6_output.c:1049
ip6_dst_lookup_flow+0x68/0x120 net/ipv6/ip6_output.c:1150
inet6_csk_route_socket+0x2f7/0x420 net/ipv6/inet6_connection_sock.c:106
inet6_csk_xmit+0x91/0x1f0 net/ipv6/inet6_connection_sock.c:121
__tcp_transmit_skb+0xe81/0x1d60 net/ipv4/tcp_output.c:1169
tcp_transmit_skb net/ipv4/tcp_output.c:1185 [inline]
tcp_xmit_probe_skb+0x19b/0x1d0 net/ipv4/tcp_output.c:3735
read to 0xffff8880bb7aabe8 of 8 bytes by interrupt on cpu 0:
rt6_probe net/ipv6/route.c:657 [inline]
find_match net/ipv6/route.c:757 [inline]
find_match+0x521/0x790 net/ipv6/route.c:733
__find_rr_leaf+0xe3/0x780 net/ipv6/route.c:831
find_rr_leaf net/ipv6/route.c:852 [inline]
rt6_select net/ipv6/route.c:896 [inline]
fib6_table_lookup+0x383/0x650 net/ipv6/route.c:2164
ip6_pol_route+0xee/0x5c0 net/ipv6/route.c:2200
ip6_pol_route_output+0x48/0x60 net/ipv6/route.c:2452
fib6_rule_lookup+0x3d6/0x470 net/ipv6/fib6_rules.c:117
ip6_route_output_flags_noref+0x16b/0x230 net/ipv6/route.c:2484
ip6_route_output_flags+0x50/0x1a0 net/ipv6/route.c:2497
ip6_dst_lookup_tail+0x25d/0xc30 net/ipv6/ip6_output.c:1049
ip6_dst_lookup_flow+0x68/0x120 net/ipv6/ip6_output.c:1150
inet6_csk_route_socket+0x2f7/0x420 net/ipv6/inet6_connection_sock.c:106
inet6_csk_xmit+0x91/0x1f0 net/ipv6/inet6_connection_sock.c:121
__tcp_transmit_skb+0xe81/0x1d60 net/ipv4/tcp_output.c:1169
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 18894 Comm: udevd Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: cc3a86c802f0 ("ipv6: Change rt6_probe to take a fib6_nh")
Fixes: f547fac624be ("ipv6: rate-limit probes for neighbourless routes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The function nfc_put_device(dev) is called twice to drop the reference
to dev when there is no associated local llcp. Remove one of them to fix
the bug.
Fixes: 52feb444a903 ("NFC: Extend netlink interface for LTO, RW, and MIUX parameters support")
Fixes: d9b8d8e19b07 ("NFC: llcp: Service Name Lookup netlink interface")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We leak the page that we use to create skb page fragments
when destroying the xfrm_state. Fix this by dropping a
page reference if a page was assigned to the xfrm_state.
Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible")
Reported-by: JD <jdtxs00@gmail.com>
Reported-by: Paul Wouters <paul@nohats.ca>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Missing register size validation in bitwise and cmp offloads.
2) Fix error code in ip_set_sockfn_get() when copy_to_user() fails,
from Dan Carpenter.
3) Oneliner to copy MAC address in IPv6 hash:ip,mac sets, from
Stefano Brivio.
4) Missing policy validation in ipset with NL_VALIDATE_STRICT,
from Jozsef Kadlecsik.
5) Fix unaligned access to private data area of nf_tables instructions,
from Lukas Wunner.
6) Relax check for object updates, reported as a regression by
Eric Garver, patch from Fernando Fernandez Mancera.
7) Crash on ebtables dnat extension when used from the output path.
From Florian Westphal.
8) Fix bogus EOPNOTSUPP when updating basechain flags.
9) Fix bogus EBUSY when updating a basechain that is already offloaded.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If a pnet table entry is to be added mentioning a valid ethernet
interface, but an invalid infiniband or ISM device, the dev_put()
operation for the ethernet interface is called twice, resulting
in a negative refcount for the ethernet interface, which disables
removal of such a network interface.
This patch removes one of the dev_put() calls.
Fixes: 890a2cb4a966 ("net/smc: rework pnet table")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
TLS TX needs to release and re-acquire the socket lock if send buffer
fills up.
TLS SW TX path currently depends on only allowing one thread to enter
the function by the abuse of sk_write_pending. If another writer is
already waiting for memory no new ones are allowed in.
This has two problems:
- writers don't wake other threads up when they leave the kernel;
meaning that this scheme works for single extra thread (second
application thread or delayed work) because memory becoming
available will send a wake up request, but as Mallesham and
Pooja report with larger number of threads it leads to threads
being put to sleep indefinitely;
- the delayed work does not get _scheduled_ but it may _run_ when
other writers are present leading to crashes as writers don't
expect state to change under their feet (same records get pushed
and freed multiple times); it's hard to reliably bail from the
work, however, because the mere presence of a writer does not
guarantee that the writer will push pending records before exiting.
Ensuring wakeups always happen will make the code basically open
code a mutex. Just use a mutex.
The TLS HW TX path does not have any locking (not even the
sk_write_pending hack), yet it uses a per-socket sg_tx_data
array to push records.
Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance")
Reported-by: Mallesham Jatharakonda <mallesh537@gmail.com>
Reported-by: Pooja Trivedi <poojatrivedi@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
sk_write_pending being not zero does not guarantee that partial
record will be pushed. If the thread waiting for memory times out
the pending record may get stuck.
In case of tls_device there is no path where parial record is
set and writer present in the first place. Partial record is
set only in tls_push_sg() and tls_push_sg() will return an
error immediately. All tls_device callers of tls_push_sg()
will return (and not wait for memory) if it failed.
Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
sk_msg_trim() tries to only update curr pointer if it falls into
the trimmed region. The logic, however, does not take into the
account pointer wrapping that sk_msg_iter_var_prev() does nor
(as John points out) the fact that msg->sg is a ring buffer.
This means that when the message was trimmed completely, the new
curr pointer would have the value of MAX_MSG_FRAGS - 1, which is
neither smaller than any other value, nor would it actually be
correct.
Special case the trimming to 0 length a little bit and rework
the comparison between curr and end to take into account wrapping.
This bug caused the TLS code to not copy all of the message, if
zero copy filled in fewer sg entries than memcopy would need.
Big thanks to Alexander Potapenko for the non-KMSAN reproducer.
v2:
- take into account that msg->sg is a ring buffer (John).
Link: https://lore.kernel.org/netdev/20191030160542.30295-1-jakub.kicinski@netronome.com/ (v1)
Fixes: d829e9c4112b ("tls: convert to generic sk_msg interface")
Reported-by: syzbot+f8495bff23a879a6d0bd@syzkaller.appspotmail.com
Reported-by: syzbot+6f50c99e8f6194bf363f@syzkaller.appspotmail.com
Co-developed-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When a new filter is added to cls_api, the function
tcf_chain_tp_insert_unique() looks up the protocol/priority/chain to
determine if the tcf_proto is duplicated in the chain's hashtable. It then
creates a new entry or continues with an existing one. In cls_flower, this
allows the function fl_ht_insert_unque to determine if a filter is a
duplicate and reject appropriately, meaning that the duplicate will not be
passed to drivers via the offload hooks. However, when a tcf_proto is
destroyed it is removed from its chain before a hardware remove hook is
hit. This can lead to a race whereby the driver has not received the
remove message but duplicate flows can be accepted. This, in turn, can
lead to the offload driver receiving incorrect duplicate flows and out of
order add/delete messages.
Prevent duplicates by utilising an approach suggested by Vlad Buslov. A
hash table per block stores each unique chain/protocol/prio being
destroyed. This entry is only removed when the full destroy (and hardware
offload) has completed. If a new flow is being added with the same
identiers as a tc_proto being detroyed, then the add request is replayed
until the destroy is complete.
Fixes: 8b64678e0af8 ("net: sched: refactor tp insert/delete for concurrent execution")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reported-by: Louis Peens <louis.peens@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Don't swap oper and admin schedules too early, it's not correct and
causes crash.
Steps to reproduce:
1)
tc qdisc replace dev eth0 parent root handle 100 taprio \
num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
queues 1@0 1@1 1@2 \
base-time $SOME_BASE_TIME \
sched-entry S 01 80000 \
sched-entry S 02 15000 \
sched-entry S 04 40000 \
flags 2
2)
tc qdisc replace dev eth0 parent root handle 100 taprio \
base-time $SOME_BASE_TIME \
sched-entry S 01 90000 \
sched-entry S 02 20000 \
sched-entry S 04 40000 \
flags 2
3)
tc qdisc replace dev eth0 parent root handle 100 taprio \
base-time $SOME_BASE_TIME \
sched-entry S 01 150000 \
sched-entry S 02 200000 \
sched-entry S 04 40000 \
flags 2
Do 2 3 2 .. steps more times if not happens and observe:
[ 305.832319] Unable to handle kernel write to read-only memory at
virtual address ffff0000087ce7f0
[ 305.910887] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
[ 305.919306] Hardware name: Texas Instruments AM654 Base Board (DT)
[...]
[ 306.017119] x1 : ffff800848031d88 x0 : ffff800848031d80
[ 306.022422] Call trace:
[ 306.024866] taprio_free_sched_cb+0x4c/0x98
[ 306.029040] rcu_process_callbacks+0x25c/0x410
[ 306.033476] __do_softirq+0x10c/0x208
[ 306.037132] irq_exit+0xb8/0xc8
[ 306.040267] __handle_domain_irq+0x64/0xb8
[ 306.044352] gic_handle_irq+0x7c/0x178
[ 306.048092] el1_irq+0xb0/0x128
[ 306.051227] arch_cpu_idle+0x10/0x18
[ 306.054795] do_idle+0x120/0x138
[ 306.058015] cpu_startup_entry+0x20/0x28
[ 306.061931] rest_init+0xcc/0xd8
[ 306.065154] start_kernel+0x3bc/0x3e4
[ 306.068810] Code: f2fbd5b7 f2fbd5b6 d503201f f9400422 (f9000662)
[ 306.074900] ---[ end trace 96c8e2284a9d9d6e ]---
[ 306.079507] Kernel panic - not syncing: Fatal exception in interrupt
[ 306.085847] SMP: stopping secondary CPUs
[ 306.089765] Kernel Offset: disabled
Try to explain one of the possible crash cases:
The "real" admin list is assigned when admin_sched is set to
new_admin, it happens after "swap", that assigns to oper_sched NULL.
Thus if call qdisc show it can crash.
Farther, next second time, when sched list is updated, the admin_sched
is not NULL and becomes the oper_sched, previous oper_sched was NULL so
just skipped. But then admin_sched is assigned new_admin, but schedules
to free previous assigned admin_sched (that already became oper_sched).
Farther, next third time, when sched list is updated,
while one more swap, oper_sched is not null, but it was happy to be
freed already (while prev. admin update), so while try to free
oper_sched the kernel panic happens at taprio_free_sched_cb().
So, move the "swap emulation" where it should be according to function
comment from code.
Fixes: 9c66d15646760e ("taprio: Add support for hardware offloading")
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|