summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2017-10-20bpf: avoid preempt enable/disable in sockmap using tcp_skb_cb regionJohn Fastabend8-30/+36
SK_SKB BPF programs are run from the socket/tcp context but early in the stack before much of the TCP metadata is needed in tcp_skb_cb. So we can use some unused fields to place BPF metadata needed for SK_SKB programs when implementing the redirect function. This allows us to drop the preempt disable logic. It does however require an API change so sk_redirect_map() has been updated to additionally provide ctx_ptr to skb. Note, we do however continue to disable/enable preemption around actual BPF program running to account for map updates. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20bpf: enforce TCP only support for sockmapJohn Fastabend2-1/+17
Only TCP sockets have been tested and at the moment the state change callback only handles TCP sockets. This adds a check to ensure that sockets actually being added are TCP sockets. For net-next we can consider UDP support. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20sctp: add the missing sock_owned_by_user check in sctp_icmp_redirectXin Long1-1/+1
Now sctp processes icmp redirect packet in sctp_icmp_redirect where it calls sctp_transport_dst_check in which tp->dst can be released. The problem is before calling sctp_transport_dst_check, it doesn't check sock_owned_by_user, which means tp->dst could be freed while a process is accessing it with owning the socket. An use-after-free issue could be triggered by this. This patch is to fix it by checking sock_owned_by_user before calling sctp_transport_dst_check in sctp_icmp_redirect, so that it would not release tp->dst if users still hold sock lock. Besides, the same issue fixed in commit 45caeaa5ac0b ("dccp/tcp: fix routing redirect race") on sctp also needs this check. Fixes: 55be7a9c6074 ("ipv4: Add redirect support to all protocol icmp error handlers") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19sctp: do not peel off an assoc from one netns to another oneXin Long1-0/+4
Now when peeling off an association to the sock in another netns, all transports in this assoc are not to be rehashed and keep use the old key in hashtable. As a transport uses sk->net as the hash key to insert into hashtable, it would miss removing these transports from hashtable due to the new netns when closing the sock and all transports are being freeed, then later an use-after-free issue could be caused when looking up an asoc and dereferencing those transports. This is a very old issue since very beginning, ChunYu found it with syzkaller fuzz testing with this series: socket$inet6_sctp() bind$inet6() sendto$inet6() unshare(0x40000000) getsockopt$inet_sctp6_SCTP_GET_ASSOC_ID_LIST() getsockopt$inet_sctp6_SCTP_SOCKOPT_PEELOFF() This patch is to block this call when peeling one assoc off from one netns to another one, so that the netns of all transport would not go out-sync with the key in hashtable. Note that this patch didn't fix it by rehashing transports, as it's difficult to handle the situation when the tuple is already in use in the new netns. Besides, no one would like to peel off one assoc to another netns, considering ipaddrs, ifaces, etc. are usually different. Reported-by: ChunYu Wang <chunwang@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19Merge branch 'bpf-Fix-for-BPF-devmap-percpu-allocation-splat'David S. Miller4-12/+14
Daniel Borkmann says: ==================== bpf: Fix for BPF devmap percpu allocation splat The set fixes a splat in devmap percpu allocation when we alloc the flush bitmap. Patch 1 is a prerequisite for the fix in patch 2, patch 1 is rather small, so if this could be routed via -net, for example, with Tejun's Ack that would be good. Patch 3 gets rid of remaining PCPU_MIN_UNIT_SIZE checks, which are percpu allocator internals and should not be used. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19bpf: do not test for PCPU_MIN_UNIT_SIZE before percpu allocationsDaniel Borkmann2-5/+1
PCPU_MIN_UNIT_SIZE is an implementation detail of the percpu allocator. Given we support __GFP_NOWARN now, lets just let the allocation request fail naturally instead. The two call sites from BPF mistakenly assumed __GFP_NOWARN would work, so no changes needed to their actual __alloc_percpu_gfp() calls which use the flag already. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19bpf: fix splat for illegal devmap percpu allocationDaniel Borkmann1-2/+3
It was reported that syzkaller was able to trigger a splat on devmap percpu allocation due to illegal/unsupported allocation request size passed to __alloc_percpu(): [ 70.094249] illegal size (32776) or align (8) for percpu allocation [ 70.094256] ------------[ cut here ]------------ [ 70.094259] WARNING: CPU: 3 PID: 3451 at mm/percpu.c:1365 pcpu_alloc+0x96/0x630 [...] [ 70.094325] Call Trace: [ 70.094328] __alloc_percpu_gfp+0x12/0x20 [ 70.094330] dev_map_alloc+0x134/0x1e0 [ 70.094331] SyS_bpf+0x9bc/0x1610 [ 70.094333] ? selinux_task_setrlimit+0x5a/0x60 [ 70.094334] ? security_task_setrlimit+0x43/0x60 [ 70.094336] entry_SYSCALL_64_fastpath+0x1a/0xa5 This was due to too large max_entries for the map such that we surpassed the upper limit of PCPU_MIN_UNIT_SIZE. It's fine to fail naturally here, so switch to __alloc_percpu_gfp() and pass __GFP_NOWARN instead. Fixes: 11393cc9b9be ("xdp: Add batching support to redirect map") Reported-by: Mark Rutland <mark.rutland@arm.com> Reported-by: Shankara Pailoor <sp3485@columbia.edu> Reported-by: Richard Weinberger <richard@nod.at> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19mm, percpu: add support for __GFP_NOWARN flagDaniel Borkmann1-5/+10
Add an option for pcpu_alloc() to support __GFP_NOWARN flag. Currently, we always throw a warning when size or alignment is unsupported (and also dump stack on failed allocation requests). The warning itself is harmless since we return NULL anyway for any failed request, which callers are required to handle anyway. However, it becomes harmful when panic_on_warn is set. The rationale for the WARN() in pcpu_alloc() is that it can be tracked when larger than supported allocation requests are made such that allocations limits can be tweaked if warranted. This makes sense for in-kernel users, however, there are users of pcpu allocator where allocation size is derived from user space requests, e.g. when creating BPF maps. In these cases, the requests should fail gracefully without throwing a splat. The current work-around was to check allocation size against the upper limit of PCPU_MIN_UNIT_SIZE from call-sites for bailing out prior to a call to pcpu_alloc() in order to avoid throwing the WARN(). This is bad in multiple ways since PCPU_MIN_UNIT_SIZE is an implementation detail, and having the checks on call-sites only complicates the code for no good reason. Thus, lets fix it generically by supporting the __GFP_NOWARN flag that users can then use with calling the __alloc_percpu_gfp() helper instead. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19Merge branch 'ena-fixes'David S. Miller2-5/+6
Netanel Belgazal says: ==================== ENA ethernet driver bug fixes Some fixes for ENA ethernet driver ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19net: ena: fix wrong max Tx/Rx queues on ethtoolNetanel Belgazal1-2/+2
ethtool ena_get_channels() expose the max number of queues as the max number of queues ENA supports (128 queues) and not the actual number of created queues. Signed-off-by: Netanel Belgazal <netanel@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19net: ena: fix rare kernel crash when bar memory remap failsNetanel Belgazal1-1/+2
This failure is rare and only found on testing where deliberately fail devm_ioremap() [ 451.170464] ena 0000:04:00.0: failed to remap regs bar 451.170549] Workqueue: pciehp-1 pciehp_power_thread [ 451.170551] task: ffff88085a5f2d00 task.stack: ffffc9000756c000 [ 451.170552] RIP: 0010:devm_iounmap+0x2d/0x40 [ 451.170553] RSP: 0018:ffffc9000756fac0 EFLAGS: 00010282 [ 451.170554] RAX: 00000000fffffffe RBX: 0000000000000000 RCX: 0000000000000000 [ 451.170555] RDX: ffffffff813a7e00 RSI: 0000000000000282 RDI: 0000000000000282 [ 451.170556] RBP: ffffc9000756fac8 R08: 00000000fffffffe R09: 00000000000009b7 [ 451.170557] R10: 0000000000000005 R11: 00000000000009b6 R12: ffff880856c9d0a0 [ 451.170558] R13: ffffc9000f5c90c0 R14: ffff880856c9d0a0 R15: 0000000000000028 [ 451.170559] FS: 0000000000000000(0000) GS:ffff88085f400000(0000) knlGS:0000000000000000 [ 451.170560] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 451.170561] CR2: 00007f169038b000 CR3: 0000000001c09000 CR4: 00000000003406f0 [ 451.170562] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 451.170562] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 451.170563] Call Trace: [ 451.170572] ena_release_bars.isra.48+0x34/0x60 [ena] [ 451.170574] ena_probe+0x144/0xd90 [ena] [ 451.170579] ? ida_simple_get+0x98/0x100 [ 451.170585] ? kernfs_next_descendant_post+0x40/0x50 [ 451.170591] local_pci_probe+0x45/0xa0 [ 451.170592] pci_device_probe+0x157/0x180 [ 451.170599] driver_probe_device+0x2a8/0x460 [ 451.170600] __device_attach_driver+0x7e/0xe0 [ 451.170602] ? driver_allows_async_probing+0x30/0x30 [ 451.170603] bus_for_each_drv+0x68/0xb0 [ 451.170605] __device_attach+0xdd/0x160 [ 451.170607] device_attach+0x10/0x20 [ 451.170610] pci_bus_add_device+0x4f/0xa0 [ 451.170611] pci_bus_add_devices+0x39/0x70 [ 451.170613] pciehp_configure_device+0x96/0x120 [ 451.170614] pciehp_enable_slot+0x1b3/0x290 [ 451.170616] pciehp_power_thread+0x3b/0xb0 [ 451.170622] process_one_work+0x149/0x360 [ 451.170623] worker_thread+0x4d/0x3c0 [ 451.170626] kthread+0x109/0x140 [ 451.170627] ? rescuer_thread+0x380/0x380 [ 451.170628] ? kthread_park+0x60/0x60 [ 451.170632] ret_from_fork+0x25/0x30 Signed-off-by: Netanel Belgazal <netanel@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19net: ena: reduce the severity of some printoutsNetanel Belgazal1-2/+2
Decrease log level of checksum errors as these messages can be triggered remotely by bad packets. Signed-off-by: Netanel Belgazal <netanel@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18bpf: disallow arithmetic operations on context pointerJakub Kicinski2-2/+20
Commit f1174f77b50c ("bpf/verifier: rework value tracking") removed the crafty selection of which pointer types are allowed to be modified. This is OK for most pointer types since adjust_ptr_min_max_vals() will catch operations on immutable pointers. One exception is PTR_TO_CTX which is now allowed to be offseted freely. The intent of aforementioned commit was to allow context access via modified registers. The offset passed to ->is_valid_access() verifier callback has been adjusted by the value of the variable offset. What is missing, however, is taking the variable offset into account when the context register is used. Or in terms of the code adding the offset to the value passed to the ->convert_ctx_access() callback. This leads to the following eBPF user code: r1 += 68 r0 = *(u32 *)(r1 + 8) exit being translated to this in kernel space: 0: (07) r1 += 68 1: (61) r0 = *(u32 *)(r1 +180) 2: (95) exit Offset 8 is corresponding to 180 in the kernel, but offset 76 is valid too. Verifier will "accept" access to offset 68+8=76 but then "convert" access to offset 8 as 180. Effective access to offset 248 is beyond the kernel context. (This is a __sk_buff example on a debug-heavy kernel - packet mark is 8 -> 180, 76 would be data.) Dereferencing the modified context pointer is not as easy as dereferencing other types, because we have to translate the access to reading a field in kernel structures which is usually at a different offset and often of a different size. To allow modifying the pointer we would have to make sure that given eBPF instruction will always access the same field or the fields accessed are "compatible" in terms of offset and size... Disallow dereferencing modified context pointers and add to selftests the test case described here. Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18netlink: fix netlink_ack() extack raceJohannes Berg1-4/+4
It seems that it's possible to toggle NETLINK_F_EXT_ACK through setsockopt() while another thread/CPU is building a message inside netlink_ack(), which could then trigger the WARN_ON()s I added since if it goes from being turned off to being turned on between allocating and filling the message, the skb could end up being too small. Avoid this whole situation by storing the value of this flag in a separate variable and using that throughout the function instead. Fixes: 2d4bc93368f5 ("netlink: extended ACK reporting") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18ibmvnic: Fix calculation of number of TX header descriptorsThomas Falcon1-8/+8
This patch correctly sets the number of additional header descriptors that will be sent in an indirect SCRQ entry. Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18mlxsw: core: Fix possible deadlockIdo Schimmel1-1/+10
When an EMAD is transmitted, a timeout work item is scheduled with a delay of 200ms, so that another EMAD will be retried until a maximum of five retries. In certain situations, it's possible for the function waiting on the EMAD to be associated with a work item that is queued on the same workqueue (`mlxsw_core`) as the timeout work item. This results in flushing a work item on the same workqueue. According to commit e159489baa71 ("workqueue: relax lockdep annotation on flush_work()") the above may lead to a deadlock in case the workqueue has only one worker active or if the system in under memory pressure and the rescue worker is in use. The latter explains the very rare and random nature of the lockdep splats we have been seeing: [ 52.730240] ============================================ [ 52.736179] WARNING: possible recursive locking detected [ 52.742119] 4.14.0-rc3jiri+ #4 Not tainted [ 52.746697] -------------------------------------------- [ 52.752635] kworker/1:3/599 is trying to acquire lock: [ 52.758378] (mlxsw_core_driver_name){+.+.}, at: [<ffffffff811c4fa4>] flush_work+0x3a4/0x5e0 [ 52.767837] but task is already holding lock: [ 52.774360] (mlxsw_core_driver_name){+.+.}, at: [<ffffffff811c65c4>] process_one_work+0x7d4/0x12f0 [ 52.784495] other info that might help us debug this: [ 52.791794] Possible unsafe locking scenario: [ 52.798413] CPU0 [ 52.801144] ---- [ 52.803875] lock(mlxsw_core_driver_name); [ 52.808556] lock(mlxsw_core_driver_name); [ 52.813236] *** DEADLOCK *** [ 52.819857] May be due to missing lock nesting notation [ 52.827450] 3 locks held by kworker/1:3/599: [ 52.832221] #0: (mlxsw_core_driver_name){+.+.}, at: [<ffffffff811c65c4>] process_one_work+0x7d4/0x12f0 [ 52.842846] #1: ((&(&bridge->fdb_notify.dw)->work)){+.+.}, at: [<ffffffff811c65c4>] process_one_work+0x7d4/0x12f0 [ 52.854537] #2: (rtnl_mutex){+.+.}, at: [<ffffffff822ad8e7>] rtnl_lock+0x17/0x20 [ 52.863021] stack backtrace: [ 52.867890] CPU: 1 PID: 599 Comm: kworker/1:3 Not tainted 4.14.0-rc3jiri+ #4 [ 52.875773] Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016 [ 52.886267] Workqueue: mlxsw_core mlxsw_sp_fdb_notify_work [mlxsw_spectrum] [ 52.894060] Call Trace: [ 52.909122] __lock_acquire+0xf6f/0x2a10 [ 53.025412] lock_acquire+0x158/0x440 [ 53.047557] flush_work+0x3c4/0x5e0 [ 53.087571] __cancel_work_timer+0x3ca/0x5e0 [ 53.177051] cancel_delayed_work_sync+0x13/0x20 [ 53.182142] mlxsw_reg_trans_bulk_wait+0x12d/0x7a0 [mlxsw_core] [ 53.194571] mlxsw_core_reg_access+0x586/0x990 [mlxsw_core] [ 53.225365] mlxsw_reg_query+0x10/0x20 [mlxsw_core] [ 53.230882] mlxsw_sp_fdb_notify_work+0x2a3/0x9d0 [mlxsw_spectrum] [ 53.237801] process_one_work+0x8f1/0x12f0 [ 53.321804] worker_thread+0x1fd/0x10c0 [ 53.435158] kthread+0x28e/0x370 [ 53.448703] ret_from_fork+0x2a/0x40 [ 53.453017] mlxsw_spectrum 0000:01:00.0: EMAD retries (2/5) (tid=bf4549b100000774) [ 53.453119] mlxsw_spectrum 0000:01:00.0: EMAD retries (5/5) (tid=bf4549b100000770) [ 53.453132] mlxsw_spectrum 0000:01:00.0: EMAD reg access failed (tid=bf4549b100000770,reg_id=200b(sfn),type=query,status=0(operation performed)) [ 53.453143] mlxsw_spectrum 0000:01:00.0: Failed to get FDB notifications Fix this by creating another workqueue for EMAD timeouts, thereby preventing the situation of a work item trying to flush a work item queued on the same workqueue. Fixes: caf7297e7ab5f ("mlxsw: core: Introduce support for asynchronous EMAD register access") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reported-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16Merge tag 'mac80211-for-davem-2017-10-16' of ↵David S. Miller1-4/+17
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Just a single fix, for a WoWLAN-related part of CVE-2017-13080. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16dev_ioctl: add missing NETDEV_CHANGE_TX_QUEUE_LEN event notificationXin Long1-1/+12
When changing dev tx_queue_len via netlink or net-sysfs, a NETDEV_CHANGE_TX_QUEUE_LEN event notification will be called. But dev_ioctl missed this event notification, which could cause no userspace notification would be sent. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16net/sched: cls_flower: Set egress_dev mark when calling into the HW driverOr Gerlitz1-0/+2
Commit 7091d8c '(net/sched: cls_flower: Add offload support using egress Hardware device') made sure (when fl_hw_replace_filter is called) to put the egress_dev mark on persisent structure instance. Hence, following calls into the HW driver for stats and deletion will note it and act accordingly. With commit de4784ca030f this property is lost and hence when called, the HW driver failes to operate (stats, delete) on the offloaded flow. Fix it by setting the egress_dev flag whenever the ingress device is different from the hw device since this is exactly the condition under which we're calling into the HW driver through the egress port net-device. Fixes: de4784ca030f ('net: sched: get rid of struct tc_to_netdev') Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roi Dayan <roid@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16tun: call dev_get_valid_name() before register_netdevice()Cong Wang3-3/+9
register_netdevice() could fail early when we have an invalid dev name, in which case ->ndo_uninit() is not called. For tun device, this is a problem because a timer etc. are already initialized and it expects ->ndo_uninit() to clean them up. We could move these initializations into a ->ndo_init() so that register_netdevice() knows better, however this is still complicated due to the logic in tun_detach(). Therefore, I choose to just call dev_get_valid_name() before register_netdevice(), which is quicker and much easier to audit. And for this specific case, it is already enough. Fixes: 96442e42429e ("tuntap: choose the txq based on rxq") Reported-by: Dmitry Alexeev <avekceeb@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16net: enable interface alias removal via rtnlNicolas Dichtel1-1/+4
IFLA_IFALIAS is defined as NLA_STRING. It means that the minimal length of the attribute is 1 ("\0"). However, to remove an alias, the attribute length must be 0 (see dev_set_alias()). Let's define the type to NLA_BINARY to allow 0-length string, so that the alias can be removed. Example: $ ip l s dummy0 alias foo $ ip l l dev dummy0 5: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether ae:20:30:4f:a7:f3 brd ff:ff:ff:ff:ff:ff alias foo Before the patch: $ ip l s dummy0 alias "" RTNETLINK answers: Numerical result out of range After the patch: $ ip l s dummy0 alias "" $ ip l l dev dummy0 5: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether ae:20:30:4f:a7:f3 brd ff:ff:ff:ff:ff:ff CC: Oliver Hartkopp <oliver@hartkopp.net> CC: Stephen Hemminger <stephen@networkplumber.org> Fixes: 96ca4a2cc145 ("net: remove ifalias on empty given alias") Reported-by: Julien FLoret <julien.floret@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16Merge branch 'rtnetlink-dev-notification-fixes'David S. Miller1-2/+6
Xin Long says: ==================== rtnetlink: a bunch of fixes for userspace notifications in changing dev properties Whenever any property of a link, address, route, etc. changes by whatever way, kernel should notify the programs that listen for such events in userspace. The patchet "rtnetlink: Cleanup user notifications for netdev events" tried to fix a redundant notifications issue, but it also introduced a side effect. After that, user notifications could only be sent when changing dev properties via netlink api. As it removed some events process in rtnetlink_event where the notifications was sent to users. It resulted in no notification generated when dev properties are changed via other ways, like ioctl, sysfs, etc. It may cause some user programs doesn't work as expected because of the missing notifications. This patchset will fix it by bringing some of these netdev events back and also fix the old redundant notifications issue with a proper way. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16rtnetlink: do not set notification for tx_queue_len in do_setlinkXin Long1-1/+1
NETDEV_CHANGE_TX_QUEUE_LEN event process in rtnetlink_event would send a notification for userspace and tx_queue_len's setting in do_setlink would trigger NETDEV_CHANGE_TX_QUEUE_LEN. So it shouldn't set DO_SETLINK_NOTIFY status for this change to send a notification any more. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16rtnetlink: check DO_SETLINK_NOTIFY correctly in do_setlinkXin Long1-1/+1
The check 'status & DO_SETLINK_NOTIFY' in do_setlink doesn't really work after status & DO_SETLINK_MODIFIED, as: DO_SETLINK_MODIFIED 0x1 DO_SETLINK_NOTIFY 0x3 Considering that notifications are suppposed to be sent only when status have the flag DO_SETLINK_NOTIFY, the right check would be: (status & DO_SETLINK_NOTIFY) == DO_SETLINK_NOTIFY This would avoid lots of duplicated notifications when setting some properties of a link. Fixes: ba9989069f4e ("rtnl/do_setlink(): notify when a netdev is modified") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: David Ahern <dsahern@gmail.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16rtnetlink: bring NETDEV_CHANGEUPPER event process back in rtnetlink_eventXin Long1-0/+1
libteam needs this event notification in userspace when dev's master dev has been changed. After this, the redundant notifications issue would be fixed in the later patch 'rtnetlink: check DO_SETLINK_NOTIFY correctly in do_setlink'. Fixes: b6b36eb23a46 ("rtnetlink: Do not generate notifications for NETDEV_CHANGEUPPER event") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16rtnetlink: bring NETDEV_POST_TYPE_CHANGE event process back in rtnetlink_eventXin Long1-0/+1
As I said in patch 'rtnetlink: bring NETDEV_CHANGEMTU event process back in rtnetlink_event', removing NETDEV_POST_TYPE_CHANGE event was not the right fix for the redundant notifications issue. So bring this event process back to rtnetlink_event and the old redundant notifications issue would be fixed in the later patch 'rtnetlink: check DO_SETLINK_NOTIFY correctly in do_setlink'. Fixes: aef091ae58aa ("rtnetlink: Do not generate notifications for POST_TYPE_CHANGE event") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16rtnetlink: bring NETDEV_CHANGE_TX_QUEUE_LEN event process back in ↵Xin Long1-0/+1
rtnetlink_event The same fix for changing mtu in the patch 'rtnetlink: bring NETDEV_CHANGEMTU event process back in rtnetlink_event' is needed for changing tx_queue_len. Note that the redundant notifications issue for tx_queue_len will be fixed in the later patch 'rtnetlink: do not send notification for tx_queue_len in do_setlink'. Fixes: 27b3b551d8a7 ("rtnetlink: Do not generate notifications for NETDEV_CHANGE_TX_QUEUE_LEN event") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16rtnetlink: bring NETDEV_CHANGEMTU event process back in rtnetlink_eventXin Long1-0/+1
Commit 085e1a65f04f ("rtnetlink: Do not generate notifications for MTU events") tried to fix the redundant notifications issue when ip link set mtu by removing NETDEV_CHANGEMTU event process in rtnetlink_event. But it also resulted in no notification generated when dev's mtu is changed via other methods, like: 'ifconfig eth1 mtu 1400' or 'echo 1400 > /sys/class/net/eth1/mtu' It would cause users not to be notified by this change. This patch is to fix it by bringing NETDEV_CHANGEMTU event back into rtnetlink_event, and the redundant notifications issue will be fixed in the later patch 'rtnetlink: check DO_SETLINK_NOTIFY correctly in do_setlink'. Fixes: 085e1a65f04f ("rtnetlink: Do not generate notifications for MTU events") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16mac80211: accept key reinstall without changing anythingJohannes Berg1-4/+17
When a key is reinstalled we can reset the replay counters etc. which can lead to nonce reuse and/or replay detection being impossible, breaking security properties, as described in the "KRACK attacks". In particular, CVE-2017-13080 applies to GTK rekeying that happened in firmware while the host is in D3, with the second part of the attack being done after the host wakes up. In this case, the wpa_supplicant mitigation isn't sufficient since wpa_supplicant doesn't know the GTK material. In case this happens, simply silently accept the new key coming from userspace but don't take any action on it since it's the same key; this keeps the PN replay counters intact. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-10-14Merge branch 'bnxt_en-fixes'David S. Miller5-34/+112
Michael Chan says: ==================== bnxt_en: bug fixes. Various bug fixes for the VF/PF link change logic, VF resource checking, potential firmware response corruption on NVRAM and DCB parameters, and reading the wrong register for PCIe link speed on the VF. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14bnxt_en: Fix possible corruption in DCB parameters from firmware.Sankar Patchineelam1-5/+18
hwrm_send_message() is replaced with _hwrm_send_message(), and hwrm_cmd_lock mutex lock is grabbed for the whole period of firmware call until the firmware DCB parameters have been copied. This will prevent possible corruption of the firmware data. Fixes: 7df4ae9fe855 ("bnxt_en: Implement DCBNL to support host-based DCBX.") Signed-off-by: Sankar Patchineelam <sankar.patchineelam@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14bnxt_en: Fix possible corrupted NVRAM parameters from firmware response.Michael Chan3-1/+10
In bnxt_find_nvram_item(), it is copying firmware response data after releasing the mutex. This can cause the firmware response data to be corrupted if the next firmware response overwrites the response buffer. The rare problem shows up when running ethtool -i repeatedly. Fix it by calling the new variant _hwrm_send_message_silent() that requires the caller to take the mutex and to release it after the response data has been copied. Fixes: 3ebf6f0a09a2 ("bnxt_en: Add installed-package version reporting via Ethtool GDRVINFO") Reported-by: Sarveswara Rao Mygapula <sarveswararao.mygapula@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14bnxt_en: Fix VF resource checking.Michael Chan1-2/+9
In bnxt_sriov_enable(), we calculate to see if we have enough hardware resources to enable the requested number of VFs. The logic to check for minimum completion rings and statistics contexts is missing. Add the required checks so that VF configuration won't fail. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14bnxt_en: Fix VF PCIe link speed and width logic.Vasundhara Volam1-1/+1
PCIE PCIE_EP_REG_LINK_STATUS_CONTROL register is only defined in PF config space, so we must read it from the PF. Fixes: 90c4f788f6c0 ("bnxt_en: Report PCIe link speed and width during driver load") Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14bnxt_en: Don't use rtnl lock to protect link change logic in workqueue.Michael Chan3-12/+21
As a further improvement to the PF/VF link change logic, use a private mutex instead of the rtnl lock to protect link change logic. With the new mutex, we don't have to take the rtnl lock in the workqueue when we have to handle link related functions. If the VF and PF drivers are running on the same host and both take the rtnl lock and one is waiting for the other, it will cause timeout. This patch fixes these timeouts. Fixes: 90c694bb7181 ("bnxt_en: Fix RTNL lock usage on bnxt_update_link().") Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14bnxt_en: Improve VF/PF link change logic.Michael Chan1-13/+53
Link status query firmware messages originating from the VFs are forwarded to the PF. The driver handles these interactions in a workqueue for the VF and PF. The VF driver waits for the response from the PF in the workqueue. If the PF and VF driver are running on the same host and the work for both PF and VF are queued on the same workqueue, the VF driver may not get the response if the PF work item is queued behind it on the same workqueue. This will lead to the VF link query message timing out. To prevent this, we create a private workqueue for PFs instead of using the common workqueue. The VF query and PF response will never be on the same workqueue. Fixes: c0c050c58d84 ("bnxt_en: New Broadcom ethernet driver.") Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14net: dsa: mv88e6060: fix switch MAC addressVivien Didelot1-2/+8
The 88E6060 Ethernet switch always transmits the multicast bit of the switch MAC address as a zero. It re-uses the corresponding bit 8 of the register "Switch MAC Address Register Bytes 0 & 1" for "DiffAddr". If the "DiffAddr" bit is 0, then all ports transmit the same source address. If it is set to 1, then bit 2:0 are used for the port number. The mv88e6060 driver is currently wrongly shifting the MAC address byte 0 by 9. To fix this, shift it by 8 as usual and clear its bit 0. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Woojung Huh <woojung.huh@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14l2tp: check ps->sock before running pppol2tp_session_ioctl()Guillaume Nault1-0/+3
When pppol2tp_session_ioctl() is called by pppol2tp_tunnel_ioctl(), the session may be unconnected. That is, it was created by pppol2tp_session_create() and hasn't been connected with pppol2tp_connect(). In this case, ps->sock is NULL, so we need to check for this case in order to avoid dereferencing a NULL pointer. Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP") Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14net: fix typo in skbuff.cWenhua Shi1-1/+1
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13net: stmmac: dwmac_lib: fix interchanged sleep/timeout values in DMA reset ↵Emiliano Ingrassia1-1/+1
function The DMA reset timeout, used in read_poll_timeout, is ten times shorter than the sleep time. This patch fixes these values interchanging them, as it was before the read_poll_timeout introduction. Fixes: 8a70aeca80c2 ("net: stmmac: Use readl_poll_timeout") Signed-off-by: Emiliano Ingrassia <ingrassia@epigenesys.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13liquidio: fix timespec64_to_ns typoArnd Bergmann1-1/+1
While experimenting with changes to the timekeeping code, I ran into a build error in the liquidio driver: drivers/net/ethernet/cavium/liquidio/lio_main.c: In function 'liquidio_ptp_settime': drivers/net/ethernet/cavium/liquidio/lio_main.c:1850:22: error: passing argument 1 of 'timespec_to_ns' from incompatible pointer type [-Werror=incompatible-pointer-types] The driver had a type mismatch since it was first merged, but this never caused problems because it is only built on 64-bit architectures that define timespec and timespec64 to the same type. If we ever want to compile-test the driver on 32-bit or change the way that 64-bit timespec64 is defined, we need to fix it, so let's just do it now. Fixes: f21fb3ed364b ("Add support of Cavium Liquidio ethernet adapters") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Felix Manlunas <felix.manlunas@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13Merge tag 'wireless-drivers-for-davem-2017-10-13' of ↵David S. Miller18-197/+232
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers Kalle Valo says: ==================== wireless-drivers fixes for 4.14 Nothing really special standing out, all of these are important fixes which should go to 4.14. iwlwifi * fix support for 3168 device series * fix a potential crash when using FW debugging recording; * improve channel flags parsing to avoid warnings on too long traces * return -ENODATA when the temperature is not available, since the -EIO we were returning was causing fatal errors in userspace * avoid printing too many messages in dmesg when using monitor mode, since this can become very noisy and completely flood the logs brcmsmac * reduce stack usage to avoid frame size warnings with KASAN brcmfmac * add a check to avoid copying uninitialised memory rtlwifi: * fix a regression with rtl8821ae starting from v4.11 where connections was frequently lost ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-12ip: update policy routing config helpStephen Hemminger1-5/+3
The kernel config help for policy routing was still pointing at an ancient document from 2000 that refers to Linux 2.1. Update it to point to something that is at least occasionally updated. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-11net/ncsi: Don't limit vids based on hot_channelSamuel Mendoza-Jonas2-8/+10
Currently we drop any new VLAN ids if there are more than the current (or last used) channel can support. Most importantly this is a problem if no channel has been selected yet, resulting in a segfault. Secondly this does not necessarily reflect the capabilities of any other channels. Instead only drop a new VLAN id if we are already tracking the maximum allowed by the NCSI specification. Per-channel limits are already handled by ncsi_add_filter(), but add a message to set_one_vid() to make it obvious that the channel can not support any more VLAN ids. Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-11r8169: only enable PCI wakeups when WOL is activeDaniel Drake1-2/+0
rtl_init_one() currently enables PCI wakeups if the ethernet device is found to be WOL-capable. There is no need to do this when rtl8169_set_wol() will correctly enable or disable the same wakeup flag when WOL is activated/deactivated. This works around an ACPI DSDT bug which prevents the Acer laptop models Aspire ES1-533, Aspire ES1-732, PackardBell ENTE69AP and Gateway NE533 from entering S3 suspend - even when no ethernet cable is connected. On these platforms, the DSDT says that GPE08 is a wakeup source for ethernet, but this GPE fires as soon as the system goes into suspend, waking the system up immediately. Having the wakeup normally disabled avoids this issue in the default case. With this change, WOL will continue to be unusable on these platforms (it will instantly wake up if WOL is later enabled by the user) but we do not expect this to be a commonly used feature on these consumer laptops. We have separately determined that WOL works fine without any ACPI GPEs enabled during sleep, so a DSDT fix or override would be possible to make WOL work. Signed-off-by: Daniel Drake <drake@endlessm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-11macsec: fix memory leaks when skb_to_sgvec failsSabrina Dubroca1-0/+2
Fixes: cda7ea690350 ("macsec: check return value of skb_to_sgvec always") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10net: call cgroup_sk_alloc() earlier in sk_clone_lock()Eric Dumazet1-2/+1
If for some reason, the newly allocated child need to be freed, we will call cgroup_put() (via sk_free_unlock_clone()) while the corresponding cgroup_get() was not yet done, and we will free memory too soon. Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10Revert "net: defer call to cgroup_sk_alloc()"Eric Dumazet3-6/+13
This reverts commit fbb1fb4ad415cb31ce944f65a5ca700aaf73a227. This was not the proper fix, lets cleanly revert it, so that following patch can be carried to stable versions. sock_cgroup_ptr() callers do not expect a NULL return value. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10Merge branch 'nfp-fix-ethtool-stats-and-page-allocation'David S. Miller2-9/+19
Jakub Kicinski says: ==================== nfp: fix ethtool stats and page allocation Two fixes for net. First one makes sure we handle gather of stats on 32bit machines correctly (ouch). The second fix solves a potential NULL-deref if we fail to allocate a page with XDP running. I used Fixes: tags pointing to where the bug was introduced, but for patch 1 it has been in the driver "for ever" and fix won't backport cleanly beyond commit 325945ede6d4 ("nfp: split software and hardware vNIC statistics") which is in net. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10nfp: handle page allocation failuresJakub Kicinski1-6/+14
page_address() does not handle NULL argument gracefully, make sure we NULL-check the page pointer before passing it to page_address(). Fixes: ecd63a0217d5 ("nfp: add XDP support in the driver") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>