summaryrefslogtreecommitdiffstats
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2019-04-27bpf: Introduce bpf sk local storageMartin KaFai Lau5-0/+824
After allowing a bpf prog to - directly read the skb->sk ptr - get the fullsock bpf_sock by "bpf_sk_fullsock()" - get the bpf_tcp_sock by "bpf_tcp_sock()" - get the listener sock by "bpf_get_listener_sock()" - avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock" into different bpf running context. this patch is another effort to make bpf's network programming more intuitive to do (together with memory and performance benefit). When bpf prog needs to store data for a sk, the current practice is to define a map with the usual 4-tuples (src/dst ip/port) as the key. If multiple bpf progs require to store different sk data, multiple maps have to be defined. Hence, wasting memory to store the duplicated keys (i.e. 4 tuples here) in each of the bpf map. [ The smallest key could be the sk pointer itself which requires some enhancement in the verifier and it is a separate topic. ] Also, the bpf prog needs to clean up the elem when sk is freed. Otherwise, the bpf map will become full and un-usable quickly. The sk-free tracking currently could be done during sk state transition (e.g. BPF_SOCK_OPS_STATE_CB). The size of the map needs to be predefined which then usually ended-up with an over-provisioned map in production. Even the map was re-sizable, while the sk naturally come and go away already, this potential re-size operation is arguably redundant if the data can be directly connected to the sk itself instead of proxy-ing through a bpf map. This patch introduces sk->sk_bpf_storage to provide local storage space at sk for bpf prog to use. The space will be allocated when the first bpf prog has created data for this particular sk. The design optimizes the bpf prog's lookup (and then optionally followed by an inline update). bpf_spin_lock should be used if the inline update needs to be protected. BPF_MAP_TYPE_SK_STORAGE: ----------------------- To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in this patch) needs to be created. Multiple BPF_MAP_TYPE_SK_STORAGE maps can be created to fit different bpf progs' needs. The map enforces BTF to allow printing the sk-local-storage during a system-wise sk dump (e.g. "ss -ta") in the future. The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete a "sk-local-storage" data from a particular sk. Think of the map as a meta-data (or "type") of a "sk-local-storage". This particular "type" of "sk-local-storage" data can then be stored in any sk. The main purposes of this map are mostly: 1. Define the size of a "sk-local-storage" type. 2. Provide a similar syscall userspace API as the map (e.g. lookup/update, map-id, map-btf...etc.) 3. Keep track of all sk's storages of this "type" and clean them up when the map is freed. sk->sk_bpf_storage: ------------------ The main lookup/update/delete is done on sk->sk_bpf_storage (which is a "struct bpf_sk_storage"). When doing a lookup, the "map" pointer is now used as the "key" to search on the sk_storage->list. The "map" pointer is actually serving as the "type" of the "sk-local-storage" that is being requested. To allow very fast lookup, it should be as fast as looking up an array at a stable-offset. At the same time, it is not ideal to set a hard limit on the number of sk-local-storage "type" that the system can have. Hence, this patch takes a cache approach. The last search result from sk_storage->list is cached in sk_storage->cache[] which is a stable sized array. Each "sk-local-storage" type has a stable offset to the cache[] array. In the future, a map's flag could be introduced to do cache opt-out/enforcement if it became necessary. The cache size is 16 (i.e. 16 types of "sk-local-storage"). Programs can share map. On the program side, having a few bpf_progs running in the networking hotpath is already a lot. The bpf_prog should have already consolidated the existing sock-key-ed map usage to minimize the map lookup penalty. 16 has enough runway to grow. All sk-local-storage data will be removed from sk->sk_bpf_storage during sk destruction. bpf_sk_storage_get() and bpf_sk_storage_delete(): ------------------------------------------------ Instead of using bpf_map_(lookup|update|delete)_elem(), the bpf prog needs to use the new helper bpf_sk_storage_get() and bpf_sk_storage_delete(). The verifier can then enforce the ARG_PTR_TO_SOCKET argument. The bpf_sk_storage_get() also allows to "create" new elem if one does not exist in the sk. It is done by the new BPF_SK_STORAGE_GET_F_CREATE flag. An optional value can also be provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE. The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock. Together, it has eliminated the potential use cases for an equivalent bpf_map_update_elem() API (for bpf_prog) in this patch. Misc notes: ---------- 1. map_get_next_key is not supported. From the userspace syscall perspective, the map has the socket fd as the key while the map can be shared by pinned-file or map-id. Since btf is enforced, the existing "ss" could be enhanced to pretty print the local-storage. Supporting a kernel defined btf with 4 tuples as the return key could be explored later also. 2. The sk->sk_lock cannot be acquired. Atomic operations is used instead. e.g. cmpxchg is done on the sk->sk_bpf_storage ptr. Please refer to the source code comments for the details in synchronization cases and considerations. 3. The mem is charged to the sk->sk_omem_alloc as the sk filter does. Benchmark: --------- Here is the benchmark data collected by turning on the "kernel.bpf_stats_enabled" sysctl. Two bpf progs are tested: One bpf prog with the usual bpf hashmap (max_entries = 8192) with the sk ptr as the key. (verifier is modified to support sk ptr as the key That should have shortened the key lookup time.) Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE. Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for each egress skb and then bump the cnt. netperf is used to drive data with 4096 connected UDP sockets. BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run) 27: cgroup_skb name egress_sk_map tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633 loaded_at 2019-04-15T13:46:39-0700 uid 0 xlated 344B jited 258B memlock 4096B map_ids 16 btf_id 5 BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run) 30: cgroup_skb name egress_sk_stora tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739 loaded_at 2019-04-15T13:47:54-0700 uid 0 xlated 168B jited 156B memlock 4096B map_ids 17 btf_id 6 Here is a high-level picture on how are the objects organized: sk ┌──────┐ │ │ │ │ │ │ │*sk_bpf_storage─────▶ bpf_sk_storage └──────┘ ┌───────┐ ┌───────────┤ list │ │ │ │ │ │ │ │ │ │ │ └───────┘ │ │ elem │ ┌────────┐ ├─▶│ snode │ │ ├────────┤ │ │ data │ bpf_map │ ├────────┤ ┌─────────┐ │ │map_node│◀─┬─────┤ list │ │ └────────┘ │ │ │ │ │ │ │ │ elem │ │ │ │ ┌────────┐ │ └─────────┘ └─▶│ snode │ │ ├────────┤ │ bpf_map │ data │ │ ┌─────────┐ ├────────┤ │ │ list ├───────▶│map_node│ │ │ │ └────────┘ │ │ │ │ │ │ elem │ └─────────┘ ┌────────┐ │ ┌─▶│ snode │ │ │ ├────────┤ │ │ │ data │ │ │ ├────────┤ │ │ │map_node│◀─┘ │ └────────┘ │ │ │ ┌───────┐ sk └──────────│ list │ ┌──────┐ │ │ │ │ │ │ │ │ │ │ │ │ └───────┘ │*sk_bpf_storage───────▶bpf_sk_storage └──────┘ Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-26selftests: bpf: test writable buffers in raw tpsMatt Mullins1-0/+4
This tests that: * a BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE cannot be attached if it uses either: * a variable offset to the tracepoint buffer, or * an offset beyond the size of the tracepoint buffer * a tracer can modify the buffer provided when attached to a writable tracepoint in bpf_prog_test_run Signed-off-by: Matt Mullins <mmullins@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-25bpf: support BPF_PROG_QUERY for BPF_FLOW_DISSECTOR attach_typeStanislav Fomichev1-0/+39
target_fd is target namespace. If there is a flow dissector BPF program attached to that namespace, its (single) id is returned. v5: * drop net ref right after rcu unlock (Daniel Borkmann) v4: * add missing put_net (Jann Horn) v3: * add missing inline to skb_flow_dissector_prog_query static def (kbuild test robot <lkp@intel.com>) v2: * don't sleep in rcu critical section (Jakub Kicinski) * check input prog_cnt (exit early) Cc: Jann Horn <jannh@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-24bpf: update skb->protocol in bpf_skb_net_growWillem de Bruijn1-0/+8
Some tunnels, like sit, change the network protocol of packet. If so, update skb->protocol to match the new type. Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23bpf/flow_dissector: don't adjust nhoff by ETH_HLEN in BPF_PROG_TEST_RUNStanislav Fomichev1-3/+0
Now that we use skb-less flow dissector let's return true nhoff and thoff. We used to adjust them by ETH_HLEN because that's how it was done in the skb case. For VLAN tests that looks confusing: nhoff is pointing to vlan parts :-\ Warning, this is an API change for BPF_PROG_TEST_RUN! Feel free to drop if you think that it's too late at this point to fix it. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23net: pass net_device argument to the eth_get_headlenStanislav Fomichev1-2/+3
Update all users of eth_get_headlen to pass network device, fetch network namespace from it and pass it down to the flow dissector. This commit is a noop until administrator inserts BPF flow dissector program. Cc: Maxim Krasnyansky <maxk@qti.qualcomm.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: intel-wired-lan@lists.osuosl.org Cc: Yisen Zhuang <yisen.zhuang@huawei.com> Cc: Salil Mehta <salil.mehta@huawei.com> Cc: Michael Chan <michael.chan@broadcom.com> Cc: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23flow_dissector: handle no-skb use caseStanislav Fomichev1-27/+25
When called without skb, gather all required data from the __skb_flow_dissect's arguments and use recently introduces no-skb mode of bpf flow dissector. Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23net: plumb network namespace into __skb_flow_dissectStanislav Fomichev2-12/+20
This new argument will be used in the next patches for the eth_get_headlen use case. eth_get_headlen calls flow dissector with only data (without skb) so there is currently no way to pull attached BPF flow dissector program. With this new argument, we can amend the callers to explicitly pass network namespace so we can use attached BPF program. Signed-off-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb modeStanislav Fomichev1-30/+17
Now that we have bpf_flow_dissect which can work on raw data, use it when doing BPF_PROG_TEST_RUN for flow dissector. Simplifies bpf_prog_test_run_flow_dissector and allows us to test no-skb mode. Note, that previously, with bpf_flow_dissect_skb we used to call eth_type_trans which pulled L2 (ETH_HLEN) header and we explicitly called skb_reset_network_header. That means flow_keys->nhoff would be initialized to 0 (skb_network_offset) in init_flow_keys. Now we call bpf_flow_dissect with nhoff set to ETH_HLEN and need to undo it once the dissection is done to preserve the existing behavior. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23flow_dissector: switch kernel context to struct bpf_flow_dissectorStanislav Fomichev3-52/+102
struct bpf_flow_dissector has a small subset of sk_buff fields that flow dissector BPF program is allowed to access and an optional pointer to real skb. Real skb is used only in bpf_skb_load_bytes helper to read non-linear data. The real motivation for this is to be able to call flow dissector from eth_get_headlen context where we don't have an skb and need to dissect raw bytes. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-22bridge: Fix possible use-after-free when deleting bridge portIdo Schimmel1-1/+2
When a bridge port is being deleted, do not dereference it later in br_vlan_port_event() as it can result in a use-after-free [1] if the RCU callback was executed before invoking the function. [1] [ 129.638551] ================================================================== [ 129.646904] BUG: KASAN: use-after-free in br_vlan_port_event+0x53c/0x5fd [ 129.654406] Read of size 8 at addr ffff8881e4aa1ae8 by task ip/483 [ 129.663008] CPU: 0 PID: 483 Comm: ip Not tainted 5.1.0-rc5-custom-02265-ga946bd73daac #1383 [ 129.672359] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016 [ 129.682484] Call Trace: [ 129.685242] dump_stack+0xa9/0x10e [ 129.689068] print_address_description.cold.2+0x9/0x25e [ 129.694930] kasan_report.cold.3+0x78/0x9d [ 129.704420] br_vlan_port_event+0x53c/0x5fd [ 129.728300] br_device_event+0x2c7/0x7a0 [ 129.741505] notifier_call_chain+0xb5/0x1c0 [ 129.746202] rollback_registered_many+0x895/0xe90 [ 129.793119] unregister_netdevice_many+0x48/0x210 [ 129.803384] rtnl_delete_link+0xe1/0x140 [ 129.815906] rtnl_dellink+0x2a3/0x820 [ 129.844166] rtnetlink_rcv_msg+0x397/0x910 [ 129.868517] netlink_rcv_skb+0x137/0x3a0 [ 129.882013] netlink_unicast+0x49b/0x660 [ 129.900019] netlink_sendmsg+0x755/0xc90 [ 129.915758] ___sys_sendmsg+0x761/0x8e0 [ 129.966315] __sys_sendmsg+0xf0/0x1c0 [ 129.988918] do_syscall_64+0xa4/0x470 [ 129.993032] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 129.998696] RIP: 0033:0x7ff578104b58 ... [ 130.073811] Allocated by task 479: [ 130.077633] __kasan_kmalloc.constprop.5+0xc1/0xd0 [ 130.083008] kmem_cache_alloc_trace+0x152/0x320 [ 130.088090] br_add_if+0x39c/0x1580 [ 130.092005] do_set_master+0x1aa/0x210 [ 130.096211] do_setlink+0x985/0x3100 [ 130.100224] __rtnl_newlink+0xc52/0x1380 [ 130.104625] rtnl_newlink+0x6b/0xa0 [ 130.108541] rtnetlink_rcv_msg+0x397/0x910 [ 130.113136] netlink_rcv_skb+0x137/0x3a0 [ 130.117538] netlink_unicast+0x49b/0x660 [ 130.121939] netlink_sendmsg+0x755/0xc90 [ 130.126340] ___sys_sendmsg+0x761/0x8e0 [ 130.130645] __sys_sendmsg+0xf0/0x1c0 [ 130.134753] do_syscall_64+0xa4/0x470 [ 130.138864] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 130.146195] Freed by task 0: [ 130.149421] __kasan_slab_free+0x125/0x170 [ 130.154016] kfree+0xf3/0x310 [ 130.157349] kobject_put+0x1a8/0x4c0 [ 130.161363] rcu_core+0x859/0x19b0 [ 130.165175] __do_softirq+0x250/0xa26 [ 130.170956] The buggy address belongs to the object at ffff8881e4aa1ae8 which belongs to the cache kmalloc-1k of size 1024 [ 130.184972] The buggy address is located 0 bytes inside of 1024-byte region [ffff8881e4aa1ae8, ffff8881e4aa1ee8) Fixes: 9c0ec2e7182a ("bridge: support binding vlan dev link state to vlan member bridge ports") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Cc: Mike Manning <mmanning@vyatta.att-mail.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Mike Manning <mmanning@vyatta.att-mail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22net: devlink: Add extack to shared buffer operationsIdo Schimmel1-9/+13
Add extack to shared buffer set operations, so that meaningful error messages could be propagated to the user. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22net: strparser: make it explicitly non-modularPaul Gortmaker1-10/+4
The Kconfig currently controlling compilation of this code is: net/strparser/Kconfig:config STREAM_PARSER net/strparser/Kconfig: def_bool n ...meaning that it currently is not being built as a module by anyone. Lets remove the modular code that is essentially orphaned, so that when reading the driver there is no doubt it is builtin-only. Since module_init translates to device_initcall in the non-modular case, the init ordering remains unchanged with this commit. For clarity, we change the fcn name mod_init to dev_init at the same time. We replace module.h with init.h and export.h ; the latter since this file exports some syms. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22net: bpfilter: dont use module_init in non-modular codePaul Gortmaker1-2/+1
The Kconfig controlling this code is: bpfilter/Kconfig:menuconfig BPFILTER bpfilter/Kconfig: bool "BPF based packet filtering framework (BPFILTER)" Since it isn't a module, we shouldn't use module_init(). Instead we use device_initcall() - which is exactly what module_init() defaults to for non-modular code/builds. We don't remove <linux/module.h> from the includes since this file does a request_module() and hence is a valid user of that header file, even though it is not modular itself. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22cgroup: net: remove left over MODULE_LICENSE tagPaul Gortmaker1-2/+0
The Kconfig currently controlling compilation of this code is: net/Kconfig:config CGROUP_NET_PRIO net/Kconfig: bool "Network priority cgroup" ...meaning that it currently is not being built as a module by anyone, as module support was discontinued in 2014. We delete the MODULE_LICENSE tag since all that information is already contained at the top of the file in the comments. We don't delete module.h from the includes since it was no longer there to begin with. Cc: "David S. Miller" <davem@davemloft.net> Cc: Tejun Heo <tj@kernel.org> Cc: "Rosen, Rami" <rami.rosen@intel.com> Cc: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: netdev@vger.kernel.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22net: Rename net/nexthop.h net/rtnh.hDavid Ahern6-6/+6
The header contains rtnh_ macros so rename the file accordingly. Allows a later patch to use the nexthop.h name for the new nexthop code. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller3-25/+108
Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-04-22 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) allow stack/queue helpers from more bpf program types, from Alban. 2) allow parallel verification of root bpf programs, from Alexei. 3) introduce bpf sysctl hook for trusted root cases, from Andrey. 4) recognize var/datasec in btf deduplication, from Andrii. 5) cpumap performance optimizations, from Jesper. 6) verifier prep for alu32 optimization, from Jiong. 7) libbpf xsk cleanup, from Magnus. 8) other various fixes and cleanups. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-21net: ax25: fix misuse of %xFuqian Huang1-2/+2
Pointers should be printed with %p or %px rather than cast to long type and printed with %8.8lx. Change %8.8lx to %p to print the pointer. Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19tcp: properly reset skb->truesize for tx recyclingEric Dumazet1-1/+1
tcp sendmsg() and sendpage() normally advance skb->data_len and skb->truesize by the payload added to an skb. But sendmsg(fd, ..., MSG_ZEROCOPY) has to account for whole pages, even if a single byte of payload is used in the page. This means that we can not assume skb->truesize can be adjusted by skb->data_len. We must instead overwrite its value. Otherwise skb->truesize is too big and can hit socket sndbuf limit, especially if the skb is recycled multiple times :/ Fixes: 472c2e07eef0 ("tcp: add one skb cache for tx") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19tipc: introduce new socket option TIPC_SOCK_RECVQ_USEDTung Nguyen1-0/+3
When using TIPC_SOCK_RECVQ_DEPTH for getsockopt(), it returns the number of buffers in receive socket buffer which is not so helpful for user space applications. This commit introduces the new option TIPC_SOCK_RECVQ_USED which returns the current allocated bytes of the receive socket buffer. This helps user space applications dimension its buffer usage to avoid buffer overload issue. Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19net: socket: implement 64-bit timestampsArnd Bergmann1-6/+18
The 'timeval' and 'timespec' data structures used for socket timestamps are going to be redefined in user space based on 64-bit time_t in future versions of the C library to deal with the y2038 overflow problem, which breaks the ABI definition. Unlike many modern ioctl commands, SIOCGSTAMP and SIOCGSTAMPNS do not use the _IOR() macro to encode the size of the transferred data, so it remains ambiguous whether the application uses the old or new layout. The best workaround I could find is rather ugly: we redefine the command code based on the size of the respective data structure with a ternary operator. This lets it get evaluated as late as possible, hopefully after that structure is visible to the caller. We cannot use an #ifdef here, because inux/sockios.h might have been included before any libc header that could determine the size of time_t. The ioctl implementation now interprets the new command codes as always referring to the 64-bit structure on all architectures, while the old architecture specific command code still refers to the old architecture specific layout. The new command number is only used when they are actually different. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19net: rework SIOCGSTAMP ioctl handlingArnd Bergmann30-227/+71
The SIOCGSTAMP/SIOCGSTAMPNS ioctl commands are implemented by many socket protocol handlers, and all of those end up calling the same sock_get_timestamp()/sock_get_timestampns() helper functions, which results in a lot of duplicate code. With the introduction of 64-bit time_t on 32-bit architectures, this gets worse, as we then need four different ioctl commands in each socket protocol implementation. To simplify that, let's add a new .gettstamp() operation in struct proto_ops, and move ioctl implementation into the common sock_ioctl()/compat_sock_ioctl_trans() functions that these all go through. We can reuse the sock_get_timestamp() implementation, but generalize it so it can deal with both native and compat mode, as well as timeval and timespec structures. Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Link: https://lore.kernel.org/lkml/CAK8P3a038aDQQotzua_QtKGhq8O9n+rdiz2=WDCp82ys8eUT+A@mail.gmail.com/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19bridge: update vlan dev link state for bridge netdev changesMike Manning1-3/+47
If vlan bridge binding is enabled, then the link state of a vlan device that is an upper device of the bridge tracks the state of bridge ports that are members of that vlan. But this can only be done when the link state of the bridge is up. If it is down, then the link state of the vlan devices must also be down. This is to maintain existing behavior for when STP is enabled and there are no live ports, in which case the link state for the bridge and any vlan devices is down. Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19bridge: update vlan dev state when port added to or deleted from vlanMike Manning1-0/+19
If vlan bridge binding is enabled, then the link state of a vlan device that is an upper device of the bridge should track the state of bridge ports that are members of that vlan. So if a bridge port becomes or stops being a member of a vlan, then update the link state of the vlan device if necessary. Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19bridge: support binding vlan dev link state to vlan member bridge portsMike Manning3-4/+174
In the case of vlan filtering on bridges, the bridge may also have the corresponding vlan devices as upper devices. A vlan bridge binding mode is added to allow the link state of the vlan device to track only the state of the subset of bridge ports that are also members of the vlan, rather than that of all bridge ports. This mode is set with a vlan flag rather than a bridge sysfs so that the 8021q module is aware that it should not set the link state for the vlan device. If bridge vlan is configured, the bridge device event handling results in the link state for an upper device being set, if it is a vlan device with the vlan bridge binding mode enabled. This also sets a vlan_bridge_binding flag so that subsequent UP/DOWN/CHANGE events for the ports in that bridge result in a link state update of the vlan device if required. The link state of the vlan device is up if there is at least one bridge port that is a vlan member that is admin & oper up, otherwise its oper state is IF_OPER_LOWERLAYERDOWN. Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19vlan: do not transfer link state in vlan bridge binding modeMike Manning2-11/+26
In vlan bridge binding mode, the link state is no longer transferred from the lower device. Instead it is set by the bridge module according to the state of bridge ports that are members of the vlan. Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19vlan: support binding link state to vlan member bridge portsMike Manning2-2/+4
In the case of vlan filtering on bridges, the bridge may also have the corresponding vlan devices as upper devices. Currently the link state of vlan devices is transferred from the lower device. So this is up if the bridge is in admin up state and there is at least one bridge port that is up, regardless of the vlan that the port is a member of. The link state of the vlan device may need to track only the state of the subset of ports that are also members of the corresponding vlan, rather than that of all ports. Add a flag to specify a vlan bridge binding mode, by which the link state is no longer automatically transferred from the lower device, but is instead determined by the bridge ports that are members of the vlan. Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-18net/sched: taprio: fix build without 64bit divJakub Kicinski1-6/+11
Recent changes to taprio did not use the correct div64 helpers, leading to: net/sched/sch_taprio.o: In function `taprio_dequeue': sch_taprio.c:(.text+0x34a): undefined reference to `__divdi3' net/sched/sch_taprio.o: In function `advance_sched': sch_taprio.c:(.text+0xa0b): undefined reference to `__divdi3' net/sched/sch_taprio.o: In function `taprio_init': sch_taprio.c:(.text+0x1450): undefined reference to `__divdi3' /home/jkicinski/devel/linux/Makefile:1032: recipe for target 'vmlinux' failed Use math64 helpers. Fixes: 7b9eba7ba0c1 ("net/sched: taprio: fix picos_per_byte miscalculation") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-18l2tp: fix set but not used variableJakub Kicinski1-2/+1
GCC complains: net/l2tp/l2tp_ppp.c: In function ‘pppol2tp_ioctl’: net/l2tp/l2tp_ppp.c:1073:6: warning: variable ‘val’ set but not used [-Wunused-but-set-variable] int val; ^~~ Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-18ipv6: Add rate limit mask for ICMPv6 messagesStephen Suryaputra2-9/+31
To make ICMPv6 closer to ICMPv4, add ratemask parameter. Since the ICMP message types use larger numeric values, a simple bitmask doesn't fit. I use large bitmap. The input and output are the in form of list of ranges. Set the default to rate limit all error messages but Packet Too Big. For Packet Too Big, use ratemask instead of hard-coded. There are functions where icmpv6_xrlim_allow() and icmpv6_global_allow() aren't called. This patch only adds them to icmpv6_echo_reply(). Rate limiting error messages is mandated by RFC 4443 but RFC 4890 says that it is also acceptable to rate limit informational messages. Thus, I removed the current hard-coded behavior of icmpv6_mask_allow() that doesn't rate limit informational messages. v2: Add dummy function proc_do_large_bitmap() if CONFIG_PROC_SYSCTL isn't defined, expand the description in ip-sysctl.txt and remove unnecessary conditional before kfree(). v3: Inline the bitmap instead of dynamically allocated. Still is a pointer to it is needed because of the way proc_do_large_bitmap work. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17net ipv6: Prevent neighbor add if protocol is disabled on deviceDavid Ahern2-0/+22
Disabling IPv6 on an interface removes existing entries but nothing prevents new entries from being manually added. To that end, add a new neigh_table operation, allow_add, that is called on RTM_NEWNEIGH to see if neighbor entries are allowed on a given device. If IPv6 is disabled on the device, allow_add returns false and passes a message back to the user via extack. $ echo 1 > /proc/sys/net/ipv6/conf/eth1/disable_ipv6 $ ip -6 neigh add fe80::4c88:bff:fe21:2704 dev eth1 lladdr de:ad:be:ef:01:01 Error: IPv6 is disabled on this device. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17ipv6: Add fib6_type and fib6_flags to fib6_resultDavid Ahern2-38/+49
Add the fib6_flags and fib6_type to fib6_result. Update the lookup helpers to set them and update post fib lookup users to use the version from the result. This allows nexthop objects to have blackhole nexthop. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17ipv6: Pass fib6_result to fib lookupsDavid Ahern5-43/+36
Change fib6_lookup and fib6_table_lookup to take a fib6_result and set f6i and nh rather than returning a fib6_info. For now both always return 0. A later patch set can make these more like the IPv4 counterparts and return EINVAL, EACCESS, etc based on fib6_type. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17ipv6: Pass fib6_result to fib6_table_lookup tracepointDavid Ahern1-3/+3
Change fib6_table_lookup tracepoint to take the fib6_result and use the fib6_info and fib6_nh from it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17ipv6: Pass fib6_result to rt6_select and find_rr_leafDavid Ahern1-39/+43
Pass fib6_result to rt6_select. Instead of returning the fib entry, it will set f6i and nh based on the lookup. find_rr_leaf is changed to remove the match option in favor of taking fib6_result and having __find_rr_leaf set f6i in the result. In the process, update fib6_info references in __find_rr_leaf to f6i names. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17ipv6: Pass fib6_result to rt6_device_matchDavid Ahern1-19/+30
Pass fib6_result to rt6_device_match with f6i set. rt6_device_match updates f6i in the result if it finds a better match and sets nh. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17ipv6: Pass fib6_result to ip6_mtu_from_fib6 and fib6_mtuDavid Ahern3-14/+19
Change ip6_mtu_from_fib6 and fib6_mtu to take a fib6_result over a fib6_info. Update both to use the fib6_nh from fib6_result. Since the signature of ip6_mtu_from_fib6 is already changing, add const to daddr and saddr. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17ipv6: Pass fib6_result to rt6_insert_exceptionDavid Ahern1-16/+17
Update rt6_insert_exception to take a fib6_result over a fib6_info. Change ort to f6i from the fib6_result and rename to better reflect what it references (a fib6_info). Since this function is already getting changed, update the comments to reference fib6_info variables rather than the older rt6_info. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17ipv6: Pass fib6_result to ip6_rt_get_dev_rcu and ip6_rt_copy_initDavid Ahern1-22/+27
Now that all callers are update to have a fib6_result, pass it down to ip6_rt_get_dev_rcu, ip6_rt_copy_init, and ip6_rt_init_dst. In the process, change ort to f6i in ip6_rt_copy_init to make it clear it is a reference to a fib6_info. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17ipv6: Pass fib6_result to pcpu route functionsDavid Ahern1-13/+14
Update ip6_rt_pcpu_alloc, rt6_get_pcpu_route and rt6_make_pcpu_route to a fib6_result over a fib6_info. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17ipv6: Pass fib6_result to ip6_create_rt_rcuDavid Ahern1-16/+21
Change ip6_create_rt_rcu to take fib6_result over a fib6_info. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17ipv6: Pass fib6_result to ip6_rt_cache_allocDavid Ahern1-22/+26
Change ip6_rt_cache_alloc to take a fib6_result over a fib6_info. Since ip6_rt_cache_alloc is only the caller, update the rt6_is_gw_or_nonexthop helper to take fib6_result. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17ipv6: Pass fib6_result to rt6_find_cached_rtDavid Ahern1-14/+21
Simplify rt6_find_cached_rt for the fast path cases and pass fib6_result to rt6_find_cached_rt. Rename the local return variable to ret to maintain consisting with fib6_result name. Update the comment in rt6_find_cached_rt to reference the new names in a fib6_info vs the old name when fib entries were an rt6_info. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17ipv6: Rename fib6_multipath_select and pass fib6_resultDavid Ahern4-54/+56
Add 'struct fib6_result' to hold the fib entry and fib6_nh from a fib lookup as separate entries, similar to what IPv4 now has with fib_result. Rename fib6_multipath_select to fib6_select_path, pass fib6_result to it, and set f6i and nh in the result once a path selection is done. Call fib6_select_path unconditionally for path selection which means moving the sibling and oif check to fib6_select_path. To handle the two different call paths (2 only call multipath_select if flowi6_oif == 0 and the other always calls it), add a new have_oif_match that controls the sibling walk if relevant. Update callers of fib6_multipath_select accordingly and have them use the fib6_info and fib6_nh from the result. This is needed for multipath nexthop objects where a single f6i can point to multiple fib6_nh (similar to IPv4). Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17net: core: introduce build_skb_aroundJesper Dangaard Brouer1-19/+52
The function build_skb() also have the responsibility to allocate and clear the SKB structure. Introduce a new function build_skb_around(), that moves the responsibility of allocation and clearing to the caller. This allows caller to use kmem_cache (slab/slub) bulk allocation API. Next patch use this function combined with kmem_cache_alloc_bulk. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller58-258/+460
Conflict resolution of af_smc.c from Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds56-220/+451
Pull networking fixes from David Miller: 1) Handle init flow failures properly in iwlwifi driver, from Shahar S Matityahu. 2) mac80211 TXQs need to be unscheduled on powersave start, from Felix Fietkau. 3) SKB memory accounting fix in A-MDSU aggregation, from Felix Fietkau. 4) Increase RCU lock hold time in mlx5 FPGA code, from Saeed Mahameed. 5) Avoid checksum complete with XDP in mlx5, also from Saeed. 6) Fix netdev feature clobbering in ibmvnic driver, from Thomas Falcon. 7) Partial sent TLS record leak fix from Jakub Kicinski. 8) Reject zero size iova range in vhost, from Jason Wang. 9) Allow pending work to complete before clcsock release from Karsten Graul. 10) Fix XDP handling max MTU in thunderx, from Matteo Croce. 11) A lot of protocols look at the sa_family field of a sockaddr before validating it's length is large enough, from Tetsuo Handa. 12) Don't write to free'd pointer in qede ptp error path, from Colin Ian King. 13) Have to recompile IP options in ipv4_link_failure because it can be invoked from ARP, from Stephen Suryaputra. 14) Doorbell handling fixes in qed from Denis Bolotin. 15) Revert net-sysfs kobject register leak fix, it causes new problems. From Wang Hai. 16) Spectre v1 fix in ATM code, from Gustavo A. R. Silva. 17) Fix put of BROPT_VLAN_STATS_PER_PORT in bridging code, from Nikolay Aleksandrov. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (111 commits) socket: fix compat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEW tcp: tcp_grow_window() needs to respect tcp_space() ocelot: Clean up stats update deferred work ocelot: Don't sleep in atomic context (irqs_disabled()) net: bridge: fix netlink export of vlan_stats_per_port option qed: fix spelling mistake "faspath" -> "fastpath" tipc: set sysctl_tipc_rmem and named_timeout right range tipc: fix link established but not in session net: Fix missing meta data in skb with vlan packet net: atm: Fix potential Spectre v1 vulnerabilities net/core: work around section mismatch warning for ptp_classifier net: bridge: fix per-port af_packet sockets bnx2x: fix spelling mistake "dicline" -> "decline" route: Avoid crash from dereferencing NULL rt->from MAINTAINERS: normalize Woojung Huh's email address bonding: fix event handling for stacked bonds Revert "net-sysfs: Fix memory leak in netdev_register_kobject" rtnetlink: fix rtnl_valid_stats_req() nlmsg_len check qed: Fix the DORQ's attentions handling qed: Fix missing DORQ attentions ...
2019-04-16socket: fix compat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEWArnd Bergmann1-2/+2
It looks like the new socket options only work correctly for native execution, but in case of compat mode fall back to the old behavior as we ignore the 'old_timeval' flag. Rework so we treat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEW the same way in compat and native 32-bit mode. Cc: Deepa Dinamani <deepa.kernel@gmail.com> Fixes: a9beb86ae6e5 ("sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16tcp: tcp_grow_window() needs to respect tcp_space()Eric Dumazet1-5/+5
For some reason, tcp_grow_window() correctly tests if enough room is present before attempting to increase tp->rcv_ssthresh, but does not prevent it to grow past tcp_space() This is causing hard to debug issues, like failing the (__tcp_select_window(sk) >= tp->rcv_wnd) test in __tcp_ack_snd_check(), causing ACK delays and possibly slow flows. Depending on tcp_rmem[2], MTU, skb->len/skb->truesize ratio, we can see the problem happening on "netperf -t TCP_RR -- -r 2000,2000" after about 60 round trips, when the active side no longer sends immediate acks. This bug predates git history. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16net: bridge: fix netlink export of vlan_stats_per_port optionNikolay Aleksandrov1-1/+1
Since the introduction of the vlan_stats_per_port option the netlink export of it has been broken since I made a typo and used the ifla attribute instead of the bridge option to retrieve its state. Sysfs export is fine, only netlink export has been affected. Fixes: 9163a0fc1f0c0 ("net: bridge: add support for per-port vlan stats") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>