summaryrefslogtreecommitdiffstats
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2017-07-01net: convert sock.sk_wmem_alloc from atomic_t to refcount_tReshetova, Elena32-66/+65
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01net: convert sk_buff_fclones.fclone_ref from atomic_t to refcount_tReshetova, Elena1-5/+5
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01net: convert sk_buff.users from atomic_t to refcount_tReshetova, Elena14-38/+38
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01net: convert nf_bridge_info.use from atomic_t to refcount_tReshetova, Elena1-2/+2
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01net: convert neigh_params.refcnt from atomic_t to refcount_tReshetova, Elena1-4/+4
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01net: convert neighbour.refcnt from atomic_t to refcount_tReshetova, Elena3-11/+11
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01net: convert inet_peer.refcnt from atomic_t to refcount_tReshetova, Elena1-9/+9
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. This conversion requires overall +1 on the whole refcounting scheme. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller17-30/+90
A set of overlapping changes in macvlan and the rocker driver, nothing serious. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller31-476/+905
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree. This batch contains connection tracking updates for the cleanup iteration path, patches from Florian Westphal: X) Skip unconfirmed conntracks in nf_ct_iterate_cleanup_net(), just set dying bit to let the CPU release them. X) Add nf_ct_iterate_destroy() to be used on module removal, to kill conntrack from all namespace. X) Restart iteration on hashtable resizing, since both may occur at the same time. X) Use the new nf_ct_iterate_destroy() to remove conntrack with NAT mapping on module removal. X) Use nf_ct_iterate_destroy() to remove conntrack entries helper module removal, from Liping Zhang. X) Use nf_ct_iterate_cleanup_net() to remove the timeout extension if user requests this, also from Liping. X) Add net_ns_barrier() and use it from FTP helper, so make sure no concurrent namespace removal happens at the same time while the helper module is being removed. X) Use NFPROTO_MAX in layer 3 conntrack protocol array, to reduce module size. Same thing in nf_tables. Updates for the nf_tables infrastructure: X) Prepare usage of the extended ACK reporting infrastructure for nf_tables. X) Remove unnecessary forward declaration in nf_tables hash set. X) Skip set size estimation if number of element is not specified. X) Changes to accomodate a (faster) unresizable hash set implementation, for anonymous sets and dynamic size fixed sets with no timeouts. X) Faster lookup function for unresizable hash table for 2 and 4 bytes key. And, finally, a bunch of asorted small updates and cleanups: X) Do not hold reference to netdev from ipt_CLUSTER, instead subscribe to device events and look up for index from the packet path, this is fixing an issue that is present since the very beginning, patch from Xin Long. X) Use nf_register_net_hook() in ipt_CLUSTER, from Florian Westphal. X) Use ebt_invalid_target() whenever possible in the ebtables tree, from Gao Feng. X) Calm down compilation warning in nf_dup infrastructure, patch from stephen hemminger. X) Statify functions in nftables rt expression, also from stephen. X) Update Makefile to use canonical method to specify nf_tables-objs. From Jike Song. X) Use nf_conntrack_helpers_register() in amanda and H323. X) Space cleanup for ctnetlink, from linzhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish()Michal Kubeček1-7/+17
Recently I started seeing warnings about pages with refcount -1. The problem was traced to packets being reused after their head was merged into a GRO packet by skb_gro_receive(). While bisecting the issue pointed to commit c21b48cc1bbf ("net: adjust skb->truesize in ___pskb_trim()") and I have never seen it on a kernel with it reverted, I believe the real problem appeared earlier when the option to merge head frag in GRO was implemented. Handling NAPI_GRO_FREE_STOLEN_HEAD state was only added to GRO_MERGED_FREE branch of napi_skb_finish() so that if the driver uses napi_gro_frags() and head is merged (which in my case happens after the skb_condense() call added by the commit mentioned above), the skb is reused including the head that has been merged. As a result, we release the page reference twice and eventually end up with negative page refcount. To fix the problem, handle NAPI_GRO_FREE_STOLEN_HEAD in napi_frags_finish() the same way it's done in napi_skb_finish(). Fixes: d7e8883cfcf4 ("net: make GRO aware of skb->head_frag") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29net: bridge: constify attribute_group structures.Arvind Yadav1-1/+1
attribute_groups are not supposed to change at runtime. All functions working with attribute_groups provided by <linux/sysfs.h> work with const attribute_group. So mark the non-const structs as const. File size before: text data bss dec hex filename 2645 896 0 3541 dd5 net/bridge/br_sysfs_br.o File size After adding 'const': text data bss dec hex filename 2701 832 0 3533 dcd net/bridge/br_sysfs_br.o Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29net: constify attribute_group structures.Arvind Yadav1-3/+3
attribute_groups are not supposed to change at runtime. All functions working with attribute_groups provided by <linux/device.h> work with const attribute_group. So mark the non-const structs as const. File size before: text data bss dec hex filename 9968 3168 16 13152 3360 net/core/net-sysfs.o File size After adding 'const': text data bss dec hex filename 10160 2976 16 13152 3360 net/core/net-sysfs.o Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29net: ipmr: Add ipmr_rtm_getrouteDonald Sharp1-1/+62
Add to RTNL_FAMILY_IPMR, RTM_GETROUTE the ability to retrieve one S,G mroute from a specified table. *,G will return mroute information for just that particular mroute if it exists. This is because it is entirely possible to have more S's then can fit in one skb to return to the requesting process. Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29net: sched: Fix one possible panic when no destroy callbackGao Feng1-1/+2
When qdisc fail to init, qdisc_create would invoke the destroy callback to cleanup. But there is no check if the callback exists really. So it would cause the panic if there is no real destroy callback like the qdisc codel, fq, and so on. Take codel as an example following: When a malicious user constructs one invalid netlink msg, it would cause codel_init->codel_change->nla_parse_nested failed. Then kernel would invoke the destroy callback directly but qdisc codel doesn't define one. It causes one panic as a result. Now add one the check for destroy to avoid the possible panic. Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation") Signed-off-by: Gao Feng <gfree.wind@vip.163.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-27ipv6: udp: leverage scratch area helpersPaolo Abeni1-5/+9
The commit b65ac44674dd ("udp: try to avoid 2 cache miss on dequeue") leveraged the scratched area helpers for UDP v4 but I forgot to update accordingly the IPv6 code path. This change extends the scratch area usage to the IPv6 code, synching the two implementations and giving some performance benefit. IPv6 is again almost on the same level of IPv4, performance-wide. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-27udp: move scratch area helpers into the include filePaolo Abeni1-60/+0
So that they can be later used by the IPv6 code, too. Also lift the comments a bit. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-27tcp: fix null ptr deref in getsockopt(..., TCP_ULP, ...)Dave Watson1-0/+5
If icsk_ulp_ops is unset, it dereferences a null ptr. Add a null ptr check. BUG: KASAN: null-ptr-deref in copy_to_user include/linux/uaccess.h:168 [inline] BUG: KASAN: null-ptr-deref in do_tcp_getsockopt.isra.33+0x24f/0x1e30 net/ipv4/tcp.c:3057 Read of size 4 at addr 0000000000000020 by task syz-executor1/15452 Signed-off-by: Dave Watson <davejwatson@fb.com> Reported-by: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-27net: prevent sign extension in dev_get_stats()Eric Dumazet1-3/+3
Similar to the fix provided by Dominik Heidler in commit 9b3dc0a17d73 ("l2tp: cast l2tp traffic counter to unsigned") we need to take care of 32bit kernels in dev_get_stats(). When using atomic_long_read(), we add a 'long' to u64 and might misinterpret high order bit, unless we cast to unsigned. Fixes: caf586e5f23ce ("net: add a core netdev->rx_dropped counter") Fixes: 015f0688f57ca ("net: net: add a core netdev->tx_dropped counter") Fixes: 6e7333d315a76 ("net: add rx_nohandler stat counter") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26net: add netlink_ext_ack argument to rtnl_link_ops.slave_validateMatthias Schiffer1-1/+2
Add support for extended error reporting. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26net: add netlink_ext_ack argument to rtnl_link_ops.slave_changelinkMatthias Schiffer2-2/+4
Add support for extended error reporting. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26net: add netlink_ext_ack argument to rtnl_link_ops.validateMatthias Schiffer11-15/+28
Add support for extended error reporting. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26net: add netlink_ext_ack argument to rtnl_link_ops.changelinkMatthias Schiffer11-14/+24
Add support for extended error reporting. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26net: add netlink_ext_ack argument to rtnl_link_ops.newlinkMatthias Schiffer13-14/+27
Add support for extended error reporting. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25sctp: adjust ssthresh when transport is idleMarcelo Ricardo Leitner1-0/+2
RFC 4960 Errata 3.27 identifies that ssthresh should be adjusted to cwnd because otherwise it could cause the transport to lock into congestion avoidance phase specially if ssthresh was previously reduced by some packet drop, leading to poor performance. The Errata says to adjust ssthresh to cwnd only once, though the same goal is achieved by updating it every time we update cwnd too. The caveat is that we could take longer to get back up to speed but that should be compensated by the fact that we don't adjust on RTO basis (as RFC says) but based on Heartbeats, which are usually way longer. See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-rfc4960-errata-01#section-3.27 Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25sctp: adjust cwnd increase in Congestion Avoidance phaseMarcelo Ricardo Leitner1-8/+18
RFC4960 Errata 3.26 identified that at the same time RFC4960 states that cwnd should never grow more than 1*MTU per RTT, Section 7.2.2 was underspecified and as described could allow increasing cwnd more than that. This patch updates it so partial_bytes_acked is maxed to cwnd if flight_size doesn't reach cwnd, protecting it from such case. See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-rfc4960-errata-01#section-3.26 Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25sctp: allow increasing cwnd regardless of ctsn moving or notMarcelo Ricardo Leitner1-9/+8
As per RFC4960 Errata 3.22, this condition is not needed anymore as it could cause the partial_bytes_acked to not consider the TSNs acked in the Gap Ack Blocks although they were received by the peer successfully. This patch thus drops the check for new Cumulative TSN Ack Point, leaving just the flight_size < cwnd one. See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-rfc4960-errata-01#section-3.22 Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25sctp: update order of adjustments of partial_bytes_acked and cwndMarcelo Ricardo Leitner1-7/+8
RFC4960 Errata 3.12 says RFC4960 is unclear about the order of adjustments applied to partial_bytes_acked and cwnd in the congestion avoidance phase, and that the actual order should be: partial_bytes_acked is reset to (partial_bytes_acked - cwnd). Next, cwnd is increased by MTU. We were first increasing cwnd, and then subtracting the new value pba, which leads to a different result as pba is smaller than what it should and could cause cwnd to not grow as much. See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-rfc4960-errata-01#section-3.12 Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25tcp: reset sk_rx_dst in tcp_disconnect()WANG Cong1-0/+2
We have to reset the sk->sk_rx_dst when we disconnect a TCP connection, because otherwise when we re-connect it this dst reference is simply overridden in tcp_finish_connect(). This fixes a dst leak which leads to a loopback dev refcnt leak. It is a long-standing bug, Kevin reported a very similar (if not same) bug before. Thanks to Andrei for providing such a reliable reproducer which greatly narrows down the problem. Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.") Reported-by: Andrei Vagin <avagin@gmail.com> Reported-by: Kevin Xu <kaiwen.xu@hulu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25net: ipv6: reset daddr and dport in sk if connect() failsWei Wang2-2/+9
In __ip6_datagram_connect(), reset sk->sk_v6_daddr and inet->dport if error occurs. In udp_v6_early_demux(), check for sk_state to make sure it is in TCP_ESTABLISHED state. Together, it makes sure unconnected UDP socket won't be considered as a valid candidate for early demux. v3: add TCP_ESTABLISHED state check in udp_v6_early_demux() v2: fix compilation error Fixes: 5425077d73e0 ("net: ipv6: Add early demux handler for UDP unicast") Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25af_iucv: Move sockaddr length checks to before accessing sa_family in bind ↵Mateusz Jurczyk1-5/+3
and connect handlers Verify that the caller-provided sockaddr structure is large enough to contain the sa_family field, before accessing it in bind() and connect() handlers of the AF_IUCV socket. Since neither syscall enforces a minimum size of the corresponding memory region, very short sockaddrs (zero or one byte long) result in operating on uninitialized memory while referencing .sa_family. Fixes: 52a82e23b9f2 ("af_iucv: Validate socket address length in iucv_sock_bind()") Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com> [jwi: removed unneeded null-check for addr] Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25net/iucv: improve endianness handlingHans Wippel1-1/+1
Use proper endianness conversion for an skb protocol assignment. Given that IUCV is only available on big endian systems (s390), this simply avoids an endianness warning reported by sparse. Signed-off-by: Hans Wippel <hwippel@linux.vnet.ibm.com> Reviewed-by: Julian Wiedmann <jwi@linux.vnet.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25net: store port/representator id in metadata_dstJakub Kicinski4-8/+18
Switches and modern SR-IOV enabled NICs may multiplex traffic from Port representators and control messages over single set of hardware queues. Control messages and muxed traffic may need ordered delivery. Those requirements make it hard to comfortably use TC infrastructure today unless we have a way of attaching metadata to skbs at the upper device. Because single set of queues is used for many netdevs stopping TC/sched queues of all of them reliably is impossible and lower device has to retreat to returning NETDEV_TX_BUSY and usually has to take extra locks on the fastpath. This patch attempts to enable port/representative devs to attach metadata to skbs which carry port id. This way representatives can be queueless and all queuing can be performed at the lower netdev in the usual way. Traffic arriving on the port/representative interfaces will be have metadata attached and will subsequently be queued to the lower device for transmission. The lower device should recognize the metadata and translate it to HW specific format which is most likely either a special header inserted before the network headers or descriptor/metadata fields. Metadata is associated with the lower device by storing the netdev pointer along with port id so that if TC decides to redirect or mirror the new netdev will not try to interpret it. This is mostly for SR-IOV devices since switches don't have lower netdevs today. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23tls: return -EFAULT if copy_to_user() failsDan Carpenter1-4/+6
The copy_to_user() function returns the number of bytes remaining but we want to return -EFAULT here. Fixes: 3c4d7559159b ("tls: kernel TLS support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23Merge branch 'master' of ↵David S. Miller8-44/+66
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2017-06-23 1) Use memdup_user to spmlify xfrm_user_policy. From Geliang Tang. 2) Make xfrm_dev_register static to silence a sparse warning. From Wei Yongjun. 3) Use crypto_memneq to check the ICV in the AH protocol. From Sabrina Dubroca. 4) Remove some unused variables in esp6. From Stephen Hemminger. 5) Extend XFRM MIGRATE to allow to change the UDP encapsulation port. From Antony Antony. 6) Include the UDP encapsulation port to km_migrate announcements. From Antony Antony. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23Merge branch 'master' of ↵David S. Miller7-11/+45
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2017-06-23 1) Fix xfrm garbage collecting when unregistering a netdevice. From Hangbin Liu. 2) Fix NULL pointer derefernce when exiting a network namespace. From Hangbin Liu. 3) Fix some error codes in pfkey to prevent a NULL pointer derefernce. From Dan Carpenter. 4) Fix NULL pointer derefernce on allocation failure in pfkey. From Dan Carpenter. 5) Adjust IPv6 payload_len to include extension headers. Otherwise we corrupt the packets when doing ESP GRO on transport mode. From Yossi Kuperman. 6) Set nhoff to the proper offset of the IPv6 nexthdr when doing ESP GRO. From Yossi Kuperman. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23tcp: fix out-of-bounds access in ULP sysctlJakub Kicinski1-0/+1
KASAN reports out-of-bound access in proc_dostring() coming from proc_tcp_available_ulp() because in case TCP ULP list is empty the buffer allocated for the response will not have anything printed into it. Set the first byte to zero to avoid strlen() going out-of-bounds. Fixes: 734942cc4ea6 ("tcp: ULP infrastructure") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23sit: use __GFP_NOWARN for user controlled allocationWANG Cong1-1/+1
The memory allocation size is controlled by user-space, if it is too large just fail silently and return NULL, not to mention there is a fallback allocation later. Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23bpf: possibly avoid extra masking for narrower load in verifierYonghong Song1-39/+53
Commit 31fd85816dbe ("bpf: permits narrower load from bpf program context fields") permits narrower load for certain ctx fields. The commit however will already generate a masking even if the prog-specific ctx conversion produces the result with narrower size. For example, for __sk_buff->protocol, the ctx conversion loads the data into register with 2-byte load. A narrower 2-byte load should not generate masking. For __sk_buff->vlan_present, the conversion function set the result as either 0 or 1, essentially a byte. The narrower 2-byte or 1-byte load should not generate masking. To avoid unnecessary masking, prog-specific *_is_valid_access now passes converted_op_size back to verifier, which indicates the valid data width after perceived future conversion. Based on this information, verifier is able to avoid unnecessary marking. Since we want more information back from prog-specific *_is_valid_access checking, all of them are packed into one data structure for more clarity. Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23xdp: add reporting of offload modeJakub Kicinski2-5/+4
Extend the XDP_ATTACHED_* values to include offloaded mode. Let drivers report whether program is installed in the driver or the HW by changing the prog_attached field from bool to u8 (type of the netlink attribute). Exploit the fact that the value of XDP_ATTACHED_DRV is 1, therefore since all drivers currently assign the mode with double negation: mode = !!xdp_prog; no drivers have to be modified. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23xdp: add HW offload mode flag for installing programsJakub Kicinski2-4/+7
Add an installation-time flag for requesting that the program be installed only if it can be offloaded to HW. Internally new command for ndo_xdp is added, this way we avoid putting checks into drivers since they all return -EINVAL on an unknown command. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23xdp: pass XDP flags into install handlersJakub Kicinski1-2/+3
Pass XDP flags to the xdp ndo. This will allow drivers to look at the mode flags and make decisions about offload. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23net: account for current skb length when deciding about UFOMichal Kubeček2-2/+3
Our customer encountered stuck NFS writes for blocks starting at specific offsets w.r.t. page boundary caused by networking stack sending packets via UFO enabled device with wrong checksum. The problem can be reproduced by composing a long UDP datagram from multiple parts using MSG_MORE flag: sendto(sd, buff, 1000, MSG_MORE, ...); sendto(sd, buff, 1000, MSG_MORE, ...); sendto(sd, buff, 3000, 0, ...); Assume this packet is to be routed via a device with MTU 1500 and NETIF_F_UFO enabled. When second sendto() gets into __ip_append_data(), this condition is tested (among others) to decide whether to call ip_ufo_append_data(): ((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb)) At the moment, we already have skb with 1028 bytes of data which is not marked for GSO so that the test is false (fragheaderlen is usually 20). Thus we append second 1000 bytes to this skb without invoking UFO. Third sendto(), however, has sufficient length to trigger the UFO path so that we end up with non-UFO skb followed by a UFO one. Later on, udp_send_skb() uses udp_csum() to calculate the checksum but that assumes all fragments have correct checksum in skb->csum which is not true for UFO fragments. When checking against MTU, we need to add skb->len to length of new segment if we already have a partially filled skb and fragheaderlen only if there isn't one. In the IPv6 case, skb can only be null if this is the first segment so that we have to use headersize (length of the first IPv6 header) rather than fragheaderlen (length of IPv6 header of further fragments) for skb == NULL. Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach") Fixes: e4c5e13aa45c ("ipv6: Should use consistent conditional judgement for ip6 fragment between __ip6_append_data and ip6_finish_output") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23udp: fix poll()Paolo Abeni1-10/+17
Michael reported an UDP breakage caused by the commit b65ac44674dd ("udp: try to avoid 2 cache miss on dequeue"). The function __first_packet_length() can update the checksum bits of the pending skb, making the scratched area out-of-sync, and setting skb->csum, if the skb was previously in need of checksum validation. On later recvmsg() for such skb, checksum validation will be invoked again - due to the wrong udp_skb_csum_unnecessary() value - and will fail, causing the valid skb to be dropped. This change addresses the issue refreshing the scratch area in __first_packet_length() after the possible checksum update. Fixes: b65ac44674dd ("udp: try to avoid 2 cache miss on dequeue") Reported-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-22udp/v6: prefetch rmem_alloc in udp6_queue_rcv_skb()Paolo Abeni1-0/+1
very similar to commit dd99e425be23 ("udp: prefetch rmem_alloc in udp_queue_rcv_skb()"), this allows saving a cache miss when the BH is bottle-neck for UDP over ipv6 packet processing, e.g. for small packets when a single RX NIC ingress queue is in use. Performances under flood when multiple NIC RX queues used are unaffected, but when a single NIC rx queue is in use, this gives ~8% performance improvement. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-22ipv6: avoid unregistering inet6_dev for loopbackWANG Cong1-2/+3
The per netns loopback_dev->ip6_ptr is unregistered and set to NULL when its mtu is set to smaller than IPV6_MIN_MTU, this leads to that we could set rt->rt6i_idev NULL after a rt6_uncached_list_flush_dev() and then crash after another call. In this case we should just bring its inet6_dev down, rather than unregistering it, at least prior to commit 176c39af29bc ("netns: fix addrconf_ifdown kernel panic") we always override the case for loopback. Thanks a lot to Andrey for finding a reliable reproducer. Fixes: 176c39af29bc ("netns: fix addrconf_ifdown kernel panic") Reported-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Daniel Lezcano <dlezcano@fr.ibm.com> Cc: David Ahern <dsahern@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: David Ahern <dsahern@gmail.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-22rds: tcp: set linger to 1 when unloading a rds-tcpSowmini Varadhan5-2/+7
If we are unloading the rds_tcp module, we can set linger to 1 and drop pending packets to accelerate reconnect. The peer will end up resetting the connection based on new generation numbers of the new incarnation, so hanging on to unsent TCP packets via linger is mostly pointless in this case. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Tested-by: Jenny Xu <jenny.x.xu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-22rds: tcp: send handshake ping-probe from passive endpointSowmini Varadhan4-11/+12
The RDS handshake ping probe added by commit 5916e2c1554f ("RDS: TCP: Enable multipath RDS for TCP") is sent from rds_sendmsg() before the first data packet is sent to a peer. If the conversation is not bidirectional (i.e., one side is always passive and never invokes rds_sendmsg()) and the passive side restarts its rds_tcp module, a new HS ping probe needs to be sent, so that the number of paths can be re-established. This patch achieves that by sending a HS ping probe from rds_tcp_accept_one() when c_npaths is 0 (i.e., we have not done a handshake probe with this peer yet). Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Tested-by: Jenny Xu <jenny.x.xu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-22tcp: Add a tcp_filter hook before handle ack packetChenbo Feng2-0/+4
Currently in both ipv4 and ipv6 code path, the ack packet received when sk at TCP_NEW_SYN_RECV state is not filtered by socket filter or cgroup filter since it is handled from tcp_child_process and never reaches the tcp_filter inside tcp_v4_rcv or tcp_v6_rcv. Adding a tcp_filter hooks here can make sure all the ingress tcp packet can be correctly filtered. Signed-off-by: Chenbo Feng <fengc@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-22ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTERWANG Cong1-1/+5
In commit 242d3a49a2a1 ("ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf") I assumed NETDEV_REGISTER and NETDEV_UNREGISTER are paired, unfortunately, as reported by jeffy, netdev_wait_allrefs() could rebroadcast NETDEV_UNREGISTER event until all refs are gone. We have to add an additional check to avoid this corner case. For netdev_wait_allrefs() dev->reg_state is NETREG_UNREGISTERED, for dev_change_net_namespace(), dev->reg_state is NETREG_REGISTERED. So check for dev->reg_state != NETREG_UNREGISTERED. Fixes: 242d3a49a2a1 ("ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf") Reported-by: jeffy <jeffy.chen@rock-chips.com> Cc: David Ahern <dsahern@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-22esp6_offload: Fix IP6CB(skb)->nhoff for ESP GROYossi Kuperman1-0/+25
IP6CB(skb)->nhoff is the offset of the nexthdr field in an IPv6 header, unless there are extension headers present, in which case nhoff points to the nexthdr field of the last extension header. In non-GRO code path, nhoff is set by ipv6_rcv before any XFRM code is executed. Conversely, in GRO code path (when esp6_offload is loaded), nhoff is not set. The following functions fail to read the correct value and eventually the packet is dropped: xfrm6_transport_finish xfrm6_tunnel_input xfrm6_rcv_tnl Set nhoff to the proper offset of nexthdr in esp6_gro_receive. Fixes: 7785bba299a8 ("esp: Add a software GRO codepath") Signed-off-by: Yossi Kuperman <yossiku@mellanox.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>