summaryrefslogtreecommitdiffstats
path: root/net/ipv4
AgeCommit message (Collapse)AuthorFilesLines
2016-12-07netfilter: rpfilter: bypass ipv4 lbcast packets with zeronet sourceLiping Zhang2-9/+12
Otherwise, DHCP Discover packets(0.0.0.0->255.255.255.255) may be dropped incorrectly. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06netfilter: nft_fib_ipv4: initialize *dest to zeroLiping Zhang1-0/+2
Otherwise, if fib lookup fail, *dest will be filled with garbage value, so reverse path filtering will not work properly: # nft add rule x prerouting fib saddr oif eq 0 drop Fixes: f6d0cbcf09c5 ("netfilter: nf_tables: add fib expression") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06netfilter: nft_fib: convert htonl to ntohl properlyLiping Zhang1-1/+1
Acctually ntohl and htonl are identical, so this doesn't affect anything, but it is conceptually wrong. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06netfilter: x_tables: pack percpu counter allocationsFlorian Westphal2-6/+12
instead of allocating each xt_counter individually, allocate 4k chunks and then use these for counter allocation requests. This should speed up rule evaluation by increasing data locality, also speeds up ruleset loading because we reduce calls to the percpu allocator. As Eric points out we can't use PAGE_SIZE, page_allocator would fail on arches with 64k page size. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06netfilter: x_tables: pass xt_counters struct to counter allocatorFlorian Westphal2-8/+2
Keeps some noise away from a followup patch. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06netfilter: x_tables: pass xt_counters struct instead of packet counterFlorian Westphal2-4/+4
On SMP we overload the packet counter (unsigned long) to contain percpu offset. Hide this from callers and pass xt_counters address instead. Preparation patch to allocate the percpu counters in page-sized batch chunks. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06netfilter: defrag: only register defrag functionality if neededFlorian Westphal2-4/+44
nf_defrag modules for ipv4 and ipv6 export an empty stub function. Any module that needs the defragmentation hooks registered simply 'calls' this empty function to create a phony module dependency -- modprobe will then load the defrag module too. This extends netfilter ipv4/ipv6 defragmentation modules to delay the hook registration until the functionality is requested within a network namespace instead of module load time for all namespaces. Hooks are only un-registered on module unload or when a namespace that used such defrag functionality exits. We have to use struct net for this as the register hooks can be called before netns initialization here from the ipv4/ipv6 conntrack module init path. There is no unregister functionality support, defrag will always be active once it was requested inside a net namespace. The reason is that defrag has impact on nft and iptables rulesets (without defrag we might see framents). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: conntrack: register hooks in netns when needed by rulesetFlorian Westphal1-11/+44
This makes use of nf_ct_netns_get/put added in previous patch. We add get/put functions to nf_conntrack_l3proto structure, ipv4 and ipv6 then implement use-count to track how many users (nft or xtables modules) have a dependency on ipv4 and/or ipv6 connection tracking functionality. When count reaches zero, the hooks are unregistered. This delays activation of connection tracking inside a namespace until stateful firewall rule or nat rule gets added. This patch breaks backwards compatibility in the sense that connection tracking won't be active anymore when the protocol tracker module is loaded. This breaks e.g. setups that ctnetlink for flow accounting and the like, without any '-m conntrack' packet filter rules. Followup patch restores old behavour and makes new delayed scheme optional via sysctl. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: nf_tables: add conntrack dependencies for nat/masq/redir expressionsFlorian Westphal2-0/+14
so that conntrack core will add the needed hooks in this namespace. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: nat: add dependencies on conntrack moduleFlorian Westphal1-1/+7
MASQUERADE, S/DNAT and REDIRECT already call functions that depend on the conntrack module. However, since the conntrack hooks are now registered in a lazy fashion (i.e., only when needed) a symbol reference is not enough. Thus, when something is added to a nat table, make sure that it will see packets by calling nf_ct_netns_get() which will register the conntrack hooks in the current netns. An alternative would be to add these dependencies to the NAT table. However, that has problems when using non-modular builds -- we might register e.g. ipv6 conntrack before its initcall has run, leading to NULL deref crashes since its per-netns storage has not yet been allocated. Adding the dependency in the modules instead has the advantage that nat table also does not register its hooks until rules are added. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: add and use nf_ct_netns_get/putFlorian Westphal2-4/+4
currently aliased to try_module_get/_put. Will be changed in next patch when we add functions to make use of ->net argument to store usercount per l3proto tracker. This is needed to avoid registering the conntrack hooks in all netns and later only enable connection tracking in those that need conntrack. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: conntrack: remove unused init_net hookFlorian Westphal1-6/+0
since adf0516845bcd0 ("netfilter: remove ip_conntrack* sysctl compat code") the only user (ipv4 tracker) sets this to an empty stub function. After this change nf_ct_l3proto_pernet_register() is also empty, but this will change in a followup patch to add conditional register of the hooks. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: conntrack: built-in support for UDPliteDavide Caratti1-0/+3
CONFIG_NF_CT_PROTO_UDPLITE is no more a tristate. When set to y, connection tracking support for UDPlite protocol is built-in into nf_conntrack.ko. footprint test: $ ls -l net/netfilter/nf_conntrack{_proto_udplite,}.ko \ net/ipv4/netfilter/nf_conntrack_ipv4.ko \ net/ipv6/netfilter/nf_conntrack_ipv6.ko (builtin)|| udplite| ipv4 | ipv6 |nf_conntrack ---------++--------+--------+--------+-------------- none || 432538 | 828755 | 828676 | 6141434 UDPlite || - | 829649 | 829362 | 6498204 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: conntrack: built-in support for SCTPDavide Caratti1-0/+3
CONFIG_NF_CT_PROTO_SCTP is no more a tristate. When set to y, connection tracking support for SCTP protocol is built-in into nf_conntrack.ko. footprint test: $ ls -l net/netfilter/nf_conntrack{_proto_sctp,}.ko \ net/ipv4/netfilter/nf_conntrack_ipv4.ko \ net/ipv6/netfilter/nf_conntrack_ipv6.ko (builtin)|| sctp | ipv4 | ipv6 | nf_conntrack ---------++--------+--------+--------+-------------- none || 498243 | 828755 | 828676 | 6141434 SCTP || - | 829254 | 829175 | 6547872 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: conntrack: built-in support for DCCPDavide Caratti1-0/+3
CONFIG_NF_CT_PROTO_DCCP is no more a tristate. When set to y, connection tracking support for DCCP protocol is built-in into nf_conntrack.ko. footprint test: $ ls -l net/netfilter/nf_conntrack{_proto_dccp,}.ko \ net/ipv4/netfilter/nf_conntrack_ipv4.ko \ net/ipv6/netfilter/nf_conntrack_ipv6.ko (builtin)|| dccp | ipv4 | ipv6 | nf_conntrack ---------++--------+--------+--------+-------------- none || 469140 | 828755 | 828676 | 6141434 DCCP || - | 830566 | 829935 | 6533526 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04netfilter: update Arturo Borrero Gonzalez email addressArturo Borrero Gonzalez2-4/+4
The email address has changed, let's update the copyright statements. Signed-off-by: Arturo Borrero Gonzalez <arturo@debian.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-03ipv4: fib: Replay events when registering FIB notifierIdo Schimmel1-2/+146
Commit b90eb7549499 ("fib: introduce FIB notification infrastructure") introduced a new notification chain to notify listeners (f.e., switchdev drivers) about addition and deletion of routes. However, upon registration to the chain the FIB tables can already be populated, which means potential listeners will have an incomplete view of the tables. Solve that by dumping the FIB tables and replaying the events to the passed notification block. The dump itself is done using RCU in order not to starve consumers that need RTNL to make progress. The integrity of the dump is ensured by reading the FIB change sequence counter before and after the dump under RTNL. This allows us to avoid the problematic situation in which the dumping process sends a ENTRY_ADD notification following ENTRY_DEL generated by another process holding RTNL. Callers of the registration function may pass a callback that is executed in case the dump was inconsistent with current FIB tables. The number of retries until a consistent dump is achieved is set to a fixed number to prevent callers from looping for long periods of time. In case current limit proves to be problematic in the future, it can be easily converted to be configurable using a sysctl. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03ipv4: fib: Allow for consistent FIB dumpingIdo Schimmel2-0/+3
The next patch will enable listeners of the FIB notification chain to request a dump of the FIB tables. However, since RTNL isn't taken during the dump, it's possible for the FIB tables to change mid-dump, which will result in inconsistency between the listener's table and the kernel's. Allow listeners to know about changes that occurred mid-dump, by adding a change sequence counter to each net namespace. The counter is incremented just before a notification is sent in the FIB chain. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03ipv4: fib: Convert FIB notification chain to be atomicIdo Schimmel1-4/+4
In order not to hold RTNL for long periods of time we're going to dump the FIB tables using RCU. Convert the FIB notification chain to be atomic, as we can't block in RCU critical sections. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03ipv4: fib: Export free_fib_info()Ido Schimmel1-0/+1
The FIB notification chain is going to be converted to an atomic chain, which means switchdev drivers will have to offload FIB entries in deferred work, as hardware operations entail sleeping. However, while the work is queued fib info might be freed, so a reference must be taken. To release the reference (and potentially free the fib info) fib_info_put() will be called, which in turn calls free_fib_info(). Export free_fib_info() so that modules will be able to invoke fib_info_put(). Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03udp: be less conservative with sock rmem accountingPaolo Abeni1-2/+2
Before commit 850cbaddb52d ("udp: use it's own memory accounting schema"), the udp protocol allowed sk_rmem_alloc to grow beyond the rcvbuf by the whole current packet's truesize. After said commit we allow sk_rmem_alloc to exceed the rcvbuf only if the receive queue is empty. As reported by Jesper this cause a performance regression for some (small) values of rcvbuf. This commit is intended to fix the regression restoring the old handling of the rcvbuf limit. Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Fixes: 850cbaddb52d ("udp: use it's own memory accounting schema") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller6-5/+11
Couple conflicts resolved here: 1) In the MACB driver, a bug fix to properly initialize the RX tail pointer properly overlapped with some changes to support variable sized rings. 2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix overlapping with a reorganization of the driver to support ACPI, OF, as well as PCI variants of the chip. 3) In 'net' we had several probe error path bug fixes to the stmmac driver, meanwhile a lot of this code was cleaned up and reorganized in 'net-next'. 4) The cls_flower classifier obtained a helper function in 'net-next' called __fl_delete() and this overlapped with Daniel Borkamann's bug fix to use RCU for object destruction in 'net'. It also overlapped with Jiri's change to guard the rhashtable_remove_fast() call with a check against tc_skip_sw(). 5) In mlx4, a revert bug fix in 'net' overlapped with some unrelated changes in 'net-next'. 6) In geneve, a stale header pointer after pskb_expand_head() bug fix in 'net' overlapped with a large reorganization of the same code in 'net-next'. Since the 'net-next' code no longer had the bug in question, there was nothing to do other than to simply take the 'net-next' hunks. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02bpf: Add new cgroup attach type to enable sock modificationsDavid Ahern1-1/+11
Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run any time a process in the cgroup opens an AF_INET or AF_INET6 socket. Currently only sk_bound_dev_if is exported to userspace for modification by a bpf program. This allows a cgroup to be configured such that AF_INET{6} sockets opened by processes are automatically bound to a specific device. In turn, this enables the running of programs that do not support SO_BINDTODEVICE in a specific VRF context / L3 domain. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02tcp: allow to turn tcp timestamp randomization offFlorian Westphal1-0/+1
Eric says: "By looking at tcpdump, and TS val of xmit packets of multiple flows, we can deduct the relative qdisc delays (think of fq pacing). This should work even if we have one flow per remote peer." Having random per flow (or host) offsets doesn't allow that anymore so add a way to turn this off. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02tcp: randomize tcp timestamp offsets for each connectionFlorian Westphal5-7/+16
jiffies based timestamps allow for easy inference of number of devices behind NAT translators and also makes tracking of hosts simpler. commit ceaa1fef65a7c2e ("tcp: adding a per-socket timestamp offset") added the main infrastructure that is needed for per-connection ts randomization, in particular writing/reading the on-wire tcp header format takes the offset into account so rest of stack can use normal tcp_time_stamp (jiffies). So only two items are left: - add a tsoffset for request sockets - extend the tcp isn generator to also return another 32bit number in addition to the ISN. Re-use of ISN generator also means timestamps are still monotonically increasing for same connection quadruple, i.e. PAWS will still work. Includes fixes from Eric Dumazet. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02ipv4: Set skb->protocol properly for local outputEli Cooper1-0/+2
When xfrm is applied to TSO/GSO packets, it follows this path: xfrm_output() -> xfrm_output_gso() -> skb_gso_segment() where skb_gso_segment() relies on skb->protocol to function properly. This patch sets skb->protocol to ETH_P_IP before dst_output() is called, fixing a bug where GSO packets sent through a sit tunnel are dropped when xfrm is involved. Cc: stable@vger.kernel.org Signed-off-by: Eli Cooper <elicooper@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02route: Set lwtstate for local traffic and cached input dstsThomas Graf1-13/+26
A route on the output path hitting a RTN_LOCAL route will keep the dst associated on its way through the loopback device. On the receive path, the dst_input() call will thus invoke the input handler of the route created in the output path. Thus, lwt redirection for input must be done for dsts allocated in the otuput path as well. Also, if a route is cached in the input path, the allocated dst should respect lwtunnel configuration on the nexthop as well. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02route: Set orig_output when redirecting to lwt on locally generated trafficThomas Graf1-1/+3
orig_output for IPv4 was only set for dsts which hit an input route. Set it consistently for locally generated traffic as well to allow lwt to continue the dst_output() path as configured by the nexthop. Fixes: 2536862311d ("lwt: Add support to redirect dst.input") Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-01Merge branch 'master' of ↵David S. Miller1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2016-12-01 1) Change the error value when someone tries to run 32bit userspace on a 64bit host from -ENOTSUPP to the userspace exported -EOPNOTSUPP. Fix from Yi Zhao. 2) On inbound, ESN sequence numbers are already in network byte order. So don't try to convert it again, this fixes integrity verification for ESN. Fixes from Tobias Brunner. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2-3/+6
Pablo Neira Ayuso says: ==================== Netfilter fixes for net This is a large batch of Netfilter fixes for net, they are: 1) Three patches to fix NAT conversion to rhashtable: Switch to rhlist structure that allows to have several objects with the same key. Moreover, fix wrong comparison logic in nf_nat_bysource_cmp() as this is expecting a return value similar to memcmp(). Change location of the nat_bysource field in the nf_conn structure to avoid zeroing this as it breaks interaction with SLAB_DESTROY_BY_RCU and lead us to crashes. From Florian Westphal. 2) Don't allow malformed fragments go through in IPv6, drop them, otherwise we hit GPF, patch from Florian Westphal. 3) Fix crash if attributes are missing in nft_range, from Liping Zhang. 4) Fix arptables 32-bits userspace 64-bits kernel compat, from Hongxu Jia. 5) Two patches from David Ahern to fix netfilter interaction with vrf. From David Ahern. 6) Fix element timeout calculation in nf_tables, we take milliseconds from userspace, but we use jiffies from kernelspace. Patch from Anders K. Pedersen. 7) Missing validation length netlink attribute for nft_hash, from Laura Garcia. 8) Fix nf_conntrack_helper documentation, we don't default to off anymore for a bit of time so let's get this in sync with the code. I know is late but I think these are important, specifically the NAT bits, as they are mostly addressing fallout from recent changes. I also read there are chances to have -rc8, if that is the case, that would also give us a bit more time to test this. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30net: ipv4: Don't crash if passing a null sk to ip_rt_update_pmtu.Lorenzo Colitti1-1/+2
Commit e2d118a1cb5e ("net: inet: Support UID-based routing in IP protocols.") made __build_flow_key call sock_net(sk) to determine the network namespace of the passed-in socket. This crashes if sk is NULL. Fix this by getting the network namespace from the skb instead. Fixes: e2d118a1cb5e ("net: inet: Support UID-based routing in IP protocols.") Reported-by: Erez Shitrit <erezsh@dev.mellanox.co.il> Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30netfilter: arp_tables: fix invoking 32bit "iptable -P INPUT ACCEPT" failed ↵Hongxu Jia1-2/+2
in 64bit kernel Since 09d9686047db ("netfilter: x_tables: do compat validation via translate_table"), it used compatr structure to assign newinfo structure. In translate_compat_table of ip_tables.c and ip6_tables.c, it used compatr->hook_entry to replace info->hook_entry and compatr->underflow to replace info->underflow, but not do the same replacement in arp_tables.c. It caused invoking 32-bit "arptbale -P INPUT ACCEPT" failed in 64bit kernel. -------------------------------------- root@qemux86-64:~# arptables -P INPUT ACCEPT root@qemux86-64:~# arptables -P INPUT ACCEPT ERROR: Policy for `INPUT' offset 448 != underflow 0 arptables: Incompatible with this kernel -------------------------------------- Fixes: 09d9686047db ("netfilter: x_tables: do compat validation via translate_table") Signed-off-by: Hongxu Jia <hongxu.jia@windriver.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-30tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPINGFrancis Yan1-0/+20
This patch exports the sender chronograph stats via the socket SO_TIMESTAMPING channel. Currently we can instrument how long a particular application unit of data was queued in TCP by tracking SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having these sender chronograph stats exported simultaneously along with these timestamps allow further breaking down the various sender limitation. For example, a video server can tell if a particular chunk of video on a connection takes a long time to deliver because TCP was experiencing small receive window. It is not possible to tell before this patch without packet traces. To prepare these stats, the user needs to set SOF_TIMESTAMPING_OPT_STATS and SOF_TIMESTAMPING_OPT_TSONLY flags while requesting other SOF_TIMESTAMPING TX timestamps. When the timestamps are available in the error queue, the stats are returned in a separate control message of type SCM_TIMESTAMPING_OPT_STATS, in a list of TLVs (struct nlattr) of types: TCP_NLA_BUSY_TIME, TCP_NLA_RWND_LIMITED, TCP_NLA_SNDBUF_LIMITED. Unit is microsecond. Signed-off-by: Francis Yan <francisyyan@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30tcp: export sender limits chronographs to TCP_INFOFrancis Yan1-0/+20
This patch exports all the sender chronograph measurements collected in the previous patches to TCP_INFO interface. Note that busy time exported includes all the other sending limits (rwnd-limited, sndbuf-limited). Internally the time unit is jiffy but externally the measurements are in microseconds for future extensions. Signed-off-by: Francis Yan <francisyyan@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30tcp: instrument how long TCP is limited by insufficient send bufferFrancis Yan3-3/+24
This patch measures the amount of time when TCP runs out of new data to send to the network due to insufficient send buffer, while TCP is still busy delivering (i.e. write queue is not empty). The goal is to indicate either the send buffer autotuning or user SO_SNDBUF setting has resulted network under-utilization. The measurement starts conservatively by checking various conditions to minimize false claims (i.e. under-estimation is more likely). The measurement stops when the SOCK_NOSPACE flag is cleared. But it does not account the time elapsed till the next application write. Also the measurement only starts if the sender is still busy sending data, s.t. the limit accounted is part of the total busy time. Signed-off-by: Francis Yan <francisyyan@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30tcp: instrument how long TCP is limited by receive windowFrancis Yan1-2/+9
This patch measures the total time when the TCP stops sending because the receiver's advertised window is not large enough. Note that once the limit is lifted we are likely in the busy status if we have data pending. Signed-off-by: Francis Yan <francisyyan@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30tcp: instrument how long TCP is busy sendingFrancis Yan2-3/+19
This patch measures TCP busy time, which is defined as the period of time when sender has data (or FIN) to send. The time starts when data is buffered and stops when the write queue is flushed by ACKs or error events. Note the busy time does not include SYN time, unless data is included in SYN (i.e. Fast Open). It does include FIN time even if the FIN carries no payload. Excluding pure FIN is possible but would incur one additional test in the fast path, which may not be worth it. Signed-off-by: Francis Yan <francisyyan@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30tcp: instrument tcp sender limits chronographsFrancis Yan1-0/+30
This patch implements the skeleton of the TCP chronograph instrumentation on sender side limits: 1) idle (unspec) 2) busy sending data other than 3-4 below 3) rwnd-limited 4) sndbuf-limited The limits are enumerated 'tcp_chrono'. Since a connection in theory can idle forever, we do not track the actual length of this uninteresting idle period. For the rest we track how long the sender spends in each limit. At any point during the life time of a connection, the sender must be in one of the four states. If there are multiple conditions worthy of tracking in a chronograph then the highest priority enum takes precedence over the other conditions. So that if something "more interesting" starts happening, stop the previous chrono and start a new one. The time unit is jiffy(u32) in order to save space in tcp_sock. This implies application must sample the stats no longer than every 49 days of 1ms jiffy. Signed-off-by: Francis Yan <francisyyan@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30esp4: Fix integrity verification when ESN are usedTobias Brunner1-1/+1
When handling inbound packets, the two halves of the sequence number stored on the skb are already in network order. Fixes: 7021b2e1cddd ("esp4: Switch to new AEAD interface") Signed-off-by: Tobias Brunner <tobias@strongswan.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-11-29GSO: Reload iph after pskb_may_pullArnaldo Carvalho de Melo1-1/+1
As it may get stale and lead to use after free. Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Alexander Duyck <aduyck@mirantis.com> Cc: Andrey Konovalov <andreyknvl@google.com> Fixes: cbc53e08a793 ("GSO: Add GSO type for fixed IPv4 ID") Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-28tcp: Set DEFAULT_TCP_CONG to bbr if DEFAULT_BBR is setJulian Wollrath1-0/+1
Signed-off-by: Julian Wollrath <jwollrath@web.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-2/+2
udplite conflict is resolved by taking what 'net-next' did which removed the backlog receive method assignment, since it is no longer necessary. Two entries were added to the non-priv ethtool operations switch statement, one in 'net' and one in 'net-next, so simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25net: ipv4, ipv6: run cgroup eBPF egress programsDaniel Mack1-2/+24
If the cgroup associated with the receiving socket has an eBPF programs installed, run them from ip_output(), ip6_output() and ip_mc_output(). From mentioned functions we have two socket contexts as per 7026b1ddb6b8 ("netfilter: Pass socket pointer down through okfn()."). We explicitly need to use sk instead of skb->sk here, since otherwise the same program would run multiple times on egress when encap devices are involved, which is not desired in our case. eBPF programs used in this context are expected to either return 1 to let the packet pass, or != 1 to drop them. The programs have access to the skb through bpf_skb_load_bytes(), and the payload starts at the network headers (L3). Note that cgroup_bpf_run_filter() is stubbed out as static inline nop for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if the feature is unused. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24tcp: enhance tcp_collapse_retrans() with skb_shift()Eric Dumazet1-11/+11
In commit 2331ccc5b323 ("tcp: enhance tcp collapsing"), we made a first step allowing copying right skb to left skb head. Since all skbs in socket write queue are headless (but possibly the very first one), this strategy often does not work. This patch extends tcp_collapse_retrans() to perform frag shifting, thanks to skb_shift() helper. This helper needs to not BUG on non headless skbs, as callers are ok with that. Tested: Following packetdrill test now passes : 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 8> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8> +.100 < . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 +0 write(4, ..., 200) = 200 +0 > P. 1:201(200) ack 1 +.001 write(4, ..., 200) = 200 +0 > P. 201:401(200) ack 1 +.001 write(4, ..., 200) = 200 +0 > P. 401:601(200) ack 1 +.001 write(4, ..., 200) = 200 +0 > P. 601:801(200) ack 1 +.001 write(4, ..., 200) = 200 +0 > P. 801:1001(200) ack 1 +.001 write(4, ..., 100) = 100 +0 > P. 1001:1101(100) ack 1 +.001 write(4, ..., 100) = 100 +0 > P. 1101:1201(100) ack 1 +.001 write(4, ..., 100) = 100 +0 > P. 1201:1301(100) ack 1 +.001 write(4, ..., 100) = 100 +0 > P. 1301:1401(100) ack 1 +.099 < . 1:1(0) ack 201 win 257 +.001 < . 1:1(0) ack 201 win 257 <nop,nop,sack 1001:1401> +0 > P. 201:1001(800) ack 1 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24udplite: call proper backlog handlersEric Dumazet3-3/+3
In commits 93821778def10 ("udp: Fix rcv socket locking") and f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb") UDP backlog handlers were renamed, but UDPlite was forgotten. This leads to crashes if UDPlite header is pulled twice, which happens starting from commit e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Bug found by syzkaller team, thanks a lot guys ! Note that backlog use in UDP/UDPlite is scheduled to be removed starting from linux-4.10, so this patch is only needed up to linux-4.9 Fixes: 93821778def1 ("udp: Fix rcv socket locking") Fixes: f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb") Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24netfilter: Update ip_route_me_harder to consider L3 domainDavid Ahern1-1/+4
ip_route_me_harder is not considering the L3 domain and sending lookups to the wrong table. For example consider the following output rule: iptables -I OUTPUT -p tcp --dport 12345 -j REJECT --reject-with tcp-reset using perf to analyze lookups via the fib_table_lookup tracepoint shows: vrf-test 1187 [001] 46887.295927: fib:fib_table_lookup: table 255 oif 0 iif 0 src 0.0.0.0 dst 10.100.1.254 tos 0 scope 0 flags 0 ffffffff8143922c perf_trace_fib_table_lookup ([kernel.kallsyms]) ffffffff81493aac fib_table_lookup ([kernel.kallsyms]) ffffffff8148dda3 __inet_dev_addr_type ([kernel.kallsyms]) ffffffff8148ddf6 inet_addr_type ([kernel.kallsyms]) ffffffff8149e344 ip_route_me_harder ([kernel.kallsyms]) and vrf-test 1187 [001] 46887.295933: fib:fib_table_lookup: table 255 oif 0 iif 1 src 10.100.1.254 dst 10.100.1.2 tos 0 scope 0 flags ffffffff8143922c perf_trace_fib_table_lookup ([kernel.kallsyms]) ffffffff81493aac fib_table_lookup ([kernel.kallsyms]) ffffffff814998ff fib4_rule_action ([kernel.kallsyms]) ffffffff81437f35 fib_rules_lookup ([kernel.kallsyms]) ffffffff81499758 __fib_lookup ([kernel.kallsyms]) ffffffff8144f010 fib_lookup.constprop.34 ([kernel.kallsyms]) ffffffff8144f759 __ip_route_output_key_hash ([kernel.kallsyms]) ffffffff8144fc6a ip_route_output_flow ([kernel.kallsyms]) ffffffff8149e39b ip_route_me_harder ([kernel.kallsyms]) In both cases the lookups are directed to table 255 rather than the table associated with the device via the L3 domain. Update both lookups to pull the L3 domain from the dst currently attached to the skb. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller5-24/+125
All conflicts were simple overlapping changes except perhaps for the Thunder driver. That driver has a change_mtu method explicitly for sending a message to the hardware. If that fails it returns an error. Normally a driver doesn't need an ndo_change_mtu method becuase those are usually just range changes, which are now handled generically. But since this extra operation is needed in the Thunder driver, it has to stay. However, if the message send fails we have to restore the original MTU before the change because the entire call chain expects that if an error is thrown by ndo_change_mtu then the MTU did not change. Therefore code is added to nicvf_change_mtu to remember the original MTU, and to restore it upon nicvf_update_hw_max_frs() failue. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21tcp: make undo_cwnd mandatory for congestion modulesFlorian Westphal7-6/+18
The undo_cwnd fallback in the stack doubles cwnd based on ssthresh, which un-does reno halving behaviour. It seems more appropriate to let congctl algorithms pair .ssthresh and .undo_cwnd properly. Add a 'tcp_reno_undo_cwnd' function and wire it up for all congestion algorithms that used to rely on the fallback. Cc: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21tcp: add cwnd_undo functions to various tcp cc algorithmsFlorian Westphal5-1/+55
congestion control algorithms that do not halve cwnd in their .ssthresh should provide a .cwnd_undo rather than rely on current fallback which assumes reno halving (and thus doubles the cwnd). All of these do 'something else' in their .ssthresh implementation, thus store the cwnd on loss and provide .undo_cwnd to restore it again. A followup patch will remove the fallback and all algorithms will need to provide a .cwnd_undo function. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21tcp: zero ca_priv area when switching cc algorithmsFlorian Westphal1-1/+3
We need to zero out the private data area when application switches connection to different algorithm (TCP_CONGESTION setsockopt). When congestion ops get assigned at connect time everything is already zeroed because sk_alloc uses GFP_ZERO flag. But in the setsockopt case this contains whatever previous cc placed there. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>