summaryrefslogtreecommitdiffstats
path: root/Documentation/networking/ip-sysctl.txt
AgeCommit message (Collapse)AuthorFilesLines
2017-04-24net/tcp_fastopen: Disable active side TFO in certain scenariosWei Wang1-0/+8
Middlebox firewall issues can potentially cause server's data being blackholed after a successful 3WHS using TFO. Following are the related reports from Apple: https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf Slide 31 identifies an issue where the client ACK to the server's data sent during a TFO'd handshake is dropped. C ---> syn-data ---> S C <--- syn/ack ----- S C (accept & write) C <---- data ------- S C ----- ACK -> X S [retry and timeout] https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf Slide 5 shows a similar situation that the server's data gets dropped after 3WHS. C ---- syn-data ---> S C <--- syn/ack ----- S C ---- ack --------> S S (accept & write) C? X <- data ------ S [retry and timeout] This is the worst failure b/c the client can not detect such behavior to mitigate the situation (such as disabling TFO). Failing to proceed, the application (e.g., SSL library) may simply timeout and retry with TFO again, and the process repeats indefinitely. The proposed solution is to disable active TFO globally under the following circumstances: 1. client side TFO socket detects out of order FIN 2. client side TFO socket receives out of order RST We disable active side TFO globally for 1hr at first. Then if it happens again, we disable it for 2h, then 4h, 8h, ... And we reset the timeout to 1hr if a client side TFO sockets not opened on loopback has successfully received data segs from server. And we examine this condition during close(). The rational behind it is that when such firewall issue happens, application running on the client should eventually close the socket as it is not able to get the data it is expecting. Or application running on the server should close the socket as it is not able to receive any response from client. In both cases, out of order FIN or RST will get received on the client given that the firewall will not block them as no data are in those frames. And we want to disable active TFO globally as it helps if the middle box is very close to the client and most of the connections are likely to fail. Also, add a debug sysctl: tcp_fastopen_blackhole_detect_timeout_sec: the initial timeout to use when firewall blackhole issue happens. This can be set and read. When setting it to 0, it means to disable the active disable logic. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24net: Add sysctl to toggle early demux for tcp and udpsubashab@codeaurora.org1-1/+10
Certain system process significant unconnected UDP workload. It would be preferrable to disable UDP early demux for those systems and enable it for TCP only. By disabling UDP demux, we see these slight gains on an ARM64 system- 782 -> 788Mbps unconnected single stream UDPv4 633 -> 654Mbps unconnected UDPv4 different sources The performance impact can change based on CPU architecure and cache sizes. There will not much difference seen if entire UDP hash table is in cache. Both sysctls are enabled by default to preserve existing behavior. v1->v2: Change function pointer instead of adding conditional as suggested by Stephen. v2->v3: Read once in callers to avoid issues due to compiler optimizations. Also update commit message with the tests. v3->v4: Store and use read once result instead of querying pointer again incorrectly. v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n} Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Suggested-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Tom Herbert <tom@herbertland.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22net: ipv6: Add sysctl for minimum prefix len acceptable in RIOs.Joel Scherpelz1-2/+11
This commit adds a new sysctl accept_ra_rt_info_min_plen that defines the minimum acceptable prefix length of Route Information Options. The new sysctl is intended to be used together with accept_ra_rt_info_max_plen to configure a range of acceptable prefix lengths. It is useful to prevent misconfigurations from unintentionally blackholing too much of the IPv6 address space (e.g., home routers announcing RIOs for fc00::/7, which is incorrect). Signed-off-by: Joel Scherpelz <jscherpelz@google.com> Acked-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21net: ipv4: add support for ECMP hash policy choiceNikolay Aleksandrov1-0/+8
This patch adds support for ECMP hash policy choice via a new sysctl called fib_multipath_hash_policy and also adds support for L4 hashes. The current values for fib_multipath_hash_policy are: 0 - layer 3 (default) 1 - layer 4 If there's an skb hash already set and it matches the chosen policy then it will be used instead of being calculated (currently only for L4). In L3 mode we always calculate the hash due to the ICMP error special case, the flow dissector's field consistentification should handle the address order thus we can remove the address reversals. If the skb is provided we always use it for the hash calculation, otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16tcp: remove tcp_tw_recycleSoheil Hassas Yeganeh1-5/+0
The tcp_tw_recycle was already broken for connections behind NAT, since the per-destination timestamp is not monotonically increasing for multiple machines behind a single destination address. After the randomization of TCP timestamp offsets in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection), the tcp_tw_recycle is broken for all types of connections for the same reason: the timestamps received from a single machine is not monotonically increasing, anymore. Remove tcp_tw_recycle, since it is not functional. Also, remove the PAWSPassive SNMP counter since it is only used for tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req since the strict argument is only set when tcp_tw_recycle is enabled. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Cc: Lutz Vieweg <lvml@5t9.de> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12Make IP 'forwarding' doc more preciseNeil Jerram1-1/+2
It wasn't clear if the 'forwarding' setting needs to be enabled on the interface that packets are received from, or on the interface that packets are forwarded to, or both. In fact (according to my code reading) the setting is relevant on the interface that packets are received from, so this change updates the doc to say that. Signed-off-by: Neil Jerram <neil@tigera.io> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30net: Avoid receiving packets with an l3mdev on unbound UDP socketsRobert Shearman1-0/+7
Packets arriving in a VRF currently are delivered to UDP sockets that aren't bound to any interface. TCP defaults to not delivering packets arriving in a VRF to unbound sockets. IP route lookup and socket transmit both assume that unbound means using the default table and UDP applications that haven't been changed to be aware of VRFs may not function correctly in this case since they may not be able to handle overlapping IP address ranges, or be able to send packets back to the original sender if required. So add a sysctl, udp_l3mdev_accept, to control this behaviour with it being analgous to the existing tcp_l3mdev_accept, namely to allow a process to have a VRF-global listen socket. Have this default to off as this is the behaviour that users will expect, given that there is no explicit mechanism to set unmodified VRF-unaware application into a default VRF. Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24Introduce a sysctl that modifies the value of PROT_SOCK.Krister Johansen1-0/+9
Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl that denotes the first unprivileged inet port in the namespace. To disable all privileged ports set this to zero. It also checks for overlap with the local port range. The privileged and local range may not overlap. The use case for this change is to allow containerized processes to bind to priviliged ports, but prevent them from ever being allowed to modify their container's network configuration. The latter is accomplished by ensuring that the network namespace is not a child of the user namespace. This modification was needed to allow the container manager to disable a namespace's priviliged port restrictions without exposing control of the network namespace to processes in the user namespace. Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13tcp: remove thin_dupack featureYuchung Cheng1-12/+0
Thin stream DUPACK is to start fast recovery on only one DUPACK provided the connection is a thin stream (i.e., low inflight). But this older feature is now subsumed with RACK. If a connection receives only a single DUPACK, RACK would arm a reordering timer and soon starts fast recovery instead of timeout if no further ACKs are received. The socket option (THIN_DUPACK) is kept as a nop for compatibility. Note that this patch does not change another thin-stream feature which enables linear RTO. Although it might be good to generalize that in the future (i.e., linear RTO for the first say 3 retries). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13tcp: remove early retransmitYuchung Cheng1-14/+5
This patch removes the support of RFC5827 early retransmit (i.e., fast recovery on small inflight with <3 dupacks) because it is subsumed by the new RACK loss detection. More specifically when RACK receives DUPACKs, it'll arm a reordering timer to start fast recovery after a quarter of (min)RTT, hence it covers the early retransmit except RACK does not limit itself to specific inflight or dupack numbers. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03ipv6 addrconf: Implemented enhanced DAD (RFC7527)Erik Nordmark1-0/+9
Implemented RFC7527 Enhanced DAD. IPv6 duplicate address detection can fail if there is some temporary loopback of Ethernet frames. RFC7527 solves this by including a random nonce in the NS messages used for DAD, and if an NS is received with the same nonce it is assumed to be a looped back DAD probe and is ignored. RFC7527 is enabled by default. Can be disabled by setting both of conf/{all,interface}/enhanced_dad to zero. Signed-off-by: Erik Nordmark <nordmark@arista.com> Signed-off-by: Bob Gilligan <gilligan@arista.com> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02tcp: allow to turn tcp timestamp randomization offFlorian Westphal1-2/+7
Eric says: "By looking at tcpdump, and TS val of xmit packets of multiple flows, we can deduct the relative qdisc delays (think of fq pacing). This should work even if we have one flow per remote peer." Having random per flow (or host) offsets doesn't allow that anymore so add a way to turn this off. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09igmp: Document sysctl force_igmp_versionHangbin Liu1-0/+15
There is some difference between force_igmp_version and force_mld_version. Add document to make users aware of this. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-23net-tcp: retire TFO_SERVER_WO_SOCKOPT2 configYuchung Cheng1-22/+23
TFO_SERVER_WO_SOCKOPT2 was intended for debugging purposes during Fast Open development. Remove this config option and also update/clean-up the documentation of the Fast Open sysctl. Reported-by: Piotr Jurkiewicz <piotr.jerzy.jurkiewicz@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-29Documentation: ip-sysctl.txt: clarify secure_redirectsEric Garver1-3/+5
Clarify how secure_redirects works. Mention that RFC1122 always applies. Signed-off-by: Eric Garver <e@erig.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11net: ipv4: Consider failed nexthops in multipath routesDavid Ahern1-0/+10
Multipath route lookups should consider knowledge about next hops and not select a hop that is known to be failed. Example: [h2] [h3] 15.0.0.5 | | 3| 3| [SP1] [SP2]--+ 1 2 1 2 | | /-------------+ | | \ / | | X | | / \ | | / \---------------\ | 1 2 1 2 12.0.0.2 [TOR1] 3-----------------3 [TOR2] 12.0.0.3 4 4 \ / \ / \ / -------| |-----/ 1 2 [TOR3] 3| | [h1] 12.0.0.1 host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5: root@h1:~# ip ro ls ... 12.0.0.0/24 dev swp1 proto kernel scope link src 12.0.0.1 15.0.0.0/16 nexthop via 12.0.0.2 dev swp1 weight 1 nexthop via 12.0.0.3 dev swp1 weight 1 ... If the link between tor3 and tor1 is down and the link between tor1 and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and ssh 15.0.0.5 gets the other. Connections that attempt to use the 12.0.0.2 nexthop fail since that neighbor is not reachable: root@h1:~# ip neigh show ... 12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE 12.0.0.2 dev swp1 FAILED ... The failed path can be avoided by considering known neighbor information when selecting next hops. If the neighbor lookup fails we have no knowledge about the nexthop, so give it a shot. If there is an entry then only select the nexthop if the state is sane. This is similar to what fib_detect_death does. To maintain backward compatibility use of the neighbor information is based on a new sysctl, fib_multipath_use_neigh. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-21igmp: Document sysctl_igmp_max_msfBenjamin Poirier1-3/+8
Signed-off-by: Benjamin Poirier <bpoirier@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-21net: Fix indentation of the conf/ documentation blockBenjamin Poirier1-5/+5
Commit d67ef35fff67 ("clarify documentation for net.ipv4.igmp_max_memberships") mistakenly indented a block of documentation such that it now looks like it belongs to a specific sysctl. Restore that block's original position. Cc: Jeremy Eder <jeder@redhat.com> Signed-off-by: Benjamin Poirier <bpoirier@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25net: ipv6: Make address flushing on ifdown optionalDavid Ahern1-0/+9
Currently, all ipv6 addresses are flushed when the interface is configured down, including global, static addresses: $ ip -6 addr show dev eth1 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000 inet6 2100:1::2/120 scope global valid_lft forever preferred_lft forever inet6 fe80::e0:f9ff:fe79:34bd/64 scope link valid_lft forever preferred_lft forever $ ip link set dev eth1 down $ ip -6 addr show dev eth1 << nothing; all addresses have been flushed>> Add a new sysctl to make this behavior optional. The new setting defaults to flush all addresses to maintain backwards compatibility. When the set global addresses with no expire times are not flushed on an admin down. The sysctl is per-interface or system-wide for all interfaces $ sysctl -w net.ipv6.conf.eth1.keep_addr_on_down=1 or $ sysctl -w net.ipv6.conf.all.keep_addr_on_down=1 Will keep addresses on eth1 on an admin down. $ ip -6 addr show dev eth1 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000 inet6 2100:1::2/120 scope global valid_lft forever preferred_lft forever inet6 fe80::e0:f9ff:fe79:34bd/64 scope link valid_lft forever preferred_lft forever $ ip link set dev eth1 down $ ip -6 addr show dev eth1 3: eth1: <BROADCAST,MULTICAST> mtu 1500 state DOWN qlen 1000 inet6 2100:1::2/120 scope global tentative valid_lft forever preferred_lft forever inet6 fe80::e0:f9ff:fe79:34bd/64 scope link tentative valid_lft forever preferred_lft forever Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11ipv6: add option to drop unsolicited neighbor advertisementsJohannes Berg1-0/+7
In certain 802.11 wireless deployments, there will be NA proxies that use knowledge of the network to correctly answer requests. To prevent unsolicitd advertisements on the shared medium from being a problem, on such deployments wireless needs to drop them. Enable this by providing an option called "drop_unsolicited_na". Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11ipv6: add option to drop unicast encapsulated in L2 multicastJohannes Berg1-0/+6
In order to solve a problem with 802.11, the so-called hole-196 attack, add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if enabled, causes the stack to drop IPv6 unicast packets encapsulated in link-layer multi- or broadcast frames. Such frames can (as an attack) be created by any member of the same wireless network and transmitted as valid encrypted frames since the symmetric key for broadcast frames is shared between all stations. Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11ipv4: add option to drop gratuitous ARP packetsJohannes Berg1-0/+6
In certain 802.11 wireless deployments, there will be ARP proxies that use knowledge of the network to correctly answer requests. To prevent gratuitous ARP frames on the shared medium from being a problem, on such deployments wireless needs to drop them. Enable this by providing an option called "drop_gratuitous_arp". Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11ipv4: add option to drop unicast encapsulated in L2 multicastJohannes Berg1-0/+7
In order to solve a problem with 802.11, the so-called hole-196 attack, add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if enabled, causes the stack to drop IPv4 unicast packets encapsulated in link-layer multi- or broadcast frames. Such frames can (as an attack) be created by any member of the same wireless network and transmitted as valid encrypted frames since the symmetric key for broadcast frames is shared between all stations. Additionally, enabling this option provides compliance with a SHOULD clause of RFC 1122. Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-20net: change tcp_syn_retries documentationXin Long1-1/+1
Documentation should be kept consistent with the code: static int tcp_syn_retries_max = MAX_TCP_SYNCNT; #define MAX_TCP_SYNCNT 127 Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18net: Allow accepted sockets to be bound to l3mdev domainDavid Ahern1-0/+8
Allow accepted sockets to derive their sk_bound_dev_if setting from the l3mdev domain in which the packets originated. A sysctl setting is added to control the behavior which is similar to sk_mark and sysctl_tcp_fwmark_accept. This effectively allow a process to have a "VRF-global" listen socket, with child sockets bound to the VRF device in which the packet originated. A similar behavior can be achieved using sk_mark, but a solution using marks is incomplete as it does not handle duplicate addresses in different L3 domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev domain provides a complete solution. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-16net: sctp: dynamically enable or disable pf stateZhu Yanjun1-1/+22
As we all know, the value of pf_retrans >= max_retrans_path can disable pf state. The variables of pf_retrans and max_retrans_path can be changed by the userspace application. Sometimes the user expects to disable pf state while the 2 variables are changed to enable pf state. So it is necessary to introduce a new variable to disable pf state. According to the suggestions from Vlad Yasevich, extra1 and extra2 are removed. The initialization of pf_enable is added. Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: Zhu Yanjun <zyjzyj2000@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-09net: Documentation: Fix default value tcp_limit_output_bytesNiklas Cassel1-1/+1
Commit c39c4c6abb89 ("tcp: double default TSQ output bytes limit") updated default value for tcp_limit_output_bytes Signed-off-by: Niklas Cassel <niklas.cassel@axis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-30Merge branch 'master' of ↵David S. Miller1-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2015-10-30 1) The flow cache is limited by the flow cache limit which depends on the number of cpus and the xfrm garbage collector threshold which is independent of the number of cpus. This leads to the fact that on systems with more than 16 cpus we hit the xfrm garbage collector limit and refuse new allocations, so new flows are dropped. On systems with 16 or less cpus, we hit the flowcache limit. In this case, we shrink the flow cache instead of refusing new flows. We increase the xfrm garbage collector threshold to INT_MAX to get the same behaviour, independent of the number of cpus. 2) Fix some unaligned accesses on sparc systems. From Sowmini Varadhan. 3) Fix some header checks in _decode_session4. We may call pskb_may_pull with a negative value converted to unsigened int from pskb_may_pull. This can lead to incorrect policy lookups. We fix this by a check of the data pointer position before we call pskb_may_pull. 4) Reload skb header pointers after calling pskb_may_pull in _decode_session4 as this may change the pointers into the packet. 5) Add a missing statistic counter on inner mode errors. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: use RACK to detect lossesYuchung Cheng1-0/+9
This patch implements the second half of RACK that uses the the most recent transmit time among all delivered packets to detect losses. tcp_rack_mark_lost() is called upon receiving a dubious ACK. It then checks if an not-yet-sacked packet was sent at least "reo_wnd" prior to the sent time of the most recently delivered. If so the packet is deemed lost. The "reo_wnd" reordering window starts with 1msec for fast loss detection and changes to min-RTT/4 when reordering is observed. We found 1msec accommodates well on tiny degree of reordering (<3 pkts) on faster links. We use min-RTT instead of SRTT because reordering is more of a path property but SRTT can be inflated by self-inflicated congestion. The factor of 4 is borrowed from the delayed early retransmit and seems to work reasonably well. Since RACK is still experimental, it is now used as a supplemental loss detection on top of existing algorithms. It is only effective after the fast recovery starts or after the timeout occurs. The fast recovery is still triggered by FACK and/or dupack threshold instead of RACK. We introduce a new sysctl net.ipv4.tcp_recovery for future experiments of loss recoveries. For now RACK can be disabled by setting it to 0. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: track min RTT using windowed min-filterYuchung Cheng1-0/+8
Kathleen Nichols' algorithm for tracking the minimum RTT of a data stream over some measurement window. It uses constant space and constant time per update. Yet it almost always delivers the same minimum as an implementation that has to keep all the data in the window. The measurement window is tunable via sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes. The algorithm keeps track of the best, 2nd best & 3rd best min values, maintaining an invariant that the measurement time of the n'th best >= n-1'th best. It also makes sure that the three values are widely separated in the time window since that bounds the worse case error when that data is monotonically increasing over the window. Upon getting a new min, we can forget everything earlier because it has no value - the new min is less than everything else in the window by definition and it's the most recent. So we restart fresh on every new min and overwrites the 2nd & 3rd choices. The same property holds for the 2nd & 3rd best. Therefore we have to maintain two invariants to maximize the information in the samples, one on values (1st.v <= 2nd.v <= 3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <= now). These invariants determine the structure of the code The RTT input to the windowed filter is the minimum RTT measured from ACK or SACK, or as the last resort from TCP timestamps. The accessor tcp_min_rtt() returns the minimum RTT seen in the window. ~0U indicates it is not available. The minimum is 1usec even if the true RTT is below that. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-14Revert "ipv4/icmp: redirect messages can use the ingress daddr as source"Paolo Abeni1-17/+2
Revert the commit e2ca690b657f ("ipv4/icmp: redirect messages can use the ingress daddr as source"), which tried to introduce a more suitable behaviour for ICMP redirect messages generated by VRRP routers. However RFC 5798 section 8.1.1 states: The IPv4 source address of an ICMP redirect should be the address that the end-host used when making its next-hop routing decision. while said commit used the generating packet destination address, which do not match the above and in most cases leads to no redirect packets to be generated. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12ipv4/icmp: redirect messages can use the ingress daddr as sourcePaolo Abeni1-2/+17
This patch allows configuring how the source address of ICMP redirect messages is selected; by default the old behaviour is retained, while setting icmp_redirects_use_orig_daddr force the usage of the destination address of the packet that caused the redirect. The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the following scenario: Two machines are set up with VRRP to act as routers out of a subnet, they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to x.x.x.254/24. If a host in said subnet needs to get an ICMP redirect from the VRRP router, i.e. to reach a destination behind a different gateway, the source IP in the ICMP redirect is chosen as the primary IP on the interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2. The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2, and will continue to use the wrong next-op. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29xfrm: Let the flowcache handle its size by default.Steffen Klassert1-2/+4
The xfrm flowcache size is limited by the flowcache limit (4096 * number of online cpus) and the xfrm garbage collector threshold (2 * 32768), whatever is reached first. This means that we can hit the garbage collector limit only on systems with more than 16 cpus. On such systems we simply refuse new allocations if we reach the limit, so new flows are dropped. On syslems with 16 or less cpus, we hit the flowcache limit. In this case, we shrink the flow cache instead of refusing new flows. We increase the xfrm garbage collector threshold to INT_MAX to get the same behaviour, independent of the number of cpus. The xfrm garbage collector threshold can still be set below the flowcache limit to reduce the memory usage of the flowcache. Tested-by: Dan Streetman <dan.streetman@canonical.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-08-31IGMP: Document igmp_link_local_mcast_reportsPhilip Downey1-0/+5
Document the addition of a new sysctl variable which controls the generation of IGMP reports for link local multicast groups in the 224.0.0.X range. IGMP reports for local multicast groups can now be optionally inhibited by setting the value to zero e.g.: echo 0 > /proc/sys/net/ipv4/igmp_link_local_mcast_reports To retain backwards compatibility the previous behaviour is retained by default on system boot or reverted by setting the value back to non-zero. Signed-off-by: Philip Downey <pdowney@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25tcp: refine pacing rate determinationEric Dumazet1-0/+15
When TCP pacing was added back in linux-3.12, we chose to apply a fixed ratio of 200 % against current rate, to allow probing for optimal throughput even during slow start phase, where cwnd can be doubled every other gRTT. At Google, we found it was better applying a different ratio while in Congestion Avoidance phase. This ratio was set to 120 %. We've used the normal tcp_in_slow_start() helper for a while, then tuned the condition to select the conservative ratio as soon as cwnd >= ssthresh/2 : - After cwnd reduction, it is safer to ramp up more slowly, as we approach optimal cwnd. - Initial ramp up (ssthresh == INFINITY) still allows doubling cwnd every other RTT. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-12net: Document xfrm4_gc_thresh and xfrm6_gc_threshAlexander Duyck1-0/+10
This change adds documentation for xfrm4_gc_thresh and xfrm6_gc_thresh based on the comments in commit eeb1b73378b56 ("xfrm: Increase the garbage collector threshold"). Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-07-31ipv6: Enable auto flow labels by defaultTom Herbert1-1/+1
Initialize auto_flowlabels to one. This enables automatic flow labels, individual socket may disable them using the IPV6_AUTOFLOWLABEL socket option. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31ipv6: Implement different admin modes for automatic flow labelsTom Herbert1-7/+13
Change the meaning of net.ipv6.auto_flowlabels to provide a mode for automatic flow labels generation. There are four modes: 0: flow labels are disabled 1: flow labels are enabled, sockets can opt-out 2: flow labels are allowed, sockets can opt-in 3: flow labels are enabled and enforced, no opt-out for sockets np->autoflowlabel is initialized according to the sysctl value. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30net/ipv6: add sysctl option accept_ra_min_hop_limitHangbin Liu1-0/+8
Commit 6fd99094de2b ("ipv6: Don't reduce hop limit for an interface") disabled accept hop limit from RA if it is smaller than the current hop limit for security stuff. But this behavior kind of break the RFC definition. RFC 4861, 6.3.4. Processing Received Router Advertisements A Router Advertisement field (e.g., Cur Hop Limit, Reachable Time, and Retrans Timer) may contain a value denoting that it is unspecified. In such cases, the parameter should be ignored and the host should continue using whatever value it is already using. If the received Cur Hop Limit value is non-zero, the host SHOULD set its CurHopLimit variable to the received value. So add sysctl option accept_ra_min_hop_limit to let user choose the minimum hop limit value they can accept from RA. And set default to 1 to meet RFC standards. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-22ipv6: sysctl to restrict candidate source addressesErik Kline1-0/+7
Per RFC 6724, section 4, "Candidate Source Addresses": It is RECOMMENDED that the candidate source addresses be the set of unicast addresses assigned to the interface that will be used to send to the destination (the "outgoing" interface). Add a sysctl to enable this behaviour. Signed-off-by: Erik Kline <ek@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09ipv6: Nonlocal bindTom Herbert1-0/+5
Add support to allow non-local binds similar to how this was done for IPv4. Non-local binds are very useful in emulating the Internet in a box, etc. This add the ip_nonlocal_bind sysctl under ipv6. Testing: Set up nonlocal binding and receive routing on a host, e.g.: ip -6 rule add from ::/0 iif eth0 lookup 200 ip -6 route add local 2001:0:0:1::/64 dev lo proto kernel scope host table 200 sysctl -w net.ipv6.ip_nonlocal_bind=1 Set up routing to 2001:0:0:1::/64 on peer to go to first host ping6 -I 2001:0:0:1::1 peer-address -- to verify Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-27tcp/dccp: try to not exhaust ip_local_port_range in connect()Eric Dumazet1-3/+5
A long standing problem on busy servers is the tiny available TCP port range (/proc/sys/net/ipv4/ip_local_port_range) and the default sequential allocation of source ports in connect() system call. If a host is having a lot of active TCP sessions, chances are very high that all ports are in use by at least one flow, and subsequent bind(0) attempts fail, or have to scan a big portion of space to find a slot. In this patch, I changed the starting point in __inet_hash_connect() so that we try to favor even [1] ports, leaving odd ports for bind() users. We still perform a sequential search, so there is no guarantee, but if connect() targets are very different, end result is we leave more ports available to bind(), and we spread them all over the range, lowering time for both connect() and bind() to find a slot. This strategy only works well if /proc/sys/net/ipv4/ip_local_port_range is even, ie if start/end values have different parity. Therefore, default /proc/sys/net/ipv4/ip_local_port_range was changed to 32768 - 60999 (instead of 32768 - 61000) There is no change on security aspects here, only some poor hashing schemes could be eventually impacted by this change. [1] : The odd/even property depends on ip_local_port_range values parity Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19tcp: add rfc3168, section 6.1.1.1. fallbackDaniel Borkmann1-0/+9
This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing ECN connections. In other words, this work adds a retry with a non-ECN setup SYN packet, as suggested from the RFC on the first timeout: [...] A host that receives no reply to an ECN-setup SYN within the normal SYN retransmission timeout interval MAY resend the SYN and any subsequent SYN retransmissions with CWR and ECE cleared. [...] Schematic client-side view when assuming the server is in tcp_ecn=2 mode, that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend ECN sysctl to allow server-side only ECN"): 1) Normal ECN-capable path: SYN ECE CWR -----> <----- SYN ACK ECE ACK -----> 2) Path with broken middlebox, when client has fallback: SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) SYN -----> <----- SYN ACK ACK -----> In case we would not have the fallback implemented, the middlebox drop point would basically end up as: SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) In any case, it's rather a smaller percentage of sites where there would occur such additional setup latency: it was found in end of 2014 that ~56% of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the fallback would mitigate with a slight latency trade-off. Recent related paper on this topic: Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth, Gorry Fairhurst, and Richard Scheffenegger: "Enabling Internet-Wide Deployment of Explicit Congestion Notification." Proc. PAM 2015, New York. http://ecn.ethz.ch/ecn-pam15.pdf Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168, section 6.1.1.1. fallback on timeout. For users explicitly not wanting this which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that allows for disabling the fallback. tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but rather we let tcp_ecn_rcv_synack() take that over on input path in case a SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent ECN being negotiated eventually in that case. Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch> Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch> Cc: Eric Dumazet <edumazet@google.com> Cc: Dave That <dave.taht@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-03ipv6: Flow label state rangesTom Herbert1-0/+8
This patch divides the IPv6 flow label space into two ranges: 0-7ffff is reserved for flow label manager, 80000-fffff will be used for creating auto flow labels (per RFC6438). This only affects how labels are set on transmit, it does not affect receive. This range split can be disbaled by systcl. Background: IPv6 flow labels have been an unmitigated disappointment thus far in the lifetime of IPv6. Support in HW devices to use them for ECMP is lacking, and OSes don't turn them on by default. If we had these we could get much better hashing in IPv6 networks without resorting to DPI, possibly eliminating some of the motivations to to define new encaps in UDP just for getting ECMP. Unfortunately, the initial specfications of IPv6 did not clarify how they are to be used. There has always been a vague concept that these can be used for ECMP, flow hashing, etc. and we do now have a good standard how to this in RFC6438. The problem is that flow labels can be either stateful or stateless (as in RFC6438), and we are presented with the possibility that a stateless label may collide with a stateful one. Attempts to split the flow label space were rejected in IETF. When we added support in Linux for RFC6438, we could not turn on flow labels by default due to this conflict. This patch splits the flow label space and should give us a path to enabling auto flow labels by default for all IPv6 packets. This is an API change so we need to consider compatibility with existing deployment. The stateful range is chosen to be the lower values in hopes that most uses would have chosen small numbers. Once we resolve the stateless/stateful issue, we can proceed to look at enabling RFC6438 flow labels by default (starting with scaled testing). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23ipv6: add documentation for stable_secret, idgen_delay and idgen_retries knobsHannes Frederic Sowa1-0/+25
Cc: Erik Kline <ek@google.com> Cc: Fernando Gont <fgont@si6networks.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20net: neighbour: Document {mcast, ucast}_solicit, mcast_resolicit.YOSHIFUJI Hideaki/吉藤英明1-1/+13
Signed-off-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06ipv4: Documenting two sysctls for tcp PMTU probeFan Du1-0/+10
Namely tcp_probe_interval to control how often to restart a probe. And tcp_probe_threshold to control when stop the probing in respect to the width of search range in bytes Signed-off-by: Fan Du <fan.du@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacksNeal Cardwell1-0/+22
Helpers for mitigating ACK loops by rate-limiting dupacks sent in response to incoming out-of-window packets. This patch includes: - rate-limiting logic - sysctl to control how often we allow dupacks to out-of-window packets - SNMP counter for cases where we rate-limited our dupack sending The rate-limiting logic in this patch decides to not send dupacks in response to out-of-window segments if (a) they are SYNs or pure ACKs and (b) the remote endpoint is sending them faster than the configured rate limit. We rate-limit our responses rather than blocking them entirely or resetting the connection, because legitimate connections can rely on dupacks in response to some out-of-window segments. For example, zero window probes are typically sent with a sequence number that is below the current window, and ZWPs thus expect to thus elicit a dupack in response. We allow dupacks in response to TCP segments with data, because these may be spurious retransmissions for which the remote endpoint wants to receive DSACKs. This is safe because segments with data can't realistically be part of ACK loops, which by their nature consist of each side sending pure/data-less ACKs to each other. The dupack interval is controlled by a new sysctl knob, tcp_invalid_ratelimit, given in milliseconds, in case an administrator needs to dial this upward in the face of a high-rate DoS attack. The name and units are chosen to be analogous to the existing analogous knob for ICMP, icmp_ratelimit. The default value for tcp_invalid_ratelimit is 500ms, which allows at most one such dupack per 500ms. This is chosen to be 2x faster than the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule 2.4). We allow the extra 2x factor because network delay variations can cause packets sent at 1 second intervals to be compressed and arrive much closer. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-25net: ipv6: Add sysctl entry to disable MTU updates from RAHarout Hedeshian1-0/+7
The kernel forcefully applies MTU values received in router advertisements provided the new MTU is less than the current. This behavior is undesirable when the user space is managing the MTU. Instead a sysctl flag 'accept_ra_mtu' is introduced such that the user space can control whether or not RA provided MTU updates should be applied. The default behavior is unchanged; user space must explicitly set this flag to 0 for RA MTUs to be ignored. Signed-off-by: Harout Hedeshian <harouth@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-12update ip-sysctl.txt documentation (v2)Ani Sinha1-0/+2
Update documentation to reflect the fact that /proc/sys/net/ipv4/route/max_size is no longer used for ipv4. Signed-off-by: Ani Sinha <ani@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>