From bafd4bd4dcfa13145db7f951251eef3e10f8c278 Mon Sep 17 00:00:00 2001 From: Steffen Klassert Date: Mon, 9 Sep 2013 10:38:38 +0200 Subject: xfrm: Decode sessions with output interface. The output interface matching does not work on forward policy lookups, the output interface of the flowi is always 0. Fix this by setting the output interface when we decode the session. Signed-off-by: Steffen Klassert --- net/ipv4/xfrm4_policy.c | 1 + 1 file changed, 1 insertion(+) (limited to 'net/ipv4') diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c index 9a459be24af7..ccde54248c8c 100644 --- a/net/ipv4/xfrm4_policy.c +++ b/net/ipv4/xfrm4_policy.c @@ -107,6 +107,7 @@ _decode_session4(struct sk_buff *skb, struct flowi *fl, int reverse) memset(fl4, 0, sizeof(struct flowi4)); fl4->flowi4_mark = skb->mark; + fl4->flowi4_oif = skb_dst(skb)->dev->ifindex; if (!ip_is_fragment(iph)) { switch (iph->protocol) { -- cgit v1.2.3 From 7263a5187f9e9de45fcb51349cf0e031142c19a1 Mon Sep 17 00:00:00 2001 From: Christophe Gouault Date: Tue, 8 Oct 2013 17:21:22 +0200 Subject: vti: get rid of nf mark rule in prerouting This patch fixes and improves the use of vti interfaces (while lightly changing the way of configuring them). Currently: - it is necessary to identify and mark inbound IPsec packets destined to each vti interface, via netfilter rules in the mangle table at prerouting hook. - the vti module cannot retrieve the right tunnel in input since commit b9959fd3: vti tunnels all have an i_key, but the tunnel lookup is done with flag TUNNEL_NO_KEY, so there no chance to retrieve them. - the i_key is used by the outbound processing as a mark to lookup for the right SP and SA bundle. This patch uses the o_key to store the vti mark (instead of i_key) and enables: - to avoid the need for previously marking the inbound skbuffs via a netfilter rule. - to properly retrieve the right tunnel in input, only based on the IPsec packet outer addresses. - to properly perform an inbound policy check (using the tunnel o_key as a mark). - to properly perform an outbound SPD and SAD lookup (using the tunnel o_key as a mark). - to keep the current mark of the skbuff. The skbuff mark is neither used nor changed by the vti interface. Only the vti interface o_key is used. SAs have a wildcard mark. SPs have a mark equal to the vti interface o_key. The vti interface must be created as follows (i_key = 0, o_key = mark): ip link add vti1 mode vti local 1.1.1.1 remote 2.2.2.2 okey 1 The SPs attached to vti1 must be created as follows (mark = vti1 o_key): ip xfrm policy add dir out mark 1 tmpl src 1.1.1.1 dst 2.2.2.2 \ proto esp mode tunnel ip xfrm policy add dir in mark 1 tmpl src 2.2.2.2 dst 1.1.1.1 \ proto esp mode tunnel The SAs are created with the default wildcard mark. There is no distinction between global vs. vti SAs. Just their addresses will possibly link them to a vti interface: ip xfrm state add src 1.1.1.1 dst 2.2.2.2 proto esp spi 1000 mode tunnel \ enc "cbc(aes)" "azertyuiopqsdfgh" ip xfrm state add src 2.2.2.2 dst 1.1.1.1 proto esp spi 2000 mode tunnel \ enc "cbc(aes)" "sqbdhgqsdjqjsdfh" To avoid matching "global" (not vti) SPs in vti interfaces, global SPs should no use the default wildcard mark, but explicitly match mark 0. To avoid a double SPD lookup in input and output (in global and vti SPDs), the NOPOLICY and NOXFRM options should be set on the vti interfaces: echo 1 > /proc/sys/net/ipv4/conf/vti1/disable_policy echo 1 > /proc/sys/net/ipv4/conf/vti1/disable_xfrm The outgoing traffic is steered to vti1 by a route via the vti interface: ip route add 192.168.0.0/16 dev vti1 The incoming IPsec traffic is steered to vti1 because its outer addresses match the vti1 tunnel configuration. Signed-off-by: Christophe Gouault Signed-off-by: David S. Miller --- net/ipv4/ip_vti.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) (limited to 'net/ipv4') diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c index e805e7b3030e..6e87f853d033 100644 --- a/net/ipv4/ip_vti.c +++ b/net/ipv4/ip_vti.c @@ -125,8 +125,17 @@ static int vti_rcv(struct sk_buff *skb) iph->saddr, iph->daddr, 0); if (tunnel != NULL) { struct pcpu_tstats *tstats; + u32 oldmark = skb->mark; + int ret; - if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) + + /* temporarily mark the skb with the tunnel o_key, to + * only match policies with this mark. + */ + skb->mark = be32_to_cpu(tunnel->parms.o_key); + ret = xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb); + skb->mark = oldmark; + if (!ret) return -1; tstats = this_cpu_ptr(tunnel->dev->tstats); @@ -135,7 +144,6 @@ static int vti_rcv(struct sk_buff *skb) tstats->rx_bytes += skb->len; u64_stats_update_end(&tstats->syncp); - skb->mark = 0; secpath_reset(skb); skb->dev = tunnel->dev; return 1; @@ -167,7 +175,7 @@ static netdev_tx_t vti_tunnel_xmit(struct sk_buff *skb, struct net_device *dev) memset(&fl4, 0, sizeof(fl4)); flowi4_init_output(&fl4, tunnel->parms.link, - be32_to_cpu(tunnel->parms.i_key), RT_TOS(tos), + be32_to_cpu(tunnel->parms.o_key), RT_TOS(tos), RT_SCOPE_UNIVERSE, IPPROTO_IPIP, 0, dst, tiph->saddr, 0, 0); -- cgit v1.2.3 From 031afe4990a7c9dbff41a3a742c44d3e740ea0a1 Mon Sep 17 00:00:00 2001 From: Yuchung Cheng Date: Sat, 12 Oct 2013 10:16:27 -0700 Subject: tcp: fix incorrect ca_state in tail loss probe On receiving an ACK that covers the loss probe sequence, TLP immediately sets the congestion state to Open, even though some packets are not recovered and retransmisssion are on the way. The later ACks may trigger a WARN_ON check in step D of tcp_fastretrans_alert(), e.g., https://bugzilla.redhat.com/show_bug.cgi?id=989251 The fix is to follow the similar procedure in recovery by calling tcp_try_keep_open(). The sender switches to Open state if no packets are retransmissted. Otherwise it goes to Disorder and let subsequent ACKs move the state to Recovery or Open. Reported-By: Michael Sterrett Tested-By: Dormando Signed-off-by: Yuchung Cheng Acked-by: Neal Cardwell Signed-off-by: David S. Miller --- net/ipv4/tcp_input.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net/ipv4') diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 113dc5f17d47..53974c7c04fe 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3291,7 +3291,7 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag) tcp_init_cwnd_reduction(sk, true); tcp_set_ca_state(sk, TCP_CA_CWR); tcp_end_cwnd_reduction(sk); - tcp_set_ca_state(sk, TCP_CA_Open); + tcp_try_keep_open(sk); NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSSPROBERECOVERY); } -- cgit v1.2.3 From c52e2421f7368fd36cbe330d2cf41b10452e39a9 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 15 Oct 2013 11:54:30 -0700 Subject: tcp: must unclone packets before mangling them TCP stack should make sure it owns skbs before mangling them. We had various crashes using bnx2x, and it turned out gso_size was cleared right before bnx2x driver was populating TC descriptor of the _previous_ packet send. TCP stack can sometime retransmit packets that are still in Qdisc. Of course we could make bnx2x driver more robust (using ACCESS_ONCE(shinfo->gso_size) for example), but the bug is TCP stack. We have identified two points where skb_unclone() was needed. This patch adds a WARN_ON_ONCE() to warn us if we missed another fix of this kind. Kudos to Neal for finding the root cause of this bug. Its visible using small MSS. Signed-off-by: Eric Dumazet Signed-off-by: Neal Cardwell Cc: Yuchung Cheng Signed-off-by: David S. Miller --- net/ipv4/tcp_output.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'net/ipv4') diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index c6f01f2cdb32..8fad1c155688 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -986,6 +986,9 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb) static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb, unsigned int mss_now) { + /* Make sure we own this skb before messing gso_size/gso_segs */ + WARN_ON_ONCE(skb_cloned(skb)); + if (skb->len <= mss_now || !sk_can_gso(sk) || skb->ip_summed == CHECKSUM_NONE) { /* Avoid the costly divide in the normal @@ -1067,9 +1070,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len, if (nsize < 0) nsize = 0; - if (skb_cloned(skb) && - skb_is_nonlinear(skb) && - pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) + if (skb_unclone(skb, GFP_ATOMIC)) return -ENOMEM; /* Get a new skb... force flag on. */ @@ -2344,6 +2345,8 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb) int oldpcount = tcp_skb_pcount(skb); if (unlikely(oldpcount > 1)) { + if (skb_unclone(skb, GFP_ATOMIC)) + return -ENOMEM; tcp_init_tso_segs(sk, skb, cur_mss); tcp_adjust_pcount(sk, skb, oldpcount - tcp_skb_pcount(skb)); } -- cgit v1.2.3 From 8f26fb1c1ed81c33f5d87c5936f4d9d1b4118918 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 15 Oct 2013 12:24:54 -0700 Subject: tcp: remove the sk_can_gso() check from tcp_set_skb_tso_segs() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit sk_can_gso() should only be used as a hint in tcp_sendmsg() to build GSO packets in the first place. (As a performance hint) Once we have GSO packets in write queue, we can not decide they are no longer GSO only because flow now uses a route which doesn't handle TSO/GSO. Core networking stack handles the case very well for us, all we need is keeping track of packet counts in MSS terms, regardless of segmentation done later (in GSO or hardware) Right now, if tcp_fragment() splits a GSO packet in two parts, @left and @right, and route changed through a non GSO device, both @left and @right have pcount set to 1, which is wrong, and leads to incorrect packet_count tracking. This problem was added in commit d5ac99a648 ("[TCP]: skb pcount with MTU discovery") Signed-off-by: Eric Dumazet Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Reported-by: Maciej Żenczykowski Signed-off-by: David S. Miller --- net/ipv4/tcp_output.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'net/ipv4') diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 8fad1c155688..d46f2143305c 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -989,8 +989,7 @@ static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb, /* Make sure we own this skb before messing gso_size/gso_segs */ WARN_ON_ONCE(skb_cloned(skb)); - if (skb->len <= mss_now || !sk_can_gso(sk) || - skb->ip_summed == CHECKSUM_NONE) { + if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) { /* Avoid the costly divide in the normal * non-TSO case. */ -- cgit v1.2.3 From e93b7d748be887cd7639b113ba7d7ef792a7efb9 Mon Sep 17 00:00:00 2001 From: Jiri Pirko Date: Sat, 19 Oct 2013 12:29:17 +0200 Subject: ip_output: do skb ufo init for peeked non ufo skb as well Now, if user application does: sendto lenmtu flag 0 The skb is not treated as fragmented one because it is not initialized that way. So move the initialization to fix this. introduced by: commit e89e9cf539a28df7d0eb1d0a545368e9920b34ac "[IPv4/IPv6]: UFO Scatter-gather approach" Signed-off-by: Jiri Pirko Acked-by: Hannes Frederic Sowa Signed-off-by: David S. Miller --- net/ipv4/ip_output.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) (limited to 'net/ipv4') diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index a04d872c54f9..3982eabf61e1 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -772,15 +772,20 @@ static inline int ip_ufo_append_data(struct sock *sk, /* initialize protocol header pointer */ skb->transport_header = skb->network_header + fragheaderlen; - skb->ip_summed = CHECKSUM_PARTIAL; skb->csum = 0; - /* specify the length of each IP datagram fragment */ - skb_shinfo(skb)->gso_size = maxfraglen - fragheaderlen; - skb_shinfo(skb)->gso_type = SKB_GSO_UDP; + __skb_queue_tail(queue, skb); + } else if (skb_is_gso(skb)) { + goto append; } + skb->ip_summed = CHECKSUM_PARTIAL; + /* specify the length of each IP datagram fragment */ + skb_shinfo(skb)->gso_size = maxfraglen - fragheaderlen; + skb_shinfo(skb)->gso_type = SKB_GSO_UDP; + +append: return skb_append_datato_frags(sk, skb, getfrag, from, (length - transhdrlen)); } -- cgit v1.2.3 From 02cf4ebd82ff0ac7254b88e466820a290ed8289a Mon Sep 17 00:00:00 2001 From: Neal Cardwell Date: Mon, 21 Oct 2013 15:40:19 -0400 Subject: tcp: initialize passive-side sk_pacing_rate after 3WHS For passive TCP connections, upon receiving the ACK that completes the 3WHS, make sure we set our pacing rate after we get our first RTT sample. On passive TCP connections, when we receive the ACK completing the 3WHS we do not take an RTT sample in tcp_ack(), but rather in tcp_synack_rtt_meas(). So upon receiving the ACK that completes the 3WHS, tcp_ack() leaves sk_pacing_rate at its initial value. Originally the initial sk_pacing_rate value was 0, so passive-side connections defaulted to sysctl_tcp_min_tso_segs (2 segs) in skbuffs made in the first RTT. With a default initial cwnd of 10 packets, this happened to be correct for RTTs 5ms or bigger, so it was hard to see problems in WAN or emulated WAN testing. Since 7eec4174ff ("pkt_sched: fq: fix non TCP flows pacing"), the initial sk_pacing_rate is 0xffffffff. So after that change, passive TCP connections were keeping this value (and using large numbers of segments per skbuff) until receiving an ACK for data. Signed-off-by: Neal Cardwell Cc: Eric Dumazet Cc: Yuchung Cheng Acked-by: Eric Dumazet Signed-off-by: David S. Miller --- net/ipv4/tcp_input.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'net/ipv4') diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 53974c7c04fe..a16b01b537ba 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5712,6 +5712,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb, } else tcp_init_metrics(sk); + tcp_update_pacing_rate(sk); + /* Prevent spurious tcp_cwnd_restart() on first data packet */ tp->lsndtime = tcp_time_stamp; -- cgit v1.2.3