summaryrefslogtreecommitdiffstats
path: root/include/net
AgeCommit message (Collapse)AuthorFilesLines
2012-09-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-5/+13
Conflicts: drivers/net/team/team.c drivers/net/usb/qmi_wwan.c net/batman-adv/bat_iv_ogm.c net/ipv4/fib_frontend.c net/ipv4/route.c net/l2tp/l2tp_netlink.c The team, fib_frontend, route, and l2tp_netlink conflicts were simply overlapping changes. qmi_wwan and bat_iv_ogm were of the "use HEAD" variety. With help from Antonio Quartulli. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-27net: remove sk_init() helperEric Dumazet1-2/+0
It seems sk_init() has no value today and even does strange things : # grep . /proc/sys/net/core/?mem_* /proc/sys/net/core/rmem_default:212992 /proc/sys/net/core/rmem_max:131071 /proc/sys/net/core/wmem_default:212992 /proc/sys/net/core/wmem_max:131071 We can remove it completely. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Shan Wei <davidshan@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-27tunnel: drop packet if ECN present with not-ECTstephen hemminger1-0/+76
Linux tunnels were written before RFC6040 and therefore never implemented the corner case of ECN getting set in the outer header and the inner header not being ready for it. Section 4.2. Default Tunnel Egress Behaviour. o If the inner ECN field is Not-ECT, the decapsulator MUST NOT propagate any other ECN codepoint onwards. This is because the inner Not-ECT marking is set by transports that rely on dropped packets as an indication of congestion and would not understand or respond to any other ECN codepoint [RFC4774]. Specifically: * If the inner ECN field is Not-ECT and the outer ECN field is CE, the decapsulator MUST drop the packet. * If the inner ECN field is Not-ECT and the outer ECN field is Not-ECT, ECT(0), or ECT(1), the decapsulator MUST forward the outgoing packet with the ECN field cleared to Not-ECT. This patch moves the ECN decap logic out of the individual tunnels into a common place. It also adds logging to allow detecting broken systems that set ECN bits incorrectly when tunneling (or an intermediate router might be changing the header). Overloads rx_frame_error to keep track of ECN related error. Thanks to Chris Wright who caught this while reviewing the new VXLAN tunnel. This code was tested by injecting faulty logic in other end GRE to send incorrectly encapsulated packets. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24net: use a per task frag allocatorEric Dumazet2-15/+16
We currently use a per socket order-0 page cache for tcp_sendmsg() operations. This page is used to build fragments for skbs. Its done to increase probability of coalescing small write() into single segments in skbs still in write queue (not yet sent) But it wastes a lot of memory for applications handling many mostly idle sockets, since each socket holds one page in sk->sk_sndmsg_page Its also quite inefficient to build TSO 64KB packets, because we need about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit page allocator more than wanted. This patch adds a per task frag allocator and uses bigger pages, if available. An automatic fallback is done in case of memory pressure. (up to 32768 bytes per frag, thats order-3 pages on x86) This increases TCP stream performance by 20% on loopback device, but also benefits on other network devices, since 8x less frags are mapped on transmit and unmapped on tx completion. Alexander Duyck mentioned a probable performance win on systems with IOMMU enabled. Its possible some SG enabled hardware cant cope with bigger fragments, but their ndo_start_xmit() should already handle this, splitting a fragment in sub fragments, since some arches have PAGE_SIZE=65536 Successfully tested on various ethernet devices. (ixgbe, igb, bnx2x, tg3, mellanox mlx4) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Vijay Subramanian <subramanian.vijay@gmail.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24net: Remove unnecessary NULL check in scm_destroy().David S. Miller1-1/+1
All callers provide a non-NULL scm argument. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-22tcp: TCP Fast Open Server - take SYNACK RTT after completing 3WHSNeal Cardwell1-0/+1
When taking SYNACK RTT samples for servers using TCP Fast Open, fix the code to ensure that we only call tcp_valid_rtt_meas() after we receive the ACK that completes the 3-way handshake. Previously we were always taking an RTT sample in tcp_v4_syn_recv_sock(). However, for TCP Fast Open connections tcp_v4_conn_req_fastopen() calls tcp_v4_syn_recv_sock() at the time we receive the SYN. So for TFO we must wait until tcp_rcv_state_process() to take the RTT sample. To fix this, we wait until after TFO calls tcp_v4_syn_recv_sock() before we set the snt_synack timestamp, since tcp_synack_rtt_meas() already ensures that we only take a SYNACK RTT sample if snt_synack is non-zero. To be careful, we only take a snt_synack timestamp when a SYNACK transmit or retransmit succeeds. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-22tcp: extract code to compute SYNACK RTTNeal Cardwell1-0/+9
In preparation for adding another spot where we compute the SYNACK RTT, extract this code so that it can be shared. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds4-5/+13
Pull networking updates from David Miller: "More bug fixes, nothing gets past these guys" 1) More kernel info leaks found by Mathias Krause, this time in the IPSEC configuration layers. 2) When IPSEC policies change, we do not properly make sure that cached routes (which could now be stale) throughout the system will be revalidated. Fix this by generalizing the generation count invalidation scheme used by ipv4. From Nicolas Dichtel. 3) When repairing TCP sockets, we need to allow to restore not just the send window scale, but the receive one too. Extend the existing interface to achieve this in a backwards compatible way. From Andrey Vagin. 4) A fix for FCOE scatter gather feature validation erroneously caused scatter gather to be disabled for things like AOE too. From Ed L Cashin. 5) Several cases of mishandling of error pointers, from Mathias Krause, Wei Yongjun, and Devendra Naga. 6) Fix gianfar build, from Richard Cochran. 7) CAP_NET_* failures should return -EPERM not -EACCES, from Zhao Hongjiang. 8) Hardware reset fix in janz-ican3 CAN driver, from Ira W Snyder. 9) Fix oops during rmmod in ti_hecc CAN driver, from Marc Kleine-Budde. 10) The removal of the conditional compilation of the clk support code in the stmmac driver broke things. This is because the interfaces used are the ones that don't also perform the enable/disable of the clk. Fix from Stefan Roese. 11) The QFQ packet scheduler can record out of range virtual start times, resulting later in misbehavior and even crashes. Fix from Paolo Valente. 12) If MSG_WAITALL is used with IOAT DMA under TCP, we can wedge the receiver when the advertised receive window goes to zero. Detect this case and force the processing of the IOAT DMA queue when it happens to avoid getting stuck. Fix from Michal Kubecek. 13) batman-adv assumes that test_bit() returns only 0 or 1, but this is not true for x86 (which returns -1 or 0, via the 'sbb' instruction). Fix from Linus Lussing. 14) Fix small packet corruption in e1000, from Tushar Dave. 15) make_blackhole() in the IPSEC policy code can do one read unlock too many, fix from Li RongQing. 16) The new tcp_try_coalesce() code introduced a bug in TCP URG handling, fix from Eric Dumazet. 17) Fix memory leak in __netif_receive_skb() when doing zerocopy and when hit an OOM condition. From Michael S Tsirkin. 18) netxen blindly deferences pdev->bus->self, which is not guarenteed to be non-NULL. Fix from Nikolay Aleksandrov. 19) Fix a performance regression caused by mistakes in ipv6 checksum validation in the bnx2x driver, fix from Michal Schmidt. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (45 commits) net/stmmac: Use clk_prepare_enable and clk_disable_unprepare net: change return values from -EACCES to -EPERM net/irda: sh_sir: fix return value check in sh_sir_set_baudrate() stmmac: fix return value check in stmmac_open_ext_timer() gianfar: fix phc index build failure ipv6: fix return value check in fib6_add() bnx2x: remove false warning regarding interrupt number can: ti_hecc: fix oops during rmmod can: janz-ican3: fix support for older hardware revisions net: do not disable sg for packets requiring no checksum aoe: assert AoE packets marked as requiring no checksum at91ether: return PTR_ERR if call to clk_get fails xfrm_user: don't copy esn replay window twice for new states xfrm_user: ensure user supplied esn replay window is valid xfrm_user: fix info leak in copy_to_user_tmpl() xfrm_user: fix info leak in copy_to_user_policy() xfrm_user: fix info leak in copy_to_user_state() xfrm_user: fix info leak in copy_to_user_auth() net: qmi_wwan: adding Huawei E367, ZTE MF683 and Pantech P4200 tcp: restore rcv_wscale in a repair mode (v2) ...
2012-09-19ipv6: unify fragment thresh handling codeAmerigo Wang1-1/+1
Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Michal Kubeček <mkubecek@suse.cz> Cc: David Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19ipv6: make ip6_frag_nqueues() and ip6_frag_mem() static inlineAmerigo Wang1-2/+11
Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Michal Kubeček <mkubecek@suse.cz> Cc: David Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19ipv6: unify conntrack reassembly expire code with standard oneAmerigo Wang1-0/+19
Two years ago, Shan Wei tried to fix this: http://patchwork.ozlabs.org/patch/43905/ The problem is that RFC2460 requires an ICMP Time Exceeded -- Fragment Reassembly Time Exceeded message should be sent to the source of that fragment, if the defragmentation times out. " If insufficient fragments are received to complete reassembly of a packet within 60 seconds of the reception of the first-arriving fragment of that packet, reassembly of that packet must be abandoned and all the fragments that have been received for that packet must be discarded. If the first fragment (i.e., the one with a Fragment Offset of zero) has been received, an ICMP Time Exceeded -- Fragment Reassembly Time Exceeded message should be sent to the source of that fragment. " As Herbert suggested, we could actually use the standard IPv6 reassembly code which follows RFC2460. With this patch applied, I can see ICMP Time Exceeded sent from the receiver when the sender sent out 3/4 fragmented IPv6 UDP packet. Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Michal Kubeček <mkubecek@suse.cz> Cc: David Miller <davem@davemloft.net> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: netfilter-devel@vger.kernel.org Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19ipv6: add a new namespace for nf_conntrack_reasmAmerigo Wang2-0/+11
As pointed by Michal, it is necessary to add a new namespace for nf_conntrack_reasm code, this prepares for the second patch. Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Michal Kubeček <mkubecek@suse.cz> Cc: David Miller <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: netfilter-devel@vger.kernel.org Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18ipv6: use net->rt_genid to check dst validityNicolas Dichtel1-3/+2
IPv6 dst should take care of rt_genid too. When a xfrm policy is inserted or deleted, all dst should be invalidated. To force the validation, dst entries should be created with ->obsolete set to DST_OBSOLETE_FORCE_CHK. This was already the case for all functions calling ip6_dst_alloc(), except for ip6_rt_copy(). As a consequence, we can remove the specific code in inet6_connection_sock. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18netns: move net->ipv4.rt_genid to net->rt_genidNicolas Dichtel2-1/+10
This commit prepares the use of rt_genid by both IPv4 and IPv6. Initialization is left in IPv4 part. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18ipv4/route: arg delay is useless in rt_cache_flush()Nicolas Dichtel1-1/+1
Since route cache deletion (89aef8921bfbac22f), delay is no more used. Remove it. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-17include/net/sock.h: squelch compiler warning in sk_rmem_schedule()Chuck Lever1-1/+1
This warning: In file included from linux/include/linux/tcp.h:227:0, from linux/include/linux/ipv6.h:221, from linux/include/net/ipv6.h:16, from linux/include/linux/sunrpc/clnt.h:26, from linux/net/sunrpc/stats.c:22: linux/include/net/sock.h: In function `sk_rmem_schedule': linux/nfs-2.6/include/net/sock.h:1339:13: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] is seen with gcc (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2) using the -Wextra option. Commit c76562b6709f ("netvm: prevent a stream-specific deadlock") accidentally replaced the "size" parameter of sk_rmem_schedule() with an unsigned int. This changes the semantics of the comparison in the return statement. In sk_wmem_schedule we have syntactically the same comparison, but "size" is a signed integer. In addition, __sk_mem_schedule() takes a signed integer for its "size" parameter, so there is an implicit type conversion in sk_rmem_schedule() anyway. Revert the "size" parameter back to a signed integer so that the semantics of the expressions in both sk_[rw]mem_schedule() are exactly the same. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Joonsoo Kim <js1304@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17llc: Remove stray reference to sysctl_llc_station_ack_timeout.David S. Miller1-1/+0
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-17Merge branch 'for-davem' of ↵David S. Miller1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next John W. Linville says: ==================== This is another batch of updates intended for the 3.7 stream. There are not a lot of large items, but iwlwifi, mwifiex, rt2x00, ath9k, and brcmfmac all get some attention. Wei Yongjun also provides a series of small maintenance fixes. This also includes a pull of the wireless tree in order to satisfy some prerequisites for later patches. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-1/+4
Conflicts: net/netfilter/nfnetlink_log.c net/netfilter/xt_LOG.c Rather easy conflict resolution, the 'net' tree had bug fixes to make sure we checked if a socket is a time-wait one or not and elide the logging code if so. Whereas on the 'net-next' side we are calculating the UID and GID from the creds using different interfaces due to the user namespace changes from Eric Biederman. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-14Merge branch 'master' of ↵John W. Linville2-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
2012-09-10netlink: Rename pid to portid to avoid confusionEric W. Biederman6-51/+51
It is a frequent mistake to confuse the netlink port identifier with a process identifier. Try to reduce this confusion by renaming fields that hold port identifiers portid instead of pid. I have carefully avoided changing the structures exported to userspace to avoid changing the userspace API. I have successfully built an allyesconfig kernel with this change. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-07Merge branch 'master' of ↵John W. Linville4-6/+44
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless
2012-09-07Merge branch 'for-john' of ↵John W. Linville1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
2012-09-07ipv4/route: arg delay is useless in rt_cache_flush()Nicolas Dichtel1-1/+1
Since route cache deletion (89aef8921bfbac22f), delay is no more used. Remove it. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-07scm: Don't use struct ucred in NETLINK_CB and struct scm_cookie.Eric W. Biederman1-4/+19
Passing uids and gids on NETLINK_CB from a process in one user namespace to a process in another user namespace can result in the wrong uid or gid being presented to userspace. Avoid that problem by passing kuids and kgids instead. - define struct scm_creds for use in scm_cookie and netlink_skb_parms that holds uid and gid information in kuid_t and kgid_t. - Modify scm_set_cred to fill out scm_creds by heand instead of using cred_to_ucred to fill out struct ucred. This conversion ensures userspace does not get incorrect uid or gid values to look at. - Modify scm_recv to convert from struct scm_creds to struct ucred before copying credential values to userspace. - Modify __scm_send to populate struct scm_creds on in the scm_cookie, instead of just copying struct ucred from userspace. - Modify netlink_sendmsg to copy scm_creds instead of struct ucred into the NETLINK_CB. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-07Merge branch 'master' of ↵John W. Linville1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem John W. Linville says: ==================== Please pull these fixes intended for 3.6. There are more commits here than I would like -- I got a bit behind while I was stalking Steven Rostedt in San Diego last week... I'll slow it down after this! There are a couple of pulls here. One is from Johannes: "Please pull (according to the below information) to get a few fixes. * a fix to properly disconnect in the driver when authentication or association fails * a fix to prevent invalid information about mesh paths being reported to userspace * a memory leak fix in an nl80211 error path" The other comes via Gustavo: "A few updates for the 3.6 kernel. There are two btusb patches to add more supported devices through the new USB_VENDOR_AND_INTEFACE_INFO() macro and another one that add a new device id for a Sony Vaio laptop, one fix for a user-after-free and, finally, two patches from Vinicius to fix a issue in SMP pairing." Along with those... Arend van Spriel provides a fix for a use-after-free bug in brcmfmac. Daniel Drake avoids a hang by not trying to touch the libertas hardware duing suspend if it is already powered-down. Felix Fietkau provides a batch of ath9k fixes that adress some potential problems with power settings, as well as a fix to avoid a potential interrupt storm. Gertjan van Wingerde provides a register-width fix for rt2x00, and a rt2x00 fix to prevent incorrectly detecting the rfkill status. He also provides a device ID patch. Hante Meuleman gives us three brcmfmac fixes, one that properly initializes a command structure, one that fixes a race condition that could lose usb requests, and one that removes some log spam. Marc Kleine-Budde offers an rt2x00 fix for a voltage setting on some specific devices. Mohammed Shafi Shajakhan sent an ath9k fix to avoid a crash related to using timers that aren't allocated when 2 wire bluetooth coexistence hardware is in use. Sergei Poselenov changes rt2800usb to do some validity checking for received packets, avoiding crashes on an ARM Soc. Stone Piao gives us an mwifiex fix for an incorrectly set skb length value for a command buffer. All of these are localized to their specific drivers, and relatively small. The power-related patches from Felix are bigger than I would like, but I merged them in consideration of their isolation to ath9k and the sensitive nature of power settings in wireless devices. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-05ipv6: fix handling of blackhole and prohibit routesNicolas Dichtel1-0/+1
When adding a blackhole or a prohibit route, they were handling like classic routes. Moreover, it was only possible to add this kind of routes by specifying an interface. Bug already reported here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=498498 Before the patch: $ ip route add blackhole 2001::1/128 RTNETLINK answers: No such device $ ip route add blackhole 2001::1/128 dev eth0 $ ip -6 route | grep 2001 2001::1 dev eth0 metric 1024 After: $ ip route add blackhole 2001::1/128 $ ip -6 route | grep 2001 blackhole 2001::1 dev lo metric 1024 error -22 v2: wrong patch v3: add a field fc_type in struct fib6_config to store RTN_* type Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-05cfg80211: add kerneldoc entry for "vht_cap"Robert P. J. Day1-0/+1
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2012-09-04xfrm: Workaround incompatibility of ESN and async cryptoSteffen Klassert1-0/+3
ESN for esp is defined in RFC 4303. This RFC assumes that the sequence number counters are always up to date. However, this is not true if an async crypto algorithm is employed. If the sequence number counters are not up to date on sequence number check, we may incorrectly update the upper 32 bit of the sequence number. This leads to a DOS. We workaround this by comparing the upper sequence number, (used for authentication) with the upper sequence number computed after the async processing. We drop the packet if these numbers are different. To do this, we introduce a recheck function that does this check in the ESN case. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-03Merge branch 'master' of git://1984.lsi.us.es/nf-nextDavid S. Miller16-95/+171
2012-09-03tcp: use PRR to reduce cwin in CWR stateYuchung Cheng1-2/+8
Use proportional rate reduction (PRR) algorithm to reduce cwnd in CWR state, in addition to Recovery state. Retire the current rate-halving in CWR. When losses are detected via ACKs in CWR state, the sender enters Recovery state but the cwnd reduction continues and does not restart. Rename and refactor cwnd reduction functions since both CWR and Recovery use the same algorithm: tcp_init_cwnd_reduction() is new and initiates reduction state variables. tcp_cwnd_reduction() is previously tcp_update_cwnd_in_recovery(). tcp_ends_cwnd_reduction() is previously tcp_complete_cwr(). The rate halving functions and logic such as tcp_cwnd_down(), tcp_min_cwnd(), and the cwnd moderation inside tcp_enter_cwr() are removed. The unused parameter, flag, in tcp_cwnd_reduction() is also removed. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextPablo Neira Ayuso12-55/+234
This merges (3f509c6 netfilter: nf_nat_sip: fix incorrect handling of EBUSY for RTCP expectation) to Patrick McHardy's IPv6 NAT changes.
2012-09-03netfilter: nf_conntrack: add nf_ct_timeout_lookupPablo Neira Ayuso1-0/+20
This patch adds the new nf_ct_timeout_lookup function to encapsulate the timeout policy attachment that is called in the nf_conntrack_in path. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-08-31tcp: TCP Fast Open Server - support TFO listenersJerry Chu2-15/+4
This patch builds on top of the previous patch to add the support for TFO listeners. This includes - 1. allocating, properly initializing, and managing the per listener fastopen_queue structure when TFO is enabled 2. changes to the inet_csk_accept code to support TFO. E.g., the request_sock can no longer be freed upon accept(), not until 3WHS finishes 3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg() if it's a TFO socket 4. properly closing a TFO listener, and a TFO socket before 3WHS finishes 5. supporting TCP_FASTOPEN socket option 6. modifying tcp_check_req() to use to check a TFO socket as well as request_sock 7. supporting TCP's TFO cookie option 8. adding a new SYN-ACK retransmit handler to use the timer directly off the TFO socket rather than the listener socket. Note that TFO server side will not retransmit anything other than SYN-ACK until the 3WHS is completed. The patch also contains an important function "reqsk_fastopen_remove()" to manage the somewhat complex relation between a listener, its request_sock, and the corresponding child socket. See the comment above the function for the detail. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31tcp: TCP Fast Open Server - header & support functionsJerry Chu2-7/+75
This patch adds all the necessary data structure and support functions to implement TFO server side. It also documents a number of flags for the sysctl_tcp_fastopen knob, and adds a few Linux extension MIBs. In addition, it includes the following: 1. a new TCP_FASTOPEN socket option an application must call to supply a max backlog allowed in order to enable TFO on its listener. 2. A number of key data structures: "fastopen_rsk" in tcp_sock - for a big socket to access its request_sock for retransmission and ack processing purpose. It is non-NULL iff 3WHS not completed. "fastopenq" in request_sock_queue - points to a per Fast Open listener data structure "fastopen_queue" to keep track of qlen (# of outstanding Fast Open requests) and max_qlen, among other things. "listener" in tcp_request_sock - to point to the original listener for book-keeping purpose, i.e., to maintain qlen against max_qlen as part of defense against IP spoofing attack. 3. various data structure and functions, many in tcp_fastopen.c, to support server side Fast Open cookie operations, including /proc/sys/net/ipv4/tcp_fastopen_key to allow manual rekeying. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31tcp: Increase timeout for SYN segmentsAlex Bergmann1-4/+14
Commit 9ad7c049 ("tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side") changed the initRTO from 3secs to 1sec in accordance to RFC6298 (former RFC2988bis). This reduced the time till the last SYN retransmission packet gets sent from 93secs to 31secs. RFC1122 is stating that the retransmission should be done for at least 3 minutes, but this seems to be quite high. "However, the values of R1 and R2 may be different for SYN and data segments. In particular, R2 for a SYN segment MUST be set large enough to provide retransmission of the segment for at least 3 minutes. The application can close the connection (i.e., give up on the open attempt) sooner, of course." This patch increases the value of TCP_SYN_RETRIES to the value of 6, providing a retransmission window of 63secs. The comments for SYN and SYNACK retries have also been updated to describe the current settings. The same goes for the documentation file "Documentation/networking/ip-sysctl.txt". Signed-off-by: Alexander Bergmann <alex@linlab.net> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+1
Merge the 'net' tree to get the recent set of netfilter bug fixes in order to assist with some merge hassles Pablo is going to have to deal with for upcoming changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31Merge branch 'master' of git://1984.lsi.us.es/nfDavid S. Miller1-0/+1
2012-08-31netfilter: nf_conntrack: fix racy timer handling with reliable eventsPablo Neira Ayuso1-0/+1
Existing code assumes that del_timer returns true for alive conntrack entries. However, this is not true if reliable events are enabled. In that case, del_timer may return true for entries that were just inserted in the dying list. Note that packets / ctnetlink may hold references to conntrack entries that were just inserted to such list. This patch fixes the issue by adding an independent timer for event delivery. This increases the size of the ecache extension. Still we can revisit this later and use variable size extensions to allocate this area on demand. Tested-by: Oliver Smith <olipro@8.c.9.b.0.7.4.0.1.0.0.2.ip6.arpa> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-08-30netfilter: ip6tables: add MASQUERADE targetPatrick McHardy2-2/+4
Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30netfilter: ipv6: add IPv6 NAT supportPatrick McHardy3-0/+7
Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30net: core: add function for incremental IPv6 pseudo header checksum updatesPatrick McHardy1-0/+3
Add inet_proto_csum_replace16 for incrementally updating IPv6 pseudo header checksums for IPv6 NAT. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: David S. Miller <davem@davemloft.net>
2012-08-30netfilter: add protocol independent NAT corePatrick McHardy9-90/+125
Convert the IPv4 NAT implementation to a protocol independent core and address family specific modules. Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30netfilter: nf_nat: add protoff argument to packet mangling functionsPatrick McHardy1-3/+8
For mangling IPv6 packets the protocol header offset needs to be known by the NAT packet mangling functions. Add a so far unused protoff argument and convert the conntrack and NAT helpers to use it in preparation of IPv6 NAT. Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-27Bluetooth: Change signature of smp_conn_security()Vinicius Costa Gomes1-1/+1
To make it clear that it may be called from contexts that may not have any knowledge of L2CAP, we change the connection parameter, to receive a hci_conn. This also makes it clear that it is checking the security of the link. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@openbossa.org> Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
2012-08-26ipv4: fix path MTU discovery with connection trackingPatrick McHardy2-0/+4
IPv4 conntrack defragments incoming packet at the PRE_ROUTING hook and (in case of forwarded packets) refragments them at POST_ROUTING independent of the IP_DF flag. Refragmentation uses the dst_mtu() of the local route without caring about the original fragment sizes, thereby breaking PMTUD. This patch fixes this by keeping track of the largest received fragment with IP_DF set and generates an ICMP fragmentation required error during refragmentation if that size exceeds the MTU. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: David S. Miller <davem@davemloft.net>
2012-08-24Merge branch 'for-next' of ↵David S. Miller6-7/+22
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace This is an initial merge in of Eric Biederman's work to start adding user namespace support to the networking. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-24Merge branch 'master' of ↵John W. Linville3-21/+117
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
2012-08-23packet: fix broken build.Rami Rosen1-1/+1
This patch fixes a broken build due to a missing header: ... CC net/ipv4/proc.o In file included from include/net/net_namespace.h:15, from net/ipv4/proc.c:35: include/net/netns/packet.h:11: error: field 'sklist_lock' has incomplete type ... The lock of netns_packet has been replaced by a recent patch to be a mutex instead of a spinlock, but we need to replace the header file to be linux/mutex.h instead of linux/spinlock.h as well. See commit 0fa7fa98dbcc2789409ed24e885485e645803d7f: packet: Protect packet sk list with mutex (v2) patch, Signed-off-by: Rami Rosen <rosenr@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22packet: Protect packet sk list with mutex (v2)Pavel Emelyanov1-1/+1
Change since v1: * Fixed inuse counters access spotted by Eric In patch eea68e2f (packet: Report socket mclist info via diag module) I've introduced a "scheduling in atomic" problem in packet diag module -- the socket list is traversed under rcu_read_lock() while performed under it sk mclist access requires rtnl lock (i.e. -- mutex) to be taken. [152363.820563] BUG: scheduling while atomic: crtools/12517/0x10000002 [152363.820573] 4 locks held by crtools/12517: [152363.820581] #0: (sock_diag_mutex){+.+.+.}, at: [<ffffffff81a2dcb5>] sock_diag_rcv+0x1f/0x3e [152363.820613] #1: (sock_diag_table_mutex){+.+.+.}, at: [<ffffffff81a2de70>] sock_diag_rcv_msg+0xdb/0x11a [152363.820644] #2: (nlk->cb_mutex){+.+.+.}, at: [<ffffffff81a67d01>] netlink_dump+0x23/0x1ab [152363.820693] #3: (rcu_read_lock){.+.+..}, at: [<ffffffff81b6a049>] packet_diag_dump+0x0/0x1af Similar thing was then re-introduced by further packet diag patches (fanount mutex and pgvec mutex for rings) :( Apart from being terribly sorry for the above, I propose to change the packet sk list protection from spinlock to mutex. This lock currently protects two modifications: * sklist * prot inuse counters The sklist modifications can be just reprotected with mutex since they already occur in a sleeping context. The inuse counters modifications are trickier -- the __this_cpu_-s are used inside, thus requiring the caller to handle the potential issues with contexts himself. Since packet sockets' counters are modified in two places only (packet_create and packet_release) we only need to protect the context from being preempted. BH disabling is not required in this case. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>