summaryrefslogtreecommitdiffstats
path: root/net/ipv4
AgeCommit message (Collapse)AuthorFilesLines
2011-05-10xfrm: Assign the inner mode output function to the dst entrySteffen Klassert2-2/+7
As it is, we assign the outer modes output function to the dst entry when we create the xfrm bundle. This leads to two problems on interfamily scenarios. We might insert ipv4 packets into ip6_fragment when called from xfrm6_output. The system crashes if we try to fragment an ipv4 packet with ip6_fragment. This issue was introduced with git commit ad0081e4 (ipv6: Fragment locally generated tunnel-mode IPSec6 packets as needed). The second issue is, that we might insert ipv4 packets in netfilter6 and vice versa on interfamily scenarios. With this patch we assign the inner mode output function to the dst entry when we create the xfrm bundle. So xfrm4_output/xfrm6_output from the inner mode is used and the right fragmentation and netfilter functions are called. We switch then to outer mode with the output_finish functions. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08tcp_cubic: limit delayed_ack ratio to prevent divide errorstephen hemminger1-2/+7
TCP Cubic keeps a metric that estimates the amount of delayed acknowledgements to use in adjusting the window. If an abnormally large number of packets are acknowledged at once, then the update could wrap and reach zero. This kind of ACK could only happen when there was a large window and huge number of ACK's were lost. This patch limits the value of delayed ack ratio. The choice of 32 is just a conservative value since normally it should be range of 1 to 4 packets. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-04net: ip_expire() must revalidate routeEric Dumazet1-16/+15
Commit 4a94445c9a5c (net: Use ip_route_input_noref() in input path) added a bug in IP defragmentation handling, in case timeout is fired. When a frame is defragmented, we use last skb dst field when building final skb. Its dst is valid, since we are in rcu read section. But if a timeout occurs, we take first queued fragment to build one ICMP TIME EXCEEDED message. Problem is all queued skb have weak dst pointers, since we escaped RCU critical section after their queueing. icmp_send() might dereference a now freed (and possibly reused) part of memory. Calling skb_dst_drop() and ip_route_input_noref() to revalidate route is the only possible choice. Reported-by: Denys Fedoryshchenko <denys@visp.net.lb> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-02sysctl: net: call unregister_net_sysctl_table where neededLucian Adrian Grijincu1-1/+1
ctl_table_headers registered with register_net_sysctl_table should have been unregistered with the equivalent unregister_net_sysctl_table Signed-off-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-01ipv4: don't spam dmesg with "Using LC-trie" messagesAlexey Dobriyan1-3/+0
fib_trie_table() is called during netns creation and Chromium uses clone(CLONE_NEWNET) to sandbox renderer process. Don't print anything. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-25net: provide cow_metrics() methods to blackhole dst_opsHeld Bernhard1-0/+7
Since commit 62fa8a846d7d (net: Implement read-only protection and COW'ing of metrics.) the kernel throws an oops. [ 101.620985] BUG: unable to handle kernel NULL pointer dereference at (null) [ 101.621050] IP: [< (null)>] (null) [ 101.621084] PGD 6e53c067 PUD 3dd6a067 PMD 0 [ 101.621122] Oops: 0010 [#1] SMP [ 101.621153] last sysfs file: /sys/devices/virtual/ppp/ppp/uevent [ 101.621192] CPU 2 [ 101.621206] Modules linked in: l2tp_ppp pppox ppp_generic slhc l2tp_netlink l2tp_core deflate zlib_deflate twofish_x86_64 twofish_common des_generic cbc ecb sha1_generic hmac af_key iptable_filter snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device loop snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_pcm snd_timer snd i2c_i801 iTCO_wdt psmouse soundcore snd_page_alloc evdev uhci_hcd ehci_hcd thermal [ 101.621552] [ 101.621567] Pid: 5129, comm: openl2tpd Not tainted 2.6.39-rc4-Quad #3 Gigabyte Technology Co., Ltd. G33-DS3R/G33-DS3R [ 101.621637] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [ 101.621684] RSP: 0018:ffff88003ddeba60 EFLAGS: 00010202 [ 101.621716] RAX: ffff88003ddb5600 RBX: ffff88003ddb5600 RCX: 0000000000000020 [ 101.621758] RDX: ffffffff81a69a00 RSI: ffffffff81b7ee61 RDI: ffff88003ddb5600 [ 101.621800] RBP: ffff8800537cd900 R08: 0000000000000000 R09: ffff88003ddb5600 [ 101.621840] R10: 0000000000000005 R11: 0000000000014b38 R12: ffff88003ddb5600 [ 101.621881] R13: ffffffff81b7e480 R14: ffffffff81b7e8b8 R15: ffff88003ddebad8 [ 101.621924] FS: 00007f06e4182700(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 [ 101.621971] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 101.622005] CR2: 0000000000000000 CR3: 0000000045274000 CR4: 00000000000006e0 [ 101.622046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 101.622087] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 101.622129] Process openl2tpd (pid: 5129, threadinfo ffff88003ddea000, task ffff88003de9a280) [ 101.622177] Stack: [ 101.622191] ffffffff81447efa ffff88007d3ded80 ffff88003de9a280 ffff88007d3ded80 [ 101.622245] 0000000000000001 ffff88003ddebbb8 ffffffff8148d5a7 0000000000000212 [ 101.622299] ffff88003dcea000 ffff88003dcea188 ffffffff00000001 ffffffff81b7e480 [ 101.622353] Call Trace: [ 101.622374] [<ffffffff81447efa>] ? ipv4_blackhole_route+0x1ba/0x210 [ 101.622415] [<ffffffff8148d5a7>] ? xfrm_lookup+0x417/0x510 [ 101.622450] [<ffffffff8127672a>] ? extract_buf+0x9a/0x140 [ 101.622485] [<ffffffff8144c6a0>] ? __ip_flush_pending_frames+0x70/0x70 [ 101.622526] [<ffffffff8146fbbf>] ? udp_sendmsg+0x62f/0x810 [ 101.622562] [<ffffffff813f98a6>] ? sock_sendmsg+0x116/0x130 [ 101.622599] [<ffffffff8109df58>] ? find_get_page+0x18/0x90 [ 101.622633] [<ffffffff8109fd6a>] ? filemap_fault+0x12a/0x4b0 [ 101.622668] [<ffffffff813fb5c4>] ? move_addr_to_kernel+0x64/0x90 [ 101.622706] [<ffffffff81405d5a>] ? verify_iovec+0x7a/0xf0 [ 101.622739] [<ffffffff813fc772>] ? sys_sendmsg+0x292/0x420 [ 101.622774] [<ffffffff810b994a>] ? handle_pte_fault+0x8a/0x7c0 [ 101.622810] [<ffffffff810b76fe>] ? __pte_alloc+0xae/0x130 [ 101.622844] [<ffffffff810ba2f8>] ? handle_mm_fault+0x138/0x380 [ 101.622880] [<ffffffff81024af9>] ? do_page_fault+0x189/0x410 [ 101.622915] [<ffffffff813fbe03>] ? sys_getsockname+0xf3/0x110 [ 101.622952] [<ffffffff81450c4d>] ? ip_setsockopt+0x4d/0xa0 [ 101.622986] [<ffffffff813f9932>] ? sockfd_lookup_light+0x22/0x90 [ 101.623024] [<ffffffff814b61fb>] ? system_call_fastpath+0x16/0x1b [ 101.623060] Code: Bad RIP value. [ 101.623090] RIP [< (null)>] (null) [ 101.623125] RSP <ffff88003ddeba60> [ 101.623146] CR2: 0000000000000000 [ 101.650871] ---[ end trace ca3856a7d8e8dad4 ]--- [ 101.651011] __sk_free: optmem leakage (160 bytes) detected. The oops happens in dst_metrics_write_ptr() include/net/dst.h:124: return dst->ops->cow_metrics(dst, p); dst->ops->cow_metrics is NULL and causes the oops. Provide cow_metrics() methods, like we did in commit 214f45c91bb (net: provide default_advmss() methods to blackhole dst_ops) Signed-off-by: Held Bernhard <berny156@gmx.de> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-14ip: ip_options_compile() resilient to NULL skb routeEric Dumazet1-3/+3
Scot Doyle demonstrated ip_options_compile() could be called with an skb without an attached route, using a setup involving a bridge, netfilter, and forged IP packets. Let's make ip_options_compile() and ip_options_rcv_srr() a bit more robust, instead of changing bridge/netfilter code. With help from Hiroaki SHIMODA. Reported-by: Scot Doyle <lkml@scotdoyle.com> Tested-by: Scot Doyle <lkml@scotdoyle.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-13Revert "tcp: disallow bind() to reuse addr/port"David S. Miller1-3/+2
This reverts commit c191a836a908d1dd6b40c503741f91b914de3348. It causes known regressions for programs that expect to be able to use SO_REUSEADDR to shutdown a socket, then successfully rebind another socket to the same ID. Programs such as haproxy and amavisd expect this to work. This should fix kernel bugzilla 32832. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-12net: Do not wrap sysctl igmp_max_memberships in IP_MULTICASTJoakim Tjernlund1-3/+0
controlling igmp_max_membership is useful even when IP_MULTICAST is off. Quagga(an OSPF deamon) uses multicast addresses for all interfaces using a single socket and hits igmp_max_membership limit when there are 20 interfaces or more. Always export sysctl igmp_max_memberships in proc, just like igmp_max_msf Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-12inetpeer: reduce stack usageEric Dumazet1-6/+7
On 64bit arches, we use 752 bytes of stack when cleanup_once() is called from inet_getpeer(). Lets share the avl stack to save ~376 bytes. Before patch : # objdump -d net/ipv4/inetpeer.o | scripts/checkstack.pl 0x000006c3 unlink_from_pool [inetpeer.o]: 376 0x00000721 unlink_from_pool [inetpeer.o]: 376 0x00000cb1 inet_getpeer [inetpeer.o]: 376 0x00000e6d inet_getpeer [inetpeer.o]: 376 0x0004 inet_initpeers [inetpeer.o]: 112 # size net/ipv4/inetpeer.o text data bss dec hex filename 5320 432 21 5773 168d net/ipv4/inetpeer.o After patch : objdump -d net/ipv4/inetpeer.o | scripts/checkstack.pl 0x00000c11 inet_getpeer [inetpeer.o]: 376 0x00000dcd inet_getpeer [inetpeer.o]: 376 0x00000ab9 peer_check_expire [inetpeer.o]: 328 0x00000b7f peer_check_expire [inetpeer.o]: 328 0x0004 inet_initpeers [inetpeer.o]: 112 # size net/ipv4/inetpeer.o text data bss dec hex filename 5163 432 21 5616 15f0 net/ipv4/inetpeer.o Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Scot Doyle <lkml@scotdoyle.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Reviewed-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds3-4/+10
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (34 commits) net: Add support for SMSC LAN9530, LAN9730 and LAN89530 mlx4_en: Restoring RX buffer pointer in case of failure mlx4: Sensing link type at device initialization ipv4: Fix "Set rt->rt_iif more sanely on output routes." MAINTAINERS: add entry for Xen network backend be2net: Fix suspend/resume operation be2net: Rename some struct members for clarity pppoe: drop PPPOX_ZOMBIEs in pppoe_flush_dev dsa/mv88e6131: add support for mv88e6085 switch ipv6: Enable RFS sk_rxhash tracking for ipv6 sockets (v2) be2net: Fix a potential crash during shutdown. bna: Fix for handling firmware heartbeat failure can: mcp251x: Allow pass IRQ flags through platform data. smsc911x: fix mac_lock acquision before calling smsc911x_mac_read iwlwifi: accept EEPROM version 0x423 for iwl6000 rt2x00: fix cancelling uninitialized work rtlwifi: Fix some warnings/bugs p54usb: IDs for two new devices wl12xx: fix potential buffer overflow in testmode nvs push zd1211rw: reset rx idle timer from tasklet ...
2011-04-07ipv4: Fix "Set rt->rt_iif more sanely on output routes."OGAWA Hirofumi2-2/+7
Commit 1018b5c01636c7c6bda31a719bda34fc631db29a ("Set rt->rt_iif more sanely on output routes.") breaks rt_is_{output,input}_route. This became the cause to return "IP_PKTINFO's ->ipi_ifindex == 0". To fix it, this does: 1) Add "int rt_route_iif;" to struct rtable 2) For input routes, always set rt_route_iif to same value as rt_iif 3) For output routes, always set rt_route_iif to zero. Set rt_iif as it is done currently. 4) Change rt_is_{output,input}_route() to test rt_route_iif Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-07Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6Linus Torvalds14-20/+20
* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6: Fix common misspellings
2011-04-05Merge branch 'master' of ↵David S. Miller1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6
2011-04-04netfilter: af_info: add 'strict' parameter to limit lookup to .oifFlorian Westphal1-1/+1
ipv6 fib lookup can set RT6_LOOKUP_F_IFACE flag to restrict search to an interface, but this flag cannot be set via struct flowi. Also, it cannot be set via ip6_route_output: this function uses the passed sock struct to determine if this flag is required (by testing for nonzero sk_bound_dev_if). Work around this by passing in an artificial struct sk in case 'strict' argument is true. This is required to replace the rt6_lookup call in xt_addrtype.c with nf_afinfo->route(). Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-04-04netfilter: af_info: add network namespace parameter to route hookFlorian Westphal1-2/+3
This is required to eventually replace the rt6_lookup call in xt_addrtype.c with nf_afinfo->route(). Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-04-01tcp: len check is unnecessarily devastating, change to WARN_ONIlpo Järvinen1-1/+2
All callers are prepared for alloc failures anyway, so this error can safely be boomeranged to the callers domain without super bad consequences. ...At worst the connection might go into a state where each RTO tries to (unsuccessfully) re-fragment with such a mis-sized value and eventually dies. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-31Fix common misspellingsLucas De Marchi14-20/+20
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-30fib: add rtnl locking in ip_fib_net_exitEric Dumazet1-0/+2
Daniel J Blueman reported a lockdep splat in trie_firstleaf(), caused by RTNL being not locked before a call to fib_table_flush() Reported-by: Daniel J Blueman <daniel.blueman@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-30net: gre: provide multicast mappings for ipv4 and ipv6Timo Teräs1-0/+3
My commit 6d55cb91a0020ac0 (gre: fix hard header destination address checking) broke multicast. The reason is that ip_gre used to get ipgre_header() calls with zero destination if we have NOARP or multicast destination. Instead the actual target was decided at ipgre_tunnel_xmit() time based on per-protocol dissection. Instead of allowing the "abuse" of ->header() calls with invalid destination, this creates multicast mappings for ip_gre. This also fixes "ip neigh show nud noarp" to display the proper multicast mappings used by the gre device. Reported-by: Doug Kehn <rdkehn@yahoo.com> Signed-off-by: Timo Teräs <timo.teras@iki.fi> Acked-by: Doug Kehn <rdkehn@yahoo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-28ipv4: Don't ip_rt_put() an error pointer in RAW sockets.David S. Miller1-0/+1
Reported-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-27ipv4: Fix IP timestamp option (IPOPT_TS_PRESPEC) handling in ip_options_echo()Jan Luebbe1-3/+3
The current handling of echoed IP timestamp options with prespecified addresses is rather broken since the 2.2.x kernels. As far as i understand it, it should behave like when originating packets. Currently it will only timestamp the next free slot if: - there is space for *two* timestamps - some random data from the echoed packet taken as an IP is *not* a local IP This first is caused by an off-by-one error. 'soffset' points to the next free slot and so we only need to have 'soffset + 7 <= optlen'. The second bug is using sptr as the start of the option, when it really is set to 'skb_network_header(skb)'. I just use dptr instead which points to the timestamp option. Finally it would only timestamp for non-local IPs, which we shouldn't do. So instead we exclude all unicast destinations, similar to what we do in ip_options_compile(). Signed-off-by: Jan Luebbe <jluebbe@debian.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-25ipv4: do not ignore route errorsJulian Anastasov1-2/+2
The "ipv4: Inline fib_semantic_match into check_leaf" change forgets to return the route errors. check_leaf should return the same results as fib_table_lookup. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-24ipv4: Fix nexthop caching wrt. scoping.David S. Miller3-17/+13
Move the scope value out of the fib alias entries and into fib_info, so that we always use the correct scope when recomputing the nexthop cached source address. Reported-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-24ipv4: Invalidate nexthop cache nh_saddr more correctly.David S. Miller3-27/+22
Any operation that: 1) Brings up an interface 2) Adds an IP address to an interface 3) Deletes an IP address from an interface can potentially invalidate the nh_saddr value, requiring it to be recomputed. Perform the recomputation lazily using a generation ID. Reported-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-24ipv4: fix fib metricsEric Dumazet1-1/+1
Alessandro Suardi reported that we could not change route metrics : ip ro change default .... advmss 1400 This regression came with commit 9c150e82ac50 (Allocate fib metrics dynamically). fib_metrics is no longer an array, but a pointer to an array. Reported-by: Alessandro Suardi <alessandro.suardi@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Alessandro Suardi <alessandro.suardi@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-23ipv4: fix ip_rt_update_pmtu()Eric Dumazet1-2/+0
commit 2c8cec5c10bc (Cache learned PMTU information in inetpeer) added an extra inet_putpeer() call in ip_rt_update_pmtu(). This results in various problems, since we can free one inetpeer, while it is still in use. Ref: http://www.spinics.net/lists/netdev/msg159121.html Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-23ipv4: Fallback to FIB local table in __ip_dev_find().David S. Miller1-0/+16
In commit 9435eb1cf0b76b323019cebf8d16762a50a12a19 ("ipv4: Implement __ip_dev_find using new interface address hash.") we reimplemented __ip_dev_find() so that it doesn't have to do a full FIB table lookup. Instead, it consults a hash table of addresses configured to interfaces. This works identically to the old code in all except one case, and that is for loopback subnets. The old code would match the loopback device for any IP address that falls within a subnet configured to the loopback device. Handle this corner case by doing the FIB lookup. We could implement this via inet_addr_onlink() but: 1) Someone could configure many addresses to loopback and inet_addr_onlink() is a simple list traversal. 2) We know the old code works. Reported-by: Julian Anastasov <ja@ssi.bg> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-22tcp: Make undo_ssthresh arg to tcp_undo_cwr() a bool.David S. Miller1-6/+6
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-22tcp: avoid cwnd moderation in undoYuchung Cheng1-5/+7
In the current undo logic, cwnd is moderated after it was restored to the value prior entering fast-recovery. It was moderated first in tcp_try_undo_recovery then again in tcp_complete_cwr. Since the undo indicates recovery was false, these moderations are not necessary. If the undo is triggered when most of the outstanding data have been acknowledged, the (restored) cwnd is falsely pulled down to a small value. This patch removes these cwnd moderations if cwnd is undone a) during fast-recovery b) by receiving DSACKs past fast-recovery Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-22ipv4: optimize route adding on secondary promotionJulian Anastasov1-1/+2
Optimize the calling of fib_add_ifaddr for all secondary addresses after the promoted one to start from their place, not from the new place of the promoted secondary. It will save some CPU cycles because we are sure the promoted secondary was first for the subnet and all next secondaries do not change their place. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-22ipv4: remove the routes on secondary promotionJulian Anastasov1-0/+11
The secondary address promotion relies on fib_sync_down_addr to remove all routes created for the secondary addresses when the old primary address is deleted. It does not happen for cases when the primary address is also in another subnet. Fix that by deleting local and broadcast routes for all secondaries while they are on device list and by faking that all addresses from this subnet are to be deleted. It relies on fib_del_ifaddr being able to ignore the IPs from the concerned subnet while checking for duplication. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-22ipv4: fix route deletion for IPs on many subnetsJulian Anastasov1-13/+88
Alex Sidorenko reported for problems with local routes left after IP addresses are deleted. It happens when same IPs are used in more than one subnet for the device. Fix fib_del_ifaddr to restrict the checks for duplicate local and broadcast addresses only to the IFAs that use our primary IFA or another primary IFA with same address. And we expect the prefsrc to be matched when the routes are deleted because it is possible they to differ only by prefsrc. This patch prevents local and broadcast routes to be leaked until their primary IP is deleted finally from the box. As the secondary address promotion needs to delete the routes for all secondaries that used the old primary IFA, add option to ignore these secondaries from the checks and to assume they are already deleted, so that we can safely delete the route while these IFAs are still on the device list. Reported-by: Alex Sidorenko <alexandre.sidorenko@hp.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-22ipv4: match prefsrc when deleting routesJulian Anastasov1-0/+2
fib_table_delete forgets to match the routes by prefsrc. Callers can specify known IP in fc_prefsrc and we should remove the exact route. This is needed for cases when same local or broadcast addresses are used in different subnets and the routes differ only in prefsrc. All callers that do not provide fc_prefsrc will ignore the route prefsrc as before and will delete the first occurence. That is how the ip route del default magic works. Current callers are: - ip_rt_ioctl where rtentry_to_fib_config provides fc_prefsrc only when the provided device name matches IP label with colon. - inet_rtm_delroute where RTA_PREFSRC is optional too - fib_magic which deals with routes when deleting addresses and where the fc_prefsrc is always set with the primary IP for the concerned IFA. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-20netfilter: ipt_CLUSTERIP: fix buffer overflowVasiliy Kulikov1-1/+4
'buffer' string is copied from userspace. It is not checked whether it is zero terminated. This may lead to overflow inside of simple_strtoul(). Changli Gao suggested to copy not more than user supplied 'size' bytes. It was introduced before the git epoch. Files "ipt_CLUSTERIP/*" are root writable only by default, however, on some setups permissions might be relaxed to e.g. network admin user. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Acked-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-03-20netfilter: xtables: fix reentrancyEric Dumazet1-2/+2
commit f3c5c1bfd4308 (make ip_tables reentrant) introduced a race in handling the stackptr restore, at the end of ipt_do_table() We should do it before the call to xt_info_rdunlock_bh(), or we allow cpu preemption and another cpu overwrites stackptr of original one. A second fix is to change the underflow test to check the origptr value instead of 0 to detect underflow, or else we allow a jump from different hooks. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-03-15net_sched: fix ip_tos2prioDan Siemon1-1/+1
ECN support incorrectly maps ECN BESTEFFORT packets to TC_PRIO_FILLER (1) instead of TC_PRIO_BESTEFFORT (0) This means ECN enabled flows are placed in pfifo_fast/prio low priority band, giving ECN enabled flows [ECT(0) and CE codepoints] higher drop probabilities. This is rather unfortunate, given we would like ECN being more widely used. Ref : http://www.coverfire.com/archives/2011/03/13/pfifo_fast-and-ecn/ Signed-off-by: Dan Siemon <dan@coverfire.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dave Täht <d@taht.net> Cc: Jonathan Morton <chromatix99@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-15Merge branch 'master' of ↵David S. Miller2-12/+35
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
2011-03-15Merge branch 'master' of ↵David S. Miller5-145/+6
git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6 Conflicts: Documentation/feature-removal-schedule.txt
2011-03-15netfilter: ipt_addrtype: rename to xt_addrtypeFlorian Westphal3-145/+0
Followup patch will add ipv6 support. ipt_addrtype.h is retained for compatibility reasons, but no longer used by the kernel. Signed-off-by: Florian Westphal <fwestphal@astaro.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-03-15netfilter: ip_tables: fix infoleak to userspaceVasiliy Kulikov1-0/+3
Structures ipt_replace, compat_ipt_replace, and xt_get_revision are copied from userspace. Fields of these structs that are zero-terminated strings are not checked. When they are used as argument to a format string containing "%s" in request_module(), some sensitive information is leaked to userspace via argument of spawned modprobe process. The first and the third bugs were introduced before the git epoch; the second was introduced in 2722971c (v2.6.17-rc1). To trigger the bug one should have CAP_NET_ADMIN. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-03-15netfilter: arp_tables: fix infoleak to userspaceVasiliy Kulikov1-0/+3
Structures ipt_replace, compat_ipt_replace, and xt_get_revision are copied from userspace. Fields of these structs that are zero-terminated strings are not checked. When they are used as argument to a format string containing "%s" in request_module(), some sensitive information is leaked to userspace via argument of spawned modprobe process. The first bug was introduced before the git epoch; the second is introduced by 6b7d31fc (v2.6.15-rc1); the third is introduced by 6b7d31fc (v2.6.15-rc1). To trigger the bug one should have CAP_NET_ADMIN. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-03-14tcp_cubic: fix low utilization of CUBIC with HyStartSangtae Ha1-0/+7
HyStart sets the initial exit point of slow start. Suppose that HyStart exits at 0.5BDP in a BDP network and no history exists. If the BDP of a network is large, CUBIC's initial cwnd growth may be too conservative to utilize the link. CUBIC increases the cwnd 20% per RTT in this case. Signed-off-by: Sangtae Ha <sangtae.ha@gmail.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-14tcp_cubic: make the delay threshold of HyStart less sensitiveSangtae Ha1-1/+1
Make HyStart less sensitive to abrupt delay variations due to buffer bloat. Signed-off-by: Sangtae Ha <sangtae.ha@gmail.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Reported-by: Lucas Nussbaum <lucas.nussbaum@loria.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-14tcp_cubic: enable high resolution ack time if neededstephen hemminger1-0/+4
This is a refined version of an earlier patch by Lucas Nussbaum. Cubic needs RTT values in milliseconds. If HZ < 1000 then the values will be too coarse. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Reported-by: Lucas Nussbaum <lucas.nussbaum@loria.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-14tcp_cubic: fix clock dependencystephen hemminger1-12/+19
The hystart code was written with assumption that HZ=1000. Replace the use of jiffies with bictcp_clock as a millisecond real time clock. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Reported-by: Lucas Nussbaum <lucas.nussbaum@loria.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-14tcp_cubic: make ack train delta value a parameterstephen hemminger1-1/+4
Make the spacing between ACK's that indicates a train a tuneable value like other hystart values. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-14tcp_cubic: fix comparison of jiffiesstephen hemminger1-2/+4
Jiffies wraps around therefore the correct way to compare is to use cast to signed value. Note: cubic is not using full jiffies value on 64 bit arch because using full unsigned long makes struct bictcp grow too large for the available ca_priv area. Includes correction from Sangtae Ha to improve ack train detection. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-14tcp: fix RTT for quick packets in congestion controlstephen hemminger1-1/+1
In the congestion control interface, the callback for each ACK includes an estimated round trip time in microseconds. Some algorithms need high resolution (Vegas style) but most only need jiffie resolution. If RTT is not accurate (like a retransmission) -1 is used as a flag value. When doing coarse resolution if RTT is less than a a jiffie then 0 should be returned rather than no estimate. Otherwise algorithms that expect good ack's to trigger slow start (like CUBIC Hystart) will be confused. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-13inetpeer: should use call_rcu() variantEric Dumazet1-1/+1
After commit 7b46ac4e77f3224a (inetpeer: Don't disable BH for initial fast RCU lookup.), we should use call_rcu() to wait proper RCU grace period. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>