summaryrefslogtreecommitdiffstats
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2014-03-04UAPI: add MPLS label stack definitionSimon Wunderlich2-0/+40
Labels for the Multiprotocol Label Switching are defined in RFC 3032 which was superseded by RFC 5462. Add the definition to UAPI and a stub header for include/linux. Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-04if_ether.h: add IEEE 802.21 EthertypeSimon Wunderlich1-0/+1
Add the Ethertype for IEEE Std 802.21 - Media Independent Handover Protocol. This Ethertype is used for network control messages. Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-03tcp: snmp stats for Fast Open, SYN rtx, and data pktsYuchung Cheng1-0/+3
Add the following snmp stats: TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed beacuse the remote does not accept it or the attempts timed out. TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down retransmissions into SYN, fast-retransmits, timeout retransmits, etc. TCPOrigDataSent: number of outgoing packets with original data (excluding retransmission but including data-in-SYN). This counter is different from TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is more useful to track the TCP retransmission rate. Change TCPFastOpenActive to track only successful Fast Opens to be symmetric to TCPFastOpenPassive. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Lawrence Brakmo <brakmo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-02net/mlx4_en: Use union for BlueFlame WQEAmir Vadai1-3/+8
When BlueFlame is turned on, control segment of the TX WQE is changed, and the second line of it is used for QPN. Changed code to use a union in the mlx4_wqe_ctrl_seg instead of casting. This makes the code clearer and solves the static checker warning: drivers/net/ethernet/mellanox/mlx4/en_tx.c:839 mlx4_en_xmit() warn: potential memory corrupting cast 4 vs 2 bytes CC: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-02net/mlx4: Replace mlx4_en_mac_to_u64() with mlx4_mac_to_u64()Eugenia Emantayev1-0/+12
Currently, the EN driver uses a private static function mlx4_en_mac_to_u64(). Move it to a common include file (driver.h) for mlx4_en and mlx4_ib for further use. Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-286lowpan: handling 6lowpan fragmentation via inet_frag apiAlexander Aring1-0/+9
This patch drops the current way of 6lowpan fragmentation on receiving side and replace it with a implementation which use the inet_frag api. The old fragmentation handling has some race conditions and isn't rfc4944 compatible. Also adding support to match fragments on destination address, source address, tag value and datagram_size which is missing in the current implementation. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-28net: ns: add ieee802154_6lowpan namespaceAlexander Aring2-0/+17
This patch adds necessary ieee802154 6lowpan namespace to provide the inet_frag information. This is a initial support for handling 6lowpan fragmentation with the inet_frag api. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-286lowpan: add frag information structAlexander Aring1-0/+7
This patch adds a 6lowpan fragmentation struct into cb of skb which is necessary to hold fragmentation information. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-27ipv6: addrconf: silence sparse endianness warningsBjørn Mork1-3/+3
Avoid the following sparse __CHECK_ENDIAN__ warnings: include/net/addrconf.h:318:25: warning: restricted __be64 degrades to integer include/net/addrconf.h:318:70: warning: restricted __be64 degrades to integer include/net/addrconf.h:330:25: warning: restricted __be64 degrades to integer include/net/addrconf.h:330:70: warning: restricted __be64 degrades to integer include/net/addrconf.h:347:25: warning: restricted __be64 degrades to integer include/net/addrconf.h:348:26: warning: restricted __be64 degrades to integer include/net/addrconf.h:349:18: warning: restricted __be64 degrades to integer The warnings are false but they make it harder to spot real bugs. Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-27Merge branch 'master' of ↵David S. Miller1-15/+68
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== This is the rework of the IPsec virtual tunnel interface for ipv4 to support inter address family tunneling and namespace crossing. The only change to the last RFC version is a compile fix for an odd configuration where CONFIG_XFRM is set but CONFIG_INET is not set. 1) Add and use a IPsec protocol multiplexer. 2) Add xfrm_tunnel_skb_cb to the skb common buffer to store a receive callback there. 3) Make vti work with i_key set by not including the i_key when comupting the hash for the tunnel lookup in case of vti tunnels. 4) Update ip_vti to use it's own receive hook. 5) Remove xfrm_tunnel_notifier, this is replaced by the IPsec protocol multiplexer. 6) We need to be protocol family indepenent, so use the on xfrm_lookup returned dst_entry instead of the ipv4 rtable in vti_tunnel_xmit(). 7) Add support for inter address family tunneling. 8) Check if the tunnel endpoints of the xfrm state and the vti interface are matching and return an error otherwise. 8) Enable namespace crossing tor vti devices. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-27net: move net_device priv_flags out from UAPILuis R. Rodriguez2-83/+83
These are private to userspace, and they're unstable anyway and can be shuffled at will (see 080e4130b1fb) so any userspace application relying on them is on crack. Test compiled with allyesconfig. mcgrof@drvbp1 /pub/mem/mcgrof/net-next (git::master)$ make allyesconfig mcgrof@drvbp1 /pub/mem/mcgrof/net-next (git::master)$ time make -j 20 ... BUILD arch/x86/boot/bzImage Setup is 16992 bytes (padded to 17408 bytes). System is 56153 kB CRC 721d2751 Kernel: arch/x86/boot/bzImage is ready (#1) real 19m35.744s user 280m37.984s sys 27m54.104s Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-27net: kdoc struct net_device flags and priv_flagsLuis R. Rodriguez1-52/+161
We have documentation for these flags but they're scattered all over the place. #defines don't allow documentation to be written easily so to help to start bringing some documentation together use the enums kdoc practice but keep the defines to allow userspace to be able to #ifdef them. I've verified the same values are assigned before and after with a simple userspace test program [0] and checksumming the output. [0] http://drvbp1.linux-foundation.org/~mcgrof/kdoc/netdev_flags/ mcgrof@gnat ~/tmp $ ./check-flags | sha1sum 0ec5b6b1840aa3bb9ce464e61c564820871c92c3 - Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26tcp: switch rtt estimations to usec resolutionEric Dumazet3-9/+16
Upcoming congestion controls for TCP require usec resolution for RTT estimations. Millisecond resolution is simply not enough these days. FQ/pacing in DC environments also require this change for finer control and removal of bimodal behavior due to the current hack in tcp_update_pacing_rate() for 'small rtt' TCP_CONG_RTT_STAMP is no longer needed. As Julian Anastasov pointed out, we need to keep user compatibility : tcp_metrics used to export RTT and RTTVAR in msec resolution, so we added RTT_US and RTTVAR_US. An iproute2 patch is needed to use the new attributes if provided by the kernel. In this example ss command displays a srtt of 32 usecs (10Gbit link) lpk51:~# ./ss -i dst lpk52 Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port tcp ESTAB 0 1 10.246.11.51:42959 10.246.11.52:64614 cubic wscale:6,6 rto:201 rtt:0.032/0.001 ato:40 mss:1448 cwnd:10 send 3620.0Mbps pacing_rate 7240.0Mbps unacked:1 rcv_rtt:993 rcv_space:29559 Updated iproute2 ip command displays : lpk51:~# ./ip tcp_metrics | grep 10.246.11.52 10.246.11.52 age 561.914sec cwnd 10 rtt 274us rttvar 213us source 10.246.11.51 Old binary displays : lpk51:~# ip tcp_metrics | grep 10.246.11.52 10.246.11.52 age 561.914sec cwnd 10 rtt 250us rttvar 125us source 10.246.11.51 With help from Julian Anastasov, Stephen Hemminger and Yuchung Cheng Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Yuchung Cheng <ycheng@google.com> Cc: Larry Brakmo <brakmo@google.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26net: add skb_mstamp infrastructureEric Dumazet1-2/+57
ktime_get() is too expensive on some cases, and we'd like to get usec resolution timestamps in TCP stack. This patch adds a light weight facility using a combination of local_clock() and jiffies samples. Instead of : u64 t0, t1; t0 = ktime_get(); // stuff t1 = ktime_get(); delta_us = ktime_us_delta(t1, t0); use : struct skb_mstamp t0, t1; skb_mstamp_get(&t0); // stuff skb_mstamp_get(&t1); delta_us = skb_mstamp_us_delta(&t1, &t0); Note : local_clock() might have a (bounded) drift between cpus. Do not use this infra in place of ktime_get() without understanding the issues. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Larry Brakmo <brakmo@google.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26ipv6: yet another new IPV6_MTU_DISCOVER option IPV6_PMTUDISC_OMITHannes Frederic Sowa2-1/+12
This option has the same semantic as IP_PMTUDISC_OMIT for IPv4 which got recently introduced. It doesn't honor the path mtu discovered by the host but in contrary to IPV6_PMTUDISC_INTERFACE allows the generation of fragments if the packet size exceeds the MTU of the outgoing interface MTU. Fixes: 93b36cf3425b9b ("ipv6: support IPV6_PMTU_INTERFACE on sockets") Cc: Florian Weimer <fweimer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26ipv4: yet another new IP_MTU_DISCOVER option IP_PMTUDISC_OMITHannes Frederic Sowa2-1/+12
IP_PMTUDISC_INTERFACE has a design error: because it does not allow the generation of fragments if the interface mtu is exceeded, it is very hard to make use of this option in already deployed name server software for which I introduced this option. This patch adds yet another new IP_MTU_DISCOVER option to not honor any path mtu information and not accepting new icmp notifications destined for the socket this option is enabled on. But we allow outgoing fragmentation in case the packet size exceeds the outgoing interface mtu. As such this new option can be used as a drop-in replacement for IP_PMTUDISC_DONT, which is currently in use by most name server software making the adoption of this option very smooth and easy. The original advantage of IP_PMTUDISC_INTERFACE is still maintained: ignoring incoming path MTU updates and not honoring discovered path MTUs in the output path. Fixes: 482fc6094afad5 ("ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE") Cc: Florian Weimer <fweimer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26net: Add sysfs file for port numberAmir Vadai1-0/+4
Add a sysfs file to enable user space to query the device port number used by a netdevice instance. This is needed for devices that have multiple ports on the same PCI function. Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26net: tcp: add mib counters to track zero window transitionsFlorian Westphal1-0/+3
Three counters are added: - one to track when we went from non-zero to zero window - one to track the reverse - one counter incremented when we want to announce zero window, but can't because we would shrink current window. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26net: order MPLS ethertypes numericallyNeil Jerram1-2/+2
All ethertypes other than ETH_P_MPLS_UC, ETH_P_MPLS_MC and ETH_P_ATMMPOA were already ordered numerically. This commit moves those three ETH_P_... values into correct numerical order too. Signed-off-by: Neil Jerram <Neil.Jerram@metaswitch.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-25xfrm4: Remove xfrm_tunnel_notifierSteffen Klassert1-2/+0
This was used from vti and is replaced by the IPsec protocol multiplexer hooks. It is now unused, so remove it. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-02-25xfrm: Add xfrm_tunnel_skb_cb to the skb common bufferSteffen Klassert1-12/+38
IPsec vti_rcv needs to remind the tunnel pointer to check it later at the vti_rcv_cb callback. So add this pointer to the IPsec common buffer, initialize it and check it to avoid transport state matching of a tunneled packet. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-02-25xfrm4: Add IPsec protocol multiplexerSteffen Klassert1-1/+30
This patch add an IPsec protocol multiplexer. With this it is possible to add alternative protocol handlers as needed for IPsec virtual tunnel interfaces. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-02-24Merge branch 'master' of ↵David S. Miller7-9/+85
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== 1) Introduce skb_to_sgvec_nomark function to add further data to the sg list without calling sg_unmark_end first. Needed to add extended sequence number informations. From Fan Du. 2) Add IPsec extended sequence numbers support to the Authentication Header protocol for ipv4 and ipv6. From Fan Du. 3) Make the IPsec flowcache namespace aware, from Fan Du. 4) Avoid creating temporary SA for every packet when no key manager is registered. From Horia Geanta. 5) Support filtering of SA dumps to show only the SAs that match a given filter. From Nicolas Dichtel. 6) Remove caching of xfrm_policy_sk_bundles. The cached socket policy bundles are never used, instead we create a new cache entry whenever xfrm_lookup() is called on a socket policy. Most protocols cache the used routes to the socket, so this caching is not needed. 7) Fix a forgotten SADB_X_EXT_FILTER length check in pfkey, from Nicolas Dichtel. 8) Cleanup error handling of xfrm_state_clone. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-20Merge branch 'master' of ↵John W. Linville5-108/+242
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
2014-02-19ipv6: honor IPV6_PKTINFO with v4 mapped addresses on sendmsgHannes Frederic Sowa1-1/+2
In case we decide in udp6_sendmsg to send the packet down the ipv4 udp_sendmsg path because the destination is either of family AF_INET or the destination is an ipv4 mapped ipv6 address, we don't honor the maybe specified ipv4 mapped ipv6 address in IPV6_PKTINFO. We simply can check for this option in ip_cmsg_send because no calls to ipv6 module functions are needed to do so. Reported-by: Gert Doering <gert@space.net> Cc: Tore Anderson <tore@fud.no> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-19xfrm: Remove caching of xfrm_policy_sk_bundlesSteffen Klassert1-1/+0
We currently cache socket policy bundles at xfrm_policy_sk_bundles. These cached bundles are never used. Instead we create and cache a new one whenever xfrm_lookup() is called on a socket policy. Most protocols cache the used routes to the socket, so let's remove the unused caching of socket policy bundles in xfrm. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-02-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller35-260/+264
Conflicts: drivers/net/bonding/bond_3ad.h drivers/net/bonding/bond_main.c Two minor conflicts in bonding, both of which were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-18Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds4-0/+8
Pull drm fixes from Dave Airlie: "Lots of little small things, nothing too major: nouveau regression fixes, vmware fixes for the new hw support, memory leaks in error path fixes" * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (31 commits) drm/radeon/ni: fix typo in dpm sq ramping setup drm/radeon/si: fix typo in dpm sq ramping setup drm/radeon: fix CP semaphores on CIK drm/radeon: delete a stray tab drm/radeon: fix display tiling setup on SI drm/radeon/dpm: reduce r7xx vblank mclk threshold to 200 drm/radeon: fill in DRM_CAPs for cursor size drm: add DRM_CAPs for cursor size drm/radeon: unify bpc handling drm/ttm: Fix memory leak in ttm_agp_backend.c drm/ttm: declare 'struct device' in ttm_page_alloc.h drm/nouveau: fix TTM_PL_TT memtype on pre-nv50 drm/nv50/disp: use correct register to determine DP display bpp drm/nouveau/fb: use correct ram oclass for nv1a hardware drm/nv50/gr: add missing nv_error parameter priv drm/nouveau: fix ENG_RUNLIST register address drm/nv4c/bios: disallow retrieving from prom on nv4x igp's drm/nv4c/vga: decode register is in a different place on nv4x igp's drm/nv4c/mc: nv4x igp's have a different msi rearm register drm/nouveau: set irq_enabled manually ...
2014-02-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds3-17/+50
Pull networking fixes from David Miller: 1) kvaser CAN driver has fixed limits of some of it's table, validate that we won't exceed those limits at probe time. Fix from Olivier Sobrie. 2) Fix rtl8192ce disabling interrupts for too long, from Olivier Langlois. 3) Fix botched shift in ath5k driver, from Dan Carpenter. 4) Fix corruption of deferred packets in TIPC, from Erik Hugne. 5) Fix newlink error path in macvlan driver, from Cong Wang. 6) Fix netpoll deadlock in bonding, from Ding Tianhong. 7) Handle GSO packets properly in forwarding path when fragmentation is necessary on egress, from Florian Westphal. 8) Fix axienet build errors, from Michal Simek. 9) Fix refcounting of ubufs on tx in vhost net driver, from Michael S Tsirkin. 10) Carrier status isn't set properly in hyperv driver, from Haiyang Zhang. 11) Missing pci_disable_device() in tulip_remove_one), from Ingo Molnar. 12) AF_PACKET qdisc bypass mode doesn't adhere to driver provided TX queue selection method. Add a fallback method mechanism to fix this bug, from Daniel Borkmann. 13) Fix regression in link local route handling on GRE tunnels, from Nicolas Dichtel. 14) Bonding can assign dup aggregator IDs in some sequences of configuration, fix by making the allocation counter per-bond instead of global. From Jiri Bohac. 15) sctp_connectx() needs compat translations, from Daniel Borkmann. 16) Fix of_mdio PHY interrupt parsing, from Ben Dooks * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits) MAINTAINERS: add entry for the PHY library of_mdio: fix phy interrupt passing net: ethernet: update dependency and help text of mvneta NET: fec: only enable napi if we are successful af_packet: remove a stray tab in packet_set_ring() net: sctp: fix sctp_connectx abi for ia32 emulation/compat mode ipv4: fix counter in_slow_tot irtty-sir.c: Do not set_termios() on irtty_close() bonding: 802.3ad: make aggregator_identifier bond-private usbnet: remove generic hard_header_len check gre: add link local route when local addr is any batman-adv: fix potential kernel paging error for unicast transmissions batman-adv: avoid double free when orig_node initialization fails batman-adv: free skb on TVLV parsing success batman-adv: fix TT CRC computation by ensuring byte order batman-adv: fix potential orig_node reference leak batman-adv: avoid potential race condition when adding a new neighbour batman-adv: properly check pskb_may_pull return value batman-adv: release vlan object after checking the CRC batman-adv: fix TT-TVLV parsing on OGM reception ...
2014-02-18rtnl: make ifla_policy staticJiri Pirko1-1/+1
The only place this is used outside rtnetlink.c is veth. So provide wrapper function for this usage. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-19Merge tag 'ttm-fixes-3.14-2014-02-18' of ↵Dave Airlie1-0/+2
git://people.freedesktop.org/~thomash/linux into drm-fixes Pull request of 2014-02-18 One compile fix and one memory leak. * tag 'ttm-fixes-3.14-2014-02-18' of git://people.freedesktop.org/~thomash/linux: drm/ttm: Fix memory leak in ttm_agp_backend.c drm/ttm: declare 'struct device' in ttm_page_alloc.h
2014-02-19Merge tag 'vmwgfx-fixes-3.14-2014-02-18' of ↵Dave Airlie1-0/+1
git://people.freedesktop.org/~thomash/linux into drm-fixes Pull request of 2014-02-18. Nothing special. The biggest change is adding a couple of command defines and packing the command data correctly. * tag 'vmwgfx-fixes-3.14-2014-02-18' of git://people.freedesktop.org/~thomash/linux: drm/vmwgfx: Fix command defines and checks drm/vmwgfx: Fix possible integer overflow drm/vmwgfx: Remove stray const drm/vmwgfx: unlock on error path in vmw_execbuf_process() drm/vmwgfx: Get maximum mob size from register SVGA_REG_MOB_MAX_SIZE drm/vmwgfx: Fix a couple of sparse warnings and errors
2014-02-18bonding: remove useless updating of slave->dev->last_rxVeaceslav Falico1-7/+1
Now that all the logic is handled via last_arp_rx, we don't need to use last_rx. CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: "David S. Miller" <davem@davemloft.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-18drm: add DRM_CAPs for cursor sizeAlex Deucher2-0/+5
Some hardware may not support standard 64x64 cursors. Add a drm cap to query the cursor size from the kernel. Some examples include radeon CIK parts (128x128 cursors) and armada (32x64 or 64x32). This allows things like device specific ddxes to remove asics specific logic and also allows xf86-video-modesetting to work properly with hw cursors on this hardware. Default to 64 if the driver doesn't specify a size. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Rob Clark <robdclark@gmail.com>
2014-02-18drm/ttm: declare 'struct device' in ttm_page_alloc.hAlexandre Courbot1-0/+2
Declare 'struct device' explicitly in ttm_page_alloc.h as this file does not include any file declaring it. This removes the following warning: warning: 'struct device' declared inside parameter list Signed-off-by: Alexandre Courbot <acourbot@nvidia.com> Reviewed-by: Thierry Reding <treding@nvidia.com>
2014-02-17Merge branch 'for-linus' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph fixes from Sage Weil: "We have some patches fixing up ACL support issues from Zheng and Guangliang and a mount option to enable/disable this support. (These fixes were somewhat delayed by the Chinese holiday.) There is also a small fix for cached readdir handling when directories are fragmented" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: fix __dcache_readdir() ceph: add acl, noacl options for cephfs mount ceph: make ceph_forget_all_cached_acls() static inline ceph: add missing init_acl() for mkdir() and atomic_open() ceph: fix ceph_set_acl() ceph: fix ceph_removexattr() ceph: remove xattr when null value is given to setxattr() ceph: properly handle XATTR_CREATE and XATTR_REPLACE
2014-02-17ieee802154: add netlink APIs for smartMAC configurationPhoebe Buckheister3-0/+24
Introduce new netlink attributes for SET_PHY_ATTRS: * CSMA minimal backoff exponent * CSMA maximal backoff exponent * CSMA retry limit * frame retransmission limit The CSMA attributes shall correspond to minBE, maxBE and maxCSMABackoffs of 802.15.4, respectively. The frame retransmission shall correspond to maxFrameRetries of 802.15.4, unless given as -1: then the old behaviour of the stack shall apply. For RF2xy, the old behaviour is to not do channel sensing at all and simply send *right now*, which is not intended behaviour for most applications and actually prohibited for some channel/page combinations. For all values except frame retransmission limit, the defaults of 802.15.4 apply. Frame retransmission limits are set to -1 to indicate backward-compatible behaviour. Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17ieee802154: add support for setting CCA energy detection levelsPhoebe Buckheister3-0/+10
Since three of the four clear channel assesment modes make use of energy detection, provide an API to set the energy detection threshold. Driver support for this is available in at86rf230 for the RF212 chips. Since for these chips the minimal energy detection threshold depends on page and channel used, add a field to struct at86rf230_local that stores the minimal threshold. Actual ED thresholds are configured as offsets from this value. For RF212, setting the ED threshold will not work before a channel/page has been set due to the dependency of energy detection in the chip and the actual channel/page selected. Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17ieee802154: add support for CCA mode in wpan physPhoebe Buckheister3-0/+7
The standard describes four modes of clear channel assesment: "energy above threshold", "carrier found", and the logical and/or of these two. Support for CCA mode setting is included in the at86rf230 driver, predicated for RF212 chips. Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17ieee802154: add support for listen-before-talk in wpan_phyPhoebe Buckheister3-0/+10
Listen-before-talk is an alternative to CSMA in uncoordinated networks and prescribed by european regulations if one wants to have a device with radio duty cycles above 10% (or less in some bands). Add a phy property to enable/disable LBT in the phy, including support in the at86rf230 driver for RF212 chips. Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17ieee802154: add TX power control to wpan_phyPhoebe Buckheister3-2/+13
Replace the current u8 transmit_power in wpan_phy with s8 transmit_power. The u8 field contained the actual tx power and a tolerance field, which no physical radio every used. Adjust sysfs entries to keep compatibility with userspace, give tolerances of +-1dB statically there. This patch only adds support for this in the at86rf230 driver and the RF212 chip. Configuration calculation for RF212 is also somewhat basic, but does the job - the RF212 datasheet gives a large table with suggested values for combinations of TX power and page/channel, if this does not work well, we might have to copy the whole table. Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17net: phy: allow PHY drivers to implement their own software resetFlorian Fainelli1-0/+5
As pointed out by Shaohui, most 10G PHYs out there have a non-standard compliant software reset sequence, eventually something much more complex than just toggling the BMCR_RESET bit. Allow PHY driver to implement their own soft_reset() callback to deal with that. If no callback is provided, call into genphy_soft_reset() which makes sure the existing behavior is kept intact. Reported-by: Shaohui Xie <Shaohui.Xie@freescale.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17net: phy: move PHY software reset to genphy_soft_resetFlorian Fainelli1-0/+1
As pointed out by Shaohui, this function is generic for 10/100/1000 PHYs, but 10G PHYs might have a slightly different reset sequence which prevents most of them from using this function. Move the BMCR_RESET based software resent sequence to genphy_soft_reset() in preparation for allowing PHY drivers to implement a soft_reset() callback. Reported-by: Shaohui Xie <Shaohui.Xie@freescale.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17Merge tag 'dma-buf-for-3.14' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/sumits/dma-buf Pull dma-buf fix from Sumit Semwal: "Just some debugfs output updates. There's another patch related to dma-buf, but it'll get upstreamed via Greg KH's pull request" * tag 'dma-buf-for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/sumits/dma-buf: dma-buf: update debugfs output
2014-02-17ceph: remove xattr when null value is given to setxattr()Yan, Zheng1-2/+3
For the setxattr request, introduce a new flag CEPH_XATTR_REMOVE to distinguish null value case from the zero-length value case. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-02-17Merge branch 'merge' of ↵Linus Torvalds1-0/+39
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc fixes from Ben Herrenschmidt: "Here are some more powerpc fixes for 3.14 The main one is a nasty issue with the NUMA balancing support which requires a small generic change and the addition of a new accessor to set _PAGE_NUMA. Both have been reviewed and acked by Mel and Rik. The changelog should have plenty of details but basically, without this fix, we get random user segfaults and/or corruptions due to missing TLB/hash flushes. Aneesh series of 3 patches fixes it. We have some vDSO vs. perf fixes from Anton, some small EEH fixes from Gavin, a ppc32 regression vs the stack overflow detector, and a fix for the way we handle PCIe host bridge speed settings on pseries (which is needed for proper operations of AMD graphics cards on Power8)" * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: powerpc/eeh: Disable EEH on reboot powerpc/eeh: Cleanup on eeh_subsystem_enabled powerpc/powernv: Rework EEH reset powerpc: Use unstripped VDSO image for more accurate profiling data powerpc: Link VDSOs at 0x0 mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit mm: Dirty accountable change only apply to non prot numa case powerpc/mm: Add new "set" flag argument to pte/pmd update function powerpc/pseries: Add Gen3 definitions for PCIE link speed powerpc/pseries: Fix regression on PCI link speed powerpc: Set the correct ksp_limit on ppc32 when switching to irq stack
2014-02-17ipsec: add support of limited SA dumpNicolas Dichtel3-6/+29
The goal of this patch is to allow userland to dump only a part of SA by specifying a filter during the dump. The kernel is in charge to filter SA, this avoids to generate useless netlink traffic (it save also some cpu cycles). This is particularly useful when there is a big number of SA set on the system. Note that I removed the union in struct xfrm_state_walk to fix a problem on arm. struct netlink_callback->args is defined as a array of 6 long and the first long is used in xfrm code to flag the cb as initialized. Hence, we must have: sizeof(struct xfrm_state_walk) <= sizeof(long) * 5. With the union, it was false on arm (sizeof(struct xfrm_state_walk) was sizeof(long) * 7), due to the padding. In fact, whatever the arch is, this union seems useless, there will be always padding after it. Removing it will not increase the size of this struct (and reduce it on arm). Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-02-17netdevice: move netdev_cap_txqueue for shared usage to headerDaniel Borkmann1-0/+20
In order to allow users to invoke netdev_cap_txqueue, it needs to be moved into netdevice.h header file. While at it, also add kernel doc header to document the API. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17netdevice: add queue selection fallback handler for ndo_select_queueDaniel Borkmann1-3/+6
Add a new argument for ndo_select_queue() callback that passes a fallback handler. This gets invoked through netdev_pick_tx(); fallback handler is currently __netdev_pick_tx() as most drivers invoke this function within their customized implementation in case for skbs that don't need any special handling. This fallback handler can then be replaced on other call-sites with different queue selection methods (e.g. in packet sockets, pktgen etc). This also has the nice side-effect that __netdev_pick_tx() is then only invoked from netdev_pick_tx() and export of that function to modules can be undone. Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17net: sctp: Fix a_rwnd/rwnd management to reflect real state of the ↵Matija Glavinic Pecotic1-13/+1
receiver's buffer Implementation of (a)rwnd calculation might lead to severe performance issues and associations completely stalling. These problems are described and solution is proposed which improves lksctp's robustness in congestion state. 1) Sudden drop of a_rwnd and incomplete window recovery afterwards Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data), but size of sk_buff, which is blamed against receiver buffer, is not accounted in rwnd. Theoretically, this should not be the problem as actual size of buffer is double the amount requested on the socket (SO_RECVBUF). Problem here is that this will have bad scaling for data which is less then sizeof sk_buff. E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion of traffic of this size (less then 100B). An example of sudden drop and incomplete window recovery is given below. Node B exhibits problematic behavior. Node A initiates association and B is configured to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp message in 4G (LTE) network). On B data is left in buffer by not reading socket in userspace. Lets examine when we will hit pressure state and declare rwnd to be 0 for scenario with above stated parameters (rwnd == 10000, chunk size == 43, each chunk is sent in separate sctp packet) Logic is implemented in sctp_assoc_rwnd_decrease: socket_buffer (see below) is maximum size which can be held in socket buffer (sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count) A simple expression is given for which it will be examined after how many packets for above stated parameters we enter pressure state: We start by condition which has to be met in order to enter pressure state: socket_buffer < currently_alloced; currently_alloced is represented as size of sctp packets received so far and not yet delivered to userspace. x is the number of chunks/packets (since there is no bundling, and each chunk is delivered in separate packet, we can observe each chunk also as sctp packet, and what is important here, having its own sk_buff): socket_buffer < x*each_sctp_packet; each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is twice the amount of initially requested size of socket buffer, which is in case of sctp, twice the a_rwnd requested: 2*rwnd < x*(payload+sizeof(struc sk_buff)); sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000 and each payload size is 43 20000 < x(43+190); x > 20000/233; x ~> 84; After ~84 messages, pressure state is entered and 0 rwnd is advertised while received 84*43B ~= 3612B sctp data. This is why external observer notices sudden drop from 6474 to 0, as it will be now shown in example: IP A.34340 > B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017] IP B.12345 > A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839] IP A.34340 > B.12345: sctp (1) [COOKIE ECHO] IP B.12345 > A.34340: sctp (1) [COOKIE ACK] IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0] IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0] IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0] <...> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0] IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0] IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18] --> Sudden drop IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0] At this point, rwnd_press stores current rwnd value so it can be later restored in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This condition is not met since rwnd, after it hit 0, must first reach rwnd_press by adding amount which is read from userspace. Let us observe values in above example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the amount of actual sctp data currently waiting to be delivered to userspace is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed only for sctp data, which is ~3500. Condition is never met, and when userspace reads all data, rwnd stays on 3569. IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0] IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0] --> At this point userspace read everything, rwnd recovered only to 3569 IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18] IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0] Reproduction is straight forward, it is enough for sender to send packets of size less then sizeof(struct sk_buff) and receiver keeping them in its buffers. 2) Minute size window for associations sharing the same socket buffer In case multiple associations share the same socket, and same socket buffer (sctp.rcvbuf_policy == 0), different scenarios exist in which congestion on one of the associations can permanently drop rwnd of other association(s). Situation will be typically observed as one association suddenly having rwnd dropped to size of last packet received and never recovering beyond that point. Different scenarios will lead to it, but all have in common that one of the associations (let it be association from 1)) nearly depleted socket buffer, and the other association blames socket buffer just for the amount enough to start the pressure. This association will enter pressure state, set rwnd_press and announce 0 rwnd. When data is read by userspace, similar situation as in 1) will occur, rwnd will increase just for the size read by userspace but rwnd_press will be high enough so that association doesn't have enough credit to reach rwnd_press and restore to previous state. This case is special case of 1), being worse as there is, in the worst case, only one packet in buffer for which size rwnd will be increased. Consequence is association which has very low maximum rwnd ('minute size', in our case down to 43B - size of packet which caused pressure) and as such unusable. Scenario happened in the field and labs frequently after congestion state (link breaks, different probabilities of packet drop, packet reordering) and with scenario 1) preceding. Here is given a deterministic scenario for reproduction: >From node A establish two associations on the same socket, with rcvbuf_policy being set to share one common buffer (sctp.rcvbuf_policy == 0). On association 1 repeat scenario from 1), that is, bring it down to 0 and restore up. Observe scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered', bring it down close to 0, as in just one more packet would close it. This has as a consequence that association number 2 is able to receive (at least) one more packet which will bring it in pressure state. E.g. if association 2 had rwnd of 10000, packet received was 43, and we enter at this point into pressure, rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will increase for 43, but conditions to restore rwnd to original state, just as in 1), will never be satisfied. --> Association 1, between A.y and B.12345 IP A.55915 > B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569] IP B.12345 > A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613] IP A.55915 > B.12345: sctp (1) [COOKIE ECHO] IP B.12345 > A.55915: sctp (1) [COOKIE ACK] --> Association 2, between A.z and B.12346 IP A.55915 > B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173] IP B.12346 > A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240] IP A.55915 > B.12346: sctp (1) [COOKIE ECHO] IP B.12346 > A.55915: sctp (1) [COOKIE ACK] --> Deplete socket buffer by sending messages of size 43B over association 1 IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0] <...> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0] IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0] --> Sudden drop on 1 IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0] --> Here userspace read, rwnd 'recovered' to 3698, now deplete again using association 1 so there is place in buffer for only one more packet IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0] IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0] --> Socket buffer is almost depleted, but there is space for one more packet, send them over association 2, size 43B IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18] IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0] --> Immediate drop IP A.60995 > B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0] --> Read everything from the socket, both association recover up to maximum rwnd they are capable of reaching, note that association 1 recovered up to 3698, and association 2 recovered only to 43 IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0] IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18] IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0] IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18] IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0] A careful reader might wonder why it is necessary to reproduce 1) prior reproduction of 2). It is simply easier to observe when to send packet over association 2 which will push association into the pressure state. Proposed solution: Both problems share the same root cause, and that is improper scaling of socket buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while calculating rwnd is not possible due to fact that there is no linear relationship between amount of data blamed in increase/decrease with IP packet in which payload arrived. Even in case such solution would be followed, complexity of the code would increase. Due to nature of current rwnd handling, slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is entered is rationale, but it gives false representation to the sender of current buffer space. Furthermore, it implements additional congestion control mechanism which is defined on implementation, and not on standard basis. Proposed solution simplifies whole algorithm having on mind definition from rfc: o Receiver Window (rwnd): This gives the sender an indication of the space available in the receiver's inbound buffer. Core of the proposed solution is given with these lines: sctp_assoc_rwnd_update: if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0) asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1; else asoc->rwnd = 0; We advertise to sender (half of) actual space we have. Half is in the braces depending whether you would like to observe size of socket buffer as SO_RECVBUF or twice the amount, i.e. size is the one visible from userspace, that is, from kernelspace. In this way sender is given with good approximation of our buffer space, regardless of the buffer policy - we always advertise what we have. Proposed solution fixes described problems and removes necessity for rwnd restoration algorithm. Finally, as proposed solution is simplification, some lines of code, along with some bytes in struct sctp_association are saved. Version 2 of the patch addressed comments from Vlad. Name of the function is set to be more descriptive, and two parts of code are changed, in one removing the superfluous call to sctp_assoc_rwnd_update since call would not result in update of rwnd, and the other being reordering of the code in a way that call to sctp_assoc_rwnd_update updates rwnd. Version 3 corrected change introduced in v2 in a way that existing function is not reordered/copied in line, but it is correctly called. Thanks Vlad for suggesting. Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>