summaryrefslogtreecommitdiffstats
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2015-10-12ipv4: Only compute net once in ip_call_ra_chainEric W. Biederman1-1/+2
ip_call_ra_chain is called early in the forwarding chain from ip_forward and ip_mr_input, which makes skb->dev the correct expression to get the input network device and dev_net(skb->dev) a correct expression for the network namespace the packet is being processed in. Compute the network namespace and store it in a variable to make the code clearer. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12packet: fix match_fanout_group()Eric Dumazet1-3/+3
Recent TCP listener patches exposed a prior af_packet bug : match_fanout_group() blindly assumes it is always safe to cast sk to a packet socket to compare fanout with af_packet_priv But SYNACK packets can be sent while attached to request_sock, which are smaller than a "struct sock". We can read non existent memory and crash. Fixes: c0de08d04215 ("af_packet: don't emit packet on orig fanout group") Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Eric Leblond <eric@regit.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12ipv4/icmp: redirect messages can use the ingress daddr as sourcePaolo Abeni2-1/+15
This patch allows configuring how the source address of ICMP redirect messages is selected; by default the old behaviour is retained, while setting icmp_redirects_use_orig_daddr force the usage of the destination address of the packet that caused the redirect. The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the following scenario: Two machines are set up with VRRP to act as routers out of a subnet, they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to x.x.x.254/24. If a host in said subnet needs to get an ICMP redirect from the VRRP router, i.e. to reach a destination behind a different gateway, the source IP in the ICMP redirect is chosen as the primary IP on the interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2. The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2, and will continue to use the wrong next-op. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12bridge: try switchdev op first in __vlan_vid_add/delJiri Pirko1-36/+22
Some drivers need to implement both switchdev vlan ops and vid_add/kill ndos. For that to work in bridge code, we need to try switchdev op first when adding/deleting vlan id. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12net: shrink struct sock and request_sock by 8 bytesEric Dumazet7-17/+17
One 32bit hole is following skc_refcnt, use it. skc_incoming_cpu can also be an union for request_sock rcv_wnd. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12net: SO_INCOMING_CPU setsockopt() supportEric Dumazet5-5/+21
SO_INCOMING_CPU as added in commit 2c8c56e15df3 was a getsockopt() command to fetch incoming cpu handling a particular TCP flow after accept() This commits adds setsockopt() support and extends SO_REUSEPORT selection logic : If a TCP listener or UDP socket has this option set, a packet is delivered to this socket only if CPU handling the packet matches the specified one. This allows to build very efficient TCP servers, using one listener per RX queue, as the associated TCP listener should only accept flows handled in softirq by the same cpu. This provides optimal NUMA behavior and keep cpu caches hot. Note that __inet_lookup_listener() still has to iterate over the list of all listeners. Following patch puts sk_refcnt in a different cache line to let this iteration hit only shared and read mostly cache lines. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12packet: support per-packet fwmark for af_packet sendmsgEdward Jee1-1/+9
Signed-off-by: Edward Hyunkoo Jee <edjee@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12sock: support per-packet fwmarkEdward Jee1-0/+26
It's useful to allow users to set fwmark for an individual packet, without changing the socket state. The function this patch adds in sock layer can be used by the protocols that need such a feature. Signed-off-by: Edward Hyunkoo Jee <edjee@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12bpf: enable non-root eBPF programsAlexei Starovoitov1-1/+2
In order to let unprivileged users load and execute eBPF programs teach verifier to prevent pointer leaks. Verifier will prevent - any arithmetic on pointers (except R10+Imm which is used to compute stack addresses) - comparison of pointers (except if (map_value_ptr == 0) ... ) - passing pointers to helper functions - indirectly passing pointers in stack to helper functions - returning pointer from bpf program - storing pointers into ctx or maps Spill/fill of pointers into stack is allowed, but mangling of pointers stored in the stack or reading them byte by byte is not. Within bpf programs the pointers do exist, since programs need to be able to access maps, pass skb pointer to LD_ABS insns, etc but programs cannot pass such pointer values to the outside or obfuscate them. Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs, so that socket filters (tcpdump), af_packet (quic acceleration) and future kcm can use it. tracing and tc cls/act program types still require root permissions, since tracing actually needs to be able to see all kernel pointers and tc is for root only. For example, the following unprivileged socket filter program is allowed: int bpf_prog1(struct __sk_buff *skb) { u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); u64 *value = bpf_map_lookup_elem(&my_map, &index); if (value) *value += skb->len; return 0; } but the following program is not: int bpf_prog1(struct __sk_buff *skb) { u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); u64 *value = bpf_map_lookup_elem(&my_map, &index); if (value) *value += (u64) skb; return 0; } since it would leak the kernel address into the map. Unprivileged socket filter bpf programs have access to the following helper functions: - map lookup/update/delete (but they cannot store kernel pointers into them) - get_random (it's already exposed to unprivileged user space) - get_smp_processor_id - tail_call into another socket filter program - ktime_get_ns The feature is controlled by sysctl kernel.unprivileged_bpf_disabled. This toggle defaults to off (0), but can be set true (1). Once true, bpf programs and maps cannot be accessed from unprivileged process, and the toggle cannot be set back to false. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12bridge: push bridge setting ageing_time down to switchdevScott Feldman5-7/+29
Use SWITCHDEV_F_SKIP_EOPNOTSUPP to skip over ports in bridge that don't support setting ageing_time (or setting bridge attrs in general). If push fails, don't update ageing_time in bridge and return err to user. If push succeeds, update ageing_time in bridge and run gc_timer now to recalabrate when to run gc_timer next, based on new ageing_time. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12switchdev: skip over ports returning -EOPNOTSUPP when recursing portsScott Feldman1-1/+8
This allows us to recurse over all the ports, skipping over unsupporting ports. Without the change, the recursion would stop at first unsupported port. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12tcp: change type of alive from int to boolRichard Sailer1-3/+3
The alive parameter of tcp_orphan_retries, indicates whether the connection is assumed alive or not. In the function and all places calling it is used as a boolean value. Therefore this changes the type of alive to bool in the function definition and all calling locations. Since tcp_orphan_tries is a tcp_timer.c local function no change in any other file or header is necessary. Signed-off-by: Richard Sailer <richard@weltraumpflege.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12bridge: allow adding of fdb entries pointing to the bridge deviceRoopa Prabhu2-30/+93
This patch enables adding of fdb entries pointing to the bridge device. This can be used to propagate mac address of vlan interfaces configured on top of the vlan filtering bridge. Before: $bridge fdb add 44:38:39:00:27:9f dev bridge RTNETLINK answers: Invalid argument After: $bridge fdb add 44:38:39:00:27:9f dev bridge Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-11tcp: fix RFS vs lockless listenersEric Dumazet2-0/+2
Before recent TCP listener patches, we were updating listener sk->sk_rxhash before the cloning of master socket. children sk_rxhash was therefore correct after the normal 3WHS. But with lockless listener, we no longer dirty/change listener sk_rxhash as it would be racy. We need to correctly update the child sk_rxhash, otherwise first data packet wont hit correct cpu if RFS is used. Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Willem de Bruijn <willemb@google.com> Cc: Tom Herbert <tom@herbertland.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-11net: dsa: use switchdev obj in port_fdb_delVivien Didelot1-1/+1
For consistency with the FDB add operation, propagate the switchdev_obj_port_fdb structure in the DSA drivers. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-11net: dsa: push prepare phase in port_fdb_addVivien Didelot1-1/+1
Now that the prepare phase is pushed down to the DSA drivers, propagate it to the port_fdb_add function. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-11net: dsa: add port_fdb_prepareVivien Didelot1-2/+5
Push the prepare phase for FDB operations down to the DSA drivers, with a new port_fdb_prepare function. Currently only mv88e6xxx is affected. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-11Merge branch 'for-upstream' of ↵David S. Miller21-135/+1870
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2015-10-08 Here's another set of Bluetooth & 802.15.4 patches for the 4.4 kernel. 802.15.4: - Many improvements & fixes to the mrf24j40 driver - Fixes and cleanups to nl802154, mac802154 & ieee802154 code Bluetooth: - New chipset support in btmrvl driver - Fixes & cleanups to btbcm, btmrvl, bpa10x & btintel drivers - Support for vendor specific diagnostic data through common API - Cleanups to the 6lowpan code - New events & message types for monitor channel Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-11net: synack packets can be attached to request socketsEric Dumazet1-1/+2
selinux needs few changes to accommodate fact that SYNACK messages can be attached to a request socket, lacking sk_security pointer (Only syncookies are still attached to a TCP_LISTEN socket) Adds a new sk_listener() helper, and use it in selinux and sch_fq Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported by: kernel test robot <ying.huang@linux.intel.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Eric Paris <eparis@parisplace.org> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-11bpf: fix cb access in socket filter programsAlexei Starovoitov2-10/+12
eBPF socket filter programs may see junk in 'u32 cb[5]' area, since it could have been used by protocol layers earlier. For socket filter programs used in af_packet we need to clean 20 bytes of skb->cb area if it could be used by the program. For programs attached to TCP/UDP sockets we need to save/restore these 20 bytes, since it's used by protocol layers. Remove SK_RUN_FILTER macro, since it's no longer used. Long term we may move this bpf cb area to per-cpu scratch, but that requires addition of new 'per-cpu load/store' instructions, so not suitable as a short term fix. Fixes: d691f9e8d440 ("bpf: allow programs to write to certain skb fields") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-09net/sched: make sch_blackhole.c explicitly non-modularPaul Gortmaker1-12/+3
The Kconfig currently controlling compilation of this code is: net/sched/Kconfig:menuconfig NET_SCHED net/sched/Kconfig: bool "QoS and/or fair queueing" ...meaning that it currently is not being built as a module by anyone. Lets remove the modular code that is essentially orphaned, so that when reading the driver there is no doubt it is builtin-only. Since module_init translates to device_initcall in the non-modular case, the init ordering remains unchanged with this commit. We can change to one of the other priority initcalls (subsys?) at any later date, if desired. We also delete the MODULE_LICENSE tag since all that information is already contained at the top of the file in the comments. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-09net/dcb: make dcbnl.c explicitly non-modularPaul Gortmaker1-27/+3
The Kconfig currently controlling compilation of this code is: net/dcb/Kconfig:config DCB net/dcb/Kconfig: bool "Data Center Bridging support" ...meaning that it currently is not being built as a module by anyone. Lets remove the modular code that is essentially orphaned, so that when reading the driver there is no doubt it is builtin-only. Since module_init translates to device_initcall in the non-modular case, the init ordering remains unchanged with this commit. We can change to one of the other priority initcalls (subsys?) at any later date, if desired. We also delete the MODULE_LICENSE tag etc. since all that information is (or is now) already contained at the top of the file in the comments. Cc: "David S. Miller" <davem@davemloft.net> Cc: Or Gerlitz <ogerlitz@mellanox.com> Cc: Anish Bhatt <anish@chelsio.com> Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Shani Michaeli <shanim@mellanox.com> Cc: netdev@vger.kernel.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-09net/core: make sock_diag.c explicitly non-modularPaul Gortmaker1-11/+3
The Makefile currently controlling compilation of this code lists it under "obj-y" ...meaning that it currently is not being built as a module by anyone. Lets remove the modular code that is essentially orphaned, so that when reading the driver there is no doubt it is builtin-only. Since module_init translates to device_initcall in the non-modular case, the init ordering remains unchanged with this commit. We can change to one of the other priority initcalls (subsys?) at any later date, if desired. We can't remove module.h since the file uses other module related stuff even though it is not modular itself. We move the information from the MODULE_LICENSE tag to the top of the file, since that information is not captured anywhere else. The MODULE_ALIAS_NET_PF_PROTO becomes a no-op in the non modular case, so it is removed. Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Craig Gallek <kraig@google.com> Cc: netdev@vger.kernel.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-09net/core: lockdep_rtnl_is_held can be booleanYaowei Bai1-1/+1
This patch makes lockdep_rtnl_is_held return bool due to this particular function only using either one or zero as its return value. In another patch lockdep_is_held is also made return bool. No functional change. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-09net/dccp: dccp_bad_service_code can be booleanYaowei Bai1-2/+2
This patch makes dccp_bad_service_code return bool due to these particular functions only using either one or zero as their return value. dccp_list_has_service is also been made return bool in this patchset. No functional change. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-09net/nfnetlink: lockdep_nfnl_is_held can be booleanYaowei Bai1-1/+1
This patch makes lockdep_nfnl_is_held return bool to improve readability due to this particular function only using either one or zero as its return value. No functional change. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-09net/netlink: lockdep_genl_is_held can be booleanYaowei Bai1-1/+1
This patch makes lockdep_genl_is_held return bool to improve readability due to this particular function only using either one or zero as its return value. No functional change. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08Bluetooth: Fix basic debugfs entries for unconfigured controllersMarcel Holtmann1-14/+15
When the controller is unconfigured (for example it does not have a valid Bluetooth address), then the basic debugfs entries for dut_mode and vendor_diag are not creates. Ensure they are created in __hci_init and also __hci_unconf_init functions. One of them is called during setup stage of a new controller. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-10-08bpf: split state from prandom_u32() and consolidate {c, e}BPF prngsDaniel Borkmann1-7/+2
While recently arguing on a seccomp discussion that raw prandom_u32() access shouldn't be exposed to unpriviledged user space, I forgot the fact that SKF_AD_RANDOM extension actually already does it for some time in cBPF via commit 4cd3675ebf74 ("filter: added BPF random opcode"). Since prandom_u32() is being used in a lot of critical networking code, lets be more conservative and split their states. Furthermore, consolidate eBPF and cBPF prandom handlers to use the new internal PRNG. For eBPF, bpf_get_prandom_u32() was only accessible for priviledged users, but should that change one day, we also don't want to leak raw sequences through things like eBPF maps. One thought was also to have own per bpf_prog states, but due to ABI reasons this is not easily possible, i.e. the program code currently cannot access bpf_prog itself, and copying the rnd_state to/from the stack scratch space whenever a program uses the prng seems not really worth the trouble and seems too hacky. If needed, taus113 could in such cases be implemented within eBPF using a map entry to keep the state space, or get_random_bytes() could become a second helper in cases where performance would not be critical. Both sides can trigger a one-time late init via prandom_init_once() on the shared state. Performance-wise, there should even be a tiny gain as bpf_user_rnd_u32() saves one function call. The PRNG needs to live inside the BPF core since kernels could have a NET-less config as well. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Chema Gonzalez <chema@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08net: move net_get_random_once to libHannes Frederic Sowa1-49/+0
There's no good reason why users outside of networking should not be using this facility, f.e. for initializing their seeds. Therefore, make it accessible from there as get_random_once(). Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-086lowpan: move shared settings to lowpan_netdev_setupAlexander Aring3-10/+5
This patch moves values for all lowpan interface to the shared implementation of 6lowpan. This patch also quietly fixes the forgotten IFF_NO_QUEUE flag for the bluetooth 6LoWPAN interface. An identically commit is 4afbc0d ("net: 6lowpan: convert to using IFF_NO_QUEUE") which wasn't changed for bluetooth 6lowpan. All 6lowpan interfaces should be virtual with IFF_NO_QUEUE, using EUI64 address length, the mtu size is 1280 (IPV6_MIN_MTU) and the netdev type is ARPHRD_6LOWPAN. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-08net: Do not drop to make_route if oif is l3mdevDavid Ahern1-1/+2
Commit deaa0a6a930 ("net: Lookup actual route when oif is VRF device") exposed a bug in __ip_route_output_key_hash for VRF devices: on FIB lookup failure if the oif is specified the current logic drops to make_route on the assumption that the route tables are wrong. For VRF/L3 master devices this leads to wrong dst entries and route lookups. For example: $ ip route ls table vrf-red unreachable default broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.2 10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.2 local 10.2.1.2 dev eth1 proto kernel scope host src 10.2.1.2 broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.2 $ ip route get oif vrf-red 1.1.1.1 1.1.1.1 dev vrf-red src 10.0.0.2 cache With this patch: $ ip route get oif vrf-red 1.1.1.1 RTNETLINK answers: No route to host which is the correct response based on the default route Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08bpf, skb_do_redirect: clear sender_cpu before xmitDaniel Borkmann1-0/+1
Similar to commit c29390c6dfee ("xps: must clear sender_cpu before forwarding"), we also need to clear the skb->sender_cpu when moving from RX to TX via skb_do_redirect() due to the shared location of napi_id (used on RX) and sender_cpu (used on TX). Fixes: 27b29f63058d ("bpf: add bpf_redirect() helper") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08Merge branch 'net/rds/4.3-v3' of ↵David S. Miller12-282/+448
git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux Santosh Shilimkar says: ==================== RDS: connection scalability and performance improvements [v4] Re-sending the same patches from v3 again since my repost of patch 05/14 from v3 was whitespace damaged. [v3] Updated patch "[PATCH v2 05/14] RDS: defer the over_batch work to send worker" as per David Miller's comment [4] to avoid the magic value usage. Patch now makes use of already available but unused send_batch_count module parameter. Rest of the patches are same as earlier version v2 [3] [v2]: Dropped "[PATCH 05/15] RDS: increase size of hash-table to 8K" from earlier version [1]. I plan to address the hash table scalability using re-sizable hash tables as suggested by David Laight and David Miller [2] This series addresses RDS connection bottlenecks on massive workloads and improve the RDMA performance almost by 3X. RDS TCP also gets a small gain of about 12%. RDS is being used in massive systems with high scalability where several hundred thousand end points and tens of thousands of local processes are operating in tens of thousand sockets. Being RC(reliable connection), socket bind and release happens very often and any inefficiencies in bind hash look ups hurts the overall system performance. RDS bin hash-table uses global spin-lock which is the biggest bottleneck. To make matter worst, it uses rcu inside global lock for hash buckets. This is being addressed by simply using per bucket rw lock which makes the locking simple and very efficient. The hash table size is still an issue and I plan to address it by using re-sizable hash tables as suggested on the list. For RDS RDMA improvement, the completion handling is revamped so that we can do batch completions. Both send and receive completion handlers are split logically to achieve the same. RDS 8K messages being one of the key usecase, mr pool is adapted to have the 8K mrs along with default 1M mrs. And while doing this, few fixes and couple of bottlenecks seen with rds_sendmsg() are addressed. Series applies against 4.3-rc1 as well net-next. Its tested on Oracle hardware with IB fabric for both bcopy as well as RDMA mode. RDS TCP is tested with iXGB NIC. Like last time, iWARP transport is untested with these changes. The patchset is also available at below git repo: git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git net/rds/4.3-v3 As a side note, the IB HCA driver I used for testing misses at least 3 important patches in upstream to see the full blown IB performance and am hoping to get that in mainline with help of them. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08dst: Pass net into dst->outputEric W. Biederman12-37/+30
The network namespace is already passed into dst_output pass it into dst->output lwt->output and friends. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08ipv4, ipv6: Pass net into ip_local_out and ip6_local_outEric W. Biederman12-19/+17
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08ipv4, ipv6: Pass net into __ip_local_out and __ip6_local_outEric W. Biederman3-7/+5
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08ipv4: Cache net in ip_build_and_send_pkt and ip_queue_xmitEric W. Biederman1-4/+6
Compute net and store it in a variable in the functions ip_build_and_send_pkt and ip_queue_xmit so that it does not need to be recomputed next time it is needed. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08ipv4: Cache net in iptunnel_xmitEric W. Biederman1-2/+2
Store net in a variable in ip_tunnel_xmit so it does not need to be recomputed when it is used again. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08ipv6: Merge ip6_local_out and ip6_local_out_skEric W. Biederman6-12/+6
Stop hidding the sk parameter with an inline helper function and make all of the callers pass it, so that it is clear what the function is doing. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08ipv6: Merge __ip6_local_out and __ip6_local_out_skEric W. Biederman3-9/+4
Only __ip6_local_out_sk has callers so rename __ip6_local_out_sk __ip6_local_out and remove the previous __ip6_local_out. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08ipv4: Merge ip_local_out and ip_local_out_skEric W. Biederman7-12/+12
It is confusing and silly hiding a parameter so modify all of the callers to pass in the appropriate socket or skb->sk if no socket is known. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08ipv4: Merge __ip_local_out and __ip_local_out_skEric W. Biederman3-9/+4
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08dst: Pass a sk into .local_outEric W. Biederman7-7/+7
For consistency with the other similar methods in the kernel pass a struct sock into the dst_ops .local_out method. Simplifying the socket passing case is needed a prequel to passing a struct net reference into .local_out. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08net: Pass net into dst_output and remove dst_output_okfnEric W. Biederman18-25/+27
Replace dst_output_okfn with dst_output Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08xfrm: Only compute net once in xfrm_policy_queue_processEric W. Biederman1-4/+3
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08ipv4: Fix ip_queue_xmit to pass sk into ip_local_out_skEric W. Biederman1-1/+1
After a packet has been encapsulated by a tunnel we should use the tunnel sockets local multicast loopback flag to control if the encapsulated packet should be locally loopback back. Pass sk into ip_local_out_sk so that in the rare case we are dealing with a tunneled packet whose tunnel destination address is a multicast address the kernel properly decides to loopback this packet. In practice I don't think this matters as ip_queue_xmit is used by tcp, l2tp and sctp none of which I am aware of uses ip level multicasting as they are all point to point communications protocols. Let's fix this before someone uses ip_queue_xmit for a tunnel protocol that does use multicast. Fixes: aad88724c9d5 ("ipv4: add a sock pointer to dst->output() path.") Fixes: b0270e91014d ("ipv4: add a sock pointer to ip_queue_xmit()") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08ipv4: Fix ip_local_out_sk by passing the sk into __ip_local_out_skEric W. Biederman1-1/+1
In the rare case where sk != skb->sk ip_local_out_sk arranges to call dst->output differently if the skb is queued or not. This is a bug. Fix this bug by passing the sk parameter of ip_local_out_sk through from ip_local_out_sk to __ip_local_out_sk (skipping __ip_local_out). Fixes: 7026b1ddb6b8 ("netfilter: Pass socket pointer down through okfn().") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08Bluetooth: 6lowpan: Remove unnecessary chan_get() functionJohan Hedberg1-15/+5
The chan_get() function just adds unnecessary indirection to calling the chan_create() call. The only added value it gives is the chan->ops assignment, but that can equally well be done in the calling code. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-08Bluetooth: 6lowpan: Rename confusing 'pchan' variablesJohan Hedberg1-21/+21
The typical convention when having both a child and a parent channel variable is to call the former 'chan' and the latter 'pchan'. When there's only one variable it's called chan. Rename the 'pchan' variables in the 6lowpan code to follow this convention. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>