summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2018-02-19tipc: don't call sock_release() in atomic contextPaolo Abeni1-1/+1
syzbot reported a scheduling while atomic issue at netns destruction time: BUG: sleeping function called from invalid context at net/core/sock.c:2769 in_atomic(): 1, irqs_disabled(): 0, pid: 85, name: kworker/u4:3 5 locks held by kworker/u4:3/85: #0: ((wq_completion)"%s""netns"){+.+.}, at: [<00000000c9792deb>] process_one_work+0xaaf/0x1af0 kernel/workqueue.c:2084 #1: (net_cleanup_work){+.+.}, at: [<00000000adc12e2a>] process_one_work+0xb01/0x1af0 kernel/workqueue.c:2088 #2: (net_sem){++++}, at: [<000000009ccb5669>] cleanup_net+0x23f/0xd20 net/core/net_namespace.c:494 #3: (net_mutex){+.+.}, at: [<00000000a92767d9>] cleanup_net+0xa7d/0xd20 net/core/net_namespace.c:496 #4: (&(&srv->idr_lock)->rlock){+...}, at: [<000000001343e568>] spin_lock_bh include/linux/spinlock.h:315 [inline] #4: (&(&srv->idr_lock)->rlock){+...}, at: [<000000001343e568>] tipc_topsrv_stop+0x231/0x610 net/tipc/topsrv.c:685 CPU: 0 PID: 85 Comm: kworker/u4:3 Not tainted 4.16.0-rc1+ #230 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: netns cleanup_net Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:53 ___might_sleep+0x2b2/0x470 kernel/sched/core.c:6128 __might_sleep+0x95/0x190 kernel/sched/core.c:6081 lock_sock_nested+0x37/0x110 net/core/sock.c:2769 lock_sock include/net/sock.h:1463 [inline] tipc_release+0x103/0xff0 net/tipc/socket.c:572 sock_release+0x8d/0x1e0 net/socket.c:594 tipc_topsrv_stop+0x3c0/0x610 net/tipc/topsrv.c:696 tipc_exit_net+0x15/0x40 net/tipc/core.c:96 ops_exit_list.isra.6+0xae/0x150 net/core/net_namespace.c:148 cleanup_net+0x6ba/0xd20 net/core/net_namespace.c:529 process_one_work+0xbbf/0x1af0 kernel/workqueue.c:2113 worker_thread+0x223/0x1990 kernel/workqueue.c:2247 kthread+0x33c/0x400 kernel/kthread.c:238 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:429 This is caused by tipc_topsrv_stop() releasing the listener socket with the idr lock held. This changeset addresses the issue moving the release operation outside such lock. Reported-and-tested-by: syzbot+749d9d87c294c00ca856@syzkaller.appspotmail.com Fixes: 0ef897be12b8 ("tipc: separate topology server listener socket from subcsriber sockets") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: ///jon Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19tipc: fix bug on error path in tipc_topsrv_kern_subscr()Jon Maloy1-3/+4
In commit cc1ea9ffadf7 ("tipc: eliminate struct tipc_subscriber") we re-introduced an old bug on the error path in the function tipc_topsrv_kern_subscr(). We now re-introduce the correction too. Reported-by: syzbot+f62e0f2a0ef578703946@syzkaller.appspotmail.com Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19Merge branch 'pernet_ops-conversions-part-2'David S. Miller24-0/+26
Kirill Tkhai says: ==================== Converting pernet_operations (part #2) This patchset continues to review and to convert pernet_operations to async. There are mostly ipv6, also some regular used netfilter pernet_operations are involved. One more converted is cfg80211_pernet_ops. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: Convert iptable_filter_net_opsKirill Tkhai1-0/+1
These pernet_operations register and unregister net::ipv4.iptable_filter table. Since there are no packets in-flight at the time of exit method is working, iptables rules should not be touched. Also, pernet_operations should not send ipv4 packets each other. So, it's safe to mark them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: Convert ip_tables_net_ops, udplite6_net_ops and xt_net_opsKirill Tkhai3-0/+3
ip_tables_net_ops and udplite6_net_ops create and destroy /proc entries. xt_net_ops does nothing. So, we are able to mark them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: Convert ip6_frags_opsKirill Tkhai1-0/+1
Exit methods calls inet_frags_exit_net() with global ip6_frags as argument. So, after we make the pernet_operations async, a pair of exit methods may be called to iterate this hash table. Since there is inet_frag_worker(), which already may work in parallel with inet_frags_exit_net(), and it can make the same cleanup, that inet_frags_exit_net() does, it's safe. So we may mark these pernet_operations as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: Convert fib6_net_ops, ipv6_addr_label_ops and ip6_segments_opsKirill Tkhai3-0/+3
These pernet_operations register and unregister tables and lists for packets forwarding. All of the entities are per-net. Init methods makes simple initializations, and since net is not visible for foreigners at the time it is working, it can't race with anything. Exit method is executed when there are only local devices, and there mustn't be packets in-flight. Also, it looks like no one pernet_operations want to send ipv6 packets to foreign net. The same reasons are for ipv6_addr_label_ops and ip6_segments_ops. So, we are able to mark all them as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: Convert xfrm6_net_opsKirill Tkhai1-0/+1
These pernet_operations create sysctl tables and initialize net::xfrm.xfrm6_dst_ops used for routing. It doesn't look like another pernet_operations send ipv6 packets to foreign net namespaces, so it should be safe to mark the pernet_operations as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: Convert ip6_flowlabel_net_opsKirill Tkhai1-0/+1
These pernet_operations create and destroy /proc entries. ip6_fl_purge() makes almost the same actions as timer ip6_fl_gc_timer does, and as it can be executed in parallel with ip6_fl_purge(), two parallel ip6_fl_purge() may be executed. So, we can mark it async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: Convert ping_v6_net_opsKirill Tkhai1-0/+1
These pernet_operations only register and unregister /proc entries, so it's possible to mark them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: Convert ipv6_sysctl_net_opsKirill Tkhai1-0/+1
These pernet_operations create and destroy sysctl tables. They are not touched by another net pernet_operations. So, it's possible to execute them in parallel with others. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: Convert tcpv6_net_opsKirill Tkhai1-0/+1
These pernet_operations create and destroy net::ipv6.tcp_sk socket, which is used in tcp_v6_send_response() only. It looks like foreign pernet_operations don't want to set ipv6 connection inside destroyed net, so this socket may be created in destroyed in parallel with anything else. inet_twsk_purge() is also safe for that, as described in patch for tcp_sk_ops. So, it's possible to mark them as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: Convert fib6_rules_net_opsKirill Tkhai1-0/+1
These pernet_operations register and unregister net::ipv6.fib6_rules_ops, which are used for routing. It looks like there are no pernet_operations, which send ipv6 packages to another net, so we are able to mark them as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: Convert ipv6_inetpeer_opsKirill Tkhai1-0/+1
net->ipv6.peers is dereferenced in three places via inet_getpeer_v6(), and it's used to handle skb. All the users of inet_getpeer_v6() do not look like be able to be called from foreign net pernet_operations, so we may mark them as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: Convert raw6_net_ops, udplite6_net_ops, ipv6_proc_ops, if6_proc_net_ops ↵Kirill Tkhai4-0/+5
and ip6_route_net_late_ops These pernet_operations create and destroy /proc entries and safely may be converted and safely may be mark as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: Convert icmpv6_sk_ops, ndisc_net_ops and igmp6_net_opsKirill Tkhai3-0/+3
These pernet_operations create and destroy net::ipv6.icmp_sk socket, used to send ICMP or error reply. Nobody can dereference the socket to handle a packet before net is initialized, as there is no routing; nobody can do that in parallel with exit, as all of devices are moved to init_net or destroyed and there are no packets it-flight. So, it's possible to mark these pernet_operations as async. The same for ndisc_net_ops and for igmp6_net_ops. The last one also creates and destroys /proc entries. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: Convert ip6mr_net_opsKirill Tkhai1-0/+1
These pernet_operations create and destroy /proc entries, populate and depopulate net::rules_ops and multiroute table. All the structures are pernet, and they are not touched by foreign net pernet_operations. So, it's possible to mark them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: Convert cfg80211_pernet_opsKirill Tkhai1-0/+1
This patch finishes converting pernet_operations registered in net/wireless directory. These pernet_operations have only exit method, which moves devices to init_net. This action is not pernet_operations-specific, and function cfg80211_switch_netns() may be called all time during the system life. All necessary protection against concurrent cfg80211_pernet_exit() is made by rtnl_lock(). So, cfg80211_pernet_ops is able to be marked as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: Convert inet6_net_opsKirill Tkhai1-0/+1
init method initializes sysctl defaults, allocates percpu arrays and creates /proc entries. exit method reverts the above. There are no pernet_operations, which are interested in the above entities of foreign net namespace, so inet6_net_ops are able to be marked as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19dpaa_eth: fix pause capability advertisement logicJake Moroni1-1/+1
The ADVERTISED_Asym_Pause bit was being improperly set when both rx and tx pause were enabled. When rx and tx are both enabled, only the ADVERTISED_Pause bit is supposed to be set. Signed-off-by: Jake Moroni <mail@jakemoroni.com> Acked-by: Madalin Bucur <madalin.bucur@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19sh_eth: simplify sh_eth_check_reset()Sergei Shtylyov1-10/+6
The *while* loop in this function can be turned into a normal *for* loop. And getting rid of the single return point saves us a few more LoCs... Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19Merge branch 'dwmac-meson8b-small-cleanup'David S. Miller1-116/+92
Martin Blumenstingl says: ==================== dwmac-meson8b: small cleanup This is a follow-up to my previous series "dwmac-meson8b: clock fixes for Meson8b" from [0]. during the review of that series it was found that the clock registration could be simplified. now that the previous series has landed we can start cleaning up the clock registration. the goal of this series is to simplify the code in the dwmac-meson8b driver. no functional changes are intended. I have tested this on my Khadas VIM2 (GXM SoC, with RGMII PHY) and my Endless Mini (EC-100, Meson8b SoC with RMII PHY, .dts support is not part of mainline yet). no problems were found. [0] http://lists.infradead.org/pipermail/linux-amlogic/2018-January/006143.html ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: stmmac: dwmac-meson8b: make the clock configurations privateMartin Blumenstingl1-21/+24
The common clock framework needs access to the "clock configuration" structs during runtime. However, only the common clock framework should access these. Ensure this by moving the configuration structs out of struct meson8b_dwmac, so only meson8b_init_rgmii_tx_clk() and the common clock framework know about these configurations. Suggested-by: Jerome Brunet <jbrunet@baylibre.com> Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Acked-by: Jerome Brunet <jbrunet@baylibre.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: stmmac: dwmac-meson8b: only keep struct device aroundMartin Blumenstingl1-10/+9
Nothing in the dwmac-meson8b driver (except .probe itself) requires the platform_device anymore after .probe has finished. Replace it with a pointer to struct device since this is what the functions inside the driver are actually accessing. No functional changes. Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: stmmac: dwmac-meson8b: simplify clock registrationMartin Blumenstingl1-91/+65
To goal of this patch is to simplify the registration of the RGMII TX clock (and it's parent clocks). This is achieved by: - introducing the meson8b_dwmac_register_clk helper-function to remove code duplication when registering a single clock (this saves a few lines since we have 4 clocks internally) - using devm_add_action_or_reset to disable the RGMII TX clock automatically when needed. This also allows us to re-use the standard stmmac_pltfr_remove function. - devm_kasprintf() and devm_kstrdup() are not used anymore to generate the clock name (these are replaced by a variable on the stack) because the common clock framework already uses kstrdup() internally. No functional changes intended. Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Jerome Brunet <jbrunet@baylibre.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19net: dsa: mv88e6xxx: hwtstamp: remove unnecessary range checking testsGustavo A. R. Silva1-9/+0
_port_ is already known to be a valid index in the callers [1]. So these checks are unnecessary. [1] https://lkml.org/lkml/2018/2/16/469 Addresses-Coverity-ID: 1465287 Addresses-Coverity-ID: 1465291 Suggested-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19ptr_ring: Remove now-redundant smp_read_barrier_depends()Andrea Parri1-3/+4
Because READ_ONCE() now implies smp_read_barrier_depends(), the smp_read_barrier_depends() in __ptr_ring_consume() is redundant; this commit removes it and updates the comments. Signed-off-by: Andrea Parri <parri.andrea@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: <linux-kernel@vger.kernel.org> Cc: <netdev@vger.kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16tun: export flags, uid, gid, queue information over netlinkSabrina Dubroca2-0/+74
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16net: Only honor ifindex in IP_PKTINFO if non-0David Ahern1-2/+4
Only allow ifindex from IP_PKTINFO to override SO_BINDTODEVICE settings if the index is actually set in the message. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16net: dsa: mv88e6xxx: avoid unintended sign extension on a 16 bit shiftColin Ian King1-2/+3
The shifting of timehi by 16 bits to the left will be promoted to a 32 bit signed int and then sign-extended to an u64. If the top bit of timehi is set then all then all the upper bits of ns end up as also being set because of the sign-extension. Fix this by making timehi and timelo u64. Also move the declaration of ns. Detected by CoverityScan, CID#1465288 ("Unintended sign extension") Fixes: c6fe0ad2c349 ("net: dsa: mv88e6xxx: add rx/tx timestamping support") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16ravb: add support for changing MTUNiklas Söderlund2-7/+28
Allow for changing the MTU within the limit of the maximum size of a descriptor (2048 bytes). Add the callback to change MTU from user-space and take the configurable MTU into account when configuring the hardware. Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16Merge branch 'nfp-whitespace-sync-and-flower-TCP-flags'David S. Miller5-142/+204
Jakub Kicinski says: ==================== nfp: whitespace sync and flower TCP flags Whitespace cleanup from Michael and flower offload support for matching on TCP flags from Pieter. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16nfp: flower: implement tcp flag match offloadPieter Jansen van Vuuren4-2/+64
Implement tcp flag match offloading. Current tcp flag match support include FIN, SYN, RST, PSH and URG flags, other flags are unsupported. The PSH and URG flags are only set in the hardware fast path when used in combination with the SYN, RST and PSH flags. Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16nfp: standardize FW header whitespaceMichael Rapson1-140/+140
The nfp_net_ctrl.h file used spaces for indentation in the past but tabs have crept in. Host driver files use tabs for indentation by default, so let's convert to tabs for consistency across the file and our drivers. Signed-off-by: Michael Rapson <michael.rapson@netronome.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16Merge branch 'net-sched-act-add-extack-support'David S. Miller19-129/+215
Alexander Aring says: ==================== net: sched: act: add extack support this patch series adds extack support for the TC action subsystem. As example I for the extack support in a TC action I choosed mirred action. - Alex Cc: David Ahern <dsahern@gmail.com> changes since v3: - adapt recommended changes from Davide Caratti, please check if I catch everything. Thanks. changes since v2: - remove newline in extack of generic walker handling Thanks to Davide Caratti - add kernel@mojatatu.com in cc ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16net: sched: act: mirred: add extack supportAlexander Aring1-4/+11
This patch adds extack support for TC mirred action. Cc: David Ahern <dsahern@gmail.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16net: sched: act: handle extack in tcf_generic_walkerAlexander Aring18-20/+23
This patch adds extack handling for a common used TC act function "tcf_generic_walker()" to add an extack message on failures. The tcf_generic_walker() function can fail if get a invalid command different than DEL and GET. The naming "action" here is wrong, the correct naming would be command. Cc: David Ahern <dsahern@gmail.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16net: sched: act: add extack for walk callbackAlexander Aring18-20/+38
This patch adds extack support for act walker callback api. This prepares to handle extack support inside each specific act implementation. Cc: David Ahern <dsahern@gmail.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16net: sched: act: add extack for lookup callbackAlexander Aring18-19/+37
This patch adds extack support for act lookup callback api. This prepares to handle extack support inside each specific act implementation. Cc: David Ahern <dsahern@gmail.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16net: sched: act: add extack to init callbackAlexander Aring18-20/+24
This patch adds extack support for act init callback api. This prepares to handle extack support inside each specific act implementation. Based on work by David Ahern <dsahern@gmail.com> Cc: David Ahern <dsahern@gmail.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16net: sched: act: handle generic action errorsAlexander Aring1-32/+61
This patch adds extack support for generic act handling. The extack will be set deeper to each called function which is not part of netdev core api. Based on work by David Ahern <dsahern@gmail.com> Cc: David Ahern <dsahern@gmail.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16net: sched: act: add extack to initAlexander Aring3-10/+16
This patch adds extack to tcf_action_init and tcf_action_init_1 functions. These are necessary to make individual extack handling in each act implementation. Based on work by David Ahern <dsahern@gmail.com> Cc: David Ahern <dsahern@gmail.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16net: sched: act: fix code styleAlexander Aring3-11/+12
This patch is used by subsequent patches. It fixes code style issues caught by checkpatch. Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16Merge branch 'RDS-zerocopy-support'David S. Miller11-35/+339
Sowmini Varadhan says: ==================== RDS: zerocopy support This is version 3 of the series, following up on review comments for http://patchwork.ozlabs.org/project/netdev/list/?series=28530 Review comments addressed Patch 4 - fix fragile use of skb->cb[], do not set ee_code incorrectly. Patch 5: - remove needless bzero of skb->cb[], consolidate err cleanup A brief overview of this feature follows. This patch series provides support for MSG_ZERCOCOPY on a PF_RDS socket based on the APIs and infrastructure added by Commit f214f915e7db ("tcp: enable MSG_ZEROCOPY") For single threaded rds-stress testing using rds-tcp with the ixgbe driver using 1M message sizes (-a 1M -q 1M) preliminary results show that there is a significant reduction in latency: about 90 usec with zerocopy, compared with 200 usec without zerocopy. This patchset modifies the above for zerocopy in the following manner. - if the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and, - if the SO_ZEROCOPY socket option has been set on the PF_RDS socket, application pages sent down with rds_sendmsg are pinned. The pinning uses the accounting infrastructure added by a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages"). The message is unpinned when all references to the message go down to 0, and the message is freed by rds_message_purge. A multithreaded application using this infrastructure must send down a unique 32 bit cookie as ancillary data with each sendmsg invocation. The format of this ancillary data is described in Patch 5 of the series. The cookie is passed up to the application on the sk_error_queue when the message is unpinned, indicating to the application that it is now safe to free/reuse the message buffer. The details of the completion notification are provided in Patch 4 of this series. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16selftests/net: add zerocopy support for PF_RDS test caseSowmini Varadhan1-4/+64
Send a cookie with sendmsg() on PF_RDS sockets, and process the returned batched cookies in do_recv_completion() Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16selftests/net: add support for PF_RDS socketsSowmini Varadhan1-1/+64
Add support for basic PF_RDS client-server testing in msg_zerocopy Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16rds: zerocopy Tx support.Sowmini Varadhan4-8/+91
If the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and, if the SO_ZEROCOPY socket option has been set on the PF_RDS socket, application pages sent down with rds_sendmsg() are pinned. The pinning uses the accounting infrastructure added by Commit a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages") The payload bytes in the message may not be modified for the duration that the message has been pinned. A multi-threaded application using this infrastructure may thus need to be notified about send-completion so that it can free/reuse the buffers passed to rds_sendmsg(). Notification of send-completion will identify each message-buffer by a cookie that the application must specify as ancillary data to rds_sendmsg(). The ancillary data in this case has cmsg_level == SOL_RDS and cmsg_type == RDS_CMSG_ZCOPY_COOKIE. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16rds: support for zcopy completion notificationSowmini Varadhan5-7/+96
RDS removes a datagram (rds_message) from the retransmit queue when an ACK is received. The ACK indicates that the receiver has queued the RDS datagram, so that the sender can safely forget the datagram. When all references to the rds_message are quiesced, rds_message_purge is called to release resources used by the rds_message If the datagram to be removed had pinned pages set up, add an entry to the rs->rs_znotify_queue so that the notifcation will be sent up via rds_rm_zerocopy_callback() when the rds_message is eventually freed by rds_message_purge. rds_rm_zerocopy_callback() attempts to batch the number of cookies sent with each notification to a max of SO_EE_ORIGIN_MAX_ZCOOKIES. This is achieved by checking the tail skb in the sk_error_queue: if this has room for one more cookie, the cookie from the current notification is added; else a new skb is added to the sk_error_queue. Every invocation of rds_rm_zerocopy_callback() will trigger a ->sk_error_report to notify the application. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16sock: permit SO_ZEROCOPY on PF_RDS socketSowmini Varadhan1-11/+14
allow the application to set SO_ZEROCOPY on the underlying sk of a PF_RDS socket Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16rds: hold a sock ref from rds_message to the rds_sockSowmini Varadhan2-7/+8
The existing model holds a reference from the rds_sock to the rds_message, but the rds_message does not itself hold a sock_put() on the rds_sock. Instead the m_rs field in the rds_message is assigned when the message is queued on the sock, and nulled when the message is dequeued from the sock. We want to be able to notify userspace when the rds_message is actually freed (from rds_message_purge(), after the refcounts to the rds_message go to 0). At the time that rds_message_purge() is called, the message is no longer on the rds_sock retransmit queue. Thus the explicit reference for the m_rs is needed to send a notification that will signal to userspace that it is now safe to free/reuse any pages that may have been pinned down for zerocopy. This patch manages the m_rs assignment in the rds_message with the necessary refcount book-keeping. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>