summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2018-10-17net/mlx5: Add a no-append flow insertion modePaul Blakey5-9/+24
If no-append flag is set, we will add a new FTE, instead of appending the actions of the inserted rule when the same match already exists. While here, move the has_flow_tag boolean indicator to be a flag too. This patch doesn't change any functionality. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanmox.com> Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-17net/mlx5: E-Switch, Add chains and prioritiesPaul Blakey3-86/+339
A chain is a group of priorities, so use the fdb parallel sub namespaces to implement chains, and a flow table for each priority in them. Because these namespaces are parallel and in series to the slow path fdb, the chains aren't connected to one another (but to the slow path), and one must use a explicit goto action to reach a different chain. Flow tables for the priorities will be created on demand and destroyed once not used. The Firmware has four pools of tables for sizes S/XS/M/L (4k, 64k, 1m, 4m). We maintain ghost copies of the pools occupancy. When a new table is to be created, we scan the pools from large to small and find the 1st table size which can be now created. When a table is destroyed, we update the relevant pool. Multi chain/prio isn't enabled yet by this patch, for now all flows will use the default chain 0, and prio 1. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-17net/mlx5: E-Switch, Have explicit API to delete fwd rulesOr Gerlitz3-2/+14
Be symmetric with the e-switch API to add rules which has a specific function to add fwd rules which are used as part of vport mirroring. This patch doesn't change any functionality. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-17net/mlx5: Split FDB fast path prio to multiple namespacesPaul Blakey5-11/+101
Towards supporting multi-chains and priorities, split the FDB fast path to multiple namespaces (sub namespaces), each with multiple priorities. This patch adds a new flow steering type, FS_TYPE_PRIO_CHAINS, which is like current FS_TYPE_PRIO, but may contain only namespaces, and those will be in parallel to one another in terms of managing of the flow tables connections inside them. Meaning, while searching for the next or previous flow table to connect for a new table inside such namespace we skip the parallel namespaces in the same level under the FS_TYPE_PRIO_CHAINS prio we originated from. We use this new type for splitting the fast path prio into multiple parallel namespaces, each containing normal prios. The prios inside them (and their tables) will be connected to one another, but not from one parallel namespace to another, instead the last prio in each namespace will be connected to the next prio in the containing FDB namespace, which is the slow path prio. Signed-off-by: Paul Blakey <paulb@mellanox.com> Acked-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-17net/mlx5: Add cap bits for multi fdb encapPaul Blakey1-1/+3
If set, the firmware supports creating of flow tables with encap enabled while VFs are configured, if we already created one (restriction still applies on the first creation). Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-17net/mlx5e: Split TC add rule path for nic vs e-switchRoi Dayan1-47/+138
Move to have clear separation on the code path to add nic vs e-switch flows. While here we break the code that deals with adding offloaded TC tool to few smaller stages, each on helper function. Besides getting us simpler and readable code, these are pre-steps for being able to have two HW flows serving one SW TC flow for some e-switch use cases. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-17net/mlx5e: Change return type of tc add flow functionsRabie Loulou1-47/+39
Refactor the flow add utility functions to return err code instead of rule pointers. This will allow for simpler logic when one tc rule is duplicated to two HW rules in downstream patches. Signed-off-by: Rabie Loulou <rabiel@mellanox.com> Signed-off-by: Shahar Klein <shahark@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-17net/mlx5: Use flow counter IDs and not the wrapping cache objectMark Bloch9-19/+23
Currently, when a flow rule is created using the FS core layer, the caller has to pass the entire flow counter object and not just the counter HW handle (ID). This requires both the FS core and the caller to have knowledge about the inner implementation of the FS layer flow counters cache and limits the possible users. Move to use the counter ID across the place when dealing with flows. Doing this decoupling, now can we privatize the inner implementation of the flow counters. Signed-off-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-17net/mlx5: E-Switch, Get counters for offloaded flows from callersMark Bloch5-36/+33
There's no real reason for the e-switch logic to manage the creation of counters for offloaded flows. The API already has the directive for the caller to denote they want to attach a counter to the created flow. As such, we go and move the management of flow counters to the mlx5e tc offload logic. This also lets us remove an inelegant interface where the FS layer had to provide a way to retrieve a counter from a flow rule. Signed-off-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-17Merge branch 'mlx5-next' of ↵Saeed Mahameed21-230/+392
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into net-next mlx5 updates for both net-next and rdma-next * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: (21 commits) net/mlx5: Expose DC scatter to CQE capability bit net/mlx5: Update mlx5_ifc with DEVX UID bits net/mlx5: Set uid as part of DCT commands net/mlx5: Set uid as part of SRQ commands net/mlx5: Set uid as part of SQ commands net/mlx5: Set uid as part of RQ commands net/mlx5: Set uid as part of QP commands net/mlx5: Set uid as part of CQ commands net/mlx5: Rename incorrect naming in IFC file net/mlx5: Export packet reformat alloc/dealloc functions net/mlx5: Pass a namespace for packet reformat ID allocation net/mlx5: Expose new packet reformat capabilities {net, RDMA}/mlx5: Rename encap to reformat packet net/mlx5: Move header encap type to IFC header file net/mlx5: Break encap/decap into two separated flow table creation flags net/mlx5: Add support for more namespaces when allocating modify header net/mlx5: Export modify header alloc/dealloc functions net/mlx5: Add proper NIC TX steering flow tables support net/mlx5: Cleanup flow namespace getter switch logic net/mlx5: Add memic command opcode to command checker ... Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-16tcp, ulp: remove socket lock assertion on ULP cleanupDaniel Borkmann1-2/+4
Eric reported that syzkaller triggered a splat in tcp_cleanup_ulp() where assertion sock_owned_by_me() failed. This happened through inet_csk_prepare_forced_close() first releasing the socket lock, then calling into tcp_done(newsk) which is called after the inet_csk_prepare_forced_close() and therefore without the socket lock held. The sock_owned_by_me() assertion can generally be removed as the only place where tcp_cleanup_ulp() is called from now is out of inet_csk_destroy_sock() -> sk->sk_prot->destroy() where socket is in dead state and unreachable. Therefore, add a comment why the check is not needed instead. Fixes: 8b9088f806e1 ("tcp, ulp: enforce sock_owned_by_me upon ulp init and cleanup") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net/mlx5: Expose DC scatter to CQE capability bitYonatan Cohen1-1/+2
dc_req_scat_data_cqe capability bit determines if requester scatter to cqe is available for 64 bytes CQE over DC transport type. Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com> Reviewed-by: Guy Levi <guyle@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-10-16Merge branch 'hns3-Some-cleanup-and-bugfix-for-desc-filling'David S. Miller2-78/+62
Yunsheng Lin says: ==================== Some cleanup and bugfix for desc filling When retransmiting packets, skb_cow_head which is called in hns3_set_tso may clone a new header. And driver will clear the checksum of the header after doing DMA map, so HW will read the old header whose L3 checksum is not cleared and calculate a wrong L3 checksum. Also When sending a big fragment using multiple buffer descriptor, hns3 does one maping, but do multiple unmapping when tx is done, which may cause unmapping problem. This patchset does some cleanup before fixing the above problem. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net: hns3: fix for multiple unmapping DMA problemFuyun Liang1-5/+8
When sending a big fragment using multiple buffer descriptor, hns3 does one maping, but do multiple unmapping when tx is done, which may cause unmapping problem. To fix it, this patch makes sure the value of desc_cb.length of the non-first bd is zero. If desc_cb.length is zero, we do not unmap the buffer. Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net: hns3: rename hns_nic_dma_unmapFuyun Liang1-7/+7
To keep symmetrical, this patch renames hns_nic_dma_unmap to hns3_clear_desc. Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net: hns3: add handling for big TX fragmentFuyun Liang1-14/+31
This patch unifies big tx fragment handling for tso and non-tso case. Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net: hns3: move DMA map into hns3_fill_descPeng Li2-26/+24
To solve the L3 checksum error problem which happens when driver does not clear L3 checksum, DMA map should be done after calling skb_cow_head. This patch moves DMA map into hns3_fill_desc to ensure that DMA map is done after calling skb_cow_head. Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net: hns3: remove hns3_fill_desc_tsoPeng Li1-39/+5
This patch removes hns3_fill_desc_tso in preparation for fixing some desc filling bug, because for tso or non-tso case, we will use the unified hns3_fill_desc. Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16Merge branch 'qed-Align-PTT-and-add-various-link-modes'David S. Miller8-103/+587
Rahul Verma says: ==================== Align PTT and add various link modes. This series aligns the ptt propagation as local ptt or global ptt. Adds new transceiver modes, speed capabilities and board config, which is utilized to display the enhanced link modes, media types and speed. Enhances the link with detailed information. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16qed: Prevent link getting down in case of autoneg-off.Rahul Verma1-7/+33
Newly added link modes are required to be added during setting link modes. If the new link mode is not available during qed_set_link, it may cause link getting down due to empty supported capability, being passed to MFW, after setting autoneg off/on with current/supported speed. Signed-off-by: Rahul Verma <Rahul.Verma@cavium.com> Signed-off-by: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16qede: Check available link modes before link set from ethtool.Rahul Verma1-19/+45
Set link mode after checking available "supported" link caps of the port. Signed-off-by: Rahul Verma <Rahul.Verma@cavium.com> Signed-off-by: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16qed: Add supported link and advertise link to display in ethtool.Rahul Verma5-58/+426
Added transceiver type, speed capability and board types in HSI, are utilizing to display the accurate link information in ethtool. Signed-off-by: Rahul Verma <Rahul.Verma@cavium.com> Signed-off-by: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16qed: Added supported transceiver modes, speed capability and board config to ↵Rahul Verma1-1/+53
HSI. Added transceiver modes with different speed and media type, speed capability and supported board types in HSI, which will be utilizing to display correct specification of link modes and speed type. Signed-off-by: Rahul Verma <Rahul.Verma@cavium.com> Signed-off-by: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16qed: Align local and global PTT to propagate through the APIs.Rahul Verma5-23/+35
Align the use of local PTT to propagate through the qed_mcp* API's. Global ptt should not be used. Register access should be done through layers. Register address is mapped into a PTT, PF translation table. Several interface functions require a PTT to direct read/write into register. There is a pool of PTT maintained, and several PTT are used simultaneously to access device registers in different flows. Same PTT should not be used in flows that can run concurrently. To avoid running out of PTT resources, too many PTT should not be acquired without releasing them. Every PF has a global PTT, which is used throughout the life of PF, in most important flows for register access. Generic functions acquire the PTT locally and release after the use. This patch aligns the use of Global PTT and Local PTT accordingly. Signed-off-by: Rahul Verma <rahul.verma@cavium.com> Signed-off-by: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net: aquantia: make function aq_fw2x_update_stats staticYueHaibing1-1/+1
Fixes the following sparse warning: drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils_fw2x.c:282:5: warning: symbol 'aq_fw2x_update_stats' was not declared. Should it be static? Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16Merge branch 'net-Kernel-side-filtering-for-route-dumps'David S. Miller13-95/+387
David Ahern says: ==================== net: Kernel side filtering for route dumps Implement kernel side filtering of route dumps by protocol (e.g., which routing daemon installed the route), route type (e.g., unicast), table id and nexthop device. iproute2 has been doing this filtering in userspace for years; pushing the filters to the kernel side reduces the amount of data the kernel sends and reduces wasted cycles on both sides processing unwanted data. These initial options provide a huge improvement for efficiently examining routes on large scale systems. v2 - better handling of requests for a specific table. Rather than walking the hash of all tables, lookup the specific table and dump it - refactor mr_rtm_dumproute moving the loop over the table into a helper that can be invoked directly - add hook to return NLM_F_DUMP_FILTERED in DONE message to ensure it is returned even when the dump returns nothing ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net/ipv4: Bail early if user only wants prefix entriesDavid Ahern1-2/+6
Unlike IPv6, IPv4 does not have routes marked with RTF_PREFIX_RT. If the flag is set in the dump request, just return. In the process of this change, move the CLONE check to use the new filter flags. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net/ipv6: Bail early if user only wants cloned entriesDavid Ahern1-2/+5
Similar to IPv4, IPv6 fib no longer contains cloned routes. If a user requests a route dump for only cloned entries, no sense walking the FIB and returning everything. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net/mpls: Handle kernel side filtering of route dumpsDavid Ahern1-5/+28
Update the dump request parsing in MPLS for the non-INET case to enable kernel side filtering. If INET is disabled the only filters that make sense for MPLS are protocol and nexthop device. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net: Enable kernel side filtering of route dumpsDavid Ahern6-15/+53
Update parsing of route dump request to enable kernel side filtering. Allow filtering results by protocol (e.g., which routing daemon installed the route), route type (e.g., unicast), table id and nexthop device. These amount to the low hanging fruit, yet a huge improvement, for dumping routes. ip_valid_fib_dump_req is called with RTNL held, so __dev_get_by_index can be used to look up the device index without taking a reference. From there filter->dev is only used during dump loops with the lock still held. Set NLM_F_DUMP_FILTERED in the answer_flags so the user knows the results have been filtered should no entries be returned. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net: Plumb support for filtering ipv4 and ipv6 multicast route dumpsDavid Ahern4-12/+74
Implement kernel side filtering of routes by egress device index and table id. If the table id is given in the filter, lookup table and call mr_table_dump directly for it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16ipmr: Refactor mr_rtm_dumprouteDavid Ahern2-33/+61
Move per-table loops from mr_rtm_dumproute to mr_table_dump and export mr_table_dump for dumps by specific table id. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net/mpls: Plumb support for filtering route dumpsDavid Ahern1-1/+41
Implement kernel side filtering of routes by egress device index and protocol. MPLS uses only a single table and route type. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net/ipv6: Plumb support for filtering route dumpsDavid Ahern2-14/+54
Implement kernel side filtering of routes by table id, egress device index, protocol, and route type. If the table id is given in the filter, lookup the table and call fib6_dump_table directly for it. Move the existing route flags check for prefix only routes to the new filter. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net/ipv4: Plumb support for filtering route dumpsDavid Ahern3-13/+39
Implement kernel side filtering of routes by table id, egress device index, protocol and route type. If the table id is given in the filter, lookup the table and call fib_table_dump directly for it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net: Add struct for fib dump filterDavid Ahern7-11/+37
Add struct fib_dump_filter for options on limiting which routes are returned in a dump request. The current list is table id, protocol, route type, rtm_flags and nexthop device index. struct net is needed to lookup the net_device from the index. Declare the filter for each route dump handler and plumb the new arguments from dump handlers to ip_valid_fib_dump_req. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16netlink: Add answer_flags to netlink_callbackDavid Ahern2-1/+3
With dump filtering we need a way to ensure the NLM_F_DUMP_FILTERED flag is set on a message back to the user if the data returned is influenced by some input attributes. Normally this can be done as messages are added to the skb, but if the filter results in no data being returned, the user could be confused as to why. This patch adds answer_flags to the netlink_callback allowing dump handlers to set the NLM_F_DUMP_FILTERED at a minimum in the NLMSG_DONE message ensuring the flag gets back to the user. The netlink_callback space is initialized to 0 via a memset in __netlink_dump_start, so init of the new answer_flags is covered. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller54-3659/+4962
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-10-16 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Convert BPF sockmap and kTLS to both use a new sk_msg API and enable sk_msg BPF integration for the latter, from Daniel and John. 2) Enable BPF syscall side to indicate for maps that they do not support a map lookup operation as opposed to just missing key, from Prashant. 3) Add bpftool map create command which after map creation pins the map into bpf fs for further processing, from Jakub. 4) Add bpftool support for attaching programs to maps allowing sock_map and sock_hash to be used from bpftool, from John. 5) Improve syscall BPF map update/delete path for map-in-map types to wait a RCU grace period for pending references to complete, from Daniel. 6) Couple of follow-up fixes for the BPF socket lookup to get it enabled also when IPv6 is compiled as a module, from Joe. 7) Fix a generic-XDP bug to handle the case when the Ethernet header was mangled and thus update skb's protocol and data, from Jesper. 8) Add a missing BTF header length check between header copies from user space, from Wenwen. 9) Minor fixups in libbpf to use __u32 instead u32 types and include proper perf_event.h uapi header instead of perf internal one, from Yonghong. 10) Allow to pass user-defined flags through EXTRA_CFLAGS and EXTRA_LDFLAGS to bpftool's build, from Jiri. 11) BPF kselftest tweaks to add LWTUNNEL to config fragment and to install with_addr.sh script from flow dissector selftest, from Anders. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15net: phy: merge phy_start_aneg and phy_start_aneg_privHeiner Kallweit1-18/+3
After commit 9f2959b6b52d ("net: phy: improve handling delayed work") the sync parameter isn't needed any longer in phy_start_aneg_priv(). This allows to merge phy_start_aneg() and phy_start_aneg_priv(). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15hv_netvsc: fix vf serial matching with pci slot infoHaiyang Zhang1-4/+11
The VF device's serial number is saved as a string in PCI slot's kobj name, not the slot->number. This patch corrects the netvsc driver, so the VF device can be successfully paired with synthetic NIC. Fixes: 00d7ddba1143 ("hv_netvsc: pair VF based on serial number") Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15Merge branch 'tcp-second-round-for-EDT-conversion'David S. Miller10-58/+78
Eric Dumazet says: ==================== tcp: second round for EDT conversion First round of EDT patches left TCP stack in a non optimal state. - High speed flows suffered from loss of performance, addressed by the first patch of this series. - Second patch brings pacing to the current state of networking, since we now reach ~100 Gbit on a single TCP flow. - Third patch implements a mitigation for scheduling delays, like the one we did in sch_fq in the past. - Fourth patch removes one special case in sch_fq for ACK packets. - Fifth patch removes a serious perfomance cost for TCP internal pacing. We should setup the high resolution timer only if really needed. - Sixth patch fixes a typo in BBR. - Last patch is one minor change in cdg congestion control. Neal Cardwell also has a patch series fixing BBR after EDT adoption. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15tcp: cdg: use tcp high resolution clock cacheEric Dumazet1-1/+1
We store in tcp socket a cache of most recent high resolution clock, there is no need to call local_clock() again, since this cache is good enough. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15tcp_bbr: fix typo in bbr_pacing_margin_percentNeal Cardwell1-2/+2
There was a typo in this parameter name. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15tcp: optimize tcp internal pacingEric Dumazet1-15/+16
When TCP implements its own pacing (when no fq packet scheduler is used), it is arming high resolution timer after a packet is sent. But in many cases (like TCP_RR kind of workloads), this high resolution timer expires before the application attempts to write the following packet. This overhead also happens when the flow is ACK clocked and cwnd limited instead of being limited by the pacing rate. This leads to extra overhead (high number of IRQ) Now tcp_wstamp_ns is reserved for the pacing timer only (after commit "tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh"), we can setup the timer only when a packet is about to be sent, and if tcp_wstamp_ns is in the future. This leads to a ~10% performance increase in TCP_RR workloads. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15net_sched: sch_fq: no longer use skb_is_tcp_pure_ack()Eric Dumazet1-1/+1
With the new EDT model, sch_fq no longer has to special case TCP pure acks, since their skb->tstamp will allow them being sent without pacing delay. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15tcp: mitigate scheduling jitter in EDT pacing modelEric Dumazet1-6/+13
In commit fefa569a9d4b ("net_sched: sch_fq: account for schedule/timers drifts") we added a mitigation for scheduling jitter in fq packet scheduler. This patch does the same in TCP stack, now it is using EDT model. Note that this mitigation is valid for both external (fq packet scheduler) or internal TCP pacing. This uses the same strategy than the above commit, allowing a time credit of half the packet currently sent. Consider following case : An skb is sent, after an idle period of 300 usec. The air-time (skb->len/pacing_rate) is 500 usec Instead of setting the pacing timer to now+500 usec, it will use now+min(500/2, 300) -> now+250usec This is like having a token bucket with a depth of half an skb. Tested: tc qdisc replace dev eth0 root pfifo_fast Before netperf -P0 -H remote -- -q 1000000000 # 8000Mbit 540000 262144 262144 10.00 7710.43 After : netperf -P0 -H remote -- -q 1000000000 # 8000 Mbit 540000 262144 262144 10.00 7999.75 # Much closer to 8000Mbit target Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15net: extend sk_pacing_rate to unsigned longEric Dumazet7-32/+40
sk_pacing_rate has beed introduced as a u32 field in 2013, effectively limiting per flow pacing to 34Gbit. We believe it is time to allow TCP to pace high speed flows on 64bit hosts, as we now can reach 100Gbit on one TCP flow. This patch adds no cost for 32bit kernels. The tcpi_pacing_rate and tcpi_max_pacing_rate were already exported as 64bit, so iproute2/ss command require no changes. Unfortunately the SO_MAX_PACING_RATE socket option will stay 32bit and we will need to add a new option to let applications control high pacing rates. State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 1787144 10.246.9.76:49992 10.246.9.77:36741 timer:(on,003ms,0) ino:91863 sk:2 <-> skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0) ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448 rcvmss:536 advmss:1448 cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177 segs_in:3916318 data_segs_out:177279175 bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2) send 28045.5Mbps lastrcv:73333 pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps busy:73333ms unacked:135 retrans:0/157 rcv_space:14480 notsent:2085120 minrtt:0.013 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15tcp: do not change tcp_wstamp_ns in tcp_mstamp_refreshEric Dumazet3-4/+8
In EDT design, I made the mistake of using tcp_wstamp_ns to store the last tcp_clock_ns() sample and to store the pacing virtual timer. This causes major regressions at high speed flows. Introduce tcp_clock_cache to store last tcp_clock_ns(). This is needed because some arches have slow high-resolution kernel time service. tcp_wstamp_ns is only updated when a packet is sent. Note that we can remove tcp_mstamp in the future since tcp_mstamp is essentially tcp_clock_cache/1000, so the apparent socket size increase is temporary. Fixes: 9799ccb0e984 ("tcp: add tcp_wstamp_ns socket field") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15net: bridge: fix a possible memory leak in __vlan_addLi RongQing1-0/+4
After per-port vlan stats, vlan stats should be released when fail to add vlan Fixes: 9163a0fc1f0c0 ("net: bridge: add support for per-port vlan stats") CC: bridge@lists.linux-foundation.org cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> CC: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15rxrpc: Add /proc/net/rxrpc/peers to display peer listDavid Howells3-0/+130
Add /proc/net/rxrpc/peers to display the list of peers currently active. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>