linux - Linux Kernel (branches are rebased on master from time to time)

Age	Commit message (Collapse)	Author	Files	Lines
2020-01-07	net/mlx5: Use async EQ setup cleanup helpers for multiple EQs	Parav Pandit	1	-65/+49
	Use helper routines to setup and teardown multiple EQs and reuse the code in setup, cleanup and error unwinding flows. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-01-07	net/mlx5: Reduce No CQ found log level from warn to debug	Parav Pandit	1	-1/+2
	In below sequence, a EQE entry arrives for a CQ which is on the path of being destroyed. cpu-0 cpu-1 ------ ----- mlx5_core_destroy_cq() mlx5_eq_comp_int() mlx5_eq_del_cq() [..] radix_tree_delete() [..] [..] mlx5_eq_cq_get() /* Didn't find CQ is * a valid case. / / destroy CQ in hw */ mlx5_cmd_exec() This is still a valid scenario and correct delete CQ sequence, as mirror of the CQ create sequence. Hence, suppress the non harmful debug message from warn to debug level. Keep the debug log message rate limited because user application can trigger it repeatedly. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-01-07	net/mlx5: Increase the max number of channels to 128	Fan Li	3	-10/+12
	Currently the max number of channels is limited to 64, which is half of the indirection table size to allow some flexibility. But on servers with more than 64 cores, users may want to utilize more queues. This patch increases the advertised max number of channels to 128 by changing the ratio between channels and indirection table slots to 1:1. At the same time, the driver still enable no more than 64 channels at loading. Users can change it by ethtool afterwards. Signed-off-by: Fan Li <fanl@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-01-07	net/mlx5e: Support accept action on nic table	Tonghao Zhang	1	-0/+4
	In one case, we may forward packets from one vport to others, but only one packets flow will be accepted, which destination ip was assign to VF. +-----+ +-----+ +-----+ \| VFn \| \| VF1 \| \| VF0 \| accept +--+--+ +--+--+ hairpin +--^--+ \| \| <--------------- \| \| \| \| +--+-----------v-+ +--+-------------+ \| eswitch PF1 \| \| eswitch PF0 \| +----------------+ +----------------+ tc filter add dev $PF0 protocol all parent ffff: prio 1 handle 1 \ flower skip_sw action mirred egress redirect dev $VF0_REP tc filter add dev $VF0 protocol ip parent ffff: prio 1 handle 1 \ flower skip_sw dst_ip $VF0_IP action pass tc filter add dev $VF0 protocol all parent ffff: prio 2 handle 2 \ flower skip_sw action mirred egress redirect dev $VF1 Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-01-07	mlx5: work around high stack usage with gcc	Arnd Bergmann	1	-0/+3
	In some configurations, gcc tries too hard to optimize this code: drivers/net/ethernet/mellanox/mlx5/core/en_stats.c: In function 'mlx5e_grp_sw_update_stats': drivers/net/ethernet/mellanox/mlx5/core/en_stats.c:302:1: error: the frame size of 1336 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] As was stated in the bug report, the reason is that gcc runs into a corner case in the register allocator that is rather hard to fix in a good way. As there is an easy way to work around it, just add a comment and the barrier that stops gcc from trying to overoptimize the function. Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92657 Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-01-07	net/mlx5: limit the function in local scope	Zhu Yanjun	2	-4/+2
	The function mlx5_buf_alloc_node is only used by the function in the local scope. So it is appropriate to limit this function in the local scope. Signed-off-by: Zhu Yanjun <zyjzyj2000@gmail.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-01-06	Merge branch 'Unique-mv88e6xxx-IRQ-names'	David S. Miller	5	-6/+31
	Andrew Lunn says: ==================== Unique mv88e6xxx IRQ names There are a few boards which have multiple mv88e6xxx switches. With such boards, it can be hard to determine which interrupts belong to which switches. Make the interrupt names unique by including the device name in the interrupt name. For the SERDES interrupt, also include the port number. As a result of these patches ZII devel C looks like: 50: 0 gpio-vf610 27 Level mv88e6xxx-0.1:00 54: 0 mv88e6xxx-g1 3 Edge mv88e6xxx-0.1:00-g1-atu-prob 56: 0 mv88e6xxx-g1 5 Edge mv88e6xxx-0.1:00-g1-vtu-prob 58: 0 mv88e6xxx-g1 7 Edge mv88e6xxx-0.1:00-g2 61: 0 mv88e6xxx-g2 1 Edge !mdio-mux!mdio@1!switch@0!mdio:01 62: 0 mv88e6xxx-g2 2 Edge !mdio-mux!mdio@1!switch@0!mdio:02 63: 0 mv88e6xxx-g2 3 Edge !mdio-mux!mdio@1!switch@0!mdio:03 64: 0 mv88e6xxx-g2 4 Edge !mdio-mux!mdio@1!switch@0!mdio:04 70: 0 mv88e6xxx-g2 10 Edge mv88e6xxx-0.1:00-serdes-10 75: 0 mv88e6xxx-g2 15 Edge mv88e6xxx-0.1:00-watchdog 76: 5 gpio-vf610 26 Level mv88e6xxx-0.2:00 80: 0 mv88e6xxx-g1 3 Edge mv88e6xxx-0.2:00-g1-atu-prob 82: 0 mv88e6xxx-g1 5 Edge mv88e6xxx-0.2:00-g1-vtu-prob 84: 4 mv88e6xxx-g1 7 Edge mv88e6xxx-0.2:00-g2 87: 2 mv88e6xxx-g2 1 Edge !mdio-mux!mdio@2!switch@0!mdio:01 88: 0 mv88e6xxx-g2 2 Edge !mdio-mux!mdio@2!switch@0!mdio:02 89: 0 mv88e6xxx-g2 3 Edge !mdio-mux!mdio@2!switch@0!mdio:03 90: 0 mv88e6xxx-g2 4 Edge !mdio-mux!mdio@2!switch@0!mdio:04 95: 3 mv88e6xxx-g2 9 Edge mv88e6xxx-0.2:00-serdes-9 96: 0 mv88e6xxx-g2 10 Edge mv88e6xxx-0.2:00-serdes-10 101: 0 mv88e6xxx-g2 15 Edge mv88e6xxx-0.2:00-watchdog Interrupt names like !mdio-mux!mdio@2!switch@0!mdio:01 are created by phylib for the integrated PHYs. The mv88e6xxx driver does not determine these names. ==================== Tested-by: Chris Healy <cphealy@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	net: dsa: mv88e6xxx: Unique ATU and VTU IRQ names	Andrew Lunn	3	-2/+10
	Dynamically generate a unique interrupt name for the VTU and ATU, based on the device name. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	net: dsa: mv88e6xxx: Unique g2 IRQ name	Andrew Lunn	2	-1/+5
	Dynamically generate a unique g2 interrupt name, based on the device name. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	net: dsa: mv88e6xxx: Unique watchdog IRQ name	Andrew Lunn	2	-1/+5
	Dynamically generate a unique watchdog interrupt name, based on the device name. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	net: dsa: mv88e6xxx: Unique SERDES interrupt names	Andrew Lunn	2	-1/+6
	Dynamically generate a unique SERDES interrupt name, based on the device name and the port the SERDES is for. For example: 95: 3 mv88e6xxx-g2 9 Edge mv88e6xxx-0.2:00-serdes-9 96: 0 mv88e6xxx-g2 10 Edge mv88e6xxx-0.2:00-serdes-10 The 0.2:00 indicates the switch and -9 indicates port 9. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	net: dsa: mv88e6xxx: Unique IRQ name	Andrew Lunn	2	-1/+5
	Dynamically generate a unique switch interrupt name, based on the device name. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	Merge branch 'ethtool-allow-nesting-of-begin-and-complete-callbacks'	David S. Miller	4	-33/+32
	Michal Kubecek says: ==================== ethtool: allow nesting of begin() and complete() callbacks The ethtool ioctl interface used to guarantee that ethtool_ops callbacks were always called in a block between calls to ->begin() and ->complete() (if these are defined) and that this whole block was executed with RTNL lock held: rtnl_lock(); ops->begin(); /* other ethtool_ops calls */ ops->complete(); rtnl_unlock(); This prevented any nesting or crossing of the begin-complete blocks. However, this is no longer guaranteed even for ioctl interface as at least ethtool_phys_id() releases RTNL lock while waiting for a timer. With the introduction of netlink ethtool interface, the begin-complete pairs are naturally nested e.g. when a request triggers a netlink notification. Fortunately, only minority of networking drivers implements begin() and complete() callbacks and most of those that do, fall into three groups: - wrappers for pm_runtime_get_sync() and pm_runtime_put() - wrappers for clk_prepare_enable() and clk_disable_unprepare() - begin() checks netif_running() (fails if false), no complete() First two have their own refcounting, third is safe w.r.t. nesting of the blocks. Only three in-tree networking drivers need an update to deal with nesting of begin() and complete() calls: via-velocity and epic100 perform resume and suspend on their own and wil6210 completely serializes the calls using its own mutex (which would lead to a deadlock if a request request triggered a netlink notification). The series addresses these problems. changes between v1 and v2: - fix inverted condition in epic100 ethtool_begin() (thanks to Andrew Lunn) ==================== Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	epic100: allow nesting of ethtool_ops begin() and complete()	Michal Kubecek	1	-2/+5
	Unlike most networking drivers using begin() and complete() ethtool_ops callbacks to resume a device which is down and suspend it again when done, epic100 does not use standard refcounted infrastructure but sets device sleep state directly. With the introduction of netlink ethtool interface, we may have nested begin-complete blocks so that inner complete() would put the device back to sleep for the rest of the outer block. To avoid rewriting an old and not very actively developed driver, just add a nesting counter and only perform resume and suspend on the outermost level. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	via-velocity: allow nesting of ethtool_ops begin() and complete()	Michal Kubecek	2	-4/+11
	Unlike most networking drivers using begin() and complete() ethtool_ops callbacks to resume a device which is down and suspend it again when done, via-velocity does not use standard refcounted infrastructure but sets device sleep state directly. With the introduction of netlink ethtool interface, we may have nested begin-complete blocks so that inner complete() would put the device back to sleep for the rest of the outer block. To avoid rewriting an old and not very actively developed driver, just add a nesting counter and only perform resume and suspend on the outermost level. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	wil6210: get rid of begin() and complete() ethtool_ops	Michal Kubecek	1	-27/+16
	The wil6210 driver locks a mutex in begin() ethtool_ops callback and unlocks it in complete() so that all ethtool requests are serialized. This is not going to work correctly with netlink interface; e.g. when ioctl triggers a netlink notification, netlink code would call begin() again while the mutex taken by ioctl code is still held by the same task. Let's get rid of the begin() and complete() callbacks and move the mutex locking into the remaining ethtool_ops handlers except get_drvinfo which only copies strings that are not changing so that there is no need for serialization. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	fcnal-test: Fix vrf argument in local tcp tests	David Ahern	1	-4/+4
	The recent MD5 tests added duplicate configuration in the default VRF. This change exposed a bug in existing tests designed to verify no connection when client and server are not in the same domain. The server should be running bound to the vrf device with the client run in the default VRF (the -2 option is meant for validating connection data). Fix the option for both tests. While technically this is a bug in previous releases, the tests are properly failing since the default VRF does not have any routing configuration so there really is no need to backport to prior releases. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	gtp: simplify error handling code in 'gtp_encap_enable()'	Christophe JAILLET	1	-6/+3
	'gtp_encap_disable_sock(sk)' handles the case where sk is NULL, so there is no need to test it before calling the function. This saves a few line of code. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	Merge branch 'mlxsw-Disable-checks-in-hardware-pipeline'	David S. Miller	3	-1/+200
	Ido Schimmel says: ==================== mlxsw: Disable checks in hardware pipeline Amit says: The hardware pipeline contains some checks that, by default, are configured to drop packets. Since the software data path does not drop packets due to these reasons and since we are interested in offloading the software data path to hardware, then these checks should be disabled in the hardware pipeline as well. This patch set changes mlxsw to disable four of these checks and adds corresponding selftests. The tests pass both when the software data path is exercised (using veth pair) and when the hardware data path is exercised (using mlxsw ports in loopback). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	selftests: forwarding: router: Add test case for destination IP link-local	Amit Cohen	1	-0/+25
	Add test case to check that packets are not dropped when they need to be routed and their destination is link-local, i.e., 169.254.0.0/16. Signed-off-by: Amit Cohen <amitc@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	mlxsw: spectrum: Disable DIP_LINK_LOCAL check in hardware pipeline	Amit Cohen	2	-0/+3
	The check drops packets if they need to be routed and their destination IP is link-local, i.e., belongs to 169.254.0.0/16 address range. Disable the check since the kernel forwards such packets and does not drop them. Signed-off-by: Amit Cohen <amitc@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	selftests: forwarding: router: Add test case for source IP equals destination IP	Amit Cohen	1	-0/+44
	Add test case to check that packets are not dropped when they need to be routed and their source IP equals to their destination IP. Signed-off-by: Amit Cohen <amitc@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	mlxsw: spectrum: Disable SIP_DIP check in hardware pipeline	Amit Cohen	2	-0/+3
	The check drops packets if they need to be routed and their source IP equals to their destination IP. Disable the check since the kernel forwards such packets and does not drop them. Signed-off-by: Amit Cohen <amitc@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	selftests: forwarding: router: Add test case for multicast destination MAC ↵	Amit Cohen	1	-0/+82
	mismatch Add test case to check that packets are not dropped when they need to be routed and their multicast MAC mismatched to their multicast destination IP. i.e., destination IP is multicast and * for IPV4: DMAC != {01-00-5E-0 (25 bits), DIP[22:0]} * for IPV6: DMAC != {33-33-0 (16 bits), DIP[31:0]} Signed-off-by: Amit Cohen <amitc@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	mlxsw: spectrum: Disable MC_DMAC check in hardware pipeline	Amit Cohen	2	-0/+3
	The check drops packets if they need to be routed and their multicast MAC mismatched to their multicast destination IP. For IPV4: DMAC is mismatched if it is different from {01-00-5E-0 (25 bits), DIP[22:0]} For IPV6: DMAC is mismatched if it is different from {33-33-0 (16 bits), DIP[31:0]} Disable the check since the kernel forwards such packets and does not drop them. Signed-off-by: Amit Cohen <amitc@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	selftests: forwarding: router: Add test case for source IP in class E	Amit Cohen	1	-1/+37
	Add test case to check that packets are not dropped when they need to be routed and their source IP in class E, (i.e., 240.0.0.0 – 255.255.255.254). Signed-off-by: Amit Cohen <amitc@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	mlxsw: spectrum: Disable SIP_CLASS_E check in hardware pipeline	Amit Cohen	2	-0/+3
	The check drops packets if they need to be routed and their source IP is from class E, i.e., belongs to 240.0.0.0/4 address range, but different from 255.255.255.255. Disable the check since the kernel forwards such packets and does not drop them. Signed-off-by: Amit Cohen <amitc@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	Merge branch 'hns3-next'	David S. Miller	11	-67/+259
	Huazhong Tan says: ==================== net: hns3: misc updates for -net-next This series includes some misc updates for the HNS3 ethernet driver. [patch 1] adds trace events support. [patch 2] re-organizes TQP's vector handling. [patch 3] renames the name of TQP vector. [patch 4] rewrites a log in the hclge_map_ring_to_vector(). [patch 5] modifies the name of misc IRQ vector. [patch 6] handles the unexpected speed 0 return from HW. [patch 7] replaces an unsuitable variable type. [patch 8] modifies an unsuitable reset level for HW error. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	net: hns3: modify an unsuitable reset level for hardware error	Huazhong Tan	1	-2/+2
	According to hardware user manual, when hardware reports error 'roc_pkt_without_key_port', the driver should assert function reset to do the recovery. So this patch uses HNAE3_FUNC_RESET to replace HNAE3_GLOBAL_RESET. Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	net: hns3: replace an unsuitable variable type in ↵	Huazhong Tan	2	-1/+4
	hclge_inform_reset_assert_to_vf() In hclge_inform_reset_assert_to_vf(), variable reset_type(enum type) will be copied into msg_data whose size is 2 bytes. Currently, hip08 is a little-endian machine, so the lower two bytes of reset_type will be copied to msg_data. But when running on a big-endian machine, msg_data will have a wrong value(the higher two bytes of reset_type). So this patch modifies the type of reset_type to u16, and adds a build check in case enum hnae3_reset_type has value larger than U16_MAX. Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	net: hns3: add protection when get SFP speed as 0	Guojia Liao	1	-0/+6
	In some case, the MAC speed get from hardware maybe 0, it should not be set to mac->speed. Signed-off-by: Guojia Liao <liaoguojia@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	net: hns3: modify the IRQ name of misc vectors	Yonglong Liu	5	-4/+10
	The misc IRQ of all the devices have the same name, so it's hard to find the right misc IRQ of the device. This patch modifies the misc IRQ names as "hclge/hclgevf"-misc- "pci name". And now the IRQ name is not related to net device name anymore, so change the HNAE3_INT_NAME_LEN to 32 bytes, and that is enough. Signed-off-by: Yonglong Liu <liuyonglong@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	net: hns3: modify an unsuitable log in hclge_map_ring_to_vector()	Yonglong Liu	1	-1/+1
	When the returned vector_id less than 0, the message should print out the vector who is getting vector index fail. So this patch replaces vector_id with vector, and re-format the message. Signed-off-by: Yonglong Liu <liuyonglong@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	net: hns3: modify the IRQ name of TQP vector	Yonglong Liu	1	-9/+12
	When rename the net devices, the IRQ number can not be fetched by the net device name, because the driver request the IRQ resources only when the vector resource changed, and the rename operation did not change the vector resources, so the IRQ name keeps the previous net device name. So this patch modifies the name of the TQP IRQ as "pci driver name"-"pci name"-"TxRx"-"index". Signed-off-by: Yonglong Liu <liuyonglong@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	net: hns3: re-organize vector handle	Yonglong Liu	1	-48/+52
	To prevent loss user's IRQ affinity configuration when DOWN, this patch moves out release/request operation of the vector handle from net DOWN/UP, just do it when vector resource changes. Signed-off-by: Yonglong Liu <liuyonglong@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06	net: hns3: add trace event support for HNS3 driver	Yunsheng Lin	4	-2/+172
	This adds trace support for HNS3 driver. It also declares some events which could be used to trace the events when a TX/RX BD is processed, and other events which are related to the processing of sk_buff, such as TSO, GRO. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05	Merge branch 'Convert-Felix-DSA-switch-to-PHYLINK'	David S. Miller	27	-130/+1036
	Vladimir Oltean says: ==================== Convert Felix DSA switch to PHYLINK Unlike most other conversions, this one is not by far a trivial one, and should be seen as "Layerscape PCS meets PHYLINK". Actually, the PCS doesn't need a lot of hand-holding and most of our other devices 'just work' (this one included) without any sort of operating system awareness, just an initialization procedure done typically in the bootloader. Our issues start when the PCS stops from "just working", and that is where PHYLINK comes in handy. The PCS is not specific to the Vitesse / Microsemi / Microchip switching core at all. Variations of this SerDes/PCS design can also be found on DPAA1 and DPAA2 hardware. The main idea of the abstraction provided is that the PCS looks so much like a PHY device, that we model it as an actual PHY device and run the generic PHY functions on it, where appropriate. The 4xSGMII, QSGMII and QSXGMII modes are fairly straightforward. The SerDes protocol which the driver calls 2500Base-X mode (a misnomer) is more interesting. There is a description of how it works and what can be done with it in patch 9/9 (in a comment above vsc9959_pcs_init_2500basex). In short, it is a fixed speed protocol with no auto-negotiation whatsoever. From my research of the SGMII-2500 patent [1], it has nothing to do with SGMII-2500. That one: * does not define any change to the AN base page compared to plain 10/100/1000 SGMII. This implies that the 2500 speed is not negotiable, but the other speeds are. In our case, when the SerDes is configured for this protocol it's configured for good, there's no going back to SGMII. * runs at a higher base frequency than regular SGMII. So SGMII-2500 operating at 1000 Mbps wouldn't interoperate with plain SGMII at 1000 Mbps. Strange, but ok.. * Emulates lower link speeds than 2500 by duplicating the codewords twice, then thrice, then twice again etc (2.5/25/250 times on average). The Layerscape PCS doesn't do that (it is fixed at 2500 Mbaud). But on the other hand it isn't completely compatible with Base-X either, since it doesn't do 802.3z / clause 37 auto negotiation (flow control, local/remote fault etc). It is compatible with 2500Base-X without in-band AN, and that is exactly how we decided to expose it (this is actually similar to what others do). For SGMII and USXGMII, the driver is using the PHYLINK 'managed = "in-band-status"' DTS binding to figure out whether in-band AN is expected to be enabled in the PCS or not. It is expected that the attached PHY follows suite, but there is a gap here: the PHY driver does not react to this setting, so only one of "AN on" and "AN off" works on any particular PHY, even though that PHY might support bypassing the SGMII AN process, as is the case on the VSC8514 PHY present on the LS1028A-RDB board. A separate series will be sent to propose a way to deal with that. I dropped the Ocelot PHYLINK conversion because: * I don't have VSC7514 hardware anyway * The hardware is so different in this regard that there's almost nothing to share anyway. Changes in v5: - Added the register write to DEV_CLOCK_CFG back in felix_phylink_mac_config in patch 9/9. Changes in v4: - This is mostly a resend of v3, with the only notable change that I've dropped the PHY core patches for in_band_autoneg and I'll propose them independently. v1 series: https://www.spinics.net/lists/netdev/msg613869.html RFC v2 series: https://www.spinics.net/lists/netdev/msg620128.html v3 series: https://www.spinics.net/lists/netdev/msg622060.html v4 series: https://www.spinics.net/lists/netdev/msg622606.html [0]: https://www.spinics.net/lists/netdev/msg613869.html [1]: https://patents.google.com/patent/US7356047B1/en ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05	net: dsa: felix: Add PCS operations for PHYLINK	Vladimir Oltean	4	-17/+767
	Layerscape SoCs traditionally expose the SerDes configuration/status for Ethernet protocols (PCS for SGMII/USXGMII/10GBase-R etc etc) in a register format that is compatible with clause 22 or clause 45 (depending on SerDes protocol). Each MAC has its own internal MDIO bus on which there is one or more of these PCS's, responding to commands at a configurable PHY address. The per-port internal MDIO bus (which is just for PCSs) is totally separate and has nothing to do with the dedicated external MDIO controller (which is just for PHYs), but the register map for the MDIO controller is the same. The VSC9959 (Felix) switch instantiated in the LS1028A is integrated in hardware with the ENETC PCS of its DSA master, and reuses its MDIO controller driver, so Felix has been made to depend on it in Kconfig. +------------------------------------------------------------------------+ \| +--------+ GMII (typically disabled via RCW) \| \| ENETC PCI \| ENETC \|--------------------------+ \| \| Root Complex \| port 3 \|-----------------------+ \| \| \| Integrated +--------+ \| \| \| \| Endpoint \| \| \| \| +--------+ 2.5G GMII \| \| \| \| \| ENETC \|--------------+ \| \| \| \| \| port 2 \|-----------+ \| \| \| \| \| +--------+ \| \| \| \| \| \| +--------+ +--------+ \| \| \| Felix \| \| Felix \| \| \| \| port 4 \| \| port 5 \| \| \| +--------+ +--------+ \| \| \| \| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ \| \| \| ENETC \| \| ENETC \| \| Felix \| \| Felix \| \| Felix \| \| Felix \| \| \| \| port 0 \| \| port 1 \| \| port 0 \| \| port 1 \| \| port 2 \| \| port 3 \| \| +------------------------------------------------------------------------+ \| \|\|\|\| SerDes \| \|\|\|\| \|\|\|\| \|\|\|\| \|\|\|\| \| \| +--------+block \| +--------------------------------------------+ \| \| \| ENETC \| \| \| ENETC port 2 internal MDIO bus \| \| \| \| port 0 \| \| \| PCS PCS PCS PCS \| \| \| \| PCS \| \| \| 0 1 2 3 \| \| +-----------------\|------------------------------------------------------+ v v v v v v SGMII/ RGMII QSGMII/QSXGMII/4xSGMII/4x1000Base-X/4x2500Base-X USXGMII/ (bypasses 1000Base-X/ SerDes) 2500Base-X In the LS1028A SoC described above, the VSC9959 Felix switch is PF5 of the ENETC root complex, and has 2 BARs: - BAR 4: the switch's effective registers - BAR 0: the MDIO controller register map lended from ENETC port 2 (PF2), for accessing its associated PCS's. This explanation is necessary because the patch does some renaming "pci_bar" -> "switch_pci_bar" for clarity, which would otherwise appear a bit obtuse. The fact that the internal MDIO bus is "borrowed" is relevant because the register map is found in PF5 (the switch) but it triggers an access fault if PF2 (the ENETC DSA master) is not enabled. This is not treated in any way (and I don't think it can be treated). All of this is so SoC-specific, that it was contained as much as possible in the platform-integration file felix_vsc9959.c. We need to parse and pre-validate the device tree because of 2 reasons: - The PHY mode (SerDes protocol) cannot change at runtime due to SoC design. - There is a circular dependency in that we need to know what clause the PCS speaks in order to find it on the internal MDIO bus. But the clause of the PCS depends on what phy-mode it is configured for. The goal of this patch is to make steps towards removing the bootloader dependency for SGMII PCS pre-configuration, as well as to add support for monitoring the in-band SGMII AN between the PCS and the system-side link partner (PHY or other MAC). In practice the bootloader dependency is not completely removed. U-Boot pre-programs the PHY address at which each PCS can be found on the internal MDIO bus (MDEV_PORT). This is needed because the PCS of each port has the same out-of-reset PHY address of zero. The SerDes register for changing MDEV_PORT is pretty deep in the SoC (outside the addresses of the ENETC PCI BARs) and therefore inaccessible to us from here. Felix VSC9959 and Ocelot VSC7514 are integrated very differently in their respective SoCs, and for that reason Felix does not use the Ocelot core library for PHYLINK. On one hand we don't want to impose the fixed phy-mode limitation to Ocelot, and on the other hand Felix doesn't need to force the MAC link speed the way Ocelot does, since the MAC is connected to the PCS through a fixed GMII, and the PCS is the one who does the rate adaptation at lower link speeds, which the MAC does not even need to know about. In fact changing the GMII speed for Felix irrecoverably breaks transmission through that port until a reset. The pair with ENETC port 3 and Felix port 5 is optional and doesn't support tagging. When we enable it, swp5 is a regular slave port, albeit an internal one. The trouble is that it doesn't work, and that is because the DSA PHYLIB adaptation layer doesn't treat fixed-link slave ports. So that is yet another reason for wanting to convert Felix to the native PHYLINK API. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05	net: mscc: ocelot: export ANA, DEV and QSYS registers to include/soc/mscc	Vladimir Oltean	4	-3/+3
	Since the Felix DSA driver is implementing its own PHYLINK instance due to SoC differences, it needs access to the few registers that are common, mainly for flow control. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05	net: mscc: ocelot: make phy_mode a member of the common struct ocelot_port	Vladimir Oltean	4	-6/+8
	The Ocelot switchdev driver and the Felix DSA one need it for different reasons. Felix (or at least the VSC9959 instantiation in NXP LS1028A) is integrated with the traditional NXP Layerscape PCS design which does not support runtime configuration of SerDes protocol. So it needs to pre-validate the phy-mode from the device tree and prevent PHYLINK from attempting to change it. For this, it needs to cache it in a private variable. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05	enetc: Set MDIO_CFG_HOLD to the recommended value of 2	Vladimir Oltean	1	-4/+8
	This increases the MDIO hold time to 5 enet_clk cycles from the previous value of 0. This is actually the out-of-reset value, that the driver was previously overwriting with 0. Zero worked for the external MDIO, but breaks communication with the internal MDIO buses on which the PCS of ENETC SI's and Felix switch are found. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05	enetc: Make MDIO accessors more generic and export to include/linux/fsl	Claudiu Manoil	9	-99/+176
	Within the LS1028A SoC, the register map for the ENETC MDIO controller is instantiated a few times: for the central (external) MDIO controller, for the internal bus of each standalone ENETC port, and for the internal bus of the Felix switch. Refactoring is needed to support multiple MDIO buses from multiple drivers. The enetc_hw structure is made an opaque type and a smaller enetc_mdio_priv is created. 'mdio_base' - MDIO registers base address - is being parameterized, to be able to work with different MDIO register bases. The ENETC MDIO bus operations are exported from the fsl-enetc-mdio kernel object, the same that registers the central MDIO controller (the dedicated PF). The ENETC main driver has been changed to select it, and use its exported helpers to further register its private MDIO bus. The DSA Felix driver will do the same. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05	net: dsa: Pass pcs_poll flag from driver to PHYLINK	Vladimir Oltean	2	-0/+6
	The DSA drivers that implement .phylink_mac_link_state should normally register an interrupt for the PCS, from which they should call phylink_mac_change(). However not all switches implement this, and those who don't should set this flag in dsa_switch in the .setup callback, so that PHYLINK will poll for a few ms until the in-band AN link timer expires and the PCS state settles. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05	net: phylink: add support for polling MAC PCS	Vladimir Oltean	3	-2/+6
	Some MAC PCS blocks are unable to provide interrupts when their status changes. As we already have support in phylink for polling status, use this to provide a hook for MACs to enable polling mode. The patch idea was picked up from Russell King's suggestion on the macb phylink patch thread here [0] but the implementation was changed. Instead of introducing a new phylink_start_poll() function, which would make the implementation cumbersome for common PHYLINK implementations for multiple types of devices, like DSA, just add a boolean property to the phylink_config structure, which is just as backwards-compatible. https://lkml.org/lkml/2019/12/16/603 Suggested-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05	net: phylink: make QSGMII a valid PHY mode for in-band AN	Vladimir Oltean	1	-0/+1
	QSGMII is a SerDes protocol clocked at 5 Gbaud (4 times higher than SGMII which is clocked at 1.25 Gbaud), with the same 8b/10b encoding and some extra symbols for synchronization. Logically it offers 4 SGMII interfaces multiplexed onto the same physical lanes. Each MAC PCS has its own in-band AN process with the system side of the QSGMII PHY, which is identical to the regular SGMII AN process. So allow QSGMII as a valid in-band AN mode, since it is no different from software perspective from regular SGMII. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05	mii: Add helpers for parsing SGMII auto-negotiation	Vladimir Oltean	2	-0/+62
	Typically a MAC PCS auto-configures itself after it receives the negotiated copper-side link settings from the PHY, but some MAC devices are more special and need manual interpretation of the SGMII AN result. In other cases, the PCS exposes the entire tx_config_reg base page as it is transmitted on the wire during auto-negotiation, so it makes sense to be able to decode the equivalent lp_advertised bit mask from the raw u16 (of course, "lp" considering the PCS to be the local PHY). Therefore, add the bit definitions for the SGMII registers 4 and 5 (local device ability, link partner ability), as well as a link_mode conversion helper that can be used to feed the AN results into phy_resolve_aneg_linkmode. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05	Merge branch 'dsa-deferred-xmit'	David S. Miller	6	-96/+94
	Vladimir Oltean says: ==================== Improvements to the DSA deferred xmit After the feedback received on v1: https://www.spinics.net/lists/netdev/msg622617.html I've decided to move the deferred xmit implementation completely within the sja1105 driver. The executive summary for this series is the same as it was for v1 (better for everybody): - For those who don't use it, thanks to one less assignment in the hotpath (and now also thanks to less code in the DSA core) - For those who do, by making its scheduling more amenable and moving it outside the generic workqueue (since it still deals with packet hotpath, after all) There are some simplification (1/3) and cosmetic (3/3) patches in the areas next to the code touched by the main patch (2/3). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05	net: dsa: tag_sja1105: Slightly improve the Xmas tree in sja1105_xmit	Vladimir Oltean	1	-2/+1
	This is a cosmetic patch that makes the dp, tx_vid, queue_mapping and pcp local variable definitions a bit closer in length, so they don't look like an eyesore as much. The 'ds' variable is not used otherwise, except for ds->dp. Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05	net: dsa: Make deferred_xmit private to sja1105	Vladimir Oltean	6	-69/+93
	There are 3 things that are wrong with the DSA deferred xmit mechanism: 1. Its introduction has made the DSA hotpath ever so slightly more inefficient for everybody, since DSA_SKB_CB(skb)->deferred_xmit needs to be initialized to false for every transmitted frame, in order to figure out whether the driver requested deferral or not (a very rare occasion, rare even for the only driver that does use this mechanism: sja1105). That was necessary to avoid kfree_skb from freeing the skb. 2. Because L2 PTP is a link-local protocol like STP, it requires management routes and deferred xmit with this switch. But as opposed to STP, the deferred work mechanism needs to schedule the packet rather quickly for the TX timstamp to be collected in time and sent to user space. But there is no provision for controlling the scheduling priority of this deferred xmit workqueue. Too bad this is a rather specific requirement for a feature that nobody else uses (more below). 3. Perhaps most importantly, it makes the DSA core adhere a bit too much to the NXP company-wide policy "Innovate Where It Doesn't Matter". The sja1105 is probably the only DSA switch that requires some frames sent from the CPU to be routed to the slave port via an out-of-band configuration (register write) rather than in-band (DSA tag). And there are indeed very good reasons to not want to do that: if that out-of-band register is at the other end of a slow bus such as SPI, then you limit that Ethernet flow's throughput to effectively the throughput of the SPI bus. So hardware vendors should definitely not be encouraged to design this way. We do _not_ want more widespread use of this mechanism. Luckily we have a solution for each of the 3 issues: For 1, we can just remove that variable in the skb->cb and counteract the effect of kfree_skb with skb_get, much to the same effect. The advantage, of course, being that anybody who doesn't use deferred xmit doesn't need to do any extra operation in the hotpath. For 2, we can create a kernel thread for each port's deferred xmit work. If the user switch ports are named swp0, swp1, swp2, the kernel threads will be named swp0_xmit, swp1_xmit, swp2_xmit (there appears to be a 15 character length limit on kernel thread names). With this, the user can change the scheduling priority with chrt $(pidof swp2_xmit). For 3, we can actually move the entire implementation to the sja1105 driver. So this patch deletes the generic implementation from the DSA core and adds a new one, more adequate to the requirements of PTP TX timestamping, in sja1105_main.c. Suggested-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05	net: dsa: sja1105: Always send through management routes in slot 0	Vladimir Oltean	2	-26/+1
	I finally found out how the 4 management route slots are supposed to be used, but.. it's not worth it. The description from the comment I've just deleted in this commit is still true: when more than 1 management slot is active at the same time, the switch will match frames incoming [from the CPU port] on the lowest numbered management slot that matches the frame's DMAC. My issue was that one was not supposed to statically assign each port a slot. Yes, there are 4 slots and also 4 non-CPU ports, but that is a mere coincidence. Instead, the switch can be used like this: every management frame gets a slot at the right of the most recently assigned slot: Send mgmt frame 1 through S0: S0 x x x Send mgmt frame 2 through S1: S0 S1 x x Send mgmt frame 3 through S2: S0 S1 S2 x Send mgmt frame 4 through S3: S0 S1 S2 S3 The difference compared to the old usage is that the transmission of frames 1-4 doesn't need to wait until the completion of the management route. It is safe to use a slot to the right of the most recently used one, because by protocol nobody will program a slot to your left and "steal" your route towards the correct egress port. So there is a potential throughput benefit here. But mgmt frame 5 has no more free slot to use, so it has to wait until _all_ of S0, S1, S2, S3 are full, in order to use S0 again. And that's actually exactly the problem: I was looking for something that would bring more predictable transmission latency, but this is exactly the opposite: 3 out of 4 frames would be transmitted quicker, but the 4th would draw the short straw and have a worse worst-case latency than before. Useless. Things are made even worse by PTP TX timestamping, which is something I won't go deeply into here. Suffice to say that the fact there is a driver-level lock on the SPI bus offsets any potential throughput gains that parallelism might bring. So there's no going back to the multi-slot scheme, remove the "mgmt_slot" variable from sja1105_port and the dummy static assignment made at probe time. While passing by, also remove the assignment to casc_port altogether. Don't pretend that we support cascaded setups. Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>