summaryrefslogtreecommitdiffstats
path: root/drivers/infiniband/hw
AgeCommit message (Collapse)AuthorFilesLines
2019-11-06RDMA/hns: Correct the value of srq_desc_sizeWenpeng Liang1-1/+1
srq_desc_size should be rounded up to pow of two before used, or related calculation may cause allocating wrong size of memory for srq buffer. Fixes: c7bcb13442e1 ("RDMA/hns: Add SRQ support for hip08 kernel mode") Link: https://lore.kernel.org/r/1572575610-52530-3-git-send-email-liweihang@hisilicon.com Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com> Signed-off-by: Weihang Li <liweihang@hisilicon.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-11-06RDMA/hns: Correct the value of HNS_ROCE_HEM_CHUNK_LENSirong Wang1-1/+1
Size of pointer to buf field of struct hns_roce_hem_chunk should be considered when calculating HNS_ROCE_HEM_CHUNK_LEN, or sg table size will be larger than expected when allocating hem. Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver") Link: https://lore.kernel.org/r/1572575610-52530-2-git-send-email-liweihang@hisilicon.com Signed-off-by: Sirong Wang <wangsirong@huawei.com> Signed-off-by: Weihang Li <liweihang@hisilicon.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-11-06RDMA/qedr: Remove unsupported modify_port callbackKamal Heib3-9/+0
There is no need to return always zero for function which is not supported. Fixes: ac1b36e55a51 ("qedr: Add support for user context verbs") Link: https://lore.kernel.org/r/20191028155931.1114-5-kamalheib1@gmail.com Signed-off-by: Kamal Heib <kamalheib1@gmail.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-11-06RDMA/ocrdma: Remove unsupported modify_port callbackKamal Heib3-9/+0
There is no need to return always zero for function which is not supported. Fixes: fe2caefcdf58 ("RDMA/ocrdma: Add driver for Emulex OneConnect IBoE RDMA adapter") Link: https://lore.kernel.org/r/20191028155931.1114-4-kamalheib1@gmail.com Signed-off-by: Kamal Heib <kamalheib1@gmail.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-11-06RDMA/hns: Remove unsupported modify_port callbackKamal Heib1-7/+0
There is no need to return always zero for function which is not supported. Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver") Link: https://lore.kernel.org/r/20191028155931.1114-3-kamalheib1@gmail.com Signed-off-by: Kamal Heib <kamalheib1@gmail.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-11-06IB/hfi1: TID RDMA WRITE should not return IB_WC_RNR_RETRY_EXC_ERRKaike Wan1-8/+8
Normal RDMA WRITE request never returns IB_WC_RNR_RETRY_EXC_ERR to ULPs because it does not need post receive buffer on the responder side. Consequently, as an enhancement to normal RDMA WRITE request inside the hfi1 driver, TID RDMA WRITE request should not return such an error status to ULPs, although it does receive RNR NAKs from the responder when TID resources are not available. This behavior is violated when qp->s_rnr_retry_cnt is set in current hfi1 implementation. This patch enforces these semantics by avoiding any reaction to the updates of the RNR QP attributes. Fixes: 3c6cb20a0d17 ("IB/hfi1: Add TID RDMA WRITE functionality into RDMA verbs") Link: https://lore.kernel.org/r/20191025195842.106825.71532.stgit@awfm-01.aw.intel.com Cc: <stable@vger.kernel.org> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-11-06IB/hfi1: Calculate flow weight based on QP MTU for TID RDMAKaike Wan3-11/+6
For a TID RDMA WRITE request, a QP on the responder side could be put into a queue when a hardware flow is not available. A RNR NAK will be returned to the requester with a RNR timeout value based on the position of the QP in the queue. The tid_rdma_flow_wt variable is used to calculate the timeout value and is determined by using a MTU of 4096 at the module loading time. This could reduce the timeout value by half from the desired value, leading to excessive RNR retries. This patch fixes the issue by calculating the flow weight with the real MTU assigned to the QP. Fixes: 07b923701e38 ("IB/hfi1: Add functions to receive TID RDMA WRITE request") Link: https://lore.kernel.org/r/20191025195836.106825.77769.stgit@awfm-01.aw.intel.com Cc: <stable@vger.kernel.org> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-11-06IB/hfi1: Ensure r_tid_ack is valid before building TID RDMA ACK packetKaike Wan1-17/+27
The index r_tid_ack is used to indicate the next TID RDMA WRITE request to acknowledge in the ring s_ack_queue[] on the responder side and should be set to a valid index other than its initial value before r_tid_tail is advanced to the next TID RDMA WRITE request and particularly before a TID RDMA ACK is built. Otherwise, a NULL pointer dereference may result: BUG: unable to handle kernel paging request at ffff9a32d27abff8 IP: [<ffffffffc0d87ea6>] hfi1_make_tid_rdma_pkt+0x476/0xcb0 [hfi1] PGD 2749032067 PUD 0 Oops: 0000 1 SMP Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib(OE) hfi1(OE) rdmavt(OE) nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod target_core_mod ib_ucm dm_mirror dm_region_hash dm_log mlx5_ib dm_mod zfs(POE) rpcrdma sunrpc rdma_ucm ib_uverbs opa_vnic ib_iser zunicode(POE) ib_umad zavl(POE) icp(POE) sb_edac intel_powerclamp coretemp rdma_cm intel_rapl iosf_mbi iw_cm libiscsi scsi_transport_iscsi kvm ib_cm iTCO_wdt mxm_wmi iTCO_vendor_support irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd zcommon(POE) znvpair(POE) pcspkr spl(OE) mei_me sg mei ioatdma lpc_ich joydev i2c_i801 shpchp ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 mlx5_core drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ixgbe ahci ttm mlxfw ib_core libahci devlink mdio crct10dif_pclmul crct10dif_common drm ptp libata megaraid_sas crc32c_intel i2c_algo_bit pps_core i2c_core dca [last unloaded: rdmavt] CPU: 15 PID: 68691 Comm: kworker/15:2H Kdump: loaded Tainted: P W OE ------------ 3.10.0-862.2.3.el7_lustre.x86_64 #1 Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016 Workqueue: hfi0_0 _hfi1_do_tid_send [hfi1] task: ffff9a01f47faf70 ti: ffff9a11776a8000 task.ti: ffff9a11776a8000 RIP: 0010:[<ffffffffc0d87ea6>] [<ffffffffc0d87ea6>] hfi1_make_tid_rdma_pkt+0x476/0xcb0 [hfi1] RSP: 0018:ffff9a11776abd08 EFLAGS: 00010002 RAX: ffff9a32d27abfc0 RBX: ffff99f2d27aa000 RCX: 00000000ffffffff RDX: 0000000000000000 RSI: 0000000000000220 RDI: ffff99f2ffc05300 RBP: ffff9a11776abd88 R08: 000000000001c310 R09: ffffffffc0d87ad4 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a117a423c00 R13: ffff9a117a423c00 R14: ffff9a03500c0000 R15: ffff9a117a423cb8 FS: 0000000000000000(0000) GS:ffff9a117e9c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff9a32d27abff8 CR3: 0000002748a0e000 CR4: 00000000001607e0 Call Trace: [<ffffffffc0d88874>] _hfi1_do_tid_send+0x194/0x320 [hfi1] [<ffffffffaf0b2dff>] process_one_work+0x17f/0x440 [<ffffffffaf0b3ac6>] worker_thread+0x126/0x3c0 [<ffffffffaf0b39a0>] ? manage_workers.isra.24+0x2a0/0x2a0 [<ffffffffaf0bae31>] kthread+0xd1/0xe0 [<ffffffffaf0bad60>] ? insert_kthread_work+0x40/0x40 [<ffffffffaf71f5f7>] ret_from_fork_nospec_begin+0x21/0x21 [<ffffffffaf0bad60>] ? insert_kthread_work+0x40/0x40 hfi1 0000:05:00.0: hfi1_0: reserved_op: opcode 0xf2, slot 2, rsv_used 1, rsv_ops 1 Code: 00 00 41 8b 8d d8 02 00 00 89 c8 48 89 45 b0 48 c1 65 b0 06 48 8b 83 a0 01 00 00 48 01 45 b0 48 8b 45 b0 41 80 bd 10 03 00 00 00 <48> 8b 50 38 4c 8d 7a 50 74 45 8b b2 d0 00 00 00 85 f6 0f 85 72 RIP [<ffffffffc0d87ea6>] hfi1_make_tid_rdma_pkt+0x476/0xcb0 [hfi1] RSP <ffff9a11776abd08> CR2: ffff9a32d27abff8 This problem can happen if a RESYNC request is received before r_tid_ack is modified. This patch fixes the issue by making sure that r_tid_ack is set to a valid value before a TID RDMA ACK is built. Functions are defined to simplify the code. Fixes: 07b923701e38 ("IB/hfi1: Add functions to receive TID RDMA WRITE request") Fixes: 7cf0ad679de4 ("IB/hfi1: Add a function to receive TID RDMA RESYNC packet") Link: https://lore.kernel.org/r/20191025195830.106825.44022.stgit@awfm-01.aw.intel.com Cc: <stable@vger.kernel.org> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-11-06IB/hfi1: Ensure full Gen3 speed in a Gen4 systemJames Erwin1-1/+3
If an hfi1 card is inserted in a Gen4 systems, the driver will avoid the gen3 speed bump and the card will operate at half speed. This is because the driver avoids the gen3 speed bump when the parent bus speed isn't identical to gen3, 8.0GT/s. This is not compatible with gen4 and newer speeds. Fix by relaxing the test to explicitly look for the lower capability speeds which inherently allows for gen4 and all future speeds. Fixes: 7724105686e7 ("IB/hfi1: add driver files") Link: https://lore.kernel.org/r/20191101192059.106248.1699.stgit@awfm-01.aw.intel.com Cc: <stable@vger.kernel.org> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: James Erwin <james.erwin@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-11-06RDMA/qedr: Add iWARP doorbell recovery supportMichal Kalderon2-6/+43
This patch adds the iWARP specific doorbells to the doorbell recovery mechanism. Link: https://lore.kernel.org/r/20191030094417.16866-9-michal.kalderon@marvell.com Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-11-06RDMA/qedr: Add doorbell overflow recovery supportMichal Kalderon2-50/+275
Use the doorbell recovery mechanism to register rdma related doorbells that will be restored in case there is a doorbell overflow attention. Link: https://lore.kernel.org/r/20191030094417.16866-8-michal.kalderon@marvell.com Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-11-06RDMA/qedr: Use the common mmap APIMichal Kalderon4-121/+98
Remove all functions related to mmap from qedr and use the common API. Link: https://lore.kernel.org/r/20191030094417.16866-7-michal.kalderon@marvell.com Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-11-06RDMA/efa: Use the common mmap_xa helpersMichal Kalderon3-194/+153
Remove the functions related to managing the mmap_xa database. This code was replaced with common code in ib_core. Link: https://lore.kernel.org/r/20191030094417.16866-5-michal.kalderon@marvell.com Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-11-06RDMA: Connect between the mmap entry and the umap_priv structureMichal Kalderon4-10/+19
The rdma_user_mmap_io interface created a common interface for drivers to correctly map hw resources and zap them once the ucontext is destroyed enabling the drivers to safely free the hw resources. However, this meant the drivers need to delay freeing the resource to the ucontext destroy phase to ensure they were no longer mapped. The new mechanism for a common way of handling user/driver address mapping enabled notifying the driver if all umap_priv mappings were removed, and enabled freeing the hw resources when they are done with and not delay it until ucontext destroy. Since not all drivers use the mechanism, NULL can be sent to the rdma_user_mmap_io interface to continue working as before. Drivers that use the mmap_xa interface can pass the entry being mapped to the rdma_user_mmap_io function to be linked together. Link: https://lore.kernel.org/r/20191030094417.16866-4-michal.kalderon@marvell.com Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-11-01IB/mlx5: Introduce and use mlx5_core_is_vf()Parav Pandit1-1/+1
Instead of deciding a given device is virtual function or not based on a device is PF or not, use already defined MLX5_COREDEV_VF by introducing an helper API mlx5_core_is_vf(). This enables to clearly identify PF, VF and non virtual functions. Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Vu Pham <vuhuong@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-10-31IB/mlx5: Test write combining supportMichael Guralnik4-3/+223
Linux can run in all sorts of physical machines and VMs where write combining may or may not be supported. Currently there is no way to reliably tell if the system supports WC, or not. The driver uses WC to optimize posting work to the HCA, and getting this wrong in either direction can cause a significant performance loss. Add a test in mlx5_ib initialization process to test whether write-combining is supported on the machine. The test will run as part of the enable_driver callback to ensure that the test runs after the device is setup and can create and modify the QP needed, but runs before the device is exposed to the users. The test opens UD QP and posts NOP WQEs, the WQE written to the BlueFlame is different from the WQE in memory, requesting CQE only on the BlueFlame WQE. By checking whether we received a completion on one of these WQEs we can know if BlueFlame succeeded and this write-combining must be supported. Change reporting of BlueFlame support to be dependent on write-combining support instead of the FW's guess as to what the machine can do. Link: https://lore.kernel.org/r/20191027062234.10993-1-leon@kernel.org Signed-off-by: Michael Guralnik <michaelgur@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-31RDMA/mlx5: Return proper error valueLeon Romanovsky1-1/+1
Returned value from mlx5_mr_cache_alloc() is checked to be error or real pointer. Return proper error code instead of NULL which is not checked later. Fixes: 81713d3788d2 ("IB/mlx5: Add implicit MR support") Link: https://lore.kernel.org/r/20191029055721.7192-1-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-29RDMA/hns: Fix build error againArnd Bergmann2-5/+20
This is not the first attempt to fix building random configurations, unfortunately the attempt in commit a07fc0bb483e ("RDMA/hns: Fix build error") caused a new problem when CONFIG_INFINIBAND_HNS_HIP06=m and CONFIG_INFINIBAND_HNS_HIP08=y: drivers/infiniband/hw/hns/hns_roce_main.o:(.rodata+0xe60): undefined reference to `__this_module' Revert commits a07fc0bb483e ("RDMA/hns: Fix build error") and a3e2d4c7e766 ("RDMA/hns: remove obsolete Kconfig comment") to get back to the previous state, then fix the issues described there differently, by adding more specific dependencies: INFINIBAND_HNS can now only be built-in if at least one of HNS or HNS3 are built-in, and the individual back-ends are only available if that code is reachable from the main driver. Fixes: a07fc0bb483e ("RDMA/hns: Fix build error") Fixes: a3e2d4c7e766 ("RDMA/hns: remove obsolete Kconfig comment") Fixes: dd74282df573 ("RDMA/hns: Initialize the PCI device for hip08 RoCE") Fixes: 08805fdbeb2d ("RDMA/hns: Split hw v1 driver from hns roce driver") Link: https://lore.kernel.org/r/20191007211826.3361202-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28Merge branch 'odp_rework' into rdma.git for-nextJason Gunthorpe6-591/+604
Jason Gunthorpe says: ==================== In order to hoist the interval tree code out of the drivers and into the mmu_notifiers it is necessary for the drivers to not use the interval tree for other things. This series replaces the interval tree with an xarray and along the way re-aligns all the locking to use a sensible SRCU model where the 'update' step is done by modifying an xarray. The result is overall much simpler and with less locking in the critical path. Many functions were reworked for clarity and small details like using 'imr' to refer to the implicit MR make the entire code flow here more readable. This also squashes at least two race bugs on its own, and quite possibily more that haven't been identified. ==================== Merge conflicts with the odp statistics patch resolved. * branch 'odp_rework': RDMA/odp: Remove broken debugging call to invalidate_range RDMA/mlx5: Do not race with mlx5_ib_invalidate_range during create and destroy RDMA/mlx5: Do not store implicit children in the odp_mkeys xarray RDMA/mlx5: Rework implicit ODP destroy RDMA/mlx5: Avoid double lookups on the pagefault path RDMA/mlx5: Reduce locking in implicit_mr_get_data() RDMA/mlx5: Use an xarray for the children of an implicit ODP RDMA/mlx5: Split implicit handling from pagefault_mr RDMA/mlx5: Set the HW IOVA of the child MRs to their place in the tree RDMA/mlx5: Lift implicit_mr_alloc() into the two routines that call it RDMA/mlx5: Rework implicit_mr_get_data RDMA/mlx5: Delete struct mlx5_priv->mkey_table RDMA/mlx5: Use a dedicated mkey xarray for ODP RDMA/mlx5: Split sig_err MR data into its own xarray RDMA/mlx5: Use SRCU properly in ODP prefetch Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/mlx5: Do not race with mlx5_ib_invalidate_range during create and destroyJason Gunthorpe3-59/+88
For creation, as soon as the umem_odp is created the notifier can be called, however the underlying MR may not have been setup yet. This would cause problems if mlx5_ib_invalidate_range() runs. There is some confusing/ulocked/racy code that might by trying to solve this, but without locks it isn't going to work right. Instead trivially solve the problem by short-circuiting the invalidation if there are not yet any DMA mapped pages. By definition there is nothing to invalidate in this case. The create code will have the umem fully setup before anything is DMA mapped, and npages is fully locked by the umem_mutex. For destroy, invalidate the entire MR at the HW to stop DMA then DMA unmap the pages before destroying the MR. This drives npages to zero and prevents similar racing with invalidate while the MR is undergoing destruction. Arguably it would be better if the umem was created after the MR and destroyed before, but that would require a big rework of the MR code. Fixes: 6aec21f6a832 ("IB/mlx5: Page faults handling infrastructure") Link: https://lore.kernel.org/r/20191009160934.3143-15-jgg@ziepe.ca Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/mlx5: Do not store implicit children in the odp_mkeys xarrayJason Gunthorpe1-30/+6
These mkeys are entirely internal and are never used by the HW for page fault. They should also never be used by userspace for prefetch. Simplify & optimize things by not including them in the xarray. Since the prefetch path can now never see a child mkey there is no need for the second synchronize_srcu() during imr destroy. Link: https://lore.kernel.org/r/20191009160934.3143-14-jgg@ziepe.ca Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/mlx5: Rework implicit ODP destroyJason Gunthorpe4-64/+120
Use SRCU in a sensible way by removing all MRs in the implicit tree from the two xarrays (the update operation), then a synchronize, followed by a normal single threaded teardown. This is only a little unusual from the normal pattern as there can still be some work pending in the unbound wq that may also require a workqueue flush. This is tracked with a single atomic, consolidating the redundant existing atomics and wait queue. For understand-ability the entire ODP implicit create/destroy flow now largely exists in a single pair of functions within odp.c, with a few support functions for tearing down an unused child. Link: https://lore.kernel.org/r/20191009160934.3143-13-jgg@ziepe.ca Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/mlx5: Avoid double lookups on the pagefault pathJason Gunthorpe1-106/+80
Now that the locking is simplified combine pagefault_implicit_mr() with implicit_mr_get_data() so that we sweep over the idx range only once, and do the single xlt update at the end, after the child umems are setup. This avoids double iteration/xa_loads plus the sketchy failure path if the xa_load() fails. Link: https://lore.kernel.org/r/20191009160934.3143-12-jgg@ziepe.ca Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/mlx5: Reduce locking in implicit_mr_get_data()Jason Gunthorpe1-12/+26
Now that the child MRs are stored in an xarray we can rely on the SRCU lock to protect the xa_load and use xa_cmpxchg on the slow allocation path to resolve races with concurrent page fault. This reduces the scope of the critical section of umem_mutex for implicit MRs to only cover mlx5_ib_update_xlt, and avoids taking a lock at all if the child MR is already in the xarray. This makes it consistent with the normal ODP MR critical section for umem_lock, and the locking approach used for destroying an unusued implicit child MR. The MLX5_IB_UPD_XLT_ATOMIC is no longer needed in implicit_get_child_mr() since it is no longer called with any locks. Link: https://lore.kernel.org/r/20191009160934.3143-11-jgg@ziepe.ca Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/mlx5: Use an xarray for the children of an implicit ODPJason Gunthorpe2-133/+67
Currently the child leaves are stored in the shared interval tree and every lookup for a child must be done under the interval tree rwsem. This is further complicated by dropping the rwsem during iteration (ie the odp_lookup(), odp_next() pattern), which requires a very tricky an difficult to understand locking scheme with SRCU. Instead reserve the interval tree for the exclusive use of the mmu notifier related code in umem_odp.c and give each implicit MR a xarray containing all the child MRs. Since the size of each child is 1GB of VA, a 1 level xarray will index 64G of VA, and a 2 level will index 2TB, making xarray a much better data structure choice than an interval tree. The locking properties of xarray will be used in the next patches to rework the implicit ODP locking scheme into something simpler. At this point, the xarray is locked by the implicit MR's umem_mutex, and read can also be locked by the odp_srcu. Link: https://lore.kernel.org/r/20191009160934.3143-10-jgg@ziepe.ca Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/mlx5: Split implicit handling from pagefault_mrJason Gunthorpe1-49/+76
The single routine has a very confusing scheme to advance to the next child MR when working on an implicit parent. This scheme can only be used when working with an implicit parent and must not be triggered when working on a normal MR. Re-arrange things by directly putting all the single-MR stuff into one function and calling it in a loop for the implicit case. Simplify some of the error handling in the new pagefault_real_mr() to remove unneeded gotos. Link: https://lore.kernel.org/r/20191009160934.3143-9-jgg@ziepe.ca Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/mlx5: Set the HW IOVA of the child MRs to their place in the treeJason Gunthorpe1-2/+3
Instead of rewriting all the IOVA's to 0 as things progress down the tree make the IOVA of the children equal to placement in the tree. This makes things easier to understand by keeping mmkey.iova == HW configuration. Link: https://lore.kernel.org/r/20191009160934.3143-8-jgg@ziepe.ca Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/mlx5: Lift implicit_mr_alloc() into the two routines that call itJason Gunthorpe1-77/+74
This makes the routines easier to understand, particularly with respect the locking requirements of the entire sequence. The implicit_mr_alloc() had a lot of ifs specializing it to each of the callers, and only a very small amount of code was actually shared. Following patches will cause the flow in the two functions to diverge further. Link: https://lore.kernel.org/r/20191009160934.3143-7-jgg@ziepe.ca Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/mlx5: Rework implicit_mr_get_dataJason Gunthorpe1-54/+69
This function is intended to loop across each MTT chunk in the implicit parent that intersects the range [io_virt, io_virt+bnct). But it is has a confusing construction, so: - Consistently use imr and odp_imr to refer to the implicit parent to avoid confusion with the normal mr and odp of the child - Directly compute the inclusive start/end indexes by shifting. This is clearer to understand the intent and avoids any errors from unaligned values of addr - Iterate directly over the range of MTT indexes, do not make a loop out of goto - Follow 'success oriented flow', with goto error unwind - Directly calculate the range of idx's that need update_xlt - Ensure that any leaf MR added to the interval tree always results in an update to the XLT Link: https://lore.kernel.org/r/20191009160934.3143-6-jgg@ziepe.ca Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/mlx5: Delete struct mlx5_priv->mkey_tableJason Gunthorpe1-9/+0
No users are left, delete it. Link: https://lore.kernel.org/r/20191009160934.3143-5-jgg@ziepe.ca Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/mlx5: Use a dedicated mkey xarray for ODPJason Gunthorpe5-73/+83
There is a per device xarray storing mkeys that is used to store every mkey in the system. However, this xarray is now only read by ODP for certain ODP designated MRs (ODP, implicit ODP, MW, DEVX_INDIRECT). Create an xarray only for use by ODP, that only contains ODP related MKeys. This xarray is protected by SRCU and all erases are protected by a synchronize. This improves performance: - All MRs in the odp_mkeys xarray are ODP MRs, so some tests for is_odp() can be deleted. The xarray will also consume fewer nodes. - normal MR's are never mixed with ODP MRs in a SRCU data structure so performance sucking synchronize_srcu() on every MR destruction is not needed. - No smp_load_acquire(live) and xa_load() double barrier on read Due to the SRCU locking scheme care must be taken with the placement of the xa_store(). Once it completes the MR is immediately visible to other threads and only through a xa_erase() & synchronize_srcu() cycle could it be destroyed. Link: https://lore.kernel.org/r/20191009160934.3143-4-jgg@ziepe.ca Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/mlx5: Split sig_err MR data into its own xarrayJason Gunthorpe4-17/+28
The locking model for signature is completely different than ODP, do not share the same xarray that relies on SRCU locking to support ODP. Simply store the active mlx5_core_sig_ctx's in an xarray when signature MRs are created and rely on trivial xarray locking to serialize everything. The overhead of storing only a handful of SIG related MRs is going to be much less than an xarray full of every mkey. Link: https://lore.kernel.org/r/20191009160934.3143-3-jgg@ziepe.ca Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/mlx5: Use SRCU properly in ODP prefetchJason Gunthorpe1-141/+121
When working with SRCU protected xarrays the xarray itself should be the SRCU 'update' point. Instead prefetch is using live as the SRCU update point and this prevents switching the locking design to use the xarray instead. To solve this the prefetch must only read from the xarray once, and hold on to the actual MR pointer for the duration of the async operation. Incrementing num_pending_prefetch delays destruction of the MR, so it is suitable. Prefetch calls directly to the pagefault_mr using the MR pointer and only does a single xarray lookup. All the testing if a MR is prefetchable or not is now done only in the prefetch code and removed from the pagefault critical path. Link: https://lore.kernel.org/r/20191009160934.3143-2-jgg@ziepe.ca Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28Merge tag 'v5.4-rc5' into rdma.git for-nextJason Gunthorpe10-121/+122
Linux 5.4-rc5 For dependencies in the next patches Conflict resolved by keeping the delete of the unlock. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/vmw_pvrdma: Use resource ids from physical device if availableBryan Tan2-29/+101
This change allows the RDMA stack to use physical resource numbers if they are passed up from the device. This is accomplished by separating the concept of the QP number from the QP handle. Previously, the two were the same, as the QP number was exposed to the guest and also used to reference a virtual QP in the device backend. With physical resource numbers exposed, the QP number given to the guest is the number assigned from the physical HCA's QP, while the QP handle is still the internal handle used to reference a virtual QP. Regardless of whether the device is exposing physical ids, the driver will still try to pick up the QP handle from the backend if possible. The MR keys exposed to the guest will also be the MR keys created by the physical HCA, instead of virtual MR keys. The distinction between handle and keys is already present for MRs so there is no need to do anything special here. A new version of the create QP response has been added to the device API to pass up the QP number and handle. The driver will also report these to userspace in the udata response if userspace supports it or not create the queuepair if not. I also had to do a refactor of the destroy qp code to reuse it if we fail to copy to userspace. Link: https://lore.kernel.org/r/20191028181444.19448-1-aditr@vmware.com Reviewed-by: Jorgen Hansen <jhansen@vmware.com> Signed-off-by: Adit Ranadive <aditr@vmware.com> Signed-off-by: Bryan Tan <bryantan@vmware.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/hns: Prevent memory leaks of eq->buf_listLijun Ou1-3/+3
eq->buf_list->buf and eq->buf_list should also be freed when eqe_hop_num is set to 0, or there will be memory leaks. Fixes: a5073d6054f7 ("RDMA/hns: Add eq support of hip08") Link: https://lore.kernel.org/r/1572072995-11277-3-git-send-email-liweihang@hisilicon.com Signed-off-by: Lijun Ou <oulijun@huawei.com> Signed-off-by: Weihang Li <liweihang@hisilicon.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/iw_cxgb4: Avoid freeing skb twice in arp failure casePotnuri Bharat Teja1-2/+0
_put_ep_safe() and _put_pass_ep_safe() free the skb before it is freed by process_work(). fix double free by freeing the skb only in process_work(). Fixes: 1dad0ebeea1c ("iw_cxgb4: Avoid touch after free error in ARP failure handlers") Link: https://lore.kernel.org/r/1572006880-5800-1-git-send-email-bharat@chelsio.com Signed-off-by: Dakshaja Uppalapati <dakshaja@chelsio.com> Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/iw_cxgb4: Report correct port speed/widthPotnuri Bharat Teja1-3/+4
Query speed/width from corresponding netdev. Link: https://lore.kernel.org/r/1572001022-4533-1-git-send-email-bharat@chelsio.com Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/mlx5: Use irq xarray locking for mkey_tableJason Gunthorpe1-2/+2
The mkey_table xarray is touched by the reg_mr_callback() function which is called from a hard irq. Thus all other uses of xa_lock must use the _irq variants. WARNING: inconsistent lock state 5.4.0-rc1 #12 Not tainted -------------------------------- inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. python3/343 [HC0[0]:SC0[0]:HE1:SE1] takes: ffff888182be1d40 (&(&xa->xa_lock)->rlock#3){?.-.}, at: xa_erase+0x12/0x30 {IN-HARDIRQ-W} state was registered at: lock_acquire+0xe1/0x200 _raw_spin_lock_irqsave+0x35/0x50 reg_mr_callback+0x2dd/0x450 [mlx5_ib] mlx5_cmd_exec_cb_handler+0x2c/0x70 [mlx5_core] mlx5_cmd_comp_handler+0x355/0x840 [mlx5_core] [..] Possible unsafe locking scenario: CPU0 ---- lock(&(&xa->xa_lock)->rlock#3); <Interrupt> lock(&(&xa->xa_lock)->rlock#3); *** DEADLOCK *** 2 locks held by python3/343: #0: ffff88818eb4bd38 (&uverbs_dev->disassociate_srcu){....}, at: ib_uverbs_ioctl+0xe5/0x1e0 [ib_uverbs] #1: ffff888176c76d38 (&file->hw_destroy_rwsem){++++}, at: uobj_destroy+0x2d/0x90 [ib_uverbs] stack backtrace: CPU: 3 PID: 343 Comm: python3 Not tainted 5.4.0-rc1 #12 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x86/0xca print_usage_bug.cold.50+0x2e5/0x355 mark_lock+0x871/0xb50 ? match_held_lock+0x20/0x250 ? check_usage_forwards+0x240/0x240 __lock_acquire+0x7de/0x23a0 ? __kasan_check_read+0x11/0x20 ? mark_lock+0xae/0xb50 ? mark_held_locks+0xb0/0xb0 ? find_held_lock+0xca/0xf0 lock_acquire+0xe1/0x200 ? xa_erase+0x12/0x30 _raw_spin_lock+0x2a/0x40 ? xa_erase+0x12/0x30 xa_erase+0x12/0x30 mlx5_ib_dealloc_mw+0x55/0xa0 [mlx5_ib] uverbs_dealloc_mw+0x3c/0x70 [ib_uverbs] uverbs_free_mw+0x1a/0x20 [ib_uverbs] destroy_hw_idr_uobject+0x49/0xa0 [ib_uverbs] [..] Fixes: 0417791536ae ("RDMA/mlx5: Add missing synchronize_srcu() for MW cases") Link: https://lore.kernel.org/r/20191024234910.GA9038@ziepe.ca Acked-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/qedr: Fix memory leak in user qp and mrMichal Kalderon1-2/+10
User QPs pbl's weren't freed properly. MR pbls weren't freed properly. Fixes: e0290cce6ac0 ("qedr: Add support for memory registeration verbs") Link: https://lore.kernel.org/r/20191027200451.28187-5-michal.kalderon@marvell.com Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/qedr: Fix synchronization methods and memory leaks in qedrMichal Kalderon3-75/+158
Re-design of the iWARP CM related objects reference counting and synchronization methods, to ensure operations are synchronized correctly and that memory allocated for "ep" is properly released. Also makes sure QP memory is not released before ep is finished accessing it. Where as the QP object is created/destroyed by external operations, the ep is created/destroyed by internal operations and represents the tcp connection associated with the QP. QP destruction flow: - needs to wait for ep establishment to complete (either successfully or with error) - needs to wait for ep disconnect to be fully posted to avoid a race condition of disconnect being called after reset. - both the operations above don't always happen, so we use atomic flags to indicate whether the qp destruction flow needs to wait for these completions or not, if the destroy is called before these operations began, the flows will check the flags and not execute them ( connect / disconnect). We use completion structure for waiting for the completions mentioned above. The QP refcnt was modified to kref object. The EP has a kref added to it to handle additional worker thread accessing it. Memory Leaks - https://www.spinics.net/lists/linux-rdma/msg83762.html Concurrency not managed correctly - https://www.spinics.net/lists/linux-rdma/msg67949.html Fixes: de0089e692a9 ("RDMA/qedr: Add iWARP connection management qp related callbacks") Link: https://lore.kernel.org/r/20191027200451.28187-4-michal.kalderon@marvell.com Reported-by: Chuck Lever <chuck.lever@oracle.com> Reported-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/qedr: Fix qpids xarray api usedMichal Kalderon3-4/+4
The qpids xarray isn't accessed from irq context and therefore there is no need to use the xa_XXX_irq version of the apis. Remove the _irq. Fixes: b6014f9e5f39 ("qedr: Convert qpidr to XArray") Link: https://lore.kernel.org/r/20191027200451.28187-3-michal.kalderon@marvell.com Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/qedr: Fix srqs xarray initializationMichal Kalderon1-0/+1
There was a missing initialization for the srqs xarray. SRQs xarray can also be called from irq context when searching for an element and uses the xa_XXX_irq apis, therefore should be initialized with IRQ flags. Fixes: 9fd15987ed27 ("qedr: Convert srqidr to XArray") Link: https://lore.kernel.org/r/20191027200451.28187-2-michal.kalderon@marvell.com Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/hns: Fix memory leak on 'context' on error return pathColin Ian King1-4/+8
Currently, the error return path when the call to function dev->dfx->query_cqc_info fails will leak object 'context'. Fix this by making the error return path via 'err' return return codes rather than -EMSGSIZE, set ret appropriately for all error return paths and for the memory leak now return via 'err' rather than just returning without freeing context. Link: https://lore.kernel.org/r/20191024131034.19989-1-colin.king@canonical.com Addresses-Coverity: ("Resource leak") Fixes: e1c9a0dc2939 ("RDMA/hns: Dump detailed driver-specific CQ") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/hns: Bugfix for qpc/cqc timer configurationYangyang Li1-2/+2
qpc/cqc timer entry size needs one page, but currently they are fixedly configured to 4096, which is not appropriate in 64K page scenarios. So they should be modified to PAGE_SIZE. Fixes: 0e40dc2f70cd ("RDMA/hns: Add timer allocation support for hip08") Link: https://lore.kernel.org/r/1571908917-16220-3-git-send-email-liweihang@hisilicon.com Signed-off-by: Yangyang Li <liyangyang20@huawei.com> Signed-off-by: Weihang Li <liweihang@hisilicon.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/hns: Fix to support 64K page for srqLijun Ou1-2/+2
SRQ's page size configuration of BA and buffer should depend on current PAGE_SHIFT, or it can't work in scenario of 64K page. Fixes: c7bcb13442e1 ("RDMA/hns: Add SRQ support for hip08 kernel mode") Link: https://lore.kernel.org/r/1571908917-16220-2-git-send-email-liweihang@hisilicon.com Signed-off-by: Lijun Ou <oulijun@huawei.com> Signed-off-by: Weihang Li <liweihang@hisilicon.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/hns: Delete BITS_PER_BYTE redefinitionLeon Romanovsky1-2/+0
HNS redefined available in bits.h define and didn't use it, we can safely delete it. Link: https://lore.kernel.org/r/20191023054239.31648-1-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-28RDMA/hns: Prevent undefined behavior in hns_roce_set_user_sq_size()Jason Gunthorpe1-4/+6
The "ucmd->log_sq_bb_count" variable is a user controlled variable in the 0-255 range. If we shift more than then number of bits in an int then it's undefined behavior (it shift wraps), and potentially the int could become negative. Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver") Link: https://lore.kernel.org/r/20190608092514.GC28890@mwanda Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
2019-10-22IB/mlx5: Align usage of QP1 create flags with rest of mlx5 definesMichael Guralnik3-11/+6
There is little value in keeping separate function for one flag, provide it directly like any other mlx5 define. Link: https://lore.kernel.org/r/20191020064400.8344-2-leon@kernel.org Signed-off-by: Michael Guralnik <michaelgur@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-10-22IB/mlx5: Remove dead codeRan Rozenstein2-16/+0
mlx5_ib_dc_atomic_is_supported function is not used anywhere. Remove the dead code. Fixes: a60109dc9a95 ("IB/mlx5: Add support for extended atomic operations") Link: https://lore.kernel.org/r/20191020064454.8551-1-leon@kernel.org Signed-off-by: Ran Rozenstein <ranro@mellanox.com> Reviewed-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>