Age | Commit message (Collapse) | Author | Files | Lines |
|
Instead of enabling dynamic debug globally with CONFIG_DYNAMIC_DEBUG,
CONFIG_DYNAMIC_DEBUG_CORE will only enable core function of dynamic
debug. With the DYNAMIC_DEBUG_MODULE defined for any modules, dynamic
debug will be tied to them.
This is useful for people who only want to enable dynamic debug for
kernel modules without worrying about kernel image size and memory
consumption is increasing too much.
[orson.zhai@unisoc.com: v2]
Link: http://lkml.kernel.org/r/1587408228-10861-1-git-send-email-orson.unisoc@gmail.com
Signed-off-by: Orson Zhai <orson.zhai@unisoc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Petr Mladek <pmladek@suse.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: http://lkml.kernel.org/r/1586521984-5890-1-git-send-email-orson.unisoc@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Now that FMR support is gone, this attribute can be deleted from all
places.
Link: https://lore.kernel.org/r/13-v3-f58e6669d5d3+2cf-fmr_removal_jgg@mellanox.com
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Now that FMR support is gone, this attribute can be deleted from all
places.
Link: https://lore.kernel.org/r/12-v3-f58e6669d5d3+2cf-fmr_removal_jgg@mellanox.com
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
After removing FMR support from all the RDMA ULPs and providers, there
is no need to keep FMR operation for IB devices.
Link: https://lore.kernel.org/r/11-v3-f58e6669d5d3+2cf-fmr_removal_jgg@mellanox.com
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Allow a ULP to ask the core to provide a completion queue based on a
least-used search on a per-device CQ pools. The device CQ pools grow in a
lazy fashion when more CQs are requested.
This feature reduces the amount of interrupts when using many QPs. Using
shared CQs allows for more effcient completion handling. It also reduces
the amount of overhead needed for CQ contexts.
Test setup:
Intel(R) Xeon(R) Platinum 8176M CPU @ 2.10GHz servers.
Running NVMeoF 4KB read IOs over ConnectX-5EX across Spectrum switch.
TX-depth = 32. The patch was applied in the nvme driver on both the target
and initiator. Four controllers are accessed from each core. In the
current test case we have exposed sixteen NVMe namespaces using four
different subsystems (four namespaces per subsystem) from one NVM port.
Each controller allocated X queues (RDMA QPs) and attached to Y CQs.
Before this series we had X == Y, i.e for four controllers we've created
total of 4X QPs and 4X CQs. In the shared case, we've created 4X QPs and
only X CQs which means that we have four controllers that share a
completion queue per core. Until fourteen cores there is no significant
change in performance and the number of interrupts per second is less than
a million in the current case.
==================================================
|Cores|Current KIOPs |Shared KIOPs |improvement|
|-----|---------------|--------------|-----------|
|14 |2332 |2723 |16.7% |
|-----|---------------|--------------|-----------|
|20 |2086 |2712 |30% |
|-----|---------------|--------------|-----------|
|28 |1971 |2669 |35.4% |
|=================================================
|Cores|Current avg lat|Shared avg lat|improvement|
|-----|---------------|--------------|-----------|
|14 |767us |657us |14.3% |
|-----|---------------|--------------|-----------|
|20 |1225us |943us |23% |
|-----|---------------|--------------|-----------|
|28 |1816us |1341us |26.1% |
========================================================
|Cores|Current interrupts|Shared interrupts|improvement|
|-----|------------------|-----------------|-----------|
|14 |1.6M/sec |0.4M/sec |72% |
|-----|------------------|-----------------|-----------|
|20 |2.8M/sec |0.6M/sec |72.4% |
|-----|------------------|-----------------|-----------|
|28 |2.9M/sec |0.8M/sec |63.4% |
====================================================================
|Cores|Current 99.99th PCTL lat|Shared 99.99th PCTL lat|improvement|
|-----|------------------------|-----------------------|-----------|
|14 |67ms |6ms |90.9% |
|-----|------------------------|-----------------------|-----------|
|20 |5ms |6ms |-10% |
|-----|------------------------|-----------------------|-----------|
|28 |8.7ms |6ms |25.9% |
|===================================================================
Performance improvement with sixteen disks (sixteen CQs per core) is
comparable.
Link: https://lore.kernel.org/r/1590568495-101621-3-git-send-email-yaminf@mellanox.com
Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
A pre-step for adding shared CQs. Add the infrastructure to prevent shared
CQ users from altering the CQ configurations. For now all cqs are marked
as private (non-shared). The core driver should use the new force
functions to perform resize/destroy/moderation changes that are not
allowed for users of shared CQs.
Link: https://lore.kernel.org/r/1590568495-101621-2-git-send-email-yaminf@mellanox.com
Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
These constants are going to be used in the ioctl interface in coming
patches so they are part of the UAPI, place them in the correct header
for clarity.
Link: https://lore.kernel.org/r/20200519072711.257271-5-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
While creating a uobject every create reaches a point where the uobject is
fully initialized. For ioctls that go on to copy_to_user this means they
need to open code the destruction of a fully created uobject - ie the
RDMA_REMOVE_DESTROY sort of flow.
Open coding this creates bugs, eg the CQ does not properly flush the
events list when it does its error unwind.
Provide a uverbs_finalize_uobj_create() function which indicates that the
uobject is fully initialized and that abort should call to destroy_hw to
destroy the uobj->object and related.
Methods can call this function if they go on to have error cases after
setting uobj->object. Once done those error cases can simply do return,
without an error unwind.
Link: https://lore.kernel.org/r/20200519072711.257271-2-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Currently the ipoib UD mtu is restricted to 4K bytes. Remove this
limitation so that the IPOIB module can potentially use an MTU (in UD
mode) that is bounded by the MTU of the underlying device. A field is
added to the ib_port_attr structure to indicate the maximum physical
MTU the underlying device supports.
Link: https://lore.kernel.org/r/20200511160618.173205.23053.stgit@awfm-01.aw.intel.com
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sadanand Warrier <sadanand.warrier@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Adds capability to create a qpn to be recognized as an accelerated
UD QP for ipoib.
This is accomplished by reserving 0x81 in byte[0] of the qpn as the
prefix for these qp types and reserving qpns between 0x810000 and
0x81ffff.
The hfi1 capability mask already contained a flag for the VNIC netdev.
This has been renamed and extended to include both VNIC and ipoib.
The rvt code to allocate qps now recognizes this flag and sets 0x81
into byte[0] of the qpn.
The code to allocate qpns is modified to reset the qpn numbering when it
is detected that a value is located in byte[0] for a UD QP and it is a
qpn being requested for net dev use. If it is a regular UD QP then it is
allowable to have bits set in byte[0] of the qpn and provide the
previously normal behavior.
The code to free the qpn now checks for the AIP prefix value of 0x81 and
removes it from the qpn before being freed so that the lower 16 bit
number can be reused.
This patch requires minor changes in the IB core and ipoib to facilitate
the creation of accelerated UP QPs.
Link: https://lore.kernel.org/r/20200511160607.173205.11757.stgit@awfm-01.aw.intel.com
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Gary Leshner <Gary.S.Leshner@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
This patch implements the mechanism to accelerate the transmit side of
a multiple transmit queue RDMA netdev by submitting the packets to
the SDMA engine directly instead of sending through the verbs layer.
This patch also changes the UD/SEND_ONLY op to output the entropy value
in byte 0 of deth[1]. UD/SEND_ONLY_WITH_IMMEDIATE uses the previous
behavior with no entropy value being output.
The code in the ipoib rdma netdev which submits tx requests upon
successful submission will call trace_sdma_output_ibhdr to output
the ibhdr to the trace buffer.
Link: https://lore.kernel.org/r/20200511160548.173205.45616.stgit@awfm-01.aw.intel.com
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Gary Leshner <Gary.S.Leshner@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
The uverbs layer largely duplicate the code in ib_create_srq(), with the
slight difference that it passes in a udata. Move all the code together
into ib_create_srq_user() and provide an inline for kernel users, similar
to other create calls.
Link: https://lore.kernel.org/r/20200506082444.14502-6-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Add two hash functions to distribute RoCE v2 UDP source and Flowlabel
symmetrically. These are user visible API and any change in the
implementation needs to be tested for inter-operability between old and
new variant.
Link: https://lore.kernel.org/r/20200504051935.269708-2-leon@kernel.org
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
When a client is added it isn't allowed to fail, but all the client's have
various failure paths within their add routines.
This creates the very fringe condition where the client was added, failed
during add and didn't set the client_data. The core code will then still
call other client_data centric ops like remove(), rename(), get_nl_info(),
and get_net_dev_by_params() with NULL client_data - which is confusing and
unexpected.
If the add() callback fails, then do not call any more client ops for the
device, even remove.
Remove all the now redundant checks for NULL client_data in ops callbacks.
Update all the add() callbacks to return error codes
appropriately. EOPNOTSUPP is used for cases where the ULP does not support
the ib_device - eg because it only works with IB.
Link: https://lore.kernel.org/r/20200421172440.387069-1-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Add a call to rdma_lag_get_ah_roce_slave() when the address handle is
created. Lower driver can use it to select the QP's affinity port.
Link: https://lore.kernel.org/r/20200430192146.12863-15-maorg@mellanox.com
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Add support to get the RoCE LAG xmit slave by building skb of the RoCE
packet and call to master_get_xmit_slave. If driver wants to get the
slave assume all slaves are available, then need to set
RDMA_LAG_FLAGS_HASH_ALL_SLAVES in flags.
Link: https://lore.kernel.org/r/20200430192146.12863-14-maorg@mellanox.com
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Following patch adds additional argument to the create AH function, so it
make sense to group ah_attr and flags arguments in struct.
Link: https://lore.kernel.org/r/20200430192146.12863-13-maorg@mellanox.com
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Acked-by: Devesh Sharma <devesh.sharma@broadcom.com>
Acked-by: Gal Pressman <galpress@amazon.com>
Acked-by: Weihang Li <liweihang@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")
Link: https://lore.kernel.org/r/20200213010425.GA13068@embeddedor.com
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> # added a few more
|
|
This function accepts a udata but does nothing with it, and is never
passed a !NULL udata. Rename it to ib_create_qp which was the only caller
and remove the udata.
Link: https://lore.kernel.org/r/20200213191911.GA9898@ziepe.ca
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
From https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma
Leon Romanovsky says:
====================
Use ODP MRs for kernel ULPs
The following series extends MR creation routines to allow creation of
user MRs through kernel ULPs as a proxy. The immediate use case is to
allow RDS to work over FS-DAX, which requires ODP (on-demand-paging)
MRs to be created and such MRs were not possible to create prior this
series.
The first part of this patchset extends RDMA to have special verb
ib_reg_user_mr(). The common use case that uses this function is a
userspace application that allocates memory for HCA access but the
responsibility to register the memory at the HCA is on an kernel ULP.
This ULP acts as an agent for the userspace application.
The second part provides advise MR functionality for ULPs. This is
integral part of ODP flows and used to trigger pagefaults in advance
to prepare memory before running working set.
The third part is actual user of those in-kernel APIs.
====================
* tag 'rds-odp-for-5.5':
net/rds: Use prefetch for On-Demand-Paging MR
net/rds: Handle ODP mr registration/unregistration
net/rds: Detect need of On-Demand-Paging memory registration
RDMA/mlx5: Fix handling of IOVA != user_va in ODP paths
IB/mlx5: Mask out unsupported ODP capabilities for kernel QPs
RDMA/mlx5: Don't fake udata for kernel path
IB/mlx5: Add ODP WQE handlers for kernel QPs
IB/core: Add interface to advise_mr for kernel users
IB/core: Introduce ib_reg_user_mr
IB: Allow calls to ib_umem_get from kernel ULPs
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Add a new relaxed ordering access flag for memory regions. Using memory
regions with relaxed ordeing set can enhance performance.
This access flag is handled in a best-effort manner, drivers should ignore
if they don't support setting relaxed ordering.
Link: https://lore.kernel.org/r/1578506740-22188-9-git-send-email-yishaih@mellanox.com
Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Define a range of access flags that are defined to be optional, both
uverbs and drivers should enable getting them and use if they are
applicable
This will be used, for example, for the relaxed ordering access flag which
unsupporting drivers can ignore.
Link: https://lore.kernel.org/r/1578506740-22188-7-git-send-email-yishaih@mellanox.com
Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Verify that MR access flags that are passed from user are all supported
ones, otherwise an error is returned.
Fixes: 4fca03778351 ("IB/uverbs: Move ib_access_flags and ib_read_counters_flags to uapi")
Link: https://lore.kernel.org/r/1578506740-22188-6-git-send-email-yishaih@mellanox.com
Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Allow ULPs to call advise_mr, so they can control ODP regions
in the same way as user space applications.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
|
Add ib_reg_user_mr() for kernel ULPs to register user MRs.
The common use case that uses this function is a userspace application
that allocates memory for HCA access but the responsibility to register
the memory at the HCA is on an kernel ULP. This ULP that acts as an agent
for the userspace application.
This function is intended to be used without a user context so vendor
drivers need to be aware of calling reg_user_mr() device operation with
udata equal to NULL.
Among all drivers, i40iw is the only driver which relies on presence
of udata, so check udata existence for that driver.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
|
This is a struct ib_uwq_object pointer, instead of using container_of()
all over the place just store it with its actual type.
Link: https://lore.kernel.org/r/1578504126-9400-10-git-send-email-yishaih@mellanox.com
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
This is a struct ib_usrq_object pointer, instead of using container_of()
all over the place just store it with its actual type.
Link: https://lore.kernel.org/r/1578504126-9400-9-git-send-email-yishaih@mellanox.com
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
This is a struct ib_uqp_object pointer, instead of using container_of()
all over the place just store it with its actual type.
Link: https://lore.kernel.org/r/1578504126-9400-8-git-send-email-yishaih@mellanox.com
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
This is a struct ib_ucq_object pointer, instead of using container_of()
all over the place just store it with its actual type.
Link: https://lore.kernel.org/r/1578504126-9400-7-git-send-email-yishaih@mellanox.com
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
This lock is used to protect the qp->open_list linked list. As a side
effect it seems to also globally serialize the qp event_handler, but it
isn't clear if that is a deliberate design.
Link: https://lore.kernel.org/r/20191212113024.336702-5-leon@kernel.org
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Given that ib_cache structure has only single member now, merge the cache
lock directly in the ib_device.
Link: https://lore.kernel.org/r/20191212113024.336702-4-leon@kernel.org
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Currently when the low level driver notifies Pkey, GID, and port change
events they are notified to the registered handlers in the order they are
registered.
IB core and other ULPs such as IPoIB are interested in GID, LID, Pkey
change events.
Since all GID queries done by ULPs are serviced by IB core, and the IB
core deferes cache updates to a work queue, it is possible for other
clients to see stale cache data when they handle their own events.
For example, the below call tree shows how ipoib will call
rdma_query_gid() concurrently with the update to the cache sitting in the
WQ.
mlx5_ib_handle_event()
ib_dispatch_event()
ib_cache_event()
queue_work() -> slow cache update
[..]
ipoib_event()
queue_work()
[..]
work handler
ipoib_ib_dev_flush_light()
__ipoib_ib_dev_flush()
ipoib_dev_addr_changed_valid()
rdma_query_gid() <- Returns old GID, cache not updated.
Move all the event dispatch to a work queue so that the cache update is
always done before any clients are notified.
Fixes: f35faa4ba956 ("IB/core: Simplify ib_query_gid to always refer to cache")
Link: https://lore.kernel.org/r/20191212113024.336702-3-leon@kernel.org
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Sample trace events:
kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
<idle>-0 [002] 120.056292: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067503: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067657: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1
...
systemd-1 [002] 122.392653: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3
Several features to note in this output:
- The WCE count and context type are reported at allocation time
- The CPU and kworker for each CQ is evident
- The CQ's restracker ID is tagged on each trace event
- CQ poll scheduling latency is measured
- Details about how often single completions occur versus multiple
completions are evident
- The cost of the ULP's completion handler is recorded
Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Introduce rdma_user_mmap_entry_insert_range() API to be used once the
required key for the given entry should be in a given range.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20191212100237.330654-2-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull hmm updates from Jason Gunthorpe:
"This is another round of bug fixing and cleanup. This time the focus
is on the driver pattern to use mmu notifiers to monitor a VA range.
This code is lifted out of many drivers and hmm_mirror directly into
the mmu_notifier core and written using the best ideas from all the
driver implementations.
This removes many bugs from the drivers and has a very pleasing
diffstat. More drivers can still be converted, but that is for another
cycle.
- A shared branch with RDMA reworking the RDMA ODP implementation
- New mmu_interval_notifier API. This is focused on the use case of
monitoring a VA and simplifies the process for drivers
- A common seq-count locking scheme built into the
mmu_interval_notifier API usable by drivers that call
get_user_pages() or hmm_range_fault() with the VA range
- Conversion of mlx5 ODP, hfi1, radeon, nouveau, AMD GPU, and Xen
GntDev drivers to the new API. This deletes a lot of wonky driver
code.
- Two improvements for hmm_range_fault(), from testing done by Ralph"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
mm/hmm: remove hmm_range_dma_map and hmm_range_dma_unmap
mm/hmm: make full use of walk_page_range()
xen/gntdev: use mmu_interval_notifier_insert
mm/hmm: remove hmm_mirror and related
drm/amdgpu: Use mmu_interval_notifier instead of hmm_mirror
drm/amdgpu: Use mmu_interval_insert instead of hmm_mirror
drm/amdgpu: Call find_vma under mmap_sem
nouveau: use mmu_interval_notifier instead of hmm_mirror
nouveau: use mmu_notifier directly for invalidate_range_start
drm/radeon: use mmu_interval_notifier_insert
RDMA/hfi1: Use mmu_interval_notifier_insert for user_exp_rcv
RDMA/odp: Use mmu_interval_notifier_insert()
mm/hmm: define the pre-processor related parts of hmm.h even if disabled
mm/hmm: allow hmm_range to be used with a mmu_interval_notifier or hmm_mirror
mm/mmu_notifier: add an interval tree notifier
mm/mmu_notifier: define the header pre-processor parts even if disabled
mm/hmm: allow snapshot of the special zero page
|
|
Pull rdma updates from Jason Gunthorpe:
"Again another fairly quiet cycle with few notable core code changes
and the usual variety of driver bug fixes and small improvements.
- Various driver updates and bug fixes for siw, bnxt_re, hns, qedr,
iw_cxgb4, vmw_pvrdma, mlx5
- Improvements in SRPT from working with iWarp
- SRIOV VF support for bnxt_re
- Skeleton kernel-doc files for drivers/infiniband
- User visible counters for events related to ODP
- Common code for tracking of mmap lifetimes so that drivers can link
HW object liftime to a VMA
- ODP bug fixes and rework
- RDMA READ support for efa
- Removal of the very old cxgb3 driver"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (168 commits)
RDMA/hns: Delete unnecessary callback functions for cq
RDMA/hns: Rename the functions used inside creating cq
RDMA/hns: Redefine the member of hns_roce_cq struct
RDMA/hns: Redefine interfaces used in creating cq
RDMA/efa: Expose RDMA read related attributes
RDMA/efa: Support remote read access in MR registration
RDMA/efa: Store network attributes in device attributes
IB/hfi1: remove redundant assignment to variable ret
RDMA/bnxt_re: Fix missing le16_to_cpu
RDMA/bnxt_re: Fix stat push into dma buffer on gen p5 devices
RDMA/bnxt_re: Fix chip number validation Broadcom's Gen P5 series
RDMA/bnxt_re: Fix Kconfig indentation
IB/mlx5: Implement callbacks for getting VFs GUID attributes
IB/ipoib: Add ndo operation for getting VFs GUID attributes
IB/core: Add interfaces to get VF node and port GUIDs
net/core: Add support for getting VF GUIDs
RDMA/qedr: Fix null-pointer dereference when calling rdma_user_mmap_get_offset
RDMA/cm: Use refcount_t type for refcount variable
IB/mlx5: Support extended number of strides for Striding RQ
IB/mlx4: Update HW GID table while adding vlan GID
...
|
|
Danit Goldberg says:
====================
This series extends RTNETLINK to provide IB port and node GUIDs, which
were configured for Infiniband VFs.
The functionality to set VF GUIDs already existed for a long time, and
here we are adding the missing "get" so that netlink will be symmetric and
various cloud orchestration tools will be able to manage such VFs more
naturally.
The iproute2 was extended too to present those GUIDs.
- ip link show <device>
For example:
- ip link set ib4 vf 0 node_guid 22:44:33:00:33:11:00:33
- ip link set ib4 vf 0 port_guid 10:21:33:12:00:11:22:10
- ip link show ib4
ib4: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256
link/infiniband 00:00:0a:2d:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:44:36:8d brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
vf 0 link/infiniband 00:00:0a:2d:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:44:36:8d brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff,
spoof checking off, NODE_GUID 22:44:33:00:33:11:00:33, PORT_GUID 10:21:33:12:00:11:22:10, link-state disable, trust off, query_rss off
====================
Based on the mlx5-next branch from
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux for
dependencies
* branch 'ib-guids': (35 commits)
IB/mlx5: Implement callbacks for getting VFs GUID attributes
IB/ipoib: Add ndo operation for getting VFs GUID attributes
IB/core: Add interfaces to get VF node and port GUIDs
net/core: Add support for getting VF GUIDs
net/mlx5: Add new chain for netfilter flow table offload
net/mlx5: Refactor creating fast path prio chains
net/mlx5: Accumulate levels for chains prio namespaces
net/mlx5: Define fdb tc levels per prio
net/mlx5: Rename FDB_* tc related defines to FDB_TC_* defines
net/mlx5: Simplify fdb chain and prio eswitch defines
IB/mlx5: Load profile according to RoCE enablement state
IB/mlx5: Rename profile and init methods
net/mlx5: Handle "enable_roce" devlink param
net/mlx5: Document flow_steering_mode devlink param
devlink: Add new "enable_roce" generic device param
net/mlx5: fix spelling mistake "metdata" -> "metadata"
net/mlx5: fix kvfree of uninitialized pointer spec
IB/mlx5: Introduce and use mlx5_core_is_vf()
net/mlx5: E-switch, Enable metadata on own vport
net/mlx5: Refactor ingress acl configuration
...
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Replace the internal interval tree based mmu notifier with the new common
mmu_interval_notifier_insert() API. This removes a lot of code and fixes a
deadlock that can be triggered in ODP:
zap_page_range()
mmu_notifier_invalidate_range_start()
[..]
ib_umem_notifier_invalidate_range_start()
down_read(&per_mm->umem_rwsem)
unmap_single_vma()
[..]
__split_huge_page_pmd()
mmu_notifier_invalidate_range_start()
[..]
ib_umem_notifier_invalidate_range_start()
down_read(&per_mm->umem_rwsem) // DEADLOCK
mmu_notifier_invalidate_range_end()
up_read(&per_mm->umem_rwsem)
mmu_notifier_invalidate_range_end()
up_read(&per_mm->umem_rwsem)
The umem_rwsem is held across the range_start/end as the ODP algorithm for
invalidate_range_end cannot tolerate changes to the interval
tree. However, due to the nested invalidation regions the second
down_read() can deadlock if there are competing writers. The new core code
provides an alternative scheme to solve this problem.
Fixes: ca748c39ea3f ("RDMA/umem: Get rid of per_mm->notifier_count")
Link: https://lore.kernel.org/r/20191112202231.3856-6-jgg@ziepe.ca
Tested-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Provide ability to get node and port GUIDs of VFs to be symmetrical
to already existing set option.
Signed-off-by: Danit Goldberg <danitg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
|
All users of process_mad() converts input pointers from ib_mad_hdr to be
ib_mad, update the function declaration to use ib_mad directly.
Also remove not used input MAD size parameter.
Link: https://lore.kernel.org/r/20191029062745.7932-17-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Tested-By: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
The rdma_user_mmap_io interface created a common interface for drivers to
correctly map hw resources and zap them once the ucontext is destroyed
enabling the drivers to safely free the hw resources.
However, this meant the drivers need to delay freeing the resource to the
ucontext destroy phase to ensure they were no longer mapped. The new
mechanism for a common way of handling user/driver address mapping enabled
notifying the driver if all umap_priv mappings were removed, and enabled
freeing the hw resources when they are done with and not delay it until
ucontext destroy.
Since not all drivers use the mechanism, NULL can be sent to the
rdma_user_mmap_io interface to continue working as before. Drivers that
use the mmap_xa interface can pass the entry being mapped to the
rdma_user_mmap_io function to be linked together.
Link: https://lore.kernel.org/r/20191030094417.16866-4-michal.kalderon@marvell.com
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Create some common API's for adding entries to a xa_mmap. Searching for
an entry and freeing one.
The general approach is copied from the EFA driver and improved to be more
general and do more to help the drivers. Integration with the core allows
a reference counted scheme with a free function so that the driver can
know when its mmaps are all gone.
This significant new functionality will be helpful for drivers to have the
correct lifetime model for mmap objects.
Link: https://lore.kernel.org/r/20191030094417.16866-3-michal.kalderon@marvell.com
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
If dev->dma_device->params == NULL then the maximum DMA segment size is 64
KB. See also the dma_get_max_seg_size() implementation. This patch fixes
the following kernel warning:
DMA-API: infiniband rxe0: mapping sg segment longer than device claims to support [len=126976] [max=65536]
WARNING: CPU: 4 PID: 4848 at kernel/dma/debug.c:1220 debug_dma_map_sg+0x3d9/0x450
RIP: 0010:debug_dma_map_sg+0x3d9/0x450
Call Trace:
srp_queuecommand+0x626/0x18d0 [ib_srp]
scsi_queue_rq+0xd02/0x13e0 [scsi_mod]
__blk_mq_try_issue_directly+0x2b3/0x3f0
blk_mq_request_issue_directly+0xac/0xf0
blk_insert_cloned_request+0xdf/0x170
dm_mq_queue_rq+0x43d/0x830 [dm_mod]
__blk_mq_try_issue_directly+0x2b3/0x3f0
blk_mq_request_issue_directly+0xac/0xf0
blk_mq_try_issue_list_directly+0xb8/0x170
blk_mq_sched_insert_requests+0x23c/0x3b0
blk_mq_flush_plug_list+0x529/0x730
blk_flush_plug_list+0x21f/0x260
blk_mq_make_request+0x56b/0xf20
generic_make_request+0x196/0x660
submit_bio+0xae/0x290
blkdev_direct_IO+0x822/0x900
generic_file_direct_write+0x110/0x200
__generic_file_write_iter+0x124/0x2a0
blkdev_write_iter+0x168/0x270
aio_write+0x1c4/0x310
io_submit_one+0x971/0x1390
__x64_sys_io_submit+0x12a/0x390
do_syscall_64+0x6f/0x2e0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Link: https://lore.kernel.org/r/20191025225830.257535-2-bvanassche@acm.org
Cc: <stable@vger.kernel.org>
Fixes: 0b5cb3300ae5 ("RDMA/srp: Increase max_segment_size")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Add RDMA nldev netlink interface for dumping MR statistics information.
Output example:
$ ./ibv_rc_pingpong -o -P -s 500000000
local address: LID 0x0001, QPN 0x00008a, PSN 0xf81096, GID ::
$ rdma stat show mr
dev mlx5_0 mrn 2 page_faults 122071 page_invalidations 0
Link: https://lore.kernel.org/r/20191016062308.11886-5-leon@kernel.org
Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Introduce ODP diagnostic counters and count the following
per MR within IB/mlx5 driver:
1) Page faults:
Total number of faulted pages.
2) Page invalidations:
Total number of pages invalidated by the OS during all
invalidation events. The translations can be no longer
valid due to either non-present pages or mapping changes.
Link: https://lore.kernel.org/r/20191016062308.11886-2-leon@kernel.org
Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
The issue is in drivers/infiniband/core/uverbs_std_types_cq.c in the
UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE) function. We check that:
if (attr.comp_vector >= attrs->ufile->device->num_comp_vectors) {
But we don't check if "attr.comp_vector" is negative. It could
potentially lead to an array underflow. My concern would be where
cq->vector is used in the create_cq() function from the cxgb4 driver.
And really "attr.comp_vector" is appears as a u32 to user space so that's
the right type to use.
Fixes: 9ee79fce3642 ("IB/core: Add completion queue (cq) object actions")
Link: https://lore.kernel.org/r/20191011133419.GA22905@mwanda
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
If there are more scatter entries than the recommended limit provided by
the ib device, UMR registration is used. This will provide optimal
performance when performing large RDMA READs over devices that advertise
the threshold capability.
With ConnectX-5 running NVMeoF RDMA with FIO single QP 128KB writes:
Without use of cap: 70Gb/sec
With use of cap: 84Gb/sec
Link: https://lore.kernel.org/r/20191007135933.12483-3-leon@kernel.org
Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Pull RDMA subsystem updates from Jason Gunthorpe:
"This cycle mainly saw lots of bug fixes and clean up code across the
core code and several drivers, few new functional changes were made.
- Many cleanup and bug fixes for hns
- Various small bug fixes and cleanups in hfi1, mlx5, usnic, qed,
bnxt_re, efa
- Share the query_port code between all the iWarp drivers
- General rework and cleanup of the ODP MR umem code to fit better
with the mmu notifier get/put scheme
- Support rdma netlink in non init_net name spaces
- mlx5 support for XRC devx and DC ODP"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (99 commits)
RDMA: Fix double-free in srq creation error flow
RDMA/efa: Fix incorrect error print
IB/mlx5: Free mpi in mp_slave mode
IB/mlx5: Use the original address for the page during free_pages
RDMA/bnxt_re: Fix spelling mistake "missin_resp" -> "missing_resp"
RDMA/hns: Package operations of rq inline buffer into separate functions
RDMA/hns: Optimize cmd init and mode selection for hip08
IB/hfi1: Define variables as unsigned long to fix KASAN warning
IB/{rdmavt, hfi1, qib}: Add a counter for credit waits
IB/hfi1: Add traces for TID RDMA READ
RDMA/siw: Relax from kmap_atomic() use in TX path
IB/iser: Support up to 16MB data transfer in a single command
RDMA/siw: Fix page address mapping in TX path
RDMA: Fix goto target to release the allocated memory
RDMA/usnic: Avoid overly large buffers on stack
RDMA/odp: Add missing cast for 32 bit
RDMA/hns: Use devm_platform_ioremap_resource() to simplify code
Documentation/infiniband: update name of some functions
RDMA/cma: Fix false error message
RDMA/hns: Fix wrong assignment of qp_access_flags
...
|
|
This is a significant simplification, no extra list is kept per FD, and
the interval tree is now shared between all the ucontexts, reducing
overhead if there are multiple ucontexts active.
Link: https://lore.kernel.org/r/20190806231548.25242-7-jgg@ziepe.ca
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Jason Gunthorpe says:
====================
This is a collection of general cleanups for ODP to clarify some of the
flows around umem creation and use of the interval tree.
====================
The branch is based on v5.3-rc5 due to dependencies
* odp_fixes:
RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
RDMA/mlx5: Use ib_umem_start instead of umem.address
RDMA/core: Make invalidate_range a device operation
RDMA/odp: Use kvcalloc for the dma_list and page_list
RDMA/odp: Check for overflow when computing the umem_odp end
RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
RDMA/odp: Split creating a umem_odp from ib_umem_get
RDMA/odp: Make the three ways to create a umem_odp clear
RMDA/odp: Consolidate umem_odp initialization
RDMA/odp: Make it clearer when a umem is an implicit ODP umem
RDMA/odp: Iterate over the whole rbtree directly
RDMA/odp: Use the common interval tree library instead of generic
RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|