Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull cxl updates from Dan Williams:
"Compute Express Link (CXL) updates for this cycle.
The highlight is new driver-core infrastructure and CXL subsystem
changes for allowing lockdep to validate device_lock() usage. Thanks
to PeterZ for setting me straight on the current capabilities of the
lockdep API, and Greg acked it as well.
On the CXL ACPI side this update adds support for CXL _OSC so that
platform firmware knows that it is safe to still grant Linux native
control of PCIe hotplug and error handling in the presence of CXL
devices. A circular dependency problem was discovered between suspend
and CXL memory for cases where the suspend image might be stored in
CXL memory where that image also contains the PCI register state to
restore to re-enable the device. Disable suspend for now until an
architecture is defined to clarify that conflict.
Lastly a collection of reworks, fixes, and cleanups to the CXL
subsystem where support for snooping mailbox commands and properly
handling the "mem_enable" flow are the highlights.
Summary:
- Add driver-core infrastructure for lockdep validation of
device_lock(), and fixup a deadlock report that was previously
hidden behind the 'lockdep no validate' policy.
- Add CXL _OSC support for claiming native control of CXL hotplug and
error handling.
- Disable suspend in the presence of CXL memory unless and until a
protocol is identified for restoring PCI device context from memory
hosted on CXL PCI devices.
- Add support for snooping CXL mailbox commands to protect against
inopportune changes, like set-partition with the 'immediate' flag
set.
- Rework how the driver detects legacy CXL 1.1 configurations (CXL
DVSEC / 'mem_enable') before enabling new CXL 2.0 decode
configurations (CXL HDM Capability).
- Miscellaneous cleanups and fixes from -next exposure"
* tag 'cxl-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (47 commits)
cxl/port: Enable HDM Capability after validating DVSEC Ranges
cxl/port: Reuse 'struct cxl_hdm' context for hdm init
cxl/port: Move endpoint HDM Decoder Capability init to port driver
cxl/pci: Drop @info argument to cxl_hdm_decode_init()
cxl/mem: Merge cxl_dvsec_ranges() and cxl_hdm_decode_init()
cxl/mem: Skip range enumeration if mem_enable clear
cxl/mem: Consolidate CXL DVSEC Range enumeration in the core
cxl/pci: Move cxl_await_media_ready() to the core
cxl/mem: Validate port connectivity before dvsec ranges
cxl/mem: Fix cxl_mem_probe() error exit
cxl/pci: Drop wait_for_valid() from cxl_await_media_ready()
cxl/pci: Consolidate wait_for_media() and wait_for_media_ready()
cxl/mem: Drop mem_enabled check from wait_for_media()
nvdimm: Fix firmware activation deadlock scenarios
device-core: Kill the lockdep_mutex
nvdimm: Drop nd_device_lock()
ACPI: NFIT: Drop nfit_device_lock()
nvdimm: Replace lockdep_mutex with local lock classes
cxl: Drop cxl_device_lock()
cxl/acpi: Add root device lockdep validation
...
|
|
The recovery write thread started out as a normal pwrite thread and
when the filesystem was told about potential media error in the
range, filesystem turns the normal pwrite to a dax_recovery_write.
The recovery write consists of clearing media poison, clearing page
HWPoison bit, reenable page-wide read-write permission, flush the
caches and finally write. A competing pread thread will be held
off during the recovery process since data read back might not be
valid, and this is achieved by clearing the badblock records after
the recovery write is complete. Competing recovery write threads
are already serialized by writer lock held by dax_iomap_rw().
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/165247997655.53156.8381418704988035976.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Refactor the pmem_clear_poison() function such that the common
shared code between the typical write path and the recovery write
path is factored out.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Link: https://lore.kernel.org/r/20220422224508.440670-7-jane.chu@oracle.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Introduce dax_recovery_write() operation. The function is used to
recover a dax range that contains poison. Typical use case is when
a user process receives a SIGBUS with si_code BUS_MCEERR_AR
indicating poison(s) in a dax range, in response, the user process
issues a pwrite() to the page-aligned dax range, thus clears the
poison and puts valid data in the range.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Link: https://lore.kernel.org/r/20220422224508.440670-6-jane.chu@oracle.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Up till now, dax_direct_access() is used implicitly for normal
access, but for the purpose of recovery write, dax range with
poison is requested. To make the interface clear, introduce
enum dax_access_mode {
DAX_ACCESS,
DAX_RECOVERY_WRITE,
}
where DAX_ACCESS is used for normal dax access, and
DAX_RECOVERY_WRITE is used for dax recovery write.
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/r/165247982851.52965.11024212198889762949.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
The set_memory_uc() approach doesn't work well in all cases.
As Dan pointed out when "The VMM unmapped the bad page from
guest physical space and passed the machine check to the guest."
"The guest gets virtual #MC on an access to that page. When
the guest tries to do set_memory_uc() and instructs cpa_flush()
to do clean caches that results in taking another fault / exception
perhaps because the VMM unmapped the page from the guest."
Since the driver has special knowledge to handle NP or UC,
mark the poisoned page with NP and let driver handle it when
it comes down to repair.
Please refer to discussions here for more details.
https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/
Now since poisoned page is marked as not-present, in order to
avoid writing to a not-present page and trigger kernel Oops,
also fix pmem_do_write().
Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/165272615484.103830.2563950688772226611.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
It is not clear why the original implementation of overwrite support
required the dimm driver to be active before overwrite could proceed. In
fact that can lead to cases where the kernel retains an invalid cached
copy of the labels from before the overwrite. Unfortunately the kernel
has not only allowed that case, but enforced it.
Going forward, allow for overwrite to happen while the label area is
offline, and follow-on with updates to 'ndctl sanitize-dimm --overwrite'
to trigger the label area invalidation by default.
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Reported-by: Krzysztof Kensicki <krzysztof.kensicki@intel.com>
Fixes: 7d988097c546 ("acpi/nfit, libnvdimm/security: Add security DSM overwrite support")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Lockdep reports the following deadlock scenarios for CXL root device
power-management, device_prepare(), operations, and device_shutdown()
operations for 'nd_region' devices:
Chain exists of:
&nvdimm_region_key --> &nvdimm_bus->reconfig_mutex --> system_transition_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(system_transition_mutex);
lock(&nvdimm_bus->reconfig_mutex);
lock(system_transition_mutex);
lock(&nvdimm_region_key);
Chain exists of:
&cxl_nvdimm_bridge_key --> acpi_scan_lock --> &cxl_root_key
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&cxl_root_key);
lock(acpi_scan_lock);
lock(&cxl_root_key);
lock(&cxl_nvdimm_bridge_key);
These stem from holding nvdimm_bus_lock() over hibernate_quiet_exec()
which walks the entire system device topology taking device_lock() along
the way. The nvdimm_bus_lock() is protecting against unregistration,
multiple simultaneous ops callers, and preventing activate_show() from
racing activate_store(). For the first 2, the lock is redundant.
Unregistration already flushes all ops users, and sysfs already prevents
multiple threads to be active in an ops handler at the same time. For
the last userspace should already be waiting for its last
activate_store() to complete, and does not need activate_show() to flush
the write side, so this lock usage can be deleted in these attributes.
Fixes: 48001ea50d17 ("PM, libnvdimm: Add runtime firmware activation support")
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/165074883800.4116052.10737040861825806582.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Now that all NVDIMM subsystem locking is validated with custom lock
classes, there is no need for the custom usage of the lockdep_mutex.
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/165055521979.3745911.10751769706032029999.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
In response to an attempt to expand dev->lockdep_mutex for device_lock()
validation [1], Peter points out [2] that the lockdep API already has
the ability to assign a dedicated lock class per subsystem device-type.
Use lockdep_set_class() to override the default device_lock()
'__lockdep_no_validate__' class for each NVDIMM subsystem device-type. This
enables lockdep to detect deadlocks and recursive locking within the
device-driver core and the subsystem.
Link: https://lore.kernel.org/r/164982968798.684294.15817853329823976469.stgit@dwillia2-desk3.amr.corp.intel.com [1]
Link: https://lore.kernel.org/r/Ylf0dewci8myLvoW@hirez.programming.kicks-ass.net [2]
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/165055520896.3745911.8021255583475547548.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"The update for this cycle includes the deprecation of block-aperture
mode and a new perf events interface for the papr_scm nvdimm driver.
The perf events approach was acked by PeterZ.
- Add perf support for nvdimm events, initially only for 'papr_scm'
devices.
- Deprecate the 'block aperture' support in libnvdimm, it only ever
existed in the specification, not in shipping product"
* tag 'libnvdimm-for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
nvdimm/blk: Fix title level
MAINTAINERS: remove section LIBNVDIMM BLK: MMIO-APERTURE DRIVER
powerpc/papr_scm: Fix build failure when
drivers/nvdimm: Fix build failure when CONFIG_PERF_EVENTS is not set
nvdimm/region: Delete nd_blk_region infrastructure
ACPI: NFIT: Remove block aperture support
nvdimm/namespace: Delete nd_namespace_blk
nvdimm/namespace: Delete blk namespace consideration in shared paths
nvdimm/blk: Delete the block-aperture window driver
nvdimm/region: Fix default alignment for small regions
docs: ABI: sysfs-bus-nvdimm: Document sysfs event format entries for nvdimm pmu
powerpc/papr_scm: Add perf interface support
drivers/nvdimm: Add perf interface to expose nvdimm performance stats
drivers/nvdimm: Add nvdimm pmu structure
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull CXL (Compute Express Link) updates from Dan Williams:
"This development cycle extends the subsystem to discover CXL resources
throughout a CXL/PCIe switch topology and respond to hot add/remove
events anywhere in that topology.
This is more foundational infrastructure in preparation for dynamic
memory region provisioning support. Recall that CXL memory regions, as
the new "Theory of Operation" section of
Documentation/driver-api/cxl/memory-devices.rst describes, bring
storage volume striping semantics to memory.
The hot add/remove behavior is validated with extensions to the
cxl_test unit test environment and this test in the cxl-cli test
suite:
https://github.com/pmem/ndctl/blob/djbw/for-74/cxl/test/cxl-topology.sh
Summary:
- Add a driver for 'struct cxl_memdev' objects responsible for
CXL.mem operation as distinct from 'cxl_pci' mailbox operations.
Its primary responsibility is enumerating an endpoint 'struct
cxl_port' and all the 'struct cxl_port' instances between an
endpoint and the CXL platform root.
- Add a driver for 'struct cxl_port' objects responsible for
enumerating and operating all Host-managed Device Memory (HDM)
decoder resources between the platform-level CXL memory
description, all intervening host bridges / switches, and the HDM
resources in endpoints.
- Update the cxl_pci driver to validate CXL.mem operation precursors
to HDM decoder operation like ready-polling, and legacy CXL 1.1
DVSEC based CXL.mem configuration.
- Add basic lockdep coverage for usage of device_lock() on CXL
subsystem objects similar to what exists for LIBNVDIMM. Include a
compile-time switch for which subsystem to validate at run-time.
- Update cxl_test to emulate a one level switch topology.
- Document a "Theory of Operation" for the subsystem.
- Add 'numa_node' and 'serial' attributes to cxl_memdev sysfs
- Include miscellaneous fixes for spec / QEMU CXL emulation
compatibility and static analysis reports"
* tag 'cxl-for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (48 commits)
cxl/core/port: Fix NULL but dereferenced coccicheck error
cxl/port: Hold port reference until decoder release
cxl/port: Fix endpoint refcount leak
cxl/core: Fix cxl_device_lock() class detection
cxl/core/port: Fix unregister_port() lock assertion
cxl/regs: Fix size of CXL Capability Header Register
cxl/core/port: Handle invalid decoders
cxl/core/port: Fix / relax decoder target enumeration
tools/testing/cxl: Add a physical_node link
tools/testing/cxl: Enumerate mock decoders
tools/testing/cxl: Mock one level of switches
tools/testing/cxl: Fix root port to host bridge assignment
tools/testing/cxl: Mock dvsec_ranges()
cxl/core/port: Add endpoint decoders
cxl/core: Move target_list out of base decoder attributes
cxl/mem: Add the cxl_mem driver
cxl/core/port: Add switch port enumeration
cxl/memdev: Add numa_node attribute
cxl/pci: Emit device serial number
cxl/pci: Implement wait for media active
...
|
|
The following build failure occurs when CONFIG_PERF_EVENTS is not set
as generic pmu functions are not visible in that scenario.
|-- s390-randconfig-r044-20220313
| |-- nd_perf.c:(.text):undefined-reference-to-perf_pmu_migrate_context
| |-- nd_perf.c:(.text):undefined-reference-to-perf_pmu_register
| `-- nd_perf.c:(.text):undefined-reference-to-perf_pmu_unregister
Similar build failure in nds32 architecture:
nd_perf.c:(.text+0x21e): undefined reference to `perf_pmu_migrate_context'
nd_perf.c:(.text+0x434): undefined reference to `perf_pmu_register'
nd_perf.c:(.text+0x57c): undefined reference to `perf_pmu_unregister'
Fix this issue by adding check for CONFIG_PERF_EVENTS config option
and disabling the nvdimm perf interface incase this config is not set.
Also remove function declaration of perf_pmu_migrate_context,
perf_pmu_register, perf_pmu_unregister functions from nd.h as these are
common pmu functions which are part of perf_event.h and since we
are disabling nvdimm perf interface incase CONFIG_PERF_EVENTS option
is not set, we not need to declare them in nd.h
Also move the platform_device header file addition part from nd.h to
nd_perf.c and add stub functions for register_nvdimm_pmu and
unregister_nvdimm_pmu functions to handle CONFIG_PERF_EVENTS=n
case.
Fixes: 0fab1ba6ad6b ("drivers/nvdimm: Add perf interface to expose nvdimm performance stats") (Commit id based on libnvdimm-for-next tree)
Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
Link: https://lore.kernel.org/all/62317124.YBQFU33+s%2FwdvWGj%25lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/r/20220323164550.109768-1-kjain@linux.ibm.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Pull folio updates from Matthew Wilcox:
- Rewrite how munlock works to massively reduce the contention on
i_mmap_rwsem (Hugh Dickins):
https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
- Sort out the page refcount mess for ZONE_DEVICE pages (Christoph
Hellwig):
https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
- Convert GUP to use folios and make pincount available for order-1
pages. (Matthew Wilcox)
- Convert a few more truncation functions to use folios (Matthew
Wilcox)
- Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew
Wilcox)
- Convert rmap_walk to use folios (Matthew Wilcox)
- Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
- Add support for creating large folios in readahead (Matthew Wilcox)
* tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache: (114 commits)
mm/damon: minor cleanup for damon_pa_young
selftests/vm/transhuge-stress: Support file-backed PMD folios
mm/filemap: Support VM_HUGEPAGE for file mappings
mm/readahead: Switch to page_cache_ra_order
mm/readahead: Align file mappings for non-DAX
mm/readahead: Add large folio readahead
mm: Support arbitrary THP sizes
mm: Make large folios depend on THP
mm: Fix READ_ONLY_THP warning
mm/filemap: Allow large folios to be added to the page cache
mm: Turn can_split_huge_page() into can_split_folio()
mm/vmscan: Convert pageout() to take a folio
mm/vmscan: Turn page_check_references() into folio_check_references()
mm/vmscan: Account large folios correctly
mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios
mm/vmscan: Free non-shmem folios without splitting them
mm/rmap: Constify the rmap_walk_control argument
mm/rmap: Convert rmap_walk() to take a folio
mm: Turn page_anon_vma() into folio_anon_vma()
mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()
...
|
|
Now that the nd_namespace_blk infrastructure is removed, delete all the
region machinery to coordinate provisioning aliased capacity between
PMEM and BLK.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/164688418803.2879318.1302315202397235855.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Now that none of the configuration paths consider BLK namespaces, delete
the BLK namespace data and supporting code.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/164688417727.2879318.11691110761800109662.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Given is_namespace_blk() is never true outside of the NVDIMM unit tests
delete the support from namespace device management.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/164688417214.2879318.4698377272678028573.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Block Aperture Window support was an attempt to layer an error model
over PMEM for platforms that did not support machine-check-recovery.
However, it was abandoned before it ever shipped, and only ever existed
in the ACPI specification. Meanwhile Linux has carried a large pile of
dead code for non-shipping infrastructure. For years it has been off to
the side out of the way, but now CXL and recent directions with DAX
support have the potential to collide with this code.
In preparation for adding discontiguous namespace support, a
pre-requisite for the nvdimm subsystem to replace device-mapper for
striping + concatenation use cases, delete BLK aperture support.
On the obscure chance that some hardware vendor shipped support for this
mode, note that the driver will still keep BLK space reserved in the
label area. So an end user in this case would still have the opportunity
to report the regression to get BLK-mode support restored without
risking the data they have on that device.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/164688416668.2879318.16903178375774275120.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
In preparation for removing BLK aperture support the NVDIMM unit tests
discovered that the default alignment can be set higher than the
capacity of the region. Fall back to PAGE_SIZE in that case.
Given this has not been seen in the wild, elide notifying -stable.
Fixes: 2522afb86a8c ("libnvdimm/region: Introduce an 'align' attribute")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/164688416128.2879318.17890707310125575258.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
A common interface is added to get performance stats reporting
support for nvdimm devices. Added interface defines supported
event list, config fields for the event attributes and their
corresponding bit values which are exported via sysfs.
Interface also added support for pmu register/unregister functions,
cpu hotplug feature along with macros for handling events addition
via sysfs. It adds attribute groups for format, cpumask and events
to the pmu structure.
User could use the standard perf tool to access perf events exposed
via nvdimm pmu.
[Declare pmu functions in nd.h file to resolve implicit-function-declaration
warning and make hotplug function static as reported by kernel test robot]
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Nageswara R Sastry <rnsastry@linux.ibm.com>
Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
Link: https://lore.kernel.org/all/202202241242.zqzGkguy-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Madhavan Srinivasan <maddy@in.ibm.com>
Link: https://lore.kernel.org/r/20220225143024.47947-3-kjain@linux.ibm.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Using local kmaps slightly reduces the chances to stray writes, and
the bvec interface cleans up the code a little bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20220303111905.321089-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Using local kmaps slightly reduces the chances to stray writes, and
the bvec interface cleans up the code a little bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20220303111905.321089-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move the check for the actual pgmap types that need the free at refcount
one behavior into the out of line helper, and thus avoid the need to
pull memremap.h into mm.h.
Link: https://lkml.kernel.org/r/20220210072828.2930359-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
When CONFIG_PROVE_LOCKING is enabled the 'struct device' definition gets
an additional mutex that is not clobbered by
lockdep_set_novalidate_class() like the typical device_lock(). This
allows for local annotation of subsystem locks with mutex_lock_nested()
per the subsystem's object/lock hierarchy. For CXL, this primarily needs
the ability to lock ports by depth and child objects of ports by their
parent parent-port lock.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Ben Widawsky <ben.widawsky@intel.com>
Link: https://lore.kernel.org/r/164365853422.99383.1052399160445197427.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Pass the block_device and operation that we plan to use this bio for to
bio_alloc to optimize the assignment. NULL/0 can be passed, both for the
passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.
Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is no good reason to keep genhd.h separate from the main blkdev.h
header that includes it. So fold the contents of genhd.h into blkdev.h
and remove genhd.h entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220124093913.742411-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull virtio updates from Michael Tsirkin:
"virtio,vdpa,qemu_fw_cfg: features, cleanups, and fixes.
- partial support for < MAX_ORDER - 1 granularity for virtio-mem
- driver_override for vdpa
- sysfs ABI documentation for vdpa
- multiqueue config support for mlx5 vdpa
- and misc fixes, cleanups"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (42 commits)
vdpa/mlx5: Fix tracking of current number of VQs
vdpa/mlx5: Fix is_index_valid() to refer to features
vdpa: Protect vdpa reset with cf_mutex
vdpa: Avoid taking cf_mutex lock on get status
vdpa/vdpa_sim_net: Report max device capabilities
vdpa: Use BIT_ULL for bit operations
vdpa/vdpa_sim: Configure max supported virtqueues
vdpa/mlx5: Report max device capabilities
vdpa: Support reporting max device capabilities
vdpa/mlx5: Restore cur_num_vqs in case of failure in change_num_qps()
vdpa: Add support for returning device configuration information
vdpa/mlx5: Support configuring max data virtqueue
vdpa/mlx5: Fix config_attr_mask assignment
vdpa: Allow to configure max data virtqueues
vdpa: Read device configuration only if FEATURES_OK
vdpa: Sync calls set/get config/status with cf_mutex
vdpa/mlx5: Distribute RX virtqueues in RQT object
vdpa: Provide interface to read driver features
vdpa: clean up get_config_size ret value handling
virtio_ring: mark ring unused on error
...
|
|
This will enable cleanups down the road.
The idea is to disable cbs, then add "flush_queued_cbs" callback
as a parameter, this way drivers can flush any work
queued after callbacks have been disabled.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20211013105226.20225-1-mst@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
These methods indirect the actual DAX read/write path. In the end pmem
uses magic flush and mc safe variants and fuse and dcssblk use plain ones
while device mapper picks redirects to the underlying device.
Add set_dax_nocache() and set_dax_nomc() APIs to control which copy
routines are used to remove indirect call from the read/write fast path
as well as a lot of boilerplate code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com> [virtiofs]
Link: https://lore.kernel.org/r/20211215084508.435401-5-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Remove the DAXDEV_F_SYNC flag and thus the flags argument to alloc_dax and
just let the drivers call set_dax_synchronous directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/20211215084508.435401-4-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
These two wrappers are never used.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211215084508.435401-2-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Just open code the block size and dax_dev == NULL checks in the callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> [erofs]
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-9-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Replace the dax_host_hash with an xarray indexed by the pointer value
of the gendisk, and require explicitly calls from the block drivers that
want to associate their gendisk with a dax_device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-5-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
CONFIG_DAX_DRIVER only selects CONFIG_DAX now, so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/20211129102203.2243509-4-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm update from Dan Williams:
"A single cleanup that precedes some deeper PMEM/DAX reworks that did
not settle in time for v5.16:
- Continue the cleanup of the dax api in preparation for a dax-device
block-device divorce"
* tag 'libnvdimm-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
nvdimm/pmem: move dax_attribute_group from dax to pmem
|
|
Pull more block driver updates from Jens Axboe:
- Last series adding error handling support for add_disk() in drivers.
After this one, and once the SCSI side has been merged, we can
finally annotate add_disk() as must_check. (Luis)
- bcache fixes (Coly)
- zram fixes (Ming)
- ataflop locking fix (Tetsuo)
- nbd fixes (Ye, Yu)
- MD merge via Song
- Cleanup (Yang)
- sysfs fix (Guoqing)
- Misc fixes (Geert, Wu, luo)
* tag 'for-5.16/drivers-2021-11-09' of git://git.kernel.dk/linux-block: (34 commits)
bcache: Revert "bcache: use bvec_virt"
ataflop: Add missing semicolon to return statement
floppy: address add_disk() error handling on probe
ataflop: address add_disk() error handling on probe
block: update __register_blkdev() probe documentation
ataflop: remove ataflop_probe_lock mutex
mtd/ubi/block: add error handling support for add_disk()
block/sunvdc: add error handling support for add_disk()
z2ram: add error handling support for add_disk()
nvdimm/pmem: use add_disk() error handling
nvdimm/pmem: cleanup the disk if pmem_release_disk() is yet assigned
nvdimm/blk: add error handling support for add_disk()
nvdimm/blk: avoid calling del_gendisk() on early failures
nvdimm/btt: add error handling support for add_disk()
nvdimm/btt: use goto error labels on btt_blk_init()
loop: Remove duplicate assignments
drbd: Fix double free problem in drbd_create_device
nvdimm/btt: do not call del_gendisk() if not needed
bcache: fix use-after-free problem in bcache_device_free()
zram: replace fsync_bdev with sync_blockdev
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull cxl updates from Dan Williams:
"More preparation and plumbing work in the CXL subsystem.
From an end user perspective the highlight here is lighting up the CXL
Persistent Memory related commands (label read / write) with the
generic ioctl() front-end in LIBNVDIMM.
Otherwise, the ability to instantiate new persistent and volatile
memory regions is still on track for v5.17.
Summary:
- Fix support for platforms that do not enumerate every ACPI0016 (CXL
Host Bridge) in the CHBS (ACPI Host Bridge Structure).
- Introduce a common pci_find_dvsec_capability() helper, clean up
open coded implementations in various drivers.
- Add 'cxl_test' for regression testing CXL subsystem ABIs.
'cxl_test' is a module built from tools/testing/cxl/ that mocks up
a CXL topology to augment the nascent support for emulation of CXL
devices in QEMU.
- Convert libnvdimm to use the uuid API.
- Complete the definition of CXL namespace labels in libnvdimm.
- Tunnel libnvdimm label operations from nd_ioctl() back to the CXL
mailbox driver. Enable 'ndctl {read,write}-labels' for CXL.
- Continue to sort and refactor functionality into distinct driver
and core-infrastructure buckets. For example, mailbox handling is
now a generic core capability consumed by the PCI and cxl_test
drivers"
* tag 'cxl-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (34 commits)
ocxl: Use pci core's DVSEC functionality
cxl/pci: Use pci core's DVSEC functionality
PCI: Add pci_find_dvsec_capability to find designated VSEC
cxl/pci: Split cxl_pci_setup_regs()
cxl/pci: Add @base to cxl_register_map
cxl/pci: Make more use of cxl_register_map
cxl/pci: Remove pci request/release regions
cxl/pci: Fix NULL vs ERR_PTR confusion
cxl/pci: Remove dev_dbg for unknown register blocks
cxl/pci: Convert register block identifiers to an enum
cxl/acpi: Do not fail cxl_acpi_probe() based on a missing CHBS
cxl/pci: Disambiguate cxl_pci further from cxl_mem
Documentation/cxl: Add bus internal docs
cxl/core: Split decoder setup into alloc + add
tools/testing/cxl: Introduce a mock memory device + driver
cxl/mbox: Move command definitions to common location
cxl/bus: Populate the target list at decoder create
tools/testing/cxl: Introduce a mocked-up CXL port hierarchy
cxl/pmem: Add support for multiple nvdimm-bridge objects
cxl/pmem: Translate NVDIMM label commands to CXL label commands
...
|
|
Now that device_add_disk() supports returning an error, use
that. We must unwind alloc_dax() on error.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211103230437.1639990-7-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Prior to devm being able to use pmem_release_disk() there are other
failure which can occur for which we must account for and release the
disk for. Address those few cases.
Fixes: 3dd60fb9d95d ("nvdimm/pmem: stop using q_usage_count as external pgmap refcount")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211103230437.1639990-6-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.
Since nvdimm/blk uses devm we just need to move the devm
registration towards the end. And in hindsight, that seems
to also provide a fix given del_gendisk() should not be
called unless the disk was already added via add_disk().
The probably of that issue happening is low though, like
OOM while calling devm_add_action(), so the fix is minor.
We manually unwind in case of add_disk() failure prior
to the devm registration.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211103230437.1639990-5-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If nd_integrity_init() fails we'd get del_gendisk() called,
but that's not correct as we should only call that if we're
done with device_add_disk(). Fix this by providing unwinding
prior to the devm call being registered and moving the devm
registration to the very end.
This should fix calling del_gendisk() if nd_integrity_init()
fails. I only spotted this issue through code inspection. It
does not fix any real world bug.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211103230437.1639990-4-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211103230437.1639990-3-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This will make it easier to share common error paths.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211103230437.1639990-2-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
del_gendisk() should not called if the disk has not been added. Fix this.
Fixes: 41cd8b70c37a ("libnvdimm, btt: add support for blk integrity")
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211103165843.1402142-1-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull block updates from Jens Axboe:
- mq-deadline accounting improvements (Bart)
- blk-wbt timer fix (Andrea)
- Untangle the block layer includes (Christoph)
- Rework the poll support to be bio based, which will enable adding
support for polling for bio based drivers (Christoph)
- Block layer core support for multi-actuator drives (Damien)
- blk-crypto improvements (Eric)
- Batched tag allocation support (me)
- Request completion batching support (me)
- Plugging improvements (me)
- Shared tag set improvements (John)
- Concurrent queue quiesce support (Ming)
- Cache bdev in ->private_data for block devices (Pavel)
- bdev dio improvements (Pavel)
- Block device invalidation and block size improvements (Xie)
- Various cleanups, fixes, and improvements (Christoph, Jackie,
Masahira, Tejun, Yu, Pavel, Zheng, me)
* tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits)
blk-mq-debugfs: Show active requests per queue for shared tags
block: improve readability of blk_mq_end_request_batch()
virtio-blk: Use blk_validate_block_size() to validate block size
loop: Use blk_validate_block_size() to validate block size
nbd: Use blk_validate_block_size() to validate block size
block: Add a helper to validate the block size
block: re-flow blk_mq_rq_ctx_init()
block: prefetch request to be initialized
block: pass in blk_mq_tags to blk_mq_rq_ctx_init()
block: add rq_flags to struct blk_mq_alloc_data
block: add async version of bio_set_polled
block: kill DIO_MULTI_BIO
block: kill unused polling bits in __blkdev_direct_IO()
block: avoid extra iter advance with async iocb
block: Add independent access ranges support
blk-mq: don't issue request directly in case that current is to be blocked
sbitmap: silence data race warning
blk-cgroup: synchronize blkg creation against policy deactivation
block: refactor bio_iov_bvec_set()
block: add single bio async direct IO helper
...
|
|
Originally all DAX access when through block_device operations and thus
needed a queue reference. But since commit cccbce671582
("filesystem-dax: convert to dax_direct_access()") all this happens at
the DAX device level which uses its own refcounting. Having the external
refcount thus wasn't needed but has otherwise been harmless for long
time.
But now that "block: drain file system I/O on del_gendisk" waits for
q_usage_count to reach 0 in del_gendisk this whole scheme can't work
anymore (and pmem is the only driver abusing q_usage_count like that).
So switch to the internal reference and remove the unbalanced
blk_freeze_queue_start that is taken care of by del_gendisk.
Fixes: 8e141f9eb803 ("block: drain file system I/O on del_gendisk")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019073641.2323410-2-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.
Polling for the bio itself leads to a few advantages:
- the cookie construction can made entirely private in blk-mq.c
- the caller does not need to remember the request_queue and cookie
separately and thus sidesteps their lifetime issues
- keeping the device and the cookie inside the bio allows to trivially
support polling BIOs remapping by stacking drivers
- a lot of code to propagate the cookie back up the submission path can
be removed entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Split the integrity/metadata handling definitions out into a new header.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The recent block layer refactoring broke the way how the pmem driver
abused device_add_disk. Fix this by properly passing the attribute groups
to device_add_disk.
Fixes: 52b85909f85d ("block: fold register_disk into device_add_disk")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210922173431.2454024-2-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
dax_attribute_group is only used by the pmem driver, and can avoid the
completely pointless lookup by the disk name if moved there. This
leaves just a single caller of dax_get_by_host, so move dax_get_by_host
into the same ifdef block as that caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/20210922173431.2454024-3-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|