summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2021-09-08riscv: only select GENERIC_IOREMAP if MMU support is enabledChristoph Hellwig1-1/+1
nommu ioremap is an inline stub in asm-generic/io.h. Link: https://lkml.kernel.org/r/20210825072036.GA29161@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm: remove redundant compound_head() callingMuchun Song2-6/+7
There is a READ_ONCE() in the macro of compound_head(), which will prevent compiler from optimizing the code when there are more than once calling of it in a function. Remove the redundant calling of compound_head() from page_to_index() and page_add_file_rmap() for better code generation. Link: https://lkml.kernel.org/r/20210811101431.83940-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: use helper zone_is_zone_device() to simplify the codeMiaohe Lin1-3/+1
Patch series "Cleanup and fixups for memory hotplug". This series contains cleanup to use helper function to simplify the code. Also we fix some potential bugs. More details can be found in the respective changelogs. This patch (of 3): Use helper zone_is_zone_device() to simplify the code and remove some explicit CONFIG_ZONE_DEVICE codes. Link: https://lkml.kernel.org/r/20210821094246.10149-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210821094246.10149-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Chris Goldsworthy <cgoldswo@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: improved dynamic memory group aware "auto-movable" online ↵David Hildenbrand3-4/+89
policy Currently, the "auto-movable" online policy does not allow for hotplugged KERNEL (ZONE_NORMAL) memory to increase the amount of MOVABLE memory we can have, primarily, because there is no coordiantion across memory devices and we don't want to create zone-imbalances accidentially when unplugging memory. However, within a single memory device it's different. Let's allow for KERNEL memory within a dynamic memory group to allow for more MOVABLE within the same memory group. The only thing we have to take care of is that the managing driver avoids zone imbalances by unplugging MOVABLE memory first, otherwise there can be corner cases where unplug of memory could result in (accidential) zone imbalances. virtio-mem is the only user of dynamic memory groups and recently added support for prioritizing unplug of ZONE_MOVABLE over ZONE_NORMAL, so we don't need a new toggle to enable it for dynamic memory groups. We limit this handling to dynamic memory groups, because: * We want to keep the runtime overhead for collecting stats when onlining a single memory block small. We tend to have only a handful of dynamic memory groups, but we can have quite some static memory groups (e.g., 256 DIMMs). * It doesn't make too much sense for static memory groups, as we try onlining all applicable memory blocks either completely to ZONE_MOVABLE or not. In ordinary operation, we won't have a mixture of zones within a static memory group. When adding memory to a dynamic memory group, we'll first online memory to ZONE_MOVABLE as long as early KERNEL memory allows for it. Then, we'll online the next unit(s) to ZONE_NORMAL, until we can online the next unit(s) to ZONE_MOVABLE. For a simple virtio-mem device with a MOVABLE:KERNEL ratio of 3:1, it will result in a layout like: [M][M][M][M][M][M][M][M][N][M][M][M][N][M][M][M]... ^ movable memory due to early kernel memory ^ allows for more movable memory ... ^-----^ ... here ^ allows for more movable memory ... ^-----^ ... here While the created layout is sub-optimal when it comes to contiguous zones, it gives us the maximum flexibility when dynamically growing/shrinking a device; we can grow small VMs really big in small steps, and still shrink reliably to e.g., 1/4 of the maximum VM size in this example, removing full memory blocks along with meta data more reliably. Mark dynamic memory groups in the xarray such that we can efficiently iterate over them when collecting stats. In usual setups, we have one virtio-mem device per NUMA node, and usually only a small number of NUMA nodes. Note: for now, there seems to be no compelling reason to make this behavior configurable. Link: https://lkml.kernel.org/r/20210806124715.17090-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: memory group aware "auto-movable" online policyDavid Hildenbrand3-12/+57
Use memory groups to improve our "auto-movable" onlining policy: 1. For static memory groups (e.g., a DIMM), online a memory block MOVABLE only if all other memory blocks in the group are either MOVABLE or could be onlined MOVABLE. A DIMM will either be MOVABLE or not, not a mixture. 2. For dynamic memory groups (e.g., a virtio-mem device), online a memory block MOVABLE only if all other memory blocks inside the current unit are either MOVABLE or could be onlined MOVABLE. For a virtio-mem device with a device block size with 512 MiB, all 128 MiB memory blocks wihin a 512 MiB unit will either be MOVABLE or not, not a mixture. We have to pass the memory group to zone_for_pfn_range() to take the memory group into account. Note: for now, there seems to be no compelling reason to make this behavior configurable. Link: https://lkml.kernel.org/r/20210806124715.17090-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08virtio-mem: use a single dynamic memory group for a single virtio-mem deviceDavid Hildenbrand1-3/+19
Let's use a single dynamic memory group. Link: https://lkml.kernel.org/r/20210806124715.17090-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08dax/kmem: use a single static memory group for a single probed unitDavid Hildenbrand1-8/+32
Although dax/kmem users often disable auto-onlining and instead online memory manually (usually to ZONE_MOVABLE), there is still value in having auto-onlining be aware of the relationship of memory blocks. Let's treat one probed unit as a single static memory device, similar to a single ACPI memory device. Link: https://lkml.kernel.org/r/20210806124715.17090-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08ACPI: memhotplug: use a single static memory group for a single memory deviceDavid Hildenbrand1-5/+30
Let's group all memory we add for a single memory device - we want a single node for that (which also seems to be the sane thing to do). We won't care for now about memory that was already added to the system (e.g., via e820) -- usually *all* memory of a memory device was already added and we'll fail acpi_memory_enable_device(). Link: https://lkml.kernel.org/r/20210806124715.17090-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: track present pages in memory groupsDavid Hildenbrand4-14/+34
Let's track all present pages in each memory group. Especially, track memory present in ZONE_MOVABLE and memory present in one of the kernel zones (which really only is ZONE_NORMAL right now as memory groups only apply to hotplugged memory) separately within a memory group, to prepare for making smart auto-online decision for individual memory blocks within a memory group based on group statistics. Link: https://lkml.kernel.org/r/20210806124715.17090-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08drivers/base/memory: introduce "memory groups" to logically group memory blocksDavid Hildenbrand4-6/+215
In our "auto-movable" memory onlining policy, we want to make decisions across memory blocks of a single memory device. Examples of memory devices include ACPI memory devices (in the simplest case a single DIMM) and virtio-mem. For now, we don't have a connection between a single memory block device and the real memory device. Each memory device consists of 1..X memory block devices. Let's logically group memory blocks belonging to the same memory device in "memory groups". Memory groups can span multiple physical ranges and a memory group itself does not contain any information regarding physical ranges, only properties (e.g., "max_pages") necessary for improved memory onlining. Introduce two memory group types: 1) Static memory group: E.g., a single ACPI memory device, consisting of 1..X memory resources. A memory group consists of 1..Y memory blocks. The whole group is added/removed in one go. If any part cannot get offlined, the whole group cannot be removed. 2) Dynamic memory group: E.g., a single virtio-mem device. Memory is dynamically added/removed in a fixed granularity, called a "unit", consisting of 1..X memory blocks. A unit is added/removed in one go. If any part of a unit cannot get offlined, the whole unit cannot be removed. In case of 1) we usually want either all memory managed by ZONE_MOVABLE or none. In case of 2) we usually want to have as many units as possible managed by ZONE_MOVABLE. We want a single unit to be of the same type. For now, memory groups are an internal concept that is not exposed to user space; we might want to change that in the future, though. add_memory() users can specify a mgid instead of a nid when passing the MHP_NID_IS_MGID flag. Link: https://lkml.kernel.org/r/20210806124715.17090-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: introduce "auto-movable" online policyDavid Hildenbrand1-0/+191
When onlining without specifying a zone (using "online" instead of "online_kernel" or "online_movable"), we currently select a zone such that existing zones are kept contiguous. This online policy made sense in the past, where contiguous zones where required. We'd like to implement smarter policies, however: * User space has little insight. As one example, it has no idea which memory blocks logically belong together (e.g., to a DIMM or to a virtio-mem device). * Drivers that add memory in separate memory blocks, especially virtio-mem, want memory to get onlined right from the kernel when adding. So we really want to have onlining to differing zones managed in the kernel, configured by user space. We see more and more cases where we might eventually hotplug a lot of memory in the future (e.g., eventually grow a 2 GiB VM to 64 GiB), however: * Resizing happens dynamically, in smaller steps in both directions (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...) * We still want as much flexibility as possible, especially, hotunplugging as much memory as possible later. We can really only use "online_movable" if we know that the amount of memory we are going to hotplug upfront, and we know that it won't result in a zone imbalance. So in our example, a 2 GiB VM that could grow to 64 GiB could currently not use "online_movable", and instead, "online_kernel" would have to be used, resulting in worse (no) memory hotunplug reliability. Let's add a new "auto-movable" online policy that considers the current zone ratios (global, per-node) to determine, whether we a memory block can be onlined to ZONE_MOVABLE: MOVABLE : KERNEL However, internally we'll only consider the following ratio for now: MOVABLE : KERNEL_EARLY For now, we don't allow for hotplugged KERNEL memory to allow for more MOVABLE memory, because there is no coordination across memory devices. In follow-up patches, we will allow for more KERNEL memory within a memory device to allow for more MOVABLE memory within the same memory device -- which only makes sense for special memory device types. We base our calculation on "present pages", see the code comments for details. Hotplugged memory will get online to ZONE_MOVABLE if the configured ratio allows for it. Depending on the setup, this can result in fragmented zones, which can make compaction slower and dynamic allocation of gigantic pages when not using CMA less reliable (... which is already pretty unreliable). The old policy will be the default and called "contig-zones". In follow-up patches, our new policy will use additional information, such as memory groups, to make even smarter decisions across memory blocks. Configuration: * memory_hotplug.online_policy is used to switch between both polices and defaults to "contig-zones". * memory_hotplug.auto_movable_ratio defines the maximum ratio is in percent and defaults to "301" -- allowing e.g., most 8 GiB machines to grow to 32 GiB and have all hotplugged memory in ZONE_MOVABLE. The additional percent accounts for a handful of lost present pages (e.g., firmware allocations). User space is expected to adjust this ratio when enabling the new "auto-movable" policy, though. * memory_hotplug.auto_movable_numa_aware considers numa node stats in addition to global stats, and defaults to "true". Note: just like the old policy, the new policy won't take things like unmovable huge pages or memory ballooning that doesn't support balloon compaction into account. User space has to configure onlining accordingly. Link: https://lkml.kernel.org/r/20210806124715.17090-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm: track present early pages per zoneDavid Hildenbrand5-11/+29
Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3. I. Goal The goal of this series is improving in-kernel auto-online support. It tackles the fundamental problems that: 1) We can create zone imbalances when onlining all memory blindly to ZONE_MOVABLE, in the worst case crashing the system. We have to know upfront how much memory we are going to hotplug such that we can safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE via "online_movable". This is far from practical and only applicable in limited setups -- like inside VMs under the RHV/oVirt hypervisor which will never hotplug more than 3 times the boot memory (and the limitation is only in place due to the Linux limitation). 2) We see more setups that implement dynamic VM resizing, hot(un)plugging memory to resize VM memory. In these setups, we might hotplug a lot of memory, but it might happen in various small steps in both directions (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the primary driver of this upstream right now, performing such dynamic resizing NUMA-aware via multiple virtio-mem devices. Onlining all hotplugged memory to ZONE_NORMAL means we basically have no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can easily run into zone imbalances when growing a VM. We want a mixture, and we want as much memory as reasonable/configured in ZONE_MOVABLE. Details regarding zone imbalances can be found at [1]. 3) Memory devices consist of 1..X memory block devices, however, the kernel doesn't really track the relationship. Consequently, also user space has no idea. We want to make per-device decisions. As one example, for memory hotunplug it doesn't make sense to use a mixture of zones within a single DIMM: we want all MOVABLE if possible, otherwise all !MOVABLE, because any !MOVABLE part will easily block the whole DIMM from getting hotunplugged. As another example, virtio-mem operates on individual units that span 1..X memory blocks. Similar to a DIMM, we want a unit to either be all MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however, all units of a virtio-mem device logically belong together and are managed (added/removed) by a single driver. We want as much memory of a virtio-mem device to be MOVABLE as possible. 4) We want memory onlining to be done right from the kernel while adding memory, not triggered by user space via udev rules; for example, this is reqired for fast memory hotplug for drivers that add individual memory blocks, like virito-mem. We want a way to configure a policy in the kernel and avoid implementing advanced policies in user space. The auto-onlining support we have in the kernel is not sufficient. All we have is a) online everything MOVABLE (online_movable) b) online everything !MOVABLE (online_kernel) c) keep zones contiguous (online). This series allows configuring c) to mean instead "online movable if possible according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio" -- a new onlining policy. II. Approach This series does 3 things: 1) Introduces the "auto-movable" online policy that initially operates on individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio to make a decision whether a memory block will be onlined to ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL memory does not allow for more MOVABLE memory (details in the patches). CMA memory is treated like MOVABLE memory. 2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory groups and uses group information to make decisions in the "auto-movable" online policy across memory blocks of a single memory device (modeled as memory group). More details can be found in patch #3 or in the DIMM example below. 3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by allowing ZONE_NORMAL memory within a dynamic memory group to allow for more ZONE_MOVABLE memory within the same memory group. The target use case is dynamic VM resizing using virtio-mem. See the virtio-mem example below. I remember that the basic idea of using a ratio to implement a policy in the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I lost the pointer to that discussion). For me, the main use case is using it along with virtio-mem (and DIMMs / ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the amount of memory we can hotunplug reliably again if we might eventually hotplug a lot of memory to a VM. III. Target Usage The target usage will be: 1) Linux boots with "mhp_default_online_type=offline" 2) User space (e.g., systemd unit) configures memory onlining (according to a config file and system properties), for example: * Setting memory_hotplug.online_policy=auto-movable * Setting memory_hotplug.auto_movable_ratio=301 * Setting memory_hotplug.auto_movable_numa_aware=true 3) User space enabled auto onlining via "echo online > /sys/devices/system/memory/auto_online_blocks" 4) User space triggers manual onlining of all already-offline memory blocks (go over offline memory blocks and set them to "online") IV. Example For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of 301% results in the following layout: Memory block 0-15: DMA32 (early) Memory block 32-47: Normal (early) Memory block 48-79: Movable (DIMM 0) Memory block 80-111: Movable (DIMM 1) Memory block 112-143: Movable (DIMM 2) Memory block 144-275: Normal (DIMM 3) Memory block 176-207: Normal (DIMM 4) ... all Normal (-> hotplugged Normal memory does not allow for more Movable memory) For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM will result in the following layout: Memory block 0-15: DMA32 (early) Memory block 32-47: Normal (early) Memory block 48-143: Movable (virtio-mem, first 12 GiB) Memory block 144: Normal (virtio-mem, next 128 MiB) Memory block 145-147: Movable (virtio-mem, next 384 MiB) Memory block 148: Normal (virtio-mem, next 128 MiB) Memory block 149-151: Movable (virtio-mem, next 384 MiB) ... Normal/Movable mixture as above (-> hotplugged Normal memory allows for more Movable memory within the same device) Which gives us maximum flexibility when dynamically growing/shrinking a VM in smaller steps. V. Doc Update I'll update the memory-hotplug.rst documentation, once the overhaul [1] is usptream. Until then, details can be found in patch #2. VI. Future Work 1) Use memory groups for ppc64 dlpar 2) Being able to specify a portion of (early) kernel memory that will be excluded from the ratio. Like "128 MiB globally/per node" are excluded. This might be helpful when starting VMs with extremely small memory footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting the first hotplugged units getting onlined to ZONE_MOVABLE. One alternative would be a trigger to not consider ZONE_DMA memory in the ratio. We'll have to see if this is really rrequired. 3) Indicate to user space that MOVABLE might be a bad idea -- especially relevant when memory ballooning without support for balloon compaction is active. This patch (of 9): For implementing a new memory onlining policy, which determines when to online memory blocks to ZONE_MOVABLE semi-automatically, we need the number of present early (boot) pages -- present pages excluding hotplugged pages. Let's track these pages per zone. Pass a page instead of the zone to adjust_present_page_count(), similar as adjust_managed_page_count() and derive the zone from the page. It's worth noting that a memory block to be offlined/onlined is either completely "early" or "not early". add_memory() and friends can only add complete memory blocks and we only online/offline complete (individual) memory blocks. Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: Hui Zhu <teawater@gmail.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08ACPI: memhotplug: memory resources cannot be enabled yetDavid Hildenbrand1-4/+0
We allocate + initialize everything from scratch. In case enabling the device fails, we free all memory resourcs. Link: https://lkml.kernel.org/r/20210712124052.26491-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Jia He <justin.he@arm.com> Cc: Joe Perches <joe@perches.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Len Brown <lenb@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pierre Morel <pmorel@linux.ibm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Rich Felker <dalias@libc.org> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Sergei Trofimovich <slyfox@gentoo.org> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: remove nid parameter from remove_memory() and friendsDavid Hildenbrand6-31/+30
There is only a single user remaining. We can simply lookup the nid only used for node offlining purposes when walking our memory blocks. We don't expect to remove multi-nid ranges; and if we'd ever do, we most probably don't care about removing multi-nid ranges that actually result in empty nodes. If ever required, we can detect the "multi-nid" scenario and simply try offlining all online nodes. Link: https://lkml.kernel.org/r/20210712124052.26491-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jia He <justin.he@arm.com> Cc: Joe Perches <joe@perches.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta@ionos.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pierre Morel <pmorel@linux.ibm.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Rich Felker <dalias@libc.org> Cc: Sergei Trofimovich <slyfox@gentoo.org> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: remove nid parameter from arch_remove_memory()David Hildenbrand10-22/+11
The parameter is unused, let's remove it. Link: https://lkml.kernel.org/r/20210712124052.26491-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Heiko Carstens <hca@linux.ibm.com> [s390] Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Baoquan He <bhe@redhat.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Sergei Trofimovich <slyfox@gentoo.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Joe Perches <joe@perches.com> Cc: Pierre Morel <pmorel@linux.ibm.com> Cc: Jia He <justin.he@arm.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()David Hildenbrand2-4/+4
Patch series "mm/memory_hotplug: preparatory patches for new online policy and memory" These are all cleanups and one fix previously sent as part of [1]: [PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory groups. These patches make sense even without the other series, therefore I pulled them out to make the other series easier to digest. [1] https://lkml.kernel.org/r/20210607195430.48228-1-david@redhat.com This patch (of 4): Checkpatch complained on a follow-up patch that we are using "unsigned" here, which defaults to "unsigned int" and checkpatch is correct. As we will search for a fitting zone using the wrong pfn, we might end up onlining memory to one of the special kernel zones, such as ZONE_DMA, which can end badly as the onlined memory does not satisfy properties of these zones. Use "unsigned long" instead, just as we do in other places when handling PFNs. This can bite us once we have physical addresses in the range of multiple TB. Link: https://lkml.kernel.org/r/20210712124052.26491-2-david@redhat.com Fixes: e5e689302633 ("mm, memory_hotplug: display allowed zones in the preferred ordering") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: virtualization@lists.linux-foundation.org Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Dave Jiang <dave.jiang@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jia He <justin.he@arm.com> Cc: Joe Perches <joe@perches.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pierre Morel <pmorel@linux.ibm.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Rich Felker <dalias@libc.org> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Sergei Trofimovich <slyfox@gentoo.org> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm: memory_hotplug: cleanup after removal of pfn_valid_within()Mike Rapoport1-6/+3
When test_pages_in_a_zone() used pfn_valid_within() is has some logic surrounding pfn_valid_within() checks. Since pfn_valid_within() is gone, this logic can be removed. Link: https://lkml.kernel.org/r/20210713080035.7464-3-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONEMike Rapoport8-75/+11
Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE". After recent updates to freeing unused parts of the memory map, no architecture can have holes in the memory map within a pageblock. This makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration option redundant. The first patch removes them both in a mechanical way and the second patch simplifies memory_hotplug::test_pages_in_a_zone() that had pfn_valid_within() surrounded by more logic than simple if. This patch (of 2): After recent changes in freeing of the unused parts of the memory map and rework of pfn_valid() in arm and arm64 there are no architectures that can have holes in the memory map within a pageblock and so nothing can enable CONFIG_HOLES_IN_ZONE which guards non trivial implementation of pfn_valid_within(). With that, pfn_valid_within() is always hardwired to 1 and can be completely removed. Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE. Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08memory-hotplug.rst: complete admin-guide overhaulDavid Hildenbrand1-306/+455
The memory hot(un)plug documentation is outdated and incomplete. Most of the content dates back to 2007, so it's time for a major overhaul. Let's rewrite, reorganize and update most parts of the documentation. In addition to memory hot(un)plug, also add some details regarding ZONE_MOVABLE, with memory hotunplug being one of its main consumers. Drop the file history, that information can more reliably be had from the git log. The style of the document is also properly fixed that e.g., "restview" renders it cleanly now. In the future, we might add some more details about virt users like virtio-mem, the XEN balloon, the Hyper-V balloon and ppc64 dlpar. Link: https://lkml.kernel.org/r/20210707073205.3835-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08memory-hotplug.rst: remove locking details from admin-guideDavid Hildenbrand1-39/+0
Patch series "memory-hotplug.rst: complete admin-guide overhaul", v3. This patch (of 2): We have the same content at Documentation/core-api/memory-hotplug.rst and it doesn't fit into the admin-guide. The documentation was accidentially duplicated when merging. Link: https://lkml.kernel.org/r/20210707073205.3835-1-david@redhat.com Link: https://lkml.kernel.org/r/20210707073205.3835-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-29Linux 5.14v5.14Linus Torvalds1-1/+1
2021-08-29Merge tag 'clk-fixes-for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk fix from Stephen Boyd: "One hotfix for a NULL pointer deref in the Renesas usb clk driver" * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: clk: renesas: rcar-usb2-clock-sel: Fix kernel NULL pointer dereference
2021-08-29Merge tag 'sched_urgent_for_v5.14' of ↵Linus Torvalds2-27/+121
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Have get_push_task() check whether current has migration disabled and thus avoid useless invocations of the migration thread - Rework initialization flow so that all rq->core's are initialized, even of CPUs which have not been onlined yet, so that iterating over them all works as expected * tag 'sched_urgent_for_v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix get_push_task() vs migrate_disable() sched: Fix Core-wide rq->lock for uninitialized CPUs
2021-08-29Merge tag 'irq_urgent_for_v5.14' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Borislav Petkov: - Have msix_mask_all() check a global control which says whether MSI-X masking should be done and thus make it usable on Xen-PV too * tag 'irq_urgent_for_v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: PCI/MSI: Skip masking MSI-X on Xen PV
2021-08-29Merge tag 'perf_urgent_for_v5.14' of ↵Linus Torvalds4-2/+12
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Prevent the amd/power module from being removed while in use - Mark AMD IBS as not supporting content exclusion - Add a workaround for AMD erratum #1197 where IBS registers might not be restored properly after exiting CC6 state - Fix a potential truncation of a 32-bit variable due to shifting - Read the correct bits describing the number of configurable address ranges on Intel PT * tag 'perf_urgent_for_v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/amd/power: Assign pmu.module perf/x86/amd/ibs: Extend PERF_PMU_CAP_NO_EXCLUDE to IBS Op perf/x86/amd/ibs: Work around erratum #1197 perf/x86/intel/uncore: Fix integer overflow on 23 bit left shift of a u32 perf/x86/intel/pt: Fix mask of num_address_ranges
2021-08-29Merge tag 'x86_urgent_for_v5.14' of ↵Linus Torvalds3-9/+30
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Fix build error on RHEL where -Werror=maybe-uninitialized is set. - Restore the firmware's IDT when calling EFI boot services and before ExitBootServices() has been called. This fixes a boot failure on what appears to be a tablet with 32-bit UEFI running a 64-bit kernel. * tag 'x86_urgent_for_v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/resctrl: Fix a maybe-uninitialized build warning treated as error x86/efi: Restore Firmware IDT before calling ExitBootServices()
2021-08-29Revert "parisc: Add assembly implementations for memset, strlen, strcpy, ↵Helge Deller5-157/+74
strncpy and strcat" This reverts commit 83af58f8068ea3f7b3c537c37a30887bfa585069. It turns out that at least the assembly implementation for strncpy() was buggy. Revert the whole commit and return back to the default coding. Signed-off-by: Helge Deller <deller@gmx.de> Cc: <stable@vger.kernel.org> # v5.4+ Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-28clk: renesas: rcar-usb2-clock-sel: Fix kernel NULL pointer dereferenceAdam Ford1-1/+1
The probe was manually passing NULL instead of dev to devm_clk_hw_register. This caused a Unable to handle kernel NULL pointer dereference error. Fix this by passing 'dev'. Signed-off-by: Adam Ford <aford173@gmail.com> Fixes: a20a40a8bbc2 ("clk: renesas: rcar-usb2-clock-sel: Fix error handling in .probe()") Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-08-28Merge tag 'scsi-fixes' of ↵Linus Torvalds1-3/+6
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fix from James Bottomley: "A single fix for a race introduced by a fix that went into 5.14-rc5" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: core: Fix hang of freezing queue between blocking and running device
2021-08-28Merge tag 'usb-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usbLinus Torvalds6-76/+89
Pull USB fixes from Greg KH: "Here are a few tiny USB fixes for reported issues with some USB drivers. These fixes include: - gadget driver fixes for regressions - tcpm driver fix - dwc3 driver fixes - xhci renesas firmware loading fix, again. - usb serial option driver device id addition - usb serial ch341 revert for regression All all of these have been in linux-next with no reported problems" * tag 'usb-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: usb: gadget: u_audio: fix race condition on endpoint stop usb: gadget: f_uac2: fixup feedback endpoint stop usb: typec: tcpm: Raise vdm_sm_running flag only when VDM SM is running usb: renesas-xhci: Prefer firmware loading on unknown ROM state usb: dwc3: gadget: Stop EP0 transfers during pullup disable usb: dwc3: gadget: Fix dwc3_calc_trbs_left() Revert "USB: serial: ch341: fix character loss at high transfer rates" USB: serial: option: add new VID/PID to support Fibocom FG150
2021-08-28Merge tag 'powerpc-5.14-7' of ↵Linus Torvalds2-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix scv implicit soft-mask table for relocated (eg. kdump) kernels - Re-enable ARCH_ENABLE_SPLIT_PMD_PTLOCK, which was disabled due to a typo Thanks to Lukas Bulwahn, Nicholas Piggin, and Daniel Axtens. * tag 'powerpc-5.14-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/64s: Fix scv implicit soft-mask table for relocated kernels powerpc: Re-enable ARCH_ENABLE_SPLIT_PMD_PTLOCK
2021-08-27Merge tag 'block-5.14-2021-08-27' of git://git.kernel.dk/linux-blockLinus Torvalds4-45/+21
Pull block fixes from Jens Axboe: - Revert the mq-deadline priority handling, it's causing serious performance regressions. While experimental patches exists to fix this up, it's too late to do so now. Revert it and re-do it properly for 5.15 instead. - Fix a NULL vs IS_ERR() regression in this release (Dan) - Fix a mq-deadline accounting regression in this release (Bart) - Mark cryptoloop as deprecated. It's broken and dm-crypt fully supports it, and it's actively intefering with loop. Plan on removal for 5.16 (Christoph) * tag 'block-5.14-2021-08-27' of git://git.kernel.dk/linux-block: cryptoloop: add a deprecation warning pd: fix a NULL vs IS_ERR() check Revert "block/mq-deadline: Prioritize high-priority requests" mq-deadline: Fix request accounting
2021-08-27Merge tag 'soc-fixes-5.14-4' of ↵Linus Torvalds2-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull ARM SoC fixes from Arnd Bergmann: "Just two trivial fixes from the reset driver tree, nothing else came up since the last soc fixes" * tag 'soc-fixes-5.14-4' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: reset: reset-zynqmp: Fixed the argument data type reset: RESET_MCHP_SPARX5 should depend on ARCH_SPARX5
2021-08-27Merge tag 'acpi-5.14-rc8' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fix from Rafael Wysocki: "Fix a regression introduced during this cycle that has been partially addressed by an earlier commit (Andy Shevchenko)" * tag 'acpi-5.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: media: ipu3-cio2: Drop reference on error path in cio2_bridge_connect_sensor()
2021-08-27Merge tag 'pm-5.14-rc8' of ↵Linus Torvalds2-6/+12
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix two issues introduced during this cycle, one of which is a regression and the other one affects new code. Specifics: - Prevent the operating performance points (OPP) code from crashing when some entries in the table of required OPPs are set to error pointer values (Marijn Suijten) - Prevent the generic power domains (genpd) framework from incorrectly overriding the performance state of a device set by its driver while it is runtime-suspended or when runtime PM of it is disabled (Dmitry Osipenko)" * tag 'pm-5.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: domains: Improve runtime PM performance state handling opp: core: Check for pending links before reading required_opp pointers
2021-08-27virtio-mem: fix sleeping in RCU read side section in virtio_mem_online_page_cb()David Hildenbrand1-1/+8
virtio_mem_set_fake_offline() might sleep now, and we call it under rcu_read_lock(). To fix it, simply move the rcu_read_unlock() further up, as we're done with the device. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 6cc26d77613a: "virtio-mem: use page_offline_(start|end) when setting PageOffline() Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: virtualization@lists.linux-foundation.org Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-27Merge branch 'pm-opp'Rafael J. Wysocki1-4/+4
* pm-opp: opp: core: Check for pending links before reading required_opp pointers
2021-08-27Merge tag 'riscv-for-linus-5.14-rc8' of ↵Linus Torvalds3-1/+9
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: - device tree updates for the Microsemi Polarfire development kit that fix some mismatches between the u-boot and Linux enternet entries - ensure that the F register state is correctly reflected in core dumps * tag 'riscv-for-linus-5.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: dts: microchip: Add ethernet0 to the aliases node riscv: dts: microchip: Use 'local-mac-address' for emac1 riscv: Ensure the value of FP registers in the core dump file is up to date
2021-08-27Merge tag 'mmc-v5.14-rc7' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc Pull MMC host fix from Ulf Hansson: - sdhci-iproc: Fix clock error for ACPI rpi's * tag 'mmc-v5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: Revert "mmc: sdhci-iproc: Set SDHCI_QUIRK_CAP_CLOCK_BASE_BROKEN on BCM2711"
2021-08-27cryptoloop: add a deprecation warningChristoph Hellwig2-2/+4
Support for cryptoloop has been officially marked broken and deprecated in favor of dm-crypt (which supports the same broken algorithms if needed) in Linux 2.6.4 (released in March 2004), and support for it has been entirely removed from losetup in util-linux 2.23 (released in April 2013). Add a warning and a deprecation schedule. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210827163250.255325-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-27Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-armLinus Torvalds4-8/+27
Pull ARM fix from Russell King: "Resolve a Keystone 2 kernel mapping regression" * tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: ARM: 9104/2: Fix Keystone 2 kernel mapping regression
2021-08-27Revert "mmc: sdhci-iproc: Set SDHCI_QUIRK_CAP_CLOCK_BASE_BROKEN on BCM2711"Ulf Hansson1-2/+1
This reverts commit 419dd626e357e89fc9c4e3863592c8b38cfe1571. It turned out that the change from the reverted commit breaks the ACPI based rpi's because it causes the 100Mhz max clock to be overridden to the return from sdhci_iproc_get_max_clock(), which is 0 because there isn't a OF/DT based clock device. Reported-by: Jeremy Linton <jeremy.linton@arm.com> Fixes: 419dd626e357 ("mmc: sdhci-iproc: Set SDHCI_QUIRK_CAP_CLOCK_BASE_BROKEN on BCM2711") Acked-by: Stefan Wahren <stefan.wahren@i2se.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2021-08-27usb: gadget: u_audio: fix race condition on endpoint stopJerome Brunet1-4/+4
If the endpoint completion callback is call right after the ep_enabled flag is cleared and before usb_ep_dequeue() is call, we could do a double free on the request and the associated buffer. Fix this by clearing ep_enabled after all the endpoint requests have been dequeued. Fixes: 7de8681be2cd ("usb: gadget: u_audio: Free requests only after callback") Cc: stable <stable@vger.kernel.org> Reported-by: Thinh Nguyen <Thinh.Nguyen@synopsys.com> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com> Link: https://lore.kernel.org/r/20210827092927.366482-1-jbrunet@baylibre.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-27usb: gadget: f_uac2: fixup feedback endpoint stopJerome Brunet1-4/+11
When the uac2 function is stopped, there seems to be an issue reported on some platforms (Intel Merrifield at least) BUG: kernel NULL pointer dereference, address: 0000000000000008 ... RIP: 0010:dwc3_gadget_del_and_unmap_request+0x19/0xe0 ... Call Trace: dwc3_remove_requests.constprop.0+0x12f/0x170 __dwc3_gadget_ep_disable+0x7a/0x160 dwc3_gadget_ep_disable+0x3d/0xd0 usb_ep_disable+0x1c/0x70 u_audio_stop_capture+0x79/0x120 [u_audio] afunc_set_alt+0x73/0x80 [usb_f_uac2] composite_setup+0x224/0x1b90 [libcomposite] The issue happens only when the gadget is using the sync type "async", not "adaptive". This indicates that problem is coming from the feedback endpoint, which is only used with async synchronization mode. The problem is that request is freed regardless of usb_ep_dequeue(), which ends up badly if the request is not actually dequeued yet. Update the feedback endpoint free function to release the endpoint the same way it is done for the data endpoint, which takes care of the problem. Fixes: 24f779dac8f3 ("usb: gadget: f_uac2/u_audio: add feedback endpoint support") Reported-by: Ferry Toth <ftoth@exalondelft.nl> Tested-by: Ferry Toth <ftoth@exalondelft.nl> Acked-by: Felipe Balbi <balbi@kernel.org> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com> Link: https://lore.kernel.org/r/20210827075853.266912-1-jbrunet@baylibre.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-27pd: fix a NULL vs IS_ERR() checkDan Carpenter1-1/+1
blk_mq_alloc_disk() returns error pointers, it doesn't return NULL so correct the check. Fixes: 262d431f9000 ("pd: use blk_mq_alloc_disk and blk_cleanup_disk") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20210827100023.GB9449@kili Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-26Merge tag 'drm-fixes-2021-08-27' of git://anongit.freedesktop.org/drm/drmLinus Torvalds8-46/+63
Pull drm fixes from Dave Airlie: "Last set of fixes for 5.14, nothing major a couple of i915, couple of imx and a few amdgpu. All pretty small. i915: - Fix syncmap memory leak - Drop redundant display port debug print amdgpu: - Fix for pinning display buffers multiple times - Fix delayed work handling for GFXOFF - Fix build when CONFIG_SUSPEND is not set imx: - fix planar offset calculations - fix accidental partial revert" * tag 'drm-fixes-2021-08-27' of git://anongit.freedesktop.org/drm/drm: drm/i915/dp: Drop redundant debug print drm/i915: Fix syncmap memory leak drm/amdgpu: Fix build with missing pm_suspend_target_state module export drm/amdgpu: Cancel delayed work when GFXOFF is disabled drm/amdgpu: use the preferred pin domain after the check drm/imx: ipuv3-plane: fix accidental partial revert of 8 pixel alignment fix gpu: ipu-v3: Fix i.MX IPU-v3 offset calculations for (semi)planar U/V formats
2021-08-27Merge tag 'imx-drm-fixes-2021-08-18' of git://git.pengutronix.de/pza/linux ↵Dave Airlie2-16/+16
into drm-fixes drm/imx: imx-drm alignment and plane offset fixes Fix an accidental partial revert of commit 94dfec48fca7 ("drm/imx: Add 8 pixel alignment fix") and plane offset calculations for capture of non-aligned resolutions. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Philipp Zabel <p.zabel@pengutronix.de> Link: https://patchwork.freedesktop.org/patch/msgid/85a41af99beb2c9e7d6020435a135bf9f205a5ff.camel@pengutronix.de
2021-08-27Merge tag 'amd-drm-fixes-5.14-2021-08-25' of ↵Dave Airlie4-23/+36
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes amd-drm-fixes-5.14-2021-08-25: amdgpu: - Fix for pinning display buffers multiple times - Fix delayed work handling for GFXOFF - Fix build when CONFIG_SUSPEND is not set Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20210826032658.4068-1-alexander.deucher@amd.com
2021-08-27Merge tag 'drm-intel-fixes-2021-08-26' of ↵Dave Airlie2-7/+11
git://anongit.freedesktop.org/drm/drm-intel into drm-fixes - Fix syncmap memory leak - Drop redundant display port debug print Signed-off-by: Dave Airlie <airlied@redhat.com> From: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/YSfSeHbyS5wBZtNJ@intel.com
2021-08-27PCI/MSI: Skip masking MSI-X on Xen PVMarek Marczykowski-Górecki1-0/+3
When running as Xen PV guest, masking MSI-X is a responsibility of the hypervisor. The guest has no write access to the relevant BAR at all - when it tries to, it results in a crash like this: BUG: unable to handle page fault for address: ffffc9004069100c #PF: supervisor write access in kernel mode #PF: error_code(0x0003) - permissions violation RIP: e030:__pci_enable_msix_range.part.0+0x26b/0x5f0 e1000e_set_interrupt_capability+0xbf/0xd0 [e1000e] e1000_probe+0x41f/0xdb0 [e1000e] local_pci_probe+0x42/0x80 (...) The recently introduced function msix_mask_all() does not check the global variable pci_msi_ignore_mask which is set by XEN PV to bypass the masking of MSI[-X] interrupts. Add the check to make this function XEN PV compatible. Fixes: 7d5ec3d36123 ("PCI/MSI: Mask all unused MSI-X entries") Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210826170342.135172-1-marmarek@invisiblethingslab.com