summaryrefslogtreecommitdiffstats
path: root/arch
AgeCommit message (Collapse)AuthorFilesLines
2021-07-02Merge branch 'akpm' (patches from Andrew)Linus Torvalds104-242/+160
Merge more updates from Andrew Morton: "190 patches. Subsystems affected by this patch series: mm (hugetlb, userfaultfd, vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock, migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap, zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc, core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs, signals, exec, kcov, selftests, compress/decompress, and ipc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits) ipc/util.c: use binary search for max_idx ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock ipc: use kmalloc for msg_queue and shmid_kernel ipc sem: use kvmalloc for sem_undo allocation lib/decompressors: remove set but not used variabled 'level' selftests/vm/pkeys: exercise x86 XSAVE init state selftests/vm/pkeys: refill shadow register after implicit kernel write selftests/vm/pkeys: handle negative sys_pkey_alloc() return code selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random kcov: add __no_sanitize_coverage to fix noinstr for all architectures exec: remove checks in __register_bimfmt() x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned hfsplus: report create_date to kstat.btime hfsplus: remove unnecessary oom message nilfs2: remove redundant continue statement in a while-loop kprobes: remove duplicated strong free_insn_page in x86 and s390 init: print out unknown kernel parameters checkpatch: do not complain about positive return values starting with EPOLL checkpatch: improve the indented label test checkpatch: scripts/spdxcheck.py now requires python3 ...
2021-07-01Merge tag 'mips_5.14' of ↵Linus Torvalds37-115/+250
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux Pull MIPS updates from Thomas Bogendoerfer: - add support for OpeneEmbed SOM9331 board - Ingenic fixes/improvments - other fixes and cleanups * tag 'mips_5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (39 commits) MIPS: Fix PKMAP with 32-bit MIPS huge page support MIPS: CI20: Add second percpu timer for SMP. MIPS: CI20: Reduce clocksource to 750 kHz. MIPS: Ingenic: Add MAC syscon nodes for Ingenic SoCs. dt-bindings: clock: Add documentation for MAC PHY control bindings. MIPS: X1830: Respect cell count of common properties. MIPS: set mips32r5 for virt extensions MIPS: loongsoon64: Reserve memory below starting pfn to prevent Oops MIPS: MT extensions are not available on MIPS32r1 mips/kvm: Use BUG_ON instead of if condition followed by BUG MIPS: OCTEON: octeon-usb: Use devm_platform_get_and_ioremap_resource() MIPS: add PMD table accounting into MIPS'pmd_alloc_one MIPS: Loongson64: fix spelling of SPDX tag MIPS: ingenic: rs90: Add dedicated VRAM memory region MIPS: ingenic: gcw0: Set codec to cap-less mode for FM radio MIPS: ingenic: jz4780: Fix I2C nodes to match DT doc MIPS: ingenic: Select CPU_SUPPORTS_CPUFREQ && MIPS_EXTERNAL_TIMER MIPS: Kconfig: ingenic: Ensure MACH_INGENIC_GENERIC selects all SoCs MIPS: cpu-probe: Fix FPU detection on Ingenic JZ4760(B) MIPS: boot: Support specifying UART port on Ingenic SoCs ...
2021-07-01Merge tag 'pinctrl-v5.14-1' of ↵Linus Torvalds11-717/+7
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pin control updates from Linus Walleij: "This is the bulk of pin control changes for the v5.14 kernel. Not so much going on. No core changes, just drivers. The most interesting would be that MIPS Ralink is migrating to pin control and we have some bindings but not yet code for the Apple M1 pin controller. New drivers: - Last merge window we created a driver for the Ralink RT2880. We are now moving the Ralink SoC pin control drivers out of the MIPS architecture code and into the pin control subsystem. This concerns RT288X, MT7620, RT305X, RT3883 and MT7621. - Qualcomm SM6125 SoC pin control driver. - Qualcomm spmi-gpio support for PM7325. - Qualcomm spmi-mpp also handles PMI8994 (just a compatible string) - Mediatek MT8365 SoC pin controller. - New device HID for the AMD GPIO controller. Improvements: - Pin bias config support for a slew of Renesas pin controllers. - Incremental improvements and non-urgent bug fixes to the Renesas SoC drivers. - Implement irq_set_wake on the AMD pin controller so we can wake up from external pin events. Misc: - Devicetree bindings for the Apple M1 pin controller, we will probably see a proper driver for this soon as well" * tag 'pinctrl-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (54 commits) pinctrl: ralink: rt305x: add missing include pinctrl: stm32: check for IRQ MUX validity during alloc() pinctrl: zynqmp: some code cleanups drivers: qcom: pinctrl: Add pinctrl driver for sm6125 dt-bindings: pinctrl: qcom: sm6125: Document SM6125 pinctrl driver dt-bindings: pinctrl: mcp23s08: add documentation for reset-gpios pinctrl: mcp23s08: Add optional reset GPIO pinctrl: mediatek: fix mode encoding pinctrl: mcp23s08: Fix missing unlock on error in mcp23s08_irq() pinctrl: bcm: Constify static pinmux_ops pinctrl: bcm: Constify static pinctrl_ops pinctrl: ralink: move RT288X SoC pinmux config into a new 'pinctrl-rt288x.c' file pinctrl: ralink: move MT7620 SoC pinmux config into a new 'pinctrl-mt7620.c' file pinctrl: ralink: move RT305X SoC pinmux config into a new 'pinctrl-rt305x.c' file pinctrl: ralink: move RT3883 SoC pinmux config into a new 'pinctrl-rt3883.c' file pinctrl: ralink: move MT7621 SoC pinmux config into a new 'pinctrl-mt7621.c' file pinctrl: ralink: move ralink architecture pinmux header into the driver pinctrl: single: config: enable the pin's input pinctrl: mtk: Fix mt8365 Kconfig dependency pinctrl: mcp23s08: fix race condition in irq handler ...
2021-07-01Merge tag 'clk-for-linus' of ↵Linus Torvalds25-525/+313
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk updates from Stephen Boyd: "This round has a diffstat dominated by Qualcomm clk drivers. Honestly though that's just a bunch of data so the diffstat reflects that. Looking beyond that there's just a bunch of updates all around in various clk drivers. Renesas and NXP (for i.MX) are two SoC vendors that have a lot of patches in here. Overall the driver changes look to be mostly enabling more clks and non-critical fixes that we could hold until the next merge window. I'm especially excited about the series from Arnd that graduates clkdev to be the only implementation of clk_get() and clk_put(). That's a good step in the right direction to migreate eveerything over to the common clk framework. Now we don't have to worry about clkdev specific details, they're just part of the clk API now. Core: - clkdev is now the only option, i.e. clk_get()/clk_put() is implemented in only one place in the kernel instead of in drivers/clk/clkdev.c and in architectures that want their own implementation New Drivers: - Texas Instruments' LMK04832 Ultra Low-Noise JESD204B Compliant Clock Jitter Cleaner With Dual Loop PLLs - Qualcomm MDM9607 GCC - Qualcomm SC8180X display clks - Qualcomm SM6125 GCC - Qualcomm SM8250 CAMCC (camera) - Renesas RZ/G2L SoC - Hisilicon hi3559A SoC Updates: - Stop using clock-output-names in ST clk drivers (yay!) - Support secure mode of STM32MP1 SoCs - Improve clock support for Actions S500 SoC - duty cycle setting support on qcom clks - Add TI am33xx spread spectrum clock support - Use determine_rate() for the Amlogic pll ops instead of round_rate() - Restrict Amlogic gp0/1 and audio plls range on g12a/sm1 - Improve Amlogic axg-audio controller error on deferral - Add NNA clocks on Amlogic g12a - Reduce memory footprint of Rockchip PLL rate tables - A fix for the newly added Rockchip rk3568 clk driver - Exported clock for the newly added Rockchip video decoder - Remove audio ipg clock from i.MX8MP - Remove deprecated legacy clock binding for i.MX SCU clock driver - Use common clk-imx8qxp for both i.MX8QXP and i.MX8QM - Add multiple clocks to clk-imx8qxp driver (enet, hdmi, lcdif, audio, parallel interface) - Add dedicated clock ops for i.MX paralel interface - Different fixes for clocks controlled by ATF on i.MX SoCs - Add A53/A72 frequency scaling support i.MX clk-scu driver - Add special case for DCSS clock on suspend for i.MX clk-scu driver - Add parent save/restore on suspend/resume to i.MX clk-scu driver - Skip runtime PM enablement for CPU clocks in i.MX clk-scu driver - Remove the sys1_pll/sys2_pll clock gates for i.MX8MQ and their bindings - Tegra clk driver no longer deasserts resets on clk_enable as it gets in the way of certain power-up sequences - Fix compile testing for Tegra clk driver - One patch to fix a divider on the Allwinner v3s Audio PLL - Add support for CPU core clock boost modes on Renesas R-Car Gen3 - Add ISPCS (Image Signal Processor) clocks on Renesas R-Car V3U - Switch SH/R-Mobile and R-Car "DIV6" clocks to .determine_rate() and improve support for multiple parents - Switch Renesas RZ/N1 divider clocks to .determine_rate() - Add ZA2 (Audio Clock Generator) clock on Renesas R-Car D3 - Convert ar7 to common clk framework - Convert ralink to common clk framework" * tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (161 commits) clk: zynqmp: Handle divider specific read only flag clk: zynqmp: Use firmware specific mux clock flags clk: zynqmp: Use firmware specific divider clock flags clk: zynqmp: Use firmware specific common clock flags clk: lmk04832: Use of match table clk: lmk04832: Depend on SPI clk: stm32mp1: new compatible for secure RCC support dt-bindings: clock: stm32mp1 new compatible for secure rcc dt-bindings: reset: add MCU HOLD BOOT ID for SCMI reset domains on stm32mp15 dt-bindings: reset: add IDs for SCMI reset domains on stm32mp15 dt-bindings: clock: add IDs for SCMI clocks on stm32mp15 reset: stm32mp1: remove stm32mp1 reset clk: hisilicon: Add clock driver for hi3559A SoC dt-bindings: Document the hi3559a clock bindings clk: si5341: Add sysfs properties to allow checking/resetting device faults clk: si5341: Add silabs,iovdd-33 property clk: si5341: Add silabs,xaxb-ext-clk property clk: si5341: Allow different output VDD_SEL values clk: si5341: Update initialization magic clk: si5341: Check for input clock presence and PLL lock on startup ...
2021-07-01Merge tag 'drm-next-2021-07-01' of git://anongit.freedesktop.org/drm/drmLinus Torvalds1-0/+2
Pull drm updates from Dave Airlie: "Highlights: - AMD enables two more GPUs, with resulting header files - i915 has started to move to TTM for discrete GPU and enable DG1 discrete GPU support (not by default yet) - new HyperV drm driver - vmwgfx adds arm64 support - TTM refactoring ongoing - 16bpc display support for AMD hw Otherwise it's just the usual insane amounts of work all over the place in lots of drivers and the core, as mostly summarised below: Core: - mark AGP ioctls as legacy - disable force probing for non-master clients - HDR metadata property helpers - HDMI infoframe signal colorimetry support - remove drm_device.pdev pointer - remove DRM_KMS_FB_HELPER config option - remove drm_pci_alloc/free - drm_err_*/drm_dbg_* helpers - use drm driver names for fbdev - leaked DMA handle fix - 16bpc fixed point format fourcc - add prefetching memcpy for WC - Documentation fixes aperture: - add aperture ownership helpers dp: - aux fixes - downstream 0 port handling - use extended base receiver capability DPCD - Rename DP_PSR_SELECTIVE_UPDATE to better mach eDP spec - mst: use khz as link rate during init - VCPI fixes for StarTech hub ttm: - provide tt_shrink file via debugfs - warn about freeing pinned BOs - fix swapping error handling - move page alignment into BO - cleanup ttm_agp_backend - add ttm_sys_manager - don't override vm_ops - ttm_bo_mmap removed - make ttm_resource base of all managers - remove VM_MIXEDMAP usage panel: - sysfs_emit support - simple: runtime PM support - simple: power up panel when reading EDID + caching bridge: - MHDP8546: HDCP support + DT bindings - MHDP8546: Register DP AUX channel with userspace - TI SN65DSI83 + SN65DSI84: add driver - Sil8620: Fix module dependencies - dw-hdmi: make CEC driver loading optional - Ti-sn65dsi86: refclk fixes, subdrivers, runtime pm - It66121: Add driver + DT bindings - Adv7511: Support I2S IEC958 encoding - Anx7625: fix power-on delay - Nwi-dsi: Modesetting fixes; Cleanups - lt6911: add missing MODULE_DEVICE_TABLE - cdns: fix PM reference leak hyperv: - add new DRM driver for HyperV graphics efifb: - non-PCI device handling fixes i915: - refactor IP/device versioning - XeLPD Display IP preperation work - ADL-P enablement patches - DG1 uAPI behind BROKEN - disable mmap ioctl for discerte GPUs - start enabling HuC loading for Gen12+ - major GuC backend rework for new platforms - initial TTM support for Discrete GPUs - locking rework for TTM prep - use correct max source link rate for eDP - %p4cc format printing - GLK display fixes - VLV DSI panel power fixes - PSR2 disabled for RKL and ADL-S - ACPI _DSM invalid access fixed - DMC FW path abstraction - ADL-S PCI ID update - uAPI headers converted to kerneldoc - initial LMEM support for DG1 - x86/gpu: add Jasperlake to gen11 early quirks amdgpu: - Aldebaran updates + initial SR-IOV - new GPU: Beige Goby and Yellow Carp support - more LTTPR display work - Vangogh updates - SDMA 5.x GCR fixes - PCIe ASPM support - Renoir TMZ enablement - initial multiple eDP panel support - use fdinfo to track devices/process info - pin/unpin TTM fixes - free resource on fence usage query - fix fence calculation - fix hotunplug/suspend issues - GC/MM register access macro cleanup for SR-IOV - W=1 fixes - ACPI ATCS/ATIF handling rework - 16bpc fixed point format support - Initial smartshift support - RV/PCO power tuning fixes - new INFO query for additional vbios info amdkfd: - SR-IOV aldebaran support - HMM SVM support radeon: - SMU regression fixes - Oland flickering fix vmwgfx: - enable console with fbdev emulation - fix cpu updates of coherent multisample surfaces - remove reservation semaphore - add initial SVGA3 support - support arm64 msm: - devcoredump support for display errors - dpu/dsi: yaml bindings conversion - mdp5: alpha/blend_mode/zpos support - a6xx: cached coherent buffer support - gpu iova fault improvement - a660 support rockchip: - RK3036 win1 scaling support - RK3066/3188 missing register support - RK3036/3066/3126/3188 alpha support mediatek: - MT8167 HDMI support - MT8183 DPI dual edge support tegra: - fixed YUV support/scaling on Tegra186+ ast: - use pcim_iomap - fix DP501 EDID bochs: - screen blanking support etnaviv: - export more GPU ID values to userspace - add HWDB entry for GPU on i.MX8MP - rework linear window calcs exynos: - pm runtime changes imx: - Annotate dma_fence critical section - fix PRG modifiers after drmm conversion - Add 8 pixel alignment fix for 1366x768 - fix YUV advertising - add color properties ingenic: - IPU planes fix panfrost: - Mediatek MT8183 support + DT bindings - export AFBC_FEATURES register to userspace simpledrm: - %pr for printing resources nouveau: - pin/unpin TTM fixes qxl: - unpin shadow BO virtio: - create dumb BOs as guest blob vkms: - drmm_universal_plane_alloc - add XRGB plane composition - overlay support" * tag 'drm-next-2021-07-01' of git://anongit.freedesktop.org/drm/drm: (1570 commits) drm/i915: Reinstate the mmap ioctl for some platforms drm/i915/dsc: abstract helpers to get bigjoiner primary/secondary crtc Revert "drm/msm/mdp5: provide dynamic bandwidth management" drm/msm/mdp5: provide dynamic bandwidth management drm/msm/mdp5: add perf blocks for holding fudge factors drm/msm/mdp5: switch to standard zpos property drm/msm/mdp5: add support for alpha/blend_mode properties drm/msm/mdp5: use drm_plane_state for pixel blend mode drm/msm/mdp5: use drm_plane_state for storing alpha value drm/msm/mdp5: use drm atomic helpers to handle base drm plane state drm/msm/dsi: do not enable PHYs when called for the slave DSI interface drm/msm: Add debugfs to trigger shrinker drm/msm/dpu: Avoid ABBA deadlock between IRQ modules drm/msm: devcoredump iommu fault support iommu/arm-smmu-qcom: Add stall support drm/msm: Improve the a6xx page fault handler iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info iommu/arm-smmu: Add support for driver IOMMU fault handlers drm/msm: export hangcheck_period in debugfs drm/msm/a6xx: add support for Adreno 660 GPU ...
2021-07-01Merge tag 'fs_for_v5.14-rc1' of ↵Linus Torvalds17-17/+18
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull misc fs updates from Jan Kara: "The new quotactl_fd() syscall (remake of quotactl_path() syscall that got introduced & disabled in 5.13 cycle), and couple of udf, reiserfs, isofs, and writeback fixes and cleanups" * tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: writeback: fix obtain a reference to a freeing memcg css quota: remove unnecessary oom message isofs: remove redundant continue statement quota: Wire up quotactl_fd syscall quota: Change quotactl_path() systcall to an fd-based one reiserfs: Remove unneed check in reiserfs_write_full_page() udf: Fix NULL pointer dereference in udf_symlink function reiserfs: add check for invalid 1st journal block
2021-07-01kprobes: remove duplicated strong free_insn_page in x86 and s390Barry Song2-11/+0
free_insn_page() in x86 and s390 is same with the common weak function in kernel/kprobes.c. Plus, the comment "Recover page to RW mode before releasing it" in x86 seems insensible to be there since resetting mapping is done by common code in vfree() of module_memfree(). So drop these two duplicated strong functions and related comment, then mark the common one in kernel/kprobes.c strong. Link: https://lkml.kernel.org/r/20210608065736.32656-1-song.bao.hua@hisilicon.com Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Qi Liu <liuqi115@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01kernel.h: split out panic and oops helpersAndy Shevchenko18-1/+19
kernel.h is being used as a dump for all kinds of stuff for a long time. Here is the attempt to start cleaning it up by splitting out panic and oops helpers. There are several purposes of doing this: - dropping dependency in bug.h - dropping a loop by moving out panic_notifier.h - unload kernel.h from something which has its own domain At the same time convert users tree-wide to use new headers, although for the time being include new header back to kernel.h to avoid twisted indirected includes for existing users. [akpm@linux-foundation.org: thread_info.h needs limits.h] [andriy.shevchenko@linux.intel.com: ia64 fix] Link: https://lkml.kernel.org/r/20210520130557.55277-1-andriy.shevchenko@linux.intel.com Link: https://lkml.kernel.org/r/20210511074137.33666-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Co-developed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Corey Minyard <cminyard@mvista.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Wei Liu <wei.liu@kernel.org> Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Sebastian Reichel <sre@kernel.org> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: Stephen Boyd <sboyd@kernel.org> Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Acked-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01mm: remove special swap entry functionsAlistair Popple1-1/+1
Patch series "Add support for SVM atomics in Nouveau", v11. Introduction ============ Some devices have features such as atomic PTE bits that can be used to implement atomic access to system memory. To support atomic operations to a shared virtual memory page such a device needs access to that page which is exclusive of the CPU. This series introduces a mechanism to temporarily unmap pages granting exclusive access to a device. These changes are required to support OpenCL atomic operations in Nouveau to shared virtual memory (SVM) regions allocated with the CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the OpenCL SVM feature is available at https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/ OpenCL_API.html#_shared_virtual_memory . Implementation ============== Exclusive device access is implemented by adding a new swap entry type (SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main difference is that on fault the original entry is immediately restored by the fault handler instead of waiting. Restoring the entry triggers calls to MMU notifers which allows a device driver to revoke the atomic access permission from the GPU prior to the CPU finalising the entry. Patches ======= Patches 1 & 2 refactor existing migration and device private entry functions. Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated functionality into separate functions - try_to_migrate_one() and try_to_munlock_one(). Patch 5 renames some existing code but does not introduce functionality. Patch 6 is a small clean-up to swap entry handling in copy_pte_range(). Patch 7 contains the bulk of the implementation for device exclusive memory. Patch 8 contains some additions to the HMM selftests to ensure everything works as expected. Patch 9 is a cleanup for the Nouveau SVM implementation. Patch 10 contains the implementation of atomic access for the Nouveau driver. Testing ======= This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program which checks that GPU atomic accesses to system memory are atomic. Without this series the test fails as there is no way of write-protecting the page mapping which results in the device clobbering CPU writes. For reference the test is available at https://ozlabs.org/~apopple/opencl_svm_atomics/ Further testing has been performed by adding support for testing exclusive access to the hmm-tests kselftests. This patch (of 10): Remove multiple similar inline functions for dealing with different types of special swap entries. Both migration and device private swap entries use the swap offset to store a pfn. Instead of multiple inline functions to obtain a struct page for each swap entry type use a common function pfn_swap_entry_to_page(). Also open-code the various entry_to_pfn() functions as this results is shorter code that is easier to understand. Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01mm/thp: define default pmd_pgtable()Anshuman Khandual32-43/+19
Currently most platforms define pmd_pgtable() as pmd_page() duplicating the same code all over. Instead just define a default value i.e pmd_page() for pmd_pgtable() and let platforms override when required via <asm/pgtable.h>. All the existing platform that override pmd_pgtable() have been moved into their respective <asm/pgtable.h> header in order to precede before the new generic definition. This makes it much cleaner with reduced code. Link: https://lkml.kernel.org/r/1623646133-20306-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Nick Hu <nickhu@andestech.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Brian Cain <bcain@codeaurora.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Stafford Horne <shorne@gmail.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01mm: define default value for FIRST_USER_ADDRESSAnshuman Khandual25-43/+0
Currently most platforms define FIRST_USER_ADDRESS as 0UL duplication the same code all over. Instead just define a generic default value (i.e 0UL) for FIRST_USER_ADDRESS and let the platforms override when required. This makes it much cleaner with reduced code. The default FIRST_USER_ADDRESS here would be skipped in <linux/pgtable.h> when the given platform overrides its value via <asm/pgtable.h>. Link: https://lkml.kernel.org/r/1620615725-24623-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Acked-by: Guo Ren <guoren@kernel.org> [csky] Acked-by: Stafford Horne <shorne@gmail.com> [openrisc] Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com> [RISC-V] Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Brian Cain <bcain@codeaurora.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Stafford Horne <shorne@gmail.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tablesDavid Hildenbrand4-0/+12
I. Background: Sparse Memory Mappings When we manage sparse memory mappings dynamically in user space - also sometimes involving MAP_NORESERVE - we want to dynamically populate/ discard memory inside such a sparse memory region. Example users are hypervisors (especially implementing memory ballooning or similar technologies like virtio-mem) and memory allocators. In addition, we want to fail in a nice way (instead of generating SIGBUS) if populating does not succeed because we are out of backend memory (which can happen easily with file-based mappings, especially tmpfs and hugetlbfs). While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for reliably discarding memory for most mapping types, there is no generic approach to populate page tables and preallocate memory. Although mmap() supports MAP_POPULATE, it is not applicable to the concept of sparse memory mappings, where we want to populate/discard dynamically and avoid expensive/problematic remappings. In addition, we never actually report errors during the final populate phase - it is best-effort only. fallocate() can be used to preallocate file-based memory and fail in a safe way. However, it cannot really be used for any private mappings on anonymous files via memfd due to COW semantics. In addition, fallocate() does not actually populate page tables, so we still always get pagefaults on first access - which is sometimes undesired (i.e., real-time workloads) and requires real prefaulting of page tables, not just a preallocation of backend storage. There might be interesting use cases for sparse memory regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy as it does not prefault page tables. II. On preallcoation/prefaulting from user space Because we don't have a proper interface, what applications (like QEMU and databases) end up doing is touching (i.e., reading+writing one byte to not overwrite existing data) all individual pages. However, that approach 1) Can result in wear on storage backing, because we end up reading/writing each page; this is especially a problem for dax/pmem. 2) Can result in mmap_sem contention when prefaulting via multiple threads. 3) Requires expensive signal handling, especially to catch SIGBUS in case of hugetlbfs/shmem/file-backed memory. For example, this is problematic in hypervisors like QEMU where SIGBUS handlers might already be used by other subsystems concurrently to e.g, handle hardware errors. "Simply" doing preallocation concurrently from other thread is not that easy. III. On MADV_WILLNEED Extending MADV_WILLNEED is not an option because 1. It would change the semantics: "Expect access in the near future." and "might be a good idea to read some pages" vs. "Definitely populate/ preallocate all memory and definitely fail on errors.". 2. Existing users (like virtio-balloon in QEMU when deflating the balloon) don't want populate/prealloc semantics. They treat this rather as a hint to give a little performance boost without too much overhead - and don't expect that a lot of memory might get consumed or a lot of time might be spent. IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by MAP_POPULATE, with the following semantics: 1. MADV_POPULATE_READ can be used to prefault page tables just like manually reading each individual page. This will not break any COW mappings. The shared zero page might get mapped and no backend storage might get preallocated -- allocation might be deferred to write-fault time. Especially shared file mappings require an explicit fallocate() upfront to actually preallocate backend memory (blocks in the file system) in case the file might have holes. 2. If MADV_POPULATE_READ succeeds, all page tables have been populated (prefaulted) readable once. 3. MADV_POPULATE_WRITE can be used to preallocate backend memory and prefault page tables just like manually writing (or reading+writing) each individual page. This will break any COW mappings -- e.g., the shared zeropage is never populated. 4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated (prefaulted) writable once. 5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special mappings marked with VM_PFNMAP and VM_IO. Also, proper access permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such mapping is encountered, madvise() fails with -EINVAL. 6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables might have been populated. 7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON when encountering a HW poisoned page in the range. 8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot protect from the OOM (Out Of Memory) handler killing the process. While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e., preallocate memory and prefault page tables for VMs), one issue is that whenever we prefault pages writable, the pages have to be marked dirty, because the CPU could dirty them any time. while not a real problem for hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each page will be marked dirty and has to be written back later when evicting. MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole mapping from backend storage without marking it dirty, such that eviction won't have to write it back. As discussed above, shared file mappings might require an explciit fallocate() upfront to achieve preallcoation+prepopulation. Although sparse memory mappings are the primary use case, this will also be useful for other preallocate/prefault use cases where MAP_POPULATE is not desired or the semantics of MAP_POPULATE are not sufficient: as one example, QEMU users can trigger preallocation/prefaulting of guest RAM after the mapping was created -- and don't want errors to be silently suppressed. Looking at the history, MADV_POPULATE was already proposed in 2013 [1], however, the main motivation back than was performance improvements -- which should also still be the case. V. Single-threaded performance comparison I did a short experiment, prefaulting page tables on completely *empty mappings/files* and repeated the experiment 10 times. The results correspond to the shortest execution time. In general, the performance benefit for huge pages is negligible with small mappings. V.1: Private mappings POPULATE_READ and POPULATE_WRITE is fastest. Note that Reading/POPULATE_READ will populate the shared zeropage where applicable -- which result in short population times. The fastest way to allocate backend storage (here: swap or huge pages) and prefault page tables is POPULATE_WRITE. V.2: Shared mappings fallocate() is fastest, however, doesn't prefault page tables. POPULATE_WRITE is faster than simple writes and read/writes. POPULATE_READ is faster than simple reads. Without a fd, the fastest way to allocate backend storage and prefault page tables is POPULATE_WRITE. With an fd, the fastest way is usually FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one exception are actual files: FALLOCATE+Read is slightly faster than FALLOCATE+POPULATE_READ. The fastest way to allocate backend storage prefault page tables is FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then, FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as dirty. v.3: Detailed results ================================================== 2 MiB MAP_PRIVATE: ************************************************** Anon 4 KiB : Read : 0.119 ms Anon 4 KiB : Write : 0.222 ms Anon 4 KiB : Read/Write : 0.380 ms Anon 4 KiB : POPULATE_READ : 0.060 ms Anon 4 KiB : POPULATE_WRITE : 0.158 ms Memfd 4 KiB : Read : 0.034 ms Memfd 4 KiB : Write : 0.310 ms Memfd 4 KiB : Read/Write : 0.362 ms Memfd 4 KiB : POPULATE_READ : 0.039 ms Memfd 4 KiB : POPULATE_WRITE : 0.229 ms Memfd 2 MiB : Read : 0.030 ms Memfd 2 MiB : Write : 0.030 ms Memfd 2 MiB : Read/Write : 0.030 ms Memfd 2 MiB : POPULATE_READ : 0.030 ms Memfd 2 MiB : POPULATE_WRITE : 0.030 ms tmpfs : Read : 0.033 ms tmpfs : Write : 0.313 ms tmpfs : Read/Write : 0.406 ms tmpfs : POPULATE_READ : 0.039 ms tmpfs : POPULATE_WRITE : 0.285 ms file : Read : 0.033 ms file : Write : 0.351 ms file : Read/Write : 0.408 ms file : POPULATE_READ : 0.039 ms file : POPULATE_WRITE : 0.290 ms hugetlbfs : Read : 0.030 ms hugetlbfs : Write : 0.030 ms hugetlbfs : Read/Write : 0.030 ms hugetlbfs : POPULATE_READ : 0.030 ms hugetlbfs : POPULATE_WRITE : 0.030 ms ************************************************** 4096 MiB MAP_PRIVATE: ************************************************** Anon 4 KiB : Read : 237.940 ms Anon 4 KiB : Write : 708.409 ms Anon 4 KiB : Read/Write : 1054.041 ms Anon 4 KiB : POPULATE_READ : 124.310 ms Anon 4 KiB : POPULATE_WRITE : 572.582 ms Memfd 4 KiB : Read : 136.928 ms Memfd 4 KiB : Write : 963.898 ms Memfd 4 KiB : Read/Write : 1106.561 ms Memfd 4 KiB : POPULATE_READ : 78.450 ms Memfd 4 KiB : POPULATE_WRITE : 805.881 ms Memfd 2 MiB : Read : 357.116 ms Memfd 2 MiB : Write : 357.210 ms Memfd 2 MiB : Read/Write : 357.606 ms Memfd 2 MiB : POPULATE_READ : 356.094 ms Memfd 2 MiB : POPULATE_WRITE : 356.937 ms tmpfs : Read : 137.536 ms tmpfs : Write : 954.362 ms tmpfs : Read/Write : 1105.954 ms tmpfs : POPULATE_READ : 80.289 ms tmpfs : POPULATE_WRITE : 822.826 ms file : Read : 137.874 ms file : Write : 987.025 ms file : Read/Write : 1107.439 ms file : POPULATE_READ : 80.413 ms file : POPULATE_WRITE : 857.622 ms hugetlbfs : Read : 355.607 ms hugetlbfs : Write : 355.729 ms hugetlbfs : Read/Write : 356.127 ms hugetlbfs : POPULATE_READ : 354.585 ms hugetlbfs : POPULATE_WRITE : 355.138 ms ************************************************** 2 MiB MAP_SHARED: ************************************************** Anon 4 KiB : Read : 0.394 ms Anon 4 KiB : Write : 0.348 ms Anon 4 KiB : Read/Write : 0.400 ms Anon 4 KiB : POPULATE_READ : 0.326 ms Anon 4 KiB : POPULATE_WRITE : 0.273 ms Anon 2 MiB : Read : 0.030 ms Anon 2 MiB : Write : 0.030 ms Anon 2 MiB : Read/Write : 0.030 ms Anon 2 MiB : POPULATE_READ : 0.030 ms Anon 2 MiB : POPULATE_WRITE : 0.030 ms Memfd 4 KiB : Read : 0.412 ms Memfd 4 KiB : Write : 0.372 ms Memfd 4 KiB : Read/Write : 0.419 ms Memfd 4 KiB : POPULATE_READ : 0.343 ms Memfd 4 KiB : POPULATE_WRITE : 0.288 ms Memfd 4 KiB : FALLOCATE : 0.137 ms Memfd 4 KiB : FALLOCATE+Read : 0.446 ms Memfd 4 KiB : FALLOCATE+Write : 0.330 ms Memfd 4 KiB : FALLOCATE+Read/Write : 0.454 ms Memfd 4 KiB : FALLOCATE+POPULATE_READ : 0.379 ms Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 0.268 ms Memfd 2 MiB : Read : 0.030 ms Memfd 2 MiB : Write : 0.030 ms Memfd 2 MiB : Read/Write : 0.030 ms Memfd 2 MiB : POPULATE_READ : 0.030 ms Memfd 2 MiB : POPULATE_WRITE : 0.030 ms Memfd 2 MiB : FALLOCATE : 0.030 ms Memfd 2 MiB : FALLOCATE+Read : 0.031 ms Memfd 2 MiB : FALLOCATE+Write : 0.031 ms Memfd 2 MiB : FALLOCATE+Read/Write : 0.031 ms Memfd 2 MiB : FALLOCATE+POPULATE_READ : 0.030 ms Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 0.030 ms tmpfs : Read : 0.416 ms tmpfs : Write : 0.369 ms tmpfs : Read/Write : 0.425 ms tmpfs : POPULATE_READ : 0.346 ms tmpfs : POPULATE_WRITE : 0.295 ms tmpfs : FALLOCATE : 0.139 ms tmpfs : FALLOCATE+Read : 0.447 ms tmpfs : FALLOCATE+Write : 0.333 ms tmpfs : FALLOCATE+Read/Write : 0.454 ms tmpfs : FALLOCATE+POPULATE_READ : 0.380 ms tmpfs : FALLOCATE+POPULATE_WRITE : 0.272 ms file : Read : 0.191 ms file : Write : 0.511 ms file : Read/Write : 0.524 ms file : POPULATE_READ : 0.196 ms file : POPULATE_WRITE : 0.434 ms file : FALLOCATE : 0.004 ms file : FALLOCATE+Read : 0.197 ms file : FALLOCATE+Write : 0.554 ms file : FALLOCATE+Read/Write : 0.480 ms file : FALLOCATE+POPULATE_READ : 0.201 ms file : FALLOCATE+POPULATE_WRITE : 0.381 ms hugetlbfs : Read : 0.030 ms hugetlbfs : Write : 0.030 ms hugetlbfs : Read/Write : 0.030 ms hugetlbfs : POPULATE_READ : 0.030 ms hugetlbfs : POPULATE_WRITE : 0.030 ms hugetlbfs : FALLOCATE : 0.030 ms hugetlbfs : FALLOCATE+Read : 0.031 ms hugetlbfs : FALLOCATE+Write : 0.031 ms hugetlbfs : FALLOCATE+Read/Write : 0.030 ms hugetlbfs : FALLOCATE+POPULATE_READ : 0.030 ms hugetlbfs : FALLOCATE+POPULATE_WRITE : 0.030 ms ************************************************** 4096 MiB MAP_SHARED: ************************************************** Anon 4 KiB : Read : 1053.090 ms Anon 4 KiB : Write : 913.642 ms Anon 4 KiB : Read/Write : 1060.350 ms Anon 4 KiB : POPULATE_READ : 893.691 ms Anon 4 KiB : POPULATE_WRITE : 782.885 ms Anon 2 MiB : Read : 358.553 ms Anon 2 MiB : Write : 358.419 ms Anon 2 MiB : Read/Write : 357.992 ms Anon 2 MiB : POPULATE_READ : 357.533 ms Anon 2 MiB : POPULATE_WRITE : 357.808 ms Memfd 4 KiB : Read : 1078.144 ms Memfd 4 KiB : Write : 942.036 ms Memfd 4 KiB : Read/Write : 1100.391 ms Memfd 4 KiB : POPULATE_READ : 925.829 ms Memfd 4 KiB : POPULATE_WRITE : 804.394 ms Memfd 4 KiB : FALLOCATE : 304.632 ms Memfd 4 KiB : FALLOCATE+Read : 1163.359 ms Memfd 4 KiB : FALLOCATE+Write : 933.186 ms Memfd 4 KiB : FALLOCATE+Read/Write : 1187.304 ms Memfd 4 KiB : FALLOCATE+POPULATE_READ : 1013.660 ms Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 794.560 ms Memfd 2 MiB : Read : 358.131 ms Memfd 2 MiB : Write : 358.099 ms Memfd 2 MiB : Read/Write : 358.250 ms Memfd 2 MiB : POPULATE_READ : 357.563 ms Memfd 2 MiB : POPULATE_WRITE : 357.334 ms Memfd 2 MiB : FALLOCATE : 356.735 ms Memfd 2 MiB : FALLOCATE+Read : 358.152 ms Memfd 2 MiB : FALLOCATE+Write : 358.331 ms Memfd 2 MiB : FALLOCATE+Read/Write : 358.018 ms Memfd 2 MiB : FALLOCATE+POPULATE_READ : 357.286 ms Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 357.523 ms tmpfs : Read : 1087.265 ms tmpfs : Write : 950.840 ms tmpfs : Read/Write : 1107.567 ms tmpfs : POPULATE_READ : 922.605 ms tmpfs : POPULATE_WRITE : 810.094 ms tmpfs : FALLOCATE : 306.320 ms tmpfs : FALLOCATE+Read : 1169.796 ms tmpfs : FALLOCATE+Write : 933.730 ms tmpfs : FALLOCATE+Read/Write : 1191.610 ms tmpfs : FALLOCATE+POPULATE_READ : 1020.474 ms tmpfs : FALLOCATE+POPULATE_WRITE : 798.945 ms file : Read : 654.101 ms file : Write : 1259.142 ms file : Read/Write : 1289.509 ms file : POPULATE_READ : 661.642 ms file : POPULATE_WRITE : 1106.816 ms file : FALLOCATE : 1.864 ms file : FALLOCATE+Read : 656.328 ms file : FALLOCATE+Write : 1153.300 ms file : FALLOCATE+Read/Write : 1180.613 ms file : FALLOCATE+POPULATE_READ : 668.347 ms file : FALLOCATE+POPULATE_WRITE : 996.143 ms hugetlbfs : Read : 357.245 ms hugetlbfs : Write : 357.413 ms hugetlbfs : Read/Write : 357.120 ms hugetlbfs : POPULATE_READ : 356.321 ms hugetlbfs : POPULATE_WRITE : 356.693 ms hugetlbfs : FALLOCATE : 355.927 ms hugetlbfs : FALLOCATE+Read : 357.074 ms hugetlbfs : FALLOCATE+Write : 357.120 ms hugetlbfs : FALLOCATE+Read/Write : 356.983 ms hugetlbfs : FALLOCATE+POPULATE_READ : 356.413 ms hugetlbfs : FALLOCATE+POPULATE_WRITE : 356.266 ms ************************************************** [1] https://lkml.org/lkml/2013/6/27/698 [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/20210419135443.12822-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30mm: generalize ZONE_[DMA|DMA32]Kefeng Wang13-60/+11
ZONE_[DMA|DMA32] configs have duplicate definitions on platforms that subscribe to them. Instead, just make them generic options which can be selected on applicable platforms. Also only x86/arm64 architectures could enable both ZONE_DMA and ZONE_DMA32 if EXPERT, add ARCH_HAS_ZONE_DMA_SET to make dma zone configurable and visible on the two architectures. Link: https://lkml.kernel.org/r/20210528074557.17768-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com> [RISC-V] Acked-by: Michal Simek <michal.simek@xilinx.com> [microblaze] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Richard Henderson <rth@twiddle.net> Cc: Russell King <linux@armlinux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30mm/thp: make ARCH_ENABLE_SPLIT_PMD_PTLOCK dependent on PGTABLE_LEVELS > 2Anshuman Khandual2-2/+2
ARCH_ENABLE_SPLIT_PMD_PTLOCK is irrelevant unless there are more than two page table levels including PMD (also per Documentation/vm/split_page_table_lock.rst). Make this dependency explicit on remaining platforms i.e x86 and s390 where ARCH_ENABLE_SPLIT_PMD_PTLOCK is subscribed. Link: https://lkml.kernel.org/r/1622013501-20409-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> # s390 Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30arm64/mm: drop HAVE_ARCH_PFN_VALIDAnshuman Khandual3-39/+0
CONFIG_SPARSEMEM_VMEMMAP is now the only available memory model on arm64 platforms and free_unused_memmap() would just return without creating any holes in the memmap mapping. There is no need for any special handling in pfn_valid() and HAVE_ARCH_PFN_VALID can just be dropped. This also moves the pfn upper bits sanity check into generic pfn_valid(). Link: https://lkml.kernel.org/r/1621947349-25421-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30arm64: drop pfn_valid_within() and simplify pfn_valid()Mike Rapoport2-2/+1
The arm64's version of pfn_valid() differs from the generic because of two reasons: * Parts of the memory map are freed during boot. This makes it necessary to verify that there is actual physical memory that corresponds to a pfn which is done by querying memblock. * There are NOMAP memory regions. These regions are not mapped in the linear map and until the previous commit the struct pages representing these areas had default values. As the consequence of absence of the special treatment of NOMAP regions in the memory map it was necessary to use memblock_is_map_memory() in pfn_valid() and to have pfn_valid_within() aliased to pfn_valid() so that generic mm functionality would not treat a NOMAP page as a normal page. Since the NOMAP regions are now marked as PageReserved(), pfn walkers and the rest of core mm will treat them as unusable memory and thus pfn_valid_within() is no longer required at all and can be disabled on arm64. pfn_valid() can be slightly simplified by replacing memblock_is_map_memory() with memblock_is_memory(). [rppt@kernel.org: fix merge fix] Link: https://lkml.kernel.org/r/YJtoQhidtIJOhYsV@kernel.org Link: https://lkml.kernel.org/r/20210511100550.28178-5-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30arm64: decouple check whether pfn is in linear map from pfn_valid()Mike Rapoport6-6/+19
The intended semantics of pfn_valid() is to verify whether there is a struct page for the pfn in question and nothing else. Yet, on arm64 it is used to distinguish memory areas that are mapped in the linear map vs those that require ioremap() to access them. Introduce a dedicated pfn_is_map_memory() wrapper for memblock_is_map_memory() to perform such check and use it where appropriate. Using a wrapper allows to avoid cyclic include dependencies. While here also update style of pfn_valid() so that both pfn_valid() and pfn_is_map_memory() declarations will be consistent. Link: https://lkml.kernel.org/r/20210511100550.28178-4-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30mm/kconfig: move HOLES_IN_ZONE into mmKefeng Wang3-9/+1
commit a55749639dc1 ("ia64: drop marked broken DISCONTIGMEM and VIRTUAL_MEM_MAP") drop VIRTUAL_MEM_MAP, so there is no need HOLES_IN_ZONE on ia64. Also move HOLES_IN_ZONE into mm/Kconfig, select it if architecture needs this feature. Link: https://lkml.kernel.org/r/20210417075946.181402-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Cc: Will Deacon <will@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30mm: sparsemem: use huge PMD mapping for vmemmap pagesMuchun Song1-6/+2
The preparation of splitting huge PMD mapping of vmemmap pages is ready, so switch the mapping from PTE to PMD. Link: https://lkml.kernel.org/r/20210616094915.34432-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Chen Huang <chenhuang5@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30powerpc/8xx: add support for huge pages on VMAP and VMALLOCChristophe Leroy2-1/+44
powerpc 8xx has 4 page sizes: - 4k - 16k - 512k - 8M At the time being, vmalloc and vmap only support huge pages which are leaf at PMD level. Here the PMD level is 4M, it doesn't correspond to any supported page size. For now, implement use of 16k and 512k pages which is done at PTE level. Support of 8M pages will be implemented later, it requires vmalloc to support hugepd tables. Link: https://lkml.kernel.org/r/8b972f1c03fb6bd59953035f0a3e4d26659de4f8.1620795204.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30mm/pgtable: add stubs for {pmd/pub}_{set/clear}_hugeChristophe Leroy2-23/+31
For architectures with no PMD and/or no PUD, add stubs similar to what we have for architectures without P4D. [christophe.leroy@csgroup.eu: arm64: define only {pud/pmd}_{set/clear}_huge when useful] Link: https://lkml.kernel.org/r/73ec95f40cafbbb69bdfb43a7f53876fd845b0ce.1620990479.git.christophe.leroy@csgroup.eu [christophe.leroy@csgroup.eu: x86: define only {pud/pmd}_{set/clear}_huge when useful] Link: https://lkml.kernel.org/r/7fbf1b6bc3e15c07c24fa45278d57064f14c896b.1620930415.git.christophe.leroy@csgroup.eu Link: https://lkml.kernel.org/r/5ac5976419350e8e048d463a64cae449eb3ba4b0.1620795204.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30mm/hugetlb: change parameters of arch_make_huge_pte()Christophe Leroy5-14/+8
Patch series "Subject: [PATCH v2 0/5] Implement huge VMAP and VMALLOC on powerpc 8xx", v2. This series implements huge VMAP and VMALLOC on powerpc 8xx. Powerpc 8xx has 4 page sizes: - 4k - 16k - 512k - 8M At the time being, vmalloc and vmap only support huge pages which are leaf at PMD level. Here the PMD level is 4M, it doesn't correspond to any supported page size. For now, implement use of 16k and 512k pages which is done at PTE level. Support of 8M pages will be implemented later, it requires use of hugepd tables. To allow this, the architecture provides two functions: - arch_vmap_pte_range_map_size() which tells vmap_pte_range() what page size to use. A stub returning PAGE_SIZE is provided when the architecture doesn't provide this function. - arch_vmap_pte_supported_shift() which tells __vmalloc_node_range() what page shift to use for a given area size. A stub returning PAGE_SHIFT is provided when the architecture doesn't provide this function. This patch (of 5): At the time being, arch_make_huge_pte() has the following prototype: pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, struct page *page, int writable); vma is used to get the pages shift or size. vma is also used on Sparc to get vm_flags. page is not used. writable is not used. In order to use this function without a vma, replace vma by shift and flags. Also remove the used parameters. Link: https://lkml.kernel.org/r/cover.1620795204.git.christophe.leroy@csgroup.eu Link: https://lkml.kernel.org/r/f4633ac6a7da2f22f31a04a89e0a7026bb78b15b.1620795204.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30mm: hugetlb: add a kernel parameter hugetlb_free_vmemmapMuchun Song1-2/+6
Add a kernel parameter hugetlb_free_vmemmap to enable the feature of freeing unused vmemmap pages associated with each hugetlb page on boot. We disable PMD mapping of vmemmap pages for x86-64 arch when this feature is enabled. Because vmemmap_remap_free() depends on vmemmap being base page mapped. Link: https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Barry Song <song.bao.hua@hisilicon.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Tested-by: Chen Huang <chenhuang5@huawei.com> Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAPMuchun Song1-1/+1
The option HUGETLB_PAGE_FREE_VMEMMAP allows for the freeing of some vmemmap pages associated with pre-allocated HugeTLB pages. For example, on X86_64 6 vmemmap pages of size 4KB each can be saved for each 2MB HugeTLB page. 4094 vmemmap pages of size 4KB each can be saved for each 1GB HugeTLB page. When a HugeTLB page is allocated or freed, the vmemmap array representing the range associated with the page will need to be remapped. When a page is allocated, vmemmap pages are freed after remapping. When a page is freed, previously discarded vmemmap pages must be allocated before remapping. The config option is introduced early so that supporting code can be written to depend on the option. The initial version of the code only provides support for x86-64. If config HAVE_BOOTMEM_INFO_NODE is enabled, the freeing vmemmap page code denpend on it to free vmemmap pages. Otherwise, just use free_reserved_page() to free vmemmmap pages. The routine register_page_bootmem_info() is used to register bootmem info. Therefore, make sure register_page_bootmem_info is enabled if HUGETLB_PAGE_FREE_VMEMMAP is defined. Link: https://lkml.kernel.org/r/20210510030027.56044-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Tested-by: Chen Huang <chenhuang5@huawei.com> Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com> Reviewed-by: Balbir Singh <bsingharora@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30mm: memory_hotplug: factor out bootmem core functions to bootmem_info.cMuchun Song2-1/+3
Patch series "Free some vmemmap pages of HugeTLB page", v23. This patch series will free some vmemmap pages(struct page structures) associated with each HugeTLB page when preallocated to save memory. In order to reduce the difficulty of the first version of code review. In this version, we disable PMD/huge page mapping of vmemmap if this feature was enabled. This acutely eliminates a bunch of the complex code doing page table manipulation. When this patch series is solid, we cam add the code of vmemmap page table manipulation in the future. The struct page structures (page structs) are used to describe a physical page frame. By default, there is an one-to-one mapping from a page frame to it's corresponding page struct. The HugeTLB pages consist of multiple base page size pages and is supported by many architectures. See hugetlbpage.rst in the Documentation directory for more details. On the x86 architecture, HugeTLB pages of size 2MB and 1GB are currently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of 4096 base pages. For each base page, there is a corresponding page struct. Within the HugeTLB subsystem, only the first 4 page structs are used to contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER provides this upper limit. The only 'useful' information in the remaining page structs is the compound_head field, and this field is the same for all tail pages. By removing redundant page structs for HugeTLB pages, memory can returned to the buddy allocator for other uses. When the system boot up, every 2M HugeTLB has 512 struct page structs which size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE). HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | -------------> | 2 | | | +-----------+ +-----------+ | | | 3 | -------------> | 3 | | | +-----------+ +-----------+ | | | 4 | -------------> | 4 | | 2MB | +-----------+ +-----------+ | | | 5 | -------------> | 5 | | | +-----------+ +-----------+ | | | 6 | -------------> | 6 | | | +-----------+ +-----------+ | | | 7 | -------------> | 7 | | | +-----------+ +-----------+ | | | | | | +-----------+ The value of page->compound_head is the same for all tail pages. The first page of page structs (page 0) associated with the HugeTLB page contains the 4 page structs necessary to describe the HugeTLB. The only use of the remaining pages of page structs (page 1 to page 7) is to point to page->compound_head. Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs will be used for each HugeTLB page. This will allow us to free the remaining 6 pages to the buddy allocator. Here is how things look after remapping. HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | ----------------^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | 3 | ------------------+ | | | | | | +-----------+ | | | | | | | 4 | --------------------+ | | | | 2MB | +-----------+ | | | | | | 5 | ----------------------+ | | | | +-----------+ | | | | | 6 | ------------------------+ | | | +-----------+ | | | | 7 | --------------------------+ | | +-----------+ | | | | | | +-----------+ When a HugeTLB is freed to the buddy system, we should allocate 6 pages for vmemmap pages and restore the previous mapping relationship. Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page. It is similar to the 2MB HugeTLB page. We also can use this approach to free the vmemmap pages. In this case, for the 1GB HugeTLB page, we can save 4094 pages. This is a very substantial gain. On our server, run some SPDK/QEMU applications which will use 1024GB HugeTLB page. With this feature enabled, we can save ~16GB (1G hugepage)/~12GB (2MB hugepage) memory. Because there are vmemmap page tables reconstruction on the freeing/allocating path, it increases some overhead. Here are some overhead analysis. 1) Allocating 10240 2MB HugeTLB pages. a) With this patch series applied: # time echo 10240 > /proc/sys/vm/nr_hugepages real 0m0.166s user 0m0.000s sys 0m0.166s # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [8K, 16K) 5476 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16K, 32K) 4760 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 4 | | b) Without this patch series: # time echo 10240 > /proc/sys/vm/nr_hugepages real 0m0.067s user 0m0.000s sys 0m0.067s # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [4K, 8K) 10147 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8K, 16K) 93 | | Summarize: this feature is about ~2x slower than before. 2) Freeing 10240 2MB HugeTLB pages. a) With this patch series applied: # time echo 0 > /proc/sys/vm/nr_hugepages real 0m0.213s user 0m0.000s sys 0m0.213s # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; } kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [8K, 16K) 6 | | [16K, 32K) 10227 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32K, 64K) 7 | | b) Without this patch series: # time echo 0 > /proc/sys/vm/nr_hugepages real 0m0.081s user 0m0.000s sys 0m0.081s # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; } kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [4K, 8K) 6805 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8K, 16K) 3427 |@@@@@@@@@@@@@@@@@@@@@@@@@@ | [16K, 32K) 8 | | Summary: The overhead of __free_hugepage is about ~2-3x slower than before. Although the overhead has increased, the overhead is not significant. Like Mike said, "However, remember that the majority of use cases create HugeTLB pages at or shortly after boot time and add them to the pool. So, additional overhead is at pool creation time. There is no change to 'normal run time' operations of getting a page from or returning a page to the pool (think page fault/unmap)". Despite the overhead and in addition to the memory gains from this series. The following data is obtained by Joao Martins. Very thanks to his effort. There's an additional benefit which is page (un)pinners will see an improvement and Joao presumes because there are fewer memmap pages and thus the tail/head pages are staying in cache more often. Out of the box Joao saw (when comparing linux-next against linux-next + this series) with gup_test and pinning a 16G HugeTLB file (with 1G pages): get_user_pages(): ~32k -> ~9k unpin_user_pages(): ~75k -> ~70k Usually any tight loop fetching compound_head(), or reading tail pages data (e.g. compound_head) benefit a lot. There's some unpinning inefficiencies Joao was fixing[2], but with that in added it shows even more: unpin_user_pages(): ~27k -> ~3.8k [1] https://lore.kernel.org/linux-mm/20210409205254.242291-1-mike.kravetz@oracle.com/ [2] https://lore.kernel.org/linux-mm/20210204202500.26474-1-joao.m.martins@oracle.com/ This patch (of 9): Move bootmem info registration common API to individual bootmem_info.c. And we will use {get,put}_page_bootmem() to initialize the page for the vmemmap pages or free the vmemmap pages to buddy in the later patch. So move them out of CONFIG_MEMORY_HOTPLUG_SPARSE. This is just code movement without any functional change. Link: https://lkml.kernel.org/r/20210510030027.56044-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210510030027.56044-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Tested-by: Chen Huang <chenhuang5@huawei.com> Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Oliver Neukum <oneukum@suse.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Mina Almasry <almasrymina@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30Merge tag 'net-next-5.14' of ↵Linus Torvalds13-135/+1212
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - BPF: - add syscall program type and libbpf support for generating instructions and bindings for in-kernel BPF loaders (BPF loaders for BPF), this is a stepping stone for signed BPF programs - infrastructure to migrate TCP child sockets from one listener to another in the same reuseport group/map to improve flexibility of service hand-off/restart - add broadcast support to XDP redirect - allow bypass of the lockless qdisc to improving performance (for pktgen: +23% with one thread, +44% with 2 threads) - add a simpler version of "DO_ONCE()" which does not require jump labels, intended for slow-path usage - virtio/vsock: introduce SOCK_SEQPACKET support - add getsocketopt to retrieve netns cookie - ip: treat lowest address of a IPv4 subnet as ordinary unicast address allowing reclaiming of precious IPv4 addresses - ipv6: use prandom_u32() for ID generation - ip: add support for more flexible field selection for hashing across multi-path routes (w/ offload to mlxsw) - icmp: add support for extended RFC 8335 PROBE (ping) - seg6: add support for SRv6 End.DT46 behavior - mptcp: - DSS checksum support (RFC 8684) to detect middlebox meddling - support Connection-time 'C' flag - time stamping support - sctp: packetization Layer Path MTU Discovery (RFC 8899) - xfrm: speed up state addition with seq set - WiFi: - hidden AP discovery on 6 GHz and other HE 6 GHz improvements - aggregation handling improvements for some drivers - minstrel improvements for no-ack frames - deferred rate control for TXQs to improve reaction times - switch from round robin to virtual time-based airtime scheduler - add trace points: - tcp checksum errors - openvswitch - action execution, upcalls - socket errors via sk_error_report Device APIs: - devlink: add rate API for hierarchical control of max egress rate of virtual devices (VFs, SFs etc.) - don't require RCU read lock to be held around BPF hooks in NAPI context - page_pool: generic buffer recycling New hardware/drivers: - mobile: - iosm: PCIe Driver for Intel M.2 Modem - support for Qualcomm MSM8998 (ipa) - WiFi: Qualcomm QCN9074 and WCN6855 PCI devices - sparx5: Microchip SparX-5 family of Enterprise Ethernet switches - Mellanox BlueField Gigabit Ethernet (control NIC of the DPU) - NXP SJA1110 Automotive Ethernet 10-port switch - Qualcomm QCA8327 switch support (qca8k) - Mikrotik 10/25G NIC (atl1c) Driver changes: - ACPI support for some MDIO, MAC and PHY devices from Marvell and NXP (our first foray into MAC/PHY description via ACPI) - HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx - Mellanox/Nvidia NIC (mlx5) - NIC VF offload of L2 bridging - support IRQ distribution to Sub-functions - Marvell (prestera): - add flower and match all - devlink trap - link aggregation - Netronome (nfp): connection tracking offload - Intel 1GE (igc): add AF_XDP support - Marvell DPU (octeontx2): ingress ratelimit offload - Google vNIC (gve): new ring/descriptor format support - Qualcomm mobile (rmnet & ipa): inline checksum offload support - MediaTek WiFi (mt76) - mt7915 MSI support - mt7915 Tx status reporting - mt7915 thermal sensors support - mt7921 decapsulation offload - mt7921 enable runtime pm and deep sleep - Realtek WiFi (rtw88) - beacon filter support - Tx antenna path diversity support - firmware crash information via devcoredump - Qualcomm WiFi (wcn36xx) - Wake-on-WLAN support with magic packets and GTK rekeying - Micrel PHY (ksz886x/ksz8081): add cable test support" * tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2168 commits) tcp: change ICSK_CA_PRIV_SIZE definition tcp_yeah: check struct yeah size at compile time gve: DQO: Fix off by one in gve_rx_dqo() stmmac: intel: set PCI_D3hot in suspend stmmac: intel: Enable PHY WOL option in EHL net: stmmac: option to enable PHY WOL with PMT enabled net: say "local" instead of "static" addresses in ndo_dflt_fdb_{add,del} net: use netdev_info in ndo_dflt_fdb_{add,del} ptp: Set lookup cookie when creating a PTP PPS source. net: sock: add trace for socket errors net: sock: introduce sk_error_report net: dsa: replay the local bridge FDB entries pointing to the bridge dev too net: dsa: ensure during dsa_fdb_offload_notify that dev_hold and dev_put are on the same dev net: dsa: include fdb entries pointing to bridge in the host fdb list net: dsa: include bridge addresses which are local in the host fdb list net: dsa: sync static FDB entries on foreign interfaces to hardware net: dsa: install the host MDB and FDB entries in the master's RX filter net: dsa: reference count the FDB addresses at the cross-chip notifier level net: dsa: introduce a separate cross-chip notifier type for host FDBs net: dsa: reference count the MDB entries at the cross-chip notifier level ...
2021-06-30Merge tag 'microblaze-v5.14' of git://git.monstr.eu/linux-2.6-microblazeLinus Torvalds2-4/+1
Pull microblaze updates from Michal Simek: - Remove unused PAGE_UP/DOWN macros - Fix trivial spelling mistake * tag 'microblaze-v5.14' of git://git.monstr.eu/linux-2.6-microblaze: arch: microblaze: Fix spelling mistake "vesion" -> "version" microblaze: Cleanup unused functions
2021-06-30Merge tag 'clang-features-v5.14-rc1' of ↵Linus Torvalds6-18/+28
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull clang feature updates from Kees Cook: - Add CC_HAS_NO_PROFILE_FN_ATTR in preparation for PGO support in the face of the noinstr attribute, paving the way for PGO and fixing GCOV. (Nick Desaulniers) - x86_64 LTO coverage is expanded to 32-bit x86. (Nathan Chancellor) - Small fixes to CFI. (Mark Rutland, Nathan Chancellor) * tag 'clang-features-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: qemu_fw_cfg: Make fw_cfg_rev_attr a proper kobj_attribute Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTR compiler_attributes.h: cleanups for GCC 4.9+ compiler_attributes.h: define __no_profile, add to noinstr x86, lto: Enable Clang LTO for 32-bit as well CFI: Move function_nocfi() into compiler.h MAINTAINERS: Add Clang CFI section
2021-06-30Merge tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-blockLinus Torvalds2-37/+12
Pull core block updates from Jens Axboe: - disk events cleanup (Christoph) - gendisk and request queue allocation simplifications (Christoph) - bdev_disk_changed cleanups (Christoph) - IO priority improvements (Bart) - Chained bio completion trace fix (Edward) - blk-wbt fixes (Jan) - blk-wbt enable/disable fix (Zhang) - Scheduler dispatch improvements (Jan, Ming) - Shared tagset scheduler improvements (John) - BFQ updates (Paolo, Luca, Pietro) - BFQ lock inversion fix (Jan) - Documentation improvements (Kir) - CLONE_IO block cgroup fix (Tejun) - Remove of ancient and deprecated block dump feature (zhangyi) - Discard merge fix (Ming) - Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas, Yang) * tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits) block: fix discard request merge block/mq-deadline: Remove a WARN_ON_ONCE() call blk-mq: update hctx->dispatch_busy in case of real scheduler blk: Fix lock inversion between ioc lock and bfqd lock bfq: Remove merged request already in bfq_requests_merged() block: pass a gendisk to bdev_disk_changed block: move bdev_disk_changed block: add the events* attributes to disk_attrs block: move the disk events code to a separate file block: fix trace completion for chained bio block/partitions/msdos: Fix typo inidicator -> indicator block, bfq: reset waker pointer with shared queues block, bfq: check waker only for queues with no in-flight I/O block, bfq: avoid delayed merge of async queues block, bfq: boost throughput by extending queue-merging times block, bfq: consider also creation time in delayed stable merge block, bfq: fix delayed stable merge check block, bfq: let also stably merged queues enjoy weight raising blk-wbt: make sure throttle is enabled properly blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled() ...
2021-06-30MIPS: Fix PKMAP with 32-bit MIPS huge page supportWei Li1-1/+1
When 32-bit MIPS huge page support is enabled, we halve the number of pointers a PTE page holds, making its last half go to waste. Correspondingly, we should halve the number of kmap entries, as we just initialized only a single pte table for that in pagetable_init(). Fixes: 35476311e529 ("MIPS: Add partial 32-bit huge page support") Signed-off-by: Wei Li <liwei391@huawei.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2021-06-30MIPS: CI20: Add second percpu timer for SMP.周琰杰 (Zhou Yanjie)1-10/+14
1.Add a new TCU channel as the percpu timer of core1, this is to prepare for the subsequent SMP support. The newly added channel will not adversely affect the current single-core state. 2.Adjust the position of TCU node to make it consistent with the order in jz4780.dtsi file. Tested-by: Nikolaus Schaller <hns@goldelico.com> # on CI20 Signed-off-by: 周琰杰 (Zhou Yanjie) <zhouyanjie@wanyeetech.com> Acked-by: Paul Cercueil <paul@crapouillou.net> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2021-06-30MIPS: CI20: Reduce clocksource to 750 kHz.周琰杰 (Zhou Yanjie)1-2/+2
The original clock (3 MHz) is too fast for the clocksource, there will be a chance that the system may get stuck. Reported-by: Nikolaus Schaller <hns@goldelico.com> Tested-by: Nikolaus Schaller <hns@goldelico.com> # on CI20 Signed-off-by: 周琰杰 (Zhou Yanjie) <zhouyanjie@wanyeetech.com> Acked-by: Paul Cercueil <paul@crapouillou.net> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2021-06-30MIPS: Ingenic: Add MAC syscon nodes for Ingenic SoCs.周琰杰 (Zhou Yanjie)2-0/+14
Add MAC syscon nodes for X1000 SoC and X1830 SoC from Ingenic. Signed-off-by: 周琰杰 (Zhou Yanjie) <zhouyanjie@wanyeetech.com> Acked-by: Paul Cercueil <paul@crapouillou.net> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2021-06-30MIPS: X1830: Respect cell count of common properties.周琰杰 (Zhou Yanjie)1-5/+4
If N fields of X cells should be provided, then that's what the devicetree should represent, instead of having one single field of (N * X) cells. Signed-off-by: 周琰杰 (Zhou Yanjie) <zhouyanjie@wanyeetech.com> Acked-by: Paul Cercueil <paul@crapouillou.net> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2021-06-29Merge branch 'akpm' (patches from Andrew)Linus Torvalds75-794/+94
Merge misc updates from Andrew Morton: "191 patches. Subsystems affected by this patch series: kthread, ia64, scripts, ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab, slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap, mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization, pagealloc, and memory-failure)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits) mm,hwpoison: make get_hwpoison_page() call get_any_page() mm,hwpoison: send SIGBUS with error virutal address mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes mm/page_alloc: allow high-order pages to be stored on the per-cpu lists mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA docs: remove description of DISCONTIGMEM arch, mm: remove stale mentions of DISCONIGMEM mm: remove CONFIG_DISCONTIGMEM m68k: remove support for DISCONTIGMEM arc: remove support for DISCONTIGMEM arc: update comment about HIGHMEM implementation alpha: remove DISCONTIGMEM and NUMA mm/page_alloc: move free_the_page mm/page_alloc: fix counting of managed_pages mm/page_alloc: improve memmap_pages dbg msg mm: drop SECTION_SHIFT in code comments mm/page_alloc: introduce vm.percpu_pagelist_high_fraction mm/page_alloc: limit the number of pages on PCP lists when reclaim is active mm/page_alloc: scale the number of pages that are batch freed ...
2021-06-29Merge tag 'acpi-5.14-rc1' of ↵Linus Torvalds1-71/+47
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI updates from Rafael Wysocki: "These update the ACPICA code in the kernel to the 20210604 upstream revision, add preliminary support for the Platform Runtime Mechanism (PRM), address issues related to the handling of device dependencies in the ACPI device eunmeration code, improve the tracking of ACPI power resource states, improve the ACPI support for suspend-to-idle on AMD systems, continue the unification of message printing in the ACPI code, address assorted issues and clean up the code in a number of places. Specifics: - Update ACPICA code in the kernel to upstrea revision 20210604 including the following changes: - Add defines for the CXL Host Bridge Structureand and add the CFMWS structure definition to CEDT (Alison Schofield). - iASL: Finish support for the IVRS ACPI table (Bob Moore). - iASL: Add support for the SVKL table (Bob Moore). - iASL: Add full support for RGRT ACPI table (Bob Moore). - iASL: Add support for the BDAT ACPI table (Bob Moore). - iASL: add disassembler support for PRMT (Erik Kaneda). - Fix memory leak caused by _CID repair function (Erik Kaneda). - Add support for PlatformRtMechanism OpRegion (Erik Kaneda). - Add PRMT module header to facilitate parsing (Erik Kaneda). - Add _PLD panel positions (Fabian Wüthrich). - MADT: add Multiprocessor Wakeup Mailbox Structure and the SVKL table headers (Kuppuswamy Sathyanarayanan). - Use ACPI_FALLTHROUGH (Wei Ming Chen). - Add preliminary support for the Platform Runtime Mechanism (PRM) to allow the AML interpreter to call PRM functions (Erik Kaneda). - Address some issues related to the handling of device dependencies reported by _DEP in the ACPI device enumeration code and clean up some related pieces of it (Rafael Wysocki). - Improve the tracking of states of ACPI power resources (Rafael Wysocki). - Improve ACPI support for suspend-to-idle on AMD systems (Alex Deucher, Mario Limonciello, Pratik Vishwakarma). - Continue the unification and cleanup of message printing in the ACPI code (Hanjun Guo, Heiner Kallweit). - Fix possible buffer overrun issue with the description_show() sysfs attribute method (Krzysztof Wilczyński). - Improve the acpi_mask_gpe kernel command line parameter handling and clean up the core ACPI code related to sysfs (Andy Shevchenko, Baokun Li, Clayton Casciato). - Postpone bringing devices in the general ACPI PM domain to D0 during resume from system-wide suspend until they are really needed (Dmitry Torokhov). - Make the ACPI processor driver fix up C-state latency if not ordered (Mario Limonciello). - Add support for identifying devices depening on the given one that are not its direct descendants with the help of _DEP (Daniel Scally). - Extend the checks related to ACPI IRQ overrides on x86 in order to avoid false-positives (Hui Wang). - Add battery DPTF participant for Intel SoCs (Sumeet Pawnikar). - Rearrange the ACPI fan driver and device power management code to use a common list of device IDs (Rafael Wysocki). - Fix clang CFI violation in the ACPI BGRT table parsing code and clean it up (Nathan Chancellor). - Add GPE-related quirks for some laptops to the EC driver (Chris Chiu, Zhang Rui). - Make the ACPI PPTT table parsing code populate the cache-id value if present in the firmware (James Morse). - Remove redundant clearing of context->ret.pointer from acpi_run_osc() (Hans de Goede). - Add missing acpi_put_table() in acpi_init_fpdt() (Jing Xiangfeng). - Make ACPI APEI handle ARM Processor Error CPER records like Memory Error ones to avoid user space task lockups (Xiaofei Tan). - Stop warning about disabled ACPI in APEI (Jon Hunter). - Fix fall-through warning for Clang in the SBSHC driver (Gustavo A. R. Silva). - Add custom DSDT file as Makefile prerequisite (Richard Fitzgerald). - Initialize local variable to avoid garbage being returned (Colin Ian King). - Simplify assorted pieces of code, address assorted coding style and documentation issues and comment typos (Baokun Li, Christophe JAILLET, Clayton Casciato, Liu Shixin, Shaokun Zhang, Wei Yongjun, Yang Li, Zhen Lei)" * tag 'acpi-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (97 commits) ACPI: PM: postpone bringing devices to D0 unless we need them ACPI: tables: Add custom DSDT file as makefile prerequisite ACPI: bgrt: Use sysfs_emit ACPI: bgrt: Fix CFI violation ACPI: EC: trust DSDT GPE for certain HP laptop ACPI: scan: Simplify acpi_table_events_fn() ACPI: PM: Adjust behavior for field problems on AMD systems ACPI: PM: s2idle: Add support for new Microsoft UUID ACPI: PM: s2idle: Add support for multiple func mask ACPI: PM: s2idle: Refactor common code ACPI: PM: s2idle: Use correct revision id ACPI: sysfs: Remove tailing return statement in void function ACPI: sysfs: Use __ATTR_RO() and __ATTR_RW() macros ACPI: sysfs: Sort headers alphabetically ACPI: sysfs: Refactor param_get_trace_state() to drop dead code ACPI: sysfs: Unify pattern of memory allocations ACPI: sysfs: Allow bitmap list to be supplied to acpi_mask_gpe ACPI: sysfs: Make sparse happy about address space in use ACPI: scan: Fix race related to dropping dependencies ACPI: scan: Reorganize acpi_device_add() ...
2021-06-29Merge branches 'clk-legacy', 'clk-vc5', 'clk-allwinner', 'clk-nvidia' and ↵Stephen Boyd23-514/+302
'clk-imx' into clk-next * clk-legacy: clkdev: remove unused clkdev_alloc() interfaces clkdev: remove CONFIG_CLKDEV_LOOKUP m68k: coldfire: remove private clk_get/clk_put m68k: coldfire: use clkdev_lookup on most coldfire mips: ralink: convert to CONFIG_COMMON_CLK mips: ar7: convert to CONFIG_COMMON_CLK mips: ar7: convert to clkdev_lookup * clk-vc5: clk: vc5: fix output disabling when enabling a FOD * clk-allwinner: clk: sunxi-ng: v3s: fix incorrect postdivider on pll-audio * clk-nvidia: clk: tegra: clk-tegra124-dfll-fcpu: don't use devm functions for regulator clk: tegra: tegra124-emc: Fix clock imbalance in emc_set_timing() clk: tegra: Add stubs needed for compile-testing clk: tegra: Don't deassert reset on enabling clocks clk: tegra: Mark external clocks as not having reset control clk: tegra: cclk: Handle thermal DIV2 CPU frequency throttling clk: tegra: Don't allow zero clock rate for PLLs clk: tegra: Halve SCLK rate on Tegra20 clk: tegra: Ensure that PLLU configuration is applied properly clk: tegra: Fix refcounting of gate clocks clk: tegra30: Use 300MHz for video decoder by default * clk-imx: clk: imx8mq: remove SYS PLL 1/2 clock gates clk: imx: scu: Do not enable runtime PM for CPU clks clk: imx: scu: add parent save and restore clk: imx: scu: Only save DC SS clock using non-cached clock rate clk: imx: scu: Add A72 frequency scaling support clk: imx: scu: Add A53 frequency scaling support clk: imx: scu: bypass pi_pll enable status restore clk: imx: scu: detach pd if can't power up clk: imx: scu: bypass cpu clock save and restore clk: imx: scu: add parallel port clock ops clk: imx: scu: add more scu clocks clk: imx: scu: add enet rgmii gpr clocks clk: imx8qm: add clock valid resource checking clk: imx8qxp: add clock valid checking mechnism clk: imx: scu: add gpr clocks support clk: imx: scu: remove legacy scu clock binding support dt-bindings: arm: imx: scu: drop deprecated legacy clock binding dt-bindings: arm: imx: scu: fix naming typo of clk compatible string clk: imx: Remove the audio ipg clock from imx8mp
2021-06-29Merge tag 'x86-entry-2021-06-29' of ↵Linus Torvalds16-217/+126
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 entry code related updates from Thomas Gleixner: - Consolidate the macros for .byte ... opcode sequences - Deduplicate register offset defines in include files - Simplify the ia32,x32 compat handling of the related syscall tables to get rid of #ifdeffery. - Clear all EFLAGS which are not required for syscall handling - Consolidate the syscall tables and switch the generation over to the generic shell script and remove the CFLAGS tweaks which are not longer required. - Use 'int' type for system call numbers to match the generic code. - Add more selftests for syscalls * tag 'x86-entry-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/syscalls: Don't adjust CFLAGS for syscall tables x86/syscalls: Remove -Wno-override-init for syscall tables x86/uml/syscalls: Remove array index from syscall initializers x86/syscalls: Clear 'offset' and 'prefix' in case they are set in env x86/entry: Use int everywhere for system call numbers x86/entry: Treat out of range and gap system calls the same x86/entry/64: Sign-extend system calls on entry to int selftests/x86/syscall: Add tests under ptrace to syscall_numbering_64 selftests/x86/syscall: Simplify message reporting in syscall_numbering selftests/x86/syscall: Update and extend syscall_numbering_64 x86/syscalls: Switch to generic syscallhdr.sh x86/syscalls: Use __NR_syscalls instead of __NR_syscall_max x86/unistd: Define X32_NR_syscalls only for 64-bit kernel x86/syscalls: Stop filling syscall arrays with *_sys_ni_syscall x86/syscalls: Switch to generic syscalltbl.sh x86/entry/x32: Rename __x32_compat_sys_* to __x64_compat_sys_*
2021-06-29Merge tag 'x86-irq-2021-06-29' of ↵Linus Torvalds7-53/+35
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 interrupt related updates from Thomas Gleixner: - Consolidate the VECTOR defines and the usage sites. - Cleanup GDT/IDT related code and replace open coded ASM with proper native helper functions. * tag 'x86-irq-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kexec: Set_[gi]dt() -> native_[gi]dt_invalidate() in machine_kexec_*.c x86: Add native_[ig]dt_invalidate() x86/idt: Remove address argument from idt_invalidate() x86/irq: Add and use NR_EXTERNAL_VECTORS and NR_SYSTEM_VECTORS x86/irq: Remove unused vectors defines
2021-06-29Merge tag 'timers-core-2021-06-29' of ↵Linus Torvalds2-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Time and clocksource/clockevent related updates: Core changes: - Infrastructure to support per CPU "broadcast" devices for per CPU clockevent devices which stop in deep idle states. This allows us to utilize the more efficient architected timer on certain ARM SoCs for normal operation instead of permanentely using the slow to access SoC specific clockevent device. - Print the name of the broadcast/wakeup device in /proc/timer_list - Make the clocksource watchdog more robust against delays between reading the current active clocksource and the watchdog clocksource. Such delays can be caused by NMIs, SMIs and vCPU preemption. Handle this by reading the watchdog clocksource twice, i.e. before and after reading the current active clocksource. In case that the two watchdog reads shows an excessive time delta, the read sequence is repeated up to 3 times. - Improve the debug output and add a test module for the watchdog mechanism. - Reimplementation of the venerable time64_to_tm() function with a faster and significantly smaller version. Straight from the source, i.e. the author of the related research paper contributed this! Driver changes: - No new drivers, not even new device tree bindings! - Fixes, improvements and cleanups and all over the place" * tag 'timers-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits) time/kunit: Add missing MODULE_LICENSE() time: Improve performance of time64_to_tm() clockevents: Use list_move() instead of list_del()/list_add() clocksource: Print deviation in nanoseconds when a clocksource becomes unstable clocksource: Provide kernel module to test clocksource watchdog clocksource: Reduce clocksource-skew threshold clocksource: Limit number of CPUs checked for clock synchronization clocksource: Check per-CPU clock synchronization when marked unstable clocksource: Retry clock read if long delays detected clockevents: Add missing parameter documentation clocksource/drivers/timer-ti-dm: Drop unnecessary restore clocksource/arm_arch_timer: Improve Allwinner A64 timer workaround clocksource/drivers/arm_global_timer: Remove duplicated argument in arm_global_timer clocksource/drivers/arm_global_timer: Make symbol 'gt_clk_rate_change_nb' static arm: zynq: don't disable CONFIG_ARM_GLOBAL_TIMER due to CONFIG_CPU_FREQ anymore clocksource/drivers/arm_global_timer: Implement rate compensation whenever source clock changes clocksource/drivers/ingenic: Rename unreasonable array names clocksource/drivers/timer-ti-dm: Save and restore timer TIOCP_CFG clocksource/drivers/mediatek: Ack and disable interrupts on suspend clocksource/drivers/samsung_pwm: Constify source IO memory ...
2021-06-29Merge tag 'irq-core-2021-06-29' of ↵Linus Torvalds31-12/+54
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Updates for the interrupt subsystem: Core changes: - Cleanup and simplification of common code to invoke the low level interrupt flow handlers when this invocation requires irqdomain resolution. Add the necessary core infrastructure. - Provide a proper interface for modular PMU drivers to set the interrupt affinity. - Add a request flag which allows to exclude interrupts from spurious interrupt detection. Useful especially for IPI handlers which always return IRQ_HANDLED which turns the spurious interrupt detection into a pointless waste of CPU cycles. Driver changes: - Bulk convert interrupt chip drivers to the new irqdomain low level flow handler invocation mechanism. - Add device tree bindings for the Renesas R-Car M3-W+ SoC - Enable modular build of the Qualcomm PDC driver - The usual small fixes and improvements" * tag 'irq-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) dt-bindings: interrupt-controller: arm,gic-v3: Describe GICv3 optional properties irqchip: gic-pm: Remove redundant error log of clock bulk irqchip/sun4i: Remove unnecessary oom message irqchip/irq-imx-gpcv2: Remove unnecessary oom message irqchip/imgpdc: Remove unnecessary oom message irqchip/gic-v3-its: Remove unnecessary oom message irqchip/gic-v2m: Remove unnecessary oom message irqchip/exynos-combiner: Remove unnecessary oom message irqchip: Bulk conversion to generic_handle_domain_irq() genirq: Move non-irqdomain handle_domain_irq() handling into ARM's handle_IRQ() genirq: Add generic_handle_domain_irq() helper irqchip/nvic: Convert from handle_IRQ() to handle_domain_irq() irqdesc: Fix __handle_domain_irq() comment genirq: Use irq_resolve_mapping() to implement __handle_domain_irq() and co irqdomain: Introduce irq_resolve_mapping() irqdomain: Protect the linear revmap with RCU irqdomain: Cache irq_data instead of a virq number in the revmap irqdomain: Use struct_size() helper when allocating irqdomain irqdomain: Make normal and nomap irqdomains exclusive powerpc: Move the use of irq_domain_add_nomap() behind a config option ...
2021-06-29Merge tag 'hyperv-next-signed-20210629' of ↵Linus Torvalds2-48/+1
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: "Just a few minor enhancement patches and bug fixes" * tag 'hyperv-next-signed-20210629' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: PCI: hv: Add check for hyperv_initialized in init_hv_pci_drv() Drivers: hv: Move Hyper-V extended capability check to arch neutral code drivers: hv: Fix missing error code in vmbus_connect() x86/hyperv: fix logical processor creation hv_utils: Fix passing zero to 'PTR_ERR' warning scsi: storvsc: Use blk_mq_unique_tag() to generate requestIDs Drivers: hv: vmbus: Copy packets sent by Hyper-V out of the ring buffer hv_balloon: Remove redundant assignment to region_start
2021-06-29mm,hwpoison: send SIGBUS with error virutal addressNaoya Horiguchi1-2/+11
Now an action required MCE in already hwpoisoned address surely sends a SIGBUS to current process, but the SIGBUS doesn't convey error virtual address. That's not optimal for hwpoison-aware applications. To fix the issue, make memory_failure() call kill_accessing_process(), that does pagetable walk to find the error virtual address. It could find multiple virtual addresses for the same error page, and it seems hard to tell which virtual address is correct one. But that's rare and sending incorrect virtual address could be better than no address. So let's report the first found virtual address for now. [naoya.horiguchi@nec.com: fix walk_page_range() return] Link: https://lkml.kernel.org/r/20210603051055.GA244241@hori.linux.bs1.fc.nec.co.jp Link: https://lkml.kernel.org/r/20210521030156.2612074-4-nao.horiguchi@gmail.com Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Aili Yao <yaoaili@kingsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Jue Wang <juew@google.com> Cc: Borislav Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMAMike Rapoport26-40/+40
After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA configuration options are equivalent. Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead. Done with $ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \ $(git grep -wl CONFIG_NEED_MULTIPLE_NODES) $ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \ $(git grep -wl NEED_MULTIPLE_NODES) with manual tweaks afterwards. [rppt@linux.ibm.com: fix arm boot crash] Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29arch, mm: remove stale mentions of DISCONIGMEMMike Rapoport6-25/+4
There are several places that mention DISCONIGMEM in comments or have stale code guarded by CONFIG_DISCONTIGMEM. Remove the dead code and update the comments. Link: https://lkml.kernel.org/r/20210608091316.3622-7-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29m68k: remove support for DISCONTIGMEMMike Rapoport5-76/+1
DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map in v5.11. Remove the support for DISCONTIGMEM entirely. Link: https://lkml.kernel.org/r/20210608091316.3622-5-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29arc: remove support for DISCONTIGMEMMike Rapoport3-61/+0
DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map in v5.11. Remove the support for DISCONTIGMEM entirely. Link: https://lkml.kernel.org/r/20210608091316.3622-4-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Vineet Gupta <vgupta@synopsys.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29arc: update comment about HIGHMEM implementationMike Rapoport1-8/+5
Arc does not use DISCONTIGMEM to implement high memory, update the comment describing how high memory works to reflect this. Link: https://lkml.kernel.org/r/20210608091316.3622-3-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Vineet Gupta <vgupta@synopsys.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29alpha: remove DISCONTIGMEM and NUMAMike Rapoport15-540/+4
Patch series "Remove DISCONTIGMEM memory model", v3. SPARSEMEM memory model was supposed to entirely replace DISCONTIGMEM a (long) while ago. The last architectures that used DISCONTIGMEM were updated to use other memory models in v5.11 and it is about the time to entirely remove DISCONTIGMEM from the kernel. This set removes DISCONTIGMEM from alpha, arc and m68k, simplifies memory model selection in mm/Kconfig and replaces usage of redundant CONFIG_NEED_MULTIPLE_NODES and CONFIG_FLAT_NODE_MEM_MAP with CONFIG_NUMA and CONFIG_FLATMEM respectively. I've also removed NUMA support on alpha that was BROKEN for more than 15 years. There were also minor updates all over arch/ to remove mentions of DISCONTIGMEM in comments and #ifdefs. This patch (of 9): NUMA is marked broken on alpha for more than 15 years and DISCONTIGMEM was replaced with SPARSEMEM in v5.11. Remove both NUMA and DISCONTIGMEM support from alpha. Link: https://lkml.kernel.org/r/20210608091316.3622-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210608091316.3622-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: define default MAX_PTRS_PER_* in include/pgtable.hDaniel Axtens1-2/+0
Commit c65e774fb3f6 ("x86/mm: Make PGDIR_SHIFT and PTRS_PER_P4D variable") made PTRS_PER_P4D variable on x86 and introduced MAX_PTRS_PER_P4D as a constant for cases which need a compile-time constant (e.g. fixed-size arrays). powerpc likewise has boot-time selectable MMU features which can cause other mm "constants" to vary. For KASAN, we have some static PTE/PMD/PUD/P4D arrays so we need compile-time maximums for all these constants. Extend the MAX_PTRS_PER_ idiom, and place default definitions in include/pgtable.h. These define MAX_PTRS_PER_x to be PTRS_PER_x unless an architecture has defined MAX_PTRS_PER_x in its arch headers. Clean up pgtable-nop4d.h and s390's MAX_PTRS_PER_P4D definitions while we're at it: both can just pick up the default now. Link: https://lkml.kernel.org/r/20210624034050.511391-4-dja@axtens.net Signed-off-by: Daniel Axtens <dja@axtens.net> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Marco Elver <elver@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>