summaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdkfd
AgeCommit message (Collapse)AuthorFilesLines
2023-01-10drm/amdkfd: Fix NULL pointer error for GC 11.0.1 on mGPUEric Huang1-1/+1
The point bo->kfd_bo is NULL for queue's write pointer BO when creating queue on mGPU. To avoid using the pointer fixes the error. Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-10drm/amdkfd: Add sync after creating vram boEric Huang1-0/+9
There will be data corruption on vram allocated by svm if the initialization is not complete and application is writting on the memory. Adding sync to wait for the initialization completion is to resolve this issue. Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-03drm/amdkfd: Fix kernel warning during topology setupMukul Joshi1-1/+1
This patch fixes the following kernel warning seen during driver load by correctly initializing the p2plink attr before creating the sysfs file: [ +0.002865] ------------[ cut here ]------------ [ +0.002327] kobject: '(null)' (0000000056260cfb): is not initialized, yet kobject_put() is being called. [ +0.004780] WARNING: CPU: 32 PID: 1006 at lib/kobject.c:718 kobject_put+0xaa/0x1c0 [ +0.001361] Call Trace: [ +0.001234] <TASK> [ +0.001067] kfd_remove_sysfs_node_entry+0x24a/0x2d0 [amdgpu] [ +0.003147] kfd_topology_update_sysfs+0x3d/0x750 [amdgpu] [ +0.002890] kfd_topology_add_device+0xbd7/0xc70 [amdgpu] [ +0.002844] ? lock_release+0x13c/0x2e0 [ +0.001936] ? smu_cmn_send_smc_msg_with_param+0x1e8/0x2d0 [amdgpu] [ +0.003313] ? amdgpu_dpm_get_mclk+0x54/0x60 [amdgpu] [ +0.002703] kgd2kfd_device_init.cold+0x39f/0x4ed [amdgpu] [ +0.002930] amdgpu_amdkfd_device_init+0x13d/0x1f0 [amdgpu] [ +0.002944] amdgpu_device_init.cold+0x1464/0x17b4 [amdgpu] [ +0.002970] ? pci_bus_read_config_word+0x43/0x80 [ +0.002380] amdgpu_driver_load_kms+0x15/0x100 [amdgpu] [ +0.002744] amdgpu_pci_probe+0x147/0x370 [amdgpu] [ +0.002522] local_pci_probe+0x40/0x80 [ +0.001896] work_for_cpu_fn+0x10/0x20 [ +0.001892] process_one_work+0x26e/0x5a0 [ +0.002029] worker_thread+0x1fd/0x3e0 [ +0.001890] ? process_one_work+0x5a0/0x5a0 [ +0.002115] kthread+0xea/0x110 [ +0.001618] ? kthread_complete_and_exit+0x20/0x20 [ +0.002422] ret_from_fork+0x1f/0x30 [ +0.001808] </TASK> [ +0.001103] irq event stamp: 59837 [ +0.001718] hardirqs last enabled at (59849): [<ffffffffb30fab12>] __up_console_sem+0x52/0x60 [ +0.004414] hardirqs last disabled at (59860): [<ffffffffb30faaf7>] __up_console_sem+0x37/0x60 [ +0.004414] softirqs last enabled at (59654): [<ffffffffb307d9c7>] irq_exit_rcu+0xd7/0x130 [ +0.004205] softirqs last disabled at (59649): [<ffffffffb307d9c7>] irq_exit_rcu+0xd7/0x130 [ +0.004203] ---[ end trace 0000000000000000 ]--- Fixes: 0f28cca87e9a ("drm/amdkfd: Extend KFD device topology to surface peer-to-peer links") Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2022-12-20drm/amdkfd: Fix double release compute pasidPhilip Yang1-3/+9
If kfd_process_device_init_vm returns failure after vm is converted to compute vm and vm->pasid set to compute pasid, KFD will not take pdd->drm_file reference. As a result, drm close file handler maybe called to release the compute pasid before KFD process destroy worker to release the same pasid and set vm->pasid to zero, this generates below WARNING backtrace and NULL pointer access. Add helper amdgpu_amdkfd_gpuvm_set_vm_pasid and call it at the last step of kfd_process_device_init_vm, to ensure vm pasid is the original pasid if acquiring vm failed or is the compute pasid with pdd->drm_file reference taken to avoid double release same pasid. amdgpu: Failed to create process VM object ida_free called for id=32770 which is not allocated. WARNING: CPU: 57 PID: 72542 at ../lib/idr.c:522 ida_free+0x96/0x140 RIP: 0010:ida_free+0x96/0x140 Call Trace: amdgpu_pasid_free_delayed+0xe1/0x2a0 [amdgpu] amdgpu_driver_postclose_kms+0x2d8/0x340 [amdgpu] drm_file_free.part.13+0x216/0x270 [drm] drm_close_helper.isra.14+0x60/0x70 [drm] drm_release+0x6e/0xf0 [drm] __fput+0xcc/0x280 ____fput+0xe/0x20 task_work_run+0x96/0xc0 do_exit+0x3d0/0xc10 BUG: kernel NULL pointer dereference, address: 0000000000000000 RIP: 0010:ida_free+0x76/0x140 Call Trace: amdgpu_pasid_free_delayed+0xe1/0x2a0 [amdgpu] amdgpu_driver_postclose_kms+0x2d8/0x340 [amdgpu] drm_file_free.part.13+0x216/0x270 [drm] drm_close_helper.isra.14+0x60/0x70 [drm] drm_release+0x6e/0xf0 [drm] __fput+0xcc/0x280 ____fput+0xe/0x20 task_work_run+0x96/0xc0 do_exit+0x3d0/0xc10 Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-20drm/amdkfd: Fix kfd_process_device_init_vm error handlingPhilip Yang1-6/+6
Should only destroy the ib_mem and let process cleanup worker to free the outstanding BOs. Reset the pointer in pdd->qpd structure, to avoid NULL pointer access in process destroy worker. BUG: kernel NULL pointer dereference, address: 0000000000000010 Call Trace: amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel+0x46/0xb0 [amdgpu] kfd_process_device_destroy_cwsr_dgpu+0x40/0x70 [amdgpu] kfd_process_destroy_pdds+0x71/0x190 [amdgpu] kfd_process_wq_release+0x2a2/0x3b0 [amdgpu] process_one_work+0x2a1/0x600 worker_thread+0x39/0x3d0 Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-29drm/amdkfd: Remove unnecessary condition in kfd_topology_add_device()Dan Carpenter1-3/+2
We re-arranged this code recently so "ret" is always zero at this point. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-29drm/amdkfd: add GC 11.0.4 KFD supportYifan Zhang2-0/+3
Add initial support for GC 11.0.4 in KFD compute driver. Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Aaron Liu <aaron.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-23drm/amdkfd: Release the topology_lock in error caseFelix Kuehling1-55/+65
Move the topology-locked part of kfd_topology_add_device into a separate function to simlpify error handling and release the topology lock consistently. Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Felix Kuehling <felix.kuehling@gmail.com> Signed-off-by: Ma Jun <Jun.Ma2@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-22Merge tag 'amd-drm-next-6.2-2022-11-18' of ↵Dave Airlie7-29/+26
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.2-2022-11-18: amdgpu: - SR-IOV fixes - Clean up DC checks - DCN 3.2.x fixes - DCN 3.1.x fixes - Don't enable degamma on asics which don't support it - IP discovery fixes - BACO fixes - Fix vbios allocation handling when vkms is enabled - Drop buggy tdr advanced mode GPU reset handling - Fix the build when DCN is not set in kconfig - MST DSC fixes - Userptr fixes - FRU and RAS EEPROM fixes - VCN 4.x RAS support - Aldrebaran CU occupancy reporting fix - PSP ring cleanup amdkfd: - Memory limit fix - Enable cooperative launch on gfx 10.3 amd-drm-next-6.2-2022-11-11: amdgpu: - SMU 13.x updates - GPUVM TLB race fix - DCN 3.1.4 updates - DCN 3.2.x updates - PSR fixes - Kerneldoc fix - Vega10 fan fix - GPUVM locking fixes in error pathes - BACO fix for Beige Goby - EEPROM I2C address cleanup - GFXOFF fix - Fix DC memory leak in error pathes - Flexible array updates - Mtype fix for GPUVM PTEs - Move Kconfig into amdgpu directory - SR-IOV updates - Fix possible memory leak in CS IOCTL error path amdkfd: - Fix possible memory overrun - CRIU fixes radeon: - ACPI ref count fix - HDA audio notifier support - Move Kconfig into radeon directory UAPI: - Add new GEM_CREATE flags to help to transition more KFD functionality to the DRM UAPI. These are used internally in the driver to align location based memory coherency requirements from memory allocated in the KFD with how we manage GPUVM PTEs. They are currently blocked in the GEM_CREATE IOCTL as we don't have a user right now. They are just used internally in the kernel driver for now for existing KFD memory allocations. So a change to the UAPI header, but no functional change in the UAPI. From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221118170807.6505-1-alexander.deucher@amd.com Signed-off-by: Dave Airlie <airlied@redhat.com>
2022-11-17drm/amdkfd: enable cooperative launch for gfx10.3Jonathan Kim1-1/+4
FW fix available to enable cooperative launch for GFX10.3. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-17drm/amdgpu: cleanup amdgpu_hmm_range_get_pagesChristian König1-3/+3
Remove unused parameters and cleanup dead code. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-17drm/amdgpu: rename the files for HMM handlingChristian König2-2/+1
Clean that up a bit, no functional change. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-16Merge tag 'drm-misc-next-2022-11-10-1' of ↵Dave Airlie1-11/+6
git://anongit.freedesktop.org/drm/drm-misc into drm-next drm-misc-next for 6.2: UAPI Changes: Cross-subsystem Changes: Core Changes: - atomic-helper: Add begin_fb_access and end_fb_access hooks - fb-helper: Rework to move fb emulation into helpers - scheduler: rework entity flush, kill and fini - ttm: Optimize pool allocations Driver Changes: - amdgpu: scheduler rework - hdlcd: Switch to DRM-managed resources - ingenic: Fix registration error path - lcdif: FIFO threshold tuning - meson: Fix return type of cvbs' mode_valid - ofdrm: multiple fixes (kconfig, types, endianness) - sun4i: A100 and D1 support - panel: - New Panel: Jadard JD9365DA-H3 Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <maxime@cerno.tech> Link: https://patchwork.freedesktop.org/patch/msgid/20221110083612.g63eaocoaa554soh@houat
2022-11-09drm/amdkfd: Make kfd_fill_cache_non_crat_info() as staticMa Jun1-1/+1
kfd_fill_cache_non_crat_info() is only used in kfd_topology.c, so make it as static. Signed-off-by: Ma Jun <Jun.Ma2@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-09drm/amdkfd: Fix the memory overrunMa Jun1-1/+1
Fix the memory overrun issue caused by wrong array size. Signed-off-by: Ma Jun <Jun.Ma2@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Reported-by: coverity-bot <keescook+coverity-bot@chromium.org> Addresses-Coverity-ID: 1527133 ("Memory - corruptions") Fixes: c0cc999f3c32e6 ("drm/amdkfd: Fix the warning of array-index-out-of-bounds") Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-09drm/amdkfd: Fix error handling in criu_checkpointFelix Kuehling1-19/+15
Checkpoint BOs last. That way we don't need to close dmabuf FDs if something else fails later. This avoids problematic access to user mode memory in the error handling code path. criu_checkpoint_bos has its own error handling and cleanup that does not depend on access to user memory. In the private data, keep BOs before the remaining objects. This is necessary to restore things in the correct order as restoring events depends on the events-page BO being restored first. Fixes: be072b06c739 ("drm/amdkfd: CRIU export BOs as prime dmabuf objects") Reported-by: Jann Horn <jannh@google.com> CC: Rajneesh Bhardwaj <Rajneesh.Bhardwaj@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-and-tested-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-09drm/amdkfd: Fix error handling in kfd_criu_restore_eventsFelix Kuehling1-2/+1
mutex_unlock before the exit label because all the error code paths that jump there didn't take that lock. This fixes unbalanced locking errors in case of restore errors. Fixes: 40e8a766a761 ("drm/amdkfd: CRIU checkpoint and restore events") Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04drm/amdkfd: Fix the warning of array-index-out-of-boundsMa Jun4-292/+282
For some GPUs with more CUs, the original sibling_map[32] in struct crat_subtype_cache is not enough to save the cache information when create the VCRAT table, so skip filling the struct crat_subtype_cache info instead fill struct kfd_cache_properties directly to fix this problem. Signed-off-by: Ma Jun <Jun.Ma2@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04drm/amdkfd: update GFX11 CWSR trap handlerJay Cornwall2-381/+389
With corresponding FW change fixes issue where triggering CWSR on a workgroup with waves in s_barrier wouldn't lead to a back-off and therefore cause a hang. Signed-off-by: Jay Cornwall <jay.cornwall@amd.com> Tested-by: Graham Sider <Graham.Sider@amd.com> Acked-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Graham Sider <Graham.Sider@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-03drm/amdgpu: cleanup scheduler job initialization v2Christian König1-11/+6
Init the DRM scheduler base class while allocating the job. This makes the whole handling much more cleaner. v2: fix coding style Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-7-christian.koenig@amd.com
2022-11-01drm/amdkfd: Remove unused variableMa Jun2-2/+0
kfd_topology_device->cache_count is not used by other fucntions, so remove it. Signed-off-by: Ma Jun <Jun.Ma2@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27drm/amdkfd: Cleanup kfd_dev structMukul Joshi7-53/+47
Cleanup kfd_dev struct by removing ddev and pdev as both drm_device and pci_dev can be fetched from amdgpu_device. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Tested-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27drm/amdkfd: Fix NULL pointer dereference in svm_migrate_to_ram()Yang Li1-3/+1
./drivers/gpu/drm/amd/amdkfd/kfd_migrate.c:985:58-62: ERROR: p is NULL but dereferenced. Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2549 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27drm/amdkfd: remove unused struct cdit_headerPaulo Miguel Almeida1-23/+1
struct cdit_header was never used across any of the amd drivers nor this is exposed to UAPI so it can be removed. This patch removes struct cdit_header and refactor code accordingly Signed-off-by: Paulo Miguel Almeida <paulo.miguel.almeida.rodenas@gmail.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27drm/amdkfd: remove unused kfd_pm4_headers_diq header filePaulo Miguel Almeida1-291/+0
kfd_pm4_headers_diq.h header is a leftover from the old H/W debugger module support added on commit fbeb661bfa895dc ("drm/amdkfd: Add skeleton H/W debugger module support"). That implementation was removed after a while and the last file that included that header was removed on commit 5bdd3eb253544b1 ("drm/amdkfd: Remove unused old debugger implementation"). This patch removes the unused header file kfd_pm4_headers_diq.h Signed-off-by: Paulo Miguel Almeida <paulo.miguel.almeida.rodenas@gmail.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-24drm/amdkfd: introduce dummy cache info for property asicPrike Liang1-1/+52
This dummy cache info will enable kfd base function support. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-24drm/amdkfd: correct the cache info for gfx1036Jesse Zhang1-1/+52
correct the cache information for gfx1036 Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-24drm/amdkfd: update gfx1037 Lx cache settingPrike Liang1-1/+52
Update the gfx1037 L1/L2 cache setting. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-24drm/amdkfd: use vma_lookup() instead of find_vma()Deming Wang2-13/+12
Using vma_lookup() verifies the start address is contained in the found vma. This results in easier to read the code. Signed-off-by: Deming Wang <wangdeming@inspur.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-14Merge tag 'mm-stable-2022-10-13' of ↵Linus Torvalds3-13/+19
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - fix a race which causes page refcounting errors in ZONE_DEVICE pages (Alistair Popple) - fix userfaultfd test harness instability (Peter Xu) - various other patches in MM, mainly fixes * tag 'mm-stable-2022-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (29 commits) highmem: fix kmap_to_page() for kmap_local_page() addresses mm/page_alloc: fix incorrect PGFREE and PGALLOC for high-order page mm/selftest: uffd: explain the write missing fault check mm/hugetlb: use hugetlb_pte_stable in migration race check mm/hugetlb: fix race condition of uffd missing/minor handling zram: always expose rw_page LoongArch: update local TLB if PTE entry exists mm: use update_mmu_tlb() on the second thread kasan: fix array-bounds warnings in tests hmm-tests: add test for migrate_device_range() nouveau/dmem: evict device private memory during release nouveau/dmem: refactor nouveau_dmem_fault_copy_one() mm/migrate_device.c: add migrate_device_range() mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page() mm/memremap.c: take a pgmap reference on page allocation mm: free device private pages have zero refcount mm/memory.c: fix race when faulting a device private page mm/damon: use damon_sz_region() in appropriate place mm/damon: move sz_damon_region to damon_sz_region lib/test_meminit: add checks for the allocation functions ...
2022-10-12mm: free device private pages have zero refcountAlistair Popple1-1/+1
Since 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page refcount") device private pages have no longer had an extra reference count when the page is in use. However before handing them back to the owning device driver we add an extra reference count such that free pages have a reference count of one. This makes it difficult to tell if a page is free or not because both free and in use pages will have a non-zero refcount. Instead we should return pages to the drivers page allocator with a zero reference count. Kernel code can then safely use kernel functions such as get_page_unless_zero(). Link: https://lkml.kernel.org/r/cf70cf6f8c0bdb8aaebdbfb0d790aea4c683c3c6.1664366292.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12mm/memory.c: fix race when faulting a device private pageAlistair Popple3-12/+18
Patch series "Fix several device private page reference counting issues", v2 This series aims to fix a number of page reference counting issues in drivers dealing with device private ZONE_DEVICE pages. These result in use-after-free type bugs, either from accessing a struct page which no longer exists because it has been removed or accessing fields within the struct page which are no longer valid because the page has been freed. During normal usage it is unlikely these will cause any problems. However without these fixes it is possible to crash the kernel from userspace. These crashes can be triggered either by unloading the kernel module or unbinding the device from the driver prior to a userspace task exiting. In modules such as Nouveau it is also possible to trigger some of these issues by explicitly closing the device file-descriptor prior to the task exiting and then accessing device private memory. This involves some minor changes to both PowerPC and AMD GPU code. Unfortunately I lack hardware to test either of those so any help there would be appreciated. The changes mimic what is done in for both Nouveau and hmm-tests though so I doubt they will cause problems. This patch (of 8): When the CPU tries to access a device private page the migrate_to_ram() callback associated with the pgmap for the page is called. However no reference is taken on the faulting page. Therefore a concurrent migration of the device private page can free the page and possibly the underlying pgmap. This results in a race which can crash the kernel due to the migrate_to_ram() function pointer becoming invalid. It also means drivers can't reliably read the zone_device_data field because the page may have been freed with memunmap_pages(). Close the race by getting a reference on the page while holding the ptl to ensure it has not been freed. Unfortunately the elevated reference count will cause the migration required to handle the fault to fail. To avoid this failure pass the faulting page into the migrate_vma functions so that if an elevated reference count is found it can be checked to see if it's expected or not. [mpe@ellerman.id.au: fix build] Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Lyude Paul <lyude@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-06drm/amdgpu: Enable F32_WPTR_POLL_ENABLE in mqdRuili Ji1-1/+2
This patch is to fix the SDMA user queue doorbell missing issue on SDMA 6.0. F32_WPTR_POLL_ENABLE has to be set if doorbell mode is used. Otherwise ringing SDMA user queue doorbell can't wake up system from gfxoff. Signed-off-by: Ruili Ji <ruiliji2@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.0.x
2022-09-30drm/amdkfd: Fix UBSAN shift-out-of-bounds warningFelix Kuehling1-24/+21
This was fixed in initialize_cpsch before, but not in initialize_nocpsch. Factor sdma bitmap initialization into a helper function to apply the correct implementation in both cases without duplicating it. v2: Added a range check Reported-by: Ellis Michael <ellis@ellismichael.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Graham Sider <Graham.Sider@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-30drm/amdkfd: Track unified memory when switching xnack modePhilip Yang3-7/+80
Unified memory usage with xnack off is tracked to avoid oversubscribe system memory, with xnack on, we don't track unified memory usage to allow memory oversubscribe. When switching xnack mode from off to on, subsequent free ranges allocated with xnack off will not unreserve memory. When switching xnack mode from on to off, subsequent free ranges allocated with xnack on will unreserve memory. Both cases cause memory accounting unbalanced. When switching xnack mode from on to off, need reserve already allocated svm range memory. When switching xnack mode from off to on, need unreserve already allocated svm range memory. v6: Take prange lock to access range child list v5: Handle prange child ranges v4: Handle reservation memory failure v3: Handle switching xnack mode race with svm_range_deferred_list_work v2: Handle both switching xnack from on to off and from off to on cases Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-29drm/amdkfd: fix dropped interrupt in kfd_int_process_v11Graham Sider1-3/+3
Shader wave interrupts were getting dropped in event_interrupt_wq_v11 if the PRIV bit was set to 1. This would often lead to a hang. Until debugger logic is upstreamed, expand comment to stop early return. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-29drm/amdgpu: pass queue size and is_aql_queue to MESGraham Sider1-0/+2
Update mes_v11_api_def.h add_queue API with is_aql_queue parameter. Also re-use gds_size for the queue size (unused for KFD). MES requires the queue size in order to compute the actual wptr offset within the queue RB since it increases monotonically for AQL queues. v2: Make is_aql_queue assign clearer Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-29drm/amdkfd: fix MQD init for GFX11 in init_mqdGraham Sider1-0/+4
Set remaining compute_static_thread_mgmt_se* accordingly. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-29drm/amdgpu: Enable SA software trap.David Belanger2-384/+408
Enables support for software trap for MES >= 4. Adapted from implementation from Jay Cornwall. v2: Add IP version check in conditions. v3: Remove debugger code changes. Signed-off-by: Jay Cornwall <Jay.Cornwall@amd.com> Signed-off-by: David Belanger <david.belanger@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-19drm/amdkfd: Fix spelling mistake "detroyed" -> "destroyed"Colin Ian King1-1/+1
There is a spelling mistake in a pr_debug message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-19drm/amdkfd: Use the consolidated MQD manager functions for GFX11Shiwu Zhang1-73/+12
To remove duplication for GFX11 as well, use the common MQD manager functions defined in kfd_mqd_manager.c for all version of managers Signed-off-by: Shiwu Zhang <shiwu.zhang@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Mukul Joshi <mukul.joshi@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-13drm/amdkfd: Migrate in CPU page fault use current mmPhilip Yang1-1/+2
migrate_vma_setup shows below warning because we don't hold another process mm mmap_lock. We should use current vmf->vma->vm_mm instead, the caller already hold current mmap lock inside CPU page fault handler. WARNING: CPU: 10 PID: 3054 at include/linux/mmap_lock.h:155 find_vma Call Trace: walk_page_range+0x76/0x150 migrate_vma_setup+0x18a/0x640 svm_migrate_vram_to_ram+0x245/0xa10 [amdgpu] svm_migrate_to_ram+0x36f/0x470 [amdgpu] do_swap_page+0xcfe/0xec0 __handle_mm_fault+0x96b/0x15e0 handle_mm_fault+0x13f/0x3e0 do_user_addr_fault+0x1e7/0x690 Fixes: e1f84eef313f ("drm/amdkfd: handle CPU fault on COW mapping") Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-13drm/amdkfd: Remove prefault before migrating to VRAMPhilip Yang3-31/+5
Prefaulting potentially allocates system memory pages before a migration. This adds unnecessary overhead. Instead we can skip unallocated pages in the migration and just point migrate->dst to a 0-initialized VRAM page directly. Then the VRAM page will be inserted to the PTE. A subsequent CPU page fault will migrate the page back to system memory. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-13drm/amdkfd: handle CPU fault on COW mappingPhilip Yang1-13/+29
If CPU page fault in a page with zone_device_data svm_bo from another process, that means it is COW mapping in the child process and the range is migrated to VRAM by parent process. Migrate the parent process range back to system memory to recover the CPU page fault. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-13amd/amdkfd: fix repeated words in commentswangjianli1-1/+1
Delete the redundant word 'to'. Signed-off-by: wangjianli <wangjianli@cdjrlc.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-13drm/amdkfd: Fix CRIU restore op due to doorbell offsetRajneesh Bhardwaj3-0/+16
Recently introduced change to allocate doorbells only when the first queue is created or mapped for CPU / GPU access, did not consider Checkpoint Restore scenario completely. This fix allows the CRIU restore operation by extending the doorbell optimization to CRIU restore scenario. Fixes: 16f0013157bf ("drm/amdkfd: Allocate doorbells only when needed") Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-12Merge tag 'amd-drm-next-6.1-2022-09-08' of ↵Dave Airlie5-17/+34
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.1-2022-09-08: amdgpu: - Mode2 reset for RDNA2 - Lots of new DC documentation - Add documentation about different asic families - DSC improvements - Aldebaran fixes - Misc spelling and grammar fixes - GFXOFF stats support for vangogh - DC frame size fixes - NBIO 7.7 updates - DCN 3.2 updates - DCN 3.1.4 Updates - SMU 13.x updates - Misc bug fixes - Rework DC register offset handling - GC 11.x updates - PSP 13.x updates - SDMA 6.x updates - GMC 11.x updates - SR-IOV updates - PSP fixes for TA unloading - DSC passthrough support - Misc code cleanups amdkfd: - ISA fixes for some GC 10.3 IPs - Misc code cleanups radeon: - Delayed work flush fix - Use time_after for some jiffies calculations drm: - DSC passthrough aux support Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220908155202.57862-1-alexander.deucher@amd.com
2022-08-30drm/amdkfd: Added GFX 11.0.3 SupportDavid Belanger2-0/+9
Added missing cases for GFX 11.0.3 code in a few switch statements. Signed-off-by: David Belanger <david.belanger@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-29drm/amdkfd: remove redundant variables err and retJinpeng Cui1-7/+2
Return value from kfd_wait_on_events() and io_remap_pfn_range() directly instead of taking this in another redundant variable. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Jinpeng Cui <cui.jinpeng2@zte.com.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-25drm/amdkfd: Fix isa version for the GC 10.3.7Prike Liang1-5/+1
Correct the isa version for handling KFD test. Fixes: 7c4f4f197e0c ("drm/amdkfd: Add GC 10.3.6 and 10.3.7 KFD definitions") Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Aaron Liu <aaron.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>