drm/amdgpu: introduce a kind of halt state for amdgpu device

It is useful to maintain error context when debugging SW/FW issues. Introduce amdgpu_device_halt() for this purpose. It will bring hardware to a kind of halt state, so that no one can touch it any more. Compare to a simple hang, the system will keep stable at least for SSH access. Then it should be trivial to inspect the hardware state and see what's going on. v2: - Set adev->no_hw_access earlier to avoid potential crashes.(Christian) Suggested-by: Christian Koenig <christian.koenig@amd.com> Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Lang Yu <lang.yu@amd.com> Reviewed-by: Christian Koenig <christian.koenig@amd.co> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
author: Lang Yu <lang.yu@amd.com> 2021-12-09 15:10:32 +0800
committer: Alex Deucher <alexander.deucher@amd.com> 2021-12-13 16:33:16 -0500
commit: 34f3a4a98bd388ad6298c42dc9b00c72d3398330 (patch)
tree: f690ee30a63e7e3a94e4c8bce3dcfdd8c79bb3df /drivers/gpu/drm
parent: cace4bff750ff4f55b16c3aa90aa9376d7488929 (diff)
download: linux-34f3a4a98bd388ad6298c42dc9b00c72d3398330.tar.bz2
2 files changed, 41 insertions, 0 deletions
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 54c882a6b433..89278f17a70b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct amdgpu_device *adev,
 void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
 		struct amdgpu_ring *ring);
 
+void amdgpu_device_halt(struct amdgpu_device *adev);
+
 /* atpx handler */
 #if defined(CONFIG_VGA_SWITCHEROO)
 void amdgpu_register_atpx_handler(void);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 15e3f4473e23..c9b8e5d8b824 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5663,3 +5663,42 @@ void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
 
 	amdgpu_asic_invalidate_hdp(adev, ring);
 }
+
+/**
+ * amdgpu_device_halt() - bring hardware to some kind of halt state
+ *
+ * @adev: amdgpu_device pointer
+ *
+ * Bring hardware to some kind of halt state so that no one can touch it
+ * any more. It will help to maintain error context when error occurred.
+ * Compare to a simple hang, the system will keep stable at least for SSH
+ * access. Then it should be trivial to inspect the hardware state and
+ * see what's going on. Implemented as following:
+ *
+ * 1. drm_dev_unplug() makes device inaccessible to user space(IOCTLs, etc),
+ *    clears all CPU mappings to device, disallows remappings through page faults
+ * 2. amdgpu_irq_disable_all() disables all interrupts
+ * 3. amdgpu_fence_driver_hw_fini() signals all HW fences
+ * 4. set adev->no_hw_access to avoid potential crashes after setp 5
+ * 5. amdgpu_device_unmap_mmio() clears all MMIO mappings
+ * 6. pci_disable_device() and pci_wait_for_pending_transaction()
+ *    flush any in flight DMA operations
+ */
+void amdgpu_device_halt(struct amdgpu_device *adev)
+{
+	struct pci_dev *pdev = adev->pdev;
+	struct drm_device *ddev = &adev->ddev;
+
+	drm_dev_unplug(ddev);
+
+	amdgpu_irq_disable_all(adev);
+
+	amdgpu_fence_driver_hw_fini(adev);
+
+	adev->no_hw_access = true;
+
+	amdgpu_device_unmap_mmio(adev);
+
+	pci_disable_device(pdev);
+	pci_wait_for_pending_transaction(pdev);
+}
author	Lang Yu <lang.yu@amd.com>	2021-12-09 15:10:32 +0800
committer	Alex Deucher <alexander.deucher@amd.com>	2021-12-13 16:33:16 -0500
commit	34f3a4a98bd388ad6298c42dc9b00c72d3398330 (patch)
tree	f690ee30a63e7e3a94e4c8bce3dcfdd8c79bb3df /drivers/gpu/drm
parent	cace4bff750ff4f55b16c3aa90aa9376d7488929 (diff)
download	linux-34f3a4a98bd388ad6298c42dc9b00c72d3398330.tar.bz2