summaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
AgeCommit message (Collapse)AuthorFilesLines
2021-05-10drm/amdgpu: force enable gfx ras for vega20 wsStanley.Yang1-0/+21
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10drm/amdgpu: enable gfx ras in aldebran by defaultHawking Zhang1-1/+2
gfx ras now can be enabled by default in aldebaran Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10drm/amdgpu: enable ras error count query and reset for HDPHawking Zhang1-0/+10
add hdp block ras error query and reset support in amdgpu ras error count query and reset interface Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-28drm/amdgpu: provide socket/die id info in RAS messageHawking Zhang1-2/+27
Add socket/die information in RAS messages for platforms that support query those information Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-23drm/amdgpu: disable gfx ras by default in aldebaranHawking Zhang1-2/+1
aldebaran gfx ras is still under development Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-23drm/amdgpu: optimize gfx ras features flag cleanStanley.Yang1-5/+5
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20drm/amdgpu: Reset RAS error count and status regsMukul Joshi1-0/+6
Reset the RAS error count and error status registers after reading to prevent over reporting error counts on Aldebaran. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-By: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20drm/amdgpu: fix a error injection failed issueDennis Li1-1/+1
because "sscanf(str, "retire_page")" always return 0, if application use the raw data for error injection, it always wrongly falls into "op == 3". Change to use strstr instead. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15drm/amdgpu: Add double-sscanf but invertLuben Tuikov1-2/+5
Add back the double-sscanf so that both decimal and hexadecimal values could be read in, but this time invert the scan so that hexadecimal format with a leading 0x is tried first, and if that fails, then try decimal format. Also use a logical-AND instead of nesting double if-conditional. See commit "drm/amdgpu: Fix a bug for input with double sscanf" Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15drm/amdgpu: Fix kernel-doc for the RAS sysfs interfaceLuben Tuikov1-23/+24
Imporve the kernel-doc for the RAS sysfs interface. Fix the grammar, fix the context. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15drm/amdgpu: Add bad_page_cnt_threshold to debugfsLuben Tuikov1-0/+2
Add bad_page_cnt_threshold to debugfs, an optional file system used for debugging, for reporting purposes only--it usually matches the size of EEPROM but may be different depending on the "bad_page_threshold" kernel module option. The "bad_page_cnt_threshold" is a dynamically computed value. It depends on three things: the VRAM size; the size of the EEPROM (or the size allocated to the RAS table therein); and the "bad_page_threshold" module parameter. It is a dynamically computed value, when the amdgpu module is run, on which further parameters and logic depend, and as such it is helpful to see the dynamically computed value in debugfs. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15drm/amdgpu: Fix a bug in checking the result of reserve pageLuben Tuikov1-6/+3
Fix if (ret) --> if (!ret), a bug, for "retire_page", which caused the kernel to recall the method with *pos == end of file, and that bounced back with error. On the first run, we advanced *pos, but returned 0 back to fs layer, also a bug. Fix the logic of the check of the result of amdgpu_reserve_page_direct()--it is 0 on success, and non-zero on error, not the other way around. This patch fixes this bug. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15drm/amdgpu: Fix a bug for input with double sscanfLuben Tuikov1-8/+5
Remove double-sscanf to scan for %llu and 0x%llx, as that is not going to work! The %llu will consume the "0" in "0x" of your input, and the hex value you think you're entering will always be 0. That is, a valid hex value can never be consumed. On the other hand, just entering a hex number without leading 0x will either be scanned as a string and not match, for instance FAB123, or the leading decimal portion is scanned as the %llu, for instance 123FAB will be scanned as 123, which is not correct. Thus remove the first %llu scan and leave only the %llx scan, removing the leading 0x since %llx can scan either. Addresses are usually always hex values, so this suffices. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Xinhui Pan <xinhui.pan@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09drm/amdgpu: page retire over debugfs mechanismJohn Clements1-0/+67
added support in RAS debugfs to add bad page for isolated page retirement testing Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09drm/amdgpu: RAS harvest on driver loadJohn Clements1-0/+29
In event of RAS UE + warm reset, error counters shall be harvested and cleared on driver load Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09drm/amdgpu: split gfx callbacks into ras and non-ras onesHawking Zhang1-12/+18
gfx ras is only available in cerntain ip generations. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09drm/amdgpu: split mmhub callbacks into ras and non-ras onesHawking Zhang1-8/+12
mmhub ras is only avaiable in cerntain mmhub ip generation. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09drm/amdgpu: split umc callbacks to ras and non-ras onesHawking Zhang1-4/+6
umc ras is not managed by gpu driver when gpu is connected to cpu through xgmi. split umc callbacks into ras and non-ras ones so gpu driver only initializes umc ras callbacks when it manages umc ras. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09drm/amdgpu: move xgmi ras functions to xgmi_ras_funcsHawking Zhang1-1/+3
xgmi ras is not managed by gpu driver when gpu is connected to cpu through xgmi. move all xgmi ras functions to xgmi_ras_funcs so gpu driver only initializes xgmi ras functions when it manages xgmi ras. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09drm/amdgpu: split nbio callbacks into ras and non-ras onesHawking Zhang1-6/+24
nbio ras is not managed by gpu driver when gpu is connected to cpu through xgmi. split nbio callbacks into ras and non-ras ones so gpu driver only initializes nbio ras callbacks when it manages nbio ras. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09drm/amdgpu: initialze ras caps per paltform configHawking Zhang1-12/+23
Driver only manages GFX/SDMA/MMHUB RAS in platforms that gpu node is connected to cpu through XGMI, other than that, it queries VBIOS for RAS capabilities. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09drm/amd: cleanup coding style a bitBernard Zhao1-4/+3
Fix patch check warning: WARNING: suspect code indent for conditional statements (8, 17) + if (obj && obj->use < 0) { + DRM_ERROR("RAS ERROR: Unbalance obj(%s) use\n", obj->head.name); WARNING: braces {} are not necessary for single statement blocks + if (obj && obj->use < 0) { + DRM_ERROR("RAS ERROR: Unbalance obj(%s) use\n", obj->head.name); + } Signed-off-by: Bernard Zhao <bernard@vivo.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09drm/amdgpu: support sdma error injectionStanley.Yang1-0/+1
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reivewed-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09drm/amdgpu: Convert sysfs sprintf/snprintf family to sysfs_emitTian Tao1-5/+3
Fix the following coccicheck warning: drivers/gpu//drm/amd/amdgpu/amdgpu_ras.c:434:9-17: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_xgmi.c:220:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_xgmi.c:249:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/df_v3_6.c:208:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_psp.c:2973:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:75:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:112:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:58:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:93:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:125:9-17: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_gtt_mgr.c:52:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_gtt_mgr.c:71:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:140:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:164:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:186:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:208:8-16: WARNING: use scnprintf or sprintf drivers/gpu//drm/amd/amdgpu/amdgpu_atombios.c:1916:8-16: WARNING: use scnprintf or sprintf Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09drm/amdgpu: Fix check for RAS supportLuben Tuikov1-9/+6
Use positive logic to check for RAS support. Rename the function to actually indicate what it is testing for. Essentially, make the function a predicate with the correct name. Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23drm/amdgpu: revert "reserve backup pages for bad page retirment"Christian König1-18/+11
As noted during the review this approach doesn't make sense at all. We should not apply any limitation on the VRAM applications can use inside the kernel. If an application or end user wants to reserve a certain amount of VRAM for bad pages handling we should do this in the upper layer. This reverts commit f89b881c81d9a6481fc17b46b351ca38f5dd6f3a. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23drm/amdgpu: fix send ras disable cmd when asic not support rasStanley.Yang1-11/+46
cause: It is necessary to send ras disable command to ras-ta during gfx block ras later init, because the ras capability is disable read from vbios for vega20 gaming, but the ras context is released during ras init process, this will cause send ras disable command to ras-to failed. how: Delay releasing ras context, the ras context will be released after gfx block later init done. Changed from V1: move release_ras_context into ras_resume Changed from V2: check BIT(UMC) is more reasonable before access eeprom table Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23drm/amdgpu: support query ecc cap for SIENNA_CICHLIDHawking Zhang1-2/+2
driver needs to query umc_info_v3_3 for ecc capability in sienna_cichlid Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Likun Gao <Likun.Gao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23drm/amdgpu: skip read eeprom for device that pending on XGMI resetshaoyunl1-0/+6
Read eeprom through SMU doesn't works stable on XGMI reset during test. skip it for now Signed-off-by: shaoyunl <shaoyun.liu@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23drm/amdgpu: harvest edc status when connected to host via xGMIDennis Li1-9/+41
When connected to a host via xGMI, system fatal errors may trigger warm reset, driver has no change to query edc status before reset. Therefore in this case, driver should harvest previous error loging registers during boot, instead of only resetting them. v2: 1. IP's ras_manager object is created when its ras feature is enabled, so change to query edc status after amdgpu_ras_late_init called 2. change to enable watchdog timer after finishing gfx edc init Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reivewed-by: Hawking Zhang <hawking.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23drm/amdgpu: enable watchdog feature for SQ of aldebaranDennis Li1-0/+3
SQ's watchdog timer monitors forward progress, a mask of which waves caused the watchdog timeout is recorded into ras status registers and then trigger a system fatal error event. v2: 1. change *query_timeout_status to *query_sq_timeout_status. 2. move query_sq_timeout_status into amdgpu_ras_do_recovery. 3. add module parameters to enable/disable fatal error event and modify the watchdog timer. v3: 1. remove unused parameters of *enable_watchdog_timer Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-26drm/amdgpu: remove unnecessary reading for epprom headerDennis Li1-16/+0
If the number of badpage records exceed the threshold, driver has updated both epprom header and control->tbl_hdr.header before gpu reset, therefore GPU recovery thread no need to read epprom header directly. v2: merge amdgpu_ras_check_err_threshold into amdgpu_ras_eeprom_check_err_threshold Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-26drm/amdgpu: reserve backup pages for bad page retirmentDennis Li1-11/+18
To ensure user has a constant of VRAM accessible in run-time, driver reserves limit backup pages when init, and return ones when bad pages retired, to keep no change of unused memory size. v2: refine codes to calculate badpags threshold Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-18drm/amdgpu: mark local function as staticNirmoy Das1-1/+1
Mark amdgpu_ras_debugfs_create_ctrl_node() as static. Fixes: eb14235668777b ("drm/amdgpu: do not keep debugfs dentry") Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-18drm/amdgpu: do not keep debugfs dentryNirmoy Das1-43/+30
Cleanup unnecessary debugfs dentries and surrounding functions. v3: remove return value check for debugfs_create_file() v2: remove ttm_debugfs_entries array. do not init variables. Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-01-15drm/amdgpu: toggle on DF Cstate after finishing xgmi injectionGuchun Chen1-1/+1
Fixes: 5c23e9e05e42 ("drm/amdgpu: Update RAS XGMI error inject sequence") Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-01-05drm/amdgpu: fix no bad_pages issue after umc ue injectionDennis Li1-4/+4
old code wrongly used the bad page status as the function return value, which cause amdgpu_ras_badpages_read always return failed. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-12-08drm/amdgpu: fix debugfs creation/removal, againArnd Bergmann1-8/+5
There is still a warning when CONFIG_DEBUG_FS is disabled: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1145:13: error: 'amdgpu_ras_debugfs_create_ctrl_node' defined but not used [-Werror=unused-function] 1145 | static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev) Change the code again to make the compiler actually drop this code but not warn about it. Fixes: ae2bf61ff39e ("drm/amdgpu: guard ras debugfs creation/removal based on CONFIG_DEBUG_FS") Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-11-13drm/amd/amdgpu/amdgpu_ras: Make local function ↵Lee Jones1-2/+2
'amdgpu_ras_error_status_query' static Fixes the following W=1 kernel build warning(s): drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1482:6: warning: no previous prototype for ‘amdgpu_ras_error_status_query’ [-Wmissing-prototypes] Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-11-13drm/amd/amdgpu/amdgpu_ras: Remove unused function 'amdgpu_ras_error_cure'Lee Jones1-7/+0
Fixes the following W=1 kernel build warning(s): drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:908:5: warning: no previous prototype for ‘amdgpu_ras_error_cure’ [-Wmissing-prototypes] Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-11-02drm/amdgpu/amdgpu: improve code indentation and alignmentDeepak R Varma1-5/+5
General code indentation and alignment changes such as replace spaces by tabs or align function arguments as per the coding style guidelines. The patch corrects issues for various amdgpu_*.c files for this driver. Issue reported by checkpatch script. Signed-off-by: Deepak R Varma <mh12gx2825@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-10-30drm/amdgpu: remove unneeded semicolonTom Rix1-1/+1
A semicolon is not needed after a switch statement. Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-10-30drm/amdgpu: fix the issue of reserving bad pages failedDennis Li1-111/+39
In amdgpu_ras_reset_gpu, because bad pages may not be freed, it has high probability to reserve bad pages failed. Change to reserve bad pages when freeing VRAM. v2: 1. avoid allocating the drm_mm node outside of amdgpu_vram_mgr.c 2. move bad page reserving into amdgpu_ras_add_bad_pages, if vram mgr reserve bad page failed, it will put it into pending list, otherwise put it into processed list; 3. remove amdgpu_ras_release_bad_pages, because retired page's info has been moved into amdgpu_vram_mgr v3: 1. formate code style; 2. rename amdgpu_vram_reserve_scope as amdgpu_vram_reservation; 3. rename scope_pending as reservations_pending; 4. rename scope_processed as reserved_pages; 5. change to iterate over all the pending ones and try to insert them with drm_mm_reserve_node(); v4: 1. rename amdgpu_vram_mgr_reserve_scope as amdgpu_vram_mgr_reserve_range; 2. remove unused include "amdgpu_ras.h"; 3. rename amdgpu_vram_mgr_check_and_reserve as amdgpu_vram_mgr_do_reserve; 4. refine amdgpu_vram_mgr_reserve_range to call amdgpu_vram_mgr_do_reserve. Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Hawking Zhang <hawking.zhang@amd.com> Signed-off-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Wenhui Sheng <Wenhui.Sheng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-10-30drm/amdgpu: change to save bad pages in UMC error interrupt callbackDennis Li1-4/+1
Instead of saving bad pages in amdgpu_ras_reset_gpu, it will reduce the unnecessary calling of amdgpu_ras_save_bad_pages. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <hawking.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-10-21Revert drm/amdgpu: disable sienna chichlid UMC RASJohn Clements1-2/+2
This reverts commit 265c280a4807419249644156654e5c40a235ea84. Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-25drm/amdgpu: fix a warning in amdgpu_ras.c (v2)Alex Deucher1-1/+4
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c: In function ‘amdgpu_ras_fs_init’: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1284:2: warning: ignoring return value of ‘sysfs_create_group’, declared with attribute warn_unused_result [-Wunused-result] 1284 | sysfs_create_group(&adev->dev->kobj, &group); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ v2: just print an error for sysfs group creation failure Acked-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-25drm/amdgpu: clean up ras sysfs creation (v2)Guchun Chen1-56/+31
Merge ras sysfs creation together by calling sysfs_create_group once, as sysfs_update_group may not work properly as expected. v2: improve commit message Signed-off-by: Guchun Chen <guchun.chen@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-25drm/amdgpu: disable sienna chichlid UMC RASJohn Clements1-2/+2
disable UMC RAS in lieu of stability issues on certain sku Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-22drm/amdgpu: update athub interrupt harvesting handleStanley.Yang1-1/+42
GCEA/MMHUB EA error should not result to DF freeze, this is fixed in next generation, but for some reasons the GCEA/MMHUB EA error will result to DF freeze in previous generation, diver should avoid to indicate GCEA/MMHUB EA error as hw fatal error in kernel message by read GCEA/MMHUB err status registers. Changed from V1: make query_ras_error_status function more general make read mmhub er status register more friendly Changed from V2: move ras error status query function into do_recovery workqueue Changed from V3: remove useless code from V2, print GCEA error status instance number Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-26drm/amdkfd: fix set kfd node ras properties valueStanley.Yang1-9/+19
The ctx->features are new RAS implementation which is only available for Vega20 and onwards, it is not available for vega10, vega10 should follow legacy ECC implementation. Changed from V1: wrap function to initialize kfd node properties Changed from V2: remove wrap function and SDMA SRAM ECC check Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>