I updated my system with Radeon VII from kernel 5.6 to kernel 5.7, and
following started to happen on each boot:
...
BUG: kernel NULL pointer dereference, address: 0000000000000128
...
CPU: 9 PID: 1940 Comm: modprobe Tainted: G E 5.7.2-200.im0.fc32.x86_64 #1
Hardware name: System manufacturer System Product Name/PRIME X570-P, BIOS 1407 04/02/2020
RIP: 0010:lock_bus+0x42/0x60 [amdgpu]
...
Call Trace:
i2c_smbus_xfer+0x3d/0xf0
i2c_default_probe+0xf3/0x130
i2c_detect.isra.0+0xfe/0x2b0
? kfree+0xa3/0x200
? kobject_uevent_env+0x11f/0x6a0
? i2c_detect.isra.0+0x2b0/0x2b0
__process_new_driver+0x1b/0x20
bus_for_each_dev+0x64/0x90
? 0xffffffffc0f34000
i2c_register_driver+0x73/0xc0
do_one_initcall+0x46/0x200
? _cond_resched+0x16/0x40
? kmem_cache_alloc_trace+0x167/0x220
? do_init_module+0x23/0x260
do_init_module+0x5c/0x260
__do_sys_init_module+0x14f/0x170
do_syscall_64+0x5b/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
...
Error appears when some i2c device driver tries to probe for devices
using adapter registered by `smu_v11_0_i2c_eeprom_control_init()`.
Code supporting this adapter requires `adev->psp.ras.ras` to be not
NULL, which is true only when `amdgpu_ras_init()` detects HW support by
calling `amdgpu_ras_check_supported()`.
Before 9015d60c9ee1, adapter was registered by
-> amdgpu_device_ip_init()
-> amdgpu_ras_recovery_init()
-> amdgpu_ras_eeprom_init()
-> smu_v11_0_i2c_eeprom_control_init()
after verifying that `adev->psp.ras.ras` is not NULL in
`amdgpu_ras_recovery_init()`. Currently it is registered
unconditionally by
-> amdgpu_device_ip_init()
-> pp_sw_init()
-> hwmgr_sw_init()
-> vega20_smu_init()
-> smu_v11_0_i2c_eeprom_control_init()
Fix simply adds HW support check (ras == NULL => no support) before
calling `smu_v11_0_i2c_eeprom_control_{init,fini}()`.
Please note that there is a chance that similar fix is also required for
CHIP_ARCTURUS. I do not know whether any actual Arcturus hardware without
RAS exist, and whether calling `smu_i2c_eeprom_init()` makes any sense
when there is no HW support.
Cc: [email protected]
Fixes: 9015d60c9ee1 ("drm/amdgpu: Move EEPROM I2C adapter to amdgpu_device")
Signed-off-by: Ivan Mironov <[email protected]>
Tested-by: Bjorn Nostvold <[email protected]>
---
Changelog:
v1:
- Added "Tested-by" for another user who used this patch to fix the
same issue.
v0:
- Patch introduced.
---
drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c b/drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c
index 2fb97554134f..c2e0fbbccf56 100644
--- a/drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c
+++ b/drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c
@@ -522,9 +522,11 @@ static int vega20_smu_init(struct pp_hwmgr *hwmgr)
priv->smu_tables.entry[TABLE_ACTIVITY_MONITOR_COEFF].version = 0x01;
priv->smu_tables.entry[TABLE_ACTIVITY_MONITOR_COEFF].size = sizeof(DpmActivityMonitorCoeffInt_t);
- ret = smu_v11_0_i2c_eeprom_control_init(&adev->pm.smu_i2c);
- if (ret)
- goto err4;
+ if (adev->psp.ras.ras) {
+ ret = smu_v11_0_i2c_eeprom_control_init(&adev->pm.smu_i2c);
+ if (ret)
+ goto err4;
+ }
return 0;
@@ -560,7 +562,8 @@ static int vega20_smu_fini(struct pp_hwmgr *hwmgr)
(struct vega20_smumgr *)(hwmgr->smu_backend);
struct amdgpu_device *adev = hwmgr->adev;
- smu_v11_0_i2c_eeprom_control_fini(&adev->pm.smu_i2c);
+ if (adev->psp.ras.ras)
+ smu_v11_0_i2c_eeprom_control_fini(&adev->pm.smu_i2c);
if (priv) {
amdgpu_bo_free_kernel(&priv->smu_tables.entry[TABLE_PPTABLE].handle,
--
2.26.2
Issue still reproduces on latest 5.8.0-rc2+
(8be3a53e18e0e1a98f288f6c7f5e9da3adbe9c49).
Applied. Thanks!
Alex
On Thu, Jun 25, 2020 at 1:14 PM Ivan Mironov <[email protected]> wrote:
>
> I updated my system with Radeon VII from kernel 5.6 to kernel 5.7, and
> following started to happen on each boot:
>
> ...
> BUG: kernel NULL pointer dereference, address: 0000000000000128
> ...
> CPU: 9 PID: 1940 Comm: modprobe Tainted: G E 5.7.2-200.im0.fc32.x86_64 #1
> Hardware name: System manufacturer System Product Name/PRIME X570-P, BIOS 1407 04/02/2020
> RIP: 0010:lock_bus+0x42/0x60 [amdgpu]
> ...
> Call Trace:
> i2c_smbus_xfer+0x3d/0xf0
> i2c_default_probe+0xf3/0x130
> i2c_detect.isra.0+0xfe/0x2b0
> ? kfree+0xa3/0x200
> ? kobject_uevent_env+0x11f/0x6a0
> ? i2c_detect.isra.0+0x2b0/0x2b0
> __process_new_driver+0x1b/0x20
> bus_for_each_dev+0x64/0x90
> ? 0xffffffffc0f34000
> i2c_register_driver+0x73/0xc0
> do_one_initcall+0x46/0x200
> ? _cond_resched+0x16/0x40
> ? kmem_cache_alloc_trace+0x167/0x220
> ? do_init_module+0x23/0x260
> do_init_module+0x5c/0x260
> __do_sys_init_module+0x14f/0x170
> do_syscall_64+0x5b/0xf0
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> ...
>
> Error appears when some i2c device driver tries to probe for devices
> using adapter registered by `smu_v11_0_i2c_eeprom_control_init()`.
> Code supporting this adapter requires `adev->psp.ras.ras` to be not
> NULL, which is true only when `amdgpu_ras_init()` detects HW support by
> calling `amdgpu_ras_check_supported()`.
>
> Before 9015d60c9ee1, adapter was registered by
>
> -> amdgpu_device_ip_init()
> -> amdgpu_ras_recovery_init()
> -> amdgpu_ras_eeprom_init()
> -> smu_v11_0_i2c_eeprom_control_init()
>
> after verifying that `adev->psp.ras.ras` is not NULL in
> `amdgpu_ras_recovery_init()`. Currently it is registered
> unconditionally by
>
> -> amdgpu_device_ip_init()
> -> pp_sw_init()
> -> hwmgr_sw_init()
> -> vega20_smu_init()
> -> smu_v11_0_i2c_eeprom_control_init()
>
> Fix simply adds HW support check (ras == NULL => no support) before
> calling `smu_v11_0_i2c_eeprom_control_{init,fini}()`.
>
> Please note that there is a chance that similar fix is also required for
> CHIP_ARCTURUS. I do not know whether any actual Arcturus hardware without
> RAS exist, and whether calling `smu_i2c_eeprom_init()` makes any sense
> when there is no HW support.
>
> Cc: [email protected]
> Fixes: 9015d60c9ee1 ("drm/amdgpu: Move EEPROM I2C adapter to amdgpu_device")
> Signed-off-by: Ivan Mironov <[email protected]>
> Tested-by: Bjorn Nostvold <[email protected]>
> ---
> Changelog:
>
> v1:
> - Added "Tested-by" for another user who used this patch to fix the
> same issue.
>
> v0:
> - Patch introduced.
> ---
> drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c b/drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c
> index 2fb97554134f..c2e0fbbccf56 100644
> --- a/drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c
> +++ b/drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c
> @@ -522,9 +522,11 @@ static int vega20_smu_init(struct pp_hwmgr *hwmgr)
> priv->smu_tables.entry[TABLE_ACTIVITY_MONITOR_COEFF].version = 0x01;
> priv->smu_tables.entry[TABLE_ACTIVITY_MONITOR_COEFF].size = sizeof(DpmActivityMonitorCoeffInt_t);
>
> - ret = smu_v11_0_i2c_eeprom_control_init(&adev->pm.smu_i2c);
> - if (ret)
> - goto err4;
> + if (adev->psp.ras.ras) {
> + ret = smu_v11_0_i2c_eeprom_control_init(&adev->pm.smu_i2c);
> + if (ret)
> + goto err4;
> + }
>
> return 0;
>
> @@ -560,7 +562,8 @@ static int vega20_smu_fini(struct pp_hwmgr *hwmgr)
> (struct vega20_smumgr *)(hwmgr->smu_backend);
> struct amdgpu_device *adev = hwmgr->adev;
>
> - smu_v11_0_i2c_eeprom_control_fini(&adev->pm.smu_i2c);
> + if (adev->psp.ras.ras)
> + smu_v11_0_i2c_eeprom_control_fini(&adev->pm.smu_i2c);
>
> if (priv) {
> amdgpu_bo_free_kernel(&priv->smu_tables.entry[TABLE_PPTABLE].handle,
> --
> 2.26.2
>
> _______________________________________________
> amd-gfx mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx