Hi,
The goal of this patchset is to improve debugging device resets on amdgpu.
The first patch creates a new module parameter to disable soft recoveries,
ensuring every recovery go through the full device reset, making easier to
generate resets from userspace tools like [0] and [1]. This is important to
validate how the stack behaves on resets, from end-to-end.
The second patch is a small addition to mark guilty jobs that causes soft
recoveries for API consistency.
The last patches are a rework to store more information at devcoredump files,
making it more useful to be attached to bug reports.
The new coredump content look like this:
**** AMDGPU Device Coredump ****
version: 1
kernel: 6.4.0-rc7-tony+
module: amdgpu
time: 702.743534320
process_name: vulkan-triangle PID: 4561
IBs:
[0] 0xffff800100545000
[1] 0xffff800100001000
ring name: gfx_0.0.0
Due to nested IBs, this may not be the one that really caused the hang, but it
gives some direction.
Thanks,
André
[0] https://gitlab.freedesktop.org/andrealmeid/gpu-timeout
[1] https://github.com/andrealmeid/vulkan-triangle-v1
André Almeida (6):
drm/amdgpu: Create a module param to disable soft recovery
drm/amdgpu: Mark contexts guilty for causing soft recoveries
drm/amdgpu: Rework coredump to use memory dynamically
drm/amdgpu: Limit info in coredump for kernel threads
drm/amdgpu: Log IBs and ring name at coredump
drm/amdgpu: Create version number for coredumps
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 21 +++--
drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 6 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 99 +++++++++++++++++-----
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 9 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 6 +-
5 files changed, 112 insertions(+), 29 deletions(-)
--
2.41.0
Create a module parameter to disable soft recoveries on amdgpu, making
every recovery go through the device reset path. This option makes
easier to force device resets for testing and debugging purposes.
Signed-off-by: André Almeida <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 9 +++++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 6 +++++-
3 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index a84bd4a0c421..dbe062a087c5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -189,6 +189,7 @@ extern uint amdgpu_force_long_training;
extern int amdgpu_lbpw;
extern int amdgpu_compute_multipipe;
extern int amdgpu_gpu_recovery;
+extern bool amdgpu_soft_recovery;
extern int amdgpu_emu_mode;
extern uint amdgpu_smu_memory_pool_size;
extern int amdgpu_smu_pptable_id;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 3b711babd4e2..7c69f3169aa6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -163,6 +163,7 @@ uint amdgpu_force_long_training;
int amdgpu_lbpw = -1;
int amdgpu_compute_multipipe = -1;
int amdgpu_gpu_recovery = -1; /* auto */
+bool amdgpu_soft_recovery = true;
int amdgpu_emu_mode;
uint amdgpu_smu_memory_pool_size;
int amdgpu_smu_pptable_id = -1;
@@ -540,6 +541,14 @@ module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444);
MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)");
module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
+/**
+ * DOC: gpu_soft_recovery (bool)
+ * Set true to allow the driver to try soft recoveries if a job get stuck. Set
+ * to false to always force a GPU reset during recovery.
+ */
+MODULE_PARM_DESC(gpu_soft_recovery, "Enable GPU soft recovery mechanism (default: true)");
+module_param_named(gpu_soft_recovery, amdgpu_soft_recovery, bool, 0644);
+
/**
* DOC: emu_mode (int)
* Set value 1 to enable emulation mode. This is only needed when running on an emulator. The default is 0 (disabled).
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 80d6e132e409..40678d9fb17e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -434,8 +434,12 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
struct dma_fence *fence)
{
unsigned long flags;
+ ktime_t deadline;
- ktime_t deadline = ktime_add_us(ktime_get(), 10000);
+ if (!amdgpu_soft_recovery)
+ return false;
+
+ deadline = ktime_add_us(ktime_get(), 10000);
if (amdgpu_sriov_vf(ring->adev) || !ring->funcs->soft_recovery || !fence)
return false;
--
2.41.0
If a DRM fence is set to -ENODATA, that means that this context was a
cause of a soft reset, but is never marked as guilty. Flag it as guilty
and log to user that this context won't accept more submissions.
Signed-off-by: André Almeida <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 0dc9c655c4fb..fe8e47d063da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -459,6 +459,12 @@ int amdgpu_ctx_get_entity(struct amdgpu_ctx *ctx, u32 hw_ip, u32 instance,
ctx_entity = &ctx->entities[hw_ip][ring]->entity;
r = drm_sched_entity_error(ctx_entity);
if (r) {
+ if (r == -ENODATA) {
+ DRM_ERROR("%s (%d) context caused a reset,"
+ "marking it guilty and refusing new submissions.\n",
+ current->comm, current->pid);
+ atomic_set(&ctx->guilty, 1);
+ }
DRM_DEBUG("error entity %p\n", ctx_entity);
return r;
}
--
2.41.0
Am 11.07.23 um 23:34 schrieb André Almeida:
> If a DRM fence is set to -ENODATA, that means that this context was a
> cause of a soft reset, but is never marked as guilty. Flag it as guilty
> and log to user that this context won't accept more submissions.
>
> Signed-off-by: André Almeida <[email protected]>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index 0dc9c655c4fb..fe8e47d063da 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -459,6 +459,12 @@ int amdgpu_ctx_get_entity(struct amdgpu_ctx *ctx, u32 hw_ip, u32 instance,
> ctx_entity = &ctx->entities[hw_ip][ring]->entity;
> r = drm_sched_entity_error(ctx_entity);
> if (r) {
> + if (r == -ENODATA) {
> + DRM_ERROR("%s (%d) context caused a reset,"
> + "marking it guilty and refusing new submissions.\n",
> + current->comm, current->pid);
> + atomic_set(&ctx->guilty, 1);
> + }
I'm going back and forth with that as well.
Michel has a very good point that it often is sufficient to cancel just
one rough shader to keep going.
But Marek has a very good point as well that when that happens multiple
times we probably want to block the application from making further
submissions.
Christian.
> DRM_DEBUG("error entity %p\n", ctx_entity);
> return r;
> }