2022-08-18 05:05:10

by Akhil P Oommen

[permalink] [raw]
Subject: [PATCH v4 0/7] Improve GPU Recovery


Recently, I debugged a few device crashes which occured during recovery
after a hangcheck timeout. It looks like there are a few things we can
do to improve our chance at a successful gpu recovery.

First one is to ensure that CX GDSC collapses which clears the internal
states in gpu's CX domain. First 5 patches tries to handle this.

Rest of the patches are to ensure that few internal blocks like CP, GMU
and GBIF are halted properly before proceeding for a snapshot followed by
recovery. Also, handle 'prepare slumber' hfi failure correctly. These
are A6x specific improvements.

This series is rebased on top of v2 version of [1] which is based on
linus's master branch.

[1] https://patchwork.freedesktop.org/series/106860/

Changes in v4:
- Keep active_submit lock across the suspend & resume (Rob)
- Clear gpu->active_submits to silence a WARN() during runpm suspend (Rob)

Changes in v3:
- Use reset interface from gpucc driver to poll for cx gdsc collapse
https://patchwork.freedesktop.org/series/106860/
- Use single pm refcount for all active submits

Changes in v2:
- Rebased on msm-next tip

Akhil P Oommen (7):
drm/msm: Remove unnecessary pm_runtime_get/put
drm/msm: Take single rpm refcount on behalf of all submits
drm/msm: Correct pm_runtime votes in recover worker
drm/msm: Fix cx collapse issue during recovery
drm/msm/a6xx: Ensure CX collapse during gpu recovery
drm/msm/a6xx: Improve gpu recovery sequence
drm/msm/a6xx: Handle GMU prepare-slumber hfi failure

drivers/gpu/drm/msm/adreno/a6xx.xml.h | 4 ++
drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 83 ++++++++++++++++++++++-------------
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 43 ++++++++++++++++--
drivers/gpu/drm/msm/msm_gpu.c | 21 ++++++---
drivers/gpu/drm/msm/msm_gpu.h | 4 ++
drivers/gpu/drm/msm/msm_ringbuffer.c | 4 --
6 files changed, 114 insertions(+), 45 deletions(-)

--
2.7.4


2022-08-18 05:11:00

by Akhil P Oommen

[permalink] [raw]
Subject: [PATCH v4 1/7] drm/msm: Remove unnecessary pm_runtime_get/put

We already enable gpu power from msm_gpu_submit(), so avoid a duplicate
pm_runtime_get/put from msm_job_run().

Signed-off-by: Akhil P Oommen <[email protected]>
---

(no changes since v1)

drivers/gpu/drm/msm/msm_ringbuffer.c | 4 ----
1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
index 56eecb4..cad4c35 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.c
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
@@ -29,8 +29,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
msm_gem_unlock(obj);
}

- pm_runtime_get_sync(&gpu->pdev->dev);
-
/* TODO move submit path over to using a per-ring lock.. */
mutex_lock(&gpu->lock);

@@ -38,8 +36,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)

mutex_unlock(&gpu->lock);

- pm_runtime_put(&gpu->pdev->dev);
-
return dma_fence_get(submit->hw_fence);
}

--
2.7.4

2022-08-18 05:11:32

by Akhil P Oommen

[permalink] [raw]
Subject: [PATCH v4 3/7] drm/msm: Correct pm_runtime votes in recover worker

In the scenario where there is one a single submit which is hung, gpu is
power collapsed when it is retired. Because of this, by the time we call
reover(), gpu state would be already clear. Fix this by correctly
managing the pm runtime votes.

Signed-off-by: Akhil P Oommen <[email protected]>
---

(no changes since v1)

drivers/gpu/drm/msm/msm_gpu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index e1dd3cc..1945efb 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -398,7 +398,6 @@ static void recover_worker(struct kthread_work *work)
/* Record the crash state */
pm_runtime_get_sync(&gpu->pdev->dev);
msm_gpu_crashstate_capture(gpu, submit, comm, cmd);
- pm_runtime_put_sync(&gpu->pdev->dev);

kfree(cmd);
kfree(comm);
@@ -446,6 +445,8 @@ static void recover_worker(struct kthread_work *work)
}
}

+ pm_runtime_put_sync(&gpu->pdev->dev);
+
mutex_unlock(&gpu->lock);

msm_gpu_retire(gpu);
--
2.7.4