Recently, I debugged a few device crashes which occured during recovery
after a hangcheck timeout. It looks like there are a few things we can
do to improve our chance at a successful gpu recovery.
First one is to ensure that CX GDSC collapses which clears the internal
states in gpu's CX domain. First 5 patches tries to handle this.
Rest of the patches are to ensure that few internal blocks like CP, GMU
and GBIF are halted properly before proceeding for a snapshot followed by
recovery. Also, handle 'prepare slumber' hfi failure correctly. These
are A6x specific improvements.
This series is rebased on top of [1] which based on linus's master
branch.
[1] https://patchwork.freedesktop.org/series/106860/
Changes in v3:
- Use reset interface from gpucc driver to poll for cx gdsc collapse
https://patchwork.freedesktop.org/series/106860/
- Use single pm refcount for all active submits
Changes in v2:
- Rebased on msm-next tip
Akhil P Oommen (8):
drm/msm: Remove unnecessary pm_runtime_get/put
drm/msm: Take single rpm refcount on behalf of all submits
drm/msm: Correct pm_runtime votes in recover worker
drm/msm: Fix cx collapse issue during recovery
drm/msm/a6xx: Ensure CX collapse during gpu recovery
drm/msm/adreno: Remove a WARN() during runtime_suspend
drm/msm/a6xx: Improve gpu recovery sequence
drm/msm/a6xx: Handle GMU prepare-slumber hfi failure
drivers/gpu/drm/msm/adreno/a6xx.xml.h | 4 ++
drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 83 +++++++++++++++++++-----------
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 35 +++++++++++--
drivers/gpu/drm/msm/adreno/adreno_device.c | 7 ---
drivers/gpu/drm/msm/msm_gpu.c | 21 +++++---
drivers/gpu/drm/msm/msm_gpu.h | 4 ++
drivers/gpu/drm/msm/msm_ringbuffer.c | 4 --
7 files changed, 106 insertions(+), 52 deletions(-)
--
2.7.4
We already enable gpu power from msm_gpu_submit(), so avoid a duplicate
pm_runtime_get/put from msm_job_run().
Signed-off-by: Akhil P Oommen <[email protected]>
---
(no changes since v1)
drivers/gpu/drm/msm/msm_ringbuffer.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
index 56eecb4..cad4c35 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.c
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
@@ -29,8 +29,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
msm_gem_unlock(obj);
}
- pm_runtime_get_sync(&gpu->pdev->dev);
-
/* TODO move submit path over to using a per-ring lock.. */
mutex_lock(&gpu->lock);
@@ -38,8 +36,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
mutex_unlock(&gpu->lock);
- pm_runtime_put(&gpu->pdev->dev);
-
return dma_fence_get(submit->hw_fence);
}
--
2.7.4
On Sat, Jul 30, 2022 at 2:41 AM Akhil P Oommen <[email protected]> wrote:
>
> We already enable gpu power from msm_gpu_submit(), so avoid a duplicate
> pm_runtime_get/put from msm_job_run().
>
> Signed-off-by: Akhil P Oommen <[email protected]>
> ---
>
> (no changes since v1)
>
> drivers/gpu/drm/msm/msm_ringbuffer.c | 4 ----
> 1 file changed, 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
> index 56eecb4..cad4c35 100644
> --- a/drivers/gpu/drm/msm/msm_ringbuffer.c
> +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
> @@ -29,8 +29,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
> msm_gem_unlock(obj);
> }
>
> - pm_runtime_get_sync(&gpu->pdev->dev);
> -
This is removing a _get_sync() and simply relying on a _get() (async)
in msm_gpu_submit().. that seems pretty likely to go badly? I think
it should probably replace the _get() in msm_gpu_submit() with
_get_sync() (but also since this is changing position of
resume/suspend vs active_lock, please make sure you test with lockdep
enabled)
BR,
-R
> /* TODO move submit path over to using a per-ring lock.. */
> mutex_lock(&gpu->lock);
>
> @@ -38,8 +36,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
>
> mutex_unlock(&gpu->lock);
>
> - pm_runtime_put(&gpu->pdev->dev);
> -
> return dma_fence_get(submit->hw_fence);
> }
>
> --
> 2.7.4
>
On 7/31/2022 9:25 PM, Rob Clark wrote:
> On Sat, Jul 30, 2022 at 2:41 AM Akhil P Oommen <[email protected]> wrote:
>> We already enable gpu power from msm_gpu_submit(), so avoid a duplicate
>> pm_runtime_get/put from msm_job_run().
>>
>> Signed-off-by: Akhil P Oommen <[email protected]>
>> ---
>>
>> (no changes since v1)
>>
>> drivers/gpu/drm/msm/msm_ringbuffer.c | 4 ----
>> 1 file changed, 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
>> index 56eecb4..cad4c35 100644
>> --- a/drivers/gpu/drm/msm/msm_ringbuffer.c
>> +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
>> @@ -29,8 +29,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
>> msm_gem_unlock(obj);
>> }
>>
>> - pm_runtime_get_sync(&gpu->pdev->dev);
>> -
> This is removing a _get_sync() and simply relying on a _get() (async)
> in msm_gpu_submit().. that seems pretty likely to go badly? I think
> it should probably replace the _get() in msm_gpu_submit() with
> _get_sync() (but also since this is changing position of
> resume/suspend vs active_lock, please make sure you test with lockdep
> enabled)
>
> BR,
> -R
As discussed in the other patch, this is correctly handled in
msm_gpu_submit(). And from active_lock perspective, there is no change
actually. GPU is ON by the time we touch active_lock in both cases.
-Akhil.
>> /* TODO move submit path over to using a per-ring lock.. */
>> mutex_lock(&gpu->lock);
>>
>> @@ -38,8 +36,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
>>
>> mutex_unlock(&gpu->lock);
>>
>> - pm_runtime_put(&gpu->pdev->dev);
>> -
>> return dma_fence_get(submit->hw_fence);
>> }
>>
>> --
>> 2.7.4
>>