2022-07-09 06:03:21

by Akhil P Oommen

[permalink] [raw]
Subject: [PATCH v2 0/7] Improve GPU Recovery


Recently, I debugged a few device crashes which occured during recovery
after a hangcheck timeout. It looks like there are a few things we can
do to improve our chance at a successful gpu recovery.

First one is to ensure that CX GDSC collapses which clears the internal
states in gpu's CX domain. First 5 patches tries to handle this.

Rest of the patches are to ensure that few internal blocks like CP, GMU
and GBIF are halted properly before proceeding for a snapshot followed by
recovery. Also, handle 'prepare slumber' hfi failure correctly. These
are A6x specific improvements.

Changes in v2:
- Rebased on msm-next tip

Akhil P Oommen (7):
drm/msm: Remove unnecessary pm_runtime_get/put
drm/msm: Correct pm_runtime votes in recover worker
drm/msm: Fix cx collapse issue during recovery
drm/msm: Ensure cx gdsc collapse during recovery
arm64: dts: qcom: sc7280: Update gpu register list
drm/msm/a6xx: Improve gpu recovery sequence
drm/msm/a6xx: Handle GMU prepare-slumber hfi failure

arch/arm64/boot/dts/qcom/sc7280.dtsi | 6 ++-
drivers/gpu/drm/msm/adreno/a6xx.xml.h | 4 ++
drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 83 ++++++++++++++++++++++-------------
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 36 +++++++++++++--
drivers/gpu/drm/msm/msm_gpu.c | 9 ++--
drivers/gpu/drm/msm/msm_gpu.h | 1 +
drivers/gpu/drm/msm/msm_ringbuffer.c | 4 --
7 files changed, 100 insertions(+), 43 deletions(-)

--
2.7.4


2022-07-09 06:05:37

by Akhil P Oommen

[permalink] [raw]
Subject: [PATCH v2 1/7] drm/msm: Remove unnecessary pm_runtime_get/put

We already enable gpu power from msm_gpu_submit(), so avoid a duplicate
pm_runtime_get/put from msm_job_run().

Signed-off-by: Akhil P Oommen <[email protected]>
---

(no changes since v1)

drivers/gpu/drm/msm/msm_ringbuffer.c | 4 ----
1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
index 56eecb4..cad4c35 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.c
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
@@ -29,8 +29,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
msm_gem_unlock(obj);
}

- pm_runtime_get_sync(&gpu->pdev->dev);
-
/* TODO move submit path over to using a per-ring lock.. */
mutex_lock(&gpu->lock);

@@ -38,8 +36,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)

mutex_unlock(&gpu->lock);

- pm_runtime_put(&gpu->pdev->dev);
-
return dma_fence_get(submit->hw_fence);
}

--
2.7.4

2022-07-09 06:08:53

by Akhil P Oommen

[permalink] [raw]
Subject: [PATCH v2 4/7] drm/msm: Ensure cx gdsc collapse during recovery

To improve our chance of a successful recovery, we should ensure that
cx headswitch collapses. Cx headswitch might be kept enabled through a
vote from another driver like iommu or even another hardware subsystem.
So, poll the cx gdscr register to ensure that it collapses during
recovery.

Signed-off-by: Akhil P Oommen <[email protected]>
---

(no changes since v1)

drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 13 ++++++++++++-
drivers/gpu/drm/msm/msm_gpu.c | 4 ++++
drivers/gpu/drm/msm/msm_gpu.h | 1 +
3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 7ed347c..9aaa469 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1257,11 +1257,15 @@ static void a6xx_dump(struct msm_gpu *gpu)
#define VBIF_RESET_ACK_TIMEOUT 100
#define VBIF_RESET_ACK_MASK 0x00f0

+#define CX_GDSCR_OFFSET 0x106c
+#define CX_GDSC_ON_MASK BIT(31)
+
static void a6xx_recover(struct msm_gpu *gpu)
{
struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
- int i;
+ int i, ret;
+ u32 val;

adreno_dump_info(gpu);

@@ -1288,6 +1292,13 @@ static void a6xx_recover(struct msm_gpu *gpu)
/* And the final one from recover worker */
pm_runtime_put_sync(&gpu->pdev->dev);

+ if (gpu->gpucc_io) {
+ ret = readl_poll_timeout(gpu->gpucc_io + CX_GDSCR_OFFSET, val,
+ !(val & CX_GDSC_ON_MASK), 100, 500000);
+ if (ret)
+ DRM_DEV_INFO(&gpu->pdev->dev, "cx gdsc didn't collapse\n");
+ }
+
for (i = gpu->active_submits; i > 0; i--)
pm_runtime_get(&gpu->pdev->dev);

diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index aa6f34f..7ada0785 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -865,6 +865,10 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
goto fail;
}

+ gpu->gpucc_io = msm_ioremap(pdev, "gpucc");
+ if (IS_ERR(gpu->gpucc_io))
+ gpu->gpucc_io = NULL;
+
/* Get Interrupt: */
gpu->irq = platform_get_irq(pdev, 0);
if (gpu->irq < 0) {
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index 4d935fe..1fe498f 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -226,6 +226,7 @@ struct msm_gpu {
int global_faults;

void __iomem *mmio;
+ void __iomem *gpucc_io;
int irq;

struct msm_gem_address_space *aspace;
--
2.7.4

2022-07-09 06:09:15

by Akhil P Oommen

[permalink] [raw]
Subject: [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery

There are some hardware logic under CX domain. For a successful
recovery, we should ensure cx headswitch collapses to ensure all the
stale states are cleard out. This is especially true to for a6xx family
where we can GMU co-processor.

Currently, cx doesn't collapse due to a devlink between gpu and its
smmu. So the *struct gpu device* needs to be runtime suspended to ensure
that the iommu driver removes its vote on cx gdsc.

Signed-off-by: Akhil P Oommen <[email protected]>
---

(no changes since v1)

drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
drivers/gpu/drm/msm/msm_gpu.c | 2 --
2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 4d50110..7ed347c 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
*/
gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);

- gpu->funcs->pm_suspend(gpu);
- gpu->funcs->pm_resume(gpu);
+ /*
+ * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
+ * First drop the usage count from all active submits
+ */
+ for (i = gpu->active_submits; i > 0; i--)
+ pm_runtime_put(&gpu->pdev->dev);
+
+ /* And the final one from recover worker */
+ pm_runtime_put_sync(&gpu->pdev->dev);
+
+ for (i = gpu->active_submits; i > 0; i--)
+ pm_runtime_get(&gpu->pdev->dev);
+
+ pm_runtime_get_sync(&gpu->pdev->dev);

msm_gpu_hw_init(gpu);
}
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index 18c1544..aa6f34f 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -422,9 +422,7 @@ static void recover_worker(struct kthread_work *work)
/* retire completed submits, plus the one that hung: */
retire_submits(gpu);

- pm_runtime_get_sync(&gpu->pdev->dev);
gpu->funcs->recover(gpu);
- pm_runtime_put_sync(&gpu->pdev->dev);

/*
* Replay all remaining submits starting with highest priority
--
2.7.4

2022-07-09 06:10:15

by Akhil P Oommen

[permalink] [raw]
Subject: [PATCH v2 6/7] drm/msm/a6xx: Improve gpu recovery sequence

We can do a few more things to improve our chance at a successful gpu
recovery, especially during a hangcheck timeout:
1. Halt CP and GMU core
2. Do RBBM GBIF HALT sequence
3. Do a soft reset of GPU core

Signed-off-by: Akhil P Oommen <[email protected]>
---

(no changes since v1)

drivers/gpu/drm/msm/adreno/a6xx.xml.h | 4 ++
drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 77 +++++++++++++++++++++--------------
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 7 ++++
3 files changed, 58 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx.xml.h b/drivers/gpu/drm/msm/adreno/a6xx.xml.h
index b03e2c4..beea4a7 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx.xml.h
+++ b/drivers/gpu/drm/msm/adreno/a6xx.xml.h
@@ -1413,6 +1413,10 @@ static inline uint32_t REG_A6XX_RBBM_PERFCTR_RBBM_SEL(uint32_t i0) { return 0x00

#define REG_A6XX_RBBM_GBIF_CLIENT_QOS_CNTL 0x00000011

+#define REG_A6XX_RBBM_GBIF_HALT 0x00000016
+
+#define REG_A6XX_RBBM_GBIF_HALT_ACK 0x00000017
+
#define REG_A6XX_RBBM_WAIT_FOR_GPU_IDLE_CMD 0x0000001c
#define A6XX_RBBM_WAIT_FOR_GPU_IDLE_CMD_WAIT_GPU_IDLE 0x00000001

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index 310a317..ec25f19 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -873,9 +873,47 @@ static void a6xx_gmu_rpmh_off(struct a6xx_gmu *gmu)
(val & 1), 100, 1000);
}

+#define GBIF_CLIENT_HALT_MASK BIT(0)
+#define GBIF_ARB_HALT_MASK BIT(1)
+
+static void a6xx_bus_clear_pending_transactions(struct adreno_gpu *adreno_gpu)
+{
+ struct msm_gpu *gpu = &adreno_gpu->base;
+
+ if (!a6xx_has_gbif(adreno_gpu)) {
+ gpu_write(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL0, 0xf);
+ spin_until((gpu_read(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL1) &
+ 0xf) == 0xf);
+ gpu_write(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL0, 0);
+
+ return;
+ }
+
+ /* Halt the gx side of GBIF */
+ gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 1);
+ spin_until(gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT_ACK) & 1);
+
+ /* Halt new client requests on GBIF */
+ gpu_write(gpu, REG_A6XX_GBIF_HALT, GBIF_CLIENT_HALT_MASK);
+ spin_until((gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK) &
+ (GBIF_CLIENT_HALT_MASK)) == GBIF_CLIENT_HALT_MASK);
+
+ /* Halt all AXI requests on GBIF */
+ gpu_write(gpu, REG_A6XX_GBIF_HALT, GBIF_ARB_HALT_MASK);
+ spin_until((gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK) &
+ (GBIF_ARB_HALT_MASK)) == GBIF_ARB_HALT_MASK);
+
+ /* The GBIF halt needs to be explicitly cleared */
+ gpu_write(gpu, REG_A6XX_GBIF_HALT, 0x0);
+}
+
/* Force the GMU off in case it isn't responsive */
static void a6xx_gmu_force_off(struct a6xx_gmu *gmu)
{
+ struct a6xx_gpu *a6xx_gpu = container_of(gmu, struct a6xx_gpu, gmu);
+ struct adreno_gpu *adreno_gpu = &a6xx_gpu->base;
+ struct msm_gpu *gpu = &adreno_gpu->base;
+
/* Flush all the queues */
a6xx_hfi_stop(gmu);

@@ -887,6 +925,15 @@ static void a6xx_gmu_force_off(struct a6xx_gmu *gmu)

/* Make sure there are no outstanding RPMh votes */
a6xx_gmu_rpmh_off(gmu);
+
+ /* Halt the gmu cm3 core */
+ gmu_write(gmu, REG_A6XX_GMU_CM3_SYSRESET, 1);
+
+ a6xx_bus_clear_pending_transactions(adreno_gpu);
+
+ /* Reset GPU core blocks */
+ gpu_write(gpu, REG_A6XX_RBBM_SW_RESET_CMD, 1);
+ udelay(100);
}

static void a6xx_gmu_set_initial_freq(struct msm_gpu *gpu, struct a6xx_gmu *gmu)
@@ -1014,36 +1061,6 @@ bool a6xx_gmu_isidle(struct a6xx_gmu *gmu)
return true;
}

-#define GBIF_CLIENT_HALT_MASK BIT(0)
-#define GBIF_ARB_HALT_MASK BIT(1)
-
-static void a6xx_bus_clear_pending_transactions(struct adreno_gpu *adreno_gpu)
-{
- struct msm_gpu *gpu = &adreno_gpu->base;
-
- if (!a6xx_has_gbif(adreno_gpu)) {
- gpu_write(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL0, 0xf);
- spin_until((gpu_read(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL1) &
- 0xf) == 0xf);
- gpu_write(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL0, 0);
-
- return;
- }
-
- /* Halt new client requests on GBIF */
- gpu_write(gpu, REG_A6XX_GBIF_HALT, GBIF_CLIENT_HALT_MASK);
- spin_until((gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK) &
- (GBIF_CLIENT_HALT_MASK)) == GBIF_CLIENT_HALT_MASK);
-
- /* Halt all AXI requests on GBIF */
- gpu_write(gpu, REG_A6XX_GBIF_HALT, GBIF_ARB_HALT_MASK);
- spin_until((gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK) &
- (GBIF_ARB_HALT_MASK)) == GBIF_ARB_HALT_MASK);
-
- /* The GBIF halt needs to be explicitly cleared */
- gpu_write(gpu, REG_A6XX_GBIF_HALT, 0x0);
-}
-
/* Gracefully try to shut down the GMU and by extension the GPU */
static void a6xx_gmu_shutdown(struct a6xx_gmu *gmu)
{
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 9aaa469..60c3033 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -987,6 +987,10 @@ static int hw_init(struct msm_gpu *gpu)
/* Make sure the GMU keeps the GPU on while we set it up */
a6xx_gmu_set_oob(&a6xx_gpu->gmu, GMU_OOB_GPU_SET);

+ /* Clear GBIF halt in case GX domain was not collapsed */
+ if (a6xx_has_gbif(adreno_gpu))
+ gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
+
gpu_write(gpu, REG_A6XX_RBBM_SECVID_TSB_CNTL, 0);

/*
@@ -1276,6 +1280,9 @@ static void a6xx_recover(struct msm_gpu *gpu)
if (hang_debug)
a6xx_dump(gpu);

+ /* Halt SQE first */
+ gpu_write(gpu, REG_A6XX_CP_SQE_CNTL, 3);
+
/*
* Turn off keep alive that might have been enabled by the hang
* interrupt
--
2.7.4

2022-07-09 06:10:17

by Akhil P Oommen

[permalink] [raw]
Subject: [PATCH v2 7/7] drm/msm/a6xx: Handle GMU prepare-slumber hfi failure

When prepare-slumber hfi fails, we should follow a6xx_gmu_force_off()
sequence.

Signed-off-by: Akhil P Oommen <[email protected]>
---

(no changes since v1)

drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index ec25f19..e033d6a 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -1086,7 +1086,11 @@ static void a6xx_gmu_shutdown(struct a6xx_gmu *gmu)
a6xx_bus_clear_pending_transactions(adreno_gpu);

/* tell the GMU we want to slumber */
- a6xx_gmu_notify_slumber(gmu);
+ ret = a6xx_gmu_notify_slumber(gmu);
+ if (ret) {
+ a6xx_gmu_force_off(gmu);
+ return;
+ }

ret = gmu_poll_timeout(gmu,
REG_A6XX_GPU_GMU_AO_GPU_CX_BUSY_STATUS, val,
--
2.7.4

2022-07-09 06:21:28

by Akhil P Oommen

[permalink] [raw]
Subject: [PATCH v2 2/7] drm/msm: Correct pm_runtime votes in recover worker

In the scenario where there is one a single submit which is hung, gpu is
power collapsed when it is retired. Because of this, by the time we call
reover(), gpu state would be already clear. Fix this by correctly
managing the pm runtime votes.

Signed-off-by: Akhil P Oommen <[email protected]>
---

(no changes since v1)

drivers/gpu/drm/msm/msm_gpu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index c2bfcf3f..18c1544 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -394,7 +394,6 @@ static void recover_worker(struct kthread_work *work)
/* Record the crash state */
pm_runtime_get_sync(&gpu->pdev->dev);
msm_gpu_crashstate_capture(gpu, submit, comm, cmd);
- pm_runtime_put_sync(&gpu->pdev->dev);

kfree(cmd);
kfree(comm);
@@ -442,6 +441,8 @@ static void recover_worker(struct kthread_work *work)
}
}

+ pm_runtime_put_sync(&gpu->pdev->dev);
+
mutex_unlock(&gpu->lock);

msm_gpu_retire(gpu);
--
2.7.4

2022-07-09 07:00:21

by Akhil P Oommen

[permalink] [raw]
Subject: [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list

Update gpu register array with gpucc memory region.

Signed-off-by: Akhil P Oommen <[email protected]>
---

(no changes since v1)

arch/arm64/boot/dts/qcom/sc7280.dtsi | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi b/arch/arm64/boot/dts/qcom/sc7280.dtsi
index e66fc67..defdb25 100644
--- a/arch/arm64/boot/dts/qcom/sc7280.dtsi
+++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi
@@ -2228,10 +2228,12 @@
compatible = "qcom,adreno-635.0", "qcom,adreno";
reg = <0 0x03d00000 0 0x40000>,
<0 0x03d9e000 0 0x1000>,
- <0 0x03d61000 0 0x800>;
+ <0 0x03d61000 0 0x800>,
+ <0 0x03d90000 0 0x2000>;
reg-names = "kgsl_3d0_reg_memory",
"cx_mem",
- "cx_dbgc";
+ "cx_dbgc",
+ "gpucc";
interrupts = <GIC_SPI 300 IRQ_TYPE_LEVEL_HIGH>;
iommus = <&adreno_smmu 0 0x401>;
operating-points-v2 = <&gpu_opp_table>;
--
2.7.4

2022-07-11 23:38:38

by Doug Anderson

[permalink] [raw]
Subject: Re: [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery

Hi,

On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <[email protected]> wrote:
>
> There are some hardware logic under CX domain. For a successful
> recovery, we should ensure cx headswitch collapses to ensure all the
> stale states are cleard out. This is especially true to for a6xx family
> where we can GMU co-processor.
>
> Currently, cx doesn't collapse due to a devlink between gpu and its
> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
> that the iommu driver removes its vote on cx gdsc.
>
> Signed-off-by: Akhil P Oommen <[email protected]>
> ---
>
> (no changes since v1)
>
> drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
> drivers/gpu/drm/msm/msm_gpu.c | 2 --
> 2 files changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index 4d50110..7ed347c 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
> */
> gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
>
> - gpu->funcs->pm_suspend(gpu);
> - gpu->funcs->pm_resume(gpu);
> + /*
> + * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
> + * First drop the usage count from all active submits
> + */
> + for (i = gpu->active_submits; i > 0; i--)
> + pm_runtime_put(&gpu->pdev->dev);
> +
> + /* And the final one from recover worker */
> + pm_runtime_put_sync(&gpu->pdev->dev);
> +
> + for (i = gpu->active_submits; i > 0; i--)
> + pm_runtime_get(&gpu->pdev->dev);
> +
> + pm_runtime_get_sync(&gpu->pdev->dev);

In response to v1, Rob suggested pm_runtime_force_suspend/resume().
Those seem like they would work to me, too. Why not use them?

2022-07-11 23:46:42

by Doug Anderson

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list

Hi,

On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <[email protected]> wrote:
>
> Update gpu register array with gpucc memory region.
>
> Signed-off-by: Akhil P Oommen <[email protected]>
> ---
>
> (no changes since v1)
>
> arch/arm64/boot/dts/qcom/sc7280.dtsi | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi b/arch/arm64/boot/dts/qcom/sc7280.dtsi
> index e66fc67..defdb25 100644
> --- a/arch/arm64/boot/dts/qcom/sc7280.dtsi
> +++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi
> @@ -2228,10 +2228,12 @@
> compatible = "qcom,adreno-635.0", "qcom,adreno";
> reg = <0 0x03d00000 0 0x40000>,
> <0 0x03d9e000 0 0x1000>,
> - <0 0x03d61000 0 0x800>;
> + <0 0x03d61000 0 0x800>,
> + <0 0x03d90000 0 0x2000>;
> reg-names = "kgsl_3d0_reg_memory",
> "cx_mem",
> - "cx_dbgc";
> + "cx_dbgc",
> + "gpucc";

This doesn't seem right. Shouldn't you be coordinating with the
existing gpucc instead of reaching into its registers?

-Doug

2022-07-12 05:14:35

by Akhil P Oommen

[permalink] [raw]
Subject: Re: [Freedreno] [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery

On 7/12/2022 4:52 AM, Doug Anderson wrote:
> Hi,
>
> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <[email protected]> wrote:
>> There are some hardware logic under CX domain. For a successful
>> recovery, we should ensure cx headswitch collapses to ensure all the
>> stale states are cleard out. This is especially true to for a6xx family
>> where we can GMU co-processor.
>>
>> Currently, cx doesn't collapse due to a devlink between gpu and its
>> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
>> that the iommu driver removes its vote on cx gdsc.
>>
>> Signed-off-by: Akhil P Oommen <[email protected]>
>> ---
>>
>> (no changes since v1)
>>
>> drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
>> drivers/gpu/drm/msm/msm_gpu.c | 2 --
>> 2 files changed, 14 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> index 4d50110..7ed347c 100644
>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
>> */
>> gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
>>
>> - gpu->funcs->pm_suspend(gpu);
>> - gpu->funcs->pm_resume(gpu);
>> + /*
>> + * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
>> + * First drop the usage count from all active submits
>> + */
>> + for (i = gpu->active_submits; i > 0; i--)
>> + pm_runtime_put(&gpu->pdev->dev);
>> +
>> + /* And the final one from recover worker */
>> + pm_runtime_put_sync(&gpu->pdev->dev);
>> +
>> + for (i = gpu->active_submits; i > 0; i--)
>> + pm_runtime_get(&gpu->pdev->dev);
>> +
>> + pm_runtime_get_sync(&gpu->pdev->dev);
> In response to v1, Rob suggested pm_runtime_force_suspend/resume().
> Those seem like they would work to me, too. Why not use them?
Quoting my previous response which I seem to have sent only to Freedreno
list:

"I believe it is supposed to be used only during system sleep state
transitions. Btw, we don't want pm_runtime_get() calls from elsewhere to
fail by disabling RPM here."

-Akhil

2022-07-12 17:04:54

by Rob Clark

[permalink] [raw]
Subject: Re: [Freedreno] [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery

On Mon, Jul 11, 2022 at 10:05 PM Akhil P Oommen
<[email protected]> wrote:
>
> On 7/12/2022 4:52 AM, Doug Anderson wrote:
> > Hi,
> >
> > On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <[email protected]> wrote:
> >> There are some hardware logic under CX domain. For a successful
> >> recovery, we should ensure cx headswitch collapses to ensure all the
> >> stale states are cleard out. This is especially true to for a6xx family
> >> where we can GMU co-processor.
> >>
> >> Currently, cx doesn't collapse due to a devlink between gpu and its
> >> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
> >> that the iommu driver removes its vote on cx gdsc.
> >>
> >> Signed-off-by: Akhil P Oommen <[email protected]>
> >> ---
> >>
> >> (no changes since v1)
> >>
> >> drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
> >> drivers/gpu/drm/msm/msm_gpu.c | 2 --
> >> 2 files changed, 14 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >> index 4d50110..7ed347c 100644
> >> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
> >> */
> >> gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
> >>
> >> - gpu->funcs->pm_suspend(gpu);
> >> - gpu->funcs->pm_resume(gpu);
> >> + /*
> >> + * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
> >> + * First drop the usage count from all active submits
> >> + */
> >> + for (i = gpu->active_submits; i > 0; i--)
> >> + pm_runtime_put(&gpu->pdev->dev);
> >> +
> >> + /* And the final one from recover worker */
> >> + pm_runtime_put_sync(&gpu->pdev->dev);
> >> +
> >> + for (i = gpu->active_submits; i > 0; i--)
> >> + pm_runtime_get(&gpu->pdev->dev);
> >> +
> >> + pm_runtime_get_sync(&gpu->pdev->dev);
> > In response to v1, Rob suggested pm_runtime_force_suspend/resume().
> > Those seem like they would work to me, too. Why not use them?
> Quoting my previous response which I seem to have sent only to Freedreno
> list:
>
> "I believe it is supposed to be used only during system sleep state
> transitions. Btw, we don't want pm_runtime_get() calls from elsewhere to
> fail by disabling RPM here."

The comment about not wanting other runpm calls to fail is valid.. but
that is also solveable, ie. by holding a lock around runpm calls.
Which I think we need to do anyways, otherwise looping over
gpu->active_submits is racey..

I think pm_runtime_force_suspend/resume() is the least-bad option.. or
at least I'm not seeing any obvious alternative that is better

BR,
-R

2022-07-12 20:12:12

by Akhil P Oommen

[permalink] [raw]
Subject: Re: [Freedreno] [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery

On 7/12/2022 10:14 PM, Rob Clark wrote:
> On Mon, Jul 11, 2022 at 10:05 PM Akhil P Oommen
> <[email protected]> wrote:
>> On 7/12/2022 4:52 AM, Doug Anderson wrote:
>>> Hi,
>>>
>>> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <[email protected]> wrote:
>>>> There are some hardware logic under CX domain. For a successful
>>>> recovery, we should ensure cx headswitch collapses to ensure all the
>>>> stale states are cleard out. This is especially true to for a6xx family
>>>> where we can GMU co-processor.
>>>>
>>>> Currently, cx doesn't collapse due to a devlink between gpu and its
>>>> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
>>>> that the iommu driver removes its vote on cx gdsc.
>>>>
>>>> Signed-off-by: Akhil P Oommen <[email protected]>
>>>> ---
>>>>
>>>> (no changes since v1)
>>>>
>>>> drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
>>>> drivers/gpu/drm/msm/msm_gpu.c | 2 --
>>>> 2 files changed, 14 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>> index 4d50110..7ed347c 100644
>>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
>>>> */
>>>> gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
>>>>
>>>> - gpu->funcs->pm_suspend(gpu);
>>>> - gpu->funcs->pm_resume(gpu);
>>>> + /*
>>>> + * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
>>>> + * First drop the usage count from all active submits
>>>> + */
>>>> + for (i = gpu->active_submits; i > 0; i--)
>>>> + pm_runtime_put(&gpu->pdev->dev);
>>>> +
>>>> + /* And the final one from recover worker */
>>>> + pm_runtime_put_sync(&gpu->pdev->dev);
>>>> +
>>>> + for (i = gpu->active_submits; i > 0; i--)
>>>> + pm_runtime_get(&gpu->pdev->dev);
>>>> +
>>>> + pm_runtime_get_sync(&gpu->pdev->dev);
>>> In response to v1, Rob suggested pm_runtime_force_suspend/resume().
>>> Those seem like they would work to me, too. Why not use them?
>> Quoting my previous response which I seem to have sent only to Freedreno
>> list:
>>
>> "I believe it is supposed to be used only during system sleep state
>> transitions. Btw, we don't want pm_runtime_get() calls from elsewhere to
>> fail by disabling RPM here."
> The comment about not wanting other runpm calls to fail is valid.. but
> that is also solveable, ie. by holding a lock around runpm calls.
> Which I think we need to do anyways, otherwise looping over
> gpu->active_submits is racey..
>
> I think pm_runtime_force_suspend/resume() is the least-bad option.. or
> at least I'm not seeing any obvious alternative that is better
>
> BR,
> -R
We are holding gpu->lock here which will block further submissions from
scheduler. Will active_submits still race?

It is possible that there is another thread which successfully completed
pm_runtime_get() and while it access the hardware, we pulled the plug on
regulator/clock here. That will result in obvious device crash. So I can
think of 2 solutions:

1. wrap *every* pm_runtime_get/put with a mutex. Something like:
            mutex_lock();
            pm_runtime_get();
            < ... access hardware here >>
            pm_runtime_put();
            mutex_unlock();

2. Drop runtime votes from every submit in recover worker and wait/poll
for regulator to collapse in case there are transient votes on
regulator  from other threads/subsystems.

Option (2) seems simpler to me.  What do you think?

-Akhil.

2022-07-14 05:47:00

by Akhil P Oommen

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list

On 7/12/2022 4:57 AM, Doug Anderson wrote:
> Hi,
>
> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <[email protected]> wrote:
>> Update gpu register array with gpucc memory region.
>>
>> Signed-off-by: Akhil P Oommen <[email protected]>
>> ---
>>
>> (no changes since v1)
>>
>> arch/arm64/boot/dts/qcom/sc7280.dtsi | 6 ++++--
>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi b/arch/arm64/boot/dts/qcom/sc7280.dtsi
>> index e66fc67..defdb25 100644
>> --- a/arch/arm64/boot/dts/qcom/sc7280.dtsi
>> +++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi
>> @@ -2228,10 +2228,12 @@
>> compatible = "qcom,adreno-635.0", "qcom,adreno";
>> reg = <0 0x03d00000 0 0x40000>,
>> <0 0x03d9e000 0 0x1000>,
>> - <0 0x03d61000 0 0x800>;
>> + <0 0x03d61000 0 0x800>,
>> + <0 0x03d90000 0 0x2000>;
>> reg-names = "kgsl_3d0_reg_memory",
>> "cx_mem",
>> - "cx_dbgc";
>> + "cx_dbgc",
>> + "gpucc";
> This doesn't seem right. Shouldn't you be coordinating with the
> existing gpucc instead of reaching into its registers?
>
> -Doug
IIUC, qcom gdsc driver doesn't ensure hardware is collapsed since they
are vote-able switches. Ideally, we should ensure that the hw has
collapsed for gpu recovery because there could be transient votes from
other subsystems like hypervisor using their vote register.

I am not sure how complex the plumbing to gpucc driver would be to allow
gpu driver to check hw status. OTOH, with this patch, gpu driver does a
read operation on a gpucc register which is in always-on domain. That
means we don't need to vote any resource to access this register.

Stephen/Rajendra/Taniya, any suggestion?

-Akhil.


2022-07-19 04:28:22

by Akhil P Oommen

[permalink] [raw]
Subject: Re: [Freedreno] [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list

On 7/14/2022 11:10 AM, Akhil P Oommen wrote:
> On 7/12/2022 4:57 AM, Doug Anderson wrote:
>> Hi,
>>
>> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen
>> <[email protected]> wrote:
>>> Update gpu register array with gpucc memory region.
>>>
>>> Signed-off-by: Akhil P Oommen <[email protected]>
>>> ---
>>>
>>> (no changes since v1)
>>>
>>>   arch/arm64/boot/dts/qcom/sc7280.dtsi | 6 ++++--
>>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi
>>> b/arch/arm64/boot/dts/qcom/sc7280.dtsi
>>> index e66fc67..defdb25 100644
>>> --- a/arch/arm64/boot/dts/qcom/sc7280.dtsi
>>> +++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi
>>> @@ -2228,10 +2228,12 @@
>>>                          compatible = "qcom,adreno-635.0",
>>> "qcom,adreno";
>>>                          reg = <0 0x03d00000 0 0x40000>,
>>>                                <0 0x03d9e000 0 0x1000>,
>>> -                             <0 0x03d61000 0 0x800>;
>>> +                             <0 0x03d61000 0 0x800>,
>>> +                             <0 0x03d90000 0 0x2000>;
>>>                          reg-names = "kgsl_3d0_reg_memory",
>>>                                      "cx_mem",
>>> -                                   "cx_dbgc";
>>> +                                   "cx_dbgc",
>>> +                                   "gpucc";
>> This doesn't seem right. Shouldn't you be coordinating with the
>> existing gpucc instead of reaching into its registers?
>>
>> -Doug
> IIUC, qcom gdsc driver doesn't ensure hardware is collapsed since they
> are vote-able switches. Ideally, we should ensure that the hw has
> collapsed for gpu recovery because there could be transient votes from
> other subsystems like hypervisor using their vote register.
>
> I am not sure how complex the plumbing to gpucc driver would be to allow
> gpu driver to check hw status. OTOH, with this patch, gpu driver does a
> read operation on a gpucc register which is in always-on domain. That
> means we don't need to vote any resource to access this register.
>
> Stephen/Rajendra/Taniya, any suggestion?
>
> -Akhil.
>
>
Gentle ping.

-Akhil

2022-07-19 05:51:05

by Stephen Boyd

[permalink] [raw]
Subject: Re: [Freedreno] [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list

Quoting Akhil P Oommen (2022-07-18 21:07:05)
> On 7/14/2022 11:10 AM, Akhil P Oommen wrote:
> > On 7/12/2022 4:57 AM, Doug Anderson wrote:
> >> Hi,
> >>
> >> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen
> >> <[email protected]> wrote:
> >>> Update gpu register array with gpucc memory region.
> >>>
> >>> Signed-off-by: Akhil P Oommen <[email protected]>
> >>> ---
> >>>
> >>> (no changes since v1)
> >>>
> >>>   arch/arm64/boot/dts/qcom/sc7280.dtsi | 6 ++++--
> >>>   1 file changed, 4 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi
> >>> b/arch/arm64/boot/dts/qcom/sc7280.dtsi
> >>> index e66fc67..defdb25 100644
> >>> --- a/arch/arm64/boot/dts/qcom/sc7280.dtsi
> >>> +++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi
> >>> @@ -2228,10 +2228,12 @@
> >>>                          compatible = "qcom,adreno-635.0",
> >>> "qcom,adreno";
> >>>                          reg = <0 0x03d00000 0 0x40000>,
> >>>                                <0 0x03d9e000 0 0x1000>,
> >>> -                             <0 0x03d61000 0 0x800>;
> >>> +                             <0 0x03d61000 0 0x800>,
> >>> +                             <0 0x03d90000 0 0x2000>;
> >>>                          reg-names = "kgsl_3d0_reg_memory",
> >>>                                      "cx_mem",
> >>> -                                   "cx_dbgc";
> >>> +                                   "cx_dbgc",
> >>> +                                   "gpucc";
> >> This doesn't seem right. Shouldn't you be coordinating with the
> >> existing gpucc instead of reaching into its registers?
> >>
> > IIUC, qcom gdsc driver doesn't ensure hardware is collapsed since they
> > are vote-able switches. Ideally, we should ensure that the hw has
> > collapsed for gpu recovery because there could be transient votes from
> > other subsystems like hypervisor using their vote register.
> >
> > I am not sure how complex the plumbing to gpucc driver would be to allow
> > gpu driver to check hw status. OTOH, with this patch, gpu driver does a
> > read operation on a gpucc register which is in always-on domain. That
> > means we don't need to vote any resource to access this register.
> >
> > Stephen/Rajendra/Taniya, any suggestion?

Why can't you assert a gpu reset signal with the reset APIs? This series
seems to jump through a bunch of hoops to get the gdsc and power domain
to "reset" when I don't know why any of that is necessary. Can't we
simply assert a reset to the hardware after recovery completes so the
device is back into a good known POR (power on reset) state?

2022-07-19 07:05:26

by Akhil P Oommen

[permalink] [raw]
Subject: Re: [Freedreno] [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list

On 7/19/2022 11:19 AM, Stephen Boyd wrote:
> Quoting Akhil P Oommen (2022-07-18 21:07:05)
>> On 7/14/2022 11:10 AM, Akhil P Oommen wrote:
>>> On 7/12/2022 4:57 AM, Doug Anderson wrote:
>>>> Hi,
>>>>
>>>> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen
>>>> <[email protected]> wrote:
>>>>> Update gpu register array with gpucc memory region.
>>>>>
>>>>> Signed-off-by: Akhil P Oommen <[email protected]>
>>>>> ---
>>>>>
>>>>> (no changes since v1)
>>>>>
>>>>>   arch/arm64/boot/dts/qcom/sc7280.dtsi | 6 ++++--
>>>>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi
>>>>> b/arch/arm64/boot/dts/qcom/sc7280.dtsi
>>>>> index e66fc67..defdb25 100644
>>>>> --- a/arch/arm64/boot/dts/qcom/sc7280.dtsi
>>>>> +++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi
>>>>> @@ -2228,10 +2228,12 @@
>>>>>                          compatible = "qcom,adreno-635.0",
>>>>> "qcom,adreno";
>>>>>                          reg = <0 0x03d00000 0 0x40000>,
>>>>>                                <0 0x03d9e000 0 0x1000>,
>>>>> -                             <0 0x03d61000 0 0x800>;
>>>>> +                             <0 0x03d61000 0 0x800>,
>>>>> +                             <0 0x03d90000 0 0x2000>;
>>>>>                          reg-names = "kgsl_3d0_reg_memory",
>>>>>                                      "cx_mem",
>>>>> -                                   "cx_dbgc";
>>>>> +                                   "cx_dbgc",
>>>>> +                                   "gpucc";
>>>> This doesn't seem right. Shouldn't you be coordinating with the
>>>> existing gpucc instead of reaching into its registers?
>>>>
>>> IIUC, qcom gdsc driver doesn't ensure hardware is collapsed since they
>>> are vote-able switches. Ideally, we should ensure that the hw has
>>> collapsed for gpu recovery because there could be transient votes from
>>> other subsystems like hypervisor using their vote register.
>>>
>>> I am not sure how complex the plumbing to gpucc driver would be to allow
>>> gpu driver to check hw status. OTOH, with this patch, gpu driver does a
>>> read operation on a gpucc register which is in always-on domain. That
>>> means we don't need to vote any resource to access this register.
>>>
>>> Stephen/Rajendra/Taniya, any suggestion?
> Why can't you assert a gpu reset signal with the reset APIs? This series
> seems to jump through a bunch of hoops to get the gdsc and power domain
> to "reset" when I don't know why any of that is necessary. Can't we
> simply assert a reset to the hardware after recovery completes so the
> device is back into a good known POR (power on reset) state?
That is because there is no register interface to reset GPU CX domain.
The recommended sequence from HW design folks is to collapse both cx and
gx gdsc to properly reset gpu/gmu.

-Akhil.

2022-07-19 07:42:52

by Stephen Boyd

[permalink] [raw]
Subject: Re: [Freedreno] [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list

Quoting Akhil P Oommen (2022-07-18 23:37:16)
> On 7/19/2022 11:19 AM, Stephen Boyd wrote:
> > Quoting Akhil P Oommen (2022-07-18 21:07:05)
> >> On 7/14/2022 11:10 AM, Akhil P Oommen wrote:
> >>> IIUC, qcom gdsc driver doesn't ensure hardware is collapsed since they
> >>> are vote-able switches. Ideally, we should ensure that the hw has
> >>> collapsed for gpu recovery because there could be transient votes from
> >>> other subsystems like hypervisor using their vote register.
> >>>
> >>> I am not sure how complex the plumbing to gpucc driver would be to allow
> >>> gpu driver to check hw status. OTOH, with this patch, gpu driver does a
> >>> read operation on a gpucc register which is in always-on domain. That
> >>> means we don't need to vote any resource to access this register.

Reading between the lines here, you're saying that you have to read the
gdsc register to make sure that the gdsc is in some state? Can you
clarify exactly what you're doing? And how do you know that something
else in the kernel can't cause the register to change after it is read?
It certainly seems like we can't be certain because there is voting
involved.

> >>>
> >>> Stephen/Rajendra/Taniya, any suggestion?
> > Why can't you assert a gpu reset signal with the reset APIs? This series
> > seems to jump through a bunch of hoops to get the gdsc and power domain
> > to "reset" when I don't know why any of that is necessary. Can't we
> > simply assert a reset to the hardware after recovery completes so the
> > device is back into a good known POR (power on reset) state?
> That is because there is no register interface to reset GPU CX domain.
> The recommended sequence from HW design folks is to collapse both cx and
> gx gdsc to properly reset gpu/gmu.
>

Ok. One knee jerk reaction is to treat the gdsc as a reset then and
possibly mux that request along with any power domain on/off so that if
the reset is requested and the power domain is off nothing happens.
Otherwise if the power domain is on then it manually sequences and
controls the two gdscs so that the GPU is reset and then restores the
enable state of the power domain.

2022-07-19 09:58:39

by Rajendra Nayak

[permalink] [raw]
Subject: Re: [Freedreno] [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list



On 7/19/2022 12:49 PM, Stephen Boyd wrote:
> Quoting Akhil P Oommen (2022-07-18 23:37:16)
>> On 7/19/2022 11:19 AM, Stephen Boyd wrote:
>>> Quoting Akhil P Oommen (2022-07-18 21:07:05)
>>>> On 7/14/2022 11:10 AM, Akhil P Oommen wrote:
>>>>> IIUC, qcom gdsc driver doesn't ensure hardware is collapsed since they
>>>>> are vote-able switches. Ideally, we should ensure that the hw has
>>>>> collapsed for gpu recovery because there could be transient votes from
>>>>> other subsystems like hypervisor using their vote register.
>>>>>
>>>>> I am not sure how complex the plumbing to gpucc driver would be to allow
>>>>> gpu driver to check hw status. OTOH, with this patch, gpu driver does a
>>>>> read operation on a gpucc register which is in always-on domain. That
>>>>> means we don't need to vote any resource to access this register.
>
> Reading between the lines here, you're saying that you have to read the
> gdsc register to make sure that the gdsc is in some state? Can you
> clarify exactly what you're doing? And how do you know that something
> else in the kernel can't cause the register to change after it is read?
> It certainly seems like we can't be certain because there is voting
> involved.

yes, this looks like the best case effort to get the gpu to recover, but
the kernel driver really has no control to make sure this condition can
always be met (because it depends on other entities like hyp, trustzone etc right?)
Why not just put a worst case polling delay?

>
>>>>>
>>>>> Stephen/Rajendra/Taniya, any suggestion?
>>> Why can't you assert a gpu reset signal with the reset APIs? This series
>>> seems to jump through a bunch of hoops to get the gdsc and power domain
>>> to "reset" when I don't know why any of that is necessary. Can't we
>>> simply assert a reset to the hardware after recovery completes so the
>>> device is back into a good known POR (power on reset) state?
>> That is because there is no register interface to reset GPU CX domain.
>> The recommended sequence from HW design folks is to collapse both cx and
>> gx gdsc to properly reset gpu/gmu.
>>
>
> Ok. One knee jerk reaction is to treat the gdsc as a reset then and
> possibly mux that request along with any power domain on/off so that if
> the reset is requested and the power domain is off nothing happens.
> Otherwise if the power domain is on then it manually sequences and
> controls the two gdscs so that the GPU is reset and then restores the
> enable state of the power domain.

2022-07-20 06:06:29

by Akhil P Oommen

[permalink] [raw]
Subject: Re: [Freedreno] [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list

On 7/19/2022 3:26 PM, Rajendra Nayak wrote:
>
>
> On 7/19/2022 12:49 PM, Stephen Boyd wrote:
>> Quoting Akhil P Oommen (2022-07-18 23:37:16)
>>> On 7/19/2022 11:19 AM, Stephen Boyd wrote:
>>>> Quoting Akhil P Oommen (2022-07-18 21:07:05)
>>>>> On 7/14/2022 11:10 AM, Akhil P Oommen wrote:
>>>>>> IIUC, qcom gdsc driver doesn't ensure hardware is collapsed since
>>>>>> they
>>>>>> are vote-able switches. Ideally, we should ensure that the hw has
>>>>>> collapsed for gpu recovery because there could be transient votes
>>>>>> from
>>>>>> other subsystems like hypervisor using their vote register.
>>>>>>
>>>>>> I am not sure how complex the plumbing to gpucc driver would be
>>>>>> to allow
>>>>>> gpu driver to check hw status. OTOH, with this patch, gpu driver
>>>>>> does a
>>>>>> read operation on a gpucc register which is in always-on domain.
>>>>>> That
>>>>>> means we don't need to vote any resource to access this register.
>>
>> Reading between the lines here, you're saying that you have to read the
>> gdsc register to make sure that the gdsc is in some state? Can you
>> clarify exactly what you're doing? And how do you know that something
>> else in the kernel can't cause the register to change after it is read?
>> It certainly seems like we can't be certain because there is voting
>> involved.
From gpu driver, cx_gdscr.bit[31] (power off status) register can be
polled to ensure that it *collapsed at least once*. We don't need to
care if something turns ON gdsc after that.

>
> yes, this looks like the best case effort to get the gpu to recover, but
> the kernel driver really has no control to make sure this condition can
> always be met (because it depends on other entities like hyp,
> trustzone etc right?)
> Why not just put a worst case polling delay?

I didn't get you entirely. Where do you mean to keep the polling delay?
>
>>
>>>>>>
>>>>>> Stephen/Rajendra/Taniya, any suggestion?
>>>> Why can't you assert a gpu reset signal with the reset APIs? This
>>>> series
>>>> seems to jump through a bunch of hoops to get the gdsc and power
>>>> domain
>>>> to "reset" when I don't know why any of that is necessary. Can't we
>>>> simply assert a reset to the hardware after recovery completes so the
>>>> device is back into a good known POR (power on reset) state?
>>> That is because there is no register interface to reset GPU CX domain.
>>> The recommended sequence from HW design folks is to collapse both cx
>>> and
>>> gx gdsc to properly reset gpu/gmu.
>>>
>>
>> Ok. One knee jerk reaction is to treat the gdsc as a reset then and
>> possibly mux that request along with any power domain on/off so that if
>> the reset is requested and the power domain is off nothing happens.
>> Otherwise if the power domain is on then it manually sequences and
>> controls the two gdscs so that the GPU is reset and then restores the
>> enable state of the power domain.
It would be fatal to asynchronously pull the plug on CX gdsc forcefully
because there might be another gpu/smmu driver thread accessing
registers in cx domain.

-Akhil.

2022-07-20 18:48:50

by Rob Clark

[permalink] [raw]
Subject: Re: [Freedreno] [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery

On Tue, Jul 12, 2022 at 12:15 PM Akhil P Oommen
<[email protected]> wrote:
>
> On 7/12/2022 10:14 PM, Rob Clark wrote:
> > On Mon, Jul 11, 2022 at 10:05 PM Akhil P Oommen
> > <[email protected]> wrote:
> >> On 7/12/2022 4:52 AM, Doug Anderson wrote:
> >>> Hi,
> >>>
> >>> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <[email protected]> wrote:
> >>>> There are some hardware logic under CX domain. For a successful
> >>>> recovery, we should ensure cx headswitch collapses to ensure all the
> >>>> stale states are cleard out. This is especially true to for a6xx family
> >>>> where we can GMU co-processor.
> >>>>
> >>>> Currently, cx doesn't collapse due to a devlink between gpu and its
> >>>> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
> >>>> that the iommu driver removes its vote on cx gdsc.
> >>>>
> >>>> Signed-off-by: Akhil P Oommen <[email protected]>
> >>>> ---
> >>>>
> >>>> (no changes since v1)
> >>>>
> >>>> drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
> >>>> drivers/gpu/drm/msm/msm_gpu.c | 2 --
> >>>> 2 files changed, 14 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >>>> index 4d50110..7ed347c 100644
> >>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >>>> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
> >>>> */
> >>>> gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
> >>>>
> >>>> - gpu->funcs->pm_suspend(gpu);
> >>>> - gpu->funcs->pm_resume(gpu);
> >>>> + /*
> >>>> + * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
> >>>> + * First drop the usage count from all active submits
> >>>> + */
> >>>> + for (i = gpu->active_submits; i > 0; i--)
> >>>> + pm_runtime_put(&gpu->pdev->dev);
> >>>> +
> >>>> + /* And the final one from recover worker */
> >>>> + pm_runtime_put_sync(&gpu->pdev->dev);
> >>>> +
> >>>> + for (i = gpu->active_submits; i > 0; i--)
> >>>> + pm_runtime_get(&gpu->pdev->dev);
> >>>> +
> >>>> + pm_runtime_get_sync(&gpu->pdev->dev);
> >>> In response to v1, Rob suggested pm_runtime_force_suspend/resume().
> >>> Those seem like they would work to me, too. Why not use them?
> >> Quoting my previous response which I seem to have sent only to Freedreno
> >> list:
> >>
> >> "I believe it is supposed to be used only during system sleep state
> >> transitions. Btw, we don't want pm_runtime_get() calls from elsewhere to
> >> fail by disabling RPM here."
> > The comment about not wanting other runpm calls to fail is valid.. but
> > that is also solveable, ie. by holding a lock around runpm calls.
> > Which I think we need to do anyways, otherwise looping over
> > gpu->active_submits is racey..
> >
> > I think pm_runtime_force_suspend/resume() is the least-bad option.. or
> > at least I'm not seeing any obvious alternative that is better
> >
> > BR,
> > -R
> We are holding gpu->lock here which will block further submissions from
> scheduler. Will active_submits still race?
>
> It is possible that there is another thread which successfully completed
> pm_runtime_get() and while it access the hardware, we pulled the plug on
> regulator/clock here. That will result in obvious device crash. So I can
> think of 2 solutions:
>
> 1. wrap *every* pm_runtime_get/put with a mutex. Something like:
> mutex_lock();
> pm_runtime_get();
> < ... access hardware here >>
> pm_runtime_put();
> mutex_unlock();
>
> 2. Drop runtime votes from every submit in recover worker and wait/poll
> for regulator to collapse in case there are transient votes on
> regulator from other threads/subsystems.
>
> Option (2) seems simpler to me. What do you think?
>

But I think without #1 you could still be racing w/ some other path
that touches the hw, like devfreq, right. They could be holding a
runpm ref, so even if you loop over active_submits decrementing the
runpm ref, it still doesn't drop to zero

BR,
-R

2022-07-20 20:58:50

by Akhil P Oommen

[permalink] [raw]
Subject: Re: [Freedreno] [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery

On 7/20/2022 11:36 PM, Rob Clark wrote:
> On Tue, Jul 12, 2022 at 12:15 PM Akhil P Oommen
> <[email protected]> wrote:
>> On 7/12/2022 10:14 PM, Rob Clark wrote:
>>> On Mon, Jul 11, 2022 at 10:05 PM Akhil P Oommen
>>> <[email protected]> wrote:
>>>> On 7/12/2022 4:52 AM, Doug Anderson wrote:
>>>>> Hi,
>>>>>
>>>>> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <[email protected]> wrote:
>>>>>> There are some hardware logic under CX domain. For a successful
>>>>>> recovery, we should ensure cx headswitch collapses to ensure all the
>>>>>> stale states are cleard out. This is especially true to for a6xx family
>>>>>> where we can GMU co-processor.
>>>>>>
>>>>>> Currently, cx doesn't collapse due to a devlink between gpu and its
>>>>>> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
>>>>>> that the iommu driver removes its vote on cx gdsc.
>>>>>>
>>>>>> Signed-off-by: Akhil P Oommen <[email protected]>
>>>>>> ---
>>>>>>
>>>>>> (no changes since v1)
>>>>>>
>>>>>> drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
>>>>>> drivers/gpu/drm/msm/msm_gpu.c | 2 --
>>>>>> 2 files changed, 14 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>> index 4d50110..7ed347c 100644
>>>>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
>>>>>> */
>>>>>> gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
>>>>>>
>>>>>> - gpu->funcs->pm_suspend(gpu);
>>>>>> - gpu->funcs->pm_resume(gpu);
>>>>>> + /*
>>>>>> + * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
>>>>>> + * First drop the usage count from all active submits
>>>>>> + */
>>>>>> + for (i = gpu->active_submits; i > 0; i--)
>>>>>> + pm_runtime_put(&gpu->pdev->dev);
>>>>>> +
>>>>>> + /* And the final one from recover worker */
>>>>>> + pm_runtime_put_sync(&gpu->pdev->dev);
>>>>>> +
>>>>>> + for (i = gpu->active_submits; i > 0; i--)
>>>>>> + pm_runtime_get(&gpu->pdev->dev);
>>>>>> +
>>>>>> + pm_runtime_get_sync(&gpu->pdev->dev);
>>>>> In response to v1, Rob suggested pm_runtime_force_suspend/resume().
>>>>> Those seem like they would work to me, too. Why not use them?
>>>> Quoting my previous response which I seem to have sent only to Freedreno
>>>> list:
>>>>
>>>> "I believe it is supposed to be used only during system sleep state
>>>> transitions. Btw, we don't want pm_runtime_get() calls from elsewhere to
>>>> fail by disabling RPM here."
>>> The comment about not wanting other runpm calls to fail is valid.. but
>>> that is also solveable, ie. by holding a lock around runpm calls.
>>> Which I think we need to do anyways, otherwise looping over
>>> gpu->active_submits is racey..
>>>
>>> I think pm_runtime_force_suspend/resume() is the least-bad option.. or
>>> at least I'm not seeing any obvious alternative that is better
>>>
>>> BR,
>>> -R
>> We are holding gpu->lock here which will block further submissions from
>> scheduler. Will active_submits still race?
>>
>> It is possible that there is another thread which successfully completed
>> pm_runtime_get() and while it access the hardware, we pulled the plug on
>> regulator/clock here. That will result in obvious device crash. So I can
>> think of 2 solutions:
>>
>> 1. wrap *every* pm_runtime_get/put with a mutex. Something like:
>> mutex_lock();
>> pm_runtime_get();
>> < ... access hardware here >>
>> pm_runtime_put();
>> mutex_unlock();
>>
>> 2. Drop runtime votes from every submit in recover worker and wait/poll
>> for regulator to collapse in case there are transient votes on
>> regulator from other threads/subsystems.
>>
>> Option (2) seems simpler to me. What do you think?
>>
> But I think without #1 you could still be racing w/ some other path
> that touches the hw, like devfreq, right. They could be holding a
> runpm ref, so even if you loop over active_submits decrementing the
> runpm ref, it still doesn't drop to zero
>
> BR,
> -R
Yes, you are right. There could be some transient votes from other
threads/drivers/subsystem. This is the reason we need to poll for cx
gdsc collapse in the next patch.Even with #1, it is difficult to
coordinate with smmu driver and close to impossible with tz/hyp.

-Akhil.

2022-07-21 16:10:18

by Akhil P Oommen

[permalink] [raw]
Subject: Re: [Freedreno] [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list

On 7/20/2022 11:34 AM, Akhil P Oommen wrote:
> On 7/19/2022 3:26 PM, Rajendra Nayak wrote:
>>
>>
>> On 7/19/2022 12:49 PM, Stephen Boyd wrote:
>>> Quoting Akhil P Oommen (2022-07-18 23:37:16)
>>>> On 7/19/2022 11:19 AM, Stephen Boyd wrote:
>>>>> Quoting Akhil P Oommen (2022-07-18 21:07:05)
>>>>>> On 7/14/2022 11:10 AM, Akhil P Oommen wrote:
>>>>>>> IIUC, qcom gdsc driver doesn't ensure hardware is collapsed
>>>>>>> since they
>>>>>>> are vote-able switches. Ideally, we should ensure that the hw has
>>>>>>> collapsed for gpu recovery because there could be transient
>>>>>>> votes from
>>>>>>> other subsystems like hypervisor using their vote register.
>>>>>>>
>>>>>>> I am not sure how complex the plumbing to gpucc driver would be
>>>>>>> to allow
>>>>>>> gpu driver to check hw status. OTOH, with this patch, gpu driver
>>>>>>> does a
>>>>>>> read operation on a gpucc register which is in always-on domain.
>>>>>>> That
>>>>>>> means we don't need to vote any resource to access this register.
>>>
>>> Reading between the lines here, you're saying that you have to read the
>>> gdsc register to make sure that the gdsc is in some state? Can you
>>> clarify exactly what you're doing? And how do you know that something
>>> else in the kernel can't cause the register to change after it is read?
>>> It certainly seems like we can't be certain because there is voting
>>> involved.
> From gpu driver, cx_gdscr.bit[31] (power off status) register can be
> polled to ensure that it *collapsed at least once*. We don't need to
> care if something turns ON gdsc after that.
>
>>
>> yes, this looks like the best case effort to get the gpu to recover, but
>> the kernel driver really has no control to make sure this condition can
>> always be met (because it depends on other entities like hyp,
>> trustzone etc right?)
>> Why not just put a worst case polling delay?
>
> I didn't get you entirely. Where do you mean to keep the polling delay?
>>
>>>
>>>>>>>
>>>>>>> Stephen/Rajendra/Taniya, any suggestion?
>>>>> Why can't you assert a gpu reset signal with the reset APIs? This
>>>>> series
>>>>> seems to jump through a bunch of hoops to get the gdsc and power
>>>>> domain
>>>>> to "reset" when I don't know why any of that is necessary. Can't we
>>>>> simply assert a reset to the hardware after recovery completes so the
>>>>> device is back into a good known POR (power on reset) state?
>>>> That is because there is no register interface to reset GPU CX domain.
>>>> The recommended sequence from HW design folks is to collapse both
>>>> cx and
>>>> gx gdsc to properly reset gpu/gmu.
>>>>
>>>
>>> Ok. One knee jerk reaction is to treat the gdsc as a reset then and
>>> possibly mux that request along with any power domain on/off so that if
>>> the reset is requested and the power domain is off nothing happens.
>>> Otherwise if the power domain is on then it manually sequences and
>>> controls the two gdscs so that the GPU is reset and then restores the
>>> enable state of the power domain.
> It would be fatal to asynchronously pull the plug on CX gdsc
> forcefully because there might be another gpu/smmu driver thread
> accessing registers in cx domain.
>
> -Akhil.
>
But, we can move the cx collapse polling to gpucc and expose it to gpu
driver using 'reset' framework. I am not very familiar with clk driver,
but I did a rough prototype here (untested):
https://zerobin.net/?d34b5f958be3b9b8#NKGzdPy9fgcuOqXZ/XqjI7b8JWcivqe+oSTf4yWHSOU=

If this approach is acceptable, I will send it out as a separate series.

-Akhil.

2022-07-22 15:48:32

by Rob Clark

[permalink] [raw]
Subject: Re: [Freedreno] [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list

On Thu, Jul 21, 2022 at 9:04 AM Akhil P Oommen <[email protected]> wrote:
>
> On 7/20/2022 11:34 AM, Akhil P Oommen wrote:
> > On 7/19/2022 3:26 PM, Rajendra Nayak wrote:
> >>
> >>
> >> On 7/19/2022 12:49 PM, Stephen Boyd wrote:
> >>> Quoting Akhil P Oommen (2022-07-18 23:37:16)
> >>>> On 7/19/2022 11:19 AM, Stephen Boyd wrote:
> >>>>> Quoting Akhil P Oommen (2022-07-18 21:07:05)
> >>>>>> On 7/14/2022 11:10 AM, Akhil P Oommen wrote:
> >>>>>>> IIUC, qcom gdsc driver doesn't ensure hardware is collapsed
> >>>>>>> since they
> >>>>>>> are vote-able switches. Ideally, we should ensure that the hw has
> >>>>>>> collapsed for gpu recovery because there could be transient
> >>>>>>> votes from
> >>>>>>> other subsystems like hypervisor using their vote register.
> >>>>>>>
> >>>>>>> I am not sure how complex the plumbing to gpucc driver would be
> >>>>>>> to allow
> >>>>>>> gpu driver to check hw status. OTOH, with this patch, gpu driver
> >>>>>>> does a
> >>>>>>> read operation on a gpucc register which is in always-on domain.
> >>>>>>> That
> >>>>>>> means we don't need to vote any resource to access this register.
> >>>
> >>> Reading between the lines here, you're saying that you have to read the
> >>> gdsc register to make sure that the gdsc is in some state? Can you
> >>> clarify exactly what you're doing? And how do you know that something
> >>> else in the kernel can't cause the register to change after it is read?
> >>> It certainly seems like we can't be certain because there is voting
> >>> involved.
> > From gpu driver, cx_gdscr.bit[31] (power off status) register can be
> > polled to ensure that it *collapsed at least once*. We don't need to
> > care if something turns ON gdsc after that.
> >
> >>
> >> yes, this looks like the best case effort to get the gpu to recover, but
> >> the kernel driver really has no control to make sure this condition can
> >> always be met (because it depends on other entities like hyp,
> >> trustzone etc right?)
> >> Why not just put a worst case polling delay?
> >
> > I didn't get you entirely. Where do you mean to keep the polling delay?
> >>
> >>>
> >>>>>>>
> >>>>>>> Stephen/Rajendra/Taniya, any suggestion?
> >>>>> Why can't you assert a gpu reset signal with the reset APIs? This
> >>>>> series
> >>>>> seems to jump through a bunch of hoops to get the gdsc and power
> >>>>> domain
> >>>>> to "reset" when I don't know why any of that is necessary. Can't we
> >>>>> simply assert a reset to the hardware after recovery completes so the
> >>>>> device is back into a good known POR (power on reset) state?
> >>>> That is because there is no register interface to reset GPU CX domain.
> >>>> The recommended sequence from HW design folks is to collapse both
> >>>> cx and
> >>>> gx gdsc to properly reset gpu/gmu.
> >>>>
> >>>
> >>> Ok. One knee jerk reaction is to treat the gdsc as a reset then and
> >>> possibly mux that request along with any power domain on/off so that if
> >>> the reset is requested and the power domain is off nothing happens.
> >>> Otherwise if the power domain is on then it manually sequences and
> >>> controls the two gdscs so that the GPU is reset and then restores the
> >>> enable state of the power domain.
> > It would be fatal to asynchronously pull the plug on CX gdsc
> > forcefully because there might be another gpu/smmu driver thread
> > accessing registers in cx domain.
> >
> > -Akhil.
> >
> But, we can move the cx collapse polling to gpucc and expose it to gpu
> driver using 'reset' framework. I am not very familiar with clk driver,
> but I did a rough prototype here (untested):
> https://zerobin.net/?d34b5f958be3b9b8#NKGzdPy9fgcuOqXZ/XqjI7b8JWcivqe+oSTf4yWHSOU=
>
> If this approach is acceptable, I will send it out as a separate series.
>

I'm not super familiar w/ reset framework, but this approach seems
like it would avoid needing to play games with working around runpm as
well. So that seems like a cleaner approach.

BR,
-R

2022-07-22 17:58:31

by Akhil P Oommen

[permalink] [raw]
Subject: Re: [Freedreno] [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery

On 7/21/2022 2:08 AM, Akhil P Oommen wrote:
> On 7/20/2022 11:36 PM, Rob Clark wrote:
>> On Tue, Jul 12, 2022 at 12:15 PM Akhil P Oommen
>> <[email protected]> wrote:
>>> On 7/12/2022 10:14 PM, Rob Clark wrote:
>>>> On Mon, Jul 11, 2022 at 10:05 PM Akhil P Oommen
>>>> <[email protected]> wrote:
>>>>> On 7/12/2022 4:52 AM, Doug Anderson wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen
>>>>>> <[email protected]> wrote:
>>>>>>> There are some hardware logic under CX domain. For a successful
>>>>>>> recovery, we should ensure cx headswitch collapses to ensure all
>>>>>>> the
>>>>>>> stale states are cleard out. This is especially true to for a6xx
>>>>>>> family
>>>>>>> where we can GMU co-processor.
>>>>>>>
>>>>>>> Currently, cx doesn't collapse due to a devlink between gpu and its
>>>>>>> smmu. So the *struct gpu device* needs to be runtime suspended
>>>>>>> to ensure
>>>>>>> that the iommu driver removes its vote on cx gdsc.
>>>>>>>
>>>>>>> Signed-off-by: Akhil P Oommen <[email protected]>
>>>>>>> ---
>>>>>>>
>>>>>>> (no changes since v1)
>>>>>>>
>>>>>>>     drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
>>>>>>>     drivers/gpu/drm/msm/msm_gpu.c         |  2 --
>>>>>>>     2 files changed, 14 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>>> b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>>> index 4d50110..7ed347c 100644
>>>>>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>>> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu
>>>>>>> *gpu)
>>>>>>>             */
>>>>>>>            gmu_write(&a6xx_gpu->gmu,
>>>>>>> REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
>>>>>>>
>>>>>>> -       gpu->funcs->pm_suspend(gpu);
>>>>>>> -       gpu->funcs->pm_resume(gpu);
>>>>>>> +       /*
>>>>>>> +        * Now drop all the pm_runtime usage count to allow cx
>>>>>>> gdsc to collapse.
>>>>>>> +        * First drop the usage count from all active submits
>>>>>>> +        */
>>>>>>> +       for (i = gpu->active_submits; i > 0; i--)
>>>>>>> + pm_runtime_put(&gpu->pdev->dev);
>>>>>>> +
>>>>>>> +       /* And the final one from recover worker */
>>>>>>> + pm_runtime_put_sync(&gpu->pdev->dev);
>>>>>>> +
>>>>>>> +       for (i = gpu->active_submits; i > 0; i--)
>>>>>>> + pm_runtime_get(&gpu->pdev->dev);
>>>>>>> +
>>>>>>> + pm_runtime_get_sync(&gpu->pdev->dev);
>>>>>> In response to v1, Rob suggested pm_runtime_force_suspend/resume().
>>>>>> Those seem like they would work to me, too. Why not use them?
>>>>> Quoting my previous response which I seem to have sent only to
>>>>> Freedreno
>>>>> list:
>>>>>
>>>>> "I believe it is supposed to be used only during system sleep state
>>>>> transitions. Btw, we don't want pm_runtime_get() calls from
>>>>> elsewhere to
>>>>> fail by disabling RPM here."
>>>> The comment about not wanting other runpm calls to fail is valid.. but
>>>> that is also solveable, ie. by holding a lock around runpm calls.
>>>> Which I think we need to do anyways, otherwise looping over
>>>> gpu->active_submits is racey..
>>>>
>>>> I think pm_runtime_force_suspend/resume() is the least-bad option.. or
>>>> at least I'm not seeing any obvious alternative that is better
>>>>
>>>> BR,
>>>> -R
>>> We are holding gpu->lock here which will block further submissions from
>>> scheduler. Will active_submits still race?
>>>
>>> It is possible that there is another thread which successfully
>>> completed
>>> pm_runtime_get() and while it access the hardware, we pulled the
>>> plug on
>>> regulator/clock here. That will result in obvious device crash. So I
>>> can
>>> think of 2 solutions:
>>>
>>> 1. wrap *every* pm_runtime_get/put with a mutex. Something like:
>>>               mutex_lock();
>>>               pm_runtime_get();
>>>               < ... access hardware here >>
>>>               pm_runtime_put();
>>>               mutex_unlock();
>>>
>>> 2. Drop runtime votes from every submit in recover worker and wait/poll
>>> for regulator to collapse in case there are transient votes on
>>> regulator  from other threads/subsystems.
>>>
>>> Option (2) seems simpler to me.  What do you think?
>>>
>> But I think without #1 you could still be racing w/ some other path
>> that touches the hw, like devfreq, right.  They could be holding a
>> runpm ref, so even if you loop over active_submits decrementing the
>> runpm ref, it still doesn't drop to zero
>>
>> BR,
>> -R
> Yes, you are right. There could be some transient votes from other
> threads/drivers/subsystem. This is the reason we need to poll for cx
> gdsc collapse in the next patch.Even with #1, it is difficult to
> coordinate with smmu driver and close to impossible with tz/hyp.
>
> -Akhil.

Rob,

Summarizing my responses:
1. We cannot blindly force turn off cx headswitch because that would
impact other gpu driver threads/smmu driver/tz/hyp etc which access cx
domain register at the same time.
2. We need to drop all our rpm votes on 'gpu device' instead of a single
vote on 'gmu device' because of [1]. Otherwise, smmu driver's vote on cx
headswitch will block its collapse forever.

This is the high level sequence implemented in the current series' version:
1. Drop all rpm votes on 'gpu device' which will indirectly let smmu
driver drop its vote on cx HS.
2. To take care of transient votes from other threads/hyp etc, poll for
cx gdsc hw register to ensure that it has collapsed. (We might be able
to move this to gpucc driver depending on the consensus on the other patch.)

[1] https://lkml.org/lkml/2018/8/30/590

-Akhil.