LinuxLists.cc - [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery

2024-01-24 03:00:44

Subject: [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery

v1 reference:
https://patchwork.kernel.org/project/dri-devel/cover/[email protected]/

Changes v1 -> v2:
- Dropped patch 1 which aimed to fix
https://gitlab.freedesktop.org/mesa/mesa/-/issues/8415 .
That will require more testing and an actual fix to the irq/timeout
handler race. It can be solved separately so I am deferring it to a
followup patch and keeping that issue open.

- Added patches 2 and 4 to cover "reset time out" and bus stop bit to
hard reset in gp as well.

- Added handling of all processors in synchronize_irq in patch 5 to
cover multiple pp. Dropped unnecessary duplicate fence in patch 5.

- Added patch 7 in v2. After some discussion in patch 4 (v1), it seems
to be reasonable to bump our timeout value so that we further decrease
the chance of users actually hitting any of these timeouts by default.

- Reworked patch 8 in v2. Since I broadened the work to not only focus
in pp anymore, I also included the change to the other blocks as well.

- Collected some reviews and acks in unmodified patches.

Erico Nunes (8):
drm/lima: reset async_reset on pp hard reset
drm/lima: reset async_reset on gp hard reset
drm/lima: set pp bus_stop bit before hard reset
drm/lima: set gp bus_stop bit before hard reset
drm/lima: handle spurious timeouts due to high irq latency
drm/lima: remove guilty drm_sched context handling
drm/lima: increase default job timeout to 10s
drm/lima: standardize debug messages by ip name

drivers/gpu/drm/lima/lima_ctx.c | 2 +-
drivers/gpu/drm/lima/lima_ctx.h | 1 -
drivers/gpu/drm/lima/lima_gp.c | 39 +++++++++++++++++++++-------
drivers/gpu/drm/lima/lima_l2_cache.c | 6 +++--
drivers/gpu/drm/lima/lima_mmu.c | 18 ++++++-------
drivers/gpu/drm/lima/lima_pmu.c | 3 ++-
drivers/gpu/drm/lima/lima_pp.c | 37 ++++++++++++++++++++------
drivers/gpu/drm/lima/lima_sched.c | 38 ++++++++++++++++++++++-----
drivers/gpu/drm/lima/lima_sched.h | 3 +--
9 files changed, 107 insertions(+), 40 deletions(-)

--
2.43.0

2024-01-24 03:01:04

by Erico Nunes

[permalink] [raw]

Subject: [PATCH v2 3/8] drm/lima: set pp bus_stop bit before hard reset

This is required for reliable hard resets. Otherwise, doing a hard reset
while a task is still running (such as a task which is being stopped by
the drm_sched timeout handler) may result in random mmu write timeouts
or lockups which cause the entire gpu to hang.

Signed-off-by: Erico Nunes <[email protected]>
Reviewed-by: Vasily Khoruzhick <[email protected]>
---
drivers/gpu/drm/lima/lima_pp.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/lima/lima_pp.c b/drivers/gpu/drm/lima/lima_pp.c
index a8f8f63b8295..ac097dd75072 100644
--- a/drivers/gpu/drm/lima/lima_pp.c
+++ b/drivers/gpu/drm/lima/lima_pp.c
@@ -168,6 +168,11 @@ static void lima_pp_write_frame(struct lima_ip *ip, u32 *frame, u32 *wb)
}
}

+static int lima_pp_bus_stop_poll(struct lima_ip *ip)
+{
+ return !!(pp_read(LIMA_PP_STATUS) & LIMA_PP_STATUS_BUS_STOPPED);
+}
+
static int lima_pp_hard_reset_poll(struct lima_ip *ip)
{
pp_write(LIMA_PP_PERF_CNT_0_LIMIT, 0xC01A0000);
@@ -181,6 +186,14 @@ static int lima_pp_hard_reset(struct lima_ip *ip)

pp_write(LIMA_PP_PERF_CNT_0_LIMIT, 0xC0FFE000);
pp_write(LIMA_PP_INT_MASK, 0);
+
+ pp_write(LIMA_PP_CTRL, LIMA_PP_CTRL_STOP_BUS);
+ ret = lima_poll_timeout(ip, lima_pp_bus_stop_poll, 10, 100);
+ if (ret) {
+ dev_err(dev->dev, "pp %s bus stop timeout\n", lima_ip_name(ip));
+ return ret;
+ }
+
pp_write(LIMA_PP_CTRL, LIMA_PP_CTRL_FORCE_RESET);
ret = lima_poll_timeout(ip, lima_pp_hard_reset_poll, 10, 100);
if (ret) {
--
2.43.0

2024-01-24 03:01:04

by Erico Nunes

[permalink] [raw]

Subject: [PATCH v2 4/8] drm/lima: set gp bus_stop bit before hard reset

This is required for reliable hard resets. Otherwise, doing a hard reset
while a task is still running (such as a task which is being stopped by
the drm_sched timeout handler) may result in random mmu write timeouts
or lockups which cause the entire gpu to hang.

Signed-off-by: Erico Nunes <[email protected]>
---
drivers/gpu/drm/lima/lima_gp.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/lima/lima_gp.c b/drivers/gpu/drm/lima/lima_gp.c
index b9a06e701a33..4355fa7b17f4 100644
--- a/drivers/gpu/drm/lima/lima_gp.c
+++ b/drivers/gpu/drm/lima/lima_gp.c
@@ -166,6 +166,11 @@ static void lima_gp_task_run(struct lima_sched_pipe *pipe,
gp_write(LIMA_GP_CMD, cmd);
}

+static int lima_gp_bus_stop_poll(struct lima_ip *ip)
+{
+ return !!(gp_read(LIMA_GP_STATUS) & LIMA_GP_STATUS_BUS_STOPPED);
+}
+
static int lima_gp_hard_reset_poll(struct lima_ip *ip)
{
gp_write(LIMA_GP_PERF_CNT_0_LIMIT, 0xC01A0000);
@@ -179,6 +184,13 @@ static int lima_gp_hard_reset(struct lima_ip *ip)

gp_write(LIMA_GP_PERF_CNT_0_LIMIT, 0xC0FFE000);
gp_write(LIMA_GP_INT_MASK, 0);
+
+ gp_write(LIMA_GP_CMD, LIMA_GP_CMD_STOP_BUS);
+ ret = lima_poll_timeout(ip, lima_gp_bus_stop_poll, 10, 100);
+ if (ret) {
+ dev_err(dev->dev, "%s bus stop timeout\n", lima_ip_name(ip));
+ return ret;
+ }
gp_write(LIMA_GP_CMD, LIMA_GP_CMD_RESET);
ret = lima_poll_timeout(ip, lima_gp_hard_reset_poll, 10, 100);
if (ret) {
--
2.43.0

2024-01-24 03:01:44

by Erico Nunes

[permalink] [raw]

Subject: [PATCH v2 7/8] drm/lima: increase default job timeout to 10s

The previous 500ms default timeout was fairly optimistic and could be
hit by real world applications. Many distributions targeting devices
with a Mali-4xx already bumped this timeout to a higher limit.
We can be generous here with a high value as 10s since this should
mostly catch buggy jobs like infinite loop shaders, and these don't
seem to happen very often in real applications.

Signed-off-by: Erico Nunes <[email protected]>
---
drivers/gpu/drm/lima/lima_sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index c2e78605e43e..00b19adfc888 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -505,7 +505,7 @@ static void lima_sched_recover_work(struct work_struct *work)
int lima_sched_pipe_init(struct lima_sched_pipe *pipe, const char *name)
{
unsigned int timeout = lima_sched_timeout_ms > 0 ?
- lima_sched_timeout_ms : 500;
+ lima_sched_timeout_ms : 10000;

pipe->fence_context = dma_fence_context_alloc(1);
spin_lock_init(&pipe->fence_lock);
--
2.43.0

2024-01-24 03:03:06

by Erico Nunes

[permalink] [raw]

Subject: [PATCH v2 6/8] drm/lima: remove guilty drm_sched context handling

Marking the context as guilty currently only makes the application which
hits a single timeout problem to stop its rendering context entirely.
All jobs submitted later are dropped from the guilty context.

Lima runs on fairly underpowered hardware for modern standards and it is
not entirely unreasonable that a rendering job may time out occasionally
due to high system load or too demanding application stack. In this case
it would be generally preferred to report the error but try to keep the
application going.

Other similar embedded GPU drivers don't make use of the guilty context
flag. Now that there are reliability improvements to the lima timeout
recovery handling, drop the guilty contexts to let the application keep
running in this case.

Signed-off-by: Erico Nunes <[email protected]>
Acked-by: Christian König <[email protected]>
Reviewed-by: Vasily Khoruzhick <[email protected]>
---
drivers/gpu/drm/lima/lima_ctx.c | 2 +-
drivers/gpu/drm/lima/lima_ctx.h | 1 -
drivers/gpu/drm/lima/lima_sched.c | 5 ++---
drivers/gpu/drm/lima/lima_sched.h | 3 +--
4 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/lima/lima_ctx.c b/drivers/gpu/drm/lima/lima_ctx.c
index 8389f2d7d021..0e668fc1e0f9 100644
--- a/drivers/gpu/drm/lima/lima_ctx.c
+++ b/drivers/gpu/drm/lima/lima_ctx.c
@@ -19,7 +19,7 @@ int lima_ctx_create(struct lima_device *dev, struct lima_ctx_mgr *mgr, u32 *id)
kref_init(&ctx->refcnt);

for (i = 0; i < lima_pipe_num; i++) {
- err = lima_sched_context_init(dev->pipe + i, ctx->context + i, &ctx->guilty);
+ err = lima_sched_context_init(dev->pipe + i, ctx->context + i);
if (err)
goto err_out0;
}
diff --git a/drivers/gpu/drm/lima/lima_ctx.h b/drivers/gpu/drm/lima/lima_ctx.h
index 74e2be09090f..5b1063ce968b 100644
--- a/drivers/gpu/drm/lima/lima_ctx.h
+++ b/drivers/gpu/drm/lima/lima_ctx.h
@@ -13,7 +13,6 @@ struct lima_ctx {
struct kref refcnt;
struct lima_device *dev;
struct lima_sched_context context[lima_pipe_num];
- atomic_t guilty;

/* debug info */
char pname[TASK_COMM_LEN];
diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index 814428564637..c2e78605e43e 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -154,13 +154,12 @@ void lima_sched_task_fini(struct lima_sched_task *task)
}

int lima_sched_context_init(struct lima_sched_pipe *pipe,
- struct lima_sched_context *context,
- atomic_t *guilty)
+ struct lima_sched_context *context)
{
struct drm_gpu_scheduler *sched = &pipe->base;

return drm_sched_entity_init(&context->base, DRM_SCHED_PRIORITY_NORMAL,
- &sched, 1, guilty);
+ &sched, 1, NULL);
}

void lima_sched_context_fini(struct lima_sched_pipe *pipe,
diff --git a/drivers/gpu/drm/lima/lima_sched.h b/drivers/gpu/drm/lima/lima_sched.h
index 6a11764d87b3..6bd4f3b70109 100644
--- a/drivers/gpu/drm/lima/lima_sched.h
+++ b/drivers/gpu/drm/lima/lima_sched.h
@@ -91,8 +91,7 @@ int lima_sched_task_init(struct lima_sched_task *task,
void lima_sched_task_fini(struct lima_sched_task *task);

int lima_sched_context_init(struct lima_sched_pipe *pipe,
- struct lima_sched_context *context,
- atomic_t *guilty);
+ struct lima_sched_context *context);
void lima_sched_context_fini(struct lima_sched_pipe *pipe,
struct lima_sched_context *context);
struct dma_fence *lima_sched_context_queue_task(struct lima_sched_task *task);
--
2.43.0

2024-01-24 03:03:29

by Erico Nunes

[permalink] [raw]

Subject: [PATCH v2 8/8] drm/lima: standardize debug messages by ip name

Some debug messages carried the ip name, or included "lima", or
included both the ip name and then the numbered ip name again.
Make the messages more consistent by always looking up and showing
the ip name first.

Signed-off-by: Erico Nunes <[email protected]>
---
drivers/gpu/drm/lima/lima_gp.c | 20 +++++++++++---------
drivers/gpu/drm/lima/lima_l2_cache.c | 6 ++++--
drivers/gpu/drm/lima/lima_mmu.c | 18 +++++++++---------
drivers/gpu/drm/lima/lima_pmu.c | 3 ++-
drivers/gpu/drm/lima/lima_pp.c | 19 ++++++++++---------
5 files changed, 36 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/lima/lima_gp.c b/drivers/gpu/drm/lima/lima_gp.c
index 4355fa7b17f4..6b354e2fb61d 100644
--- a/drivers/gpu/drm/lima/lima_gp.c
+++ b/drivers/gpu/drm/lima/lima_gp.c
@@ -34,11 +34,11 @@ static irqreturn_t lima_gp_irq_handler(int irq, void *data)
if (state & LIMA_GP_IRQ_MASK_ERROR) {
if ((state & LIMA_GP_IRQ_MASK_ERROR) ==
LIMA_GP_IRQ_PLBU_OUT_OF_MEM) {
- dev_dbg(dev->dev, "gp out of heap irq status=%x\n",
- status);
+ dev_dbg(dev->dev, "%s out of heap irq status=%x\n",
+ lima_ip_name(ip), status);
} else {
- dev_err(dev->dev, "gp error irq state=%x status=%x\n",
- state, status);
+ dev_err(dev->dev, "%s error irq state=%x status=%x\n",
+ lima_ip_name(ip), state, status);
if (task)
task->recoverable = false;
}
@@ -89,7 +89,8 @@ static int lima_gp_soft_reset_async_wait(struct lima_ip *ip)
v & LIMA_GP_IRQ_RESET_COMPLETED,
0, 100);
if (err) {
- dev_err(dev->dev, "gp soft reset time out\n");
+ dev_err(dev->dev, "%s soft reset time out\n",
+ lima_ip_name(ip));
return err;
}

@@ -194,7 +195,7 @@ static int lima_gp_hard_reset(struct lima_ip *ip)
gp_write(LIMA_GP_CMD, LIMA_GP_CMD_RESET);
ret = lima_poll_timeout(ip, lima_gp_hard_reset_poll, 10, 100);
if (ret) {
- dev_err(dev->dev, "gp hard reset timeout\n");
+ dev_err(dev->dev, "%s hard reset timeout\n", lima_ip_name(ip));
return ret;
}

@@ -220,8 +221,9 @@ static void lima_gp_task_error(struct lima_sched_pipe *pipe)
{
struct lima_ip *ip = pipe->processor[0];

- dev_err(ip->dev->dev, "gp task error int_state=%x status=%x\n",
- gp_read(LIMA_GP_INT_STAT), gp_read(LIMA_GP_STATUS));
+ dev_err(ip->dev->dev, "%s task error int_state=%x status=%x\n",
+ lima_ip_name(ip), gp_read(LIMA_GP_INT_STAT),
+ gp_read(LIMA_GP_STATUS));

lima_gp_hard_reset(ip);
}
@@ -324,7 +326,7 @@ int lima_gp_init(struct lima_ip *ip)
err = devm_request_irq(dev->dev, ip->irq, lima_gp_irq_handler,
IRQF_SHARED, lima_ip_name(ip), ip);
if (err) {
- dev_err(dev->dev, "gp %s fail to request irq\n",
+ dev_err(dev->dev, "%s fail to request irq\n",
lima_ip_name(ip));
return err;
}
diff --git a/drivers/gpu/drm/lima/lima_l2_cache.c b/drivers/gpu/drm/lima/lima_l2_cache.c
index c4080a02957b..184106ce55f8 100644
--- a/drivers/gpu/drm/lima/lima_l2_cache.c
+++ b/drivers/gpu/drm/lima/lima_l2_cache.c
@@ -21,7 +21,8 @@ static int lima_l2_cache_wait_idle(struct lima_ip *ip)
!(v & LIMA_L2_CACHE_STATUS_COMMAND_BUSY),
0, 1000);
if (err) {
- dev_err(dev->dev, "l2 cache wait command timeout\n");
+ dev_err(dev->dev, "%s wait command timeout\n",
+ lima_ip_name(ip));
return err;
}
return 0;
@@ -83,7 +84,8 @@ int lima_l2_cache_init(struct lima_ip *ip)
spin_lock_init(&ip->data.lock);

size = l2_cache_read(LIMA_L2_CACHE_SIZE);
- dev_info(dev->dev, "l2 cache %uK, %u-way, %ubyte cache line, %ubit external bus\n",
+ dev_info(dev->dev, "%s %uK, %u-way, %ubyte cache line, %ubit external bus\n",
+ lima_ip_name(ip),
1 << (((size >> 16) & 0xff) - 10),
1 << ((size >> 8) & 0xff),
1 << (size & 0xff),
diff --git a/drivers/gpu/drm/lima/lima_mmu.c b/drivers/gpu/drm/lima/lima_mmu.c
index a1ae6c252dc2..e18317c5ca8c 100644
--- a/drivers/gpu/drm/lima/lima_mmu.c
+++ b/drivers/gpu/drm/lima/lima_mmu.c
@@ -22,7 +22,8 @@
cond, 0, 100); \
if (__ret) \
dev_err(dev->dev, \
- "mmu command %x timeout\n", cmd); \
+ "%s command %x timeout\n", \
+ lima_ip_name(ip), cmd); \
__ret; \
})

@@ -40,14 +41,13 @@ static irqreturn_t lima_mmu_irq_handler(int irq, void *data)
if (status & LIMA_MMU_INT_PAGE_FAULT) {
u32 fault = mmu_read(LIMA_MMU_PAGE_FAULT_ADDR);

- dev_err(dev->dev, "mmu page fault at 0x%x from bus id %d of type %s on %s\n",
- fault, LIMA_MMU_STATUS_BUS_ID(status),
- status & LIMA_MMU_STATUS_PAGE_FAULT_IS_WRITE ? "write" : "read",
- lima_ip_name(ip));
+ dev_err(dev->dev, "%s page fault at 0x%x from bus id %d of type %s\n",
+ lima_ip_name(ip), fault, LIMA_MMU_STATUS_BUS_ID(status),
+ status & LIMA_MMU_STATUS_PAGE_FAULT_IS_WRITE ? "write" : "read");
}

if (status & LIMA_MMU_INT_READ_BUS_ERROR)
- dev_err(dev->dev, "mmu %s irq bus error\n", lima_ip_name(ip));
+ dev_err(dev->dev, "%s irq bus error\n", lima_ip_name(ip));

/* mask all interrupts before resume */
mmu_write(LIMA_MMU_INT_MASK, 0);
@@ -102,14 +102,14 @@ int lima_mmu_init(struct lima_ip *ip)

mmu_write(LIMA_MMU_DTE_ADDR, 0xCAFEBABE);
if (mmu_read(LIMA_MMU_DTE_ADDR) != 0xCAFEB000) {
- dev_err(dev->dev, "mmu %s dte write test fail\n", lima_ip_name(ip));
+ dev_err(dev->dev, "%s dte write test fail\n", lima_ip_name(ip));
return -EIO;
}

err = devm_request_irq(dev->dev, ip->irq, lima_mmu_irq_handler,
IRQF_SHARED, lima_ip_name(ip), ip);
if (err) {
- dev_err(dev->dev, "mmu %s fail to request irq\n", lima_ip_name(ip));
+ dev_err(dev->dev, "%s fail to request irq\n", lima_ip_name(ip));
return err;
}

@@ -152,7 +152,7 @@ void lima_mmu_page_fault_resume(struct lima_ip *ip)
u32 v;

if (status & LIMA_MMU_STATUS_PAGE_FAULT_ACTIVE) {
- dev_info(dev->dev, "mmu resume\n");
+ dev_info(dev->dev, "%s resume\n", lima_ip_name(ip));

mmu_write(LIMA_MMU_INT_MASK, 0);
mmu_write(LIMA_MMU_DTE_ADDR, 0xCAFEBABE);
diff --git a/drivers/gpu/drm/lima/lima_pmu.c b/drivers/gpu/drm/lima/lima_pmu.c
index e397e1146e96..113cb9b215cd 100644
--- a/drivers/gpu/drm/lima/lima_pmu.c
+++ b/drivers/gpu/drm/lima/lima_pmu.c
@@ -21,7 +21,8 @@ static int lima_pmu_wait_cmd(struct lima_ip *ip)
v, v & LIMA_PMU_INT_CMD_MASK,
100, 100000);
if (err) {
- dev_err(dev->dev, "timeout wait pmu cmd\n");
+ dev_err(dev->dev, "%s timeout wait pmu cmd\n",
+ lima_ip_name(ip));
return err;
}

diff --git a/drivers/gpu/drm/lima/lima_pp.c b/drivers/gpu/drm/lima/lima_pp.c
index ac097dd75072..d0d2db0ef1ce 100644
--- a/drivers/gpu/drm/lima/lima_pp.c
+++ b/drivers/gpu/drm/lima/lima_pp.c
@@ -26,8 +26,8 @@ static void lima_pp_handle_irq(struct lima_ip *ip, u32 state)
if (state & LIMA_PP_IRQ_MASK_ERROR) {
u32 status = pp_read(LIMA_PP_STATUS);

- dev_err(dev->dev, "pp error irq state=%x status=%x\n",
- state, status);
+ dev_err(dev->dev, "%s error irq state=%x status=%x\n",
+ lima_ip_name(ip), state, status);

pipe->error = true;

@@ -125,7 +125,7 @@ static int lima_pp_soft_reset_async_wait_one(struct lima_ip *ip)

ret = lima_poll_timeout(ip, lima_pp_soft_reset_poll, 0, 100);
if (ret) {
- dev_err(dev->dev, "pp %s reset time out\n", lima_ip_name(ip));
+ dev_err(dev->dev, "%s reset time out\n", lima_ip_name(ip));
return ret;
}

@@ -190,14 +190,14 @@ static int lima_pp_hard_reset(struct lima_ip *ip)
pp_write(LIMA_PP_CTRL, LIMA_PP_CTRL_STOP_BUS);
ret = lima_poll_timeout(ip, lima_pp_bus_stop_poll, 10, 100);
if (ret) {
- dev_err(dev->dev, "pp %s bus stop timeout\n", lima_ip_name(ip));
+ dev_err(dev->dev, "%s bus stop timeout\n", lima_ip_name(ip));
return ret;
}

pp_write(LIMA_PP_CTRL, LIMA_PP_CTRL_FORCE_RESET);
ret = lima_poll_timeout(ip, lima_pp_hard_reset_poll, 10, 100);
if (ret) {
- dev_err(dev->dev, "pp hard reset timeout\n");
+ dev_err(dev->dev, "%s hard reset timeout\n", lima_ip_name(ip));
return ret;
}

@@ -274,7 +274,7 @@ int lima_pp_init(struct lima_ip *ip)
err = devm_request_irq(dev->dev, ip->irq, lima_pp_irq_handler,
IRQF_SHARED, lima_ip_name(ip), ip);
if (err) {
- dev_err(dev->dev, "pp %s fail to request irq\n",
+ dev_err(dev->dev, "%s fail to request irq\n",
lima_ip_name(ip));
return err;
}
@@ -309,7 +309,7 @@ int lima_pp_bcast_init(struct lima_ip *ip)
err = devm_request_irq(dev->dev, ip->irq, lima_pp_bcast_irq_handler,
IRQF_SHARED, lima_ip_name(ip), ip);
if (err) {
- dev_err(dev->dev, "pp %s fail to request irq\n",
+ dev_err(dev->dev, "%s fail to request irq\n",
lima_ip_name(ip));
return err;
}
@@ -423,8 +423,9 @@ static void lima_pp_task_error(struct lima_sched_pipe *pipe)
for (i = 0; i < pipe->num_processor; i++) {
struct lima_ip *ip = pipe->processor[i];

- dev_err(ip->dev->dev, "pp task error %d int_state=%x status=%x\n",
- i, pp_read(LIMA_PP_INT_STATUS), pp_read(LIMA_PP_STATUS));
+ dev_err(ip->dev->dev, "%s task error %d int_state=%x status=%x\n",
+ lima_ip_name(ip), i, pp_read(LIMA_PP_INT_STATUS),
+ pp_read(LIMA_PP_STATUS));

lima_pp_hard_reset(ip);
}
--
2.43.0

2024-01-30 01:16:25

by Qiang Yu

[permalink] [raw]

Subject: Re: [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery

Serial is Reviewed-by: QIang Yu <[email protected]>

On Wed, Jan 24, 2024 at 11:00 AM Erico Nunes <[email protected]> wrote:
>
> v1 reference:
> https://patchwork.kernel.org/project/dri-devel/cover/[email protected]/
>
> Changes v1 -> v2:
> - Dropped patch 1 which aimed to fix
> https://gitlab.freedesktop.org/mesa/mesa/-/issues/8415 .
> That will require more testing and an actual fix to the irq/timeout
> handler race. It can be solved separately so I am deferring it to a
> followup patch and keeping that issue open.
>
> - Added patches 2 and 4 to cover "reset time out" and bus stop bit to
> hard reset in gp as well.
>
> - Added handling of all processors in synchronize_irq in patch 5 to
> cover multiple pp. Dropped unnecessary duplicate fence in patch 5.
>
> - Added patch 7 in v2. After some discussion in patch 4 (v1), it seems
> to be reasonable to bump our timeout value so that we further decrease
> the chance of users actually hitting any of these timeouts by default.
>
> - Reworked patch 8 in v2. Since I broadened the work to not only focus
> in pp anymore, I also included the change to the other blocks as well.
>
> - Collected some reviews and acks in unmodified patches.
>
>
> Erico Nunes (8):
> drm/lima: reset async_reset on pp hard reset
> drm/lima: reset async_reset on gp hard reset
> drm/lima: set pp bus_stop bit before hard reset
> drm/lima: set gp bus_stop bit before hard reset
> drm/lima: handle spurious timeouts due to high irq latency
> drm/lima: remove guilty drm_sched context handling
> drm/lima: increase default job timeout to 10s
> drm/lima: standardize debug messages by ip name
>
> drivers/gpu/drm/lima/lima_ctx.c | 2 +-
> drivers/gpu/drm/lima/lima_ctx.h | 1 -
> drivers/gpu/drm/lima/lima_gp.c | 39 +++++++++++++++++++++-------
> drivers/gpu/drm/lima/lima_l2_cache.c | 6 +++--
> drivers/gpu/drm/lima/lima_mmu.c | 18 ++++++-------
> drivers/gpu/drm/lima/lima_pmu.c | 3 ++-
> drivers/gpu/drm/lima/lima_pp.c | 37 ++++++++++++++++++++------
> drivers/gpu/drm/lima/lima_sched.c | 38 ++++++++++++++++++++++-----
> drivers/gpu/drm/lima/lima_sched.h | 3 +--
> 9 files changed, 107 insertions(+), 40 deletions(-)
>
> --
> 2.43.0
>

2024-02-12 08:40:26

by Qiang Yu

[permalink] [raw]

Subject: Re: [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery

applied to drm-misc-next

On Tue, Jan 30, 2024 at 9:07 AM Qiang Yu <[email protected]> wrote:
>
> Serial is Reviewed-by: QIang Yu <[email protected]>
>
> On Wed, Jan 24, 2024 at 11:00 AM Erico Nunes <[email protected]> wrote:
> >
> > v1 reference:
> > https://patchwork.kernel.org/project/dri-devel/cover/[email protected]/
> >
> > Changes v1 -> v2:
> > - Dropped patch 1 which aimed to fix
> > https://gitlab.freedesktop.org/mesa/mesa/-/issues/8415 .
> > That will require more testing and an actual fix to the irq/timeout
> > handler race. It can be solved separately so I am deferring it to a
> > followup patch and keeping that issue open.
> >
> > - Added patches 2 and 4 to cover "reset time out" and bus stop bit to
> > hard reset in gp as well.
> >
> > - Added handling of all processors in synchronize_irq in patch 5 to
> > cover multiple pp. Dropped unnecessary duplicate fence in patch 5.
> >
> > - Added patch 7 in v2. After some discussion in patch 4 (v1), it seems
> > to be reasonable to bump our timeout value so that we further decrease
> > the chance of users actually hitting any of these timeouts by default.
> >
> > - Reworked patch 8 in v2. Since I broadened the work to not only focus
> > in pp anymore, I also included the change to the other blocks as well.
> >
> > - Collected some reviews and acks in unmodified patches.
> >
> >
> > Erico Nunes (8):
> > drm/lima: reset async_reset on pp hard reset
> > drm/lima: reset async_reset on gp hard reset
> > drm/lima: set pp bus_stop bit before hard reset
> > drm/lima: set gp bus_stop bit before hard reset
> > drm/lima: handle spurious timeouts due to high irq latency
> > drm/lima: remove guilty drm_sched context handling
> > drm/lima: increase default job timeout to 10s
> > drm/lima: standardize debug messages by ip name
> >
> > drivers/gpu/drm/lima/lima_ctx.c | 2 +-
> > drivers/gpu/drm/lima/lima_ctx.h | 1 -
> > drivers/gpu/drm/lima/lima_gp.c | 39 +++++++++++++++++++++-------
> > drivers/gpu/drm/lima/lima_l2_cache.c | 6 +++--
> > drivers/gpu/drm/lima/lima_mmu.c | 18 ++++++-------
> > drivers/gpu/drm/lima/lima_pmu.c | 3 ++-
> > drivers/gpu/drm/lima/lima_pp.c | 37 ++++++++++++++++++++------
> > drivers/gpu/drm/lima/lima_sched.c | 38 ++++++++++++++++++++++-----
> > drivers/gpu/drm/lima/lima_sched.h | 3 +--
> > 9 files changed, 107 insertions(+), 40 deletions(-)
> >
> > --
> > 2.43.0
> >