LinuxLists.cc - [PATCH 0/2] mmc: cqhci: Spurious TCN fix

2023-10-27 14:57:01

Subject: [PATCH 0/2] mmc: cqhci: Spurious TCN fix

Hi Adrian,

This series contains a fix for the "spurious TCN" issue that happens
during CQE recovery. We discussed this problem in the:
"[PATCH] mmc: cqhci: Be more verbose in error irq handler" thread.
The fix implemented in the form of a quirk, that is applied to the JSL
controller only. This was done this way, since I believe that this is a
hardware issue. Note that this quirk might be applicable to controllers
on other Intel SoCs too.
Please let me know what do you think.

Regards,
Kornel

Kornel Dulęba (2):
mmc: cqhci: Add a quirk to clear stale TC
mmc: sdhci-pci: Enable the clear stale TC quirk on JSL

drivers/mmc/host/cqhci-core.c | 42 +++++++++++++++++++++++++++++++
drivers/mmc/host/cqhci.h | 1 +
drivers/mmc/host/sdhci-pci-core.c | 6 +++++
3 files changed, 49 insertions(+)

--
2.42.0.820.g83a721a137-goog

2023-10-27 14:57:17

by Kornel Dulęba

[permalink] [raw]

Subject: [PATCH 1/2] mmc: cqhci: Add a quirk to clear stale TC

This fix addresses a stale task completion event issued right after the
CQE recovery. As it's a hardware issue the fix is done in form of a
quirk.

When error interrupt is received the driver runs recovery logic is run.
It halts the controller, clears all pending tasks, and then re-enables
it. On some platforms a stale task completion event is observed,
regardless of the CQHCI_CLEAR_ALL_TASKS bit being set.

This results in either:
a) Spurious TC completion event for an empty slot.
b) Corrupted data being passed up the stack, as a result of premature
completion for a newly added task.

To fix that re-enable the controller, clear task completion bits,
interrupt status register and halt it again.
This is done at the end of the recovery process, right before interrupts
are re-enabled.

Signed-off-by: Kornel Dulęba <[email protected]>
---
drivers/mmc/host/cqhci-core.c | 42 +++++++++++++++++++++++++++++++++++
drivers/mmc/host/cqhci.h | 1 +
2 files changed, 43 insertions(+)

diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
index b3d7d6d8d654..e534222df90c 100644
--- a/drivers/mmc/host/cqhci-core.c
+++ b/drivers/mmc/host/cqhci-core.c
@@ -1062,6 +1062,45 @@ static void cqhci_recover_mrqs(struct cqhci_host *cq_host)
/* CQHCI could be expected to clear it's internal state pretty quickly */
#define CQHCI_CLEAR_TIMEOUT 20

+/*
+ * During CQE recovery all pending tasks are cleared from the
+ * controller and its state is being reset.
+ * On some platforms the controller sets a task completion bit for
+ * a stale(previously cleared) task right after being re-enabled.
+ * This results in a spurious interrupt at best and corrupted data
+ * being passed up the stack at worst. The latter happens when
+ * the driver enqueues a new request on the problematic task slot
+ * before the "spurious" task completion interrupt is handled.
+ * To fix it:
+ * 1. Re-enable controller by clearing the halt flag.
+ * 2. Clear interrupt status and the task completion register.
+ * 3. Halt the controller again to be consistent with quirkless logic.
+ *
+ * This assumes that there are no pending requests on the queue.
+ */
+static void cqhci_quirk_clear_stale_tc(struct cqhci_host *cq_host)
+{
+ u32 reg;
+
+ WARN_ON(cq_host->qcnt);
+ cqhci_writel(cq_host, 0, CQHCI_CTL);
+ if ((cqhci_readl(cq_host, CQHCI_CTL) & CQHCI_HALT)) {
+ pr_err("%s: cqhci: CQE failed to exit halt state\n",
+ mmc_hostname(cq_host->mmc));
+ }
+ reg = cqhci_readl(cq_host, CQHCI_TCN);
+ cqhci_writel(cq_host, reg, CQHCI_TCN);
+ reg = cqhci_readl(cq_host, CQHCI_IS);
+ cqhci_writel(cq_host, reg, CQHCI_IS);
+
+ /*
+ * Halt the controller again.
+ * This is only needed so that we're consistent across quirk
+ * and quirkless logic.
+ */
+ cqhci_halt(cq_host->mmc, CQHCI_FINISH_HALT_TIMEOUT);
+}
+
static void cqhci_recovery_finish(struct mmc_host *mmc)
{
struct cqhci_host *cq_host = mmc->cqe_private;
@@ -1108,6 +1147,9 @@ static void cqhci_recovery_finish(struct mmc_host *mmc)
mmc->cqe_on = false;
spin_unlock_irqrestore(&cq_host->lock, flags);

+ if (cq_host->quirks & CQHCI_QUIRK_CLEAR_STALE_TC)
+ cqhci_quirk_clear_stale_tc(cq_host);
+
/* Ensure all writes are done before interrupts are re-enabled */
wmb();

diff --git a/drivers/mmc/host/cqhci.h b/drivers/mmc/host/cqhci.h
index 1a12e40a02e6..36131038c091 100644
--- a/drivers/mmc/host/cqhci.h
+++ b/drivers/mmc/host/cqhci.h
@@ -239,6 +239,7 @@ struct cqhci_host {

u32 quirks;
#define CQHCI_QUIRK_SHORT_TXFR_DESC_SZ 0x1
+#define CQHCI_QUIRK_CLEAR_STALE_TC 0x2

bool enabled;
bool halted;
--
2.42.0.820.g83a721a137-goog

2023-10-27 14:57:21

by Kornel Dulęba

[permalink] [raw]

Subject: [PATCH 2/2] mmc: sdhci-pci: Enable the clear stale TC quirk on JSL

This patch applies the CQHCI_QUIRK_CLEAR_STALE_TC to MMC controller on
Jasper Lake platform.
When run in CQ mode, the controller on Jasper Lake sets a stale task
completion event after CQE recovery is done.

Signed-off-by: Kornel Dulęba <[email protected]>
---
drivers/mmc/host/sdhci-pci-core.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/drivers/mmc/host/sdhci-pci-core.c b/drivers/mmc/host/sdhci-pci-core.c
index 7c14feb5db77..a7e637f5cb4f 100644
--- a/drivers/mmc/host/sdhci-pci-core.c
+++ b/drivers/mmc/host/sdhci-pci-core.c
@@ -960,6 +960,12 @@ static int glk_emmc_add_host(struct sdhci_pci_slot *slot)

cq_host->mmio = host->ioaddr + 0x200;
cq_host->quirks |= CQHCI_QUIRK_SHORT_TXFR_DESC_SZ;
+ /*
+ * The controller on Jasper Lake signals a stale task completion
+ * event after CQE recovery.
+ */
+ if (slot->chip->pdev->device == PCI_DEVICE_ID_INTEL_JSL_EMMC)
+ cq_host->quirks |= CQHCI_QUIRK_CLEAR_STALE_TC;
cq_host->ops = &glk_cqhci_ops;

dma64 = host->flags & SDHCI_USE_64_BIT_DMA;
--
2.42.0.820.g83a721a137-goog

2023-10-30 19:31:15

by Adrian Hunter

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: cqhci: Add a quirk to clear stale TC

On 27/10/23 17:56, Kornel Dulęba wrote:
> This fix addresses a stale task completion event issued right after the
> CQE recovery. As it's a hardware issue the fix is done in form of a
> quirk.
>
> When error interrupt is received the driver runs recovery logic is run.
> It halts the controller, clears all pending tasks, and then re-enables
> it. On some platforms a stale task completion event is observed,
> regardless of the CQHCI_CLEAR_ALL_TASKS bit being set.
>
> This results in either:
> a) Spurious TC completion event for an empty slot.
> b) Corrupted data being passed up the stack, as a result of premature
> completion for a newly added task.
>
> To fix that re-enable the controller, clear task completion bits,
> interrupt status register and halt it again.
> This is done at the end of the recovery process, right before interrupts
> are re-enabled.
>
> Signed-off-by: Kornel Dulęba <[email protected]>
> ---
> drivers/mmc/host/cqhci-core.c | 42 +++++++++++++++++++++++++++++++++++
> drivers/mmc/host/cqhci.h | 1 +
> 2 files changed, 43 insertions(+)
>
> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> index b3d7d6d8d654..e534222df90c 100644
> --- a/drivers/mmc/host/cqhci-core.c
> +++ b/drivers/mmc/host/cqhci-core.c
> @@ -1062,6 +1062,45 @@ static void cqhci_recover_mrqs(struct cqhci_host *cq_host)
> /* CQHCI could be expected to clear it's internal state pretty quickly */
> #define CQHCI_CLEAR_TIMEOUT 20
>
> +/*
> + * During CQE recovery all pending tasks are cleared from the
> + * controller and its state is being reset.
> + * On some platforms the controller sets a task completion bit for
> + * a stale(previously cleared) task right after being re-enabled.
> + * This results in a spurious interrupt at best and corrupted data
> + * being passed up the stack at worst. The latter happens when
> + * the driver enqueues a new request on the problematic task slot
> + * before the "spurious" task completion interrupt is handled.
> + * To fix it:
> + * 1. Re-enable controller by clearing the halt flag.
> + * 2. Clear interrupt status and the task completion register.
> + * 3. Halt the controller again to be consistent with quirkless logic.
> + *
> + * This assumes that there are no pending requests on the queue.
> + */
> +static void cqhci_quirk_clear_stale_tc(struct cqhci_host *cq_host)
> +{
> + u32 reg;
> +
> + WARN_ON(cq_host->qcnt);
> + cqhci_writel(cq_host, 0, CQHCI_CTL);
> + if ((cqhci_readl(cq_host, CQHCI_CTL) & CQHCI_HALT)) {
> + pr_err("%s: cqhci: CQE failed to exit halt state\n",
> + mmc_hostname(cq_host->mmc));
> + }
> + reg = cqhci_readl(cq_host, CQHCI_TCN);
> + cqhci_writel(cq_host, reg, CQHCI_TCN);
> + reg = cqhci_readl(cq_host, CQHCI_IS);
> + cqhci_writel(cq_host, reg, CQHCI_IS);
> +
> + /*
> + * Halt the controller again.
> + * This is only needed so that we're consistent across quirk
> + * and quirkless logic.
> + */
> + cqhci_halt(cq_host->mmc, CQHCI_FINISH_HALT_TIMEOUT);
> +}

Thanks a lot for tracking this down!

It could be that the "un-halt" starts a task, so it would be
better to force the "clear" to work if possible, which
should be the case if CQE is disabled.

Would you mind trying the code below? Note the increased
CQHCI_START_HALT_TIMEOUT helps avoid trying to clear tasks
when CQE has not halted.

diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
index b3d7d6d8d654..534c13069833 100644
--- a/drivers/mmc/host/cqhci-core.c
+++ b/drivers/mmc/host/cqhci-core.c
@@ -987,7 +987,7 @@ static bool cqhci_halt(struct mmc_host *mmc, unsigned int timeout)
* layers will need to send a STOP command), so we set the timeout based on a
* generous command timeout.
*/
-#define CQHCI_START_HALT_TIMEOUT 5
+#define CQHCI_START_HALT_TIMEOUT 500

static void cqhci_recovery_start(struct mmc_host *mmc)
{
@@ -1075,28 +1075,27 @@ static void cqhci_recovery_finish(struct mmc_host *mmc)

ok = cqhci_halt(mmc, CQHCI_FINISH_HALT_TIMEOUT);

- if (!cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT))
- ok = false;
-
/*
* The specification contradicts itself, by saying that tasks cannot be
* cleared if CQHCI does not halt, but if CQHCI does not halt, it should
* be disabled/re-enabled, but not to disable before clearing tasks.
* Have a go anyway.
*/
- if (!ok) {
- pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc));
- cqcfg = cqhci_readl(cq_host, CQHCI_CFG);
- cqcfg &= ~CQHCI_ENABLE;
- cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
- cqcfg |= CQHCI_ENABLE;
- cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
- /* Be sure that there are no tasks */
- ok = cqhci_halt(mmc, CQHCI_FINISH_HALT_TIMEOUT);
- if (!cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT))
- ok = false;
- WARN_ON(!ok);
- }
+ if (!cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT))
+ ok = false;
+
+ cqcfg = cqhci_readl(cq_host, CQHCI_CFG);
+ cqcfg &= ~CQHCI_ENABLE;
+ cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
+
+ cqcfg = cqhci_readl(cq_host, CQHCI_CFG);
+ cqcfg |= CQHCI_ENABLE;
+ cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
+
+ cqhci_halt(mmc, CQHCI_FINISH_HALT_TIMEOUT);
+
+ if (!ok)
+ cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT);

cqhci_recover_mrqs(cq_host);

2023-11-01 11:31:34

by Kornel Dulęba

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: cqhci: Add a quirk to clear stale TC

On Mon, Oct 30, 2023 at 8:31 PM Adrian Hunter <[email protected]> wrote:
>
> On 27/10/23 17:56, Kornel Dulęba wrote:
> > This fix addresses a stale task completion event issued right after the
> > CQE recovery. As it's a hardware issue the fix is done in form of a
> > quirk.
> >
> > When error interrupt is received the driver runs recovery logic is run.
> > It halts the controller, clears all pending tasks, and then re-enables
> > it. On some platforms a stale task completion event is observed,
> > regardless of the CQHCI_CLEAR_ALL_TASKS bit being set.
> >
> > This results in either:
> > a) Spurious TC completion event for an empty slot.
> > b) Corrupted data being passed up the stack, as a result of premature
> > completion for a newly added task.
> >
> > To fix that re-enable the controller, clear task completion bits,
> > interrupt status register and halt it again.
> > This is done at the end of the recovery process, right before interrupts
> > are re-enabled.
> >
> > Signed-off-by: Kornel Dulęba <[email protected]>
> > ---
> > drivers/mmc/host/cqhci-core.c | 42 +++++++++++++++++++++++++++++++++++
> > drivers/mmc/host/cqhci.h | 1 +
> > 2 files changed, 43 insertions(+)
> >
> > diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> > index b3d7d6d8d654..e534222df90c 100644
> > --- a/drivers/mmc/host/cqhci-core.c
> > +++ b/drivers/mmc/host/cqhci-core.c
> > @@ -1062,6 +1062,45 @@ static void cqhci_recover_mrqs(struct cqhci_host *cq_host)
> > /* CQHCI could be expected to clear it's internal state pretty quickly */
> > #define CQHCI_CLEAR_TIMEOUT 20
> >
> > +/*
> > + * During CQE recovery all pending tasks are cleared from the
> > + * controller and its state is being reset.
> > + * On some platforms the controller sets a task completion bit for
> > + * a stale(previously cleared) task right after being re-enabled.
> > + * This results in a spurious interrupt at best and corrupted data
> > + * being passed up the stack at worst. The latter happens when
> > + * the driver enqueues a new request on the problematic task slot
> > + * before the "spurious" task completion interrupt is handled.
> > + * To fix it:
> > + * 1. Re-enable controller by clearing the halt flag.
> > + * 2. Clear interrupt status and the task completion register.
> > + * 3. Halt the controller again to be consistent with quirkless logic.
> > + *
> > + * This assumes that there are no pending requests on the queue.
> > + */
> > +static void cqhci_quirk_clear_stale_tc(struct cqhci_host *cq_host)
> > +{
> > + u32 reg;
> > +
> > + WARN_ON(cq_host->qcnt);
> > + cqhci_writel(cq_host, 0, CQHCI_CTL);
> > + if ((cqhci_readl(cq_host, CQHCI_CTL) & CQHCI_HALT)) {
> > + pr_err("%s: cqhci: CQE failed to exit halt state\n",
> > + mmc_hostname(cq_host->mmc));
> > + }
> > + reg = cqhci_readl(cq_host, CQHCI_TCN);
> > + cqhci_writel(cq_host, reg, CQHCI_TCN);
> > + reg = cqhci_readl(cq_host, CQHCI_IS);
> > + cqhci_writel(cq_host, reg, CQHCI_IS);
> > +
> > + /*
> > + * Halt the controller again.
> > + * This is only needed so that we're consistent across quirk
> > + * and quirkless logic.
> > + */
> > + cqhci_halt(cq_host->mmc, CQHCI_FINISH_HALT_TIMEOUT);
> > +}
>
> Thanks a lot for tracking this down!
>
> It could be that the "un-halt" starts a task, so it would be
> better to force the "clear" to work if possible, which
> should be the case if CQE is disabled.
>
> Would you mind trying the code below? Note the increased
> CQHCI_START_HALT_TIMEOUT helps avoid trying to clear tasks
> when CQE has not halted.

Sure, I'll try it out tomorrow, as I don't have access to the DUT today.
BTW do we even need to halt the controller in the recovery_finish logic?
It has already been halted in recovery_start, I guess it could be
there in case the recovery_start halt didn't work.
But in that case shouldn't we do this disable/re-enable dance in recovery_start?

>
>
> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> index b3d7d6d8d654..534c13069833 100644
> --- a/drivers/mmc/host/cqhci-core.c
> +++ b/drivers/mmc/host/cqhci-core.c
> @@ -987,7 +987,7 @@ static bool cqhci_halt(struct mmc_host *mmc, unsigned int timeout)
> * layers will need to send a STOP command), so we set the timeout based on a
> * generous command timeout.
> */
> -#define CQHCI_START_HALT_TIMEOUT 5
> +#define CQHCI_START_HALT_TIMEOUT 500
>
> static void cqhci_recovery_start(struct mmc_host *mmc)
> {
> @@ -1075,28 +1075,27 @@ static void cqhci_recovery_finish(struct mmc_host *mmc)
>
> ok = cqhci_halt(mmc, CQHCI_FINISH_HALT_TIMEOUT);
>
> - if (!cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT))
> - ok = false;
> -
> /*
> * The specification contradicts itself, by saying that tasks cannot be
> * cleared if CQHCI does not halt, but if CQHCI does not halt, it should
> * be disabled/re-enabled, but not to disable before clearing tasks.
> * Have a go anyway.
> */
> - if (!ok) {
> - pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc));
> - cqcfg = cqhci_readl(cq_host, CQHCI_CFG);
> - cqcfg &= ~CQHCI_ENABLE;
> - cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
> - cqcfg |= CQHCI_ENABLE;
> - cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
> - /* Be sure that there are no tasks */
> - ok = cqhci_halt(mmc, CQHCI_FINISH_HALT_TIMEOUT);
> - if (!cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT))
> - ok = false;
> - WARN_ON(!ok);
> - }
> + if (!cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT))
> + ok = false;
> +
> + cqcfg = cqhci_readl(cq_host, CQHCI_CFG);
> + cqcfg &= ~CQHCI_ENABLE;
> + cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
> +
> + cqcfg = cqhci_readl(cq_host, CQHCI_CFG);
> + cqcfg |= CQHCI_ENABLE;
> + cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
> +
> + cqhci_halt(mmc, CQHCI_FINISH_HALT_TIMEOUT);
> +
> + if (!ok)
> + cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT);
>
> cqhci_recover_mrqs(cq_host);
>
>

2023-11-01 12:01:23

by Adrian Hunter

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: cqhci: Add a quirk to clear stale TC

On 1/11/23 13:31, Kornel Dulęba wrote:
> On Mon, Oct 30, 2023 at 8:31 PM Adrian Hunter <[email protected]> wrote:
>>
>> On 27/10/23 17:56, Kornel Dulęba wrote:
>>> This fix addresses a stale task completion event issued right after the
>>> CQE recovery. As it's a hardware issue the fix is done in form of a
>>> quirk.
>>>
>>> When error interrupt is received the driver runs recovery logic is run.
>>> It halts the controller, clears all pending tasks, and then re-enables
>>> it. On some platforms a stale task completion event is observed,
>>> regardless of the CQHCI_CLEAR_ALL_TASKS bit being set.
>>>
>>> This results in either:
>>> a) Spurious TC completion event for an empty slot.
>>> b) Corrupted data being passed up the stack, as a result of premature
>>> completion for a newly added task.
>>>
>>> To fix that re-enable the controller, clear task completion bits,
>>> interrupt status register and halt it again.
>>> This is done at the end of the recovery process, right before interrupts
>>> are re-enabled.
>>>
>>> Signed-off-by: Kornel Dulęba <[email protected]>
>>> ---
>>> drivers/mmc/host/cqhci-core.c | 42 +++++++++++++++++++++++++++++++++++
>>> drivers/mmc/host/cqhci.h | 1 +
>>> 2 files changed, 43 insertions(+)
>>>
>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
>>> index b3d7d6d8d654..e534222df90c 100644
>>> --- a/drivers/mmc/host/cqhci-core.c
>>> +++ b/drivers/mmc/host/cqhci-core.c
>>> @@ -1062,6 +1062,45 @@ static void cqhci_recover_mrqs(struct cqhci_host *cq_host)
>>> /* CQHCI could be expected to clear it's internal state pretty quickly */
>>> #define CQHCI_CLEAR_TIMEOUT 20
>>>
>>> +/*
>>> + * During CQE recovery all pending tasks are cleared from the
>>> + * controller and its state is being reset.
>>> + * On some platforms the controller sets a task completion bit for
>>> + * a stale(previously cleared) task right after being re-enabled.
>>> + * This results in a spurious interrupt at best and corrupted data
>>> + * being passed up the stack at worst. The latter happens when
>>> + * the driver enqueues a new request on the problematic task slot
>>> + * before the "spurious" task completion interrupt is handled.
>>> + * To fix it:
>>> + * 1. Re-enable controller by clearing the halt flag.
>>> + * 2. Clear interrupt status and the task completion register.
>>> + * 3. Halt the controller again to be consistent with quirkless logic.
>>> + *
>>> + * This assumes that there are no pending requests on the queue.
>>> + */
>>> +static void cqhci_quirk_clear_stale_tc(struct cqhci_host *cq_host)
>>> +{
>>> + u32 reg;
>>> +
>>> + WARN_ON(cq_host->qcnt);
>>> + cqhci_writel(cq_host, 0, CQHCI_CTL);
>>> + if ((cqhci_readl(cq_host, CQHCI_CTL) & CQHCI_HALT)) {
>>> + pr_err("%s: cqhci: CQE failed to exit halt state\n",
>>> + mmc_hostname(cq_host->mmc));
>>> + }
>>> + reg = cqhci_readl(cq_host, CQHCI_TCN);
>>> + cqhci_writel(cq_host, reg, CQHCI_TCN);
>>> + reg = cqhci_readl(cq_host, CQHCI_IS);
>>> + cqhci_writel(cq_host, reg, CQHCI_IS);
>>> +
>>> + /*
>>> + * Halt the controller again.
>>> + * This is only needed so that we're consistent across quirk
>>> + * and quirkless logic.
>>> + */
>>> + cqhci_halt(cq_host->mmc, CQHCI_FINISH_HALT_TIMEOUT);
>>> +}
>>
>> Thanks a lot for tracking this down!
>>
>> It could be that the "un-halt" starts a task, so it would be
>> better to force the "clear" to work if possible, which
>> should be the case if CQE is disabled.
>>
>> Would you mind trying the code below? Note the increased
>> CQHCI_START_HALT_TIMEOUT helps avoid trying to clear tasks
>> when CQE has not halted.
>
> Sure, I'll try it out tomorrow, as I don't have access to the DUT today.
> BTW do we even need to halt the controller in the recovery_finish logic?
> It has already been halted in recovery_start, I guess it could be
> there in case the recovery_start halt didn't work.
> But in that case shouldn't we do this disable/re-enable dance in recovery_start?

"Halt" might be waiting on an operation to finish, so the STOP
command is meant to help bring that to a conclusion.

2023-11-02 09:22:16

by Kornel Dulęba

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: cqhci: Add a quirk to clear stale TC

On Mon, Oct 30, 2023 at 8:31 PM Adrian Hunter <[email protected]> wrote:
>
> On 27/10/23 17:56, Kornel Dulęba wrote:
> > This fix addresses a stale task completion event issued right after the
> > CQE recovery. As it's a hardware issue the fix is done in form of a
> > quirk.
> >
> > When error interrupt is received the driver runs recovery logic is run.
> > It halts the controller, clears all pending tasks, and then re-enables
> > it. On some platforms a stale task completion event is observed,
> > regardless of the CQHCI_CLEAR_ALL_TASKS bit being set.
> >
> > This results in either:
> > a) Spurious TC completion event for an empty slot.
> > b) Corrupted data being passed up the stack, as a result of premature
> > completion for a newly added task.
> >
> > To fix that re-enable the controller, clear task completion bits,
> > interrupt status register and halt it again.
> > This is done at the end of the recovery process, right before interrupts
> > are re-enabled.
> >
> > Signed-off-by: Kornel Dulęba <[email protected]>
> > ---
> > drivers/mmc/host/cqhci-core.c | 42 +++++++++++++++++++++++++++++++++++
> > drivers/mmc/host/cqhci.h | 1 +
> > 2 files changed, 43 insertions(+)
> >
> > diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> > index b3d7d6d8d654..e534222df90c 100644
> > --- a/drivers/mmc/host/cqhci-core.c
> > +++ b/drivers/mmc/host/cqhci-core.c
> > @@ -1062,6 +1062,45 @@ static void cqhci_recover_mrqs(struct cqhci_host *cq_host)
> > /* CQHCI could be expected to clear it's internal state pretty quickly */
> > #define CQHCI_CLEAR_TIMEOUT 20
> >
> > +/*
> > + * During CQE recovery all pending tasks are cleared from the
> > + * controller and its state is being reset.
> > + * On some platforms the controller sets a task completion bit for
> > + * a stale(previously cleared) task right after being re-enabled.
> > + * This results in a spurious interrupt at best and corrupted data
> > + * being passed up the stack at worst. The latter happens when
> > + * the driver enqueues a new request on the problematic task slot
> > + * before the "spurious" task completion interrupt is handled.
> > + * To fix it:
> > + * 1. Re-enable controller by clearing the halt flag.
> > + * 2. Clear interrupt status and the task completion register.
> > + * 3. Halt the controller again to be consistent with quirkless logic.
> > + *
> > + * This assumes that there are no pending requests on the queue.
> > + */
> > +static void cqhci_quirk_clear_stale_tc(struct cqhci_host *cq_host)
> > +{
> > + u32 reg;
> > +
> > + WARN_ON(cq_host->qcnt);
> > + cqhci_writel(cq_host, 0, CQHCI_CTL);
> > + if ((cqhci_readl(cq_host, CQHCI_CTL) & CQHCI_HALT)) {
> > + pr_err("%s: cqhci: CQE failed to exit halt state\n",
> > + mmc_hostname(cq_host->mmc));
> > + }
> > + reg = cqhci_readl(cq_host, CQHCI_TCN);
> > + cqhci_writel(cq_host, reg, CQHCI_TCN);
> > + reg = cqhci_readl(cq_host, CQHCI_IS);
> > + cqhci_writel(cq_host, reg, CQHCI_IS);
> > +
> > + /*
> > + * Halt the controller again.
> > + * This is only needed so that we're consistent across quirk
> > + * and quirkless logic.
> > + */
> > + cqhci_halt(cq_host->mmc, CQHCI_FINISH_HALT_TIMEOUT);
> > +}
>
> Thanks a lot for tracking this down!
>
> It could be that the "un-halt" starts a task, so it would be
> better to force the "clear" to work if possible, which
> should be the case if CQE is disabled.
>
> Would you mind trying the code below? Note the increased
> CQHCI_START_HALT_TIMEOUT helps avoid trying to clear tasks
> when CQE has not halted.

I've run a quick test and it works just fine.
Your approach looks better than what I proposed, since as you
mentioned, doing it like this avoids some weird side effects, e.g. DMA
to freed memory.
Do you plan to include it in the other series that you posted yesterday?

>
> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> index b3d7d6d8d654..534c13069833 100644
> --- a/drivers/mmc/host/cqhci-core.c
> +++ b/drivers/mmc/host/cqhci-core.c
> @@ -987,7 +987,7 @@ static bool cqhci_halt(struct mmc_host *mmc, unsigned int timeout)
> * layers will need to send a STOP command), so we set the timeout based on a
> * generous command timeout.
> */
> -#define CQHCI_START_HALT_TIMEOUT 5
> +#define CQHCI_START_HALT_TIMEOUT 500
>
> static void cqhci_recovery_start(struct mmc_host *mmc)
> {
> @@ -1075,28 +1075,27 @@ static void cqhci_recovery_finish(struct mmc_host *mmc)
>
> ok = cqhci_halt(mmc, CQHCI_FINISH_HALT_TIMEOUT);
>
> - if (!cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT))
> - ok = false;
> -
> /*
> * The specification contradicts itself, by saying that tasks cannot be
> * cleared if CQHCI does not halt, but if CQHCI does not halt, it should
> * be disabled/re-enabled, but not to disable before clearing tasks.
> * Have a go anyway.
> */
> - if (!ok) {
> - pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc));
> - cqcfg = cqhci_readl(cq_host, CQHCI_CFG);
> - cqcfg &= ~CQHCI_ENABLE;
> - cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
> - cqcfg |= CQHCI_ENABLE;
> - cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
> - /* Be sure that there are no tasks */
> - ok = cqhci_halt(mmc, CQHCI_FINISH_HALT_TIMEOUT);
> - if (!cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT))
> - ok = false;
> - WARN_ON(!ok);
> - }
> + if (!cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT))
> + ok = false;
> +
> + cqcfg = cqhci_readl(cq_host, CQHCI_CFG);
> + cqcfg &= ~CQHCI_ENABLE;
> + cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
> +
> + cqcfg = cqhci_readl(cq_host, CQHCI_CFG);
> + cqcfg |= CQHCI_ENABLE;
> + cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
> +
> + cqhci_halt(mmc, CQHCI_FINISH_HALT_TIMEOUT);
> +
> + if (!ok)
> + cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT);
>
> cqhci_recover_mrqs(cq_host);
>
>

2023-11-02 11:01:42

by Adrian Hunter

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: cqhci: Add a quirk to clear stale TC

On 2/11/23 11:21, Kornel Dulęba wrote:
> On Mon, Oct 30, 2023 at 8:31 PM Adrian Hunter <[email protected]> wrote:
>>
>> On 27/10/23 17:56, Kornel Dulęba wrote:
>>> This fix addresses a stale task completion event issued right after the
>>> CQE recovery. As it's a hardware issue the fix is done in form of a
>>> quirk.
>>>
>>> When error interrupt is received the driver runs recovery logic is run.
>>> It halts the controller, clears all pending tasks, and then re-enables
>>> it. On some platforms a stale task completion event is observed,
>>> regardless of the CQHCI_CLEAR_ALL_TASKS bit being set.
>>>
>>> This results in either:
>>> a) Spurious TC completion event for an empty slot.
>>> b) Corrupted data being passed up the stack, as a result of premature
>>> completion for a newly added task.
>>>
>>> To fix that re-enable the controller, clear task completion bits,
>>> interrupt status register and halt it again.
>>> This is done at the end of the recovery process, right before interrupts
>>> are re-enabled.
>>>
>>> Signed-off-by: Kornel Dulęba <[email protected]>
>>> ---
>>> drivers/mmc/host/cqhci-core.c | 42 +++++++++++++++++++++++++++++++++++
>>> drivers/mmc/host/cqhci.h | 1 +
>>> 2 files changed, 43 insertions(+)
>>>
>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
>>> index b3d7d6d8d654..e534222df90c 100644
>>> --- a/drivers/mmc/host/cqhci-core.c
>>> +++ b/drivers/mmc/host/cqhci-core.c
>>> @@ -1062,6 +1062,45 @@ static void cqhci_recover_mrqs(struct cqhci_host *cq_host)
>>> /* CQHCI could be expected to clear it's internal state pretty quickly */
>>> #define CQHCI_CLEAR_TIMEOUT 20
>>>
>>> +/*
>>> + * During CQE recovery all pending tasks are cleared from the
>>> + * controller and its state is being reset.
>>> + * On some platforms the controller sets a task completion bit for
>>> + * a stale(previously cleared) task right after being re-enabled.
>>> + * This results in a spurious interrupt at best and corrupted data
>>> + * being passed up the stack at worst. The latter happens when
>>> + * the driver enqueues a new request on the problematic task slot
>>> + * before the "spurious" task completion interrupt is handled.
>>> + * To fix it:
>>> + * 1. Re-enable controller by clearing the halt flag.
>>> + * 2. Clear interrupt status and the task completion register.
>>> + * 3. Halt the controller again to be consistent with quirkless logic.
>>> + *
>>> + * This assumes that there are no pending requests on the queue.
>>> + */
>>> +static void cqhci_quirk_clear_stale_tc(struct cqhci_host *cq_host)
>>> +{
>>> + u32 reg;
>>> +
>>> + WARN_ON(cq_host->qcnt);
>>> + cqhci_writel(cq_host, 0, CQHCI_CTL);
>>> + if ((cqhci_readl(cq_host, CQHCI_CTL) & CQHCI_HALT)) {
>>> + pr_err("%s: cqhci: CQE failed to exit halt state\n",
>>> + mmc_hostname(cq_host->mmc));
>>> + }
>>> + reg = cqhci_readl(cq_host, CQHCI_TCN);
>>> + cqhci_writel(cq_host, reg, CQHCI_TCN);
>>> + reg = cqhci_readl(cq_host, CQHCI_IS);
>>> + cqhci_writel(cq_host, reg, CQHCI_IS);
>>> +
>>> + /*
>>> + * Halt the controller again.
>>> + * This is only needed so that we're consistent across quirk
>>> + * and quirkless logic.
>>> + */
>>> + cqhci_halt(cq_host->mmc, CQHCI_FINISH_HALT_TIMEOUT);
>>> +}
>>
>> Thanks a lot for tracking this down!
>>
>> It could be that the "un-halt" starts a task, so it would be
>> better to force the "clear" to work if possible, which
>> should be the case if CQE is disabled.
>>
>> Would you mind trying the code below? Note the increased
>> CQHCI_START_HALT_TIMEOUT helps avoid trying to clear tasks
>> when CQE has not halted.
>
> I've run a quick test and it works just fine.

Thank you!

> Your approach looks better than what I proposed, since as you
> mentioned, doing it like this avoids some weird side effects, e.g. DMA
> to freed memory.
> Do you plan to include it in the other series that you posted yesterday?

Yes I will do that

2023-11-02 14:07:12

by Kornel Dulęba

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: cqhci: Add a quirk to clear stale TC

On Thu, Nov 02, 2023 at 01:01:22PM +0200, Adrian Hunter wrote:
> On 2/11/23 11:21, Kornel Dulęba wrote:
> > On Mon, Oct 30, 2023 at 8:31 PM Adrian Hunter <[email protected]> wrote:
> >>
> >> On 27/10/23 17:56, Kornel Dulęba wrote:
> >>> This fix addresses a stale task completion event issued right after the
> >>> CQE recovery. As it's a hardware issue the fix is done in form of a
> >>> quirk.
> >>>
> >>> When error interrupt is received the driver runs recovery logic is run.
> >>> It halts the controller, clears all pending tasks, and then re-enables
> >>> it. On some platforms a stale task completion event is observed,
> >>> regardless of the CQHCI_CLEAR_ALL_TASKS bit being set.
> >>>
> >>> This results in either:
> >>> a) Spurious TC completion event for an empty slot.
> >>> b) Corrupted data being passed up the stack, as a result of premature
> >>> completion for a newly added task.
> >>>
> >>> To fix that re-enable the controller, clear task completion bits,
> >>> interrupt status register and halt it again.
> >>> This is done at the end of the recovery process, right before interrupts
> >>> are re-enabled.
> >>>
> >>> Signed-off-by: Kornel Dulęba <[email protected]>
> >>> ---
> >>> drivers/mmc/host/cqhci-core.c | 42 +++++++++++++++++++++++++++++++++++
> >>> drivers/mmc/host/cqhci.h | 1 +
> >>> 2 files changed, 43 insertions(+)
> >>>
> >>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> >>> index b3d7d6d8d654..e534222df90c 100644
> >>> --- a/drivers/mmc/host/cqhci-core.c
> >>> +++ b/drivers/mmc/host/cqhci-core.c
> >>> @@ -1062,6 +1062,45 @@ static void cqhci_recover_mrqs(struct cqhci_host *cq_host)
> >>> /* CQHCI could be expected to clear it's internal state pretty quickly */
> >>> #define CQHCI_CLEAR_TIMEOUT 20
> >>>
> >>> +/*
> >>> + * During CQE recovery all pending tasks are cleared from the
> >>> + * controller and its state is being reset.
> >>> + * On some platforms the controller sets a task completion bit for
> >>> + * a stale(previously cleared) task right after being re-enabled.
> >>> + * This results in a spurious interrupt at best and corrupted data
> >>> + * being passed up the stack at worst. The latter happens when
> >>> + * the driver enqueues a new request on the problematic task slot
> >>> + * before the "spurious" task completion interrupt is handled.
> >>> + * To fix it:
> >>> + * 1. Re-enable controller by clearing the halt flag.
> >>> + * 2. Clear interrupt status and the task completion register.
> >>> + * 3. Halt the controller again to be consistent with quirkless logic.
> >>> + *
> >>> + * This assumes that there are no pending requests on the queue.
> >>> + */
> >>> +static void cqhci_quirk_clear_stale_tc(struct cqhci_host *cq_host)
> >>> +{
> >>> + u32 reg;
> >>> +
> >>> + WARN_ON(cq_host->qcnt);
> >>> + cqhci_writel(cq_host, 0, CQHCI_CTL);
> >>> + if ((cqhci_readl(cq_host, CQHCI_CTL) & CQHCI_HALT)) {
> >>> + pr_err("%s: cqhci: CQE failed to exit halt state\n",
> >>> + mmc_hostname(cq_host->mmc));
> >>> + }
> >>> + reg = cqhci_readl(cq_host, CQHCI_TCN);
> >>> + cqhci_writel(cq_host, reg, CQHCI_TCN);
> >>> + reg = cqhci_readl(cq_host, CQHCI_IS);
> >>> + cqhci_writel(cq_host, reg, CQHCI_IS);
> >>> +
> >>> + /*
> >>> + * Halt the controller again.
> >>> + * This is only needed so that we're consistent across quirk
> >>> + * and quirkless logic.
> >>> + */
> >>> + cqhci_halt(cq_host->mmc, CQHCI_FINISH_HALT_TIMEOUT);
> >>> +}
> >>
> >> Thanks a lot for tracking this down!
> >>
> >> It could be that the "un-halt" starts a task, so it would be
> >> better to force the "clear" to work if possible, which
> >> should be the case if CQE is disabled.
> >>
> >> Would you mind trying the code below? Note the increased
> >> CQHCI_START_HALT_TIMEOUT helps avoid trying to clear tasks
> >> when CQE has not halted.
> >
> > I've run a quick test and it works just fine.
>
> Thank you!
>
> > Your approach looks better than what I proposed, since as you
> > mentioned, doing it like this avoids some weird side effects, e.g. DMA
> > to freed memory.
> > Do you plan to include it in the other series that you posted yesterday?
>
> Yes I will do that

Feel free to add "Tested-by: Kornel Dulęba <[email protected]>" and
maybe "Reported-by".

2023-11-02 14:19:13

by Radoslaw Biernacki

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: cqhci: Add a quirk to clear stale TC

On Thu, Nov 2, 2023 at 3:07 PM Kornel Dulęba <[email protected]> wrote:
>
> On Thu, Nov 02, 2023 at 01:01:22PM +0200, Adrian Hunter wrote:
> > On 2/11/23 11:21, Kornel Dulęba wrote:
> > > On Mon, Oct 30, 2023 at 8:31 PM Adrian Hunter <[email protected]> wrote:
> > >>
> > >> On 27/10/23 17:56, Kornel Dulęba wrote:
> > >>> This fix addresses a stale task completion event issued right after the
> > >>> CQE recovery. As it's a hardware issue the fix is done in form of a
> > >>> quirk.
> > >>>
> > >>> When error interrupt is received the driver runs recovery logic is run.
> > >>> It halts the controller, clears all pending tasks, and then re-enables
> > >>> it. On some platforms a stale task completion event is observed,
> > >>> regardless of the CQHCI_CLEAR_ALL_TASKS bit being set.
> > >>>
> > >>> This results in either:
> > >>> a) Spurious TC completion event for an empty slot.
> > >>> b) Corrupted data being passed up the stack, as a result of premature
> > >>> completion for a newly added task.
> > >>>
> > >>> To fix that re-enable the controller, clear task completion bits,
> > >>> interrupt status register and halt it again.
> > >>> This is done at the end of the recovery process, right before interrupts
> > >>> are re-enabled.
> > >>>
> > >>> Signed-off-by: Kornel Dulęba <[email protected]>
> > >>> ---
> > >>> drivers/mmc/host/cqhci-core.c | 42 +++++++++++++++++++++++++++++++++++
> > >>> drivers/mmc/host/cqhci.h | 1 +
> > >>> 2 files changed, 43 insertions(+)
> > >>>
> > >>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> > >>> index b3d7d6d8d654..e534222df90c 100644
> > >>> --- a/drivers/mmc/host/cqhci-core.c
> > >>> +++ b/drivers/mmc/host/cqhci-core.c
> > >>> @@ -1062,6 +1062,45 @@ static void cqhci_recover_mrqs(struct cqhci_host *cq_host)
> > >>> /* CQHCI could be expected to clear it's internal state pretty quickly */
> > >>> #define CQHCI_CLEAR_TIMEOUT 20
> > >>>
> > >>> +/*
> > >>> + * During CQE recovery all pending tasks are cleared from the
> > >>> + * controller and its state is being reset.
> > >>> + * On some platforms the controller sets a task completion bit for
> > >>> + * a stale(previously cleared) task right after being re-enabled.
> > >>> + * This results in a spurious interrupt at best and corrupted data
> > >>> + * being passed up the stack at worst. The latter happens when
> > >>> + * the driver enqueues a new request on the problematic task slot
> > >>> + * before the "spurious" task completion interrupt is handled.
> > >>> + * To fix it:
> > >>> + * 1. Re-enable controller by clearing the halt flag.
> > >>> + * 2. Clear interrupt status and the task completion register.
> > >>> + * 3. Halt the controller again to be consistent with quirkless logic.
> > >>> + *
> > >>> + * This assumes that there are no pending requests on the queue.
> > >>> + */
> > >>> +static void cqhci_quirk_clear_stale_tc(struct cqhci_host *cq_host)
> > >>> +{
> > >>> + u32 reg;
> > >>> +
> > >>> + WARN_ON(cq_host->qcnt);
> > >>> + cqhci_writel(cq_host, 0, CQHCI_CTL);
> > >>> + if ((cqhci_readl(cq_host, CQHCI_CTL) & CQHCI_HALT)) {
> > >>> + pr_err("%s: cqhci: CQE failed to exit halt state\n",
> > >>> + mmc_hostname(cq_host->mmc));
> > >>> + }
> > >>> + reg = cqhci_readl(cq_host, CQHCI_TCN);
> > >>> + cqhci_writel(cq_host, reg, CQHCI_TCN);
> > >>> + reg = cqhci_readl(cq_host, CQHCI_IS);
> > >>> + cqhci_writel(cq_host, reg, CQHCI_IS);
> > >>> +
> > >>> + /*
> > >>> + * Halt the controller again.
> > >>> + * This is only needed so that we're consistent across quirk
> > >>> + * and quirkless logic.
> > >>> + */
> > >>> + cqhci_halt(cq_host->mmc, CQHCI_FINISH_HALT_TIMEOUT);
> > >>> +}
> > >>
> > >> Thanks a lot for tracking this down!
> > >>
> > >> It could be that the "un-halt" starts a task, so it would be
> > >> better to force the "clear" to work if possible, which
> > >> should be the case if CQE is disabled.
> > >>
> > >> Would you mind trying the code below? Note the increased
> > >> CQHCI_START_HALT_TIMEOUT helps avoid trying to clear tasks
> > >> when CQE has not halted.
> > >
> > > I've run a quick test and it works just fine.
> >
> > Thank you!
> >
> > > Your approach looks better than what I proposed, since as you
> > > mentioned, doing it like this avoids some weird side effects, e.g. DMA
> > > to freed memory.
> > > Do you plan to include it in the other series that you posted yesterday?
> >
> > Yes I will do that
>
> Feel free to add "Tested-by: Kornel Dulęba <[email protected]>" and
> maybe "Reported-by".

I do not want to be you advocate Kornel, but I think you earned a
Co-developed-by
That was a lot of work.

2023-11-02 14:25:53

by Adrian Hunter

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: cqhci: Add a quirk to clear stale TC

On 2/11/23 16:18, Radoslaw Biernacki wrote:
> On Thu, Nov 2, 2023 at 3:07 PM Kornel Dulęba <[email protected]> wrote:
>>
>> On Thu, Nov 02, 2023 at 01:01:22PM +0200, Adrian Hunter wrote:
>>> On 2/11/23 11:21, Kornel Dulęba wrote:
>>>> On Mon, Oct 30, 2023 at 8:31 PM Adrian Hunter <[email protected]> wrote:
>>>>>
>>>>> On 27/10/23 17:56, Kornel Dulęba wrote:
>>>>>> This fix addresses a stale task completion event issued right after the
>>>>>> CQE recovery. As it's a hardware issue the fix is done in form of a
>>>>>> quirk.
>>>>>>
>>>>>> When error interrupt is received the driver runs recovery logic is run.
>>>>>> It halts the controller, clears all pending tasks, and then re-enables
>>>>>> it. On some platforms a stale task completion event is observed,
>>>>>> regardless of the CQHCI_CLEAR_ALL_TASKS bit being set.
>>>>>>
>>>>>> This results in either:
>>>>>> a) Spurious TC completion event for an empty slot.
>>>>>> b) Corrupted data being passed up the stack, as a result of premature
>>>>>> completion for a newly added task.
>>>>>>
>>>>>> To fix that re-enable the controller, clear task completion bits,
>>>>>> interrupt status register and halt it again.
>>>>>> This is done at the end of the recovery process, right before interrupts
>>>>>> are re-enabled.
>>>>>>
>>>>>> Signed-off-by: Kornel Dulęba <[email protected]>
>>>>>> ---
>>>>>> drivers/mmc/host/cqhci-core.c | 42 +++++++++++++++++++++++++++++++++++
>>>>>> drivers/mmc/host/cqhci.h | 1 +
>>>>>> 2 files changed, 43 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
>>>>>> index b3d7d6d8d654..e534222df90c 100644
>>>>>> --- a/drivers/mmc/host/cqhci-core.c
>>>>>> +++ b/drivers/mmc/host/cqhci-core.c
>>>>>> @@ -1062,6 +1062,45 @@ static void cqhci_recover_mrqs(struct cqhci_host *cq_host)
>>>>>> /* CQHCI could be expected to clear it's internal state pretty quickly */
>>>>>> #define CQHCI_CLEAR_TIMEOUT 20
>>>>>>
>>>>>> +/*
>>>>>> + * During CQE recovery all pending tasks are cleared from the
>>>>>> + * controller and its state is being reset.
>>>>>> + * On some platforms the controller sets a task completion bit for
>>>>>> + * a stale(previously cleared) task right after being re-enabled.
>>>>>> + * This results in a spurious interrupt at best and corrupted data
>>>>>> + * being passed up the stack at worst. The latter happens when
>>>>>> + * the driver enqueues a new request on the problematic task slot
>>>>>> + * before the "spurious" task completion interrupt is handled.
>>>>>> + * To fix it:
>>>>>> + * 1. Re-enable controller by clearing the halt flag.
>>>>>> + * 2. Clear interrupt status and the task completion register.
>>>>>> + * 3. Halt the controller again to be consistent with quirkless logic.
>>>>>> + *
>>>>>> + * This assumes that there are no pending requests on the queue.
>>>>>> + */
>>>>>> +static void cqhci_quirk_clear_stale_tc(struct cqhci_host *cq_host)
>>>>>> +{
>>>>>> + u32 reg;
>>>>>> +
>>>>>> + WARN_ON(cq_host->qcnt);
>>>>>> + cqhci_writel(cq_host, 0, CQHCI_CTL);
>>>>>> + if ((cqhci_readl(cq_host, CQHCI_CTL) & CQHCI_HALT)) {
>>>>>> + pr_err("%s: cqhci: CQE failed to exit halt state\n",
>>>>>> + mmc_hostname(cq_host->mmc));
>>>>>> + }
>>>>>> + reg = cqhci_readl(cq_host, CQHCI_TCN);
>>>>>> + cqhci_writel(cq_host, reg, CQHCI_TCN);
>>>>>> + reg = cqhci_readl(cq_host, CQHCI_IS);
>>>>>> + cqhci_writel(cq_host, reg, CQHCI_IS);
>>>>>> +
>>>>>> + /*
>>>>>> + * Halt the controller again.
>>>>>> + * This is only needed so that we're consistent across quirk
>>>>>> + * and quirkless logic.
>>>>>> + */
>>>>>> + cqhci_halt(cq_host->mmc, CQHCI_FINISH_HALT_TIMEOUT);
>>>>>> +}
>>>>>
>>>>> Thanks a lot for tracking this down!
>>>>>
>>>>> It could be that the "un-halt" starts a task, so it would be
>>>>> better to force the "clear" to work if possible, which
>>>>> should be the case if CQE is disabled.
>>>>>
>>>>> Would you mind trying the code below? Note the increased
>>>>> CQHCI_START_HALT_TIMEOUT helps avoid trying to clear tasks
>>>>> when CQE has not halted.
>>>>
>>>> I've run a quick test and it works just fine.
>>>
>>> Thank you!
>>>
>>>> Your approach looks better than what I proposed, since as you
>>>> mentioned, doing it like this avoids some weird side effects, e.g. DMA
>>>> to freed memory.Kornel
>>>> Do you plan to include it in the other series that you posted yesterday?
>>>
>>> Yes I will do that
>>
>> Feel free to add "Tested-by: Kornel Dulęba <[email protected]>" and
>> maybe "Reported-by".
>
> I do not want to be you advocate Kornel, but I think you earned a
> Co-developed-by
> That was a lot of work.

Absolutely! Thanks a lot Kornel!