LinuxLists.cc - [PATCH] mtd: nand: pxa3xx-nand: fix random command timeouts

2015-08-12 16:26:27

Subject: [PATCH] mtd: nand: pxa3xx-nand: fix random command timeouts

When 2 commands are submitted in a row, and the second is very quick,
the completion of the second command might never come. This happens
especially if the second command is quick, such as a status read after
an erase.

The issue is that in the interrupt handler, the status bits are cleared
after the new command is issued. There is a small temporal window where
this happens :
- the previous command has set the command done bit
- the ready for a command bit is set
- the handler submits the next command
- just then, the command completes, and the command done bit is still
set
- the handler clears the "previous" command done bit
- the handler exits

In this flow, the "command done" of the next command will never trigger
a new interrupt to finish the status command, as it was cleared for both
commands.

Fix this by clearing the status bit before submitting a new command.

Signed-off-by: Robert Jarzmik <[email protected]>
---
drivers/mtd/nand/pxa3xx_nand.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/mtd/nand/pxa3xx_nand.c b/drivers/mtd/nand/pxa3xx_nand.c
index b0737aec7caf..718097414b9c 100644
--- a/drivers/mtd/nand/pxa3xx_nand.c
+++ b/drivers/mtd/nand/pxa3xx_nand.c
@@ -687,8 +687,10 @@ static irqreturn_t pxa3xx_nand_irq(int irq, void *devid)
is_ready = 1;
}

+ /* clear NDSR to let the controller exit the IRQ */
+ nand_writel(info, NDSR, status);
+
if (status & NDSR_WRCMDREQ) {
- nand_writel(info, NDSR, NDSR_WRCMDREQ);
status &= ~NDSR_WRCMDREQ;
info->state = STATE_CMD_HANDLE;

@@ -709,8 +711,6 @@ static irqreturn_t pxa3xx_nand_irq(int irq, void *devid)
nand_writel(info, NDCB0, info->ndcb3);
}

- /* clear NDSR to let the controller exit the IRQ */
- nand_writel(info, NDSR, status);
if (is_completed)
complete(&info->cmd_complete);
if (is_ready)
--
2.1.4

2015-08-16 15:33:11

by Ezequiel Garcia

[permalink] [raw]

Subject: Re: [PATCH] mtd: nand: pxa3xx-nand: fix random command timeouts

On 12 Aug 06:22 PM, Robert Jarzmik wrote:
> When 2 commands are submitted in a row, and the second is very quick,
> the completion of the second command might never come. This happens
> especially if the second command is quick, such as a status read after
> an erase.
>
> The issue is that in the interrupt handler, the status bits are cleared
> after the new command is issued. There is a small temporal window where
> this happens :
> - the previous command has set the command done bit
> - the ready for a command bit is set
> - the handler submits the next command
> - just then, the command completes, and the command done bit is still
> set
> - the handler clears the "previous" command done bit
> - the handler exits
>
> In this flow, the "command done" of the next command will never trigger
> a new interrupt to finish the status command, as it was cleared for both
> commands.
>
> Fix this by clearing the status bit before submitting a new command.
>

This fix looks correct. Thanks!

Couple questions:

1. In which platform are you seeing this bug?
2. Is this a regression? (i.e. should we queue it for -stable?)

Also, one might question why we can't just write NDSR right after it's read,
before we wake the IRQ thread or start DMA. It appears this is
a requirement of BCH, as per the comment in drain_fifo.

It would be nice to put a comment explaining why we clear NDSR only
before the check to WRCMDREQ. Maybe even copy-pasting something
from the commit log?

I'd like to say "Yay, let's pick it" but I'd like to make sure this is
tested on all platforms first (unless you've tested it already).

> Signed-off-by: Robert Jarzmik <[email protected]>
> ---
> drivers/mtd/nand/pxa3xx_nand.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/mtd/nand/pxa3xx_nand.c b/drivers/mtd/nand/pxa3xx_nand.c
> index b0737aec7caf..718097414b9c 100644
> --- a/drivers/mtd/nand/pxa3xx_nand.c
> +++ b/drivers/mtd/nand/pxa3xx_nand.c
> @@ -687,8 +687,10 @@ static irqreturn_t pxa3xx_nand_irq(int irq, void *devid)
> is_ready = 1;
> }
>
> + /* clear NDSR to let the controller exit the IRQ */
> + nand_writel(info, NDSR, status);
> +
> if (status & NDSR_WRCMDREQ) {
> - nand_writel(info, NDSR, NDSR_WRCMDREQ);
> status &= ~NDSR_WRCMDREQ;
> info->state = STATE_CMD_HANDLE;
>
> @@ -709,8 +711,6 @@ static irqreturn_t pxa3xx_nand_irq(int irq, void *devid)
> nand_writel(info, NDCB0, info->ndcb3);
> }
>
> - /* clear NDSR to let the controller exit the IRQ */
> - nand_writel(info, NDSR, status);
> if (is_completed)
> complete(&info->cmd_complete);
> if (is_ready)
> --
> 2.1.4
>
>
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/

--
Ezequiel Garcia, VanguardiaSur
http://www.vanguardiasur.com.ar

2015-08-16 22:22:20

by Robert Jarzmik

[permalink] [raw]

Subject: Re: [PATCH] mtd: nand: pxa3xx-nand: fix random command timeouts

Ezequiel Garcia <[email protected]> writes:

> On 12 Aug 06:22 PM, Robert Jarzmik wrote:
>
> This fix looks correct. Thanks!
>
> Couple questions:
>
> 1. In which platform are you seeing this bug?
zylonite with a pxa310 (ie. internal stacked NAND).

> 2. Is this a regression? (i.e. should we queue it for -stable?)
No, it's been there for ages I think.

> Also, one might question why we can't just write NDSR right after it's read,
> before we wake the IRQ thread or start DMA. It appears this is
> a requirement of BCH, as per the comment in drain_fifo.
For irq thread that won't make any difference, the irq handler will finish first
and clear the bits anyway. For DMA it's better.

And more generaly speaking, I like it better, to clear it once read.

> It would be nice to put a comment explaining why we clear NDSR only
> before the check to WRCMDREQ. Maybe even copy-pasting something
> from the commit log?
If we move it up to something like that :
status = nand_readl(info, NDSR);
nand_writel(info, NDSR, status);
Then the comment is overkill I think.

> I'd like to say "Yay, let's pick it" but I'd like to make sure this is
> tested on all platforms first (unless you've tested it already).
I tested on zylonite (where bug occurs) and cm-x300 (where bug never occurs).

Cheers.

--
Robert

2015-08-17 19:09:08

by Ezequiel Garcia

[permalink] [raw]

Subject: Re: [PATCH] mtd: nand: pxa3xx-nand: fix random command timeouts

On 16 August 2015 at 19:18, Robert Jarzmik <[email protected]> wrote:
> Ezequiel Garcia <[email protected]> writes:
>
>> On 12 Aug 06:22 PM, Robert Jarzmik wrote:
>>
>> This fix looks correct. Thanks!
>>
>> Couple questions:
>>
>> 1. In which platform are you seeing this bug?
> zylonite with a pxa310 (ie. internal stacked NAND).
>
>> 2. Is this a regression? (i.e. should we queue it for -stable?)
> No, it's been there for ages I think.
>
>> Also, one might question why we can't just write NDSR right after it's read,
>> before we wake the IRQ thread or start DMA. It appears this is
>> a requirement of BCH, as per the comment in drain_fifo.
> For irq thread that won't make any difference, the irq handler will finish first
> and clear the bits anyway. For DMA it's better.
>
> And more generaly speaking, I like it better, to clear it once read.
>
>> It would be nice to put a comment explaining why we clear NDSR only
>> before the check to WRCMDREQ. Maybe even copy-pasting something
>> from the commit log?
> If we move it up to something like that :
> status = nand_readl(info, NDSR);
> nand_writel(info, NDSR, status);
> Then the comment is overkill I think.
>

Well, the comment in drain_fifo says that doing this would break
reads with BCH enabled:

/*
* According to the datasheet, when reading from NDDB
* with BCH enabled, after each 32 bytes reads, we
* have to make sure that the NDSR.RDDREQ bit is set.
*
* Drain the FIFO 8 32 bits reads at a time, and skip
* the polling on the last read.
*/

In other words, it seems that we must wake the IRQ thread handler
_before_ we clear RDDREQ, not after.

So unless I'm completely off, the current patch is right, and a comment
would be helpful.

>> I'd like to say "Yay, let's pick it" but I'd like to make sure this is
>> tested on all platforms first (unless you've tested it already).
> I tested on zylonite (where bug occurs) and cm-x300 (where bug never occurs).
>

OK, I'll see about testing your four patches on some Armada 370/XP.

--
Ezequiel García, VanguardiaSur
http://www.vanguardiasur.com.ar

2015-08-17 19:19:51

by Robert Jarzmik

[permalink] [raw]

Subject: Re: [PATCH] mtd: nand: pxa3xx-nand: fix random command timeouts

Ezequiel Garcia <[email protected]> writes:

> In other words, it seems that we must wake the IRQ thread handler
> _before_ we clear RDDREQ, not after.
>
> So unless I'm completely off, the current patch is right, and a comment
> would be helpful.
Ok Ezequiel, I'll wait for your Tested-by, and respin with something like :

/* Clear all status bit before issuing the next command, which can and
* will alter the status bits and will deserve a new interrupt on its
* own.
*/

> OK, I'll see about testing your four patches on some Armada 370/XP.
Yeah, that would be good.

Cheers.

--
Robert

2015-08-19 00:46:34

by Brian Norris

[permalink] [raw]

Subject: Re: [PATCH] mtd: nand: pxa3xx-nand: fix random command timeouts

Hi Robert,

On Mon, Aug 17, 2015 at 09:15:48PM +0200, Robert Jarzmik wrote:
> Ezequiel Garcia <[email protected]> writes:
> > So unless I'm completely off, the current patch is right, and a comment
> > would be helpful.
> Ok Ezequiel, I'll wait for your Tested-by, and respin with something like :
>
> /* Clear all status bit before issuing the next command, which can and
> * will alter the status bits and will deserve a new interrupt on its
> * own.
> */

To be clear: I should hold off on this patch?

I'm going to try to clear my queue of things that are simple/obvious
and/or fix bugs (I think you spoke up about a different pxa3xx_nand
bugfix which I will try to find). There's too big a backlog for me to
get through everything for the merge window...

Thanks,
Brian

2015-08-19 06:26:18

by Robert Jarzmik

[permalink] [raw]

Subject: Re: [PATCH] mtd: nand: pxa3xx-nand: fix random command timeouts

Brian Norris <[email protected]> writes:

> Hi Robert,
>
> On Mon, Aug 17, 2015 at 09:15:48PM +0200, Robert Jarzmik wrote:
>> Ezequiel Garcia <[email protected]> writes:
>> > So unless I'm completely off, the current patch is right, and a comment
>> > would be helpful.
>> Ok Ezequiel, I'll wait for your Tested-by, and respin with something like :
>>
>> /* Clear all status bit before issuing the next command, which can and
>> * will alter the status bits and will deserve a new interrupt on its
>> * own.
>> */
>
> To be clear: I should hold off on this patch?
If you can hold until this evening (ie. in 12h), I'll post the same version of
the patch with Ezequiel's comment, and Ack/Test if he provides it.

If you have to decide in the next few hours, take it, this fix should land in
v4.3.

> I'm going to try to clear my queue of things that are simple/obvious
> and/or fix bugs (I think you spoke up about a different pxa3xx_nand
> bugfix which I will try to find). There's too big a backlog for me to
> get through everything for the merge window...

I'll resend my pxa3xx-nand backlog this evening for fixes I have on mind which
are simple fixes :
- this one (random command timeouts)
- Antoine's one (resent yesterday)
- my fix for probe (mtd: nand: pxa3xx_nand: fix early spurious interrupt)
If you can wait this evening it will spare you digging through your mails.

Or alternatively Ezequiel, if you want to aggregate the patches because you have
others on mind also, please feel free to overtake this from me before my evening :)

Cheers.

--
Robert