2017-12-17 12:05:59

by Willy Tarreau

[permalink] [raw]
Subject: pxa3xx_nand times out in 4.14 with JFFS2

Hello,

I recently bought a Linksys WRT1900ACS which hosts an Armada 385 and a
NAND flash. While I could get OpenWRT to work flawlessly on it using
kernel 4.4, mainline 4.14.6 fails with a lot of such messages :

pxa3xx-nand f10d0000.flash: Wait time out!!!

Looking a bit closer, I found that it was triggered by my boot scripts
detecting the JFFS2 signature (0x1985) and trying to mount mtdblock5. But
under openwrt's kernel this partition mounts pretty fine.

I tried to read both /dev/mtd5 and /dev/mtdblock5 it using cat then dd
and got no issue either, so it seems that JFFS2 triggers a specific
operation causing the flash (or driver) to fail. By the way, here's the
device identification :

[ 0.638155] pxa3xx-nand f10d0000.flash: This platform can't do DMA on this device
[ 0.644661] nand: device found, Manufacturer ID: 0x01, Chip ID: 0xf1
[ 0.649732] nand: AMD/Spansion S34ML01G2
[ 0.652369] nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
[ 0.658662] pxa3xx-nand f10d0000.flash: ECC strength 16, ECC step size 2048

I found a few threads on the net where people were changing the timeout
in the driver, and OpenWRT has this larger timeout, so naturally I thought
it was the way to fix it and I tried, but surprizingly nothing changed at
all.

I found a long troubleshooting session in the list's archives where
CONFIG_PREEMPT had to be enabled to work around the problem here :

https://patchwork.ozlabs.org/patch/566837/

So I tried to enable it as well but it didn't change anything for me.

I enabled Robert's readl/writel debug traces and have added some messages
at the entrance and exit of every single function in the driver to try to
spot a bit more of it. I've placed the traces there :

http://1wt.eu/wrt1900acs/

(Please note that the config above doesn't have PREEMPT enabled but it
matches the kernel used to produce the dumps there). Also you'll notice
in the config that I added the out-of-tree mwlwifi driver; I did it late
in the evening after I spent my whole day trying to fix the NAND issue,
so it's irrelevant. However if some prefer that I re-run traces without
it I will.

I purposely split the trace in two steps : one before trying to mount the
JFFS2 fs, and the second one during and after the mount attempt. I also
took another trace with JFFS2 debuging enabled in case that helps.

I noticed that timeouts always occur after such a sequence :

> pxa3xx_nand_irq:845
pxa3xx-nand f10d0000.flash: pxa3xx_nand_irq():854 nand_readl(0x0014) = 0x2
< pxa3xx_nand_irq:932
< pxa3xx_nand_irq:933
> pxa3xx_nand_irq_thread:829
> handle_data_pio:728
> drain_fifo:693
< drain_fifo:723
> drain_fifo:693
< drain_fifo:723
< handle_data_pio:761
pxa3xx-nand f10d0000.flash: pxa3xx_nand_irq_thread():833 nand_writel(0x6, 0x0014)
< pxa3xx_nand_irq_thread:835

So in short we enter pxa3xx_nand_irq, read register 0x0014 and find 2,
then later we enter pxa3xx_nand_irq_thread(), handle_data_pio(), leave
it, and issue a nand_writel() to register 0x0014 and leave
pxa3xx_nand_irq_thread(). And at this point we wait till the timeout
strikes (apparently the CHIP_DELAY_TIMEOUT since I'm observing 2 seconds
and that's what it's set to).

Interestingly, this morning I left the machine hung like this and came
back 20 minutes later to find that mtdblock5 had finally been successfully
mounted (1322 seconds later precisely). I don't know if something suddenly
decided to work, or if an operation is performed 650 times and waits for
this 2 second timeout for each operation. I've left it running again to
see if it happens again.

I have not tested 4.15-rc3 on it yet, though the only change to the
driver is irrelevant to this issue. I also didn't go back to older
kernels as I'm not sure when this machine was supposed to start to be
supported (the DTS came in 4.12 but given that people started to complain
about the timeout in openwrt's 4.4, it's not clear).

At this point I'm out of ideas so if anyone wants me to test specific
config options to report more info, or to test patches, they're welcome!
There's no emergency for me to get this machine to work properly, it's
expected to replace my current one, but only once it works, so in the
mean time it's more of a development platform lying on my desk :-)

Thanks,
Willy


2017-12-17 12:34:10

by Boris Brezillon

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

Hi Willy,

On Sun, 17 Dec 2017 13:05:03 +0100
Willy Tarreau <[email protected]> wrote:

> Hello,
>
> I recently bought a Linksys WRT1900ACS which hosts an Armada 385 and a
> NAND flash. While I could get OpenWRT to work flawlessly on it using
> kernel 4.4, mainline 4.14.6 fails with a lot of such messages :
>
> pxa3xx-nand f10d0000.flash: Wait time out!!!
>
> Looking a bit closer, I found that it was triggered by my boot scripts
> detecting the JFFS2 signature (0x1985) and trying to mount mtdblock5. But
> under openwrt's kernel this partition mounts pretty fine.
>
> I tried to read both /dev/mtd5 and /dev/mtdblock5 it using cat then dd
> and got no issue either, so it seems that JFFS2 triggers a specific
> operation causing the flash (or driver) to fail. By the way, here's the
> device identification :
>
> [ 0.638155] pxa3xx-nand f10d0000.flash: This platform can't do DMA on this device
> [ 0.644661] nand: device found, Manufacturer ID: 0x01, Chip ID: 0xf1
> [ 0.649732] nand: AMD/Spansion S34ML01G2
> [ 0.652369] nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
> [ 0.658662] pxa3xx-nand f10d0000.flash: ECC strength 16, ECC step size 2048
>
> I found a few threads on the net where people were changing the timeout
> in the driver, and OpenWRT has this larger timeout, so naturally I thought
> it was the way to fix it and I tried, but surprizingly nothing changed at
> all.
>
> I found a long troubleshooting session in the list's archives where
> CONFIG_PREEMPT had to be enabled to work around the problem here :
>
> https://patchwork.ozlabs.org/patch/566837/
>
> So I tried to enable it as well but it didn't change anything for me.
>
> I enabled Robert's readl/writel debug traces and have added some messages
> at the entrance and exit of every single function in the driver to try to
> spot a bit more of it. I've placed the traces there :
>
> http://1wt.eu/wrt1900acs/
>
> (Please note that the config above doesn't have PREEMPT enabled but it
> matches the kernel used to produce the dumps there). Also you'll notice
> in the config that I added the out-of-tree mwlwifi driver; I did it late
> in the evening after I spent my whole day trying to fix the NAND issue,
> so it's irrelevant. However if some prefer that I re-run traces without
> it I will.
>
> I purposely split the trace in two steps : one before trying to mount the
> JFFS2 fs, and the second one during and after the mount attempt. I also
> took another trace with JFFS2 debuging enabled in case that helps.
>
> I noticed that timeouts always occur after such a sequence :
>
> > pxa3xx_nand_irq:845
> pxa3xx-nand f10d0000.flash: pxa3xx_nand_irq():854 nand_readl(0x0014) = 0x2
> < pxa3xx_nand_irq:932
> < pxa3xx_nand_irq:933
> > pxa3xx_nand_irq_thread:829
> > handle_data_pio:728
> > drain_fifo:693
> < drain_fifo:723
> > drain_fifo:693
> < drain_fifo:723
> < handle_data_pio:761
> pxa3xx-nand f10d0000.flash: pxa3xx_nand_irq_thread():833 nand_writel(0x6, 0x0014)
> < pxa3xx_nand_irq_thread:835
>
> So in short we enter pxa3xx_nand_irq, read register 0x0014 and find 2,
> then later we enter pxa3xx_nand_irq_thread(), handle_data_pio(), leave
> it, and issue a nand_writel() to register 0x0014 and leave
> pxa3xx_nand_irq_thread(). And at this point we wait till the timeout
> strikes (apparently the CHIP_DELAY_TIMEOUT since I'm observing 2 seconds
> and that's what it's set to).
>
> Interestingly, this morning I left the machine hung like this and came
> back 20 minutes later to find that mtdblock5 had finally been successfully
> mounted (1322 seconds later precisely). I don't know if something suddenly
> decided to work, or if an operation is performed 650 times and waits for
> this 2 second timeout for each operation. I've left it running again to
> see if it happens again.
>
> I have not tested 4.15-rc3 on it yet, though the only change to the
> driver is irrelevant to this issue. I also didn't go back to older
> kernels as I'm not sure when this machine was supposed to start to be
> supported (the DTS came in 4.12 but given that people started to complain
> about the timeout in openwrt's 4.4, it's not clear).
>
> At this point I'm out of ideas so if anyone wants me to test specific
> config options to report more info, or to test patches, they're welcome!
> There's no emergency for me to get this machine to work properly, it's
> expected to replace my current one, but only once it works, so in the
> mean time it's more of a development platform lying on my desk :-)

You should have a look at this thread [1], and in case you don't want
to read everything, you can just test the solution proposed here [2].

[1]http://linux-mtd.infradead.narkive.com/Rd5UaRPO/bug-pxa3xx-wait-time-out-when-scanning-for-bb
[2]http://patchwork.ozlabs.org/patch/847411/

2017-12-17 13:17:37

by Willy Tarreau

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

Hi Boris!

On Sun, Dec 17, 2017 at 01:33:55PM +0100, Boris Brezillon wrote:
> You should have a look at this thread [1], and in case you don't want
> to read everything,

I've read it entirely, it was very instructive!

> you can just test the solution proposed here [2].
>
> [1]http://linux-mtd.infradead.narkive.com/Rd5UaRPO/bug-pxa3xx-wait-time-out-when-scanning-for-bb
> [2]http://patchwork.ozlabs.org/patch/847411/

Well done for such a quick reply! I can confirm that your proposed
patch below does fix it for me! Now I understand why only jffs2 was
triggering the issue if it only affects OOB, and I guess I would have
faced it as well with nanddump had I thought about testing it.

I'm queuing this one here to continue to progress on my machine, feel
free to add my tested-by if the patch gets merged, or to ping me to
test any other option you'd like to confirm!

Thanks!
Willy

---

diff --git a/drivers/mtd/nand/pxa3xx_nand.c b/drivers/mtd/nand/pxa3xx_nand.c
index 321a90c..adb9fd8 100644
--- a/drivers/mtd/nand/pxa3xx_nand.c
+++ b/drivers/mtd/nand/pxa3xx_nand.c
@@ -950,6 +950,7 @@ static void prepare_start_command(struct pxa3xx_nand_info *info, int command)

switch (command) {
case NAND_CMD_READ0:
+ case NAND_CMD_READOOB:
case NAND_CMD_PAGEPROG:
info->use_ecc = 1;
break;


2017-12-17 14:25:20

by Ezequiel Garcia

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

On 17 December 2017 at 10:17, Willy Tarreau <[email protected]> wrote:
> Hi Boris!
>
> On Sun, Dec 17, 2017 at 01:33:55PM +0100, Boris Brezillon wrote:
>> You should have a look at this thread [1], and in case you don't want
>> to read everything,
>
> I've read it entirely, it was very instructive!
>
>> you can just test the solution proposed here [2].
>>
>> [1]http://linux-mtd.infradead.narkive.com/Rd5UaRPO/bug-pxa3xx-wait-time-out-when-scanning-for-bb
>> [2]http://patchwork.ozlabs.org/patch/847411/
>
> Well done for such a quick reply! I can confirm that your proposed
> patch below does fix it for me! Now I understand why only jffs2 was
> triggering the issue if it only affects OOB, and I guess I would have
> faced it as well with nanddump had I thought about testing it.
>
> I'm queuing this one here to continue to progress on my machine, feel
> free to add my tested-by if the patch gets merged, or to ping me to
> test any other option you'd like to confirm!
>
> Thanks!
> Willy
>
> ---
>
> diff --git a/drivers/mtd/nand/pxa3xx_nand.c b/drivers/mtd/nand/pxa3xx_nand.c
> index 321a90c..adb9fd8 100644
> --- a/drivers/mtd/nand/pxa3xx_nand.c
> +++ b/drivers/mtd/nand/pxa3xx_nand.c
> @@ -950,6 +950,7 @@ static void prepare_start_command(struct pxa3xx_nand_info *info, int command)
>
> switch (command) {
> case NAND_CMD_READ0:
> + case NAND_CMD_READOOB:
> case NAND_CMD_PAGEPROG:
> info->use_ecc = 1;
> break;
>
>

If we can confirm that with this patch, bad block markers can be read
without issues, then it's good to go.

Willy, think you could try to test that?
--
Ezequiel García, VanguardiaSur
http://www.vanguardiasur.com.ar

2017-12-17 14:27:54

by Ezequiel Garcia

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

On 17 December 2017 at 09:05, Willy Tarreau <[email protected]> wrote:
> Hello,
>
> I recently bought a Linksys WRT1900ACS which hosts an Armada 385 and a
> NAND flash. While I could get OpenWRT to work flawlessly on it using
> kernel 4.4, mainline 4.14.6 fails with a lot of such messages :
>
> pxa3xx-nand f10d0000.flash: Wait time out!!!
>

Boris,

Any idea why this issue is on v4.14, but not observed on v4.4?

Also, is this somehow related to Armada 385 only?
--
Ezequiel García, VanguardiaSur
http://www.vanguardiasur.com.ar

2017-12-17 14:53:18

by Boris Brezillon

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

On Sun, 17 Dec 2017 11:27:51 -0300
Ezequiel Garcia <[email protected]> wrote:

> On 17 December 2017 at 09:05, Willy Tarreau <[email protected]> wrote:
> > Hello,
> >
> > I recently bought a Linksys WRT1900ACS which hosts an Armada 385 and a
> > NAND flash. While I could get OpenWRT to work flawlessly on it using
> > kernel 4.4, mainline 4.14.6 fails with a lot of such messages :
> >
> > pxa3xx-nand f10d0000.flash: Wait time out!!!
> >
>
> Boris,
>
> Any idea why this issue is on v4.14, but not observed on v4.4?

I have absolutely no idea.

>
> Also, is this somehow related to Armada 385 only?

I doubt it. My guess is that almost nobody uses JFFS2 these days, which
may explain why this problem has not been detected before.

2017-12-17 15:01:14

by Willy Tarreau

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

On Sun, Dec 17, 2017 at 03:53:05PM +0100, Boris Brezillon wrote:
> On Sun, 17 Dec 2017 11:27:51 -0300
> Ezequiel Garcia <[email protected]> wrote:
>
> > On 17 December 2017 at 09:05, Willy Tarreau <[email protected]> wrote:
> > > Hello,
> > >
> > > I recently bought a Linksys WRT1900ACS which hosts an Armada 385 and a
> > > NAND flash. While I could get OpenWRT to work flawlessly on it using
> > > kernel 4.4, mainline 4.14.6 fails with a lot of such messages :
> > >
> > > pxa3xx-nand f10d0000.flash: Wait time out!!!
> > >
> >
> > Boris,
> >
> > Any idea why this issue is on v4.14, but not observed on v4.4?
>
> I have absolutely no idea.

Warning, the 4.4 in openwrt very likely is heavily patched! That's also
why I'm moving to mainline instead (to know what I'm using). I've seen
some nand timeout changes in the patches. I don't know if anything else
is applied to the driver (it's always a pain to find where to dig, as
there is no unified list of all patches for a given architecture).

> > Also, is this somehow related to Armada 385 only?
>
> I doubt it. My guess is that almost nobody uses JFFS2 these days, which
> may explain why this problem has not been detected before.

That's very likely indeed.

Ezequiel, to answer your question about dumping bad blocks, this flash
doesn't report any bad blocks yet (cool) however I could issue "nanddump
--oob --bb=dumpbad" on all MTD devices without issues. The last one has
8 BBT blocks. I didn't find any bad block, but I could confirm that
dumping oob apparently worked as it returned data that differs from the
non-oob dump on the last partition (the one containing the oob blocks),
so I guess we're fine :

# cmp -l raw oob
40822793 377 61
40822794 377 164
40822795 377 142
40822796 377 102
40822797 377 126
40822798 377 115
40822799 377 1
40957961 377 115
40957962 377 126
40957963 377 102
40957964 377 142
40957965 377 164
40957966 377 60
40957967 377 1

Hoping this helps,
Willy

2017-12-17 15:10:05

by Willy Tarreau

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

On Sun, Dec 17, 2017 at 04:00:43PM +0100, Willy Tarreau wrote:
> > > Any idea why this issue is on v4.14, but not observed on v4.4?
> >
> > I have absolutely no idea.
>
> Warning, the 4.4 in openwrt very likely is heavily patched! That's also
> why I'm moving to mainline instead (to know what I'm using). I've seen
> some nand timeout changes in the patches. I don't know if anything else
> is applied to the driver (it's always a pain to find where to dig, as
> there is no unified list of all patches for a given architecture).

Given the description here, I suspect this is how they got rid of the
problem there :

https://github.com/lede-project/source/blob/lede-17.01/target/linux/mvebu/patches-4.4/110-pxa3xxx_revert_irq_thread.patch

---
Revert "mtd: pxa3xx-nand: handle PIO in threaded interrupt"

This reverts commit 24542257a3b987025d4b998ec2d15e556c98ad3f
This upstream change has been causing spurious timeouts on accesses
to the NAND flash if something else on the system is causing
significant latency.

Nothing guarantees that the thread will run in time, so the
usual timeout is unreliable.
---

Willy

2017-12-17 15:53:39

by Ezequiel Garcia

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

On 17 December 2017 at 12:00, Willy Tarreau <[email protected]> wrote:
> On Sun, Dec 17, 2017 at 03:53:05PM +0100, Boris Brezillon wrote:
>> On Sun, 17 Dec 2017 11:27:51 -0300
>> Ezequiel Garcia <[email protected]> wrote:
>>
>> > On 17 December 2017 at 09:05, Willy Tarreau <[email protected]> wrote:
>> > > Hello,
>> > >
>> > > I recently bought a Linksys WRT1900ACS which hosts an Armada 385 and a
>> > > NAND flash. While I could get OpenWRT to work flawlessly on it using
>> > > kernel 4.4, mainline 4.14.6 fails with a lot of such messages :
>> > >
>> > > pxa3xx-nand f10d0000.flash: Wait time out!!!
>> > >
>> >
>> > Boris,
>> >
>> > Any idea why this issue is on v4.14, but not observed on v4.4?
>>
>> I have absolutely no idea.
>
> Warning, the 4.4 in openwrt very likely is heavily patched! That's also
> why I'm moving to mainline instead (to know what I'm using). I've seen
> some nand timeout changes in the patches. I don't know if anything else
> is applied to the driver (it's always a pain to find where to dig, as
> there is no unified list of all patches for a given architecture).
>
>> > Also, is this somehow related to Armada 385 only?
>>
>> I doubt it. My guess is that almost nobody uses JFFS2 these days, which
>> may explain why this problem has not been detected before.
>
> That's very likely indeed.
>
> Ezequiel, to answer your question about dumping bad blocks, this flash
> doesn't report any bad blocks yet (cool) however I could issue "nanddump
> --oob --bb=dumpbad" on all MTD devices without issues. The last one has
> 8 BBT blocks. I didn't find any bad block, but I could confirm that
> dumping oob apparently worked as it returned data that differs from the
> non-oob dump on the last partition (the one containing the oob blocks),
> so I guess we're fine :
>

If not too much to ask, this is the test that I believe is needed.
You seem to have a setup ready, hence why I'm asking you, if
possible, to give it a shot.

(1) Scrub the BBT from the NAND. Or scrub the whole NAND.
You cannot do this from the kernel, it needs to be done from the bootloader.

(2) Mark a couple blocks as bad using the OOB -- AFAICR, there
was a command to do this in the bootloader.

(3) Boot, let Linux create the BBT and see if it catches the bad blocks.

This would guarantee that devices with factory bad blocks,
(and no BBT), would be OK with this patch.
--
Ezequiel García, VanguardiaSur
http://www.vanguardiasur.com.ar

2017-12-17 16:24:12

by Willy Tarreau

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

On Sun, Dec 17, 2017 at 12:53:36PM -0300, Ezequiel Garcia wrote:
> If not too much to ask, this is the test that I believe is needed.
> You seem to have a setup ready, hence why I'm asking you, if
> possible, to give it a shot.
>
> (1) Scrub the BBT from the NAND. Or scrub the whole NAND.
> You cannot do this from the kernel, it needs to be done from the bootloader.
>
> (2) Mark a couple blocks as bad using the OOB -- AFAICR, there
> was a command to do this in the bootloader.
>
> (3) Boot, let Linux create the BBT and see if it catches the bad blocks.

Are the current boot loaders safe regarding the scrub operation ? I'm
asking because that's how I bricked my mirabox a few years ago when
trying to mark a bad block from u-boot :-/ If someone has a good
knowledge of these commands to limit the risk and helps me only playing
with a small part at the end of the flash (or in the unused area) I'd
prefer it :-)

> This would guarantee that devices with factory bad blocks,
> (and no BBT), would be OK with this patch.

I see. I'm fine with trying provided I have reasonably good assurance
that I won't have to go through the kwboot pain again :-/

Cheers,
Willy

2017-12-17 18:07:49

by Boris Brezillon

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

On Sun, 17 Dec 2017 17:23:42 +0100
Willy Tarreau <[email protected]> wrote:

> On Sun, Dec 17, 2017 at 12:53:36PM -0300, Ezequiel Garcia wrote:
> > If not too much to ask, this is the test that I believe is needed.
> > You seem to have a setup ready, hence why I'm asking you, if
> > possible, to give it a shot.
> >
> > (1) Scrub the BBT from the NAND. Or scrub the whole NAND.
> > You cannot do this from the kernel, it needs to be done from the bootloader.
> >
> > (2) Mark a couple blocks as bad using the OOB -- AFAICR, there
> > was a command to do this in the bootloader.
> >
> > (3) Boot, let Linux create the BBT and see if it catches the bad blocks.
>
> Are the current boot loaders safe regarding the scrub operation ? I'm
> asking because that's how I bricked my mirabox a few years ago when
> trying to mark a bad block from u-boot :-/ If someone has a good
> knowledge of these commands to limit the risk and helps me only playing
> with a small part at the end of the flash (or in the unused area) I'd
> prefer it :-)
>
> > This would guarantee that devices with factory bad blocks,
> > (and no BBT), would be OK with this patch.
>
> I see. I'm fine with trying provided I have reasonably good assurance
> that I won't have to go through the kwboot pain again :-/

There's a easy test you can do without scrubing the NAND:
1/ comment the nand-on-flash-bbt property in your DT (this will trigger
a full scan)
2/ from u-boot (before booting the kernel), erase a block that you know
contains nothing important
3/ during the kernel scan, make sure this block is not reported as bad

2017-12-17 19:01:30

by Willy Tarreau

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

On Sun, Dec 17, 2017 at 07:07:46PM +0100, Boris Brezillon wrote:
> > > This would guarantee that devices with factory bad blocks,
> > > (and no BBT), would be OK with this patch.
> >
> > I see. I'm fine with trying provided I have reasonably good assurance
> > that I won't have to go through the kwboot pain again :-/
>
> There's a easy test you can do without scrubing the NAND:
> 1/ comment the nand-on-flash-bbt property in your DT (this will trigger
> a full scan)
> 2/ from u-boot (before booting the kernel), erase a block that you know
> contains nothing important
> 3/ during the kernel scan, make sure this block is not reported as bad

OK so I tried and never faced any error. Thus I also attempted to mark
a bad block in u-boot, it appeared in the bad blocks table, then I had
to scrub the whole table to get rid of it. Each time when I booted I
saw the message "Scanning device for bad blocks" but no error ever
happened. So I hope it's OK.

Please find a summary of my tests below.

Marvell>> nand erase 280000 1000

NAND erase: device 0 offset 0x280000, size 0x1000
Erasing at 0x280000 -- 100% complete.
OK

Marvell>> nand dump 280000
Page 00280000 dump:
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
...
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
OOB:
ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff

Boot....
# nanddump -c --oob --bb=dumpbad /dev/mtd8 >/tmp/dump-mtd8.txt
=> only ff everywhere

# dmesg
...
pxa3xx-nand f10d0000.flash: This platform can't do DMA on this device
nand: device found, Manufacturer ID: 0x01, Chip ID: 0xf1
nand: AMD/Spansion S34ML01G2
nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
pxa3xx-nand f10d0000.flash: ECC strength 16, ECC step size 2048
Scanning device for bad blocks
...


Marvell>> nand markbad 280000
Bad block table written to 0x000007fe0000, version 0x02
Bad block table written to 0x000007fc0000, version 0x02
>>> orion_nfc_wait_for_completion_timeout command timed out!, status (0x100)
command 19 execution timed out (CS -1, NDCR=0x8104bfff, NDSR=0x100).
block 0x00280000 successfully marked as bad
Marvell>>

Marvell>> nand bad

Device 0 bad blocks:
280000
7f00000
7f20000
7f40000
7f60000
7f80000
7fa0000
7fc0000
7fe0000

Boot...
# dmesg
...
[ 0.875117] pxa3xx-nand f10d0000.flash: This platform can't do DMA on this device
[ 0.881627] nand: device found, Manufacturer ID: 0x01, Chip ID: 0xf1
[ 0.886697] nand: AMD/Spansion S34ML01G2
[ 0.889326] nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
[ 0.895628] pxa3xx-nand f10d0000.flash: ECC strength 16, ECC step size 2048
[ 0.901316] Scanning device for bad blocks
...


Marvell>> nand scrub 7f00000 100000
Erasing at 0x7fe0000 -- 100% complete.
Bad block table not found for chip 0
Bad block table not found for chip 0
Bad block table written to 0x000007fe0000, version 0x01
Bad block table written to 0x000007fc0000, version 0x01
OK
Marvell>> nand bad

Device 0 bad blocks:
7f00000
7f20000
7f40000
7f60000
7f80000
7fa0000
7fc0000
7fe0000

Boot...
# dmesg
...
[ 0.875322] pxa3xx-nand f10d0000.flash: This platform can't do DMA on this device
[ 0.881834] nand: device found, Manufacturer ID: 0x01, Chip ID: 0xf1
[ 0.886904] nand: AMD/Spansion S34ML01G2
[ 0.889533] nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
[ 0.895835] pxa3xx-nand f10d0000.flash: ECC strength 16, ECC step size 2048
[ 0.901524] Scanning device for bad blocks
[ 1.202116] ata2: SATA link down (SStatus 0 SControl 300)
[ 1.206245] ata1: SATA link down (SStatus 0 SControl 300)
[ 1.244449] 10 ofpart partitions found on MTD device pxa3xx_nand-0
[ 1.249345] Creating 10 MTD partitions on "pxa3xx_nand-0":
[ 1.253551] 0x000000000000-0x000000200000 : "u-boot"
[ 1.257523] 0x000000200000-0x000000240000 : "u_env"
[ 1.261372] 0x000000240000-0x000000280000 : "s_env"
[ 1.265217] 0x000000900000-0x000000a00000 : "devinfo"
[ 1.269229] 0x000000a00000-0x000003200000 : "kernel1"
[ 1.273326] 0x000001000000-0x000003200000 : "rootfs1"
[ 1.277407] 0x000003200000-0x000005a00000 : "kernel2"
[ 1.281509] 0x000003800000-0x000005a00000 : "rootfs2"
[ 1.285591] 0x000005a00000-0x000008000000 : "syscfg"
[ 1.289596] 0x000000280000-0x000000900000 : "unused_area"
...

Willy

2017-12-17 21:01:32

by Ezequiel Garcia

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

On 17 December 2017 at 16:00, Willy Tarreau <[email protected]> wrote:
> On Sun, Dec 17, 2017 at 07:07:46PM +0100, Boris Brezillon wrote:
>> > > This would guarantee that devices with factory bad blocks,
>> > > (and no BBT), would be OK with this patch.
>> >
>> > I see. I'm fine with trying provided I have reasonably good assurance
>> > that I won't have to go through the kwboot pain again :-/
>>
>> There's a easy test you can do without scrubing the NAND:
>> 1/ comment the nand-on-flash-bbt property in your DT (this will trigger
>> a full scan)
>> 2/ from u-boot (before booting the kernel), erase a block that you know
>> contains nothing important
>> 3/ during the kernel scan, make sure this block is not reported as bad
>
> OK so I tried and never faced any error. Thus I also attempted to mark
> a bad block in u-boot, it appeared in the bad blocks table, then I had
> to scrub the whole table to get rid of it. Each time when I booted I
> saw the message "Scanning device for bad blocks" but no error ever
> happened. So I hope it's OK.
>

Nice. Thanks a lot Willy. I think this acks Boris' patch.
--
Ezequiel García, VanguardiaSur
http://www.vanguardiasur.com.ar

2017-12-17 21:17:15

by Willy Tarreau

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

On Sun, Dec 17, 2017 at 06:01:29PM -0300, Ezequiel Garcia wrote:
> On 17 December 2017 at 16:00, Willy Tarreau <[email protected]> wrote:
> > On Sun, Dec 17, 2017 at 07:07:46PM +0100, Boris Brezillon wrote:
> >> > > This would guarantee that devices with factory bad blocks,
> >> > > (and no BBT), would be OK with this patch.
> >> >
> >> > I see. I'm fine with trying provided I have reasonably good assurance
> >> > that I won't have to go through the kwboot pain again :-/
> >>
> >> There's a easy test you can do without scrubing the NAND:
> >> 1/ comment the nand-on-flash-bbt property in your DT (this will trigger
> >> a full scan)
> >> 2/ from u-boot (before booting the kernel), erase a block that you know
> >> contains nothing important
> >> 3/ during the kernel scan, make sure this block is not reported as bad
> >
> > OK so I tried and never faced any error. Thus I also attempted to mark
> > a bad block in u-boot, it appeared in the bad blocks table, then I had
> > to scrub the whole table to get rid of it. Each time when I booted I
> > saw the message "Scanning device for bad blocks" but no error ever
> > happened. So I hope it's OK.
> >
>
> Nice. Thanks a lot Willy. I think this acks Boris' patch.

You're welcome, you and Boris fixed my problem very quickly allowing me
to continue to prepare my new router :-)

BTW, Boris please don't forget to mark your fix for -stable.

Thanks,
Willy

2017-12-17 21:26:14

by Boris Brezillon

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

+Miquel

On Sun, 17 Dec 2017 22:16:50 +0100
Willy Tarreau <[email protected]> wrote:

> On Sun, Dec 17, 2017 at 06:01:29PM -0300, Ezequiel Garcia wrote:
> > On 17 December 2017 at 16:00, Willy Tarreau <[email protected]> wrote:
> > > On Sun, Dec 17, 2017 at 07:07:46PM +0100, Boris Brezillon wrote:
> > >> > > This would guarantee that devices with factory bad blocks,
> > >> > > (and no BBT), would be OK with this patch.
> > >> >
> > >> > I see. I'm fine with trying provided I have reasonably good assurance
> > >> > that I won't have to go through the kwboot pain again :-/
> > >>
> > >> There's a easy test you can do without scrubing the NAND:
> > >> 1/ comment the nand-on-flash-bbt property in your DT (this will trigger
> > >> a full scan)
> > >> 2/ from u-boot (before booting the kernel), erase a block that you know
> > >> contains nothing important
> > >> 3/ during the kernel scan, make sure this block is not reported as bad
> > >
> > > OK so I tried and never faced any error. Thus I also attempted to mark
> > > a bad block in u-boot, it appeared in the bad blocks table, then I had
> > > to scrub the whole table to get rid of it. Each time when I booted I
> > > saw the message "Scanning device for bad blocks" but no error ever
> > > happened. So I hope it's OK.
> > >
> >
> > Nice. Thanks a lot Willy. I think this acks Boris' patch.
>
> You're welcome, you and Boris fixed my problem very quickly allowing me
> to continue to prepare my new router :-)
>
> BTW, Boris please don't forget to mark your fix for -stable.

Actually, if things go well it will only be applied to stable
releases (I really hope we'll be able to switch to Miquel's driver in
4.16). BTW, if you have some time, maybe you can test Miquel's [1]
branch and let us know if it still works properly.

Thanks,

Boris

[1]https://github.com/miquelraynal/linux/tree/marvell/nand-next/nfc

2017-12-17 21:46:23

by Miquel Raynal

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

Hello Willy,

On Sun, 17 Dec 2017 22:26:11 +0100
Boris Brezillon <[email protected]> wrote:

> +Miquel
>
> On Sun, 17 Dec 2017 22:16:50 +0100
> Willy Tarreau <[email protected]> wrote:
>
> > On Sun, Dec 17, 2017 at 06:01:29PM -0300, Ezequiel Garcia wrote:
> > > On 17 December 2017 at 16:00, Willy Tarreau <[email protected]> wrote:
> > > > On Sun, Dec 17, 2017 at 07:07:46PM +0100, Boris Brezillon
> > > > wrote:
> > > >> > > This would guarantee that devices with factory bad blocks,
> > > >> > > (and no BBT), would be OK with this patch.
> > > >> >
> > > >> > I see. I'm fine with trying provided I have reasonably good
> > > >> > assurance that I won't have to go through the kwboot pain
> > > >> > again :-/
> > > >>
> > > >> There's a easy test you can do without scrubing the NAND:
> > > >> 1/ comment the nand-on-flash-bbt property in your DT (this
> > > >> will trigger a full scan)
> > > >> 2/ from u-boot (before booting the kernel), erase a block that
> > > >> you know contains nothing important
> > > >> 3/ during the kernel scan, make sure this block is not
> > > >> reported as bad
> > > >
> > > > OK so I tried and never faced any error. Thus I also attempted
> > > > to mark a bad block in u-boot, it appeared in the bad blocks
> > > > table, then I had to scrub the whole table to get rid of it.
> > > > Each time when I booted I saw the message "Scanning device for
> > > > bad blocks" but no error ever happened. So I hope it's OK.
> > > >
> > >
> > > Nice. Thanks a lot Willy. I think this acks Boris' patch.
> >
> > You're welcome, you and Boris fixed my problem very quickly
> > allowing me to continue to prepare my new router :-)
> >
> > BTW, Boris please don't forget to mark your fix for -stable.
>
> Actually, if things go well it will only be applied to stable
> releases (I really hope we'll be able to switch to Miquel's driver in
> 4.16). BTW, if you have some time, maybe you can test Miquel's [1]
> branch and let us know if it still works properly.

As Boris said, we would really welcome a test of this branch, because
you almost have the same setup as Sean in the thread "pxa3xx: wait time
out when scanning for bb" and I am running out of explanation for his
problem unless it is related to U-Boot. So if you could try booting
with and without the on-flash-bbt property and report whether it fails
or not it would be of great help!

Thanks,
Miquèl

>
> Thanks,
>
> Boris
>
> [1]https://github.com/miquelraynal/linux/tree/marvell/nand-next/nfc
>
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/



--
Miquel Raynal, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

2017-12-18 06:37:48

by Willy Tarreau

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

Hi guys,

On Sun, Dec 17, 2017 at 10:46:17PM +0100, Miquel RAYNAL wrote:
> > Actually, if things go well it will only be applied to stable
> > releases (I really hope we'll be able to switch to Miquel's driver in
> > 4.16). BTW, if you have some time, maybe you can test Miquel's [1]
> > branch and let us know if it still works properly.
>
> As Boris said, we would really welcome a test of this branch, because
> you almost have the same setup as Sean in the thread "pxa3xx: wait time
> out when scanning for bb" and I am running out of explanation for his
> problem unless it is related to U-Boot. So if you could try booting
> with and without the on-flash-bbt property and report whether it fails
> or not it would be of great help!

Yes, I noticed your work mentionned in some of the threads I've read
during my troubleshooting session and considered giving it a try. I'll
probably do this next week-end.

Thanks!
Willy

2017-12-18 07:06:46

by Willy Tarreau

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

On Mon, Dec 18, 2017 at 07:37:15AM +0100, Willy Tarreau wrote:
> > As Boris said, we would really welcome a test of this branch, because
> > you almost have the same setup as Sean in the thread "pxa3xx: wait time
> > out when scanning for bb" and I am running out of explanation for his
> > problem unless it is related to U-Boot. So if you could try booting
> > with and without the on-flash-bbt property and report whether it fails
> > or not it would be of great help!
>
> Yes, I noticed your work mentionned in some of the threads I've read
> during my troubleshooting session and considered giving it a try. I'll
> probably do this next week-end.

Finally I figured the test was quick enough and could help you, so I
built and booted it, I'm getting this at boot :

marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
nand_bbt: error while writing BBT block -110
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
nand_bbt: error -110 while marking block 1023 bad
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
nand_bbt: error while writing BBT block -110
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
nand_bbt: error -110 while marking block 1022 bad
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
nand_bbt: error while writing BBT block -110
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
nand_bbt: error -110 while marking block 1021 bad
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
nand_bbt: error while writing BBT block -110
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
nand_bbt: error -110 while marking block 1020 bad
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
nand_bbt: error while writing BBT block -110
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
nand_bbt: error -110 while marking block 1019 bad
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
nand_bbt: error while writing BBT block -110
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
nand_bbt: error -110 while marking block 1018 bad
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
nand_bbt: error while writing BBT block -110
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
nand_bbt: error -110 while marking block 1017 bad
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
nand_bbt: error while writing BBT block -110
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
nand_bbt: error -110 while marking block 1016 bad
No space left to write bad block table
nand_bbt: error while writing bad block table -28
marvell-nfc f10d0000.nand-controller: nand_scan_tail failed: -28
marvell-nfc: probe of f10d0000.nand-controller failed with error -28

Then no MTD appears in /proc/mtd.

Hoping this helps,
Willy

2017-12-18 10:22:13

by Miquel Raynal

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

Hello Willy,

On Mon, 18 Dec 2017 08:06:17 +0100
Willy Tarreau <[email protected]> wrote:

> On Mon, Dec 18, 2017 at 07:37:15AM +0100, Willy Tarreau wrote:
> > > As Boris said, we would really welcome a test of this branch,
> > > because you almost have the same setup as Sean in the thread
> > > "pxa3xx: wait time out when scanning for bb" and I am running out
> > > of explanation for his problem unless it is related to U-Boot. So
> > > if you could try booting with and without the on-flash-bbt
> > > property and report whether it fails or not it would be of great
> > > help!
> >
> > Yes, I noticed your work mentionned in some of the threads I've read
> > during my troubleshooting session and considered giving it a try.
> > I'll probably do this next week-end.
>
> Finally I figured the test was quick enough and could help you, so I
> built and booted it, I'm getting this at boot :
>
> marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
> nand_bbt: error while writing BBT block -110
> marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
> marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
> nand_bbt: error -110 while marking block 1023 bad
> marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
> nand_bbt: error while writing BBT block -110
> marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal
> marvell-nfc f10d0000.nand-controller: Timeout waiting for RB signal

Thanks for testing.

I fixed two problems happening during read/write of 2kiB page NAND
chips, I am quite confident this would solve the issues you report
here. Could you please give it a try?

Same branch as before [1], just some more fixups! on it :)

Thank you,
Miquèl


[1] https://github.com/miquelraynal/linux/tree/marvell/nand-next/nfc

2017-12-18 21:52:46

by Willy Tarreau

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

Hi Miquel,

On Mon, Dec 18, 2017 at 11:22:08AM +0100, Miquel RAYNAL wrote:
> I fixed two problems happening during read/write of 2kiB page NAND
> chips, I am quite confident this would solve the issues you report
> here. Could you please give it a try?

So I just tested right now, and good news, it now works pretty fine
here, and my jffs2 properly mounted (without requiring Boris' fix
for oob) :

# dmesg|grep -i nand
[ 0.770395] nand: device found, Manufacturer ID: 0x01, Chip ID: 0xf1
[ 0.775474] nand: AMD/Spansion S34ML01G2
[ 0.778103] nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
[ 0.794080] 10 ofpart partitions found on MTD device pxa3xx_nand-0
[ 0.798975] Creating 10 MTD partitions on "pxa3xx_nand-0":
[ 3.245034] jffs2: version 2.2. (NAND) (SUMMARY) \xffffffc2\xffffffa9 2001-2006 Red Hat, Inc.

I was first surprized seeing this "pxa3xx_nand-0" still appearing until I
realized that it's how it's called in the device tree :-)

Cheers,
Willy

2017-12-19 00:13:22

by Miquel Raynal

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

Hi Willy,

On Mon, 18 Dec 2017 22:52:17 +0100
Willy Tarreau <[email protected]> wrote:

> Hi Miquel,
>
> On Mon, Dec 18, 2017 at 11:22:08AM +0100, Miquel RAYNAL wrote:
> > I fixed two problems happening during read/write of 2kiB page NAND
> > chips, I am quite confident this would solve the issues you report
> > here. Could you please give it a try?
>
> So I just tested right now, and good news, it now works pretty fine
> here, and my jffs2 properly mounted (without requiring Boris' fix
> for oob)

Great! Thanks for testing.

Boris' fix wouldn't apply anyway as it was written for pxa3xx_nand.c
and here you are using marvell_nand.c and the code is really different.

>
> # dmesg|grep -i nand
> [ 0.770395] nand: device found, Manufacturer ID: 0x01, Chip ID:
> 0xf1 [ 0.775474] nand: AMD/Spansion S34ML01G2
> [ 0.778103] nand: 128 MiB, SLC, erase size: 128 KiB, page size:
> 2048, OOB size: 64 [ 0.794080] 10 ofpart partitions found on MTD
> device pxa3xx_nand-0 [ 0.798975] Creating 10 MTD partitions on
> "pxa3xx_nand-0": [ 3.245034] jffs2: version 2.2. (NAND) (SUMMARY)
> \xffffffc2\xffffffa9 2001-2006 Red Hat, Inc.
>
> I was first surprized seeing this "pxa3xx_nand-0" still appearing
> until I realized that it's how it's called in the device tree :-)

That is right, but if you create a DTS for your own board feel free to
change it, this is just a default name. Giving it some meaning (like
"main-storage" or "backup-storage") is how you could use this label.

Thanks again for your help, reviews or tested-by's are welcome for this
driver ;)

Miquèl

2017-12-19 05:35:27

by Willy Tarreau

[permalink] [raw]
Subject: Re: pxa3xx_nand times out in 4.14 with JFFS2

On Tue, Dec 19, 2017 at 01:13:15AM +0100, Miquel RAYNAL wrote:
> Boris' fix wouldn't apply anyway as it was written for pxa3xx_nand.c
> and here you are using marvell_nand.c and the code is really different.

yep I noticed already during the first problem, I couldn't even find
where the equivalent part of the code was located.

> Thanks again for your help, reviews or tested-by's are welcome for this
> driver ;)

do not hesitate to add my tested-by on your current patchset if that helps.

Willy