2024-03-06 14:45:45

by Alexander Dahl

[permalink] [raw]
Subject: mtd: nand: raw: Possible bug in nand_onfi_detect()?

Hello everyone,

I think I found a bug in nand_onfi_detect() which was introduced with
commit c27842e7e11f ("mtd: rawnand: onfi: Adapt the parameter page
read to constraint controllers") back in 2020.

Background on how I found this: I'm currently struggling getting raw
nand flash access to fly with an at91 sam9x60 SoC and a S34ML02G1
Spansion SLC raw NAND flash on a custom board. The setup is
comparable to the sam9x60 curiosity board and can be reproduced with
that one.

NAND flash on sam9x60 curiosity board works fine with what is in
mainline Linux kernel. However after removing the line 'rb-gpios =
<&pioD 5 GPIO_ACTIVE_HIGH>;' from at91-sam9x60_curiosity.dts all data
read from the flash appears to be zeros only. (I did not add that
line to the dts of my custom board first, this is how I stumbled over
this.)

I have no explanation for that behaviour, it should work without R/B#
by reading the status register, maybe we investigate that
in depth later. However those all zeros data reads happens when
reading the ONFI param page as well es data read from OOB/spare area
later and I bet it's the same with usual data.

This read error reveals a bug in nand_onfi_detect(). After setting
up some things there's this for loop:

for (i = 0; i < ONFI_PARAM_PAGES; i++) {

For i = 0 nand_read_param_page_op() is called and in my case all zeros
are returned and thus the CRC calculated does not match the all zeros
CRC read. So the usual break on successful reading the first page is
skipped and for reading the second page nand_change_read_column_op()
is called. I think that one always fails on this line:

if (offset_in_page + len > mtd->writesize + mtd->oobsize) {

Those variables contain the following values:

offset_in_page: 256
len: 256
mtd->writesize: 0
mtd->oobsize: 0

The condition is true and nand_change_read_column_op() returns with
-EINVAL, because mtd->writesize and mtd->oobsize are not set yet in
that code path. Those are probably initialized later, maybe with
parameters read from that ONFI param page?

Returning with error from nand_change_read_column_op() leads to
jumping out of nand_onfi_detect() early, and no ONFI param page is
evaluated at all, although the second or third page could be intact.

I guess this would also fail with any other reason for not matching
CRCs in the first page, but I have not faulty NAND flash chip to
confirm that.

Greets
Alex



2024-03-06 15:53:04

by Miquel Raynal

[permalink] [raw]
Subject: Re: mtd: nand: raw: Possible bug in nand_onfi_detect()?

Hi Alexander,

[email protected] wrote on Wed, 6 Mar 2024 15:36:04 +0100:

> Hello everyone,
>
> I think I found a bug in nand_onfi_detect() which was introduced with
> commit c27842e7e11f ("mtd: rawnand: onfi: Adapt the parameter page
> read to constraint controllers") back in 2020.

Interesting. I don't think this patch did broke anything, as
constrained controllers would just not support the read_data_op() call
anyway.

That being said, I don't see why the atmel controller would
refuse this operation, as it is supposed to support all
operations without limitation. This is one of the three issues
you have, that probably needs fixing.

> Background on how I found this: I'm currently struggling getting raw
> nand flash access to fly with an at91 sam9x60 SoC and a S34ML02G1
> Spansion SLC raw NAND flash on a custom board. The setup is
> comparable to the sam9x60 curiosity board and can be reproduced with
> that one.
>
> NAND flash on sam9x60 curiosity board works fine with what is in
> mainline Linux kernel. However after removing the line 'rb-gpios =
> <&pioD 5 GPIO_ACTIVE_HIGH>;' from at91-sam9x60_curiosity.dts all data
> read from the flash appears to be zeros only. (I did not add that
> line to the dts of my custom board first, this is how I stumbled over
> this.)
>
> I have no explanation for that behaviour, it should work without R/B#
> by reading the status register, maybe we investigate that
> in depth later.

I don't see why at a first look. The default is "no RB" if no property
is given in the DT so it should work. Tracing the wait ready function
calls might help.

> However those all zeros data reads happens when
> reading the ONFI param page as well es data read from OOB/spare area
> later and I bet it's the same with usual data.

Reading data without observing tWB + tR may lead to this.

> This read error reveals a bug in nand_onfi_detect(). After setting
> up some things there's this for loop:
>
> for (i = 0; i < ONFI_PARAM_PAGES; i++) {
>
> For i = 0 nand_read_param_page_op() is called and in my case all zeros
> are returned and thus the CRC calculated does not match the all zeros
> CRC read. So the usual break on successful reading the first page is
> skipped and for reading the second page nand_change_read_column_op()
> is called. I think that one always fails on this line:
>
> if (offset_in_page + len > mtd->writesize + mtd->oobsize) {
>
> Those variables contain the following values:
>
> offset_in_page: 256
> len: 256
> mtd->writesize: 0
> mtd->oobsize: 0

Indeed. We probably need some kind of extra check that does not perform
the if clause above if !mtd->writesize.

> The condition is true and nand_change_read_column_op() returns with
> -EINVAL, because mtd->writesize and mtd->oobsize are not set yet in
> that code path. Those are probably initialized later, maybe with
> parameters read from that ONFI param page?
>
> Returning with error from nand_change_read_column_op() leads to
> jumping out of nand_onfi_detect() early, and no ONFI param page is
> evaluated at all, although the second or third page could be intact.
>
> I guess this would also fail with any other reason for not matching
> CRCs in the first page, but I have not faulty NAND flash chip to
> confirm that.

Thanks for the whole report, it is interesting and should lead to fixes:
- why does the controller refuses the datain op?
- why nand_soft_waitrdy is not enough?
- changing the condition in nand_change_read_column_op()

Can you take care of these?

Thanks,
Miquèl

2024-03-07 16:03:51

by Alexander Dahl

[permalink] [raw]
Subject: Re: mtd: nand: raw: Possible bug in nand_onfi_detect()?

Hello Miquel,

thanks for looking into this, see my remarks below.

Am Wed, Mar 06, 2024 at 04:48:31PM +0100 schrieb Miquel Raynal:
> Hi Alexander,
>
> [email protected] wrote on Wed, 6 Mar 2024 15:36:04 +0100:
>
> > Hello everyone,
> >
> > I think I found a bug in nand_onfi_detect() which was introduced with
> > commit c27842e7e11f ("mtd: rawnand: onfi: Adapt the parameter page
> > read to constraint controllers") back in 2020.
>
> Interesting. I don't think this patch did broke anything, as
> constrained controllers would just not support the read_data_op() call
> anyway.
>
> That being said, I don't see why the atmel controller would
> refuse this operation, as it is supposed to support all
> operations without limitation. This is one of the three issues
> you have, that probably needs fixing.

I found a flaw in my debug messages hiding the underlying issue for
this. I'm afraid this is another bug introduced by you with commit
9f820fc0651c ("mtd: rawnand: Check the data only read pattern only
once"). See this line in rawnand_check_data_only_read_support():

if (!nand_read_data_op(chip, NULL, SZ_512, true, true))

This leads to nand_read_data_op() returning -EINVAL, because it checks
if its second argument is non-NULL.

I guess not only the atmel nand controller is affected here, but _all_
nand controllers? The flag can never be set, and so use_datain is
false here?

> > Background on how I found this: I'm currently struggling getting raw
> > nand flash access to fly with an at91 sam9x60 SoC and a S34ML02G1
> > Spansion SLC raw NAND flash on a custom board. The setup is
> > comparable to the sam9x60 curiosity board and can be reproduced with
> > that one.
> >
> > NAND flash on sam9x60 curiosity board works fine with what is in
> > mainline Linux kernel. However after removing the line 'rb-gpios =
> > <&pioD 5 GPIO_ACTIVE_HIGH>;' from at91-sam9x60_curiosity.dts all data
> > read from the flash appears to be zeros only. (I did not add that
> > line to the dts of my custom board first, this is how I stumbled over
> > this.)
> >
> > I have no explanation for that behaviour, it should work without R/B#
> > by reading the status register, maybe we investigate that
> > in depth later.
>
> I don't see why at a first look. The default is "no RB" if no property
> is given in the DT so it should work.

Correct, nand_soft_waitrdy() is used in that case.

> Tracing the wait ready function calls might help.

Did that already. On each call here the status register read contains
E0h and nand_soft_waitrdy() returns without error, because the
NAND_STATUS_READY flag is set. It just looks fine, although it is
not afterwards.

> > However those all zeros data reads happens when
> > reading the ONFI param page as well es data read from OOB/spare area
> > later and I bet it's the same with usual data.
>
> Reading data without observing tWB + tR may lead to this.

I already suspected some timing issue. Deeper investigation will have
to wait until we soldered some wires to the chip and connect a logic
analyzer however. At least that's the plan, but this will have to
wait some days until after I finished some other tasks.

> > This read error reveals a bug in nand_onfi_detect(). After setting
> > up some things there's this for loop:
> >
> > for (i = 0; i < ONFI_PARAM_PAGES; i++) {
> >
> > For i = 0 nand_read_param_page_op() is called and in my case all zeros
> > are returned and thus the CRC calculated does not match the all zeros
> > CRC read. So the usual break on successful reading the first page is
> > skipped and for reading the second page nand_change_read_column_op()
> > is called. I think that one always fails on this line:
> >
> > if (offset_in_page + len > mtd->writesize + mtd->oobsize) {
> >
> > Those variables contain the following values:
> >
> > offset_in_page: 256
> > len: 256
> > mtd->writesize: 0
> > mtd->oobsize: 0
>
> Indeed. We probably need some kind of extra check that does not perform
> the if clause above if !mtd->writesize.
>
> > The condition is true and nand_change_read_column_op() returns with
> > -EINVAL, because mtd->writesize and mtd->oobsize are not set yet in
> > that code path. Those are probably initialized later, maybe with
> > parameters read from that ONFI param page?
> >
> > Returning with error from nand_change_read_column_op() leads to
> > jumping out of nand_onfi_detect() early, and no ONFI param page is
> > evaluated at all, although the second or third page could be intact.
> >
> > I guess this would also fail with any other reason for not matching
> > CRCs in the first page, but I have not faulty NAND flash chip to
> > confirm that.
>
> Thanks for the whole report, it is interesting and should lead to fixes:
> - why does the controller refuses the datain op?

See above.

> - why nand_soft_waitrdy is not enough?

I don't know. That's one reason I asked here.

> - changing the condition in nand_change_read_column_op()
>
> Can you take care of these?

The last one probably after in depth reading of the code again, unsure
for the other two.

Greets
Alex


2024-03-07 17:19:57

by Miquel Raynal

[permalink] [raw]
Subject: Re: mtd: nand: raw: Possible bug in nand_onfi_detect()?

Hi Alexander,

[email protected] wrote on Thu, 7 Mar 2024 17:02:16 +0100:

> Hello Miquel,
>
> thanks for looking into this, see my remarks below.
>
> Am Wed, Mar 06, 2024 at 04:48:31PM +0100 schrieb Miquel Raynal:
> > Hi Alexander,
> >
> > [email protected] wrote on Wed, 6 Mar 2024 15:36:04 +0100:
> >
> > > Hello everyone,
> > >
> > > I think I found a bug in nand_onfi_detect() which was introduced with
> > > commit c27842e7e11f ("mtd: rawnand: onfi: Adapt the parameter page
> > > read to constraint controllers") back in 2020.
> >
> > Interesting. I don't think this patch did broke anything, as
> > constrained controllers would just not support the read_data_op() call
> > anyway.
> >
> > That being said, I don't see why the atmel controller would
> > refuse this operation, as it is supposed to support all
> > operations without limitation. This is one of the three issues
> > you have, that probably needs fixing.
>
> I found a flaw in my debug messages hiding the underlying issue for
> this. I'm afraid this is another bug introduced by you with commit
> 9f820fc0651c ("mtd: rawnand: Check the data only read pattern only
> once"). See this line in rawnand_check_data_only_read_support():
>
> if (!nand_read_data_op(chip, NULL, SZ_512, true, true))
>
> This leads to nand_read_data_op() returning -EINVAL, because it checks
> if its second argument is non-NULL.

Ah, finally. Yes, this makes more sense. I was already notified in
private of something there, I think the contributor (I cannot find the
original mail) told me he would get back on it and did not, but I am
unable to find the thread again in my mailer. Anyhow, this is ringing a
bell, and I am pretty convinced about the bug raised now. Can you
please propose a fix?

You can propose two fixes actually, one for the NULL value and another
one for mtd->writesize being unset at this stage.

IIRC the original reporter told me about bitflips in his parameter page
(which cannot be generated on demand, and this is rather uncommon).

> I guess not only the atmel nand controller is affected here, but _all_
> nand controllers? The flag can never be set, and so use_datain is
> false here?
>
> > > Background on how I found this: I'm currently struggling getting raw
> > > nand flash access to fly with an at91 sam9x60 SoC and a S34ML02G1
> > > Spansion SLC raw NAND flash on a custom board. The setup is
> > > comparable to the sam9x60 curiosity board and can be reproduced with
> > > that one.
> > >
> > > NAND flash on sam9x60 curiosity board works fine with what is in
> > > mainline Linux kernel. However after removing the line 'rb-gpios =
> > > <&pioD 5 GPIO_ACTIVE_HIGH>;' from at91-sam9x60_curiosity.dts all data
> > > read from the flash appears to be zeros only. (I did not add that
> > > line to the dts of my custom board first, this is how I stumbled over
> > > this.)
> > >
> > > I have no explanation for that behaviour, it should work without R/B#
> > > by reading the status register, maybe we investigate that
> > > in depth later.
> >
> > I don't see why at a first look. The default is "no RB" if no property
> > is given in the DT so it should work.
>
> Correct, nand_soft_waitrdy() is used in that case.
>
> > Tracing the wait ready function calls might help.
>
> Did that already. On each call here the status register read contains
> E0h and nand_soft_waitrdy() returns without error, because the
> NAND_STATUS_READY flag is set. It just looks fine, although it is
> not afterwards.

Strange. Just to be sure, how are you testing? Please make a single
page read (minimal length with mtd_debug or any length with nanddump) to
be sure you're not affected by the continuous reads bugs (also mine).

> > > However those all zeros data reads happens when
> > > reading the ONFI param page as well es data read from OOB/spare area
> > > later and I bet it's the same with usual data.
> >
> > Reading data without observing tWB + tR may lead to this.
>
> I already suspected some timing issue. Deeper investigation will have
> to wait until we soldered some wires to the chip and connect a logic
> analyzer however. At least that's the plan, but this will have to
> wait some days until after I finished some other tasks.

Sure.

>
> > > This read error reveals a bug in nand_onfi_detect(). After setting
> > > up some things there's this for loop:
> > >
> > > for (i = 0; i < ONFI_PARAM_PAGES; i++) {
> > >
> > > For i = 0 nand_read_param_page_op() is called and in my case all zeros
> > > are returned and thus the CRC calculated does not match the all zeros
> > > CRC read. So the usual break on successful reading the first page is
> > > skipped and for reading the second page nand_change_read_column_op()
> > > is called. I think that one always fails on this line:
> > >
> > > if (offset_in_page + len > mtd->writesize + mtd->oobsize) {
> > >
> > > Those variables contain the following values:
> > >
> > > offset_in_page: 256
> > > len: 256
> > > mtd->writesize: 0
> > > mtd->oobsize: 0
> >
> > Indeed. We probably need some kind of extra check that does not perform
> > the if clause above if !mtd->writesize.
> >
> > > The condition is true and nand_change_read_column_op() returns with
> > > -EINVAL, because mtd->writesize and mtd->oobsize are not set yet in
> > > that code path. Those are probably initialized later, maybe with
> > > parameters read from that ONFI param page?
> > >
> > > Returning with error from nand_change_read_column_op() leads to
> > > jumping out of nand_onfi_detect() early, and no ONFI param page is
> > > evaluated at all, although the second or third page could be intact.
> > >
> > > I guess this would also fail with any other reason for not matching
> > > CRCs in the first page, but I have not faulty NAND flash chip to
> > > confirm that.
> >
> > Thanks for the whole report, it is interesting and should lead to fixes:
> > - why does the controller refuses the datain op?
>
> See above.
>
> > - why nand_soft_waitrdy is not enough?
>
> I don't know. That's one reason I asked here.
>
> > - changing the condition in nand_change_read_column_op()
> >
> > Can you take care of these?
>
> The last one probably after in depth reading of the code again, unsure
> for the other two.

First one is "easy" now I guess?

For the middle one we need more investigation of course.

Thanks for the debugging and sorry for the troubles.

Miquèl

2024-03-25 13:52:40

by Miquel Raynal

[permalink] [raw]
Subject: Re: mtd: nand: raw: Possible bug in nand_onfi_detect()?

Hello Alexander,

> > > > The condition is true and nand_change_read_column_op() returns with
> > > > -EINVAL, because mtd->writesize and mtd->oobsize are not set yet in
> > > > that code path. Those are probably initialized later, maybe with
> > > > parameters read from that ONFI param page?
> > > >
> > > > Returning with error from nand_change_read_column_op() leads to
> > > > jumping out of nand_onfi_detect() early, and no ONFI param page is
> > > > evaluated at all, although the second or third page could be intact.
> > > >
> > > > I guess this would also fail with any other reason for not matching
> > > > CRCs in the first page, but I have not faulty NAND flash chip to
> > > > confirm that.
> > >
> > > Thanks for the whole report, it is interesting and should lead to fixes:
> > > - why does the controller refuses the datain op?
> >
> > See above.
> >
> > > - why nand_soft_waitrdy is not enough?
> >
> > I don't know. That's one reason I asked here.
> >
> > > - changing the condition in nand_change_read_column_op()
> > >
> > > Can you take care of these?

Now would be a perfect time to send these fixes. Could you work on them?

Thanks!
Miquèl

2024-03-25 14:17:14

by Alexander Dahl

[permalink] [raw]
Subject: Re: mtd: nand: raw: Possible bug in nand_onfi_detect()?

Hello Miqu?l,

Am Mon, Mar 25, 2024 at 10:09:16AM +0100 schrieb Miquel Raynal:
> Hello Alexander,
>
> > > > > The condition is true and nand_change_read_column_op() returns with
> > > > > -EINVAL, because mtd->writesize and mtd->oobsize are not set yet in
> > > > > that code path. Those are probably initialized later, maybe with
> > > > > parameters read from that ONFI param page?
> > > > >
> > > > > Returning with error from nand_change_read_column_op() leads to
> > > > > jumping out of nand_onfi_detect() early, and no ONFI param page is
> > > > > evaluated at all, although the second or third page could be intact.
> > > > >
> > > > > I guess this would also fail with any other reason for not matching
> > > > > CRCs in the first page, but I have not faulty NAND flash chip to
> > > > > confirm that.
> > > >
> > > > Thanks for the whole report, it is interesting and should lead to fixes:
> > > > - why does the controller refuses the datain op?
> > >
> > > See above.
> > >
> > > > - why nand_soft_waitrdy is not enough?
> > >
> > > I don't know. That's one reason I asked here.
> > >
> > > > - changing the condition in nand_change_read_column_op()
> > > >
> > > > Can you take care of these?
>
> Now would be a perfect time to send these fixes. Could you work on them?

I'm sorry, no not yet. Have some more important work to do, which
will take another one or two weeks before I can return to this
problem. Will have to wait, at least from my side.

Greets
Alex

>
> Thanks!
> Miqu?l

2024-05-07 16:18:17

by Miquel Raynal

[permalink] [raw]
Subject: Re: mtd: nand: raw: Possible bug in nand_onfi_detect()?

Hello,

[email protected] wrote on Wed, 6 Mar 2024 15:36:04 +0100:

> Hello everyone,
>
> I think I found a bug in nand_onfi_detect() which was introduced with
> commit c27842e7e11f ("mtd: rawnand: onfi: Adapt the parameter page
> read to constraint controllers") back in 2020.
>
> Background on how I found this: I'm currently struggling getting raw
> nand flash access to fly with an at91 sam9x60 SoC and a S34ML02G1
> Spansion SLC raw NAND flash on a custom board. The setup is
> comparable to the sam9x60 curiosity board and can be reproduced with
> that one.
>
> NAND flash on sam9x60 curiosity board works fine with what is in
> mainline Linux kernel. However after removing the line 'rb-gpios =
> <&pioD 5 GPIO_ACTIVE_HIGH>;' from at91-sam9x60_curiosity.dts all data
> read from the flash appears to be zeros only. (I did not add that
> line to the dts of my custom board first, this is how I stumbled over
> this.)
>
> I have no explanation for that behaviour, it should work without R/B#
> by reading the status register, maybe we investigate that
> in depth later. However those all zeros data reads happens when
> reading the ONFI param page as well es data read from OOB/spare area
> later and I bet it's the same with usual data.
>
> This read error reveals a bug in nand_onfi_detect(). After setting
> up some things there's this for loop:
>
> for (i = 0; i < ONFI_PARAM_PAGES; i++) {
>
> For i = 0 nand_read_param_page_op() is called and in my case all zeros
> are returned and thus the CRC calculated does not match the all zeros
> CRC read. So the usual break on successful reading the first page is
> skipped and for reading the second page nand_change_read_column_op()
> is called. I think that one always fails on this line:
>
> if (offset_in_page + len > mtd->writesize + mtd->oobsize) {
>
> Those variables contain the following values:
>
> offset_in_page: 256
> len: 256
> mtd->writesize: 0
> mtd->oobsize: 0
>
> The condition is true and nand_change_read_column_op() returns with
> -EINVAL, because mtd->writesize and mtd->oobsize are not set yet in
> that code path. Those are probably initialized later, maybe with
> parameters read from that ONFI param page?
>
> Returning with error from nand_change_read_column_op() leads to
> jumping out of nand_onfi_detect() early, and no ONFI param page is
> evaluated at all, although the second or third page could be intact.
>
> I guess this would also fail with any other reason for not matching
> CRCs in the first page, but I have not faulty NAND flash chip to
> confirm that.

Sorry for the time it took on my side.

Here is a link to another similar report:
https://lore.kernel.org/linux-mtd/DM6PR05MB4506554457CF95191A670BDEF7062@DM6PR05MB4506.namprd05.prod.outlook.com/
And here is a link to the series attempting to fix this:
https://lore.kernel.org/linux-mtd/[email protected]/T/#t

Thanks,
Miquèl