LinuxLists.cc - [BUG] Nand support broken with v2.6.36-rc1

2010-08-17 11:36:54

Subject: [BUG] Nand support broken with v2.6.36-rc1

Hello again,

Answering my own question here. Yes indeed with the new code a driver
change seems to be needed. The badblock pattern used with this nand is no longer
supported with the stock kernel code. I added this to the nand driver
itself.

static uint8_t scan_ff_pattern[] = { 0xff, 0xff };
static struct nand_bbt_descr rbppc_nand_smallpage = {
.options = NAND_BBT_SCAN2NDPAGE,
.offs = NAND_SMALL_BADBLOCK_POS,
.len = 1,
.pattern = scan_ff_pattern
};

and the driver is working again. But shouldn't this be supported by the stock level code as well?

Kind regards,
Michael Guntsche

2010-08-17 17:01:25

by Brian Norris

[permalink] [raw]

Subject: Re: [BUG] Nand support broken with v2.6.36-rc1

Hello,

On 08/17/2010 01:52 AM, Michael Guntsche wrote:
> The only thing that might be special with the nand driver that is being
> used is that a different oob layout is being used.
>
> static struct nand_ecclayout rbppc_nand_oob_16 = {
> .eccbytes = 6,
> .eccpos = { 8, 9, 10, 13, 14, 15 },
> .oobavail = 9,
> .oobfree = { { 0, 4 }, { 6, 2 }, { 11, 2 }, { 4, 1 } }
> };

On 08/17/2010 04:36 AM, Michael Guntsche wrote:
> I added this to the nand driver itself.
>
> static uint8_t scan_ff_pattern[] = { 0xff, 0xff };
> static struct nand_bbt_descr rbppc_nand_smallpage = {
> .options = NAND_BBT_SCAN2NDPAGE,
> .offs = NAND_SMALL_BADBLOCK_POS,
> .len = 1,
> .pattern = scan_ff_pattern
> };
>
> and the driver is working again. But shouldn't this be supported by the stock level code as well?

Why yes, it should! Somebody (probably me) goofed. Your nand_ecclayout
is conflicting with the kernel's choice of bad block position. Recent
changes must have affected which position is chosen automatically by the
kernel.

One of the following two cases is likely the problem:
(1) Your chip is supposed to use offset 0, not 5, for the BBM (i.e.,
NAND_LARGE_BADBLOCK_POS, not NAND_SMALL_BADBLOCK_POS), and so your
ecclayout should not be leaving byte 0 in the "oobfree" array (a design
flaw since you first began using this chip)
(2) I made the commit that you mentioned
(c7b28e25cb9beb943aead770ff14551b55fa8c79) too restrictive in allowing
chips to use NAND_SMALL_BADBLOCK_POS.

Option 2 is likely the case, and in fact, I realized a stupid mistake I
made in refactoring the detection here.

I have been studying data from hundreds of flash chips to find where the
factory-determined markers should be stored. Unfortunately, I can't
cover all of them, and so your Hynix chip is likely one that was
overlooked. Could you send the full NAND ID string (8 bytes, not just
the manufacturer and chip ID), an exact part number for the flash, and a
datasheet? Any one of those could help (the datasheet being the most
important), but whatever you can provide is helpful. More data on your
chip would allow me to determine the problem for sure; I will send a
patch ASAP once I get your information.

Sorry for the trouble!

On another note, it may be intelligent to have the kernel-specific
systems check for such a conflict between bad-block markers and ECC
layout. If a position needed by the bad-block marker is listed in
"oobfree" or "eccpos" then we have a problem. Sound like a good idea
anybody? If so, what would be the best approach:
* print an error and quit detection
* try to modify the ecclayout, bbm info or both
* try to modify, and fall-back to error message and quit if necessary

Thanks,
Brian

2010-08-17 17:47:56

by Michael Guntsche

[permalink] [raw]

Subject: Re: [BUG] Nand support broken with v2.6.36-rc1

On 17 Aug 10 10:00, Brian Norris wrote:
> One of the following two cases is likely the problem:
> (1) Your chip is supposed to use offset 0, not 5, for the BBM (i.e.,
> NAND_LARGE_BADBLOCK_POS, not NAND_SMALL_BADBLOCK_POS), and so your
> ecclayout should not be leaving byte 0 in the "oobfree" array (a
> design flaw since you first began using this chip)

First, I am just an end user so I have no access to the datasheets etc. I
just got the code from the board manufactrurer (2.6.27) and forward
port it to recent kernels.

The reason I am using a specific layout is because the bootloader on
this board expects it this way. It formats it this way in the beginning
and I cannot change that.

> Could you send the full NAND ID string (8 bytes, not
If you can tell me where I can find that I'll be more than happy to send
it to you. But as I said I think the reason for this is this special
bootloader.

Please tell me, if you need more informations.

Kind regards,
Michael

2010-08-17 18:50:20

by Brian Norris

[permalink] [raw]

Subject: Re: [BUG] Nand support broken with v2.6.36-rc1

Hi,

On 08/17/2010 10:47 AM, Michael Guntsche wrote:
> First, I am just an end user so I have no access to the datasheets etc. I
> just got the code from the board manufactrurer (2.6.27) and forward
> port it to recent kernels.

I see. No problem. We'll work with what you can do:

If you can simply find the NAND chip part number (it would be printed on
the chip itself), that will be helpful.

Also, there are a few things you can do under a working kernel (e.g.,
2.6.35?).

First, have you ever used any of the mtdutils? In particular, running
the command "mtdinfo -a" and sending the output is helpful if you have
the utility installed on your board.

Second, since you are doing the forward-porting, I assume you can do a
little bit of coding/patching. To print the whole ID string, you can add
a simple "printk" line to the code in "drivers/mtd/nand/nand_base.c".
For example, on the 2.6.35 kernel, you can just apply the patch below.
Then, on boot, the ID string will print (or at least show up in "dmesg"
or "syslog"). That info can help a little.

> The reason I am using a specific layout is because the bootloader on
> this board expects it this way. It formats it this way in the beginning
> and I cannot change that.

Well, if the new commit that broke your board is getting the block
marker *correct* according to the factory specifications, then this
particular problem is your setup's problem; perhaps there could be a
workaround, like I mentioned about checking for these kind of conflicts.
However, I'm still hypothesizing that I simply got the detection wrong,
and so my fix will solve your problem.

Thanks,
Brian

---
drivers/mtd/nand/nand_base.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
index 4a7b864..d2d1fab 100644
--- a/drivers/mtd/nand/nand_base.c
+++ b/drivers/mtd/nand/nand_base.c
@@ -2809,8 +2809,10 @@ static struct nand_flash_dev
*nand_get_flash_type(struct mtd_info *mtd,

/* Read entire ID string */

- for (i = 0; i < 8; i++)
+ for (i = 0; i < 8; i++) {
id_data[i] = chip->read_byte(mtd);
+ printk(KERN_INFO "ID byte %i: %#x\n", i, id_data[i]);
+ }

if (id_data[0] != *maf_id || id_data[1] != dev_id) {
printk(KERN_INFO "%s: second ID read did not match "
--
1.7.0.4

2010-08-17 20:06:16

by Michael Guntsche

[permalink] [raw]

Subject: Re: [BUG] Nand support broken with v2.6.36-rc1

On 17 Aug 10 11:49, Brian Norris wrote:
> First, have you ever used any of the mtdutils? In particular,
> running the command "mtdinfo -a" and sending the output is helpful
> if you have the utility installed on your board.
hmm mtdinfo tries to open /sys/class/mtd/mtd0/dev which das not exist
the device is working ok as block device on the other hand so let's try
the next thing.

Output booting with a patched .36-rc1

[ 0.279217] rbppc_nand_probe: MikroTik RouterBOARD 600 series NAND driver, version 0.0.2
[ 0.287535] ID byte 0: 0xad
[ 0.290373] ID byte 1: 0x76
[ 0.293185] ID byte 2: 0xad
[ 0.295985] ID byte 3: 0x76
[ 0.298798] ID byte 4: 0xad
[ 0.301610] ID byte 5: 0x76
[ 0.304423] ID byte 6: 0xad
[ 0.307223] ID byte 7: 0x76
[ 0.310046] NAND device: Manufacturer ID: 0xad, Chip ID: 0x76 (Hynix NAND 64MiB 3,3V 8-bit)

Hope this helps...

2010-08-17 21:42:50

by Brian Norris

[permalink] [raw]

Subject: Re: [BUG] Nand support broken with v2.6.36-rc1

On 08/17/2010 01:05 PM, Michael Guntsche wrote:
> On 17 Aug 10 11:49, Brian Norris wrote:
>> First, have you ever used any of the mtdutils? In particular,
>> running the command "mtdinfo -a" and sending the output is helpful
>> if you have the utility installed on your board.
> hmm mtdinfo tries to open /sys/class/mtd/mtd0/dev which das not exist
> the device is working ok as block device on the other hand so let's try
> the next thing.

I'm not an expert on the workings of mtdutils, so I don't know
why your device does not have the necessary sysfs entries. Perhaps
weird hardware features or a strange, incorrect driver (I can't work
on your board specific driver for you). I suppose it's OK to ignore
this problem for the moment.

> Output booting with a patched .36-rc1
>
> [ 0.279217] rbppc_nand_probe: MikroTik RouterBOARD 600 series NAND driver, version 0.0.2
> [ 0.287535] ID byte 0: 0xad
> [ 0.290373] ID byte 1: 0x76
> [ 0.293185] ID byte 2: 0xad
> [ 0.295985] ID byte 3: 0x76
> [ 0.298798] ID byte 4: 0xad
> [ 0.301610] ID byte 5: 0x76
> [ 0.304423] ID byte 6: 0xad
> [ 0.307223] ID byte 7: 0x76
> [ 0.310046] NAND device: Manufacturer ID: 0xad, Chip ID: 0x76 (Hynix NAND 64MiB 3,3V 8-bit)
>
> Hope this helps...
>

Honestly, that doesn't really help :) I guess the device is old
enough it does not have an extended ID. In that case, I will need
the part number to be able to diagnose for sure. Can you find the
physical chip on the board and give me whatever labeling is on it?

In place of that, though, you can just try this patch on 2.6.36-rc1.
I believe it should satisfy the intention of my previous (faulty)
commit while reverting the regression behavior. If this works OK, I
will submit it to be included in the mainline kernel.

Thanks for taking the time to debug this.

Brian

---
drivers/mtd/nand/nand_base.c | 10 +++-------
1 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
index a3c7473..a22ed7b 100644
--- a/drivers/mtd/nand/nand_base.c
+++ b/drivers/mtd/nand/nand_base.c
@@ -2934,14 +2934,10 @@ static struct nand_flash_dev *nand_get_flash_type(struct mtd_info *mtd,
chip->chip_shift = ffs((unsigned)(chip->chipsize >> 32)) + 32 - 1;

/* Set the bad block position */
- if (!(busw & NAND_BUSWIDTH_16) && (*maf_id == NAND_MFR_STMICRO ||
- (*maf_id == NAND_MFR_SAMSUNG &&
- mtd->writesize == 512) ||
- *maf_id == NAND_MFR_AMD))
- chip->badblockpos = NAND_SMALL_BADBLOCK_POS;
- else
+ if (mtd->writesize > 512 || (busw & NAND_BUSWIDTH_16))
chip->badblockpos = NAND_LARGE_BADBLOCK_POS;
-
+ else
+ chip->badblockpos = NAND_SMALL_BADBLOCK_POS;

/* Get chip options, preserve non chip based options */
chip->options &= ~NAND_CHIPOPTIONS_MSK;
--
1.7.0.4

2010-08-18 05:53:21

by Michael Guntsche

[permalink] [raw]

Subject: Re: [BUG] Nand support broken with v2.6.36-rc1

On 17 Aug 10 14:42, Brian Norris wrote:
> In place of that, though, you can just try this patch on 2.6.36-rc1.
> I believe it should satisfy the intention of my previous (faulty)
> commit while reverting the regression behavior. If this works OK, I
> will submit it to be included in the mainline kernel.

Hi Brian,

Applying this patch fixes the problem for me. I removed my workaround
and just tried your patch and was able to mount both mtdblock devices.

Thank you for fixing this,
Michael Guntsche

2010-08-18 18:25:41

by Brian Norris

[permalink] [raw]

Subject: [PATCH] mtd: nand: Fix regression in BBM detection

Commit c7b28e25cb9beb943aead770ff14551b55fa8c79 caused a regression
in detection of factory-set bad block markers, especially for certain
small-page NAND. This fix removes some unneeded constraints on using
NAND_SMALL_BADBLOCK_POS, making the detection code more correct.

This regression can be seen, for example, in Hynix HY27US081G1M and
similar.

Signed-off-by: Brian Norris <[email protected]>
---
drivers/mtd/nand/nand_base.c | 10 +++-------
1 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
index a3c7473..a22ed7b 100644
--- a/drivers/mtd/nand/nand_base.c
+++ b/drivers/mtd/nand/nand_base.c
@@ -2934,14 +2934,10 @@ static struct nand_flash_dev *nand_get_flash_type(struct mtd_info *mtd,
chip->chip_shift = ffs((unsigned)(chip->chipsize >> 32)) + 32 - 1;

/* Set the bad block position */
- if (!(busw & NAND_BUSWIDTH_16) && (*maf_id == NAND_MFR_STMICRO ||
- (*maf_id == NAND_MFR_SAMSUNG &&
- mtd->writesize == 512) ||
- *maf_id == NAND_MFR_AMD))
- chip->badblockpos = NAND_SMALL_BADBLOCK_POS;
- else
+ if (mtd->writesize > 512 || (busw & NAND_BUSWIDTH_16))
chip->badblockpos = NAND_LARGE_BADBLOCK_POS;
-
+ else
+ chip->badblockpos = NAND_SMALL_BADBLOCK_POS;

/* Get chip options, preserve non chip based options */
chip->options &= ~NAND_CHIPOPTIONS_MSK;
--
1.7.0.4

2010-08-18 19:36:05

by Abdoulaye Walsimou GAYE

[permalink] [raw]

Subject: Re: [PATCH] mtd: nand: Fix regression in BBM detection

On 08/18/2010 08:25 PM, Brian Norris wrote:
> Commit c7b28e25cb9beb943aead770ff14551b55fa8c79 caused a regression
> in detection of factory-set bad block markers, especially for certain
> small-page NAND. This fix removes some unneeded constraints on using
> NAND_SMALL_BADBLOCK_POS, making the detection code more correct.
>
> This regression can be seen, for example, in Hynix HY27US081G1M and
> similar.
>
> Signed-off-by: Brian Norris<[email protected]>
> ---
> drivers/mtd/nand/nand_base.c | 10 +++-------
> 1 files changed, 3 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
> index a3c7473..a22ed7b 100644
> --- a/drivers/mtd/nand/nand_base.c
> +++ b/drivers/mtd/nand/nand_base.c
> @@ -2934,14 +2934,10 @@ static struct nand_flash_dev *nand_get_flash_type(struct mtd_info *mtd,
> chip->chip_shift = ffs((unsigned)(chip->chipsize>> 32)) + 32 - 1;
>
> /* Set the bad block position */
> - if (!(busw& NAND_BUSWIDTH_16)&& (*maf_id == NAND_MFR_STMICRO ||
> - (*maf_id == NAND_MFR_SAMSUNG&&
> - mtd->writesize == 512) ||
> - *maf_id == NAND_MFR_AMD))
> - chip->badblockpos = NAND_SMALL_BADBLOCK_POS;
> - else
> + if (mtd->writesize> 512 || (busw& NAND_BUSWIDTH_16))
> chip->badblockpos = NAND_LARGE_BADBLOCK_POS;
> -
> + else
> + chip->badblockpos = NAND_SMALL_BADBLOCK_POS;
>
> /* Get chip options, preserve non chip based options */
> chip->options&= ~NAND_CHIPOPTIONS_MSK;
>

Brian,
Sorry for the long delay!
I tested the above patch unfortunately it does not help in my case!
And when I go further and put a JFFS2 in that partition and boot the
board I have

(S3c2410 nand hardware ECC enable):
mtd->read(0x400 bytes from 0x1274000) returned ECC error
mtd->read(0x3c08 bytes from 0x12743f8) returned ECC error

(without S3c2410 nand hardware ECC enable):
uncorrectable error :
uncorrectable error :
uncorrectable error :
uncorrectable error :
mtd->read(0x400 bytes from 0x1274000) returned ECC error
uncorrectable error :
uncorrectable error :
uncorrectable error :
[...]
uncorrectable error :
uncorrectable error :
mtd->read(0x3c08 bytes from 0x12743f8) returned ECC error

Despite these errors I can actually use the board (no kernel panic)!
The part is Samsung K9F1208U0C - PCB0

Hope that helps!

Thanks,
AWG

2010-08-19 00:05:00

by Brian Norris

[permalink] [raw]

Subject: Re: [PATCH] mtd: nand: Fix regression in BBM detection

On 08/18/2010 12:30 PM, Abdoulaye Walsimou GAYE wrote:
> Brian,
> Sorry for the long delay!
> I tested the above patch unfortunately it does not help in my case!

Understood. That makes sense. In fact, your problem is most likely *not*
related to this commit. As I mentioned before, please try narrowing down
what specifically caused this; if I read correctly, you jumped from
2.6.33 to 2.6.36-rc1. There have been several important changes between
those releases. Notably, this commit may be giving Samsung chips problems:
426c457a3216fac74e

This thread is covering a few problems with Samsung:
http://lists.infradead.org/pipermail/linux-mtd/2010-August/031590.html

> And when I go further and put a JFFS2 in that partition and boot the
> board I have
<snip>
> Despite these errors I can actually use the board (no kernel panic)!
> The part is Samsung K9F1208U0C - PCB0

Unless you really know what you're doing, I wouldn't be writing/erasing
the flash if it's not detecting bad blocks properly.

Let me know if you have trouble with narrowing down to the problem commit.

Brian