2008-06-09 10:13:44

by David Brownell

[permalink] [raw]
Subject: [patch 2.6.26-rc5-git] at91_nand speedup via {read,write}s{b,w}()

This uses __raw_{read,write}s{b,w}() primitives to access data on NAND
chips for more efficient I/O.

On an arm926 with memory clocked at 100 MHz, this reduced the elapsed
time for a 64 MByte read by 16%. ("dd" /dev/mtd0 to /dev/null, with
an 8-bit NAND using hardware ECC and 128KB blocksize.)

Also some minor section tweaks:

- Use platform_driver_probe() so no pointer to probe() lingers
after that code has been removed at run-time.

- Use __exit and __exit_p so the remove() code will normally be
removed by the linker.

Since these buffer read/write calls are new, this increases the runtime
code footprint (by 88 bytes on my build, after the section tweaks).

Signed-off-by: David Brownell <[email protected]>
---
Yeah, this does may you wonder why the *default* nand r/w code isn't
using these primitives; this speedup shouldn't be platform-specific.

Posting this now since I think this should either be incorporated into
the new atmel_nand.c code or into drivers/mtd/nand/nand_base.c ...
both arm and avr32 support these calls, I'm not sure whether or not
some platforms don't support them.

drivers/mtd/nand/at91_nand.c | 46 ++++++++++++++++++++++++++++++++++++++-----
1 file changed, 41 insertions(+), 5 deletions(-)

--- a/drivers/mtd/nand/at91_nand.c 2008-04-28 11:05:34.000000000 -0700
+++ b/drivers/mtd/nand/at91_nand.c 2008-04-28 21:59:34.000000000 -0700
@@ -146,6 +146,37 @@ static void at91_nand_disable(struct at9
}

/*
+ * Minimal-overhead PIO for data access.
+ */
+static void at91_read_buf(struct mtd_info *mtd, u8 *buf, int len)
+{
+ struct nand_chip *nand_chip = mtd->priv;
+
+ __raw_readsb(nand_chip->IO_ADDR_R, buf, len);
+}
+
+static void at91_read_buf16(struct mtd_info *mtd, u8 *buf, int len)
+{
+ struct nand_chip *nand_chip = mtd->priv;
+
+ __raw_readsw(nand_chip->IO_ADDR_R, buf, len / 2);
+}
+
+static void at91_write_buf(struct mtd_info *mtd, const u8 *buf, int len)
+{
+ struct nand_chip *nand_chip = mtd->priv;
+
+ __raw_writesb(nand_chip->IO_ADDR_W, buf, len);
+}
+
+static void at91_write_buf16(struct mtd_info *mtd, const u8 *buf, int len)
+{
+ struct nand_chip *nand_chip = mtd->priv;
+
+ __raw_writesw(nand_chip->IO_ADDR_W, buf, len / 2);
+}
+
+/*
* write oob for small pages
*/
static int at91_nand_write_oob_512(struct mtd_info *mtd,
@@ -440,8 +471,14 @@ static int __init at91_nand_probe(struct

nand_chip->chip_delay = 20; /* 20us command delay time */

- if (host->board->bus_width_16) /* 16-bit bus width */
+ if (host->board->bus_width_16) { /* 16-bit bus width */
nand_chip->options |= NAND_BUSWIDTH_16;
+ nand_chip->read_buf = at91_read_buf16;
+ nand_chip->write_buf = at91_write_buf16;
+ } else {
+ nand_chip->read_buf = at91_read_buf;
+ nand_chip->write_buf = at91_write_buf;
+ }

platform_set_drvdata(pdev, host);
at91_nand_enable(host);
@@ -548,7 +585,7 @@ err_ecc_ioremap:
/*
* Remove a NAND device.
*/
-static int __devexit at91_nand_remove(struct platform_device *pdev)
+static int __exit at91_nand_remove(struct platform_device *pdev)
{
struct at91_nand_host *host = platform_get_drvdata(pdev);
struct mtd_info *mtd = &host->mtd;
@@ -565,8 +602,7 @@ static int __devexit at91_nand_remove(st
}

static struct platform_driver at91_nand_driver = {
- .probe = at91_nand_probe,
- .remove = at91_nand_remove,
+ .remove = __exit_p(at91_nand_remove),
.driver = {
.name = "at91_nand",
.owner = THIS_MODULE,
@@ -575,7 +611,7 @@ static struct platform_driver at91_nand_

static int __init at91_nand_init(void)
{
- return platform_driver_register(&at91_nand_driver);
+ return platform_driver_probe(&at91_nand_driver, at91_nand_probe);
}


2008-06-09 11:31:00

by Haavard Skinnemoen

[permalink] [raw]
Subject: Re: [patch 2.6.26-rc5-git] at91_nand speedup via {read,write}s{b,w}()

David Brownell <[email protected]> wrote:
> This uses __raw_{read,write}s{b,w}() primitives to access data on NAND
> chips for more efficient I/O.
>
> On an arm926 with memory clocked at 100 MHz, this reduced the elapsed
> time for a 64 MByte read by 16%. ("dd" /dev/mtd0 to /dev/null, with
> an 8-bit NAND using hardware ECC and 128KB blocksize.)

Nice. Here are some numbers from my setup (256 MB, 8-bit, software ECC).

Before:
real 2m38.131s
user 0m0.228s
sys 2m37.740s

After:
real 2m27.404s
user 0m0.180s
sys 2m27.068s

which is a 6.8% speedup. I guess hardware ECC helps...though I can't
seem to get it to work properly. Is there anything I need to do besides
flash_eraseall when changing the ECC layout?

Also, I wonder if we can use the DMA engine framework to get rid of all
that "sys" time...?

> Also some minor section tweaks:
>
> - Use platform_driver_probe() so no pointer to probe() lingers
> after that code has been removed at run-time.
>
> - Use __exit and __exit_p so the remove() code will normally be
> removed by the linker.
>
> Since these buffer read/write calls are new, this increases the runtime
> code footprint (by 88 bytes on my build, after the section tweaks).

Yeah, I spotted a bug in __raw_readsb on avr32, so I guess those
functions haven't actually been used before...

> Signed-off-by: David Brownell <[email protected]>
> ---
> Yeah, this does may you wonder why the *default* nand r/w code isn't
> using these primitives; this speedup shouldn't be platform-specific.
>
> Posting this now since I think this should either be incorporated into
> the new atmel_nand.c code or into drivers/mtd/nand/nand_base.c ...
> both arm and avr32 support these calls, I'm not sure whether or not
> some platforms don't support them.

I'll leave it up to the MTD people to decide whether or not to update
nand_base.c. Below is your patch rebased onto my patchset. I'll include
it in my next series after I figure out where to send it.

Haavard

>From ad420ea11f9c8aa0fcad2ce1c3af69c02a2dc447 Mon Sep 17 00:00:00 2001
From: David Brownell <[email protected]>
Date: Mon, 9 Jun 2008 03:13:28 -0700
Subject: [PATCH] atmel_nand speedup via {read,write}s{b,w}()

This uses __raw_{read,write}s{b,w}() primitives to access data on NAND
chips for more efficient I/O.

On an arm926 with memory clocked at 100 MHz, this reduced the elapsed
time for a 64 MByte read by 16%. ("dd" /dev/mtd0 to /dev/null, with
an 8-bit NAND using hardware ECC and 128KB blocksize.)

Also some minor section tweaks:

- Use platform_driver_probe() so no pointer to probe() lingers
after that code has been removed at run-time.

- Use __exit and __exit_p so the remove() code will normally be
removed by the linker.

Since these buffer read/write calls are new, this increases the runtime
code footprint (by 88 bytes on my build, after the section tweaks).

Signed-off-by: David Brownell <[email protected]>
[[email protected]: rebase onto atmel_nand rename]
Signed-off-by: Haavard Skinnemoen <[email protected]>
---
drivers/mtd/nand/atmel_nand.c | 46 ++++++++++++++++++++++++++++++++++++----
1 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/drivers/mtd/nand/atmel_nand.c b/drivers/mtd/nand/atmel_nand.c
index 325ce29..d9f7a5d 100644
--- a/drivers/mtd/nand/atmel_nand.c
+++ b/drivers/mtd/nand/atmel_nand.c
@@ -142,6 +142,37 @@ static int atmel_nand_device_ready(struct mtd_info *mtd)
}

/*
+ * Minimal-overhead PIO for data access.
+ */
+static void atmel_read_buf(struct mtd_info *mtd, u8 *buf, int len)
+{
+ struct nand_chip *nand_chip = mtd->priv;
+
+ __raw_readsb(nand_chip->IO_ADDR_R, buf, len);
+}
+
+static void atmel_read_buf16(struct mtd_info *mtd, u8 *buf, int len)
+{
+ struct nand_chip *nand_chip = mtd->priv;
+
+ __raw_readsw(nand_chip->IO_ADDR_R, buf, len / 2);
+}
+
+static void atmel_write_buf(struct mtd_info *mtd, const u8 *buf, int len)
+{
+ struct nand_chip *nand_chip = mtd->priv;
+
+ __raw_writesb(nand_chip->IO_ADDR_W, buf, len);
+}
+
+static void atmel_write_buf16(struct mtd_info *mtd, const u8 *buf, int len)
+{
+ struct nand_chip *nand_chip = mtd->priv;
+
+ __raw_writesw(nand_chip->IO_ADDR_W, buf, len / 2);
+}
+
+/*
* write oob for small pages
*/
static int atmel_nand_write_oob_512(struct mtd_info *mtd,
@@ -436,8 +467,14 @@ static int __init atmel_nand_probe(struct platform_device *pdev)

nand_chip->chip_delay = 20; /* 20us command delay time */

- if (host->board->bus_width_16) /* 16-bit bus width */
+ if (host->board->bus_width_16) { /* 16-bit bus width */
nand_chip->options |= NAND_BUSWIDTH_16;
+ nand_chip->read_buf = atmel_read_buf16;
+ nand_chip->write_buf = atmel_write_buf16;
+ } else {
+ nand_chip->read_buf = atmel_read_buf;
+ nand_chip->write_buf = atmel_write_buf;
+ }

platform_set_drvdata(pdev, host);
atmel_nand_enable(host);
@@ -546,7 +583,7 @@ err_nand_ioremap:
/*
* Remove a NAND device.
*/
-static int __devexit atmel_nand_remove(struct platform_device *pdev)
+static int __exit atmel_nand_remove(struct platform_device *pdev)
{
struct atmel_nand_host *host = platform_get_drvdata(pdev);
struct mtd_info *mtd = &host->mtd;
@@ -564,8 +601,7 @@ static int __devexit atmel_nand_remove(struct platform_device *pdev)
}

static struct platform_driver atmel_nand_driver = {
- .probe = atmel_nand_probe,
- .remove = atmel_nand_remove,
+ .remove = __exit_p(atmel_nand_remove),
.driver = {
.name = "atmel_nand",
.owner = THIS_MODULE,
@@ -574,7 +610,7 @@ static struct platform_driver atmel_nand_driver = {

static int __init atmel_nand_init(void)
{
- return platform_driver_register(&atmel_nand_driver);
+ return platform_driver_probe(&atmel_nand_driver, atmel_nand_probe);
}


--
1.5.5.3

2008-06-09 16:48:53

by Haavard Skinnemoen

[permalink] [raw]
Subject: Re: [patch 2.6.26-rc5-git] at91_nand speedup via {read,write}s{b,w}()

Haavard Skinnemoen <[email protected]> wrote:
> which is a 6.8% speedup. I guess hardware ECC helps...though I can't
> seem to get it to work properly. Is there anything I need to do besides
> flash_eraseall when changing the ECC layout?

Turns out there's an AP7000 errata that hasn't made it to the data
sheet yet. The IC designers have already come up with a workaround,
which I've implemented below. This brings the time down to

real 2m0.934s
user 0m0.140s
sys 2m0.700s

which is a nice improvement.

Haavard

>From 57d4f806c28a068baae12558794733e838016a71 Mon Sep 17 00:00:00 2001
From: Haavard Skinnemoen <[email protected]>
Date: Mon, 9 Jun 2008 18:31:25 +0200
Subject: [PATCH] atmel_nand: Work around AT32AP7000 errata

The ALE signal isn't correctly wired up to the ECC controller on the
AP7000, so it starts calculating ECC during the address cycles.

Work around this by resetting the ECC controller between the address and
data cycles.

Signed-off-by: Haavard Skinnemoen <[email protected]>
---
drivers/mtd/nand/atmel_nand.c | 25 +++++++++++++++++++++++--
1 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/mtd/nand/atmel_nand.c b/drivers/mtd/nand/atmel_nand.c
index d9f7a5d..b769ef3 100644
--- a/drivers/mtd/nand/atmel_nand.c
+++ b/drivers/mtd/nand/atmel_nand.c
@@ -33,6 +33,7 @@
#include <asm/io.h>

#include <asm/arch/board.h>
+#include <asm/arch/cpu.h>

#ifdef CONFIG_MTD_NAND_ATMEL_ECC_HW
#define hard_ecc 1
@@ -264,6 +265,19 @@ static int atmel_nand_read_page(struct mtd_info *mtd,
uint8_t *ecc_pos;
int stat;

+ /*
+ * Errata: ALE is incorrectly wired up to the ECC controller
+ * on the AP7000, so it will include the address cycles in the
+ * ECC calculation.
+ *
+ * Workaround: Reset the parity registers before reading the
+ * actual data.
+ */
+ if (cpu_is_at32ap7000()) {
+ struct atmel_nand_host *host = chip->priv;
+ ecc_writel(host->ecc, CR, ATMEL_ECC_RST);
+ }
+
/* read the page */
chip->read_buf(mtd, p, eccsize);

@@ -377,9 +391,16 @@ static int atmel_nand_correct(struct mtd_info *mtd, u_char *dat,
}

/*
- * Enable HW ECC : unsused
+ * Enable HW ECC : unused on most chips
*/
-static void atmel_nand_hwctl(struct mtd_info *mtd, int mode) { ; }
+static void atmel_nand_hwctl(struct mtd_info *mtd, int mode)
+{
+ if (cpu_is_at32ap7000()) {
+ struct nand_chip *nand_chip = mtd->priv;
+ struct atmel_nand_host *host = nand_chip->priv;
+ ecc_writel(host->ecc, CR, ATMEL_ECC_RST);
+ }
+}

#ifdef CONFIG_MTD_PARTITIONS
static const char *part_probes[] = { "cmdlinepart", NULL };
--
1.5.5.3

2008-06-09 17:07:49

by David Brownell

[permalink] [raw]
Subject: Re: [patch 2.6.26-rc5-git] at91_nand speedup via {read,write}s{b,w}()

On Monday 09 June 2008, Haavard Skinnemoen wrote:
> David Brownell <[email protected]> wrote:
> > This uses __raw_{read,write}s{b,w}() primitives to access data on NAND
> > chips for more efficient I/O.
> >
> > On an arm926 with memory clocked at 100 MHz, this reduced the elapsed
> > time for a 64 MByte read by 16%. ("dd" /dev/mtd0 to /dev/null, with
> > an 8-bit NAND using hardware ECC and 128KB blocksize.)
>
> Nice. Here are some numbers from my setup (256 MB, 8-bit, software ECC).
>
> Before:
> real 2m38.131s
> user 0m0.228s
> sys 2m37.740s
>
> After:
> real 2m27.404s
> user 0m0.180s
> sys 2m27.068s
>
> which is a 6.8% speedup. I guess hardware ECC helps...

The AVR32 versions of readsb/writesb didn't look to me as if they'd
be quite as fast as the ARM ones either. If AVR32 has some analogue
of "stmia r1!, {r3 - r6}" for burst 16 byte stores, it's not using
it right now. (What was the bug you found in its readsb?)

Yes, I'd think the win would be most visible with hardware ECC, since
without it you've still got a second manual scan of each block. (And
I see you observed this too, after applying a workaround for an ECC
erratum you just learned about...) My numbers for one pair of trials
(the "16%" was an average of 6 runs) had a *lot* less system time.
Which oddly enough went *up* after the switch to readsb/writesb:

Before:
real 0m24.199s
user 0m0.000s
sys 0m5.630s

After:
real 0m20.226s
user 0m0.010s
sys 0m6.000s

However, the fact that you got a win even with soft ECC (and, I'm
guessing, slower RAM and slower readsb) suggests that this speedup
should be pretty generally applicable!


> though I can't
> seem to get it to work properly. Is there anything I need to do besides
> flash_eraseall when changing the ECC layout?

I wouldn't know. Just be sure not to lose all your badblocks data
when you convert ...


> Also, I wonder if we can use the DMA engine framework to get rid of all
> that "sys" time...?

It's another one of those cases where the framework overhead has to be
low enough to make that practical. Last time I looked, the overhead to
set up and wait for a DMA of a couple KBytes was a significant chunk of
the cost to readsb()/writesb() the same data ... and that's even before
the data starts transferring.

Plus, the MTD layer currently assumes DMA is never used. Some of the
buffers it passes are not suitable for dma_map_single() since they
come from vmalloc.


> > ...
> >
> > Signed-off-by: David Brownell <[email protected]>
> > ---
> > Yeah, this does may you wonder why the *default* nand r/w code isn't
> > using these primitives; this speedup shouldn't be platform-specific.
> >
> > Posting this now since I think this should either be incorporated into
> > the new atmel_nand.c code or into drivers/mtd/nand/nand_base.c ...
> > both arm and avr32 support these calls, I'm not sure whether or not
> > some platforms don't support them.
>
> I'll leave it up to the MTD people to decide whether or not to update
> nand_base.c. Below is your patch rebased onto my patchset. I'll include
> it in my next series after I figure out where to send it.

Sounds fair to me. Thanks; this has been sitting in my tree for many
months now, I finally made time to measure it and was pleasantly
surprised by the size of the win!

- Dave

2008-06-09 17:48:36

by Haavard Skinnemoen

[permalink] [raw]
Subject: Re: [patch 2.6.26-rc5-git] at91_nand speedup via {read,write}s{b,w}()

David Brownell <[email protected]> wrote:
> On Monday 09 June 2008, Haavard Skinnemoen wrote:
> > David Brownell <[email protected]> wrote:
> > > This uses __raw_{read,write}s{b,w}() primitives to access data on NAND
> > > chips for more efficient I/O.
> > >
> > > On an arm926 with memory clocked at 100 MHz, this reduced the elapsed
> > > time for a 64 MByte read by 16%. ("dd" /dev/mtd0 to /dev/null, with
> > > an 8-bit NAND using hardware ECC and 128KB blocksize.)
> >
> > Nice. Here are some numbers from my setup (256 MB, 8-bit, software ECC).
> >
> > Before:
> > real 2m38.131s
> > user 0m0.228s
> > sys 2m37.740s
> >
> > After:
> > real 2m27.404s
> > user 0m0.180s
> > sys 2m27.068s
> >
> > which is a 6.8% speedup. I guess hardware ECC helps...
>
> The AVR32 versions of readsb/writesb didn't look to me as if they'd
> be quite as fast as the ARM ones either. If AVR32 has some analogue
> of "stmia r1!, {r3 - r6}" for burst 16 byte stores, it's not using
> it right now. (What was the bug you found in its readsb?)

Note that I'm talking about the __raw_ versions of those, which are a
bit more optimized than the non-raw versions. They do

1: ldins.b r8:t, r12[0]
ldins.b r8:u, r12[0]
ldins.b r8:l, r12[0]
ldins.b r8:b, r12[0]
st.w r11++, r8
sub r10, 4
brge 1b

I don't think we have an instruction that can store multiple registers
to the same address...it would of course be acceptable to store to
incrementing addresses when dealing with NAND flash, but I don't think
it's a good idea in a general __raw_readsb implementation.

Here's the bug I found, btw:

--- a/arch/avr32/lib/io-readsb.S
+++ b/arch/avr32/lib/io-readsb.S
@@ -41,7 +41,7 @@ __raw_readsb:
2: sub r10, -4
reteq r12

-3: ld.uh r8, r12[0]
+3: ld.ub r8, r12[0]
sub r10, 1
st.b r11++, r8
brne 3b

Not sure how easy it is to trigger since that code is only executed for
odd sizes.

> Yes, I'd think the win would be most visible with hardware ECC, since
> without it you've still got a second manual scan of each block. (And
> I see you observed this too, after applying a workaround for an ECC
> erratum you just learned about...) My numbers for one pair of trials
> (the "16%" was an average of 6 runs) had a *lot* less system time.
> Which oddly enough went *up* after the switch to readsb/writesb:
>
> Before:
> real 0m24.199s
> user 0m0.000s
> sys 0m5.630s
>
> After:
> real 0m20.226s
> user 0m0.010s
> sys 0m6.000s

Hmm, that's odd. What's the CPU doing during the remaining 14 seconds?
It can't possibly be sleeping?

Ah, it's I/O wait, isn't it? Because you're going through the block
layer?

> However, the fact that you got a win even with soft ECC (and, I'm
> guessing, slower RAM and slower readsb) suggests that this speedup
> should be pretty generally applicable!

Yes, I would think so...although I've seen gcc generate somewhat crappy
code for the I/O accessors, and we do some address mangling in the
non-raw I/O accessors on avr32 which might explain some of the
difference.

> > though I can't
> > seem to get it to work properly. Is there anything I need to do besides
> > flash_eraseall when changing the ECC layout?
>
> I wouldn't know. Just be sure not to lose all your badblocks data
> when you convert ...

Seems like flash_eraseall skips the bad blocks as it should.

> > Also, I wonder if we can use the DMA engine framework to get rid of all
> > that "sys" time...?
>
> It's another one of those cases where the framework overhead has to be
> low enough to make that practical. Last time I looked, the overhead to
> set up and wait for a DMA of a couple KBytes was a significant chunk of
> the cost to readsb()/writesb() the same data ... and that's even before
> the data starts transferring.

Right. I guess we should take a look at how to reduce that overhead at
some point...

> Plus, the MTD layer currently assumes DMA is never used. Some of the
> buffers it passes are not suitable for dma_map_single() since they
> come from vmalloc.

Aw...the MTD layer uses vmalloc() all over the place :-(

> Sounds fair to me. Thanks; this has been sitting in my tree for many
> months now, I finally made time to measure it and was pleasantly
> surprised by the size of the win!

Yeah...I'm still not sure where to send it though, since it touches
three different subsystems. I can set up a separate tree for it like
I've done a couple of times before...though I'm not sure if anyone ever
pulls it.

Haavard

2008-06-09 18:22:04

by David Brownell

[permalink] [raw]
Subject: Re: [patch 2.6.26-rc5-git] at91_nand speedup via {read,write}s{b,w}()

On Monday 09 June 2008, Haavard Skinnemoen wrote:
> > real ? ?0m20.226s
> > user ? ?0m0.010s
> > sys ? ? 0m6.000s
>
> Hmm, that's odd. What's the CPU doing during the remaining 14 seconds?
> It can't possibly be sleeping?
>
> Ah, it's I/O wait, isn't it? Because you're going through the block
> layer?

Some of it is surely data copying, but yes /dev/mtdblock0 might
have something to do with it. I was puzzled by this too, which
is part of why I quoted only elapsed time.


> Yeah...I'm still not sure where to send it though, since it touches
> three different subsystems. I can set up a separate tree for it like
> I've done a couple of times before...though I'm not sure if anyone ever
> pulls it.

Three subsystems ... you mean, ARM, AVR32, MTD? If MTD patches
merged more promptly, I'd suggest it goes through there. Else
maybe you should just get acks from the other maintainers and
push the rename+ directly to Linus once 2.6.27-rc0 starts.

- Dave