LinuxLists.cc - Re: [RFC 4/5] MMC: Adjust unaligned write accesses.

2011-03-21 14:21:34

Subject: Re: [RFC 4/5] MMC: Adjust unaligned write accesses.

On Saturday 19 March 2011, Andrei Warkentin wrote:
> On Mon, Mar 14, 2011 at 2:40 AM, Andrei Warkentin <[email protected]> wrote:
>
> >>>
> >>> Revalidating the data now, along with some more tests, to get a better
> >>> picture. It seems the more data I get, the less it makes sense :(.
> >>
> >> I was already fearing that the change would only benefit low-level
> >> benchmarks. It certainly helps writing small chunks to the buffer
> >> that is meant for FAT32 directories, but at some point, the card
> >> will have to write back the entire logical erase block, so you
> >> might not be able to gain much in real-world workloads.
> >>
> >
>
> Attaching is some data I have collected on the MMC32G part. I tried
> to make the collection process as controlled as possible, as well as
> use more-or-less a "real life" usage case that involves running a user
> application, so it's not just a purely synthetic test at block level.
>
> Attached file (I hope you don't mind PDFs) contains data collected for
> two possible optimizations. The second page of the document tests the
> vendor suggested optimization that is basically -
> if (request_blocks < 24) {
> /* given request offset, calculate sectors remaining on 8K page
> containing offset */
> sectors = 16 - (request_offset % 16);
> if (request_blocks > sectors) {
> request_blocks = sectors;
> }
> }
> ...I'll call this optimization A.
>
> ...the first page of the document tests the optimization that floated
> up on the list when I first sent a patch with the vendor suggestions.
> That optimization being - align all unaligned accesses (either all
> completely, or under a certain size threshold) on flash page size.
> I'll call this optimization B.

I'm not sure if I really understand the difference between the two.
Do you mean optimization A makes sure that you don't have partial
pages at the start of a request, while optimization B also splits
small requests on page boundary if the first page in it is aligned?

> To test, a collect time info for 2000 small inserts into a table with
> sqlite into 20 separate tables. So that's 20 x 2000 sqlite inserts per
> test. The test is executed for ext2, ext3 and ext4 with a 4k block
> size. Every test begins with a flash discard and format operation on
> the partition where the tables are created and accessed, to ensure
> similar acceses to flash on every test. All other partitions are RO,
> and no processes other than those needed by the tests run. All power
> management is disabled. The results are thus repeatable, consistent
> and stable across reboots and power-on time...
>
> Each test consists of:
> 1) Unmount partition
> 2) Flash erase
> 3) Format with fs
> 4) Mount
> 5) Sync
> 6) echo 3 > /proc/sys/vm/drop_caches
> 7) run 20 x 2000 inserts as described above
> 8) unmount

Just to make sure: Did you properly align the partition start on an
erase block boundary of 4MB?

I would have loved to see results with nilfs2 and btrfs as well, but
I can understand that these were less relevant to you, especially
since you don't really want to compare the file systems as much as
your own changes.

One very surprising result to me is how much worse the ext4 numbers
are compared to ext2/ext3. I would have guessed that they should
be much better, given that the ext4 developers are specifically
trying to optimize for this case. I've taken the ext4 mailing
list on Cc here and will forward your test results there as
well.

> For optimization B testing, the alignment size and alignment access
> size threshold (same parameters as in my RFC patch) are exposed
> through debugfs. To get B test data, the flow was
>
> 1) Set alignment to none (no optimization)
> 2) Sql test on ext2
> 3) Sql test on ext3
> 4) Sql test on ext4
>
> 6) Set alignment to 8k, no threshold
> 7) Sql test on ext2
> 8) Sql test on ext3
> 9) Sql test on ext4
>
> 10) Set alignment to 8k, < 8k only
> 11) Sql test on ext2
> 12) Sql test on ext3
> 13) Sql test on ext4
>
> ...all the way up to 32K threshold.
>
> For optimization A testing, the optimization was turned off/on with a
> debugfs attribute, and the data collected with this flow:
>
> 1) Turn off optimization
> 2) Sql test on ext2
> 3) Sql test on ext3
> 4) Sql test on ext4
> 5) Turn on optimization
> 6) Sql test on ext2
> 7) Sql test on ext3
> 8) Sql test on ext4
>
> My interpretation of the results: Any kind of alignment-on-flash page
> optimization produced data that in all cases was either
> indistinguishable from control, or was worse. Do you agree with my
> interpretation?

I suppse when the result is total runtime in seconds, that larger numbers
are always worse, so I agree.

One potential flaw in the measurement might be that running the test
a second time means that the card is already in a state that requires
garbage collection and therefore slower. Running the test in the opposite
order (optimized first, then unoptimized) might theoretically lead
to other results. It's not clear from your description whether your
test method has taken this into account (I would assume yes).

> So I guess that hexes the align optimization, at least until I can get
> data for MMC16G with the same controlled setup. Sorry about that. I'll
> work on the "reliability optimization" now, which I guess are pretty
> generic for cards with similar buffer schemes. It relies on reliable
> writes, so exposing that will be first for review here...
>
> Even though I'm rescinding the adjust/align patch, is there any chance
> for pulling in my quirks changes?

The quirks patch still looks fine to me, I'd just recommend that we
don't apply it before we have a need for it, i.e. at least a single
card specific quirk.

Arnd

2011-03-21 14:41:21

by Andrei Warkentin

[permalink] [raw]

Subject: Re: [RFC 4/5] MMC: Adjust unaligned write accesses.

On Mon, Mar 21, 2011 at 9:21 AM, Arnd Bergmann <[email protected]> wrote:
>> Attached file (I hope you don't mind PDFs) contains data collected for
>> two possible optimizations. The second page of the document tests the
>> vendor suggested optimization that is basically -
>> if (request_blocks < 24) {
>> ? ? ?/* given request offset, calculate sectors remaining on 8K page
>> containing offset */
>> ? ? ?sectors = 16 - (request_offset % 16);
>> ? ? ?if (request_blocks > sectors) {
>> ? ? ? ? request_blocks = sectors;
>> ? ? ?}
>> }
>> ...I'll call this optimization A.
>>
>> ...the first page of the document tests the optimization that floated
>> up on the list when I first sent a patch with the vendor suggestions.
>> That optimization being - align all unaligned accesses (either all
>> completely, or under a certain size threshold) on flash page size.
>> I'll call this optimization B.
>
> I'm not sure if I really understand the difference between the two.
> Do you mean optimization A makes sure that you don't have partial
> pages at the start of a request, while optimization B also splits
> small requests on page boundary if the first page in it is aligned?

The vendor optimization always splits accesses under 12k, even if they
are aligned. There are (still) some outstanding questions
on how that's supposed to work (to improve anything), but that's the algorithm.

"our" optimization, suggested on this list, was to align accesses onto
flash page size, thus splitting each request into (small) unaligned
and aligned portions.

>
>> To test, a collect time info for 2000 small inserts into a table with
>> sqlite into 20 separate tables. So that's 20 x 2000 sqlite inserts per
>> test. The test is executed for ext2, ext3 and ext4 with a 4k block
>> size. Every test begins with a flash discard and format operation on
>> the partition where the tables are created and accessed, to ensure
>> similar acceses to flash on every test. All other partitions are RO,
>> and no processes other than those needed by the tests run. All power
>> management is disabled. The results are thus repeatable, consistent
>> and stable across reboots and power-on time...
>>
>> Each test consists of:
>> 1) Unmount partition
>> 2) Flash erase
>> 3) Format with fs
>> 4) Mount
>> 5) Sync
>> 6) echo 3 > /proc/sys/vm/drop_caches
>> 7) run 20 x 2000 inserts as described above
>> 8) unmount
>
> Just to make sure: Did you properly align the partition start on an
> erase block boundary of 4MB?
>

Yes, absolutely.

> I would have loved to see results with nilfs2 and btrfs as well, but
> I can understand that these were less relevant to you, especially
> since you don't really want to compare the file systems as much as
> your own changes.
>

In the context of looking at this anyway, I will try and get
comparison data for sqlite on different fs (and different fs tunables)
on flash.

> One very surprising result to me is how much worse the ext4 numbers
> are compared to ext2/ext3. I would have guessed that they should
> be much better, given that the ext4 developers are specifically
> trying to optimize for this case. I've taken the ext4 mailing
> list on Cc here and will forward your test results there as
> well.

I was surprised too.

> One potential flaw in the measurement might be that running the test
> a second time means that the card is already in a state that requires
> garbage collection and therefore slower. Running the test in the opposite
> order (optimized first, then unoptimized) might theoretically lead
> to other results. It's not clear from your description whether your
> test method has taken this into account (I would assume yes).
>

I've done tests across reboots that showed consistent results.
Additionally, repeating a test after another showed same results. At
least on this flash medium, block erase (used erase utility from
flashbench, modified to erase everything if no argument provided)
prior to formatting with fs prior to every test seemed to make results
consistent.

>> So I guess that hexes the align optimization, at least until I can get
>> data for MMC16G with the same controlled setup. Sorry about that. I'll
>> work on the "reliability optimization" now, which I guess are pretty
>> generic for cards with similar buffer schemes. It relies on reliable
>> writes, so exposing that will be first for review here...
>>
>> Even though I'm rescinding the adjust/align patch, is there any chance
>> for pulling in my quirks changes?
>
> The quirks patch still looks fine to me, I'd just recommend that we
> don't apply it before we have a need for it, i.e. at least a single
> card specific quirk.
>

Ok. Sounds good. Back to reliable writes it is, so I can roll up the
second quirk...

A

2011-03-21 18:03:16

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC 4/5] MMC: Adjust unaligned write accesses.

On 2011-03-21, at 3:41 PM, Andrei Warkentin wrote:
> On Mon, Mar 21, 2011 at 9:21 AM, Arnd Bergmann <[email protected]> wrote:
>>> Attached file (I hope you don't mind PDFs) contains data collected for
>>> two possible optimizations. The second page of the document tests the
>>> vendor suggested optimization that is basically -
>>> if (request_blocks < 24) {
>>> /* given request offset, calculate sectors remaining on 8K page
>>> containing offset */
>>> sectors = 16 - (request_offset % 16);
>>> if (request_blocks > sectors) {
>>> request_blocks = sectors;
>>> }
>>> }
>>> ...I'll call this optimization A.
>>>
>>> ...the first page of the document tests the optimization that floated
>>> up on the list when I first sent a patch with the vendor suggestions.
>>> That optimization being - align all unaligned accesses (either all
>>> completely, or under a certain size threshold) on flash page size.
>>> I'll call this optimization B.
>>
>> I'm not sure if I really understand the difference between the two.
>> Do you mean optimization A makes sure that you don't have partial
>> pages at the start of a request, while optimization B also splits
>> small requests on page boundary if the first page in it is aligned?
>
> The vendor optimization always splits accesses under 12k, even if they
> are aligned. There are (still) some outstanding questions
> on how that's supposed to work (to improve anything), but that's the algorithm.
>
> "our" optimization, suggested on this list, was to align accesses onto
> flash page size, thus splitting each request into (small) unaligned
> and aligned portions.

Note that mballoc was specifically designed to handle allocation requests that are aligned on RAID stripe boundaries, so it should be able to handle this for MMC as well. What is needed is to tell the filesystem what the underlying alignment is. That can be done at format time with mke2fs or afterward with tune2fs by using the "-E stripe_width" option.

>>> To test, a collect time info for 2000 small inserts into a table with
>>> sqlite into 20 separate tables. So that's 20 x 2000 sqlite inserts per
>>> test. The test is executed for ext2, ext3 and ext4 with a 4k block
>>> size. Every test begins with a flash discard and format operation on
>>> the partition where the tables are created and accessed, to ensure
>>> similar acceses to flash on every test. All other partitions are RO,
>>> and no processes other than those needed by the tests run. All power
>>> management is disabled. The results are thus repeatable, consistent
>>> and stable across reboots and power-on time...
>>>
>>> Each test consists of:
>>> 1) Unmount partition
>>> 2) Flash erase
>>> 3) Format with fs
>>> 4) Mount
>>> 5) Sync
>>> 6) echo 3 > /proc/sys/vm/drop_caches
>>> 7) run 20 x 2000 inserts as described above
>>> 8) unmount
>>
>> Just to make sure: Did you properly align the partition start on an
>> erase block boundary of 4MB?
>>
>
> Yes, absolutely.
>
>> I would have loved to see results with nilfs2 and btrfs as well, but
>> I can understand that these were less relevant to you, especially
>> since you don't really want to compare the file systems as much as
>> your own changes.
>>
>
> In the context of looking at this anyway, I will try and get
> comparison data for sqlite on different fs (and different fs tunables)
> on flash.
>
>> One very surprising result to me is how much worse the ext4 numbers
>> are compared to ext2/ext3. I would have guessed that they should
>> be much better, given that the ext4 developers are specifically
>> trying to optimize for this case. I've taken the ext4 mailing
>> list on Cc here and will forward your test results there as
>> well.
>
> I was surprised too.
>
>> One potential flaw in the measurement might be that running the test
>> a second time means that the card is already in a state that requires
>> garbage collection and therefore slower. Running the test in the opposite
>> order (optimized first, then unoptimized) might theoretically lead
>> to other results. It's not clear from your description whether your
>> test method has taken this into account (I would assume yes).
>>
>
> I've done tests across reboots that showed consistent results.
> Additionally, repeating a test after another showed same results. At
> least on this flash medium, block erase (used erase utility from
> flashbench, modified to erase everything if no argument provided)
> prior to formatting with fs prior to every test seemed to make results
> consistent.
>
>>> So I guess that hexes the align optimization, at least until I can get
>>> data for MMC16G with the same controlled setup. Sorry about that. I'll
>>> work on the "reliability optimization" now, which I guess are pretty
>>> generic for cards with similar buffer schemes. It relies on reliable
>>> writes, so exposing that will be first for review here...
>>>
>>> Even though I'm rescinding the adjust/align patch, is there any chance
>>> for pulling in my quirks changes?
>>
>> The quirks patch still looks fine to me, I'd just recommend that we
>> don't apply it before we have a need for it, i.e. at least a single
>> card specific quirk.
>>
>
> Ok. Sounds good. Back to reliable writes it is, so I can roll up the
> second quirk...
>
> A
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Cheers, Andreas

2011-03-21 19:05:37

by Arnd Bergmann

[permalink] [raw]

Subject: Re: [RFC 4/5] MMC: Adjust unaligned write accesses.

On Monday 21 March 2011 19:03:09 Andreas Dilger wrote:
> Note that mballoc was specifically designed to handle allocation
> requests that are aligned on RAID stripe boundaries, so it should
> be able to handle this for MMC as well. What is needed is to tell
> the filesystem what the underlying alignment is. That can be done
> at format time with mke2fs or afterward with tune2fs by using the
> "-E stripe_width" option.

Ah, that sounds useful. So would I set the stripe_width to the
erase block size, and the block group size to a multiple of that?

Does this also work in (rare) cases where the erase block size is
not a power of two?

Arnd

2011-03-22 00:03:12

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC 4/5] MMC: Adjust unaligned write accesses.

On 2011-03-21, at 8:05 PM, Arnd Bergmann wrote:
> On Monday 21 March 2011 19:03:09 Andreas Dilger wrote:
>> Note that mballoc was specifically designed to handle allocation
>> requests that are aligned on RAID stripe boundaries, so it should
>> be able to handle this for MMC as well. What is needed is to tell
>> the filesystem what the underlying alignment is. That can be done
>> at format time with mke2fs or afterward with tune2fs by using the
>> "-E stripe_width" option.
>
> Ah, that sounds useful. So would I set the stripe_width to the
> erase block size, and the block group size to a multiple of that?

When you write "block group size" do you mean the ext4 block group? Then yes it would help. You could also consider setting the flex_bg size to a multiple of this, so that the bitmap blocks are grouped as a multiple of this size. However, they may not be aligned correctly, which needs extra effort that isn't obvious.

I think it would be nice to have mke2fs take the stripe_width and/or flex_bg factor into account when sizing/aligning the bitmaps, but it doesn't yet.

> Does this also work in (rare) cases where the erase block size is
> not a power of two?

It does (or is supposed to), but that isn't code that is exercised very much (most installations use a power-of-two size).

Cheers, Andreas

2011-03-22 13:56:39

by Arnd Bergmann

[permalink] [raw]

Subject: Re: [RFC 4/5] MMC: Adjust unaligned write accesses.

On Tuesday 22 March 2011, Andreas Dilger wrote:
> On 2011-03-21, at 8:05 PM, Arnd Bergmann wrote:
> > On Monday 21 March 2011 19:03:09 Andreas Dilger wrote:
> >> Note that mballoc was specifically designed to handle allocation
> >> requests that are aligned on RAID stripe boundaries, so it should
> >> be able to handle this for MMC as well. What is needed is to tell
> >> the filesystem what the underlying alignment is. That can be done
> >> at format time with mke2fs or afterward with tune2fs by using the
> >> "-E stripe_width" option.
> >
> > Ah, that sounds useful. So would I set the stripe_width to the
> > erase block size, and the block group size to a multiple of that?
>
> When you write "block group size" do you mean the ext4 block group?

Yes.

> Then yes it would help. You could also consider setting the flex_bg
> size to a multiple of this, so that the bitmap blocks are grouped as
> a multiple of this size. However, they may not be aligned correctly,
> which needs extra effort that isn't obvious.
>
> I think it would be nice to have mke2fs take the stripe_width and/or
> flex_bg factor into account when sizing/aligning the bitmaps, but it
> doesn't yet.

A few more questions:

* On cards that can only write to a single erase block at a time,
should I make the block group size the same as the as the erase
block? I suppose writing both block bitmaps, inode and data to
separate erase blocks would create multiple eraseblock
read-modify-write cycles for every single file otherwise.

* Is it guaranteed that inode bitmap, inode, block bitmap and
blocks are always written in low-to-high sector order within
one ext4 block group? A lot of the drives will do a garbage-collect
step (adding hundreds of miliseconds) every time you move back
inside of the eraseblock.

* Is there any way to make ext4 use effective blocks larger
than 4 KB? The most common size for a NAND flash page is 16
KB right (effectively, ignoring what the hardware does), so
it would be good to never write smaller.

* Calling TRIM on SD cards is probably counterproductive unless
you trim entire erase blocks. Is that even possible with ext4,
assuming that we use block group == erase block?

* Is there a way to put the journal into specific parts of the
drive? Almost all SD cards have an area in the second 4 MB
(more for larger cards) that can be written using random access
without forcing garbage collection on other parts.

> > Does this also work in (rare) cases where the erase block size is
> > not a power of two?
>
> It does (or is supposed to), but that isn't code that is exercised
> very much (most installations use a power-of-two size).

Ok. Recently, cheap TLC (three-level cell, 3-bit MLC) NAND is
becoming popular. I've seen erase block sizes of 6 MiB, 1376 KiB
(4096 / 3, rounded up) and 4128 KiB (1376 * 3) because of this, in
place of the common 4096 KiB. The SD card standard specifies
values of 12 MB and 24 MB aside from the usual power-of-two values
up to 64 MB for large cards (>32GB), while smaller cards are allowed
only up to 4 MB erase blocks and need to be power-of-two. Many
cards do not use the size they claim in their registers.

Arnd

2011-03-22 15:09:10

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC 4/5] MMC: Adjust unaligned write accesses.

On 2011-03-22, at 2:56 PM, Arnd Bergmann wrote:
> On Tuesday 22 March 2011, Andreas Dilger wrote:
>> On 2011-03-21, at 8:05 PM, Arnd Bergmann wrote:
>>> So would I set the stripe_width to the erase block size, and the
>>> block group size to a multiple of that?
>>
>> When you write "block group size" do you mean the ext4 block group?
>
> Yes.
>
>> Then yes it would help. You could also consider setting the flex_bg
>> size to a multiple of this, so that the bitmap blocks are grouped as
>> a multiple of this size. However, they may not be aligned correctly,
>> which needs extra effort that isn't obvious.
>>
>> I think it would be nice to have mke2fs take the stripe_width and/or
>> flex_bg factor into account when sizing/aligning the bitmaps, but it
>> doesn't yet.
>
> A few more questions:
>
> * On cards that can only write to a single erase block at a time,
> should I make the block group size the same as the as the erase
> block? I suppose writing both block bitmaps, inode and data to
> separate erase blocks would create multiple eraseblock
> read-modify-write cycles for every single file otherwise.

That doesn't seem like a very good idea. It will significantly limit the size of the filesystem, and will cause a lot of overhead (two bitmaps per group for only a handful of blocks).

> * Is it guaranteed that inode bitmap, inode, block bitmap and
> blocks are always written in low-to-high sector order within
> one ext4 block group? A lot of the drives will do a garbage-collect
> step (adding hundreds of miliseconds) every time you move back
> inside of the eraseblock.

Generally, yes. I don't think there is a hard guarantee, but the block device elevator will sort the blocks.

> * Is there any way to make ext4 use effective blocks larger
> than 4 KB? The most common size for a NAND flash page is 16
> KB right (effectively, ignoring what the hardware does), so
> it would be good to never write smaller.

You may be interested in Ted's bigalloc patchset. This will force block allocation to be at a power-of-two multiple of the blocksize, so it could be 16kB or whatever. However, this is inefficient if the average filesize is not large enough.

> * Calling TRIM on SD cards is probably counterproductive unless
> you trim entire erase blocks. Is that even possible with ext4,
> assuming that we use block group == erase block?

That is already the case, if the underlying storage reports the erase block size to the filesystem.

> * Is there a way to put the journal into specific parts of the
> drive? Almost all SD cards have an area in the second 4 MB
> (more for larger cards) that can be written using random access
> without forcing garbage collection on other parts.

That would need a small patch to mke2fs. I've been interested in this also for other reasons, but haven't had time to work on it. It will likely need only some small adjustments to ext2fs_add_journal_inode() to allow passing the goal block, and write_journal_inode() to use the goal block instead of its internal heuristic. The default location of the journal inode was previously moved from the beginning of the filesystem to the middle of the filesystem for performance reasons, so this is mostly already handled.

>>> Does this also work in (rare) cases where the erase block size is
>>> not a power of two?
>>
>> It does (or is supposed to), but that isn't code that is exercised
>> very much (most installations use a power-of-two size).
>
> Ok. Recently, cheap TLC (three-level cell, 3-bit MLC) NAND is
> becoming popular. I've seen erase block sizes of 6 MiB, 1376 KiB
> (4096 / 3, rounded up) and 4128 KiB (1376 * 3) because of this, in
> place of the common 4096 KiB. The SD card standard specifies
> values of 12 MB and 24 MB aside from the usual power-of-two values
> up to 64 MB for large cards (>32GB), while smaller cards are allowed
> only up to 4 MB erase blocks and need to be power-of-two. Many
> cards do not use the size they claim in their registers.

Well, the large erase block size is not in itself a problem, but if the devices do not use the reported erase block size internally, there is nothing much that ext4 or the rest of the kernel can do about it, since it has no other way of knowing what the real erase block size is.

Cheers, Andreas

2011-03-22 15:44:09

by Arnd Bergmann

[permalink] [raw]

Subject: Re: [RFC 4/5] MMC: Adjust unaligned write accesses.

On Tuesday 22 March 2011, Andreas Dilger wrote:
> On 2011-03-22, at 2:56 PM, Arnd Bergmann wrote:
> > On Tuesday 22 March 2011, Andreas Dilger wrote:
> >> On 2011-03-21, at 8:05 PM, Arnd Bergmann wrote:
> >
> > * On cards that can only write to a single erase block at a time,
> > should I make the block group size the same as the as the erase
> > block? I suppose writing both block bitmaps, inode and data to
> > separate erase blocks would create multiple eraseblock
> > read-modify-write cycles for every single file otherwise.
>
> That doesn't seem like a very good idea. It will significantly limit
> the size of the filesystem, and will cause a lot of overhead (two bitmaps
> per group for only a handful of blocks).

I'm willing to spend a little space overhead in return for one
or two orders of magnitude in performance and life expectancy
for the card ;-)

A typical case is that a single-page (16KB) write to the currently
open erase block takes 1ms, but since writing to another erase block
requires a garbage-collection (erase-rewrite 4 MB), it takes 500 ms,
just like the following access to the first erase block, which
has now been closed.

Every erase cycle ages the drive, and on some cheap ones, you only
have about 2000 guaranteed erases per erase block!

> > * Is it guaranteed that inode bitmap, inode, block bitmap and
> > blocks are always written in low-to-high sector order within
> > one ext4 block group? A lot of the drives will do a garbage-collect
> > step (adding hundreds of miliseconds) every time you move back
> > inside of the eraseblock.
>
> Generally, yes. I don't think there is a hard guarantee,
> but the block device elevator will sort the blocks.

Ok.

> > * Is there any way to make ext4 use effective blocks larger
> > than 4 KB? The most common size for a NAND flash page is 16
> > KB right (effectively, ignoring what the hardware does), so
> > it would be good to never write smaller.
>
> You may be interested in Ted's bigalloc patchset. This will force
> block allocation to be at a power-of-two multiple of the blocksize,
> so it could be 16kB or whatever. However, this is inefficient if
> the average filesize is not large enough.

Is it just a performance/space tradeoff, or is there also a
performance overhead in this?

> > * Calling TRIM on SD cards is probably counterproductive unless
> > you trim entire erase blocks. Is that even possible with ext4,
> > assuming that we use block group == erase block?
>
> That is already the case, if the underlying storage reports the
> erase block size to the filesystem.

Ok, I should try to find out how this is done on SD cards.
The hardware interface allows erasing 512 byte sectors, so
we might be reporting that instead.

> > * Is there a way to put the journal into specific parts of the
> > drive? Almost all SD cards have an area in the second 4 MB
> > (more for larger cards) that can be written using random access
> > without forcing garbage collection on other parts.
>
> That would need a small patch to mke2fs. I've been interested in
> this also for other reasons, but haven't had time to work on it.
> It will likely need only some small adjustments to
> ext2fs_add_journal_inode() to allow passing the goal block, and
> write_journal_inode() to use the goal block instead of its internal
> heuristic. The default location of the journal inode was previously
> moved from the beginning of the filesystem to the middle of the
> filesystem for performance reasons, so this is mostly already handled.

Ok. It was previously suggested to put an external journal on a
4 MB partition for experimenting with this. I hope I can back this
up with performance numbers soon.

> Well, the large erase block size is not in itself a problem, but
> if the devices do not use the reported erase block size internally,
> there is nothing much that ext4 or the rest of the kernel can do
> about it, since it has no other way of knowing what the real erase
> block size is.

For SDHC cards, the typical case is that they are reasonably efficient
when you use the reported size, because they are tested that way.
A lot of cards use 2MiB internally but report 4MiB, which is fine
as long as you write the 4MB consecutively and don't alternate
between the two halves.

The split into three erase blocks of (4 MiB / 3) is on low-end
SanDisk cards, and I believe it has mostly advantages and will
work well if we use the reported 4 MB.

The 4128 KiB erase blocks are on a USB stick, and those devices
do not report any erase block size at all.

I have written a tool to detect the actual erase block size,
and perhaps that could be integrated into mke2fs and similar
tools.

Arnd