2008-10-08 16:39:17

by Stefan Monnier

[permalink] [raw]
Subject: Filesystem for block devices using flash storage?


Google finds some people asking this same question, but I couldn't find
any answer to it: what filesystem is recommended to use on an flash
based disk that does not give access to the MTD layer (e.g. USB keys,
most SSDs, ...)?

Since they do their own wear-levelling, any filesystem should be "safe",
but I expect there is still a lot of variance in terms of performance,
wear, robustness, ...

Has anyone conducted serious experiemnts to try and find out what works
better? Also, since it appears that such devices are here to stay,
would there be a need to design a new filesystem or to tune existing
filesystems for this particular kind of device?

Or is there some hope for SSDs to provide access to the MTD layer in the
not too distant future?


Stefan


2008-10-08 20:51:25

by Chris Snook

[permalink] [raw]
Subject: Re: Filesystem for block devices using flash storage?

Stefan Monnier wrote:
> Google finds some people asking this same question, but I couldn't find
> any answer to it: what filesystem is recommended to use on an flash
> based disk that does not give access to the MTD layer (e.g. USB keys,
> most SSDs, ...)?

Unless you really know what you're doing, you should use a general-purpose disk
filesystem. You probably also want to use the relatime mount option, which is
default on some distros.

> Since they do their own wear-levelling, any filesystem should be "safe",
> but I expect there is still a lot of variance in terms of performance,
> wear, robustness, ...

Writes to magnetic disks are functionally atomic at the sector level. With
SSDs, writing requires an erase followed by rewriting the sectors that aren't
changing. This means that an ill-timed power loss can corrupt an entire erase
block, which could be up to 256k on some MLC flash. Unless you have a RAID card
with a battery-backed write cache, your best bet is probably data journaling.
On ext3, you can enable this with the data=journal mount option or the
rootflags=data=journal kernel parameter for your root filesystem. It's entirely
possible that doing this will severely harm your performance, though it's also
possible that it may actually help it if you use a larger-than-default journal,
thanks to improved write coalescing.

> Has anyone conducted serious experiemnts to try and find out what works
> better? Also, since it appears that such devices are here to stay,
> would there be a need to design a new filesystem or to tune existing
> filesystems for this particular kind of device?

logfs tries to solve the write amplification problem by forcing all write
activity to be sequential. I'm not sure how mature it is.

> Or is there some hope for SSDs to provide access to the MTD layer in the
> not too distant future?

I hope not. The proper fix is to have the devices report their physical
topology via SCSI/ATA commands. This allows dumb software to function
correctly, albeit inefficiently, and allows smart software to optimize itself.
This technique also helps with RAID arrays, large-sector disks, etc.

I suspect that in the long run, the problem will go away. Erase blocks are a
relic of the days when flash was used primarily for low-power, read-mostly
applications. As the SSD market heats up, the flash vendors will move to
smaller erase blocks, possibly as small as the sector size. Intel is already
boasting that their new SSDs have a write amplification factor of only 1.1,
which leaves very little room for improvement with erase-block-aware filesystems.

-- Chris

2008-10-11 14:49:01

by Pavel Machek

[permalink] [raw]
Subject: Re: Filesystem for block devices using flash storage?

On Wed 2008-10-08 16:51:46, Chris Snook wrote:
> Stefan Monnier wrote:
>> Google finds some people asking this same question, but I couldn't find
>> any answer to it: what filesystem is recommended to use on an flash
>> based disk that does not give access to the MTD layer (e.g. USB keys,
>> most SSDs, ...)?
>
> Unless you really know what you're doing, you should use a
> general-purpose disk filesystem. You probably also want to use the
> relatime mount option, which is default on some distros.
>
>> Since they do their own wear-levelling, any filesystem should be "safe",
>> but I expect there is still a lot of variance in terms of performance,
>> wear, robustness, ...
>
> Writes to magnetic disks are functionally atomic at the sector level.
> With SSDs, writing requires an erase followed by rewriting the sectors
> that aren't changing. This means that an ill-timed power loss can
> corrupt an entire erase block, which could be up to 256k on some MLC
> flash. Unless you have a RAID card with a battery-backed write cache,
> your best bet is probably data journaling. On ext3, you can enable this
> with the data=journal mount option or the rootflags=data=journal kernel
> parameter for your root filesystem. It's entirely possible that doing

I don't think ext3 is safe w.r.t. whole eraseblocks disappearing. So
if you write data 'nearby' root directory and power fails, bye bye
filesystem, and journal will not help.

Actually ext2 will at least detect damage...

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-10-11 16:29:46

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Filesystem for block devices using flash storage?


> > Writes to magnetic disks are functionally atomic at the sector
> > level. With SSDs, writing requires an erase followed by rewriting
> > the sectors that aren't changing. This means that an ill-timed
> > power loss can corrupt an entire erase block, which could be up to
> > 256k on some MLC flash. Unless you have a RAID card with a
> > battery-backed write cache, your best bet is probably data
> > journaling. On ext3, you can enable this with the data=journal
> > mount option or the rootflags=data=journal kernel parameter for
> > your root filesystem. It's entirely possible that doing
>
> I don't think ext3 is safe w.r.t. whole eraseblocks disappearing. So
> if you write data 'nearby' root directory and power fails, bye bye
> filesystem, and journal will not help.
>
> Actually ext2 will at least detect damage...

SSDs generally (and ones that are even remotely worth their money for
sure) erase blocks that have no data in them.
They keep empty blocks around, and when they want to erase a block with
"half data", they first move that data to an empty block before erasing
the old block.

This moving around is just a natural part of wear leveling..


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-10-11 17:52:29

by Alan

[permalink] [raw]
Subject: Re: Filesystem for block devices using flash storage?

> > Writes to magnetic disks are functionally atomic at the sector level.

No they are not.

> I don't think ext3 is safe w.r.t. whole eraseblocks disappearing. So
> if you write data 'nearby' root directory and power fails, bye bye
> filesystem, and journal will not help.

You have similar problems on rotating media. A write to a block can
corrupt other blocks adjacent to the block you write, and on the latest
disks the physical block size can be greater than the logical one so
unless you laid your partitions out right you have read-modify-write
cycles goin on.

Alan

2008-10-12 13:01:27

by Jörn Engel

[permalink] [raw]
Subject: Re: Filesystem for block devices using flash storage?

On Sat, 11 October 2008 16:35:52 +0200, Pavel Machek wrote:
>
> I don't think ext3 is safe w.r.t. whole eraseblocks disappearing. So
> if you write data 'nearby' root directory and power fails, bye bye
> filesystem, and journal will not help.

No device I've ever seen is that stupid.

Judging the reports I have of actual corruptions, the main problem
appears to be write disturb. Writes tend to inject charge into
neighboring blocks. If any part of the device gets hammered with
writes, there is a good chance of corruption nearby.

Write performance is another issue, of course.

Jörn

--
You ain't got no problem, Jules. I'm on the motherfucker. Go back in
there, chill them niggers out and wait for the Wolf, who should be
coming directly.
-- Marsellus Wallace

2008-10-12 14:35:31

by Jörn Engel

[permalink] [raw]
Subject: Re: Filesystem for block devices using flash storage?

On Wed, 8 October 2008 16:51:46 -0400, Chris Snook wrote:
> Stefan Monnier wrote:
>
> Writes to magnetic disks are functionally atomic at the sector level. With
> SSDs, writing requires an erase followed by rewriting the sectors that
> aren't changing. This means that an ill-timed power loss can corrupt an
> entire erase block, which could be up to 256k on some MLC flash. Unless

What makes you think that? The standard mode of operation in El Cheapo
devices is to write to a new eraseblock first, then delete the old one.
An ill-timed power loss results in either the old or the new block being
valid as a whole. This has been the standard ever since you could buy
4MB compactflash cards.

> logfs tries to solve the write amplification problem by forcing all write
> activity to be sequential. I'm not sure how mature it is.

Still under development. What exactly do you mean by the write
amplification problem?

> >Or is there some hope for SSDs to provide access to the MTD layer in the
> >not too distant future?
>
> I hope not. The proper fix is to have the devices report their physical
> topology via SCSI/ATA commands. This allows dumb software to function
> correctly, albeit inefficiently, and allows smart software to optimize
> itself. This technique also helps with RAID arrays, large-sector disks, etc.

Having access to the actual flash would provide a large number of
benefits. It just isn't a safe default choice at the moment.

> I suspect that in the long run, the problem will go away. Erase blocks are
> a relic of the days when flash was used primarily for low-power,
> read-mostly applications. As the SSD market heats up, the flash vendors
> will move to smaller erase blocks, possibly as small as the sector size.

Do you have any information to back this claim? AFAICT smaller erase
blocks would require more chip area per bit, making devices more
expensive. If anything, I can see a trend towards bigger erase blocks.

Jörn

--
ticks = jiffies;
while (ticks == jiffies);
ticks = jiffies;
-- /usr/src/linux/init/main.c

2008-10-13 10:57:17

by Pavel Machek

[permalink] [raw]
Subject: Re: Filesystem for block devices using flash storage?

> On Sat, 11 October 2008 16:35:52 +0200, Pavel Machek wrote:
> >
> > I don't think ext3 is safe w.r.t. whole eraseblocks disappearing. So
> > if you write data 'nearby' root directory and power fails, bye bye
> > filesystem, and journal will not help.
>
> No device I've ever seen is that stupid.

I have adata SD flash cards that loose data even normal use, no power
fails needed.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-10-13 12:10:32

by Jörn Engel

[permalink] [raw]
Subject: Re: Filesystem for block devices using flash storage?

On Mon, 13 October 2008 12:57:00 +0200, Pavel Machek wrote:
> > On Sat, 11 October 2008 16:35:52 +0200, Pavel Machek wrote:
> > >
> > > I don't think ext3 is safe w.r.t. whole eraseblocks disappearing. So
> > > if you write data 'nearby' root directory and power fails, bye bye
> > > filesystem, and journal will not help.
> >
> > No device I've ever seen is that stupid.
>
> I have adata SD flash cards that loose data even normal use, no power
> fails needed.

I don't doubt that. There is a lot of crap for sale. But your
description above does not match any crap I have ever encountered or
heard about. Nor would it explain why you "loose data even normal use,
no power fails needed." ;)

Jörn

--
Schrödinger's cat is <BLINK>not</BLINK> dead.
-- Illiad

2008-10-13 17:30:57

by Chris Snook

[permalink] [raw]
Subject: Re: Filesystem for block devices using flash storage?

Jörn Engel wrote:
> On Wed, 8 October 2008 16:51:46 -0400, Chris Snook wrote:
>> Stefan Monnier wrote:
>>
>> Writes to magnetic disks are functionally atomic at the sector level. With
>> SSDs, writing requires an erase followed by rewriting the sectors that
>> aren't changing. This means that an ill-timed power loss can corrupt an
>> entire erase block, which could be up to 256k on some MLC flash. Unless
>
> What makes you think that? The standard mode of operation in El Cheapo
> devices is to write to a new eraseblock first, then delete the old one.
> An ill-timed power loss results in either the old or the new block being
> valid as a whole. This has been the standard ever since you could buy
> 4MB compactflash cards.
>
>> logfs tries to solve the write amplification problem by forcing all write
>> activity to be sequential. I'm not sure how mature it is.
>
> Still under development. What exactly do you mean by the write
> amplification problem?

Write amplification is where a 512 byte write turns into a 128k write,
due to erase block size.

>>> Or is there some hope for SSDs to provide access to the MTD layer in the
>>> not too distant future?
>> I hope not. The proper fix is to have the devices report their physical
>> topology via SCSI/ATA commands. This allows dumb software to function
>> correctly, albeit inefficiently, and allows smart software to optimize
>> itself. This technique also helps with RAID arrays, large-sector disks, etc.
>
> Having access to the actual flash would provide a large number of
> benefits. It just isn't a safe default choice at the moment.
>
>> I suspect that in the long run, the problem will go away. Erase blocks are
>> a relic of the days when flash was used primarily for low-power,
>> read-mostly applications. As the SSD market heats up, the flash vendors
>> will move to smaller erase blocks, possibly as small as the sector size.
>
> Do you have any information to back this claim? AFAICT smaller erase
> blocks would require more chip area per bit, making devices more
> expensive. If anything, I can see a trend towards bigger erase blocks.

Intel is claiming a write amplification factor of 1.1. Either they're
using very small erase blocks, or doing something very smart in the
controller.

-- Chris

2008-10-13 18:13:58

by Jörn Engel

[permalink] [raw]
Subject: Re: Filesystem for block devices using flash storage?

On Mon, 13 October 2008 13:30:29 -0400, Chris Snook wrote:
> >
> >>logfs tries to solve the write amplification problem by forcing all write
> >>activity to be sequential. I'm not sure how mature it is.
> >
> >Still under development. What exactly do you mean by the write
> >amplification problem?
>
> Write amplification is where a 512 byte write turns into a 128k write,
> due to erase block size.

Ah, yes. Current logfs still triggers that a bit too often. I'm
currently working on the format changes to avoid the amplification as
much as possible.

Another nasty side effect of this is that heuristics for wear leveling
are always imprecise. And wear leveling is still required for most
devices. See http://www.linuxconf.eu/2007/papers/Engel.pdf

> Intel is claiming a write amplification factor of 1.1. Either they're
> using very small erase blocks, or doing something very smart in the
> controller.

With very small erase blocks the facter should be either 1 or 2, not
1.1. Most likely they work very much like logfs does, essentially doing
the whole log-structured thing internally.

Jörn

--
Das Aufregende am Schreiben ist es, eine Ordnung zu schaffen, wo
vorher keine existiert hat.
-- Doris Lessing

2008-10-13 18:38:18

by Chris Snook

[permalink] [raw]
Subject: Re: Filesystem for block devices using flash storage?

Jörn Engel wrote:
> On Mon, 13 October 2008 13:30:29 -0400, Chris Snook wrote:
>>>> logfs tries to solve the write amplification problem by forcing all write
>>>> activity to be sequential. I'm not sure how mature it is.
>>> Still under development. What exactly do you mean by the write
>>> amplification problem?
>> Write amplification is where a 512 byte write turns into a 128k write,
>> due to erase block size.
>
> Ah, yes. Current logfs still triggers that a bit too often. I'm
> currently working on the format changes to avoid the amplification as
> much as possible.
>
> Another nasty side effect of this is that heuristics for wear leveling
> are always imprecise. And wear leveling is still required for most
> devices. See http://www.linuxconf.eu/2007/papers/Engel.pdf
>
>> Intel is claiming a write amplification factor of 1.1. Either they're
>> using very small erase blocks, or doing something very smart in the
>> controller.
>
> With very small erase blocks the facter should be either 1 or 2, not
> 1.1. Most likely they work very much like logfs does, essentially doing
> the whole log-structured thing internally.
>
> Jörn
>

As I understand it, they mean that in a real-world workload that writes 1x data,
a total of 1.1x is written on flash. Real-world writes are usually, but not
always, larger than a single sector. Of course, the validity of this number
depends greatly on the test.

If someone has more info on the Intel devices, please clue me in.

-- Chris

2008-10-14 11:18:20

by Jörn Engel

[permalink] [raw]
Subject: Re: Filesystem for block devices using flash storage?

Rereading the thread, you haven't received a good answer yet. Which is
understandable, given the diversity and secrecy of the subject. The
properties of flash are reasonably well understood. To create a block
device, you need to add an FTL. How the FTL works depends on the device
in question, and you will never receive any documentation with the
device. In short, you never know.

Unless the device comes from the cheap end. Practically everyone is
using the same FTL for cheap devices, with some minor tweaks. I've
written down the basics here:
http://www.linuxconf.eu/2007/papers/Engel.pdf

More expensive devices may still behave the same, may do something
better or may attempt to do something better and actually be worse. One
never knows, so I'll pretend that every device is cheap from now on.

On Wed, 8 October 2008 12:38:51 -0400, Stefan Monnier wrote:
>
> Google finds some people asking this same question, but I couldn't find
> any answer to it: what filesystem is recommended to use on an flash
> based disk that does not give access to the MTD layer (e.g. USB keys,
> most SSDs, ...)?

Currently: Either fat or none at all.

> Since they do their own wear-levelling, any filesystem should be "safe",
> but I expect there is still a lot of variance in terms of performance,
> wear, robustness, ...

The wear leveling is not done for the complete device, only for a subset
of usually 1024 blocks. If you keep pounding the same (logical) block
over and over, the number of physical blocks you write to is either 25
or 1024, depending on whether the device does static wear leveling.

I have reports of people breaking their devices with a trivial script in
less than a day (an hour, iirc).

> Has anyone conducted serious experiemnts to try and find out what works
> better? Also, since it appears that such devices are here to stay,
> would there be a need to design a new filesystem or to tune existing
> filesystems for this particular kind of device?

Some expensive device seem to work well with any filesystem. As for the
cheap stuff, a new design is needed. The shopping list includes:
1. vast majority of writes should be eraseblock-sized and -aligned
2. wear leveling
3. scrubbing

And quite frankly, no filesystem currently fits the bill. Closest
contenders are btrfs, nilfs and logfs, all of which are still under
development. Of those, logfs is the only one designed explicitly for
flash and happens to be my brainchild. So naturally my opinion is
biased and I will refrain from any further arguments for or against. :)

Current status of logfs is that I'm currently fixing one design issue
that caused many small writes, then have to do some random minor changes
to the format and... it should be useable sometime this year.

> Or is there some hope for SSDs to provide access to the MTD layer in the
> not too distant future?

I've talked to manufacturers and not seen any enthusiasm for that idea.
Most actually have some undocumented commands for raw access - for
testing and QA. They simply see no benefit in exposing these to the
public. And it is trivial to brick a device with such commands - in the
sense that the FTL no longer works.

Jörn

--
Optimizations always bust things, because all optimizations are, in
the long haul, a form of cheating, and cheaters eventually get caught.
-- Larry Wall

2008-10-14 13:15:28

by Stefan Monnier

[permalink] [raw]
Subject: Re: Filesystem for block devices using flash storage?

>> Or is there some hope for SSDs to provide access to the MTD layer in the
>> not too distant future?
> I've talked to manufacturers and not seen any enthusiasm for that idea.
> Most actually have some undocumented commands for raw access - for
> testing and QA. They simply see no benefit in exposing these to the
> public.

Maybe if we could get our hands on such commands for a few such devices
and publicize a comparison between UBIFS and ext2 on it, we might get
some traction (assuming the comparison shows a significant benefit,
obviously)?

> And it is trivial to brick a device with such commands - in the
> sense that the FTL no longer works.

But that just requires an additional "format" command, right?
Actually, they could even force the use of format not only before the
use of FTL, but also before the use of the MTD layer, so they could
completely hide the underlying FTL metadata from reverse
engineering efforts (since they seem to care so much about their stupid
secrecy).


Stefan

2008-10-14 18:04:59

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Filesystem for block devices using flash storage?

On Sat, Oct 11, 2008 at 04:35:52PM +0200, Pavel Machek wrote:
> I don't think ext3 is safe w.r.t. whole eraseblocks disappearing. So
> if you write data 'nearby' root directory and power fails, bye bye
> filesystem, and journal will not help.
>
> Actually ext2 will at least detect damage...

I have never seen a flash device that worked that way. All the ones I
have seen have extra spare blocks and will copy an existing block to an
empty block changing the required bits while doing the copy to represent
the new data to be written. When done, they update the block map of the
device to point to the new block, then the old block is erase and added
to the spare block list.

This is also used as part of wear leveling, where better devices will
occationally take a rarely written block, move it to a more used spare
block, and then add the previously rarely used block to the spare list
for more use.

In cases of decent devices like this, ext3 works great. I have never
had a chunk of the filesystem disappear yet, although perhaps 2000
compact flash using units isn't a large enough data set to say much.

--
Len Sorensen