Hi,
I would like to have advice regarding features and options.
I am working with an embedded system equipped with an IDE Flash Disk and the ext4 filesystem. I have identified 3 problems that I would like to solve in our product. The power is abruptly turned off from time to time, this has sometimes resulted in broken Superblock (inode8) and empty files with size 0 bytes. It also happens that file changes is not committed to disk even if minutes pass before a power loss. This is very undesirable and expensive in our case, we are searching for a solution or a workaround to the problems.
List with my problems I like to solve:
1. Broken Superblock (inode8).
2. Empty files, size 0.
3. Very long auto commit times, several minutes with default settings.
Is ext4 a bad choice for an embedded system with I 1Gb IDE Flash Disk and Debian 2.6.32-5-686? ?Should we change filesystem?
I am planning to set the mount flags: barrier=1, commit=1 and data=journal, is this the way to go with an embedded system and ext4?
Is there more ext4 options/settings I can use to get an embedded system reliable when a power loss occurs?
Best Regards
Fredrik Ohlsson
Software Engineer
On Wed, Nov 14, 2012 at 11:41:59AM +0100, Ohlsson, Fredrik (GE Healthcare, consultant) wrote:
> I am working with an embedded system equipped with an IDE Flash Disk
> and the ext4 filesystem. I have identified 3 problems that I would
> like to solve in our product. The power is abruptly turned off from
> time to time, this has sometimes resulted in broken Superblock
> (inode8) and empty files with size 0 bytes. It also happens that
> file changes is not committed to disk even if minutes pass before a
> power loss. This is very undesirable and expensive in our case, we
> are searching for a solution or a workaround to the problems.
I'm not sure what you mean by "broken Superblock (inode 8)". Inode #8
is the journal superblock. I'm guessing you're seeing some kind of
corrupted journal superblock? It would be useful if you could send
kernel logs or e2fsck output so we can see exactly what is going on.
> List with my problems I like to solve:
> 1. Broken Superblock (inode8).
> 2. Empty files, size 0.
> 3. Very long auto commit times, several minutes with default settings.
The default auto commit time is 5 seconds. *However*, with delayed
allocation, writeback takes place after a 30 second timer, and
depending on how many dirty pages are outstanding, it might take a
while for all of the writeback to be completed. If you want to
simulate the behaviour you are used to with ext3, where at a journal
commit we force all writeback to complete before the commit is allowed
to proceed, you could use the nodelalloc mount option, but you will
see a corresponding hit in performance as a result. The better thing
to do is to make sure programs that care about data hitting stable
store use fsync(2) as appropriate, but unfortunately there are many
applications out there which don't do this, and I do understand that
fixing them all might be problematic. (On the other hand, for an
embedded system, it should be easier since you do control all of your
userspace applicaitons.)
The other thing which may be going on is that there is crappy flash
devices out there which do not handle unexpected power failures
correctly. Hence, even if you have pushed data out to disk using a
CACHE FLUSH request (which is what barrier=1 does, and which is the
default BTW), there are flash devices which essentially lie and which
do not guarantee that data written before the CACHE FLUSH is stable by
the time the CACHE FLUSH command returns.
If you are seeing a corrupted journal superblock (which is what I
assume you meant by Broken Superblock inode 8), that's an indication
that the hardware is lying to us, and unfortuantely, there's not much
any file system can do in that case. If the hardware is lying, you're
pretty much out of luck, and the only solution is to replace the
hardware with something which is competently engineered....
I would suggest trying to tackle these two problems separately. If
you want to make sure fsync is handled correctly, so that files are
flushed out when you need them to be, try doing a reset of the device
--- without dropping power, and see if you can get rid of the zero
length files. That should be relatively easy to handle.
Then you can try to see what happens with a power drop.
Unfortunately, if it's what I suspect is going on, you have faulty
hardware, and there really is not anything we can do at the OS layer.
If I am correct that your IDE Flash Disk is some cheap piece of cr*p,
you can try using any file system you want, but you're probably going
to end up losing big time.
Regards,
- Ted
Hi, Fredrik,
On Wed, 2012-11-14 at 11:41 +0100, Ohlsson, Fredrik (GE Healthcare,
consultant) wrote:
> 2. Empty files, size 0.
Well, this is expected in some cases. If create a file, start appending
data, and have a power cut, you may end up with zero-sized files.
This is a Linux feature - the data is cached in RAM until write-back
happens or something like fsync() happens.
Ext4 has a feature that if you write to a file, then close it, ext4 will
initiate write-back for you right away. It was added a couple of years
ago to make it more user-friendly.
You really should investigate what are those files and what was
happening to them just before the power cut. Zero-length files may be
normal, in general.
However, strictly _all_ files you care about should be explicitly
synced. This is just safer.
If you write serious medical software, you should take data integrity
seriously in your apps.
I wrote this section for UBIFS users long time ago, and it is the same
(modulo UBIFS-specific details) for other Linux file-systems, including
ext4:
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_writeback
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_sync_exceptions
> Is ext4 a bad choice for an embedded system with I 1Gb IDE Flash Disk and Debian 2.6.32-5-686? Should we change filesystem?
I do not know for sure, but I doubt there is serious power-cut testing
regularly conducted for ext4, but people may correct me.
So if power-cut tolerance is important for you, you should conduct good
power-cut tests.
And remember, the disk quality is very important for
power-cut-tolerance. If you uses something like bad managed flash (bad
SSD, eMMC), it may lose recently written data on power-cut. So testing
is important.
Of course you should have barriers on as well.
We conducted some 3 years ago. Results were quite good for ext4 - in
many cases it could recover without a need to run ckfs.ext4, sometimes
it was not mountable, but ckfs.ext4 helped.
On the opposite, ext3 constantly required ckfs.ext3, and sometimes died
so badly that even ckfs.ext3 could not recover it.
--
Best Regards,
Artem Bityutskiy
On Thu, Nov 15, 2012 at 12:42 PM, Artem Bityutskiy <[email protected]> wrote:
> We conducted some 3 years ago. Results were quite good for ext4 - in
> many cases it could recover without a need to run ckfs.ext4, sometimes
> it was not mountable, but ckfs.ext4 helped.
>
> On the opposite, ext3 constantly required ckfs.ext3, and sometimes died
> so badly that even ckfs.ext3 could not recover it.
We ran about 6000 cycles of power resets with linux 2.6.37. The test
was to run 3 tar processes unpacking linux kernel archive and power
off after about 15 seconds. There were only 3 failures when file
system couldn't be mounted, but that was due to HDD failure
(unreadable sector in journal area). e2fsck successfully recovered
those corruptions. As for software itself, there was no single issue
and we never needed to run fsck after power loss. So I'd say that ext4
is very tolerant to power losses at least in 2.6.37 assuming barriers
and ordered data mode. I however understand this test is quite basic
and any way results can be different for different kernels.
On Thu, 2012-11-15 at 13:01 +0300, Andrey Sidorov wrote:
> On Thu, Nov 15, 2012 at 12:42 PM, Artem Bityutskiy <[email protected]> wrote:
>
> > We conducted some 3 years ago. Results were quite good for ext4 - in
> > many cases it could recover without a need to run ckfs.ext4, sometimes
> > it was not mountable, but ckfs.ext4 helped.
> >
> > On the opposite, ext3 constantly required ckfs.ext3, and sometimes died
> > so badly that even ckfs.ext3 could not recover it.
>
> We ran about 6000 cycles of power resets with linux 2.6.37. The test
> was to run 3 tar processes unpacking linux kernel archive and power
> off after about 15 seconds. There were only 3 failures when file
> system couldn't be mounted, but that was due to HDD failure
> (unreadable sector in journal area). e2fsck successfully recovered
> those corruptions. As for software itself, there was no single issue
> and we never needed to run fsck after power loss. So I'd say that ext4
> is very tolerant to power losses at least in 2.6.37 assuming barriers
> and ordered data mode. I however understand this test is quite basic
> and any way results can be different for different kernels.
Very different experience indeed, shoes that everyone has to conduct own
power-cut tests in own system. I did not say that we were running on
eMMC.
--
Best Regards,
Artem Bityutskiy
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 11/15/12 2:42 AM, Artem Bityutskiy wrote:
> Hi, Fredrik,
>
> On Wed, 2012-11-14 at 11:41 +0100, Ohlsson, Fredrik (GE Healthcare,
> consultant) wrote:
>> 2. Empty files, size 0.
>
> Well, this is expected in some cases. If create a file, start appending
> data, and have a power cut, you may end up with zero-sized files.
>
> This is a Linux feature - the data is cached in RAM until write-back
> happens or something like fsync() happens.
>
> Ext4 has a feature that if you write to a file, then close it, ext4 will
> initiate write-back for you right away. It was added a couple of years
> ago to make it more user-friendly.
>
> You really should investigate what are those files and what was
> happening to them just before the power cut. Zero-length files may be
> normal, in general.
>
> However, strictly _all_ files you care about should be explicitly
> synced. This is just safer.
>
> If you write serious medical software, you should take data integrity
> seriously in your apps.
>
> I wrote this section for UBIFS users long time ago, and it is the same
> (modulo UBIFS-specific details) for other Linux file-systems, including
> ext4:
>
> http://www.linux-mtd.infradead.org/doc/ubifs.html#L_writeback
> http://www.linux-mtd.infradead.org/doc/ubifs.html#L_sync_exceptions
Jeff Moyer also has a very good article on this:
http://lwn.net/Articles/457667/
>> Is ext4 a bad choice for an embedded system with I 1Gb IDE Flash Disk and Debian 2.6.32-5-686? Should we change filesystem?
>
> I do not know for sure, but I doubt there is serious power-cut testing
> regularly conducted for ext4, but people may correct me.
We do it here, though maybe not as regularly as we should.
I also periodically test journal replay, but not in a way that simulates
lost write caches or misbehaving hardware. OTOH, those things are out
of our control in the real world (if users disable barriers, or if the
hardware lies to us).
But you absolutely should test YOUR system, audit YOUR software, and
YOUR configuration to be sure that it is behaving as you require.
It's absolutely possible to build & configure a system (software+hardware)
which correctly persists data even in the face of a power loss.
(or, in the case of a crash before data integrity syscalls are complete,
your software _knows_ that it was unable to persist the data).
- -Eric
> So if power-cut tolerance is important for you, you should conduct good
> power-cut tests.
>
> And remember, the disk quality is very important for
> power-cut-tolerance. If you uses something like bad managed flash (bad
> SSD, eMMC), it may lose recently written data on power-cut. So testing
> is important.
>
> Of course you should have barriers on as well.
>
> We conducted some 3 years ago. Results were quite good for ext4 - in
> many cases it could recover without a need to run ckfs.ext4, sometimes
> it was not mountable, but ckfs.ext4 helped.
>
> On the opposite, ext3 constantly required ckfs.ext3, and sometimes died
> so badly that even ckfs.ext3 could not recover it.
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/
iQIcBAEBAgAGBQJQpl9jAAoJECCuFpLhPd7go70P/AiAnzxaR7BhpBY+hMxEyA2O
YFmfvUsY8N51NdReE4NqO3ebu7BWIflTNHmzGaKWDQgXf2T6b1p6Cy2YVphBLsH3
Tfd5B1Aem8ZKVWq2pyuCjuQpNYOmlxibQPT4SWfzKukVO5sxgwiqL/e73awN6luF
U8zTCNnh1d8rOEEq/mSH1eeQSPdlY1h0lXegdBZ0CLkeI5/sTIFaEy8JiPBmKbrU
GsxsAyVEeko7aDL7Mh62RWFN953L/ZPkeVXWOoId8ANcZ+/Lrt/JEjkKIblkyHiK
FJTMSHeHTX8gg88HzFgWqkuB7aBlUB16Ppf5W2FFTUJOHtOcvIVlC5VGyYhYSe9L
DZZh/rxqjDpB8hFpvD6JPWLaTZ/UcIEBdYVlb+va5gfPjE+TBCRmUxRISLCg8opv
tsAQQo6kF818zGYQMtF3pwGkIHqP3DA+SRch/mh77ChTy0kGssgeZeKIoiz15cDJ
bgXR2hLHiuQ3/F3rqV5lCi8ioSPYvH2BmG1dhQdEAQAq48o25WjHOl7Rzpd38gEZ
BaAn0brYOlmW+oL09IKvBnSorZsmlPz5snwmZdxI7afp0ZbGJR1jx+7iqnl5sioG
4GtSwschP+tSnTSczQMlE6siGh3Alyq8vr7maIzsUODRyzdfsrx8koJs05hQEAIp
VzzJPsk6mv4J7E9VKhmH
=S0cP
-----END PGP SIGNATURE-----
On 11/14/12 4:41 AM, Ohlsson, Fredrik (GE Healthcare, consultant) wrote:
> Hi,
> I would like to have advice regarding features and options.
> I am working with an embedded system equipped with an IDE Flash Disk
> and the ext4 filesystem. I have identified 3 problems that I would
> like to solve in our product. The power is abruptly turned off from
> time to time, this has sometimes resulted in broken Superblock
> (inode8) and empty files with size 0 bytes. It also happens that file
> changes is not committed to disk even if minutes pass before a power
> loss. This is very undesirable and expensive in our case, we are
> searching for a solution or a workaround to the problems.
> List with my problems I like to solve:
> 1. Broken Superblock (inode8).
> 2. Empty files, size 0.
> 3. Very long auto commit times, several minutes with default settings.
>
>
> Is ext4 a bad choice for an embedded system with I 1Gb IDE Flash Disk
> and Debian 2.6.32-5-686? Should we change filesystem?
>
> I am planning to set the mount flags: barrier=1, commit=1 and
> data=journal, is this the way to go with an embedded system and ext4?
When you say "with barrier=1", were you using barrier=0 (or "nobarrier")
before?
If you have disabled barriers explicitly, just start with re-enabling
them. If your software is doing all the right data integrity syscalls,
then that is probably enough.
-Eric
> Is there more ext4 options/settings I can use to get an embedded
> system reliable when a power loss occurs?
>
> Best Regards
>
> Fredrik Ohlsson
> Software Engineer
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Thank you very much for your helpful response and answers.
I would like to describe some background for our embedded system. The
system has recently been upgraded, before the upgrade we did not see
"filesystem" related problems. Before the upgrade we used another
filesystem, kernel and IDE Flash Disk(kernel 2.6.12 kernel, reiserfs and
a smaller 256 MB IDE Flash Disk). Today we have kernel 2.6.32,
ext4(default options) and a 1GB Transend TS1GDOM44H-S IDE Flash Disk.
We have a very low IO intensity towards the flash disk and low
requirements on the filesystem performance.
In our case with the file of size 0 bytes. We use a bash shell script to
upgrade our application. The shell script calls the program "tar" and
tar overwrites/recreates our application bin file. After some minutes
the power is cut and our bin file had the new size 0 bytes when the
system came up again . This particular case was solved by adding a sync
in the end of our upgrade shell script. I still don't understand why the
data is not committed to the disk after several minutes? Even if tar
leaves the file truncated tar must have closed the file and ext4 would
have done an implicit write-back?
We are worried that this will happen again where we use programs not
written by ourselves. Will the nodelalloc option solve this behavior?
If I understand you right the corrupted journal superblock (inode #8) is
most probably a result of a problem in the IDE Flash Disk. I have
attached dumpe2fs.output from this problem. I booted the system from a
usb-drive and the filesystem could be repaired by e2fsck, but this is
not something the customer can do, we have to replace units like this.
The Transend TS1GDOM44H-S IDE Flash Disk is intended for demanding
embedded systems that require reliability, I guess it could still be the
part creating this problem.
I think you are saying that ext4 should work fine in our setup were we
have regular power-cuts if we use sync/fsync and applies the following
settings:
-barrier, which we already use by default (Can't see barrier in the
attached dumpe2fs.output thou).
-nodelalloc
You also advise us to implement or own power-cut tests.
Are there more settings that could be to our favor, like
journal_checksum? Tune2fs data=journal?
Best Regards
Fredrik Ohlsson
Software Engineer
On 11/15/12 4:01 AM, Andrey Sidorov wrote:
> On Thu, Nov 15, 2012 at 12:42 PM, Artem Bityutskiy <[email protected]> wrote:
>
>> We conducted some 3 years ago. Results were quite good for ext4 - in
>> many cases it could recover without a need to run ckfs.ext4, sometimes
>> it was not mountable, but ckfs.ext4 helped.
>>
>> On the opposite, ext3 constantly required ckfs.ext3, and sometimes died
>> so badly that even ckfs.ext3 could not recover it.
Depending on your storage, if barriers were not enabled on your storage,
this is expected. Evaporating write caches on power cut do not play well
with journaling guarantees.
https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/writebarr.html
Barriers were not made default on ext3 until 2011, in kernel v3.1,
astonishingly. So it makes sense that ext3 fared worse than ext4.
Unplayable journals are not surprising with non-battery-backed
writeback caches, no explicit cache flushing, and power loss.
> We ran about 6000 cycles of power resets with linux 2.6.37. The test
> was to run 3 tar processes unpacking linux kernel archive and power
> off after about 15 seconds. There were only 3 failures when file
> system couldn't be mounted, but that was due to HDD failure
> (unreadable sector in journal area). e2fsck successfully recovered
> those corruptions. As for software itself, there was no single issue
> and we never needed to run fsck after power loss. So I'd say that ext4
> is very tolerant to power losses at least in 2.6.37 assuming barriers
> and ordered data mode. I however understand this test is quite basic
> and any way results can be different for different kernels.
Right - barriers.
Of course you probably did lose *file* data even if the fs metadata
was correct.
Remember that journaling ensures a consistent metadata structure, but
does not guarantee data integrity.
-Eric
On 11/16/12 10:18 AM, Ohlsson, Fredrik (GE Healthcare, consultant) wrote:
> Thank you very much for your helpful response and answers.
>
> I would like to describe some background for our embedded system. The
> system has recently been upgraded, before the upgrade we did not see
> "filesystem" related problems. Before the upgrade we used another
> filesystem, kernel and IDE Flash Disk(kernel 2.6.12 kernel, reiserfs and
> a smaller 256 MB IDE Flash Disk). Today we have kernel 2.6.32,
> ext4(default options) and a 1GB Transend TS1GDOM44H-S IDE Flash Disk.
> We have a very low IO intensity towards the flash disk and low
> requirements on the filesystem performance.
>
> In our case with the file of size 0 bytes. We use a bash shell script to
> upgrade our application. The shell script calls the program "tar" and
> tar overwrites/recreates our application bin file. After some minutes
> the power is cut and our bin file had the new size 0 bytes when the
> system came up again . This particular case was solved by adding a sync
> in the end of our upgrade shell script. I still don't understand why the
> data is not committed to the disk after several minutes? Even if tar
> leaves the file truncated tar must have closed the file and ext4 would
> have done an implicit write-back?
I would have expected all data to make it out after several minutes,
yes (how many is several?) Background writeout kicks off every
30s by default, I would have expected that after at most a couple of those
cycles you would have seen it all make it to disk. To investigate, it
might be worth tracing the system to see whether or not it is pushing data
out when the script completes (iostat might be simplest, or blktrace).
I'm curious, did the sync at the end of the script take a very long
time to complete?
> We are worried that this will happen again where we use programs not
> written by ourselves. Will the nodelalloc option solve this behavior?
Doubtful. For starters, I'd argue that nodelalloc is a less tested
and therefore potentially more bug-prone path in ext4.
> If I understand you right the corrupted journal superblock (inode #8) is
> most probably a result of a problem in the IDE Flash Disk.
ok so you got:
> Superblock has an invalid journal (inode 8).
> Clear? yes
after this, of course, fsck is at a disadvantage for recovering the fs
and finds many more errors. However, things like bad resize inodes
& bad acl index inodes are a bit more suprising; those were written
at mkfs time.
What version of e2fsprogs was this?
You could get the journal message for a few reasons, unfortunately e2fsck
doesn't say which one it was. An e2image (or maybe just raw dd)
of the fs prior to the repair would offer some clues to someone with
time to investigate.
> I have
> attached dumpe2fs.output from this problem. I booted the system from a
> usb-drive and the filesystem could be repaired by e2fsck, but this is
> not something the customer can do, we have to replace units like this.
> The Transend TS1GDOM44H-S IDE Flash Disk is intended for demanding
> embedded systems that require reliability, I guess it could still be the
> part creating this problem.
>
> I think you are saying that ext4 should work fine in our setup were we
> have regular power-cuts if we use sync/fsync and applies the following
> settings:
> -barrier, which we already use by default (Can't see barrier in the
> attached dumpe2fs.output thou).
it's just runtime behavior, so you wouldn't see it there.
> -nodelalloc
I'm a little skeptical of that. I think you need to get to the root
cause before you start turning more knobs. I guess I am most skeptical
about the storage, for starters.
I can't tell if the transcend device has a cache or not; presumably
the flash controller does. I have no idea if it responds to cache flush
requests, or not.
If you look at dmesg, you probably get something like:
[sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
but what the device actually does internally - who knows.
I wonder if a test like this would be interesting:
* Boot system from USB.
* Write a unique pattern directly to the transcend block device, not through the fs.
* Use something like lmdd or xfs_io or some tool which will write a pattern, not
just 0s.
* Wait a minute or two, then cut power (watch iostat maybe to see when all IO is done)
* boot up again and check the pattern
If the pattern is bad, try it again (with a different pattern) and issue
sync prior to the power cut. See if that behaves differently.
fsync (and, I think, sys_sync) will issue cache flushes to the storage.
Simple writeback won't, AFAIK. So it'd be interesting to see if data
is being lost inside the device when it loses power; the above test might
be decent to check that.
I'm half tempted to find one of these devices & test it myself ;)
-Eric
> You also advise us to implement or own power-cut tests.
>
> Are there more settings that could be to our favor, like
> journal_checksum? Tune2fs data=journal?
>
> Best Regards
>
> Fredrik Ohlsson
> Software Engineer
>
>
>
>
>
>
On Fri, 2012-11-16 at 09:40 -0600, Eric Sandeen wrote:
> Barriers were not made default on ext3 until 2011, in kernel v3.1,
> astonishingly. So it makes sense that ext3 fared worse than ext4.
Very probable, I do not remember if we had them. But we were testing on
top of eMMC with not write-cache.
Anyway, that was long time ago.
While on it, one problem I remember in an unrelated testing of ext3 on
top of eMMC was related to _read_ errors. Sometimes after an power cut
eMMC returned an ECC error, so a sector could be unreadable. But after
writing to this sector it became fine.
ext3 and the tools treated a read error as fatal. However, in case of
eMMC that was something "normal".
I do not know how the situation changes since then, this was probably in
2009.
--
Best Regards,
Artem Bityutskiy
On Fri, 2012-11-16 at 09:44 -0600, Eric Sandeen wrote:
> > I do not know for sure, but I doubt there is serious power-cut testing
> > regularly conducted for ext4, but people may correct me.
>
> We do it here, though maybe not as regularly as we should.
> I also periodically test journal replay, but not in a way that simulates
> lost write caches or misbehaving hardware. OTOH, those things are out
> of our control in the real world (if users disable barriers, or if the
> hardware lies to us).
Good to know, thanks. I actually now recall Josef mentioned that RedHad
does some power-cut testing.
> But you absolutely should test YOUR system, audit YOUR software, and
> YOUR configuration to be sure that it is behaving as you require.
Sure.
--
Best Regards,
Artem Bityutskiy