Hello ext4 maintainer,
We found below kinds of error several times, it happened in the kernel 3.5.7.23 which we reproduce this over four times.
And we also found this issue 1 times in the kernel 3.8.13.11, which it is harder to reproduce it than the version 3.5.7.23.
Our product is a embedded system which the main CPU is freescale i.MX6(ARM cortex A9) and our storage device is eMMC which is follow the jedec4.5 standard.
ERROR LOG:
EXT4-fs error (device mmcblk1p2): ext4_ext_check_inode:462: inode #2063: comm stability-1031.: bad header/extent: invalid extent entries - magic f30a, entries 1, max 4(4), depth 0(0)
EXT4-fs error (device mmcblk1p2): ext4_ext_check_inode:462: inode #2063: comm stability-1031.: bad header/extent: invalid extent entries - magic f30a, entries 1, max 4(4), depth 0(0)
open /mmc/test2nd//hp000002c8q2y6kRgcAy fail
File /mmc/test2nd//hp000002c8q2y6kRgcAy other ERROR(60)
When we try to use debugfs to parse the detail of this issue, we found it is caused by the corrupted meta data. We have two typical corrupted meta data happened in out failure cases. Please see the numbers with yellow background color in below log, the values in that area should not all ZERO.
CASE 1:
bash-3.2# dd if=/dev/mmcblk1p2 bs=4096 skip=569 count=4096 | hexdump -C
..
00000800 80 81 00 00 10 14 00 00 31 00 00 00 31 00 00 00 |........1...1...|
00000810 31 00 00 00 00 00 00 00 00 00 01 00 10 00 00 00 |1...............|
00000820 00 00 08 00 01 00 00 00 0a f3 01 00 04 00 00 00 |................|
00000830 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 |................| => 0x83a - 0x83f should not be all ZERO
00000840 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000860 00 00 00 00 51 09 b8 14 00 00 00 00 00 00 00 00 |....Q...........|
00000870 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000880 1c 00 00 00 e4 45 c3 23 e4 45 c3 23 e4 45 c3 23 |.....E.#.E.#.E.#|
00000890 31 00 00 00 e4 45 c3 23 00 00 00 00 00 00 00 00 |1....E.#........|
000008a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
CASE2:
bash-3.2# debugfs /dev/mmcblk1p1
debugfs 1.42.1 (17-Feb-2012)
debugfs: dump_extents <393968>
Level Entries Logical Physical Length Flags
0/ 0 1/ 1 0 - 4294967295 1705492 - 4296672787 0
00000f00 80 81 00 00 10 14 00 00 2f a0 01 00 2f a0 01 00 |......../.../...|
00000f10 2f a0 01 00 00 00 00 00 00 00 01 00 10 00 00 00 |/...............|
00000f20 00 00 08 00 01 00 00 00 0a f3 01 00 04 00 00 00 |................|
00000f30 00 00 00 00 00 00 00 00 00 00 00 00 14 06 1a 00 |................| => offset 0xf38-0xf39 should not be ZERO
00000f40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000f60 00 00 00 00 25 bb 10 cd 00 00 00 00 00 00 00 00 |....%...........|
00000f70 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000f80 1c 00 00 00 d8 95 6c cf d8 95 6c cf d8 95 6c cf |......l...l...l.|
00000f90 2f a0 01 00 d8 95 6c cf 00 00 00 00 00 00 00 00 |/.....l.........|
00000fa0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
We did more test which we backup the journal blocks before we mount the test partition.
Actually, before we mount the test partition, we use fsck.ext4 with -n option to verify whether there is any bad extents issues available. The fsck.ext4 never found any such kind issue. And we can prove that the bad extents issue is happened after journaling replay.
We tried some different mount options, even mount the filesystem with journal_checksum, but the bad extents issue also happened.
Below log can proves that the journal block contain the bad extents contents:
bash-3.2# debugfs -R "imap <2063>" /dev/mmcblk1p2
debugfs 1.42.1 (17-Feb-2012)
Inode 2063 is part of block group 0
located at block 525, offset 0x0e00
bash-3.2# debugfs -R "dump_extents <2063>" /dev/mmcblk1p2
debugfs 1.42.1 (17-Feb-2012)
Level Entries Logical Physical Length Flags
0/ 0 1/ 1 0 - 4294967295 1338882 - 4296306177 0
dd if=/dev/mmcblk1p2 bs=4096 skip=525 count=4096 | hexdump -C
00000e00 80 81 00 00 10 14 00 00 37 00 00 00 37 00 00 00 |........7...7...|
00000e10 37 00 00 00 00 00 00 00 00 00 01 00 10 00 00 00 |7...............|
00000e20 00 00 08 00 01 00 00 00 0a f3 01 00 04 00 00 00 |................|
00000e30 00 00 00 00 00 00 00 00 00 00 00 00 02 6e 14 00 |.............n..| =>0xe38-0xe39 is zero which caused bad extent error
00000e40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000e60 00 00 00 00 2b f2 c5 2b 00 00 00 00 00 00 00 00 |....+..+........|
00000e70 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000e80 1c 00 00 00 d8 e5 e8 49 d8 e5 e8 49 d8 e5 e8 49 |.......I...I...I|
00000e90 37 00 00 00 d8 e5 e8 49 00 00 00 00 00 00 00 00 |7......I........|
00000ea0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
==search the string "00 00 00 00 02 6e 14 00 " in the journal block which copyed before fs mounted.
bash-3.2# hexdump -C journal2.img | grep "00 00 00 00 02 6e 14 00"
00adce30 00 00 00 00 00 00 00 00 00 00 00 00 02 6e 14 00 |.............n..|
== found the same contents in journal block, dump that block. The contents is same as the bad block in the FS meta data.
bash-3.2# hexdump -C journal2.img -s 0xadce00 -n 1024
00adce00 80 81 00 00 10 14 00 00 37 00 00 00 37 00 00 00 |........7...7...|
00adce10 37 00 00 00 00 00 00 00 00 00 01 00 10 00 00 00 |7...............|
00adce20 00 00 08 00 01 00 00 00 0a f3 01 00 04 00 00 00 |................|
00adce30 00 00 00 00 00 00 00 00 00 00 00 00 02 6e 14 00 |.............n..|
00adce40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00adce60 00 00 00 00 2b f2 c5 2b 00 00 00 00 00 00 00 00 |....+..+........|
00adce70 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00adce80 1c 00 00 00 d8 e5 e8 49 d8 e5 e8 49 d8 e5 e8 49 |.......I...I...I|
00adce90 37 00 00 00 d8 e5 e8 49 00 00 00 00 00 00 00 00 |7......I........|
00adcea0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
==check whether the address 0xadce00 is include in the valid journal blocks.
bash-3.2# hexdump -C journal2.img | grep "c0 3b 39 98"
00000000 c0 3b 39 98 00 00 00 04 00 00 00 00 00 00 10 00 |.;9.............|
....
009c7000 c0 3b 39 98 00 00 00 01 00 00 1e 27 00 00 02 34 |.;9........'...4|
00a38000 c0 3b 39 98 00 00 00 02 00 00 1e 27 00 00 00 00 |.;9........'....| => it is include in valid journal blocks.
00a39000 c0 3b 39 98 00 00 00 01 00 00 1e 28 00 00 01 7d |.;9........(...}|
00b8e000 c0 3b 39 98 00 00 00 01 00 00 1e 28 00 00 02 a3 |.;9........(....|
00c6a000 c0 3b 39 98 00 00 00 02 00 00 1e 28 00 00 00 00 |.;9........(....|
We searched such error on internet, there are some one also has such issue. But there is no solution.
This issue maybe not a big issue which it can be repaired by fsck.ext4 easily. But we have below questions:
1. whether this issue already been fixed in the latest kernel version?
2. based on the information I provided in this mail, can you help to solve this issue ?
many thanks.
Huang weiliang
Software Engineer (CM/ESW1-CN)
Bosch Automotive Products
On Thu, Jan 02, 2014 at 12:59:52PM +0800, Huang Weller (CM/ESW12-CN) wrote:
>
> We did more test which we backup the journal blocks before we mount the test partition.
> Actually, before we mount the test partition, we use fsck.ext4 with -n option to verify whether there is any bad extents issues available. The fsck.ext4 never found any such kind issue. And we can prove that the bad extents issue is happened after journaling replay.
Ok, so that implies that the failure is almost certainly due to
corrupted blocks in the journal. Hence, when we replay the journal,
it causes the the file system to become corrupted, because the "newer"
(and presumably, "more correct") metadata blocks found in the blocks
recorded in the journal are in fact corrupted.
BTW, you can use the logdump command in the debugfs program to look at
the journal. The debugfs man page documents it, but once you know the
block that was corrupted, which in your case appears to be block 525:
debugfs: logdump -b 525 -c
Or to see the contents of all of the blocks logged in the journal:
debugfs: logdump -ac
>
> We searched such error on internet, there are some one also has such issue. But there is no solution.
> This issue maybe not a big issue which it can be repaired by fsck.ext4 easily. But we have below questions:
> 1. whether this issue already been fixed in the latest kernel version?
> 2. based on the information I provided in this mail, can you help to solve this issue ?
Well, the question is how did the journal get corrupted? It's
possible that it's caused by a kernel bug, although I'm not aware of
any such bugs being reported.
In my mind, the most likely cause is that the SD card is ignoring the
CACHE FLUSH command, or is not properly saving the SD card's Flash
Translation Layer (FTL) metadata on a power drop. Here are some
examples some investigation into lousy SSD's that have this bug ---
and historically, SD cards have been **worse** than SSD's, because the
manufacturers have a much lower per-unit cost, so they tend to put in
even cheaper and crappier FTL systems on SD and eMMC flash.
http://lkcl.net/reports/ssd_analysis.html
https://www.usenix.org/conference/fast13/understanding-robustness-ssds-under-power-fault
What I tell people who are using flash devices is before they start
using any flash device, to do power drop testing on a raw device,
without any file system present. The simplest way to do this is to
write a program that writes consecutive 4k blocks that contain a
timestamp, a sequence number, some random data, and a CRC-32 checksum
over the contents of the timestamp, sequence number, a flags word, and
random data. As the program writes such 4k block, it rolls the dice
and once every 64 blocks or so (i.e., pick a random number, and see if
it is divisible by 64), then set a bit in the flags word indicating
that this block was forced out using a cache flush, and then when
writing this block, follow up the write with a CACHE FLUSH command.
It's also best if the test program prints the blocks which have been
written with CACHE FLUSH to the serial console, and that this is saved
by your test rig.
(This is what ext4's journal does before and after writing the commit
block in the journal, and it guarantees that (a) all of the data in
the journal written up to the commit block will be available after a
power drop, and (b) that the commit block has been written to the
storage device and again, will be available after a power drop.)
Once you've written this program, set up a test rig which boots your
test board, runs the program, and then drops power to the test board
randomly. After the power drop, examine the flash device and make
sure that all of the blocks written up to the last "commit block" are
in fact valid.
You will find that a surprising number of SD cards will fail this
test. In fact, the really lousy cards will become unreadable after a
power drop. (A fact many wedding photographers discover the hard way
they drop their camera and the SD card flies out, and then they find
all of that their priceless, once-in-a-lifetime photos are lost forwever.)
I ****strongly**** recommend that if you are not testing your SD cards
in this way from your parts supplier, you do so immediately, and
reject any model that is not able to guarantee that data survives a
power drop.
Good luck, and I hope this is helpful,
- Ted
P.S. If you do write such a program, please consider making it
available under an open source license. If more companies did this,
it would apply pressure to the flash manufacturers to stop making such
crappy products, and while it might raise the BOM cost of products by
a penny or two, the net result would be better for everyone in the
industry.
Hi Ted,
>What I tell people who are using flash devices is before they start
>using any flash device, to do power drop testing on a raw device,
>without any file system present. The simplest way to do this is to
>write a program that writes consecutive 4k blocks that contain a
>timestamp, a sequence number, some random data, and a CRC-32 checksum
>over the contents of the timestamp, sequence number, a flags word, and
>random data. As the program writes such 4k block, it rolls the dice
>and once every 64 blocks or so (i.e., pick a random number, and see if
>it is divisible by 64), .....
It sounds like the barrier test. We wrote such kind test tool before, the test program used ioctl(fd, BLKFLSBUF, 0) to set a barrier before next write operation.
Do you think this ioctl is enough ? Because I saw the ext4 use it. I will do the test with that tool and then let you know the result.
More information about journal block which caused the bad extents error:
We enabled the mount option journal_checksum in our test. We reproduced the same problem and the journal checksum is correct because the journal block will not be replayed if checksum is error.
Best Regards / Mit freundlichen Gr??en
Huang weiliang
-----Original Message-----
From: Theodore Ts'o [mailto:[email protected]]
Sent: Friday, January 03, 2014 2:42 AM
To: Huang Weller (CM/ESW12-CN)
Cc: [email protected]; Juergens Dirk (CM-AI/ECO2)
Subject: Re: ext4 filesystem bad extent error review
On Thu, Jan 02, 2014 at 12:59:52PM +0800, Huang Weller (CM/ESW12-CN) wrote:
>
> We did more test which we backup the journal blocks before we mount the test partition.
> Actually, before we mount the test partition, we use fsck.ext4 with -n option to verify whether there is any bad extents issues available. The fsck.ext4 never found any such kind issue. And we can prove that the bad extents issue is happened after journaling replay.
Ok, so that implies that the failure is almost certainly due to
corrupted blocks in the journal. Hence, when we replay the journal,
it causes the the file system to become corrupted, because the "newer"
(and presumably, "more correct") metadata blocks found in the blocks
recorded in the journal are in fact corrupted.
BTW, you can use the logdump command in the debugfs program to look at
the journal. The debugfs man page documents it, but once you know the
block that was corrupted, which in your case appears to be block 525:
debugfs: logdump -b 525 -c
Or to see the contents of all of the blocks logged in the journal:
debugfs: logdump -ac
>
> We searched such error on internet, there are some one also has such issue. But there is no solution.
> This issue maybe not a big issue which it can be repaired by fsck.ext4 easily. But we have below questions:
> 1. whether this issue already been fixed in the latest kernel version?
> 2. based on the information I provided in this mail, can you help to solve this issue ?
Well, the question is how did the journal get corrupted? It's
possible that it's caused by a kernel bug, although I'm not aware of
any such bugs being reported.
In my mind, the most likely cause is that the SD card is ignoring the
CACHE FLUSH command, or is not properly saving the SD card's Flash
Translation Layer (FTL) metadata on a power drop. Here are some
examples some investigation into lousy SSD's that have this bug ---
and historically, SD cards have been **worse** than SSD's, because the
manufacturers have a much lower per-unit cost, so they tend to put in
even cheaper and crappier FTL systems on SD and eMMC flash.
http://lkcl.net/reports/ssd_analysis.html
https://www.usenix.org/conference/fast13/understanding-robustness-ssds-under-power-fault
What I tell people who are using flash devices is before they start
using any flash device, to do power drop testing on a raw device,
without any file system present. The simplest way to do this is to
write a program that writes consecutive 4k blocks that contain a
timestamp, a sequence number, some random data, and a CRC-32 checksum
over the contents of the timestamp, sequence number, a flags word, and
random data. As the program writes such 4k block, it rolls the dice
and once every 64 blocks or so (i.e., pick a random number, and see if
it is divisible by 64), then set a bit in the flags word indicating
that this block was forced out using a cache flush, and then when
writing this block, follow up the write with a CACHE FLUSH command.
It's also best if the test program prints the blocks which have been
written with CACHE FLUSH to the serial console, and that this is saved
by your test rig.
(This is what ext4's journal does before and after writing the commit
block in the journal, and it guarantees that (a) all of the data in
the journal written up to the commit block will be available after a
power drop, and (b) that the commit block has been written to the
storage device and again, will be available after a power drop.)
Once you've written this program, set up a test rig which boots your
test board, runs the program, and then drops power to the test board
randomly. After the power drop, examine the flash device and make
sure that all of the blocks written up to the last "commit block" are
in fact valid.
You will find that a surprising number of SD cards will fail this
test. In fact, the really lousy cards will become unreadable after a
power drop. (A fact many wedding photographers discover the hard way
they drop their camera and the SD card flies out, and then they find
all of that their priceless, once-in-a-lifetime photos are lost forwever.)
I ****strongly**** recommend that if you are not testing your SD cards
in this way from your parts supplier, you do so immediately, and
reject any model that is not able to guarantee that data survives a
power drop.
Good luck, and I hope this is helpful,
- Ted
P.S. If you do write such a program, please consider making it
available under an open source license. If more companies did this,
it would apply pressure to the flash manufacturers to stop making such
crappy products, and while it might raise the BOM cost of products by
a penny or two, the net result would be better for everyone in the
industry.
On Fri, Jan 03, 2014 at 11:16:02AM +0800, Huang Weller (CM/ESW12-CN) wrote:
>
> It sounds like the barrier test. We wrote such kind test tool
> before, the test program used ioctl(fd, BLKFLSBUF, 0) to set a
> barrier before next write operation. Do you think this ioctl is
> enough ? Because I saw the ext4 use it. I will do the test with that
> tool and then let you know the result.
The BLKFLSBUF ioctl does __not__ send a CACHE FLUSH command to the
hardware device. It forces all of the dirty buffers in memory to the
storage device, and then it invalidates all the buffer cache, but it
does not send a CACHE FLUSH command to the hardware. Hence, the
hardware is free to write it to its on-disk cache, and not necessarily
guarantee that the data is written to stable store. (For an example
use case of BLKFLSBUF, we use it in e2fsck to drop the buffer cache
for benchmarking purposes.)
If you want to force a CACHE FLUSH (or barrier, depending on the
underlying transport different names may be given to this operation),
you need to call fsync() on the file descriptor open to the block
device.
> More information about journal block which caused the bad extents
> error: We enabled the mount option journal_checksum in our test. We
> reproduced the same problem and the journal checksum is correct
> because the journal block will not be replayed if checksum is error.
How did you enable the journal_checksum option? Note that this is not
safe in general, which is why we don't enable it or the async_commit
mount option by default. The problem is that currently the journal
replay stops when it hits a bad checksum, and this can leave the file
system in a worse case than it currently is in. There is a way we
could fix it, by adding per-block checksums to the journal, so we can
skip just the bad block, and then force an efsck afterwards, but that
isn't something we've implemented yet.
That being said, if the journal checksum was valid, and so the
corrupted block was replayed, it does seem to argue against
hardware-induced corruption.
Hmm.... I'm stumped, for the moment. The journal layer is quite
stable, and we haven't had any problems like this reported in many,
many years.
Let's take this back to first principles. How reliably can you
reproduce the problem? How often does it fail? Is it something where
you can characterize the workload leading to this failure? Secondly,
is a power drop involved in the reproduction at all, or is this
something that can be reproduced by running some kind of workload, and
then doing a soft reset (i.e., force a kernel reboot, but _not_ do it
via a power drop)?
The other thing to ask is when did this problem first start appearing?
With a kernel upgrade? A compiler/toolchain upgrade? Or has it
always been there?
Regards,
- Ted
On Thu, Jan 02, 2014 at 19:42, Theodore Ts'o [mailto:[email protected]]
wrote:
> On Thu, Jan 02, 2014 at 12:59:52PM +0800, Huang Weller (CM/ESW12-CN)
> wrote:
> >
> > We did more test which we backup the journal blocks before we mount
> the test partition.
> > Actually, before we mount the test partition, we use fsck.ext4 with -
> n option to verify whether there is any bad extents issues available.
> The fsck.ext4 never found any such kind issue. And we can prove that
> the bad extents issue is happened after journaling replay.
>
> Ok, so that implies that the failure is almost certainly due to
> corrupted blocks in the journal. Hence, when we replay the journal, it
> causes the the file system to become corrupted, because the "newer"
> (and presumably, "more correct") metadata blocks found in the blocks
> recorded in the journal are in fact corrupted.
>
.....
> >
> > We searched such error on internet, there are some one also has such
> issue. But there is no solution.
> > This issue maybe not a big issue which it can be repaired by
> fsck.ext4 easily. But we have below questions:
> > 1. whether this issue already been fixed in the latest kernel version?
> > 2. based on the information I provided in this mail, can you help to
> solve this issue ?
>
> Well, the question is how did the journal get corrupted? It's possible
> that it's caused by a kernel bug, although I'm not aware of any such
> bugs being reported.
>
> In my mind, the most likely cause is that the SD card is ignoring the
> CACHE FLUSH command, or is not properly saving the SD card's Flash
> Translation Layer (FTL) metadata on a power drop.
Yes, this could be a possible reason, but we did exactly the same test
not only with power drops but also with doing only iMX watchdog resets.
In the latter case there was no power drop for the eMMC, but we
observed exactly the same kind of inode corruption.
During thousands of test loops with power drops or watchdog resets, while
creating thousands of files with multiple threads, we did not observe any
other kind of ext4 metadata damage or file content damage.
And in the error case so far we always found only a single damaged inode.
The other inodes before and after the damaged inode in the journal, in the
same logical 4096 bytes block, seem to be intact and valid (examined with
a hex editor). And in all the failure cases - as far as we can say based
on the ext4 disk layout documentation - only the ee_len or the ee_start_hi
and ee_start_lo entries are wrong (i.e. zeroed).
The eMMC has no "knowledge" about the logical meaning or the offset of
ee_len or ee_start. Thus, it does not seem very likely that whatever kind of
internal failure or bug in the eMMC controller/firmware always and only
damages these few bytes.
> What I tell people who are using flash devices is before they start
> using any flash device, to do power drop testing on a raw device,
> without any file system present. The simplest way to do this is to
> write a program that writes consecutive 4k blocks that contain a
> timestamp, a sequence number, some random data, and a CRC-32 checksum
> over the contents of the timestamp, sequence number, a flags word, and
> random data. As the program writes such 4k block, it rolls the dice
> and once every 64 blocks or so (i.e., pick a random number, and see if
> it is divisible by 64), then set a bit in the flags word indicating
> that this block was forced out using a cache flush, and then when
> writing this block, follow up the write with a CACHE FLUSH command.
> It's also best if the test program prints the blocks which have been
> written with CACHE FLUSH to the serial console, and that this is saved
> by your test rig.
We did similar tests in the past, but not yet with this particular type
of eMMC. I think we should repeat with this particular type.
>
> (This is what ext4's journal does before and after writing the commit
> block in the journal, and it guarantees that (a) all of the data in the
> journal written up to the commit block will be available after a power
> drop, and (b) that the commit block has been written to the storage
> device and again, will be available after a power drop.)
>
Well, we also did the same tests with journal_checksum enabled. We were
still able to reproduce the failure w/o any checksumming error. So we
believe that the respective transaction (as well as all others) was
complete and not corrupted by the eMMC.
Is this a valid assumption ? If so, I would assume that the corrupted
Inode was really written to the eMMC and not corrupted by the eMMC.
(BTW, we do know that journal_checksum is somehow critical and might make
things worse, but for test purpose and to exclude that the eMMC delivers
corrupted transactions when reading the data, it seemed to be a meaningful
approach)
So, I think there _might_ be a kernel bug, but it could be also a problem
related to the particular type of eMMC. We did not observe the same issue
in previous tests with another type of eMMC from another supplier, but this
was with an older kernel patch level and with another HW design.
Regarding a possible kernel bug: Is there any chance that the invalid
ee_len or ee_start are returned by, e.g., the block allocator ?
If so, can we try to instrument the code to get suitable traces ?
Just to see or to exclude that the corrupted inode is really written
to the eMMC ?
Mit freundlichen Gr??en / Best regards
Dirk Juergens
Robert Bosch Car Multimedia GmbH
On Thu, Jan 03, 2014 at 17:30, Theodore Ts'o [mailto:[email protected]]
wrote:
>
> On Fri, Jan 03, 2014 at 11:16:02AM +0800, Huang Weller (CM/ESW12-CN)
> wrote:
> >
> > It sounds like the barrier test. We wrote such kind test tool
> > before, the test program used ioctl(fd, BLKFLSBUF, 0) to set a
> > barrier before next write operation. Do you think this ioctl is
> > enough ? Because I saw the ext4 use it. I will do the test with that
> > tool and then let you know the result.
>
> The BLKFLSBUF ioctl does __not__ send a CACHE FLUSH command to the
> hardware device. It forces all of the dirty buffers in memory to the
> storage device, and then it invalidates all the buffer cache, but it
> does not send a CACHE FLUSH command to the hardware. Hence, the
> hardware is free to write it to its on-disk cache, and not necessarily
> guarantee that the data is written to stable store. (For an example
> use case of BLKFLSBUF, we use it in e2fsck to drop the buffer cache
> for benchmarking purposes.)
>
> If you want to force a CACHE FLUSH (or barrier, depending on the
> underlying transport different names may be given to this operation),
> you need to call fsync() on the file descriptor open to the block
> device.
>
> > More information about journal block which caused the bad extents
> > error: We enabled the mount option journal_checksum in our test. We
> > reproduced the same problem and the journal checksum is correct
> > because the journal block will not be replayed if checksum is error.
>
> How did you enable the journal_checksum option? Note that this is not
> safe in general, which is why we don't enable it or the async_commit
> mount option by default. The problem is that currently the journal
> replay stops when it hits a bad checksum, and this can leave the file
> system in a worse case than it currently is in. There is a way we
> could fix it, by adding per-block checksums to the journal, so we can
> skip just the bad block, and then force an efsck afterwards, but that
> isn't something we've implemented yet.
>
> That being said, if the journal checksum was valid, and so the
> corrupted block was replayed, it does seem to argue against
> hardware-induced corruption.
Yes, this was also our feeling. Please see my other mail just sent
some minutes ago. We know about the possible problems with
journal_checksum, but we thought that it is a good option in our case
to identify if this is a HW- or SW-induced issue.
>
> Hmm.... I'm stumped, for the moment. The journal layer is quite
> stable, and we haven't had any problems like this reported in many,
> many years.
>
> Let's take this back to first principles. How reliably can you
> reproduce the problem? How often does it fail?
With kernel 3.5.7.23 about once per overnight long term test.
> Is it something where
> you can characterize the workload leading to this failure? Secondly,
> is a power drop involved in the reproduction at all, or is this
> something that can be reproduced by running some kind of workload, and
> then doing a soft reset (i.e., force a kernel reboot, but _not_ do it
> via a power drop)?
As I stated in my other mail, it is also reproduced with soft resets.
Weller can give more details about the test setup.
>
> The other thing to ask is when did this problem first start appearing?
> With a kernel upgrade? A compiler/toolchain upgrade? Or has it
> always been there?
>
> Regards,
>
> - Ted
Mit freundlichen Gr??en / Best regards
Dr. rer. nat. Dirk Juergens
Robert Bosch Car Multimedia GmbH
On 1/3/14, 9:48 AM, Theodore Ts'o wrote:
> On Fri, Jan 03, 2014 at 11:16:02AM +0800, Huang Weller (CM/ESW12-CN) wrote:
>>
>> It sounds like the barrier test. We wrote such kind test tool
>> before, the test program used ioctl(fd, BLKFLSBUF, 0) to set a
>> barrier before next write operation. Do you think this ioctl is
>> enough ? Because I saw the ext4 use it. I will do the test with that
>> tool and then let you know the result.
>
> The BLKFLSBUF ioctl does __not__ send a CACHE FLUSH command to the
> hardware device. It forces all of the dirty buffers in memory to the
> storage device, and then it invalidates all the buffer cache, but it
> does not send a CACHE FLUSH command to the hardware. Hence, the
> hardware is free to write it to its on-disk cache, and not necessarily
> guarantee that the data is written to stable store. (For an example
> use case of BLKFLSBUF, we use it in e2fsck to drop the buffer cache
> for benchmarking purposes.)
Are you sure? for a bdev w/ ext4 on it:
BLKFLSBUF
fsync_bdev
sync_filesystem
sync_fs
ext4_sync_fs
blkdev_issue_flush
-Eric
On 1/3/14, 10:29 AM, Juergens Dirk (CM-AI/ECO2) wrote:
> So, I think there _might_ be a kernel bug, but it could be also a problem
> related to the particular type of eMMC. We did not observe the same issue
> in previous tests with another type of eMMC from another supplier, but this
> was with an older kernel patch level and with another HW design.
>
> Regarding a possible kernel bug: Is there any chance that the invalid
> ee_len or ee_start are returned by, e.g., the block allocator ?
> If so, can we try to instrument the code to get suitable traces ?
> Just to see or to exclude that the corrupted inode is really written
> to the eMMC ?
>From your description it does sound possible that it's a kernel bug.
Adding testcases to the code to catch it before it hits the journal
might be helpful - but then maybe this is something getting overwritten
after the fact - hard to say.
Can you share more details of the test you are running? Or maybe even
the test itself?
I've used a test framework in the past to simulate resets w/o needing
to reset the box, and do many journal replays very quickly. It'd be
interesting to run it using your testcase.
Thanks,
-Eric
On Fri, Jan 03, 2014 at 11:23:54AM -0600, Eric Sandeen wrote:
> > The BLKFLSBUF ioctl does __not__ send a CACHE FLUSH command to the
> > hardware device. It forces all of the dirty buffers in memory to the
> > storage device, and then it invalidates all the buffer cache, but it
> > does not send a CACHE FLUSH command to the hardware. Hence, the
> > hardware is free to write it to its on-disk cache, and not necessarily
> > guarantee that the data is written to stable store. (For an example
> > use case of BLKFLSBUF, we use it in e2fsck to drop the buffer cache
> > for benchmarking purposes.)
>
> Are you sure? for a bdev w/ ext4 on it:
>
> BLKFLSBUF
> fsync_bdev
> sync_filesystem
> sync_fs
> ext4_sync_fs
> blkdev_issue_flush
This call chain only happens if the block device is mounted.
If you only have the block device opened, and doing read and writes
directly to the block device, then BLKFLSBUF will not result in
blkdev_issue_flush() being called.
Actually, BLKFLSBUF is really a bit of a mess, and it's because it
conflates multiple meanins of the word "flush" (which is ambiguous).
For ram disks, it actually destroys the ram disk (due to a
implementation detail about how the original ramdisk driver was
implemented). The original meaning of the ioctl was to safely remove
all of the buffers from the buffer cache --- for example, to deal with
a 5.25" floppy disk being replaced, since there's no way for the
hardware to signal this to the OS, or for benchmarking purposes.
Adding things like the call to sync_fs() has made the BLKFLSBUF ioctl
more and more confused, and arguably we should add some new ioctl's
which separate out some of these use cases. For example, there is
currently no way to force all dirty buffers for an unmounted block
devicein the buffer cache to be written to disk, without actually
dropping all of the clean buffers from the buffer cache (as would be
the case with BLKFLSBUF), and without causing a forced CACHE_FLUSH
command (as would be the case if you called fsync).
The main reason why we haven't is that it's rare that people would
want to do these things in isolation, but the real problem is that
exactly what the semantics are for BLKFLSBUF are a bit confused, and
hence confusing. It's not even well documented --- I had to go diving
into the kernel sources to be sure, and even then, as you've pointed
out, what happens is variable depending on whether the block device is
mounted or not.
- Ted
On 1/3/14, 11:51 AM, Theodore Ts'o wrote:
> On Fri, Jan 03, 2014 at 11:23:54AM -0600, Eric Sandeen wrote:
>>> The BLKFLSBUF ioctl does __not__ send a CACHE FLUSH command to the
>>> hardware device. It forces all of the dirty buffers in memory to the
>>> storage device, and then it invalidates all the buffer cache, but it
>>> does not send a CACHE FLUSH command to the hardware. Hence, the
>>> hardware is free to write it to its on-disk cache, and not necessarily
>>> guarantee that the data is written to stable store. (For an example
>>> use case of BLKFLSBUF, we use it in e2fsck to drop the buffer cache
>>> for benchmarking purposes.)
>>
>> Are you sure? for a bdev w/ ext4 on it:
>>
>> BLKFLSBUF
>> fsync_bdev
>> sync_filesystem
>> sync_fs
>> ext4_sync_fs
>> blkdev_issue_flush
>
> This call chain only happens if the block device is mounted.
Sure, but I thought that's what they were doing. Maybe I misread.
-Eric
On Fri, Jan 03, 2014 at 11:54:12AM -0600, Eric Sandeen wrote:
> >
> > This call chain only happens if the block device is mounted.
>
> Sure, but I thought that's what they were doing. Maybe I misread.
>
I thought this was in relation to doing what they called a "barrier
test", where you are writing to flash device and then drop power, and
then see if the CACHE FLUSH request was actually honored. (And
whether or not the FTL got corrupted so badly that the device brick's
itself, as does happen for some of the crappier cheap flash out
there.)
But I'm not sure precisely how they implemented their test. It's
possible it was done with the file system mounted. My suggestion was
to make sure that the flash was proof against power drops by doing
this using a raw block device, to remove the variable of the file
system.
Given that they've since reported that they can repro the problem
using soft resets, it doesn't sound like the problem is related to
flash devices not handling powe drops correctly --- although given
that I'm still getting reports of people who have had their SD card
get completely bricked after a power drop event, it's unfortunately
not a solved problem by the flash manufacturers yet.... or rather,
the few (many?) bad apples give all low-end flash a bad name.
- Ted
On Thu, Jan 03, 2014 at 19:07, Theodore Ts'o [mailto:[email protected]]
wrote:
>
> On Fri, Jan 03, 2014 at 11:54:12AM -0600, Eric Sandeen wrote:
> > >
> > > This call chain only happens if the block device is mounted.
> >
> > Sure, but I thought that's what they were doing. Maybe I misread.
> >
>
> I thought this was in relation to doing what they called a "barrier
> test", where you are writing to flash device and then drop power, and
> then see if the CACHE FLUSH request was actually honored. (And
> whether or not the FTL got corrupted so badly that the device brick's
> itself, as does happen for some of the crappier cheap flash out
> there.)
>
> But I'm not sure precisely how they implemented their test. It's
> possible it was done with the file system mounted. My suggestion was
> to make sure that the flash was proof against power drops by doing
> this using a raw block device, to remove the variable of the file
> system.
>
Just as a quick reply for today:
If I remember right, Weller has done the barrier test w/o file system
mounted. Weller can give more details when he is back in office.
However, these tests were done some while ago with another type of
eMMC.
> Given that they've since reported that they can repro the problem
> using soft resets, it doesn't sound like the problem is related to
> flash devices not handling powe drops correctly
I think so as well, for the same reason and also because our tests with
journal_checksum show the same problem w/o any checksum error.
> --- although given
> that I'm still getting reports of people who have had their SD card
> get completely bricked after a power drop event, it's unfortunately
> not a solved problem by the flash manufacturers yet.... or rather,
> the few (many?) bad apples give all low-end flash a bad name.
>
>
- Ted
Mit freundlichen Gr??en / Best regards
Dirk Juergens
Robert Bosch Car Multimedia GmbH
On Thu, Jan 03, 2014 at 19:24, Eric Sandeen wrote
>
> On 1/3/14, 10:29 AM, Juergens Dirk (CM-AI/ECO2) wrote:
> > So, I think there _might_ be a kernel bug, but it could be also a
> problem
> > related to the particular type of eMMC. We did not observe the same
> issue
> > in previous tests with another type of eMMC from another supplier,
> but this
> > was with an older kernel patch level and with another HW design.
> >
> > Regarding a possible kernel bug: Is there any chance that the invalid
> > ee_len or ee_start are returned by, e.g., the block allocator ?
> > If so, can we try to instrument the code to get suitable traces ?
> > Just to see or to exclude that the corrupted inode is really written
> > to the eMMC ?
>
> From your description it does sound possible that it's a kernel bug.
> Adding testcases to the code to catch it before it hits the journal
> might be helpful - but then maybe this is something getting overwritten
> after the fact - hard to say.
>
> Can you share more details of the test you are running? Or maybe even
> the test itself?
Yes, for sure, we can. Weller, please provide additional details
or corrections.
In short:
Basically we use an automated cyclic test writing many small
(some kBytes) files with CRC checksums for easy consistency check
into a separate test partition. Files also contain meta information
like filename, sequence number and a random number to allow to identify
from block device image dumps, if we just see a fragment of an old
deleted file or a still valid one.
Each test loop looks like this:
1) Boot the device after power on or reset
2) Do fsck -n BEFORE mounting
2 a) (optional) binary dump of the journal
3) Mount test partition
4) File content check for all files from prev. loop
5) erase all files from previous loop
6) start writing hundreds/thousands of test files
in multiple directories with several threads
7) after random time cut the power or do soft reset
If 2), 3), 4) or 5) fails, stop test.
We are running the test usually with kind of transaction
safe handling, i.e. use fsync/rename, to avoid zero length files
or file fragments.
>
> I've used a test framework in the past to simulate resets w/o needing
> to reset the box, and do many journal replays very quickly. It'd be
> interesting to run it using your testcase.
>
> Thanks,
> -Eric
Mit freundlichen Gr??en / Best regards
Dirk Juergens
Robert Bosch Car Multimedia GmbH
On 1/3/14, 12:45 PM, Juergens Dirk (CM-AI/ECO2) wrote:
>
> On Thu, Jan 03, 2014 at 19:24, Eric Sandeen wrote
>>
>> On 1/3/14, 10:29 AM, Juergens Dirk (CM-AI/ECO2) wrote:
>>> So, I think there _might_ be a kernel bug, but it could be also a
>> problem
>>> related to the particular type of eMMC. We did not observe the same
>> issue
>>> in previous tests with another type of eMMC from another supplier,
>> but this
>>> was with an older kernel patch level and with another HW design.
>>>
>>> Regarding a possible kernel bug: Is there any chance that the invalid
>>> ee_len or ee_start are returned by, e.g., the block allocator ?
>>> If so, can we try to instrument the code to get suitable traces ?
>>> Just to see or to exclude that the corrupted inode is really written
>>> to the eMMC ?
>>
>> From your description it does sound possible that it's a kernel bug.
>> Adding testcases to the code to catch it before it hits the journal
>> might be helpful - but then maybe this is something getting overwritten
>> after the fact - hard to say.
>>
>> Can you share more details of the test you are running? Or maybe even
>> the test itself?
>
> Yes, for sure, we can. Weller, please provide additional details
> or corrections.
>
> In short:
> Basically we use an automated cyclic test writing many small
> (some kBytes) files with CRC checksums for easy consistency check
> into a separate test partition. Files also contain meta information
> like filename, sequence number and a random number to allow to identify
> from block device image dumps, if we just see a fragment of an old
> deleted file or a still valid one.
>
> Each test loop looks like this:
0) mkfs the filesystem - with what options? How big?
> 1) Boot the device after power on or reset
> 2) Do fsck -n BEFORE mounting
> 2 a) (optional) binary dump of the journal
> 3) Mount test partition
Again with what options, if any?
> 4) File content check for all files from prev. loop
> 5) erase all files from previous loop
> 6) start writing hundreds/thousands of test files
> in multiple directories with several threads
I guess this is where we might need more details in order,
to try to recreate the failure, but perhaps
this is not a case where you can simply share the IO
generation utility...?
Thanks,
-Eric
> 7) after random time cut the power or do soft reset
>
> If 2), 3), 4) or 5) fails, stop test.
>
> We are running the test usually with kind of transaction
> safe handling, i.e. use fsync/rename, to avoid zero length files
> or file fragments.
>
>>
>> I've used a test framework in the past to simulate resets w/o needing
>> to reset the box, and do many journal replays very quickly. It'd be
>> interesting to run it using your testcase.
>>
>> Thanks,
>> -Eric
>
> Mit freundlichen Gr??en / Best regards
>
> Dirk Juergens
>
> Robert Bosch Car Multimedia GmbH
>
On Thu, Jan 03, 2014 at 19:49, Eric Sandeen wrote
>
> On 1/3/14, 12:45 PM, Juergens Dirk (CM-AI/ECO2) wrote:
> >
> > On Thu, Jan 03, 2014 at 19:24, Eric Sandeen wrote
> >>
> >> On 1/3/14, 10:29 AM, Juergens Dirk (CM-AI/ECO2) wrote:
> >>> So, I think there _might_ be a kernel bug, but it could be also a
> >> problem
> >>> related to the particular type of eMMC. We did not observe the same
> >> issue
> >>> in previous tests with another type of eMMC from another supplier,
> >> but this
> >>> was with an older kernel patch level and with another HW design.
> >>>
> >>> Regarding a possible kernel bug: Is there any chance that the
> invalid
> >>> ee_len or ee_start are returned by, e.g., the block allocator ?
> >>> If so, can we try to instrument the code to get suitable traces ?
> >>> Just to see or to exclude that the corrupted inode is really
> written
> >>> to the eMMC ?
> >>
> >> From your description it does sound possible that it's a kernel bug.
> >> Adding testcases to the code to catch it before it hits the journal
> >> might be helpful - but then maybe this is something getting
> overwritten
> >> after the fact - hard to say.
> >>
> >> Can you share more details of the test you are running? Or maybe
> even
> >> the test itself?
> >
> > Yes, for sure, we can. Weller, please provide additional details
> > or corrections.
> >
> > In short:
> > Basically we use an automated cyclic test writing many small
> > (some kBytes) files with CRC checksums for easy consistency check
> > into a separate test partition. Files also contain meta information
> > like filename, sequence number and a random number to allow to
> identify
> > from block device image dumps, if we just see a fragment of an old
> > deleted file or a still valid one.
> >
> > Each test loop looks like this:
>
> 0) mkfs the filesystem - with what options? How big?
Here we do need the details from Weller, cause
he has done all this.
>
> > 1) Boot the device after power on or reset
> > 2) Do fsck -n BEFORE mounting
> > 2 a) (optional) binary dump of the journal
> > 3) Mount test partition
>
> Again with what options, if any?
Details again have to be given by Weller, sorry.
>
> > 4) File content check for all files from prev. loop
> > 5) erase all files from previous loop
> > 6) start writing hundreds/thousands of test files
> > in multiple directories with several threads
>
> I guess this is where we might need more details in order,
> to try to recreate the failure, but perhaps
> this is not a case where you can simply share the IO
> generation utility...?
I think we can share the code, please let me check on Monday.
>
> Thanks,
> -Eric
>
> > 7) after random time cut the power or do soft reset
> >
> > If 2), 3), 4) or 5) fails, stop test.
> >
> > We are running the test usually with kind of transaction
> > safe handling, i.e. use fsync/rename, to avoid zero length files
> > or file fragments.
> >
> >>
> >> I've used a test framework in the past to simulate resets w/o
> needing
> >> to reset the box, and do many journal replays very quickly. It'd be
> >> interesting to run it using your testcase.
> >>
> >> Thanks,
> >> -Eric
> >
> > Mit freundlichen Gr??en / Best regards
> >
> > Dirk Juergens
> >
> > Robert Bosch Car Multimedia GmbH
> >
Mit freundlichen Gr??en / Best regards
Dirk Juergens
Robert Bosch Car Multimedia GmbH
> On Thu, Jan 03, 2014 at 19:24, Eric Sandeen wrote
>>
>> On 1/3/14, 10:29 AM, Juergens Dirk (CM-AI/ECO2) wrote:
>>> So, I think there _might_ be a kernel bug, but it could be also a
>> problem
>>> related to the particular type of eMMC. We did not observe the same
>> issue
>>> in previous tests with another type of eMMC from another supplier,
>> but this
>>> was with an older kernel patch level and with another HW design.
>>>
>>> Regarding a possible kernel bug: Is there any chance that the invalid
>>> ee_len or ee_start are returned by, e.g., the block allocator ?
>>> If so, can we try to instrument the code to get suitable traces ?
>>> Just to see or to exclude that the corrupted inode is really written
>>> to the eMMC ?
>>
>> From your description it does sound possible that it's a kernel bug.
>> Adding testcases to the code to catch it before it hits the journal
>> might be helpful - but then maybe this is something getting overwritten
>> after the fact - hard to say.
>>
>> Can you share more details of the test you are running? Or maybe even
>> the test itself?
>
> Yes, for sure, we can. Weller, please provide additional details
> or corrections.
>
> In short:
> Basically we use an automated cyclic test writing many small
> (some kBytes) files with CRC checksums for easy consistency check
> into a separate test partition. Files also contain meta information
> like filename, sequence number and a random number to allow to identify
> from block device image dumps, if we just see a fragment of an old
> deleted file or a still valid one.
>
> Each test loop looks like this:
>0) mkfs the filesystem - with what options? How big?
I used default options like this: mkfs.ext4 -E nodiscard /dev/$PAR
Because we found it will take long time to so disk formatting if there is no option "-E nodiscard ".
> 1) Boot the device after power on or reset
> 2) Do fsck -n BEFORE mounting
> 2 a) (optional) binary dump of the journal
> 3) Mount test partition
>Again with what options, if any?
Normally, I used below options:
-ext4 default options: rw,relatime,data=ordered,barrier=1
-rw,relatime,data=ordered,barrier=1,journal_checksum
And the test partition size is about 6G. But I filled the test partition and make there is only 700M empty space left.
And during the test, I use the tool stress to generate CPU loading. As I remember, the CPU loading is around %70. Not all the test with stress. We did the test without stress and reproduced the issue.
> 4) File content check for all files from prev. loop
> 5) erase all files from previous loop
> 6) start writing hundreds/thousands of test files
> in multiple directories with several threads
>I guess this is where we might need more details in order,
>to try to recreate the failure, but perhaps
>this is not a case where you can simply share the IO
>generation utility...?
I attached my test code, scripts and some introduction document in this mail. Please don't laugh me if there is some ugly code :-)
Thanks,
-Eric
> 7) after random time cut the power or do soft reset
>
> If 2), 3), 4) or 5) fails, stop test.
>
> We are running the test usually with kind of transaction
> safe handling, i.e. use fsync/rename, to avoid zero length files
> or file fragments.
>
>>
>> I've used a test framework in the past to simulate resets w/o needing
>> to reset the box, and do many journal replays very quickly. It'd be
>> interesting to run it using your testcase.
>>
>> Thanks,
>> -Eric
>
> Mit freundlichen Gr??en / Best regards
>
> Dirk Juergens
>
> Robert Bosch Car Multimedia GmbH
>
>On Thu, Jan 03, 2014 at 17:30, Theodore Ts'o [mailto:[email protected]]
>wrote:
>>
>> On Fri, Jan 03, 2014 at 11:16:02AM +0800, Huang Weller (CM/ESW12-CN)
>> wrote:
>> >
>> > It sounds like the barrier test. We wrote such kind test tool
>> > before, the test program used ioctl(fd, BLKFLSBUF, 0) to set a
>> > barrier before next write operation. Do you think this ioctl is
>> > enough ? Because I saw the ext4 use it. I will do the test with that
>> > tool and then let you know the result.
>>
>> The BLKFLSBUF ioctl does __not__ send a CACHE FLUSH command to the
>> hardware device. It forces all of the dirty buffers in memory to the
>> storage device, and then it invalidates all the buffer cache, but it
>> does not send a CACHE FLUSH command to the hardware. Hence, the
>> hardware is free to write it to its on-disk cache, and not necessarily
>> guarantee that the data is written to stable store. (For an example
>> use case of BLKFLSBUF, we use it in e2fsck to drop the buffer cache
>> for benchmarking purposes.)
>>
>> If you want to force a CACHE FLUSH (or barrier, depending on the
>> underlying transport different names may be given to this operation),
>> you need to call fsync() on the file descriptor open to the block
>> device.
>>
>> > More information about journal block which caused the bad extents
>> > error: We enabled the mount option journal_checksum in our test. We
>> > reproduced the same problem and the journal checksum is correct
>> > because the journal block will not be replayed if checksum is error.
>>
>> How did you enable the journal_checksum option? Note that this is not
>> safe in general, which is why we don't enable it or the async_commit
>> mount option by default. The problem is that currently the journal
>> replay stops when it hits a bad checksum, and this can leave the file
>> system in a worse case than it currently is in. There is a way we
>> could fix it, by adding per-block checksums to the journal, so we can
>> skip just the bad block, and then force an efsck afterwards, but that
>> isn't something we've implemented yet.
>>
>> That being said, if the journal checksum was valid, and so the
>> corrupted block was replayed, it does seem to argue against
>> hardware-induced corruption.
>Yes, this was also our feeling. Please see my other mail just sent
>some minutes ago. We know about the possible problems with
>journal_checksum, but we thought that it is a good option in our case
>to identify if this is a HW- or SW-induced issue.
>>
>> Hmm.... I'm stumped, for the moment. The journal layer is quite
>> stable, and we haven't had any problems like this reported in many,
>> many years.
>>
>> Let's take this back to first principles. How reliably can you
>> reproduce the problem? How often does it fail?
>With kernel 3.5.7.23 about once per overnight long term test.
>> Is it something where
>> you can characterize the workload leading to this failure? Secondly,
>> is a power drop involved in the reproduction at all, or is this
>> something that can be reproduced by running some kind of workload, and
>> then doing a soft reset (i.e., force a kernel reboot, but _not_ do it
>> via a power drop)?
>As I stated in my other mail, it is also reproduced with soft resets.
>Weller can give more details about the test setup.
My test case is like this:
1. left about 700M empty space for the test
2. most of test with stress(some test without stress but we also reproduced the issue)
3. power loss and CPU WDT reset both happened during file write operations.
>
> The other thing to ask is when did this problem first start appearing?
> With a kernel upgrade? A compiler/toolchain upgrade? Or has it
> always been there?
>
> Regards,
>
> - Ted
Mit freundlichen Gr??en / Best regards
Dr. rer. nat. Dirk Juergens
Robert Bosch Car Multimedia GmbH
>On Thu, Jan 03, 2014 at 19:07, Theodore Ts'o [mailto:[email protected]]
>wrote:
>>
>>> On Fri, Jan 03, 2014 at 11:54:12AM -0600, Eric Sandeen wrote:
> > >
>> > > This call chain only happens if the block device is mounted.
>> >
>> > Sure, but I thought that's what they were doing. Maybe I misread.
>> >
>>
>> I thought this was in relation to doing what they called a "barrier
>> test", where you are writing to flash device and then drop power, and
>> then see if the CACHE FLUSH request was actually honored. (And
>> whether or not the FTL got corrupted so badly that the device brick's
>> itself, as does happen for some of the crappier cheap flash out
>> there.)
>>
>> But I'm not sure precisely how they implemented their test. It's
>> possible it was done with the file system mounted. My suggestion was
>> to make sure that the flash was proof against power drops by doing
>> this using a raw block device, to remove the variable of the file
>> system.
>>
>Just as a quick reply for today:
>If I remember right, Weller has done the barrier test w/o file system
>mounted. Weller can give more details when he is back in office.
>However, these tests were done some while ago with another type of
>eMMC.
My previous block device barrier test is like this:
0. power on
1. run the test program: generate a map file on local fs. This file include a header and the many random block numbers.
2. test program pick up a block number from the map file offset N
3. generate new buffer with commit ID and random string. Write this buffer to the block(from last step).
4. set barrier.(previous use the ioctl BLKFLSBUF, it will be change to fsync later)
5.backup the buffer which generated on step 3: write it to block 0.
6. set barrier like step5. N++
7. jump to step2
The power loss or sw-reset would happen between step2 and 7 randomly.
Below is the step to check the test results:
1. load map file. Load block number 0 to get the last commit ID and last block number.
2. search the last block number in the map file. i.e: get the last block number at map[N].
3.get the block number from map[0] to map[N-1], check the contents of these blocks. If there is a block contents error among these blocks, we can say there are some problems.
As I remember I didn't see any problem of the test at that time. But I can do same test on the same brand eMMC which we found the bad extents issue later.
Please let us know if there is any problem on our test concept.
Thanks.
-Huang weller
>> Given that they've since reported that they can repro the problem
>> using soft resets, it doesn't sound like the problem is related to
>> flash devices not handling powe drops correctly
>I think so as well, for the same reason and also because our tests with
>journal_checksum show the same problem w/o any checksum error.
The e-mail sent to you contained an attachment with a not allowed filetype.
Please inform the sender to pack this type of attachment into a
password protected ZIP-archive.
Eine an Sie gesendete E-Mail enthielt einen nicht erlaubten Dateianhang.
Bitte informieren Sie den Absender, diese Art von Anhang kann nur
als Passwort geschütztes ZIP-Archiv versendet werden.
Details:
Sender: [email protected]
Recipients: [email protected];[email protected];[email protected]
Subject: "RE: AW: ext4 filesystem bad extent error review"
Time: Mon Jan 6 06:10:55 2014
File: code_out_pc.tar
The cleaned message body is below this line or in the attached e-mail.
Die bereinigte E-Mail ist unter der folgenden Linie oder in beigefügtem Attachment.
--------------------------------------------------------------------------------
>On 1/3/14, 10:29 AM, Juergens Dirk (CM-AI/ECO2) wrote:
>> So, I think there _might_ be a kernel bug, but it could be also a problem
>> related to the particular type of eMMC. We did not observe the same issue
>> in previous tests with another type of eMMC from another supplier, but this
>> was with an older kernel patch level and with another HW design.
>>
>> Regarding a possible kernel bug: Is there any chance that the invalid
>> ee_len or ee_start are returned by, e.g., the block allocator ?
>> If so, can we try to instrument the code to get suitable traces ?
>> Just to see or to exclude that the corrupted inode is really written
>> to the eMMC ?
>From your description it does sound possible that it's a kernel bug.
>Adding testcases to the code to catch it before it hits the journal
>might be helpful - but then maybe this is something getting overwritten
>after the fact - hard to say.
>Can you share more details of the test you are running? Or maybe even
>the test itself?
>I've used a test framework in the past to simulate resets w/o needing
>to reset the box, and do many journal replays very quickly. It'd be
>interesting to run it using your testcase.
Please get code_out.tar.gz from my another mail.
About the PC side, I write a win32 application which can send commands via uart to TOE power supplier(the power supplier has remote control mode which it can accept command from its uart interface).
The Putty src, I only modified the winser.c which include in the attachment of this mail. There is also a readme.txt to introduce the package.
If you want to use my test environment , I think you just need modify the putty_toe.bat with yours. It is easy to understand the commands in this script. You can replace the command to control the power controller with yours.
Please feel free to let me know if you have any issue about the environment setup.
Thanks
Huang weller
>On Thu, Jan 03, 2014 at 19:24, Eric Sandeen wrote
>>
>> On 1/3/14, 10:29 AM, Juergens Dirk (CM-AI/ECO2) wrote:
>> > So, I think there _might_ be a kernel bug, but it could be also a
>> problem
>> > related to the particular type of eMMC. We did not observe the same
>> issue
>> > in previous tests with another type of eMMC from another supplier,
>> but this
>> > was with an older kernel patch level and with another HW design.
>> >
>> > Regarding a possible kernel bug: Is there any chance that the invalid
>> > ee_len or ee_start are returned by, e.g., the block allocator ?
>> > If so, can we try to instrument the code to get suitable traces ?
>> > Just to see or to exclude that the corrupted inode is really written
>> > to the eMMC ?
>>
>> From your description it does sound possible that it's a kernel bug.
>> Adding testcases to the code to catch it before it hits the journal
>> might be helpful - but then maybe this is something getting overwritten
>> after the fact - hard to say.
>>
>> Can you share more details of the test you are running? Or maybe even
>> the test itself?
>Yes, for sure, we can. Weller, please provide additional details
>or corrections.
>In short:
>Basically we use an automated cyclic test writing many small
> (some kBytes) files with CRC checksums for easy consistency check
>into a separate test partition. Files also contain meta information
>like filename, sequence number and a random number to allow to identify
>from block device image dumps, if we just see a fragment of an old
>deleted file or a still valid one.
>Each test loop looks like this:
>1) Boot the device after power on or reset
>2) Do fsck -n BEFORE mounting
>2 a) (optional) binary dump of the journal
>3) Mount test partition
>4) File content check for all files from prev. loop
>5) erase all files from previous loop
>6) start writing hundreds/thousands of test files
> in multiple directories with several threads
>7) after random time cut the power or do soft reset
>If 2), 3), 4) or 5) fails, stop test.
>We are running the test usually with kind of transaction
>safe handling, i.e. use fsync/rename, to avoid zero length files
>or file fragments.
Yes, Dirk's description is right.
And You also can get the detail of my test in the package code_out.tar.gz in another mail. There is a document to introduce my test tool and test case.
And also the test scripts.
Thanks.
Huang weller
> On Thu, Jan 03, 2014 at 19:49, Eric Sandeen wrote
> >
> > On 1/3/14, 12:45 PM, Juergens Dirk (CM-AI/ECO2) wrote:
> > >
> > > On Thu, Jan 03, 2014 at 19:24, Eric Sandeen wrote
> > >>
> > >> On 1/3/14, 10:29 AM, Juergens Dirk (CM-AI/ECO2) wrote:
> > >>> So, I think there _might_ be a kernel bug, but it could be also a
> > >> problem
> > >>> related to the particular type of eMMC. We did not observe the same
> > >> issue
> > >>> in previous tests with another type of eMMC from another supplier,
> > >> but this
> > >>> was with an older kernel patch level and with another HW design.
> > >>>
> > >>> Regarding a possible kernel bug: Is there any chance that the
> > invalid
> > >>> ee_len or ee_start are returned by, e.g., the block allocator ?
> > >>> If so, can we try to instrument the code to get suitable traces ?
> > >>> Just to see or to exclude that the corrupted inode is really
> > written
> > >>> to the eMMC ?
> > >>
> > >> From your description it does sound possible that it's a kernel bug.
> > >> Adding testcases to the code to catch it before it hits the journal
> > >> might be helpful - but then maybe this is something getting
> > overwritten
> > >> after the fact - hard to say.
> > >>
> > >> Can you share more details of the test you are running? Or maybe
> > even
> > >> the test itself?
> > >
> > > Yes, for sure, we can. Weller, please provide additional details
> > > or corrections.
> > >
> > > In short:
> > > Basically we use an automated cyclic test writing many small
> > > (some kBytes) files with CRC checksums for easy consistency check
> > > into a separate test partition. Files also contain meta information
> > > like filename, sequence number and a random number to allow to
> > identify
> > > from block device image dumps, if we just see a fragment of an old
> > > deleted file or a still valid one.
> > >
> > > Each test loop looks like this:
> >
> > 0) mkfs the filesystem - with what options? How big?
>
> Here we do need the details from Weller, cause
> he has done all this.
We use the default options with option nodiscard:
mkfs.ext4 -E nodiscard /dev/$PAR
the size is about 6G.
> >
> > > 1) Boot the device after power on or reset
> > > 2) Do fsck -n BEFORE mounting
> > > 2 a) (optional) binary dump of the journal
> > > 3) Mount test partition
> >
> > Again with what options, if any?
>
> Details again have to be given by Weller, sorry.
Mount options:
-ext4 default options: rw,relatime,data=ordered,barrier=1
-rw,relatime,data=ordered,barrier=1,journal_checksum
And the test partition size is about 6G. But I filled the test partition and make there is only 700M empty space left.