2012-12-01 14:22:14

by Yafang Shao

[permalink] [raw]
Subject: help about ext3 read-only issue on ext3(2.6.16.30)

Hi all,

We have many x86 boards, and we've been using 2.6.16.60 for a long
time. Before time we occasionally found ext3 was switched to read-only
while services were running, and we took it for granted it must be
some hardware problems.
But recently this issue happens frequently, both in old boards and
new boards. We've analyzed logs, and in one board we did find
exceptional reboot (but ext3 error happened 9 days after), and in
another board we found mptbase recovery routine, but in all other
boards there's no suspicious output at all.
The only change with the system is some application updates, and
apps now put more IO burden on disks.
The error always happened in ext3_readdir, like this:

EXT3-fs error (device sda7): ext3_readdir: bad entry in directory
#6685458: rec_len is smaller than minimal - offset=3860, inode=0,
rec_len=0, name_len=0

Aborting journal on device sda7.

EXT3-fs error (device sda7) in start_transaction: Readonly filesystem

Aborting journal on device sda7.

ext3_abort called.

EXT3-fs error (device sda7): ext3_journal_start_sb: Detected aborted journal

Remounting filesystem read-only

__journal_remove_journal_head: freeing b_committed_data



We highly doubt it's hardware failures with this frequency in mind, so
we're wondering regarding to this issue if there's some ext3 bug-fix
having merged into mainline but not in our old kernel?

if hardware problem cause it, what's exactly are these hardware problems?



Thanks & Regards!


2012-12-03 17:59:38

by Eric Sandeen

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On 12/1/12 8:22 AM, Yafang Shao wrote:
> Hi all,
>
> We have many x86 boards, and we've been using 2.6.16.60 for a long
> time. Before time we occasionally found ext3 was switched to read-only
> while services were running, and we took it for granted it must be
> some hardware problems.
> But recently this issue happens frequently, both in old boards and
> new boards. We've analyzed logs, and in one board we did find
> exceptional reboot (but ext3 error happened 9 days after), and in
> another board we found mptbase recovery routine, but in all other
> boards there's no suspicious output at all.
> The only change with the system is some application updates, and
> apps now put more IO burden on disks.
> The error always happened in ext3_readdir, like this:
>
> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory
> #6685458: rec_len is smaller than minimal - offset=3860, inode=0,
> rec_len=0, name_len=0
>
> Aborting journal on device sda7.
>
> EXT3-fs error (device sda7) in start_transaction: Readonly filesystem
>
> Aborting journal on device sda7.
>
> ext3_abort called.
>
> EXT3-fs error (device sda7): ext3_journal_start_sb: Detected aborted journal
>
> Remounting filesystem read-only
>
> __journal_remove_journal_head: freeing b_committed_data
>
>
>
> We highly doubt it's hardware failures with this frequency in mind, so
> we're wondering regarding to this issue if there's some ext3 bug-fix
> having merged into mainline but not in our old kernel?

Absolutely there are. There have been 87 changes just to namei.c since 2.6.16.
You could look through git logs to see if anything looks applicable.

You might try:

ef2b02d3e617cb0400eedf2668f86215e1b0e6af ext34: ensure do_split leaves enough free space in both blocks

but to be honest, sticking with such an old kernel means you are largely on your own, or may need contract help if you can't resolve it.

-Eric

> if hardware problem cause it, what's exactly are these hardware problems?
>
>
>
> Thanks & Regards!
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2012-12-04 13:54:05

by Zefan Li

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

>> We have many x86 boards, and we've been using 2.6.16.60 for a long
>> time. Before time we occasionally found ext3 was switched to read-only
>> while services were running, and we took it for granted it must be
>> some hardware problems.
>> But recently this issue happens frequently, both in old boards and
>> new boards. We've analyzed logs, and in one board we did find
>> exceptional reboot (but ext3 error happened 9 days after), and in
>> another board we found mptbase recovery routine, but in all other
>> boards there's no suspicious output at all.
>> The only change with the system is some application updates, and
>> apps now put more IO burden on disks.
>> The error always happened in ext3_readdir, like this:
>>
>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory#6685458: rec_len is smaller than minimal - offset=3860, inode=0, rec_len=0, name_len=0
>> Aborting journal on device sda7.
>> EXT3-fs error (device sda7) in start_transaction: Readonly filesystem
>> Aborting journal on device sda7.
>> ext3_abort called.
>> EXT3-fs error (device sda7): ext3_journal_start_sb: Detected aborted journal
>> Remounting filesystem read-only
>> __journal_remove_journal_head: freeing b_committed_data
>>
>> We highly doubt it's hardware failures with this frequency in mind, so
>> we're wondering regarding to this issue if there's some ext3 bug-fix
>> having merged into mainline but not in our old kernel?
>
> Absolutely there are. There have been 87 changes just to namei.c since 2.6.16.
> You could look through git logs to see if anything looks applicable.
>
> You might try:
>
> ef2b02d3e617cb0400eedf2668f86215e1b0e6af ext34: ensure do_split leaves enough free space in both blocks

I've been asked to investigate this issue. Thanks for the reply!

I found this fix while searching for similar bug reports, but I don't think it
worths trying as we don't use dir_index feature.

I've collected some logs in different machines, and the error was always
triggered in ext3_readdir:

EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #6685458: rec_len is smaller than minimal - offset=3860, inode=0, rec_len=0, name_len=0
EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #9650541: rec_len is smaller than minimal - offset=3960, inode=0, rec_len=0, name_len=0
EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #11124783: rec_len is smaller than minimal - offset=4072, inode=0, rec_len=0, name_len=0
EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0

The last two errors happened on the same machine, and the same inode! One
happened in 11/22 (I was told they had run fsck later on), and one in 12/01.

The offset is always a bit smaller than blocksize, and all the fields are 0.
I dumped one of the dirs, and only ~1.6K was used (fsck reported no error).

In some machines fsck reported no error at all, and in others filesystems
were corrupted though fixable.

I didn't see any other error messages before this error at all.

Does this remind you of some old ext3 bug?

I'll send you fsck output, dir contents and other logs if u'r interested.

>
> but to be honest, sticking with such an old kernel means you are largely on your own, or may need contract help if you can't resolve it.
>

There're numerous machines running old kernels, and many of them are hard to
change. :(

Yesterday they upgrade apps on ~30 machines, and soon after that 5 machines
had filesystem corrupted. However they won't stop upgrading other machines!

On the other hand, we can hardly reproduce this bug in the lab.

So this is critical and urgent. Any help is appreciated.

Regards
Li Zefan


2012-12-04 15:09:37

by Theodore Ts'o

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On Tue, Dec 04, 2012 at 09:54:05PM +0800, Li Zefan wrote:
>
> I've collected some logs in different machines, and the error was always
> triggered in ext3_readdir:
>
> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #6685458: rec_len is smaller than minimal - offset=3860, inode=0, rec_len=0, name_len=0
> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #9650541: rec_len is smaller than minimal - offset=3960, inode=0, rec_len=0, name_len=0
> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #11124783: rec_len is smaller than minimal - offset=4072, inode=0, rec_len=0, name_len=0
> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0

This looks like the last part of the inode was zapped. It might be
worth adding a kernel patch which dumps out the entire directory block
as a hex dump when this triggers --- and then compare it to what you
get if you dump the directory back out after the machine reboot. That
might given you a hint if something is corrupting the directory block
in memory. (especially if you set the remount read-only option).

> The last two errors happened on the same machine, and the same inode! One
> happened in 11/22 (I was told they had run fsck later on), and one in 12/01.

If it's always the same inode, you might want to correlate based on
the pathname. Is there any commonality accross multiple machines in
terms of the directory name, and what application(s) might be touching
that directory?

> Yesterday they upgrade apps on ~30 machines, and soon after that 5 machines
> had filesystem corrupted. However they won't stop upgrading other machines!
>
> On the other hand, we can hardly reproduce this bug in the lab.

This is why wise cloud companies have a (figurative) big red button to
stop upgrade rollouts (which are always done slowly and gradually),
and processes which make it relatively easy for engineers to be able
to push the "big red button". I seem to recall the operations
engineer at Facebook giving a talk where he mentioned this. :-)

Good luck! Sorry, the pattern of corruption really doesn't sound
familiar to me...

- Ted

2012-12-04 15:29:57

by Tao Ma

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

Hi zefan,
On 12/04/2012 09:54 PM, Li Zefan wrote:
>>> We have many x86 boards, and we've been using 2.6.16.60 for a long
>>> time. Before time we occasionally found ext3 was switched to read-only
>>> while services were running, and we took it for granted it must be
>>> some hardware problems.
>>> But recently this issue happens frequently, both in old boards and
>>> new boards. We've analyzed logs, and in one board we did find
>>> exceptional reboot (but ext3 error happened 9 days after), and in
>>> another board we found mptbase recovery routine, but in all other
>>> boards there's no suspicious output at all.
>>> The only change with the system is some application updates, and
>>> apps now put more IO burden on disks.
>>> The error always happened in ext3_readdir, like this:
>>>
>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory#6685458: rec_len is smaller than minimal - offset=3860, inode=0, rec_len=0, name_len=0
>>> Aborting journal on device sda7.
>>> EXT3-fs error (device sda7) in start_transaction: Readonly filesystem
>>> Aborting journal on device sda7.
>>> ext3_abort called.
>>> EXT3-fs error (device sda7): ext3_journal_start_sb: Detected aborted journal
>>> Remounting filesystem read-only
>>> __journal_remove_journal_head: freeing b_committed_data
>>>
>>> We highly doubt it's hardware failures with this frequency in mind, so
>>> we're wondering regarding to this issue if there's some ext3 bug-fix
>>> having merged into mainline but not in our old kernel?
>>
>> Absolutely there are. There have been 87 changes just to namei.c since 2.6.16.
>> You could look through git logs to see if anything looks applicable.
>>
>> You might try:
>>
>> ef2b02d3e617cb0400eedf2668f86215e1b0e6af ext34: ensure do_split leaves enough free space in both blocks
>
> I've been asked to investigate this issue. Thanks for the reply!
>
> I found this fix while searching for similar bug reports, but I don't think it
> worths trying as we don't use dir_index feature.
>
> I've collected some logs in different machines, and the error was always
> triggered in ext3_readdir:
>
> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #6685458: rec_len is smaller than minimal - offset=3860, inode=0, rec_len=0, name_len=0
> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #9650541: rec_len is smaller than minimal - offset=3960, inode=0, rec_len=0, name_len=0
> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #11124783: rec_len is smaller than minimal - offset=4072, inode=0, rec_len=0, name_len=0
> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
>
> The last two errors happened on the same machine, and the same inode! One
> happened in 11/22 (I was told they had run fsck later on), and one in 12/01.
So now this directory has been fscked to be right? You can try by just
ls this directory and check whether there are any errors in dmesg.

Having said that, as this error happens 2 times for the same inode,
maybe there is a kernel bug. At least as Ted said in another mail, the
end of this buffer head seems to be cleared. So I guess next time when
you see this error, please do:
1. use debugfs to find the disk layout for this dir
2. read the blocks from the block device directly
3. check whether the end of a block(from offset to the end) is zeroed.
4. If yes, I guess there should be a kernel bug and we can go on to
investigate the code.

Thanks
Tao
>
> The offset is always a bit smaller than blocksize, and all the fields are 0.
> I dumped one of the dirs, and only ~1.6K was used (fsck reported no error).
>
> In some machines fsck reported no error at all, and in others filesystems
> were corrupted though fixable.
>
> I didn't see any other error messages before this error at all.
>
> Does this remind you of some old ext3 bug?
>
> I'll send you fsck output, dir contents and other logs if u'r interested.
>
>>
>> but to be honest, sticking with such an old kernel means you are largely on your own, or may need contract help if you can't resolve it.
>>
>
> There're numerous machines running old kernels, and many of them are hard to
> change. :(
>
> Yesterday they upgrade apps on ~30 machines, and soon after that 5 machines
> had filesystem corrupted. However they won't stop upgrading other machines!
>
> On the other hand, we can hardly reproduce this bug in the lab.
>
> So this is critical and urgent. Any help is appreciated.
>
> Regards
> Li Zefan
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2012-12-04 16:11:37

by Bernd Schubert

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On 12/04/2012 04:29 PM, Tao Ma wrote:
>> The last two errors happened on the same machine, and the same inode! One
>> happened in 11/22 (I was told they had run fsck later on), and one in 12/01.
> So now this directory has been fscked to be right? You can try by just
> ls this directory and check whether there are any errors in dmesg.
>
> Having said that, as this error happens 2 times for the same inode,
> maybe there is a kernel bug. At least as Ted said in another mail, the
> end of this buffer head seems to be cleared. So I guess next time when
> you see this error, please do:
> 1. use debugfs to find the disk layout for this dir
> 2. read the blocks from the block device directly
> 3. check whether the end of a block(from offset to the end) is zeroed.
> 4. If yes, I guess there should be a kernel bug and we can go on to
> investigate the code.

I still don't know if it is related to htree only, but e2fsck isn't
properly fixing directory issues without the "-D" option.
For example I have a VM here, where the kernel frequently reports
something like:

> [ 304.096059] EXT4-fs (vdb): error count: 4
> [ 304.096305] EXT4-fs (vdb): initial error at 1352366631: htree_dirblock_to_tree:861: inode 3146582: block 1641814
> [ 304.096857] EXT4-fs (vdb): last error at 1352381914: empty_dir:2334: inode 3146582: block 1641814
> [86807.520052] EXT4-fs (vdb): error count: 4

and e2fsck does not report anything. Also the dir_entry_type is for some
dirs wrongly reported by the kernel, but seen correctly by e2fsck
(https://bugzilla.kernel.org/show_bug.cgi?id=50261).

I hope to find some time to investigate that next week. I have seen it
several times already, but never had the chance to investigate or to
take an image.

So I would really recommend to run "e2fsck -D" for the issue reported here.

Cheers,
Bernd


Cheers,
Bernd

2012-12-04 16:16:53

by qixuan wu

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

Hi Tao, all,

I guess it's a memory(or ext3/kenrel) issue. Beause in one
machine, after report this issue, the partition is made to readonly,
we use debugfs to "ls dir", and it's fine. It can list all files
without error. If the disk has issue, when we using ls command, it
will give error also. (The dir name is also using debugfs to get by
issue inode ID.)

Is there the possibility: one thread(A) is read_dir(directly read
from buffer head), and another thread(B) is creating item, and fill
this buffer header at the same time. During create item, first modify
the last item's rec_len(let it point to next item which initially is
zero), then fill this added new item. Suppose the seq is below :
1) B: modify last item's rec_len
2) A: Read last item, rec_len is modified already by B, and it
identify next item is existing.
3) A: Read new item, all feilds are zero still.
3) B: fill new item with correct value.

This may cause problem. Sorry I am not still checking the code
properly. Raise this suppose is just hope ext3 experts can help to
think whether such concurring scenario has problem or not ?
Any idea or clue is welcome.

Regards & Thanks a lot.
Wuqx

On Tue, Dec 4, 2012 at 11:29 PM, Tao Ma <[email protected]> wrote:
> Hi zefan,
> On 12/04/2012 09:54 PM, Li Zefan wrote:
>>>> We have many x86 boards, and we've been using 2.6.16.60 for a long
>>>> time. Before time we occasionally found ext3 was switched to read-only
>>>> while services were running, and we took it for granted it must be
>>>> some hardware problems.
>>>> But recently this issue happens frequently, both in old boards and
>>>> new boards. We've analyzed logs, and in one board we did find
>>>> exceptional reboot (but ext3 error happened 9 days after), and in
>>>> another board we found mptbase recovery routine, but in all other
>>>> boards there's no suspicious output at all.
>>>> The only change with the system is some application updates, and
>>>> apps now put more IO burden on disks.
>>>> The error always happened in ext3_readdir, like this:
>>>>
>>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory#6685458: rec_len is smaller than minimal - offset=3860, inode=0, rec_len=0, name_len=0
>>>> Aborting journal on device sda7.
>>>> EXT3-fs error (device sda7) in start_transaction: Readonly filesystem
>>>> Aborting journal on device sda7.
>>>> ext3_abort called.
>>>> EXT3-fs error (device sda7): ext3_journal_start_sb: Detected aborted journal
>>>> Remounting filesystem read-only
>>>> __journal_remove_journal_head: freeing b_committed_data
>>>>
>>>> We highly doubt it's hardware failures with this frequency in mind, so
>>>> we're wondering regarding to this issue if there's some ext3 bug-fix
>>>> having merged into mainline but not in our old kernel?
>>>
>>> Absolutely there are. There have been 87 changes just to namei.c since 2.6.16.
>>> You could look through git logs to see if anything looks applicable.
>>>
>>> You might try:
>>>
>>> ef2b02d3e617cb0400eedf2668f86215e1b0e6af ext34: ensure do_split leaves enough free space in both blocks
>>
>> I've been asked to investigate this issue. Thanks for the reply!
>>
>> I found this fix while searching for similar bug reports, but I don't think it
>> worths trying as we don't use dir_index feature.
>>
>> I've collected some logs in different machines, and the error was always
>> triggered in ext3_readdir:
>>
>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #6685458: rec_len is smaller than minimal - offset=3860, inode=0, rec_len=0, name_len=0
>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #9650541: rec_len is smaller than minimal - offset=3960, inode=0, rec_len=0, name_len=0
>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #11124783: rec_len is smaller than minimal - offset=4072, inode=0, rec_len=0, name_len=0
>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
>>
>> The last two errors happened on the same machine, and the same inode! One
>> happened in 11/22 (I was told they had run fsck later on), and one in 12/01.
> So now this directory has been fscked to be right? You can try by just
> ls this directory and check whether there are any errors in dmesg.
>
> Having said that, as this error happens 2 times for the same inode,
> maybe there is a kernel bug. At least as Ted said in another mail, the
> end of this buffer head seems to be cleared. So I guess next time when
> you see this error, please do:
> 1. use debugfs to find the disk layout for this dir
> 2. read the blocks from the block device directly
> 3. check whether the end of a block(from offset to the end) is zeroed.
> 4. If yes, I guess there should be a kernel bug and we can go on to
> investigate the code.
>
> Thanks
> Tao
>>
>> The offset is always a bit smaller than blocksize, and all the fields are 0.
>> I dumped one of the dirs, and only ~1.6K was used (fsck reported no error).
>>
>> In some machines fsck reported no error at all, and in others filesystems
>> were corrupted though fixable.
>>
>> I didn't see any other error messages before this error at all.
>>
>> Does this remind you of some old ext3 bug?
>>
>> I'll send you fsck output, dir contents and other logs if u'r interested.
>>
>>>
>>> but to be honest, sticking with such an old kernel means you are largely on your own, or may need contract help if you can't resolve it.
>>>
>>
>> There're numerous machines running old kernels, and many of them are hard to
>> change. :(
>>
>> Yesterday they upgrade apps on ~30 machines, and soon after that 5 machines
>> had filesystem corrupted. However they won't stop upgrading other machines!
>>
>> On the other hand, we can hardly reproduce this bug in the lab.
>>
>> So this is critical and urgent. Any help is appreciated.
>>
>> Regards
>> Li Zefan
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>

2012-12-04 20:20:55

by Theodore Ts'o

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On Tue, Dec 04, 2012 at 05:11:37PM +0100, Bernd Schubert wrote:
> I still don't know if it is related to htree only, but e2fsck isn't
> properly fixing directory issues without the "-D" option.
> For example I have a VM here, where the kernel frequently reports
> something like:
>
> > [ 304.096059] EXT4-fs (vdb): error count: 4
> > [ 304.096305] EXT4-fs (vdb): initial error at 1352366631: htree_dirblock_to_tree:861: inode 3146582: block 1641814
> > [ 304.096857] EXT4-fs (vdb): last error at 1352381914: empty_dir:2334: inode 3146582: block 1641814
> > [86807.520052] EXT4-fs (vdb): error count: 4
>
> and e2fsck does not report anything.

This is not an error; this is pointing out that there previously *was*
an error, with the first file system error happening at:

% date --date "@1352366631"
Thu Nov 8 04:23:51 EST 2012

and the most recent file system error happening at:

% date --date "@1352381914"
Thu Nov 8 08:38:34 EST 2012

(your results may differ depending on your local time zone :-)

Newer versions of e2fsck clear this information from the superblock
after the file system is successfully fixed. You upgraded your kernel
without also updating e2fsprogs, and the newer kernels will print this
message approximately once a day so that users who have their file
systems set up to use errors=continue, that they get some warning that
their file system has been corrupted, even if their log files have
been rotated away.

It's also useful when trying to debug user problems where they are
convinced it is an ext4 bug, but in fact it's because they've buggered
their init scripts not to run fsck, or they are using an external USB
disk with ext4 and aren't bothering to check it with e2fsck after an
error was reported, etc. We can now see that the file system had been
first corrupted $N weeks/months ago, and it isn't a file system
regression worthy of linkbait scares articles on some random website...


In any case, that's why you're seeing it. It's really not a problem,
only a cosmetic issue, which can be easily fixed by upgrading
e2fsprogs.

> Also the dir_entry_type is for some
> dirs wrongly reported by the kernel, but seen correctly by e2fsck
> (https://bugzilla.kernel.org/show_bug.cgi?id=50261).

I've looked at your bug report, and it seems pretty clear that it's
correct on disk. I'm not sure what might be going on, but this
doesn't look like an e2fsck problem, but either a problem in glibc,
the vfs kernel code or the ext4 kernel code.

If you re-run your program, is it consistent which directories
apparently have the wrong d_type information returned for them?

> I hope to find some time to investigate that next week. I have seen it
> several times already, but never had the chance to investigate or to
> take an image.
>
> So I would really recommend to run "e2fsck -D" for the issue reported here.

Did e2fsck -D really change what you saw? That will rewrite all of
the directories as part of optimizing them all, but it certainly won't
fix the error count/first error/last error series of informational
messages. For that you just need to get a newer version of e2fsck.

Regards,

- Ted

2012-12-04 20:45:08

by Theodore Ts'o

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On Wed, Dec 05, 2012 at 12:16:50AM +0800, qixuan wu wrote:
>
> Is there the possibility: one thread(A) is read_dir(directly read
> from buffer head), and another thread(B) is creating item, and fill
> this buffer header at the same time. During create item, first modify
> the last item's rec_len(let it point to next item which initially is
> zero), then fill this added new item.

This should be handled by the VFS, which uses i_mutex to protect
directories before calling the filesystem-specific lookup, rename,
readdir, etc functions.

So absent some bug where we are missing taking i_mutex (a quick check
didn't show any obvious missing paths, although I didn't do an
exhaustive code audit), we should be OK.... if we had a bug here, I
would have expected that one of the fs stress tests that we run would
have shown them up pretty quickly.

Regards,

- Ted

2012-12-05 10:43:45

by Zefan Li

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On 2012/12/4 23:09, Theodore Ts'o wrote:
> On Tue, Dec 04, 2012 at 09:54:05PM +0800, Li Zefan wrote:
>>
>> I've collected some logs in different machines, and the error was always
>> triggered in ext3_readdir:
>>
>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #6685458: rec_len is smaller than minimal - offset=3860, inode=0, rec_len=0, name_len=0
>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #9650541: rec_len is smaller than minimal - offset=3960, inode=0, rec_len=0, name_len=0
>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #11124783: rec_len is smaller than minimal - offset=4072, inode=0, rec_len=0, name_len=0
>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
>
> This looks like the last part of the inode was zapped. It might be

I don't think so. See below...

> worth adding a kernel patch which dumps out the entire directory block
> as a hex dump when this triggers --- and then compare it to what you
> get if you dump the directory back out after the machine reboot. That
> might given you a hint if something is corrupting the directory block
> in memory. (especially if you set the remount read-only option).
>
>> The last two errors happened on the same machine, and the same inode! One
>> happened in 11/22 (I was told they had run fsck later on), and one in 12/01.
>
> If it's always the same inode, you might want to correlate based on
> the pathname. Is there any commonality accross multiple machines in
> terms of the directory name, and what application(s) might be touching
> that directory?
>

I found this in one log:

Nov 14 05:26:55 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=3952, inode=0, rec_len=0, name_len=0
Nov 14 13:42:40 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
Nov 16 17:29:40 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
Nov 23 19:42:44 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=3952, inode=0, rec_len=0, name_len=0

Happend 4 times, the same inode, different offsets. Another log showed the
same pattern.

They said they ran fsck everytime this happened. Many machines got this problem,
but they remember most of the time fsck didn't report error.(*)

I've checked the pathname, and they all points to log dirs. There're 2 kinds
of log dirs with different loggers, but seems work similarly.

Except one bug report, all others point to exactly the same log dir.

There're two processes that will touch this dir. One is a monitor, it will
delete old logs if they occupy too much space, but normally this shouldn't
happen.

Another is the logger. When it wants to log sth, it scans the directory, if
there're more than 100 log files, it will delete the oldest one. After writting
to the current log file, if the file is larger than 8M, this file will be
renamed as a backup log. I haven't read the code yet. But sounds pretty
simple, right?

The length of the file name is 25. There were 35 logs dating from 2012/11/02
to 2012/11/23, and no pending deleted files. Thus the remaining ~2.8K of the
dir block is never used, so I don't think something zeroed it because it
has always been zero.

This log dir is new in this version, while the other one also exists in
old verison, with less IO.

(*) They have machines in different spots. In another spot, 5 out of ~30
machines met this problem after upgrading, and fsck reported errors in
all of them. However there were just a few errors, and they didn't seem to
relate to the directory, which means the directory seems intact. Adding
that the fs was created nearly 1 years ago and ever fscked, those errors
might have nothing to do with this bug?

btw, the version of e2fsprogsis: e2fsck 1.38 (30-Jun-2005)

Regards
Li Zefan


2012-12-05 10:46:21

by Zefan Li

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

>>>> We highly doubt it's hardware failures with this frequency in mind, so
>>>> we're wondering regarding to this issue if there's some ext3 bug-fix
>>>> having merged into mainline but not in our old kernel?
>>>
>>> Absolutely there are. There have been 87 changes just to namei.c since 2.6.16.
>>> You could look through git logs to see if anything looks applicable.
>>>
>>> You might try:
>>>
>>> ef2b02d3e617cb0400eedf2668f86215e1b0e6af ext34: ensure do_split leaves enough free space in both blocks
>>
>> I've been asked to investigate this issue. Thanks for the reply!
>>
>> I found this fix while searching for similar bug reports, but I don't think it
>> worths trying as we don't use dir_index feature.
>>
>> I've collected some logs in different machines, and the error was always
>> triggered in ext3_readdir:
>>
>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #6685458: rec_len is smaller than minimal - offset=3860, inode=0, rec_len=0, name_len=0
>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #9650541: rec_len is smaller than minimal - offset=3960, inode=0, rec_len=0, name_len=0
>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #11124783: rec_len is smaller than minimal - offset=4072, inode=0, rec_len=0, name_len=0
>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
>>
>> The last two errors happened on the same machine, and the same inode! One
>> happened in 11/22 (I was told they had run fsck later on), and one in 12/01.
> So now this directory has been fscked to be right? You can try by just

right.

> ls this directory and check whether there are any errors in dmesg.
>

no error at all.

> Having said that, as this error happens 2 times for the same inode,
> maybe there is a kernel bug. At least as Ted said in another mail, the
> end of this buffer head seems to be cleared. So I guess next time when
> you see this error, please do:
> 1. use debugfs to find the disk layout for this dir
> 2. read the blocks from the block device directly
> 3. check whether the end of a block(from offset to the end) is zeroed.
> 4. If yes, I guess there should be a kernel bug and we can go on to
> investigate the code.
>

This may give us different output with that by dumping dir via debugfs?
If so I'll try next time.

Seeing from the output dumpped via debugfs in one machine, more than
harf of the dir block is all zero, but the offset is near 4K. I also
checked several other machines, no difference.

Regards
Li Zefan

2012-12-05 13:58:34

by Tao Ma

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

Hi qixuan,
On 12/05/2012 12:16 AM, qixuan wu wrote:
> Hi Tao, all,
>
> I guess it's a memory(or ext3/kenrel) issue. Beause in one
> machine, after report this issue, the partition is made to readonly,
> we use debugfs to "ls dir", and it's fine. It can list all files
> without error. If the disk has issue, when we using ls command, it
> will give error also. (The dir name is also using debugfs to get by
> issue inode ID.)
Are you sure the disk is good? I just checked the code in e2fsprogs, it
seems that it will not complain if rec_len = 0, and the dir iteration
just aborts. I guess the right way should be dd the corresponding block
out, decode and read it in binary format. :(

Thanks
Tao
>
> Is there the possibility: one thread(A) is read_dir(directly read
> from buffer head), and another thread(B) is creating item, and fill
> this buffer header at the same time. During create item, first modify
> the last item's rec_len(let it point to next item which initially is
> zero), then fill this added new item. Suppose the seq is below :
> 1) B: modify last item's rec_len
> 2) A: Read last item, rec_len is modified already by B, and it
> identify next item is existing.
> 3) A: Read new item, all feilds are zero still.
> 3) B: fill new item with correct value.
>
> This may cause problem. Sorry I am not still checking the code
> properly. Raise this suppose is just hope ext3 experts can help to
> think whether such concurring scenario has problem or not ?
> Any idea or clue is welcome.
>
> Regards & Thanks a lot.
> Wuqx
>
> On Tue, Dec 4, 2012 at 11:29 PM, Tao Ma <[email protected]> wrote:
>> Hi zefan,
>> On 12/04/2012 09:54 PM, Li Zefan wrote:
>>>>> We have many x86 boards, and we've been using 2.6.16.60 for a long
>>>>> time. Before time we occasionally found ext3 was switched to read-only
>>>>> while services were running, and we took it for granted it must be
>>>>> some hardware problems.
>>>>> But recently this issue happens frequently, both in old boards and
>>>>> new boards. We've analyzed logs, and in one board we did find
>>>>> exceptional reboot (but ext3 error happened 9 days after), and in
>>>>> another board we found mptbase recovery routine, but in all other
>>>>> boards there's no suspicious output at all.
>>>>> The only change with the system is some application updates, and
>>>>> apps now put more IO burden on disks.
>>>>> The error always happened in ext3_readdir, like this:
>>>>>
>>>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory#6685458: rec_len is smaller than minimal - offset=3860, inode=0, rec_len=0, name_len=0
>>>>> Aborting journal on device sda7.
>>>>> EXT3-fs error (device sda7) in start_transaction: Readonly filesystem
>>>>> Aborting journal on device sda7.
>>>>> ext3_abort called.
>>>>> EXT3-fs error (device sda7): ext3_journal_start_sb: Detected aborted journal
>>>>> Remounting filesystem read-only
>>>>> __journal_remove_journal_head: freeing b_committed_data
>>>>>
>>>>> We highly doubt it's hardware failures with this frequency in mind, so
>>>>> we're wondering regarding to this issue if there's some ext3 bug-fix
>>>>> having merged into mainline but not in our old kernel?
>>>>
>>>> Absolutely there are. There have been 87 changes just to namei.c since 2.6.16.
>>>> You could look through git logs to see if anything looks applicable.
>>>>
>>>> You might try:
>>>>
>>>> ef2b02d3e617cb0400eedf2668f86215e1b0e6af ext34: ensure do_split leaves enough free space in both blocks
>>>
>>> I've been asked to investigate this issue. Thanks for the reply!
>>>
>>> I found this fix while searching for similar bug reports, but I don't think it
>>> worths trying as we don't use dir_index feature.
>>>
>>> I've collected some logs in different machines, and the error was always
>>> triggered in ext3_readdir:
>>>
>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #6685458: rec_len is smaller than minimal - offset=3860, inode=0, rec_len=0, name_len=0
>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #9650541: rec_len is smaller than minimal - offset=3960, inode=0, rec_len=0, name_len=0
>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #11124783: rec_len is smaller than minimal - offset=4072, inode=0, rec_len=0, name_len=0
>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
>>>
>>> The last two errors happened on the same machine, and the same inode! One
>>> happened in 11/22 (I was told they had run fsck later on), and one in 12/01.
>> So now this directory has been fscked to be right? You can try by just
>> ls this directory and check whether there are any errors in dmesg.
>>
>> Having said that, as this error happens 2 times for the same inode,
>> maybe there is a kernel bug. At least as Ted said in another mail, the
>> end of this buffer head seems to be cleared. So I guess next time when
>> you see this error, please do:
>> 1. use debugfs to find the disk layout for this dir
>> 2. read the blocks from the block device directly
>> 3. check whether the end of a block(from offset to the end) is zeroed.
>> 4. If yes, I guess there should be a kernel bug and we can go on to
>> investigate the code.
>>
>> Thanks
>> Tao
>>>
>>> The offset is always a bit smaller than blocksize, and all the fields are 0.
>>> I dumped one of the dirs, and only ~1.6K was used (fsck reported no error).
>>>
>>> In some machines fsck reported no error at all, and in others filesystems
>>> were corrupted though fixable.
>>>
>>> I didn't see any other error messages before this error at all.
>>>
>>> Does this remind you of some old ext3 bug?
>>>
>>> I'll send you fsck output, dir contents and other logs if u'r interested.
>>>
>>>>
>>>> but to be honest, sticking with such an old kernel means you are largely on your own, or may need contract help if you can't resolve it.
>>>>
>>>
>>> There're numerous machines running old kernels, and many of them are hard to
>>> change. :(
>>>
>>> Yesterday they upgrade apps on ~30 machines, and soon after that 5 machines
>>> had filesystem corrupted. However they won't stop upgrading other machines!
>>>
>>> On the other hand, we can hardly reproduce this bug in the lab.
>>>
>>> So this is critical and urgent. Any help is appreciated.
>>>
>>> Regards
>>> Li Zefan
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2012-12-05 15:05:39

by Theodore Ts'o

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On Wed, Dec 05, 2012 at 09:58:17PM +0800, Tao Ma wrote:
> Are you sure the disk is good? I just checked the code in e2fsprogs, it
> seems that it will not complain if rec_len = 0, and the dir iteration
> just aborts.

Um, where are you seeing this? It should be complaining that the
directory entry is corrupt if rec_len < 12. I believe I see the
checks there, but I haven't tried creating some test file systems to
be sure.

- Ted

2012-12-05 15:46:03

by qixuan wu

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

Hi Tao,

On Wed, Dec 5, 2012 at 9:58 PM, Tao Ma <[email protected]> wrote:
> Hi qixuan,
> On 12/05/2012 12:16 AM, qixuan wu wrote:
>> Hi Tao, all,
>>
>> I guess it's a memory(or ext3/kenrel) issue. Beause in one
>> machine, after report this issue, the partition is made to readonly,
>> we use debugfs to "ls dir", and it's fine. It can list all files
>> without error. If the disk has issue, when we using ls command, it
>> will give error also. (The dir name is also using debugfs to get by
>> issue inode ID.)
> Are you sure the disk is good? I just checked the code in e2fsprogs, it
> seems that it will not complain if rec_len = 0, and the dir iteration
> just aborts. I guess the right way should be dd the corresponding block
> out, decode and read it in binary format. :(

Good idea, will do it.

Thanks
wuqixuan

> Thanks
> Tao

2012-12-05 15:51:09

by qixuan wu

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On Wed, Dec 5, 2012 at 10:26 PM, Tao Ma <[email protected]> wrote:
> On 12/05/2012 06:43 PM, Li Zefan wrote:
>> On 2012/12/4 23:09, Theodore Ts'o wrote:
>>> On Tue, Dec 04, 2012 at 09:54:05PM +0800, Li Zefan wrote:
>>>>
>>>> I've collected some logs in different machines, and the error was always
>>>> triggered in ext3_readdir:
>>>>
>>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #6685458: rec_len is smaller than minimal - offset=3860, inode=0, rec_len=0, name_len=0
>>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #9650541: rec_len is smaller than minimal - offset=3960, inode=0, rec_len=0, name_len=0
>>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #11124783: rec_len is smaller than minimal - offset=4072, inode=0, rec_len=0, name_len=0
>>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
>>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
>>>
>>> This looks like the last part of the inode was zapped. It might be
>>
>> I don't think so. See below...
>>
>>> worth adding a kernel patch which dumps out the entire directory block
>>> as a hex dump when this triggers --- and then compare it to what you
>>> get if you dump the directory back out after the machine reboot. That
>>> might given you a hint if something is corrupting the directory block
>>> in memory. (especially if you set the remount read-only option).
>>>
>>>> The last two errors happened on the same machine, and the same inode! One
>>>> happened in 11/22 (I was told they had run fsck later on), and one in 12/01.
>>>
>>> If it's always the same inode, you might want to correlate based on
>>> the pathname. Is there any commonality accross multiple machines in
>>> terms of the directory name, and what application(s) might be touching
>>> that directory?
>>>
>>
>> I found this in one log:
>>
>> Nov 14 05:26:55 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=3952, inode=0, rec_len=0, name_len=0
>> Nov 14 13:42:40 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
>> Nov 16 17:29:40 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
>> Nov 23 19:42:44 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=3952, inode=0, rec_len=0, name_len=0
>>
>> Happend 4 times, the same inode, different offsets. Another log showed the
>> same pattern.
>>
>> They said they ran fsck everytime this happened. Many machines got this problem,
>> but they remember most of the time fsck didn't report error.(*)
>>
>> I've checked the pathname, and they all points to log dirs. There're 2 kinds
>> of log dirs with different loggers, but seems work similarly.
>>
>> Except one bug report, all others point to exactly the same log dir.
>>
>> There're two processes that will touch this dir. One is a monitor, it will
>> delete old logs if they occupy too much space, but normally this shouldn't
>> happen.
>>
>> Another is the logger. When it wants to log sth, it scans the directory, if
>> there're more than 100 log files, it will delete the oldest one. After writting
>> to the current log file, if the file is larger than 8M, this file will be
>> renamed as a backup log. I haven't read the code yet. But sounds pretty
>> simple, right?
>>
>> The length of the file name is 25. There were 35 logs dating from 2012/11/02
>> to 2012/11/23, and no pending deleted files. Thus the remaining ~2.8K of the
>> dir block is never used, so I don't think something zeroed it because it
>> has always been zero.
> Only 35 files? So there should be no rename. And the only possible
> action we do to this dir is "create a new log file", right? Then, I
> really don't think ext3 will error in such a simple test case. :(
>
>>
>> This log dir is new in this version, while the other one also exists in
>> old verison, with less IO.
> You mean the kernel version? Sorry, but what do you want to tell us here?

Here is the user-space app version. In the new user-space app version,
this file op model is used and the problem is coming.

Thanks
wuqixuan

> Thanks
> Tao
>>
>> (*) They have machines in different spots. In another spot, 5 out of ~30
>> machines met this problem after upgrading, and fsck reported errors in
>> all of them. However there were just a few errors, and they didn't seem to
>> relate to the directory, which means the directory seems intact. Adding
>> that the fs was created nearly 1 years ago and ever fscked, those errors
>> might have nothing to do with this bug?
>>
>> btw, the version of e2fsprogsis: e2fsck 1.38 (30-Jun-2005)
>>
>> Regards
>> Li Zefan
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>

2012-12-06 00:13:18

by Tao Ma

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On 12/05/2012 06:46 PM, Li Zefan wrote:
>>>>> We highly doubt it's hardware failures with this frequency in mind, so
>>>>> we're wondering regarding to this issue if there's some ext3 bug-fix
>>>>> having merged into mainline but not in our old kernel?
>>>>
>>>> Absolutely there are. There have been 87 changes just to namei.c since 2.6.16.
>>>> You could look through git logs to see if anything looks applicable.
>>>>
>>>> You might try:
>>>>
>>>> ef2b02d3e617cb0400eedf2668f86215e1b0e6af ext34: ensure do_split leaves enough free space in both blocks
>>>
>>> I've been asked to investigate this issue. Thanks for the reply!
>>>
>>> I found this fix while searching for similar bug reports, but I don't think it
>>> worths trying as we don't use dir_index feature.
>>>
>>> I've collected some logs in different machines, and the error was always
>>> triggered in ext3_readdir:
>>>
>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #6685458: rec_len is smaller than minimal - offset=3860, inode=0, rec_len=0, name_len=0
>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #9650541: rec_len is smaller than minimal - offset=3960, inode=0, rec_len=0, name_len=0
>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #11124783: rec_len is smaller than minimal - offset=4072, inode=0, rec_len=0, name_len=0
>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
>>>
>>> The last two errors happened on the same machine, and the same inode! One
>>> happened in 11/22 (I was told they had run fsck later on), and one in 12/01.
>> So now this directory has been fscked to be right? You can try by just
>
> right.
>
>> ls this directory and check whether there are any errors in dmesg.
>>
>
> no error at all.
OK, so now it is fixed by e2fsck. hmm, is there any stress inode
creation/deletion in this dir? 2.6.16 is too older although I am not
sure whether this is a bug or not.
>
>> Having said that, as this error happens 2 times for the same inode,
>> maybe there is a kernel bug. At least as Ted said in another mail, the
>> end of this buffer head seems to be cleared. So I guess next time when
>> you see this error, please do:
>> 1. use debugfs to find the disk layout for this dir
>> 2. read the blocks from the block device directly
>> 3. check whether the end of a block(from offset to the end) is zeroed.
>> 4. If yes, I guess there should be a kernel bug and we can go on to
>> investigate the code.
>>
>
> This may give us different output with that by dumping dir via debugfs?
> If so I'll try next time.
In step 2, I mean dd out these blocks, decode and read them by
yourselves to check whether there are zeroes.

Thanks
Tao
>
> Seeing from the output dumpped via debugfs in one machine, more than
> harf of the dir block is all zero, but the offset is near 4K. I also
> checked several other machines, no difference.


Thanks
Tao


2012-12-05 14:26:05

by Tao Ma

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On 12/05/2012 06:43 PM, Li Zefan wrote:
> On 2012/12/4 23:09, Theodore Ts'o wrote:
>> On Tue, Dec 04, 2012 at 09:54:05PM +0800, Li Zefan wrote:
>>>
>>> I've collected some logs in different machines, and the error was always
>>> triggered in ext3_readdir:
>>>
>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #6685458: rec_len is smaller than minimal - offset=3860, inode=0, rec_len=0, name_len=0
>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #9650541: rec_len is smaller than minimal - offset=3960, inode=0, rec_len=0, name_len=0
>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #11124783: rec_len is smaller than minimal - offset=4072, inode=0, rec_len=0, name_len=0
>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
>>
>> This looks like the last part of the inode was zapped. It might be
>
> I don't think so. See below...
>
>> worth adding a kernel patch which dumps out the entire directory block
>> as a hex dump when this triggers --- and then compare it to what you
>> get if you dump the directory back out after the machine reboot. That
>> might given you a hint if something is corrupting the directory block
>> in memory. (especially if you set the remount read-only option).
>>
>>> The last two errors happened on the same machine, and the same inode! One
>>> happened in 11/22 (I was told they had run fsck later on), and one in 12/01.
>>
>> If it's always the same inode, you might want to correlate based on
>> the pathname. Is there any commonality accross multiple machines in
>> terms of the directory name, and what application(s) might be touching
>> that directory?
>>
>
> I found this in one log:
>
> Nov 14 05:26:55 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=3952, inode=0, rec_len=0, name_len=0
> Nov 14 13:42:40 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
> Nov 16 17:29:40 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
> Nov 23 19:42:44 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=3952, inode=0, rec_len=0, name_len=0
>
> Happend 4 times, the same inode, different offsets. Another log showed the
> same pattern.
>
> They said they ran fsck everytime this happened. Many machines got this problem,
> but they remember most of the time fsck didn't report error.(*)
>
> I've checked the pathname, and they all points to log dirs. There're 2 kinds
> of log dirs with different loggers, but seems work similarly.
>
> Except one bug report, all others point to exactly the same log dir.
>
> There're two processes that will touch this dir. One is a monitor, it will
> delete old logs if they occupy too much space, but normally this shouldn't
> happen.
>
> Another is the logger. When it wants to log sth, it scans the directory, if
> there're more than 100 log files, it will delete the oldest one. After writting
> to the current log file, if the file is larger than 8M, this file will be
> renamed as a backup log. I haven't read the code yet. But sounds pretty
> simple, right?
>
> The length of the file name is 25. There were 35 logs dating from 2012/11/02
> to 2012/11/23, and no pending deleted files. Thus the remaining ~2.8K of the
> dir block is never used, so I don't think something zeroed it because it
> has always been zero.
Only 35 files? So there should be no rename. And the only possible
action we do to this dir is "create a new log file", right? Then, I
really don't think ext3 will error in such a simple test case. :(

>
> This log dir is new in this version, while the other one also exists in
> old verison, with less IO.
You mean the kernel version? Sorry, but what do you want to tell us here?

Thanks
Tao
>
> (*) They have machines in different spots. In another spot, 5 out of ~30
> machines met this problem after upgrading, and fsck reported errors in
> all of them. However there were just a few errors, and they didn't seem to
> relate to the directory, which means the directory seems intact. Adding
> that the fs was created nearly 1 years ago and ever fscked, those errors
> might have nothing to do with this bug?
>
> btw, the version of e2fsprogsis: e2fsck 1.38 (30-Jun-2005)
>
> Regards
> Li Zefan
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2012-12-06 01:14:07

by Zefan Li

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

>> I found this in one log:
>>
>> Nov 14 05:26:55 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=3952, inode=0, rec_len=0, name_len=0
>> Nov 14 13:42:40 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
>> Nov 16 17:29:40 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
>> Nov 23 19:42:44 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=3952, inode=0, rec_len=0, name_len=0
>>
>> Happend 4 times, the same inode, different offsets. Another log showed the
>> same pattern.
>>
>> They said they ran fsck everytime this happened. Many machines got this problem,
>> but they remember most of the time fsck didn't report error.(*)
>>
>> I've checked the pathname, and they all points to log dirs. There're 2 kinds
>> of log dirs with different loggers, but seems work similarly.
>>
>> Except one bug report, all others point to exactly the same log dir.
>>
>> There're two processes that will touch this dir. One is a monitor, it will
>> delete old logs if they occupy too much space, but normally this shouldn't
>> happen.
>>
>> Another is the logger. When it wants to log sth, it scans the directory, if
>> there're more than 100 log files, it will delete the oldest one. After writting
>> to the current log file, if the file is larger than 8M, this file will be
>> renamed as a backup log. I haven't read the code yet. But sounds pretty
>> simple, right?
>>
>> The length of the file name is 25. There were 35 logs dating from 2012/11/02
>> to 2012/11/23, and no pending deleted files. Thus the remaining ~2.8K of the
>> dir block is never used, so I don't think something zeroed it because it
>> has always been zero.
> Only 35 files? So there should be no rename. And the only possible

Yes, there can be. The curren log will be renamed when it reaches 8M, and then
a new log is created as the current log.

> action we do to this dir is "create a new log file", right? Then, I
> really don't think ext3 will error in such a simple test case. :(
>
>>
>> This log dir is new in this version, while the other one also exists in
>> old verison, with less IO.
> You mean the kernel version? Sorry, but what do you want to tell us here?
>

The versions of the apps. One of the differences between them is the log system,
and the old apps won't trigger this ext3 error.


2012-12-06 01:17:49

by Zefan Li

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On 2012/12/5 22:02, Tao Ma wrote:
> On 12/05/2012 06:46 PM, Li Zefan wrote:
>>>>>> We highly doubt it's hardware failures with this frequency in mind, so
>>>>>> we're wondering regarding to this issue if there's some ext3 bug-fix
>>>>>> having merged into mainline but not in our old kernel?
>>>>>
>>>>> Absolutely there are. There have been 87 changes just to namei.c since 2.6.16.
>>>>> You could look through git logs to see if anything looks applicable.
>>>>>
>>>>> You might try:
>>>>>
>>>>> ef2b02d3e617cb0400eedf2668f86215e1b0e6af ext34: ensure do_split leaves enough free space in both blocks
>>>>
>>>> I've been asked to investigate this issue. Thanks for the reply!
>>>>
>>>> I found this fix while searching for similar bug reports, but I don't think it
>>>> worths trying as we don't use dir_index feature.
>>>>
>>>> I've collected some logs in different machines, and the error was always
>>>> triggered in ext3_readdir:
>>>>
>>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #6685458: rec_len is smaller than minimal - offset=3860, inode=0, rec_len=0, name_len=0
>>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #9650541: rec_len is smaller than minimal - offset=3960, inode=0, rec_len=0, name_len=0
>>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #11124783: rec_len is smaller than minimal - offset=4072, inode=0, rec_len=0, name_len=0
>>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
>>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
>>>>
>>>> The last two errors happened on the same machine, and the same inode! One
>>>> happened in 11/22 (I was told they had run fsck later on), and one in 12/01.
>>> So now this directory has been fscked to be right? You can try by just
>>
>> right.
>>
>>> ls this directory and check whether there are any errors in dmesg.
>>>
>>
>> no error at all.
> OK, so now it is fixed by e2fsck. hmm, is there any stress inode

I'm inclined to believe the on-disk dir didn't get corrupted, and so fsck
found no errors.

> creation/deletion in this dir? 2.6.16 is too older although I am not
> sure whether this is a bug or not.

I don't think there're frequent file create/delete/rename ops in the log dir.


2012-12-06 01:54:35

by Tao Ma

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On 12/05/2012 11:05 PM, Theodore Ts'o wrote:
> On Wed, Dec 05, 2012 at 09:58:17PM +0800, Tao Ma wrote:
>> Are you sure the disk is good? I just checked the code in e2fsprogs, it
>> seems that it will not complain if rec_len = 0, and the dir iteration
>> just aborts.
>
> Um, where are you seeing this? It should be complaining that the
> directory entry is corrupt if rec_len < 12. I believe I see the
> checks there, but I haven't tried creating some test file systems to
> be sure.
Oh, sorry my fault. Yes, I see a final com_err in the end of do_list_dir
in debugfs. So we should complain about the dir corruption.

Thanks
Tao

2012-12-06 02:58:55

by Yongqiang Yang

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

Hi quxuan,

Could you have a test with ext3's journal disabled? In our
production, I encountered several cases where deleting on a deleted
entry happened in ext2, however it doesn''t bother apps, so we didn't
dig the cause.

Yongqiang.

On Wed, Dec 5, 2012 at 11:46 PM, qixuan wu <[email protected]> wrote:
> Hi Tao,
>
> On Wed, Dec 5, 2012 at 9:58 PM, Tao Ma <[email protected]> wrote:
>> Hi qixuan,
>> On 12/05/2012 12:16 AM, qixuan wu wrote:
>>> Hi Tao, all,
>>>
>>> I guess it's a memory(or ext3/kenrel) issue. Beause in one
>>> machine, after report this issue, the partition is made to readonly,
>>> we use debugfs to "ls dir", and it's fine. It can list all files
>>> without error. If the disk has issue, when we using ls command, it
>>> will give error also. (The dir name is also using debugfs to get by
>>> issue inode ID.)
>> Are you sure the disk is good? I just checked the code in e2fsprogs, it
>> seems that it will not complain if rec_len = 0, and the dir iteration
>> just aborts. I guess the right way should be dd the corresponding block
>> out, decode and read it in binary format. :(
>
> Good idea, will do it.
>
> Thanks
> wuqixuan
>
>> Thanks
>> Tao
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Best Wishes
Yongqiang Yang

2012-12-06 12:37:44

by Jan Kara

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On Thu 06-12-12 09:13:45, Li Zefan wrote:
> >> I found this in one log:
> >>
> >> Nov 14 05:26:55 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=3952, inode=0, rec_len=0, name_len=0
> >> Nov 14 13:42:40 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
> >> Nov 16 17:29:40 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
> >> Nov 23 19:42:44 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=3952, inode=0, rec_len=0, name_len=0
Sorry for posting here in the thread but I got unsubscribed from the
list so I don't have the beginning of the thread in my inbox.

ext3 directory format is such that the last directory entry in the block
should have length to exactly fill up the whole block. Apparently, the
length got trimmed for some reason so we ended up before end of directory
block looked of another directory entry there and didn't find anything. I
will also make one observation regarding offsets. They are 3952, 4024, and
4084. If we subtract that from 4096 (block size), we get differences (in
binary) 10010000, 01001000, 00001100. Interestingly these have always two
bits set. Might be luck but need not...

Anyway it would be interesting to get the dump of the corrupted directory
before e2fsck is run. You can do that by running:
debugfs -R "dump_inode <7225391> /tmp/corrupted_dir" /dev/sda7

Then you can send the dump of the corrupted directory here.

> >> Happend 4 times, the same inode, different offsets. Another log showed the
> >> same pattern.
> >>
> >> They said they ran fsck everytime this happened. Many machines got this problem,
> >> but they remember most of the time fsck didn't report error.(*)
> >>
> >> I've checked the pathname, and they all points to log dirs. There're 2 kinds
> >> of log dirs with different loggers, but seems work similarly.
> >>
> >> Except one bug report, all others point to exactly the same log dir.
> >>
> >> There're two processes that will touch this dir. One is a monitor, it will
> >> delete old logs if they occupy too much space, but normally this shouldn't
> >> happen.
> >>
> >> Another is the logger. When it wants to log sth, it scans the directory, if
> >> there're more than 100 log files, it will delete the oldest one. After writting
> >> to the current log file, if the file is larger than 8M, this file will be
> >> renamed as a backup log. I haven't read the code yet. But sounds pretty
> >> simple, right?
> >>
> >> The length of the file name is 25. There were 35 logs dating from 2012/11/02
> >> to 2012/11/23, and no pending deleted files. Thus the remaining ~2.8K of the
> >> dir block is never used, so I don't think something zeroed it because it
> >> has always been zero.
> > Only 35 files? So there should be no rename. And the only possible
>
> Yes, there can be. The curren log will be renamed when it reaches 8M, and then
> a new log is created as the current log.
Yep. Still the load looks extremely simple. We use 2.6.16 based kernels
in our SLES distros and I never heard about such corruption. Strange.

> > action we do to this dir is "create a new log file", right? Then, I
> > really don't think ext3 will error in such a simple test case. :(
> >
> >>
> >> This log dir is new in this version, while the other one also exists in
> >> old verison, with less IO.
> > You mean the kernel version? Sorry, but what do you want to tell us here?
>
> The versions of the apps. One of the differences between them is the log system,
> and the old apps won't trigger this ext3 error.
Indeed, this is even stranger...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-12-06 15:48:22

by qixuan wu

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

Hi Tao,

On Thu, Dec 6, 2012 at 9:54 AM, Tao Ma <[email protected]> wrote:
> On 12/05/2012 11:05 PM, Theodore Ts'o wrote:
>> On Wed, Dec 05, 2012 at 09:58:17PM +0800, Tao Ma wrote:
>>> Are you sure the disk is good? I just checked the code in e2fsprogs, it
>>> seems that it will not complain if rec_len = 0, and the dir iteration
>>> just aborts.
>>
>> Um, where are you seeing this? It should be complaining that the
>> directory entry is corrupt if rec_len < 12. I believe I see the
>> checks there, but I haven't tried creating some test file systems to
>> be sure.
> Oh, sorry my fault. Yes, I see a final com_err in the end of do_list_dir
> in debugfs. So we should complain about the dir corruption.

I have already also checked the raw binary of the dir block in the
disk and the data is correct. So disk no issue, only data in mem is
corrupted.

> Thanks
> Tao

2012-12-06 16:21:25

by qixuan wu

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

Hi Kara,

On Thu, Dec 6, 2012 at 8:37 PM, Jan Kara <[email protected]> wrote:
> On Thu 06-12-12 09:13:45, Li Zefan wrote:
>> >> I found this in one log:
>> >>
>> >> Nov 14 05:26:55 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=3952, inode=0, rec_len=0, name_len=0
>> >> Nov 14 13:42:40 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
>> >> Nov 16 17:29:40 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
>> >> Nov 23 19:42:44 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=3952, inode=0, rec_len=0, name_len=0
> Sorry for posting here in the thread but I got unsubscribed from the
> list so I don't have the beginning of the thread in my inbox.
>
> ext3 directory format is such that the last directory entry in the block
> should have length to exactly fill up the whole block. Apparently, the
> length got trimmed for some reason so we ended up before end of directory
> block looked of another directory entry there and didn't find anything. I
> will also make one observation regarding offsets. They are 3952, 4024, and
> 4084. If we subtract that from 4096 (block size), we get differences (in
> binary) 10010000, 01001000, 00001100. Interestingly these have always two
> bits set. Might be luck but need not...

Yes, we also found the interesting things that the offset happen in
many boards are like below:
1) 3952
2) 3988( 3952+36)
3) 4024( 3988+36)
4) 4048(4042+24)
5) 4084(same as the rec_len of ".." file if there isn't any file).

I need introduce the rule of the files in the dir, for example:
.
..
current_log.txt (len is 15, rec_len is 24 when there is file after it,
the value "24" i think has relative with offset 4048)
20120526124556.865213.txt(len is 25, rec_len is 36 when there is file after it).
20120526124984.239475.txt(len is 25, rec_len is 36 when there is file after it).
....
Because the rec_len is 36, it has some relative with those offset
values( the diff of those values are multiple of 36).
I need tell another thing, customer's app invoke opendir/readdir very
frequently. There are more than 1000 times, every second(the value
need to be confirmed).

> Anyway it would be interesting to get the dump of the corrupted directory
> before e2fsck is run. You can do that by running:
> debugfs -R "dump_inode <7225391> /tmp/corrupted_dir" /dev/sda7
>
> Then you can send the dump of the corrupted directory here.

We have already dump of the data by debugfs. The data is very good
without error. But we just did it before fsck, even the fsck is not
giving any error. I want to know whether fsck will modify disk data
without reporting any error or not ?

>> >> Happend 4 times, the same inode, different offsets. Another log showed the
>> >> same pattern.
>> >>
>> >> They said they ran fsck everytime this happened. Many machines got this problem,
>> >> but they remember most of the time fsck didn't report error.(*)
>> >>
>> >> I've checked the pathname, and they all points to log dirs. There're 2 kinds
>> >> of log dirs with different loggers, but seems work similarly.
>> >>
>> >> Except one bug report, all others point to exactly the same log dir.
>> >>
>> >> There're two processes that will touch this dir. One is a monitor, it will
>> >> delete old logs if they occupy too much space, but normally this shouldn't
>> >> happen.
>> >>
>> >> Another is the logger. When it wants to log sth, it scans the directory, if
>> >> there're more than 100 log files, it will delete the oldest one. After writting
>> >> to the current log file, if the file is larger than 8M, this file will be
>> >> renamed as a backup log. I haven't read the code yet. But sounds pretty
>> >> simple, right?
>> >>
>> >> The length of the file name is 25. There were 35 logs dating from 2012/11/02
>> >> to 2012/11/23, and no pending deleted files. Thus the remaining ~2.8K of the
>> >> dir block is never used, so I don't think something zeroed it because it
>> >> has always been zero.
>> > Only 35 files? So there should be no rename. And the only possible
>>
>> Yes, there can be. The curren log will be renamed when it reaches 8M, and then
>> a new log is created as the current log.
> Yep. Still the load looks extremely simple. We use 2.6.16 based kernels
> in our SLES distros and I never heard about such corruption. Strange.
>
>> > action we do to this dir is "create a new log file", right? Then, I
>> > really don't think ext3 will error in such a simple test case. :(
>> >
>> >>
>> >> This log dir is new in this version, while the other one also exists in
>> >> old verison, with less IO.
>> > You mean the kernel version? Sorry, but what do you want to tell us here?
>>
>> The versions of the apps. One of the differences between them is the log system,
>> and the old apps won't trigger this ext3 error.
> Indeed, this is even stranger...
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

Thanks
wuqixuan.

2012-12-06 16:26:26

by qixuan wu

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

Hi yongqiang,

On Thu, Dec 6, 2012 at 10:58 AM, Yongqiang Yang <[email protected]> wrote:
> Hi quxuan,
>
> Could you have a test with ext3's journal disabled? In our
> production, I encountered several cases where deleting on a deleted
> entry happened in ext2, however it doesn''t bother apps, so we didn't
> dig the cause.

We did not test by disable journal. Our customer is usring default
journal mode. Because journal is concern only if the system power off.
If system is continously running and give readonly error, we thought
it's nothing about journal. I am not very clear what about the issue
"deleting on a deleted entry", seems not same as our issue. If you can
more detail about it, it will be welcomed. :)

> Yongqiang.
>
> On Wed, Dec 5, 2012 at 11:46 PM, qixuan wu <[email protected]> wrote:
>> Hi Tao,
>>
>> On Wed, Dec 5, 2012 at 9:58 PM, Tao Ma <[email protected]> wrote:
>>> Hi qixuan,
>>> On 12/05/2012 12:16 AM, qixuan wu wrote:
>>>> Hi Tao, all,
>>>>
>>>> I guess it's a memory(or ext3/kenrel) issue. Beause in one
>>>> machine, after report this issue, the partition is made to readonly,
>>>> we use debugfs to "ls dir", and it's fine. It can list all files
>>>> without error. If the disk has issue, when we using ls command, it
>>>> will give error also. (The dir name is also using debugfs to get by
>>>> issue inode ID.)
>>> Are you sure the disk is good? I just checked the code in e2fsprogs, it
>>> seems that it will not complain if rec_len = 0, and the dir iteration
>>> just aborts. I guess the right way should be dd the corresponding block
>>> out, decode and read it in binary format. :(
>>
>> Good idea, will do it.
>>
>> Thanks
>> wuqixuan
>>
>>> Thanks
>>> Tao
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Wishes
> Yongqiang Yang

Thanks
wuqixuan

2012-12-06 17:09:13

by Jan Kara

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On Fri 07-12-12 00:21:25, qixuan wu wrote:
> Hi Kara,
>
> On Thu, Dec 6, 2012 at 8:37 PM, Jan Kara <[email protected]> wrote:
> > On Thu 06-12-12 09:13:45, Li Zefan wrote:
> >> >> I found this in one log:
> >> >>
> >> >> Nov 14 05:26:55 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=3952, inode=0, rec_len=0, name_len=0
> >> >> Nov 14 13:42:40 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
> >> >> Nov 16 17:29:40 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
> >> >> Nov 23 19:42:44 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=3952, inode=0, rec_len=0, name_len=0
> > Sorry for posting here in the thread but I got unsubscribed from the
> > list so I don't have the beginning of the thread in my inbox.
> >
> > ext3 directory format is such that the last directory entry in the block
> > should have length to exactly fill up the whole block. Apparently, the
> > length got trimmed for some reason so we ended up before end of directory
> > block looked of another directory entry there and didn't find anything. I
> > will also make one observation regarding offsets. They are 3952, 4024, and
> > 4084. If we subtract that from 4096 (block size), we get differences (in
> > binary) 10010000, 01001000, 00001100. Interestingly these have always two
> > bits set. Might be luck but need not...
>
> Yes, we also found the interesting things that the offset happen in
> many boards are like below:
> 1) 3952
> 2) 3988( 3952+36)
> 3) 4024( 3988+36)
> 4) 4048(4042+24)
> 5) 4084(same as the rec_len of ".." file if there isn't any file).
>
> I need introduce the rule of the files in the dir, for example:
> .
> ..
> current_log.txt (len is 15, rec_len is 24 when there is file after it,
> the value "24" i think has relative with offset 4048)
> 20120526124556.865213.txt(len is 25, rec_len is 36 when there is file after it).
> 20120526124984.239475.txt(len is 25, rec_len is 36 when there is file after it).
> ....
> Because the rec_len is 36, it has some relative with those offset
> values( the diff of those values are multiple of 36).
> I need tell another thing, customer's app invoke opendir/readdir very
> frequently. There are more than 1000 times, every second(the value
> need to be confirmed).
>
> > Anyway it would be interesting to get the dump of the corrupted directory
> > before e2fsck is run. You can do that by running:
> > debugfs -R "dump_inode <7225391> /tmp/corrupted_dir" /dev/sda7
> >
> > Then you can send the dump of the corrupted directory here.
>
> We have already dump of the data by debugfs. The data is very good
> without error. But we just did it before fsck, even the fsck is not
> giving any error. I want to know whether fsck will modify disk data
> without reporting any error or not ?
Ah, OK. So it seems that directory block is OK, just f_pos gets corrupted
somehow. There are guards in ext3_readdir() to rescan dir block when
directory is modified but maybe that's not working correctly. I don't want
to burn too much time on this since this is so ancient kernel but I'd be
looking in that direction...

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-12-07 01:49:43

by Yongqiang Yang

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On Fri, Dec 7, 2012 at 12:26 AM, qixuan wu <[email protected]> wrote:
> Our customer is usring default
> journal mode. Because journal is concern only if the system power off.
> If system is continously running and give readonly error, we thought
> it's nothing about journal.
nope. changes on directory are journaled, I am not sure if journal
would corrupt the data in memory, if we test the journal and without
journal we can be sure if it has something to do with journal.

'deleting on deleted entry ' means a file had been deleted, however it
is deleted again. That says a directory was corrupted or a disk error
happened. The situation happened several times.

Yongqiang Yang.



--
Best Wishes
Yongqiang Yang

2012-12-07 10:04:09

by Zefan Li

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On 2012/12/7 1:09, Jan Kara wrote:
> On Fri 07-12-12 00:21:25, qixuan wu wrote:
>> Hi Kara,
>>
>> On Thu, Dec 6, 2012 at 8:37 PM, Jan Kara <[email protected]> wrote:
>>> On Thu 06-12-12 09:13:45, Li Zefan wrote:
>>>>>> I found this in one log:
>>>>>>
>>>>>> Nov 14 05:26:55 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=3952, inode=0, rec_len=0, name_len=0
>>>>>> Nov 14 13:42:40 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
>>>>>> Nov 16 17:29:40 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
>>>>>> Nov 23 19:42:44 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=3952, inode=0, rec_len=0, name_len=0
>>> Sorry for posting here in the thread but I got unsubscribed from the
>>> list so I don't have the beginning of the thread in my inbox.
>>>
>>> ext3 directory format is such that the last directory entry in the block
>>> should have length to exactly fill up the whole block. Apparently, the
>>> length got trimmed for some reason so we ended up before end of directory
>>> block looked of another directory entry there and didn't find anything. I
>>> will also make one observation regarding offsets. They are 3952, 4024, and
>>> 4084. If we subtract that from 4096 (block size), we get differences (in
>>> binary) 10010000, 01001000, 00001100. Interestingly these have always two
>>> bits set. Might be luck but need not...
>>
>> Yes, we also found the interesting things that the offset happen in
>> many boards are like below:
>> 1) 3952
>> 2) 3988( 3952+36)
>> 3) 4024( 3988+36)
>> 4) 4048(4042+24)
>> 5) 4084(same as the rec_len of ".." file if there isn't any file).
>>
>> I need introduce the rule of the files in the dir, for example:
>> .
>> ..
>> current_log.txt (len is 15, rec_len is 24 when there is file after it,
>> the value "24" i think has relative with offset 4048)
>> 20120526124556.865213.txt(len is 25, rec_len is 36 when there is file after it).
>> 20120526124984.239475.txt(len is 25, rec_len is 36 when there is file after it).
>> ....
>> Because the rec_len is 36, it has some relative with those offset
>> values( the diff of those values are multiple of 36).
>> I need tell another thing, customer's app invoke opendir/readdir very
>> frequently. There are more than 1000 times, every second(the value
>> need to be confirmed).
>>
>>> Anyway it would be interesting to get the dump of the corrupted directory
>>> before e2fsck is run. You can do that by running:
>>> debugfs -R "dump_inode <7225391> /tmp/corrupted_dir" /dev/sda7
>>>
>>> Then you can send the dump of the corrupted directory here.
>>
>> We have already dump of the data by debugfs. The data is very good
>> without error. But we just did it before fsck, even the fsck is not
>> giving any error. I want to know whether fsck will modify disk data
>> without reporting any error or not ?
> Ah, OK. So it seems that directory block is OK, just f_pos gets corrupted
> somehow. There are guards in ext3_readdir() to rescan dir block when
> directory is modified but maybe that's not working correctly. I don't want
> to burn too much time on this since this is so ancient kernel but I'd be
> looking in that direction...
>

I've added some debug code into ext3, which does these things:
- dump the dir block
- print the current and last f_pos and offset
- dump_stack() to see which process triggers the bug

Hope we can trigger the bug in our labs (We did see this happened twice this week
in a lab), though we can't patch the kernel in the products.

I compared ext3_readdir() with latest ext3, and saw no difference except some
API changes. I'll dig deeper. Thansks for the suggestion!

Regards
Li Zefan


2012-12-11 08:01:51

by Zefan Li

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

>>> We have already dump of the data by debugfs. The data is very good
>>> without error. But we just did it before fsck, even the fsck is not
>>> giving any error. I want to know whether fsck will modify disk data
>>> without reporting any error or not ?
>> Ah, OK. So it seems that directory block is OK, just f_pos gets corrupted
>> somehow. There are guards in ext3_readdir() to rescan dir block when
>> directory is modified but maybe that's not working correctly. I don't want
>> to burn too much time on this since this is so ancient kernel but I'd be
>> looking in that direction...
>>
>
> I've added some debug code into ext3, which does these things:
> - dump the dir block
> - print the current and last f_pos and offset
> - dump_stack() to see which process triggers the bug
>
> Hope we can trigger the bug in our labs (We did see this happened twice this week
> in a lab), though we can't patch the kernel in the products.
>
> I compared ext3_readdir() with latest ext3, and saw no difference except some
> API changes. I'll dig deeper. Thansks for the suggestion!
>

We've managed to trigger the bug once, and collected some debug information. We
found the buffer head wasn't corrupted, but f_pos was set to 4024 and then ext3
reported error.

EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #12747345: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
Aborting journal on device sda7.
ext3_abort called.
EXT3-fs error (device sda7): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only

00000000: 51 82 c2 00 0c 00 01 02 2e 00 00 00 04 80 c2 00 Q...............
00000010: 0c 00 02 02 2e 2e 00 00 d6 80 c2 00 10 00 06 02 ................
00000020: 62 61 63 6b 75 70 00 00 bb 82 c2 00 1c 00 11 01 backup..........
00000030: 4d 6f 6e 69 74 6f 72 53 65 72 76 69 63 65 2e 6f MonitorService.o
00000040: 70 00 00 00 be 82 c2 00 1c 00 13 01 43 6f 6d 70 p...........Comp
00000050: 6c 61 69 6e 74 50 72 6f 63 65 73 73 2e 6f 70 00 laintProcess.op.
00000060: c2 82 c2 00 20 00 15 01 4c 6f 63 61 74 69 6f 6e .... ...Location
00000070: 50 72 65 50 72 6f 63 65 73 73 2e 6f 70 00 00 00 PreProcess.op...
00000080: c9 82 c2 00 18 00 0f 01 4e 6f 72 74 68 50 72 6f ........NorthPro
00000090: 63 65 73 73 2e 6f 70 00 d4 82 c2 00 18 00 0d 01 cess.op.........
000000a0: 53 79 73 4d 6f 6e 69 74 6f 72 2e 6f 70 00 00 00 SysMonitor.op...
000000b0: db 82 c2 00 1c 00 13 01 56 56 49 50 4e 6f 72 74 ........VVIPNort
000000c0: 68 50 72 6f 63 65 73 73 2e 6f 70 00 e1 82 c2 00 hProcess.op.....
000000d0: 34 0f 09 01 72 61 6e 73 61 75 2e 6f 70 00 00 00 4...ransau.op...
000000e0: 4f 83 c2 00 20 0f 1e 01 72 61 6e 73 61 75 2e 6f O... ...ransau.o
000000f0: 70 2e 32 30 31 32 31 32 31 30 30 32 30 39 32 34 p.20121210020924
00000100: 34 35 31 33 39 34 00 00 79 83 c2 00 f8 0e 18 01 451394..y.......
00000110: 72 61 6e 73 61 75 2e 6f 70 2e 32 30 31 32 31 32 ransau.op.201212
00000120: 31 30 30 32 30 39 32 34 00 00 00 00 00 00 00 00 10020924........
...
00000ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

last_offset=-1, last_fpos=-1, f_pos=4024

-1 means we hit the bug in the first iteration in the insde while in ext3_readdir().

I've checked how ext3_readdir() works and how f_pos, f_version and i_version
get initialized and modified. Now I'm lost. I really can't see how f_pos got
corrupted. :(

A sample strace output:

10062 getdents64(46, /* 4 entries */, 4096) = 136
10062 stat64("/xxx/current_log.txt", {st_mode=S_IFREG|0600, st_size=4436494, ...}) = 0
10062 stat64("/xxx/20121205054350.593907.txt", {st_mode=S_IFREG|0600, st_size=8388846, ...}) = 0
10062 getdents64(46, /* 0 entries */, 4096) = 0

In the second call to getdents, f_pos should be 4096, but somehow it was
changed to 4024? but how...

Any hints?


2012-12-12 10:04:44

by Jan Kara

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On Tue 11-12-12 16:01:51, Li Zefan wrote:
> >>> We have already dump of the data by debugfs. The data is very good
> >>> without error. But we just did it before fsck, even the fsck is not
> >>> giving any error. I want to know whether fsck will modify disk data
> >>> without reporting any error or not ?
> >> Ah, OK. So it seems that directory block is OK, just f_pos gets corrupted
> >> somehow. There are guards in ext3_readdir() to rescan dir block when
> >> directory is modified but maybe that's not working correctly. I don't want
> >> to burn too much time on this since this is so ancient kernel but I'd be
> >> looking in that direction...
> >>
> >
> > I've added some debug code into ext3, which does these things:
> > - dump the dir block
> > - print the current and last f_pos and offset
> > - dump_stack() to see which process triggers the bug
> >
> > Hope we can trigger the bug in our labs (We did see this happened twice this week
> > in a lab), though we can't patch the kernel in the products.
> >
> > I compared ext3_readdir() with latest ext3, and saw no difference except some
> > API changes. I'll dig deeper. Thansks for the suggestion!
> >
>
> We've managed to trigger the bug once, and collected some debug information. We
> found the buffer head wasn't corrupted, but f_pos was set to 4024 and then ext3
> reported error.
>
> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #12747345: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
> Aborting journal on device sda7.
> ext3_abort called.
> EXT3-fs error (device sda7): ext3_journal_start_sb: Detected aborted journal
> Remounting filesystem read-only
>
> 00000000: 51 82 c2 00 0c 00 01 02 2e 00 00 00 04 80 c2 00 Q...............
> 00000010: 0c 00 02 02 2e 2e 00 00 d6 80 c2 00 10 00 06 02 ................
> 00000020: 62 61 63 6b 75 70 00 00 bb 82 c2 00 1c 00 11 01 backup..........
> 00000030: 4d 6f 6e 69 74 6f 72 53 65 72 76 69 63 65 2e 6f MonitorService.o
> 00000040: 70 00 00 00 be 82 c2 00 1c 00 13 01 43 6f 6d 70 p...........Comp
> 00000050: 6c 61 69 6e 74 50 72 6f 63 65 73 73 2e 6f 70 00 laintProcess.op.
> 00000060: c2 82 c2 00 20 00 15 01 4c 6f 63 61 74 69 6f 6e .... ...Location
> 00000070: 50 72 65 50 72 6f 63 65 73 73 2e 6f 70 00 00 00 PreProcess.op...
> 00000080: c9 82 c2 00 18 00 0f 01 4e 6f 72 74 68 50 72 6f ........NorthPro
> 00000090: 63 65 73 73 2e 6f 70 00 d4 82 c2 00 18 00 0d 01 cess.op.........
> 000000a0: 53 79 73 4d 6f 6e 69 74 6f 72 2e 6f 70 00 00 00 SysMonitor.op...
> 000000b0: db 82 c2 00 1c 00 13 01 56 56 49 50 4e 6f 72 74 ........VVIPNort
> 000000c0: 68 50 72 6f 63 65 73 73 2e 6f 70 00 e1 82 c2 00 hProcess.op.....
> 000000d0: 34 0f 09 01 72 61 6e 73 61 75 2e 6f 70 00 00 00 4...ransau.op...
> 000000e0: 4f 83 c2 00 20 0f 1e 01 72 61 6e 73 61 75 2e 6f O... ...ransau.o
> 000000f0: 70 2e 32 30 31 32 31 32 31 30 30 32 30 39 32 34 p.20121210020924
> 00000100: 34 35 31 33 39 34 00 00 79 83 c2 00 f8 0e 18 01 451394..y.......
> 00000110: 72 61 6e 73 61 75 2e 6f 70 2e 32 30 31 32 31 32 ransau.op.201212
> 00000120: 31 30 30 32 30 39 32 34 00 00 00 00 00 00 00 00 10020924........
> ...
> 00000ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
>
> last_offset=-1, last_fpos=-1, f_pos=4024
>
> -1 means we hit the bug in the first iteration in the insde while in
> ext3_readdir().
>
> I've checked how ext3_readdir() works and how f_pos, f_version and i_version
> get initialized and modified. Now I'm lost. I really can't see how f_pos got
> corrupted. :(
Hum, it looks really curious. So f_pos has been 4024 when we entered
ext3_readdir()? Do you know what it was when we last left ext3_readdir()
for that filp? You can store that value in some debug entry added to struct
file... Also any chance we ever hit:
if (version != filp->f_version)
goto revalidate;
I don't think it can ever happen since we hold i_mutex and
generic_file_llseek() takes i_mutex as well. But better be sure.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-12-12 11:31:08

by Zefan Li

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On 2012/12/12 18:04, Jan Kara wrote:
> On Tue 11-12-12 16:01:51, Li Zefan wrote:
>>>>> We have already dump of the data by debugfs. The data is very good
>>>>> without error. But we just did it before fsck, even the fsck is not
>>>>> giving any error. I want to know whether fsck will modify disk data
>>>>> without reporting any error or not ?
>>>> Ah, OK. So it seems that directory block is OK, just f_pos gets corrupted
>>>> somehow. There are guards in ext3_readdir() to rescan dir block when
>>>> directory is modified but maybe that's not working correctly. I don't want
>>>> to burn too much time on this since this is so ancient kernel but I'd be
>>>> looking in that direction...
>>>>
>>>
>>> I've added some debug code into ext3, which does these things:
>>> - dump the dir block
>>> - print the current and last f_pos and offset
>>> - dump_stack() to see which process triggers the bug
>>>
>>> Hope we can trigger the bug in our labs (We did see this happened twice this week
>>> in a lab), though we can't patch the kernel in the products.
>>>
>>> I compared ext3_readdir() with latest ext3, and saw no difference except some
>>> API changes. I'll dig deeper. Thansks for the suggestion!
>>>
>>
>> We've managed to trigger the bug once, and collected some debug information. We
>> found the buffer head wasn't corrupted, but f_pos was set to 4024 and then ext3
>> reported error.
>>
>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #12747345: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
>> Aborting journal on device sda7.
>> ext3_abort called.
>> EXT3-fs error (device sda7): ext3_journal_start_sb: Detected aborted journal
>> Remounting filesystem read-only
>>
>> 00000000: 51 82 c2 00 0c 00 01 02 2e 00 00 00 04 80 c2 00 Q...............
>> 00000010: 0c 00 02 02 2e 2e 00 00 d6 80 c2 00 10 00 06 02 ................
>> 00000020: 62 61 63 6b 75 70 00 00 bb 82 c2 00 1c 00 11 01 backup..........
>> 00000030: 4d 6f 6e 69 74 6f 72 53 65 72 76 69 63 65 2e 6f MonitorService.o
>> 00000040: 70 00 00 00 be 82 c2 00 1c 00 13 01 43 6f 6d 70 p...........Comp
>> 00000050: 6c 61 69 6e 74 50 72 6f 63 65 73 73 2e 6f 70 00 laintProcess.op.
>> 00000060: c2 82 c2 00 20 00 15 01 4c 6f 63 61 74 69 6f 6e .... ...Location
>> 00000070: 50 72 65 50 72 6f 63 65 73 73 2e 6f 70 00 00 00 PreProcess.op...
>> 00000080: c9 82 c2 00 18 00 0f 01 4e 6f 72 74 68 50 72 6f ........NorthPro
>> 00000090: 63 65 73 73 2e 6f 70 00 d4 82 c2 00 18 00 0d 01 cess.op.........
>> 000000a0: 53 79 73 4d 6f 6e 69 74 6f 72 2e 6f 70 00 00 00 SysMonitor.op...
>> 000000b0: db 82 c2 00 1c 00 13 01 56 56 49 50 4e 6f 72 74 ........VVIPNort
>> 000000c0: 68 50 72 6f 63 65 73 73 2e 6f 70 00 e1 82 c2 00 hProcess.op.....
>> 000000d0: 34 0f 09 01 72 61 6e 73 61 75 2e 6f 70 00 00 00 4...ransau.op...
>> 000000e0: 4f 83 c2 00 20 0f 1e 01 72 61 6e 73 61 75 2e 6f O... ...ransau.o
>> 000000f0: 70 2e 32 30 31 32 31 32 31 30 30 32 30 39 32 34 p.20121210020924
>> 00000100: 34 35 31 33 39 34 00 00 79 83 c2 00 f8 0e 18 01 451394..y.......
>> 00000110: 72 61 6e 73 61 75 2e 6f 70 2e 32 30 31 32 31 32 ransau.op.201212
>> 00000120: 31 30 30 32 30 39 32 34 00 00 00 00 00 00 00 00 10020924........
>> ...
>> 00000ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
>>
>> last_offset=-1, last_fpos=-1, f_pos=4024
>>
>> -1 means we hit the bug in the first iteration in the insde while in
>> ext3_readdir().
>>
>> I've checked how ext3_readdir() works and how f_pos, f_version and i_version
>> get initialized and modified. Now I'm lost. I really can't see how f_pos got
>> corrupted. :(
> Hum, it looks really curious. So f_pos has been 4024 when we entered
> ext3_readdir()?

dunno. but what else can be

> Do you know what it was when we last left ext3_readdir()
> for that filp? You can store that value in some debug entry added to struct
> file... Also any chance we ever hit:
> if (version != filp->f_version)
> goto revalidate;
> I don't think it can ever happen since we hold i_mutex and
> generic_file_llseek() takes i_mutex as well. But better be sure.
>

Yesterday I've added more debug aids, which convers all the above information
mentioned. Actually the code tracks all the places that change f_pos, and
I think only lseek() and readdir() can change it.

Now I'm waiting for the bug to happen again, can be several days...


2012-12-14 03:32:32

by Peng, Tao

[permalink] [raw]
Subject: RE: help about ext3 read-only issue on ext3(2.6.16.30)

> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Li
> Zefan
> Sent: Wednesday, December 12, 2012 7:31 PM
> To: Jan Kara
> Cc: qixuan wu; Tao Ma; Theodore Ts'o; Eric Sandeen; Yafang Shao; [email protected]; linux-
> [email protected]; [email protected]; [email protected]
> Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)
>
> On 2012/12/12 18:04, Jan Kara wrote:
> > On Tue 11-12-12 16:01:51, Li Zefan wrote:
> >>>>> We have already dump of the data by debugfs. The data is very good
> >>>>> without error. But we just did it before fsck, even the fsck is not
> >>>>> giving any error. I want to know whether fsck will modify disk data
> >>>>> without reporting any error or not ?
> >>>> Ah, OK. So it seems that directory block is OK, just f_pos gets corrupted
> >>>> somehow. There are guards in ext3_readdir() to rescan dir block when
> >>>> directory is modified but maybe that's not working correctly. I don't want
> >>>> to burn too much time on this since this is so ancient kernel but I'd be
> >>>> looking in that direction...
> >>>>
> >>>
> >>> I've added some debug code into ext3, which does these things:
> >>> - dump the dir block
> >>> - print the current and last f_pos and offset
> >>> - dump_stack() to see which process triggers the bug
> >>>
> >>> Hope we can trigger the bug in our labs (We did see this happened twice this week
> >>> in a lab), though we can't patch the kernel in the products.
> >>>
> >>> I compared ext3_readdir() with latest ext3, and saw no difference except some
> >>> API changes. I'll dig deeper. Thansks for the suggestion!
> >>>
> >>
> >> We've managed to trigger the bug once, and collected some debug information. We
> >> found the buffer head wasn't corrupted, but f_pos was set to 4024 and then ext3
> >> reported error.
> >>
> >> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #12747345: rec_len is smaller
> than minimal - offset=4024, inode=0, rec_len=0, name_len=0
> >> Aborting journal on device sda7.
> >> ext3_abort called.
> >> EXT3-fs error (device sda7): ext3_journal_start_sb: Detected aborted journal
> >> Remounting filesystem read-only
> >>
> >> 00000000: 51 82 c2 00 0c 00 01 02 2e 00 00 00 04 80 c2 00 Q...............
> >> 00000010: 0c 00 02 02 2e 2e 00 00 d6 80 c2 00 10 00 06 02 ................
> >> 00000020: 62 61 63 6b 75 70 00 00 bb 82 c2 00 1c 00 11 01 backup..........
> >> 00000030: 4d 6f 6e 69 74 6f 72 53 65 72 76 69 63 65 2e 6f MonitorService.o
> >> 00000040: 70 00 00 00 be 82 c2 00 1c 00 13 01 43 6f 6d 70 p...........Comp
> >> 00000050: 6c 61 69 6e 74 50 72 6f 63 65 73 73 2e 6f 70 00 laintProcess.op.
> >> 00000060: c2 82 c2 00 20 00 15 01 4c 6f 63 61 74 69 6f 6e .... ...Location
> >> 00000070: 50 72 65 50 72 6f 63 65 73 73 2e 6f 70 00 00 00 PreProcess.op...
> >> 00000080: c9 82 c2 00 18 00 0f 01 4e 6f 72 74 68 50 72 6f ........NorthPro
> >> 00000090: 63 65 73 73 2e 6f 70 00 d4 82 c2 00 18 00 0d 01 cess.op.........
> >> 000000a0: 53 79 73 4d 6f 6e 69 74 6f 72 2e 6f 70 00 00 00 SysMonitor.op...
> >> 000000b0: db 82 c2 00 1c 00 13 01 56 56 49 50 4e 6f 72 74 ........VVIPNort
> >> 000000c0: 68 50 72 6f 63 65 73 73 2e 6f 70 00 e1 82 c2 00 hProcess.op.....
> >> 000000d0: 34 0f 09 01 72 61 6e 73 61 75 2e 6f 70 00 00 00 4...ransau.op...
> >> 000000e0: 4f 83 c2 00 20 0f 1e 01 72 61 6e 73 61 75 2e 6f O... ...ransau.o
> >> 000000f0: 70 2e 32 30 31 32 31 32 31 30 30 32 30 39 32 34 p.20121210020924
> >> 00000100: 34 35 31 33 39 34 00 00 79 83 c2 00 f8 0e 18 01 451394..y.......
> >> 00000110: 72 61 6e 73 61 75 2e 6f 70 2e 32 30 31 32 31 32 ransau.op.201212
> >> 00000120: 31 30 30 32 30 39 32 34 00 00 00 00 00 00 00 00 10020924........
> >> ...
> >> 00000ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> >>
> >> last_offset=-1, last_fpos=-1, f_pos=4024
> >>
> >> -1 means we hit the bug in the first iteration in the insde while in
> >> ext3_readdir().
> >>
> >> I've checked how ext3_readdir() works and how f_pos, f_version and i_version
> >> get initialized and modified. Now I'm lost. I really can't see how f_pos got
> >> corrupted. :(
> > Hum, it looks really curious. So f_pos has been 4024 when we entered
> > ext3_readdir()?
>
> dunno. but what else can be
>
> > Do you know what it was when we last left ext3_readdir()
> > for that filp? You can store that value in some debug entry added to struct
> > file... Also any chance we ever hit:
> > if (version != filp->f_version)
> > goto revalidate;
> > I don't think it can ever happen since we hold i_mutex and
> > generic_file_llseek() takes i_mutex as well. But better be sure.
> >
>
> Yesterday I've added more debug aids, which convers all the above information
> mentioned. Actually the code tracks all the places that change f_pos, and
> I think only lseek() and readdir() can change it.
>
> Now I'm waiting for the bug to happen again, can be several days...
Not sure if related. I've seen crappy apps doing seekdir() randomly causing Lustre client oops. We fixed it by checking if f_version is 0 as well, which may pass the (filp->f_version != inode->i_version) check in Lustre client when a dir is lseek()ed before i_version changes. Not sure if ext3 has the same problem though. Maybe it will be helpful to print f_version and i_version when you reproduce it.

Thanks,
Tao

2012-12-17 10:52:24

by Zefan Li

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

>>> last_offset=-1, last_fpos=-1, f_pos=4024
>>>
>>> -1 means we hit the bug in the first iteration in the insde while in
>>> ext3_readdir().
>>>
>>> I've checked how ext3_readdir() works and how f_pos, f_version and i_version
>>> get initialized and modified. Now I'm lost. I really can't see how f_pos got
>>> corrupted. :(
>> Hum, it looks really curious. So f_pos has been 4024 when we entered
>> ext3_readdir()?
>
> dunno. but what else can be
>
>> Do you know what it was when we last left ext3_readdir()
>> for that filp? You can store that value in some debug entry added to struct
>> file... Also any chance we ever hit:
>> if (version != filp->f_version)
>> goto revalidate;
>> I don't think it can ever happen since we hold i_mutex and
>> generic_file_llseek() takes i_mutex as well. But better be sure.
>>
>
> Yesterday I've added more debug aids, which convers all the above information
> mentioned. Actually the code tracks all the places that change f_pos, and
> I think only lseek() and readdir() can change it.
>
> Now I'm waiting for the bug to happen again, can be several days...
>

The bug was triggered again:

EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #9372013: rec_len is smaller than minimal - offset=4028, inode=0, rec_len=0, name_len=0

And I've confirmed f_pos=4028 when we entered ext3_readdir(), while it should be 4096.

I wrote a simple ring buffer to track operations on log dirs, and from the
ring buffer, we can see that there were no lseek, unlink, rename, etc.

This is correct:

dir=9372013, seq=1549, spot=readdir_1, f_pos=0, f_pos_delta=0
dir=9372013, seq=1550, spot=readdir_3, f_pos=0, f_pos_delta=0
dir=9372013, seq=1551, spot=readdir_5, f_pos=12, f_pos_delta=12
dir=9372013, seq=1552, spot=readdir_5, f_pos=24, f_pos_delta=12
...
dir=9372013, seq=1595, spot=readdir_5, f_pos=1488, f_pos_delta=28
dir=9372013, seq=1596, spot=readdir_5, f_pos=1516, f_pos_delta=28
dir=9372013, seq=1597, spot=readdir_5, f_pos=1556, f_pos_delta=40
dir=9372013, seq=1598, spot=readdir_5, f_pos=1584, f_pos_delta=28
...
dir=9372013, seq=1627, spot=readdir_5, f_pos=2392, f_pos_delta=28
dir=9372013, seq=1628, spot=readdir_5, f_pos=4096, f_pos_delta=1704
dir=9372013, seq=1629, spot=readdir_1, f_pos=4096, f_pos_delta=0

(readir_1 is the entry of readdir(), and readdir_3 is when we enter (f_version != i_version),
and readdir_5 is we iterate the dir block)

Then f_pos went wrong suddenly:

dir=9372013, seq=1676, spot=readdir_5, f_pos=1488, f_pos_delta=28
dir=9372013, seq=1677, spot=readdir_5, f_pos=1516, f_pos_delta=28
dir=9372013, seq=1678, spot=readdir_5, f_pos=1556, f_pos_delta=40
dir=9372013, seq=1679, spot=readdir_5, f_pos=1516, f_pos_delta=28 <-- !!!!!!!!
dir=9372013, seq=1680, spot=readdir_5, f_pos=1540, f_pos_delta=24
...
dir=9372013, seq=1708, spot=readdir_5, f_pos=2324, f_pos_delta=28
dir=9372013, seq=1709, spot=readdir_5, f_pos=4028, f_pos_delta=1704
dir=9372013, seq=1710, spot=readdir_1, f_pos=4028, f_pos_delta=0

This is odd...

While f_pos was wrong, offset is always correct, and this is not some
single-bit error in memory, so someone else changed f_pos? but we were
holding i_mutex, and we see nothing else except readdir in the ring
buffer...


2012-12-20 11:32:13

by Jan Kara

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

Hello,

On Mon 17-12-12 18:51:27, Li Zefan wrote:
> >>> last_offset=-1, last_fpos=-1, f_pos=4024
> >>>
> >>> -1 means we hit the bug in the first iteration in the insde while in
> >>> ext3_readdir().
> >>>
> >>> I've checked how ext3_readdir() works and how f_pos, f_version and i_version
> >>> get initialized and modified. Now I'm lost. I really can't see how f_pos got
> >>> corrupted. :(
> >> Hum, it looks really curious. So f_pos has been 4024 when we entered
> >> ext3_readdir()?
> >
> > dunno. but what else can be
> >
> >> Do you know what it was when we last left ext3_readdir()
> >> for that filp? You can store that value in some debug entry added to struct
> >> file... Also any chance we ever hit:
> >> if (version != filp->f_version)
> >> goto revalidate;
> >> I don't think it can ever happen since we hold i_mutex and
> >> generic_file_llseek() takes i_mutex as well. But better be sure.
> >>
> >
> > Yesterday I've added more debug aids, which convers all the above information
> > mentioned. Actually the code tracks all the places that change f_pos, and
> > I think only lseek() and readdir() can change it.
> >
> > Now I'm waiting for the bug to happen again, can be several days...
> >
>
> The bug was triggered again:
>
> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #9372013: rec_len is smaller than minimal - offset=4028, inode=0, rec_len=0, name_len=0
>
> And I've confirmed f_pos=4028 when we entered ext3_readdir(), while it
> should be 4096.
OK, interesting.

> I wrote a simple ring buffer to track operations on log dirs, and from the
> ring buffer, we can see that there were no lseek, unlink, rename, etc.
>
> This is correct:
>
> dir=9372013, seq=1549, spot=readdir_1, f_pos=0, f_pos_delta=0
> dir=9372013, seq=1550, spot=readdir_3, f_pos=0, f_pos_delta=0
> dir=9372013, seq=1551, spot=readdir_5, f_pos=12, f_pos_delta=12
> dir=9372013, seq=1552, spot=readdir_5, f_pos=24, f_pos_delta=12
> ...
> dir=9372013, seq=1595, spot=readdir_5, f_pos=1488, f_pos_delta=28
> dir=9372013, seq=1596, spot=readdir_5, f_pos=1516, f_pos_delta=28
> dir=9372013, seq=1597, spot=readdir_5, f_pos=1556, f_pos_delta=40
> dir=9372013, seq=1598, spot=readdir_5, f_pos=1584, f_pos_delta=28
> ...
> dir=9372013, seq=1627, spot=readdir_5, f_pos=2392, f_pos_delta=28
> dir=9372013, seq=1628, spot=readdir_5, f_pos=4096, f_pos_delta=1704
> dir=9372013, seq=1629, spot=readdir_1, f_pos=4096, f_pos_delta=0
>
> (readir_1 is the entry of readdir(), and readdir_3 is when we enter
> (f_version != i_version), and readdir_5 is we iterate the dir block)
Can you post the readdir() code you now run including your logging?
Thanks!

> Then f_pos went wrong suddenly:
>
> dir=9372013, seq=1676, spot=readdir_5, f_pos=1488, f_pos_delta=28
> dir=9372013, seq=1677, spot=readdir_5, f_pos=1516, f_pos_delta=28
> dir=9372013, seq=1678, spot=readdir_5, f_pos=1556, f_pos_delta=40
> dir=9372013, seq=1679, spot=readdir_5, f_pos=1516, f_pos_delta=28 <-- !!!!!!!!
> dir=9372013, seq=1680, spot=readdir_5, f_pos=1540, f_pos_delta=24
> ...
> dir=9372013, seq=1708, spot=readdir_5, f_pos=2324, f_pos_delta=28
> dir=9372013, seq=1709, spot=readdir_5, f_pos=4028, f_pos_delta=1704
> dir=9372013, seq=1710, spot=readdir_1, f_pos=4028, f_pos_delta=0
>
> This is odd...
>
> While f_pos was wrong, offset is always correct, and this is not some
> single-bit error in memory, so someone else changed f_pos? but we were
> holding i_mutex, and we see nothing else except readdir in the ring
> buffer...
How do you know 'offset' was correct? I don't see it in the ring
buffer... Anyway, this all points to us taking the:
if (version != filp->f_version)
goto revalidate;
branch when 'offset' is already advanced but f_pos isn't. Then we don't
enter:
if (filp->f_version != inode->i_version) {
branch and thus f_pos and 'offset' stay out of sync and problems happen.

But how these strange issues with f_version happen is a mystery for me so
far.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-02-12 12:19:58

by Jan Kara

[permalink] [raw]
Subject: Re: help about ext3 read-only issue on ext3(2.6.16.30)

On Thu 20-12-12 12:32:13, Jan Kara wrote:
> Hello,
>
> On Mon 17-12-12 18:51:27, Li Zefan wrote:
> > >>> last_offset=-1, last_fpos=-1, f_pos=4024
> > >>>
> > >>> -1 means we hit the bug in the first iteration in the insde while in
> > >>> ext3_readdir().
> > >>>
> > >>> I've checked how ext3_readdir() works and how f_pos, f_version and i_version
> > >>> get initialized and modified. Now I'm lost. I really can't see how f_pos got
> > >>> corrupted. :(
> > >> Hum, it looks really curious. So f_pos has been 4024 when we entered
> > >> ext3_readdir()?
> > >
> > > dunno. but what else can be
> > >
> > >> Do you know what it was when we last left ext3_readdir()
> > >> for that filp? You can store that value in some debug entry added to struct
> > >> file... Also any chance we ever hit:
> > >> if (version != filp->f_version)
> > >> goto revalidate;
> > >> I don't think it can ever happen since we hold i_mutex and
> > >> generic_file_llseek() takes i_mutex as well. But better be sure.
> > >>
> > >
> > > Yesterday I've added more debug aids, which convers all the above information
> > > mentioned. Actually the code tracks all the places that change f_pos, and
> > > I think only lseek() and readdir() can change it.
> > >
> > > Now I'm waiting for the bug to happen again, can be several days...
> > >
> >
> > The bug was triggered again:
> >
> > EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #9372013: rec_len is smaller than minimal - offset=4028, inode=0, rec_len=0, name_len=0
> >
> > And I've confirmed f_pos=4028 when we entered ext3_readdir(), while it
> > should be 4096.
> OK, interesting.
>
> > I wrote a simple ring buffer to track operations on log dirs, and from the
> > ring buffer, we can see that there were no lseek, unlink, rename, etc.
> >
> > This is correct:
> >
> > dir=9372013, seq=1549, spot=readdir_1, f_pos=0, f_pos_delta=0
> > dir=9372013, seq=1550, spot=readdir_3, f_pos=0, f_pos_delta=0
> > dir=9372013, seq=1551, spot=readdir_5, f_pos=12, f_pos_delta=12
> > dir=9372013, seq=1552, spot=readdir_5, f_pos=24, f_pos_delta=12
> > ...
> > dir=9372013, seq=1595, spot=readdir_5, f_pos=1488, f_pos_delta=28
> > dir=9372013, seq=1596, spot=readdir_5, f_pos=1516, f_pos_delta=28
> > dir=9372013, seq=1597, spot=readdir_5, f_pos=1556, f_pos_delta=40
> > dir=9372013, seq=1598, spot=readdir_5, f_pos=1584, f_pos_delta=28
> > ...
> > dir=9372013, seq=1627, spot=readdir_5, f_pos=2392, f_pos_delta=28
> > dir=9372013, seq=1628, spot=readdir_5, f_pos=4096, f_pos_delta=1704
> > dir=9372013, seq=1629, spot=readdir_1, f_pos=4096, f_pos_delta=0
> >
> > (readir_1 is the entry of readdir(), and readdir_3 is when we enter
> > (f_version != i_version), and readdir_5 is we iterate the dir block)
> Can you post the readdir() code you now run including your logging?
> Thanks!
>
> > Then f_pos went wrong suddenly:
> >
> > dir=9372013, seq=1676, spot=readdir_5, f_pos=1488, f_pos_delta=28
> > dir=9372013, seq=1677, spot=readdir_5, f_pos=1516, f_pos_delta=28
> > dir=9372013, seq=1678, spot=readdir_5, f_pos=1556, f_pos_delta=40
> > dir=9372013, seq=1679, spot=readdir_5, f_pos=1516, f_pos_delta=28 <-- !!!!!!!!
> > dir=9372013, seq=1680, spot=readdir_5, f_pos=1540, f_pos_delta=24
> > ...
> > dir=9372013, seq=1708, spot=readdir_5, f_pos=2324, f_pos_delta=28
> > dir=9372013, seq=1709, spot=readdir_5, f_pos=4028, f_pos_delta=1704
> > dir=9372013, seq=1710, spot=readdir_1, f_pos=4028, f_pos_delta=0
> >
> > This is odd...
> >
> > While f_pos was wrong, offset is always correct, and this is not some
> > single-bit error in memory, so someone else changed f_pos? but we were
> > holding i_mutex, and we see nothing else except readdir in the ring
> > buffer...
> How do you know 'offset' was correct? I don't see it in the ring
> buffer... Anyway, this all points to us taking the:
> if (version != filp->f_version)
> goto revalidate;
> branch when 'offset' is already advanced but f_pos isn't. Then we don't
> enter:
> if (filp->f_version != inode->i_version) {
> branch and thus f_pos and 'offset' stay out of sync and problems happen.
>
> But how these strange issues with f_version happen is a mystery for me so
> far.
Were there any more test results regarding this? I'm just curious what
the problem is / was...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR