Hello,
I got a bunch of these into dmesg:
EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #323880: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #323888: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #323882: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
The kernel is 2.4.35 SMP, dual-processor. The scsi driver is Fusion MPT SCSI
Host driver 2.05.16.
The device is /dev/sda2, root fs.
One line per each directory had dropped into dmesg each night (I think
during updatedb) before I noticed.
The directories in question have not been written to for ages:
>debugfs /dev/sda2
debugfs: ncheck 323888
Inode Pathname
323888 /usr/share/doc/logcheck-1.1.1
debugfs: ncheck 323882
Inode Pathname
323882 /usr/share/doc/dev86-0.15.5
debugfs: ncheck 323880
Inode Pathname
323880 /usr/share/doc/mod_put-1.3
The hardware _should_ be solid, although I can never 100% sure rule out disk
level corruption.
Does this ring any bells to anyone, short of block level corruption?
-- v --
[email protected]
Hello,
> I got a bunch of these into dmesg:
>
> EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #323880: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
> EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #323888: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
> EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #323882: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
>
> The kernel is 2.4.35 SMP, dual-processor. The scsi driver is Fusion MPT SCSI
> Host driver 2.05.16.
>
> The device is /dev/sda2, root fs.
>
> One line per each directory had dropped into dmesg each night (I think
> during updatedb) before I noticed.
Interesting. Can you look (using debugfs) on the content of the
/usr/share/doc/ directory? It seems like parts of it have been zeroed
out...
> The directories in question have not been written to for ages:
>
> >debugfs /dev/sda2
> debugfs: ncheck 323888
> Inode Pathname
> 323888 /usr/share/doc/logcheck-1.1.1
> debugfs: ncheck 323882
> Inode Pathname
> 323882 /usr/share/doc/dev86-0.15.5
> debugfs: ncheck 323880
> Inode Pathname
> 323880 /usr/share/doc/mod_put-1.3
>
> The hardware _should_ be solid, although I can never 100% sure rule out disk
> level corruption.
Honza
On Tue, Sep 18, 2007 at 05:12:06PM +0200, you [Jan Kara] wrote:
> Hello,
>
> > I got a bunch of these into dmesg:
> >
> > EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #323880: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
> > EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #323888: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
> > EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #323882: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
> >
> > The kernel is 2.4.35 SMP, dual-processor. The scsi driver is Fusion MPT SCSI
> > Host driver 2.05.16.
> >
> > The device is /dev/sda2, root fs.
> >
> > One line per each directory had dropped into dmesg each night (I think
> > during updatedb) before I noticed.
> Interesting. Can you look (using debugfs) on the content of the
> /usr/share/doc/ directory? It seems like parts of it have been zeroed
> out...
Unfortunately, no. I removed those directories because those were the only
ones causing problems and wasn't able to reboot for a proper fsck
immediately. The rm -rf command gave no errors (to stdout or dmesg), and a
read-only fsck right after that gave no errors on the directory structure.
Sorry for the sparse details, but when you have these kind of problems on
live servers, you tend to forget the debuggability...
-- v --
[email protected]
> On Tue, Sep 18, 2007 at 05:12:06PM +0200, you [Jan Kara] wrote:
> > Hello,
> >
> > > I got a bunch of these into dmesg:
> > >
> > > EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #323880: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
> > > EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #323888: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
> > > EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #323882: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
> > >
> > > The kernel is 2.4.35 SMP, dual-processor. The scsi driver is Fusion MPT SCSI
> > > Host driver 2.05.16.
> > >
> > > The device is /dev/sda2, root fs.
> > >
> > > One line per each directory had dropped into dmesg each night (I think
> > > during updatedb) before I noticed.
> > Interesting. Can you look (using debugfs) on the content of the
> > /usr/share/doc/ directory? It seems like parts of it have been zeroed
> > out...
>
> Unfortunately, no. I removed those directories because those were the only
> ones causing problems and wasn't able to reboot for a proper fsck
> immediately. The rm -rf command gave no errors (to stdout or dmesg), and a
> read-only fsck right after that gave no errors on the directory structure.
>
> Sorry for the sparse details, but when you have these kind of problems on
> live servers, you tend to forget the debuggability...
Yes, I can understand that :). It's just that now it's hard to find
out what has really happened. Anyway, thanks for your report.
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
On Tue, Sep 18, 2007 at 06:22:56PM +0200, you [Jan Kara] wrote:
> > Sorry for the sparse details, but when you have these kind of problems on
> > live servers, you tend to forget the debuggability...
> Yes, I can understand that :). It's just that now it's hard to find
> out what has really happened. Anyway, thanks for your report.
If we are really lucky or unlucky it will happen again.
Zeroed-out block just might be a kernel problem (SMP race, whatever) -
random corruption would more likely be a hardware problem. There were no IO
error either. But, 2.4 ext3 has been pretty extensively tested, so I don't
suppose that's likely either. And judging from
http://www.kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.3[45] there
haven't been many changes to ext3 lately either.
-- v --
[email protected]
Hi Ville,
On Tue, Sep 18, 2007 at 07:33:26PM +0300, Ville Herva wrote:
> On Tue, Sep 18, 2007 at 06:22:56PM +0200, you [Jan Kara] wrote:
> > > Sorry for the sparse details, but when you have these kind of problems on
> > > live servers, you tend to forget the debuggability...
> > Yes, I can understand that :). It's just that now it's hard to find
> > out what has really happened. Anyway, thanks for your report.
>
> If we are really lucky or unlucky it will happen again.
>
> Zeroed-out block just might be a kernel problem (SMP race, whatever) -
> random corruption would more likely be a hardware problem. There were no IO
> error either. But, 2.4 ext3 has been pretty extensively tested, so I don't
> suppose that's likely either. And judging from
> http://www.kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.3[45] there
> haven't been many changes to ext3 lately either.
Thanks for your report. Unfortunately, I've rechecked the recent changelogs
and see nothing related either. At least, in order to keep trace of the
incident, would you please post some info about your config (CPU, RAM,
chipset, .config, gcc, and any possible patches you may have applied) ?
Maybe some of these info may remind old bad memories to some people.
Also, do you know if this server has ECC memory ? I would more easily
bet for side effects of one random bit flip in memory than for some
massive block corruption.
I vaguely remember about very old reports of people sometimes observing
zeroed out blocks during writes, which were attributed to chipset bugs
if my memory serves me. But I would rule this out as recent chipsets
look more stable than 5-10 years ago !
Regards,
Willy
On Tue, Sep 18, 2007 at 11:47:05PM +0200, you [Willy Tarreau] wrote:
> Thanks for your report. Unfortunately, I've rechecked the recent changelogs
> and see nothing related either. At least, in order to keep trace of the
> incident, would you please post some info about your config (CPU, RAM,
> chipset, .config, gcc, and any possible patches you may have applied) ?
> Maybe some of these info may remind old bad memories to some people.
>
> Also, do you know if this server has ECC memory ? I would more easily
> bet for side effects of one random bit flip in memory than for some
> massive block corruption.
>
> I vaguely remember about very old reports of people sometimes observing
> zeroed out blocks during writes, which were attributed to chipset bugs
> if my memory serves me. But I would rule this out as recent chipsets
> look more stable than 5-10 years ago !
Willy,
The machine is a virtual machine on an VMware ESX 3.0.1 host.
/proc/cpuinfo shows two of these:
Dual
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 8
cpu MHz : 2333.014
cache size : 64 KB
It has 864MB of memory.
.config is at:
http://v.iki.fi/~vherva/tmp/2.4.35-config
The kernel is plain vanilla 2.4.35 from kernel.org, no patches.
gcc 2.96-129:
cat /proc/version
Linux version 2.4.35 (root) (gcc version 2.96 20000731 (Red Hat Linux 7.2 2.96-129.7.2)) #1 SMP Thu Aug 9 10:35:37 EEST 2007
Memory is ECC.
The server is HP Proliant ML370 with 82801BA/CA/DB/EB chipset. I've had my
share of chipset bugs with older Via chipsets, but I think it's very likely
in this case.
This could very well be a VMware bug, but I wanted to know if this rings
bells for someone.
-- v --
[email protected]
Hi Ville,
On Thu, Sep 20, 2007 at 03:45:37PM +0300, Ville Herva wrote:
> On Tue, Sep 18, 2007 at 11:47:05PM +0200, you [Willy Tarreau] wrote:
> > Thanks for your report. Unfortunately, I've rechecked the recent changelogs
> > and see nothing related either. At least, in order to keep trace of the
> > incident, would you please post some info about your config (CPU, RAM,
> > chipset, .config, gcc, and any possible patches you may have applied) ?
> > Maybe some of these info may remind old bad memories to some people.
> >
> > Also, do you know if this server has ECC memory ? I would more easily
> > bet for side effects of one random bit flip in memory than for some
> > massive block corruption.
> >
> > I vaguely remember about very old reports of people sometimes observing
> > zeroed out blocks during writes, which were attributed to chipset bugs
> > if my memory serves me. But I would rule this out as recent chipsets
> > look more stable than 5-10 years ago !
>
> Willy,
>
> The machine is a virtual machine on an VMware ESX 3.0.1 host.
>
> /proc/cpuinfo shows two of these:
> Dual
> model : 15
> model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
> stepping : 8
> cpu MHz : 2333.014
> cache size : 64 KB
>
> It has 864MB of memory.
>
> .config is at:
> http://v.iki.fi/~vherva/tmp/2.4.35-config
> The kernel is plain vanilla 2.4.35 from kernel.org, no patches.
OK. And your config seems perfectly standard.
> gcc 2.96-129:
> cat /proc/version
> Linux version 2.4.35 (root) (gcc version 2.96 20000731 (Red Hat Linux 7.2 2.96-129.7.2)) #1 SMP Thu Aug 9 10:35:37 EEST 2007
I used not to trust 2.96, but I wouldn't accuse it now.
> Memory is ECC.
>
> The server is HP Proliant ML370 with 82801BA/CA/DB/EB chipset. I've had my
> share of chipset bugs with older Via chipsets, but I think it's very likely
> in this case.
I think you meant "unlikely".
> This could very well be a VMware bug, but I wanted to know if this rings
> bells for someone.
It could also be a problem with the host OS, drivers, hardware, etc...
Cheers,
Willy
On Thu, Sep 20, 2007 at 03:20:55PM +0200, you [Willy Tarreau] wrote:
> OK. And your config seems perfectly standard.
>
> > gcc 2.96-129:
> > cat /proc/version
> > Linux version 2.4.35 (root) (gcc version 2.96 20000731 (Red Hat Linux 7.2 2.96-129.7.2)) #1 SMP Thu Aug 9 10:35:37 EEST 2007
>
> I used not to trust 2.96, but I wouldn't accuse it now.
The box was virtualized a while ago and 2.4.32-rc1 and earlier 2.4 compiled
with the same compiler ran very solidly for years. It was UP before
virtualization, though.
> > share of chipset bugs with older Via chipsets, but I think it's very likely
> > in this case.
>
> I think you meant "unlikely".
Yep, sorry for the typo.
> > This could very well be a VMware bug, but I wanted to know if this rings
> > bells for someone.
>
> It could also be a problem with the host OS, drivers, hardware, etc...
Yes, pretty much anything. There's no solid evidency of anything, only
guesses of what might be more likely and what might be less likely...
If it happens again, I'll try to debug more.
-- v --
[email protected]