LinuxLists.cc - Strange disk failure...could ext4 be the culprit?

2009-07-07 18:20:08

Subject: Strange disk failure...could ext4 be the culprit?

Hello all,

I'm administering a small computing cluster on new off-the-shelf hardware. The
configuration is a master-slaves setup with the master serving nfs for the data
synchronization and performing the data re-assembly process (as well as doing
some slave work as well).

The workload produces a fairly steady I/O workload, but not particularly heavy.
While I originally pushed for specialized storage hardware or configurations,
testing and benchmarking showed that the workload appeared quite manageable for
a single disk. I expected it might experience a short lifespan, but on the
order of several months at least. To spare the disk as much thrashing as
possible, I opted for ext4.

In the first week of active deployment (and while I was on vacation), the master
experienced a very strange form of catastrophic failure. A job had failed after
only a couple hours, and serious errors blocked further work. Several core GNU
tools in /bin were corrupted, such as: mv, rm, uname, hostname, pwd. A couple
0-byte files existed in / with scrambled filenames, and plenty of Unicode
characters splattered across the screen during reboot. The reboot itself
reached a login prompt, but wouldn't accept any input. But this is where things
get strange.

I used a liveCD to perform disk checks, and there were no filesystem errors of
*any* kind. The entire filesystem was and is in pristine condition. While I'm
aware of discussion and issues surrounding some of the design decisions made for
ext4 (such as delayed write allocation), it doesn't seem possible that those
issues could be related to this kind of failure (data written without permission
or any attempt to do so). The corrupted binaries were in fact corrupted on
disk, not just in memory (also unreadable by readelf), and larger than the
originals. The software I was using runs from a user-level account and has an
apache-served web interface with apache dropping permissions to that same user.
Nothing but the kernel itself had permission to write to the files that were
corrupted, however the computing software does execute (I think all of) the
commands that were corrupted.

I have saved copies of several of the corrupted files, but neglected to save any
system logs before restoring a backup. There are still some strange messages
appearing during startup, but they fly by too quickly to see, and nothing seems
amiss in the logs except that /var/log/messages seems extremely verbose with
startup and has many references to initializing ext4 (but nothing sounds like an
error). I'm about to tell my users to start using it again and will be
expecting and watching for a repeat performance. The disk itself appears to be
fine.

_____

So my questions are these:

- How likely is it that some arcane bug in ext4 is responsible for the failure?
- If ext4 is exonerated, are there any possible explanations aside from disk
failure and newbie mistakes?
- What would be an appropriate way to stress-test the disk if I wanted to
intentionally induce the error, and what output should I be watching?
- What can I do to track the occurrence of this bug, its source, and/or the
conditions that may trigger it? (Note that iostat shows nothing of interest, as
the actual I/O load isn't particularly unusual.)
- Should I seriously consider using an SSD? (NFS will not share memory-mapped
directories, which thwarted the last of my 'better' plans, and the software's
scratch directory can potentially grow to several gigs over the span of a few
days/weeks.)

Thanks in advance for any light you may be able to shed on the issue.
- Evan

2009-07-07 22:29:06

by Andreas Dilger

[permalink] [raw]

Subject: Re: Strange disk failure...could ext4 be the culprit?

On Jul 07, 2009 18:16 +0000, Evan King wrote:
> So my questions are these:
>
> - How likely is it that some arcane bug in ext4 is responsible for the failure?

It is possible - there are still bugs being fixed in ext4.

> - What can I do to track the occurrence of this bug, its source, and/or the
> conditions that may trigger it? (Note that iostat shows nothing of interest,
> as the actual I/O load isn't particularly unusual.)

Reporting the actual kernel version you are using is critical. If you
are going to stick with ext4, I would follow the latest FC11 kernels,
since there is an active maintainer for ext4 at Red Hat.

Depending on how you formatted the filesystem, you may be able to revert
to ext3 if you want more stability. Providing the output of "dumpe2fs -h"
for the filesystem will tell (in particular the "features" line).

> - Should I seriously consider using an SSD? (NFS will not share
> memory-mapped directories, which thwarted the last of my 'better' plans,
> and the software's scratch directory can potentially grow to several gigs
> over the span of a few days/weeks.)

That is an independent question from using ext4. If you are using NFS
without "async", then an SSD will almost certainly help performance,
but it is probably completely unrelated to the corruption issue.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-07-13 05:35:32

by Aneesh Kumar K.V

[permalink] [raw]

Subject: Re: Strange disk failure...could ext4 be the culprit?

On Tue, Jul 07, 2009 at 06:16:23PM +0000, Evan King wrote:
> Hello all,
>
> I'm administering a small computing cluster on new off-the-shelf hardware. The
> configuration is a master-slaves setup with the master serving nfs for the data
> synchronization and performing the data re-assembly process (as well as doing
> some slave work as well).
>
> The workload produces a fairly steady I/O workload, but not particularly heavy.
> While I originally pushed for specialized storage hardware or configurations,
> testing and benchmarking showed that the workload appeared quite manageable for
> a single disk. I expected it might experience a short lifespan, but on the
> order of several months at least. To spare the disk as much thrashing as
> possible, I opted for ext4.
>
> In the first week of active deployment (and while I was on vacation), the master
> experienced a very strange form of catastrophic failure. A job had failed after
> only a couple hours, and serious errors blocked further work. Several core GNU
> tools in /bin were corrupted, such as: mv, rm, uname, hostname, pwd. A couple
> 0-byte files existed in / with scrambled filenames, and plenty of Unicode
> characters splattered across the screen during reboot. The reboot itself
> reached a login prompt, but wouldn't accept any input. But this is where things
> get strange.
>
> I used a liveCD to perform disk checks, and there were no filesystem errors of
> *any* kind. The entire filesystem was and is in pristine condition. While I'm
> aware of discussion and issues surrounding some of the design decisions made for
> ext4 (such as delayed write allocation), it doesn't seem possible that those
> issues could be related to this kind of failure (data written without permission
> or any attempt to do so). The corrupted binaries were in fact corrupted on
> disk, not just in memory (also unreadable by readelf), and larger than the
> originals. The software I was using runs from a user-level account and has an
> apache-served web interface with apache dropping permissions to that same user.
> Nothing but the kernel itself had permission to write to the files that were
> corrupted, however the computing software does execute (I think all of) the
> commands that were corrupted.
>
> I have saved copies of several of the corrupted files, but neglected to save any
> system logs before restoring a backup. There are still some strange messages
> appearing during startup, but they fly by too quickly to see, and nothing seems
> amiss in the logs except that /var/log/messages seems extremely verbose with
> startup and has many references to initializing ext4 (but nothing sounds like an
> error). I'm about to tell my users to start using it again and will be
> expecting and watching for a repeat performance. The disk itself appears to be
> fine.
>
> _____
>
> So my questions are these:
>
> - How likely is it that some arcane bug in ext4 is responsible for the failure?

Can you check whether your kernel have this patch
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2ec0ae3acec47f628179ee95fe2c4da01b5e9fc4

-aneesh

2009-07-13 12:12:37

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Strange disk failure...could ext4 be the culprit?

On Mon, Jul 13, 2009 at 11:05:20AM +0530, Aneesh Kumar K.V wrote:
> > So my questions are these:
> >
> > - How likely is it that some arcane bug in ext4 is responsible for the failure?
>
> Can you check whether your kernel have this patch
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2ec0ae3acec47f628179ee95fe2c4da01b5e9fc4

And did you have multiple CPU's on the system which suffered the
problem? (A requirement for this to have been the problem).

- Ted

2009-07-13 16:20:22

by Evan King

[permalink] [raw]

Subject: Re: Strange disk failure...could ext4 be the culprit?

Aneesh Kumar K.V wrote:
> On Tue, Jul 07, 2009 at 06:16:23PM +0000, Evan King wrote:
>
>> Hello all,
>>
>> I'm administering a small computing cluster...
>>
>> _____
>>
>> So my questions are these:
>>
>> - How likely is it that some arcane bug in ext4 is responsible for the failure?
>>
>
> Can you check whether your kernel have this patch
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2ec0ae3acec47f628179ee95fe2c4da01b5e9fc4
>
> -aneesh
>
Thank you for that bit of sleuthing...what you've unearthed sounds like
a perfect match for what I experienced. The system is dual core, and
the kernel is the latest Ubuntu server (linux-image-2.6.28-13-server).
I've not been able to find the exact release date of that image (and am
surprised that release dates are not metadata in apt nor the package's
web page) but I believe it is too close to the date of this patch to be
downstream already--and I find no references to this bug in the changelog.

Since there are no launchpad entries referencing this either, I think my
next step will be to create one pushing for inclusion of this patch in
the next kernel update, and hopefully for that update to come soon. My
cluster has operated smoothly since restoration from backup, and it
would be nice not to have to reformat (ext partitions were freshly
created as ext4) or go "aftermarket modding" when a fix is already out.
At any rate, I have my answer, and it's nice to have a plausible
explanation--especially one that doesn't point deeper concerns about
disk load.

Cheers,
- Evan