2007-07-19 15:39:34

by Ryoichi KATO

[permalink] [raw]
Subject: e2fsck bogus error report on orphan-list

Hi,
I hit a problem of ext3/e2fsck on orphan-list handling.

The following sequence produces bogus e2fsck error report:
"/dev/XXX: Inodes that were part of a corrupted orphan linked list found."

1. Delete a file in an ext3 filesystem in early 1970
2. Set RTC to 2007, and then mount/write the filesystem.
3. Run e2fsck (with -f)

This is because i_dtime (deletion time) field is also used as a
next-pointer of an orphan-list (stores inode number rather than time),
and e2fsck handles it improperly.
You will have the same probrem if you run e2fsck on an ext3
filesystem with 1.2+ billion of files in it. (Is it possible?)

For more detail, please take a look at a document I wrote:
- http://tree.celinuxforum.org/CelfPubWiki/Ext3OrphanedInodeProblem
- http://tree.celinuxforum.org/CelfPubWiki/JapanTechnicalJamboree15?action=AttachFile&do=get&target=ext3orphaned-inode.ppt (Sorry for .PPT)


So, my questions are:

*Is this really a bug (or design defect) ?

*Which of ext3 or e2fsck is responsible for the problem?
- I feel that e2fsck is. But needs help of ext3 to solve it elegantly.

*How should I(we) deal with this problem.
- As a work-around, it's avoidable by just set RTC
to 2007 or so before doing any ext3 operation.

Thank you.
--
Ryoichi KATO <[email protected]>
Audio Development & Engineering Div.
Sony Corporation Audio Business Group
Tel +81-3-3599-3862 / Fax +81-3-3599-3859


--
Ryoichi KATO <[email protected]>
System Design Dept. No4
Audio Development & Engineering Div.
Sony Corporation Audio Business Group
Tel +81-3-3599-3862 / Fax +81-3-3599-3859


2007-07-19 16:55:21

by Theodore Ts'o

[permalink] [raw]
Subject: Re: e2fsck bogus error report on orphan-list

On Fri, Jul 20, 2007 at 12:39:19AM +0900, [email protected] wrote:
> Hi,
> I hit a problem of ext3/e2fsck on orphan-list handling.

Wow, I'm rather impressed that this was sufficient for a presentation
at a conference. You could have just sent me e-mail. :-)

>
> The following sequence produces bogus e2fsck error report:
> "/dev/XXX: Inodes that were part of a corrupted orphan linked list found."
>
> 1. Delete a file in an ext3 filesystem in early 1970

Dare I ask *why* the system clock was set in the 1970's? Umm... don't
do that.

> 2. Set RTC to 2007, and then mount/write the filesystem.

There is code that detects when the time is set back in the 1970's
(normally due to a bad clock battery) and thus disables this
particular check. So it only triggers when the clock was previously
bad, and is now good.

> This is because i_dtime (deletion time) field is also used as a
> next-pointer of an orphan-list (stores inode number rather than time),
> and e2fsck handles it improperly.
> You will have the same probrem if you run e2fsck on an ext3
> filesystem with 1.2+ billion of files in it. (Is it possible?)

It's *possible* but in practice no one does it, because the fsck times
if the filesystem had that many inodes would be pretty scary --- and
there will always be times when you must run fsck --- for example, if
you have hardware induced corruption and you need to salvage the
filesystem because your backups had failed (or you weren't doing
backups :-).


The net is that the check is basically a sanity check to make any bugs
in the orphaned list handling would be discovered, although it can
also trigger if there is block device corruption where part of the
inode table is corrupted. I had added hueristics that for most people
meant that it never triggered, so I'm surprised that it actually did
in your environment. Still, if it did, the easist thing to do is to
just turn it off.

We haven't had bugs in that area of the code for a long time, and if
it's actually causing you trouble, the simplest thing to do is to just
comment out the check. That, or just make sure that the time is
correct, which is generally a good idea anyway. Hmm, maybe I should
add an e2fsck configuration parameter:

[options]
unreliable_system_clock = 1

Which disables various hueristics that assumes that the system clock
can be trusted.

- Ted

2007-07-19 21:01:28

by Tim Bird

[permalink] [raw]
Subject: Re: e2fsck bogus error report on orphan-list

Theodore Tso wrote:
> On Fri, Jul 20, 2007 at 12:39:19AM +0900, [email protected] wrote:
>> The following sequence produces bogus e2fsck error report:
>> "/dev/XXX: Inodes that were part of a corrupted orphan linked list found."
>>
>> 1. Delete a file in an ext3 filesystem in early 1970
>
> Dare I ask *why* the system clock was set in the 1970's? Umm... don't
> do that.

It is not uncommon for embedded boards to omit battery backing
on the RTC, so they always boot with a bogus (start-of-epoch) time.

-- Tim

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Corporation of America
=============================

2007-07-19 23:43:57

by Ryoichi KATO

[permalink] [raw]
Subject: Re: e2fsck bogus error report on orphan-list

At Thu, 19 Jul 2007 12:55:10 -0400,
Theodore Tso wrote:
> On Fri, Jul 20, 2007 at 12:39:19AM +0900, [email protected] wrote:
> > Hi,
> > I hit a problem of ext3/e2fsck on orphan-list handling.
>
> Wow, I'm rather impressed that this was sufficient for a presentation
> at a conference. You could have just sent me e-mail. :-)

I know it's a rare case for most of the people and not sure
it is a 'bug', but I thought it might happen more offten for CE people.
So, I asked for opinions of CE people in a lighting session of
"CELF Technical Jamboree."


> > 1. Delete a file in an ext3 filesystem in early 1970
>
> Dare I ask *why* the system clock was set in the 1970's? Umm... don't
> do that.

As Tim pointed out, embedded devices offten omit RTC battery.


> > 2. Set RTC to 2007, and then mount/write the filesystem.
>
> There is code that detects when the time is set back in the 1970's
> (normally due to a bad clock battery) and thus disables this
> particular check. So it only triggers when the clock was previously
> bad, and is now good.

Actually, it's a *real* problem happend for my car navigation product.
Until GPS signal is available, it's clock was 1970.
And for servers and PCs, it's possible that RTC backup battery run out,
then clock get set correctly afterward by, say, NTP.


> The net is that the check is basically a sanity check to make any bugs
> in the orphaned list handling would be discovered, although it can
> also trigger if there is block device corruption where part of the
> inode table is corrupted. I had added hueristics that for most people
> meant that it never triggered, so I'm surprised that it actually did
> in your environment. Still, if it did, the easist thing to do is to
> just turn it off.

Now, after things behind the problem turned out, it's easy.
But let me point out that,

* It is very difficult to relate RTC to the problem.
No clue without digging into e2fsck source code.

* -p (preen) option of e2fsck doen't fix it automatically.
Though I'm not sure but, maybe it's safe to correct the
problem automatically?


Actually, it took me for several weeks to solve, because it is rare.
My system only reset RTC for hardware reset or when main battery run out
but not for software reset. But it can happen.


Thank you.
--
Ryoichi KATO <[email protected]>
System Design Dept. No4
Audio Development & Engineering Div.
Sony Corporation Audio Business Group
Tel +81-3-3599-3862 / Fax +81-3-3599-3859

2007-07-20 04:10:58

by Theodore Ts'o

[permalink] [raw]
Subject: Re: e2fsck bogus error report on orphan-list

On Fri, Jul 20, 2007 at 08:20:26AM +0900, Ryoichi KATO wrote:
> > > 1. Delete a file in an ext3 filesystem in early 1970
> >
> > Dare I ask *why* the system clock was set in the 1970's? Umm... don't
> > do that.
>
> As Tim pointed out, embedded devices offten omit RTC battery.

Yes, I added the busted_fs_clock specifically to handle this.

> > There is code that detects when the time is set back in the 1970's
> > (normally due to a bad clock battery) and thus disables this
> > particular check. So it only triggers when the clock was previously
> > bad, and is now good.
>
> Actually, it's a *real* problem happend for my car navigation product.
> Until GPS signal is available, it's clock was 1970.
> And for servers and PCs, it's possible that RTC backup battery run out,
> then clock get set correctly afterward by, say, NTP.

Sure, but we have checks to detect if the last superblock write *or*
last mount time is before 1970, and if so, we declare the filesystem
has having a busted/insane system clock, and and we disable the
dtime/orphaned inode checks. This in practice is plenty since it
means that the mount-time is in the 1970's, and then when NTP sets the
time, we're fine, since that's generally after the e2fsck and the
mount of the filesystem.

So for it to trigger it requires a very strange set of modulations of
the time. You need to have time be correct at the time of the mount
(so s_mtime is sane, implying that the RTC backup battery is not
dead), and *then* reset to the 1970's, delete some files, then be
correct when the filesystem is unmounted (so s_wtime is sane). That's
pretty hard to accomplishl; and I would submit, even on embedded
systems. The system clock must be crazily warping back and forth
between correct time and 1970's/insane time in order for this to be an
issue.

This has been true since e2fsprogs 1.38, released June 30, 2005;
before that point we only checked s_wtime for sanity, and we did have
a few cases slip through, but ever since I added the s_wtime check, I
haven't had anyone report a problem until now. (Although if people
don't e-mail me, and just do conference presentations, I'd have no way
of finding out unless I was lucky enough to attend the conference. :-)

> * It is very difficult to relate RTC to the problem.
> No clue without digging into e2fsck source code.

Yes. As I said, it might be a good idea to add an
unreliable_system_time config parameter to e2fsck in the future to
catch this case. That would also document the issue to avoid future
people from running into this.

> * -p (preen) option of e2fsck doen't fix it automatically.
> Though I'm not sure but, maybe it's safe to correct the
> problem automatically?

Yes, but this was deliberate; if there was a bug in the kernel's
orphan handling code, I really wanted to know about it, and if it was
just -p, most folk would never know. (Although if there were orphan
list handling bugs, it could cause some truncates would not be
reliably replayed, so it might cause even **harder** to diagnose bugs.
Life is always full of tradeoffs.)

- Ted

2007-07-20 09:46:31

by Ryoichi KATO

[permalink] [raw]
Subject: Re: e2fsck bogus error report on orphan-list

At Fri, 20 Jul 2007 00:10:52 -0400,
Theodore Tso wrote:
> So for it to trigger it requires a very strange set of modulations of
> the time. You need to have time be correct at the time of the mount
> (so s_mtime is sane, implying that the RTC backup battery is not
> dead), and *then* reset to the 1970's, delete some files, then be
> correct when the filesystem is unmounted (so s_wtime is sane). That's
> pretty hard to accomplishl; and I would submit, even on embedded
> systems. The system clock must be crazily warping back and forth
> between correct time and 1970's/insane time in order for this to be an
> issue.

If I'm understanding correctly, once you have deleted a file in 1970,
it might stay in a filesystem for a certain period of time, like a time bomb.
Then you don't have to have the clock to jump back and forth.
I seems to me that evan a typical PCs can have the symptom,
after two reboots like this:

1. RTC backup run out
2. hardware reboot; set RTC to 1970.
3. mount, delete a file (in 2007)
4. umount
5. Set clock to 2007 (manually, or by NTP)
- - - -
6. reboot (software reset which don't reset the RTC, or replace battery.)
7. e2fsck (no problem this time)
8. mount (in 2007)
9. write (in 2007)
10. umount
- - - -
11. reboot
12. e2fsck, hit the problem.

No way to notice the real reason (RTC), if the system is a server
and only reboots once a year.


> > * It is very difficult to relate RTC to the problem.
> > No clue without digging into e2fsck source code.
>
> Yes. As I said, it might be a good idea to add an
> unreliable_system_time config parameter to e2fsck in the future to
> catch this case. That would also document the issue to avoid future
> people from running into this.
And might it be also very helpful to have some hint in the e2fsck message?


> > * -p (preen) option of e2fsck doen't fix it automatically.
> > Though I'm not sure but, maybe it's safe to correct the
> > problem automatically?
>
> Yes, but this was deliberate; if there was a bug in the kernel's
> orphan handling code, I really wanted to know about it, and if it was
> just -p, most folk would never know. (Although if there were orphan
> list handling bugs, it could cause some truncates would not be
> reliably replayed, so it might cause even **harder** to diagnose bugs.
> Life is always full of tradeoffs.)
OK, I agree. You have at least one example of such person here :-)


Regards,
--
Ryoichi KATO <[email protected]>
Audio Development & Engineering Div.
Sony Corporation Audio Business Group
Tel +81-3-3599-3862 / Fax +81-3-3599-3859