2005-11-16 18:16:23

by Jan Niehusmann

[permalink] [raw]
Subject: Laptop mode causing writes to wrong sectors?

(Cc: to linux-kernel, in case somebody else is interested in this as well)

Hi Bart,

let me start by stating that the following is mainly guessed. I may be
completely wrong. Still I think you may be interested in my
observations, and perhaps you already got similar reports?

On my laptop, running 2.6.14, I'm observing some strange file- and
filesystem corruptions. First, I thought it may have been caused by an
ext3 bug because the first corruption I did observe happened shortly
after an ext3 journal replay.

I did report this to linux-kernel, but without any helpful response:
http://www.ussg.iu.edu/hypermail/linux/kernel/0511.0/0129.html
(Subject: ext3 corruption: "JBD: no valid journal superblock found")

But now, I got another hint pointing to a possible cause of this
problem: I found a file - /usr/lib/libatlas.so.3.0 - which was corrupted
by 4k of it being overwritten by a different file, which I recognized.
And that file happened to be an uncompressed manual page.

As usually the manual pages are only stored compressed, this must have
happened when I actually did look at that manual page, which causes the
uncompressed version to be written to a file in /tmp/. And the best is:
I actually remember when I did read that man page, and it was while the
notebook ran on battery power, which is quite seldom. On battery power,
I have laptop mode activated and the hard disk spun down after a short
idle time.

Why do I think this is related to the corruption? Well, on the one hand,
I'm compiling kernels quite often, tracking linus' git repository, and
I'm regularly upgrading my system to debian unstable, both involving
hundreds of megabytes of disk writes - and I never observed a single
problem while doing so. On the other hand, a simple look at a short
manual page did cause file system corruption. This would be rather
strange if the corruption happened at random disk writes. But kernel
compiles as well as system upgrades involve regular writes to the hard
disk, which therefore doesn't spin down. (And additionally, I usually
don't do such things while running on battery.) Reading the man page
happend while the system was quite idle, and it may have been the read
of the compressed image or the write to the temporary file which spun up
the hard drive. (To be exact, I looked at the man page more than once -
so the second time, the compressed image probably was cached and reading
it didn't require filesystem access, so it really could have been the
write triggering the spin up)

Well, quite a long mail for a little observation, and sorry if you think
that I wasted your time. Did I? Or may my suspicion be true and there is
some connection between laptop mode and the corruptions I observe?

Thanks for reading all this stuff ;-)

Jan


Attachments:
(No filename) (2.71 kB)
signature.asc (307.00 B)
Digital signature
Download all attachments

2005-11-16 20:08:01

by Bart Samwel

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

Jan Niehusmann wrote:
> let me start by stating that the following is mainly guessed. I may be
> completely wrong. Still I think you may be interested in my
> observations, and perhaps you already got similar reports?

Nope, no similar reports. But I'm listening. :)

> On my laptop, running 2.6.14, I'm observing some strange file- and
> filesystem corruptions. First, I thought it may have been caused by an
> ext3 bug because the first corruption I did observe happened shortly
> after an ext3 journal replay.
>
> I did report this to linux-kernel, but without any helpful response:
> http://www.ussg.iu.edu/hypermail/linux/kernel/0511.0/0129.html
> (Subject: ext3 corruption: "JBD: no valid journal superblock found")

Quoting your message:

# There are two things I did with the filesystem which may be related to
# this: First, on Oct. 27 I did resize the filesystem (umount, lvextend,
# e2fsck -f, resize2fs, mount). But after that I did several reboots
# without any problems - this is my notebook and I turn it on and off
# several times a day.

First of all, you having resized your fs is a smoking gun, if you ask
me. Your fs is dead/dying, and you know you've recently been tinkering
with it. It's the most probable cause.

Secondly, I think that your resize sequence is missing an e2fsck -f
after resize2fs. Resizing filesystems is risky business, and I've ruined
many a filesystem by resizing them. Even when it came clean out of an
fsck. I'm also worried that there was apparently _never_ a full fsck
after the resize2fs -- seeing as all the subsequent fscks were probably
done by journal. That way, any existing problem can stay in existence
and slowly "creep" into more and more of your files as you modify them.

> But now, I got another hint pointing to a possible cause of this
> problem: I found a file - /usr/lib/libatlas.so.3.0 - which was corrupted
> by 4k of it being overwritten by a different file, which I recognized.
> And that file happened to be an uncompressed manual page.

This sounds like your filesystem's block bitmaps are "fscked up". These
problems can definitely cause "creeping corruption" when undetected,
because (a) new files overwrite existing files only part of the time
(especially if your filesystem has a relatively large amount of free
space, as it probably does because you just resized it), and (b) you
don't actually use most of your files very often, so you usually don't
really notice it until it's too late.

Also, AFAIK the journal is simply a special file as far as ext3 is
concerned, and perhaps the journal corruption you experienced has to do
with that special file's bits being marked free, and the beginning of
the journal being overwritten by other data.

DISCLAIMER: I'm biased. I almost lost a filesystem to this exact problem
once. It was ext2resize, not resize2fs. But still.

About the laptop mode hypothesis: I think it's just a coincidence. If
it's not, then it could be a "sync-time-only" problem (because what
laptop mode does before spindown is a sync), but not a specific laptop
mode problem -- laptop mode doesn't influence block numbers whatsoever.
But if it were a sync problem, we would be seeing a lot more reports
of corruption. For now my vote is with the resize. :)

--Bart

2005-11-16 21:42:26

by Jan Niehusmann

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

On Wed, Nov 16, 2005 at 09:06:03PM +0100, Bart Samwel wrote:
> First of all, you having resized your fs is a smoking gun, if you ask
> me. Your fs is dead/dying, and you know you've recently been tinkering
> with it. It's the most probable cause.

Well, it would be nice if the explanation was that easy - but the recent
corruption (with the man page) was on my root partition, which is not on
LVM and has never been resized.

> Secondly, I think that your resize sequence is missing an e2fsck -f
> after resize2fs.

Be assured that at least after the first corruption I observed, I did
forced e2fsck on all partitions. Without any errors found.

> after the resize2fs -- seeing as all the subsequent fscks were probably
> done by journal.

What do you mean with 'by journal'? The filesystems were unmounted (or
remounted r/o), so the journal should have been committed and empty at
e2fsck time.

> >But now, I got another hint pointing to a possible cause of this
> >problem: I found a file - /usr/lib/libatlas.so.3.0 - which was corrupted
> >by 4k of it being overwritten by a different file, which I recognized.
> >And that file happened to be an uncompressed manual page.
>
> This sounds like your filesystem's block bitmaps are "fscked up". These
> problems can definitely cause "creeping corruption" when undetected,

But this should definitely have been detected by an fsck, right?

> (especially if your filesystem has a relatively large amount of free
> space, as it probably does because you just resized it)

Root fs is 94% full, and during apt-get upgrade sometimes becomes
completely full. (Which I don't like, because it probably causes bad
fragmentation...)

> About the laptop mode hypothesis: I think it's just a coincidence. If
> it's not, then it could be a "sync-time-only" problem (because what
> laptop mode does before spindown is a sync), but not a specific laptop
> mode problem -- laptop mode doesn't influence block numbers whatsoever.
> But if it were a sync problem, we would be seeing a lot more reports
> of corruption. For now my vote is with the resize. :)

I agree it's probably not a sync problem. And therefore, probably not
really a laptop-mode bug, even if laptop-mode triggered the corruption.
I suspected the hard drive to mess up write requests during spin-up. Or
perhaps giving some kind of error message, which could trigger a bug in
a rarely tested error-handling path in the kernel. But the fact that you
never got similar reports makes this less likely. In the end, I have to
consider there may be some bad hardware in my laptop. (Already did a
memtest86, of course.)

Jan



Attachments:
(No filename) (2.57 kB)
signature.asc (307.00 B)
Digital signature
Download all attachments

2005-11-17 09:27:28

by Bart Samwel

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

Jan Niehusmann wrote:
> On Wed, Nov 16, 2005 at 09:06:03PM +0100, Bart Samwel wrote:
>> First of all, you having resized your fs is a smoking gun, if you ask
>> me. Your fs is dead/dying, and you know you've recently been tinkering
>> with it. It's the most probable cause.
>
> Well, it would be nice if the explanation was that easy - but the recent
> corruption (with the man page) was on my root partition, which is not on
> LVM and has never been resized.

Youch. I assumed this was all the same fs! It is the same HD though?

> Be assured that at least after the first corruption I observed, I did
> forced e2fsck on all partitions. Without any errors found.

ACK.

>> after the resize2fs -- seeing as all the subsequent fscks were probably
>> done by journal.
>
> What do you mean with 'by journal'? The filesystems were unmounted (or
> remounted r/o), so the journal should have been committed and empty at
> e2fsck time.

I mean, no full e2fsck, at most a journal replay. This doesn't check the
bitmaps, so if they are inconsistent, they stay inconsistent.

>>> But now, I got another hint pointing to a possible cause of this
>>> problem: I found a file - /usr/lib/libatlas.so.3.0 - which was corrupted
>>> by 4k of it being overwritten by a different file, which I recognized.
>>> And that file happened to be an uncompressed manual page.
>> This sounds like your filesystem's block bitmaps are "fscked up". These
>> problems can definitely cause "creeping corruption" when undetected,
>
> But this should definitely have been detected by an fsck, right?

Yes. And you've had this problem before, even. Googling for "e2fsck
block bitmap differences" shows me this as the third entry. :)

http://lkml.org/lkml/2003/8/3/166

If you didn't get those messages, then this is not the problem, apparently.

> I agree it's probably not a sync problem. And therefore, probably not
> really a laptop-mode bug, even if laptop-mode triggered the corruption.
> I suspected the hard drive to mess up write requests during spin-up. Or
> perhaps giving some kind of error message, which could trigger a bug in
> a rarely tested error-handling path in the kernel. But the fact that you
> never got similar reports makes this less likely. In the end, I have to
> consider there may be some bad hardware in my laptop. (Already did a
> memtest86, of course.)

There is a known problem with laptop mode where, often during
spindown/spinup, the kernel emits DMA timeout errors.

http://lkml.org/lkml/2005/8/21/48

According to Andrea Gelmini (the original reporter of this problem) this
can lead to system freezes on some kernels, and corruption on others.
Are you seeing these errors somewhere in your logs?

What's your hardware? A Thinkpad perhaps?

--Bart

2005-11-17 10:33:59

by Jan Niehusmann

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

On Thu, Nov 17, 2005 at 10:25:31AM +0100, Bart Samwel wrote:
> Youch. I assumed this was all the same fs! It is the same HD though?

Yes, same HD, same notebook, no hardware changes.

> >But this should definitely have been detected by an fsck, right?
>
> Yes. And you've had this problem before, even. Googling for "e2fsck
> block bitmap differences" shows me this as the third entry. :)
>
> http://lkml.org/lkml/2003/8/3/166
>
> If you didn't get those messages, then this is not the problem, apparently.

Ok - this was on a different computer, so it's not really related, but
the important point is that fsck didn't report any errors so the
write into the middle of another file was not caused by a bad
filesystem.

> There is a known problem with laptop mode where, often during
> spindown/spinup, the kernel emits DMA timeout errors.
>
> http://lkml.org/lkml/2005/8/21/48
>
> According to Andrea Gelmini (the original reporter of this problem) this
> can lead to system freezes on some kernels, and corruption on others.
> Are you seeing these errors somewhere in your logs?

No ide errors at all in the kernel logs, which cover more than one
month.

> What's your hardware? A Thinkpad perhaps?

ASUS M2400N with a SAMSUNG MP0804H 80GB hard drive (this is not
the original hard drive - the notbook was delivered with a 60GB
drive). Centrino chipset, 1.6GHz Pentium M. Hard drive running in udma5
mode. 512MB RAM, Intel Pro Wireless 2100 replaced with an Intel Pro
Wireless 2200 wireless lan card, and a CardMan 4000 card reader in the
pccard slot - not that I think these have any influence on the hard drive,
but this is the complete hardware description.

Jan


Attachments:
(No filename) (1.64 kB)
signature.asc (307.00 B)
Digital signature
Download all attachments

2005-11-17 11:38:27

by Bart Samwel

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

Jan Niehusmann wrote:
> No ide errors at all in the kernel logs, which cover more than one
> month.
>
>> What's your hardware? A Thinkpad perhaps?
>
> ASUS M2400N with a SAMSUNG MP0804H 80GB hard drive (this is not
> the original hard drive - the notbook was delivered with a 60GB
> drive). Centrino chipset, 1.6GHz Pentium M. Hard drive running in udma5
> mode. 512MB RAM, Intel Pro Wireless 2100 replaced with an Intel Pro
> Wireless 2200 wireless lan card, and a CardMan 4000 card reader in the
> pccard slot - not that I think these have any influence on the hard drive,
> but this is the complete hardware description.

Thanks. The reason I asked whether it was a Thinkpad is that the DMA
problems usually occur on Thinkpads. You don't have a thinkpad, and you
don't see the log messages, so that's not it. The filesystem's block
bitmaps were OK, so that's not it. Quoting you in your previous message:

# I suspected the hard drive to mess up write requests during spin-up.
# Or perhaps giving some kind of error message, which could trigger a
# bug in a rarely tested error-handling path in the kernel. But the fact
# that you never got similar reports makes this less likely. In the end,
# I have to consider there may be some bad hardware in my laptop.

If there are no other reports on different hardware, then I'm afraid you
might be right about the hardware thing. :( I think problems like these
usually don't get noticed/fixed because modern Windows versions are
actually very lousy at spinning down disks -- in my experience it
doesn't really happen, even though you can set the timeouts.

I'm sorry I'm not able to help you out here.

--Bart

2005-11-17 13:22:57

by Bradley Chapman

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

I too have been experiencing some problems with laptop-mode that may
be similar to what was recently reported here.

I have a Centrino machine (Sager NP3760, aka Clevo M375E) with a 60GB
Hitachi TravelStar hard disk running in UDMA5 and 512MB RAM, and on
occassions I've had random files on my /usr partition overwritten and
both my /usr and /var filesystems quite thoroughly trashed - with
these events usually occuring right after I'd been on battery power
and my hard disk had been spinning up and down regularly.

All my filesystems are ext3 with journaling active, and none of them
have been messed with (i.e. resized).

Brad
--
SCREW THE ADS! http://adblock.mozdev.org/

2005-11-17 14:29:15

by Bart Samwel

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

Bradley, Jan,

Bradley Chapman wrote:
> I too have been experiencing some problems with laptop-mode that may
> be similar to what was recently reported here.
>
> I have a Centrino machine (Sager NP3760, aka Clevo M375E) with a 60GB
> Hitachi TravelStar hard disk running in UDMA5 and 512MB RAM, and on
> occassions I've had random files on my /usr partition overwritten and
> both my /usr and /var filesystems quite thoroughly trashed - with
> these events usually occuring right after I'd been on battery power
> and my hard disk had been spinning up and down regularly.
>
> All my filesystems are ext3 with journaling active, and none of them
> have been messed with (i.e. resized).

OK, that's the second report then. I'm beginning to worry. :/

Are you seeing any DMA timeout messages in your kernel log?

Also, both reports are on ext3, which might point to an ext3 problem
with long commit intervals.

Bradley, Jan, since when have these problems been happening? Kernel
version-wise, I mean?

--Bart

2005-11-17 15:41:48

by Jan Niehusmann

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

On Thu, Nov 17, 2005 at 03:27:00PM +0100, Bart Samwel wrote:
> OK, that's the second report then. I'm beginning to worry. :/

And I'm not feeling so lonely any more ;-)

> Bradley, Jan, since when have these problems been happening? Kernel
> version-wise, I mean?

I didn't notice these problems before 2.6.14. As these corruptions are
not happening very often, and as I usually do not run the notebook on
battery power, the problem may have existed for a while, though.

Today I did a simple test: I activated laptop mode with a 10s idle
timeout, and made a script write files with uniqe identifiers, followed
by a sync, every 60 seconds. After nearly an hour, I didn't see any
corruption, though at least some of these writes have triggered
a spin-up. When I have some spare time I'll do more intensive testing.

Additionally, I mounted more than half of the partitions on this
notebook read only, and made a 1:1 copy of these partitions to an
external hard drive. Therefore, I can check later if something
accidentally did write to these areas.

If you have any suggestions for additional test, please tell me.

The random filesystem corruption had one positive effect: I never had
such a good backup of my data before. ;-)

Jan

2005-11-17 16:07:52

by Bart Samwel

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

Jan Niehusmann wrote:
>> Bradley, Jan, since when have these problems been happening? Kernel
>> version-wise, I mean?
>
> I didn't notice these problems before 2.6.14. As these corruptions are
> not happening very often, and as I usually do not run the notebook on
> battery power, the problem may have existed for a while, though.
>
> Today I did a simple test: I activated laptop mode with a 10s idle
> timeout, and made a script write files with uniqe identifiers, followed
> by a sync, every 60 seconds. After nearly an hour, I didn't see any
> corruption, though at least some of these writes have triggered
> a spin-up. When I have some spare time I'll do more intensive testing.

Well, the syncs should trigger a spinup every time. Laptop mode does not
influence syncs, really.

> Additionally, I mounted more than half of the partitions on this
> notebook read only, and made a 1:1 copy of these partitions to an
> external hard drive. Therefore, I can check later if something
> accidentally did write to these areas.

"more than half"? Exactly how many partitions *are* there on this
notebook? ;)

> If you have any suggestions for additional test, please tell me.

Perhaps you could enable /proc/sys/vm/block_dump. This makes the kernel
output all disk activity, including block numbers. By looking up the
corrupted block numbers in your logs you can check later if the
corrupting write was done by the kernel (i.e., software fault) or not
(hardware fault).

Note that the output of block_dump may not go into your logs by default,
because it's output with KERN_DEBUG. You may need to change your log
settings.

You can add extra context on ext3's state by enabling JBD debugging
(CONFIG_JBD_DEBUG, IIRC).

> The random filesystem corruption had one positive effect: I never had
> such a good backup of my data before. ;-)

It made me rethink my backup strategy as well. :)

--Bart

2005-11-17 16:22:26

by Bradley Chapman

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

Mr. Samwel,

On 11/17/05, Bart Samwel <[email protected]> wrote:
>
> OK, that's the second report then. I'm beginning to worry. :/
>
> Are you seeing any DMA timeout messages in your kernel log?

Once, when my /var partition got trashed - about thirty or forty loud
and scary messages from the IDE core saying that various disk accesses
(i.e. normal read/writes) were failing. I do believe DMA was
mentioned.

Another time (i.e. just now), I got five Oopses in a row, most of them
in kmem_cache_alloc() but with one in generic_aio_file_read().
Unfortunately I am using fglrx right now so they are probably quite
meaningless...*

Most of the time though, I don't see anything.

Fortunately the number of errors encouraged the kernel to remount R/O,
so after a quick jump to single user mode and two e2fsck -f
invocations, it was healed.

>
> Also, both reports are on ext3, which might point to an ext3 problem
> with long commit intervals.
>
> Bradley, Jan, since when have these problems been happening? Kernel
> version-wise, I mean?

They started with 2.6.13. I can't remember ever expereincing random
partition trashing or random file corruption in 2.6.12. I tried
2.6.14.1 - that kernel did Bad Things as well.

So far though, as long as I stay on juice, 2.6.13 seems to behave.

>
> --Bart
>

Brad

* - http://216.55.161.203/theonekea/misc/oopsen.txt
--
SCREW THE ADS! http://adblock.mozdev.org/

2005-11-17 21:24:10

by Bart Samwel

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

Bradley Chapman wrote:
> Mr. Samwel,
>
> On 11/17/05, Bart Samwel <[email protected]> wrote:
>> OK, that's the second report then. I'm beginning to worry. :/
>>
>> Are you seeing any DMA timeout messages in your kernel log?
>
> Once, when my /var partition got trashed - about thirty or forty loud
> and scary messages from the IDE core saying that various disk accesses
> (i.e. normal read/writes) were failing. I do believe DMA was
> mentioned.

This could be the problem I was talking about. Was this happening during
spindown or not?

> Another time (i.e. just now), I got five Oopses in a row, most of them
> in kmem_cache_alloc() but with one in generic_aio_file_read().
> Unfortunately I am using fglrx right now so they are probably quite
> meaningless...*

I guess so. They all oops on reading the same address (0x05c2a5bb),
there's something corrupted in the slab cache, cause unknown. Very
possibly fglrx.

> Most of the time though, I don't see anything.

...while still experiencing corruption?

>> Bradley, Jan, since when have these problems been happening? Kernel
>> version-wise, I mean?
>
> They started with 2.6.13. I can't remember ever expereincing random
> partition trashing or random file corruption in 2.6.12. I tried
> 2.6.14.1 - that kernel did Bad Things as well.
>
> So far though, as long as I stay on juice, 2.6.13 seems to behave.

Hmmmm. This means that you could still be experiencing the same thing
that Andrea Gelmini was reporting. Could you try the things he said made
it worse, and check if things go wrong? You are, of course, allowed to
decline because of the risk involved. :-) The things are:

Big activity:

* iozone -A
* unrar big file
* In order to make it happen faster:
cd /proc/sys/vm
echo 100 > dirty_background_ratio
echo 1000000 > dirty_expire_centisecs
echo 100 > dirty_ratio
echo 1000000 > dirty_writeback_centisecs

--Bart

2005-11-17 22:33:59

by Pavel Machek

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

Hi!

> let me start by stating that the following is mainly guessed. I may be
> completely wrong. Still I think you may be interested in my
> observations, and perhaps you already got similar reports?
>
> On my laptop, running 2.6.14, I'm observing some strange file- and
> filesystem corruptions. First, I thought it may have been caused by an
> ext3 bug because the first corruption I did observe happened shortly
> after an ext3 journal replay.
>
> I did report this to linux-kernel, but without any helpful response:
> http://www.ussg.iu.edu/hypermail/linux/kernel/0511.0/0129.html
> (Subject: ext3 corruption: "JBD: no valid journal superblock found")
>
> But now, I got another hint pointing to a possible cause of this
> problem: I found a file - /usr/lib/libatlas.so.3.0 - which was corrupted
> by 4k of it being overwritten by a different file, which I recognized.
> And that file happened to be an uncompressed manual page.
>
> As usually the manual pages are only stored compressed, this must have
> happened when I actually did look at that manual page, which causes the
> uncompressed version to be written to a file in /tmp/. And the best is:
> I actually remember when I did read that man page, and it was while the
> notebook ran on battery power, which is quite seldom. On battery power,
> I have laptop mode activated and the hard disk spun down after a short
> idle time.
>
> Why do I think this is related to the corruption? Well, on the one hand,
> I'm compiling kernels quite often, tracking linus' git repository,
> and

Can you try some filesystem test while forcing disk spindowns via
hdparm?

It may be bug in laptop mode, or a bug in ide (or something
related)... trying spindowns without laptopmode would be helpful.

Pavel
--
Thanks, Sharp!

2005-11-17 22:50:32

by Bradley Chapman

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

Mr. Samwel,

On 11/17/05, Bart Samwel <[email protected]> wrote:
> Bradley Chapman wrote:
> > Mr. Samwel,
> >
> > On 11/17/05, Bart Samwel <[email protected]> wrote:
> >> OK, that's the second report then. I'm beginning to worry. :/
> >>
> >> Are you seeing any DMA timeout messages in your kernel log?
> >
> > Once, when my /var partition got trashed - about thirty or forty loud
> > and scary messages from the IDE core saying that various disk accesses
> > (i.e. normal read/writes) were failing. I do believe DMA was
> > mentioned.
>
> This could be the problem I was talking about. Was this happening during
> spindown or not?

I believe so. I only noticed it when I checked my dmesg after the
ipw2200 driver crapped out yet again...

>
> > Another time (i.e. just now), I got five Oopses in a row, most of them
> > in kmem_cache_alloc() but with one in generic_aio_file_read().
> > Unfortunately I am using fglrx right now so they are probably quite
> > meaningless...*
>
> I guess so. They all oops on reading the same address (0x05c2a5bb),
> there's something corrupted in the slab cache, cause unknown. Very
> possibly fglrx.

Indeed. I expected as such.

>
> > Most of the time though, I don't see anything.
>
> ...while still experiencing corruption?
>
> >> Bradley, Jan, since when have these problems been happening? Kernel
> >> version-wise, I mean?
> >
> > They started with 2.6.13. I can't remember ever expereincing random
> > partition trashing or random file corruption in 2.6.12. I tried
> > 2.6.14.1 - that kernel did Bad Things as well.
> >
> > So far though, as long as I stay on juice, 2.6.13 seems to behave.
>
> Hmmmm. This means that you could still be experiencing the same thing
> that Andrea Gelmini was reporting. Could you try the things he said made
> it worse, and check if things go wrong? You are, of course, allowed to
> decline because of the risk involved. :-) The things are:
>
> Big activity:
>
> * iozone -A
> * unrar big file
> * In order to make it happen faster:
> cd /proc/sys/vm
> echo 100 > dirty_background_ratio
> echo 1000000 > dirty_expire_centisecs
> echo 100 > dirty_ratio
> echo 1000000 > dirty_writeback_centisecs

No thanks. I'll stick to 'normal usage' triggers and see if I can
gather any data that way ;-)))

>
> --Bart
>

Brad
--
SCREW THE ADS! http://adblock.mozdev.org/

2005-11-18 18:44:05

by Bill Davidsen

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

Pavel Machek wrote:
> Hi!
>
>
>>let me start by stating that the following is mainly guessed. I may be
>>completely wrong. Still I think you may be interested in my
>>observations, and perhaps you already got similar reports?
>>
>>On my laptop, running 2.6.14, I'm observing some strange file- and
>>filesystem corruptions. First, I thought it may have been caused by an
>>ext3 bug because the first corruption I did observe happened shortly
>>after an ext3 journal replay.
>>
>>I did report this to linux-kernel, but without any helpful response:
>>http://www.ussg.iu.edu/hypermail/linux/kernel/0511.0/0129.html
>>(Subject: ext3 corruption: "JBD: no valid journal superblock found")
>>
>>But now, I got another hint pointing to a possible cause of this
>>problem: I found a file - /usr/lib/libatlas.so.3.0 - which was corrupted
>>by 4k of it being overwritten by a different file, which I recognized.
>>And that file happened to be an uncompressed manual page.
>>
>>As usually the manual pages are only stored compressed, this must have
>>happened when I actually did look at that manual page, which causes the
>>uncompressed version to be written to a file in /tmp/. And the best is:
>>I actually remember when I did read that man page, and it was while the
>>notebook ran on battery power, which is quite seldom. On battery power,
>>I have laptop mode activated and the hard disk spun down after a short
>>idle time.
>>
>>Why do I think this is related to the corruption? Well, on the one hand,
>>I'm compiling kernels quite often, tracking linus' git repository,
>>and
>
>
> Can you try some filesystem test while forcing disk spindowns via
> hdparm?
>
> It may be bug in laptop mode, or a bug in ide (or something
> related)... trying spindowns without laptopmode would be helpful.
>
I don't know if it would be helpful, but I run several servers with
multiple drives, usually 4-5, some of which are in RAID and some aren't,
and they all spin down and restart without problems many times a day.
The kernel is 2.6.14.? with one patch to get my unsupported VIA IDE working.

My laptop also has a spindown (five min from memory) and I have yet to
have a problem with it. Don't know if any of that is "spindowns without
laptopmode" in a useful sense.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2005-11-18 23:39:06

by Pavel Machek

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

Hi!

> >Can you try some filesystem test while forcing disk spindowns via
> >hdparm?
> >
> >It may be bug in laptop mode, or a bug in ide (or something
> >related)... trying spindowns without laptopmode would be helpful.
> >
> I don't know if it would be helpful, but I run several servers with
> multiple drives, usually 4-5, some of which are in RAID and some aren't,
> and they all spin down and restart without problems many times a day.
> The kernel is 2.6.14.? with one patch to get my unsupported VIA IDE working.
>
> My laptop also has a spindown (five min from memory) and I have yet to
> have a problem with it. Don't know if any of that is "spindowns without
> laptopmode" in a useful sense.

Unless you can also reproduce the failure... no, probably does not help
much.
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2005-11-19 08:41:11

by Bart Samwel

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

Pavel Machek wrote:
>> My laptop also has a spindown (five min from memory) and I have yet to
>> have a problem with it. Don't know if any of that is "spindowns without
>> laptopmode" in a useful sense.
>
> Unless you can also reproduce the failure... no, probably does not help
> much.

Okay, let's recap.

* There are a lot of people who are not having problems. The people who
*are* having problems can usually reproduce them. My interpretation: the
problem is triggered by some hardware and/or kernel config settings.

* A significant proportion of the people who *do* have trouble see
messages about DMA timeouts. The problems do also occur on other
hardware, but seem to be most pronounced on Thinkpad T40s. On those
machines, the DMA timeout problems are triggered *especially* when the
madwifi drivers are loaded (see
http://bugzilla.ubuntu.com/show_bug.cgi?id=6108).

Perhaps I should start collection kernel configs and hardware specs, see
if there are any unexpected commonalities. The influence of the madwifi
drivers suggest that we could be be looking for anything really. What do
you think?

--Bart

2005-11-19 09:26:30

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

On Sat, Nov 19, 2005 at 09:39:15AM +0100, Bart Samwel wrote:
> Pavel Machek wrote:
> >>My laptop also has a spindown (five min from memory) and I have yet to
> >>have a problem with it. Don't know if any of that is "spindowns without
> >>laptopmode" in a useful sense.
> >
> >Unless you can also reproduce the failure... no, probably does not help
> >much.
>
> Okay, let's recap.
>
> * There are a lot of people who are not having problems. The people who
> *are* having problems can usually reproduce them. My interpretation: the
> problem is triggered by some hardware and/or kernel config settings.
>
> * A significant proportion of the people who *do* have trouble see
> messages about DMA timeouts. The problems do also occur on other
> hardware, but seem to be most pronounced on Thinkpad T40s. On those
> machines, the DMA timeout problems are triggered *especially* when the
> madwifi drivers are loaded (see
> http://bugzilla.ubuntu.com/show_bug.cgi?id=6108).
>
> Perhaps I should start collection kernel configs and hardware specs, see
> if there are any unexpected commonalities. The influence of the madwifi
> drivers suggest that we could be be looking for anything really. What do
> you think?

The issue might be that these people are using

hdparm -S xxx

or

hdparm -y / -Y

while a much better way to do

hdparm -B 63

The -S option should in theory be safe, but I remember some drives did
behave unpredictably if this was used. -y/-Y is much tougher and some
drives will not work reliably unless first woken up manually before
issuing a read/write request.

On the other hand, -B is pretty safe on drives that support it, and all
IBM notebook drives do.

--
Vojtech Pavlik
SuSE Labs, SuSE CR

2005-11-19 11:12:31

by Bart Samwel

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

Vojtech Pavlik wrote:

> The issue might be that these people are using
>
> hdparm -S xxx
>
> or
>
> hdparm -y / -Y
>
> while a much better way to do
>
> hdparm -B 63
>
> The -S option should in theory be safe, but I remember some drives did
> behave unpredictably if this was used.

Well, some drives have a specific lower limit on the -S values that are
supported. That's the only compatibility problem I've ever encountered
with -S. (And you can find out if a drive has a lower limit by checking
hdparm -i.)

> -y/-Y is much tougher and some
> drives will not work reliably unless first woken up manually before
> issuing a read/write request.

In fact, -Y is problematic but -y usually isn't. -Y puts the drive to
sleep and requires that Linux reset the complete IDE controller before
using it again, while -y simply puts the drive in standby mode, leaving
it up to the drive to decide when it spins up. I've never heard of any
problems with -y.

> On the other hand, -B is pretty safe on drives that support it, and all
> IBM notebook drives do.

Not true, unfortunately. In fact, I had to change the default config of
laptop-mode-tools a while ago so that it wouldn't use -B, as it seemed
to be one of the *causes* of hangup/corruption problems. This was also
an issue on Thinkpads, and I think it was also noted in the ubuntu bug I
linked to earlier.

An additional problem is that the values for -B are not really
standardized, while the values for -S are.

--Bart

2005-11-19 14:05:29

by Jan Niehusmann

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

On Sat, Nov 19, 2005 at 09:39:15AM +0100, Bart Samwel wrote:
> * A significant proportion of the people who *do* have trouble see
> messages about DMA timeouts. The problems do also occur on other
> hardware, but seem to be most pronounced on Thinkpad T40s. On those
> machines, the DMA timeout problems are triggered *especially* when the
> madwifi drivers are loaded (see
> http://bugzilla.ubuntu.com/show_bug.cgi?id=6108).

That's interesting. Both Bradley and me are using ipw2200, an in the
madwifi thread, one person also mentions he is using this driver. I
don't know if madwifi and ipw2200 use common or very similar code. But
perhaps this problem really is caused by a combination of laptop
mode / disk spinup and certain wireless drivers?

As far as I remember all corruptions I observed happened while being
connected to wireless lan. But that alone never triggered the bug, I had
to enable laptop mode as well.

Unfortunately, I still have to find a way to reliably trigger the
problem. None of my test scripts which try to trigger disk activity
while the drive is spun down caused any corruption yet.

Jan


2005-11-19 15:30:27

by Jan Niehusmann

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

On Sat, Nov 19, 2005 at 03:05:28PM +0100, Jan Niehusmann wrote:
> That's interesting. Both Bradley and me are using ipw2200, an in the
> madwifi thread, one person also mentions he is using this driver. I
> don't know if madwifi and ipw2200 use common or very similar code. But
> perhaps this problem really is caused by a combination of laptop
> mode / disk spinup and certain wireless drivers?

Perhaps this one is related?

http://bughost.org/bugzilla/show_bug.cgi?id=821

If the corruption caused by this bug could lead to the filesystem
corruption I observed, it would match my observations quite well:

- Corruption started shortly after I upgraded to a kernel with
ipw2200 driver version 1.0.8
- Happens when using wireless
- File system corruption happens with laptop mode (because then the
probability that dirty pages which will be written to disk later can
be overwritten before the write actually happens is much higher)
- Problem is difficult to reproduce, as it's not deterministic which
type of data structure gets overwritten, and it can happen only once
per reboot (or driver reload)

Question is, could this bug cause filesystem corruption without any Oops
visible in the kernel log? Cc: to Zhu Yi at Intel - can you answer this
question?

Jan

2005-11-19 23:31:04

by Bart Samwel

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

Jan Niehusmann wrote:
> Question is, could this bug cause filesystem corruption without any Oops
> visible in the kernel log? Cc: to Zhu Yi at Intel - can you answer this
> question?

OK, can this bug overwrite a page containing filesystem metadata? The
way it looks to me, it can: it writes at some fixed distance after a
block of memory allocated by the driver, and that memory could probably
be anything.

Now, can the stuff go to disk without oopsing? In the case of ext3, the
metadata writeback is handled by the JBD layer, which is block-based and
doesn't care about the actual page contents AFAIK -- that's handled by
the ext3 filesystem layer. That means a corrupted metadata page can go
to disk without oopsing. Youch. :/

Remaining issue: this bug is only triggered when the ipw2200 driver does
firmware restarts, which generates kernel output "ipw2200: Firmware
error detected. Restarting". Jan, Bradley, do you see any of these
messages in your logs near the time of corruption? That should be within
10 minutes before it; the corruption may happen anywhere during a
spun-down period. (Or does the ipw2200 driver only show this message in
debug mode?)

--Bart

2005-11-19 23:45:38

by Jan Niehusmann

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

On Sun, Nov 20, 2005 at 12:29:01AM +0100, Bart Samwel wrote:
> Remaining issue: this bug is only triggered when the ipw2200 driver does
> firmware restarts, which generates kernel output "ipw2200: Firmware
> error detected. Restarting". Jan, Bradley, do you see any of these
> messages in your logs near the time of corruption? That should be within

The "Firmware error detected" message occurs regularly - fortunately, on
my system they cause no significant performance problem. If I understand
the code correctly, writes to the wrong memory location only happen at
the first firmware restart. (Unless certain debugging flags are set,
which isn't the case here.) So, basically, it happens exactly once after
each reboot, after a few minutes of network activity.

It's difficult to know the exact time when the filesystem corruption
happened, as it usually isn't noticed immediately.

Jan

2005-11-21 00:39:15

by Pavel Machek

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

Hi!

> I too have been experiencing some problems with laptop-mode that may
> be similar to what was recently reported here.
>
> I have a Centrino machine (Sager NP3760, aka Clevo M375E) with a 60GB
> Hitachi TravelStar hard disk running in UDMA5 and 512MB RAM, and on
> occassions I've had random files on my /usr partition overwritten and
> both my /usr and /var filesystems quite thoroughly trashed - with
> these events usually occuring right after I'd been on battery power
> and my hard disk had been spinning up and down regularly.
>
> All my filesystems are ext3 with journaling active, and none of them
> have been messed with (i.e. resized).

Can you try ext2? Relying on journal when you are experiencing corruption
is bad idea, anyway.

--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2005-11-21 01:59:25

by Bill Davidsen

[permalink] [raw]
Subject: Re: Laptop mode causing writes to wrong sectors?

Pavel Machek wrote:

>Hi!
>
>
>
>>>Can you try some filesystem test while forcing disk spindowns via
>>>hdparm?
>>>
>>>It may be bug in laptop mode, or a bug in ide (or something
>>>related)... trying spindowns without laptopmode would be helpful.
>>>
>>>
>>>
>>I don't know if it would be helpful, but I run several servers with
>>multiple drives, usually 4-5, some of which are in RAID and some aren't,
>>and they all spin down and restart without problems many times a day.
>>The kernel is 2.6.14.? with one patch to get my unsupported VIA IDE working.
>>
>>My laptop also has a spindown (five min from memory) and I have yet to
>>have a problem with it. Don't know if any of that is "spindowns without
>>laptopmode" in a useful sense.
>>
>>
>
>Unless you can also reproduce the failure... no, probably does not help
>much.
>
>
No, that was really the point, even on multiple systems using spindown,
I have no failures. I see four possible causes:
1 - spindown
2 - laptop mode
3 - 1 + 2
4 - bad hardware, kernel not at fault

Since I have a lot of (1) data I thought eliminating that as a cause by
itself might be helpful.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979