2004-09-12 15:28:45

by Gene Heskett

[permalink] [raw]
Subject: journal aborted, system read-only

Greetings;

I just got up, and found advisories on every shell open that the
journal had encountered an error and aborted, converting my /
partition to read-only.

Rebooting was a mess of course, and it didn't take long for it to
report corruption in /dev/hda7, my / partition, and to drop me to a
shell for manual intervention if I knew my password.

An e2fsck /dev/hda7 reported problems with about a dozen inodes, and
essentially I stood on the y key, but once that was done, the reboot
was clean. I've no idea whats missing if anything at this point.

The kernel is 2.6.9-rc1-mm4. .config available on request.

I had been playing with amanda, essentially restarting it from scratch
each time as I played with a virtual tapes on disk configuration on
that new 200GB disk, but the target disk wasn't trashed that I know
of, but the amanda run was aborted due to the read-only nature of its
holding disk, which is a dir on /. Nothing precious was lost there
because I'll probably have to restart it from scratch again to clean
up the mess of an aborted run anyway.

But it is inconvienient to lose a days experimental data.

FWIW, I have a *large* UPS, and my local electrical power supply
hasn't been that great over the last month, averaging around 1, 2
second power outage per day at random times that don't seem to be
connected with the weather. I mention this because the Bulldog
monitoring program throws up advisory windows on every screen
advising that an automatic shutdown will start in 5 minutes, and then
use that same advisory window to report that power has been restored.

There was one such advisory window open on every X screen.

Checking the logs, there is of course nothing between the read-only
event, and the reboot. From it:
=========
Sep 12 04:54:58 coyote su(pam_unix)[17131]: session closed for user
news
=========
The test amanda run was cron started at 4:55 AM, and I played a few
games of solitaire before going back to bed, also my nightime
'burgular alarm' mode of the X-10 stuff was put back in daytime mode
at 5:00 AM
=========
Sep 12 05:00:00 coyote heyu_relay: interrupt received
Sep 12 05:00:01 coyote heyu_relay: relay setting up-
=========
I shut down solitaire and went back to bed
=========
Sep 12 05:20:56 coyote gconfd (root-14600): GConf server is not in
use, shutting down.
Sep 12 05:20:57 coyote gconfd (root-14600): Exiting
Sep 12 10:58:17 coyote syslogd 1.4.1: restart.
=========

This is precious little info to go on, but basicly I'm wondering if
anyone else has encountered this?

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.26% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.


2004-09-13 15:20:08

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: journal aborted, system read-only

Hi,

On Sun, 2004-09-12 at 16:28, Gene Heskett wrote:

> I just got up, and found advisories on every shell open that the
> journal had encountered an error and aborted, converting my /
> partition to read-only.
...
> The kernel is 2.6.9-rc1-mm4. .config available on request.

> This is precious little info to go on, but basicly I'm wondering if
> anyone else has encountered this?

Well, we really need to see _what_ error the journal had encountered to
be able to even begin to diagnose it. But 2.6.9-rc1-mm3 and -mm4 had a
bug in the journaling introduced by low-latency work on the checkpoint
code; can you try -mm5 or back out
"journal_clean_checkpoint_list-latency-fix.patch" and try again?

Cheers,
Stephen


2004-09-14 01:14:39

by Gene Heskett

[permalink] [raw]
Subject: Re: journal aborted, system read-only

On Monday 13 September 2004 11:12, Stephen C. Tweedie wrote:
>Hi,
>
>On Sun, 2004-09-12 at 16:28, Gene Heskett wrote:
>> I just got up, and found advisories on every shell open that the
>> journal had encountered an error and aborted, converting my /
>> partition to read-only.
>
>...
>
>> The kernel is 2.6.9-rc1-mm4. .config available on request.
>>
>> This is precious little info to go on, but basicly I'm wondering
>> if anyone else has encountered this?
>
>Well, we really need to see _what_ error the journal had encountered
> to be able to even begin to diagnose it. But 2.6.9-rc1-mm3 and
> -mm4 had a bug in the journaling introduced by low-latency work on
> the checkpoint code; can you try -mm5 or back out
>"journal_clean_checkpoint_list-latency-fix.patch" and try again?

Yes, I can try rc1-mm5 which I grabbed this morning. I also have -rc2
coming in right now, but from the messages I see so far this evening,
I'm beginning to think its a 'to be skipped' version.

FWIW, I didn't have a problem last night during the amanda run, I'd
moved the run time back to 05 00 * * *. The one that barfed was
triggered at 55 4 * * * in cron-speak, and was a full level 0 on
everything as I'd nuked the data and restarted it from day 1.

>Cheers,
> Stephen

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.26% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-09-14 03:49:47

by Gene Heskett

[permalink] [raw]
Subject: Re: journal aborted, system read-only

On Monday 13 September 2004 11:12, Stephen C. Tweedie wrote:
>Hi,
>
>On Sun, 2004-09-12 at 16:28, Gene Heskett wrote:
>> I just got up, and found advisories on every shell open that the
>> journal had encountered an error and aborted, converting my /
>> partition to read-only.
>
>...
>
>> The kernel is 2.6.9-rc1-mm4. .config available on request.
>>
>> This is precious little info to go on, but basicly I'm wondering
>> if anyone else has encountered this?
>
>Well, we really need to see _what_ error the journal had encountered
> to be able to even begin to diagnose it. But 2.6.9-rc1-mm3 and
> -mm4 had a bug in the journaling introduced by low-latency work on
> the checkpoint code; can you try -mm5 or back out
>"journal_clean_checkpoint_list-latency-fix.patch" and try again?

Since -mm5 killed my usb2.0 stuffs, (all my printers disappeared) I'm
now building -mm4 after reverting this patch.

This must be a fairly rare occurance in the real world, it has not
recurred. (yet, gotta keep Murphy happy you know) :-)

>Cheers,
> Stephen

--
Cheers & thanks Stephen, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.26% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-09-14 09:38:56

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: journal aborted, system read-only

Hi,

On Tue, 2004-09-14 at 04:37, Gene Heskett wrote:

> Since -mm5 killed my usb2.0 stuffs, (all my printers disappeared) I'm
> now building -mm4 after reverting this patch.

OK, thanks for testing it.

--Stephen

2004-09-14 11:04:24

by Gene Heskett

[permalink] [raw]
Subject: Re: journal aborted, system read-only

On Tuesday 14 September 2004 05:37, Stephen C. Tweedie wrote:
>Hi,
>
>On Tue, 2004-09-14 at 04:37, Gene Heskett wrote:
>> Since -mm5 killed my usb2.0 stuffs, (all my printers disappeared)
>> I'm now building -mm4 after reverting this patch.
>
>OK, thanks for testing it.
>
>--Stephen

And I assume it worked Stephen, it ran on it long enough to build the
-mm5 patch that fixed the borked hi-speed usb.

I have a samba problem, my rh7.3 firewall no longer smbmounts this FC2
box. Are you still doing samba?

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.26% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-09-16 05:03:41

by Gene Heskett

[permalink] [raw]
Subject: Re: journal aborted, system read-only

On Monday 13 September 2004 11:12, Stephen C. Tweedie wrote:
>Hi,
>
>On Sun, 2004-09-12 at 16:28, Gene Heskett wrote:
>> I just got up, and found advisories on every shell open that the
>> journal had encountered an error and aborted, converting my /
>> partition to read-only.
>
>...
>
>> The kernel is 2.6.9-rc1-mm4. .config available on request.
>>
>> This is precious little info to go on, but basicly I'm wondering
>> if anyone else has encountered this?
>
>Well, we really need to see _what_ error the journal had encountered
> to be able to even begin to diagnose it. But 2.6.9-rc1-mm3 and
> -mm4 had a bug in the journaling introduced by low-latency work on
> the checkpoint code; can you try -mm5 or back out
>"journal_clean_checkpoint_list-latency-fix.patch" and try again?
>
It just did it to me again, this time with 2.6.9-rc1-mm5.

This seems to coincide with the system being busier than that famous
cat on the equally famous tin roof as far as disk traffic is
concerned. This time amanda was running which makes the drives work
up a sweat, and I was trying to get checkinstall to install
xorg.6.8.1 that I had just built, so it was moving about 55 megs of
files around when things went splat.

So that run of amanda is kaput, and I have a mess to clean up
in /var/tmp and /usr/src/X6.8.1 from checkinstall.

And as usual in these cases, the logs are spotlessly clean
because /var is on /, which is on /dev/hda7, an syslog couldn't write
when its read-only.

Has anyone any ideas?

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.26% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-09-16 10:50:05

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: journal aborted, system read-only

Hi,

On Thu, 2004-09-16 at 06:03, Gene Heskett wrote:

> >Well, we really need to see _what_ error the journal had encountered
...
> It just did it to me again, this time with 2.6.9-rc1-mm5.

> And as usual in these cases, the logs are spotlessly clean
> because /var is on /, which is on /dev/hda7, an syslog couldn't write
> when its read-only.

Possibility the first is to create a separate partition for /var;
possibility the second is to set up a serial console. Without access to
that log information, all we know is "there was an IO error," and that's
really not enough to narrow down the search. :-)

Thanks,
Stephen

2004-09-16 13:15:38

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: journal aborted, system read-only

On Mon, 13 Sep 2004 16:12:59 BST, "Stephen C. Tweedie" said:

> Well, we really need to see _what_ error the journal had encountered to
> be able to even begin to diagnose it. But 2.6.9-rc1-mm3 and -mm4 had a
> bug in the journaling introduced by low-latency work on the checkpoint
> code; can you try -mm5 or back out
> "journal_clean_checkpoint_list-latency-fix.patch" and try again?

I just got bit by the 'journal aborted' problem under -rc1-mm5, so it
looks like that particular bug wasn't at fault here (also, I started seeing
the problem under -mm2, so that's another point against that theory...)

Sep 16 01:29:05 turing-police kernel: journal_bmap: journal block not found at offset 5132 on dm-8
Sep 16 01:29:05 turing-police kernel: Aborting journal on device dm-8.
Sep 16 01:29:05 turing-police kernel: ext3_abort called.

This happened about 4 minutes into a 'tar cf - | (cd && tar xf -)' pipeline
to clone a work copy of the -rc1-mm5 source tree (it got about 408M through the
543M before it blew up)....


Attachments:
(No filename) (226.00 B)

2004-09-16 13:22:04

by Gene Heskett

[permalink] [raw]
Subject: Re: journal aborted, system read-only

On Thursday 16 September 2004 06:48, Stephen C. Tweedie wrote:
>Hi,
>
>On Thu, 2004-09-16 at 06:03, Gene Heskett wrote:
>> >Well, we really need to see _what_ error the journal had
>> > encountered
>
>...
>
>> It just did it to me again, this time with 2.6.9-rc1-mm5.
>>
>> And as usual in these cases, the logs are spotlessly clean
>> because /var is on /, which is on /dev/hda7, an syslog couldn't
>> write when its read-only.
>
>Possibility the first is to create a separate partition for /var;

Thats now been done, but not w/o a minor disaster & an extra hour
sorting out something heyu seems to have done. NDI when, but its log
output in /var/tmp has been renamed from heyu.out to heyu.outttyS1
and thats why xtend has been getting a tummy ache.

I did have 2 partitions on that 200Gigger, one accidently way too big
16GB swap and the rest as /amandatapes. The minor disaster was that
I didn't wait till I had rebooted before I ran a mke2fs -j /dev/hdd2
(the new /var, and the amanda useage partition was left exactly the
same, but the kernel was still runing on the old partition table so
it formatted the amanda partition. My bad...), so amanda is back to
square one tonight but thinking it has a weeks backups to count on.
But with a 7 day dumpcycle, it will be caught up in a week if I
expand the tapetypes set size to 60Gb or so till it gets in balance.

Anyway, I now have a 15GB /var to record this crap in.

>possibility the second is to set up a serial console.

Both of my seriel ports are busy, one is watching the ups, and the
other is running x10 stuffs.

So we'll have to take our chances that we can catch it in the logs.
There was a single 'driver ready seek not complete' message in the
log several days ago according to logwatch. Its about a year old
120GB Maxtor, and smartd is watching both of them now without send me
any telegrams (so far, that knocking sound is me, knocking on wood).

>Without
> access to that log information, all we know is "there was an IO
> error," and that's really not enough to narrow down the search. :-)
>
>Thanks,
> Stephen

Anyway, now we wait, except I'm going to fire off the initial amdump
right now after telling it there is enough space on its 'tape' do do
a level 0 on everything. That might be interesting in itself.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.26% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-09-16 13:40:22

by Gene Heskett

[permalink] [raw]
Subject: Re: journal aborted, system read-only

On Thursday 16 September 2004 02:34, [email protected] wrote:
>On Mon, 13 Sep 2004 16:12:59 BST, "Stephen C. Tweedie" said:
>> Well, we really need to see _what_ error the journal had
>> encountered to be able to even begin to diagnose it. But
>> 2.6.9-rc1-mm3 and -mm4 had a bug in the journaling introduced by
>> low-latency work on the checkpoint code; can you try -mm5 or back
>> out
>> "journal_clean_checkpoint_list-latency-fix.patch" and try again?
>
>I just got bit by the 'journal aborted' problem under -rc1-mm5, so
> it looks like that particular bug wasn't at fault here (also, I
> started seeing the problem under -mm2, so that's another point
> against that theory...)
>
Thanks Valdis, now I don't feel quite so lonely in this camp. :-)

[...]

>This happened about 4 minutes into a 'tar cf - | (cd && tar xf -)'
> pipeline to clone a work copy of the -rc1-mm5 source tree (it got
> about 408M through the 543M before it blew up)....

Humm, it happened to me while amdump was running, and amdump uses tar.
My tar version is 1.13-25.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.26% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-09-17 16:35:34

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: journal aborted, system read-only

On Thu, 16 Sep 2004 09:36:01 EDT, Gene Heskett said:

> >This happened about 4 minutes into a 'tar cf - | (cd && tar xf -)'
> > pipeline to clone a work copy of the -rc1-mm5 source tree (it got
> > about 408M through the 543M before it blew up)....
>
> Humm, it happened to me while amdump was running, and amdump uses tar.
> My tar version is 1.13-25.

I don't think "tar" is anything more than an enabler here - it's just that on
my laptop it's one of the more abusive things I can do to the file system (especially
when source and dest directories are on the same file system). I've had the problem
pop up while reading down my e-mail, which is another "lots of little files" scenario
(500+ lines of procmailrc, passing stuff to/from spamassassin, and storing in the
MH "one message per file" format)....

I'm about to start building -rc2-mm1 - I'm probably going to liberally sprinkle some
strategic printk's so we have a chance of flushing out why it's failing...


Attachments:
(No filename) (226.00 B)

2004-09-17 17:19:48

by Gene Heskett

[permalink] [raw]
Subject: Re: journal aborted, system read-only

On Friday 17 September 2004 12:30, [email protected] wrote:
>On Thu, 16 Sep 2004 09:36:01 EDT, Gene Heskett said:
>> >This happened about 4 minutes into a 'tar cf - | (cd && tar xf
>> > -)' pipeline to clone a work copy of the -rc1-mm5 source tree
>> > (it got about 408M through the 543M before it blew up)....
>>
>> Humm, it happened to me while amdump was running, and amdump uses
>> tar. My tar version is 1.13-25.
>
>I don't think "tar" is anything more than an enabler here - it's
> just that on my laptop it's one of the more abusive things I can do
> to the file system (especially when source and dest directories are
> on the same file system). I've had the problem pop up while
> reading down my e-mail, which is another "lots of little files"
> scenario (500+ lines of procmailrc, passing stuff to/from
> spamassassin, and storing in the MH "one message per file"
> format)....

Thats the same conclusion I've since come to, Valdis. And, now that
I've moved my /var to its own partition on another disk, that may
have reduced the thrashing of that drive enough to stop the problem.
At least I've had no more such occurances since I moved /var off
of /. About 30.5 hours now according to the uptime display in
gkrellm.

>I'm about to start building -rc2-mm1 - I'm probably going to
> liberally sprinkle some strategic printk's so we have a chance of
> flushing out why it's failing...

I've been thinking about it too, but I found a bug in one of the
amanda utilities thats had my attention the last day or so & just
fixed in the last half hour. If you can make a patch out of those
printk's against rc2-mm1, send it along and I'll taste-test for the
arsenic in the desert too maybe. The More Eyeballs theory & all that
rot.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.26% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-09-18 09:55:07

by Martin Diehl

[permalink] [raw]
Subject: Re: journal aborted, system read-only

On Fri, 17 Sep 2004, Gene Heskett wrote:

> Thats the same conclusion I've since come to, Valdis. And, now that
> I've moved my /var to its own partition on another disk, that may
> have reduced the thrashing of that drive enough to stop the problem.
> At least I've had no more such occurances since I moved /var off
> of /. About 30.5 hours now according to the uptime display in
> gkrellm.

I've just experienced a very similar situation: I/O error, journal abort,
partition went read-only. Maybe it's completely unrelated, but by chance,
do you have some clipping jumper set on this disk so you depend on the
ide-stroke feature?

In my case this was the culprit. It's a 120GB disk in a box with old
crashing Award BIOS, so I've clipped it to 32GB. However I've missed the
fact since about 2.6.7 one has to set hda=stroke kernel parameter in order
to get the host protected area disabled. Therefore partitions starting
behind 32GB failed to mount.

In contrast, the partition which crosses the 32GB barrier was mounted
succesfully but later when trying to access beyond the limit I was seeing:

ide0: reset: success
hda: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hda: read_intr: error=0x10 { SectorIdNotFound }, LBAsect=110291109, high=6, low=9627813, sector=110291109
ide: failed opcode was: unknown
end_request: I/O error, dev hda, sector 110291109
EXT3-fs error (device hda7): ext3_get_inode_loc: unable to read inode block - inode=3883009, block=7766022
Aborting journal on device hda7.
ext3_abort called.
EXT3-fs error (device hda7): ext3_journal_start: Detected aborted journal
Remounting filesystem read-only

Adding the hda=stroke kernel parameter solved the issue - forced fsck
showed no damage happened to the fs.

HTH,
Martin

2004-09-21 08:55:51

by Andrew Morton

[permalink] [raw]
Subject: Re: journal aborted, system read-only

"Stephen C. Tweedie" <[email protected]> wrote:
>
> Hi,
>
> On Sun, 2004-09-12 at 16:28, Gene Heskett wrote:
>
> > I just got up, and found advisories on every shell open that the
> > journal had encountered an error and aborted, converting my /
> > partition to read-only.
> ...
> > The kernel is 2.6.9-rc1-mm4. .config available on request.
>
> > This is precious little info to go on, but basicly I'm wondering if
> > anyone else has encountered this?
>
> Well, we really need to see _what_ error the journal had encountered to
> be able to even begin to diagnose it. But 2.6.9-rc1-mm3 and -mm4 had a
> bug in the journaling introduced by low-latency work on the checkpoint
> code; can you try -mm5 or back out
> "journal_clean_checkpoint_list-latency-fix.patch" and try again?
>

Turns out this is due to the reworked buffer/page sleep/wakeup code in
recent -mm's. If the journal timer wakes kjournald while kjournald is
waiting on a read of a journal indirect block, kjournald just plunges ahead
with a still-locked, non-uptodate buffer. Which it treats as an I/O error,
and things don't improve from there.

This should fix.

--- 25/kernel/wait.c~wait_on_bit-must-loop 2004-09-21 01:33:18.000000000 -0700
+++ 25-akpm/kernel/wait.c 2004-09-21 01:44:36.706435616 -0700
@@ -157,8 +157,9 @@ __wait_on_bit(wait_queue_head_t *wq, str
int ret = 0;

prepare_to_wait(wq, &q->wait, mode);
- if (test_bit(q->key.bit_nr, q->key.flags))
+ do {
ret = (*action)(q->key.flags);
+ } while (test_bit(q->key.bit_nr, q->key.flags) && !ret);
finish_wait(wq, &q->wait);
return ret;
}
_

2004-09-21 09:56:09

by Andrew Morton

[permalink] [raw]
Subject: Re: journal aborted, system read-only

Andrew Morton <[email protected]> wrote:
>
> This should fix.

scrub that, it hangs. Third time lucky.

--- 25/kernel/wait.c~wait_on_bit-must-loop 2004-09-21 01:57:14.000000000 -0700
+++ 25-akpm/kernel/wait.c 2004-09-21 02:48:18.596420024 -0700
@@ -157,7 +157,7 @@ __wait_on_bit(wait_queue_head_t *wq, str
int ret = 0;

prepare_to_wait(wq, &q->wait, mode);
- if (test_bit(q->key.bit_nr, q->key.flags))
+ while (test_bit(q->key.bit_nr, q->key.flags) && !ret)
ret = (*action)(q->key.flags);
finish_wait(wq, &q->wait);
return ret;
_

2004-09-21 14:14:42

by Gene Heskett

[permalink] [raw]
Subject: Re: journal aborted, system read-only

On Tuesday 21 September 2004 04:50, Andrew Morton wrote:
>"Stephen C. Tweedie" <[email protected]> wrote:
>> Hi,
>>
>> On Sun, 2004-09-12 at 16:28, Gene Heskett wrote:
>> > I just got up, and found advisories on every shell open that the
>> > journal had encountered an error and aborted, converting my /
>> > partition to read-only.
>>
>> ...
>>
>> > The kernel is 2.6.9-rc1-mm4. .config available on request.
>> >
>> > This is precious little info to go on, but basicly I'm wondering
>> > if anyone else has encountered this?
>>
>> Well, we really need to see _what_ error the journal had
>> encountered to be able to even begin to diagnose it. But
>> 2.6.9-rc1-mm3 and -mm4 had a bug in the journaling introduced by
>> low-latency work on the checkpoint code; can you try -mm5 or back
>> out
>> "journal_clean_checkpoint_list-latency-fix.patch" and try again?
>
>Turns out this is due to the reworked buffer/page sleep/wakeup code
> in recent -mm's. If the journal timer wakes kjournald while
> kjournald is waiting on a read of a journal indirect block,
> kjournald just plunges ahead with a still-locked, non-uptodate
> buffer. Which it treats as an I/O error, and things don't improve
> from there.

Classic understatement :)

That said, I've not had a repeat of this scene since I moved /var to a
different drive a week or thereabouts ago. OTOH, this will be
building in a couple of minutes too, many thanks.

>This should fix.
>
>--- 25/kernel/wait.c~wait_on_bit-must-loop 2004-09-21
> 01:33:18.000000000 -0700 +++ 25-akpm/kernel/wait.c 2004-09-21
> 01:44:36.706435616 -0700 @@ -157,8 +157,9 @@
> __wait_on_bit(wait_queue_head_t *wq, str int ret = 0;
>
> prepare_to_wait(wq, &q->wait, mode);
>- if (test_bit(q->key.bit_nr, q->key.flags))
>+ do {
> ret = (*action)(q->key.flags);
>+ } while (test_bit(q->key.bit_nr, q->key.flags) && !ret);
> finish_wait(wq, &q->wait);
> return ret;
> }
>_

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.26% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-09-21 14:37:22

by Gene Heskett

[permalink] [raw]
Subject: Re: journal aborted, system read-only

On Tuesday 21 September 2004 05:53, Andrew Morton wrote:
>Andrew Morton <[email protected]> wrote:
>> This should fix.
>
>scrub that, it hangs. Third time lucky.
>
Stopped "makeit", patch this up by hand, started again. Thanks
Andrew.

>--- 25/kernel/wait.c~wait_on_bit-must-loop 2004-09-21
> 01:57:14.000000000 -0700 +++ 25-akpm/kernel/wait.c 2004-09-21
> 02:48:18.596420024 -0700 @@ -157,7 +157,7 @@
> __wait_on_bit(wait_queue_head_t *wq, str int ret = 0;
>
> prepare_to_wait(wq, &q->wait, mode);
>- if (test_bit(q->key.bit_nr, q->key.flags))
>+ while (test_bit(q->key.bit_nr, q->key.flags) && !ret)
> ret = (*action)(q->key.flags);
> finish_wait(wq, &q->wait);
> return ret;
>_

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.26% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

2004-09-21 15:00:32

by Gene Heskett

[permalink] [raw]
Subject: Re: journal aborted, system read-only

On Tuesday 21 September 2004 10:37, Gene Heskett wrote:
>On Tuesday 21 September 2004 05:53, Andrew Morton wrote:
>>Andrew Morton <[email protected]> wrote:
>>> This should fix.
>>
>>scrub that, it hangs. Third time lucky.
>
>Stopped "makeit", patch this up by hand, started again. Thanks
>Andrew.

And it would appear to be working, I'm booted to it now. Now we wait,
but so far I haven't heard the first shoe drop. :-)

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.26% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.