2004-09-02 18:16:08

by Valdis Klētnieks

[permalink] [raw]
Subject: 2.6.9-rc1-mm2 - 'journal block not found' - ext3 on crack?

I built 2.6.9-rc1-mm2 last night, and I've had this happen on 2 separate file systems today:

Sep 2 12:42:50 turing-police kernel: journal_bmap: journal block not found at offset 2316 on dm-6
Sep 2 12:42:52 turing-police kernel: Aborting journal on device dm-6.
Sep 2 12:42:52 turing-police kernel: ext3_abort called.
Sep 2 12:42:53 turing-police kernel: EXT3-fs error (device dm-6): ext3_journal_start: Detected aborted journal
Sep 2 12:42:53 turing-police kernel: Remounting filesystem read-only
Sep 2 12:42:54 turing-police kernel: EXT3-fs error (device dm-6) in start_transaction: Journal has aborted
Sep 2 12:42:54 turing-police kernel: EXT3-fs error (device dm-6) in start_transaction: Journal has aborted
Sep 2 12:42:54 turing-police kernel: journal commit I/O error
Sep 2 12:42:54 turing-police kernel: EXT3-fs error (device dm-6) in start_transaction: Journal has aborted
Sep 2 12:42:57 turing-police last message repeated 15 times

(This one was /home - I'd paste the other one, but it happened on /var so
it didn't get logged because /var/adm/messages went R/O..)

Of interest:

1) Didn't see this under 2.6.8-mm4.
2) Neither time had any actual disk I/O error messages...
3) I'm using ext3-on-LVM, if that matters...


Attachments:
(No filename) (226.00 B)

2004-09-03 23:44:44

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.9-rc1-mm2 - 'journal block not found' - ext3 on crack?

[email protected] wrote:
>
> I built 2.6.9-rc1-mm2 last night, and I've had this happen on 2 separate file systems today:
>
> Sep 2 12:42:50 turing-police kernel: journal_bmap: journal block not found at offset 2316 on dm-6
> Sep 2 12:42:52 turing-police kernel: Aborting journal on device dm-6.
> Sep 2 12:42:52 turing-police kernel: ext3_abort called.
> Sep 2 12:42:53 turing-police kernel: EXT3-fs error (device dm-6): ext3_journal_start: Detected aborted journal
> Sep 2 12:42:53 turing-police kernel: Remounting filesystem read-only
> Sep 2 12:42:54 turing-police kernel: EXT3-fs error (device dm-6) in start_transaction: Journal has aborted
> Sep 2 12:42:54 turing-police kernel: EXT3-fs error (device dm-6) in start_transaction: Journal has aborted
> Sep 2 12:42:54 turing-police kernel: journal commit I/O error
> Sep 2 12:42:54 turing-police kernel: EXT3-fs error (device dm-6) in start_transaction: Journal has aborted
> Sep 2 12:42:57 turing-police last message repeated 15 times
>
> (This one was /home - I'd paste the other one, but it happened on /var so
> it didn't get logged because /var/adm/messages went R/O..)

Probably caused by an I/O error (or data loss) performing metadata reads.

> Of interest:
>
> 1) Didn't see this under 2.6.8-mm4.
> 2) Neither time had any actual disk I/O error messages...
> 3) I'm using ext3-on-LVM, if that matters...

Let me guess: raid5?

We've had a steady stream of heisenbugreports for ext3-on-raid5
for several years.

2004-09-05 12:01:22

by Karl Vogel

[permalink] [raw]
Subject: Re: 2.6.9-rc1-mm2 - 'journal block not found' - ext3 on crack?

Andrew Morton <[email protected]> writes:

> [email protected] wrote:
>>
>> I built 2.6.9-rc1-mm2 last night, and I've had this happen on 2 separate file systems today:
>>
>> Sep 2 12:42:54 turing-police kernel: EXT3-fs error (device dm-6) in start_transaction: Journal has aborted
>
> Probably caused by an I/O error (or data loss) performing metadata reads.
>
>> 3) I'm using ext3-on-LVM, if that matters...
>
> Let me guess: raid5?
>
> We've had a steady stream of heisenbugreports for ext3-on-raid5
> for several years.

For the record, I just got a journal aborted on 2.6.9-rc1-mm3.
I'm also using ext3-on-LVM (no RAID stuff involved).

I'm afraid that I don't have more info.

2004-09-14 18:28:56

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: 2.6.9-rc1-mm2 - 'journal block not found' - ext3 on crack?

On Fri, 03 Sep 2004 16:48:24 PDT, Andrew Morton said:

(Sorry for the delay in following up...)

> > Sep 2 12:42:50 turing-police kernel: journal_bmap: journal block not found
at offset 2316 on dm-6
> > Sep 2 12:42:52 turing-police kernel: Aborting journal on device dm-6.

> Probably caused by an I/O error (or data loss) performing metadata reads.

I'd be willing to buy that, except that it's been hitting me off-and-on and
there's NEVER a dmesg entry indicating an actual I/O problem. Also, I never
hit it against 2.6.8.1-mm4 - I've only seen it against -rc1-mm2 and -rc1-mm4
(didn't try mm1 or mm3). I'm always leery of silent "must be I/O error"
problems that show up when you upgrade the software.. ;)

Here's one against -mm4 last night:

Sep 12 03:00:03 turing-police kernel: audit(1094972403.380:0): avc: denied { search } for pid=14808 ex
e=/usr/bin/whereis name=man dev=dm-1 ino=18869 scontext=user_u:user_r:user_crond_t tcontext=system_u:obje
ct_r:man_t tclass=dir
Sep 12 03:37:04 turing-police kernel: journal_bmap: journal block not found at offset 12 on dm-6
Sep 12 03:37:04 turing-police kernel: Aborting journal on device dm-6.
Sep 12 03:37:04 turing-police kernel: __journal_remove_journal_head: freeing b_committed_data
Sep 12 03:37:04 turing-police last message repeated 4 times
Sep 12 03:37:07 turing-police kernel: ext3_abort called.
Sep 12 03:37:07 turing-police kernel: EXT3-fs error (device dm-6): ext3_journal_start: Detected aborted journal
Sep 12 03:37:07 turing-police kernel: Remounting filesystem read-only
Sep 12 03:37:07 turing-police kernel: EXT3-fs error (device dm-6) in start_transaction: Journal has aborted


Absolutely nothing out of the kernel for 37 minutes straight, till suddenly it
decides that /home has a problem (Yes, there *was* plenty activity on the disk
at the time - the usual 3AM cron-based disk thrashers, and I was trying to
catch up on lkml, hit 'commit' for a bunch of pending 'delete messages', and
things just sort of sat there because the file system just evaporated out from
under it).

> > Of interest:
> >
> > 1) Didn't see this under 2.6.8-mm4.
> > 2) Neither time had any actual disk I/O error messages...
> > 3) I'm using ext3-on-LVM, if that matters...
>
> Let me guess: raid5?

Nope. ext3, lvm, and a single laptop IDE disk under it all....

journal_bmap() appears to eventually end up down in ext3_get_branch(). Is it
possible that there's some weird locking/race issue where we end up calling
__bread() for the block, but get a bh that has wrong/bogus uptodate/dirty
flags? (this only seems to bite under high (for a laptop) I/O - while updatedb
or tripwire are running in background, or reading down my mail in the morning,
or doing a large compile/make, etc) Is it possible for the journal to wrap
around and we come back to an "old" page that *still* hasn't made it out to
disk due to LVM/elevator/etc issues (even though enough other I/O requests have
been retired to keep things happy?)


Attachments:
(No filename) (226.00 B)