2004-03-04 06:50:40

by Daniel Fenert

[permalink] [raw]
Subject: Is there some bug in ext3 in 2.4.25?

Message from syslogd@lazy at Thu Mar 4 08:31:58 2004 ...
lazy kernel: Assertion failure in __journal_drop_transaction() at
checkpoint.c:587: "transaction->t_ilist == NULL"

Networking still works, I've tried to login, but no luck here.
I've got one ssh console opened, and tried to reboot, but nothing happend, it
looks like it lost connection with hda :(
Where should I look for reason?
Machine as faaar away, and it's second or third time it hangs mysteriously,
the only difference is that this time I've got some console output.

--
Daniel Fenert --==> [email protected] <==--
==-P o w e r e d--b y--S l a c k w a r e-=-ICQ #37739641-==
=======- http://daniel.fenert.net/ -=======< +48604628083 >


2004-03-04 07:03:31

by Daniel Fenert

[permalink] [raw]
Subject: Re: Is there some bug in ext3 in 2.4.25?

W dniu Thu, Mar 04, 2004 at 07:50:38AM +0100, Daniel Fenert wystuka?(a):
>Message from syslogd@lazy at Thu Mar 4 08:31:58 2004 ...
>lazy kernel: Assertion failure in __journal_drop_transaction() at
>checkpoint.c:587: "transaction->t_ilist == NULL"

One more thing - it has happened when /var got full.

--
Daniel Fenert --==> [email protected] <==--
==-P o w e r e d--b y--S l a c k w a r e-=-ICQ #37739641-==
Absurd: przekonanie sprzeczne z Twoimi pogl?dami - [Ambrose Bierce]
=======- http://daniel.fenert.net/ -=======< +48604628083 >

2004-03-05 14:07:30

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Is there some bug in ext3 in 2.4.25?


Hi,

This sounds like memory corruption (which could be caused by a misbehaving
driver or by flaky hardware) because transaction->t_ilist is not used at
all by the kernel code. Did this box run stable with other kernels?

I found a similar report from Michelle (CCed), which can be found at:
http://marc.theaimsgroup.com/?l=linux-kernel&m=107529754608448&w=2

Searching a bit more, I found another message from Michelle with
topic "[SOLVED] Kernel-Bug (at checkpoint.c 587)"
http://lists.debian.org/debian-user-german/2004/debian-user-german-200401/msg04404.html

Unfortunately the said message is in German, which I can't understand.
Michelle, can you clarify it for me?

Stephen, Andrew, any idea how can transaction->t_ilist become not NULL?


On Thu, 4 Mar 2004, Daniel Fenert wrote:

> Message from syslogd@lazy at Thu Mar 4 08:31:58 2004 ...
> lazy kernel: Assertion failure in __journal_drop_transaction() at
> checkpoint.c:587: "transaction->t_ilist == NULL"
>
> Networking still works, I've tried to login, but no luck here.
> I've got one ssh console opened, and tried to reboot, but nothing happend, it
> looks like it lost connection with hda :(
> Where should I look for reason?
> Machine as faaar away, and it's second or third time it hangs mysteriously,
> the only difference is that this time I've got some console output.
>

>From [email protected] Fri Mar 5 10:48:26 2004
Date: Thu, 4 Mar 2004 08:03:29 +0100
From: Daniel Fenert <[email protected]>
To: [email protected]
Subject: Re: Is there some bug in ext3 in 2.4.25?

>Message from syslogd@lazy at Thu Mar 4 08:31:58 2004 ...
>lazy kernel: Assertion failure in __journal_drop_transaction() at
>checkpoint.c:587: "transaction->t_ilist == NULL"

One more thing - it has happened when /var got full.


2004-03-05 14:15:36

by Michael Frank

[permalink] [raw]
Subject: Re: Is there some bug in ext3 in 2.4.25?

On Fri, 5 Mar 2004 11:06:02 -0300 (BRT), Marcelo Tosatti <[email protected]> wrote:

>
> Hi,
>
> This sounds like memory corruption (which could be caused by a misbehaving
> driver or by flaky hardware) because transaction->t_ilist is not used at
> all by the kernel code. Did this box run stable with other kernels?
>
> I found a similar report from Michelle (CCed), which can be found at:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=107529754608448&w=2
>
> Searching a bit more, I found another message from Michelle with
> topic "[SOLVED] Kernel-Bug (at checkpoint.c 587)"
> http://lists.debian.org/debian-user-german/2004/debian-user-german-200401/msg04404.html
>
> Unfortunately the said message is in German, which I can't understand.
> Michelle, can you clarify it for me?

> Hallo Leute,

> Auch wenn ich von der [email protected] keine Antworterhalten habe, handelt es sich definitiv um einen echten Kernel-
> Bug in 2.4.22 der in 2.4.24 offensichtlich nicht mehr vorhandenlist.

Although I have nt received a reply from LKML, it is definitively
a real kernel bug in 2.4.22 which has been fixed in 2.4.24.

Ein weiterer Fehler trat mehrfach in 'exit.c' auf, der ebenfals
nach der Installation von Linux 2.4.24 verschwunden war.

Further bug occuring several times in 'exit.c' has also vanished
after installation of 2.4.24.

>
> Stephen, Andrew, any idea how can transaction->t_ilist become not NULL?
>
>
> On Thu, 4 Mar 2004, Daniel Fenert wrote:
>
>> Message from syslogd@lazy at Thu Mar 4 08:31:58 2004 ...
>> lazy kernel: Assertion failure in __journal_drop_transaction() at
>> checkpoint.c:587: "transaction->t_ilist == NULL"
>>
>> Networking still works, I've tried to login, but no luck here.
>> I've got one ssh console opened, and tried to reboot, but nothing happend, it
>> looks like it lost connection with hda :(
>> Where should I look for reason?
>> Machine as faaar away, and it's second or third time it hangs mysteriously,
>> the only difference is that this time I've got some console output.
>>
>
>> From [email protected] Fri Mar 5 10:48:26 2004
> Date: Thu, 4 Mar 2004 08:03:29 +0100
> From: Daniel Fenert <[email protected]>
> To: [email protected]
> Subject: Re: Is there some bug in ext3 in 2.4.25?
>
>> Message from syslogd@lazy at Thu Mar 4 08:31:58 2004 ...
>> lazy kernel: Assertion failure in __journal_drop_transaction() at
>> checkpoint.c:587: "transaction->t_ilist == NULL"
>
> One more thing - it has happened when /var got full.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>


2004-03-05 14:25:28

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: Is there some bug in ext3 in 2.4.25?

Hi,

On Fri, 2004-03-05 at 14:06, Marcelo Tosatti wrote:

> This sounds like memory corruption (which could be caused by a misbehaving
> driver or by flaky hardware) because transaction->t_ilist is not used at
> all by the kernel code. Did this box run stable with other kernels?

Sounds like bad memory to me. The only other report of this I've seen
was at

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=115935

and that machine didn't pass memtest86.

> Stephen, Andrew, any idea how can transaction->t_ilist become not NULL?

Bad hardware is about the only way I can think of. If it was a random
kernel memory scribble, you'd expect it to show up in other places too:
the transaction struct is a very very long-lived struct, you wouldn't
expect it to be the only place to show up slab corruptions.

Cheers,
Stephen

2004-03-05 14:27:25

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: Is there some bug in ext3 in 2.4.25?

Hi,

On Fri, 2004-03-05 at 14:14, Michael Frank wrote:

> Although I have nt received a reply from LKML, it is definitively
> a real kernel bug in 2.4.22 which has been fixed in 2.4.24.
>
> Ein weiterer Fehler trat mehrfach in 'exit.c' auf, der ebenfals
> nach der Installation von Linux 2.4.24 verschwunden war.
>
> Further bug occuring several times in 'exit.c' has also vanished
> after installation of 2.4.24.

Sounds like bad memory. It's quite impossible for a bad memory module
to show up a problem in one kernel but not in another, simply because
kernels are storing their active data in slightly different memory
locations from one release to another (or even from one compiler, or one
set of config options, to another.)

I'd definitely be running memtest86 as the next step here.

Cheers,
Stephen

2004-03-08 13:44:12

by Daniel Fenert

[permalink] [raw]
Subject: Re: Is there some bug in ext3 in 2.4.25?

W dniu Fri, Mar 05, 2004 at 02:25:13PM +0000, Stephen C. Tweedie wystuka?(a):
>> This sounds like memory corruption (which could be caused by a misbehaving
>> driver or by flaky hardware) because transaction->t_ilist is not used at
>> all by the kernel code. Did this box run stable with other kernels?
>
>Sounds like bad memory to me. The only other report of this I've seen
>was at
>
>https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=115935
>
>and that machine didn't pass memtest86.

I'll check this this week, BIG thanks for replies.

(the machine was stable for few years, AFAIR 3 years).

--
Daniel Fenert --==> [email protected] <==--
==-P o w e r e d--b y--S l a c k w a r e-=-ICQ #37739641-==
Who does not love wine, women, and song, remains a fool his whole life long.
=======- http://daniel.fenert.net/ -=======< +48604628083 >

2004-04-02 10:20:15

by Daniel Fenert

[permalink] [raw]
Subject: Re: Is there some bug in ext3 in 2.4.25?

Old thread, but I've managed to test machine.

>> This sounds like memory corruption (which could be caused by a misbehaving
>> driver or by flaky hardware) because transaction->t_ilist is not used at
>> all by the kernel code. Did this box run stable with other kernels?
>
>Sounds like bad memory to me. The only other report of this I've seen
>was at
>
>https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=115935
>
>and that machine didn't pass memtest86.

It passed memtest86, 6 or 7 hours, any further hints?

--
Daniel Fenert --==> [email protected] <==--
==-P o w e r e d--b y--S l a c k w a r e-=-ICQ #37739641-==
Najpro?ciej pyta? dlaczego, najtrudniej znale?? odpowied? --J. Szczawi?ski
=======- http://daniel.fenert.net/ -=======< +48604628083 >

2004-04-02 10:38:10

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: Is there some bug in ext3 in 2.4.25?

Hi,

On Fri, 2004-04-02 at 11:20, Daniel Fenert wrote:

> >> This sounds like memory corruption (which could be caused by a misbehaving
> >> driver or by flaky hardware) because transaction->t_ilist is not used at
> >> all by the kernel code. Did this box run stable with other kernels?
> >
> >Sounds like bad memory to me. The only other report of this I've seen
> >was at
> >
> >https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=115935
> >
> >and that machine didn't pass memtest86.
>
> It passed memtest86, 6 or 7 hours, any further hints?

Well, 7 hours is often not enough for memtest86, I usually recommend 24
hours if there are signs of bad hardware. But other than that, I can't
think of anything ext3-related --- ext3 simply doesn't ever set that
flag. If it's being set, something is stomping on ext3's transaction
struct. That _could_ be the kernel, but it could be just about anything
touching memory after it's freed; or it could be bad hardware.

What modules are you using? Is there anything unusual in common between
your machine or its use and that in #115935?

Rebuilding the kernel to enable slab debugging may well be useful if
there's something stomping on transaction structs.

Cheers,
Stephen