2004-03-26 19:01:00

by Marcelo Tosatti

[permalink] [raw]
Subject: 2.4: kernel BUG at inode.c:334!

On Fri, Mar 26, 2004 at 04:40:00PM +0100, Fredrik Steen wrote:
> On [040326 16:20] Marcelo Tosatti <[email protected]> wrote:
> > On Thu, Mar 25, 2004 at 09:32:22AM -0800, Martin J. Bligh wrote:
> > > http://bugme.osdl.org/show_bug.cgi?id=2367
> >
> > This is the second bug report of "BUG at inode.c:334" I have seen.
> > The other one reported by Mika Fischer.
> >
> > Its indeed not valid for I_LOCK or I_FREEING inode's to be on the
> > superblock dirty list. I cannot see how this is happening.
> >
> > Martin, Mika, can you please apply the attached patch and rerun the tests?
> >
> > It might give a bit more clue. Thanks.
> >
> > --- fs/inode.c.orig 2004-03-26 12:30:01.961087616 -0300
> [...]
>
> I ran the patch and got this:
> inode->i_istate:f
> Kernel BUG at inode.c:340!
> [...]

Hi Fredik,

It seems Trond already figured it out, we are erroneously moving
locked inodes to the dirty list. He attached the following patch in
the bugzilla to fix the problem. Can you please give it a try?

--- linux-2.4.26-up/fs/inode.c.orig 2004-03-19 17:12:46.000000000 -0500
+++ linux-2.4.26-up/fs/inode.c 2004-03-26 13:01:23.000000000 -0500
@@ -319,7 +319,8 @@ void refile_inode(struct inode *inode)
if (!inode)
return;
spin_lock(&inode_lock);
- __refile_inode(inode);
+ if (!(inode->i_state & I_LOCK))
+ __refile_inode(inode);
spin_unlock(&inode_lock);
}




2004-03-30 14:52:18

by Jaco Kroon

[permalink] [raw]
Subject: Re: 2.4: kernel BUG at inode.c:334!

Hello

We were having similar problems on two oldish machines (about 5 years
old, old prolines) and since the dpt_i2o driver isn't ported officially
yet we were stuck with 2.4 for some time before we decided to just stuff
it and switch to 2.6 using some patch for dpt_i2o. Now we are still
having the same problem, but less regularly - seems to die shortly after
we run a addusers script that causes intensive io on /, which is using
ext3. Unfortunately the stack traces doesn't get sent to a log file
(how can I quickly rig this?) and both machines are production machines
-> ie, it goes down and we run for all we are worth to hit that reset
button.

I do however have a small machine at home that seem to be giving similar
problems, but I'm not sure. I can't get stack traces in this case at
all (APM kicks in and I can't get it back out after it crashes). I've
now recompiled with full kernel debugging (everything under kernel
hacking) and the only thing I get in the kernel logs are ??? suppressed
messages from the kernel. It still dies. It also has periods where it
just slows down to a stop (doesn't respond to pings for up to a minute
at a time). Usually dies whilst compiling (heavy disk io).

One of the production machines and my machine at home currently runs
2.6.4 and the other 2.4.25.

So this seems to be a more general problem (My co-worker suspects ext3 -
since this bug report started with xfs that might not be the case). The
only pattern we are seeing between all of these is that they serve as
nfs servers (but on mine at home it still dies, even when not serving
nfs - it still is a nfs client when it dies though), are not the newest
and greatest machines and all of them use ext3 as their root file
system. Oh, also, usually shortly after, or during, intensive disk io -
which match up with what Mika mentioned. I've also tried disabling
IO-APIC (which we're not even sure is supported, but APIC is), as well
as pre-empting.

We don't suspect nfs on the production machines anymore since we managed
to trash the nfs exported dir for about an hour (keeping the server at
load average 8.5) which makes use of reiserfs - we might've been lucky
though. In almost all the cases these exports are relatively big
though, and I noticed there is a problem there as well (We don't get the
magical 1000 number quite yet).

Is there anything else I should/can take a look at? Is there any other
way in which I can help find the problem? If I can just get somewhere
to start ... (The patch below doesn't apply to 2.6 as far as I can see).

Apologies for the essay.

Jaco

Marcelo Tosatti wrote:

>On Fri, Mar 26, 2004 at 04:40:00PM +0100, Fredrik Steen wrote:
>
>
>>On [040326 16:20] Marcelo Tosatti <[email protected]> wrote:
>>
>>
>>> On Thu, Mar 25, 2004 at 09:32:22AM -0800, Martin J. Bligh wrote:
>>> > http://bugme.osdl.org/show_bug.cgi?id=2367
>>>
>>>This is the second bug report of "BUG at inode.c:334" I have seen.
>>>The other one reported by Mika Fischer.
>>>
>>>Its indeed not valid for I_LOCK or I_FREEING inode's to be on the
>>>superblock dirty list. I cannot see how this is happening.
>>>
>>>Martin, Mika, can you please apply the attached patch and rerun the tests?
>>>
>>>It might give a bit more clue. Thanks.
>>>
>>>--- fs/inode.c.orig 2004-03-26 12:30:01.961087616 -0300
>>>
>>>
>>[...]
>>
>>I ran the patch and got this:
>>inode->i_istate:f
>>Kernel BUG at inode.c:340!
>>[...]
>>
>>
>
>Hi Fredik,
>
>It seems Trond already figured it out, we are erroneously moving
>locked inodes to the dirty list. He attached the following patch in
>the bugzilla to fix the problem. Can you please give it a try?
>
>--- linux-2.4.26-up/fs/inode.c.orig 2004-03-19 17:12:46.000000000 -0500
>+++ linux-2.4.26-up/fs/inode.c 2004-03-26 13:01:23.000000000 -0500
>@@ -319,7 +319,8 @@ void refile_inode(struct inode *inode)
> if (!inode)
> return;
> spin_lock(&inode_lock);
>- __refile_inode(inode);
>+ if (!(inode->i_state & I_LOCK))
>+ __refile_inode(inode);
> spin_unlock(&inode_lock);
> }
>
>
===========================================This message and attachments
are subject to a disclaimer. Please refer to
http://www.it.up.ac.za/documentation/governance/disclaimer/ for full details.

Hierdie boodskap en aanhangsels is aan 'n vrywaringsklousule onderhewig. Volledige besonderhede is by http://www.it.up.ac.za/documentation/governance/disclaimer/ beskikbaar.
===========================================

2004-03-30 17:04:11

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4: kernel BUG at inode.c:334!

On Tue, Mar 30, 2004 at 04:51:57PM +0200, Jaco Kroon wrote:

> So this seems to be a more general problem (My co-worker suspects ext3 -
> since this bug report started with xfs that might not be the case). The
> only pattern we are seeing between all of these is that they serve as
> nfs servers (but on mine at home it still dies, even when not serving
> nfs - it still is a nfs client when it dies though), are not the newest
> and greatest machines and all of them use ext3 as their root file
> system. Oh, also, usually shortly after, or during, intensive disk io -
> which match up with what Mika mentioned. I've also tried disabling
> IO-APIC (which we're not even sure is supported, but APIC is), as well
> as pre-empting.
>
> We don't suspect nfs on the production machines anymore since we managed
> to trash the nfs exported dir for about an hour (keeping the server at
> load average 8.5) which makes use of reiserfs - we might've been lucky
> though. In almost all the cases these exports are relatively big
> though, and I noticed there is a problem there as well (We don't get the
> magical 1000 number quite yet).
>
> Is there anything else I should/can take a look at? Is there any other
> way in which I can help find the problem? If I can just get somewhere
> to start ... (The patch below doesn't apply to 2.6 as far as I can see).
>
> Apologies for the essay.

Jaco,

The "kernel BUG at inode.c:340" problem is fixed in 2.4.26-rc1.
If that was what you were hitting, can you try that on your servers

About the other crashes, its hard to help without more information. Try attaching
a serial cable to the box for serial console.