2007-04-16 10:48:01

by NeilBrown

[permalink] [raw]
Subject: Re: mountd randomly crash and panic the server

On Monday April 16, [email protected] wrote:
> Hello,
>
> I once again got this crash this night. The call trace always end with
> the same function (cache_clean), which makes me think that there is
> maybe a race condition in it (happens randomly, particularly on quite
> heavy load).
>
> Apr 16 00:29:45 filer1 kernel: [<ffffffff8053ae60>]
> cache_flush+0xd/0x23
> Apr 16 00:29:45 filer1 kernel: [<ffffffff805386e0>]
> ip_map_parse+0x17c/0x18e
> Apr 16 00:29:45 filer1 kernel: [<ffffffff8053b4d1>] cache_write+0x90/0xac
> Apr 16 00:29:45 filer1 kernel: [<ffffffff80213f77>] vfs_write+0xaf/0x151
> Apr 16 00:29:45 filer1 kernel: [<ffffffff80214907>] sys_write+0x45/0x6e
> Apr 16 00:29:45 filer1 kernel: [<ffffffff8025311e>] system_call+0x7e/0x83
> Apr 16 00:29:45 filer1 kernel:
> Apr 16 00:29:45 filer1 kernel:
> Apr 16 00:29:45 filer1 kernel: Code: 48 8b 43 08 48 39 82 80 00 00 00 7e
> 0a 48 ff c0 48 89 82 80
> Apr 16 00:29:45 filer1 kernel: RIP [<ffffffff8053ad42>]
> cache_clean+0x11e/0x22f
>
> Is this related to the kernel cache_clean function in net/sunrpc/cache.c ?

Yes. It is crashing at:
for (; ch; cp= & ch->next, ch= *cp) {
if (current_detail->nextcheck > ch->expiry_time)
^^^^^^
current_detail->nextcheck = ch->expiry_time+1;
if (ch->expiry_time >= get_seconds()

ch has a garbage value (0001e71926010009), presumable because a
previous cp had been corrupted, most likely by being freed while still
in use.

Bother.

I'll see if I can figure out what is happening.

Thanks for the report.

NeilBrown

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2007-04-16 14:07:57

by Gabriel Barazer

[permalink] [raw]
Subject: Re: mountd randomly crash and panic the server

On 04/16/2007 12:47:32 +0200, Neil Brown <[email protected]> wrote:

> On Monday April 16, [email protected] wrote:

>>
>> Is this related to the kernel cache_clean function in net/sunrpc/cache.c ?
>
> Yes. It is crashing at:
> for (; ch; cp= & ch->next, ch= *cp) {
> if (current_detail->nextcheck > ch->expiry_time)
> ^^^^^^
> current_detail->nextcheck = ch->expiry_time+1;
> if (ch->expiry_time >= get_seconds()
>
> ch has a garbage value (0001e71926010009), presumable because a
> previous cp had been corrupted, most likely by being freed while still
> in use.
>

Maybe this can helps : I never encountered the problem with kernel
2.6.18, maybe this can restrict the scope for where is this corruption
happening ?

> Bother.
>
> I'll see if I can figure out what is happening.
>
> Thanks for the report.
>
> NeilBrown

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs