2005-01-05 20:06:45

by Jan-Frode Myklebust

[permalink] [raw]
Subject: panic - Attempting to free lock with active block list

We have a couple of mail-servers running first 2.6.9-1.681_FC3smp
and was later upgraded to the Fedora test kernel 2.6.10-1.727_FC3smp
which I think is pretty plain 2.6.10 + ac2. But they both keep
crashing with the message:

Kernel panic - not syncing: Attempting to free lock with active block list

Any ideas how to attack this?

We're running Centos 3.3, ext3 for root-disks, ext2 on /boot,
XFS for mail-spools, lots of nfs-mounted directories..


-jf


2005-01-05 20:36:08

by Chris Wright

[permalink] [raw]
Subject: Re: panic - Attempting to free lock with active block list

* Jan-Frode Myklebust ([email protected]) wrote:
> We have a couple of mail-servers running first 2.6.9-1.681_FC3smp
> and was later upgraded to the Fedora test kernel 2.6.10-1.727_FC3smp
> which I think is pretty plain 2.6.10 + ac2. But they both keep
> crashing with the message:
>
> Kernel panic - not syncing: Attempting to free lock with active block list
>
> Any ideas how to attack this?
>
> We're running Centos 3.3, ext3 for root-disks, ext2 on /boot,
> XFS for mail-spools, lots of nfs-mounted directories..

It seems likely it's nfs related in this case since it stresses the
fs/locks code differently than local filesystems. I recall Steve French
reporting similar issue with cifs last month.

Message-Id: <[email protected]>

Are those three cases really panic-worthy? Could we change to BUG_ON()
and try and get some useful debugging? Trond, Willy, any ideas?

thanks,
-chris

===== fs/locks.c 1.76 vs edited =====
--- 1.76/fs/locks.c 2005-01-04 18:48:28 -08:00
+++ edited/fs/locks.c 2005-01-05 12:31:34 -08:00
@@ -159,14 +159,20 @@ static inline void locks_free_lock(struc
BUG();
return;
}
- if (waitqueue_active(&fl->fl_wait))
- panic("Attempting to free lock with active wait queue");
+ if (waitqueue_active(&fl->fl_wait)) {
+ printk("Attempting to free lock with active wait queue");
+ BUG();
+ }

- if (!list_empty(&fl->fl_block))
- panic("Attempting to free lock with active block list");
+ if (!list_empty(&fl->fl_block)) {
+ printk("Attempting to free lock with active block list");
+ BUG();
+ }

- if (!list_empty(&fl->fl_link))
- panic("Attempting to free lock on active lock list");
+ if (!list_empty(&fl->fl_link)) {
+ printk("Attempting to free lock on active lock list");
+ BUG();
+ }

if (fl->fl_ops) {
if (fl->fl_ops->fl_release_private)

2005-01-05 21:54:08

by Jan-Frode Myklebust

[permalink] [raw]
Subject: Re: panic - Attempting to free lock with active block list

On Wed, Jan 05, 2005 at 12:32:07PM -0800, Chris Wright wrote:
>
> It seems likely it's nfs related in this case since it stresses the
> fs/locks code differently than local filesystems. I recall Steve French
> reporting similar issue with cifs last month.

Also found this on the linux-cifs-client list:

http://lists.samba.org/archive/linux-cifs-client/2004-December/000617.html

Is the suggested fix also relevant for fs/nfs/file.c ?


-jf

2005-01-05 21:58:40

by Trond Myklebust

[permalink] [raw]
Subject: Re: panic - Attempting to free lock with active block list

on den 05.01.2005 Klokka 12:32 (-0800) skreiv Chris Wright:
> * Jan-Frode Myklebust ([email protected]) wrote:
> > We have a couple of mail-servers running first 2.6.9-1.681_FC3smp
> > and was later upgraded to the Fedora test kernel 2.6.10-1.727_FC3smp
> > which I think is pretty plain 2.6.10 + ac2. But they both keep
> > crashing with the message:
> >
> > Kernel panic - not syncing: Attempting to free lock with active block list
> >
> > Any ideas how to attack this?

Well, the prevailing theory tends to start along the lines of "find out
how to reproduce the problem...". ;-)

Looking at the NFS code, I can attempt a wild guess about what may be
happening: there may be a race when pressing ^C in the middle of a
blocking NFS lock RPC call, and if so, the following patch will fix it.

Try it, and see whether or not it fixes your problem, but if it doesn't,
then I agree with Chris' suggestion of replacing those "panic()" calls
with BUG_ON()s.

Cheers,
Trond

file.c | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.10/fs/nfs/file.c
===================================================================
--- linux-2.6.10.orig/fs/nfs/file.c
+++ linux-2.6.10/fs/nfs/file.c
@@ -374,7 +374,7 @@ static int do_setlk(struct file *filp, i
* the process exits.
*/
if (status == -EINTR || status == -ERESTARTSYS)
- posix_lock_file(filp, fl);
+ posix_lock_file_wait(filp, fl);
} else
status = posix_lock_file_wait(filp, fl);
unlock_kernel();


--
Trond Myklebust <[email protected]>

2005-01-06 15:29:00

by Jan-Frode Myklebust

[permalink] [raw]
Subject: Re: panic - Attempting to free lock with active block list

On Wed, Jan 05, 2005 at 10:54:03PM +0100, Trond Myklebust wrote:
>
> Looking at the NFS code, I can attempt a wild guess about what may be
> happening: there may be a race when pressing ^C in the middle of a
> blocking NFS lock RPC call, and if so, the following patch will fix it.


A whopping 9 hours of uptime now :) So the one-liner patch seems to have
fixed it.

Thanks!

> - posix_lock_file(filp, fl);
> + posix_lock_file_wait(filp, fl);


-jf

2005-01-11 16:09:42

by Anders Saaby

[permalink] [raw]
Subject: Re: panic - Attempting to free lock with active block list

Hi Myklebust(s) :)

I have seen the exact same error on one of my webservers which is serving
from an NFS export and under heavy load. ~2 hours uptime before panic'ing.
I then tried Trond's patch which seems to work. 14 hours of uptime now. :)

Anyways, I have a couple of issues you might be able to clear up for me:

First issue:
New strange message in the kernel log:

"nlmclnt_lock: VFS is out of sync with lock manager!"

- What does this mean? - Is it bad?, What can i do?


Second issue:
my fs/nfs/file.c doesn't look like yours (Vanilla 2.6.10):

<fs/nfs/file.c SNIP>
????????status?=?NFS_PROTO(inode)->lock(filp,?cmd,?fl);
????????/*?If?we?were?signalled?we?still?need?to?ensure?that
?????????*?we?clean?up?any?state?on?the?server.?We?therefore
?????????*?record?the?lock?call?as?having?succeeded?in?order?to
?????????*?ensure?that?locks_remove_posix()?cleans?it?out?when
?????????*?the?process?exits.
?????????*/
????????if?(status?==?-EINTR?||?status?==?-ERESTARTSYS)
????????????????posix_lock_file_wait(filp,?fl);
????????unlock_kernel();
????????if?(status?<?0)
????????????????return?status;
????????/*
?????????*?Make?sure?we?clear?the?cache?whenever?we?try?to?get?the?lock.
?????????*?This?makes?locking?act?as?a?cache?coherency?point.
?????????*/
????????filemap_fdatawrite(filp->f_mapping);
????????down(&inode->i_sem);
????????nfs_wb_all(inode);??????/*?we?may?have?slept?*/
????????up(&inode->i_sem);
????????filemap_fdatawait(filp->f_mapping);
????????nfs_zap_caches(inode);
????????return?0;
</SNIP>

So... Am I missing another patch or something else?

Jan-Frode Myklebust wrote:

> On Wed, Jan 05, 2005 at 10:54:03PM +0100, Trond Myklebust wrote:
>>
>> Looking at the NFS code, I can attempt a wild guess about what may be
>> happening: there may be a race when pressing ^C in the middle of a
>> blocking NFS lock RPC call, and if so, the following patch will fix it.
>
>
> A whopping 9 hours of uptime now :) So the one-liner patch seems to have
> fixed it.
>
> Thanks!
>
>> - posix_lock_file(filp, fl);
>> + posix_lock_file_wait(filp, fl);
>
>
> -jf

--
Med venlig hilsen - Best regards - Meilleures salutations

Anders Saaby
Systems Engineer
------------------------------------------------
Cohaesio A/S - Maglebjergvej 5D - DK-2800 Lyngby
Phone: +45 45 880 888 - Fax: +45 45 880 777
Mail: [email protected] - http://www.cohaesio.com
------------------------------------------------