2005-03-10 20:44:23

by Olaf Kirch

[permalink] [raw]
Subject: deadlock in lockd

Hi all,

I just debugged a deadlock in lockd in SLES8 (i.e 2.4.21), but
I think the same problem exists in 2.6.

Here's the backtrace courtesy of sysrq:

lockd D 00000000 3648 791 1 792 795 783 (L-TLB)
Call Trace: [do_schedule+338/608] (36) [__down+131/224] (52) [__down_failed+8/12] (16)
[.text.lock.svclock+5/150] (04) [vsnprintf+519/1120] (24) [nlm_traverse_files+324/368] (32) [nlmsvc_mark_resources+32/64] (12)
[nlm_gc_hosts+69/384] (28) [nlm_lookup_host+139/800] (48) [nlmsvc_lookup_host+48/64] (20) [nlmsvc_create_block+145/352] (16)
[posix_test_lock+132/160] (20) [nlmsvc_lock+222/784] (28) [nlm4svc_retrieve_args+199/288] (40) [nlm4svc_proc_lock+172/256] (44)
[svc_process+827/1392] (56) [lockd+426/720] (40) [arch_kernel_thread+46/64] (08) [lockd+0/720] (04)

What happens is that nlmsvc_lock takes the f_sema on the file the client wishes to lock,
then calls posix_test_lock, which finds there's a blocking lock. So it calls
nlmsvc_create_block, which calls nlmsvc_lookup_host - and the host code decides
to do a garbage collection pass.

We call nlm_traverse_files, that hits the file we're just trying to lock, and
invokes nlmsvc_traverse_blocks, which will try to down f_sema once more.
And there it hangs...

The attched (untested) patch changes the way we do garbage collection
passes, instead of doing it inside nlm_lookup_host, nlm_gc_hosts is
called from the top-level service loop in lockd now, where we don'T hold any
locks. It looks saner anyway.

Comments?

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax


Attachments:
(No filename) (1.63 kB)
lockd-gc-deadlock (2.37 kB)
Download all attachments

2005-03-10 21:49:57

by Daniel Forrest

[permalink] [raw]
Subject: Re: deadlock in lockd

Olaf,

> Hi all,
>
> I just debugged a deadlock in lockd in SLES8 (i.e 2.4.21), but
> I think the same problem exists in 2.6.

I found this ages ago.

Neil Brown submitted a patch on 12 Mar 2003:

Subject: [NFS] [PATCH] kNFSd - 3 of 6 - Fix deadlock problem in lockd.

Which fixes this. Apparently this made it into 2.5, but not 2.4.

He submitted the same patch on 29 Oct 2003:

Subject: [NFS] [PATCH] Fix deadlock problem in lockd.

With the comment: "(This was fixed in 2.5 8 months ago)".

You may want to check if this ever made it into the 2.4 source.

Here is that e-mail:

---

nlmsvc_lock calls nlmsvc_create_block with file->f_sema
held.
nlmsvc_create_block calls nlmclnt_lookup_host which might
call nlm_gc_hosts which might, eventually, try to claim
file->f_sema for the same file -> deadlock.

nlmsvc_create_block does not need any protection under
any lock as lockd is single-threaded and _create_block
only plays with internal data structures.

So we release the f_sema before calling in, and make sure
it gets claimed again afterwards.

(This was fixed in 2.5 8 months ago)


diff ./fs/lockd/svclock.c~current~ ./fs/lockd/svclock.c
--- ./fs/lockd/svclock.c~current~ 2003-10-29 09:07:10.000000000 +1100
+++ ./fs/lockd/svclock.c 2003-10-29 09:07:10.000000000 +1100
@@ -317,8 +317,6 @@ nlmsvc_lock(struct svc_rqst *rqstp, stru
(long long)lock->fl.fl_end,
wait);

- /* Lock file against concurrent access */
- down(&file->f_sema);

/* Get existing block (in case client is busy-waiting) */
block = nlmsvc_lookup_block(file, lock, 0);
@@ -326,6 +324,9 @@ nlmsvc_lock(struct svc_rqst *rqstp, stru
lock->fl.fl_flags |= FL_LOCKD;

again:
+ /* Lock file against concurrent access */
+ down(&file->f_sema);
+
if (!(conflock = posix_test_lock(&file->f_file, &lock->fl))) {
error = posix_lock_file(&file->f_file, &lock->fl, 0);

@@ -358,7 +359,10 @@ again:

/* If we don't have a block, create and initialize it. Then
* retry because we may have slept in kmalloc. */
+ /* We have to release f_sema as nlmsvc_create_block may try to
+ * claim it while doing host garbage collection */
if (block == NULL) {
+ up(&file->f_sema);
dprintk("lockd: blocking on this lock (allocating).\n");
if (!(block = nlmsvc_create_block(rqstp, file, lock, cookie)))
return nlm_lck_denied_nolocks;

--
Daniel K. Forrest Laboratory for Molecular and
[email protected] Computational Genomics
University of Wisconsin, Madison


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-03-10 22:42:48

by Olaf Kirch

[permalink] [raw]
Subject: Re: deadlock in lockd

On Thu, Mar 10, 2005 at 03:49:44PM -0600, Daniel Forrest wrote:
> Subject: [NFS] [PATCH] kNFSd - 3 of 6 - Fix deadlock problem in lockd.

I see... I would have been surprised if I had been the first one
to run into this.

Still I think it's cleaner to move the nlm_gc_hosts calls out of
nlm_lookup_host.

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs