LinuxLists.cc - NFS file locking for clustered filesystems

2004-07-13 21:32:51

Subject: NFS file locking for clustered filesystems

Hi

I am looking into an issue with locking that is faced by clustered filesystems
with exports to NFS clients. In the current NFS implementation, NFS advisory
locking is performed by lockd on the exporting node. It does local locking even
if the underlying filesystem provides its own locking routine.

There have been earlier attempts to fix this issue and patches were submitted
which called the underlying filesystem locks. But they have been rejected with
a reason that lockd cannot sleep although the users claim that the underlying
filesytem lock routines do not block.
I was told that there was a suggestion to use workqueues as a way around this
problem. I have been thinking about this and am laying down my initial thoughts
on how we can do this without blocking lockd. I would appreciate if you can
review and provide any suggestions on how this can be done in a way that is
acceptable for inclusion in mainline kernel.

Currently lockd is implemented as kernel thread that gets and processes
lock requests in a sequential loop. svc_recv() gets the requests and
svc_process() processes them. So blocking in the processing of any particular
request will block all the future calls too.

One simple way to avoid blocking new requests is to have lockd get the request
and schedule the processing of the request to a separate kernel thread. But
creating a new kernel thread or scheduling from a pool of threads for each
request may be expensive.

2.6 provides a workqueue mechanism where we can defer the work to worker
threads. Using this mechanism we can add the "work to process lock request"
to the workqueue and schedule it to be done in the worker threads. But as
i understand workqueues, the work requests are done in the order in which
they are added to the queue and a new work request will not start until the
earlier ones are completed. So basically lockd is not blocked accepting new
requests, but processing of new requests is still blocked by earlier requests.

Do you agree with either of these mechanisms that make lockd multi-threaded?
With these changes, will it be acceptable to call the file-system lock
operations in lockd? If not, please let me know of any other possible solutions
to this issue.

With NFSV4 we do not have lockd and all the locking requests are handled by
nfsd threads. So i hope calling underlying filesystem lock calls is less of
an issue with V4 as there will be multiple nfsd threads.

Thanks
Sridhar

-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 -
digital self defense, top technical experts, no vendor pitches,
unmatched networking opportunities. Visit http://www.blackhat.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-11 00:16:43

by Sridhar Samudrala

[permalink] [raw]

Subject: Re: NFS file locking for clustered filesystems

Olaf, Trond,

Thanks for your responses.

I gathered the following from your suggestions to make the cluster filesyst=
em
process the lock/unlock calls asynchronously,

NLM LOCK/UNLOCK Call on a file in a cluster filesystem exported from an NFS
server will always respond with NLM LOCK/UNLOCK Reply with NLM_BLOCKED as t=
he
status.

Once the cluster filesystem is done granting or denying the lock, the serve=
r
will send a NLM GRANTED_MSG/GRANTED_RES callbacks with the appropriate stat=
us.

Is this correct?

As Olaf suggested, can't we use the existing fl_notify callback instead of
setting up a new callback?

Thanks
Sridhar

On Mon, 2 Aug 2004, Trond Myklebust wrote:

> P=E5 m=E5 , 02/08/2004 klokka 03:51, skreiv Olaf Kirch:
> > On Tue, Jul 13, 2004 at 02:32:12PM -0700, Sridhar Samudrala wrote:
> > > One simple way to avoid blocking new requests is to have lockd get th=
e request
> > > and schedule the processing of the request to a separate kernel threa=
d. But
> > > creating a new kernel thread or scheduling from a pool of threads for=
each
> > > request may be expensive.
> >
> > I think making lockd multithreaded and whatnot is going to be very
> > painful.
>
> I agree: I was thinking about this at the recent Redhat cluster
> filesystem summit.
>
> What we really want is an interface for asynchronous lock calls in which
> lockd gets called back once the cluster filesystem is ready with the
> results. This is pretty close to what we do already w.r.t. local locks.
>
> IOW for each cluster filesystem we want to set up something like
>
> typedef void (*lock_callback_t)(struct file_lock fl, int result);
> int lock(struct file filp, struct file_lock fl, lock_callback_t callbac=
k);
>
> lockd would then call lock() would return immediately once the
> filesystem has set up whatever it needs to do an asynchronous call (it's
> up to the filesystem implementation to decide if that has to involve
> forking off a thread).
> Once the filesystem is done granting or denying the lock, the filesystem
> would use the callback method to notify lockd about the result.
>
> Cheers,
> Trond
>
>

-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-11 00:43:33

by Trond Myklebust

[permalink] [raw]

Subject: Re: NFS file locking for clustered filesystems

P=E5 ty , 10/08/2004 klokka 17:15, skreiv Sridhar Samudrala:

> I gathered the following from your suggestions to make the cluster filesy=
stem
> process the lock/unlock calls asynchronously,
>=20
> NLM LOCK/UNLOCK Call on a file in a cluster filesystem exported from an N=
FS
> server will always respond with NLM LOCK/UNLOCK Reply with NLM_BLOCKED as=
the
> status.

No. That would break the spec. See:
http://www.opengroup.org/onlinepubs/9629799/NLM_LOCK.htm#tagcjh_11_11

NLM_BLOCKED is only a valid response if the client actually asked for
blocking behaviour.

In the case where the server is not able to reply in a timely fashion (I
would suggest a short timeout period of a few seconds) it should reply
with LCK_DENIED. (Note that when I talk about a "short timeout period",
I do not mean that lockd itself is allowed to block.)

> Once the cluster filesystem is done granting or denying the lock, the ser=
ver
> will send a NLM GRANTED_MSG/GRANTED_RES callbacks with the appropriate st=
atus.
>=20
> Is this correct?

See above. That is only true if the client requested a blocking lock.

> As Olaf suggested, can't we use the existing fl_notify callback instead o=
f
> setting up a new callback?

Possibly. As long as the design remains clean...

Cheers,
Trond

-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-11 18:22:53

by Goutham Kurra

[permalink] [raw]

Subject: force unmount for NFSv3 client

Hi,

I notice some strange problems while trying to unmount
an NFS filesystem after the remote NFS server has died
(while some applications are hung blocked accessing
it)

1. unmount fails with device busy - expected behaviour
2. fuser hangs, so I can't find out which processes
are hung blocked on the mount.
3. umount -f doesn't work - same error device busy.

I've mounted using -intr option.

Questions:

1. Why does fuser hang?
2. Why do some processes which are hung on that mount
not respond even to SIGKILL (despite the -intr option)
3. Why doesn't force unmount work?

Essentially, under what circumstances is this sort of
behavior likely to happen? I see this very often.

I'm using NFSv3 client on the 2.4.22 kernel with
patches from Trond's 2.4.23-pre-8.

thanks,
goutham

__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - Send 10MB messages!
http://promotions.yahoo.com/new_mail

-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-11 21:36:33

by Sridhar Samudrala

[permalink] [raw]

Subject: Re: NFS file locking for clustered filesystems

On Tue, 10 Aug 2004, Trond Myklebust wrote:

> P=E5 ty , 10/08/2004 klokka 17:15, skreiv Sridhar Samudrala:
>
> > I gathered the following from your suggestions to make the cluster file=
system
> > process the lock/unlock calls asynchronously,
> >
> > NLM LOCK/UNLOCK Call on a file in a cluster filesystem exported from an=
NFS
> > server will always respond with NLM LOCK/UNLOCK Reply with NLM_BLOCKED =
as the
> > status.
>
> No. That would break the spec. See:
> http://www.opengroup.org/onlinepubs/9629799/NLM_LOCK.htm#tagcjh_11_11
>
> NLM_BLOCKED is only a valid response if the client actually asked for
> blocking behaviour.
>
> In the case where the server is not able to reply in a timely fashion (I
> would suggest a short timeout period of a few seconds) it should reply
> with LCK_DENIED. (Note that when I talk about a "short timeout period",
> I do not mean that lockd itself is allowed to block.)

So, in the case of a client which doesn't allow blocking, are you suggestin=
g
that the cluster filesystem to return LCK_BLOCKED immediately and start the
asynchrous call and return LCK_DENIED in a few seconds if it cannot complet=
e
the operation?

But i am not sure if we can handle this easily with the current design of
lockd. As soon as the lock operation returns LCK_BLOCKED, i think svc_send(=
) is
called from svc_process() which will send a reply with NLM_BLOCKED.

Thanks
Sridhar

-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-12 05:35:53

by Trond Myklebust

[permalink] [raw]

Subject: Re: NFS file locking for clustered filesystems

P=E5 on , 11/08/2004 klokka 14:35, skreiv Sridhar Samudrala:

> So, in the case of a client which doesn't allow blocking, are you suggest=
ing
> that the cluster filesystem to return LCK_BLOCKED immediately and start t=
he
> asynchrous call and return LCK_DENIED in a few seconds if it cannot compl=
ete
> the operation?

Sort of.

Firstly, having the filesystem return NLM-specific errors is a no-no.
Lets keep the errors generic (-EAGAIN =3D=3D -EWOULDBLOCK for instance.)
NFSv4 does not understand LCK_BLOCKED.

Secondly, I suggest that lockd should manage its timeouts itself rather
than relying on the clustered filesystem to do so. Timers are cheap...

> But i am not sure if we can handle this easily with the current design of
> lockd. As soon as the lock operation returns LCK_BLOCKED, i think svc_sen=
d() is
> called from svc_process() which will send a reply with NLM_BLOCKED.

For NFSv4 this is not a problem: NFS4ERR_DELAY will make the client back
off and retry the request later.

For NLM, you are clearly going to have to figure out a way around the
above problem. (but that's why you get paid the big bucks right? ;-))
Note that the sunrpc server code already has a mechanism for deferring
requests when dealing with blocking behaviour (in case it needs to do an
upcall) and then replaying them later. Perhaps you could look into
reusing that here?

Cheers,
Trond

-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-17 00:52:51

by Sridhar Samudrala

[permalink] [raw]

Subject: Re: NFS file locking for clustered filesystems

On Wed, 11 Aug 2004, Trond Myklebust wrote:

> P=E5 on , 11/08/2004 klokka 14:35, skreiv Sridhar Samudrala:
>
> > So, in the case of a client which doesn't allow blocking, are you sugge=
sting
> > that the cluster filesystem to return LCK_BLOCKED immediately and start=
the
> > asynchrous call and return LCK_DENIED in a few seconds if it cannot com=
plete
> > the operation?
>
> Sort of.
>
> Firstly, having the filesystem return NLM-specific errors is a no-no.
> Lets keep the errors generic (-EAGAIN =3D=3D -EWOULDBLOCK for instance.)
> NFSv4 does not understand LCK_BLOCKED.

I didn't meant the filesystem to return NLM-specific errors. Just to make i=
t
simple, i used the return values from the nlmsvc routines after they are
converted from the generic errors returned by the filesystem calls. Sorry f=
or
the confusion.

>
> Secondly, I suggest that lockd should manage its timeouts itself rather
> than relying on the clustered filesystem to do so. Timers are cheap...

An issue with using a timer to validate that the filesystem call is not
blocking too long is that it could generate false positives because the cal=
l
- might get pre-empted
- might take a long interrupt
- might encounter hw problem resulting in machine-check exception etc.

Instead, we can disable preemption before the call and validate that there
were no context switches after the filesystem call returns.

preempt_disable();
nctx =3D RCU_qsctr(task_cpu(current));
file_system_lock();
if (nctx !=3D RCU_qsctr(task_cpu(current))) {
=09 printk(KERN_WARNING "filesystem call did not return immediately");
}
preempt_enable();

Is this approach acceptable?

>
> > But i am not sure if we can handle this easily with the current design =
of
> > lockd. As soon as the lock operation returns LCK_BLOCKED, i think svc_s=
end() is
> > called from svc_process() which will send a reply with NLM_BLOCKED.
>
> For NFSv4 this is not a problem: NFS4ERR_DELAY will make the client back
> off and retry the request later.
>
> For NLM, you are clearly going to have to figure out a way around the
> above problem. (but that's why you get paid the big bucks right? ;-))
> Note that the sunrpc server code already has a mechanism for deferring
> requests when dealing with blocking behaviour (in case it needs to do an
> upcall) and then replaying them later. Perhaps you could look into
> reusing that here?

I started looking into the sunrpc server code. I guess you meant svc_defer(=
)
as a mechanism to defer the requests.
Is this mechanism currently used for any lockd RPC calls? I guess it is use=
d
only for nfsd RPC calls.
Could you please give an example when a request is deferred?

Thanks
Sridhar

>
> Cheers,
> Trond
>
>

-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-17 23:10:52

by Trond Myklebust

[permalink] [raw]

Subject: Re: NFS file locking for clustered filesystems

P=E5 m=E5 , 16/08/2004 klokka 20:46, skreiv Sridhar Samudrala:

> An issue with using a timer to validate that the filesystem call is not
> blocking too long is that it could generate false positives because the c=
all
> - might get pre-empted
> - might take a long interrupt
> - might encounter hw problem resulting in machine-check exception etc.

I don't understand. The above are all things that will occur on short
timescales (~ 1/100 seconds). If we're talking timeouts of the order of
seconds, none of them will be significant.

> Instead, we can disable preemption before the call and validate that ther=
e
> were no context switches after the filesystem call returns.

Huh? I think you misunderstand me. I'm not talking about timing the call
to file_system_lock(). That's not supposed to sleep anyway...

I'm talking about the case where the filesystem "forgets" to call us
back afterwards (because it is unable to contact the cluster lock
manager or something like that). For *that* case we need to be able to
time out after a few seconds and return an ENOLOCK error (LCK_DENIED) to
the client.

> >
> > > But i am not sure if we can handle this easily with the current desig=
n of
> > > lockd. As soon as the lock operation returns LCK_BLOCKED, i think svc=
_send() is
> > > called from svc_process() which will send a reply with NLM_BLOCKED.
> >
> > For NFSv4 this is not a problem: NFS4ERR_DELAY will make the client bac=
k
> > off and retry the request later.
> >
> > For NLM, you are clearly going to have to figure out a way around the
> > above problem. (but that's why you get paid the big bucks right? ;-))
> > Note that the sunrpc server code already has a mechanism for deferring
> > requests when dealing with blocking behaviour (in case it needs to do a=
n
> > upcall) and then replaying them later. Perhaps you could look into
> > reusing that here?
>=20
> I started looking into the sunrpc server code. I guess you meant svc_defe=
r()
> as a mechanism to defer the requests.
> Is this mechanism currently used for any lockd RPC calls? I guess it is u=
sed
> only for nfsd RPC calls.
> Could you please give an example when a request is deferred?

It is for instance currently used in the case where auth_unix needs to
do an upcall to userland in order to find out whether or not the user is
authorized.

Cheers,
Trond

-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-02 10:55:32

by Olaf Kirch

[permalink] [raw]

Subject: Re: NFS file locking for clustered filesystems

On Tue, Jul 13, 2004 at 02:32:12PM -0700, Sridhar Samudrala wrote:
> One simple way to avoid blocking new requests is to have lockd get the request
> and schedule the processing of the request to a separate kernel thread. But
> creating a new kernel thread or scheduling from a pool of threads for each
> request may be expensive.

I think making lockd multithreaded and whatnot is going to be very
painful.

If the file system wants to implement its own locking functions, why not
make it the file system's job to deal with blocking on (network) I/O?
You could have it return an error code indicating that the locking
operation would have blocked. Add the NLM request to the list of blocked
locks. When the file system is done, it calls us back via the normal
fl_notify callback, we retry the call, and get the real status code.

The only additional bit of lockd logic that would be needed for this
is dealing with blocking on unlock requests. Currently lockd only
expects lock requests to block.

Olaf
--
Olaf Kirch | The Hardware Gods hate me.
[email protected] |
---------------+

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-02 15:39:35

by Trond Myklebust

[permalink] [raw]

Subject: Re: NFS file locking for clustered filesystems

P=E5 m=E5 , 02/08/2004 klokka 03:51, skreiv Olaf Kirch:
> On Tue, Jul 13, 2004 at 02:32:12PM -0700, Sridhar Samudrala wrote:
> > One simple way to avoid blocking new requests is to have lockd get the =
request
> > and schedule the processing of the request to a separate kernel thread.=
But
> > creating a new kernel thread or scheduling from a pool of threads for e=
ach
> > request may be expensive.
>=20
> I think making lockd multithreaded and whatnot is going to be very
> painful.

I agree: I was thinking about this at the recent Redhat cluster
filesystem summit.

What we really want is an interface for asynchronous lock calls in which
lockd gets called back once the cluster filesystem is ready with the
results. This is pretty close to what we do already w.r.t. local locks.

IOW for each cluster filesystem we want to set up something like

typedef void (*lock_callback_t)(struct file_lock fl, int result);
int lock(struct file filp, struct file_lock fl, lock_callback_t callback)=
;

lockd would then call lock() would return immediately once the
filesystem has set up whatever it needs to do an asynchronous call (it's
up to the filesystem implementation to decide if that has to involve
forking off a thread).
Once the filesystem is done granting or denying the lock, the filesystem
would use the callback method to notify lockd about the result.

Cheers,
Trond

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs