2007-03-23 02:49:01

by Wendy Cheng

[permalink] [raw]
Subject: Question about f_count in struct nlm_file

I'm trying to finish the NLM lock failover work. The new NLM code in
2.6.21-rc4 kernel is kind of confusing. To make the story short, could
someone please explain how the server side's f_count (in struct
nlm_file) is intended to be used ? My test program simply does a posix
lock from NFS client without unlocking it (testing lock failover). Look
like the server keeps the file in nlm_files hash but the f_count for
that particular file is zero. The trace shows the following:

client does posix lock -->
server calls nlm4svc_proc_lock() ->
* server lookup file (f_count++)
* server lock the file
* server calls nlm_release_host
* server calls nlm_release_file (f_count--)
* server return to client with status 0

This will cause any call into nlm_traverse_files() to crash in the
following path, if the file happens to be of "no interest" of the search
(for example, the "match" function returns FALSE in all cases). Is this
intentional or oversight ? Would 2.6.21-rc4 be a good base to do NLM
development work ?

260 /*
261 * Loop over all files in the file table.
262 */
263 static int
264 nlm_traverse_files(struct nlm_host *host, nlm_host_match_fn_t match)
265 {
.............
271 for (i = 0; i < FILE_NRHASH; i++) {
272 hlist_for_each_entry_safe(file, pos, next,
&nlm_files[i] , f_list) {
....
274 file->f_count++;
275 mutex_unlock(&nlm_file_mutex);
276
277 /* Traverse locks, blocks and shares of
this fil e
278 * and update file->f_locks count */
279 if (nlm_inspect_file(host, file, match))
280 ret = 1;
281
282 mutex_lock(&nlm_file_mutex);
283 file->f_count--;
284 /* No more references to this file. Let
go of it . */
285 if (list_empty(&file->f_blocks) &&
!file->f_lock s
286 && !file->f_shares && !file->f_count) {
287 hlist_del(&file->f_list);
288 nlmsvc_ops->fclose(file->f_file);
289 kfree(file);

I can make the nlm_inspect_file() loops back (instead of trying to clean
up the hash) to avoid this crash. But somehow the f_count logic sounds
wrong to me. Why would a file that is still locked has a f_count zero in
the hash ?

-- Wendy








-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2007-03-23 03:19:15

by Wendy Cheng

[permalink] [raw]
Subject: Re: Question about f_count in struct nlm_file

Wendy Cheng wrote:

>
>client does posix lock -->
> server calls nlm4svc_proc_lock() ->
> * server lookup file (f_count++)
> * server lock the file
> * server calls nlm_release_host
> * server calls nlm_release_file (f_count--)
> * server return to client with status 0
>
>This will cause any call into nlm_traverse_files() to crash in the
>following path, if the file happens to be of "no interest" of the search
>(for example, the "match" function returns FALSE in all cases). Is this
>intentional or oversight ? Would 2.6.21-rc4 be a good base to do NLM
>development work ?
>
> 260 /*
> 261 * Loop over all files in the file table.
> 262 */
> 263 static int
> 264 nlm_traverse_files(struct nlm_host *host, nlm_host_match_fn_t match)
> 265 {
> .............
> 271 for (i = 0; i < FILE_NRHASH; i++) {
> 272 hlist_for_each_entry_safe(file, pos, next,
>&nlm_files[i] , f_list) {
> ....
> 274 file->f_count++;
> 275 mutex_unlock(&nlm_file_mutex);
> 276
> 277 /* Traverse locks, blocks and shares of
>this fil e
> 278 * and update file->f_locks count */
> 279 if (nlm_inspect_file(host, file, match))
> 280 ret = 1;
> 281
> 282 mutex_lock(&nlm_file_mutex);
> 283 file->f_count--;
> 284 /* No more references to this file. Let
>go of it . */
> 285 if (list_empty(&file->f_blocks) &&
>!file->f_lock s
> 286 && !file->f_shares && !file->f_count) {
> 287 hlist_del(&file->f_list);
> 288 nlmsvc_ops->fclose(file->f_file);
> 289 kfree(file);
>
>I can make the nlm_inspect_file() loops back (instead of trying to clean
>up the hash) to avoid this crash. But somehow the f_count logic sounds
>wrong to me. Why would a file that is still locked has a f_count zero in
>the hash ?
>
>
>
I should have made it clear... after nlm_inspect_file(), the logic
unconditionally checks for possible removing of this file. Since the
file is not blocked, nothing to do with shares, and f_count is zero, it
will get removed from hash and fclose() invoked (even it still owns a
plock). This will make VFS very unhappy and BUG() in fs/locks.c:1988 in
the middle of __fput -> locks_remove_flock.

On the other hand, the more I think (about this issue), maybe just
looping back after nlm_inspect_file finds no match would be good enough.
Anyway, that's what I'm going to do. Any objection ? Please let me know.

-- Wendy

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-03-23 04:39:07

by NeilBrown

[permalink] [raw]
Subject: Re: Question about f_count in struct nlm_file

On Thursday March 22, [email protected] wrote:
> Wendy Cheng wrote:
>
> >
> >client does posix lock -->
> > server calls nlm4svc_proc_lock() ->
> > * server lookup file (f_count++)
> > * server lock the file
> > * server calls nlm_release_host
> > * server calls nlm_release_file (f_count--)
> > * server return to client with status 0
> >
> >This will cause any call into nlm_traverse_files() to crash in the
> >following path, if the file happens to be of "no interest" of the search
> >(for example, the "match" function returns FALSE in all cases). Is this
> >intentional or oversight ? Would 2.6.21-rc4 be a good base to do NLM
> >development work ?

Hmmm.... definitely seems to be a bug in there!!

The key to the issue seems to be to make sure we take account of all
the current locks.
One way might be to replace
if (list_empty(&file->f_blocks) && !file->f_locks
&& !file->f_shares && !file->f_count) {
in nlm_traverse_files with a call to nlm_file_inuse(file).

Another way would be to try a bit harder to keep f_count uptodate by
incrementing it in nlmsvc_lock:

diff .prev/fs/lockd/svclock.c ./fs/lockd/svclock.c
--- .prev/fs/lockd/svclock.c 2007-03-23 15:30:37.000000000 +1100
+++ ./fs/lockd/svclock.c 2007-03-23 15:33:41.000000000 +1100
@@ -370,6 +370,7 @@ again:

switch(error) {
case 0:
+ file->f_locks ++;
ret = nlm_granted;
goto out;
case -EAGAIN:

The former is probably bit safer... but I wish I know what we have
that f_locks count. It is not at all clear what it is needed for.
Maybe some legacy issue that doesn't exist any more...


> I should have made it clear... after nlm_inspect_file(), the logic
> unconditionally checks for possible removing of this file. Since the
> file is not blocked, nothing to do with shares, and f_count is zero, it
> will get removed from hash and fclose() invoked (even it still owns a
> plock). This will make VFS very unhappy and BUG() in fs/locks.c:1988 in
> the middle of __fput -> locks_remove_flock.
>
> On the other hand, the more I think (about this issue), maybe just
> looping back after nlm_inspect_file finds no match would be good enough.
> Anyway, that's what I'm going to do. Any objection ? Please let me know.

Could you explain this possible fix a bit more. I'm not sure what you
are proposing..
Thanks,
NeilBrown

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-03-23 22:30:32

by Wendy Cheng

[permalink] [raw]
Subject: Re: Question about f_count in struct nlm_file

Neil Brown wrote:
> On Thursday March 22, [email protected] wrote:
>
>> Wendy Cheng wrote:
>>
>>
>>> client does posix lock -->
>>> server calls nlm4svc_proc_lock() ->
>>> * server lookup file (f_count++)
>>> * server lock the file
>>> * server calls nlm_release_host
>>> * server calls nlm_release_file (f_count--)
>>> * server return to client with status 0
>>>
>>> This will cause any call into nlm_traverse_files() to crash in the
>>> following path, if the file happens to be of "no interest" of the search
>>> (for example, the "match" function returns FALSE in all cases). Is this
>>> intentional or oversight ? Would 2.6.21-rc4 be a good base to do NLM
>>> development work ?
>>>
>
> Hmmm.... definitely seems to be a bug in there!!
>
> The key to the issue seems to be to make sure we take account of all
> the current locks.
> One way might be to replace
> if (list_empty(&file->f_blocks) && !file->f_locks
> && !file->f_shares && !file->f_count) {
> in nlm_traverse_files with a call to nlm_file_inuse(file).
>

Look reasonable ... an easy and quick fix. This one has my vote.
> Another way would be to try a bit harder to keep f_count uptodate by
> incrementing it in nlmsvc_lock:
>
> diff .prev/fs/lockd/svclock.c ./fs/lockd/svclock.c
> --- .prev/fs/lockd/svclock.c 2007-03-23 15:30:37.000000000 +1100
> +++ ./fs/lockd/svclock.c 2007-03-23 15:33:41.000000000 +1100
> @@ -370,6 +370,7 @@ again:
>
> switch(error) {
> case 0:
> + file->f_locks ++;
> ret = nlm_granted;
> goto out;
> case -EAGAIN:
>
> The former is probably bit safer... but I wish I know what we have
> that f_locks count. It is not at all clear what it is needed for.
> Maybe some legacy issue that doesn't exist any more...
>

This f_locks is definitely another awkward counter - better not messing
with it. My vote is "no". How about removing it from nlm_file for good ?
>
>
>> I should have made it clear... after nlm_inspect_file(), the logic
>> unconditionally checks for possible removing of this file. Since the
>> file is not blocked, nothing to do with shares, and f_count is zero, it
>> will get removed from hash and fclose() invoked (even it still owns a
>> plock). This will make VFS very unhappy and BUG() in fs/locks.c:1988 in
>> the middle of __fput -> locks_remove_flock.
>>
>> On the other hand, the more I think (about this issue), maybe just
>> looping back after nlm_inspect_file finds no match would be good enough.
>> Anyway, that's what I'm going to do. Any objection ? Please let me know.
>>
>
> Could you explain this possible fix a bit more. I'm not sure what you
> are proposing..
>
>
No, scratch what I said here (since it is not related) ... will submit
the first set of failover patches shortly after this mail.

Great thanks (as always) for looking into this.

-- Wendy


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs