2009-12-15 10:00:22

by Mi Jinlong

[permalink] [raw]
Subject: [RFC] server's statd and lockd will not sync after its nfslock restart

Hi,

When testing the NLM at the latest kernel(2.6.32), i find a bug.
When a client hold locks, after server restart its nfslock service,
server's statd will not synchronize with lockd.
If server restart nfslock twice or more, client's lock will be lost.

Test process:

Step1: client open nfs file.
Step2: client using fcntl to get lock.
Step3: server restart it's nfslock service.

After step3, server's lockd records client holding locks, but statd's
/var/lib/nfs/statd/sm/ directory is empty. It means statd and lockd are
not sync. If server restart it's nfslock again, client's locks will be lost.

The Primary Reason:

At step3, when client's reclaimed lock request is sent to server,
client's host(the host struct) is reused but not be re-monitored at
server's lockd. After that, statd and lockd are not sync.

Question:

In my opinion, if lockd is allowed reuseing the client's host, it should
send a SM_MON to statd when reuse. If not allowed, the client's host should
be destroyed immediately.

What should lockd to do? Reuse ? Destroy ? Or some other action?


thanks,

Mi Jinlong



2009-12-17 16:19:49

by Chuck Lever III

[permalink] [raw]
Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart

On Dec 17, 2009, at 5:07 AM, Mi Jinlong wrote:
> Chuck Lever :
>> On Dec 16, 2009, at 5:27 AM, Mi Jinlong wrote:
>>> Chuck Lever:
>>>> On Dec 15, 2009, at 5:02 AM, Mi Jinlong wrote:
>>>>> Hi,
>
> ...snip...
>
>>>>>
>>>>> The Primary Reason:
>>>>>
>>>>> At step3, when client's reclaimed lock request is sent to server,
>>>>> client's host(the host struct) is reused but not be re-monitored
>>>>> at
>>>>> server's lockd. After that, statd and lockd are not sync.
>>>>
>>>> The kernel squashes SM_MON upcalls for hosts that it already
>>>> believes
>>>> are monitored. This is a scalability feature.
>>>
>>> When statd start, it will move files from /var/lib/nfs/statd/sm/ to
>>> /var/lib/nfs/statd/sm.bak/.
>>
>> Well, it's really sm-notify that does this. sm-notify is run by
>> rpc.statd when it starts up.
>>
>> However, sm-notify should only retire the monitor list the first
>> time it
>> is run after a reboot. Simply restarting statd should not change the
>> on-disk monitor list in the slightest. If it does, there's some
>> kind of
>> problem with the way sm-notify's pid file is managed, or perhaps with
>> the nfslock script.
>
> When starting, statd will call run_sm_notify() function to run sm-
> notify.
> Using command "service nfslock restart" will case statd stop and
> start,
> so sm-notify will be run. If sm-notify run, the on-disk monitor list
> will be changed.
>
>>
>>> If lockd don't send a SM_MON to statd,
>>> statd will not monitor those client which be monitored before statd
>>> restart.
>>>
>>>>> Question:
>>>>>
>>>>> In my opinion, if lockd is allowed reuseing the client's host, it
>>>>> should
>>>>> send a SM_MON to statd when reuse. If not allowed, the client's
>>>>> host
>>>>> should
>>>>> be destroyed immediately.
>>>>>
>>>>> What should lockd to do? Reuse ? Destroy ? Or some other action?
>>>>
>>>> I don't immediately see why lockd should change it's behavior.
>>>> Perhaps
>>>> statd/sm-notify were incorrect to delete the monitor list when you
>>>> restarted the nfslock service?
>>>
>>> Sorry, maybe i did not express clearly.
>>> I mean, lockd reuse the host struct which was created before statd
>>> restart.
>>>
>>> It seems have deleted the monitor list when nfslock restart.
>>
>> lockd does not touch any user space files; the on-disk monitor list
>> is
>> managed by statd and sm-notify. A remote peer rebooting does not
>> clear
>> the "monitored" flag for that peer in the local kernel's lockd, so it
>> won't send another SM_MON request.
>
> Yes, that's right.
>
> But, this case refers to server's lockd, not the remote peer.
> I thank, when local system's nfslock restart, local kernel's lockd
> clear all other client's host strcut's "monitored" flag.
>
>>
>> Now, it may be the case that "service nfslock start" uses a command
>> line
>> option that forces a fresh sm-notify run, and that is what is
>> wiping the
>> on-disk monitor list. That would be the bug in this case -- sm-
>> notify
>> can and should be allowed to make its own determination of whether
>> the
>> monitor list gets retired. Notification should not normally be
>> forced
>> by command line options in the nfslock script.
>
> A fresh sm-notify run is cause by statd start.
> I find it through codes by followed.
>
> utils/statd/statd.c
> ...
> 478 if (! (run_mode & MODE_NO_NOTIFY))
> 479 switch (pid = fork()) {
> 480 case 0:
> 481 run_sm_notify(out_port);
> 482 break;
> 483 case -1:
> 484 break;
> 485 default:
> 486 waitpid(pid, NULL, 0);
> 487 }
> ....
>
>
> I thank, when statd restart and call sm-notify, the on-disk monitor
> list will
> be deleted, so lockd should clear all other client's host strcut's
> "monitored" flag.
> After that, a reused host struct will be re-monitored, a on-disk
> monitor
> will be re-created. Like that, lockd and statd will sync .

run_sm_notify() simply forks and execs the sm-notify program. This
program checks for the existence of a pid file. If the pid file
exists, then sm-notify exits. If it does not, then sm-notify retires
the records in /var/lib/nfs/statd/sm and posts reboot notifications.

Jeff Layton pointed out to me yesterday that Red Hat's nfslock script
unconditionally deletes sm-notify's pid file every time "service
nfslock start" is done, which effectively defeats sm-notify's reboot
detection.

sm-notify was written by a developer at SuSE. SuSE Linux uses a tmpfs
for /var/run, but Red Hat uses permanent storage for this directory.
Thus on SuSE, the pid file gets deleted automatically by a reboot, but
on Red Hat, the pid file must be deleted "by hand" or reboot
notification never occurs.

So the root cause of this problem is that the current mechanism sm-
notify uses to detect a reboot is not portable across distributions.

My new-statd prototype used a semaphor instead of a pid file to detect
reboots. A semaphor is shared (visible to other processes) and will
continue to exist until it is deleted or the system reboots. It is a
resource that is not destroyed automatically when the sm-notify
process exits. If creating the semaphor fails, sm-notify exits. If
creating it succeeds, it runs.

Would anyone strongly object to using a semaphor instead of a pid file
here? Is support for semaphors always built into kernels? Would
there be any problems with the small size of the semaphor name space?
Is there another similar facility that might be better?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2009-12-19 16:42:37

by Steve Dickson

[permalink] [raw]
Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart



On 12/18/2009 10:18 AM, Chuck Lever wrote:
>
> On Dec 17, 2009, at 6:14 PM, Neil Brown wrote:
>
>> On Thu, 17 Dec 2009 11:18:53 -0500
>> Chuck Lever <[email protected]> wrote:
>>
>>> Jeff Layton pointed out to me yesterday that Red Hat's nfslock script
>>> unconditionally deletes sm-notify's pid file every time "service
>>> nfslock start" is done, which effectively defeats sm-notify's reboot
>>> detection.
>>>
>>> sm-notify was written by a developer at SuSE. SuSE Linux uses a tmpfs
>>> for /var/run, but Red Hat uses permanent storage for this directory.
>>> Thus on SuSE, the pid file gets deleted automatically by a reboot, but
>>> on Red Hat, the pid file must be deleted "by hand" or reboot
>>> notification never occurs.
>>
>> Just to make sure the facts are straight:
>> SuSE does not use tmpfs for /var/run (much as I personally think that
>> would be a very sensible approach for both /var/run and /var/locks).
>> It appears that Debian can use tmpfs for these, but doesn't by default.
>>
>> Both SuSE and Debian have boot time scripts that clean up /var/run and
>> other
>> directories. They remove all non-directories other than /var/run/utmp.
>>
>> If Redhat doesn't clean up /var/run at boot time, then I would think
>> that is
>> very odd. The files in there represent something that is running. At
>> boot,
>> nothing is running, so it should all be cleaned up. Are you sure Redhat
>> doesn't clean out /var/run???
>>
>> I just had a look at master.kernel.org (the only fedora machine I can
>> think
>> of that I have access to) and in /etc/rc.d/rc.sysinit I find
>>
>> find /var/lock /var/run ! -type d -exec rm -f {} \;
>>
>> So I'm thinking that if you just remove
>>
>> # Make sure locks are recovered
>> rm -f /var/run/sm-notify.pid
>>
>> from /etc/init.d/nfslock, then it will do the right thing.
>
> Makes sense. Steve, can you look into this for supported releases (like
> F12 and RHEL5)? Or, perhaps you can clarify why that "rm" is required.
I know at the time I added that code the pid file was not being
removed and by explicitly removing it caused sm-notify to *always* run
which, at the time, seem like the right thing to do.. The change was
made in early January of 08, so let me take to see if things
have changed...

steved.

2009-12-15 12:41:25

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart

On Tue, Dec 15, 2009 at 06:02:11PM +0800, Mi Jinlong wrote:
> Hi,
>
> When testing the NLM at the latest kernel(2.6.32), i find a bug.
> When a client hold locks, after server restart its nfslock service,
> server's statd will not synchronize with lockd.
> If server restart nfslock twice or more, client's lock will be lost.
>
> Test process:
>
> Step1: client open nfs file.
> Step2: client using fcntl to get lock.
> Step3: server restart it's nfslock service.

I don't know what you mean; what did you actually do in step 3?

--b.

>
> After step3, server's lockd records client holding locks, but statd's
> /var/lib/nfs/statd/sm/ directory is empty. It means statd and lockd are
> not sync. If server restart it's nfslock again, client's locks will be lost.
>
> The Primary Reason:
>
> At step3, when client's reclaimed lock request is sent to server,
> client's host(the host struct) is reused but not be re-monitored at
> server's lockd. After that, statd and lockd are not sync.
>
> Question:
>
> In my opinion, if lockd is allowed reuseing the client's host, it should
> send a SM_MON to statd when reuse. If not allowed, the client's host should
> be destroyed immediately.
>
> What should lockd to do? Reuse ? Destroy ? Or some other action?
>
>
> thanks,
>
> Mi Jinlong
>

2009-12-15 15:11:50

by Chuck Lever III

[permalink] [raw]
Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart

On Dec 15, 2009, at 5:02 AM, Mi Jinlong wrote:
> Hi,
>
> When testing the NLM at the latest kernel(2.6.32), i find a bug.
> When a client hold locks, after server restart its nfslock service,
> server's statd will not synchronize with lockd.
> If server restart nfslock twice or more, client's lock will be lost.
>
> Test process:
>
> Step1: client open nfs file.
> Step2: client using fcntl to get lock.
> Step3: server restart it's nfslock service.

I'll assume here that you mean the equivalent of "service nfslock
restart". This restarts statd and possibly runs sm-notify, but it has
no effect on lockd.

Again, this test seems artificial to me. Is there a real world use
case where someone would deliberately restart statd while an NFS
server is serving files? I pose this question because I've worked on
statd only for a year or so, and I am quite likely ignorant of all the
ways it can be deployed.

> After step3, server's lockd records client holding locks, but statd's
> /var/lib/nfs/statd/sm/ directory is empty. It means statd and lockd
> are
> not sync. If server restart it's nfslock again, client's locks will
> be lost.
>
> The Primary Reason:
>
> At step3, when client's reclaimed lock request is sent to server,
> client's host(the host struct) is reused but not be re-monitored at
> server's lockd. After that, statd and lockd are not sync.

The kernel squashes SM_MON upcalls for hosts that it already believes
are monitored. This is a scalability feature.

> Question:
>
> In my opinion, if lockd is allowed reuseing the client's host, it
> should
> send a SM_MON to statd when reuse. If not allowed, the client's host
> should
> be destroyed immediately.
>
> What should lockd to do? Reuse ? Destroy ? Or some other action?

I don't immediately see why lockd should change it's behavior.
Perhaps statd/sm-notify were incorrect to delete the monitor list when
you restarted the nfslock service?

Can you show exactly how statd's state (ie it's on-disk monitor list
in /var/lib/nfs/statd/sm) changed across the restart? Did sm-notify
run when you restarted statd? If so, why didn't the sm-notify pid
file stop it?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2009-12-16 09:44:55

by Mi Jinlong

[permalink] [raw]
Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart



J. Bruce Fields :
> On Tue, Dec 15, 2009 at 06:02:11PM +0800, Mi Jinlong wrote:
>> Hi,
>>
>> When testing the NLM at the latest kernel(2.6.32), i find a bug.
>> When a client hold locks, after server restart its nfslock service,
>> server's statd will not synchronize with lockd.
>> If server restart nfslock twice or more, client's lock will be lost.
>>
>> Test process:
>>
>> Step1: client open nfs file.
>> Step2: client using fcntl to get lock.
>> Step3: server restart it's nfslock service.
>
> I don't know what you mean; what did you actually do in step 3?

I use command "service nfslock restart" at server.

I mean, after server restart nfslock service, lockd and statd are not synchronizing.
Because, after nfslock restart, server go into grace_period, and client's
reclaimed lock request will be processed by lockd. At this, client's host(the host struct)
will be reused which is created before nfslock restart, but, the lockd don't
send a SM_MON to statd.

After locks are reclaimed, server's lockd records client hold locks, but statd don't
monitor client.

thanks,
Mi Jinlong

>
> --b.
>
>> After step3, server's lockd records client holding locks, but statd's
>> /var/lib/nfs/statd/sm/ directory is empty. It means statd and lockd are
>> not sync. If server restart it's nfslock again, client's locks will be lost.
>>
>> The Primary Reason:
>>
>> At step3, when client's reclaimed lock request is sent to server,
>> client's host(the host struct) is reused but not be re-monitored at
>> server's lockd. After that, statd and lockd are not sync.
>>
>> Question:
>>
>> In my opinion, if lockd is allowed reuseing the client's host, it should
>> send a SM_MON to statd when reuse. If not allowed, the client's host should
>> be destroyed immediately.
>>
>> What should lockd to do? Reuse ? Destroy ? Or some other action?
>>
>>
>> thanks,
>>
>> Mi Jinlong
>>
>
>

--
Regards
Mi Jinlong


2009-12-16 10:25:13

by Mi Jinlong

[permalink] [raw]
Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart



Chuck Lever:
> On Dec 15, 2009, at 5:02 AM, Mi Jinlong wrote:
>> Hi,
>>
>> When testing the NLM at the latest kernel(2.6.32), i find a bug.
>> When a client hold locks, after server restart its nfslock service,
>> server's statd will not synchronize with lockd.
>> If server restart nfslock twice or more, client's lock will be lost.
>>
>> Test process:
>>
>> Step1: client open nfs file.
>> Step2: client using fcntl to get lock.
>> Step3: server restart it's nfslock service.
>
> I'll assume here that you mean the equivalent of "service nfslock
> restart". This restarts statd and possibly runs sm-notify, but it has
> no effect on lockd.

Yes, i used "service nfslock restart".

It has effect on lockd too, when service stop, lockd will get a KILL signal.
Lockd will release all client's locks, and go into grace_period and wait
client reclaime it's lock.

>
> Again, this test seems artificial to me. Is there a real world use case
> where someone would deliberately restart statd while an NFS server is
> serving files? I pose this question because I've worked on statd only
> for a year or so, and I am quite likely ignorant of all the ways it can
> be deployed.

^/^, but maybe someone will restart nfslock when an NFS server is serving files.
It is inevitable.

>
>> After step3, server's lockd records client holding locks, but statd's
>> /var/lib/nfs/statd/sm/ directory is empty. It means statd and lockd are
>> not sync. If server restart it's nfslock again, client's locks will be
>> lost.
>>
>> The Primary Reason:
>>
>> At step3, when client's reclaimed lock request is sent to server,
>> client's host(the host struct) is reused but not be re-monitored at
>> server's lockd. After that, statd and lockd are not sync.
>
> The kernel squashes SM_MON upcalls for hosts that it already believes
> are monitored. This is a scalability feature.

When statd start, it will move files from /var/lib/nfs/statd/sm/ to
/var/lib/nfs/statd/sm.bak/. If lockd don't send a SM_MON to statd,
statd will not monitor those client which be monitored before statd restart.
I don't make sure, is it right?

>
>> Question:
>>
>> In my opinion, if lockd is allowed reuseing the client's host, it should
>> send a SM_MON to statd when reuse. If not allowed, the client's host
>> should
>> be destroyed immediately.
>>
>> What should lockd to do? Reuse ? Destroy ? Or some other action?
>
> I don't immediately see why lockd should change it's behavior. Perhaps
> statd/sm-notify were incorrect to delete the monitor list when you
> restarted the nfslock service?

Sorry, maybe i did not express clearly.
I mean, lockd reuse the host struct which was created before statd restart.

It seems have deleted the monitor list when nfslock restart.

>
> Can you show exactly how statd's state (ie it's on-disk monitor list in
> /var/lib/nfs/statd/sm) changed across the restart? Did sm-notify run
> when you restarted statd? If so, why didn't the sm-notify pid file stop
> it?
>

The statd and lockd's state at server when nfslock restart:

lockd statd |
|
host(monitored = 1) /sm/client | client get locks success at first
(locks) |
|
host(monitored = 1) /sm/client | nfslock stop (lockd release client's locks)
(no locks) |
|
host(monitored = 1) /sm/ | nfslock start (client reclaim locks)
(locks) | (but statd don't monitor it)

note: host(monitored=1) means: client's host struct is created, and is marked be monitored.
(locks), (no locks)means: host strcut holds locks, or not.
/sm/client means: there have a file under /var/lib/nfs/statd/sm directory
/sm/ means: /var/lib/nfs/statd/sm is empty!


thanks,
Mi Jinlong


2009-12-16 13:49:20

by Jeff Layton

[permalink] [raw]
Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart

On Wed, 16 Dec 2009 18:27:09 +0800
Mi Jinlong <[email protected]> wrote:

>
> The statd and lockd's state at server when nfslock restart:
>
> lockd statd |
> |
> host(monitored = 1) /sm/client | client get locks success at first
> (locks) |
> |
> host(monitored = 1) /sm/client | nfslock stop (lockd release client's locks)
> (no locks) |
> |
> host(monitored = 1) /sm/ | nfslock start (client reclaim locks)
> (locks) | (but statd don't monitor it)
>
> note: host(monitored=1) means: client's host struct is created, and is marked be monitored.
> (locks), (no locks)means: host strcut holds locks, or not.
> /sm/client means: there have a file under /var/lib/nfs/statd/sm directory
> /sm/ means: /var/lib/nfs/statd/sm is empty!
>
>

Perhaps we ought to clear the cached list of monitored hosts (i.e. set
them all to monitored = 0) when lockd gets a SIGKILL.

--
Jeff Layton <[email protected]>

2009-12-16 19:34:23

by Chuck Lever III

[permalink] [raw]
Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart

On Dec 16, 2009, at 5:27 AM, Mi Jinlong wrote:
> Chuck Lever:
>> On Dec 15, 2009, at 5:02 AM, Mi Jinlong wrote:
>>> Hi,
>>>
>>> When testing the NLM at the latest kernel(2.6.32), i find a bug.
>>> When a client hold locks, after server restart its nfslock service,
>>> server's statd will not synchronize with lockd.
>>> If server restart nfslock twice or more, client's lock will be lost.
>>>
>>> Test process:
>>>
>>> Step1: client open nfs file.
>>> Step2: client using fcntl to get lock.
>>> Step3: server restart it's nfslock service.
>>
>> I'll assume here that you mean the equivalent of "service nfslock
>> restart". This restarts statd and possibly runs sm-notify, but it
>> has
>> no effect on lockd.
>
> Yes, i used "service nfslock restart".
>
> It has effect on lockd too, when service stop, lockd will get a
> KILL signal.
> Lockd will release all client's locks, and go into grace_period and
> wait
> client reclaime it's lock.
>
>>
>> Again, this test seems artificial to me. Is there a real world use
>> case
>> where someone would deliberately restart statd while an NFS server is
>> serving files? I pose this question because I've worked on statd
>> only
>> for a year or so, and I am quite likely ignorant of all the ways it
>> can
>> be deployed.
>
> ^/^, but maybe someone will restart nfslock when an NFS server is
> serving files.
> It is inevitable.
>
>>> After step3, server's lockd records client holding locks, but
>>> statd's
>>> /var/lib/nfs/statd/sm/ directory is empty. It means statd and
>>> lockd are
>>> not sync. If server restart it's nfslock again, client's locks
>>> will be
>>> lost.
>>>
>>> The Primary Reason:
>>>
>>> At step3, when client's reclaimed lock request is sent to server,
>>> client's host(the host struct) is reused but not be re-monitored at
>>> server's lockd. After that, statd and lockd are not sync.
>>
>> The kernel squashes SM_MON upcalls for hosts that it already believes
>> are monitored. This is a scalability feature.
>
> When statd start, it will move files from /var/lib/nfs/statd/sm/ to
> /var/lib/nfs/statd/sm.bak/.

Well, it's really sm-notify that does this. sm-notify is run by
rpc.statd when it starts up.

However, sm-notify should only retire the monitor list the first time
it is run after a reboot. Simply restarting statd should not change
the on-disk monitor list in the slightest. If it does, there's some
kind of problem with the way sm-notify's pid file is managed, or
perhaps with the nfslock script.

> If lockd don't send a SM_MON to statd,
> statd will not monitor those client which be monitored before statd
> restart.
>
>>> Question:
>>>
>>> In my opinion, if lockd is allowed reuseing the client's host, it
>>> should
>>> send a SM_MON to statd when reuse. If not allowed, the client's host
>>> should
>>> be destroyed immediately.
>>>
>>> What should lockd to do? Reuse ? Destroy ? Or some other action?
>>
>> I don't immediately see why lockd should change it's behavior.
>> Perhaps
>> statd/sm-notify were incorrect to delete the monitor list when you
>> restarted the nfslock service?
>
> Sorry, maybe i did not express clearly.
> I mean, lockd reuse the host struct which was created before statd
> restart.
>
> It seems have deleted the monitor list when nfslock restart.

lockd does not touch any user space files; the on-disk monitor list is
managed by statd and sm-notify. A remote peer rebooting does not
clear the "monitored" flag for that peer in the local kernel's lockd,
so it won't send another SM_MON request.

Now, it may be the case that "service nfslock start" uses a command
line option that forces a fresh sm-notify run, and that is what is
wiping the on-disk monitor list. That would be the bug in this case
-- sm-notify can and should be allowed to make its own determination
of whether the monitor list gets retired. Notification should not
normally be forced by command line options in the nfslock script.

>> Can you show exactly how statd's state (ie it's on-disk monitor
>> list in
>> /var/lib/nfs/statd/sm) changed across the restart? Did sm-notify run
>> when you restarted statd? If so, why didn't the sm-notify pid file
>> stop
>> it?
>
> The statd and lockd's state at server when nfslock restart:
>
> lockd statd |
> |
> host(monitored = 1) /sm/client | client get locks
> success at first
> (locks) |
> |
> host(monitored = 1) /sm/client | nfslock stop (lockd
> release client's locks)
> (no locks) |
> |
> host(monitored = 1) /sm/ | nfslock start
> (client reclaim locks)
> (locks) | (but
> statd don't monitor it)
>
> note: host(monitored=1) means: client's host struct is created,
> and is marked be monitored.
> (locks), (no locks)means: host strcut holds locks, or not.
> /sm/client means: there have a file under /var/lib/
> nfs/statd/sm directory
> /sm/ means: /var/lib/nfs/statd/sm is empty!
>
>
> thanks,
> Mi Jinlong
>

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2009-12-17 09:32:22

by Mi Jinlong

[permalink] [raw]
Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart



Jeff Layton :
> On Wed, 16 Dec 2009 18:27:09 +0800
> Mi Jinlong <[email protected]> wrote:
>
>> The statd and lockd's state at server when nfslock restart:
>>
>> lockd statd |
>> |
>> host(monitored = 1) /sm/client | client get locks success at first
>> (locks) |
>> |
>> host(monitored = 1) /sm/client | nfslock stop (lockd release client's locks)
>> (no locks) |
>> |
>> host(monitored = 1) /sm/ | nfslock start (client reclaim locks)
>> (locks) | (but statd don't monitor it)
>>
>> note: host(monitored=1) means: client's host struct is created, and is marked be monitored.
>> (locks), (no locks)means: host strcut holds locks, or not.
>> /sm/client means: there have a file under /var/lib/nfs/statd/sm directory
>> /sm/ means: /var/lib/nfs/statd/sm is empty!
>>
>>
>
> Perhaps we ought to clear the cached list of monitored hosts (i.e. set
> them all to monitored = 0) when lockd gets a SIGKILL.

Yes, if lockd can reuse the host struct, clear the cached list of
monitored hosts is a good way when lockd gets a SIGKILL.

thanks,
Mi Jinlong


2009-12-17 10:05:12

by Mi Jinlong

[permalink] [raw]
Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart



Chuck Lever :
> On Dec 16, 2009, at 5:27 AM, Mi Jinlong wrote:
>> Chuck Lever:
>>> On Dec 15, 2009, at 5:02 AM, Mi Jinlong wrote:
>>>> Hi,

...snip...

>>>>
>>>> The Primary Reason:
>>>>
>>>> At step3, when client's reclaimed lock request is sent to server,
>>>> client's host(the host struct) is reused but not be re-monitored at
>>>> server's lockd. After that, statd and lockd are not sync.
>>>
>>> The kernel squashes SM_MON upcalls for hosts that it already believes
>>> are monitored. This is a scalability feature.
>>
>> When statd start, it will move files from /var/lib/nfs/statd/sm/ to
>> /var/lib/nfs/statd/sm.bak/.
>
> Well, it's really sm-notify that does this. sm-notify is run by
> rpc.statd when it starts up.
>
> However, sm-notify should only retire the monitor list the first time it
> is run after a reboot. Simply restarting statd should not change the
> on-disk monitor list in the slightest. If it does, there's some kind of
> problem with the way sm-notify's pid file is managed, or perhaps with
> the nfslock script.

When starting, statd will call run_sm_notify() function to run sm-notify.
Using command "service nfslock restart" will case statd stop and start,
so sm-notify will be run. If sm-notify run, the on-disk monitor list
will be changed.

>
>> If lockd don't send a SM_MON to statd,
>> statd will not monitor those client which be monitored before statd
>> restart.
>>
>>>> Question:
>>>>
>>>> In my opinion, if lockd is allowed reuseing the client's host, it
>>>> should
>>>> send a SM_MON to statd when reuse. If not allowed, the client's host
>>>> should
>>>> be destroyed immediately.
>>>>
>>>> What should lockd to do? Reuse ? Destroy ? Or some other action?
>>>
>>> I don't immediately see why lockd should change it's behavior. Perhaps
>>> statd/sm-notify were incorrect to delete the monitor list when you
>>> restarted the nfslock service?
>>
>> Sorry, maybe i did not express clearly.
>> I mean, lockd reuse the host struct which was created before statd
>> restart.
>>
>> It seems have deleted the monitor list when nfslock restart.
>
> lockd does not touch any user space files; the on-disk monitor list is
> managed by statd and sm-notify. A remote peer rebooting does not clear
> the "monitored" flag for that peer in the local kernel's lockd, so it
> won't send another SM_MON request.

Yes, that's right.

But, this case refers to server's lockd, not the remote peer.
I thank, when local system's nfslock restart, local kernel's lockd
clear all other client's host strcut's "monitored" flag.

>
> Now, it may be the case that "service nfslock start" uses a command line
> option that forces a fresh sm-notify run, and that is what is wiping the
> on-disk monitor list. That would be the bug in this case -- sm-notify
> can and should be allowed to make its own determination of whether the
> monitor list gets retired. Notification should not normally be forced
> by command line options in the nfslock script.

A fresh sm-notify run is cause by statd start.
I find it through codes by followed.

utils/statd/statd.c
...
478 if (! (run_mode & MODE_NO_NOTIFY))
479 switch (pid = fork()) {
480 case 0:
481 run_sm_notify(out_port);
482 break;
483 case -1:
484 break;
485 default:
486 waitpid(pid, NULL, 0);
487 }
....


I thank, when statd restart and call sm-notify, the on-disk monitor list will
be deleted, so lockd should clear all other client's host strcut's "monitored" flag.
After that, a reused host struct will be re-monitored, a on-disk monitor
will be re-created. Like that, lockd and statd will sync .


thanks,
Mi Jinlong


2009-12-17 20:14:34

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart

On Thu, Dec 17, 2009 at 11:18:53AM -0500, Chuck Lever wrote:
> run_sm_notify() simply forks and execs the sm-notify program. This =20
> program checks for the existence of a pid file. If the pid file exis=
ts,=20
> then sm-notify exits. If it does not, then sm-notify retires the rec=
ords=20
> in /var/lib/nfs/statd/sm and posts reboot notifications.
>
> Jeff Layton pointed out to me yesterday that Red Hat's nfslock script=
=20
> unconditionally deletes sm-notify's pid file every time "service nfsl=
ock=20
> start" is done, which effectively defeats sm-notify's reboot detectio=
n.
>
> sm-notify was written by a developer at SuSE. SuSE Linux uses a tmpf=
s =20
> for /var/run, but Red Hat uses permanent storage for this directory. =
=20
> Thus on SuSE, the pid file gets deleted automatically by a reboot, bu=
t =20
> on Red Hat, the pid file must be deleted "by hand" or reboot =20
> notification never occurs.
>
> So the root cause of this problem is that the current mechanism sm-=20
> notify uses to detect a reboot is not portable across distributions.
>
> My new-statd prototype used a semaphor instead of a pid file to detec=
t =20
> reboots. A semaphor is shared (visible to other processes) and will =
=20
> continue to exist until it is deleted or the system reboots. It is a=
=20
> resource that is not destroyed automatically when the sm-notify proce=
ss=20
> exits. If creating the semaphor fails, sm-notify exits. If creating=
it=20
> succeeds, it runs.
>
> Would anyone strongly object to using a semaphor instead of a pid fil=
e =20
> here? Is support for semaphors always built into kernels? Would the=
re=20
> be any problems with the small size of the semaphor name space? Is t=
here=20
> another similar facility that might be better?

I don't know much about those (except that I think there's an e at the
end); looks like sem_overview(7) is the place to start?

It says:

" Prior to kernel 2.6, Linux only supported unnamed,
thread-shared sema=E2=80=90 phores. On a system with Linux 2.6 and =
a
glibc that provides the NPTL threading implementation, a
complete implementation of POSIX semaphores is provided."

So would it mean dropping support for 2.4?

--b.

2009-12-17 20:27:56

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart

On Thu, 2009-12-17 at 11:18 -0500, Chuck Lever wrote:
> On Dec 17, 2009, at 5:07 AM, Mi Jinlong wrote:
> > Chuck Lever :
> >> On Dec 16, 2009, at 5:27 AM, Mi Jinlong wrote:
> >>> Chuck Lever:
> >>>> On Dec 15, 2009, at 5:02 AM, Mi Jinlong wrote:
> >>>>> Hi,
> >
> > ...snip...
> >
> >>>>>
> >>>>> The Primary Reason:
> >>>>>
> >>>>> At step3, when client's reclaimed lock request is sent to server,
> >>>>> client's host(the host struct) is reused but not be re-monitored
> >>>>> at
> >>>>> server's lockd. After that, statd and lockd are not sync.
> >>>>
> >>>> The kernel squashes SM_MON upcalls for hosts that it already
> >>>> believes
> >>>> are monitored. This is a scalability feature.
> >>>
> >>> When statd start, it will move files from /var/lib/nfs/statd/sm/ to
> >>> /var/lib/nfs/statd/sm.bak/.
> >>
> >> Well, it's really sm-notify that does this. sm-notify is run by
> >> rpc.statd when it starts up.
> >>
> >> However, sm-notify should only retire the monitor list the first
> >> time it
> >> is run after a reboot. Simply restarting statd should not change the
> >> on-disk monitor list in the slightest. If it does, there's some
> >> kind of
> >> problem with the way sm-notify's pid file is managed, or perhaps with
> >> the nfslock script.
> >
> > When starting, statd will call run_sm_notify() function to run sm-
> > notify.
> > Using command "service nfslock restart" will case statd stop and
> > start,
> > so sm-notify will be run. If sm-notify run, the on-disk monitor list
> > will be changed.
> >
> >>
> >>> If lockd don't send a SM_MON to statd,
> >>> statd will not monitor those client which be monitored before statd
> >>> restart.
> >>>
> >>>>> Question:
> >>>>>
> >>>>> In my opinion, if lockd is allowed reuseing the client's host, it
> >>>>> should
> >>>>> send a SM_MON to statd when reuse. If not allowed, the client's
> >>>>> host
> >>>>> should
> >>>>> be destroyed immediately.
> >>>>>
> >>>>> What should lockd to do? Reuse ? Destroy ? Or some other action?
> >>>>
> >>>> I don't immediately see why lockd should change it's behavior.
> >>>> Perhaps
> >>>> statd/sm-notify were incorrect to delete the monitor list when you
> >>>> restarted the nfslock service?
> >>>
> >>> Sorry, maybe i did not express clearly.
> >>> I mean, lockd reuse the host struct which was created before statd
> >>> restart.
> >>>
> >>> It seems have deleted the monitor list when nfslock restart.
> >>
> >> lockd does not touch any user space files; the on-disk monitor list
> >> is
> >> managed by statd and sm-notify. A remote peer rebooting does not
> >> clear
> >> the "monitored" flag for that peer in the local kernel's lockd, so it
> >> won't send another SM_MON request.
> >
> > Yes, that's right.
> >
> > But, this case refers to server's lockd, not the remote peer.
> > I thank, when local system's nfslock restart, local kernel's lockd
> > clear all other client's host strcut's "monitored" flag.
> >
> >>
> >> Now, it may be the case that "service nfslock start" uses a command
> >> line
> >> option that forces a fresh sm-notify run, and that is what is
> >> wiping the
> >> on-disk monitor list. That would be the bug in this case -- sm-
> >> notify
> >> can and should be allowed to make its own determination of whether
> >> the
> >> monitor list gets retired. Notification should not normally be
> >> forced
> >> by command line options in the nfslock script.
> >
> > A fresh sm-notify run is cause by statd start.
> > I find it through codes by followed.
> >
> > utils/statd/statd.c
> > ...
> > 478 if (! (run_mode & MODE_NO_NOTIFY))
> > 479 switch (pid = fork()) {
> > 480 case 0:
> > 481 run_sm_notify(out_port);
> > 482 break;
> > 483 case -1:
> > 484 break;
> > 485 default:
> > 486 waitpid(pid, NULL, 0);
> > 487 }
> > ....
> >
> >
> > I thank, when statd restart and call sm-notify, the on-disk monitor
> > list will
> > be deleted, so lockd should clear all other client's host strcut's
> > "monitored" flag.
> > After that, a reused host struct will be re-monitored, a on-disk
> > monitor
> > will be re-created. Like that, lockd and statd will sync .
>
> run_sm_notify() simply forks and execs the sm-notify program. This
> program checks for the existence of a pid file. If the pid file
> exists, then sm-notify exits. If it does not, then sm-notify retires
> the records in /var/lib/nfs/statd/sm and posts reboot notifications.
>
> Jeff Layton pointed out to me yesterday that Red Hat's nfslock script
> unconditionally deletes sm-notify's pid file every time "service
> nfslock start" is done, which effectively defeats sm-notify's reboot
> detection.
>
> sm-notify was written by a developer at SuSE. SuSE Linux uses a tmpfs
> for /var/run, but Red Hat uses permanent storage for this directory.
> Thus on SuSE, the pid file gets deleted automatically by a reboot, but
> on Red Hat, the pid file must be deleted "by hand" or reboot
> notification never occurs.
>
> So the root cause of this problem is that the current mechanism sm-
> notify uses to detect a reboot is not portable across distributions.
>
> My new-statd prototype used a semaphor instead of a pid file to detect
> reboots. A semaphor is shared (visible to other processes) and will
> continue to exist until it is deleted or the system reboots. It is a
> resource that is not destroyed automatically when the sm-notify
> process exits. If creating the semaphor fails, sm-notify exits. If
> creating it succeeds, it runs.
>
> Would anyone strongly object to using a semaphor instead of a pid file
> here? Is support for semaphors always built into kernels? Would
> there be any problems with the small size of the semaphor name space?
> Is there another similar facility that might be better?
>

One alternative might be to just record the kernel's random boot_id in
the pid file. That gets regenerated on each boot, so should be unique.

Trond


2009-12-17 20:35:44

by Chuck Lever III

[permalink] [raw]
Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart


On Dec 17, 2009, at 3:27 PM, Trond Myklebust wrote:

> On Thu, 2009-12-17 at 11:18 -0500, Chuck Lever wrote:
>> On Dec 17, 2009, at 5:07 AM, Mi Jinlong wrote:
>>> Chuck Lever :
>>>> On Dec 16, 2009, at 5:27 AM, Mi Jinlong wrote:
>>>>> Chuck Lever:
>>>>>> On Dec 15, 2009, at 5:02 AM, Mi Jinlong wrote:
>>>>>>> Hi,
>>>
>>> ...snip...
>>>
>>>>>>>
>>>>>>> The Primary Reason:
>>>>>>>
>>>>>>> At step3, when client's reclaimed lock request is sent to
>>>>>>> server,
>>>>>>> client's host(the host struct) is reused but not be re-monitored
>>>>>>> at
>>>>>>> server's lockd. After that, statd and lockd are not sync.
>>>>>>
>>>>>> The kernel squashes SM_MON upcalls for hosts that it already
>>>>>> believes
>>>>>> are monitored. This is a scalability feature.
>>>>>
>>>>> When statd start, it will move files from /var/lib/nfs/statd/sm/
>>>>> to
>>>>> /var/lib/nfs/statd/sm.bak/.
>>>>
>>>> Well, it's really sm-notify that does this. sm-notify is run by
>>>> rpc.statd when it starts up.
>>>>
>>>> However, sm-notify should only retire the monitor list the first
>>>> time it
>>>> is run after a reboot. Simply restarting statd should not change
>>>> the
>>>> on-disk monitor list in the slightest. If it does, there's some
>>>> kind of
>>>> problem with the way sm-notify's pid file is managed, or perhaps
>>>> with
>>>> the nfslock script.
>>>
>>> When starting, statd will call run_sm_notify() function to run sm-
>>> notify.
>>> Using command "service nfslock restart" will case statd stop and
>>> start,
>>> so sm-notify will be run. If sm-notify run, the on-disk monitor list
>>> will be changed.
>>>
>>>>
>>>>> If lockd don't send a SM_MON to statd,
>>>>> statd will not monitor those client which be monitored before
>>>>> statd
>>>>> restart.
>>>>>
>>>>>>> Question:
>>>>>>>
>>>>>>> In my opinion, if lockd is allowed reuseing the client's host,
>>>>>>> it
>>>>>>> should
>>>>>>> send a SM_MON to statd when reuse. If not allowed, the client's
>>>>>>> host
>>>>>>> should
>>>>>>> be destroyed immediately.
>>>>>>>
>>>>>>> What should lockd to do? Reuse ? Destroy ? Or some other
>>>>>>> action?
>>>>>>
>>>>>> I don't immediately see why lockd should change it's behavior.
>>>>>> Perhaps
>>>>>> statd/sm-notify were incorrect to delete the monitor list when
>>>>>> you
>>>>>> restarted the nfslock service?
>>>>>
>>>>> Sorry, maybe i did not express clearly.
>>>>> I mean, lockd reuse the host struct which was created before statd
>>>>> restart.
>>>>>
>>>>> It seems have deleted the monitor list when nfslock restart.
>>>>
>>>> lockd does not touch any user space files; the on-disk monitor list
>>>> is
>>>> managed by statd and sm-notify. A remote peer rebooting does not
>>>> clear
>>>> the "monitored" flag for that peer in the local kernel's lockd,
>>>> so it
>>>> won't send another SM_MON request.
>>>
>>> Yes, that's right.
>>>
>>> But, this case refers to server's lockd, not the remote peer.
>>> I thank, when local system's nfslock restart, local kernel's lockd
>>> clear all other client's host strcut's "monitored" flag.
>>>
>>>>
>>>> Now, it may be the case that "service nfslock start" uses a command
>>>> line
>>>> option that forces a fresh sm-notify run, and that is what is
>>>> wiping the
>>>> on-disk monitor list. That would be the bug in this case -- sm-
>>>> notify
>>>> can and should be allowed to make its own determination of whether
>>>> the
>>>> monitor list gets retired. Notification should not normally be
>>>> forced
>>>> by command line options in the nfslock script.
>>>
>>> A fresh sm-notify run is cause by statd start.
>>> I find it through codes by followed.
>>>
>>> utils/statd/statd.c
>>> ...
>>> 478 if (! (run_mode & MODE_NO_NOTIFY))
>>> 479 switch (pid = fork()) {
>>> 480 case 0:
>>> 481 run_sm_notify(out_port);
>>> 482 break;
>>> 483 case -1:
>>> 484 break;
>>> 485 default:
>>> 486 waitpid(pid, NULL, 0);
>>> 487 }
>>> ....
>>>
>>>
>>> I thank, when statd restart and call sm-notify, the on-disk monitor
>>> list will
>>> be deleted, so lockd should clear all other client's host strcut's
>>> "monitored" flag.
>>> After that, a reused host struct will be re-monitored, a on-disk
>>> monitor
>>> will be re-created. Like that, lockd and statd will sync .
>>
>> run_sm_notify() simply forks and execs the sm-notify program. This
>> program checks for the existence of a pid file. If the pid file
>> exists, then sm-notify exits. If it does not, then sm-notify retires
>> the records in /var/lib/nfs/statd/sm and posts reboot notifications.
>>
>> Jeff Layton pointed out to me yesterday that Red Hat's nfslock script
>> unconditionally deletes sm-notify's pid file every time "service
>> nfslock start" is done, which effectively defeats sm-notify's reboot
>> detection.
>>
>> sm-notify was written by a developer at SuSE. SuSE Linux uses a
>> tmpfs
>> for /var/run, but Red Hat uses permanent storage for this directory.
>> Thus on SuSE, the pid file gets deleted automatically by a reboot,
>> but
>> on Red Hat, the pid file must be deleted "by hand" or reboot
>> notification never occurs.
>>
>> So the root cause of this problem is that the current mechanism sm-
>> notify uses to detect a reboot is not portable across distributions.
>>
>> My new-statd prototype used a semaphor instead of a pid file to
>> detect
>> reboots. A semaphor is shared (visible to other processes) and will
>> continue to exist until it is deleted or the system reboots. It is a
>> resource that is not destroyed automatically when the sm-notify
>> process exits. If creating the semaphor fails, sm-notify exits. If
>> creating it succeeds, it runs.
>>
>> Would anyone strongly object to using a semaphor instead of a pid
>> file
>> here? Is support for semaphors always built into kernels? Would
>> there be any problems with the small size of the semaphor name space?
>> Is there another similar facility that might be better?
>
> One alternative might be to just record the kernel's random boot_id in
> the pid file. That gets regenerated on each boot, so should be unique.

Where do you get it in user space? Is it available on earlier
kernels? ("should be unique" -- I hope it doesn't have the same
problem we had with XID replay on diskless systems).

Fwiw, I tried using the boot time stamp at one point, but
unfortunately that's adjusted by the ntp offset, so it can take
different values over time. It was difficult to compare it to a time
stamp recorded in a file.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2009-12-17 20:37:12

by Chuck Lever III

[permalink] [raw]
Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart

On Dec 17, 2009, at 3:14 PM, J. Bruce Fields wrote:
> On Thu, Dec 17, 2009 at 11:18:53AM -0500, Chuck Lever wrote:
>> run_sm_notify() simply forks and execs the sm-notify program. This
>> program checks for the existence of a pid file. If the pid file =20
>> exists,
>> then sm-notify exits. If it does not, then sm-notify retires the =20
>> records
>> in /var/lib/nfs/statd/sm and posts reboot notifications.
>>
>> Jeff Layton pointed out to me yesterday that Red Hat's nfslock scrip=
t
>> unconditionally deletes sm-notify's pid file every time "service =20
>> nfslock
>> start" is done, which effectively defeats sm-notify's reboot =20
>> detection.
>>
>> sm-notify was written by a developer at SuSE. SuSE Linux uses a =20
>> tmpfs
>> for /var/run, but Red Hat uses permanent storage for this directory.
>> Thus on SuSE, the pid file gets deleted automatically by a reboot, =20
>> but
>> on Red Hat, the pid file must be deleted "by hand" or reboot
>> notification never occurs.
>>
>> So the root cause of this problem is that the current mechanism sm-
>> notify uses to detect a reboot is not portable across distributions.
>>
>> My new-statd prototype used a semaphor instead of a pid file to =20
>> detect
>> reboots. A semaphor is shared (visible to other processes) and will
>> continue to exist until it is deleted or the system reboots. It is =
a
>> resource that is not destroyed automatically when the sm-notify =20
>> process
>> exits. If creating the semaphor fails, sm-notify exits. If =20
>> creating it
>> succeeds, it runs.
>>
>> Would anyone strongly object to using a semaphor instead of a pid =20
>> file
>> here? Is support for semaphors always built into kernels? Would =20
>> there
>> be any problems with the small size of the semaphor name space? Is =
=20
>> there
>> another similar facility that might be better?
>
> I don't know much about those (except that I think there's an e at th=
e
> end); looks like sem_overview(7) is the place to start?
>
> It says:
>
> " Prior to kernel 2.6, Linux only supported unnamed,
> thread-shared sema=E2=80=90 phores. On a system with Linux 2.6 an=
d a
> glibc that provides the NPTL threading implementation, a
> complete implementation of POSIX semaphores is provided."
>
> So would it mean dropping support for 2.4?

No, it would mean using them only on systems that supported shared =20
semaphores.

--=20
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2009-12-17 20:48:13

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart

On Thu, 2009-12-17 at 15:34 -0500, Chuck Lever wrote:
> On Dec 17, 2009, at 3:27 PM, Trond Myklebust wrote:
> > One alternative might be to just record the kernel's random boot_id in
> > the pid file. That gets regenerated on each boot, so should be unique.
>
> Where do you get it in user space? Is it available on earlier
> kernels? ("should be unique" -- I hope it doesn't have the same
> problem we had with XID replay on diskless systems).

You can access it from userland as the 'kernel.random.boot_id' sysctl.

It is available on 2.4 kernels and newer.

It is based on the kernel random number generator, so should be
reasonably unique.

> Fwiw, I tried using the boot time stamp at one point, but
> unfortunately that's adjusted by the ntp offset, so it can take
> different values over time. It was difficult to compare it to a time
> stamp recorded in a file.

Agreed. You can't rely on time stamps.

Trond


2009-12-17 23:14:50

by NeilBrown

[permalink] [raw]
Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart

On Thu, 17 Dec 2009 11:18:53 -0500
Chuck Lever <[email protected]> wrote:

> Jeff Layton pointed out to me yesterday that Red Hat's nfslock script
> unconditionally deletes sm-notify's pid file every time "service
> nfslock start" is done, which effectively defeats sm-notify's reboot
> detection.
>
> sm-notify was written by a developer at SuSE. SuSE Linux uses a tmpfs
> for /var/run, but Red Hat uses permanent storage for this directory.
> Thus on SuSE, the pid file gets deleted automatically by a reboot, but
> on Red Hat, the pid file must be deleted "by hand" or reboot
> notification never occurs.

Just to make sure the facts are straight:
SuSE does not use tmpfs for /var/run (much as I personally think that
would be a very sensible approach for both /var/run and /var/locks).
It appears that Debian can use tmpfs for these, but doesn't by default.

Both SuSE and Debian have boot time scripts that clean up /var/run and other
directories. They remove all non-directories other than /var/run/utmp.

If Redhat doesn't clean up /var/run at boot time, then I would think that is
very odd. The files in there represent something that is running. At boot,
nothing is running, so it should all be cleaned up. Are you sure Redhat
doesn't clean out /var/run???

I just had a look at master.kernel.org (the only fedora machine I can think
of that I have access to) and in /etc/rc.d/rc.sysinit I find

find /var/lock /var/run ! -type d -exec rm -f {} \;

So I'm thinking that if you just remove

# Make sure locks are recovered
rm -f /var/run/sm-notify.pid

from /etc/init.d/nfslock, then it will do the right thing.

NeilBrown


2009-12-18 15:19:44

by Chuck Lever III

[permalink] [raw]
Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart


On Dec 17, 2009, at 6:14 PM, Neil Brown wrote:

> On Thu, 17 Dec 2009 11:18:53 -0500
> Chuck Lever <[email protected]> wrote:
>
>> Jeff Layton pointed out to me yesterday that Red Hat's nfslock script
>> unconditionally deletes sm-notify's pid file every time "service
>> nfslock start" is done, which effectively defeats sm-notify's reboot
>> detection.
>>
>> sm-notify was written by a developer at SuSE. SuSE Linux uses a
>> tmpfs
>> for /var/run, but Red Hat uses permanent storage for this directory.
>> Thus on SuSE, the pid file gets deleted automatically by a reboot,
>> but
>> on Red Hat, the pid file must be deleted "by hand" or reboot
>> notification never occurs.
>
> Just to make sure the facts are straight:
> SuSE does not use tmpfs for /var/run (much as I personally think that
> would be a very sensible approach for both /var/run and /var/locks).
> It appears that Debian can use tmpfs for these, but doesn't by
> default.
>
> Both SuSE and Debian have boot time scripts that clean up /var/run
> and other
> directories. They remove all non-directories other than /var/run/
> utmp.
>
> If Redhat doesn't clean up /var/run at boot time, then I would think
> that is
> very odd. The files in there represent something that is running.
> At boot,
> nothing is running, so it should all be cleaned up. Are you sure
> Redhat
> doesn't clean out /var/run???
>
> I just had a look at master.kernel.org (the only fedora machine I
> can think
> of that I have access to) and in /etc/rc.d/rc.sysinit I find
>
> find /var/lock /var/run ! -type d -exec rm -f {} \;
>
> So I'm thinking that if you just remove
>
> # Make sure locks are recovered
> rm -f /var/run/sm-notify.pid
>
> from /etc/init.d/nfslock, then it will do the right thing.

Makes sense. Steve, can you look into this for supported releases
(like F12 and RHEL5)? Or, perhaps you can clarify why that "rm" is
required.

Meanwhile, I'm going to prototype a mechanism that tries to use the
kernel's boot_id, if present.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com