From: Trond Myklebust Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart Date: Thu, 17 Dec 2009 15:27:44 -0500 Message-ID: <1261081664.4080.18.camel@localhost> References: <4B275EA3.9030603@cn.fujitsu.com> <4B28B5FD.5000103@cn.fujitsu.com> <4B2A02C6.6080501@cn.fujitsu.com> <35D45F43-D98F-460E-8060-F7C5F3ADFCFE@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Cc: "J. Bruce Fields" , Neil Brown , Steve Dickson , NFSv3 list , Mi Jinlong To: Chuck Lever Return-path: Received: from mail-out1.uio.no ([129.240.10.57]:44177 "EHLO mail-out1.uio.no" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1764915AbZLQU14 (ORCPT ); Thu, 17 Dec 2009 15:27:56 -0500 In-Reply-To: <35D45F43-D98F-460E-8060-F7C5F3ADFCFE@oracle.com> Sender: linux-nfs-owner@vger.kernel.org List-ID: On Thu, 2009-12-17 at 11:18 -0500, Chuck Lever wrote: > On Dec 17, 2009, at 5:07 AM, Mi Jinlong wrote: > > Chuck Lever : > >> On Dec 16, 2009, at 5:27 AM, Mi Jinlong wrote: > >>> Chuck Lever: > >>>> On Dec 15, 2009, at 5:02 AM, Mi Jinlong wrote: > >>>>> Hi, > > > > ...snip... > > > >>>>> > >>>>> The Primary Reason: > >>>>> > >>>>> At step3, when client's reclaimed lock request is sent to server, > >>>>> client's host(the host struct) is reused but not be re-monitored > >>>>> at > >>>>> server's lockd. After that, statd and lockd are not sync. > >>>> > >>>> The kernel squashes SM_MON upcalls for hosts that it already > >>>> believes > >>>> are monitored. This is a scalability feature. > >>> > >>> When statd start, it will move files from /var/lib/nfs/statd/sm/ to > >>> /var/lib/nfs/statd/sm.bak/. > >> > >> Well, it's really sm-notify that does this. sm-notify is run by > >> rpc.statd when it starts up. > >> > >> However, sm-notify should only retire the monitor list the first > >> time it > >> is run after a reboot. Simply restarting statd should not change the > >> on-disk monitor list in the slightest. If it does, there's some > >> kind of > >> problem with the way sm-notify's pid file is managed, or perhaps with > >> the nfslock script. > > > > When starting, statd will call run_sm_notify() function to run sm- > > notify. > > Using command "service nfslock restart" will case statd stop and > > start, > > so sm-notify will be run. If sm-notify run, the on-disk monitor list > > will be changed. > > > >> > >>> If lockd don't send a SM_MON to statd, > >>> statd will not monitor those client which be monitored before statd > >>> restart. > >>> > >>>>> Question: > >>>>> > >>>>> In my opinion, if lockd is allowed reuseing the client's host, it > >>>>> should > >>>>> send a SM_MON to statd when reuse. If not allowed, the client's > >>>>> host > >>>>> should > >>>>> be destroyed immediately. > >>>>> > >>>>> What should lockd to do? Reuse ? Destroy ? Or some other action? > >>>> > >>>> I don't immediately see why lockd should change it's behavior. > >>>> Perhaps > >>>> statd/sm-notify were incorrect to delete the monitor list when you > >>>> restarted the nfslock service? > >>> > >>> Sorry, maybe i did not express clearly. > >>> I mean, lockd reuse the host struct which was created before statd > >>> restart. > >>> > >>> It seems have deleted the monitor list when nfslock restart. > >> > >> lockd does not touch any user space files; the on-disk monitor list > >> is > >> managed by statd and sm-notify. A remote peer rebooting does not > >> clear > >> the "monitored" flag for that peer in the local kernel's lockd, so it > >> won't send another SM_MON request. > > > > Yes, that's right. > > > > But, this case refers to server's lockd, not the remote peer. > > I thank, when local system's nfslock restart, local kernel's lockd > > clear all other client's host strcut's "monitored" flag. > > > >> > >> Now, it may be the case that "service nfslock start" uses a command > >> line > >> option that forces a fresh sm-notify run, and that is what is > >> wiping the > >> on-disk monitor list. That would be the bug in this case -- sm- > >> notify > >> can and should be allowed to make its own determination of whether > >> the > >> monitor list gets retired. Notification should not normally be > >> forced > >> by command line options in the nfslock script. > > > > A fresh sm-notify run is cause by statd start. > > I find it through codes by followed. > > > > utils/statd/statd.c > > ... > > 478 if (! (run_mode & MODE_NO_NOTIFY)) > > 479 switch (pid = fork()) { > > 480 case 0: > > 481 run_sm_notify(out_port); > > 482 break; > > 483 case -1: > > 484 break; > > 485 default: > > 486 waitpid(pid, NULL, 0); > > 487 } > > .... > > > > > > I thank, when statd restart and call sm-notify, the on-disk monitor > > list will > > be deleted, so lockd should clear all other client's host strcut's > > "monitored" flag. > > After that, a reused host struct will be re-monitored, a on-disk > > monitor > > will be re-created. Like that, lockd and statd will sync . > > run_sm_notify() simply forks and execs the sm-notify program. This > program checks for the existence of a pid file. If the pid file > exists, then sm-notify exits. If it does not, then sm-notify retires > the records in /var/lib/nfs/statd/sm and posts reboot notifications. > > Jeff Layton pointed out to me yesterday that Red Hat's nfslock script > unconditionally deletes sm-notify's pid file every time "service > nfslock start" is done, which effectively defeats sm-notify's reboot > detection. > > sm-notify was written by a developer at SuSE. SuSE Linux uses a tmpfs > for /var/run, but Red Hat uses permanent storage for this directory. > Thus on SuSE, the pid file gets deleted automatically by a reboot, but > on Red Hat, the pid file must be deleted "by hand" or reboot > notification never occurs. > > So the root cause of this problem is that the current mechanism sm- > notify uses to detect a reboot is not portable across distributions. > > My new-statd prototype used a semaphor instead of a pid file to detect > reboots. A semaphor is shared (visible to other processes) and will > continue to exist until it is deleted or the system reboots. It is a > resource that is not destroyed automatically when the sm-notify > process exits. If creating the semaphor fails, sm-notify exits. If > creating it succeeds, it runs. > > Would anyone strongly object to using a semaphor instead of a pid file > here? Is support for semaphors always built into kernels? Would > there be any problems with the small size of the semaphor name space? > Is there another similar facility that might be better? > One alternative might be to just record the kernel's random boot_id in the pid file. That gets regenerated on each boot, so should be unique. Trond