From: Chuck Lever Subject: Re: AutoFS+NFSv4 server down = LOOOOONG timeout. Date: Thu, 27 Aug 2009 10:54:27 -0400 Message-ID: References: <7E189B77-1139-4B16-97E5-4841B41B90C7@oracle.com> <4A82CE18.6020401@redhat.com> <4A82DDB1.1000109@redhat.com> <4A84210F.3020906@redhat.com> <1250555418.16878.7.camel@zeus.themaw.net> <4A92AA43.6070304@redhat.com> <4A9649B3.7080208@redhat.com> <7A35D986-E872-4DBD-8619-1F29D97AC039@oracle.com> <1251384739.5173.7.camel@heimdal.trondhjem.org> Mime-Version: 1.0 (Apple Message framework v936) Content-Type: text/plain; charset="iso-8859-1"; Format="flowed"; DelSp="yes" Cc: Ian Kent , NFS list , Linux NFSv4 mailing list To: Trond Myklebust Return-path: In-Reply-To: <1251384739.5173.7.camel@heimdal.trondhjem.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: nfsv4-bounces@linux-nfs.org Errors-To: nfsv4-bounces@linux-nfs.org List-ID: On Aug 27, 2009, at 10:52 AM, Trond Myklebust wrote: > On Thu, 2009-08-27 at 10:38 -0400, Chuck Lever wrote: >> On Aug 27, 2009, at 4:54 AM, Ian Kent wrote: >>> Ian Kent wrote: >>>> Carlos Andr=E9 wrote: >>>>> Hi Ian, >>>>> >>>>> Thanks for patch and sorry for delay (i'm expecting receive u >>>>> reply on >>>>> bug track, not here) :) >>>>> >>>>> But, this patch doesnt worked to me like expected... :( >>>>> >>>>> >>>>> Firstly I've changed "#MOUNT_WAIT=3D-1" to "MOUNT_WAIT=3D10" >>>>> and later changed "10" to "2" with same results... >>>>> (always restarting service, of course :) >>>>> >>>>> Then, tried remove "sec=3Dkrb5p", and later removed "nfs4" but i got >>>>> same results again. >>>>> >>>>> Or i'm doing something wrong? >>>>> >>>>> >>>>> [root@KSTATION areas]# automount -V >>>>> >>>>> Linux automount version 5.0.1-0.rc2.131.bz517349.1 >>>>> [...] >>>>> >>>>> [root@KSTATION areas]# time ls -la testdown >>>>> ls: testedown: No such file or directory >>>>> >>>>> real 3m9.006s >>>>> user 0m0.002s >>>>> sys 0m0.000s >>>> >>>> OK, that isn't behaving the way I expect, I'll have a look. >>>> >>>>> >>>>> LOGGING: >>>>> ----------------------------------------- >>>>> Aug 24 09:23:51 KSTATION automount[20803]: mount_mount: = >>>>> mount(nfs): >>>>> calling mount -t nfs4 -s -o rw,acl,sec=3Dkrb5p 1.2.3.4:/areas/ = >>>>> testdown >>>>> /misc/areas/testdown >>>>> Aug 24 09:27:00 KSTATION automount[20803]: mount(nfs): nfs: mount >>>>> failure 1.2.3.4:/areas/testdown on /misc/areas/testdown >>>>> Aug 24 09:27:00 KSTATION automount[20803]: ioctl_send_fail: token >>>>> =3D 91 >>>>> Aug 24 09:27:00 KSTATION automount[20803]: failed to mount /misc/ >>>>> areas/testdown >>>>> ----------------------------------------- >>> >>> Having a look at this I suspect the reason it doesn't work as = >>> expected >>> is the waitpid(2) we do after sending the TERM signal to the mount >>> process (which we have to do) is not returning. This is likely = >>> because >>> the mount process isn't giving up in a shorter time as it used to. >> >> You're thinking maybe mount(2) should be as interruptible as the >> socket calls that the mount command used to do? That might be >> reasonable, and I can take a look at that. > > In recent kernels, all those RPC calls should be using TASK_KILLABLE > sleep states. SIGTERM should cause them to abort, provided that some > process isn't blocking it. > > Perhaps TASK_KILLABLE could be backported to RHEL-5? That's pretty extensive, with hooks in the page cache. I doubt RH = would go for that. >> In the kernel, if the rpcbind for the MNT request is async, that = >> would >> be done by rpciod. That's a different process, so the signal = >> wouldn't >> have any effect on the mount. I have a patch that converts the MNT >> client to use rpcb_getport_sync() which might help in this case. > > The client shouldn't be using rpcbind at all when doing a NFSv4 mount. Yep, forgot this was NFSv4. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com