Received: by 2002:a25:ad19:0:0:0:0:0 with SMTP id y25csp10717922ybi; Thu, 11 Jul 2019 09:50:08 -0700 (PDT) X-Google-Smtp-Source: APXvYqyWoXkX1+4lSKSHTF2wN45TY7BRr4rO9kSoYNi2YiW9WOimlycZB8arT/SM2FkbDO5tErQP X-Received: by 2002:a17:902:467:: with SMTP id 94mr5674416ple.131.1562863808304; Thu, 11 Jul 2019 09:50:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1562863808; cv=none; d=google.com; s=arc-20160816; b=wQSBrUEt2BBDbMrf0IleP16xZTIHu9x2N99+0IsbKP7ucdY6xUOf/+itTINn2a7VRw 6apAhIPjMExEije0Z9DG/PBwhK6bXZ3l4bvjUpFe8217IkZN2O8D5xfJq52+tSLdYBV0 6zkAxhCQOXl0JnVJMJMh+DGvbUZQowSJ3aNTI125pwb6LzZkKFEDO+IzsJkkq+83o7NK /fLldtKiHWVoJ866eoeCTjmO73JeBZ9DRW142ivXf2Z8QwIBmTjBG/ewnlFGDEQISwAG 74aMf53mV29qKwFhUfrswyyu9EU3K+/AJpQtrWqJbxvCRx8R1Qt247huFuBRYIcbwyx3 elvA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=3fzjt3NMs4ub9dyEVYabvSRmfslEGYoDMOaTxbA0CKU=; b=hHkPJQ2Ln/5qM+p5+yILN2QCu5Evz9+3QGr5354sGWRSgIKgmtfAIy/LVVLwmSvS5X Hc3vQLZ75v/9tMp2SB43NigdNfMV/A041nQC+haByuNoRuSf6Q2faejQt+641Mk0EVeU W2c8wAkvGarYLvDMxHsrTA+cSLJa2DtyZ3RyHCk7T1dbeaPreTqntpcLhrpFK4vJZz40 CEGddjVBdV4guq428vhcSh2l3CHqoN6Q24gXnSv9C/4+kWWfa9rDBpq4zrkAdNczOdRd 2pHKhlYS1YMaGxjFBv12GiNUydMr4h563aYcpZ6+i22nleyeP5Y3gHNqUyh9pbBw+CJc 0dIA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-nfs-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l4si5375047pjq.69.2019.07.11.09.49.42; Thu, 11 Jul 2019 09:50:08 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-nfs-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-nfs-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728325AbfGKQti (ORCPT + 99 others); Thu, 11 Jul 2019 12:49:38 -0400 Received: from mx1.redhat.com ([209.132.183.28]:48366 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728308AbfGKQti (ORCPT ); Thu, 11 Jul 2019 12:49:38 -0400 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id F2ECF86679; Thu, 11 Jul 2019 16:49:37 +0000 (UTC) Received: from coeurl.usersys.redhat.com (ovpn-122-198.rdu2.redhat.com [10.10.122.198]) by smtp.corp.redhat.com (Postfix) with ESMTP id C7D735C220; Thu, 11 Jul 2019 16:49:37 +0000 (UTC) Received: by coeurl.usersys.redhat.com (Postfix, from userid 1000) id 365EC209F9; Thu, 11 Jul 2019 12:49:37 -0400 (EDT) Date: Thu, 11 Jul 2019 12:49:37 -0400 From: Scott Mayhew To: Indivar Nair Cc: linux-nfs@vger.kernel.org Subject: Re: rpc.statd dies because of pacemaker monitoring Message-ID: <20190711164937.GA4131@coeurl.usersys.redhat.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.11.3 (2019-02-01) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Thu, 11 Jul 2019 16:49:38 +0000 (UTC) Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Thu, 11 Jul 2019, Indivar Nair wrote: > Hi ..., > > I have a 2 node Pacemaker cluster built using CentOS 7.6.1810 > It serves files using NFS and Samba. > > Every 15 - 20 minutes, the rpc.statd service fails, and the whole NFS > service is restarted. > After investigation, it was found that the service fails after a few > rounds of monitoring by Pacemaker. > The Pacemaker's script runs the following command to check whether all > the services are running - > --------------------------------------------------------------------------------------------------------------------------------------- > rpcinfo > /dev/null 2>&1 > rpcinfo -t localhost 100005 > /dev/null 2>&1 > nfs_exec status nfs-idmapd > $fn 2>&1 > rpcinfo -t localhost 100024 > /dev/null 2>&1 I would check to make sure your DNS setup is working properly. rpc.statd uses the canonical hostnames for comparison purposes whenever it gets an SM_MON or SM_UNMON request from lockd and when it gets an SM_NOTIFY from a rebooted NFS client. That involves calls to getaddrinfo() and getnameinfo() which in turn could result in requests to a DNS server. rpc.statd is single-threaded, so if it's blocked waiting for one of those requests, then it's unable to respond to the RPC ping (which has a timeout of 10 seconds) generated by the rpcinfo program. I ran into a similar scenario in the past where a client was launching multiple instances of rpc.statd. When the client does a v3 mount it does a similar RPC ping (with a more aggressive timeout) to see if rpc.statd is running... if not then it calls out to /usr/sbin/start-statd (which in the past simply called 'exec rpc.statd --no-notify' but now has additional checks). Likewise rpc.statd does it's own RPC ping to make sure there's not one already running. It wound up that the user had a flakey DNS server and requests were taking over 30 seconds to time out, thus thwarting all those additional checks, and they wound up with multiple copies of rpc.statd running. You could be running into a similar scenario here and pacemaker could be deciding that rpc.statd's not running when it's actually fine. -Scott > --------------------------------------------------------------------------------------------------------------------------------------- > The script is scheduled to check every 20 seconds. > > This is the message we get in the logs - > ------------------------------------------------------------------------------------------------------------------------------------- > Jul 09 07:33:56 virat-nd01 rpc.mountd[51641]: check_default: access by > 127.0.0.1 ALLOWED > Jul 09 07:33:56 virat-nd01 rpc.mountd[51641]: Received NULL request > from 127.0.0.1 > Jul 09 07:33:56 virat-nd01 rpc.mountd[51641]: check_default: access by > 127.0.0.1 ALLOWED (cached) > Jul 09 07:33:56 virat-nd01 rpc.mountd[51641]: Received NULL request > from 127.0.0.1 > Jul 09 07:33:56 virat-nd01 rpc.mountd[51641]: check_default: access by > 127.0.0.1 ALLOWED (cached) > Jul 09 07:33:56 virat-nd01 rpc.mountd[51641]: Received NULL request > from 127.0.0.1 > ------------------------------------------------------------------------------------------------------------------------------------- > > After 10 seconds, we get his message - > ------------------------------------------------------------------------------------------------------------------------------------- > Jul 09 07:34:09 virat-nd01 nfsserver(virat-nfs-daemon)[54087]: ERROR: > rpc-statd is not running > ------------------------------------------------------------------------------------------------------------------------------------- > Once we get this error, the NFS service is automatically restarted. > > "ERROR: rpc-statd is not running" message is from the pacemaker's > monitoring script. > I have pasted that part of the script below. > > I disabled monitoring and everything is working fine, since then. > > I cant keep the cluster monitoring disabled forever. > > Kindly help. > > Regards, > > > Indivar Nair > > Part of the pacemaker script that does the monitoring > (/usr/lib/ocf/resources.d/heartbeat/nfsserver) > ======================================================================= > nfsserver_systemd_monitor() > { > local threads_num > local rc > local fn > > ocf_log debug "Status: rpcbind" > rpcinfo > /dev/null 2>&1 > rc=$? > if [ "$rc" -ne "0" ]; then > ocf_exit_reason "rpcbind is not running" > return $OCF_NOT_RUNNING > fi > > ocf_log debug "Status: nfs-mountd" > rpcinfo -t localhost 100005 > /dev/null 2>&1 > rc=$? > if [ "$rc" -ne "0" ]; then > ocf_exit_reason "nfs-mountd is not running" > return $OCF_NOT_RUNNING > fi > > ocf_log debug "Status: nfs-idmapd" > fn=`mktemp` > nfs_exec status nfs-idmapd > $fn 2>&1 > rc=$? > ocf_log debug "$(cat $fn)" > rm -f $fn > if [ "$rc" -ne "0" ]; then > ocf_exit_reason "nfs-idmapd is not running" > return $OCF_NOT_RUNNING > fi > > ocf_log debug "Status: rpc-statd" > rpcinfo -t localhost 100024 > /dev/null 2>&1 > rc=$? > if [ "$rc" -ne "0" ]; then > ocf_exit_reason "rpc-statd is not running" > return $OCF_NOT_RUNNING > fi > > nfs_exec is-active nfs-server > rc=$? > > # Now systemctl is-active can't detect the failure of kernel > process like nfsd. > # So, if the return value of systemctl is-active is 0, check the > threads number > # to make sure the process is running really. > # /proc/fs/nfsd/threads has the numbers of the nfsd threads. > if [ $rc -eq 0 ]; then > threads_num=`cat /proc/fs/nfsd/threads 2>/dev/null` > if [ $? -eq 0 ]; then > if [ $threads_num -gt 0 ]; then > return $OCF_SUCCESS > else > return 3 > fi > else > return $OCF_ERR_GENERIC > fi > fi > > return $rc > } > =======================================================================