Received: by 2002:a25:c593:0:0:0:0:0 with SMTP id v141csp6592664ybe; Wed, 18 Sep 2019 06:07:18 -0700 (PDT) X-Google-Smtp-Source: APXvYqww0CULmdhjv3H06uf5N17Kb+AmdkcWOHculivp89tKE/H6FRA6z2ZwNmFehDFO62wezFbp X-Received: by 2002:a7b:c045:: with SMTP id u5mr2818554wmc.139.1568812038505; Wed, 18 Sep 2019 06:07:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1568812038; cv=none; d=google.com; s=arc-20160816; b=bAeoXlIqO6PKWF7cwi9+LqlOj5Lxvj9jsOaMyZB6z8dAKVwm3331QsLj36DduL4IDp dhW2BhC37hv9EJDtoocYp8lampmCtoE3ESMO/jjDhrZrxkjGtBUJTqOsYINLzDxu96c0 mxse3Tspxmgw3DsnuwIpEFxm3urzd02kr7mIZwWsQFvC2jj4uTgoJuu+e9xMiAIiA3JR WMBxw9LJAt/ffk3n1Tfag4yxABPTrShzd0QUJutmlQ1FH8f4jd+m+13fO1oq1zb4kvpf czvTBfLvixbyicaGOuzDDZewHqk2ZV91K5jPAm48YI54xbc8BPrp5BNGOd9/K2XGtAZb vs1g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from; bh=g8LAVmB844tnZf79SyNFZoIkq2HbytOX6/pk94o7O2c=; b=aLbB7temEGX2+VRv9nx9jNOlkfCe/Fr2tuRbODNC47VpknZqPPnj2sAAl3+dKByj8M i1Em7C0IfnG7ma337G/rSOzOYoZUhH/U6UqBdcbOSnadH96as9D+xagrjFuKRWSI1JIQ Nz9PDorpDysei4N/g1D/Qhsq2Z0NFJprGCO2qUQylTcILBI5nM6LL0XSrxf3jQzLPGhZ oPUDI9s91J1zylNMP8w4N8k14KAARjRGu2deIVaRIimfOBr96+6pxw2ieOSG28ankK6Y iJb94U+auu098kL7BvWTcxRfmeUc6p8lX4UC6Hqrl5o6eqECSjuEkZig6PhQ3Y+qlz+1 fBpA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-nfs-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u14si3216281edq.167.2019.09.18.06.06.34; Wed, 18 Sep 2019 06:07:18 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-nfs-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-nfs-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727569AbfIRLEH (ORCPT + 99 others); Wed, 18 Sep 2019 07:04:07 -0400 Received: from mx1.redhat.com ([209.132.183.28]:39568 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726077AbfIRLEG (ORCPT ); Wed, 18 Sep 2019 07:04:06 -0400 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 0C73310F2E8B; Wed, 18 Sep 2019 11:04:06 +0000 (UTC) Received: from [172.16.176.1] (ovpn-64-2.rdu2.redhat.com [10.10.64.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id B34A35D6A5; Wed, 18 Sep 2019 11:04:05 +0000 (UTC) From: "Benjamin Coddington" To: "Leon Kyneur" Cc: linux-nfs@vger.kernel.org Subject: Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID Date: Wed, 18 Sep 2019 07:04:05 -0400 Message-ID: <66D00B9D-16DC-4979-8400-457398DC4801@redhat.com> In-Reply-To: References: <8217416C-F3E5-4BEE-BD01-2BE19952425E@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; format=flowed X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.6.2 (mx1.redhat.com [10.5.110.66]); Wed, 18 Sep 2019 11:04:06 +0000 (UTC) Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On 17 Sep 2019, at 22:20, Leon Kyneur wrote: > On Tue, Sep 17, 2019 at 7:28 PM Benjamin Coddington > wrote: >> >> On 12 Sep 2019, at 4:27, Leon Kyneur wrote: >> >>> Hi >>> >>> I'm experiencing an issue on NFS 4.0 + 4.1 where we cannot call >>> fcntl >>> locks on any file on the share. The problem goes away if the share >>> is >>> umount && mount (mount -o remount does not resolve the issue) >>> >>> Client: >>> EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64 >>> >>> Server: >>> EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64 >>> >>> I can't figure this out but the client reports bad-sequence-id in >>> dupicate in the logs: >>> Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad >>> sequence-id error on an unconfirmed sequence ffff881c52286220! >>> Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad >>> sequence-id error on an unconfirmed sequence ffff881c52286220! >>> Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad >>> sequence-id error on an unconfirmed sequence ffff8810889cb020! >>> Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad >>> sequence-id error on an unconfirmed sequence ffff8810889cb020! >>> Sep 12 02:17:44 client kernel: NFS: v4 server returned a bad >>> sequence-id error on an unconfirmed sequence ffff881b414b2620! >>> >>> wireshark capture shows only 1 BAD_SEQID reply from the server: >>> $ tshark -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid -z >>> proto,colinfo,nfs.seqid,nfs.seqid -R 'rpc.xid == 0x9990c61d' >>> tshark: -R without -2 is deprecated. For single-pass filtering use >>> -Y. >>> 141 93 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK >>> FH: >>> 0x80589398 Offset: 0 Length: nfs.seqid == 0x0000004e >>> nfs.seqid == 0x00000002 rpc.xid == 0x9990c61d >>> 142 93 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call >>> In 141) LOCK Status: NFS4ERR_BAD_SEQID rpc.xid == 0x9990c61d >>> >>> system call I have identified as triggering it is: >>> fcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824, >>> len=1}) = -1 EIO (Input/output error) >> >> Can you simplify the trigger into something repeatable? Can you >> determine >> if the client or the server has lost track of the sequence? >> > > I have tried, I wrote some code to perform the fcntl RDKLCK the same > way and ran it accross > thousands of machines without any success. I am quite sure this is a > symptom of something > not the cause. > > Is there a better way of tracking sequences other than monitoring the > network traffic? I think that's the best way, right now. We do have tracepoints for nfs4 open and close that show the sequence numbers on the client, but I'm not sure about how to get that from the server side. I don't think we have seqid for locks in tracepoints.. I could be missing something. Not only that, but you might not get tracepoint output showing the sequence numbers if you're in an error-handling path. If you have a wire capture of the event, you should be able to go backwards from the error and figure out what the sequence number on the state should be for the operation that received BAD_SEQID by finding the last sequence-mutating (OPEN,CLOSE,LOCK) operation for that stateid that did not return an error. Ben