Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1C2C8C43381 for ; Fri, 1 Mar 2019 05:20:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E152320851 for ; Fri, 1 Mar 2019 05:19:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725913AbfCAFT7 (ORCPT ); Fri, 1 Mar 2019 00:19:59 -0500 Received: from out30-42.freemail.mail.aliyun.com ([115.124.30.42]:34334 "EHLO out30-42.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725287AbfCAFT6 (ORCPT ); Fri, 1 Mar 2019 00:19:58 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01f04455;MF=jiufei.xue@linux.alibaba.com;NM=1;PH=DS;RN=6;SR=0;TI=SMTPD_---0TLdWz5u_1551417594; Received: from ali-186590e05fa3.local(mailfrom:jiufei.xue@linux.alibaba.com fp:SMTPD_---0TLdWz5u_1551417594) by smtp.aliyun-inc.com(127.0.0.1); Fri, 01 Mar 2019 13:19:55 +0800 Subject: Re: [bug report] task hang while testing xfstests generic/323 To: Trond Myklebust , "aglo@umich.edu" Cc: "bfields@fieldses.org" , "Anna.Schumaker@netapp.com" , "linux-nfs@vger.kernel.org" , "joseph.qi@linux.alibaba.com" References: From: Jiufei Xue Message-ID: Date: Fri, 1 Mar 2019 13:19:54 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.3.3 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On 2019/3/1 上午7:56, Trond Myklebust wrote: > On Thu, 2019-02-28 at 17:26 -0500, Olga Kornievskaia wrote: >> On Thu, Feb 28, 2019 at 5:11 AM Jiufei Xue < >> jiufei.xue@linux.alibaba.com> wrote: >>> Hi, >>> >>> when I tested xfstests/generic/323 with NFSv4.1 and v4.2, the task >>> changed to zombie occasionally while a thread is hanging with the >>> following stack: >>> >>> [<0>] rpc_wait_bit_killable+0x1e/0xa0 [sunrpc] >>> [<0>] nfs4_do_close+0x21b/0x2c0 [nfsv4] >>> [<0>] __put_nfs_open_context+0xa2/0x110 [nfs] >>> [<0>] nfs_file_release+0x35/0x50 [nfs] >>> [<0>] __fput+0xa2/0x1c0 >>> [<0>] task_work_run+0x82/0xa0 >>> [<0>] do_exit+0x2ac/0xc20 >>> [<0>] do_group_exit+0x39/0xa0 >>> [<0>] get_signal+0x1ce/0x5d0 >>> [<0>] do_signal+0x36/0x620 >>> [<0>] exit_to_usermode_loop+0x5e/0xc2 >>> [<0>] do_syscall_64+0x16c/0x190 >>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>> [<0>] 0xffffffffffffffff >>> >>> Since commit 12f275cdd163(NFSv4: Retry CLOSE and DELEGRETURN on >>> NFS4ERR_OLD_STATEID), the client will retry to close the file when >>> stateid generation number in client is lower than server. >>> >>> The original intention of this commit is retrying the operation >>> while >>> racing with an OPEN. However, in this case the stateid generation >>> remains >>> mismatch forever. >>> >>> Any suggestions? >> >> Can you include a network trace of the failure? Is it possible that >> the server has crashed on reply to the close and that's why the task >> is hung? What server are you testing against? >> >> I have seen trace where close would get ERR_OLD_STATEID and would >> still retry with the same open state until it got a reply to the OPEN >> which changed the state and when the client received reply to that, >> it'll retry the CLOSE with the updated stateid. > > I agree with Olga's assessment. The server is not allowed to randomly > change the values of the seqid, and the client should be taking pains > to replay any OPEN calls for which a reply is missed. The expectation > is therefore that NFS4ERR_OLD_STATEID should always be a temporary > state. > The server bumped the seqid because of a new OPEN from another thread. And I doubt that maybe the new OPEN task exit while receiving a signal without update the stateid. > If it is not, then the bugreport needs to explain why the server bumped > the seqid without informing the client. >