Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 586F9C43381 for ; Mon, 11 Mar 2019 15:07:35 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 1B2FB2084F for ; Mon, 11 Mar 2019 15:07:35 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=umich.edu header.i=@umich.edu header.b="Ek/Kr8MS" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726625AbfCKPHe (ORCPT ); Mon, 11 Mar 2019 11:07:34 -0400 Received: from mail-vk1-f180.google.com ([209.85.221.180]:37100 "EHLO mail-vk1-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726546AbfCKPHe (ORCPT ); Mon, 11 Mar 2019 11:07:34 -0400 Received: by mail-vk1-f180.google.com with SMTP id 17so1135497vkf.4 for ; Mon, 11 Mar 2019 08:07:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=umich.edu; s=google-2016-06-03; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=j9k+P76c8wY030NHQuj+kWtkMfpibzganKVpNRVL87E=; b=Ek/Kr8MS2IqEEok4zI6BaGq3mIpiXInR01jYPoRAEVancIlJpwOr47OhQSu+2FOyqy 5jKcnTfyJQYH7t8D6PzWRy7aCFfZ/k674d/eremCsQD2UqB1GrFuRTEmH8JAvvIzQB53 I23kht7aJlMkKaXa2wZJc+8PlUGGUh3L+JgH29Ftf+zx/fBYEoNORAaJLBtoAxlPPgVw aPrqJ1nkNWEKvJg2XA79Zasb2B33LQB9ODR/nhS+3MWCedilCUYLhkHGx7YP2A1HjsiQ eIn9i9yYRaLkQmf0htTDYHkLelDYfw8y+yX2CTFRPnKaOw1dJVRlxG675VR/LrFupWN7 mKJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=j9k+P76c8wY030NHQuj+kWtkMfpibzganKVpNRVL87E=; b=C0kThtKwSKWPA6kbQaKQfGhNDdBnSS31cShTg+IYiaORYGUA3lC0h2CDGbRPjH1gPe 1kD48Rb/gU4vxQrzc+T1QUHL+3Ha+s1uTEG7wJIyMyFY7h43Y0LdFvMlnxdRhNRquw6h rdNaPefxg8dqqKx2F7eB0QyS5ZETubO3Lt+4J7kQNVfhcG7l1Hvwc0akR+s5IXc/UFZk tmvrup4ir4WqlQTJ1NMQhwwaarJyHSslQJzkQ5RBKsLV/S0oWWCze7EmiyxOEHL4n4gW YfxTZMjxAn/4CHc81PNCb+jfuqsbkej2JHbrg0yU1vuzb6tXEzHAiBiVNtKsCM7pm7BP wTLw== X-Gm-Message-State: APjAAAU6p+MPSkF4TM/hxcaMYKNxFd4/wXFgwoYVAad1A6yzaUp9UmG1 rbP2rjAoyZRxmX1BIL44I0UMekY81moMdGmzkSs= X-Google-Smtp-Source: APXvYqxqiDty3INEzQXcVu5XoWq9X11xrUiUkFfyFB7qkEhBgkfabxWkSN4PXC0999wBuZ17+QXqwOA1sTkQy/ddULM= X-Received: by 2002:a1f:8bcd:: with SMTP id n196mr5968761vkd.33.1552316852868; Mon, 11 Mar 2019 08:07:32 -0700 (PDT) MIME-Version: 1.0 References: <3b6597b0-41b4-255a-5ebb-cf34ad95315c@linux.alibaba.com> <57796366-0f08-2cfb-6f85-27d5485c07af@linux.alibaba.com> <365504c3-d785-7e4b-1dd2-c75f30c85c65@linux.alibaba.com> <740c1ad6c5b2afad196af34f48dda7097b5102b7.camel@hammerspace.com> <1417d6c8-3602-d895-a036-44031a7c1e98@linux.alibaba.com> <46f5c870-f4fd-b29a-f5eb-eab39c4c1e89@linux.alibaba.com> <20190306160934.GB3066@fieldses.org> In-Reply-To: From: Olga Kornievskaia Date: Mon, 11 Mar 2019 11:07:21 -0400 Message-ID: Subject: Re: [bug report] task hang while testing xfstests generic/323 To: Trond Myklebust Cc: "bfields@fieldses.org" , "jiufei.xue@linux.alibaba.com" , "Anna.Schumaker@netapp.com" , "linux-nfs@vger.kernel.org" , "joseph.qi@linux.alibaba.com" Content-Type: text/plain; charset="UTF-8" Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Mon, Mar 11, 2019 at 10:30 AM Trond Myklebust wrote: > > Hi Olga, > > On Sun, 2019-03-10 at 18:20 -0400, Olga Kornievskaia wrote: > > There are a bunch of cases where multiple operations are using the > > same seqid and slot. > > > > Example of such weirdness is (nfs.seqid == 0x000002f4) && (nfs.slotid > > == 0) and the one leading to the hang. > > > > In frame 415870, there is an OPEN using that seqid and slot for the > > first time (but this slot will be re-used a bunch of times before it > > gets a reply in frame 415908 with the open stateid seq=40). (also in > > this packet there is an example of reuse slot=1+seqid=0x000128f7 by > > both TEST_STATEID and OPEN but let's set that aside). > > > > In frame 415874 (in the same packet), client sends 5 opens on the > > SAME > > seqid and slot (all have distinct xids). In a ways that's end up > > being > > alright since opens are for the same file and thus reply out of the > > cache and the reply is ERR_DELAY. But in frame 415876, client sends > > again uses the same seqid and slot and in this case it's used by > > 3opens and a test_stateid. > > > > Client in all this mess never processes the open stateid seq=40 and > > keeps on resending CLOSE with seq=37 (also to note client "missed" > > processing seqid=38 and 39 as well. 39 probably because it was a > > reply > > on the same kind of "Reused" slot=1 and seq=0x000128f7. I haven't > > tracked 38 but i'm assuming it's the same). I don't know how many > > times but after 5mins, I see a TEST_STATEID that again uses the same > > seqid+slot (which gets a reply from the cache matching OPEN). Also > > open + close (still with seq=37) open is replied to but after this > > client goes into a soft lockup logs have > > "nfs4_schedule_state_manager: > > kthread_ruan: -4" over and over again . then a soft lockup. > > > > Looking back on slot 0. nfs.seqid=0x000002f3 was used in frame=415866 > > by the TEST_STATEID. This is replied to in frame 415877 (with an > > ERR_DELAY). But before the client got a reply, it used the slot and > > the seq by frame 415874. TEST_STATEID is a synchronous and > > interruptible operation. I'm suspecting that somehow it was > > interrupted and that's who the slot was able to be re-used by the > > frame 415874. But how the several opens were able to get the same > > slot > > I don't know.. > > Is this still true with the current linux-next? I would expect this > patch > http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=commitdiff;h=3453d5708b33efe76f40eca1c0ed60923094b971 > to change the Linux client behaviour in the above regard. Yes. I reproduced it against your origin/testing (5.0.0-rc7+) commit 0d1bf3407c4ae88 ("SUNRPC: allow dynamic allocation of back channel slots"). > > Cheers > Trond > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com > >