Received: by 2002:a25:683:0:0:0:0:0 with SMTP id 125csp2225114ybg; Fri, 5 Jun 2020 08:35:18 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyRZXhMO93k5aYyggDKDwi9lBd7W1TE3MM459Eg0HK1l6gxtNoCnhTBoKPIE2ij5cNW5Be/ X-Received: by 2002:a17:906:af76:: with SMTP id os22mr8892067ejb.191.1591371318397; Fri, 05 Jun 2020 08:35:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1591371318; cv=none; d=google.com; s=arc-20160816; b=jBnl4lLD3WqCPDvR+bcbr1GoEDLS3j6uC58WSmD7oy9o1mc/omsjmFPkvo1ueCg/9O 5v/u4dZgKLgkfTa+W81u53euicb6V4q7KrOk7ZGtsyz9OtluWdhPoA0RLED4ThiZ4ZLa 9drDXJ5K6bEamFpfVvISiq1bE8n/4oEPG7WsgNcvC1643gWJvLR3AW69F0+Z/GmBqiM9 71cqNVM+NYxw7nKEHpv3um+DiAVvkDZIz5mkGnxOyDCoU4LgkVvPe4y9GDYQtjtEt5rO Yo5YeIORqH7hsg/tFIQXNFBlFM6op9WUvRF4r9R4S39qa3B7qpjtx1IQ00+YSd0FzVjh LQMg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=DtVk4v6if7a0DmcWxY8lkRce1Psz5gIIzdrWjLY5SY4=; b=xZ6v6h3yBGJ8SjFYunncHId9bIuYStq91dVFXohbY4m4sMT9ky8Ow2on8eorHBMU1R iWVEaMpkM9dgrWRizQel5O3Vf2kr6FuoCbsn4t5XDfExgLkqm5+u9n+APGbpIjRPHA0/ 2Pr3C8Cp45uyJbIwqMvVqNovXJLftvUSJaUrbl8oYSiYG2DQQJUGErtP1GPPW1KLMRL8 GhtKpC+CX6mIfvTKeOprewQpYGUk8NaZm8Rjs1Tya4nlr0IwBWS+TmSanmR81IffmPo+ gvXju68Y7bXv58bMXpUzGi0ejPlzoxTcKj3MUkm6E5NDDgsNqVkJMXH7iME0FQpEKxcE JIJw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@umich.edu header.s=google-2016-06-03 header.b=DZp3lq3z; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=umich.edu Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id o28si4300272edz.306.2020.06.05.08.34.47; Fri, 05 Jun 2020 08:35:18 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@umich.edu header.s=google-2016-06-03 header.b=DZp3lq3z; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=umich.edu Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728123AbgFEPap (ORCPT + 99 others); Fri, 5 Jun 2020 11:30:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60344 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728080AbgFEPan (ORCPT ); Fri, 5 Jun 2020 11:30:43 -0400 Received: from mail-ej1-x62c.google.com (mail-ej1-x62c.google.com [IPv6:2a00:1450:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3E811C08C5C2 for ; Fri, 5 Jun 2020 08:30:42 -0700 (PDT) Received: by mail-ej1-x62c.google.com with SMTP id x1so10537595ejd.8 for ; Fri, 05 Jun 2020 08:30:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=umich.edu; s=google-2016-06-03; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=DtVk4v6if7a0DmcWxY8lkRce1Psz5gIIzdrWjLY5SY4=; b=DZp3lq3zIpXQvHvvdcXzVk7qz1/GAJGIIhmaA/IYQVQ8v821mCvKALm+BeS1t0aM45 np3aJjjk/h2eaxLm3X8E7OXBXj7yOEHBkGSFlzotRWAOF/K+L0RKTHrPXXHvUymzoe8h 2wGimOhu6I2XEs4E4klqcQ4bNxCZAQxReDq/0ca+cR4OT5vYK3bXXWKW/m3bPrx3xB+i v5FGzzN8WOHJt1lxRjwk7LbLlDJmtHlI8aicRvYMWg6upkHdQk3iha24E7yK13+8N+UP bGNOsOXOnk7HQovtOxToZuZv5kkep3CNeQJpsw+8Pf70JxXpDqM26ZNJrXbO5kCVyXj4 QNkA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=DtVk4v6if7a0DmcWxY8lkRce1Psz5gIIzdrWjLY5SY4=; b=ijxbn1YtQ7ShYFlmW71RKl2XBlJ0zDP7O92mwMSXRi1O7+0oH0U/mS/mYiZPOhfH/3 zScYRiJ2r1tIJZFHwd28HMUvZYB0jKOOF0J5kfLG46Y47I+VayFaXmgJrJEkkDclbA9P YBgYHT5DedruIf+FPbA2en2eAyQRMKqjbldpMcNbvckUwdpftKrqca8vSk9JbzkTSvkA IGt3maq41H1To/fXVMXwSUven6h/56/ZYUc/qHhO/hpA6/b6jOZpZD2dykYCnDYveW50 M3Z6uRhVqrmel7JB0hBvnDWqxt1xkO56+grv+3bPoWxSC24EZKD8FNK5Nz+DopdtuP3E Gx0Q== X-Gm-Message-State: AOAM531d5Okm1wTPPhS/idk1EwcbWhB9lIjBK/TOo8oQChNC7vIiuWXB yXFgVqkpHyBnJSX7dwNPxCOeFAh06o0SyT+Q7gm8KQ== X-Received: by 2002:a17:906:fac8:: with SMTP id lu8mr9027559ejb.432.1591371040850; Fri, 05 Jun 2020 08:30:40 -0700 (PDT) MIME-Version: 1.0 References: <13bed646-39b7-197e-ff90-85f8af10d93c@talpey.com> In-Reply-To: From: Olga Kornievskaia Date: Fri, 5 Jun 2020 11:30:29 -0400 Message-ID: Subject: Re: once again problems with interrupted slots To: Tom Talpey Cc: Trond Myklebust , linux-nfs Content-Type: text/plain; charset="UTF-8" Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Fri, Jun 5, 2020 at 9:49 AM Tom Talpey wrote: > > On 6/5/2020 9:24 AM, Olga Kornievskaia wrote: > > On Fri, Jun 5, 2020 at 8:06 AM Tom Talpey wrote: > >> > >> On 6/4/2020 5:21 PM, Olga Kornievskaia wrote: > >>> Hi Trond, > >>> > >>> There is a problem with interrupted slots (yet again). > >>> > >>> We send an operation to the server and it gets interrupted by the a signal. > >>> > >>> We used to send a sole SEQUENCE to remove the problem of having real > >>> operation get an out of the cache reply and failing. Now we are not > >>> doing it again (since 3453d5708 NFSv4.1: Avoid false retries when RPC > >>> calls are interrupted"). So the problem is > >>> > >>> We bump the sequence on the next use of the slot, and get SEQ_MISORDERED. > >> > >> Misordered? It sounds like the client isn't managing the sequence > >> number, or perhaps the server never saw the original request, and > >> is being overly strict. > > > > Well, both the client and the server are acting appropriately. I'm > > not arguing against bumping the sequence. Client sent say REMOVE with > > slot=1 seq=5 which got interrupted. So client doesn't know in what > > state the slot is left. So it sends the next operation say READ with > > slot=1 seq=6. Server acts appropriately too, as it's version of the > > slot has seq=4, this request with seq=6 gets SEQ_MISORDERED. > > Wait, if the client sent slot=1 seq=5, then unless the connection > breaks, that slot is at seq=5, simple as that. If the operation was > interrupted before sending the request, then the sequence should > not be bumped. Connection doesn't drop. We tried not bumping the sequence (i think that was probably how it was originally done). Then you still get into the same problem right away. REMOVE and READ will be using seq=5. > >>> We decrement the number back to the interrupted operation. This gets > >>> us a reply out of the cache. We again fail with REMOTE EIO error. > >> > >> Ew. The client *decrements* the sequence? > > > > Yes, as client then decides that server never received seq=5 operation > > so it re-sends with seq=5. But in reality seq=5 operation also reached > > the server so it has 2 requests REMOVE/READ both with seq=5 for > > slot=1. This leads to READ failing with some error. > > But if the connection didn't break, it's reliable therefore the "resend" > must not be performed. This is a new operation, not a retry. It cannot > use the same slot+seq pair. And decrementing the slot is even sillier, > it's reusing *two* seq's at that point. When the slot gets interrupted we don't know when the interruption happened. If we got SEQ_MISORDERED, it might be because interruption happened before the request was ever sent to the server, so it's valid for the seq to stay the same (ie decrementing the seq). I don't see how decrementing the seq is reusing 2 seq values: only one value is valid and client is trying to figure out which one. > > We used to before send a sole SEQUENCE when we have an interrupted > > slot to sync up the seq numbers. But commit 3453d5708 changed that and > > I would like to understand why. As I think we need to go back to > > sending sole SEQUENCE. > > Sounds like a hack, frankly. What if the server responds the same > way? The client will start burning the wire. Sending the SEQUENCE on the same slot/seqid as an interrupted slot doesn't lead to any operation failing. > Closing the connection, or never using that slot again, seems to > me the only correct option, given the other behavior described. Not ever using an interrupted slot seems too drastic (we can end up with a session where all slots are unusable. or losing slots also means losing ability to send more requests in parallel). I thought that's given a sequence of events and error codes we should be able to re-sync the slot. > > Tom. > > > >>> Going back to the commit's message. I don't see the logic that the > >>> server can't tell if this is a new call or the old one. We used to > >>> send a lone SEQUENCE as a way to protect reuse of slot by a normal > >>> operation. An interrupted slot couldn't have been another SEQUENCE. So > >>> I don't see how the server can't tell a difference between SEQUENCE > >>> and any other operations. > >>> > >>> > > > >