Received: by 2002:a25:1985:0:0:0:0:0 with SMTP id 127csp706632ybz; Wed, 29 Apr 2020 08:01:03 -0700 (PDT) X-Google-Smtp-Source: APiQypKkPlci/jIwOqc1luLq3kJlV8YyxqZCWxfaroxMHArXwA3lMB+c30pfokQBsj2xX8OWl065 X-Received: by 2002:a17:906:6811:: with SMTP id k17mr2986182ejr.351.1588172462749; Wed, 29 Apr 2020 08:01:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1588172462; cv=none; d=google.com; s=arc-20160816; b=Hv/g/3JAhIZTyn131RxLxx/e9chSOH3tbxvFPrbFWsd6t6b+gOJYUTiNy0r/RYkQi5 0aI+vmTuRhrVIJ+9sBmY2x23zz49sqlO5MqiZZ+3GnjJ3Z0DdttuSN0zSR/b+cSUagBw LpL3i2wyuNTTiE3ZA/vpDEx56M8J1vNBT/3kvgGeE/cK8zks3NsFCtK63Ztakib/1WIc PZhOmZEu0Ep6K2/5S6Z/yN1guwmMYsg03Iui0fpY5peQwykZOLl7tkpeZdy4TL7Pkb7X POjdreJv95RjxKXRt9uaw+CcSjheVt0F2P0VoDp7HilyjSdh4vg06ElXKM+aShIjZdC3 yRDQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=1+Me6IoG/b9yw/ys5xAaD7svoAcP0k6uOgvRjItn4aw=; b=mHs80MYai9zDzrjpaFROi4X1BMDNGB9ehRoPWq//kq8ZpdzBxJrA56pCy6Ph7khR+e m0eYT1BzDMs+0ciK6+x/20wTeJonW2katxNsJKnO1cNzD5gVy+C7Zr2iSFWk6dazSAzy +PEucdPeV8EqRxHqNQ4VJYKEO0g2wJt8cVsz5QpC2c0QzzjtZhwGQmQpoTlQB7dSrBs4 W2Uz+NSe5Pcx9TO2fo1hx5/gWUuZfOIHvNEGj2vVtObQIjItUErO41dTacdFNWe3BBkl huWxucBvTFKX5jLYVUIPTS1hoPS6u9fn2Hkye0k6F28CjmK1I05ag/FApnNpfSsV6OiT YhKQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id s8si3329533edw.484.2020.04.29.08.00.24; Wed, 29 Apr 2020 08:01:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726558AbgD2O6M (ORCPT + 99 others); Wed, 29 Apr 2020 10:58:12 -0400 Received: from p3plsmtpa09-04.prod.phx3.secureserver.net ([173.201.193.233]:35387 "EHLO p3plsmtpa09-04.prod.phx3.secureserver.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726456AbgD2O6L (ORCPT ); Wed, 29 Apr 2020 10:58:11 -0400 X-Greylist: delayed 439 seconds by postgrey-1.27 at vger.kernel.org; Wed, 29 Apr 2020 10:58:11 EDT Received: from [192.168.0.78] ([24.218.182.144]) by :SMTPAUTH: with ESMTPSA id To31jwfLa0ZQQTo32jTGt9; Wed, 29 Apr 2020 07:50:52 -0700 X-CMAE-Analysis: v=2.3 cv=AqWQI91P c=1 sm=1 tr=0 a=ugQcCzLIhEHbLaAUV45L0A==:117 a=ugQcCzLIhEHbLaAUV45L0A==:17 a=IkcTkHD0fZMA:10 a=SEtKQCMJAAAA:8 a=yoXgPP3dPFmGH1RIU_sA:9 a=QEXdDO2ut3YA:10 a=kyTSok1ft720jgMXX5-3:22 X-SECURESERVER-ACCT: tom@talpey.com Subject: Re: handling ERR_SERVERFAULT on RESTOREFH To: Olga Kornievskaia , Trond Myklebust Cc: "linux-nfs@vger.kernel.org" References: <98410608e028cb4b53024c7669e0fb70fea98214.camel@hammerspace.com> <98a10c8775e4127419ac57630f839744bdf1063d.camel@hammerspace.com> From: Tom Talpey Message-ID: Date: Wed, 29 Apr 2020 10:50:52 -0400 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Thunderbird/68.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-CMAE-Envelope: MS4wfH9LVolWMLiUw7/dSbaKSVDwg4+BqOYbwOkAv9o3N6vw0l0vFblV2PT3VWfuD8boVLeaSS4UodJQke9nQN+F9antE/w2/Y1NDEh8YSOrXVsKt2kH7BnH fNHEqHp0QJEQbre4/m5Yz/SB7z3IKPvTu7xxni44dLZBXy02aw9csBEDFUNJSpUdnjFvceFebtjsdJd+Y2m8nseKTpSqHmBKUaA7ALzlmMu1zZia98mJetNS Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On 4/28/2020 10:06 PM, Olga Kornievskaia wrote: > On Tue, Apr 28, 2020 at 7:42 PM Trond Myklebust wrote: >> >> On Tue, 2020-04-28 at 19:02 -0400, Olga Kornievskaia wrote: >>> On Tue, Apr 28, 2020 at 5:32 PM Trond Myklebust < >>> trondmy@hammerspace.com> wrote: >>>> On Tue, 2020-04-28 at 16:40 -0400, Olga Kornievskaia wrote: >>>>> On Tue, Apr 28, 2020 at 2:47 PM Trond Myklebust < >>>>> trondmy@hammerspace.com> wrote: >>>>>> Hi Olga, >>>>>> >>>>>> On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia wrote: >>>>>>> Hi folk, >>>>>>> >>>>>>> Looking for guidance on what folks think. A client is sending >>>>>>> a >>>>>>> LINK >>>>>>> operation to the server. This compound after the LINK has >>>>>>> RESTOREFH >>>>>>> and GETATTR. Server returns SERVER_FAULT to on RESTOREFH. But >>>>>>> LINK is >>>>>>> done successfully. Client still fails the system call with >>>>>>> EIO. >>>>>>> We >>>>>>> have a hardline and "ln" saying hardlink failed. >>>>>>> >>>>>>> Should the client not fail the system call in this case? The >>>>>>> fact >>>>>>> that >>>>>>> we couldn't get up-to-date attributes don't seem like the >>>>>>> reason >>>>>>> to >>>>>>> fail the system call? >>>>>>> >>>>>>> Thank you. >>>>>> >>>>>> I don't really see this as worth fixing on the client. It is >>>>>> very >>>>>> clearly a server bug. >>>>> >>>>> Why is that a server bug? A server can legitimately have an issue >>>>> trying to execute an operation (RESTOREFH) and legitimately >>>>> returning >>>>> an error. >>>> >>>> If it is happening consistently on the server, then it is a bug, >>>> and it >>>> gets reported by the client in the same way we always report >>>> NFS4ERR_SERVERFAULT, by converting to an EREMOTEIO. >>> >>> Yes but the client doesn't retry so it can't assess if it's >>> consistently happening or not. It can be a transient error (or >>> ENOMEM) >>> that's later resolved. >> >> If the server wants to signal a transient error, it should send >> NFS4ERR_DELAY. > > ERR_DELAY not an allowed error for the RESTOREFH. But let's say, the > server does return it, then client is not following the spec because > if it'll get this error, it will retry the whole compound (causing a > different error of redoing a non-idempotent operation). The spec says > client is responsible for handling partially completed compound. The > client should only retry the failed operations in a compound, I don't > see that client does that. > >>>>> NFS client also ignores errors of the returning GETATTR after the >>>>> RESTOREFH. So I'm not sure why we are then not ignoring errors >>>>> (or >>>>> some errors) of the RESTOREFH. >>>> >>>> We do need to check the value of RESTOREFH in order to figure out >>>> if we >>>> can continue reading the XDR buffer to decode the file attributes. >>>> We >>>> want to read those file attributes because we do expect the change >>>> attribute, the ctime and the nlinks values to all change as a >>>> result of >>>> the operation. >>> >>> I have nothing against decoding the error and using it in a decision >>> to keep decoding. But the client doesn't have to propagate the >>> RESTOREFH error to the application? >>> >>> In all other non-idempotent operations that have other operations (ie >>> GETATTR) following them, the client ignores the errors. Btw I just >>> noticed that on OPEN compound, since we ignore decode error from the >>> GETATTR, it would continue decoding LAYOUTGET... >>> >>> CREATE has problem if the following GETFH will return EDELAY. Client >>> doesn't deal with retrying a part of the compound. It retries the >>> whole compound. It leads to an error (since non-idempotent operation >>> is retried). But I guess that's a 2nd issue (or a 3rd if we could the >>> decoding layoutget).... >>> >>> All this is under the umbrella of how to handle errors on >>> non-idempotent operations in a compound.... >> >> There is no point in trying to handle errors that make no sense. If the >> server has a bug, then let's expose it instead of trying to hide it in >> the sofa cushions. > > EDELAY on GETFH is a reasonable error for the server to return. I don't disagree that this is a broken server behavior. But from the protocol perspective, I want to make two observations. 1) The post-operation attributes are not atomic, therefore an attribute failure does not imply the operation was unsuccessful. 2) The application did not necessarily request the attributes, this was inserted by the client, right? So again, their success or failure is not actually relevant to the application. Tom.