Received: by 2002:a05:6358:11c7:b0:104:8066:f915 with SMTP id i7csp2515316rwl; Thu, 6 Apr 2023 11:16:34 -0700 (PDT) X-Google-Smtp-Source: AKy350Yz9DpqkaoWYfVZPi/oEfRTtPMq3K2en2jp1dSiUcJ3p1Pe3YCmcNiYy11YjGw9v260pRpV X-Received: by 2002:a05:6a20:dc85:b0:da:c40:8d7 with SMTP id ky5-20020a056a20dc8500b000da0c4008d7mr407291pzb.1.1680804994358; Thu, 06 Apr 2023 11:16:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680804994; cv=none; d=google.com; s=arc-20160816; b=DD8PPthlD/K6y8Ru2O8NUKHDJNZsxqJ+iHtfyIj06XVXIjGYduv4IFDOVkhB6Lv+DM G23i1nnQ42pZkUwZEa05YPZTvQPXtTnOUS8fNeIk9LAYL1wS/2NrLrhspjYdk3t9iqtC kMvON+hDyVnsLFxnwPw+B+vlrz4vtUkrjC+YgNi9RLI/9PMoqinY9DKCToYlOJsX3Nok 1qKWcDYiTp7tZx14gKvaUBqKpHl3rqAtmy4KVo7rqfV7hp++6aB37D3/fQVc3Tu85a8z n+jPfZVlkpMyDq4gYV6VTy3qVPo4FveCaJD87X1Ym7lqAxdoGzXU9+K9macRCCevCkig 3ZBg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent :content-transfer-encoding:references:in-reply-to:date:cc:to:from :subject:message-id:dkim-signature; bh=DWRudFioG8dvQjQniWG+xMud25zU+F4Hkjb4Fj/4a7k=; b=W9Dh5Yh+UDQQ+0pFCMnZEljKhUaZqkWvciLg5y5DltpUvlyw72G0p0br3rRflCHnjL 7Bt6cvxSWpXllU1EqTwKCb9LDV9++Gkj6DPqrRVS4vwXEyIxnVjqstsKphKJj+uxHgvD eUGBRVJrxCBIUaG5KfBx1ZouCcNCwf9ScM5uxTarlSL6l6qaW89Ra0l43YXE7EUyH8xV 4ULyQGRE8zvh74flGxjUQqXUJ47K69+53Q1kZQaFuZ9+VP5d17Sd9LoqepFxuqhidi/c 1HdF64YV5cJ7MoCvGuceKfCjRCTqD9JD197bHtJ8xCs0vAKKckq+Nwktx/YVePb/PsQv w/7w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=p4qrYo2f; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b7-20020a656687000000b0050bfb3ac0eesi1824277pgw.162.2023.04.06.11.16.14; Thu, 06 Apr 2023 11:16:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=p4qrYo2f; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239619AbjDFSLA (ORCPT + 99 others); Thu, 6 Apr 2023 14:11:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58356 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229764AbjDFSK7 (ORCPT ); Thu, 6 Apr 2023 14:10:59 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6274C7ED3 for ; Thu, 6 Apr 2023 11:10:58 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id F02C864002 for ; Thu, 6 Apr 2023 18:10:57 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id B99D7C433EF; Thu, 6 Apr 2023 18:10:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1680804657; bh=DWRudFioG8dvQjQniWG+xMud25zU+F4Hkjb4Fj/4a7k=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=p4qrYo2fGZTmP8ojXQxXxWIynJE0LWd2ecxth39aKRbu442V+U6sZ6o2nYgJyT2TI sU4VLUzpvrRDZqyGbXplIsZqjel6gAcc2m9wHn6jzr2Nt6yJhlklriV5ydY/QIc57C KmvaQRzkeanOMkMi3lrFOpkFtERQsTxhpIkuan/tCfhfM/GZoQkK2eYfW74ZRi0Msi Yb9PbAyxNXrr5SHaPeV6NUwpjXuDv2bkQfIz0j6oSrw7BhQ4y7vDDA9CKnoyeMrzoP arVbXLxfR2X7sbeCpRh+A2HSgDvUOtGCfYNDq9rt0mra2vZoDauxpM7brPSZ65bzyV LydsIL8BuDsbQ== Message-ID: Subject: Re: [PATCH] SUNRPC: remove the maximum number of retries in call_bind_status From: Jeff Layton To: dai.ngo@oracle.com, Chuck Lever III Cc: Linux NFS Mailing List , Helen Chao , Anna Schumaker , Trond Myklebust Date: Thu, 06 Apr 2023 14:10:55 -0400 In-Reply-To: <64c4e5c4e61962fd828bcbef79db1df6466a875d.camel@kernel.org> References: <1678301132-24496-1-git-send-email-dai.ngo@oracle.com> <9D5A190A-333A-4470-8572-CF85EE9A8086@oracle.com> <182842b2-3de4-d64b-d729-f4f6c9c576d6@oracle.com> <64c4e5c4e61962fd828bcbef79db1df6466a875d.camel@kernel.org> Content-Type: text/plain; charset="ISO-8859-15" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.46.4 (3.46.4-1.fc37) MIME-Version: 1.0 X-Spam-Status: No, score=-5.2 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI,SPF_HELO_NONE, SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Thu, 2023-04-06 at 13:33 -0400, Jeff Layton wrote: > On Tue, 2023-03-14 at 09:19 -0700, dai.ngo@oracle.com wrote: > > On 3/8/23 11:03 AM, dai.ngo@oracle.com wrote: > > > On 3/8/23 10:50 AM, Chuck Lever III wrote: > > > >=20 > > > > > On Mar 8, 2023, at 1:45 PM, Dai Ngo wrote: > > > > >=20 > > > > > Currently call_bind_status places a hard limit of 3 to the number= of > > > > > retries on EACCES error. This limit was done to accommodate the= =20 > > > > > behavior > > > > > of a buggy server that keeps returning garbage when the NLM daemo= n is > > > > > killed on the NFS server. However this change causes problem for = other > > > > > servers that take a little longer than 9 seconds for the port map= per to > > > > >=20 > > > > >=20 Actually, the EACCES error means that the host doesn't have the port registered. That could happen if (e.g.) the host had a NFSv3 mount up with an NLM connection and then crashed and rebooted and didn't remount it. =20 > > > > > become ready when the NFS server is restarted. > > > > >=20 > > > > > This patch removes this hard coded limit and let the RPC handles > > > > > the retry according to whether the export is soft or hard mounted= . > > > > >=20 > > > > > To avoid the hang with buggy server, the client can use soft moun= t for > > > > > the export. > > > > >=20 > > > > > Fixes: 0b760113a3a1 ("NLM: Don't hang forever on NLM unlock reque= sts") > > > > > Reported-by: Helen Chao > > > > > Tested-by: Helen Chao > > > > > Signed-off-by: Dai Ngo > > > > Helen is the royal queen of ^C=A0 ;-) > > > >=20 > > > > Did you try ^C on a mount while it waits for a rebind? > > >=20 > > > She uses a test script that restarts the NFS server while NLM lock te= st > > > is running. The failure is random, sometimes it fails and sometimes i= t > > > passes depending on when the LOCK/UNLOCK requests come in so I think > > > it's hard to time it to do the ^C, but I will ask. > >=20 > > We did the test with ^C and here is what we found. > >=20 > > For synchronous RPC task the signal was delivered to the RPC task and > > the task exit with -ERESTARTSYS from __rpc_execute as expected. > >=20 > > For asynchronous RPC task the process that invokes the RPC task to send > > the request detected the signal in rpc_wait_for_completion_task and exi= ts > > with -ERESTARTSYS. However the async RPC was allowed to continue to run > > to completion. So if the async RPC task was retrying an operation and > > the NFS server was down, it will retry forever if this is a hard mount > > or until the NFS server comes back up. > >=20 > > The question for the list is should we propagate the signal to the asyn= c > > task via rpc_signal_task to stop its execution or just leave it alone a= s is. > >=20 > >=20 >=20 > That is a good question. >=20 > I like the patch overall, as it gets rid of a special one-off retry > counter, but I too share some concerns about retrying indefinitely when > an server goes missing. >=20 > >=20 > Propagating a signal seems like the right thing to do. Looks like > rpcb_getport_done would also need to grow a check for RPC_SIGNALLED ? >=20 > It sounds pretty straightforward otherwise. Erm, except that some of these xprts are in the context of the server. For instance: server-side lockd sometimes has to send messages to the clients (e.g. GRANTED callbacks). Suppose we're trying to send a message to the client, but it has crashed and rebooted...or maybe the client's address got handed to another host that isn't doing NFS at all so the NLM service will never come back. This will mean that those RPCs will retry forever now in this situation. I'm not sure that's what we want. Maybe we need some way to distinguish between "user-driven" RPCs and those that aren't? As a simpler workaround, would it work to just increase the number of retries here? There's nothing magical about 3 tries. ISTM that having it retry enough times to cover at least a minute or two would be acceptable. --=20 Jeff Layton