Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.2 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS,UNPARSEABLE_RELAY,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D2CB1C43387 for ; Mon, 17 Dec 2018 19:19:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 876B8206A2 for ; Mon, 17 Dec 2018 19:19:39 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="0/d0NSef" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732524AbeLQTTi (ORCPT ); Mon, 17 Dec 2018 14:19:38 -0500 Received: from userp2120.oracle.com ([156.151.31.85]:46192 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726399AbeLQTTi (ORCPT ); Mon, 17 Dec 2018 14:19:38 -0500 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wBHJJU3h106998; Mon, 17 Dec 2018 19:19:34 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=content-type : mime-version : subject : from : in-reply-to : date : cc : content-transfer-encoding : message-id : references : to; s=corp-2018-07-02; bh=783imR7aerfyjCKXfy8q4WQovQnhXb4056sFMwpCDpk=; b=0/d0NSefQ8MTnurO73cJEqktqhvBJWfwalsiNzmSzlSy/RHusVmKm4D8qbN/rKLRWLkU niSmNiIHL0WBoP1ka7nSNGWLG4EOjXvX68YrqyUgS6yag2QyE9uKDQ2053QxHq+Ur2bA kyCr2Y1mY2N70J7QBERL1QLdNQitosmGPBrpERB+PE6Sa80IlUTebgW5yIfadcagfufW QMtRAYwz3K9hoZ0PbvV/85/mz1+WBLSQxj3mI4kC0h8fibZZIbcbMJ2QesMzddaAeMOk nJWQQoNmKSn8m7WaAiDiiWuPAWr2ejANTxZESQ6pJ1QN5mCcEzPoae/ggQW6W5Yvx3uV Pg== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2120.oracle.com with ESMTP id 2pct8qqarc-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 17 Dec 2018 19:19:34 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id wBHJJXYe001928 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 17 Dec 2018 19:19:34 GMT Received: from abhmp0013.oracle.com (abhmp0013.oracle.com [141.146.116.19]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wBHJJXS4023694; Mon, 17 Dec 2018 19:19:33 GMT Received: from anon-dhcp-171.1015granger.net (/68.61.232.219) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 17 Dec 2018 11:19:33 -0800 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\)) Subject: Re: [PATCH v4 06/30] xprtrdma: Don't wake pending tasks until disconnect is done From: Chuck Lever In-Reply-To: Date: Mon, 17 Dec 2018 14:19:31 -0500 Cc: "linux-rdma@vger.kernel.org" , Linux NFS Mailing List Content-Transfer-Encoding: quoted-printable Message-Id: <1490E575-2589-402B-B34A-4455CB6845F1@oracle.com> References: <20181217162406.24133.27356.stgit@manet.1015granger.net> <20181217163953.24133.29214.stgit@manet.1015granger.net> <6A4F0F5B-2D29-4A51-BC21-AF23B64D6ED7@oracle.com> To: Trond Myklebust X-Mailer: Apple Mail (2.3445.9.1) X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9110 signatures=668679 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1812170170 Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org > On Dec 17, 2018, at 2:09 PM, Trond Myklebust = wrote: >=20 > On Mon, 2018-12-17 at 14:00 -0500, Chuck Lever wrote: >>> On Dec 17, 2018, at 1:55 PM, Trond Myklebust < >>> trondmy@hammerspace.com> wrote: >>>=20 >>> On Mon, 2018-12-17 at 13:37 -0500, Chuck Lever wrote: >>>>> On Dec 17, 2018, at 12:28 PM, Trond Myklebust < >>>>> trondmy@hammerspace.com> wrote: >>>>>=20 >>>>> On Mon, 2018-12-17 at 11:39 -0500, Chuck Lever wrote: >>>>>> Transport disconnect processing does a "wake pending tasks" >>>>>> at >>>>>> various points. >>>>>>=20 >>>>>> Suppose an RPC Reply is being processed. The RPC task that >>>>>> Reply >>>>>> goes with is waiting on the pending queue. If a disconnect >>>>>> wake- >>>>>> up >>>>>> happens before reply processing is done, that reply, even if >>>>>> it >>>>>> is >>>>>> good, is thrown away, and the RPC has to be sent again. >>>>>>=20 >>>>>> This window apparently does not exist for socket transports >>>>>> because >>>>>> there is a lock held while a reply is being received which >>>>>> prevents >>>>>> the wake-up call until after reply processing is done. >>>>>>=20 >>>>>> To resolve this, all RPC replies being processed on an RPC- >>>>>> over- >>>>>> RDMA >>>>>> transport have to complete before pending tasks are awoken >>>>>> due to >>>>>> a >>>>>> transport disconnect. >>>>>>=20 >>>>>> Callers that already hold the transport write lock may invoke >>>>>> ->ops->close directly. Others use a generic helper that >>>>>> schedules >>>>>> a close when the write lock can be taken safely. >>>>>>=20 >>>>>> Signed-off-by: Chuck Lever >>>>>> --- >>>>>> include/linux/sunrpc/xprt.h | 1 + >>>>>> net/sunrpc/xprt.c | 19 >>>>>> +++++++++++++++++++ >>>>>> net/sunrpc/xprtrdma/backchannel.c | 13 +++++++-- >>>>>> ---- >>>>>> net/sunrpc/xprtrdma/svc_rdma_backchannel.c | 8 +++++--- >>>>>> net/sunrpc/xprtrdma/transport.c | 16 ++++++++++- >>>>>> ---- >>>>>> - >>>>>> net/sunrpc/xprtrdma/verbs.c | 5 ++--- >>>>>> 6 files changed, 44 insertions(+), 18 deletions(-) >>>>>>=20 >>>>>> diff --git a/include/linux/sunrpc/xprt.h >>>>>> b/include/linux/sunrpc/xprt.h >>>>>> index a4ab4f8..ee94ed0 100644 >>>>>> --- a/include/linux/sunrpc/xprt.h >>>>>> +++ b/include/linux/sunrpc/xprt.h >>>>>> @@ -401,6 +401,7 @@ static inline __be32 >>>>>> *xprt_skip_transport_header(struct rpc_xprt *xprt, __be32 * >>>>>> bool xprt_request_get_cong(struct rpc_xprt >>>>>> *xprt, >>>>>> struct rpc_rqst *req); >>>>>> void xprt_disconnect_done(struct rpc_xprt >>>>>> *xprt); >>>>>> void xprt_force_disconnect(struct rpc_xprt >>>>>> *xprt); >>>>>> +void xprt_disconnect_nowake(struct rpc_xprt >>>>>> *xprt); >>>>>> void xprt_conditional_disconnect(struct >>>>>> rpc_xprt >>>>>> *xprt, unsigned int cookie); >>>>>>=20 >>>>>> bool xprt_lock_connect(struct rpc_xprt *, >>>>>> struct >>>>>> rpc_task *, void *); >>>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c >>>>>> index ce92700..afe412e 100644 >>>>>> --- a/net/sunrpc/xprt.c >>>>>> +++ b/net/sunrpc/xprt.c >>>>>> @@ -685,6 +685,25 @@ void xprt_force_disconnect(struct >>>>>> rpc_xprt >>>>>> *xprt) >>>>>> } >>>>>> EXPORT_SYMBOL_GPL(xprt_force_disconnect); >>>>>>=20 >>>>>> +/** >>>>>> + * xprt_disconnect_nowake - force a call to xprt->ops->close >>>>>> + * @xprt: transport to disconnect >>>>>> + * >>>>>> + * The caller must ensure that xprt_wake_pending_tasks() is >>>>>> + * called later. >>>>>> + */ >>>>>> +void xprt_disconnect_nowake(struct rpc_xprt *xprt) >>>>>> +{ >>>>>> + /* Don't race with the test_bit() in >>>>>> xprt_clear_locked() >>>>>> */ >>>>>> + spin_lock_bh(&xprt->transport_lock); >>>>>> + set_bit(XPRT_CLOSE_WAIT, &xprt->state); >>>>>> + /* Try to schedule an autoclose RPC call */ >>>>>> + if (test_and_set_bit(XPRT_LOCKED, &xprt->state) =3D=3D 0) >>>>>> + queue_work(xprtiod_workqueue, &xprt- >>>>>>> task_cleanup); >>>>>> + spin_unlock_bh(&xprt->transport_lock); >>>>>> +} >>>>>> +EXPORT_SYMBOL_GPL(xprt_disconnect_nowake); >>>>>> + >>>>>=20 >>>>> We shouldn't need both xprt_disconnect_nowake() and >>>>> xprt_force_disconnect() to be exported given that you can build >>>>> the >>>>> latter from the former + xprt_wake_pending_tasks() (which is >>>>> also >>>>> already exported). >>>>=20 >>>> Thanks for your review! >>>>=20 >>>> I can get rid of xprt_disconnect_nowake. There are some >>>> variations, >>>> depending on why wake_pending_tasks is protected by xprt- >>>>> transport_lock. >>>=20 >>> I'm having some second thoughts about the patch that Scott sent out >>> last week to fix the issue that Dave and he were seeing. I think >>> that >>> what we really need to do to fix his issue is to call >>> xprt_disconnect_done() after we've released the TCP socket. >>>=20 >>> Given that realisation, I think that we can fix up >>> xprt_force_disconnect() to only wake up the task that holds the >>> XPRT_LOCKED instead of doing a thundering herd wakeup like we do >>> today. >>> That should (I think) obviate the need for a separate >>> xprt_disconnect_nowake(). >>=20 >> For RPC-over-RDMA, there really is a dangerous race between the >> waking >> task(s) and work being done by the deferred RPC completion handler. >> IMO >> the only safe thing RPC-over-RDMA can do is not wake anything until >> the >> deferred queue is well and truly drained. >=20 > The deferred RPC completion handler (and hence the close) cannot > execute if another task is holding XPRT_LOCKED, Just to be certain we are speaking of the same thing, rpcrdma_deferred_completion is queued by the Receive handler, and can indeed run independently of an rpc_task. It is always running outside the purview of XPRT_LOCKED. > so we do need to wake up that task (and only that one). >=20 > Note that in the new code, the only reason why a task would be holding > XPRT_LOCKED while sleeping is because >=20 > 1. It is waiting for a connection attempt to complete following a = call > to xprt_connect(). > 2. It is waiting for a write_space event following an attempt to > transmit. xprt_rdma_close can sleep in rpcrdma_ep_disconnect: -> ib_drain_{qp,sq,rq} can all sleep waiting for the last FLUSH -> drain_workqueue, added in this patch, can sleep waiting for the deferred RPC completion workqueue to drain >> As you observed when we last spoke, socket transports are already >> protected from this race by the socket lock.... RPC-over-RDMA is >> going >> to have to be more careful. >>=20 >>=20 >>> A patch is forthcoming later today. I'll make sure you are Cced so >>> you >>> can comment. >>>=20 >>> --=20 >>> Trond Myklebust >>> Linux NFS client maintainer, Hammerspace >>> trond.myklebust@hammerspace.com >>=20 >> -- >> Chuck Lever >>=20 >>=20 >>=20 > --=20 > Trond Myklebust > CTO, Hammerspace Inc > 4300 El Camino Real, Suite 105 > Los Altos, CA 94022 > www.hammer.space -- Chuck Lever