Received: by 2002:a05:6a10:eb17:0:0:0:0 with SMTP id hx23csp1013474pxb; Thu, 9 Sep 2021 18:00:54 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzH4YsOocPOjb5GGoQH/3DQf7KGP9QYbNiAyOGpDNcBwlFZ8drLpFG1Zou0kJsIIpqUMTEh X-Received: by 2002:a6b:f610:: with SMTP id n16mr5090545ioh.139.1631235654695; Thu, 09 Sep 2021 18:00:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1631235654; cv=none; d=google.com; s=arc-20160816; b=maw0Eb9F2M+gPaRenmaq6eiyPChbQlvG0Nv/NW1WAS582ehKDONEIWKV8NruW7gCFY bSqJCcgj95WSmCvBfQNF3DDAGnD1tr+0GR1ZrLhc9f/z9h4RJlQkNFKeFD06HPMtKMDx qrnYTW7ox2toC+ml8VNWBgdoqvzloWKqR8OLYiYM3axjnfwSvy3fhPZGRdbSudmmziYR VFcVHN9P9iABQ07/XQfb4zynpJLey94UlsY7BQ/L1NenHNLtKD0q4SkRuXNV0stOFo2h bagoDMKzodz2gOL+jofSY32D7TE3T2uopc71zGbZdrzB2sfB61nGQJPeovZtICNkreGC M1SQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=P2M0fajvvBmNscFRbRE+ghltPq6uwAd8WtERtopucMA=; b=lArsSnNONFN3cWF5QnQrSTUM5pmimv8QTypw2AU+AcsT6sBlJmTJgDZ1c2M7ioBc1W cpoWGDZK06LVqY2CKyU0nUhHKiODExibs80YTCLEJq3orfUXvkkJblW6Sx+v2sp5g8p1 4W2grZOm/Besf30NumFAbpULqlLSkta/5Id8MKeLB0Pe7vAlU3k/wAE34jZF1p5ytQZz 9GP/ITAKzG4Sv9z9BadQLIwq2bouUreHZVVtkfHLGq7VPOuKDlI7ggk9P6vW2COkrdk4 OgROFLMmVQnKi8/jMxqE/DaIVfK3CIiQgSow3sjFjiRnUnqA2tsr3rpLz3sDf/Ya76Pm bH3w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=bOuk0d3h; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id u16si3267620ilb.175.2021.09.09.18.00.42; Thu, 09 Sep 2021 18:00:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=bOuk0d3h; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232944AbhIJBA5 (ORCPT + 99 others); Thu, 9 Sep 2021 21:00:57 -0400 Received: from mail.kernel.org ([198.145.29.99]:48960 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234515AbhIJAXb (ORCPT ); Thu, 9 Sep 2021 20:23:31 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 337B360FDA; Fri, 10 Sep 2021 00:22:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1631233341; bh=CwhQ68CEcb5Xp6duDNuoiPXaTXRQUvHQERNLQ17GpNE=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=bOuk0d3hIupZMuQERms1qhw0qx+I2UqxMHmbGd7ElXL9Amaigkswr6qnPZfUuxfah T5XBQWDzwUVnyLYCXTTXdaZjB47t7NhETPa5wAIgOCNh1JyYLtDsswqh4NW6V351E0 LI17YkkBcx1q7uQb7rcvAR6cxrco32KRbmzyYo6Fkkg3IIFZu4s4RIRc3SOuWJ+uKP HhvHi0bRi5FOFv3gdOPh3UlcrkmdTxSEGkLY5oZnPBsiOW78Kwn5v16gPpkqcNl44q xM8O53d+U/5Y9nu/GJAlS/5ll6aBkNfTisArbcpFlYq/XOFKAAATigShm6v7AIBLRy 63Gj+50cOlXTA== From: Sasha Levin To: linux-kernel@vger.kernel.org, stable@vger.kernel.org Cc: =?UTF-8?q?H=C3=A5kon=20Bugge?= , Jason Gunthorpe , Sasha Levin , linux-rdma@vger.kernel.org Subject: [PATCH AUTOSEL 5.4 29/37] RDMA/core/sa_query: Retry SA queries Date: Thu, 9 Sep 2021 20:21:34 -0400 Message-Id: <20210910002143.175731-29-sashal@kernel.org> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210910002143.175731-1-sashal@kernel.org> References: <20210910002143.175731-1-sashal@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-stable: review X-Patchwork-Hint: Ignore Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Håkon Bugge [ Upstream commit 5f5a650999d5718af766fc70a120230b04235a6f ] A MAD packet is sent as an unreliable datagram (UD). SA requests are sent as MAD packets. As such, SA requests or responses may be silently dropped. IB Core's MAD layer has a timeout and retry mechanism, which amongst other, is used by RDMA CM. But it is not used by SA queries. The lack of retries of SA queries leads to long specified timeout, and error being returned in case of packet loss. The ULP or user-land process has to perform the retry. Fix this by taking advantage of the MAD layer's retry mechanism. First, a check against a zero timeout is added in rdma_resolve_route(). In send_mad(), we set the MAD layer timeout to one tenth of the specified timeout and the number of retries to 10. The special case when timeout is less than 10 is handled. With this fix: # ucmatose -c 1000 -S 1024 -C 1 runs stable on an Infiniband fabric. Without this fix, we see an intermittent behavior and it errors out with: cmatose: event: RDMA_CM_EVENT_ROUTE_ERROR, error: -110 (110 is ETIMEDOUT) Link: https://lore.kernel.org/r/1628784755-28316-1-git-send-email-haakon.bugge@oracle.com Signed-off-by: Håkon Bugge Signed-off-by: Jason Gunthorpe Signed-off-by: Sasha Levin --- drivers/infiniband/core/cma.c | 3 +++ drivers/infiniband/core/sa_query.c | 9 ++++++++- 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index ec9e9598894f..261d284e5ff3 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -2937,6 +2937,9 @@ int rdma_resolve_route(struct rdma_cm_id *id, unsigned long timeout_ms) struct rdma_id_private *id_priv; int ret; + if (!timeout_ms) + return -EINVAL; + id_priv = container_of(id, struct rdma_id_private, id); if (!cma_comp_exch(id_priv, RDMA_CM_ADDR_RESOLVED, RDMA_CM_ROUTE_QUERY)) return -EINVAL; diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index d2d70c89193f..8c0ff50cbcfc 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -1360,6 +1360,7 @@ static int send_mad(struct ib_sa_query *query, unsigned long timeout_ms, { unsigned long flags; int ret, id; + const int nmbr_sa_query_retries = 10; xa_lock_irqsave(&queries, flags); ret = __xa_alloc(&queries, &id, query, xa_limit_32b, gfp_mask); @@ -1367,7 +1368,13 @@ static int send_mad(struct ib_sa_query *query, unsigned long timeout_ms, if (ret < 0) return ret; - query->mad_buf->timeout_ms = timeout_ms; + query->mad_buf->timeout_ms = timeout_ms / nmbr_sa_query_retries; + query->mad_buf->retries = nmbr_sa_query_retries; + if (!query->mad_buf->timeout_ms) { + /* Special case, very small timeout_ms */ + query->mad_buf->timeout_ms = 1; + query->mad_buf->retries = timeout_ms; + } query->mad_buf->context[0] = query; query->id = id; -- 2.30.2