Received: by 2002:a05:6a10:9e8c:0:0:0:0 with SMTP id y12csp1111441pxx; Tue, 27 Oct 2020 08:30:13 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzh1O9kYThC3fkj6x7K/irBFSQoPAfrBqdJO84EvbbwpqTuuJonhqVZGLgdPzfmUChd6CN4 X-Received: by 2002:a17:906:c0c8:: with SMTP id bn8mr2958528ejb.256.1603812612923; Tue, 27 Oct 2020 08:30:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1603812612; cv=none; d=google.com; s=arc-20160816; b=vJNAssFPlX565xO4bYCFBQlCGsGTfcpvY9IRVl1KXXJ+Ig2XRz36dt0pqMCPmAA2pt ep4osbTnX5dzFeMqqPe/b7WihBjZaKZGIyZRNSe2ceKg4nMRFzrR/YE1WEX8GdSrW+Fk 3PDi2kRDdidJbg5mIwqbRXUm/obz8w6CssbyoMpo/l3SncJhJIjfNaFhLExxuFqhrhE9 dSb1SXnJSeDTE/NmHLSlW271oZ1XUGnFCV8yC+cAF5XgEfqZs9ibiN7UpWwGguK1t9SL OoJpn0Dzz5xiS+nTAUFhWojxsMbWsgJ+H9zjTehirSEavZDTVztdTDWXN4Qu3w6xzdNl vjEA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=lF48Dh1PSPkNhr7HvEqi4y9oS71BT5ozfDymUGGr0CY=; b=pXPY3hKEHvJSONwKL2RAfoEA0I2QNjoH54gsE2EX9Jk7O9HocVLvJn4Tvy4ft802a4 RaMR0cGfhXTbhEdxlGl/U41WsZynqW9GSf4ZhVMb+87yo7LbiT5JoafMTkHfUPbAAsMy 7aKP8goU7mdP1Dwg0+BvOve3d5zt9XcY1HO7CiVu5Jptr6CwlqSbL1jJmsc0UTEs7cp/ HRKqCtEdXRmjrMyo9GiR5mbWIEP9NIDR0t5d/KFgm/YAICqFTefAw+AjBL8is3dqLjF0 a8gC6YAzp6B+Mle94kWZQSV5zY37js3KKizHYYNPgtH4H7D60VIJzt5gMpxLQMzfl/xP 7f6A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=t7IhXamD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id a16si1523572ejk.22.2020.10.27.08.29.50; Tue, 27 Oct 2020 08:30:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=t7IhXamD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1761001AbgJ0OhZ (ORCPT + 99 others); Tue, 27 Oct 2020 10:37:25 -0400 Received: from mail.kernel.org ([198.145.29.99]:36308 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1760984AbgJ0OhU (ORCPT ); Tue, 27 Oct 2020 10:37:20 -0400 Received: from localhost (83-86-74-64.cable.dynamic.v4.ziggo.nl [83.86.74.64]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 420982225E; Tue, 27 Oct 2020 14:37:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1603809439; bh=rkita8JVTkacNwvK/OJDXJh1Fy56x8p9EDC4ObrasD4=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=t7IhXamDWV2efADBuVEDaELYxWNCcH2N2wOjXDBy73XYsCHDFyqwLE3bOVEKCOtBo du1kdnO3IHF5zcqnGOen38Zp4Mw0IAMOnqfG/eqUUDswtsiN47GhmP0jC1CsNxQFd7 NG+23qUA1lKSQwkSAaSmhDD47zeCVbCLwN3NXJsE= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, =?UTF-8?q?H=C3=A5kon=20Bugge?= , Jason Gunthorpe , Sasha Levin Subject: [PATCH 5.4 197/408] IB/mlx4: Fix starvation in paravirt mux/demux Date: Tue, 27 Oct 2020 14:52:15 +0100 Message-Id: <20201027135504.234526711@linuxfoundation.org> X-Mailer: git-send-email 2.29.1 In-Reply-To: <20201027135455.027547757@linuxfoundation.org> References: <20201027135455.027547757@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Håkon Bugge [ Upstream commit 7fd1507df7cee9c533f38152fcd1dd769fcac6ce ] The mlx4 driver will proxy MAD packets through the PF driver. A VM or an instantiated VF will send its MAD packets to the PF driver using loop-back. The PF driver will be informed by an interrupt, but defer the handling and polling of CQEs to a worker thread running on an ordered work-queue. Consider the following scenario: the VMs will in short proximity in time, for example due to a network event, send many MAD packets to the PF driver. Lets say there are K VMs, each sending N packets. The interrupt from the first VM will start the worker thread, which will poll N CQEs. A common case here is where the PF driver will multiplex the packets received from the VMs out on the wire QP. But before the wire QP has returned a send CQE and associated interrupt, the other K - 1 VMs have sent their N packets as well. The PF driver has to multiplex K * N packets out on the wire QP. But the send-queue on the wire QP has a finite capacity. So, in this scenario, if K * N is larger than the send-queue capacity of the wire QP, we will get MAD packets dropped on the floor with this dynamic debug message: mlx4_ib_multiplex_mad: failed sending GSI to wire on behalf of slave 2 (-11) and this despite the fact that the wire send-queue could have capacity, but the PF driver isn't aware, because the wire send CQEs have not yet been polled. We can also have a similar scenario inbound, with a wire recv-queue larger than the tunnel QP's send-queue. If many remote peers send MAD packets to the very same VM, the tunnel send-queue destined to the VM could allegedly be construed to be full by the PF driver. This starvation is fixed by introducing separate work queues for the wire QPs vs. the tunnel QPs. With this fix, using a dual ported HCA, 8 VFs instantiated, we could run cmtime on each of the 18 interfaces towards a similar configured peer, each cmtime instance with 800 QPs (all in all 14400 QPs) without a single CM packet getting lost. Fixes: 3cf69cc8dbeb ("IB/mlx4: Add CM paravirtualization") Link: https://lore.kernel.org/r/20200803061941.1139994-5-haakon.bugge@oracle.com Signed-off-by: Håkon Bugge Signed-off-by: Jason Gunthorpe Signed-off-by: Sasha Levin --- drivers/infiniband/hw/mlx4/mad.c | 34 +++++++++++++++++++++++++--- drivers/infiniband/hw/mlx4/mlx4_ib.h | 2 ++ 2 files changed, 33 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/mad.c b/drivers/infiniband/hw/mlx4/mad.c index 57079110af9b5..08eccf2b6967d 100644 --- a/drivers/infiniband/hw/mlx4/mad.c +++ b/drivers/infiniband/hw/mlx4/mad.c @@ -1307,6 +1307,18 @@ static void mlx4_ib_tunnel_comp_handler(struct ib_cq *cq, void *arg) spin_unlock_irqrestore(&dev->sriov.going_down_lock, flags); } +static void mlx4_ib_wire_comp_handler(struct ib_cq *cq, void *arg) +{ + unsigned long flags; + struct mlx4_ib_demux_pv_ctx *ctx = cq->cq_context; + struct mlx4_ib_dev *dev = to_mdev(ctx->ib_dev); + + spin_lock_irqsave(&dev->sriov.going_down_lock, flags); + if (!dev->sriov.is_going_down && ctx->state == DEMUX_PV_STATE_ACTIVE) + queue_work(ctx->wi_wq, &ctx->work); + spin_unlock_irqrestore(&dev->sriov.going_down_lock, flags); +} + static int mlx4_ib_post_pv_qp_buf(struct mlx4_ib_demux_pv_ctx *ctx, struct mlx4_ib_demux_pv_qp *tun_qp, int index) @@ -2009,7 +2021,8 @@ static int create_pv_resources(struct ib_device *ibdev, int slave, int port, cq_size *= 2; cq_attr.cqe = cq_size; - ctx->cq = ib_create_cq(ctx->ib_dev, mlx4_ib_tunnel_comp_handler, + ctx->cq = ib_create_cq(ctx->ib_dev, + create_tun ? mlx4_ib_tunnel_comp_handler : mlx4_ib_wire_comp_handler, NULL, ctx, &cq_attr); if (IS_ERR(ctx->cq)) { ret = PTR_ERR(ctx->cq); @@ -2046,6 +2059,7 @@ static int create_pv_resources(struct ib_device *ibdev, int slave, int port, INIT_WORK(&ctx->work, mlx4_ib_sqp_comp_worker); ctx->wq = to_mdev(ibdev)->sriov.demux[port - 1].wq; + ctx->wi_wq = to_mdev(ibdev)->sriov.demux[port - 1].wi_wq; ret = ib_req_notify_cq(ctx->cq, IB_CQ_NEXT_COMP); if (ret) { @@ -2189,7 +2203,7 @@ static int mlx4_ib_alloc_demux_ctx(struct mlx4_ib_dev *dev, goto err_mcg; } - snprintf(name, sizeof name, "mlx4_ibt%d", port); + snprintf(name, sizeof(name), "mlx4_ibt%d", port); ctx->wq = alloc_ordered_workqueue(name, WQ_MEM_RECLAIM); if (!ctx->wq) { pr_err("Failed to create tunnelling WQ for port %d\n", port); @@ -2197,7 +2211,15 @@ static int mlx4_ib_alloc_demux_ctx(struct mlx4_ib_dev *dev, goto err_wq; } - snprintf(name, sizeof name, "mlx4_ibud%d", port); + snprintf(name, sizeof(name), "mlx4_ibwi%d", port); + ctx->wi_wq = alloc_ordered_workqueue(name, WQ_MEM_RECLAIM); + if (!ctx->wi_wq) { + pr_err("Failed to create wire WQ for port %d\n", port); + ret = -ENOMEM; + goto err_wiwq; + } + + snprintf(name, sizeof(name), "mlx4_ibud%d", port); ctx->ud_wq = alloc_ordered_workqueue(name, WQ_MEM_RECLAIM); if (!ctx->ud_wq) { pr_err("Failed to create up/down WQ for port %d\n", port); @@ -2208,6 +2230,10 @@ static int mlx4_ib_alloc_demux_ctx(struct mlx4_ib_dev *dev, return 0; err_udwq: + destroy_workqueue(ctx->wi_wq); + ctx->wi_wq = NULL; + +err_wiwq: destroy_workqueue(ctx->wq); ctx->wq = NULL; @@ -2255,12 +2281,14 @@ static void mlx4_ib_free_demux_ctx(struct mlx4_ib_demux_ctx *ctx) ctx->tun[i]->state = DEMUX_PV_STATE_DOWNING; } flush_workqueue(ctx->wq); + flush_workqueue(ctx->wi_wq); for (i = 0; i < dev->dev->caps.sqp_demux; i++) { destroy_pv_resources(dev, i, ctx->port, ctx->tun[i], 0); free_pv_object(dev, i, ctx->port); } kfree(ctx->tun); destroy_workqueue(ctx->ud_wq); + destroy_workqueue(ctx->wi_wq); destroy_workqueue(ctx->wq); } } diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index eb53bb4c0c91c..0173e3931cc7f 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -459,6 +459,7 @@ struct mlx4_ib_demux_pv_ctx { struct ib_pd *pd; struct work_struct work; struct workqueue_struct *wq; + struct workqueue_struct *wi_wq; struct mlx4_ib_demux_pv_qp qp[2]; }; @@ -466,6 +467,7 @@ struct mlx4_ib_demux_ctx { struct ib_device *ib_dev; int port; struct workqueue_struct *wq; + struct workqueue_struct *wi_wq; struct workqueue_struct *ud_wq; spinlock_t ud_lock; atomic64_t subnet_prefix; -- 2.25.1