Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp9176754pxu; Mon, 28 Dec 2020 08:29:24 -0800 (PST) X-Google-Smtp-Source: ABdhPJztKZFrVu6hN7ucbLJLW8EgHH8PXd42DLLC75gykG8MXny9U+isKwLCu3z22NJXLMNQdqMh X-Received: by 2002:a05:6402:8d5:: with SMTP id d21mr33340206edz.57.1609172964361; Mon, 28 Dec 2020 08:29:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1609172964; cv=none; d=google.com; s=arc-20160816; b=dHf1BvBrFFgYSnCr36MAUBd7vyp2TH/JWbtgf4ap8Ga8Cul+RMrPMTuhLy4bEqElYa qO4UjApFFQEnfBEi9JudQBNcWwdVBNnylfyGnSaBdda/KgK0I1Y0XQ+Lqi6658ABLQfQ fzxlOIFXLOQx04J4NCvscVnbG9TCJGLkUDQx0gUi5rcsaezkN7M7iegLQSxJU55EmFDm fm2HGSnfUYN4PEHblRsnJ1sAhInn3jN7SqSVxtC1Dr/ewcRMbuBOngc+VCk86L+jDmRy nKUBuGt+Oa315wRkgH7uLajd0J8+qLed1SK5kYCvl6IqbcphshPpBnGbA7K80ounuGrz 7mHw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=9clIWbi/YUu4BtEQ9S4P1wxZHxvTESJFLqIcyYKOP4E=; b=l3hvBtDT8047Lu0mdpdVfWrwyEn5Dzrqe6MhBHV4OM3hMK3gurMF50m8bn6zYq6yh0 NWJ3x4GMzsOavaY1sFhCDSWgNdqR4d+7fBV2Acd5stRtiLuS8CTTl5Qk1m4lTdYDBsSl L8ZweAKeviyj1yWRc6K/12lSaNlSen15tr7yrqTy7ISoDCK6TnwRPjpRZVKVNPfXvFBx RxFVgsjyAeCYG4z1QTmpfB4L7ZwXKlzpZcx/CNZ52nfaL41Opein5uRS/0lDKLi/nwIF Umj6UOjz9n7sfYPp8om56mL/P6RvuyixiIozINfqWIYHoTwMjIkVxCFY6tnQ16L/5/8P boOQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=GiGKzu3b; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id de28si20509819edb.166.2020.12.28.08.29.00; Mon, 28 Dec 2020 08:29:24 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=GiGKzu3b; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2437461AbgL1Q1d (ORCPT + 99 others); Mon, 28 Dec 2020 11:27:33 -0500 Received: from mail.kernel.org ([198.145.29.99]:36046 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730907AbgL1NIV (ORCPT ); Mon, 28 Dec 2020 08:08:21 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 1F13422573; Mon, 28 Dec 2020 13:07:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1609160860; bh=zOS/wJlBAKwxJZzU8vaJBwWqhAtOBTE00YXtTSluFy0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=GiGKzu3bRCFDppVMaOpsF7mqUK3z31mzWmiUJ2WVODrbQeSasGu0pSBRAwC9LpNxr UY/Sf6++oemy6VTmNvCWlggvlbqCIaRwvCK7tA0v1A+XrFygiEHWONIHhkV73L9TdR lS9HgifJpHAX3oTmyTbUB7F9vaxHZ1cLqOEIRENo= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Moshe Shemesh , Tariq Toukan , "David S. Miller" Subject: [PATCH 4.14 020/242] net/mlx4_en: Handle TX error CQE Date: Mon, 28 Dec 2020 13:47:05 +0100 Message-Id: <20201228124905.665889474@linuxfoundation.org> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20201228124904.654293249@linuxfoundation.org> References: <20201228124904.654293249@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Moshe Shemesh [ Upstream commit ba603d9d7b1215c72513d7c7aa02b6775fd4891b ] In case error CQE was found while polling TX CQ, the QP is in error state and all posted WQEs will generate error CQEs without any data transmitted. Fix it by reopening the channels, via same method used for TX timeout handling. In addition add some more info on error CQE and WQE for debug. Fixes: bd2f631d7c60 ("net/mlx4_en: Notify user when TX ring in error state") Signed-off-by: Moshe Shemesh Signed-off-by: Tariq Toukan Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 1 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 40 ++++++++++++++++++++----- drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 5 +++ 3 files changed, 39 insertions(+), 7 deletions(-) --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -1746,6 +1746,7 @@ int mlx4_en_start_port(struct net_device mlx4_en_deactivate_cq(priv, cq); goto tx_err; } + clear_bit(MLX4_EN_TX_RING_STATE_RECOVERING, &tx_ring->state); if (t != TX_XDP) { tx_ring->tx_queue = netdev_get_tx_queue(dev, i); tx_ring->recycle_ring = NULL; --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c @@ -385,6 +385,35 @@ int mlx4_en_free_tx_buf(struct net_devic return cnt; } +static void mlx4_en_handle_err_cqe(struct mlx4_en_priv *priv, struct mlx4_err_cqe *err_cqe, + u16 cqe_index, struct mlx4_en_tx_ring *ring) +{ + struct mlx4_en_dev *mdev = priv->mdev; + struct mlx4_en_tx_info *tx_info; + struct mlx4_en_tx_desc *tx_desc; + u16 wqe_index; + int desc_size; + + en_err(priv, "CQE error - cqn 0x%x, ci 0x%x, vendor syndrome: 0x%x syndrome: 0x%x\n", + ring->sp_cqn, cqe_index, err_cqe->vendor_err_syndrome, err_cqe->syndrome); + print_hex_dump(KERN_WARNING, "", DUMP_PREFIX_OFFSET, 16, 1, err_cqe, sizeof(*err_cqe), + false); + + wqe_index = be16_to_cpu(err_cqe->wqe_index) & ring->size_mask; + tx_info = &ring->tx_info[wqe_index]; + desc_size = tx_info->nr_txbb << LOG_TXBB_SIZE; + en_err(priv, "Related WQE - qpn 0x%x, wqe index 0x%x, wqe size 0x%x\n", ring->qpn, + wqe_index, desc_size); + tx_desc = ring->buf + (wqe_index << LOG_TXBB_SIZE); + print_hex_dump(KERN_WARNING, "", DUMP_PREFIX_OFFSET, 16, 1, tx_desc, desc_size, false); + + if (test_and_set_bit(MLX4_EN_STATE_FLAG_RESTARTING, &priv->state)) + return; + + en_err(priv, "Scheduling port restart\n"); + queue_work(mdev->workqueue, &priv->restart_task); +} + bool mlx4_en_process_tx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int napi_budget) { @@ -431,13 +460,10 @@ bool mlx4_en_process_tx_cq(struct net_de dma_rmb(); if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == - MLX4_CQE_OPCODE_ERROR)) { - struct mlx4_err_cqe *cqe_err = (struct mlx4_err_cqe *)cqe; - - en_err(priv, "CQE error - vendor syndrome: 0x%x syndrome: 0x%x\n", - cqe_err->vendor_err_syndrome, - cqe_err->syndrome); - } + MLX4_CQE_OPCODE_ERROR)) + if (!test_and_set_bit(MLX4_EN_TX_RING_STATE_RECOVERING, &ring->state)) + mlx4_en_handle_err_cqe(priv, (struct mlx4_err_cqe *)cqe, index, + ring); /* Skip over last polled CQE */ new_index = be16_to_cpu(cqe->wqe_index) & size_mask; --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h @@ -267,6 +267,10 @@ struct mlx4_en_page_cache { } buf[MLX4_EN_CACHE_SIZE]; }; +enum { + MLX4_EN_TX_RING_STATE_RECOVERING, +}; + struct mlx4_en_priv; struct mlx4_en_tx_ring { @@ -313,6 +317,7 @@ struct mlx4_en_tx_ring { * Only queue_stopped might be used if BQL is not properly working. */ unsigned long queue_stopped; + unsigned long state; struct mlx4_hwq_resources sp_wqres; struct mlx4_qp sp_qp; struct mlx4_qp_context sp_context;