Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp554695pxb; Wed, 3 Mar 2021 09:26:46 -0800 (PST) X-Google-Smtp-Source: ABdhPJxBpdh9FrQQ5LE2rEcbs6C/HWZbKdjolnWQDIAltDSht9QgeLkPhIQNG4F1Y4IiA9oJfB3o X-Received: by 2002:a17:907:2d89:: with SMTP id gt9mr233500ejc.226.1614792406428; Wed, 03 Mar 2021 09:26:46 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1614792406; cv=none; d=google.com; s=arc-20160816; b=JrQ/QPWXBbt2xadpx/WA3bOxRvelo+4GjWqwwERazHX4ElYTdPW6BxXB4DXHsU4z+u uqvFqHc9s7rMgGIrstsNyzOIbXnAbmhlEaiRTnifzL6mNiU5hd8y4kGbW3NlOEwOqObY GDmhTMw99c1qMfgVH4MvX5ALV1G2/T2Wq3HdRGBuFItdJD7PP9usLBA4dusy2kMQsMv2 uLuIoK8tdn+OBj6Bva3FV7E194Wy6je3hnh4nQECq+VG65CEhSIV+t9rxvAlkGppy4MH YQtCemCbxEcoz8/yzOJrhdNHGrhl9R2VS6EMJES+4OYKhBV2V4qC5lPHAAWKUkq7IPnY SCBA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=OfkdLl2khNvzNGWSO4gcatFjSzWCq8niEpRT2ieVtTs=; b=oCRIiQaFEWh92DnvqCTHYtapzEWGi4DQty7zXhtPQSx2NWCg++wN2B/6Vh7wg4ocM1 QZC1a4IpTgnDtPdo9rizv1tvSjStlQKcZFc4HkTUfqqDyEVNdIvJ6MfKaroo4BoKrZ4W jhoWC+HUPMfsts5IdJcgRH2qfyVMW4c4LPAicDMkc98D4+fFDvhvLdLvvk6//shVjrxY 7XN6ZY/CbChH4TD/oytGqjQdKeT/qgBc5MVYqRPMU8xDB9emz1Jr2g7hiciGjMh3GCX8 NKpEcSTu+XdJM2FSAcqd0BTdvIzrTUQaT9TBunBRseME8bqZph5ulHXZuRE04LeIOxbk oxOQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=nc7ExkNz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id w1si8546682edt.272.2021.03.03.09.25.40; Wed, 03 Mar 2021 09:26:46 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=nc7ExkNz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1377450AbhCBArU (ORCPT + 99 others); Mon, 1 Mar 2021 19:47:20 -0500 Received: from mail.kernel.org ([198.145.29.99]:53878 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239847AbhCASoX (ORCPT ); Mon, 1 Mar 2021 13:44:23 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id D5F9464E66; Mon, 1 Mar 2021 17:08:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1614618487; bh=jDaYfxJXqUarLEOWHPucA2FtOxFv6RBPytXGRBjyKvU=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=nc7ExkNzQzaXvEMXrhAvVJY8tG21YQlUiGYDVjKIOjoOC1YxPKOK0i5eBJzGM3gzM Flo/E1t6MAbCNAly5G0D4ZVSdh3s9oTHivUnucOpq6Gz8wVZcQjfJMcIEk6FC7OVyU FQjiX+AEPY0+WoAIF9Y+Gq/kTOfjm3nxOysLZsY8= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Shay Drory , Moshe Shemesh , Saeed Mahameed , Sasha Levin Subject: [PATCH 5.10 105/663] net/mlx5: Fix health error state handling Date: Mon, 1 Mar 2021 17:05:53 +0100 Message-Id: <20210301161146.932110499@linuxfoundation.org> X-Mailer: git-send-email 2.30.1 In-Reply-To: <20210301161141.760350206@linuxfoundation.org> References: <20210301161141.760350206@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Shay Drory [ Upstream commit 51d138c2610a236c1ed0059d034ee4c74f452b86 ] Currently, when we discover a fatal error, we are queueing a work that will wait for a lock in order to enter the device to error state. Meanwhile, FW commands are still being processed, and gets timeouts. This can block the driver for few minutes before the work will manage to get the lock and enter to error state. Setting the device to error state before queueing health work, in order to avoid FW commands being processed while the work is waiting for the lock. Fixes: c1d4d2e92ad6 ("net/mlx5: Avoid calling sleeping function by the health poll thread") Signed-off-by: Shay Drory Reviewed-by: Moshe Shemesh Signed-off-by: Saeed Mahameed Signed-off-by: Sasha Levin --- .../net/ethernet/mellanox/mlx5/core/health.c | 22 ++++++++++++------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c b/drivers/net/ethernet/mellanox/mlx5/core/health.c index 54523bed16cd3..0c32c485eb588 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/health.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c @@ -190,6 +190,16 @@ static bool reset_fw_if_needed(struct mlx5_core_dev *dev) return true; } +static void enter_error_state(struct mlx5_core_dev *dev, bool force) +{ + if (mlx5_health_check_fatal_sensors(dev) || force) { /* protected state setting */ + dev->state = MLX5_DEVICE_STATE_INTERNAL_ERROR; + mlx5_cmd_flush(dev); + } + + mlx5_notifier_call_chain(dev->priv.events, MLX5_DEV_EVENT_SYS_ERROR, (void *)1); +} + void mlx5_enter_error_state(struct mlx5_core_dev *dev, bool force) { bool err_detected = false; @@ -208,12 +218,7 @@ void mlx5_enter_error_state(struct mlx5_core_dev *dev, bool force) goto unlock; } - if (mlx5_health_check_fatal_sensors(dev) || force) { /* protected state setting */ - dev->state = MLX5_DEVICE_STATE_INTERNAL_ERROR; - mlx5_cmd_flush(dev); - } - - mlx5_notifier_call_chain(dev->priv.events, MLX5_DEV_EVENT_SYS_ERROR, (void *)1); + enter_error_state(dev, force); unlock: mutex_unlock(&dev->intf_state_mutex); } @@ -613,7 +618,7 @@ static void mlx5_fw_fatal_reporter_err_work(struct work_struct *work) priv = container_of(health, struct mlx5_priv, health); dev = container_of(priv, struct mlx5_core_dev, priv); - mlx5_enter_error_state(dev, false); + enter_error_state(dev, false); if (IS_ERR_OR_NULL(health->fw_fatal_reporter)) { if (mlx5_health_try_recover(dev)) mlx5_core_err(dev, "health recovery failed\n"); @@ -707,8 +712,9 @@ static void poll_health(struct timer_list *t) mlx5_core_err(dev, "Fatal error %u detected\n", fatal_error); dev->priv.health.fatal_error = fatal_error; print_health_info(dev); + dev->state = MLX5_DEVICE_STATE_INTERNAL_ERROR; mlx5_trigger_health_work(dev); - goto out; + return; } count = ioread32be(health->health_counter); -- 2.27.0