Received: by 2002:ab2:1149:0:b0:1f3:1f8c:d0c6 with SMTP id z9csp2855854lqz; Wed, 3 Apr 2024 10:22:19 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCVm5F6+TlobavEzZtvJjjHZifyt6LoOFg/iyvpXPJekTL+2qOiiFej4CAieKJB1+3+4E5uKRGL79I/B7JtnEEfHQYf+/ezsrT+fliqYQA== X-Google-Smtp-Source: AGHT+IFT4JuY+edFMwm0j579U06jiNAS4IJAV7z3q8K7Yju8FeJX3QFyc0NQS3e7DQmCfj+qCC6W X-Received: by 2002:a05:6a00:1152:b0:6e7:8322:ff8e with SMTP id b18-20020a056a00115200b006e78322ff8emr155663pfm.30.1712164939578; Wed, 03 Apr 2024 10:22:19 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1712164939; cv=pass; d=google.com; s=arc-20160816; b=uLZYWkBySOf96g45VKL5TVF9k5T4aYbD/p8bVns4UckY1uQn7RnXAEy7Ovcq3TkBm9 X5wmOCpWlYuy68AoAD0ZZvoTwdc0CFqWGBs5wu2ESfzDqx8QeudCO129p0cOcSKleyWO 3LWCQ4K3JsV2VfUaXoQObD5hw3cFUOWB9uF486N4OFC84aCkszudocZzuVrwv/057iOD njuMxGBn+hzdpZk4xSNWLJGg1zMlNCAG9W0oOP27IHPcxuOsJvr17PWnj7eRiGAtgmdX r6m6nE61dMvAayCQC5EXK8Crnrbd7r3EwLe63vN2s8FRsbuINK2mVDhOst0qp6VUzJQZ C9Pg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:in-reply-to:message-id :date:subject:cc:to:from:dkim-signature; bh=yGBdhVPu6xjmwHDs62GGm8H/rDN3P2QQTA3UpWaCKbE=; fh=iiQXvKk2s0rKSdHOHe5qL3a0ptDKbkVTYKv6nvlZqGM=; b=FEzQLUpWW3bIsQepx9B+yZMiMEXyAg7bMqW2QfAwvHaj9tn2Wa7Aytx22HEZEDoArj wNgLThwww71TvguLeQJ0gXsBrUGsy/u3GEYVYcS0zkGN6bLovKrINUzydwVxvtcNpRyx 5temtBuFC2Fd39ZzTjKaRKwnbFLQk40v/T2BuKwdiicw3cnH0fNiZCy85RXAQeSOgJCh rgZXlsH2NgRv5TPVRCqhETNI4Y5o6HTnAzyP0ZbZUJ5SVAbgA6MhTDVeGGpmPw6pouRH xsWmksTZ6yXUgohLGS/hFLQkLbNMzopcoAsvk6+eqYMms94q8EBg4/VoJYqvM4Sa6aev GOLA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=caNKmqx7; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-130278-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-130278-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id g27-20020a63375b000000b005dc48827a3asi13702391pgn.227.2024.04.03.10.22.19 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 03 Apr 2024 10:22:19 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-130278-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=caNKmqx7; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-130278-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-130278-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 93606290AF0 for ; Wed, 3 Apr 2024 17:20:19 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 0C8E1155A4F; Wed, 3 Apr 2024 17:17:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="caNKmqx7" Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 01631155A43; Wed, 3 Apr 2024 17:17:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712164625; cv=none; b=Fm6gKYCHHhFMu8htEa45Q+52532kKTDaPEx5Sq+LvqG5NcTf7MnuY6JJyW7QoBwBX6lUAC9VQNns7mROg4x33BWDDilbcmRc86hI2NDBMeJZLg/MwArzFdxsVb3s7n2xe4zYdqefTuk3ogZtfLcGypyg1rvyPSg8E7jdD6SA1aA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712164625; c=relaxed/simple; bh=w19gzpbr+XAqjZGzBY9VmtjFHzJ2i+cxM14/amqpDJs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Nfaz7GrJpM3Hf4ZF7KGPasG1l2faKlgo7hfDRxgOGKrRU4HPtQczUH+6ilFUp8dm8JCA260k1x6NTgg8evjnmFnZwgW9jtHQNVHxmNjAJs5aJvv6qogoQQnu4+3IvWQZBVBKHCtgtlE4qW7sN9HK5ODDnI1BlGtkyXTz5WPAktk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=caNKmqx7; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 37A62C433C7; Wed, 3 Apr 2024 17:17:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1712164624; bh=w19gzpbr+XAqjZGzBY9VmtjFHzJ2i+cxM14/amqpDJs=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=caNKmqx7yAHDFvXVzz/Y9VwItSBh+1GZUtEmu6cxn9uBp3UOTIpFqrTTH+H7oMnp1 d/ipyaGBkGsEOb+wDp+N1Bm5ScTeg9Vc41es2sCnJEOUDx0PYtaKuNXXVbIscBWNr+ 46hmVEAgzAH/h3SbqQCo9VshOSM6bpGoQLkpwzbwUCI/Uxwwc2ajI+mD5HHh/lLVeL bcFNydOwwf48c/2495fWB1020YX+AuNk5rDI4Ce+6ySDu/IwrAKVSWENen208nQMuZ O9qsnyCMQMAanIVEM6LcWKAKOW8RFnFRqqG9a3S9g7WUOWDRGhXVENLn6b1Gj0aD6D 36hfzZYc3WyVA== From: Sasha Levin To: linux-kernel@vger.kernel.org, stable@vger.kernel.org Cc: Jeffrey Hugo , Carl Vanderlip , Manivannan Sadhasivam , Sasha Levin , Julia.Lawall@inria.fr, quic_pkanojiy@quicinc.com, rostedt@goodmis.org, quic_qianyu@quicinc.com, quic_krichai@quicinc.com, mhi@lists.linux.dev, linux-arm-msm@vger.kernel.org Subject: [PATCH AUTOSEL 6.8 04/28] bus: mhi: host: Add MHI_PM_SYS_ERR_FAIL state Date: Wed, 3 Apr 2024 13:16:06 -0400 Message-ID: <20240403171656.335224-4-sashal@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240403171656.335224-1-sashal@kernel.org> References: <20240403171656.335224-1-sashal@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-stable: review X-Patchwork-Hint: Ignore X-stable-base: Linux 6.8.3 Content-Transfer-Encoding: 8bit From: Jeffrey Hugo [ Upstream commit bce3f770684cc1d91ff9edab431b71ac991faf29 ] When processing a SYSERR, if the device does not respond to the MHI_RESET from the host, the host will be stuck in a difficult to recover state. The host will remain in MHI_PM_SYS_ERR_PROCESS and not clean up the host channels. Clients will not be notified of the SYSERR via the destruction of their channel devices, which means clients may think that the device is still up. Subsequent SYSERR events such as a device fatal error will not be processed as the state machine cannot transition from PROCESS back to DETECT. The only way to recover from this is to unload the mhi module (wipe the state machine state) or for the mhi controller to initiate SHUTDOWN. This issue was discovered by stress testing soc_reset events on AIC100 via the sysfs node. soc_reset is processed entirely in hardware. When the register write hits the endpoint hardware, it causes the soc to reset without firmware involvement. In stress testing, there is a rare race where soc_reset N will cause the soc to reset and PBL to signal SYSERR (fatal error). If soc_reset N+1 is triggered before PBL can process the MHI_RESET from the host, then the soc will reset again, and re-run PBL from the beginning. This will cause PBL to lose all state. PBL will be waiting for the host to respond to the new syserr, but host will be stuck expecting the previous MHI_RESET to be processed. Additionally, the AMSS EE firmware (QSM) was hacked to synthetically reproduce the issue by simulating a FW hang after the QSM issued a SYSERR. In this case, soc_reset would not recover the device. For this failure case, to recover the device, we need a state similar to PROCESS, but can transition to DETECT. There is not a viable existing state to use. POR has the needed transitions, but assumes the device is in a good state and could allow the host to attempt to use the device. Allowing PROCESS to transition to DETECT invites the possibility of parallel SYSERR processing which could get the host and device out of sync. Thus, invent a new state - MHI_PM_SYS_ERR_FAIL This essentially a holding state. It allows us to clean up the host elements that are based on the old state of the device (channels), but does not allow us to directly advance back to an operational state. It does allow the detection and processing of another SYSERR which may recover the device, or allows the controller to do a clean shutdown. Signed-off-by: Jeffrey Hugo Reviewed-by: Carl Vanderlip Reviewed-by: Manivannan Sadhasivam Link: https://lore.kernel.org/r/20240112180800.536733-1-quic_jhugo@quicinc.com Signed-off-by: Manivannan Sadhasivam Signed-off-by: Sasha Levin --- drivers/bus/mhi/host/init.c | 1 + drivers/bus/mhi/host/internal.h | 9 ++++++--- drivers/bus/mhi/host/pm.c | 20 +++++++++++++++++--- 3 files changed, 24 insertions(+), 6 deletions(-) diff --git a/drivers/bus/mhi/host/init.c b/drivers/bus/mhi/host/init.c index 65ceac1837f9a..8e5ec1a409b80 100644 --- a/drivers/bus/mhi/host/init.c +++ b/drivers/bus/mhi/host/init.c @@ -62,6 +62,7 @@ static const char * const mhi_pm_state_str[] = { [MHI_PM_STATE_FW_DL_ERR] = "Firmware Download Error", [MHI_PM_STATE_SYS_ERR_DETECT] = "SYS ERROR Detect", [MHI_PM_STATE_SYS_ERR_PROCESS] = "SYS ERROR Process", + [MHI_PM_STATE_SYS_ERR_FAIL] = "SYS ERROR Failure", [MHI_PM_STATE_SHUTDOWN_PROCESS] = "SHUTDOWN Process", [MHI_PM_STATE_LD_ERR_FATAL_DETECT] = "Linkdown or Error Fatal Detect", }; diff --git a/drivers/bus/mhi/host/internal.h b/drivers/bus/mhi/host/internal.h index 30ac415a3000f..4b6deea17bcd2 100644 --- a/drivers/bus/mhi/host/internal.h +++ b/drivers/bus/mhi/host/internal.h @@ -88,6 +88,7 @@ enum mhi_pm_state { MHI_PM_STATE_FW_DL_ERR, MHI_PM_STATE_SYS_ERR_DETECT, MHI_PM_STATE_SYS_ERR_PROCESS, + MHI_PM_STATE_SYS_ERR_FAIL, MHI_PM_STATE_SHUTDOWN_PROCESS, MHI_PM_STATE_LD_ERR_FATAL_DETECT, MHI_PM_STATE_MAX @@ -104,14 +105,16 @@ enum mhi_pm_state { #define MHI_PM_FW_DL_ERR BIT(7) #define MHI_PM_SYS_ERR_DETECT BIT(8) #define MHI_PM_SYS_ERR_PROCESS BIT(9) -#define MHI_PM_SHUTDOWN_PROCESS BIT(10) +#define MHI_PM_SYS_ERR_FAIL BIT(10) +#define MHI_PM_SHUTDOWN_PROCESS BIT(11) /* link not accessible */ -#define MHI_PM_LD_ERR_FATAL_DETECT BIT(11) +#define MHI_PM_LD_ERR_FATAL_DETECT BIT(12) #define MHI_REG_ACCESS_VALID(pm_state) ((pm_state & (MHI_PM_POR | MHI_PM_M0 | \ MHI_PM_M2 | MHI_PM_M3_ENTER | MHI_PM_M3_EXIT | \ MHI_PM_SYS_ERR_DETECT | MHI_PM_SYS_ERR_PROCESS | \ - MHI_PM_SHUTDOWN_PROCESS | MHI_PM_FW_DL_ERR))) + MHI_PM_SYS_ERR_FAIL | MHI_PM_SHUTDOWN_PROCESS | \ + MHI_PM_FW_DL_ERR))) #define MHI_PM_IN_ERROR_STATE(pm_state) (pm_state >= MHI_PM_FW_DL_ERR) #define MHI_PM_IN_FATAL_STATE(pm_state) (pm_state == MHI_PM_LD_ERR_FATAL_DETECT) #define MHI_DB_ACCESS_VALID(mhi_cntrl) (mhi_cntrl->pm_state & mhi_cntrl->db_access) diff --git a/drivers/bus/mhi/host/pm.c b/drivers/bus/mhi/host/pm.c index a2f2feef14768..d0d033ce9984b 100644 --- a/drivers/bus/mhi/host/pm.c +++ b/drivers/bus/mhi/host/pm.c @@ -36,7 +36,10 @@ * M0 <--> M0 * M0 -> FW_DL_ERR * M0 -> M3_ENTER -> M3 -> M3_EXIT --> M0 - * L1: SYS_ERR_DETECT -> SYS_ERR_PROCESS --> POR + * L1: SYS_ERR_DETECT -> SYS_ERR_PROCESS + * SYS_ERR_PROCESS -> SYS_ERR_FAIL + * SYS_ERR_FAIL -> SYS_ERR_DETECT + * SYS_ERR_PROCESS --> POR * L2: SHUTDOWN_PROCESS -> LD_ERR_FATAL_DETECT * SHUTDOWN_PROCESS -> DISABLE * L3: LD_ERR_FATAL_DETECT <--> LD_ERR_FATAL_DETECT @@ -93,7 +96,12 @@ static const struct mhi_pm_transitions dev_state_transitions[] = { }, { MHI_PM_SYS_ERR_PROCESS, - MHI_PM_POR | MHI_PM_SHUTDOWN_PROCESS | + MHI_PM_POR | MHI_PM_SYS_ERR_FAIL | MHI_PM_SHUTDOWN_PROCESS | + MHI_PM_LD_ERR_FATAL_DETECT + }, + { + MHI_PM_SYS_ERR_FAIL, + MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS | MHI_PM_LD_ERR_FATAL_DETECT }, /* L2 States */ @@ -629,7 +637,13 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl) !in_reset, timeout); if (!ret || in_reset) { dev_err(dev, "Device failed to exit MHI Reset state\n"); - goto exit_sys_error_transition; + write_lock_irq(&mhi_cntrl->pm_lock); + cur_state = mhi_tryset_pm_state(mhi_cntrl, + MHI_PM_SYS_ERR_FAIL); + write_unlock_irq(&mhi_cntrl->pm_lock); + /* Shutdown may have occurred, otherwise cleanup now */ + if (cur_state != MHI_PM_SYS_ERR_FAIL) + goto exit_sys_error_transition; } /* -- 2.43.0