Received: by 2002:a05:7412:a9a2:b0:e2:908c:2ebd with SMTP id o34csp265239rdh; Thu, 26 Oct 2023 01:14:54 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEV5yR3CYdzL9aeuUUETPZkHES1dnZ94+SL8lgYxxUnMR15OGYFckj3DoZnckhjCiZW0/AK X-Received: by 2002:a25:2443:0:b0:d9a:5c30:c1e9 with SMTP id k64-20020a252443000000b00d9a5c30c1e9mr17967312ybk.35.1698308094441; Thu, 26 Oct 2023 01:14:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698308094; cv=none; d=google.com; s=arc-20160816; b=DdqGgB9y3Pv0GjKZ1t+BeK8JaeLlAzmlF5EDZuq/65hbt5ccfUM4WHBlyrLkQbG+mF n31yY1d1Onys0/wPrSFPaMQHwm+0EzcBn2N843SoVYNRowy8Mai5Qmp0CUxkCtHIg6MC va2Y7swm5Vh/PVTn7BdQnwW9IMq7Nx16Rq0LnU/9Ol9Rvl7zjZf1DfV+MiJlyogphVG/ va3VzGyPc3OFwxtV6Ltek62VHpgs4Q5p6Pf0+p2wOyM+0SlyzRhqd6QqxOsnnqGYlR7Y Z0nsimjKh7RjzLXz70IUi3pQqEWZ847GGI4lC9BpTV8vQB3GhJ+6IFMix8lS32+jx6xH XruQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to :organization:from:references:cc:to:content-language:subject :user-agent:mime-version:date:message-id:dkim-signature; bh=6y+1IFmDwZh+QZRyVWU9Qo+AOhWiC6i4WcC9O1pBh7o=; fh=NuG/llHmBjZer+ZirO1KqH/9hVDDzCZRnoZJ96qcER8=; b=B1JIZBjyPopBjwWNWXa3oBChyFCafn1XVPSYIq+rJECZ3pHaE9WHFoBR9fCJUWuLnc MiuMsSASIbj6gFxMV3y9g12IUdApFyv0GO/XzJkfRJ6iLcZMzVvJ8RV/GzWizEWGJiL/ ix6l/tRsSHbXWnP2rMR9vs1T49IbuO6bpOZ0ayIuB2MJyeOGiwEozMaTdrtznv5FTnxc pOT5VsT3lupz2GwrLzzoIKnvb45YRClbvoh6uAuPQB4HDlOy081Uqc5lOaRiYTuimssz SzMbtZVM3tYPtysXVX6tk89NPgot+YIcHYbkOdnl7PANYH31Wh04Jgr1K3DvAyXcZKLJ 4MiA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=YAnEEhXz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id q6-20020a056902150600b00d9a402dcd12si17745483ybu.80.2023.10.26.01.14.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 26 Oct 2023 01:14:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=YAnEEhXz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 11AC68206D4E; Thu, 26 Oct 2023 01:14:53 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344545AbjJZIOq (ORCPT + 99 others); Thu, 26 Oct 2023 04:14:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50504 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344444AbjJZIOl (ORCPT ); Thu, 26 Oct 2023 04:14:41 -0400 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.93]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AA442B8; Thu, 26 Oct 2023 01:14:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1698308078; x=1729844078; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=MFaMMnglpoDnzneeiuDNtTRmMj3rXcOLeCMqZ4UP8s4=; b=YAnEEhXzVo7A+Dno6tUEBnE0R4LBZAH9Qap7ccBZjGQ572KqKjY/tlWW SaVsILfzULfri+ykymwGuLzzKiIPuxvKy35mGzYQAlLO7T6WRCZwJeYk7 1/tUDzBnMxh31waR+J6Xv2dSSM4bc3dsMDK5iL6SvvKsIDG+G9qBCi8SJ J2dAV/MFB3inAfnY/lRlUNNTiP1m7qQeuMoOQInhwuCCMwzfou/N8h2wJ MIYwb7z9rtiZzHIm1C8U+jJH4ZJ2Husdonk7J1o7uvodQCF0tyH/TUgpk TnaRZfoybiYhJuRF8IWlAQDdpI0kxBGG/keOIBb40P7/TuYLs07Gr3iAm A==; X-IronPort-AV: E=McAfee;i="6600,9927,10874"; a="384692660" X-IronPort-AV: E=Sophos;i="6.03,253,1694761200"; d="scan'208";a="384692660" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Oct 2023 01:14:32 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10874"; a="708985716" X-IronPort-AV: E=Sophos;i="6.03,253,1694761200"; d="scan'208";a="708985716" Received: from ahunter6-mobl1.ger.corp.intel.com (HELO [10.0.2.15]) ([10.251.211.218]) by orsmga003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Oct 2023 01:14:26 -0700 Message-ID: Date: Thu, 26 Oct 2023 11:14:22 +0300 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mmc: cqhci: Be more verbose in error irq handler Content-Language: en-US To: =?UTF-8?Q?Kornel_Dul=C4=99ba?= Cc: linux-mmc@vger.kernel.org, linux-kernel@vger.kernel.org, Asutosh Das , Ulf Hansson , Radoslaw Biernacki , Guenter Roeck , Gwendal Grignou , upstream@semihalf.com References: <20231016095610.1095084-1-korneld@chromium.org> <613c51f0-c32e-4de5-9627-525d92fb06ed@intel.com> <78bb4ad2-853a-4ed4-9998-c4e1122545b6@intel.com> From: Adrian Hunter Organization: Intel Finland Oy, Registered Address: PL 281, 00181 Helsinki, Business Identity Code: 0357606 - 4, Domiciled in Helsinki In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Thu, 26 Oct 2023 01:14:53 -0700 (PDT) On 26/10/23 10:56, Kornel Dulęba wrote: > On Thu, Oct 26, 2023 at 8:25 AM Adrian Hunter wrote: >> >> On 25/10/23 11:01, Kornel Dulęba wrote: >>> On Mon, Oct 23, 2023 at 1:38 PM Adrian Hunter wrote: >>>> >>>> On 20/10/23 11:53, Kornel Dulęba wrote: >>>>> On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter wrote: >>>>>> >>>>>> On 16/10/23 12:56, Kornel Dulęba wrote: >>>>>>> There are several reasons for controller to generate an error interrupt. >>>>>>> They include controller<->card timeout, and CRC mismatch error. >>>>>>> Right now we only get one line in the logs stating that CQE recovery was >>>>>>> triggered, but with no information about what caused it. >>>>>>> To figure out what happened be more verbose and dump the registers from >>>>>>> irq error handler logic. >>>>>>> This matches the behaviour of the software timeout logic, see >>>>>>> cqhci_timeout. >>>>>>> >>>>>>> Signed-off-by: Kornel Dulęba >>>>>>> --- >>>>>>> drivers/mmc/host/cqhci-core.c | 5 +++-- >>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c >>>>>>> index b3d7d6d8d654..33abb4bd53b5 100644 >>>>>>> --- a/drivers/mmc/host/cqhci-core.c >>>>>>> +++ b/drivers/mmc/host/cqhci-core.c >>>>>>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error, >>>>>>> >>>>>>> terri = cqhci_readl(cq_host, CQHCI_TERRI); >>>>>>> >>>>>>> - pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n", >>>>>>> - mmc_hostname(mmc), status, cmd_error, data_error, terri); >>>>>>> + pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n", >>>>>>> + mmc_hostname(mmc), status, cmd_error, data_error); >>>>>>> + cqhci_dumpregs(cq_host); >>>>>> >>>>>> For debugging, isn't dynamic debug seems more appropriate? >>>>> >>>>> Dynamic debug is an option, but my personal preference would be to >>>>> just log more info in the error handler. >>>> >>>> Interrupt handlers can get called very rapidly, so some kind of rate >>>> limiting should be used if the message is unconditional. Also you need >>>> to provide actual reasons for your preference. >>>> >>>> For dynamic debug of the register dump, something like below is >>>> possible. >>>> >>>> #define cqhci_dynamic_dumpregs(cqhost) \ >>>> _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost) >>>> >>> Fair point. >>> The reason I'm not a fan of using dynamic debug for this is that my >>> goal here is to improve the warning/error logging information that we >>> get from systems running in production. >>> I.e. if we get a lot of "running CQE recovery" messages, at the very >>> least I'd like to know what is causing them. >> >> So you are saying you want to collect debug information from production >> systems, but don't want to use dynamic debug to do it? >> >>>>> To give you some background. >>>>> We're seeing some "running CQE recovery" lines in the logs, followed >>>>> by a dm_verity mismatch error. >>>>> The reports come from the field, with no feasible way to reproduce the >>>>> issue locally. >>>> >>>> If it is a software error, some kind of error injection may well >>>> reproduce it. Also if it is a hardware error that only happens >>>> during recovery, error injection could increase the likelihood of >>>> reproducing it. >>> >>> We tried software injection and it didn't yield any results. >>> We're currently looking into "injecting" hw errors by using a small >>> burst field generator to interfere with transfers on the data line >>> directly. >> >> I just tried instrumenting a driver to inject CRC errors and it >> revealed several CQE recovery issues, including spurious TCN for >> tag 31. I will send some patches when they are ready. > > Sorry, what I meant by it didn't yield results is that Ii didn't > trigger the dm-verity error that we're seeing on production. > With SW injection there are two potential issues that come to my mind: > > 1. In the cqhci_error_irq when TERRI is not valid only a single, > "random" task is marked as bad. > Then in cqhci_recover_mrq we're marking all pending requests as done. > For data transfers this is somewhat bening as it will return with > bytes_xfered=0. > IIUC this will then cause the upper layer to re-enqueue this request. Yes > The bigger problem is a CMD only mrq, which will be mistakenly marked > as completed successfully. I noticed that also. Notably the only non-data CMD is cache flush. There are several other issues, but patches will describe them better. > > 2. As for the spurious task completion warning. > I initially thought that it was bening. > The check for !mrq is done before checking if we're currently doing recovery. > So if it's called just right at the end of recovery, right after the > cqhci_recover_mrqs is executed that would explain it. > With that being said if that irq handler is run right after the > recovery is finished we'll end up with a race where a new request, > that was just enqueued, might be mistakenly marked as done. > This would explain the dm-verity errors we're seeing. > >> >>>> >>>>> >>>>> I'd argue that logging only the info that CQE recovery was executed is >>>>> not particularly helpful for someone looking into those logs. >>>> >>>> As the comment says, that message is there because recovery reduces >>>> performance, it is not to aid debugging per se. >>>> >>>>> Ideally we would have more data about the state the controller was in >>>>> when the error happened, or at least what caused the recovery to be >>>>> triggered. >>>>> The question here is how verbose should we be in this error scenario. >>>>> Looking at other error scenarios, in the case of a software timeout >>>>> we're dumping the controller registers. (cqhci_timeout) >>>> >>>> Timeout means something is broken - either the driver, the cq engine >>>> or the card. On the other hand, an error interrupt is most likely a >>>> CRC error which is not unexpected occasionally, due to thermal drift >>>> or perhaps interference. >>> >>> Right, but my point is that we don't know what triggered CQE recovery. >> >> True, although probably a CRC error. >> >>> >>>> >>>>> Hence I thought that I'd be appropriate to match that and do the same >>>>> in CQE recovery logic. >>>> >>>> It needs to be consistent. There are other pr_debugs, such as: >>>> >>>> pr_debug("%s: cqhci: Failed to clear tasks\n", >>>> pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc)); >>>> pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc)); >>>> >>>> which should perhaps be treated the same. >>>> >>>> And there are no messages for errors from the commands in >>>> mmc_cqe_recovery(). >>> >>> How about this. >>> As a compromise would it be okay to just do a single pr_warn directly >>> from cqhci_error_irq. >> >> Sure, printk_ratelimited() or __ratelimit() >> >>> We could simply promote the existing pr_debug to pr_warn at the >>> beginning of that function. >>> This would tell us what triggered the recovery. (controller timeout, >>> CRC mismatch) >>> We can also consider removing the "running CQE recovery" print for the >>> sake of brevity. >> >> No, that serves a different purpose. >> >>> The only downside of this that I can see is that we'd be running the >>> logic from the interrupt handler directly, but I can't see an easy way >>> around that. >>> What do you think? >> >> Should be OK with rate limiting. > > OK, I'll look into the rate limiting and will send a v2. > >> >>>> >>>>> >>>>>> >>>>>>> >>>>>>> /* Forget about errors when recovery has already been triggered */ >>>>>>> if (cq_host->recovery_halt) >>>>>> >>>> >>