Received: by 2002:a05:7412:a9a2:b0:e2:908c:2ebd with SMTP id o34csp219429rdh; Wed, 25 Oct 2023 23:25:39 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHfsu4ZBJvcgJE8rQCyJlWDQCVn7XgEFbpShe+G+akAgnUL5JwWH4YMlO5IpjQRuZj88y4u X-Received: by 2002:a25:cc4e:0:b0:d9c:a7fa:dbcb with SMTP id l75-20020a25cc4e000000b00d9ca7fadbcbmr19015704ybf.22.1698301539281; Wed, 25 Oct 2023 23:25:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698301539; cv=none; d=google.com; s=arc-20160816; b=ULynNS9LNedWHxY8M00os94buyISJV3bWXkZww+GZEhC/rg5UVTEzGR6d/NEngUn6w DWImCQUM/ZjXFjHaYVIKDcCs+njZHbmDlXHaPlJzgkYZSiZzSLEQqGiV661OFjUL6+6R 0ZrysJGW7IaW64x7uyg9wkAZMpTyPy3n/a6F8L0PaTQ9hy2va2MZ5hRvjXwqMysw6qMY o4grVBqqZYpwROZps3ZggDcwZovV1mu48MdcIXgLmXDZAFBOYO4tagOT5etrGYnBJg8N vSMVrS8oG1WHQ+duA0WZVo2AUCDpR0dkGdD97T0zUUx7F0b4uUAAH/M8Y7RW8gYGpMga YVnw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to :organization:from:content-language:references:cc:to:subject :user-agent:mime-version:date:message-id:dkim-signature; bh=D1gcMIfFqdKF3FT3Pv2ctPs1pha72c5hmdUgwWlnkIU=; fh=NuG/llHmBjZer+ZirO1KqH/9hVDDzCZRnoZJ96qcER8=; b=TtFHmZ2Uaa5XYKG7pt3jC0cgduWaUy+Tzx+MUIjY5Z3y7z1NVE/gbsRxsTXoeLyKlJ v4pB5jHxHCnuHgNrchw2WLawbUwt/GTTU9y3pt14lH0ZJSUdvLjijVF0Vac7kxOyRJiJ WrmbzPUG/J5nNVBpSbbhh6iip9ejCh5GUk52mVI5LArWDCvunx4eEewWsL4ct9M1X2pQ CQFnsSGVDFEMfOhMkEmct/6vzC8VkA4pqPi6DUa/kcbQ3l1mYT3LwgNGfTJfB+5REDoS GqO+0YWz2ejN7yGV7xw+WCHCKVnxtaj3xAkScX9+Af7v65RfoR2gOcdnNBiXooa5cK0E FvTg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=F9atq2u7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id j9-20020a256e09000000b00d9a66ec5a28si13077554ybc.121.2023.10.25.23.25.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Oct 2023 23:25:39 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=F9atq2u7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id EE6D880D6AC1; Wed, 25 Oct 2023 23:25:34 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343564AbjJZGZY (ORCPT + 99 others); Thu, 26 Oct 2023 02:25:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34026 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229638AbjJZGZW (ORCPT ); Thu, 26 Oct 2023 02:25:22 -0400 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.88]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0CA72189; Wed, 25 Oct 2023 23:25:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1698301520; x=1729837520; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=9dwbtHL/Eb13EI84Wvbp0wfhWl3liifdx/FKkioH6mA=; b=F9atq2u77tFrIZNDO7CQwDMVUcWaxn1MkT7a3dLWNV0PX4jBWeLVdU1c tZdrDVlYAVnFKEpEHQ0n7RPHCeWoVJWc1Q2aSE79k2PlKkI9QO9POOgAW rzba+wi4mKSZF4TgIri+rdX867n9dSzIqP2/t+b2GXzWr/qxJi4ybZQcb SZdyLh9QiAnMsy4qqfqs4oRG5Lq1DbUkkdUoCsUCtlyxG7tGLzCzfK/A8 eZM9AeLA8KzHALf7FeVskvacV+POc4jdoEd0jm3DvAgq/fZ2oPrQ+Biz4 B6iBTqKPu0eRTUPNxSUJOIqOdfWlF7KTXoQ+Gww3K+Ovr25c79kMi/4+Y Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10874"; a="418597908" X-IronPort-AV: E=Sophos;i="6.03,253,1694761200"; d="scan'208";a="418597908" Received: from fmviesa002.fm.intel.com ([10.60.135.142]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Oct 2023 23:25:19 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.03,253,1694761200"; d="scan'208";a="393267" Received: from ahunter6-mobl1.ger.corp.intel.com (HELO [10.0.2.15]) ([10.251.211.218]) by fmviesa002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Oct 2023 23:25:06 -0700 Message-ID: Date: Thu, 26 Oct 2023 09:25:08 +0300 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mmc: cqhci: Be more verbose in error irq handler To: =?UTF-8?Q?Kornel_Dul=C4=99ba?= Cc: linux-mmc@vger.kernel.org, linux-kernel@vger.kernel.org, Asutosh Das , Ulf Hansson , Radoslaw Biernacki , Guenter Roeck , Gwendal Grignou , upstream@semihalf.com References: <20231016095610.1095084-1-korneld@chromium.org> <613c51f0-c32e-4de5-9627-525d92fb06ed@intel.com> <78bb4ad2-853a-4ed4-9998-c4e1122545b6@intel.com> Content-Language: en-US From: Adrian Hunter Organization: Intel Finland Oy, Registered Address: PL 281, 00181 Helsinki, Business Identity Code: 0357606 - 4, Domiciled in Helsinki In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Wed, 25 Oct 2023 23:25:35 -0700 (PDT) On 25/10/23 11:01, Kornel Dulęba wrote: > On Mon, Oct 23, 2023 at 1:38 PM Adrian Hunter wrote: >> >> On 20/10/23 11:53, Kornel Dulęba wrote: >>> On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter wrote: >>>> >>>> On 16/10/23 12:56, Kornel Dulęba wrote: >>>>> There are several reasons for controller to generate an error interrupt. >>>>> They include controller<->card timeout, and CRC mismatch error. >>>>> Right now we only get one line in the logs stating that CQE recovery was >>>>> triggered, but with no information about what caused it. >>>>> To figure out what happened be more verbose and dump the registers from >>>>> irq error handler logic. >>>>> This matches the behaviour of the software timeout logic, see >>>>> cqhci_timeout. >>>>> >>>>> Signed-off-by: Kornel Dulęba >>>>> --- >>>>> drivers/mmc/host/cqhci-core.c | 5 +++-- >>>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c >>>>> index b3d7d6d8d654..33abb4bd53b5 100644 >>>>> --- a/drivers/mmc/host/cqhci-core.c >>>>> +++ b/drivers/mmc/host/cqhci-core.c >>>>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error, >>>>> >>>>> terri = cqhci_readl(cq_host, CQHCI_TERRI); >>>>> >>>>> - pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n", >>>>> - mmc_hostname(mmc), status, cmd_error, data_error, terri); >>>>> + pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n", >>>>> + mmc_hostname(mmc), status, cmd_error, data_error); >>>>> + cqhci_dumpregs(cq_host); >>>> >>>> For debugging, isn't dynamic debug seems more appropriate? >>> >>> Dynamic debug is an option, but my personal preference would be to >>> just log more info in the error handler. >> >> Interrupt handlers can get called very rapidly, so some kind of rate >> limiting should be used if the message is unconditional. Also you need >> to provide actual reasons for your preference. >> >> For dynamic debug of the register dump, something like below is >> possible. >> >> #define cqhci_dynamic_dumpregs(cqhost) \ >> _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost) >> > Fair point. > The reason I'm not a fan of using dynamic debug for this is that my > goal here is to improve the warning/error logging information that we > get from systems running in production. > I.e. if we get a lot of "running CQE recovery" messages, at the very > least I'd like to know what is causing them. So you are saying you want to collect debug information from production systems, but don't want to use dynamic debug to do it? >>> To give you some background. >>> We're seeing some "running CQE recovery" lines in the logs, followed >>> by a dm_verity mismatch error. >>> The reports come from the field, with no feasible way to reproduce the >>> issue locally. >> >> If it is a software error, some kind of error injection may well >> reproduce it. Also if it is a hardware error that only happens >> during recovery, error injection could increase the likelihood of >> reproducing it. > > We tried software injection and it didn't yield any results. > We're currently looking into "injecting" hw errors by using a small > burst field generator to interfere with transfers on the data line > directly. I just tried instrumenting a driver to inject CRC errors and it revealed several CQE recovery issues, including spurious TCN for tag 31. I will send some patches when they are ready. >> >>> >>> I'd argue that logging only the info that CQE recovery was executed is >>> not particularly helpful for someone looking into those logs. >> >> As the comment says, that message is there because recovery reduces >> performance, it is not to aid debugging per se. >> >>> Ideally we would have more data about the state the controller was in >>> when the error happened, or at least what caused the recovery to be >>> triggered. >>> The question here is how verbose should we be in this error scenario. >>> Looking at other error scenarios, in the case of a software timeout >>> we're dumping the controller registers. (cqhci_timeout) >> >> Timeout means something is broken - either the driver, the cq engine >> or the card. On the other hand, an error interrupt is most likely a >> CRC error which is not unexpected occasionally, due to thermal drift >> or perhaps interference. > > Right, but my point is that we don't know what triggered CQE recovery. True, although probably a CRC error. > >> >>> Hence I thought that I'd be appropriate to match that and do the same >>> in CQE recovery logic. >> >> It needs to be consistent. There are other pr_debugs, such as: >> >> pr_debug("%s: cqhci: Failed to clear tasks\n", >> pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc)); >> pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc)); >> >> which should perhaps be treated the same. >> >> And there are no messages for errors from the commands in >> mmc_cqe_recovery(). > > How about this. > As a compromise would it be okay to just do a single pr_warn directly > from cqhci_error_irq. Sure, printk_ratelimited() or __ratelimit() > We could simply promote the existing pr_debug to pr_warn at the > beginning of that function. > This would tell us what triggered the recovery. (controller timeout, > CRC mismatch) > We can also consider removing the "running CQE recovery" print for the > sake of brevity. No, that serves a different purpose. > The only downside of this that I can see is that we'd be running the > logic from the interrupt handler directly, but I can't see an easy way > around that. > What do you think? Should be OK with rate limiting. >> >>> >>>> >>>>> >>>>> /* Forget about errors when recovery has already been triggered */ >>>>> if (cq_host->recovery_halt) >>>> >>