Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp1219307rwb; Thu, 6 Oct 2022 09:57:24 -0700 (PDT) X-Google-Smtp-Source: AMsMyM5atocTcbwgHkeMjxJjsNziA9yAPqklM/goJzymotGH2WRtaW3J+ZjSmEfxX9s9xGUYslUH X-Received: by 2002:a17:907:1622:b0:78d:3bab:1ec9 with SMTP id hb34-20020a170907162200b0078d3bab1ec9mr631482ejc.643.1665075443923; Thu, 06 Oct 2022 09:57:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1665075443; cv=none; d=google.com; s=arc-20160816; b=FvCmOZXEi/gfm+F5swwJMthloRog5oVoOdcQEUPPEzGkmpMjRVUUWzXGoNlFXUM0Dh Hz/UfUviAolYf6m//TMCfRcaSCk2B8+/4Oc4ZwH2O0V44HSMTgABzxva5NA44EFsktsj VFt3F9XLlpgLJCpMsQ7JviVyzcAfhPFSP5y5SVsRxcEVb9zSlVhOhHa5XzaJAPRFvNau UJuYZiDkkFjukjleJqrEIllisR5J0ohyUEcnVTRtP5yn0yTt0QBusn+yoI/HnsaSTGo5 04vN2EW5CvlPLTmKjaBMJMHTwPaxDB9733YykGCC9s2NwQ5YtuBFJKPbcRHYMimoV4FV ubdw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id; bh=j8u8lySP5bA+yzb4poqKyZqqwC59pD9k1YAtr7gjAGo=; b=CrziL3W+zBWTQ1VghgoNBXO/m4GdO7cg/N+DKHKbmxmxqHc5133eibDUJ0O7DH1wAQ +mPJlYizLo5v8TDZJ5sci6Lyd+LUu2F4o+CrXcsBZp7c/4r5CshHT8aG2RFr2z2oiAZx ravqyTgEnbeiJr7DWUXYGb5c+6QiFCe8P+q1XHDuFN9UsS5kjKY7+H6fBFXM9pJz0BkB IvDKytX34kXjzyOlok5ru80GirRf1tXVnMOjD+MIDkei4kgxb7NnUi25nQ0XPCClcY7F 72joXRnG2LBK3puw71CHV7d+cp0KEOhpS0r78TW34NSZ86MGEcS90gN432qZ8QP+7OdQ 5rXg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l17-20020a170906795100b0077b17197047si20488072ejo.437.2022.10.06.09.56.58; Thu, 06 Oct 2022 09:57:23 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231567AbiJFQlr (ORCPT + 99 others); Thu, 6 Oct 2022 12:41:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59770 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229555AbiJFQlp (ORCPT ); Thu, 6 Oct 2022 12:41:45 -0400 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1E2ED1F2EB; Thu, 6 Oct 2022 09:41:43 -0700 (PDT) Received: from fraeml702-chm.china.huawei.com (unknown [172.18.147.206]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4Mjxxq5Jgxz6896k; Fri, 7 Oct 2022 00:40:15 +0800 (CST) Received: from lhrpeml500003.china.huawei.com (7.191.162.67) by fraeml702-chm.china.huawei.com (10.206.15.51) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.2375.31; Thu, 6 Oct 2022 18:41:41 +0200 Received: from [10.126.169.216] (10.126.169.216) by lhrpeml500003.china.huawei.com (7.191.162.67) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Thu, 6 Oct 2022 17:41:40 +0100 Message-ID: Date: Thu, 6 Oct 2022 17:41:40 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.6.1 Subject: Re: [PATCH v5 0/7] libsas and drivers: NCQ error handling To: Niklas Cassel CC: Damien Le Moal , "jejb@linux.ibm.com" , "martin.petersen@oracle.com" , "jinpu.wang@cloud.ionos.com" , "linux-scsi@vger.kernel.org" , "linux-kernel@vger.kernel.org" , Linuxarm , yangxingui , yanaijie References: <1664262298-239952-1-git-send-email-john.garry@huawei.com> <27148ec5-d1ae-d9a2-1b00-a4c34d2da198@huawei.com> <5db6a7bc-dfeb-76e1-6899-7041daa934cf@opensource.wdc.com> <64ab35a7-f1ff-92ee-890e-89a5aee935a4@opensource.wdc.com> From: John Garry In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.126.169.216] X-ClientProxiedBy: lhrpeml500005.china.huawei.com (7.191.163.240) To lhrpeml500003.china.huawei.com (7.191.162.67) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-5.6 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/10/2022 15:45, Niklas Cassel wrote: >> I think that it gets frozen when the internal command for read log ext times >> out. More below about that timeout. > ata_read_log_page() will first try to read using READ LOG DMA EXT. > If that fails it will retry with READ LOG EXT. > > Your log has this: > [ 350.257870] ata1.00: qc timeout (cmd 0x47) > > So it is definitely ATA_CMD_READ_LOG_DMA_EXT that times out. > > On timeout, ata_exec_internal_sg() will freeze the port: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-core.c?h=v6.0#n1577 > > When ata_read_log_page() retries with the port frozen, > READ LOG EXT will obviously fail (since the port is frozen). > > Not sure why READ LOG DMA EXT would timeout for you... > Perhaps your drive does not implement this command, > and incorrectly reports supporting this command via > ata_id_has_read_log_dma_ext(). > > Perhaps you could try boot your kernel with libata.force=nodmalog > on the kernel command line, so that ata_read_log_page() will use > READ LOG EXT on the first try. > I tried that and unfortunately it does not appear to help. I get this log, which proves no dmalog [ 15.757617] ata1.00: FORCE: horkage modified (nodmalog) but then still fails with timeout: [ 123.094430] ata1.00: qc timeout (cmd 0x2f) [ 123.098637] pm80xx0:: mpi_sata_completion 2293: task null, freeing CCB tag 2 [ 123.105711] ata1.00: Read log 0x10 page 0x00 failed, Emask 0x5 [ 123.118081] ata1: failed to read log page 10h (errno=-5) > > Damien, it seems that there is no use in retrying if the port > is frozen/we got a timeout, so perhaps: > > diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c > index e74ab6c0f1a0..1aa628332c8e 100644 > --- a/drivers/ata/libata-core.c > +++ b/drivers/ata/libata-core.c > @@ -2035,7 +2035,8 @@ unsigned int ata_read_log_page(struct ata_device *dev, u8 log, > if (err_mask) { > if (dma) { > dev->horkage |= ATA_HORKAGE_NO_DMA_LOG; > - goto retry; > + if (err_mask != AC_ERR_TIMEOUT) > + goto retry; > } > > or: > > diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c > index e74ab6c0f1a0..2fa03b7573ac 100644 > --- a/drivers/ata/libata-core.c > +++ b/drivers/ata/libata-core.c > @@ -2035,7 +2035,8 @@ unsigned int ata_read_log_page(struct ata_device *dev, u8 log, > if (err_mask) { > if (dma) { > dev->horkage |= ATA_HORKAGE_NO_DMA_LOG; > - goto retry; > + if (!(dev->link->ap->pflags & ATA_PFLAG_FROZEN)) > + goto retry; > } > > would be in order, so that we actually print the real error, instead of a bogus > AC_ERR_SYSTEM (returned by ata_exec_internal_sg()) when the port is frozen. > >>>> ata_do_link_abort() calls ata_eh_set_pending() without activating fast drain: >>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-eh.c?h=v6.0#n989 >>>> >>>> So I'm not sure why your port is frozen. >>>> (The fast drain timer does freeze the port, but it shouldn't be enabled.) >>>> It might be worthwhile to see who freezes the port in your case. >>> Might come from the command timeout. John has had many problems with the >>> pm80xx HBA in his Arm machine from a while back. Likely not a driver issue >>> but a hw one... No-one seems to be able to recreate the same problem. >>> >>> We need to try the HBA on our Arm board to see what happens. >>> >> Yeah, it just looks to be the longstanding issue of using this card on my >> arm64 machine - that is that I get IO timeouts quite regularly. I should >> have mentioned that yesterday. This just seems to be a driver issue. > Out of curiosity, which arm64 SoC is this? HiSilicon hi1620 which contains a custom arm v8 implementation. Note that others have also seen the issue with this card on other arm implementations. > > While it is very unlikely that this is your problem, but I've encountered > an issue on an ARM board before, where the PCIe controller was incorrectly > configured in device tree, causing the controller to miss interrrupts, > which presented itself to the user as timeouts in the WiFi driver: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=97131f85c08e024df49480ed499aae8fb754067f Unlikely. Indeed, when I was checking this issue some time go, I found that not only was there no completion interrupt but also no completion when I manually examine the completion ring buffer read and write pointers. Here's where I discuss this issue earlier a bit: https://lore.kernel.org/linux-scsi/PH0PR11MB511238B8FF7B44C375DDDFADEC519@PH0PR11MB5112.namprd11.prod.outlook.com/ Thanks, John