Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp1091395pxb; Wed, 6 Apr 2022 08:31:05 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxqHIGu3YuQvRJ9V2A8eVtgARzwukS166YT8+cnvZJDX3qfGrJ2hlhhkAgeELBzHVXTnqfH X-Received: by 2002:a17:903:41c1:b0:156:b391:3ce7 with SMTP id u1-20020a17090341c100b00156b3913ce7mr9151621ple.69.1649259065300; Wed, 06 Apr 2022 08:31:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1649259065; cv=none; d=google.com; s=arc-20160816; b=nXkZoVbwFWmdRn2/UEKGTtvAAu2hBvY1ylpKI+BSU9Pn/cr5GWVUxyVhS24NUbB/5t GSMwiF570yvQuIMF90jQ2+Ckz2IIRYKLyQyvrI+6lXvfHpcflZ9XwKOwrBV33fmANLlR E0vjKMtm6xZwN2MlFj6OclVvbi5n22gahyzGavDbDF51qLfzEARR4aOg0M1TKL1MqA2t ktou0LBISxrvVG8k7OB9gQnIBDNEF+z9FwB45g2Y3pBfq7AGbd43KaQJqmL0jW6UpWpK P7AEbu7MAxCT1uGEQF6XsKzDsExlgjhNH55GA5nhiBG0dy3333eaimsNydYnjmdV9UiY ujJA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject; bh=LvzZax48W4t6KuUBh4uQ+XRa23nwVz6f/8TC8LlQ5DM=; b=BMOqA3C4K90t0MimjJjJTHZWKBn9IUwTeGWV0CQdTBsodufP051tOU6ntMrYc3LSP+ 7CRDMx0FPTll/G4z+xolg5eTxmfTxiYkcvGamWtVWbc0wLls3VbJkdRnPFevMsXVKK6H hu71Xg3/CabmOPMUhSsuNtQMzBC27j+Ab0B+Y2wS5JtiJ0ShqVj+J/ux/QdUVI8bQ2Hg BCnXVmhSqY1isuwcNog/4AT5KdIhmY088T68H0x02SylLlBJ/0haq1fmba1oODFCUznp +0KQS6Swse5pUlYxkP9QYKGVDIFYrqliyUBKtkgIjnuMae22cgNcYTLUp6VYZfrHm1hQ UanA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id me17-20020a17090b17d100b001c6edf679b4si5471747pjb.46.2022.04.06.08.31.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Apr 2022 08:31:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id D5D9348DAF5; Wed, 6 Apr 2022 06:42:16 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233652AbiDFNoG (ORCPT + 99 others); Wed, 6 Apr 2022 09:44:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45448 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233598AbiDFNnw (ORCPT ); Wed, 6 Apr 2022 09:43:52 -0400 Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 08D65512631; Wed, 6 Apr 2022 03:51:19 -0700 (PDT) Received: from dggpemm500024.china.huawei.com (unknown [172.30.72.55]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4KYLMP4fygzBs2q; Wed, 6 Apr 2022 18:28:33 +0800 (CST) Received: from dggpemm500017.china.huawei.com (7.185.36.178) by dggpemm500024.china.huawei.com (7.185.36.203) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Wed, 6 Apr 2022 18:32:44 +0800 Received: from [10.174.178.220] (10.174.178.220) by dggpemm500017.china.huawei.com (7.185.36.178) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Wed, 6 Apr 2022 18:32:43 +0800 Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices To: Hannes Reinecke , Mike Christie , Steffen Maier , , "linux-kernel@vger.kernel.org" , "James E.J. Bottomley" , "Martin K. Petersen" , Lee Duncan , John Garry CC: Wu Bo , Feilong Lin , References: <71e09bb4-ff0a-23fe-38b4-fe6425670efa@huawei.com> <331aafe1-df9b-cae4-c958-9cf1800e389a@huawei.com> <64d5a997-a1bf-7747-072d-711a8248874d@suse.de> <1dd69d03-b4f6-ab20-4923-0995b40f045d@suse.de> <78d41ec1-b30c-f6d2-811c-e0e4adbc8f01@oracle.com> <84b38f16-2a32-f361-43e5-34bce1012e71@oracle.com> <769bcd36-4818-8470-2daa-49ac5c05b33a@suse.de> From: Wenchao Hao Message-ID: <90e4af07-074c-1f60-e64a-e6dbe9a5c1bb@huawei.com> Date: Wed, 6 Apr 2022 18:32:43 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.9.1 MIME-Version: 1.0 In-Reply-To: <769bcd36-4818-8470-2daa-49ac5c05b33a@suse.de> Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.174.178.220] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To dggpemm500017.china.huawei.com (7.185.36.178) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.7 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2022/4/4 13:28, Hannes Reinecke wrote: > On 4/3/22 19:17, Mike Christie wrote: >> On 4/3/22 12:14 PM, Mike Christie wrote: >>> We could share code with scsi_ioctl_reset as well. Drivers that support >>> TMFs via that ioctl already expect queuecommand to be possibly in the >>> middle of a run and IO not yet timed out. For example, the code to >>> block a queue and reset the device could be used for the new EH and >>> SG_SCSI_RESET_DEVICE handling. >>> >> >> Hannes or others, >> >> How do parallel SCSI drivers support scsi_ioctl_reset? Is is not fully >> supported and more only used for controlled testing? > > That's actually a problem in scsi_ioctl_reset(); it really should wait for all I/O to quiesce. Currently it just sets the 'tmf' flag and calls into the various reset functions. > > But really, I'd rather get my EH rework in before we're start discussing modifying EH behaviour. > Let me repost it ... > > Cheers, > > Hannes Hi hannes: According to the statistic, following scenario would cause an abort failed can be handled by LUN reset: 1. The task execute of disk's FW is abnormal; 2. Intermittent bit errors or intermittent disconnection; 3. FW do not response IO; Following scenario can not be handled by LUN reset: 1. Disk HW issue, LUN reset can not be handled; 2. DDR UNC in disk, can not fix, the only way is to power off then power on 3. FW of disk is out of service, can not fix, the only way is to power off then power on And the statistic shows most command abort failed can be handled by LUN reset. So we plan to design a lightweight timeout handle flow as following: if disable lightweight EH(default) scsi_times_out ====================================> origin EH flow || || if enable lightweight EH || \/ do not using current timeout flow, and branch to another flow which perform following steps: abort command || || failed || \/ stop single LUN's I/O (need to wait LUN's failed command number equal to busy command number) || || failed (according to our statistic, 90% reset LUN would succeed) || \/ reset single LUN || || if host with multi LUNs timeout || failed =====================================> perform Host reset || || || || failed || || || <=================================================// || \/ offline disk Since it's a lightweight EH, we prefer offline disk once reset LUN failed. These changes would not affect origin EH flow. The advantage of this design is it would not affect other LUNs of same host.