Received: by 2002:a05:7412:31a9:b0:e2:908c:2ebd with SMTP id et41csp3745557rdb; Thu, 14 Sep 2023 00:56:05 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEzGC8ShnEr3xGFNm9Vt8hL5cfwqiLJJEvKDic9fKtNdee4sPDNYUE1RydhXVV+6FYO6aw0 X-Received: by 2002:a05:6a00:2289:b0:68f:f6dd:e785 with SMTP id f9-20020a056a00228900b0068ff6dde785mr4775512pfe.28.1694678165177; Thu, 14 Sep 2023 00:56:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694678165; cv=none; d=google.com; s=arc-20160816; b=Xusyl6PIa9tyeo1ocEVOzMF7BsvHavJs8/rjtrQ6LiOpCIRnwjMNbdq7lYkgpSWs1a /L0zkH+AG3CgOnupV7HIyQ71BtG2An6Pj0yP7CSQEOe8fbHrcXeGln2gCeRlGGUUW9if jpb/StQ6gUU3fBT4oj5d2JWRfDJw4ON9Af/y5sVJPobS3ICVTO0y6N8dBP8BYtQQSHFR V5azLzbDDVZJVZseTktft4kZB2rpTa41Imd4BIEsLVEs/lm+m7e4HYsc+R+6+FNiIAII 8qFzbvm1xJIe37hpn5E6rubiwvgK+jsKf/FC7CayZw1wPvDws0cScjMUNMEBKzGG+KBj WmwQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :content-language:references:cc:to:subject:user-agent:mime-version :date:message-id; bh=H9+rE2wvJRpsjcmYa35Rd6TgKFuoi3Av+L7R4URup1A=; fh=ju6efAnBfGEPeECRt+yho+w3Gapg92HxoVovvmEmEyE=; b=hFsey55czHxfhkoQsaqCl8O5Da3o0JNz1/CrED8agQbKG3Z6VZPaxYkMA8oDuAmDdM 3TJBw0EfSTHsTb9jZvCyRrMc5hRbwl+oJVj24eOJferaLKAPo61MCSSv6WcV9hZn+W+l Z81fU34VnxvqFfue2oOf1xAfdl1A7BOI68dWGOU9jXXM2eALActfIlsjGF2eCkGRlWCy GWVTNr4Cs0FwiX8x3GXAaRSBtogTaXL8MnUVgZLdf8EAFSMo8jsXLCNKYkqIyOIu54lg llZzBw+hT1R0MZezccDWr1aII+9xHK5NosBR3P2T4GtosjYh9MKB9XfJJOAJnbJUpDqk iD8Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Return-Path: Received: from howler.vger.email (howler.vger.email. [2620:137:e000::3:4]) by mx.google.com with ESMTPS id i125-20020a62c183000000b0068e27a368f8si993691pfg.115.2023.09.14.00.56.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Sep 2023 00:56:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) client-ip=2620:137:e000::3:4; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 6C1CF83E5AFB; Wed, 13 Sep 2023 23:20:31 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235166AbjINGUc (ORCPT + 99 others); Thu, 14 Sep 2023 02:20:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59136 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230413AbjINGU3 (ORCPT ); Thu, 14 Sep 2023 02:20:29 -0400 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 86CABF9; Wed, 13 Sep 2023 23:20:25 -0700 (PDT) Received: from kwepemm600012.china.huawei.com (unknown [172.30.72.53]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4RmRvz0s3bzrSpF; Thu, 14 Sep 2023 14:18:23 +0800 (CST) Received: from [10.174.178.220] (10.174.178.220) by kwepemm600012.china.huawei.com (7.193.23.74) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.31; Thu, 14 Sep 2023 14:20:20 +0800 Message-ID: Date: Thu, 14 Sep 2023 14:20:20 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.8.0 Subject: Re: [RFC PATCH v2 00/18] scsi: scsi_error: Introduce new error handle mechanism To: "James E . J . Bottomley" , "Martin K . Petersen" , CC: Hannes Reinecke , , , References: <20230901094127.2010873-1-haowenchao2@huawei.com> Content-Language: en-US From: Wenchao Hao In-Reply-To: <20230901094127.2010873-1-haowenchao2@huawei.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.174.178.220] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To kwepemm600012.china.huawei.com (7.193.23.74) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Wed, 13 Sep 2023 23:20:31 -0700 (PDT) On 2023/9/1 17:41, Wenchao Hao wrote: > It's unbearable for systems with large scale scsi devices share HBAs to > block all devices' IOs when handle error commands, we need a new error > handle mechanism to address this issue. > > I consulted about this issue a year ago, the discuss link can be found in > refenence. Hannes replied about why we have to block the SCSI host > then perform error recovery kindly. I think it's unnecessary to block > SCSI host for all drivers and can try a small level recovery(LUN based for > example) first to avoid block the SCSI host. > > The new error handle mechanism introduced in this patchset has been > developed and tested with out self developed hardware since one year > ago, now we want this mechanism can be used by more drivers. > > Drivers can decide if using the new error handle mechanism and how to > handle error commands when scsi_device are scanned,the new mechanism > makes SCSI error handle more flexible. > > SCSI error recovery strategy after blocking host's IO is mainly > following steps: > > - LUN reset > - Target reset > - Bus reset > - Host reset > Mike gave some suggestions and I found a bug in fallback logic, I would address these and resend in next few days. > Some drivers did not implement callbacks for host reset, it's unnecessary > to block host's IO for these drivers. For example, smartpqi only registered > device reset, if device reset failed, it's meaningless to fallback to target > reset, bus reset or host reset any more, because these steps would also > failed. > > Here are some drivers we concerned:(there are too many kinds of drivers > to figure out, so here I just list some drivers I am familiar with) > > +-------------+--------------+--------------+-----------+------------+ > | drivers | device_reset | target_reset | bus_reset | host_reset | > +-------------+--------------+--------------+-----------+------------+ > | mpt3sas | Y | Y | N | Y | > +-------------+--------------+--------------+-----------+------------+ > | smartpqi | Y | N | N | N | > +-------------+--------------+--------------+-----------+------------+ > | megaraidsas | N | Y | N | Y | > +-------------+--------------+--------------+-----------+------------+ > | virtioscsi | Y | N | N | N | > +-------------+--------------+--------------+-----------+------------+ > | iscsi_tcp | Y | Y | N | N | > +-------------+--------------+--------------+-----------+------------+ > | hisisas | Y | Y | N | N | > +-------------+--------------+--------------+-----------+------------+ > > For LUN based error handle, when scsi command is classified as error, > we would block the scsi device's IO and try to recover this scsi > device, if still can not recover all error commands, it might > fallback to target or host level recovery. > > It's same for target based error handle, but target based error handle > would block the scsi target's IO then try to recover the error commands > of this target. > > The first patch defines basic framework to support LUN/target based error > handle mechanism, three key operations are abstracted which are: > - add error command > - wake up error handle > - block IOs when error command is added and recoverying. > > Drivers can implement these three function callbacks and setup to SCSI > middle level; I also add a general LUN/target based error handle strategy > which can be called directly from drivers to implement LUN/tartget based > error handle. > > The changes of SCSI middle level's error handle are tested with scsi_debug > which support single LUN error injection, the scsi_debug patches can be > found in reference, following scenarios are tested. > > Scenario1: LUN based error handle is enabled: > +-----------+---------+-------------------------------------------------------+ > | lun reset | TUR | Desired result | > + --------- + ------- + ------------------------------------------------------+ > | success | success | retry or finish with EIO(may offline disk) | > + --------- + ------- + ------------------------------------------------------+ > | success | fail | fallback to host recovery, retry or finish with | > | | | EIO(may offline disk) | > + --------- + ------- + ------------------------------------------------------+ > | fail | NA | fallback to host recovery, retry or finish with | > | | | EIO(may offline disk) | > + --------- + ------- + ------------------------------------------------------+ > > Scenario2: target based error handle is enabled: > +-----------+---------+--------------+---------+------------------------------+ > | lun reset | TUR | target reset | TUR | Desired result | > +-----------+---------+--------------+---------+------------------------------+ > | success | success | NA | NA | retry or finish with | > | | | | | EIO(may offline disk) | > +-----------+---------+--------------+---------+------------------------------+ > | success | fail | success | success | retry or finish with | > | | | | | EIO(may offline disk) | > +-----------+---------+--------------+---------+------------------------------+ > | fail | NA | success | success | retry or finish with | > | | | | | EIO(may offline disk) | > +-----------+---------+--------------+---------+------------------------------+ > | fail | NA | success | fail | fallback to host recovery, | > | | | | | retry or finish with EIO(may | > | | | | | offline disk) | > +-----------+---------+--------------+---------+------------------------------+ > | fail | NA | fail | NA | fallback to host recovery, | > | | | | | retry or finish with EIO(may | > | | | | | offline disk) | > +-----------+---------+--------------+---------+------------------------------+ > > Scenario3: both LUN and target based error handle are enabled: > +-----------+---------+--------------+---------+------------------------------+ > | lun reset | TUR | target reset | TUR | Desired result | > +-----------+---------+--------------+---------+------------------------------+ > | success | success | NA | NA | retry or finish with | > | | | | | EIO(may offline disk) | > +-----------+---------+--------------+---------+------------------------------+ > | success | fail | success | success | lun recovery fallback to | > | | | | | target recovery, retry or | > | | | | | finish with EIO(may offline | > | | | | | disk | > +-----------+---------+--------------+---------+------------------------------+ > | fail | NA | success | success | lun recovery fallback to | > | | | | | target recovery, retry or | > | | | | | finish with EIO(may offline | > | | | | | disk | > +-----------+---------+--------------+---------+------------------------------+ > | fail | NA | success | fail | lun recovery fallback to | > | | | | | target recovery, then fall | > | | | | | back to host recovery, retry | > | | | | | or fhinsi with EIO(may | > | | | | | offline disk) | > +-----------+---------+--------------+---------+------------------------------+ > | fail | NA | fail | NA | lun recovery fallback to | > | | | | | target recovery, then fall | > | | | | | back to host recovery, retry | > | | | | | or fhinsi with EIO(may | > | | | | | offline disk) | > +-----------+---------+--------------+---------+------------------------------+ > > References: https://lore.kernel.org/linux-scsi/20230815122316.4129333-1-haowenchao2@huawei.com/ > References: https://lore.kernel.org/linux-scsi/71e09bb4-ff0a-23fe-38b4-fe6425670efa@huawei.com/ > > Wenchao Hao (19): > scsi: scsi_error: Define framework for LUN/target based error handle > scsi: scsi_error: Move complete variable eh_action from shost to sdevice > scsi: scsi_error: Check if to do reset in scsi_try_xxx_reset > scsi: scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT > scsi: scsi_error: Add helper scsi_eh_sdev_reset to do lun reset > scsi: scsi_error: Add flags to mark error handle steps has done > scsi: scsi_error: Add helper to handle scsi device's error command list > scsi: scsi_error: Add a general LUN based error handler > scsi: core: increase/decrease target_busy without check can_queue > scsi: scsi_error: Add helper to handle scsi target's error command list > scsi: scsi_error: Add a general target based error handler > scsi: scsi_debug: Add param to control LUN bassed error handler > scsi: scsi_debug: Add param to control target based error handle > scsi: mpt3sas: Add param to control LUN based error handle > scsi: mpt3sas: Add param to control target based error handle > scsi: smartpqi: Add param to control LUN based error handle > scsi: megaraid_sas: Add param to control target based error handle > scsi: virtio_scsi: Add param to control LUN based error handle > scsi: iscsi_tcp: Add param to control LUN based error handle > > drivers/scsi/iscsi_tcp.c | 20 + > drivers/scsi/megaraid/megaraid_sas_base.c | 20 + > drivers/scsi/mpt3sas/mpt3sas_scsih.c | 28 + > drivers/scsi/scsi_debug.c | 24 + > drivers/scsi/scsi_error.c | 756 ++++++++++++++++++++-- > drivers/scsi/scsi_lib.c | 23 +- > drivers/scsi/scsi_priv.h | 18 + > drivers/scsi/smartpqi/smartpqi_init.c | 14 + > drivers/scsi/virtio_scsi.c | 16 +- > include/scsi/scsi_device.h | 97 +++ > include/scsi/scsi_eh.h | 8 + > include/scsi/scsi_host.h | 2 - > 12 files changed, 963 insertions(+), 63 deletions(-) >