Received: by 2002:a05:6512:2355:0:0:0:0 with SMTP id p21csp201684lfu; Wed, 30 Mar 2022 20:43:34 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzcK+tTR5OQcp2umxfj/DqbhHiJa7WMm6XU6mNRpd2ruhPS8W1Mn0bBE6PjxyAjaaRJ8Up/ X-Received: by 2002:a17:902:a40f:b0:14b:61:b19e with SMTP id p15-20020a170902a40f00b0014b0061b19emr3053337plq.20.1648698214330; Wed, 30 Mar 2022 20:43:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1648698214; cv=none; d=google.com; s=arc-20160816; b=fif1qG9ZseyfYic1ZYunqGK8HqRINLm2X/fDCvyx49Xvm+I3h2yJ4o0G+xJDvniDjf yH7jGusX6a1qDLEfnMVhsPJRFIetuRn5Dy3i8I3r4XUZkEdZXvUnxk3s+FaHZg0mvxu1 wn4aO5igHa5tNJhT03JVJjyaOFTK7LKmtRcjXeva8aIB3/UzBImrsRU+ZMj61Un8RfDl IVpgVDywewHPzEWXtD7xQyGPaBcIvaj3UI59omW52fIh5ayVVtdEnkN/pRV4S0Gz/6II Lsmj1/ZIJws29S12C/yX/jwGRh/PtQBRSlQ/hMO7FNK4xJomRxBWfxNywpHzIP6vVbsw uH8Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:subject :from:references:cc:to:content-language:user-agent:mime-version:date :message-id:dkim-signature:dkim-signature; bh=r+nughnReHhFuD/TpuTTU7U3+FMke0CXkY0cWV22ILo=; b=C89Cz4DyWcRrDaY8qoEWyqQ+6T2qovsqBRlBAOd3j+PLbFMYA7UQpopT5Lts+vfKEZ uW0KPzQ0c6coiEsWuc+kvSzdoGkgVSMnVG54sLKpdM25I4QTFCSINbkOl41bBEIqzRuC 4+PhXhaRoiHnc8IAQ1gL0Svy718PV4/Y60bRf6cGxT1g2cux+A3TpQvt+APX9kbhM9YY 93esG7yKy/lAO02erhLPyaIQM03Oww4NCdcZkzgINhi4XT79QSEW9F2BFi5SGvG8tcKc 8Q8Sxe+aBrI4KTumhcswdXtUHxROUkBHZ9NvqMtfWN//h2h58/nrZJo1Rd80CmdxyQ3x E58g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=bWBOSyaF; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519 header.b=1jViVL6i; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id q6-20020a17090311c600b00154a133dbefsi25008616plh.551.2022.03.30.20.43.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 30 Mar 2022 20:43:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=bWBOSyaF; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519 header.b=1jViVL6i; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id E2F6C144B6B; Wed, 30 Mar 2022 20:01:44 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244834AbiC3JeF (ORCPT + 99 others); Wed, 30 Mar 2022 05:34:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59174 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241954AbiC3JeD (ORCPT ); Wed, 30 Mar 2022 05:34:03 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A038E2ED6B; Wed, 30 Mar 2022 02:32:05 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 5672B218EF; Wed, 30 Mar 2022 09:32:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1648632724; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=r+nughnReHhFuD/TpuTTU7U3+FMke0CXkY0cWV22ILo=; b=bWBOSyaFV5jEYGMhIZbepWpiyL9YLniHd0dNTiQ6yZyx3bQZFA78Ml4/wnqAejDIAw9mW0 95wh1K5ftrs6/8wwRilaYSHvD6rWMwGDzx24ujpOt0olubl3VEZ/FpBWLpOmCwYA4802Ja MQ/G6k9kzgMAtCQ9kBo8AuUnXrfypiM= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1648632724; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=r+nughnReHhFuD/TpuTTU7U3+FMke0CXkY0cWV22ILo=; b=1jViVL6i1k5fZthyR+hzNFvwPZ/KRJyL4vVvySg9sddLvhACr7b04gb4pw1RC0oTETvZ6K HvTdCsMe/ckxHkDA== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 28D8713A60; Wed, 30 Mar 2022 09:32:04 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id IkzGB5QjRGLJXQAAMHmgww (envelope-from ); Wed, 30 Mar 2022 09:32:04 +0000 Message-ID: <1dd69d03-b4f6-ab20-4923-0995b40f045d@suse.de> Date: Wed, 30 Mar 2022 11:32:03 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.4.0 Content-Language: en-US To: Wenchao Hao , Steffen Maier , linux-scsi@vger.kernel.org, "linux-kernel@vger.kernel.org" , "James E.J. Bottomley" , "Martin K. Petersen" , Mike Christie , Lee Duncan , John Garry Cc: Wu Bo , Feilong Lin , zhangjian013@huawei.com References: <71e09bb4-ff0a-23fe-38b4-fe6425670efa@huawei.com> <331aafe1-df9b-cae4-c958-9cf1800e389a@huawei.com> <64d5a997-a1bf-7747-072d-711a8248874d@suse.de> From: Hannes Reinecke Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,RDNS_NONE,SPF_HELO_NONE, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 3/30/22 11:11, Wenchao Hao wrote: > On 2022/3/30 2:56, Hannes Reinecke wrote: >> On 3/29/22 14:40, Wenchao Hao wrote: >>> On 2022/3/29 18:56, Steffen Maier wrote: >>>> On 3/29/22 11:06, Wenchao Hao wrote: >>>>> SCSI timeout would call scsi_eh_scmd_add() on some conditions, host would be set >>>>> to SHOST_RECOVERY state. Once host enter SHOST_RECOVERY, IOs submitted to all >>>>> devices in this host would not succeed until the scsi_error_handler() finished. >>>>> The scsi_error_handler() might takes long time to be done, it's unbearable when >>>>> host has massive devices. >>>>> >>>>> I want to ask is anyone applying another error handler flow to address this >>>>> phenomenon? >>>>> >>>>> I think we can move some operations(like scsi get sense, scsi send startunit >>>>> and scsi device reset) out of scsi_unjam_host(), to perform these operations >>>>> without setting host to SHOST_RECOVERY? It would reduce the time of block the >>>>> whole host. >>>>> >>>>> Waiting for your discussion. >>>> >>>> We already have "async" aborts before even entering scsi_eh. So your use case seems to imply that those aborts fail and we enter scsi_eh? >>>> >>> >>> Yes, I mean when scsi_abort_command() failed and scsi_eh_scmd_add() is called. >>> >>>> There's eh_deadline for limiting the time spent in escalation of scsi_eh, and instead directly go to host reset. Would this help? >>>> >>>> >>> >>> The deadline seems not helpful. What we want to see is a single LUN's command error >>> would not stop other LUNs which share the same host. So my plan is to move reset LUN out >>> from scsi_unjam_host() which run with host set to SHOST_RECOVERY. >> >> Nope. One of the key points of scsi_unjam_host() is that is has to stop all I/O before proceeding. Without doing so basically all SCSI parallel HBAs will fail EH as they _require_ I/O to be stopped. >> > > I still can not understand why we must stop all I/O. In my comprehension, stopping all I/O > is because we might reset host during scsi_error_handler() and we must wait host's number of > failed command equal to number of busy command then we can wake up scsi_error_handler(). > > If move reset LUN out of scsi_error_handler(), and perform single LUN reset, we only need > stop I/O of this single LUN, this would not affect other LUNs. If single LUN reset failed, > we can then call in the large scale error handle. > I know the EH flow. Problem here is the way parallel SCSI operates. Remember, parallel SCSI is a _bus_, and there can be only one command at a time on the bus. So if one command on the bus misfires and you have to start EH you have to stop all I/O on the bus to ensure that your EH command is the only one active on the bus. For modern HBAs we sure can device other ways and means of error recovery, but I can't really see how we would do that on legacy HBAs. > Here is a brief flow: > > abort command > || > || failed > || > \/ > stop single LUN's I/O (need to wait LUN's failed command number equal to busy command number) > || > || failed (according to our statistic, 90% reset LUN would succeed) > || > \/ Interesting. This does not match up with my experience, where 99% of the errors were due to a command timeout. So which errors do you see here? What are the causes? Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer