Received: by 10.223.176.5 with SMTP id f5csp1956646wra; Thu, 8 Feb 2018 06:18:40 -0800 (PST) X-Google-Smtp-Source: AH8x2256z+22WKvLfg/4F1Rq2ljX6A3dw1Wc6qRhDRCVu2KuOETzbuIrHkIGMWiM7hqLlZfZ348f X-Received: by 10.99.105.202 with SMTP id e193mr655908pgc.239.1518099520773; Thu, 08 Feb 2018 06:18:40 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1518099520; cv=none; d=google.com; s=arc-20160816; b=p28iLnvfjdrhXd85kd8Qgd5cVUHw2INNTqns7TkU0kIVNX9T+2jQHfbyp3CGeqqTbh 0oyxxjLQn2yJyKAuRfgEYvpvCx/2xjLc4cUGZw0ARXaQtQgdv2pkVHgSx/8z93o+5OSE X1Bc+muW77zxtLn9VFLRpe+ErA58e1zCaxhwL7/ZjDmwpbNyUYaN5bXF1OWTlmPPv+CZ UMydInncWAtMSwFs7fn5ofGqRHZgoKIuiKK7ERNoiJPVk2qiNUhrhg+ax1qFZU/QhEVJ AqeMPa/G2d/zqOL/koscYcQtzyT6QhoO4FzGgpH+OoGWqZdvIbkPDQL8fwb5KM84QhQl XI3A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=srRtuokblejvKD8IlYccSn5TM/ZR/XVKihGrDFqBGGU=; b=E28Gf3JqY3Oc/I/e00+3qgboPD4ilhVJGJ9YJ8yGG7h/leA9nixhxzdiImfZ+Vjvro Ef8DxAlYJvvQayXksrabGCbl8A5qWiMMHhKB9HmfeLR8sKb/N7gDcASTBvq6rcyM+uLy rLl0VaSgqLiMF+VvLuM7H+XWVbmtFRvvKD8SNe6Gr3T9drrvXd6APT/UYgQrWw3cdtpS XuXgjYtyzfAVIek1W0p8GpgiCb2SPrUFNBsVDwbX8UaaVO6+sKGqVkZqkNrk1YR7CiRF fHJTTVn0laiRPj/vfjG984LlNy/T4cnRgpMnE6GQSv/6pRM67puBBVYLQ237i7z0ph7s iTcg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=HeJaFmpo; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w62si354pgd.573.2018.02.08.06.18.26; Thu, 08 Feb 2018 06:18:40 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=HeJaFmpo; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751988AbeBHORM (ORCPT + 99 others); Thu, 8 Feb 2018 09:17:12 -0500 Received: from aserp2120.oracle.com ([141.146.126.78]:37500 "EHLO aserp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751079AbeBHORL (ORCPT ); Thu, 8 Feb 2018 09:17:11 -0500 Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w18EBvi5003752; Thu, 8 Feb 2018 14:16:43 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2017-10-26; bh=srRtuokblejvKD8IlYccSn5TM/ZR/XVKihGrDFqBGGU=; b=HeJaFmpox4v6DOaSBHcP8oq/JDK6PsZh4MmH2jSpk77QNHm4d7C2CQOBLeAvAIwE0v++ exq6IKDKbQHUyJJENdM7Wjz8STPE5AEyikPFdfRDoD790wVGb27F6ZiTLJPEfsfi8OzK mHgJ+pkU7kCFs25wmEpMriaR2p4oK8Istlcc3xHjUmWZDmAHmLrESUL/4k5A0U2I8iaI VXj4XVES45rnxXWBrAiLBEKn5FH7p1asewh6OGZTjm46GOyxpuMXjV5+xuWF+F+0ywcR fefF04kZdnf2VWN7gxInEAvaqQXMA1Oog/axyD9mBIcDUGsoD0XVPa0Tq1978fFeQU4n gA== Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234]) by aserp2120.oracle.com with ESMTP id 2g0r2x81gm-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 08 Feb 2018 14:16:43 +0000 Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by aserv0022.oracle.com (8.14.4/8.14.4) with ESMTP id w18EGgq0022537 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Thu, 8 Feb 2018 14:16:42 GMT Received: from abhmp0012.oracle.com (abhmp0012.oracle.com [141.146.116.18]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id w18EGfFm016607; Thu, 8 Feb 2018 14:16:42 GMT Received: from [10.191.1.168] (/10.191.1.168) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Thu, 08 Feb 2018 06:16:41 -0800 Subject: Re: [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case To: Keith Busch Cc: axboe@fb.com, sagi@grimberg.me, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, hch@lst.de References: <1517554849-7802-1-git-send-email-jianchao.w.wang@oracle.com> <1517554849-7802-3-git-send-email-jianchao.w.wang@oracle.com> <20180202182413.GH24417@localhost.localdomain> <20180205151314.GP24417@localhost.localdomain> <20180206151335.GE31110@localhost.localdomain> <20180207161345.GB1337@localhost.localdomain> <1826ebc1-d419-23da-12d4-dd7b1b3fe598@oracle.com> From: "jianchao.wang" Message-ID: <958cae59-1a01-d60f-822b-cf81cfa31b8f@oracle.com> Date: Thu, 8 Feb 2018 22:17:00 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: <1826ebc1-d419-23da-12d4-dd7b1b3fe598@oracle.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8798 signatures=668663 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1802080166 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Keith Sorry for bothering you again. ;) There is a dangerous scenario which caused by nvme_wait_freeze in nvme_reset_work. please consider it. nvme_reset_work -> nvme_start_queues -> nvme_wait_freeze if the controller no response, we have to rely on the timeout path. there are issues below: nvme_dev_disable need to be invoked. nvme_dev_disable will quiesce queues, cancel and requeue and outstanding requests. nvme_reset_work will hang at nvme_wait_freeze if we set NVME_REQ_CANCELLED and return BLK_EH_HANDLED as the RESETTING case, nvme_reset_work will hang forever, because no one could complete the entered requests. if we invoke nvme_reset_ctrl after modify the state machine to be able to change to RESETTING to RECONNECTING and queue reset_work, we still cannot move things forward, because the reset_work is being executed. if we use nvme_wait_freeze_timeout in nvme_reset_work, unfreeze and return if expires. But the timeout value is tricky.. Maybe we could use blk_set_preempt_only which is also gate on blk_queue_enter. We needn't to drain the queue for it. It is lightweight. And nvme needn't worry about the full queue prevent admin requests from being submitted. Looking forward your precious advice. Sincerely Jianchao On 02/08/2018 09:40 AM, jianchao.wang wrote: > Hi Keith > > > Really thanks for your your precious time and kindly directive. > That's really appreciated. :) > > On 02/08/2018 12:13 AM, Keith Busch wrote: >> On Wed, Feb 07, 2018 at 10:13:51AM +0800, jianchao.wang wrote: >>> What's the difference ? Can you please point out. >>> I have shared my understanding below. >>> But actually, I don't get the point what's the difference you said. >> >> It sounds like you have all the pieces. Just keep this in mind: we don't >> want to fail IO if we can prevent it. >> > Yes, absolutely. > >> A request is allocated from an hctx pool of tags. Once the request is >> allocated, it is permently tied to that hctx because that's where its >> tag came from. If that hctx becomes invalid, the request has to be ended >> with an error, and we can't do anything about that[*]. >> >> Prior to a reset, we currently halt new requests from being allocated by >> freezing the request queues. We unfreeze the queues after the new state >> of the hctx's is established. This way all IO requests that were gating >> on the unfreeze are guaranteed to enter into a valid context. >> >> You are proposing to skip freeze on a reset. New requests will then be >> allocated before we've established the hctx map. Any request allocated >> will have to be terminated in failure if the hctx is no longer valid >> once the reset completes. > Yes, if any previous hctx doesn't come back, the requests on that hctx > will be drained with BLK_STS_IOERR. > __blk_mq_update_nr_hw_queues > -> blk_mq_freeze_queue > -> blk_freeze_queue > -> blk_mq_freeze_queue_wait > But the nvmeq's cq_vector is -1. > >> Yes, it's entirely possible today a request allocated prior to the reset >> may need to be terminated after the reset. There's nothing we can do >> about those except end them in failure, but we can prevent new ones from >> sharing the same fate. You are removing that prevention, and that's what >> I am complaining about. > > Thanks again for your precious time to detail this. > So I got what you concern about is that this patch doesn't freeze the queue for reset case > any more. And there maybe new requests enter, which will be failed when the associated > hctx doesn't come back during reset procedure. And this should be avoided. > > I will change this in next V3 version. > > >> * Future consideration: we recently obtained a way to "steal" bios that >> looks like it may be used to back out certain types of requests and let >> the bio create a new one. >> > Yeah, that will be a great idea to reduce the loss when hctx is gone. > > Sincerely > Jianchao > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme@lists.infradead.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=qgIjamg94NFS7dkacDOfpPEcPa_5ImJwfZzC4Bj7wy8&s=nMzhDp4fnFjiHfbVcjI0pmiX42xK5vRWwOhFpaDV8To&e= >