Received: by 10.223.176.5 with SMTP id f5csp99664wra; Tue, 6 Feb 2018 18:05:02 -0800 (PST) X-Google-Smtp-Source: AH8x226NZ/9c7IjBT+m3OM8YsUPlrCtB0SeGyKSdzXlSBls0dyLDdV3VREu0PicKbMV5NE9zO5ow X-Received: by 10.101.102.3 with SMTP id w3mr3526118pgv.326.1517969102770; Tue, 06 Feb 2018 18:05:02 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1517969102; cv=none; d=google.com; s=arc-20160816; b=s9dkowka51KKiSpUBmCPTE3TxKwiBpmoNjENsPPnWefHAKFTQF8d5lPfprEcel+AQT XGOBiyI6RDG6F8lau77GgOsTMnWg8/FzsN2FaH6hFJLVqHLkCz2Pz6mSw2NzWOabdXKT nXvkPFcC7/AmveugoUryyX3mXhRjPUxd3WCvo6aHSfLE7F0JtbqfB26lnprBJP1ijo5z D4yWsn9A8H9YBz/Plfl54NjYpQxa4F80bNkurnefhx6raixvPWn8jbGdXZg0nszTwYGJ 3R3rbS6r7Khwk+bH3u0qj5+eBCy6WANVo6vULNLe4SLnWMy9DjGdp6qCQ+kV1bRR8hit ySdg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=YV97wkT9fyE6ckfxGu8nYFf3qWozLEn/OMa5QH4xVLA=; b=hBxalDSiGPJ7cfSxVIjoe5apEvrte5yCf8hjV5A/2hU1J/8lfJZ9saK/9ra4HixEvV V6IhYePcAACINgJpGEtOotHuK9+C6ycJyYP1gXB+7R1FmUue86UUBESj/FIIKi0wHMJC giEAIfGsA8JG+ghyyuZeb8IU/xdRe4HtzCl8LKj8X/IoeWh6mf1EDaPaFt3g70IErFLj ii2Mn7+OOzJ5upnV5+IL0zTuThhgNUzDsmpwIUaE47YfX1McBX/fvQXcaWG3IspydVJv 3QcoR9IavKwrM5X9p5tYqSGE6e4TJ8AGGDaVYku9kC5Y90XjO64uucrV6vHj6OgNIkRz 9VWw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=uZGCPY8n; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t4si243287pgv.507.2018.02.06.18.04.46; Tue, 06 Feb 2018 18:05:02 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=uZGCPY8n; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752786AbeBGCD7 (ORCPT + 99 others); Tue, 6 Feb 2018 21:03:59 -0500 Received: from aserp2120.oracle.com ([141.146.126.78]:48550 "EHLO aserp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752439AbeBGCD5 (ORCPT ); Tue, 6 Feb 2018 21:03:57 -0500 Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w1721edN041226; Wed, 7 Feb 2018 02:03:33 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2017-10-26; bh=YV97wkT9fyE6ckfxGu8nYFf3qWozLEn/OMa5QH4xVLA=; b=uZGCPY8nOjqjcUAsQeR4v3Do/kjvcijYPWk4Z2CyGRuKm+KlVPm5329ZNKDsvVzAGFv6 nqv/Lsyv9P9GOPcSZ7igRg0noIA40lPvBugBqIRPJE2gDJNpiQ5YAdURWSsXLrJ+OtG+ Y/TqDJH+bduOdsMNonXvI/g+GM6ukJUqPwb4nTmMzIhnuIHjs6K/5oBVsWVhSs0myl0L YcL0H5gOHyTf6NsV47C1XDF513MDWikdwPO28wBymprbFz9StDWBglSUfJiRmDtDpzPj EPRXQWL8yazzQKLniEWMNOsf6PYJsDVvGzUPrZrBE5/09lhfQxJrT9NJjdIQ5FhxE6qx FQ== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by aserp2120.oracle.com with ESMTP id 2fyr74r0k2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 07 Feb 2018 02:03:33 +0000 Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w1723VPE019059 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Wed, 7 Feb 2018 02:03:32 GMT Received: from abhmp0017.oracle.com (abhmp0017.oracle.com [141.146.116.23]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id w1723VLR002408; Wed, 7 Feb 2018 02:03:31 GMT Received: from [10.182.69.179] (/10.182.69.179) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Tue, 06 Feb 2018 18:03:30 -0800 Subject: Re: [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case To: Keith Busch Cc: axboe@fb.com, linux-kernel@vger.kernel.org, hch@lst.de, linux-nvme@lists.infradead.org, sagi@grimberg.me References: <1517554849-7802-1-git-send-email-jianchao.w.wang@oracle.com> <1517554849-7802-3-git-send-email-jianchao.w.wang@oracle.com> <20180202182413.GH24417@localhost.localdomain> <20180205151314.GP24417@localhost.localdomain> <20180206151335.GE31110@localhost.localdomain> From: "jianchao.wang" Message-ID: Date: Wed, 7 Feb 2018 10:03:49 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: <20180206151335.GE31110@localhost.localdomain> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8797 signatures=668663 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1802070021 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Keith Thanks for your time and kindly response on this. On 02/06/2018 11:13 PM, Keith Busch wrote: > On Tue, Feb 06, 2018 at 09:46:36AM +0800, jianchao.wang wrote: >> Hi Keith >> >> Thanks for your kindly response. >> >> On 02/05/2018 11:13 PM, Keith Busch wrote: >>> but how many requests are you letting enter to their demise by >>> freezing on the wrong side of the reset? >> >> There are only two difference with this patch from the original one. >> 1. Don't freeze the queue for the reset case. At the moment, the outstanding requests will be requeued back to blk-mq queues. >> The new entered requests during reset will also stay in blk-mq queues. All this requests will not enter into nvme driver layer >> due to quiescent request_queues. And they will be issued after the reset is completed successfully. >> 2. Drain the request queue before nvme_dev_disable. This is nearly same with the previous rule which will also unquiesce the queue >> and let the requests be able to be drained. The only difference is this patch will invoke wait_freeze in nvme_dev_disable instead >> of nvme_reset_work. >> >> We don't sacrifice any request. This patch do the same thing with the previous one and make things clearer. > > No, what you're proposing is quite different. > > By "enter", I'm referring to blk_queue_enter. When a request is allocated, it will hold a request_queue->q_usage_counter until it is freed. Please refer to blk_mq_get_request -> blk_queue_enter_live blk_mq_free_request -> blk_exit_queue Regarding to 'request enters into an hctx', I cannot get the point. I think you should mean it enter into nvme driver layer. > Once a request enters > into an hctx, it can not be backed out to re-enter a new hctx if the > original one is invalidated. I also cannot get the point here. We certainly will not issue a request which has been issued to other hctx. What this patch and also the original one does is that disable/shutdown controller, cancel and requeue or fail the outstanding requests. The requeue mechanism will ensure the requests to be inserted to the ctx where req->mq_ctx->cpu points to. > > Prior to a reset, all requests that have entered the queue are committed > to that hctx, A request could be on - blk-mq per-cpu ctx->rq_list, IO scheduler list\ - hctx->dispatch list or - request_queue->requeue_list (will be inserted to 1st case again) When requests are issued, they will be dequeued from 1st or 2nd case and submitted to nvme driver layer. These requests are _outstanding_ ones. When the request queue is quiesced, the request will be stayed in blk-mq per-cpu ctx->rq_list, IO scheduler list or hctx->dispatch list, and cannot be issued to driver layer any more. When the request queue is frozen, it will gate the bio out of generic_make_request, so new request cannot enter blk-mq layer any more, and certainly the nvme driver layer. For the reset case, the nvme controller will be back soon, we needn't freeze the queue, just quiescing is enough. The outstanding ones will be canceled and _requeued_ to request_queue->requeue_list, then they will be inserted into blk-mq layer again by requeue_work. When reset_work completes and start queues again, all the requests will be issued again. :) For the shutdown case, freezing and quiescing is safer. Also we will wait them to be completed if the controller is still alive. If dead, we need to fail them directly instead of requeue them, otherwise, IO hung will come up, because controller will be offline for some time. and we can't do anything about that. The only thing we can > do is prevent new requests from entering until we're sure that hctx is > valid on the other side of the reset. > yes, that's is what this patch does. Add some explaining about the 4th patch nvme-pci: break up nvme_timeout and nvme_dev_disable here. Also thanks for your time to look into this. That's really appreciated! The key point is blk_abort_request. It will force the request to be expired and then handle the request in timeout work context. It is safe to race with the irq completion path. This is the most important reason to use blk_abort_request. We don't _complete_ the request or _rearm_ the time, but just set a CANCELED flag. So request will not be freed. Then these requests cannot be taken away by irq completion path and time out path is also be avoid (no outstanding requests any more). So we say 'all the outstanding requests are grabbed'. When we close the controller totally, we could complete these requests safely. This is the core idea of the 4th patch. Many thanks Jianchao