Received: by 10.223.148.5 with SMTP id 5csp6794536wrq; Wed, 17 Jan 2018 19:15:11 -0800 (PST) X-Google-Smtp-Source: ACJfBouk+8MWy0oxH9Dz3pW9WwgSCC9Lr5/Jkp/D2GzFK4FQFYauogTHPi+Ra8uT9mjYBzO8L13M X-Received: by 10.159.240.138 with SMTP id p10mr43645327plr.133.1516245311125; Wed, 17 Jan 2018 19:15:11 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1516245311; cv=none; d=google.com; s=arc-20160816; b=ZFiio7u1RCEEY0dSNlxHpfbhYHVZht7PukdA5sjCBc6+Vzm1aivO9W3YyLHGvM9Ugm H5WVgUoPjxdAuOHS3VqGPuOxGoZW2VPOQrIPyyq2LOhLT7pOPJ5ppGWEx+AA/KEojYVS BVcMWSuaj0Ip+ggk4khLJeqKOIixdIw6vzZomiumyp/olE76NlO7piPLcKdynahcjLuT 48MjOsMyS1dKvURfYiEms8JXtoJE3supIFLTPndUx9TpqqD1Mz/Ok8Iho6jaPVOxZPBn 1KSFiGOsCXCtOtFKOFZvwwN9qDzK12QJFb5oz295RsLtI0cefQSbnQ8CRRz7QYz0LI65 ToSA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=ItWsf5Cvl8UusefctvkWGmhMnSNZFYYzAhFxVFHQCx4=; b=w8wOUDFsQ6zHiMr7fOHNJnyQyzgAqq3ajiEf0HwgtVfQzKhzdOzx4oVH5+5wsh7SQT Wo577PzEY2itaGFNQxvqty7MTB2P2vIlF767p86AQL8mjvUKoTH6598z9ZH6BcOVNjTh DOKGNhnU/u9C5kbCOV9yxFL9lEQqzjZkOlRi4Du9wE1nDmF8OQKIftbGO60j+jkZErIT IV+RLuoquG0qOhGZsf9N+zv+XL22EDfIyOOL86r3J4ATuh2f1BTMHiHwqwhYGmeWEM1u U6nwrLNmW+xxHE3bbMa9LTdOvoiw5eC+JM5QoEJJaiJVGHPgIH+KDHKdA1bPvctDcbKY NhVw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=kY+B+m0H; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o2si404614pli.293.2018.01.17.19.14.57; Wed, 17 Jan 2018 19:15:11 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=kY+B+m0H; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753696AbeARDOd (ORCPT + 99 others); Wed, 17 Jan 2018 22:14:33 -0500 Received: from aserp2130.oracle.com ([141.146.126.79]:41580 "EHLO aserp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752955AbeARDOb (ORCPT ); Wed, 17 Jan 2018 22:14:31 -0500 Received: from pps.filterd (aserp2130.oracle.com [127.0.0.1]) by aserp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w0I3CLKd102643; Thu, 18 Jan 2018 03:13:34 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2017-10-26; bh=ItWsf5Cvl8UusefctvkWGmhMnSNZFYYzAhFxVFHQCx4=; b=kY+B+m0HsdbD6d8PLWndf5qv9PGkP5du407erDe7oAflvwYdnKwfqSXWUhuUk/FWqurV 62FlL7Cb/ChXG5NIALvoosifHQycexcqxGbXSDwhYenzO7DxW8R0ldP7rGH3oQBQLoKK PE3mVSsTSZFLMAq5FIzQ/eUNi4OZlnyOP905QFZFtw42zOOTOkFgo4+gaF456t3co/1v vqLcqSNW7FDt5FaYLMrKKXmbiMQGwiKx8YP8ORUZ3hmC1/w5zcHdYkaxMiGn1/tCTq3A KDO0e1bGfPbj+kg6PHe3GfvmxjRqY62yFsS8JvVW00GudOlBu9B5oplS/IdxhhVpELhV FA== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by aserp2130.oracle.com with ESMTP id 2fjjung44c-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 18 Jan 2018 03:13:33 +0000 Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id w0I3DV0s011610 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Thu, 18 Jan 2018 03:13:32 GMT Received: from abhmp0013.oracle.com (abhmp0013.oracle.com [141.146.116.19]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id w0I3DTiF023977; Thu, 18 Jan 2018 03:13:29 GMT Received: from [10.182.69.179] (/10.182.69.179) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 17 Jan 2018 19:13:28 -0800 Subject: Re: [PATCH V4 1/2] nvme: add NVME_CTRL_RESET_PREPARE state To: James Smart , Max Gurtovoy , keith.busch@intel.com, axboe@fb.com, hch@lst.de, sagi@grimberg.me Cc: linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org References: <1516164877-2170-1-git-send-email-jianchao.w.wang@oracle.com> <1516164877-2170-2-git-send-email-jianchao.w.wang@oracle.com> <11106a93-2e78-c853-e6eb-35c652dab3a9@mellanox.com> <50ca626e-7933-5a60-33f9-52e2ac8d0f25@broadcom.com> From: "jianchao.wang" Message-ID: <9a42cfac-642a-6838-625a-eeb51bec5c78@oracle.com> Date: Thu, 18 Jan 2018 11:13:24 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: <50ca626e-7933-5a60-33f9-52e2ac8d0f25@broadcom.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8777 signatures=668653 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1801180045 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi James Thanks for your kindly and detailed response. That's really appreciated! And sorry for my bad description. On 01/18/2018 05:01 AM, James Smart wrote: > I'm having a hard time following why this patch is being requested. Help me catch on..... > Actually, this patchset is to fix a issue in nvme_timeout. Please consider the following scenario. nvme_reset_ctrl -> set state to RESETTING -> queue reset_work (scheduling) nvme_reset_work -> nvme_dev_disable -> quiesce queues -> nvme_cancel_request on outstanding requests --------------------------------------------------------------------_boundary_ -> nvme initializing (may issue request on adminq) Before the boundary, not only quiesce the queues, but only cancel all the outstanding requests. A request could expire when the ctrl state is RESETTING. - If the timeout occur before the _boundary_, the expired requests are from the previous work. - Otherwise, the expired requests are from the controller initializing procedure, such as sending cq/sq create commands to adminq to setup io queues. In current implementation, nvme_timeout cannot identify the _boundary_ so only handles second case above. This patchset is to identify the _boundary_ above, so introduce RESET_PREPARE. Please refer to the following diagram. nvme_reset_ctrl -> set state to RESET_PREPARE -> queue reset_work (scheduling) nvme_reset_work -> nvme_dev_disable -> quiesce queues -> nvme_cancel_request on outstanding requests -> set state to RESETTING ----------> boundary -> nvme initializing (may issue request on adminq) Then we could identify the boundary by checking the ctrl-> state After Sagi introduced d5bf4b7 (nvme-rdma: fix concurrent reset and reconnect), actually, same things as been done in both nvme-fc/nvme. - nvme_rdma_error_recovery_work - nvme_rdma_reset_ctrl_work - nvme_fc_reset_ctrl_work But the state is from RESETTING to RECONNECTING. So in the patch, RESETTING in nvme-fc/rdma is changed to RESET_PREPARE. Then we get: nvme-fc/rdma RESET_PREPARE -> RECONNECTING -> LIVE nvme-pci RESET_PREPARE -> RESETTING -> LIVE > On 1/16/2018 8:54 PM, Jianchao Wang wrote: >> Currently, the ctrl->state will be changed to NVME_CTRL_RESETTING >> before queue the reset work. This is not so strict. There could be >> a big gap before the reset_work callback is invoked. > ok.... so what you're saying is you want something to know that you've transitioned to RESETTING but not yet performed an action for the reset.   What is that something and what is it to do > > guessing - I assume it's in the transport  xxx_is_ready() and/or nvmf_check_init_req(), which is called at the start of queue_rq(), that wants to do something ? > Should be explained above. :) > >>   In addition, >> there is some disable work in the reset_work callback, strictly >> speaking, not part of reset work, and could lead to some confusion. > > Can you explain this ?  what's the confusion ? > > I assume by "disable" you mean it is quiescing queues ? Sorry for my bad description. Not only quiesce queues, but also handle the outstanding requests. > > >> >> In addition, after set state to RESETTING and disable procedure, >> nvme-rdma/fc use NVME_CTRL_RECONNECTING to mark the setup and >> reconnect procedure. The RESETTING state has been narrowed. > > I still don't follow. Yes RECONNECTING is where we repetitively: try to create a link-side association again: if it fails, delay and try again; or if success, reinit the controller, and unquiesce all queues - allowing full operation again, at which time we transition to LIVE. > > by "narrowed" what do you mean ?    what "narrowed" ? Currently , in nvme-pci, the RESETTING state includes both "shutdown" and "setup". But in nvme-fc/rdma, only "shutdown", "setup" is in RECONECTTING shutdown - tear down queues/connections, quiesce queues, cancel requests .... setup - setup new IO queues or connections and some other initializing work .... > > In FC, as we have a lot of work that must occur to terminate io as part of the reset, it can be a fairly long window.  I don't know that any functionally in this path, regardless of time window, has narrowed. > > >> >> This patch add NVME_CTRL_RESET_PREPARE state to mark the reset_work >> or error recovery work, scheduling gap and disable procedure. >> After that, >>   - For nvme-pci, nvmet-loop, set state to RESETTING, start >>     initialization. >>   - For nvme-rdma, nvme-fc, set state to RECONNECTING, start >>     initialization or reconnect. > > So I'm lost - so you've effectively renamed RESETTING to RESET_PREPARE for fc/rdma.  What do you define are the actions in RESETTING that went away and why is that different between pci and the other transports ? As Sagi said, we could merge RECONNECTING and RESETTING into a unique state. > Why doesn't nvme-pci need to go through RESET_PREPARE ? doesn't it have the same scheduling window for a reset_work thread ? Sure, nvme-pci also need to pass through RESET_PREPARE. > > > On 1/17/2018 1:06 AM, Max Gurtovoy wrote: >> >>> + >>> +    case NVME_CTRL_RESETTING: >>> +        switch (old_state) { >>> +        case NVME_CTRL_RESET_PREPARE: >>> +            changed = true; >>> +            /* FALLTHRU */ >>> +        default: >>> +            break; >>> +        } >>> +        break; >>>       case NVME_CTRL_RECONNECTING: >>>           switch (old_state) { >>>           case NVME_CTRL_LIVE: >>> -        case NVME_CTRL_RESETTING: >>> +        case NVME_CTRL_RESET_PREPARE: >> >> As I suggested in V3, please don't allow this transition. >> We'll move to NVME_CTRL_RECONNECTING from NVME_CTRL_RESETTING. >> >> I look on it like that: >> >> NVME_CTRL_RESET_PREPARE - "suspend" state >> NVME_CTRL_RESETTING - "resume" state >> >> you don't reconnect from "suspend" state, you must "resume" before you reconnect. > > This makes no sense to me. > > I could use a definition of what "suspend" and "resume" mean to you.... > > from what I've seen so far: > NVME_CTRL_RESET_PREPARE:   means I've decided to reset, changed state, but the actual work for reset hasn't started yet.   As we haven't commonized who does the quiescing of the queues, the queues are still live at this state, although some nvme check routine may bounce them. In truth, the queues should be quiesced here. > > NVME_CTRL_RESETTING: I'm resetting the controller, tearing down queues/connections, the link side association.  AFAIK - pci and all the other transports have to do things things.   Now is when the blk-mq queues get stopped.   We have a variance on whether the queues are unquiesced or left quiesced (I think this is what you meant by "resume", where resume means unquiesce) at the end of this.   The admin_q is unquiesced, meaning new admin cmds should fail.  rdma also has io queues unquiesced meaning new ios fail, while fc leaves them quiesced while background timers run - meaning no new ios issued, nor any fail back to a multipather. With the agreement that we would patch all of the transports to leave them quiesced with fast-fail-timeouts occuring to unquiesce them and start failing ios. > > NVME_RECONNECTING: transitioned to after the link-side association is terminated and the transport will now attempt to reconnect (perhaps several attempts) to create a new link-side association. Stays in this state until the controller is fully reconnected and it transitions to NVME_LIVE.   Until the link side association is active, queues do what they do (as left by RESETTING and/or updated per timeouts) excepting that after an active assocation, they queues will be unquiesced at the time of the LIVE transition.   Note: we grandfathered PCI into not needing this state:   As you (almost) can't fail the establishment of the link-side association as it's a PCI function and registers so unless the device has dropped off the bus, you can immediately talk to registers and start to reinit the controller.  Given it was almost impossible not to succeed, and as the code path already existed, we didn't see a reason to make pci change. Thanks for your directive here. That's really appreciated. > > > On 1/17/2018 1:19 AM, jianchao.wang wrote: >> >> I used to respond you in the V3 and wait for your feedback. >> Please refer to: >> After Sagi's nvme-rdma: fix concurrent reset and reconnect, the rdma ctrl state is changed to RECONNECTING state >> after some clearing and shutdown work, then some initializing procedure,  no matter reset work path or error recovery path. >> The fc reset work also does the same thing. >> So if we define the range that RESET_PREPARE includes scheduling gap and disable and clear work, RESETTING includes initializing >> procedure,  RECONNECTING is very similar with RESETTING. > > I get the impression we have very different understandings of what occurs under what states. I'd like to see a summary of what you think that is as this is what we must agree upon. > > Of course, if you move the gap, disable, and clear work into RESET_PREPARE, you've taken most of the RESETTING state away. What was left ? Things should be cleared on the top. :) > > I don't believe RESETTING and RECONNECTING are that similar - unless - you are looking at that "reinit" part that we left grandfathered into PCI. Yes. I have got the point. Thanks for your directive. Both nvme-pc and nvme-fc/rdma have "shutdown" part to tear down queues/connects, quiesce queues, cancel outstanding requests... We could call this part as "shutdowning" as well as the scheduling gap. Because the difference of initializing between the nvme-pci and nvme-fc/rdma, we could call "reiniting" for nvme-pci and call "reconnecting" for nvme-fc/rdma > > > On 1/17/2018 1:24 AM, jianchao.wang wrote: >> >>> Maybe we could do like this; >>> In nvme fc/rdma >>> - set state to RESET_PREPARE, queue reset_work/err_work >>> - clear/shutdown works, set state to RECONNECTING >>> - initialization, set state to LIVE >>> >>> In nvme pci >>> - set state to RESET_PREPARE, queue reset_work >>> - clear/shutdown works, set state to RESETTING >>> - initialization, set state to LIVE >>> Currently, RECONNECTING has overlapped with RESETTING. >>> So I suggest to use RESET_PREPARE to mark the "suspend" part as you said. >>> And use RECONNECTING to mark the "resume" part in nvme-rdma/fc >>> use RESETTING part to mark the "resume" part in nvme-pci, nvmt-loop. >> I have added a comment above the definition of nvme_ctrl_state as below: >>    +/* >> + * RESET_PREPARE - mark the state of scheduling gap and disable procedure >> + * RESETTING     - nvme-pci, nvmet loop use it to mark the state of setup >> + *                   procedure >> + * RECONNECTING  - nvme-fc, nvme-rdma use it to mark the state of setup >> + *                   and reconnect procedure >> + */ >> >> Consequently, nvme-fc/rdma don't use RESETTING state any more, but only RESET_PREPARE. > > So now I'm confused again...  The purpose of the new state was to have a state indicator to cover the transition into the state while the reset_work item was yet to be scheduled. If all you've done is change the state name on fc/rdma - don't you still have that window ? Things should be cleared on the top. :) Many thanks again. > > -- james > > > >