Hello
Please consider the following scenario.
nvme_reset_ctrl
-> set state to RESETTING
-> queue reset_work
(scheduling)
nvme_reset_work
-> nvme_dev_disable
-> quiesce queues
-> nvme_cancel_request
on outstanding requests
-------------------------------_boundary_
-> nvme initializing (issue request on adminq)
Before the _boundary_, not only quiesce the queues, but only cancel
all the outstanding requests.
A request could expire when the ctrl state is RESETTING.
- If the timeout occur before the _boundary_, the expired requests
are from the previous work.
- Otherwise, the expired requests are from the controller initializing
procedure, such as sending cq/sq create commands to adminq to setup
io queues.
In current implementation, nvme_timeout cannot identify the _boundary_
so only handles second case above.
In fact, after Sagi's commit (nvme-rdma: fix concurrent reset and
reconnect), both nvme-fc/rdma have following pattern:
RESETTING - quiesce blk-mq queues, teardown and delete queues/
connections, clear out outstanding IO requests...
RECONNECTING - establish new queues/connections and some other
initializing things.
Introduce RECONNECTING to nvme-pci transport to do the same mark
Then we get a coherent state definition among nvme pci/rdma/fc
transports and nvme_timeout could identify the _boundary_.
V5:
- discard RESET_PREPARE and introduce RESETTING into nvme-pci
- change the 1st patch's name and comment
- other misc changes
V4:
- rebase patches on Jens' for-next
- let RESETTING equal to RECONNECTING in terms of work procedure
- change the 1st patch's name and comment
- other misc changes
V3:
- fix wrong reference in loop.c
- other misc changes
V2:
- split NVME_CTRL_RESETTING into NVME_CTRL_RESET_PREPARE and
NVME_CTRL_RESETTING. Introduce new patch based on this.
- distinguish the requests based on the new state in nvme_timeout
- change comments of patch
drivers/nvme/host/core.c | 2 +-
drivers/nvme/host/pci.c | 43 ++++++++++++++++++++++++++++++++-----------
2 files changed, 33 insertions(+), 12 deletions(-)
Thanks
Jianchao
After Sagi's commit (nvme-rdma: fix concurrent reset and reconnect),
both nvme-fc/rdma have following pattern:
RESETTING - quiesce blk-mq queues, teardown and delete queues/
connections, clear out outstanding IO requests...
RECONNECTING - establish new queues/connections and some other
initializing things.
Introduce RECONNECTING to nvme-pci transport to do the same mark.
Then we get a coherent state definition among nvme pci/rdma/fc
transports.
Suggested-by: James Smart <[email protected]>
Signed-off-by: Jianchao Wang <[email protected]>
---
drivers/nvme/host/core.c | 2 +-
drivers/nvme/host/pci.c | 19 +++++++++++++++++--
2 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 230cc09..23b3e53 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -242,7 +242,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
switch (new_state) {
case NVME_CTRL_ADMIN_ONLY:
switch (old_state) {
- case NVME_CTRL_RESETTING:
+ case NVME_CTRL_RECONNECTING:
changed = true;
/* FALLTHRU */
default:
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 45f843d..05344be 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1138,9 +1138,14 @@ static bool nvme_should_reset(struct nvme_dev *dev, u32 csts)
*/
bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO);
- /* If there is a reset ongoing, we shouldn't reset again. */
- if (dev->ctrl.state == NVME_CTRL_RESETTING)
+ /* If there is a reset/reinit ongoing, we shouldn't reset again. */
+ switch (dev->ctrl.state) {
+ case NVME_CTRL_RESETTING:
+ case NVME_CTRL_RECONNECTING:
return false;
+ default:
+ break;
+ }
/* We shouldn't reset unless the controller is on fatal error state
* _or_ if we lost the communication with it.
@@ -2304,6 +2309,16 @@ static void nvme_reset_work(struct work_struct *work)
if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
nvme_dev_disable(dev, false);
+ /*
+ * Introduce RECONNECTING state from nvme-fc/rdma transports to mark the
+ * initializing procedure here.
+ */
+ if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RECONNECTING)) {
+ dev_warn(dev->ctrl.device,
+ "failed to mark controller RECONNECTING\n");
+ goto out;
+ }
+
result = nvme_pci_enable(dev);
if (result)
goto out;
--
2.7.4
Look at the following scenario.
nvme_reset_ctrl
-> set state to RESETTING
-> queue reset_work
(scheduling)
nvme_reset_work
-> nvme_dev_disable
-> quiesce queues
-> nvme_cancel_request
on outstanding requests
-> set state to RECONNECTING
-> nvme initializing
(will issue request on adminq)
A request could expire after the ctrl state is set to RESETTING
and queue the reset_work.
- If the timeout occurs before state RECONNECTING, the expired
requests are from the previous work.
- Otherwise, the expired requests are from the controller
initializing procedure, such as sending cq/sq create commands
to adminq to setup io queues.
In current implementation, nvme_timeout only handles second case
above.
To fix this, when the state is RESETTING, handle the expired
requests as nvme_cancel_request, when the state is RECONNECTING,
handle it as the original method and discard the nvme_dev_disable
there, the nvme_reset_work will see the error and invoke itself.
Signed-off-by: Jianchao Wang <[email protected]>
---
drivers/nvme/host/pci.c | 24 +++++++++++++++---------
1 file changed, 15 insertions(+), 9 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 05344be..a9b2359 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1210,18 +1210,24 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
}
/*
- * Shutdown immediately if controller times out while starting. The
- * reset work will see the pci device disabled when it gets the forced
- * cancellation error. All outstanding requests are completed on
- * shutdown, so we return BLK_EH_HANDLED.
+ * - When the ctrl.state is NVME_CTRL_RESETTING, the expired
+ * request should come from the previous work and we handle
+ * it as nvme_cancel_request.
+ * - When the ctrl.state is NVME_CTRL_RECONNECTING, the expired
+ * request should come from the initializing procedure such as
+ * setup io queues, because all the previous outstanding
+ * requests should have been cancelled.
*/
- if (dev->ctrl.state == NVME_CTRL_RESETTING) {
- dev_warn(dev->ctrl.device,
- "I/O %d QID %d timeout, disable controller\n",
- req->tag, nvmeq->qid);
- nvme_dev_disable(dev, false);
+ switch (dev->ctrl.state) {
+ case NVME_CTRL_RESETTING:
+ nvme_req(req)->status = NVME_SC_ABORT_REQ;
+ return BLK_EH_HANDLED;
+ case NVME_CTRL_RECONNECTING:
+ WARN_ON_ONCE(nvmeq->qid);
nvme_req(req)->flags |= NVME_REQ_CANCELLED;
return BLK_EH_HANDLED;
+ default:
+ break;
}
/*
--
2.7.4
On 1/18/2018 12:10 PM, Jianchao Wang wrote:
> After Sagi's commit (nvme-rdma: fix concurrent reset and reconnect),
> both nvme-fc/rdma have following pattern:
> RESETTING - quiesce blk-mq queues, teardown and delete queues/
> connections, clear out outstanding IO requests...
> RECONNECTING - establish new queues/connections and some other
> initializing things.
I guess we can call it NVME_CTRL_CONNECTING in later patch (more
suitable name for this state now).
> Introduce RECONNECTING to nvme-pci transport to do the same mark.
> Then we get a coherent state definition among nvme pci/rdma/fc
> transports.
>
> Suggested-by: James Smart <[email protected]>
> Signed-off-by: Jianchao Wang <[email protected]>
> ---
> drivers/nvme/host/core.c | 2 +-
> drivers/nvme/host/pci.c | 19 +++++++++++++++++--
> 2 files changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 230cc09..23b3e53 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -242,7 +242,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
> switch (new_state) {
> case NVME_CTRL_ADMIN_ONLY:
> switch (old_state) {
> - case NVME_CTRL_RESETTING:
> + case NVME_CTRL_RECONNECTING:
> changed = true;
> /* FALLTHRU */
> default:
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 45f843d..05344be 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1138,9 +1138,14 @@ static bool nvme_should_reset(struct nvme_dev *dev, u32 csts)
> */
> bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO);
>
> - /* If there is a reset ongoing, we shouldn't reset again. */
> - if (dev->ctrl.state == NVME_CTRL_RESETTING)
> + /* If there is a reset/reinit ongoing, we shouldn't reset again. */
> + switch (dev->ctrl.state) {
> + case NVME_CTRL_RESETTING:
> + case NVME_CTRL_RECONNECTING:
> return false;
> + default:
> + break;
> + }
>
> /* We shouldn't reset unless the controller is on fatal error state
> * _or_ if we lost the communication with it.
> @@ -2304,6 +2309,16 @@ static void nvme_reset_work(struct work_struct *work)
> if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
> nvme_dev_disable(dev, false);
>
> + /*
> + * Introduce RECONNECTING state from nvme-fc/rdma transports to mark the
> + * initializing procedure here.
> + */
> + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RECONNECTING)) {
> + dev_warn(dev->ctrl.device,
> + "failed to mark controller RECONNECTING\n");
> + goto out;
> + }
> +
> result = nvme_pci_enable(dev);
> if (result)
> goto out;
>
On 1/18/2018 2:10 AM, Jianchao Wang wrote:
> After Sagi's commit (nvme-rdma: fix concurrent reset and reconnect),
> both nvme-fc/rdma have following pattern:
> RESETTING - quiesce blk-mq queues, teardown and delete queues/
> connections, clear out outstanding IO requests...
> RECONNECTING - establish new queues/connections and some other
> initializing things.
> Introduce RECONNECTING to nvme-pci transport to do the same mark.
> Then we get a coherent state definition among nvme pci/rdma/fc
> transports.
>
> Suggested-by: James Smart <[email protected]>
> Signed-off-by: Jianchao Wang <[email protected]>
> ---
> drivers/nvme/host/core.c | 2 +-
> drivers/nvme/host/pci.c | 19 +++++++++++++++++--
> 2 files changed, 18 insertions(+), 3 deletions(-)
>
Reviewed-by: James SmartĀ <[email protected]>
Jianchao,
This looks very coherent to me. Thank You.
-- james
On 1/18/2018 2:10 AM, Jianchao Wang wrote:
> Hello
>
> Please consider the following scenario.
> nvme_reset_ctrl
> -> set state to RESETTING
> -> queue reset_work
> (scheduling)
> nvme_reset_work
> -> nvme_dev_disable
> -> quiesce queues
> -> nvme_cancel_request
> on outstanding requests
> -------------------------------_boundary_
> -> nvme initializing (issue request on adminq)
>
> Before the _boundary_, not only quiesce the queues, but only cancel
> all the outstanding requests.
>
> A request could expire when the ctrl state is RESETTING.
> - If the timeout occur before the _boundary_, the expired requests
> are from the previous work.
> - Otherwise, the expired requests are from the controller initializing
> procedure, such as sending cq/sq create commands to adminq to setup
> io queues.
> In current implementation, nvme_timeout cannot identify the _boundary_
> so only handles second case above.
>
> In fact, after Sagi's commit (nvme-rdma: fix concurrent reset and
> reconnect), both nvme-fc/rdma have following pattern:
> RESETTING - quiesce blk-mq queues, teardown and delete queues/
> connections, clear out outstanding IO requests...
> RECONNECTING - establish new queues/connections and some other
> initializing things.
> Introduce RECONNECTING to nvme-pci transport to do the same mark
> Then we get a coherent state definition among nvme pci/rdma/fc
> transports and nvme_timeout could identify the _boundary_.
>
> V5:
> - discard RESET_PREPARE and introduce RESETTING into nvme-pci
> - change the 1st patch's name and comment
> - other misc changes
>
> V4:
> - rebase patches on Jens' for-next
> - let RESETTING equal to RECONNECTING in terms of work procedure
> - change the 1st patch's name and comment
> - other misc changes
>
> V3:
> - fix wrong reference in loop.c
> - other misc changes
>
> V2:
> - split NVME_CTRL_RESETTING into NVME_CTRL_RESET_PREPARE and
> NVME_CTRL_RESETTING. Introduce new patch based on this.
> - distinguish the requests based on the new state in nvme_timeout
> - change comments of patch
>
> drivers/nvme/host/core.c | 2 +-
> drivers/nvme/host/pci.c | 43 ++++++++++++++++++++++++++++++++-----------
> 2 files changed, 33 insertions(+), 12 deletions(-)
>
> Thanks
> Jianchao
On Thu, Jan 18, 2018 at 06:10:02PM +0800, Jianchao Wang wrote:
> + * - When the ctrl.state is NVME_CTRL_RESETTING, the expired
> + * request should come from the previous work and we handle
> + * it as nvme_cancel_request.
> + * - When the ctrl.state is NVME_CTRL_RECONNECTING, the expired
> + * request should come from the initializing procedure such as
> + * setup io queues, because all the previous outstanding
> + * requests should have been cancelled.
> */
> - if (dev->ctrl.state == NVME_CTRL_RESETTING) {
> - dev_warn(dev->ctrl.device,
> - "I/O %d QID %d timeout, disable controller\n",
> - req->tag, nvmeq->qid);
> - nvme_dev_disable(dev, false);
> + switch (dev->ctrl.state) {
> + case NVME_CTRL_RESETTING:
> + nvme_req(req)->status = NVME_SC_ABORT_REQ;
> + return BLK_EH_HANDLED;
> + case NVME_CTRL_RECONNECTING:
> + WARN_ON_ONCE(nvmeq->qid);
> nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> return BLK_EH_HANDLED;
> + default:
> + break;
> }
The driver may be giving up on the command here, but that doesn't mean
the controller has. We can't just end the request like this because that
will release the memory the controller still owns. We must wait until
after nvme_dev_disable clears bus master because we can't say for sure
the controller isn't going to write to that address right after we end
the request.
Hi Keith
Thanks for your kindly response and directive.
On 01/19/2018 12:59 PM, Keith Busch wrote:
> On Thu, Jan 18, 2018 at 06:10:02PM +0800, Jianchao Wang wrote:
>> + * - When the ctrl.state is NVME_CTRL_RESETTING, the expired
>> + * request should come from the previous work and we handle
>> + * it as nvme_cancel_request.
>> + * - When the ctrl.state is NVME_CTRL_RECONNECTING, the expired
>> + * request should come from the initializing procedure such as
>> + * setup io queues, because all the previous outstanding
>> + * requests should have been cancelled.
>> */
>> - if (dev->ctrl.state == NVME_CTRL_RESETTING) {
>> - dev_warn(dev->ctrl.device,
>> - "I/O %d QID %d timeout, disable controller\n",
>> - req->tag, nvmeq->qid);
>> - nvme_dev_disable(dev, false);
>> + switch (dev->ctrl.state) {
>> + case NVME_CTRL_RESETTING:
>> + nvme_req(req)->status = NVME_SC_ABORT_REQ;
>> + return BLK_EH_HANDLED;
>> + case NVME_CTRL_RECONNECTING:
>> + WARN_ON_ONCE(nvmeq->qid);
>> nvme_req(req)->flags |= NVME_REQ_CANCELLED;
>> return BLK_EH_HANDLED;
>> + default:
>> + break;
>> }
>
> The driver may be giving up on the command here, but that doesn't mean
> the controller has. We can't just end the request like this because that
> will release the memory the controller still owns. We must wait until
> after nvme_dev_disable clears bus master because we can't say for sure
> the controller isn't going to write to that address right after we end
> the request.
>
Yes, but the controller is going to be reseted or shutdown at the moment,
even if the controller accesses a bad address and goes wrong, everything will
be ok after reset or shutdown. :)
Thanks
Jianchao
On Fri, Jan 19, 2018 at 01:55:29PM +0800, jianchao.wang wrote:
> On 01/19/2018 12:59 PM, Keith Busch wrote:
> > On Thu, Jan 18, 2018 at 06:10:02PM +0800, Jianchao Wang wrote:
> >> + * - When the ctrl.state is NVME_CTRL_RESETTING, the expired
> >> + * request should come from the previous work and we handle
> >> + * it as nvme_cancel_request.
> >> + * - When the ctrl.state is NVME_CTRL_RECONNECTING, the expired
> >> + * request should come from the initializing procedure such as
> >> + * setup io queues, because all the previous outstanding
> >> + * requests should have been cancelled.
> >> */
> >> - if (dev->ctrl.state == NVME_CTRL_RESETTING) {
> >> - dev_warn(dev->ctrl.device,
> >> - "I/O %d QID %d timeout, disable controller\n",
> >> - req->tag, nvmeq->qid);
> >> - nvme_dev_disable(dev, false);
> >> + switch (dev->ctrl.state) {
> >> + case NVME_CTRL_RESETTING:
> >> + nvme_req(req)->status = NVME_SC_ABORT_REQ;
> >> + return BLK_EH_HANDLED;
> >> + case NVME_CTRL_RECONNECTING:
> >> + WARN_ON_ONCE(nvmeq->qid);
> >> nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> >> return BLK_EH_HANDLED;
> >> + default:
> >> + break;
> >> }
> >
> > The driver may be giving up on the command here, but that doesn't mean
> > the controller has. We can't just end the request like this because that
> > will release the memory the controller still owns. We must wait until
> > after nvme_dev_disable clears bus master because we can't say for sure
> > the controller isn't going to write to that address right after we end
> > the request.
> >
> Yes, but the controller is going to be reseted or shutdown at the moment,
> even if the controller accesses a bad address and goes wrong, everything will
> be ok after reset or shutdown. :)
Hm, I don't follow. DMA access after free is never okay.
Hi Keith
Thanks for your kindly reminding.
On 01/19/2018 02:05 PM, Keith Busch wrote:
>>> The driver may be giving up on the command here, but that doesn't mean
>>> the controller has. We can't just end the request like this because that
>>> will release the memory the controller still owns. We must wait until
>>> after nvme_dev_disable clears bus master because we can't say for sure
>>> the controller isn't going to write to that address right after we end
>>> the request.
>>>
>> Yes, but the controller is going to be reseted or shutdown at the moment,
>> even if the controller accesses a bad address and goes wrong, everything will
>> be ok after reset or shutdown. :)
> Hm, I don't follow. DMA access after free is never okay.
Yes, this may cause unexpected memory corruption.
Thanks
Jianchao
On Thu, Jan 18, 2018 at 06:10:00PM +0800, Jianchao Wang wrote:
> Hello
>
> Please consider the following scenario.
> nvme_reset_ctrl
> -> set state to RESETTING
> -> queue reset_work
> (scheduling)
> nvme_reset_work
> -> nvme_dev_disable
> -> quiesce queues
> -> nvme_cancel_request
> on outstanding requests
> -------------------------------_boundary_
> -> nvme initializing (issue request on adminq)
>
> Before the _boundary_, not only quiesce the queues, but only cancel
> all the outstanding requests.
>
> A request could expire when the ctrl state is RESETTING.
> - If the timeout occur before the _boundary_, the expired requests
> are from the previous work.
> - Otherwise, the expired requests are from the controller initializing
> procedure, such as sending cq/sq create commands to adminq to setup
> io queues.
> In current implementation, nvme_timeout cannot identify the _boundary_
> so only handles second case above.
Bare with me a moment, as I'm only just now getting a real chance to look
at this, and I'm not quite sure I follow what problem this is solving.
The nvme_dev_disable routine makes forward progress without depending on
timeout handling to complete expired commands. Once controller disabling
completes, there can't possibly be any started requests that can expire.
So we don't need nvme_timeout to do anything for requests above the
boundary.
Hi Keith
Thanks for your time to look into this.
On 01/19/2018 04:01 PM, Keith Busch wrote:
> On Thu, Jan 18, 2018 at 06:10:00PM +0800, Jianchao Wang wrote:
>> Hello
>>
>> Please consider the following scenario.
>> nvme_reset_ctrl
>> -> set state to RESETTING
>> -> queue reset_work
>> (scheduling)
>> nvme_reset_work
>> -> nvme_dev_disable
>> -> quiesce queues
>> -> nvme_cancel_request
>> on outstanding requests
>> -------------------------------_boundary_
>> -> nvme initializing (issue request on adminq)
>>
>> Before the _boundary_, not only quiesce the queues, but only cancel
>> all the outstanding requests.
>>
>> A request could expire when the ctrl state is RESETTING.
>> - If the timeout occur before the _boundary_, the expired requests
>> are from the previous work.
>> - Otherwise, the expired requests are from the controller initializing
>> procedure, such as sending cq/sq create commands to adminq to setup
>> io queues.
>> In current implementation, nvme_timeout cannot identify the _boundary_
>> so only handles second case above.
>
> Bare with me a moment, as I'm only just now getting a real chance to look
> at this, and I'm not quite sure I follow what problem this is solving.
>
> The nvme_dev_disable routine makes forward progress without depending on
> timeout handling to complete expired commands. Once controller disabling
> completes, there can't possibly be any started requests that can expire.
> So we don't need nvme_timeout to do anything for requests above the
> boundary.
>
Yes, once controller disabling completes, any started requests will be handled and cannot expire.
But before the _boundary_, there could be a nvme_timeout context runs with nvme_dev_disable in parallel.
If a timeout path grabs a request, then nvme_dev_disable cannot get and cancel it.
So even though the nvme_dev_disable completes, there still could be a request in nvme_timeout context.
The worst case is :
nvme_timeout nvme_reset_work
if (ctrl->state == RESETTING ) nvme_dev_disable
nvme_dev_disable initializing procedure
the nvme_dev_disable run with reinit procedure in nvme_reset_work in parallel.
Thanks
Jianchao
On Fri, Jan 19, 2018 at 04:14:02PM +0800, jianchao.wang wrote:
> On 01/19/2018 04:01 PM, Keith Busch wrote:
> > The nvme_dev_disable routine makes forward progress without depending on
> > timeout handling to complete expired commands. Once controller disabling
> > completes, there can't possibly be any started requests that can expire.
> > So we don't need nvme_timeout to do anything for requests above the
> > boundary.
> >
> Yes, once controller disabling completes, any started requests will be handled and cannot expire.
> But before the _boundary_, there could be a nvme_timeout context runs with nvme_dev_disable in parallel.
> If a timeout path grabs a request, then nvme_dev_disable cannot get and cancel it.
> So even though the nvme_dev_disable completes, there still could be a request in nvme_timeout context.
>
> The worst case is :
> nvme_timeout nvme_reset_work
> if (ctrl->state == RESETTING ) nvme_dev_disable
> nvme_dev_disable initializing procedure
>
> the nvme_dev_disable run with reinit procedure in nvme_reset_work in parallel.
Okay, I see what you're saying. That's a pretty obscure case, as normally
we enter nvme_reset_work with the queues already disabled, but there
are a few cases where we need nvme_reset_work to handle that.
But if we are in that case, I think we really just want to sync the
queues. What do you think of this?
---
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fde6fd2e7eef..516383193416 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3520,6 +3520,17 @@ void nvme_stop_queues(struct nvme_ctrl *ctrl)
}
EXPORT_SYMBOL_GPL(nvme_stop_queues);
+void nvme_sync_queues(struct nvme_ctrl *ctrl)
+{
+ struct nvme_ns *ns;
+
+ mutex_lock(&ctrl->namespaces_mutex);
+ list_for_each_entry(ns, &ctrl->namespaces, list)
+ blk_sync_queue(ns->queue);
+ mutex_unlock(&ctrl->namespaces_mutex);
+}
+EXPORT_SYMBOL_GPL(nvme_sync_queues);
+
void nvme_start_queues(struct nvme_ctrl *ctrl)
{
struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 8e7fc1b041b7..45b1b8ceddb6 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -374,6 +374,7 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
void nvme_stop_queues(struct nvme_ctrl *ctrl);
void nvme_start_queues(struct nvme_ctrl *ctrl);
+void nvme_sync_queues(struct nvme_ctrl *ctrl)
void nvme_kill_queues(struct nvme_ctrl *ctrl);
void nvme_unfreeze(struct nvme_ctrl *ctrl);
void nvme_wait_freeze(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a2ffb557b616..1fe00be22ad1 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2289,8 +2289,10 @@ static void nvme_reset_work(struct work_struct *work)
* If we're called to reset a live controller first shut it down before
* moving on.
*/
- if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
+ if (dev->ctrl.ctrl_config & NVME_CC_ENABLE) {
nvme_dev_disable(dev, false);
+ nvme_sync_queues(&dev->ctrl);
+ }
result = nvme_pci_enable(dev);
if (result)
--
Hi Keith
Thanks for your kindly and detailed response and patch.
On 01/19/2018 04:42 PM, Keith Busch wrote:
> On Fri, Jan 19, 2018 at 04:14:02PM +0800, jianchao.wang wrote:
>> On 01/19/2018 04:01 PM, Keith Busch wrote:
>>> The nvme_dev_disable routine makes forward progress without depending on
>>> timeout handling to complete expired commands. Once controller disabling
>>> completes, there can't possibly be any started requests that can expire.
>>> So we don't need nvme_timeout to do anything for requests above the
>>> boundary.
>>>
>> Yes, once controller disabling completes, any started requests will be handled and cannot expire.
>> But before the _boundary_, there could be a nvme_timeout context runs with nvme_dev_disable in parallel.
>> If a timeout path grabs a request, then nvme_dev_disable cannot get and cancel it.
>> So even though the nvme_dev_disable completes, there still could be a request in nvme_timeout context.
>>
>> The worst case is :
>> nvme_timeout nvme_reset_work
>> if (ctrl->state == RESETTING ) nvme_dev_disable
>> nvme_dev_disable initializing procedure
>>
>> the nvme_dev_disable run with reinit procedure in nvme_reset_work in parallel.
>
> Okay, I see what you're saying. That's a pretty obscure case, as normally
> we enter nvme_reset_work with the queues already disabled, but there
> are a few cases where we need nvme_reset_work to handle that.
>
> But if we are in that case, I think we really just want to sync the
> queues. What do you think of this?
>
> ---
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index fde6fd2e7eef..516383193416 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3520,6 +3520,17 @@ void nvme_stop_queues(struct nvme_ctrl *ctrl)
> }
> EXPORT_SYMBOL_GPL(nvme_stop_queues);
>
> +void nvme_sync_queues(struct nvme_ctrl *ctrl)
> +{
> + struct nvme_ns *ns;
> +
> + mutex_lock(&ctrl->namespaces_mutex);
> + list_for_each_entry(ns, &ctrl->namespaces, list)
> + blk_sync_queue(ns->queue);
> + mutex_unlock(&ctrl->namespaces_mutex);
> +}
> +EXPORT_SYMBOL_GPL(nvme_sync_queues);
> +
> void nvme_start_queues(struct nvme_ctrl *ctrl)
> {
> struct nvme_ns *ns;
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 8e7fc1b041b7..45b1b8ceddb6 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -374,6 +374,7 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
>
> void nvme_stop_queues(struct nvme_ctrl *ctrl);
> void nvme_start_queues(struct nvme_ctrl *ctrl);
> +void nvme_sync_queues(struct nvme_ctrl *ctrl)
> void nvme_kill_queues(struct nvme_ctrl *ctrl);
> void nvme_unfreeze(struct nvme_ctrl *ctrl);
> void nvme_wait_freeze(struct nvme_ctrl *ctrl);
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index a2ffb557b616..1fe00be22ad1 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2289,8 +2289,10 @@ static void nvme_reset_work(struct work_struct *work)
> * If we're called to reset a live controller first shut it down before
> * moving on.
> */
> - if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
> + if (dev->ctrl.ctrl_config & NVME_CC_ENABLE) {
> nvme_dev_disable(dev, false);
> + nvme_sync_queues(&dev->ctrl);
> + }
>
> result = nvme_pci_enable(dev);
> if (result)
> --
>
We should not use blk_sync_queue here, the requeue_work and run_work will be canceled.
Just flush_work(&q->timeout_work) should be ok.
In addition, we could check NVME_CC_ENABLE in nvme_dev_disable to avoid redundant invoking.
:)
Thanks
Jianchao
Hi Max
Thanks for your kindly comment and response.
On 01/18/2018 06:17 PM, Max Gurtovoy wrote:
>
> On 1/18/2018 12:10 PM, Jianchao Wang wrote:
>> After Sagi's commit (nvme-rdma: fix concurrent reset and reconnect),
>> both nvme-fc/rdma have following pattern:
>> RESETTING??? - quiesce blk-mq queues, teardown and delete queues/
>> ??????????????? connections, clear out outstanding IO requests...
>> RECONNECTING - establish new queues/connections and some other
>> ??????????????? initializing things.
>
> I guess we can call it NVME_CTRL_CONNECTING in later patch (more suitable name for this state now).
I think we can change the name in another patchset in which just talk about the more suitable names of the states.
:)
Thanks
Jianchao
On Fri, Jan 19, 2018 at 05:02:06PM +0800, jianchao.wang wrote:
> We should not use blk_sync_queue here, the requeue_work and run_work will be canceled.
> Just flush_work(&q->timeout_work) should be ok.
I agree flushing timeout_work is sufficient. All the other work had
already better not be running either, so it doesn't hurt to call the
sync API.
> In addition, we could check NVME_CC_ENABLE in nvme_dev_disable to avoid redundant invoking.
> :)
That should already be inferred through reading back the CSTS register.
Hi Keith
Thanks for your kindly response.
On 01/19/2018 07:52 PM, Keith Busch wrote:
> On Fri, Jan 19, 2018 at 05:02:06PM +0800, jianchao.wang wrote:
>> We should not use blk_sync_queue here, the requeue_work and run_work will be canceled.
>> Just flush_work(&q->timeout_work) should be ok.
>
> I agree flushing timeout_work is sufficient. All the other work had
> already better not be running either, so it doesn't hurt to call the
> sync API.
In nvme_dev_disable, the outstanding requests will be requeued finally.
I'm afraid the requests requeued on the q->requeue_list will be blocked until another requeue
occurs, if we cancel the requeue work before it get scheduled.
>
>> In addition, we could check NVME_CC_ENABLE in nvme_dev_disable to avoid redundant invoking.
>> :)
>
> That should already be inferred through reading back the CSTS register.
>
Yes, the "dead" in nvme_dev_disable looks enough for these uncommon cases.
Thanks
Jianchao
On Fri, Jan 19, 2018 at 09:56:48PM +0800, jianchao.wang wrote:
> In nvme_dev_disable, the outstanding requests will be requeued finally.
> I'm afraid the requests requeued on the q->requeue_list will be blocked until another requeue
> occurs, if we cancel the requeue work before it get scheduled.
We should kick the request list in nvme_start_queues.
Hi Keith
Thanks for you kindly response.
On 01/20/2018 10:11 AM, Keith Busch wrote:
> On Fri, Jan 19, 2018 at 09:56:48PM +0800, jianchao.wang wrote:
>> In nvme_dev_disable, the outstanding requests will be requeued finally.
>> I'm afraid the requests requeued on the q->requeue_list will be blocked until another requeue
>> occurs, if we cancel the requeue work before it get scheduled.
>
> We should kick the request list in nvme_start_queues.
>
Yes
@@ -3513,8 +3513,10 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
struct nvme_ns *ns;
mutex_lock(&ctrl->namespaces_mutex);
- list_for_each_entry(ns, &ctrl->namespaces, list)
+ list_for_each_entry(ns, &ctrl->namespaces, list) {
blk_mq_unquiesce_queue(ns->queue);
+ blk_mq_kick_requeue_list(ns->queue);
+ }
mutex_unlock(&ctrl->namespaces_mutex);
}
Then, nvme_sync_queues could be more universal.
Many thanks for your directive.
Jianchao
On 01/20/2018 10:07 PM, jianchao.wang wrote:
> Hi Keith
>
> Thanks for you kindly response.
>
> On 01/20/2018 10:11 AM, Keith Busch wrote:
>> On Fri, Jan 19, 2018 at 09:56:48PM +0800, jianchao.wang wrote:
>>> In nvme_dev_disable, the outstanding requests will be requeued finally.
>>> I'm afraid the requests requeued on the q->requeue_list will be blocked until another requeue
>>> occurs, if we cancel the requeue work before it get scheduled.
>>
>> We should kick the request list in nvme_start_queues.
>>
> Yes
>
> @@ -3513,8 +3513,10 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
> struct nvme_ns *ns;
>
> mutex_lock(&ctrl->namespaces_mutex);
> - list_for_each_entry(ns, &ctrl->namespaces, list)
> + list_for_each_entry(ns, &ctrl->namespaces, list) {
> blk_mq_unquiesce_queue(ns->queue);
> + blk_mq_kick_requeue_list(ns->queue);
> + }
> mutex_unlock(&ctrl->namespaces_mutex);
> }
We have to also add blk_mq_kick_requeue_list in nvme_kill_queues in case of queue_count < 2.
>
> Then, nvme_sync_queues could be more universal.
>
> Many thanks for your directive.
>
> Jianchao
>