2018-01-29 03:09:26

by jianchao.wang

[permalink] [raw]
Subject: [PATCH] nvme-pci: use NOWAIT flag for nvme_set_host_mem

nvme_set_host_mem will invoke nvme_alloc_request without NOWAIT
flag, it is unsafe for nvme_dev_disable. The adminq driver tags
may have been used up when the previous outstanding adminq requests
cannot be completed due to some hardware error. We have to depend
on the timeout path to complete the previous outstanding adminq
requests and free the tags.
However, nvme_timeout will invoke nvme_dev_disable and try to
get the shutdown_lock which is held by another context who is
sleeping to wait for the tags to be freed by timeout path. A
deadlock comes up.

To fix it, let nvme_set_host_mem use NOWAIT flag.

Signed-off-by: Jianchao Wang <[email protected]>
---
drivers/nvme/host/pci.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6fe7af0..9532529 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1736,7 +1736,8 @@ static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
c.features.dword14 = cpu_to_le32(upper_32_bits(dma_addr));
c.features.dword15 = cpu_to_le32(dev->nr_host_mem_descs);

- ret = nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
+ ret = __nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, NULL, 0, 0,
+ NVME_QID_ANY, 0, BLK_MQ_REQ_NOWAIT);
if (ret) {
dev_warn(dev->ctrl.device,
"failed to set host mem (err %d, flags %#x).\n",
--
2.7.4



2018-01-29 08:15:46

by jianchao.wang

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: use NOWAIT flag for nvme_set_host_mem



On 01/29/2018 11:07 AM, Jianchao Wang wrote:
> nvme_set_host_mem will invoke nvme_alloc_request without NOWAIT
> flag, it is unsafe for nvme_dev_disable. The adminq driver tags
> may have been used up when the previous outstanding adminq requests
> cannot be completed due to some hardware error. We have to depend
> on the timeout path to complete the previous outstanding adminq
> requests and free the tags.
> However, nvme_timeout will invoke nvme_dev_disable and try to
> get the shutdown_lock which is held by another context who is
> sleeping to wait for the tags to be freed by timeout path. A
> deadlock comes up.
>
> To fix it, let nvme_set_host_mem use NOWAIT flag.

In fact, this is the only case about nvme_set_host_mem in
nvme_dev_disable.
Consider the following case:

A adminq request expired.

timeout_work context
nvme_timeout
-> nvme_dev_disable
-> nvme_set_host_mem
if this adminq request expires again, the timeout_work
cannot handle this case, because it is waiting for the
result of nvme_set_host_mem.

Thanks
Jianchao

2018-01-29 15:58:54

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: use NOWAIT flag for nvme_set_host_mem

On Mon, Jan 29, 2018 at 11:07:35AM +0800, Jianchao Wang wrote:
> nvme_set_host_mem will invoke nvme_alloc_request without NOWAIT
> flag, it is unsafe for nvme_dev_disable. The adminq driver tags
> may have been used up when the previous outstanding adminq requests
> cannot be completed due to some hardware error. We have to depend
> on the timeout path to complete the previous outstanding adminq
> requests and free the tags.
> However, nvme_timeout will invoke nvme_dev_disable and try to
> get the shutdown_lock which is held by another context who is
> sleeping to wait for the tags to be freed by timeout path. A
> deadlock comes up.
>
> To fix it, let nvme_set_host_mem use NOWAIT flag.
>
> Signed-off-by: Jianchao Wang <[email protected]>

Thanks for the fix. It looks like we still have a problem, though.
Commands submitted with the "shutdown_lock" held need to be able to make
forward progress without relying on a completion, but this one could
block indefinitely.


> drivers/nvme/host/pci.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 6fe7af0..9532529 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1736,7 +1736,8 @@ static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
> c.features.dword14 = cpu_to_le32(upper_32_bits(dma_addr));
> c.features.dword15 = cpu_to_le32(dev->nr_host_mem_descs);
>
> - ret = nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
> + ret = __nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, NULL, 0, 0,
> + NVME_QID_ANY, 0, BLK_MQ_REQ_NOWAIT);
> if (ret) {
> dev_warn(dev->ctrl.device,
> "failed to set host mem (err %d, flags %#x).\n",
> --

2018-01-29 19:56:25

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: use NOWAIT flag for nvme_set_host_mem

Hey Keith,

> Thanks for the fix. It looks like we still have a problem, though.
> Commands submitted with the "shutdown_lock" held need to be able to make
> forward progress without relying on a completion, but this one could
> block indefinitely.

Can you explain to me why is the shutdown_lock needed to synchronize
nvme_dev_disable? More concretely, how is nvme_dev_disable different
from other places where we rely on the ctrl state to serialize stuff?

The only reason I see would be to protect against completion-after-abort
scenario but I think the block layer should protect against it (checks
if the request timeout timer fired).

2018-01-29 20:37:38

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: use NOWAIT flag for nvme_set_host_mem

On Mon, Jan 29, 2018 at 09:55:41PM +0200, Sagi Grimberg wrote:
> > Thanks for the fix. It looks like we still have a problem, though.
> > Commands submitted with the "shutdown_lock" held need to be able to make
> > forward progress without relying on a completion, but this one could
> > block indefinitely.
>
> Can you explain to me why is the shutdown_lock needed to synchronize
> nvme_dev_disable? More concretely, how is nvme_dev_disable different
> from other places where we rely on the ctrl state to serialize stuff?
>
> The only reason I see would be to protect against completion-after-abort
> scenario but I think the block layer should protect against it (checks
> if the request timeout timer fired).

We can probably find a way to use the state machine for this. Disabling
the controller pre-dates the state machine, and the mutex is there to
protect against two actors shutting the controller down at the same
time, like a hot removal at the same time as a timeout handling reset.

2018-01-30 03:42:28

by jianchao.wang

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: use NOWAIT flag for nvme_set_host_mem

Hi Keith and Sagi

Thanks for your kindly response. :)

On 01/30/2018 04:17 AM, Keith Busch wrote:
> On Mon, Jan 29, 2018 at 09:55:41PM +0200, Sagi Grimberg wrote:
>>> Thanks for the fix. It looks like we still have a problem, though.
>>> Commands submitted with the "shutdown_lock" held need to be able to make
>>> forward progress without relying on a completion, but this one could
>>> block indefinitely.
>>
>> Can you explain to me why is the shutdown_lock needed to synchronize
>> nvme_dev_disable? More concretely, how is nvme_dev_disable different
>> from other places where we rely on the ctrl state to serialize stuff?
>>
>> The only reason I see would be to protect against completion-after-abort
>> scenario but I think the block layer should protect against it (checks
>> if the request timeout timer fired).
>
> We can probably find a way to use the state machine for this. Disabling
> the controller pre-dates the state machine, and the mutex is there to
> protect against two actors shutting the controller down at the same
> time, like a hot removal at the same time as a timeout handling reset.
>

Another point that confuses me is that whether nvme_set_host_mem is necessary
in nvme_dev_disable ?
As the comment:
----
/*
* If the controller is still alive tell it to stop using the
* host memory buffer. In theory the shutdown / reset should
* make sure that it doesn't access the host memoery anymore,
* but I'd rather be safe than sorry..
*/
if (dev->host_mem_descs)
nvme_set_host_mem(dev, 0);
----
It is to avoid accessing to host memory from controller. But the host memory is just
freed when nvme_remove. It looks like we just need to do this in nvme_remove.
For example:
-----
@@ -2553,6 +2545,14 @@ static void nvme_remove(struct pci_dev *pdev)
flush_work(&dev->ctrl.reset_work);
nvme_stop_ctrl(&dev->ctrl);
nvme_remove_namespaces(&dev->ctrl);
+ /*
+ * If the controller is still alive tell it to stop using the
+ * host memory buffer. In theory the shutdown / reset should
+ * make sure that it doesn't access the host memoery anymore,
+ * but I'd rather be safe than sorry..
+ */
+ if (dev->host_mem_descs)
+ nvme_set_host_mem(dev, 0);
nvme_dev_disable(dev, true);
nvme_free_host_mem(dev);
----

If anything missed, please point out.
That's really appreciated.

Sincerely
Jianchao

2018-01-30 17:49:38

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: use NOWAIT flag for nvme_set_host_mem

On Tue, Jan 30, 2018 at 11:41:07AM +0800, jianchao.wang wrote:
> Another point that confuses me is that whether nvme_set_host_mem is necessary
> in nvme_dev_disable ?
> As the comment:
> ----
> /*
> * If the controller is still alive tell it to stop using the
> * host memory buffer. In theory the shutdown / reset should
> * make sure that it doesn't access the host memoery anymore,
> * but I'd rather be safe than sorry..
> */
> if (dev->host_mem_descs)
> nvme_set_host_mem(dev, 0);
> ----
> It is to avoid accessing to host memory from controller. But the host memory is just
> freed when nvme_remove. It looks like we just need to do this in nvme_remove.
> For example:
> -----
> @@ -2553,6 +2545,14 @@ static void nvme_remove(struct pci_dev *pdev)
> flush_work(&dev->ctrl.reset_work);
> nvme_stop_ctrl(&dev->ctrl);
> nvme_remove_namespaces(&dev->ctrl);
> + /*
> + * If the controller is still alive tell it to stop using the
> + * host memory buffer. In theory the shutdown / reset should
> + * make sure that it doesn't access the host memoery anymore,
> + * but I'd rather be safe than sorry..
> + */
> + if (dev->host_mem_descs)
> + nvme_set_host_mem(dev, 0);
> nvme_dev_disable(dev, true);
> nvme_free_host_mem(dev);
> ----
>
> If anything missed, please point out.
> That's really appreciated.

I don't make any such controller, so I can't speak to the necessity of
having this here. I just think having it in the controller disabling
path is for safety. We want the controller to acknowledge that it has
completed any use of the HMB before we make it so it can't access it.