From: Chunguang Xu <[email protected]>
In some scenarios, if too many commands are issued by nvme command in
the same time by user tasks, this may exhaust all tags of admin_q. If
a reset (nvme reset or IO timeout) occurs before these commands finish,
reconnect routine may fail to update nvme regs due to insufficient tags,
which will cause kernel hang forever. In order to workaround this issue,
maybe we can let reg_read32()/reg_read64()/reg_write32() use reserved
tags. This maybe safe for nvmf:
1. For the disable ctrl path, we will not issue connect command
2. For the enable ctrl / fw activate path, since connect and reg_xx()
are called serially.
So the reserved tags may still be enough while reg_xx() use reserved tags.
Signed-off-by: Chunguang Xu <[email protected]>
---
drivers/nvme/host/fabrics.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 1f0ea1f32d22..f6416f8553f0 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -180,7 +180,7 @@ int nvmf_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val)
cmd.prop_get.offset = cpu_to_le32(off);
ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, &res, NULL, 0,
- NVME_QID_ANY, 0);
+ NVME_QID_ANY, NVME_SUBMIT_RESERVED);
if (ret >= 0)
*val = le64_to_cpu(res.u64);
@@ -226,7 +226,7 @@ int nvmf_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val)
cmd.prop_get.offset = cpu_to_le32(off);
ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, &res, NULL, 0,
- NVME_QID_ANY, 0);
+ NVME_QID_ANY, NVME_SUBMIT_RESERVED);
if (ret >= 0)
*val = le64_to_cpu(res.u64);
@@ -271,7 +271,7 @@ int nvmf_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val)
cmd.prop_set.value = cpu_to_le64(val);
ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, NULL, NULL, 0,
- NVME_QID_ANY, 0);
+ NVME_QID_ANY, NVME_SUBMIT_RESERVED);
if (unlikely(ret))
dev_err(ctrl->device,
"Property Set error: %d, offset %#x\n",
--
2.25.1
On 4/29/2024 7:17 PM, brookxu.cn wrote:
> From: Chunguang Xu <[email protected]>
>
> In some scenarios, if too many commands are issued by nvme command in
> the same time by user tasks, this may exhaust all tags of admin_q. If
> a reset (nvme reset or IO timeout) occurs before these commands finish,
> reconnect routine may fail to update nvme regs due to insufficient tags,
> which will cause kernel hang forever. In order to workaround this issue,
> maybe we can let reg_read32()/reg_read64()/reg_write32() use reserved
> tags. This maybe safe for nvmf:
>
> 1. For the disable ctrl path, we will not issue connect command
> 2. For the enable ctrl / fw activate path, since connect and reg_xx()
> are called serially.
>
Given the complexity of the scenario described above, is it possible to
write a script for this scenario that will trigger this and submit to
blktest ? not that this is a blocker to get this patch reviewed, but
believe it is needed in long run, WDYT ?
> So the reserved tags may still be enough while reg_xx() use reserved tags.
>
> Signed-off-by: Chunguang Xu <[email protected]>
> ---
-ck
Chaitanya Kulkarni <[email protected]> 于2024年4月30日周二 11:47写道:
>
> On 4/29/2024 7:17 PM, brookxu.cn wrote:
> > From: Chunguang Xu <[email protected]>
> >
> > In some scenarios, if too many commands are issued by nvme command in
> > the same time by user tasks, this may exhaust all tags of admin_q. If
> > a reset (nvme reset or IO timeout) occurs before these commands finish,
> > reconnect routine may fail to update nvme regs due to insufficient tags,
> > which will cause kernel hang forever. In order to workaround this issue,
> > maybe we can let reg_read32()/reg_read64()/reg_write32() use reserved
> > tags. This maybe safe for nvmf:
> >
> > 1. For the disable ctrl path, we will not issue connect command
> > 2. For the enable ctrl / fw activate path, since connect and reg_xx()
> > are called serially.
> >
>
> Given the complexity of the scenario described above, is it possible to
> write a script for this scenario that will trigger this and submit to
> blktest ? not that this is a blocker to get this patch reviewed, but
> believe it is needed in long run, WDYT ?
Thanks for you reply, I can easily reproduce it in my VMs by following steps:
STEP 1. In order to reduce the complexity of reproduction, I reduce
NVME_AQ_MQ_TAG_DEPTH from 31 to 8
STEP 2. Create a nvme device by NVMe over tcp, such as following command:
nvme connect -t tcp -a 192.168.122.20 -s 4420 -n
nqn.2014-08.org.nvmexpress.mytest
STEP 3. Buind and run the c++ program issues nvme commands as followed:
#include <sys/types.h>
#include <signal.h>
#include <unistd.h>
#include <vector>
#include <set>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/wait.h>
const int concurrency = 64;
std::set<pid_t> chlds;
int __exit = 0;
void sigint_proc(int signo)
{
__exit = 1;
}
int main(int argc, char **argv)
{
signal(SIGINT, sigint_proc);
for (auto i = 0; i < concurrency; i++) {
auto pid = fork();
if (!pid) {
while (true) {
system("nvme list -o json 2>&1 > /dev/null");
}
}
chlds.insert(pid);
}
while (!__exit) {
if (chlds.empty())
break;
for (auto pid : chlds) {
int wstatus, ret;
ret = waitpid(pid, &wstatus, WNOWAIT);
if (ret > 0) {
chlds.erase(pid);
break;
}
}
usleep(1000);
}
// exit
for (auto pid : chlds)
kill(pid, SIGKILL);
return 0;
}
STEP 4. Open a new console, running the followed command:
while [ true ]; do nvme reset /dev/nvme0; sleep `echo "$RANDOM%1" | bc`; done
We will reproduce this issue soon.
>
> > So the reserved tags may still be enough while reg_xx() use reserved tags.
> >
> > Signed-off-by: Chunguang Xu <[email protected]>
> > ---
>
> -ck
>
>
On 4/29/2024 10:10 PM, 许春光 wrote:
> Chaitanya Kulkarni <[email protected]> 于2024年4月30日周二 11:47写道:
>>
>> On 4/29/2024 7:17 PM, brookxu.cn wrote:
>>> From: Chunguang Xu <[email protected]>
>>>
>>> In some scenarios, if too many commands are issued by nvme command in
>>> the same time by user tasks, this may exhaust all tags of admin_q. If
>>> a reset (nvme reset or IO timeout) occurs before these commands finish,
>>> reconnect routine may fail to update nvme regs due to insufficient tags,
>>> which will cause kernel hang forever. In order to workaround this issue,
>>> maybe we can let reg_read32()/reg_read64()/reg_write32() use reserved
>>> tags. This maybe safe for nvmf:
>>>
>>> 1. For the disable ctrl path, we will not issue connect command
>>> 2. For the enable ctrl / fw activate path, since connect and reg_xx()
>>> are called serially.
>>>
>>
>> Given the complexity of the scenario described above, is it possible to
>> write a script for this scenario that will trigger this and submit to
>> blktest ? not that this is a blocker to get this patch reviewed, but
>> believe it is needed in long run, WDYT ?
>
> Thanks for you reply, I can easily reproduce it in my VMs by following steps:
> STEP 1. In order to reduce the complexity of reproduction, I reduce
> NVME_AQ_MQ_TAG_DEPTH from 31 to 8
>
> STEP 2. Create a nvme device by NVMe over tcp, such as following command:
> nvme connect -t tcp -a 192.168.122.20 -s 4420 -n
> nqn.2014-08.org.nvmexpress.mytest
>
> STEP 3. Buind and run the c++ program issues nvme commands as followed:
> #include <sys/types.h>
> #include <signal.h>
> #include <unistd.h>
> #include <vector>
> #include <set>
> #include <stdlib.h>
> #include <sys/types.h>
> #include <sys/wait.h>
>
> const int concurrency = 64;
> std::set<pid_t> chlds;
>
> int __exit = 0;
> void sigint_proc(int signo)
> {
> __exit = 1;
> }
>
> int main(int argc, char **argv)
> {
> signal(SIGINT, sigint_proc);
>
> for (auto i = 0; i < concurrency; i++) {
> auto pid = fork();
> if (!pid) {
> while (true) {
> system("nvme list -o json 2>&1 > /dev/null");
> }
> }
>
> chlds.insert(pid);
> }
>
> while (!__exit) {
> if (chlds.empty())
> break;
>
> for (auto pid : chlds) {
> int wstatus, ret;
> ret = waitpid(pid, &wstatus, WNOWAIT);
> if (ret > 0) {
> chlds.erase(pid);
> break;
> }
> }
> usleep(1000);
> }
>
> // exit
> for (auto pid : chlds)
> kill(pid, SIGKILL);
>
> return 0;
> }
>
> STEP 4. Open a new console, running the followed command:
> while [ true ]; do nvme reset /dev/nvme0; sleep `echo "$RANDOM%1" | bc`; done
>
> We will reproduce this issue soon.
>>
cool, can you please submit a blktest [1] for this ? I'm not sure if we
have any coverage for this scenario ...
-ck
[1] https://github.com/osandov/blktests
Chaitanya Kulkarni <[email protected]> 于2024年4月30日周二 14:59写道:
>
> On 4/29/2024 10:10 PM, 许春光 wrote:
> > Chaitanya Kulkarni <[email protected]> 于2024年4月30日周二 11:47写道:
> >>
> >> On 4/29/2024 7:17 PM, brookxu.cn wrote:
> >>> From: Chunguang Xu <[email protected]>
> >>>
> >>> In some scenarios, if too many commands are issued by nvme command in
> >>> the same time by user tasks, this may exhaust all tags of admin_q. If
> >>> a reset (nvme reset or IO timeout) occurs before these commands finish,
> >>> reconnect routine may fail to update nvme regs due to insufficient tags,
> >>> which will cause kernel hang forever. In order to workaround this issue,
> >>> maybe we can let reg_read32()/reg_read64()/reg_write32() use reserved
> >>> tags. This maybe safe for nvmf:
> >>>
> >>> 1. For the disable ctrl path, we will not issue connect command
> >>> 2. For the enable ctrl / fw activate path, since connect and reg_xx()
> >>> are called serially.
> >>>
> >>
> >> Given the complexity of the scenario described above, is it possible to
> >> write a script for this scenario that will trigger this and submit to
> >> blktest ? not that this is a blocker to get this patch reviewed, but
> >> believe it is needed in long run, WDYT ?
> >
> > Thanks for you reply, I can easily reproduce it in my VMs by following steps:
> > STEP 1. In order to reduce the complexity of reproduction, I reduce
> > NVME_AQ_MQ_TAG_DEPTH from 31 to 8
> >
> > STEP 2. Create a nvme device by NVMe over tcp, such as following command:
> > nvme connect -t tcp -a 192.168.122.20 -s 4420 -n
> > nqn.2014-08.org.nvmexpress.mytest
> >
> > STEP 3. Buind and run the c++ program issues nvme commands as followed:
> > #include <sys/types.h>
> > #include <signal.h>
> > #include <unistd.h>
> > #include <vector>
> > #include <set>
> > #include <stdlib.h>
> > #include <sys/types.h>
> > #include <sys/wait.h>
> >
> > const int concurrency = 64;
> > std::set<pid_t> chlds;
> >
> > int __exit = 0;
> > void sigint_proc(int signo)
> > {
> > __exit = 1;
> > }
> >
> > int main(int argc, char **argv)
> > {
> > signal(SIGINT, sigint_proc);
> >
> > for (auto i = 0; i < concurrency; i++) {
> > auto pid = fork();
> > if (!pid) {
> > while (true) {
> > system("nvme list -o json 2>&1 > /dev/null");
> > }
> > }
> >
> > chlds.insert(pid);
> > }
> >
> > while (!__exit) {
> > if (chlds.empty())
> > break;
> >
> > for (auto pid : chlds) {
> > int wstatus, ret;
> > ret = waitpid(pid, &wstatus, WNOWAIT);
> > if (ret > 0) {
> > chlds.erase(pid);
> > break;
> > }
> > }
> > usleep(1000);
> > }
> >
> > // exit
> > for (auto pid : chlds)
> > kill(pid, SIGKILL);
> >
> > return 0;
> > }
> >
> > STEP 4. Open a new console, running the followed command:
> > while [ true ]; do nvme reset /dev/nvme0; sleep `echo "$RANDOM%1" | bc`; done
> >
> > We will reproduce this issue soon.
> >>
>
> cool, can you please submit a blktest [1] for this ? I'm not sure if we
> have any coverage for this scenario ...
>
My pleasure, after reviewed, if no more doubt, I will try to do it, thanks
> -ck
>
> [1] https://github.com/osandov/blktests
>
Reviewed-by: Sagi Grimberg <[email protected]>
On 4/29/24 19:17, brookxu.cn wrote:
> From: Chunguang Xu <[email protected]>
>
> In some scenarios, if too many commands are issued by nvme command in
> the same time by user tasks, this may exhaust all tags of admin_q. If
> a reset (nvme reset or IO timeout) occurs before these commands finish,
> reconnect routine may fail to update nvme regs due to insufficient tags,
> which will cause kernel hang forever. In order to workaround this issue,
> maybe we can let reg_read32()/reg_read64()/reg_write32() use reserved
> tags. This maybe safe for nvmf:
>
> 1. For the disable ctrl path, we will not issue connect command
> 2. For the enable ctrl / fw activate path, since connect and reg_xx()
> are called serially.
>
> So the reserved tags may still be enough while reg_xx() use reserved tags.
>
> Signed-off-by: Chunguang Xu <[email protected]>
> ---
>
Looks good.
Reviewed-by: Chaitanya Kulkarni <[email protected]>
-ck