LinuxLists.cc - [PROBLEM] nbd requests become stuck when devices watched by inotify emit udev uevent changes

2022-04-22 19:31:05

Subject: [PROBLEM] nbd requests become stuck when devices watched by inotify emit udev uevent changes

Dear maintainers of the nbd subsystem,

A user has come across an issue which causes the nbd module to hang after a
disconnect where a write has been made to a qemu qcow image file, with qemu-nbd
being the server.

The issue is easily reproducible with the following:

Ubuntu 20.04, 22.04 or Fedora 36
Linux 5.18-rc2 or earlier (have tested 5.18-rc2, 5.15, 5.4, 4.15)
QEMU 6.2 or earlier

Instructions to reproduce:
==========================

$ sudo apt install qemu-nbd

$ cat << EOF >> reproducer.sh
#!/bin/bash

sudo modprobe nbd

while :
do
qemu-img create -f qcow2 foo.img 500M
sudo qemu-nbd --disconnect /dev/nbd15 || true
sudo qemu-nbd --connect=/dev/nbd15 --cache=writeback --format=qcow2 foo.img
sudo mkfs.ext4 -L root -O "^64bit" -E nodiscard /dev/nbd15
sudo qemu-nbd --disconnect /dev/nbd15
done
EOF

$ chmod +x reproducer.sh
$ ./reproducer.sh

On Ubuntu, the terminal will pause within a minute or two, and dmesg will report
a lot of I/O errors, followed by hung task timeouts. On Fedora, it takes a
little longer, but it will break in the same way within 10 minutes.

An example kernel log is below:

https://paste.ubuntu.com/p/5ZjC5b8MR7/

Debugging done:
===============

Looking at syslog, it seems systemd-udevd gets stuck, and enters D state.

systemd-udevd[419]: nbd15: Worker [2004] processing SEQNUM=5661 is taking a long time

$ ps aux
...
419 1194 root D 0.1 systemd-udevd -

I rebooted, and disabled systemd-udevd and its sockets, with:

$ sudo systemctl stop systemd-udevd.service
$ sudo systemctl stop systemd-udevd-control.socket
$ sudo systemctl stop systemd-udevd-kernel.socket

When running the reproducer again, everything works fine, and nbd subsystem does
not hang.

Looking at udev rules, I looked at the differences between those in Ubuntu 18.04
where the issue does not occur, and 20.04, where it does, and came across:

/usr/lib/udev/rules.d/60-block.rules

In 18.04:
# watch metadata changes, caused by tools closing the device node which was opened for writing
ACTION!="remove", SUBSYSTEM=="block", KERNEL=="loop*|nvme*|sd*|vd*|xvd*|pmem*|mmcblk*", OPTIONS+="watch"

In 20.04:
# watch metadata changes, caused by tools closing the device node which was opened for writing
ACTION!="remove", SUBSYSTEM=="block", \
KERNEL=="loop*|mmcblk*[0-9]|msblk*[0-9]|mspblk*[0-9]|nvme*|sd*|vd*|xvd*|bcache*|cciss*|dasd*|ubd*|ubi*|scm*|pmem*|nbd*|zd*", \
OPTIONS+="watch"

The difference is the OPTIONS+="watch" being added to nbd* any event not remove.

When I deleted nbd* and retried the reproducer again, everything works smoothly.

Looking at the manpage for udev:

> watch
> Watch the device node with inotify; when the node is closed after being
> opened for writing, a change uevent is synthesized.
>
> nowatch
> Disable the watching of a device node with inotify.

It appears that there is some sort of problem where systemd-udevd uses inotify
to watch for updates to the device metadata, and it blocks a subsequent
disconnect request, causing it to fail with:

block nbd15: Send disconnect failed -32

After which we start seeing stuck requests:

block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264,24576B). Runtime 30 seconds

All userspace calls to the nbd module hang, and the system has to be rebooted.

Workaround:
===========

We can workaround the issue by adding a higher priority udev rule to not watch
nbd* devices.

$ cat << EOF >> /etc/udev/rules.d/97-nbd-device.rules
# Disable inotify watching of change events for NBD devices
ACTION=="add|change", KERNEL=="nbd*", OPTIONS:="nowatch"
EOF

$ sudo udevadm control --reload-rules
$ sudo udevadm trigger

Help on debugging the problem:
==============================

I need some help debugging the problem, as I am not quite sure how to trace
the interactions between inotify and nbd.

I am happy to help debug the issue, or try any patches that gather debugging
data or any potential fixes.

Thanks,
Matthew

2022-04-22 22:51:47

by Josef Bacik

[permalink] [raw]

Subject: Re: [PROBLEM] nbd requests become stuck when devices watched by inotify emit udev uevent changes

On Fri, Apr 22, 2022 at 1:42 AM Matthew Ruffell
<[email protected]> wrote:
>
> Dear maintainers of the nbd subsystem,
>
> A user has come across an issue which causes the nbd module to hang after a
> disconnect where a write has been made to a qemu qcow image file, with qemu-nbd
> being the server.
>

Ok there's two problems here, but I want to make sure I have the right
fix for the hang first. Can you apply this patch

https://paste.centos.org/view/b1a2d01a

and make sure the hang goes away? Once that part is fixed I'll fix
the IO errors, this is just us racing with systemd while we teardown
the device and then we're triggering a partition read while the device
is going down and it's complaining loudly. Before we would
set_capacity to 0 whenever we disconnected, but that causes problems
with file systems that may still have the device open. However now we
only do this if the server does the CLEAR_SOCK ioctl, which clearly
can race with systemd poking the device, so I need to make it
set_capacity(0) when the last opener closes the device to prevent this
style of race.

Let me know if that patch fixes the hang, and then I'll work up
something for the capacity problem. Thanks,

Josef

2022-04-26 07:32:57

by Matthew Ruffell

[permalink] [raw]

Subject: Re: [PROBLEM] nbd requests become stuck when devices watched by inotify emit udev uevent changes

Hi Josef,

The pastebin has expired the link, and I can't access your patch.
Seems to default to 1 day deletion.

Could you please create a new paste or send the patch inline in this
email thread?

I am more than happy to try the patch out.

Thank you for your analysis.
Matthew

On Sat, Apr 23, 2022 at 3:24 AM Josef Bacik <[email protected]> wrote:
>
> On Fri, Apr 22, 2022 at 1:42 AM Matthew Ruffell
> <[email protected]> wrote:
> >
> > Dear maintainers of the nbd subsystem,
> >
> > A user has come across an issue which causes the nbd module to hang after a
> > disconnect where a write has been made to a qemu qcow image file, with qemu-nbd
> > being the server.
> >
>
> Ok there's two problems here, but I want to make sure I have the right
> fix for the hang first. Can you apply this patch
>
> https://paste.centos.org/view/b1a2d01a
>
> and make sure the hang goes away? Once that part is fixed I'll fix
> the IO errors, this is just us racing with systemd while we teardown
> the device and then we're triggering a partition read while the device
> is going down and it's complaining loudly. Before we would
> set_capacity to 0 whenever we disconnected, but that causes problems
> with file systems that may still have the device open. However now we
> only do this if the server does the CLEAR_SOCK ioctl, which clearly
> can race with systemd poking the device, so I need to make it
> set_capacity(0) when the last opener closes the device to prevent this
> style of race.
>
> Let me know if that patch fixes the hang, and then I'll work up
> something for the capacity problem. Thanks,
>
> Josef

2022-05-13 08:58:20

by Yu Kuai

[permalink] [raw]

Subject: Re: [PROBLEM] nbd requests become stuck when devices watched by inotify emit udev uevent changes

在 2022/05/13 10:56, Matthew Ruffell 写道:
> Hi Josef,
>
> Just a friendly ping, I am more than happy to test a patch, if you send it
> inline in the email, since the pastebin you used expired after 1 day, and I
> couldn't access it.
>
> I came across and tested Yu Kuai's patches [1][2] which are for the same issue,
> and they indeed fix the hang. Thank you Yu.
Hi, Matthew

Thanks for your test.
>
> [1] nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
> https://lists.debian.org/nbd/2022/04/msg00212.html
>
> [2] nbd: fix io hung while disconnecting device
> https://lists.debian.org/nbd/2022/04/msg00207.html
>
> I am also happy to test any patches to fix the I/O errors.

Sorry that I missed this thread. IMO, if inflight requests is cleared by
ioctl NBD_CLEAR_SOCK after my patch [2](or other callers for
nbd_clear_que()), such io will return as error. Thus I don't think such
io errors need to be fixed.

Josef, do you have other suggestions?

Thanks,
Kuai

2022-05-14 02:05:17

by Josef Bacik

[permalink] [raw]

Subject: Re: [PROBLEM] nbd requests become stuck when devices watched by inotify emit udev uevent changes

On Fri, May 13, 2022 at 02:56:18PM +1200, Matthew Ruffell wrote:
> Hi Josef,
>
> Just a friendly ping, I am more than happy to test a patch, if you send it
> inline in the email, since the pastebin you used expired after 1 day, and I
> couldn't access it.
>
> I came across and tested Yu Kuai's patches [1][2] which are for the same issue,
> and they indeed fix the hang. Thank you Yu.
>
> [1] nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
> https://lists.debian.org/nbd/2022/04/msg00212.html
>
> [2] nbd: fix io hung while disconnecting device
> https://lists.debian.org/nbd/2022/04/msg00207.html
>
> I am also happy to test any patches to fix the I/O errors.
>

Sorry, you caught me on vacation before and I forgot to reply. Here's part one
of the patch I wanted you to try which fixes the io hung part. Thanks,

Josef

From 0a6123520380cb84de8ccefcccc5f112bce5efb6 Mon Sep 17 00:00:00 2001
Message-Id: <0a6123520380cb84de8ccefcccc5f112bce5efb6.1652447517.git.josef@toxicpanda.com>
From: Josef Bacik <[email protected]>
Date: Sat, 23 Apr 2022 23:51:23 -0400
Subject: [PATCH] timeout thing

---
drivers/block/nbd.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 526389351784..ab365c0e9c04 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -1314,7 +1314,10 @@ static void nbd_config_put(struct nbd_device *nbd)
kfree(nbd->config);
nbd->config = NULL;

- nbd->tag_set.timeout = 0;
+ /* Reset our timeout to something sane. */
+ nbd->tag_set.timeout = 30 * HZ;
+ blk_queue_rq_timeout(nbd->disk->queue, 30 * HZ);
+
nbd->disk->queue->limits.discard_granularity = 0;
nbd->disk->queue->limits.discard_alignment = 0;
blk_queue_max_discard_sectors(nbd->disk->queue, 0);
--
2.26.3

2022-05-14 02:37:58

by Matthew Ruffell

[permalink] [raw]

Subject: Re: [PROBLEM] nbd requests become stuck when devices watched by inotify emit udev uevent changes

Hi Josef,

Just a friendly ping, I am more than happy to test a patch, if you send it
inline in the email, since the pastebin you used expired after 1 day, and I
couldn't access it.

I came across and tested Yu Kuai's patches [1][2] which are for the same issue,
and they indeed fix the hang. Thank you Yu.

[1] nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
https://lists.debian.org/nbd/2022/04/msg00212.html

[2] nbd: fix io hung while disconnecting device
https://lists.debian.org/nbd/2022/04/msg00207.html

I am also happy to test any patches to fix the I/O errors.

Thanks,
Matthew

On Tue, Apr 26, 2022 at 9:47 AM Matthew Ruffell
<[email protected]> wrote:
>
> Hi Josef,
>
> The pastebin has expired the link, and I can't access your patch.
> Seems to default to 1 day deletion.
>
> Could you please create a new paste or send the patch inline in this
> email thread?
>
> I am more than happy to try the patch out.
>
> Thank you for your analysis.
> Matthew
>
> On Sat, Apr 23, 2022 at 3:24 AM Josef Bacik <[email protected]> wrote:
> >
> > On Fri, Apr 22, 2022 at 1:42 AM Matthew Ruffell
> > <[email protected]> wrote:
> > >
> > > Dear maintainers of the nbd subsystem,
> > >
> > > A user has come across an issue which causes the nbd module to hang after a
> > > disconnect where a write has been made to a qemu qcow image file, with qemu-nbd
> > > being the server.
> > >
> >
> > Ok there's two problems here, but I want to make sure I have the right
> > fix for the hang first. Can you apply this patch
> >
> > https://paste.centos.org/view/b1a2d01a
> >
> > and make sure the hang goes away? Once that part is fixed I'll fix
> > the IO errors, this is just us racing with systemd while we teardown
> > the device and then we're triggering a partition read while the device
> > is going down and it's complaining loudly. Before we would
> > set_capacity to 0 whenever we disconnected, but that causes problems
> > with file systems that may still have the device open. However now we
> > only do this if the server does the CLEAR_SOCK ioctl, which clearly
> > can race with systemd poking the device, so I need to make it
> > set_capacity(0) when the last opener closes the device to prevent this
> > style of race.
> >
> > Let me know if that patch fixes the hang, and then I'll work up
> > something for the capacity problem. Thanks,
> >
> > Josef

2022-05-14 05:03:17

by Yu Kuai

[permalink] [raw]

Subject: Re: [PROBLEM] nbd requests become stuck when devices watched by inotify emit udev uevent changes

?? 2022/05/13 21:13, Josef Bacik д??:
> On Fri, May 13, 2022 at 02:56:18PM +1200, Matthew Ruffell wrote:
>> Hi Josef,
>>
>> Just a friendly ping, I am more than happy to test a patch, if you send it
>> inline in the email, since the pastebin you used expired after 1 day, and I
>> couldn't access it.
>>
>> I came across and tested Yu Kuai's patches [1][2] which are for the same issue,
>> and they indeed fix the hang. Thank you Yu.
>>
>> [1] nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
>> https://lists.debian.org/nbd/2022/04/msg00212.html
>>
>> [2] nbd: fix io hung while disconnecting device
>> https://lists.debian.org/nbd/2022/04/msg00207.html
>>
>> I am also happy to test any patches to fix the I/O errors.
>>
>
> Sorry, you caught me on vacation before and I forgot to reply. Here's part one
> of the patch I wanted you to try which fixes the io hung part. Thanks,
>
> Josef
>
>
>>From 0a6123520380cb84de8ccefcccc5f112bce5efb6 Mon Sep 17 00:00:00 2001
> Message-Id: <0a6123520380cb84de8ccefcccc5f112bce5efb6.1652447517.git.josef@toxicpanda.com>
> From: Josef Bacik <[email protected]>
> Date: Sat, 23 Apr 2022 23:51:23 -0400
> Subject: [PATCH] timeout thing
>
> ---
> drivers/block/nbd.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index 526389351784..ab365c0e9c04 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -1314,7 +1314,10 @@ static void nbd_config_put(struct nbd_device *nbd)
> kfree(nbd->config);
> nbd->config = NULL;
>
> - nbd->tag_set.timeout = 0;
> + /* Reset our timeout to something sane. */
> + nbd->tag_set.timeout = 30 * HZ;
> + blk_queue_rq_timeout(nbd->disk->queue, 30 * HZ);
> +
> nbd->disk->queue->limits.discard_granularity = 0;
> nbd->disk->queue->limits.discard_alignment = 0;
> blk_queue_max_discard_sectors(nbd->disk->queue, 0);
>
Hi, Josef

This seems to try to fix the same problem that I described here:

nbd: fix io hung while disconnecting device
https://lists.debian.org/nbd/2022/04/msg00207.html

There are still some io that are stuck, which means the devcie is
probably still opened. Thus nbd_config_put() can't reach here.
I'm afraid this patch can't fix the io hung.

Matthew, can you try a test with this patch together with my patch below
to comfirm my thought?

nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
https://lists.debian.org/nbd/2022/04/msg00212.html.

Thanks,
Kuai

2022-05-16 13:59:51

by Matthew Ruffell

[permalink] [raw]

Subject: Re: [PROBLEM] nbd requests become stuck when devices watched by inotify emit udev uevent changes

Hi Josef, Kuai,

Josef, thank you for attaching your patch. No worries about being on vacation,
I hope you enjoyed your time off.

Josef, I built your patch ontop of 5.18-rc6 with no other patches applied, and
ran the testcase in my original message. After 3x loops, a hang occurred, and
we see the usual -32 error:

May 16 03:38:35 focal-nbd kernel: block nbd15: NBD_DISCONNECT
May 16 03:38:35 focal-nbd kernel: block nbd15: Send disconnect failed -32

The hang lasted 30 seconds, no doubt caused by the "30 * HZ" timeout in your
patch, and things started moving forward again:

May 16 03:39:05 focal-nbd kernel: block nbd15: Connection timed out,
retrying (0/1 alive)
May 16 03:39:05 focal-nbd kernel: block nbd15: Connection timed out,
retrying (0/1 alive)
May 16 03:39:05 focal-nbd kernel: blk_print_req_error: 128 callbacks suppressed
May 16 03:39:05 focal-nbd kernel: I/O error, dev nbd15, sector 1023488
op 0x0:(READ) flags 0x80700 phys_seg 14 prio class 0
May 16 03:39:05 focal-nbd kernel: I/O error, dev nbd15, sector 1023608
op 0x0:(READ) flags 0x80700 phys_seg 16 prio class 0
May 16 03:39:05 focal-nbd kernel: block nbd15: Device being setup by
another task

Note the timestamp increment of 30s. There were a whole host of I/O errors,
and after a few more loops, the hang occurred again, again lasting for 30
seconds, and then doing a few more loops before getting stuck again.

Pastebin of journalctl: https://paste.ubuntu.com/p/Cx6MBC8Vgj/

Unfortunately, your patch doesn't quite solve the issue.

Kuai, I tested your suspicions by building Josef's patch ontop of 5.18-rc6 with
your below patch applied:

nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
https://lists.debian.org/nbd/2022/04/msg00212.html.

The behaviour was different this time from Josef's patch alone. On the very
second iteration of the loop, I got a bunch of I/O errors, and the nbd subsystem
hung, and did not recover. I started getting stuck request messages, and
the usual hung task timeout oops messages.

Pastebin of journalctl here: https://paste.ubuntu.com/p/C9rjckrWtp/

I went back and did some more testing of Kuai's two commits:

nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
https://lists.debian.org/nbd/2022/04/msg00212.html

nbd: fix io hung while disconnecting device
https://lists.debian.org/nbd/2022/04/msg00207.html

I left the testcase running for about 20 minutes, and it never hung. It did
get a bit racey from time to time trying to get a write lock for the qcow image,
where the disconnect completed after the call to mkfs.ext4 started, but simply
saying "y" let the loop run for another 5 minutes before the race occurred
again.

Formatting 'foo.img', fmt=qcow2 size=524288000 cluster_size=65536
lazy_refcounts=off refcount_bits=16
qemu-img: foo.img: Failed to get "write" lock
Is another process using the image [foo.img]?
/dev/nbd15 disconnected
mke2fs 1.45.5 (07-Jan-2020)
/dev/nbd15 contains a ext4 file system labelled 'root'
created on Mon May 16 05:23:01 2022
Proceed anyway? (y,N)

Through my whole time testing Kuai's fixes, I never saw a hang. The behaviour
seen is the same as the workaround of preventing systemd from watching nbd
devices with inotify. I think we should go with Kuai's patches.

So for Kuai's two patches:

Tested-by: Matthew Ruffell <[email protected]>

Thanks,
Matthew

On Sat, May 14, 2022 at 3:39 PM yukuai (C) <[email protected]> wrote:
>
> 在 2022/05/13 21:13, Josef Bacik 写道:
> > On Fri, May 13, 2022 at 02:56:18PM +1200, Matthew Ruffell wrote:
> >> Hi Josef,
> >>
> >> Just a friendly ping, I am more than happy to test a patch, if you send it
> >> inline in the email, since the pastebin you used expired after 1 day, and I
> >> couldn't access it.
> >>
> >> I came across and tested Yu Kuai's patches [1][2] which are for the same issue,
> >> and they indeed fix the hang. Thank you Yu.
> >>
> >> [1] nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
> >> https://lists.debian.org/nbd/2022/04/msg00212.html
> >>
> >> [2] nbd: fix io hung while disconnecting device
> >> https://lists.debian.org/nbd/2022/04/msg00207.html
> >>
> >> I am also happy to test any patches to fix the I/O errors.
> >>
> >
> > Sorry, you caught me on vacation before and I forgot to reply. Here's part one
> > of the patch I wanted you to try which fixes the io hung part. Thanks,
> >
> > Josef
> >
> >
> >>From 0a6123520380cb84de8ccefcccc5f112bce5efb6 Mon Sep 17 00:00:00 2001
> > Message-Id: <0a6123520380cb84de8ccefcccc5f112bce5efb6.1652447517.git.josef@toxicpanda.com>
> > From: Josef Bacik <[email protected]>
> > Date: Sat, 23 Apr 2022 23:51:23 -0400
> > Subject: [PATCH] timeout thing
> >
> > ---
> > drivers/block/nbd.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> > index 526389351784..ab365c0e9c04 100644
> > --- a/drivers/block/nbd.c
> > +++ b/drivers/block/nbd.c
> > @@ -1314,7 +1314,10 @@ static void nbd_config_put(struct nbd_device *nbd)
> > kfree(nbd->config);
> > nbd->config = NULL;
> >
> > - nbd->tag_set.timeout = 0;
> > + /* Reset our timeout to something sane. */
> > + nbd->tag_set.timeout = 30 * HZ;
> > + blk_queue_rq_timeout(nbd->disk->queue, 30 * HZ);
> > +
> > nbd->disk->queue->limits.discard_granularity = 0;
> > nbd->disk->queue->limits.discard_alignment = 0;
> > blk_queue_max_discard_sectors(nbd->disk->queue, 0);
> >
> Hi, Josef
>
> This seems to try to fix the same problem that I described here:
>
> nbd: fix io hung while disconnecting device
> https://lists.debian.org/nbd/2022/04/msg00207.html
>
> There are still some io that are stuck, which means the devcie is
> probably still opened. Thus nbd_config_put() can't reach here.
> I'm afraid this patch can't fix the io hung.
>
> Matthew, can you try a test with this patch together with my patch below
> to comfirm my thought?
>
> nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
> https://lists.debian.org/nbd/2022/04/msg00212.html.
>
> Thanks,
> Kuai

2022-05-16 20:33:26

by Josef Bacik

[permalink] [raw]

Subject: Re: [PROBLEM] nbd requests become stuck when devices watched by inotify emit udev uevent changes

On Sat, May 14, 2022 at 11:39:25AM +0800, yukuai (C) wrote:
> 在 2022/05/13 21:13, Josef Bacik 写道:
> > On Fri, May 13, 2022 at 02:56:18PM +1200, Matthew Ruffell wrote:
> > > Hi Josef,
> > >
> > > Just a friendly ping, I am more than happy to test a patch, if you send it
> > > inline in the email, since the pastebin you used expired after 1 day, and I
> > > couldn't access it.
> > >
> > > I came across and tested Yu Kuai's patches [1][2] which are for the same issue,
> > > and they indeed fix the hang. Thank you Yu.
> > >
> > > [1] nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
> > > https://lists.debian.org/nbd/2022/04/msg00212.html
> > >
> > > [2] nbd: fix io hung while disconnecting device
> > > https://lists.debian.org/nbd/2022/04/msg00207.html
> > >
> > > I am also happy to test any patches to fix the I/O errors.
> > >
> >
> > Sorry, you caught me on vacation before and I forgot to reply. Here's part one
> > of the patch I wanted you to try which fixes the io hung part. Thanks,
> >
> > Josef
> >
> > > From 0a6123520380cb84de8ccefcccc5f112bce5efb6 Mon Sep 17 00:00:00 2001
> > Message-Id: <0a6123520380cb84de8ccefcccc5f112bce5efb6.1652447517.git.josef@toxicpanda.com>
> > From: Josef Bacik <[email protected]>
> > Date: Sat, 23 Apr 2022 23:51:23 -0400
> > Subject: [PATCH] timeout thing
> >
> > ---
> > drivers/block/nbd.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> > index 526389351784..ab365c0e9c04 100644
> > --- a/drivers/block/nbd.c
> > +++ b/drivers/block/nbd.c
> > @@ -1314,7 +1314,10 @@ static void nbd_config_put(struct nbd_device *nbd)
> > kfree(nbd->config);
> > nbd->config = NULL;
> > - nbd->tag_set.timeout = 0;
> > + /* Reset our timeout to something sane. */
> > + nbd->tag_set.timeout = 30 * HZ;
> > + blk_queue_rq_timeout(nbd->disk->queue, 30 * HZ);
> > +
> > nbd->disk->queue->limits.discard_granularity = 0;
> > nbd->disk->queue->limits.discard_alignment = 0;
> > blk_queue_max_discard_sectors(nbd->disk->queue, 0);
> >
> Hi, Josef
>
> This seems to try to fix the same problem that I described here:
>
> nbd: fix io hung while disconnecting device
> https://lists.debian.org/nbd/2022/04/msg00207.html
>
> There are still some io that are stuck, which means the devcie is
> probably still opened. Thus nbd_config_put() can't reach here.
> I'm afraid this patch can't fix the io hung.
>
> Matthew, can you try a test with this patch together with my patch below
> to comfirm my thought?
>
> nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
> https://lists.debian.org/nbd/2022/04/msg00212.html.
>

Re-submit this one, but fix it so we just test the bit to see if we need to skip
it, and change it so we only CLEAR when we're sure we're going to complete the
request. Thanks,

Josef

2022-05-17 01:00:34

by Yu Kuai

[permalink] [raw]

Subject: Re: [PROBLEM] nbd requests become stuck when devices watched by inotify emit udev uevent changes

在 2022/05/16 20:17, Josef Bacik 写道:

>> Hi, Josef
>>
>> This seems to try to fix the same problem that I described here:
>>
>> nbd: fix io hung while disconnecting device
>> https://lists.debian.org/nbd/2022/04/msg00207.html
>>
>> There are still some io that are stuck, which means the devcie is
>> probably still opened. Thus nbd_config_put() can't reach here.
>> I'm afraid this patch can't fix the io hung.
>>
>> Matthew, can you try a test with this patch together with my patch below
>> to comfirm my thought?
>>
>> nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
>> https://lists.debian.org/nbd/2022/04/msg00212.html.
>>
>
> Re-submit this one, but fix it so we just test the bit to see if we need to skip
> it, and change it so we only CLEAR when we're sure we're going to complete the
> request. Thanks,

Ok, thanks for your advice. I'll send a new version.

BTW, do you have any suggestions on other patches of the patchset?

https://lore.kernel.org/all/[email protected]/

Thanks,
Kuai
>
> Josef
> .
>