After the surprise removal of a mounted NVMe disk the pciehp task
reliably hangs forever with a trace similar to this one:
INFO: task irq/43-pciehp:64 blocked for more than 120 seconds.
Call Trace:
<TASK>
__bio_queue_enter
blk_mq_submit_bio
submit_bio_noacct
submit_bio_wait
blkdev_issue_flush
ext4_sync_fs
sync_filesystem
fsync_bdev
delete_partition
blk_drop_partitions
del_gendisk
nvme_ns_remove
nvme_remove_namespaces
nvme_remove
pci_device_remove
__device_release_driver
device_release_driver
pci_stop_bus_device
pci_stop_and_remove_bus_device
pciehp_unconfigure_device
pciehp_disable_slot
pciehp_handle_presence_or_link_change
pciehp_ist
</TASK>
I observed this with 5.15.5 from debian bullseye-backports and confirmed
with 5.17.0-rc3 but previous kernels may be affected as well.
I read that del_gendisk() prevents any new I/O only after
flushing and dropping all partitions.
But in case of a surprise removal any new blocking I/O must be prevented
first. I assume that nvme_set_queue_dying() is supposed to do that.
Is there any other mechanism in place which should achieve this?
Unfortunately I am not very familiar with the blk_mq infrastructure so
any comments and suggestions are very welcome.
Best regards,
Markus
Signed-off-by: Markus Blöchl <[email protected]>
---
drivers/nvme/host/core.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 961a5f8a44d2..0654cbe9b80e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4573,6 +4573,8 @@ static void nvme_set_queue_dying(struct nvme_ns *ns)
if (test_and_set_bit(NVME_NS_DEAD, &ns->flags))
return;
+ set_bit(GD_DEAD, &ns->disk->state);
+
blk_set_queue_dying(ns->queue);
nvme_start_ns_queue(ns);
base-commit: f1baf68e1383f6ed93eb9cff2866d46562607a43
--
2.35.1
On Tue, Feb 15, 2022 at 07:22:40AM -0800, Keith Busch wrote:
> I can't actually tell if not checking the DYING flag check was
> intentional or not, since the comments in blk_queue_start_drain() say
> otherwise.
>
> Christoph, do you know the intention here? Should __bio_queue_enter()
> check the queue DYING flag, or do you prefer drivers explicity set the
> disk state like this? It looks to me the queue flags should be checked
> since that's already tied to the freeze wait_queue_head_t.
It was intentional but maybe not fully thought out. Do you remember why
we're doing the manual setting of the dying flag instead of just calling
del_gendisk early on in nvme? Because calling del_gendisk is supposed
to be all that a tree needs to do.
On Tue, Feb 15, 2022 at 07:47:04PM +0100, Christoph Hellwig wrote:
> On Tue, Feb 15, 2022 at 07:22:40AM -0800, Keith Busch wrote:
> > I can't actually tell if not checking the DYING flag check was
> > intentional or not, since the comments in blk_queue_start_drain() say
> > otherwise.
> >
> > Christoph, do you know the intention here? Should __bio_queue_enter()
> > check the queue DYING flag, or do you prefer drivers explicity set the
> > disk state like this? It looks to me the queue flags should be checked
> > since that's already tied to the freeze wait_queue_head_t.
>
> It was intentional but maybe not fully thought out. Do you remember why
> we're doing the manual setting of the dying flag instead of just calling
> del_gendisk early on in nvme? Because calling del_gendisk is supposed
> to be all that a tree needs to do.
When the driver concludes new requests can't ever succeed, we had been
setting the queue to DYING first so new requests can't enter, which can
prevent forward progress.
AFAICT, just calling del_gendisk() is fine for a graceful removal. It
calls fsync_bdev() to flush out pending writes before setting the disk
state to "DEAD".
Setting the queue to dying first will "freeze" the queue, which is why
fsync_bdev() is blocked. We were relying on the queue DYING flag to
prevent that from blocking.
Perhaps another way to do this might be to remove the queue DYING
setting, and let the driver internally fail new requests instead? There
may be some issues with doing it that way IIRC, but blk-mq is a bit
evolved from where we started, so I'd need to test it out to confirm.
On Mon, Feb 14, 2022 at 10:51:07AM +0100, Markus Bl?chl wrote:
> After the surprise removal of a mounted NVMe disk the pciehp task
> reliably hangs forever with a trace similar to this one:
Do you have a specific reproducer? At least with doing a
echo 1 > /sys/.../remove
while running fsx on a file system I can't actually reproduce it.
So, I think in th short run setting GD_DEAD is the right thing as a
non-invasive fix, but I'd rather do it in blk_set_queue_dying to also
cover the other drivers with a similar pattern. blk_set_queue_dying
is never used for cases where q->disk isn't valid so that should be
fine.
In the long run I think we just need to remove the fsync_bdev in
del_gendisk or at least make it conditional.
On Tue, Feb 15, 2022 at 08:17:31PM +0100, Christoph Hellwig wrote:
> On Mon, Feb 14, 2022 at 10:51:07AM +0100, Markus Bl?chl wrote:
> > After the surprise removal of a mounted NVMe disk the pciehp task
> > reliably hangs forever with a trace similar to this one:
>
> Do you have a specific reproducer? At least with doing a
>
> echo 1 > /sys/.../remove
>
> while running fsx on a file system I can't actually reproduce it.
That's a gracefull removal. You need to do something to terminate the
connection without the driver knowing about it. If you don't have a hotplug
capable system, you can do something slightly destructive to the PCI link to
force an ungraceful teardown, though you'll need to wait for IO timeout before
anything interesting will happen.
# setpci -s "${slot}" CAP_EXP+10.w=10:10
The "$slot" needs to be the B:D.f of the bridge connecting to your nvme
end device. An example getting it for a non-multipath PCIe attached nvme0n1:
# readlink -f /sys/block/nvme0n1/device | grep -Eo '[0-9a-f]{4,5}:[0-9a-f]{2}:[0-9a-f]{2}\.[0-9a-f]' | tail -2 | head -1
On Mon, Feb 14, 2022 at 10:51:07AM +0100, Markus Bl?chl wrote:
> After the surprise removal of a mounted NVMe disk the pciehp task
> reliably hangs forever with a trace similar to this one:
>
> INFO: task irq/43-pciehp:64 blocked for more than 120 seconds.
> Call Trace:
> <TASK>
> __bio_queue_enter
> blk_mq_submit_bio
> submit_bio_noacct
> submit_bio_wait
> blkdev_issue_flush
> ext4_sync_fs
> sync_filesystem
> fsync_bdev
> delete_partition
> blk_drop_partitions
> del_gendisk
> nvme_ns_remove
> nvme_remove_namespaces
> nvme_remove
> pci_device_remove
> __device_release_driver
> device_release_driver
> pci_stop_bus_device
> pci_stop_and_remove_bus_device
> pciehp_unconfigure_device
> pciehp_disable_slot
> pciehp_handle_presence_or_link_change
> pciehp_ist
> </TASK>
>
> I observed this with 5.15.5 from debian bullseye-backports and confirmed
> with 5.17.0-rc3 but previous kernels may be affected as well.
Thanks for the patch.
Entering the queue used to fail if blk_queue_dying() was true. The
condition was changed in commit:
8e141f9eb803e209714a80aa6ec073893f94c526
block: drain file system I/O on del_gendisk
I can't actually tell if not checking the DYING flag check was
intentional or not, since the comments in blk_queue_start_drain() say
otherwise.
Christoph, do you know the intention here? Should __bio_queue_enter()
check the queue DYING flag, or do you prefer drivers explicity set the
disk state like this? It looks to me the queue flags should be checked
since that's already tied to the freeze wait_queue_head_t.
> @@ -4573,6 +4573,8 @@ static void nvme_set_queue_dying(struct nvme_ns *ns)
> if (test_and_set_bit(NVME_NS_DEAD, &ns->flags))
> return;
>
> + set_bit(GD_DEAD, &ns->disk->state);
> +
> blk_set_queue_dying(ns->queue);
> nvme_start_ns_queue(ns);
On 2/15/22 20:17, Christoph Hellwig wrote:
> On Mon, Feb 14, 2022 at 10:51:07AM +0100, Markus Blöchl wrote:
>> After the surprise removal of a mounted NVMe disk the pciehp task
>> reliably hangs forever with a trace similar to this one:
>
> Do you have a specific reproducer? At least with doing a
>
> echo 1 > /sys/.../remove
>
> while running fsx on a file system I can't actually reproduce it.
You should be able to reproduce it doing a PCI hotplug from qemu.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
On Tue, Feb 15, 2022 at 08:17:31PM +0100, Christoph Hellwig wrote:
> On Mon, Feb 14, 2022 at 10:51:07AM +0100, Markus Blöchl wrote:
> > After the surprise removal of a mounted NVMe disk the pciehp task
> > reliably hangs forever with a trace similar to this one:
>
> Do you have a specific reproducer? At least with doing a
>
> echo 1 > /sys/.../remove
>
> while running fsx on a file system I can't actually reproduce it.
We built our own enclosures with a custom connector to plug the disks.
So an external enclosure for thunderbolt is probably very similar.
(or just ripping an unscrewed NVMe out of the M.2 ...)
But as already suggested, qemu might also be very useful here as it also
allows us to test multiple namespaces and multipath I/O, if you/someone
wants to check those too (hotplug with multipath I/O really scares me).
On 2/16/22 12:18, Markus Blöchl wrote:
> On Tue, Feb 15, 2022 at 08:17:31PM +0100, Christoph Hellwig wrote:
>> On Mon, Feb 14, 2022 at 10:51:07AM +0100, Markus Blöchl wrote:
>>> After the surprise removal of a mounted NVMe disk the pciehp task
>>> reliably hangs forever with a trace similar to this one:
>>
>> Do you have a specific reproducer? At least with doing a
>>
>> echo 1 > /sys/.../remove
>>
>> while running fsx on a file system I can't actually reproduce it.
>
> We built our own enclosures with a custom connector to plug the disks.
>
> So an external enclosure for thunderbolt is probably very similar.
> (or just ripping an unscrewed NVMe out of the M.2 ...)
>
> But as already suggested, qemu might also be very useful here as it also
> allows us to test multiple namespaces and multipath I/O, if you/someone
> wants to check those too (hotplug with multipath I/O really scares me).
>
Nothing to be scared of.
I've tested this extensively in the run up to commit 5396fdac56d8
("nvme: fix refcounting imbalance when all paths are down") which,
incidentally, is something you need if you want to test things.
Let me see if I can dig up the testbed.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
On Tue, Feb 15, 2022 at 09:17:38PM +0100, Christoph Hellwig wrote:
> So, I think in th short run setting GD_DEAD is the right thing as a
> non-invasive fix, but I'd rather do it in blk_set_queue_dying to also
> cover the other drivers with a similar pattern. blk_set_queue_dying
> is never used for cases where q->disk isn't valid so that should be
> fine.
blk_set_queue_dying() is called from blk_cleanup_queue(), which in turn
is called for all kinds of queues without associated disk, even ones
with failed allocations.
But according to disk_release() a disk which was once associated should
always outlive its queue, thus q->disk should indeed always be safe.
It's a little unfortunate that the existing doc on request queues was
so outdated that it had to be removed.
So I was trying to figure out if blk_queue_dying(q) should always imply
(!q->disk || test_bit(GD_DEAD, &q->disk->state)).
Assuming that there is a 1:1 relationship between q and q->disk and that
blk_queue_dying(q) was previously used where GD_DEAD is now, I do hope
so, since GD_DEAD should just be a weaker form of QUEUE_FLAG_DYING to
allow killing the disk way before the queue.
>
> In the long run I think we just need to remove the fsync_bdev in
> del_gendisk or at least make it conditional.
But wouldn't that require you to pass a flag like SURPRISE_REMOVAL into
del_gendisk and passing it along through delete_partition()?
You would still need GD_DEAD since someone might remove the disk while
you are syncing, so this only sounds like an error-prone optimization to
me. After all, syncing on graceful removal seems to be a sensible thing
to do.
Fiddling with qemu will probably take me too long unless Hannes already
has something up his sleeves.
So I'll submit V2 after I regained access to the test setup.
It will probably just look something like this:
```
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -286,6 +286,8 @@ void blk_queue_start_drain(struct request_queue *q)
void blk_set_queue_dying(struct request_queue *q)
{
+ if (q->disk)
+ set_bit(GD_DEAD, &q->disk->state);
blk_queue_flag_set(QUEUE_FLAG_DYING, q);
blk_queue_start_drain(q);
}
```
Thanks for all your input,
Markus
I'd do something like this, which gets us a properly documented
interface (the del_gendisk change will be split into a separate
patch):
diff --git a/block/blk-core.c b/block/blk-core.c
index d93e3bb9a769b..15d5c5ba5bbe5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -284,12 +284,19 @@ void blk_queue_start_drain(struct request_queue *q)
wake_up_all(&q->mq_freeze_wq);
}
-void blk_set_queue_dying(struct request_queue *q)
+/**
+ * blk_set_disk_dead - mark a disk as dead
+ * @disk: disk to mark as dead
+ *
+ * Mark as disk as dead (e.g. surprise removed) and don't accept any new I/O
+ * to this disk.
+ */
+void blk_mark_disk_dead(struct gendisk *disk)
{
- blk_queue_flag_set(QUEUE_FLAG_DYING, q);
- blk_queue_start_drain(q);
+ set_bit(GD_DEAD, &disk->state);
+ blk_queue_start_drain(disk->queue);
}
-EXPORT_SYMBOL_GPL(blk_set_queue_dying);
+EXPORT_SYMBOL_GPL(blk_mark_disk_dead);
/**
* blk_cleanup_queue - shutdown a request queue
@@ -308,7 +315,8 @@ void blk_cleanup_queue(struct request_queue *q)
WARN_ON_ONCE(blk_queue_registered(q));
/* mark @q DYING, no new request or merges will be allowed afterwards */
- blk_set_queue_dying(q);
+ blk_queue_flag_set(QUEUE_FLAG_DYING, q);
+ blk_queue_start_drain(q);
blk_queue_flag_set(QUEUE_FLAG_NOMERGES, q);
blk_queue_flag_set(QUEUE_FLAG_NOXMERGES, q);
diff --git a/block/genhd.c b/block/genhd.c
index 626c8406f21a6..35e49b7f1112a 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -584,7 +584,8 @@ void del_gendisk(struct gendisk *disk)
blk_drop_partitions(disk);
mutex_unlock(&disk->open_mutex);
- fsync_bdev(disk->part0);
+ if (!test_bit(GD_DEAD, &disk->state))
+ fsync_bdev(disk->part0);
__invalidate_device(disk->part0, true);
/*
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index e6005c2323281..2b588b62cbbb2 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -4112,7 +4112,7 @@ static void mtip_pci_remove(struct pci_dev *pdev)
"Completion workers still active!\n");
}
- blk_set_queue_dying(dd->queue);
+ blk_mark_disk_dead(dd->disk);
set_bit(MTIP_DDF_REMOVE_PENDING_BIT, &dd->dd_flag);
/* Clean up the block layer. */
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 4203cdab8abfd..b844432bad20b 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -7185,7 +7185,7 @@ static ssize_t do_rbd_remove(struct bus_type *bus,
* IO to complete/fail.
*/
blk_mq_freeze_queue(rbd_dev->disk->queue);
- blk_set_queue_dying(rbd_dev->disk->queue);
+ blk_mark_disk_dead(rbd_dev->disk);
}
del_gendisk(rbd_dev->disk);
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index ccd0dd0c6b83c..ca71a0585333f 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -2126,7 +2126,7 @@ static void blkfront_closing(struct blkfront_info *info)
/* No more blkif_request(). */
blk_mq_stop_hw_queues(info->rq);
- blk_set_queue_dying(info->rq);
+ blk_mark_disk_dead(info->gd);
set_capacity(info->gd, 0);
for_each_rinfo(info, rinfo, i) {
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dcbd6d201619d..997ace47bbd54 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2077,7 +2077,7 @@ static void __dm_destroy(struct mapped_device *md, bool wait)
set_bit(DMF_FREEING, &md->flags);
spin_unlock(&_minor_lock);
- blk_set_queue_dying(md->queue);
+ blk_mark_disk_dead(md->disk);
/*
* Take suspend_lock so that presuspend and postsuspend methods
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 79005ea1a33e3..469f23186159c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4574,7 +4574,7 @@ static void nvme_set_queue_dying(struct nvme_ns *ns)
if (test_and_set_bit(NVME_NS_DEAD, &ns->flags))
return;
- blk_set_queue_dying(ns->queue);
+ blk_mark_disk_dead(ns->disk);
nvme_start_ns_queue(ns);
set_capacity_and_notify(ns->disk, 0);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index f8bf6606eb2fc..ff775235534cf 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -848,7 +848,7 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
{
if (!head->disk)
return;
- blk_set_queue_dying(head->disk->queue);
+ blk_mark_disk_dead(head->disk);
/* make sure all pending bios are cleaned up */
kblockd_schedule_work(&head->requeue_work);
flush_work(&head->requeue_work);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f35aea98bc351..16b47035e4b06 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -748,7 +748,8 @@ extern bool blk_queue_can_use_dma_map_merging(struct request_queue *q,
bool __must_check blk_get_queue(struct request_queue *);
extern void blk_put_queue(struct request_queue *);
-extern void blk_set_queue_dying(struct request_queue *);
+
+void blk_mark_disk_dead(struct gendisk *disk);
#ifdef CONFIG_BLOCK
/*