2020-08-11 13:45:29

by Xianting Tian

[permalink] [raw]
Subject: [PATCH] block: don't read block device if it's invalid

We found several processes in 'D' state after nvme device hot-removed,
The call trace as below, we can see process 848 got lock 'bdev->bd_mutex'
in blkdev_reread_part(), but scheduled out due to wait for IO done. But
the IO won't be completed as the device is hot-removed. Then it caused
the lock 'bdev->bd_mutex' can't be unlocked. As a result, it caused
other processes, which need to get the same lock 'bdev->bd_mutex',
blocked on this lock.

When nvme device hot-removed, kernel will start a thread to handle the
task of nvme device removing, as the call trace of process 1111504 shows
below. I listed the call trace of nvme_kill_queues() in detail as below,
we can see 'NVME_NS_DEAD' is set, then when executing
nvme_revalidate_disk(), it found 'NVME_NS_DEAD' is set and
'set_capacity(disk, 0)' will be called to set disk capacity to 0.
nvme_kill_queues()
if (test_and_set_bit(NVME_NS_DEAD, &ns->flags)) return;
revalidate_disk(disk)
disk->fops->revalidate_disk(disk) <=for nvme device, revalidate_disk=nvme_revalidate_disk()
mutex_lock(&bdev->bd_mutex)

This patch is to reduce the probability of such problem. Before getting
the lock of 'bdev->bd_mutex' in blkdev_reread_part(), add the code to
check if the capacity of the disk is 0, just return. Then we can avoid
the happen of the issue:
nvme device is hot-removed, and its capacity is alreday set to 0; then
if there is process like 848 want to read the device, it will return
directly in blkdev_reread_part(), then it will not get the lock
"bdev->bd_mutex", which can't be unlocked by the process itself as IO
can't be completed.

cat /proc/848/stack
[<ffffffffb14b9d56>] io_schedule+0x16/0x40
[<ffffffffb15b009e>] do_read_cache_page+0x3ee/0x5e0
[<ffffffffb15b02a5>] read_cache_page+0x15/0x20
[<ffffffffb17810ad>] read_dev_sector+0x2d/0xa0
[<ffffffffb17856e4>] read_lba+0x104/0x1c0
[<ffffffffb1785d1a>] find_valid_gpt+0xfa/0x720
[<ffffffffb17863c9>] efi_partition+0x89/0x430
[<ffffffffb1783280>] check_partition+0x100/0x1f0
[<ffffffffb17819e4>] rescan_partitions+0xb4/0x360
[<ffffffffb177d544>] __blkdev_reread_part+0x64/0x70
[<ffffffffb177d573>] blkdev_reread_part+0x23/0x40 <<==mutex_lock(&bdev->bd_mutex);
[<ffffffffb177e00b>] blkdev_ioctl+0x44b/0x8e0
[<ffffffffb1692dc1>] block_ioctl+0x41/0x50
[<ffffffffb1668d37>] do_vfs_ioctl+0xa7/0x5e0
[<ffffffffb16692e9>] SyS_ioctl+0x79/0x90
[<ffffffffb1bd6001>] entry_SYSCALL_64_fastpath+0x1f/0xb9
[<ffffffffffffffff>] 0xffffffffffffffff

cat /proc/1111504/stack
[<ffffffffb1692fb9>] revalidate_disk+0x49/0x80 <<==mutex_lock(&bdev->bd_mutex);
[<ffffffffc0112ff2>] nvme_kill_queues+0x52/0x80 [nvme_core]
[<ffffffffc0116304>] nvme_remove_namespaces+0x44/0x50 [nvme_core]
[<ffffffffc0050805>] nvme_remove+0x85/0x130 [nvme]
[<ffffffffb17e9a69>] pci_device_remove+0x39/0xc0
[<ffffffffb18f9f91>] device_release_driver_internal+0x141/0x210
[<ffffffffb18fa072>] device_release_driver+0x12/0x20
[<ffffffffb17e1b3c>] pci_stop_bus_device+0x8c/0xa0
[<ffffffffb17e1cd2>] pci_stop_and_remove_bus_device+0x12/0x20
[<ffffffffb17fdeaa>] pciehp_unconfigure_device+0x7a/0x1e0
[<ffffffffb17fd9b2>] pciehp_disable_slot+0x52/0xd0
[<ffffffffb17fdaba>] pciehp_power_thread+0x8a/0xb0
[<ffffffffb14a556e>] process_one_work+0x14e/0x370
[<ffffffffb14a601d>] worker_thread+0x4d/0x3f0
[<ffffffffb14abd09>] kthread+0x109/0x140
[<ffffffffb1bd627a>] ret_from_fork+0x2a/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

cat /proc/1197767/stack
[<ffffffffb169444e>] __blkdev_get+0x6e/0x450 <<==mutex_lock_nested(&bdev->bd_mutex, for_part);
[<ffffffffb16949d4>] blkdev_get+0x1a4/0x300
[<ffffffffb1694f1a>] blkdev_open+0x7a/0xa0
[<ffffffffb164f93f>] do_dentry_open+0x20f/0x330
[<ffffffffb1650e10>] vfs_open+0x50/0x70
[<ffffffffb16621f8>] path_openat+0x548/0x13b0
[<ffffffffb1664d71>] do_filp_open+0x91/0x100
[<ffffffffb16511e4>] do_sys_open+0x124/0x210
[<ffffffffb16512ee>] SyS_open+0x1e/0x20
[<ffffffffb140396c>] do_syscall_64+0x6c/0x1b0
[<ffffffffb1bd60c0>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

ps -eo pid,comm,state | grep ' D'
848 systemd-udevd D
1111504 kworker/10:1 D
1197767 isdct D
1198830 isdct D
1580322 xxd D
1616804 kworker/10:0 D
1626264 isdct D
1734726 kworker/10:2 D
2197993 isdct D
2662117 xxd D
3083718 xxd D
3189834 xxd D

Signed-off-by: Xianting Tian <[email protected]>
---
block/ioctl.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/block/ioctl.c b/block/ioctl.c
index bdb3bbb..159bceb 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -94,6 +94,9 @@ static int blkdev_reread_part(struct block_device *bdev)
{
int ret;

+ if (unlikely(!get_capacity(bdev->bd_disk)))
+ return -EIO;
+
if (!disk_part_scan_enabled(bdev->bd_disk) || bdev != bdev->bd_contains)
return -EINVAL;
if (!capable(CAP_SYS_ADMIN))
--
1.8.3.1


2020-08-17 06:46:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] block: don't read block device if it's invalid

On Tue, Aug 11, 2020 at 09:43:06AM -0400, Xianting Tian wrote:
> We found several processes in 'D' state after nvme device hot-removed,
> The call trace as below, we can see process 848 got lock 'bdev->bd_mutex'
> in blkdev_reread_part(), but scheduled out due to wait for IO done. But
> the IO won't be completed as the device is hot-removed. Then it caused
> the lock 'bdev->bd_mutex' can't be unlocked. As a result, it caused
> other processes, which need to get the same lock 'bdev->bd_mutex',
> blocked on this lock.
>
> When nvme device hot-removed, kernel will start a thread to handle the
> task of nvme device removing, as the call trace of process 1111504 shows
> below. I listed the call trace of nvme_kill_queues() in detail as below,
> we can see 'NVME_NS_DEAD' is set, then when executing
> nvme_revalidate_disk(), it found 'NVME_NS_DEAD' is set and
> 'set_capacity(disk, 0)' will be called to set disk capacity to 0.
> nvme_kill_queues()
> if (test_and_set_bit(NVME_NS_DEAD, &ns->flags)) return;
> revalidate_disk(disk)
> disk->fops->revalidate_disk(disk) <=for nvme device, revalidate_disk=nvme_revalidate_disk()
> mutex_lock(&bdev->bd_mutex)
>
> This patch is to reduce the probability of such problem. Before getting
> the lock of 'bdev->bd_mutex' in blkdev_reread_part(), add the code to
> check if the capacity of the disk is 0, just return. Then we can avoid
> the happen of the issue:
> nvme device is hot-removed, and its capacity is alreday set to 0; then
> if there is process like 848 want to read the device, it will return
> directly in blkdev_reread_part(), then it will not get the lock
> "bdev->bd_mutex", which can't be unlocked by the process itself as IO
> can't be completed.

We need to fix this for real, as you stated at best this reduces the
window that the race can happen.

I think our main problem is that due to bd_mutex we can't update the
block device size from arbitrary context. If we instead add an irqsave
spinlock just for the size we'd get rid of the limitation and can stop
papering over the problem. Give m a little time to try to do that.