2023-04-27 02:46:52

by Ming Lei

[permalink] [raw]
Subject: [ext4 io hang] buffered write io hang in balance_dirty_pages

Hello Guys,

I got one report in which buffered write IO hangs in balance_dirty_pages,
after one nvme block device is unplugged physically, then umount can't
succeed.

Turns out it is one long-term issue, and it can be triggered at least
since v5.14 until the latest v6.3.

And the issue can be reproduced reliably in KVM guest:

1) run the following script inside guest:

mkfs.ext4 -F /dev/nvme0n1
mount /dev/nvme0n1 /mnt
dd if=/dev/zero of=/mnt/z.img&
sleep 10
echo 1 > /sys/block/nvme0n1/device/device/remove

2) dd hang is observed and /dev/nvme0n1 is gone actually

[root@ktest-09 ~]# ps -ax | grep dd
1348 pts/0 D 0:33 dd if=/dev/zero of=/mnt/z.img
1365 pts/0 S+ 0:00 grep --color=auto dd

[root@ktest-09 ~]# cat /proc/1348/stack
[<0>] balance_dirty_pages+0x649/0x2500
[<0>] balance_dirty_pages_ratelimited_flags+0x4c6/0x5d0
[<0>] generic_perform_write+0x310/0x4c0
[<0>] ext4_buffered_write_iter+0x130/0x2c0 [ext4]
[<0>] new_sync_write+0x28e/0x4a0
[<0>] vfs_write+0x62a/0x920
[<0>] ksys_write+0xf9/0x1d0
[<0>] do_syscall_64+0x59/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd


[root@ktest-09 ~]# lsblk | grep nvme
[root@ktest-09 ~]#

BTW, my VM sets 2G ram, and the nvme disk size is 40GB.

So far only observed on ext4 FS, not see it on XFS. I guess it isn't
related with disk type, and not tried such test on other type of disks yet,
but will do.

Seems like dirty pages aren't cleaned after ext4 bio is failed in this
situation?


Thanks,
Ming


2023-04-27 04:02:26

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Thu, Apr 27, 2023 at 10:20:28AM +0800, Ming Lei wrote:
> Hello Guys,
>
> I got one report in which buffered write IO hangs in balance_dirty_pages,
> after one nvme block device is unplugged physically, then umount can't
> succeed.

That's a feature, not a bug ... the dd should continue indefinitely?

balance_dirty_pages() is sleeping in KILLABLE state, so kill -9 of
the dd process should succeed.

2023-04-27 05:03:06

by Ming Lei

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

Hello Matthew,

On Thu, Apr 27, 2023 at 04:58:36AM +0100, Matthew Wilcox wrote:
> On Thu, Apr 27, 2023 at 10:20:28AM +0800, Ming Lei wrote:
> > Hello Guys,
> >
> > I got one report in which buffered write IO hangs in balance_dirty_pages,
> > after one nvme block device is unplugged physically, then umount can't
> > succeed.
>
> That's a feature, not a bug ... the dd should continue indefinitely?

Can you explain what the feature is? And not see such 'issue' or 'feature'
on xfs.

The device has been gone, so IMO it is reasonable to see FS buffered write IO
failed. Actually dmesg has shown that 'EXT4-fs (nvme0n1): Remounting
filesystem read-only'. Seems these things may confuse user.

>
> balance_dirty_pages() is sleeping in KILLABLE state, so kill -9 of
> the dd process should succeed.

Yeah, dd can be killed, however it may be any application(s), :-)

Fortunately it won't cause trouble during reboot/power off, given
userspace will be killed at that time.



Thanks,
Ming

2023-04-27 06:48:27

by Baokun Li

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On 2023/4/27 12:50, Ming Lei wrote:
> Hello Matthew,
>
> On Thu, Apr 27, 2023 at 04:58:36AM +0100, Matthew Wilcox wrote:
>> On Thu, Apr 27, 2023 at 10:20:28AM +0800, Ming Lei wrote:
>>> Hello Guys,
>>>
>>> I got one report in which buffered write IO hangs in balance_dirty_pages,
>>> after one nvme block device is unplugged physically, then umount can't
>>> succeed.
>> That's a feature, not a bug ... the dd should continue indefinitely?
> Can you explain what the feature is? And not see such 'issue' or 'feature'
> on xfs.
>
> The device has been gone, so IMO it is reasonable to see FS buffered write IO
> failed. Actually dmesg has shown that 'EXT4-fs (nvme0n1): Remounting
> filesystem read-only'. Seems these things may confuse user.


The reason for this difference is that ext4 and xfs handle errors
differently.

ext4 remounts the filesystem as read-only or even just continues,
vfs_write does not check for these.

xfs shuts down the filesystem, so it returns a failure at
xfs_file_write_iter when it finds an error.


``` ext4
ksys_write
 vfs_write
  ext4_file_write_iter
   ext4_buffered_write_iter
    ext4_write_checks
     file_modified
      file_modified_flags
       __file_update_time
        inode_update_time
         generic_update_time
          __mark_inode_dirty
           ext4_dirty_inode ---> 2. void func, No propagating errors out
            __ext4_journal_start_sb
             ext4_journal_check_start ---> 1. Error found, remount-ro
    generic_perform_write ---> 3. No error sensed, continue
     balance_dirty_pages_ratelimited
      balance_dirty_pages_ratelimited_flags
       balance_dirty_pages
        // 4. Sleeping waiting for dirty pages to be freed
        __set_current_state(TASK_KILLABLE)
        io_schedule_timeout(pause);
```

``` xfs
ksys_write
 vfs_write
  xfs_file_write_iter
   if (xfs_is_shutdown(ip->i_mount))
     return -EIO;    ---> dd fail
```
>> balance_dirty_pages() is sleeping in KILLABLE state, so kill -9 of
>> the dd process should succeed.
> Yeah, dd can be killed, however it may be any application(s), :-)
>
> Fortunately it won't cause trouble during reboot/power off, given
> userspace will be killed at that time.
>
>
>
> Thanks,
> Ming
>
Don't worry about that, we always set the current thread to TASK_KILLABLE

while waiting in balance_dirty_pages().


--
With Best Regards,
Baokun Li
.

2023-04-27 07:36:48

by Baokun Li

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On 2023/4/27 14:36, Baokun Li wrote:
> On 2023/4/27 12:50, Ming Lei wrote:
>> Hello Matthew,
>>
>> On Thu, Apr 27, 2023 at 04:58:36AM +0100, Matthew Wilcox wrote:
>>> On Thu, Apr 27, 2023 at 10:20:28AM +0800, Ming Lei wrote:
>>>> Hello Guys,
>>>>
>>>> I got one report in which buffered write IO hangs in
>>>> balance_dirty_pages,
>>>> after one nvme block device is unplugged physically, then umount can't
>>>> succeed.
>>> That's a feature, not a bug ... the dd should continue indefinitely?
>> Can you explain what the feature is? And not see such 'issue' or
>> 'feature'
>> on xfs.
>>
>> The device has been gone, so IMO it is reasonable to see FS buffered
>> write IO
>> failed. Actually dmesg has shown that 'EXT4-fs (nvme0n1): Remounting
>> filesystem read-only'. Seems these things may confuse user.
>
>
> The reason for this difference is that ext4 and xfs handle errors
> differently.
>
> ext4 remounts the filesystem as read-only or even just continues,
> vfs_write does not check for these.
>
> xfs shuts down the filesystem, so it returns a failure at
> xfs_file_write_iter when it finds an error.
>
>
> ``` ext4
> ksys_write
>  vfs_write
>   ext4_file_write_iter
>    ext4_buffered_write_iter
>     ext4_write_checks
>      file_modified
>       file_modified_flags
>        __file_update_time
>         inode_update_time
>          generic_update_time
>           __mark_inode_dirty
>            ext4_dirty_inode ---> 2. void func, No propagating errors out
>             __ext4_journal_start_sb
>              ext4_journal_check_start ---> 1. Error found, remount-ro
>     generic_perform_write ---> 3. No error sensed, continue
>      balance_dirty_pages_ratelimited
>       balance_dirty_pages_ratelimited_flags
>        balance_dirty_pages
>         // 4. Sleeping waiting for dirty pages to be freed
>         __set_current_state(TASK_KILLABLE)
>         io_schedule_timeout(pause);
> ```
>
> ``` xfs
> ksys_write
>  vfs_write
>   xfs_file_write_iter
>    if (xfs_is_shutdown(ip->i_mount))
>      return -EIO;    ---> dd fail
> ```
>>> balance_dirty_pages() is sleeping in KILLABLE state, so kill -9 of
>>> the dd process should succeed.
>> Yeah, dd can be killed, however it may be any application(s), :-)
>>
>> Fortunately it won't cause trouble during reboot/power off, given
>> userspace will be killed at that time.
>>
>>
>>
>> Thanks,
>> Ming
>>
> Don't worry about that, we always set the current thread to TASK_KILLABLE
>
> while waiting in balance_dirty_pages().
>
>
On second thought, we can determine if the file system has become read-only
when the ext4_file_write_iter() is called on a write file, even though
the fs was
not read-only when the file was opened.

This would end the write process early and free up resources like xfs does.
The patch is below, does anyone have any other thoughts?


diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d101b3b0c7da..d2966268ee41 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -699,6 +699,8 @@ ext4_file_write_iter(struct kiocb *iocb, struct
iov_iter *from)

        if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
                return -EIO;
+       if (unlikely(sb_rdonly(inode->i_sb)))
+               return -EROFS;

 #ifdef CONFIG_FS_DAX
        if (IS_DAX(inode))


--
With Best Regards,
Baokun Li
.

2023-04-27 10:18:03

by Ming Lei

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Thu, Apr 27, 2023 at 02:36:51PM +0800, Baokun Li wrote:
> On 2023/4/27 12:50, Ming Lei wrote:
> > Hello Matthew,
> >
> > On Thu, Apr 27, 2023 at 04:58:36AM +0100, Matthew Wilcox wrote:
> > > On Thu, Apr 27, 2023 at 10:20:28AM +0800, Ming Lei wrote:
> > > > Hello Guys,
> > > >
> > > > I got one report in which buffered write IO hangs in balance_dirty_pages,
> > > > after one nvme block device is unplugged physically, then umount can't
> > > > succeed.
> > > That's a feature, not a bug ... the dd should continue indefinitely?
> > Can you explain what the feature is? And not see such 'issue' or 'feature'
> > on xfs.
> >
> > The device has been gone, so IMO it is reasonable to see FS buffered write IO
> > failed. Actually dmesg has shown that 'EXT4-fs (nvme0n1): Remounting
> > filesystem read-only'. Seems these things may confuse user.
>
>
> The reason for this difference is that ext4 and xfs handle errors
> differently.
>
> ext4 remounts the filesystem as read-only or even just continues, vfs_write
> does not check for these.

vfs_write may not find anything wrong, but ext4 remount could see that
disk is gone, which might happen during or after remount, however.

>
> xfs shuts down the filesystem, so it returns a failure at
> xfs_file_write_iter when it finds an error.
>
>
> ``` ext4
> ksys_write
> ?vfs_write
> ? ext4_file_write_iter
> ?? ext4_buffered_write_iter
> ??? ext4_write_checks
> ???? file_modified
> ????? file_modified_flags
> ?????? __file_update_time
> ??????? inode_update_time
> ???????? generic_update_time
> ????????? __mark_inode_dirty
> ?????????? ext4_dirty_inode ---> 2. void func, No propagating errors out
> ??????????? __ext4_journal_start_sb
> ???????????? ext4_journal_check_start ---> 1. Error found, remount-ro
> ??? generic_perform_write ---> 3. No error sensed, continue
> ???? balance_dirty_pages_ratelimited
> ????? balance_dirty_pages_ratelimited_flags
> ?????? balance_dirty_pages
> ??????? // 4. Sleeping waiting for dirty pages to be freed
> ??????? __set_current_state(TASK_KILLABLE)
> ??????? io_schedule_timeout(pause);
> ```
>
> ``` xfs
> ksys_write
> ?vfs_write
> ? xfs_file_write_iter
> ?? if (xfs_is_shutdown(ip->i_mount))
> ???? return -EIO;??? ---> dd fail
> ```

Thanks for the info which is really helpful for me to understand the
problem.

> > > balance_dirty_pages() is sleeping in KILLABLE state, so kill -9 of
> > > the dd process should succeed.
> > Yeah, dd can be killed, however it may be any application(s), :-)
> >
> > Fortunately it won't cause trouble during reboot/power off, given
> > userspace will be killed at that time.
> >
> >
> >
> > Thanks,
> > Ming
> >
> Don't worry about that, we always set the current thread to TASK_KILLABLE
>
> while waiting in balance_dirty_pages().

I have another concern, if 'dd' isn't killed, dirty pages won't be cleaned, and
these (big amount)memory becomes not usable, and typical scenario could be USB HDD
unplugged.


thanks,
Ming

2023-04-27 11:24:22

by Baokun Li

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On 2023/4/27 18:01, Ming Lei wrote:
> On Thu, Apr 27, 2023 at 02:36:51PM +0800, Baokun Li wrote:
>> On 2023/4/27 12:50, Ming Lei wrote:
>>> Hello Matthew,
>>>
>>> On Thu, Apr 27, 2023 at 04:58:36AM +0100, Matthew Wilcox wrote:
>>>> On Thu, Apr 27, 2023 at 10:20:28AM +0800, Ming Lei wrote:
>>>>> Hello Guys,
>>>>>
>>>>> I got one report in which buffered write IO hangs in balance_dirty_pages,
>>>>> after one nvme block device is unplugged physically, then umount can't
>>>>> succeed.
>>>> That's a feature, not a bug ... the dd should continue indefinitely?
>>> Can you explain what the feature is? And not see such 'issue' or 'feature'
>>> on xfs.
>>>
>>> The device has been gone, so IMO it is reasonable to see FS buffered write IO
>>> failed. Actually dmesg has shown that 'EXT4-fs (nvme0n1): Remounting
>>> filesystem read-only'. Seems these things may confuse user.
>>
>> The reason for this difference is that ext4 and xfs handle errors
>> differently.
>>
>> ext4 remounts the filesystem as read-only or even just continues, vfs_write
>> does not check for these.
> vfs_write may not find anything wrong, but ext4 remount could see that
> disk is gone, which might happen during or after remount, however.
>
>> xfs shuts down the filesystem, so it returns a failure at
>> xfs_file_write_iter when it finds an error.
>>
>>
>> ``` ext4
>> ksys_write
>>  vfs_write
>>   ext4_file_write_iter
>>    ext4_buffered_write_iter
>>     ext4_write_checks
>>      file_modified
>>       file_modified_flags
>>        __file_update_time
>>         inode_update_time
>>          generic_update_time
>>           __mark_inode_dirty
>>            ext4_dirty_inode ---> 2. void func, No propagating errors out
>>             __ext4_journal_start_sb
>>              ext4_journal_check_start ---> 1. Error found, remount-ro
>>     generic_perform_write ---> 3. No error sensed, continue
>>      balance_dirty_pages_ratelimited
>>       balance_dirty_pages_ratelimited_flags
>>        balance_dirty_pages
>>         // 4. Sleeping waiting for dirty pages to be freed
>>         __set_current_state(TASK_KILLABLE)
>>         io_schedule_timeout(pause);
>> ```
>>
>> ``` xfs
>> ksys_write
>>  vfs_write
>>   xfs_file_write_iter
>>    if (xfs_is_shutdown(ip->i_mount))
>>      return -EIO;    ---> dd fail
>> ```
> Thanks for the info which is really helpful for me to understand the
> problem.
>
>>>> balance_dirty_pages() is sleeping in KILLABLE state, so kill -9 of
>>>> the dd process should succeed.
>>> Yeah, dd can be killed, however it may be any application(s), :-)
>>>
>>> Fortunately it won't cause trouble during reboot/power off, given
>>> userspace will be killed at that time.
>>>
>>>
>>>
>>> Thanks,
>>> Ming
>>>
>> Don't worry about that, we always set the current thread to TASK_KILLABLE
>>
>> while waiting in balance_dirty_pages().
> I have another concern, if 'dd' isn't killed, dirty pages won't be cleaned, and
> these (big amount)memory becomes not usable, and typical scenario could be USB HDD
> unplugged.
>
>
> thanks,
> Ming
Yes, it is unreasonable to continue writing data with the previously
opened fd after
the file system becomes read-only, resulting in dirty page accumulation.

I provided a patch in another reply.
Could you help test if it can solve your problem?
If it can indeed solve your problem, I will officially send it to the
email list.

--
With Best Regards,
Baokun Li
.

2023-04-27 11:52:07

by Ming Lei

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Thu, Apr 27, 2023 at 07:19:35PM +0800, Baokun Li wrote:
> On 2023/4/27 18:01, Ming Lei wrote:
> > On Thu, Apr 27, 2023 at 02:36:51PM +0800, Baokun Li wrote:
> > > On 2023/4/27 12:50, Ming Lei wrote:
> > > > Hello Matthew,
> > > >
> > > > On Thu, Apr 27, 2023 at 04:58:36AM +0100, Matthew Wilcox wrote:
> > > > > On Thu, Apr 27, 2023 at 10:20:28AM +0800, Ming Lei wrote:
> > > > > > Hello Guys,
> > > > > >
> > > > > > I got one report in which buffered write IO hangs in balance_dirty_pages,
> > > > > > after one nvme block device is unplugged physically, then umount can't
> > > > > > succeed.
> > > > > That's a feature, not a bug ... the dd should continue indefinitely?
> > > > Can you explain what the feature is? And not see such 'issue' or 'feature'
> > > > on xfs.
> > > >
> > > > The device has been gone, so IMO it is reasonable to see FS buffered write IO
> > > > failed. Actually dmesg has shown that 'EXT4-fs (nvme0n1): Remounting
> > > > filesystem read-only'. Seems these things may confuse user.
> > >
> > > The reason for this difference is that ext4 and xfs handle errors
> > > differently.
> > >
> > > ext4 remounts the filesystem as read-only or even just continues, vfs_write
> > > does not check for these.
> > vfs_write may not find anything wrong, but ext4 remount could see that
> > disk is gone, which might happen during or after remount, however.
> >
> > > xfs shuts down the filesystem, so it returns a failure at
> > > xfs_file_write_iter when it finds an error.
> > >
> > >
> > > ``` ext4
> > > ksys_write
> > > ?vfs_write
> > > ? ext4_file_write_iter
> > > ?? ext4_buffered_write_iter
> > > ??? ext4_write_checks
> > > ???? file_modified
> > > ????? file_modified_flags
> > > ?????? __file_update_time
> > > ??????? inode_update_time
> > > ???????? generic_update_time
> > > ????????? __mark_inode_dirty
> > > ?????????? ext4_dirty_inode ---> 2. void func, No propagating errors out
> > > ??????????? __ext4_journal_start_sb
> > > ???????????? ext4_journal_check_start ---> 1. Error found, remount-ro
> > > ??? generic_perform_write ---> 3. No error sensed, continue
> > > ???? balance_dirty_pages_ratelimited
> > > ????? balance_dirty_pages_ratelimited_flags
> > > ?????? balance_dirty_pages
> > > ??????? // 4. Sleeping waiting for dirty pages to be freed
> > > ??????? __set_current_state(TASK_KILLABLE)
> > > ??????? io_schedule_timeout(pause);
> > > ```
> > >
> > > ``` xfs
> > > ksys_write
> > > ?vfs_write
> > > ? xfs_file_write_iter
> > > ?? if (xfs_is_shutdown(ip->i_mount))
> > > ???? return -EIO;??? ---> dd fail
> > > ```
> > Thanks for the info which is really helpful for me to understand the
> > problem.
> >
> > > > > balance_dirty_pages() is sleeping in KILLABLE state, so kill -9 of
> > > > > the dd process should succeed.
> > > > Yeah, dd can be killed, however it may be any application(s), :-)
> > > >
> > > > Fortunately it won't cause trouble during reboot/power off, given
> > > > userspace will be killed at that time.
> > > >
> > > >
> > > >
> > > > Thanks,
> > > > Ming
> > > >
> > > Don't worry about that, we always set the current thread to TASK_KILLABLE
> > >
> > > while waiting in balance_dirty_pages().
> > I have another concern, if 'dd' isn't killed, dirty pages won't be cleaned, and
> > these (big amount)memory becomes not usable, and typical scenario could be USB HDD
> > unplugged.
> >
> >
> > thanks,
> > Ming
> Yes, it is unreasonable to continue writing data with the previously opened
> fd after
> the file system becomes read-only, resulting in dirty page accumulation.
>
> I provided a patch in another reply.
> Could you help test if it can solve your problem?
> If it can indeed solve your problem, I will officially send it to the email
> list.

OK, I will test it tomorrow.

But I am afraid if it can avoid the issue completely because the
old write task hang in balance_dirty_pages() may still write/dirty pages
if it is one very big size write IO.

Thanks,
Ming

2023-04-27 23:35:21

by Dave Chinner

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Thu, Apr 27, 2023 at 10:20:28AM +0800, Ming Lei wrote:
> Hello Guys,
>
> I got one report in which buffered write IO hangs in balance_dirty_pages,
> after one nvme block device is unplugged physically, then umount can't
> succeed.

The bug here is that the device unplug code has not told the
filesystem that it's gone away permanently.

This is the same problem we've been having for the past 15 years -
when block device goes away permanently it leaves the filesystem and
everything else dependent on the block device completely unaware
that they are unable to function anymore. IOWs, the block
device remove path is relying on -unreliable side effects- of
filesystem IO error handling to produce what we'd call "correct
behaviour".

The block device needs to be shutting down the filesystem when it
has some sort of fatal, unrecoverable error like this (e.g. hot
unplug). We have the XFS_IOC_GOINGDOWN ioctl for telling the
filesystem it can't function anymore. This ioctl
(_IOR('X',125,__u32)) has also been replicated into ext4, f2fs and
CIFS and it gets exercised heavily by fstests. Hence this isn't XFS
specific functionality, nor is it untested functionality.

The ioctl should be lifted to the VFS as FS_IOC_SHUTDOWN and a
super_operations method added to trigger a filesystem shutdown.
That way the block device removal code could simply call
sb->s_ops->shutdown(sb, REASON) if it exists rather than
sync_filesystem(sb) if there's a superblock associated with the
block device. Then all these

This way we won't have to spend another two decades of people
complaining about how applications and filesystems hang when they
pull the storage device out from under them and the filesystem
didn't do something that made it notice before the system hung....

> So far only observed on ext4 FS, not see it on XFS.

Pure dumb luck - a journal IO failed on XFS (probably during the
sync_filesystem() call) and that shut the filesystem down.

> I guess it isn't
> related with disk type, and not tried such test on other type of disks yet,
> but will do.

It can happen on any block device based storage that gets pulled
from under any filesystem without warning.

> Seems like dirty pages aren't cleaned after ext4 bio is failed in this
> situation?

Yes, because the filesystem wasn't shut down on device removal to
tell it that it's allowed to toss away dirty pages as they cannot be
cleaned via the IO path....

-Dave.
--
Dave Chinner
[email protected]

2023-04-28 02:02:20

by Ming Lei

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Thu, Apr 27, 2023 at 07:27:04PM +0800, Ming Lei wrote:
> On Thu, Apr 27, 2023 at 07:19:35PM +0800, Baokun Li wrote:
> > On 2023/4/27 18:01, Ming Lei wrote:
> > > On Thu, Apr 27, 2023 at 02:36:51PM +0800, Baokun Li wrote:
> > > > On 2023/4/27 12:50, Ming Lei wrote:
> > > > > Hello Matthew,
> > > > >
> > > > > On Thu, Apr 27, 2023 at 04:58:36AM +0100, Matthew Wilcox wrote:
> > > > > > On Thu, Apr 27, 2023 at 10:20:28AM +0800, Ming Lei wrote:
> > > > > > > Hello Guys,
> > > > > > >
> > > > > > > I got one report in which buffered write IO hangs in balance_dirty_pages,
> > > > > > > after one nvme block device is unplugged physically, then umount can't
> > > > > > > succeed.
> > > > > > That's a feature, not a bug ... the dd should continue indefinitely?
> > > > > Can you explain what the feature is? And not see such 'issue' or 'feature'
> > > > > on xfs.
> > > > >
> > > > > The device has been gone, so IMO it is reasonable to see FS buffered write IO
> > > > > failed. Actually dmesg has shown that 'EXT4-fs (nvme0n1): Remounting
> > > > > filesystem read-only'. Seems these things may confuse user.
> > > >
> > > > The reason for this difference is that ext4 and xfs handle errors
> > > > differently.
> > > >
> > > > ext4 remounts the filesystem as read-only or even just continues, vfs_write
> > > > does not check for these.
> > > vfs_write may not find anything wrong, but ext4 remount could see that
> > > disk is gone, which might happen during or after remount, however.
> > >
> > > > xfs shuts down the filesystem, so it returns a failure at
> > > > xfs_file_write_iter when it finds an error.
> > > >
> > > >
> > > > ``` ext4
> > > > ksys_write
> > > > ?vfs_write
> > > > ? ext4_file_write_iter
> > > > ?? ext4_buffered_write_iter
> > > > ??? ext4_write_checks
> > > > ???? file_modified
> > > > ????? file_modified_flags
> > > > ?????? __file_update_time
> > > > ??????? inode_update_time
> > > > ???????? generic_update_time
> > > > ????????? __mark_inode_dirty
> > > > ?????????? ext4_dirty_inode ---> 2. void func, No propagating errors out
> > > > ??????????? __ext4_journal_start_sb
> > > > ???????????? ext4_journal_check_start ---> 1. Error found, remount-ro
> > > > ??? generic_perform_write ---> 3. No error sensed, continue
> > > > ???? balance_dirty_pages_ratelimited
> > > > ????? balance_dirty_pages_ratelimited_flags
> > > > ?????? balance_dirty_pages
> > > > ??????? // 4. Sleeping waiting for dirty pages to be freed
> > > > ??????? __set_current_state(TASK_KILLABLE)
> > > > ??????? io_schedule_timeout(pause);
> > > > ```
> > > >
> > > > ``` xfs
> > > > ksys_write
> > > > ?vfs_write
> > > > ? xfs_file_write_iter
> > > > ?? if (xfs_is_shutdown(ip->i_mount))
> > > > ???? return -EIO;??? ---> dd fail
> > > > ```
> > > Thanks for the info which is really helpful for me to understand the
> > > problem.
> > >
> > > > > > balance_dirty_pages() is sleeping in KILLABLE state, so kill -9 of
> > > > > > the dd process should succeed.
> > > > > Yeah, dd can be killed, however it may be any application(s), :-)
> > > > >
> > > > > Fortunately it won't cause trouble during reboot/power off, given
> > > > > userspace will be killed at that time.
> > > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Ming
> > > > >
> > > > Don't worry about that, we always set the current thread to TASK_KILLABLE
> > > >
> > > > while waiting in balance_dirty_pages().
> > > I have another concern, if 'dd' isn't killed, dirty pages won't be cleaned, and
> > > these (big amount)memory becomes not usable, and typical scenario could be USB HDD
> > > unplugged.
> > >
> > >
> > > thanks,
> > > Ming
> > Yes, it is unreasonable to continue writing data with the previously opened
> > fd after
> > the file system becomes read-only, resulting in dirty page accumulation.
> >
> > I provided a patch in another reply.
> > Could you help test if it can solve your problem?
> > If it can indeed solve your problem, I will officially send it to the email
> > list.
>
> OK, I will test it tomorrow.

Your patch can avoid dd hang when bs is 512 at default, but if bs is
increased to 1G and more 'dd' tasks are started, the dd hang issue
still can be observed.

The reason should be the next paragraph I posted.

Another thing is that if remount read-only makes sense on one dead
disk? Yeah, block layer doesn't export such interface for querying
if bdev is dead. However, I think it is reasonable to export such
interface if FS needs that.

>
> But I am afraid if it can avoid the issue completely because the
> old write task hang in balance_dirty_pages() may still write/dirty pages
> if it is one very big size write IO.


thanks,
Ming

2023-04-28 03:10:08

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Fri, Apr 28, 2023 at 09:33:20AM +1000, Dave Chinner wrote:
> The block device needs to be shutting down the filesystem when it
> has some sort of fatal, unrecoverable error like this (e.g. hot
> unplug). We have the XFS_IOC_GOINGDOWN ioctl for telling the
> filesystem it can't function anymore. This ioctl
> (_IOR('X',125,__u32)) has also been replicated into ext4, f2fs and
> CIFS and it gets exercised heavily by fstests. Hence this isn't XFS
> specific functionality, nor is it untested functionality.
>
> The ioctl should be lifted to the VFS as FS_IOC_SHUTDOWN and a
> super_operations method added to trigger a filesystem shutdown.
> That way the block device removal code could simply call
> sb->s_ops->shutdown(sb, REASON) if it exists rather than
> sync_filesystem(sb) if there's a superblock associated with the
> block device. Then all these

I think this is the wrong approach. Not that I've had any time to
work on my alternative approach:

https://www.infradead.org/~willy/banbury.html

2023-04-28 04:03:53

by Baokun Li

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On 2023/4/28 9:41, Ming Lei wrote:
> On Thu, Apr 27, 2023 at 07:27:04PM +0800, Ming Lei wrote:
>> On Thu, Apr 27, 2023 at 07:19:35PM +0800, Baokun Li wrote:
>>> On 2023/4/27 18:01, Ming Lei wrote:
>>>> On Thu, Apr 27, 2023 at 02:36:51PM +0800, Baokun Li wrote:
>>>>> On 2023/4/27 12:50, Ming Lei wrote:
>>>>>> Hello Matthew,
>>>>>>
>>>>>> On Thu, Apr 27, 2023 at 04:58:36AM +0100, Matthew Wilcox wrote:
>>>>>>> On Thu, Apr 27, 2023 at 10:20:28AM +0800, Ming Lei wrote:
>>>>>>>> Hello Guys,
>>>>>>>>
>>>>>>>> I got one report in which buffered write IO hangs in balance_dirty_pages,
>>>>>>>> after one nvme block device is unplugged physically, then umount can't
>>>>>>>> succeed.
>>>>>>> That's a feature, not a bug ... the dd should continue indefinitely?
>>>>>> Can you explain what the feature is? And not see such 'issue' or 'feature'
>>>>>> on xfs.
>>>>>>
>>>>>> The device has been gone, so IMO it is reasonable to see FS buffered write IO
>>>>>> failed. Actually dmesg has shown that 'EXT4-fs (nvme0n1): Remounting
>>>>>> filesystem read-only'. Seems these things may confuse user.
>>>>> The reason for this difference is that ext4 and xfs handle errors
>>>>> differently.
>>>>>
>>>>> ext4 remounts the filesystem as read-only or even just continues, vfs_write
>>>>> does not check for these.
>>>> vfs_write may not find anything wrong, but ext4 remount could see that
>>>> disk is gone, which might happen during or after remount, however.
>>>>
>>>>> xfs shuts down the filesystem, so it returns a failure at
>>>>> xfs_file_write_iter when it finds an error.
>>>>>
>>>>>
>>>>> ``` ext4
>>>>> ksys_write
>>>>>  vfs_write
>>>>>   ext4_file_write_iter
>>>>>    ext4_buffered_write_iter
>>>>>     ext4_write_checks
>>>>>      file_modified
>>>>>       file_modified_flags
>>>>>        __file_update_time
>>>>>         inode_update_time
>>>>>          generic_update_time
>>>>>           __mark_inode_dirty
>>>>>            ext4_dirty_inode ---> 2. void func, No propagating errors out
>>>>>             __ext4_journal_start_sb
>>>>>              ext4_journal_check_start ---> 1. Error found, remount-ro
>>>>>     generic_perform_write ---> 3. No error sensed, continue
>>>>>      balance_dirty_pages_ratelimited
>>>>>       balance_dirty_pages_ratelimited_flags
>>>>>        balance_dirty_pages
>>>>>         // 4. Sleeping waiting for dirty pages to be freed
>>>>>         __set_current_state(TASK_KILLABLE)
>>>>>         io_schedule_timeout(pause);
>>>>> ```
>>>>>
>>>>> ``` xfs
>>>>> ksys_write
>>>>>  vfs_write
>>>>>   xfs_file_write_iter
>>>>>    if (xfs_is_shutdown(ip->i_mount))
>>>>>      return -EIO;    ---> dd fail
>>>>> ```
>>>> Thanks for the info which is really helpful for me to understand the
>>>> problem.
>>>>
>>>>>>> balance_dirty_pages() is sleeping in KILLABLE state, so kill -9 of
>>>>>>> the dd process should succeed.
>>>>>> Yeah, dd can be killed, however it may be any application(s), :-)
>>>>>>
>>>>>> Fortunately it won't cause trouble during reboot/power off, given
>>>>>> userspace will be killed at that time.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Ming
>>>>>>
>>>>> Don't worry about that, we always set the current thread to TASK_KILLABLE
>>>>>
>>>>> while waiting in balance_dirty_pages().
>>>> I have another concern, if 'dd' isn't killed, dirty pages won't be cleaned, and
>>>> these (big amount)memory becomes not usable, and typical scenario could be USB HDD
>>>> unplugged.
>>>>
>>>>
>>>> thanks,
>>>> Ming
>>> Yes, it is unreasonable to continue writing data with the previously opened
>>> fd after
>>> the file system becomes read-only, resulting in dirty page accumulation.
>>>
>>> I provided a patch in another reply.
>>> Could you help test if it can solve your problem?
>>> If it can indeed solve your problem, I will officially send it to the email
>>> list.
>> OK, I will test it tomorrow.
> Your patch can avoid dd hang when bs is 512 at default, but if bs is
> increased to 1G and more 'dd' tasks are started, the dd hang issue
> still can be observed.

Thank you for your testing!

Yes, my patch only prevents the adding of new dirty pages, but it
doesn't clear
the dirty pages that already exist. The reason why it doesn't work after
bs grows
is that there are already enough dirty pages to trigger
balance_dirty_pages().
Executing drop_caches at this point may make dd fail and exit. But the
dirty pages
are still not cleared, nor is the shutdown implemented by ext4. These
dirty pages
will not be cleared until the filesystem is unmounted.

This is the result of my test at bs=512:

ext4 -- remount read-only
OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
313872 313872 100%    0.10K   8048   39 32192K buffer_head   ---> wait
to max
82602  82465  99%     0.10K   2118   39  8472K buffer_head   ---> kill
dd && drop_caches
897    741    82%     0.10K     23   39    92K buffer_head   ---> umount

patched:
25233  25051  99%    0.10K    647     39     2588K buffer_head
>
> The reason should be the next paragraph I posted.
>
> Another thing is that if remount read-only makes sense on one dead
> disk? Yeah, block layer doesn't export such interface for querying
> if bdev is dead. However, I think it is reasonable to export such
> interface if FS needs that.
Ext4 just detects I/O Error and remounts it as read-only, it doesn't know
if the current disk is dead or not.

I asked Yu Kuai and he said that disk_live() can be used to determine
whether
a disk has been removed based on the status of the inode corresponding to
the block device, but this is generally not done in file systems.
>
>> But I am afraid if it can avoid the issue completely because the
>> old write task hang in balance_dirty_pages() may still write/dirty pages
>> if it is one very big size write IO.
>
> thanks,
> Ming
>
Those dirty pages that are already there are piling up and can't be
written back,
which I think is a real problem. Can the block layer clear those dirty
pages when
it detects that the disk is deleted?


--
With Best Regards,
Baokun Li
.

2023-04-28 05:55:51

by Dave Chinner

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Fri, Apr 28, 2023 at 03:56:13AM +0100, Matthew Wilcox wrote:
> On Fri, Apr 28, 2023 at 09:33:20AM +1000, Dave Chinner wrote:
> > The block device needs to be shutting down the filesystem when it
> > has some sort of fatal, unrecoverable error like this (e.g. hot
> > unplug). We have the XFS_IOC_GOINGDOWN ioctl for telling the
> > filesystem it can't function anymore. This ioctl
> > (_IOR('X',125,__u32)) has also been replicated into ext4, f2fs and
> > CIFS and it gets exercised heavily by fstests. Hence this isn't XFS
> > specific functionality, nor is it untested functionality.
> >
> > The ioctl should be lifted to the VFS as FS_IOC_SHUTDOWN and a
> > super_operations method added to trigger a filesystem shutdown.
> > That way the block device removal code could simply call
> > sb->s_ops->shutdown(sb, REASON) if it exists rather than
> > sync_filesystem(sb) if there's a superblock associated with the
> > block device. Then all these
>
> I think this is the wrong approach. Not that I've had any time to
> work on my alternative approach:
>
> https://www.infradead.org/~willy/banbury.html

While that looks interesting, I'm going to say straight out that
re-attaching a storage device that was hot-unplugged from under a
running filesystem and then resuming service as if nothing happened
is going to be both a filesystem corruption vector and a major
security hole.

The moment a mounted device is unexpectedly unplugged, it becomes an
untrusted device. We cannot trust that it's contents when plugged
back in are identical to when it was unplugged. I can't wait for
syzbot to learn that it can mount a filesystem, hot-unplug the
device, fuzz the image on the device and then plug it back in and
expect the filesystem to handle the in-memory vs on-disk
inconsistencies that result from the fuzzing. it's basically no
different to allowing someone to write to the block device while the
filesystem is mounted. You do that, you get to keep all the broken
bits to yourself.

Even without image tampering considerations, there's no guarantee
that the device caches are flushed properly when someone just pulls
the plug on a storage device. Hence even without tampering, we
cannot trust the image on the device to match what the in-memory
state of the filesystem expects it to be.

So, yeah, I just don't see this ever being something we'd support
with filesystems like XFS - there's just too much risk associated
with untrusted devices...

-Dave.
--
Dave Chinner
[email protected]

2023-04-28 06:02:17

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Fri, Apr 28, 2023 at 11:47:26AM +0800, Baokun Li wrote:
> Ext4 just detects I/O Error and remounts it as read-only, it doesn't know
> if the current disk is dead or not.
>
> I asked Yu Kuai and he said that disk_live() can be used to determine
> whether
> a disk has been removed based on the status of the inode corresponding to
> the block device, but this is generally not done in file systems.

What really needs to happen is that del_gendisk() needs to inform file
systems that the disk is gone, so that the file system can shutdown
the file system and tear everything down.

disk_live() is relatively new; it was added in August 2021. Back in
2015, I had added the following in fs/ext4/super.c:

/*
* The del_gendisk() function uninitializes the disk-specific data
* structures, including the bdi structure, without telling anyone
* else. Once this happens, any attempt to call mark_buffer_dirty()
* (for example, by ext4_commit_super), will cause a kernel OOPS.
* This is a kludge to prevent these oops until we can put in a proper
* hook in del_gendisk() to inform the VFS and file system layers.
*/
static int block_device_ejected(struct super_block *sb)
{
struct inode *bd_inode = sb->s_bdev->bd_inode;
struct backing_dev_info *bdi = inode_to_bdi(bd_inode);

return bdi->dev == NULL;
}

As the comment states, it's rather awkward to have the file system
check to see if the block device is dead in various places; the real
problem is that the block device shouldn't just *vanish*, with the
block device structures egetting partially de-initialized, without the
block layer being polite enough to let the file system know.

> Those dirty pages that are already there are piling up and can't be
> written back, which I think is a real problem. Can the block layer
> clear those dirty pages when it detects that the disk is deleted?

Well, the dirty pages belong to the file system, and so it needs to be
up to the file system to clear out the dirty pages. But I'll also
what the right thing to do when a disk gets removed is not necessarily
obvious.

For example, suppose some process has a file mmap'ed into its address
space, and that file is on the disk which the user has rudely yanked
out from their laptop; what is the right thing to do? Do we kill the
process? Do we let the process write to the mmap'ed region, and
silently let the modified data go *poof* when the process exits? What
if there is an executable file on the removable disk, and there are
one or more processes running that executable when the device
disappears? Do we kill the process? Do we let the process run unti
it tries to access a page which hasn't been paged in and then kill the
process?

We should design a proper solution for What Should Happen when a
removable disk gets removed unceremoniously without unmounting the
file system first. It's not just a matter of making some tests go
green....

- Ted

2023-04-29 04:43:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Sat, Apr 29, 2023 at 11:16:14AM +0800, Ming Lei wrote:
> OK, looks both Dave and you have same suggestion, and IMO, it isn't hard to
> add one interface for notifying FS, and it can be either one s_ops->shutdown()
> or shutdown_filesystem(struct super_block *sb).

It's not that simple. You need to be able to do that for any device used
by a file system, not just s_bdev. This means it needs go into ops
passed by the bdev owner, which is also needed to propagate this through
stackable devices.

I have some work on that, but the way how blkdev_get is called in the
generic mount helpers is a such a mess that I've not been happy with
the result yet. Let me see if spending extra time with it will allow
me to come up with something that doesn't suck.

> But the main job should be how this interface is implemented in FS/VFS side,
> so it looks one more FS job, and block layer can call shutdown_filesystem()
> from del_gendisk() simply.

This needs to be called from blk_mark_disk_dead for drivers using that,
and from del_gendisk only if GD_DEAD isn't set yet.

2023-04-29 04:56:01

by Ming Lei

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Fri, Apr 28, 2023 at 01:47:22AM -0400, Theodore Ts'o wrote:
> On Fri, Apr 28, 2023 at 11:47:26AM +0800, Baokun Li wrote:
> > Ext4 just detects I/O Error and remounts it as read-only, it doesn't know
> > if the current disk is dead or not.
> >
> > I asked Yu Kuai and he said that disk_live() can be used to determine
> > whether
> > a disk has been removed based on the status of the inode corresponding to
> > the block device, but this is generally not done in file systems.
>
> What really needs to happen is that del_gendisk() needs to inform file
> systems that the disk is gone, so that the file system can shutdown
> the file system and tear everything down.

OK, looks both Dave and you have same suggestion, and IMO, it isn't hard to
add one interface for notifying FS, and it can be either one s_ops->shutdown()
or shutdown_filesystem(struct super_block *sb).

But the main job should be how this interface is implemented in FS/VFS side,
so it looks one more FS job, and block layer can call shutdown_filesystem()
from del_gendisk() simply.

>
> disk_live() is relatively new; it was added in August 2021. Back in

IO failure plus checking disk_live() could be one way for handling the
failure, but this kind of interface isn't friendly.

> 2015, I had added the following in fs/ext4/super.c:
>
> /*
> * The del_gendisk() function uninitializes the disk-specific data
> * structures, including the bdi structure, without telling anyone
> * else. Once this happens, any attempt to call mark_buffer_dirty()
> * (for example, by ext4_commit_super), will cause a kernel OOPS.
> * This is a kludge to prevent these oops until we can put in a proper
> * hook in del_gendisk() to inform the VFS and file system layers.
> */
> static int block_device_ejected(struct super_block *sb)
> {
> struct inode *bd_inode = sb->s_bdev->bd_inode;
> struct backing_dev_info *bdi = inode_to_bdi(bd_inode);
>
> return bdi->dev == NULL;
> }
>
> As the comment states, it's rather awkward to have the file system
> check to see if the block device is dead in various places; the real

I can understand the awkward, :-(

bdi_unregister() is called in del_gendisk(), since bdi_register() has
to be called in add_disk() where major/minor is figured out.

> problem is that the block device shouldn't just *vanish*, with the

That looks not realistic, removable disk can be gone any time, and device
driver error handler often deletes disk as the last straw, and it shouldn't
be hard to observe such error.

Also it is not realistic to wait until all openers closes the bdev, given
it may wait forever.

> block device structures egetting partially de-initialized, without the
> block layer being polite enough to let the file system know.

Block device & gendisk instance won't be gone if the bdev is opened, and
I guess it is just few fields deinitialized, such as bdi->dev, bdi could be
the only one used by FS code.

>
> > Those dirty pages that are already there are piling up and can't be
> > written back, which I think is a real problem. Can the block layer
> > clear those dirty pages when it detects that the disk is deleted?
>
> Well, the dirty pages belong to the file system, and so it needs to be
> up to the file system to clear out the dirty pages. But I'll also
> what the right thing to do when a disk gets removed is not necessarily
> obvious.

Yeah, clearing dirty pages doesn't belong to block layer.

>
> For example, suppose some process has a file mmap'ed into its address
> space, and that file is on the disk which the user has rudely yanked
> out from their laptop; what is the right thing to do? Do we kill the
> process? Do we let the process write to the mmap'ed region, and
> silently let the modified data go *poof* when the process exits? What
> if there is an executable file on the removable disk, and there are
> one or more processes running that executable when the device
> disappears? Do we kill the process? Do we let the process run unti
> it tries to access a page which hasn't been paged in and then kill the
> process?
>
> We should design a proper solution for What Should Happen when a
> removable disk gets removed unceremoniously without unmounting the
> file system first. It's not just a matter of making some tests go
> green....

Agree, the trouble is actually in how FS to handle the disk removal.


Thanks,
Ming

2023-04-29 05:22:41

by Ming Lei

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Sat, Apr 29, 2023 at 06:40:38AM +0200, Christoph Hellwig wrote:
> On Sat, Apr 29, 2023 at 11:16:14AM +0800, Ming Lei wrote:
> > OK, looks both Dave and you have same suggestion, and IMO, it isn't hard to
> > add one interface for notifying FS, and it can be either one s_ops->shutdown()
> > or shutdown_filesystem(struct super_block *sb).
>
> It's not that simple. You need to be able to do that for any device used
> by a file system, not just s_bdev. This means it needs go into ops
> passed by the bdev owner, which is also needed to propagate this through
> stackable devices.

Not sure if it is needed for non s_bdev , because FS is over stackable device
directly. Stackable device has its own logic for handling underlying disks dead
or deleted, then decide if its own disk needs to be deleted, such as, it is
fine for raid1 to work from user viewpoint if one underlying disk is deleted.

>
> I have some work on that, but the way how blkdev_get is called in the
> generic mount helpers is a such a mess that I've not been happy with
> the result yet. Let me see if spending extra time with it will allow
> me to come up with something that doesn't suck.
>
> > But the main job should be how this interface is implemented in FS/VFS side,
> > so it looks one more FS job, and block layer can call shutdown_filesystem()
> > from del_gendisk() simply.
>
> This needs to be called from blk_mark_disk_dead for drivers using that,
> and from del_gendisk only if GD_DEAD isn't set yet.

OK, looks you mean the API needs to be called before GD_DEAD is set,
but I am wondering if it makes a difference, given device is already
dead.


Thanks,
Ming

2023-04-29 05:26:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Sat, Apr 29, 2023 at 11:16:14AM +0800, Ming Lei wrote:
>
> bdi_unregister() is called in del_gendisk(), since bdi_register() has
> to be called in add_disk() where major/minor is figured out.
>
> > problem is that the block device shouldn't just *vanish*, with the
>
> That looks not realistic, removable disk can be gone any time, and device
> driver error handler often deletes disk as the last straw, and it shouldn't
> be hard to observe such error.

It's not realistic to think that the file system can write back any
dirty pages, sure. At this point, the user has already yanked out the
thumb drive, and the physical device is gone. However, various fields
like bdi->dev shouldn't get deinitialized until after the
s_ops->shutdown() function has returned.

We need to give the file system a chance to shutdown any pending
writebacks; otherwise, we could be racing with writeback happening in
some other kernel thread, and while the I/O is certainly not going to
suceed, it would be nice if attempts to write to the block device
return an error, intead potentially causing the kernel to crash.

The shutdown function might need to sleep while it waits for
workqueues or kernel threads to exit, or while it iterates over all
inodes and clears all of the dirty bits and/or drop all of the pages
associated with the file system on the disconnected block device. So
while this happens, I/O should just fail, and not result in a kernel
BUG or oops.

Once the s_ops->shutdown() has returned, then del_gendisk can shutdown
and/or deallocate anything it wants, and if the file system tries to
use the bdi after s_ops->shutdown() has returned, well, it deserves
anything it gets.

(Well, it would be nice if things didn't bug/oops in fs/buffer.c if
there is no s_ops->shutdown() function, since there are a lot of
legacy file systems that use the buffer cache and until we can add
some kind of generic shutdown function to fs/libfs.c and make sure
that all of the legacy file systems that are likely to be used on a
USB thumb drive are fixed, it would be nice if they were protected.
At the very least, we should make that things are no worse than they
currently are.)

- Ted

P.S. Note that the semantics I've described here for
s_ops->shutdown() are slightly different than what the FS_IOC_SHUTDOWN
ioctl currently does. For example, after FS_IOC_SHUTDOWN, writes to
files will fail, but read to already open files will succeed. I know
this because the original ext4 shutdown implementation did actually
prevent reads from going through, but we got objections from those
that wanted ext4's FS_IOC_SHUTDOWN to work the same way as xfs's.

So we have an out of tree patch for ext4's FS_IOC_SHUTDOWN
implementation in our kernels at $WORK, because we were using it when
we knew that the back-end server providing the iSCSI or remote block
had died, and we wanted to make sure our borg (think Kubernetes) jobs
would fast fail when they tried reading from the dead file system, as
opposed to failing only after some timeout had elapsed.

To avoid confusion, we should probably either use a different name
than s_ops->shutdown(), or add a new mode to FS_IOC_SHUTDOWN which
corresponds to "the block device is gone, shut *everything* down:
reads, writes, everything." My preference would be the latter, since
it would mean we could stop carrying that out-of-tree patch in our
data center kernels...

2023-05-01 02:15:26

by Dave Chinner

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Sat, Apr 29, 2023 at 12:56:03AM -0400, Theodore Ts'o wrote:
> On Sat, Apr 29, 2023 at 11:16:14AM +0800, Ming Lei wrote:
> >
> > bdi_unregister() is called in del_gendisk(), since bdi_register() has
> > to be called in add_disk() where major/minor is figured out.
> >
> > > problem is that the block device shouldn't just *vanish*, with the
> >
> > That looks not realistic, removable disk can be gone any time, and device
> > driver error handler often deletes disk as the last straw, and it shouldn't
> > be hard to observe such error.
>
> It's not realistic to think that the file system can write back any
> dirty pages, sure. At this point, the user has already yanked out the
> thumb drive, and the physical device is gone. However, various fields
> like bdi->dev shouldn't get deinitialized until after the
> s_ops->shutdown() function has returned.
>
> We need to give the file system a chance to shutdown any pending
> writebacks; otherwise, we could be racing with writeback happening in
> some other kernel thread, and while the I/O is certainly not going to
> suceed, it would be nice if attempts to write to the block device
> return an error, intead potentially causing the kernel to crash.
>
> The shutdown function might need to sleep while it waits for
> workqueues or kernel threads to exit, or while it iterates over all
> inodes and clears all of the dirty bits and/or drop all of the pages
> associated with the file system on the disconnected block device. So
> while this happens, I/O should just fail, and not result in a kernel
> BUG or oops.
>
> Once the s_ops->shutdown() has returned, then del_gendisk can shutdown
> and/or deallocate anything it wants, and if the file system tries to
> use the bdi after s_ops->shutdown() has returned, well, it deserves
> anything it gets.
>
> (Well, it would be nice if things didn't bug/oops in fs/buffer.c if
> there is no s_ops->shutdown() function, since there are a lot of
> legacy file systems that use the buffer cache and until we can add
> some kind of generic shutdown function to fs/libfs.c and make sure
> that all of the legacy file systems that are likely to be used on a
> USB thumb drive are fixed, it would be nice if they were protected.
> At the very least, we should make that things are no worse than they
> currently are.)
>
> - Ted
>
> P.S. Note that the semantics I've described here for
> s_ops->shutdown() are slightly different than what the FS_IOC_SHUTDOWN
> ioctl currently does. For example, after FS_IOC_SHUTDOWN, writes to
> files will fail, but read to already open files will succeed. I know
> this because the original ext4 shutdown implementation did actually
> prevent reads from going through, but we got objections from those
> that wanted ext4's FS_IOC_SHUTDOWN to work the same way as xfs's.

<blink>

Wot?

The current XFS read IO path does this as it's first check:

STATIC ssize_t
xfs_file_read_iter(
struct kiocb *iocb,
struct iov_iter *to)
{
struct inode *inode = file_inode(iocb->ki_filp);
struct xfs_mount *mp = XFS_I(inode)->i_mount;
ssize_t ret = 0;

XFS_STATS_INC(mp, xs_read_calls);

if (xfs_is_shutdown(mp))
return -EIO;
....

It's been this way since .... 1997 on Irix when forced shutdowns
were introduced with this commit:

commit a96958f0891133f2731094b455465e88c03a13fb
Author: Supriya Wickrematillake <[email protected]>
Date: Sat Jan 25 02:55:04 1997 +0000

First cut of XFS I/O error handling changes.

So, yeah, XFS *always* errors out user read IO after a shutdown.

/me wonders what ext4 does


static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
struct inode *inode = file_inode(iocb->ki_filp);

if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
return -EIO;

Huh. It does the same thing as XFS. I now have no idea what
you are talking about now, Ted.

Ah:

https://lore.kernel.org/linux-ext4/[email protected]/

You explicitly mention readpage/readpages, which was the page
fault IO path (address space ops, not file ops). These days we have
xfs_vm_read_folio(), which goes through iomap, which then asks the
filesystem to map the folio to the underlying storage for IO, which
then calls xfs_read_iomap_begin(), and that does:

static int
xfs_read_iomap_begin(
struct inode *inode,
loff_t offset,
loff_t length,
unsigned flags,
struct iomap *iomap,
struct iomap *srcmap)
{
struct xfs_inode *ip = XFS_I(inode);
struct xfs_mount *mp = ip->i_mount;
struct xfs_bmbt_irec imap;
xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset);
xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, length);
int nimaps = 1, error = 0;
bool shared = false;
unsigned int lockmode = XFS_ILOCK_SHARED;
u64 seq;

ASSERT(!(flags & (IOMAP_WRITE | IOMAP_ZERO)));

if (xfs_is_shutdown(mp))
return -EIO;

.... a shutdown check as it's first operation.

I'm pretty sure that this has always been the case - for reading
pages/folios through page faults, we've always errored those out in
the block mapping callback function, not in the high level VFS
interfacing functions. Yeah, looking at the old page based path
from 2008:


STATIC int
xfs_vm_readpage(
struct file *unused,
struct page *page)
{
return mpage_readpage(page, xfs_get_blocks);
}

Call chain:

mpage_readpage
xfs_get_blocks
__xfs_get_blocks
xfs_iomap

int
xfs_iomap(
....
{
xfs_mount_t *mp = ip->i_mount;
....
int iomap_flags = 0;

ASSERT((ip->i_d.di_mode & S_IFMT) == S_IFREG);

if (XFS_FORCED_SHUTDOWN(mp))
return XFS_ERROR(EIO);

Yup, there's the shutdown check way back in the 2008 code base for
the pagefault IO path. IOWs, we've always errored out any attempt to
do IO via page faults after a shutdown, too.

The XFS ->page_mkwrite() path also does a shutdown check as it's
first operation, which leaves just read faults through filemap_fault
as skipping the shutdown check.

Yeah, so if the page is uptodate in the page cache, the fault still
succeeds. This was left in place so that root filesystems might
still be able to execute a system shutdown after the root filesystem
was shut down. Sometimes it works, sometimes it doesn't, but it
doesn't hurt anything to let read page faults for cached data to
succeed after a shutdown...

That's trivial to change - just add a shutdown check before calling
filemap_fault(). I just don't see a need to change that for XFS, and
I don't care one way or another if other filesystems do something
different here, either, as long as they don't issue read IO to the
underlying device....

> So we have an out of tree patch for ext4's FS_IOC_SHUTDOWN
> implementation in our kernels at $WORK, because we were using it when
> we knew that the back-end server providing the iSCSI or remote block
> had died, and we wanted to make sure our borg (think Kubernetes) jobs
> would fast fail when they tried reading from the dead file system, as
> opposed to failing only after some timeout had elapsed.

Well, yeah. That's pretty much why XFS has failed all physical I/O
attempts (read or write) after a shutdown for the past 25 years.
Shutdowns on root filesystems are a bit nasty, though, because if
you don't allow cached executables to run, the whole system dies
instantly with no warning or ability to safely shutdown applications
at all.

> To avoid confusion, we should probably either use a different name
> than s_ops->shutdown(), or add a new mode to FS_IOC_SHUTDOWN which
> corresponds to "the block device is gone, shut *everything* down:
> reads, writes, everything.

Yup, that's pretty much what we already do for a shutdown, so I'm
not sure what you are advocating for, Ted. If you add the shutdown
check to the filemap_fault() path then even that last little "allow
cached executables to still run on a shutdown root fs" helper goes
away and then you have what you want...

-Dave.
--
Dave Chinner
[email protected]

2023-05-01 05:08:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Sat, Apr 29, 2023 at 01:10:49PM +0800, Ming Lei wrote:
> Not sure if it is needed for non s_bdev

So you don't want to work this at all for btrfs? Or the XFS log device,
or ..

> , because FS is over stackable device
> directly. Stackable device has its own logic for handling underlying disks dead
> or deleted, then decide if its own disk needs to be deleted, such as, it is
> fine for raid1 to work from user viewpoint if one underlying disk is deleted.

We still need to propagate the even that device has been removed upwards.
Right now some file systems (especially XFS) are good at just propagating
it from an I/O error. And explicity call would be much better.

2023-05-02 01:02:36

by Ming Lei

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Mon, May 01, 2023 at 06:47:44AM +0200, Christoph Hellwig wrote:
> On Sat, Apr 29, 2023 at 01:10:49PM +0800, Ming Lei wrote:
> > Not sure if it is needed for non s_bdev
>
> So you don't want to work this at all for btrfs? Or the XFS log device,
> or ..

Basically FS can provide one generic API of shutdown_filesystem() which
shutdown FS generically, meantime calls each fs's ->shutdown() for
dealing with fs specific shutdown.

If there isn't superblock attached for one bdev, can you explain a bit what
filesystem code can do? Same with block layer bdev.

The current bio->bi_status together disk_live()(maybe bdev_live() is
needed) should be enough for FS code to handle non s_bdev.

>
> > , because FS is over stackable device
> > directly. Stackable device has its own logic for handling underlying disks dead
> > or deleted, then decide if its own disk needs to be deleted, such as, it is
> > fine for raid1 to work from user viewpoint if one underlying disk is deleted.
>
> We still need to propagate the even that device has been removed upwards.
> Right now some file systems (especially XFS) are good at just propagating
> it from an I/O error. And explicity call would be much better.

It depends on the above question about how FS code handle non s_bdev
deletion/dead.


Thanks,
Ming

2023-05-02 01:50:40

by Dave Chinner

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Tue, May 02, 2023 at 08:57:32AM +0800, Ming Lei wrote:
> On Mon, May 01, 2023 at 06:47:44AM +0200, Christoph Hellwig wrote:
> > On Sat, Apr 29, 2023 at 01:10:49PM +0800, Ming Lei wrote:
> > > Not sure if it is needed for non s_bdev
> >
> > So you don't want to work this at all for btrfs? Or the XFS log device,
> > or ..
>
> Basically FS can provide one generic API of shutdown_filesystem() which
> shutdown FS generically, meantime calls each fs's ->shutdown() for
> dealing with fs specific shutdown.
>
> If there isn't superblock attached for one bdev, can you explain a bit what
> filesystem code can do? Same with block layer bdev.
>
> The current bio->bi_status together disk_live()(maybe bdev_live() is
> needed) should be enough for FS code to handle non s_bdev.

maybe necessary for btrfs, but not for XFS....
>
> >
> > > , because FS is over stackable device
> > > directly. Stackable device has its own logic for handling underlying disks dead
> > > or deleted, then decide if its own disk needs to be deleted, such as, it is
> > > fine for raid1 to work from user viewpoint if one underlying disk is deleted.
> >
> > We still need to propagate the even that device has been removed upwards.
> > Right now some file systems (especially XFS) are good at just propagating
> > it from an I/O error. And explicity call would be much better.
>
> It depends on the above question about how FS code handle non s_bdev
> deletion/dead.

as XFS doesn't treat the individual devices differently. A
failure on an external log device is just as fatal as a failure on
a single device filesystem with an internal log. ext4 is
going to consider external journal device removal as fatal, too.

As for removal of realtime devices on XFS, all the user data has
gone away, so the filesystem will largely be useless for users and
applications. At this point, we'll probably want to shut down the
filesystem because we've had an unknown amount of user data loss and
so silently continuing on as if nothing happened is not the right
thing to do.

So as long as we can attach the superblock to each block device that
the filesystem opens (regardless of where sb->s_bdev points), device
removal calling sb_force_shutdown(sb, SB_SHUTDOWN_DEVICE_DEAD) will
do what we need. If we need anything different in future, then we
can worry about how to do that in the future.

I have attached a quick untested compendium patch to lift the
shutdown ioctl to the VFS, have if call sb_force_shutdown(sb), plumb
in sb->s_op->shutdown_fs(sb), hook XFS up to it and then remove the
XFS ioctl implementation for this functionality. I also included a
new shutdown reason called "FS_SHUTDOWN_DEVICE_DEAD" so that the
filesystem knows it's being called because the storage device is
dead instead of a user/admin asking it to die.

The code isn't that complex - we'll need to add ext4, f2fs and cifs
support for the shutdown_fs method, but otherwise there shouldn't be
much more to it...

-Dave.
--
Dave Chinner
[email protected]


fs: add superblock shutdown infrastructure

From: Dave Chinner <[email protected]>

So we can lift the shutdown ioctl to the VFS and also provide a
mechanism for block device failure to shut down the filesystem for
correct error handling.

Signed-off-by: Dave Chinner <[email protected]>
---
Documentation/filesystems/vfs.rst | 5 +++++
fs/ioctl.c | 27 +++++++++++++++++++++++++++
fs/super.c | 19 +++++++++++++++++++
fs/xfs/libxfs/xfs_fs.h | 27 +++++++++++++++++++--------
fs/xfs/xfs_fsops.c | 30 +++---------------------------
fs/xfs/xfs_ioctl.c | 12 ------------
fs/xfs/xfs_mount.h | 7 +++++--
fs/xfs/xfs_super.c | 32 ++++++++++++++++++++++++++++++++
include/linux/fs.h | 16 ++++++++++++++++
include/uapi/linux/fs.h | 13 +++++++++++++
10 files changed, 139 insertions(+), 49 deletions(-)

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 769be5230210..6e0855cad74a 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -266,6 +266,7 @@ filesystem. The following members are defined:
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
+ int (*shutdown_fs) (struct super_block *sb, enum sb_shutdown_flags flags);
void (*umount_begin) (struct super_block *);

int (*show_options)(struct seq_file *, struct dentry *);
@@ -376,6 +377,10 @@ or bottom half).
called when the filesystem is remounted. This is called with
the kernel lock held

+``shutdown_fs``
+ called when the filesystem is to be shut down in response to a admin
+ command or device failure. Optional.
+
``umount_begin``
called when the VFS is unmounting a filesystem.

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 5b2481cd4750..4e68b3a15fc1 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -413,6 +413,30 @@ static int ioctl_fsthaw(struct file *filp)
return thaw_super(sb);
}

+static int ioctl_fsshutdown(struct file *filp, u32 __user *argp)
+{
+ struct super_block *sb = file_inode(filp)->i_sb;
+ u32 inflags;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (get_user(inflags, argp))
+ return -EFAULT;
+
+ switch (inflags) {
+ case FS_SHUTDOWN_SYNC:
+ return sb_force_shutdown(sb, SB_SHUTDOWN_SYNC);
+ case FS_SHUTDOWN_METASYNC:
+ return sb_force_shutdown(sb, SB_SHUTDOWN_METASYNC);
+ case FS_SHUTDOWN_IMMEDIATE:
+ return sb_force_shutdown(sb, SB_SHUTDOWN_IMMEDIATE);
+ default:
+ break;
+ }
+ return -EINVAL;
+}
+
static int ioctl_file_dedupe_range(struct file *file,
struct file_dedupe_range __user *argp)
{
@@ -844,6 +868,9 @@ static int do_vfs_ioctl(struct file *filp, unsigned int fd,
case FS_IOC_FSSETXATTR:
return ioctl_fssetxattr(filp, argp);

+ case FS_IOC_SHUTDOWN:
+ return ioctl_fsshutdown(filp, argp);
+
default:
if (S_ISREG(inode->i_mode))
return file_ioctl(filp, cmd, argp);
diff --git a/fs/super.c b/fs/super.c
index 34afe411cf2b..9e9af48e2d38 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1809,3 +1809,22 @@ int sb_init_dio_done_wq(struct super_block *sb)
destroy_workqueue(wq);
return 0;
}
+
+/**
+ * sb_force_shutdown -- force a shutdown of the given filesystem
+ * @sb: the super to shut down
+ * @flags: shutdown behaviour required
+ *
+ * This locks the superblock and performs a shutdown of the filesystem such that
+ * it no longer will perform any user operations successfully or issue any IO to
+ * the underlying backing store.
+ *
+ * If the filesystem does not support shutdowns, just sync the filesystem and
+ * hope that it generates sufficient IO errors for applications to notice.
+ */
+int sb_force_shutdown(struct super_block *sb, enum sb_shutdown_reason reason)
+{
+ if (!sb->s_op->shutdown_fs)
+ return sync_filesystem(sb);
+ return sb->s_op->shutdown_fs(sb, reason);
+}
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 1cfd5bc6520a..533b016681f0 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -660,13 +660,6 @@ typedef struct xfs_swapext
struct xfs_bstat sx_stat; /* stat of target b4 copy */
} xfs_swapext_t;

-/*
- * Flags for going down operation
- */
-#define XFS_FSOP_GOING_FLAGS_DEFAULT 0x0 /* going down */
-#define XFS_FSOP_GOING_FLAGS_LOGFLUSH 0x1 /* flush log but not data */
-#define XFS_FSOP_GOING_FLAGS_NOLOGFLUSH 0x2 /* don't flush log nor data */
-
/* metadata scrubbing */
struct xfs_scrub_metadata {
__u32 sm_type; /* What to check? */
@@ -827,12 +820,30 @@ struct xfs_scrub_metadata {
#define XFS_IOC_ATTRLIST_BY_HANDLE _IOW ('X', 122, struct xfs_fsop_attrlist_handlereq)
#define XFS_IOC_ATTRMULTI_BY_HANDLE _IOW ('X', 123, struct xfs_fsop_attrmulti_handlereq)
#define XFS_IOC_FSGEOMETRY_V4 _IOR ('X', 124, struct xfs_fsop_geom_v4)
-#define XFS_IOC_GOINGDOWN _IOR ('X', 125, uint32_t)
#define XFS_IOC_FSGEOMETRY _IOR ('X', 126, struct xfs_fsop_geom)
#define XFS_IOC_BULKSTAT _IOR ('X', 127, struct xfs_bulkstat_req)
#define XFS_IOC_INUMBERS _IOR ('X', 128, struct xfs_inumbers_req)
/* XFS_IOC_GETFSUUID ---------- deprecated 140 */

+#ifndef FS_IOC_SHUTDOWN
+/*
+ * Flags for going down operation
+ */
+#define XFS_FSOP_GOING_FLAGS_DEFAULT 0x0 /* going down */
+#define XFS_FSOP_GOING_FLAGS_LOGFLUSH 0x1 /* flush log but not data */
+#define XFS_FSOP_GOING_FLAGS_NOLOGFLUSH 0x2 /* don't flush log nor data */
+
+#define XFS_IOC_GOINGDOWN _IOR ('X', 125, uint32_t)
+#else /* !FS_IOC_SHUTDOWN */
+
+/* Use VFS provided definitions */
+#define XFS_IOC_GOINGDOWN FS_IOC_SHUTDOWN
+
+#define XFS_FSOP_GOING_FLAGS_DEFAULT FS_SHUTDOWN_SYNC
+#define XFS_FSOP_GOING_FLAGS_LOGFLUSH FS_SHUTDOWN_METASYNC
+#define XFS_FSOP_GOING_FLAGS_NOLOGFLUSH FS_SHUTDOWN_IMMEDIATE
+
+#endif /* FS_IOC_SHUTDOWN */

#ifndef HAVE_BBMACROS
/*
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 13851c0d640b..11bc4270e373 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -466,33 +466,6 @@ xfs_reserve_blocks(
return error;
}

-int
-xfs_fs_goingdown(
- xfs_mount_t *mp,
- uint32_t inflags)
-{
- switch (inflags) {
- case XFS_FSOP_GOING_FLAGS_DEFAULT: {
- if (!freeze_bdev(mp->m_super->s_bdev)) {
- xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
- thaw_bdev(mp->m_super->s_bdev);
- }
- break;
- }
- case XFS_FSOP_GOING_FLAGS_LOGFLUSH:
- xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
- break;
- case XFS_FSOP_GOING_FLAGS_NOLOGFLUSH:
- xfs_force_shutdown(mp,
- SHUTDOWN_FORCE_UMOUNT | SHUTDOWN_LOG_IO_ERROR);
- break;
- default:
- return -EINVAL;
- }
-
- return 0;
-}
-
/*
* Force a shutdown of the filesystem instantly while keeping the filesystem
* consistent. We don't do an unmount here; just shutdown the shop, make sure
@@ -525,6 +498,9 @@ xfs_do_force_shutdown(
if (flags & SHUTDOWN_FORCE_UMOUNT)
xfs_alert(mp, "User initiated shutdown received.");

+ if (flags & SHUTDOWN_DEVICE_DEAD)
+ xfs_alert(mp, "Storage device initiated shutdown received.");
+
if (xlog_force_shutdown(mp->m_log, flags)) {
tag = XFS_PTAG_SHUTDOWN_LOGERROR;
why = "Log I/O Error";
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 55bb01173cde..695bfc25213d 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -2094,18 +2094,6 @@ xfs_file_ioctl(
return error;
}

- case XFS_IOC_GOINGDOWN: {
- uint32_t in;
-
- if (!capable(CAP_SYS_ADMIN))
- return -EPERM;
-
- if (get_user(in, (uint32_t __user *)arg))
- return -EFAULT;
-
- return xfs_fs_goingdown(mp, in);
- }
-
case XFS_IOC_ERROR_INJECTION: {
xfs_error_injection_t in;

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index f3269c0626f0..0b3f829333fe 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -453,13 +453,16 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, uint32_t flags, char *fname,
#define SHUTDOWN_LOG_IO_ERROR (1u << 1) /* write attempt to the log failed */
#define SHUTDOWN_FORCE_UMOUNT (1u << 2) /* shutdown from a forced unmount */
#define SHUTDOWN_CORRUPT_INCORE (1u << 3) /* corrupt in-memory structures */
-#define SHUTDOWN_CORRUPT_ONDISK (1u << 4) /* corrupt metadata on device */
+#define SHUTDOWN_CORRUPT_ONDISK (1u << 4) /* corrupt metadata on device */
+#define SHUTDOWN_DEVICE_DEAD (1u << 5) /* block device is dead */

#define XFS_SHUTDOWN_STRINGS \
{ SHUTDOWN_META_IO_ERROR, "metadata_io" }, \
{ SHUTDOWN_LOG_IO_ERROR, "log_io" }, \
{ SHUTDOWN_FORCE_UMOUNT, "force_umount" }, \
- { SHUTDOWN_CORRUPT_INCORE, "corruption" }
+ { SHUTDOWN_CORRUPT_INCORE, "corruption" }, \
+ { SHUTDOWN_CORRUPT_ONDISK, "on-disk corruption" }, \
+ { SHUTDOWN_DEVICE_DEAD, "device dead" }

/*
* Flags for xfs_mountfs
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 4d2e87462ac4..4d62d1cee8dc 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -963,6 +963,37 @@ xfs_fs_unfreeze(
return 0;
}

+static int
+xfs_fs_shutdown(
+ struct super_block *sb,
+ enum sb_shutdown_reason reason)
+{
+ struct xfs_mount *mp = XFS_M(sb);
+
+ switch (reason) {
+ case SB_SHUTDOWN_SYNC:
+ if (!freeze_bdev(sb->s_bdev)) {
+ xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+ thaw_bdev(mp->m_super->s_bdev);
+ }
+ break;
+ case SB_SHUTDOWN_METASYNC:
+ xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+ break;
+ case SB_SHUTDOWN_IMMEDIATE:
+ xfs_force_shutdown(mp,
+ SHUTDOWN_FORCE_UMOUNT | SHUTDOWN_LOG_IO_ERROR);
+ break;
+ case SB_SHUTDOWN_DEVICE_DEAD:
+ xfs_force_shutdown(mp,
+ SHUTDOWN_DEVICE_DEAD | SHUTDOWN_LOG_IO_ERROR);
+ break;
+ default:
+ return -EINVAL;
+ }
+ return 0;
+}
+
/*
* This function fills in xfs_mount_t fields based on mount args.
* Note: the superblock _has_ now been read in.
@@ -1165,6 +1196,7 @@ static const struct super_operations xfs_super_operations = {
.sync_fs = xfs_fs_sync_fs,
.freeze_fs = xfs_fs_freeze,
.unfreeze_fs = xfs_fs_unfreeze,
+ .shutdown_fs = xfs_fs_shutdown,
.statfs = xfs_fs_statfs,
.show_options = xfs_fs_show_options,
.nr_cached_objects = xfs_fs_nr_cached_objects,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a2dfbe2fb639..a8f3b39cd976 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1900,6 +1900,21 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
struct file *dst_file, loff_t dst_pos,
loff_t len, unsigned int remap_flags);

+/*
+ * Shutdown reason - indicate the reason for and/or the actions to take while
+ * processing the shutdown of the filesystem
+ */
+enum sb_shutdown_reason {
+ /* user driven shutdowns */
+ SB_SHUTDOWN_SYNC, /* sync whole fs before shutdown */
+ SB_SHUTDOWN_METASYNC, /* only sync metadata before shutdown */
+ SB_SHUTDOWN_IMMEDIATE, /* immediately terminate all operations */
+
+ /* internal event driven shutdowns */
+ SB_SHUTDOWN_DEVICE_DEAD,/* shutdown initiated by device removal */
+};
+
+int sb_force_shutdown(struct super_block *sb, enum sb_shutdown_reason reason);

struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
@@ -1918,6 +1933,7 @@ struct super_operations {
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
+ int (*shutdown_fs) (struct super_block *sb, enum sb_shutdown_reason reason);
void (*umount_begin) (struct super_block *);

int (*show_options)(struct seq_file *, struct dentry *);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..ce346944cd3d 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -216,6 +216,19 @@ struct fsxattr {
#define FS_IOC_GETFSLABEL _IOR(0x94, 49, char[FSLABEL_MAX])
#define FS_IOC_SETFSLABEL _IOW(0x94, 50, char[FSLABEL_MAX])

+#define FS_IOC_SHUTDOWN _IOR('X', 125, u32)
+
+/*
+ * Flags for FS_IOC_SHUTDOWN operation.
+ *
+ * This ioctl has been lifted from XFS, so the user API must support the first
+ * three types of user driven shutdown. If no shutdown flag value is passed,
+ * we treat it as a "sync the entire filesystem before shutdown" command.
+ */
+#define FS_SHUTDOWN_SYNC (0U) /* sync entire fs before shutdown */
+#define FS_SHUTDOWN_METASYNC (1U << 0) /* sync all metadata before shutdown */
+#define FS_SHUTDOWN_IMMEDIATE (1U << 1) /* terminate all operations immediately */
+
/*
* Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS)
*

2023-05-02 03:12:33

by Ming Lei

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Sat, Apr 29, 2023 at 12:56:03AM -0400, Theodore Ts'o wrote:
> On Sat, Apr 29, 2023 at 11:16:14AM +0800, Ming Lei wrote:
> >
> > bdi_unregister() is called in del_gendisk(), since bdi_register() has
> > to be called in add_disk() where major/minor is figured out.
> >
> > > problem is that the block device shouldn't just *vanish*, with the
> >
> > That looks not realistic, removable disk can be gone any time, and device
> > driver error handler often deletes disk as the last straw, and it shouldn't
> > be hard to observe such error.
>
> It's not realistic to think that the file system can write back any
> dirty pages, sure. At this point, the user has already yanked out the
> thumb drive, and the physical device is gone. However, various fields
> like bdi->dev shouldn't get deinitialized until after the
> s_ops->shutdown() function has returned.

Yeah, it makes sense.

>
> We need to give the file system a chance to shutdown any pending
> writebacks; otherwise, we could be racing with writeback happening in
> some other kernel thread, and while the I/O is certainly not going to
> suceed, it would be nice if attempts to write to the block device
> return an error, intead potentially causing the kernel to crash.
>
> The shutdown function might need to sleep while it waits for
> workqueues or kernel threads to exit, or while it iterates over all
> inodes and clears all of the dirty bits and/or drop all of the pages
> associated with the file system on the disconnected block device. So
> while this happens, I/O should just fail, and not result in a kernel
> BUG or oops.
>
> Once the s_ops->shutdown() has returned, then del_gendisk can shutdown
> and/or deallocate anything it wants, and if the file system tries to
> use the bdi after s_ops->shutdown() has returned, well, it deserves
> anything it gets.
>
> (Well, it would be nice if things didn't bug/oops in fs/buffer.c if
> there is no s_ops->shutdown() function, since there are a lot of
> legacy file systems that use the buffer cache and until we can add
> some kind of generic shutdown function to fs/libfs.c and make sure

One generic shutdown API is pretty nice, such as sb_force_shutdown() posted by Dave.

> that all of the legacy file systems that are likely to be used on a
> USB thumb drive are fixed, it would be nice if they were protected.
> At the very least, we should make that things are no worse than they
> currently are.)

Yeah, things won't be worse than now for legacy FS, given the generic
FS API could cover more generic FS cleanup, and block layer always calls
it before removing one disk.


Thanks,
Ming

2023-05-02 15:46:56

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Tue, May 02, 2023 at 11:35:57AM +1000, Dave Chinner wrote:
> On Tue, May 02, 2023 at 08:57:32AM +0800, Ming Lei wrote:
> > On Mon, May 01, 2023 at 06:47:44AM +0200, Christoph Hellwig wrote:
> > > On Sat, Apr 29, 2023 at 01:10:49PM +0800, Ming Lei wrote:
> > > > Not sure if it is needed for non s_bdev
> > >
> > > So you don't want to work this at all for btrfs? Or the XFS log device,
> > > or ..
> >
> > Basically FS can provide one generic API of shutdown_filesystem() which
> > shutdown FS generically, meantime calls each fs's ->shutdown() for
> > dealing with fs specific shutdown.
> >
> > If there isn't superblock attached for one bdev, can you explain a bit what
> > filesystem code can do? Same with block layer bdev.
> >
> > The current bio->bi_status together disk_live()(maybe bdev_live() is
> > needed) should be enough for FS code to handle non s_bdev.
>
> maybe necessary for btrfs, but not for XFS....
> >
> > >
> > > > , because FS is over stackable device
> > > > directly. Stackable device has its own logic for handling underlying disks dead
> > > > or deleted, then decide if its own disk needs to be deleted, such as, it is
> > > > fine for raid1 to work from user viewpoint if one underlying disk is deleted.
> > >
> > > We still need to propagate the even that device has been removed upwards.
> > > Right now some file systems (especially XFS) are good at just propagating
> > > it from an I/O error. And explicity call would be much better.
> >
> > It depends on the above question about how FS code handle non s_bdev
> > deletion/dead.
>
> as XFS doesn't treat the individual devices differently. A
> failure on an external log device is just as fatal as a failure on
> a single device filesystem with an internal log. ext4 is
> going to consider external journal device removal as fatal, too.
>
> As for removal of realtime devices on XFS, all the user data has
> gone away, so the filesystem will largely be useless for users and
> applications. At this point, we'll probably want to shut down the
> filesystem because we've had an unknown amount of user data loss and
> so silently continuing on as if nothing happened is not the right
> thing to do.
>
> So as long as we can attach the superblock to each block device that
> the filesystem opens (regardless of where sb->s_bdev points), device
> removal calling sb_force_shutdown(sb, SB_SHUTDOWN_DEVICE_DEAD) will
> do what we need. If we need anything different in future, then we
> can worry about how to do that in the future.

Shiyang spent a lot of time hooking up pmem failure notifications so
that xfs can kill processes that have pmem in their mapping. I wonder
if we could reuse some of that infrastructure here? That MF_MEM_REMOVE
patchset he's been trying to get us to merge would be a good starting
point for building something similar for block devices. AFAICT it does
the right thing if you hand it a subrange of the dax device or if you
pass it the customary (0, -1ULL) to mean "the entire device".

The block device version of that could be a lot simpler-- imagine if
"echo 0 > /sys/block/fd0/device/delete" resulted in the block layer
first sending us a notification that the device is about to be removed.
We could then flush the fs and try to freeze it. After the device
actually goes away, the blocy layer would send us a second notification
about DEVICE_DEAD and we could shut down the incore filesystem objects.

I've also been wondering if we should rather hook XFS (and all the other
block filesystems) into the device model as a child "device" of the
bdevs that they attach to. Suspend and remove events then simply map to
->freeze_fs, like Luis has been talking about. But I am not
sufficiently familiar with the device model to know if it supports
a device being reachable through multiple paths through the device tree?

--D

> I have attached a quick untested compendium patch to lift the
> shutdown ioctl to the VFS, have if call sb_force_shutdown(sb), plumb
> in sb->s_op->shutdown_fs(sb), hook XFS up to it and then remove the
> XFS ioctl implementation for this functionality. I also included a
> new shutdown reason called "FS_SHUTDOWN_DEVICE_DEAD" so that the
> filesystem knows it's being called because the storage device is
> dead instead of a user/admin asking it to die.
>
> The code isn't that complex - we'll need to add ext4, f2fs and cifs
> support for the shutdown_fs method, but otherwise there shouldn't be
> much more to it...
>
> -Dave.
> --
> Dave Chinner
> [email protected]
>
>
> fs: add superblock shutdown infrastructure
>
> From: Dave Chinner <[email protected]>
>
> So we can lift the shutdown ioctl to the VFS and also provide a
> mechanism for block device failure to shut down the filesystem for
> correct error handling.
>
> Signed-off-by: Dave Chinner <[email protected]>
> ---
> Documentation/filesystems/vfs.rst | 5 +++++
> fs/ioctl.c | 27 +++++++++++++++++++++++++++
> fs/super.c | 19 +++++++++++++++++++
> fs/xfs/libxfs/xfs_fs.h | 27 +++++++++++++++++++--------
> fs/xfs/xfs_fsops.c | 30 +++---------------------------
> fs/xfs/xfs_ioctl.c | 12 ------------
> fs/xfs/xfs_mount.h | 7 +++++--
> fs/xfs/xfs_super.c | 32 ++++++++++++++++++++++++++++++++
> include/linux/fs.h | 16 ++++++++++++++++
> include/uapi/linux/fs.h | 13 +++++++++++++
> 10 files changed, 139 insertions(+), 49 deletions(-)
>
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 769be5230210..6e0855cad74a 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -266,6 +266,7 @@ filesystem. The following members are defined:
> int (*unfreeze_fs) (struct super_block *);
> int (*statfs) (struct dentry *, struct kstatfs *);
> int (*remount_fs) (struct super_block *, int *, char *);
> + int (*shutdown_fs) (struct super_block *sb, enum sb_shutdown_flags flags);
> void (*umount_begin) (struct super_block *);
>
> int (*show_options)(struct seq_file *, struct dentry *);
> @@ -376,6 +377,10 @@ or bottom half).
> called when the filesystem is remounted. This is called with
> the kernel lock held
>
> +``shutdown_fs``
> + called when the filesystem is to be shut down in response to a admin
> + command or device failure. Optional.
> +
> ``umount_begin``
> called when the VFS is unmounting a filesystem.
>
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 5b2481cd4750..4e68b3a15fc1 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -413,6 +413,30 @@ static int ioctl_fsthaw(struct file *filp)
> return thaw_super(sb);
> }
>
> +static int ioctl_fsshutdown(struct file *filp, u32 __user *argp)
> +{
> + struct super_block *sb = file_inode(filp)->i_sb;
> + u32 inflags;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + if (get_user(inflags, argp))
> + return -EFAULT;
> +
> + switch (inflags) {
> + case FS_SHUTDOWN_SYNC:
> + return sb_force_shutdown(sb, SB_SHUTDOWN_SYNC);
> + case FS_SHUTDOWN_METASYNC:
> + return sb_force_shutdown(sb, SB_SHUTDOWN_METASYNC);
> + case FS_SHUTDOWN_IMMEDIATE:
> + return sb_force_shutdown(sb, SB_SHUTDOWN_IMMEDIATE);
> + default:
> + break;
> + }
> + return -EINVAL;
> +}
> +
> static int ioctl_file_dedupe_range(struct file *file,
> struct file_dedupe_range __user *argp)
> {
> @@ -844,6 +868,9 @@ static int do_vfs_ioctl(struct file *filp, unsigned int fd,
> case FS_IOC_FSSETXATTR:
> return ioctl_fssetxattr(filp, argp);
>
> + case FS_IOC_SHUTDOWN:
> + return ioctl_fsshutdown(filp, argp);
> +
> default:
> if (S_ISREG(inode->i_mode))
> return file_ioctl(filp, cmd, argp);
> diff --git a/fs/super.c b/fs/super.c
> index 34afe411cf2b..9e9af48e2d38 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1809,3 +1809,22 @@ int sb_init_dio_done_wq(struct super_block *sb)
> destroy_workqueue(wq);
> return 0;
> }
> +
> +/**
> + * sb_force_shutdown -- force a shutdown of the given filesystem
> + * @sb: the super to shut down
> + * @flags: shutdown behaviour required
> + *
> + * This locks the superblock and performs a shutdown of the filesystem such that
> + * it no longer will perform any user operations successfully or issue any IO to
> + * the underlying backing store.
> + *
> + * If the filesystem does not support shutdowns, just sync the filesystem and
> + * hope that it generates sufficient IO errors for applications to notice.
> + */
> +int sb_force_shutdown(struct super_block *sb, enum sb_shutdown_reason reason)
> +{
> + if (!sb->s_op->shutdown_fs)
> + return sync_filesystem(sb);
> + return sb->s_op->shutdown_fs(sb, reason);
> +}
> diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> index 1cfd5bc6520a..533b016681f0 100644
> --- a/fs/xfs/libxfs/xfs_fs.h
> +++ b/fs/xfs/libxfs/xfs_fs.h
> @@ -660,13 +660,6 @@ typedef struct xfs_swapext
> struct xfs_bstat sx_stat; /* stat of target b4 copy */
> } xfs_swapext_t;
>
> -/*
> - * Flags for going down operation
> - */
> -#define XFS_FSOP_GOING_FLAGS_DEFAULT 0x0 /* going down */
> -#define XFS_FSOP_GOING_FLAGS_LOGFLUSH 0x1 /* flush log but not data */
> -#define XFS_FSOP_GOING_FLAGS_NOLOGFLUSH 0x2 /* don't flush log nor data */
> -
> /* metadata scrubbing */
> struct xfs_scrub_metadata {
> __u32 sm_type; /* What to check? */
> @@ -827,12 +820,30 @@ struct xfs_scrub_metadata {
> #define XFS_IOC_ATTRLIST_BY_HANDLE _IOW ('X', 122, struct xfs_fsop_attrlist_handlereq)
> #define XFS_IOC_ATTRMULTI_BY_HANDLE _IOW ('X', 123, struct xfs_fsop_attrmulti_handlereq)
> #define XFS_IOC_FSGEOMETRY_V4 _IOR ('X', 124, struct xfs_fsop_geom_v4)
> -#define XFS_IOC_GOINGDOWN _IOR ('X', 125, uint32_t)
> #define XFS_IOC_FSGEOMETRY _IOR ('X', 126, struct xfs_fsop_geom)
> #define XFS_IOC_BULKSTAT _IOR ('X', 127, struct xfs_bulkstat_req)
> #define XFS_IOC_INUMBERS _IOR ('X', 128, struct xfs_inumbers_req)
> /* XFS_IOC_GETFSUUID ---------- deprecated 140 */
>
> +#ifndef FS_IOC_SHUTDOWN
> +/*
> + * Flags for going down operation
> + */
> +#define XFS_FSOP_GOING_FLAGS_DEFAULT 0x0 /* going down */
> +#define XFS_FSOP_GOING_FLAGS_LOGFLUSH 0x1 /* flush log but not data */
> +#define XFS_FSOP_GOING_FLAGS_NOLOGFLUSH 0x2 /* don't flush log nor data */
> +
> +#define XFS_IOC_GOINGDOWN _IOR ('X', 125, uint32_t)
> +#else /* !FS_IOC_SHUTDOWN */
> +
> +/* Use VFS provided definitions */
> +#define XFS_IOC_GOINGDOWN FS_IOC_SHUTDOWN
> +
> +#define XFS_FSOP_GOING_FLAGS_DEFAULT FS_SHUTDOWN_SYNC
> +#define XFS_FSOP_GOING_FLAGS_LOGFLUSH FS_SHUTDOWN_METASYNC
> +#define XFS_FSOP_GOING_FLAGS_NOLOGFLUSH FS_SHUTDOWN_IMMEDIATE
> +
> +#endif /* FS_IOC_SHUTDOWN */
>
> #ifndef HAVE_BBMACROS
> /*
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 13851c0d640b..11bc4270e373 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -466,33 +466,6 @@ xfs_reserve_blocks(
> return error;
> }
>
> -int
> -xfs_fs_goingdown(
> - xfs_mount_t *mp,
> - uint32_t inflags)
> -{
> - switch (inflags) {
> - case XFS_FSOP_GOING_FLAGS_DEFAULT: {
> - if (!freeze_bdev(mp->m_super->s_bdev)) {
> - xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> - thaw_bdev(mp->m_super->s_bdev);
> - }
> - break;
> - }
> - case XFS_FSOP_GOING_FLAGS_LOGFLUSH:
> - xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> - break;
> - case XFS_FSOP_GOING_FLAGS_NOLOGFLUSH:
> - xfs_force_shutdown(mp,
> - SHUTDOWN_FORCE_UMOUNT | SHUTDOWN_LOG_IO_ERROR);
> - break;
> - default:
> - return -EINVAL;
> - }
> -
> - return 0;
> -}
> -
> /*
> * Force a shutdown of the filesystem instantly while keeping the filesystem
> * consistent. We don't do an unmount here; just shutdown the shop, make sure
> @@ -525,6 +498,9 @@ xfs_do_force_shutdown(
> if (flags & SHUTDOWN_FORCE_UMOUNT)
> xfs_alert(mp, "User initiated shutdown received.");
>
> + if (flags & SHUTDOWN_DEVICE_DEAD)
> + xfs_alert(mp, "Storage device initiated shutdown received.");
> +
> if (xlog_force_shutdown(mp->m_log, flags)) {
> tag = XFS_PTAG_SHUTDOWN_LOGERROR;
> why = "Log I/O Error";
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 55bb01173cde..695bfc25213d 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -2094,18 +2094,6 @@ xfs_file_ioctl(
> return error;
> }
>
> - case XFS_IOC_GOINGDOWN: {
> - uint32_t in;
> -
> - if (!capable(CAP_SYS_ADMIN))
> - return -EPERM;
> -
> - if (get_user(in, (uint32_t __user *)arg))
> - return -EFAULT;
> -
> - return xfs_fs_goingdown(mp, in);
> - }
> -
> case XFS_IOC_ERROR_INJECTION: {
> xfs_error_injection_t in;
>
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index f3269c0626f0..0b3f829333fe 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -453,13 +453,16 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, uint32_t flags, char *fname,
> #define SHUTDOWN_LOG_IO_ERROR (1u << 1) /* write attempt to the log failed */
> #define SHUTDOWN_FORCE_UMOUNT (1u << 2) /* shutdown from a forced unmount */
> #define SHUTDOWN_CORRUPT_INCORE (1u << 3) /* corrupt in-memory structures */
> -#define SHUTDOWN_CORRUPT_ONDISK (1u << 4) /* corrupt metadata on device */
> +#define SHUTDOWN_CORRUPT_ONDISK (1u << 4) /* corrupt metadata on device */
> +#define SHUTDOWN_DEVICE_DEAD (1u << 5) /* block device is dead */
>
> #define XFS_SHUTDOWN_STRINGS \
> { SHUTDOWN_META_IO_ERROR, "metadata_io" }, \
> { SHUTDOWN_LOG_IO_ERROR, "log_io" }, \
> { SHUTDOWN_FORCE_UMOUNT, "force_umount" }, \
> - { SHUTDOWN_CORRUPT_INCORE, "corruption" }
> + { SHUTDOWN_CORRUPT_INCORE, "corruption" }, \
> + { SHUTDOWN_CORRUPT_ONDISK, "on-disk corruption" }, \
> + { SHUTDOWN_DEVICE_DEAD, "device dead" }
>
> /*
> * Flags for xfs_mountfs
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 4d2e87462ac4..4d62d1cee8dc 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -963,6 +963,37 @@ xfs_fs_unfreeze(
> return 0;
> }
>
> +static int
> +xfs_fs_shutdown(
> + struct super_block *sb,
> + enum sb_shutdown_reason reason)
> +{
> + struct xfs_mount *mp = XFS_M(sb);
> +
> + switch (reason) {
> + case SB_SHUTDOWN_SYNC:
> + if (!freeze_bdev(sb->s_bdev)) {
> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> + thaw_bdev(mp->m_super->s_bdev);
> + }
> + break;
> + case SB_SHUTDOWN_METASYNC:
> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> + break;
> + case SB_SHUTDOWN_IMMEDIATE:
> + xfs_force_shutdown(mp,
> + SHUTDOWN_FORCE_UMOUNT | SHUTDOWN_LOG_IO_ERROR);
> + break;
> + case SB_SHUTDOWN_DEVICE_DEAD:
> + xfs_force_shutdown(mp,
> + SHUTDOWN_DEVICE_DEAD | SHUTDOWN_LOG_IO_ERROR);
> + break;
> + default:
> + return -EINVAL;
> + }
> + return 0;
> +}
> +
> /*
> * This function fills in xfs_mount_t fields based on mount args.
> * Note: the superblock _has_ now been read in.
> @@ -1165,6 +1196,7 @@ static const struct super_operations xfs_super_operations = {
> .sync_fs = xfs_fs_sync_fs,
> .freeze_fs = xfs_fs_freeze,
> .unfreeze_fs = xfs_fs_unfreeze,
> + .shutdown_fs = xfs_fs_shutdown,
> .statfs = xfs_fs_statfs,
> .show_options = xfs_fs_show_options,
> .nr_cached_objects = xfs_fs_nr_cached_objects,
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index a2dfbe2fb639..a8f3b39cd976 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1900,6 +1900,21 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
> struct file *dst_file, loff_t dst_pos,
> loff_t len, unsigned int remap_flags);
>
> +/*
> + * Shutdown reason - indicate the reason for and/or the actions to take while
> + * processing the shutdown of the filesystem
> + */
> +enum sb_shutdown_reason {
> + /* user driven shutdowns */
> + SB_SHUTDOWN_SYNC, /* sync whole fs before shutdown */
> + SB_SHUTDOWN_METASYNC, /* only sync metadata before shutdown */
> + SB_SHUTDOWN_IMMEDIATE, /* immediately terminate all operations */
> +
> + /* internal event driven shutdowns */
> + SB_SHUTDOWN_DEVICE_DEAD,/* shutdown initiated by device removal */
> +};
> +
> +int sb_force_shutdown(struct super_block *sb, enum sb_shutdown_reason reason);
>
> struct super_operations {
> struct inode *(*alloc_inode)(struct super_block *sb);
> @@ -1918,6 +1933,7 @@ struct super_operations {
> int (*unfreeze_fs) (struct super_block *);
> int (*statfs) (struct dentry *, struct kstatfs *);
> int (*remount_fs) (struct super_block *, int *, char *);
> + int (*shutdown_fs) (struct super_block *sb, enum sb_shutdown_reason reason);
> void (*umount_begin) (struct super_block *);
>
> int (*show_options)(struct seq_file *, struct dentry *);
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b7b56871029c..ce346944cd3d 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -216,6 +216,19 @@ struct fsxattr {
> #define FS_IOC_GETFSLABEL _IOR(0x94, 49, char[FSLABEL_MAX])
> #define FS_IOC_SETFSLABEL _IOW(0x94, 50, char[FSLABEL_MAX])
>
> +#define FS_IOC_SHUTDOWN _IOR('X', 125, u32)
> +
> +/*
> + * Flags for FS_IOC_SHUTDOWN operation.
> + *
> + * This ioctl has been lifted from XFS, so the user API must support the first
> + * three types of user driven shutdown. If no shutdown flag value is passed,
> + * we treat it as a "sync the entire filesystem before shutdown" command.
> + */
> +#define FS_SHUTDOWN_SYNC (0U) /* sync entire fs before shutdown */
> +#define FS_SHUTDOWN_METASYNC (1U << 0) /* sync all metadata before shutdown */
> +#define FS_SHUTDOWN_IMMEDIATE (1U << 1) /* terminate all operations immediately */
> +
> /*
> * Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS)
> *

2023-05-02 22:39:38

by Dave Chinner

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Tue, May 02, 2023 at 08:35:16AM -0700, Darrick J. Wong wrote:
> On Tue, May 02, 2023 at 11:35:57AM +1000, Dave Chinner wrote:
> > On Tue, May 02, 2023 at 08:57:32AM +0800, Ming Lei wrote:
> > > On Mon, May 01, 2023 at 06:47:44AM +0200, Christoph Hellwig wrote:
> > > > On Sat, Apr 29, 2023 at 01:10:49PM +0800, Ming Lei wrote:
> > > > > Not sure if it is needed for non s_bdev
> > > >
> > > > So you don't want to work this at all for btrfs? Or the XFS log device,
> > > > or ..
> > >
> > > Basically FS can provide one generic API of shutdown_filesystem() which
> > > shutdown FS generically, meantime calls each fs's ->shutdown() for
> > > dealing with fs specific shutdown.
> > >
> > > If there isn't superblock attached for one bdev, can you explain a bit what
> > > filesystem code can do? Same with block layer bdev.
> > >
> > > The current bio->bi_status together disk_live()(maybe bdev_live() is
> > > needed) should be enough for FS code to handle non s_bdev.
> >
> > maybe necessary for btrfs, but not for XFS....
> > >
> > > >
> > > > > , because FS is over stackable device
> > > > > directly. Stackable device has its own logic for handling underlying disks dead
> > > > > or deleted, then decide if its own disk needs to be deleted, such as, it is
> > > > > fine for raid1 to work from user viewpoint if one underlying disk is deleted.
> > > >
> > > > We still need to propagate the even that device has been removed upwards.
> > > > Right now some file systems (especially XFS) are good at just propagating
> > > > it from an I/O error. And explicity call would be much better.
> > >
> > > It depends on the above question about how FS code handle non s_bdev
> > > deletion/dead.
> >
> > as XFS doesn't treat the individual devices differently. A
> > failure on an external log device is just as fatal as a failure on
> > a single device filesystem with an internal log. ext4 is
> > going to consider external journal device removal as fatal, too.
> >
> > As for removal of realtime devices on XFS, all the user data has
> > gone away, so the filesystem will largely be useless for users and
> > applications. At this point, we'll probably want to shut down the
> > filesystem because we've had an unknown amount of user data loss and
> > so silently continuing on as if nothing happened is not the right
> > thing to do.
> >
> > So as long as we can attach the superblock to each block device that
> > the filesystem opens (regardless of where sb->s_bdev points), device
> > removal calling sb_force_shutdown(sb, SB_SHUTDOWN_DEVICE_DEAD) will
> > do what we need. If we need anything different in future, then we
> > can worry about how to do that in the future.
>
> Shiyang spent a lot of time hooking up pmem failure notifications so
> that xfs can kill processes that have pmem in their mapping. I wonder
> if we could reuse some of that infrastructure here?

ISTR that the generic mechanism for "device failure ranges" (I think
I called the mechanism ->corrupt_range()) that we came up with in
the first instance for this functionality got shouted down by some
block layer devs because they saw it as unnecessary complexity to
push device range failure notifications through block devices up
to the filesystem.

The whole point of starting from that point was so that any type of
block device could report a failure to the filesystem and have the
filesystem deal with it appropriately:

This is where we started:

https://lore.kernel.org/linux-xfs/[email protected]/

".....
The call trace is like this:
memory_failure()
pgmap->ops->memory_failure() => pmem_pgmap_memory_failure()
gendisk->fops->corrupted_range() => - pmem_corrupted_range()
- md_blk_corrupted_range()
sb->s_ops->currupted_range() => xfs_fs_corrupted_range()
xfs_rmap_query_range()
xfs_currupt_helper()
* corrupted on metadata
try to recover data, call xfs_force_shutdown()
* corrupted on file data
try to recover data, call mf_dax_mapping_kill_procs()
...."

> That MF_MEM_REMOVE
> patchset he's been trying to get us to merge would be a good starting
> point for building something similar for block devices. AFAICT it does
> the right thing if you hand it a subrange of the dax device or if you
> pass it the customary (0, -1ULL) to mean "the entire device".

*nod*

That was exactly how I originally envisiaged that whole "bad device
range" stack being used.

> The block device version of that could be a lot simpler-- imagine if
> "echo 0 > /sys/block/fd0/device/delete" resulted in the block layer
> first sending us a notification that the device is about to be removed.
> We could then flush the fs and try to freeze it. After the device
> actually goes away, the blocy layer would send us a second notification
> about DEVICE_DEAD and we could shut down the incore filesystem objects.

*nod*

But seeing this mechanism has already been shot down by the block
layer devs, let's be a little less ambitious and just start with
a simple, pre-existing "kill the filesystem" mechanism. Once we've
got that in place and working, we can then expand on the error
handling mechanism to perform notification of on more fine-grained
storage errors...

-Dave.
--
Dave Chinner
[email protected]

2023-05-02 23:38:36

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Wed, May 03, 2023 at 08:33:23AM +1000, Dave Chinner wrote:
> On Tue, May 02, 2023 at 08:35:16AM -0700, Darrick J. Wong wrote:
> > On Tue, May 02, 2023 at 11:35:57AM +1000, Dave Chinner wrote:
> > > On Tue, May 02, 2023 at 08:57:32AM +0800, Ming Lei wrote:
> > > > On Mon, May 01, 2023 at 06:47:44AM +0200, Christoph Hellwig wrote:
> > > > > On Sat, Apr 29, 2023 at 01:10:49PM +0800, Ming Lei wrote:
> > > > > > Not sure if it is needed for non s_bdev
> > > > >
> > > > > So you don't want to work this at all for btrfs? Or the XFS log device,
> > > > > or ..
> > > >
> > > > Basically FS can provide one generic API of shutdown_filesystem() which
> > > > shutdown FS generically, meantime calls each fs's ->shutdown() for
> > > > dealing with fs specific shutdown.
> > > >
> > > > If there isn't superblock attached for one bdev, can you explain a bit what
> > > > filesystem code can do? Same with block layer bdev.
> > > >
> > > > The current bio->bi_status together disk_live()(maybe bdev_live() is
> > > > needed) should be enough for FS code to handle non s_bdev.
> > >
> > > maybe necessary for btrfs, but not for XFS....
> > > >
> > > > >
> > > > > > , because FS is over stackable device
> > > > > > directly. Stackable device has its own logic for handling underlying disks dead
> > > > > > or deleted, then decide if its own disk needs to be deleted, such as, it is
> > > > > > fine for raid1 to work from user viewpoint if one underlying disk is deleted.
> > > > >
> > > > > We still need to propagate the even that device has been removed upwards.
> > > > > Right now some file systems (especially XFS) are good at just propagating
> > > > > it from an I/O error. And explicity call would be much better.
> > > >
> > > > It depends on the above question about how FS code handle non s_bdev
> > > > deletion/dead.
> > >
> > > as XFS doesn't treat the individual devices differently. A
> > > failure on an external log device is just as fatal as a failure on
> > > a single device filesystem with an internal log. ext4 is
> > > going to consider external journal device removal as fatal, too.
> > >
> > > As for removal of realtime devices on XFS, all the user data has
> > > gone away, so the filesystem will largely be useless for users and
> > > applications. At this point, we'll probably want to shut down the
> > > filesystem because we've had an unknown amount of user data loss and
> > > so silently continuing on as if nothing happened is not the right
> > > thing to do.
> > >
> > > So as long as we can attach the superblock to each block device that
> > > the filesystem opens (regardless of where sb->s_bdev points), device
> > > removal calling sb_force_shutdown(sb, SB_SHUTDOWN_DEVICE_DEAD) will
> > > do what we need. If we need anything different in future, then we
> > > can worry about how to do that in the future.
> >
> > Shiyang spent a lot of time hooking up pmem failure notifications so
> > that xfs can kill processes that have pmem in their mapping. I wonder
> > if we could reuse some of that infrastructure here?
>
> ISTR that the generic mechanism for "device failure ranges" (I think
> I called the mechanism ->corrupt_range()) that we came up with in
> the first instance for this functionality got shouted down by some
> block layer devs because they saw it as unnecessary complexity to
> push device range failure notifications through block devices up
> to the filesystem.
>
> The whole point of starting from that point was so that any type of
> block device could report a failure to the filesystem and have the
> filesystem deal with it appropriately:
>
> This is where we started:
>
> https://lore.kernel.org/linux-xfs/[email protected]/
>
> ".....
> The call trace is like this:
> memory_failure()
> pgmap->ops->memory_failure() => pmem_pgmap_memory_failure()
> gendisk->fops->corrupted_range() => - pmem_corrupted_range()
> - md_blk_corrupted_range()
> sb->s_ops->currupted_range() => xfs_fs_corrupted_range()
> xfs_rmap_query_range()
> xfs_currupt_helper()
> * corrupted on metadata
> try to recover data, call xfs_force_shutdown()
> * corrupted on file data
> try to recover data, call mf_dax_mapping_kill_procs()
> ...."

<nod> I dug up
https://lore.kernel.org/linux-xfs/[email protected]/

which I interpreted as Christoph asking Shiyang not to make the dax
device code go swerving through the block layer to call
->corrupted_range, since he was trying to separate the two entirely.
I didn't think he was shutting down the idea of block devices being able
to call ->corrupted_range to tell the filesystem that the user's $2
NVME<->STL<->USB bridge caught on fire.

> > That MF_MEM_REMOVE
> > patchset he's been trying to get us to merge would be a good starting
> > point for building something similar for block devices. AFAICT it does
> > the right thing if you hand it a subrange of the dax device or if you
> > pass it the customary (0, -1ULL) to mean "the entire device".
>
> *nod*
>
> That was exactly how I originally envisiaged that whole "bad device
> range" stack being used.
>
> > The block device version of that could be a lot simpler-- imagine if
> > "echo 0 > /sys/block/fd0/device/delete" resulted in the block layer
> > first sending us a notification that the device is about to be removed.
> > We could then flush the fs and try to freeze it. After the device
> > actually goes away, the blocy layer would send us a second notification
> > about DEVICE_DEAD and we could shut down the incore filesystem objects.
>
> *nod*
>
> But seeing this mechanism has already been shot down by the block
> layer devs, let's be a little less ambitious and just start with
> a simple, pre-existing "kill the filesystem" mechanism. Once we've
> got that in place and working, we can then expand on the error
> handling mechanism to perform notification of on more fine-grained
> storage errors...

<shrug> Seeing as LSF is next week, I'll ask the room about this when
I'm there.

--D

> -Dave.
> --
> Dave Chinner
> [email protected]

2023-05-04 03:11:00

by Baokun Li

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On 2023/4/28 13:47, Theodore Ts'o wrote:
> On Fri, Apr 28, 2023 at 11:47:26AM +0800, Baokun Li wrote:
>> Ext4 just detects I/O Error and remounts it as read-only, it doesn't know
>> if the current disk is dead or not.
>>
>> I asked Yu Kuai and he said that disk_live() can be used to determine
>> whether
>> a disk has been removed based on the status of the inode corresponding to
>> the block device, but this is generally not done in file systems.
> What really needs to happen is that del_gendisk() needs to inform file
> systems that the disk is gone, so that the file system can shutdown
> the file system and tear everything down.

Yes, first of all, we need to be able to sense whether the current disk has
been removed. We're just sensing an I/O error now, so we're just making
the file system read-only.
>
> disk_live() is relatively new; it was added in August 2021. Back in
> 2015, I had added the following in fs/ext4/super.c:
>
> /*
> * The del_gendisk() function uninitializes the disk-specific data
> * structures, including the bdi structure, without telling anyone
> * else. Once this happens, any attempt to call mark_buffer_dirty()
> * (for example, by ext4_commit_super), will cause a kernel OOPS.
> * This is a kludge to prevent these oops until we can put in a proper
> * hook in del_gendisk() to inform the VFS and file system layers.
> */
> static int block_device_ejected(struct super_block *sb)
> {
> struct inode *bd_inode = sb->s_bdev->bd_inode;
> struct backing_dev_info *bdi = inode_to_bdi(bd_inode);
>
> return bdi->dev == NULL;
> }
>
> As the comment states, it's rather awkward to have the file system
> check to see if the block device is dead in various places; the real
> problem is that the block device shouldn't just *vanish*, with the
> block device structures egetting partially de-initialized, without the
> block layer being polite enough to let the file system know.

I didn't notice the block_device_ejected() function, and it's really
awkward

for the file system to detect whether the current disk has been removed.

>> Those dirty pages that are already there are piling up and can't be
>> written back, which I think is a real problem. Can the block layer
>> clear those dirty pages when it detects that the disk is deleted?
> Well, the dirty pages belong to the file system, and so it needs to be
> up to the file system to clear out the dirty pages. But I'll also
> what the right thing to do when a disk gets removed is not necessarily
> obvious.
Yes, I know that! If the block layer can find and clear these dirty
pages in a
unified manner, there is no need to do this for each file system.

The subsequent solution is to declare the interface at the VFS layer,
which is
implemented by each file system. When the block layer detects that the disk
is deleted, the block layer invokes the common interface at the VFS layer.
This also sounds good.
>
> For example, suppose some process has a file mmap'ed into its address
> space, and that file is on the disk which the user has rudely yanked
> out from their laptop; what is the right thing to do? Do we kill the
> process? Do we let the process write to the mmap'ed region, and
> silently let the modified data go *poof* when the process exits? What
> if there is an executable file on the removable disk, and there are
> one or more processes running that executable when the device
> disappears? Do we kill the process? Do we let the process run unti
> it tries to access a page which hasn't been paged in and then kill the
> process?
>
> We should design a proper solution for What Should Happen when a
> removable disk gets removed unceremoniously without unmounting the
> file system first. It's not just a matter of making some tests go
> green....
>
> - Ted
>
>
Yes, we need to consider a variety of scenarios, which is not a simple
matter.
Thank you very much for your detailed explanation.

--
With Best Regards,
Baokun Li
.

2023-05-04 16:13:15

by Keith Busch

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Thu, Apr 27, 2023 at 10:20:28AM +0800, Ming Lei wrote:
> Hello Guys,
>
> I got one report in which buffered write IO hangs in balance_dirty_pages,
> after one nvme block device is unplugged physically, then umount can't
> succeed.
>
> Turns out it is one long-term issue, and it can be triggered at least
> since v5.14 until the latest v6.3.
>
> And the issue can be reproduced reliably in KVM guest:
>
> 1) run the following script inside guest:
>
> mkfs.ext4 -F /dev/nvme0n1
> mount /dev/nvme0n1 /mnt
> dd if=/dev/zero of=/mnt/z.img&
> sleep 10
> echo 1 > /sys/block/nvme0n1/device/device/remove
>
> 2) dd hang is observed and /dev/nvme0n1 is gone actually

Sorry to jump in so late.

For an ungraceful nvme removal, like a surpirse hot unplug, the driver
sets the capacity to 0 and that effectively ends all dirty page writers
that could stall forward progress on the removal. And that 0 capacity
should also cause 'dd' to exit.

But this is not an ungraceful removal, so we're not getting that forced
behavior. Could we use the same capacity trick here after flushing any
outstanding dirty pages?

2023-05-04 16:23:36

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Thu, May 04, 2023 at 09:59:52AM -0600, Keith Busch wrote:
> On Thu, Apr 27, 2023 at 10:20:28AM +0800, Ming Lei wrote:
> > Hello Guys,
> >
> > I got one report in which buffered write IO hangs in balance_dirty_pages,
> > after one nvme block device is unplugged physically, then umount can't
> > succeed.
> >
> > Turns out it is one long-term issue, and it can be triggered at least
> > since v5.14 until the latest v6.3.
> >
> > And the issue can be reproduced reliably in KVM guest:
> >
> > 1) run the following script inside guest:
> >
> > mkfs.ext4 -F /dev/nvme0n1
> > mount /dev/nvme0n1 /mnt
> > dd if=/dev/zero of=/mnt/z.img&
> > sleep 10
> > echo 1 > /sys/block/nvme0n1/device/device/remove
> >
> > 2) dd hang is observed and /dev/nvme0n1 is gone actually
>
> Sorry to jump in so late.
>
> For an ungraceful nvme removal, like a surpirse hot unplug, the driver
> sets the capacity to 0 and that effectively ends all dirty page writers
> that could stall forward progress on the removal. And that 0 capacity
> should also cause 'dd' to exit.
>
> But this is not an ungraceful removal, so we're not getting that forced
> behavior. Could we use the same capacity trick here after flushing any
> outstanding dirty pages?

There's a filesystem mounted on that block device, though. I don't
think the filesystem is going to notice the underlying block device
capacity change and break out of any of these functions.

2023-05-05 02:18:51

by Ming Lei

[permalink] [raw]
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Thu, May 04, 2023 at 09:59:52AM -0600, Keith Busch wrote:
> On Thu, Apr 27, 2023 at 10:20:28AM +0800, Ming Lei wrote:
> > Hello Guys,
> >
> > I got one report in which buffered write IO hangs in balance_dirty_pages,
> > after one nvme block device is unplugged physically, then umount can't
> > succeed.
> >
> > Turns out it is one long-term issue, and it can be triggered at least
> > since v5.14 until the latest v6.3.
> >
> > And the issue can be reproduced reliably in KVM guest:
> >
> > 1) run the following script inside guest:
> >
> > mkfs.ext4 -F /dev/nvme0n1
> > mount /dev/nvme0n1 /mnt
> > dd if=/dev/zero of=/mnt/z.img&
> > sleep 10
> > echo 1 > /sys/block/nvme0n1/device/device/remove
> >
> > 2) dd hang is observed and /dev/nvme0n1 is gone actually
>
> Sorry to jump in so late.
>
> For an ungraceful nvme removal, like a surpirse hot unplug, the driver
> sets the capacity to 0 and that effectively ends all dirty page writers
> that could stall forward progress on the removal. And that 0 capacity
> should also cause 'dd' to exit.

Actually nvme device has been gone, and the hang just happens in
balance_dirty_pages() from generic_perform_write().

The issue should be triggered on all kinds of disks which can be hot-unplug,
and it can be duplicated on both ublk and nvme easily.

>
> But this is not an ungraceful removal, so we're not getting that forced
> behavior. Could we use the same capacity trick here after flushing any
> outstanding dirty pages?

set_capacity(0) has been called in del_gendisk() after fsync_bdev() &
__invalidate_device(), but I understand FS code just try best to flush dirty
pages. And when the bdev is gone, these un-flushed dirty pages need cleanup,
otherwise they can't be used any more.


Thanks,
Ming