For async-write on block device,if device removed,but the vfs don't know it.
It will continue to do.
Patch1 set size of inode of block device to zero when removed disk.By this,vfs know
disk changed.
Path2 add size-check on blk_aio_write.If pos of write larger than size of inode,it will
return zero.So the user can check disk state.
V2:
patch1:
-consider the case which disk has partitions
-move i_size_write into func invalidate_partition
patch2: No change
Jianpeng Ma (2):
block: Set inode of block_device size to zero when delete gendisk.
block_dev: Add size check before doing async write on block device.
block/genhd.c | 6 ++++++
fs/block_dev.c | 4 ++++
2 files changed, 10 insertions(+)
--
1.8.4-rc0????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
majianpeng <[email protected]> writes:
> For async-write on block device,if device removed,but the vfs don't know it.
> It will continue to do.
> Patch1 set size of inode of block device to zero when removed disk.By this,vfs know
> disk changed.
> Path2 add size-check on blk_aio_write.If pos of write larger than size of inode,it will
> return zero.So the user can check disk state.
OK, so the basic problem is that __generic_file_aio_write will always
return 0 after device removal, yes? I'm not sure why that's a real
issue, can you explain exactly why you're trying to change this?
As for your patches, I don't think that putting the i_size_write into
invalidate_partitions is a good idea. Consider the case of rescanning
partitions: you will always detect a size change now, which is not good.
Cheers,
Jeff
>majianpeng <[email protected]> writes:
>
>> For async-write on block device,if device removed,but the vfs don't know it.
>> It will continue to do.
>> Patch1 set size of inode of block device to zero when removed disk.By this,vfs know
>> disk changed.
>> Path2 add size-check on blk_aio_write.If pos of write larger than size of inode,it will
>> return zero.So the user can check disk state.
>
>OK, so the basic problem is that __generic_file_aio_write will always
>return 0 after device removal, yes? I'm not sure why that's a real
>issue, can you explain exactly why you're trying to change this?
>
At prenset, the __generic_file_aio_write don't return zero rather that the wanted size.
So the user can't know the disk removed.
For example:
dd if=/dev/zero of=usb-disk bs=64k
When removed usb-disk, dd stoped until reached the endof usb-disk.
Using this patch, after removed disk, the aio-write will return zero.I think the upper user will check.
(or if the size of block is zero, we return -ENOSPC).
>As for your patches, I don't think that putting the i_size_write into
>invalidate_partitions is a good idea. Consider the case of rescanning
>partitions: you will always detect a size change now, which is not good.
>
Yes.But in func rescan_partitions, after invalidate_partitions it will call check_disk_size_change to set size of block_device.
Thanks!
Jianpeng Ma
>Cheers,
>Jeff????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
majianpeng <[email protected]> writes:
>>majianpeng <[email protected]> writes:
>>
>>> For async-write on block device,if device removed,but the vfs don't know it.
>>> It will continue to do.
>>> Patch1 set size of inode of block device to zero when removed disk.By this,vfs know
>>> disk changed.
>>> Path2 add size-check on blk_aio_write.If pos of write larger than size of inode,it will
>>> return zero.So the user can check disk state.
>>
>>OK, so the basic problem is that __generic_file_aio_write will always
>>return 0 after device removal, yes? I'm not sure why that's a real
>>issue, can you explain exactly why you're trying to change this?
>>
> At prenset, the __generic_file_aio_write don't return zero rather that the wanted size.
> So the user can't know the disk removed.
> For example:
> dd if=/dev/zero of=usb-disk bs=64k
> When removed usb-disk, dd stoped until reached the endof usb-disk.
Ah, right, it's just writing to the page cache. I think the only reason
you get more timely errors when doing the same thing to a file on a file
system is that there is some synchronous metadata or journal I/O that
will get EIO and result in the file system being set read-only.
The bigger question is whether we want to change this long-standing
behaviour of how our write-back cache works. I don't know that it's
really worth it, honestly. If you want to ensure data is on disk, you
open the file O_SYNC or you issue an fsync, and those calls will return
an error for a removed block device. So, I guess I'll ask the same
question again: why are you looking at this? Is there some application
you care about that does buffered I/O to the block device and never does
an fsync?
> Using this patch, after removed disk, the aio-write will return zero.I
> think the upper user will check. (or if the size of block is zero, we
> return -ENOSPC).
>
>>As for your patches, I don't think that putting the i_size_write into
>>invalidate_partitions is a good idea. Consider the case of rescanning
>>partitions: you will always detect a size change now, which is not good.
>>
> Yes.But in func rescan_partitions, after invalidate_partitions it will
> call check_disk_size_change to set size of block_device.
The problem with doing an i_size_write of 0 inside of
invalidate_partitions is that it isn't just called for the case where a
device is removed. A user can initiate a rescan of partitions. In such
a case, we don't want to evict all of the cached data for unchanged
partitions.
The call chain is like this:
blkdev_ioctl
blkdev_reread_part
rescan_partitions
check_disk_size_change
Now look and see what check_disk_size_change will do when it finds out
that the size has changed:
void check_disk_size_change(struct gendisk *disk, struct block_device
*bdev)
{
loff_t disk_size, bdev_size;
disk_size = (loff_t)get_capacity(disk) << 9;
bdev_size = i_size_read(bdev->bd_inode);
if (disk_size != bdev_size) {
char name[BDEVNAME_SIZE];
disk_name(disk, 0, name);
printk(KERN_INFO
"%s: detected capacity change from %lld to
%lld\n",
name, bdev_size, disk_size);
i_size_write(bdev->bd_inode, disk_size);
flush_disk(bdev, false); <=============
}
}
That will invalidate all of the metadata for any mounted file systems on
the device. Also, you'll get a big nasty warning if any files are dirty:
printk(KERN_WARNING "VFS: busy inodes on changed media or "
"resized disk %s\n", name);
And the reality is that we haven't changed anything, so there's no need
for this.
After looking at the code further, why do you even need to add the
second patch? generic_write_checks will check for a write past the end
of the block device.
Cheers,
Jeff
>majianpeng <[email protected]> writes:
>
>>>majianpeng <[email protected]> writes:
>>>
>>>> For async-write on block device,if device removed,but the vfs don't know it.
>>>> It will continue to do.
>>>> Patch1 set size of inode of block device to zero when removed disk.By this,vfs know
>>>> disk changed.
>>>> Path2 add size-check on blk_aio_write.If pos of write larger than size of inode,it will
>>>> return zero.So the user can check disk state.
>>>
>>>OK, so the basic problem is that __generic_file_aio_write will always
>>>return 0 after device removal, yes? I'm not sure why that's a real
>>>issue, can you explain exactly why you're trying to change this?
>>>
>> At prenset, the __generic_file_aio_write don't return zero rather that the wanted size.
>> So the user can't know the disk removed.
>> For example:
>> dd if=/dev/zero of=usb-disk bs=64k
>> When removed usb-disk, dd stoped until reached the endof usb-disk.
>
>Ah, right, it's just writing to the page cache. I think the only reason
>you get more timely errors when doing the same thing to a file on a file
>system is that there is some synchronous metadata or journal I/O that
>will get EIO and result in the file system being set read-only.
>
Yes
>The bigger question is whether we want to change this long-standing
>behaviour of how our write-back cache works. I don't know that it's
>really worth it, honestly. If you want to ensure data is on disk, you
>open the file O_SYNC or you issue an fsync, and those calls will return
>an error for a removed block device. So, I guess I'll ask the same
>question again: why are you looking at this? Is there some application
>you care about that does buffered I/O to the block device and never does
>an fsync?
>
Yes, for my company, we used our filesystem in userspace on block-device.
For the performance, we used buffer-wrtite not sync-write.
For my workload, we allow user to remove disk whether disk working or not.
Now, we check the state of disk from /proc/partitions at the same interval.
This patchset don't change write-back cache works.It only let vfs know the state of lower-device.
I think it make a sense.
Thanks!
Jianpeng Ma????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
majianpeng <[email protected]> writes:
>>The bigger question is whether we want to change this long-standing
>>behaviour of how our write-back cache works. I don't know that it's
>>really worth it, honestly. If you want to ensure data is on disk, you
>>open the file O_SYNC or you issue an fsync, and those calls will return
>>an error for a removed block device. So, I guess I'll ask the same
>>question again: why are you looking at this? Is there some application
>>you care about that does buffered I/O to the block device and never does
>>an fsync?
>>
> Yes, for my company, we used our filesystem in userspace on block-device.
> For the performance, we used buffer-wrtite not sync-write.
> For my workload, we allow user to remove disk whether disk working or not.
> Now, we check the state of disk from /proc/partitions at the same interval.
>
> This patchset don't change write-back cache works.It only let vfs know
> the state of lower-device. I think it make a sense.
I'm still curious to know how you maintain a consistent file system
without the use of fsync, but that's an unrelated issue.
I looked at the rescan partition code path more closely, and it will
only really trigger if the partitions themselves aren't open. So, I
don't think there is a problem in your approach.
I'll ack patch 1. I still think patch 2 is not neessary. Please
correct me if I'm wrong.
Cheers,
Jeff
>majianpeng <[email protected]> writes:
>
>>>majianpeng <[email protected]> writes:
>>>
>>>> For async-write on block device,if device removed,but the vfs don't know it.
>>>> It will continue to do.
>>>> Patch1 set size of inode of block device to zero when removed disk.By this,vfs know
>>>> disk changed.
>>>> Path2 add size-check on blk_aio_write.If pos of write larger than size of inode,it will
>>>> return zero.So the user can check disk state.
>>>
>>>OK, so the basic problem is that __generic_file_aio_write will always
>>>return 0 after device removal, yes? I'm not sure why that's a real
>>>issue, can you explain exactly why you're trying to change this?
>>>
>> At prenset, the __generic_file_aio_write don't return zero rather that the wanted size.
>> So the user can't know the disk removed.
>> For example:
>> dd if=/dev/zero of=usb-disk bs=64k
>> When removed usb-disk, dd stoped until reached the endof usb-disk.
>
>Ah, right, it's just writing to the page cache. I think the only reason
>you get more timely errors when doing the same thing to a file on a file
>system is that there is some synchronous metadata or journal I/O that
>will get EIO and result in the file system being set read-only.
>
>The bigger question is whether we want to change this long-standing
>behaviour of how our write-back cache works. I don't know that it's
>really worth it, honestly. If you want to ensure data is on disk, you
>open the file O_SYNC or you issue an fsync, and those calls will return
>an error for a removed block device. So, I guess I'll ask the same
>question again: why are you looking at this? Is there some application
>you care about that does buffered I/O to the block device and never does
>an fsync?
>
>> Using this patch, after removed disk, the aio-write will return zero.I
>> think the upper user will check. (or if the size of block is zero, we
>> return -ENOSPC).
>>
>>>As for your patches, I don't think that putting the i_size_write into
>>>invalidate_partitions is a good idea. Consider the case of rescanning
>>>partitions: you will always detect a size change now, which is not good.
>>>
>> Yes.But in func rescan_partitions, after invalidate_partitions it will
>> call check_disk_size_change to set size of block_device.
>
>The problem with doing an i_size_write of 0 inside of
>invalidate_partitions is that it isn't just called for the case where a
>device is removed. A user can initiate a rescan of partitions. In such
>a case, we don't want to evict all of the cached data for unchanged
>partitions.
>
>The call chain is like this:
>
>blkdev_ioctl
>blkdev_reread_part
>rescan_partitions
>check_disk_size_change
>
>Now look and see what check_disk_size_change will do when it finds out
>that the size has changed:
>
>void check_disk_size_change(struct gendisk *disk, struct block_device
>*bdev)
>{
> loff_t disk_size, bdev_size;
>
> disk_size = (loff_t)get_capacity(disk) << 9;
> bdev_size = i_size_read(bdev->bd_inode);
> if (disk_size != bdev_size) {
> char name[BDEVNAME_SIZE];
>
> disk_name(disk, 0, name);
> printk(KERN_INFO
> "%s: detected capacity change from %lld to
> %lld\n",
> name, bdev_size, disk_size);
> i_size_write(bdev->bd_inode, disk_size);
> flush_disk(bdev, false); <=============
> }
>}
>
>That will invalidate all of the metadata for any mounted file systems on
>the device. Also, you'll get a big nasty warning if any files are dirty:
>
> printk(KERN_WARNING "VFS: busy inodes on changed media or "
> "resized disk %s\n", name);
>
>And the reality is that we haven't changed anything, so there's no need
>for this.
Yes. How about those code:
diff --git a/block/genhd.c b/block/genhd.c
index 791f419..c279b34 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -634,6 +634,7 @@ void del_gendisk(struct gendisk *disk)
{
struct disk_part_iter piter;
struct hd_struct *part;
+ struct block_device *bdev;
disk_del_events(disk);
@@ -642,12 +643,25 @@ void del_gendisk(struct gendisk *disk)
DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
while ((part = disk_part_iter_next(&piter))) {
invalidate_partition(disk, part->partno);
+ bdev = bdget_disk(disk, part->partno);
+ if (bdev) {
+ i_size_write(bdev->bd_inode, 0);
+ bdput(bdev);
+ }
+
delete_partition(disk, part->partno);
}
disk_part_iter_exit(&piter);
invalidate_partition(disk, 0);
set_capacity(disk, 0);
+
+ bdev = bdget_disk(disk, 0);
+ if (bdev) {
+ i_size_write(bdev->bd_inode, 0);
+ bdput(bdev);
+ }
+
disk->flags &= ~GENHD_FL_UP;
sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
We only set inode to zero in del_gendisk.
>
>After looking at the code further, why do you even need to add the
>second patch? generic_write_checks will check for a write past the end
>of the block device.
>
Yes, in generic_write_checks it will check size so patch2 don't need.
Thanks!
Jianpeng Ma
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?