If you saw a command "fallocate -k -l 1G /dev/vda" you probably think
that user want to preallocate space on thin-provision blkdev. Right?
What possibly can go wrong? Unfortunately you may destroy your filesystem
and kernel panic. The reason is the bug in blkdev_fallocate() which
unconditionally truncate bdev cache. But even if we fix this particular bug
there are other places where we still have to truncate blkdev cache even
if FS is mounted and holds some bh's
1) nbd: If server disconnected we call kill_bdev() which destroy bdev cache
2) bdev falloc{ FALLOC_FL_ZERO_RANGE, FALLOC_FL_PUNCH_HOLE } definitely
expect bdev cache to be truncated.
3) ioctl: BLKDISCARD also must truncate bdev cache
There is a discussion whenever we have to permit (2) and (3) on bdev with
active filesytem, why shouldn't we force bd_claim for this? But this is
advisory user-space interface, because by historical reasons we allow
direct_io to blkdev while fs is mounted.
I prefer to treat all three cases while FS is mounted as runtime errors.
Fs may be corrupted, but we should not panic.
This patchset guard fs/blk layer from panic in case of such runtime errors.
0001-bh-Prevent-panic-on-invalid-BHs
0002-block-protect-bdevname-from-null-pointer-bdev
0003-bio-Protect-submit_bio-from-bdevless-bio-s
0004-jbd2-use-stable-bdev-pointer
# Finally fix the bug with unconditional cache truncate on bdev
0005-block-truncate-page-cache-only-when-necessary-on-falloc
Testcases:
xfstests: ./check blockdev/004 blockdev/005
https://github.com/dmonakhov/xfstests/tree/blkdev-falloc-tests-v1
TODO: Prepare patch for util-linux fallocate(2) should claim bdev.
Signed-off-by: Dmitry Monakhov <[email protected]>
---
fs/block_dev.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00e..f4b13e1 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2075,7 +2075,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
{
struct block_device *bdev = I_BDEV(bdev_file_inode(file));
struct request_queue *q = bdev_get_queue(bdev);
- struct address_space *mapping;
+ struct address_space *mapping = bdev->bd_inode->i_mapping;
loff_t end = start + len - 1;
loff_t isize;
int error;
@@ -2102,13 +2102,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
if ((start | len) & (bdev_logical_block_size(bdev) - 1))
return -EINVAL;
- /* Invalidate the page cache, including dirty pages. */
- mapping = bdev->bd_inode->i_mapping;
- truncate_inode_pages_range(mapping, start, end);
-
switch (mode) {
case FALLOC_FL_ZERO_RANGE:
case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
+ truncate_inode_pages_range(mapping, start, end);
error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
GFP_KERNEL, false);
break;
@@ -2116,12 +2113,14 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
/* Only punch if the device can do zeroing discard. */
if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
return -EOPNOTSUPP;
+ truncate_inode_pages_range(mapping, start, end);
error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
GFP_KERNEL, 0);
break;
case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE:
if (!blk_queue_discard(q))
return -EOPNOTSUPP;
+ truncate_inode_pages_range(mapping, start, end);
error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
GFP_KERNEL, 0);
break;
--
2.9.3
This prevent us from panic if someone invalidate bh under us.
Signed-off-by: Dmitry Monakhov <[email protected]>
---
fs/jbd2/revoke.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index f9aefcd..e3b791d 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -459,7 +459,8 @@ int jbd2_journal_cancel_revoke(handle_t *handle, struct journal_head *jh)
* state machine will get very upset later on. */
if (need_cancel) {
struct buffer_head *bh2;
- bh2 = __find_get_block(bh->b_bdev, bh->b_blocknr, bh->b_size);
+ bh2 = __find_get_block(journal->j_dev, bh->b_blocknr,
+ bh->b_size);
if (bh2) {
if (bh2 != bh)
clear_buffer_revoked(bh2);
--
2.9.3
Idially this type of check should be handled at generic_make_request_checks
, but it is too late for bdevless bios, bad pointer was already
dereferenced at that point.
Signed-off-by: Dmitry Monakhov <[email protected]>
---
block/blk-core.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/block/blk-core.c b/block/blk-core.c
index 275c1e4..e583ded 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2065,6 +2065,11 @@ EXPORT_SYMBOL(generic_make_request);
*/
blk_qc_t submit_bio(struct bio *bio)
{
+ if (WARN_ON_ONCE(!bio->bi_bdev)) {
+ bio_io_error(bio);
+ return BLK_QC_T_NONE;
+ }
+
/*
* If it's a regular read/write or a barrier with data attached,
* go through the normal accounting stuff before submission.
--
2.9.3
Some callers (usually error paths) call bdevname with null bdev.
Signed-off-by: Dmitry Monakhov <[email protected]>
---
block/partition-generic.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/block/partition-generic.c b/block/partition-generic.c
index 7afb990..284de18 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -46,6 +46,8 @@ char *disk_name(struct gendisk *hd, int partno, char *buf)
const char *bdevname(struct block_device *bdev, char *buf)
{
+ if (unlikely(!bdev))
+ return snprintf(buf, BDEVNAME_SIZE, "unknown-block(null)");
return disk_name(bdev->bd_disk, bdev->bd_part->partno, buf);
}
--
2.9.3
- Convert BUG_ON to WARN_ON+EIO on submit_bh.
Leave BUG_ON(!bh->b_end_io) as is because this is static bug in
submission logic which can not be handled at runtime anyway.
unmapped BH is also special case which signal about user
misbehavior, so just dump error messageŃ
- guard __find_get_block from null pointer bdev.
Signed-off-by: Dmitry Monakhov <[email protected]>
---
fs/buffer.c | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 9196f2a..4c8ce74 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1355,8 +1355,12 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
struct buffer_head *
__find_get_block(struct block_device *bdev, sector_t block, unsigned size)
{
- struct buffer_head *bh = lookup_bh_lru(bdev, block, size);
+ struct buffer_head *bh;
+
+ if (WARN_ON_ONCE(!bdev))
+ return NULL;
+ bh = lookup_bh_lru(bdev, block, size);
if (bh == NULL) {
/* __find_get_block_slow will mark the page accessed */
bh = __find_get_block_slow(bdev, block);
@@ -3099,11 +3103,18 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
{
struct bio *bio;
- BUG_ON(!buffer_locked(bh));
- BUG_ON(!buffer_mapped(bh));
BUG_ON(!bh->b_end_io);
- BUG_ON(buffer_delay(bh));
- BUG_ON(buffer_unwritten(bh));
+
+ if (WARN_ON_ONCE(!buffer_locked(bh)))
+ goto bad_bh;
+ if (WARN_ON_ONCE(buffer_delay(bh)))
+ goto bad_bh;
+ if (WARN_ON_ONCE(buffer_unwritten(bh)))
+ goto bad_bh;
+ if (unlikely(!buffer_mapped(bh))) {
+ buffer_io_error(bh, ", bh not mapped");
+ goto bad_bh;
+ }
/*
* Only clear out a write error when rewriting
@@ -3143,6 +3154,9 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
submit_bio(bio);
return 0;
+bad_bh:
+ bh->b_end_io(bh, 0);
+ return -EIO;
}
int _submit_bh(int op, int op_flags, struct buffer_head *bh,
--
2.9.3
This look ok, but how did you manage to trigger this case? I think
we might have a deeper problem here.
why?
On Thu, Apr 06, 2017 at 04:02:49PM +0400, Dmitry Monakhov wrote:
> Signed-off-by: Dmitry Monakhov <[email protected]>
> ---
> fs/block_dev.c | 9 ++++-----
> 1 file changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 2eca00e..f4b13e1 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -2075,7 +2075,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> {
> struct block_device *bdev = I_BDEV(bdev_file_inode(file));
> struct request_queue *q = bdev_get_queue(bdev);
> - struct address_space *mapping;
> + struct address_space *mapping = bdev->bd_inode->i_mapping;
> loff_t end = start + len - 1;
> loff_t isize;
> int error;
> @@ -2102,13 +2102,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> if ((start | len) & (bdev_logical_block_size(bdev) - 1))
> return -EINVAL;
>
> - /* Invalidate the page cache, including dirty pages. */
> - mapping = bdev->bd_inode->i_mapping;
> - truncate_inode_pages_range(mapping, start, end);
> -
> switch (mode) {
> case FALLOC_FL_ZERO_RANGE:
> case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
> + truncate_inode_pages_range(mapping, start, end);
> error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
> GFP_KERNEL, false);
> break;
> @@ -2116,12 +2113,14 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> /* Only punch if the device can do zeroing discard. */
> if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
> return -EOPNOTSUPP;
> + truncate_inode_pages_range(mapping, start, end);
> error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
> GFP_KERNEL, 0);
> break;
> case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE:
> if (!blk_queue_discard(q))
> return -EOPNOTSUPP;
> + truncate_inode_pages_range(mapping, start, end);
> error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
> GFP_KERNEL, 0);
> break;
> --
> 2.9.3
>
---end quoted text---
Christoph Hellwig <[email protected]> writes:
> why?
because it is not good thing to truncate page cache and fiew lines later
realize that feature is not supported !blk_queue_discard(q) and return ENOTSUPP.
Event more: if mode == FALLOC_FL_KEEP_SIZE then we do nothing and
return ENOTSUPP unconditionally.
IMHO (mode == FALLOC_FL_KEEP_SIZE) is sane API for thin-provision blkdevs
to preallocate space in advance. Nobody use it at the moment, but it may
be usefull in future.
>
> On Thu, Apr 06, 2017 at 04:02:49PM +0400, Dmitry Monakhov wrote:
>> Signed-off-by: Dmitry Monakhov <[email protected]>
>> ---
>> fs/block_dev.c | 9 ++++-----
>> 1 file changed, 4 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 2eca00e..f4b13e1 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -2075,7 +2075,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>> {
>> struct block_device *bdev = I_BDEV(bdev_file_inode(file));
>> struct request_queue *q = bdev_get_queue(bdev);
>> - struct address_space *mapping;
>> + struct address_space *mapping = bdev->bd_inode->i_mapping;
>> loff_t end = start + len - 1;
>> loff_t isize;
>> int error;
>> @@ -2102,13 +2102,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>> if ((start | len) & (bdev_logical_block_size(bdev) - 1))
>> return -EINVAL;
>>
>> - /* Invalidate the page cache, including dirty pages. */
>> - mapping = bdev->bd_inode->i_mapping;
>> - truncate_inode_pages_range(mapping, start, end);
>> -
>> switch (mode) {
>> case FALLOC_FL_ZERO_RANGE:
>> case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
>> + truncate_inode_pages_range(mapping, start, end);
>> error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
>> GFP_KERNEL, false);
>> break;
>> @@ -2116,12 +2113,14 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>> /* Only punch if the device can do zeroing discard. */
>> if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
>> return -EOPNOTSUPP;
>> + truncate_inode_pages_range(mapping, start, end);
>> error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
>> GFP_KERNEL, 0);
>> break;
>> case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE:
>> if (!blk_queue_discard(q))
>> return -EOPNOTSUPP;
>> + truncate_inode_pages_range(mapping, start, end);
>> error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
>> GFP_KERNEL, 0);
>> break;
>> --
>> 2.9.3
>>
> ---end quoted text---
Christoph Hellwig <[email protected]> writes:
> This look ok, but how did you manage to trigger this case?
# testcases
# TEST1
# Via bug in fallocate
truncate -l 1G img
losetup /dev/loop img
mkfs.ext4 -qF /dev/loop0
mkdir m
mount /dev/loop0 m
# command above truncate bdevs pagecache
xfs_io -c "falloc -k 0 32G" -d /dev/loop0
for ((i=0;i<100;i++));do
xfs_io -c "pwrite 0 4k" -d m/test-$i;
done
sync
# TEST2: NBD close_sock -> kill_bdev
mkdir -p a/mnt
cd a
truncate -s 1G img
mkfs.ext4 -qF img
qemu-nbd -c /dev/nbd0 img
mount /dev/nbd0 /mnt
cp -r /bin/ /mnt&
# Disconnect nbd while cp is active
qemu-nbd -d /dev/nbd0
sync
> I think
> we might have a deeper problem here.
Probably. It seems that !buffer_locked(bh) case should stay BUG_ON
because it is hard to make semi-correct decesion here.