2015-12-07 18:05:19

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH] blkdev: Fix blkdev_open to release the bdev on error

blkdev_open() doesn't release the bdev, it attached to a given
inode, if blkdev_get() fails (e.g, due to absence of a device).
This can cause kernel crashes when the original filesystem
tries to flush the data during evict_inode.

This can be triggered easily with virtio-9p fs using the following
simple steps.

root@localhost:~# mknod disk b 9 1
root@localhost:~# cat disk
Unable to handle kernel NULL pointer dereference at virtual address 00000214
pgd = bea40000
[00000214] *pgd=be9eb831, *pte=00000000, *ppte=00000000
Internal error: Oops: 17 [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 1094 Comm: cat Not tainted 4.3.0 #3
Hardware name: Generic DT based system
task: bf186600 ti: be822000 task.ti: be822000
PC is at blk_get_backing_dev_info+0x4/0x10
LR is at __filemap_fdatawrite_range+0x88/0x94
pc : [<80317e00>] lr : [<801995d4>] psr: 60010013
sp : be823db0 ip : 00000000 fp : 00000024
r10: fffffffa r9 : be86a240 r8 : bec87e58
r7 : 00000001 r6 : 80615640 r5 : 7fffffff r4 : bec03354
r3 : 00000000 r2 : bf006c00 r1 : 7fffffff r0 : bec03200
Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control: 10c5383d Table: bea4006a DAC: 00000051
Process cat (pid: 1094, stack limit = 0xbe822210)

[... stack contents trimmed ...]

[<80317e00>] (blk_get_backing_dev_info) from [<801995d4>] (__filemap_fdatawrite_range+0x88/0x94)
[<801995d4>] (__filemap_fdatawrite_range) from [<80199608>] (filemap_fdatawrite+0x28/0x30)
[<80199608>] (filemap_fdatawrite) from [<802fa830>] (v9fs_evict_inode+0x20/0x3c)
[<802fa830>] (v9fs_evict_inode) from [<801ef4fc>] (evict+0xb0/0x188)
[<801ef4fc>] (evict) from [<801eb998>] (__dentry_kill+0x1ec/0x250)
[<801eb998>] (__dentry_kill) from [<801ec2d8>] (dput+0x188/0x28c)
[<801ec2d8>] (dput) from [<801e0858>] (path_put+0x10/0x1c)
[<801e0858>] (path_put) from [<801e08a0>] (terminate_walk+0x3c/0x98)
[<801e08a0>] (terminate_walk) from [<801e3d54>] (path_openat+0x1ec/0xeac)
[<801e3d54>] (path_openat) from [<801e56c8>] (do_filp_open+0x60/0xb4)
[<801e56c8>] (do_filp_open) from [<801d7850>] (do_sys_open+0x124/0x1d0)
[<801d7850>] (do_sys_open) from [<80107340>] (ret_fast_syscall+0x0/0x3c)
Code: 806d3ca0 80941b7c 807175d8 e590305c (e5930214)
---[ end trace b61b160a3217ae29 ]---

Fixes: e525fd89d380c4a94c0d63913a1dd1a593ed25e7
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: Al Viro <[email protected]>
Signed-off-by: Suzuki K. Poulose <[email protected]>
---
fs/block_dev.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c25639e..7d7f322 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1484,6 +1484,7 @@ EXPORT_SYMBOL(blkdev_get_by_dev);

static int blkdev_open(struct inode * inode, struct file * filp)
{
+ int rc;
struct block_device *bdev;

/*
@@ -1507,7 +1508,11 @@ static int blkdev_open(struct inode * inode, struct file * filp)

filp->f_mapping = bdev->bd_inode->i_mapping;

- return blkdev_get(bdev, filp->f_mode, filp);
+ rc = blkdev_get(bdev, filp->f_mode, filp);
+ if (rc)
+ bd_forget(inode);
+
+ return rc;
}

static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
--
1.7.9.5


2015-12-07 18:49:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error

On Mon, Dec 7, 2015 at 10:05 AM, Suzuki K. Poulose
<[email protected]> wrote:
> blkdev_open() doesn't release the bdev, it attached to a given
> inode, if blkdev_get() fails (e.g, due to absence of a device).
> This can cause kernel crashes when the original filesystem
> tries to flush the data during evict_inode.

Ugh. This code is a mess. Al, can you please comment?

So what happens is that when "blkdev_get()" fails, it will do a
bdput() on the bdev.

But blkdev_open() hasn't done a bdget(). It's done a bd_acquire().
Which will do the whole "add inodes to bd_inodes". And yes,
bd_forget() will undo that.

HOWEVER.

bd_forget() will undo that unconditionally, but bd_acquire() has *not*
unconditionally done that bd_inodes list operation. It might already
have been there.

So as far as I can tell, the patch here undoes things potentially too much.

Shouldn't the last bdput() already end up doing a bd_forget()? We'd have

bdput -> iput -> iput_final -> evict -> bd_forget.

but the fact that Suzuki shows an oops clearly shows that something is
badly wrong.

IOW, the path looks simple and apparently fixes an oops, but I'd like
much more of an explanation for what happens, because it all feels
wrong to me. Why doesn't the bdput() end up undoing the bd_acquire()
properly?

Linus

2015-12-08 07:25:16

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error

On Mon, Dec 07, 2015 at 06:05:03PM +0000, Suzuki K. Poulose wrote:
> blkdev_open() doesn't release the bdev, it attached to a given
> inode, if blkdev_get() fails (e.g, due to absence of a device).
> This can cause kernel crashes when the original filesystem
> tries to flush the data during evict_inode.
>
> This can be triggered easily with virtio-9p fs using the following
> simple steps.

???
How can filesystem type affect the behaviour of block devices?

Having mknod /tmp/splat b 8 1; rm /tmp/splat try to evict the pagecache
of /dev/sda1 is simply wrong, no matter what type /tmp happens to have.
And they must share pagecache, or you'll get one hell of cache coherency
problems. As it is, that pagecache belongs to inode on bdevfs (see
fs/block_dev.c; not mountable anywhere visible, the one and only mount is
internal). That inode is tied to struct bdev, ditto for its lifetime.

Block device inodes on anything else have their ->i_mapping pointing to
the corresponding (unique for given major/minor) inode on bdevfs; that
gives us the coherency, but that also means that their *own* pagecache
(->i_data) is empty. Which is just fine, since inode eviction should
get rid of everything in its embedded struct address_space. In case of
block device inodes on ext2, 9p, etc. that amounts to no pages at all.
In case of bdevfs, it contains the page cache of block device.

<looks>
Aha...
truncate_inode_pages_final(inode->i_mapping);
clear_inode(inode);
filemap_fdatawrite(inode->i_mapping);

in there is obviously wrong - it should be

truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
filemap_fdatawrite(&inode->i_data);

and if you check other filesystems' ->evict_inode() you'll see the same thing
there.

We should not do bd_forget() upon failing open() - what for? As long as
->i_rdev remains the same, the pointer to struct bdev is valid. It
doesn't pin bdev down; having it (or any other alias) opened does. When
we decide to evict bdev, *all* aliasing inodes are dissociated from it;
none of them is open at that point, so we are OK. When an aliasing inode
gets evicted, we have it dissociated from its ->i_bdev (if any). Since we
only access the ->i_mapping of aliasing inode while its open, those places
are fine and anything that wants ->i_data of alias will simply find it empty.

AFAICS, the cause of your oopsen is that 9p evict_inode is accessing the
object it has no business to touch.

Could you confirm that the patch below fixes your problem?

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 699941e..5110785 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -451,9 +451,9 @@ void v9fs_evict_inode(struct inode *inode)
{
struct v9fs_inode *v9inode = V9FS_I(inode);

- truncate_inode_pages_final(inode->i_mapping);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
- filemap_fdatawrite(inode->i_mapping);
+ filemap_fdatawrite(&inode->i_data);

v9fs_cache_inode_put_cookie(inode);
/* clunk the fid stashed in writeback_fid */

2015-12-08 07:59:00

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error

On Mon, Dec 07, 2015 at 10:49:05AM -0800, Linus Torvalds wrote:
> On Mon, Dec 7, 2015 at 10:05 AM, Suzuki K. Poulose
> <[email protected]> wrote:
> > blkdev_open() doesn't release the bdev, it attached to a given
> > inode, if blkdev_get() fails (e.g, due to absence of a device).
> > This can cause kernel crashes when the original filesystem
> > tries to flush the data during evict_inode.
>
> Ugh. This code is a mess. Al, can you please comment?
>
> So what happens is that when "blkdev_get()" fails, it will do a
> bdput() on the bdev.

Yes.

> But blkdev_open() hasn't done a bdget(). It's done a bd_acquire().
> Which will do the whole "add inodes to bd_inodes".

Yes.

> And yes,
> bd_forget() will undo that.

It would, but there's no reason to drop the cached pointer to bdev.

> IOW, the path looks simple and apparently fixes an oops, but I'd like
> much more of an explanation for what happens, because it all feels
> wrong to me. Why doesn't the bdput() end up undoing the bd_acquire()
> properly?

Because it doesn't work that way. ->i_bdev is just a cached result of
lookup by device number. Once found, it stays there for as long as
neither the struct inode nor struct block_device are freed.

It does *NOT* pin struct block_device. Note that we have two kinds of block
device inodes - ones coallocated with struct block_device (those are unique
per major/minor, live on bdevfs and can't be seen directly) and ones aliasing
the first kind. Those live on normal filesystems.

Pagecache lives in ->i_data of the bdevfs inode; aliasing ones have their
->i_data empty and ->i_mapping pointing to ->i_data of the bdevfs inode.
That guarantees the cache coherency between those guys.

Now, simply having ->i_bdev point to struct block_device does not affect
the lifetime of the latter in any way. All aliases are dissociated from
block_device when bdevfs inode is evicted; block_device is dissociated
from aliasing inode when that aliasing inode is evicted. bdev_lock
provides the atomicity there.

_Opening_ an alias (any of them) does pin block_device down. So when an
aliasing inode is open, we can safely use its ->i_mapping in normal
pagecache-related code and have everything work correctly. Accessing
->i_mapping when inode isn't open is valid only if filesystem code is
sure it's pointing to its own ->i_data (and pointless in any case).

And that's what 9p ->evict_inode() is doing - it's trying to evict not
the pages in its ->i_data (which would be empty for block device), but
the pages in its ->i_mapping. IOW, the pagecache shared by all aliasing
inodes. Which is obviously bogus, regardless of lifetime rules violation -
mknod /tmp/foo b 8 1 && dd count=1 </tmp/foo >/dev/null && rm /tmp/foo
should not blow the cache of /dev/sda1, no matter which fs type we happen to
use for /tmp. And 9p will try to do just that.

Fortunately, no other ->evict_inode() instance is doing anything of that sort,
so we just need to fix that bogosity in 9p one.

As for the bdev eviction, bdput() acts exactly like iput(). In fact, it is
iput() of the coallocated bdevfs inode. It can stay around with zero
refcount; same as any other inode, memory pressure would eventually push
them out. It does *NOT* pin the driver when not opened, BTW.

Anyway, the fix for 9p bogosity follows; it definitely fixes a bug there,
and I'm fairly sure that it fixes the bug that had been reported.
A confirmation would be nice, of course...

Signed-off-by: Al Viro <[email protected]>
---
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 699941e..5110785 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -451,9 +451,9 @@ void v9fs_evict_inode(struct inode *inode)
{
struct v9fs_inode *v9inode = V9FS_I(inode);

- truncate_inode_pages_final(inode->i_mapping);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
- filemap_fdatawrite(inode->i_mapping);
+ filemap_fdatawrite(&inode->i_data);

v9fs_cache_inode_put_cookie(inode);
/* clunk the fid stashed in writeback_fid */

2015-12-08 10:08:04

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error

On 08/12/15 07:25, Al Viro wrote:
> On Mon, Dec 07, 2015 at 06:05:03PM +0000, Suzuki K. Poulose wrote:
>> blkdev_open() doesn't release the bdev, it attached to a given
>> inode, if blkdev_get() fails (e.g, due to absence of a device).
>> This can cause kernel crashes when the original filesystem
>> tries to flush the data during evict_inode.
>>
>> This can be triggered easily with virtio-9p fs using the following
>> simple steps.
>
> ???

> How can filesystem type affect the behaviour of block devices?
>

...

>
> We should not do bd_forget() upon failing open() - what for? As long as
> ->i_rdev remains the same, the pointer to struct bdev is valid. It
> doesn't pin bdev down; having it (or any other alias) opened does. When
> we decide to evict bdev, *all* aliasing inodes are dissociated from it;
> none of them is open at that point, so we are OK. When an aliasing inode
> gets evicted, we have it dissociated from its ->i_bdev (if any). Since we
> only access the ->i_mapping of aliasing inode while its open, those places
> are fine and anything that wants ->i_data of alias will simply find it empty.

Thanks for the detailed explanation. Surely my patch was not cooked up
on the full understanding of the bdev fs. Things are much more clear now.

> Could you confirm that the patch below fixes your problem?


Yes, it does solve the issue.

Thanks
Suzuki

2015-12-08 10:09:29

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error

On 08/12/15 07:58, Al Viro wrote:
> On Mon, Dec 07, 2015 at 10:49:05AM -0800, Linus Torvalds wrote:
>> On Mon, Dec 7, 2015 at 10:05 AM, Suzuki K. Poulose
>> <[email protected]> wrote:

...

> Anyway, the fix for 9p bogosity follows; it definitely fixes a bug there,
> and I'm fairly sure that it fixes the bug that had been reported.
> A confirmation would be nice, of course...
>
> Signed-off-by: Al Viro <[email protected]>
> ---
> diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
> index 699941e..5110785 100644
> --- a/fs/9p/vfs_inode.c
> +++ b/fs/9p/vfs_inode.c
> @@ -451,9 +451,9 @@ void v9fs_evict_inode(struct inode *inode)
> {
> struct v9fs_inode *v9inode = V9FS_I(inode);
>
> - truncate_inode_pages_final(inode->i_mapping);
> + truncate_inode_pages_final(&inode->i_data);
> clear_inode(inode);
> - filemap_fdatawrite(inode->i_mapping);
> + filemap_fdatawrite(&inode->i_data);
>
> v9fs_cache_inode_put_cookie(inode);
> /* clunk the fid stashed in writeback_fid */
>

This patch fixes the problem :

Tested-by: Suzuki K. Poulose <[email protected]>

Thanks
Suzuki

2015-12-08 11:56:25

by Vegard Nossum

[permalink] [raw]
Subject: Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error

On 8 December 2015 at 11:08, Suzuki K. Poulose <[email protected]> wrote:
> On 08/12/15 07:58, Al Viro wrote:
>>
>> On Mon, Dec 07, 2015 at 10:49:05AM -0800, Linus Torvalds wrote:
>>>
>>> On Mon, Dec 7, 2015 at 10:05 AM, Suzuki K. Poulose
>>> <[email protected]> wrote:
>
>
> ...
>
>> Anyway, the fix for 9p bogosity follows; it definitely fixes a bug there,
>> and I'm fairly sure that it fixes the bug that had been reported.
>> A confirmation would be nice, of course...
>>
>> Signed-off-by: Al Viro <[email protected]>
>> ---
>> diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
>> index 699941e..5110785 100644
>> --- a/fs/9p/vfs_inode.c
>> +++ b/fs/9p/vfs_inode.c
>> @@ -451,9 +451,9 @@ void v9fs_evict_inode(struct inode *inode)
>> {
>> struct v9fs_inode *v9inode = V9FS_I(inode);
>>
>> - truncate_inode_pages_final(inode->i_mapping);
>> + truncate_inode_pages_final(&inode->i_data);
>> clear_inode(inode);
>> - filemap_fdatawrite(inode->i_mapping);
>> + filemap_fdatawrite(&inode->i_data);
>>
>> v9fs_cache_inode_put_cookie(inode);
>> /* clunk the fid stashed in writeback_fid */
>>
>
> This patch fixes the problem :
>
> Tested-by: Suzuki K. Poulose <[email protected]>
>
> Thanks
> Suzuki

FWIW, I think I reported the same issue here:

http://sourceforge.net/p/v9fs/mailman/message/34661239/

And Al's patch fixed it here too. Thanks,


Vegard