LinuxLists.cc - [PATCH 0/2] block: bugfix for bd_link_disk

2022-11-03 02:50:24

Subject: [PATCH 0/2] block: bugfix for bd_link_disk_holder()

From: Yu Kuai <[email protected]>

Yu Kuai (2):
block: don't allow a disk link holder to itself
block: fix use after free for bd_holder_dir

block/holder.c | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)

--
2.31.1

2022-11-03 02:51:07

by Yu Kuai

[permalink] [raw]

Subject: [PATCH 1/2] block: don't allow a disk link holder to itself

From: Yu Kuai <[email protected]>

Test procedures:

1) dmsetup create test --table "xxx sda", assume dm-0 is created
2) dmsetup suspend test
3) dmsetup reload test --table "xxx dm-0"
4) dmsetup resume test

Test result:

BUG: TASK stack guard page was hit at 00000000736a261f (stack is 000000008d12c88d..00000000c8dd82d5)
stack guard page: 0000 [#1] PREEMPT SMP
CPU: 29 PID: 946 Comm: systemd-udevd Not tainted 6.1.0-rc3-next-20221101-00006-g17640ca3b0ee #1295
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
RIP: 0010:dm_prepare_ioctl+0xf/0x1e0
Code: da 48 83 05 4a 7c 99 0b 01 41 89 c4 eb cd e8 b8 1f 40 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 48 83 05 a1 5a 99 0b 01 <41> 56 49 89 d6 41 55 4c 8d af 90 02 00 00 9
RSP: 0018:ffffc90002090000 EFLAGS: 00010206
RAX: ffff8881049d6800 RBX: ffff88817e589000 RCX: 0000000000000000
RDX: ffffc90002090010 RSI: ffffc9000209001c RDI: ffff88817e589000
RBP: 00000000484a101d R08: 0000000000000000 R09: 0000000000000007
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000005331
R13: 0000000000005331 R14: 0000000000000000 R15: 0000000000000000
FS: 00007fddf9609200(0000) GS:ffff889fbfd40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffc9000208fff8 CR3: 0000000179043000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
dm_blk_ioctl+0x50/0x1c0
? dm_prepare_ioctl+0xe0/0x1e0
dm_blk_ioctl+0x88/0x1c0
dm_blk_ioctl+0x88/0x1c0
......(a lot of same lines)
dm_blk_ioctl+0x88/0x1c0
dm_blk_ioctl+0x88/0x1c0
blkdev_ioctl+0x184/0x3e0
__x64_sys_ioctl+0xa3/0x110
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fddf7306577
Code: b3 66 90 48 8b 05 11 89 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e1 88 8
RSP: 002b:00007ffd0b2ec318 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00005634ef478320 RCX: 00007fddf7306577
RDX: 0000000000000000 RSI: 0000000000005331 RDI: 0000000000000007
RBP: 0000000000000007 R08: 00005634ef4843e0 R09: 0000000000000080
R10: 00007fddf75cfb38 R11: 0000000000000246 R12: 00000000030d4000
R13: 0000000000000000 R14: 0000000000000000 R15: 00005634ef48b800
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:dm_prepare_ioctl+0xf/0x1e0
Code: da 48 83 05 4a 7c 99 0b 01 41 89 c4 eb cd e8 b8 1f 40 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 48 83 05 a1 5a 99 0b 01 <41> 56 49 89 d6 41 55 4c 8d af 90 02 00 00 9
RSP: 0018:ffffc90002090000 EFLAGS: 00010206
RAX: ffff8881049d6800 RBX: ffff88817e589000 RCX: 0000000000000000
RDX: ffffc90002090010 RSI: ffffc9000209001c RDI: ffff88817e589000
RBP: 00000000484a101d R08: 0000000000000000 R09: 0000000000000007
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000005331
R13: 0000000000005331 R14: 0000000000000000 R15: 0000000000000000
FS: 00007fddf9609200(0000) GS:ffff889fbfd40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffc9000208fff8 CR3: 0000000179043000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

After creating a dm device, then user can reload such dm with itself,
and dead loop will be triggered because dm keep looking up to itself.

Fix the problem by forbidding a disk to create link to itself.

Signed-off-by: Yu Kuai <[email protected]>
---
block/holder.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/block/holder.c b/block/holder.c
index 5283bc804cc1..5fc68238ce3a 100644
--- a/block/holder.c
+++ b/block/holder.c
@@ -75,6 +75,9 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk)
struct bd_holder_disk *holder;
int ret = 0;

+ if (bdev->bd_disk == disk)
+ return -EINVAL;
+
mutex_lock(&disk->open_mutex);

WARN_ON_ONCE(!bdev->bd_holder);
--
2.31.1

2022-11-03 02:51:22

by Yu Kuai

[permalink] [raw]

Subject: [PATCH 2/2] block: fix use after free for bd_holder_dir

From: Yu Kuai <[email protected]>

Currently, the caller of bd_link_disk_holer() get 'bdev' by
blkdev_get_by_dev(), which will look up 'bdev' by inode number 'dev'.
Howerver, it's possible that del_gendisk() can be called currently, and
'bd_holder_dir' can be freed before bd_link_disk_holer() access it, thus
use after free is triggered.

t1: t2:
bdev = blkdev_get_by_dev
del_gendisk
kobject_put(bd_holder_dir)
kobject_free()
bd_link_disk_holder

Fix the problem by checking disk is still live and grabbing a reference
to 'bd_holder_dir' first in bd_link_disk_holder().

Signed-off-by: Yu Kuai <[email protected]>
---
block/holder.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/block/holder.c b/block/holder.c
index 5fc68238ce3a..1c6c5b132a92 100644
--- a/block/holder.c
+++ b/block/holder.c
@@ -78,19 +78,32 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk)
if (bdev->bd_disk == disk)
return -EINVAL;

- mutex_lock(&disk->open_mutex);
+ /*
+ * del_gendisk drops the initial reference to bd_holder_dir, so we
+ * need to keep our own here to allow for cleanup past that point.
+ */
+ mutex_lock(&bdev->bd_disk->open_mutex);
+ if (!disk_live(bdev->bd_disk)) {
+ mutex_unlock(&bdev->bd_disk->open_mutex);
+ return -ENODEV;
+ }

- WARN_ON_ONCE(!bdev->bd_holder);
+ kobject_get(bdev->bd_holder_dir);
+ mutex_unlock(&bdev->bd_disk->open_mutex);

+ mutex_lock(&disk->open_mutex);
+ WARN_ON_ONCE(!bdev->bd_holder);
holder = bd_find_holder_disk(bdev, disk);
if (holder) {
holder->refcnt++;
+ kobject_put(bdev->bd_holder_dir);
goto out_unlock;
}

holder = kzalloc(sizeof(*holder), GFP_KERNEL);
if (!holder) {
ret = -ENOMEM;
+ kobject_put(bdev->bd_holder_dir);
goto out_unlock;
}

@@ -101,16 +114,12 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk)
ret = __link_disk_holder(bdev, disk);
if (ret) {
kfree(holder);
+ kobject_put(bdev->bd_holder_dir);
goto out_unlock;
}
}

list_add(&holder->list, &disk->slave_bdevs);
- /*
- * del_gendisk drops the initial reference to bd_holder_dir, so we need
- * to keep our own here to allow for cleanup past that point.
- */
- kobject_get(bdev->bd_holder_dir);

out_unlock:
mutex_unlock(&disk->open_mutex);
--
2.31.1

2022-11-03 08:12:44

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 1/2] block: don't allow a disk link holder to itself

Yes, this is a good one. Please add a single sentence blurb explaing
the issue before the Test prodecure?

With that this looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

On Thu, Nov 03, 2022 at 10:55:40AM +0800, Yu Kuai wrote:
> From: Yu Kuai <[email protected]>
>
> Test procedures:
>
> 1) dmsetup create test --table "xxx sda", assume dm-0 is created
> 2) dmsetup suspend test
> 3) dmsetup reload test --table "xxx dm-0"
> 4) dmsetup resume test

Can you wire this up for blocktests?

I've also been wondering how we could wire your other two test cases
up as I think they'be very useful. Hopefully I can find some time
to inject delays with eBPF or something like it.

2022-11-03 09:08:02

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 2/2] block: fix use after free for bd_holder_dir

On Thu, Nov 03, 2022 at 10:55:41AM +0800, Yu Kuai wrote:
> From: Yu Kuai <[email protected]>
>
> Currently, the caller of bd_link_disk_holer() get 'bdev' by
> blkdev_get_by_dev(), which will look up 'bdev' by inode number 'dev'.
> Howerver, it's possible that del_gendisk() can be called currently, and
> 'bd_holder_dir' can be freed before bd_link_disk_holer() access it, thus
> use after free is triggered.
>
> t1: t2:
> bdev = blkdev_get_by_dev
> del_gendisk
> kobject_put(bd_holder_dir)
> kobject_free()
> bd_link_disk_holder
>
> Fix the problem by checking disk is still live and grabbing a reference
> to 'bd_holder_dir' first in bd_link_disk_holder().

Looks good with some minor stilistic nipicks:

> + if (!disk_live(bdev->bd_disk)) {
> + mutex_unlock(&bdev->bd_disk->open_mutex);
> + return -ENODEV;
> + }

This can use a goto out_unlock;

> holder->refcnt++;
> + kobject_put(bdev->bd_holder_dir);
> goto out_unlock;
> }
>
> holder = kzalloc(sizeof(*holder), GFP_KERNEL);
> if (!holder) {
> ret = -ENOMEM;
> + kobject_put(bdev->bd_holder_dir);
> goto out_unlock;
> }
>
> @@ -101,16 +114,12 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk)
> ret = __link_disk_holder(bdev, disk);
> if (ret) {
> kfree(holder);
> + kobject_put(bdev->bd_holder_dir);

And I think a goto out_put_holder and out_free_holder would clean this up
nicely.

> list_add(&holder->list, &disk->slave_bdevs);
> - /*
> - * del_gendisk drops the initial reference to bd_holder_dir, so we need
> - * to keep our own here to allow for cleanup past that point.
> - */
> - kobject_get(bdev->bd_holder_dir);

.. with this then jumping straight to out_unlock.

We should repost a series with my first 7 patches and your two. I can do
that, but it might take some time as I just got through (minor) knee
surgery and am still at the hospital, so if you have spare cycles feel
free to do it.

2022-11-03 09:34:01

by Yu Kuai

[permalink] [raw]

Subject: Re: [PATCH 1/2] block: don't allow a disk link holder to itself

Hi,

?? 2022/11/03 16:06, Christoph Hellwig д??:
> Yes, this is a good one. Please add a single sentence blurb explaing
> the issue before the Test prodecure?
Ok, I'll do that.
>
> With that this looks good:
>
> Reviewed-by: Christoph Hellwig <[email protected]>
>
> On Thu, Nov 03, 2022 at 10:55:40AM +0800, Yu Kuai wrote:
>> From: Yu Kuai <[email protected]>
>>
>> Test procedures:
>>
>> 1) dmsetup create test --table "xxx sda", assume dm-0 is created
>> 2) dmsetup suspend test
>> 3) dmsetup reload test --table "xxx dm-0"
>> 4) dmsetup resume test
>
> Can you wire this up for blocktests?

Of course I can, I'm not very familiar with how to add new test in
blktests, so I might take some time to do that.
>
> I've also been wondering how we could wire your other two test cases
> up as I think they'be very useful. Hopefully I can find some time
> to inject delays with eBPF or something like it.

The problem was found without the delay originally, such delay can make
sure the problem is reproduced 100%, but it's still possible without the
delay. Perhaps we can add the test without delay for now.

Thanks,
Kuai
>
> .
>

2022-11-03 10:09:35

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 2/2] block: fix use after free for bd_holder_dir

On Thu, Nov 03, 2022 at 05:45:25PM +0800, Yu Kuai wrote:
> This lock is different from current 'out_unlock', add a new lable will
> make the code more complex, I think.

Of course, same mistake as last time..

2022-11-03 10:34:22

by Yu Kuai

[permalink] [raw]

Subject: Re: [PATCH 2/2] block: fix use after free for bd_holder_dir

Hi,

在 2022/11/03 16:12, Christoph Hellwig 写道:
> On Thu, Nov 03, 2022 at 10:55:41AM +0800, Yu Kuai wrote:
>> From: Yu Kuai <[email protected]>
>>
>> Currently, the caller of bd_link_disk_holer() get 'bdev' by
>> blkdev_get_by_dev(), which will look up 'bdev' by inode number 'dev'.
>> Howerver, it's possible that del_gendisk() can be called currently, and
>> 'bd_holder_dir' can be freed before bd_link_disk_holer() access it, thus
>> use after free is triggered.
>>
>> t1: t2:
>> bdev = blkdev_get_by_dev
>> del_gendisk
>> kobject_put(bd_holder_dir)
>> kobject_free()
>> bd_link_disk_holder
>>
>> Fix the problem by checking disk is still live and grabbing a reference
>> to 'bd_holder_dir' first in bd_link_disk_holder().
>
> Looks good with some minor stilistic nipicks:
>
>> + if (!disk_live(bdev->bd_disk)) {
>> + mutex_unlock(&bdev->bd_disk->open_mutex);
>> + return -ENODEV;
>> + }
>
> This can use a goto out_unlock;

This lock is different from current 'out_unlock', add a new lable will
make the code more complex, I think.
>
>> holder->refcnt++;
>> + kobject_put(bdev->bd_holder_dir);
>> goto out_unlock;
>> }
>>
>> holder = kzalloc(sizeof(*holder), GFP_KERNEL);
>> if (!holder) {
>> ret = -ENOMEM;
>> + kobject_put(bdev->bd_holder_dir);
>> goto out_unlock;
>> }
>>
>> @@ -101,16 +114,12 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk)
>> ret = __link_disk_holder(bdev, disk);
>> if (ret) {
>> kfree(holder);
>> + kobject_put(bdev->bd_holder_dir);
>
> And I think a goto out_put_holder and out_free_holder would clean this up
> nicely.

Yes, you're right.
>
>> list_add(&holder->list, &disk->slave_bdevs);
>> - /*
>> - * del_gendisk drops the initial reference to bd_holder_dir, so we need
>> - * to keep our own here to allow for cleanup past that point.
>> - */
>> - kobject_get(bdev->bd_holder_dir);
>
> .. with this then jumping straight to out_unlock.

Ok, I'll do that in next version.
>
>
> We should repost a series with my first 7 patches and your two. I can do
> that, but it might take some time as I just got through (minor) knee
> surgery and am still at the hospital, so if you have spare cycles feel
> free to do it.

I'm glad to do that, and have a good rest ????

Thanks,
Kuai
>
> .
>