2022-12-17 03:23:25

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next 0/4] blk-cgroup: synchronize del_gendisk() with configuring cgroup policy

From: Yu Kuai <[email protected]>

iocost is initialized when it's configured the first time, and iocost
initializing can race with del_gendisk(), which will cause null pointer
dereference:

t1 t2
ioc_qos_write
blk_iocost_init
rq_qos_add
del_gendisk
rq_qos_exit
//iocost is removed from q->roqs
blkcg_activate_policy
pd_init_fn
ioc_pd_init
ioc = q_to_ioc(blkg->q)
//can't find iocost and return null

And iolatency is about to switch to the same lazy initialization.

This patchset fix this problem by synchronize rq_qos_add() and
blkcg_activate_policy() with rq_qos_exit().

Yu Kuai (4):
block/rq_qos: protect 'q->rq_qos' with queue_lock in rq_qos_exit()
block/rq_qos: fail rq_qos_add() after del_gendisk()
blk-cgroup: add a new interface blkcg_conf_close_bdev()
blk-cgroup: synchronize del_gendisk() with configuring cgroup policy

block/blk-cgroup.c | 12 ++++++++++--
block/blk-cgroup.h | 1 +
block/blk-iocost.c | 8 ++++----
block/blk-rq-qos.c | 25 ++++++++++++++++++++-----
block/blk-rq-qos.h | 17 +++++++++++++----
include/linux/blkdev.h | 1 +
6 files changed, 49 insertions(+), 15 deletions(-)

--
2.31.1


2022-12-17 03:23:32

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next 4/4] blk-cgroup: synchronize del_gendisk() with configuring cgroup policy

From: Yu Kuai <[email protected]>

iocost is initialized when it's configured the first time, and iocost
initializing can race with del_gendisk(), which will cause null pointer
dereference:

t1 t2
ioc_qos_write
blk_iocost_init
rq_qos_add
del_gendisk
rq_qos_exit
//iocost is removed from q->roqs
blkcg_activate_policy
pd_init_fn
ioc_pd_init
ioc = q_to_ioc(blkg->q)
//can't find iocost and return null

Fix the problem by adding a new mutex in request_queue, and use it to
synchronize rq_qos_exit() from del_gendisk() with configuring cgroup
policy.

Signed-off-by: Yu Kuai <[email protected]>
---
block/blk-cgroup.c | 3 +++
block/blk-rq-qos.c | 8 ++++++++
include/linux/blkdev.h | 1 +
3 files changed, 12 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index ad612148cf3b..8dcdaacb52a1 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -658,12 +658,14 @@ struct block_device *blkcg_conf_open_bdev(char **inputp)
return ERR_PTR(-ENODEV);
}

+ mutex_lock(&bdev->bd_queue->blkcg_pols_lock);
*inputp = input;
return bdev;
}

void blkcg_conf_close_bdev(struct block_device *bdev)
{
+ mutex_unlock(&bdev->bd_queue->blkcg_pols_lock);
blkdev_put_no_open(bdev);
}

@@ -1277,6 +1279,7 @@ int blkcg_init_disk(struct gendisk *disk)
int ret;

INIT_LIST_HEAD(&q->blkg_list);
+ mutex_init(&q->blkcg_pols_lock);

new_blkg = blkg_alloc(&blkcg_root, disk, GFP_KERNEL);
if (!new_blkg)
diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index efffc6fa55db..86bccdfa1a43 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -290,6 +290,10 @@ void rq_qos_exit(struct request_queue *q)
{
struct rq_qos *rqos;

+#ifdef CONFIG_BLK_CGROUP
+ mutex_lock(&q->blkcg_pols_lock);
+#endif
+
spin_lock_irq(&q->queue_lock);
rqos = q->rq_qos;
q->rq_qos = NULL;
@@ -300,4 +304,8 @@ void rq_qos_exit(struct request_queue *q)
rqos->ops->exit(rqos);
rqos = rqos->next;
} while (rqos);
+
+#ifdef CONFIG_BLK_CGROUP
+ mutex_unlock(&q->blkcg_pols_lock);
+#endif
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 301cf1cf4f2f..824d68a41a83 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -484,6 +484,7 @@ struct request_queue {
DECLARE_BITMAP (blkcg_pols, BLKCG_MAX_POLS);
struct blkcg_gq *root_blkg;
struct list_head blkg_list;
+ struct mutex blkcg_pols_lock;
#endif

struct queue_limits limits;
--
2.31.1

2022-12-17 03:23:46

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next 1/4] block/rq_qos: protect 'q->rq_qos' with queue_lock in rq_qos_exit()

From: Yu Kuai <[email protected]>

queue_lock is held to protect 'q->rqos' in rq_qos_add() and
rq_qos_del(), however, it's not held in rq_qos_exit(), while they
can operate the list concurrently:

t1 t2
// configure iocost //remove device
blk_iocost_init del_gendisk
rq_qos_add rq_qos_exit

'rqos->ops->exit' can't be called under spinlock because it might
be scheduled out, hence fix the problem by holding queue_lock to
fetch the list and reset q->rq_qos first.

Signed-off-by: Yu Kuai <[email protected]>
---
block/blk-rq-qos.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index 88f0fe7dcf54..efffc6fa55db 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -288,9 +288,16 @@ void rq_qos_wait(struct rq_wait *rqw, void *private_data,

void rq_qos_exit(struct request_queue *q)
{
- while (q->rq_qos) {
- struct rq_qos *rqos = q->rq_qos;
- q->rq_qos = rqos->next;
- rqos->ops->exit(rqos);
- }
+ struct rq_qos *rqos;
+
+ spin_lock_irq(&q->queue_lock);
+ rqos = q->rq_qos;
+ q->rq_qos = NULL;
+ spin_unlock_irq(&q->queue_lock);
+
+ do {
+ if (rqos->ops->exit)
+ rqos->ops->exit(rqos);
+ rqos = rqos->next;
+ } while (rqos);
}
--
2.31.1

2022-12-19 22:04:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH -next 0/4] blk-cgroup: synchronize del_gendisk() with configuring cgroup policy

Hello,

On Sat, Dec 17, 2022 at 11:09:04AM +0800, Yu Kuai wrote:
> From: Yu Kuai <[email protected]>
>
> iocost is initialized when it's configured the first time, and iocost
> initializing can race with del_gendisk(), which will cause null pointer
> dereference:
>
> t1 t2
> ioc_qos_write
> blk_iocost_init
> rq_qos_add
> del_gendisk
> rq_qos_exit
> //iocost is removed from q->roqs
> blkcg_activate_policy
> pd_init_fn
> ioc_pd_init
> ioc = q_to_ioc(blkg->q)
> //can't find iocost and return null
>
> And iolatency is about to switch to the same lazy initialization.
>
> This patchset fix this problem by synchronize rq_qos_add() and
> blkcg_activate_policy() with rq_qos_exit().

So, the patchset seems a bit overly complicated to me. What do you think
about the following?

* These init/exit paths are super cold path, just protecting them with a
global mutex is probably enough. If we encounter a scalability problem,
it's easy to fix down the line.

* If we're synchronizing this with a mutex anyway, no need to grab the
queue_lock, right? rq_qos_add/del/exit() can all just hold the mutex.

* And we can keep the state tracking within rq_qos. When rq_qos_exit() is
called, mark it so that future adds will fail - be that a special ->next
value a queue flag or whatever.

Thanks.

--
tejun

2022-12-20 09:58:54

by Yu Kuai

[permalink] [raw]
Subject: Re: [PATCH -next 0/4] blk-cgroup: synchronize del_gendisk() with configuring cgroup policy

Hi,

?? 2022/12/20 4:55, Tejun Heo д??:
> Hello,
>
> On Sat, Dec 17, 2022 at 11:09:04AM +0800, Yu Kuai wrote:
>> From: Yu Kuai <[email protected]>
>>
>> iocost is initialized when it's configured the first time, and iocost
>> initializing can race with del_gendisk(), which will cause null pointer
>> dereference:
>>
>> t1 t2
>> ioc_qos_write
>> blk_iocost_init
>> rq_qos_add
>> del_gendisk
>> rq_qos_exit
>> //iocost is removed from q->roqs
>> blkcg_activate_policy
>> pd_init_fn
>> ioc_pd_init
>> ioc = q_to_ioc(blkg->q)
>> //can't find iocost and return null
>>
>> And iolatency is about to switch to the same lazy initialization.
>>
>> This patchset fix this problem by synchronize rq_qos_add() and
>> blkcg_activate_policy() with rq_qos_exit().
>
> So, the patchset seems a bit overly complicated to me. What do you think
> about the following?
>
> * These init/exit paths are super cold path, just protecting them with a
> global mutex is probably enough. If we encounter a scalability problem,
> it's easy to fix down the line.
>
> * If we're synchronizing this with a mutex anyway, no need to grab the
> queue_lock, right? rq_qos_add/del/exit() can all just hold the mutex.
>
> * And we can keep the state tracking within rq_qos. When rq_qos_exit() is
> called, mark it so that future adds will fail - be that a special ->next
> value a queue flag or whatever.

Yes, that sounds good. BTW, queue_lock is also used to protect
pd_alloc_fn/pd_init_fn??and we found that blkcg_activate_policy() is
problematic:

blkcg_activate_policy
spin_lock_irq(&q->queue_lock);
list_for_each_entry_reverse(blkg, &q->blkg_list
pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN,...) -> failed

spin_unlock_irq(&q->queue_lock);
// release queue_lock here is problematic, this will cause
pd_offline_fn called without pd_init_fn.
pd_alloc_fn(__GFP_NOWARN,...)

If we are using a mutex to protect rq_qos ops, it seems the right thing
to do do also using the mutex to protect blkcg_policy ops, and this
problem can be fixed because mutex can be held to alloc memroy with
GFP_KERNEL. What do you think?

Thanks,
Kuai
>
> Thanks.
>

2022-12-20 16:39:16

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH -next 0/4] blk-cgroup: synchronize del_gendisk() with configuring cgroup policy

Hello,

On Tue, Dec 20, 2022 at 05:19:12PM +0800, Yu Kuai wrote:
> Yes, that sounds good. BTW, queue_lock is also used to protect
> pd_alloc_fn/pd_init_fn,and we found that blkcg_activate_policy() is
> problematic:
>
> blkcg_activate_policy
> spin_lock_irq(&q->queue_lock);
> list_for_each_entry_reverse(blkg, &q->blkg_list
> pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN,...) -> failed
>
> spin_unlock_irq(&q->queue_lock);
> // release queue_lock here is problematic, this will cause
> pd_offline_fn called without pd_init_fn.
> pd_alloc_fn(__GFP_NOWARN,...)

So, if a blkg is destroyed while a policy is being activated, right?

> If we are using a mutex to protect rq_qos ops, it seems the right thing
> to do do also using the mutex to protect blkcg_policy ops, and this
> problem can be fixed because mutex can be held to alloc memroy with
> GFP_KERNEL. What do you think?

One worry is that switching to mutex can be more headache due to destroy
path synchronization. Another approach would be using a per-blkg flag to
track whether a blkg has been initialized.

Thanks.

--
tejun

2022-12-21 01:43:10

by Yu Kuai

[permalink] [raw]
Subject: Re: [PATCH -next 0/4] blk-cgroup: synchronize del_gendisk() with configuring cgroup policy

Hi,

在 2022/12/21 0:01, Tejun Heo 写道:
> Hello,
>
> On Tue, Dec 20, 2022 at 05:19:12PM +0800, Yu Kuai wrote:
>> Yes, that sounds good. BTW, queue_lock is also used to protect
>> pd_alloc_fn/pd_init_fn,and we found that blkcg_activate_policy() is
>> problematic:
>>
>> blkcg_activate_policy
>> spin_lock_irq(&q->queue_lock);
>> list_for_each_entry_reverse(blkg, &q->blkg_list
>> pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN,...) -> failed
>>
>> spin_unlock_irq(&q->queue_lock);
>> // release queue_lock here is problematic, this will cause
>> pd_offline_fn called without pd_init_fn.
>> pd_alloc_fn(__GFP_NOWARN,...)
>
> So, if a blkg is destroyed while a policy is being activated, right?
Yes, remove cgroup can race with this, for bfq null pointer deference
will be triggered in bfq_pd_offline().

>
>> If we are using a mutex to protect rq_qos ops, it seems the right thing
>> to do do also using the mutex to protect blkcg_policy ops, and this
>> problem can be fixed because mutex can be held to alloc memroy with
>> GFP_KERNEL. What do you think?
>
> One worry is that switching to mutex can be more headache due to destroy
> path synchronization. Another approach would be using a per-blkg flag to
> track whether a blkg has been initialized.
I think perhaps you mean per blkg_policy_data flag? per blkg flag should
not work in this case.

Thanks,
Kuai
>
> Thanks.
>

2022-12-21 03:45:28

by Yu Kuai

[permalink] [raw]
Subject: Re: [PATCH -next 0/4] blk-cgroup: synchronize del_gendisk() with configuring cgroup policy

Hi,

在 2022/12/21 9:10, Yu Kuai 写道:
> Hi,
>
> 在 2022/12/21 0:01, Tejun Heo 写道:
>> Hello,
>>
>> On Tue, Dec 20, 2022 at 05:19:12PM +0800, Yu Kuai wrote:
>>> Yes, that sounds good. BTW, queue_lock is also used to protect
>>> pd_alloc_fn/pd_init_fn,and we found that blkcg_activate_policy() is
>>> problematic:
>>>
>>> blkcg_activate_policy
>>>   spin_lock_irq(&q->queue_lock);
>>>   list_for_each_entry_reverse(blkg, &q->blkg_list
>>>    pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN,...) -> failed
>>>
>>>    spin_unlock_irq(&q->queue_lock);
>>>    // release queue_lock here is problematic, this will cause
>>> pd_offline_fn called without pd_init_fn.
>>>    pd_alloc_fn(__GFP_NOWARN,...)
>>
>> So, if a blkg is destroyed while a policy is being activated, right?
> Yes, remove cgroup can race with this, for bfq null pointer deference
> will be triggered in bfq_pd_offline().

BTW, We just found that pd_online_fn() is missed in
blkcg_activate_policy()... Currently this is not a problem because only
bl-throttle implement it, and blk-throttle is activated while creating
blkg.

Thanks,
Kuai
>
>>
>>> If we are using a mutex to protect rq_qos ops, it seems the right thing
>>> to do do also using the mutex to protect blkcg_policy ops, and this
>>> problem can be fixed because mutex can be held to alloc memroy with
>>> GFP_KERNEL. What do you think?
>>
>> One worry is that switching to mutex can be more headache due to destroy
>> path synchronization. Another approach would be using a per-blkg flag to
>> track whether a blkg has been initialized.
> I think perhaps you mean per blkg_policy_data flag? per blkg flag should
> not work in this case.
>
> Thanks,
> Kuai
>>
>> Thanks.
>>
>
> .
>

2022-12-21 11:09:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH -next 0/4] blk-cgroup: synchronize del_gendisk() with configuring cgroup policy

On Tue, Dec 20, 2022 at 05:19:12PM +0800, Yu Kuai wrote:
> If we are using a mutex to protect rq_qos ops, it seems the right thing
> to do do also using the mutex to protect blkcg_policy ops, and this
> problem can be fixed because mutex can be held to alloc memroy with
> GFP_KERNEL. What do you think?

Getting rid of the atomic allocations would be awesome.

FYI, I'm also in favor of everything that moves things out of
queue_lock into more dedicated locks. queue_lock is such an undocument
mess of untargeted things that don't realted to each other right now
that splitting and removing it is becoming more and more important.