Remove block device when iocost is initializing may cause
null-pointer dereference:
CPU1 CPU2
ioc_qos_write
blkcg_conf_open_bdev
blkdev_get_no_open
kobject_get_unless_zero
blk_iocost_init
rq_qos_add
del_gendisk
rq_qos_exit
q->rq_qos = rqos->next
//iocost is removed from q->roqs
blkcg_activate_policy
pd_init_fn
ioc_pd_init
ioc = q_to_ioc(blkg->q)
//cant find iocost and return null
Fix problem by moving rq_qos_exit() to disk_release(). ioc_qos_write() get
bd_device.kobj in blkcg_conf_open_bdev(), so disk_release will not be
actived until iocost initialization is complited.
Signed-off-by: Li Nan <[email protected]>
---
block/genhd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/genhd.c b/block/genhd.c
index dcf200bcbd3e..0db440bbfefb 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -656,7 +656,6 @@ void del_gendisk(struct gendisk *disk)
elevator_exit(q);
mutex_unlock(&q->sysfs_lock);
}
- rq_qos_exit(q);
blk_mq_unquiesce_queue(q);
/*
@@ -1168,6 +1167,7 @@ static void disk_release(struct device *dev)
!test_bit(GD_ADDED, &disk->state))
blk_mq_exit_queue(disk->queue);
+ rq_qos_exit(disk->queue);
blkcg_exit_disk(disk);
bioset_exit(&disk->bio_split);
--
2.31.1
On Wed, Nov 30, 2022 at 09:21:55PM +0800, Li Nan wrote:
> Remove block device when iocost is initializing may cause
> null-pointer dereference:
>
> CPU1 CPU2
> ioc_qos_write
> blkcg_conf_open_bdev
> blkdev_get_no_open
> kobject_get_unless_zero
> blk_iocost_init
> rq_qos_add
> del_gendisk
> rq_qos_exit
> q->rq_qos = rqos->next
> //iocost is removed from q->roqs
> blkcg_activate_policy
> pd_init_fn
> ioc_pd_init
> ioc = q_to_ioc(blkg->q)
> //cant find iocost and return null
>
> Fix problem by moving rq_qos_exit() to disk_release(). ioc_qos_write() get
> bd_device.kobj in blkcg_conf_open_bdev(), so disk_release will not be
> actived until iocost initialization is complited.
I think it'd be better to make del_gendisk wait for these in-flight
operations because this might fix the above particular issue but now all the
policies are exposed to request_queue in a state it never expected before.
del_gendisk() is quiescing the queue around rq_qos_exit(), so maybe we can
piggyback on that and update blkcg_conf_open_bdev() to provide such
exclusion?
Thanks.
--
tejun
Hi,
?? 2022/12/01 4:50, Tejun Heo д??:
> On Wed, Nov 30, 2022 at 09:21:55PM +0800, Li Nan wrote:
>> Remove block device when iocost is initializing may cause
>> null-pointer dereference:
>>
>> CPU1 CPU2
>> ioc_qos_write
>> blkcg_conf_open_bdev
>> blkdev_get_no_open
>> kobject_get_unless_zero
>> blk_iocost_init
>> rq_qos_add
>> del_gendisk
>> rq_qos_exit
>> q->rq_qos = rqos->next
>> //iocost is removed from q->roqs
>> blkcg_activate_policy
>> pd_init_fn
>> ioc_pd_init
>> ioc = q_to_ioc(blkg->q)
>> //cant find iocost and return null
>>
>> Fix problem by moving rq_qos_exit() to disk_release(). ioc_qos_write() get
>> bd_device.kobj in blkcg_conf_open_bdev(), so disk_release will not be
>> actived until iocost initialization is complited.
>
> I think it'd be better to make del_gendisk wait for these in-flight
> operations because this might fix the above particular issue but now all the
> policies are exposed to request_queue in a state it never expected before.
>
> del_gendisk() is quiescing the queue around rq_qos_exit(), so maybe we can
> piggyback on that and update blkcg_conf_open_bdev() to provide such
> exclusion?
Let del_gendisk waiting for that sounds good, but I'm litter confused
how to do that. Following are what I think about:
1) By mentioning that "del_gendisk() is quiescing the queue", do you
suggest to add rcu_read_lock()? This seems wrong because blk_iocost_init
requires memory allocation.
2) Hold gendisk open mutex
3) Use q_usage_counter, and in the meantime, rq_qos_add() and
blkcg_activate_policy() will need refactoring to factor out freeze
queue.
4) Define a new metux
Thanks,
Kuai
>
> Thanks.
>
On Thu, Dec 01, 2022 at 10:12:13AM +0800, Yu Kuai wrote:
> 1) By mentioning that "del_gendisk() is quiescing the queue", do you
> suggest to add rcu_read_lock()? This seems wrong because blk_iocost_init
> requires memory allocation.
Quiescing uses SRCU so that should be fine but I'm not sure whether this is
the right one to piggyback on. Jens should have a better idea.
Thanks.
--
tejun
Hi,
?? 2022/12/01 18:11, Tejun Heo д??:
> On Thu, Dec 01, 2022 at 10:12:13AM +0800, Yu Kuai wrote:
>> 1) By mentioning that "del_gendisk() is quiescing the queue", do you
>> suggest to add rcu_read_lock()? This seems wrong because blk_iocost_init
>> requires memory allocation.
>
> Quiescing uses SRCU so that should be fine but I'm not sure whether this is
> the right one to piggyback on. Jens should have a better idea.
>
> Thanks.
>
Currently SRCU is used if BLK_MQ_F_BLOCKING set, otherwise RCU is used.
dispatch:
#define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \
do { \
if ((q)->tag_set->flags & BLK_MQ_F_BLOCKING) { \
int srcu_idx; \
\
might_sleep_if(check_sleep); \
srcu_idx = srcu_read_lock((q)->tag_set->srcu); \
(dispatch_ops); \
srcu_read_unlock((q)->tag_set->srcu, srcu_idx); \
} else { \
rcu_read_lock(); \
(dispatch_ops); \
rcu_read_unlock(); \
} \
} while (0)
quiesce:
void blk_mq_wait_quiesce_done(struct blk_mq_tag_set *set)
{
if (set->flags & BLK_MQ_F_BLOCKING)
synchronize_srcu(set->srcu);
else
synchronize_rcu();
}
Thanks,
Kuai
On Thu, Dec 01, 2022 at 06:23:16PM +0800, Yu Kuai wrote:
> Hi,
>
> 在 2022/12/01 18:11, Tejun Heo 写道:
> > On Thu, Dec 01, 2022 at 10:12:13AM +0800, Yu Kuai wrote:
> > > 1) By mentioning that "del_gendisk() is quiescing the queue", do you
> > > suggest to add rcu_read_lock()? This seems wrong because blk_iocost_init
> > > requires memory allocation.
> >
> > Quiescing uses SRCU so that should be fine but I'm not sure whether this is
> > the right one to piggyback on. Jens should have a better idea.
> >
> > Thanks.
> >
>
> Currently SRCU is used if BLK_MQ_F_BLOCKING set, otherwise RCU is used.
>
> dispatch:
> #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \
> do { \
> if ((q)->tag_set->flags & BLK_MQ_F_BLOCKING) { \
> int srcu_idx; \
> \
> might_sleep_if(check_sleep); \
> srcu_idx = srcu_read_lock((q)->tag_set->srcu); \
> (dispatch_ops); \
> srcu_read_unlock((q)->tag_set->srcu, srcu_idx); \
> } else { \
> rcu_read_lock(); \
> (dispatch_ops); \
> rcu_read_unlock(); \
> } \
> } while (0)
>
> quiesce:
> void blk_mq_wait_quiesce_done(struct blk_mq_tag_set *set)
> {
> if (set->flags & BLK_MQ_F_BLOCKING)
> synchronize_srcu(set->srcu);
> else
> synchronize_rcu();
> }
Oh I see. Unfortunately, I don't know what to do off the top of my head.
Thanks.
--
tejun
Hi, Tejun!
While reviewing rq_qos code, I found that there are some other possible
defects:
1) queue_lock is held to protect rq_qos_add() and rq_qos_del(), whlie
it's not held to protect rq_qos_exit(), which is absolutely not safe
because they can be called concurrently by configuring iocost and
removing device.
I'm thinking about holding the lock to fetch the list and reset
q->rq_qos first:
diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index 88f0fe7dcf54..271ad65eebd9 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -288,9 +288,15 @@ void rq_qos_wait(struct rq_wait *rqw, void
*private_data,
void rq_qos_exit(struct request_queue *q)
{
- while (q->rq_qos) {
- struct rq_qos *rqos = q->rq_qos;
- q->rq_qos = rqos->next;
+ struct rq_qos *rqos;
+
+ spin_lock_irq(&q->queue_lock);
+ rqos = q->rq_qos;
+ q->rq_qos = NULL;
+ spin_unlock_irq(&q->queue_lock);
+
+ while (rqos) {
rqos->ops->exit(rqos);
+ rqos = rqos->next;
}
}
2) rq_qos_add() can still succeed after rq_qos_exit() is done, which
will cause memory leak. Hence a checking is required beforing adding
to q->rq_qos. I'm thinking about flag QUEUE_FLAG_DYING first, but the
flag will not set if disk state is not marked GD_OWNS_QUEUE. Since
blk_unregister_queue() is called before rq_qos_exit(), use the queue
flag QUEUE_FLAG_REGISTERED should be OK.
For the current problem that device can be removed while initializing
, I'm thinking about some possible solutions:
Since bfq is initialized in elevator initialization, and others are
in queue initialization, such problem is only possible in iocost, hence
it make sense to fix it in iocost:
1) use open mutex to prevet concurrency, however, this will cause
that configuring iocost will block some other operations that is relied
on open_mutex.
@@ -2889,7 +2889,15 @@ static int blk_iocost_init(struct gendisk *disk)
if (ret)
goto err_free_ioc;
+ mutex_lock(&disk->open_mutex);
+ if (!disk_live(disk)) {
+ mutex_unlock(&disk->open_mutex);
+ ret = -ENODEV;
+ goto err_del_qos;
+ }
+
ret = blkcg_activate_policy(q, &blkcg_policy_iocost);
+ mutex_unlock(&disk->open_mutex);
if (ret)
goto err_del_qos;
2) like 1), the difference is that define a new mutex just in iocst.
3) Or is it better to fix it in the higher level? For example:
add a new restriction that blkcg_deactivate_policy() should be called
with blkcg_activate_policy() in pairs, and blkcg_deactivate_policy()
will wait for blkcg_activate_policy() to finish. Something like:
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index ef4fef1af909..6266f702157f 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1410,7 +1410,7 @@ int blkcg_activate_policy(struct request_queue *q,
struct blkcg_gq *blkg, *pinned_blkg = NULL;
int ret;
- if (blkcg_policy_enabled(q, pol))
+ if (WARN_ON_ONCE(blkcg_policy_enabled(q, pol)))
return 0;
if (queue_is_mq(q))
@@ -1477,6 +1477,8 @@ int blkcg_activate_policy(struct request_queue *q,
blkg_put(pinned_blkg);
if (pd_prealloc)
pol->pd_free_fn(pd_prealloc);
+ if (!ret)
+ wake_up(q->policy_waitq);
return ret;
enomem:
@@ -1512,7 +1514,7 @@ void blkcg_deactivate_policy(struct request_queue *q,
struct blkcg_gq *blkg;
if (!blkcg_policy_enabled(q, pol))
- return;
+ wait_event(q->policy_waitq, blkcg_policy_enabled(q, pol));
wait_event(q->xxx, blkcg_policy_enabled(q, pol));
Hello,
On Mon, Dec 05, 2022 at 05:32:17PM +0800, Yu Kuai wrote:
> 1) queue_lock is held to protect rq_qos_add() and rq_qos_del(), whlie
> it's not held to protect rq_qos_exit(), which is absolutely not safe
> because they can be called concurrently by configuring iocost and
> removing device.
> I'm thinking about holding the lock to fetch the list and reset
> q->rq_qos first:
>
> diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
> index 88f0fe7dcf54..271ad65eebd9 100644
> --- a/block/blk-rq-qos.c
> +++ b/block/blk-rq-qos.c
> @@ -288,9 +288,15 @@ void rq_qos_wait(struct rq_wait *rqw, void
> *private_data,
>
> void rq_qos_exit(struct request_queue *q)
> {
> - while (q->rq_qos) {
> - struct rq_qos *rqos = q->rq_qos;
> - q->rq_qos = rqos->next;
> + struct rq_qos *rqos;
> +
> + spin_lock_irq(&q->queue_lock);
> + rqos = q->rq_qos;
> + q->rq_qos = NULL;
> + spin_unlock_irq(&q->queue_lock);
> +
> + while (rqos) {
> rqos->ops->exit(rqos);
> + rqos = rqos->next;
> }
> }
>
> 2) rq_qos_add() can still succeed after rq_qos_exit() is done, which
> will cause memory leak. Hence a checking is required beforing adding
> to q->rq_qos. I'm thinking about flag QUEUE_FLAG_DYING first, but the
> flag will not set if disk state is not marked GD_OWNS_QUEUE. Since
> blk_unregister_queue() is called before rq_qos_exit(), use the queue
> flag QUEUE_FLAG_REGISTERED should be OK.
>
> For the current problem that device can be removed while initializing
> , I'm thinking about some possible solutions:
>
> Since bfq is initialized in elevator initialization, and others are
> in queue initialization, such problem is only possible in iocost, hence
> it make sense to fix it in iocost:
So, iolatency is likely to switch to similar lazy init scheme, so we better
fix it in the rq_qos / core block layer.
...
> 3) Or is it better to fix it in the higher level? For example:
> add a new restriction that blkcg_deactivate_policy() should be called
> with blkcg_activate_policy() in pairs, and blkcg_deactivate_policy()
> will wait for blkcg_activate_policy() to finish. Something like:
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index ef4fef1af909..6266f702157f 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1410,7 +1410,7 @@ int blkcg_activate_policy(struct request_queue *q,
> struct blkcg_gq *blkg, *pinned_blkg = NULL;
> int ret;
>
> - if (blkcg_policy_enabled(q, pol))
> + if (WARN_ON_ONCE(blkcg_policy_enabled(q, pol)))
> return 0;
>
> if (queue_is_mq(q))
> @@ -1477,6 +1477,8 @@ int blkcg_activate_policy(struct request_queue *q,
> blkg_put(pinned_blkg);
> if (pd_prealloc)
> pol->pd_free_fn(pd_prealloc);
> + if (!ret)
> + wake_up(q->policy_waitq);
> return ret;
>
> enomem:
> @@ -1512,7 +1514,7 @@ void blkcg_deactivate_policy(struct request_queue *q,
> struct blkcg_gq *blkg;
>
> if (!blkcg_policy_enabled(q, pol))
> - return;
> + wait_event(q->policy_waitq, blkcg_policy_enabled(q, pol));
> wait_event(q->xxx, blkcg_policy_enabled(q, pol));
Yeah, along this line but hopefully something simpler like a mutex.
Thanks.
--
tejun