LinuxLists.cc - [PATCH-next] block: fix null-deref in percpu_ref

2022-12-06 09:34:01

Subject: [PATCH-next] block: fix null-deref in percpu_ref_put

A problem was find in stable 5.10 and the root cause of it like below.

In the use of q_usage_counter of request_queue, blk_cleanup_queue using
"wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter))"
to wait q_usage_counter becoming zero. however, if the q_usage_counter
becoming zero quickly, and percpu_ref_exit will execute and ref->data
will be freed, maybe another process will cause a null-defef problem
like below:

CPU0 CPU1
blk_mq_destroy_queue
blk_freeze_queue
blk_mq_freeze_queue_wait
scsi_end_request
percpu_ref_get
...
percpu_ref_put
atomic_long_sub_and_test
blk_put_queue
kobject_put
kref_put
blk_release_queue
percpu_ref_exit
ref->data -> NULL
ref->data->release(ref) -> null-deref

As suggested by Ming Lei, fix it by getting the release method before
the referebce count is minus 0.

Suggested-by: Ming Lei <[email protected]>
Signed-off-by: Zhong Jinghua <[email protected]>
---
include/linux/percpu-refcount.h | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index d73a1c08c3e3..11e717c95acb 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -331,8 +331,11 @@ static inline void percpu_ref_put_many(struct percpu_ref *ref, unsigned long nr)

if (__ref_is_percpu(ref, &percpu_count))
this_cpu_sub(*percpu_count, nr);
- else if (unlikely(atomic_long_sub_and_test(nr, &ref->data->count)))
- ref->data->release(ref);
+ else {
+ percpu_ref_func_t *release = ref->data->release;
+ if (unlikely(atomic_long_sub_and_test(nr, &ref->data->count)))
+ release(ref);
+ }

rcu_read_unlock();
}
--
2.31.1

2022-12-07 01:41:23

by Dennis Zhou

[permalink] [raw]

Subject: Re: [PATCH-next] block: fix null-deref in percpu_ref_put

Hello,

On Tue, Dec 06, 2022 at 05:09:39PM +0800, Zhong Jinghua wrote:
> A problem was find in stable 5.10 and the root cause of it like below.
>
> In the use of q_usage_counter of request_queue, blk_cleanup_queue using
> "wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter))"
> to wait q_usage_counter becoming zero. however, if the q_usage_counter
> becoming zero quickly, and percpu_ref_exit will execute and ref->data
> will be freed, maybe another process will cause a null-defef problem
> like below:
>
> CPU0 CPU1
> blk_mq_destroy_queue
> blk_freeze_queue
> blk_mq_freeze_queue_wait
> scsi_end_request
> percpu_ref_get
> ...
> percpu_ref_put
> atomic_long_sub_and_test
> blk_put_queue
> kobject_put
> kref_put
> blk_release_queue
> percpu_ref_exit
> ref->data -> NULL
> ref->data->release(ref) -> null-deref
>

I remember thinking about this a while ago. I don't think this fix works
as nicely as it may seem. Please correct me if I'm wrong.

q->q_usage_counter has the oddity that the lifetime of the percpu_ref
object isn't managed by the release function. The freeing is handled by
a separate path where it depends on the percpu_ref hitting 0. So here we
have 2 concurrent paths racing to run with 1 destroying the object. We
probably need blk_release_queue() to wait on percpu_ref's release
finishing, not starting.

I think the above works in this specific case because there is a
call_rcu() in blk_release_queue(). If there wasn't a call_rcu(),
then by the same logic we could delay ref->data->release(ref) further
and that could potentially lead to a use after free.

Ideally, I think fixing the race in q->q_usage_counter's pattern is
better than masking it here as I think we're being saved by the
call_rcu() call further down the object release path.

Thanks,
Dennis

> As suggested by Ming Lei, fix it by getting the release method before
> the referebce count is minus 0.
>
> Suggested-by: Ming Lei <[email protected]>
> Signed-off-by: Zhong Jinghua <[email protected]>
> ---
> include/linux/percpu-refcount.h | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
> index d73a1c08c3e3..11e717c95acb 100644
> --- a/include/linux/percpu-refcount.h
> +++ b/include/linux/percpu-refcount.h
> @@ -331,8 +331,11 @@ static inline void percpu_ref_put_many(struct percpu_ref *ref, unsigned long nr)
>
> if (__ref_is_percpu(ref, &percpu_count))
> this_cpu_sub(*percpu_count, nr);
> - else if (unlikely(atomic_long_sub_and_test(nr, &ref->data->count)))
> - ref->data->release(ref);
> + else {
> + percpu_ref_func_t *release = ref->data->release;
> + if (unlikely(atomic_long_sub_and_test(nr, &ref->data->count)))
> + release(ref);
> + }
>
> rcu_read_unlock();
> }
> --
> 2.31.1
>

2022-12-07 13:23:11

by Yu Kuai

[permalink] [raw]

Subject: Re: [PATCH-next] block: fix null-deref in percpu_ref_put

Hi,

?? 2022/12/07 9:05, Dennis Zhou д??:
> Hello,
>
> On Tue, Dec 06, 2022 at 05:09:39PM +0800, Zhong Jinghua wrote:
>> A problem was find in stable 5.10 and the root cause of it like below.
>>
>> In the use of q_usage_counter of request_queue, blk_cleanup_queue using
>> "wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter))"
>> to wait q_usage_counter becoming zero. however, if the q_usage_counter
>> becoming zero quickly, and percpu_ref_exit will execute and ref->data
>> will be freed, maybe another process will cause a null-defef problem
>> like below:
>>
>> CPU0 CPU1
>> blk_mq_destroy_queue
>> blk_freeze_queue
>> blk_mq_freeze_queue_wait
>> scsi_end_request
>> percpu_ref_get
>> ...
>> percpu_ref_put
>> atomic_long_sub_and_test
>> blk_put_queue
>> kobject_put
>> kref_put
>> blk_release_queue
>> percpu_ref_exit
>> ref->data -> NULL
>> ref->data->release(ref) -> null-deref
>>
>
> I remember thinking about this a while ago. I don't think this fix works
> as nicely as it may seem. Please correct me if I'm wrong.
>
> q->q_usage_counter has the oddity that the lifetime of the percpu_ref
> object isn't managed by the release function. The freeing is handled by
> a separate path where it depends on the percpu_ref hitting 0. So here we
> have 2 concurrent paths racing to run with 1 destroying the object. We
> probably need blk_release_queue() to wait on percpu_ref's release
> finishing, not starting.
>
> I think the above works in this specific case because there is a
> call_rcu() in blk_release_queue(). If there wasn't a call_rcu(),
> then by the same logic we could delay ref->data->release(ref) further
> and that could potentially lead to a use after free.
>
> Ideally, I think fixing the race in q->q_usage_counter's pattern is
> better than masking it here as I think we're being saved by the
> call_rcu() call further down the object release path.

Agree.

BTW, Wensheng used to send a patch to fix this in block layer:

https://www.spinics.net/lists/kernel/msg4615696.html.

Thanks,
Kuai

2022-12-08 09:26:40

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH-next] block: fix null-deref in percpu_ref_put

On Wed, Dec 7, 2022 at 9:08 AM Dennis Zhou <[email protected]> wrote:
>
> Hello,
>
> On Tue, Dec 06, 2022 at 05:09:39PM +0800, Zhong Jinghua wrote:
> > A problem was find in stable 5.10 and the root cause of it like below.
> >
> > In the use of q_usage_counter of request_queue, blk_cleanup_queue using
> > "wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter))"
> > to wait q_usage_counter becoming zero. however, if the q_usage_counter
> > becoming zero quickly, and percpu_ref_exit will execute and ref->data
> > will be freed, maybe another process will cause a null-defef problem
> > like below:
> >
> > CPU0 CPU1
> > blk_mq_destroy_queue
> > blk_freeze_queue
> > blk_mq_freeze_queue_wait
> > scsi_end_request
> > percpu_ref_get
> > ...
> > percpu_ref_put
> > atomic_long_sub_and_test
> > blk_put_queue
> > kobject_put
> > kref_put
> > blk_release_queue
> > percpu_ref_exit
> > ref->data -> NULL
> > ref->data->release(ref) -> null-deref
> >
>
> I remember thinking about this a while ago. I don't think this fix works
> as nicely as it may seem. Please correct me if I'm wrong.
>
> q->q_usage_counter has the oddity that the lifetime of the percpu_ref
> object isn't managed by the release function. The freeing is handled by
> a separate path where it depends on the percpu_ref hitting 0. So here we
> have 2 concurrent paths racing to run with 1 destroying the object. We
> probably need blk_release_queue() to wait on percpu_ref's release
> finishing, not starting.
>
> I think the above works in this specific case because there is a
> call_rcu() in blk_release_queue(). If there wasn't a call_rcu(),
> then by the same logic we could delay ref->data->release(ref) further
> and that could potentially lead to a use after free.
>
> Ideally, I think fixing the race in q->q_usage_counter's pattern is
> better than masking it here as I think we're being saved by the
> call_rcu() call further down the object release path.

The problem is actually in percpu_ref_is_zero(), which can return true
before ->release() is called. And any pattern of wait_event(percpu_ref_is_zero)
may imply such risk.

It may be not easy to fix the issue in block layer cleanly, but can be
solved in percpu-refcount simply by adding ->release_lock(spin lock)
in the counter for draining atomic_long_sub_and_test() & ->release()
in percpu_ref_exit(). Or simply use percpu_ref_switch_lock.

Thanks,
Ming