by Teng Qi

[permalink] [raw]

Subject: Re: [bug] kernel: bpf: syscall: a possible sleep-in-atomic bug in __bpf_prog_put()

Hello!

> It would be great if you also print out in_interrupt() value, so we know
> whether softirq or nmi is enabled or not.

After adding the in_interrupt(), the interesting output cases are as follows:
[ 38.596580] bpf_prog_put: in_irq() 0, irqs_disabled() 0, in_interrupt() 256,
preempt_count() 256, in_atomic() 1, rcu_read_lock_held() 0
[ 62.300608] bpf_prog_put: in_irq() 0, irqs_disabled() 0, in_interrupt() 256,
preempt_count() 256, in_atomic() 1, rcu_read_lock_held() 1
[ 62.301179] bpf_prog_put: in_irq() 0, irqs_disabled() 0, in_interrupt() 0,
preempt_count() 0, in_atomic() 0, rcu_read_lock_held() 1

Based on these cases, the current code is safe for the first two cases, because
in_interrupt() in vfree() prevents sleeping.
However, the rcu_read_lock_held() is not reliable, so we cannot rely on it.
Considering all the discussions so far, I think the best plan now is to change
the condition in __bpf_prog_put() to ‘irqs_disabled() || in_atomic()’ and
provide examples for possible issues. This plan effectively addresses
more possible atomic contexts of __bpf_prog_put() without incurring
any additional cost.
-- Teng Qi

On Mon, Jun 12, 2023 at 8:02 AM Yonghong Song <[email protected]> wrote:
>
>
>
> On 6/11/23 6:02 AM, Teng Qi wrote:
> > Hello!
> >> BTW, please do create a test case, e.g, sockmap test case which
> >> can show the problem with existing code base.
> >
> > I add a printk in bpf_prog_put_deferred():
> > static void bpf_prog_put_deferred(struct work_struct *work)
> > {
> > // . . .
> > int inIrq = in_irq();
> > int irqsDisabled = irqs_disabled();
> > int preemptBits = preempt_count();
> > int inAtomic = in_atomic();
> > int rcuHeld = rcu_read_lock_held();
> > printk("bpf_prog_put: in_irq() %d, irqs_disabled() %d, preempt_count()
> > %d, in_atomic() %d, rcu_read_lock_held() %d",
> > inIrq, irqsDisabled, preemptBits, inAtomic, rcuHeld);
> > // . . .
> > }
> >
> > When running the selftest, I see the following output:
> > [255340.388339] bpf_prog_put: in_irq() 0, irqs_disabled() 0,
> > preempt_count() 256, in_atomic() 1, rcu_read_lock_held() 1
> > [255393.237632] bpf_prog_put: in_irq() 0, irqs_disabled() 0,
> > preempt_count() 0, in_atomic() 0, rcu_read_lock_held() 1
>
> It would be great if you also print out in_interrupt() value, so we know
> whether softirq or nmi is enabled or not.
>
> We cannot really WARN with !rcu_read_lock_held() since the
> __bpf_prog_put funciton is called in different contexts.
>
> Also, note that rcu_read_lock_held() may not be reliable. rcu subsystem
> will return 1 if not tracked or not sure about the result.
>
> >
> > Based on this output, I believe it is sufficient to construct a self-test case
> > for bpf_prog_put_deferred() called under preempt disabled or rcu read lock
> > region. However, I'm a bit confused about what I should do to build the
> > self-test case. Are we looking to create a checker that verifies the
> > context of bpf_prog_put_deferred() is valid? Or do we need a test case that
> > can trigger this bug?
> >
> > Could you show me more ideas to construct a self test case? I am not familiar
> > with it and have no idea.
>
> Okay, I see. It seems hard to create a test case with warnings since
> bpf_prog_put_deferred is called in different context. So some
> examples for possible issues (through code analysis) should be good enough.
>
> >
> > -- Teng Qi
> >
> > On Thu, May 25, 2023 at 3:34 AM Yonghong Song <[email protected]> wrote:
> >>
> >>
> >>
> >> On 5/24/23 5:42 AM, Teng Qi wrote:
> >>> Thank you.
> >>>
> >>>> We cannot use rcu_read_lock_held() in the 'if' statement. The return
> >>>> value rcu_read_lock_held() could be 1 for some configurations regardless
> >>>> whether rcu_read_lock() is really held or not. In most cases,
> >>>> rcu_read_lock_held() is used in issuing potential warnings.
> >>>> Maybe there are other ways to record whether rcu_read_lock() is held or not?
> >>>
> >>> Sorry. I was not aware of the dependency of configurations of
> >>> rcu_read_lock_held().
> >>>
> >>>> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
> >>>> can be !in_interrupt(), so any process-context will go to a workqueue.
> >>>
> >>> I agree that using !in_interrupt() as a condition is an acceptable solution.
> >>
> >> This should work although it could be conservative.
> >>
> >>>
> >>>> Alternatively, we could have another solution. We could add another
> >>>> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
> >>>> will be done in rcu context.
> >>>
> >>> Implementing a new function like bpf_prog_put_rcu() is a solution that involves
> >>> more significant changes.
> >>
> >> Maybe we can change signature of bpf_prog_put instead? Like
> >> void bpf_prog_put(struct bpf_prog *prog, bool in_rcu)
> >> and inside bpf_prog_put we can add
> >> WARN_ON_ONCE(in_rcu && !bpf_rcu_lock_held());
> >>
> >>>
> >>>> So if in_interrupt(), do kvfree, otherwise,
> >>>> put into a workqueue.
> >>>
> >>> Shall we proceed with submitting a patch following this approach?
> >>
> >> You could choose either of the above although I think with newer
> >> bpf_prog_put() is better.
> >>
> >> BTW, please do create a test case, e.g, sockmap test case which
> >> can show the problem with existing code base.
> >>
> >>>
> >>> I would like to mention something unrelated to the possible bug. At this
> >>> moment, things seem to be more puzzling. vfree() is safe under in_interrupt()
> >>> but not safe under other atomic contexts.
> >>> This disorder challenges our conventional belief, a monotonic incrementation
> >>> of limitations of the hierarchical atomic contexts, that programer needs
> >>> to be more and more careful to write code under rcu read lock, spin lock,
> >>> bh disable, interrupt...
> >>> This disorder can lead to unexpected consequences, such as code being safe
> >>> under interrupts but not safe under spin locks.
> >>> The disorder makes kernel programming more complex and may result in more bugs.
> >>> Even though we find a way to resolve the possible bug about the bpf_prog_put(),
> >>> I feel sad for undermining of kernel`s maintainability and disorder of
> >>> hierarchy of atomic contexts.
> >>>
> >>> -- Teng Qi
> >>>
> >>> On Tue, May 23, 2023 at 12:33 PM Yonghong Song <[email protected]> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 5/21/23 6:39 AM, Teng Qi wrote:
> >>>>> Thank you.
> >>>>>
> >>>>> > Your above analysis makes sense if indeed that kvfree cannot appear
> >>>>> > inside a spin lock region or RCU read lock region. But is it true?
> >>>>> > I checked a few code paths in kvfree/kfree. It is either guarded
> >>>>> > with local_irq_save/restore or by
> >>>>> > spin_lock_irqsave/spin_unlock_
> >>>>> > irqrestore, etc. Did I miss
> >>>>> > anything? Are you talking about RT kernel here?
> >>>>>
> >>>>> To see the sleepable possibility of kvfree, it is important to analyze the
> >>>>> following calling stack:
> >>>>> mm/util.c: 645 kvfree()
> >>>>> mm/vmalloc.c: 2763 vfree()
> >>>>>
> >>>>> In kvfree(), to call vfree, if the pointer addr points to memory
> >>>>> allocated by
> >>>>> vmalloc(), it calls vfree().
> >>>>> void kvfree(const void *addr)
> >>>>> {
> >>>>> if (is_vmalloc_addr(addr))
> >>>>> vfree(addr);
> >>>>> else
> >>>>> kfree(addr);
> >>>>> }
> >>>>>
> >>>>> In vfree(), in_interrupt() and might_sleep() need to be considered.
> >>>>> void vfree(const void *addr)
> >>>>> {
> >>>>> // ...
> >>>>> if (unlikely(in_interrupt()))
> >>>>> {
> >>>>> vfree_atomic(addr);
> >>>>> return;
> >>>>> }
> >>>>> // ...
> >>>>> might_sleep();
> >>>>> // ...
> >>>>> }
> >>>>
> >>>> Sorry. I didn't check vfree path. So it does look like that
> >>>> we need to pay special attention to non interrupt part.
> >>>>
> >>>>>
> >>>>> The vfree() may sleep if in_interrupt() == false. The RCU read lock region
> >>>>> could have in_interrupt() == false and spin lock region which only disables
> >>>>> preemption also has in_interrupt() == false. So the kvfree() cannot appear
> >>>>> inside a spin lock region or RCU read lock region if the pointer addr points
> >>>>> to memory allocated by vmalloc().
> >>>>>
> >>>>> > > Therefore, we propose modifying the condition to include
> >>>>> > > in_atomic(). Could we
> >>>>> > > update the condition as follows: "in_irq() || irqs_disabled() ||
> >>>>> > > in_atomic()"?
> >>>>> > Thank you! We look forward to your feedback.
> >>>>>
> >>>>> We now think that ‘irqs_disabled() || in_atomic() ||
> >>>>> rcu_read_lock_held()’ is
> >>>>> more proper. irqs_disabled() is for irq flag reg, in_atomic() is for
> >>>>> preempt count and rcu_read_lock_held() is for RCU read lock region.
> >>>>
> >>>> We cannot use rcu_read_lock_held() in the 'if' statement. The return
> >>>> value rcu_read_lock_held() could be 1 for some configuraitons regardless
> >>>> whether rcu_read_lock() is really held or not. In most cases,
> >>>> rcu_read_lock_held() is used in issuing potential warnings.
> >>>> Maybe there are other ways to record whether rcu_read_lock() is held or not?
> >>>>
> >>>> I agree with your that 'irqs_disabled() || in_atomic()' makes sense
> >>>> since it covers process context local_irq_save() and spin_lock() cases.
> >>>>
> >>>> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
> >>>> can be !in_interrupt(), so any process-context will go to a workqueue.
> >>>>
> >>>> Alternatively, we could have another solution. We could add another
> >>>> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
> >>>> will be done in rcu context. So if in_interrupt(), do kvfree, otherwise,
> >>>> put into a workqueue.
> >>>>
> >>>>
> >>>>>
> >>>>> -- Teng Qi
> >>>>>
> >>>>> On Sun, May 21, 2023 at 11:45 AM Yonghong Song <[email protected]
> >>>>> <mailto:[email protected]>> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 5/19/23 7:18 AM, Teng Qi wrote:
> >>>>> > Thank you for your response.
> >>>>> > > Looks like you only have suspicion here. Could you find a real
> >>>>> violation
> >>>>> > > here where __bpf_prog_put() is called with !in_irq() &&
> >>>>> > > !irqs_disabled(), but inside spin_lock or rcu read lock? I
> >>>>> have not seen
> >>>>> > > things like that.
> >>>>> >
> >>>>> > For the complex conditions to call bpf_prog_put() with 1 refcnt,
> >>>>> we have
> >>>>> > been
> >>>>> > unable to really trigger this atomic violation after trying to
> >>>>> construct
> >>>>> > test cases manually. But we found that it is possible to show
> >>>>> cases with
> >>>>> > !in_irq() && !irqs_disabled(), but inside spin_lock or rcu read lock.
> >>>>> > For example, even a failed case, one of selftest cases of bpf,
> >>>>> netns_cookie,
> >>>>> > calls bpf_sock_map_update() and may indirectly call bpf_prog_put()
> >>>>> > only inside rcu read lock: The possible call stack is:
> >>>>> > net/core/sock_map.c: 615 bpf_sock_map_update()
> >>>>> > net/core/sock_map.c: 468 sock_map_update_common()
> >>>>> > net/core/sock_map.c: 217 sock_map_link()
> >>>>> > kernel/bpf/syscall.c: 2111 bpf_prog_put()
> >>>>> >
> >>>>> > The files about netns_cookie include
> >>>>> > tools/testing/selftests/bpf/progs/netns_cookie_prog.c and
> >>>>> > tools/testing/selftests/bpf/prog_tests/netns_cookie.c. We
> >>>>> inserted the
> >>>>> > following code in
> >>>>> > ‘net/core/sock_map.c: 468 sock_map_update_common()’:
> >>>>> > static int sock_map_update_common(..)
> >>>>> > {
> >>>>> > int inIrq = in_irq();
> >>>>> > int irqsDisabled = irqs_disabled();
> >>>>> > int preemptBits = preempt_count();
> >>>>> > int inAtomic = in_atomic();
> >>>>> > int rcuHeld = rcu_read_lock_held();
> >>>>> > printk("in_irq() %d, irqs_disabled() %d, preempt_count() %d,
> >>>>> > in_atomic() %d, rcu_read_lock_held() %d", inIrq,
> >>>>> irqsDisabled,
> >>>>> > preemptBits, inAtomic, rcuHeld);
> >>>>> > }
> >>>>> >
> >>>>> > The output message is as follows:
> >>>>> > root@(none):/root/bpf# ./test_progs -t netns_cookie
> >>>>> > [ 137.639188] in_irq() 0, irqs_disabled() 0, preempt_count() 0,
> >>>>> > in_atomic() 0,
> >>>>> > rcu_read_lock_held() 1
> >>>>> > #113 netns_cookie:OK
> >>>>> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> >>>>> >
> >>>>> > We notice that there are numerous callers in kernel/, net/ and
> >>>>> drivers/,
> >>>>> > so we
> >>>>> > highly suggest modifying __bpf_prog_put() to address this gap.
> >>>>> The gap
> >>>>> > exists
> >>>>> > because __bpf_prog_put() is only safe under in_irq() ||
> >>>>> irqs_disabled()
> >>>>> > but not in_atomic() || rcu_read_lock_held(). The following code
> >>>>> snippet may
> >>>>> > mislead developers into thinking that bpf_prog_put() is safe in all
> >>>>> > contexts.
> >>>>> > if (in_irq() || irqs_disabled()) {
> >>>>> > INIT_WORK(&aux->work, bpf_prog_put_deferred);
> >>>>> > schedule_work(&aux->work);
> >>>>> > } else {
> >>>>> > bpf_prog_put_deferred(&aux->work);
> >>>>> > }
> >>>>> >
> >>>>> > Implicit dependency may lead to issues.
> >>>>> >
> >>>>> > > Any problem here?
> >>>>> > We mentioned it to demonstrate the possibility of kvfree() being
> >>>>> > called by __bpf_prog_put_noref().
> >>>>> >
> >>>>> > Thanks.
> >>>>> > -- Teng Qi
> >>>>> >
> >>>>> > On Wed, May 17, 2023 at 1:08 AM Yonghong Song <[email protected]
> >>>>> <mailto:[email protected]>
> >>>>> > <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > On 5/16/23 4:18 AM, [email protected]
> >>>>> <mailto:[email protected]>
> >>>>> > <mailto:[email protected]
> >>>>> <mailto:[email protected]>> wrote:
> >>>>> > > From: Teng Qi <[email protected]
> >>>>> <mailto:[email protected]>
> >>>>> > <mailto:[email protected]
> >>>>> <mailto:[email protected]>>>
> >>>>> > >
> >>>>> > > Hi, bpf developers,
> >>>>> > >
> >>>>> > > We are developing a static tool to check the matching between
> >>>>> > helpers and the
> >>>>> > > context of hooks. During our analysis, we have discovered some
> >>>>> > important
> >>>>> > > findings that we would like to report.
> >>>>> > >
> >>>>> > > ‘kernel/bpf/syscall.c: 2097 __bpf_prog_put()’ shows that
> >>>>> function
> >>>>> > > bpf_prog_put_deferred() won`t be called in the condition of
> >>>>> > > ‘in_irq() || irqs_disabled()’.
> >>>>> > > if (in_irq() || irqs_disabled()) {
> >>>>> > > INIT_WORK(&aux->work, bpf_prog_put_deferred);
> >>>>> > > schedule_work(&aux->work);
> >>>>> > > } else {
> >>>>> > >
> >>>>> > > bpf_prog_put_deferred(&aux->work);
> >>>>> > > }
> >>>>> > >
> >>>>> > > We suspect this condition exists because there might be
> >>>>> sleepable
> >>>>> > operations
> >>>>> > > in the callees of the bpf_prog_put_deferred() function:
> >>>>> > > kernel/bpf/syscall.c: 2097 __bpf_prog_put()
> >>>>> > > kernel/bpf/syscall.c: 2084 bpf_prog_put_deferred()
> >>>>> > > kernel/bpf/syscall.c: 2063 __bpf_prog_put_noref()
> >>>>> > > kvfree(prog->aux->jited_linfo);
> >>>>> > > kvfree(prog->aux->linfo);
> >>>>> >
> >>>>> > Looks like you only have suspicion here. Could you find a real
> >>>>> > violation
> >>>>> > here where __bpf_prog_put() is called with !in_irq() &&
> >>>>> > !irqs_disabled(), but inside spin_lock or rcu read lock? I
> >>>>> have not seen
> >>>>> > things like that.
> >>>>> >
> >>>>> > >
> >>>>> > > Additionally, we found that array prog->aux->jited_linfo is
> >>>>> > initialized in
> >>>>> > > ‘kernel/bpf/core.c: 157 bpf_prog_alloc_jited_linfo()’:
> >>>>> > > prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
> >>>>> > > sizeof(*prog->aux->jited_linfo),
> >>>>> bpf_memcg_flags(GFP_KERNEL |
> >>>>> > __GFP_NOWARN));
> >>>>> >
> >>>>> > Any problem here?
> >>>>> >
> >>>>> > >
> >>>>> > > Our question is whether the condition 'in_irq() ||
> >>>>> > irqs_disabled() == false' is
> >>>>> > > sufficient for calling 'kvfree'. We are aware that calling
> >>>>> > 'kvfree' within the
> >>>>> > > context of a spin lock or an RCU lock is unsafe.
> >>>>>
> >>>>> Your above analysis makes sense if indeed that kvfree cannot appear
> >>>>> inside a spin lock region or RCU read lock region. But is it true?
> >>>>> I checked a few code paths in kvfree/kfree. It is either guarded
> >>>>> with local_irq_save/restore or by
> >>>>> spin_lock_irqsave/spin_unlock_irqrestore, etc. Did I miss
> >>>>> anything? Are you talking about RT kernel here?
> >>>>>
> >>>>>
> >>>>> > >
> >>>>> > > Therefore, we propose modifying the condition to include
> >>>>> > in_atomic(). Could we
> >>>>> > > update the condition as follows: "in_irq() ||
> >>>>> irqs_disabled() ||
> >>>>> > in_atomic()"?
> >>>>> > >
> >>>>> > > Thank you! We look forward to your feedback.
> >>>>> > >
> >>>>> > > Signed-off-by: Teng Qi <[email protected]
> >>>>> <mailto:[email protected]>
> >>>>> > <mailto:[email protected]
> >>>>> <mailto:[email protected]>>>
> >>>>> >
> >>>>>

2023-06-19 19:38:57

by Yonghong Song

[permalink] [raw]

Subject: Re: [bug] kernel: bpf: syscall: a possible sleep-in-atomic bug in __bpf_prog_put()

On 6/19/23 2:05 AM, Teng Qi wrote:
> Hello!
>
>> It would be great if you also print out in_interrupt() value, so we know
>> whether softirq or nmi is enabled or not.
>
> After adding the in_interrupt(), the interesting output cases are as follows:
> [ 38.596580] bpf_prog_put: in_irq() 0, irqs_disabled() 0, in_interrupt() 256,
> preempt_count() 256, in_atomic() 1, rcu_read_lock_held() 0
> [ 62.300608] bpf_prog_put: in_irq() 0, irqs_disabled() 0, in_interrupt() 256,
> preempt_count() 256, in_atomic() 1, rcu_read_lock_held() 1
> [ 62.301179] bpf_prog_put: in_irq() 0, irqs_disabled() 0, in_interrupt() 0,
> preempt_count() 0, in_atomic() 0, rcu_read_lock_held() 1
>
> Based on these cases, the current code is safe for the first two cases, because
> in_interrupt() in vfree() prevents sleeping.
> However, the rcu_read_lock_held() is not reliable, so we cannot rely on it.
> Considering all the discussions so far, I think the best plan now is to change
> the condition in __bpf_prog_put() to ‘irqs_disabled() || in_atomic()’ and
> provide examples for possible issues. This plan effectively addresses
> more possible atomic contexts of __bpf_prog_put() without incurring
> any additional cost.

Thanks for analysis. In the above, in_atomic()=1 is due to
preempt_count()=256 which implies a softirq. Actually in_atomic() will
be true if preempt_disabled or in_interrupt. So I guess your previous
change

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a75c54b6f8a3..11df562e481b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2147,7 +2147,7 @@ static void __bpf_prog_put(struct bpf_prog *prog)
struct bpf_prog_aux *aux = prog->aux;

if (atomic64_dec_and_test(&aux->refcnt)) {
- if (in_irq() || irqs_disabled()) {
+ if (!in_interrupt()) {
INIT_WORK(&aux->work, bpf_prog_put_deferred);
schedule_work(&aux->work);
} else {

should be okay. Just need to explain in the commit message
- why with 'in_interrupt()' it is okay to call
bpf_prog_put_deferred() directly, and
- why with '!in_interrupt()' it is not okay to call
pf_prog_put_deferred() directly

Thanks!

> -- Teng Qi
>
> On Mon, Jun 12, 2023 at 8:02 AM Yonghong Song <[email protected]> wrote:
>>
>>
>>
>> On 6/11/23 6:02 AM, Teng Qi wrote:
>>> Hello!
>>>> BTW, please do create a test case, e.g, sockmap test case which
>>>> can show the problem with existing code base.
>>>
>>> I add a printk in bpf_prog_put_deferred():
>>> static void bpf_prog_put_deferred(struct work_struct *work)
>>> {
>>> // . . .
>>> int inIrq = in_irq();
>>> int irqsDisabled = irqs_disabled();
>>> int preemptBits = preempt_count();
>>> int inAtomic = in_atomic();
>>> int rcuHeld = rcu_read_lock_held();
>>> printk("bpf_prog_put: in_irq() %d, irqs_disabled() %d, preempt_count()
>>> %d, in_atomic() %d, rcu_read_lock_held() %d",
>>> inIrq, irqsDisabled, preemptBits, inAtomic, rcuHeld);
>>> // . . .
>>> }
>>>
>>> When running the selftest, I see the following output:
>>> [255340.388339] bpf_prog_put: in_irq() 0, irqs_disabled() 0,
>>> preempt_count() 256, in_atomic() 1, rcu_read_lock_held() 1
>>> [255393.237632] bpf_prog_put: in_irq() 0, irqs_disabled() 0,
>>> preempt_count() 0, in_atomic() 0, rcu_read_lock_held() 1
>>
>> It would be great if you also print out in_interrupt() value, so we know
>> whether softirq or nmi is enabled or not.
>>
>> We cannot really WARN with !rcu_read_lock_held() since the
>> __bpf_prog_put funciton is called in different contexts.
>>
>> Also, note that rcu_read_lock_held() may not be reliable. rcu subsystem
>> will return 1 if not tracked or not sure about the result.
>>
>>>
>>> Based on this output, I believe it is sufficient to construct a self-test case
>>> for bpf_prog_put_deferred() called under preempt disabled or rcu read lock
>>> region. However, I'm a bit confused about what I should do to build the
>>> self-test case. Are we looking to create a checker that verifies the
>>> context of bpf_prog_put_deferred() is valid? Or do we need a test case that
>>> can trigger this bug?
>>>
>>> Could you show me more ideas to construct a self test case? I am not familiar
>>> with it and have no idea.
>>
>> Okay, I see. It seems hard to create a test case with warnings since
>> bpf_prog_put_deferred is called in different context. So some
>> examples for possible issues (through code analysis) should be good enough.
>>
>>>
>>> -- Teng Qi
>>>
>>> On Thu, May 25, 2023 at 3:34 AM Yonghong Song <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 5/24/23 5:42 AM, Teng Qi wrote:
>>>>> Thank you.
>>>>>
>>>>>> We cannot use rcu_read_lock_held() in the 'if' statement. The return
>>>>>> value rcu_read_lock_held() could be 1 for some configurations regardless
>>>>>> whether rcu_read_lock() is really held or not. In most cases,
>>>>>> rcu_read_lock_held() is used in issuing potential warnings.
>>>>>> Maybe there are other ways to record whether rcu_read_lock() is held or not?
>>>>>
>>>>> Sorry. I was not aware of the dependency of configurations of
>>>>> rcu_read_lock_held().
>>>>>
>>>>>> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
>>>>>> can be !in_interrupt(), so any process-context will go to a workqueue.
>>>>>
>>>>> I agree that using !in_interrupt() as a condition is an acceptable solution.
>>>>
>>>> This should work although it could be conservative.
>>>>
>>>>>
>>>>>> Alternatively, we could have another solution. We could add another
>>>>>> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
>>>>>> will be done in rcu context.
>>>>>
>>>>> Implementing a new function like bpf_prog_put_rcu() is a solution that involves
>>>>> more significant changes.
>>>>
>>>> Maybe we can change signature of bpf_prog_put instead? Like
>>>> void bpf_prog_put(struct bpf_prog *prog, bool in_rcu)
>>>> and inside bpf_prog_put we can add
>>>> WARN_ON_ONCE(in_rcu && !bpf_rcu_lock_held());
>>>>
>>>>>
>>>>>> So if in_interrupt(), do kvfree, otherwise,
>>>>>> put into a workqueue.
>>>>>
>>>>> Shall we proceed with submitting a patch following this approach?
>>>>
>>>> You could choose either of the above although I think with newer
>>>> bpf_prog_put() is better.
>>>>
>>>> BTW, please do create a test case, e.g, sockmap test case which
>>>> can show the problem with existing code base.
>>>>
>>>>>
>>>>> I would like to mention something unrelated to the possible bug. At this
>>>>> moment, things seem to be more puzzling. vfree() is safe under in_interrupt()
>>>>> but not safe under other atomic contexts.
>>>>> This disorder challenges our conventional belief, a monotonic incrementation
>>>>> of limitations of the hierarchical atomic contexts, that programer needs
>>>>> to be more and more careful to write code under rcu read lock, spin lock,
>>>>> bh disable, interrupt...
>>>>> This disorder can lead to unexpected consequences, such as code being safe
>>>>> under interrupts but not safe under spin locks.
>>>>> The disorder makes kernel programming more complex and may result in more bugs.
>>>>> Even though we find a way to resolve the possible bug about the bpf_prog_put(),
>>>>> I feel sad for undermining of kernel`s maintainability and disorder of
>>>>> hierarchy of atomic contexts.
>>>>>
>>>>> -- Teng Qi
>>>>>
>>>>> On Tue, May 23, 2023 at 12:33 PM Yonghong Song <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 5/21/23 6:39 AM, Teng Qi wrote:
>>>>>>> Thank you.
>>>>>>>
>>>>>>> > Your above analysis makes sense if indeed that kvfree cannot appear
>>>>>>> > inside a spin lock region or RCU read lock region. But is it true?
>>>>>>> > I checked a few code paths in kvfree/kfree. It is either guarded
>>>>>>> > with local_irq_save/restore or by
>>>>>>> > spin_lock_irqsave/spin_unlock_
>>>>>>> > irqrestore, etc. Did I miss
>>>>>>> > anything? Are you talking about RT kernel here?
>>>>>>>
>>>>>>> To see the sleepable possibility of kvfree, it is important to analyze the
>>>>>>> following calling stack:
>>>>>>> mm/util.c: 645 kvfree()
>>>>>>> mm/vmalloc.c: 2763 vfree()
>>>>>>>
>>>>>>> In kvfree(), to call vfree, if the pointer addr points to memory
>>>>>>> allocated by
>>>>>>> vmalloc(), it calls vfree().
>>>>>>> void kvfree(const void *addr)
>>>>>>> {
>>>>>>> if (is_vmalloc_addr(addr))
>>>>>>> vfree(addr);
>>>>>>> else
>>>>>>> kfree(addr);
>>>>>>> }
>>>>>>>
>>>>>>> In vfree(), in_interrupt() and might_sleep() need to be considered.
>>>>>>> void vfree(const void *addr)
>>>>>>> {
>>>>>>> // ...
>>>>>>> if (unlikely(in_interrupt()))
>>>>>>> {
>>>>>>> vfree_atomic(addr);
>>>>>>> return;
>>>>>>> }
>>>>>>> // ...
>>>>>>> might_sleep();
>>>>>>> // ...
>>>>>>> }
>>>>>>
>>>>>> Sorry. I didn't check vfree path. So it does look like that
>>>>>> we need to pay special attention to non interrupt part.
>>>>>>
>>>>>>>
>>>>>>> The vfree() may sleep if in_interrupt() == false. The RCU read lock region
>>>>>>> could have in_interrupt() == false and spin lock region which only disables
>>>>>>> preemption also has in_interrupt() == false. So the kvfree() cannot appear
>>>>>>> inside a spin lock region or RCU read lock region if the pointer addr points
>>>>>>> to memory allocated by vmalloc().
>>>>>>>
>>>>>>> > > Therefore, we propose modifying the condition to include
>>>>>>> > > in_atomic(). Could we
>>>>>>> > > update the condition as follows: "in_irq() || irqs_disabled() ||
>>>>>>> > > in_atomic()"?
>>>>>>> > Thank you! We look forward to your feedback.
>>>>>>>
>>>>>>> We now think that ‘irqs_disabled() || in_atomic() ||
>>>>>>> rcu_read_lock_held()’ is
>>>>>>> more proper. irqs_disabled() is for irq flag reg, in_atomic() is for
>>>>>>> preempt count and rcu_read_lock_held() is for RCU read lock region.
>>>>>>
>>>>>> We cannot use rcu_read_lock_held() in the 'if' statement. The return
>>>>>> value rcu_read_lock_held() could be 1 for some configuraitons regardless
>>>>>> whether rcu_read_lock() is really held or not. In most cases,
>>>>>> rcu_read_lock_held() is used in issuing potential warnings.
>>>>>> Maybe there are other ways to record whether rcu_read_lock() is held or not?
>>>>>>
>>>>>> I agree with your that 'irqs_disabled() || in_atomic()' makes sense
>>>>>> since it covers process context local_irq_save() and spin_lock() cases.
>>>>>>
>>>>>> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
>>>>>> can be !in_interrupt(), so any process-context will go to a workqueue.
>>>>>>
>>>>>> Alternatively, we could have another solution. We could add another
>>>>>> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
>>>>>> will be done in rcu context. So if in_interrupt(), do kvfree, otherwise,
>>>>>> put into a workqueue.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> -- Teng Qi
>>>>>>>
>>>>>>> On Sun, May 21, 2023 at 11:45 AM Yonghong Song <[email protected]
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 5/19/23 7:18 AM, Teng Qi wrote:
>>>>>>> > Thank you for your response.
>>>>>>> > > Looks like you only have suspicion here. Could you find a real
>>>>>>> violation
>>>>>>> > > here where __bpf_prog_put() is called with !in_irq() &&
>>>>>>> > > !irqs_disabled(), but inside spin_lock or rcu read lock? I
>>>>>>> have not seen
>>>>>>> > > things like that.
>>>>>>> >
>>>>>>> > For the complex conditions to call bpf_prog_put() with 1 refcnt,
>>>>>>> we have
>>>>>>> > been
>>>>>>> > unable to really trigger this atomic violation after trying to
>>>>>>> construct
>>>>>>> > test cases manually. But we found that it is possible to show
>>>>>>> cases with
>>>>>>> > !in_irq() && !irqs_disabled(), but inside spin_lock or rcu read lock.
>>>>>>> > For example, even a failed case, one of selftest cases of bpf,
>>>>>>> netns_cookie,
>>>>>>> > calls bpf_sock_map_update() and may indirectly call bpf_prog_put()
>>>>>>> > only inside rcu read lock: The possible call stack is:
>>>>>>> > net/core/sock_map.c: 615 bpf_sock_map_update()
>>>>>>> > net/core/sock_map.c: 468 sock_map_update_common()
>>>>>>> > net/core/sock_map.c: 217 sock_map_link()
>>>>>>> > kernel/bpf/syscall.c: 2111 bpf_prog_put()
>>>>>>> >
>>>>>>> > The files about netns_cookie include
>>>>>>> > tools/testing/selftests/bpf/progs/netns_cookie_prog.c and
>>>>>>> > tools/testing/selftests/bpf/prog_tests/netns_cookie.c. We
>>>>>>> inserted the
>>>>>>> > following code in
>>>>>>> > ‘net/core/sock_map.c: 468 sock_map_update_common()’:
>>>>>>> > static int sock_map_update_common(..)
>>>>>>> > {
>>>>>>> > int inIrq = in_irq();
>>>>>>> > int irqsDisabled = irqs_disabled();
>>>>>>> > int preemptBits = preempt_count();
>>>>>>> > int inAtomic = in_atomic();
>>>>>>> > int rcuHeld = rcu_read_lock_held();
>>>>>>> > printk("in_irq() %d, irqs_disabled() %d, preempt_count() %d,
>>>>>>> > in_atomic() %d, rcu_read_lock_held() %d", inIrq,
>>>>>>> irqsDisabled,
>>>>>>> > preemptBits, inAtomic, rcuHeld);
>>>>>>> > }
>>>>>>> >
>>>>>>> > The output message is as follows:
>>>>>>> > root@(none):/root/bpf# ./test_progs -t netns_cookie
>>>>>>> > [ 137.639188] in_irq() 0, irqs_disabled() 0, preempt_count() 0,
>>>>>>> > in_atomic() 0,
>>>>>>> > rcu_read_lock_held() 1
>>>>>>> > #113 netns_cookie:OK
>>>>>>> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
>>>>>>> >
>>>>>>> > We notice that there are numerous callers in kernel/, net/ and
>>>>>>> drivers/,
>>>>>>> > so we
>>>>>>> > highly suggest modifying __bpf_prog_put() to address this gap.
>>>>>>> The gap
>>>>>>> > exists
>>>>>>> > because __bpf_prog_put() is only safe under in_irq() ||
>>>>>>> irqs_disabled()
>>>>>>> > but not in_atomic() || rcu_read_lock_held(). The following code
>>>>>>> snippet may
>>>>>>> > mislead developers into thinking that bpf_prog_put() is safe in all
>>>>>>> > contexts.
>>>>>>> > if (in_irq() || irqs_disabled()) {
>>>>>>> > INIT_WORK(&aux->work, bpf_prog_put_deferred);
>>>>>>> > schedule_work(&aux->work);
>>>>>>> > } else {
>>>>>>> > bpf_prog_put_deferred(&aux->work);
>>>>>>> > }
>>>>>>> >
>>>>>>> > Implicit dependency may lead to issues.
>>>>>>> >
>>>>>>> > > Any problem here?
>>>>>>> > We mentioned it to demonstrate the possibility of kvfree() being
>>>>>>> > called by __bpf_prog_put_noref().
>>>>>>> >
>>>>>>> > Thanks.
>>>>>>> > -- Teng Qi
>>>>>>> >
>>>>>>> > On Wed, May 17, 2023 at 1:08 AM Yonghong Song <[email protected]
>>>>>>> <mailto:[email protected]>
>>>>>>> > <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On 5/16/23 4:18 AM, [email protected]
>>>>>>> <mailto:[email protected]>
>>>>>>> > <mailto:[email protected]
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> > > From: Teng Qi <[email protected]
>>>>>>> <mailto:[email protected]>
>>>>>>> > <mailto:[email protected]
>>>>>>> <mailto:[email protected]>>>
>>>>>>> > >
>>>>>>> > > Hi, bpf developers,
>>>>>>> > >
>>>>>>> > > We are developing a static tool to check the matching between
>>>>>>> > helpers and the
>>>>>>> > > context of hooks. During our analysis, we have discovered some
>>>>>>> > important
>>>>>>> > > findings that we would like to report.
>>>>>>> > >
>>>>>>> > > ‘kernel/bpf/syscall.c: 2097 __bpf_prog_put()’ shows that
>>>>>>> function
>>>>>>> > > bpf_prog_put_deferred() won`t be called in the condition of
>>>>>>> > > ‘in_irq() || irqs_disabled()’.
>>>>>>> > > if (in_irq() || irqs_disabled()) {
>>>>>>> > > INIT_WORK(&aux->work, bpf_prog_put_deferred);
>>>>>>> > > schedule_work(&aux->work);
>>>>>>> > > } else {
>>>>>>> > >
>>>>>>> > > bpf_prog_put_deferred(&aux->work);
>>>>>>> > > }
>>>>>>> > >
>>>>>>> > > We suspect this condition exists because there might be
>>>>>>> sleepable
>>>>>>> > operations
>>>>>>> > > in the callees of the bpf_prog_put_deferred() function:
>>>>>>> > > kernel/bpf/syscall.c: 2097 __bpf_prog_put()
>>>>>>> > > kernel/bpf/syscall.c: 2084 bpf_prog_put_deferred()
>>>>>>> > > kernel/bpf/syscall.c: 2063 __bpf_prog_put_noref()
>>>>>>> > > kvfree(prog->aux->jited_linfo);
>>>>>>> > > kvfree(prog->aux->linfo);
>>>>>>> >
>>>>>>> > Looks like you only have suspicion here. Could you find a real
>>>>>>> > violation
>>>>>>> > here where __bpf_prog_put() is called with !in_irq() &&
>>>>>>> > !irqs_disabled(), but inside spin_lock or rcu read lock? I
>>>>>>> have not seen
>>>>>>> > things like that.
>>>>>>> >
>>>>>>> > >
>>>>>>> > > Additionally, we found that array prog->aux->jited_linfo is
>>>>>>> > initialized in
>>>>>>> > > ‘kernel/bpf/core.c: 157 bpf_prog_alloc_jited_linfo()’:
>>>>>>> > > prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
>>>>>>> > > sizeof(*prog->aux->jited_linfo),
>>>>>>> bpf_memcg_flags(GFP_KERNEL |
>>>>>>> > __GFP_NOWARN));
>>>>>>> >
>>>>>>> > Any problem here?
>>>>>>> >
>>>>>>> > >
>>>>>>> > > Our question is whether the condition 'in_irq() ||
>>>>>>> > irqs_disabled() == false' is
>>>>>>> > > sufficient for calling 'kvfree'. We are aware that calling
>>>>>>> > 'kvfree' within the
>>>>>>> > > context of a spin lock or an RCU lock is unsafe.
>>>>>>>
>>>>>>> Your above analysis makes sense if indeed that kvfree cannot appear
>>>>>>> inside a spin lock region or RCU read lock region. But is it true?
>>>>>>> I checked a few code paths in kvfree/kfree. It is either guarded
>>>>>>> with local_irq_save/restore or by
>>>>>>> spin_lock_irqsave/spin_unlock_irqrestore, etc. Did I miss
>>>>>>> anything? Are you talking about RT kernel here?
>>>>>>>
>>>>>>>
>>>>>>> > >
>>>>>>> > > Therefore, we propose modifying the condition to include
>>>>>>> > in_atomic(). Could we
>>>>>>> > > update the condition as follows: "in_irq() ||
>>>>>>> irqs_disabled() ||
>>>>>>> > in_atomic()"?
>>>>>>> > >
>>>>>>> > > Thank you! We look forward to your feedback.
>>>>>>> > >
>>>>>>> > > Signed-off-by: Teng Qi <[email protected]
>>>>>>> <mailto:[email protected]>
>>>>>>> > <mailto:[email protected]
>>>>>>> <mailto:[email protected]>>>
>>>>>>> >
>>>>>>>