From: Teng Qi <[email protected]>
Hi, bpf developers,
We are developing a static tool to check the matching between helpers and the
context of hooks. During our analysis, we have discovered some important
findings that we would like to report.
‘kernel/bpf/syscall.c: 2097 __bpf_prog_put()’ shows that function
bpf_prog_put_deferred() won`t be called in the condition of
‘in_irq() || irqs_disabled()’.
if (in_irq() || irqs_disabled()) {
INIT_WORK(&aux->work, bpf_prog_put_deferred);
schedule_work(&aux->work);
} else {
bpf_prog_put_deferred(&aux->work);
}
We suspect this condition exists because there might be sleepable operations
in the callees of the bpf_prog_put_deferred() function:
kernel/bpf/syscall.c: 2097 __bpf_prog_put()
kernel/bpf/syscall.c: 2084 bpf_prog_put_deferred()
kernel/bpf/syscall.c: 2063 __bpf_prog_put_noref()
kvfree(prog->aux->jited_linfo);
kvfree(prog->aux->linfo);
Additionally, we found that array prog->aux->jited_linfo is initialized in
‘kernel/bpf/core.c: 157 bpf_prog_alloc_jited_linfo()’:
prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
sizeof(*prog->aux->jited_linfo), bpf_memcg_flags(GFP_KERNEL | __GFP_NOWARN));
Our question is whether the condition 'in_irq() || irqs_disabled() == false' is
sufficient for calling 'kvfree'. We are aware that calling 'kvfree' within the
context of a spin lock or an RCU lock is unsafe.
Therefore, we propose modifying the condition to include in_atomic(). Could we
update the condition as follows: "in_irq() || irqs_disabled() || in_atomic()"?
Thank you! We look forward to your feedback.
Signed-off-by: Teng Qi <[email protected]>
On 5/16/23 4:18 AM, [email protected] wrote:
> From: Teng Qi <[email protected]>
>
> Hi, bpf developers,
>
> We are developing a static tool to check the matching between helpers and the
> context of hooks. During our analysis, we have discovered some important
> findings that we would like to report.
>
> ‘kernel/bpf/syscall.c: 2097 __bpf_prog_put()’ shows that function
> bpf_prog_put_deferred() won`t be called in the condition of
> ‘in_irq() || irqs_disabled()’.
> if (in_irq() || irqs_disabled()) {
> INIT_WORK(&aux->work, bpf_prog_put_deferred);
> schedule_work(&aux->work);
> } else {
>
> bpf_prog_put_deferred(&aux->work);
> }
>
> We suspect this condition exists because there might be sleepable operations
> in the callees of the bpf_prog_put_deferred() function:
> kernel/bpf/syscall.c: 2097 __bpf_prog_put()
> kernel/bpf/syscall.c: 2084 bpf_prog_put_deferred()
> kernel/bpf/syscall.c: 2063 __bpf_prog_put_noref()
> kvfree(prog->aux->jited_linfo);
> kvfree(prog->aux->linfo);
Looks like you only have suspicion here. Could you find a real violation
here where __bpf_prog_put() is called with !in_irq() &&
!irqs_disabled(), but inside spin_lock or rcu read lock? I have not seen
things like that.
>
> Additionally, we found that array prog->aux->jited_linfo is initialized in
> ‘kernel/bpf/core.c: 157 bpf_prog_alloc_jited_linfo()’:
> prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
> sizeof(*prog->aux->jited_linfo), bpf_memcg_flags(GFP_KERNEL | __GFP_NOWARN));
Any problem here?
>
> Our question is whether the condition 'in_irq() || irqs_disabled() == false' is
> sufficient for calling 'kvfree'. We are aware that calling 'kvfree' within the
> context of a spin lock or an RCU lock is unsafe.
>
> Therefore, we propose modifying the condition to include in_atomic(). Could we
> update the condition as follows: "in_irq() || irqs_disabled() || in_atomic()"?
>
> Thank you! We look forward to your feedback.
>
> Signed-off-by: Teng Qi <[email protected]>
On 5/19/23 7:18 AM, Teng Qi wrote:
> Thank you for your response.
> > Looks like you only have suspicion here. Could you find a real violation
> > here where __bpf_prog_put() is called with !in_irq() &&
> > !irqs_disabled(), but inside spin_lock or rcu read lock? I have not seen
> > things like that.
>
> For the complex conditions to call bpf_prog_put() with 1 refcnt, we have
> been
> unable to really trigger this atomic violation after trying to construct
> test cases manually. But we found that it is possible to show cases with
> !in_irq() && !irqs_disabled(), but inside spin_lock or rcu read lock.
> For example, even a failed case, one of selftest cases of bpf, netns_cookie,
> calls bpf_sock_map_update() and may indirectly call bpf_prog_put()
> only inside rcu read lock: The possible call stack is:
> net/core/sock_map.c: 615 bpf_sock_map_update()
> net/core/sock_map.c: 468 sock_map_update_common()
> net/core/sock_map.c: 217 sock_map_link()
> kernel/bpf/syscall.c: 2111 bpf_prog_put()
>
> The files about netns_cookie include
> tools/testing/selftests/bpf/progs/netns_cookie_prog.c and
> tools/testing/selftests/bpf/prog_tests/netns_cookie.c. We inserted the
> following code in
> ‘net/core/sock_map.c: 468 sock_map_update_common()’:
> static int sock_map_update_common(..)
> {
> int inIrq = in_irq();
> int irqsDisabled = irqs_disabled();
> int preemptBits = preempt_count();
> int inAtomic = in_atomic();
> int rcuHeld = rcu_read_lock_held();
> printk("in_irq() %d, irqs_disabled() %d, preempt_count() %d,
> in_atomic() %d, rcu_read_lock_held() %d", inIrq, irqsDisabled,
> preemptBits, inAtomic, rcuHeld);
> }
>
> The output message is as follows:
> root@(none):/root/bpf# ./test_progs -t netns_cookie
> [ 137.639188] in_irq() 0, irqs_disabled() 0, preempt_count() 0,
> in_atomic() 0,
> rcu_read_lock_held() 1
> #113 netns_cookie:OK
> Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
>
> We notice that there are numerous callers in kernel/, net/ and drivers/,
> so we
> highly suggest modifying __bpf_prog_put() to address this gap. The gap
> exists
> because __bpf_prog_put() is only safe under in_irq() || irqs_disabled()
> but not in_atomic() || rcu_read_lock_held(). The following code snippet may
> mislead developers into thinking that bpf_prog_put() is safe in all
> contexts.
> if (in_irq() || irqs_disabled()) {
> INIT_WORK(&aux->work, bpf_prog_put_deferred);
> schedule_work(&aux->work);
> } else {
> bpf_prog_put_deferred(&aux->work);
> }
>
> Implicit dependency may lead to issues.
>
> > Any problem here?
> We mentioned it to demonstrate the possibility of kvfree() being
> called by __bpf_prog_put_noref().
>
> Thanks.
> -- Teng Qi
>
> On Wed, May 17, 2023 at 1:08 AM Yonghong Song <[email protected]
> <mailto:[email protected]>> wrote:
>
>
>
> On 5/16/23 4:18 AM, [email protected]
> <mailto:[email protected]> wrote:
> > From: Teng Qi <[email protected]
> <mailto:[email protected]>>
> >
> > Hi, bpf developers,
> >
> > We are developing a static tool to check the matching between
> helpers and the
> > context of hooks. During our analysis, we have discovered some
> important
> > findings that we would like to report.
> >
> > ‘kernel/bpf/syscall.c: 2097 __bpf_prog_put()’ shows that function
> > bpf_prog_put_deferred() won`t be called in the condition of
> > ‘in_irq() || irqs_disabled()’.
> > if (in_irq() || irqs_disabled()) {
> > INIT_WORK(&aux->work, bpf_prog_put_deferred);
> > schedule_work(&aux->work);
> > } else {
> >
> > bpf_prog_put_deferred(&aux->work);
> > }
> >
> > We suspect this condition exists because there might be sleepable
> operations
> > in the callees of the bpf_prog_put_deferred() function:
> > kernel/bpf/syscall.c: 2097 __bpf_prog_put()
> > kernel/bpf/syscall.c: 2084 bpf_prog_put_deferred()
> > kernel/bpf/syscall.c: 2063 __bpf_prog_put_noref()
> > kvfree(prog->aux->jited_linfo);
> > kvfree(prog->aux->linfo);
>
> Looks like you only have suspicion here. Could you find a real
> violation
> here where __bpf_prog_put() is called with !in_irq() &&
> !irqs_disabled(), but inside spin_lock or rcu read lock? I have not seen
> things like that.
>
> >
> > Additionally, we found that array prog->aux->jited_linfo is
> initialized in
> > ‘kernel/bpf/core.c: 157 bpf_prog_alloc_jited_linfo()’:
> > prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
> > sizeof(*prog->aux->jited_linfo), bpf_memcg_flags(GFP_KERNEL |
> __GFP_NOWARN));
>
> Any problem here?
>
> >
> > Our question is whether the condition 'in_irq() ||
> irqs_disabled() == false' is
> > sufficient for calling 'kvfree'. We are aware that calling
> 'kvfree' within the
> > context of a spin lock or an RCU lock is unsafe.
Your above analysis makes sense if indeed that kvfree cannot appear
inside a spin lock region or RCU read lock region. But is it true?
I checked a few code paths in kvfree/kfree. It is either guarded
with local_irq_save/restore or by
spin_lock_irqsave/spin_unlock_irqrestore, etc. Did I miss
anything? Are you talking about RT kernel here?
> >
> > Therefore, we propose modifying the condition to include
> in_atomic(). Could we
> > update the condition as follows: "in_irq() || irqs_disabled() ||
> in_atomic()"?
> >
> > Thank you! We look forward to your feedback.
> >
> > Signed-off-by: Teng Qi <[email protected]
> <mailto:[email protected]>>
>
On 5/21/23 6:39 AM, Teng Qi wrote:
> Thank you.
>
> > Your above analysis makes sense if indeed that kvfree cannot appear
> > inside a spin lock region or RCU read lock region. But is it true?
> > I checked a few code paths in kvfree/kfree. It is either guarded
> > with local_irq_save/restore or by
> > spin_lock_irqsave/spin_unlock_
> > irqrestore, etc. Did I miss
> > anything? Are you talking about RT kernel here?
>
> To see the sleepable possibility of kvfree, it is important to analyze the
> following calling stack:
> mm/util.c: 645 kvfree()
> mm/vmalloc.c: 2763 vfree()
>
> In kvfree(), to call vfree, if the pointer addr points to memory
> allocated by
> vmalloc(), it calls vfree().
> void kvfree(const void *addr)
> {
> if (is_vmalloc_addr(addr))
> vfree(addr);
> else
> kfree(addr);
> }
>
> In vfree(), in_interrupt() and might_sleep() need to be considered.
> void vfree(const void *addr)
> {
> // ...
> if (unlikely(in_interrupt()))
> {
> vfree_atomic(addr);
> return;
> }
> // ...
> might_sleep();
> // ...
> }
Sorry. I didn't check vfree path. So it does look like that
we need to pay special attention to non interrupt part.
>
> The vfree() may sleep if in_interrupt() == false. The RCU read lock region
> could have in_interrupt() == false and spin lock region which only disables
> preemption also has in_interrupt() == false. So the kvfree() cannot appear
> inside a spin lock region or RCU read lock region if the pointer addr points
> to memory allocated by vmalloc().
>
> > > Therefore, we propose modifying the condition to include
> > > in_atomic(). Could we
> > > update the condition as follows: "in_irq() || irqs_disabled() ||
> > > in_atomic()"?
> > Thank you! We look forward to your feedback.
>
> We now think that ‘irqs_disabled() || in_atomic() ||
> rcu_read_lock_held()’ is
> more proper. irqs_disabled() is for irq flag reg, in_atomic() is for
> preempt count and rcu_read_lock_held() is for RCU read lock region.
We cannot use rcu_read_lock_held() in the 'if' statement. The return
value rcu_read_lock_held() could be 1 for some configuraitons regardless
whether rcu_read_lock() is really held or not. In most cases,
rcu_read_lock_held() is used in issuing potential warnings.
Maybe there are other ways to record whether rcu_read_lock() is held or not?
I agree with your that 'irqs_disabled() || in_atomic()' makes sense
since it covers process context local_irq_save() and spin_lock() cases.
If we cannot resolve rcu_read_lock() presence issue, maybe the condition
can be !in_interrupt(), so any process-context will go to a workqueue.
Alternatively, we could have another solution. We could add another
function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
will be done in rcu context. So if in_interrupt(), do kvfree, otherwise,
put into a workqueue.
>
> -- Teng Qi
>
> On Sun, May 21, 2023 at 11:45 AM Yonghong Song <[email protected]
> <mailto:[email protected]>> wrote:
>
>
>
> On 5/19/23 7:18 AM, Teng Qi wrote:
> > Thank you for your response.
> > > Looks like you only have suspicion here. Could you find a real
> violation
> > > here where __bpf_prog_put() is called with !in_irq() &&
> > > !irqs_disabled(), but inside spin_lock or rcu read lock? I
> have not seen
> > > things like that.
> >
> > For the complex conditions to call bpf_prog_put() with 1 refcnt,
> we have
> > been
> > unable to really trigger this atomic violation after trying to
> construct
> > test cases manually. But we found that it is possible to show
> cases with
> > !in_irq() && !irqs_disabled(), but inside spin_lock or rcu read lock.
> > For example, even a failed case, one of selftest cases of bpf,
> netns_cookie,
> > calls bpf_sock_map_update() and may indirectly call bpf_prog_put()
> > only inside rcu read lock: The possible call stack is:
> > net/core/sock_map.c: 615 bpf_sock_map_update()
> > net/core/sock_map.c: 468 sock_map_update_common()
> > net/core/sock_map.c: 217 sock_map_link()
> > kernel/bpf/syscall.c: 2111 bpf_prog_put()
> >
> > The files about netns_cookie include
> > tools/testing/selftests/bpf/progs/netns_cookie_prog.c and
> > tools/testing/selftests/bpf/prog_tests/netns_cookie.c. We
> inserted the
> > following code in
> > ‘net/core/sock_map.c: 468 sock_map_update_common()’:
> > static int sock_map_update_common(..)
> > {
> > int inIrq = in_irq();
> > int irqsDisabled = irqs_disabled();
> > int preemptBits = preempt_count();
> > int inAtomic = in_atomic();
> > int rcuHeld = rcu_read_lock_held();
> > printk("in_irq() %d, irqs_disabled() %d, preempt_count() %d,
> > in_atomic() %d, rcu_read_lock_held() %d", inIrq,
> irqsDisabled,
> > preemptBits, inAtomic, rcuHeld);
> > }
> >
> > The output message is as follows:
> > root@(none):/root/bpf# ./test_progs -t netns_cookie
> > [ 137.639188] in_irq() 0, irqs_disabled() 0, preempt_count() 0,
> > in_atomic() 0,
> > rcu_read_lock_held() 1
> > #113 netns_cookie:OK
> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> >
> > We notice that there are numerous callers in kernel/, net/ and
> drivers/,
> > so we
> > highly suggest modifying __bpf_prog_put() to address this gap.
> The gap
> > exists
> > because __bpf_prog_put() is only safe under in_irq() ||
> irqs_disabled()
> > but not in_atomic() || rcu_read_lock_held(). The following code
> snippet may
> > mislead developers into thinking that bpf_prog_put() is safe in all
> > contexts.
> > if (in_irq() || irqs_disabled()) {
> > INIT_WORK(&aux->work, bpf_prog_put_deferred);
> > schedule_work(&aux->work);
> > } else {
> > bpf_prog_put_deferred(&aux->work);
> > }
> >
> > Implicit dependency may lead to issues.
> >
> > > Any problem here?
> > We mentioned it to demonstrate the possibility of kvfree() being
> > called by __bpf_prog_put_noref().
> >
> > Thanks.
> > -- Teng Qi
> >
> > On Wed, May 17, 2023 at 1:08 AM Yonghong Song <[email protected]
> <mailto:[email protected]>
> > <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >
> >
> >
> > On 5/16/23 4:18 AM, [email protected]
> <mailto:[email protected]>
> > <mailto:[email protected]
> <mailto:[email protected]>> wrote:
> > > From: Teng Qi <[email protected]
> <mailto:[email protected]>
> > <mailto:[email protected]
> <mailto:[email protected]>>>
> > >
> > > Hi, bpf developers,
> > >
> > > We are developing a static tool to check the matching between
> > helpers and the
> > > context of hooks. During our analysis, we have discovered some
> > important
> > > findings that we would like to report.
> > >
> > > ‘kernel/bpf/syscall.c: 2097 __bpf_prog_put()’ shows that
> function
> > > bpf_prog_put_deferred() won`t be called in the condition of
> > > ‘in_irq() || irqs_disabled()’.
> > > if (in_irq() || irqs_disabled()) {
> > > INIT_WORK(&aux->work, bpf_prog_put_deferred);
> > > schedule_work(&aux->work);
> > > } else {
> > >
> > > bpf_prog_put_deferred(&aux->work);
> > > }
> > >
> > > We suspect this condition exists because there might be
> sleepable
> > operations
> > > in the callees of the bpf_prog_put_deferred() function:
> > > kernel/bpf/syscall.c: 2097 __bpf_prog_put()
> > > kernel/bpf/syscall.c: 2084 bpf_prog_put_deferred()
> > > kernel/bpf/syscall.c: 2063 __bpf_prog_put_noref()
> > > kvfree(prog->aux->jited_linfo);
> > > kvfree(prog->aux->linfo);
> >
> > Looks like you only have suspicion here. Could you find a real
> > violation
> > here where __bpf_prog_put() is called with !in_irq() &&
> > !irqs_disabled(), but inside spin_lock or rcu read lock? I
> have not seen
> > things like that.
> >
> > >
> > > Additionally, we found that array prog->aux->jited_linfo is
> > initialized in
> > > ‘kernel/bpf/core.c: 157 bpf_prog_alloc_jited_linfo()’:
> > > prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
> > > sizeof(*prog->aux->jited_linfo),
> bpf_memcg_flags(GFP_KERNEL |
> > __GFP_NOWARN));
> >
> > Any problem here?
> >
> > >
> > > Our question is whether the condition 'in_irq() ||
> > irqs_disabled() == false' is
> > > sufficient for calling 'kvfree'. We are aware that calling
> > 'kvfree' within the
> > > context of a spin lock or an RCU lock is unsafe.
>
> Your above analysis makes sense if indeed that kvfree cannot appear
> inside a spin lock region or RCU read lock region. But is it true?
> I checked a few code paths in kvfree/kfree. It is either guarded
> with local_irq_save/restore or by
> spin_lock_irqsave/spin_unlock_irqrestore, etc. Did I miss
> anything? Are you talking about RT kernel here?
>
>
> > >
> > > Therefore, we propose modifying the condition to include
> > in_atomic(). Could we
> > > update the condition as follows: "in_irq() ||
> irqs_disabled() ||
> > in_atomic()"?
> > >
> > > Thank you! We look forward to your feedback.
> > >
> > > Signed-off-by: Teng Qi <[email protected]
> <mailto:[email protected]>
> > <mailto:[email protected]
> <mailto:[email protected]>>>
> >
>
Thank you.
> We cannot use rcu_read_lock_held() in the 'if' statement. The return
> value rcu_read_lock_held() could be 1 for some configurations regardless
> whether rcu_read_lock() is really held or not. In most cases,
> rcu_read_lock_held() is used in issuing potential warnings.
> Maybe there are other ways to record whether rcu_read_lock() is held or not?
Sorry. I was not aware of the dependency of configurations of
rcu_read_lock_held().
> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
> can be !in_interrupt(), so any process-context will go to a workqueue.
I agree that using !in_interrupt() as a condition is an acceptable solution.
> Alternatively, we could have another solution. We could add another
> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
> will be done in rcu context.
Implementing a new function like bpf_prog_put_rcu() is a solution that involves
more significant changes.
> So if in_interrupt(), do kvfree, otherwise,
> put into a workqueue.
Shall we proceed with submitting a patch following this approach?
I would like to mention something unrelated to the possible bug. At this
moment, things seem to be more puzzling. vfree() is safe under in_interrupt()
but not safe under other atomic contexts.
This disorder challenges our conventional belief, a monotonic incrementation
of limitations of the hierarchical atomic contexts, that programer needs
to be more and more careful to write code under rcu read lock, spin lock,
bh disable, interrupt...
This disorder can lead to unexpected consequences, such as code being safe
under interrupts but not safe under spin locks.
The disorder makes kernel programming more complex and may result in more bugs.
Even though we find a way to resolve the possible bug about the bpf_prog_put(),
I feel sad for undermining of kernel`s maintainability and disorder of
hierarchy of atomic contexts.
-- Teng Qi
On Tue, May 23, 2023 at 12:33 PM Yonghong Song <[email protected]> wrote:
>
>
>
> On 5/21/23 6:39 AM, Teng Qi wrote:
> > Thank you.
> >
> > > Your above analysis makes sense if indeed that kvfree cannot appear
> > > inside a spin lock region or RCU read lock region. But is it true?
> > > I checked a few code paths in kvfree/kfree. It is either guarded
> > > with local_irq_save/restore or by
> > > spin_lock_irqsave/spin_unlock_
> > > irqrestore, etc. Did I miss
> > > anything? Are you talking about RT kernel here?
> >
> > To see the sleepable possibility of kvfree, it is important to analyze the
> > following calling stack:
> > mm/util.c: 645 kvfree()
> > mm/vmalloc.c: 2763 vfree()
> >
> > In kvfree(), to call vfree, if the pointer addr points to memory
> > allocated by
> > vmalloc(), it calls vfree().
> > void kvfree(const void *addr)
> > {
> > if (is_vmalloc_addr(addr))
> > vfree(addr);
> > else
> > kfree(addr);
> > }
> >
> > In vfree(), in_interrupt() and might_sleep() need to be considered.
> > void vfree(const void *addr)
> > {
> > // ...
> > if (unlikely(in_interrupt()))
> > {
> > vfree_atomic(addr);
> > return;
> > }
> > // ...
> > might_sleep();
> > // ...
> > }
>
> Sorry. I didn't check vfree path. So it does look like that
> we need to pay special attention to non interrupt part.
>
> >
> > The vfree() may sleep if in_interrupt() == false. The RCU read lock region
> > could have in_interrupt() == false and spin lock region which only disables
> > preemption also has in_interrupt() == false. So the kvfree() cannot appear
> > inside a spin lock region or RCU read lock region if the pointer addr points
> > to memory allocated by vmalloc().
> >
> > > > Therefore, we propose modifying the condition to include
> > > > in_atomic(). Could we
> > > > update the condition as follows: "in_irq() || irqs_disabled() ||
> > > > in_atomic()"?
> > > Thank you! We look forward to your feedback.
> >
> > We now think that ‘irqs_disabled() || in_atomic() ||
> > rcu_read_lock_held()’ is
> > more proper. irqs_disabled() is for irq flag reg, in_atomic() is for
> > preempt count and rcu_read_lock_held() is for RCU read lock region.
>
> We cannot use rcu_read_lock_held() in the 'if' statement. The return
> value rcu_read_lock_held() could be 1 for some configuraitons regardless
> whether rcu_read_lock() is really held or not. In most cases,
> rcu_read_lock_held() is used in issuing potential warnings.
> Maybe there are other ways to record whether rcu_read_lock() is held or not?
>
> I agree with your that 'irqs_disabled() || in_atomic()' makes sense
> since it covers process context local_irq_save() and spin_lock() cases.
>
> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
> can be !in_interrupt(), so any process-context will go to a workqueue.
>
> Alternatively, we could have another solution. We could add another
> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
> will be done in rcu context. So if in_interrupt(), do kvfree, otherwise,
> put into a workqueue.
>
>
> >
> > -- Teng Qi
> >
> > On Sun, May 21, 2023 at 11:45 AM Yonghong Song <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >
> >
> > On 5/19/23 7:18 AM, Teng Qi wrote:
> > > Thank you for your response.
> > > > Looks like you only have suspicion here. Could you find a real
> > violation
> > > > here where __bpf_prog_put() is called with !in_irq() &&
> > > > !irqs_disabled(), but inside spin_lock or rcu read lock? I
> > have not seen
> > > > things like that.
> > >
> > > For the complex conditions to call bpf_prog_put() with 1 refcnt,
> > we have
> > > been
> > > unable to really trigger this atomic violation after trying to
> > construct
> > > test cases manually. But we found that it is possible to show
> > cases with
> > > !in_irq() && !irqs_disabled(), but inside spin_lock or rcu read lock.
> > > For example, even a failed case, one of selftest cases of bpf,
> > netns_cookie,
> > > calls bpf_sock_map_update() and may indirectly call bpf_prog_put()
> > > only inside rcu read lock: The possible call stack is:
> > > net/core/sock_map.c: 615 bpf_sock_map_update()
> > > net/core/sock_map.c: 468 sock_map_update_common()
> > > net/core/sock_map.c: 217 sock_map_link()
> > > kernel/bpf/syscall.c: 2111 bpf_prog_put()
> > >
> > > The files about netns_cookie include
> > > tools/testing/selftests/bpf/progs/netns_cookie_prog.c and
> > > tools/testing/selftests/bpf/prog_tests/netns_cookie.c. We
> > inserted the
> > > following code in
> > > ‘net/core/sock_map.c: 468 sock_map_update_common()’:
> > > static int sock_map_update_common(..)
> > > {
> > > int inIrq = in_irq();
> > > int irqsDisabled = irqs_disabled();
> > > int preemptBits = preempt_count();
> > > int inAtomic = in_atomic();
> > > int rcuHeld = rcu_read_lock_held();
> > > printk("in_irq() %d, irqs_disabled() %d, preempt_count() %d,
> > > in_atomic() %d, rcu_read_lock_held() %d", inIrq,
> > irqsDisabled,
> > > preemptBits, inAtomic, rcuHeld);
> > > }
> > >
> > > The output message is as follows:
> > > root@(none):/root/bpf# ./test_progs -t netns_cookie
> > > [ 137.639188] in_irq() 0, irqs_disabled() 0, preempt_count() 0,
> > > in_atomic() 0,
> > > rcu_read_lock_held() 1
> > > #113 netns_cookie:OK
> > > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > >
> > > We notice that there are numerous callers in kernel/, net/ and
> > drivers/,
> > > so we
> > > highly suggest modifying __bpf_prog_put() to address this gap.
> > The gap
> > > exists
> > > because __bpf_prog_put() is only safe under in_irq() ||
> > irqs_disabled()
> > > but not in_atomic() || rcu_read_lock_held(). The following code
> > snippet may
> > > mislead developers into thinking that bpf_prog_put() is safe in all
> > > contexts.
> > > if (in_irq() || irqs_disabled()) {
> > > INIT_WORK(&aux->work, bpf_prog_put_deferred);
> > > schedule_work(&aux->work);
> > > } else {
> > > bpf_prog_put_deferred(&aux->work);
> > > }
> > >
> > > Implicit dependency may lead to issues.
> > >
> > > > Any problem here?
> > > We mentioned it to demonstrate the possibility of kvfree() being
> > > called by __bpf_prog_put_noref().
> > >
> > > Thanks.
> > > -- Teng Qi
> > >
> > > On Wed, May 17, 2023 at 1:08 AM Yonghong Song <[email protected]
> > <mailto:[email protected]>
> > > <mailto:[email protected] <mailto:[email protected]>>> wrote:
> > >
> > >
> > >
> > > On 5/16/23 4:18 AM, [email protected]
> > <mailto:[email protected]>
> > > <mailto:[email protected]
> > <mailto:[email protected]>> wrote:
> > > > From: Teng Qi <[email protected]
> > <mailto:[email protected]>
> > > <mailto:[email protected]
> > <mailto:[email protected]>>>
> > > >
> > > > Hi, bpf developers,
> > > >
> > > > We are developing a static tool to check the matching between
> > > helpers and the
> > > > context of hooks. During our analysis, we have discovered some
> > > important
> > > > findings that we would like to report.
> > > >
> > > > ‘kernel/bpf/syscall.c: 2097 __bpf_prog_put()’ shows that
> > function
> > > > bpf_prog_put_deferred() won`t be called in the condition of
> > > > ‘in_irq() || irqs_disabled()’.
> > > > if (in_irq() || irqs_disabled()) {
> > > > INIT_WORK(&aux->work, bpf_prog_put_deferred);
> > > > schedule_work(&aux->work);
> > > > } else {
> > > >
> > > > bpf_prog_put_deferred(&aux->work);
> > > > }
> > > >
> > > > We suspect this condition exists because there might be
> > sleepable
> > > operations
> > > > in the callees of the bpf_prog_put_deferred() function:
> > > > kernel/bpf/syscall.c: 2097 __bpf_prog_put()
> > > > kernel/bpf/syscall.c: 2084 bpf_prog_put_deferred()
> > > > kernel/bpf/syscall.c: 2063 __bpf_prog_put_noref()
> > > > kvfree(prog->aux->jited_linfo);
> > > > kvfree(prog->aux->linfo);
> > >
> > > Looks like you only have suspicion here. Could you find a real
> > > violation
> > > here where __bpf_prog_put() is called with !in_irq() &&
> > > !irqs_disabled(), but inside spin_lock or rcu read lock? I
> > have not seen
> > > things like that.
> > >
> > > >
> > > > Additionally, we found that array prog->aux->jited_linfo is
> > > initialized in
> > > > ‘kernel/bpf/core.c: 157 bpf_prog_alloc_jited_linfo()’:
> > > > prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
> > > > sizeof(*prog->aux->jited_linfo),
> > bpf_memcg_flags(GFP_KERNEL |
> > > __GFP_NOWARN));
> > >
> > > Any problem here?
> > >
> > > >
> > > > Our question is whether the condition 'in_irq() ||
> > > irqs_disabled() == false' is
> > > > sufficient for calling 'kvfree'. We are aware that calling
> > > 'kvfree' within the
> > > > context of a spin lock or an RCU lock is unsafe.
> >
> > Your above analysis makes sense if indeed that kvfree cannot appear
> > inside a spin lock region or RCU read lock region. But is it true?
> > I checked a few code paths in kvfree/kfree. It is either guarded
> > with local_irq_save/restore or by
> > spin_lock_irqsave/spin_unlock_irqrestore, etc. Did I miss
> > anything? Are you talking about RT kernel here?
> >
> >
> > > >
> > > > Therefore, we propose modifying the condition to include
> > > in_atomic(). Could we
> > > > update the condition as follows: "in_irq() ||
> > irqs_disabled() ||
> > > in_atomic()"?
> > > >
> > > > Thank you! We look forward to your feedback.
> > > >
> > > > Signed-off-by: Teng Qi <[email protected]
> > <mailto:[email protected]>
> > > <mailto:[email protected]
> > <mailto:[email protected]>>>
> > >
> >
On Wed, May 24, 2023 at 12:34 PM Yonghong Song <[email protected]> wrote:
>
>
>
> On 5/24/23 5:42 AM, Teng Qi wrote:
> > Thank you.
> >
> >> We cannot use rcu_read_lock_held() in the 'if' statement. The return
> >> value rcu_read_lock_held() could be 1 for some configurations regardless
> >> whether rcu_read_lock() is really held or not. In most cases,
> >> rcu_read_lock_held() is used in issuing potential warnings.
> >> Maybe there are other ways to record whether rcu_read_lock() is held or not?
> >
> > Sorry. I was not aware of the dependency of configurations of
> > rcu_read_lock_held().
> >
> >> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
> >> can be !in_interrupt(), so any process-context will go to a workqueue.
> >
> > I agree that using !in_interrupt() as a condition is an acceptable solution.
>
> This should work although it could be conservative.
>
> >
> >> Alternatively, we could have another solution. We could add another
> >> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
> >> will be done in rcu context.
> >
> > Implementing a new function like bpf_prog_put_rcu() is a solution that involves
> > more significant changes.
>
> Maybe we can change signature of bpf_prog_put instead? Like
> void bpf_prog_put(struct bpf_prog *prog, bool in_rcu)
> and inside bpf_prog_put we can add
> WARN_ON_ONCE(in_rcu && !bpf_rcu_lock_held());
bpf_rcu_lock_held() is used for different cases.
Here s/in_irq/in_interrupt/ inside bpf_prog_put() is enough
to address this theoretical issue.
On 5/24/23 5:42 AM, Teng Qi wrote:
> Thank you.
>
>> We cannot use rcu_read_lock_held() in the 'if' statement. The return
>> value rcu_read_lock_held() could be 1 for some configurations regardless
>> whether rcu_read_lock() is really held or not. In most cases,
>> rcu_read_lock_held() is used in issuing potential warnings.
>> Maybe there are other ways to record whether rcu_read_lock() is held or not?
>
> Sorry. I was not aware of the dependency of configurations of
> rcu_read_lock_held().
>
>> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
>> can be !in_interrupt(), so any process-context will go to a workqueue.
>
> I agree that using !in_interrupt() as a condition is an acceptable solution.
This should work although it could be conservative.
>
>> Alternatively, we could have another solution. We could add another
>> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
>> will be done in rcu context.
>
> Implementing a new function like bpf_prog_put_rcu() is a solution that involves
> more significant changes.
Maybe we can change signature of bpf_prog_put instead? Like
void bpf_prog_put(struct bpf_prog *prog, bool in_rcu)
and inside bpf_prog_put we can add
WARN_ON_ONCE(in_rcu && !bpf_rcu_lock_held());
>
>> So if in_interrupt(), do kvfree, otherwise,
>> put into a workqueue.
>
> Shall we proceed with submitting a patch following this approach?
You could choose either of the above although I think with newer
bpf_prog_put() is better.
BTW, please do create a test case, e.g, sockmap test case which
can show the problem with existing code base.
>
> I would like to mention something unrelated to the possible bug. At this
> moment, things seem to be more puzzling. vfree() is safe under in_interrupt()
> but not safe under other atomic contexts.
> This disorder challenges our conventional belief, a monotonic incrementation
> of limitations of the hierarchical atomic contexts, that programer needs
> to be more and more careful to write code under rcu read lock, spin lock,
> bh disable, interrupt...
> This disorder can lead to unexpected consequences, such as code being safe
> under interrupts but not safe under spin locks.
> The disorder makes kernel programming more complex and may result in more bugs.
> Even though we find a way to resolve the possible bug about the bpf_prog_put(),
> I feel sad for undermining of kernel`s maintainability and disorder of
> hierarchy of atomic contexts.
>
> -- Teng Qi
>
> On Tue, May 23, 2023 at 12:33 PM Yonghong Song <[email protected]> wrote:
>>
>>
>>
>> On 5/21/23 6:39 AM, Teng Qi wrote:
>>> Thank you.
>>>
>>> > Your above analysis makes sense if indeed that kvfree cannot appear
>>> > inside a spin lock region or RCU read lock region. But is it true?
>>> > I checked a few code paths in kvfree/kfree. It is either guarded
>>> > with local_irq_save/restore or by
>>> > spin_lock_irqsave/spin_unlock_
>>> > irqrestore, etc. Did I miss
>>> > anything? Are you talking about RT kernel here?
>>>
>>> To see the sleepable possibility of kvfree, it is important to analyze the
>>> following calling stack:
>>> mm/util.c: 645 kvfree()
>>> mm/vmalloc.c: 2763 vfree()
>>>
>>> In kvfree(), to call vfree, if the pointer addr points to memory
>>> allocated by
>>> vmalloc(), it calls vfree().
>>> void kvfree(const void *addr)
>>> {
>>> if (is_vmalloc_addr(addr))
>>> vfree(addr);
>>> else
>>> kfree(addr);
>>> }
>>>
>>> In vfree(), in_interrupt() and might_sleep() need to be considered.
>>> void vfree(const void *addr)
>>> {
>>> // ...
>>> if (unlikely(in_interrupt()))
>>> {
>>> vfree_atomic(addr);
>>> return;
>>> }
>>> // ...
>>> might_sleep();
>>> // ...
>>> }
>>
>> Sorry. I didn't check vfree path. So it does look like that
>> we need to pay special attention to non interrupt part.
>>
>>>
>>> The vfree() may sleep if in_interrupt() == false. The RCU read lock region
>>> could have in_interrupt() == false and spin lock region which only disables
>>> preemption also has in_interrupt() == false. So the kvfree() cannot appear
>>> inside a spin lock region or RCU read lock region if the pointer addr points
>>> to memory allocated by vmalloc().
>>>
>>> > > Therefore, we propose modifying the condition to include
>>> > > in_atomic(). Could we
>>> > > update the condition as follows: "in_irq() || irqs_disabled() ||
>>> > > in_atomic()"?
>>> > Thank you! We look forward to your feedback.
>>>
>>> We now think that ‘irqs_disabled() || in_atomic() ||
>>> rcu_read_lock_held()’ is
>>> more proper. irqs_disabled() is for irq flag reg, in_atomic() is for
>>> preempt count and rcu_read_lock_held() is for RCU read lock region.
>>
>> We cannot use rcu_read_lock_held() in the 'if' statement. The return
>> value rcu_read_lock_held() could be 1 for some configuraitons regardless
>> whether rcu_read_lock() is really held or not. In most cases,
>> rcu_read_lock_held() is used in issuing potential warnings.
>> Maybe there are other ways to record whether rcu_read_lock() is held or not?
>>
>> I agree with your that 'irqs_disabled() || in_atomic()' makes sense
>> since it covers process context local_irq_save() and spin_lock() cases.
>>
>> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
>> can be !in_interrupt(), so any process-context will go to a workqueue.
>>
>> Alternatively, we could have another solution. We could add another
>> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
>> will be done in rcu context. So if in_interrupt(), do kvfree, otherwise,
>> put into a workqueue.
>>
>>
>>>
>>> -- Teng Qi
>>>
>>> On Sun, May 21, 2023 at 11:45 AM Yonghong Song <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>
>>>
>>> On 5/19/23 7:18 AM, Teng Qi wrote:
>>> > Thank you for your response.
>>> > > Looks like you only have suspicion here. Could you find a real
>>> violation
>>> > > here where __bpf_prog_put() is called with !in_irq() &&
>>> > > !irqs_disabled(), but inside spin_lock or rcu read lock? I
>>> have not seen
>>> > > things like that.
>>> >
>>> > For the complex conditions to call bpf_prog_put() with 1 refcnt,
>>> we have
>>> > been
>>> > unable to really trigger this atomic violation after trying to
>>> construct
>>> > test cases manually. But we found that it is possible to show
>>> cases with
>>> > !in_irq() && !irqs_disabled(), but inside spin_lock or rcu read lock.
>>> > For example, even a failed case, one of selftest cases of bpf,
>>> netns_cookie,
>>> > calls bpf_sock_map_update() and may indirectly call bpf_prog_put()
>>> > only inside rcu read lock: The possible call stack is:
>>> > net/core/sock_map.c: 615 bpf_sock_map_update()
>>> > net/core/sock_map.c: 468 sock_map_update_common()
>>> > net/core/sock_map.c: 217 sock_map_link()
>>> > kernel/bpf/syscall.c: 2111 bpf_prog_put()
>>> >
>>> > The files about netns_cookie include
>>> > tools/testing/selftests/bpf/progs/netns_cookie_prog.c and
>>> > tools/testing/selftests/bpf/prog_tests/netns_cookie.c. We
>>> inserted the
>>> > following code in
>>> > ‘net/core/sock_map.c: 468 sock_map_update_common()’:
>>> > static int sock_map_update_common(..)
>>> > {
>>> > int inIrq = in_irq();
>>> > int irqsDisabled = irqs_disabled();
>>> > int preemptBits = preempt_count();
>>> > int inAtomic = in_atomic();
>>> > int rcuHeld = rcu_read_lock_held();
>>> > printk("in_irq() %d, irqs_disabled() %d, preempt_count() %d,
>>> > in_atomic() %d, rcu_read_lock_held() %d", inIrq,
>>> irqsDisabled,
>>> > preemptBits, inAtomic, rcuHeld);
>>> > }
>>> >
>>> > The output message is as follows:
>>> > root@(none):/root/bpf# ./test_progs -t netns_cookie
>>> > [ 137.639188] in_irq() 0, irqs_disabled() 0, preempt_count() 0,
>>> > in_atomic() 0,
>>> > rcu_read_lock_held() 1
>>> > #113 netns_cookie:OK
>>> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
>>> >
>>> > We notice that there are numerous callers in kernel/, net/ and
>>> drivers/,
>>> > so we
>>> > highly suggest modifying __bpf_prog_put() to address this gap.
>>> The gap
>>> > exists
>>> > because __bpf_prog_put() is only safe under in_irq() ||
>>> irqs_disabled()
>>> > but not in_atomic() || rcu_read_lock_held(). The following code
>>> snippet may
>>> > mislead developers into thinking that bpf_prog_put() is safe in all
>>> > contexts.
>>> > if (in_irq() || irqs_disabled()) {
>>> > INIT_WORK(&aux->work, bpf_prog_put_deferred);
>>> > schedule_work(&aux->work);
>>> > } else {
>>> > bpf_prog_put_deferred(&aux->work);
>>> > }
>>> >
>>> > Implicit dependency may lead to issues.
>>> >
>>> > > Any problem here?
>>> > We mentioned it to demonstrate the possibility of kvfree() being
>>> > called by __bpf_prog_put_noref().
>>> >
>>> > Thanks.
>>> > -- Teng Qi
>>> >
>>> > On Wed, May 17, 2023 at 1:08 AM Yonghong Song <[email protected]
>>> <mailto:[email protected]>
>>> > <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>> >
>>> >
>>> >
>>> > On 5/16/23 4:18 AM, [email protected]
>>> <mailto:[email protected]>
>>> > <mailto:[email protected]
>>> <mailto:[email protected]>> wrote:
>>> > > From: Teng Qi <[email protected]
>>> <mailto:[email protected]>
>>> > <mailto:[email protected]
>>> <mailto:[email protected]>>>
>>> > >
>>> > > Hi, bpf developers,
>>> > >
>>> > > We are developing a static tool to check the matching between
>>> > helpers and the
>>> > > context of hooks. During our analysis, we have discovered some
>>> > important
>>> > > findings that we would like to report.
>>> > >
>>> > > ‘kernel/bpf/syscall.c: 2097 __bpf_prog_put()’ shows that
>>> function
>>> > > bpf_prog_put_deferred() won`t be called in the condition of
>>> > > ‘in_irq() || irqs_disabled()’.
>>> > > if (in_irq() || irqs_disabled()) {
>>> > > INIT_WORK(&aux->work, bpf_prog_put_deferred);
>>> > > schedule_work(&aux->work);
>>> > > } else {
>>> > >
>>> > > bpf_prog_put_deferred(&aux->work);
>>> > > }
>>> > >
>>> > > We suspect this condition exists because there might be
>>> sleepable
>>> > operations
>>> > > in the callees of the bpf_prog_put_deferred() function:
>>> > > kernel/bpf/syscall.c: 2097 __bpf_prog_put()
>>> > > kernel/bpf/syscall.c: 2084 bpf_prog_put_deferred()
>>> > > kernel/bpf/syscall.c: 2063 __bpf_prog_put_noref()
>>> > > kvfree(prog->aux->jited_linfo);
>>> > > kvfree(prog->aux->linfo);
>>> >
>>> > Looks like you only have suspicion here. Could you find a real
>>> > violation
>>> > here where __bpf_prog_put() is called with !in_irq() &&
>>> > !irqs_disabled(), but inside spin_lock or rcu read lock? I
>>> have not seen
>>> > things like that.
>>> >
>>> > >
>>> > > Additionally, we found that array prog->aux->jited_linfo is
>>> > initialized in
>>> > > ‘kernel/bpf/core.c: 157 bpf_prog_alloc_jited_linfo()’:
>>> > > prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
>>> > > sizeof(*prog->aux->jited_linfo),
>>> bpf_memcg_flags(GFP_KERNEL |
>>> > __GFP_NOWARN));
>>> >
>>> > Any problem here?
>>> >
>>> > >
>>> > > Our question is whether the condition 'in_irq() ||
>>> > irqs_disabled() == false' is
>>> > > sufficient for calling 'kvfree'. We are aware that calling
>>> > 'kvfree' within the
>>> > > context of a spin lock or an RCU lock is unsafe.
>>>
>>> Your above analysis makes sense if indeed that kvfree cannot appear
>>> inside a spin lock region or RCU read lock region. But is it true?
>>> I checked a few code paths in kvfree/kfree. It is either guarded
>>> with local_irq_save/restore or by
>>> spin_lock_irqsave/spin_unlock_irqrestore, etc. Did I miss
>>> anything? Are you talking about RT kernel here?
>>>
>>>
>>> > >
>>> > > Therefore, we propose modifying the condition to include
>>> > in_atomic(). Could we
>>> > > update the condition as follows: "in_irq() ||
>>> irqs_disabled() ||
>>> > in_atomic()"?
>>> > >
>>> > > Thank you! We look forward to your feedback.
>>> > >
>>> > > Signed-off-by: Teng Qi <[email protected]
>>> <mailto:[email protected]>
>>> > <mailto:[email protected]
>>> <mailto:[email protected]>>>
>>> >
>>>
On 5/24/23 12:44 PM, Alexei Starovoitov wrote:
> On Wed, May 24, 2023 at 12:34 PM Yonghong Song <[email protected]> wrote:
>>
>>
>>
>> On 5/24/23 5:42 AM, Teng Qi wrote:
>>> Thank you.
>>>
>>>> We cannot use rcu_read_lock_held() in the 'if' statement. The return
>>>> value rcu_read_lock_held() could be 1 for some configurations regardless
>>>> whether rcu_read_lock() is really held or not. In most cases,
>>>> rcu_read_lock_held() is used in issuing potential warnings.
>>>> Maybe there are other ways to record whether rcu_read_lock() is held or not?
>>>
>>> Sorry. I was not aware of the dependency of configurations of
>>> rcu_read_lock_held().
>>>
>>>> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
>>>> can be !in_interrupt(), so any process-context will go to a workqueue.
>>>
>>> I agree that using !in_interrupt() as a condition is an acceptable solution.
>>
>> This should work although it could be conservative.
>>
>>>
>>>> Alternatively, we could have another solution. We could add another
>>>> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
>>>> will be done in rcu context.
>>>
>>> Implementing a new function like bpf_prog_put_rcu() is a solution that involves
>>> more significant changes.
>>
>> Maybe we can change signature of bpf_prog_put instead? Like
>> void bpf_prog_put(struct bpf_prog *prog, bool in_rcu)
>> and inside bpf_prog_put we can add
>> WARN_ON_ONCE(in_rcu && !bpf_rcu_lock_held());
>
> bpf_rcu_lock_held() is used for different cases.
Sorry, I actually mean rcu_read_lock_held() ...
>
> Here s/in_irq/in_interrupt/ inside bpf_prog_put() is enough
> to address this theoretical issue.
Maybe
if (!in_interrupt()) {
INIT_WORK(&aux->work, bpf_prog_put_deferred);
schedule_work(&aux->work);
} else {
bpf_prog_put_deferred(&aux->work);
}
?
Basically for any process context, use a work queue since
we have no idea whether rcu_read_lock() is held or not.
In process context, is_atmoc() and irqs_disabled() should
already use the work queue.
As we discussed in the above, if in_interrupt() is true,
kvfree seems okay, so can directly call
bpf_prog_put_deferred().
Does this sound reasonable?
Hello!
> BTW, please do create a test case, e.g, sockmap test case which
> can show the problem with existing code base.
I add a printk in bpf_prog_put_deferred():
static void bpf_prog_put_deferred(struct work_struct *work)
{
// . . .
int inIrq = in_irq();
int irqsDisabled = irqs_disabled();
int preemptBits = preempt_count();
int inAtomic = in_atomic();
int rcuHeld = rcu_read_lock_held();
printk("bpf_prog_put: in_irq() %d, irqs_disabled() %d, preempt_count()
%d, in_atomic() %d, rcu_read_lock_held() %d",
inIrq, irqsDisabled, preemptBits, inAtomic, rcuHeld);
// . . .
}
When running the selftest, I see the following output:
[255340.388339] bpf_prog_put: in_irq() 0, irqs_disabled() 0,
preempt_count() 256, in_atomic() 1, rcu_read_lock_held() 1
[255393.237632] bpf_prog_put: in_irq() 0, irqs_disabled() 0,
preempt_count() 0, in_atomic() 0, rcu_read_lock_held() 1
Based on this output, I believe it is sufficient to construct a self-test case
for bpf_prog_put_deferred() called under preempt disabled or rcu read lock
region. However, I'm a bit confused about what I should do to build the
self-test case. Are we looking to create a checker that verifies the
context of bpf_prog_put_deferred() is valid? Or do we need a test case that
can trigger this bug?
Could you show me more ideas to construct a self test case? I am not familiar
with it and have no idea.
-- Teng Qi
On Thu, May 25, 2023 at 3:34 AM Yonghong Song <[email protected]> wrote:
>
>
>
> On 5/24/23 5:42 AM, Teng Qi wrote:
> > Thank you.
> >
> >> We cannot use rcu_read_lock_held() in the 'if' statement. The return
> >> value rcu_read_lock_held() could be 1 for some configurations regardless
> >> whether rcu_read_lock() is really held or not. In most cases,
> >> rcu_read_lock_held() is used in issuing potential warnings.
> >> Maybe there are other ways to record whether rcu_read_lock() is held or not?
> >
> > Sorry. I was not aware of the dependency of configurations of
> > rcu_read_lock_held().
> >
> >> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
> >> can be !in_interrupt(), so any process-context will go to a workqueue.
> >
> > I agree that using !in_interrupt() as a condition is an acceptable solution.
>
> This should work although it could be conservative.
>
> >
> >> Alternatively, we could have another solution. We could add another
> >> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
> >> will be done in rcu context.
> >
> > Implementing a new function like bpf_prog_put_rcu() is a solution that involves
> > more significant changes.
>
> Maybe we can change signature of bpf_prog_put instead? Like
> void bpf_prog_put(struct bpf_prog *prog, bool in_rcu)
> and inside bpf_prog_put we can add
> WARN_ON_ONCE(in_rcu && !bpf_rcu_lock_held());
>
> >
> >> So if in_interrupt(), do kvfree, otherwise,
> >> put into a workqueue.
> >
> > Shall we proceed with submitting a patch following this approach?
>
> You could choose either of the above although I think with newer
> bpf_prog_put() is better.
>
> BTW, please do create a test case, e.g, sockmap test case which
> can show the problem with existing code base.
>
> >
> > I would like to mention something unrelated to the possible bug. At this
> > moment, things seem to be more puzzling. vfree() is safe under in_interrupt()
> > but not safe under other atomic contexts.
> > This disorder challenges our conventional belief, a monotonic incrementation
> > of limitations of the hierarchical atomic contexts, that programer needs
> > to be more and more careful to write code under rcu read lock, spin lock,
> > bh disable, interrupt...
> > This disorder can lead to unexpected consequences, such as code being safe
> > under interrupts but not safe under spin locks.
> > The disorder makes kernel programming more complex and may result in more bugs.
> > Even though we find a way to resolve the possible bug about the bpf_prog_put(),
> > I feel sad for undermining of kernel`s maintainability and disorder of
> > hierarchy of atomic contexts.
> >
> > -- Teng Qi
> >
> > On Tue, May 23, 2023 at 12:33 PM Yonghong Song <[email protected]> wrote:
> >>
> >>
> >>
> >> On 5/21/23 6:39 AM, Teng Qi wrote:
> >>> Thank you.
> >>>
> >>> > Your above analysis makes sense if indeed that kvfree cannot appear
> >>> > inside a spin lock region or RCU read lock region. But is it true?
> >>> > I checked a few code paths in kvfree/kfree. It is either guarded
> >>> > with local_irq_save/restore or by
> >>> > spin_lock_irqsave/spin_unlock_
> >>> > irqrestore, etc. Did I miss
> >>> > anything? Are you talking about RT kernel here?
> >>>
> >>> To see the sleepable possibility of kvfree, it is important to analyze the
> >>> following calling stack:
> >>> mm/util.c: 645 kvfree()
> >>> mm/vmalloc.c: 2763 vfree()
> >>>
> >>> In kvfree(), to call vfree, if the pointer addr points to memory
> >>> allocated by
> >>> vmalloc(), it calls vfree().
> >>> void kvfree(const void *addr)
> >>> {
> >>> if (is_vmalloc_addr(addr))
> >>> vfree(addr);
> >>> else
> >>> kfree(addr);
> >>> }
> >>>
> >>> In vfree(), in_interrupt() and might_sleep() need to be considered.
> >>> void vfree(const void *addr)
> >>> {
> >>> // ...
> >>> if (unlikely(in_interrupt()))
> >>> {
> >>> vfree_atomic(addr);
> >>> return;
> >>> }
> >>> // ...
> >>> might_sleep();
> >>> // ...
> >>> }
> >>
> >> Sorry. I didn't check vfree path. So it does look like that
> >> we need to pay special attention to non interrupt part.
> >>
> >>>
> >>> The vfree() may sleep if in_interrupt() == false. The RCU read lock region
> >>> could have in_interrupt() == false and spin lock region which only disables
> >>> preemption also has in_interrupt() == false. So the kvfree() cannot appear
> >>> inside a spin lock region or RCU read lock region if the pointer addr points
> >>> to memory allocated by vmalloc().
> >>>
> >>> > > Therefore, we propose modifying the condition to include
> >>> > > in_atomic(). Could we
> >>> > > update the condition as follows: "in_irq() || irqs_disabled() ||
> >>> > > in_atomic()"?
> >>> > Thank you! We look forward to your feedback.
> >>>
> >>> We now think that ‘irqs_disabled() || in_atomic() ||
> >>> rcu_read_lock_held()’ is
> >>> more proper. irqs_disabled() is for irq flag reg, in_atomic() is for
> >>> preempt count and rcu_read_lock_held() is for RCU read lock region.
> >>
> >> We cannot use rcu_read_lock_held() in the 'if' statement. The return
> >> value rcu_read_lock_held() could be 1 for some configuraitons regardless
> >> whether rcu_read_lock() is really held or not. In most cases,
> >> rcu_read_lock_held() is used in issuing potential warnings.
> >> Maybe there are other ways to record whether rcu_read_lock() is held or not?
> >>
> >> I agree with your that 'irqs_disabled() || in_atomic()' makes sense
> >> since it covers process context local_irq_save() and spin_lock() cases.
> >>
> >> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
> >> can be !in_interrupt(), so any process-context will go to a workqueue.
> >>
> >> Alternatively, we could have another solution. We could add another
> >> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
> >> will be done in rcu context. So if in_interrupt(), do kvfree, otherwise,
> >> put into a workqueue.
> >>
> >>
> >>>
> >>> -- Teng Qi
> >>>
> >>> On Sun, May 21, 2023 at 11:45 AM Yonghong Song <[email protected]
> >>> <mailto:[email protected]>> wrote:
> >>>
> >>>
> >>>
> >>> On 5/19/23 7:18 AM, Teng Qi wrote:
> >>> > Thank you for your response.
> >>> > > Looks like you only have suspicion here. Could you find a real
> >>> violation
> >>> > > here where __bpf_prog_put() is called with !in_irq() &&
> >>> > > !irqs_disabled(), but inside spin_lock or rcu read lock? I
> >>> have not seen
> >>> > > things like that.
> >>> >
> >>> > For the complex conditions to call bpf_prog_put() with 1 refcnt,
> >>> we have
> >>> > been
> >>> > unable to really trigger this atomic violation after trying to
> >>> construct
> >>> > test cases manually. But we found that it is possible to show
> >>> cases with
> >>> > !in_irq() && !irqs_disabled(), but inside spin_lock or rcu read lock.
> >>> > For example, even a failed case, one of selftest cases of bpf,
> >>> netns_cookie,
> >>> > calls bpf_sock_map_update() and may indirectly call bpf_prog_put()
> >>> > only inside rcu read lock: The possible call stack is:
> >>> > net/core/sock_map.c: 615 bpf_sock_map_update()
> >>> > net/core/sock_map.c: 468 sock_map_update_common()
> >>> > net/core/sock_map.c: 217 sock_map_link()
> >>> > kernel/bpf/syscall.c: 2111 bpf_prog_put()
> >>> >
> >>> > The files about netns_cookie include
> >>> > tools/testing/selftests/bpf/progs/netns_cookie_prog.c and
> >>> > tools/testing/selftests/bpf/prog_tests/netns_cookie.c. We
> >>> inserted the
> >>> > following code in
> >>> > ‘net/core/sock_map.c: 468 sock_map_update_common()’:
> >>> > static int sock_map_update_common(..)
> >>> > {
> >>> > int inIrq = in_irq();
> >>> > int irqsDisabled = irqs_disabled();
> >>> > int preemptBits = preempt_count();
> >>> > int inAtomic = in_atomic();
> >>> > int rcuHeld = rcu_read_lock_held();
> >>> > printk("in_irq() %d, irqs_disabled() %d, preempt_count() %d,
> >>> > in_atomic() %d, rcu_read_lock_held() %d", inIrq,
> >>> irqsDisabled,
> >>> > preemptBits, inAtomic, rcuHeld);
> >>> > }
> >>> >
> >>> > The output message is as follows:
> >>> > root@(none):/root/bpf# ./test_progs -t netns_cookie
> >>> > [ 137.639188] in_irq() 0, irqs_disabled() 0, preempt_count() 0,
> >>> > in_atomic() 0,
> >>> > rcu_read_lock_held() 1
> >>> > #113 netns_cookie:OK
> >>> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> >>> >
> >>> > We notice that there are numerous callers in kernel/, net/ and
> >>> drivers/,
> >>> > so we
> >>> > highly suggest modifying __bpf_prog_put() to address this gap.
> >>> The gap
> >>> > exists
> >>> > because __bpf_prog_put() is only safe under in_irq() ||
> >>> irqs_disabled()
> >>> > but not in_atomic() || rcu_read_lock_held(). The following code
> >>> snippet may
> >>> > mislead developers into thinking that bpf_prog_put() is safe in all
> >>> > contexts.
> >>> > if (in_irq() || irqs_disabled()) {
> >>> > INIT_WORK(&aux->work, bpf_prog_put_deferred);
> >>> > schedule_work(&aux->work);
> >>> > } else {
> >>> > bpf_prog_put_deferred(&aux->work);
> >>> > }
> >>> >
> >>> > Implicit dependency may lead to issues.
> >>> >
> >>> > > Any problem here?
> >>> > We mentioned it to demonstrate the possibility of kvfree() being
> >>> > called by __bpf_prog_put_noref().
> >>> >
> >>> > Thanks.
> >>> > -- Teng Qi
> >>> >
> >>> > On Wed, May 17, 2023 at 1:08 AM Yonghong Song <[email protected]
> >>> <mailto:[email protected]>
> >>> > <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>> >
> >>> >
> >>> >
> >>> > On 5/16/23 4:18 AM, [email protected]
> >>> <mailto:[email protected]>
> >>> > <mailto:[email protected]
> >>> <mailto:[email protected]>> wrote:
> >>> > > From: Teng Qi <[email protected]
> >>> <mailto:[email protected]>
> >>> > <mailto:[email protected]
> >>> <mailto:[email protected]>>>
> >>> > >
> >>> > > Hi, bpf developers,
> >>> > >
> >>> > > We are developing a static tool to check the matching between
> >>> > helpers and the
> >>> > > context of hooks. During our analysis, we have discovered some
> >>> > important
> >>> > > findings that we would like to report.
> >>> > >
> >>> > > ‘kernel/bpf/syscall.c: 2097 __bpf_prog_put()’ shows that
> >>> function
> >>> > > bpf_prog_put_deferred() won`t be called in the condition of
> >>> > > ‘in_irq() || irqs_disabled()’.
> >>> > > if (in_irq() || irqs_disabled()) {
> >>> > > INIT_WORK(&aux->work, bpf_prog_put_deferred);
> >>> > > schedule_work(&aux->work);
> >>> > > } else {
> >>> > >
> >>> > > bpf_prog_put_deferred(&aux->work);
> >>> > > }
> >>> > >
> >>> > > We suspect this condition exists because there might be
> >>> sleepable
> >>> > operations
> >>> > > in the callees of the bpf_prog_put_deferred() function:
> >>> > > kernel/bpf/syscall.c: 2097 __bpf_prog_put()
> >>> > > kernel/bpf/syscall.c: 2084 bpf_prog_put_deferred()
> >>> > > kernel/bpf/syscall.c: 2063 __bpf_prog_put_noref()
> >>> > > kvfree(prog->aux->jited_linfo);
> >>> > > kvfree(prog->aux->linfo);
> >>> >
> >>> > Looks like you only have suspicion here. Could you find a real
> >>> > violation
> >>> > here where __bpf_prog_put() is called with !in_irq() &&
> >>> > !irqs_disabled(), but inside spin_lock or rcu read lock? I
> >>> have not seen
> >>> > things like that.
> >>> >
> >>> > >
> >>> > > Additionally, we found that array prog->aux->jited_linfo is
> >>> > initialized in
> >>> > > ‘kernel/bpf/core.c: 157 bpf_prog_alloc_jited_linfo()’:
> >>> > > prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
> >>> > > sizeof(*prog->aux->jited_linfo),
> >>> bpf_memcg_flags(GFP_KERNEL |
> >>> > __GFP_NOWARN));
> >>> >
> >>> > Any problem here?
> >>> >
> >>> > >
> >>> > > Our question is whether the condition 'in_irq() ||
> >>> > irqs_disabled() == false' is
> >>> > > sufficient for calling 'kvfree'. We are aware that calling
> >>> > 'kvfree' within the
> >>> > > context of a spin lock or an RCU lock is unsafe.
> >>>
> >>> Your above analysis makes sense if indeed that kvfree cannot appear
> >>> inside a spin lock region or RCU read lock region. But is it true?
> >>> I checked a few code paths in kvfree/kfree. It is either guarded
> >>> with local_irq_save/restore or by
> >>> spin_lock_irqsave/spin_unlock_irqrestore, etc. Did I miss
> >>> anything? Are you talking about RT kernel here?
> >>>
> >>>
> >>> > >
> >>> > > Therefore, we propose modifying the condition to include
> >>> > in_atomic(). Could we
> >>> > > update the condition as follows: "in_irq() ||
> >>> irqs_disabled() ||
> >>> > in_atomic()"?
> >>> > >
> >>> > > Thank you! We look forward to your feedback.
> >>> > >
> >>> > > Signed-off-by: Teng Qi <[email protected]
> >>> <mailto:[email protected]>
> >>> > <mailto:[email protected]
> >>> <mailto:[email protected]>>>
> >>> >
> >>>
On 6/11/23 6:02 AM, Teng Qi wrote:
> Hello!
>> BTW, please do create a test case, e.g, sockmap test case which
>> can show the problem with existing code base.
>
> I add a printk in bpf_prog_put_deferred():
> static void bpf_prog_put_deferred(struct work_struct *work)
> {
> // . . .
> int inIrq = in_irq();
> int irqsDisabled = irqs_disabled();
> int preemptBits = preempt_count();
> int inAtomic = in_atomic();
> int rcuHeld = rcu_read_lock_held();
> printk("bpf_prog_put: in_irq() %d, irqs_disabled() %d, preempt_count()
> %d, in_atomic() %d, rcu_read_lock_held() %d",
> inIrq, irqsDisabled, preemptBits, inAtomic, rcuHeld);
> // . . .
> }
>
> When running the selftest, I see the following output:
> [255340.388339] bpf_prog_put: in_irq() 0, irqs_disabled() 0,
> preempt_count() 256, in_atomic() 1, rcu_read_lock_held() 1
> [255393.237632] bpf_prog_put: in_irq() 0, irqs_disabled() 0,
> preempt_count() 0, in_atomic() 0, rcu_read_lock_held() 1
It would be great if you also print out in_interrupt() value, so we know
whether softirq or nmi is enabled or not.
We cannot really WARN with !rcu_read_lock_held() since the
__bpf_prog_put funciton is called in different contexts.
Also, note that rcu_read_lock_held() may not be reliable. rcu subsystem
will return 1 if not tracked or not sure about the result.
>
> Based on this output, I believe it is sufficient to construct a self-test case
> for bpf_prog_put_deferred() called under preempt disabled or rcu read lock
> region. However, I'm a bit confused about what I should do to build the
> self-test case. Are we looking to create a checker that verifies the
> context of bpf_prog_put_deferred() is valid? Or do we need a test case that
> can trigger this bug?
>
> Could you show me more ideas to construct a self test case? I am not familiar
> with it and have no idea.
Okay, I see. It seems hard to create a test case with warnings since
bpf_prog_put_deferred is called in different context. So some
examples for possible issues (through code analysis) should be good enough.
>
> -- Teng Qi
>
> On Thu, May 25, 2023 at 3:34 AM Yonghong Song <[email protected]> wrote:
>>
>>
>>
>> On 5/24/23 5:42 AM, Teng Qi wrote:
>>> Thank you.
>>>
>>>> We cannot use rcu_read_lock_held() in the 'if' statement. The return
>>>> value rcu_read_lock_held() could be 1 for some configurations regardless
>>>> whether rcu_read_lock() is really held or not. In most cases,
>>>> rcu_read_lock_held() is used in issuing potential warnings.
>>>> Maybe there are other ways to record whether rcu_read_lock() is held or not?
>>>
>>> Sorry. I was not aware of the dependency of configurations of
>>> rcu_read_lock_held().
>>>
>>>> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
>>>> can be !in_interrupt(), so any process-context will go to a workqueue.
>>>
>>> I agree that using !in_interrupt() as a condition is an acceptable solution.
>>
>> This should work although it could be conservative.
>>
>>>
>>>> Alternatively, we could have another solution. We could add another
>>>> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
>>>> will be done in rcu context.
>>>
>>> Implementing a new function like bpf_prog_put_rcu() is a solution that involves
>>> more significant changes.
>>
>> Maybe we can change signature of bpf_prog_put instead? Like
>> void bpf_prog_put(struct bpf_prog *prog, bool in_rcu)
>> and inside bpf_prog_put we can add
>> WARN_ON_ONCE(in_rcu && !bpf_rcu_lock_held());
>>
>>>
>>>> So if in_interrupt(), do kvfree, otherwise,
>>>> put into a workqueue.
>>>
>>> Shall we proceed with submitting a patch following this approach?
>>
>> You could choose either of the above although I think with newer
>> bpf_prog_put() is better.
>>
>> BTW, please do create a test case, e.g, sockmap test case which
>> can show the problem with existing code base.
>>
>>>
>>> I would like to mention something unrelated to the possible bug. At this
>>> moment, things seem to be more puzzling. vfree() is safe under in_interrupt()
>>> but not safe under other atomic contexts.
>>> This disorder challenges our conventional belief, a monotonic incrementation
>>> of limitations of the hierarchical atomic contexts, that programer needs
>>> to be more and more careful to write code under rcu read lock, spin lock,
>>> bh disable, interrupt...
>>> This disorder can lead to unexpected consequences, such as code being safe
>>> under interrupts but not safe under spin locks.
>>> The disorder makes kernel programming more complex and may result in more bugs.
>>> Even though we find a way to resolve the possible bug about the bpf_prog_put(),
>>> I feel sad for undermining of kernel`s maintainability and disorder of
>>> hierarchy of atomic contexts.
>>>
>>> -- Teng Qi
>>>
>>> On Tue, May 23, 2023 at 12:33 PM Yonghong Song <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 5/21/23 6:39 AM, Teng Qi wrote:
>>>>> Thank you.
>>>>>
>>>>> > Your above analysis makes sense if indeed that kvfree cannot appear
>>>>> > inside a spin lock region or RCU read lock region. But is it true?
>>>>> > I checked a few code paths in kvfree/kfree. It is either guarded
>>>>> > with local_irq_save/restore or by
>>>>> > spin_lock_irqsave/spin_unlock_
>>>>> > irqrestore, etc. Did I miss
>>>>> > anything? Are you talking about RT kernel here?
>>>>>
>>>>> To see the sleepable possibility of kvfree, it is important to analyze the
>>>>> following calling stack:
>>>>> mm/util.c: 645 kvfree()
>>>>> mm/vmalloc.c: 2763 vfree()
>>>>>
>>>>> In kvfree(), to call vfree, if the pointer addr points to memory
>>>>> allocated by
>>>>> vmalloc(), it calls vfree().
>>>>> void kvfree(const void *addr)
>>>>> {
>>>>> if (is_vmalloc_addr(addr))
>>>>> vfree(addr);
>>>>> else
>>>>> kfree(addr);
>>>>> }
>>>>>
>>>>> In vfree(), in_interrupt() and might_sleep() need to be considered.
>>>>> void vfree(const void *addr)
>>>>> {
>>>>> // ...
>>>>> if (unlikely(in_interrupt()))
>>>>> {
>>>>> vfree_atomic(addr);
>>>>> return;
>>>>> }
>>>>> // ...
>>>>> might_sleep();
>>>>> // ...
>>>>> }
>>>>
>>>> Sorry. I didn't check vfree path. So it does look like that
>>>> we need to pay special attention to non interrupt part.
>>>>
>>>>>
>>>>> The vfree() may sleep if in_interrupt() == false. The RCU read lock region
>>>>> could have in_interrupt() == false and spin lock region which only disables
>>>>> preemption also has in_interrupt() == false. So the kvfree() cannot appear
>>>>> inside a spin lock region or RCU read lock region if the pointer addr points
>>>>> to memory allocated by vmalloc().
>>>>>
>>>>> > > Therefore, we propose modifying the condition to include
>>>>> > > in_atomic(). Could we
>>>>> > > update the condition as follows: "in_irq() || irqs_disabled() ||
>>>>> > > in_atomic()"?
>>>>> > Thank you! We look forward to your feedback.
>>>>>
>>>>> We now think that ‘irqs_disabled() || in_atomic() ||
>>>>> rcu_read_lock_held()’ is
>>>>> more proper. irqs_disabled() is for irq flag reg, in_atomic() is for
>>>>> preempt count and rcu_read_lock_held() is for RCU read lock region.
>>>>
>>>> We cannot use rcu_read_lock_held() in the 'if' statement. The return
>>>> value rcu_read_lock_held() could be 1 for some configuraitons regardless
>>>> whether rcu_read_lock() is really held or not. In most cases,
>>>> rcu_read_lock_held() is used in issuing potential warnings.
>>>> Maybe there are other ways to record whether rcu_read_lock() is held or not?
>>>>
>>>> I agree with your that 'irqs_disabled() || in_atomic()' makes sense
>>>> since it covers process context local_irq_save() and spin_lock() cases.
>>>>
>>>> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
>>>> can be !in_interrupt(), so any process-context will go to a workqueue.
>>>>
>>>> Alternatively, we could have another solution. We could add another
>>>> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
>>>> will be done in rcu context. So if in_interrupt(), do kvfree, otherwise,
>>>> put into a workqueue.
>>>>
>>>>
>>>>>
>>>>> -- Teng Qi
>>>>>
>>>>> On Sun, May 21, 2023 at 11:45 AM Yonghong Song <[email protected]
>>>>> <mailto:[email protected]>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 5/19/23 7:18 AM, Teng Qi wrote:
>>>>> > Thank you for your response.
>>>>> > > Looks like you only have suspicion here. Could you find a real
>>>>> violation
>>>>> > > here where __bpf_prog_put() is called with !in_irq() &&
>>>>> > > !irqs_disabled(), but inside spin_lock or rcu read lock? I
>>>>> have not seen
>>>>> > > things like that.
>>>>> >
>>>>> > For the complex conditions to call bpf_prog_put() with 1 refcnt,
>>>>> we have
>>>>> > been
>>>>> > unable to really trigger this atomic violation after trying to
>>>>> construct
>>>>> > test cases manually. But we found that it is possible to show
>>>>> cases with
>>>>> > !in_irq() && !irqs_disabled(), but inside spin_lock or rcu read lock.
>>>>> > For example, even a failed case, one of selftest cases of bpf,
>>>>> netns_cookie,
>>>>> > calls bpf_sock_map_update() and may indirectly call bpf_prog_put()
>>>>> > only inside rcu read lock: The possible call stack is:
>>>>> > net/core/sock_map.c: 615 bpf_sock_map_update()
>>>>> > net/core/sock_map.c: 468 sock_map_update_common()
>>>>> > net/core/sock_map.c: 217 sock_map_link()
>>>>> > kernel/bpf/syscall.c: 2111 bpf_prog_put()
>>>>> >
>>>>> > The files about netns_cookie include
>>>>> > tools/testing/selftests/bpf/progs/netns_cookie_prog.c and
>>>>> > tools/testing/selftests/bpf/prog_tests/netns_cookie.c. We
>>>>> inserted the
>>>>> > following code in
>>>>> > ‘net/core/sock_map.c: 468 sock_map_update_common()’:
>>>>> > static int sock_map_update_common(..)
>>>>> > {
>>>>> > int inIrq = in_irq();
>>>>> > int irqsDisabled = irqs_disabled();
>>>>> > int preemptBits = preempt_count();
>>>>> > int inAtomic = in_atomic();
>>>>> > int rcuHeld = rcu_read_lock_held();
>>>>> > printk("in_irq() %d, irqs_disabled() %d, preempt_count() %d,
>>>>> > in_atomic() %d, rcu_read_lock_held() %d", inIrq,
>>>>> irqsDisabled,
>>>>> > preemptBits, inAtomic, rcuHeld);
>>>>> > }
>>>>> >
>>>>> > The output message is as follows:
>>>>> > root@(none):/root/bpf# ./test_progs -t netns_cookie
>>>>> > [ 137.639188] in_irq() 0, irqs_disabled() 0, preempt_count() 0,
>>>>> > in_atomic() 0,
>>>>> > rcu_read_lock_held() 1
>>>>> > #113 netns_cookie:OK
>>>>> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
>>>>> >
>>>>> > We notice that there are numerous callers in kernel/, net/ and
>>>>> drivers/,
>>>>> > so we
>>>>> > highly suggest modifying __bpf_prog_put() to address this gap.
>>>>> The gap
>>>>> > exists
>>>>> > because __bpf_prog_put() is only safe under in_irq() ||
>>>>> irqs_disabled()
>>>>> > but not in_atomic() || rcu_read_lock_held(). The following code
>>>>> snippet may
>>>>> > mislead developers into thinking that bpf_prog_put() is safe in all
>>>>> > contexts.
>>>>> > if (in_irq() || irqs_disabled()) {
>>>>> > INIT_WORK(&aux->work, bpf_prog_put_deferred);
>>>>> > schedule_work(&aux->work);
>>>>> > } else {
>>>>> > bpf_prog_put_deferred(&aux->work);
>>>>> > }
>>>>> >
>>>>> > Implicit dependency may lead to issues.
>>>>> >
>>>>> > > Any problem here?
>>>>> > We mentioned it to demonstrate the possibility of kvfree() being
>>>>> > called by __bpf_prog_put_noref().
>>>>> >
>>>>> > Thanks.
>>>>> > -- Teng Qi
>>>>> >
>>>>> > On Wed, May 17, 2023 at 1:08 AM Yonghong Song <[email protected]
>>>>> <mailto:[email protected]>
>>>>> > <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>> >
>>>>> >
>>>>> >
>>>>> > On 5/16/23 4:18 AM, [email protected]
>>>>> <mailto:[email protected]>
>>>>> > <mailto:[email protected]
>>>>> <mailto:[email protected]>> wrote:
>>>>> > > From: Teng Qi <[email protected]
>>>>> <mailto:[email protected]>
>>>>> > <mailto:[email protected]
>>>>> <mailto:[email protected]>>>
>>>>> > >
>>>>> > > Hi, bpf developers,
>>>>> > >
>>>>> > > We are developing a static tool to check the matching between
>>>>> > helpers and the
>>>>> > > context of hooks. During our analysis, we have discovered some
>>>>> > important
>>>>> > > findings that we would like to report.
>>>>> > >
>>>>> > > ‘kernel/bpf/syscall.c: 2097 __bpf_prog_put()’ shows that
>>>>> function
>>>>> > > bpf_prog_put_deferred() won`t be called in the condition of
>>>>> > > ‘in_irq() || irqs_disabled()’.
>>>>> > > if (in_irq() || irqs_disabled()) {
>>>>> > > INIT_WORK(&aux->work, bpf_prog_put_deferred);
>>>>> > > schedule_work(&aux->work);
>>>>> > > } else {
>>>>> > >
>>>>> > > bpf_prog_put_deferred(&aux->work);
>>>>> > > }
>>>>> > >
>>>>> > > We suspect this condition exists because there might be
>>>>> sleepable
>>>>> > operations
>>>>> > > in the callees of the bpf_prog_put_deferred() function:
>>>>> > > kernel/bpf/syscall.c: 2097 __bpf_prog_put()
>>>>> > > kernel/bpf/syscall.c: 2084 bpf_prog_put_deferred()
>>>>> > > kernel/bpf/syscall.c: 2063 __bpf_prog_put_noref()
>>>>> > > kvfree(prog->aux->jited_linfo);
>>>>> > > kvfree(prog->aux->linfo);
>>>>> >
>>>>> > Looks like you only have suspicion here. Could you find a real
>>>>> > violation
>>>>> > here where __bpf_prog_put() is called with !in_irq() &&
>>>>> > !irqs_disabled(), but inside spin_lock or rcu read lock? I
>>>>> have not seen
>>>>> > things like that.
>>>>> >
>>>>> > >
>>>>> > > Additionally, we found that array prog->aux->jited_linfo is
>>>>> > initialized in
>>>>> > > ‘kernel/bpf/core.c: 157 bpf_prog_alloc_jited_linfo()’:
>>>>> > > prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
>>>>> > > sizeof(*prog->aux->jited_linfo),
>>>>> bpf_memcg_flags(GFP_KERNEL |
>>>>> > __GFP_NOWARN));
>>>>> >
>>>>> > Any problem here?
>>>>> >
>>>>> > >
>>>>> > > Our question is whether the condition 'in_irq() ||
>>>>> > irqs_disabled() == false' is
>>>>> > > sufficient for calling 'kvfree'. We are aware that calling
>>>>> > 'kvfree' within the
>>>>> > > context of a spin lock or an RCU lock is unsafe.
>>>>>
>>>>> Your above analysis makes sense if indeed that kvfree cannot appear
>>>>> inside a spin lock region or RCU read lock region. But is it true?
>>>>> I checked a few code paths in kvfree/kfree. It is either guarded
>>>>> with local_irq_save/restore or by
>>>>> spin_lock_irqsave/spin_unlock_irqrestore, etc. Did I miss
>>>>> anything? Are you talking about RT kernel here?
>>>>>
>>>>>
>>>>> > >
>>>>> > > Therefore, we propose modifying the condition to include
>>>>> > in_atomic(). Could we
>>>>> > > update the condition as follows: "in_irq() ||
>>>>> irqs_disabled() ||
>>>>> > in_atomic()"?
>>>>> > >
>>>>> > > Thank you! We look forward to your feedback.
>>>>> > >
>>>>> > > Signed-off-by: Teng Qi <[email protected]
>>>>> <mailto:[email protected]>
>>>>> > <mailto:[email protected]
>>>>> <mailto:[email protected]>>>
>>>>> >
>>>>>
Hello!
> It would be great if you also print out in_interrupt() value, so we know
> whether softirq or nmi is enabled or not.
After adding the in_interrupt(), the interesting output cases are as follows:
[ 38.596580] bpf_prog_put: in_irq() 0, irqs_disabled() 0, in_interrupt() 256,
preempt_count() 256, in_atomic() 1, rcu_read_lock_held() 0
[ 62.300608] bpf_prog_put: in_irq() 0, irqs_disabled() 0, in_interrupt() 256,
preempt_count() 256, in_atomic() 1, rcu_read_lock_held() 1
[ 62.301179] bpf_prog_put: in_irq() 0, irqs_disabled() 0, in_interrupt() 0,
preempt_count() 0, in_atomic() 0, rcu_read_lock_held() 1
Based on these cases, the current code is safe for the first two cases, because
in_interrupt() in vfree() prevents sleeping.
However, the rcu_read_lock_held() is not reliable, so we cannot rely on it.
Considering all the discussions so far, I think the best plan now is to change
the condition in __bpf_prog_put() to ‘irqs_disabled() || in_atomic()’ and
provide examples for possible issues. This plan effectively addresses
more possible atomic contexts of __bpf_prog_put() without incurring
any additional cost.
-- Teng Qi
On Mon, Jun 12, 2023 at 8:02 AM Yonghong Song <[email protected]> wrote:
>
>
>
> On 6/11/23 6:02 AM, Teng Qi wrote:
> > Hello!
> >> BTW, please do create a test case, e.g, sockmap test case which
> >> can show the problem with existing code base.
> >
> > I add a printk in bpf_prog_put_deferred():
> > static void bpf_prog_put_deferred(struct work_struct *work)
> > {
> > // . . .
> > int inIrq = in_irq();
> > int irqsDisabled = irqs_disabled();
> > int preemptBits = preempt_count();
> > int inAtomic = in_atomic();
> > int rcuHeld = rcu_read_lock_held();
> > printk("bpf_prog_put: in_irq() %d, irqs_disabled() %d, preempt_count()
> > %d, in_atomic() %d, rcu_read_lock_held() %d",
> > inIrq, irqsDisabled, preemptBits, inAtomic, rcuHeld);
> > // . . .
> > }
> >
> > When running the selftest, I see the following output:
> > [255340.388339] bpf_prog_put: in_irq() 0, irqs_disabled() 0,
> > preempt_count() 256, in_atomic() 1, rcu_read_lock_held() 1
> > [255393.237632] bpf_prog_put: in_irq() 0, irqs_disabled() 0,
> > preempt_count() 0, in_atomic() 0, rcu_read_lock_held() 1
>
> It would be great if you also print out in_interrupt() value, so we know
> whether softirq or nmi is enabled or not.
>
> We cannot really WARN with !rcu_read_lock_held() since the
> __bpf_prog_put funciton is called in different contexts.
>
> Also, note that rcu_read_lock_held() may not be reliable. rcu subsystem
> will return 1 if not tracked or not sure about the result.
>
> >
> > Based on this output, I believe it is sufficient to construct a self-test case
> > for bpf_prog_put_deferred() called under preempt disabled or rcu read lock
> > region. However, I'm a bit confused about what I should do to build the
> > self-test case. Are we looking to create a checker that verifies the
> > context of bpf_prog_put_deferred() is valid? Or do we need a test case that
> > can trigger this bug?
> >
> > Could you show me more ideas to construct a self test case? I am not familiar
> > with it and have no idea.
>
> Okay, I see. It seems hard to create a test case with warnings since
> bpf_prog_put_deferred is called in different context. So some
> examples for possible issues (through code analysis) should be good enough.
>
> >
> > -- Teng Qi
> >
> > On Thu, May 25, 2023 at 3:34 AM Yonghong Song <[email protected]> wrote:
> >>
> >>
> >>
> >> On 5/24/23 5:42 AM, Teng Qi wrote:
> >>> Thank you.
> >>>
> >>>> We cannot use rcu_read_lock_held() in the 'if' statement. The return
> >>>> value rcu_read_lock_held() could be 1 for some configurations regardless
> >>>> whether rcu_read_lock() is really held or not. In most cases,
> >>>> rcu_read_lock_held() is used in issuing potential warnings.
> >>>> Maybe there are other ways to record whether rcu_read_lock() is held or not?
> >>>
> >>> Sorry. I was not aware of the dependency of configurations of
> >>> rcu_read_lock_held().
> >>>
> >>>> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
> >>>> can be !in_interrupt(), so any process-context will go to a workqueue.
> >>>
> >>> I agree that using !in_interrupt() as a condition is an acceptable solution.
> >>
> >> This should work although it could be conservative.
> >>
> >>>
> >>>> Alternatively, we could have another solution. We could add another
> >>>> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
> >>>> will be done in rcu context.
> >>>
> >>> Implementing a new function like bpf_prog_put_rcu() is a solution that involves
> >>> more significant changes.
> >>
> >> Maybe we can change signature of bpf_prog_put instead? Like
> >> void bpf_prog_put(struct bpf_prog *prog, bool in_rcu)
> >> and inside bpf_prog_put we can add
> >> WARN_ON_ONCE(in_rcu && !bpf_rcu_lock_held());
> >>
> >>>
> >>>> So if in_interrupt(), do kvfree, otherwise,
> >>>> put into a workqueue.
> >>>
> >>> Shall we proceed with submitting a patch following this approach?
> >>
> >> You could choose either of the above although I think with newer
> >> bpf_prog_put() is better.
> >>
> >> BTW, please do create a test case, e.g, sockmap test case which
> >> can show the problem with existing code base.
> >>
> >>>
> >>> I would like to mention something unrelated to the possible bug. At this
> >>> moment, things seem to be more puzzling. vfree() is safe under in_interrupt()
> >>> but not safe under other atomic contexts.
> >>> This disorder challenges our conventional belief, a monotonic incrementation
> >>> of limitations of the hierarchical atomic contexts, that programer needs
> >>> to be more and more careful to write code under rcu read lock, spin lock,
> >>> bh disable, interrupt...
> >>> This disorder can lead to unexpected consequences, such as code being safe
> >>> under interrupts but not safe under spin locks.
> >>> The disorder makes kernel programming more complex and may result in more bugs.
> >>> Even though we find a way to resolve the possible bug about the bpf_prog_put(),
> >>> I feel sad for undermining of kernel`s maintainability and disorder of
> >>> hierarchy of atomic contexts.
> >>>
> >>> -- Teng Qi
> >>>
> >>> On Tue, May 23, 2023 at 12:33 PM Yonghong Song <[email protected]> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 5/21/23 6:39 AM, Teng Qi wrote:
> >>>>> Thank you.
> >>>>>
> >>>>> > Your above analysis makes sense if indeed that kvfree cannot appear
> >>>>> > inside a spin lock region or RCU read lock region. But is it true?
> >>>>> > I checked a few code paths in kvfree/kfree. It is either guarded
> >>>>> > with local_irq_save/restore or by
> >>>>> > spin_lock_irqsave/spin_unlock_
> >>>>> > irqrestore, etc. Did I miss
> >>>>> > anything? Are you talking about RT kernel here?
> >>>>>
> >>>>> To see the sleepable possibility of kvfree, it is important to analyze the
> >>>>> following calling stack:
> >>>>> mm/util.c: 645 kvfree()
> >>>>> mm/vmalloc.c: 2763 vfree()
> >>>>>
> >>>>> In kvfree(), to call vfree, if the pointer addr points to memory
> >>>>> allocated by
> >>>>> vmalloc(), it calls vfree().
> >>>>> void kvfree(const void *addr)
> >>>>> {
> >>>>> if (is_vmalloc_addr(addr))
> >>>>> vfree(addr);
> >>>>> else
> >>>>> kfree(addr);
> >>>>> }
> >>>>>
> >>>>> In vfree(), in_interrupt() and might_sleep() need to be considered.
> >>>>> void vfree(const void *addr)
> >>>>> {
> >>>>> // ...
> >>>>> if (unlikely(in_interrupt()))
> >>>>> {
> >>>>> vfree_atomic(addr);
> >>>>> return;
> >>>>> }
> >>>>> // ...
> >>>>> might_sleep();
> >>>>> // ...
> >>>>> }
> >>>>
> >>>> Sorry. I didn't check vfree path. So it does look like that
> >>>> we need to pay special attention to non interrupt part.
> >>>>
> >>>>>
> >>>>> The vfree() may sleep if in_interrupt() == false. The RCU read lock region
> >>>>> could have in_interrupt() == false and spin lock region which only disables
> >>>>> preemption also has in_interrupt() == false. So the kvfree() cannot appear
> >>>>> inside a spin lock region or RCU read lock region if the pointer addr points
> >>>>> to memory allocated by vmalloc().
> >>>>>
> >>>>> > > Therefore, we propose modifying the condition to include
> >>>>> > > in_atomic(). Could we
> >>>>> > > update the condition as follows: "in_irq() || irqs_disabled() ||
> >>>>> > > in_atomic()"?
> >>>>> > Thank you! We look forward to your feedback.
> >>>>>
> >>>>> We now think that ‘irqs_disabled() || in_atomic() ||
> >>>>> rcu_read_lock_held()’ is
> >>>>> more proper. irqs_disabled() is for irq flag reg, in_atomic() is for
> >>>>> preempt count and rcu_read_lock_held() is for RCU read lock region.
> >>>>
> >>>> We cannot use rcu_read_lock_held() in the 'if' statement. The return
> >>>> value rcu_read_lock_held() could be 1 for some configuraitons regardless
> >>>> whether rcu_read_lock() is really held or not. In most cases,
> >>>> rcu_read_lock_held() is used in issuing potential warnings.
> >>>> Maybe there are other ways to record whether rcu_read_lock() is held or not?
> >>>>
> >>>> I agree with your that 'irqs_disabled() || in_atomic()' makes sense
> >>>> since it covers process context local_irq_save() and spin_lock() cases.
> >>>>
> >>>> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
> >>>> can be !in_interrupt(), so any process-context will go to a workqueue.
> >>>>
> >>>> Alternatively, we could have another solution. We could add another
> >>>> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
> >>>> will be done in rcu context. So if in_interrupt(), do kvfree, otherwise,
> >>>> put into a workqueue.
> >>>>
> >>>>
> >>>>>
> >>>>> -- Teng Qi
> >>>>>
> >>>>> On Sun, May 21, 2023 at 11:45 AM Yonghong Song <[email protected]
> >>>>> <mailto:[email protected]>> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 5/19/23 7:18 AM, Teng Qi wrote:
> >>>>> > Thank you for your response.
> >>>>> > > Looks like you only have suspicion here. Could you find a real
> >>>>> violation
> >>>>> > > here where __bpf_prog_put() is called with !in_irq() &&
> >>>>> > > !irqs_disabled(), but inside spin_lock or rcu read lock? I
> >>>>> have not seen
> >>>>> > > things like that.
> >>>>> >
> >>>>> > For the complex conditions to call bpf_prog_put() with 1 refcnt,
> >>>>> we have
> >>>>> > been
> >>>>> > unable to really trigger this atomic violation after trying to
> >>>>> construct
> >>>>> > test cases manually. But we found that it is possible to show
> >>>>> cases with
> >>>>> > !in_irq() && !irqs_disabled(), but inside spin_lock or rcu read lock.
> >>>>> > For example, even a failed case, one of selftest cases of bpf,
> >>>>> netns_cookie,
> >>>>> > calls bpf_sock_map_update() and may indirectly call bpf_prog_put()
> >>>>> > only inside rcu read lock: The possible call stack is:
> >>>>> > net/core/sock_map.c: 615 bpf_sock_map_update()
> >>>>> > net/core/sock_map.c: 468 sock_map_update_common()
> >>>>> > net/core/sock_map.c: 217 sock_map_link()
> >>>>> > kernel/bpf/syscall.c: 2111 bpf_prog_put()
> >>>>> >
> >>>>> > The files about netns_cookie include
> >>>>> > tools/testing/selftests/bpf/progs/netns_cookie_prog.c and
> >>>>> > tools/testing/selftests/bpf/prog_tests/netns_cookie.c. We
> >>>>> inserted the
> >>>>> > following code in
> >>>>> > ‘net/core/sock_map.c: 468 sock_map_update_common()’:
> >>>>> > static int sock_map_update_common(..)
> >>>>> > {
> >>>>> > int inIrq = in_irq();
> >>>>> > int irqsDisabled = irqs_disabled();
> >>>>> > int preemptBits = preempt_count();
> >>>>> > int inAtomic = in_atomic();
> >>>>> > int rcuHeld = rcu_read_lock_held();
> >>>>> > printk("in_irq() %d, irqs_disabled() %d, preempt_count() %d,
> >>>>> > in_atomic() %d, rcu_read_lock_held() %d", inIrq,
> >>>>> irqsDisabled,
> >>>>> > preemptBits, inAtomic, rcuHeld);
> >>>>> > }
> >>>>> >
> >>>>> > The output message is as follows:
> >>>>> > root@(none):/root/bpf# ./test_progs -t netns_cookie
> >>>>> > [ 137.639188] in_irq() 0, irqs_disabled() 0, preempt_count() 0,
> >>>>> > in_atomic() 0,
> >>>>> > rcu_read_lock_held() 1
> >>>>> > #113 netns_cookie:OK
> >>>>> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> >>>>> >
> >>>>> > We notice that there are numerous callers in kernel/, net/ and
> >>>>> drivers/,
> >>>>> > so we
> >>>>> > highly suggest modifying __bpf_prog_put() to address this gap.
> >>>>> The gap
> >>>>> > exists
> >>>>> > because __bpf_prog_put() is only safe under in_irq() ||
> >>>>> irqs_disabled()
> >>>>> > but not in_atomic() || rcu_read_lock_held(). The following code
> >>>>> snippet may
> >>>>> > mislead developers into thinking that bpf_prog_put() is safe in all
> >>>>> > contexts.
> >>>>> > if (in_irq() || irqs_disabled()) {
> >>>>> > INIT_WORK(&aux->work, bpf_prog_put_deferred);
> >>>>> > schedule_work(&aux->work);
> >>>>> > } else {
> >>>>> > bpf_prog_put_deferred(&aux->work);
> >>>>> > }
> >>>>> >
> >>>>> > Implicit dependency may lead to issues.
> >>>>> >
> >>>>> > > Any problem here?
> >>>>> > We mentioned it to demonstrate the possibility of kvfree() being
> >>>>> > called by __bpf_prog_put_noref().
> >>>>> >
> >>>>> > Thanks.
> >>>>> > -- Teng Qi
> >>>>> >
> >>>>> > On Wed, May 17, 2023 at 1:08 AM Yonghong Song <[email protected]
> >>>>> <mailto:[email protected]>
> >>>>> > <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > On 5/16/23 4:18 AM, [email protected]
> >>>>> <mailto:[email protected]>
> >>>>> > <mailto:[email protected]
> >>>>> <mailto:[email protected]>> wrote:
> >>>>> > > From: Teng Qi <[email protected]
> >>>>> <mailto:[email protected]>
> >>>>> > <mailto:[email protected]
> >>>>> <mailto:[email protected]>>>
> >>>>> > >
> >>>>> > > Hi, bpf developers,
> >>>>> > >
> >>>>> > > We are developing a static tool to check the matching between
> >>>>> > helpers and the
> >>>>> > > context of hooks. During our analysis, we have discovered some
> >>>>> > important
> >>>>> > > findings that we would like to report.
> >>>>> > >
> >>>>> > > ‘kernel/bpf/syscall.c: 2097 __bpf_prog_put()’ shows that
> >>>>> function
> >>>>> > > bpf_prog_put_deferred() won`t be called in the condition of
> >>>>> > > ‘in_irq() || irqs_disabled()’.
> >>>>> > > if (in_irq() || irqs_disabled()) {
> >>>>> > > INIT_WORK(&aux->work, bpf_prog_put_deferred);
> >>>>> > > schedule_work(&aux->work);
> >>>>> > > } else {
> >>>>> > >
> >>>>> > > bpf_prog_put_deferred(&aux->work);
> >>>>> > > }
> >>>>> > >
> >>>>> > > We suspect this condition exists because there might be
> >>>>> sleepable
> >>>>> > operations
> >>>>> > > in the callees of the bpf_prog_put_deferred() function:
> >>>>> > > kernel/bpf/syscall.c: 2097 __bpf_prog_put()
> >>>>> > > kernel/bpf/syscall.c: 2084 bpf_prog_put_deferred()
> >>>>> > > kernel/bpf/syscall.c: 2063 __bpf_prog_put_noref()
> >>>>> > > kvfree(prog->aux->jited_linfo);
> >>>>> > > kvfree(prog->aux->linfo);
> >>>>> >
> >>>>> > Looks like you only have suspicion here. Could you find a real
> >>>>> > violation
> >>>>> > here where __bpf_prog_put() is called with !in_irq() &&
> >>>>> > !irqs_disabled(), but inside spin_lock or rcu read lock? I
> >>>>> have not seen
> >>>>> > things like that.
> >>>>> >
> >>>>> > >
> >>>>> > > Additionally, we found that array prog->aux->jited_linfo is
> >>>>> > initialized in
> >>>>> > > ‘kernel/bpf/core.c: 157 bpf_prog_alloc_jited_linfo()’:
> >>>>> > > prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
> >>>>> > > sizeof(*prog->aux->jited_linfo),
> >>>>> bpf_memcg_flags(GFP_KERNEL |
> >>>>> > __GFP_NOWARN));
> >>>>> >
> >>>>> > Any problem here?
> >>>>> >
> >>>>> > >
> >>>>> > > Our question is whether the condition 'in_irq() ||
> >>>>> > irqs_disabled() == false' is
> >>>>> > > sufficient for calling 'kvfree'. We are aware that calling
> >>>>> > 'kvfree' within the
> >>>>> > > context of a spin lock or an RCU lock is unsafe.
> >>>>>
> >>>>> Your above analysis makes sense if indeed that kvfree cannot appear
> >>>>> inside a spin lock region or RCU read lock region. But is it true?
> >>>>> I checked a few code paths in kvfree/kfree. It is either guarded
> >>>>> with local_irq_save/restore or by
> >>>>> spin_lock_irqsave/spin_unlock_irqrestore, etc. Did I miss
> >>>>> anything? Are you talking about RT kernel here?
> >>>>>
> >>>>>
> >>>>> > >
> >>>>> > > Therefore, we propose modifying the condition to include
> >>>>> > in_atomic(). Could we
> >>>>> > > update the condition as follows: "in_irq() ||
> >>>>> irqs_disabled() ||
> >>>>> > in_atomic()"?
> >>>>> > >
> >>>>> > > Thank you! We look forward to your feedback.
> >>>>> > >
> >>>>> > > Signed-off-by: Teng Qi <[email protected]
> >>>>> <mailto:[email protected]>
> >>>>> > <mailto:[email protected]
> >>>>> <mailto:[email protected]>>>
> >>>>> >
> >>>>>
On 6/19/23 2:05 AM, Teng Qi wrote:
> Hello!
>
>> It would be great if you also print out in_interrupt() value, so we know
>> whether softirq or nmi is enabled or not.
>
> After adding the in_interrupt(), the interesting output cases are as follows:
> [ 38.596580] bpf_prog_put: in_irq() 0, irqs_disabled() 0, in_interrupt() 256,
> preempt_count() 256, in_atomic() 1, rcu_read_lock_held() 0
> [ 62.300608] bpf_prog_put: in_irq() 0, irqs_disabled() 0, in_interrupt() 256,
> preempt_count() 256, in_atomic() 1, rcu_read_lock_held() 1
> [ 62.301179] bpf_prog_put: in_irq() 0, irqs_disabled() 0, in_interrupt() 0,
> preempt_count() 0, in_atomic() 0, rcu_read_lock_held() 1
>
> Based on these cases, the current code is safe for the first two cases, because
> in_interrupt() in vfree() prevents sleeping.
> However, the rcu_read_lock_held() is not reliable, so we cannot rely on it.
> Considering all the discussions so far, I think the best plan now is to change
> the condition in __bpf_prog_put() to ‘irqs_disabled() || in_atomic()’ and
> provide examples for possible issues. This plan effectively addresses
> more possible atomic contexts of __bpf_prog_put() without incurring
> any additional cost.
Thanks for analysis. In the above, in_atomic()=1 is due to
preempt_count()=256 which implies a softirq. Actually in_atomic() will
be true if preempt_disabled or in_interrupt. So I guess your previous
change
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a75c54b6f8a3..11df562e481b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2147,7 +2147,7 @@ static void __bpf_prog_put(struct bpf_prog *prog)
struct bpf_prog_aux *aux = prog->aux;
if (atomic64_dec_and_test(&aux->refcnt)) {
- if (in_irq() || irqs_disabled()) {
+ if (!in_interrupt()) {
INIT_WORK(&aux->work, bpf_prog_put_deferred);
schedule_work(&aux->work);
} else {
should be okay. Just need to explain in the commit message
- why with 'in_interrupt()' it is okay to call
bpf_prog_put_deferred() directly, and
- why with '!in_interrupt()' it is not okay to call
pf_prog_put_deferred() directly
Thanks!
> -- Teng Qi
>
> On Mon, Jun 12, 2023 at 8:02 AM Yonghong Song <[email protected]> wrote:
>>
>>
>>
>> On 6/11/23 6:02 AM, Teng Qi wrote:
>>> Hello!
>>>> BTW, please do create a test case, e.g, sockmap test case which
>>>> can show the problem with existing code base.
>>>
>>> I add a printk in bpf_prog_put_deferred():
>>> static void bpf_prog_put_deferred(struct work_struct *work)
>>> {
>>> // . . .
>>> int inIrq = in_irq();
>>> int irqsDisabled = irqs_disabled();
>>> int preemptBits = preempt_count();
>>> int inAtomic = in_atomic();
>>> int rcuHeld = rcu_read_lock_held();
>>> printk("bpf_prog_put: in_irq() %d, irqs_disabled() %d, preempt_count()
>>> %d, in_atomic() %d, rcu_read_lock_held() %d",
>>> inIrq, irqsDisabled, preemptBits, inAtomic, rcuHeld);
>>> // . . .
>>> }
>>>
>>> When running the selftest, I see the following output:
>>> [255340.388339] bpf_prog_put: in_irq() 0, irqs_disabled() 0,
>>> preempt_count() 256, in_atomic() 1, rcu_read_lock_held() 1
>>> [255393.237632] bpf_prog_put: in_irq() 0, irqs_disabled() 0,
>>> preempt_count() 0, in_atomic() 0, rcu_read_lock_held() 1
>>
>> It would be great if you also print out in_interrupt() value, so we know
>> whether softirq or nmi is enabled or not.
>>
>> We cannot really WARN with !rcu_read_lock_held() since the
>> __bpf_prog_put funciton is called in different contexts.
>>
>> Also, note that rcu_read_lock_held() may not be reliable. rcu subsystem
>> will return 1 if not tracked or not sure about the result.
>>
>>>
>>> Based on this output, I believe it is sufficient to construct a self-test case
>>> for bpf_prog_put_deferred() called under preempt disabled or rcu read lock
>>> region. However, I'm a bit confused about what I should do to build the
>>> self-test case. Are we looking to create a checker that verifies the
>>> context of bpf_prog_put_deferred() is valid? Or do we need a test case that
>>> can trigger this bug?
>>>
>>> Could you show me more ideas to construct a self test case? I am not familiar
>>> with it and have no idea.
>>
>> Okay, I see. It seems hard to create a test case with warnings since
>> bpf_prog_put_deferred is called in different context. So some
>> examples for possible issues (through code analysis) should be good enough.
>>
>>>
>>> -- Teng Qi
>>>
>>> On Thu, May 25, 2023 at 3:34 AM Yonghong Song <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 5/24/23 5:42 AM, Teng Qi wrote:
>>>>> Thank you.
>>>>>
>>>>>> We cannot use rcu_read_lock_held() in the 'if' statement. The return
>>>>>> value rcu_read_lock_held() could be 1 for some configurations regardless
>>>>>> whether rcu_read_lock() is really held or not. In most cases,
>>>>>> rcu_read_lock_held() is used in issuing potential warnings.
>>>>>> Maybe there are other ways to record whether rcu_read_lock() is held or not?
>>>>>
>>>>> Sorry. I was not aware of the dependency of configurations of
>>>>> rcu_read_lock_held().
>>>>>
>>>>>> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
>>>>>> can be !in_interrupt(), so any process-context will go to a workqueue.
>>>>>
>>>>> I agree that using !in_interrupt() as a condition is an acceptable solution.
>>>>
>>>> This should work although it could be conservative.
>>>>
>>>>>
>>>>>> Alternatively, we could have another solution. We could add another
>>>>>> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
>>>>>> will be done in rcu context.
>>>>>
>>>>> Implementing a new function like bpf_prog_put_rcu() is a solution that involves
>>>>> more significant changes.
>>>>
>>>> Maybe we can change signature of bpf_prog_put instead? Like
>>>> void bpf_prog_put(struct bpf_prog *prog, bool in_rcu)
>>>> and inside bpf_prog_put we can add
>>>> WARN_ON_ONCE(in_rcu && !bpf_rcu_lock_held());
>>>>
>>>>>
>>>>>> So if in_interrupt(), do kvfree, otherwise,
>>>>>> put into a workqueue.
>>>>>
>>>>> Shall we proceed with submitting a patch following this approach?
>>>>
>>>> You could choose either of the above although I think with newer
>>>> bpf_prog_put() is better.
>>>>
>>>> BTW, please do create a test case, e.g, sockmap test case which
>>>> can show the problem with existing code base.
>>>>
>>>>>
>>>>> I would like to mention something unrelated to the possible bug. At this
>>>>> moment, things seem to be more puzzling. vfree() is safe under in_interrupt()
>>>>> but not safe under other atomic contexts.
>>>>> This disorder challenges our conventional belief, a monotonic incrementation
>>>>> of limitations of the hierarchical atomic contexts, that programer needs
>>>>> to be more and more careful to write code under rcu read lock, spin lock,
>>>>> bh disable, interrupt...
>>>>> This disorder can lead to unexpected consequences, such as code being safe
>>>>> under interrupts but not safe under spin locks.
>>>>> The disorder makes kernel programming more complex and may result in more bugs.
>>>>> Even though we find a way to resolve the possible bug about the bpf_prog_put(),
>>>>> I feel sad for undermining of kernel`s maintainability and disorder of
>>>>> hierarchy of atomic contexts.
>>>>>
>>>>> -- Teng Qi
>>>>>
>>>>> On Tue, May 23, 2023 at 12:33 PM Yonghong Song <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 5/21/23 6:39 AM, Teng Qi wrote:
>>>>>>> Thank you.
>>>>>>>
>>>>>>> > Your above analysis makes sense if indeed that kvfree cannot appear
>>>>>>> > inside a spin lock region or RCU read lock region. But is it true?
>>>>>>> > I checked a few code paths in kvfree/kfree. It is either guarded
>>>>>>> > with local_irq_save/restore or by
>>>>>>> > spin_lock_irqsave/spin_unlock_
>>>>>>> > irqrestore, etc. Did I miss
>>>>>>> > anything? Are you talking about RT kernel here?
>>>>>>>
>>>>>>> To see the sleepable possibility of kvfree, it is important to analyze the
>>>>>>> following calling stack:
>>>>>>> mm/util.c: 645 kvfree()
>>>>>>> mm/vmalloc.c: 2763 vfree()
>>>>>>>
>>>>>>> In kvfree(), to call vfree, if the pointer addr points to memory
>>>>>>> allocated by
>>>>>>> vmalloc(), it calls vfree().
>>>>>>> void kvfree(const void *addr)
>>>>>>> {
>>>>>>> if (is_vmalloc_addr(addr))
>>>>>>> vfree(addr);
>>>>>>> else
>>>>>>> kfree(addr);
>>>>>>> }
>>>>>>>
>>>>>>> In vfree(), in_interrupt() and might_sleep() need to be considered.
>>>>>>> void vfree(const void *addr)
>>>>>>> {
>>>>>>> // ...
>>>>>>> if (unlikely(in_interrupt()))
>>>>>>> {
>>>>>>> vfree_atomic(addr);
>>>>>>> return;
>>>>>>> }
>>>>>>> // ...
>>>>>>> might_sleep();
>>>>>>> // ...
>>>>>>> }
>>>>>>
>>>>>> Sorry. I didn't check vfree path. So it does look like that
>>>>>> we need to pay special attention to non interrupt part.
>>>>>>
>>>>>>>
>>>>>>> The vfree() may sleep if in_interrupt() == false. The RCU read lock region
>>>>>>> could have in_interrupt() == false and spin lock region which only disables
>>>>>>> preemption also has in_interrupt() == false. So the kvfree() cannot appear
>>>>>>> inside a spin lock region or RCU read lock region if the pointer addr points
>>>>>>> to memory allocated by vmalloc().
>>>>>>>
>>>>>>> > > Therefore, we propose modifying the condition to include
>>>>>>> > > in_atomic(). Could we
>>>>>>> > > update the condition as follows: "in_irq() || irqs_disabled() ||
>>>>>>> > > in_atomic()"?
>>>>>>> > Thank you! We look forward to your feedback.
>>>>>>>
>>>>>>> We now think that ‘irqs_disabled() || in_atomic() ||
>>>>>>> rcu_read_lock_held()’ is
>>>>>>> more proper. irqs_disabled() is for irq flag reg, in_atomic() is for
>>>>>>> preempt count and rcu_read_lock_held() is for RCU read lock region.
>>>>>>
>>>>>> We cannot use rcu_read_lock_held() in the 'if' statement. The return
>>>>>> value rcu_read_lock_held() could be 1 for some configuraitons regardless
>>>>>> whether rcu_read_lock() is really held or not. In most cases,
>>>>>> rcu_read_lock_held() is used in issuing potential warnings.
>>>>>> Maybe there are other ways to record whether rcu_read_lock() is held or not?
>>>>>>
>>>>>> I agree with your that 'irqs_disabled() || in_atomic()' makes sense
>>>>>> since it covers process context local_irq_save() and spin_lock() cases.
>>>>>>
>>>>>> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
>>>>>> can be !in_interrupt(), so any process-context will go to a workqueue.
>>>>>>
>>>>>> Alternatively, we could have another solution. We could add another
>>>>>> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
>>>>>> will be done in rcu context. So if in_interrupt(), do kvfree, otherwise,
>>>>>> put into a workqueue.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> -- Teng Qi
>>>>>>>
>>>>>>> On Sun, May 21, 2023 at 11:45 AM Yonghong Song <[email protected]
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 5/19/23 7:18 AM, Teng Qi wrote:
>>>>>>> > Thank you for your response.
>>>>>>> > > Looks like you only have suspicion here. Could you find a real
>>>>>>> violation
>>>>>>> > > here where __bpf_prog_put() is called with !in_irq() &&
>>>>>>> > > !irqs_disabled(), but inside spin_lock or rcu read lock? I
>>>>>>> have not seen
>>>>>>> > > things like that.
>>>>>>> >
>>>>>>> > For the complex conditions to call bpf_prog_put() with 1 refcnt,
>>>>>>> we have
>>>>>>> > been
>>>>>>> > unable to really trigger this atomic violation after trying to
>>>>>>> construct
>>>>>>> > test cases manually. But we found that it is possible to show
>>>>>>> cases with
>>>>>>> > !in_irq() && !irqs_disabled(), but inside spin_lock or rcu read lock.
>>>>>>> > For example, even a failed case, one of selftest cases of bpf,
>>>>>>> netns_cookie,
>>>>>>> > calls bpf_sock_map_update() and may indirectly call bpf_prog_put()
>>>>>>> > only inside rcu read lock: The possible call stack is:
>>>>>>> > net/core/sock_map.c: 615 bpf_sock_map_update()
>>>>>>> > net/core/sock_map.c: 468 sock_map_update_common()
>>>>>>> > net/core/sock_map.c: 217 sock_map_link()
>>>>>>> > kernel/bpf/syscall.c: 2111 bpf_prog_put()
>>>>>>> >
>>>>>>> > The files about netns_cookie include
>>>>>>> > tools/testing/selftests/bpf/progs/netns_cookie_prog.c and
>>>>>>> > tools/testing/selftests/bpf/prog_tests/netns_cookie.c. We
>>>>>>> inserted the
>>>>>>> > following code in
>>>>>>> > ‘net/core/sock_map.c: 468 sock_map_update_common()’:
>>>>>>> > static int sock_map_update_common(..)
>>>>>>> > {
>>>>>>> > int inIrq = in_irq();
>>>>>>> > int irqsDisabled = irqs_disabled();
>>>>>>> > int preemptBits = preempt_count();
>>>>>>> > int inAtomic = in_atomic();
>>>>>>> > int rcuHeld = rcu_read_lock_held();
>>>>>>> > printk("in_irq() %d, irqs_disabled() %d, preempt_count() %d,
>>>>>>> > in_atomic() %d, rcu_read_lock_held() %d", inIrq,
>>>>>>> irqsDisabled,
>>>>>>> > preemptBits, inAtomic, rcuHeld);
>>>>>>> > }
>>>>>>> >
>>>>>>> > The output message is as follows:
>>>>>>> > root@(none):/root/bpf# ./test_progs -t netns_cookie
>>>>>>> > [ 137.639188] in_irq() 0, irqs_disabled() 0, preempt_count() 0,
>>>>>>> > in_atomic() 0,
>>>>>>> > rcu_read_lock_held() 1
>>>>>>> > #113 netns_cookie:OK
>>>>>>> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
>>>>>>> >
>>>>>>> > We notice that there are numerous callers in kernel/, net/ and
>>>>>>> drivers/,
>>>>>>> > so we
>>>>>>> > highly suggest modifying __bpf_prog_put() to address this gap.
>>>>>>> The gap
>>>>>>> > exists
>>>>>>> > because __bpf_prog_put() is only safe under in_irq() ||
>>>>>>> irqs_disabled()
>>>>>>> > but not in_atomic() || rcu_read_lock_held(). The following code
>>>>>>> snippet may
>>>>>>> > mislead developers into thinking that bpf_prog_put() is safe in all
>>>>>>> > contexts.
>>>>>>> > if (in_irq() || irqs_disabled()) {
>>>>>>> > INIT_WORK(&aux->work, bpf_prog_put_deferred);
>>>>>>> > schedule_work(&aux->work);
>>>>>>> > } else {
>>>>>>> > bpf_prog_put_deferred(&aux->work);
>>>>>>> > }
>>>>>>> >
>>>>>>> > Implicit dependency may lead to issues.
>>>>>>> >
>>>>>>> > > Any problem here?
>>>>>>> > We mentioned it to demonstrate the possibility of kvfree() being
>>>>>>> > called by __bpf_prog_put_noref().
>>>>>>> >
>>>>>>> > Thanks.
>>>>>>> > -- Teng Qi
>>>>>>> >
>>>>>>> > On Wed, May 17, 2023 at 1:08 AM Yonghong Song <[email protected]
>>>>>>> <mailto:[email protected]>
>>>>>>> > <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On 5/16/23 4:18 AM, [email protected]
>>>>>>> <mailto:[email protected]>
>>>>>>> > <mailto:[email protected]
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> > > From: Teng Qi <[email protected]
>>>>>>> <mailto:[email protected]>
>>>>>>> > <mailto:[email protected]
>>>>>>> <mailto:[email protected]>>>
>>>>>>> > >
>>>>>>> > > Hi, bpf developers,
>>>>>>> > >
>>>>>>> > > We are developing a static tool to check the matching between
>>>>>>> > helpers and the
>>>>>>> > > context of hooks. During our analysis, we have discovered some
>>>>>>> > important
>>>>>>> > > findings that we would like to report.
>>>>>>> > >
>>>>>>> > > ‘kernel/bpf/syscall.c: 2097 __bpf_prog_put()’ shows that
>>>>>>> function
>>>>>>> > > bpf_prog_put_deferred() won`t be called in the condition of
>>>>>>> > > ‘in_irq() || irqs_disabled()’.
>>>>>>> > > if (in_irq() || irqs_disabled()) {
>>>>>>> > > INIT_WORK(&aux->work, bpf_prog_put_deferred);
>>>>>>> > > schedule_work(&aux->work);
>>>>>>> > > } else {
>>>>>>> > >
>>>>>>> > > bpf_prog_put_deferred(&aux->work);
>>>>>>> > > }
>>>>>>> > >
>>>>>>> > > We suspect this condition exists because there might be
>>>>>>> sleepable
>>>>>>> > operations
>>>>>>> > > in the callees of the bpf_prog_put_deferred() function:
>>>>>>> > > kernel/bpf/syscall.c: 2097 __bpf_prog_put()
>>>>>>> > > kernel/bpf/syscall.c: 2084 bpf_prog_put_deferred()
>>>>>>> > > kernel/bpf/syscall.c: 2063 __bpf_prog_put_noref()
>>>>>>> > > kvfree(prog->aux->jited_linfo);
>>>>>>> > > kvfree(prog->aux->linfo);
>>>>>>> >
>>>>>>> > Looks like you only have suspicion here. Could you find a real
>>>>>>> > violation
>>>>>>> > here where __bpf_prog_put() is called with !in_irq() &&
>>>>>>> > !irqs_disabled(), but inside spin_lock or rcu read lock? I
>>>>>>> have not seen
>>>>>>> > things like that.
>>>>>>> >
>>>>>>> > >
>>>>>>> > > Additionally, we found that array prog->aux->jited_linfo is
>>>>>>> > initialized in
>>>>>>> > > ‘kernel/bpf/core.c: 157 bpf_prog_alloc_jited_linfo()’:
>>>>>>> > > prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
>>>>>>> > > sizeof(*prog->aux->jited_linfo),
>>>>>>> bpf_memcg_flags(GFP_KERNEL |
>>>>>>> > __GFP_NOWARN));
>>>>>>> >
>>>>>>> > Any problem here?
>>>>>>> >
>>>>>>> > >
>>>>>>> > > Our question is whether the condition 'in_irq() ||
>>>>>>> > irqs_disabled() == false' is
>>>>>>> > > sufficient for calling 'kvfree'. We are aware that calling
>>>>>>> > 'kvfree' within the
>>>>>>> > > context of a spin lock or an RCU lock is unsafe.
>>>>>>>
>>>>>>> Your above analysis makes sense if indeed that kvfree cannot appear
>>>>>>> inside a spin lock region or RCU read lock region. But is it true?
>>>>>>> I checked a few code paths in kvfree/kfree. It is either guarded
>>>>>>> with local_irq_save/restore or by
>>>>>>> spin_lock_irqsave/spin_unlock_irqrestore, etc. Did I miss
>>>>>>> anything? Are you talking about RT kernel here?
>>>>>>>
>>>>>>>
>>>>>>> > >
>>>>>>> > > Therefore, we propose modifying the condition to include
>>>>>>> > in_atomic(). Could we
>>>>>>> > > update the condition as follows: "in_irq() ||
>>>>>>> irqs_disabled() ||
>>>>>>> > in_atomic()"?
>>>>>>> > >
>>>>>>> > > Thank you! We look forward to your feedback.
>>>>>>> > >
>>>>>>> > > Signed-off-by: Teng Qi <[email protected]
>>>>>>> <mailto:[email protected]>
>>>>>>> > <mailto:[email protected]
>>>>>>> <mailto:[email protected]>>>
>>>>>>> >
>>>>>>>