LinuxLists.cc - Re: [net-next] bpf: avoid hashtab deadlock with try

[permalink] [raw]

Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On 11/29/22 07:45, Hou Tao wrote:
> Hi,
>
> On 11/29/2022 2:06 PM, Tonghao Zhang wrote:
>> On Tue, Nov 29, 2022 at 12:32 PM Hou Tao <[email protected]> wrote:
>>> Hi,
>>>
>>> On 11/29/2022 5:55 AM, Hao Luo wrote:
>>>> On Sun, Nov 27, 2022 at 7:15 PM Tonghao Zhang <[email protected]> wrote:
>>>> Hi Tonghao,
>>>>
>>>> With a quick look at the htab_lock_bucket() and your problem
>>>> statement, I agree with Hou Tao that using hash &
>>>> min(HASHTAB_MAP_LOCK_MASK, n_bucket - 1) to index in map_locked seems
>>>> to fix the potential deadlock. Can you actually send your changes as
>>>> v2 so we can take a look and better help you? Also, can you explain
>>>> your solution in your commit message? Right now, your commit message
>>>> has only a problem statement and is not very clear. Please include
>>>> more details on what you do to fix the issue.
>>>>
>>>> Hao
>>> It would be better if the test case below can be rewritten as a bpf selftests.
>>> Please see comments below on how to improve it and reproduce the deadlock.
>>>>> Hi
>>>>> only a warning from lockdep.
>>> Thanks for your details instruction. I can reproduce the warning by using your
>>> setup. I am not a lockdep expert, it seems that fixing such warning needs to set
>>> different lockdep class to the different bucket. Because we use map_locked to
>>> protect the acquisition of bucket lock, so I think we can define lock_class_key
>>> array in bpf_htab (e.g., lockdep_key[HASHTAB_MAP_LOCK_COUNT]) and initialize the
>>> bucket lock accordingly.
> The proposed lockdep solution doesn't work. Still got lockdep warning after
> that, so cc +locking expert +lkml.org for lockdep help.
>
> Hi lockdep experts,
>
> We are trying to fix the following lockdep warning from bpf subsystem:
>
> [   36.092222] ================================
> [   36.092230] WARNING: inconsistent lock state
> [   36.092234] 6.1.0-rc5+ #81 Tainted: G            E
> [   36.092236] --------------------------------
> [   36.092237] inconsistent {INITIAL USE} -> {IN-NMI} usage.
> [   36.092238] perf/1515 [HC1[1]:SC0[0]:HE0:SE1] takes:
> [   36.092242] ffff888341acd1a0 (&htab->lockdep_key){....}-{2:2}, at:
> htab_lock_bucket+0x4d/0x58
> [   36.092253] {INITIAL USE} state was registered at:
> [   36.092255]   mark_usage+0x1d/0x11d
> [   36.092262]   __lock_acquire+0x3c9/0x6ed
> [   36.092266]   lock_acquire+0x23d/0x29a
> [   36.092270]   _raw_spin_lock_irqsave+0x43/0x7f
> [   36.092274]   htab_lock_bucket+0x4d/0x58
> [   36.092276]   htab_map_delete_elem+0x82/0xfb
> [   36.092278]   map_delete_elem+0x156/0x1ac
> [   36.092282]   __sys_bpf+0x138/0xb71
> [   36.092285]   __do_sys_bpf+0xd/0x15
> [   36.092288]   do_syscall_64+0x6d/0x84
> [   36.092291]   entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [   36.092295] irq event stamp: 120346
> [   36.092296] hardirqs last enabled at (120345): [<ffffffff8180b97f>]
> _raw_spin_unlock_irq+0x24/0x39
> [   36.092299] hardirqs last disabled at (120346): [<ffffffff81169e85>]
> generic_exec_single+0x40/0xb9
> [   36.092303] softirqs last enabled at (120268): [<ffffffff81c00347>]
> __do_softirq+0x347/0x387
> [   36.092307] softirqs last disabled at (120133): [<ffffffff810ba4f0>]
> __irq_exit_rcu+0x67/0xc6
> [   36.092311]
> [   36.092311] other info that might help us debug this:
> [   36.092312] Possible unsafe locking scenario:
> [   36.092312]
> [   36.092313]        CPU0
> [   36.092313]        ----
> [   36.092314]   lock(&htab->lockdep_key);
> [   36.092315]   <Interrupt>
> [   36.092316]     lock(&htab->lockdep_key);
> [   36.092318]
> [   36.092318] *** DEADLOCK ***
> [   36.092318]
> [   36.092318] 3 locks held by perf/1515:
> [   36.092320] #0: ffff8881b9805cc0 (&cpuctx_mutex){+.+.}-{4:4}, at:
> perf_event_ctx_lock_nested+0x8e/0xba
> [   36.092327] #1: ffff8881075ecc20 (&event->child_mutex){+.+.}-{4:4}, at:
> perf_event_for_each_child+0x35/0x76
> [   36.092332] #2: ffff8881b9805c20 (&cpuctx_lock){-.-.}-{2:2}, at:
> perf_ctx_lock+0x12/0x27
> [   36.092339]
> [   36.092339] stack backtrace:
> [   36.092341] CPU: 0 PID: 1515 Comm: perf Tainted: G            E
> 6.1.0-rc5+ #81
> [   36.092344] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> [   36.092349] Call Trace:
> [   36.092351] <NMI>
> [   36.092354] dump_stack_lvl+0x57/0x81
> [   36.092359] lock_acquire+0x1f4/0x29a
> [   36.092363] ? handle_pmi_common+0x13f/0x1f0
> [   36.092366] ? htab_lock_bucket+0x4d/0x58
> [   36.092371] _raw_spin_lock_irqsave+0x43/0x7f
> [   36.092374] ? htab_lock_bucket+0x4d/0x58
> [   36.092377] htab_lock_bucket+0x4d/0x58
> [   36.092379] htab_map_update_elem+0x11e/0x220
> [   36.092386] bpf_prog_f3a535ca81a8128a_bpf_prog2+0x3e/0x42
> [   36.092392] trace_call_bpf+0x177/0x215
> [   36.092398] perf_trace_run_bpf_submit+0x52/0xaa
> [   36.092403] ? x86_pmu_stop+0x97/0x97
> [   36.092407] perf_trace_nmi_handler+0xb7/0xe0
> [   36.092415] nmi_handle+0x116/0x254
> [   36.092418] ? x86_pmu_stop+0x97/0x97
> [   36.092423] default_do_nmi+0x3d/0xf6
> [   36.092428] exc_nmi+0xa1/0x109
> [   36.092432] end_repeat_nmi+0x16/0x67
> [   36.092436] RIP: 0010:wrmsrl+0xd/0x1b

So the lock is really taken in a NMI context. In general, we advise
again using lock in a NMI context unless it is a lock that is used only
in that context. Otherwise, deadlock is certainly a possibility as there
is no way to mask off again NMI.

Cheers,
Longman

2022-11-29 18:03:52

by Boqun Feng

[permalink] [raw]

Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On Tue, Nov 29, 2022 at 11:06:51AM -0500, Waiman Long wrote:
> On 11/29/22 07:45, Hou Tao wrote:
> > Hi,
> >
> > On 11/29/2022 2:06 PM, Tonghao Zhang wrote:
> > > On Tue, Nov 29, 2022 at 12:32 PM Hou Tao <[email protected]> wrote:
> > > > Hi,
> > > >
> > > > On 11/29/2022 5:55 AM, Hao Luo wrote:
> > > > > On Sun, Nov 27, 2022 at 7:15 PM Tonghao Zhang <[email protected]> wrote:
> > > > > Hi Tonghao,
> > > > >
> > > > > With a quick look at the htab_lock_bucket() and your problem
> > > > > statement, I agree with Hou Tao that using hash &
> > > > > min(HASHTAB_MAP_LOCK_MASK, n_bucket - 1) to index in map_locked seems
> > > > > to fix the potential deadlock. Can you actually send your changes as
> > > > > v2 so we can take a look and better help you? Also, can you explain
> > > > > your solution in your commit message? Right now, your commit message
> > > > > has only a problem statement and is not very clear. Please include
> > > > > more details on what you do to fix the issue.
> > > > >
> > > > > Hao
> > > > It would be better if the test case below can be rewritten as a bpf selftests.
> > > > Please see comments below on how to improve it and reproduce the deadlock.
> > > > > > Hi
> > > > > > only a warning from lockdep.
> > > > Thanks for your details instruction. I can reproduce the warning by using your
> > > > setup. I am not a lockdep expert, it seems that fixing such warning needs to set
> > > > different lockdep class to the different bucket. Because we use map_locked to
> > > > protect the acquisition of bucket lock, so I think we can define lock_class_key
> > > > array in bpf_htab (e.g., lockdep_key[HASHTAB_MAP_LOCK_COUNT]) and initialize the
> > > > bucket lock accordingly.
> > The proposed lockdep solution doesn't work. Still got lockdep warning after
> > that, so cc +locking expert +lkml.org for lockdep help.
> >
> > Hi lockdep experts,
> >
> > We are trying to fix the following lockdep warning from bpf subsystem:
> >
> > [?? 36.092222] ================================
> > [?? 36.092230] WARNING: inconsistent lock state
> > [?? 36.092234] 6.1.0-rc5+ #81 Tainted: G??????????? E
> > [?? 36.092236] --------------------------------
> > [?? 36.092237] inconsistent {INITIAL USE} -> {IN-NMI} usage.
> > [?? 36.092238] perf/1515 [HC1[1]:SC0[0]:HE0:SE1] takes:
> > [?? 36.092242] ffff888341acd1a0 (&htab->lockdep_key){....}-{2:2}, at:
> > htab_lock_bucket+0x4d/0x58
> > [?? 36.092253] {INITIAL USE} state was registered at:
> > [?? 36.092255]?? mark_usage+0x1d/0x11d
> > [?? 36.092262]?? __lock_acquire+0x3c9/0x6ed
> > [?? 36.092266]?? lock_acquire+0x23d/0x29a
> > [?? 36.092270]?? _raw_spin_lock_irqsave+0x43/0x7f
> > [?? 36.092274]?? htab_lock_bucket+0x4d/0x58
> > [?? 36.092276]?? htab_map_delete_elem+0x82/0xfb
> > [?? 36.092278]?? map_delete_elem+0x156/0x1ac
> > [?? 36.092282]?? __sys_bpf+0x138/0xb71
> > [?? 36.092285]?? __do_sys_bpf+0xd/0x15
> > [?? 36.092288]?? do_syscall_64+0x6d/0x84
> > [?? 36.092291]?? entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > [?? 36.092295] irq event stamp: 120346
> > [?? 36.092296] hardirqs last? enabled at (120345): [<ffffffff8180b97f>]
> > _raw_spin_unlock_irq+0x24/0x39
> > [?? 36.092299] hardirqs last disabled at (120346): [<ffffffff81169e85>]
> > generic_exec_single+0x40/0xb9
> > [?? 36.092303] softirqs last? enabled at (120268): [<ffffffff81c00347>]
> > __do_softirq+0x347/0x387
> > [?? 36.092307] softirqs last disabled at (120133): [<ffffffff810ba4f0>]
> > __irq_exit_rcu+0x67/0xc6
> > [?? 36.092311]
> > [?? 36.092311] other info that might help us debug this:
> > [?? 36.092312]? Possible unsafe locking scenario:
> > [?? 36.092312]
> > [?? 36.092313]??????? CPU0
> > [?? 36.092313]??????? ----
> > [?? 36.092314]?? lock(&htab->lockdep_key);
> > [?? 36.092315]?? <Interrupt>
> > [?? 36.092316]???? lock(&htab->lockdep_key);
> > [?? 36.092318]
> > [?? 36.092318]? *** DEADLOCK ***
> > [?? 36.092318]
> > [?? 36.092318] 3 locks held by perf/1515:
> > [?? 36.092320]? #0: ffff8881b9805cc0 (&cpuctx_mutex){+.+.}-{4:4}, at:
> > perf_event_ctx_lock_nested+0x8e/0xba
> > [?? 36.092327]? #1: ffff8881075ecc20 (&event->child_mutex){+.+.}-{4:4}, at:
> > perf_event_for_each_child+0x35/0x76
> > [?? 36.092332]? #2: ffff8881b9805c20 (&cpuctx_lock){-.-.}-{2:2}, at:
> > perf_ctx_lock+0x12/0x27
> > [?? 36.092339]
> > [?? 36.092339] stack backtrace:
> > [?? 36.092341] CPU: 0 PID: 1515 Comm: perf Tainted: G??????????? E
> > 6.1.0-rc5+ #81
> > [?? 36.092344] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > [?? 36.092349] Call Trace:
> > [?? 36.092351]? <NMI>
> > [?? 36.092354]? dump_stack_lvl+0x57/0x81
> > [?? 36.092359]? lock_acquire+0x1f4/0x29a
> > [?? 36.092363]? ? handle_pmi_common+0x13f/0x1f0
> > [?? 36.092366]? ? htab_lock_bucket+0x4d/0x58
> > [?? 36.092371]? _raw_spin_lock_irqsave+0x43/0x7f
> > [?? 36.092374]? ? htab_lock_bucket+0x4d/0x58
> > [?? 36.092377]? htab_lock_bucket+0x4d/0x58
> > [?? 36.092379]? htab_map_update_elem+0x11e/0x220
> > [?? 36.092386]? bpf_prog_f3a535ca81a8128a_bpf_prog2+0x3e/0x42
> > [?? 36.092392]? trace_call_bpf+0x177/0x215
> > [?? 36.092398]? perf_trace_run_bpf_submit+0x52/0xaa
> > [?? 36.092403]? ? x86_pmu_stop+0x97/0x97
> > [?? 36.092407]? perf_trace_nmi_handler+0xb7/0xe0
> > [?? 36.092415]? nmi_handle+0x116/0x254
> > [?? 36.092418]? ? x86_pmu_stop+0x97/0x97
> > [?? 36.092423]? default_do_nmi+0x3d/0xf6
> > [?? 36.092428]? exc_nmi+0xa1/0x109
> > [?? 36.092432]? end_repeat_nmi+0x16/0x67
> > [?? 36.092436] RIP: 0010:wrmsrl+0xd/0x1b
>
> So the lock is really taken in a NMI context. In general, we advise again
> using lock in a NMI context unless it is a lock that is used only in that
> context. Otherwise, deadlock is certainly a possibility as there is no way
> to mask off again NMI.
>

I think here they use a percpu counter as an "outer lock" to make the
accesses to the real lock exclusive:

preempt_disable();
a = __this_cpu_inc(->map_locked);
if (a != 1) {
__this_cpu_dec(->map_locked);
preempt_enable();
return -EBUSY;
}
preempt_enable();
return -EBUSY;

raw_spin_lock_irqsave(->raw_lock);

and lockdep is not aware that ->map_locked acts as a lock.

However, I feel this may be just a reinvented try_lock pattern, Hou Tao,
could you see if this can be refactored with a try_lock? Otherwise, you
may need to introduce a virtual lockclass for ->map_locked.

Regards,
Boqun

> Cheers,
> Longman
>

2022-11-29 18:04:14

by Boqun Feng

[permalink] [raw]

Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On Tue, Nov 29, 2022 at 09:23:18AM -0800, Boqun Feng wrote:
> On Tue, Nov 29, 2022 at 11:06:51AM -0500, Waiman Long wrote:
> > On 11/29/22 07:45, Hou Tao wrote:
> > > Hi,
> > >
> > > On 11/29/2022 2:06 PM, Tonghao Zhang wrote:
> > > > On Tue, Nov 29, 2022 at 12:32 PM Hou Tao <[email protected]> wrote:
> > > > > Hi,
> > > > >
> > > > > On 11/29/2022 5:55 AM, Hao Luo wrote:
> > > > > > On Sun, Nov 27, 2022 at 7:15 PM Tonghao Zhang <[email protected]> wrote:
> > > > > > Hi Tonghao,
> > > > > >
> > > > > > With a quick look at the htab_lock_bucket() and your problem
> > > > > > statement, I agree with Hou Tao that using hash &
> > > > > > min(HASHTAB_MAP_LOCK_MASK, n_bucket - 1) to index in map_locked seems
> > > > > > to fix the potential deadlock. Can you actually send your changes as
> > > > > > v2 so we can take a look and better help you? Also, can you explain
> > > > > > your solution in your commit message? Right now, your commit message
> > > > > > has only a problem statement and is not very clear. Please include
> > > > > > more details on what you do to fix the issue.
> > > > > >
> > > > > > Hao
> > > > > It would be better if the test case below can be rewritten as a bpf selftests.
> > > > > Please see comments below on how to improve it and reproduce the deadlock.
> > > > > > > Hi
> > > > > > > only a warning from lockdep.
> > > > > Thanks for your details instruction. I can reproduce the warning by using your
> > > > > setup. I am not a lockdep expert, it seems that fixing such warning needs to set
> > > > > different lockdep class to the different bucket. Because we use map_locked to
> > > > > protect the acquisition of bucket lock, so I think we can define lock_class_key
> > > > > array in bpf_htab (e.g., lockdep_key[HASHTAB_MAP_LOCK_COUNT]) and initialize the
> > > > > bucket lock accordingly.
> > > The proposed lockdep solution doesn't work. Still got lockdep warning after
> > > that, so cc +locking expert +lkml.org for lockdep help.
> > >
> > > Hi lockdep experts,
> > >
> > > We are trying to fix the following lockdep warning from bpf subsystem:
> > >
> > > [?? 36.092222] ================================
> > > [?? 36.092230] WARNING: inconsistent lock state
> > > [?? 36.092234] 6.1.0-rc5+ #81 Tainted: G??????????? E
> > > [?? 36.092236] --------------------------------
> > > [?? 36.092237] inconsistent {INITIAL USE} -> {IN-NMI} usage.
> > > [?? 36.092238] perf/1515 [HC1[1]:SC0[0]:HE0:SE1] takes:
> > > [?? 36.092242] ffff888341acd1a0 (&htab->lockdep_key){....}-{2:2}, at:
> > > htab_lock_bucket+0x4d/0x58
> > > [?? 36.092253] {INITIAL USE} state was registered at:
> > > [?? 36.092255]?? mark_usage+0x1d/0x11d
> > > [?? 36.092262]?? __lock_acquire+0x3c9/0x6ed
> > > [?? 36.092266]?? lock_acquire+0x23d/0x29a
> > > [?? 36.092270]?? _raw_spin_lock_irqsave+0x43/0x7f
> > > [?? 36.092274]?? htab_lock_bucket+0x4d/0x58
> > > [?? 36.092276]?? htab_map_delete_elem+0x82/0xfb
> > > [?? 36.092278]?? map_delete_elem+0x156/0x1ac
> > > [?? 36.092282]?? __sys_bpf+0x138/0xb71
> > > [?? 36.092285]?? __do_sys_bpf+0xd/0x15
> > > [?? 36.092288]?? do_syscall_64+0x6d/0x84
> > > [?? 36.092291]?? entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > [?? 36.092295] irq event stamp: 120346
> > > [?? 36.092296] hardirqs last? enabled at (120345): [<ffffffff8180b97f>]
> > > _raw_spin_unlock_irq+0x24/0x39
> > > [?? 36.092299] hardirqs last disabled at (120346): [<ffffffff81169e85>]
> > > generic_exec_single+0x40/0xb9
> > > [?? 36.092303] softirqs last? enabled at (120268): [<ffffffff81c00347>]
> > > __do_softirq+0x347/0x387
> > > [?? 36.092307] softirqs last disabled at (120133): [<ffffffff810ba4f0>]
> > > __irq_exit_rcu+0x67/0xc6
> > > [?? 36.092311]
> > > [?? 36.092311] other info that might help us debug this:
> > > [?? 36.092312]? Possible unsafe locking scenario:
> > > [?? 36.092312]
> > > [?? 36.092313]??????? CPU0
> > > [?? 36.092313]??????? ----
> > > [?? 36.092314]?? lock(&htab->lockdep_key);
> > > [?? 36.092315]?? <Interrupt>
> > > [?? 36.092316]???? lock(&htab->lockdep_key);
> > > [?? 36.092318]
> > > [?? 36.092318]? *** DEADLOCK ***
> > > [?? 36.092318]
> > > [?? 36.092318] 3 locks held by perf/1515:
> > > [?? 36.092320]? #0: ffff8881b9805cc0 (&cpuctx_mutex){+.+.}-{4:4}, at:
> > > perf_event_ctx_lock_nested+0x8e/0xba
> > > [?? 36.092327]? #1: ffff8881075ecc20 (&event->child_mutex){+.+.}-{4:4}, at:
> > > perf_event_for_each_child+0x35/0x76
> > > [?? 36.092332]? #2: ffff8881b9805c20 (&cpuctx_lock){-.-.}-{2:2}, at:
> > > perf_ctx_lock+0x12/0x27
> > > [?? 36.092339]
> > > [?? 36.092339] stack backtrace:
> > > [?? 36.092341] CPU: 0 PID: 1515 Comm: perf Tainted: G??????????? E
> > > 6.1.0-rc5+ #81
> > > [?? 36.092344] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > > [?? 36.092349] Call Trace:
> > > [?? 36.092351]? <NMI>
> > > [?? 36.092354]? dump_stack_lvl+0x57/0x81
> > > [?? 36.092359]? lock_acquire+0x1f4/0x29a
> > > [?? 36.092363]? ? handle_pmi_common+0x13f/0x1f0
> > > [?? 36.092366]? ? htab_lock_bucket+0x4d/0x58
> > > [?? 36.092371]? _raw_spin_lock_irqsave+0x43/0x7f
> > > [?? 36.092374]? ? htab_lock_bucket+0x4d/0x58
> > > [?? 36.092377]? htab_lock_bucket+0x4d/0x58
> > > [?? 36.092379]? htab_map_update_elem+0x11e/0x220
> > > [?? 36.092386]? bpf_prog_f3a535ca81a8128a_bpf_prog2+0x3e/0x42
> > > [?? 36.092392]? trace_call_bpf+0x177/0x215
> > > [?? 36.092398]? perf_trace_run_bpf_submit+0x52/0xaa
> > > [?? 36.092403]? ? x86_pmu_stop+0x97/0x97
> > > [?? 36.092407]? perf_trace_nmi_handler+0xb7/0xe0
> > > [?? 36.092415]? nmi_handle+0x116/0x254
> > > [?? 36.092418]? ? x86_pmu_stop+0x97/0x97
> > > [?? 36.092423]? default_do_nmi+0x3d/0xf6
> > > [?? 36.092428]? exc_nmi+0xa1/0x109
> > > [?? 36.092432]? end_repeat_nmi+0x16/0x67
> > > [?? 36.092436] RIP: 0010:wrmsrl+0xd/0x1b
> >
> > So the lock is really taken in a NMI context. In general, we advise again
> > using lock in a NMI context unless it is a lock that is used only in that
> > context. Otherwise, deadlock is certainly a possibility as there is no way
> > to mask off again NMI.
> >
>
> I think here they use a percpu counter as an "outer lock" to make the
> accesses to the real lock exclusive:
>
> preempt_disable();
> a = __this_cpu_inc(->map_locked);
> if (a != 1) {
> __this_cpu_dec(->map_locked);
> preempt_enable();
> return -EBUSY;
> }
> preempt_enable();
> return -EBUSY;
>
> raw_spin_lock_irqsave(->raw_lock);
>
> and lockdep is not aware that ->map_locked acts as a lock.
>
> However, I feel this may be just a reinvented try_lock pattern, Hou Tao,
> could you see if this can be refactored with a try_lock? Otherwise, you

Just to be clear, I meant to refactor htab_lock_bucket() into a try
lock pattern. Also after a second thought, the below suggestion doesn't
work. I think the proper way is to make htab_lock_bucket() as a
raw_spin_trylock_irqsave().

Regards,
Boqun

> may need to introduce a virtual lockclass for ->map_locked.
>
> Regards,
> Boqun
>
> > Cheers,
> > Longman
> >

2022-11-29 19:49:19

by Hao Luo

[permalink] [raw]

Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
>
> Just to be clear, I meant to refactor htab_lock_bucket() into a try
> lock pattern. Also after a second thought, the below suggestion doesn't
> work. I think the proper way is to make htab_lock_bucket() as a
> raw_spin_trylock_irqsave().
>
> Regards,
> Boqun
>

The potential deadlock happens when the lock is contended from the
same cpu. When the lock is contended from a remote cpu, we would like
the remote cpu to spin and wait, instead of giving up immediately. As
this gives better throughput. So replacing the current
raw_spin_lock_irqsave() with trylock sacrifices this performance gain.

I suspect the source of the problem is the 'hash' that we used in
htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
whether we should use a hash derived from 'bucket' rather than from
'key'. For example, from the memory address of the 'bucket'. Because,
different keys may fall into the same bucket, but yield different
hashes. If the same bucket can never have two different 'hashes' here,
the map_locked check should behave as intended. Also because
->map_locked is per-cpu, execution flows from two different cpus can
both pass.

Hao

2022-11-29 21:37:50

[permalink] [raw]

Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On 11/29/22 14:36, Hao Luo wrote:
> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
>> lock pattern. Also after a second thought, the below suggestion doesn't
>> work. I think the proper way is to make htab_lock_bucket() as a
>> raw_spin_trylock_irqsave().
>>
>> Regards,
>> Boqun
>>
> The potential deadlock happens when the lock is contended from the
> same cpu. When the lock is contended from a remote cpu, we would like
> the remote cpu to spin and wait, instead of giving up immediately. As
> this gives better throughput. So replacing the current
> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
>
> I suspect the source of the problem is the 'hash' that we used in
> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
> whether we should use a hash derived from 'bucket' rather than from
> 'key'. For example, from the memory address of the 'bucket'. Because,
> different keys may fall into the same bucket, but yield different
> hashes. If the same bucket can never have two different 'hashes' here,
> the map_locked check should behave as intended. Also because
> ->map_locked is per-cpu, execution flows from two different cpus can
> both pass.

I would suggest that you add a in_nmi() check and if true use trylock to
get the lock. You can continue to use raw_spin_lock_irqsave() in all
other cases.

Cheers,
Longman

2022-11-30 02:15:28

[permalink] [raw]

Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

Hi Hao,

On 11/30/2022 3:36 AM, Hao Luo wrote:
> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
>> lock pattern. Also after a second thought, the below suggestion doesn't
>> work. I think the proper way is to make htab_lock_bucket() as a
>> raw_spin_trylock_irqsave().
>>
>> Regards,
>> Boqun
>>
> The potential deadlock happens when the lock is contended from the
> same cpu. When the lock is contended from a remote cpu, we would like
> the remote cpu to spin and wait, instead of giving up immediately. As
> this gives better throughput. So replacing the current
> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
>
> I suspect the source of the problem is the 'hash' that we used in
> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
> whether we should use a hash derived from 'bucket' rather than from
> 'key'. For example, from the memory address of the 'bucket'. Because,
> different keys may fall into the same bucket, but yield different
> hashes. If the same bucket can never have two different 'hashes' here,
> the map_locked check should behave as intended. Also because
> ->map_locked is per-cpu, execution flows from two different cpus can
> both pass.
The warning from lockdep is due to the reason the bucket lock A is used in a
no-NMI context firstly, then the same bucke lock is used a NMI context, so
lockdep deduces that may be a dead-lock. I have already tried to use the same
map_locked for keys with the same bucket, the dead-lock is gone, but still got
lockdep warning.
>
> Hao
> .

2022-11-30 02:15:32

[permalink] [raw]

Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

Hi,

On 11/30/2022 1:23 AM, Boqun Feng wrote:
> On Tue, Nov 29, 2022 at 11:06:51AM -0500, Waiman Long wrote:
>> On 11/29/22 07:45, Hou Tao wrote:
>>> Hi,
>>>
>>> On 11/29/2022 2:06 PM, Tonghao Zhang wrote:
>>>> On Tue, Nov 29, 2022 at 12:32 PM Hou Tao <[email protected]> wrote:
>>>>> Hi,
>>>>>
>>>>> On 11/29/2022 5:55 AM, Hao Luo wrote:
>>>>>> On Sun, Nov 27, 2022 at 7:15 PM Tonghao Zhang <[email protected]> wrote:
>>>>>> Hi Tonghao,
>>>>>>
>>>>>> With a quick look at the htab_lock_bucket() and your problem
>>>>>> statement, I agree with Hou Tao that using hash &
>>>>>> min(HASHTAB_MAP_LOCK_MASK, n_bucket - 1) to index in map_locked seems
>>>>>> to fix the potential deadlock. Can you actually send your changes as
>>>>>> v2 so we can take a look and better help you? Also, can you explain
>>>>>> your solution in your commit message? Right now, your commit message
>>>>>> has only a problem statement and is not very clear. Please include
>>>>>> more details on what you do to fix the issue.
>>>>>>
>>>>>> Hao
>>>>> It would be better if the test case below can be rewritten as a bpf selftests.
>>>>> Please see comments below on how to improve it and reproduce the deadlock.
>>>>>>> Hi
>>>>>>> only a warning from lockdep.
>>>>> Thanks for your details instruction. I can reproduce the warning by using your
>>>>> setup. I am not a lockdep expert, it seems that fixing such warning needs to set
>>>>> different lockdep class to the different bucket. Because we use map_locked to
>>>>> protect the acquisition of bucket lock, so I think we can define lock_class_key
>>>>> array in bpf_htab (e.g., lockdep_key[HASHTAB_MAP_LOCK_COUNT]) and initialize the
>>>>> bucket lock accordingly.
>>> The proposed lockdep solution doesn't work. Still got lockdep warning after
>>> that, so cc +locking expert +lkml.org for lockdep help.
>>>
>>> Hi lockdep experts,
>>>
>>> We are trying to fix the following lockdep warning from bpf subsystem:
>>>
>>> [   36.092222] ================================
>>> [   36.092230] WARNING: inconsistent lock state
>>> [   36.092234] 6.1.0-rc5+ #81 Tainted: G            E
>>> [   36.092236] --------------------------------
>>> [   36.092237] inconsistent {INITIAL USE} -> {IN-NMI} usage.
>>> [   36.092238] perf/1515 [HC1[1]:SC0[0]:HE0:SE1] takes:
>>> [   36.092242] ffff888341acd1a0 (&htab->lockdep_key){....}-{2:2}, at:
>>> htab_lock_bucket+0x4d/0x58
>>> [   36.092253] {INITIAL USE} state was registered at:
>>> [   36.092255]   mark_usage+0x1d/0x11d
>>> [   36.092262]   __lock_acquire+0x3c9/0x6ed
>>> [   36.092266]   lock_acquire+0x23d/0x29a
>>> [   36.092270]   _raw_spin_lock_irqsave+0x43/0x7f
>>> [   36.092274]   htab_lock_bucket+0x4d/0x58
>>> [   36.092276]   htab_map_delete_elem+0x82/0xfb
>>> [   36.092278]   map_delete_elem+0x156/0x1ac
>>> [   36.092282]   __sys_bpf+0x138/0xb71
>>> [   36.092285]   __do_sys_bpf+0xd/0x15
>>> [   36.092288]   do_syscall_64+0x6d/0x84
>>> [   36.092291]   entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>> [   36.092295] irq event stamp: 120346
>>> [   36.092296] hardirqs last enabled at (120345): [<ffffffff8180b97f>]
>>> _raw_spin_unlock_irq+0x24/0x39
>>> [   36.092299] hardirqs last disabled at (120346): [<ffffffff81169e85>]
>>> generic_exec_single+0x40/0xb9
>>> [   36.092303] softirqs last enabled at (120268): [<ffffffff81c00347>]
>>> __do_softirq+0x347/0x387
>>> [   36.092307] softirqs last disabled at (120133): [<ffffffff810ba4f0>]
>>> __irq_exit_rcu+0x67/0xc6
>>> [   36.092311]
>>> [   36.092311] other info that might help us debug this:
>>> [   36.092312] Possible unsafe locking scenario:
>>> [   36.092312]
>>> [   36.092313]        CPU0
>>> [   36.092313]        ----
>>> [   36.092314]   lock(&htab->lockdep_key);
>>> [   36.092315]   <Interrupt>
>>> [   36.092316]     lock(&htab->lockdep_key);
>>> [   36.092318]
>>> [   36.092318] *** DEADLOCK ***
>>> [   36.092318]
>>> [   36.092318] 3 locks held by perf/1515:
>>> [   36.092320] #0: ffff8881b9805cc0 (&cpuctx_mutex){+.+.}-{4:4}, at:
>>> perf_event_ctx_lock_nested+0x8e/0xba
>>> [   36.092327] #1: ffff8881075ecc20 (&event->child_mutex){+.+.}-{4:4}, at:
>>> perf_event_for_each_child+0x35/0x76
>>> [   36.092332] #2: ffff8881b9805c20 (&cpuctx_lock){-.-.}-{2:2}, at:
>>> perf_ctx_lock+0x12/0x27
>>> [   36.092339]
>>> [   36.092339] stack backtrace:
>>> [   36.092341] CPU: 0 PID: 1515 Comm: perf Tainted: G            E
>>> 6.1.0-rc5+ #81
>>> [   36.092344] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>> rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
>>> [   36.092349] Call Trace:
>>> [   36.092351] <NMI>
>>> [   36.092354] dump_stack_lvl+0x57/0x81
>>> [   36.092359] lock_acquire+0x1f4/0x29a
>>> [   36.092363] ? handle_pmi_common+0x13f/0x1f0
>>> [   36.092366] ? htab_lock_bucket+0x4d/0x58
>>> [   36.092371] _raw_spin_lock_irqsave+0x43/0x7f
>>> [   36.092374] ? htab_lock_bucket+0x4d/0x58
>>> [   36.092377] htab_lock_bucket+0x4d/0x58
>>> [   36.092379] htab_map_update_elem+0x11e/0x220
>>> [   36.092386] bpf_prog_f3a535ca81a8128a_bpf_prog2+0x3e/0x42
>>> [   36.092392] trace_call_bpf+0x177/0x215
>>> [   36.092398] perf_trace_run_bpf_submit+0x52/0xaa
>>> [   36.092403] ? x86_pmu_stop+0x97/0x97
>>> [   36.092407] perf_trace_nmi_handler+0xb7/0xe0
>>> [   36.092415] nmi_handle+0x116/0x254
>>> [   36.092418] ? x86_pmu_stop+0x97/0x97
>>> [   36.092423] default_do_nmi+0x3d/0xf6
>>> [   36.092428] exc_nmi+0xa1/0x109
>>> [   36.092432] end_repeat_nmi+0x16/0x67
>>> [   36.092436] RIP: 0010:wrmsrl+0xd/0x1b
>> So the lock is really taken in a NMI context. In general, we advise again
>> using lock in a NMI context unless it is a lock that is used only in that
>> context. Otherwise, deadlock is certainly a possibility as there is no way
>> to mask off again NMI.
>>
> I think here they use a percpu counter as an "outer lock" to make the
> accesses to the real lock exclusive:
>
> preempt_disable();
> a = __this_cpu_inc(->map_locked);
> if (a != 1) {
> __this_cpu_dec(->map_locked);
> preempt_enable();
> return -EBUSY;
> }
> preempt_enable();
> return -EBUSY;
>
> raw_spin_lock_irqsave(->raw_lock);
>
> and lockdep is not aware that ->map_locked acts as a lock.
>
> However, I feel this may be just a reinvented try_lock pattern, Hou Tao,
> could you see if this can be refactored with a try_lock? Otherwise, you
> may need to introduce a virtual lockclass for ->map_locked.
As said by Hao Luo, the problem of using trylock in nmi context is that it can
not distinguish between dead-lock and lock with high-contention. And map_locked
is still needed even trylock is used in NMI because htab_map_update_elem() may
be reentered in a normal context through attaching a bpf program to one function
called after taken the lock. So introducing a virtual lockclass for ->map_locked
is a better idea.

Thanks,
Tao
> Regards,
> Boqun
>
>> Cheers,
>> Longman
>>
> .

2022-11-30 03:02:42

[permalink] [raw]

Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <[email protected]> wrote:
>
> Hi Hao,
>
> On 11/30/2022 3:36 AM, Hao Luo wrote:
> > On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
> >> Just to be clear, I meant to refactor htab_lock_bucket() into a try
> >> lock pattern. Also after a second thought, the below suggestion doesn't
> >> work. I think the proper way is to make htab_lock_bucket() as a
> >> raw_spin_trylock_irqsave().
> >>
> >> Regards,
> >> Boqun
> >>
> > The potential deadlock happens when the lock is contended from the
> > same cpu. When the lock is contended from a remote cpu, we would like
> > the remote cpu to spin and wait, instead of giving up immediately. As
> > this gives better throughput. So replacing the current
> > raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
> >
> > I suspect the source of the problem is the 'hash' that we used in
> > htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
> > whether we should use a hash derived from 'bucket' rather than from
> > 'key'. For example, from the memory address of the 'bucket'. Because,
> > different keys may fall into the same bucket, but yield different
> > hashes. If the same bucket can never have two different 'hashes' here,
> > the map_locked check should behave as intended. Also because
> > ->map_locked is per-cpu, execution flows from two different cpus can
> > both pass.
> The warning from lockdep is due to the reason the bucket lock A is used in a
> no-NMI context firstly, then the same bucke lock is used a NMI context, so
Yes, I tested lockdep too, we can't use the lock in NMI(but only
try_lock work fine) context if we use them no-NMI context. otherwise
the lockdep prints the warning.
* for the dead-lock case: we can use the
1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
2. or hash bucket address.

* for lockdep warning, we should use in_nmi check with map_locked.

BTW, the patch doesn't work, so we can remove the lock_key
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9

static inline int htab_lock_bucket(const struct bpf_htab *htab,
struct bucket *b, u32 hash,
unsigned long *pflags)
{
unsigned long flags;

hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);

preempt_disable();
if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
__this_cpu_dec(*(htab->map_locked[hash]));
preempt_enable();
return -EBUSY;
}

if (in_nmi()) {
if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
return -EBUSY;
} else {
raw_spin_lock_irqsave(&b->raw_lock, flags);
}

*pflags = flags;
return 0;
}

> lockdep deduces that may be a dead-lock. I have already tried to use the same
> map_locked for keys with the same bucket, the dead-lock is gone, but still got
> lockdep warning.
> >
> > Hao
> > .
>

--
Best regards, Tonghao

2022-11-30 03:30:21

[permalink] [raw]

Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On 11/29/22 21:47, Tonghao Zhang wrote:
> On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <[email protected]> wrote:
>> Hi Hao,
>>
>> On 11/30/2022 3:36 AM, Hao Luo wrote:
>>> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
>>>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
>>>> lock pattern. Also after a second thought, the below suggestion doesn't
>>>> work. I think the proper way is to make htab_lock_bucket() as a
>>>> raw_spin_trylock_irqsave().
>>>>
>>>> Regards,
>>>> Boqun
>>>>
>>> The potential deadlock happens when the lock is contended from the
>>> same cpu. When the lock is contended from a remote cpu, we would like
>>> the remote cpu to spin and wait, instead of giving up immediately. As
>>> this gives better throughput. So replacing the current
>>> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
>>>
>>> I suspect the source of the problem is the 'hash' that we used in
>>> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
>>> whether we should use a hash derived from 'bucket' rather than from
>>> 'key'. For example, from the memory address of the 'bucket'. Because,
>>> different keys may fall into the same bucket, but yield different
>>> hashes. If the same bucket can never have two different 'hashes' here,
>>> the map_locked check should behave as intended. Also because
>>> ->map_locked is per-cpu, execution flows from two different cpus can
>>> both pass.
>> The warning from lockdep is due to the reason the bucket lock A is used in a
>> no-NMI context firstly, then the same bucke lock is used a NMI context, so
> Yes, I tested lockdep too, we can't use the lock in NMI(but only
> try_lock work fine) context if we use them no-NMI context. otherwise
> the lockdep prints the warning.
> * for the dead-lock case: we can use the
> 1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
> 2. or hash bucket address.
>
> * for lockdep warning, we should use in_nmi check with map_locked.
>
> BTW, the patch doesn't work, so we can remove the lock_key
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9
>
> static inline int htab_lock_bucket(const struct bpf_htab *htab,
> struct bucket *b, u32 hash,
> unsigned long *pflags)
> {
> unsigned long flags;
>
> hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
>
> preempt_disable();
> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> __this_cpu_dec(*(htab->map_locked[hash]));
> preempt_enable();
> return -EBUSY;
> }
>
> if (in_nmi()) {
> if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
> return -EBUSY;
That is not right. You have to do the same step as above by decrementing
the percpu count and enable preemption. So you may want to put all these
busy_out steps after the return 0 and use "goto busy_out;" to jump there.
> } else {
> raw_spin_lock_irqsave(&b->raw_lock, flags);
> }
>
> *pflags = flags;
> return 0;
> }

BTW, with that change, I believe you can actually remove all the percpu
map_locked count code.

Cheers,
Longman

2022-11-30 03:59:36

[permalink] [raw]

Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On Wed, Nov 30, 2022 at 11:07 AM Waiman Long <[email protected]> wrote:
>
> On 11/29/22 21:47, Tonghao Zhang wrote:
> > On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <[email protected]> wrote:
> >> Hi Hao,
> >>
> >> On 11/30/2022 3:36 AM, Hao Luo wrote:
> >>> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
> >>>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
> >>>> lock pattern. Also after a second thought, the below suggestion doesn't
> >>>> work. I think the proper way is to make htab_lock_bucket() as a
> >>>> raw_spin_trylock_irqsave().
> >>>>
> >>>> Regards,
> >>>> Boqun
> >>>>
> >>> The potential deadlock happens when the lock is contended from the
> >>> same cpu. When the lock is contended from a remote cpu, we would like
> >>> the remote cpu to spin and wait, instead of giving up immediately. As
> >>> this gives better throughput. So replacing the current
> >>> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
> >>>
> >>> I suspect the source of the problem is the 'hash' that we used in
> >>> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
> >>> whether we should use a hash derived from 'bucket' rather than from
> >>> 'key'. For example, from the memory address of the 'bucket'. Because,
> >>> different keys may fall into the same bucket, but yield different
> >>> hashes. If the same bucket can never have two different 'hashes' here,
> >>> the map_locked check should behave as intended. Also because
> >>> ->map_locked is per-cpu, execution flows from two different cpus can
> >>> both pass.
> >> The warning from lockdep is due to the reason the bucket lock A is used in a
> >> no-NMI context firstly, then the same bucke lock is used a NMI context, so
> > Yes, I tested lockdep too, we can't use the lock in NMI(but only
> > try_lock work fine) context if we use them no-NMI context. otherwise
> > the lockdep prints the warning.
> > * for the dead-lock case: we can use the
> > 1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
> > 2. or hash bucket address.
> >
> > * for lockdep warning, we should use in_nmi check with map_locked.
> >
> > BTW, the patch doesn't work, so we can remove the lock_key
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9
> >
> > static inline int htab_lock_bucket(const struct bpf_htab *htab,
> > struct bucket *b, u32 hash,
> > unsigned long *pflags)
> > {
> > unsigned long flags;
> >
> > hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
> >
> > preempt_disable();
> > if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> > __this_cpu_dec(*(htab->map_locked[hash]));
> > preempt_enable();
> > return -EBUSY;
> > }
> >
> > if (in_nmi()) {
> > if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
> > return -EBUSY;
> That is not right. You have to do the same step as above by decrementing
> the percpu count and enable preemption. So you may want to put all these
> busy_out steps after the return 0 and use "goto busy_out;" to jump there.
Yes, thanks Waiman, I should add the busy_out label.
> > } else {
> > raw_spin_lock_irqsave(&b->raw_lock, flags);
> > }
> >
> > *pflags = flags;
> > return 0;
> > }
>
> BTW, with that change, I believe you can actually remove all the percpu
> map_locked count code.
there are some case, for example, we run the bpf_prog A B in task
context on the same cpu.
bpf_prog A
update map X
htab_lock_bucket
raw_spin_lock_irqsave()
lookup_elem_raw()
// bpf prog B is attached on lookup_elem_raw()
bpf prog B
update map X again and update the element
htab_lock_bucket()
// dead-lock
raw_spinlock_irqsave()
> Cheers,
> Longman
>

--
Best regards, Tonghao

2022-11-30 04:23:03

[permalink] [raw]

Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On 11/29/22 22:32, Tonghao Zhang wrote:
> On Wed, Nov 30, 2022 at 11:07 AM Waiman Long <[email protected]> wrote:
>> On 11/29/22 21:47, Tonghao Zhang wrote:
>>> On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <[email protected]> wrote:
>>>> Hi Hao,
>>>>
>>>> On 11/30/2022 3:36 AM, Hao Luo wrote:
>>>>> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
>>>>>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
>>>>>> lock pattern. Also after a second thought, the below suggestion doesn't
>>>>>> work. I think the proper way is to make htab_lock_bucket() as a
>>>>>> raw_spin_trylock_irqsave().
>>>>>>
>>>>>> Regards,
>>>>>> Boqun
>>>>>>
>>>>> The potential deadlock happens when the lock is contended from the
>>>>> same cpu. When the lock is contended from a remote cpu, we would like
>>>>> the remote cpu to spin and wait, instead of giving up immediately. As
>>>>> this gives better throughput. So replacing the current
>>>>> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
>>>>>
>>>>> I suspect the source of the problem is the 'hash' that we used in
>>>>> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
>>>>> whether we should use a hash derived from 'bucket' rather than from
>>>>> 'key'. For example, from the memory address of the 'bucket'. Because,
>>>>> different keys may fall into the same bucket, but yield different
>>>>> hashes. If the same bucket can never have two different 'hashes' here,
>>>>> the map_locked check should behave as intended. Also because
>>>>> ->map_locked is per-cpu, execution flows from two different cpus can
>>>>> both pass.
>>>> The warning from lockdep is due to the reason the bucket lock A is used in a
>>>> no-NMI context firstly, then the same bucke lock is used a NMI context, so
>>> Yes, I tested lockdep too, we can't use the lock in NMI(but only
>>> try_lock work fine) context if we use them no-NMI context. otherwise
>>> the lockdep prints the warning.
>>> * for the dead-lock case: we can use the
>>> 1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
>>> 2. or hash bucket address.
>>>
>>> * for lockdep warning, we should use in_nmi check with map_locked.
>>>
>>> BTW, the patch doesn't work, so we can remove the lock_key
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9
>>>
>>> static inline int htab_lock_bucket(const struct bpf_htab *htab,
>>> struct bucket *b, u32 hash,
>>> unsigned long *pflags)
>>> {
>>> unsigned long flags;
>>>
>>> hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
>>>
>>> preempt_disable();
>>> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>>> __this_cpu_dec(*(htab->map_locked[hash]));
>>> preempt_enable();
>>> return -EBUSY;
>>> }
>>>
>>> if (in_nmi()) {
>>> if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
>>> return -EBUSY;
>> That is not right. You have to do the same step as above by decrementing
>> the percpu count and enable preemption. So you may want to put all these
>> busy_out steps after the return 0 and use "goto busy_out;" to jump there.
> Yes, thanks Waiman, I should add the busy_out label.
>>> } else {
>>> raw_spin_lock_irqsave(&b->raw_lock, flags);
>>> }
>>>
>>> *pflags = flags;
>>> return 0;
>>> }
>> BTW, with that change, I believe you can actually remove all the percpu
>> map_locked count code.
> there are some case, for example, we run the bpf_prog A B in task
> context on the same cpu.
> bpf_prog A
> update map X
> htab_lock_bucket
> raw_spin_lock_irqsave()
> lookup_elem_raw()
> // bpf prog B is attached on lookup_elem_raw()
> bpf prog B
> update map X again and update the element
> htab_lock_bucket()
> // dead-lock
> raw_spinlock_irqsave()

I see, so nested locking is possible in this case. Beside using the
percpu map_lock, another way is to have cpumask associated with each
bucket lock and use each bit in the cpumask for to control access using
test_and_set_bit() for each cpu. That will allow more concurrency and
you can actually find out how contended is the lock. Anyway, it is just
a thought.

Cheers,
Longman

2022-11-30 04:35:13

[permalink] [raw]

Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

Hi,

On 11/30/2022 10:47 AM, Tonghao Zhang wrote:
> On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <[email protected]> wrote:
>> Hi Hao,
>>
>> On 11/30/2022 3:36 AM, Hao Luo wrote:
>>> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
>>>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
>>>> lock pattern. Also after a second thought, the below suggestion doesn't
>>>> work. I think the proper way is to make htab_lock_bucket() as a
>>>> raw_spin_trylock_irqsave().
>>>>
>>>> Regards,
>>>> Boqun
>>>>
>>> The potential deadlock happens when the lock is contended from the
>>> same cpu. When the lock is contended from a remote cpu, we would like
>>> the remote cpu to spin and wait, instead of giving up immediately. As
>>> this gives better throughput. So replacing the current
>>> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
>>>
>>> I suspect the source of the problem is the 'hash' that we used in
>>> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
>>> whether we should use a hash derived from 'bucket' rather than from
>>> 'key'. For example, from the memory address of the 'bucket'. Because,
>>> different keys may fall into the same bucket, but yield different
>>> hashes. If the same bucket can never have two different 'hashes' here,
>>> the map_locked check should behave as intended. Also because
>>> ->map_locked is per-cpu, execution flows from two different cpus can
>>> both pass.
>> The warning from lockdep is due to the reason the bucket lock A is used in a
>> no-NMI context firstly, then the same bucke lock is used a NMI context, so
> Yes, I tested lockdep too, we can't use the lock in NMI(but only
> try_lock work fine) context if we use them no-NMI context. otherwise
> the lockdep prints the warning.
> * for the dead-lock case: we can use the
> 1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
> 2. or hash bucket address.
Use the computed hash will be better than hash bucket address, because the hash
buckets are allocated sequentially.
>
> * for lockdep warning, we should use in_nmi check with map_locked.
>
> BTW, the patch doesn't work, so we can remove the lock_key
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9
>
> static inline int htab_lock_bucket(const struct bpf_htab *htab,
> struct bucket *b, u32 hash,
> unsigned long *pflags)
> {
> unsigned long flags;
>
> hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
>
> preempt_disable();
> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> __this_cpu_dec(*(htab->map_locked[hash]));
> preempt_enable();
> return -EBUSY;
> }
>
> if (in_nmi()) {
> if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
> return -EBUSY;
The only purpose of trylock here is to make lockdep happy and it may lead to
unnecessary -EBUSY error for htab operations in NMI context. I still prefer add
a virtual lock-class for map_locked to fix the lockdep warning. So could you use
separated patches to fix the potential dead-lock and the lockdep warning ? It
will be better you can also add a bpf selftests for deadlock problem as said before.

Thanks,
Tao
> } else {
> raw_spin_lock_irqsave(&b->raw_lock, flags);
> }
>
> *pflags = flags;
> return 0;
> }
>
>
>> lockdep deduces that may be a dead-lock. I have already tried to use the same
>> map_locked for keys with the same bucket, the dead-lock is gone, but still got
>> lockdep warning.
>>> Hao
>>> .
>

2022-11-30 05:11:38

by Hao Luo

[permalink] [raw]

Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On Tue, Nov 29, 2022 at 8:13 PM Hou Tao <[email protected]> wrote:
>
> On 11/30/2022 10:47 AM, Tonghao Zhang wrote:
<...>
> > if (in_nmi()) {
> > if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
> > return -EBUSY;
>
> The only purpose of trylock here is to make lockdep happy and it may lead to
> unnecessary -EBUSY error for htab operations in NMI context. I still prefer add
> a virtual lock-class for map_locked to fix the lockdep warning. So could you use
> separated patches to fix the potential dead-lock and the lockdep warning ? It
> will be better you can also add a bpf selftests for deadlock problem as said before.
>

Agree with Tao here. Tonghao, could you send another version which:

- separates the fix to deadlock and the fix to lockdep warning
- includes a bpf selftest to verify the fix to deadlock
- with bpf-specific tag: [PATCH bpf-next]

There are multiple ideas on the fly in this thread, it's easy to lose
track of what has been proposed and what change you intend to make.

Thanks,
Hao

2022-11-30 06:21:26

[permalink] [raw]

Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On Wed, Nov 30, 2022 at 1:02 PM Hao Luo <[email protected]> wrote:
>
> On Tue, Nov 29, 2022 at 8:13 PM Hou Tao <[email protected]> wrote:
> >
> > On 11/30/2022 10:47 AM, Tonghao Zhang wrote:
> <...>
> > > if (in_nmi()) {
> > > if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
> > > return -EBUSY;
> >
> > The only purpose of trylock here is to make lockdep happy and it may lead to
> > unnecessary -EBUSY error for htab operations in NMI context. I still prefer add
> > a virtual lock-class for map_locked to fix the lockdep warning. So could you use
> > separated patches to fix the potential dead-lock and the lockdep warning ? It
> > will be better you can also add a bpf selftests for deadlock problem as said before.
> >
>
> Agree with Tao here. Tonghao, could you send another version which:
>
> - separates the fix to deadlock and the fix to lockdep warning
> - includes a bpf selftest to verify the fix to deadlock
> - with bpf-specific tag: [PATCH bpf-next]
>
> There are multiple ideas on the fly in this thread, it's easy to lose
> track of what has been proposed and what change you intend to make.
Hi, I will send v2 soon. Thanks.
> Thanks,
> Hao

--
Best regards, Tonghao

2022-11-30 06:33:38

[permalink] [raw]

Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On Wed, Nov 30, 2022 at 12:13 PM Hou Tao <[email protected]> wrote:
>
> Hi,
>
> On 11/30/2022 10:47 AM, Tonghao Zhang wrote:
> > On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <[email protected]> wrote:
> >> Hi Hao,
> >>
> >> On 11/30/2022 3:36 AM, Hao Luo wrote:
> >>> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
> >>>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
> >>>> lock pattern. Also after a second thought, the below suggestion doesn't
> >>>> work. I think the proper way is to make htab_lock_bucket() as a
> >>>> raw_spin_trylock_irqsave().
> >>>>
> >>>> Regards,
> >>>> Boqun
> >>>>
> >>> The potential deadlock happens when the lock is contended from the
> >>> same cpu. When the lock is contended from a remote cpu, we would like
> >>> the remote cpu to spin and wait, instead of giving up immediately. As
> >>> this gives better throughput. So replacing the current
> >>> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
> >>>
> >>> I suspect the source of the problem is the 'hash' that we used in
> >>> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
> >>> whether we should use a hash derived from 'bucket' rather than from
> >>> 'key'. For example, from the memory address of the 'bucket'. Because,
> >>> different keys may fall into the same bucket, but yield different
> >>> hashes. If the same bucket can never have two different 'hashes' here,
> >>> the map_locked check should behave as intended. Also because
> >>> ->map_locked is per-cpu, execution flows from two different cpus can
> >>> both pass.
> >> The warning from lockdep is due to the reason the bucket lock A is used in a
> >> no-NMI context firstly, then the same bucke lock is used a NMI context, so
> > Yes, I tested lockdep too, we can't use the lock in NMI(but only
> > try_lock work fine) context if we use them no-NMI context. otherwise
> > the lockdep prints the warning.
> > * for the dead-lock case: we can use the
> > 1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
> > 2. or hash bucket address.
> Use the computed hash will be better than hash bucket address, because the hash
> buckets are allocated sequentially.
> >
> > * for lockdep warning, we should use in_nmi check with map_locked.
> >
> > BTW, the patch doesn't work, so we can remove the lock_key
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9
> >
> > static inline int htab_lock_bucket(const struct bpf_htab *htab,
> > struct bucket *b, u32 hash,
> > unsigned long *pflags)
> > {
> > unsigned long flags;
> >
> > hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
> >
> > preempt_disable();
> > if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> > __this_cpu_dec(*(htab->map_locked[hash]));
> > preempt_enable();
> > return -EBUSY;
> > }
> >
> > if (in_nmi()) {
> > if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
> > return -EBUSY;
> The only purpose of trylock here is to make lockdep happy and it may lead to
> unnecessary -EBUSY error for htab operations in NMI context. I still prefer add
> a virtual lock-class for map_locked to fix the lockdep warning. So could you use
Hi, what is virtual lock-class ? Can you give me an example of what you mean?
> separated patches to fix the potential dead-lock and the lockdep warning ? It
> will be better you can also add a bpf selftests for deadlock problem as said before.
>
> Thanks,
> Tao
> > } else {
> > raw_spin_lock_irqsave(&b->raw_lock, flags);
> > }
> >
> > *pflags = flags;
> > return 0;
> > }
> >
> >
> >> lockdep deduces that may be a dead-lock. I have already tried to use the same
> >> map_locked for keys with the same bucket, the dead-lock is gone, but still got
> >> lockdep warning.
> >>> Hao
> >>> .
> >
>

--
Best regards, Tonghao

2022-12-01 03:47:13