2022-11-29 13:25:00

by Hou Tao

[permalink] [raw]
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

Hi,

On 11/29/2022 2:06 PM, Tonghao Zhang wrote:
> On Tue, Nov 29, 2022 at 12:32 PM Hou Tao <[email protected]> wrote:
>> Hi,
>>
>> On 11/29/2022 5:55 AM, Hao Luo wrote:
>>> On Sun, Nov 27, 2022 at 7:15 PM Tonghao Zhang <[email protected]> wrote:
>>> Hi Tonghao,
>>>
>>> With a quick look at the htab_lock_bucket() and your problem
>>> statement, I agree with Hou Tao that using hash &
>>> min(HASHTAB_MAP_LOCK_MASK, n_bucket - 1) to index in map_locked seems
>>> to fix the potential deadlock. Can you actually send your changes as
>>> v2 so we can take a look and better help you? Also, can you explain
>>> your solution in your commit message? Right now, your commit message
>>> has only a problem statement and is not very clear. Please include
>>> more details on what you do to fix the issue.
>>>
>>> Hao
>> It would be better if the test case below can be rewritten as a bpf selftests.
>> Please see comments below on how to improve it and reproduce the deadlock.
>>>> Hi
>>>> only a warning from lockdep.
>> Thanks for your details instruction. I can reproduce the warning by using your
>> setup. I am not a lockdep expert, it seems that fixing such warning needs to set
>> different lockdep class to the different bucket. Because we use map_locked to
>> protect the acquisition of bucket lock, so I think we can define lock_class_key
>> array in bpf_htab (e.g., lockdep_key[HASHTAB_MAP_LOCK_COUNT]) and initialize the
>> bucket lock accordingly.
The proposed lockdep solution doesn't work. Still got lockdep warning after
that, so cc +locking expert +lkml.org for lockdep help.

Hi lockdep experts,

We are trying to fix the following lockdep warning from bpf subsystem:

[   36.092222] ================================
[   36.092230] WARNING: inconsistent lock state
[   36.092234] 6.1.0-rc5+ #81 Tainted: G            E
[   36.092236] --------------------------------
[   36.092237] inconsistent {INITIAL USE} -> {IN-NMI} usage.
[   36.092238] perf/1515 [HC1[1]:SC0[0]:HE0:SE1] takes:
[   36.092242] ffff888341acd1a0 (&htab->lockdep_key){....}-{2:2}, at:
htab_lock_bucket+0x4d/0x58
[   36.092253] {INITIAL USE} state was registered at:
[   36.092255]   mark_usage+0x1d/0x11d
[   36.092262]   __lock_acquire+0x3c9/0x6ed
[   36.092266]   lock_acquire+0x23d/0x29a
[   36.092270]   _raw_spin_lock_irqsave+0x43/0x7f
[   36.092274]   htab_lock_bucket+0x4d/0x58
[   36.092276]   htab_map_delete_elem+0x82/0xfb
[   36.092278]   map_delete_elem+0x156/0x1ac
[   36.092282]   __sys_bpf+0x138/0xb71
[   36.092285]   __do_sys_bpf+0xd/0x15
[   36.092288]   do_syscall_64+0x6d/0x84
[   36.092291]   entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   36.092295] irq event stamp: 120346
[   36.092296] hardirqs last  enabled at (120345): [<ffffffff8180b97f>]
_raw_spin_unlock_irq+0x24/0x39
[   36.092299] hardirqs last disabled at (120346): [<ffffffff81169e85>]
generic_exec_single+0x40/0xb9
[   36.092303] softirqs last  enabled at (120268): [<ffffffff81c00347>]
__do_softirq+0x347/0x387
[   36.092307] softirqs last disabled at (120133): [<ffffffff810ba4f0>]
__irq_exit_rcu+0x67/0xc6
[   36.092311]
[   36.092311] other info that might help us debug this:
[   36.092312]  Possible unsafe locking scenario:
[   36.092312]
[   36.092313]        CPU0
[   36.092313]        ----
[   36.092314]   lock(&htab->lockdep_key);
[   36.092315]   <Interrupt>
[   36.092316]     lock(&htab->lockdep_key);
[   36.092318]
[   36.092318]  *** DEADLOCK ***
[   36.092318]
[   36.092318] 3 locks held by perf/1515:
[   36.092320]  #0: ffff8881b9805cc0 (&cpuctx_mutex){+.+.}-{4:4}, at:
perf_event_ctx_lock_nested+0x8e/0xba
[   36.092327]  #1: ffff8881075ecc20 (&event->child_mutex){+.+.}-{4:4}, at:
perf_event_for_each_child+0x35/0x76
[   36.092332]  #2: ffff8881b9805c20 (&cpuctx_lock){-.-.}-{2:2}, at:
perf_ctx_lock+0x12/0x27
[   36.092339]
[   36.092339] stack backtrace:
[   36.092341] CPU: 0 PID: 1515 Comm: perf Tainted: G            E     
6.1.0-rc5+ #81
[   36.092344] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[   36.092349] Call Trace:
[   36.092351]  <NMI>
[   36.092354]  dump_stack_lvl+0x57/0x81
[   36.092359]  lock_acquire+0x1f4/0x29a
[   36.092363]  ? handle_pmi_common+0x13f/0x1f0
[   36.092366]  ? htab_lock_bucket+0x4d/0x58
[   36.092371]  _raw_spin_lock_irqsave+0x43/0x7f
[   36.092374]  ? htab_lock_bucket+0x4d/0x58
[   36.092377]  htab_lock_bucket+0x4d/0x58
[   36.092379]  htab_map_update_elem+0x11e/0x220
[   36.092386]  bpf_prog_f3a535ca81a8128a_bpf_prog2+0x3e/0x42
[   36.092392]  trace_call_bpf+0x177/0x215
[   36.092398]  perf_trace_run_bpf_submit+0x52/0xaa
[   36.092403]  ? x86_pmu_stop+0x97/0x97
[   36.092407]  perf_trace_nmi_handler+0xb7/0xe0
[   36.092415]  nmi_handle+0x116/0x254
[   36.092418]  ? x86_pmu_stop+0x97/0x97
[   36.092423]  default_do_nmi+0x3d/0xf6
[   36.092428]  exc_nmi+0xa1/0x109
[   36.092432]  end_repeat_nmi+0x16/0x67
[   36.092436] RIP: 0010:wrmsrl+0xd/0x1b
[   36.092441] Code: 04 01 00 00 c6 84 07 48 01 00 00 01 5b e9 46 15 80 00 5b c3
cc cc cc cc c3 cc cc cc cc 48 89 f2 89 f9 89 f0 48 c1 ea 20 0f 30 <66> 90 c3 cc
cc cc cc 31 d2 e9 2f 04 49 00 0f 1f 44 00 00 40 0f6
[   36.092443] RSP: 0018:ffffc900043dfc48 EFLAGS: 00000002
[   36.092445] RAX: 000000000000000f RBX: ffff8881b96153e0 RCX: 000000000000038f
[   36.092447] RDX: 0000000000000007 RSI: 000000070000000f RDI: 000000000000038f
[   36.092449] RBP: 000000070000000f R08: ffffffffffffffff R09: ffff8881053bdaa8
[   36.092451] R10: ffff8881b9805d40 R11: 0000000000000005 R12: ffff8881b9805c00
[   36.092452] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8881075ec970
[   36.092460]  ? wrmsrl+0xd/0x1b
[   36.092465]  ? wrmsrl+0xd/0x1b
[   36.092469]  </NMI>
[   36.092469]  <TASK>
[   36.092470]  __intel_pmu_enable_all.constprop.0+0x7c/0xaf
[   36.092475]  event_function+0xb6/0xd3
[   36.092478]  ? cpu_to_node+0x1a/0x1a
[   36.092482]  ? cpu_to_node+0x1a/0x1a
[   36.092485]  remote_function+0x1e/0x4c
[   36.092489]  generic_exec_single+0x48/0xb9
[   36.092492]  ? __lock_acquire+0x666/0x6ed
[   36.092497]  smp_call_function_single+0xbf/0x106
[   36.092499]  ? cpu_to_node+0x1a/0x1a
[   36.092504]  ? kvm_sched_clock_read+0x5/0x11
[   36.092508]  ? __perf_event_task_sched_in+0x13d/0x13d
[   36.092513]  cpu_function_call+0x47/0x69
[   36.092516]  ? perf_event_update_time+0x52/0x52
[   36.092519]  event_function_call+0x89/0x117
[   36.092521]  ? __perf_event_task_sched_in+0x13d/0x13d
[   36.092526]  ? _perf_event_disable+0x4a/0x4a
[   36.092528]  perf_event_for_each_child+0x3d/0x76
[   36.092532]  ? _perf_event_disable+0x4a/0x4a
[   36.092533]  _perf_ioctl+0x564/0x590
[   36.092537]  ? __lock_release+0xd5/0x1b0
[   36.092543]  ? perf_event_ctx_lock_nested+0x8e/0xba
[   36.092547]  perf_ioctl+0x42/0x5f
[   36.092551]  vfs_ioctl+0x1e/0x2f
[   36.092554]  __do_sys_ioctl+0x66/0x89
[   36.092559]  do_syscall_64+0x6d/0x84
[   36.092563]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   36.092566] RIP: 0033:0x7fe7110f362b
[   36.092569] Code: 0f 1e fa 48 8b 05 5d b8 2c 00 64 c7 00 26 00 00 00 48 c7 c0
ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0
ff ff 73 01 c3 48 8b 0d 2d b8 2c 00 f7 d8 64 89 018
[   36.092570] RSP: 002b:00007ffebb8e4b08 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[   36.092573] RAX: ffffffffffffffda RBX: 0000000000002400 RCX: 00007fe7110f362b
[   36.092575] RDX: 0000000000000000 RSI: 0000000000002400 RDI: 0000000000000013
[   36.092576] RBP: 00007ffebb8e4b40 R08: 0000000000000001 R09: 000055c1db4a5b40
[   36.092577] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[   36.092579] R13: 000055c1db3b2a30 R14: 0000000000000000 R15: 0000000000000000
[   36.092586]  </TASK>

The lockdep warning is a false alarm, because per-cpu map_locked must be zero
before acquire b->raw_lock. If b->raw_lock has already been acquired by a normal
process through htab_map_update_elem(), then a NMI interrupts the process and
tries to acquire the same b->raw_lock, the acquisition will fail because per-cpu
map_locked has already been increased by the process.

So beside using lockdep_off() and lockdep_on() to disable/enable lockdep
temporarily in htab_lock_bucket() and htab_unlock_bucket(), are there other ways
to fix the lockdep warning ?

Thanks,
Tao




> Hi
> Thanks for your reply. define the lock_class_key array looks good.
> Last question: how about using raw_spin_trylock_irqsave, if the
> bucket is locked on the same or other cpu.
> raw_spin_trylock_irqsave will return the false, we should return the
> -EBUSY in htab_lock_bucket.
>
> static inline int htab_lock_bucket(struct bucket *b,
> unsigned long *pflags)
> {
> unsigned long flags;
>
> if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
> return -EBUSY;
>
> *pflags = flags;
> return 0;
> }
>
>>>> 1. the kernel .config
>>>> #
>>>> # Debug Oops, Lockups and Hangs
>>>> #
>>>> CONFIG_PANIC_ON_OOPS=y
>>>> CONFIG_PANIC_ON_OOPS_VALUE=1
>>>> CONFIG_PANIC_TIMEOUT=0
>>>> CONFIG_LOCKUP_DETECTOR=y
>>>> CONFIG_SOFTLOCKUP_DETECTOR=y
>>>> # CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
>>>> CONFIG_HARDLOCKUP_DETECTOR_PERF=y
>>>> CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
>>>> CONFIG_HARDLOCKUP_DETECTOR=y
>>>> CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y
>>>> CONFIG_DETECT_HUNG_TASK=y
>>>> CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120
>>>> # CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
>>>> # CONFIG_WQ_WATCHDOG is not set
>>>> # CONFIG_TEST_LOCKUP is not set
>>>> # end of Debug Oops, Lockups and Hangs
>>>>
>>>> 2. bpf.c, the map size is 2.
>>>> struct {
>>>> __uint(type, BPF_MAP_TYPE_HASH);
>> Adding __uint(map_flags, BPF_F_ZERO_SEED); to ensure there will be no seed for
>> hash calculation, so we can use key=4 and key=20 to construct the case that
>> these two keys have the same bucket index but have different map_locked index.
>>>> __uint(max_entries, 2);
>>>> __uint(key_size, sizeof(unsigned int));
>>>> __uint(value_size, sizeof(unsigned int));
>>>> } map1 SEC(".maps");
>>>>
>>>> static int bpf_update_data()
>>>> {
>>>> unsigned int val = 1, key = 0;
>> key = 20
>>>> return bpf_map_update_elem(&map1, &key, &val, BPF_ANY);
>>>> }
>>>>
>>>> SEC("kprobe/ip_rcv")
>>>> int bpf_prog1(struct pt_regs *regs)
>>>> {
>>>> bpf_update_data();
>>>> return 0;
>>>> }
>> kprobe on ip_rcv is unnecessary, you can just remove it.
>>>> SEC("tracepoint/nmi/nmi_handler")
>>>> int bpf_prog2(struct pt_regs *regs)
>>>> {
>>>> bpf_update_data();
>>>> return 0;
>>>> }
>> Please use SEC("fentry/nmi_handle") instead of SEC("tracepoint") and unfold
>> bpf_update_data(), because the running of bpf program on tracepoint will be
>> blocked by bpf_prog_active which will be increased bpf_map_update_elem through
>> bpf_disable_instrumentation().
>>>> char _license[] SEC("license") = "GPL";
>>>> unsigned int _version SEC("version") = LINUX_VERSION_CODE;
>>>>
>>>> 3. bpf loader.
>>>> #include "kprobe-example.skel.h"
>>>>
>>>> #include <unistd.h>
>>>> #include <errno.h>
>>>>
>>>> #include <bpf/bpf.h>
>>>>
>>>> int main()
>>>> {
>>>> struct kprobe_example *skel;
>>>> int map_fd, prog_fd;
>>>> int i;
>>>> int err = 0;
>>>>
>>>> skel = kprobe_example__open_and_load();
>>>> if (!skel)
>>>> return -1;
>>>>
>>>> err = kprobe_example__attach(skel);
>>>> if (err)
>>>> goto cleanup;
>>>>
>>>> /* all libbpf APIs are usable */
>>>> prog_fd = bpf_program__fd(skel->progs.bpf_prog1);
>>>> map_fd = bpf_map__fd(skel->maps.map1);
>>>>
>>>> printf("map_fd: %d\n", map_fd);
>>>>
>>>> unsigned int val = 0, key = 0;
>>>>
>>>> while (1) {
>>>> bpf_map_delete_elem(map_fd, &key);
>> No needed neither. Only do bpf_map_update_elem() is OK. Also change key=0 from
>> key=4, so it will have the same bucket index as key=20 but have different
>> map_locked index.
>>>> bpf_map_update_elem(map_fd, &key, &val, BPF_ANY);
>>>> }
>> Also need to pin the process on a specific CPU (e.g., CPU 0)
>>>> cleanup:
>>>> kprobe_example__destroy(skel);
>>>> return err;
>>>> }
>>>>
>>>> 4. run the bpf loader and perf record for nmi interrupts. the warming occurs
>> For perf event, you can reference prog_tests/find_vma.c on how to using
>> perf_event_open to trigger a perf nmi interrupt. The perf event also needs to
>> pin on a specific CPU as the caller of bpf_map_update_elem() does.
>>
>>>> --
>>>> Best regards, Tonghao
>>> .
>


2022-11-29 16:27:05

by Waiman Long

[permalink] [raw]
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On 11/29/22 07:45, Hou Tao wrote:
> Hi,
>
> On 11/29/2022 2:06 PM, Tonghao Zhang wrote:
>> On Tue, Nov 29, 2022 at 12:32 PM Hou Tao <[email protected]> wrote:
>>> Hi,
>>>
>>> On 11/29/2022 5:55 AM, Hao Luo wrote:
>>>> On Sun, Nov 27, 2022 at 7:15 PM Tonghao Zhang <[email protected]> wrote:
>>>> Hi Tonghao,
>>>>
>>>> With a quick look at the htab_lock_bucket() and your problem
>>>> statement, I agree with Hou Tao that using hash &
>>>> min(HASHTAB_MAP_LOCK_MASK, n_bucket - 1) to index in map_locked seems
>>>> to fix the potential deadlock. Can you actually send your changes as
>>>> v2 so we can take a look and better help you? Also, can you explain
>>>> your solution in your commit message? Right now, your commit message
>>>> has only a problem statement and is not very clear. Please include
>>>> more details on what you do to fix the issue.
>>>>
>>>> Hao
>>> It would be better if the test case below can be rewritten as a bpf selftests.
>>> Please see comments below on how to improve it and reproduce the deadlock.
>>>>> Hi
>>>>> only a warning from lockdep.
>>> Thanks for your details instruction. I can reproduce the warning by using your
>>> setup. I am not a lockdep expert, it seems that fixing such warning needs to set
>>> different lockdep class to the different bucket. Because we use map_locked to
>>> protect the acquisition of bucket lock, so I think we can define lock_class_key
>>> array in bpf_htab (e.g., lockdep_key[HASHTAB_MAP_LOCK_COUNT]) and initialize the
>>> bucket lock accordingly.
> The proposed lockdep solution doesn't work. Still got lockdep warning after
> that, so cc +locking expert +lkml.org for lockdep help.
>
> Hi lockdep experts,
>
> We are trying to fix the following lockdep warning from bpf subsystem:
>
> [   36.092222] ================================
> [   36.092230] WARNING: inconsistent lock state
> [   36.092234] 6.1.0-rc5+ #81 Tainted: G            E
> [   36.092236] --------------------------------
> [   36.092237] inconsistent {INITIAL USE} -> {IN-NMI} usage.
> [   36.092238] perf/1515 [HC1[1]:SC0[0]:HE0:SE1] takes:
> [   36.092242] ffff888341acd1a0 (&htab->lockdep_key){....}-{2:2}, at:
> htab_lock_bucket+0x4d/0x58
> [   36.092253] {INITIAL USE} state was registered at:
> [   36.092255]   mark_usage+0x1d/0x11d
> [   36.092262]   __lock_acquire+0x3c9/0x6ed
> [   36.092266]   lock_acquire+0x23d/0x29a
> [   36.092270]   _raw_spin_lock_irqsave+0x43/0x7f
> [   36.092274]   htab_lock_bucket+0x4d/0x58
> [   36.092276]   htab_map_delete_elem+0x82/0xfb
> [   36.092278]   map_delete_elem+0x156/0x1ac
> [   36.092282]   __sys_bpf+0x138/0xb71
> [   36.092285]   __do_sys_bpf+0xd/0x15
> [   36.092288]   do_syscall_64+0x6d/0x84
> [   36.092291]   entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [   36.092295] irq event stamp: 120346
> [   36.092296] hardirqs last  enabled at (120345): [<ffffffff8180b97f>]
> _raw_spin_unlock_irq+0x24/0x39
> [   36.092299] hardirqs last disabled at (120346): [<ffffffff81169e85>]
> generic_exec_single+0x40/0xb9
> [   36.092303] softirqs last  enabled at (120268): [<ffffffff81c00347>]
> __do_softirq+0x347/0x387
> [   36.092307] softirqs last disabled at (120133): [<ffffffff810ba4f0>]
> __irq_exit_rcu+0x67/0xc6
> [   36.092311]
> [   36.092311] other info that might help us debug this:
> [   36.092312]  Possible unsafe locking scenario:
> [   36.092312]
> [   36.092313]        CPU0
> [   36.092313]        ----
> [   36.092314]   lock(&htab->lockdep_key);
> [   36.092315]   <Interrupt>
> [   36.092316]     lock(&htab->lockdep_key);
> [   36.092318]
> [   36.092318]  *** DEADLOCK ***
> [   36.092318]
> [   36.092318] 3 locks held by perf/1515:
> [   36.092320]  #0: ffff8881b9805cc0 (&cpuctx_mutex){+.+.}-{4:4}, at:
> perf_event_ctx_lock_nested+0x8e/0xba
> [   36.092327]  #1: ffff8881075ecc20 (&event->child_mutex){+.+.}-{4:4}, at:
> perf_event_for_each_child+0x35/0x76
> [   36.092332]  #2: ffff8881b9805c20 (&cpuctx_lock){-.-.}-{2:2}, at:
> perf_ctx_lock+0x12/0x27
> [   36.092339]
> [   36.092339] stack backtrace:
> [   36.092341] CPU: 0 PID: 1515 Comm: perf Tainted: G            E
> 6.1.0-rc5+ #81
> [   36.092344] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> [   36.092349] Call Trace:
> [   36.092351]  <NMI>
> [   36.092354]  dump_stack_lvl+0x57/0x81
> [   36.092359]  lock_acquire+0x1f4/0x29a
> [   36.092363]  ? handle_pmi_common+0x13f/0x1f0
> [   36.092366]  ? htab_lock_bucket+0x4d/0x58
> [   36.092371]  _raw_spin_lock_irqsave+0x43/0x7f
> [   36.092374]  ? htab_lock_bucket+0x4d/0x58
> [   36.092377]  htab_lock_bucket+0x4d/0x58
> [   36.092379]  htab_map_update_elem+0x11e/0x220
> [   36.092386]  bpf_prog_f3a535ca81a8128a_bpf_prog2+0x3e/0x42
> [   36.092392]  trace_call_bpf+0x177/0x215
> [   36.092398]  perf_trace_run_bpf_submit+0x52/0xaa
> [   36.092403]  ? x86_pmu_stop+0x97/0x97
> [   36.092407]  perf_trace_nmi_handler+0xb7/0xe0
> [   36.092415]  nmi_handle+0x116/0x254
> [   36.092418]  ? x86_pmu_stop+0x97/0x97
> [   36.092423]  default_do_nmi+0x3d/0xf6
> [   36.092428]  exc_nmi+0xa1/0x109
> [   36.092432]  end_repeat_nmi+0x16/0x67
> [   36.092436] RIP: 0010:wrmsrl+0xd/0x1b

So the lock is really taken in a NMI context. In general, we advise
again using lock in a NMI context unless it is a lock that is used only
in that context. Otherwise, deadlock is certainly a possibility as there
is no way to mask off again NMI.

Cheers,
Longman

2022-11-29 18:03:52

by Boqun Feng

[permalink] [raw]
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On Tue, Nov 29, 2022 at 11:06:51AM -0500, Waiman Long wrote:
> On 11/29/22 07:45, Hou Tao wrote:
> > Hi,
> >
> > On 11/29/2022 2:06 PM, Tonghao Zhang wrote:
> > > On Tue, Nov 29, 2022 at 12:32 PM Hou Tao <[email protected]> wrote:
> > > > Hi,
> > > >
> > > > On 11/29/2022 5:55 AM, Hao Luo wrote:
> > > > > On Sun, Nov 27, 2022 at 7:15 PM Tonghao Zhang <[email protected]> wrote:
> > > > > Hi Tonghao,
> > > > >
> > > > > With a quick look at the htab_lock_bucket() and your problem
> > > > > statement, I agree with Hou Tao that using hash &
> > > > > min(HASHTAB_MAP_LOCK_MASK, n_bucket - 1) to index in map_locked seems
> > > > > to fix the potential deadlock. Can you actually send your changes as
> > > > > v2 so we can take a look and better help you? Also, can you explain
> > > > > your solution in your commit message? Right now, your commit message
> > > > > has only a problem statement and is not very clear. Please include
> > > > > more details on what you do to fix the issue.
> > > > >
> > > > > Hao
> > > > It would be better if the test case below can be rewritten as a bpf selftests.
> > > > Please see comments below on how to improve it and reproduce the deadlock.
> > > > > > Hi
> > > > > > only a warning from lockdep.
> > > > Thanks for your details instruction. I can reproduce the warning by using your
> > > > setup. I am not a lockdep expert, it seems that fixing such warning needs to set
> > > > different lockdep class to the different bucket. Because we use map_locked to
> > > > protect the acquisition of bucket lock, so I think we can define lock_class_key
> > > > array in bpf_htab (e.g., lockdep_key[HASHTAB_MAP_LOCK_COUNT]) and initialize the
> > > > bucket lock accordingly.
> > The proposed lockdep solution doesn't work. Still got lockdep warning after
> > that, so cc +locking expert +lkml.org for lockdep help.
> >
> > Hi lockdep experts,
> >
> > We are trying to fix the following lockdep warning from bpf subsystem:
> >
> > [?? 36.092222] ================================
> > [?? 36.092230] WARNING: inconsistent lock state
> > [?? 36.092234] 6.1.0-rc5+ #81 Tainted: G??????????? E
> > [?? 36.092236] --------------------------------
> > [?? 36.092237] inconsistent {INITIAL USE} -> {IN-NMI} usage.
> > [?? 36.092238] perf/1515 [HC1[1]:SC0[0]:HE0:SE1] takes:
> > [?? 36.092242] ffff888341acd1a0 (&htab->lockdep_key){....}-{2:2}, at:
> > htab_lock_bucket+0x4d/0x58
> > [?? 36.092253] {INITIAL USE} state was registered at:
> > [?? 36.092255]?? mark_usage+0x1d/0x11d
> > [?? 36.092262]?? __lock_acquire+0x3c9/0x6ed
> > [?? 36.092266]?? lock_acquire+0x23d/0x29a
> > [?? 36.092270]?? _raw_spin_lock_irqsave+0x43/0x7f
> > [?? 36.092274]?? htab_lock_bucket+0x4d/0x58
> > [?? 36.092276]?? htab_map_delete_elem+0x82/0xfb
> > [?? 36.092278]?? map_delete_elem+0x156/0x1ac
> > [?? 36.092282]?? __sys_bpf+0x138/0xb71
> > [?? 36.092285]?? __do_sys_bpf+0xd/0x15
> > [?? 36.092288]?? do_syscall_64+0x6d/0x84
> > [?? 36.092291]?? entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > [?? 36.092295] irq event stamp: 120346
> > [?? 36.092296] hardirqs last? enabled at (120345): [<ffffffff8180b97f>]
> > _raw_spin_unlock_irq+0x24/0x39
> > [?? 36.092299] hardirqs last disabled at (120346): [<ffffffff81169e85>]
> > generic_exec_single+0x40/0xb9
> > [?? 36.092303] softirqs last? enabled at (120268): [<ffffffff81c00347>]
> > __do_softirq+0x347/0x387
> > [?? 36.092307] softirqs last disabled at (120133): [<ffffffff810ba4f0>]
> > __irq_exit_rcu+0x67/0xc6
> > [?? 36.092311]
> > [?? 36.092311] other info that might help us debug this:
> > [?? 36.092312]? Possible unsafe locking scenario:
> > [?? 36.092312]
> > [?? 36.092313]??????? CPU0
> > [?? 36.092313]??????? ----
> > [?? 36.092314]?? lock(&htab->lockdep_key);
> > [?? 36.092315]?? <Interrupt>
> > [?? 36.092316]???? lock(&htab->lockdep_key);
> > [?? 36.092318]
> > [?? 36.092318]? *** DEADLOCK ***
> > [?? 36.092318]
> > [?? 36.092318] 3 locks held by perf/1515:
> > [?? 36.092320]? #0: ffff8881b9805cc0 (&cpuctx_mutex){+.+.}-{4:4}, at:
> > perf_event_ctx_lock_nested+0x8e/0xba
> > [?? 36.092327]? #1: ffff8881075ecc20 (&event->child_mutex){+.+.}-{4:4}, at:
> > perf_event_for_each_child+0x35/0x76
> > [?? 36.092332]? #2: ffff8881b9805c20 (&cpuctx_lock){-.-.}-{2:2}, at:
> > perf_ctx_lock+0x12/0x27
> > [?? 36.092339]
> > [?? 36.092339] stack backtrace:
> > [?? 36.092341] CPU: 0 PID: 1515 Comm: perf Tainted: G??????????? E
> > 6.1.0-rc5+ #81
> > [?? 36.092344] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > [?? 36.092349] Call Trace:
> > [?? 36.092351]? <NMI>
> > [?? 36.092354]? dump_stack_lvl+0x57/0x81
> > [?? 36.092359]? lock_acquire+0x1f4/0x29a
> > [?? 36.092363]? ? handle_pmi_common+0x13f/0x1f0
> > [?? 36.092366]? ? htab_lock_bucket+0x4d/0x58
> > [?? 36.092371]? _raw_spin_lock_irqsave+0x43/0x7f
> > [?? 36.092374]? ? htab_lock_bucket+0x4d/0x58
> > [?? 36.092377]? htab_lock_bucket+0x4d/0x58
> > [?? 36.092379]? htab_map_update_elem+0x11e/0x220
> > [?? 36.092386]? bpf_prog_f3a535ca81a8128a_bpf_prog2+0x3e/0x42
> > [?? 36.092392]? trace_call_bpf+0x177/0x215
> > [?? 36.092398]? perf_trace_run_bpf_submit+0x52/0xaa
> > [?? 36.092403]? ? x86_pmu_stop+0x97/0x97
> > [?? 36.092407]? perf_trace_nmi_handler+0xb7/0xe0
> > [?? 36.092415]? nmi_handle+0x116/0x254
> > [?? 36.092418]? ? x86_pmu_stop+0x97/0x97
> > [?? 36.092423]? default_do_nmi+0x3d/0xf6
> > [?? 36.092428]? exc_nmi+0xa1/0x109
> > [?? 36.092432]? end_repeat_nmi+0x16/0x67
> > [?? 36.092436] RIP: 0010:wrmsrl+0xd/0x1b
>
> So the lock is really taken in a NMI context. In general, we advise again
> using lock in a NMI context unless it is a lock that is used only in that
> context. Otherwise, deadlock is certainly a possibility as there is no way
> to mask off again NMI.
>

I think here they use a percpu counter as an "outer lock" to make the
accesses to the real lock exclusive:

preempt_disable();
a = __this_cpu_inc(->map_locked);
if (a != 1) {
__this_cpu_dec(->map_locked);
preempt_enable();
return -EBUSY;
}
preempt_enable();
return -EBUSY;

raw_spin_lock_irqsave(->raw_lock);

and lockdep is not aware that ->map_locked acts as a lock.

However, I feel this may be just a reinvented try_lock pattern, Hou Tao,
could you see if this can be refactored with a try_lock? Otherwise, you
may need to introduce a virtual lockclass for ->map_locked.

Regards,
Boqun

> Cheers,
> Longman
>

2022-11-29 18:04:14

by Boqun Feng

[permalink] [raw]
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On Tue, Nov 29, 2022 at 09:23:18AM -0800, Boqun Feng wrote:
> On Tue, Nov 29, 2022 at 11:06:51AM -0500, Waiman Long wrote:
> > On 11/29/22 07:45, Hou Tao wrote:
> > > Hi,
> > >
> > > On 11/29/2022 2:06 PM, Tonghao Zhang wrote:
> > > > On Tue, Nov 29, 2022 at 12:32 PM Hou Tao <[email protected]> wrote:
> > > > > Hi,
> > > > >
> > > > > On 11/29/2022 5:55 AM, Hao Luo wrote:
> > > > > > On Sun, Nov 27, 2022 at 7:15 PM Tonghao Zhang <[email protected]> wrote:
> > > > > > Hi Tonghao,
> > > > > >
> > > > > > With a quick look at the htab_lock_bucket() and your problem
> > > > > > statement, I agree with Hou Tao that using hash &
> > > > > > min(HASHTAB_MAP_LOCK_MASK, n_bucket - 1) to index in map_locked seems
> > > > > > to fix the potential deadlock. Can you actually send your changes as
> > > > > > v2 so we can take a look and better help you? Also, can you explain
> > > > > > your solution in your commit message? Right now, your commit message
> > > > > > has only a problem statement and is not very clear. Please include
> > > > > > more details on what you do to fix the issue.
> > > > > >
> > > > > > Hao
> > > > > It would be better if the test case below can be rewritten as a bpf selftests.
> > > > > Please see comments below on how to improve it and reproduce the deadlock.
> > > > > > > Hi
> > > > > > > only a warning from lockdep.
> > > > > Thanks for your details instruction. I can reproduce the warning by using your
> > > > > setup. I am not a lockdep expert, it seems that fixing such warning needs to set
> > > > > different lockdep class to the different bucket. Because we use map_locked to
> > > > > protect the acquisition of bucket lock, so I think we can define lock_class_key
> > > > > array in bpf_htab (e.g., lockdep_key[HASHTAB_MAP_LOCK_COUNT]) and initialize the
> > > > > bucket lock accordingly.
> > > The proposed lockdep solution doesn't work. Still got lockdep warning after
> > > that, so cc +locking expert +lkml.org for lockdep help.
> > >
> > > Hi lockdep experts,
> > >
> > > We are trying to fix the following lockdep warning from bpf subsystem:
> > >
> > > [?? 36.092222] ================================
> > > [?? 36.092230] WARNING: inconsistent lock state
> > > [?? 36.092234] 6.1.0-rc5+ #81 Tainted: G??????????? E
> > > [?? 36.092236] --------------------------------
> > > [?? 36.092237] inconsistent {INITIAL USE} -> {IN-NMI} usage.
> > > [?? 36.092238] perf/1515 [HC1[1]:SC0[0]:HE0:SE1] takes:
> > > [?? 36.092242] ffff888341acd1a0 (&htab->lockdep_key){....}-{2:2}, at:
> > > htab_lock_bucket+0x4d/0x58
> > > [?? 36.092253] {INITIAL USE} state was registered at:
> > > [?? 36.092255]?? mark_usage+0x1d/0x11d
> > > [?? 36.092262]?? __lock_acquire+0x3c9/0x6ed
> > > [?? 36.092266]?? lock_acquire+0x23d/0x29a
> > > [?? 36.092270]?? _raw_spin_lock_irqsave+0x43/0x7f
> > > [?? 36.092274]?? htab_lock_bucket+0x4d/0x58
> > > [?? 36.092276]?? htab_map_delete_elem+0x82/0xfb
> > > [?? 36.092278]?? map_delete_elem+0x156/0x1ac
> > > [?? 36.092282]?? __sys_bpf+0x138/0xb71
> > > [?? 36.092285]?? __do_sys_bpf+0xd/0x15
> > > [?? 36.092288]?? do_syscall_64+0x6d/0x84
> > > [?? 36.092291]?? entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > [?? 36.092295] irq event stamp: 120346
> > > [?? 36.092296] hardirqs last? enabled at (120345): [<ffffffff8180b97f>]
> > > _raw_spin_unlock_irq+0x24/0x39
> > > [?? 36.092299] hardirqs last disabled at (120346): [<ffffffff81169e85>]
> > > generic_exec_single+0x40/0xb9
> > > [?? 36.092303] softirqs last? enabled at (120268): [<ffffffff81c00347>]
> > > __do_softirq+0x347/0x387
> > > [?? 36.092307] softirqs last disabled at (120133): [<ffffffff810ba4f0>]
> > > __irq_exit_rcu+0x67/0xc6
> > > [?? 36.092311]
> > > [?? 36.092311] other info that might help us debug this:
> > > [?? 36.092312]? Possible unsafe locking scenario:
> > > [?? 36.092312]
> > > [?? 36.092313]??????? CPU0
> > > [?? 36.092313]??????? ----
> > > [?? 36.092314]?? lock(&htab->lockdep_key);
> > > [?? 36.092315]?? <Interrupt>
> > > [?? 36.092316]???? lock(&htab->lockdep_key);
> > > [?? 36.092318]
> > > [?? 36.092318]? *** DEADLOCK ***
> > > [?? 36.092318]
> > > [?? 36.092318] 3 locks held by perf/1515:
> > > [?? 36.092320]? #0: ffff8881b9805cc0 (&cpuctx_mutex){+.+.}-{4:4}, at:
> > > perf_event_ctx_lock_nested+0x8e/0xba
> > > [?? 36.092327]? #1: ffff8881075ecc20 (&event->child_mutex){+.+.}-{4:4}, at:
> > > perf_event_for_each_child+0x35/0x76
> > > [?? 36.092332]? #2: ffff8881b9805c20 (&cpuctx_lock){-.-.}-{2:2}, at:
> > > perf_ctx_lock+0x12/0x27
> > > [?? 36.092339]
> > > [?? 36.092339] stack backtrace:
> > > [?? 36.092341] CPU: 0 PID: 1515 Comm: perf Tainted: G??????????? E
> > > 6.1.0-rc5+ #81
> > > [?? 36.092344] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > > [?? 36.092349] Call Trace:
> > > [?? 36.092351]? <NMI>
> > > [?? 36.092354]? dump_stack_lvl+0x57/0x81
> > > [?? 36.092359]? lock_acquire+0x1f4/0x29a
> > > [?? 36.092363]? ? handle_pmi_common+0x13f/0x1f0
> > > [?? 36.092366]? ? htab_lock_bucket+0x4d/0x58
> > > [?? 36.092371]? _raw_spin_lock_irqsave+0x43/0x7f
> > > [?? 36.092374]? ? htab_lock_bucket+0x4d/0x58
> > > [?? 36.092377]? htab_lock_bucket+0x4d/0x58
> > > [?? 36.092379]? htab_map_update_elem+0x11e/0x220
> > > [?? 36.092386]? bpf_prog_f3a535ca81a8128a_bpf_prog2+0x3e/0x42
> > > [?? 36.092392]? trace_call_bpf+0x177/0x215
> > > [?? 36.092398]? perf_trace_run_bpf_submit+0x52/0xaa
> > > [?? 36.092403]? ? x86_pmu_stop+0x97/0x97
> > > [?? 36.092407]? perf_trace_nmi_handler+0xb7/0xe0
> > > [?? 36.092415]? nmi_handle+0x116/0x254
> > > [?? 36.092418]? ? x86_pmu_stop+0x97/0x97
> > > [?? 36.092423]? default_do_nmi+0x3d/0xf6
> > > [?? 36.092428]? exc_nmi+0xa1/0x109
> > > [?? 36.092432]? end_repeat_nmi+0x16/0x67
> > > [?? 36.092436] RIP: 0010:wrmsrl+0xd/0x1b
> >
> > So the lock is really taken in a NMI context. In general, we advise again
> > using lock in a NMI context unless it is a lock that is used only in that
> > context. Otherwise, deadlock is certainly a possibility as there is no way
> > to mask off again NMI.
> >
>
> I think here they use a percpu counter as an "outer lock" to make the
> accesses to the real lock exclusive:
>
> preempt_disable();
> a = __this_cpu_inc(->map_locked);
> if (a != 1) {
> __this_cpu_dec(->map_locked);
> preempt_enable();
> return -EBUSY;
> }
> preempt_enable();
> return -EBUSY;
>
> raw_spin_lock_irqsave(->raw_lock);
>
> and lockdep is not aware that ->map_locked acts as a lock.
>
> However, I feel this may be just a reinvented try_lock pattern, Hou Tao,
> could you see if this can be refactored with a try_lock? Otherwise, you

Just to be clear, I meant to refactor htab_lock_bucket() into a try
lock pattern. Also after a second thought, the below suggestion doesn't
work. I think the proper way is to make htab_lock_bucket() as a
raw_spin_trylock_irqsave().

Regards,
Boqun

> may need to introduce a virtual lockclass for ->map_locked.
>
> Regards,
> Boqun
>
> > Cheers,
> > Longman
> >

2022-11-29 19:49:19

by Hao Luo

[permalink] [raw]
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
>
> Just to be clear, I meant to refactor htab_lock_bucket() into a try
> lock pattern. Also after a second thought, the below suggestion doesn't
> work. I think the proper way is to make htab_lock_bucket() as a
> raw_spin_trylock_irqsave().
>
> Regards,
> Boqun
>

The potential deadlock happens when the lock is contended from the
same cpu. When the lock is contended from a remote cpu, we would like
the remote cpu to spin and wait, instead of giving up immediately. As
this gives better throughput. So replacing the current
raw_spin_lock_irqsave() with trylock sacrifices this performance gain.

I suspect the source of the problem is the 'hash' that we used in
htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
whether we should use a hash derived from 'bucket' rather than from
'key'. For example, from the memory address of the 'bucket'. Because,
different keys may fall into the same bucket, but yield different
hashes. If the same bucket can never have two different 'hashes' here,
the map_locked check should behave as intended. Also because
->map_locked is per-cpu, execution flows from two different cpus can
both pass.

Hao

2022-11-29 21:37:50

by Waiman Long

[permalink] [raw]
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock


On 11/29/22 14:36, Hao Luo wrote:
> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
>> lock pattern. Also after a second thought, the below suggestion doesn't
>> work. I think the proper way is to make htab_lock_bucket() as a
>> raw_spin_trylock_irqsave().
>>
>> Regards,
>> Boqun
>>
> The potential deadlock happens when the lock is contended from the
> same cpu. When the lock is contended from a remote cpu, we would like
> the remote cpu to spin and wait, instead of giving up immediately. As
> this gives better throughput. So replacing the current
> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
>
> I suspect the source of the problem is the 'hash' that we used in
> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
> whether we should use a hash derived from 'bucket' rather than from
> 'key'. For example, from the memory address of the 'bucket'. Because,
> different keys may fall into the same bucket, but yield different
> hashes. If the same bucket can never have two different 'hashes' here,
> the map_locked check should behave as intended. Also because
> ->map_locked is per-cpu, execution flows from two different cpus can
> both pass.

I would suggest that you add a in_nmi() check and if true use trylock to
get the lock. You can continue to use raw_spin_lock_irqsave() in all
other cases.

Cheers,
Longman

2022-11-30 02:15:28

by Hou Tao

[permalink] [raw]
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

Hi Hao,

On 11/30/2022 3:36 AM, Hao Luo wrote:
> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
>> lock pattern. Also after a second thought, the below suggestion doesn't
>> work. I think the proper way is to make htab_lock_bucket() as a
>> raw_spin_trylock_irqsave().
>>
>> Regards,
>> Boqun
>>
> The potential deadlock happens when the lock is contended from the
> same cpu. When the lock is contended from a remote cpu, we would like
> the remote cpu to spin and wait, instead of giving up immediately. As
> this gives better throughput. So replacing the current
> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
>
> I suspect the source of the problem is the 'hash' that we used in
> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
> whether we should use a hash derived from 'bucket' rather than from
> 'key'. For example, from the memory address of the 'bucket'. Because,
> different keys may fall into the same bucket, but yield different
> hashes. If the same bucket can never have two different 'hashes' here,
> the map_locked check should behave as intended. Also because
> ->map_locked is per-cpu, execution flows from two different cpus can
> both pass.
The warning from lockdep is due to the reason the bucket lock A is used in a
no-NMI context firstly, then the same bucke lock is used a NMI context, so
lockdep deduces that may be a dead-lock. I have already tried to use the same
map_locked for keys with the same bucket, the dead-lock is gone, but still got
lockdep warning.
>
> Hao
> .

2022-11-30 02:15:32

by Hou Tao

[permalink] [raw]
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

Hi,

On 11/30/2022 1:23 AM, Boqun Feng wrote:
> On Tue, Nov 29, 2022 at 11:06:51AM -0500, Waiman Long wrote:
>> On 11/29/22 07:45, Hou Tao wrote:
>>> Hi,
>>>
>>> On 11/29/2022 2:06 PM, Tonghao Zhang wrote:
>>>> On Tue, Nov 29, 2022 at 12:32 PM Hou Tao <[email protected]> wrote:
>>>>> Hi,
>>>>>
>>>>> On 11/29/2022 5:55 AM, Hao Luo wrote:
>>>>>> On Sun, Nov 27, 2022 at 7:15 PM Tonghao Zhang <[email protected]> wrote:
>>>>>> Hi Tonghao,
>>>>>>
>>>>>> With a quick look at the htab_lock_bucket() and your problem
>>>>>> statement, I agree with Hou Tao that using hash &
>>>>>> min(HASHTAB_MAP_LOCK_MASK, n_bucket - 1) to index in map_locked seems
>>>>>> to fix the potential deadlock. Can you actually send your changes as
>>>>>> v2 so we can take a look and better help you? Also, can you explain
>>>>>> your solution in your commit message? Right now, your commit message
>>>>>> has only a problem statement and is not very clear. Please include
>>>>>> more details on what you do to fix the issue.
>>>>>>
>>>>>> Hao
>>>>> It would be better if the test case below can be rewritten as a bpf selftests.
>>>>> Please see comments below on how to improve it and reproduce the deadlock.
>>>>>>> Hi
>>>>>>> only a warning from lockdep.
>>>>> Thanks for your details instruction. I can reproduce the warning by using your
>>>>> setup. I am not a lockdep expert, it seems that fixing such warning needs to set
>>>>> different lockdep class to the different bucket. Because we use map_locked to
>>>>> protect the acquisition of bucket lock, so I think we can define lock_class_key
>>>>> array in bpf_htab (e.g., lockdep_key[HASHTAB_MAP_LOCK_COUNT]) and initialize the
>>>>> bucket lock accordingly.
>>> The proposed lockdep solution doesn't work. Still got lockdep warning after
>>> that, so cc +locking expert +lkml.org for lockdep help.
>>>
>>> Hi lockdep experts,
>>>
>>> We are trying to fix the following lockdep warning from bpf subsystem:
>>>
>>> [   36.092222] ================================
>>> [   36.092230] WARNING: inconsistent lock state
>>> [   36.092234] 6.1.0-rc5+ #81 Tainted: G            E
>>> [   36.092236] --------------------------------
>>> [   36.092237] inconsistent {INITIAL USE} -> {IN-NMI} usage.
>>> [   36.092238] perf/1515 [HC1[1]:SC0[0]:HE0:SE1] takes:
>>> [   36.092242] ffff888341acd1a0 (&htab->lockdep_key){....}-{2:2}, at:
>>> htab_lock_bucket+0x4d/0x58
>>> [   36.092253] {INITIAL USE} state was registered at:
>>> [   36.092255]   mark_usage+0x1d/0x11d
>>> [   36.092262]   __lock_acquire+0x3c9/0x6ed
>>> [   36.092266]   lock_acquire+0x23d/0x29a
>>> [   36.092270]   _raw_spin_lock_irqsave+0x43/0x7f
>>> [   36.092274]   htab_lock_bucket+0x4d/0x58
>>> [   36.092276]   htab_map_delete_elem+0x82/0xfb
>>> [   36.092278]   map_delete_elem+0x156/0x1ac
>>> [   36.092282]   __sys_bpf+0x138/0xb71
>>> [   36.092285]   __do_sys_bpf+0xd/0x15
>>> [   36.092288]   do_syscall_64+0x6d/0x84
>>> [   36.092291]   entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>> [   36.092295] irq event stamp: 120346
>>> [   36.092296] hardirqs last  enabled at (120345): [<ffffffff8180b97f>]
>>> _raw_spin_unlock_irq+0x24/0x39
>>> [   36.092299] hardirqs last disabled at (120346): [<ffffffff81169e85>]
>>> generic_exec_single+0x40/0xb9
>>> [   36.092303] softirqs last  enabled at (120268): [<ffffffff81c00347>]
>>> __do_softirq+0x347/0x387
>>> [   36.092307] softirqs last disabled at (120133): [<ffffffff810ba4f0>]
>>> __irq_exit_rcu+0x67/0xc6
>>> [   36.092311]
>>> [   36.092311] other info that might help us debug this:
>>> [   36.092312]  Possible unsafe locking scenario:
>>> [   36.092312]
>>> [   36.092313]        CPU0
>>> [   36.092313]        ----
>>> [   36.092314]   lock(&htab->lockdep_key);
>>> [   36.092315]   <Interrupt>
>>> [   36.092316]     lock(&htab->lockdep_key);
>>> [   36.092318]
>>> [   36.092318]  *** DEADLOCK ***
>>> [   36.092318]
>>> [   36.092318] 3 locks held by perf/1515:
>>> [   36.092320]  #0: ffff8881b9805cc0 (&cpuctx_mutex){+.+.}-{4:4}, at:
>>> perf_event_ctx_lock_nested+0x8e/0xba
>>> [   36.092327]  #1: ffff8881075ecc20 (&event->child_mutex){+.+.}-{4:4}, at:
>>> perf_event_for_each_child+0x35/0x76
>>> [   36.092332]  #2: ffff8881b9805c20 (&cpuctx_lock){-.-.}-{2:2}, at:
>>> perf_ctx_lock+0x12/0x27
>>> [   36.092339]
>>> [   36.092339] stack backtrace:
>>> [   36.092341] CPU: 0 PID: 1515 Comm: perf Tainted: G            E
>>> 6.1.0-rc5+ #81
>>> [   36.092344] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>> rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
>>> [   36.092349] Call Trace:
>>> [   36.092351]  <NMI>
>>> [   36.092354]  dump_stack_lvl+0x57/0x81
>>> [   36.092359]  lock_acquire+0x1f4/0x29a
>>> [   36.092363]  ? handle_pmi_common+0x13f/0x1f0
>>> [   36.092366]  ? htab_lock_bucket+0x4d/0x58
>>> [   36.092371]  _raw_spin_lock_irqsave+0x43/0x7f
>>> [   36.092374]  ? htab_lock_bucket+0x4d/0x58
>>> [   36.092377]  htab_lock_bucket+0x4d/0x58
>>> [   36.092379]  htab_map_update_elem+0x11e/0x220
>>> [   36.092386]  bpf_prog_f3a535ca81a8128a_bpf_prog2+0x3e/0x42
>>> [   36.092392]  trace_call_bpf+0x177/0x215
>>> [   36.092398]  perf_trace_run_bpf_submit+0x52/0xaa
>>> [   36.092403]  ? x86_pmu_stop+0x97/0x97
>>> [   36.092407]  perf_trace_nmi_handler+0xb7/0xe0
>>> [   36.092415]  nmi_handle+0x116/0x254
>>> [   36.092418]  ? x86_pmu_stop+0x97/0x97
>>> [   36.092423]  default_do_nmi+0x3d/0xf6
>>> [   36.092428]  exc_nmi+0xa1/0x109
>>> [   36.092432]  end_repeat_nmi+0x16/0x67
>>> [   36.092436] RIP: 0010:wrmsrl+0xd/0x1b
>> So the lock is really taken in a NMI context. In general, we advise again
>> using lock in a NMI context unless it is a lock that is used only in that
>> context. Otherwise, deadlock is certainly a possibility as there is no way
>> to mask off again NMI.
>>
> I think here they use a percpu counter as an "outer lock" to make the
> accesses to the real lock exclusive:
>
> preempt_disable();
> a = __this_cpu_inc(->map_locked);
> if (a != 1) {
> __this_cpu_dec(->map_locked);
> preempt_enable();
> return -EBUSY;
> }
> preempt_enable();
> return -EBUSY;
>
> raw_spin_lock_irqsave(->raw_lock);
>
> and lockdep is not aware that ->map_locked acts as a lock.
>
> However, I feel this may be just a reinvented try_lock pattern, Hou Tao,
> could you see if this can be refactored with a try_lock? Otherwise, you
> may need to introduce a virtual lockclass for ->map_locked.
As said by Hao Luo, the problem of using trylock in nmi context is that it can
not distinguish between dead-lock and lock with high-contention. And map_locked
is still needed even trylock is used in NMI because htab_map_update_elem() may
be reentered in a normal context through attaching a bpf program to one function
called after taken the lock. So introducing a virtual lockclass for ->map_locked
is a better idea.

Thanks,
Tao
> Regards,
> Boqun
>
>> Cheers,
>> Longman
>>
> .

2022-11-30 03:02:42

by Tonghao Zhang

[permalink] [raw]
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <[email protected]> wrote:
>
> Hi Hao,
>
> On 11/30/2022 3:36 AM, Hao Luo wrote:
> > On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
> >> Just to be clear, I meant to refactor htab_lock_bucket() into a try
> >> lock pattern. Also after a second thought, the below suggestion doesn't
> >> work. I think the proper way is to make htab_lock_bucket() as a
> >> raw_spin_trylock_irqsave().
> >>
> >> Regards,
> >> Boqun
> >>
> > The potential deadlock happens when the lock is contended from the
> > same cpu. When the lock is contended from a remote cpu, we would like
> > the remote cpu to spin and wait, instead of giving up immediately. As
> > this gives better throughput. So replacing the current
> > raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
> >
> > I suspect the source of the problem is the 'hash' that we used in
> > htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
> > whether we should use a hash derived from 'bucket' rather than from
> > 'key'. For example, from the memory address of the 'bucket'. Because,
> > different keys may fall into the same bucket, but yield different
> > hashes. If the same bucket can never have two different 'hashes' here,
> > the map_locked check should behave as intended. Also because
> > ->map_locked is per-cpu, execution flows from two different cpus can
> > both pass.
> The warning from lockdep is due to the reason the bucket lock A is used in a
> no-NMI context firstly, then the same bucke lock is used a NMI context, so
Yes, I tested lockdep too, we can't use the lock in NMI(but only
try_lock work fine) context if we use them no-NMI context. otherwise
the lockdep prints the warning.
* for the dead-lock case: we can use the
1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
2. or hash bucket address.

* for lockdep warning, we should use in_nmi check with map_locked.

BTW, the patch doesn't work, so we can remove the lock_key
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9

static inline int htab_lock_bucket(const struct bpf_htab *htab,
struct bucket *b, u32 hash,
unsigned long *pflags)
{
unsigned long flags;

hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);

preempt_disable();
if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
__this_cpu_dec(*(htab->map_locked[hash]));
preempt_enable();
return -EBUSY;
}

if (in_nmi()) {
if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
return -EBUSY;
} else {
raw_spin_lock_irqsave(&b->raw_lock, flags);
}

*pflags = flags;
return 0;
}


> lockdep deduces that may be a dead-lock. I have already tried to use the same
> map_locked for keys with the same bucket, the dead-lock is gone, but still got
> lockdep warning.
> >
> > Hao
> > .
>


--
Best regards, Tonghao

2022-11-30 03:30:21

by Waiman Long

[permalink] [raw]
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On 11/29/22 21:47, Tonghao Zhang wrote:
> On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <[email protected]> wrote:
>> Hi Hao,
>>
>> On 11/30/2022 3:36 AM, Hao Luo wrote:
>>> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
>>>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
>>>> lock pattern. Also after a second thought, the below suggestion doesn't
>>>> work. I think the proper way is to make htab_lock_bucket() as a
>>>> raw_spin_trylock_irqsave().
>>>>
>>>> Regards,
>>>> Boqun
>>>>
>>> The potential deadlock happens when the lock is contended from the
>>> same cpu. When the lock is contended from a remote cpu, we would like
>>> the remote cpu to spin and wait, instead of giving up immediately. As
>>> this gives better throughput. So replacing the current
>>> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
>>>
>>> I suspect the source of the problem is the 'hash' that we used in
>>> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
>>> whether we should use a hash derived from 'bucket' rather than from
>>> 'key'. For example, from the memory address of the 'bucket'. Because,
>>> different keys may fall into the same bucket, but yield different
>>> hashes. If the same bucket can never have two different 'hashes' here,
>>> the map_locked check should behave as intended. Also because
>>> ->map_locked is per-cpu, execution flows from two different cpus can
>>> both pass.
>> The warning from lockdep is due to the reason the bucket lock A is used in a
>> no-NMI context firstly, then the same bucke lock is used a NMI context, so
> Yes, I tested lockdep too, we can't use the lock in NMI(but only
> try_lock work fine) context if we use them no-NMI context. otherwise
> the lockdep prints the warning.
> * for the dead-lock case: we can use the
> 1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
> 2. or hash bucket address.
>
> * for lockdep warning, we should use in_nmi check with map_locked.
>
> BTW, the patch doesn't work, so we can remove the lock_key
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9
>
> static inline int htab_lock_bucket(const struct bpf_htab *htab,
> struct bucket *b, u32 hash,
> unsigned long *pflags)
> {
> unsigned long flags;
>
> hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
>
> preempt_disable();
> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> __this_cpu_dec(*(htab->map_locked[hash]));
> preempt_enable();
> return -EBUSY;
> }
>
> if (in_nmi()) {
> if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
> return -EBUSY;
That is not right. You have to do the same step as above by decrementing
the percpu count and enable preemption. So you may want to put all these
busy_out steps after the return 0 and use "goto busy_out;" to jump there.
> } else {
> raw_spin_lock_irqsave(&b->raw_lock, flags);
> }
>
> *pflags = flags;
> return 0;
> }

BTW, with that change, I believe you can actually remove all the percpu
map_locked count code.

Cheers,
Longman

2022-11-30 03:59:36

by Tonghao Zhang

[permalink] [raw]
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On Wed, Nov 30, 2022 at 11:07 AM Waiman Long <[email protected]> wrote:
>
> On 11/29/22 21:47, Tonghao Zhang wrote:
> > On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <[email protected]> wrote:
> >> Hi Hao,
> >>
> >> On 11/30/2022 3:36 AM, Hao Luo wrote:
> >>> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
> >>>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
> >>>> lock pattern. Also after a second thought, the below suggestion doesn't
> >>>> work. I think the proper way is to make htab_lock_bucket() as a
> >>>> raw_spin_trylock_irqsave().
> >>>>
> >>>> Regards,
> >>>> Boqun
> >>>>
> >>> The potential deadlock happens when the lock is contended from the
> >>> same cpu. When the lock is contended from a remote cpu, we would like
> >>> the remote cpu to spin and wait, instead of giving up immediately. As
> >>> this gives better throughput. So replacing the current
> >>> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
> >>>
> >>> I suspect the source of the problem is the 'hash' that we used in
> >>> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
> >>> whether we should use a hash derived from 'bucket' rather than from
> >>> 'key'. For example, from the memory address of the 'bucket'. Because,
> >>> different keys may fall into the same bucket, but yield different
> >>> hashes. If the same bucket can never have two different 'hashes' here,
> >>> the map_locked check should behave as intended. Also because
> >>> ->map_locked is per-cpu, execution flows from two different cpus can
> >>> both pass.
> >> The warning from lockdep is due to the reason the bucket lock A is used in a
> >> no-NMI context firstly, then the same bucke lock is used a NMI context, so
> > Yes, I tested lockdep too, we can't use the lock in NMI(but only
> > try_lock work fine) context if we use them no-NMI context. otherwise
> > the lockdep prints the warning.
> > * for the dead-lock case: we can use the
> > 1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
> > 2. or hash bucket address.
> >
> > * for lockdep warning, we should use in_nmi check with map_locked.
> >
> > BTW, the patch doesn't work, so we can remove the lock_key
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9
> >
> > static inline int htab_lock_bucket(const struct bpf_htab *htab,
> > struct bucket *b, u32 hash,
> > unsigned long *pflags)
> > {
> > unsigned long flags;
> >
> > hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
> >
> > preempt_disable();
> > if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> > __this_cpu_dec(*(htab->map_locked[hash]));
> > preempt_enable();
> > return -EBUSY;
> > }
> >
> > if (in_nmi()) {
> > if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
> > return -EBUSY;
> That is not right. You have to do the same step as above by decrementing
> the percpu count and enable preemption. So you may want to put all these
> busy_out steps after the return 0 and use "goto busy_out;" to jump there.
Yes, thanks Waiman, I should add the busy_out label.
> > } else {
> > raw_spin_lock_irqsave(&b->raw_lock, flags);
> > }
> >
> > *pflags = flags;
> > return 0;
> > }
>
> BTW, with that change, I believe you can actually remove all the percpu
> map_locked count code.
there are some case, for example, we run the bpf_prog A B in task
context on the same cpu.
bpf_prog A
update map X
htab_lock_bucket
raw_spin_lock_irqsave()
lookup_elem_raw()
// bpf prog B is attached on lookup_elem_raw()
bpf prog B
update map X again and update the element
htab_lock_bucket()
// dead-lock
raw_spinlock_irqsave()
> Cheers,
> Longman
>


--
Best regards, Tonghao

2022-11-30 04:23:03

by Waiman Long

[permalink] [raw]
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On 11/29/22 22:32, Tonghao Zhang wrote:
> On Wed, Nov 30, 2022 at 11:07 AM Waiman Long <[email protected]> wrote:
>> On 11/29/22 21:47, Tonghao Zhang wrote:
>>> On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <[email protected]> wrote:
>>>> Hi Hao,
>>>>
>>>> On 11/30/2022 3:36 AM, Hao Luo wrote:
>>>>> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
>>>>>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
>>>>>> lock pattern. Also after a second thought, the below suggestion doesn't
>>>>>> work. I think the proper way is to make htab_lock_bucket() as a
>>>>>> raw_spin_trylock_irqsave().
>>>>>>
>>>>>> Regards,
>>>>>> Boqun
>>>>>>
>>>>> The potential deadlock happens when the lock is contended from the
>>>>> same cpu. When the lock is contended from a remote cpu, we would like
>>>>> the remote cpu to spin and wait, instead of giving up immediately. As
>>>>> this gives better throughput. So replacing the current
>>>>> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
>>>>>
>>>>> I suspect the source of the problem is the 'hash' that we used in
>>>>> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
>>>>> whether we should use a hash derived from 'bucket' rather than from
>>>>> 'key'. For example, from the memory address of the 'bucket'. Because,
>>>>> different keys may fall into the same bucket, but yield different
>>>>> hashes. If the same bucket can never have two different 'hashes' here,
>>>>> the map_locked check should behave as intended. Also because
>>>>> ->map_locked is per-cpu, execution flows from two different cpus can
>>>>> both pass.
>>>> The warning from lockdep is due to the reason the bucket lock A is used in a
>>>> no-NMI context firstly, then the same bucke lock is used a NMI context, so
>>> Yes, I tested lockdep too, we can't use the lock in NMI(but only
>>> try_lock work fine) context if we use them no-NMI context. otherwise
>>> the lockdep prints the warning.
>>> * for the dead-lock case: we can use the
>>> 1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
>>> 2. or hash bucket address.
>>>
>>> * for lockdep warning, we should use in_nmi check with map_locked.
>>>
>>> BTW, the patch doesn't work, so we can remove the lock_key
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9
>>>
>>> static inline int htab_lock_bucket(const struct bpf_htab *htab,
>>> struct bucket *b, u32 hash,
>>> unsigned long *pflags)
>>> {
>>> unsigned long flags;
>>>
>>> hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
>>>
>>> preempt_disable();
>>> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>>> __this_cpu_dec(*(htab->map_locked[hash]));
>>> preempt_enable();
>>> return -EBUSY;
>>> }
>>>
>>> if (in_nmi()) {
>>> if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
>>> return -EBUSY;
>> That is not right. You have to do the same step as above by decrementing
>> the percpu count and enable preemption. So you may want to put all these
>> busy_out steps after the return 0 and use "goto busy_out;" to jump there.
> Yes, thanks Waiman, I should add the busy_out label.
>>> } else {
>>> raw_spin_lock_irqsave(&b->raw_lock, flags);
>>> }
>>>
>>> *pflags = flags;
>>> return 0;
>>> }
>> BTW, with that change, I believe you can actually remove all the percpu
>> map_locked count code.
> there are some case, for example, we run the bpf_prog A B in task
> context on the same cpu.
> bpf_prog A
> update map X
> htab_lock_bucket
> raw_spin_lock_irqsave()
> lookup_elem_raw()
> // bpf prog B is attached on lookup_elem_raw()
> bpf prog B
> update map X again and update the element
> htab_lock_bucket()
> // dead-lock
> raw_spinlock_irqsave()

I see, so nested locking is possible in this case. Beside using the
percpu map_lock, another way is to have cpumask associated with each
bucket lock and use each bit in the cpumask for to control access using
test_and_set_bit() for each cpu. That will allow more concurrency and
you can actually find out how contended is the lock. Anyway, it is just
a thought.

Cheers,
Longman


2022-11-30 04:35:13

by Hou Tao

[permalink] [raw]
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

Hi,

On 11/30/2022 10:47 AM, Tonghao Zhang wrote:
> On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <[email protected]> wrote:
>> Hi Hao,
>>
>> On 11/30/2022 3:36 AM, Hao Luo wrote:
>>> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
>>>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
>>>> lock pattern. Also after a second thought, the below suggestion doesn't
>>>> work. I think the proper way is to make htab_lock_bucket() as a
>>>> raw_spin_trylock_irqsave().
>>>>
>>>> Regards,
>>>> Boqun
>>>>
>>> The potential deadlock happens when the lock is contended from the
>>> same cpu. When the lock is contended from a remote cpu, we would like
>>> the remote cpu to spin and wait, instead of giving up immediately. As
>>> this gives better throughput. So replacing the current
>>> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
>>>
>>> I suspect the source of the problem is the 'hash' that we used in
>>> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
>>> whether we should use a hash derived from 'bucket' rather than from
>>> 'key'. For example, from the memory address of the 'bucket'. Because,
>>> different keys may fall into the same bucket, but yield different
>>> hashes. If the same bucket can never have two different 'hashes' here,
>>> the map_locked check should behave as intended. Also because
>>> ->map_locked is per-cpu, execution flows from two different cpus can
>>> both pass.
>> The warning from lockdep is due to the reason the bucket lock A is used in a
>> no-NMI context firstly, then the same bucke lock is used a NMI context, so
> Yes, I tested lockdep too, we can't use the lock in NMI(but only
> try_lock work fine) context if we use them no-NMI context. otherwise
> the lockdep prints the warning.
> * for the dead-lock case: we can use the
> 1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
> 2. or hash bucket address.
Use the computed hash will be better than hash bucket address, because the hash
buckets are allocated sequentially.
>
> * for lockdep warning, we should use in_nmi check with map_locked.
>
> BTW, the patch doesn't work, so we can remove the lock_key
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9
>
> static inline int htab_lock_bucket(const struct bpf_htab *htab,
> struct bucket *b, u32 hash,
> unsigned long *pflags)
> {
> unsigned long flags;
>
> hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
>
> preempt_disable();
> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> __this_cpu_dec(*(htab->map_locked[hash]));
> preempt_enable();
> return -EBUSY;
> }
>
> if (in_nmi()) {
> if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
> return -EBUSY;
The only purpose of trylock here is to make lockdep happy and it may lead to
unnecessary -EBUSY error for htab operations in NMI context. I still prefer add
a virtual lock-class for map_locked to fix the lockdep warning. So could you use
separated patches to fix the potential dead-lock and the lockdep warning ? It
will be better you can also add a bpf selftests for deadlock problem as said before.

Thanks,
Tao
> } else {
> raw_spin_lock_irqsave(&b->raw_lock, flags);
> }
>
> *pflags = flags;
> return 0;
> }
>
>
>> lockdep deduces that may be a dead-lock. I have already tried to use the same
>> map_locked for keys with the same bucket, the dead-lock is gone, but still got
>> lockdep warning.
>>> Hao
>>> .
>

2022-11-30 05:11:38

by Hao Luo

[permalink] [raw]
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On Tue, Nov 29, 2022 at 8:13 PM Hou Tao <[email protected]> wrote:
>
> On 11/30/2022 10:47 AM, Tonghao Zhang wrote:
<...>
> > if (in_nmi()) {
> > if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
> > return -EBUSY;
>
> The only purpose of trylock here is to make lockdep happy and it may lead to
> unnecessary -EBUSY error for htab operations in NMI context. I still prefer add
> a virtual lock-class for map_locked to fix the lockdep warning. So could you use
> separated patches to fix the potential dead-lock and the lockdep warning ? It
> will be better you can also add a bpf selftests for deadlock problem as said before.
>

Agree with Tao here. Tonghao, could you send another version which:

- separates the fix to deadlock and the fix to lockdep warning
- includes a bpf selftest to verify the fix to deadlock
- with bpf-specific tag: [PATCH bpf-next]

There are multiple ideas on the fly in this thread, it's easy to lose
track of what has been proposed and what change you intend to make.

Thanks,
Hao

2022-11-30 06:21:26

by Tonghao Zhang

[permalink] [raw]
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On Wed, Nov 30, 2022 at 1:02 PM Hao Luo <[email protected]> wrote:
>
> On Tue, Nov 29, 2022 at 8:13 PM Hou Tao <[email protected]> wrote:
> >
> > On 11/30/2022 10:47 AM, Tonghao Zhang wrote:
> <...>
> > > if (in_nmi()) {
> > > if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
> > > return -EBUSY;
> >
> > The only purpose of trylock here is to make lockdep happy and it may lead to
> > unnecessary -EBUSY error for htab operations in NMI context. I still prefer add
> > a virtual lock-class for map_locked to fix the lockdep warning. So could you use
> > separated patches to fix the potential dead-lock and the lockdep warning ? It
> > will be better you can also add a bpf selftests for deadlock problem as said before.
> >
>
> Agree with Tao here. Tonghao, could you send another version which:
>
> - separates the fix to deadlock and the fix to lockdep warning
> - includes a bpf selftest to verify the fix to deadlock
> - with bpf-specific tag: [PATCH bpf-next]
>
> There are multiple ideas on the fly in this thread, it's easy to lose
> track of what has been proposed and what change you intend to make.
Hi, I will send v2 soon. Thanks.
> Thanks,
> Hao



--
Best regards, Tonghao

2022-11-30 06:33:38

by Tonghao Zhang

[permalink] [raw]
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On Wed, Nov 30, 2022 at 12:13 PM Hou Tao <[email protected]> wrote:
>
> Hi,
>
> On 11/30/2022 10:47 AM, Tonghao Zhang wrote:
> > On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <[email protected]> wrote:
> >> Hi Hao,
> >>
> >> On 11/30/2022 3:36 AM, Hao Luo wrote:
> >>> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
> >>>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
> >>>> lock pattern. Also after a second thought, the below suggestion doesn't
> >>>> work. I think the proper way is to make htab_lock_bucket() as a
> >>>> raw_spin_trylock_irqsave().
> >>>>
> >>>> Regards,
> >>>> Boqun
> >>>>
> >>> The potential deadlock happens when the lock is contended from the
> >>> same cpu. When the lock is contended from a remote cpu, we would like
> >>> the remote cpu to spin and wait, instead of giving up immediately. As
> >>> this gives better throughput. So replacing the current
> >>> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
> >>>
> >>> I suspect the source of the problem is the 'hash' that we used in
> >>> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
> >>> whether we should use a hash derived from 'bucket' rather than from
> >>> 'key'. For example, from the memory address of the 'bucket'. Because,
> >>> different keys may fall into the same bucket, but yield different
> >>> hashes. If the same bucket can never have two different 'hashes' here,
> >>> the map_locked check should behave as intended. Also because
> >>> ->map_locked is per-cpu, execution flows from two different cpus can
> >>> both pass.
> >> The warning from lockdep is due to the reason the bucket lock A is used in a
> >> no-NMI context firstly, then the same bucke lock is used a NMI context, so
> > Yes, I tested lockdep too, we can't use the lock in NMI(but only
> > try_lock work fine) context if we use them no-NMI context. otherwise
> > the lockdep prints the warning.
> > * for the dead-lock case: we can use the
> > 1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
> > 2. or hash bucket address.
> Use the computed hash will be better than hash bucket address, because the hash
> buckets are allocated sequentially.
> >
> > * for lockdep warning, we should use in_nmi check with map_locked.
> >
> > BTW, the patch doesn't work, so we can remove the lock_key
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9
> >
> > static inline int htab_lock_bucket(const struct bpf_htab *htab,
> > struct bucket *b, u32 hash,
> > unsigned long *pflags)
> > {
> > unsigned long flags;
> >
> > hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
> >
> > preempt_disable();
> > if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> > __this_cpu_dec(*(htab->map_locked[hash]));
> > preempt_enable();
> > return -EBUSY;
> > }
> >
> > if (in_nmi()) {
> > if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
> > return -EBUSY;
> The only purpose of trylock here is to make lockdep happy and it may lead to
> unnecessary -EBUSY error for htab operations in NMI context. I still prefer add
> a virtual lock-class for map_locked to fix the lockdep warning. So could you use
Hi, what is virtual lock-class ? Can you give me an example of what you mean?
> separated patches to fix the potential dead-lock and the lockdep warning ? It
> will be better you can also add a bpf selftests for deadlock problem as said before.
>
> Thanks,
> Tao
> > } else {
> > raw_spin_lock_irqsave(&b->raw_lock, flags);
> > }
> >
> > *pflags = flags;
> > return 0;
> > }
> >
> >
> >> lockdep deduces that may be a dead-lock. I have already tried to use the same
> >> map_locked for keys with the same bucket, the dead-lock is gone, but still got
> >> lockdep warning.
> >>> Hao
> >>> .
> >
>


--
Best regards, Tonghao

2022-12-01 03:47:13

by Hou Tao

[permalink] [raw]
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

Hi,

On 11/30/2022 1:55 PM, Tonghao Zhang wrote:
> On Wed, Nov 30, 2022 at 12:13 PM Hou Tao <[email protected]> wrote:
>> Hi,
>>
>> On 11/30/2022 10:47 AM, Tonghao Zhang wrote:
>>> On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <[email protected]> wrote:
>>>> Hi Hao,
>>>>
>>>> On 11/30/2022 3:36 AM, Hao Luo wrote:
>>>>> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <[email protected]> wrote:
>>>>>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
>>>>>> lock pattern. Also after a second thought, the below suggestion doesn't
>>>>>> work. I think the proper way is to make htab_lock_bucket() as a
>>>>>> raw_spin_trylock_irqsave().
>>>>>>
>>>>>> Regards,
>>>>>> Boqun
>>>>>>
>>>>> The potential deadlock happens when the lock is contended from the
>>>>> same cpu. When the lock is contended from a remote cpu, we would like
>>>>> the remote cpu to spin and wait, instead of giving up immediately. As
>>>>> this gives better throughput. So replacing the current
>>>>> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
>>>>>
>>>>> I suspect the source of the problem is the 'hash' that we used in
>>>>> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
>>>>> whether we should use a hash derived from 'bucket' rather than from
>>>>> 'key'. For example, from the memory address of the 'bucket'. Because,
>>>>> different keys may fall into the same bucket, but yield different
>>>>> hashes. If the same bucket can never have two different 'hashes' here,
>>>>> the map_locked check should behave as intended. Also because
>>>>> ->map_locked is per-cpu, execution flows from two different cpus can
>>>>> both pass.
>>>> The warning from lockdep is due to the reason the bucket lock A is used in a
>>>> no-NMI context firstly, then the same bucke lock is used a NMI context, so
>>> Yes, I tested lockdep too, we can't use the lock in NMI(but only
>>> try_lock work fine) context if we use them no-NMI context. otherwise
>>> the lockdep prints the warning.
>>> * for the dead-lock case: we can use the
>>> 1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
>>> 2. or hash bucket address.
>> Use the computed hash will be better than hash bucket address, because the hash
>> buckets are allocated sequentially.
>>> * for lockdep warning, we should use in_nmi check with map_locked.
>>>
>>> BTW, the patch doesn't work, so we can remove the lock_key
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9
>>>
>>> static inline int htab_lock_bucket(const struct bpf_htab *htab,
>>> struct bucket *b, u32 hash,
>>> unsigned long *pflags)
>>> {
>>> unsigned long flags;
>>>
>>> hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
>>>
>>> preempt_disable();
>>> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>>> __this_cpu_dec(*(htab->map_locked[hash]));
>>> preempt_enable();
>>> return -EBUSY;
>>> }
>>>
>>> if (in_nmi()) {
>>> if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
>>> return -EBUSY;
>> The only purpose of trylock here is to make lockdep happy and it may lead to
>> unnecessary -EBUSY error for htab operations in NMI context. I still prefer add
>> a virtual lock-class for map_locked to fix the lockdep warning. So could you use
> Hi, what is virtual lock-class ? Can you give me an example of what you mean?
If LOCKDEP is enabled, raw_spinlock will add dep_map in the definition and it
also calls lock_acquire() and lock_release() to assist the deadlock check. Now
map_locked is not a lock but it acts like a raw_spin_trylock, so we need to add
dep_map to it manually, and then also call lock_acquire(trylock=1) and
lock_release() before increasing and decreasing map_locked. You can reference
the implementation of raw_spin_trylock and raw_spin_unlock for more details.
>> separated patches to fix the potential dead-lock and the lockdep warning ? It
>> will be better you can also add a bpf selftests for deadlock problem as said before.
>>
>> Thanks,
>> Tao
>>> } else {
>>> raw_spin_lock_irqsave(&b->raw_lock, flags);
>>> }
>>>
>>> *pflags = flags;
>>> return 0;
>>> }
>>>
>>>
>>>> lockdep deduces that may be a dead-lock. I have already tried to use the same
>>>> map_locked for keys with the same bucket, the dead-lock is gone, but still got
>>>> lockdep warning.
>>>>> Hao
>>>>> .
>