LinuxLists.cc - [PATCH bpf] bpf: Do not grab the bucket spinlock by default on htab batch ops

2020-02-14 22:43:45

Subject: [PATCH bpf] bpf: Do not grab the bucket spinlock by default on htab batch ops

Grabbing the spinlock for every bucket even if it's empty, was causing
significant perfomance cost when traversing htab maps that have only a
few entries. This patch addresses the issue by checking first the
bucket_cnt, if the bucket has some entries then we go and grab the
spinlock and proceed with the batching.

Tested with a htab of size 50K and different value of populated entries.

Before:
Benchmark Time(ns) CPU(ns)
---------------------------------------------
BM_DumpHashMap/1 2759655 2752033
BM_DumpHashMap/10 2933722 2930825
BM_DumpHashMap/200 3171680 3170265
BM_DumpHashMap/500 3639607 3635511
BM_DumpHashMap/1000 4369008 4364981
BM_DumpHashMap/5k 11171919 11134028
BM_DumpHashMap/20k 69150080 69033496
BM_DumpHashMap/39k 190501036 190226162

After:
Benchmark Time(ns) CPU(ns)
---------------------------------------------
BM_DumpHashMap/1 202707 200109
BM_DumpHashMap/10 213441 210569
BM_DumpHashMap/200 478641 472350
BM_DumpHashMap/500 980061 967102
BM_DumpHashMap/1000 1863835 1839575
BM_DumpHashMap/5k 8961836 8902540
BM_DumpHashMap/20k 69761497 69322756
BM_DumpHashMap/39k 187437830 186551111

Fixes: 057996380a42 ("bpf: Add batch ops to all htab bpf map")
Cc: Yonghong Song <[email protected]>
Signed-off-by: Brian Vazquez <[email protected]>
---
kernel/bpf/hashtab.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 2d182c4ee9d99..fdbde28b0fe06 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1260,6 +1260,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
struct hlist_nulls_head *head;
struct hlist_nulls_node *n;
unsigned long flags;
+ bool locked = false;
struct htab_elem *l;
struct bucket *b;
int ret = 0;
@@ -1319,15 +1320,25 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
dst_val = values;
b = &htab->buckets[batch];
head = &b->head;
- raw_spin_lock_irqsave(&b->lock, flags);
+ /* do not grab the lock unless need it (bucket_cnt > 0). */
+ if (locked)
+ raw_spin_lock_irqsave(&b->lock, flags);

bucket_cnt = 0;
hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
bucket_cnt++;

+ if (bucket_cnt && !locked) {
+ locked = true;
+ goto again_nocopy;
+ }
+
if (bucket_cnt > (max_count - total)) {
if (total == 0)
ret = -ENOSPC;
+ /* Note that since bucket_cnt > 0 here, it is implicit
+ * that the locked was grabbed, so release it.
+ */
raw_spin_unlock_irqrestore(&b->lock, flags);
rcu_read_unlock();
this_cpu_dec(bpf_prog_active);
@@ -1337,6 +1348,9 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,

if (bucket_cnt > bucket_size) {
bucket_size = bucket_cnt;
+ /* Note that since bucket_cnt > 0 here, it is implicit
+ * that the locked was grabbed, so release it.
+ */
raw_spin_unlock_irqrestore(&b->lock, flags);
rcu_read_unlock();
this_cpu_dec(bpf_prog_active);
@@ -1379,7 +1393,10 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
dst_val += value_size;
}

- raw_spin_unlock_irqrestore(&b->lock, flags);
+ if (locked) {
+ raw_spin_unlock_irqrestore(&b->lock, flags);
+ locked = false;
+ }
/* If we are not copying data, we can go to next bucket and avoid
* unlocking the rcu.
*/
--
2.25.0.265.gbab2e86ba0-goog

2020-02-18 15:44:45

by Yonghong Song

[permalink] [raw]

Subject: Re: [PATCH bpf] bpf: Do not grab the bucket spinlock by default on htab batch ops

On 2/14/20 2:43 PM, Brian Vazquez wrote:
> Grabbing the spinlock for every bucket even if it's empty, was causing
> significant perfomance cost when traversing htab maps that have only a
> few entries. This patch addresses the issue by checking first the
> bucket_cnt, if the bucket has some entries then we go and grab the
> spinlock and proceed with the batching.
>
> Tested with a htab of size 50K and different value of populated entries.
>
> Before:
> Benchmark Time(ns) CPU(ns)
> ---------------------------------------------
> BM_DumpHashMap/1 2759655 2752033
> BM_DumpHashMap/10 2933722 2930825
> BM_DumpHashMap/200 3171680 3170265
> BM_DumpHashMap/500 3639607 3635511
> BM_DumpHashMap/1000 4369008 4364981
> BM_DumpHashMap/5k 11171919 11134028
> BM_DumpHashMap/20k 69150080 69033496
> BM_DumpHashMap/39k 190501036 190226162
>
> After:
> Benchmark Time(ns) CPU(ns)
> ---------------------------------------------
> BM_DumpHashMap/1 202707 200109
> BM_DumpHashMap/10 213441 210569
> BM_DumpHashMap/200 478641 472350
> BM_DumpHashMap/500 980061 967102
> BM_DumpHashMap/1000 1863835 1839575
> BM_DumpHashMap/5k 8961836 8902540
> BM_DumpHashMap/20k 69761497 69322756
> BM_DumpHashMap/39k 187437830 186551111
>
> Fixes: 057996380a42 ("bpf: Add batch ops to all htab bpf map")
> Cc: Yonghong Song <[email protected]>
> Signed-off-by: Brian Vazquez <[email protected]>

Acked-by: Yonghong Song <[email protected]>

2020-02-18 15:57:03

by Daniel Borkmann

[permalink] [raw]

Subject: Re: [PATCH bpf] bpf: Do not grab the bucket spinlock by default on htab batch ops

On 2/18/20 4:43 PM, Yonghong Song wrote:
> On 2/14/20 2:43 PM, Brian Vazquez wrote:
>> Grabbing the spinlock for every bucket even if it's empty, was causing
>> significant perfomance cost when traversing htab maps that have only a
>> few entries. This patch addresses the issue by checking first the
>> bucket_cnt, if the bucket has some entries then we go and grab the
>> spinlock and proceed with the batching.
>>
>> Tested with a htab of size 50K and different value of populated entries.
>>
>> Before:
>>    Benchmark             Time(ns)        CPU(ns)
>>    ---------------------------------------------
>>    BM_DumpHashMap/1       2759655        2752033
>>    BM_DumpHashMap/10      2933722        2930825
>>    BM_DumpHashMap/200     3171680        3170265
>>    BM_DumpHashMap/500     3639607        3635511
>>    BM_DumpHashMap/1000    4369008        4364981
>>    BM_DumpHashMap/5k     11171919       11134028
>>    BM_DumpHashMap/20k    69150080       69033496
>>    BM_DumpHashMap/39k   190501036      190226162
>>
>> After:
>>    Benchmark             Time(ns)        CPU(ns)
>>    ---------------------------------------------
>>    BM_DumpHashMap/1        202707         200109
>>    BM_DumpHashMap/10       213441         210569
>>    BM_DumpHashMap/200      478641         472350
>>    BM_DumpHashMap/500      980061         967102
>>    BM_DumpHashMap/1000    1863835        1839575
>>    BM_DumpHashMap/5k      8961836        8902540
>>    BM_DumpHashMap/20k    69761497       69322756
>>    BM_DumpHashMap/39k   187437830      186551111
>>
>> Fixes: 057996380a42 ("bpf: Add batch ops to all htab bpf map")
>> Cc: Yonghong Song <[email protected]>
>> Signed-off-by: Brian Vazquez <[email protected]>
>
> Acked-by: Yonghong Song <[email protected]>

I must probably be missing something, but how is this safe? Presume we
traverse in the walk with bucket_cnt = 0. Meanwhile a different CPU added
entries to this bucket since not locked. Same reader on the other CPU with
bucket_cnt = 0 then starts to traverse the second
hlist_nulls_for_each_entry_safe() unlocked e.g. deleting entries?

2020-02-18 16:36:12

by Yonghong Song

[permalink] [raw]

Subject: Re: [PATCH bpf] bpf: Do not grab the bucket spinlock by default on htab batch ops

On 2/18/20 7:56 AM, Daniel Borkmann wrote:
> On 2/18/20 4:43 PM, Yonghong Song wrote:
>> On 2/14/20 2:43 PM, Brian Vazquez wrote:
>>> Grabbing the spinlock for every bucket even if it's empty, was causing
>>> significant perfomance cost when traversing htab maps that have only a
>>> few entries. This patch addresses the issue by checking first the
>>> bucket_cnt, if the bucket has some entries then we go and grab the
>>> spinlock and proceed with the batching.
>>>
>>> Tested with a htab of size 50K and different value of populated entries.
>>>
>>> Before:
>>>    Benchmark             Time(ns)        CPU(ns)
>>>    ---------------------------------------------
>>>    BM_DumpHashMap/1       2759655        2752033
>>>    BM_DumpHashMap/10      2933722        2930825
>>>    BM_DumpHashMap/200     3171680        3170265
>>>    BM_DumpHashMap/500     3639607        3635511
>>>    BM_DumpHashMap/1000    4369008        4364981
>>>    BM_DumpHashMap/5k     11171919       11134028
>>>    BM_DumpHashMap/20k    69150080       69033496
>>>    BM_DumpHashMap/39k   190501036      190226162
>>>
>>> After:
>>>    Benchmark             Time(ns)        CPU(ns)
>>>    ---------------------------------------------
>>>    BM_DumpHashMap/1        202707         200109
>>>    BM_DumpHashMap/10       213441         210569
>>>    BM_DumpHashMap/200      478641         472350
>>>    BM_DumpHashMap/500      980061         967102
>>>    BM_DumpHashMap/1000    1863835        1839575
>>>    BM_DumpHashMap/5k      8961836        8902540
>>>    BM_DumpHashMap/20k    69761497       69322756
>>>    BM_DumpHashMap/39k   187437830      186551111
>>>
>>> Fixes: 057996380a42 ("bpf: Add batch ops to all htab bpf map")
>>> Cc: Yonghong Song <[email protected]>
>>> Signed-off-by: Brian Vazquez <[email protected]>
>>
>> Acked-by: Yonghong Song <[email protected]>
>
> I must probably be missing something, but how is this safe? Presume we
> traverse in the walk with bucket_cnt = 0. Meanwhile a different CPU added
> entries to this bucket since not locked. Same reader on the other CPU with
> bucket_cnt = 0 then starts to traverse the second
> hlist_nulls_for_each_entry_safe() unlocked e.g. deleting entries?

Thanks for pointing this out. Yes, you are correct. If bucket_cnt is 0
and buck->lock is not held, we should skip the
hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
...
}
as another cpu may traverse the bucket in parallel by adding/deleting
the elements.

2020-02-18 17:17:58

by Brian Vazquez

[permalink] [raw]

Subject: Re: [PATCH bpf] bpf: Do not grab the bucket spinlock by default on htab batch ops

On Tue, Feb 18, 2020 at 8:34 AM Yonghong Song <[email protected]> wrote:
>
>
>
> On 2/18/20 7:56 AM, Daniel Borkmann wrote:
> > On 2/18/20 4:43 PM, Yonghong Song wrote:
> >> On 2/14/20 2:43 PM, Brian Vazquez wrote:
> >>> Grabbing the spinlock for every bucket even if it's empty, was causing
> >>> significant perfomance cost when traversing htab maps that have only a
> >>> few entries. This patch addresses the issue by checking first the
> >>> bucket_cnt, if the bucket has some entries then we go and grab the
> >>> spinlock and proceed with the batching.
> >>>
> >>> Tested with a htab of size 50K and different value of populated entries.
> >>>
> >>> Before:
> >>> Benchmark Time(ns) CPU(ns)
> >>> ---------------------------------------------
> >>> BM_DumpHashMap/1 2759655 2752033
> >>> BM_DumpHashMap/10 2933722 2930825
> >>> BM_DumpHashMap/200 3171680 3170265
> >>> BM_DumpHashMap/500 3639607 3635511
> >>> BM_DumpHashMap/1000 4369008 4364981
> >>> BM_DumpHashMap/5k 11171919 11134028
> >>> BM_DumpHashMap/20k 69150080 69033496
> >>> BM_DumpHashMap/39k 190501036 190226162
> >>>
> >>> After:
> >>> Benchmark Time(ns) CPU(ns)
> >>> ---------------------------------------------
> >>> BM_DumpHashMap/1 202707 200109
> >>> BM_DumpHashMap/10 213441 210569
> >>> BM_DumpHashMap/200 478641 472350
> >>> BM_DumpHashMap/500 980061 967102
> >>> BM_DumpHashMap/1000 1863835 1839575
> >>> BM_DumpHashMap/5k 8961836 8902540
> >>> BM_DumpHashMap/20k 69761497 69322756
> >>> BM_DumpHashMap/39k 187437830 186551111
> >>>
> >>> Fixes: 057996380a42 ("bpf: Add batch ops to all htab bpf map")
> >>> Cc: Yonghong Song <[email protected]>
> >>> Signed-off-by: Brian Vazquez <[email protected]>
> >>
> >> Acked-by: Yonghong Song <[email protected]>
> >
> > I must probably be missing something, but how is this safe? Presume we
> > traverse in the walk with bucket_cnt = 0. Meanwhile a different CPU added
> > entries to this bucket since not locked. Same reader on the other CPU with
> > bucket_cnt = 0 then starts to traverse the second
> > hlist_nulls_for_each_entry_safe() unlocked e.g. deleting entries?
>
> Thanks for pointing this out. Yes, you are correct. If bucket_cnt is 0
> and buck->lock is not held, we should skip the
> hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
> ...
> }
> as another cpu may traverse the bucket in parallel by adding/deleting
> the elements.

Makes sense. Let me fix it in the next version, thanks for reviewing it!