LinuxLists.cc - [PATCH V2 bpf-next] bpf: make __htab_lookup_and_delete

2020-08-01 18:12:48

Subject: [PATCH V2 bpf-next] bpf: make __htab_lookup_and_delete_batch faster when map is almost empty

While running some experiments it was observed that map_lookup_batch was 2x
slower than get_next_key + lookup when the syscall overhead is minimal.
This was because the map_lookup_batch implementation was more expensive
traversing empty buckets, this can be really costly when the pre-allocated
map is too big.

This patch optimizes the case when the bucket is empty so we can move
quickly to next bucket.

The Benchmark was generated using the google/benchmark library[1]. When
the benckmark is executed the number of iterations is governed by the
amount of time the benckmarks takes, the number of iterations is at
least 1 and not more than 1e9, until CPU time(of the entire binary, not
just the part to measure), is greater than 0.5s. Time and CPU reported
are the average of a single iteration over the iteration runs.

The experiments to exercise the empty buckets are as follows:

-The map was populated with a single entry to make sure that the syscall
overhead is not helping the map_batch_lookup.
-The size of the preallocated map was increased to show the effect of
traversing empty buckets.

To interpret the results, Benchmark is the name of the experiment where
the first number correspond to the number of elements in the map, and
the next one correspond to the size of the pre-allocated map. Time and
CPU are average and correspond to the time elapsed per iteration and the
system time consumtion per iteration.

Results:

Using get_next_key + lookup:

Benchmark Time(ns) CPU(ns) Iteration
---------------------------------------------------------------
BM_DumpHashMap/1/1k 3593 3586 192680
BM_DumpHashMap/1/4k 6004 5972 100000
BM_DumpHashMap/1/16k 15755 15710 44341
BM_DumpHashMap/1/64k 59525 59376 10000

Using htab_lookup_batch before this patch:
Benchmark Time(ns) CPU(ns) Iterations
---------------------------------------------------------------
BM_DumpHashMap/1/1k 3933 3927 177978
BM_DumpHashMap/1/4k 9192 9177 73951
BM_DumpHashMap/1/16k 42011 41970 16789
BM_DumpHashMap/1/64k 117895 117661 6135

Using htab_lookup_batch with this patch:
Benchmark Time(ns) CPU(ns) Iterations
---------------------------------------------------------------
BM_DumpHashMap/1/1k 2809 2803 249212
BM_DumpHashMap/1/4k 5318 5316 100000
BM_DumpHashMap/1/16k 14925 14895 47448
BM_DumpHashMap/1/64k 58870 58674 10000

[1] https://github.com/google/benchmark.git

Changelog:

v1 -> v2:
- Add more information about how to interpret the results

Suggested-by: Luigi Rizzo <[email protected]>
Cc: Yonghong Song <[email protected]>
Signed-off-by: Brian Vazquez <[email protected]>
---
kernel/bpf/hashtab.c | 23 ++++++++---------------
1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 024276787055..b6d28bd6345b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1349,7 +1349,6 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
struct hlist_nulls_head *head;
struct hlist_nulls_node *n;
unsigned long flags = 0;
- bool locked = false;
struct htab_elem *l;
struct bucket *b;
int ret = 0;
@@ -1408,19 +1407,19 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
dst_val = values;
b = &htab->buckets[batch];
head = &b->head;
- /* do not grab the lock unless need it (bucket_cnt > 0). */
- if (locked)
- flags = htab_lock_bucket(htab, b);

+ l = hlist_nulls_entry_safe(rcu_dereference_raw(hlist_nulls_first_rcu(head)),
+ struct htab_elem, hash_node);
+ if (!l && (batch + 1 < htab->n_buckets)) {
+ batch++;
+ goto again_nocopy;
+ }
+
+ flags = htab_lock_bucket(htab, b);
bucket_cnt = 0;
hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
bucket_cnt++;

- if (bucket_cnt && !locked) {
- locked = true;
- goto again_nocopy;
- }
-
if (bucket_cnt > (max_count - total)) {
if (total == 0)
ret = -ENOSPC;
@@ -1446,10 +1445,6 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
goto alloc;
}

- /* Next block is only safe to run if you have grabbed the lock */
- if (!locked)
- goto next_batch;
-
hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
memcpy(dst_key, l->key, key_size);

@@ -1492,7 +1487,6 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
}

htab_unlock_bucket(htab, b, flags);
- locked = false;

while (node_to_free) {
l = node_to_free;
@@ -1500,7 +1494,6 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
bpf_lru_push_free(&htab->lru, &l->lru_node);
}

-next_batch:
/* If we are not copying data, we can go to next bucket and avoid
* unlocking the rcu.
*/
--
2.28.0.163.g6104cc2f0b6-goog

2020-08-02 05:26:40

by Yonghong Song

[permalink] [raw]

Subject: Re: [PATCH V2 bpf-next] bpf: make __htab_lookup_and_delete_batch faster when map is almost empty

On 8/1/20 11:09 AM, Brian Vazquez wrote:
> While running some experiments it was observed that map_lookup_batch was 2x
> slower than get_next_key + lookup when the syscall overhead is minimal.
> This was because the map_lookup_batch implementation was more expensive
> traversing empty buckets, this can be really costly when the pre-allocated
> map is too big.
>
> This patch optimizes the case when the bucket is empty so we can move
> quickly to next bucket.
>
> The Benchmark was generated using the google/benchmark library[1]. When
> the benckmark is executed the number of iterations is governed by the
> amount of time the benckmarks takes, the number of iterations is at
> least 1 and not more than 1e9, until CPU time(of the entire binary, not
> just the part to measure), is greater than 0.5s. Time and CPU reported
> are the average of a single iteration over the iteration runs.
>
> The experiments to exercise the empty buckets are as follows:
>
> -The map was populated with a single entry to make sure that the syscall
> overhead is not helping the map_batch_lookup.
> -The size of the preallocated map was increased to show the effect of
> traversing empty buckets.
>
> To interpret the results, Benchmark is the name of the experiment where
> the first number correspond to the number of elements in the map, and
> the next one correspond to the size of the pre-allocated map. Time and
> CPU are average and correspond to the time elapsed per iteration and the
> system time consumtion per iteration.

thanks for explanation!

>
> Results:
>
> Using get_next_key + lookup:
>
> Benchmark Time(ns) CPU(ns) Iteration
> ---------------------------------------------------------------
> BM_DumpHashMap/1/1k 3593 3586 192680
> BM_DumpHashMap/1/4k 6004 5972 100000
> BM_DumpHashMap/1/16k 15755 15710 44341
> BM_DumpHashMap/1/64k 59525 59376 10000
>
> Using htab_lookup_batch before this patch:
> Benchmark Time(ns) CPU(ns) Iterations
> ---------------------------------------------------------------
> BM_DumpHashMap/1/1k 3933 3927 177978
> BM_DumpHashMap/1/4k 9192 9177 73951
> BM_DumpHashMap/1/16k 42011 41970 16789
> BM_DumpHashMap/1/64k 117895 117661 6135
>
> Using htab_lookup_batch with this patch:
> Benchmark Time(ns) CPU(ns) Iterations
> ---------------------------------------------------------------
> BM_DumpHashMap/1/1k 2809 2803 249212
> BM_DumpHashMap/1/4k 5318 5316 100000
> BM_DumpHashMap/1/16k 14925 14895 47448
> BM_DumpHashMap/1/64k 58870 58674 10000
>
> [1] https://github.com/google/benchmark.git
>
> Changelog:
>
> v1 -> v2:
> - Add more information about how to interpret the results
>
> Suggested-by: Luigi Rizzo <[email protected]>
> Cc: Yonghong Song <[email protected]>
> Signed-off-by: Brian Vazquez <[email protected]>
> ---
> kernel/bpf/hashtab.c | 23 ++++++++---------------
> 1 file changed, 8 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 024276787055..b6d28bd6345b 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -1349,7 +1349,6 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
> struct hlist_nulls_head *head;
> struct hlist_nulls_node *n;
> unsigned long flags = 0;
> - bool locked = false;
> struct htab_elem *l;
> struct bucket *b;
> int ret = 0;
> @@ -1408,19 +1407,19 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
> dst_val = values;
> b = &htab->buckets[batch];
> head = &b->head;
> - /* do not grab the lock unless need it (bucket_cnt > 0). */
> - if (locked)
> - flags = htab_lock_bucket(htab, b);
>
> + l = hlist_nulls_entry_safe(rcu_dereference_raw(hlist_nulls_first_rcu(head)),
> + struct htab_elem, hash_node);
> + if (!l && (batch + 1 < htab->n_buckets)) {
> + batch++;
> + goto again_nocopy;
> + }

In this case, if batch + 1 == htab->n_buckets, we still go through
htab_lock_bucket/htab_unlock_bucket which is really not needed.

So since we are trying to optimize for performance, let us handle
the above case as well. We can do
if (!l) {
if (batch + 1 < htab->n_buckets) {
batch++;
goto again_nocopy;
}
bucket_cnt = 0;
goto done_bucket;
}

...
done_bucket:
rcu_read_unlock();
bpf_enable_instrumentation();
...

what do you think?

> +
> + flags = htab_lock_bucket(htab, b);
> bucket_cnt = 0;
> hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
> bucket_cnt++;
>
> - if (bucket_cnt && !locked) {
> - locked = true;
> - goto again_nocopy;
> - }
> -
> if (bucket_cnt > (max_count - total)) {
> if (total == 0)
> ret = -ENOSPC;
> @@ -1446,10 +1445,6 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
> goto alloc;
> }
>
> - /* Next block is only safe to run if you have grabbed the lock */
> - if (!locked)
> - goto next_batch;
> -
> hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
> memcpy(dst_key, l->key, key_size);
>
> @@ -1492,7 +1487,6 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
> }
>
> htab_unlock_bucket(htab, b, flags);
> - locked = false;
>
> while (node_to_free) {
> l = node_to_free;
> @@ -1500,7 +1494,6 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
> bpf_lru_push_free(&htab->lru, &l->lru_node);
> }
>
> -next_batch:
> /* If we are not copying data, we can go to next bucket and avoid
> * unlocking the rcu.
> */
>