This patch introduces the use of a module parameter n_prefetch
which enables prefetching within the bpf_map_lookup_batch function
for a faster lookup. Benefits depend on the platform, relative
density of the map, and the setting of the module parameter as
described below.
For multiprocessor machines, for a particular key in a bpf map,
each cpu has a value associated with that key. This patch’s
change is as follows: when copying each of these values to
userspace in bpf_map_lookup_batch, the value for a cpu
n_prefetch ahead is prefetched.
MEASUREMENTS:
The benchmark test added in this patch series was used to
measure the effect of prefetching as well as determine the
optimal setting of n_prefetch given the different parameters:
the test was run on many different platforms (with varying
number of cpus), with a range of settings of n_prefetch, and with
saturated, dense, and sparse maps (num_entries/capacity_of_map).
The benchmark test measures the average time for a single entry
lookup (t = num_entries_looked_up/total_time) given the varied
factors as mentioned above. The overhead of the
bpf_map_lookup_batch syscall introduces some error.
Here are the experimental results:
amd machine with 256 cores (rome zen 2)
Density of map n_prefetch single entry lookup time (ns/op)
--------------------------------------------------------------------
40k / 40k 0 16176.471
1 13095.238
5 7432.432
12 5188.679
20 9482.759
10k / 40k 0 13253.012
5 7482.993
12 5164.319
20 9649.123
2.5k / 40k 0 7394.958
5 7201.309
13 4721.030
20 8118.081
For denser maps, the experiments suggest that as n_prefetch
increases, there is a significant time benefit (~66% decrease)
until a certain point after which the time benefit begins to
decrease. For sparser maps, there is a less pronounced speedup
from prefetching. Additionally, this experiment seems to suggest
the optimal n_prefetch range on this particular machine is 12-13,
but a setting of n_prefetch = 5 can still improve the single
entry lookup time.
intel-skylake (with 112 cores)
Density of map n_prefetch single entry lookup time (ns/op)
------------------------------------------------------------------
40k / 40k 0 5729.167
1 5092.593
5 3395.062
20 6875.000
10k / 40k 0 2029.520
5 2989.130
20 5820.106
2.5k / 40k 0 1598.256
5 2935.290
20 4867.257
For this particular machine, the experimental results suggest that
there is only a significant benefit in prefetching with denser maps.
Prefetching within bpf_map_lookup_batch can provide significant
benefit depending on the use case. Across the many different
platforms experiments were performed on, a setting of n_prefetch = 5,
although not the optimal setting, significantly decreased the single
entry lookup time for denser maps.
Signed-off-by: Sagarika Sharma <[email protected]>
---
kernel/bpf/hashtab.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 8392f7f8a8ac..eb70c4bbe246 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -15,6 +15,9 @@
#include "bpf_lru_list.h"
#include "map_in_map.h"
+static uint n_prefetch;
+module_param(n_prefetch, uint, 0644);
+
#define HTAB_CREATE_FLAG_MASK \
(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)
@@ -1743,9 +1746,13 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
if (is_percpu) {
int off = 0, cpu;
void __percpu *pptr;
+ int num_cpus = num_possible_cpus();
pptr = htab_elem_get_ptr(l, map->key_size);
for_each_possible_cpu(cpu) {
+ if (n_prefetch > 0 && (cpu + n_prefetch) <= num_cpus)
+ prefetch(per_cpu_ptr(pptr, cpu + n_prefetch));
+
bpf_long_memcpy(dst_val + off,
per_cpu_ptr(pptr, cpu), size);
off += size;
--
2.37.1.595.g718a3a8f04-goog
On Tue, Aug 16, 2022 at 8:35 AM Sagarika Sharma
<[email protected]> wrote:
>
> This patch introduces the use of a module parameter n_prefetch
> which enables prefetching within the bpf_map_lookup_batch function
> for a faster lookup. Benefits depend on the platform, relative
> density of the map, and the setting of the module parameter as
> described below.
>
> For multiprocessor machines, for a particular key in a bpf map,
> each cpu has a value associated with that key. This patch’s
> change is as follows: when copying each of these values to
> userspace in bpf_map_lookup_batch, the value for a cpu
> n_prefetch ahead is prefetched.
>
> MEASUREMENTS:
> The benchmark test added in this patch series was used to
> measure the effect of prefetching as well as determine the
> optimal setting of n_prefetch given the different parameters:
> the test was run on many different platforms (with varying
> number of cpus), with a range of settings of n_prefetch, and with
> saturated, dense, and sparse maps (num_entries/capacity_of_map).
> The benchmark test measures the average time for a single entry
> lookup (t = num_entries_looked_up/total_time) given the varied
> factors as mentioned above. The overhead of the
> bpf_map_lookup_batch syscall introduces some error.
>
> Here are the experimental results:
>
> amd machine with 256 cores (rome zen 2)
> Density of map n_prefetch single entry lookup time (ns/op)
> --------------------------------------------------------------------
> 40k / 40k 0 16176.471
> 1 13095.238
> 5 7432.432
> 12 5188.679
> 20 9482.759
>
> 10k / 40k 0 13253.012
> 5 7482.993
> 12 5164.319
> 20 9649.123
>
> 2.5k / 40k 0 7394.958
> 5 7201.309
> 13 4721.030
> 20 8118.081
>
> For denser maps, the experiments suggest that as n_prefetch
> increases, there is a significant time benefit (~66% decrease)
> until a certain point after which the time benefit begins to
> decrease. For sparser maps, there is a less pronounced speedup
> from prefetching. Additionally, this experiment seems to suggest
> the optimal n_prefetch range on this particular machine is 12-13,
> but a setting of n_prefetch = 5 can still improve the single
> entry lookup time.
>
> intel-skylake (with 112 cores)
> Density of map n_prefetch single entry lookup time (ns/op)
> ------------------------------------------------------------------
> 40k / 40k 0 5729.167
> 1 5092.593
> 5 3395.062
> 20 6875.000
>
> 10k / 40k 0 2029.520
> 5 2989.130
> 20 5820.106
>
> 2.5k / 40k 0 1598.256
> 5 2935.290
> 20 4867.257
>
> For this particular machine, the experimental results suggest that
> there is only a significant benefit in prefetching with denser maps.
> Prefetching within bpf_map_lookup_batch can provide significant
> benefit depending on the use case. Across the many different
> platforms experiments were performed on, a setting of n_prefetch = 5,
> although not the optimal setting, significantly decreased the single
> entry lookup time for denser maps.
>
> Signed-off-by: Sagarika Sharma <[email protected]>
> ---
> kernel/bpf/hashtab.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 8392f7f8a8ac..eb70c4bbe246 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -15,6 +15,9 @@
> #include "bpf_lru_list.h"
> #include "map_in_map.h"
>
> +static uint n_prefetch;
> +module_param(n_prefetch, uint, 0644);
module_param is no go. sorry.
> +
> #define HTAB_CREATE_FLAG_MASK \
> (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
> BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)
> @@ -1743,9 +1746,13 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
> if (is_percpu) {
> int off = 0, cpu;
> void __percpu *pptr;
> + int num_cpus = num_possible_cpus();
>
> pptr = htab_elem_get_ptr(l, map->key_size);
> for_each_possible_cpu(cpu) {
> + if (n_prefetch > 0 && (cpu + n_prefetch) <= num_cpus)
> + prefetch(per_cpu_ptr(pptr, cpu + n_prefetch));
> +
prefetch is a decent technique, but doesn't look like it helps
in all cases. Your numbers suggest that it may hurt too.
Whatever you're doing with map lookups you need a different
strategy than micro-optimizing this loop with prefetch.
Have you considering using for_each bpf side helper or user
space side map iterator to aggregate and copy values?
Or iterate over all map elements on one cpu first and
then on other cpus ? Essentially swapping
hlist_nulls_for_each_entry_safe and for_each_possible_cpu ?