2019-07-03 17:03:03

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH bpf-next RFC v3 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call

This introduces a new command to retrieve a variable number of entries
from a bpf map.

This new command can be executed from the existing BPF syscall as
follows:

err = bpf(BPF_MAP_DUMP, union bpf_attr *attr, u32 size)
using attr->dump.map_fd, attr->dump.prev_key, attr->dump.buf,
attr->dump.buf_len
returns zero or negative error, and populates buf and buf_len on
succees

This implementation is wrapping the existing bpf methods:
map_get_next_key and map_lookup_elem
the results show that even with a 1-elem_size buffer, it runs ~40 faster
than the current implementation, improvements of ~85% are reported when
the buffer size is increased, although, after the buffer size is around
5% of the total number of entries there's no huge difference in
increasing it.

Tested:
Tried different size buffers to handle case where the bulk is bigger, or
the elements to retrieve are less than the existing ones, all runs read
a map of 100K entries. Below are the results(in ns) from the different
runs:

buf_len_1: 69038725 entry-by-entry: 112384424 improvement
38.569134
buf_len_2: 40897447 entry-by-entry: 111030546 improvement
63.165590
buf_len_230: 13652714 entry-by-entry: 111694058 improvement
87.776687
buf_len_5000: 13576271 entry-by-entry: 111101169 improvement
87.780263
buf_len_73000: 14694343 entry-by-entry: 111740162 improvement
86.849542
buf_len_100000: 13745969 entry-by-entry: 114151991 improvement
87.958187
buf_len_234567: 14329834 entry-by-entry: 114427589 improvement
87.476941

Changelog:

v3:
- add explanation of the API in the commit message
- fix masked errors and return them to user
- copy last_key from return buf into prev_key if it was provided
- run test with kpti and retpoline mitigations

v2:
- use proper bpf-next tag

Brian Vazquez (6):
bpf: add bpf_map_value_size and bp_map_copy_value helper functions
bpf: add BPF_MAP_DUMP command to dump more than one entry per call
bpf: keep bpf.h in sync with tools/
libbpf: support BPF_MAP_DUMP command
selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap
selftests/bpf: add test to measure performance of BPF_MAP_DUMP

include/uapi/linux/bpf.h | 9 +
kernel/bpf/syscall.c | 252 ++++++++++++++++++------
tools/include/uapi/linux/bpf.h | 9 +
tools/lib/bpf/bpf.c | 28 +++
tools/lib/bpf/bpf.h | 4 +
tools/lib/bpf/libbpf.map | 2 +
tools/testing/selftests/bpf/test_maps.c | 147 +++++++++++++-
7 files changed, 388 insertions(+), 63 deletions(-)

--
2.22.0.410.gd8fdbe21b5-goog


2019-07-03 17:03:22

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH bpf-next RFC v3 4/6] libbpf: support BPF_MAP_DUMP command

Make libbpf aware of new BPF_MAP_DUMP command and add bpf_map_dump and
bpf_map_dump_flags to use them from the library.

Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Brian Vazquez <[email protected]>
---
tools/lib/bpf/bpf.c | 28 ++++++++++++++++++++++++++++
tools/lib/bpf/bpf.h | 4 ++++
tools/lib/bpf/libbpf.map | 2 ++
3 files changed, 34 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index c7d7993c44bb..c1139b7db756 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -368,6 +368,34 @@ int bpf_map_update_elem(int fd, const void *key, const void *value,
return sys_bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
}

+int bpf_map_dump(int fd, const void *prev_key, void *buf, void *buf_len)
+{
+ union bpf_attr attr;
+
+ memset(&attr, 0, sizeof(attr));
+ attr.dump.map_fd = fd;
+ attr.dump.prev_key = ptr_to_u64(prev_key);
+ attr.dump.buf = ptr_to_u64(buf);
+ attr.dump.buf_len = ptr_to_u64(buf_len);
+
+ return sys_bpf(BPF_MAP_DUMP, &attr, sizeof(attr));
+}
+
+int bpf_map_dump_flags(int fd, const void *prev_key, void *buf, void *buf_len,
+ __u64 flags)
+{
+ union bpf_attr attr;
+
+ memset(&attr, 0, sizeof(attr));
+ attr.dump.map_fd = fd;
+ attr.dump.prev_key = ptr_to_u64(prev_key);
+ attr.dump.buf = ptr_to_u64(buf);
+ attr.dump.buf_len = ptr_to_u64(buf_len);
+ attr.dump.flags = flags;
+
+ return sys_bpf(BPF_MAP_DUMP, &attr, sizeof(attr));
+}
+
int bpf_map_lookup_elem(int fd, const void *key, void *value)
{
union bpf_attr attr;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index ff42ca043dc8..86496443440e 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -112,6 +112,10 @@ LIBBPF_API int bpf_verify_program(enum bpf_prog_type type,
LIBBPF_API int bpf_map_update_elem(int fd, const void *key, const void *value,
__u64 flags);

+LIBBPF_API int bpf_map_dump(int fd, const void *prev_key, void *buf,
+ void *buf_len);
+LIBBPF_API int bpf_map_dump_flags(int fd, const void *prev_key, void *buf,
+ void *buf_len, __u64 flags);
LIBBPF_API int bpf_map_lookup_elem(int fd, const void *key, void *value);
LIBBPF_API int bpf_map_lookup_elem_flags(int fd, const void *key, void *value,
__u64 flags);
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 2c6d835620d2..e7641773cfb0 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -173,4 +173,6 @@ LIBBPF_0.0.4 {
btf__parse_elf;
bpf_object__load_xattr;
libbpf_num_possible_cpus;
+ bpf_map_dump;
+ bpf_map_dump_flags;
} LIBBPF_0.0.3;
--
2.22.0.410.gd8fdbe21b5-goog

2019-07-03 17:03:39

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH bpf-next RFC v3 6/6] selftests/bpf: add test to measure performance of BPF_MAP_DUMP

This tests compares the amount of time that takes to read an entire
table of 100K elements on a bpf hashmap using both BPF_MAP_DUMP and
BPF_MAP_GET_NEXT_KEY + BPF_MAP_LOOKUP_ELEM.

Signed-off-by: Brian Vazquez <[email protected]>
---
tools/testing/selftests/bpf/test_maps.c | 65 +++++++++++++++++++++++++
1 file changed, 65 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index b19ba6aa8e36..786d0e340aed 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -18,6 +18,7 @@
#include <sys/socket.h>
#include <netinet/in.h>
#include <linux/bpf.h>
+#include <linux/time64.h>

#include <bpf/bpf.h>
#include <bpf/libbpf.h>
@@ -388,6 +389,69 @@ static void test_hashmap_dump(void)
close(fd);
}

+static void test_hashmap_dump_perf(void)
+{
+ int fd, i, max_entries = 100000;
+ uint64_t key, value, next_key;
+ bool next_key_valid = true;
+ void *buf;
+ u32 buf_len, entries;
+ int j = 0;
+ int clk_id = CLOCK_MONOTONIC;
+ struct timespec begin, end;
+ long long time_spent, dump_time_spent;
+ double res;
+ int tests[] = {1, 2, 230, 5000, 73000, 100000, 234567};
+ int test_len = ARRAY_SIZE(tests);
+ const int elem_size = sizeof(key) + sizeof(value);
+
+ fd = helper_fill_hashmap(max_entries);
+ // Alloc memory considering the largest buffer
+ buf = malloc(elem_size * tests[test_len-1]);
+ assert(buf != NULL);
+
+test:
+ entries = tests[j];
+ buf_len = elem_size*tests[j];
+ j++;
+ clock_gettime(clk_id, &begin);
+ errno = 0;
+ i = 0;
+ while (errno == 0) {
+ bpf_map_dump(fd, !i ? NULL : &key,
+ buf, &buf_len);
+ if (errno)
+ break;
+ if (!i)
+ key = *((uint64_t *)(buf + buf_len - elem_size));
+ i += buf_len / elem_size;
+ }
+ clock_gettime(clk_id, &end);
+ assert(i == max_entries);
+ dump_time_spent = NSEC_PER_SEC * (end.tv_sec - begin.tv_sec) +
+ end.tv_nsec - begin.tv_nsec;
+ next_key_valid = true;
+ clock_gettime(clk_id, &begin);
+ assert(bpf_map_get_next_key(fd, NULL, &key) == 0);
+ for (i = 0; next_key_valid; i++) {
+ next_key_valid = bpf_map_get_next_key(fd, &key, &next_key) == 0;
+ assert(bpf_map_lookup_elem(fd, &key, &value) == 0);
+ key = next_key;
+ }
+ clock_gettime(clk_id, &end);
+ time_spent = NSEC_PER_SEC * (end.tv_sec - begin.tv_sec) +
+ end.tv_nsec - begin.tv_nsec;
+ res = (1-((double)dump_time_spent/time_spent))*100;
+ printf("buf_len_%u:\t %llu entry-by-entry: %llu improvement %lf\n",
+ entries, dump_time_spent, time_spent, res);
+ assert(i == max_entries);
+
+ if (j < test_len)
+ goto test;
+ free(buf);
+ close(fd);
+}
+
static void test_hashmap_zero_seed(void)
{
int i, first, second, old_flags;
@@ -1748,6 +1812,7 @@ static void run_all_tests(void)
test_hashmap_walk(0, NULL);
test_hashmap_zero_seed();
test_hashmap_dump();
+ test_hashmap_dump_perf();

test_arraymap(0, NULL);
test_arraymap_percpu(0, NULL);
--
2.22.0.410.gd8fdbe21b5-goog

2019-07-03 17:03:58

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH bpf-next RFC v3 1/6] bpf: add bpf_map_value_size and bp_map_copy_value helper functions

Move reusable code from map_lookup_elem to helper functions to avoid code
duplication in kernel/bpf/syscall.c

Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Brian Vazquez <[email protected]>
---
kernel/bpf/syscall.c | 134 +++++++++++++++++++++++--------------------
1 file changed, 73 insertions(+), 61 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b0f545e07425..d200d2837ade 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -126,6 +126,76 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
return map;
}

+static u32 bpf_map_value_size(struct bpf_map *map)
+{
+ if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+ map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH ||
+ map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY ||
+ map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE)
+ return round_up(map->value_size, 8) * num_possible_cpus();
+ else if (IS_FD_MAP(map))
+ return sizeof(u32);
+ else
+ return map->value_size;
+}
+
+static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
+ __u64 flags)
+{
+ void *ptr;
+ int err;
+
+ if (bpf_map_is_dev_bound(map))
+ return bpf_map_offload_lookup_elem(map, key, value);
+
+ preempt_disable();
+ this_cpu_inc(bpf_prog_active);
+ if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+ map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
+ err = bpf_percpu_hash_copy(map, key, value);
+ } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
+ err = bpf_percpu_array_copy(map, key, value);
+ } else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
+ err = bpf_percpu_cgroup_storage_copy(map, key, value);
+ } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) {
+ err = bpf_stackmap_copy(map, key, value);
+ } else if (IS_FD_ARRAY(map)) {
+ err = bpf_fd_array_map_lookup_elem(map, key, value);
+ } else if (IS_FD_HASH(map)) {
+ err = bpf_fd_htab_map_lookup_elem(map, key, value);
+ } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
+ err = bpf_fd_reuseport_array_lookup_elem(map, key, value);
+ } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
+ map->map_type == BPF_MAP_TYPE_STACK) {
+ err = map->ops->map_peek_elem(map, value);
+ } else {
+ rcu_read_lock();
+ if (map->ops->map_lookup_elem_sys_only)
+ ptr = map->ops->map_lookup_elem_sys_only(map, key);
+ else
+ ptr = map->ops->map_lookup_elem(map, key);
+ if (IS_ERR(ptr)) {
+ err = PTR_ERR(ptr);
+ } else if (!ptr) {
+ err = -ENOENT;
+ } else {
+ err = 0;
+ if (flags & BPF_F_LOCK)
+ /* lock 'ptr' and copy everything but lock */
+ copy_map_value_locked(map, value, ptr, true);
+ else
+ copy_map_value(map, value, ptr);
+ /* mask lock, since value wasn't zero inited */
+ check_and_init_map_lock(map, value);
+ }
+ rcu_read_unlock();
+ }
+ this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+
+ return err;
+}
+
void *bpf_map_area_alloc(size_t size, int numa_node)
{
/* We really just want to fail instead of triggering OOM killer
@@ -729,7 +799,7 @@ static int map_lookup_elem(union bpf_attr *attr)
void __user *uvalue = u64_to_user_ptr(attr->value);
int ufd = attr->map_fd;
struct bpf_map *map;
- void *key, *value, *ptr;
+ void *key, *value;
u32 value_size;
struct fd f;
int err;
@@ -761,72 +831,14 @@ static int map_lookup_elem(union bpf_attr *attr)
goto err_put;
}

- if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
- map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH ||
- map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY ||
- map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE)
- value_size = round_up(map->value_size, 8) * num_possible_cpus();
- else if (IS_FD_MAP(map))
- value_size = sizeof(u32);
- else
- value_size = map->value_size;
+ value_size = bpf_map_value_size(map);

err = -ENOMEM;
value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);
if (!value)
goto free_key;

- if (bpf_map_is_dev_bound(map)) {
- err = bpf_map_offload_lookup_elem(map, key, value);
- goto done;
- }
-
- preempt_disable();
- this_cpu_inc(bpf_prog_active);
- if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
- map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
- err = bpf_percpu_hash_copy(map, key, value);
- } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
- err = bpf_percpu_array_copy(map, key, value);
- } else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
- err = bpf_percpu_cgroup_storage_copy(map, key, value);
- } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) {
- err = bpf_stackmap_copy(map, key, value);
- } else if (IS_FD_ARRAY(map)) {
- err = bpf_fd_array_map_lookup_elem(map, key, value);
- } else if (IS_FD_HASH(map)) {
- err = bpf_fd_htab_map_lookup_elem(map, key, value);
- } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
- err = bpf_fd_reuseport_array_lookup_elem(map, key, value);
- } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
- map->map_type == BPF_MAP_TYPE_STACK) {
- err = map->ops->map_peek_elem(map, value);
- } else {
- rcu_read_lock();
- if (map->ops->map_lookup_elem_sys_only)
- ptr = map->ops->map_lookup_elem_sys_only(map, key);
- else
- ptr = map->ops->map_lookup_elem(map, key);
- if (IS_ERR(ptr)) {
- err = PTR_ERR(ptr);
- } else if (!ptr) {
- err = -ENOENT;
- } else {
- err = 0;
- if (attr->flags & BPF_F_LOCK)
- /* lock 'ptr' and copy everything but lock */
- copy_map_value_locked(map, value, ptr, true);
- else
- copy_map_value(map, value, ptr);
- /* mask lock, since value wasn't zero inited */
- check_and_init_map_lock(map, value);
- }
- rcu_read_unlock();
- }
- this_cpu_dec(bpf_prog_active);
- preempt_enable();
-
-done:
+ err = bpf_map_copy_value(map, key, value, attr->flags);
if (err)
goto free_value;

--
2.22.0.410.gd8fdbe21b5-goog

2019-07-03 17:04:32

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH bpf-next RFC v3 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call

This introduces a new command to retrieve a variable number of entries
from a bpf map wrapping the existing bpf methods:
map_get_next_key and map_lookup_elem

To start dumping the map from the beginning you must specify NULL as
the prev_key.

The new API returns 0 when it successfully copied all the elements
requested or it copied less because there weren't more elements to
retrieved (err == -ENOENT). In last scenario err will be masked to 0.

On a successful call buf and buf_len will contain correct data and in
case prev_key was provided (not for the first walk, since prev_key is
NULL) it will contain the last_key copied into the buf which will simplify
next call.

Only when it can't find a single element it will return -ENOENT meaning
that the map has been entirely walked. When an error is return buf,
buf_len and prev_key shouldn't be read nor used.

Because maps can be called from userspace and kernel code, this function
can have a scenario where the next_key was found but by the time we
try to retrieve the value the element is not there, in this case the
function continues and tries to get a new next_key value, skipping the
deleted key. If at some point the function find itself trap in a loop,
it will return -EINTR.

The function will try to fit as much as possible in the buf provided and
will return -EINVAL if buf_len is smaller than elem_size.

QUEUE and STACK maps are not supported.

Note that map_dump doesn't guarantee that reading the entire table is
consistent since this function is always racing with kernel and user code
but the same behaviour is found when the entire table is walked using
the current interfaces: map_get_next_key + map_lookup_elem.
It is also important to note that when a locked map the lock is grabbed for
1 entry at the time, meaning that the buf returned might or might not be
consistent.

Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Brian Vazquez <[email protected]>
---
include/uapi/linux/bpf.h | 9 +++
kernel/bpf/syscall.c | 118 +++++++++++++++++++++++++++++++++++++++
2 files changed, 127 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index cffea1826a1f..cc589570a639 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -106,6 +106,7 @@ enum bpf_cmd {
BPF_TASK_FD_QUERY,
BPF_MAP_LOOKUP_AND_DELETE_ELEM,
BPF_MAP_FREEZE,
+ BPF_MAP_DUMP,
};

enum bpf_map_type {
@@ -388,6 +389,14 @@ union bpf_attr {
__u64 flags;
};

+ struct { /* struct used by BPF_MAP_DUMP command */
+ __u32 map_fd;
+ __aligned_u64 prev_key;
+ __aligned_u64 buf;
+ __aligned_u64 buf_len; /* input/output: len of buf */
+ __u64 flags;
+ } dump;
+
struct { /* anonymous struct used by BPF_PROG_LOAD command */
__u32 prog_type; /* one of enum bpf_prog_type */
__u32 insn_cnt;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index d200d2837ade..78d55463fc76 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1097,6 +1097,121 @@ static int map_get_next_key(union bpf_attr *attr)
return err;
}

+/* last field in 'union bpf_attr' used by this command */
+#define BPF_MAP_DUMP_LAST_FIELD dump.buf_len
+
+static int map_dump(union bpf_attr *attr)
+{
+ void __user *ukey = u64_to_user_ptr(attr->dump.prev_key);
+ void __user *ubuf = u64_to_user_ptr(attr->dump.buf);
+ u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len);
+ int ufd = attr->dump.map_fd;
+ struct bpf_map *map;
+ void *buf, *prev_key, *key, *value;
+ u32 value_size, elem_size, buf_len, cp_len;
+ struct fd f;
+ int err;
+ bool first_key = false;
+
+ if (CHECK_ATTR(BPF_MAP_DUMP))
+ return -EINVAL;
+
+ attr->flags = 0;
+ if (attr->dump.flags & ~BPF_F_LOCK)
+ return -EINVAL;
+
+ f = fdget(ufd);
+ map = __bpf_map_get(f);
+ if (IS_ERR(map))
+ return PTR_ERR(map);
+ if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
+ err = -EPERM;
+ goto err_put;
+ }
+
+ if ((attr->dump.flags & BPF_F_LOCK) &&
+ !map_value_has_spin_lock(map)) {
+ err = -EINVAL;
+ goto err_put;
+ }
+
+ if (map->map_type == BPF_MAP_TYPE_QUEUE ||
+ map->map_type == BPF_MAP_TYPE_STACK) {
+ err = -ENOTSUPP;
+ goto err_put;
+ }
+
+ value_size = bpf_map_value_size(map);
+
+ err = get_user(buf_len, ubuf_len);
+ if (err)
+ goto err_put;
+
+ elem_size = map->key_size + value_size;
+ if (buf_len < elem_size) {
+ err = -EINVAL;
+ goto err_put;
+ }
+
+ if (ukey) {
+ prev_key = __bpf_copy_key(ukey, map->key_size);
+ if (IS_ERR(prev_key)) {
+ err = PTR_ERR(prev_key);
+ goto err_put;
+ }
+ } else {
+ prev_key = NULL;
+ first_key = true;
+ }
+
+ err = -ENOMEM;
+ buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN);
+ if (!buf)
+ goto err_put;
+
+ key = buf;
+ value = key + map->key_size;
+ for (cp_len = 0; cp_len + elem_size <= buf_len;) {
+ if (signal_pending(current)) {
+ err = -EINTR;
+ break;
+ }
+
+ rcu_read_lock();
+ err = map->ops->map_get_next_key(map, prev_key, key);
+ rcu_read_unlock();
+
+ if (err)
+ break;
+
+ err = bpf_map_copy_value(map, key, value, attr->dump.flags);
+
+ if (err == -ENOENT)
+ continue;
+ if (err)
+ goto free_buf;
+
+ if (copy_to_user(ubuf + cp_len, buf, elem_size)) {
+ err = -EFAULT;
+ goto free_buf;
+ }
+
+ prev_key = key;
+ cp_len += elem_size;
+ }
+
+ if (err == -ENOENT && cp_len)
+ err = 0;
+ if (!err && (copy_to_user(ubuf_len, &cp_len, sizeof(cp_len)) ||
+ (!first_key && copy_to_user(ukey, key, map->key_size))))
+ err = -EFAULT;
+free_buf:
+ kfree(buf);
+err_put:
+ fdput(f);
+ return err;
+}
+
#define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value

static int map_lookup_and_delete_elem(union bpf_attr *attr)
@@ -2910,6 +3025,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
err = map_lookup_and_delete_elem(&attr);
break;
+ case BPF_MAP_DUMP:
+ err = map_dump(&attr);
+ break;
default:
err = -EINVAL;
break;
--
2.22.0.410.gd8fdbe21b5-goog

2019-07-03 17:04:51

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH bpf-next RFC v3 5/6] selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap

This tests exercise the new command on a bpf hashmap and make sure it
works as expected.

Signed-off-by: Brian Vazquez <[email protected]>
---
tools/testing/selftests/bpf/test_maps.c | 82 ++++++++++++++++++++++++-
1 file changed, 80 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index a3fbc571280a..b19ba6aa8e36 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -309,6 +309,85 @@ static void test_hashmap_walk(unsigned int task, void *data)
close(fd);
}

+static void test_hashmap_dump(void)
+{
+ int fd, i, max_entries = 5;
+ uint64_t keys[max_entries], values[max_entries];
+ uint64_t key, value, next_key, prev_key;
+ bool next_key_valid = true;
+ void *buf, *elem;
+ u32 buf_len;
+ const int elem_size = sizeof(key) + sizeof(value);
+
+ fd = helper_fill_hashmap(max_entries);
+
+ // Get the elements in the hashmap, and store them in that order
+ assert(bpf_map_get_next_key(fd, NULL, &key) == 0);
+ i = 0;
+ keys[i] = key;
+ for (i = 1; next_key_valid; i++) {
+ next_key_valid = bpf_map_get_next_key(fd, &key, &next_key) == 0;
+ assert(bpf_map_lookup_elem(fd, &key, &values[i - 1]) == 0);
+ keys[i-1] = key;
+ key = next_key;
+ }
+
+ // Alloc memory for the whole table
+ buf = malloc(elem_size * max_entries);
+ assert(buf != NULL);
+
+ // Check that buf_len < elem_size returns EINVAL
+ buf_len = elem_size-1;
+ errno = 0;
+ assert(bpf_map_dump(fd, NULL, buf, &buf_len) == -1 && errno == EINVAL);
+
+ // Check that it returns the first two elements
+ errno = 0;
+ buf_len = elem_size * 2;
+ i = 0;
+ assert(bpf_map_dump(fd, NULL, buf, &buf_len) == 0 &&
+ buf_len == 2*elem_size);
+ elem = buf;
+ assert((*(uint64_t *)elem) == keys[i] &&
+ (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+ elem = buf + elem_size;
+ i++;
+ assert((*(uint64_t *)elem) == keys[i] &&
+ (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+ i++;
+
+ /* Check that prev_key contains key from last_elem retrieved in previous
+ * call
+ */
+ prev_key = *((uint64_t *)elem);
+ assert(bpf_map_dump(fd, &prev_key, buf, &buf_len) == 0 &&
+ buf_len == elem_size*2);
+ elem = buf;
+ assert((*(uint64_t *)elem) == keys[i] &&
+ (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+ elem = buf + elem_size;
+ i++;
+ assert((*(uint64_t *)elem) == keys[i] &&
+ (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+ i++;
+ assert(prev_key == (*(uint64_t *)elem));
+
+ /* Continue reading from map and verify buf_len only contains 1 element
+ * even though buf_len is 2 elem_size.
+ */
+ assert(bpf_map_dump(fd, &prev_key, buf, &buf_len) == 0 &&
+ buf_len == elem_size);
+ elem = buf;
+ assert((*(uint64_t *)elem) == keys[i] &&
+ (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+
+ assert(bpf_map_dump(fd, &prev_key, buf, &buf_len) == -1 &&
+ errno == ENOENT);
+
+ free(buf);
+ close(fd);
+}
+
static void test_hashmap_zero_seed(void)
{
int i, first, second, old_flags;
@@ -1668,6 +1747,7 @@ static void run_all_tests(void)
test_hashmap_percpu(0, NULL);
test_hashmap_walk(0, NULL);
test_hashmap_zero_seed();
+ test_hashmap_dump();

test_arraymap(0, NULL);
test_arraymap_percpu(0, NULL);
@@ -1705,11 +1785,9 @@ int main(void)

map_flags = BPF_F_NO_PREALLOC;
run_all_tests();
-
#define CALL
#include <map_tests/tests.h>
#undef CALL
-
printf("test_maps: OK, %d SKIPPED\n", skips);
return 0;
}
--
2.22.0.410.gd8fdbe21b5-goog

2019-07-03 17:04:53

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH bpf-next RFC v3 3/6] bpf: keep bpf.h in sync with tools/

Adds bpf_attr.dump structure to libbpf.

Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Brian Vazquez <[email protected]>
---
tools/include/uapi/linux/bpf.h | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a396b516a2b2..db6785a03538 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -106,6 +106,7 @@ enum bpf_cmd {
BPF_TASK_FD_QUERY,
BPF_MAP_LOOKUP_AND_DELETE_ELEM,
BPF_MAP_FREEZE,
+ BPF_MAP_DUMP,
};

enum bpf_map_type {
@@ -388,6 +389,14 @@ union bpf_attr {
__u64 flags;
};

+ struct { /* struct used by BPF_MAP_DUMP command */
+ __u32 map_fd;
+ __aligned_u64 prev_key;
+ __aligned_u64 buf;
+ __aligned_u64 buf_len; /* input/output: len of buf */
+ __u64 flags;
+ } dump;
+
struct { /* anonymous struct used by BPF_PROG_LOAD command */
__u32 prog_type; /* one of enum bpf_prog_type */
__u32 insn_cnt;
--
2.22.0.410.gd8fdbe21b5-goog

2019-07-05 16:45:38

by Y Song

[permalink] [raw]
Subject: Re: [PATCH bpf-next RFC v3 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call

On Wed, Jul 3, 2019 at 10:03 AM Brian Vazquez <[email protected]> wrote:
>
> This introduces a new command to retrieve a variable number of entries
> from a bpf map wrapping the existing bpf methods:
> map_get_next_key and map_lookup_elem
>
> To start dumping the map from the beginning you must specify NULL as
> the prev_key.
>
> The new API returns 0 when it successfully copied all the elements
> requested or it copied less because there weren't more elements to
> retrieved (err == -ENOENT). In last scenario err will be masked to 0.
>
> On a successful call buf and buf_len will contain correct data and in
> case prev_key was provided (not for the first walk, since prev_key is
> NULL) it will contain the last_key copied into the buf which will simplify
> next call.
>
> Only when it can't find a single element it will return -ENOENT meaning
> that the map has been entirely walked. When an error is return buf,
> buf_len and prev_key shouldn't be read nor used.
>
> Because maps can be called from userspace and kernel code, this function
> can have a scenario where the next_key was found but by the time we
> try to retrieve the value the element is not there, in this case the
> function continues and tries to get a new next_key value, skipping the
> deleted key. If at some point the function find itself trap in a loop,
> it will return -EINTR.
>
> The function will try to fit as much as possible in the buf provided and
> will return -EINVAL if buf_len is smaller than elem_size.
>
> QUEUE and STACK maps are not supported.
>
> Note that map_dump doesn't guarantee that reading the entire table is
> consistent since this function is always racing with kernel and user code
> but the same behaviour is found when the entire table is walked using
> the current interfaces: map_get_next_key + map_lookup_elem.
> It is also important to note that when a locked map the lock is grabbed for
> 1 entry at the time, meaning that the buf returned might or might not be
> consistent.

First, thanks for the RFC. I do think there are use cases where
batch dumping helps.
Some comments below.

>
> Suggested-by: Stanislav Fomichev <[email protected]>
> Signed-off-by: Brian Vazquez <[email protected]>
> ---
> include/uapi/linux/bpf.h | 9 +++
> kernel/bpf/syscall.c | 118 +++++++++++++++++++++++++++++++++++++++
> 2 files changed, 127 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index cffea1826a1f..cc589570a639 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -106,6 +106,7 @@ enum bpf_cmd {
> BPF_TASK_FD_QUERY,
> BPF_MAP_LOOKUP_AND_DELETE_ELEM,
> BPF_MAP_FREEZE,
> + BPF_MAP_DUMP,
> };
>
> enum bpf_map_type {
> @@ -388,6 +389,14 @@ union bpf_attr {
> __u64 flags;
> };
>
> + struct { /* struct used by BPF_MAP_DUMP command */
> + __u32 map_fd;
> + __aligned_u64 prev_key;
> + __aligned_u64 buf;
> + __aligned_u64 buf_len; /* input/output: len of buf */
> + __u64 flags;
> + } dump;

Maybe you can swap map_fd and flags?
This way, you won't have hole right after map_fd?

> +
> struct { /* anonymous struct used by BPF_PROG_LOAD command */
> __u32 prog_type; /* one of enum bpf_prog_type */
> __u32 insn_cnt;
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index d200d2837ade..78d55463fc76 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1097,6 +1097,121 @@ static int map_get_next_key(union bpf_attr *attr)
> return err;
> }
>
> +/* last field in 'union bpf_attr' used by this command */
> +#define BPF_MAP_DUMP_LAST_FIELD dump.buf_len
> +
> +static int map_dump(union bpf_attr *attr)
> +{
> + void __user *ukey = u64_to_user_ptr(attr->dump.prev_key);
> + void __user *ubuf = u64_to_user_ptr(attr->dump.buf);
> + u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len);
> + int ufd = attr->dump.map_fd;
> + struct bpf_map *map;
> + void *buf, *prev_key, *key, *value;
> + u32 value_size, elem_size, buf_len, cp_len;
> + struct fd f;
> + int err;
> + bool first_key = false;
> +
> + if (CHECK_ATTR(BPF_MAP_DUMP))
> + return -EINVAL;
> +
> + attr->flags = 0;

Why do you want attr->flags? This is to modify anonumous struct used by
BPF_MAP_*_ELEM commands.

> + if (attr->dump.flags & ~BPF_F_LOCK)
> + return -EINVAL;
> +
> + f = fdget(ufd);
> + map = __bpf_map_get(f);
> + if (IS_ERR(map))
> + return PTR_ERR(map);
> + if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
> + err = -EPERM;
> + goto err_put;
> + }
> +
> + if ((attr->dump.flags & BPF_F_LOCK) &&
> + !map_value_has_spin_lock(map)) {
> + err = -EINVAL;
> + goto err_put;
> + }
> +
> + if (map->map_type == BPF_MAP_TYPE_QUEUE ||
> + map->map_type == BPF_MAP_TYPE_STACK) {
> + err = -ENOTSUPP;
> + goto err_put;
> + }
> +
> + value_size = bpf_map_value_size(map);
> +
> + err = get_user(buf_len, ubuf_len);
> + if (err)
> + goto err_put;
> +
> + elem_size = map->key_size + value_size;
> + if (buf_len < elem_size) {
> + err = -EINVAL;
> + goto err_put;
> + }
> +
> + if (ukey) {
> + prev_key = __bpf_copy_key(ukey, map->key_size);
> + if (IS_ERR(prev_key)) {
> + err = PTR_ERR(prev_key);
> + goto err_put;
> + }
> + } else {
> + prev_key = NULL;
> + first_key = true;
> + }
> +
> + err = -ENOMEM;
> + buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN);
> + if (!buf)
> + goto err_put;
> +
> + key = buf;
> + value = key + map->key_size;
> + for (cp_len = 0; cp_len + elem_size <= buf_len;) {
> + if (signal_pending(current)) {
> + err = -EINTR;
> + break;
> + }
> +
> + rcu_read_lock();
> + err = map->ops->map_get_next_key(map, prev_key, key);
> + rcu_read_unlock();
> +
> + if (err)
> + break;
> +
> + err = bpf_map_copy_value(map, key, value, attr->dump.flags);
> +
> + if (err == -ENOENT)
> + continue;
> + if (err)
> + goto free_buf;
> +
> + if (copy_to_user(ubuf + cp_len, buf, elem_size)) {
> + err = -EFAULT;
> + goto free_buf;
> + }
> +
> + prev_key = key;
> + cp_len += elem_size;
> + }
> +
> + if (err == -ENOENT && cp_len)
> + err = 0;
> + if (!err && (copy_to_user(ubuf_len, &cp_len, sizeof(cp_len)) ||
> + (!first_key && copy_to_user(ukey, key, map->key_size))))
> + err = -EFAULT;
> +free_buf:
> + kfree(buf);
> +err_put:
> + fdput(f);
> + return err;
> +}
> +
> #define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value
>
> static int map_lookup_and_delete_elem(union bpf_attr *attr)
> @@ -2910,6 +3025,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
> case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
> err = map_lookup_and_delete_elem(&attr);
> break;
> + case BPF_MAP_DUMP:
> + err = map_dump(&attr);
> + break;

In bcc, we have use cases like this. At a certain time interval (e.g.,
every 2 seconds),
we get all key/value pairs for a map, we format and print out map
key/values on the screen,
and then delete all key/value pairs we retrieved earlier.

Currently, bpf_get_next_key() is used to get all key/value pairs, and
deletion also happened
at each key level.

Your batch dump command should help retrieving map key/value pairs.
What do you think deletions of those just retrieved map entries?
With an additional flag and fold into BPF_MAP_DUMP?
or implement a new BPF_MAP_DUMP_AND_DELETE?

I mentioned this so that we can start discussion now.
You do not need to implement batch deletion part, but let us
have a design extensible for that.

Thanks.

> default:
> err = -EINVAL;
> break;
> --
> 2.22.0.410.gd8fdbe21b5-goog
>

2019-07-09 15:35:29

by Brian Vazquez

[permalink] [raw]
Subject: Re: [PATCH bpf-next RFC v3 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call

> Maybe you can swap map_fd and flags?
> This way, you won't have hole right after map_fd?

Makes sense.

> > + attr->flags = 0;
> Why do you want attr->flags? This is to modify anonumous struct used by
> BPF_MAP_*_ELEM commands.

Nice catch! This was a mistake I forgot to delete that line.

> In bcc, we have use cases like this. At a certain time interval (e.g.,
> every 2 seconds),
> we get all key/value pairs for a map, we format and print out map
> key/values on the screen,
> and then delete all key/value pairs we retrieved earlier.
>
> Currently, bpf_get_next_key() is used to get all key/value pairs, and
> deletion also happened
> at each key level.
>
> Your batch dump command should help retrieving map key/value pairs.
> What do you think deletions of those just retrieved map entries?
> With an additional flag and fold into BPF_MAP_DUMP?
> or implement a new BPF_MAP_DUMP_AND_DELETE?
>
> I mentioned this so that we can start discussion now.
> You do not need to implement batch deletion part, but let us
> have a design extensible for that.
>
> Thanks.

With a additional flag, code could be racy where you copy an old value
and delete the newest one.
So maybe we could implement BPF_MAP_DUMP_AND_DELETE as a wrapper of
map_get_next_key + map_lookup_and_delete_elem. Last function already
exists but it has not been implemented for maps other than stack and
queue.

Thanks for reviewing it!