2019-12-11 22:34:59

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH v3 bpf-next 00/11] add bpf batch ops to process more than 1 elem

This patch series introduce batch ops that can be added to bpf maps to
lookup/lookup_and_delete/update/delete more than 1 element at the time,
this is specially useful when syscall overhead is a problem and in case
of hmap it will provide a reliable way of traversing them.

The implementation inclues a generic approach that could potentially be
used by any bpf map and adds it to arraymap, it also includes the specific
implementation of hashmaps which are traversed using buckets instead
of keys.

The bpf syscall subcommands introduced are:

BPF_MAP_LOOKUP_BATCH
BPF_MAP_LOOKUP_AND_DELETE_BATCH
BPF_MAP_UPDATE_BATCH
BPF_MAP_DELETE_BATCH

The UAPI attribute is:

struct { /* struct used by BPF_MAP_*_BATCH commands */
__aligned_u64 in_batch; /* start batch,
* NULL to start from beginning
*/
__aligned_u64 out_batch; /* output: next start batch */
__aligned_u64 keys;
__aligned_u64 values;
__u32 count; /* input/output:
* input: # of key/value
* elements
* output: # of filled elements
*/
__u32 map_fd;
__u64 elem_flags;
__u64 flags;
} batch;


in_batch and out_batch are only used for lookup and lookup_and_delete since
those are the only two operations that attempt to traverse the map.

update/delete batch ops should provide the keys/values that user wants
to modify.

Here are the previous discussions on the batch processing:
- https://lore.kernel.org/bpf/[email protected]/
- https://lore.kernel.org/bpf/[email protected]/
- https://lore.kernel.org/bpf/[email protected]/

Changelog sinve v2:
- Add generic batch support for lpm_trie and test it (Yonghong Song)
- Use define MAP_LOOKUP_RETRIES for retries (John Fastabend)
- Return errors directly and remove labels (Yonghong Song)
- Insert new API functions into libbpf alphabetically (Yonghong Song)
- Change hlist_nulls_for_each_entry_rcu to
hlist_nulls_for_each_entry_safe in htab batch ops (Yonghong Song)

Changelog since v1:
- Fix SOB ordering and remove Co-authored-by tag (Alexei Starovoitov)

Changelog since RFC:
- Change batch to in_batch and out_batch to support more flexible opaque
values to iterate the bpf maps.
- Remove update/delete specific batch ops for htab and use the generic
implementations instead.

Brian Vazquez (7):
bpf: add bpf_map_{value_size,update_value,map_copy_value} functions
bpf: add generic support for lookup and lookup_and_delete batch ops
bpf: add generic support for update and delete batch ops
bpf: add lookup and updated batch ops to arraymap
bpf: add generic_batch_ops to lpm_trie map
selftests/bpf: add batch ops testing to array bpf map
selftests/bpf: add batch ops testing to lpm_trie bpf map

Yonghong Song (4):
bpf: add batch ops to all htab bpf map
tools/bpf: sync uapi header bpf.h
libbpf: add libbpf support to batch ops
selftests/bpf: add batch ops testing for htab and htab_percpu map

include/linux/bpf.h | 21 +
include/uapi/linux/bpf.h | 21 +
kernel/bpf/arraymap.c | 2 +
kernel/bpf/hashtab.c | 242 ++++++++
kernel/bpf/lpm_trie.c | 4 +
kernel/bpf/syscall.c | 562 ++++++++++++++----
tools/include/uapi/linux/bpf.h | 21 +
tools/lib/bpf/bpf.c | 61 ++
tools/lib/bpf/bpf.h | 14 +
tools/lib/bpf/libbpf.map | 4 +
.../bpf/map_tests/array_map_batch_ops.c | 119 ++++
.../bpf/map_tests/htab_map_batch_ops.c | 269 +++++++++
.../bpf/map_tests/trie_map_batch_ops.c | 235 ++++++++
13 files changed, 1451 insertions(+), 124 deletions(-)
create mode 100644 tools/testing/selftests/bpf/map_tests/array_map_batch_ops.c
create mode 100644 tools/testing/selftests/bpf/map_tests/htab_map_batch_ops.c
create mode 100644 tools/testing/selftests/bpf/map_tests/trie_map_batch_ops.c

--
2.24.1.735.g03f4e72817-goog


2019-12-11 22:35:13

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH v3 bpf-next 01/11] bpf: add bpf_map_{value_size,update_value,map_copy_value} functions

This commit moves reusable code from map_lookup_elem and map_update_elem
to avoid code duplication in kernel/bpf/syscall.c.

Signed-off-by: Brian Vazquez <[email protected]>
Acked-by: John Fastabend <[email protected]>
---
kernel/bpf/syscall.c | 275 ++++++++++++++++++++++++-------------------
1 file changed, 151 insertions(+), 124 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 66b90eaf99fe8..2530266fa6477 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -129,6 +129,153 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
return map;
}

+static u32 bpf_map_value_size(struct bpf_map *map)
+{
+ if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+ map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH ||
+ map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY ||
+ map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE)
+ return round_up(map->value_size, 8) * num_possible_cpus();
+ else if (IS_FD_MAP(map))
+ return sizeof(u32);
+ else
+ return map->value_size;
+}
+
+static void maybe_wait_bpf_programs(struct bpf_map *map)
+{
+ /* Wait for any running BPF programs to complete so that
+ * userspace, when we return to it, knows that all programs
+ * that could be running use the new map value.
+ */
+ if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS ||
+ map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
+ synchronize_rcu();
+}
+
+static int bpf_map_update_value(struct bpf_map *map, struct fd f, void *key,
+ void *value, __u64 flags)
+{
+ int err;
+ /* Need to create a kthread, thus must support schedule */
+ if (bpf_map_is_dev_bound(map)) {
+ return bpf_map_offload_update_elem(map, key, value, flags);
+ } else if (map->map_type == BPF_MAP_TYPE_CPUMAP ||
+ map->map_type == BPF_MAP_TYPE_SOCKHASH ||
+ map->map_type == BPF_MAP_TYPE_SOCKMAP) {
+ return map->ops->map_update_elem(map, key, value, flags);
+ }
+
+ /* must increment bpf_prog_active to avoid kprobe+bpf triggering from
+ * inside bpf map update or delete otherwise deadlocks are possible
+ */
+ preempt_disable();
+ __this_cpu_inc(bpf_prog_active);
+ if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+ map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
+ err = bpf_percpu_hash_update(map, key, value, flags);
+ } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
+ err = bpf_percpu_array_update(map, key, value, flags);
+ } else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
+ err = bpf_percpu_cgroup_storage_update(map, key, value,
+ flags);
+ } else if (IS_FD_ARRAY(map)) {
+ rcu_read_lock();
+ err = bpf_fd_array_map_update_elem(map, f.file, key, value,
+ flags);
+ rcu_read_unlock();
+ } else if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
+ rcu_read_lock();
+ err = bpf_fd_htab_map_update_elem(map, f.file, key, value,
+ flags);
+ rcu_read_unlock();
+ } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
+ /* rcu_read_lock() is not needed */
+ err = bpf_fd_reuseport_array_update_elem(map, key, value,
+ flags);
+ } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
+ map->map_type == BPF_MAP_TYPE_STACK) {
+ err = map->ops->map_push_elem(map, value, flags);
+ } else {
+ rcu_read_lock();
+ err = map->ops->map_update_elem(map, key, value, flags);
+ rcu_read_unlock();
+ }
+ __this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+ maybe_wait_bpf_programs(map);
+
+ return err;
+}
+
+static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
+ __u64 flags, bool do_delete)
+{
+ void *ptr;
+ int err;
+
+
+ if (bpf_map_is_dev_bound(map)) {
+ err = bpf_map_offload_lookup_elem(map, key, value);
+
+ if (!err && do_delete)
+ err = bpf_map_offload_delete_elem(map, key);
+
+ return err;
+ }
+
+ preempt_disable();
+ this_cpu_inc(bpf_prog_active);
+ if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+ map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
+ err = bpf_percpu_hash_copy(map, key, value);
+ } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
+ err = bpf_percpu_array_copy(map, key, value);
+ } else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
+ err = bpf_percpu_cgroup_storage_copy(map, key, value);
+ } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) {
+ err = bpf_stackmap_copy(map, key, value);
+ } else if (IS_FD_ARRAY(map)) {
+ err = bpf_fd_array_map_lookup_elem(map, key, value);
+ } else if (IS_FD_HASH(map)) {
+ err = bpf_fd_htab_map_lookup_elem(map, key, value);
+ } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
+ err = bpf_fd_reuseport_array_lookup_elem(map, key, value);
+ } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
+ map->map_type == BPF_MAP_TYPE_STACK) {
+ err = map->ops->map_peek_elem(map, value);
+ } else {
+ rcu_read_lock();
+ if (map->ops->map_lookup_elem_sys_only)
+ ptr = map->ops->map_lookup_elem_sys_only(map, key);
+ else
+ ptr = map->ops->map_lookup_elem(map, key);
+ if (IS_ERR(ptr)) {
+ err = PTR_ERR(ptr);
+ } else if (!ptr) {
+ err = -ENOENT;
+ } else {
+ err = 0;
+ if (flags & BPF_F_LOCK)
+ /* lock 'ptr' and copy everything but lock */
+ copy_map_value_locked(map, value, ptr, true);
+ else
+ copy_map_value(map, value, ptr);
+ /* mask lock, since value wasn't zero inited */
+ check_and_init_map_lock(map, value);
+ }
+ rcu_read_unlock();
+ }
+ if (do_delete)
+ err = err ? err : map->ops->map_delete_elem(map, key);
+
+ this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+ maybe_wait_bpf_programs(map);
+
+ return err;
+}
+
static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
{
/* We really just want to fail instead of triggering OOM killer
@@ -816,7 +963,7 @@ static int map_lookup_elem(union bpf_attr *attr)
void __user *uvalue = u64_to_user_ptr(attr->value);
int ufd = attr->map_fd;
struct bpf_map *map;
- void *key, *value, *ptr;
+ void *key, *value;
u32 value_size;
struct fd f;
int err;
@@ -848,72 +995,14 @@ static int map_lookup_elem(union bpf_attr *attr)
goto err_put;
}

- if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
- map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH ||
- map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY ||
- map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE)
- value_size = round_up(map->value_size, 8) * num_possible_cpus();
- else if (IS_FD_MAP(map))
- value_size = sizeof(u32);
- else
- value_size = map->value_size;
+ value_size = bpf_map_value_size(map);

err = -ENOMEM;
value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);
if (!value)
goto free_key;

- if (bpf_map_is_dev_bound(map)) {
- err = bpf_map_offload_lookup_elem(map, key, value);
- goto done;
- }
-
- preempt_disable();
- this_cpu_inc(bpf_prog_active);
- if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
- map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
- err = bpf_percpu_hash_copy(map, key, value);
- } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
- err = bpf_percpu_array_copy(map, key, value);
- } else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
- err = bpf_percpu_cgroup_storage_copy(map, key, value);
- } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) {
- err = bpf_stackmap_copy(map, key, value);
- } else if (IS_FD_ARRAY(map) || IS_FD_PROG_ARRAY(map)) {
- err = bpf_fd_array_map_lookup_elem(map, key, value);
- } else if (IS_FD_HASH(map)) {
- err = bpf_fd_htab_map_lookup_elem(map, key, value);
- } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
- err = bpf_fd_reuseport_array_lookup_elem(map, key, value);
- } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
- map->map_type == BPF_MAP_TYPE_STACK) {
- err = map->ops->map_peek_elem(map, value);
- } else {
- rcu_read_lock();
- if (map->ops->map_lookup_elem_sys_only)
- ptr = map->ops->map_lookup_elem_sys_only(map, key);
- else
- ptr = map->ops->map_lookup_elem(map, key);
- if (IS_ERR(ptr)) {
- err = PTR_ERR(ptr);
- } else if (!ptr) {
- err = -ENOENT;
- } else {
- err = 0;
- if (attr->flags & BPF_F_LOCK)
- /* lock 'ptr' and copy everything but lock */
- copy_map_value_locked(map, value, ptr, true);
- else
- copy_map_value(map, value, ptr);
- /* mask lock, since value wasn't zero inited */
- check_and_init_map_lock(map, value);
- }
- rcu_read_unlock();
- }
- this_cpu_dec(bpf_prog_active);
- preempt_enable();
-
-done:
+ err = bpf_map_copy_value(map, key, value, attr->flags, false);
if (err)
goto free_value;

@@ -932,16 +1021,6 @@ static int map_lookup_elem(union bpf_attr *attr)
return err;
}

-static void maybe_wait_bpf_programs(struct bpf_map *map)
-{
- /* Wait for any running BPF programs to complete so that
- * userspace, when we return to it, knows that all programs
- * that could be running use the new map value.
- */
- if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS ||
- map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
- synchronize_rcu();
-}

#define BPF_MAP_UPDATE_ELEM_LAST_FIELD flags

@@ -997,60 +1076,8 @@ static int map_update_elem(union bpf_attr *attr)
if (copy_from_user(value, uvalue, value_size) != 0)
goto free_value;

- /* Need to create a kthread, thus must support schedule */
- if (bpf_map_is_dev_bound(map)) {
- err = bpf_map_offload_update_elem(map, key, value, attr->flags);
- goto out;
- } else if (map->map_type == BPF_MAP_TYPE_CPUMAP ||
- map->map_type == BPF_MAP_TYPE_SOCKHASH ||
- map->map_type == BPF_MAP_TYPE_SOCKMAP) {
- err = map->ops->map_update_elem(map, key, value, attr->flags);
- goto out;
- } else if (IS_FD_PROG_ARRAY(map)) {
- err = bpf_fd_array_map_update_elem(map, f.file, key, value,
- attr->flags);
- goto out;
- }
+ err = bpf_map_update_value(map, f, key, value, attr->flags);

- /* must increment bpf_prog_active to avoid kprobe+bpf triggering from
- * inside bpf map update or delete otherwise deadlocks are possible
- */
- preempt_disable();
- __this_cpu_inc(bpf_prog_active);
- if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
- map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
- err = bpf_percpu_hash_update(map, key, value, attr->flags);
- } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
- err = bpf_percpu_array_update(map, key, value, attr->flags);
- } else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
- err = bpf_percpu_cgroup_storage_update(map, key, value,
- attr->flags);
- } else if (IS_FD_ARRAY(map)) {
- rcu_read_lock();
- err = bpf_fd_array_map_update_elem(map, f.file, key, value,
- attr->flags);
- rcu_read_unlock();
- } else if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
- rcu_read_lock();
- err = bpf_fd_htab_map_update_elem(map, f.file, key, value,
- attr->flags);
- rcu_read_unlock();
- } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
- /* rcu_read_lock() is not needed */
- err = bpf_fd_reuseport_array_update_elem(map, key, value,
- attr->flags);
- } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
- map->map_type == BPF_MAP_TYPE_STACK) {
- err = map->ops->map_push_elem(map, value, attr->flags);
- } else {
- rcu_read_lock();
- err = map->ops->map_update_elem(map, key, value, attr->flags);
- rcu_read_unlock();
- }
- __this_cpu_dec(bpf_prog_active);
- preempt_enable();
- maybe_wait_bpf_programs(map);
-out:
free_value:
kfree(value);
free_key:
--
2.24.1.735.g03f4e72817-goog

2019-12-11 22:35:29

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH v3 bpf-next 03/11] bpf: add generic support for update and delete batch ops

This commit adds generic support for update and delete batch ops that
can be used for almost all the bpf maps. These commands share the same
UAPI attr that lookup and lookup_and_delete batch ops use and the
syscall commands are:

BPF_MAP_UPDATE_BATCH
BPF_MAP_DELETE_BATCH

The main difference between update/delete and lookup/lookup_and_delete
batch ops is that for update/delete keys/values must be specified for
userspace and because of that, neither in_batch nor out_batch are used.

Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Brian Vazquez <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
---
include/linux/bpf.h | 10 ++++
include/uapi/linux/bpf.h | 2 +
kernel/bpf/syscall.c | 117 ++++++++++++++++++++++++++++++++++++++-
3 files changed, 128 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a16f209255a59..851fb3ff084b0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -48,6 +48,10 @@ struct bpf_map_ops {
int (*map_lookup_and_delete_batch)(struct bpf_map *map,
const union bpf_attr *attr,
union bpf_attr __user *uattr);
+ int (*map_update_batch)(struct bpf_map *map, const union bpf_attr *attr,
+ union bpf_attr __user *uattr);
+ int (*map_delete_batch)(struct bpf_map *map, const union bpf_attr *attr,
+ union bpf_attr __user *uattr);

/* funcs callable from userspace and from eBPF programs */
void *(*map_lookup_elem)(struct bpf_map *map, void *key);
@@ -849,6 +853,12 @@ int generic_map_lookup_batch(struct bpf_map *map,
int generic_map_lookup_and_delete_batch(struct bpf_map *map,
const union bpf_attr *attr,
union bpf_attr __user *uattr);
+int generic_map_update_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr);
+int generic_map_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr);

extern int sysctl_unprivileged_bpf_disabled;

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 36d3b885ddedd..dab24a763e4bb 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -109,6 +109,8 @@ enum bpf_cmd {
BPF_BTF_GET_NEXT_ID,
BPF_MAP_LOOKUP_BATCH,
BPF_MAP_LOOKUP_AND_DELETE_BATCH,
+ BPF_MAP_UPDATE_BATCH,
+ BPF_MAP_DELETE_BATCH,
};

enum bpf_map_type {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 708aa89fe2308..8272e76183068 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1206,6 +1206,111 @@ static int map_get_next_key(union bpf_attr *attr)
return err;
}

+int generic_map_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ void __user *keys = u64_to_user_ptr(attr->batch.keys);
+ u32 cp, max_count;
+ int err = 0;
+ void *key;
+
+ if (attr->batch.elem_flags & ~BPF_F_LOCK)
+ return -EINVAL;
+
+ if ((attr->batch.elem_flags & BPF_F_LOCK) &&
+ !map_value_has_spin_lock(map)) {
+ return -EINVAL;
+ }
+
+ max_count = attr->batch.count;
+ if (!max_count)
+ return -EINVAL;
+
+ for (cp = 0; cp < max_count; cp++) {
+ key = __bpf_copy_key(keys + cp * map->key_size, map->key_size);
+ if (IS_ERR(key)) {
+ err = PTR_ERR(key);
+ break;
+ }
+
+ if (bpf_map_is_dev_bound(map)) {
+ err = bpf_map_offload_delete_elem(map, key);
+ break;
+ }
+
+ preempt_disable();
+ __this_cpu_inc(bpf_prog_active);
+ rcu_read_lock();
+ err = map->ops->map_delete_elem(map, key);
+ rcu_read_unlock();
+ __this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+ maybe_wait_bpf_programs(map);
+ if (err)
+ break;
+ }
+ if (copy_to_user(&uattr->batch.count, &cp, sizeof(cp)))
+ err = -EFAULT;
+ return err;
+}
+
+int generic_map_update_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ void __user *values = u64_to_user_ptr(attr->batch.values);
+ void __user *keys = u64_to_user_ptr(attr->batch.keys);
+ u32 value_size, cp, max_count;
+ int ufd = attr->map_fd;
+ void *key, *value;
+ struct fd f;
+ int err = 0;
+
+ f = fdget(ufd);
+ if (attr->batch.elem_flags & ~BPF_F_LOCK)
+ return -EINVAL;
+
+ if ((attr->batch.elem_flags & BPF_F_LOCK) &&
+ !map_value_has_spin_lock(map)) {
+ return -EINVAL;
+ }
+
+ value_size = bpf_map_value_size(map);
+
+ max_count = attr->batch.count;
+ if (!max_count)
+ return 0;
+
+ value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);
+ if (!value)
+ return -ENOMEM;
+
+ for (cp = 0; cp < max_count; cp++) {
+ key = __bpf_copy_key(keys + cp * map->key_size, map->key_size);
+ if (IS_ERR(key)) {
+ err = PTR_ERR(key);
+ break;
+ }
+ err = -EFAULT;
+ if (copy_from_user(value, values + cp * value_size, value_size))
+ break;
+
+ err = bpf_map_update_value(map, f, key, value,
+ attr->batch.elem_flags);
+
+ if (err)
+ break;
+ }
+
+ if (copy_to_user(&uattr->batch.count, &cp, sizeof(cp)))
+ err = -EFAULT;
+
+ kfree(value);
+ kfree(key);
+ return err;
+}
+
#define MAP_LOOKUP_RETRIES 3

static int __generic_map_lookup_batch(struct bpf_map *map,
@@ -3203,8 +3308,12 @@ static int bpf_map_do_batch(const union bpf_attr *attr,

if (cmd == BPF_MAP_LOOKUP_BATCH)
BPF_DO_BATCH(map->ops->map_lookup_batch);
- else
+ else if (cmd == BPF_MAP_LOOKUP_AND_DELETE_BATCH)
BPF_DO_BATCH(map->ops->map_lookup_and_delete_batch);
+ else if (cmd == BPF_MAP_UPDATE_BATCH)
+ BPF_DO_BATCH(map->ops->map_update_batch);
+ else
+ BPF_DO_BATCH(map->ops->map_delete_batch);

err_put:
fdput(f);
@@ -3315,6 +3424,12 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
err = bpf_map_do_batch(&attr, uattr,
BPF_MAP_LOOKUP_AND_DELETE_BATCH);
break;
+ case BPF_MAP_UPDATE_BATCH:
+ err = bpf_map_do_batch(&attr, uattr, BPF_MAP_UPDATE_BATCH);
+ break;
+ case BPF_MAP_DELETE_BATCH:
+ err = bpf_map_do_batch(&attr, uattr, BPF_MAP_DELETE_BATCH);
+ break;
default:
err = -EINVAL;
break;
--
2.24.1.735.g03f4e72817-goog

2019-12-11 22:35:38

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH v3 bpf-next 05/11] bpf: add generic_batch_ops to lpm_trie map

This adds the generic batch ops functionality to bpf lpm_trie.

Signed-off-by: Brian Vazquez <[email protected]>
---
kernel/bpf/lpm_trie.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 56e6c75d354d9..92c47b4f03337 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -743,4 +743,8 @@ const struct bpf_map_ops trie_map_ops = {
.map_update_elem = trie_update_elem,
.map_delete_elem = trie_delete_elem,
.map_check_btf = trie_check_btf,
+ .map_lookup_batch = generic_map_lookup_batch,
+ .map_lookup_and_delete_batch = generic_map_lookup_and_delete_batch,
+ .map_delete_batch = generic_map_delete_batch,
+ .map_update_batch = generic_map_update_batch,
};
--
2.24.1.735.g03f4e72817-goog

2019-12-11 22:35:38

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH v3 bpf-next 06/11] bpf: add batch ops to all htab bpf map

From: Yonghong Song <[email protected]>

htab can't use generic batch support due some problematic behaviours
inherent to the data structre, i.e. while iterating the bpf map a
concurrent program might delete the next entry that batch was about to
use, in that case there's no easy solution to retrieve the next entry,
the issue has been discussed multiple times (see [1] and [2]).

The only way hmap can be traversed without the problem previously
exposed is by making sure that the map is traversing entire buckets.
This commit implements those strict requirements for hmap, the
implementation follows the same interaction that generic support with
some exceptions:

- If keys/values buffer are not big enough to traverse a bucket,
ENOSPC will be returned.
- out_batch contains the value of the next bucket in the iteration, not
the next key, but this is transparent for the user since the user
should never use out_batch for other than bpf batch syscalls.

Note that only lookup and lookup_and_delete batch ops require the hmap
specific implementation, update/delete batch ops can be the generic
ones.

[1] https://lore.kernel.org/bpf/[email protected]/
[2] https://lore.kernel.org/bpf/[email protected]/

Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Brian Vazquez <[email protected]>
---
kernel/bpf/hashtab.c | 242 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 242 insertions(+)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 22066a62c8c97..fac107bdaf9ec 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -17,6 +17,17 @@
(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)

+#define BATCH_OPS(_name) \
+ .map_lookup_batch = \
+ _name##_map_lookup_batch, \
+ .map_lookup_and_delete_batch = \
+ _name##_map_lookup_and_delete_batch, \
+ .map_update_batch = \
+ generic_map_update_batch, \
+ .map_delete_batch = \
+ generic_map_delete_batch
+
+
struct bucket {
struct hlist_nulls_head head;
raw_spinlock_t lock;
@@ -1232,6 +1243,233 @@ static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
rcu_read_unlock();
}

+static int
+__htab_map_lookup_and_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr,
+ bool do_delete, bool is_lru_map,
+ bool is_percpu)
+{
+ struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+ u32 bucket_cnt, total, key_size, value_size, roundup_key_size;
+ void *keys = NULL, *values = NULL, *value, *dst_key, *dst_val;
+ void __user *uvalues = u64_to_user_ptr(attr->batch.values);
+ void __user *ukeys = u64_to_user_ptr(attr->batch.keys);
+ void *ubatch = u64_to_user_ptr(attr->batch.in_batch);
+ u64 elem_map_flags, map_flags;
+ struct hlist_nulls_head *head;
+ u32 batch, max_count, size;
+ struct hlist_nulls_node *n;
+ unsigned long flags;
+ struct htab_elem *l;
+ struct bucket *b;
+ int ret = 0;
+
+ max_count = attr->batch.count;
+ if (!max_count)
+ return 0;
+
+ elem_map_flags = attr->batch.elem_flags;
+ if ((elem_map_flags & ~BPF_F_LOCK) ||
+ ((elem_map_flags & BPF_F_LOCK) && !map_value_has_spin_lock(map)))
+ return -EINVAL;
+
+ map_flags = attr->batch.flags;
+ if (map_flags)
+ return -EINVAL;
+
+ batch = 0;
+ if (ubatch && copy_from_user(&batch, ubatch, sizeof(batch)))
+ return -EFAULT;
+
+ if (batch >= htab->n_buckets)
+ return -ENOENT;
+
+ /* We cannot do copy_from_user or copy_to_user inside
+ * the rcu_read_lock. Allocate enough space here.
+ */
+ key_size = htab->map.key_size;
+ roundup_key_size = round_up(htab->map.key_size, 8);
+ value_size = htab->map.value_size;
+ size = round_up(value_size, 8);
+ if (is_percpu)
+ value_size = size * num_possible_cpus();
+ keys = kvmalloc(key_size, GFP_USER | __GFP_NOWARN);
+ values = kvmalloc(value_size, GFP_USER | __GFP_NOWARN);
+ if (!keys || !values) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ dst_key = keys;
+ dst_val = values;
+ total = 0;
+
+ preempt_disable();
+ this_cpu_inc(bpf_prog_active);
+ rcu_read_lock();
+
+again:
+ b = &htab->buckets[batch];
+ head = &b->head;
+ raw_spin_lock_irqsave(&b->lock, flags);
+
+ bucket_cnt = 0;
+ hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
+ bucket_cnt++;
+
+ if (bucket_cnt > (max_count - total)) {
+ if (total == 0)
+ ret = -ENOSPC;
+ goto after_loop;
+ }
+
+ hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
+ memcpy(dst_key, l->key, key_size);
+
+ if (is_percpu) {
+ int off = 0, cpu;
+ void __percpu *pptr;
+
+ pptr = htab_elem_get_ptr(l, map->key_size);
+ for_each_possible_cpu(cpu) {
+ bpf_long_memcpy(dst_val + off,
+ per_cpu_ptr(pptr, cpu), size);
+ off += size;
+ }
+ } else {
+ value = l->key + roundup_key_size;
+ if (elem_map_flags & BPF_F_LOCK)
+ copy_map_value_locked(map, dst_val, value,
+ true);
+ else
+ copy_map_value(map, dst_val, value);
+ check_and_init_map_lock(map, dst_val);
+ }
+ if (do_delete) {
+ hlist_nulls_del_rcu(&l->hash_node);
+ if (is_lru_map)
+ bpf_lru_push_free(&htab->lru, &l->lru_node);
+ else
+ free_htab_elem(htab, l);
+ }
+ if (copy_to_user(ukeys + total * key_size, keys, key_size) ||
+ copy_to_user(uvalues + total * value_size, values,
+ value_size)) {
+ ret = -EFAULT;
+ goto after_loop;
+ }
+ total++;
+ }
+
+ batch++;
+ if (batch >= htab->n_buckets) {
+ ret = -ENOENT;
+ goto after_loop;
+ }
+
+ raw_spin_unlock_irqrestore(&b->lock, flags);
+ goto again;
+
+after_loop:
+ raw_spin_unlock_irqrestore(&b->lock, flags);
+
+ rcu_read_unlock();
+ this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+
+ if (ret && ret != -ENOENT)
+ goto out;
+
+ /* copy data back to user */
+ ubatch = u64_to_user_ptr(attr->batch.out_batch);
+ if (copy_to_user(ubatch, &batch, sizeof(batch)) ||
+ put_user(total, &uattr->batch.count))
+ ret = -EFAULT;
+
+out:
+ kvfree(keys);
+ kvfree(values);
+ return ret;
+}
+
+static int
+htab_percpu_map_lookup_batch(struct bpf_map *map, const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, false,
+ false, true);
+}
+
+static int
+htab_percpu_map_lookup_and_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, true,
+ false, true);
+}
+
+static int
+htab_map_lookup_batch(struct bpf_map *map, const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, false,
+ false, false);
+}
+
+static int
+htab_map_lookup_and_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, true,
+ false, false);
+}
+
+static int
+htab_map_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return generic_map_delete_batch(map, attr, uattr);
+}
+
+static int
+htab_lru_percpu_map_lookup_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, false,
+ true, true);
+}
+
+static int
+htab_lru_percpu_map_lookup_and_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, true,
+ true, true);
+}
+
+static int
+htab_lru_map_lookup_batch(struct bpf_map *map, const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, false,
+ true, false);
+}
+
+static int
+htab_lru_map_lookup_and_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, true,
+ true, false);
+}
+
const struct bpf_map_ops htab_map_ops = {
.map_alloc_check = htab_map_alloc_check,
.map_alloc = htab_map_alloc,
@@ -1242,6 +1480,7 @@ const struct bpf_map_ops htab_map_ops = {
.map_delete_elem = htab_map_delete_elem,
.map_gen_lookup = htab_map_gen_lookup,
.map_seq_show_elem = htab_map_seq_show_elem,
+ BATCH_OPS(htab),
};

const struct bpf_map_ops htab_lru_map_ops = {
@@ -1255,6 +1494,7 @@ const struct bpf_map_ops htab_lru_map_ops = {
.map_delete_elem = htab_lru_map_delete_elem,
.map_gen_lookup = htab_lru_map_gen_lookup,
.map_seq_show_elem = htab_map_seq_show_elem,
+ BATCH_OPS(htab_lru),
};

/* Called from eBPF program */
@@ -1368,6 +1608,7 @@ const struct bpf_map_ops htab_percpu_map_ops = {
.map_update_elem = htab_percpu_map_update_elem,
.map_delete_elem = htab_map_delete_elem,
.map_seq_show_elem = htab_percpu_map_seq_show_elem,
+ BATCH_OPS(htab_percpu),
};

const struct bpf_map_ops htab_lru_percpu_map_ops = {
@@ -1379,6 +1620,7 @@ const struct bpf_map_ops htab_lru_percpu_map_ops = {
.map_update_elem = htab_lru_percpu_map_update_elem,
.map_delete_elem = htab_lru_map_delete_elem,
.map_seq_show_elem = htab_percpu_map_seq_show_elem,
+ BATCH_OPS(htab_lru_percpu),
};

static int fd_htab_map_alloc_check(union bpf_attr *attr)
--
2.24.1.735.g03f4e72817-goog

2019-12-11 22:35:44

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH v3 bpf-next 07/11] tools/bpf: sync uapi header bpf.h

From: Yonghong Song <[email protected]>

sync uapi header include/uapi/linux/bpf.h to
tools/include/uapi/linux/bpf.h

Signed-off-by: Yonghong Song <[email protected]>
---
tools/include/uapi/linux/bpf.h | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index dbbcf0b02970b..dab24a763e4bb 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -107,6 +107,10 @@ enum bpf_cmd {
BPF_MAP_LOOKUP_AND_DELETE_ELEM,
BPF_MAP_FREEZE,
BPF_BTF_GET_NEXT_ID,
+ BPF_MAP_LOOKUP_BATCH,
+ BPF_MAP_LOOKUP_AND_DELETE_BATCH,
+ BPF_MAP_UPDATE_BATCH,
+ BPF_MAP_DELETE_BATCH,
};

enum bpf_map_type {
@@ -403,6 +407,23 @@ union bpf_attr {
__u64 flags;
};

+ struct { /* struct used by BPF_MAP_*_BATCH commands */
+ __aligned_u64 in_batch; /* start batch,
+ * NULL to start from beginning
+ */
+ __aligned_u64 out_batch; /* output: next start batch */
+ __aligned_u64 keys;
+ __aligned_u64 values;
+ __u32 count; /* input/output:
+ * input: # of key/value
+ * elements
+ * output: # of filled elements
+ */
+ __u32 map_fd;
+ __u64 elem_flags;
+ __u64 flags;
+ } batch;
+
struct { /* anonymous struct used by BPF_PROG_LOAD command */
__u32 prog_type; /* one of enum bpf_prog_type */
__u32 insn_cnt;
--
2.24.1.735.g03f4e72817-goog

2019-12-11 22:36:03

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH v3 bpf-next 11/11] selftests/bpf: add batch ops testing to lpm_trie bpf map

Tested bpf_map_batch_ops functionality.

$ ./test_maps
...
test_trie_map_batch_ops:PASS
...

Signed-off-by: Brian Vazquez <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
---
.../bpf/map_tests/trie_map_batch_ops.c | 235 ++++++++++++++++++
1 file changed, 235 insertions(+)
create mode 100644 tools/testing/selftests/bpf/map_tests/trie_map_batch_ops.c

diff --git a/tools/testing/selftests/bpf/map_tests/trie_map_batch_ops.c b/tools/testing/selftests/bpf/map_tests/trie_map_batch_ops.c
new file mode 100644
index 0000000000000..927e898bea2d0
--- /dev/null
+++ b/tools/testing/selftests/bpf/map_tests/trie_map_batch_ops.c
@@ -0,0 +1,235 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <errno.h>
+#include <string.h>
+#include <arpa/inet.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include <bpf_util.h>
+#include <test_maps.h>
+
+static void map_batch_update(int map_fd, __u32 max_entries)
+{
+ int i, err;
+ struct bpf_lpm_trie_key *key_helper;
+ void *key, *values;
+ size_t key_size;
+
+ key_size = sizeof(*key_helper) + sizeof(__u32);
+ key = alloca(max_entries * key_size);
+ values = alloca(max_entries * sizeof(int));
+
+ for (i = 0; i < max_entries; i++) {
+ key_helper = (struct bpf_lpm_trie_key *)(key + i * key_size);
+ inet_pton(AF_INET, "192.168.0.0", key_helper->data);
+ key_helper->data[3] = i;
+ key_helper->prefixlen = 16 + i;
+ ((int *)values)[i] = i + 1;
+ }
+
+ err = bpf_map_update_batch(map_fd, key, values, &max_entries, 0, 0);
+ CHECK(err, "bpf_map_update_batch()", "error:%s\n", strerror(errno));
+}
+
+static void map_batch_verify(int *visited, __u32 max_entries,
+ void *keys, void *values)
+{
+ struct bpf_lpm_trie_key *key;
+ size_t key_size;
+ int i;
+
+ memset(visited, 0, max_entries * sizeof(*visited));
+ key_size = sizeof(*key) + sizeof(__u32);
+
+ for (i = 0; i < max_entries; i++) {
+ key = keys + i * key_size;
+ CHECK(key->data[3] + 1 != ((int *)values)[i],
+ "key/value checking",
+ "error: i %d key %d.%d.%d.%d value %d\n", i, key->data[0],
+ key->data[1], key->data[2], key->data[3],
+ ((int *)values)[i]);
+
+ visited[i] = 1;
+
+ }
+ for (i = 0; i < max_entries; i++) {
+ CHECK(visited[i] != 1, "visited checking",
+ "error: keys array at index %d missing\n", i);
+ }
+}
+
+void __test_trie_map_lookup_and_delete_batch(void)
+{
+ struct bpf_lpm_trie_key *batch, *key_p;
+ size_t key_size = sizeof(*key_p) + sizeof(__u32);
+ struct bpf_create_map_attr xattr = {
+ .name = "lpm_trie_map",
+ .map_type = BPF_MAP_TYPE_LPM_TRIE,
+ .key_size = key_size,
+ .value_size = sizeof(int),
+ };
+ __u32 count, total, total_success;
+ const __u32 max_entries = 10;
+ int map_fd, *visited, key;
+ int err, i, step;
+ int *values;
+ void *keys;
+
+ xattr.max_entries = max_entries;
+ xattr.map_flags = BPF_F_NO_PREALLOC;
+ map_fd = bpf_create_map_xattr(&xattr);
+ CHECK(map_fd == -1,
+ "bpf_create_map_xattr()", "error:%s\n", strerror(errno));
+
+ keys = malloc(max_entries * key_size);
+ batch = malloc(key_size);
+ values = malloc(max_entries * sizeof(int));
+ visited = malloc(max_entries * sizeof(int));
+ CHECK(!keys || !values || !visited, "malloc()",
+ "error:%s\n", strerror(errno));
+
+ /* test 1: lookup/delete an empty hash table, -ENOENT */
+ count = max_entries;
+ err = bpf_map_lookup_and_delete_batch(map_fd, NULL, batch, keys,
+ values, &count, 0, 0);
+ CHECK((err && errno != ENOENT), "empty map",
+ "error: %s\n", strerror(errno));
+
+ /* populate elements to the map */
+ map_batch_update(map_fd, max_entries);
+
+ /* test 2: lookup/delete with count = 0, success */
+ count = 0;
+ err = bpf_map_lookup_and_delete_batch(map_fd, NULL, batch, keys,
+ values, &count, 0, 0);
+ CHECK(err, "count = 0", "error: %s\n", strerror(errno));
+
+ /* test 3: lookup/delete with count = max_entries, success */
+ memset(keys, 0, max_entries * key_size);
+ memset(values, 0, max_entries * sizeof(int));
+ count = max_entries;
+ err = bpf_map_lookup_and_delete_batch(map_fd, NULL, batch, keys,
+ values, &count, 0, 0);
+ CHECK((err && errno != ENOENT), "count = max_entries",
+ "error: %s\n", strerror(errno));
+ CHECK(count != max_entries, "count = max_entries",
+ "count = %u, max_entries = %u\n", count, max_entries);
+ map_batch_verify(visited, max_entries, keys, values);
+
+ /* bpf_map_get_next_key() should return -ENOENT for an empty map. */
+ err = bpf_map_get_next_key(map_fd, NULL, &key);
+ CHECK(!err, "bpf_map_get_next_key()", "error: %s\n", strerror(errno));
+
+ /* test 4: lookup/delete in a loop with various steps. */
+ total_success = 0;
+ for (step = 4; step < max_entries; step++) {
+ map_batch_update(map_fd, max_entries);
+ memset(keys, 0, max_entries * key_size);
+ memset(values, 0, max_entries * sizeof(int));
+ total = 0;
+ /* iteratively lookup elements with 'step'
+ * elements each
+ */
+ count = step;
+ while (true) {
+ err = bpf_map_lookup_batch(map_fd,
+ total ? batch : NULL,
+ batch,
+ keys + total * key_size,
+ values + total,
+ &count, 0, 0);
+ /* It is possible that we are failing due to buffer size
+ * not big enough. In such cases, let us just exit and
+ * go with large steps. Not that a buffer size with
+ * max_entries should always work.
+ */
+ CHECK((err && errno != ENOENT), "lookup with steps",
+ "error: %s\n", strerror(errno));
+
+ total += count;
+ if (err)
+ break;
+ }
+
+ CHECK(total != max_entries, "lookup with steps",
+ "total = %u, max_entries = %u\n", total, max_entries);
+ map_batch_verify(visited, max_entries, keys, values);
+
+ total = 0;
+ while (true) {
+ err = bpf_map_delete_batch(map_fd,
+ keys + total * key_size,
+ &count, 0, 0);
+ /* It is possible that we are failing due to buffer size
+ * not big enough. In such cases, let us just exit and
+ * go with large steps. Not that a buffer size with
+ * max_entries should always work.
+ */
+ CHECK((err && errno != ENOENT), "lookup with steps",
+ "error: %s\n", strerror(errno));
+ total += count;
+ if (err)
+ break;
+ }
+
+ CHECK(total != max_entries, "delete with steps",
+ "total = %u, max_entries = %u\n", total, max_entries);
+
+ /* check map is empty, errno == -ENOENT */
+ err = bpf_map_get_next_key(map_fd, NULL, keys);
+ CHECK(!err || errno != ENOENT, "bpf_map_get_next_key()",
+ "error: %s\n", strerror(errno));
+
+ map_batch_update(map_fd, max_entries);
+ memset(keys, 0, max_entries * key_size);
+ memset(values, 0, max_entries * sizeof(int));
+ total = 0;
+ i = 0;
+ /* iteratively lookup/delete elements with 'step'
+ * elements each
+ */
+ count = step;
+ while (true) {
+ err = bpf_map_lookup_and_delete_batch(map_fd,
+ total ? batch : NULL,
+ batch,
+ keys + total * key_size,
+ values + total,
+ &count, 0, 0);
+
+ CHECK((err && errno != ENOENT), "lookup with steps",
+ "error: %s\n", strerror(errno));
+
+ total += count;
+ if (err)
+ break;
+ i++;
+ }
+
+ CHECK(total != max_entries, "lookup/delete with steps",
+ "total = %u, max_entries = %u\n", total, max_entries);
+
+ map_batch_verify(visited, max_entries, keys, values);
+ err = bpf_map_get_next_key(map_fd, NULL, &key);
+ CHECK(!err, "bpf_map_get_next_key()", "error: %s\n",
+ strerror(errno));
+
+ total_success++;
+ }
+
+ CHECK(total_success == 0, "check total_success",
+ "unexpected failure\n");
+}
+
+void trie_map_batch_ops(void)
+{
+ __test_trie_map_lookup_and_delete_batch();
+ printf("test_%s:PASS\n", __func__);
+}
+
+void test_trie_map_batch_ops(void)
+{
+ trie_map_batch_ops();
+}
--
2.24.1.735.g03f4e72817-goog

2019-12-11 22:36:33

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH v3 bpf-next 02/11] bpf: add generic support for lookup and lookup_and_delete batch ops

This commit introduces generic support for the bpf_map_lookup_batch and
bpf_map_lookup_and_delete_batch ops. This implementation can be used by
almost all the bpf maps since its core implementation is relying on the
existing map_get_next_key, map_lookup_elem and map_delete_elem
functions. The bpf syscall subcommands introduced are:

BPF_MAP_LOOKUP_BATCH
BPF_MAP_LOOKUP_AND_DELETE_BATCH

The UAPI attribute is:

struct { /* struct used by BPF_MAP_*_BATCH commands */
__aligned_u64 in_batch; /* start batch,
* NULL to start from beginning
*/
__aligned_u64 out_batch; /* output: next start batch */
__aligned_u64 keys;
__aligned_u64 values;
__u32 count; /* input/output:
* input: # of key/value
* elements
* output: # of filled elements
*/
__u32 map_fd;
__u64 elem_flags;
__u64 flags;
} batch;

in_batch/out_batch are opaque values use to communicate between
user/kernel space, in_batch/out_batch must be of key_size length.

To start iterating from the beginning in_batch must be null,
count is the # of key/value elements to retrieve. Note that the 'keys'
buffer must be a buffer of key_size * count size and the 'values' buffer
must be value_size * count, where value_size must be aligned to 8 bytes
by userspace if it's dealing with percpu maps. 'count' will contain the
number of keys/values successfully retrieved. Note that 'count' is an
input/output variable and it can contain a lower value after a call.

If there's no more entries to retrieve, ENOENT will be returned. If error
is ENOENT, count might be > 0 in case it copied some values but there were
no more entries to retrieve.

Note that if the return code is an error and not -EFAULT,
count indicates the number of elements successfully processed.

Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Brian Vazquez <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
---
include/linux/bpf.h | 11 +++
include/uapi/linux/bpf.h | 19 +++++
kernel/bpf/syscall.c | 172 +++++++++++++++++++++++++++++++++++++++
3 files changed, 202 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 35903f148be59..a16f209255a59 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -43,6 +43,11 @@ struct bpf_map_ops {
int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key);
void (*map_release_uref)(struct bpf_map *map);
void *(*map_lookup_elem_sys_only)(struct bpf_map *map, void *key);
+ int (*map_lookup_batch)(struct bpf_map *map, const union bpf_attr *attr,
+ union bpf_attr __user *uattr);
+ int (*map_lookup_and_delete_batch)(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr);

/* funcs callable from userspace and from eBPF programs */
void *(*map_lookup_elem)(struct bpf_map *map, void *key);
@@ -838,6 +843,12 @@ void *bpf_map_area_alloc(u64 size, int numa_node);
void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
void bpf_map_area_free(void *base);
void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);
+int generic_map_lookup_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr);
+int generic_map_lookup_and_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr);

extern int sysctl_unprivileged_bpf_disabled;

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index dbbcf0b02970b..36d3b885ddedd 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -107,6 +107,8 @@ enum bpf_cmd {
BPF_MAP_LOOKUP_AND_DELETE_ELEM,
BPF_MAP_FREEZE,
BPF_BTF_GET_NEXT_ID,
+ BPF_MAP_LOOKUP_BATCH,
+ BPF_MAP_LOOKUP_AND_DELETE_BATCH,
};

enum bpf_map_type {
@@ -403,6 +405,23 @@ union bpf_attr {
__u64 flags;
};

+ struct { /* struct used by BPF_MAP_*_BATCH commands */
+ __aligned_u64 in_batch; /* start batch,
+ * NULL to start from beginning
+ */
+ __aligned_u64 out_batch; /* output: next start batch */
+ __aligned_u64 keys;
+ __aligned_u64 values;
+ __u32 count; /* input/output:
+ * input: # of key/value
+ * elements
+ * output: # of filled elements
+ */
+ __u32 map_fd;
+ __u64 elem_flags;
+ __u64 flags;
+ } batch;
+
struct { /* anonymous struct used by BPF_PROG_LOAD command */
__u32 prog_type; /* one of enum bpf_prog_type */
__u32 insn_cnt;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 2530266fa6477..708aa89fe2308 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1206,6 +1206,120 @@ static int map_get_next_key(union bpf_attr *attr)
return err;
}

+#define MAP_LOOKUP_RETRIES 3
+
+static int __generic_map_lookup_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr,
+ bool do_delete)
+{
+ void __user *ubatch = u64_to_user_ptr(attr->batch.in_batch);
+ void __user *uobatch = u64_to_user_ptr(attr->batch.out_batch);
+ void __user *values = u64_to_user_ptr(attr->batch.values);
+ void __user *keys = u64_to_user_ptr(attr->batch.keys);
+ void *buf, *prev_key, *key, *value;
+ u32 value_size, cp, max_count;
+ bool first_key = false;
+ int err, retry = MAP_LOOKUP_RETRIES;
+
+ if (attr->batch.elem_flags & ~BPF_F_LOCK)
+ return -EINVAL;
+
+ if ((attr->batch.elem_flags & BPF_F_LOCK) &&
+ !map_value_has_spin_lock(map))
+ return -EINVAL;
+
+ value_size = bpf_map_value_size(map);
+
+ max_count = attr->batch.count;
+ if (!max_count)
+ return 0;
+
+ buf = kmalloc(map->key_size + value_size, GFP_USER | __GFP_NOWARN);
+ if (!buf)
+ return -ENOMEM;
+
+ err = -EFAULT;
+ first_key = false;
+ if (ubatch && copy_from_user(buf, ubatch, map->key_size))
+ goto free_buf;
+ key = buf;
+ value = key + map->key_size;
+ if (!ubatch) {
+ prev_key = NULL;
+ first_key = true;
+ }
+
+ for (cp = 0; cp < max_count;) {
+ if (cp || first_key) {
+ rcu_read_lock();
+ err = map->ops->map_get_next_key(map, prev_key, key);
+ rcu_read_unlock();
+ if (err)
+ break;
+ }
+ err = bpf_map_copy_value(map, key, value,
+ attr->batch.elem_flags, do_delete);
+
+ if (err == -ENOENT) {
+ if (retry) {
+ retry--;
+ continue;
+ }
+ err = -EINTR;
+ break;
+ }
+
+ if (err)
+ goto free_buf;
+
+ if (copy_to_user(keys + cp * map->key_size, key,
+ map->key_size)) {
+ err = -EFAULT;
+ goto free_buf;
+ }
+ if (copy_to_user(values + cp * value_size, value, value_size)) {
+ err = -EFAULT;
+ goto free_buf;
+ }
+
+ prev_key = key;
+ retry = MAP_LOOKUP_RETRIES;
+ cp++;
+ }
+
+ if (!err) {
+ rcu_read_lock();
+ err = map->ops->map_get_next_key(map, prev_key, key);
+ rcu_read_unlock();
+ }
+
+ if (err)
+ memset(key, 0, map->key_size);
+
+ if ((copy_to_user(&uattr->batch.count, &cp, sizeof(cp)) ||
+ (copy_to_user(uobatch, key, map->key_size))))
+ err = -EFAULT;
+
+free_buf:
+ kfree(buf);
+ return err;
+}
+
+int generic_map_lookup_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __generic_map_lookup_batch(map, attr, uattr, false);
+}
+
+int generic_map_lookup_and_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __generic_map_lookup_batch(map, attr, uattr, true);
+}
+
#define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value

static int map_lookup_and_delete_elem(union bpf_attr *attr)
@@ -3046,6 +3160,57 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
return err;
}

+#define BPF_MAP_BATCH_LAST_FIELD batch.flags
+
+#define BPF_DO_BATCH(fn) \
+ do { \
+ if (!fn) { \
+ err = -ENOTSUPP; \
+ goto err_put; \
+ } \
+ err = fn(map, attr, uattr); \
+ } while (0)
+
+static int bpf_map_do_batch(const union bpf_attr *attr,
+ union bpf_attr __user *uattr,
+ int cmd)
+{
+ struct bpf_map *map;
+ int err, ufd;
+ struct fd f;
+
+ if (CHECK_ATTR(BPF_MAP_BATCH))
+ return -EINVAL;
+
+ ufd = attr->batch.map_fd;
+ f = fdget(ufd);
+ map = __bpf_map_get(f);
+ if (IS_ERR(map))
+ return PTR_ERR(map);
+
+ if ((cmd == BPF_MAP_LOOKUP_BATCH ||
+ cmd == BPF_MAP_LOOKUP_AND_DELETE_BATCH) &&
+ !(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
+ err = -EPERM;
+ goto err_put;
+ }
+
+ if (cmd != BPF_MAP_LOOKUP_BATCH &&
+ !(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) {
+ err = -EPERM;
+ goto err_put;
+ }
+
+ if (cmd == BPF_MAP_LOOKUP_BATCH)
+ BPF_DO_BATCH(map->ops->map_lookup_batch);
+ else
+ BPF_DO_BATCH(map->ops->map_lookup_and_delete_batch);
+
+err_put:
+ fdput(f);
+ return err;
+}
+
SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
{
union bpf_attr attr = {};
@@ -3143,6 +3308,13 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
err = map_lookup_and_delete_elem(&attr);
break;
+ case BPF_MAP_LOOKUP_BATCH:
+ err = bpf_map_do_batch(&attr, uattr, BPF_MAP_LOOKUP_BATCH);
+ break;
+ case BPF_MAP_LOOKUP_AND_DELETE_BATCH:
+ err = bpf_map_do_batch(&attr, uattr,
+ BPF_MAP_LOOKUP_AND_DELETE_BATCH);
+ break;
default:
err = -EINVAL;
break;
--
2.24.1.735.g03f4e72817-goog

2019-12-11 22:36:40

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH v3 bpf-next 08/11] libbpf: add libbpf support to batch ops

From: Yonghong Song <[email protected]>

Added four libbpf API functions to support map batch operations:
. int bpf_map_delete_batch( ... )
. int bpf_map_lookup_batch( ... )
. int bpf_map_lookup_and_delete_batch( ... )
. int bpf_map_update_batch( ... )

Signed-off-by: Yonghong Song <[email protected]>
---
tools/lib/bpf/bpf.c | 61 ++++++++++++++++++++++++++++++++++++++++
tools/lib/bpf/bpf.h | 14 +++++++++
tools/lib/bpf/libbpf.map | 4 +++
3 files changed, 79 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 98596e15390fb..933a36a33d5a0 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -443,6 +443,67 @@ int bpf_map_freeze(int fd)
return sys_bpf(BPF_MAP_FREEZE, &attr, sizeof(attr));
}

+static int bpf_map_batch_common(int cmd, int fd, void *in_batch,
+ void *out_batch, void *keys, void *values,
+ __u32 *count, __u64 elem_flags,
+ __u64 flags)
+{
+ union bpf_attr attr = {};
+ int ret;
+
+ memset(&attr, 0, sizeof(attr));
+ attr.batch.map_fd = fd;
+ attr.batch.in_batch = ptr_to_u64(in_batch);
+ attr.batch.out_batch = ptr_to_u64(out_batch);
+ attr.batch.keys = ptr_to_u64(keys);
+ attr.batch.values = ptr_to_u64(values);
+ if (count)
+ attr.batch.count = *count;
+ attr.batch.elem_flags = elem_flags;
+ attr.batch.flags = flags;
+
+ ret = sys_bpf(cmd, &attr, sizeof(attr));
+ if (count)
+ *count = attr.batch.count;
+
+ return ret;
+}
+
+int bpf_map_delete_batch(int fd, void *keys, __u32 *count,
+ __u64 elem_flags, __u64 flags)
+{
+ return bpf_map_batch_common(BPF_MAP_DELETE_BATCH, fd, NULL,
+ NULL, keys, NULL, count,
+ elem_flags, flags);
+}
+
+int bpf_map_lookup_batch(int fd, void *in_batch, void *out_batch, void *keys,
+ void *values, __u32 *count,
+ __u64 elem_flags, __u64 flags)
+{
+ return bpf_map_batch_common(BPF_MAP_LOOKUP_BATCH, fd, in_batch,
+ out_batch, keys, values, count,
+ elem_flags, flags);
+}
+
+int bpf_map_lookup_and_delete_batch(int fd, void *in_batch, void *out_batch,
+ void *keys, void *values,
+ __u32 *count, __u64 elem_flags,
+ __u64 flags)
+{
+ return bpf_map_batch_common(BPF_MAP_LOOKUP_AND_DELETE_BATCH,
+ fd, in_batch, out_batch, keys, values,
+ count, elem_flags, flags);
+}
+
+int bpf_map_update_batch(int fd, void *keys, void *values, __u32 *count,
+ __u64 elem_flags, __u64 flags)
+{
+ return bpf_map_batch_common(BPF_MAP_UPDATE_BATCH,
+ fd, NULL, NULL, keys, values,
+ count, elem_flags, flags);
+}
+
int bpf_obj_pin(int fd, const char *pathname)
{
union bpf_attr attr;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 3c791fa8e68e8..51c577393ec48 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -126,6 +126,20 @@ LIBBPF_API int bpf_map_lookup_and_delete_elem(int fd, const void *key,
LIBBPF_API int bpf_map_delete_elem(int fd, const void *key);
LIBBPF_API int bpf_map_get_next_key(int fd, const void *key, void *next_key);
LIBBPF_API int bpf_map_freeze(int fd);
+LIBBPF_API int bpf_map_delete_batch(int fd, void *keys,
+ __u32 *count, __u64 elem_flags,
+ __u64 flags);
+LIBBPF_API int bpf_map_lookup_batch(int fd, void *in_batch, void *out_batch,
+ void *keys, void *values, __u32 *count,
+ __u64 elem_flags, __u64 flags);
+LIBBPF_API int bpf_map_lookup_and_delete_batch(int fd, void *in_batch,
+ void *out_batch, void *keys,
+ void *values, __u32 *count,
+ __u64 elem_flags, __u64 flags);
+LIBBPF_API int bpf_map_update_batch(int fd, void *keys, void *values,
+ __u32 *count, __u64 elem_flags,
+ __u64 flags);
+
LIBBPF_API int bpf_obj_pin(int fd, const char *pathname);
LIBBPF_API int bpf_obj_get(const char *pathname);
LIBBPF_API int bpf_prog_attach(int prog_fd, int attachable_fd,
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 495df575f87f8..4efbf25888eb0 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -210,4 +210,8 @@ LIBBPF_0.0.6 {
} LIBBPF_0.0.5;

LIBBPF_0.0.7 {
+ bpf_map_delete_batch;
+ bpf_map_lookup_and_delete_batch;
+ bpf_map_lookup_batch;
+ bpf_map_update_batch;
} LIBBPF_0.0.6;
--
2.24.1.735.g03f4e72817-goog

2019-12-11 22:36:59

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH v3 bpf-next 04/11] bpf: add lookup and updated batch ops to arraymap

This adds the generic batch ops functionality to bpf arraymap, note that
since deletion is not a valid operation for arraymap, only batch and
lookup are added.

Signed-off-by: Brian Vazquez <[email protected]>
---
kernel/bpf/arraymap.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index f0d19bbb9211e..95d77770353c9 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -503,6 +503,8 @@ const struct bpf_map_ops array_map_ops = {
.map_mmap = array_map_mmap,
.map_seq_show_elem = array_map_seq_show_elem,
.map_check_btf = array_map_check_btf,
+ .map_lookup_batch = generic_map_lookup_batch,
+ .map_update_batch = generic_map_update_batch,
};

const struct bpf_map_ops percpu_array_map_ops = {
--
2.24.1.735.g03f4e72817-goog

2019-12-11 22:37:00

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH v3 bpf-next 09/11] selftests/bpf: add batch ops testing for htab and htab_percpu map

From: Yonghong Song <[email protected]>

Tested bpf_map_lookup_batch(), bpf_map_lookup_and_delete_batch(),
bpf_map_update_batch(), and bpf_map_delete_batch() functionality.
$ ./test_maps
...
test_htab_map_batch_ops:PASS
test_htab_percpu_map_batch_ops:PASS
...

Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Brian Vazquez <[email protected]>
---
.../bpf/map_tests/htab_map_batch_ops.c | 269 ++++++++++++++++++
1 file changed, 269 insertions(+)
create mode 100644 tools/testing/selftests/bpf/map_tests/htab_map_batch_ops.c

diff --git a/tools/testing/selftests/bpf/map_tests/htab_map_batch_ops.c b/tools/testing/selftests/bpf/map_tests/htab_map_batch_ops.c
new file mode 100644
index 0000000000000..dabc4d420a10e
--- /dev/null
+++ b/tools/testing/selftests/bpf/map_tests/htab_map_batch_ops.c
@@ -0,0 +1,269 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+#include <stdio.h>
+#include <errno.h>
+#include <string.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include <bpf_util.h>
+#include <test_maps.h>
+
+static void map_batch_update(int map_fd, __u32 max_entries, int *keys,
+ void *values, bool is_pcpu)
+{
+ typedef BPF_DECLARE_PERCPU(int, value);
+ int i, j, err;
+ value *v;
+
+ if (is_pcpu)
+ v = (value *)values;
+
+ for (i = 0; i < max_entries; i++) {
+ keys[i] = i + 1;
+ if (is_pcpu)
+ for (j = 0; j < bpf_num_possible_cpus(); j++)
+ bpf_percpu(v[i], j) = i + 2 + j;
+ else
+ ((int *)values)[i] = i + 2;
+ }
+
+ err = bpf_map_update_batch(map_fd, keys, values, &max_entries, 0, 0);
+ CHECK(err, "bpf_map_update_batch()", "error:%s\n", strerror(errno));
+}
+
+static void map_batch_verify(int *visited, __u32 max_entries,
+ int *keys, void *values, bool is_pcpu)
+{
+ typedef BPF_DECLARE_PERCPU(int, value);
+ value *v;
+ int i, j;
+
+ if (is_pcpu)
+ v = (value *)values;
+
+ memset(visited, 0, max_entries * sizeof(*visited));
+ for (i = 0; i < max_entries; i++) {
+
+ if (is_pcpu) {
+ for (j = 0; j < bpf_num_possible_cpus(); j++) {
+ CHECK(keys[i] + 1 + j != bpf_percpu(v[i], j),
+ "key/value checking",
+ "error: i %d j %d key %d value %d\n",
+ i, j, keys[i], bpf_percpu(v[i], j));
+ }
+ } else {
+ CHECK(keys[i] + 1 != ((int *)values)[i],
+ "key/value checking",
+ "error: i %d key %d value %d\n", i, keys[i],
+ ((int *)values)[i]);
+ }
+
+ visited[i] = 1;
+
+ }
+ for (i = 0; i < max_entries; i++) {
+ CHECK(visited[i] != 1, "visited checking",
+ "error: keys array at index %d missing\n", i);
+ }
+}
+
+void __test_map_lookup_and_delete_batch(bool is_pcpu)
+{
+ int map_type = is_pcpu ? BPF_MAP_TYPE_PERCPU_HASH : BPF_MAP_TYPE_HASH;
+ struct bpf_create_map_attr xattr = {
+ .name = "hash_map",
+ .map_type = map_type,
+ .key_size = sizeof(int),
+ .value_size = sizeof(int),
+ };
+ __u32 batch, count, total, total_success;
+ typedef BPF_DECLARE_PERCPU(int, value);
+ int map_fd, *keys, *visited, key;
+ const __u32 max_entries = 10;
+ int err, step, value_size;
+ value pcpu_values[10];
+ bool nospace_err;
+ void *values;
+
+ xattr.max_entries = max_entries;
+ map_fd = bpf_create_map_xattr(&xattr);
+ CHECK(map_fd == -1,
+ "bpf_create_map_xattr()", "error:%s\n", strerror(errno));
+
+ value_size = is_pcpu ? sizeof(value) : sizeof(int);
+ keys = malloc(max_entries * sizeof(int));
+ if (is_pcpu)
+ values = pcpu_values;
+ else
+ values = malloc(max_entries * sizeof(int));
+ visited = malloc(max_entries * sizeof(int));
+ CHECK(!keys || !values || !visited, "malloc()",
+ "error:%s\n", strerror(errno));
+
+ /* test 1: lookup/delete an empty hash table, -ENOENT */
+ count = max_entries;
+ err = bpf_map_lookup_and_delete_batch(map_fd, NULL, &batch, keys,
+ values, &count, 0, 0);
+ CHECK((err && errno != ENOENT), "empty map",
+ "error: %s\n", strerror(errno));
+
+ /* populate elements to the map */
+ map_batch_update(map_fd, max_entries, keys, values, is_pcpu);
+
+ /* test 2: lookup/delete with count = 0, success */
+ count = 0;
+ err = bpf_map_lookup_and_delete_batch(map_fd, NULL, &batch, keys,
+ values, &count, 0, 0);
+ CHECK(err, "count = 0", "error: %s\n", strerror(errno));
+
+ /* test 3: lookup/delete with count = max_entries, success */
+ memset(keys, 0, max_entries * sizeof(*keys));
+ memset(values, 0, max_entries * value_size);
+ count = max_entries;
+ err = bpf_map_lookup_and_delete_batch(map_fd, NULL, &batch, keys,
+ values, &count, 0, 0);
+ CHECK((err && errno != ENOENT), "count = max_entries",
+ "error: %s\n", strerror(errno));
+ CHECK(count != max_entries, "count = max_entries",
+ "count = %u, max_entries = %u\n", count, max_entries);
+ map_batch_verify(visited, max_entries, keys, values, is_pcpu);
+
+ /* bpf_map_get_next_key() should return -ENOENT for an empty map. */
+ err = bpf_map_get_next_key(map_fd, NULL, &key);
+ CHECK(!err, "bpf_map_get_next_key()", "error: %s\n", strerror(errno));
+
+ /* test 4: lookup/delete in a loop with various steps. */
+ total_success = 0;
+ for (step = 1; step < max_entries; step++) {
+ map_batch_update(map_fd, max_entries, keys, values, is_pcpu);
+ memset(keys, 0, max_entries * sizeof(*keys));
+ memset(values, 0, max_entries * value_size);
+ total = 0;
+ /* iteratively lookup/delete elements with 'step'
+ * elements each
+ */
+ count = step;
+ nospace_err = false;
+ while (true) {
+ err = bpf_map_lookup_batch(map_fd,
+ total ? &batch : NULL,
+ &batch, keys + total,
+ values +
+ total * value_size,
+ &count, 0, 0);
+ /* It is possible that we are failing due to buffer size
+ * not big enough. In such cases, let us just exit and
+ * go with large steps. Not that a buffer size with
+ * max_entries should always work.
+ */
+ if (err && errno == ENOSPC) {
+ nospace_err = true;
+ break;
+ }
+
+ CHECK((err && errno != ENOENT), "lookup with steps",
+ "error: %s\n", strerror(errno));
+
+ total += count;
+ if (err)
+ break;
+
+ }
+ if (nospace_err == true)
+ continue;
+
+ CHECK(total != max_entries, "lookup with steps",
+ "total = %u, max_entries = %u\n", total, max_entries);
+ map_batch_verify(visited, max_entries, keys, values, is_pcpu);
+
+ total = 0;
+ count = step;
+ while (true) {
+ err = bpf_map_delete_batch(map_fd,
+ keys + total,
+ &count, 0, 0);
+ CHECK((err && errno != ENOENT), "delete with steps",
+ "error: %s\n", strerror(errno));
+ total += count;
+ if (err)
+ break;
+ }
+ CHECK(total != max_entries, "delete with steps",
+ "total = %u, max_entries = %u\n", total, max_entries);
+
+ /* check map is empty, errono == ENOENT */
+ err = bpf_map_get_next_key(map_fd, NULL, &key);
+ CHECK(!err || errno != ENOENT, "bpf_map_get_next_key()", "error: %s\n",
+ strerror(errno));
+
+ /* iteratively lookup/delete elements with 'step'
+ * elements each
+ */
+ map_batch_update(map_fd, max_entries, keys, values, is_pcpu);
+ memset(keys, 0, max_entries * sizeof(*keys));
+ memset(values, 0, max_entries * value_size);
+ total = 0;
+ count = step;
+ nospace_err = false;
+ while (true) {
+ err = bpf_map_lookup_and_delete_batch(map_fd,
+ total ? &batch : NULL,
+ &batch, keys + total,
+ values +
+ total * value_size,
+ &count, 0, 0);
+ /* It is possible that we are failing due to buffer size
+ * not big enough. In such cases, let us just exit and
+ * go with large steps. Not that a buffer size with
+ * max_entries should always work.
+ */
+ if (err && errno == ENOSPC) {
+ nospace_err = true;
+ break;
+ }
+
+ CHECK((err && errno != ENOENT), "lookup with steps",
+ "error: %s\n", strerror(errno));
+
+ total += count;
+ if (err)
+ break;
+ }
+
+ if (nospace_err == true)
+ continue;
+
+ CHECK(total != max_entries, "lookup/delete with steps",
+ "total = %u, max_entries = %u\n", total, max_entries);
+
+ map_batch_verify(visited, max_entries, keys, values, is_pcpu);
+ err = bpf_map_get_next_key(map_fd, NULL, &key);
+ CHECK(!err, "bpf_map_get_next_key()", "error: %s\n",
+ strerror(errno));
+
+ total_success++;
+ }
+
+ CHECK(total_success == 0, "check total_success",
+ "unexpected failure\n");
+}
+
+void htab_map_batch_ops(void)
+{
+ __test_map_lookup_and_delete_batch(false);
+ printf("test_%s:PASS\n", __func__);
+}
+
+void htab_percpu_map_batch_ops(void)
+{
+ __test_map_lookup_and_delete_batch(true);
+ printf("test_%s:PASS\n", __func__);
+}
+
+void test_htab_map_batch_ops(void)
+{
+ htab_map_batch_ops();
+ htab_percpu_map_batch_ops();
+}
--
2.24.1.735.g03f4e72817-goog

2019-12-11 22:37:25

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH v3 bpf-next 10/11] selftests/bpf: add batch ops testing to array bpf map

Tested bpf_map_lookup_batch() and bpf_map_update_batch()
functionality.

$ ./test_maps
...
test_array_map_batch_ops:PASS
...

Signed-off-by: Brian Vazquez <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
---
.../bpf/map_tests/array_map_batch_ops.c | 119 ++++++++++++++++++
1 file changed, 119 insertions(+)
create mode 100644 tools/testing/selftests/bpf/map_tests/array_map_batch_ops.c

diff --git a/tools/testing/selftests/bpf/map_tests/array_map_batch_ops.c b/tools/testing/selftests/bpf/map_tests/array_map_batch_ops.c
new file mode 100644
index 0000000000000..ed90bfd03d762
--- /dev/null
+++ b/tools/testing/selftests/bpf/map_tests/array_map_batch_ops.c
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdio.h>
+#include <errno.h>
+#include <string.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include <test_maps.h>
+
+static void map_batch_update(int map_fd, __u32 max_entries, int *keys,
+ int *values)
+{
+ int i, err;
+
+ for (i = 0; i < max_entries; i++) {
+ keys[i] = i;
+ values[i] = i + 1;
+ }
+
+ err = bpf_map_update_batch(map_fd, keys, values, &max_entries, 0, 0);
+ CHECK(err, "bpf_map_update_batch()", "error:%s\n", strerror(errno));
+}
+
+static void map_batch_verify(int *visited, __u32 max_entries,
+ int *keys, int *values)
+{
+ int i;
+
+ memset(visited, 0, max_entries * sizeof(*visited));
+ for (i = 0; i < max_entries; i++) {
+ CHECK(keys[i] + 1 != values[i], "key/value checking",
+ "error: i %d key %d value %d\n", i, keys[i], values[i]);
+ visited[i] = 1;
+ }
+ for (i = 0; i < max_entries; i++) {
+ CHECK(visited[i] != 1, "visited checking",
+ "error: keys array at index %d missing\n", i);
+ }
+}
+
+void test_array_map_batch_ops(void)
+{
+ struct bpf_create_map_attr xattr = {
+ .name = "array_map",
+ .map_type = BPF_MAP_TYPE_ARRAY,
+ .key_size = sizeof(int),
+ .value_size = sizeof(int),
+ };
+ int map_fd, *keys, *values, *visited;
+ __u32 count, total, total_success;
+ const __u32 max_entries = 10;
+ int err, i, step;
+ bool nospace_err;
+ __u64 batch = 0;
+
+ xattr.max_entries = max_entries;
+ map_fd = bpf_create_map_xattr(&xattr);
+ CHECK(map_fd == -1,
+ "bpf_create_map_xattr()", "error:%s\n", strerror(errno));
+
+ keys = malloc(max_entries * sizeof(int));
+ values = malloc(max_entries * sizeof(int));
+ visited = malloc(max_entries * sizeof(int));
+ CHECK(!keys || !values || !visited, "malloc()", "error:%s\n",
+ strerror(errno));
+
+ /* populate elements to the map */
+ map_batch_update(map_fd, max_entries, keys, values);
+
+ /* test 1: lookup in a loop with various steps. */
+ total_success = 0;
+ for (step = 1; step < max_entries; step++) {
+ map_batch_update(map_fd, max_entries, keys, values);
+ memset(keys, 0, max_entries * sizeof(*keys));
+ memset(values, 0, max_entries * sizeof(*values));
+ batch = 0;
+ total = 0;
+ i = 0;
+ /* iteratively lookup/delete elements with 'step'
+ * elements each.
+ */
+ count = step;
+ nospace_err = false;
+ while (true) {
+ err = bpf_map_lookup_batch(map_fd,
+ total ? &batch : NULL, &batch,
+ keys + total,
+ values + total,
+ &count, 0, 0);
+
+ CHECK((err && errno != ENOENT), "lookup with steps",
+ "error: %s\n", strerror(errno));
+
+ total += count;
+
+ if (err)
+ break;
+
+ i++;
+ }
+
+ if (nospace_err == true)
+ continue;
+
+ CHECK(total != max_entries, "lookup with steps",
+ "total = %u, max_entries = %u\n", total, max_entries);
+
+ map_batch_verify(visited, max_entries, keys, values);
+
+ total_success++;
+ }
+
+ CHECK(total_success == 0, "check total_success",
+ "unexpected failure\n");
+
+ printf("%s:PASS\n", __func__);
+}
--
2.24.1.735.g03f4e72817-goog

2019-12-13 05:54:35

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 00/11] add bpf batch ops to process more than 1 elem

On Wed, Dec 11, 2019 at 2:34 PM Brian Vazquez <[email protected]> wrote:
>
> This patch series introduce batch ops that can be added to bpf maps to
> lookup/lookup_and_delete/update/delete more than 1 element at the time,
> this is specially useful when syscall overhead is a problem and in case
> of hmap it will provide a reliable way of traversing them.
>
> The implementation inclues a generic approach that could potentially be
> used by any bpf map and adds it to arraymap, it also includes the specific
> implementation of hashmaps which are traversed using buckets instead
> of keys.
>
> The bpf syscall subcommands introduced are:
>
> BPF_MAP_LOOKUP_BATCH
> BPF_MAP_LOOKUP_AND_DELETE_BATCH
> BPF_MAP_UPDATE_BATCH
> BPF_MAP_DELETE_BATCH
>
> The UAPI attribute is:
>
> struct { /* struct used by BPF_MAP_*_BATCH commands */
> __aligned_u64 in_batch; /* start batch,
> * NULL to start from beginning
> */
> __aligned_u64 out_batch; /* output: next start batch */
> __aligned_u64 keys;
> __aligned_u64 values;
> __u32 count; /* input/output:
> * input: # of key/value
> * elements
> * output: # of filled elements
> */
> __u32 map_fd;
> __u64 elem_flags;
> __u64 flags;
> } batch;
>
>
> in_batch and out_batch are only used for lookup and lookup_and_delete since
> those are the only two operations that attempt to traverse the map.
>
> update/delete batch ops should provide the keys/values that user wants
> to modify.
>
> Here are the previous discussions on the batch processing:
> - https://lore.kernel.org/bpf/[email protected]/
> - https://lore.kernel.org/bpf/[email protected]/
> - https://lore.kernel.org/bpf/[email protected]/
>
> Changelog sinve v2:
> - Add generic batch support for lpm_trie and test it (Yonghong Song)
> - Use define MAP_LOOKUP_RETRIES for retries (John Fastabend)
> - Return errors directly and remove labels (Yonghong Song)
> - Insert new API functions into libbpf alphabetically (Yonghong Song)
> - Change hlist_nulls_for_each_entry_rcu to
> hlist_nulls_for_each_entry_safe in htab batch ops (Yonghong Song)

Yonghong,
please review.

2019-12-13 17:09:24

by Yonghong Song

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 01/11] bpf: add bpf_map_{value_size,update_value,map_copy_value} functions



On 12/11/19 2:33 PM, Brian Vazquez wrote:
> This commit moves reusable code from map_lookup_elem and map_update_elem
> to avoid code duplication in kernel/bpf/syscall.c.
>
> Signed-off-by: Brian Vazquez <[email protected]>
> Acked-by: John Fastabend <[email protected]>

Acked-by: Yonghong Song <[email protected]>

> ---
> kernel/bpf/syscall.c | 275 ++++++++++++++++++++++++-------------------
> 1 file changed, 151 insertions(+), 124 deletions(-)
>

2019-12-13 17:29:58

by Yonghong Song

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 02/11] bpf: add generic support for lookup and lookup_and_delete batch ops



On 12/11/19 2:33 PM, Brian Vazquez wrote:
> This commit introduces generic support for the bpf_map_lookup_batch and
> bpf_map_lookup_and_delete_batch ops. This implementation can be used by
> almost all the bpf maps since its core implementation is relying on the
> existing map_get_next_key, map_lookup_elem and map_delete_elem
> functions. The bpf syscall subcommands introduced are:
>
> BPF_MAP_LOOKUP_BATCH
> BPF_MAP_LOOKUP_AND_DELETE_BATCH
>
> The UAPI attribute is:
>
> struct { /* struct used by BPF_MAP_*_BATCH commands */
> __aligned_u64 in_batch; /* start batch,
> * NULL to start from beginning
> */
> __aligned_u64 out_batch; /* output: next start batch */
> __aligned_u64 keys;
> __aligned_u64 values;
> __u32 count; /* input/output:
> * input: # of key/value
> * elements
> * output: # of filled elements
> */
> __u32 map_fd;
> __u64 elem_flags;
> __u64 flags;
> } batch;
>
> in_batch/out_batch are opaque values use to communicate between
> user/kernel space, in_batch/out_batch must be of key_size length.
>
> To start iterating from the beginning in_batch must be null,
> count is the # of key/value elements to retrieve. Note that the 'keys'
> buffer must be a buffer of key_size * count size and the 'values' buffer
> must be value_size * count, where value_size must be aligned to 8 bytes
> by userspace if it's dealing with percpu maps. 'count' will contain the
> number of keys/values successfully retrieved. Note that 'count' is an
> input/output variable and it can contain a lower value after a call.
>
> If there's no more entries to retrieve, ENOENT will be returned. If error
> is ENOENT, count might be > 0 in case it copied some values but there were
> no more entries to retrieve.
>
> Note that if the return code is an error and not -EFAULT,
> count indicates the number of elements successfully processed.
>
> Suggested-by: Stanislav Fomichev <[email protected]>
> Signed-off-by: Brian Vazquez <[email protected]>
> Signed-off-by: Yonghong Song <[email protected]>
> ---
> include/linux/bpf.h | 11 +++
> include/uapi/linux/bpf.h | 19 +++++
> kernel/bpf/syscall.c | 172 +++++++++++++++++++++++++++++++++++++++
> 3 files changed, 202 insertions(+)
[...]
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 2530266fa6477..708aa89fe2308 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1206,6 +1206,120 @@ static int map_get_next_key(union bpf_attr *attr)
> return err;
> }
>
> +#define MAP_LOOKUP_RETRIES 3
> +
> +static int __generic_map_lookup_batch(struct bpf_map *map,
> + const union bpf_attr *attr,
> + union bpf_attr __user *uattr,
> + bool do_delete)
> +{
> + void __user *ubatch = u64_to_user_ptr(attr->batch.in_batch);
> + void __user *uobatch = u64_to_user_ptr(attr->batch.out_batch);
> + void __user *values = u64_to_user_ptr(attr->batch.values);
> + void __user *keys = u64_to_user_ptr(attr->batch.keys);
> + void *buf, *prev_key, *key, *value;
> + u32 value_size, cp, max_count;
> + bool first_key = false;
> + int err, retry = MAP_LOOKUP_RETRIES;

Could you try to use reverse Christmas tree style declaration here?

> +
> + if (attr->batch.elem_flags & ~BPF_F_LOCK)
> + return -EINVAL;
> +
> + if ((attr->batch.elem_flags & BPF_F_LOCK) &&
> + !map_value_has_spin_lock(map))
> + return -EINVAL;
> +
> + value_size = bpf_map_value_size(map);
> +
> + max_count = attr->batch.count;
> + if (!max_count)
> + return 0;
> +
> + buf = kmalloc(map->key_size + value_size, GFP_USER | __GFP_NOWARN);
> + if (!buf)
> + return -ENOMEM;
> +
> + err = -EFAULT;
> + first_key = false;
> + if (ubatch && copy_from_user(buf, ubatch, map->key_size))
> + goto free_buf;
> + key = buf;
> + value = key + map->key_size;
> + if (!ubatch) {
> + prev_key = NULL;
> + first_key = true;
> + }
> +
> + for (cp = 0; cp < max_count;) {
> + if (cp || first_key) {
> + rcu_read_lock();
> + err = map->ops->map_get_next_key(map, prev_key, key);
> + rcu_read_unlock();
> + if (err)
> + break;
> + }
> + err = bpf_map_copy_value(map, key, value,
> + attr->batch.elem_flags, do_delete);
> +
> + if (err == -ENOENT) {
> + if (retry) {
> + retry--;
> + continue;
> + }
> + err = -EINTR;
> + break;
> + }
> +
> + if (err)
> + goto free_buf;
> +
> + if (copy_to_user(keys + cp * map->key_size, key,
> + map->key_size)) {
> + err = -EFAULT;
> + goto free_buf;
> + }
> + if (copy_to_user(values + cp * value_size, value, value_size)) {
> + err = -EFAULT;
> + goto free_buf;
> + }
> +
> + prev_key = key;
> + retry = MAP_LOOKUP_RETRIES;
> + cp++;
> + }
> +
> + if (!err) {
> + rcu_read_lock();
> + err = map->ops->map_get_next_key(map, prev_key, key);
> + rcu_read_unlock();
> + }
> +
> + if (err)
> + memset(key, 0, map->key_size);

So if any error happens due to above map_get_next_key() or earlier
error, the next "batch" returned to user could be "0". What should
user space handle this? Ultimately, the user space needs to start
from the beginning again?

What I mean is here how we could design an interface so user
space, if no -EFAULT error, can successfully get all elements
without duplication.

One way to do here is just return -EFAULT if we cannot get
proper next key. But maybe we could have better mechanism
when we try to implement what user space codes will look like.

> +
> + if ((copy_to_user(&uattr->batch.count, &cp, sizeof(cp)) ||
> + (copy_to_user(uobatch, key, map->key_size))))
> + err = -EFAULT;
> +
> +free_buf:
> + kfree(buf);
> + return err;
> +}
> +
[...]

2019-12-13 17:42:01

by Yonghong Song

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 03/11] bpf: add generic support for update and delete batch ops



On 12/11/19 2:33 PM, Brian Vazquez wrote:
> This commit adds generic support for update and delete batch ops that
> can be used for almost all the bpf maps. These commands share the same
> UAPI attr that lookup and lookup_and_delete batch ops use and the
> syscall commands are:
>
> BPF_MAP_UPDATE_BATCH
> BPF_MAP_DELETE_BATCH
>
> The main difference between update/delete and lookup/lookup_and_delete
> batch ops is that for update/delete keys/values must be specified for
> userspace and because of that, neither in_batch nor out_batch are used.
>
> Suggested-by: Stanislav Fomichev <[email protected]>
> Signed-off-by: Brian Vazquez <[email protected]>
> Signed-off-by: Yonghong Song <[email protected]>
> ---
> include/linux/bpf.h | 10 ++++
> include/uapi/linux/bpf.h | 2 +
> kernel/bpf/syscall.c | 117 ++++++++++++++++++++++++++++++++++++++-
> 3 files changed, 128 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index a16f209255a59..851fb3ff084b0 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -48,6 +48,10 @@ struct bpf_map_ops {
> int (*map_lookup_and_delete_batch)(struct bpf_map *map,
> const union bpf_attr *attr,
> union bpf_attr __user *uattr);
> + int (*map_update_batch)(struct bpf_map *map, const union bpf_attr *attr,
> + union bpf_attr __user *uattr);
> + int (*map_delete_batch)(struct bpf_map *map, const union bpf_attr *attr,
> + union bpf_attr __user *uattr);
>
> /* funcs callable from userspace and from eBPF programs */
> void *(*map_lookup_elem)(struct bpf_map *map, void *key);
> @@ -849,6 +853,12 @@ int generic_map_lookup_batch(struct bpf_map *map,
> int generic_map_lookup_and_delete_batch(struct bpf_map *map,
> const union bpf_attr *attr,
> union bpf_attr __user *uattr);
> +int generic_map_update_batch(struct bpf_map *map,
> + const union bpf_attr *attr,
> + union bpf_attr __user *uattr);
> +int generic_map_delete_batch(struct bpf_map *map,
> + const union bpf_attr *attr,
> + union bpf_attr __user *uattr);
>
> extern int sysctl_unprivileged_bpf_disabled;
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 36d3b885ddedd..dab24a763e4bb 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -109,6 +109,8 @@ enum bpf_cmd {
> BPF_BTF_GET_NEXT_ID,
> BPF_MAP_LOOKUP_BATCH,
> BPF_MAP_LOOKUP_AND_DELETE_BATCH,
> + BPF_MAP_UPDATE_BATCH,
> + BPF_MAP_DELETE_BATCH,
> };
>
> enum bpf_map_type {
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 708aa89fe2308..8272e76183068 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1206,6 +1206,111 @@ static int map_get_next_key(union bpf_attr *attr)
> return err;
> }
>
> +int generic_map_delete_batch(struct bpf_map *map,
> + const union bpf_attr *attr,
> + union bpf_attr __user *uattr)
> +{
> + void __user *keys = u64_to_user_ptr(attr->batch.keys);
> + u32 cp, max_count;
> + int err = 0;
> + void *key;
> +
> + if (attr->batch.elem_flags & ~BPF_F_LOCK)
> + return -EINVAL;
> +
> + if ((attr->batch.elem_flags & BPF_F_LOCK) &&
> + !map_value_has_spin_lock(map)) {
> + return -EINVAL;
> + }
> +
> + max_count = attr->batch.count;
> + if (!max_count)
> + return -EINVAL;

To be consistent with lookup and lookup_and_delete, if max_count = 0,
we can just return 0 instead of -EINVAL.

> +
> + for (cp = 0; cp < max_count; cp++) {
> + key = __bpf_copy_key(keys + cp * map->key_size, map->key_size);
> + if (IS_ERR(key)) {
> + err = PTR_ERR(key);
> + break;
> + }
> +
> + if (bpf_map_is_dev_bound(map)) {
> + err = bpf_map_offload_delete_elem(map, key);
> + break;
> + }
> +
> + preempt_disable();
> + __this_cpu_inc(bpf_prog_active);
> + rcu_read_lock();
> + err = map->ops->map_delete_elem(map, key);
> + rcu_read_unlock();
> + __this_cpu_dec(bpf_prog_active);
> + preempt_enable();
> + maybe_wait_bpf_programs(map);
> + if (err)
> + break;
> + }
> + if (copy_to_user(&uattr->batch.count, &cp, sizeof(cp)))
> + err = -EFAULT;
> + return err;
> +}
> +
> +int generic_map_update_batch(struct bpf_map *map,
> + const union bpf_attr *attr,
> + union bpf_attr __user *uattr)
> +{
> + void __user *values = u64_to_user_ptr(attr->batch.values);
> + void __user *keys = u64_to_user_ptr(attr->batch.keys);
> + u32 value_size, cp, max_count;
> + int ufd = attr->map_fd;
> + void *key, *value;
> + struct fd f;
> + int err = 0;
> +
> + f = fdget(ufd);

I did not find the pairing fdput(). Also,
the variable 'f' is used way down in the loop, so
you can do fdget() later.

> + if (attr->batch.elem_flags & ~BPF_F_LOCK)
> + return -EINVAL;
> +
> + if ((attr->batch.elem_flags & BPF_F_LOCK) &&
> + !map_value_has_spin_lock(map)) {
> + return -EINVAL;
> + }
> +
> + value_size = bpf_map_value_size(map);
> +
> + max_count = attr->batch.count;
> + if (!max_count)
> + return 0;
> +
> + value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);
> + if (!value)
> + return -ENOMEM;
> +
> + for (cp = 0; cp < max_count; cp++) {
> + key = __bpf_copy_key(keys + cp * map->key_size, map->key_size);
> + if (IS_ERR(key)) {
> + err = PTR_ERR(key);
> + break;
> + }
> + err = -EFAULT;
> + if (copy_from_user(value, values + cp * value_size, value_size))
> + break;
> +
> + err = bpf_map_update_value(map, f, key, value,
> + attr->batch.elem_flags);
> +
> + if (err)
> + break;
> + }
> +
> + if (copy_to_user(&uattr->batch.count, &cp, sizeof(cp)))
> + err = -EFAULT;
> +
> + kfree(value);
> + kfree(key);
> + return err;
> +}
> +
[...]

2019-12-13 17:42:56

by Yonghong Song

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 04/11] bpf: add lookup and updated batch ops to arraymap



On 12/11/19 2:33 PM, Brian Vazquez wrote:
> This adds the generic batch ops functionality to bpf arraymap, note that
> since deletion is not a valid operation for arraymap, only batch and
> lookup are added.
>
> Signed-off-by: Brian Vazquez <[email protected]>

Acked-by: Yonghong Song <[email protected]>

2019-12-13 17:49:19

by Yonghong Song

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 05/11] bpf: add generic_batch_ops to lpm_trie map



On 12/11/19 2:33 PM, Brian Vazquez wrote:
> This adds the generic batch ops functionality to bpf lpm_trie.
>
> Signed-off-by: Brian Vazquez <[email protected]>
> ---
> kernel/bpf/lpm_trie.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index 56e6c75d354d9..92c47b4f03337 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -743,4 +743,8 @@ const struct bpf_map_ops trie_map_ops = {
> .map_update_elem = trie_update_elem,
> .map_delete_elem = trie_delete_elem,
> .map_check_btf = trie_check_btf,
> + .map_lookup_batch = generic_map_lookup_batch,
> + .map_lookup_and_delete_batch = generic_map_lookup_and_delete_batch,

Not 100% sure whether trie should use generic map
lookup/lookup_and_delete or not. If the key is not available,
the get_next_key will return the 'leftmost' node which roughly
corresponding to the first node in the hash table.

> + .map_delete_batch = generic_map_delete_batch,
> + .map_update_batch = generic_map_update_batch,
> };
>

2019-12-13 18:18:11

by Yonghong Song

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 06/11] bpf: add batch ops to all htab bpf map



On 12/11/19 2:33 PM, Brian Vazquez wrote:
> From: Yonghong Song <[email protected]>
>
> htab can't use generic batch support due some problematic behaviours
> inherent to the data structre, i.e. while iterating the bpf map a
> concurrent program might delete the next entry that batch was about to
> use, in that case there's no easy solution to retrieve the next entry,
> the issue has been discussed multiple times (see [1] and [2]).
>
> The only way hmap can be traversed without the problem previously
> exposed is by making sure that the map is traversing entire buckets.
> This commit implements those strict requirements for hmap, the
> implementation follows the same interaction that generic support with
> some exceptions:
>
> - If keys/values buffer are not big enough to traverse a bucket,
> ENOSPC will be returned.
> - out_batch contains the value of the next bucket in the iteration, not
> the next key, but this is transparent for the user since the user
> should never use out_batch for other than bpf batch syscalls.
>
> Note that only lookup and lookup_and_delete batch ops require the hmap
> specific implementation, update/delete batch ops can be the generic
> ones.
>
> [1] https://lore.kernel.org/bpf/[email protected]/
> [2] https://lore.kernel.org/bpf/[email protected]/
>
> Signed-off-by: Yonghong Song <[email protected]>
> Signed-off-by: Brian Vazquez <[email protected]>
> ---
> kernel/bpf/hashtab.c | 242 +++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 242 insertions(+)
>
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 22066a62c8c97..fac107bdaf9ec 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -17,6 +17,17 @@
> (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
> BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)
>
> +#define BATCH_OPS(_name) \
> + .map_lookup_batch = \
> + _name##_map_lookup_batch, \
> + .map_lookup_and_delete_batch = \
> + _name##_map_lookup_and_delete_batch, \
> + .map_update_batch = \
> + generic_map_update_batch, \
> + .map_delete_batch = \
> + generic_map_delete_batch
> +
> +
> struct bucket {
> struct hlist_nulls_head head;
> raw_spinlock_t lock;
> @@ -1232,6 +1243,233 @@ static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
> rcu_read_unlock();
> }
>
> +static int
> +__htab_map_lookup_and_delete_batch(struct bpf_map *map,
> + const union bpf_attr *attr,
> + union bpf_attr __user *uattr,
> + bool do_delete, bool is_lru_map,
> + bool is_percpu)
> +{
> + struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> + u32 bucket_cnt, total, key_size, value_size, roundup_key_size;
> + void *keys = NULL, *values = NULL, *value, *dst_key, *dst_val;
> + void __user *uvalues = u64_to_user_ptr(attr->batch.values);
> + void __user *ukeys = u64_to_user_ptr(attr->batch.keys);
> + void *ubatch = u64_to_user_ptr(attr->batch.in_batch);
> + u64 elem_map_flags, map_flags;
> + struct hlist_nulls_head *head;
> + u32 batch, max_count, size;
> + struct hlist_nulls_node *n;
> + unsigned long flags;
> + struct htab_elem *l;
> + struct bucket *b;
> + int ret = 0;
> +
> + max_count = attr->batch.count;
> + if (!max_count)
> + return 0;

In all previous implementation, we did sanity checks for flags and then
checked max_count. To be consistent, we should move this after map_flags
check.

> +
> + elem_map_flags = attr->batch.elem_flags;
> + if ((elem_map_flags & ~BPF_F_LOCK) ||
> + ((elem_map_flags & BPF_F_LOCK) && !map_value_has_spin_lock(map)))
> + return -EINVAL;
> +
> + map_flags = attr->batch.flags;
> + if (map_flags)
> + return -EINVAL;
> +
> + batch = 0;
> + if (ubatch && copy_from_user(&batch, ubatch, sizeof(batch)))
> + return -EFAULT;
> +
> + if (batch >= htab->n_buckets)
> + return -ENOENT;
> +
> + /* We cannot do copy_from_user or copy_to_user inside
> + * the rcu_read_lock. Allocate enough space here.
> + */
> + key_size = htab->map.key_size;
> + roundup_key_size = round_up(htab->map.key_size, 8);
> + value_size = htab->map.value_size;
> + size = round_up(value_size, 8);
> + if (is_percpu)
> + value_size = size * num_possible_cpus();
> + keys = kvmalloc(key_size, GFP_USER | __GFP_NOWARN);
> + values = kvmalloc(value_size, GFP_USER | __GFP_NOWARN);
> + if (!keys || !values) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + dst_key = keys;
> + dst_val = values;
> + total = 0;
> +
> + preempt_disable();
> + this_cpu_inc(bpf_prog_active);
> + rcu_read_lock();
> +
> +again:
> + b = &htab->buckets[batch];
> + head = &b->head;
> + raw_spin_lock_irqsave(&b->lock, flags);
> +
> + bucket_cnt = 0;
> + hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
> + bucket_cnt++;
> +
> + if (bucket_cnt > (max_count - total)) {
> + if (total == 0)
> + ret = -ENOSPC;
> + goto after_loop;
> + }
> +
> + hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
> + memcpy(dst_key, l->key, key_size);
> +
> + if (is_percpu) {
> + int off = 0, cpu;
> + void __percpu *pptr;
> +
> + pptr = htab_elem_get_ptr(l, map->key_size);
> + for_each_possible_cpu(cpu) {
> + bpf_long_memcpy(dst_val + off,
> + per_cpu_ptr(pptr, cpu), size);
> + off += size;
> + }
> + } else {
> + value = l->key + roundup_key_size;
> + if (elem_map_flags & BPF_F_LOCK)
> + copy_map_value_locked(map, dst_val, value,
> + true);
> + else
> + copy_map_value(map, dst_val, value);
> + check_and_init_map_lock(map, dst_val);
> + }
> + if (do_delete) {
> + hlist_nulls_del_rcu(&l->hash_node);
> + if (is_lru_map)
> + bpf_lru_push_free(&htab->lru, &l->lru_node);
> + else
> + free_htab_elem(htab, l);
> + }
> + if (copy_to_user(ukeys + total * key_size, keys, key_size) ||
> + copy_to_user(uvalues + total * value_size, values,
> + value_size)) {
> + ret = -EFAULT;
> + goto after_loop;
> + }
> + total++;
> + }
> +
> + batch++;
> + if (batch >= htab->n_buckets) {
> + ret = -ENOENT;
> + goto after_loop;
> + }
> +
> + raw_spin_unlock_irqrestore(&b->lock, flags);
> + goto again;
> +
> +after_loop:
> + raw_spin_unlock_irqrestore(&b->lock, flags);
> +
> + rcu_read_unlock();
> + this_cpu_dec(bpf_prog_active);
> + preempt_enable();
> +

When reaching here, only the following values are possible
for 'ret':
0 <=== everything is okay, still have some buckets left
-ENOENT <=== everything is okay, not all user buffer filled as
we reach the end of hash table
-EFAULT <=== fault during copy data to user space
-ENOSPC <=== no enough buffer space to copy the 'batch'.

> + if (ret && ret != -ENOENT)
> + goto out;

Maybe we should do:
0 <=== continue to copy data back to user
-ENOSPC <=== continue to copy data back to user
user needs to check -ENOSPC error code
-EFAULT <=== we can do the copy below or go out
since -EFAULT will be returned any way
-ENOENT <=== we reached the end, so we actually have
no error. we should still copy data
back to user. The user can check -ENOENT
to signal end of the traversal, similar to
get_next_key().

Do this make sense?

> +
> + /* copy data back to user */
> + ubatch = u64_to_user_ptr(attr->batch.out_batch);
> + if (copy_to_user(ubatch, &batch, sizeof(batch)) ||
> + put_user(total, &uattr->batch.count))
> + ret = -EFAULT;
> +
> +out:
> + kvfree(keys);
> + kvfree(values);
> + return ret;
> +}
> +
[...]

2019-12-13 18:37:47

by Yonghong Song

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 09/11] selftests/bpf: add batch ops testing for htab and htab_percpu map



On 12/11/19 2:33 PM, Brian Vazquez wrote:
> From: Yonghong Song <[email protected]>
>
> Tested bpf_map_lookup_batch(), bpf_map_lookup_and_delete_batch(),
> bpf_map_update_batch(), and bpf_map_delete_batch() functionality.
> $ ./test_maps
> ...
> test_htab_map_batch_ops:PASS
> test_htab_percpu_map_batch_ops:PASS
> ...
>
> Signed-off-by: Yonghong Song <[email protected]>
> Signed-off-by: Brian Vazquez <[email protected]>
> ---
> .../bpf/map_tests/htab_map_batch_ops.c | 269 ++++++++++++++++++
> 1 file changed, 269 insertions(+)
> create mode 100644 tools/testing/selftests/bpf/map_tests/htab_map_batch_ops.c
>
> diff --git a/tools/testing/selftests/bpf/map_tests/htab_map_batch_ops.c b/tools/testing/selftests/bpf/map_tests/htab_map_batch_ops.c
> new file mode 100644
> index 0000000000000..dabc4d420a10e
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/map_tests/htab_map_batch_ops.c
> @@ -0,0 +1,269 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2019 Facebook */
> +#include <stdio.h>
> +#include <errno.h>
> +#include <string.h>
> +
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +
> +#include <bpf_util.h>
> +#include <test_maps.h>
> +
> +static void map_batch_update(int map_fd, __u32 max_entries, int *keys,
> + void *values, bool is_pcpu)
> +{
> + typedef BPF_DECLARE_PERCPU(int, value);
> + int i, j, err;
> + value *v;
> +
> + if (is_pcpu)
> + v = (value *)values;
> +
> + for (i = 0; i < max_entries; i++) {
> + keys[i] = i + 1;
> + if (is_pcpu)
> + for (j = 0; j < bpf_num_possible_cpus(); j++)
> + bpf_percpu(v[i], j) = i + 2 + j;
> + else
> + ((int *)values)[i] = i + 2;
> + }
> +
> + err = bpf_map_update_batch(map_fd, keys, values, &max_entries, 0, 0);
> + CHECK(err, "bpf_map_update_batch()", "error:%s\n", strerror(errno));
> +}
> +
> +static void map_batch_verify(int *visited, __u32 max_entries,
> + int *keys, void *values, bool is_pcpu)
> +{
> + typedef BPF_DECLARE_PERCPU(int, value);
> + value *v;
> + int i, j;
> +
> + if (is_pcpu)
> + v = (value *)values;
> +
> + memset(visited, 0, max_entries * sizeof(*visited));
> + for (i = 0; i < max_entries; i++) {
> +
> + if (is_pcpu) {
> + for (j = 0; j < bpf_num_possible_cpus(); j++) {
> + CHECK(keys[i] + 1 + j != bpf_percpu(v[i], j),
> + "key/value checking",
> + "error: i %d j %d key %d value %d\n",
> + i, j, keys[i], bpf_percpu(v[i], j));
> + }
> + } else {
> + CHECK(keys[i] + 1 != ((int *)values)[i],
> + "key/value checking",
> + "error: i %d key %d value %d\n", i, keys[i],
> + ((int *)values)[i]);
> + }
> +
> + visited[i] = 1;
> +
> + }
> + for (i = 0; i < max_entries; i++) {
> + CHECK(visited[i] != 1, "visited checking",
> + "error: keys array at index %d missing\n", i);
> + }
> +}
> +
> +void __test_map_lookup_and_delete_batch(bool is_pcpu)
> +{
> + int map_type = is_pcpu ? BPF_MAP_TYPE_PERCPU_HASH : BPF_MAP_TYPE_HASH;
> + struct bpf_create_map_attr xattr = {
> + .name = "hash_map",
> + .map_type = map_type,
> + .key_size = sizeof(int),
> + .value_size = sizeof(int),
> + };
> + __u32 batch, count, total, total_success;
> + typedef BPF_DECLARE_PERCPU(int, value);
> + int map_fd, *keys, *visited, key;
> + const __u32 max_entries = 10;
> + int err, step, value_size;
> + value pcpu_values[10];
> + bool nospace_err;
> + void *values;
> +
> + xattr.max_entries = max_entries;
> + map_fd = bpf_create_map_xattr(&xattr);
> + CHECK(map_fd == -1,
> + "bpf_create_map_xattr()", "error:%s\n", strerror(errno));
> +
> + value_size = is_pcpu ? sizeof(value) : sizeof(int);
> + keys = malloc(max_entries * sizeof(int));
> + if (is_pcpu)
> + values = pcpu_values;
> + else
> + values = malloc(max_entries * sizeof(int));
> + visited = malloc(max_entries * sizeof(int));
> + CHECK(!keys || !values || !visited, "malloc()",
> + "error:%s\n", strerror(errno));

Sorry, I missed it. the mallcoed keys/visited memory region should
be freed at the end of the test. The same for other tests.

> +
> + /* test 1: lookup/delete an empty hash table, -ENOENT */
> + count = max_entries;
> + err = bpf_map_lookup_and_delete_batch(map_fd, NULL, &batch, keys,
> + values, &count, 0, 0);
> + CHECK((err && errno != ENOENT), "empty map",
> + "error: %s\n", strerror(errno));
> +
> + /* populate elements to the map */
> + map_batch_update(map_fd, max_entries, keys, values, is_pcpu);
> +
> + /* test 2: lookup/delete with count = 0, success */
> + count = 0;
> + err = bpf_map_lookup_and_delete_batch(map_fd, NULL, &batch, keys,
> + values, &count, 0, 0);
> + CHECK(err, "count = 0", "error: %s\n", strerror(errno));
> +
> + /* test 3: lookup/delete with count = max_entries, success */
> + memset(keys, 0, max_entries * sizeof(*keys));
> + memset(values, 0, max_entries * value_size);
> + count = max_entries;
> + err = bpf_map_lookup_and_delete_batch(map_fd, NULL, &batch, keys,
> + values, &count, 0, 0);
> + CHECK((err && errno != ENOENT), "count = max_entries",
> + "error: %s\n", strerror(errno));
> + CHECK(count != max_entries, "count = max_entries",
> + "count = %u, max_entries = %u\n", count, max_entries);
> + map_batch_verify(visited, max_entries, keys, values, is_pcpu);
> +
> + /* bpf_map_get_next_key() should return -ENOENT for an empty map. */
> + err = bpf_map_get_next_key(map_fd, NULL, &key);
> + CHECK(!err, "bpf_map_get_next_key()", "error: %s\n", strerror(errno));
> +
[...]

2019-12-13 18:44:14

by Yonghong Song

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 09/11] selftests/bpf: add batch ops testing for htab and htab_percpu map



On 12/11/19 2:33 PM, Brian Vazquez wrote:
> From: Yonghong Song <[email protected]>
>
> Tested bpf_map_lookup_batch(), bpf_map_lookup_and_delete_batch(),
> bpf_map_update_batch(), and bpf_map_delete_batch() functionality.
> $ ./test_maps
> ...
> test_htab_map_batch_ops:PASS
> test_htab_percpu_map_batch_ops:PASS
> ...
>
> Signed-off-by: Yonghong Song <[email protected]>
> Signed-off-by: Brian Vazquez <[email protected]>
> ---
> .../bpf/map_tests/htab_map_batch_ops.c | 269 ++++++++++++++++++
> 1 file changed, 269 insertions(+)
> create mode 100644 tools/testing/selftests/bpf/map_tests/htab_map_batch_ops.c
>
> diff --git a/tools/testing/selftests/bpf/map_tests/htab_map_batch_ops.c b/tools/testing/selftests/bpf/map_tests/htab_map_batch_ops.c
> new file mode 100644
> index 0000000000000..dabc4d420a10e
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/map_tests/htab_map_batch_ops.c
> @@ -0,0 +1,269 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2019 Facebook */
> +#include <stdio.h>
> +#include <errno.h>
> +#include <string.h>
> +
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +
> +#include <bpf_util.h>
> +#include <test_maps.h>
> +
> +static void map_batch_update(int map_fd, __u32 max_entries, int *keys,
> + void *values, bool is_pcpu)
> +{
> + typedef BPF_DECLARE_PERCPU(int, value);
> + int i, j, err;
> + value *v;

We can initialize the above 'v' to 'v = NULL'.
Some compiler may not be smart enough to find out 'v' is always initialized.

> +
> + if (is_pcpu)
> + v = (value *)values;
> +
> + for (i = 0; i < max_entries; i++) {
> + keys[i] = i + 1;
> + if (is_pcpu)
> + for (j = 0; j < bpf_num_possible_cpus(); j++)
> + bpf_percpu(v[i], j) = i + 2 + j;
> + else
> + ((int *)values)[i] = i + 2;
> + }
> +
> + err = bpf_map_update_batch(map_fd, keys, values, &max_entries, 0, 0);
> + CHECK(err, "bpf_map_update_batch()", "error:%s\n", strerror(errno));
> +}
> +
[...]

2019-12-13 18:59:55

by Yonghong Song

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 06/11] bpf: add batch ops to all htab bpf map



On 12/11/19 2:33 PM, Brian Vazquez wrote:
> From: Yonghong Song <[email protected]>
>
> htab can't use generic batch support due some problematic behaviours
> inherent to the data structre, i.e. while iterating the bpf map a
> concurrent program might delete the next entry that batch was about to
> use, in that case there's no easy solution to retrieve the next entry,
> the issue has been discussed multiple times (see [1] and [2]).
>
> The only way hmap can be traversed without the problem previously
> exposed is by making sure that the map is traversing entire buckets.
> This commit implements those strict requirements for hmap, the
> implementation follows the same interaction that generic support with
> some exceptions:
>
> - If keys/values buffer are not big enough to traverse a bucket,
> ENOSPC will be returned.
> - out_batch contains the value of the next bucket in the iteration, not
> the next key, but this is transparent for the user since the user
> should never use out_batch for other than bpf batch syscalls.
>
> Note that only lookup and lookup_and_delete batch ops require the hmap
> specific implementation, update/delete batch ops can be the generic
> ones.
>
> [1] https://lore.kernel.org/bpf/[email protected]/
> [2] https://lore.kernel.org/bpf/[email protected]/
>
> Signed-off-by: Yonghong Song <[email protected]>
> Signed-off-by: Brian Vazquez <[email protected]>
> ---
> kernel/bpf/hashtab.c | 242 +++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 242 insertions(+)
>
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 22066a62c8c97..fac107bdaf9ec 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -17,6 +17,17 @@
> (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
> BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)
>
> +#define BATCH_OPS(_name) \
> + .map_lookup_batch = \
> + _name##_map_lookup_batch, \
> + .map_lookup_and_delete_batch = \
> + _name##_map_lookup_and_delete_batch, \
> + .map_update_batch = \
> + generic_map_update_batch, \
> + .map_delete_batch = \
> + generic_map_delete_batch
> +
> +
> struct bucket {
> struct hlist_nulls_head head;
> raw_spinlock_t lock;
> @@ -1232,6 +1243,233 @@ static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
> rcu_read_unlock();
> }
>
> +static int
> +__htab_map_lookup_and_delete_batch(struct bpf_map *map,
> + const union bpf_attr *attr,
> + union bpf_attr __user *uattr,
> + bool do_delete, bool is_lru_map,
> + bool is_percpu)
> +{
> + struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> + u32 bucket_cnt, total, key_size, value_size, roundup_key_size;
> + void *keys = NULL, *values = NULL, *value, *dst_key, *dst_val;
> + void __user *uvalues = u64_to_user_ptr(attr->batch.values);
> + void __user *ukeys = u64_to_user_ptr(attr->batch.keys);
> + void *ubatch = u64_to_user_ptr(attr->batch.in_batch);
> + u64 elem_map_flags, map_flags;
> + struct hlist_nulls_head *head;
> + u32 batch, max_count, size;
> + struct hlist_nulls_node *n;
> + unsigned long flags;
> + struct htab_elem *l;
> + struct bucket *b;
> + int ret = 0;
> +
> + max_count = attr->batch.count;
> + if (!max_count)
> + return 0;
> +
> + elem_map_flags = attr->batch.elem_flags;
> + if ((elem_map_flags & ~BPF_F_LOCK) ||
> + ((elem_map_flags & BPF_F_LOCK) && !map_value_has_spin_lock(map)))
> + return -EINVAL;
> +
> + map_flags = attr->batch.flags;
> + if (map_flags)
> + return -EINVAL;
> +
> + batch = 0;
> + if (ubatch && copy_from_user(&batch, ubatch, sizeof(batch)))
> + return -EFAULT;
> +
> + if (batch >= htab->n_buckets)
> + return -ENOENT;
> +
> + /* We cannot do copy_from_user or copy_to_user inside
> + * the rcu_read_lock. Allocate enough space here.
> + */
> + key_size = htab->map.key_size;
> + roundup_key_size = round_up(htab->map.key_size, 8);
> + value_size = htab->map.value_size;
> + size = round_up(value_size, 8);
> + if (is_percpu)
> + value_size = size * num_possible_cpus();
> + keys = kvmalloc(key_size, GFP_USER | __GFP_NOWARN);
> + values = kvmalloc(value_size, GFP_USER | __GFP_NOWARN);
> + if (!keys || !values) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + dst_key = keys;
> + dst_val = values;
> + total = 0;
> +
> + preempt_disable();
> + this_cpu_inc(bpf_prog_active);
> + rcu_read_lock();
> +
> +again:
> + b = &htab->buckets[batch];
> + head = &b->head;
> + raw_spin_lock_irqsave(&b->lock, flags);
> +
> + bucket_cnt = 0;
> + hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
> + bucket_cnt++;
> +
> + if (bucket_cnt > (max_count - total)) {
> + if (total == 0)
> + ret = -ENOSPC;
> + goto after_loop;
> + }
> +
> + hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
> + memcpy(dst_key, l->key, key_size);
> +
> + if (is_percpu) {
> + int off = 0, cpu;
> + void __percpu *pptr;
> +
> + pptr = htab_elem_get_ptr(l, map->key_size);
> + for_each_possible_cpu(cpu) {
> + bpf_long_memcpy(dst_val + off,
> + per_cpu_ptr(pptr, cpu), size);
> + off += size;
> + }
> + } else {
> + value = l->key + roundup_key_size;
> + if (elem_map_flags & BPF_F_LOCK)
> + copy_map_value_locked(map, dst_val, value,
> + true);
> + else
> + copy_map_value(map, dst_val, value);
> + check_and_init_map_lock(map, dst_val);
> + }
> + if (do_delete) {
> + hlist_nulls_del_rcu(&l->hash_node);
> + if (is_lru_map)
> + bpf_lru_push_free(&htab->lru, &l->lru_node);
> + else
> + free_htab_elem(htab, l);
> + }
> + if (copy_to_user(ukeys + total * key_size, keys, key_size) ||
> + copy_to_user(uvalues + total * value_size, values,
> + value_size)) {

We cannot do copy_to_user inside atomic region where irq is disabled
with raw_spin_lock_irqsave(). We could do the following:
. we kalloc memory before preempt_disable() with the current count
of bucket size.
. inside the raw_spin_lock_irqsave() region, we can do copy to kernel
memory.
. inside the raw_spin_lock_irqsave() region, if the bucket size
changes, we can have a few retries to increase allocation size
before giving up.
Do you think this may work?

> + ret = -EFAULT;
> + goto after_loop;
> + }
> + total++;
> + }
> +
> + batch++;
> + if (batch >= htab->n_buckets) {
> + ret = -ENOENT;
> + goto after_loop;
> + }
> +
> + raw_spin_unlock_irqrestore(&b->lock, flags);
> + goto again;
> +
> +after_loop:
> + raw_spin_unlock_irqrestore(&b->lock, flags);
> +
> + rcu_read_unlock();
> + this_cpu_dec(bpf_prog_active);
> + preempt_enable();
> +
> + if (ret && ret != -ENOENT)
> + goto out;
> +
> + /* copy data back to user */
> + ubatch = u64_to_user_ptr(attr->batch.out_batch);
> + if (copy_to_user(ubatch, &batch, sizeof(batch)) ||
> + put_user(total, &uattr->batch.count))
> + ret = -EFAULT;
> +
> +out:
> + kvfree(keys);
> + kvfree(values);
> + return ret;
> +}
> +
[...]

2019-12-19 00:56:23

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 08/11] libbpf: add libbpf support to batch ops

On Wed, Dec 11, 2019 at 2:35 PM Brian Vazquez <[email protected]> wrote:
>
> From: Yonghong Song <[email protected]>
>
> Added four libbpf API functions to support map batch operations:
> . int bpf_map_delete_batch( ... )
> . int bpf_map_lookup_batch( ... )
> . int bpf_map_lookup_and_delete_batch( ... )
> . int bpf_map_update_batch( ... )
>
> Signed-off-by: Yonghong Song <[email protected]>
> ---

These libbpf APIs should use _opts approach from the get go to make
them extensible, but preserving backwards/forward compatibility.
Please take a look at one of few that are already using them (or
follow Andrey's bpf_prog_attach work, as he's adding opts-based one at
the moment).

> tools/lib/bpf/bpf.c | 61 ++++++++++++++++++++++++++++++++++++++++
> tools/lib/bpf/bpf.h | 14 +++++++++
> tools/lib/bpf/libbpf.map | 4 +++
> 3 files changed, 79 insertions(+)
>

[...]

2020-01-07 06:40:14

by Brian Vazquez

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 05/11] bpf: add generic_batch_ops to lpm_trie map

Hi Yonghong, thanks for reviewing it and sorry for the late reply I
had been traveling.

On Fri, Dec 13, 2019 at 11:46 AM Yonghong Song <[email protected]> wrote:
>
>
>
> On 12/11/19 2:33 PM, Brian Vazquez wrote:
> > This adds the generic batch ops functionality to bpf lpm_trie.
> >
> > Signed-off-by: Brian Vazquez <[email protected]>
> > ---
> > kernel/bpf/lpm_trie.c | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> > index 56e6c75d354d9..92c47b4f03337 100644
> > --- a/kernel/bpf/lpm_trie.c
> > +++ b/kernel/bpf/lpm_trie.c
> > @@ -743,4 +743,8 @@ const struct bpf_map_ops trie_map_ops = {
> > .map_update_elem = trie_update_elem,
> > .map_delete_elem = trie_delete_elem,
> > .map_check_btf = trie_check_btf,
> > + .map_lookup_batch = generic_map_lookup_batch,
> > + .map_lookup_and_delete_batch = generic_map_lookup_and_delete_batch,
>
> Not 100% sure whether trie should use generic map
> lookup/lookup_and_delete or not. If the key is not available,
> the get_next_key will return the 'leftmost' node which roughly
> corresponding to the first node in the hash table.
>

I think you're right, we shouldn't use generic
lookup/lookup_and_delete for lpm_trie. That being said, would you be
ok, if we don't add lpm_trie support in this patch series? Also we can
drop the generic_map_lookup_and_delete implementation in this patch
series and add it in the future, if needed. What do you think?

> > + .map_delete_batch = generic_map_delete_batch,
> > + .map_update_batch = generic_map_update_batch,with efault
> > };
> >

2020-01-07 06:51:22

by Brian Vazquez

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 02/11] bpf: add generic support for lookup and lookup_and_delete batch ops

On Fri, Dec 13, 2019 at 11:26 AM Yonghong Song <[email protected]> wrote:
>
>
>
> On 12/11/19 2:33 PM, Brian Vazquez wrote:
> > This commit introduces generic support for the bpf_map_lookup_batch and
> > bpf_map_lookup_and_delete_batch ops. This implementation can be used by
> > almost all the bpf maps since its core implementation is relying on the
> > existing map_get_next_key, map_lookup_elem and map_delete_elem
> > functions. The bpf syscall subcommands introduced are:
> >
> > BPF_MAP_LOOKUP_BATCH
> > BPF_MAP_LOOKUP_AND_DELETE_BATCH
> >
> > The UAPI attribute is:
> >
> > struct { /* struct used by BPF_MAP_*_BATCH commands */
> > __aligned_u64 in_batch; /* start batch,
> > * NULL to start from beginning
> > */
> > __aligned_u64 out_batch; /* output: next start batch */
> > __aligned_u64 keys;
> > __aligned_u64 values;
> > __u32 count; /* input/output:
> > * input: # of key/value
> > * elements
> > * output: # of filled elements
> > */
> > __u32 map_fd;
> > __u64 elem_flags;
> > __u64 flags;
> > } batch;
> >
> > in_batch/out_batch are opaque values use to communicate between
> > user/kernel space, in_batch/out_batch must be of key_size length.
> >
> > To start iterating from the beginning in_batch must be null,
> > count is the # of key/value elements to retrieve. Note that the 'keys'
> > buffer must be a buffer of key_size * count size and the 'values' buffer
> > must be value_size * count, where value_size must be aligned to 8 bytes
> > by userspace if it's dealing with percpu maps. 'count' will contain the
> > number of keys/values successfully retrieved. Note that 'count' is an
> > input/output variable and it can contain a lower value after a call.
> >
> > If there's no more entries to retrieve, ENOENT will be returned. If error
> > is ENOENT, count might be > 0 in case it copied some values but there were
> > no more entries to retrieve.
> >
> > Note that if the return code is an error and not -EFAULT,
> > count indicates the number of elements successfully processed.
> >
> > Suggested-by: Stanislav Fomichev <[email protected]>
> > Signed-off-by: Brian Vazquez <[email protected]>
> > Signed-off-by: Yonghong Song <[email protected]>
> > ---
> > include/linux/bpf.h | 11 +++
> > include/uapi/linux/bpf.h | 19 +++++
> > kernel/bpf/syscall.c | 172 +++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 202 insertions(+)
> [...]
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 2530266fa6477..708aa89fe2308 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -1206,6 +1206,120 @@ static int map_get_next_key(union bpf_attr *attr)
> > return err;
> > }
> >
> > +#define MAP_LOOKUP_RETRIES 3
> > +
> > +static int __generic_map_lookup_batch(struct bpf_map *map,
> > + const union bpf_attr *attr,
> > + union bpf_attr __user *uattr,
> > + bool do_delete)
> > +{
> > + void __user *ubatch = u64_to_user_ptr(attr->batch.in_batch);
> > + void __user *uobatch = u64_to_user_ptr(attr->batch.out_batch);
> > + void __user *values = u64_to_user_ptr(attr->batch.values);
> > + void __user *keys = u64_to_user_ptr(attr->batch.keys);
> > + void *buf, *prev_key, *key, *value;
> > + u32 value_size, cp, max_count;
> > + bool first_key = false;
> > + int err, retry = MAP_LOOKUP_RETRIES;
>
> Could you try to use reverse Christmas tree style declaration here?

ACK
>
> > +
> > + if (attr->batch.elem_flags & ~BPF_F_LOCK)
> > + return -EINVAL;
> > +
> > + if ((attr->batch.elem_flags & BPF_F_LOCK) &&
> > + !map_value_has_spin_lock(map))
> > + return -EINVAL;
> > +
> > + value_size = bpf_map_value_size(map);
> > +
> > + max_count = attr->batch.count;
> > + if (!max_count)
> > + return 0;
> > +
> > + buf = kmalloc(map->key_size + value_size, GFP_USER | __GFP_NOWARN);
> > + if (!buf)
> > + return -ENOMEM;
> > +
> > + err = -EFAULT;
> > + first_key = false;
> > + if (ubatch && copy_from_user(buf, ubatch, map->key_size))
> > + goto free_buf;
> > + key = buf;
> > + value = key + map->key_size;
> > + if (!ubatch) {
> > + prev_key = NULL;
> > + first_key = true;
> > + }
> > +
> > + for (cp = 0; cp < max_count;) {
> > + if (cp || first_key) {
> > + rcu_read_lock();
> > + err = map->ops->map_get_next_key(map, prev_key, key);
> > + rcu_read_unlock();
> > + if (err)
> > + break;
> > + }
> > + err = bpf_map_copy_value(map, key, value,
> > + attr->batch.elem_flags, do_delete);
> > +
> > + if (err == -ENOENT) {
> > + if (retry) {
> > + retry--;
> > + continue;
> > + }
> > + err = -EINTR;
> > + break;
> > + }
> > +
> > + if (err)
> > + goto free_buf;
> > +
> > + if (copy_to_user(keys + cp * map->key_size, key,
> > + map->key_size)) {
> > + err = -EFAULT;
> > + goto free_buf;
> > + }
> > + if (copy_to_user(values + cp * value_size, value, value_size)) {
> > + err = -EFAULT;
> > + goto free_buf;
> > + }
> > +
> > + prev_key = key;
> > + retry = MAP_LOOKUP_RETRIES;
> > + cp++;
> > + }
> > +
> > + if (!err) {
> > + rcu_read_lock();
> > + err = map->ops->map_get_next_key(map, prev_key, key);
> > + rcu_read_unlock();
> > + }
> > +
> > + if (err)
> > + memset(key, 0, map->key_size);
>
> So if any error happens due to above map_get_next_key() or earlier
> error, the next "batch" returned to user could be "0". What should
> user space handle this? Ultimately, the user space needs to start
> from the beginning again?
>
> What I mean is here how we could design an interface so user
> space, if no -EFAULT error, can successfully get all elements
> without duplication.
>
> One way to do here is just return -EFAULT if we cannot get
> proper next key. But maybe we could have better mechanism
> when we try to implement what user space codes will look like.

I was thinking that instead of using the "next key" as a token we
could use the last value successfully copied as the token, that way
user space code would always be able to start/retry from the last
processed entry. Do you think this would work?
>
> > +
> > + if ((copy_to_user(&uattr->batch.count, &cp, sizeof(cp)) ||
> > + (copy_to_user(uobatch, key, map->key_size))))
> > + err = -EFAULT;
> > +
> > +free_buf:
> > + kfree(buf);
> > + return err;
> > +}
> > +
> [...]

2020-01-07 06:52:53

by Brian Vazquez

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 08/11] libbpf: add libbpf support to batch ops

Thanks for reviewing it, Andrii!

On Wed, Dec 18, 2019 at 6:54 PM Andrii Nakryiko
<[email protected]> wrote:
>
> On Wed, Dec 11, 2019 at 2:35 PM Brian Vazquez <[email protected]> wrote:
> >
> > From: Yonghong Song <[email protected]>
> >
> > Added four libbpf API functions to support map batch operations:
> > . int bpf_map_delete_batch( ... )
> > . int bpf_map_lookup_batch( ... )
> > . int bpf_map_lookup_and_delete_batch( ... )
> > . int bpf_map_update_batch( ... )
> >
> > Signed-off-by: Yonghong Song <[email protected]>
> > ---
>
> These libbpf APIs should use _opts approach from the get go to make
> them extensible, but preserving backwards/forward compatibility.
> Please take a look at one of few that are already using them (or
> follow Andrey's bpf_prog_attach work, as he's adding opts-based one at
> the moment).

I will add this to next version.
>
> > tools/lib/bpf/bpf.c | 61 ++++++++++++++++++++++++++++++++++++++++
> > tools/lib/bpf/bpf.h | 14 +++++++++
> > tools/lib/bpf/libbpf.map | 4 +++
> > 3 files changed, 79 insertions(+)
> >
>
> [...]

2020-01-07 07:03:15

by Brian Vazquez

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 06/11] bpf: add batch ops to all htab bpf map

On Fri, Dec 13, 2019 at 12:58 PM Yonghong Song <[email protected]> wrote:
>
>
>
> On 12/11/19 2:33 PM, Brian Vazquez wrote:
> > From: Yonghong Song <[email protected]>
> >
> > htab can't use generic batch support due some problematic behaviours
> > inherent to the data structre, i.e. while iterating the bpf map a
> > concurrent program might delete the next entry that batch was about to
> > use, in that case there's no easy solution to retrieve the next entry,
> > the issue has been discussed multiple times (see [1] and [2]).
> >
> > The only way hmap can be traversed without the problem previously
> > exposed is by making sure that the map is traversing entire buckets.
> > This commit implements those strict requirements for hmap, the
> > implementation follows the same interaction that generic support with
> > some exceptions:
> >
> > - If keys/values buffer are not big enough to traverse a bucket,
> > ENOSPC will be returned.
> > - out_batch contains the value of the next bucket in the iteration, not
> > the next key, but this is transparent for the user since the user
> > should never use out_batch for other than bpf batch syscalls.
> >
> > Note that only lookup and lookup_and_delete batch ops require the hmap
> > specific implementation, update/delete batch ops can be the generic
> > ones.
> >
> > [1] https://lore.kernel.org/bpf/[email protected]/
> > [2] https://lore.kernel.org/bpf/[email protected]/
> >
> > Signed-off-by: Yonghong Song <[email protected]>
> > Signed-off-by: Brian Vazquez <[email protected]>
> > ---
> > kernel/bpf/hashtab.c | 242 +++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 242 insertions(+)
> >
> > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > index 22066a62c8c97..fac107bdaf9ec 100644
> > --- a/kernel/bpf/hashtab.c
> > +++ b/kernel/bpf/hashtab.c
> > @@ -17,6 +17,17 @@
> > (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
> > BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)
> >
> > +#define BATCH_OPS(_name) \
> > + .map_lookup_batch = \
> > + _name##_map_lookup_batch, \
> > + .map_lookup_and_delete_batch = \
> > + _name##_map_lookup_and_delete_batch, \
> > + .map_update_batch = \
> > + generic_map_update_batch, \
> > + .map_delete_batch = \
> > + generic_map_delete_batch
> > +
> > +
> > struct bucket {
> > struct hlist_nulls_head head;
> > raw_spinlock_t lock;
> > @@ -1232,6 +1243,233 @@ static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
> > rcu_read_unlock();
> > }
> >
> > +static int
> > +__htab_map_lookup_and_delete_batch(struct bpf_map *map,
> > + const union bpf_attr *attr,
> > + union bpf_attr __user *uattr,
> > + bool do_delete, bool is_lru_map,
> > + bool is_percpu)
> > +{
> > + struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> > + u32 bucket_cnt, total, key_size, value_size, roundup_key_size;
> > + void *keys = NULL, *values = NULL, *value, *dst_key, *dst_val;
> > + void __user *uvalues = u64_to_user_ptr(attr->batch.values);
> > + void __user *ukeys = u64_to_user_ptr(attr->batch.keys);
> > + void *ubatch = u64_to_user_ptr(attr->batch.in_batch);
> > + u64 elem_map_flags, map_flags;
> > + struct hlist_nulls_head *head;
> > + u32 batch, max_count, size;
> > + struct hlist_nulls_node *n;
> > + unsigned long flags;
> > + struct htab_elem *l;
> > + struct bucket *b;
> > + int ret = 0;
> > +
> > + max_count = attr->batch.count;
> > + if (!max_count)
> > + return 0;
> > +
> > + elem_map_flags = attr->batch.elem_flags;
> > + if ((elem_map_flags & ~BPF_F_LOCK) ||
> > + ((elem_map_flags & BPF_F_LOCK) && !map_value_has_spin_lock(map)))
> > + return -EINVAL;
> > +
> > + map_flags = attr->batch.flags;
> > + if (map_flags)
> > + return -EINVAL;
> > +
> > + batch = 0;
> > + if (ubatch && copy_from_user(&batch, ubatch, sizeof(batch)))
> > + return -EFAULT;
> > +
> > + if (batch >= htab->n_buckets)
> > + return -ENOENT;
> > +
> > + /* We cannot do copy_from_user or copy_to_user inside
> > + * the rcu_read_lock. Allocate enough space here.
> > + */
> > + key_size = htab->map.key_size;
> > + roundup_key_size = round_up(htab->map.key_size, 8);
> > + value_size = htab->map.value_size;
> > + size = round_up(value_size, 8);
> > + if (is_percpu)
> > + value_size = size * num_possible_cpus();
> > + keys = kvmalloc(key_size, GFP_USER | __GFP_NOWARN);
> > + values = kvmalloc(value_size, GFP_USER | __GFP_NOWARN);
> > + if (!keys || !values) {
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > +
> > + dst_key = keys;
> > + dst_val = values;
> > + total = 0;
> > +
> > + preempt_disable();
> > + this_cpu_inc(bpf_prog_active);
> > + rcu_read_lock();
> > +
> > +again:
> > + b = &htab->buckets[batch];
> > + head = &b->head;
> > + raw_spin_lock_irqsave(&b->lock, flags);
> > +
> > + bucket_cnt = 0;
> > + hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
> > + bucket_cnt++;
> > +
> > + if (bucket_cnt > (max_count - total)) {
> > + if (total == 0)
> > + ret = -ENOSPC;
> > + goto after_loop;
> > + }
> > +
> > + hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
> > + memcpy(dst_key, l->key, key_size);
> > +
> > + if (is_percpu) {
> > + int off = 0, cpu;
> > + void __percpu *pptr;
> > +
> > + pptr = htab_elem_get_ptr(l, map->key_size);
> > + for_each_possible_cpu(cpu) {
> > + bpf_long_memcpy(dst_val + off,
> > + per_cpu_ptr(pptr, cpu), size);
> > + off += size;
> > + }
> > + } else {
> > + value = l->key + roundup_key_size;
> > + if (elem_map_flags & BPF_F_LOCK)
> > + copy_map_value_locked(map, dst_val, value,
> > + true);
> > + else
> > + copy_map_value(map, dst_val, value);
> > + check_and_init_map_lock(map, dst_val);
> > + }
> > + if (do_delete) {
> > + hlist_nulls_del_rcu(&l->hash_node);
> > + if (is_lru_map)
> > + bpf_lru_push_free(&htab->lru, &l->lru_node);
> > + else
> > + free_htab_elem(htab, l);
> > + }
> > + if (copy_to_user(ukeys + total * key_size, keys, key_size) ||
> > + copy_to_user(uvalues + total * value_size, values,
> > + value_size)) {
>
> We cannot do copy_to_user inside atomic region where irq is disabled
> with raw_spin_lock_irqsave(). We could do the following:
> . we kalloc memory before preempt_disable() with the current count
> of bucket size.
> . inside the raw_spin_lock_irqsave() region, we can do copy to kernel
> memory.
> . inside the raw_spin_lock_irqsave() region, if the bucket size
> changes, we can have a few retries to increase allocation size
> before giving up.
> Do you think this may work?

Yes, it does.

What should be the initial value for the allocated memory
max_entries/2? Do you see any issue if we just kalloc the entire
buffer?

>
> > + ret = -EFAULT;
> > + goto after_loop;
> > + }
> > + total++;
> > + }
> > +
> > + batch++;
> > + if (batch >= htab->n_buckets) {
> > + ret = -ENOENT;
> > + goto after_loop;
> > + }
> > +
> > + raw_spin_unlock_irqrestore(&b->lock, flags);
> > + goto again;
> > +
> > +after_loop:
> > + raw_spin_unlock_irqrestore(&b->lock, flags);
> > +
> > + rcu_read_unlock();
> > + this_cpu_dec(bpf_prog_active);
> > + preempt_enable();
> > +
> > + if (ret && ret != -ENOENT)
> > + goto out;
> > +
> > + /* copy data back to user */
> > + ubatch = u64_to_user_ptr(attr->batch.out_batch);
> > + if (copy_to_user(ubatch, &batch, sizeof(batch)) ||
> > + put_user(total, &uattr->batch.count))
> > + ret = -EFAULT;
> > +
> > +out:
> > + kvfree(keys);
> > + kvfree(values);
> > + return ret;
> > +}
> > +
> [...]

2020-01-07 17:59:09

by Yonghong Song

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 05/11] bpf: add generic_batch_ops to lpm_trie map



On 1/6/20 10:39 PM, Brian Vazquez wrote:
> Hi Yonghong, thanks for reviewing it and sorry for the late reply I
> had been traveling.
>
> On Fri, Dec 13, 2019 at 11:46 AM Yonghong Song <[email protected]> wrote:
>>
>>
>>
>> On 12/11/19 2:33 PM, Brian Vazquez wrote:
>>> This adds the generic batch ops functionality to bpf lpm_trie.
>>>
>>> Signed-off-by: Brian Vazquez <[email protected]>
>>> ---
>>> kernel/bpf/lpm_trie.c | 4 ++++
>>> 1 file changed, 4 insertions(+)
>>>
>>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>>> index 56e6c75d354d9..92c47b4f03337 100644
>>> --- a/kernel/bpf/lpm_trie.c
>>> +++ b/kernel/bpf/lpm_trie.c
>>> @@ -743,4 +743,8 @@ const struct bpf_map_ops trie_map_ops = {
>>> .map_update_elem = trie_update_elem,
>>> .map_delete_elem = trie_delete_elem,
>>> .map_check_btf = trie_check_btf,
>>> + .map_lookup_batch = generic_map_lookup_batch,
>>> + .map_lookup_and_delete_batch = generic_map_lookup_and_delete_batch,
>>
>> Not 100% sure whether trie should use generic map
>> lookup/lookup_and_delete or not. If the key is not available,
>> the get_next_key will return the 'leftmost' node which roughly
>> corresponding to the first node in the hash table.
>>
>
> I think you're right, we shouldn't use generic
> lookup/lookup_and_delete for lpm_trie. That being said, would you be
> ok, if we don't add lpm_trie support in this patch series? Also we can
> drop the generic_map_lookup_and_delete implementation in this patch
> series and add it in the future, if needed. What do you think?

Yes, we can drop generic_map_lookup_and_delete(), it probably will not
be used a lot. The normal array map, you cannot delete elements.
For fd_array maps, they tend to be small.

>
>>> + .map_delete_batch = generic_map_delete_batch,
>>> + .map_update_batch = generic_map_update_batch,with efault
>>> };
>>>

2020-01-07 18:08:34

by Yonghong Song

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 02/11] bpf: add generic support for lookup and lookup_and_delete batch ops



On 1/6/20 10:50 PM, Brian Vazquez wrote:
> On Fri, Dec 13, 2019 at 11:26 AM Yonghong Song <[email protected]> wrote:
>>
>>
>>
>> On 12/11/19 2:33 PM, Brian Vazquez wrote:
>>> This commit introduces generic support for the bpf_map_lookup_batch and
>>> bpf_map_lookup_and_delete_batch ops. This implementation can be used by
>>> almost all the bpf maps since its core implementation is relying on the
>>> existing map_get_next_key, map_lookup_elem and map_delete_elem
>>> functions. The bpf syscall subcommands introduced are:
>>>
>>> BPF_MAP_LOOKUP_BATCH
>>> BPF_MAP_LOOKUP_AND_DELETE_BATCH
>>>
>>> The UAPI attribute is:
>>>
>>> struct { /* struct used by BPF_MAP_*_BATCH commands */
>>> __aligned_u64 in_batch; /* start batch,
>>> * NULL to start from beginning
>>> */
>>> __aligned_u64 out_batch; /* output: next start batch */
>>> __aligned_u64 keys;
>>> __aligned_u64 values;
>>> __u32 count; /* input/output:
>>> * input: # of key/value
>>> * elements
>>> * output: # of filled elements
>>> */
>>> __u32 map_fd;
>>> __u64 elem_flags;
>>> __u64 flags;
>>> } batch;
>>>
>>> in_batch/out_batch are opaque values use to communicate between
>>> user/kernel space, in_batch/out_batch must be of key_size length.
>>>
>>> To start iterating from the beginning in_batch must be null,
>>> count is the # of key/value elements to retrieve. Note that the 'keys'
>>> buffer must be a buffer of key_size * count size and the 'values' buffer
>>> must be value_size * count, where value_size must be aligned to 8 bytes
>>> by userspace if it's dealing with percpu maps. 'count' will contain the
>>> number of keys/values successfully retrieved. Note that 'count' is an
>>> input/output variable and it can contain a lower value after a call.
>>>
>>> If there's no more entries to retrieve, ENOENT will be returned. If error
>>> is ENOENT, count might be > 0 in case it copied some values but there were
>>> no more entries to retrieve.
>>>
>>> Note that if the return code is an error and not -EFAULT,
>>> count indicates the number of elements successfully processed.
>>>
>>> Suggested-by: Stanislav Fomichev <[email protected]>
>>> Signed-off-by: Brian Vazquez <[email protected]>
>>> Signed-off-by: Yonghong Song <[email protected]>
>>> ---
>>> include/linux/bpf.h | 11 +++
>>> include/uapi/linux/bpf.h | 19 +++++
>>> kernel/bpf/syscall.c | 172 +++++++++++++++++++++++++++++++++++++++
>>> 3 files changed, 202 insertions(+)
>> [...]
>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>> index 2530266fa6477..708aa89fe2308 100644
>>> --- a/kernel/bpf/syscall.c
>>> +++ b/kernel/bpf/syscall.c
>>> @@ -1206,6 +1206,120 @@ static int map_get_next_key(union bpf_attr *attr)
>>> return err;
>>> }
>>>
>>> +#define MAP_LOOKUP_RETRIES 3
>>> +
>>> +static int __generic_map_lookup_batch(struct bpf_map *map,
>>> + const union bpf_attr *attr,
>>> + union bpf_attr __user *uattr,
>>> + bool do_delete)
>>> +{
>>> + void __user *ubatch = u64_to_user_ptr(attr->batch.in_batch);
>>> + void __user *uobatch = u64_to_user_ptr(attr->batch.out_batch);
>>> + void __user *values = u64_to_user_ptr(attr->batch.values);
>>> + void __user *keys = u64_to_user_ptr(attr->batch.keys);
>>> + void *buf, *prev_key, *key, *value;
>>> + u32 value_size, cp, max_count;
>>> + bool first_key = false;
>>> + int err, retry = MAP_LOOKUP_RETRIES;
>>
>> Could you try to use reverse Christmas tree style declaration here?
>
> ACK
>>
>>> +
>>> + if (attr->batch.elem_flags & ~BPF_F_LOCK)
>>> + return -EINVAL;
>>> +
>>> + if ((attr->batch.elem_flags & BPF_F_LOCK) &&
>>> + !map_value_has_spin_lock(map))
>>> + return -EINVAL;
>>> +
>>> + value_size = bpf_map_value_size(map);
>>> +
>>> + max_count = attr->batch.count;
>>> + if (!max_count)
>>> + return 0;
>>> +
>>> + buf = kmalloc(map->key_size + value_size, GFP_USER | __GFP_NOWARN);
>>> + if (!buf)
>>> + return -ENOMEM;
>>> +
>>> + err = -EFAULT;
>>> + first_key = false;
>>> + if (ubatch && copy_from_user(buf, ubatch, map->key_size))
>>> + goto free_buf;
>>> + key = buf;
>>> + value = key + map->key_size;
>>> + if (!ubatch) {
>>> + prev_key = NULL;
>>> + first_key = true;
>>> + }
>>> +
>>> + for (cp = 0; cp < max_count;) {
>>> + if (cp || first_key) {
>>> + rcu_read_lock();
>>> + err = map->ops->map_get_next_key(map, prev_key, key);
>>> + rcu_read_unlock();
>>> + if (err)
>>> + break;
>>> + }
>>> + err = bpf_map_copy_value(map, key, value,
>>> + attr->batch.elem_flags, do_delete);
>>> +
>>> + if (err == -ENOENT) {
>>> + if (retry) {
>>> + retry--;
>>> + continue;
>>> + }
>>> + err = -EINTR;
>>> + break;
>>> + }
>>> +
>>> + if (err)
>>> + goto free_buf;
>>> +
>>> + if (copy_to_user(keys + cp * map->key_size, key,
>>> + map->key_size)) {
>>> + err = -EFAULT;
>>> + goto free_buf;
>>> + }
>>> + if (copy_to_user(values + cp * value_size, value, value_size)) {
>>> + err = -EFAULT;
>>> + goto free_buf;
>>> + }
>>> +
>>> + prev_key = key;
>>> + retry = MAP_LOOKUP_RETRIES;
>>> + cp++;
>>> + }
>>> +
>>> + if (!err) {
>>> + rcu_read_lock();
>>> + err = map->ops->map_get_next_key(map, prev_key, key);
>>> + rcu_read_unlock();
>>> + }
>>> +
>>> + if (err)
>>> + memset(key, 0, map->key_size);
>>
>> So if any error happens due to above map_get_next_key() or earlier
>> error, the next "batch" returned to user could be "0". What should
>> user space handle this? Ultimately, the user space needs to start
>> from the beginning again?
>>
>> What I mean is here how we could design an interface so user
>> space, if no -EFAULT error, can successfully get all elements
>> without duplication.
>>
>> One way to do here is just return -EFAULT if we cannot get
>> proper next key. But maybe we could have better mechanism
>> when we try to implement what user space codes will look like.
>
> I was thinking that instead of using the "next key" as a token we
> could use the last value successfully copied as the token, that way
> user space code would always be able to start/retry from the last
> processed entry. Do you think this would work?

Yes, this should work.


>>
>>> +
>>> + if ((copy_to_user(&uattr->batch.count, &cp, sizeof(cp)) ||
>>> + (copy_to_user(uobatch, key, map->key_size))))
>>> + err = -EFAULT;
>>> +
>>> +free_buf:
>>> + kfree(buf);
>>> + return err;
>>> +}
>>> +
>> [...]

2020-01-07 18:21:08

by Yonghong Song

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 06/11] bpf: add batch ops to all htab bpf map



On 1/6/20 11:02 PM, Brian Vazquez wrote:
> On Fri, Dec 13, 2019 at 12:58 PM Yonghong Song <[email protected]> wrote:
>>
>>
>>
>> On 12/11/19 2:33 PM, Brian Vazquez wrote:
>>> From: Yonghong Song <[email protected]>
>>>
>>> htab can't use generic batch support due some problematic behaviours
>>> inherent to the data structre, i.e. while iterating the bpf map a
>>> concurrent program might delete the next entry that batch was about to
>>> use, in that case there's no easy solution to retrieve the next entry,
>>> the issue has been discussed multiple times (see [1] and [2]).
>>>
>>> The only way hmap can be traversed without the problem previously
>>> exposed is by making sure that the map is traversing entire buckets.
>>> This commit implements those strict requirements for hmap, the
>>> implementation follows the same interaction that generic support with
>>> some exceptions:
>>>
>>> - If keys/values buffer are not big enough to traverse a bucket,
>>> ENOSPC will be returned.
>>> - out_batch contains the value of the next bucket in the iteration, not
>>> the next key, but this is transparent for the user since the user
>>> should never use out_batch for other than bpf batch syscalls.
>>>
>>> Note that only lookup and lookup_and_delete batch ops require the hmap
>>> specific implementation, update/delete batch ops can be the generic
>>> ones.
>>>
>>> [1] https://lore.kernel.org/bpf/[email protected]/
>>> [2] https://lore.kernel.org/bpf/[email protected]/
>>>
>>> Signed-off-by: Yonghong Song <[email protected]>
>>> Signed-off-by: Brian Vazquez <[email protected]>
>>> ---
>>> kernel/bpf/hashtab.c | 242 +++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 242 insertions(+)
>>>
>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>> index 22066a62c8c97..fac107bdaf9ec 100644
>>> --- a/kernel/bpf/hashtab.c
>>> +++ b/kernel/bpf/hashtab.c
>>> @@ -17,6 +17,17 @@
>>> (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
>>> BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)
>>>
>>> +#define BATCH_OPS(_name) \
>>> + .map_lookup_batch = \
>>> + _name##_map_lookup_batch, \
>>> + .map_lookup_and_delete_batch = \
>>> + _name##_map_lookup_and_delete_batch, \
>>> + .map_update_batch = \
>>> + generic_map_update_batch, \
>>> + .map_delete_batch = \
>>> + generic_map_delete_batch
>>> +
>>> +
>>> struct bucket {
>>> struct hlist_nulls_head head;
>>> raw_spinlock_t lock;
>>> @@ -1232,6 +1243,233 @@ static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
>>> rcu_read_unlock();
>>> }
>>>
>>> +static int
>>> +__htab_map_lookup_and_delete_batch(struct bpf_map *map,
>>> + const union bpf_attr *attr,
>>> + union bpf_attr __user *uattr,
>>> + bool do_delete, bool is_lru_map,
>>> + bool is_percpu)
>>> +{
>>> + struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
>>> + u32 bucket_cnt, total, key_size, value_size, roundup_key_size;
>>> + void *keys = NULL, *values = NULL, *value, *dst_key, *dst_val;
>>> + void __user *uvalues = u64_to_user_ptr(attr->batch.values);
>>> + void __user *ukeys = u64_to_user_ptr(attr->batch.keys);
>>> + void *ubatch = u64_to_user_ptr(attr->batch.in_batch);
>>> + u64 elem_map_flags, map_flags;
>>> + struct hlist_nulls_head *head;
>>> + u32 batch, max_count, size;
>>> + struct hlist_nulls_node *n;
>>> + unsigned long flags;
>>> + struct htab_elem *l;
>>> + struct bucket *b;
>>> + int ret = 0;
>>> +
>>> + max_count = attr->batch.count;
>>> + if (!max_count)
>>> + return 0;
>>> +
>>> + elem_map_flags = attr->batch.elem_flags;
>>> + if ((elem_map_flags & ~BPF_F_LOCK) ||
>>> + ((elem_map_flags & BPF_F_LOCK) && !map_value_has_spin_lock(map)))
>>> + return -EINVAL;
>>> +
>>> + map_flags = attr->batch.flags;
>>> + if (map_flags)
>>> + return -EINVAL;
>>> +
>>> + batch = 0;
>>> + if (ubatch && copy_from_user(&batch, ubatch, sizeof(batch)))
>>> + return -EFAULT;
>>> +
>>> + if (batch >= htab->n_buckets)
>>> + return -ENOENT;
>>> +
>>> + /* We cannot do copy_from_user or copy_to_user inside
>>> + * the rcu_read_lock. Allocate enough space here.
>>> + */
>>> + key_size = htab->map.key_size;
>>> + roundup_key_size = round_up(htab->map.key_size, 8);
>>> + value_size = htab->map.value_size;
>>> + size = round_up(value_size, 8);
>>> + if (is_percpu)
>>> + value_size = size * num_possible_cpus();
>>> + keys = kvmalloc(key_size, GFP_USER | __GFP_NOWARN);
>>> + values = kvmalloc(value_size, GFP_USER | __GFP_NOWARN);
>>> + if (!keys || !values) {
>>> + ret = -ENOMEM;
>>> + goto out;
>>> + }
>>> +
>>> + dst_key = keys;
>>> + dst_val = values;
>>> + total = 0;
>>> +
>>> + preempt_disable();
>>> + this_cpu_inc(bpf_prog_active);
>>> + rcu_read_lock();
>>> +
>>> +again:
>>> + b = &htab->buckets[batch];
>>> + head = &b->head;
>>> + raw_spin_lock_irqsave(&b->lock, flags);
>>> +
>>> + bucket_cnt = 0;
>>> + hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
>>> + bucket_cnt++;
>>> +
>>> + if (bucket_cnt > (max_count - total)) {
>>> + if (total == 0)
>>> + ret = -ENOSPC;
>>> + goto after_loop;
>>> + }
>>> +
>>> + hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
>>> + memcpy(dst_key, l->key, key_size);
>>> +
>>> + if (is_percpu) {
>>> + int off = 0, cpu;
>>> + void __percpu *pptr;
>>> +
>>> + pptr = htab_elem_get_ptr(l, map->key_size);
>>> + for_each_possible_cpu(cpu) {
>>> + bpf_long_memcpy(dst_val + off,
>>> + per_cpu_ptr(pptr, cpu), size);
>>> + off += size;
>>> + }
>>> + } else {
>>> + value = l->key + roundup_key_size;
>>> + if (elem_map_flags & BPF_F_LOCK)
>>> + copy_map_value_locked(map, dst_val, value,
>>> + true);
>>> + else
>>> + copy_map_value(map, dst_val, value);
>>> + check_and_init_map_lock(map, dst_val);
>>> + }
>>> + if (do_delete) {
>>> + hlist_nulls_del_rcu(&l->hash_node);
>>> + if (is_lru_map)
>>> + bpf_lru_push_free(&htab->lru, &l->lru_node);
>>> + else
>>> + free_htab_elem(htab, l);
>>> + }
>>> + if (copy_to_user(ukeys + total * key_size, keys, key_size) ||
>>> + copy_to_user(uvalues + total * value_size, values,
>>> + value_size)) {
>>
>> We cannot do copy_to_user inside atomic region where irq is disabled
>> with raw_spin_lock_irqsave(). We could do the following:
>> . we kalloc memory before preempt_disable() with the current count
>> of bucket size.
>> . inside the raw_spin_lock_irqsave() region, we can do copy to kernel
>> memory.
>> . inside the raw_spin_lock_irqsave() region, if the bucket size
>> changes, we can have a few retries to increase allocation size
>> before giving up.
>> Do you think this may work?
>
> Yes, it does.
>
> What should be the initial value for the allocated memory
> max_entries/2? Do you see any issue if we just kalloc the entire
> buffer?

Allocating max_entries/2 or entire buffer risks allocating too much
memory from the system, which may not be a good thing in a production
system. That is why I proposed to allocate memory at bucket level.
For a reasonable balanced hash table, this should not cause large
memory pressure on the host. What do you think?

>
>>
>>> + ret = -EFAULT;
>>> + goto after_loop;
>>> + }
>>> + total++;
>>> + }
>>> +
>>> + batch++;
>>> + if (batch >= htab->n_buckets) {
>>> + ret = -ENOENT;
>>> + goto after_loop;
>>> + }
>>> +
>>> + raw_spin_unlock_irqrestore(&b->lock, flags);
>>> + goto again;
>>> +
>>> +after_loop:
>>> + raw_spin_unlock_irqrestore(&b->lock, flags);
>>> +
>>> + rcu_read_unlock();
>>> + this_cpu_dec(bpf_prog_active);
>>> + preempt_enable();
>>> +
>>> + if (ret && ret != -ENOENT)
>>> + goto out;
>>> +
>>> + /* copy data back to user */
>>> + ubatch = u64_to_user_ptr(attr->batch.out_batch);
>>> + if (copy_to_user(ubatch, &batch, sizeof(batch)) ||
>>> + put_user(total, &uattr->batch.count))
>>> + ret = -EFAULT;
>>> +
>>> +out:
>>> + kvfree(keys);
>>> + kvfree(values);
>>> + return ret;
>>> +}
>>> +
>> [...]

2020-01-08 00:35:31

by Brian Vazquez

[permalink] [raw]
Subject: Re: [PATCH v3 bpf-next 06/11] bpf: add batch ops to all htab bpf map

On Tue, Jan 7, 2020 at 12:18 PM Yonghong Song <[email protected]> wrote:
>
>
>
> On 1/6/20 11:02 PM, Brian Vazquez wrote:
> > On Fri, Dec 13, 2019 at 12:58 PM Yonghong Song <[email protected]> wrote:
> >>
> >>
> >>
> >> On 12/11/19 2:33 PM, Brian Vazquez wrote:
> >>> From: Yonghong Song <[email protected]>
> >>>
> >>> htab can't use generic batch support due some problematic behaviours
> >>> inherent to the data structre, i.e. while iterating the bpf map a
> >>> concurrent program might delete the next entry that batch was about to
> >>> use, in that case there's no easy solution to retrieve the next entry,
> >>> the issue has been discussed multiple times (see [1] and [2]).
> >>>
> >>> The only way hmap can be traversed without the problem previously
> >>> exposed is by making sure that the map is traversing entire buckets.
> >>> This commit implements those strict requirements for hmap, the
> >>> implementation follows the same interaction that generic support with
> >>> some exceptions:
> >>>
> >>> - If keys/values buffer are not big enough to traverse a bucket,
> >>> ENOSPC will be returned.
> >>> - out_batch contains the value of the next bucket in the iteration, not
> >>> the next key, but this is transparent for the user since the user
> >>> should never use out_batch for other than bpf batch syscalls.
> >>>
> >>> Note that only lookup and lookup_and_delete batch ops require the hmap
> >>> specific implementation, update/delete batch ops can be the generic
> >>> ones.
> >>>
> >>> [1] https://lore.kernel.org/bpf/[email protected]/
> >>> [2] https://lore.kernel.org/bpf/[email protected]/
> >>>
> >>> Signed-off-by: Yonghong Song <[email protected]>
> >>> Signed-off-by: Brian Vazquez <[email protected]>
> >>> ---
> >>> kernel/bpf/hashtab.c | 242 +++++++++++++++++++++++++++++++++++++++++++
> >>> 1 file changed, 242 insertions(+)
> >>>
> >>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> >>> index 22066a62c8c97..fac107bdaf9ec 100644
> >>> --- a/kernel/bpf/hashtab.c
> >>> +++ b/kernel/bpf/hashtab.c
> >>> @@ -17,6 +17,17 @@
> >>> (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
> >>> BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)
> >>>
> >>> +#define BATCH_OPS(_name) \
> >>> + .map_lookup_batch = \
> >>> + _name##_map_lookup_batch, \
> >>> + .map_lookup_and_delete_batch = \
> >>> + _name##_map_lookup_and_delete_batch, \
> >>> + .map_update_batch = \
> >>> + generic_map_update_batch, \
> >>> + .map_delete_batch = \
> >>> + generic_map_delete_batch
> >>> +
> >>> +
> >>> struct bucket {
> >>> struct hlist_nulls_head head;
> >>> raw_spinlock_t lock;
> >>> @@ -1232,6 +1243,233 @@ static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
> >>> rcu_read_unlock();
> >>> }
> >>>
> >>> +static int
> >>> +__htab_map_lookup_and_delete_batch(struct bpf_map *map,
> >>> + const union bpf_attr *attr,
> >>> + union bpf_attr __user *uattr,
> >>> + bool do_delete, bool is_lru_map,
> >>> + bool is_percpu)
> >>> +{
> >>> + struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> >>> + u32 bucket_cnt, total, key_size, value_size, roundup_key_size;
> >>> + void *keys = NULL, *values = NULL, *value, *dst_key, *dst_val;
> >>> + void __user *uvalues = u64_to_user_ptr(attr->batch.values);
> >>> + void __user *ukeys = u64_to_user_ptr(attr->batch.keys);
> >>> + void *ubatch = u64_to_user_ptr(attr->batch.in_batch);
> >>> + u64 elem_map_flags, map_flags;
> >>> + struct hlist_nulls_head *head;
> >>> + u32 batch, max_count, size;
> >>> + struct hlist_nulls_node *n;
> >>> + unsigned long flags;
> >>> + struct htab_elem *l;
> >>> + struct bucket *b;
> >>> + int ret = 0;
> >>> +
> >>> + max_count = attr->batch.count;
> >>> + if (!max_count)
> >>> + return 0;
> >>> +
> >>> + elem_map_flags = attr->batch.elem_flags;
> >>> + if ((elem_map_flags & ~BPF_F_LOCK) ||
> >>> + ((elem_map_flags & BPF_F_LOCK) && !map_value_has_spin_lock(map)))
> >>> + return -EINVAL;
> >>> +
> >>> + map_flags = attr->batch.flags;
> >>> + if (map_flags)
> >>> + return -EINVAL;
> >>> +
> >>> + batch = 0;
> >>> + if (ubatch && copy_from_user(&batch, ubatch, sizeof(batch)))
> >>> + return -EFAULT;
> >>> +
> >>> + if (batch >= htab->n_buckets)
> >>> + return -ENOENT;
> >>> +
> >>> + /* We cannot do copy_from_user or copy_to_user inside
> >>> + * the rcu_read_lock. Allocate enough space here.
> >>> + */
> >>> + key_size = htab->map.key_size;
> >>> + roundup_key_size = round_up(htab->map.key_size, 8);
> >>> + value_size = htab->map.value_size;
> >>> + size = round_up(value_size, 8);
> >>> + if (is_percpu)
> >>> + value_size = size * num_possible_cpus();
> >>> + keys = kvmalloc(key_size, GFP_USER | __GFP_NOWARN);
> >>> + values = kvmalloc(value_size, GFP_USER | __GFP_NOWARN);
> >>> + if (!keys || !values) {
> >>> + ret = -ENOMEM;
> >>> + goto out;
> >>> + }
> >>> +
> >>> + dst_key = keys;
> >>> + dst_val = values;
> >>> + total = 0;
> >>> +
> >>> + preempt_disable();
> >>> + this_cpu_inc(bpf_prog_active);
> >>> + rcu_read_lock();
> >>> +
> >>> +again:
> >>> + b = &htab->buckets[batch];
> >>> + head = &b->head;
> >>> + raw_spin_lock_irqsave(&b->lock, flags);
> >>> +
> >>> + bucket_cnt = 0;
> >>> + hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
> >>> + bucket_cnt++;
> >>> +
> >>> + if (bucket_cnt > (max_count - total)) {
> >>> + if (total == 0)
> >>> + ret = -ENOSPC;
> >>> + goto after_loop;
> >>> + }
> >>> +
> >>> + hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
> >>> + memcpy(dst_key, l->key, key_size);
> >>> +
> >>> + if (is_percpu) {
> >>> + int off = 0, cpu;
> >>> + void __percpu *pptr;
> >>> +
> >>> + pptr = htab_elem_get_ptr(l, map->key_size);
> >>> + for_each_possible_cpu(cpu) {
> >>> + bpf_long_memcpy(dst_val + off,
> >>> + per_cpu_ptr(pptr, cpu), size);
> >>> + off += size;
> >>> + }
> >>> + } else {
> >>> + value = l->key + roundup_key_size;
> >>> + if (elem_map_flags & BPF_F_LOCK)
> >>> + copy_map_value_locked(map, dst_val, value,
> >>> + true);
> >>> + else
> >>> + copy_map_value(map, dst_val, value);
> >>> + check_and_init_map_lock(map, dst_val);
> >>> + }
> >>> + if (do_delete) {
> >>> + hlist_nulls_del_rcu(&l->hash_node);
> >>> + if (is_lru_map)
> >>> + bpf_lru_push_free(&htab->lru, &l->lru_node);
> >>> + else
> >>> + free_htab_elem(htab, l);
> >>> + }
> >>> + if (copy_to_user(ukeys + total * key_size, keys, key_size) ||
> >>> + copy_to_user(uvalues + total * value_size, values,
> >>> + value_size)) {
> >>
> >> We cannot do copy_to_user inside atomic region where irq is disabled
> >> with raw_spin_lock_irqsave(). We could do the following:
> >> . we kalloc memory before preempt_disable() with the current count
> >> of bucket size.
> >> . inside the raw_spin_lock_irqsave() region, we can do copy to kernel
> >> memory.
> >> . inside the raw_spin_lock_irqsave() region, if the bucket size
> >> changes, we can have a few retries to increase allocation size
> >> before giving up.
> >> Do you think this may work?
> >
> > Yes, it does.
> >
> > What should be the initial value for the allocated memory
> > max_entries/2? Do you see any issue if we just kalloc the entire
> > buffer?
>
> Allocating max_entries/2 or entire buffer risks allocating too much
> memory from the system, which may not be a good thing in a production
> system. That is why I proposed to allocate memory at bucket level.
> For a reasonable balanced hash table, this should not cause large
> memory pressure on the host. What do you think?

Sounds reasonable, I'll do that! Thanks for the feedback!
>
> >
> >>
> >>> + ret = -EFAULT;
> >>> + goto after_loop;
> >>> + }
> >>> + total++;
> >>> + }
> >>> +
> >>> + batch++;
> >>> + if (batch >= htab->n_buckets) {
> >>> + ret = -ENOENT;
> >>> + goto after_loop;
> >>> + }
> >>> +
> >>> + raw_spin_unlock_irqrestore(&b->lock, flags);
> >>> + goto again;
> >>> +
> >>> +after_loop:
> >>> + raw_spin_unlock_irqrestore(&b->lock, flags);
> >>> +
> >>> + rcu_read_unlock();
> >>> + this_cpu_dec(bpf_prog_active);
> >>> + preempt_enable();
> >>> +
> >>> + if (ret && ret != -ENOENT)
> >>> + goto out;
> >>> +
> >>> + /* copy data back to user */
> >>> + ubatch = u64_to_user_ptr(attr->batch.out_batch);
> >>> + if (copy_to_user(ubatch, &batch, sizeof(batch)) ||
> >>> + put_user(total, &uattr->batch.count))
> >>> + ret = -EFAULT;
> >>> +
> >>> +out:
> >>> + kvfree(keys);
> >>> + kvfree(values);
> >>> + return ret;
> >>> +}
> >>> +
> >> [...]