2019-11-07 21:22:28

by Brian Vazquez

[permalink] [raw]
Subject: [RFC bpf-next 0/3] bpf: adding map batch processing support

This is a follow up in the effort to batch bpf map operations to reduce
the syscall overhead with the map_ops. I initially proposed the idea and
the discussion is here:
https://lore.kernel.org/bpf/[email protected]/

Yonghong talked at the LPC about this and also proposed and idea that
handles the special weird case of hashtables by doing traversing using
the buckets as a reference instead of a key. Discussion is here:
https://lore.kernel.org/bpf/[email protected]/

This RFC proposes a way to extend batch operations for more data
structures by creating generic batch functions that can be used instead
of implementing the operations for each individual data structure,
reducing the code that needs to be maintained. The series contains the
patches used in Yonghong's RFC and the patch that adds the generic
implementation of the operations plus some testing with pcpu hashmaps
and arrays. Note that pcpu hashmap shouldn't use the generic
implementation and it either should have its own implementation or share
the one introduced by Yonghong, I added that just as an example to show
that the generic implementation can be easily added to a data structure.

What I want to achieve with this RFC is to collect early feedback and see if
there's any major concern about this before I move forward. I do plan
to better separate this into different patches and explain them properly
in the commit messages.

Current known issues where I would like to discuss are the followings:

- Because Yonghong's UAPI definition was done specifically for
iterating buckets, the batch field is u64 and is treated as an u64
instead of an opaque pointer, this won't work for other data structures
that are going to use a key as a batch token with a size greater than
64. Although I think at this point the only key that couldn't be
treated as a u64 is the key of a hashmap, and the hashmap won't use
the generic interface.
- Not all the data structures use delete (because it's not a valid
operation) i.e. arrays. So maybe lookup_and_delete_batch command is
not needed and we can handle that operation with a lookup_batch and a
flag.
- For delete_batch (not just the lookup_and_delete_batch). Is this
operation really needed? If so, shouldn't it be better if the
behaviour is delete the keys provided? I did that with my generic
implementation but Yonghong's delete_batch for a hashmap deletes
buckets.

Brian Vazquez (1):
bpf: add generic batch support

Yonghong Song (2):
bpf: adding map batch processing support
tools/bpf: test bpf_map_lookup_and_delete_batch()

include/linux/bpf.h | 21 +
include/uapi/linux/bpf.h | 22 +
kernel/bpf/arraymap.c | 4 +
kernel/bpf/hashtab.c | 331 ++++++++++
kernel/bpf/syscall.c | 573 ++++++++++++++----
tools/include/uapi/linux/bpf.h | 22 +
tools/lib/bpf/bpf.c | 59 ++
tools/lib/bpf/bpf.h | 13 +
tools/lib/bpf/libbpf.map | 4 +
.../map_tests/map_lookup_and_delete_batch.c | 245 ++++++++
.../map_lookup_and_delete_batch_array.c | 118 ++++
11 files changed, 1292 insertions(+), 120 deletions(-)
create mode 100644 tools/testing/selftests/bpf/map_tests/map_lookup_and_delete_batch.c
create mode 100644 tools/testing/selftests/bpf/map_tests/map_lookup_and_delete_batch_array.c

--
2.24.0.432.g9d3f5f5b63-goog


2019-11-07 21:22:36

by Brian Vazquez

[permalink] [raw]
Subject: [RFC bpf-next 1/3] bpf: adding map batch processing support

From: Yonghong Song <[email protected]>

Brian Vazquez has proposed BPF_MAP_DUMP command to look up more than one
map entries per syscall.
https://lore.kernel.org/bpf/CABCgpaU3xxX6CMMxD+1knApivtc2jLBHysDXw-0E9bQEL0qC3A@mail.gmail.com/T/#t

During discussion, we found more use cases can be supported in a similar
map operation batching framework. For example, batched map lookup and delete,
which can be really helpful for bcc.
https://github.com/iovisor/bcc/blob/master/tools/tcptop.py#L233-L243
https://github.com/iovisor/bcc/blob/master/tools/slabratetop.py#L129-L138

Also, in bcc, we have API to delete all entries in a map.
https://github.com/iovisor/bcc/blob/master/src/cc/api/BPFTable.h#L257-L264

For map update, batched operations also useful as sometimes applications need
to populate initial maps with more than one entry. For example, the below
example is from kernel/samples/bpf/xdp_redirect_cpu_user.c:
https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_redirect_cpu_user.c#L543-L550

This patch addresses all the above use cases. To make uapi stable, it also
covers other potential use cases. For bpf syscall subcommands are introduced:
BPF_MAP_LOOKUP_BATCH
BPF_MAP_LOOKUP_AND_DELETE_BATCH
BPF_MAP_UPDATE_BATCH
BPF_MAP_DELETE_BATCH

The UAPI attribute structure looks like:

struct { /* struct used by BPF_MAP_*_BATCH commands */
__u64 batch; /* input/output:
* input: start batch,
* 0 to start from beginning.
* output: next start batch,
* 0 to end batching.
*/
__aligned_u64 keys;
__aligned_u64 values;
__u32 count; /* input/output:
* input: # of elements keys/values.
* output: # of filled elements.
*/
__u32 map_fd;
__u64 elem_flags;
__u64 flags;
} batch;

An opaque value 'batch' is used for user/kernel space communication
for where in the map to start the operation for lookup/lookup_and_delete/delete.
input 'batch' = 0: to start the operation from the beginning of the map.
output 'batch': if not 0, the next input for batch operation.

For lookup/lookup_and_delete:
operation: lookup/lookup_and_delete starting from a particular 'batch'.
return:
'batch' 'count' return code meaning
0 0 0 Done. Nothing left
0 0 -ENOSPC no space to handle batch 0
> 0 0 -ENOSPC no space to handle 'batch'
> 0 > 0 0 stopped right before 'batch'
Note that:
(1). Even if return code is 0 and return 'count' > 0, the return 'count' may
not be equal to input 'count'. This happens when there is no enough space
to handle a batch.
(2). If the return code is an error and not -EFAULT,
'batch' indicates the batch has issues and 'count' indicates the number
of elements successfully processed.

For delete:
operation: deletion starting from a particular 'batch'.
return: 0 means everything is deleted from 'batch'.
error code means something deletion not happening.

For update:
operation: update 'count' number of elements in 'keys'/'values'.
return: 0 means successful updates for all elements.
error code, if not -EFAULT, 'count' is the number of successful updates.

Signed-off-by: Yonghong Song <[email protected]>
---
include/linux/bpf.h | 9 ++
include/uapi/linux/bpf.h | 22 +++
kernel/bpf/hashtab.c | 327 +++++++++++++++++++++++++++++++++++++++
kernel/bpf/syscall.c | 67 ++++++++
4 files changed, 425 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7c7f518811a66..66df540ee2473 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -39,6 +39,15 @@ struct bpf_map_ops {
int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key);
void (*map_release_uref)(struct bpf_map *map);
void *(*map_lookup_elem_sys_only)(struct bpf_map *map, void *key);
+ int (*map_lookup_batch)(struct bpf_map *map, const union bpf_attr *attr,
+ union bpf_attr __user *uattr);
+ int (*map_lookup_and_delete_batch)(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr);
+ int (*map_update_batch)(struct bpf_map *map, const union bpf_attr *attr,
+ union bpf_attr __user *uattr);
+ int (*map_delete_batch)(struct bpf_map *map, const union bpf_attr *attr,
+ union bpf_attr __user *uattr);

/* funcs callable from userspace and from eBPF programs */
void *(*map_lookup_elem)(struct bpf_map *map, void *key);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index df6809a764046..2d647fa6476cb 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -107,6 +107,10 @@ enum bpf_cmd {
BPF_MAP_LOOKUP_AND_DELETE_ELEM,
BPF_MAP_FREEZE,
BPF_BTF_GET_NEXT_ID,
+ BPF_MAP_LOOKUP_BATCH,
+ BPF_MAP_LOOKUP_AND_DELETE_BATCH,
+ BPF_MAP_UPDATE_BATCH,
+ BPF_MAP_DELETE_BATCH,
};

enum bpf_map_type {
@@ -398,6 +402,24 @@ union bpf_attr {
__u64 flags;
};

+ struct { /* struct used by BPF_MAP_*_BATCH commands */
+ __u64 batch; /* input/output:
+ * input: start batch,
+ * 0 to start from beginning.
+ * output: next start batch,
+ * 0 to end batching.
+ */
+ __aligned_u64 keys;
+ __aligned_u64 values;
+ __u32 count; /* input/output:
+ * input: # of elements keys/values.
+ * output: # of filled elements.
+ */
+ __u32 map_fd;
+ __u64 elem_flags;
+ __u64 flags;
+ } batch;
+
struct { /* anonymous struct used by BPF_PROG_LOAD command */
__u32 prog_type; /* one of enum bpf_prog_type */
__u32 insn_cnt;
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 22066a62c8c97..10977cb321862 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1232,6 +1232,325 @@ static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
rcu_read_unlock();
}

+static int
+__htab_map_lookup_and_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr,
+ bool do_delete, bool is_lru_map)
+{
+ struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+ u32 bucket_cnt, total, key_size, value_size, roundup_key_size;
+ void *keys = NULL, *values = NULL, *value, *dst_key, *dst_val;
+ u64 elem_map_flags, map_flags;
+ struct hlist_nulls_head *head;
+ void __user *ukeys, *uvalues;
+ struct hlist_nulls_node *n;
+ u32 batch, max_count;
+ unsigned long flags;
+ struct htab_elem *l;
+ struct bucket *b;
+ int ret = 0;
+
+ max_count = attr->batch.count;
+ if (!max_count)
+ return 0;
+
+ elem_map_flags = attr->batch.elem_flags;
+ if ((elem_map_flags & ~BPF_F_LOCK) ||
+ ((elem_map_flags & BPF_F_LOCK) && !map_value_has_spin_lock(map)))
+ return -EINVAL;
+
+ map_flags = attr->batch.flags;
+ if (map_flags)
+ return -EINVAL;
+
+ batch = (u32)attr->batch.batch;
+ if (batch >= htab->n_buckets)
+ return -EINVAL;
+
+ /* We cannot do copy_from_user or copy_to_user inside
+ * the rcu_read_lock. Allocate enough space here.
+ */
+ key_size = htab->map.key_size;
+ roundup_key_size = round_up(htab->map.key_size, 8);
+ value_size = htab->map.value_size;
+ keys = kvmalloc(key_size * max_count, GFP_USER | __GFP_NOWARN);
+ values = kvmalloc(value_size * max_count, GFP_USER | __GFP_NOWARN);
+ if (!keys || !values) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ dst_key = keys;
+ dst_val = values;
+ total = 0;
+
+ preempt_disable();
+ this_cpu_inc(bpf_prog_active);
+ rcu_read_lock();
+
+again:
+ b = &htab->buckets[batch];
+ head = &b->head;
+ raw_spin_lock_irqsave(&b->lock, flags);
+
+ bucket_cnt = 0;
+ hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
+ bucket_cnt++;
+
+ if (bucket_cnt > (max_count - total)) {
+ if (total == 0)
+ ret = -ENOSPC;
+ goto after_loop;
+ }
+
+ hlist_nulls_for_each_entry_rcu(l, n, head, hash_node) {
+ memcpy(dst_key, l->key, key_size);
+
+ value = l->key + roundup_key_size;
+ if (elem_map_flags & BPF_F_LOCK)
+ copy_map_value_locked(map, dst_val, value, true);
+ else
+ copy_map_value(map, dst_val, value);
+ check_and_init_map_lock(map, dst_val);
+
+ dst_key += key_size;
+ dst_val += value_size;
+ total++;
+ }
+
+ if (do_delete) {
+ hlist_nulls_for_each_entry_rcu(l, n, head, hash_node) {
+ hlist_nulls_del_rcu(&l->hash_node);
+ if (is_lru_map)
+ bpf_lru_push_free(&htab->lru, &l->lru_node);
+ else
+ free_htab_elem(htab, l);
+ }
+ }
+
+ batch++;
+ if (batch >= htab->n_buckets) {
+ batch = 0;
+ goto after_loop;
+ }
+
+ raw_spin_unlock_irqrestore(&b->lock, flags);
+ goto again;
+
+after_loop:
+ raw_spin_unlock_irqrestore(&b->lock, flags);
+
+ rcu_read_unlock();
+ this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+
+ /* copy data back to user */
+ ukeys = u64_to_user_ptr(attr->batch.keys);
+ uvalues = u64_to_user_ptr(attr->batch.values);
+ if (put_user(batch, &uattr->batch.batch) ||
+ copy_to_user(ukeys, keys, total * key_size) ||
+ copy_to_user(uvalues, values, total * value_size) ||
+ put_user(total, &uattr->batch.count))
+ ret = -EFAULT;
+
+out:
+ kvfree(keys);
+ kvfree(values);
+ return ret;
+}
+
+static int
+__htab_map_update_batch(struct bpf_map *map, const union bpf_attr *attr,
+ union bpf_attr __user *uattr, bool is_lru_map)
+{
+ struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+ u32 count, max_count, key_size, roundup_key_size, value_size;
+ u64 elem_map_flags, map_flags;
+ void __user *ukey, *uvalue;
+ void *key, *value;
+ int ret = 0;
+
+ max_count = attr->batch.count;
+ if (!max_count)
+ return 0;
+
+ elem_map_flags = attr->batch.elem_flags;
+ if ((elem_map_flags & BPF_F_LOCK) && !map_value_has_spin_lock(map))
+ return -EINVAL;
+
+ map_flags = attr->batch.flags;
+ if (map_flags)
+ return -EINVAL;
+
+ key_size = htab->map.key_size;
+ roundup_key_size = round_up(htab->map.key_size, 8);
+ value_size = htab->map.value_size;
+ key = kmalloc(key_size, GFP_USER | __GFP_NOWARN);
+ value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);
+ if (!key || !value) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ ukey = u64_to_user_ptr(attr->batch.keys);
+ uvalue = u64_to_user_ptr(attr->batch.values);
+ for (count = 0; count < max_count; count++) {
+ if (copy_from_user(key, ukey + count * key_size, key_size) ||
+ copy_from_user(value, uvalue + count * value_size,
+ value_size)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ preempt_disable();
+ __this_cpu_inc(bpf_prog_active);
+ rcu_read_lock();
+ if (is_lru_map)
+ ret = htab_lru_map_update_elem(map, key, value,
+ elem_map_flags);
+ else
+ ret = htab_map_update_elem(map, key, value,
+ elem_map_flags);
+ rcu_read_unlock();
+ __this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+
+ if (ret) {
+ if (put_user(count, &uattr->batch.count))
+ ret = -EFAULT;
+ break;
+ }
+ }
+
+out:
+ kfree(key);
+ kfree(value);
+ return ret;
+}
+
+static int
+__htab_map_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr,
+ bool is_lru_map)
+{
+ struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+ u64 elem_map_flags, map_flags;
+ struct hlist_nulls_head *head;
+ struct hlist_nulls_node *n;
+ u32 batch, max_count;
+ unsigned long flags;
+ struct htab_elem *l;
+ struct bucket *b;
+
+ elem_map_flags = attr->batch.elem_flags;
+ map_flags = attr->batch.flags;
+ if (elem_map_flags || map_flags)
+ return -EINVAL;
+
+ max_count = attr->batch.count;
+ batch = (u32)attr->batch.batch;
+ if (max_count || batch >= htab->n_buckets)
+ return -EINVAL;
+
+ preempt_disable();
+ __this_cpu_inc(bpf_prog_active);
+ rcu_read_lock();
+
+again:
+ b = &htab->buckets[batch];
+ head = &b->head;
+ raw_spin_lock_irqsave(&b->lock, flags);
+
+ hlist_nulls_for_each_entry_rcu(l, n, head, hash_node) {
+ hlist_nulls_del_rcu(&l->hash_node);
+ if (is_lru_map)
+ bpf_lru_push_free(&htab->lru, &l->lru_node);
+ else
+ free_htab_elem(htab, l);
+ }
+
+ batch++;
+ if (batch >= htab->n_buckets)
+ goto out;
+
+ raw_spin_unlock_irqrestore(&b->lock, flags);
+ goto again;
+
+out:
+ raw_spin_unlock_irqrestore(&b->lock, flags);
+ rcu_read_unlock();
+ __this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+
+ return 0;
+}
+
+static int
+htab_map_lookup_batch(struct bpf_map *map, const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, false,
+ false);
+}
+
+static int
+htab_map_lookup_and_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, true,
+ false);
+}
+
+static int
+htab_map_update_batch(struct bpf_map *map, const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_update_batch(map, attr, uattr, false);
+}
+
+static int
+htab_map_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_delete_batch(map, attr, uattr, false);
+}
+
+static int
+htab_lru_map_lookup_batch(struct bpf_map *map, const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, false,
+ true);
+}
+
+static int
+htab_lru_map_lookup_and_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, true,
+ true);
+}
+
+static int
+htab_lru_map_update_batch(struct bpf_map *map, const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_update_batch(map, attr, uattr, true);
+}
+
+static int
+htab_lru_map_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_delete_batch(map, attr, uattr, true);
+}
+
const struct bpf_map_ops htab_map_ops = {
.map_alloc_check = htab_map_alloc_check,
.map_alloc = htab_map_alloc,
@@ -1242,6 +1561,10 @@ const struct bpf_map_ops htab_map_ops = {
.map_delete_elem = htab_map_delete_elem,
.map_gen_lookup = htab_map_gen_lookup,
.map_seq_show_elem = htab_map_seq_show_elem,
+ .map_lookup_batch = htab_map_lookup_batch,
+ .map_lookup_and_delete_batch = htab_map_lookup_and_delete_batch,
+ .map_update_batch = htab_map_update_batch,
+ .map_delete_batch = htab_map_delete_batch,
};

const struct bpf_map_ops htab_lru_map_ops = {
@@ -1255,6 +1578,10 @@ const struct bpf_map_ops htab_lru_map_ops = {
.map_delete_elem = htab_lru_map_delete_elem,
.map_gen_lookup = htab_lru_map_gen_lookup,
.map_seq_show_elem = htab_map_seq_show_elem,
+ .map_lookup_batch = htab_lru_map_lookup_batch,
+ .map_lookup_and_delete_batch = htab_lru_map_lookup_and_delete_batch,
+ .map_update_batch = htab_lru_map_update_batch,
+ .map_delete_batch = htab_lru_map_delete_batch,
};

/* Called from eBPF program */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 6d9ce95e5a8da..c9e5f928d85b0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2868,6 +2868,61 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
return err;
}

+#define BPF_MAP_BATCH_LAST_FIELD batch.flags
+
+#define BPF_DO_BATCH(fn) \
+ do { \
+ if (!fn) { \
+ err = -ENOTSUPP; \
+ goto err_put; \
+ } \
+ err = fn(map, attr, uattr); \
+ } while (0)
+
+static int bpf_map_do_batch(const union bpf_attr *attr,
+ union bpf_attr __user *uattr,
+ int cmd)
+{
+ struct bpf_map *map;
+ int err, ufd;
+ struct fd f;
+
+ if (CHECK_ATTR(BPF_MAP_BATCH))
+ return -EINVAL;
+
+ ufd = attr->batch.map_fd;
+ f = fdget(ufd);
+ map = __bpf_map_get(f);
+ if (IS_ERR(map))
+ return PTR_ERR(map);
+
+ if ((cmd == BPF_MAP_LOOKUP_BATCH ||
+ cmd == BPF_MAP_LOOKUP_AND_DELETE_BATCH) &&
+ !(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
+ err = -EPERM;
+ goto err_put;
+ }
+
+ if (cmd != BPF_MAP_LOOKUP_BATCH &&
+ !(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) {
+ err = -EPERM;
+ goto err_put;
+ }
+
+ if (cmd == BPF_MAP_LOOKUP_BATCH)
+ BPF_DO_BATCH(map->ops->map_lookup_batch);
+ else if (cmd == BPF_MAP_LOOKUP_AND_DELETE_BATCH)
+ BPF_DO_BATCH(map->ops->map_lookup_and_delete_batch);
+ else if (cmd == BPF_MAP_UPDATE_BATCH)
+ BPF_DO_BATCH(map->ops->map_update_batch);
+ else
+ BPF_DO_BATCH(map->ops->map_delete_batch);
+
+err_put:
+ fdput(f);
+ return err;
+}
+
SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
{
union bpf_attr attr = {};
@@ -2965,6 +3020,18 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
err = map_lookup_and_delete_elem(&attr);
break;
+ case BPF_MAP_LOOKUP_BATCH:
+ err = bpf_map_do_batch(&attr, uattr, BPF_MAP_LOOKUP_BATCH);
+ break;
+ case BPF_MAP_LOOKUP_AND_DELETE_BATCH:
+ err = bpf_map_do_batch(&attr, uattr, BPF_MAP_LOOKUP_AND_DELETE_BATCH);
+ break;
+ case BPF_MAP_UPDATE_BATCH:
+ err = bpf_map_do_batch(&attr, uattr, BPF_MAP_UPDATE_BATCH);
+ break;
+ case BPF_MAP_DELETE_BATCH:
+ err = bpf_map_do_batch(&attr, uattr, BPF_MAP_DELETE_BATCH);
+ break;
default:
err = -EINVAL;
break;
--
2.24.0.432.g9d3f5f5b63-goog

2019-11-07 21:22:58

by Brian Vazquez

[permalink] [raw]
Subject: [RFC bpf-next 2/3] tools/bpf: test bpf_map_lookup_and_delete_batch()

From: Yonghong Song <[email protected]>

Added four libbpf API functions to support map batch operations:
. int bpf_map_delete_batch( ... )
. int bpf_map_lookup_batch( ... )
. int bpf_map_lookup_and_delete_batch( ... )
. int bpf_map_update_batch( ... )

Tested bpf_map_lookup_and_delete_batch() and bpf_map_update_batch()
functionality.
$ ./test_maps
...
test_map_lookup_and_delete_batch:PASS
...

Note that I clumped uapi header sync patch, libbpf patch
and tests patch together considering this is a RFC patch.
Will do proper formating once it is out of RFC stage.

Signed-off-by: Yonghong Song <[email protected]>
---
tools/include/uapi/linux/bpf.h | 22 +++
tools/lib/bpf/bpf.c | 59 +++++++
tools/lib/bpf/bpf.h | 13 ++
tools/lib/bpf/libbpf.map | 4 +
.../map_tests/map_lookup_and_delete_batch.c | 155 ++++++++++++++++++
5 files changed, 253 insertions(+)
create mode 100644 tools/testing/selftests/bpf/map_tests/map_lookup_and_delete_batch.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index df6809a764046..2d647fa6476cb 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -107,6 +107,10 @@ enum bpf_cmd {
BPF_MAP_LOOKUP_AND_DELETE_ELEM,
BPF_MAP_FREEZE,
BPF_BTF_GET_NEXT_ID,
+ BPF_MAP_LOOKUP_BATCH,
+ BPF_MAP_LOOKUP_AND_DELETE_BATCH,
+ BPF_MAP_UPDATE_BATCH,
+ BPF_MAP_DELETE_BATCH,
};

enum bpf_map_type {
@@ -398,6 +402,24 @@ union bpf_attr {
__u64 flags;
};

+ struct { /* struct used by BPF_MAP_*_BATCH commands */
+ __u64 batch; /* input/output:
+ * input: start batch,
+ * 0 to start from beginning.
+ * output: next start batch,
+ * 0 to end batching.
+ */
+ __aligned_u64 keys;
+ __aligned_u64 values;
+ __u32 count; /* input/output:
+ * input: # of elements keys/values.
+ * output: # of filled elements.
+ */
+ __u32 map_fd;
+ __u64 elem_flags;
+ __u64 flags;
+ } batch;
+
struct { /* anonymous struct used by BPF_PROG_LOAD command */
__u32 prog_type; /* one of enum bpf_prog_type */
__u32 insn_cnt;
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index ca0d635b1d5ea..8b7e773003ddd 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -441,6 +441,65 @@ int bpf_map_freeze(int fd)
return sys_bpf(BPF_MAP_FREEZE, &attr, sizeof(attr));
}

+static int bpf_map_batch_common(int cmd, int fd, __u64 *batch,
+ void *keys, void *values,
+ __u32 *count, __u64 elem_flags,
+ __u64 flags)
+{
+ union bpf_attr attr = {};
+ int ret;
+
+ attr.batch.map_fd = fd;
+ if (batch)
+ attr.batch.batch = *batch;
+ attr.batch.keys = ptr_to_u64(keys);
+ attr.batch.values = ptr_to_u64(values);
+ if (count)
+ attr.batch.count = *count;
+ attr.batch.elem_flags = elem_flags;
+ attr.batch.flags = flags;
+
+ ret = sys_bpf(cmd, &attr, sizeof(attr));
+ if (batch)
+ *batch = attr.batch.batch;
+ if (count)
+ *count = attr.batch.count;
+
+ return ret;
+}
+
+int bpf_map_delete_batch(int fd, __u64 *batch, __u32 *count, __u64 elem_flags,
+ __u64 flags)
+{
+ return bpf_map_batch_common(BPF_MAP_DELETE_BATCH, fd, batch,
+ NULL, NULL, count, elem_flags, flags);
+}
+
+int bpf_map_lookup_batch(int fd, __u64 *batch, void *keys, void *values,
+ __u32 *count, __u64 elem_flags, __u64 flags)
+{
+ return bpf_map_batch_common(BPF_MAP_LOOKUP_BATCH, fd, batch,
+ keys, values, count, elem_flags, flags);
+}
+
+int bpf_map_lookup_and_delete_batch(int fd, __u64 *batch,
+ void *keys, void *values,
+ __u32 *count, __u64 elem_flags,
+ __u64 flags)
+{
+ return bpf_map_batch_common(BPF_MAP_LOOKUP_AND_DELETE_BATCH,
+ fd, batch, keys, values,
+ count, elem_flags, flags);
+}
+
+int bpf_map_update_batch(int fd, void *keys, void *values, __u32 *count,
+ __u64 elem_flags, __u64 flags)
+{
+ return bpf_map_batch_common(BPF_MAP_UPDATE_BATCH,
+ fd, NULL, keys, values,
+ count, elem_flags, flags);
+}
+
int bpf_obj_pin(int fd, const char *pathname)
{
union bpf_attr attr;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 1c53bc5b4b3c7..e61da7a92a414 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -123,6 +123,19 @@ LIBBPF_API int bpf_map_lookup_and_delete_elem(int fd, const void *key,
LIBBPF_API int bpf_map_delete_elem(int fd, const void *key);
LIBBPF_API int bpf_map_get_next_key(int fd, const void *key, void *next_key);
LIBBPF_API int bpf_map_freeze(int fd);
+LIBBPF_API int bpf_map_delete_batch(int fd, __u64 *batch, __u32 *count,
+ __u64 elem_flags, __u64 flags);
+LIBBPF_API int bpf_map_lookup_batch(int fd, __u64 *batch, void *keys,
+ void *values, __u32 *count,
+ __u64 elem_flags, __u64 flags);
+LIBBPF_API int bpf_map_lookup_and_delete_batch(int fd, __u64 *batch,
+ void *keys, void *values,
+ __u32 *count, __u64 elem_flags,
+ __u64 flags);
+LIBBPF_API int bpf_map_update_batch(int fd, void *keys, void *values,
+ __u32 *count, __u64 elem_flags,
+ __u64 flags);
+
LIBBPF_API int bpf_obj_pin(int fd, const char *pathname);
LIBBPF_API int bpf_obj_get(const char *pathname);
LIBBPF_API int bpf_prog_attach(int prog_fd, int attachable_fd,
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 86173cbb159d3..0529a770a04eb 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -189,6 +189,10 @@ LIBBPF_0.0.4 {
LIBBPF_0.0.5 {
global:
bpf_btf_get_next_id;
+ bpf_map_delete_batch;
+ bpf_map_lookup_and_delete_batch;
+ bpf_map_lookup_batch;
+ bpf_map_update_batch;
} LIBBPF_0.0.4;

LIBBPF_0.0.6 {
diff --git a/tools/testing/selftests/bpf/map_tests/map_lookup_and_delete_batch.c b/tools/testing/selftests/bpf/map_tests/map_lookup_and_delete_batch.c
new file mode 100644
index 0000000000000..dd906b1de5950
--- /dev/null
+++ b/tools/testing/selftests/bpf/map_tests/map_lookup_and_delete_batch.c
@@ -0,0 +1,155 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+#include <stdio.h>
+#include <errno.h>
+#include <string.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include <test_maps.h>
+
+static void map_batch_update(int map_fd, __u32 max_entries, int *keys,
+ int *values)
+{
+ int i, err;
+
+ for (i = 0; i < max_entries; i++) {
+ keys[i] = i + 1;
+ values[i] = i + 2;
+ }
+
+ err = bpf_map_update_batch(map_fd, keys, values, &max_entries, 0, 0);
+ CHECK(err, "bpf_map_update_batch()", "error:%s\n", strerror(errno));
+}
+
+static void map_batch_verify(int *visited, __u32 max_entries,
+ int *keys, int *values)
+{
+ int i;
+
+ memset(visited, 0, max_entries * sizeof(*visited));
+ for (i = 0; i < max_entries; i++) {
+ CHECK(keys[i] + 1 != values[i], "key/value checking",
+ "error: i %d key %d value %d\n", i, keys[i], values[i]);
+ visited[i] = 1;
+ }
+ for (i = 0; i < max_entries; i++) {
+ CHECK(visited[i] != 1, "visited checking",
+ "error: keys array at index %d missing\n", i);
+ }
+}
+
+void test_map_lookup_and_delete_batch(void)
+{
+ struct bpf_create_map_attr xattr = {
+ .name = "hash_map",
+ .map_type = BPF_MAP_TYPE_HASH,
+ .key_size = sizeof(int),
+ .value_size = sizeof(int),
+ };
+ int map_fd, *keys, *values, *visited, key;
+ __u32 count, total, total_success;
+ const __u32 max_entries = 10;
+ int err, i, step;
+ bool nospace_err;
+ __u64 batch = 0;
+
+ xattr.max_entries = max_entries;
+ map_fd = bpf_create_map_xattr(&xattr);
+ CHECK(map_fd == -1,
+ "bpf_create_map_xattr()", "error:%s\n", strerror(errno));
+
+ keys = malloc(max_entries * sizeof(int));
+ values = malloc(max_entries * sizeof(int));
+ visited = malloc(max_entries * sizeof(int));
+ CHECK(!keys || !values || !visited, "malloc()", "error:%s\n", strerror(errno));
+
+ /* test 1: lookup/delete an empty hash table, success */
+ count = max_entries;
+ err = bpf_map_lookup_and_delete_batch(map_fd, &batch, keys, values,
+ &count, 0, 0);
+ CHECK(err, "empty map", "error: %s\n", strerror(errno));
+ CHECK(batch || count, "empty map", "batch = %lld, count = %u\n", batch, count);
+
+ /* populate elements to the map */
+ map_batch_update(map_fd, max_entries, keys, values);
+
+ /* test 2: lookup/delete with count = 0, success */
+ batch = 0;
+ count = 0;
+ err = bpf_map_lookup_and_delete_batch(map_fd, &batch, keys, values,
+ &count, 0, 0);
+ CHECK(err, "count = 0", "error: %s\n", strerror(errno));
+
+ /* test 3: lookup/delete with count = max_entries, success */
+ memset(keys, 0, max_entries * sizeof(*keys));
+ memset(values, 0, max_entries * sizeof(*values));
+ count = max_entries;
+ batch = 0;
+ err = bpf_map_lookup_and_delete_batch(map_fd, &batch, keys,
+ values, &count, 0, 0);
+ CHECK(err, "count = max_entries", "error: %s\n", strerror(errno));
+ CHECK(count != max_entries || batch != 0, "count = max_entries",
+ "count = %u, max_entries = %u, batch = %lld\n",
+ count, max_entries, batch);
+ map_batch_verify(visited, max_entries, keys, values);
+
+ /* bpf_map_get_next_key() should return -ENOENT for an empty map. */
+ err = bpf_map_get_next_key(map_fd, NULL, &key);
+ CHECK(!err, "bpf_map_get_next_key()", "error: %s\n", strerror(errno));
+
+ /* test 4: lookup/delete in a loop with various steps. */
+ total_success = 0;
+ for (step = 1; step < max_entries; step++) {
+ map_batch_update(map_fd, max_entries, keys, values);
+ memset(keys, 0, max_entries * sizeof(*keys));
+ memset(values, 0, max_entries * sizeof(*values));
+ batch = 0;
+ total = 0;
+ i = 0;
+ /* iteratively lookup/delete elements with 'step' elements each */
+ count = step;
+ nospace_err = false;
+ while (true) {
+ err = bpf_map_lookup_and_delete_batch(map_fd, &batch,
+ keys + total,
+ values + total,
+ &count, 0, 0);
+ /* It is possible that we are failing due to buffer size
+ * not big enough. In such cases, let us just exit and
+ * go with large steps. Not that a buffer size with
+ * max_entries should always work.
+ */
+ if (err && errno == ENOSPC) {
+ nospace_err = true;
+ break;
+ }
+
+ CHECK(err, "lookup/delete with steps", "error: %s\n",
+ strerror(errno));
+
+ total += count;
+ if (batch == 0)
+ break;
+
+ i++;
+ }
+
+ if (nospace_err == true)
+ continue;
+
+ CHECK(total != max_entries, "lookup/delete with steps",
+ "total = %u, max_entries = %u\n", total, max_entries);
+
+ map_batch_verify(visited, max_entries, keys, values);
+ err = bpf_map_get_next_key(map_fd, NULL, &key);
+ CHECK(!err, "bpf_map_get_next_key()", "error: %s\n", strerror(errno));
+
+ total_success++;
+ }
+
+ CHECK(total_success == 0, "check total_success", "unexpected failure\n");
+
+ printf("%s:PASS\n", __func__);
+}
--
2.24.0.432.g9d3f5f5b63-goog

2019-11-07 21:23:12

by Brian Vazquez

[permalink] [raw]
Subject: [RFC bpf-next 3/3] bpf: add generic batch support

This commits add generic batch support, this coulde b used by data
structures that don't have specific constrains or inconsistencies like
hashmap. This implementation also allow us to incremenmtally add the
support for the remaining data structures keeping less lines of code.

This commit also tests the generic batch operations with pcpu hashmaps
and arrays. Note that pcpu hashmaps support was only added in this
commit for demonstration purposes and the specific implementation should
be used instead.

Cc: Yonghong Song <[email protected]>
Cc: Stanislav Fomichev <[email protected]>
Cc: Petar Penkov <[email protected]>
Cc: Willem de Bruijn <[email protected]>
Signed-off-by: Brian Vazquez <[email protected]>
---
include/linux/bpf.h | 12 +
kernel/bpf/arraymap.c | 4 +
kernel/bpf/hashtab.c | 4 +
kernel/bpf/syscall.c | 506 +++++++++++++-----
.../map_tests/map_lookup_and_delete_batch.c | 132 ++++-
.../map_lookup_and_delete_batch_array.c | 118 ++++
6 files changed, 635 insertions(+), 141 deletions(-)
create mode 100644 tools/testing/selftests/bpf/map_tests/map_lookup_and_delete_batch_array.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 66df540ee2473..211f5d04748cc 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -700,6 +700,18 @@ void bpf_map_charge_move(struct bpf_map_memory *dst,
void *bpf_map_area_alloc(size_t size, int numa_node);
void bpf_map_area_free(void *base);
void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);
+int generic_map_lookup_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr);
+int generic_map_lookup_and_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr);
+int generic_map_update_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr);
+int generic_map_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr);

extern int sysctl_unprivileged_bpf_disabled;

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 1c65ce0098a95..5afab2c36a3ba 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -457,6 +457,10 @@ const struct bpf_map_ops array_map_ops = {
.map_direct_value_meta = array_map_direct_value_meta,
.map_seq_show_elem = array_map_seq_show_elem,
.map_check_btf = array_map_check_btf,
+ .map_lookup_batch = generic_map_lookup_batch,
+ .map_lookup_and_delete_batch = generic_map_lookup_and_delete_batch,
+ .map_update_batch = generic_map_update_batch,
+ .map_delete_batch = generic_map_lookup_batch,
};

const struct bpf_map_ops percpu_array_map_ops = {
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 10977cb321862..d68374b1cd83e 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1695,6 +1695,10 @@ const struct bpf_map_ops htab_percpu_map_ops = {
.map_update_elem = htab_percpu_map_update_elem,
.map_delete_elem = htab_map_delete_elem,
.map_seq_show_elem = htab_percpu_map_seq_show_elem,
+ .map_lookup_batch = generic_map_lookup_batch,
+ .map_lookup_and_delete_batch = generic_map_lookup_and_delete_batch,
+ .map_update_batch = generic_map_update_batch,
+ .map_delete_batch = generic_map_lookup_batch,
};

const struct bpf_map_ops htab_lru_percpu_map_ops = {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c9e5f928d85b0..1a42ccab32113 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -127,6 +127,153 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
return map;
}

+static u32 bpf_map_value_size(struct bpf_map *map)
+{
+ if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+ map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH ||
+ map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY ||
+ map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE)
+ return round_up(map->value_size, 8) * num_possible_cpus();
+ else if (IS_FD_MAP(map))
+ return sizeof(u32);
+ else
+ return map->value_size;
+}
+
+static void maybe_wait_bpf_programs(struct bpf_map *map)
+{
+ /* Wait for any running BPF programs to complete so that
+ * userspace, when we return to it, knows that all programs
+ * that could be running use the new map value.
+ */
+ if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS ||
+ map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
+ synchronize_rcu();
+}
+
+static int bpf_map_update_value(struct bpf_map *map, struct fd f, void *key,
+ void *value, __u64 flags)
+{
+ int err;
+ /* Need to create a kthread, thus must support schedule */
+ if (bpf_map_is_dev_bound(map)) {
+ return bpf_map_offload_update_elem(map, key, value, flags);
+ } else if (map->map_type == BPF_MAP_TYPE_CPUMAP ||
+ map->map_type == BPF_MAP_TYPE_SOCKHASH ||
+ map->map_type == BPF_MAP_TYPE_SOCKMAP) {
+ return map->ops->map_update_elem(map, key, value, flags);
+ }
+
+ /* must increment bpf_prog_active to avoid kprobe+bpf triggering from
+ * inside bpf map update or delete otherwise deadlocks are possible
+ */
+ preempt_disable();
+ __this_cpu_inc(bpf_prog_active);
+ if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+ map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
+ err = bpf_percpu_hash_update(map, key, value, flags);
+ } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
+ err = bpf_percpu_array_update(map, key, value, flags);
+ } else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
+ err = bpf_percpu_cgroup_storage_update(map, key, value,
+ flags);
+ } else if (IS_FD_ARRAY(map)) {
+ rcu_read_lock();
+ err = bpf_fd_array_map_update_elem(map, f.file, key, value,
+ flags);
+ rcu_read_unlock();
+ } else if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
+ rcu_read_lock();
+ err = bpf_fd_htab_map_update_elem(map, f.file, key, value,
+ flags);
+ rcu_read_unlock();
+ } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
+ /* rcu_read_lock() is not needed */
+ err = bpf_fd_reuseport_array_update_elem(map, key, value,
+ flags);
+ } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
+ map->map_type == BPF_MAP_TYPE_STACK) {
+ err = map->ops->map_push_elem(map, value, flags);
+ } else {
+ rcu_read_lock();
+ err = map->ops->map_update_elem(map, key, value, flags);
+ rcu_read_unlock();
+ }
+ __this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+ maybe_wait_bpf_programs(map);
+
+ return err;
+}
+
+static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
+ __u64 flags, bool do_delete)
+{
+ void *ptr;
+ int err;
+
+
+ if (bpf_map_is_dev_bound(map)) {
+ err = bpf_map_offload_lookup_elem(map, key, value);
+
+ if (!err && do_delete)
+ err = bpf_map_offload_delete_elem(map, key);
+
+ return err;
+ }
+
+ preempt_disable();
+ this_cpu_inc(bpf_prog_active);
+ if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+ map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
+ err = bpf_percpu_hash_copy(map, key, value);
+ } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
+ err = bpf_percpu_array_copy(map, key, value);
+ } else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
+ err = bpf_percpu_cgroup_storage_copy(map, key, value);
+ } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) {
+ err = bpf_stackmap_copy(map, key, value);
+ } else if (IS_FD_ARRAY(map)) {
+ err = bpf_fd_array_map_lookup_elem(map, key, value);
+ } else if (IS_FD_HASH(map)) {
+ err = bpf_fd_htab_map_lookup_elem(map, key, value);
+ } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
+ err = bpf_fd_reuseport_array_lookup_elem(map, key, value);
+ } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
+ map->map_type == BPF_MAP_TYPE_STACK) {
+ err = map->ops->map_peek_elem(map, value);
+ } else {
+ rcu_read_lock();
+ if (map->ops->map_lookup_elem_sys_only)
+ ptr = map->ops->map_lookup_elem_sys_only(map, key);
+ else
+ ptr = map->ops->map_lookup_elem(map, key);
+ if (IS_ERR(ptr)) {
+ err = PTR_ERR(ptr);
+ } else if (!ptr) {
+ err = -ENOENT;
+ } else {
+ err = 0;
+ if (flags & BPF_F_LOCK)
+ /* lock 'ptr' and copy everything but lock */
+ copy_map_value_locked(map, value, ptr, true);
+ else
+ copy_map_value(map, value, ptr);
+ /* mask lock, since value wasn't zero inited */
+ check_and_init_map_lock(map, value);
+ }
+ rcu_read_unlock();
+ }
+ if (do_delete)
+ err = err ? err : map->ops->map_delete_elem(map, key);
+
+ this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+ maybe_wait_bpf_programs(map);
+
+ return err;
+}
+
void *bpf_map_area_alloc(size_t size, int numa_node)
{
/* We really just want to fail instead of triggering OOM killer
@@ -740,7 +887,7 @@ static int map_lookup_elem(union bpf_attr *attr)
void __user *uvalue = u64_to_user_ptr(attr->value);
int ufd = attr->map_fd;
struct bpf_map *map;
- void *key, *value, *ptr;
+ void *key, *value;
u32 value_size;
struct fd f;
int err;
@@ -772,72 +919,14 @@ static int map_lookup_elem(union bpf_attr *attr)
goto err_put;
}

- if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
- map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH ||
- map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY ||
- map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE)
- value_size = round_up(map->value_size, 8) * num_possible_cpus();
- else if (IS_FD_MAP(map))
- value_size = sizeof(u32);
- else
- value_size = map->value_size;
+ value_size = bpf_map_value_size(map);

err = -ENOMEM;
value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);
if (!value)
goto free_key;

- if (bpf_map_is_dev_bound(map)) {
- err = bpf_map_offload_lookup_elem(map, key, value);
- goto done;
- }
-
- preempt_disable();
- this_cpu_inc(bpf_prog_active);
- if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
- map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
- err = bpf_percpu_hash_copy(map, key, value);
- } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
- err = bpf_percpu_array_copy(map, key, value);
- } else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
- err = bpf_percpu_cgroup_storage_copy(map, key, value);
- } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) {
- err = bpf_stackmap_copy(map, key, value);
- } else if (IS_FD_ARRAY(map)) {
- err = bpf_fd_array_map_lookup_elem(map, key, value);
- } else if (IS_FD_HASH(map)) {
- err = bpf_fd_htab_map_lookup_elem(map, key, value);
- } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
- err = bpf_fd_reuseport_array_lookup_elem(map, key, value);
- } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
- map->map_type == BPF_MAP_TYPE_STACK) {
- err = map->ops->map_peek_elem(map, value);
- } else {
- rcu_read_lock();
- if (map->ops->map_lookup_elem_sys_only)
- ptr = map->ops->map_lookup_elem_sys_only(map, key);
- else
- ptr = map->ops->map_lookup_elem(map, key);
- if (IS_ERR(ptr)) {
- err = PTR_ERR(ptr);
- } else if (!ptr) {
- err = -ENOENT;
- } else {
- err = 0;
- if (attr->flags & BPF_F_LOCK)
- /* lock 'ptr' and copy everything but lock */
- copy_map_value_locked(map, value, ptr, true);
- else
- copy_map_value(map, value, ptr);
- /* mask lock, since value wasn't zero inited */
- check_and_init_map_lock(map, value);
- }
- rcu_read_unlock();
- }
- this_cpu_dec(bpf_prog_active);
- preempt_enable();
-
-done:
+ err = bpf_map_copy_value(map, key, value, attr->flags, false);
if (err)
goto free_value;

@@ -856,16 +945,6 @@ static int map_lookup_elem(union bpf_attr *attr)
return err;
}

-static void maybe_wait_bpf_programs(struct bpf_map *map)
-{
- /* Wait for any running BPF programs to complete so that
- * userspace, when we return to it, knows that all programs
- * that could be running use the new map value.
- */
- if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS ||
- map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
- synchronize_rcu();
-}

#define BPF_MAP_UPDATE_ELEM_LAST_FIELD flags

@@ -921,56 +1000,8 @@ static int map_update_elem(union bpf_attr *attr)
if (copy_from_user(value, uvalue, value_size) != 0)
goto free_value;

- /* Need to create a kthread, thus must support schedule */
- if (bpf_map_is_dev_bound(map)) {
- err = bpf_map_offload_update_elem(map, key, value, attr->flags);
- goto out;
- } else if (map->map_type == BPF_MAP_TYPE_CPUMAP ||
- map->map_type == BPF_MAP_TYPE_SOCKHASH ||
- map->map_type == BPF_MAP_TYPE_SOCKMAP) {
- err = map->ops->map_update_elem(map, key, value, attr->flags);
- goto out;
- }
+ err = bpf_map_update_value(map, f, key, value, attr->flags);

- /* must increment bpf_prog_active to avoid kprobe+bpf triggering from
- * inside bpf map update or delete otherwise deadlocks are possible
- */
- preempt_disable();
- __this_cpu_inc(bpf_prog_active);
- if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
- map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
- err = bpf_percpu_hash_update(map, key, value, attr->flags);
- } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
- err = bpf_percpu_array_update(map, key, value, attr->flags);
- } else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
- err = bpf_percpu_cgroup_storage_update(map, key, value,
- attr->flags);
- } else if (IS_FD_ARRAY(map)) {
- rcu_read_lock();
- err = bpf_fd_array_map_update_elem(map, f.file, key, value,
- attr->flags);
- rcu_read_unlock();
- } else if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
- rcu_read_lock();
- err = bpf_fd_htab_map_update_elem(map, f.file, key, value,
- attr->flags);
- rcu_read_unlock();
- } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
- /* rcu_read_lock() is not needed */
- err = bpf_fd_reuseport_array_update_elem(map, key, value,
- attr->flags);
- } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
- map->map_type == BPF_MAP_TYPE_STACK) {
- err = map->ops->map_push_elem(map, value, attr->flags);
- } else {
- rcu_read_lock();
- err = map->ops->map_update_elem(map, key, value, attr->flags);
- rcu_read_unlock();
- }
- __this_cpu_dec(bpf_prog_active);
- preempt_enable();
- maybe_wait_bpf_programs(map);
-out:
free_value:
kfree(value);
free_key:
@@ -1096,6 +1127,241 @@ static int map_get_next_key(union bpf_attr *attr)
return err;
}

+int generic_map_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ void __user *keys = u64_to_user_ptr(attr->batch.keys);
+ int ufd = attr->map_fd;
+ u32 cp, max_count;
+ struct fd f;
+ void *key;
+ int err;
+
+ f = fdget(ufd);
+ if (attr->batch.elem_flags & ~BPF_F_LOCK)
+ return -EINVAL;
+
+ if ((attr->batch.elem_flags & BPF_F_LOCK) &&
+ !map_value_has_spin_lock(map)) {
+ err = -EINVAL;
+ goto err_put;
+ }
+
+ max_count = attr->batch.count;
+ if (!max_count)
+ return 0;
+
+ err = -ENOMEM;
+ for (cp = 0; cp < max_count; cp++) {
+ key = __bpf_copy_key(keys + cp * map->key_size, map->key_size);
+ if (IS_ERR(key)) {
+ err = PTR_ERR(key);
+ break;
+ }
+
+ if (err)
+ break;
+ if (bpf_map_is_dev_bound(map)) {
+ err = bpf_map_offload_delete_elem(map, key);
+ break;
+ }
+
+ preempt_disable();
+ __this_cpu_inc(bpf_prog_active);
+ rcu_read_lock();
+ err = map->ops->map_delete_elem(map, key);
+ rcu_read_unlock();
+ __this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+ maybe_wait_bpf_programs(map);
+ if (err)
+ break;
+ }
+ if (copy_to_user(&uattr->batch.count, &cp, sizeof(cp)))
+ err = -EFAULT;
+err_put:
+ return err;
+}
+int generic_map_update_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ void __user *values = u64_to_user_ptr(attr->batch.values);
+ void __user *keys = u64_to_user_ptr(attr->batch.keys);
+ u32 value_size, cp, max_count;
+ int ufd = attr->map_fd;
+ void *key, *value;
+ struct fd f;
+ int err;
+
+ f = fdget(ufd);
+ if (attr->batch.elem_flags & ~BPF_F_LOCK)
+ return -EINVAL;
+
+ if ((attr->batch.elem_flags & BPF_F_LOCK) &&
+ !map_value_has_spin_lock(map)) {
+ err = -EINVAL;
+ goto err_put;
+ }
+
+ value_size = bpf_map_value_size(map);
+
+ max_count = attr->batch.count;
+ if (!max_count)
+ return 0;
+
+ err = -ENOMEM;
+ value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);
+ if (!value)
+ goto err_put;
+
+ for (cp = 0; cp < max_count; cp++) {
+ key = __bpf_copy_key(keys + cp * map->key_size, map->key_size);
+ if (IS_ERR(key)) {
+ err = PTR_ERR(key);
+ break;
+ }
+ err = -EFAULT;
+ if (copy_from_user(value, values + cp * value_size, value_size))
+ break;
+
+ err = bpf_map_update_value(map, f, key, value,
+ attr->batch.elem_flags);
+
+ if (err)
+ break;
+ }
+
+ if (copy_to_user(&uattr->batch.count, &cp, sizeof(cp)))
+ err = -EFAULT;
+
+ kfree(value);
+err_put:
+ return err;
+}
+
+static int __generic_map_lookup_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr,
+ bool do_delete)
+{
+ void __user *values = u64_to_user_ptr(attr->batch.values);
+ void __user *keys = u64_to_user_ptr(attr->batch.keys);
+ void *buf, *prev_key, *key, *value;
+ u32 value_size, cp, max_count;
+ bool first_key = false;
+ u64 batch;
+ int err, retry = 3;
+
+ if (attr->batch.elem_flags & ~BPF_F_LOCK)
+ return -EINVAL;
+
+ if ((attr->batch.elem_flags & BPF_F_LOCK) &&
+ !map_value_has_spin_lock(map)) {
+ err = -EINVAL;
+ goto err_put;
+ }
+
+ if (map->map_type == BPF_MAP_TYPE_QUEUE ||
+ map->map_type == BPF_MAP_TYPE_STACK) {
+ err = -ENOTSUPP;
+ goto err_put;
+ }
+
+ value_size = bpf_map_value_size(map);
+
+ max_count = attr->batch.count;
+ if (!max_count)
+ return 0;
+
+ batch = attr->batch.batch;
+
+ err = -ENOMEM;
+ buf = kmalloc(map->key_size + value_size, GFP_USER | __GFP_NOWARN);
+ if (!buf)
+ goto err_put;
+
+ key = buf;
+ value = key + map->key_size;
+ if (batch) {
+ memcpy(key, &batch, map->key_size);
+ } else {
+ prev_key = NULL;
+ first_key = true;
+ }
+
+
+ for (cp = 0; cp < max_count; cp++) {
+ if (cp || !batch) {
+ rcu_read_lock();
+ err = map->ops->map_get_next_key(map, prev_key, key);
+ rcu_read_unlock();
+ if (err)
+ break;
+ }
+ err = bpf_map_copy_value(map, key, value,
+ attr->batch.elem_flags, do_delete);
+
+ if (err == -ENOENT) {
+ if (retry) {
+ retry--;
+ continue;
+ }
+ err = -EINTR;
+ break;
+ }
+
+ if (err)
+ goto free_buf;
+
+ if (copy_to_user(keys + cp * map->key_size, key,
+ map->key_size)) {
+ err = -EFAULT;
+ goto free_buf;
+ }
+ if (copy_to_user(values + cp * value_size, value, value_size)) {
+ err = -EFAULT;
+ goto free_buf;
+ }
+
+ prev_key = key;
+ retry = 3;
+ }
+ if (!err) {
+ rcu_read_lock();
+ err = map->ops->map_get_next_key(map, prev_key, key);
+ rcu_read_unlock();
+ }
+
+ if (err == -ENOENT && (cp || do_delete)) {
+ err = 0;
+ memset(key, 0, map->key_size);
+ }
+ if (!err && (copy_to_user(&uattr->batch.count, &cp, sizeof(cp)) ||
+ (copy_to_user(&uattr->batch.batch, key, map->key_size))))
+ err = -EFAULT;
+
+free_buf:
+ kfree(buf);
+err_put:
+ return err;
+}
+
+int generic_map_lookup_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __generic_map_lookup_batch(map, attr, uattr, false);
+}
+
+int generic_map_lookup_and_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __generic_map_lookup_batch(map, attr, uattr, true);
+}
+
#define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value

static int map_lookup_and_delete_elem(union bpf_attr *attr)
diff --git a/tools/testing/selftests/bpf/map_tests/map_lookup_and_delete_batch.c b/tools/testing/selftests/bpf/map_tests/map_lookup_and_delete_batch.c
index dd906b1de5950..60fe30793fa22 100644
--- a/tools/testing/selftests/bpf/map_tests/map_lookup_and_delete_batch.c
+++ b/tools/testing/selftests/bpf/map_tests/map_lookup_and_delete_batch.c
@@ -7,16 +7,26 @@
#include <bpf/bpf.h>
#include <bpf/libbpf.h>

+#include <bpf_util.h>
#include <test_maps.h>

static void map_batch_update(int map_fd, __u32 max_entries, int *keys,
- int *values)
+ void *values, bool is_pcpu)
{
- int i, err;
+ typedef BPF_DECLARE_PERCPU(int, value);
+ int i, j, err;
+ value *v;
+
+ if(is_pcpu)
+ v = (value *)values;

for (i = 0; i < max_entries; i++) {
keys[i] = i + 1;
- values[i] = i + 2;
+ if (is_pcpu)
+ for (j = 0; j < bpf_num_possible_cpus(); j++)
+ bpf_percpu(v[i], j) = i + 2 + j;
+ else
+ ((int *)values)[i] = i + 2;
}

err = bpf_map_update_batch(map_fd, keys, values, &max_entries, 0, 0);
@@ -24,15 +34,32 @@ static void map_batch_update(int map_fd, __u32 max_entries, int *keys,
}

static void map_batch_verify(int *visited, __u32 max_entries,
- int *keys, int *values)
+ int *keys, void *values, bool is_pcpu)
{
- int i;
+ typedef BPF_DECLARE_PERCPU(int, value);
+ value *v;
+ int i, j;
+
+ if(is_pcpu)
+ v = (value *)values;

memset(visited, 0, max_entries * sizeof(*visited));
for (i = 0; i < max_entries; i++) {
- CHECK(keys[i] + 1 != values[i], "key/value checking",
- "error: i %d key %d value %d\n", i, keys[i], values[i]);
+
+ if (is_pcpu) {
+ for (j = 0; j < bpf_num_possible_cpus(); j++) {
+ CHECK(keys[i] + 1 + j != bpf_percpu(v[i], j),
+ "key/value checking",
+ "error: i %d j %d key %d value %d\n",
+ i, j, keys[i], bpf_percpu(v[i], j));
+ }
+ } else {
+ CHECK(keys[i] + 1 != ((int *)values)[i], "key/value checking",
+ "error: i %d key %d value %d\n", i, keys[i], ((int *)values)[i]);
+ }
+
visited[i] = 1;
+
}
for (i = 0; i < max_entries; i++) {
CHECK(visited[i] != 1, "visited checking",
@@ -40,18 +67,21 @@ static void map_batch_verify(int *visited, __u32 max_entries,
}
}

-void test_map_lookup_and_delete_batch(void)
-{
+void __test_map_lookup_and_delete_batch(bool is_pcpu){
+ int map_type = is_pcpu ? BPF_MAP_TYPE_PERCPU_HASH : BPF_MAP_TYPE_HASH;
struct bpf_create_map_attr xattr = {
.name = "hash_map",
- .map_type = BPF_MAP_TYPE_HASH,
+ .map_type = map_type,
.key_size = sizeof(int),
.value_size = sizeof(int),
};
- int map_fd, *keys, *values, *visited, key;
+ typedef BPF_DECLARE_PERCPU(int, value);
+ int map_fd, *keys, *visited, key;
+ void *values;
__u32 count, total, total_success;
const __u32 max_entries = 10;
- int err, i, step;
+ int err, i, step, value_size;
+ value pcpu_values [10];
bool nospace_err;
__u64 batch = 0;

@@ -60,8 +90,12 @@ void test_map_lookup_and_delete_batch(void)
CHECK(map_fd == -1,
"bpf_create_map_xattr()", "error:%s\n", strerror(errno));

- keys = malloc(max_entries * sizeof(int));
- values = malloc(max_entries * sizeof(int));
+ value_size = is_pcpu ? sizeof(value) : sizeof(int);
+ keys = malloc(max_entries * sizeof(int));
+ if (is_pcpu)
+ values = pcpu_values;
+ else
+ values = malloc(max_entries * sizeof(int));
visited = malloc(max_entries * sizeof(int));
CHECK(!keys || !values || !visited, "malloc()", "error:%s\n", strerror(errno));

@@ -73,7 +107,7 @@ void test_map_lookup_and_delete_batch(void)
CHECK(batch || count, "empty map", "batch = %lld, count = %u\n", batch, count);

/* populate elements to the map */
- map_batch_update(map_fd, max_entries, keys, values);
+ map_batch_update(map_fd, max_entries, keys, values, is_pcpu);

/* test 2: lookup/delete with count = 0, success */
batch = 0;
@@ -84,7 +118,7 @@ void test_map_lookup_and_delete_batch(void)

/* test 3: lookup/delete with count = max_entries, success */
memset(keys, 0, max_entries * sizeof(*keys));
- memset(values, 0, max_entries * sizeof(*values));
+ memset(values, 0, max_entries * value_size);
count = max_entries;
batch = 0;
err = bpf_map_lookup_and_delete_batch(map_fd, &batch, keys,
@@ -93,7 +127,7 @@ void test_map_lookup_and_delete_batch(void)
CHECK(count != max_entries || batch != 0, "count = max_entries",
"count = %u, max_entries = %u, batch = %lld\n",
count, max_entries, batch);
- map_batch_verify(visited, max_entries, keys, values);
+ map_batch_verify(visited, max_entries, keys, values, is_pcpu);

/* bpf_map_get_next_key() should return -ENOENT for an empty map. */
err = bpf_map_get_next_key(map_fd, NULL, &key);
@@ -102,9 +136,48 @@ void test_map_lookup_and_delete_batch(void)
/* test 4: lookup/delete in a loop with various steps. */
total_success = 0;
for (step = 1; step < max_entries; step++) {
- map_batch_update(map_fd, max_entries, keys, values);
+ map_batch_update(map_fd, max_entries, keys, values, is_pcpu);
+ memset(keys, 0, max_entries * sizeof(*keys));
+ memset(values, 0, max_entries * value_size);
+ batch = 0;
+ total = 0;
+ i = 0;
+ /* iteratively lookup/delete elements with 'step' elements each */
+ count = step;
+ nospace_err = false;
+ while (true) {
+ err = bpf_map_lookup_batch(map_fd, &batch,
+ keys + total,
+ values + total * value_size,
+ &count, 0, 0);
+ /* It is possible that we are failing due to buffer size
+ * not big enough. In such cases, let us just exit and
+ * go with large steps. Not that a buffer size with
+ * max_entries should always work.
+ */
+ if (err && errno == ENOSPC) {
+ nospace_err = true;
+ break;
+ }
+
+ CHECK(err, "lookup with steps", "error: %s\n",
+ strerror(errno));
+
+ total += count;
+ if (err || batch == 0)
+ break;
+
+ i++;
+ }
+ if (nospace_err == true)
+ continue;
+
+ CHECK(total != max_entries, "lookup with steps",
+ "total = %u, max_entries = %u\n", total, max_entries);
+ map_batch_verify(visited, max_entries, keys, values, is_pcpu);
+
memset(keys, 0, max_entries * sizeof(*keys));
- memset(values, 0, max_entries * sizeof(*values));
+ memset(values, 0, max_entries * value_size);
batch = 0;
total = 0;
i = 0;
@@ -114,7 +187,7 @@ void test_map_lookup_and_delete_batch(void)
while (true) {
err = bpf_map_lookup_and_delete_batch(map_fd, &batch,
keys + total,
- values + total,
+ values + total * value_size,
&count, 0, 0);
/* It is possible that we are failing due to buffer size
* not big enough. In such cases, let us just exit and
@@ -142,7 +215,7 @@ void test_map_lookup_and_delete_batch(void)
CHECK(total != max_entries, "lookup/delete with steps",
"total = %u, max_entries = %u\n", total, max_entries);

- map_batch_verify(visited, max_entries, keys, values);
+ map_batch_verify(visited, max_entries, keys, values, is_pcpu);
err = bpf_map_get_next_key(map_fd, NULL, &key);
CHECK(!err, "bpf_map_get_next_key()", "error: %s\n", strerror(errno));

@@ -150,6 +223,23 @@ void test_map_lookup_and_delete_batch(void)
}

CHECK(total_success == 0, "check total_success", "unexpected failure\n");
+}
+
+void test_hmap_lookup_and_delete_batch(void)
+{
+ __test_map_lookup_and_delete_batch(false);
+ printf("%s:PASS\n", __func__);
+}

+void test_pcpu_hmap_lookup_and_delete_batch(void)
+{
+ __test_map_lookup_and_delete_batch(true);
printf("%s:PASS\n", __func__);
}
+
+void test_map_lookup_and_delete_batch(void)
+{
+ test_hmap_lookup_and_delete_batch();
+ test_pcpu_hmap_lookup_and_delete_batch();
+}
+
diff --git a/tools/testing/selftests/bpf/map_tests/map_lookup_and_delete_batch_array.c b/tools/testing/selftests/bpf/map_tests/map_lookup_and_delete_batch_array.c
new file mode 100644
index 0000000000000..e4ea6e45f038c
--- /dev/null
+++ b/tools/testing/selftests/bpf/map_tests/map_lookup_and_delete_batch_array.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdio.h>
+#include <errno.h>
+#include <string.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include <test_maps.h>
+
+static void map_batch_update(int map_fd, __u32 max_entries, int *keys,
+ int *values)
+{
+ int i, err;
+
+ for (i = 0; i < max_entries; i++) {
+ keys[i] = i;
+ values[i] = i + 1;
+ }
+
+ err = bpf_map_update_batch(map_fd, keys, values, &max_entries, 0, 0);
+ CHECK(err, "bpf_map_update_batch()", "error:%s\n", strerror(errno));
+}
+
+static void map_batch_verify(int *visited, __u32 max_entries,
+ int *keys, int *values)
+{
+ int i;
+
+ memset(visited, 0, max_entries * sizeof(*visited));
+ for (i = 0; i < max_entries; i++) {
+ CHECK(keys[i] + 1 != values[i], "key/value checking",
+ "error: i %d key %d value %d\n", i, keys[i], values[i]);
+ visited[i] = 1;
+ }
+ for (i = 0; i < max_entries; i++) {
+ CHECK(visited[i] != 1, "visited checking",
+ "error: keys array at index %d missing\n", i);
+ }
+}
+
+void test_map_lookup_and_delete_batch_array(void)
+{
+ struct bpf_create_map_attr xattr = {
+ .name = "hash_map",
+ .map_type = BPF_MAP_TYPE_ARRAY,
+ .key_size = sizeof(int),
+ .value_size = sizeof(int),
+ };
+ int map_fd, *keys, *values, *visited;
+ __u32 count, total, total_success;
+ const __u32 max_entries = 10;
+ int err, i, step;
+ bool nospace_err;
+ __u64 batch = 0;
+
+ xattr.max_entries = max_entries;
+ map_fd = bpf_create_map_xattr(&xattr);
+ CHECK(map_fd == -1,
+ "bpf_create_map_xattr()", "error:%s\n", strerror(errno));
+
+ keys = malloc(max_entries * sizeof(int));
+ values = malloc(max_entries * sizeof(int));
+ visited = malloc(max_entries * sizeof(int));
+ CHECK(!keys || !values || !visited, "malloc()", "error:%s\n",
+ strerror(errno));
+
+ /* populate elements to the map */
+ map_batch_update(map_fd, max_entries, keys, values);
+
+ /* test 1: lookup in a loop with various steps. */
+ total_success = 0;
+ for (step = 1; step < max_entries; step++) {
+ map_batch_update(map_fd, max_entries, keys, values);
+ memset(keys, 0, max_entries * sizeof(*keys));
+ memset(values, 0, max_entries * sizeof(*values));
+ batch = 0;
+ total = 0;
+ i = 0;
+ /* iteratively lookup/delete elements with 'step'
+ * elements each.
+ */
+ count = step;
+ nospace_err = false;
+ while (true) {
+ err = bpf_map_lookup_batch(map_fd, &batch,
+ keys + total,
+ values + total,
+ &count, 0, 0);
+
+ CHECK(err, "lookup with steps", "error: %s\n",
+ strerror(errno));
+
+ total += count;
+
+ if (err || batch == 0)
+ break;
+
+ i++;
+ }
+
+ if (nospace_err == true)
+ continue;
+
+ CHECK(total != max_entries, "lookup with steps",
+ "total = %u, max_entries = %u\n", total, max_entries);
+
+ map_batch_verify(visited, max_entries, keys, values);
+
+ total_success++;
+ }
+
+ CHECK(total_success == 0, "check total_success",
+ "unexpected failure\n");
+
+ printf("%s:PASS\n", __func__);
+}
--
2.24.0.432.g9d3f5f5b63-goog

2019-11-13 02:37:01

by Yonghong Song

[permalink] [raw]
Subject: Re: [RFC bpf-next 0/3] bpf: adding map batch processing support



On 11/7/19 1:20 PM, Brian Vazquez wrote:
> This is a follow up in the effort to batch bpf map operations to reduce
> the syscall overhead with the map_ops. I initially proposed the idea and
> the discussion is here:
> https://lore.kernel.org/bpf/[email protected]/
>
> Yonghong talked at the LPC about this and also proposed and idea that
> handles the special weird case of hashtables by doing traversing using
> the buckets as a reference instead of a key. Discussion is here:
> https://lore.kernel.org/bpf/[email protected]/
>
> This RFC proposes a way to extend batch operations for more data
> structures by creating generic batch functions that can be used instead
> of implementing the operations for each individual data structure,
> reducing the code that needs to be maintained. The series contains the
> patches used in Yonghong's RFC and the patch that adds the generic
> implementation of the operations plus some testing with pcpu hashmaps
> and arrays. Note that pcpu hashmap shouldn't use the generic
> implementation and it either should have its own implementation or share
> the one introduced by Yonghong, I added that just as an example to show
> that the generic implementation can be easily added to a data structure.
>
> What I want to achieve with this RFC is to collect early feedback and see if
> there's any major concern about this before I move forward. I do plan
> to better separate this into different patches and explain them properly
> in the commit messages.

Thanks Brian for working on this. The general approach described here
is good to me. Having a generic implementation for batch operations
looks good for maps (not hash table, queue/stack, etc.)

>
> Current known issues where I would like to discuss are the followings:
>
> - Because Yonghong's UAPI definition was done specifically for
> iterating buckets, the batch field is u64 and is treated as an u64
> instead of an opaque pointer, this won't work for other data structures
> that are going to use a key as a batch token with a size greater than
> 64. Although I think at this point the only key that couldn't be
> treated as a u64 is the key of a hashmap, and the hashmap won't use
> the generic interface.

The u64 can be changed with a __aligned_u64 opaque value. This way,
it can represent a pointer or a 64bit value.

> - Not all the data structures use delete (because it's not a valid
> operation) i.e. arrays. So maybe lookup_and_delete_batch command is
> not needed and we can handle that operation with a lookup_batch and a
> flag.

This make sense.

> - For delete_batch (not just the lookup_and_delete_batch). Is this
> operation really needed? If so, shouldn't it be better if the
> behaviour is delete the keys provided? I did that with my generic
> implementation but Yonghong's delete_batch for a hashmap deletes
> buckets.

We need batched delete in bcc. lookup_and_delete_batch is better as
it can preserves more new map entries. Alternatively, deleting
all entries after lookup is another option. But this may remove
more new map entries. Statistically this may or may not matter though.

bcc does have a clear_table (clear_map) API, but not clear who is using it.

So, I did not have a concrete use case for delete_batch yet.
I tend to think we should have delete_batch for API completeness,
but maybe other people can comment on this as well.

Maybe initial patch, we can skip it. But we should still ensure
user interface data structure can handle batch delete if it is
needed later. The current data structure should handle this
as far as I know.

>
> Brian Vazquez (1):
> bpf: add generic batch support
>
> Yonghong Song (2):
> bpf: adding map batch processing support
> tools/bpf: test bpf_map_lookup_and_delete_batch()
>
> include/linux/bpf.h | 21 +
> include/uapi/linux/bpf.h | 22 +
> kernel/bpf/arraymap.c | 4 +
> kernel/bpf/hashtab.c | 331 ++++++++++
> kernel/bpf/syscall.c | 573 ++++++++++++++----
> tools/include/uapi/linux/bpf.h | 22 +
> tools/lib/bpf/bpf.c | 59 ++
> tools/lib/bpf/bpf.h | 13 +
> tools/lib/bpf/libbpf.map | 4 +
> .../map_tests/map_lookup_and_delete_batch.c | 245 ++++++++
> .../map_lookup_and_delete_batch_array.c | 118 ++++
> 11 files changed, 1292 insertions(+), 120 deletions(-)
> create mode 100644 tools/testing/selftests/bpf/map_tests/map_lookup_and_delete_batch.c
> create mode 100644 tools/testing/selftests/bpf/map_tests/map_lookup_and_delete_batch_array.c
>

2019-11-13 22:27:39

by Yonghong Song

[permalink] [raw]
Subject: Re: [RFC bpf-next 0/3] bpf: adding map batch processing support



On 11/13/19 2:07 PM, Brian Vazquez wrote:
> Hi Yonghong,
>
> Thanks for reviewing it! I'm preparing the changes and will submit them
> sooner.
>
> As for the right way to manage author rights, does anyone know what the
> correct approach is? Should I use Yonghong's patch and apply the
> extended support in different patches (i.e. support per_cpu maps, change
> batch from u64 to __aligned_u64, etc) or it is fine to apply the changes
> in place and write both sign-offs?

The logic flow of the patch set is most important.
You can add me as co-signoff if you reuse significant portion of my code.

>
> Thanks,
> Brian
>
> On Tue, Nov 12, 2019 at 6:34 PM Yonghong Song <[email protected]
> <mailto:[email protected]>> wrote:
>
>
>
> On 11/7/19 1:20 PM, Brian Vazquez wrote:
> > This is a follow up in the effort to batch bpf map operations to
> reduce
> > the syscall overhead with the map_ops. I initially proposed the
> idea and
> > the discussion is here:
> >
> https://lore.kernel.org/bpf/[email protected]/
> >
> > Yonghong talked at the LPC about this and also proposed and idea that
> > handles the special weird case of hashtables by doing traversing
> using
> > the buckets as a reference instead of a key. Discussion is here:
> > https://lore.kernel.org/bpf/[email protected]/
> >
> > This RFC proposes a way to extend batch operations for more data
> > structures by creating generic batch functions that can be used
> instead
> > of implementing the operations for each individual data structure,
> > reducing the code that needs to be maintained. The series
> contains the
> > patches used in Yonghong's RFC and the patch that adds the generic
> > implementation of the operations plus some testing with pcpu hashmaps
> > and arrays. Note that pcpu hashmap shouldn't use the generic
> > implementation and it either should have its own implementation
> or share
> > the one introduced by Yonghong, I added that just as an example
> to show
> > that the generic implementation can be easily added to a data
> structure.
> >
> > What I want to achieve with this RFC is to collect early feedback
> and see if
> > there's any major concern about this before I move forward. I do plan
> > to better separate this into different patches and explain them
> properly
> > in the commit messages.
>
> Thanks Brian for working on this. The general approach described here
> is good to me. Having a generic implementation for batch operations
> looks good for maps (not hash table, queue/stack, etc.)
>
> >
> > Current known issues where I would like to discuss are the
> followings:
> >
> > - Because Yonghong's UAPI definition was done specifically for
> >    iterating buckets, the batch field is u64 and is treated as an u64
> >    instead of an opaque pointer, this won't work for other data
> structures
> >    that are going to use a key as a batch token with a size
> greater than
> >    64. Although I think at this point the only key that couldn't be
> >    treated as a u64 is the key of a hashmap, and the hashmap
> won't use
> >    the generic interface.
>
> The u64 can be changed with a __aligned_u64 opaque value. This way,
> it can represent a pointer or a 64bit value.
>
> > - Not all the data structures use delete (because it's not a valid
> >    operation) i.e. arrays. So maybe lookup_and_delete_batch
> command is
> >    not needed and we can handle that operation with a
> lookup_batch and a
> >    flag.
>
> This make sense.
>
> > - For delete_batch (not just the lookup_and_delete_batch). Is this
> >    operation really needed? If so, shouldn't it be better if the
> >    behaviour is delete the keys provided? I did that with my generic
> >    implementation but Yonghong's delete_batch for a hashmap deletes
> >    buckets.
>
> We need batched delete in bcc. lookup_and_delete_batch is better as
> it can preserves more new map entries. Alternatively, deleting
> all entries after lookup is another option. But this may remove
> more new map entries. Statistically this may or may not matter though.
>
> bcc does have a clear_table (clear_map) API, but not clear who is
> using it.
>
> So, I did not have a concrete use case for delete_batch yet.
> I tend to think we should have delete_batch for API completeness,
> but maybe other people can comment on this as well.
>
> Maybe initial patch, we can skip it. But we should still ensure
> user interface data structure can handle batch delete if it is
> needed later. The current data structure should handle this
> as far as I know.
>
> >
> > Brian Vazquez (1):
> >    bpf: add generic batch support
> >
> > Yonghong Song (2):
> >    bpf: adding map batch processing support
> >    tools/bpf: test bpf_map_lookup_and_delete_batch()
> >
> >   include/linux/bpf.h                           |  21 +
> >   include/uapi/linux/bpf.h                      |  22 +
> >   kernel/bpf/arraymap.c                         |   4 +
> >   kernel/bpf/hashtab.c                          | 331 ++++++++++
> >   kernel/bpf/syscall.c                          | 573
> ++++++++++++++----
> >   tools/include/uapi/linux/bpf.h                |  22 +
> >   tools/lib/bpf/bpf.c                           |  59 ++
> >   tools/lib/bpf/bpf.h                           |  13 +
> >   tools/lib/bpf/libbpf.map                      |   4 +
> >   .../map_tests/map_lookup_and_delete_batch.c   | 245 ++++++++
> >   .../map_lookup_and_delete_batch_array.c       | 118 ++++
> >   11 files changed, 1292 insertions(+), 120 deletions(-)
> >   create mode 100644
> tools/testing/selftests/bpf/map_tests/map_lookup_and_delete_batch.c
> >   create mode 100644
> tools/testing/selftests/bpf/map_tests/map_lookup_and_delete_batch_array.c
> >
>

2019-11-15 22:44:35

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [RFC bpf-next 2/3] tools/bpf: test bpf_map_lookup_and_delete_batch()

On Thu, Nov 7, 2019 at 1:23 PM Brian Vazquez <[email protected]> wrote:
>
> From: Yonghong Song <[email protected]>
>
> Added four libbpf API functions to support map batch operations:
> . int bpf_map_delete_batch( ... )
> . int bpf_map_lookup_batch( ... )
> . int bpf_map_lookup_and_delete_batch( ... )
> . int bpf_map_update_batch( ... )
>
> Tested bpf_map_lookup_and_delete_batch() and bpf_map_update_batch()
> functionality.
> $ ./test_maps
> ...
> test_map_lookup_and_delete_batch:PASS
> ...
>
> Note that I clumped uapi header sync patch, libbpf patch
> and tests patch together considering this is a RFC patch.
> Will do proper formating once it is out of RFC stage.
>
> Signed-off-by: Yonghong Song <[email protected]>
> ---

[...]

>
> + struct { /* struct used by BPF_MAP_*_BATCH commands */
> + __u64 batch; /* input/output:
> + * input: start batch,
> + * 0 to start from beginning.
> + * output: next start batch,
> + * 0 to end batching.
> + */
> + __aligned_u64 keys;
> + __aligned_u64 values;
> + __u32 count; /* input/output:
> + * input: # of elements keys/values.
> + * output: # of filled elements.
> + */
> + __u32 map_fd;
> + __u64 elem_flags;
> + __u64 flags;
> + } batch;
> +

Describe what elem_flags and flags are here?

[...]

> +LIBBPF_API int bpf_map_delete_batch(int fd, __u64 *batch, __u32 *count,
> + __u64 elem_flags, __u64 flags);
> +LIBBPF_API int bpf_map_lookup_batch(int fd, __u64 *batch, void *keys,
> + void *values, __u32 *count,
> + __u64 elem_flags, __u64 flags);
> +LIBBPF_API int bpf_map_lookup_and_delete_batch(int fd, __u64 *batch,
> + void *keys, void *values,
> + __u32 *count, __u64 elem_flags,
> + __u64 flags);
> +LIBBPF_API int bpf_map_update_batch(int fd, void *keys, void *values,
> + __u32 *count, __u64 elem_flags,
> + __u64 flags);

Should we start using the same approach as with bpf_object__open_file
(see LIBBPF_OPTS), so that we can keep adding extra fields without
breaking ABI? The gist is: use function arguments for mandatory fields
(as of right now, at least), and put all the optional fields into a
xxx_opts struct, which can be NULL. Please see
bpf_object__open_{file,mem} for details.

> +
> LIBBPF_API int bpf_obj_pin(int fd, const char *pathname);
> LIBBPF_API int bpf_obj_get(const char *pathname);
> LIBBPF_API int bpf_prog_attach(int prog_fd, int attachable_fd,
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index 86173cbb159d3..0529a770a04eb 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -189,6 +189,10 @@ LIBBPF_0.0.4 {
> LIBBPF_0.0.5 {
> global:
> bpf_btf_get_next_id;
> + bpf_map_delete_batch;
> + bpf_map_lookup_and_delete_batch;
> + bpf_map_lookup_batch;
> + bpf_map_update_batch;
> } LIBBPF_0.0.4;

This should be in 0.0.6 section now.

>

[...]