2020-01-14 16:49:15

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH v4 bpf-next 0/9] add bpf batch ops to process more than 1 elem

This patch series introduce batch ops that can be added to bpf maps to
lookup/lookup_and_delete/update/delete more than 1 element at the time,
this is specially useful when syscall overhead is a problem and in case
of hmap it will provide a reliable way of traversing them.

The implementation inclues a generic approach that could potentially be
used by any bpf map and adds it to arraymap, it also includes the specific
implementation of hashmaps which are traversed using buckets instead
of keys.

The bpf syscall subcommands introduced are:

BPF_MAP_LOOKUP_BATCH
BPF_MAP_LOOKUP_AND_DELETE_BATCH
BPF_MAP_UPDATE_BATCH
BPF_MAP_DELETE_BATCH

The UAPI attribute is:

struct { /* struct used by BPF_MAP_*_BATCH commands */
__aligned_u64 in_batch; /* start batch,
* NULL to start from beginning
*/
__aligned_u64 out_batch; /* output: next start batch */
__aligned_u64 keys;
__aligned_u64 values;
__u32 count; /* input/output:
* input: # of key/value
* elements
* output: # of filled elements
*/
__u32 map_fd;
__u64 elem_flags;
__u64 flags;
} batch;


in_batch and out_batch are only used for lookup and lookup_and_delete since
those are the only two operations that attempt to traverse the map.

update/delete batch ops should provide the keys/values that user wants
to modify.

Here are the previous discussions on the batch processing:
- https://lore.kernel.org/bpf/[email protected]/
- https://lore.kernel.org/bpf/[email protected]/
- https://lore.kernel.org/bpf/[email protected]/

Changelog sinve v3:
- Do not use copy_to_user inside atomic region (Yonghong Song)
- Use _opts approach on libbpf APIs (Andrii Nakryiko)
- Drop generic_map_lookup_and_delete_batch support
- Free malloc-ed memory in tests (Yonghong Song)
- Reverse christmas tree (Yonghong Song)
- Add acked labels

Changelog sinve v2:
- Add generic batch support for lpm_trie and test it (Yonghong Song)
- Use define MAP_LOOKUP_RETRIES for retries (John Fastabend)
- Return errors directly and remove labels (Yonghong Song)
- Insert new API functions into libbpf alphabetically (Yonghong Song)
- Change hlist_nulls_for_each_entry_rcu to
hlist_nulls_for_each_entry_safe in htab batch ops (Yonghong Song)

Changelog since v1:
- Fix SOB ordering and remove Co-authored-by tag (Alexei Starovoitov)

Changelog since RFC:
- Change batch to in_batch and out_batch to support more flexible opaque
values to iterate the bpf maps.
- Remove update/delete specific batch ops for htab and use the generic
implementations instead.

Brian Vazquez (5):
bpf: add bpf_map_{value_size,update_value,map_copy_value} functions
bpf: add generic support for lookup batch op
bpf: add generic support for update and delete batch ops
bpf: add lookup and update batch ops to arraymap
selftests/bpf: add batch ops testing to array bpf map

Yonghong Song (4):
bpf: add batch ops to all htab bpf map
tools/bpf: sync uapi header bpf.h
libbpf: add libbpf support to batch ops
selftests/bpf: add batch ops testing for htab and htab_percpu map

include/linux/bpf.h | 18 +
include/uapi/linux/bpf.h | 21 +
kernel/bpf/arraymap.c | 2 +
kernel/bpf/hashtab.c | 258 +++++++++
kernel/bpf/syscall.c | 548 ++++++++++++++----
tools/include/uapi/linux/bpf.h | 21 +
tools/lib/bpf/bpf.c | 60 ++
tools/lib/bpf/bpf.h | 22 +
tools/lib/bpf/libbpf.map | 4 +
.../bpf/map_tests/array_map_batch_ops.c | 131 +++++
.../bpf/map_tests/htab_map_batch_ops.c | 285 +++++++++
11 files changed, 1242 insertions(+), 128 deletions(-)
create mode 100644 tools/testing/selftests/bpf/map_tests/array_map_batch_ops.c
create mode 100644 tools/testing/selftests/bpf/map_tests/htab_map_batch_ops.c

--
2.25.0.rc1.283.g88dfdc4193-goog


2020-01-14 16:49:23

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH v4 bpf-next 8/9] selftests/bpf: add batch ops testing for htab and htab_percpu map

From: Yonghong Song <[email protected]>

Tested bpf_map_lookup_batch(), bpf_map_lookup_and_delete_batch(),
bpf_map_update_batch(), and bpf_map_delete_batch() functionality.
$ ./test_maps
...
test_htab_map_batch_ops:PASS
test_htab_percpu_map_batch_ops:PASS
...

Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Brian Vazquez <[email protected]>
---
.../bpf/map_tests/htab_map_batch_ops.c | 285 ++++++++++++++++++
1 file changed, 285 insertions(+)
create mode 100644 tools/testing/selftests/bpf/map_tests/htab_map_batch_ops.c

diff --git a/tools/testing/selftests/bpf/map_tests/htab_map_batch_ops.c b/tools/testing/selftests/bpf/map_tests/htab_map_batch_ops.c
new file mode 100644
index 0000000000000..670d7e6574899
--- /dev/null
+++ b/tools/testing/selftests/bpf/map_tests/htab_map_batch_ops.c
@@ -0,0 +1,285 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+#include <stdio.h>
+#include <errno.h>
+#include <string.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include <bpf_util.h>
+#include <test_maps.h>
+
+static void map_batch_update(int map_fd, __u32 max_entries, int *keys,
+ void *values, bool is_pcpu)
+{
+ typedef BPF_DECLARE_PERCPU(int, value);
+ value *v = NULL;
+ int i, j, err;
+
+ DECLARE_LIBBPF_OPTS(bpf_map_batch_opts, opts,
+ .elem_flags = 0,
+ .flags = 0,
+ );
+
+ if (is_pcpu)
+ v = (value *)values;
+
+ for (i = 0; i < max_entries; i++) {
+ keys[i] = i + 1;
+ if (is_pcpu)
+ for (j = 0; j < bpf_num_possible_cpus(); j++)
+ bpf_percpu(v[i], j) = i + 2 + j;
+ else
+ ((int *)values)[i] = i + 2;
+ }
+
+ err = bpf_map_update_batch(map_fd, keys, values, &max_entries, &opts);
+ CHECK(err, "bpf_map_update_batch()", "error:%s\n", strerror(errno));
+}
+
+static void map_batch_verify(int *visited, __u32 max_entries,
+ int *keys, void *values, bool is_pcpu)
+{
+ typedef BPF_DECLARE_PERCPU(int, value);
+ value *v = NULL;
+ int i, j;
+
+ if (is_pcpu)
+ v = (value *)values;
+
+ memset(visited, 0, max_entries * sizeof(*visited));
+ for (i = 0; i < max_entries; i++) {
+
+ if (is_pcpu) {
+ for (j = 0; j < bpf_num_possible_cpus(); j++) {
+ CHECK(keys[i] + 1 + j != bpf_percpu(v[i], j),
+ "key/value checking",
+ "error: i %d j %d key %d value %d\n",
+ i, j, keys[i], bpf_percpu(v[i], j));
+ }
+ } else {
+ CHECK(keys[i] + 1 != ((int *)values)[i],
+ "key/value checking",
+ "error: i %d key %d value %d\n", i, keys[i],
+ ((int *)values)[i]);
+ }
+
+ visited[i] = 1;
+
+ }
+ for (i = 0; i < max_entries; i++) {
+ CHECK(visited[i] != 1, "visited checking",
+ "error: keys array at index %d missing\n", i);
+ }
+}
+
+void __test_map_lookup_and_delete_batch(bool is_pcpu)
+{
+ int map_type = is_pcpu ? BPF_MAP_TYPE_PERCPU_HASH : BPF_MAP_TYPE_HASH;
+ struct bpf_create_map_attr xattr = {
+ .name = "hash_map",
+ .map_type = map_type,
+ .key_size = sizeof(int),
+ .value_size = sizeof(int),
+ };
+ __u32 batch, count, total, total_success;
+ typedef BPF_DECLARE_PERCPU(int, value);
+ int map_fd, *keys, *visited, key;
+ const __u32 max_entries = 10;
+ int err, step, value_size;
+ value pcpu_values[max_entries];
+ bool nospace_err;
+ void *values;
+
+ DECLARE_LIBBPF_OPTS(bpf_map_batch_opts, opts,
+ .elem_flags = 0,
+ .flags = 0,
+ );
+
+ xattr.max_entries = max_entries;
+ map_fd = bpf_create_map_xattr(&xattr);
+ CHECK(map_fd == -1,
+ "bpf_create_map_xattr()", "error:%s\n", strerror(errno));
+
+ value_size = is_pcpu ? sizeof(value) : sizeof(int);
+ keys = malloc(max_entries * sizeof(int));
+ if (is_pcpu)
+ values = pcpu_values;
+ else
+ values = malloc(max_entries * sizeof(int));
+ visited = malloc(max_entries * sizeof(int));
+ CHECK(!keys || !values || !visited, "malloc()",
+ "error:%s\n", strerror(errno));
+
+ /* test 1: lookup/delete an empty hash table, -ENOENT */
+ count = max_entries;
+ err = bpf_map_lookup_and_delete_batch(map_fd, NULL, &batch, keys,
+ values, &count, &opts);
+ CHECK((err && errno != ENOENT), "empty map",
+ "error: %s\n", strerror(errno));
+
+ /* populate elements to the map */
+ map_batch_update(map_fd, max_entries, keys, values, is_pcpu);
+
+ /* test 2: lookup/delete with count = 0, success */
+ count = 0;
+ err = bpf_map_lookup_and_delete_batch(map_fd, NULL, &batch, keys,
+ values, &count, &opts);
+ CHECK(err, "count = 0", "error: %s\n", strerror(errno));
+
+ /* test 3: lookup/delete with count = max_entries, success */
+ memset(keys, 0, max_entries * sizeof(*keys));
+ memset(values, 0, max_entries * value_size);
+ count = max_entries;
+ err = bpf_map_lookup_and_delete_batch(map_fd, NULL, &batch, keys,
+ values, &count, &opts);
+ CHECK((err && errno != ENOENT), "count = max_entries",
+ "error: %s\n", strerror(errno));
+ CHECK(count != max_entries, "count = max_entries",
+ "count = %u, max_entries = %u\n", count, max_entries);
+ map_batch_verify(visited, max_entries, keys, values, is_pcpu);
+
+ /* bpf_map_get_next_key() should return -ENOENT for an empty map. */
+ err = bpf_map_get_next_key(map_fd, NULL, &key);
+ CHECK(!err, "bpf_map_get_next_key()", "error: %s\n", strerror(errno));
+
+ /* test 4: lookup/delete in a loop with various steps. */
+ total_success = 0;
+ for (step = 1; step < max_entries; step++) {
+ map_batch_update(map_fd, max_entries, keys, values, is_pcpu);
+ memset(keys, 0, max_entries * sizeof(*keys));
+ memset(values, 0, max_entries * value_size);
+ total = 0;
+ /* iteratively lookup/delete elements with 'step'
+ * elements each
+ */
+ count = step;
+ nospace_err = false;
+ while (true) {
+ err = bpf_map_lookup_batch(map_fd,
+ total ? &batch : NULL,
+ &batch, keys + total,
+ values +
+ total * value_size,
+ &count, &opts);
+ /* It is possible that we are failing due to buffer size
+ * not big enough. In such cases, let us just exit and
+ * go with large steps. Not that a buffer size with
+ * max_entries should always work.
+ */
+ if (err && errno == ENOSPC) {
+ nospace_err = true;
+ break;
+ }
+
+ CHECK((err && errno != ENOENT), "lookup with steps",
+ "error: %s\n", strerror(errno));
+
+ total += count;
+ if (err)
+ break;
+
+ }
+ if (nospace_err == true)
+ continue;
+
+ CHECK(total != max_entries, "lookup with steps",
+ "total = %u, max_entries = %u\n", total, max_entries);
+ map_batch_verify(visited, max_entries, keys, values, is_pcpu);
+
+ total = 0;
+ count = step;
+ while (total < max_entries) {
+ if (max_entries - total < step)
+ count = max_entries - total;
+ err = bpf_map_delete_batch(map_fd,
+ keys + total,
+ &count, &opts);
+ CHECK((err && errno != ENOENT), "delete batch",
+ "error: %s\n", strerror(errno));
+ total += count;
+ if (err)
+ break;
+ }
+ CHECK(total != max_entries, "delete with steps",
+ "total = %u, max_entries = %u\n", total, max_entries);
+
+ /* check map is empty, errono == ENOENT */
+ err = bpf_map_get_next_key(map_fd, NULL, &key);
+ CHECK(!err || errno != ENOENT, "bpf_map_get_next_key()",
+ "error: %s\n", strerror(errno));
+
+ /* iteratively lookup/delete elements with 'step'
+ * elements each
+ */
+ map_batch_update(map_fd, max_entries, keys, values, is_pcpu);
+ memset(keys, 0, max_entries * sizeof(*keys));
+ memset(values, 0, max_entries * value_size);
+ total = 0;
+ count = step;
+ nospace_err = false;
+ while (true) {
+ err = bpf_map_lookup_and_delete_batch(map_fd,
+ total ? &batch : NULL,
+ &batch, keys + total,
+ values +
+ total * value_size,
+ &count, &opts);
+ /* It is possible that we are failing due to buffer size
+ * not big enough. In such cases, let us just exit and
+ * go with large steps. Not that a buffer size with
+ * max_entries should always work.
+ */
+ if (err && errno == ENOSPC) {
+ nospace_err = true;
+ break;
+ }
+
+ CHECK((err && errno != ENOENT), "lookup with steps",
+ "error: %s\n", strerror(errno));
+
+ total += count;
+ if (err)
+ break;
+ }
+
+ if (nospace_err == true)
+ continue;
+
+ CHECK(total != max_entries, "lookup/delete with steps",
+ "total = %u, max_entries = %u\n", total, max_entries);
+
+ map_batch_verify(visited, max_entries, keys, values, is_pcpu);
+ err = bpf_map_get_next_key(map_fd, NULL, &key);
+ CHECK(!err, "bpf_map_get_next_key()", "error: %s\n",
+ strerror(errno));
+
+ total_success++;
+ }
+
+ CHECK(total_success == 0, "check total_success",
+ "unexpected failure\n");
+ free(keys);
+ free(visited);
+ if (!is_pcpu)
+ free(values);
+}
+
+void htab_map_batch_ops(void)
+{
+ __test_map_lookup_and_delete_batch(false);
+ printf("test_%s:PASS\n", __func__);
+}
+
+void htab_percpu_map_batch_ops(void)
+{
+ __test_map_lookup_and_delete_batch(true);
+ printf("test_%s:PASS\n", __func__);
+}
+
+void test_htab_map_batch_ops(void)
+{
+ htab_map_batch_ops();
+ htab_percpu_map_batch_ops();
+}
--
2.25.0.rc1.283.g88dfdc4193-goog

2020-01-14 16:49:32

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH v4 bpf-next 5/9] bpf: add batch ops to all htab bpf map

From: Yonghong Song <[email protected]>

htab can't use generic batch support due some problematic behaviours
inherent to the data structre, i.e. while iterating the bpf map a
concurrent program might delete the next entry that batch was about to
use, in that case there's no easy solution to retrieve the next entry,
the issue has been discussed multiple times (see [1] and [2]).

The only way hmap can be traversed without the problem previously
exposed is by making sure that the map is traversing entire buckets.
This commit implements those strict requirements for hmap, the
implementation follows the same interaction that generic support with
some exceptions:

- If keys/values buffer are not big enough to traverse a bucket,
ENOSPC will be returned.
- out_batch contains the value of the next bucket in the iteration, not
the next key, but this is transparent for the user since the user
should never use out_batch for other than bpf batch syscalls.

This commits implements BPF_MAP_LOOKUP_BATCH and adds support for new
command BPF_MAP_LOOKUP_AND_DELETE_BATCH. Note that for update/delete
batch ops it is possible to use the generic implementations.

[1] https://lore.kernel.org/bpf/[email protected]/
[2] https://lore.kernel.org/bpf/[email protected]/

Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Brian Vazquez <[email protected]>
---
include/linux/bpf.h | 3 +
include/uapi/linux/bpf.h | 1 +
kernel/bpf/hashtab.c | 258 +++++++++++++++++++++++++++++++++++++++
kernel/bpf/syscall.c | 9 +-
4 files changed, 270 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 05466ad6cf1c5..3517e32149a4f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -46,6 +46,9 @@ struct bpf_map_ops {
void *(*map_lookup_elem_sys_only)(struct bpf_map *map, void *key);
int (*map_lookup_batch)(struct bpf_map *map, const union bpf_attr *attr,
union bpf_attr __user *uattr);
+ int (*map_lookup_and_delete_batch)(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr);
int (*map_update_batch)(struct bpf_map *map, const union bpf_attr *attr,
union bpf_attr __user *uattr);
int (*map_delete_batch)(struct bpf_map *map, const union bpf_attr *attr,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e8df9ca680e0c..9536729a03d57 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -108,6 +108,7 @@ enum bpf_cmd {
BPF_MAP_FREEZE,
BPF_BTF_GET_NEXT_ID,
BPF_MAP_LOOKUP_BATCH,
+ BPF_MAP_LOOKUP_AND_DELETE_BATCH,
BPF_MAP_UPDATE_BATCH,
BPF_MAP_DELETE_BATCH,
};
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 22066a62c8c97..d9888acfd632b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -17,6 +17,16 @@
(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)

+#define BATCH_OPS(_name) \
+ .map_lookup_batch = \
+ _name##_map_lookup_batch, \
+ .map_lookup_and_delete_batch = \
+ _name##_map_lookup_and_delete_batch, \
+ .map_update_batch = \
+ generic_map_update_batch, \
+ .map_delete_batch = \
+ generic_map_delete_batch
+
struct bucket {
struct hlist_nulls_head head;
raw_spinlock_t lock;
@@ -1232,6 +1242,250 @@ static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
rcu_read_unlock();
}

+static int
+__htab_map_lookup_and_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr,
+ bool do_delete, bool is_lru_map,
+ bool is_percpu)
+{
+ struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+ u32 bucket_cnt, total, key_size, value_size, roundup_key_size;
+ void *keys = NULL, *values = NULL, *value, *dst_key, *dst_val;
+ void __user *uvalues = u64_to_user_ptr(attr->batch.values);
+ void __user *ukeys = u64_to_user_ptr(attr->batch.keys);
+ void *ubatch = u64_to_user_ptr(attr->batch.in_batch);
+ u32 batch, max_count, size, bucket_size;
+ u64 elem_map_flags, map_flags;
+ struct hlist_nulls_head *head;
+ struct hlist_nulls_node *n;
+ unsigned long flags;
+ struct htab_elem *l;
+ struct bucket *b;
+ int ret = 0;
+
+ elem_map_flags = attr->batch.elem_flags;
+ if ((elem_map_flags & ~BPF_F_LOCK) ||
+ ((elem_map_flags & BPF_F_LOCK) && !map_value_has_spin_lock(map)))
+ return -EINVAL;
+
+ map_flags = attr->batch.flags;
+ if (map_flags)
+ return -EINVAL;
+
+ max_count = attr->batch.count;
+ if (!max_count)
+ return 0;
+
+ batch = 0;
+ if (ubatch && copy_from_user(&batch, ubatch, sizeof(batch)))
+ return -EFAULT;
+
+ if (batch >= htab->n_buckets)
+ return -ENOENT;
+
+ key_size = htab->map.key_size;
+ roundup_key_size = round_up(htab->map.key_size, 8);
+ value_size = htab->map.value_size;
+ size = round_up(value_size, 8);
+ if (is_percpu)
+ value_size = size * num_possible_cpus();
+ total = 0;
+ bucket_size = 1;
+
+alloc:
+ /* We cannot do copy_from_user or copy_to_user inside
+ * the rcu_read_lock. Allocate enough space here.
+ */
+ keys = kvmalloc(key_size * bucket_size, GFP_USER | __GFP_NOWARN);
+ values = kvmalloc(value_size * bucket_size, GFP_USER | __GFP_NOWARN);
+ if (!keys || !values) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+again:
+ preempt_disable();
+ this_cpu_inc(bpf_prog_active);
+ rcu_read_lock();
+again_nocopy:
+ dst_key = keys;
+ dst_val = values;
+ b = &htab->buckets[batch];
+ head = &b->head;
+ raw_spin_lock_irqsave(&b->lock, flags);
+
+ bucket_cnt = 0;
+ hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
+ bucket_cnt++;
+
+ if (bucket_cnt > (max_count - total)) {
+ if (total == 0)
+ ret = -ENOSPC;
+ raw_spin_unlock_irqrestore(&b->lock, flags);
+ rcu_read_unlock();
+ this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+ goto after_loop;
+ }
+
+ if (bucket_cnt > bucket_size) {
+ bucket_size = bucket_cnt;
+ raw_spin_unlock_irqrestore(&b->lock, flags);
+ rcu_read_unlock();
+ this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+ kvfree(keys);
+ kvfree(values);
+ goto alloc;
+ }
+
+ hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
+ memcpy(dst_key, l->key, key_size);
+
+ if (is_percpu) {
+ int off = 0, cpu;
+ void __percpu *pptr;
+
+ pptr = htab_elem_get_ptr(l, map->key_size);
+ for_each_possible_cpu(cpu) {
+ bpf_long_memcpy(dst_val + off,
+ per_cpu_ptr(pptr, cpu), size);
+ off += size;
+ }
+ } else {
+ value = l->key + roundup_key_size;
+ if (elem_map_flags & BPF_F_LOCK)
+ copy_map_value_locked(map, dst_val, value,
+ true);
+ else
+ copy_map_value(map, dst_val, value);
+ check_and_init_map_lock(map, dst_val);
+ }
+ if (do_delete) {
+ hlist_nulls_del_rcu(&l->hash_node);
+ if (is_lru_map)
+ bpf_lru_push_free(&htab->lru, &l->lru_node);
+ else
+ free_htab_elem(htab, l);
+ }
+ dst_key += key_size;
+ dst_val += value_size;
+ }
+
+ raw_spin_unlock_irqrestore(&b->lock, flags);
+ /* If we are not copying data, we can go to next bucket and avoid
+ * unlocking the rcu.
+ */
+ if (!bucket_cnt && (batch + 1 < htab->n_buckets)) {
+ batch++;
+ goto again_nocopy;
+ }
+
+ rcu_read_unlock();
+ this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+ if (bucket_cnt && (copy_to_user(ukeys + total * key_size, keys,
+ key_size * bucket_cnt) ||
+ copy_to_user(uvalues + total * value_size, values,
+ value_size * bucket_cnt))) {
+ ret = -EFAULT;
+ goto after_loop;
+ }
+
+ total += bucket_cnt;
+ batch++;
+ if (batch >= htab->n_buckets) {
+ ret = -ENOENT;
+ goto after_loop;
+ }
+ goto again;
+
+after_loop:
+ if (ret && (ret != -ENOENT && ret != -EFAULT))
+ goto out;
+
+ /* copy # of entries and next batch */
+ ubatch = u64_to_user_ptr(attr->batch.out_batch);
+ if (copy_to_user(ubatch, &batch, sizeof(batch)) ||
+ put_user(total, &uattr->batch.count))
+ ret = -EFAULT;
+
+out:
+ kvfree(keys);
+ kvfree(values);
+ return ret;
+}
+
+static int
+htab_percpu_map_lookup_batch(struct bpf_map *map, const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, false,
+ false, true);
+}
+
+static int
+htab_percpu_map_lookup_and_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, true,
+ false, true);
+}
+
+static int
+htab_map_lookup_batch(struct bpf_map *map, const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, false,
+ false, false);
+}
+
+static int
+htab_map_lookup_and_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, true,
+ false, false);
+}
+
+static int
+htab_lru_percpu_map_lookup_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, false,
+ true, true);
+}
+
+static int
+htab_lru_percpu_map_lookup_and_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, true,
+ true, true);
+}
+
+static int
+htab_lru_map_lookup_batch(struct bpf_map *map, const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, false,
+ true, false);
+}
+
+static int
+htab_lru_map_lookup_and_delete_batch(struct bpf_map *map,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ return __htab_map_lookup_and_delete_batch(map, attr, uattr, true,
+ true, false);
+}
+
const struct bpf_map_ops htab_map_ops = {
.map_alloc_check = htab_map_alloc_check,
.map_alloc = htab_map_alloc,
@@ -1242,6 +1496,7 @@ const struct bpf_map_ops htab_map_ops = {
.map_delete_elem = htab_map_delete_elem,
.map_gen_lookup = htab_map_gen_lookup,
.map_seq_show_elem = htab_map_seq_show_elem,
+ BATCH_OPS(htab),
};

const struct bpf_map_ops htab_lru_map_ops = {
@@ -1255,6 +1510,7 @@ const struct bpf_map_ops htab_lru_map_ops = {
.map_delete_elem = htab_lru_map_delete_elem,
.map_gen_lookup = htab_lru_map_gen_lookup,
.map_seq_show_elem = htab_map_seq_show_elem,
+ BATCH_OPS(htab_lru),
};

/* Called from eBPF program */
@@ -1368,6 +1624,7 @@ const struct bpf_map_ops htab_percpu_map_ops = {
.map_update_elem = htab_percpu_map_update_elem,
.map_delete_elem = htab_map_delete_elem,
.map_seq_show_elem = htab_percpu_map_seq_show_elem,
+ BATCH_OPS(htab_percpu),
};

const struct bpf_map_ops htab_lru_percpu_map_ops = {
@@ -1379,6 +1636,7 @@ const struct bpf_map_ops htab_lru_percpu_map_ops = {
.map_update_elem = htab_lru_percpu_map_update_elem,
.map_delete_elem = htab_lru_map_delete_elem,
.map_seq_show_elem = htab_percpu_map_seq_show_elem,
+ BATCH_OPS(htab_lru_percpu),
};

static int fd_htab_map_alloc_check(union bpf_attr *attr)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 2f631eb67d00c..7e4f40657450f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3304,7 +3304,8 @@ static int bpf_map_do_batch(const union bpf_attr *attr,
if (IS_ERR(map))
return PTR_ERR(map);

- if (cmd == BPF_MAP_LOOKUP_BATCH &&
+ if ((cmd == BPF_MAP_LOOKUP_BATCH ||
+ cmd == BPF_MAP_LOOKUP_AND_DELETE_BATCH) &&
!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
err = -EPERM;
goto err_put;
@@ -3318,6 +3319,8 @@ static int bpf_map_do_batch(const union bpf_attr *attr,

if (cmd == BPF_MAP_LOOKUP_BATCH)
BPF_DO_BATCH(map->ops->map_lookup_batch);
+ else if (cmd == BPF_MAP_LOOKUP_AND_DELETE_BATCH)
+ BPF_DO_BATCH(map->ops->map_lookup_and_delete_batch);
else if (cmd == BPF_MAP_UPDATE_BATCH)
BPF_DO_BATCH(map->ops->map_update_batch);
else
@@ -3428,6 +3431,10 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
case BPF_MAP_LOOKUP_BATCH:
err = bpf_map_do_batch(&attr, uattr, BPF_MAP_LOOKUP_BATCH);
break;
+ case BPF_MAP_LOOKUP_AND_DELETE_BATCH:
+ err = bpf_map_do_batch(&attr, uattr,
+ BPF_MAP_LOOKUP_AND_DELETE_BATCH);
+ break;
case BPF_MAP_UPDATE_BATCH:
err = bpf_map_do_batch(&attr, uattr, BPF_MAP_UPDATE_BATCH);
break;
--
2.25.0.rc1.283.g88dfdc4193-goog

2020-01-14 16:49:37

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH v4 bpf-next 4/9] bpf: add lookup and updated batch ops to arraymap

This adds the generic batch ops functionality to bpf arraymap, note that
since deletion is not a valid operation for arraymap, only batch and
lookup are added.

Signed-off-by: Brian Vazquez <[email protected]>
Acked-by: Yonghong Song <[email protected]>
---
kernel/bpf/arraymap.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index f0d19bbb9211e..95d77770353c9 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -503,6 +503,8 @@ const struct bpf_map_ops array_map_ops = {
.map_mmap = array_map_mmap,
.map_seq_show_elem = array_map_seq_show_elem,
.map_check_btf = array_map_check_btf,
+ .map_lookup_batch = generic_map_lookup_batch,
+ .map_update_batch = generic_map_update_batch,
};

const struct bpf_map_ops percpu_array_map_ops = {
--
2.25.0.rc1.283.g88dfdc4193-goog

2020-01-14 16:49:59

by Brian Vazquez

[permalink] [raw]
Subject: [PATCH v4 bpf-next 4/9] bpf: add lookup and update batch ops to arraymap

This adds the generic batch ops functionality to bpf arraymap, note that
since deletion is not a valid operation for arraymap, only batch and
lookup are added.

Signed-off-by: Brian Vazquez <[email protected]>
Acked-by: Yonghong Song <[email protected]>
---
kernel/bpf/arraymap.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index f0d19bbb9211e..95d77770353c9 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -503,6 +503,8 @@ const struct bpf_map_ops array_map_ops = {
.map_mmap = array_map_mmap,
.map_seq_show_elem = array_map_seq_show_elem,
.map_check_btf = array_map_check_btf,
+ .map_lookup_batch = generic_map_lookup_batch,
+ .map_update_batch = generic_map_update_batch,
};

const struct bpf_map_ops percpu_array_map_ops = {
--
2.25.0.rc1.283.g88dfdc4193-goog

2020-01-14 22:57:53

by Yonghong Song

[permalink] [raw]
Subject: Re: [PATCH v4 bpf-next 5/9] bpf: add batch ops to all htab bpf map



On 1/14/20 8:46 AM, Brian Vazquez wrote:
> From: Yonghong Song <[email protected]>
>
> htab can't use generic batch support due some problematic behaviours
> inherent to the data structre, i.e. while iterating the bpf map a
> concurrent program might delete the next entry that batch was about to
> use, in that case there's no easy solution to retrieve the next entry,
> the issue has been discussed multiple times (see [1] and [2]).
>
> The only way hmap can be traversed without the problem previously
> exposed is by making sure that the map is traversing entire buckets.
> This commit implements those strict requirements for hmap, the
> implementation follows the same interaction that generic support with
> some exceptions:
>
> - If keys/values buffer are not big enough to traverse a bucket,
> ENOSPC will be returned.
> - out_batch contains the value of the next bucket in the iteration, not
> the next key, but this is transparent for the user since the user
> should never use out_batch for other than bpf batch syscalls.
>
> This commits implements BPF_MAP_LOOKUP_BATCH and adds support for new
> command BPF_MAP_LOOKUP_AND_DELETE_BATCH. Note that for update/delete
> batch ops it is possible to use the generic implementations.
>
> [1] https://lore.kernel.org/bpf/[email protected]/
> [2] https://lore.kernel.org/bpf/[email protected]/
>
> Signed-off-by: Yonghong Song <[email protected]>
> Signed-off-by: Brian Vazquez <[email protected]>
> ---
> include/linux/bpf.h | 3 +
> include/uapi/linux/bpf.h | 1 +
> kernel/bpf/hashtab.c | 258 +++++++++++++++++++++++++++++++++++++++
> kernel/bpf/syscall.c | 9 +-
> 4 files changed, 270 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 05466ad6cf1c5..3517e32149a4f 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -46,6 +46,9 @@ struct bpf_map_ops {
> void *(*map_lookup_elem_sys_only)(struct bpf_map *map, void *key);
> int (*map_lookup_batch)(struct bpf_map *map, const union bpf_attr *attr,
> union bpf_attr __user *uattr);
> + int (*map_lookup_and_delete_batch)(struct bpf_map *map,
> + const union bpf_attr *attr,
> + union bpf_attr __user *uattr);
> int (*map_update_batch)(struct bpf_map *map, const union bpf_attr *attr,
> union bpf_attr __user *uattr);
> int (*map_delete_batch)(struct bpf_map *map, const union bpf_attr *attr,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index e8df9ca680e0c..9536729a03d57 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -108,6 +108,7 @@ enum bpf_cmd {
> BPF_MAP_FREEZE,
> BPF_BTF_GET_NEXT_ID,
> BPF_MAP_LOOKUP_BATCH,
> + BPF_MAP_LOOKUP_AND_DELETE_BATCH,
> BPF_MAP_UPDATE_BATCH,
> BPF_MAP_DELETE_BATCH,
> };
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 22066a62c8c97..d9888acfd632b 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -17,6 +17,16 @@
> (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
> BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)
>
> +#define BATCH_OPS(_name) \
> + .map_lookup_batch = \
> + _name##_map_lookup_batch, \
> + .map_lookup_and_delete_batch = \
> + _name##_map_lookup_and_delete_batch, \
> + .map_update_batch = \
> + generic_map_update_batch, \
> + .map_delete_batch = \
> + generic_map_delete_batch
> +
> struct bucket {
> struct hlist_nulls_head head;
> raw_spinlock_t lock;
> @@ -1232,6 +1242,250 @@ static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
> rcu_read_unlock();
> }
>
> +static int
> +__htab_map_lookup_and_delete_batch(struct bpf_map *map,
> + const union bpf_attr *attr,
> + union bpf_attr __user *uattr,
> + bool do_delete, bool is_lru_map,
> + bool is_percpu)
> +{
> + struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> + u32 bucket_cnt, total, key_size, value_size, roundup_key_size;
> + void *keys = NULL, *values = NULL, *value, *dst_key, *dst_val;
> + void __user *uvalues = u64_to_user_ptr(attr->batch.values);
> + void __user *ukeys = u64_to_user_ptr(attr->batch.keys);
> + void *ubatch = u64_to_user_ptr(attr->batch.in_batch);
> + u32 batch, max_count, size, bucket_size;
> + u64 elem_map_flags, map_flags;
> + struct hlist_nulls_head *head;
> + struct hlist_nulls_node *n;
> + unsigned long flags;
> + struct htab_elem *l;
> + struct bucket *b;
> + int ret = 0;
> +
> + elem_map_flags = attr->batch.elem_flags;
> + if ((elem_map_flags & ~BPF_F_LOCK) ||
> + ((elem_map_flags & BPF_F_LOCK) && !map_value_has_spin_lock(map)))
> + return -EINVAL;
> +
> + map_flags = attr->batch.flags;
> + if (map_flags)
> + return -EINVAL;
> +
> + max_count = attr->batch.count;
> + if (!max_count)
> + return 0;
> +
> + batch = 0;
> + if (ubatch && copy_from_user(&batch, ubatch, sizeof(batch)))
> + return -EFAULT;
> +
> + if (batch >= htab->n_buckets)
> + return -ENOENT;
> +
> + key_size = htab->map.key_size;
> + roundup_key_size = round_up(htab->map.key_size, 8);
> + value_size = htab->map.value_size;
> + size = round_up(value_size, 8);
> + if (is_percpu)
> + value_size = size * num_possible_cpus();
> + total = 0;
> + bucket_size = 1;

Have you checked typical hash table linklist length?
Maybe initial value bucket_size = 2 is able to cover most common cases?

> +
> +alloc:
> + /* We cannot do copy_from_user or copy_to_user inside
> + * the rcu_read_lock. Allocate enough space here.
> + */
> + keys = kvmalloc(key_size * bucket_size, GFP_USER | __GFP_NOWARN);
> + values = kvmalloc(value_size * bucket_size, GFP_USER | __GFP_NOWARN);
> + if (!keys || !values) {
> + ret = -ENOMEM;
> + goto out;

In this case, we won't copy batch and total to user buffer. Maybe we
should do that?


> + }
> +
> +again:
> + preempt_disable();
> + this_cpu_inc(bpf_prog_active);
> + rcu_read_lock();
> +again_nocopy:
> + dst_key = keys;
> + dst_val = values;
> + b = &htab->buckets[batch];
> + head = &b->head;
> + raw_spin_lock_irqsave(&b->lock, flags);
> +
> + bucket_cnt = 0;
> + hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
> + bucket_cnt++;
> +
> + if (bucket_cnt > (max_count - total)) {
> + if (total == 0)
> + ret = -ENOSPC;
> + raw_spin_unlock_irqrestore(&b->lock, flags);
> + rcu_read_unlock();
> + this_cpu_dec(bpf_prog_active);
> + preempt_enable();
> + goto after_loop;
> + }
> +
> + if (bucket_cnt > bucket_size) {
> + bucket_size = bucket_cnt;
> + raw_spin_unlock_irqrestore(&b->lock, flags);
> + rcu_read_unlock();
> + this_cpu_dec(bpf_prog_active);
> + preempt_enable();
> + kvfree(keys);
> + kvfree(values);
> + goto alloc;
> + }
> +
> + hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
> + memcpy(dst_key, l->key, key_size);
> +
> + if (is_percpu) {
> + int off = 0, cpu;
> + void __percpu *pptr;
> +
> + pptr = htab_elem_get_ptr(l, map->key_size);
> + for_each_possible_cpu(cpu) {
> + bpf_long_memcpy(dst_val + off,
> + per_cpu_ptr(pptr, cpu), size);
> + off += size;
> + }
> + } else {
> + value = l->key + roundup_key_size;
> + if (elem_map_flags & BPF_F_LOCK)
> + copy_map_value_locked(map, dst_val, value,
> + true);
> + else
> + copy_map_value(map, dst_val, value);
> + check_and_init_map_lock(map, dst_val);
> + }
> + if (do_delete) {
> + hlist_nulls_del_rcu(&l->hash_node);
> + if (is_lru_map)
> + bpf_lru_push_free(&htab->lru, &l->lru_node);
> + else
> + free_htab_elem(htab, l);
> + }
> + dst_key += key_size;
> + dst_val += value_size;
> + }
> +
> + raw_spin_unlock_irqrestore(&b->lock, flags);
> + /* If we are not copying data, we can go to next bucket and avoid
> + * unlocking the rcu.
> + */
> + if (!bucket_cnt && (batch + 1 < htab->n_buckets)) {
> + batch++;
> + goto again_nocopy;
> + }
> +
> + rcu_read_unlock();
> + this_cpu_dec(bpf_prog_active);
> + preempt_enable();
> + if (bucket_cnt && (copy_to_user(ukeys + total * key_size, keys,
> + key_size * bucket_cnt) ||
> + copy_to_user(uvalues + total * value_size, values,
> + value_size * bucket_cnt))) {
> + ret = -EFAULT;
> + goto after_loop;
> + }
> +
> + total += bucket_cnt;
> + batch++;
> + if (batch >= htab->n_buckets) {
> + ret = -ENOENT;
> + goto after_loop;
> + }
> + goto again;
> +
> +after_loop:
> + if (ret && (ret != -ENOENT && ret != -EFAULT))
> + goto out;

We won't have many error codes reaching here, -ENOENT, -EFAULT, -ENOSPC,
and possibly -ENOMEM.
Maybe just
if (ret == -EFAULT)
goto out;

> +
> + /* copy # of entries and next batch */
> + ubatch = u64_to_user_ptr(attr->batch.out_batch);
> + if (copy_to_user(ubatch, &batch, sizeof(batch)) ||
> + put_user(total, &uattr->batch.count))
> + ret = -EFAULT;
> +
> +out:
> + kvfree(keys);
> + kvfree(values);
> + return ret;
> +}
> +
[...]

2020-01-14 23:13:51

by Yonghong Song

[permalink] [raw]
Subject: Re: [PATCH v4 bpf-next 0/9] add bpf batch ops to process more than 1 elem



On 1/14/20 8:46 AM, Brian Vazquez wrote:
> This patch series introduce batch ops that can be added to bpf maps to
> lookup/lookup_and_delete/update/delete more than 1 element at the time,
> this is specially useful when syscall overhead is a problem and in case
> of hmap it will provide a reliable way of traversing them.
>
> The implementation inclues a generic approach that could potentially be
> used by any bpf map and adds it to arraymap, it also includes the specific
> implementation of hashmaps which are traversed using buckets instead
> of keys.
>
> The bpf syscall subcommands introduced are:
>
> BPF_MAP_LOOKUP_BATCH
> BPF_MAP_LOOKUP_AND_DELETE_BATCH
> BPF_MAP_UPDATE_BATCH
> BPF_MAP_DELETE_BATCH
>
> The UAPI attribute is:
>
> struct { /* struct used by BPF_MAP_*_BATCH commands */
> __aligned_u64 in_batch; /* start batch,
> * NULL to start from beginning
> */
> __aligned_u64 out_batch; /* output: next start batch */
> __aligned_u64 keys;
> __aligned_u64 values;
> __u32 count; /* input/output:
> * input: # of key/value
> * elements
> * output: # of filled elements
> */
> __u32 map_fd;
> __u64 elem_flags;
> __u64 flags;
> } batch;
>
>
> in_batch and out_batch are only used for lookup and lookup_and_delete since
> those are the only two operations that attempt to traverse the map.
>
> update/delete batch ops should provide the keys/values that user wants
> to modify.
>
> Here are the previous discussions on the batch processing:
> - https://lore.kernel.org/bpf/[email protected]/
> - https://lore.kernel.org/bpf/[email protected]/
> - https://lore.kernel.org/bpf/[email protected]/
>
> Changelog sinve v3:
> - Do not use copy_to_user inside atomic region (Yonghong Song)
> - Use _opts approach on libbpf APIs (Andrii Nakryiko)
> - Drop generic_map_lookup_and_delete_batch support
> - Free malloc-ed memory in tests (Yonghong Song)
> - Reverse christmas tree (Yonghong Song)
> - Add acked labels

Thanks for the new revision! Overall looks good. Only have a few minor
comments. Also tested in my environment and everything works as expected.

2020-01-14 23:50:36

by Brian Vazquez

[permalink] [raw]
Subject: Re: [PATCH v4 bpf-next 5/9] bpf: add batch ops to all htab bpf map

Hi Yonghong, thanks for reviewing it!

On Tue, Jan 14, 2020 at 2:56 PM Yonghong Song <[email protected]> wrote:
>
>
>
> On 1/14/20 8:46 AM, Brian Vazquez wrote:
> > From: Yonghong Song <[email protected]>
> >
> > htab can't use generic batch support due some problematic behaviours
> > inherent to the data structre, i.e. while iterating the bpf map a
> > concurrent program might delete the next entry that batch was about to
> > use, in that case there's no easy solution to retrieve the next entry,
> > the issue has been discussed multiple times (see [1] and [2]).
> >
> > The only way hmap can be traversed without the problem previously
> > exposed is by making sure that the map is traversing entire buckets.
> > This commit implements those strict requirements for hmap, the
> > implementation follows the same interaction that generic support with
> > some exceptions:
> >
> > - If keys/values buffer are not big enough to traverse a bucket,
> > ENOSPC will be returned.
> > - out_batch contains the value of the next bucket in the iteration, not
> > the next key, but this is transparent for the user since the user
> > should never use out_batch for other than bpf batch syscalls.
> >
> > This commits implements BPF_MAP_LOOKUP_BATCH and adds support for new
> > command BPF_MAP_LOOKUP_AND_DELETE_BATCH. Note that for update/delete
> > batch ops it is possible to use the generic implementations.
> >
> > [1] https://lore.kernel.org/bpf/[email protected]/
> > [2] https://lore.kernel.org/bpf/[email protected]/
> >
> > Signed-off-by: Yonghong Song <[email protected]>
> > Signed-off-by: Brian Vazquez <[email protected]>
> > ---
> > include/linux/bpf.h | 3 +
> > include/uapi/linux/bpf.h | 1 +
> > kernel/bpf/hashtab.c | 258 +++++++++++++++++++++++++++++++++++++++
> > kernel/bpf/syscall.c | 9 +-
> > 4 files changed, 270 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 05466ad6cf1c5..3517e32149a4f 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -46,6 +46,9 @@ struct bpf_map_ops {
> > void *(*map_lookup_elem_sys_only)(struct bpf_map *map, void *key);
> > int (*map_lookup_batch)(struct bpf_map *map, const union bpf_attr *attr,
> > union bpf_attr __user *uattr);
> > + int (*map_lookup_and_delete_batch)(struct bpf_map *map,
> > + const union bpf_attr *attr,
> > + union bpf_attr __user *uattr);
> > int (*map_update_batch)(struct bpf_map *map, const union bpf_attr *attr,
> > union bpf_attr __user *uattr);
> > int (*map_delete_batch)(struct bpf_map *map, const union bpf_attr *attr,
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index e8df9ca680e0c..9536729a03d57 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -108,6 +108,7 @@ enum bpf_cmd {
> > BPF_MAP_FREEZE,
> > BPF_BTF_GET_NEXT_ID,
> > BPF_MAP_LOOKUP_BATCH,
> > + BPF_MAP_LOOKUP_AND_DELETE_BATCH,
> > BPF_MAP_UPDATE_BATCH,
> > BPF_MAP_DELETE_BATCH,
> > };
> > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > index 22066a62c8c97..d9888acfd632b 100644
> > --- a/kernel/bpf/hashtab.c
> > +++ b/kernel/bpf/hashtab.c
> > @@ -17,6 +17,16 @@
> > (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
> > BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)
> >
> > +#define BATCH_OPS(_name) \
> > + .map_lookup_batch = \
> > + _name##_map_lookup_batch, \
> > + .map_lookup_and_delete_batch = \
> > + _name##_map_lookup_and_delete_batch, \
> > + .map_update_batch = \
> > + generic_map_update_batch, \
> > + .map_delete_batch = \
> > + generic_map_delete_batch
> > +
> > struct bucket {
> > struct hlist_nulls_head head;
> > raw_spinlock_t lock;
> > @@ -1232,6 +1242,250 @@ static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
> > rcu_read_unlock();
> > }
> >
> > +static int
> > +__htab_map_lookup_and_delete_batch(struct bpf_map *map,
> > + const union bpf_attr *attr,
> > + union bpf_attr __user *uattr,
> > + bool do_delete, bool is_lru_map,
> > + bool is_percpu)
> > +{
> > + struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> > + u32 bucket_cnt, total, key_size, value_size, roundup_key_size;
> > + void *keys = NULL, *values = NULL, *value, *dst_key, *dst_val;
> > + void __user *uvalues = u64_to_user_ptr(attr->batch.values);
> > + void __user *ukeys = u64_to_user_ptr(attr->batch.keys);
> > + void *ubatch = u64_to_user_ptr(attr->batch.in_batch);
> > + u32 batch, max_count, size, bucket_size;
> > + u64 elem_map_flags, map_flags;
> > + struct hlist_nulls_head *head;
> > + struct hlist_nulls_node *n;
> > + unsigned long flags;
> > + struct htab_elem *l;
> > + struct bucket *b;
> > + int ret = 0;
> > +
> > + elem_map_flags = attr->batch.elem_flags;
> > + if ((elem_map_flags & ~BPF_F_LOCK) ||
> > + ((elem_map_flags & BPF_F_LOCK) && !map_value_has_spin_lock(map)))
> > + return -EINVAL;
> > +
> > + map_flags = attr->batch.flags;
> > + if (map_flags)
> > + return -EINVAL;
> > +
> > + max_count = attr->batch.count;
> > + if (!max_count)
> > + return 0;
> > +
> > + batch = 0;
> > + if (ubatch && copy_from_user(&batch, ubatch, sizeof(batch)))
> > + return -EFAULT;
> > +
> > + if (batch >= htab->n_buckets)
> > + return -ENOENT;
> > +
> > + key_size = htab->map.key_size;
> > + roundup_key_size = round_up(htab->map.key_size, 8);
> > + value_size = htab->map.value_size;
> > + size = round_up(value_size, 8);
> > + if (is_percpu)
> > + value_size = size * num_possible_cpus();
> > + total = 0;
> > + bucket_size = 1;
>
> Have you checked typical hash table linklist length?
While testing with hash tables ranging from 10 to 1000 entries I saw
linked lists of upto 5 entries.
> Maybe initial value bucket_size = 2 is able to cover most common cases?
I think 4-5 is still a reasonable number, what do you think?
>
> > +
> > +alloc:
> > + /* We cannot do copy_from_user or copy_to_user inside
> > + * the rcu_read_lock. Allocate enough space here.
> > + */
> > + keys = kvmalloc(key_size * bucket_size, GFP_USER | __GFP_NOWARN);
> > + values = kvmalloc(value_size * bucket_size, GFP_USER | __GFP_NOWARN);
> > + if (!keys || !values) {
> > + ret = -ENOMEM;
> > + goto out;
>
> In this case, we won't copy batch and total to user buffer. Maybe we
> should do that?
Yes, I think last line should be: goto after_loop;
>
>
> > + }
> > +
> > +again:
> > + preempt_disable();
> > + this_cpu_inc(bpf_prog_active);
> > + rcu_read_lock();
> > +again_nocopy:
> > + dst_key = keys;
> > + dst_val = values;
> > + b = &htab->buckets[batch];
> > + head = &b->head;
> > + raw_spin_lock_irqsave(&b->lock, flags);
> > +
> > + bucket_cnt = 0;
> > + hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
> > + bucket_cnt++;
> > +
> > + if (bucket_cnt > (max_count - total)) {
> > + if (total == 0)
> > + ret = -ENOSPC;
> > + raw_spin_unlock_irqrestore(&b->lock, flags);
> > + rcu_read_unlock();
> > + this_cpu_dec(bpf_prog_active);
> > + preempt_enable();
> > + goto after_loop;
> > + }
> > +
> > + if (bucket_cnt > bucket_size) {
> > + bucket_size = bucket_cnt;
> > + raw_spin_unlock_irqrestore(&b->lock, flags);
> > + rcu_read_unlock();
> > + this_cpu_dec(bpf_prog_active);
> > + preempt_enable();
> > + kvfree(keys);
> > + kvfree(values);
> > + goto alloc;
> > + }
> > +
> > + hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
> > + memcpy(dst_key, l->key, key_size);
> > +
> > + if (is_percpu) {
> > + int off = 0, cpu;
> > + void __percpu *pptr;
> > +
> > + pptr = htab_elem_get_ptr(l, map->key_size);
> > + for_each_possible_cpu(cpu) {
> > + bpf_long_memcpy(dst_val + off,
> > + per_cpu_ptr(pptr, cpu), size);
> > + off += size;
> > + }
> > + } else {
> > + value = l->key + roundup_key_size;
> > + if (elem_map_flags & BPF_F_LOCK)
> > + copy_map_value_locked(map, dst_val, value,
> > + true);
> > + else
> > + copy_map_value(map, dst_val, value);
> > + check_and_init_map_lock(map, dst_val);
> > + }
> > + if (do_delete) {
> > + hlist_nulls_del_rcu(&l->hash_node);
> > + if (is_lru_map)
> > + bpf_lru_push_free(&htab->lru, &l->lru_node);
> > + else
> > + free_htab_elem(htab, l);
> > + }
> > + dst_key += key_size;
> > + dst_val += value_size;
> > + }
> > +
> > + raw_spin_unlock_irqrestore(&b->lock, flags);
> > + /* If we are not copying data, we can go to next bucket and avoid
> > + * unlocking the rcu.
> > + */
> > + if (!bucket_cnt && (batch + 1 < htab->n_buckets)) {
> > + batch++;
> > + goto again_nocopy;
> > + }
> > +
> > + rcu_read_unlock();
> > + this_cpu_dec(bpf_prog_active);
> > + preempt_enable();
> > + if (bucket_cnt && (copy_to_user(ukeys + total * key_size, keys,
> > + key_size * bucket_cnt) ||
> > + copy_to_user(uvalues + total * value_size, values,
> > + value_size * bucket_cnt))) {
> > + ret = -EFAULT;
> > + goto after_loop;
> > + }
> > +
> > + total += bucket_cnt;
> > + batch++;
> > + if (batch >= htab->n_buckets) {
> > + ret = -ENOENT;
> > + goto after_loop;
> > + }
> > + goto again;
> > +
> > +after_loop:
> > + if (ret && (ret != -ENOENT && ret != -EFAULT))
> > + goto out;
>
> We won't have many error codes reaching here, -ENOENT, -EFAULT, -ENOSPC,
> and possibly -ENOMEM.
> Maybe just
> if (ret == -EFAULT)
> goto out;
>
Yes I think that make senses, I only need to add

if (put_user(0, &uattr->batch.count))
return -EFAULT;

before traversing the map to make sure that if there is an error,
batch.count doesn't miss report entries since that variable is used as
input/output. Does this make sense?

> > +
> > + /* copy # of entries and next batch */
> > + ubatch = u64_to_user_ptr(attr->batch.out_batch);
> > + if (copy_to_user(ubatch, &batch, sizeof(batch)) ||
> > + put_user(total, &uattr->batch.count))
> > + ret = -EFAULT;
> > +
> > +out:
> > + kvfree(keys);
> > + kvfree(values);
> > + return ret;
> > +}
> > +
> [...]

2020-01-15 01:04:55

by Yonghong Song

[permalink] [raw]
Subject: Re: [PATCH v4 bpf-next 5/9] bpf: add batch ops to all htab bpf map



On 1/14/20 3:49 PM, Brian Vazquez wrote:
> Hi Yonghong, thanks for reviewing it!
>
> On Tue, Jan 14, 2020 at 2:56 PM Yonghong Song <[email protected]> wrote:
>>
>>
>>
>> On 1/14/20 8:46 AM, Brian Vazquez wrote:
>>> From: Yonghong Song <[email protected]>
>>>
>>> htab can't use generic batch support due some problematic behaviours
>>> inherent to the data structre, i.e. while iterating the bpf map a
>>> concurrent program might delete the next entry that batch was about to
>>> use, in that case there's no easy solution to retrieve the next entry,
>>> the issue has been discussed multiple times (see [1] and [2]).
>>>
>>> The only way hmap can be traversed without the problem previously
>>> exposed is by making sure that the map is traversing entire buckets.
>>> This commit implements those strict requirements for hmap, the
>>> implementation follows the same interaction that generic support with
>>> some exceptions:
>>>
>>> - If keys/values buffer are not big enough to traverse a bucket,
>>> ENOSPC will be returned.
>>> - out_batch contains the value of the next bucket in the iteration, not
>>> the next key, but this is transparent for the user since the user
>>> should never use out_batch for other than bpf batch syscalls.
>>>
>>> This commits implements BPF_MAP_LOOKUP_BATCH and adds support for new
>>> command BPF_MAP_LOOKUP_AND_DELETE_BATCH. Note that for update/delete
>>> batch ops it is possible to use the generic implementations.
>>>
>>> [1] https://lore.kernel.org/bpf/[email protected]/
>>> [2] https://lore.kernel.org/bpf/[email protected]/
>>>
>>> Signed-off-by: Yonghong Song <[email protected]>
>>> Signed-off-by: Brian Vazquez <[email protected]>
>>> ---
>>> include/linux/bpf.h | 3 +
>>> include/uapi/linux/bpf.h | 1 +
>>> kernel/bpf/hashtab.c | 258 +++++++++++++++++++++++++++++++++++++++
>>> kernel/bpf/syscall.c | 9 +-
>>> 4 files changed, 270 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>> index 05466ad6cf1c5..3517e32149a4f 100644
>>> --- a/include/linux/bpf.h
>>> +++ b/include/linux/bpf.h
>>> @@ -46,6 +46,9 @@ struct bpf_map_ops {
>>> void *(*map_lookup_elem_sys_only)(struct bpf_map *map, void *key);
>>> int (*map_lookup_batch)(struct bpf_map *map, const union bpf_attr *attr,
>>> union bpf_attr __user *uattr);
>>> + int (*map_lookup_and_delete_batch)(struct bpf_map *map,
>>> + const union bpf_attr *attr,
>>> + union bpf_attr __user *uattr);
>>> int (*map_update_batch)(struct bpf_map *map, const union bpf_attr *attr,
>>> union bpf_attr __user *uattr);
>>> int (*map_delete_batch)(struct bpf_map *map, const union bpf_attr *attr,
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index e8df9ca680e0c..9536729a03d57 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -108,6 +108,7 @@ enum bpf_cmd {
>>> BPF_MAP_FREEZE,
>>> BPF_BTF_GET_NEXT_ID,
>>> BPF_MAP_LOOKUP_BATCH,
>>> + BPF_MAP_LOOKUP_AND_DELETE_BATCH,
>>> BPF_MAP_UPDATE_BATCH,
>>> BPF_MAP_DELETE_BATCH,
>>> };
>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>> index 22066a62c8c97..d9888acfd632b 100644
>>> --- a/kernel/bpf/hashtab.c
>>> +++ b/kernel/bpf/hashtab.c
>>> @@ -17,6 +17,16 @@
>>> (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
>>> BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)
>>>
>>> +#define BATCH_OPS(_name) \
>>> + .map_lookup_batch = \
>>> + _name##_map_lookup_batch, \
>>> + .map_lookup_and_delete_batch = \
>>> + _name##_map_lookup_and_delete_batch, \
>>> + .map_update_batch = \
>>> + generic_map_update_batch, \
>>> + .map_delete_batch = \
>>> + generic_map_delete_batch
>>> +
>>> struct bucket {
>>> struct hlist_nulls_head head;
>>> raw_spinlock_t lock;
>>> @@ -1232,6 +1242,250 @@ static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
>>> rcu_read_unlock();
>>> }
>>>
>>> +static int
>>> +__htab_map_lookup_and_delete_batch(struct bpf_map *map,
>>> + const union bpf_attr *attr,
>>> + union bpf_attr __user *uattr,
>>> + bool do_delete, bool is_lru_map,
>>> + bool is_percpu)
>>> +{
>>> + struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
>>> + u32 bucket_cnt, total, key_size, value_size, roundup_key_size;
>>> + void *keys = NULL, *values = NULL, *value, *dst_key, *dst_val;
>>> + void __user *uvalues = u64_to_user_ptr(attr->batch.values);
>>> + void __user *ukeys = u64_to_user_ptr(attr->batch.keys);
>>> + void *ubatch = u64_to_user_ptr(attr->batch.in_batch);
>>> + u32 batch, max_count, size, bucket_size;
>>> + u64 elem_map_flags, map_flags;
>>> + struct hlist_nulls_head *head;
>>> + struct hlist_nulls_node *n;
>>> + unsigned long flags;
>>> + struct htab_elem *l;
>>> + struct bucket *b;
>>> + int ret = 0;
>>> +
>>> + elem_map_flags = attr->batch.elem_flags;
>>> + if ((elem_map_flags & ~BPF_F_LOCK) ||
>>> + ((elem_map_flags & BPF_F_LOCK) && !map_value_has_spin_lock(map)))
>>> + return -EINVAL;
>>> +
>>> + map_flags = attr->batch.flags;
>>> + if (map_flags)
>>> + return -EINVAL;
>>> +
>>> + max_count = attr->batch.count;
>>> + if (!max_count)
>>> + return 0;
>>> +
>>> + batch = 0;
>>> + if (ubatch && copy_from_user(&batch, ubatch, sizeof(batch)))
>>> + return -EFAULT;
>>> +
>>> + if (batch >= htab->n_buckets)
>>> + return -ENOENT;
>>> +
>>> + key_size = htab->map.key_size;
>>> + roundup_key_size = round_up(htab->map.key_size, 8);
>>> + value_size = htab->map.value_size;
>>> + size = round_up(value_size, 8);
>>> + if (is_percpu)
>>> + value_size = size * num_possible_cpus();
>>> + total = 0;
>>> + bucket_size = 1;
>>
>> Have you checked typical hash table linklist length?
> While testing with hash tables ranging from 10 to 1000 entries I saw
> linked lists of upto 5 entries.
>> Maybe initial value bucket_size = 2 is able to cover most common cases?
> I think 4-5 is still a reasonable number, what do you think?

5 should be okay. You can add some comments to explain why "5".

>>
>>> +
>>> +alloc:
>>> + /* We cannot do copy_from_user or copy_to_user inside
>>> + * the rcu_read_lock. Allocate enough space here.
>>> + */
>>> + keys = kvmalloc(key_size * bucket_size, GFP_USER | __GFP_NOWARN);
>>> + values = kvmalloc(value_size * bucket_size, GFP_USER | __GFP_NOWARN);
>>> + if (!keys || !values) {
>>> + ret = -ENOMEM;
>>> + goto out;
>>
>> In this case, we won't copy batch and total to user buffer. Maybe we
>> should do that?
> Yes, I think last line should be: goto after_loop;
>>
>>
>>> + }
>>> +
>>> +again:
>>> + preempt_disable();
>>> + this_cpu_inc(bpf_prog_active);
>>> + rcu_read_lock();
>>> +again_nocopy:
>>> + dst_key = keys;
>>> + dst_val = values;
>>> + b = &htab->buckets[batch];
>>> + head = &b->head;
>>> + raw_spin_lock_irqsave(&b->lock, flags);
>>> +
>>> + bucket_cnt = 0;
>>> + hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
>>> + bucket_cnt++;
>>> +
>>> + if (bucket_cnt > (max_count - total)) {
>>> + if (total == 0)
>>> + ret = -ENOSPC;
>>> + raw_spin_unlock_irqrestore(&b->lock, flags);
>>> + rcu_read_unlock();
>>> + this_cpu_dec(bpf_prog_active);
>>> + preempt_enable();
>>> + goto after_loop;
>>> + }
>>> +
>>> + if (bucket_cnt > bucket_size) {
>>> + bucket_size = bucket_cnt;
>>> + raw_spin_unlock_irqrestore(&b->lock, flags);
>>> + rcu_read_unlock();
>>> + this_cpu_dec(bpf_prog_active);
>>> + preempt_enable();
>>> + kvfree(keys);
>>> + kvfree(values);
>>> + goto alloc;
>>> + }
>>> +
>>> + hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
>>> + memcpy(dst_key, l->key, key_size);
>>> +
>>> + if (is_percpu) {
>>> + int off = 0, cpu;
>>> + void __percpu *pptr;
>>> +
>>> + pptr = htab_elem_get_ptr(l, map->key_size);
>>> + for_each_possible_cpu(cpu) {
>>> + bpf_long_memcpy(dst_val + off,
>>> + per_cpu_ptr(pptr, cpu), size);
>>> + off += size;
>>> + }
>>> + } else {
>>> + value = l->key + roundup_key_size;
>>> + if (elem_map_flags & BPF_F_LOCK)
>>> + copy_map_value_locked(map, dst_val, value,
>>> + true);
>>> + else
>>> + copy_map_value(map, dst_val, value);
>>> + check_and_init_map_lock(map, dst_val);
>>> + }
>>> + if (do_delete) {
>>> + hlist_nulls_del_rcu(&l->hash_node);
>>> + if (is_lru_map)
>>> + bpf_lru_push_free(&htab->lru, &l->lru_node);
>>> + else
>>> + free_htab_elem(htab, l);
>>> + }
>>> + dst_key += key_size;
>>> + dst_val += value_size;
>>> + }
>>> +
>>> + raw_spin_unlock_irqrestore(&b->lock, flags);
>>> + /* If we are not copying data, we can go to next bucket and avoid
>>> + * unlocking the rcu.
>>> + */
>>> + if (!bucket_cnt && (batch + 1 < htab->n_buckets)) {
>>> + batch++;
>>> + goto again_nocopy;
>>> + }
>>> +
>>> + rcu_read_unlock();
>>> + this_cpu_dec(bpf_prog_active);
>>> + preempt_enable();
>>> + if (bucket_cnt && (copy_to_user(ukeys + total * key_size, keys,
>>> + key_size * bucket_cnt) ||
>>> + copy_to_user(uvalues + total * value_size, values,
>>> + value_size * bucket_cnt))) {
>>> + ret = -EFAULT;
>>> + goto after_loop;
>>> + }
>>> +
>>> + total += bucket_cnt;
>>> + batch++;
>>> + if (batch >= htab->n_buckets) {
>>> + ret = -ENOENT;
>>> + goto after_loop;
>>> + }
>>> + goto again;
>>> +
>>> +after_loop:
>>> + if (ret && (ret != -ENOENT && ret != -EFAULT))
>>> + goto out;
>>
>> We won't have many error codes reaching here, -ENOENT, -EFAULT, -ENOSPC,
>> and possibly -ENOMEM.
>> Maybe just
>> if (ret == -EFAULT)
>> goto out;
>>
> Yes I think that make senses, I only need to add
>
> if (put_user(0, &uattr->batch.count))
> return -EFAULT;
>
> before traversing the map to make sure that if there is an error,
> batch.count doesn't miss report entries since that variable is used as
> input/output. Does this make sense?

This does make sense. You can put the above checking right before
the "alloc" label. Everything after that will go through copying
"count".

>
>>> +
>>> + /* copy # of entries and next batch */
>>> + ubatch = u64_to_user_ptr(attr->batch.out_batch);
>>> + if (copy_to_user(ubatch, &batch, sizeof(batch)) ||
>>> + put_user(total, &uattr->batch.count))
>>> + ret = -EFAULT;
>>> +
>>> +out:
>>> + kvfree(keys);
>>> + kvfree(values);
>>> + return ret;
>>> +}
>>> +
>> [...]