This introduces a new command to retrieve multiple number of entries
from a bpf map.
This new command can be executed from the existing BPF syscall as
follows:
err = bpf(BPF_MAP_DUMP, union bpf_attr *attr, u32 size)
using attr->dump.map_fd, attr->dump.prev_key, attr->dump.buf,
attr->dump.buf_len
returns zero or negative error, and populates buf and buf_len on
succees
This implementation is wrapping the existing bpf methods:
map_get_next_key and map_lookup_elem
Note that this implementation can be extended later to do dump and
delete by extending map_lookup_and_delete_elem (currently it only works
for bpf queue/stack maps) and either use a new flag in map_dump or a new
command map_dump_and_delete.
Results show that even with a 1-elem_size buffer, it runs ~40 faster
than the current implementation, improvements of ~85% are reported when
the buffer size is increased, although, after the buffer size is around
5% of the total number of entries there's no huge difference in
increasing it.
Tested:
Tried different size buffers to handle case where the bulk is bigger, or
the elements to retrieve are less than the existing ones, all runs read
a map of 100K entries. Below are the results(in ns) from the different
runs:
buf_len_1: 69038725 entry-by-entry: 112384424 improvement
38.569134
buf_len_2: 40897447 entry-by-entry: 111030546 improvement
63.165590
buf_len_230: 13652714 entry-by-entry: 111694058 improvement
87.776687
buf_len_5000: 13576271 entry-by-entry: 111101169 improvement
87.780263
buf_len_73000: 14694343 entry-by-entry: 111740162 improvement
86.849542
buf_len_100000: 13745969 entry-by-entry: 114151991 improvement
87.958187
buf_len_234567: 14329834 entry-by-entry: 114427589 improvement
87.476941
The series of patches are split as follows:
- First patch move some map_lookup_elem logic into 2 fucntions to
deduplicate code: bpf_map_value_size and bpf_map_copy_value
- Second patch introduce map_dump function
- Third patch syncs tools linux headers
- Fourth patch adds libbpf support
- Last two patches adds tests
RFC Changelog:
- remove wrong usage of attr.flags
- move map_fd to remove hole after it
v3:
- add explanation of the API in the commit message
- fix masked errors and return them to user
- copy last_key from return buf into prev_key if it was provided
- run perf test with kpti and retpoline mitigations
v2:
- use proper bpf-next tag
Brian Vazquez (6):
bpf: add bpf_map_value_size and bp_map_copy_value helper functions
bpf: add BPF_MAP_DUMP command to dump more than one entry per call
bpf: keep bpf.h in sync with tools/
libbpf: support BPF_MAP_DUMP command
selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap
selftests/bpf: add test to measure performance of BPF_MAP_DUMP
include/uapi/linux/bpf.h | 9 +
kernel/bpf/syscall.c | 251 ++++++++++++++++++------
tools/include/uapi/linux/bpf.h | 9 +
tools/lib/bpf/bpf.c | 28 +++
tools/lib/bpf/bpf.h | 4 +
tools/lib/bpf/libbpf.map | 2 +
tools/testing/selftests/bpf/test_maps.c | 148 +++++++++++++-
7 files changed, 388 insertions(+), 63 deletions(-)
--
2.22.0.657.g960e92d24f-goog
This introduces a new command to retrieve multiple number of entries
from a bpf map, wrapping the existing bpf methods:
map_get_next_key and map_lookup_elem
To start dumping the map from the beginning you must specify NULL as
the prev_key.
The new API returns 0 when it successfully copied all the elements
requested or it copied less because there weren't more elements to
retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.
On a successful call buf and buf_len will contain correct data and in
case prev_key was provided (not for the first walk, since prev_key is
NULL) it will contain the last_key copied into the prev_key which will
simplify next call.
Only when it can't find a single element it will return -ENOENT meaning
that the map has been entirely walked. When an error is return buf,
buf_len and prev_key shouldn't be read nor used.
Because maps can be called from userspace and kernel code, this function
can have a scenario where the next_key was found but by the time we
try to retrieve the value the element is not there, in this case the
function continues and tries to get a new next_key value, skipping the
deleted key. If at some point the function find itself trap in a loop,
it will return -EINTR.
The function will try to fit as much as possible in the buf provided and
will return -EINVAL if buf_len is smaller than elem_size.
QUEUE and STACK maps are not supported.
Note that map_dump doesn't guarantee that reading the entire table is
consistent since this function is always racing with kernel and user code
but the same behaviour is found when the entire table is walked using
the current interfaces: map_get_next_key + map_lookup_elem.
It is also important to note that with a locked map, the lock is grabbed
for 1 entry at the time, meaning that the returned buf might or might not
be consistent.
Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Brian Vazquez <[email protected]>
---
include/uapi/linux/bpf.h | 9 +++
kernel/bpf/syscall.c | 117 +++++++++++++++++++++++++++++++++++++++
2 files changed, 126 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fa1c753dcdbc7..66dab5385170d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -106,6 +106,7 @@ enum bpf_cmd {
BPF_TASK_FD_QUERY,
BPF_MAP_LOOKUP_AND_DELETE_ELEM,
BPF_MAP_FREEZE,
+ BPF_MAP_DUMP,
};
enum bpf_map_type {
@@ -388,6 +389,14 @@ union bpf_attr {
__u64 flags;
};
+ struct { /* struct used by BPF_MAP_DUMP command */
+ __aligned_u64 prev_key;
+ __aligned_u64 buf;
+ __aligned_u64 buf_len; /* input/output: len of buf */
+ __u64 flags;
+ __u32 map_fd;
+ } dump;
+
struct { /* anonymous struct used by BPF_PROG_LOAD command */
__u32 prog_type; /* one of enum bpf_prog_type */
__u32 insn_cnt;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 86cdc2f7bb56e..0c35505aa219f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr)
return err;
}
+/* last field in 'union bpf_attr' used by this command */
+#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd
+
+static int map_dump(union bpf_attr *attr)
+{
+ void __user *ukey = u64_to_user_ptr(attr->dump.prev_key);
+ void __user *ubuf = u64_to_user_ptr(attr->dump.buf);
+ u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len);
+ int ufd = attr->dump.map_fd;
+ struct bpf_map *map;
+ void *buf, *prev_key, *key, *value;
+ u32 value_size, elem_size, buf_len, cp_len;
+ struct fd f;
+ int err;
+ bool first_key = false;
+
+ if (CHECK_ATTR(BPF_MAP_DUMP))
+ return -EINVAL;
+
+ if (attr->dump.flags & ~BPF_F_LOCK)
+ return -EINVAL;
+
+ f = fdget(ufd);
+ map = __bpf_map_get(f);
+ if (IS_ERR(map))
+ return PTR_ERR(map);
+ if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
+ err = -EPERM;
+ goto err_put;
+ }
+
+ if ((attr->dump.flags & BPF_F_LOCK) &&
+ !map_value_has_spin_lock(map)) {
+ err = -EINVAL;
+ goto err_put;
+ }
+
+ if (map->map_type == BPF_MAP_TYPE_QUEUE ||
+ map->map_type == BPF_MAP_TYPE_STACK) {
+ err = -ENOTSUPP;
+ goto err_put;
+ }
+
+ value_size = bpf_map_value_size(map);
+
+ err = get_user(buf_len, ubuf_len);
+ if (err)
+ goto err_put;
+
+ elem_size = map->key_size + value_size;
+ if (buf_len < elem_size) {
+ err = -EINVAL;
+ goto err_put;
+ }
+
+ if (ukey) {
+ prev_key = __bpf_copy_key(ukey, map->key_size);
+ if (IS_ERR(prev_key)) {
+ err = PTR_ERR(prev_key);
+ goto err_put;
+ }
+ } else {
+ prev_key = NULL;
+ first_key = true;
+ }
+
+ err = -ENOMEM;
+ buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN);
+ if (!buf)
+ goto err_put;
+
+ key = buf;
+ value = key + map->key_size;
+ for (cp_len = 0; cp_len + elem_size <= buf_len;) {
+ if (signal_pending(current)) {
+ err = -EINTR;
+ break;
+ }
+
+ rcu_read_lock();
+ err = map->ops->map_get_next_key(map, prev_key, key);
+ rcu_read_unlock();
+
+ if (err)
+ break;
+
+ err = bpf_map_copy_value(map, key, value, attr->dump.flags);
+
+ if (err == -ENOENT)
+ continue;
+ if (err)
+ goto free_buf;
+
+ if (copy_to_user(ubuf + cp_len, buf, elem_size)) {
+ err = -EFAULT;
+ goto free_buf;
+ }
+
+ prev_key = key;
+ cp_len += elem_size;
+ }
+
+ if (err == -ENOENT && cp_len)
+ err = 0;
+ if (!err && (copy_to_user(ubuf_len, &cp_len, sizeof(cp_len)) ||
+ (!first_key && copy_to_user(ukey, key, map->key_size))))
+ err = -EFAULT;
+free_buf:
+ kfree(buf);
+err_put:
+ fdput(f);
+ return err;
+}
+
#define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value
static int map_lookup_and_delete_elem(union bpf_attr *attr)
@@ -2910,6 +3024,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
err = map_lookup_and_delete_elem(&attr);
break;
+ case BPF_MAP_DUMP:
+ err = map_dump(&attr);
+ break;
default:
err = -EINVAL;
break;
--
2.22.0.657.g960e92d24f-goog
On Wed, Jul 24, 2019 at 10:09 AM Brian Vazquez <[email protected]> wrote:
>
> This introduces a new command to retrieve multiple number of entries
> from a bpf map.
>
> This new command can be executed from the existing BPF syscall as
> follows:
>
> err = bpf(BPF_MAP_DUMP, union bpf_attr *attr, u32 size)
> using attr->dump.map_fd, attr->dump.prev_key, attr->dump.buf,
> attr->dump.buf_len
> returns zero or negative error, and populates buf and buf_len on
> succees
>
> This implementation is wrapping the existing bpf methods:
> map_get_next_key and map_lookup_elem
>
> Note that this implementation can be extended later to do dump and
> delete by extending map_lookup_and_delete_elem (currently it only works
> for bpf queue/stack maps) and either use a new flag in map_dump or a new
> command map_dump_and_delete.
>
> Results show that even with a 1-elem_size buffer, it runs ~40 faster
Why is the new command 40% faster with 1-elem_size buffer?
> than the current implementation, improvements of ~85% are reported when
> the buffer size is increased, although, after the buffer size is around
> 5% of the total number of entries there's no huge difference in
> increasing it.
>
> Tested:
> Tried different size buffers to handle case where the bulk is bigger, or
> the elements to retrieve are less than the existing ones, all runs read
> a map of 100K entries. Below are the results(in ns) from the different
> runs:
>
> buf_len_1: 69038725 entry-by-entry: 112384424 improvement
> 38.569134
> buf_len_2: 40897447 entry-by-entry: 111030546 improvement
> 63.165590
> buf_len_230: 13652714 entry-by-entry: 111694058 improvement
> 87.776687
> buf_len_5000: 13576271 entry-by-entry: 111101169 improvement
> 87.780263
> buf_len_73000: 14694343 entry-by-entry: 111740162 improvement
> 86.849542
> buf_len_100000: 13745969 entry-by-entry: 114151991 improvement
> 87.958187
> buf_len_234567: 14329834 entry-by-entry: 114427589 improvement
> 87.476941
It took me a while to figure out the meaning of 87.476941. It is probably
a good idea to say 87.5% instead.
Thanks,
Song
On Wed, Jul 24, 2019 at 12:55 PM Willem de Bruijn
<[email protected]> wrote:
>
> On Wed, Jul 24, 2019 at 1:10 PM Brian Vazquez <[email protected]> wrote:
> >
> > This introduces a new command to retrieve multiple number of entries
> > from a bpf map, wrapping the existing bpf methods:
> > map_get_next_key and map_lookup_elem
> >
> > To start dumping the map from the beginning you must specify NULL as
> > the prev_key.
> >
> > The new API returns 0 when it successfully copied all the elements
> > requested or it copied less because there weren't more elements to
> > retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.
>
> I think I understand this, but perhaps it can be explained a bit more
> concisely without reference to ENOENT and error masking. The function
> returns the min of the number of requested elements and the number of
> remaining elements in the map from the given starting point
> (prev_key).
That sounds better to me. Thanks!
> > On a successful call buf and buf_len will contain correct data and in
> > case prev_key was provided (not for the first walk, since prev_key is
> > NULL) it will contain the last_key copied into the prev_key which will
> > simplify next call.
> >
> > Only when it can't find a single element it will return -ENOENT meaning
> > that the map has been entirely walked. When an error is return buf,
> > buf_len and prev_key shouldn't be read nor used.
>
> That's common for error handling. No need to state explicitly.
Ack
>
> > Because maps can be called from userspace and kernel code, this function
> > can have a scenario where the next_key was found but by the time we
> > try to retrieve the value the element is not there, in this case the
> > function continues and tries to get a new next_key value, skipping the
> > deleted key. If at some point the function find itself trap in a loop,
> > it will return -EINTR.
>
> Good to point this out! I don't think that unbounded continue;
> statements until an interrupt happens is sufficient. Please bound the
> number of retries to a low number.
And what would it be a good number? Maybe 3 attempts? And in that case
what error should be reported?
>
> > The function will try to fit as much as possible in the buf provided and
> > will return -EINVAL if buf_len is smaller than elem_size.
> >
> > QUEUE and STACK maps are not supported.
> >
> > Note that map_dump doesn't guarantee that reading the entire table is
> > consistent since this function is always racing with kernel and user code
> > but the same behaviour is found when the entire table is walked using
> > the current interfaces: map_get_next_key + map_lookup_elem.
>
> > It is also important to note that with a locked map, the lock is grabbed
> > for 1 entry at the time, meaning that the returned buf might or might not
> > be consistent.
>
> Would it be informative to signal to the caller if the read was
> complete and consistent (because the entire table was read while the
> lock was held)?
Mmm.. not sure how we could signal that to the caller. But I don't
think there's a way to know it was consistent (i.e. one element was
removed in bucket 20 and you are copying the keys in bucket 15, when
you get to bucket 20 there's no way to know that some entries were
removed when you traversed them). The lock is held for just 1 single
entry not the entire table.
Maybe clarify more that in the commit message?
Thanks for reviewing!
> > > Because maps can be called from userspace and kernel code, this function
> > > can have a scenario where the next_key was found but by the time we
> > > try to retrieve the value the element is not there, in this case the
> > > function continues and tries to get a new next_key value, skipping the
> > > deleted key. If at some point the function find itself trap in a loop,
> > > it will return -EINTR.
> >
> > Good to point this out! I don't think that unbounded continue;
> > statements until an interrupt happens is sufficient. Please bound the
> > number of retries to a low number.
>
> And what would it be a good number? Maybe 3 attempts?
3 sounds good to me.
> And in that case what error should be reported?
One that's unambiguous and somewhat intuitive for the given issue.
Perhaps EBUSY?
> > > The function will try to fit as much as possible in the buf provided and
> > > will return -EINVAL if buf_len is smaller than elem_size.
> > >
> > > QUEUE and STACK maps are not supported.
> > >
> > > Note that map_dump doesn't guarantee that reading the entire table is
> > > consistent since this function is always racing with kernel and user code
> > > but the same behaviour is found when the entire table is walked using
> > > the current interfaces: map_get_next_key + map_lookup_elem.
> >
> > > It is also important to note that with a locked map, the lock is grabbed
> > > for 1 entry at the time, meaning that the returned buf might or might not
> > > be consistent.
> >
> > Would it be informative to signal to the caller if the read was
> > complete and consistent (because the entire table was read while the
> > lock was held)?
>
> Mmm.. not sure how we could signal that to the caller. But I don't
> think there's a way to know it was consistent
Okay, that makes for a simple answer :) No need to try to add a signal, then.
On Wed, Jul 24, 2019 at 3:44 PM Brian Vazquez <[email protected]> wrote:
>
> On Wed, Jul 24, 2019 at 2:40 PM Song Liu <[email protected]> wrote:
> >
> > On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <[email protected]> wrote:
> > >
> > > This introduces a new command to retrieve multiple number of entries
> > > from a bpf map, wrapping the existing bpf methods:
> > > map_get_next_key and map_lookup_elem
> > >
> > > To start dumping the map from the beginning you must specify NULL as
> > > the prev_key.
> > >
> > > The new API returns 0 when it successfully copied all the elements
> > > requested or it copied less because there weren't more elements to
> > > retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.
> > >
> > > On a successful call buf and buf_len will contain correct data and in
> > > case prev_key was provided (not for the first walk, since prev_key is
> > > NULL) it will contain the last_key copied into the prev_key which will
> > > simplify next call.
> > >
> > > Only when it can't find a single element it will return -ENOENT meaning
> > > that the map has been entirely walked. When an error is return buf,
> > > buf_len and prev_key shouldn't be read nor used.
> > >
> > > Because maps can be called from userspace and kernel code, this function
> > > can have a scenario where the next_key was found but by the time we
> > > try to retrieve the value the element is not there, in this case the
> > > function continues and tries to get a new next_key value, skipping the
> > > deleted key. If at some point the function find itself trap in a loop,
> > > it will return -EINTR.
> > >
> > > The function will try to fit as much as possible in the buf provided and
> > > will return -EINVAL if buf_len is smaller than elem_size.
> > >
> > > QUEUE and STACK maps are not supported.
> > >
> > > Note that map_dump doesn't guarantee that reading the entire table is
> > > consistent since this function is always racing with kernel and user code
> > > but the same behaviour is found when the entire table is walked using
> > > the current interfaces: map_get_next_key + map_lookup_elem.
> > > It is also important to note that with a locked map, the lock is grabbed
> > > for 1 entry at the time, meaning that the returned buf might or might not
> > > be consistent.
> > >
> > > Suggested-by: Stanislav Fomichev <[email protected]>
> > > Signed-off-by: Brian Vazquez <[email protected]>
> > > ---
> > > include/uapi/linux/bpf.h | 9 +++
> > > kernel/bpf/syscall.c | 117 +++++++++++++++++++++++++++++++++++++++
> > > 2 files changed, 126 insertions(+)
> > >
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index fa1c753dcdbc7..66dab5385170d 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -106,6 +106,7 @@ enum bpf_cmd {
> > > BPF_TASK_FD_QUERY,
> > > BPF_MAP_LOOKUP_AND_DELETE_ELEM,
> > > BPF_MAP_FREEZE,
> > > + BPF_MAP_DUMP,
> > > };
> > >
> > > enum bpf_map_type {
> > > @@ -388,6 +389,14 @@ union bpf_attr {
> > > __u64 flags;
> > > };
> > >
> > > + struct { /* struct used by BPF_MAP_DUMP command */
> > > + __aligned_u64 prev_key;
> > > + __aligned_u64 buf;
> > > + __aligned_u64 buf_len; /* input/output: len of buf */
> > > + __u64 flags;
> >
> > Please add explanation of flags here.
>
> got it!
>
> > Also, we need to update the
> > comments of BPF_F_LOCK for BPF_MAP_DUMP.
>
> What do you mean? I didn't get this part.
I meant, current comment says BPF_F_LOCK is for BPF_MAP_UPDATE_ELEM.
But it is also used by BPF_MAP_LOOKUP_ELEM and BPF_MAP_DUMP. So
current comment is not accurate either.
Maybe fix it while you are on it?
>
> >
> > > + __u32 map_fd;
> > > + } dump;
> > > +
> > > struct { /* anonymous struct used by BPF_PROG_LOAD command */
> > > __u32 prog_type; /* one of enum bpf_prog_type */
> > > __u32 insn_cnt;
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index 86cdc2f7bb56e..0c35505aa219f 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr)
> > > return err;
> > > }
> > >
> > > +/* last field in 'union bpf_attr' used by this command */
> > > +#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd
> > > +
> > > +static int map_dump(union bpf_attr *attr)
> > > +{
> > > + void __user *ukey = u64_to_user_ptr(attr->dump.prev_key);
> > > + void __user *ubuf = u64_to_user_ptr(attr->dump.buf);
> > > + u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len);
> > > + int ufd = attr->dump.map_fd;
> > > + struct bpf_map *map;
> > > + void *buf, *prev_key, *key, *value;
> > > + u32 value_size, elem_size, buf_len, cp_len;
> > > + struct fd f;
> > > + int err;
> > > + bool first_key = false;
> > > +
> > > + if (CHECK_ATTR(BPF_MAP_DUMP))
> > > + return -EINVAL;
> > > +
> > > + if (attr->dump.flags & ~BPF_F_LOCK)
> > > + return -EINVAL;
> > > +
> > > + f = fdget(ufd);
> > > + map = __bpf_map_get(f);
> > > + if (IS_ERR(map))
> > > + return PTR_ERR(map);
> > > + if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
> > > + err = -EPERM;
> > > + goto err_put;
> > > + }
> > > +
> > > + if ((attr->dump.flags & BPF_F_LOCK) &&
> > > + !map_value_has_spin_lock(map)) {
> > > + err = -EINVAL;
> > > + goto err_put;
> > > + }
> >
> > We can share these lines with map_lookup_elem(). Maybe
> > add another helper function?
>
> Which are the lines you are referring to? the dump.flags? It makes
> sense so that way when a new flag is added you only need to modify
> them in one spot.
I think I misread it. attr->dump.flags is not same as attr->flags.
So never mind.
>
> > > +
> > > + if (map->map_type == BPF_MAP_TYPE_QUEUE ||
> > > + map->map_type == BPF_MAP_TYPE_STACK) {
> > > + err = -ENOTSUPP;
> > > + goto err_put;
> > > + }
> > > +
> > > + value_size = bpf_map_value_size(map);
> > > +
> > > + err = get_user(buf_len, ubuf_len);
> > > + if (err)
> > > + goto err_put;
> > > +
> > > + elem_size = map->key_size + value_size;
> > > + if (buf_len < elem_size) {
> > > + err = -EINVAL;
> > > + goto err_put;
> > > + }
> > > +
> > > + if (ukey) {
> > > + prev_key = __bpf_copy_key(ukey, map->key_size);
> > > + if (IS_ERR(prev_key)) {
> > > + err = PTR_ERR(prev_key);
> > > + goto err_put;
> > > + }
> > > + } else {
> > > + prev_key = NULL;
> > > + first_key = true;
> > > + }
> > > +
> > > + err = -ENOMEM;
> > > + buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN);
> > > + if (!buf)
> > > + goto err_put;
> > > +
> > > + key = buf;
> > > + value = key + map->key_size;
> > > + for (cp_len = 0; cp_len + elem_size <= buf_len;) {
> > > + if (signal_pending(current)) {
> > > + err = -EINTR;
> > > + break;
> > > + }
> > > +
> > > + rcu_read_lock();
> > > + err = map->ops->map_get_next_key(map, prev_key, key);
> >
> > If prev_key is deleted before map_get_next_key(), we get the first key
> > again. This is pretty weird.
>
> Yes, I know. But note that the current scenario happens even for the
> old interface (imagine you are walking a map from userspace and you
> tried get_next_key the prev_key was removed, you will start again from
> the beginning without noticing it).
> I tried to sent a patch in the past but I was missing some context:
> before NULL was used to get the very first_key the interface relied in
> a random (non existent) key to retrieve the first_key in the map, and
> I was told what we still have to support that scenario.
BPF_MAP_DUMP is slightly different, as you may return the first key
multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
don't have to support legacy scenarios.
Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
to look up previous keys. Would something down this direction work?
Thanks,
Song
On Wed, Jul 24, 2019 at 1:10 PM Brian Vazquez <[email protected]> wrote:
>
> This introduces a new command to retrieve multiple number of entries
> from a bpf map, wrapping the existing bpf methods:
> map_get_next_key and map_lookup_elem
>
> To start dumping the map from the beginning you must specify NULL as
> the prev_key.
>
> The new API returns 0 when it successfully copied all the elements
> requested or it copied less because there weren't more elements to
> retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.
I think I understand this, but perhaps it can be explained a bit more
concisely without reference to ENOENT and error masking. The function
returns the min of the number of requested elements and the number of
remaining elements in the map from the given starting point
(prev_key).
> On a successful call buf and buf_len will contain correct data and in
> case prev_key was provided (not for the first walk, since prev_key is
> NULL) it will contain the last_key copied into the prev_key which will
> simplify next call.
>
> Only when it can't find a single element it will return -ENOENT meaning
> that the map has been entirely walked. When an error is return buf,
> buf_len and prev_key shouldn't be read nor used.
That's common for error handling. No need to state explicitly.
> Because maps can be called from userspace and kernel code, this function
> can have a scenario where the next_key was found but by the time we
> try to retrieve the value the element is not there, in this case the
> function continues and tries to get a new next_key value, skipping the
> deleted key. If at some point the function find itself trap in a loop,
> it will return -EINTR.
Good to point this out! I don't think that unbounded continue;
statements until an interrupt happens is sufficient. Please bound the
number of retries to a low number.
> The function will try to fit as much as possible in the buf provided and
> will return -EINVAL if buf_len is smaller than elem_size.
>
> QUEUE and STACK maps are not supported.
>
> Note that map_dump doesn't guarantee that reading the entire table is
> consistent since this function is always racing with kernel and user code
> but the same behaviour is found when the entire table is walked using
> the current interfaces: map_get_next_key + map_lookup_elem.
> It is also important to note that with a locked map, the lock is grabbed
> for 1 entry at the time, meaning that the returned buf might or might not
> be consistent.
Would it be informative to signal to the caller if the read was
complete and consistent (because the entire table was read while the
lock was held)?
>
> Suggested-by: Stanislav Fomichev <[email protected]>
> Signed-off-by: Brian Vazquez <[email protected]>
On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <[email protected]> wrote:
>
> This introduces a new command to retrieve multiple number of entries
> from a bpf map, wrapping the existing bpf methods:
> map_get_next_key and map_lookup_elem
>
> To start dumping the map from the beginning you must specify NULL as
> the prev_key.
>
> The new API returns 0 when it successfully copied all the elements
> requested or it copied less because there weren't more elements to
> retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.
>
> On a successful call buf and buf_len will contain correct data and in
> case prev_key was provided (not for the first walk, since prev_key is
> NULL) it will contain the last_key copied into the prev_key which will
> simplify next call.
>
> Only when it can't find a single element it will return -ENOENT meaning
> that the map has been entirely walked. When an error is return buf,
> buf_len and prev_key shouldn't be read nor used.
>
> Because maps can be called from userspace and kernel code, this function
> can have a scenario where the next_key was found but by the time we
> try to retrieve the value the element is not there, in this case the
> function continues and tries to get a new next_key value, skipping the
> deleted key. If at some point the function find itself trap in a loop,
> it will return -EINTR.
>
> The function will try to fit as much as possible in the buf provided and
> will return -EINVAL if buf_len is smaller than elem_size.
>
> QUEUE and STACK maps are not supported.
>
> Note that map_dump doesn't guarantee that reading the entire table is
> consistent since this function is always racing with kernel and user code
> but the same behaviour is found when the entire table is walked using
> the current interfaces: map_get_next_key + map_lookup_elem.
> It is also important to note that with a locked map, the lock is grabbed
> for 1 entry at the time, meaning that the returned buf might or might not
> be consistent.
>
> Suggested-by: Stanislav Fomichev <[email protected]>
> Signed-off-by: Brian Vazquez <[email protected]>
> ---
> include/uapi/linux/bpf.h | 9 +++
> kernel/bpf/syscall.c | 117 +++++++++++++++++++++++++++++++++++++++
> 2 files changed, 126 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index fa1c753dcdbc7..66dab5385170d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -106,6 +106,7 @@ enum bpf_cmd {
> BPF_TASK_FD_QUERY,
> BPF_MAP_LOOKUP_AND_DELETE_ELEM,
> BPF_MAP_FREEZE,
> + BPF_MAP_DUMP,
> };
>
> enum bpf_map_type {
> @@ -388,6 +389,14 @@ union bpf_attr {
> __u64 flags;
> };
>
> + struct { /* struct used by BPF_MAP_DUMP command */
> + __aligned_u64 prev_key;
> + __aligned_u64 buf;
> + __aligned_u64 buf_len; /* input/output: len of buf */
> + __u64 flags;
Please add explanation of flags here. Also, we need to update the
comments of BPF_F_LOCK for BPF_MAP_DUMP.
> + __u32 map_fd;
> + } dump;
> +
> struct { /* anonymous struct used by BPF_PROG_LOAD command */
> __u32 prog_type; /* one of enum bpf_prog_type */
> __u32 insn_cnt;
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 86cdc2f7bb56e..0c35505aa219f 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr)
> return err;
> }
>
> +/* last field in 'union bpf_attr' used by this command */
> +#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd
> +
> +static int map_dump(union bpf_attr *attr)
> +{
> + void __user *ukey = u64_to_user_ptr(attr->dump.prev_key);
> + void __user *ubuf = u64_to_user_ptr(attr->dump.buf);
> + u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len);
> + int ufd = attr->dump.map_fd;
> + struct bpf_map *map;
> + void *buf, *prev_key, *key, *value;
> + u32 value_size, elem_size, buf_len, cp_len;
> + struct fd f;
> + int err;
> + bool first_key = false;
> +
> + if (CHECK_ATTR(BPF_MAP_DUMP))
> + return -EINVAL;
> +
> + if (attr->dump.flags & ~BPF_F_LOCK)
> + return -EINVAL;
> +
> + f = fdget(ufd);
> + map = __bpf_map_get(f);
> + if (IS_ERR(map))
> + return PTR_ERR(map);
> + if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
> + err = -EPERM;
> + goto err_put;
> + }
> +
> + if ((attr->dump.flags & BPF_F_LOCK) &&
> + !map_value_has_spin_lock(map)) {
> + err = -EINVAL;
> + goto err_put;
> + }
We can share these lines with map_lookup_elem(). Maybe
add another helper function?
> +
> + if (map->map_type == BPF_MAP_TYPE_QUEUE ||
> + map->map_type == BPF_MAP_TYPE_STACK) {
> + err = -ENOTSUPP;
> + goto err_put;
> + }
> +
> + value_size = bpf_map_value_size(map);
> +
> + err = get_user(buf_len, ubuf_len);
> + if (err)
> + goto err_put;
> +
> + elem_size = map->key_size + value_size;
> + if (buf_len < elem_size) {
> + err = -EINVAL;
> + goto err_put;
> + }
> +
> + if (ukey) {
> + prev_key = __bpf_copy_key(ukey, map->key_size);
> + if (IS_ERR(prev_key)) {
> + err = PTR_ERR(prev_key);
> + goto err_put;
> + }
> + } else {
> + prev_key = NULL;
> + first_key = true;
> + }
> +
> + err = -ENOMEM;
> + buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN);
> + if (!buf)
> + goto err_put;
> +
> + key = buf;
> + value = key + map->key_size;
> + for (cp_len = 0; cp_len + elem_size <= buf_len;) {
> + if (signal_pending(current)) {
> + err = -EINTR;
> + break;
> + }
> +
> + rcu_read_lock();
> + err = map->ops->map_get_next_key(map, prev_key, key);
If prev_key is deleted before map_get_next_key(), we get the first key
again. This is pretty weird.
> + rcu_read_unlock();
> +
> + if (err)
> + break;
> +
> + err = bpf_map_copy_value(map, key, value, attr->dump.flags);
> +
> + if (err == -ENOENT)
> + continue;
> + if (err)
> + goto free_buf;
> +
> + if (copy_to_user(ubuf + cp_len, buf, elem_size)) {
> + err = -EFAULT;
> + goto free_buf;
> + }
> +
> + prev_key = key;
> + cp_len += elem_size;
> + }
> +
> + if (err == -ENOENT && cp_len)
> + err = 0;
> + if (!err && (copy_to_user(ubuf_len, &cp_len, sizeof(cp_len)) ||
> + (!first_key && copy_to_user(ukey, key, map->key_size))))
> + err = -EFAULT;
> +free_buf:
> + kfree(buf);
> +err_put:
> + fdput(f);
> + return err;
> +}
> +
> #define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value
>
> static int map_lookup_and_delete_elem(union bpf_attr *attr)
> @@ -2910,6 +3024,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
> case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
> err = map_lookup_and_delete_elem(&attr);
> break;
> + case BPF_MAP_DUMP:
> + err = map_dump(&attr);
> + break;
> default:
> err = -EINVAL;
> break;
> --
> 2.22.0.657.g960e92d24f-goog
>
On Wed, Jul 24, 2019 at 2:40 PM Song Liu <[email protected]> wrote:
>
> On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <[email protected]> wrote:
> >
> > This introduces a new command to retrieve multiple number of entries
> > from a bpf map, wrapping the existing bpf methods:
> > map_get_next_key and map_lookup_elem
> >
> > To start dumping the map from the beginning you must specify NULL as
> > the prev_key.
> >
> > The new API returns 0 when it successfully copied all the elements
> > requested or it copied less because there weren't more elements to
> > retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.
> >
> > On a successful call buf and buf_len will contain correct data and in
> > case prev_key was provided (not for the first walk, since prev_key is
> > NULL) it will contain the last_key copied into the prev_key which will
> > simplify next call.
> >
> > Only when it can't find a single element it will return -ENOENT meaning
> > that the map has been entirely walked. When an error is return buf,
> > buf_len and prev_key shouldn't be read nor used.
> >
> > Because maps can be called from userspace and kernel code, this function
> > can have a scenario where the next_key was found but by the time we
> > try to retrieve the value the element is not there, in this case the
> > function continues and tries to get a new next_key value, skipping the
> > deleted key. If at some point the function find itself trap in a loop,
> > it will return -EINTR.
> >
> > The function will try to fit as much as possible in the buf provided and
> > will return -EINVAL if buf_len is smaller than elem_size.
> >
> > QUEUE and STACK maps are not supported.
> >
> > Note that map_dump doesn't guarantee that reading the entire table is
> > consistent since this function is always racing with kernel and user code
> > but the same behaviour is found when the entire table is walked using
> > the current interfaces: map_get_next_key + map_lookup_elem.
> > It is also important to note that with a locked map, the lock is grabbed
> > for 1 entry at the time, meaning that the returned buf might or might not
> > be consistent.
> >
> > Suggested-by: Stanislav Fomichev <[email protected]>
> > Signed-off-by: Brian Vazquez <[email protected]>
> > ---
> > include/uapi/linux/bpf.h | 9 +++
> > kernel/bpf/syscall.c | 117 +++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 126 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index fa1c753dcdbc7..66dab5385170d 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -106,6 +106,7 @@ enum bpf_cmd {
> > BPF_TASK_FD_QUERY,
> > BPF_MAP_LOOKUP_AND_DELETE_ELEM,
> > BPF_MAP_FREEZE,
> > + BPF_MAP_DUMP,
> > };
> >
> > enum bpf_map_type {
> > @@ -388,6 +389,14 @@ union bpf_attr {
> > __u64 flags;
> > };
> >
> > + struct { /* struct used by BPF_MAP_DUMP command */
> > + __aligned_u64 prev_key;
> > + __aligned_u64 buf;
> > + __aligned_u64 buf_len; /* input/output: len of buf */
> > + __u64 flags;
>
> Please add explanation of flags here.
got it!
> Also, we need to update the
> comments of BPF_F_LOCK for BPF_MAP_DUMP.
What do you mean? I didn't get this part.
>
> > + __u32 map_fd;
> > + } dump;
> > +
> > struct { /* anonymous struct used by BPF_PROG_LOAD command */
> > __u32 prog_type; /* one of enum bpf_prog_type */
> > __u32 insn_cnt;
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 86cdc2f7bb56e..0c35505aa219f 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr)
> > return err;
> > }
> >
> > +/* last field in 'union bpf_attr' used by this command */
> > +#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd
> > +
> > +static int map_dump(union bpf_attr *attr)
> > +{
> > + void __user *ukey = u64_to_user_ptr(attr->dump.prev_key);
> > + void __user *ubuf = u64_to_user_ptr(attr->dump.buf);
> > + u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len);
> > + int ufd = attr->dump.map_fd;
> > + struct bpf_map *map;
> > + void *buf, *prev_key, *key, *value;
> > + u32 value_size, elem_size, buf_len, cp_len;
> > + struct fd f;
> > + int err;
> > + bool first_key = false;
> > +
> > + if (CHECK_ATTR(BPF_MAP_DUMP))
> > + return -EINVAL;
> > +
> > + if (attr->dump.flags & ~BPF_F_LOCK)
> > + return -EINVAL;
> > +
> > + f = fdget(ufd);
> > + map = __bpf_map_get(f);
> > + if (IS_ERR(map))
> > + return PTR_ERR(map);
> > + if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
> > + err = -EPERM;
> > + goto err_put;
> > + }
> > +
> > + if ((attr->dump.flags & BPF_F_LOCK) &&
> > + !map_value_has_spin_lock(map)) {
> > + err = -EINVAL;
> > + goto err_put;
> > + }
>
> We can share these lines with map_lookup_elem(). Maybe
> add another helper function?
Which are the lines you are referring to? the dump.flags? It makes
sense so that way when a new flag is added you only need to modify
them in one spot.
> > +
> > + if (map->map_type == BPF_MAP_TYPE_QUEUE ||
> > + map->map_type == BPF_MAP_TYPE_STACK) {
> > + err = -ENOTSUPP;
> > + goto err_put;
> > + }
> > +
> > + value_size = bpf_map_value_size(map);
> > +
> > + err = get_user(buf_len, ubuf_len);
> > + if (err)
> > + goto err_put;
> > +
> > + elem_size = map->key_size + value_size;
> > + if (buf_len < elem_size) {
> > + err = -EINVAL;
> > + goto err_put;
> > + }
> > +
> > + if (ukey) {
> > + prev_key = __bpf_copy_key(ukey, map->key_size);
> > + if (IS_ERR(prev_key)) {
> > + err = PTR_ERR(prev_key);
> > + goto err_put;
> > + }
> > + } else {
> > + prev_key = NULL;
> > + first_key = true;
> > + }
> > +
> > + err = -ENOMEM;
> > + buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN);
> > + if (!buf)
> > + goto err_put;
> > +
> > + key = buf;
> > + value = key + map->key_size;
> > + for (cp_len = 0; cp_len + elem_size <= buf_len;) {
> > + if (signal_pending(current)) {
> > + err = -EINTR;
> > + break;
> > + }
> > +
> > + rcu_read_lock();
> > + err = map->ops->map_get_next_key(map, prev_key, key);
>
> If prev_key is deleted before map_get_next_key(), we get the first key
> again. This is pretty weird.
Yes, I know. But note that the current scenario happens even for the
old interface (imagine you are walking a map from userspace and you
tried get_next_key the prev_key was removed, you will start again from
the beginning without noticing it).
I tried to sent a patch in the past but I was missing some context:
before NULL was used to get the very first_key the interface relied in
a random (non existent) key to retrieve the first_key in the map, and
I was told what we still have to support that scenario.
>
> > + rcu_read_unlock();
> > +
> > + if (err)
> > + break;
> > +
> > + err = bpf_map_copy_value(map, key, value, attr->dump.flags);
> > +
> > + if (err == -ENOENT)
> > + continue;
> > + if (err)
> > + goto free_buf;
> > +
> > + if (copy_to_user(ubuf + cp_len, buf, elem_size)) {
> > + err = -EFAULT;
> > + goto free_buf;
> > + }
> > +
> > + prev_key = key;
> > + cp_len += elem_size;
> > + }
> > +
> > + if (err == -ENOENT && cp_len)
> > + err = 0;
> > + if (!err && (copy_to_user(ubuf_len, &cp_len, sizeof(cp_len)) ||
> > + (!first_key && copy_to_user(ukey, key, map->key_size))))
> > + err = -EFAULT;
> > +free_buf:
> > + kfree(buf);
> > +err_put:
> > + fdput(f);
> > + return err;
> > +}
> > +
> > #define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value
> >
> > static int map_lookup_and_delete_elem(union bpf_attr *attr)
> > @@ -2910,6 +3024,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
> > case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
> > err = map_lookup_and_delete_elem(&attr);
> > break;
> > + case BPF_MAP_DUMP:
> > + err = map_dump(&attr);
> > + break;
> > default:
> > err = -EINVAL;
> > break;
> > --
> > 2.22.0.657.g960e92d24f-goog
> >
Thanks for reviewing it!!
On Wed, Jul 24, 2019 at 12:20 PM Song Liu <[email protected]> wrote:
>
> On Wed, Jul 24, 2019 at 10:09 AM Brian Vazquez <[email protected]> wrote:
> >
> > This introduces a new command to retrieve multiple number of entries
> > from a bpf map.
> >
> > This new command can be executed from the existing BPF syscall as
> > follows:
> >
> > err = bpf(BPF_MAP_DUMP, union bpf_attr *attr, u32 size)
> > using attr->dump.map_fd, attr->dump.prev_key, attr->dump.buf,
> > attr->dump.buf_len
> > returns zero or negative error, and populates buf and buf_len on
> > succees
> >
> > This implementation is wrapping the existing bpf methods:
> > map_get_next_key and map_lookup_elem
> >
> > Note that this implementation can be extended later to do dump and
> > delete by extending map_lookup_and_delete_elem (currently it only works
> > for bpf queue/stack maps) and either use a new flag in map_dump or a new
> > command map_dump_and_delete.
> >
> > Results show that even with a 1-elem_size buffer, it runs ~40 faster
>
> Why is the new command 40% faster with 1-elem_size buffer?
The test is using a really simple map structure: uint64_t key,val.
Which makes the lookup_elem logic faster since it doesn't spend too
much time copying the value. My conclusion with the 40% was that this
new implementation only needs 1 syscall per elem compare to the 2
syscalls that we needed with previous implementation so in this
particular case the number of ops that we are doing is almost halved.
I did one experiment increasing the value_size (448*64B) and it was
only 14% faster with 1-elem_size buffer.
> > than the current implementation, improvements of ~85% are reported when
> > the buffer size is increased, although, after the buffer size is around
> > 5% of the total number of entries there's no huge difference in
> > increasing it.
> >
> > Tested:
> > Tried different size buffers to handle case where the bulk is bigger, or
> > the elements to retrieve are less than the existing ones, all runs read
> > a map of 100K entries. Below are the results(in ns) from the different
> > runs:
> >
> > buf_len_1: 69038725 entry-by-entry: 112384424 improvement
> > 38.569134
> > buf_len_2: 40897447 entry-by-entry: 111030546 improvement
> > 63.165590
> > buf_len_230: 13652714 entry-by-entry: 111694058 improvement
> > 87.776687
> > buf_len_5000: 13576271 entry-by-entry: 111101169 improvement
> > 87.780263
> > buf_len_73000: 14694343 entry-by-entry: 111740162 improvement
> > 86.849542
> > buf_len_100000: 13745969 entry-by-entry: 114151991 improvement
> > 87.958187
> > buf_len_234567: 14329834 entry-by-entry: 114427589 improvement
> > 87.476941
>
> It took me a while to figure out the meaning of 87.476941. It is probably
> a good idea to say 87.5% instead.
right, will change it in next version.
> > > If prev_key is deleted before map_get_next_key(), we get the first key
> > > again. This is pretty weird.
> >
> > Yes, I know. But note that the current scenario happens even for the
> > old interface (imagine you are walking a map from userspace and you
> > tried get_next_key the prev_key was removed, you will start again from
> > the beginning without noticing it).
> > I tried to sent a patch in the past but I was missing some context:
> > before NULL was used to get the very first_key the interface relied in
> > a random (non existent) key to retrieve the first_key in the map, and
> > I was told what we still have to support that scenario.
>
> BPF_MAP_DUMP is slightly different, as you may return the first key
> multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> don't have to support legacy scenarios.
>
> Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> to look up previous keys. Would something down this direction work?
I've been thinking about it and I think first we need a way to detect
that since key was not present we got the first_key instead:
- One solution I had in mind was to explicitly asked for the first key
with map_get_next_key(map, NULL, first_key) and while walking the map
check that map_get_next_key(map, prev_key, key) doesn't return the
same key. This could be done using memcmp.
- Discussing with Stan, he mentioned that another option is to support
a flag in map_get_next_key to let it know that we want an error
instead of the first_key.
After detecting the problem we also need to define what we want to do,
here some options:
a) Return the error to the caller
b) Try with previous keys if any (which be limited to the keys that we
have traversed so far in this dump call)
c) continue with next entries in the map. array is easy just get the
next valid key (starting on i+1), but hmap might be difficult since
starting on the next bucket could potentially skip some keys that were
concurrently added to the same bucket where key used to be, and
starting on the same bucket could lead us to return repeated elements.
Or maybe we could support those 3 cases via flags and let the caller
decide which one to use?
Let me know what are your thoughts.
Thanks,
Brian
On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote:
> > > > If prev_key is deleted before map_get_next_key(), we get the first key
> > > > again. This is pretty weird.
> > >
> > > Yes, I know. But note that the current scenario happens even for the
> > > old interface (imagine you are walking a map from userspace and you
> > > tried get_next_key the prev_key was removed, you will start again from
> > > the beginning without noticing it).
> > > I tried to sent a patch in the past but I was missing some context:
> > > before NULL was used to get the very first_key the interface relied in
> > > a random (non existent) key to retrieve the first_key in the map, and
> > > I was told what we still have to support that scenario.
> >
> > BPF_MAP_DUMP is slightly different, as you may return the first key
> > multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> > don't have to support legacy scenarios.
> >
> > Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> > to look up previous keys. Would something down this direction work?
>
> I've been thinking about it and I think first we need a way to detect
> that since key was not present we got the first_key instead:
>
> - One solution I had in mind was to explicitly asked for the first key
> with map_get_next_key(map, NULL, first_key) and while walking the map
> check that map_get_next_key(map, prev_key, key) doesn't return the
> same key. This could be done using memcmp.
> - Discussing with Stan, he mentioned that another option is to support
> a flag in map_get_next_key to let it know that we want an error
> instead of the first_key.
>
> After detecting the problem we also need to define what we want to do,
> here some options:
>
> a) Return the error to the caller
> b) Try with previous keys if any (which be limited to the keys that we
> have traversed so far in this dump call)
> c) continue with next entries in the map. array is easy just get the
> next valid key (starting on i+1), but hmap might be difficult since
> starting on the next bucket could potentially skip some keys that were
> concurrently added to the same bucket where key used to be, and
> starting on the same bucket could lead us to return repeated elements.
>
> Or maybe we could support those 3 cases via flags and let the caller
> decide which one to use?
this type of indecision is the reason why I wasn't excited about
batch dumping in the first place and gave 'soft yes' when Stan
mentioned it during lsf/mm/bpf uconf.
We probably shouldn't do it.
It feels this map_dump makes api more complex and doesn't really
give much benefit to the user other than large map dump becomes faster.
I think we gotta solve this problem differently.
On Thu, Jul 25, 2019 at 7:54 PM Alexei Starovoitov
<[email protected]> wrote:
>
> On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote:
> > > > > If prev_key is deleted before map_get_next_key(), we get the first key
> > > > > again. This is pretty weird.
> > > >
> > > > Yes, I know. But note that the current scenario happens even for the
> > > > old interface (imagine you are walking a map from userspace and you
> > > > tried get_next_key the prev_key was removed, you will start again from
> > > > the beginning without noticing it).
> > > > I tried to sent a patch in the past but I was missing some context:
> > > > before NULL was used to get the very first_key the interface relied in
> > > > a random (non existent) key to retrieve the first_key in the map, and
> > > > I was told what we still have to support that scenario.
> > >
> > > BPF_MAP_DUMP is slightly different, as you may return the first key
> > > multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> > > don't have to support legacy scenarios.
> > >
> > > Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> > > to look up previous keys. Would something down this direction work?
> >
> > I've been thinking about it and I think first we need a way to detect
> > that since key was not present we got the first_key instead:
> >
> > - One solution I had in mind was to explicitly asked for the first key
> > with map_get_next_key(map, NULL, first_key) and while walking the map
> > check that map_get_next_key(map, prev_key, key) doesn't return the
> > same key. This could be done using memcmp.
> > - Discussing with Stan, he mentioned that another option is to support
> > a flag in map_get_next_key to let it know that we want an error
> > instead of the first_key.
> >
> > After detecting the problem we also need to define what we want to do,
> > here some options:
> >
> > a) Return the error to the caller
> > b) Try with previous keys if any (which be limited to the keys that we
> > have traversed so far in this dump call)
> > c) continue with next entries in the map. array is easy just get the
> > next valid key (starting on i+1), but hmap might be difficult since
> > starting on the next bucket could potentially skip some keys that were
> > concurrently added to the same bucket where key used to be, and
> > starting on the same bucket could lead us to return repeated elements.
> >
> > Or maybe we could support those 3 cases via flags and let the caller
> > decide which one to use?
>
> this type of indecision is the reason why I wasn't excited about
> batch dumping in the first place and gave 'soft yes' when Stan
> mentioned it during lsf/mm/bpf uconf.
> We probably shouldn't do it.
> It feels this map_dump makes api more complex and doesn't really
> give much benefit to the user other than large map dump becomes faster.
> I think we gotta solve this problem differently.
Multiple variants with flags indeed makes the API complex. I think the
kernel should expose only the simplest, most obvious behavior that
allows the application to recover. In this case, that sounds like
option (a) and restart.
In practice, the common use case is to allocate enough user memory to
read an entire table in one go, in which case the entire issue is
moot.
The cycle savings of dump are significant for large tables. I'm not
sure how we achieve that differently and even simpler? We originally
looked at shared memory, but that is obviously much more complex.
On Thu, Jul 25, 2019 at 4:54 PM Alexei Starovoitov
<[email protected]> wrote:
>
> On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote:
> > > > > If prev_key is deleted before map_get_next_key(), we get the first key
> > > > > again. This is pretty weird.
> > > >
> > > > Yes, I know. But note that the current scenario happens even for the
> > > > old interface (imagine you are walking a map from userspace and you
> > > > tried get_next_key the prev_key was removed, you will start again from
> > > > the beginning without noticing it).
> > > > I tried to sent a patch in the past but I was missing some context:
> > > > before NULL was used to get the very first_key the interface relied in
> > > > a random (non existent) key to retrieve the first_key in the map, and
> > > > I was told what we still have to support that scenario.
> > >
> > > BPF_MAP_DUMP is slightly different, as you may return the first key
> > > multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> > > don't have to support legacy scenarios.
> > >
> > > Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> > > to look up previous keys. Would something down this direction work?
> >
> > I've been thinking about it and I think first we need a way to detect
> > that since key was not present we got the first_key instead:
> >
> > - One solution I had in mind was to explicitly asked for the first key
> > with map_get_next_key(map, NULL, first_key) and while walking the map
> > check that map_get_next_key(map, prev_key, key) doesn't return the
> > same key. This could be done using memcmp.
> > - Discussing with Stan, he mentioned that another option is to support
> > a flag in map_get_next_key to let it know that we want an error
> > instead of the first_key.
> >
> > After detecting the problem we also need to define what we want to do,
> > here some options:
> >
> > a) Return the error to the caller
> > b) Try with previous keys if any (which be limited to the keys that we
> > have traversed so far in this dump call)
> > c) continue with next entries in the map. array is easy just get the
> > next valid key (starting on i+1), but hmap might be difficult since
> > starting on the next bucket could potentially skip some keys that were
> > concurrently added to the same bucket where key used to be, and
> > starting on the same bucket could lead us to return repeated elements.
> >
> > Or maybe we could support those 3 cases via flags and let the caller
> > decide which one to use?
>
> this type of indecision is the reason why I wasn't excited about
> batch dumping in the first place and gave 'soft yes' when Stan
> mentioned it during lsf/mm/bpf uconf.
> We probably shouldn't do it.
> It feels this map_dump makes api more complex and doesn't really
> give much benefit to the user other than large map dump becomes faster.
> I think we gotta solve this problem differently.
Some users are working around the dumping problems with the existing
api by creating a bpf_map_get_next_key_and_delete userspace function
(see https://www.bouncybouncy.net/blog/bpf_map_get_next_key-pitfalls/)
which in my opinion is actually a good idea. The only problem with
that is that calling bpf_map_get_next_key(fd, key, next_key) and then
bpf_map_delete_elem(fd, key) from userspace is racing with kernel code
and it might lose some information when deleting.
We could then do map_dump_and_delete using that idea but in the kernel
where we could better handle the racing condition. In that scenario
even if we retrieve the same key it will contain different info ( the
delta between old and new value). Would that work?