LinuxLists.cc - [RFC] Add BPF_SYNCHRONIZE bpf(2) command

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> RCU data structure operations with respect to active programs. For
> example, userspace can update a map->map entry to point to a new map,
> use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> complete, and then drain the old map without fear that BPF programs
> may still be updating it.
>
> Signed-off-by: Daniel Colascione <[email protected]>
> ---
> include/uapi/linux/bpf.h | 1 +
> kernel/bpf/syscall.c | 14 ++++++++++++++
> 2 files changed, 15 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index b7db3261c62d..4365c50e8055 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -98,6 +98,7 @@ enum bpf_cmd {
> BPF_BTF_LOAD,
> BPF_BTF_GET_FD_BY_ID,
> BPF_TASK_FD_QUERY,
> + BPF_SYNCHRONIZE,
> };
>
> enum bpf_map_type {
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index d10ecd78105f..60ec7811846e 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
> if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> return -EPERM;
>
> + if (cmd == BPF_SYNCHRONIZE) {
> + if (uattr != NULL || size != 0)
> + return -EINVAL;
> + err = security_bpf(cmd, NULL, 0);
> + if (err < 0)
> + return err;
> + /* BPF programs are run with preempt disabled, so
> + * synchronize_sched is sufficient even with
> + * RCU_PREEMPT.
> + */
> + synchronize_sched();
> + return 0;

I don't think it's necessary. sys_membarrier() can do this already
and some folks use it exactly for this use case.

2018-07-07 03:23:48

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Fri, Jul 6, 2018 at 7:54 PM, Alexei Starovoitov
<[email protected]> wrote:
> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
>> BPF_SYNCHRONIZE waits for any BPF programs active at the time of
>> BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
>> RCU data structure operations with respect to active programs. For
>> example, userspace can update a map->map entry to point to a new map,
>> use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
>> complete, and then drain the old map without fear that BPF programs
>> may still be updating it.
>>
>> Signed-off-by: Daniel Colascione <[email protected]>
>> ---
>> include/uapi/linux/bpf.h | 1 +
>> kernel/bpf/syscall.c | 14 ++++++++++++++
>> 2 files changed, 15 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index b7db3261c62d..4365c50e8055 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -98,6 +98,7 @@ enum bpf_cmd {
>> BPF_BTF_LOAD,
>> BPF_BTF_GET_FD_BY_ID,
>> BPF_TASK_FD_QUERY,
>> + BPF_SYNCHRONIZE,
>> };
>>
>> enum bpf_map_type {
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index d10ecd78105f..60ec7811846e 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
>> if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
>> return -EPERM;
>>
>> + if (cmd == BPF_SYNCHRONIZE) {
>> + if (uattr != NULL || size != 0)
>> + return -EINVAL;
>> + err = security_bpf(cmd, NULL, 0);
>> + if (err < 0)
>> + return err;
>> + /* BPF programs are run with preempt disabled, so
>> + * synchronize_sched is sufficient even with
>> + * RCU_PREEMPT.
>> + */
>> + synchronize_sched();
>> + return 0;
>
> I don't think it's necessary. sys_membarrier() can do this already
> and some folks use it exactly for this use case.

True --- implementation-wise, today. The point of the patch is to sort
out contractual guarantees. membarrier isn't documented to have the
BPF_SYNCHRONIZE effect right now, although it does. If we want to
extend membarrier's contract, great, but that might make it more
expensive than necessary for callers who, well, just want a memory
barrier. It's worthwhile to have a dedicated BPF command for this
task, especially considering the simplicity of the implementation.

2018-07-07 20:36:13

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> > RCU data structure operations with respect to active programs. For
> > example, userspace can update a map->map entry to point to a new map,
> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> > complete, and then drain the old map without fear that BPF programs
> > may still be updating it.
> >
> > Signed-off-by: Daniel Colascione <[email protected]>
> > ---
> > include/uapi/linux/bpf.h | 1 +
> > kernel/bpf/syscall.c | 14 ++++++++++++++
> > 2 files changed, 15 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index b7db3261c62d..4365c50e8055 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> > BPF_BTF_LOAD,
> > BPF_BTF_GET_FD_BY_ID,
> > BPF_TASK_FD_QUERY,
> > + BPF_SYNCHRONIZE,
> > };
> >
> > enum bpf_map_type {
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index d10ecd78105f..60ec7811846e 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
> > if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > return -EPERM;
> >
> > + if (cmd == BPF_SYNCHRONIZE) {
> > + if (uattr != NULL || size != 0)
> > + return -EINVAL;
> > + err = security_bpf(cmd, NULL, 0);
> > + if (err < 0)
> > + return err;
> > + /* BPF programs are run with preempt disabled, so
> > + * synchronize_sched is sufficient even with
> > + * RCU_PREEMPT.
> > + */
> > + synchronize_sched();
> > + return 0;
>
> I don't think it's necessary. sys_membarrier() can do this already
> and some folks use it exactly for this use case.

Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
though. No where does the manpage say membarrier should be implemented this
way so what happens if the implementation changes?

Further, membarrier manpage says that a memory barrier should be matched with
a matching barrier. In this use case there is no matching barrier, so it
makes it weirder.

Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
fragile to depend on it for this?

case MEMBARRIER_CMD_GLOBAL:
/* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
if (tick_nohz_full_enabled())
return -EINVAL;
if (num_online_cpus() > 1)
synchronize_sched();
return 0;

Adding Mathieu as well who I believe is author/maintainer of membarrier.

thanks!

- Joel

2018-07-08 20:56:23

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes [email protected] wrote:

> On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
>> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
>> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
>> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
>> > RCU data structure operations with respect to active programs. For
>> > example, userspace can update a map->map entry to point to a new map,
>> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
>> > complete, and then drain the old map without fear that BPF programs
>> > may still be updating it.
>> >
>> > Signed-off-by: Daniel Colascione <[email protected]>
>> > ---
>> > include/uapi/linux/bpf.h | 1 +
>> > kernel/bpf/syscall.c | 14 ++++++++++++++
>> > 2 files changed, 15 insertions(+)
>> >
>> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> > index b7db3261c62d..4365c50e8055 100644
>> > --- a/include/uapi/linux/bpf.h
>> > +++ b/include/uapi/linux/bpf.h
>> > @@ -98,6 +98,7 @@ enum bpf_cmd {
>> > BPF_BTF_LOAD,
>> > BPF_BTF_GET_FD_BY_ID,
>> > BPF_TASK_FD_QUERY,
>> > + BPF_SYNCHRONIZE,
>> > };
>> >
>> > enum bpf_map_type {
>> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> > index d10ecd78105f..60ec7811846e 100644
>> > --- a/kernel/bpf/syscall.c
>> > +++ b/kernel/bpf/syscall.c
>> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
>> > uattr, unsigned int, siz
>> > if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
>> > return -EPERM;
>> >
>> > + if (cmd == BPF_SYNCHRONIZE) {
>> > + if (uattr != NULL || size != 0)
>> > + return -EINVAL;
>> > + err = security_bpf(cmd, NULL, 0);
>> > + if (err < 0)
>> > + return err;
>> > + /* BPF programs are run with preempt disabled, so
>> > + * synchronize_sched is sufficient even with
>> > + * RCU_PREEMPT.
>> > + */
>> > + synchronize_sched();
>> > + return 0;
>>
>> I don't think it's necessary. sys_membarrier() can do this already
>> and some folks use it exactly for this use case.
>
> Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> though. No where does the manpage say membarrier should be implemented this
> way so what happens if the implementation changes?
>
> Further, membarrier manpage says that a memory barrier should be matched with
> a matching barrier. In this use case there is no matching barrier, so it
> makes it weirder.
>
> Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> fragile to depend on it for this?
>
> case MEMBARRIER_CMD_GLOBAL:
> /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> if (tick_nohz_full_enabled())
> return -EINVAL;
> if (num_online_cpus() > 1)
> synchronize_sched();
> return 0;
>
>
> Adding Mathieu as well who I believe is author/maintainer of membarrier.

See commit 907565337
"Fix: Disable sys_membarrier when nohz_full is enabled"

"Userspace applications should be allowed to expect the membarrier system
call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
nohz_full CPUs, but synchronize_sched() does not take those into
account."

So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
only care about kernel preempt off critical sections.

Clearly bpf code does not run in user-space, so it would "work".

But the guarantees provided by membarrier are not to synchronize against
preempt off per se. It's just that the current implementation happens to
do that. The point of membarrier is to turn user-space memory barriers
into compiler barriers.

If what you need is to wait for a RCU grace period for whatever RCU flavor
ebpf is using, I would against using membarrier for this. I would rather
recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
implementation details to user-space, *and* you can eventually change you
RCU implementation for e.g. SRCU in the future if needed.

Thanks,

Mathieu

>
> thanks!
>
> - Joel

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-07-09 21:10:48

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes [email protected] wrote:
>
> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> >> > RCU data structure operations with respect to active programs. For
> >> > example, userspace can update a map->map entry to point to a new map,
> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> >> > complete, and then drain the old map without fear that BPF programs
> >> > may still be updating it.
> >> >
> >> > Signed-off-by: Daniel Colascione <[email protected]>
> >> > ---
> >> > include/uapi/linux/bpf.h | 1 +
> >> > kernel/bpf/syscall.c | 14 ++++++++++++++
> >> > 2 files changed, 15 insertions(+)
> >> >
> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> > index b7db3261c62d..4365c50e8055 100644
> >> > --- a/include/uapi/linux/bpf.h
> >> > +++ b/include/uapi/linux/bpf.h
> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> >> > BPF_BTF_LOAD,
> >> > BPF_BTF_GET_FD_BY_ID,
> >> > BPF_TASK_FD_QUERY,
> >> > + BPF_SYNCHRONIZE,
> >> > };
> >> >
> >> > enum bpf_map_type {
> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> > index d10ecd78105f..60ec7811846e 100644
> >> > --- a/kernel/bpf/syscall.c
> >> > +++ b/kernel/bpf/syscall.c
> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> >> > uattr, unsigned int, siz
> >> > if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> >> > return -EPERM;
> >> >
> >> > + if (cmd == BPF_SYNCHRONIZE) {
> >> > + if (uattr != NULL || size != 0)
> >> > + return -EINVAL;
> >> > + err = security_bpf(cmd, NULL, 0);
> >> > + if (err < 0)
> >> > + return err;
> >> > + /* BPF programs are run with preempt disabled, so
> >> > + * synchronize_sched is sufficient even with
> >> > + * RCU_PREEMPT.
> >> > + */
> >> > + synchronize_sched();
> >> > + return 0;
> >>
> >> I don't think it's necessary. sys_membarrier() can do this already
> >> and some folks use it exactly for this use case.
> >
> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> > though. No where does the manpage say membarrier should be implemented this
> > way so what happens if the implementation changes?
> >
> > Further, membarrier manpage says that a memory barrier should be matched with
> > a matching barrier. In this use case there is no matching barrier, so it
> > makes it weirder.
> >
> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> > fragile to depend on it for this?
> >
> > case MEMBARRIER_CMD_GLOBAL:
> > /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> > if (tick_nohz_full_enabled())
> > return -EINVAL;
> > if (num_online_cpus() > 1)
> > synchronize_sched();
> > return 0;
> >
> >
> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
>
> See commit 907565337
> "Fix: Disable sys_membarrier when nohz_full is enabled"
>
> "Userspace applications should be allowed to expect the membarrier system
> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> nohz_full CPUs, but synchronize_sched() does not take those into
> account."
>
> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> only care about kernel preempt off critical sections.
>
> Clearly bpf code does not run in user-space, so it would "work".
>
> But the guarantees provided by membarrier are not to synchronize against
> preempt off per se. It's just that the current implementation happens to
> do that. The point of membarrier is to turn user-space memory barriers
> into compiler barriers.
>
> If what you need is to wait for a RCU grace period for whatever RCU flavor
> ebpf is using, I would against using membarrier for this. I would rather
> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
> implementation details to user-space, *and* you can eventually change you
> RCU implementation for e.g. SRCU in the future if needed.

The point about future changes to underlying bpf mechanisms is valid.
There is work already on the way to reduce the scope of preempt_off+rcu_lock
that currently lasts the whole prog. We will have new prog types that won't
have such wrappers and will do rcu_lock/unlock and preempt on/off only
when necessary.
So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
also won't make sense for the same reason.
What we can do it instead is to define synchronization barrier for
programs accessing maps. May be call it something like:
BPF_SYNC_MAP_ACCESS ?
uapi/bpf.h would need to have extensive comment what this barrier is doing.
Implementation should probably call synchronize_rcu() and not play games
with synchronize_sched(), since that's going too much into implementation.
Also should such sys_bpf command be root only?
I'm not sure whether dos attack can be made by spamming synchronize_rcu()
and synchronize_sched() for that matter.

2018-07-09 21:37:03

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

----- On Jul 9, 2018, at 5:09 PM, Alexei Starovoitov [email protected] wrote:

> On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
>> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes [email protected] wrote:
>>
>> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
>> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
>> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
>> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
>> >> > RCU data structure operations with respect to active programs. For
>> >> > example, userspace can update a map->map entry to point to a new map,
>> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
>> >> > complete, and then drain the old map without fear that BPF programs
>> >> > may still be updating it.
>> >> >
>> >> > Signed-off-by: Daniel Colascione <[email protected]>
>> >> > ---
>> >> > include/uapi/linux/bpf.h | 1 +
>> >> > kernel/bpf/syscall.c | 14 ++++++++++++++
>> >> > 2 files changed, 15 insertions(+)
>> >> >
>> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> >> > index b7db3261c62d..4365c50e8055 100644
>> >> > --- a/include/uapi/linux/bpf.h
>> >> > +++ b/include/uapi/linux/bpf.h
>> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
>> >> > BPF_BTF_LOAD,
>> >> > BPF_BTF_GET_FD_BY_ID,
>> >> > BPF_TASK_FD_QUERY,
>> >> > + BPF_SYNCHRONIZE,
>> >> > };
>> >> >
>> >> > enum bpf_map_type {
>> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> >> > index d10ecd78105f..60ec7811846e 100644
>> >> > --- a/kernel/bpf/syscall.c
>> >> > +++ b/kernel/bpf/syscall.c
>> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
>> >> > uattr, unsigned int, siz
>> >> > if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
>> >> > return -EPERM;
>> >> >
>> >> > + if (cmd == BPF_SYNCHRONIZE) {
>> >> > + if (uattr != NULL || size != 0)
>> >> > + return -EINVAL;
>> >> > + err = security_bpf(cmd, NULL, 0);
>> >> > + if (err < 0)
>> >> > + return err;
>> >> > + /* BPF programs are run with preempt disabled, so
>> >> > + * synchronize_sched is sufficient even with
>> >> > + * RCU_PREEMPT.
>> >> > + */
>> >> > + synchronize_sched();
>> >> > + return 0;
>> >>
>> >> I don't think it's necessary. sys_membarrier() can do this already
>> >> and some folks use it exactly for this use case.
>> >
>> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
>> > though. No where does the manpage say membarrier should be implemented this
>> > way so what happens if the implementation changes?
>> >
>> > Further, membarrier manpage says that a memory barrier should be matched with
>> > a matching barrier. In this use case there is no matching barrier, so it
>> > makes it weirder.
>> >
>> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
>> > fragile to depend on it for this?
>> >
>> > case MEMBARRIER_CMD_GLOBAL:
>> > /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
>> > if (tick_nohz_full_enabled())
>> > return -EINVAL;
>> > if (num_online_cpus() > 1)
>> > synchronize_sched();
>> > return 0;
>> >
>> >
>> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
>>
>> See commit 907565337
>> "Fix: Disable sys_membarrier when nohz_full is enabled"
>>
>> "Userspace applications should be allowed to expect the membarrier system
>> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
>> nohz_full CPUs, but synchronize_sched() does not take those into
>> account."
>>
>> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
>> only care about kernel preempt off critical sections.
>>
>> Clearly bpf code does not run in user-space, so it would "work".
>>
>> But the guarantees provided by membarrier are not to synchronize against
>> preempt off per se. It's just that the current implementation happens to
>> do that. The point of membarrier is to turn user-space memory barriers
>> into compiler barriers.
>>
>> If what you need is to wait for a RCU grace period for whatever RCU flavor
>> ebpf is using, I would against using membarrier for this. I would rather
>> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
>> implementation details to user-space, *and* you can eventually change you
>> RCU implementation for e.g. SRCU in the future if needed.
>
> The point about future changes to underlying bpf mechanisms is valid.
> There is work already on the way to reduce the scope of preempt_off+rcu_lock
> that currently lasts the whole prog. We will have new prog types that won't
> have such wrappers and will do rcu_lock/unlock and preempt on/off only
> when necessary.
> So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> also won't make sense for the same reason.
> What we can do it instead is to define synchronization barrier for
> programs accessing maps. May be call it something like:
> BPF_SYNC_MAP_ACCESS ?
> uapi/bpf.h would need to have extensive comment what this barrier is doing.
> Implementation should probably call synchronize_rcu() and not play games
> with synchronize_sched(), since that's going too much into implementation.
> Also should such sys_bpf command be root only?
> I'm not sure whether dos attack can be made by spamming synchronize_rcu()
> and synchronize_sched() for that matter.

Adding Paul E. McKenney in CC. He may want to share his thoughts on the matter.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-07-09 21:37:45

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Mon, Jul 9, 2018 at 2:09 PM, Alexei Starovoitov
<[email protected]> wrote:
> On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
>> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes [email protected] wrote:
>>
>> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
>> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
>> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
>> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
>> >> > RCU data structure operations with respect to active programs. For
>> >> > example, userspace can update a map->map entry to point to a new map,
>> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
>> >> > complete, and then drain the old map without fear that BPF programs
>> >> > may still be updating it.
>> >> >
>> >> > Signed-off-by: Daniel Colascione <[email protected]>
>> >> > ---
>> >> > include/uapi/linux/bpf.h | 1 +
>> >> > kernel/bpf/syscall.c | 14 ++++++++++++++
>> >> > 2 files changed, 15 insertions(+)
>> >> >
>> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> >> > index b7db3261c62d..4365c50e8055 100644
>> >> > --- a/include/uapi/linux/bpf.h
>> >> > +++ b/include/uapi/linux/bpf.h
>> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
>> >> > BPF_BTF_LOAD,
>> >> > BPF_BTF_GET_FD_BY_ID,
>> >> > BPF_TASK_FD_QUERY,
>> >> > + BPF_SYNCHRONIZE,
>> >> > };
>> >> >
>> >> > enum bpf_map_type {
>> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> >> > index d10ecd78105f..60ec7811846e 100644
>> >> > --- a/kernel/bpf/syscall.c
>> >> > +++ b/kernel/bpf/syscall.c
>> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
>> >> > uattr, unsigned int, siz
>> >> > if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
>> >> > return -EPERM;
>> >> >
>> >> > + if (cmd == BPF_SYNCHRONIZE) {
>> >> > + if (uattr != NULL || size != 0)
>> >> > + return -EINVAL;
>> >> > + err = security_bpf(cmd, NULL, 0);
>> >> > + if (err < 0)
>> >> > + return err;
>> >> > + /* BPF programs are run with preempt disabled, so
>> >> > + * synchronize_sched is sufficient even with
>> >> > + * RCU_PREEMPT.
>> >> > + */
>> >> > + synchronize_sched();
>> >> > + return 0;
>> >>
>> >> I don't think it's necessary. sys_membarrier() can do this already
>> >> and some folks use it exactly for this use case.
>> >
>> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
>> > though. No where does the manpage say membarrier should be implemented this
>> > way so what happens if the implementation changes?
>> >
>> > Further, membarrier manpage says that a memory barrier should be matched with
>> > a matching barrier. In this use case there is no matching barrier, so it
>> > makes it weirder.
>> >
>> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
>> > fragile to depend on it for this?
>> >
>> > case MEMBARRIER_CMD_GLOBAL:
>> > /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
>> > if (tick_nohz_full_enabled())
>> > return -EINVAL;
>> > if (num_online_cpus() > 1)
>> > synchronize_sched();
>> > return 0;
>> >
>> >
>> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
>>
>> See commit 907565337
>> "Fix: Disable sys_membarrier when nohz_full is enabled"
>>
>> "Userspace applications should be allowed to expect the membarrier system
>> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
>> nohz_full CPUs, but synchronize_sched() does not take those into
>> account."
>>
>> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
>> only care about kernel preempt off critical sections.
>>
>> Clearly bpf code does not run in user-space, so it would "work".
>>
>> But the guarantees provided by membarrier are not to synchronize against
>> preempt off per se. It's just that the current implementation happens to
>> do that. The point of membarrier is to turn user-space memory barriers
>> into compiler barriers.
>>
>> If what you need is to wait for a RCU grace period for whatever RCU flavor
>> ebpf is using, I would against using membarrier for this. I would rather
>> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
>> implementation details to user-space, *and* you can eventually change you
>> RCU implementation for e.g. SRCU in the future if needed.
>
> The point about future changes to underlying bpf mechanisms is valid.
> There is work already on the way to reduce the scope of preempt_off+rcu_lock
> that currently lasts the whole prog. We will have new prog types that won't
> have such wrappers and will do rcu_lock/unlock and preempt on/off only
> when necessary.
> So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> also won't make sense for the same reason.
> What we can do it instead is to define synchronization barrier for
> programs accessing maps. May be call it something like:
> BPF_SYNC_MAP_ACCESS ?

I'm not sure what you're proposing. In the case the commit message
describes, a user-space program that wants to "drain" a map needs to
be confident that the map won't change under it, even across multiple
bpf system calls on that map. One way of doing that is to ensure that
nothing that could possibly hold a reference to that map is still
running. Are you proposing some kind of refcount-draining approach?
Simple locking won't work, since BPF programs can't block, and I don't
see right now how a simple barrier would help.

> Also should such sys_bpf command be root only?
> I'm not sure whether dos attack can be made by spamming synchronize_rcu()
> and synchronize_sched() for that matter.

membarrier being globally available seems not to have caused any kind
of DoS catastrophe.

2018-07-09 22:12:08

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Mon, Jul 09, 2018 at 02:36:37PM -0700, Daniel Colascione wrote:
> On Mon, Jul 9, 2018 at 2:09 PM, Alexei Starovoitov
> <[email protected]> wrote:
> > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> >> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes [email protected] wrote:
> >>
> >> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> >> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> >> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> >> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> >> >> > RCU data structure operations with respect to active programs. For
> >> >> > example, userspace can update a map->map entry to point to a new map,
> >> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> >> >> > complete, and then drain the old map without fear that BPF programs
> >> >> > may still be updating it.
> >> >> >
> >> >> > Signed-off-by: Daniel Colascione <[email protected]>
> >> >> > ---
> >> >> > include/uapi/linux/bpf.h | 1 +
> >> >> > kernel/bpf/syscall.c | 14 ++++++++++++++
> >> >> > 2 files changed, 15 insertions(+)
> >> >> >
> >> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> >> > index b7db3261c62d..4365c50e8055 100644
> >> >> > --- a/include/uapi/linux/bpf.h
> >> >> > +++ b/include/uapi/linux/bpf.h
> >> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> >> >> > BPF_BTF_LOAD,
> >> >> > BPF_BTF_GET_FD_BY_ID,
> >> >> > BPF_TASK_FD_QUERY,
> >> >> > + BPF_SYNCHRONIZE,
> >> >> > };
> >> >> >
> >> >> > enum bpf_map_type {
> >> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> >> > index d10ecd78105f..60ec7811846e 100644
> >> >> > --- a/kernel/bpf/syscall.c
> >> >> > +++ b/kernel/bpf/syscall.c
> >> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> >> >> > uattr, unsigned int, siz
> >> >> > if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> >> >> > return -EPERM;
> >> >> >
> >> >> > + if (cmd == BPF_SYNCHRONIZE) {
> >> >> > + if (uattr != NULL || size != 0)
> >> >> > + return -EINVAL;
> >> >> > + err = security_bpf(cmd, NULL, 0);
> >> >> > + if (err < 0)
> >> >> > + return err;
> >> >> > + /* BPF programs are run with preempt disabled, so
> >> >> > + * synchronize_sched is sufficient even with
> >> >> > + * RCU_PREEMPT.
> >> >> > + */
> >> >> > + synchronize_sched();
> >> >> > + return 0;
> >> >>
> >> >> I don't think it's necessary. sys_membarrier() can do this already
> >> >> and some folks use it exactly for this use case.
> >> >
> >> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> >> > though. No where does the manpage say membarrier should be implemented this
> >> > way so what happens if the implementation changes?
> >> >
> >> > Further, membarrier manpage says that a memory barrier should be matched with
> >> > a matching barrier. In this use case there is no matching barrier, so it
> >> > makes it weirder.
> >> >
> >> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> >> > fragile to depend on it for this?
> >> >
> >> > case MEMBARRIER_CMD_GLOBAL:
> >> > /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> >> > if (tick_nohz_full_enabled())
> >> > return -EINVAL;
> >> > if (num_online_cpus() > 1)
> >> > synchronize_sched();
> >> > return 0;
> >> >
> >> >
> >> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> >>
> >> See commit 907565337
> >> "Fix: Disable sys_membarrier when nohz_full is enabled"
> >>
> >> "Userspace applications should be allowed to expect the membarrier system
> >> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> >> nohz_full CPUs, but synchronize_sched() does not take those into
> >> account."
> >>
> >> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> >> only care about kernel preempt off critical sections.
> >>
> >> Clearly bpf code does not run in user-space, so it would "work".
> >>
> >> But the guarantees provided by membarrier are not to synchronize against
> >> preempt off per se. It's just that the current implementation happens to
> >> do that. The point of membarrier is to turn user-space memory barriers
> >> into compiler barriers.
> >>
> >> If what you need is to wait for a RCU grace period for whatever RCU flavor
> >> ebpf is using, I would against using membarrier for this. I would rather
> >> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
> >> implementation details to user-space, *and* you can eventually change you
> >> RCU implementation for e.g. SRCU in the future if needed.
> >
> > The point about future changes to underlying bpf mechanisms is valid.
> > There is work already on the way to reduce the scope of preempt_off+rcu_lock
> > that currently lasts the whole prog. We will have new prog types that won't
> > have such wrappers and will do rcu_lock/unlock and preempt on/off only
> > when necessary.
> > So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> > guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> > also won't make sense for the same reason.
> > What we can do it instead is to define synchronization barrier for
> > programs accessing maps. May be call it something like:
> > BPF_SYNC_MAP_ACCESS ?
>
> I'm not sure what you're proposing. In the case the commit message
> describes, a user-space program that wants to "drain" a map needs to
> be confident that the map won't change under it, even across multiple
> bpf system calls on that map. One way of doing that is to ensure that
> nothing that could possibly hold a reference to that map is still
> running. Are you proposing some kind of refcount-draining approach?
> Simple locking won't work, since BPF programs can't block, and I don't
> see right now how a simple barrier would help.

I'm proposing few changes for your patch:
s/BPF_SYNCHRONIZE/BPF_SYNC_MAP_ACCESS/
and s/synchronize_sched/synchronize_rcu/
with detailed comment in uapi/bpf.h that has an example why folks
would want to use this new cmd.
I think the bpf maps will be rcu protected for foreseeable future
even when rcu_read_lock/unlock will be done by the programs instead of
kernel wrappers.

2018-07-09 22:17:54

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Mon, Jul 09, 2018 at 05:35:34PM -0400, Mathieu Desnoyers wrote:
>
>
> ----- On Jul 9, 2018, at 5:09 PM, Alexei Starovoitov [email protected] wrote:
>
> > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> >> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes [email protected] wrote:
> >>
> >> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> >> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> >> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> >> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> >> >> > RCU data structure operations with respect to active programs. For
> >> >> > example, userspace can update a map->map entry to point to a new map,
> >> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> >> >> > complete, and then drain the old map without fear that BPF programs
> >> >> > may still be updating it.
> >> >> >
> >> >> > Signed-off-by: Daniel Colascione <[email protected]>
> >> >> > ---
> >> >> > include/uapi/linux/bpf.h | 1 +
> >> >> > kernel/bpf/syscall.c | 14 ++++++++++++++
> >> >> > 2 files changed, 15 insertions(+)
> >> >> >
> >> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> >> > index b7db3261c62d..4365c50e8055 100644
> >> >> > --- a/include/uapi/linux/bpf.h
> >> >> > +++ b/include/uapi/linux/bpf.h
> >> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> >> >> > BPF_BTF_LOAD,
> >> >> > BPF_BTF_GET_FD_BY_ID,
> >> >> > BPF_TASK_FD_QUERY,
> >> >> > + BPF_SYNCHRONIZE,
> >> >> > };
> >> >> >
> >> >> > enum bpf_map_type {
> >> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> >> > index d10ecd78105f..60ec7811846e 100644
> >> >> > --- a/kernel/bpf/syscall.c
> >> >> > +++ b/kernel/bpf/syscall.c
> >> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> >> >> > uattr, unsigned int, siz
> >> >> > if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> >> >> > return -EPERM;
> >> >> >
> >> >> > + if (cmd == BPF_SYNCHRONIZE) {
> >> >> > + if (uattr != NULL || size != 0)
> >> >> > + return -EINVAL;
> >> >> > + err = security_bpf(cmd, NULL, 0);
> >> >> > + if (err < 0)
> >> >> > + return err;
> >> >> > + /* BPF programs are run with preempt disabled, so
> >> >> > + * synchronize_sched is sufficient even with
> >> >> > + * RCU_PREEMPT.
> >> >> > + */
> >> >> > + synchronize_sched();
> >> >> > + return 0;
> >> >>
> >> >> I don't think it's necessary. sys_membarrier() can do this already
> >> >> and some folks use it exactly for this use case.
> >> >
> >> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> >> > though. No where does the manpage say membarrier should be implemented this
> >> > way so what happens if the implementation changes?
> >> >
> >> > Further, membarrier manpage says that a memory barrier should be matched with
> >> > a matching barrier. In this use case there is no matching barrier, so it
> >> > makes it weirder.
> >> >
> >> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> >> > fragile to depend on it for this?
> >> >
> >> > case MEMBARRIER_CMD_GLOBAL:
> >> > /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> >> > if (tick_nohz_full_enabled())
> >> > return -EINVAL;
> >> > if (num_online_cpus() > 1)
> >> > synchronize_sched();
> >> > return 0;
> >> >
> >> >
> >> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> >>
> >> See commit 907565337
> >> "Fix: Disable sys_membarrier when nohz_full is enabled"
> >>
> >> "Userspace applications should be allowed to expect the membarrier system
> >> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> >> nohz_full CPUs, but synchronize_sched() does not take those into
> >> account."
> >>
> >> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> >> only care about kernel preempt off critical sections.
> >>
> >> Clearly bpf code does not run in user-space, so it would "work".
> >>
> >> But the guarantees provided by membarrier are not to synchronize against
> >> preempt off per se. It's just that the current implementation happens to
> >> do that. The point of membarrier is to turn user-space memory barriers
> >> into compiler barriers.
> >>
> >> If what you need is to wait for a RCU grace period for whatever RCU flavor
> >> ebpf is using, I would against using membarrier for this. I would rather
> >> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
> >> implementation details to user-space, *and* you can eventually change you
> >> RCU implementation for e.g. SRCU in the future if needed.
> >
> > The point about future changes to underlying bpf mechanisms is valid.
> > There is work already on the way to reduce the scope of preempt_off+rcu_lock
> > that currently lasts the whole prog. We will have new prog types that won't
> > have such wrappers and will do rcu_lock/unlock and preempt on/off only
> > when necessary.
> > So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> > guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> > also won't make sense for the same reason.
> > What we can do it instead is to define synchronization barrier for
> > programs accessing maps. May be call it something like:
> > BPF_SYNC_MAP_ACCESS ?
> > uapi/bpf.h would need to have extensive comment what this barrier is doing.
> > Implementation should probably call synchronize_rcu() and not play games
> > with synchronize_sched(), since that's going too much into implementation.
> > Also should such sys_bpf command be root only?
> > I'm not sure whether dos attack can be made by spamming synchronize_rcu()
> > and synchronize_sched() for that matter.
>
> Adding Paul E. McKenney in CC. He may want to share his thoughts on the matter.

Let's see...

Spamming synchronize_rcu() and synchronize_sched() should be a non-event,
at least aside from the CPUs doing the spamming. The reason for this
is that a given task can only fire off a single synchronize_sched or
synchronize_rcu() per few milliseconds, so you need a -lot- of tasks
to have much effect, at which point the sheer number of tasks is much
more a problem than the large number of outstanding synchronize_rcu()
or synchronize_sched() invocations.

I very strongly agree that usermode should have a operation that
synchronizes with whatever eBPF uses, rather than something that forces
a specific type of RCU grace period.

Finally, in a few releases, synchronize_sched() will be retiring in favor
of synchronize_rcu(), which will wait on preemption-disabled regions of
code in addition to waiting on RCU read-side critical sections. Not a
big deal, as I expect to enlist Coccinelle's aid in this.

Did I manage to hit all the high points?

Thanx, Paul

2018-07-09 22:21:55

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Mon, Jul 09, 2018 at 03:19:03PM -0700, Paul E. McKenney wrote:
> On Mon, Jul 09, 2018 at 05:35:34PM -0400, Mathieu Desnoyers wrote:
> >
> >
> > ----- On Jul 9, 2018, at 5:09 PM, Alexei Starovoitov [email protected] wrote:
> >
> > > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> > >> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes [email protected] wrote:
> > >>
> > >> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> > >> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> > >> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> > >> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> > >> >> > RCU data structure operations with respect to active programs. For
> > >> >> > example, userspace can update a map->map entry to point to a new map,
> > >> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> > >> >> > complete, and then drain the old map without fear that BPF programs
> > >> >> > may still be updating it.
> > >> >> >
> > >> >> > Signed-off-by: Daniel Colascione <[email protected]>
> > >> >> > ---
> > >> >> > include/uapi/linux/bpf.h | 1 +
> > >> >> > kernel/bpf/syscall.c | 14 ++++++++++++++
> > >> >> > 2 files changed, 15 insertions(+)
> > >> >> >
> > >> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > >> >> > index b7db3261c62d..4365c50e8055 100644
> > >> >> > --- a/include/uapi/linux/bpf.h
> > >> >> > +++ b/include/uapi/linux/bpf.h
> > >> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> > >> >> > BPF_BTF_LOAD,
> > >> >> > BPF_BTF_GET_FD_BY_ID,
> > >> >> > BPF_TASK_FD_QUERY,
> > >> >> > + BPF_SYNCHRONIZE,
> > >> >> > };
> > >> >> >
> > >> >> > enum bpf_map_type {
> > >> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > >> >> > index d10ecd78105f..60ec7811846e 100644
> > >> >> > --- a/kernel/bpf/syscall.c
> > >> >> > +++ b/kernel/bpf/syscall.c
> > >> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> > >> >> > uattr, unsigned int, siz
> > >> >> > if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > >> >> > return -EPERM;
> > >> >> >
> > >> >> > + if (cmd == BPF_SYNCHRONIZE) {
> > >> >> > + if (uattr != NULL || size != 0)
> > >> >> > + return -EINVAL;
> > >> >> > + err = security_bpf(cmd, NULL, 0);
> > >> >> > + if (err < 0)
> > >> >> > + return err;
> > >> >> > + /* BPF programs are run with preempt disabled, so
> > >> >> > + * synchronize_sched is sufficient even with
> > >> >> > + * RCU_PREEMPT.
> > >> >> > + */
> > >> >> > + synchronize_sched();
> > >> >> > + return 0;
> > >> >>
> > >> >> I don't think it's necessary. sys_membarrier() can do this already
> > >> >> and some folks use it exactly for this use case.
> > >> >
> > >> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> > >> > though. No where does the manpage say membarrier should be implemented this
> > >> > way so what happens if the implementation changes?
> > >> >
> > >> > Further, membarrier manpage says that a memory barrier should be matched with
> > >> > a matching barrier. In this use case there is no matching barrier, so it
> > >> > makes it weirder.
> > >> >
> > >> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> > >> > fragile to depend on it for this?
> > >> >
> > >> > case MEMBARRIER_CMD_GLOBAL:
> > >> > /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> > >> > if (tick_nohz_full_enabled())
> > >> > return -EINVAL;
> > >> > if (num_online_cpus() > 1)
> > >> > synchronize_sched();
> > >> > return 0;
> > >> >
> > >> >
> > >> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> > >>
> > >> See commit 907565337
> > >> "Fix: Disable sys_membarrier when nohz_full is enabled"
> > >>
> > >> "Userspace applications should be allowed to expect the membarrier system
> > >> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> > >> nohz_full CPUs, but synchronize_sched() does not take those into
> > >> account."
> > >>
> > >> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> > >> only care about kernel preempt off critical sections.
> > >>
> > >> Clearly bpf code does not run in user-space, so it would "work".
> > >>
> > >> But the guarantees provided by membarrier are not to synchronize against
> > >> preempt off per se. It's just that the current implementation happens to
> > >> do that. The point of membarrier is to turn user-space memory barriers
> > >> into compiler barriers.
> > >>
> > >> If what you need is to wait for a RCU grace period for whatever RCU flavor
> > >> ebpf is using, I would against using membarrier for this. I would rather
> > >> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
> > >> implementation details to user-space, *and* you can eventually change you
> > >> RCU implementation for e.g. SRCU in the future if needed.
> > >
> > > The point about future changes to underlying bpf mechanisms is valid.
> > > There is work already on the way to reduce the scope of preempt_off+rcu_lock
> > > that currently lasts the whole prog. We will have new prog types that won't
> > > have such wrappers and will do rcu_lock/unlock and preempt on/off only
> > > when necessary.
> > > So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> > > guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> > > also won't make sense for the same reason.
> > > What we can do it instead is to define synchronization barrier for
> > > programs accessing maps. May be call it something like:
> > > BPF_SYNC_MAP_ACCESS ?
> > > uapi/bpf.h would need to have extensive comment what this barrier is doing.
> > > Implementation should probably call synchronize_rcu() and not play games
> > > with synchronize_sched(), since that's going too much into implementation.
> > > Also should such sys_bpf command be root only?
> > > I'm not sure whether dos attack can be made by spamming synchronize_rcu()
> > > and synchronize_sched() for that matter.
> >
> > Adding Paul E. McKenney in CC. He may want to share his thoughts on the matter.
>
> Let's see...
>
> Spamming synchronize_rcu() and synchronize_sched() should be a non-event,
> at least aside from the CPUs doing the spamming. The reason for this
> is that a given task can only fire off a single synchronize_sched or
> synchronize_rcu() per few milliseconds, so you need a -lot- of tasks
> to have much effect, at which point the sheer number of tasks is much
> more a problem than the large number of outstanding synchronize_rcu()
> or synchronize_sched() invocations.
>
> I very strongly agree that usermode should have a operation that
> synchronizes with whatever eBPF uses, rather than something that forces
> a specific type of RCU grace period.
>
> Finally, in a few releases, synchronize_sched() will be retiring in favor
> of synchronize_rcu(), which will wait on preemption-disabled regions of
> code in addition to waiting on RCU read-side critical sections. Not a
> big deal, as I expect to enlist Coccinelle's aid in this.
>
> Did I manage to hit all the high points?

Thanks. It's ok for new cmd being unpriv then.

2018-07-09 22:23:01

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Mon, Jul 9, 2018 at 3:10 PM, Alexei Starovoitov
<[email protected]> wrote:
> On Mon, Jul 09, 2018 at 02:36:37PM -0700, Daniel Colascione wrote:
>> On Mon, Jul 9, 2018 at 2:09 PM, Alexei Starovoitov
>> <[email protected]> wrote:
>> > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
>> >> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes [email protected] wrote:
>> >>
>> >> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
>> >> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
>> >> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
>> >> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
>> >> >> > RCU data structure operations with respect to active programs. For
>> >> >> > example, userspace can update a map->map entry to point to a new map,
>> >> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
>> >> >> > complete, and then drain the old map without fear that BPF programs
>> >> >> > may still be updating it.
>> >> >> >
>> >> >> > Signed-off-by: Daniel Colascione <[email protected]>
>> >> >> > ---
>> >> >> > include/uapi/linux/bpf.h | 1 +
>> >> >> > kernel/bpf/syscall.c | 14 ++++++++++++++
>> >> >> > 2 files changed, 15 insertions(+)
>> >> >> >
>> >> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> >> >> > index b7db3261c62d..4365c50e8055 100644
>> >> >> > --- a/include/uapi/linux/bpf.h
>> >> >> > +++ b/include/uapi/linux/bpf.h
>> >> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
>> >> >> > BPF_BTF_LOAD,
>> >> >> > BPF_BTF_GET_FD_BY_ID,
>> >> >> > BPF_TASK_FD_QUERY,
>> >> >> > + BPF_SYNCHRONIZE,
>> >> >> > };
>> >> >> >
>> >> >> > enum bpf_map_type {
>> >> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> >> >> > index d10ecd78105f..60ec7811846e 100644
>> >> >> > --- a/kernel/bpf/syscall.c
>> >> >> > +++ b/kernel/bpf/syscall.c
>> >> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
>> >> >> > uattr, unsigned int, siz
>> >> >> > if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
>> >> >> > return -EPERM;
>> >> >> >
>> >> >> > + if (cmd == BPF_SYNCHRONIZE) {
>> >> >> > + if (uattr != NULL || size != 0)
>> >> >> > + return -EINVAL;
>> >> >> > + err = security_bpf(cmd, NULL, 0);
>> >> >> > + if (err < 0)
>> >> >> > + return err;
>> >> >> > + /* BPF programs are run with preempt disabled, so
>> >> >> > + * synchronize_sched is sufficient even with
>> >> >> > + * RCU_PREEMPT.
>> >> >> > + */
>> >> >> > + synchronize_sched();
>> >> >> > + return 0;
>> >> >>
>> >> >> I don't think it's necessary. sys_membarrier() can do this already
>> >> >> and some folks use it exactly for this use case.
>> >> >
>> >> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
>> >> > though. No where does the manpage say membarrier should be implemented this
>> >> > way so what happens if the implementation changes?
>> >> >
>> >> > Further, membarrier manpage says that a memory barrier should be matched with
>> >> > a matching barrier. In this use case there is no matching barrier, so it
>> >> > makes it weirder.
>> >> >
>> >> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
>> >> > fragile to depend on it for this?
>> >> >
>> >> > case MEMBARRIER_CMD_GLOBAL:
>> >> > /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
>> >> > if (tick_nohz_full_enabled())
>> >> > return -EINVAL;
>> >> > if (num_online_cpus() > 1)
>> >> > synchronize_sched();
>> >> > return 0;
>> >> >
>> >> >
>> >> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
>> >>
>> >> See commit 907565337
>> >> "Fix: Disable sys_membarrier when nohz_full is enabled"
>> >>
>> >> "Userspace applications should be allowed to expect the membarrier system
>> >> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
>> >> nohz_full CPUs, but synchronize_sched() does not take those into
>> >> account."
>> >>
>> >> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
>> >> only care about kernel preempt off critical sections.
>> >>
>> >> Clearly bpf code does not run in user-space, so it would "work".
>> >>
>> >> But the guarantees provided by membarrier are not to synchronize against
>> >> preempt off per se. It's just that the current implementation happens to
>> >> do that. The point of membarrier is to turn user-space memory barriers
>> >> into compiler barriers.
>> >>
>> >> If what you need is to wait for a RCU grace period for whatever RCU flavor
>> >> ebpf is using, I would against using membarrier for this. I would rather
>> >> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
>> >> implementation details to user-space, *and* you can eventually change you
>> >> RCU implementation for e.g. SRCU in the future if needed.
>> >
>> > The point about future changes to underlying bpf mechanisms is valid.
>> > There is work already on the way to reduce the scope of preempt_off+rcu_lock
>> > that currently lasts the whole prog. We will have new prog types that won't
>> > have such wrappers and will do rcu_lock/unlock and preempt on/off only
>> > when necessary.
>> > So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
>> > guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
>> > also won't make sense for the same reason.
>> > What we can do it instead is to define synchronization barrier for
>> > programs accessing maps. May be call it something like:
>> > BPF_SYNC_MAP_ACCESS ?
>>
>> I'm not sure what you're proposing. In the case the commit message
>> describes, a user-space program that wants to "drain" a map needs to
>> be confident that the map won't change under it, even across multiple
>> bpf system calls on that map. One way of doing that is to ensure that
>> nothing that could possibly hold a reference to that map is still
>> running. Are you proposing some kind of refcount-draining approach?
>> Simple locking won't work, since BPF programs can't block, and I don't
>> see right now how a simple barrier would help.
>
> I'm proposing few changes for your patch:
> s/BPF_SYNCHRONIZE/BPF_SYNC_MAP_ACCESS/
> and s/synchronize_sched/synchronize_rcu/
> with detailed comment in uapi/bpf.h that has an example why folks
> would want to use this new cmd.

Thanks for clarifying.

> I think the bpf maps will be rcu protected for foreseeable future
> even when rcu_read_lock/unlock will be done by the programs instead of
> kernel wrappers.

Can we guarantee that we always obtain a map reference and dispose of
that reference inside the same critical section? If so, can BPF
programs then disable preemption for as long as they'd like?

2018-07-09 22:35:40

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Mon, Jul 09, 2018 at 03:21:43PM -0700, Daniel Colascione wrote:
> On Mon, Jul 9, 2018 at 3:10 PM, Alexei Starovoitov
> <[email protected]> wrote:
> > On Mon, Jul 09, 2018 at 02:36:37PM -0700, Daniel Colascione wrote:
> >> On Mon, Jul 9, 2018 at 2:09 PM, Alexei Starovoitov
> >> <[email protected]> wrote:
> >> > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> >> >> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes [email protected] wrote:
> >> >>
> >> >> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> >> >> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> >> >> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> >> >> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> >> >> >> > RCU data structure operations with respect to active programs. For
> >> >> >> > example, userspace can update a map->map entry to point to a new map,
> >> >> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> >> >> >> > complete, and then drain the old map without fear that BPF programs
> >> >> >> > may still be updating it.
> >> >> >> >
> >> >> >> > Signed-off-by: Daniel Colascione <[email protected]>
> >> >> >> > ---
> >> >> >> > include/uapi/linux/bpf.h | 1 +
> >> >> >> > kernel/bpf/syscall.c | 14 ++++++++++++++
> >> >> >> > 2 files changed, 15 insertions(+)
> >> >> >> >
> >> >> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> >> >> > index b7db3261c62d..4365c50e8055 100644
> >> >> >> > --- a/include/uapi/linux/bpf.h
> >> >> >> > +++ b/include/uapi/linux/bpf.h
> >> >> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> >> >> >> > BPF_BTF_LOAD,
> >> >> >> > BPF_BTF_GET_FD_BY_ID,
> >> >> >> > BPF_TASK_FD_QUERY,
> >> >> >> > + BPF_SYNCHRONIZE,
> >> >> >> > };
> >> >> >> >
> >> >> >> > enum bpf_map_type {
> >> >> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> >> >> > index d10ecd78105f..60ec7811846e 100644
> >> >> >> > --- a/kernel/bpf/syscall.c
> >> >> >> > +++ b/kernel/bpf/syscall.c
> >> >> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> >> >> >> > uattr, unsigned int, siz
> >> >> >> > if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> >> >> >> > return -EPERM;
> >> >> >> >
> >> >> >> > + if (cmd == BPF_SYNCHRONIZE) {
> >> >> >> > + if (uattr != NULL || size != 0)
> >> >> >> > + return -EINVAL;
> >> >> >> > + err = security_bpf(cmd, NULL, 0);
> >> >> >> > + if (err < 0)
> >> >> >> > + return err;
> >> >> >> > + /* BPF programs are run with preempt disabled, so
> >> >> >> > + * synchronize_sched is sufficient even with
> >> >> >> > + * RCU_PREEMPT.
> >> >> >> > + */
> >> >> >> > + synchronize_sched();
> >> >> >> > + return 0;
> >> >> >>
> >> >> >> I don't think it's necessary. sys_membarrier() can do this already
> >> >> >> and some folks use it exactly for this use case.
> >> >> >
> >> >> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> >> >> > though. No where does the manpage say membarrier should be implemented this
> >> >> > way so what happens if the implementation changes?
> >> >> >
> >> >> > Further, membarrier manpage says that a memory barrier should be matched with
> >> >> > a matching barrier. In this use case there is no matching barrier, so it
> >> >> > makes it weirder.
> >> >> >
> >> >> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> >> >> > fragile to depend on it for this?
> >> >> >
> >> >> > case MEMBARRIER_CMD_GLOBAL:
> >> >> > /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> >> >> > if (tick_nohz_full_enabled())
> >> >> > return -EINVAL;
> >> >> > if (num_online_cpus() > 1)
> >> >> > synchronize_sched();
> >> >> > return 0;
> >> >> >
> >> >> >
> >> >> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> >> >>
> >> >> See commit 907565337
> >> >> "Fix: Disable sys_membarrier when nohz_full is enabled"
> >> >>
> >> >> "Userspace applications should be allowed to expect the membarrier system
> >> >> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> >> >> nohz_full CPUs, but synchronize_sched() does not take those into
> >> >> account."
> >> >>
> >> >> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> >> >> only care about kernel preempt off critical sections.
> >> >>
> >> >> Clearly bpf code does not run in user-space, so it would "work".
> >> >>
> >> >> But the guarantees provided by membarrier are not to synchronize against
> >> >> preempt off per se. It's just that the current implementation happens to
> >> >> do that. The point of membarrier is to turn user-space memory barriers
> >> >> into compiler barriers.
> >> >>
> >> >> If what you need is to wait for a RCU grace period for whatever RCU flavor
> >> >> ebpf is using, I would against using membarrier for this. I would rather
> >> >> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
> >> >> implementation details to user-space, *and* you can eventually change you
> >> >> RCU implementation for e.g. SRCU in the future if needed.
> >> >
> >> > The point about future changes to underlying bpf mechanisms is valid.
> >> > There is work already on the way to reduce the scope of preempt_off+rcu_lock
> >> > that currently lasts the whole prog. We will have new prog types that won't
> >> > have such wrappers and will do rcu_lock/unlock and preempt on/off only
> >> > when necessary.
> >> > So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> >> > guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> >> > also won't make sense for the same reason.
> >> > What we can do it instead is to define synchronization barrier for
> >> > programs accessing maps. May be call it something like:
> >> > BPF_SYNC_MAP_ACCESS ?
> >>
> >> I'm not sure what you're proposing. In the case the commit message
> >> describes, a user-space program that wants to "drain" a map needs to
> >> be confident that the map won't change under it, even across multiple
> >> bpf system calls on that map. One way of doing that is to ensure that
> >> nothing that could possibly hold a reference to that map is still
> >> running. Are you proposing some kind of refcount-draining approach?
> >> Simple locking won't work, since BPF programs can't block, and I don't
> >> see right now how a simple barrier would help.
> >
> > I'm proposing few changes for your patch:
> > s/BPF_SYNCHRONIZE/BPF_SYNC_MAP_ACCESS/
> > and s/synchronize_sched/synchronize_rcu/
> > with detailed comment in uapi/bpf.h that has an example why folks
> > would want to use this new cmd.
>
> Thanks for clarifying.
>
> > I think the bpf maps will be rcu protected for foreseeable future
> > even when rcu_read_lock/unlock will be done by the programs instead of
> > kernel wrappers.
>
> Can we guarantee that we always obtain a map reference and dispose of
> that reference inside the same critical section?

yep. the verifier will guarantee that.

> If so, can BPF
> programs then disable preemption for as long as they'd like?

you mean after the finish? no. only while running.
The verifier will match things like lookup/release, lock/unlock, preempt on/off
and will make sure there is no dangling preempt disable after program returns.

2018-07-09 22:47:29

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Mon, Jul 09, 2018 at 03:19:55PM -0700, Alexei Starovoitov wrote:
> On Mon, Jul 09, 2018 at 03:19:03PM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 09, 2018 at 05:35:34PM -0400, Mathieu Desnoyers wrote:
> > >
> > >
> > > ----- On Jul 9, 2018, at 5:09 PM, Alexei Starovoitov [email protected] wrote:
> > >
> > > > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> > > >> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes [email protected] wrote:
> > > >>
> > > >> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> > > >> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> > > >> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> > > >> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> > > >> >> > RCU data structure operations with respect to active programs. For
> > > >> >> > example, userspace can update a map->map entry to point to a new map,
> > > >> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> > > >> >> > complete, and then drain the old map without fear that BPF programs
> > > >> >> > may still be updating it.
> > > >> >> >
> > > >> >> > Signed-off-by: Daniel Colascione <[email protected]>
> > > >> >> > ---
> > > >> >> > include/uapi/linux/bpf.h | 1 +
> > > >> >> > kernel/bpf/syscall.c | 14 ++++++++++++++
> > > >> >> > 2 files changed, 15 insertions(+)
> > > >> >> >
> > > >> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > >> >> > index b7db3261c62d..4365c50e8055 100644
> > > >> >> > --- a/include/uapi/linux/bpf.h
> > > >> >> > +++ b/include/uapi/linux/bpf.h
> > > >> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> > > >> >> > BPF_BTF_LOAD,
> > > >> >> > BPF_BTF_GET_FD_BY_ID,
> > > >> >> > BPF_TASK_FD_QUERY,
> > > >> >> > + BPF_SYNCHRONIZE,
> > > >> >> > };
> > > >> >> >
> > > >> >> > enum bpf_map_type {
> > > >> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > >> >> > index d10ecd78105f..60ec7811846e 100644
> > > >> >> > --- a/kernel/bpf/syscall.c
> > > >> >> > +++ b/kernel/bpf/syscall.c
> > > >> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> > > >> >> > uattr, unsigned int, siz
> > > >> >> > if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > > >> >> > return -EPERM;
> > > >> >> >
> > > >> >> > + if (cmd == BPF_SYNCHRONIZE) {
> > > >> >> > + if (uattr != NULL || size != 0)
> > > >> >> > + return -EINVAL;
> > > >> >> > + err = security_bpf(cmd, NULL, 0);
> > > >> >> > + if (err < 0)
> > > >> >> > + return err;
> > > >> >> > + /* BPF programs are run with preempt disabled, so
> > > >> >> > + * synchronize_sched is sufficient even with
> > > >> >> > + * RCU_PREEMPT.
> > > >> >> > + */
> > > >> >> > + synchronize_sched();
> > > >> >> > + return 0;
> > > >> >>
> > > >> >> I don't think it's necessary. sys_membarrier() can do this already
> > > >> >> and some folks use it exactly for this use case.
> > > >> >
> > > >> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> > > >> > though. No where does the manpage say membarrier should be implemented this
> > > >> > way so what happens if the implementation changes?
> > > >> >
> > > >> > Further, membarrier manpage says that a memory barrier should be matched with
> > > >> > a matching barrier. In this use case there is no matching barrier, so it
> > > >> > makes it weirder.
> > > >> >
> > > >> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> > > >> > fragile to depend on it for this?
> > > >> >
> > > >> > case MEMBARRIER_CMD_GLOBAL:
> > > >> > /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> > > >> > if (tick_nohz_full_enabled())
> > > >> > return -EINVAL;
> > > >> > if (num_online_cpus() > 1)
> > > >> > synchronize_sched();
> > > >> > return 0;
> > > >> >
> > > >> >
> > > >> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> > > >>
> > > >> See commit 907565337
> > > >> "Fix: Disable sys_membarrier when nohz_full is enabled"
> > > >>
> > > >> "Userspace applications should be allowed to expect the membarrier system
> > > >> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> > > >> nohz_full CPUs, but synchronize_sched() does not take those into
> > > >> account."
> > > >>
> > > >> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> > > >> only care about kernel preempt off critical sections.
> > > >>
> > > >> Clearly bpf code does not run in user-space, so it would "work".
> > > >>
> > > >> But the guarantees provided by membarrier are not to synchronize against
> > > >> preempt off per se. It's just that the current implementation happens to
> > > >> do that. The point of membarrier is to turn user-space memory barriers
> > > >> into compiler barriers.
> > > >>
> > > >> If what you need is to wait for a RCU grace period for whatever RCU flavor
> > > >> ebpf is using, I would against using membarrier for this. I would rather
> > > >> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
> > > >> implementation details to user-space, *and* you can eventually change you
> > > >> RCU implementation for e.g. SRCU in the future if needed.
> > > >
> > > > The point about future changes to underlying bpf mechanisms is valid.
> > > > There is work already on the way to reduce the scope of preempt_off+rcu_lock
> > > > that currently lasts the whole prog. We will have new prog types that won't
> > > > have such wrappers and will do rcu_lock/unlock and preempt on/off only
> > > > when necessary.
> > > > So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> > > > guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> > > > also won't make sense for the same reason.
> > > > What we can do it instead is to define synchronization barrier for
> > > > programs accessing maps. May be call it something like:
> > > > BPF_SYNC_MAP_ACCESS ?
> > > > uapi/bpf.h would need to have extensive comment what this barrier is doing.
> > > > Implementation should probably call synchronize_rcu() and not play games
> > > > with synchronize_sched(), since that's going too much into implementation.
> > > > Also should such sys_bpf command be root only?
> > > > I'm not sure whether dos attack can be made by spamming synchronize_rcu()
> > > > and synchronize_sched() for that matter.
> > >
> > > Adding Paul E. McKenney in CC. He may want to share his thoughts on the matter.
> >
> > Let's see...
> >
> > Spamming synchronize_rcu() and synchronize_sched() should be a non-event,
> > at least aside from the CPUs doing the spamming. The reason for this
> > is that a given task can only fire off a single synchronize_sched or
> > synchronize_rcu() per few milliseconds, so you need a -lot- of tasks
> > to have much effect, at which point the sheer number of tasks is much
> > more a problem than the large number of outstanding synchronize_rcu()
> > or synchronize_sched() invocations.
> >
> > I very strongly agree that usermode should have a operation that
> > synchronizes with whatever eBPF uses, rather than something that forces
> > a specific type of RCU grace period.
> >
> > Finally, in a few releases, synchronize_sched() will be retiring in favor
> > of synchronize_rcu(), which will wait on preemption-disabled regions of
> > code in addition to waiting on RCU read-side critical sections. Not a
> > big deal, as I expect to enlist Coccinelle's aid in this.
> >
> > Did I manage to hit all the high points?
>
> Thanks. It's ok for new cmd being unpriv then.

Yes. After all, users can cause RCU far more pain by doing close(open())
in a tight loop. ;-)

Thanx, Paul

2018-07-10 05:22:14

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes [email protected] wrote:
>
> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> >> > RCU data structure operations with respect to active programs. For
> >> > example, userspace can update a map->map entry to point to a new map,
> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> >> > complete, and then drain the old map without fear that BPF programs
> >> > may still be updating it.
> >> >
> >> > Signed-off-by: Daniel Colascione <[email protected]>
> >> > ---
> >> > include/uapi/linux/bpf.h | 1 +
> >> > kernel/bpf/syscall.c | 14 ++++++++++++++
> >> > 2 files changed, 15 insertions(+)
> >> >
> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> > index b7db3261c62d..4365c50e8055 100644
> >> > --- a/include/uapi/linux/bpf.h
> >> > +++ b/include/uapi/linux/bpf.h
> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> >> > BPF_BTF_LOAD,
> >> > BPF_BTF_GET_FD_BY_ID,
> >> > BPF_TASK_FD_QUERY,
> >> > + BPF_SYNCHRONIZE,
> >> > };
> >> >
> >> > enum bpf_map_type {
> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> > index d10ecd78105f..60ec7811846e 100644
> >> > --- a/kernel/bpf/syscall.c
> >> > +++ b/kernel/bpf/syscall.c
> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> >> > uattr, unsigned int, siz
> >> > if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> >> > return -EPERM;
> >> >
> >> > + if (cmd == BPF_SYNCHRONIZE) {
> >> > + if (uattr != NULL || size != 0)
> >> > + return -EINVAL;
> >> > + err = security_bpf(cmd, NULL, 0);
> >> > + if (err < 0)
> >> > + return err;
> >> > + /* BPF programs are run with preempt disabled, so
> >> > + * synchronize_sched is sufficient even with
> >> > + * RCU_PREEMPT.
> >> > + */
> >> > + synchronize_sched();
> >> > + return 0;
> >>
> >> I don't think it's necessary. sys_membarrier() can do this already
> >> and some folks use it exactly for this use case.
> >
> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> > though. No where does the manpage say membarrier should be implemented this
> > way so what happens if the implementation changes?
> >
> > Further, membarrier manpage says that a memory barrier should be matched with
> > a matching barrier. In this use case there is no matching barrier, so it
> > makes it weirder.
> >
> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> > fragile to depend on it for this?
> >
> > case MEMBARRIER_CMD_GLOBAL:
> > /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> > if (tick_nohz_full_enabled())
> > return -EINVAL;
> > if (num_online_cpus() > 1)
> > synchronize_sched();
> > return 0;
> >
> >
> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
>
> See commit 907565337
> "Fix: Disable sys_membarrier when nohz_full is enabled"
>
> "Userspace applications should be allowed to expect the membarrier system
> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> nohz_full CPUs, but synchronize_sched() does not take those into
> account."
>
> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> only care about kernel preempt off critical sections.

Mathieu, Thanks a lot for your reply. I understand what you said and agree
with you. Slight OT, but I tried to go back to first principles and
understand how membarrier() uses synchronize_sched() for the "slow path" and
it didn't make immediate sense to me. Let me clarify my dillema..

My understanding is membarrier's MEMBARRIER_CMD_GLOBAL will employ
synchronize_sched to make sure all other CPUs aren't executing anymore in an
section of usercode that happen to be accessing memory that was written to
before the membarrier call was made. To do this, the system call will use
synchronize_sched to try to guarantee that all user-mode execution that
started before the membarrier call would be completed when the membarrier
call returns. This guarantees that without using a real memory barrier on the
"fast path", things work just fine and everyone wins.

But, going through RCU code, I see that a "RCU-sched quiecent state" on a CPU
may be reached when the CPU receives a timer tick while executing in user
mode:

void rcu_check_callbacks(int user)
{
trace_rcu_utilization(TPS("Start scheduler-tick"));
increment_cpu_stall_ticks();
if (user || rcu_is_cpu_rrupt_from_idle()) {
[...]
rcu_sched_qs();
rcu_bh_qs();

The problem I see is the CPU could be executing usermode code at the time of
the RCU sched-QS. This IMO is enough reason for synchronize_sched() to
return, because the CPU in question just reported a QS (assuming all other
CPUs also happen to do so if they needed to).

Then I am wondering how does the membarrier call even work, the tick could
very well have interrupted the CPU while it was executing usermode code in
the middle of a set of instructions performing memory accesses. Reporting a
quiescent state at such an inopportune time would cause the membarrier call
to prematurely return, no? Sorry if I missed something.

The other question I have is about the whole "nohz-full doesn't work" thing.
I didn't fully understand why. RCU is already tracking the state of nohz-full
CPUs because the rcu dynticks code in (kernel/rcu/tree.c) monitors
transitions to and from usermode even if the timer tick is turned off. So why
would it not work?

thanks a lot!

- Joel

2018-07-10 05:26:27

by Chenbo Feng

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Mon, Jul 9, 2018 at 3:34 PM Alexei Starovoitov
<[email protected]> wrote:
>
> On Mon, Jul 09, 2018 at 03:21:43PM -0700, Daniel Colascione wrote:
> > On Mon, Jul 9, 2018 at 3:10 PM, Alexei Starovoitov
> > <[email protected]> wrote:
> > > On Mon, Jul 09, 2018 at 02:36:37PM -0700, Daniel Colascione wrote:
> > >> On Mon, Jul 9, 2018 at 2:09 PM, Alexei Starovoitov
> > >> <[email protected]> wrote:
> > >> > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> > >> >> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes [email protected] wrote:
> > >> >>
> > >> >> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> > >> >> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> > >> >> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> > >> >> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> > >> >> >> > RCU data structure operations with respect to active programs. For
> > >> >> >> > example, userspace can update a map->map entry to point to a new map,
> > >> >> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> > >> >> >> > complete, and then drain the old map without fear that BPF programs
> > >> >> >> > may still be updating it.
> > >> >> >> >
> > >> >> >> > Signed-off-by: Daniel Colascione <[email protected]>
> > >> >> >> > ---
> > >> >> >> > include/uapi/linux/bpf.h | 1 +
> > >> >> >> > kernel/bpf/syscall.c | 14 ++++++++++++++
> > >> >> >> > 2 files changed, 15 insertions(+)
> > >> >> >> >
> > >> >> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > >> >> >> > index b7db3261c62d..4365c50e8055 100644
> > >> >> >> > --- a/include/uapi/linux/bpf.h
> > >> >> >> > +++ b/include/uapi/linux/bpf.h
> > >> >> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> > >> >> >> > BPF_BTF_LOAD,
> > >> >> >> > BPF_BTF_GET_FD_BY_ID,
> > >> >> >> > BPF_TASK_FD_QUERY,
> > >> >> >> > + BPF_SYNCHRONIZE,
> > >> >> >> > };
> > >> >> >> >
> > >> >> >> > enum bpf_map_type {
> > >> >> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > >> >> >> > index d10ecd78105f..60ec7811846e 100644
> > >> >> >> > --- a/kernel/bpf/syscall.c
> > >> >> >> > +++ b/kernel/bpf/syscall.c
> > >> >> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> > >> >> >> > uattr, unsigned int, siz
> > >> >> >> > if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > >> >> >> > return -EPERM;
> > >> >> >> >
> > >> >> >> > + if (cmd == BPF_SYNCHRONIZE) {
> > >> >> >> > + if (uattr != NULL || size != 0)
> > >> >> >> > + return -EINVAL;
> > >> >> >> > + err = security_bpf(cmd, NULL, 0);
> > >> >> >> > + if (err < 0)
> > >> >> >> > + return err;
> > >> >> >> > + /* BPF programs are run with preempt disabled, so
> > >> >> >> > + * synchronize_sched is sufficient even with
> > >> >> >> > + * RCU_PREEMPT.
> > >> >> >> > + */
> > >> >> >> > + synchronize_sched();
> > >> >> >> > + return 0;
> > >> >> >>
> > >> >> >> I don't think it's necessary. sys_membarrier() can do this already
> > >> >> >> and some folks use it exactly for this use case.
> > >> >> >
> > >> >> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> > >> >> > though. No where does the manpage say membarrier should be implemented this
> > >> >> > way so what happens if the implementation changes?
> > >> >> >
> > >> >> > Further, membarrier manpage says that a memory barrier should be matched with
> > >> >> > a matching barrier. In this use case there is no matching barrier, so it
> > >> >> > makes it weirder.
> > >> >> >
> > >> >> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> > >> >> > fragile to depend on it for this?
> > >> >> >
> > >> >> > case MEMBARRIER_CMD_GLOBAL:
> > >> >> > /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> > >> >> > if (tick_nohz_full_enabled())
> > >> >> > return -EINVAL;
> > >> >> > if (num_online_cpus() > 1)
> > >> >> > synchronize_sched();
> > >> >> > return 0;
> > >> >> >
> > >> >> >
> > >> >> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> > >> >>
> > >> >> See commit 907565337
> > >> >> "Fix: Disable sys_membarrier when nohz_full is enabled"
> > >> >>
> > >> >> "Userspace applications should be allowed to expect the membarrier system
> > >> >> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> > >> >> nohz_full CPUs, but synchronize_sched() does not take those into
> > >> >> account."
> > >> >>
> > >> >> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> > >> >> only care about kernel preempt off critical sections.
> > >> >>
> > >> >> Clearly bpf code does not run in user-space, so it would "work".
> > >> >>
> > >> >> But the guarantees provided by membarrier are not to synchronize against
> > >> >> preempt off per se. It's just that the current implementation happens to
> > >> >> do that. The point of membarrier is to turn user-space memory barriers
> > >> >> into compiler barriers.
> > >> >>
> > >> >> If what you need is to wait for a RCU grace period for whatever RCU flavor
> > >> >> ebpf is using, I would against using membarrier for this. I would rather
> > >> >> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
> > >> >> implementation details to user-space, *and* you can eventually change you
> > >> >> RCU implementation for e.g. SRCU in the future if needed.
> > >> >
> > >> > The point about future changes to underlying bpf mechanisms is valid.
> > >> > There is work already on the way to reduce the scope of preempt_off+rcu_lock
> > >> > that currently lasts the whole prog. We will have new prog types that won't
> > >> > have such wrappers and will do rcu_lock/unlock and preempt on/off only
> > >> > when necessary.
> > >> > So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> > >> > guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> > >> > also won't make sense for the same reason.
> > >> > What we can do it instead is to define synchronization barrier for
> > >> > programs accessing maps. May be call it something like:
> > >> > BPF_SYNC_MAP_ACCESS ?
> > >>
> > >> I'm not sure what you're proposing. In the case the commit message
> > >> describes, a user-space program that wants to "drain" a map needs to
> > >> be confident that the map won't change under it, even across multiple
> > >> bpf system calls on that map. One way of doing that is to ensure that
> > >> nothing that could possibly hold a reference to that map is still
> > >> running. Are you proposing some kind of refcount-draining approach?
> > >> Simple locking won't work, since BPF programs can't block, and I don't
> > >> see right now how a simple barrier would help.
> > >
> > > I'm proposing few changes for your patch:
> > > s/BPF_SYNCHRONIZE/BPF_SYNC_MAP_ACCESS/
> > > and s/synchronize_sched/synchronize_rcu/
> > > with detailed comment in uapi/bpf.h that has an example why folks
> > > would want to use this new cmd.
> >
> > Thanks for clarifying.
> >
> > > I think the bpf maps will be rcu protected for foreseeable future
> > > even when rcu_read_lock/unlock will be done by the programs instead of
> > > kernel wrappers.
> >
> > Can we guarantee that we always obtain a map reference and dispose of
> > that reference inside the same critical section?
>
> yep. the verifier will guarantee that.
>
> > If so, can BPF
> > programs then disable preemption for as long as they'd like?
>
> you mean after the finish? no. only while running.
> The verifier will match things like lookup/release, lock/unlock, preempt on/off
> and will make sure there is no dangling preempt disable after program returns.
>
It might be a silly question but does this new bpf program type
provide a way to customize where to hold the rcu_lock/unlock and
enable/disable preemption when creating the program? Since the
BPF_SYNC_MAP_ACCESS cmd sounds not very useful if it only protect each
single map access instead of the whole program. For example, sometimes
we have to read the same map multiple times when running a eBPF
program and if the userspace changed the map value between two reads,
it may cause troubles. It would be great if we can have a RCU lock on
a block of eBPF program and in that way I think BPF_SYNC_MAP_ACCESS
cmd should be enough in handling userspace kernel racing problems.

Thanks
Chenbo Feng

2018-07-10 17:02:04

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Mon, Jul 09, 2018 at 10:13:47PM -0700, Joel Fernandes wrote:
> On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> > ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes [email protected] wrote:
> >
> > > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> > >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> > >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> > >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> > >> > RCU data structure operations with respect to active programs. For
> > >> > example, userspace can update a map->map entry to point to a new map,
> > >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> > >> > complete, and then drain the old map without fear that BPF programs
> > >> > may still be updating it.
> > >> >
> > >> > Signed-off-by: Daniel Colascione <[email protected]>
> > >> > ---
> > >> > include/uapi/linux/bpf.h | 1 +
> > >> > kernel/bpf/syscall.c | 14 ++++++++++++++
> > >> > 2 files changed, 15 insertions(+)
> > >> >
> > >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > >> > index b7db3261c62d..4365c50e8055 100644
> > >> > --- a/include/uapi/linux/bpf.h
> > >> > +++ b/include/uapi/linux/bpf.h
> > >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> > >> > BPF_BTF_LOAD,
> > >> > BPF_BTF_GET_FD_BY_ID,
> > >> > BPF_TASK_FD_QUERY,
> > >> > + BPF_SYNCHRONIZE,
> > >> > };
> > >> >
> > >> > enum bpf_map_type {
> > >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > >> > index d10ecd78105f..60ec7811846e 100644
> > >> > --- a/kernel/bpf/syscall.c
> > >> > +++ b/kernel/bpf/syscall.c
> > >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> > >> > uattr, unsigned int, siz
> > >> > if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > >> > return -EPERM;
> > >> >
> > >> > + if (cmd == BPF_SYNCHRONIZE) {
> > >> > + if (uattr != NULL || size != 0)
> > >> > + return -EINVAL;
> > >> > + err = security_bpf(cmd, NULL, 0);
> > >> > + if (err < 0)
> > >> > + return err;
> > >> > + /* BPF programs are run with preempt disabled, so
> > >> > + * synchronize_sched is sufficient even with
> > >> > + * RCU_PREEMPT.
> > >> > + */
> > >> > + synchronize_sched();
> > >> > + return 0;
> > >>
> > >> I don't think it's necessary. sys_membarrier() can do this already
> > >> and some folks use it exactly for this use case.
> > >
> > > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> > > though. No where does the manpage say membarrier should be implemented this
> > > way so what happens if the implementation changes?
> > >
> > > Further, membarrier manpage says that a memory barrier should be matched with
> > > a matching barrier. In this use case there is no matching barrier, so it
> > > makes it weirder.
> > >
> > > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> > > fragile to depend on it for this?
> > >
> > > case MEMBARRIER_CMD_GLOBAL:
> > > /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> > > if (tick_nohz_full_enabled())
> > > return -EINVAL;
> > > if (num_online_cpus() > 1)
> > > synchronize_sched();
> > > return 0;
> > >
> > >
> > > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> >
> > See commit 907565337
> > "Fix: Disable sys_membarrier when nohz_full is enabled"
> >
> > "Userspace applications should be allowed to expect the membarrier system
> > call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> > nohz_full CPUs, but synchronize_sched() does not take those into
> > account."
> >
> > So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> > only care about kernel preempt off critical sections.
>
> Mathieu, Thanks a lot for your reply. I understand what you said and agree
> with you. Slight OT, but I tried to go back to first principles and
> understand how membarrier() uses synchronize_sched() for the "slow path" and
> it didn't make immediate sense to me. Let me clarify my dillema..
>
> My understanding is membarrier's MEMBARRIER_CMD_GLOBAL will employ
> synchronize_sched to make sure all other CPUs aren't executing anymore in an
> section of usercode that happen to be accessing memory that was written to
> before the membarrier call was made. To do this, the system call will use
> synchronize_sched to try to guarantee that all user-mode execution that
> started before the membarrier call would be completed when the membarrier
> call returns. This guarantees that without using a real memory barrier on the
> "fast path", things work just fine and everyone wins.
>
> But, going through RCU code, I see that a "RCU-sched quiecent state" on a CPU
> may be reached when the CPU receives a timer tick while executing in user
> mode:
>
> void rcu_check_callbacks(int user)
> {
> trace_rcu_utilization(TPS("Start scheduler-tick"));
> increment_cpu_stall_ticks();
> if (user || rcu_is_cpu_rrupt_from_idle()) {
> [...]
> rcu_sched_qs();
> rcu_bh_qs();
>
> The problem I see is the CPU could be executing usermode code at the time of
> the RCU sched-QS. This IMO is enough reason for synchronize_sched() to
> return, because the CPU in question just reported a QS (assuming all other
> CPUs also happen to do so if they needed to).

This scenario will have inserted the needed smp_mb() into the userspace
instruction execution stream, as is required by the sys_membarrier
use cases.

> Then I am wondering how does the membarrier call even work, the tick could
> very well have interrupted the CPU while it was executing usermode code in
> the middle of a set of instructions performing memory accesses. Reporting a
> quiescent state at such an inopportune time would cause the membarrier call
> to prematurely return, no? Sorry if I missed something.

One way to think of sys_membarrier() is as something that promotes a
barrier() to an smp_mb(). This barrier then separates the target CPU's
accesses that the caller saw before the sys_membarrier() from that same
CPU's accesses that the caller will see after the sys_membarrier().

> The other question I have is about the whole "nohz-full doesn't work" thing.
> I didn't fully understand why. RCU is already tracking the state of nohz-full
> CPUs because the rcu dynticks code in (kernel/rcu/tree.c) monitors
> transitions to and from usermode even if the timer tick is turned off. So why
> would it not work?

In the nohz_full case, there is no need for sys_membarrier()'s call to
synchronize_sched() to interact directly with the nohz_full CPU. It
can instead look at the target CPU's dyntick-idle state, and that state
would potentially have been set in the dim distant past, thus having
no effect on the target CPU's current execution.

Thanx, Paul

2018-07-10 17:04:52

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Tue, Jul 10, 2018 at 09:42:12AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 09, 2018 at 10:13:47PM -0700, Joel Fernandes wrote:
> > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> > > ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes [email protected] wrote:
> > >
> > > > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> > > >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> > > >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> > > >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> > > >> > RCU data structure operations with respect to active programs. For
> > > >> > example, userspace can update a map->map entry to point to a new map,
> > > >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> > > >> > complete, and then drain the old map without fear that BPF programs
> > > >> > may still be updating it.
> > > >> >
> > > >> > Signed-off-by: Daniel Colascione <[email protected]>
> > > >> > ---
> > > >> > include/uapi/linux/bpf.h | 1 +
> > > >> > kernel/bpf/syscall.c | 14 ++++++++++++++
> > > >> > 2 files changed, 15 insertions(+)
> > > >> >
> > > >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > >> > index b7db3261c62d..4365c50e8055 100644
> > > >> > --- a/include/uapi/linux/bpf.h
> > > >> > +++ b/include/uapi/linux/bpf.h
> > > >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> > > >> > BPF_BTF_LOAD,
> > > >> > BPF_BTF_GET_FD_BY_ID,
> > > >> > BPF_TASK_FD_QUERY,
> > > >> > + BPF_SYNCHRONIZE,
> > > >> > };
> > > >> >
> > > >> > enum bpf_map_type {
> > > >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > >> > index d10ecd78105f..60ec7811846e 100644
> > > >> > --- a/kernel/bpf/syscall.c
> > > >> > +++ b/kernel/bpf/syscall.c
> > > >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> > > >> > uattr, unsigned int, siz
> > > >> > if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > > >> > return -EPERM;
> > > >> >
> > > >> > + if (cmd == BPF_SYNCHRONIZE) {
> > > >> > + if (uattr != NULL || size != 0)
> > > >> > + return -EINVAL;
> > > >> > + err = security_bpf(cmd, NULL, 0);
> > > >> > + if (err < 0)
> > > >> > + return err;
> > > >> > + /* BPF programs are run with preempt disabled, so
> > > >> > + * synchronize_sched is sufficient even with
> > > >> > + * RCU_PREEMPT.
> > > >> > + */
> > > >> > + synchronize_sched();
> > > >> > + return 0;
> > > >>
> > > >> I don't think it's necessary. sys_membarrier() can do this already
> > > >> and some folks use it exactly for this use case.
> > > >
> > > > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> > > > though. No where does the manpage say membarrier should be implemented this
> > > > way so what happens if the implementation changes?
> > > >
> > > > Further, membarrier manpage says that a memory barrier should be matched with
> > > > a matching barrier. In this use case there is no matching barrier, so it
> > > > makes it weirder.
> > > >
> > > > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> > > > fragile to depend on it for this?
> > > >
> > > > case MEMBARRIER_CMD_GLOBAL:
> > > > /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> > > > if (tick_nohz_full_enabled())
> > > > return -EINVAL;
> > > > if (num_online_cpus() > 1)
> > > > synchronize_sched();
> > > > return 0;
> > > >
> > > >
> > > > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> > >
> > > See commit 907565337
> > > "Fix: Disable sys_membarrier when nohz_full is enabled"
> > >
> > > "Userspace applications should be allowed to expect the membarrier system
> > > call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> > > nohz_full CPUs, but synchronize_sched() does not take those into
> > > account."
> > >
> > > So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> > > only care about kernel preempt off critical sections.
> >
> > Mathieu, Thanks a lot for your reply. I understand what you said and agree
> > with you. Slight OT, but I tried to go back to first principles and
> > understand how membarrier() uses synchronize_sched() for the "slow path" and
> > it didn't make immediate sense to me. Let me clarify my dillema..
> >
> > My understanding is membarrier's MEMBARRIER_CMD_GLOBAL will employ
> > synchronize_sched to make sure all other CPUs aren't executing anymore in an
> > section of usercode that happen to be accessing memory that was written to
> > before the membarrier call was made. To do this, the system call will use
> > synchronize_sched to try to guarantee that all user-mode execution that
> > started before the membarrier call would be completed when the membarrier
> > call returns. This guarantees that without using a real memory barrier on the
> > "fast path", things work just fine and everyone wins.
> >
> > But, going through RCU code, I see that a "RCU-sched quiecent state" on a CPU
> > may be reached when the CPU receives a timer tick while executing in user
> > mode:
> >
> > void rcu_check_callbacks(int user)
> > {
> > trace_rcu_utilization(TPS("Start scheduler-tick"));
> > increment_cpu_stall_ticks();
> > if (user || rcu_is_cpu_rrupt_from_idle()) {
> > [...]
> > rcu_sched_qs();
> > rcu_bh_qs();
> >
> > The problem I see is the CPU could be executing usermode code at the time of
> > the RCU sched-QS. This IMO is enough reason for synchronize_sched() to
> > return, because the CPU in question just reported a QS (assuming all other
> > CPUs also happen to do so if they needed to).
>
> This scenario will have inserted the needed smp_mb() into the userspace
> instruction execution stream, as is required by the sys_membarrier
> use cases.

Oh ok, that makes sense!

> > Then I am wondering how does the membarrier call even work, the tick could
> > very well have interrupted the CPU while it was executing usermode code in
> > the middle of a set of instructions performing memory accesses. Reporting a
> > quiescent state at such an inopportune time would cause the membarrier call
> > to prematurely return, no? Sorry if I missed something.
>
> One way to think of sys_membarrier() is as something that promotes a
> barrier() to an smp_mb(). This barrier then separates the target CPU's
> accesses that the caller saw before the sys_membarrier() from that same
> CPU's accesses that the caller will see after the sys_membarrier().

Got it!

> > The other question I have is about the whole "nohz-full doesn't work" thing.
> > I didn't fully understand why. RCU is already tracking the state of nohz-full
> > CPUs because the rcu dynticks code in (kernel/rcu/tree.c) monitors
> > transitions to and from usermode even if the timer tick is turned off. So why
> > would it not work?
>
> In the nohz_full case, there is no need for sys_membarrier()'s call to
> synchronize_sched() to interact directly with the nohz_full CPU. It
> can instead look at the target CPU's dyntick-idle state, and that state
> would potentially have been set in the dim distant past, thus having
> no effect on the target CPU's current execution.

In nohz-idle case though, there's nothing to promote the barrier() to
smp_mb() if you were to purely look at the dynticks-idle state on the
nohz-full CPU executing in user mode?

So then it makes sense to me now that nohz-full needs something to IPI that
CPU inorder to enforce the needed memory barrier and pure synchronize_sched()
wouldn't work. So then makes me think the expedited versions of
synchronize_sched should be able to do the job but I could off on a different
track..

Thanks a lot,

-Joel

2018-07-10 17:12:53

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Tue, Jul 10, 2018 at 09:57:44AM -0700, Joel Fernandes wrote:
> On Tue, Jul 10, 2018 at 09:42:12AM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 09, 2018 at 10:13:47PM -0700, Joel Fernandes wrote:
> > > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> > > > ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes [email protected] wrote:
> > > >
> > > > > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> > > > >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> > > > >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> > > > >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> > > > >> > RCU data structure operations with respect to active programs. For
> > > > >> > example, userspace can update a map->map entry to point to a new map,
> > > > >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> > > > >> > complete, and then drain the old map without fear that BPF programs
> > > > >> > may still be updating it.
> > > > >> >
> > > > >> > Signed-off-by: Daniel Colascione <[email protected]>
> > > > >> > ---
> > > > >> > include/uapi/linux/bpf.h | 1 +
> > > > >> > kernel/bpf/syscall.c | 14 ++++++++++++++
> > > > >> > 2 files changed, 15 insertions(+)
> > > > >> >
> > > > >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > >> > index b7db3261c62d..4365c50e8055 100644
> > > > >> > --- a/include/uapi/linux/bpf.h
> > > > >> > +++ b/include/uapi/linux/bpf.h
> > > > >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> > > > >> > BPF_BTF_LOAD,
> > > > >> > BPF_BTF_GET_FD_BY_ID,
> > > > >> > BPF_TASK_FD_QUERY,
> > > > >> > + BPF_SYNCHRONIZE,
> > > > >> > };
> > > > >> >
> > > > >> > enum bpf_map_type {
> > > > >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > > >> > index d10ecd78105f..60ec7811846e 100644
> > > > >> > --- a/kernel/bpf/syscall.c
> > > > >> > +++ b/kernel/bpf/syscall.c
> > > > >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> > > > >> > uattr, unsigned int, siz
> > > > >> > if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > > > >> > return -EPERM;
> > > > >> >
> > > > >> > + if (cmd == BPF_SYNCHRONIZE) {
> > > > >> > + if (uattr != NULL || size != 0)
> > > > >> > + return -EINVAL;
> > > > >> > + err = security_bpf(cmd, NULL, 0);
> > > > >> > + if (err < 0)
> > > > >> > + return err;
> > > > >> > + /* BPF programs are run with preempt disabled, so
> > > > >> > + * synchronize_sched is sufficient even with
> > > > >> > + * RCU_PREEMPT.
> > > > >> > + */
> > > > >> > + synchronize_sched();
> > > > >> > + return 0;
> > > > >>
> > > > >> I don't think it's necessary. sys_membarrier() can do this already
> > > > >> and some folks use it exactly for this use case.
> > > > >
> > > > > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> > > > > though. No where does the manpage say membarrier should be implemented this
> > > > > way so what happens if the implementation changes?
> > > > >
> > > > > Further, membarrier manpage says that a memory barrier should be matched with
> > > > > a matching barrier. In this use case there is no matching barrier, so it
> > > > > makes it weirder.
> > > > >
> > > > > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> > > > > fragile to depend on it for this?
> > > > >
> > > > > case MEMBARRIER_CMD_GLOBAL:
> > > > > /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> > > > > if (tick_nohz_full_enabled())
> > > > > return -EINVAL;
> > > > > if (num_online_cpus() > 1)
> > > > > synchronize_sched();
> > > > > return 0;
> > > > >
> > > > >
> > > > > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> > > >
> > > > See commit 907565337
> > > > "Fix: Disable sys_membarrier when nohz_full is enabled"
> > > >
> > > > "Userspace applications should be allowed to expect the membarrier system
> > > > call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> > > > nohz_full CPUs, but synchronize_sched() does not take those into
> > > > account."
> > > >
> > > > So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> > > > only care about kernel preempt off critical sections.
> > >
> > > Mathieu, Thanks a lot for your reply. I understand what you said and agree
> > > with you. Slight OT, but I tried to go back to first principles and
> > > understand how membarrier() uses synchronize_sched() for the "slow path" and
> > > it didn't make immediate sense to me. Let me clarify my dillema..
> > >
> > > My understanding is membarrier's MEMBARRIER_CMD_GLOBAL will employ
> > > synchronize_sched to make sure all other CPUs aren't executing anymore in an
> > > section of usercode that happen to be accessing memory that was written to
> > > before the membarrier call was made. To do this, the system call will use
> > > synchronize_sched to try to guarantee that all user-mode execution that
> > > started before the membarrier call would be completed when the membarrier
> > > call returns. This guarantees that without using a real memory barrier on the
> > > "fast path", things work just fine and everyone wins.
> > >
> > > But, going through RCU code, I see that a "RCU-sched quiecent state" on a CPU
> > > may be reached when the CPU receives a timer tick while executing in user
> > > mode:
> > >
> > > void rcu_check_callbacks(int user)
> > > {
> > > trace_rcu_utilization(TPS("Start scheduler-tick"));
> > > increment_cpu_stall_ticks();
> > > if (user || rcu_is_cpu_rrupt_from_idle()) {
> > > [...]
> > > rcu_sched_qs();
> > > rcu_bh_qs();
> > >
> > > The problem I see is the CPU could be executing usermode code at the time of
> > > the RCU sched-QS. This IMO is enough reason for synchronize_sched() to
> > > return, because the CPU in question just reported a QS (assuming all other
> > > CPUs also happen to do so if they needed to).
> >
> > This scenario will have inserted the needed smp_mb() into the userspace
> > instruction execution stream, as is required by the sys_membarrier
> > use cases.
>
> Oh ok, that makes sense!
>
> > > Then I am wondering how does the membarrier call even work, the tick could
> > > very well have interrupted the CPU while it was executing usermode code in
> > > the middle of a set of instructions performing memory accesses. Reporting a
> > > quiescent state at such an inopportune time would cause the membarrier call
> > > to prematurely return, no? Sorry if I missed something.
> >
> > One way to think of sys_membarrier() is as something that promotes a
> > barrier() to an smp_mb(). This barrier then separates the target CPU's
> > accesses that the caller saw before the sys_membarrier() from that same
> > CPU's accesses that the caller will see after the sys_membarrier().
>
> Got it!
>
> > > The other question I have is about the whole "nohz-full doesn't work" thing.
> > > I didn't fully understand why. RCU is already tracking the state of nohz-full
> > > CPUs because the rcu dynticks code in (kernel/rcu/tree.c) monitors
> > > transitions to and from usermode even if the timer tick is turned off. So why
> > > would it not work?
> >
> > In the nohz_full case, there is no need for sys_membarrier()'s call to
> > synchronize_sched() to interact directly with the nohz_full CPU. It
> > can instead look at the target CPU's dyntick-idle state, and that state
> > would potentially have been set in the dim distant past, thus having
> > no effect on the target CPU's current execution.
>
> In nohz-idle case though, there's nothing to promote the barrier() to
> smp_mb() if you were to purely look at the dynticks-idle state on the
> nohz-full CPU executing in user mode?
>
> So then it makes sense to me now that nohz-full needs something to IPI that
> CPU inorder to enforce the needed memory barrier and pure synchronize_sched()
> wouldn't work. So then makes me think the expedited versions of
> synchronize_sched should be able to do the job but I could off on a different
> track..

The problem is that the expedited versions also check the dyntick-idle
state and don't touch idle (or nohz_full usermode) CPUs. This is by
design for the battery-powered embedded use case. ;-)

Thanx, Paul

> Thanks a lot,
>
> -Joel
>
>

2018-07-10 18:46:36

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Tue, Jul 10, 2018 at 10:12:29AM -0700, Paul E. McKenney wrote:
[..]
> > > > The other question I have is about the whole "nohz-full doesn't work" thing.
> > > > I didn't fully understand why. RCU is already tracking the state of nohz-full
> > > > CPUs because the rcu dynticks code in (kernel/rcu/tree.c) monitors
> > > > transitions to and from usermode even if the timer tick is turned off. So why
> > > > would it not work?
> > >
> > > In the nohz_full case, there is no need for sys_membarrier()'s call to
> > > synchronize_sched() to interact directly with the nohz_full CPU. It
> > > can instead look at the target CPU's dyntick-idle state, and that state
> > > would potentially have been set in the dim distant past, thus having
> > > no effect on the target CPU's current execution.
> >
> > In nohz-idle case though, there's nothing to promote the barrier() to
> > smp_mb() if you were to purely look at the dynticks-idle state on the
> > nohz-full CPU executing in user mode?
> >
> > So then it makes sense to me now that nohz-full needs something to IPI that
> > CPU inorder to enforce the needed memory barrier and pure synchronize_sched()
> > wouldn't work. So then makes me think the expedited versions of
> > synchronize_sched should be able to do the job but I could off on a different
> > track..
>
> The problem is that the expedited versions also check the dyntick-idle
> state and don't touch idle (or nohz_full usermode) CPUs. This is by
> design for the battery-powered embedded use case. ;-)

Oh ok! ;)

I guess there's also a MEMBARRIER_CMD_GLOBAL_EXPEDITED which seems to IPI
CPUs (I'm guessing regardless of dynticks state) and execute smp_mb within
the IPI so userspace can fallback to using that incase MEMBARRIER_CMD_GLOBAL
returns -EINVAL.

thanks!

-Joel

2018-07-10 18:51:32

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Tue, Jul 10, 2018 at 10:29:57AM -0700, Joel Fernandes wrote:
> On Tue, Jul 10, 2018 at 10:12:29AM -0700, Paul E. McKenney wrote:
> [..]
> > > > > The other question I have is about the whole "nohz-full doesn't work" thing.
> > > > > I didn't fully understand why. RCU is already tracking the state of nohz-full
> > > > > CPUs because the rcu dynticks code in (kernel/rcu/tree.c) monitors
> > > > > transitions to and from usermode even if the timer tick is turned off. So why
> > > > > would it not work?
> > > >
> > > > In the nohz_full case, there is no need for sys_membarrier()'s call to
> > > > synchronize_sched() to interact directly with the nohz_full CPU. It
> > > > can instead look at the target CPU's dyntick-idle state, and that state
> > > > would potentially have been set in the dim distant past, thus having
> > > > no effect on the target CPU's current execution.
> > >
> > > In nohz-idle case though, there's nothing to promote the barrier() to
> > > smp_mb() if you were to purely look at the dynticks-idle state on the
> > > nohz-full CPU executing in user mode?
> > >
> > > So then it makes sense to me now that nohz-full needs something to IPI that
> > > CPU inorder to enforce the needed memory barrier and pure synchronize_sched()
> > > wouldn't work. So then makes me think the expedited versions of
> > > synchronize_sched should be able to do the job but I could off on a different
> > > track..
> >
> > The problem is that the expedited versions also check the dyntick-idle
> > state and don't touch idle (or nohz_full usermode) CPUs. This is by
> > design for the battery-powered embedded use case. ;-)
>
> Oh ok! ;)
>
> I guess there's also a MEMBARRIER_CMD_GLOBAL_EXPEDITED which seems to IPI
> CPUs (I'm guessing regardless of dynticks state) and execute smp_mb within
> the IPI so userspace can fallback to using that incase MEMBARRIER_CMD_GLOBAL
> returns -EINVAL.

Yes, and this avoids IPIing idle CPUs via the ->mm checks. But it will
IPI nohz_full CPUs in that same process, as it must for correctness.

Thanx, Paul

2018-07-10 23:53:52

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Mon, Jul 09, 2018 at 10:25:04PM -0700, Chenbo Feng wrote:
> On Mon, Jul 9, 2018 at 3:34 PM Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Mon, Jul 09, 2018 at 03:21:43PM -0700, Daniel Colascione wrote:
> > > On Mon, Jul 9, 2018 at 3:10 PM, Alexei Starovoitov
> > > <[email protected]> wrote:
> > > > On Mon, Jul 09, 2018 at 02:36:37PM -0700, Daniel Colascione wrote:
> > > >> On Mon, Jul 9, 2018 at 2:09 PM, Alexei Starovoitov
> > > >> <[email protected]> wrote:
> > > >> > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> > > >> >> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes [email protected] wrote:
> > > >> >>
> > > >> >> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> > > >> >> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> > > >> >> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> > > >> >> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> > > >> >> >> > RCU data structure operations with respect to active programs. For
> > > >> >> >> > example, userspace can update a map->map entry to point to a new map,
> > > >> >> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> > > >> >> >> > complete, and then drain the old map without fear that BPF programs
> > > >> >> >> > may still be updating it.
> > > >> >> >> >
> > > >> >> >> > Signed-off-by: Daniel Colascione <[email protected]>
> > > >> >> >> > ---
> > > >> >> >> > include/uapi/linux/bpf.h | 1 +
> > > >> >> >> > kernel/bpf/syscall.c | 14 ++++++++++++++
> > > >> >> >> > 2 files changed, 15 insertions(+)
> > > >> >> >> >
> > > >> >> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > >> >> >> > index b7db3261c62d..4365c50e8055 100644
> > > >> >> >> > --- a/include/uapi/linux/bpf.h
> > > >> >> >> > +++ b/include/uapi/linux/bpf.h
> > > >> >> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> > > >> >> >> > BPF_BTF_LOAD,
> > > >> >> >> > BPF_BTF_GET_FD_BY_ID,
> > > >> >> >> > BPF_TASK_FD_QUERY,
> > > >> >> >> > + BPF_SYNCHRONIZE,
> > > >> >> >> > };
> > > >> >> >> >
> > > >> >> >> > enum bpf_map_type {
> > > >> >> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > >> >> >> > index d10ecd78105f..60ec7811846e 100644
> > > >> >> >> > --- a/kernel/bpf/syscall.c
> > > >> >> >> > +++ b/kernel/bpf/syscall.c
> > > >> >> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> > > >> >> >> > uattr, unsigned int, siz
> > > >> >> >> > if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > > >> >> >> > return -EPERM;
> > > >> >> >> >
> > > >> >> >> > + if (cmd == BPF_SYNCHRONIZE) {
> > > >> >> >> > + if (uattr != NULL || size != 0)
> > > >> >> >> > + return -EINVAL;
> > > >> >> >> > + err = security_bpf(cmd, NULL, 0);
> > > >> >> >> > + if (err < 0)
> > > >> >> >> > + return err;
> > > >> >> >> > + /* BPF programs are run with preempt disabled, so
> > > >> >> >> > + * synchronize_sched is sufficient even with
> > > >> >> >> > + * RCU_PREEMPT.
> > > >> >> >> > + */
> > > >> >> >> > + synchronize_sched();
> > > >> >> >> > + return 0;
> > > >> >> >>
> > > >> >> >> I don't think it's necessary. sys_membarrier() can do this already
> > > >> >> >> and some folks use it exactly for this use case.
> > > >> >> >
> > > >> >> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> > > >> >> > though. No where does the manpage say membarrier should be implemented this
> > > >> >> > way so what happens if the implementation changes?
> > > >> >> >
> > > >> >> > Further, membarrier manpage says that a memory barrier should be matched with
> > > >> >> > a matching barrier. In this use case there is no matching barrier, so it
> > > >> >> > makes it weirder.
> > > >> >> >
> > > >> >> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> > > >> >> > fragile to depend on it for this?
> > > >> >> >
> > > >> >> > case MEMBARRIER_CMD_GLOBAL:
> > > >> >> > /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> > > >> >> > if (tick_nohz_full_enabled())
> > > >> >> > return -EINVAL;
> > > >> >> > if (num_online_cpus() > 1)
> > > >> >> > synchronize_sched();
> > > >> >> > return 0;
> > > >> >> >
> > > >> >> >
> > > >> >> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> > > >> >>
> > > >> >> See commit 907565337
> > > >> >> "Fix: Disable sys_membarrier when nohz_full is enabled"
> > > >> >>
> > > >> >> "Userspace applications should be allowed to expect the membarrier system
> > > >> >> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> > > >> >> nohz_full CPUs, but synchronize_sched() does not take those into
> > > >> >> account."
> > > >> >>
> > > >> >> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> > > >> >> only care about kernel preempt off critical sections.
> > > >> >>
> > > >> >> Clearly bpf code does not run in user-space, so it would "work".
> > > >> >>
> > > >> >> But the guarantees provided by membarrier are not to synchronize against
> > > >> >> preempt off per se. It's just that the current implementation happens to
> > > >> >> do that. The point of membarrier is to turn user-space memory barriers
> > > >> >> into compiler barriers.
> > > >> >>
> > > >> >> If what you need is to wait for a RCU grace period for whatever RCU flavor
> > > >> >> ebpf is using, I would against using membarrier for this. I would rather
> > > >> >> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
> > > >> >> implementation details to user-space, *and* you can eventually change you
> > > >> >> RCU implementation for e.g. SRCU in the future if needed.
> > > >> >
> > > >> > The point about future changes to underlying bpf mechanisms is valid.
> > > >> > There is work already on the way to reduce the scope of preempt_off+rcu_lock
> > > >> > that currently lasts the whole prog. We will have new prog types that won't
> > > >> > have such wrappers and will do rcu_lock/unlock and preempt on/off only
> > > >> > when necessary.
> > > >> > So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> > > >> > guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> > > >> > also won't make sense for the same reason.
> > > >> > What we can do it instead is to define synchronization barrier for
> > > >> > programs accessing maps. May be call it something like:
> > > >> > BPF_SYNC_MAP_ACCESS ?
> > > >>
> > > >> I'm not sure what you're proposing. In the case the commit message
> > > >> describes, a user-space program that wants to "drain" a map needs to
> > > >> be confident that the map won't change under it, even across multiple
> > > >> bpf system calls on that map. One way of doing that is to ensure that
> > > >> nothing that could possibly hold a reference to that map is still
> > > >> running. Are you proposing some kind of refcount-draining approach?
> > > >> Simple locking won't work, since BPF programs can't block, and I don't
> > > >> see right now how a simple barrier would help.
> > > >
> > > > I'm proposing few changes for your patch:
> > > > s/BPF_SYNCHRONIZE/BPF_SYNC_MAP_ACCESS/
> > > > and s/synchronize_sched/synchronize_rcu/
> > > > with detailed comment in uapi/bpf.h that has an example why folks
> > > > would want to use this new cmd.
> > >
> > > Thanks for clarifying.
> > >
> > > > I think the bpf maps will be rcu protected for foreseeable future
> > > > even when rcu_read_lock/unlock will be done by the programs instead of
> > > > kernel wrappers.
> > >
> > > Can we guarantee that we always obtain a map reference and dispose of
> > > that reference inside the same critical section?
> >
> > yep. the verifier will guarantee that.
> >
> > > If so, can BPF
> > > programs then disable preemption for as long as they'd like?
> >
> > you mean after the finish? no. only while running.
> > The verifier will match things like lookup/release, lock/unlock, preempt on/off
> > and will make sure there is no dangling preempt disable after program returns.
> >
> It might be a silly question but does this new bpf program type
> provide a way to customize where to hold the rcu_lock/unlock and
> enable/disable preemption when creating the program? Since the
> BPF_SYNC_MAP_ACCESS cmd sounds not very useful if it only protect each
> single map access instead of the whole program. For example, sometimes
> we have to read the same map multiple times when running a eBPF
> program and if the userspace changed the map value between two reads,
> it may cause troubles. It would be great if we can have a RCU lock on
> a block of eBPF program and in that way I think BPF_SYNC_MAP_ACCESS
> cmd should be enough in handling userspace kernel racing problems.

we need to make sure we have detailed description of BPF_SYNC_MAP_ACCESS
in uapi/bpf.h, since I feel the confusion regarding its usage is starting already.
This new cmd will only make sense for map-in-map type of maps.
Expecting that BPF_SYNC_MAP_ACCESS is somehow implies the end of
the program or doing some other map synchronization is not correct.
Commit log of this patch got it right:
"""
For example, userspace can update a map->map entry to point to a new map,
use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
complete, and then drain the old map without fear that BPF programs
may still be updating it.
"""
Now consider two cases:
1. single program does two lookups (inner and outer map) and the program is called
twice (as trigger on two events or it was tail-called or through prog_array)
2. future single program without rcu wrappers does:
rcu_lock + outer lookup + inner lookup + rcu_unlock + .. some other bpf stuff .. +
rcu_lock + outer lookup + inner lookup + rcu_unlock + .. yet another bpf stuff ..
All of the above as part of single program that is called once.
These two cases are the same from the point of view of BPF_SYNC_MAP_ACCESS
and completion of this cmd does not guarantee that program finished.
In both cases the user space will be correct to start draining the inner map
it replaced after BPF_SYNC_MAP_ACCESS is done
(assuming that implementation does synchronize_rcu I mentioned earlier)

Preemption on/off as done today by prog wrappers or in the future explicitly
by the prog code is orthogonal to this new cmd and map-in-map draining.

2018-07-11 02:47:39

by Lorenzo Colitti

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Wed, Jul 11, 2018 at 8:52 AM Alexei Starovoitov
<[email protected]> wrote:
>
> we need to make sure we have detailed description of BPF_SYNC_MAP_ACCESS
> in uapi/bpf.h, since I feel the confusion regarding its usage is starting already.
> This new cmd will only make sense for map-in-map type of maps.
> Expecting that BPF_SYNC_MAP_ACCESS is somehow implies the end of
> the program or doing some other map synchronization is not correct.
> Commit log of this patch got it right:
> """
> For example, userspace can update a map->map entry to point to a new map,
> use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> complete, and then drain the old map without fear that BPF programs
> may still be updating it.
> """

+1 for detailed documentation. For example, consider what happens if
we have two map fds, one active and one standby, and a map-in-map with
one element that contains a pointer to the currently-active map fd.
The kernel program might do:

=====
const int current_map_key = 1;
void *current_map = bpf_map_lookup_elem(outer_map, &current_map_key);

int stats_key = 42;
uint64_t *stats_value = bpf_map_lookup_elem(current_map, &stats_key);
__sync_fetch_and_add(&stats_value, 1);
=====

If a userspace does:

1. Write new fd to outer_map[1].
2. Call BPF_SYNC_MAP_ACCESS.
3. Start deleting everything in the old map.

How can we guarantee that the __sync_fetch_and_add will not add to the
old map? If it does, we'll lose data. Will the verifier automatically
hold the RCU lock for as long as a pointer to an inner map is valid?

2018-07-11 03:42:30

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Wed, Jul 11, 2018 at 11:46:19AM +0900, Lorenzo Colitti wrote:
> On Wed, Jul 11, 2018 at 8:52 AM Alexei Starovoitov
> <[email protected]> wrote:
> >
> > we need to make sure we have detailed description of BPF_SYNC_MAP_ACCESS
> > in uapi/bpf.h, since I feel the confusion regarding its usage is starting already.
> > This new cmd will only make sense for map-in-map type of maps.
> > Expecting that BPF_SYNC_MAP_ACCESS is somehow implies the end of
> > the program or doing some other map synchronization is not correct.
> > Commit log of this patch got it right:
> > """
> > For example, userspace can update a map->map entry to point to a new map,
> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> > complete, and then drain the old map without fear that BPF programs
> > may still be updating it.
> > """
>
> +1 for detailed documentation. For example, consider what happens if
> we have two map fds, one active and one standby, and a map-in-map with
> one element that contains a pointer to the currently-active map fd.

yes. that's exactly the use case that folks use.

> The kernel program might do:
>
> =====
> const int current_map_key = 1;
> void *current_map = bpf_map_lookup_elem(outer_map, &current_map_key);
>
> int stats_key = 42;
> uint64_t *stats_value = bpf_map_lookup_elem(current_map, &stats_key);
> __sync_fetch_and_add(&stats_value, 1);
> =====
>
> If a userspace does:
>
> 1. Write new fd to outer_map[1].
> 2. Call BPF_SYNC_MAP_ACCESS.
> 3. Start deleting everything in the old map.
>
> How can we guarantee that the __sync_fetch_and_add will not add to the
> old map?

without any changes to the kernel sys_membarrier will work.
And that's what folks use already.
BPF_SYNC_MAP_ACCESS implemented via synchronize_rcu() will work
as well whether in the current implementation where rcu_lock/unlock
is done outside of the program and in the future when
rcu_lock/unlock are called by the program itself.

> Will the verifier automatically
> hold the RCU lock for as long as a pointer to an inner map is valid?

the verifier will guarantee the equivalency of future explicit
lock/unlock by the program vs current situation of implicit
lock/unlock by the kernel.
The verifier will track that bpf_map_lookup_elem() is done
after rcu_lock and that the value returned by this helper is
not accessed after rcu_unlock. Baby steps of dataflow analysis.

2018-07-14 18:19:43

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Tue, Jul 10, 2018 at 08:40:19PM -0700, Alexei Starovoitov wrote:
[..]
> > The kernel program might do:
> >
> > =====
> > const int current_map_key = 1;
> > void *current_map = bpf_map_lookup_elem(outer_map, &current_map_key);
> >
> > int stats_key = 42;
> > uint64_t *stats_value = bpf_map_lookup_elem(current_map, &stats_key);
> > __sync_fetch_and_add(&stats_value, 1);
> > =====
> >
> > If a userspace does:
> >
> > 1. Write new fd to outer_map[1].
> > 2. Call BPF_SYNC_MAP_ACCESS.
> > 3. Start deleting everything in the old map.
> >
> > How can we guarantee that the __sync_fetch_and_add will not add to the
> > old map?
>
> without any changes to the kernel sys_membarrier will work.
> And that's what folks use already.
> BPF_SYNC_MAP_ACCESS implemented via synchronize_rcu() will work
> as well whether in the current implementation where rcu_lock/unlock
> is done outside of the program and in the future when
> rcu_lock/unlock are called by the program itself.

Cool Alexei and Lorenzo, sounds great to me. Daniel want to send a follow up
patch with BPF_SYNC_MAP_ACCESS changes then?

> > Will the verifier automatically
> > hold the RCU lock for as long as a pointer to an inner map is valid?
>
> the verifier will guarantee the equivalency of future explicit
> lock/unlock by the program vs current situation of implicit
> lock/unlock by the kernel.
> The verifier will track that bpf_map_lookup_elem() is done
> after rcu_lock and that the value returned by this helper is
> not accessed after rcu_unlock. Baby steps of dataflow analysis.

Nice!

By the way just curious I was briefly going through kernel/bpf/arraymap.c.
How are you protecting against load-store tearing of values of array map
updates/lookups?

For example, if userspace reads an array map at a particular index, while
another CPU is updating it, then userspace can read partial values /
half-updated values right? Since rcu_read_lock is in use, I was hoping to
find something like rcu_assign_pointer there to protect readers against
concurrent updates. Thanks for any clarification.

Regards,

- Joel

2018-07-16 15:30:58

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Sat, Jul 14, 2018 at 11:18 AM, Joel Fernandes <[email protected]> wrote:
> On Tue, Jul 10, 2018 at 08:40:19PM -0700, Alexei Starovoitov wrote:
> [..]
>> > The kernel program might do:
>> >
>> > =====
>> > const int current_map_key = 1;
>> > void *current_map = bpf_map_lookup_elem(outer_map, &current_map_key);
>> >
>> > int stats_key = 42;
>> > uint64_t *stats_value = bpf_map_lookup_elem(current_map, &stats_key);
>> > __sync_fetch_and_add(&stats_value, 1);
>> > =====
>> >
>> > If a userspace does:
>> >
>> > 1. Write new fd to outer_map[1].
>> > 2. Call BPF_SYNC_MAP_ACCESS.
>> > 3. Start deleting everything in the old map.
>> >
>> > How can we guarantee that the __sync_fetch_and_add will not add to the
>> > old map?
>>
>> without any changes to the kernel sys_membarrier will work.
>> And that's what folks use already.
>> BPF_SYNC_MAP_ACCESS implemented via synchronize_rcu() will work
>> as well whether in the current implementation where rcu_lock/unlock
>> is done outside of the program and in the future when
>> rcu_lock/unlock are called by the program itself.
>
> Cool Alexei and Lorenzo, sounds great to me. Daniel want to send a follow up
> patch with BPF_SYNC_MAP_ACCESS changes then?

Will do. Mind if I just mine this thread for the doc comment?

2018-07-16 20:24:14

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Mon, Jul 16, 2018 at 08:29:47AM -0700, Daniel Colascione wrote:
> On Sat, Jul 14, 2018 at 11:18 AM, Joel Fernandes <[email protected]> wrote:
> > On Tue, Jul 10, 2018 at 08:40:19PM -0700, Alexei Starovoitov wrote:
> > [..]
> >> > The kernel program might do:
> >> >
> >> > =====
> >> > const int current_map_key = 1;
> >> > void *current_map = bpf_map_lookup_elem(outer_map, &current_map_key);
> >> >
> >> > int stats_key = 42;
> >> > uint64_t *stats_value = bpf_map_lookup_elem(current_map, &stats_key);
> >> > __sync_fetch_and_add(&stats_value, 1);
> >> > =====
> >> >
> >> > If a userspace does:
> >> >
> >> > 1. Write new fd to outer_map[1].
> >> > 2. Call BPF_SYNC_MAP_ACCESS.
> >> > 3. Start deleting everything in the old map.
> >> >
> >> > How can we guarantee that the __sync_fetch_and_add will not add to the
> >> > old map?
> >>
> >> without any changes to the kernel sys_membarrier will work.
> >> And that's what folks use already.
> >> BPF_SYNC_MAP_ACCESS implemented via synchronize_rcu() will work
> >> as well whether in the current implementation where rcu_lock/unlock
> >> is done outside of the program and in the future when
> >> rcu_lock/unlock are called by the program itself.
> >
> > Cool Alexei and Lorenzo, sounds great to me. Daniel want to send a follow up
> > patch with BPF_SYNC_MAP_ACCESS changes then?
>
> Will do. Mind if I just mine this thread for the doc comment?

Do you mean the changelog? Yes I believe you could use the discussion in this
thread for the rationale as Alexei described.

thanks,

- Joel

2018-07-26 16:53:57

[permalink] [raw]

Subject: [PATCH v2] Add BPF_SYNCHRONIZE_MAPS bpf(2) command

BPF_SYNCHRONIZE_MAPS waits for the release of any references to a BPF
map made by a BPF program that is running at the time the
BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command is
to provide a means for userspace to replace a BPF map with another,
newer version, then ensure that no component is still using the "old"
map before manipulating the "old" map in some way.

Signed-off-by: Daniel Colascione <[email protected]>
---
include/uapi/linux/bpf.h | 9 +++++++++
kernel/bpf/syscall.c | 13 +++++++++++++
2 files changed, 22 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b7db3261c62d..5b27e9117d3e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -75,6 +75,14 @@ struct bpf_lpm_trie_key {
__u8 data[0]; /* Arbitrary size */
};

+/* BPF_SYNCHRONIZE_MAPS waits for the release of any references to a
+ * BPF map made by a BPF program that is running at the time the
+ * BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command
+ * is to provide a means for userspace to replace a BPF map with
+ * another, newer version, then ensure that no component is still
+ * using the "old" map before manipulating the "old" map in some way.
+ */
+
/* BPF syscall commands, see bpf(2) man-page for details. */
enum bpf_cmd {
BPF_MAP_CREATE,
@@ -98,6 +106,7 @@ enum bpf_cmd {
BPF_BTF_LOAD,
BPF_BTF_GET_FD_BY_ID,
BPF_TASK_FD_QUERY,
+ BPF_SYNCHRONIZE_MAPS,
};

enum bpf_map_type {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a31a1ba0f8ea..8bbd1a5d01d1 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2274,6 +2274,19 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
return -EPERM;

+ if (cmd == BPF_SYNCHRONIZE_MAPS) {
+ if (uattr != NULL || size != 0)
+ return -EINVAL;
+ err = security_bpf(cmd, NULL, 0);
+ if (err < 0)
+ return err;
+ /* BPF programs always enter a critical section while
+ * they have a map reference outstanding.
+ */
+ synchronize_rcu();
+ return 0;
+ }
+
err = bpf_check_uarg_tail_zero(uattr, sizeof(attr), size);
if (err)
return err;
--
2.18.0.233.g985f88cf7e-goog

2018-07-27 19:19:26

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Sat, Jul 14, 2018 at 11:18 AM, Joel Fernandes <[email protected]>
wrote:

> By the way just curious I was briefly going through kernel/bpf/arraymap.c.
> How are you protecting against load-store tearing of values of array map
> updates/lookups?
>
> For example, if userspace reads an array map at a particular index, while
> another CPU is updating it, then userspace can read partial values /
> half-updated values right? Since rcu_read_lock is in use, I was hoping to
> find something like rcu_assign_pointer there to protect readers against
> concurrent updates. Thanks for any clarification.

I'm also curious about the answer to this question.

2018-07-29 15:53:29

[permalink] [raw]

Subject: Re: [PATCH v2] Add BPF_SYNCHRONIZE_MAPS bpf(2) command

On Thu, Jul 26, 2018 at 7:51 PM, Daniel Colascione <[email protected]> wrote:
> BPF_SYNCHRONIZE_MAPS waits for the release of any references to a BPF
> map made by a BPF program that is running at the time the
> BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command is
> to provide a means for userspace to replace a BPF map with another,
> newer version, then ensure that no component is still using the "old"
> map before manipulating the "old" map in some way.
>
> Signed-off-by: Daniel Colascione <[email protected]>
> ---
> include/uapi/linux/bpf.h | 9 +++++++++
> kernel/bpf/syscall.c | 13 +++++++++++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index b7db3261c62d..5b27e9117d3e 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -75,6 +75,14 @@ struct bpf_lpm_trie_key {
> __u8 data[0]; /* Arbitrary size */
> };
>
> +/* BPF_SYNCHRONIZE_MAPS waits for the release of any references to a
> + * BPF map made by a BPF program that is running at the time the
> + * BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command

that doesn't sound right to me.
such command won't wait for the release of the references.
in case of map-in-map the program does not hold
the references to inner map (only to outer map).

> + * is to provide a means for userspace to replace a BPF map with
> + * another, newer version, then ensure that no component is still
> + * using the "old" map before manipulating the "old" map in some way.
> + */

that's correct, but the whole paragraph still reads too
'generic' which will lead to confusion,
whereas the use case is map-in-map only.
I think bpf_sync_inner_map or
bpf_sync_map_in_map would be better
choices for the name.

btw i don't have reliable internet access at the moment,
so pls excuse the delays.

2018-07-29 15:58:30

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Fri, Jul 27, 2018 at 10:17 PM, Daniel Colascione <[email protected]> wrote:
> On Sat, Jul 14, 2018 at 11:18 AM, Joel Fernandes <[email protected]>
> wrote:
>>
>> By the way just curious I was briefly going through kernel/bpf/arraymap.c.
>> How are you protecting against load-store tearing of values of array map
>> updates/lookups?
>>
>> For example, if userspace reads an array map at a particular index, while
>> another CPU is updating it, then userspace can read partial values /
>> half-updated values right? Since rcu_read_lock is in use, I was hoping to
>> find something like rcu_assign_pointer there to protect readers against
>> concurrent updates. Thanks for any clarification.
>
>
> I'm also curious about the answer to this question.

i'm not sure I understand the question.
bpf_map_type_array is a C-like array.
There is no locking of elements.
If single program executing on two cpus
and accesses the same value it will collide.
Same goes for user space vs prog parallel access.
bpf long_memcpy is an attempt to provide minimal
level of automicity when values are aligned and
size==long.

2018-07-29 20:54:12

[permalink] [raw]

Subject: Re: [PATCH v2] Add BPF_SYNCHRONIZE_MAPS bpf(2) command

On Sun, Jul 29, 2018 at 8:51 AM, Alexei Starovoitov <
[email protected]> wrote:

> On Thu, Jul 26, 2018 at 7:51 PM, Daniel Colascione <[email protected]>
> wrote:
> > BPF_SYNCHRONIZE_MAPS waits for the release of any references to a BPF
> > map made by a BPF program that is running at the time the
> > BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command is
> > to provide a means for userspace to replace a BPF map with another,
> > newer version, then ensure that no component is still using the "old"
> > map before manipulating the "old" map in some way.
> >
> > Signed-off-by: Daniel Colascione <[email protected]>
> > ---
> > include/uapi/linux/bpf.h | 9 +++++++++
> > kernel/bpf/syscall.c | 13 +++++++++++++
> > 2 files changed, 22 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index b7db3261c62d..5b27e9117d3e 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -75,6 +75,14 @@ struct bpf_lpm_trie_key {
> > __u8 data[0]; /* Arbitrary size */
> > };
> >
> > +/* BPF_SYNCHRONIZE_MAPS waits for the release of any references to a
> > + * BPF map made by a BPF program that is running at the time the
> > + * BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command
>
> that doesn't sound right to me.
> such command won't wait for the release of the references.
> in case of map-in-map the program does not hold
> the references to inner map (only to outer map).
>

Well, today that's the guarantee it provides. :-) I figured it'd be simpler
to explain the guarantee this way.

How about "waits for the release of any reference to a map obtained from
another map"?

>
> > + * is to provide a means for userspace to replace a BPF map with
> > + * another, newer version, then ensure that no component is still
> > + * using the "old" map before manipulating the "old" map in some way.
> > + */
>
> that's correct, but the whole paragraph still reads too
> 'generic' which will lead to confusion,
> whereas the use case is map-in-map only.
> I think bpf_sync_inner_map or
> bpf_sync_map_in_map would be better
> choices for the name.

I worry that a name like that would be too specific.

2018-07-29 21:01:00

[permalink] [raw]

Subject: [PATCH v3] Add BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES bpf(2) command

BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES waits for the release of all
references to maps active at the instant the
BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES is
issued. BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES waits only for the
expiration of map references obtained by BPF programs from other maps.

The purpose of this command is to provide a means for userspace to
replace a BPF map with another, newer version, then ensure that no
component is still using the "old" map before manipulating the "old"
map in some way.

Signed-off-by: Daniel Colascione <[email protected]>
---
include/uapi/linux/bpf.h | 14 ++++++++++++++
kernel/bpf/syscall.c | 13 +++++++++++++
2 files changed, 27 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b7db3261c62d..ca3cfca76edc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -75,6 +75,19 @@ struct bpf_lpm_trie_key {
__u8 data[0]; /* Arbitrary size */
};

+/* BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES waits for the release of all
+ * references to maps active at the instant the
+ * BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES is
+ * issued. BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES waits only for the
+ * expiration of map references obtained by BPF programs from
+ * other maps.
+ *
+ * The purpose of this command is to provide a means for userspace to
+ * replace a BPF map with another, newer version, then ensure that no
+ * component is still using the "old" map before manipulating the
+ * "old" map in some way.
+ */
+
/* BPF syscall commands, see bpf(2) man-page for details. */
enum bpf_cmd {
BPF_MAP_CREATE,
@@ -98,6 +111,7 @@ enum bpf_cmd {
BPF_BTF_LOAD,
BPF_BTF_GET_FD_BY_ID,
BPF_TASK_FD_QUERY,
+ BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES,
};

enum bpf_map_type {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a31a1ba0f8ea..bc9a0713f47d 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2274,6 +2274,19 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
return -EPERM;

+ if (cmd == BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES) {
+ if (uattr != NULL || size != 0)
+ return -EINVAL;
+ err = security_bpf(cmd, NULL, 0);
+ if (err < 0)
+ return err;
+ /* BPF programs always enter a critical section while
+ * they have a map reference outstanding.
+ */
+ synchronize_rcu();
+ return 0;
+ }
+
err = bpf_check_uarg_tail_zero(uattr, sizeof(attr), size);
if (err)
return err;
--
2.18.0.345.g5c9ce644c3-goog

2018-07-30 10:26:51

[permalink] [raw]

Subject: Re: [PATCH v3] Add BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES bpf(2) command

On Mon, Jul 30, 2018 at 3:04 AM, Daniel Borkmann <[email protected]> wrote:
> Hmm, I don't think such UAPI as above is future-proof. In case we would want
> a similar mechanism in future for other maps, we would need a whole new bpf
> command or reuse BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES as a workaround though
> the underlying map may not even be a map-to-map. Additionally, we don't have
> any map object at hand in the above, so we couldn't make any finer grained
> decisions either. Something like below would be more suitable and leaves room
> for extending this further in future.

YAGNI. Your proposed mechanism doesn't add anything under the current
implementation. It's also not clear how a map-specific synchronization
command is supposed to work in cases where we swap multiple map
references. Do we synchronize_rcu multiple times? Why would we impose
that inefficiency just for the sake of some non-specific future
extensibility? Add some kind of batching layer? The current approach
works for the anticipated use cases.

While my preference is not to talk about map-to-maps at all in the
user API and instead spec the thing as talking about map references in
general, I'd rather have something that talks about
references-to-maps-acquired-from-maps than this interface.

2018-07-30 10:34:09

by Daniel Borkmann

[permalink] [raw]

Subject: Re: [PATCH v3] Add BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES bpf(2) command

On 07/29/2018 10:58 PM, Daniel Colascione wrote:
> BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES waits for the release of all
> references to maps active at the instant the
> BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES is
> issued. BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES waits only for the
> expiration of map references obtained by BPF programs from other maps.
>
> The purpose of this command is to provide a means for userspace to
> replace a BPF map with another, newer version, then ensure that no
> component is still using the "old" map before manipulating the "old"
> map in some way.
>
> Signed-off-by: Daniel Colascione <[email protected]>
> ---
> include/uapi/linux/bpf.h | 14 ++++++++++++++
> kernel/bpf/syscall.c | 13 +++++++++++++
> 2 files changed, 27 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index b7db3261c62d..ca3cfca76edc 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -75,6 +75,19 @@ struct bpf_lpm_trie_key {
> __u8 data[0]; /* Arbitrary size */
> };
>
> +/* BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES waits for the release of all
> + * references to maps active at the instant the
> + * BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES is
> + * issued. BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES waits only for the
> + * expiration of map references obtained by BPF programs from
> + * other maps.
> + *
> + * The purpose of this command is to provide a means for userspace to
> + * replace a BPF map with another, newer version, then ensure that no
> + * component is still using the "old" map before manipulating the
> + * "old" map in some way.
> + */
> +
> /* BPF syscall commands, see bpf(2) man-page for details. */
> enum bpf_cmd {
> BPF_MAP_CREATE,
> @@ -98,6 +111,7 @@ enum bpf_cmd {
> BPF_BTF_LOAD,
> BPF_BTF_GET_FD_BY_ID,
> BPF_TASK_FD_QUERY,
> + BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES,
> };
>
> enum bpf_map_type {
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index a31a1ba0f8ea..bc9a0713f47d 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2274,6 +2274,19 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
> if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> return -EPERM;
>
> + if (cmd == BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES) {
> + if (uattr != NULL || size != 0)
> + return -EINVAL;
> + err = security_bpf(cmd, NULL, 0);
> + if (err < 0)
> + return err;
> + /* BPF programs always enter a critical section while
> + * they have a map reference outstanding.
> + */
> + synchronize_rcu();
> + return 0;
> + }
> +
> err = bpf_check_uarg_tail_zero(uattr, sizeof(attr), size);
> if (err)
> return err;
>

Hmm, I don't think such UAPI as above is future-proof. In case we would want
a similar mechanism in future for other maps, we would need a whole new bpf
command or reuse BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES as a workaround though
the underlying map may not even be a map-to-map. Additionally, we don't have
any map object at hand in the above, so we couldn't make any finer grained
decisions either. Something like below would be more suitable and leaves room
for extending this further in future.

Thanks,
Daniel

From 8dfea71b73fa0d402633b76f78c106e82a7a5007 Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <[email protected]>
Date: Mon, 30 Jul 2018 11:47:37 +0200
Subject: [PATCH] sync map refs

Signed-off-by: Daniel Borkmann <[email protected]>
---
include/linux/bpf.h | 1 +
include/uapi/linux/bpf.h | 1 +
kernel/bpf/arraymap.c | 1 +
kernel/bpf/hashtab.c | 1 +
kernel/bpf/map_in_map.c | 6 ++++++
kernel/bpf/map_in_map.h | 1 +
kernel/bpf/syscall.c | 24 ++++++++++++++++++++++++
7 files changed, 35 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5b5ad95..7b51f86 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -34,6 +34,7 @@ struct bpf_map_ops {
void (*map_free)(struct bpf_map *map);
int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key);
void (*map_release_uref)(struct bpf_map *map);
+ int (*map_sync_refs)(struct bpf_map *map);

/* funcs callable from userspace and from eBPF programs */
void *(*map_lookup_elem)(struct bpf_map *map, void *key);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8701139..e6ec1de 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -98,6 +98,7 @@ enum bpf_cmd {
BPF_BTF_LOAD,
BPF_BTF_GET_FD_BY_ID,
BPF_TASK_FD_QUERY,
+ BPF_MAP_SYNC_REFS,
};

enum bpf_map_type {
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 544e58f..ddaf42a 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -748,5 +748,6 @@ const struct bpf_map_ops array_of_maps_map_ops = {
.map_fd_get_ptr = bpf_map_fd_get_ptr,
.map_fd_put_ptr = bpf_map_fd_put_ptr,
.map_fd_sys_lookup_elem = bpf_map_fd_sys_lookup_elem,
+ .map_sync_refs = bpf_map_sync_refs,
.map_gen_lookup = array_of_map_gen_lookup,
};
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 513d9df..05380ea 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1407,5 +1407,6 @@ const struct bpf_map_ops htab_of_maps_map_ops = {
.map_fd_get_ptr = bpf_map_fd_get_ptr,
.map_fd_put_ptr = bpf_map_fd_put_ptr,
.map_fd_sys_lookup_elem = bpf_map_fd_sys_lookup_elem,
+ .map_sync_refs = bpf_map_sync_refs,
.map_gen_lookup = htab_of_map_gen_lookup,
};
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 1da5746..698a50f 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -96,6 +96,12 @@ void bpf_map_fd_put_ptr(void *ptr)
bpf_map_put(ptr);
}

+int bpf_map_sync_refs(struct bpf_map *map)
+{
+ synchronize_rcu();
+ return 0;
+}
+
u32 bpf_map_fd_sys_lookup_elem(void *ptr)
{
return ((struct bpf_map *)ptr)->id;
diff --git a/kernel/bpf/map_in_map.h b/kernel/bpf/map_in_map.h
index 6183db9..ac02456 100644
--- a/kernel/bpf/map_in_map.h
+++ b/kernel/bpf/map_in_map.h
@@ -19,6 +19,7 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0,
void *bpf_map_fd_get_ptr(struct bpf_map *map, struct file *map_file,
int ufd);
void bpf_map_fd_put_ptr(void *ptr);
+int bpf_map_sync_refs(struct bpf_map *map);
u32 bpf_map_fd_sys_lookup_elem(void *ptr);

#endif
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a31a1ba..b1286cc 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -896,6 +896,27 @@ static int map_get_next_key(union bpf_attr *attr)
return err;
}

+#define BPF_MAP_SYNC_REFS_LAST_FIELD map_fd
+
+static int map_sync_refs(union bpf_attr *attr)
+{
+ int err = -ENOTSUPP, ufd = attr->map_fd;
+ struct bpf_map *map;
+ struct fd f;
+
+ if (CHECK_ATTR(BPF_MAP_SYNC_REFS))
+ return -EINVAL;
+
+ f = fdget(ufd);
+ map = __bpf_map_get(f);
+ if (IS_ERR(map))
+ return PTR_ERR(map);
+ if (map->ops->map_sync_refs)
+ err = map->ops->map_sync_refs(map);
+ fdput(f);
+ return err;
+}
+
static const struct bpf_prog_ops * const bpf_prog_types[] = {
#define BPF_PROG_TYPE(_id, _name) \
[_id] = & _name ## _prog_ops,
@@ -2303,6 +2324,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
case BPF_MAP_GET_NEXT_KEY:
err = map_get_next_key(&attr);
break;
+ case BPF_MAP_SYNC_REFS:
+ err = map_sync_refs(&attr);
+ break;
case BPF_PROG_LOAD:
err = bpf_prog_load(&attr);
break;
--
2.9.5

2018-07-30 22:28:09

[permalink] [raw]

Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command

On Sun, Jul 29, 2018 at 06:57:06PM +0300, Alexei Starovoitov wrote:
> On Fri, Jul 27, 2018 at 10:17 PM, Daniel Colascione <[email protected]> wrote:
> > On Sat, Jul 14, 2018 at 11:18 AM, Joel Fernandes <[email protected]>
> > wrote:
> >>
> >> By the way just curious I was briefly going through kernel/bpf/arraymap.c.
> >> How are you protecting against load-store tearing of values of array map
> >> updates/lookups?
> >>
> >> For example, if userspace reads an array map at a particular index, while
> >> another CPU is updating it, then userspace can read partial values /
> >> half-updated values right? Since rcu_read_lock is in use, I was hoping to
> >> find something like rcu_assign_pointer there to protect readers against
> >> concurrent updates. Thanks for any clarification.
> >
> >
> > I'm also curious about the answer to this question.
>
> i'm not sure I understand the question.
> bpf_map_type_array is a C-like array.
> There is no locking of elements.
> If single program executing on two cpus
> and accesses the same value it will collide.
> Same goes for user space vs prog parallel access.
> bpf long_memcpy is an attempt to provide minimal
> level of automicity when values are aligned and
> size==long.

Thanks for the answer. I think you answered the question.

Regards,

- Joel

2018-07-31 00:27:50

by Jakub Kicinski

[permalink] [raw]

Subject: Re: [PATCH v3] Add BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES bpf(2) command

On Mon, 30 Jul 2018 03:25:43 -0700, Daniel Colascione wrote:
> On Mon, Jul 30, 2018 at 3:04 AM, Daniel Borkmann <[email protected]> wrote:
> > Hmm, I don't think such UAPI as above is future-proof. In case we would want
> > a similar mechanism in future for other maps, we would need a whole new bpf
> > command or reuse BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES as a workaround though
> > the underlying map may not even be a map-to-map. Additionally, we don't have
> > any map object at hand in the above, so we couldn't make any finer grained
> > decisions either. Something like below would be more suitable and leaves room
> > for extending this further in future.
>
> YAGNI. Your proposed mechanism doesn't add anything under the current
> implementation.

FWIW in case of HW offload targeting a particular map may allow users
to avoid a potentially slow sync with all the devices on the system.

> It's also not clear how a map-specific synchronization
> command is supposed to work in cases where we swap multiple map
> references. Do we synchronize_rcu multiple times? Why would we impose
> that inefficiency just for the sake of some non-specific future
> extensibility? Add some kind of batching layer? The current approach
> works for the anticipated use cases.
>
> While my preference is not to talk about map-to-maps at all in the
> user API and instead spec the thing as talking about map references in
> general, I'd rather have something that talks about
> references-to-maps-acquired-from-maps than this interface.

2018-07-31 00:34:53

[permalink] [raw]

Subject: Re: [PATCH v3] Add BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES bpf(2) command

On Mon, Jul 30, 2018 at 5:26 PM, Jakub Kicinski
<[email protected]> wrote:
>
> On Mon, 30 Jul 2018 03:25:43 -0700, Daniel Colascione wrote:
> > On Mon, Jul 30, 2018 at 3:04 AM, Daniel Borkmann <[email protected]> wrote:
> > > Hmm, I don't think such UAPI as above is future-proof. In case we would want
> > > a similar mechanism in future for other maps, we would need a whole new bpf
> > > command or reuse BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES as a workaround though
> > > the underlying map may not even be a map-to-map. Additionally, we don't have
> > > any map object at hand in the above, so we couldn't make any finer grained
> > > decisions either. Something like below would be more suitable and leaves room
> > > for extending this further in future.
> >
> > YAGNI. Your proposed mechanism doesn't add anything under the current
> > implementation.
>
> FWIW in case of HW offload targeting a particular map may allow users
> to avoid a potentially slow sync with all the devices on the system.

Sure. But such a thing doesn't exist right now (right?), and we can
add that more-efficient-in-that-one-case BPF interface when it lands.
I'd rather keep things simple for now.

2018-07-31 00:46:57

by Jakub Kicinski

[permalink] [raw]

Subject: Re: [PATCH v3] Add BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES bpf(2) command

On Mon, 30 Jul 2018 17:33:39 -0700, Daniel Colascione wrote:
> On Mon, Jul 30, 2018 at 5:26 PM, Jakub Kicinski wrote:
> > On Mon, 30 Jul 2018 03:25:43 -0700, Daniel Colascione wrote:
> > > On Mon, Jul 30, 2018 at 3:04 AM, Daniel Borkmann <[email protected]> wrote:
> > > > Hmm, I don't think such UAPI as above is future-proof. In case we would want
> > > > a similar mechanism in future for other maps, we would need a whole new bpf
> > > > command or reuse BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES as a workaround though
> > > > the underlying map may not even be a map-to-map. Additionally, we don't have
> > > > any map object at hand in the above, so we couldn't make any finer grained
> > > > decisions either. Something like below would be more suitable and leaves room
> > > > for extending this further in future.
> > >
> > > YAGNI. Your proposed mechanism doesn't add anything under the current
> > > implementation.
> >
> > FWIW in case of HW offload targeting a particular map may allow users
> > to avoid a potentially slow sync with all the devices on the system.
>
> Sure. But such a thing doesn't exist right now (right?), and we can
> add that more-efficient-in-that-one-case BPF interface when it lands.
> I'd rather keep things simple for now.

You mean map-in-map offload doesn't exist today? True, but it's on the
roadmap for Netronome.

Tangentially it would be really useful for us to have a "the map has
actually been freed" notification/barrier. We have complaints of users
creating huge maps and then trying to free and recreate them quickly,
and the creation part failing with -ENOMEM, because the free from a
workqueue didn't run, yet :( It'd probably require a different API to
solve than what's discussed here, but since we're talking about syncing
things I thought I'd put it out there...

2018-07-31 00:51:23

[permalink] [raw]

Subject: Re: [PATCH v3] Add BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES bpf(2) command

On Mon, Jul 30, 2018 at 5:45 PM, Jakub Kicinski
<[email protected]> wrote:
> On Mon, 30 Jul 2018 17:33:39 -0700, Daniel Colascione wrote:
>> On Mon, Jul 30, 2018 at 5:26 PM, Jakub Kicinski wrote:
>> > On Mon, 30 Jul 2018 03:25:43 -0700, Daniel Colascione wrote:
>> > > On Mon, Jul 30, 2018 at 3:04 AM, Daniel Borkmann <[email protected]> wrote:
>> > > > Hmm, I don't think such UAPI as above is future-proof. In case we would want
>> > > > a similar mechanism in future for other maps, we would need a whole new bpf
>> > > > command or reuse BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES as a workaround though
>> > > > the underlying map may not even be a map-to-map. Additionally, we don't have
>> > > > any map object at hand in the above, so we couldn't make any finer grained
>> > > > decisions either. Something like below would be more suitable and leaves room
>> > > > for extending this further in future.
>> > >
>> > > YAGNI. Your proposed mechanism doesn't add anything under the current
>> > > implementation.
>> >
>> > FWIW in case of HW offload targeting a particular map may allow users
>> > to avoid a potentially slow sync with all the devices on the system.
>>
>> Sure. But such a thing doesn't exist right now (right?), and we can
>> add that more-efficient-in-that-one-case BPF interface when it lands.
>> I'd rather keep things simple for now.
>
> You mean map-in-map offload doesn't exist today? True, but it's on the
> roadmap for Netronome.

Sounds cool. I'd still wait until map-in-map offload lands until
adding the map-specific API though.

> Tangentially it would be really useful for us to have a "the map has
> actually been freed" notification/barrier. We have complaints of users
> creating huge maps and then trying to free and recreate them quickly,
> and the creation part failing with -ENOMEM, because the free from a
> workqueue didn't run, yet :( It'd probably require a different API to
> solve than what's discussed here, but since we're talking about syncing
> things I thought I'd put it out there...

Yeah. What you're talking about is what I meant upthread when I
briefly mentioned a "refcount draining approach" --- you'd block until
all references except your own to a map disappeared.

2018-07-31 01:16:42

by Jakub Kicinski

[permalink] [raw]

Subject: Re: [PATCH v3] Add BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES bpf(2) command

On Mon, 30 Jul 2018 17:50:15 -0700, Daniel Colascione wrote:
> > Tangentially it would be really useful for us to have a "the map has
> > actually been freed" notification/barrier. We have complaints of users
> > creating huge maps and then trying to free and recreate them quickly,
> > and the creation part failing with -ENOMEM, because the free from a
> > workqueue didn't run, yet :( It'd probably require a different API to
> > solve than what's discussed here, but since we're talking about syncing
> > things I thought I'd put it out there...
>
> Yeah. What you're talking about is what I meant upthread when I
> briefly mentioned a "refcount draining approach" --- you'd block until
> all references except your own to a map disappeared.

I don't think so. The ref count goes to 0, work gets scheduled to call
free, work runs, free gets called, device memory gets freed. Now the
memory can be reused. As long as there are any refs we can't free
memory, unless we want to make the semantics really ugly.

But as I said, only superficially related to this patch. Perhaps
solution will naturally fall out of an API to query the device
capabilities and resources, if we ever get there.

2018-07-31 02:03:15

[permalink] [raw]

Subject: Re: [PATCH v2] Add BPF_SYNCHRONIZE_MAPS bpf(2) command

On Sun, Jul 29, 2018 at 06:51:18PM +0300, Alexei Starovoitov wrote:
> On Thu, Jul 26, 2018 at 7:51 PM, Daniel Colascione <[email protected]> wrote:
> > BPF_SYNCHRONIZE_MAPS waits for the release of any references to a BPF
> > map made by a BPF program that is running at the time the
> > BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command is
> > to provide a means for userspace to replace a BPF map with another,
> > newer version, then ensure that no component is still using the "old"
> > map before manipulating the "old" map in some way.
> >
> > Signed-off-by: Daniel Colascione <[email protected]>
> > ---
> > include/uapi/linux/bpf.h | 9 +++++++++
> > kernel/bpf/syscall.c | 13 +++++++++++++
> > 2 files changed, 22 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index b7db3261c62d..5b27e9117d3e 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -75,6 +75,14 @@ struct bpf_lpm_trie_key {
> > __u8 data[0]; /* Arbitrary size */
> > };
> >
> > +/* BPF_SYNCHRONIZE_MAPS waits for the release of any references to a
> > + * BPF map made by a BPF program that is running at the time the
> > + * BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command
>
> that doesn't sound right to me.
> such command won't wait for the release of the references.
> in case of map-in-map the program does not hold
> the references to inner map (only to outer map).

I didn't follow this completely.

The userspace program is using the inner map per your description of the
algorithm for using map-in-map to solve the race conditions that this patch
is trying to address:

If you don't mind, I copy-pasted it below from your netdev post:

if you use map-in-map you don't need extra boolean map.
0. bpf prog can do
inner_map = lookup(map_in_map, key=0);
lookup(inner_map, your_real_key);
1. user space writes into map_in_map[0] <- FD of new map
2. some cpus are using old inner map and some a new
3. user space does sys_membarrier(CMD_GLOBAL) which will do synchronize_sched()
which in CONFIG_PREEMPT_NONE=y servers is the same as synchronize_rcu()
which will guarantee that progs finished.
4. scan old inner map

In step 2, as you mentioned there are CPUs using different inner maps. So
could you clarify how the synchronize_rcu mechanism will even work if you're
now saying "program does not hold references to the inner maps"?

Thanks!

- Joel

2018-07-31 02:08:29

[permalink] [raw]

Subject: Re: [PATCH v2] Add BPF_SYNCHRONIZE_MAPS bpf(2) command

On Mon, Jul 30, 2018 at 07:01:22PM -0700, Joel Fernandes wrote:
> On Sun, Jul 29, 2018 at 06:51:18PM +0300, Alexei Starovoitov wrote:
> > On Thu, Jul 26, 2018 at 7:51 PM, Daniel Colascione <[email protected]> wrote:
> > > BPF_SYNCHRONIZE_MAPS waits for the release of any references to a BPF
> > > map made by a BPF program that is running at the time the
> > > BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command is
> > > to provide a means for userspace to replace a BPF map with another,
> > > newer version, then ensure that no component is still using the "old"
> > > map before manipulating the "old" map in some way.
> > >
> > > Signed-off-by: Daniel Colascione <[email protected]>
> > > ---
> > > include/uapi/linux/bpf.h | 9 +++++++++
> > > kernel/bpf/syscall.c | 13 +++++++++++++
> > > 2 files changed, 22 insertions(+)
> > >
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index b7db3261c62d..5b27e9117d3e 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -75,6 +75,14 @@ struct bpf_lpm_trie_key {
> > > __u8 data[0]; /* Arbitrary size */
> > > };
> > >
> > > +/* BPF_SYNCHRONIZE_MAPS waits for the release of any references to a
> > > + * BPF map made by a BPF program that is running at the time the
> > > + * BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command
> >
> > that doesn't sound right to me.
> > such command won't wait for the release of the references.
> > in case of map-in-map the program does not hold
> > the references to inner map (only to outer map).
>
> I didn't follow this completely.
>
> The userspace program is using the inner map per your description of the

Sorry just to correct myself, here I meant "The kernel eBPF program is using
the inner map on multiple CPUs" instead of "userspace".

thanks,

- Joel

> algorithm for using map-in-map to solve the race conditions that this patch
> is trying to address:
>
> If you don't mind, I copy-pasted it below from your netdev post:
>
> if you use map-in-map you don't need extra boolean map.
> 0. bpf prog can do
> inner_map = lookup(map_in_map, key=0);
> lookup(inner_map, your_real_key);
> 1. user space writes into map_in_map[0] <- FD of new map
> 2. some cpus are using old inner map and some a new
> 3. user space does sys_membarrier(CMD_GLOBAL) which will do synchronize_sched()
> which in CONFIG_PREEMPT_NONE=y servers is the same as synchronize_rcu()
> which will guarantee that progs finished.
> 4. scan old inner map
>
> In step 2, as you mentioned there are CPUs using different inner maps. So
> could you clarify how the synchronize_rcu mechanism will even work if you're
> now saying "program does not hold references to the inner maps"?
>
> Thanks!
>
> - Joel
>

2018-07-31 04:05:15

by Y Song

[permalink] [raw]

Subject: Re: [PATCH v2] Add BPF_SYNCHRONIZE_MAPS bpf(2) command

On Mon, Jul 30, 2018 at 7:06 PM, Joel Fernandes <[email protected]> wrote:
> On Mon, Jul 30, 2018 at 07:01:22PM -0700, Joel Fernandes wrote:
>> On Sun, Jul 29, 2018 at 06:51:18PM +0300, Alexei Starovoitov wrote:
>> > On Thu, Jul 26, 2018 at 7:51 PM, Daniel Colascione <[email protected]> wrote:
>> > > BPF_SYNCHRONIZE_MAPS waits for the release of any references to a BPF
>> > > map made by a BPF program that is running at the time the
>> > > BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command is
>> > > to provide a means for userspace to replace a BPF map with another,
>> > > newer version, then ensure that no component is still using the "old"
>> > > map before manipulating the "old" map in some way.
>> > >
>> > > Signed-off-by: Daniel Colascione <[email protected]>
>> > > ---
>> > > include/uapi/linux/bpf.h | 9 +++++++++
>> > > kernel/bpf/syscall.c | 13 +++++++++++++
>> > > 2 files changed, 22 insertions(+)
>> > >
>> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> > > index b7db3261c62d..5b27e9117d3e 100644
>> > > --- a/include/uapi/linux/bpf.h
>> > > +++ b/include/uapi/linux/bpf.h
>> > > @@ -75,6 +75,14 @@ struct bpf_lpm_trie_key {
>> > > __u8 data[0]; /* Arbitrary size */
>> > > };
>> > >
>> > > +/* BPF_SYNCHRONIZE_MAPS waits for the release of any references to a
>> > > + * BPF map made by a BPF program that is running at the time the
>> > > + * BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command
>> >
>> > that doesn't sound right to me.
>> > such command won't wait for the release of the references.
>> > in case of map-in-map the program does not hold
>> > the references to inner map (only to outer map).
>>
>> I didn't follow this completely.
>>
>> The userspace program is using the inner map per your description of the
>
> Sorry just to correct myself, here I meant "The kernel eBPF program is using
> the inner map on multiple CPUs" instead of "userspace".
>
> thanks,
>
> - Joel
>
>
>
>
>
>> algorithm for using map-in-map to solve the race conditions that this patch
>> is trying to address:
>>
>> If you don't mind, I copy-pasted it below from your netdev post:
>>
>> if you use map-in-map you don't need extra boolean map.
>> 0. bpf prog can do
>> inner_map = lookup(map_in_map, key=0);
>> lookup(inner_map, your_real_key);
>> 1. user space writes into map_in_map[0] <- FD of new map
>> 2. some cpus are using old inner map and some a new
>> 3. user space does sys_membarrier(CMD_GLOBAL) which will do synchronize_sched()
>> which in CONFIG_PREEMPT_NONE=y servers is the same as synchronize_rcu()
>> which will guarantee that progs finished.
>> 4. scan old inner map
>>
>> In step 2, as you mentioned there are CPUs using different inner maps. So
>> could you clarify how the synchronize_rcu mechanism will even work if you're
>> now saying "program does not hold references to the inner maps"?

The program only held references to the outer maps, and the outer map
held references to the inner maps. The user space program can add/remove
the inner map for a particular outer map while the prog <-> outer-map
relationship is not changed.

>>
>> Thanks!
>>
>> - Joel
>>

2018-07-31 08:35:46

by Daniel Borkmann

[permalink] [raw]

Subject: Re: [PATCH v3] Add BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES bpf(2) command

On 07/31/2018 02:33 AM, Daniel Colascione wrote:
> On Mon, Jul 30, 2018 at 5:26 PM, Jakub Kicinski
> <[email protected]> wrote:
>> On Mon, 30 Jul 2018 03:25:43 -0700, Daniel Colascione wrote:
>>> On Mon, Jul 30, 2018 at 3:04 AM, Daniel Borkmann <[email protected]> wrote:
>>>> Hmm, I don't think such UAPI as above is future-proof. In case we would want
>>>> a similar mechanism in future for other maps, we would need a whole new bpf
>>>> command or reuse BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES as a workaround though
>>>> the underlying map may not even be a map-to-map. Additionally, we don't have
>>>> any map object at hand in the above, so we couldn't make any finer grained
>>>> decisions either. Something like below would be more suitable and leaves room
>>>> for extending this further in future.
>>>
>>> YAGNI. Your proposed mechanism doesn't add anything under the current
>>> implementation.
>>
>> FWIW in case of HW offload targeting a particular map may allow users
>> to avoid a potentially slow sync with all the devices on the system.
>
> Sure. But such a thing doesn't exist right now (right?), and we can
> add that more-efficient-in-that-one-case BPF interface when it lands.
> I'd rather keep things simple for now.

I don't see a reason why that is even more complicated. An API command name
such as BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES is simply non-generic, and
exposes specific map details (here: map-in-map) into the UAPI whereas it
should reside within a specific implementation instead similar to other ops
we have for maps. If in future other maps would be added that would have
similar mechanisms of inner objects they return to the BPF program, we'll
be adding yet another command just for this. Also, union bpf_attr is extensible,
e.g. additional members could be added in future whenever needed for this
subcommand instead of forcing it to NULL as done here. E.g. having a high-level
one like bpf(BPF_MAP_WAIT, { .map_fd = fd }) could later on also cover the
use-case from Jakub by adding an 'event' member into union bpf_attr where we
could add other map specific wakeups for user space while the current default
of 0 would be BPF_MAP_WAIT_PENDING_REFS (or the like). All I'm saying is to
keep it generic so it can be extended later.

2018-07-31 09:37:45

[permalink] [raw]

Subject: Re: [PATCH v3] Add BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES bpf(2) command

On Tue, Jul 31, 2018 at 1:34 AM, Daniel Borkmann <[email protected]> wrote:
> On 07/31/2018 02:33 AM, Daniel Colascione wrote:
>> On Mon, Jul 30, 2018 at 5:26 PM, Jakub Kicinski
>> <[email protected]> wrote:
>>> On Mon, 30 Jul 2018 03:25:43 -0700, Daniel Colascione wrote:
>>>> On Mon, Jul 30, 2018 at 3:04 AM, Daniel Borkmann <[email protected]> wrote:
>>>>> Hmm, I don't think such UAPI as above is future-proof. In case we would want
>>>>> a similar mechanism in future for other maps, we would need a whole new bpf
>>>>> command or reuse BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES as a workaround though
>>>>> the underlying map may not even be a map-to-map. Additionally, we don't have
>>>>> any map object at hand in the above, so we couldn't make any finer grained
>>>>> decisions either. Something like below would be more suitable and leaves room
>>>>> for extending this further in future.
>>>>
>>>> YAGNI. Your proposed mechanism doesn't add anything under the current
>>>> implementation.
>>>
>>> FWIW in case of HW offload targeting a particular map may allow users
>>> to avoid a potentially slow sync with all the devices on the system.
>>
>> Sure. But such a thing doesn't exist right now (right?), and we can
>> add that more-efficient-in-that-one-case BPF interface when it lands.
>> I'd rather keep things simple for now.
>
> I don't see a reason why that is even more complicated.

Both the API and the implementation are much more complicated in the
per-map ops version: just look at the patch size. The size argument
isn't necessarily a dealbreaker, but I still don't see what the extra
code size and complexity is buying.

> An API command name
> such as BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES is simply non-generic, and
> exposes specific map details (here: map-in-map) into the UAPI whereas it
> should reside within a specific implementation instead similar to other ops
> we have for maps.

But synchronize isn't conceptually a command that applies to a
specific map. It waits on all references. Did you address my point
about your proposed map-specific interface requiring redundant
synchronize_rcu calls in the case where we swap multiple maps and want
to wait for all the references to drain? Under my proposal, you'd just
BPF_SYNCHRONIZE_WHATEVER and call schedule_rcu once. Under your
proposal, we'd make it a per-map operation, so we'd issue one
synchronize_rcu per map.

> If in future other maps would be added that would have
> similar mechanisms of inner objects they return to the BPF program, we'll
> be adding yet another command just for this.

And that's why my personal preference is to just calling this thing
BPF_SYNCHRONIZE, which I'd define to wait for all such "inner
objects". Alexei is the one who asked for the very specific naming, I
believe.

Anyway, we have a very simple patch that we could apply today. It
addresses a real need, and it doesn't preclude adding something more
specific later, when we know we need it. Besides, it's not as if
adding a BPF command is particularly expensive.

> Also, union bpf_attr is extensible,
> e.g. additional members could be added in future whenever needed for this
> subcommand instead of forcing it to NULL as done here.

We fail with EINVAL when attr != NULL now, which means that we can
safely accept a non-NULL attr-based subcommand later without breaking
anyone. The interface is already extensible.

> All I'm saying is to
> keep it generic so it can be extended later.

Sure, but no more extensible than it has to be. Prematurely-added
extension points tend to cause trouble later.

2018-07-31 21:58:07

[permalink] [raw]

Subject: Re: [PATCH v2] Add BPF_SYNCHRONIZE_MAPS bpf(2) command

On Mon, Jul 30, 2018 at 09:03:18PM -0700, Y Song wrote:
> On Mon, Jul 30, 2018 at 7:06 PM, Joel Fernandes <[email protected]> wrote:
> > On Mon, Jul 30, 2018 at 07:01:22PM -0700, Joel Fernandes wrote:
> >> On Sun, Jul 29, 2018 at 06:51:18PM +0300, Alexei Starovoitov wrote:
> >> > On Thu, Jul 26, 2018 at 7:51 PM, Daniel Colascione <[email protected]> wrote:
> >> > > BPF_SYNCHRONIZE_MAPS waits for the release of any references to a BPF
> >> > > map made by a BPF program that is running at the time the
> >> > > BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command is
> >> > > to provide a means for userspace to replace a BPF map with another,
> >> > > newer version, then ensure that no component is still using the "old"
> >> > > map before manipulating the "old" map in some way.
> >> > >
> >> > > Signed-off-by: Daniel Colascione <[email protected]>
> >> > > ---
> >> > > include/uapi/linux/bpf.h | 9 +++++++++
> >> > > kernel/bpf/syscall.c | 13 +++++++++++++
> >> > > 2 files changed, 22 insertions(+)
> >> > >
> >> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> > > index b7db3261c62d..5b27e9117d3e 100644
> >> > > --- a/include/uapi/linux/bpf.h
> >> > > +++ b/include/uapi/linux/bpf.h
> >> > > @@ -75,6 +75,14 @@ struct bpf_lpm_trie_key {
> >> > > __u8 data[0]; /* Arbitrary size */
> >> > > };
> >> > >
> >> > > +/* BPF_SYNCHRONIZE_MAPS waits for the release of any references to a
> >> > > + * BPF map made by a BPF program that is running at the time the
> >> > > + * BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command
> >> >
> >> > that doesn't sound right to me.
> >> > such command won't wait for the release of the references.
> >> > in case of map-in-map the program does not hold
> >> > the references to inner map (only to outer map).
> >>
> >> I didn't follow this completely.
> >>
> >> The userspace program is using the inner map per your description of the
> >> algorithm for using map-in-map to solve the race conditions that this patch
> >> is trying to address:
> >>
> >> If you don't mind, I copy-pasted it below from your netdev post:
> >>
> >> if you use map-in-map you don't need extra boolean map.
> >> 0. bpf prog can do
> >> inner_map = lookup(map_in_map, key=0);
> >> lookup(inner_map, your_real_key);
> >> 1. user space writes into map_in_map[0] <- FD of new map
> >> 2. some cpus are using old inner map and some a new
> >> 3. user space does sys_membarrier(CMD_GLOBAL) which will do synchronize_sched()
> >> which in CONFIG_PREEMPT_NONE=y servers is the same as synchronize_rcu()
> >> which will guarantee that progs finished.
> >> 4. scan old inner map
> >>
> >> In step 2, as you mentioned there are CPUs using different inner maps. So
> >> could you clarify how the synchronize_rcu mechanism will even work if you're
> >> now saying "program does not hold references to the inner maps"?
>
> The program only held references to the outer maps, and the outer map
> held references to the inner maps. The user space program can add/remove
> the inner map for a particular outer map while the prog <-> outer-map
> relationship is not changed.

My definition of "reference" in this context is protection by rcu_read_lock.

So I was concerned the above map-in-map access isn't protected as such when
Alexei said "program doesn't have reference on inner map" in the above steps.
Maybe I misunderstood what is the meaning of reference here.

To make the map-in-map thing to work for Chenbo/Lorenzo's usecase, both the
access of outer map at key=0 and the inner map have to protect by
rcu_read_lock so that the membarrier call will work.

So basically step 0 in the steps above should be rcu_read_lock protected to
satisfy Chenbo/Lorenzo's usecase.

I know today the entire program is run as preempt disabled (unless something
changed) so this shouldn't be a problem, but in the future if the verifier is
doing similar things at a finer grainer level, then the above has to be
taken into consideration.

Does that make sense or am I missing something?

thanks,

- Joel

2018-07-31 22:32:05

by Daniel Borkmann

[permalink] [raw]

Subject: Re: [PATCH v2] Add BPF_SYNCHRONIZE_MAPS bpf(2) command

On 07/31/2018 11:56 PM, Joel Fernandes wrote:
> On Mon, Jul 30, 2018 at 09:03:18PM -0700, Y Song wrote:
>> On Mon, Jul 30, 2018 at 7:06 PM, Joel Fernandes <[email protected]> wrote:
>>> On Mon, Jul 30, 2018 at 07:01:22PM -0700, Joel Fernandes wrote:
>>>> On Sun, Jul 29, 2018 at 06:51:18PM +0300, Alexei Starovoitov wrote:
>>>>> On Thu, Jul 26, 2018 at 7:51 PM, Daniel Colascione <[email protected]> wrote:
>>>>>> BPF_SYNCHRONIZE_MAPS waits for the release of any references to a BPF
>>>>>> map made by a BPF program that is running at the time the
>>>>>> BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command is
>>>>>> to provide a means for userspace to replace a BPF map with another,
>>>>>> newer version, then ensure that no component is still using the "old"
>>>>>> map before manipulating the "old" map in some way.
>>>>>>
>>>>>> Signed-off-by: Daniel Colascione <[email protected]>
>>>>>> ---
>>>>>> include/uapi/linux/bpf.h | 9 +++++++++
>>>>>> kernel/bpf/syscall.c | 13 +++++++++++++
>>>>>> 2 files changed, 22 insertions(+)
>>>>>>
>>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>>>> index b7db3261c62d..5b27e9117d3e 100644
>>>>>> --- a/include/uapi/linux/bpf.h
>>>>>> +++ b/include/uapi/linux/bpf.h
>>>>>> @@ -75,6 +75,14 @@ struct bpf_lpm_trie_key {
>>>>>> __u8 data[0]; /* Arbitrary size */
>>>>>> };
>>>>>>
>>>>>> +/* BPF_SYNCHRONIZE_MAPS waits for the release of any references to a
>>>>>> + * BPF map made by a BPF program that is running at the time the
>>>>>> + * BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command
>>>>>
>>>>> that doesn't sound right to me.
>>>>> such command won't wait for the release of the references.
>>>>> in case of map-in-map the program does not hold
>>>>> the references to inner map (only to outer map).
>>>>
>>>> I didn't follow this completely.
>>>>
>>>> The userspace program is using the inner map per your description of the
>>>> algorithm for using map-in-map to solve the race conditions that this patch
>>>> is trying to address:
>>>>
>>>> If you don't mind, I copy-pasted it below from your netdev post:
>>>>
>>>> if you use map-in-map you don't need extra boolean map.
>>>> 0. bpf prog can do
>>>> inner_map = lookup(map_in_map, key=0);
>>>> lookup(inner_map, your_real_key);
>>>> 1. user space writes into map_in_map[0] <- FD of new map
>>>> 2. some cpus are using old inner map and some a new
>>>> 3. user space does sys_membarrier(CMD_GLOBAL) which will do synchronize_sched()
>>>> which in CONFIG_PREEMPT_NONE=y servers is the same as synchronize_rcu()
>>>> which will guarantee that progs finished.
>>>> 4. scan old inner map
>>>>
>>>> In step 2, as you mentioned there are CPUs using different inner maps. So
>>>> could you clarify how the synchronize_rcu mechanism will even work if you're
>>>> now saying "program does not hold references to the inner maps"?
>>
>> The program only held references to the outer maps, and the outer map
>> held references to the inner maps. The user space program can add/remove
>> the inner map for a particular outer map while the prog <-> outer-map
>> relationship is not changed.
>
> My definition of "reference" in this context is protection by rcu_read_lock.
>
> So I was concerned the above map-in-map access isn't protected as such when
> Alexei said "program doesn't have reference on inner map" in the above steps.
> Maybe I misunderstood what is the meaning of reference here.
>
> To make the map-in-map thing to work for Chenbo/Lorenzo's usecase, both the
> access of outer map at key=0 and the inner map have to protect by
> rcu_read_lock so that the membarrier call will work.
>
> So basically step 0 in the steps above should be rcu_read_lock protected to
> satisfy Chenbo/Lorenzo's usecase.
>
> I know today the entire program is run as preempt disabled (unless something
> changed) so this shouldn't be a problem, but in the future if the verifier is
> doing similar things at a finer grainer level, then the above has to be
> taken into consideration.
>
> Does that make sense or am I missing something?

All BPF programs are required to run under rcu_read_lock today, so that
assumption holds. Should this ever change in future, then this constraint
of course needs to be taken into consideration.

Thanks,
Daniel

2018-08-10 22:31:00

[permalink] [raw]

Subject: Re: [PATCH v3] Add BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES bpf(2) command

On Thu, Aug 09, 2018 at 10:17:50AM -0700, Daniel Colascione wrote:
> Ping.

sorry for delay. have been off grid. let's continue in the other thread for full context

2018-08-10 22:53:54

[permalink] [raw]

Subject: Re: [PATCH v3] Add BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES bpf(2) command

On Tue, Jul 31, 2018 at 02:36:39AM -0700, Daniel Colascione wrote:
>
> > An API command name
> > such as BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES is simply non-generic, and
> > exposes specific map details (here: map-in-map) into the UAPI whereas it
> > should reside within a specific implementation instead similar to other ops
> > we have for maps.
>
> But synchronize isn't conceptually a command that applies to a
> specific map. It waits on all references. Did you address my point
> about your proposed map-specific interface requiring redundant
> synchronize_rcu calls in the case where we swap multiple maps and want
> to wait for all the references to drain? Under my proposal, you'd just
> BPF_SYNCHRONIZE_WHATEVER and call schedule_rcu once. Under your
> proposal, we'd make it a per-map operation, so we'd issue one
> synchronize_rcu per map.

optimizing for multi-map sync sounds like premature optimization.
In general I'd prefer DanielB proposal to make the sync logic map and fd specific,
but before we argue about implementation further let's agree on
the problem we're trying to solve.
I believe the only issue being discussed is user space doesn't know
when it's ok to start draining the inner map when it was replaced
by bpf_map_update syscall command with another map, right?
If we agree on that, should bpf_map_update handle it then?
Wouldn't it be much easier to understand and use from user pov?
No new commands to learn. map_update syscall replaced the map
and old map is no longer accessed by the program via this given map-in-map.
But if the replaced map is used directly or it sits in some other
map-in-map slot the progs can still access it.

My issue with DanielC SYNC cmd that it exposes implementation details
and introduces complex 'synchronization' semantics. To majority of
the users it won't be obvious what is being synchronized.

My issue with DanielB WAIT_REF map_fd cmd that it needs to wait for all refs
to this map to be dropped. I think combination of usercnt and refcnt
can answer that, but feels dangerous to sleep potentially forever
in a syscall until all prog->map references are gone, though such
cmd is useful beyond map-in-map use case.

2018-08-14 20:39:37

[permalink] [raw]

Subject: Re: [PATCH v3] Add BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES bpf(2) command

On Fri, Aug 10, 2018 at 3:52 PM, Alexei Starovoitov
<[email protected]> wrote:
> On Tue, Jul 31, 2018 at 02:36:39AM -0700, Daniel Colascione wrote:
>>
>> > An API command name
>> > such as BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES is simply non-generic, and
>> > exposes specific map details (here: map-in-map) into the UAPI whereas it
>> > should reside within a specific implementation instead similar to other ops
>> > we have for maps.
>>
>> But synchronize isn't conceptually a command that applies to a
>> specific map. It waits on all references. Did you address my point
>> about your proposed map-specific interface requiring redundant
>> synchronize_rcu calls in the case where we swap multiple maps and want
>> to wait for all the references to drain? Under my proposal, you'd just
>> BPF_SYNCHRONIZE_WHATEVER and call schedule_rcu once. Under your
>> proposal, we'd make it a per-map operation, so we'd issue one
>> synchronize_rcu per map.
>
> optimizing for multi-map sync sounds like premature optimization.

Maybe, but the per-map proposal is less efficient *and* more
complicated! I don't want to spend more code just to go slower.

> I believe the only issue being discussed is user space doesn't know
> when it's ok to start draining the inner map when it was replaced
> by bpf_map_update syscall command with another map, right?

Yes.

> If we agree on that, should bpf_map_update handle it then?
> Wouldn't it be much easier to understand and use from user pov?
> No new commands to learn. map_update syscall replaced the map
> and old map is no longer accessed by the program via this given map-in-map.

Maybe with a new BPF_SYNCHRONIZE flag for BPF_MAP_UPDATE_ELEM and
BPF_MAP_DELETE_ELEM. Otherwise, it seems wrong to make every user of
these commands pay for synchronization that only a few will need.

> But if the replaced map is used directly or it sits in some other
> map-in-map slot the progs can still access it.
>
> My issue with DanielC SYNC cmd that it exposes implementation details

What implementation details? The command semantics are defined
entirely in terms of existing user-visible primitives.

> and introduces complex 'synchronization' semantics. To majority of
> the users it won't be obvious what is being synchronized.
>
> My issue with DanielB WAIT_REF map_fd cmd that it needs to wait for all refs
> to this map to be dropped. I think combination of usercnt and refcnt
> can answer that, but feels dangerous to sleep potentially forever
> in a syscall until all prog->map references are gone, though such
> cmd is useful beyond map-in-map use case.

In what scenarios?

In any case, can we submit _something_?

2018-08-16 07:45:22

[permalink] [raw]

Subject: Re: [PATCH v3] Add BPF_SYNCHRONIZE_MAP_TO_MAP_REFERENCES bpf(2) command

On Tue, Aug 14, 2018 at 01:37:12PM -0700, Daniel Colascione wrote:
>
> > If we agree on that, should bpf_map_update handle it then?
> > Wouldn't it be much easier to understand and use from user pov?
> > No new commands to learn. map_update syscall replaced the map
> > and old map is no longer accessed by the program via this given map-in-map.
>
> Maybe with a new BPF_SYNCHRONIZE flag for BPF_MAP_UPDATE_ELEM and
> BPF_MAP_DELETE_ELEM. Otherwise, it seems wrong to make every user of
> these commands pay for synchronization that only a few will need.

I don't think extra flag is needed. Extra sync_rcu() for map-in-map
is useful for all users. I would consider it a bugfix,
since users that examine deleted map have this race today
and removing the race is always a good thing especially since the cost
is small.

2018-10-12 10:55:15

[permalink] [raw]

Subject: [PATCH v4] Wait for running BPF programs when updating map-in-map

The map-in-map frequently serves as a mechanism for atomic
snapshotting of state that a BPF program might record. The current
implementation is dangerous to use in this way, however, since
userspace has no way of knowing when all programs that might have
retrieved the "old" value of the map may have completed.

This change ensures that map update operations on map-in-map map types
always wait for all references to the old map to drop before returning
to userspace.

Signed-off-by: Daniel Colascione <[email protected]>
---
kernel/bpf/syscall.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8339d81cba1d..d7c16ae1e85a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -741,6 +741,18 @@ static int map_lookup_elem(union bpf_attr *attr)
return err;
}

+static void maybe_wait_bpf_programs(struct bpf_map *map)
+{
+ /* Wait for any running BPF programs to complete so that
+ * userspace, when we return to it, knows that all programs
+ * that could be running use the new map value.
+ */
+ if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS ||
+ map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS) {
+ synchronize_rcu();
+ }
+}
+
#define BPF_MAP_UPDATE_ELEM_LAST_FIELD flags

static int map_update_elem(union bpf_attr *attr)
@@ -831,6 +843,7 @@ static int map_update_elem(union bpf_attr *attr)
}
__this_cpu_dec(bpf_prog_active);
preempt_enable();
+ maybe_wait_bpf_programs(map);
out:
free_value:
kfree(value);
@@ -883,6 +896,7 @@ static int map_delete_elem(union bpf_attr *attr)
rcu_read_unlock();
__this_cpu_dec(bpf_prog_active);
preempt_enable();
+ maybe_wait_bpf_programs(map);
out:
kfree(key);
err_put:
--
2.19.0.605.g01d371f741-goog

2018-10-12 20:55:08

[permalink] [raw]

Subject: Re: [PATCH v4] Wait for running BPF programs when updating map-in-map

On Fri, Oct 12, 2018 at 03:54:27AM -0700, Daniel Colascione wrote:
> The map-in-map frequently serves as a mechanism for atomic
> snapshotting of state that a BPF program might record. The current
> implementation is dangerous to use in this way, however, since
> userspace has no way of knowing when all programs that might have
> retrieved the "old" value of the map may have completed.
>
> This change ensures that map update operations on map-in-map map types
> always wait for all references to the old map to drop before returning
> to userspace.
>
> Signed-off-by: Daniel Colascione <[email protected]>
> ---
> kernel/bpf/syscall.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 8339d81cba1d..d7c16ae1e85a 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -741,6 +741,18 @@ static int map_lookup_elem(union bpf_attr *attr)
> return err;
> }
>
> +static void maybe_wait_bpf_programs(struct bpf_map *map)
> +{
> + /* Wait for any running BPF programs to complete so that
> + * userspace, when we return to it, knows that all programs
> + * that could be running use the new map value.
> + */
> + if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS ||
> + map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS) {
> + synchronize_rcu();
> + }
> +}
> +
> #define BPF_MAP_UPDATE_ELEM_LAST_FIELD flags
>
> static int map_update_elem(union bpf_attr *attr)
> @@ -831,6 +843,7 @@ static int map_update_elem(union bpf_attr *attr)
> }
> __this_cpu_dec(bpf_prog_active);
> preempt_enable();
> + maybe_wait_bpf_programs(map);
> out:
> free_value:
> kfree(value);
> @@ -883,6 +896,7 @@ static int map_delete_elem(union bpf_attr *attr)
> rcu_read_unlock();
> __this_cpu_dec(bpf_prog_active);
> preempt_enable();
> + maybe_wait_bpf_programs(map);

Looks good to me,

Reviewed-by: Joel Fernandes (Google) <[email protected]>

Also I believe that those rcu_read_lock() and rcu_read_unlock() calls in the
existing code are useless. preempt_disable()d code is already an RCU
read-side section, and synchronize_rcu and friends work on those type of
read-side sections as well (as of recent kernel releases) however removing it
may make lockdep unhappy, unless we also replace all rcu_dereference() usages
with rcu_dereference_sched(), so lets leave that alone for now I guess.

thanks,

- Joel

2018-10-13 02:32:22

[permalink] [raw]

Subject: Re: [PATCH v4] Wait for running BPF programs when updating map-in-map

2018-10-16 17:41:55

[permalink] [raw]

Subject: Re: [PATCH v4] Wait for running BPF programs when updating map-in-map

On Fri, Oct 12, 2018 at 7:31 PM, Alexei Starovoitov
<[email protected]> wrote:
> On Fri, Oct 12, 2018 at 03:54:27AM -0700, Daniel Colascione wrote:
>> The map-in-map frequently serves as a mechanism for atomic
>> snapshotting of state that a BPF program might record. The current
>> implementation is dangerous to use in this way, however, since
>> userspace has no way of knowing when all programs that might have
>> retrieved the "old" value of the map may have completed.
>>
>> This change ensures that map update operations on map-in-map map types
>> always wait for all references to the old map to drop before returning
>> to userspace.
>>
>> Signed-off-by: Daniel Colascione <[email protected]>
>> ---
>> kernel/bpf/syscall.c | 14 ++++++++++++++
>> 1 file changed, 14 insertions(+)
>>
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 8339d81cba1d..d7c16ae1e85a 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -741,6 +741,18 @@ static int map_lookup_elem(union bpf_attr *attr)
>> return err;
>> }
>>
>> +static void maybe_wait_bpf_programs(struct bpf_map *map)
>> +{
>> + /* Wait for any running BPF programs to complete so that
>> + * userspace, when we return to it, knows that all programs
>> + * that could be running use the new map value.
>> + */
>> + if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS ||
>> + map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS) {
>> + synchronize_rcu();
>> + }
>
> extra {} were not necessary. I removed them while applying to bpf-next.
> Please run checkpatch.pl next time.
> Thanks

Thanks Alexei for taking it. Me and Lorenzo were discussing that not
having this causes incorrect behavior for apps using map-in-map for
this. So I CC'd stable as well.

-Joel

2018-10-18 15:47:41

[permalink] [raw]

Subject: Re: [PATCH v4] Wait for running BPF programs when updating map-in-map

On Tue, Oct 16, 2018 at 10:39:57AM -0700, Joel Fernandes wrote:
> On Fri, Oct 12, 2018 at 7:31 PM, Alexei Starovoitov
> <[email protected]> wrote:
> > On Fri, Oct 12, 2018 at 03:54:27AM -0700, Daniel Colascione wrote:
> >> The map-in-map frequently serves as a mechanism for atomic
> >> snapshotting of state that a BPF program might record. The current
> >> implementation is dangerous to use in this way, however, since
> >> userspace has no way of knowing when all programs that might have
> >> retrieved the "old" value of the map may have completed.
> >>
> >> This change ensures that map update operations on map-in-map map types
> >> always wait for all references to the old map to drop before returning
> >> to userspace.
> >>
> >> Signed-off-by: Daniel Colascione <[email protected]>
> >> ---
> >> kernel/bpf/syscall.c | 14 ++++++++++++++
> >> 1 file changed, 14 insertions(+)
> >>
> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> index 8339d81cba1d..d7c16ae1e85a 100644
> >> --- a/kernel/bpf/syscall.c
> >> +++ b/kernel/bpf/syscall.c
> >> @@ -741,6 +741,18 @@ static int map_lookup_elem(union bpf_attr *attr)
> >> return err;
> >> }
> >>
> >> +static void maybe_wait_bpf_programs(struct bpf_map *map)
> >> +{
> >> + /* Wait for any running BPF programs to complete so that
> >> + * userspace, when we return to it, knows that all programs
> >> + * that could be running use the new map value.
> >> + */
> >> + if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS ||
> >> + map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS) {
> >> + synchronize_rcu();
> >> + }
> >
> > extra {} were not necessary. I removed them while applying to bpf-next.
> > Please run checkpatch.pl next time.
> > Thanks
>
> Thanks Alexei for taking it. Me and Lorenzo were discussing that not
> having this causes incorrect behavior for apps using map-in-map for
> this. So I CC'd stable as well.

It is too late in the release cycle.
We can submit it to stable releases after the merge window.

2018-10-18 23:38:04