2016-12-17 18:19:14

by Andy Lutomirski

[permalink] [raw]
Subject: Potential issues (security and otherwise) with the current cgroup-bpf API

Hi all-

I apologize for being rather late with this. I didn't realize that
cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
it on the linux-api list, so I missed the discussion.

I think that the inet ingress, egress etc filters are a neat feature,
but I think the API has some issues that will bite us down the road
if it becomes stable in its current form.

Most of the problems I see are summarized in this transcript:

# mkdir cg2
# mount -t cgroup2 none cg2
# mkdir cg2/nosockets
# strace cgrp_socket_rule cg2/nosockets/ 0
...
open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3

^^^^ You can modify a cgroup after opening it O_RDONLY?

bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
log_buf=0x6020c0, kern_version=0}, 48) = 4

^^^^ This is fine. The bpf() syscall manipulates bpf objects.

bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0

^^^^ This is not so good:
^^^^
^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This
^^^^ is manipulating a cgroup. There's no reason that a socket creation
^^^^ filter couldn't be written in a different language (new iptables
^^^^ table? Simple list of address families?), but if that happened,
^^^^ then using bpf() to install it would be entirely nonsensical.
^^^^
^^^^ b) This is starting to be an excessively ugly multiplexer. Among
^^^^ other things, it's very unfriendly to seccomp.

# echo $$ >cg2/nosockets/cgroup.procs
# ping 127.0.0.1
ping: socket: Operation not permitted
# ls cg2/nosockets/
cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control
# cat cg2/nosockets/cgroup.controllers

^^^^ Something in cgroupfs should give an indication that this cgroup
^^^^ filters socket creation, but there's nothing there. You should also
^^^^ be able to turn the filter off from cgroupfs.

# mkdir cg2/nosockets/sockets
# /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1

^^^^ This succeeded, which means that, if this feature is enabled in 4.10,
^^^^ then we're stuck with its semantics. If it returned -EINVAL instead,
^^^^ there would be a chance to refine it.

# echo $$ >cg2/nosockets/sockets/cgroup.procs
# ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms
^C
--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.029/0.029/0.029/0.000 ms

^^^^ Bash was inside a cgroup that disallowed socket creation, but socket
^^^^ creation wasn't disallowed. This means that the obvious use of socket
^^^^ creation filters in nestable constainers fails insecurely.


There's also a subtle but nasty potential security problem here.
In 4.9 and before, cgroups has only one real effect in the kernel:
resource control. A process in a malicious cgroup could be DoSed,
but that was about the extent of the damage that a malicious cgroup
could do.

In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf
programs attached that can do things if various events occur. (Right
now, this means socket operations, but there are plans in the works
to do this for LSM hooks too.) These bpf programs can say yes or no,
but they can also read out various data (including socket payloads!)
and save them away where an attacker can find them. This sounds a
lot like seccomp with a narrower scope but a much stronger ability to
exfiltrate private information.

Unfortunately, while seccomp is very, very careful to prevent
injection of a privileged victim into a malicious sandbox, the
CGROUP_BPF mechanism appears to have no real security model. There
is nothing to prevent a program that's in a malicious cgroup from
running a setuid binary, and there is nothing to prevent a program
that has the ability to move itself or another program into a
malicious cgroup from doing so and then, if needed for exploitation,
exec a setuid binary.

This isn't much of a problem yet because you currently need
CAP_NET_ADMIN to create a malicious sandbox in the first place. I'm
sure that, in the near future, someone will want to make this stuff
work in containers with delegated cgroup hierarchies, and then there
may be a real problem here.


I've included a few security people on this thread. The current API
looks abusable, and it would be nice to find all the holes before
4.10 comes out.


(The cgrp_socket_rule source is attached. You can build it by sticking it
in samples/bpf and doing:

$ make headers_install
$ cd samples/bpf
$ gcc -o cgrp_socket_rule cgrp_socket_rule.c libbpf.c -I../../usr/include
)

--Andy


Attachments:
cgrp_socket_rule.c (1.51 kB)

2016-12-17 19:35:28

by Mickaël Salaün

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API


On 17/12/2016 19:18, Andy Lutomirski wrote:
> Hi all-
>
> I apologize for being rather late with this. I didn't realize that
> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
> it on the linux-api list, so I missed the discussion.
>
> I think that the inet ingress, egress etc filters are a neat feature,
> but I think the API has some issues that will bite us down the road
> if it becomes stable in its current form.
>
> Most of the problems I see are summarized in this transcript:
>
> # mkdir cg2
> # mount -t cgroup2 none cg2
> # mkdir cg2/nosockets
> # strace cgrp_socket_rule cg2/nosockets/ 0
> ...
> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
>
> ^^^^ You can modify a cgroup after opening it O_RDONLY?

I sent a patch to check the cgroup.procs permission before attaching a
BPF program to it [1], but it was not merged because not part of the
current security model (which may not be crystal clear). The thing is
that the current socket/BPF/cgroup feature is only available to a
process with the *global CAP_NET_ADMIN* and such a process can already
modify the network for every processes, so it doesn't make much sense to
check if it can modify the network for a subset of this processes.

[1] https://lkml.org/lkml/2016/9/19/854

However, needing a process to open a cgroup *directory* in write mode
may not make sense because the process does not modify the content of
the cgroup but only use it as a *reference* in the network stack.
Forcing an open with write mode may forbid to use this kind of
network-filtering feature in a read-only file-system but not necessarily
read-only *network configuration*.

Another point of view is that the CAP_NET_ADMIN may be an unneeded
privilege if the cgroup migration is using a no_new_privs-like feature
as I proposed with Landlock [2] (with an extra ptrace_may_access() check).
The new capability proposition for cgroup may be interesting too.

[2] https://lkml.org/lkml/2016/9/14/82

>
> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
> log_buf=0x6020c0, kern_version=0}, 48) = 4
>
> ^^^^ This is fine. The bpf() syscall manipulates bpf objects.
>
> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
>
> ^^^^ This is not so good:
> ^^^^
> ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This
> ^^^^ is manipulating a cgroup. There's no reason that a socket creation
> ^^^^ filter couldn't be written in a different language (new iptables
> ^^^^ table? Simple list of address families?), but if that happened,
> ^^^^ then using bpf() to install it would be entirely nonsensical.

Another point of view is to say that the BPF program (called by the
network stack) is using a reference to a set of processes thanks to a
cgroup.

> ^^^^
> ^^^^ b) This is starting to be an excessively ugly multiplexer. Among
> ^^^^ other things, it's very unfriendly to seccomp.

FWIW, Landlock will have the capability to filter this kind of action.

>
> # echo $$ >cg2/nosockets/cgroup.procs
> # ping 127.0.0.1
> ping: socket: Operation not permitted
> # ls cg2/nosockets/
> cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control
> # cat cg2/nosockets/cgroup.controllers
>
> ^^^^ Something in cgroupfs should give an indication that this cgroup
> ^^^^ filters socket creation, but there's nothing there. You should also
> ^^^^ be able to turn the filter off from cgroupfs.

Right. Everybody was OK at LPC to add such an information but it is not
there yet.

>
> # mkdir cg2/nosockets/sockets
> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1
>
> ^^^^ This succeeded, which means that, if this feature is enabled in 4.10,
> ^^^^ then we're stuck with its semantics. If it returned -EINVAL instead,
> ^^^^ there would be a chance to refine it.

This is indeed unfortunate.

>
> # echo $$ >cg2/nosockets/sockets/cgroup.procs
> # ping 127.0.0.1
> PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms
> ^C
> --- 127.0.0.1 ping statistics ---
> 1 packets transmitted, 1 received, 0% packet loss, time 0ms
> rtt min/avg/max/mdev = 0.029/0.029/0.029/0.000 ms
>
> ^^^^ Bash was inside a cgroup that disallowed socket creation, but socket
> ^^^^ creation wasn't disallowed. This means that the obvious use of socket
> ^^^^ creation filters in nestable constainers fails insecurely.
>
>
> There's also a subtle but nasty potential security problem here.
> In 4.9 and before, cgroups has only one real effect in the kernel:
> resource control. A process in a malicious cgroup could be DoSed,
> but that was about the extent of the damage that a malicious cgroup
> could do.
>
> In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf
> programs attached that can do things if various events occur. (Right
> now, this means socket operations, but there are plans in the works
> to do this for LSM hooks too.) These bpf programs can say yes or no,
> but they can also read out various data (including socket payloads!)
> and save them away where an attacker can find them. This sounds a
> lot like seccomp with a narrower scope but a much stronger ability to
> exfiltrate private information.
>
> Unfortunately, while seccomp is very, very careful to prevent
> injection of a privileged victim into a malicious sandbox, the
> CGROUP_BPF mechanism appears to have no real security model. There
> is nothing to prevent a program that's in a malicious cgroup from
> running a setuid binary, and there is nothing to prevent a program
> that has the ability to move itself or another program into a
> malicious cgroup from doing so and then, if needed for exploitation,
> exec a setuid binary.
>
> This isn't much of a problem yet because you currently need
> CAP_NET_ADMIN to create a malicious sandbox in the first place. I'm
> sure that, in the near future, someone will want to make this stuff
> work in containers with delegated cgroup hierarchies, and then there
> may be a real problem here.
>
>
> I've included a few security people on this thread. The current API
> looks abusable, and it would be nice to find all the holes before
> 4.10 comes out.
>
>
> (The cgrp_socket_rule source is attached. You can build it by sticking it
> in samples/bpf and doing:
>
> $ make headers_install
> $ cd samples/bpf
> $ gcc -o cgrp_socket_rule cgrp_socket_rule.c libbpf.c -I../../usr/include
> )
>
> --Andy
>

Right. Because this feature doesn't handle namespaces (only global
CAP_NET_ADMIN), nested containers should not be allowed to use it at all.
If we want this kind of feature to be usable by something other than a
process with a global capability, then we need an inheritance mechanism
similar to what I designed for Landlock. I think it could be added later.

Regards,
Micka?l


Attachments:
signature.asc (488.00 B)
OpenPGP digital signature

2016-12-17 20:03:01

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Sat, Dec 17, 2016 at 11:26 AM, Mickaël Salaün <[email protected]> wrote:
>
> On 17/12/2016 19:18, Andy Lutomirski wrote:
>> Hi all-
>>
>> I apologize for being rather late with this. I didn't realize that
>> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
>> it on the linux-api list, so I missed the discussion.
>>
>> I think that the inet ingress, egress etc filters are a neat feature,
>> but I think the API has some issues that will bite us down the road
>> if it becomes stable in its current form.
>>
>> Most of the problems I see are summarized in this transcript:
>>
>> # mkdir cg2
>> # mount -t cgroup2 none cg2
>> # mkdir cg2/nosockets
>> # strace cgrp_socket_rule cg2/nosockets/ 0
>> ...
>> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
>>
>> ^^^^ You can modify a cgroup after opening it O_RDONLY?
>
> I sent a patch to check the cgroup.procs permission before attaching a
> BPF program to it [1], but it was not merged because not part of the
> current security model (which may not be crystal clear). The thing is
> that the current socket/BPF/cgroup feature is only available to a
> process with the *global CAP_NET_ADMIN* and such a process can already
> modify the network for every processes, so it doesn't make much sense to
> check if it can modify the network for a subset of this processes.
>
> [1] https://lkml.org/lkml/2016/9/19/854

I don't think that's quite the right approach. First, checking
permissions on an fd that's not the fd passed to the syscall is weird
and IMO shouldn't be done without a good reason. Second,
cgroups.procs seems like the wrong file.

Why not create "net.socket_create_filter" and require a writable fd to
*that* to be passed to the syscall.

>
> However, needing a process to open a cgroup *directory* in write mode
> may not make sense because the process does not modify the content of
> the cgroup but only use it as a *reference* in the network stack.
> Forcing an open with write mode may forbid to use this kind of
> network-filtering feature in a read-only file-system but not necessarily
> read-only *network configuration*.

It does modify the cgroup. Logically it changes the cgroup behavior.
As an implementation matter, it modifies the struct cgroup.

>
> Another point of view is that the CAP_NET_ADMIN may be an unneeded
> privilege if the cgroup migration is using a no_new_privs-like feature
> as I proposed with Landlock [2] (with an extra ptrace_may_access() check).
> The new capability proposition for cgroup may be interesting too.
>
> [2] https://lkml.org/lkml/2016/9/14/82
>
>>
>> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
>> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
>> log_buf=0x6020c0, kern_version=0}, 48) = 4
>>
>> ^^^^ This is fine. The bpf() syscall manipulates bpf objects.
>>
>> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
>>
>> ^^^^ This is not so good:
>> ^^^^
>> ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This
>> ^^^^ is manipulating a cgroup. There's no reason that a socket creation
>> ^^^^ filter couldn't be written in a different language (new iptables
>> ^^^^ table? Simple list of address families?), but if that happened,
>> ^^^^ then using bpf() to install it would be entirely nonsensical.
>
> Another point of view is to say that the BPF program (called by the
> network stack) is using a reference to a set of processes thanks to a
> cgroup.

I strongly disagree with this point of view. From a user's
perspective, nothing about the bpf program changed -- the cgroup
changes. Even in the code, the bpf program doesn't have a reference
to anything. The cgroup references the bpf program.

>>
>> # echo $$ >cg2/nosockets/cgroup.procs
>> # ping 127.0.0.1
>> ping: socket: Operation not permitted
>> # ls cg2/nosockets/
>> cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control
>> # cat cg2/nosockets/cgroup.controllers
>>
>> ^^^^ Something in cgroupfs should give an indication that this cgroup
>> ^^^^ filters socket creation, but there's nothing there. You should also
>> ^^^^ be able to turn the filter off from cgroupfs.
>
> Right. Everybody was OK at LPC to add such an information but it is not
> there yet.

Then let's do it before CONFIG_CGROUP_BPF=y becomes possible in a
released kernel.

> Right. Because this feature doesn't handle namespaces (only global
> CAP_NET_ADMIN), nested containers should not be allowed to use it at all.
> If we want this kind of feature to be usable by something other than a
> process with a global capability, then we need an inheritance mechanism
> similar to what I designed for Landlock. I think it could be added later.

I think it's okay to have a new kernel feature that doesn't support
namespacing yet. I don't think it's okay to have a new kernel feature
that will become problematic when namespacing is added, and I think
cgroup-bpf is in the latter class.


Anyway, here's a half-baked proposal to clean this all up:

Make these new fiters actually be cgroup controllers. Fix whatever
needs to be fixed to make that work. This will mean that they can't
be the "subtree-control" type of controllers. (Maybe the same set of
fixes will help with the cpu controllers, too.) Make them stack
properly. Make them configurable using cgroupfs.

Add a "dangerous" flag on cgroups. Cgroups are "dangerous" if they
have dangerous controllers enabled. Controllers like "inet ingress"
are dangerous. You can only configure dangerous controllers if all
tasks in the cgroup have no_new_privs set and you have appropriate
permission over the tasks or if the cgroup is empty. If trying to bind
an inet ingress filter would make a cgroup dangerous and you don't
have the relevent privilege, then the operation fails. You cannot move
any task into a dangerous cgroup unless that task has no_new_privs set
*and* you have privilege over that task or if the task is in a
namespace that you have CAP_SYS_ADMIN on. (This is kind of like your
proposal, and maybe they should be merged. I do think that
*something* is needed and needs fleshing out.

Keep in mind, though, that strictly requiring no_new_privs is
excessive for namespaced applications. It might be okay to require
that a cgroup only ever contain tasks in a given namespace or perhaps
that a cgroup only contain tasks that are either no_new_privs *or* are
in a given namespace or its descendents. (In fact, unshare() can
*clear* no_new_privs because being in a fresh userns has more or less
the same effect.)

--Andy

2016-12-19 20:56:44

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
> Hi all-
>
> I apologize for being rather late with this. I didn't realize that
> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
> it on the linux-api list, so I missed the discussion.
>
> I think that the inet ingress, egress etc filters are a neat feature,
> but I think the API has some issues that will bite us down the road
> if it becomes stable in its current form.
>
> Most of the problems I see are summarized in this transcript:
>
> # mkdir cg2
> # mount -t cgroup2 none cg2
> # mkdir cg2/nosockets
> # strace cgrp_socket_rule cg2/nosockets/ 0
> ...
> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
>
> ^^^^ You can modify a cgroup after opening it O_RDONLY?
>
> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
> log_buf=0x6020c0, kern_version=0}, 48) = 4
>
> ^^^^ This is fine. The bpf() syscall manipulates bpf objects.
>
> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
>
> ^^^^ This is not so good:
> ^^^^
> ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This
> ^^^^ is manipulating a cgroup. There's no reason that a socket creation
> ^^^^ filter couldn't be written in a different language (new iptables
> ^^^^ table? Simple list of address families?), but if that happened,
> ^^^^ then using bpf() to install it would be entirely nonsensical.

I don't see why it's _modifing_ the cgroup. I'm looking at it as
network stack is using cgroup as an application group that should
invoke bpf program at the certain point in the stack.
imo cgroup management is orthogonal.
I certainly don't mind adding 'cat cgroup.bpf' control file that will
print the program digest for debugging, just like we do for fdinfo,
but cgroup file interface shouldn't be used to control networking.
If there was something like networking-wide ioctl, I think it could
have been an alternative way to attach a program to a hook to a cgroup.
Adding a control file to a cgroup for the purpose of attaching bpf,
just doesn't make sense to me, since it's dragging bpf and networking
details into cgroup land. Imagine in the future somebody would want
to have nft to perform similar BPF_CGROUP_INET_INGRESS functionality.
I'd think that the position of the hook within the networking stack
would remain the same, but enum id will be different, since
other ids in that enum (like BPF_CGROUP_INET_SOCK_CREATE) are not
applicable to nft. Similarly the concept of program_fd doesn't exist
there either. So it would be using its own netlink based mechanism
and its own way of enumerating hook id. And cgroup could be passed
as a string into netlink message instead of fd.
So if somebody adds a control file to a cgroup api that takes bpf's hookid
and bpf's program_fd, such control file will only be applicable to bpf
and not usable for anything else. Hence I see no reason to go that route.

> ^^^^
> ^^^^ b) This is starting to be an excessively ugly multiplexer. Among
> ^^^^ other things, it's very unfriendly to seccomp.
>
> # echo $$ >cg2/nosockets/cgroup.procs
> # ping 127.0.0.1
> ping: socket: Operation not permitted
> # ls cg2/nosockets/
> cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control
> # cat cg2/nosockets/cgroup.controllers
>
> ^^^^ Something in cgroupfs should give an indication that this cgroup
> ^^^^ filters socket creation, but there's nothing there. You should also
> ^^^^ be able to turn the filter off from cgroupfs.

Doing 'cat cg2/nosockets/cgroup.bpf' here would be useful for debugging
to see which program is attached to which hook in this cgroup.
Detaching programs via cgroup control file is a lot less clear,
since it will be confusing to a running application that attached the program
in the first place, since it won't have any indication that external entity
detached the program.
One can argue that it could be useful for curious human/admin
to detach all bpf progs from all hooks for this cgroup, but
then it probably needs dmesg warning and only usable in extreme cases
as debugging and not something control plane should ever use.

> # mkdir cg2/nosockets/sockets
> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1
>
> ^^^^ This succeeded, which means that, if this feature is enabled in 4.10,
> ^^^^ then we're stuck with its semantics. If it returned -EINVAL instead,
> ^^^^ there would be a chance to refine it.

i don't see the problem with this behavior. bpf and cgroup are indepedent.
Imange that socket accounting program is attached to cg2/nosockets.
The program is readonly and carry no security meaning.
Why cgroup should prevent creation of cg2/nosockets/foo directory ?

> # echo $$ >cg2/nosockets/sockets/cgroup.procs
> # ping 127.0.0.1
> PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms

hmm. in this case the admin created a subgroup with a group that allows
ping, so? how's that a problem? In case of vrf I can imagine
a set of application auto-binding to one vrf and smaller
set of applications binding to a different vrf.
Or broader set of applications monitoring all tcp traffic
and subset of them monitoring udp instead.
Those are valid use cases.
I guess you're advocating to run a link list of programs?
That won't be useful for the above use cases, where there is no
reason to run more than one program that control plane
assigned to run for this cgroup.
At this stage we decided to allow only one program per cgroup per hook
and later can extend it if necessary.
As you're pointing out, in case of security, we probably
want to preserve original bpf program that should always be
run first and only after it returned 'ok' (we'd need to define
what 'ok' means in secruity context) run program attached to sub-hierarchy.
Another alternative is to disallow attaching programs in sub-hierarchy
if parent has something already attached, but it's not useful
for general case.
All of these are possible future extensions.

> In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf
> programs attached that can do things if various events occur. (Right
> now, this means socket operations, but there are plans in the works
> to do this for LSM hooks too.) These bpf programs can say yes or no,
> but they can also read out various data (including socket payloads!)
> and save them away where an attacker can find them. This sounds a
> lot like seccomp with a narrower scope but a much stronger ability to
> exfiltrate private information.

that sounds like a future problem to solve when bpf+lsm+cgroup are
used for security.

> This isn't much of a problem yet because you currently need
> CAP_NET_ADMIN to create a malicious sandbox in the first place. I'm
> sure that, in the near future, someone will want to make this stuff
> work in containers with delegated cgroup hierarchies, and then there
> may be a real problem here.

I agree. Currently cgroup+bpf+hooks are for root only and have to
go long way before they can be used for security and sandboxing.

> I've included a few security people on this thread. The current API
> looks abusable, and it would be nice to find all the holes before
> 4.10 comes out.

I disagree with the urgency. I see nothing that needs immediate action.
bpf+lsm+cgroup is not in the tree yet.

2016-12-19 21:24:26

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov
<[email protected]> wrote:
> On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
>> Hi all-
>>
>> I apologize for being rather late with this. I didn't realize that
>> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
>> it on the linux-api list, so I missed the discussion.
>>
>> I think that the inet ingress, egress etc filters are a neat feature,
>> but I think the API has some issues that will bite us down the road
>> if it becomes stable in its current form.
>>
>> Most of the problems I see are summarized in this transcript:
>>
>> # mkdir cg2
>> # mount -t cgroup2 none cg2
>> # mkdir cg2/nosockets
>> # strace cgrp_socket_rule cg2/nosockets/ 0
>> ...
>> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
>>
>> ^^^^ You can modify a cgroup after opening it O_RDONLY?
>>
>> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
>> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
>> log_buf=0x6020c0, kern_version=0}, 48) = 4
>>
>> ^^^^ This is fine. The bpf() syscall manipulates bpf objects.
>>
>> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
>>
>> ^^^^ This is not so good:
>> ^^^^
>> ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This
>> ^^^^ is manipulating a cgroup. There's no reason that a socket creation
>> ^^^^ filter couldn't be written in a different language (new iptables
>> ^^^^ table? Simple list of address families?), but if that happened,
>> ^^^^ then using bpf() to install it would be entirely nonsensical.
>
> I don't see why it's _modifing_ the cgroup. I'm looking at it as
> network stack is using cgroup as an application group that should
> invoke bpf program at the certain point in the stack.
> imo cgroup management is orthogonal.

It is literally modifying the struct cgroup, and, as a practical
matter, it's causing membership in the cgroup to have a certain
effect. But rather than pointless arguing, let me propose an
alternative API that I think solves most of the problems here.

In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely.
Instead, the cgroup gets three new control files:
"net.ingress_filter", "net.egress_filter", and
"net.socket_create_filter". Initially, if you read these files, you
see "none\n".

To attach a bpf filter, you open the file for write and do an ioctl on
it. After doing the ioctl, if you read the file, you'll see
"bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for
the bpf program.

To detach any type of filter, bpf or otherwise, you open the file for
write and write "none\n" (or just "none").

If you write anything else to the file, you get -EINVAL. But, if
someone writes a new type of filter (perhaps a simple list of address
families), maybe you can enable the filter by writing something
appropriate to the file.

Now the API matches the effect. A cgroup with something other than
"none" in one of its net.*_filter files is a cgroup that filters
network activity. And you get CRIU compatibility for free: CRIU can
read the control file and use whatever mechanism is uses for BPF in
general to restore the cgroup filter state. As an added bonus, you
get ACLs for free and the ugly multiplexer goes away.

>> # mkdir cg2/nosockets/sockets
>> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1
>>
>> ^^^^ This succeeded, which means that, if this feature is enabled in 4.10,
>> ^^^^ then we're stuck with its semantics. If it returned -EINVAL instead,
>> ^^^^ there would be a chance to refine it.
>
> i don't see the problem with this behavior. bpf and cgroup are indepedent.
> Imange that socket accounting program is attached to cg2/nosockets.
> The program is readonly and carry no security meaning.
> Why cgroup should prevent creation of cg2/nosockets/foo directory ?

I think you're misunderstanding me. What I'm saying is that, if you
allow a cgroup and one of its descendents to both enable the same type
of filter, you have just committed to some particular semantics for
what happens. And I think that the current semantics are the *wrong*
semantics for default long-term use, so you should either fix the
semantics or disable the problematic case.

>
>> # echo $$ >cg2/nosockets/sockets/cgroup.procs
>> # ping 127.0.0.1
>> PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
>> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms
>
> hmm. in this case the admin created a subgroup with a group that allows
> ping, so? how's that a problem?

I think you're forgetting about namespaces. There are two different
admins here. There's the admin who created the outer container and
said "no sockets here". Then there's the admin inside the container
who said "muahaha, I can create a sub-container and allow sockets
*there*".

> In case of vrf I can imagine
> a set of application auto-binding to one vrf and smaller
> set of applications binding to a different vrf.
> Or broader set of applications monitoring all tcp traffic
> and subset of them monitoring udp instead.
> Those are valid use cases.

> I guess you're advocating to run a link list of programs?
> That won't be useful for the above use cases, where there is no
> reason to run more than one program that control plane
> assigned to run for this cgroup.

Yes there is, for both monitoring and policy. If I want to monitor
all activity in a cgroup, I probably want to monitor descendents as
well. If I want to restrict a cgroup, I want to restrict its
descendents.

In the case where I actually don't want to hook the descendents, I'd
be find with having an option to turn off recursion. Maybe
net.egress_filter could also say "bpf(overridable):[hash]" or
"bpf(nonrecursive):[hash]". But you should have to opt in to allowing
your filter to be overridden.

> At this stage we decided to allow only one program per cgroup per hook
> and later can extend it if necessary.

No you can't. Since you allow the problematic case and you gave it
the surprising "one program" semantics, you can't change it later.

> As you're pointing out, in case of security, we probably
> want to preserve original bpf program that should always be
> run first and only after it returned 'ok' (we'd need to define
> what 'ok' means in secruity context) run program attached to sub-hierarchy.

It's already defined AFAICT. 1 means okay. 0 means not okay.

> Another alternative is to disallow attaching programs in sub-hierarchy
> if parent has something already attached, but it's not useful
> for general case.
> All of these are possible future extensions.

I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You
have to do it now (or disable the feature for 4.10). This is why I'm
bringing this whole thing up now.

>
>> In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf
>> programs attached that can do things if various events occur. (Right
>> now, this means socket operations, but there are plans in the works
>> to do this for LSM hooks too.) These bpf programs can say yes or no,
>> but they can also read out various data (including socket payloads!)
>> and save them away where an attacker can find them. This sounds a
>> lot like seccomp with a narrower scope but a much stronger ability to
>> exfiltrate private information.
>
> that sounds like a future problem to solve when bpf+lsm+cgroup are
> used for security.

[...]

>
> I disagree with the urgency. I see nothing that needs immediate action.
> bpf+lsm+cgroup is not in the tree yet.
>

I disagree very strongly here. The API in 4.10 is IMO quite ugly, but
the result if bpf+lsm+cgroup works *differently* will be far, far
uglier.

I think the right solution here is to clean up the API so that it'll
work for future extensions that people are already imagining. If this
can happen for 4.10, great. If not, then postpone this stuff
entirely. I've had code I've written for Linux postponed extensively
until I've gotten the API right, and it's not so bad. (Actually, I've
even had API changes I've made reverted in -stable, IIRC. This is
much worse than postponing before a release.)

--Andy

2016-12-20 00:03:07

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov
> <[email protected]> wrote:
> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
> >> Hi all-
> >>
> >> I apologize for being rather late with this. I didn't realize that
> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
> >> it on the linux-api list, so I missed the discussion.
> >>
> >> I think that the inet ingress, egress etc filters are a neat feature,
> >> but I think the API has some issues that will bite us down the road
> >> if it becomes stable in its current form.
> >>
> >> Most of the problems I see are summarized in this transcript:
> >>
> >> # mkdir cg2
> >> # mount -t cgroup2 none cg2
> >> # mkdir cg2/nosockets
> >> # strace cgrp_socket_rule cg2/nosockets/ 0
> >> ...
> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
> >>
> >> ^^^^ You can modify a cgroup after opening it O_RDONLY?
> >>
> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
> >> log_buf=0x6020c0, kern_version=0}, 48) = 4
> >>
> >> ^^^^ This is fine. The bpf() syscall manipulates bpf objects.
> >>
> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
> >>
> >> ^^^^ This is not so good:
> >> ^^^^
> >> ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This
> >> ^^^^ is manipulating a cgroup. There's no reason that a socket creation
> >> ^^^^ filter couldn't be written in a different language (new iptables
> >> ^^^^ table? Simple list of address families?), but if that happened,
> >> ^^^^ then using bpf() to install it would be entirely nonsensical.
> >
> > I don't see why it's _modifing_ the cgroup. I'm looking at it as
> > network stack is using cgroup as an application group that should
> > invoke bpf program at the certain point in the stack.
> > imo cgroup management is orthogonal.
>
> It is literally modifying the struct cgroup, and, as a practical
> matter, it's causing membership in the cgroup to have a certain
> effect. But rather than pointless arguing, let me propose an
> alternative API that I think solves most of the problems here.
>
> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely.
> Instead, the cgroup gets three new control files:
> "net.ingress_filter", "net.egress_filter", and
> "net.socket_create_filter". Initially, if you read these files, you
> see "none\n".
>
> To attach a bpf filter, you open the file for write and do an ioctl on
> it. After doing the ioctl, if you read the file, you'll see
> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for
> the bpf program.
>
> To detach any type of filter, bpf or otherwise, you open the file for
> write and write "none\n" (or just "none").
>
> If you write anything else to the file, you get -EINVAL. But, if
> someone writes a new type of filter (perhaps a simple list of address
> families), maybe you can enable the filter by writing something
> appropriate to the file.

I see no difference in what you're proposing vs what is already implemented
from feature set point of view, but the file approach is very ugly, since
it's a mismatch to FD style access that bpf is using everywhere.
In your proposal you'd also need to add bpf prefix everywhere.
So the control file names should be bpf_inet_ingress, bpf_inet_egress
and bpf_socket_create.
If you want to prepare such patch for them, I don't mind,
but you cannot kill syscall command, since it's more flexible
and your control-file approach _will_ be obsolete pretty quickly.

> Now the API matches the effect. A cgroup with something other than
> "none" in one of its net.*_filter files is a cgroup that filters
> network activity. And you get CRIU compatibility for free: CRIU can
> read the control file and use whatever mechanism is uses for BPF in
> general to restore the cgroup filter state. As an added bonus, you
> get ACLs for free and the ugly multiplexer goes away.

extended bpf is not supported by criu. only classic, so having
control_file-style attachment doesn't buy us anything.

> >> # mkdir cg2/nosockets/sockets
> >> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1
> >>
> >> ^^^^ This succeeded, which means that, if this feature is enabled in 4.10,
> >> ^^^^ then we're stuck with its semantics. If it returned -EINVAL instead,
> >> ^^^^ there would be a chance to refine it.
> >
> > i don't see the problem with this behavior. bpf and cgroup are indepedent.
> > Imange that socket accounting program is attached to cg2/nosockets.
> > The program is readonly and carry no security meaning.
> > Why cgroup should prevent creation of cg2/nosockets/foo directory ?
>
> I think you're misunderstanding me. What I'm saying is that, if you
> allow a cgroup and one of its descendents to both enable the same type
> of filter, you have just committed to some particular semantics for
> what happens. And I think that the current semantics are the *wrong*
> semantics for default long-term use, so you should either fix the
> semantics or disable the problematic case.

Are you saying that use cases I provided are also 'wrong'?
If you insist on that we won't be able to make any forward progress.
The current semantics is fine for what it's designed for.

> >> # echo $$ >cg2/nosockets/sockets/cgroup.procs
> >> # ping 127.0.0.1
> >> PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
> >> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms
> >
> > hmm. in this case the admin created a subgroup with a group that allows
> > ping, so? how's that a problem?
>
> I think you're forgetting about namespaces. There are two different
> admins here. There's the admin who created the outer container and
> said "no sockets here". Then there's the admin inside the container
> who said "muahaha, I can create a sub-container and allow sockets
> *there*".

container management frameworks should be doing sensible
things. With any infra in the kernel there are many ways to
create non-sensical setups. It's not a job of the kernel
to interfere with user space.

> > In case of vrf I can imagine
> > a set of application auto-binding to one vrf and smaller
> > set of applications binding to a different vrf.
> > Or broader set of applications monitoring all tcp traffic
> > and subset of them monitoring udp instead.
> > Those are valid use cases.
>
> > I guess you're advocating to run a link list of programs?
> > That won't be useful for the above use cases, where there is no
> > reason to run more than one program that control plane
> > assigned to run for this cgroup.
>
> Yes there is, for both monitoring and policy. If I want to monitor
> all activity in a cgroup, I probably want to monitor descendents as
> well. If I want to restrict a cgroup, I want to restrict its
> descendents.

you're ignoring use cases I described earlier.
In vrf case there is only one ifindex it needs to bind to.

> In the case where I actually don't want to hook the descendents, I'd
> be find with having an option to turn off recursion. Maybe
> net.egress_filter could also say "bpf(overridable):[hash]" or
> "bpf(nonrecursive):[hash]". But you should have to opt in to allowing
> your filter to be overridden.

'opt in' only makes sense if you're implementing sandboxing environment.
It doesn't make sense as the default.

> > At this stage we decided to allow only one program per cgroup per hook
> > and later can extend it if necessary.
>
> No you can't. Since you allow the problematic case and you gave it
> the surprising "one program" semantics, you can't change it later.

please show me why we cannot? As far as I can see nothing prevents
that in the future. We can add any number of new fields to
BPF_PROG_ATTACH command just like we did in the past with
other commands, whereas control file interface is not extensible.

> > As you're pointing out, in case of security, we probably
> > want to preserve original bpf program that should always be
> > run first and only after it returned 'ok' (we'd need to define
> > what 'ok' means in secruity context) run program attached to sub-hierarchy.
>
> It's already defined AFAICT. 1 means okay. 0 means not okay.

sorry that doesn't make any sense. For seccomp we have a set of
ranges that mean different things. Here you're proposing to
hastily assign 1 and 0 ? How is that extensible?
We need to carefully think through what should be the semantics
of attaching multiple programs, consider performance implications,
return codes and so on.

> > Another alternative is to disallow attaching programs in sub-hierarchy
> > if parent has something already attached, but it's not useful
> > for general case.
> > All of these are possible future extensions.
>
> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You
> have to do it now (or disable the feature for 4.10). This is why I'm
> bringing this whole thing up now.

We don't have to touch user visible api here, so extensions are fine.

> >> In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf
> >> programs attached that can do things if various events occur. (Right
> >> now, this means socket operations, but there are plans in the works
> >> to do this for LSM hooks too.) These bpf programs can say yes or no,
> >> but they can also read out various data (including socket payloads!)
> >> and save them away where an attacker can find them. This sounds a
> >> lot like seccomp with a narrower scope but a much stronger ability to
> >> exfiltrate private information.
> >
> > that sounds like a future problem to solve when bpf+lsm+cgroup are
> > used for security.
>
> [...]
>
> >
> > I disagree with the urgency. I see nothing that needs immediate action.
> > bpf+lsm+cgroup is not in the tree yet.
> >
>
> I disagree very strongly here. The API in 4.10 is IMO quite ugly, but
> the result if bpf+lsm+cgroup works *differently* will be far, far
> uglier.

again we're talking about the future here and 'ugly' is the matter of taste.
I hear all the time that people say that netlink api is ugly, so?

> I think the right solution here is to clean up the API so that it'll
> work for future extensions that people are already imagining. If this
> can happen for 4.10, great. If not, then postpone this stuff
> entirely. I've had code I've written for Linux postponed extensively
> until I've gotten the API right, and it's not so bad. (Actually, I've
> even had API changes I've made reverted in -stable, IIRC. This is
> much worse than postponing before a release.)

huh? 'not right api' because it's using bpf syscall instead
of cgroup control-file? I think the opposite is the truth.
syscall style is more extensible whereas control-file and text
based interfaces have proven to be huge pain to extend and maintain.
Again, I don't mind if you want to have both available.

2016-12-20 00:25:58

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov
<[email protected]> wrote:
> On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote:
>> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov
>> <[email protected]> wrote:
>> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
>> >> Hi all-
>> >>
>> >> I apologize for being rather late with this. I didn't realize that
>> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
>> >> it on the linux-api list, so I missed the discussion.
>> >>
>> >> I think that the inet ingress, egress etc filters are a neat feature,
>> >> but I think the API has some issues that will bite us down the road
>> >> if it becomes stable in its current form.
>> >>
>> >> Most of the problems I see are summarized in this transcript:
>> >>
>> >> # mkdir cg2
>> >> # mount -t cgroup2 none cg2
>> >> # mkdir cg2/nosockets
>> >> # strace cgrp_socket_rule cg2/nosockets/ 0
>> >> ...
>> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
>> >>
>> >> ^^^^ You can modify a cgroup after opening it O_RDONLY?
>> >>
>> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
>> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
>> >> log_buf=0x6020c0, kern_version=0}, 48) = 4
>> >>
>> >> ^^^^ This is fine. The bpf() syscall manipulates bpf objects.
>> >>
>> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
>> >>
>> >> ^^^^ This is not so good:
>> >> ^^^^
>> >> ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This
>> >> ^^^^ is manipulating a cgroup. There's no reason that a socket creation
>> >> ^^^^ filter couldn't be written in a different language (new iptables
>> >> ^^^^ table? Simple list of address families?), but if that happened,
>> >> ^^^^ then using bpf() to install it would be entirely nonsensical.
>> >
>> > I don't see why it's _modifing_ the cgroup. I'm looking at it as
>> > network stack is using cgroup as an application group that should
>> > invoke bpf program at the certain point in the stack.
>> > imo cgroup management is orthogonal.
>>
>> It is literally modifying the struct cgroup, and, as a practical
>> matter, it's causing membership in the cgroup to have a certain
>> effect. But rather than pointless arguing, let me propose an
>> alternative API that I think solves most of the problems here.
>>
>> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely.
>> Instead, the cgroup gets three new control files:
>> "net.ingress_filter", "net.egress_filter", and
>> "net.socket_create_filter". Initially, if you read these files, you
>> see "none\n".
>>
>> To attach a bpf filter, you open the file for write and do an ioctl on
>> it. After doing the ioctl, if you read the file, you'll see
>> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for
>> the bpf program.
>>
>> To detach any type of filter, bpf or otherwise, you open the file for
>> write and write "none\n" (or just "none").
>>
>> If you write anything else to the file, you get -EINVAL. But, if
>> someone writes a new type of filter (perhaps a simple list of address
>> families), maybe you can enable the filter by writing something
>> appropriate to the file.
>
> I see no difference in what you're proposing vs what is already implemented
> from feature set point of view, but the file approach is very ugly, since
> it's a mismatch to FD style access that bpf is using everywhere.
> In your proposal you'd also need to add bpf prefix everywhere.
> So the control file names should be bpf_inet_ingress, bpf_inet_egress
> and bpf_socket_create.

I think we're still talking past each other. A big part of the point
of changing it is that none of this is specific to bpf. You could (in
theory -- I'm not proposing implementing these until there's demand)
have:

net.socket_create_filter = "none": no filter
net.socket_create_filter = "bpf:baadf00d": bpf filter
net.socket_create_filter = "disallow": no sockets created period
net.socket_create_filter = "iptables:foobar": some iptables thingy
net.socket_create_filter = "nft:blahblahblah": some nft thingy
net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3

See? This API is not bpf-specific. It's an API for filtering. The
fact that struct cgroup currently contains a member called "bpf" is
purely an artifact of the fact that it currently only supports bpf.
Someone will want to rename it to "filters" some day, and
BPF_PROG_DETACH makes no sense whatsoever as a generic API to detach a
filter.

> If you want to prepare such patch for them, I don't mind,
> but you cannot kill syscall command, since it's more flexible
> and your control-file approach _will_ be obsolete pretty quickly.

BPF_PROG_ATTACH and BPF_PROC_DETACH should be removed regardless IMO.
If you really really want a syscall, make it a new syscall.

>
>> Now the API matches the effect. A cgroup with something other than
>> "none" in one of its net.*_filter files is a cgroup that filters
>> network activity. And you get CRIU compatibility for free: CRIU can
>> read the control file and use whatever mechanism is uses for BPF in
>> general to restore the cgroup filter state. As an added bonus, you
>> get ACLs for free and the ugly multiplexer goes away.
>
> extended bpf is not supported by criu. only classic, so having
> control_file-style attachment doesn't buy us anything.

CRIU will support it some day. We might as well put fewer obstacles
in their way.

>
>> >> # mkdir cg2/nosockets/sockets
>> >> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1
>> >>
>> >> ^^^^ This succeeded, which means that, if this feature is enabled in 4.10,
>> >> ^^^^ then we're stuck with its semantics. If it returned -EINVAL instead,
>> >> ^^^^ there would be a chance to refine it.
>> >
>> > i don't see the problem with this behavior. bpf and cgroup are indepedent.
>> > Imange that socket accounting program is attached to cg2/nosockets.
>> > The program is readonly and carry no security meaning.
>> > Why cgroup should prevent creation of cg2/nosockets/foo directory ?
>>
>> I think you're misunderstanding me. What I'm saying is that, if you
>> allow a cgroup and one of its descendents to both enable the same type
>> of filter, you have just committed to some particular semantics for
>> what happens. And I think that the current semantics are the *wrong*
>> semantics for default long-term use, so you should either fix the
>> semantics or disable the problematic case.
>
> Are you saying that use cases I provided are also 'wrong'?
> If you insist on that we won't be able to make any forward progress.
> The current semantics is fine for what it's designed for.

Can you explain your use case more clearly?

>
>> >> # echo $$ >cg2/nosockets/sockets/cgroup.procs
>> >> # ping 127.0.0.1
>> >> PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
>> >> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms
>> >
>> > hmm. in this case the admin created a subgroup with a group that allows
>> > ping, so? how's that a problem?
>>
>> I think you're forgetting about namespaces. There are two different
>> admins here. There's the admin who created the outer container and
>> said "no sockets here". Then there's the admin inside the container
>> who said "muahaha, I can create a sub-container and allow sockets
>> *there*".
>
> container management frameworks should be doing sensible
> things. With any infra in the kernel there are many ways to
> create non-sensical setups. It's not a job of the kernel
> to interfere with user space.

What exactly isn't sensible about using cgroup_bpf for containers?

>
>> > In case of vrf I can imagine
>> > a set of application auto-binding to one vrf and smaller
>> > set of applications binding to a different vrf.
>> > Or broader set of applications monitoring all tcp traffic
>> > and subset of them monitoring udp instead.
>> > Those are valid use cases.
>>
>> > I guess you're advocating to run a link list of programs?
>> > That won't be useful for the above use cases, where there is no
>> > reason to run more than one program that control plane
>> > assigned to run for this cgroup.
>>
>> Yes there is, for both monitoring and policy. If I want to monitor
>> all activity in a cgroup, I probably want to monitor descendents as
>> well. If I want to restrict a cgroup, I want to restrict its
>> descendents.
>
> you're ignoring use cases I described earlier.
> In vrf case there is only one ifindex it needs to bind to.

I'm totally lost. Can you explain what this has to do with the cgroup
hierarchy?

>> > At this stage we decided to allow only one program per cgroup per hook
>> > and later can extend it if necessary.
>>
>> No you can't. Since you allow the problematic case and you gave it
>> the surprising "one program" semantics, you can't change it later.
>
> please show me why we cannot? As far as I can see nothing prevents
> that in the future. We can add any number of new fields to
> BPF_PROG_ATTACH command just like we did in the past with
> other commands, whereas control file interface is not extensible.

Because people are going to start using the old API, tools won't be
aware of the new semantics, you have no usable introspection
mechanism, and everyone is going to screw it up. But it's even worse:
because global privilege is currently needed to set up these filters,
containers really can use it today, but once you switch to ns_capable,
then it suddenly becomes insecure. And *that* is something that you
can't do.

>
>> > As you're pointing out, in case of security, we probably
>> > want to preserve original bpf program that should always be
>> > run first and only after it returned 'ok' (we'd need to define
>> > what 'ok' means in secruity context) run program attached to sub-hierarchy.
>>
>> It's already defined AFAICT. 1 means okay. 0 means not okay.
>
> sorry that doesn't make any sense. For seccomp we have a set of
> ranges that mean different things. Here you're proposing to
> hastily assign 1 and 0 ? How is that extensible?
> We need to carefully think through what should be the semantics
> of attaching multiple programs, consider performance implications,
> return codes and so on.

You already assigned it. The return value of the bpf program, loaded
in Linus' tree today, tells the kernel whether to accept or reject.

>
>> > Another alternative is to disallow attaching programs in sub-hierarchy
>> > if parent has something already attached, but it's not useful
>> > for general case.
>> > All of these are possible future extensions.
>>
>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You
>> have to do it now (or disable the feature for 4.10). This is why I'm
>> bringing this whole thing up now.
>
> We don't have to touch user visible api here, so extensions are fine.

Huh? My example in the original email attaches a program in a
sub-hierarchy. Are you saying that 4.11 could make that example stop
working?

>
>> >> In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf
>> >> programs attached that can do things if various events occur. (Right
>> >> now, this means socket operations, but there are plans in the works
>> >> to do this for LSM hooks too.) These bpf programs can say yes or no,
>> >> but they can also read out various data (including socket payloads!)
>> >> and save them away where an attacker can find them. This sounds a
>> >> lot like seccomp with a narrower scope but a much stronger ability to
>> >> exfiltrate private information.
>> >
>> > that sounds like a future problem to solve when bpf+lsm+cgroup are
>> > used for security.
>>
>> [...]
>>
>> >
>> > I disagree with the urgency. I see nothing that needs immediate action.
>> > bpf+lsm+cgroup is not in the tree yet.
>> >
>>
>> I disagree very strongly here. The API in 4.10 is IMO quite ugly, but
>> the result if bpf+lsm+cgroup works *differently* will be far, far
>> uglier.
>
> again we're talking about the future here and 'ugly' is the matter of taste.
> I hear all the time that people say that netlink api is ugly, so?

Yeah, it's ugly, but that ship already sailed. This ship hasn't.

2016-12-20 01:34:33

by David Miller

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

From: Alexei Starovoitov <[email protected]>
Date: Mon, 19 Dec 2016 16:02:56 -0800

> huh? 'not right api' because it's using bpf syscall instead
> of cgroup control-file? I think the opposite is the truth.

I completely agree with Alexei on this.

2016-12-20 01:41:21

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Mon, Dec 19, 2016 at 5:34 PM, David Miller <[email protected]> wrote:
> From: Alexei Starovoitov <[email protected]>
> Date: Mon, 19 Dec 2016 16:02:56 -0800
>
>> huh? 'not right api' because it's using bpf syscall instead
>> of cgroup control-file? I think the opposite is the truth.
>
> I completely agree with Alexei on this.

So what happens when someone adds another type of filter? Let's say
there's a simple, no-privilege-required list of allowed address
families that can hook up to the socket creation hook for a cgroup.
Does BPF_PROG_DETACH still detach it? Or would both the bpf *and* the
list of allowed address families be in force? If the latter, why
wouldn't two BPF programs on the same hook be allowed?

Concretely:

# mkdir /cgroup/a
# set_up_bpf_socket_rule /cgroup/a
# set_up_list_of_address_families /cgroup/a
# cat /cgroup/a/some_new_file [what gets displayed?]
# BPF_PROG_DETACH: what happens

By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't
even take a reference to a BPF program as an argument. What is it
supposed to do if this mechanism ever gets extended?

--Andy

2016-12-20 01:44:06

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Mon, Dec 19, 2016 at 4:25 PM, Andy Lutomirski <[email protected]> wrote:
> On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov
> <[email protected]> wrote:
>> you're ignoring use cases I described earlier.
>> In vrf case there is only one ifindex it needs to bind to.
>
> I'm totally lost. Can you explain what this has to do with the cgroup
> hierarchy?
>

Okay, I figured out what you mean, I think. You have a handful of vrf
devices. Let's say they have ifindexes 1 and 2 (and maybe more).

The interesting case happens when you set up /cgroup/a with a bpf
program that binds new sockets to ifindex 1 and /cgroup/a/b with a bpf
program that binds new sockets to ifindex 2. The question is: what
should happen if you're in /cgroup/a/b? Presumably, if you do this,
you wanted to end up with ifindex 2.

I think the way it should actually work is that the kernel evaluates
/cgroup/a/b's hook and then /cgroup/a's hook. Then /cgroup/a (which
is the more privileged hook) gets to make the choice. If it wants
ifindex 2 to win, it can do (pseudocode):

if (!sk->sk_bound_if)
sk->sk_bound_if = 1;

2016-12-20 01:44:22

by David Ahern

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On 12/19/16 5:25 PM, Andy Lutomirski wrote:
> net.socket_create_filter = "none": no filter
> net.socket_create_filter = "bpf:baadf00d": bpf filter
> net.socket_create_filter = "disallow": no sockets created period
> net.socket_create_filter = "iptables:foobar": some iptables thingy
> net.socket_create_filter = "nft:blahblahblah": some nft thingy
> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3

Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters.

...

>> you're ignoring use cases I described earlier.
>> In vrf case there is only one ifindex it needs to bind to.
>
> I'm totally lost. Can you explain what this has to do with the cgroup
> hierarchy?

I think the point is that a group hierarchy makes no sense for the VRF use case. What I put into iproute2 is

cgrp2/vrf/NAME

where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 sockets to a specific device index. cgrp2/vrf is the "default" vrf and does not have a filter. A user can certainly add another layer cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not make sense.

...

>>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You
>>> have to do it now (or disable the feature for 4.10). This is why I'm
>>> bringing this whole thing up now.
>>
>> We don't have to touch user visible api here, so extensions are fine.
>
> Huh? My example in the original email attaches a program in a
> sub-hierarchy. Are you saying that 4.11 could make that example stop
> working?

Are you suggesting sub-cgroups should not be allowed to override the filter of a parent cgroup?


2016-12-20 01:56:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Mon, Dec 19, 2016 at 5:44 PM, David Ahern <[email protected]> wrote:
> On 12/19/16 5:25 PM, Andy Lutomirski wrote:
>> net.socket_create_filter = "none": no filter
>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>> net.socket_create_filter = "disallow": no sockets created period
>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>
> Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters.

Can you elaborate on what goes wrong? (Obviously the
"address_family_list" example makes no sense in that context.)

>
> ...
>
>>> you're ignoring use cases I described earlier.
>>> In vrf case there is only one ifindex it needs to bind to.
>>
>> I'm totally lost. Can you explain what this has to do with the cgroup
>> hierarchy?
>
> I think the point is that a group hierarchy makes no sense for the VRF use case. What I put into iproute2 is
>
> cgrp2/vrf/NAME
>
> where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 sockets to a specific device index. cgrp2/vrf is the "default" vrf and does not have a filter. A user can certainly add another layer cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not make sense.

I tend to agree. I still think that the mechanism as it stands is
broken in other respects and should be fixed before it goes live. I
have no desire to cause problems for the vrf use case.

But keep in mind that the vrf use case is, in Linus' tree, a bit
broken right now in its interactions with other users of the same
mechanism. Suppose I create a container and want to trace all of its
created sockets. I'll set up cgrp2/container and load my tracer as a
socket creation hook. Then a container sets up
cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding
filter. Now the tracing stops working -- oops.

>
> ...
>
>>>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You
>>>> have to do it now (or disable the feature for 4.10). This is why I'm
>>>> bringing this whole thing up now.
>>>
>>> We don't have to touch user visible api here, so extensions are fine.
>>
>> Huh? My example in the original email attaches a program in a
>> sub-hierarchy. Are you saying that 4.11 could make that example stop
>> working?
>
> Are you suggesting sub-cgroups should not be allowed to override the filter of a parent cgroup?

Yes, exactly. I think there are two sensible behaviors:

a) sub-cgroups cannot have a filter at all of the parent has a filter.
(This is the "punt" approach -- it lets different semantics be
assigned later without breaking userspace.)

b) sub-cgroups can have a filter if a parent does, too. The semantics
are that the sub-cgroup filter runs first and all side-effects occur.
If that filter says "reject" then ancestor filters are skipped. If
that filter says "accept", then the ancestor filter is run and its
side-effects happen as well. (And so on, all the way up to the root.)

--Andy

2016-12-20 02:52:25

by David Ahern

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On 12/19/16 6:56 PM, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 5:44 PM, David Ahern <[email protected]> wrote:
>> On 12/19/16 5:25 PM, Andy Lutomirski wrote:
>>> net.socket_create_filter = "none": no filter
>>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>>> net.socket_create_filter = "disallow": no sockets created period
>>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>>
>> Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters.
>
> Can you elaborate on what goes wrong? (Obviously the
> "address_family_list" example makes no sense in that context.)

Being able to dump a filter or see that one exists would be a great add-on, but I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable API for loading generic BPF filters. Simple cases like "disallow" are easy -- just return 0 in the filter, no complicated BPF code needed. The rest are specific cases of the moment which goes against the intent of ebpf and generic programmability.

>>
>> ...
>>
>>>> you're ignoring use cases I described earlier.
>>>> In vrf case there is only one ifindex it needs to bind to.
>>>
>>> I'm totally lost. Can you explain what this has to do with the cgroup
>>> hierarchy?
>>
>> I think the point is that a group hierarchy makes no sense for the VRF use case. What I put into iproute2 is
>>
>> cgrp2/vrf/NAME
>>
>> where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 sockets to a specific device index. cgrp2/vrf is the "default" vrf and does not have a filter. A user can certainly add another layer cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not make sense.
>
> I tend to agree. I still think that the mechanism as it stands is
> broken in other respects and should be fixed before it goes live. I
> have no desire to cause problems for the vrf use case.
>
> But keep in mind that the vrf use case is, in Linus' tree, a bit
> broken right now in its interactions with other users of the same
> mechanism. Suppose I create a container and want to trace all of its
> created sockets. I'll set up cgrp2/container and load my tracer as a
> socket creation hook. Then a container sets up
> cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding
> filter. Now the tracing stops working -- oops.

There are other ways to achieve socket tracing, but I get your point -- nested cases do not work as users may want.


>>>>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You
>>>>> have to do it now (or disable the feature for 4.10). This is why I'm
>>>>> bringing this whole thing up now.
>>>>
>>>> We don't have to touch user visible api here, so extensions are fine.
>>>
>>> Huh? My example in the original email attaches a program in a
>>> sub-hierarchy. Are you saying that 4.11 could make that example stop
>>> working?
>>
>> Are you suggesting sub-cgroups should not be allowed to override the filter of a parent cgroup?
>
> Yes, exactly. I think there are two sensible behaviors:
>
> a) sub-cgroups cannot have a filter at all of the parent has a filter.
> (This is the "punt" approach -- it lets different semantics be
> assigned later without breaking userspace.)
>
> b) sub-cgroups can have a filter if a parent does, too. The semantics
> are that the sub-cgroup filter runs first and all side-effects occur.
> If that filter says "reject" then ancestor filters are skipped. If
> that filter says "accept", then the ancestor filter is run and its
> side-effects happen as well. (And so on, all the way up to the root.)

That comes with a big performance hit for skb / data path cases.

I'm riding my use case on Daniel's work, and as I understand it the nesting case has been discussed. I'll defer to Daniel and Alexei on this part.

2016-12-20 03:13:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Mon, Dec 19, 2016 at 6:52 PM, David Ahern <[email protected]> wrote:
> On 12/19/16 6:56 PM, Andy Lutomirski wrote:
>> On Mon, Dec 19, 2016 at 5:44 PM, David Ahern <[email protected]> wrote:
>>> On 12/19/16 5:25 PM, Andy Lutomirski wrote:
>>>> net.socket_create_filter = "none": no filter
>>>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>>>> net.socket_create_filter = "disallow": no sockets created period
>>>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>>>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>>>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>>>
>>> Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters.
>>
>> Can you elaborate on what goes wrong? (Obviously the
>> "address_family_list" example makes no sense in that context.)
>
> Being able to dump a filter or see that one exists would be a great add-on, but I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable API for loading generic BPF filters. Simple cases like "disallow" are easy -- just return 0 in the filter, no complicated BPF code needed. The rest are specific cases of the moment which goes against the intent of ebpf and generic programmability.

Oh -- I'm not proposing that at all. Let me clarify. For the bpf
case, if you *read* the file, you'd see "bpf:baadf00d". But writing
"bpf:baadf00d" is nonsense and would give you -EINVAL. Instead you
install a bpf filter by opening the file for write (O_RDWR or
O_WRONLY) and doing ioctl(cgroup_socket_create_file_fd,
CGROUP_IOCTL_SET_BPF_FILTER, bpf_fd); It's kind of like
BPF_PROG_ATTACH except that it respects filesystem permissions, it
isn't in the bpf() multiplexer, the filter being set is implied (by
the fd in use), and everything is nicely seccompable.

To remove the filter, you write "none" or "none\n" to the file.

As a future extension, if someone wanted more than one filter to be
able to coexist in the cgroup socket_create_filter slot, you could
plausibly do 'echo disallow >>net.socket_create_filter' or use a new
ioctl CGROUP_IOCTL_APPEND_BPF_FILTER, and then you'd read the file and
see more than one line. But this would be a future extension and may
never be needed.

>>
>> a) sub-cgroups cannot have a filter at all of the parent has a filter.
>> (This is the "punt" approach -- it lets different semantics be
>> assigned later without breaking userspace.)
>>
>> b) sub-cgroups can have a filter if a parent does, too. The semantics
>> are that the sub-cgroup filter runs first and all side-effects occur.
>> If that filter says "reject" then ancestor filters are skipped. If
>> that filter says "accept", then the ancestor filter is run and its
>> side-effects happen as well. (And so on, all the way up to the root.)
>
> That comes with a big performance hit for skb / data path cases.
>
> I'm riding my use case on Daniel's work, and as I understand it the nesting case has been discussed. I'll defer to Daniel and Alexei on this part.

I'm not sure I buy the performance hit. If you do it naively, then
performance will indeed suck. But there's already a bunch of code in
there to pre-populate a filter list for faster use. Currently, we
have:

struct cgroup_bpf {
/*
* Store two sets of bpf_prog pointers, one for programs that are
* pinned directly to this cgroup, and one for those that are effective
* when this cgroup is accessed.
*/
struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
};

in struct cgroup, there's a 'struct cgroup_bpf bpf;'.

This would change to something like:

struct cgroup_filter_slot {
struct bpf_prog *effective;
struct cgroup_filter_slot *next;
struct bpf_prog *local;
}

local is NULL unless *this* cgroup has a filter. effective points to
the bpf_prog that's active in this cgroup or the nearest ancestor that
has a filter. next is NULL if there are no filters higher in the
chain or points to the next slot that has a filter. struct cgroup
has:

struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE];

To evaluate it, you do:

struct cgroup_filter_slot *slot = &cgroup->slot[the index];

if (!slot->effective)
return;

do {
evaluate(slot->effective);
slot = slot->next;
} while (unlikely(slot));

The old code was one branch per evaluation. The new code is n+1
branches where n is the number of filters, so it's one extra branch in
the worst case. But the extra branch is cache-hot (the variable is
right next to slot->effective, which is needed regardless) and highly
predictable (in the case where nesting isn't used, the branch is not
taken, and it's a backward branch which most CPUs will predict as not
taken).

I expect that, on x86, this adds at most a cycle or two and quite
possibly adds no measurable overhead at all.

2016-12-20 03:18:14

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Mon, Dec 19, 2016 at 04:25:32PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov
> <[email protected]> wrote:
> > On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote:
> >> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov
> >> <[email protected]> wrote:
> >> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
> >> >> Hi all-
> >> >>
> >> >> I apologize for being rather late with this. I didn't realize that
> >> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
> >> >> it on the linux-api list, so I missed the discussion.
> >> >>
> >> >> I think that the inet ingress, egress etc filters are a neat feature,
> >> >> but I think the API has some issues that will bite us down the road
> >> >> if it becomes stable in its current form.
> >> >>
> >> >> Most of the problems I see are summarized in this transcript:
> >> >>
> >> >> # mkdir cg2
> >> >> # mount -t cgroup2 none cg2
> >> >> # mkdir cg2/nosockets
> >> >> # strace cgrp_socket_rule cg2/nosockets/ 0
> >> >> ...
> >> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
> >> >>
> >> >> ^^^^ You can modify a cgroup after opening it O_RDONLY?
> >> >>
> >> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
> >> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
> >> >> log_buf=0x6020c0, kern_version=0}, 48) = 4
> >> >>
> >> >> ^^^^ This is fine. The bpf() syscall manipulates bpf objects.
> >> >>
> >> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
> >> >>
> >> >> ^^^^ This is not so good:
> >> >> ^^^^
> >> >> ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This
> >> >> ^^^^ is manipulating a cgroup. There's no reason that a socket creation
> >> >> ^^^^ filter couldn't be written in a different language (new iptables
> >> >> ^^^^ table? Simple list of address families?), but if that happened,
> >> >> ^^^^ then using bpf() to install it would be entirely nonsensical.
> >> >
> >> > I don't see why it's _modifing_ the cgroup. I'm looking at it as
> >> > network stack is using cgroup as an application group that should
> >> > invoke bpf program at the certain point in the stack.
> >> > imo cgroup management is orthogonal.
> >>
> >> It is literally modifying the struct cgroup, and, as a practical
> >> matter, it's causing membership in the cgroup to have a certain
> >> effect. But rather than pointless arguing, let me propose an
> >> alternative API that I think solves most of the problems here.
> >>
> >> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely.
> >> Instead, the cgroup gets three new control files:
> >> "net.ingress_filter", "net.egress_filter", and
> >> "net.socket_create_filter". Initially, if you read these files, you
> >> see "none\n".
> >>
> >> To attach a bpf filter, you open the file for write and do an ioctl on
> >> it. After doing the ioctl, if you read the file, you'll see
> >> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for
> >> the bpf program.
> >>
> >> To detach any type of filter, bpf or otherwise, you open the file for
> >> write and write "none\n" (or just "none").
> >>
> >> If you write anything else to the file, you get -EINVAL. But, if
> >> someone writes a new type of filter (perhaps a simple list of address
> >> families), maybe you can enable the filter by writing something
> >> appropriate to the file.
> >
> > I see no difference in what you're proposing vs what is already implemented
> > from feature set point of view, but the file approach is very ugly, since
> > it's a mismatch to FD style access that bpf is using everywhere.
> > In your proposal you'd also need to add bpf prefix everywhere.
> > So the control file names should be bpf_inet_ingress, bpf_inet_egress
> > and bpf_socket_create.
>
> I think we're still talking past each other. A big part of the point
> of changing it is that none of this is specific to bpf. You could (in

the hooks and context passed into the program is very much bpf specific.
That's what I've been trying to convey all along.

> theory -- I'm not proposing implementing these until there's demand)
> have:
>
> net.socket_create_filter = "none": no filter
> net.socket_create_filter = "bpf:baadf00d": bpf filter

i'm assuming 'baadf00d' is bpf program fd expressed a text string?
and kernel needs to parse above? will you allow capital and lower
case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not?
can program fd expressed as decimal or hex or both?
how do you return the error? as a text string for user space
to parse?

> net.socket_create_filter = "iptables:foobar": some iptables thingy
> net.socket_create_filter = "nft:blahblahblah": some nft thingy

iptables/nft are not applicable to 'bpf_socket_create' which
looks into:
struct bpf_sock {
__u32 bound_dev_if;
__u32 family;
__u32 type;
__u32 protocol;
};
so don't fit as an example.

> net.socket_create_filter = "disallow": no sockets created period
> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3

so you're proposing to add a bunch of hard coded logic to the kernel.
First to parse such text into some sort of syntax tree or list/set
and then have hard coded logic specifically for these two use cases?
While above two can be implemented as trivial bpf programs already?!
That goes 180% degree vs bpf philosophy. bpf is about moving
the specific code out of the kernel and keeping kernel generic that
it can solve as many use cases as possible by being programmable.

> See? This API is not bpf-specific. It's an API for filtering. The

no. I don't see it. BPF_CGROUP_INET_SOCK_CREATE is very much bpf specific
and we just discussed it to the last the detail.

> Can you explain your use case more clearly?
...
> What exactly isn't sensible about using cgroup_bpf for containers?

my use case is close to zero overhead application network monitoring.

> >> > As you're pointing out, in case of security, we probably
> >> > want to preserve original bpf program that should always be
> >> > run first and only after it returned 'ok' (we'd need to define
> >> > what 'ok' means in secruity context) run program attached to sub-hierarchy.
> >>
> >> It's already defined AFAICT. 1 means okay. 0 means not okay.
> >
> > sorry that doesn't make any sense. For seccomp we have a set of
> > ranges that mean different things. Here you're proposing to
> > hastily assign 1 and 0 ? How is that extensible?
> > We need to carefully think through what should be the semantics
> > of attaching multiple programs, consider performance implications,
> > return codes and so on.
>
> You already assigned it. The return value of the bpf program, loaded
> in Linus' tree today, tells the kernel whether to accept or reject.

yes. that's what the program tells the hook.
I'm saying that whenever we have a link list of the programs
interaction between them may or may not be expressible with 1/0
and I don't want to rush such decision.

> >
> >> > Another alternative is to disallow attaching programs in sub-hierarchy
> >> > if parent has something already attached, but it's not useful
> >> > for general case.
> >> > All of these are possible future extensions.
> >>
> >> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You
> >> have to do it now (or disable the feature for 4.10). This is why I'm
> >> bringing this whole thing up now.
> >
> > We don't have to touch user visible api here, so extensions are fine.
>
> Huh? My example in the original email attaches a program in a
> sub-hierarchy. Are you saying that 4.11 could make that example stop
> working?

No. As was pointed out before only root can load BPF_PROG_TYPE_CGROUP_[SKB|SOCK]
type programs and only root can attach them.
and this root semantics obviously have to be preserved from now on,
but that doesn't mean that non-root combinations have to follow the same.
For example, if for some bizarre reason you want to do
net.socket_create_filter = "disallow": no sockets created period
in the hard coded way without using bpf at all
(I would certainly oppose that as a waste of kernel .text,
but I'm not going to nack it), so you can do whatever semantics you like.
Similarly for bpf+lsm+cgroup. In the next round of Mickael's patches
we'll keep debating the right security model for it.

2016-12-20 03:50:26

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Mon, Dec 19, 2016 at 7:18 PM, Alexei Starovoitov
<[email protected]> wrote:
> On Mon, Dec 19, 2016 at 04:25:32PM -0800, Andy Lutomirski wrote:
>> I think we're still talking past each other. A big part of the point
>> of changing it is that none of this is specific to bpf. You could (in
>
> the hooks and context passed into the program is very much bpf specific.
> That's what I've been trying to convey all along.

You mean BPF_CGROUP_RUN_PROG_INET_SOCK(sk)? There is nothing bpf
specfic about the hook except that the name of this macro has "BPF" in
it. There is nothing whatsoever that's bpf-specific about the context
-- sk is not bpf-specific at all.

The only thing bpf-specific about it is that it currently only invokes
bpf programs. That could easily change.

>
>> theory -- I'm not proposing implementing these until there's demand)
>> have:
>>
>> net.socket_create_filter = "none": no filter
>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>
> i'm assuming 'baadf00d' is bpf program fd expressed a text string?
> and kernel needs to parse above? will you allow capital and lower
> case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not?
> can program fd expressed as decimal or hex or both?
> how do you return the error? as a text string for user space
> to parse?

No. The kernel does not parse it because you cannot write this to the
file. You set a bpf filter with ioctl and pass an fd. If you *read*
the file, you get the same bpf program hash that fdinfo on the bpf
object would show -- this is for debugging and (eventually) CRIU.

>
>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>
> iptables/nft are not applicable to 'bpf_socket_create' which
> looks into:
> struct bpf_sock {
> __u32 bound_dev_if;
> __u32 family;
> __u32 type;
> __u32 protocol;
> };
> so don't fit as an example.

The code that takes a 'struct sock' and sets up bpf_sock is
bpf-specific and would obviously not be used for non-bpf filter. But
if you had a filter that was just a like of address families, that
filter would look at struct sock and do its own thing. iptables
wouldn't make sense for a socket creation filter, but it would make
perfect sense for an ingress filter. Obviously the bpf-specific state
object wouldn't be used, but it would still be a hook, invoked from
the same network code, guarded by the same static key, looking at the
same skb.

>
>> net.socket_create_filter = "disallow": no sockets created period
>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>
> so you're proposing to add a bunch of hard coded logic to the kernel.
> First to parse such text into some sort of syntax tree or list/set
> and then have hard coded logic specifically for these two use cases?
> While above two can be implemented as trivial bpf programs already?!
> That goes 180% degree vs bpf philosophy. bpf is about moving
> the specific code out of the kernel and keeping kernel generic that
> it can solve as many use cases as possible by being programmable.

I'm not seriously proposing implementing these. My point is that
*bpf*, while wonderful, is not the be-all-and-end-all of kernel
configurability, and other types of hooks might want to be hooked in
here.

> ...
>> What exactly isn't sensible about using cgroup_bpf for containers?
>
> my use case is close to zero overhead application network monitoring.

So if I set up a cgroup that's monitored and call it /cgroup/a and
enable delegation and if the program running there wants to do its own
monitoring in /cgroup/a/b (via delegation), then you really want the
outer monitor to silently drop events coming from /cgroup/a/b?

>
>> >> > As you're pointing out, in case of security, we probably
>> >> > want to preserve original bpf program that should always be
>> >> > run first and only after it returned 'ok' (we'd need to define
>> >> > what 'ok' means in secruity context) run program attached to sub-hierarchy.
>> >>
>> >> It's already defined AFAICT. 1 means okay. 0 means not okay.
>> >
>> > sorry that doesn't make any sense. For seccomp we have a set of
>> > ranges that mean different things. Here you're proposing to
>> > hastily assign 1 and 0 ? How is that extensible?
>> > We need to carefully think through what should be the semantics
>> > of attaching multiple programs, consider performance implications,
>> > return codes and so on.
>>
>> You already assigned it. The return value of the bpf program, loaded
>> in Linus' tree today, tells the kernel whether to accept or reject.
>
> yes. that's what the program tells the hook.
> I'm saying that whenever we have a link list of the programs
> interaction between them may or may not be expressible with 1/0
> and I don't want to rush such decision.

Then disallow nesting. You're welcome to not rush the decision, but
that's not what you've done. If 4.10 is released as is, you've made
the decision and you're going to have a hard time changing it.

>
>> >
>> >> > Another alternative is to disallow attaching programs in sub-hierarchy
>> >> > if parent has something already attached, but it's not useful
>> >> > for general case.
>> >> > All of these are possible future extensions.
>> >>
>> >> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You
>> >> have to do it now (or disable the feature for 4.10). This is why I'm
>> >> bringing this whole thing up now.
>> >
>> > We don't have to touch user visible api here, so extensions are fine.
>>
>> Huh? My example in the original email attaches a program in a
>> sub-hierarchy. Are you saying that 4.11 could make that example stop
>> working?
>
> No. As was pointed out before only root can load BPF_PROG_TYPE_CGROUP_[SKB|SOCK]
> type programs and only root can attach them.

Why? It really seems to me that you expect that future namespaceable
bpf hooks will use a totally different API. At KS, I sat in a room
full of people arguing about cgroup v2 and a lot of them pointed out
that there are valid, paying use cases that want to stick cgroup v1 in
a container, because the code that uses cgroup v1 in the container is
called systemd and the contained OS is called RHEL (or SuSE or Ubuntu
or Gentoo or whatever), and that code is *already written* and needs
to be contained.

The current approach to bpf hooks will bite you down the road. David
Ahern is already proposing using it for something that is not tracing
at all, and someone will want that in a container, and there will be a
problem.

> Similarly for bpf+lsm+cgroup. In the next round of Mickael's patches
> we'll keep debating the right security model for it.
>

How about slowing down a wee bit and trying to come up with cgroup
hook semantics that work for all of these use cases? I think my
proposal is quite close to workable.

--Andy

2016-12-20 04:44:49

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Mon, Dec 19, 2016 at 07:12:48PM -0800, Andy Lutomirski wrote:
>
> struct cgroup_bpf {
> /*
> * Store two sets of bpf_prog pointers, one for programs that are
> * pinned directly to this cgroup, and one for those that are effective
> * when this cgroup is accessed.
> */
> struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
> struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
> };
>
> in struct cgroup, there's a 'struct cgroup_bpf bpf;'.
>
> This would change to something like:
>
> struct cgroup_filter_slot {
> struct bpf_prog *effective;
> struct cgroup_filter_slot *next;
> struct bpf_prog *local;
> }
>
> local is NULL unless *this* cgroup has a filter. effective points to
> the bpf_prog that's active in this cgroup or the nearest ancestor that
> has a filter. next is NULL if there are no filters higher in the
> chain or points to the next slot that has a filter. struct cgroup
> has:
>
> struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE];
>
> To evaluate it, you do:
>
> struct cgroup_filter_slot *slot = &cgroup->slot[the index];
>
> if (!slot->effective)
> return;
>
> do {
> evaluate(slot->effective);
> slot = slot->next;
> } while (unlikely(slot));

yes. something like this can work as a future extension
to support multiple programs for security use case.
Please propose a patch.
Again, it's not needed today and there is no rush to implement it.

2016-12-20 04:49:23

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Mon, Dec 19, 2016 at 07:50:01PM -0800, Andy Lutomirski wrote:
> >>
> >> net.socket_create_filter = "none": no filter
> >> net.socket_create_filter = "bpf:baadf00d": bpf filter
> >
> > i'm assuming 'baadf00d' is bpf program fd expressed a text string?
> > and kernel needs to parse above? will you allow capital and lower
> > case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not?
> > can program fd expressed as decimal or hex or both?
> > how do you return the error? as a text string for user space
> > to parse?
>
> No. The kernel does not parse it because you cannot write this to the
> file. You set a bpf filter with ioctl and pass an fd. If you *read*
> the file, you get the same bpf program hash that fdinfo on the bpf
> object would show -- this is for debugging and (eventually) CRIU.

my understanding that cgroup is based on kernfs and both don't support
ioctl, so you'd need quite some hacking to introduce such concepts
and buy-in from a bunch of people first.

> >> net.socket_create_filter = "iptables:foobar": some iptables thingy
> >> net.socket_create_filter = "nft:blahblahblah": some nft thingy
> >
> > iptables/nft are not applicable to 'bpf_socket_create' which
> > looks into:
> > struct bpf_sock {
> > __u32 bound_dev_if;
> > __u32 family;
> > __u32 type;
> > __u32 protocol;
> > };
> > so don't fit as an example.
>
> The code that takes a 'struct sock' and sets up bpf_sock is
> bpf-specific and would obviously not be used for non-bpf filter. But
> if you had a filter that was just a like of address families, that
> filter would look at struct sock and do its own thing. iptables
> wouldn't make sense for a socket creation filter, but it would make
> perfect sense for an ingress filter. Obviously the bpf-specific state
> object wouldn't be used, but it would still be a hook, invoked from
> the same network code, guarded by the same static key, looking at the
> same skb.

I strongly suggest to go back and read my first reply where
I think I explained well enough that something like iptables
will not able to reuse the ioctl mechanism you're proposing here,
hook ids will be different, attachment mechanism will be different too.
So your proposed cgroup ioctl is already dead as a reusable interface.

> >> net.socket_create_filter = "disallow": no sockets created period
> >> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
> >
> > so you're proposing to add a bunch of hard coded logic to the kernel.
> > First to parse such text into some sort of syntax tree or list/set
> > and then have hard coded logic specifically for these two use cases?
> > While above two can be implemented as trivial bpf programs already?!
> > That goes 180% degree vs bpf philosophy. bpf is about moving
> > the specific code out of the kernel and keeping kernel generic that
> > it can solve as many use cases as possible by being programmable.
>
> I'm not seriously proposing implementing these. My point is that
> *bpf*, while wonderful, is not the be-all-and-end-all of kernel
> configurability, and other types of hooks might want to be hooked in
> here.

Then please let's talk about real use cases.
This daydreaming of some future llvm in the kernel that you were
bringing up during LPC doesn't help the discussion.
Just like these artificial examples.

> > ...
> >> What exactly isn't sensible about using cgroup_bpf for containers?
> >
> > my use case is close to zero overhead application network monitoring.
>
> So if I set up a cgroup that's monitored and call it /cgroup/a and
> enable delegation and if the program running there wants to do its own
> monitoring in /cgroup/a/b (via delegation), then you really want the
> outer monitor to silently drop events coming from /cgroup/a/b?

yes. both are root and must talk to each other if they want
to co-exist. When root process is asking kernel to do X, this X has
to happen.

> Then disallow nesting. You're welcome to not rush the decision, but
> that's not what you've done. If 4.10 is released as is, you've made
> the decision and you're going to have a hard time changing it.

Nothing needs to be changed.

> > No. As was pointed out before only root can load BPF_PROG_TYPE_CGROUP_[SKB|SOCK]
> > type programs and only root can attach them.
>
> Why? It really seems to me that you expect that future namespaceable
> bpf hooks will use a totally different API. At KS, I sat in a room
> full of people arguing about cgroup v2 and a lot of them pointed out
> that there are valid, paying use cases that want to stick cgroup v1 in
> a container, because the code that uses cgroup v1 in the container is
> called systemd and the contained OS is called RHEL (or SuSE or Ubuntu
> or Gentoo or whatever), and that code is *already written* and needs
> to be contained.

bpf in general is not namespace aware. It's global and this cgroup scoping
of bpf programs is the first of this kind. Namespacing of bpf is completely
different topic.

> The current approach to bpf hooks will bite you down the road. David
> Ahern is already proposing using it for something that is not tracing
> at all, and someone will want that in a container, and there will be a
> problem.

vrf use case already supported by existing code.

> How about slowing down a wee bit and trying to come up with cgroup
> hook semantics that work for all of these use cases? I think my
> proposal is quite close to workable.

you've started the topic by claiming that things are broken
and non-extensible. At the course of this thread it was explained
that the interface is extensible and not broken for the use case
it was designed for. The 'security' use case like lsm+bpf+cgroup
is not supported by the current model yet and that's what we need
to discuss in the future. So, yes, please slow down.

2016-12-20 04:52:11

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Mon, Dec 19, 2016 at 05:40:53PM -0800, Andy Lutomirski wrote:
>
> By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't
> even take a reference to a BPF program as an argument. What is it
> supposed to do if this mechanism ever gets extended?

we just add another field to that anonymous union just like
we did for other commands and everything is backwards compatible.
It's the basics of bpf syscall that we've been relying on for some
time now and it worked just fine.

2016-12-20 05:26:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Mon, Dec 19, 2016 at 8:51 PM, Alexei Starovoitov
<[email protected]> wrote:
> On Mon, Dec 19, 2016 at 05:40:53PM -0800, Andy Lutomirski wrote:
>>
>> By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't
>> even take a reference to a BPF program as an argument. What is it
>> supposed to do if this mechanism ever gets extended?
>
> we just add another field to that anonymous union just like
> we did for other commands and everything is backwards compatible.
> It's the basics of bpf syscall that we've been relying on for some
> time now and it worked just fine.

And what happens if you don't specify that member and two programs are attached?

--Andy

2016-12-20 05:27:42

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Mon, Dec 19, 2016 at 8:44 PM, Alexei Starovoitov
<[email protected]> wrote:
> On Mon, Dec 19, 2016 at 07:12:48PM -0800, Andy Lutomirski wrote:
>>
>> struct cgroup_bpf {
>> /*
>> * Store two sets of bpf_prog pointers, one for programs that are
>> * pinned directly to this cgroup, and one for those that are effective
>> * when this cgroup is accessed.
>> */
>> struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
>> struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
>> };
>>
>> in struct cgroup, there's a 'struct cgroup_bpf bpf;'.
>>
>> This would change to something like:
>>
>> struct cgroup_filter_slot {
>> struct bpf_prog *effective;
>> struct cgroup_filter_slot *next;
>> struct bpf_prog *local;
>> }
>>
>> local is NULL unless *this* cgroup has a filter. effective points to
>> the bpf_prog that's active in this cgroup or the nearest ancestor that
>> has a filter. next is NULL if there are no filters higher in the
>> chain or points to the next slot that has a filter. struct cgroup
>> has:
>>
>> struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE];
>>
>> To evaluate it, you do:
>>
>> struct cgroup_filter_slot *slot = &cgroup->slot[the index];
>>
>> if (!slot->effective)
>> return;
>>
>> do {
>> evaluate(slot->effective);
>> slot = slot->next;
>> } while (unlikely(slot));
>
> yes. something like this can work as a future extension
> to support multiple programs for security use case.
> Please propose a patch.
> Again, it's not needed today and there is no rush to implement it.
>

If this happens after 4.10 and 4.10 is released as is, then this
change would be an ABI break.

--Andy

2016-12-20 05:33:04

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Mon, Dec 19, 2016 at 09:27:18PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 8:44 PM, Alexei Starovoitov
> <[email protected]> wrote:
> > On Mon, Dec 19, 2016 at 07:12:48PM -0800, Andy Lutomirski wrote:
> >>
> >> struct cgroup_bpf {
> >> /*
> >> * Store two sets of bpf_prog pointers, one for programs that are
> >> * pinned directly to this cgroup, and one for those that are effective
> >> * when this cgroup is accessed.
> >> */
> >> struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
> >> struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
> >> };
> >>
> >> in struct cgroup, there's a 'struct cgroup_bpf bpf;'.
> >>
> >> This would change to something like:
> >>
> >> struct cgroup_filter_slot {
> >> struct bpf_prog *effective;
> >> struct cgroup_filter_slot *next;
> >> struct bpf_prog *local;
> >> }
> >>
> >> local is NULL unless *this* cgroup has a filter. effective points to
> >> the bpf_prog that's active in this cgroup or the nearest ancestor that
> >> has a filter. next is NULL if there are no filters higher in the
> >> chain or points to the next slot that has a filter. struct cgroup
> >> has:
> >>
> >> struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE];
> >>
> >> To evaluate it, you do:
> >>
> >> struct cgroup_filter_slot *slot = &cgroup->slot[the index];
> >>
> >> if (!slot->effective)
> >> return;
> >>
> >> do {
> >> evaluate(slot->effective);
> >> slot = slot->next;
> >> } while (unlikely(slot));
> >
> > yes. something like this can work as a future extension
> > to support multiple programs for security use case.
> > Please propose a patch.
> > Again, it's not needed today and there is no rush to implement it.
> >
>
> If this happens after 4.10 and 4.10 is released as is, then this
> change would be an ABI break.

it won't break existing apps.
please study how bpf syscall was extended in the past without
breaking anything.
Same thing here. The default is one program per hook per cgroup.
Everything else is future extensions.

2016-12-20 09:12:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Mon, Dec 19, 2016 at 05:56:24PM -0800, Andy Lutomirski wrote:
> >> Huh? My example in the original email attaches a program in a
> >> sub-hierarchy. Are you saying that 4.11 could make that example stop
> >> working?
> >
> > Are you suggesting sub-cgroups should not be allowed to override the filter of a parent cgroup?
>
> Yes, exactly. I think there are two sensible behaviors:
>
> a) sub-cgroups cannot have a filter at all of the parent has a filter.
> (This is the "punt" approach -- it lets different semantics be
> assigned later without breaking userspace.)
>
> b) sub-cgroups can have a filter if a parent does, too. The semantics
> are that the sub-cgroup filter runs first and all side-effects occur.
> If that filter says "reject" then ancestor filters are skipped. If
> that filter says "accept", then the ancestor filter is run and its
> side-effects happen as well. (And so on, all the way up to the root.)

So from what I understand the proposed cgroup is not in fact
hierarchical at all.

@TJ, I thought you were enforcing all new cgroups to be properly
hierarchical, that would very much include this one.

2016-12-20 10:21:24

by Daniel Mack

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

Hi,

On 12/20/2016 04:50 AM, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 7:18 PM, Alexei Starovoitov
> <[email protected]> wrote:
>> On Mon, Dec 19, 2016 at 04:25:32PM -0800, Andy Lutomirski wrote:
>>> I think we're still talking past each other. A big part of the point
>>> of changing it is that none of this is specific to bpf. You could (in
>>
>> the hooks and context passed into the program is very much bpf specific.
>> That's what I've been trying to convey all along.
>
> You mean BPF_CGROUP_RUN_PROG_INET_SOCK(sk)? There is nothing bpf
> specfic about the hook except that the name of this macro has "BPF" in
> it. There is nothing whatsoever that's bpf-specific about the context
> -- sk is not bpf-specific at all.
>
> The only thing bpf-specific about it is that it currently only invokes
> bpf programs. That could easily change.

I'm not sure if I follow. The code as it currently stands only supports
attaching bpf programs to cgroups which have been created using
BPF_PROG_LOAD. If cgroups would support other program types in the
future, then they would need to be stored in different data types
anyway, and the bpf syscall multiplexer would be the wrong entry point
to access them anyway.

Whether we add bpf-specific code to the cgroup file parsers or
cgroup-specific code to the bpf layer does not make much of a semantic
difference, does it? As a matter of fact, my very first implementation
of this patch set implemented a cgroup controller that would allow
writing strings like "ingress 5" to its control file, where 5 is the fd
number that came out of BPF_PROG_LOAD. The main reason we decided to
ditch that was that echoing fd numbers into a text file seemed way worse
than going through a proper syscall layer with it, and ioctls are
unavailable on pseudo-fs.

The idea was rather to allow attaching bpf programs to other things than
just cgroups as well, which is why we called the member of 'union
bpf_attr' 'target_fd', and a cgroup is just one type a target here.

>> i'm assuming 'baadf00d' is bpf program fd expressed a text string?
>> and kernel needs to parse above? will you allow capital and lower
>> case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not?
>> can program fd expressed as decimal or hex or both?
>> how do you return the error? as a text string for user space
>> to parse?
>
> No. The kernel does not parse it because you cannot write this to the
> file. You set a bpf filter with ioctl and pass an fd.

An ioctl on what file, exactly?

> If you *read*
> the file, you get the same bpf program hash that fdinfo on the bpf
> object would show -- this is for debugging and (eventually) CRIU.

We need a debugging facility at some point, I agree to that. As the code
currently stands, that would rather need to go into the bpf(2) syscall
though, as setting a program through bpf(2) and reading it through
cgroupfs is really nasty.

>> so you're proposing to add a bunch of hard coded logic to the kernel.
>> First to parse such text into some sort of syntax tree or list/set
>> and then have hard coded logic specifically for these two use cases?
>> While above two can be implemented as trivial bpf programs already?!
>> That goes 180% degree vs bpf philosophy. bpf is about moving
>> the specific code out of the kernel and keeping kernel generic that
>> it can solve as many use cases as possible by being programmable.
>
> I'm not seriously proposing implementing these. My point is that
> *bpf*, while wonderful, is not the be-all-and-end-all of kernel
> configurability, and other types of hooks might want to be hooked in
> here.

Sure, but nobody claimed it to be that be-all-and-end-all thing. It's
just one thing that a cgroup is now able to accommodate, and because
that new feature is specific to bpf, we decided to hook up the uapi to
the bpf syscall.

> So if I set up a cgroup that's monitored and call it /cgroup/a and
> enable delegation and if the program running there wants to do its own
> monitoring in /cgroup/a/b (via delegation), then you really want the
> outer monitor to silently drop events coming from /cgroup/a/b?

That's a fair point, and we've discussed it as well. The issue is, as
Alexei already pointed out, that we do not want to traverse the tree up
to the root for nested cgroups due to the runtime costs in the
networking fast-path. After all, we're running the bpf program for each
packet in flight. Hence, we opted for the approach to only look at the
leaf node for now, with the ability to open it up further in the future
using flags during attach etc.

> The current approach to bpf hooks will bite you down the road. David
> Ahern is already proposing using it for something that is not tracing
> at all, and someone will want that in a container, and there will be a
> problem.

Hmm, I thought we've sorted out the concerns about that by making sure
that we

a) lock-down the API sufficiently so it doesn't cause any security
issues in its current form, and

b) make it possible to extend the functionality in the future by adding
flags to the command struct etc.

And I hoped we achieved that after discussing it for so long.

> How about slowing down a wee bit and trying to come up with cgroup
> hook semantics that work for all of these use cases?

I'm all for discussing things, but I don't this was done in a rush.

I do agree though that adding functionality to cgroups that is not
limited to resource control is a delicate thing to do, which is why I
cc'ed cgroups@ in my patches. I should have also added linux-api@ I
guess, sorry I missed that.

> I think my proposal is quite close to workable.

So let's talk about how to proceed. I've seen different bits of your
proposal in different mails, and I think a summary of it would help the
discussion.


Thanks,
Daniel

2016-12-20 17:24:37

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Tue, Dec 20, 2016 at 2:21 AM, Daniel Mack <[email protected]> wrote:
> Hi,
>
> On 12/20/2016 04:50 AM, Andy Lutomirski wrote:
>> You mean BPF_CGROUP_RUN_PROG_INET_SOCK(sk)? There is nothing bpf
>> specfic about the hook except that the name of this macro has "BPF" in
>> it. There is nothing whatsoever that's bpf-specific about the context
>> -- sk is not bpf-specific at all.
>>
>> The only thing bpf-specific about it is that it currently only invokes
>> bpf programs. That could easily change.
>
> I'm not sure if I follow. The code as it currently stands only supports
> attaching bpf programs to cgroups which have been created using
> BPF_PROG_LOAD. If cgroups would support other program types in the
> future, then they would need to be stored in different data types
> anyway, and the bpf syscall multiplexer would be the wrong entry point
> to access them anyway.

To clarify, since this thread has gotten excessively long and twisted,
I think it's important that, for hooks attached to a cgroup, you be
able to tell in a generic way whether something is plugged into the
hook. The natural way to see a cgroup's configuration is to read from
cgroupfs, so I think that reading from cgroupfs should show you that a
BPF program is attached and also give enough information that, once
bpf programs become dumpable, you can dump the program (using the
bpf() syscall or whatever).

Obviously the interface to *attach* a BPF program to a hook will need
to be at least a little bit BPF-specific. But there's nothing
particularly BPF-specific about detaching, and if a control file were
to exist, writing "detach" or "none" to it seems natural.

>
> Whether we add bpf-specific code to the cgroup file parsers or
> cgroup-specific code to the bpf layer does not make much of a semantic
> difference, does it? As a matter of fact, my very first implementation
> of this patch set implemented a cgroup controller that would allow
> writing strings like "ing
>
> b) make it possible to extend the functionality in the future by adding
> flags to the command struct etc.
>
> And I hoped we achieved that after discussing it for so long.
>
>> How about slowing down a wee bit and trying to come up with cgroup
>> hook semantics that work for all of these use cases?
>
> I'm all for discussing things, but I don't this was done in a rush.
>
> I do agree though that adding functionality to cgroups that is not
> limited to resource control is a delicate thing to do, which is why I
> cc'ed cgroups@ in my patches. I should have also added linux-api@ I
> guess, sorry I missed that.
>ress 5" to its control file, where 5 is the fd
> number that came out of BPF_PROG_LOAD. The main reason we decided to
> ditch that was that echoing fd numbers into a text file seemed way worse
> than going through a proper syscall layer with it, and ioctls are
> unavailable on pseudo-fs.

There isn't a big semantic difference between
'open("/cgroup/NAME/some.control.file", O_WRONLY); ioctl(...,
CGROUP_ATTACH_BPF, ...)' and 'open("/cgroup/NAME/some.control.file",
O_WRONLY); bpf(BPF_PROG_ATTACH, ...);'. There is, however, a semantic
difference when you do open("/cgroup/NAME", O_RDONLY | O_DIRECTORY)
because the permission check is much weaker.

The reason I suggest ioctl() and not write() is that write() MUST NOT
make its behavior depend on the caller's credentials, file table, etc.
Imagine what would happen if you did 'sudo -u eviltext
>/cgroup/NAME/control.file'. (This particular mistake has been
repeated many times in the kernel, in drivers, networking, namespaces,
core code, etc, and it's resulted in a big pile of privilege
escalation bugs.) So write("bpf:<actual BPF instructions>") is safe
(but unusably awkward, I think), whereas write("bpf:fd 5") is unsafe.

>
> The idea was rather to allow attaching bpf programs to other things than
> just cgroups as well, which is why we called the member of 'union
> bpf_attr' 'target_fd', and a cgroup is just one type a target here.

I would make that a separate operation. If someone adds the ability
to attach an ebpf program to, say, seccomp (I'm quite sure this will
happen eventually), it should be attached using seccomp(), not bpf(),
for example). The people writing seccomp filters will thank you for
making the syscall in question reflect what object (the cgroup, for
example) is being modified.

>
>>> i'm assuming 'baadf00d' is bpf program fd expressed a text string?
>>> and kernel needs to parse above? will you allow capital and lower
>>> case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not?
>>> can program fd expressed as decimal or hex or both?
>>> how do you return the error? as a text string for user space
>>> to parse?
>>
>> No. The kernel does not parse it because you cannot write this to the
>> file. You set a bpf filter with ioctl and pass an fd.
>
> An ioctl on what file, exactly?

There are at least two plausible models.

My preference would be to do an ioctl on a new
/cgroup/NAME/network_hooks.inet_ingress file. Reading that file tells
you whether something is attached and hopefully also gives enough
information (a hash of the BPF program, perhaps) to dump the actual
program using future bpf() interfaces. write() and ioctl() can be
used to configure it as appropriate.

Another option that I like less would be to have a
/cgroup/NAME/cgroup.bpf that lists all the active hooks along with
their contents. You would do an ioctl() on that to program a hook and
you could read it to see what's there.

FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too.
It doesn't make a semantic difference, except that I dislike
BPF_PROG_DETACH because that particular command isn't BPF-specific at
all.

>
>> If you *read*
>> the file, you get the same bpf program hash that fdinfo on the bpf
>> object would show -- this is for debugging and (eventually) CRIU.
>
> We need a debugging facility at some point, I agree to that. As the code
> currently stands, that would rather need to go into the bpf(2) syscall
> though, as setting a program through bpf(2) and reading it through
> cgroupfs is really nasty.

But knowing that you have to call bpf() to tell whether bpf hooks are
installed in a cgroup is nasty. Everything else uses ls and cat --
why should BPF be special here?

>> So if I set up a cgroup that's monitored and call it /cgroup/a and
>> enable delegation and if the program running there wants to do its own
>> monitoring in /cgroup/a/b (via delegation), then you really want the
>> outer monitor to silently drop events coming from /cgroup/a/b?
>
> That's a fair point, and we've discussed it as well. The issue is, as
> Alexei already pointed out, that we do not want to traverse the tree up
> to the root for nested cgroups due to the runtime costs in the
> networking fast-path. After all, we're running the bpf program for each
> packet in flight. Hence, we opted for the approach to only look at the
> leaf node for now, with the ability to open it up further in the future
> using flags during attach etc.

Careful here! You don't look only at the leaf node for now. You do a
fancy traversal and choose the nearest node that has a hook set up.
This gives you almost all the complexity of evaluating all of the
installed hooks with none of the benefit. It also is, IMO, much more
dangerous than only looking at the leaf node. Consider:

mkdir /cgroup/foo
BPF_PROG_ATTACH(some program to foo)
mkdir /cgroup/foo/bar
chown -R some_user /cgroup/foo/bar

If the kernel only looked at the leaf, then the program that did the
above would not expect that the program would constrain
/cgroup/foo/bar's activity. But, as it stands, the program *would*
expect /cgroup/foo/bar to be constrained, except that, whenever the
capable() check changes to ns_capable() (which will happen eventually
one way or another), then the bad guy can create /cgroup/foo/bar/baz,
install a new no-op hook there, and break the security assumption.

IOW, I think that totally non-recursive hooks are okay from a security
perspective, albeit rather strange, but the current design is not okay
from a security perspective.

>
>> The current approach to bpf hooks will bite you down the road. David
>> Ahern is already proposing using it for something that is not tracing
>> at all, and someone will want that in a container, and there will be a
>> problem.
>
> Hmm, I thought we've sorted out the concerns about that by making sure
> that we
>
> a) lock-down the API sufficiently so it doesn't cause any security
> issues in its current form, and

This argument is why iptables + userns has become a security mess, for
example. Designing an API assuming that the bad guys will never be
permitted to use it causes quite a bit of pain when, a few years
later, bad guys are permitted to use it.

>> I think my proposal is quite close to workable.
>
> So let's talk about how to proceed. I've seen different bits of your
> proposal in different mails, and I think a summary of it would help the
> discussion.

So here's a fleshed-out possible version that's a bit of a compromise
after sleeping on this. There's plenty of room to tweak this.

Each cgroup gets a new file cgroup.hooks. Reading it shows a list of
active hooks. (A hook can be a string like "network.inet_ingress".)

You can write a command like "-network.inet_ingress off" to it to
disable network.inet_ingress. You can write a command like
"+network.inet_ingress" to it to enable the network.inet_ingress hook.

When a hook (e.g. network.inet_ingress) is enabled, a new file appears
in the cgroup called "hooks.network.inet_ingress"). You can read it
to get an indication of what is currently installed in that slot. You
can write "none" to it to cause nothing to be installed in that slot.
(This replaces BPF_PROG_DETACH.). You can open it for write and use
bpf() or perhaps ioctl() to attach a bpf program. Maybe you can also
use bpf() to dump the bpf program, but, regardless, if a bpf program
is there, read() will return some string that contains "bpf" and maybe
some other useful information.

If a SELinux policy wants to lock down the netowrk.inet_ingress hook,
it uses existing mechanisms to label the hooks.network.inet_ingress
file when it appears and to restrict opening it for write.

I think this is all quite straightforward to implement and will result
in clean code. I could probably make some decent progress toward it
over the next couple days.

--Andy

2016-12-20 18:36:22

by Daniel Mack

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

Hi,

On 12/20/2016 06:23 PM, Andy Lutomirski wrote:
> On Tue, Dec 20, 2016 at 2:21 AM, Daniel Mack <[email protected]> wrote:

> To clarify, since this thread has gotten excessively long and twisted,
> I think it's important that, for hooks attached to a cgroup, you be
> able to tell in a generic way whether something is plugged into the
> hook. The natural way to see a cgroup's configuration is to read from
> cgroupfs, so I think that reading from cgroupfs should show you that a
> BPF program is attached and also give enough information that, once
> bpf programs become dumpable, you can dump the program (using the
> bpf() syscall or whatever).

[...]

> There isn't a big semantic difference between
> 'open("/cgroup/NAME/some.control.file", O_WRONLY); ioctl(...,
> CGROUP_ATTACH_BPF, ...)' and 'open("/cgroup/NAME/some.control.file",
> O_WRONLY); bpf(BPF_PROG_ATTACH, ...);'. There is, however, a semantic
> difference when you do open("/cgroup/NAME", O_RDONLY | O_DIRECTORY)
> because the permission check is much weaker.

Okay, if you have such a control file, you can of course do something
like that. When we discussed things back then with Tejun however, we
concluded that a controller that is not completely controllable through
control knobs that can be written and read via cat is meaningless.
That's why this has become a 'hidden' cgroup feature.

With your proposed API, you'd first go to the bpf(2) syscall in order to
get a prog fd, and then come back to some sort of cgroup API to put the
fd in there. That's quite a mix and match, which is why we considered
the API cleaner in its current form, as everything that is related to
bpf is encapsulated behind a single syscall.

> My preference would be to do an ioctl on a new
> /cgroup/NAME/network_hooks.inet_ingress file. Reading that file tells
> you whether something is attached and hopefully also gives enough
> information (a hash of the BPF program, perhaps) to dump the actual
> program using future bpf() interfaces. write() and ioctl() can be
> used to configure it as appropriate.

So am I reading this right? You're proposing to add ioctl() hooks to
kernfs/cgroupfs? That would open more possibilities of course, but I'm
not sure where that rabbit hole leads us eventually.

> Another option that I like less would be to have a
> /cgroup/NAME/cgroup.bpf that lists all the active hooks along with
> their contents. You would do an ioctl() on that to program a hook and
> you could read it to see what's there.

Yes, read() could, in theory, give you similar information than ioctl(),
but in human-readable form.

> FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too.
> It doesn't make a semantic difference, except that I dislike
> BPF_PROG_DETACH because that particular command isn't BPF-specific at
> all.

Well, I think it is; it pops the bpf program from a target and drops the
reference on it. It's not much code, but it's certainly bpf-specific.

>>> So if I set up a cgroup that's monitored and call it /cgroup/a and
>>> enable delegation and if the program running there wants to do its own
>>> monitoring in /cgroup/a/b (via delegation), then you really want the
>>> outer monitor to silently drop events coming from /cgroup/a/b?
>>
>> That's a fair point, and we've discussed it as well. The issue is, as
>> Alexei already pointed out, that we do not want to traverse the tree up
>> to the root for nested cgroups due to the runtime costs in the
>> networking fast-path. After all, we're running the bpf program for each
>> packet in flight. Hence, we opted for the approach to only look at the
>> leaf node for now, with the ability to open it up further in the future
>> using flags during attach etc.
>
> Careful here! You don't look only at the leaf node for now. You do a
> fancy traversal and choose the nearest node that has a hook set up.

But we do the 'complex' operation at attach time or when a cgroup is
created, both of which are slow-path operations. In the fast-path, we
only look at the leaf, which may or may not have an effective program
installed. And that's of course much cheaper then doing the traversing
for each packet.

> mkdir /cgroup/foo
> BPF_PROG_ATTACH(some program to foo)
> mkdir /cgroup/foo/bar
> chown -R some_user /cgroup/foo/bar
>
> If the kernel only looked at the leaf, then the program that did the
> above would not expect that the program would constrain
> /cgroup/foo/bar's activity. But, as it stands, the program *would*
> expect /cgroup/foo/bar to be constrained, except that, whenever the
> capable() check changes to ns_capable() (which will happen eventually
> one way or another), then the bad guy can create /cgroup/foo/bar/baz,
> install a new no-op hook there, and break the security assumption.
>
> IOW, I think that totally non-recursive hooks are okay from a security
> perspective, albeit rather strange, but the current design is not okay
> from a security perspective.

We locked down the ability to override any of these programs with
CAP_NET_ADMIN, which is also what it takes to flush iptables, right?
What's the difference?

> So here's a fleshed-out possible version that's a bit of a compromise
> after sleeping on this. There's plenty of room to tweak this.
>
> Each cgroup gets a new file cgroup.hooks. Reading it shows a list of
> active hooks. (A hook can be a string like "network.inet_ingress".)
>
> You can write a command like "-network.inet_ingress off" to it to
> disable network.inet_ingress. You can write a command like
> "+network.inet_ingress" to it to enable the network.inet_ingress hook.
>
> When a hook (e.g. network.inet_ingress) is enabled, a new file appears
> in the cgroup called "hooks.network.inet_ingress"). You can read it
> to get an indication of what is currently installed in that slot. You
> can write "none" to it to cause nothing to be installed in that slot.
> (This replaces BPF_PROG_DETACH.). You can open it for write and use
> bpf() or perhaps ioctl() to attach a bpf program. Maybe you can also
> use bpf() to dump the bpf program, but, regardless, if a bpf program
> is there, read() will return some string that contains "bpf" and maybe
> some other useful information.

I can see where you're going, but I don't know yet if if I like this
approach better, given that you would still need a binary interface at
least at attach time, and that such an interface would use a resource
returned from bpf(2). The ability to read from control files in order to
see what's going on is nice though.

I'd like to have Tejun's and Alexei's opinion on this - as I said, I had
something like that (albeit much simpler) in one of my very early
drafts, but we consented to do the hookup the other way around, for
stated reasons.


Thanks,
Daniel

2016-12-20 18:49:50

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Tue, Dec 20, 2016 at 10:36 AM, Daniel Mack <[email protected]> wrote:
> Hi,
>
> On 12/20/2016 06:23 PM, Andy Lutomirski wrote:
>> On Tue, Dec 20, 2016 at 2:21 AM, Daniel Mack <[email protected]> wrote:
>
>> To clarify, since this thread has gotten excessively long and twisted,
>> I think it's important that, for hooks attached to a cgroup, you be
>> able to tell in a generic way whether something is plugged into the
>> hook. The natural way to see a cgroup's configuration is to read from
>> cgroupfs, so I think that reading from cgroupfs should show you that a
>> BPF program is attached and also give enough information that, once
>> bpf programs become dumpable, you can dump the program (using the
>> bpf() syscall or whatever).
>
> [...]
>
>> There isn't a big semantic difference between
>> 'open("/cgroup/NAME/some.control.file", O_WRONLY); ioctl(...,
>> CGROUP_ATTACH_BPF, ...)' and 'open("/cgroup/NAME/some.control.file",
>> O_WRONLY); bpf(BPF_PROG_ATTACH, ...);'. There is, however, a semantic
>> difference when you do open("/cgroup/NAME", O_RDONLY | O_DIRECTORY)
>> because the permission check is much weaker.
>
> Okay, if you have such a control file, you can of course do something
> like that. When we discussed things back then with Tejun however, we
> concluded that a controller that is not completely controllable through
> control knobs that can be written and read via cat is meaningless.
> That's why this has become a 'hidden' cgroup feature.
>
> With your proposed API, you'd first go to the bpf(2) syscall in order to
> get a prog fd, and then come back to some sort of cgroup API to put the
> fd in there. That's quite a mix and match, which is why we considered
> the API cleaner in its current form, as everything that is related to
> bpf is encapsulated behind a single syscall.

You already have to do bpf() to get a prog fd, then open() to get a
cgroup fd, then bpf() or ioctl() to attach, so this isn't much
different, and its exactly the same number of syscalls.

>
>> My preference would be to do an ioctl on a new
>> /cgroup/NAME/network_hooks.inet_ingress file. Reading that file tells
>> you whether something is attached and hopefully also gives enough
>> information (a hash of the BPF program, perhaps) to dump the actual
>> program using future bpf() interfaces. write() and ioctl() can be
>> used to configure it as appropriate.
>
> So am I reading this right? You're proposing to add ioctl() hooks to
> kernfs/cgroupfs? That would open more possibilities of course, but I'm
> not sure where that rabbit hole leads us eventually.

Indeed. I already have a test patch to add ioctl() to kernfs. Adding
it to cgroupfs shouldn't be much more complicated.

>
>> Another option that I like less would be to have a
>> /cgroup/NAME/cgroup.bpf that lists all the active hooks along with
>> their contents. You would do an ioctl() on that to program a hook and
>> you could read it to see what's there.
>
> Yes, read() could, in theory, give you similar information than ioctl(),
> but in human-readable form.
>
>> FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too.
>> It doesn't make a semantic difference, except that I dislike
>> BPF_PROG_DETACH because that particular command isn't BPF-specific at
>> all.
>
> Well, I think it is; it pops the bpf program from a target and drops the
> reference on it. It's not much code, but it's certainly bpf-specific.

I mean the interface isn't bpf-specific. If there was something that
wasn't bpf attached to the target, you'd still want an API to detach
it.

>
>>>> So if I set up a cgroup that's monitored and call it /cgroup/a and
>>>> enable delegation and if the program running there wants to do its own
>>>> monitoring in /cgroup/a/b (via delegation), then you really want the
>>>> outer monitor to silently drop events coming from /cgroup/a/b?
>>>
>>> That's a fair point, and we've discussed it as well. The issue is, as
>>> Alexei already pointed out, that we do not want to traverse the tree up
>>> to the root for nested cgroups due to the runtime costs in the
>>> networking fast-path. After all, we're running the bpf program for each
>>> packet in flight. Hence, we opted for the approach to only look at the
>>> leaf node for now, with the ability to open it up further in the future
>>> using flags during attach etc.
>>
>> Careful here! You don't look only at the leaf node for now. You do a
>> fancy traversal and choose the nearest node that has a hook set up.
>
> But we do the 'complex' operation at attach time or when a cgroup is
> created, both of which are slow-path operations. In the fast-path, we
> only look at the leaf, which may or may not have an effective program
> installed. And that's of course much cheaper then doing the traversing
> for each packet.

You would never traverse the full hierarchy for each packet. You'd
have a linked list of programs that are attached, kind of like how
there's an "effective" array right now. I sent out pseudocode earlier
in the thread.

>
>> mkdir /cgroup/foo
>> BPF_PROG_ATTACH(some program to foo)
>> mkdir /cgroup/foo/bar
>> chown -R some_user /cgroup/foo/bar
>>
>> If the kernel only looked at the leaf, then the program that did the
>> above would not expect that the program would constrain
>> /cgroup/foo/bar's activity. But, as it stands, the program *would*
>> expect /cgroup/foo/bar to be constrained, except that, whenever the
>> capable() check changes to ns_capable() (which will happen eventually
>> one way or another), then the bad guy can create /cgroup/foo/bar/baz,
>> install a new no-op hook there, and break the security assumption.
>>
>> IOW, I think that totally non-recursive hooks are okay from a security
>> perspective, albeit rather strange, but the current design is not okay
>> from a security perspective.
>
> We locked down the ability to override any of these programs with
> CAP_NET_ADMIN, which is also what it takes to flush iptables, right?
> What's the difference?

For iptables, it's ns_capable() now, and there have been a number of
holes in it. For cgroup, it's going to turn in to ns_capable() sooner
or later, and it would be nice to be ready for it.

--Andy

2016-12-21 04:01:57

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Tue, Dec 20, 2016 at 10:49:25AM -0800, Andy Lutomirski wrote:
> >> FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too.
> >> It doesn't make a semantic difference, except that I dislike
> >> BPF_PROG_DETACH because that particular command isn't BPF-specific at
> >> all.
> >
> > Well, I think it is; it pops the bpf program from a target and drops the
> > reference on it. It's not much code, but it's certainly bpf-specific.
>
> I mean the interface isn't bpf-specific. If there was something that
> wasn't bpf attached to the target, you'd still want an API to detach
> it.

This discussion won't go anywhere while you keep thinking that this api
has to be generalized. As I explained several times earlier
BPF_CGROUP_INET_SOCK_CREATE hook is bpf specific. There is nothing
in the kernel that can take advantage of it today, so by definition
the hook is bpf specific. Period. Saying that something in the future
may come along that would want to use that is like saying I want
to design the generic steering wheel for any car that will ever need it.

Hence if you want to change 'target_fd' in BPF_PROG_ATTACH/DETACH cmds
from being fd of open("cgroupdir") to fd of open("cgroupdir/cgroup.bpf")
file inside it then I'm ok with that.
All other proposals with non-extensible ioctls() and crazy text based
per-hook permissions is nack.