LinuxLists.cc - Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-17 18:19:14

Subject: Potential issues (security and otherwise) with the current cgroup-bpf API

Hi all-

I apologize for being rather late with this. I didn't realize that
cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
it on the linux-api list, so I missed the discussion.

I think that the inet ingress, egress etc filters are a neat feature,
but I think the API has some issues that will bite us down the road
if it becomes stable in its current form.

Most of the problems I see are summarized in this transcript:

# mkdir cg2
# mount -t cgroup2 none cg2
# mkdir cg2/nosockets
# strace cgrp_socket_rule cg2/nosockets/ 0
...
open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3

^^^^ You can modify a cgroup after opening it O_RDONLY?

bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
log_buf=0x6020c0, kern_version=0}, 48) = 4

^^^^ This is fine. The bpf() syscall manipulates bpf objects.

bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0

^^^^ This is not so good:
^^^^
^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This
^^^^ is manipulating a cgroup. There's no reason that a socket creation
^^^^ filter couldn't be written in a different language (new iptables
^^^^ table? Simple list of address families?), but if that happened,
^^^^ then using bpf() to install it would be entirely nonsensical.
^^^^
^^^^ b) This is starting to be an excessively ugly multiplexer. Among
^^^^ other things, it's very unfriendly to seccomp.

# echo $$ >cg2/nosockets/cgroup.procs
# ping 127.0.0.1
ping: socket: Operation not permitted
# ls cg2/nosockets/
cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control
# cat cg2/nosockets/cgroup.controllers

^^^^ Something in cgroupfs should give an indication that this cgroup
^^^^ filters socket creation, but there's nothing there. You should also
^^^^ be able to turn the filter off from cgroupfs.

# mkdir cg2/nosockets/sockets
# /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1

^^^^ This succeeded, which means that, if this feature is enabled in 4.10,
^^^^ then we're stuck with its semantics. If it returned -EINVAL instead,
^^^^ there would be a chance to refine it.

# echo $$ >cg2/nosockets/sockets/cgroup.procs
# ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms
^C
--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.029/0.029/0.029/0.000 ms

^^^^ Bash was inside a cgroup that disallowed socket creation, but socket
^^^^ creation wasn't disallowed. This means that the obvious use of socket
^^^^ creation filters in nestable constainers fails insecurely.

There's also a subtle but nasty potential security problem here.
In 4.9 and before, cgroups has only one real effect in the kernel:
resource control. A process in a malicious cgroup could be DoSed,
but that was about the extent of the damage that a malicious cgroup
could do.

In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf
programs attached that can do things if various events occur. (Right
now, this means socket operations, but there are plans in the works
to do this for LSM hooks too.) These bpf programs can say yes or no,
but they can also read out various data (including socket payloads!)
and save them away where an attacker can find them. This sounds a
lot like seccomp with a narrower scope but a much stronger ability to
exfiltrate private information.

Unfortunately, while seccomp is very, very careful to prevent
injection of a privileged victim into a malicious sandbox, the
CGROUP_BPF mechanism appears to have no real security model. There
is nothing to prevent a program that's in a malicious cgroup from
running a setuid binary, and there is nothing to prevent a program
that has the ability to move itself or another program into a
malicious cgroup from doing so and then, if needed for exploitation,
exec a setuid binary.

This isn't much of a problem yet because you currently need
CAP_NET_ADMIN to create a malicious sandbox in the first place. I'm
sure that, in the near future, someone will want to make this stuff
work in containers with delegated cgroup hierarchies, and then there
may be a real problem here.

I've included a few security people on this thread. The current API
looks abusable, and it would be nice to find all the holes before
4.10 comes out.

(The cgrp_socket_rule source is attached. You can build it by sticking it
in samples/bpf and doing:

$ make headers_install
$ cd samples/bpf
$ gcc -o cgrp_socket_rule cgrp_socket_rule.c libbpf.c -I../../usr/include
)

--Andy

Attachments:

cgrp_socket_rule.c (1.51 kB)

2016-12-17 19:35:28

On Tue, Dec 20, 2016 at 10:36 AM, Daniel Mack <[email protected]> wrote:
> Hi,
>
> On 12/20/2016 06:23 PM, Andy Lutomirski wrote:
>> On Tue, Dec 20, 2016 at 2:21 AM, Daniel Mack <[email protected]> wrote:
>
>> To clarify, since this thread has gotten excessively long and twisted,
>> I think it's important that, for hooks attached to a cgroup, you be
>> able to tell in a generic way whether something is plugged into the
>> hook. The natural way to see a cgroup's configuration is to read from
>> cgroupfs, so I think that reading from cgroupfs should show you that a
>> BPF program is attached and also give enough information that, once
>> bpf programs become dumpable, you can dump the program (using the
>> bpf() syscall or whatever).
>
> [...]
>
>> There isn't a big semantic difference between
>> 'open("/cgroup/NAME/some.control.file", O_WRONLY); ioctl(...,
>> CGROUP_ATTACH_BPF, ...)' and 'open("/cgroup/NAME/some.control.file",
>> O_WRONLY); bpf(BPF_PROG_ATTACH, ...);'. There is, however, a semantic
>> difference when you do open("/cgroup/NAME", O_RDONLY | O_DIRECTORY)
>> because the permission check is much weaker.
>
> Okay, if you have such a control file, you can of course do something
> like that. When we discussed things back then with Tejun however, we
> concluded that a controller that is not completely controllable through
> control knobs that can be written and read via cat is meaningless.
> That's why this has become a 'hidden' cgroup feature.
>
> With your proposed API, you'd first go to the bpf(2) syscall in order to
> get a prog fd, and then come back to some sort of cgroup API to put the
> fd in there. That's quite a mix and match, which is why we considered
> the API cleaner in its current form, as everything that is related to
> bpf is encapsulated behind a single syscall.

You already have to do bpf() to get a prog fd, then open() to get a
cgroup fd, then bpf() or ioctl() to attach, so this isn't much
different, and its exactly the same number of syscalls.

>
>> My preference would be to do an ioctl on a new
>> /cgroup/NAME/network_hooks.inet_ingress file. Reading that file tells
>> you whether something is attached and hopefully also gives enough
>> information (a hash of the BPF program, perhaps) to dump the actual
>> program using future bpf() interfaces. write() and ioctl() can be
>> used to configure it as appropriate.
>
> So am I reading this right? You're proposing to add ioctl() hooks to
> kernfs/cgroupfs? That would open more possibilities of course, but I'm
> not sure where that rabbit hole leads us eventually.

Indeed. I already have a test patch to add ioctl() to kernfs. Adding
it to cgroupfs shouldn't be much more complicated.

>
>> Another option that I like less would be to have a
>> /cgroup/NAME/cgroup.bpf that lists all the active hooks along with
>> their contents. You would do an ioctl() on that to program a hook and
>> you could read it to see what's there.
>
> Yes, read() could, in theory, give you similar information than ioctl(),
> but in human-readable form.
>
>> FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too.
>> It doesn't make a semantic difference, except that I dislike
>> BPF_PROG_DETACH because that particular command isn't BPF-specific at
>> all.
>
> Well, I think it is; it pops the bpf program from a target and drops the
> reference on it. It's not much code, but it's certainly bpf-specific.

I mean the interface isn't bpf-specific. If there was something that
wasn't bpf attached to the target, you'd still want an API to detach
it.

>
>>>> So if I set up a cgroup that's monitored and call it /cgroup/a and
>>>> enable delegation and if the program running there wants to do its own
>>>> monitoring in /cgroup/a/b (via delegation), then you really want the
>>>> outer monitor to silently drop events coming from /cgroup/a/b?
>>>
>>> That's a fair point, and we've discussed it as well. The issue is, as
>>> Alexei already pointed out, that we do not want to traverse the tree up
>>> to the root for nested cgroups due to the runtime costs in the
>>> networking fast-path. After all, we're running the bpf program for each
>>> packet in flight. Hence, we opted for the approach to only look at the
>>> leaf node for now, with the ability to open it up further in the future
>>> using flags during attach etc.
>>
>> Careful here! You don't look only at the leaf node for now. You do a
>> fancy traversal and choose the nearest node that has a hook set up.
>
> But we do the 'complex' operation at attach time or when a cgroup is
> created, both of which are slow-path operations. In the fast-path, we
> only look at the leaf, which may or may not have an effective program
> installed. And that's of course much cheaper then doing the traversing
> for each packet.

You would never traverse the full hierarchy for each packet. You'd
have a linked list of programs that are attached, kind of like how
there's an "effective" array right now. I sent out pseudocode earlier
in the thread.

>
>> mkdir /cgroup/foo
>> BPF_PROG_ATTACH(some program to foo)
>> mkdir /cgroup/foo/bar
>> chown -R some_user /cgroup/foo/bar
>>
>> If the kernel only looked at the leaf, then the program that did the
>> above would not expect that the program would constrain
>> /cgroup/foo/bar's activity. But, as it stands, the program *would*
>> expect /cgroup/foo/bar to be constrained, except that, whenever the
>> capable() check changes to ns_capable() (which will happen eventually
>> one way or another), then the bad guy can create /cgroup/foo/bar/baz,
>> install a new no-op hook there, and break the security assumption.
>>
>> IOW, I think that totally non-recursive hooks are okay from a security
>> perspective, albeit rather strange, but the current design is not okay
>> from a security perspective.
>
> We locked down the ability to override any of these programs with
> CAP_NET_ADMIN, which is also what it takes to flush iptables, right?
> What's the difference?

For iptables, it's ns_capable() now, and there have been a number of
holes in it. For cgroup, it's going to turn in to ns_capable() sooner
or later, and it would be nice to be ready for it.

--Andy

2016-12-21 04:01:57

by Alexei Starovoitov

[permalink] [raw]

Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Tue, Dec 20, 2016 at 10:49:25AM -0800, Andy Lutomirski wrote:
> >> FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too.
> >> It doesn't make a semantic difference, except that I dislike
> >> BPF_PROG_DETACH because that particular command isn't BPF-specific at
> >> all.
> >
> > Well, I think it is; it pops the bpf program from a target and drops the
> > reference on it. It's not much code, but it's certainly bpf-specific.
>
> I mean the interface isn't bpf-specific. If there was something that
> wasn't bpf attached to the target, you'd still want an API to detach
> it.

This discussion won't go anywhere while you keep thinking that this api
has to be generalized. As I explained several times earlier
BPF_CGROUP_INET_SOCK_CREATE hook is bpf specific. There is nothing
in the kernel that can take advantage of it today, so by definition
the hook is bpf specific. Period. Saying that something in the future
may come along that would want to use that is like saying I want
to design the generic steering wheel for any car that will ever need it.

Hence if you want to change 'target_fd' in BPF_PROG_ATTACH/DETACH cmds
from being fd of open("cgroupdir") to fd of open("cgroupdir/cgroup.bpf")
file inside it then I'm ok with that.
All other proposals with non-extensible ioctls() and crazy text based
per-hook permissions is nack.