Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756585AbcLTEtX (ORCPT ); Mon, 19 Dec 2016 23:49:23 -0500 Received: from mail-pg0-f66.google.com ([74.125.83.66]:34785 "EHLO mail-pg0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751932AbcLTEtU (ORCPT ); Mon, 19 Dec 2016 23:49:20 -0500 Date: Mon, 19 Dec 2016 20:41:37 -0800 From: Alexei Starovoitov To: Andy Lutomirski Cc: Andy Lutomirski , Daniel Mack , =?iso-8859-1?Q?Micka=EBl_Sala=FCn?= , Kees Cook , Jann Horn , Tejun Heo , David Ahern , "David S. Miller" , Thomas Graf , Michael Kerrisk , Peter Zijlstra , Linux API , "linux-kernel@vger.kernel.org" , Network Development Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API Message-ID: <20161220044135.GA86803@ast-mbp.thefacebook.com> References: <20161219205631.GA31242@ast-mbp.thefacebook.com> <20161220000254.GA58895@ast-mbp.thefacebook.com> <20161220031802.GA77838@ast-mbp.thefacebook.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5843 Lines: 124 On Mon, Dec 19, 2016 at 07:50:01PM -0800, Andy Lutomirski wrote: > >> > >> net.socket_create_filter = "none": no filter > >> net.socket_create_filter = "bpf:baadf00d": bpf filter > > > > i'm assuming 'baadf00d' is bpf program fd expressed a text string? > > and kernel needs to parse above? will you allow capital and lower > > case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not? > > can program fd expressed as decimal or hex or both? > > how do you return the error? as a text string for user space > > to parse? > > No. The kernel does not parse it because you cannot write this to the > file. You set a bpf filter with ioctl and pass an fd. If you *read* > the file, you get the same bpf program hash that fdinfo on the bpf > object would show -- this is for debugging and (eventually) CRIU. my understanding that cgroup is based on kernfs and both don't support ioctl, so you'd need quite some hacking to introduce such concepts and buy-in from a bunch of people first. > >> net.socket_create_filter = "iptables:foobar": some iptables thingy > >> net.socket_create_filter = "nft:blahblahblah": some nft thingy > > > > iptables/nft are not applicable to 'bpf_socket_create' which > > looks into: > > struct bpf_sock { > > __u32 bound_dev_if; > > __u32 family; > > __u32 type; > > __u32 protocol; > > }; > > so don't fit as an example. > > The code that takes a 'struct sock' and sets up bpf_sock is > bpf-specific and would obviously not be used for non-bpf filter. But > if you had a filter that was just a like of address families, that > filter would look at struct sock and do its own thing. iptables > wouldn't make sense for a socket creation filter, but it would make > perfect sense for an ingress filter. Obviously the bpf-specific state > object wouldn't be used, but it would still be a hook, invoked from > the same network code, guarded by the same static key, looking at the > same skb. I strongly suggest to go back and read my first reply where I think I explained well enough that something like iptables will not able to reuse the ioctl mechanism you're proposing here, hook ids will be different, attachment mechanism will be different too. So your proposed cgroup ioctl is already dead as a reusable interface. > >> net.socket_create_filter = "disallow": no sockets created period > >> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 > > > > so you're proposing to add a bunch of hard coded logic to the kernel. > > First to parse such text into some sort of syntax tree or list/set > > and then have hard coded logic specifically for these two use cases? > > While above two can be implemented as trivial bpf programs already?! > > That goes 180% degree vs bpf philosophy. bpf is about moving > > the specific code out of the kernel and keeping kernel generic that > > it can solve as many use cases as possible by being programmable. > > I'm not seriously proposing implementing these. My point is that > *bpf*, while wonderful, is not the be-all-and-end-all of kernel > configurability, and other types of hooks might want to be hooked in > here. Then please let's talk about real use cases. This daydreaming of some future llvm in the kernel that you were bringing up during LPC doesn't help the discussion. Just like these artificial examples. > > ... > >> What exactly isn't sensible about using cgroup_bpf for containers? > > > > my use case is close to zero overhead application network monitoring. > > So if I set up a cgroup that's monitored and call it /cgroup/a and > enable delegation and if the program running there wants to do its own > monitoring in /cgroup/a/b (via delegation), then you really want the > outer monitor to silently drop events coming from /cgroup/a/b? yes. both are root and must talk to each other if they want to co-exist. When root process is asking kernel to do X, this X has to happen. > Then disallow nesting. You're welcome to not rush the decision, but > that's not what you've done. If 4.10 is released as is, you've made > the decision and you're going to have a hard time changing it. Nothing needs to be changed. > > No. As was pointed out before only root can load BPF_PROG_TYPE_CGROUP_[SKB|SOCK] > > type programs and only root can attach them. > > Why? It really seems to me that you expect that future namespaceable > bpf hooks will use a totally different API. At KS, I sat in a room > full of people arguing about cgroup v2 and a lot of them pointed out > that there are valid, paying use cases that want to stick cgroup v1 in > a container, because the code that uses cgroup v1 in the container is > called systemd and the contained OS is called RHEL (or SuSE or Ubuntu > or Gentoo or whatever), and that code is *already written* and needs > to be contained. bpf in general is not namespace aware. It's global and this cgroup scoping of bpf programs is the first of this kind. Namespacing of bpf is completely different topic. > The current approach to bpf hooks will bite you down the road. David > Ahern is already proposing using it for something that is not tracing > at all, and someone will want that in a container, and there will be a > problem. vrf use case already supported by existing code. > How about slowing down a wee bit and trying to come up with cgroup > hook semantics that work for all of these use cases? I think my > proposal is quite close to workable. you've started the topic by claiming that things are broken and non-extensible. At the course of this thread it was explained that the interface is extensible and not broken for the use case it was designed for. The 'security' use case like lsm+bpf+cgroup is not supported by the current model yet and that's what we need to discuss in the future. So, yes, please slow down.