Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757383AbcLTDNM (ORCPT ); Mon, 19 Dec 2016 22:13:12 -0500 Received: from mail-ua0-f171.google.com ([209.85.217.171]:33317 "EHLO mail-ua0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755128AbcLTDNK (ORCPT ); Mon, 19 Dec 2016 22:13:10 -0500 MIME-Version: 1.0 In-Reply-To: <80574175-3692-0278-a74e-23b752d44f73@gmail.com> References: <20161219205631.GA31242@ast-mbp.thefacebook.com> <20161220000254.GA58895@ast-mbp.thefacebook.com> <2dbec775-6304-e44c-19c5-fbf07877e7b1@gmail.com> <80574175-3692-0278-a74e-23b752d44f73@gmail.com> From: Andy Lutomirski Date: Mon, 19 Dec 2016 19:12:48 -0800 Message-ID: Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API To: David Ahern Cc: Alexei Starovoitov , Andy Lutomirski , Daniel Mack , =?UTF-8?B?TWlja2HDq2wgU2FsYcO8bg==?= , Kees Cook , Jann Horn , Tejun Heo , "David S. Miller" , Thomas Graf , Michael Kerrisk , Peter Zijlstra , Linux API , "linux-kernel@vger.kernel.org" , Network Development Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by mail.home.local id uBK3DGHW013886 Content-Length: 4963 Lines: 107 On Mon, Dec 19, 2016 at 6:52 PM, David Ahern wrote: > On 12/19/16 6:56 PM, Andy Lutomirski wrote: >> On Mon, Dec 19, 2016 at 5:44 PM, David Ahern wrote: >>> On 12/19/16 5:25 PM, Andy Lutomirski wrote: >>>> net.socket_create_filter = "none": no filter >>>> net.socket_create_filter = "bpf:baadf00d": bpf filter >>>> net.socket_create_filter = "disallow": no sockets created period >>>> net.socket_create_filter = "iptables:foobar": some iptables thingy >>>> net.socket_create_filter = "nft:blahblahblah": some nft thingy >>>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 >>> >>> Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters. >> >> Can you elaborate on what goes wrong? (Obviously the >> "address_family_list" example makes no sense in that context.) > > Being able to dump a filter or see that one exists would be a great add-on, but I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable API for loading generic BPF filters. Simple cases like "disallow" are easy -- just return 0 in the filter, no complicated BPF code needed. The rest are specific cases of the moment which goes against the intent of ebpf and generic programmability. Oh -- I'm not proposing that at all. Let me clarify. For the bpf case, if you *read* the file, you'd see "bpf:baadf00d". But writing "bpf:baadf00d" is nonsense and would give you -EINVAL. Instead you install a bpf filter by opening the file for write (O_RDWR or O_WRONLY) and doing ioctl(cgroup_socket_create_file_fd, CGROUP_IOCTL_SET_BPF_FILTER, bpf_fd); It's kind of like BPF_PROG_ATTACH except that it respects filesystem permissions, it isn't in the bpf() multiplexer, the filter being set is implied (by the fd in use), and everything is nicely seccompable. To remove the filter, you write "none" or "none\n" to the file. As a future extension, if someone wanted more than one filter to be able to coexist in the cgroup socket_create_filter slot, you could plausibly do 'echo disallow >>net.socket_create_filter' or use a new ioctl CGROUP_IOCTL_APPEND_BPF_FILTER, and then you'd read the file and see more than one line. But this would be a future extension and may never be needed. >> >> a) sub-cgroups cannot have a filter at all of the parent has a filter. >> (This is the "punt" approach -- it lets different semantics be >> assigned later without breaking userspace.) >> >> b) sub-cgroups can have a filter if a parent does, too. The semantics >> are that the sub-cgroup filter runs first and all side-effects occur. >> If that filter says "reject" then ancestor filters are skipped. If >> that filter says "accept", then the ancestor filter is run and its >> side-effects happen as well. (And so on, all the way up to the root.) > > That comes with a big performance hit for skb / data path cases. > > I'm riding my use case on Daniel's work, and as I understand it the nesting case has been discussed. I'll defer to Daniel and Alexei on this part. I'm not sure I buy the performance hit. If you do it naively, then performance will indeed suck. But there's already a bunch of code in there to pre-populate a filter list for faster use. Currently, we have: struct cgroup_bpf { /* * Store two sets of bpf_prog pointers, one for programs that are * pinned directly to this cgroup, and one for those that are effective * when this cgroup is accessed. */ struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE]; struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE]; }; in struct cgroup, there's a 'struct cgroup_bpf bpf;'. This would change to something like: struct cgroup_filter_slot { struct bpf_prog *effective; struct cgroup_filter_slot *next; struct bpf_prog *local; } local is NULL unless *this* cgroup has a filter. effective points to the bpf_prog that's active in this cgroup or the nearest ancestor that has a filter. next is NULL if there are no filters higher in the chain or points to the next slot that has a filter. struct cgroup has: struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE]; To evaluate it, you do: struct cgroup_filter_slot *slot = &cgroup->slot[the index]; if (!slot->effective) return; do { evaluate(slot->effective); slot = slot->next; } while (unlikely(slot)); The old code was one branch per evaluation. The new code is n+1 branches where n is the number of filters, so it's one extra branch in the worst case. But the extra branch is cache-hot (the variable is right next to slot->effective, which is needed regardless) and highly predictable (in the case where nesting isn't used, the branch is not taken, and it's a backward branch which most CPUs will predict as not taken). I expect that, on x86, this adds at most a cycle or two and quite possibly adds no measurable overhead at all.