Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756831AbcLTCwZ (ORCPT ); Mon, 19 Dec 2016 21:52:25 -0500 Received: from mail-pf0-f194.google.com ([209.85.192.194]:33110 "EHLO mail-pf0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754097AbcLTCwW (ORCPT ); Mon, 19 Dec 2016 21:52:22 -0500 Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API To: Andy Lutomirski References: <20161219205631.GA31242@ast-mbp.thefacebook.com> <20161220000254.GA58895@ast-mbp.thefacebook.com> <2dbec775-6304-e44c-19c5-fbf07877e7b1@gmail.com> Cc: Alexei Starovoitov , Andy Lutomirski , Daniel Mack , =?UTF-8?Q?Micka=c3=abl_Sala=c3=bcn?= , Kees Cook , Jann Horn , Tejun Heo , "David S. Miller" , Thomas Graf , Michael Kerrisk , Peter Zijlstra , Linux API , "linux-kernel@vger.kernel.org" , Network Development From: David Ahern Message-ID: <80574175-3692-0278-a74e-23b752d44f73@gmail.com> Date: Mon, 19 Dec 2016 19:52:15 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4041 Lines: 74 On 12/19/16 6:56 PM, Andy Lutomirski wrote: > On Mon, Dec 19, 2016 at 5:44 PM, David Ahern wrote: >> On 12/19/16 5:25 PM, Andy Lutomirski wrote: >>> net.socket_create_filter = "none": no filter >>> net.socket_create_filter = "bpf:baadf00d": bpf filter >>> net.socket_create_filter = "disallow": no sockets created period >>> net.socket_create_filter = "iptables:foobar": some iptables thingy >>> net.socket_create_filter = "nft:blahblahblah": some nft thingy >>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 >> >> Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters. > > Can you elaborate on what goes wrong? (Obviously the > "address_family_list" example makes no sense in that context.) Being able to dump a filter or see that one exists would be a great add-on, but I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable API for loading generic BPF filters. Simple cases like "disallow" are easy -- just return 0 in the filter, no complicated BPF code needed. The rest are specific cases of the moment which goes against the intent of ebpf and generic programmability. >> >> ... >> >>>> you're ignoring use cases I described earlier. >>>> In vrf case there is only one ifindex it needs to bind to. >>> >>> I'm totally lost. Can you explain what this has to do with the cgroup >>> hierarchy? >> >> I think the point is that a group hierarchy makes no sense for the VRF use case. What I put into iproute2 is >> >> cgrp2/vrf/NAME >> >> where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 sockets to a specific device index. cgrp2/vrf is the "default" vrf and does not have a filter. A user can certainly add another layer cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not make sense. > > I tend to agree. I still think that the mechanism as it stands is > broken in other respects and should be fixed before it goes live. I > have no desire to cause problems for the vrf use case. > > But keep in mind that the vrf use case is, in Linus' tree, a bit > broken right now in its interactions with other users of the same > mechanism. Suppose I create a container and want to trace all of its > created sockets. I'll set up cgrp2/container and load my tracer as a > socket creation hook. Then a container sets up > cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding > filter. Now the tracing stops working -- oops. There are other ways to achieve socket tracing, but I get your point -- nested cases do not work as users may want. >>>>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You >>>>> have to do it now (or disable the feature for 4.10). This is why I'm >>>>> bringing this whole thing up now. >>>> >>>> We don't have to touch user visible api here, so extensions are fine. >>> >>> Huh? My example in the original email attaches a program in a >>> sub-hierarchy. Are you saying that 4.11 could make that example stop >>> working? >> >> Are you suggesting sub-cgroups should not be allowed to override the filter of a parent cgroup? > > Yes, exactly. I think there are two sensible behaviors: > > a) sub-cgroups cannot have a filter at all of the parent has a filter. > (This is the "punt" approach -- it lets different semantics be > assigned later without breaking userspace.) > > b) sub-cgroups can have a filter if a parent does, too. The semantics > are that the sub-cgroup filter runs first and all side-effects occur. > If that filter says "reject" then ancestor filters are skipped. If > that filter says "accept", then the ancestor filter is run and its > side-effects happen as well. (And so on, all the way up to the root.) That comes with a big performance hit for skb / data path cases. I'm riding my use case on Daniel's work, and as I understand it the nesting case has been discussed. I'll defer to Daniel and Alexei on this part.