MIME-Version: 1.0
In-Reply-To: <20161220000254.GA58895@ast-mbp.thefacebook.com>
References: <CALCETrV81oFwq2AgeRsN54HA1jR=b5cOZfAgve8H8zhx83DTyA@mail.gmail.com>
 <20161219205631.GA31242@ast-mbp.thefacebook.com> <CALCETrWr5XMkexdGp7HdkiLkQV=P9ycj+sNO7xWSRoCVxihVZA@mail.gmail.com>
 <20161220000254.GA58895@ast-mbp.thefacebook.com>
From: Andy Lutomirski <luto@amacapital.net>
Date: Mon, 19 Dec 2016 16:25:32 -0800
Message-ID: <CALCETrU1_bDVLfokQ7zasHVmeq7S-R+603GEw59V_wuj4eE1hw@mail.gmail.com>
Subject: Re: Potential issues (security and otherwise) with the current
 cgroup-bpf API
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>, Daniel Mack <daniel@zonque.org>,
        =?UTF-8?B?TWlja2HDq2wgU2FsYcO8bg==?= <mic@digikod.net>,
        Kees Cook <keescook@chromium.org>, Jann Horn <jann@thejh.net>,
        Tejun Heo <tj@kernel.org>, David Ahern <dsahern@gmail.com>,
        "David S. Miller" <davem@davemloft.net>, Thomas Graf <tgraf@suug.ch>,
        Michael Kerrisk <mtk.manpages@gmail.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Linux API <linux-api@vger.kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Network Development <netdev@vger.kernel.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 12217
Lines: 272

On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote:
>> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
>> >> Hi all-
>> >>
>> >> I apologize for being rather late with this.  I didn't realize that
>> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
>> >> it on the linux-api list, so I missed the discussion.
>> >>
>> >> I think that the inet ingress, egress etc filters are a neat feature,
>> >> but I think the API has some issues that will bite us down the road
>> >> if it becomes stable in its current form.
>> >>
>> >> Most of the problems I see are summarized in this transcript:
>> >>
>> >> # mkdir cg2
>> >> # mount -t cgroup2 none cg2
>> >> # mkdir cg2/nosockets
>> >> # strace cgrp_socket_rule cg2/nosockets/ 0
>> >> ...
>> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
>> >>
>> >> ^^^^ You can modify a cgroup after opening it O_RDONLY?
>> >>
>> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
>> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
>> >> log_buf=0x6020c0, kern_version=0}, 48) = 4
>> >>
>> >> ^^^^ This is fine.  The bpf() syscall manipulates bpf objects.
>> >>
>> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
>> >>
>> >> ^^^^ This is not so good:
>> >> ^^^^
>> >> ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects.  This
>> >> ^^^^    is manipulating a cgroup.  There's no reason that a socket creation
>> >> ^^^^    filter couldn't be written in a different language (new iptables
>> >> ^^^^    table?  Simple list of address families?), but if that happened,
>> >> ^^^^    then using bpf() to install it would be entirely nonsensical.
>> >
>> > I don't see why it's _modifing_ the cgroup. I'm looking at it as
>> > network stack is using cgroup as an application group that should
>> > invoke bpf program at the certain point in the stack.
>> > imo cgroup management is orthogonal.
>>
>> It is literally modifying the struct cgroup, and, as a practical
>> matter, it's causing membership in the cgroup to have a certain
>> effect.  But rather than pointless arguing, let me propose an
>> alternative API that I think solves most of the problems here.
>>
>> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely.
>> Instead, the cgroup gets three new control files:
>> "net.ingress_filter", "net.egress_filter", and
>> "net.socket_create_filter".  Initially, if you read these files, you
>> see "none\n".
>>
>> To attach a bpf filter, you open the file for write and do an ioctl on
>> it.  After doing the ioctl, if you read the file, you'll see
>> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for
>> the bpf program.
>>
>> To detach any type of filter, bpf or otherwise, you open the file for
>> write and write "none\n" (or just "none").
>>
>> If you write anything else to the file, you get -EINVAL.  But, if
>> someone writes a new type of filter (perhaps a simple list of address
>> families), maybe you can enable the filter by writing something
>> appropriate to the file.
>
> I see no difference in what you're proposing vs what is already implemented
> from feature set point of view, but the file approach is very ugly, since
> it's a mismatch to FD style access that bpf is using everywhere.
> In your proposal you'd also need to add bpf prefix everywhere.
> So the control file names should be bpf_inet_ingress, bpf_inet_egress
> and bpf_socket_create.

I think we're still talking past each other.  A big part of the point
of changing it is that none of this is specific to bpf.  You could (in
theory -- I'm not proposing implementing these until there's demand)
have:

net.socket_create_filter = "none": no filter
net.socket_create_filter = "bpf:baadf00d": bpf filter
net.socket_create_filter = "disallow": no sockets created period
net.socket_create_filter = "iptables:foobar": some iptables thingy
net.socket_create_filter = "nft:blahblahblah": some nft thingy
net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3

See?  This API is not bpf-specific.  It's an API for filtering.  The
fact that struct cgroup currently contains a member called "bpf" is
purely an artifact of the fact that it currently only supports bpf.
Someone will want to rename it to "filters" some day, and
BPF_PROG_DETACH makes no sense whatsoever as a generic API to detach a
filter.

> If you want to prepare such patch for them, I don't mind,
> but you cannot kill syscall command, since it's more flexible
> and your control-file approach _will_ be obsolete pretty quickly.

BPF_PROG_ATTACH and BPF_PROC_DETACH should be removed regardless IMO.
If you really really want a syscall, make it a new syscall.

>
>> Now the API matches the effect.  A cgroup with something other than
>> "none" in one of its net.*_filter files is a cgroup that filters
>> network activity.  And you get CRIU compatibility for free: CRIU can
>> read the control file and use whatever mechanism is uses for BPF in
>> general to restore the cgroup filter state.   As an added bonus, you
>> get ACLs for free and the ugly multiplexer goes away.
>
> extended bpf is not supported by criu. only classic, so having
> control_file-style attachment doesn't buy us anything.

CRIU will support it some day.  We might as well put fewer obstacles
in their way.

>
>> >> # mkdir cg2/nosockets/sockets
>> >> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1
>> >>
>> >> ^^^^ This succeeded, which means that, if this feature is enabled in 4.10,
>> >> ^^^^ then we're stuck with its semantics.  If it returned -EINVAL instead,
>> >> ^^^^ there would be a chance to refine it.
>> >
>> > i don't see the problem with this behavior. bpf and cgroup are indepedent.
>> > Imange that socket accounting program is attached to cg2/nosockets.
>> > The program is readonly and carry no security meaning.
>> > Why cgroup should prevent creation of cg2/nosockets/foo directory ?
>>
>> I think you're misunderstanding me.  What I'm saying is that, if you
>> allow a cgroup and one of its descendents to both enable the same type
>> of filter, you have just committed to some particular semantics for
>> what happens.  And I think that the current semantics are the *wrong*
>> semantics for default long-term use, so you should either fix the
>> semantics or disable the problematic case.
>
> Are you saying that use cases I provided are also 'wrong'?
> If you insist on that we won't be able to make any forward progress.
> The current semantics is fine for what it's designed for.

Can you explain your use case more clearly?

>
>> >> # echo $$ >cg2/nosockets/sockets/cgroup.procs
>> >> # ping 127.0.0.1
>> >> PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
>> >> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms
>> >
>> > hmm. in this case the admin created a subgroup with a group that allows
>> > ping, so? how's that a problem?
>>
>> I think you're forgetting about namespaces.  There are two different
>> admins here.  There's the admin who created the outer container and
>> said "no sockets here".  Then there's the admin inside the container
>> who said "muahaha, I can create a sub-container and allow sockets
>> *there*".
>
> container management frameworks should be doing sensible
> things. With any infra in the kernel there are many ways to
> create non-sensical setups. It's not a job of the kernel
> to interfere with user space.

What exactly isn't sensible about using cgroup_bpf for containers?

>
>> > In case of vrf I can imagine
>> > a set of application auto-binding to one vrf and smaller
>> > set of applications binding to a different vrf.
>> > Or broader set of applications monitoring all tcp traffic
>> > and subset of them monitoring udp instead.
>> > Those are valid use cases.
>>
>> > I guess you're advocating to run a link list of programs?
>> > That won't be useful for the above use cases, where there is no
>> > reason to run more than one program that control plane
>> > assigned to run for this cgroup.
>>
>> Yes there is, for both monitoring and policy.  If I want to monitor
>> all activity in a cgroup, I probably want to monitor descendents as
>> well.  If I want to restrict a cgroup, I want to restrict its
>> descendents.
>
> you're ignoring use cases I described earlier.
> In vrf case there is only one ifindex it needs to bind to.

I'm totally lost.  Can you explain what this has to do with the cgroup
hierarchy?

>> > At this stage we decided to allow only one program per cgroup per hook
>> > and later can extend it if necessary.
>>
>> No you can't.  Since you allow the problematic case and you gave it
>> the surprising "one program" semantics, you can't change it later.
>
> please show me why we cannot? As far as I can see nothing prevents
> that in the future. We can add any number of new fields to
> BPF_PROG_ATTACH command just like we did in the past with
> other commands, whereas control file interface is not extensible.

Because people are going to start using the old API, tools won't be
aware of the new semantics, you have no usable introspection
mechanism, and everyone is going to screw it up.  But it's even worse:
because global privilege is currently needed to set up these filters,
containers really can use it today, but once you switch to ns_capable,
then it suddenly becomes insecure.  And *that* is something that you
can't do.

>
>> > As you're pointing out, in case of security, we probably
>> > want to preserve original bpf program that should always be
>> > run first and only after it returned 'ok' (we'd need to define
>> > what 'ok' means in secruity context) run program attached to sub-hierarchy.
>>
>> It's already defined AFAICT.  1 means okay.  0 means not okay.
>
> sorry that doesn't make any sense. For seccomp we have a set of
> ranges that mean different things. Here you're proposing to
> hastily assign 1 and 0 ? How is that extensible?
> We need to carefully think through what should be the semantics
> of attaching multiple programs, consider performance implications,
> return codes and so on.

You already assigned it.  The return value of the bpf program, loaded
in Linus' tree today, tells the kernel whether to accept or reject.

>
>> > Another alternative is to disallow attaching programs in sub-hierarchy
>> > if parent has something already attached, but it's not useful
>> > for general case.
>> > All of these are possible future extensions.
>>
>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
>> have to do it now (or disable the feature for 4.10).  This is why I'm
>> bringing this whole thing up now.
>
> We don't have to touch user visible api here, so extensions are fine.

Huh?  My example in the original email attaches a program in a
sub-hierarchy.  Are you saying that 4.11 could make that example stop
working?

>
>> >> In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf
>> >> programs attached that can do things if various events occur. (Right
>> >> now, this means socket operations, but there are plans in the works
>> >> to do this for LSM hooks too.) These bpf programs can say yes or no,
>> >> but they can also read out various data (including socket payloads!)
>> >> and save them away where an attacker can find them. This sounds a
>> >> lot like seccomp with a narrower scope but a much stronger ability to
>> >> exfiltrate private information.
>> >
>> > that sounds like a future problem to solve when bpf+lsm+cgroup are
>> > used for security.
>>
>> [...]
>>
>> >
>> > I disagree with the urgency. I see nothing that needs immediate action.
>> > bpf+lsm+cgroup is not in the tree yet.
>> >
>>
>> I disagree very strongly here.  The API in 4.10 is IMO quite ugly, but
>> the result if bpf+lsm+cgroup works *differently* will be far, far
>> uglier.
>
> again we're talking about the future here and 'ugly' is the matter of taste.
> I hear all the time that people say that netlink api is ugly, so?

Yeah, it's ugly, but that ship already sailed.  This ship hasn't.