Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp4320953yba; Tue, 9 Apr 2019 16:25:05 -0700 (PDT) X-Google-Smtp-Source: APXvYqxWbkhux1Q5wSH3UzvdPL06dByQPQC7jO6Vg0TveUyZrI7r4bPDdMpAnlVzcyozIC7qc+qh X-Received: by 2002:a17:902:7892:: with SMTP id q18mr39631828pll.163.1554852305357; Tue, 09 Apr 2019 16:25:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1554852305; cv=none; d=google.com; s=arc-20160816; b=oeYS9APOJzBlGlaZHGQsM88libWYwu5opKcKGkOHas+8PtY9+WMOZN/XZmASAR4JdZ qH29n1JxsFwxKp4ZB+y6+3oZ1fH1CFmvQrb8tSwaWsRFdW/MUEMWOl9hV7Ftaa/ALo+W BZMTGUzZ6p4vqb4wXZKCHBQuq7oW8cqAPtXotLfFT9HDQdnVX5XJkl8kVGuEtn5yd0L8 VPK2eGY0Pr5icmzvSl9howOBoZw3THOfF2DHGSqWY+s4zHIUweJ4AP7BmB5pr3H1WvwI j8NBLDSkHlWwaaagictU3rF7VtHPmQiFqAyhl6EvqG9nmYopaxnadS8N/LS/aEKAW2IW tWTg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=HSlPGOq/zauzPo8ZTMW6WYMjsDECkRil2hWA3pH0xf8=; b=QkcjxFT8Dzsj/e66rxLA0Zo+jOiWy/v84ZQw98D/6yPdzH2Zaa5o+eaKUd5jbjpC3L GxFRAkmnqBQZRWf/NE8v6nNdFjyhUOnHh9CaD71coA56B+7/oELQPvC7q2NBqrNOozqP 5mnsK/ScHlCCq/Be1btDGbgb6Ut+jU2b1INtcIpN/ePfcPQWU+LAuJCBZk80TXsXFXup uF2tVZDfGUYiUiqk+RNxz2KSGVDfBcSkkaEpPsLGzhifhNlm2Q+VHcgqosqyVOIx8oNE e7XW6ptgPYP3omIjoGY3Mh5BWjA7iQZZUgA7+nrVgUVp6PuqVhWzm0zgwOphZu5O7aIu jG2g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=X7kZsVyk; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f131si31819115pfc.92.2019.04.09.16.24.47; Tue, 09 Apr 2019 16:25:05 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=X7kZsVyk; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726936AbfDIXW2 (ORCPT + 99 others); Tue, 9 Apr 2019 19:22:28 -0400 Received: from mail-ot1-f65.google.com ([209.85.210.65]:45887 "EHLO mail-ot1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726766AbfDIXW2 (ORCPT ); Tue, 9 Apr 2019 19:22:28 -0400 Received: by mail-ot1-f65.google.com with SMTP id e5so198935otk.12 for ; Tue, 09 Apr 2019 16:22:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=HSlPGOq/zauzPo8ZTMW6WYMjsDECkRil2hWA3pH0xf8=; b=X7kZsVykxx6f+fxzE7DoZ43NbBlos7YO/7prMVYaq0YAUqnLgemyQpBvv4n+SmDDCv eafUNTd4icKrYBK4bRXjb+YfbpQl6MKgHOg3733EafDwqyo6Pesse2fQEgqjuhm0TNqW jZgnVk+Sv4Fzf1HEBs1MAA9OwjnRjOAvmtPZeH90qsL/t+whP2wcE/FvenRU0988Nk/k 9zVoGj5Xowrix2wQ00Ollo6R2CPE8LhhxSwTVrk9DeSbHRLAH3nz5R/ePakLbKsMqYBU yBLZcENcrBKkirVRRm4g/wYIdFTNgfXB1odVrC2RMKi7YQy2idSXqRUW5X5xqg7ynl4y MNKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=HSlPGOq/zauzPo8ZTMW6WYMjsDECkRil2hWA3pH0xf8=; b=MK7Ur4jFu9fxAQJwHGvw3XlY+S+rJlt6BEaIs7CCJhwWzsv+G7aufEank0pvaU4d0q 3jFedSxtawDLHxr/aCGU0SKhDj4Ta0oIb6hQKxsPCs4ZNG3ntOugHcQQZTdruXk0w1/u 95m0BOhb90wcsb/2Uv5cCNvk/QPhT7W9ooSLxDsKiOMJZW4z1oL1zXjZn6NASW3FuIue urq4arVB3Rx0YBAjYMTZ1YRgSNcEFMhP8kx/wLiN7B+3DXcQ5dM0PsJ0LaoaqTl9w6eS MK12u5Tb9+7Zevyj75m3LfSUS62zeDaz4FjPeMMVcSertksUajuemxDMyvH0vC9M17e6 h01Q== X-Gm-Message-State: APjAAAVEtO7gXPtErKshsO4DQ/WkieTz5IYJIbRsl4oLnaMXZ43znLbR nxgiJGRjr3geqUxZP5FoOvyca7H1k+XzO4K20FeLFA== X-Received: by 2002:a9d:5e82:: with SMTP id f2mr26200787otl.217.1554852147059; Tue, 09 Apr 2019 16:22:27 -0700 (PDT) MIME-Version: 1.0 References: <20190409230432.GA59615@rdna-mbp> In-Reply-To: <20190409230432.GA59615@rdna-mbp> From: Jann Horn Date: Wed, 10 Apr 2019 01:22:00 +0200 Message-ID: Subject: Re: [PATCH v3 bpf-next 00/21] bpf: Sysctl hook To: Andrey Ignatov Cc: Network Development , Alexei Starovoitov , Daniel Borkmann , Roman Gushchin , Kernel Team , Luis Chamberlain , Kees Cook , Alexey Dobriyan , kernel list , linux-fsdevel , linux-security-module Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 10, 2019 at 1:04 AM Andrey Ignatov wrote: > Jann Horn [Tue, 2019-04-09 13:42 -0700]: > > On Tue, Apr 9, 2019 at 10:26 PM Andrey Ignatov wrote: > > > The patch set introduces new BPF hook for sysctl. > > > > > > It adds new program type BPF_PROG_TYPE_CGROUP_SYSCTL and attach type > > > BPF_CGROUP_SYSCTL. > > > > > > BPF_CGROUP_SYSCTL hook is placed before calling to sysctl's proc_handler so > > > that accesses (read/write) to sysctl can be controlled for specific cgroup > > > and either allowed or denied, or traced. > > > > Don't look at the credentials of "current" in a read or write handler. > > Consider what happens if, for example, someone inside a cgroup opens a > > sysctl file and passes the file descriptor to another process outside > > the cgroup over a unix domain socket, and that other process then > > writes to it. Either do your access check on open, or use the > > credentials that were saved during open() in the read/write handler. > > This way this someone inside cgroup should already have control over > something running as root [1] outside of this cgroup, i.e. the game is > already lost, even without this hook. > > [1] Since proc_sys_read() / proc_sys_write() check sysctl_perm() before > execution reaches the hook. You don't need to have _control_ over something running as root. You only need to be able to communicate with something that expects to be passed in file descriptors for some purpose. > This patch set doesn't look at credentials at all and relies on what > checks were already done at sys_open time or in proc_sys_call_handler() > before execution reaches the hook. You're looking at the cgroup though. > > > The hook has access to sysctl name, current sysctl value and (on write > > > only) to new sysctl value via corresponding helpers. New sysctl value can > > > be overridden by program. Both name and values (current/new) are > > > represented as strings same way they're visible in /proc/sys/. It is up to > > > program to parse these strings. > > > > But even if a filter is installed that prevents all access to a > > sysctl, you can still read it by installing your own filter that, when > > a read is attempted the next time, dumps the value into a map or > > something like that, right? > > No. This can be controlled by cgroup hierarchy and appropriate attach > flags, same way as with any other cgroup-bpf hook. > > E.g. imagine there is a cgroup hierarchy: > root/slice/container/ > > and container application runs in root/slice/container/ in a cgroup > namespace (CLONE_NEWCGROUP) that makes visible only "container/" part of > the hierarchy, i.e. from inside container application can't even see > "root/slice/". > > Administrator can then attach sysctl hook to "root/slice/" with attach > flag NONE (bpf_attr.attach_flags = 0) what means nobody down the > hierarchy can override the program attached by administrator. Ah, okay. > > > To help with parsing the most common kind of sysctl value, vector of > > > integers, two new helpers are provided: bpf_strtol and bpf_strtoul with > > > semantic similar to user space strtol(3) and strtoul(3). > > > > > > The hook also provides bpf_sysctl context with two fields: > > > * @write indicates whether sysctl is being read (= 0) or written (= 1); > > > * @file_pos is sysctl file position to read from or write to, can be > > > overridden. > > > > > > The hook allows to make better isolation for containerized applications > > > that are run as root so that one container can't change a sysctl and affect > > > all other containers on a host, make changes to allowed sysctl in a safer > > > way and simplify sysctl tracing for cgroups. > > > > Why can't you use a user namespace and isolate things properly that > > way? That would be much cleaner, wouldn't it? > > I'm not sure I understand how user namespace helps here. From my > understanding it can only completely deny access to sysctl and can't do > fine-grained control for specific sysctl knobs. It also can't make > allow/deny decision based on sysctl value being written. > > Basically user namespace is all or nothing. This sysctl hook provides a > way to implement fine-grained access control for sysctl knobs based on > sysctl name or value being written or whatever else policy administrator > can come up with. But there's a reason why user namespaces are all-or-nothing on these things. If the kernel does not explicitly make a sysctl available to a container, the sysctl has global effects, and therefore probably shouldn't be exposed to anything other than someone with administrative privileges across the whole system. If the kernel does make it available to a container, the sysctl's effects are limited to the container (or otherwise it's a kernel bug). Can you give examples of sysctls that you want to permit using from containers, that wouldn't be accessible in a user namespace?