Received: by 10.192.165.148 with SMTP id m20csp4095929imm; Mon, 23 Apr 2018 19:09:42 -0700 (PDT) X-Google-Smtp-Source: AB8JxZq09rLe85RbHhCKydaycjifw01+4jiFT5GZlOToVR9kReytw8djHPCFW9DsUm72hjZOzaw0 X-Received: by 10.99.161.25 with SMTP id b25mr62299pgf.3.1524535782033; Mon, 23 Apr 2018 19:09:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524535781; cv=none; d=google.com; s=arc-20160816; b=Y4Snktus38Z1sCXh32V5PSMbG4TBBg8tC44YIvMKqCHoZXDnhMXtGsACL4kGIQJCs8 YIHxl5RClTcqh/GHFoE8YHdhztqDaKEsQlj+viUSf1qIUHRDs/E2VkuzP6ANOpxhOBSD XI8I0vDQ4S3b49xIHv9Q4hBUoJ6X+LA7DxBRMGI4lz3Mg7G0mMiaKZiMXFObcY+s+Vvx CkOmyZ7y2jCi5SdOMS+1z16VzD7Hng2JbZf1ZhHjSjawdOxU+89KjBh2XV0973SO+Npp LVTCBv0ZLZuTZcZGShfC4SVvf1djKe0JdmnjbBvMxBtmbSDy29xbflibV0HlsfodzccY Sxig== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=zPq9MU1HQYOLGrmmaw3nwgTF0k5tYew7+qZMTWIeZYo=; b=Ega85E10A5296BKg5GyoIh5XOnJ6l4u00n7j+sP0UECLA5iU2eUd8LlEIDRnYyA7Xd r1Q0D1mSQx67+xyGN3jyB77SfFg4O1qFXa1e8eiTn23WNnwzO71MqkQ9o/joWYTtuzgY U1LTh2UMYuPe/MVMqcX3dpxe9r942pqy9mZKv100Yg9ipsiknZQP/5o8GzyhTd4IkTKZ FzIj5ikJ96D9YP91J/LZmPiBkSdV2nc+q9+Ev3JiT5yim2meQx5QrlmOD7yT3qFuHiTe PaHw4u2nJNK97aDy/qFkUJGMrh+KBoKNA2L1DxP6Lhral0whC6bsE4/FQndmFLmeL7ke 5t+Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 38-v6si13225250pln.390.2018.04.23.19.09.27; Mon, 23 Apr 2018 19:09:41 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932756AbeDXCIR (ORCPT + 99 others); Mon, 23 Apr 2018 22:08:17 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:37830 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S932670AbeDXCIP (ORCPT ); Mon, 23 Apr 2018 22:08:15 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 8D19CFB650; Tue, 24 Apr 2018 02:08:14 +0000 (UTC) Received: from madcap2.tricolour.ca (ovpn-112-12.rdu2.redhat.com [10.10.112.12]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 968511208F70; Tue, 24 Apr 2018 02:08:03 +0000 (UTC) Date: Mon, 23 Apr 2018 22:02:00 -0400 From: Richard Guy Briggs To: Paul Moore Cc: cgroups@vger.kernel.org, containers@lists.linux-foundation.org, linux-api@vger.kernel.org, Linux-Audit Mailing List , linux-fsdevel@vger.kernel.org, LKML , netdev@vger.kernel.org, ebiederm@xmission.com, luto@kernel.org, jlayton@redhat.com, carlos@redhat.com, dhowells@redhat.com, viro@zeniv.linux.org.uk, simo@redhat.com, Eric Paris , serge@hallyn.com Subject: Re: [RFC PATCH ghak32 V2 01/13] audit: add container id Message-ID: <20180424020200.imonhbkwtb73luxl@madcap2.tricolour.ca> References: <20180421143443.faaput5g2rn6ul7p@madcap2.tricolour.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20171027 X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.1]); Tue, 24 Apr 2018 02:08:14 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.1]); Tue, 24 Apr 2018 02:08:14 +0000 (UTC) for IP:'10.11.54.3' DOMAIN:'int-mx03.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'rgb@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2018-04-23 19:15, Paul Moore wrote: > On Sat, Apr 21, 2018 at 10:34 AM, Richard Guy Briggs wrote: > > On 2018-04-18 19:47, Paul Moore wrote: > >> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs wrote: > >> > Implement the proc fs write to set the audit container ID of a process, > >> > emitting an AUDIT_CONTAINER record to document the event. > >> > > >> > This is a write from the container orchestrator task to a proc entry of > >> > the form /proc/PID/containerid where PID is the process ID of the newly > >> > created task that is to become the first task in a container, or an > >> > additional task added to a container. > >> > > >> > The write expects up to a u64 value (unset: 18446744073709551615). > >> > > >> > This will produce a record such as this: > >> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0 > >> > > >> > The "op" field indicates an initial set. The "pid" to "ses" fields are > >> > the orchestrator while the "opid" field is the object's PID, the process > >> > being "contained". Old and new container ID values are given in the > >> > "contid" fields, while res indicates its success. > >> > > >> > It is not permitted to self-set, unset or re-set the container ID. A > >> > child inherits its parent's container ID, but then can be set only once > >> > after. > >> > > >> > See: https://github.com/linux-audit/audit-kernel/issues/32 > >> > > >> > Signed-off-by: Richard Guy Briggs > >> > --- > >> > fs/proc/base.c | 37 ++++++++++++++++++++ > >> > include/linux/audit.h | 16 +++++++++ > >> > include/linux/init_task.h | 4 ++- > >> > include/linux/sched.h | 1 + > >> > include/uapi/linux/audit.h | 2 ++ > >> > kernel/auditsc.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++ > >> > 6 files changed, 143 insertions(+), 1 deletion(-) > > ... > > >> > diff --git a/include/linux/sched.h b/include/linux/sched.h > >> > index d258826..1b82191 100644 > >> > --- a/include/linux/sched.h > >> > +++ b/include/linux/sched.h > >> > @@ -796,6 +796,7 @@ struct task_struct { > >> > #ifdef CONFIG_AUDITSYSCALL > >> > kuid_t loginuid; > >> > unsigned int sessionid; > >> > + u64 containerid; > >> > >> This one line addition to the task_struct scares me the most of > >> anything in this patchset. Why? It's a field named "containerid" in > >> a perhaps one of the most widely used core kernel structures; the > >> possibilities for abuse are endless, and it's foolish to think we > >> would ever be able to adequately police this. > > > > Fair enough. > > > >> Unfortunately, we can't add the field to audit_context as things > >> currently stand because we don't always allocate an audit_context, > >> it's dependent on the system's configuration, and we need to track the > >> audit container ID for a given process, regardless of the audit > >> configuration. Pretty much the same reason why loginuid and sessionid > >> are located directly in task_struct now. As I stressed during the > >> design phase, I really want to keep this as an *audit* container ID > >> and not a general purpose kernel wide container ID. If the kernel > >> ever grows a general purpose container ID token, I'll be the first in > >> line to convert the audit code, but I don't want audit to be that > >> general purpose mechanism ... audit is hated enough as-is ;) > > > > When would we need an audit container ID when audit is not enabled > > enough to have an audit_context? > > I'm thinking of the audit_alloc() case where audit_filter_task() > returns AUDIT_DISABLED. Ok, so a task could be marked as filtered but its children would still be auditable and inheriting its parent containerid (as well at its loginuid and sessionid)... > I believe this is the same reason why loginuid and sessionid live > directly in the task_struct and not in the audit_context; they need to > persist for the lifetime of the task. Yes, probably. > > If it is only used for audit, and audit is the only consumer, and audit > > can only use it when it is enabled, then we can just return success to > > any write to the proc filehandle, or not even present it. Nothing will > > be able to know that value wasn't used. > > > > When are loginuid and sessionid used now when audit is not enabled (or > > should I say, explicitly disabled)? > > See above. I think that should answer these questions. Ok. > >> > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h > >> > index 4e61a9e..921a71f 100644 > >> > --- a/include/uapi/linux/audit.h > >> > +++ b/include/uapi/linux/audit.h > >> > @@ -71,6 +71,7 @@ > >> > #define AUDIT_TTY_SET 1017 /* Set TTY auditing status */ > >> > #define AUDIT_SET_FEATURE 1018 /* Turn an audit feature on or off */ > >> > #define AUDIT_GET_FEATURE 1019 /* Get which features are enabled */ > >> > +#define AUDIT_CONTAINER 1020 /* Define the container id and information */ > >> > > >> > #define AUDIT_FIRST_USER_MSG 1100 /* Userspace messages mostly uninteresting to kernel */ > >> > #define AUDIT_USER_AVC 1107 /* We filter this differently */ > >> > @@ -465,6 +466,7 @@ struct audit_tty_status { > >> > }; > >> > > >> > #define AUDIT_UID_UNSET (unsigned int)-1 > >> > +#define AUDIT_CID_UNSET ((u64)-1) > >> > >> I think we need to decide if we want to distinguish between the "host" > >> (e.g. init ns) and "unset". Looking at this patch (I've only quickly > >> skimmed the others so far) it would appear that you don't think we > >> need to worry about this distinction; that's fine, but let's make it > >> explicit with a comment in the code that AUDIT_CID_UNSET means "unset" > >> as well as "host". > > > > I don't see any reason to distinguish between "host" and "unset". Since > > a container doesn't have a concrete definition based in namespaces, the > > initial namespace set is meaningless here. > > Okay, that sounds reasonable. > > > Is there value in having a container orchestrator process have a > > reserved container ID that has a policy distinct from any other > > container? > > I'm open to arguments for this idea, but I don't see a point to it right now. > > > If so, then I could see the value in making the distinction. > > For example, I've heard of interest in systemd acting as a container > > orchestrator, so if it took on that role as PID 1, then every process in > > the system would inherit that ID and none would be unset. > > > > I can't picture how having seperate "host" and "unset" values helps us. > > I don't have a strong feeling either way, I just wanted to ask the question. > > >> > /* audit_rule_data supports filter rules with both integer and string > >> > * fields. It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and > >> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > >> > index 4e0a4ac..29c8482 100644 > >> > --- a/kernel/auditsc.c > >> > +++ b/kernel/auditsc.c > >> > @@ -2073,6 +2073,90 @@ int audit_set_loginuid(kuid_t loginuid) > >> > return rc; > >> > } > >> > > >> > +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid) > >> > +{ > >> > + struct task_struct *parent; > >> > + u64 pcontainerid, ccontainerid; > >> > + > >> > + /* Don't allow to set our own containerid */ > >> > + if (current == task) > >> > + return -EPERM; > >> > >> Why not? Is there some obvious security concern that I missing? > > > > We then lose the distinction in the AUDIT_CONTAINER record between the > > initiating PID and the target PID. This was outlined in the proposal. > > I just went back and reread the v3 proposal and I still don't see a > good explanation of this. Why is this bad? What's the security > concern? I don't remember, specifically. Maybe this has been addressed by the check for children/threads or identical parent container ID. So, I'm reluctantly willing to remove that check for now. > > Having said that, I'm still not sure we have protected sufficiently from > > a child turning around and setting it's parent's as yet unset or > > inherited audit container ID. > > Yes, I believe we only want to let a task set the audit container for > it's children (or itself/threads if we decide to allow that, see > above). There *has* to be a function to check to see if a task if a > child of a given task ... right? ... although this is likely to be a > pointer traversal and locking nightmare ... hmmm. Isn't that just (struct task_struct)parent == (struct task_struct)child->parent (or ->real_parent)? And now that I say that, it is covered by the following patch's child check, so as long as we keep that, we should be fine. > >> I ask because I suppose it might be possible for some container > >> runtime to do a fork, setup some of the environment and them exec the > >> container (before you answer the obvious "namespaces!" please remember > >> we're not trying to define containers). > > > > I don't think namespaces have any bearing on this concern since none are > > required. > > > >> > + /* Don't allow the containerid to be unset */ > >> > + if (!cid_valid(containerid)) > >> > + return -EINVAL; > >> > + /* if we don't have caps, reject */ > >> > + if (!capable(CAP_AUDIT_CONTROL)) > >> > + return -EPERM; > >> > + /* if containerid is unset, allow */ > >> > + if (!audit_containerid_set(task)) > >> > + return 0; > >> > + /* it is already set, and not inherited from the parent, reject */ > >> > + ccontainerid = audit_get_containerid(task); > >> > + rcu_read_lock(); > >> > + parent = rcu_dereference(task->real_parent); > >> > + rcu_read_unlock(); > >> > + task_lock(parent); > >> > + pcontainerid = audit_get_containerid(parent); > >> > + task_unlock(parent); > >> > + if (ccontainerid != pcontainerid) > >> > + return -EPERM; > >> > + return 0; > >> > +} > >> > + > >> > +static void audit_log_set_containerid(struct task_struct *task, u64 oldcontainerid, > >> > + u64 containerid, int rc) > >> > +{ > >> > + struct audit_buffer *ab; > >> > + uid_t uid; > >> > + struct tty_struct *tty; > >> > + > >> > + if (!audit_enabled) > >> > + return; > >> > + > >> > + ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_CONTAINER); > >> > + if (!ab) > >> > + return; > >> > + > >> > + uid = from_kuid(&init_user_ns, task_uid(current)); > >> > + tty = audit_get_tty(current); > >> > + > >> > + audit_log_format(ab, "op=set pid=%d uid=%u", task_tgid_nr(current), uid); > >> > + audit_log_task_context(ab); > >> > + audit_log_format(ab, " auid=%u tty=%s ses=%u opid=%d old-contid=%llu contid=%llu res=%d", > >> > + from_kuid(&init_user_ns, audit_get_loginuid(current)), > >> > + tty ? tty_name(tty) : "(none)", audit_get_sessionid(current), > >> > + task_tgid_nr(task), oldcontainerid, containerid, !rc); > >> > + > >> > + audit_put_tty(tty); > >> > + audit_log_end(ab); > >> > +} > >> > + > >> > +/** > >> > + * audit_set_containerid - set current task's audit_context containerid > >> > + * @containerid: containerid value > >> > + * > >> > + * Returns 0 on success, -EPERM on permission failure. > >> > + * > >> > + * Called (set) from fs/proc/base.c::proc_containerid_write(). > >> > + */ > >> > +int audit_set_containerid(struct task_struct *task, u64 containerid) > >> > +{ > >> > + u64 oldcontainerid; > >> > + int rc; > >> > + > >> > + oldcontainerid = audit_get_containerid(task); > >> > + > >> > + rc = audit_set_containerid_perm(task, containerid); > >> > + if (!rc) { > >> > + task_lock(task); > >> > + task->containerid = containerid; > >> > + task_unlock(task); > >> > + } > >> > + > >> > + audit_log_set_containerid(task, oldcontainerid, containerid, rc); > >> > + return rc; > >> > >> Why are audit_set_containerid_perm() and audit_log_containerid() > >> separate functions? > > > > (I assume you mean audit_log_set_containerid()?) > > Yep. My fingers got tired typing in that function name and decided a > shortcut was necessary. > > > It seemed clearer that all the permission checking was in one function > > and its return code could be used to report the outcome when logging the > > (attempted) action. This is the same structure as audit_set_loginuid() > > and it made sense. > > When possible I really like it when the permission checks are in the > same function as the code which does the work; it's less likely to get > abused that way (you have to willfully bypass the access checks). The > exceptions might be if you wanted to reuse the access control code, or > insert a modular access mechanism (e.g. LSMs). I don't follow how it could be abused. The return code from the perm check gates setting the value and is used in the success field in the log. > I'm less concerned about audit_log_set_containerid(), but the usual > idea of avoiding single-use function within the same scope applies > here. > > > This would be the time to connect it to a syscall if that seems like a > > good idea and remove pid, uid, auid, tty, ses fields. > > Ah yes, I missed that. You know my stance on connecting records by > now (hint: yes, connect them) so I think that would be a good thing to > do for the next round. Ok... > paul moore - RGB -- Richard Guy Briggs Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635