Received: by 10.223.176.5 with SMTP id f5csp1289006wra; Fri, 2 Feb 2018 14:48:17 -0800 (PST) X-Google-Smtp-Source: AH8x224BwXukuGYoyPGhHKzfgXgJEk6Q+8M3hRySTRnBoJ7FxSGjD3jpKweTBuwWzL/OOfXy+yUd X-Received: by 10.98.201.199 with SMTP id l68mr41831037pfk.199.1517611697004; Fri, 02 Feb 2018 14:48:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1517611696; cv=none; d=google.com; s=arc-20160816; b=rMQ65fpYDP+NKnpt8KtLbEZOuuIlvkvZ6+BFIFz1dFiq7mNqofBEC2h3YUe0/pAa8v r4U8Y1XU5zTwriBEYg352TX0gpun264Qoh4VwWidoXTPGFA9ME/E/IOMpX/aTEfrjxpp p6o4b4aa7I5ruOF7zA+R0eS6Srk+X5CA5fOw1pYZnqqSWHryf7Fh8H3h1vUPzURFjP5g W4qWQdSY+dWXFtzYAyLx2wPg4LhogpuQNJVufX2Fz+Hb7EDplIumFOD32olp5VxEdtQB TBPXLUG8kzTDWVToY5TMgavjwki8x77/rGNm16seINwt1JaSeekHrjdtbThAlHP6FI1c Hamw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :organization:references:in-reply-to:date:cc:to:from:subject :message-id:arc-authentication-results; bh=QkRFZYbzOSnRdNqOQPo88lyQgWKsvgpjnCHOCJ83N44=; b=ZdelV1ROL7ImY2QfdifgReR9gCMP51TtQZgGrSXoKIfdRRamF4dK1/X/6EjPrapREX dzK161uFHNn/fSaWgbMD+0wViSxtGNig6rMnqQO27HMOL0Ro5pnFbf8tYp93Mw3MMyua vPh1ADzMWfvMckY0AJVHmtxp2lI+a+EC8KbnXVgcoD1XBkz/yccAbJ/wMnhZo/Lr5znD zGjSJZt7a+MxDLQUK1kqLqZd7kX8PHwgn7tgznjmblm3Y60ZJvRMd5iClQQfe8EDOqym zRONZr+HQucEolklWc6Afs2v/1q4+SaAUK+5IxtukdT41uX2Vws+OE9ojEPd9ud7Ki73 bkTw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y23si2105049pfa.94.2018.02.02.14.48.01; Fri, 02 Feb 2018 14:48:16 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752880AbeBBWTY (ORCPT + 99 others); Fri, 2 Feb 2018 17:19:24 -0500 Received: from mx1.redhat.com ([209.132.183.28]:46752 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752215AbeBBWTT (ORCPT ); Fri, 2 Feb 2018 17:19:19 -0500 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id B5B9413A88; Fri, 2 Feb 2018 22:19:18 +0000 (UTC) Received: from ovpn-117-243.phx2.redhat.com (ovpn-117-243.phx2.redhat.com [10.3.117.243]) by smtp.corp.redhat.com (Postfix) with ESMTPS id C9E37600C0; Fri, 2 Feb 2018 22:19:07 +0000 (UTC) Message-ID: <1517609946.13097.161.camel@redhat.com> Subject: Re: RFC(V3): Audit Kernel Container IDs From: Simo Sorce To: Paul Moore , Richard Guy Briggs Cc: David Howells , cgroups@vger.kernel.org, jlayton@redhat.com, trondmy@primarydata.com, "Serge E. Hallyn" , mszeredi@redhat.com, Al Viro , Andy Lutomirski , Eric Paris , Carlos O'Donell , Linux API , Linux Containers , Linux Kernel , Linux Audit , "Eric W. Biederman" , Linux Network Development , Linux FS Devel Date: Fri, 02 Feb 2018 17:19:06 -0500 In-Reply-To: References: <20180109121620.wi7dq2423ugsraqv@madcap2.tricolour.ca> <1515514736.3239.10.camel@redhat.com> <20180110070011.l4rcdcwb27witfem@madcap2.tricolour.ca> Organization: Red Hat, Inc. Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.29]); Fri, 02 Feb 2018 22:19:19 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2018-02-02 at 16:24 -0500, Paul Moore wrote: > On Wed, Jan 10, 2018 at 2:00 AM, Richard Guy Briggs wrote: > > On 2018-01-09 11:18, Simo Sorce wrote: > > > On Tue, 2018-01-09 at 07:16 -0500, Richard Guy Briggs wrote: > > > > Containers are a userspace concept. The kernel knows nothing of them. > > > > > > > > The Linux audit system needs a way to be able to track the container > > > > provenance of events and actions. Audit needs the kernel's help to do > > > > this. > > > > > > > > Since the concept of a container is entirely a userspace concept, a > > > > registration from the userspace container orchestration system initiates > > > > this. This will define a point in time and a set of resources > > > > associated with a particular container with an audit container > > > > identifier. > > > > > > > > The registration is a u64 representing the audit container identifier > > > > written to a special file in a pseudo filesystem (proc, since PID tree > > > > already exists) representing a process that will become a parent process > > > > in that container. This write might place restrictions on mount > > > > namespaces required to define a container, or at least careful checking > > > > of namespaces in the kernel to verify permissions of the orchestrator so > > > > it can't change its own container ID. A bind mount of nsfs may be > > > > necessary in the container orchestrator's mount namespace. This write > > > > can only happen once per process. > > > > > > > > Note: The justification for using a u64 is that it minimizes the > > > > information printed in every audit record, reducing bandwidth and limits > > > > comparisons to a single u64 which will be faster and less error-prone. > > > > > > > > Require CAP_AUDIT_CONTROL to be able to carry out the registration. At > > > > that time, record the target container's user-supplied audit container > > > > identifier along with a target container's parent process (which may > > > > become the target container's "init" process) process ID (referenced > > > > from the initial PID namespace) in a new record AUDIT_CONTAINER with a > > > > qualifying op=$action field. > > > > > > > > Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid > > > > container ID present on an auditable action or event. > > > > > > > > Forked and cloned processes inherit their parent's audit container > > > > identifier, referenced in the process' task_struct. Since the audit > > > > container identifier is inherited rather than written, it can still be > > > > written once. This will prevent tampering while allowing nesting. > > > > (This can be implemented with an internal settable flag upon > > > > registration that does not get copied across a fork/clone.) > > > > > > > > Mimic setns(2) and return an error if the process has already initiated > > > > threading or forked since this registration should happen before the > > > > process execution is started by the orchestrator and hence should not > > > > yet have any threads or children. If this is deemed overly restrictive, > > > > switch all of the target's threads and children to the new containerID. > > > > > > > > Trust the orchestrator to judiciously use and restrict CAP_AUDIT_CONTROL. > > > > > > > > When a container ceases to exist because the last process in that > > > > container has exited log the fact to balance the registration action. > > > > (This is likely needed for certification accountability.) > > > > > > > > At this point it appears unnecessary to add a container session > > > > identifier since this is all tracked from loginuid and sessionid to > > > > communicate with the container orchestrator to spawn an additional > > > > session into an existing container which would be logged. It can be > > > > added at a later date without breaking API should it be deemed > > > > necessary. > > > > > > > > The following namespace logging actions are not needed for certification > > > > purposes at this point, but are helpful for tracking namespace activity. > > > > These are auxilliary records that are associated with namespace > > > > manipulation syscalls unshare(2), clone(2) and setns(2), so the records > > > > will only show up if explicit syscall rules have been added to document > > > > this activity. > > > > > > > > Log the creation of every namespace, inheriting/adding its spawning > > > > process' audit container identifier(s), if applicable. Include the > > > > spawning and spawned namespace IDs (device and inode number tuples). > > > > [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)] > > > > Note: At this point it appears only network namespaces may need to track > > > > container IDs apart from processes since incoming packets may cause an > > > > auditable event before being associated with a process. Since a > > > > namespace can be shared by processes in different containers, the > > > > namespace will need to track all containers to which it has been > > > > assigned. > > > > > > > > Upon registration, the target process' namespace IDs (in the form of a > > > > nsfs device number and inode number tuple) will be recorded in an > > > > AUDIT_NS_INFO auxilliary record. > > > > > > > > Log the destruction of every namespace that is no longer used by any > > > > process, including the namespace IDs (device and inode number tuples). > > > > [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)] > > > > > > > > Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action) > > > > the parent and child namespace IDs for any changes to a process' > > > > namespaces. [setns(2)] > > > > Note: It may be possible to combine AUDIT_NS_* record formats and > > > > distinguish them with an op=$action field depending on the fields > > > > required for each message type. > > > > > > > > The audit container identifier will need to be reaped from all > > > > implicated namespaces upon the destruction of a container. > > > > > > > > This namespace information adds supporting information for tracking > > > > events not attributable to specific processes. > > > > > > > > Changelog: > > > > > > > > (Upstream V3) > > > > - switch back to u64 (from pmoore, can be expanded to u128 in future if > > > > need arises without breaking API. u32 was originally proposed, up to > > > > c36 discussed) > > > > - write-once, but children inherit audit container identifier and can > > > > then still be written once > > > > - switch to CAP_AUDIT_CONTROL > > > > - group namespace actions together, auxilliary records to namespace > > > > operations. > > > > > > > > (Upstream V2) > > > > - switch from u64 to u128 UUID > > > > - switch from "signal" and "trigger" to "register" > > > > - restrict registration to single process or force all threads and > > > > children into same container > > > > > > I am trying to understand the back and forth on the ID size. > > > > > > From an orchestrator POV anything that requires tracking a node > > > specific ID is not ideal. > > > > > > Orchestrators tend to span many nodes, and containers tend to have IDs > > > that are either UUID or have a Hash (like SHA256) as identifier. > > > > > > The problem here is two-fold: > > > > > > a) Your auditing requires some mapping to be useful outside of the > > > system. > > > If you aggreggate audit logs outside of the system or you want to > > > correlate the system audit logs with other components dealing with > > > containers, now you need a place where you provide a mapping from your > > > audit u64 to the ID a container has in the rest of the system. > > > > > > b) Now you need a mapping of some sort. The simplest way a container > > > orchestrator can go about this is to just use the UUID or Hash > > > representing their view of the container, truncate it to a u64 and use > > > that for Audit. This means there are some chances there will be a > > > collision and a duplicate u64 ID will be used by the orchestrator as > > > the container ID. What happen in that case ? > > > > Paul, can you justify this somewhat larger inconvenience for some > > relatively minor convenience on our part? > > Done in direct response to Simo. Sorry but your response sounds more like waving away then addressing them, the excuse being: we can't please everyone, so we are going to please no one. > But to be clear Richard, we've talked about this a few times, it's not > a "minor convenience" on our part, it's a pretty big convenience once > we starting having to route audit events and make decisions based on > the audit container ID information. Audit performance is less than > awesome now, I'm working hard to not make it worse. Sounds like a security vs performance trade off to me. > > u64 vs u128 is easy for us to > > accomodate in terms of scalar comparisons. It doubles the information > > in every container id field we print in audit records. > > ... and slows down audit container ID checks. Are you saying a cmp on a u128 is slower than a comparison on a u64 and this is something that will be noticeable ? > > A c36 is a bigger step. > > Yeah, we're not doing that, no way. Ok, I can see your point though I do not agree with it. I can see why you do not want to have arbitrary length strings, but a u128 sounded like a reasonable compromise to me as it has enough room to be able to have unique cluster-wide IDs which a u64 definitely makes a lot harder to provide w/o tight coordination. Simo. -- Simo Sorce Sr. Principal Software Engineer Red Hat, Inc