2019-12-31 19:51:40

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH ghak90 V8 00/16] audit: implement container identifier

Implement kernel audit container identifier.

This patchset is an eighth based on the proposal document (V4) posted:
https://www.redhat.com/archives/linux-audit/2019-September/msg00052.html

The first patch was the last patch from ghak81 that was absorbed into
this patchset since its primary justification is the rest of this
patchset.

The second patch implements the proc fs write to set the audit container
identifier of a process, emitting an AUDIT_CONTAINER_OP record to
announce the registration of that audit container identifier on that
process. This patch requires userspace support for record acceptance
and proper type display.

The third implements reading the audit container identifier from the
proc filesystem for debugging. This patch wasn't planned for upstream
inclusion but is starting to become more likely.

The fourth converts over from a simple u64 to a list member that includes
owner information to check for descendancy, allow process injection into
a container and prevent id reuse by other orchestrators.

The fifth logs the drop of an audit container identifier once all tasks
using that audit container identifier have exited.

The 6th implements the auxiliary record AUDIT_CONTAINER_ID if an audit
container identifier is associated with an event. This patch requires
userspace support for proper type display.

The 7th adds audit daemon signalling provenance through audit_sig_info2.

The 8th creates a local audit context to be able to bind a standalone
record with a locally created auxiliary record.

The 9th patch adds audit container identifier records to the user
standalone records.

The 10th adds audit container identifier filtering to the exit,
exclude and user lists. This patch adds the AUDIT_CONTID field and
requires auditctl userspace support for the --contid option.

The 11th adds network namespace audit container identifier labelling
based on member tasks' audit container identifier labels which supports
standalone netfilter records that don't have a task context and lists
each container to which that net namespace belongs.

The 12th checks that the target is a descendant for nesting and
refactors to avoid a duplicate of the copied function.

The 13th adds tracking and reporting for container nesting.
This enables kernel filtering and userspace searches of nested audit
container identifiers.

The 14th checks and clamps the nesting depth of containers while the
15th checks and clamps the total number of audit container identifiers
sharing one network namespace. The combination of these two pararmeters
prevents the overflow of the contid field in CONTAINER_* records.

The 16th adds a mechanism to allow a process to be designated as a
container orchestrator/engine in non-init user namespaces.


Example: Set an audit container identifier of 123456 to the "sleep" task:

sleep 2&
child=$!
echo 123456 > /proc/$child/audit_containerid; echo $?
ausearch -ts recent -m container_op
echo child:$child contid:$( cat /proc/$child/audit_containerid)

This should produce a record such as:

type=CONTAINER_OP msg=audit(2018-06-06 12:39:29.636:26949) : op=set opid=2209 contid=123456 old-contid=18446744073709551615


Example: Set a filter on an audit container identifier 123459 on /tmp/tmpcontainerid:

contid=123459
key=tmpcontainerid
auditctl -a exit,always -F dir=/tmp -F perm=wa -F contid=$contid -F key=$key
perl -e "sleep 1; open(my \$tmpfile, '>', \"/tmp/$key\"); close(\$tmpfile);" &
child=$!
echo $contid > /proc/$child/audit_containerid
sleep 2
ausearch -i -ts recent -k $key
auditctl -d exit,always -F dir=/tmp -F perm=wa -F contid=$contid -F key=$key
rm -f /tmp/$key

This should produce an event such as:

type=CONTAINER_ID msg=audit(2018-06-06 12:46:31.707:26953) : contid=123459
type=PROCTITLE msg=audit(2018-06-06 12:46:31.707:26953) : proctitle=perl -e sleep 1; open(my $tmpfile, '>', "/tmp/tmpcontainerid"); close($tmpfile);
type=PATH msg=audit(2018-06-06 12:46:31.707:26953) : item=1 name=/tmp/tmpcontainerid inode=25656 dev=00:26 mode=file,644 ouid=root ogid=root rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0
type=PATH msg=audit(2018-06-06 12:46:31.707:26953) : item=0 name=/tmp/ inode=8985 dev=00:26 mode=dir,sticky,777 ouid=root ogid=root rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype=PARENT cap_fp=none cap_fi=none cap_fe=0 cap_fver=0
type=CWD msg=audit(2018-06-06 12:46:31.707:26953) : cwd=/root
type=SYSCALL msg=audit(2018-06-06 12:46:31.707:26953) : arch=x86_64 syscall=openat success=yes exit=3 a0=0xffffffffffffff9c a1=0x5621f2b81900 a2=O_WRONLY|O_CREAT|O_TRUNC a3=0x1b6 items=2 ppid=628 pid=2232 auid=root uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=ttyS0 ses=1 comm=perl exe=/usr/bin/perl subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=tmpcontainerid

Example: Test multiple containers on one netns:

sleep 5 &
child1=$!
containerid1=123451
echo $containerid1 > /proc/$child1/audit_containerid
sleep 5 &
child2=$!
containerid2=123452
echo $containerid2 > /proc/$child2/audit_containerid
iptables -I INPUT -i lo -p icmp --icmp-type echo-request -j AUDIT --type accept
iptables -I INPUT -t mangle -i lo -p icmp --icmp-type echo-request -j MARK --set-mark 0x12345555
sleep 1;
bash -c "ping -q -c 1 127.0.0.1 >/dev/null 2>&1"
sleep 1;
ausearch -i -m NETFILTER_PKT -ts boot|grep mark=0x12345555
ausearch -i -m NETFILTER_PKT -ts boot|grep contid=|grep $containerid1|grep $containerid2

This should produce an event such as:

type=NETFILTER_PKT msg=audit(03/15/2019 14:16:13.369:244) : mark=0x12345555 saddr=127.0.0.1 daddr=127.0.0.1 proto=icmp
type=CONTAINER_ID msg=audit(03/15/2019 14:16:13.369:244) : contid=123452,123451


Includes the last patch of https://github.com/linux-audit/audit-kernel/issues/81
Please see the github audit kernel issue for the main feature:
https://github.com/linux-audit/audit-kernel/issues/90
and the kernel filter code:
https://github.com/linux-audit/audit-kernel/issues/91
and the network support:
https://github.com/linux-audit/audit-kernel/issues/92
Please see the github audit userspace issue for supporting record types:
https://github.com/linux-audit/audit-userspace/issues/51
and filter code:
https://github.com/linux-audit/audit-userspace/issues/40
Please see the github audit testsuiite issue for the test case:
https://github.com/linux-audit/audit-testsuite/issues/64
https://github.com/rgbriggs/audit-testsuite/tree/ghat64-contid
https://githu.com/linux-audit/audit-testsuite/pull/91
Please see the github audit wiki for the feature overview:
https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID

The code is also posted at:
git://toccata2.tricolour.ca/linux-2.6-rgb.git ghak90-audit-containerID.v8

Changelog:
v8
- rebase on v5.5-rc1 audit/next
- remove subject attrs in CONTAINER_OP record
- group audit_contid_list_lock with audit_contid_hash
- in audit_{set,log}_contid(), break out of loop after finding target
- use target var to size kmalloc
- rework audit_cont_owner() to bool audit_contid_isowner() and move to where used
- create static void audit_cont_hold(struct audit_contobj *cont) { refcount_inc(&cont->refcount); }
- rename audit_cont{,_*} refs to audit_contobj{,_*}
- prefix special local functions with _ [audit_contobj*()]
- protect contid list traversals with rcu_read_lock() and updates with audit_contid_list_lock
- protect real_parent in audit_contid_depth() with rcu_dereference
- give new contid field nesting format in patch description
- squash task_is_descendant()
- squash support for NETFILTER_PKT into network namespaces
- limit nesting depth based on record length overflow, bandwidth and storage
- implent control for audit container identifier nesting depth limit
- make room for audit_bpf patches (bump CONTAINER_ID to 1335)
- squash proc interface into capcontid
- remove netlink access to loginuid/sessionid/contid/capcontid
- delete 32k contid limit patch
- document potential overlap between signal delivery and contid reuse
- document audit_contobj_list_lock coverage
- document disappearing orch task injection limitation
- limit the number of containers that can be associated with a network namespace
- implent control for audit container identifier netns count limit

v7
- remove BUG() in audit_comparator64()
- rebase on v5.2-rc1 audit/next
- resolve merge conflict with ghak111 (signal_info regardless syscall)
- resolve merge conflict with ghak73 (audit_field_valid)
- resolve merge conflict with ghak64 (saddr_fam filter)
- resolve merge conflict with ghak10 (ntp audit) change AUDIT_CONTAINER_ID from 1332 to 1334
- rebase on v5.3-rc1 audit/next
- track container owner
- only permit setting contid of descendants for nesting
- track drop of contid and permit reuse
- track and report container nesting
- permit filtering on any nested contid
- set/get contid and loginuid/sessionid via netlink
- implement capcontid to enable orchestrators in non-init user
namespaces
- limit number of containers
- limit depth of container nesting

v6
- change TMPBUFLEN from 11 to 21 to cover the decimal value of contid
u64 (nhorman)
- fix bug overwriting ctx in struct audit_sig_info, move cid above
ctx[0] (nhorman)
- fix bug skipping remaining fields and not advancing bufp when copying
out contid in audit_krule_to_data (omosnacec)
- add acks, tidy commit descriptions, other formatting fixes (checkpatch
wrong on audit_log_lost)
- cast ull for u64 prints
- target_cid tracking was moved from the ptrace/signal patch to
container_op
- target ptrace and signal records were moved from the ptrace/signal
patch to container_id
- auditd signaller tracking was moved to a new AUDIT_SIGNAL_INFO2
request and record
- ditch unnecessary list_empty() checks
- check for null net and aunet in audit_netns_contid_add()
- swap CONTAINER_OP contid/old-contid order to ease parsing

v5
- address loginuid and sessionid syscall scope in ghak104
- address audit_context in CONFIG_AUDIT vs CONFIG_AUDITSYSCALL in ghak105
- remove tty patch, addressed in ghak106
- rebase on audit/next v5.0-rc1
w/ghak59/ghak104/ghak103/ghak100/ghak107/ghak105/ghak106/ghak105sup
- update CONTAINER_ID to CONTAINER_OP in patch description
- move audit_context in audit_task_info to CONFIG_AUDITSYSCALL
- move audit_alloc() and audit_free() out of CONFIG_AUDITSYSCALL and into
CONFIG_AUDIT and create audit_{alloc,free}_syscall
- use plain kmem_cache_alloc() rather than kmem_cache_zalloc() in audit_alloc()
- fix audit_get_contid() declaration type error
- move audit_set_contid() from auditsc.c to audit.c
- audit_log_contid() returns void
- audit_log_contid() handed contid rather than tsk
- switch from AUDIT_CONTAINER to AUDIT_CONTAINER_ID for aux record
- move audit_log_contid(tsk/contid) & audit_contid_set(tsk)/audit_contid_valid(contid)
- switch from tsk to current
- audit_alloc_local() calls audit_log_lost() on failure to allocate a context
- add AUDIT_USER* non-syscall contid record
- cosmetic cleanup double parens, goto out on err
- ditch audit_get_ns_contid_list_lock(), fix aunet lock race
- switch from all-cpu read spinlock to rcu, keep spinlock for write
- update audit_alloc_local() to use ktime_get_coarse_real_ts64()
- add nft_log support
- add call from do_exit() in audit_free() to remove contid from netns
- relegate AUDIT_CONTAINER ref= field (was op=) to debug patch

v4
- preface set with ghak81:"collect audit task parameters"
- add shallyn and sgrubb acks
- rename feature bitmap macro
- rename cid_valid() to audit_contid_valid()
- rename AUDIT_CONTAINER_ID to AUDIT_CONTAINER_OP
- delete audit_get_contid_list() from headers
- move work into inner if, delete "found"
- change netns contid list function names
- move exports for audit_log_contid audit_alloc_local audit_free_context to non-syscall patch
- list contids CSV
- pass in gfp flags to audit_alloc_local() (fix audit_alloc_context callers)
- use "local" in lieu of abusing in_syscall for auditsc_get_stamp()
- read_lock(&tasklist_lock) around children and thread check
- task_lock(tsk) should be taken before first check of tsk->audit
- add spin lock to contid list in aunet
- restrict /proc read to CAP_AUDIT_CONTROL
- remove set again prohibition and inherited flag
- delete contidion spelling fix from patchset, send to netdev/linux-wireless

v3
- switched from containerid in task_struct to audit_task_info (depends on ghak81)
- drop INVALID_CID in favour of only AUDIT_CID_UNSET
- check for !audit_task_info, throw -ENOPROTOOPT on set
- changed -EPERM to -EEXIST for parent check
- return AUDIT_CID_UNSET if !audit_enabled
- squash child/thread check patch into AUDIT_CONTAINER_ID patch
- changed -EPERM to -EBUSY for child check
- separate child and thread checks, use -EALREADY for latter
- move addition of op= from ptrace/signal patch to AUDIT_CONTAINER patch
- fix && to || bashism in ptrace/signal patch
- uninline and export function for audit_free_context()
- drop CONFIG_CHANGE, FEATURE_CHANGE, ANOM_ABEND, ANOM_SECCOMP patches
- move audit_enabled check (xt_AUDIT)
- switched from containerid list in struct net to net_generic's struct audit_net
- move containerid list iteration into audit (xt_AUDIT)
- create function to move namespace switch into audit
- switched /proc/PID/ entry from containerid to audit_containerid
- call kzalloc with GFP_ATOMIC on in_atomic() in audit_alloc_context()
- call kzalloc with GFP_ATOMIC on in_atomic() in audit_log_container_info()
- use xt_net(par) instead of sock_net(skb->sk) to get net
- switched record and field names: initial CONTAINER_ID, aux CONTAINER, field CONTID
- allow to set own contid
- open code audit_set_containerid
- add contid inherited flag
- ccontainerid and pcontainerid eliminated due to inherited flag
- change name of container list funcitons
- rename containerid to contid
- convert initial container record to syscall aux
- fix spelling mistake of contidion in net/rfkill/core.c to avoid contid name collision

v2
- add check for children and threads
- add network namespace container identifier list
- add NETFILTER_PKT audit container identifier logging
- patch description and documentation clean-up and example
- reap unused ppid

Richard Guy Briggs (16):
audit: collect audit task parameters
audit: add container id
audit: read container ID of a process
audit: convert to contid list to check for orch/engine ownership
audit: log drop of contid on exit of last task
audit: log container info of syscalls
audit: add contid support for signalling the audit daemon
audit: add support for non-syscall auxiliary records
audit: add containerid support for user records
audit: add containerid filtering
audit: add support for containerid to network namespaces
audit: contid check descendancy and nesting
audit: track container nesting
audit: check contid depth and add limit config param
audit: check contid count per netns and add config param limit
audit: add capcontid to set contid outside init_user_ns

fs/proc/base.c | 112 +++++++-
include/linux/audit.h | 140 +++++++++-
include/linux/nsproxy.h | 2 +-
include/linux/sched.h | 10 +-
include/uapi/linux/audit.h | 14 +-
init/init_task.c | 3 +-
init/main.c | 2 +
kernel/audit.c | 626 +++++++++++++++++++++++++++++++++++++++++++-
kernel/audit.h | 29 ++
kernel/auditfilter.c | 61 +++++
kernel/auditsc.c | 91 +++++--
kernel/fork.c | 11 +-
kernel/nsproxy.c | 27 +-
kernel/sched/core.c | 33 +++
net/netfilter/nft_log.c | 11 +-
net/netfilter/xt_AUDIT.c | 11 +-
security/selinux/nlmsgtab.c | 1 +
security/yama/yama_lsm.c | 33 ---
18 files changed, 1115 insertions(+), 102 deletions(-)

--
1.8.3.1


2019-12-31 19:51:41

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH ghak90 V8 01/16] audit: collect audit task parameters

The audit-related parameters in struct task_struct should ideally be
collected together and accessed through a standard audit API.

Collect the existing loginuid, sessionid and audit_context together in a
new struct audit_task_info called "audit" in struct task_struct.

Use kmem_cache to manage this pool of memory.
Un-inline audit_free() to be able to always recover that memory.

Please see the upstream github issue
https://github.com/linux-audit/audit-kernel/issues/81

Signed-off-by: Richard Guy Briggs <[email protected]>
Acked-by: Neil Horman <[email protected]>
Reviewed-by: Ondrej Mosnacek <[email protected]>
---
include/linux/audit.h | 49 +++++++++++++++++++++++------------
include/linux/sched.h | 7 +----
init/init_task.c | 3 +--
init/main.c | 2 ++
kernel/audit.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++--
kernel/audit.h | 5 ++++
kernel/auditsc.c | 26 ++++++++++---------
kernel/fork.c | 1 -
8 files changed, 124 insertions(+), 40 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index f9ceae57ca8d..96deb28942e3 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -94,6 +94,16 @@ struct audit_ntp_data {
struct audit_ntp_data {};
#endif

+struct audit_task_info {
+ kuid_t loginuid;
+ unsigned int sessionid;
+#ifdef CONFIG_AUDITSYSCALL
+ struct audit_context *ctx;
+#endif
+};
+
+extern struct audit_task_info init_struct_audit;
+
extern int is_audit_feature_set(int which);

extern int __init audit_register_class(int class, unsigned *list);
@@ -130,6 +140,9 @@ struct audit_ntp_data {
#ifdef CONFIG_AUDIT
/* These are defined in audit.c */
/* Public API */
+extern int audit_alloc(struct task_struct *task);
+extern void audit_free(struct task_struct *task);
+extern void __init audit_task_init(void);
extern __printf(4, 5)
void audit_log(struct audit_context *ctx, gfp_t gfp_mask, int type,
const char *fmt, ...);
@@ -173,12 +186,16 @@ extern void audit_log_path_denied(int type,

static inline kuid_t audit_get_loginuid(struct task_struct *tsk)
{
- return tsk->loginuid;
+ if (!tsk->audit)
+ return INVALID_UID;
+ return tsk->audit->loginuid;
}

static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
{
- return tsk->sessionid;
+ if (!tsk->audit)
+ return AUDIT_SID_UNSET;
+ return tsk->audit->sessionid;
}

extern u32 audit_enabled;
@@ -186,6 +203,14 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
extern int audit_signal_info(int sig, struct task_struct *t);

#else /* CONFIG_AUDIT */
+static inline int audit_alloc(struct task_struct *task)
+{
+ return 0;
+}
+static inline void audit_free(struct task_struct *task)
+{ }
+static inline void __init audit_task_init(void)
+{ }
static inline __printf(4, 5)
void audit_log(struct audit_context *ctx, gfp_t gfp_mask, int type,
const char *fmt, ...)
@@ -261,8 +286,6 @@ static inline int audit_signal_info(int sig, struct task_struct *t)

/* These are defined in auditsc.c */
/* Public API */
-extern int audit_alloc(struct task_struct *task);
-extern void __audit_free(struct task_struct *task);
extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long a1,
unsigned long a2, unsigned long a3);
extern void __audit_syscall_exit(int ret_success, long ret_value);
@@ -282,12 +305,14 @@ extern void audit_seccomp_actions_logged(const char *names,

static inline void audit_set_context(struct task_struct *task, struct audit_context *ctx)
{
- task->audit_context = ctx;
+ task->audit->ctx = ctx;
}

static inline struct audit_context *audit_context(void)
{
- return current->audit_context;
+ if (!current->audit)
+ return NULL;
+ return current->audit->ctx;
}

static inline bool audit_dummy_context(void)
@@ -295,11 +320,7 @@ static inline bool audit_dummy_context(void)
void *p = audit_context();
return !p || *(int *)p;
}
-static inline void audit_free(struct task_struct *task)
-{
- if (unlikely(task->audit_context))
- __audit_free(task);
-}
+
static inline void audit_syscall_entry(int major, unsigned long a0,
unsigned long a1, unsigned long a2,
unsigned long a3)
@@ -517,12 +538,6 @@ static inline void audit_ntp_log(const struct audit_ntp_data *ad)
extern int audit_n_rules;
extern int audit_signals;
#else /* CONFIG_AUDITSYSCALL */
-static inline int audit_alloc(struct task_struct *task)
-{
- return 0;
-}
-static inline void audit_free(struct task_struct *task)
-{ }
static inline void audit_syscall_entry(int major, unsigned long a0,
unsigned long a1, unsigned long a2,
unsigned long a3)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 467d26046416..aebe24192b23 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -33,7 +33,6 @@
#include <linux/rseq.h>

/* task_struct member predeclarations (sorted alphabetically): */
-struct audit_context;
struct backing_dev_info;
struct bio_list;
struct blk_plug;
@@ -930,11 +929,7 @@ struct task_struct {
struct callback_head *task_works;

#ifdef CONFIG_AUDIT
-#ifdef CONFIG_AUDITSYSCALL
- struct audit_context *audit_context;
-#endif
- kuid_t loginuid;
- unsigned int sessionid;
+ struct audit_task_info *audit;
#endif
struct seccomp seccomp;

diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5eab7b..5204578d6e7e 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -122,8 +122,7 @@ struct task_struct init_task
.thread_group = LIST_HEAD_INIT(init_task.thread_group),
.thread_node = LIST_HEAD_INIT(init_signals.thread_head),
#ifdef CONFIG_AUDIT
- .loginuid = INVALID_UID,
- .sessionid = AUDIT_SID_UNSET,
+ .audit = &init_struct_audit,
#endif
#ifdef CONFIG_PERF_EVENTS
.perf_event_mutex = __MUTEX_INITIALIZER(init_task.perf_event_mutex),
diff --git a/init/main.c b/init/main.c
index 91f6ebb30ef0..f6687f40ba90 100644
--- a/init/main.c
+++ b/init/main.c
@@ -93,6 +93,7 @@
#include <linux/rodata_test.h>
#include <linux/jump_label.h>
#include <linux/mem_encrypt.h>
+#include <linux/audit.h>

#include <asm/io.h>
#include <asm/bugs.h>
@@ -770,6 +771,7 @@ asmlinkage __visible void __init start_kernel(void)
nsfs_init();
cpuset_init();
cgroup_init();
+ audit_task_init();
taskstats_init_early();
delayacct_init();

diff --git a/kernel/audit.c b/kernel/audit.c
index 17b0d523afb3..397f8fb4836a 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -203,6 +203,73 @@ struct audit_reply {
struct sk_buff *skb;
};

+static struct kmem_cache *audit_task_cache;
+
+void __init audit_task_init(void)
+{
+ audit_task_cache = kmem_cache_create("audit_task",
+ sizeof(struct audit_task_info),
+ 0, SLAB_PANIC, NULL);
+}
+
+/**
+ * audit_alloc - allocate an audit info block for a task
+ * @tsk: task
+ *
+ * Call audit_alloc_syscall to filter on the task information and
+ * allocate a per-task audit context if necessary. This is called from
+ * copy_process, so no lock is needed.
+ */
+int audit_alloc(struct task_struct *tsk)
+{
+ int ret = 0;
+ struct audit_task_info *info;
+
+ info = kmem_cache_alloc(audit_task_cache, GFP_KERNEL);
+ if (!info) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ info->loginuid = audit_get_loginuid(current);
+ info->sessionid = audit_get_sessionid(current);
+ tsk->audit = info;
+
+ ret = audit_alloc_syscall(tsk);
+ if (ret) {
+ tsk->audit = NULL;
+ kmem_cache_free(audit_task_cache, info);
+ }
+out:
+ return ret;
+}
+
+struct audit_task_info init_struct_audit = {
+ .loginuid = INVALID_UID,
+ .sessionid = AUDIT_SID_UNSET,
+#ifdef CONFIG_AUDITSYSCALL
+ .ctx = NULL,
+#endif
+};
+
+/**
+ * audit_free - free per-task audit info
+ * @tsk: task whose audit info block to free
+ *
+ * Called from copy_process and do_exit
+ */
+void audit_free(struct task_struct *tsk)
+{
+ struct audit_task_info *info = tsk->audit;
+
+ audit_free_syscall(tsk);
+ /* Freeing the audit_task_info struct must be performed after
+ * audit_log_exit() due to need for loginuid and sessionid.
+ */
+ info = tsk->audit;
+ tsk->audit = NULL;
+ kmem_cache_free(audit_task_cache, info);
+}
+
/**
* auditd_test_task - Check to see if a given task is an audit daemon
* @task: the task to check
@@ -2255,8 +2322,8 @@ int audit_set_loginuid(kuid_t loginuid)
sessionid = (unsigned int)atomic_inc_return(&session_id);
}

- current->sessionid = sessionid;
- current->loginuid = loginuid;
+ current->audit->sessionid = sessionid;
+ current->audit->loginuid = loginuid;
out:
audit_log_set_loginuid(oldloginuid, loginuid, oldsessionid, sessionid, rc);
return rc;
diff --git a/kernel/audit.h b/kernel/audit.h
index 6fb7160412d4..7f623ef216e6 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -251,6 +251,8 @@ extern void audit_log_d_path_exe(struct audit_buffer *ab,
extern unsigned int audit_serial(void);
extern int auditsc_get_stamp(struct audit_context *ctx,
struct timespec64 *t, unsigned int *serial);
+extern int audit_alloc_syscall(struct task_struct *tsk);
+extern void audit_free_syscall(struct task_struct *tsk);

extern void audit_put_watch(struct audit_watch *watch);
extern void audit_get_watch(struct audit_watch *watch);
@@ -292,6 +294,9 @@ extern void audit_filter_inodes(struct task_struct *tsk,
extern struct list_head *audit_killed_trees(void);
#else /* CONFIG_AUDITSYSCALL */
#define auditsc_get_stamp(c, t, s) 0
+#define audit_alloc_syscall(t) 0
+#define audit_free_syscall(t) {}
+
#define audit_put_watch(w) {}
#define audit_get_watch(w) {}
#define audit_to_watch(k, p, l, o) (-EINVAL)
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 4effe01ebbe2..10679da36bb6 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -903,23 +903,25 @@ static inline struct audit_context *audit_alloc_context(enum audit_state state)
return context;
}

-/**
- * audit_alloc - allocate an audit context block for a task
+/*
+ * audit_alloc_syscall - allocate an audit context block for a task
* @tsk: task
*
* Filter on the task information and allocate a per-task audit context
* if necessary. Doing so turns on system call auditing for the
- * specified task. This is called from copy_process, so no lock is
- * needed.
+ * specified task. This is called from copy_process via audit_alloc, so
+ * no lock is needed.
*/
-int audit_alloc(struct task_struct *tsk)
+int audit_alloc_syscall(struct task_struct *tsk)
{
struct audit_context *context;
enum audit_state state;
char *key = NULL;

- if (likely(!audit_ever_enabled))
+ if (likely(!audit_ever_enabled)) {
+ audit_set_context(tsk, NULL);
return 0; /* Return if not auditing. */
+ }

state = audit_filter_task(tsk, &key);
if (state == AUDIT_DISABLED) {
@@ -929,7 +931,7 @@ int audit_alloc(struct task_struct *tsk)

if (!(context = audit_alloc_context(state))) {
kfree(key);
- audit_log_lost("out of memory in audit_alloc");
+ audit_log_lost("out of memory in audit_alloc_syscall");
return -ENOMEM;
}
context->filterkey = key;
@@ -1574,14 +1576,15 @@ static void audit_log_exit(void)
}

/**
- * __audit_free - free a per-task audit context
+ * audit_free_syscall - free per-task audit context info
* @tsk: task whose audit context block to free
*
- * Called from copy_process and do_exit
+ * Called from audit_free
*/
-void __audit_free(struct task_struct *tsk)
+void audit_free_syscall(struct task_struct *tsk)
{
- struct audit_context *context = tsk->audit_context;
+ struct audit_task_info *info = tsk->audit;
+ struct audit_context *context = info->ctx;

if (!context)
return;
@@ -1604,7 +1607,6 @@ void __audit_free(struct task_struct *tsk)
if (context->current_state == AUDIT_RECORD_CONTEXT)
audit_log_exit();
}
-
audit_set_context(tsk, NULL);
audit_free_context(context);
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 2508a4f238a3..edf034e5cbb4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1993,7 +1993,6 @@ static __latent_entropy struct task_struct *copy_process(
posix_cputimers_init(&p->posix_cputimers);

p->io_context = NULL;
- audit_set_context(p, NULL);
cgroup_fork(p);
#ifdef CONFIG_NUMA
p->mempolicy = mpol_dup(p->mempolicy);
--
1.8.3.1

2019-12-31 19:51:48

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH ghak90 V8 02/16] audit: add container id

Implement the proc fs write to set the audit container identifier of a
process, emitting an AUDIT_CONTAINER_OP record to document the event.

This is a write from the container orchestrator task to a proc entry of
the form /proc/PID/audit_containerid where PID is the process ID of the
newly created task that is to become the first task in a container, or
an additional task added to a container.

The write expects up to a u64 value (unset: 18446744073709551615).

The writer must have capability CAP_AUDIT_CONTROL.

This will produce a record such as this:
type=CONTAINER_OP msg=audit(2018-06-06 12:39:29.636:26949) : op=set opid=2209 contid=123456 old-contid=18446744073709551615

The "op" field indicates an initial set. The "opid" field is the
object's PID, the process being "contained". New and old audit
container identifier values are given in the "contid" fields.

It is not permitted to unset the audit container identifier.
A child inherits its parent's audit container identifier.

Please see the github audit kernel issue for the main feature:
https://github.com/linux-audit/audit-kernel/issues/90
Please see the github audit userspace issue for supporting additions:
https://github.com/linux-audit/audit-userspace/issues/51
Please see the github audit testsuiite issue for the test case:
https://github.com/linux-audit/audit-testsuite/issues/64
Please see the github audit wiki for the feature overview:
https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID

Signed-off-by: Richard Guy Briggs <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Acked-by: Steve Grubb <[email protected]>
Acked-by: Neil Horman <[email protected]>
Reviewed-by: Ondrej Mosnacek <[email protected]>
Signed-off-by: Richard Guy Briggs <[email protected]>
---
fs/proc/base.c | 36 ++++++++++++++++++++++++++++
include/linux/audit.h | 25 ++++++++++++++++++++
include/uapi/linux/audit.h | 2 ++
kernel/audit.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++
kernel/audit.h | 1 +
kernel/auditsc.c | 4 ++++
6 files changed, 126 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebea9501afb8..e2e7c9f4702f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1307,6 +1307,40 @@ static ssize_t proc_sessionid_read(struct file * file, char __user * buf,
.read = proc_sessionid_read,
.llseek = generic_file_llseek,
};
+
+static ssize_t proc_contid_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct inode *inode = file_inode(file);
+ u64 contid;
+ int rv;
+ struct task_struct *task = get_proc_task(inode);
+
+ if (!task)
+ return -ESRCH;
+ if (*ppos != 0) {
+ /* No partial writes. */
+ put_task_struct(task);
+ return -EINVAL;
+ }
+
+ rv = kstrtou64_from_user(buf, count, 10, &contid);
+ if (rv < 0) {
+ put_task_struct(task);
+ return rv;
+ }
+
+ rv = audit_set_contid(task, contid);
+ put_task_struct(task);
+ if (rv < 0)
+ return rv;
+ return count;
+}
+
+static const struct file_operations proc_contid_operations = {
+ .write = proc_contid_write,
+ .llseek = generic_file_llseek,
+};
#endif

#ifdef CONFIG_FAULT_INJECTION
@@ -3067,6 +3101,7 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
#ifdef CONFIG_AUDIT
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUGO, proc_sessionid_operations),
+ REG("audit_containerid", S_IWUSR, proc_contid_operations),
#endif
#ifdef CONFIG_FAULT_INJECTION
REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
@@ -3467,6 +3502,7 @@ static int proc_tid_comm_permission(struct inode *inode, int mask)
#ifdef CONFIG_AUDIT
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUGO, proc_sessionid_operations),
+ REG("audit_containerid", S_IWUSR, proc_contid_operations),
#endif
#ifdef CONFIG_FAULT_INJECTION
REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
diff --git a/include/linux/audit.h b/include/linux/audit.h
index 96deb28942e3..a045b34ecf44 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -97,6 +97,7 @@ struct audit_ntp_data {
struct audit_task_info {
kuid_t loginuid;
unsigned int sessionid;
+ u64 contid;
#ifdef CONFIG_AUDITSYSCALL
struct audit_context *ctx;
#endif
@@ -198,6 +199,15 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
return tsk->audit->sessionid;
}

+extern int audit_set_contid(struct task_struct *tsk, u64 contid);
+
+static inline u64 audit_get_contid(struct task_struct *tsk)
+{
+ if (!tsk->audit)
+ return AUDIT_CID_UNSET;
+ return tsk->audit->contid;
+}
+
extern u32 audit_enabled;

extern int audit_signal_info(int sig, struct task_struct *t);
@@ -262,6 +272,11 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
return AUDIT_SID_UNSET;
}

+static inline u64 audit_get_contid(struct task_struct *tsk)
+{
+ return AUDIT_CID_UNSET;
+}
+
#define audit_enabled AUDIT_OFF

static inline int audit_signal_info(int sig, struct task_struct *t)
@@ -670,6 +685,16 @@ static inline bool audit_loginuid_set(struct task_struct *tsk)
return uid_valid(audit_get_loginuid(tsk));
}

+static inline bool audit_contid_valid(u64 contid)
+{
+ return contid != AUDIT_CID_UNSET;
+}
+
+static inline bool audit_contid_set(struct task_struct *tsk)
+{
+ return audit_contid_valid(audit_get_contid(tsk));
+}
+
static inline void audit_log_string(struct audit_buffer *ab, const char *buf)
{
audit_log_n_string(ab, buf, strlen(buf));
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 3ad935527177..866e1606c4ae 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -71,6 +71,7 @@
#define AUDIT_TTY_SET 1017 /* Set TTY auditing status */
#define AUDIT_SET_FEATURE 1018 /* Turn an audit feature on or off */
#define AUDIT_GET_FEATURE 1019 /* Get which features are enabled */
+#define AUDIT_CONTAINER_OP 1020 /* Define the container id and info */

#define AUDIT_FIRST_USER_MSG 1100 /* Userspace messages mostly uninteresting to kernel */
#define AUDIT_USER_AVC 1107 /* We filter this differently */
@@ -489,6 +490,7 @@ struct audit_tty_status {

#define AUDIT_UID_UNSET (unsigned int)-1
#define AUDIT_SID_UNSET ((unsigned int)-1)
+#define AUDIT_CID_UNSET ((u64)-1)

/* audit_rule_data supports filter rules with both integer and string
* fields. It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
diff --git a/kernel/audit.c b/kernel/audit.c
index 397f8fb4836a..2d7707426b7d 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -232,6 +232,7 @@ int audit_alloc(struct task_struct *tsk)
}
info->loginuid = audit_get_loginuid(current);
info->sessionid = audit_get_sessionid(current);
+ info->contid = audit_get_contid(current);
tsk->audit = info;

ret = audit_alloc_syscall(tsk);
@@ -246,6 +247,7 @@ int audit_alloc(struct task_struct *tsk)
struct audit_task_info init_struct_audit = {
.loginuid = INVALID_UID,
.sessionid = AUDIT_SID_UNSET,
+ .contid = AUDIT_CID_UNSET,
#ifdef CONFIG_AUDITSYSCALL
.ctx = NULL,
#endif
@@ -2356,6 +2358,62 @@ int audit_signal_info(int sig, struct task_struct *t)
return audit_signal_info_syscall(t);
}

+/*
+ * audit_set_contid - set current task's audit contid
+ * @task: target task
+ * @contid: contid value
+ *
+ * Returns 0 on success, -EPERM on permission failure.
+ *
+ * Called (set) from fs/proc/base.c::proc_contid_write().
+ */
+int audit_set_contid(struct task_struct *task, u64 contid)
+{
+ u64 oldcontid;
+ int rc = 0;
+ struct audit_buffer *ab;
+
+ task_lock(task);
+ /* Can't set if audit disabled */
+ if (!task->audit) {
+ task_unlock(task);
+ return -ENOPROTOOPT;
+ }
+ oldcontid = audit_get_contid(task);
+ read_lock(&tasklist_lock);
+ /* Don't allow the audit containerid to be unset */
+ if (!audit_contid_valid(contid))
+ rc = -EINVAL;
+ /* if we don't have caps, reject */
+ else if (!capable(CAP_AUDIT_CONTROL))
+ rc = -EPERM;
+ /* if task has children or is not single-threaded, deny */
+ else if (!list_empty(&task->children))
+ rc = -EBUSY;
+ else if (!(thread_group_leader(task) && thread_group_empty(task)))
+ rc = -EALREADY;
+ /* if contid is already set, deny */
+ else if (audit_contid_set(task))
+ rc = -ECHILD;
+ read_unlock(&tasklist_lock);
+ if (!rc)
+ task->audit->contid = contid;
+ task_unlock(task);
+
+ if (!audit_enabled)
+ return rc;
+
+ ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONTAINER_OP);
+ if (!ab)
+ return rc;
+
+ audit_log_format(ab,
+ "op=set opid=%d contid=%llu old-contid=%llu",
+ task_tgid_nr(task), contid, oldcontid);
+ audit_log_end(ab);
+ return rc;
+}
+
/**
* audit_log_end - end one audit record
* @ab: the audit_buffer
diff --git a/kernel/audit.h b/kernel/audit.h
index 7f623ef216e6..16bd03b88e0d 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -135,6 +135,7 @@ struct audit_context {
kuid_t target_uid;
unsigned int target_sessionid;
u32 target_sid;
+ u64 target_cid;
char target_comm[TASK_COMM_LEN];

struct audit_tree_refs *trees, *first_trees;
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 10679da36bb6..0e2d50533959 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -113,6 +113,7 @@ struct audit_aux_data_pids {
kuid_t target_uid[AUDIT_AUX_PIDS];
unsigned int target_sessionid[AUDIT_AUX_PIDS];
u32 target_sid[AUDIT_AUX_PIDS];
+ u64 target_cid[AUDIT_AUX_PIDS];
char target_comm[AUDIT_AUX_PIDS][TASK_COMM_LEN];
int pid_count;
};
@@ -2375,6 +2376,7 @@ void __audit_ptrace(struct task_struct *t)
context->target_uid = task_uid(t);
context->target_sessionid = audit_get_sessionid(t);
security_task_getsecid(t, &context->target_sid);
+ context->target_cid = audit_get_contid(t);
memcpy(context->target_comm, t->comm, TASK_COMM_LEN);
}

@@ -2402,6 +2404,7 @@ int audit_signal_info_syscall(struct task_struct *t)
ctx->target_uid = t_uid;
ctx->target_sessionid = audit_get_sessionid(t);
security_task_getsecid(t, &ctx->target_sid);
+ ctx->target_cid = audit_get_contid(t);
memcpy(ctx->target_comm, t->comm, TASK_COMM_LEN);
return 0;
}
@@ -2423,6 +2426,7 @@ int audit_signal_info_syscall(struct task_struct *t)
axp->target_uid[axp->pid_count] = t_uid;
axp->target_sessionid[axp->pid_count] = audit_get_sessionid(t);
security_task_getsecid(t, &axp->target_sid[axp->pid_count]);
+ axp->target_cid[axp->pid_count] = audit_get_contid(t);
memcpy(axp->target_comm[axp->pid_count], t->comm, TASK_COMM_LEN);
axp->pid_count++;

--
1.8.3.1

2019-12-31 19:52:19

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH ghak90 V8 03/16] audit: read container ID of a process

Add support for reading the audit container identifier from the proc
filesystem.

This is a read from the proc entry of the form
/proc/PID/audit_containerid where PID is the process ID of the task
whose audit container identifier is sought.

The read expects up to a u64 value (unset: 18446744073709551615).

This read requires CAP_AUDIT_CONTROL.

Signed-off-by: Richard Guy Briggs <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Acked-by: Neil Horman <[email protected]>
Reviewed-by: Ondrej Mosnacek <[email protected]>
---
fs/proc/base.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index e2e7c9f4702f..26091800180c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1224,7 +1224,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
};

#ifdef CONFIG_AUDIT
-#define TMPBUFLEN 11
+#define TMPBUFLEN 21
static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
size_t count, loff_t *ppos)
{
@@ -1308,6 +1308,24 @@ static ssize_t proc_sessionid_read(struct file * file, char __user * buf,
.llseek = generic_file_llseek,
};

+static ssize_t proc_contid_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct inode *inode = file_inode(file);
+ struct task_struct *task = get_proc_task(inode);
+ ssize_t length;
+ char tmpbuf[TMPBUFLEN];
+
+ if (!task)
+ return -ESRCH;
+ /* if we don't have caps, reject */
+ if (!capable(CAP_AUDIT_CONTROL))
+ return -EPERM;
+ length = scnprintf(tmpbuf, TMPBUFLEN, "%llu", audit_get_contid(task));
+ put_task_struct(task);
+ return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
+}
+
static ssize_t proc_contid_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
@@ -1338,6 +1356,7 @@ static ssize_t proc_contid_write(struct file *file, const char __user *buf,
}

static const struct file_operations proc_contid_operations = {
+ .read = proc_contid_read,
.write = proc_contid_write,
.llseek = generic_file_llseek,
};
@@ -3101,7 +3120,7 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
#ifdef CONFIG_AUDIT
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUGO, proc_sessionid_operations),
- REG("audit_containerid", S_IWUSR, proc_contid_operations),
+ REG("audit_containerid", S_IWUSR|S_IRUSR, proc_contid_operations),
#endif
#ifdef CONFIG_FAULT_INJECTION
REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
@@ -3502,7 +3521,7 @@ static int proc_tid_comm_permission(struct inode *inode, int mask)
#ifdef CONFIG_AUDIT
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUGO, proc_sessionid_operations),
- REG("audit_containerid", S_IWUSR, proc_contid_operations),
+ REG("audit_containerid", S_IWUSR|S_IRUSR, proc_contid_operations),
#endif
#ifdef CONFIG_FAULT_INJECTION
REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
--
1.8.3.1

2019-12-31 19:52:40

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH ghak90 V8 05/16] audit: log drop of contid on exit of last task

Since we are tracking the life of each audit container indentifier, we
can match the creation event with the destruction event. Log the
destruction of the audit container identifier when the last process in
that container exits.

Signed-off-by: Richard Guy Briggs <[email protected]>
---
kernel/audit.c | 17 +++++++++++++++++
kernel/audit.h | 2 ++
kernel/auditsc.c | 2 ++
3 files changed, 21 insertions(+)

diff --git a/kernel/audit.c b/kernel/audit.c
index 4bab20f5f781..fa8f1aa3a605 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -2502,6 +2502,23 @@ int audit_set_contid(struct task_struct *task, u64 contid)
return rc;
}

+void audit_log_container_drop(void)
+{
+ struct audit_buffer *ab;
+
+ if (!current->audit || !current->audit->cont ||
+ refcount_read(&current->audit->cont->refcount) > 1)
+ return;
+ ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONTAINER_OP);
+ if (!ab)
+ return;
+
+ audit_log_format(ab, "op=drop opid=%d contid=%llu old-contid=%llu",
+ task_tgid_nr(current), audit_get_contid(current),
+ audit_get_contid(current));
+ audit_log_end(ab);
+}
+
/**
* audit_log_end - end one audit record
* @ab: the audit_buffer
diff --git a/kernel/audit.h b/kernel/audit.h
index e4a31aa92dfe..162de8366b32 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -255,6 +255,8 @@ extern void audit_log_d_path_exe(struct audit_buffer *ab,
extern struct tty_struct *audit_get_tty(void);
extern void audit_put_tty(struct tty_struct *tty);

+extern void audit_log_container_drop(void);
+
/* audit watch/mark/tree functions */
#ifdef CONFIG_AUDITSYSCALL
extern unsigned int audit_serial(void);
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 0e2d50533959..bd855794ad26 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1568,6 +1568,8 @@ static void audit_log_exit(void)

audit_log_proctitle();

+ audit_log_container_drop();
+
/* Send end of event record to help user space know we are finished */
ab = audit_log_start(context, GFP_KERNEL, AUDIT_EOE);
if (ab)
--
1.8.3.1

2019-12-31 19:52:59

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

Add audit container identifier support to the action of signalling the
audit daemon.

Since this would need to add an element to the audit_sig_info struct,
a new record type AUDIT_SIGNAL_INFO2 was created with a new
audit_sig_info2 struct. Corresponding support is required in the
userspace code to reflect the new record request and reply type.
An older userspace won't break since it won't know to request this
record type.

Signed-off-by: Richard Guy Briggs <[email protected]>
---
include/linux/audit.h | 7 +++++++
include/uapi/linux/audit.h | 1 +
kernel/audit.c | 35 +++++++++++++++++++++++++++++++++++
kernel/audit.h | 1 +
security/selinux/nlmsgtab.c | 1 +
5 files changed, 45 insertions(+)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index 2636b0ad0011..6929a02080f7 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -22,6 +22,13 @@ struct audit_sig_info {
char ctx[0];
};

+struct audit_sig_info2 {
+ uid_t uid;
+ pid_t pid;
+ u64 cid;
+ char ctx[0];
+};
+
struct audit_buffer;
struct audit_context;
struct inode;
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 93417a8af9d0..4f87b06f0acd 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -72,6 +72,7 @@
#define AUDIT_SET_FEATURE 1018 /* Turn an audit feature on or off */
#define AUDIT_GET_FEATURE 1019 /* Get which features are enabled */
#define AUDIT_CONTAINER_OP 1020 /* Define the container id and info */
+#define AUDIT_SIGNAL_INFO2 1021 /* Get info auditd signal sender */

#define AUDIT_FIRST_USER_MSG 1100 /* Userspace messages mostly uninteresting to kernel */
#define AUDIT_USER_AVC 1107 /* We filter this differently */
diff --git a/kernel/audit.c b/kernel/audit.c
index 0871c3e5d6df..51159c94041c 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -126,6 +126,14 @@ struct auditd_connection {
kuid_t audit_sig_uid = INVALID_UID;
pid_t audit_sig_pid = -1;
u32 audit_sig_sid = 0;
+/* Since the signal information is stored in the record buffer at the
+ * time of the signal, but not retrieved until later, there is a chance
+ * that the last process in the container could terminate before the
+ * signal record is delivered. In this circumstance, there is a chance
+ * the orchestrator could reuse the audit container identifier, causing
+ * an overlap of audit records that refer to the same audit container
+ * identifier, but a different container instance. */
+u64 audit_sig_cid = AUDIT_CID_UNSET;

/* Records can be lost in several ways:
0) [suppressed in audit_alloc]
@@ -1123,6 +1131,7 @@ static int audit_netlink_ok(struct sk_buff *skb, u16 msg_type)
case AUDIT_ADD_RULE:
case AUDIT_DEL_RULE:
case AUDIT_SIGNAL_INFO:
+ case AUDIT_SIGNAL_INFO2:
case AUDIT_TTY_GET:
case AUDIT_TTY_SET:
case AUDIT_TRIM:
@@ -1286,6 +1295,7 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
struct audit_buffer *ab;
u16 msg_type = nlh->nlmsg_type;
struct audit_sig_info *sig_data;
+ struct audit_sig_info2 *sig_data2;
char *ctx = NULL;
u32 len;

@@ -1545,6 +1555,30 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
sig_data, sizeof(*sig_data) + len);
kfree(sig_data);
break;
+ case AUDIT_SIGNAL_INFO2:
+ len = 0;
+ if (audit_sig_sid) {
+ err = security_secid_to_secctx(audit_sig_sid, &ctx, &len);
+ if (err)
+ return err;
+ }
+ sig_data2 = kmalloc(sizeof(*sig_data2) + len, GFP_KERNEL);
+ if (!sig_data2) {
+ if (audit_sig_sid)
+ security_release_secctx(ctx, len);
+ return -ENOMEM;
+ }
+ sig_data2->uid = from_kuid(&init_user_ns, audit_sig_uid);
+ sig_data2->pid = audit_sig_pid;
+ if (audit_sig_sid) {
+ memcpy(sig_data2->ctx, ctx, len);
+ security_release_secctx(ctx, len);
+ }
+ sig_data2->cid = audit_sig_cid;
+ audit_send_reply(skb, seq, AUDIT_SIGNAL_INFO2, 0, 0,
+ sig_data2, sizeof(*sig_data2) + len);
+ kfree(sig_data2);
+ break;
case AUDIT_TTY_GET: {
struct audit_tty_status s;
unsigned int t;
@@ -2414,6 +2448,7 @@ int audit_signal_info(int sig, struct task_struct *t)
else
audit_sig_uid = uid;
security_task_getsecid(current, &audit_sig_sid);
+ audit_sig_cid = audit_get_contid(current);
}

return audit_signal_info_syscall(t);
diff --git a/kernel/audit.h b/kernel/audit.h
index 162de8366b32..de358ac61587 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -346,6 +346,7 @@ static inline int audit_signal_info_syscall(struct task_struct *t)
extern pid_t audit_sig_pid;
extern kuid_t audit_sig_uid;
extern u32 audit_sig_sid;
+extern u64 audit_sig_cid;

extern int audit_filter(int msgtype, unsigned int listtype);

diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c
index c97fdae8f71b..f006d8b70b65 100644
--- a/security/selinux/nlmsgtab.c
+++ b/security/selinux/nlmsgtab.c
@@ -134,6 +134,7 @@ struct nlmsg_perm {
{ AUDIT_DEL_RULE, NETLINK_AUDIT_SOCKET__NLMSG_WRITE },
{ AUDIT_USER, NETLINK_AUDIT_SOCKET__NLMSG_RELAY },
{ AUDIT_SIGNAL_INFO, NETLINK_AUDIT_SOCKET__NLMSG_READ },
+ { AUDIT_SIGNAL_INFO2, NETLINK_AUDIT_SOCKET__NLMSG_READ },
{ AUDIT_TRIM, NETLINK_AUDIT_SOCKET__NLMSG_WRITE },
{ AUDIT_MAKE_EQUIV, NETLINK_AUDIT_SOCKET__NLMSG_WRITE },
{ AUDIT_TTY_GET, NETLINK_AUDIT_SOCKET__NLMSG_READ },
--
1.8.3.1

2019-12-31 19:53:16

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH ghak90 V8 08/16] audit: add support for non-syscall auxiliary records

Standalone audit records have the timestamp and serial number generated
on the fly and as such are unique, making them standalone. This new
function audit_alloc_local() generates a local audit context that will
be used only for a standalone record and its auxiliary record(s). The
context is discarded immediately after the local associated records are
produced.

Signed-off-by: Richard Guy Briggs <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Acked-by: Neil Horman <[email protected]>
Reviewed-by: Ondrej Mosnacek <[email protected]>
---
include/linux/audit.h | 8 ++++++++
kernel/audit.h | 1 +
kernel/auditsc.c | 35 ++++++++++++++++++++++++++++++-----
3 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index 6929a02080f7..29b81cc43f8d 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -321,6 +321,8 @@ static inline int audit_signal_info(int sig, struct task_struct *t)

/* These are defined in auditsc.c */
/* Public API */
+extern struct audit_context *audit_alloc_local(gfp_t gfpflags);
+extern void audit_free_context(struct audit_context *context);
extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long a1,
unsigned long a2, unsigned long a3);
extern void __audit_syscall_exit(int ret_success, long ret_value);
@@ -573,6 +575,12 @@ static inline void audit_ntp_log(const struct audit_ntp_data *ad)
extern int audit_n_rules;
extern int audit_signals;
#else /* CONFIG_AUDITSYSCALL */
+static inline struct audit_context *audit_alloc_local(gfp_t gfpflags)
+{
+ return NULL;
+}
+static inline void audit_free_context(struct audit_context *context)
+{ }
static inline void audit_syscall_entry(int major, unsigned long a0,
unsigned long a1, unsigned long a2,
unsigned long a3)
diff --git a/kernel/audit.h b/kernel/audit.h
index de358ac61587..000ca7c89f6d 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -98,6 +98,7 @@ struct audit_proctitle {
struct audit_context {
int dummy; /* must be the first element */
int in_syscall; /* 1 if task is in a syscall */
+ bool local; /* local context needed */
enum audit_state state, current_state;
unsigned int serial; /* serial number for record */
int major; /* syscall number */
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index ac438fcff807..3138c88887c7 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -890,11 +890,13 @@ static inline void audit_free_aux(struct audit_context *context)
}
}

-static inline struct audit_context *audit_alloc_context(enum audit_state state)
+static inline struct audit_context *audit_alloc_context(enum audit_state state,
+ gfp_t gfpflags)
{
struct audit_context *context;

- context = kzalloc(sizeof(*context), GFP_KERNEL);
+ /* We can be called in atomic context via audit_tg() */
+ context = kzalloc(sizeof(*context), gfpflags);
if (!context)
return NULL;
context->state = state;
@@ -930,7 +932,8 @@ int audit_alloc_syscall(struct task_struct *tsk)
return 0;
}

- if (!(context = audit_alloc_context(state))) {
+ context = audit_alloc_context(state, GFP_KERNEL);
+ if (!context) {
kfree(key);
audit_log_lost("out of memory in audit_alloc_syscall");
return -ENOMEM;
@@ -942,8 +945,29 @@ int audit_alloc_syscall(struct task_struct *tsk)
return 0;
}

-static inline void audit_free_context(struct audit_context *context)
+struct audit_context *audit_alloc_local(gfp_t gfpflags)
{
+ struct audit_context *context = NULL;
+
+ if (!audit_ever_enabled)
+ goto out; /* Return if not auditing. */
+ context = audit_alloc_context(AUDIT_RECORD_CONTEXT, gfpflags);
+ if (!context) {
+ audit_log_lost("out of memory in audit_alloc_local");
+ goto out;
+ }
+ context->serial = audit_serial();
+ ktime_get_coarse_real_ts64(&context->ctime);
+ context->local = true;
+out:
+ return context;
+}
+EXPORT_SYMBOL(audit_alloc_local);
+
+void audit_free_context(struct audit_context *context)
+{
+ if (!context)
+ return;
audit_free_module(context);
audit_free_names(context);
unroll_tree_refs(context, NULL, 0);
@@ -954,6 +978,7 @@ static inline void audit_free_context(struct audit_context *context)
audit_proctitle_free(context);
kfree(context);
}
+EXPORT_SYMBOL(audit_free_context);

static int audit_log_pid_context(struct audit_context *context, pid_t pid,
kuid_t auid, kuid_t uid, unsigned int sessionid,
@@ -2182,7 +2207,7 @@ void __audit_inode_child(struct inode *parent,
int auditsc_get_stamp(struct audit_context *ctx,
struct timespec64 *t, unsigned int *serial)
{
- if (!ctx->in_syscall)
+ if (!ctx->in_syscall && !ctx->local)
return 0;
if (!ctx->serial)
ctx->serial = audit_serial();
--
1.8.3.1

2019-12-31 19:53:40

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH ghak90 V8 10/16] audit: add containerid filtering

Implement audit container identifier filtering using the AUDIT_CONTID
field name to send an 8-character string representing a u64 since the
value field is only u32.

Sending it as two u32 was considered, but gathering and comparing two
fields was more complex.

The feature indicator is AUDIT_FEATURE_BITMAP_CONTAINERID.

Please see the github audit kernel issue for the contid filter feature:
https://github.com/linux-audit/audit-kernel/issues/91
Please see the github audit userspace issue for filter additions:
https://github.com/linux-audit/audit-userspace/issues/40
Please see the github audit testsuiite issue for the test case:
https://github.com/linux-audit/audit-testsuite/issues/64
Please see the github audit wiki for the feature overview:
https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
Signed-off-by: Richard Guy Briggs <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Acked-by: Neil Horman <[email protected]>
Reviewed-by: Ondrej Mosnacek <[email protected]>
---
include/linux/audit.h | 1 +
include/uapi/linux/audit.h | 5 ++++-
kernel/audit.h | 1 +
kernel/auditfilter.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
kernel/auditsc.c | 4 ++++
5 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index 29b81cc43f8d..5531d37a4226 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -68,6 +68,7 @@ struct audit_field {
u32 type;
union {
u32 val;
+ u64 val64;
kuid_t uid;
kgid_t gid;
struct {
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 4f87b06f0acd..ea6638bb914b 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -269,6 +269,7 @@
#define AUDIT_LOGINUID_SET 24
#define AUDIT_SESSIONID 25 /* Session ID */
#define AUDIT_FSTYPE 26 /* FileSystem Type */
+#define AUDIT_CONTID 27 /* Container ID */

/* These are ONLY useful when checking
* at syscall exit time (AUDIT_AT_EXIT). */
@@ -350,6 +351,7 @@ enum {
#define AUDIT_FEATURE_BITMAP_SESSIONID_FILTER 0x00000010
#define AUDIT_FEATURE_BITMAP_LOST_RESET 0x00000020
#define AUDIT_FEATURE_BITMAP_FILTER_FS 0x00000040
+#define AUDIT_FEATURE_BITMAP_CONTAINERID 0x00000080

#define AUDIT_FEATURE_BITMAP_ALL (AUDIT_FEATURE_BITMAP_BACKLOG_LIMIT | \
AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_TIME | \
@@ -357,7 +359,8 @@ enum {
AUDIT_FEATURE_BITMAP_EXCLUDE_EXTEND | \
AUDIT_FEATURE_BITMAP_SESSIONID_FILTER | \
AUDIT_FEATURE_BITMAP_LOST_RESET | \
- AUDIT_FEATURE_BITMAP_FILTER_FS)
+ AUDIT_FEATURE_BITMAP_FILTER_FS | \
+ AUDIT_FEATURE_BITMAP_CONTAINERID)

/* deprecated: AUDIT_VERSION_* */
#define AUDIT_VERSION_LATEST AUDIT_FEATURE_BITMAP_ALL
diff --git a/kernel/audit.h b/kernel/audit.h
index 000ca7c89f6d..5e2f5c9820d8 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -225,6 +225,7 @@ static inline int audit_hash_contid(u64 contid)

extern int audit_match_class(int class, unsigned syscall);
extern int audit_comparator(const u32 left, const u32 op, const u32 right);
+extern int audit_comparator64(const u64 left, const u32 op, const u64 right);
extern int audit_uid_comparator(kuid_t left, u32 op, kuid_t right);
extern int audit_gid_comparator(kgid_t left, u32 op, kgid_t right);
extern int parent_len(const char *path);
diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
index b0126e9c0743..9606f973fe33 100644
--- a/kernel/auditfilter.c
+++ b/kernel/auditfilter.c
@@ -399,6 +399,7 @@ static int audit_field_valid(struct audit_entry *entry, struct audit_field *f)
case AUDIT_FILETYPE:
case AUDIT_FIELD_COMPARE:
case AUDIT_EXE:
+ case AUDIT_CONTID:
/* only equal and not equal valid ops */
if (f->op != Audit_not_equal && f->op != Audit_equal)
return -EINVAL;
@@ -586,6 +587,14 @@ static struct audit_entry *audit_data_to_entry(struct audit_rule_data *data,
}
entry->rule.exe = audit_mark;
break;
+ case AUDIT_CONTID:
+ if (f->val != sizeof(u64))
+ goto exit_free;
+ str = audit_unpack_string(&bufp, &remain, f->val);
+ if (IS_ERR(str))
+ goto exit_free;
+ f->val64 = ((u64 *)str)[0];
+ break;
}
}

@@ -668,6 +677,11 @@ static struct audit_rule_data *audit_krule_to_data(struct audit_krule *krule)
data->buflen += data->values[i] =
audit_pack_string(&bufp, audit_mark_path(krule->exe));
break;
+ case AUDIT_CONTID:
+ data->buflen += data->values[i] = sizeof(u64);
+ memcpy(bufp, &f->val64, sizeof(u64));
+ bufp += sizeof(u64);
+ break;
case AUDIT_LOGINUID_SET:
if (krule->pflags & AUDIT_LOGINUID_LEGACY && !f->val) {
data->fields[i] = AUDIT_LOGINUID;
@@ -754,6 +768,10 @@ static int audit_compare_rule(struct audit_krule *a, struct audit_krule *b)
if (!gid_eq(a->fields[i].gid, b->fields[i].gid))
return 1;
break;
+ case AUDIT_CONTID:
+ if (a->fields[i].val64 != b->fields[i].val64)
+ return 1;
+ break;
default:
if (a->fields[i].val != b->fields[i].val)
return 1;
@@ -1211,6 +1229,30 @@ int audit_comparator(u32 left, u32 op, u32 right)
}
}

+int audit_comparator64(u64 left, u32 op, u64 right)
+{
+ switch (op) {
+ case Audit_equal:
+ return (left == right);
+ case Audit_not_equal:
+ return (left != right);
+ case Audit_lt:
+ return (left < right);
+ case Audit_le:
+ return (left <= right);
+ case Audit_gt:
+ return (left > right);
+ case Audit_ge:
+ return (left >= right);
+ case Audit_bitmask:
+ return (left & right);
+ case Audit_bittest:
+ return ((left & right) == right);
+ default:
+ return 0;
+ }
+}
+
int audit_uid_comparator(kuid_t left, u32 op, kuid_t right)
{
switch (op) {
@@ -1345,6 +1387,10 @@ int audit_filter(int msgtype, unsigned int listtype)
result = audit_comparator(audit_loginuid_set(current),
f->op, f->val);
break;
+ case AUDIT_CONTID:
+ result = audit_comparator64(audit_get_contid(current),
+ f->op, f->val64);
+ break;
case AUDIT_MSGTYPE:
result = audit_comparator(msgtype, f->op, f->val);
break;
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 3138c88887c7..a658fe775b86 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -629,6 +629,10 @@ static int audit_filter_rules(struct task_struct *tsk,
result = audit_comparator(ctx->sockaddr->ss_family,
f->op, f->val);
break;
+ case AUDIT_CONTID:
+ result = audit_comparator64(audit_get_contid(tsk),
+ f->op, f->val64);
+ break;
case AUDIT_SUBJ_USER:
case AUDIT_SUBJ_ROLE:
case AUDIT_SUBJ_TYPE:
--
1.8.3.1

2019-12-31 19:53:50

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH ghak90 V8 13/16] audit: track container nesting

Track the parent container of a container to be able to filter and
report nesting.

Now that we have a way to track and check the parent container of a
container, modify the contid field format to be able to report that
nesting using a carrat ("^") separator to indicate nesting. The
original field format was "contid=<contid>" for task-associated records
and "contid=<contid>[,<contid>[...]]" for network-namespace-associated
records. The new field format is
"contid=<contid>[^<contid>[...]][,<contid>[...]]".

Signed-off-by: Richard Guy Briggs <[email protected]>
---
include/linux/audit.h | 1 +
kernel/audit.c | 53 +++++++++++++++++++++++++++++++++++++++++++--------
kernel/audit.h | 1 +
kernel/auditfilter.c | 17 ++++++++++++++++-
kernel/auditsc.c | 2 +-
5 files changed, 64 insertions(+), 10 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index ed8d5b74758d..4272b468417a 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -109,6 +109,7 @@ struct audit_contobj {
struct task_struct *owner;
refcount_t refcount;
struct rcu_head rcu;
+ struct audit_contobj *parent;
};

struct audit_task_info {
diff --git a/kernel/audit.c b/kernel/audit.c
index ef8e07524c46..68be59d1a89b 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -251,6 +251,7 @@ static void _audit_contobj_put(struct audit_contobj *cont)
return;
if (refcount_dec_and_test(&cont->refcount)) {
put_task_struct(cont->owner);
+ _audit_contobj_put(cont->parent);
list_del_rcu(&cont->list);
kfree_rcu(cont, rcu);
}
@@ -492,6 +493,7 @@ void audit_switch_task_namespaces(struct nsproxy *ns, struct task_struct *p)
audit_netns_contid_add(new->net_ns, contid);
}

+void audit_log_contid(struct audit_buffer *ab, u64 contid);
/**
* audit_log_netns_contid_list - List contids for the given network namespace
* @net: the network namespace of interest
@@ -523,7 +525,7 @@ void audit_log_netns_contid_list(struct net *net, struct audit_context *context)
audit_log_format(ab, "contid=");
} else
audit_log_format(ab, ",");
- audit_log_format(ab, "%llu", cont->id);
+ audit_log_contid(ab, cont->id);
}
audit_log_end(ab);
out:
@@ -2311,6 +2313,36 @@ void audit_log_session_info(struct audit_buffer *ab)
audit_log_format(ab, "auid=%u ses=%u", auid, sessionid);
}

+void audit_log_contid(struct audit_buffer *ab, u64 contid)
+{
+ struct audit_contobj *cont = NULL, *prcont = NULL;
+ int h;
+
+ if (!audit_contid_valid(contid)) {
+ audit_log_format(ab, "%llu", contid);
+ return;
+ }
+ h = audit_hash_contid(contid);
+ rcu_read_lock();
+ list_for_each_entry_rcu(cont, &audit_contid_hash[h], list)
+ if (cont->id == contid) {
+ prcont = cont;
+ break;
+ }
+ if (!prcont) {
+ audit_log_format(ab, "%llu", contid);
+ goto out;
+ }
+ while (prcont) {
+ audit_log_format(ab, "%llu", prcont->id);
+ prcont = prcont->parent;
+ if (prcont)
+ audit_log_format(ab, "^");
+ }
+out:
+ rcu_read_unlock();
+}
+
/*
* audit_log_container_id - report container info
* @context: task or local context for record
@@ -2326,7 +2358,8 @@ void audit_log_container_id(struct audit_context *context, u64 contid)
ab = audit_log_start(context, GFP_KERNEL, AUDIT_CONTAINER_ID);
if (!ab)
return;
- audit_log_format(ab, "contid=%llu", contid);
+ audit_log_format(ab, "contid=");
+ audit_log_contid(ab, contid);
audit_log_end(ab);
}
EXPORT_SYMBOL(audit_log_container_id);
@@ -2675,6 +2708,9 @@ int audit_set_contid(struct task_struct *task, u64 contid)
newcont->id = contid;
get_task_struct(current);
newcont->owner = current;
+ newcont->parent = _audit_contobj(newcont->owner);
+ if (newcont->parent)
+ _audit_contobj_hold(newcont->parent);
refcount_set(&newcont->refcount, 1);
spin_lock(&audit_contobj_list_lock);
list_add_rcu(&newcont->list, &audit_contid_hash[h]);
@@ -2705,9 +2741,10 @@ int audit_set_contid(struct task_struct *task, u64 contid)
if (!ab)
return rc;

- audit_log_format(ab,
- "op=set opid=%d contid=%llu old-contid=%llu",
- task_tgid_nr(task), contid, oldcontid);
+ audit_log_format(ab, "op=set opid=%d contid=", task_tgid_nr(task));
+ audit_log_contid(ab, contid);
+ audit_log_format(ab, " old-contid=");
+ audit_log_contid(ab, oldcontid);
audit_log_end(ab);
return rc;
}
@@ -2723,9 +2760,9 @@ void audit_log_container_drop(void)
if (!ab)
return;

- audit_log_format(ab, "op=drop opid=%d contid=%llu old-contid=%llu",
- task_tgid_nr(current), audit_get_contid(current),
- audit_get_contid(current));
+ audit_log_format(ab, "op=drop opid=%d contid=%llu old-contid=",
+ task_tgid_nr(current), AUDIT_CID_UNSET);
+ audit_log_contid(ab, audit_get_contid(current));
audit_log_end(ab);
}

diff --git a/kernel/audit.h b/kernel/audit.h
index 5e2f5c9820d8..de814fcbb38c 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -226,6 +226,7 @@ static inline int audit_hash_contid(u64 contid)
extern int audit_match_class(int class, unsigned syscall);
extern int audit_comparator(const u32 left, const u32 op, const u32 right);
extern int audit_comparator64(const u64 left, const u32 op, const u64 right);
+extern int audit_contid_comparator(const u64 left, const u32 op, const u64 right);
extern int audit_uid_comparator(kuid_t left, u32 op, kuid_t right);
extern int audit_gid_comparator(kgid_t left, u32 op, kgid_t right);
extern int parent_len(const char *path);
diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
index 9606f973fe33..1757896740e8 100644
--- a/kernel/auditfilter.c
+++ b/kernel/auditfilter.c
@@ -1297,6 +1297,21 @@ int audit_gid_comparator(kgid_t left, u32 op, kgid_t right)
}
}

+int audit_contid_comparator(u64 left, u32 op, u64 right)
+{
+ struct audit_contobj *cont = NULL;
+ int h;
+ int result = 0;
+
+ h = audit_hash_contid(left);
+ list_for_each_entry_rcu(cont, &audit_contid_hash[h], list) {
+ result = audit_comparator64(cont->id, op, right);
+ if (result)
+ break;
+ }
+ return result;
+}
+
/**
* parent_len - find the length of the parent portion of a pathname
* @path: pathname of which to determine length
@@ -1388,7 +1403,7 @@ int audit_filter(int msgtype, unsigned int listtype)
f->op, f->val);
break;
case AUDIT_CONTID:
- result = audit_comparator64(audit_get_contid(current),
+ result = audit_contid_comparator(audit_get_contid(current),
f->op, f->val64);
break;
case AUDIT_MSGTYPE:
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index a658fe775b86..6bf6d8b9dfd1 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -630,7 +630,7 @@ static int audit_filter_rules(struct task_struct *tsk,
f->op, f->val);
break;
case AUDIT_CONTID:
- result = audit_comparator64(audit_get_contid(tsk),
+ result = audit_contid_comparator(audit_get_contid(tsk),
f->op, f->val64);
break;
case AUDIT_SUBJ_USER:
--
1.8.3.1

2019-12-31 19:54:07

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH ghak90 V8 14/16] audit: check contid depth and add limit config param

Clamp the depth of audit container identifier nesting to limit the
netlink and disk bandwidth used and to prevent losing information from
record text size overflow in the contid field.

Add a configuration parameter AUDIT_STATUS_CONTID_DEPTH_LIMIT (0x80) to
set the audit container identifier depth limit. This can be used to
prevent overflow of the contid field in CONTAINER_OP and CONTAINER_ID
messages, losing information, and to limit bandwidth used by these
messages.

Signed-off-by: Richard Guy Briggs <[email protected]>
---
include/uapi/linux/audit.h | 2 ++
kernel/audit.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
kernel/audit.h | 2 ++
3 files changed, 50 insertions(+)

diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index ea6638bb914b..dcb076b0d2e1 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -343,6 +343,7 @@ enum {
#define AUDIT_STATUS_BACKLOG_LIMIT 0x0010
#define AUDIT_STATUS_BACKLOG_WAIT_TIME 0x0020
#define AUDIT_STATUS_LOST 0x0040
+#define AUDIT_STATUS_CONTID_DEPTH_LIMIT 0x0080

#define AUDIT_FEATURE_BITMAP_BACKLOG_LIMIT 0x00000001
#define AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_TIME 0x00000002
@@ -471,6 +472,7 @@ struct audit_status {
__u32 feature_bitmap; /* bitmap of kernel audit features */
};
__u32 backlog_wait_time;/* message queue wait timeout */
+ __u32 contid_depth_limit;/* container depth limit */
};

struct audit_features {
diff --git a/kernel/audit.c b/kernel/audit.c
index 68be59d1a89b..e5e39aedaf86 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -157,6 +157,7 @@ struct auditd_connection {
* of container objects to tasks and refcount changes. There should be
* no need for interaction with tasklist_lock */
static DEFINE_SPINLOCK(audit_contobj_list_lock);
+static u32 audit_contid_depth_limit = AUDIT_CONTID_DEPTH_LIMIT;

static struct kmem_cache *audit_buffer_cache;

@@ -678,6 +679,20 @@ static int audit_set_backlog_wait_time(u32 timeout)
&audit_backlog_wait_time, timeout);
}

+static int audit_set_contid_depth_limit(u32 limit)
+{
+ int rc = 0;
+
+ if (limit > 20 * AUDIT_CONTID_DEPTH_LIMIT) {
+ rc = -ENOSPC;
+ audit_log_config_change("audit_contid_depth_limit",
+ limit, audit_contid_depth_limit, 0);
+ return rc;
+ }
+ return audit_do_config_change("audit_contid_depth_limit",
+ &audit_contid_depth_limit, limit);
+}
+
static int audit_set_enabled(u32 state)
{
int rc;
@@ -1439,6 +1454,7 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
s.backlog = skb_queue_len(&audit_queue);
s.feature_bitmap = AUDIT_FEATURE_BITMAP_ALL;
s.backlog_wait_time = audit_backlog_wait_time;
+ s.contid_depth_limit = audit_contid_depth_limit;
audit_send_reply(skb, seq, AUDIT_GET, 0, 0, &s, sizeof(s));
break;
}
@@ -1542,6 +1558,13 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
audit_log_config_change("lost", 0, lost, 1);
return lost;
}
+ if (s.mask & AUDIT_STATUS_CONTID_DEPTH_LIMIT) {
+ if (sizeof(s) > (size_t)nlh->nlmsg_len)
+ return -EINVAL;
+ err = audit_set_contid_depth_limit(s.contid_depth_limit);
+ if (err < 0)
+ return err;
+ }
break;
}
case AUDIT_GET_FEATURE:
@@ -2608,6 +2631,22 @@ int audit_signal_info(int sig, struct task_struct *t)
return audit_signal_info_syscall(t);
}

+static int audit_contid_depth(struct audit_contobj *cont)
+{
+ struct audit_contobj *parent;
+ int depth = 1;
+
+ if (!cont)
+ return 0;
+
+ parent = cont->parent;
+ while (parent) {
+ depth++;
+ parent = parent->parent;
+ }
+ return depth;
+}
+
static bool audit_contid_isowner(struct task_struct *tsk)
{
if (tsk->audit && tsk->audit->cont)
@@ -2701,6 +2740,13 @@ int audit_set_contid(struct task_struct *task, u64 contid)
}
break;
}
+ /* Clamp max container id depth */
+ if (audit_contid_depth_limit != 0 &&
+ audit_contid_depth(_audit_contobj(rcu_dereference(current->real_parent)))
+ >= audit_contid_depth_limit) {
+ rc = -EMLINK;
+ goto conterror;
+ }
if (!newcont) {
newcont = kmalloc(sizeof(*newcont), GFP_ATOMIC);
if (newcont) {
diff --git a/kernel/audit.h b/kernel/audit.h
index de814fcbb38c..fbca07a49c03 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -220,6 +220,8 @@ static inline int audit_hash_contid(u64 contid)
return (contid & (AUDIT_CONTID_BUCKETS-1));
}

+#define AUDIT_CONTID_DEPTH_LIMIT 4
+
/* Indicates that audit should log the full pathname. */
#define AUDIT_NAME_FULL -1

--
1.8.3.1

2019-12-31 19:54:13

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH ghak90 V8 11/16] audit: add support for containerid to network namespaces

This also adds support to qualify NETFILTER_PKT records.

Audit events could happen in a network namespace outside of a task
context due to packets received from the net that trigger an auditing
rule prior to being associated with a running task. The network
namespace could be in use by multiple containers by association to the
tasks in that network namespace. We still want a way to attribute
these events to any potential containers. Keep a list per network
namespace to track these audit container identifiiers.

Add/increment the audit container identifier on:
- initial setting of the audit container identifier via /proc
- clone/fork call that inherits an audit container identifier
- unshare call that inherits an audit container identifier
- setns call that inherits an audit container identifier
Delete/decrement the audit container identifier on:
- an inherited audit container identifier dropped when child set
- process exit
- unshare call that drops a net namespace
- setns call that drops a net namespace

Add audit container identifier auxiliary record(s) to NETFILTER_PKT
event standalone records. Iterate through all potential audit container
identifiers associated with a network namespace.

Please see the github audit kernel issue for contid net support:
https://github.com/linux-audit/audit-kernel/issues/92
Please see the github audit testsuiite issue for the test case:
https://github.com/linux-audit/audit-testsuite/issues/64
Please see the github audit wiki for the feature overview:
https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
Signed-off-by: Richard Guy Briggs <[email protected]>
Acked-by: Neil Horman <[email protected]>
Reviewed-by: Ondrej Mosnacek <[email protected]>
---
include/linux/audit.h | 24 +++++++++
kernel/audit.c | 132 ++++++++++++++++++++++++++++++++++++++++++++++-
kernel/nsproxy.c | 4 ++
net/netfilter/nft_log.c | 11 +++-
net/netfilter/xt_AUDIT.c | 11 +++-
5 files changed, 176 insertions(+), 6 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index 5531d37a4226..ed8d5b74758d 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -12,6 +12,7 @@
#include <linux/sched.h>
#include <linux/ptrace.h>
#include <uapi/linux/audit.h>
+#include <linux/refcount.h>

#define AUDIT_INO_UNSET ((unsigned long)-1)
#define AUDIT_DEV_UNSET ((dev_t)-1)
@@ -121,6 +122,13 @@ struct audit_task_info {

extern struct audit_task_info init_struct_audit;

+struct audit_contobj_netns {
+ struct list_head list;
+ u64 id;
+ refcount_t refcount;
+ struct rcu_head rcu;
+};
+
extern int is_audit_feature_set(int which);

extern int __init audit_register_class(int class, unsigned *list);
@@ -225,6 +233,12 @@ static inline u64 audit_get_contid(struct task_struct *tsk)
}

extern void audit_log_container_id(struct audit_context *context, u64 contid);
+extern void audit_netns_contid_add(struct net *net, u64 contid);
+extern void audit_netns_contid_del(struct net *net, u64 contid);
+extern void audit_switch_task_namespaces(struct nsproxy *ns,
+ struct task_struct *p);
+extern void audit_log_netns_contid_list(struct net *net,
+ struct audit_context *context);

extern u32 audit_enabled;

@@ -297,6 +311,16 @@ static inline u64 audit_get_contid(struct task_struct *tsk)

static inline void audit_log_container_id(struct audit_context *context, u64 contid)
{ }
+static inline void audit_netns_contid_add(struct net *net, u64 contid)
+{ }
+static inline void audit_netns_contid_del(struct net *net, u64 contid)
+{ }
+static inline void audit_switch_task_namespaces(struct nsproxy *ns,
+ struct task_struct *p)
+{ }
+static inline void audit_log_netns_contid_list(struct net *net,
+ struct audit_context *context)
+{ }

#define audit_enabled AUDIT_OFF

diff --git a/kernel/audit.c b/kernel/audit.c
index d4e6eafe5644..f7a8d3288ca0 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -59,6 +59,7 @@
#include <linux/freezer.h>
#include <linux/pid_namespace.h>
#include <net/netns/generic.h>
+#include <net/net_namespace.h>

#include "audit.h"

@@ -86,9 +87,13 @@
/**
* struct audit_net - audit private network namespace data
* @sk: communication socket
+ * @contid_list: audit container identifier list
+ * @contid_list_lock audit container identifier list lock
*/
struct audit_net {
struct sock *sk;
+ struct list_head contid_list;
+ spinlock_t contid_list_lock;
};

/**
@@ -305,8 +310,11 @@ struct audit_task_info init_struct_audit = {
void audit_free(struct task_struct *tsk)
{
struct audit_task_info *info = tsk->audit;
+ struct nsproxy *ns = tsk->nsproxy;

audit_free_syscall(tsk);
+ if (ns)
+ audit_netns_contid_del(ns->net_ns, audit_get_contid(tsk));
/* Freeing the audit_task_info struct must be performed after
* audit_log_exit() due to need for loginuid and sessionid.
*/
@@ -409,6 +417,120 @@ static struct sock *audit_get_sk(const struct net *net)
return aunet->sk;
}

+void audit_netns_contid_add(struct net *net, u64 contid)
+{
+ struct audit_net *aunet;
+ struct list_head *contid_list;
+ struct audit_contobj_netns *cont;
+
+ if (!net)
+ return;
+ if (!audit_contid_valid(contid))
+ return;
+ aunet = net_generic(net, audit_net_id);
+ if (!aunet)
+ return;
+ contid_list = &aunet->contid_list;
+ rcu_read_lock();
+ list_for_each_entry_rcu(cont, contid_list, list)
+ if (cont->id == contid) {
+ spin_lock(&aunet->contid_list_lock);
+ refcount_inc(&cont->refcount);
+ spin_unlock(&aunet->contid_list_lock);
+ goto out;
+ }
+ cont = kmalloc(sizeof(*cont), GFP_ATOMIC);
+ if (cont) {
+ INIT_LIST_HEAD(&cont->list);
+ cont->id = contid;
+ refcount_set(&cont->refcount, 1);
+ spin_lock(&aunet->contid_list_lock);
+ list_add_rcu(&cont->list, contid_list);
+ spin_unlock(&aunet->contid_list_lock);
+ }
+out:
+ rcu_read_unlock();
+}
+
+void audit_netns_contid_del(struct net *net, u64 contid)
+{
+ struct audit_net *aunet;
+ struct list_head *contid_list;
+ struct audit_contobj_netns *cont = NULL;
+
+ if (!net)
+ return;
+ if (!audit_contid_valid(contid))
+ return;
+ aunet = net_generic(net, audit_net_id);
+ if (!aunet)
+ return;
+ contid_list = &aunet->contid_list;
+ rcu_read_lock();
+ list_for_each_entry_rcu(cont, contid_list, list)
+ if (cont->id == contid) {
+ spin_lock(&aunet->contid_list_lock);
+ if (refcount_dec_and_test(&cont->refcount)) {
+ list_del_rcu(&cont->list);
+ kfree_rcu(cont, rcu);
+ }
+ spin_unlock(&aunet->contid_list_lock);
+ break;
+ }
+ rcu_read_unlock();
+}
+
+void audit_switch_task_namespaces(struct nsproxy *ns, struct task_struct *p)
+{
+ u64 contid = audit_get_contid(p);
+ struct nsproxy *new = p->nsproxy;
+
+ if (!audit_contid_valid(contid))
+ return;
+ audit_netns_contid_del(ns->net_ns, contid);
+ if (new)
+ audit_netns_contid_add(new->net_ns, contid);
+}
+
+/**
+ * audit_log_netns_contid_list - List contids for the given network namespace
+ * @net: the network namespace of interest
+ * @context: the audit context to use
+ *
+ * Description:
+ * Issues a CONTAINER_ID record with a CSV list of contids associated
+ * with a network namespace to accompany a NETFILTER_PKT record.
+ */
+void audit_log_netns_contid_list(struct net *net, struct audit_context *context)
+{
+ struct audit_buffer *ab = NULL;
+ struct audit_contobj_netns *cont;
+ struct audit_net *aunet;
+
+ /* Generate AUDIT_CONTAINER_ID record with container ID CSV list */
+ rcu_read_lock();
+ aunet = net_generic(net, audit_net_id);
+ if (!aunet)
+ goto out;
+ list_for_each_entry_rcu(cont, &aunet->contid_list, list) {
+ if (!ab) {
+ ab = audit_log_start(context, GFP_ATOMIC,
+ AUDIT_CONTAINER_ID);
+ if (!ab) {
+ audit_log_lost("out of memory in audit_log_netns_contid_list");
+ goto out;
+ }
+ audit_log_format(ab, "contid=");
+ } else
+ audit_log_format(ab, ",");
+ audit_log_format(ab, "%llu", cont->id);
+ }
+ audit_log_end(ab);
+out:
+ rcu_read_unlock();
+}
+EXPORT_SYMBOL(audit_log_netns_contid_list);
+
void audit_panic(const char *message)
{
switch (audit_failure) {
@@ -1677,7 +1799,6 @@ static int __net_init audit_net_init(struct net *net)
.flags = NL_CFG_F_NONROOT_RECV,
.groups = AUDIT_NLGRP_MAX,
};
-
struct audit_net *aunet = net_generic(net, audit_net_id);

aunet->sk = netlink_kernel_create(net, NETLINK_AUDIT, &cfg);
@@ -1686,7 +1807,8 @@ static int __net_init audit_net_init(struct net *net)
return -ENOMEM;
}
aunet->sk->sk_sndtimeo = MAX_SCHEDULE_TIMEOUT;
-
+ INIT_LIST_HEAD(&aunet->contid_list);
+ spin_lock_init(&aunet->contid_list_lock);
return 0;
}

@@ -2470,6 +2592,7 @@ int audit_set_contid(struct task_struct *task, u64 contid)
u64 oldcontid;
int rc = 0;
struct audit_buffer *ab;
+ struct net *net = task->nsproxy->net_ns;

task_lock(task);
/* Can't set if audit disabled */
@@ -2540,6 +2663,11 @@ int audit_set_contid(struct task_struct *task, u64 contid)
conterror:
rcu_read_unlock();
}
+ if (!rc) {
+ if (audit_contid_valid(oldcontid))
+ audit_netns_contid_del(net, oldcontid);
+ audit_netns_contid_add(net, contid);
+ }
task_unlock(task);

if (!audit_enabled)
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index c815f58e6bc0..bbdb5bbf5446 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -23,6 +23,7 @@
#include <linux/syscalls.h>
#include <linux/cgroup.h>
#include <linux/perf_event.h>
+#include <linux/audit.h>

static struct kmem_cache *nsproxy_cachep;

@@ -136,6 +137,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
struct nsproxy *old_ns = tsk->nsproxy;
struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
struct nsproxy *new_ns;
+ u64 contid = audit_get_contid(tsk);

if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
CLONE_NEWPID | CLONE_NEWNET |
@@ -163,6 +165,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
return PTR_ERR(new_ns);

tsk->nsproxy = new_ns;
+ audit_netns_contid_add(new_ns->net_ns, contid);
return 0;
}

@@ -220,6 +223,7 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
ns = p->nsproxy;
p->nsproxy = new;
task_unlock(p);
+ audit_switch_task_namespaces(ns, p);

if (ns && atomic_dec_and_test(&ns->count))
free_nsproxy(ns);
diff --git a/net/netfilter/nft_log.c b/net/netfilter/nft_log.c
index fe4831f2258f..98d1e7e1a83c 100644
--- a/net/netfilter/nft_log.c
+++ b/net/netfilter/nft_log.c
@@ -66,13 +66,16 @@ static void nft_log_eval_audit(const struct nft_pktinfo *pkt)
struct sk_buff *skb = pkt->skb;
struct audit_buffer *ab;
int fam = -1;
+ struct audit_context *context;
+ struct net *net;

if (!audit_enabled)
return;

- ab = audit_log_start(NULL, GFP_ATOMIC, AUDIT_NETFILTER_PKT);
+ context = audit_alloc_local(GFP_ATOMIC);
+ ab = audit_log_start(context, GFP_ATOMIC, AUDIT_NETFILTER_PKT);
if (!ab)
- return;
+ goto errout;

audit_log_format(ab, "mark=%#x", skb->mark);

@@ -99,6 +102,10 @@ static void nft_log_eval_audit(const struct nft_pktinfo *pkt)
audit_log_format(ab, " saddr=? daddr=? proto=-1");

audit_log_end(ab);
+ net = xt_net(&pkt->xt);
+ audit_log_netns_contid_list(net, context);
+errout:
+ audit_free_context(context);
}

static void nft_log_eval(const struct nft_expr *expr,
diff --git a/net/netfilter/xt_AUDIT.c b/net/netfilter/xt_AUDIT.c
index 9cdc16b0d0d8..ecf868a1abde 100644
--- a/net/netfilter/xt_AUDIT.c
+++ b/net/netfilter/xt_AUDIT.c
@@ -68,10 +68,13 @@ static bool audit_ip6(struct audit_buffer *ab, struct sk_buff *skb)
{
struct audit_buffer *ab;
int fam = -1;
+ struct audit_context *context;
+ struct net *net;

if (audit_enabled == AUDIT_OFF)
- goto errout;
- ab = audit_log_start(NULL, GFP_ATOMIC, AUDIT_NETFILTER_PKT);
+ goto out;
+ context = audit_alloc_local(GFP_ATOMIC);
+ ab = audit_log_start(context, GFP_ATOMIC, AUDIT_NETFILTER_PKT);
if (ab == NULL)
goto errout;

@@ -101,7 +104,11 @@ static bool audit_ip6(struct audit_buffer *ab, struct sk_buff *skb)

audit_log_end(ab);

+ net = xt_net(par);
+ audit_log_netns_contid_list(net, context);
errout:
+ audit_free_context(context);
+out:
return XT_CONTINUE;
}

--
1.8.3.1

2019-12-31 19:54:22

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH ghak90 V8 15/16] audit: check contid count per netns and add config param limit

Clamp the number of audit container identifiers associated with a
network namespace to limit the netlink and disk bandwidth used and to
prevent losing information from record text size overflow in the contid
field.

Add a configuration parameter AUDIT_STATUS_CONTID_NETNS_LIMIT (0x100)
to set the audit container identifier netns limit. This is used to
prevent overflow of the contid field in CONTAINER_OP and CONTAINER_ID
messages, losing information, and to limit bandwidth used by these
messages.

This value must be balanced with the audit container identifier nesting
depth limit to multiply out to no more than 400. This is determined by
the total audit message length less message overhead divided by the
length of the text representation of an audit container identifier.

Signed-off-by: Richard Guy Briggs <[email protected]>
---
include/linux/audit.h | 16 +++++++----
include/linux/nsproxy.h | 2 +-
include/uapi/linux/audit.h | 2 ++
kernel/audit.c | 68 ++++++++++++++++++++++++++++++++++++++--------
kernel/audit.h | 7 +++++
kernel/fork.c | 10 +++++--
kernel/nsproxy.c | 27 +++++++++++++++---
7 files changed, 107 insertions(+), 25 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index 4272b468417a..28b9c7cd86a6 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -234,9 +234,9 @@ static inline u64 audit_get_contid(struct task_struct *tsk)
}

extern void audit_log_container_id(struct audit_context *context, u64 contid);
-extern void audit_netns_contid_add(struct net *net, u64 contid);
+extern int audit_netns_contid_add(struct net *net, u64 contid);
extern void audit_netns_contid_del(struct net *net, u64 contid);
-extern void audit_switch_task_namespaces(struct nsproxy *ns,
+extern int audit_switch_task_namespaces(struct nsproxy *ns,
struct task_struct *p);
extern void audit_log_netns_contid_list(struct net *net,
struct audit_context *context);
@@ -312,13 +312,17 @@ static inline u64 audit_get_contid(struct task_struct *tsk)

static inline void audit_log_container_id(struct audit_context *context, u64 contid)
{ }
-static inline void audit_netns_contid_add(struct net *net, u64 contid)
-{ }
+static inline int audit_netns_contid_add(struct net *net, u64 contid)
+{
+ return 0;
+}
static inline void audit_netns_contid_del(struct net *net, u64 contid)
{ }
-static inline void audit_switch_task_namespaces(struct nsproxy *ns,
+static inline int audit_switch_task_namespaces(struct nsproxy *ns,
struct task_struct *p)
-{ }
+{
+ return 0;
+}
static inline void audit_log_netns_contid_list(struct net *net,
struct audit_context *context)
{ }
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 2ae1b1a4d84d..3ca35cbf2cd8 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -67,7 +67,7 @@ struct nsproxy {

int copy_namespaces(unsigned long flags, struct task_struct *tsk);
void exit_task_namespaces(struct task_struct *tsk);
-void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
+int switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
void free_nsproxy(struct nsproxy *ns);
int unshare_nsproxy_namespaces(unsigned long, struct nsproxy **,
struct cred *, struct fs_struct *);
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index dcb076b0d2e1..2844d78cd7af 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -344,6 +344,7 @@ enum {
#define AUDIT_STATUS_BACKLOG_WAIT_TIME 0x0020
#define AUDIT_STATUS_LOST 0x0040
#define AUDIT_STATUS_CONTID_DEPTH_LIMIT 0x0080
+#define AUDIT_STATUS_CONTID_NETNS_LIMIT 0x0100

#define AUDIT_FEATURE_BITMAP_BACKLOG_LIMIT 0x00000001
#define AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_TIME 0x00000002
@@ -473,6 +474,7 @@ struct audit_status {
};
__u32 backlog_wait_time;/* message queue wait timeout */
__u32 contid_depth_limit;/* container depth limit */
+ __u32 contid_netns_limit;/* container netns limit */
};

struct audit_features {
diff --git a/kernel/audit.c b/kernel/audit.c
index e5e39aedaf86..1287f0b63757 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -89,11 +89,13 @@
* @sk: communication socket
* @contid_list: audit container identifier list
* @contid_list_lock audit container identifier list lock
+ * @contid_count count of audit container identifiers using this netns
*/
struct audit_net {
struct sock *sk;
struct list_head contid_list;
spinlock_t contid_list_lock;
+ int contid_count;
};

/**
@@ -158,6 +160,7 @@ struct auditd_connection {
* no need for interaction with tasklist_lock */
static DEFINE_SPINLOCK(audit_contobj_list_lock);
static u32 audit_contid_depth_limit = AUDIT_CONTID_DEPTH_LIMIT;
+static u32 audit_contid_netns_limit = AUDIT_CONTID_NETNS_LIMIT;

static struct kmem_cache *audit_buffer_cache;

@@ -419,19 +422,20 @@ static struct sock *audit_get_sk(const struct net *net)
return aunet->sk;
}

-void audit_netns_contid_add(struct net *net, u64 contid)
+int audit_netns_contid_add(struct net *net, u64 contid)
{
struct audit_net *aunet;
struct list_head *contid_list;
struct audit_contobj_netns *cont;
+ int rc = 0;

if (!net)
- return;
+ return 0;
if (!audit_contid_valid(contid))
- return;
+ return 0;
aunet = net_generic(net, audit_net_id);
if (!aunet)
- return;
+ return 0;
contid_list = &aunet->contid_list;
rcu_read_lock();
list_for_each_entry_rcu(cont, contid_list, list)
@@ -447,11 +451,22 @@ void audit_netns_contid_add(struct net *net, u64 contid)
cont->id = contid;
refcount_set(&cont->refcount, 1);
spin_lock(&aunet->contid_list_lock);
- list_add_rcu(&cont->list, contid_list);
+ if (audit_contid_netns_limit != 0 &&
+ aunet->contid_count < audit_contid_netns_limit) {
+ list_add_rcu(&cont->list, contid_list);
+ aunet->contid_count++;
+ } else {
+ rc = -ENOSR;
+ }
spin_unlock(&aunet->contid_list_lock);
+ if (rc)
+ kfree(cont);
+ } else {
+ rc = -ENOMEM;
}
out:
rcu_read_unlock();
+ return rc;
}

void audit_netns_contid_del(struct net *net, u64 contid)
@@ -475,6 +490,7 @@ void audit_netns_contid_del(struct net *net, u64 contid)
if (refcount_dec_and_test(&cont->refcount)) {
list_del_rcu(&cont->list);
kfree_rcu(cont, rcu);
+ aunet->contid_count--;
}
spin_unlock(&aunet->contid_list_lock);
break;
@@ -482,16 +498,21 @@ void audit_netns_contid_del(struct net *net, u64 contid)
rcu_read_unlock();
}

-void audit_switch_task_namespaces(struct nsproxy *ns, struct task_struct *p)
+int audit_switch_task_namespaces(struct nsproxy *ns, struct task_struct *p)
{
u64 contid = audit_get_contid(p);
struct nsproxy *new = p->nsproxy;
+ int rc = 0;

if (!audit_contid_valid(contid))
- return;
+ return 0;
audit_netns_contid_del(ns->net_ns, contid);
- if (new)
- audit_netns_contid_add(new->net_ns, contid);
+ if (new) {
+ rc = audit_netns_contid_add(new->net_ns, contid);
+ if (rc)
+ audit_netns_contid_add(ns->net_ns, contid);
+ }
+ return rc;
}

void audit_log_contid(struct audit_buffer *ab, u64 contid);
@@ -683,7 +704,7 @@ static int audit_set_contid_depth_limit(u32 limit)
{
int rc = 0;

- if (limit > 20 * AUDIT_CONTID_DEPTH_LIMIT) {
+ if (limit * audit_contid_netns_limit > AUDIT_CONTID_MSG_LIMIT) {
rc = -ENOSPC;
audit_log_config_change("audit_contid_depth_limit",
limit, audit_contid_depth_limit, 0);
@@ -693,6 +714,20 @@ static int audit_set_contid_depth_limit(u32 limit)
&audit_contid_depth_limit, limit);
}

+static int audit_set_contid_netns_limit(u32 limit)
+{
+ int rc = 0;
+
+ if (limit * audit_contid_depth_limit > AUDIT_CONTID_MSG_LIMIT) {
+ rc = -ENOSPC;
+ audit_log_config_change("audit_contid_netns_limit",
+ limit, audit_contid_netns_limit, 0);
+ return rc;
+ }
+ return audit_do_config_change("audit_contid_netns_limit",
+ &audit_contid_netns_limit, limit);
+}
+
static int audit_set_enabled(u32 state)
{
int rc;
@@ -1455,6 +1490,7 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
s.feature_bitmap = AUDIT_FEATURE_BITMAP_ALL;
s.backlog_wait_time = audit_backlog_wait_time;
s.contid_depth_limit = audit_contid_depth_limit;
+ s.contid_netns_limit = audit_contid_netns_limit;
audit_send_reply(skb, seq, AUDIT_GET, 0, 0, &s, sizeof(s));
break;
}
@@ -1565,6 +1601,13 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
if (err < 0)
return err;
}
+ if (s.mask & AUDIT_STATUS_CONTID_NETNS_LIMIT) {
+ if (sizeof(s) > (size_t)nlh->nlmsg_len)
+ return -EINVAL;
+ err = audit_set_contid_netns_limit(s.contid_netns_limit);
+ if (err < 0)
+ return err;
+ }
break;
}
case AUDIT_GET_FEATURE:
@@ -1834,6 +1877,7 @@ static int __net_init audit_net_init(struct net *net)
aunet->sk->sk_sndtimeo = MAX_SCHEDULE_TIMEOUT;
INIT_LIST_HEAD(&aunet->contid_list);
spin_lock_init(&aunet->contid_list_lock);
+ aunet->contid_count = 0;
return 0;
}

@@ -2776,7 +2820,9 @@ int audit_set_contid(struct task_struct *task, u64 contid)
if (!rc) {
if (audit_contid_valid(oldcontid))
audit_netns_contid_del(net, oldcontid);
- audit_netns_contid_add(net, contid);
+ rc = audit_netns_contid_add(net, contid);
+ if (rc && audit_contid_valid(oldcontid))
+ audit_netns_contid_add(net, oldcontid);
}
task_unlock(task);

diff --git a/kernel/audit.h b/kernel/audit.h
index fbca07a49c03..5701a42e564f 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -222,6 +222,13 @@ static inline int audit_hash_contid(u64 contid)

#define AUDIT_CONTID_DEPTH_LIMIT 4

+#define AUDIT_CONTID_NETNS_LIMIT 100
+
+/* this value is determined by AUDIT_MESSAGE_TEXT_MAX (8560) minus
+ * overhead (128) all divided by the max text representation of a full
+ * u64 (21) */
+#define AUDIT_CONTID_MSG_LIMIT 400
+
/* Indicates that audit should log the full pathname. */
#define AUDIT_NAME_FULL -1

diff --git a/kernel/fork.c b/kernel/fork.c
index edf034e5cbb4..7431649d6a5a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2935,6 +2935,13 @@ int ksys_unshare(unsigned long unshare_flags)
new_cred, new_fs);
if (err)
goto bad_unshare_cleanup_cred;
+ if (new_nsproxy) {
+ err = switch_task_namespaces(current, new_nsproxy);
+ if (err) {
+ free_nsproxy(new_nsproxy);
+ goto bad_unshare_cleanup_cred;
+ }
+ }

if (new_fs || new_fd || do_sysvsem || new_cred || new_nsproxy) {
if (do_sysvsem) {
@@ -2949,9 +2956,6 @@ int ksys_unshare(unsigned long unshare_flags)
shm_init_task(current);
}

- if (new_nsproxy)
- switch_task_namespaces(current, new_nsproxy);
-
task_lock(current);

if (new_fs) {
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index bbdb5bbf5446..5181a41172a8 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -138,6 +138,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
struct nsproxy *new_ns;
u64 contid = audit_get_contid(tsk);
+ int rc;

if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
CLONE_NEWPID | CLONE_NEWNET |
@@ -165,7 +166,12 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
return PTR_ERR(new_ns);

tsk->nsproxy = new_ns;
- audit_netns_contid_add(new_ns->net_ns, contid);
+ rc = audit_netns_contid_add(new_ns->net_ns, contid);
+ if (rc) {
+ tsk->nsproxy = old_ns;
+ free_nsproxy(new_ns);
+ return rc;
+ }
return 0;
}

@@ -213,9 +219,10 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
return err;
}

-void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
+int switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
{
struct nsproxy *ns;
+ int rc;

might_sleep();

@@ -223,10 +230,17 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
ns = p->nsproxy;
p->nsproxy = new;
task_unlock(p);
- audit_switch_task_namespaces(ns, p);
+ rc = audit_switch_task_namespaces(ns, p);
+ if (rc) {
+ task_lock(p);
+ p->nsproxy = ns;
+ task_unlock(p);
+ return rc;
+ }

if (ns && atomic_dec_and_test(&ns->count))
free_nsproxy(ns);
+ return 0;
}

void exit_task_namespaces(struct task_struct *p)
@@ -262,7 +276,12 @@ void exit_task_namespaces(struct task_struct *p)
free_nsproxy(new_nsproxy);
goto out;
}
- switch_task_namespaces(tsk, new_nsproxy);
+ err = switch_task_namespaces(tsk, new_nsproxy);
+ if (err) {
+ ns->ops->install(tsk->nsproxy, ns);
+ free_nsproxy(new_nsproxy);
+ goto out;
+ }

perf_event_namespaces(tsk);
out:
--
1.8.3.1

2019-12-31 19:54:29

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH ghak90 V8 04/16] audit: convert to contid list to check for orch/engine ownership

Store the audit container identifier in a refcounted kernel object that
is added to the master list of audit container identifiers. This will
allow multiple container orchestrators/engines to work on the same
machine without danger of inadvertantly re-using an existing identifier.
It will also allow an orchestrator to inject a process into an existing
container by checking if the original container owner is the one
injecting the task. A hash table list is used to optimize searches.

Signed-off-by: Richard Guy Briggs <[email protected]>
---
include/linux/audit.h | 14 ++++++--
kernel/audit.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++---
kernel/audit.h | 8 +++++
3 files changed, 112 insertions(+), 8 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index a045b34ecf44..0e6dbe943ae4 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -94,10 +94,18 @@ struct audit_ntp_data {
struct audit_ntp_data {};
#endif

+struct audit_contobj {
+ struct list_head list;
+ u64 id;
+ struct task_struct *owner;
+ refcount_t refcount;
+ struct rcu_head rcu;
+};
+
struct audit_task_info {
kuid_t loginuid;
unsigned int sessionid;
- u64 contid;
+ struct audit_contobj *cont;
#ifdef CONFIG_AUDITSYSCALL
struct audit_context *ctx;
#endif
@@ -203,9 +211,9 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)

static inline u64 audit_get_contid(struct task_struct *tsk)
{
- if (!tsk->audit)
+ if (!tsk->audit || !tsk->audit->cont)
return AUDIT_CID_UNSET;
- return tsk->audit->contid;
+ return tsk->audit->cont->id;
}

extern u32 audit_enabled;
diff --git a/kernel/audit.c b/kernel/audit.c
index 2d7707426b7d..4bab20f5f781 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -138,6 +138,12 @@ struct auditd_connection {

/* Hash for inode-based rules */
struct list_head audit_inode_hash[AUDIT_INODE_BUCKETS];
+/* Hash for contid object lists */
+struct list_head audit_contid_hash[AUDIT_CONTID_BUCKETS];
+/* Lock all additions and deletions to the contid hash lists, assignment
+ * of container objects to tasks and refcount changes. There should be
+ * no need for interaction with tasklist_lock */
+static DEFINE_SPINLOCK(audit_contobj_list_lock);

static struct kmem_cache *audit_buffer_cache;

@@ -212,6 +218,31 @@ void __init audit_task_init(void)
0, SLAB_PANIC, NULL);
}

+static struct audit_contobj *_audit_contobj(struct task_struct *tsk)
+{
+ if (!tsk->audit)
+ return NULL;
+ return tsk->audit->cont;
+}
+
+/* audit_contobj_list_lock must be held by caller unless new */
+static void _audit_contobj_hold(struct audit_contobj *cont)
+{
+ refcount_inc(&cont->refcount);
+}
+
+/* audit_contobj_list_lock must be held by caller */
+static void _audit_contobj_put(struct audit_contobj *cont)
+{
+ if (!cont)
+ return;
+ if (refcount_dec_and_test(&cont->refcount)) {
+ put_task_struct(cont->owner);
+ list_del_rcu(&cont->list);
+ kfree_rcu(cont, rcu);
+ }
+}
+
/**
* audit_alloc - allocate an audit info block for a task
* @tsk: task
@@ -232,7 +263,11 @@ int audit_alloc(struct task_struct *tsk)
}
info->loginuid = audit_get_loginuid(current);
info->sessionid = audit_get_sessionid(current);
- info->contid = audit_get_contid(current);
+ spin_lock(&audit_contobj_list_lock);
+ info->cont = _audit_contobj(current);
+ if (info->cont)
+ _audit_contobj_hold(info->cont);
+ spin_unlock(&audit_contobj_list_lock);
tsk->audit = info;

ret = audit_alloc_syscall(tsk);
@@ -247,7 +282,7 @@ int audit_alloc(struct task_struct *tsk)
struct audit_task_info init_struct_audit = {
.loginuid = INVALID_UID,
.sessionid = AUDIT_SID_UNSET,
- .contid = AUDIT_CID_UNSET,
+ .cont = NULL,
#ifdef CONFIG_AUDITSYSCALL
.ctx = NULL,
#endif
@@ -267,6 +302,9 @@ void audit_free(struct task_struct *tsk)
/* Freeing the audit_task_info struct must be performed after
* audit_log_exit() due to need for loginuid and sessionid.
*/
+ spin_lock(&audit_contobj_list_lock);
+ _audit_contobj_put(tsk->audit->cont);
+ spin_unlock(&audit_contobj_list_lock);
info = tsk->audit;
tsk->audit = NULL;
kmem_cache_free(audit_task_cache, info);
@@ -1658,6 +1696,9 @@ static int __init audit_init(void)
for (i = 0; i < AUDIT_INODE_BUCKETS; i++)
INIT_LIST_HEAD(&audit_inode_hash[i]);

+ for (i = 0; i < AUDIT_CONTID_BUCKETS; i++)
+ INIT_LIST_HEAD(&audit_contid_hash[i]);
+
mutex_init(&audit_cmd_mutex.lock);
audit_cmd_mutex.owner = NULL;

@@ -2365,6 +2406,9 @@ int audit_signal_info(int sig, struct task_struct *t)
*
* Returns 0 on success, -EPERM on permission failure.
*
+ * If the original container owner goes away, no task injection is
+ * possible to an existing container.
+ *
* Called (set) from fs/proc/base.c::proc_contid_write().
*/
int audit_set_contid(struct task_struct *task, u64 contid)
@@ -2381,9 +2425,12 @@ int audit_set_contid(struct task_struct *task, u64 contid)
}
oldcontid = audit_get_contid(task);
read_lock(&tasklist_lock);
- /* Don't allow the audit containerid to be unset */
+ /* Don't allow the contid to be unset */
if (!audit_contid_valid(contid))
rc = -EINVAL;
+ /* Don't allow the contid to be set to the same value again */
+ else if (contid == oldcontid) {
+ rc = -EADDRINUSE;
/* if we don't have caps, reject */
else if (!capable(CAP_AUDIT_CONTROL))
rc = -EPERM;
@@ -2396,8 +2443,49 @@ int audit_set_contid(struct task_struct *task, u64 contid)
else if (audit_contid_set(task))
rc = -ECHILD;
read_unlock(&tasklist_lock);
- if (!rc)
- task->audit->contid = contid;
+ if (!rc) {
+ struct audit_contobj *oldcont = _audit_contobj(task);
+ struct audit_contobj *cont = NULL, *newcont = NULL;
+ int h = audit_hash_contid(contid);
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(cont, &audit_contid_hash[h], list)
+ if (cont->id == contid) {
+ /* task injection to existing container */
+ if (current == cont->owner) {
+ spin_lock(&audit_contobj_list_lock);
+ _audit_contobj_hold(cont);
+ spin_unlock(&audit_contobj_list_lock);
+ newcont = cont;
+ } else {
+ rc = -ENOTUNIQ;
+ goto conterror;
+ }
+ break;
+ }
+ if (!newcont) {
+ newcont = kmalloc(sizeof(*newcont), GFP_ATOMIC);
+ if (newcont) {
+ INIT_LIST_HEAD(&newcont->list);
+ newcont->id = contid;
+ get_task_struct(current);
+ newcont->owner = current;
+ refcount_set(&newcont->refcount, 1);
+ spin_lock(&audit_contobj_list_lock);
+ list_add_rcu(&newcont->list, &audit_contid_hash[h]);
+ spin_unlock(&audit_contobj_list_lock);
+ } else {
+ rc = -ENOMEM;
+ goto conterror;
+ }
+ }
+ task->audit->cont = newcont;
+ spin_lock(&audit_contobj_list_lock);
+ _audit_contobj_put(oldcont);
+ spin_unlock(&audit_contobj_list_lock);
+conterror:
+ rcu_read_unlock();
+ }
task_unlock(task);

if (!audit_enabled)
diff --git a/kernel/audit.h b/kernel/audit.h
index 16bd03b88e0d..e4a31aa92dfe 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -211,6 +211,14 @@ static inline int audit_hash_ino(u32 ino)
return (ino & (AUDIT_INODE_BUCKETS-1));
}

+#define AUDIT_CONTID_BUCKETS 32
+extern struct list_head audit_contid_hash[AUDIT_CONTID_BUCKETS];
+
+static inline int audit_hash_contid(u64 contid)
+{
+ return (contid & (AUDIT_CONTID_BUCKETS-1));
+}
+
/* Indicates that audit should log the full pathname. */
#define AUDIT_NAME_FULL -1

--
1.8.3.1

2019-12-31 19:54:43

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH ghak90 V8 16/16] audit: add capcontid to set contid outside init_user_ns

Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
process in a non-init user namespace the capability to set audit
container identifiers.

Provide /proc/$PID/audit_capcontid interface to capcontid.
Valid values are: 1==enabled, 0==disabled

Report this action in message type AUDIT_SET_CAPCONTID 1022 with fields
opid= capcontid= old-capcontid=

Signed-off-by: Richard Guy Briggs <[email protected]>
---
fs/proc/base.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/audit.h | 14 ++++++++++++
include/uapi/linux/audit.h | 1 +
kernel/audit.c | 35 +++++++++++++++++++++++++++++
4 files changed, 105 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 26091800180c..283ef8e006e7 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1360,6 +1360,59 @@ static ssize_t proc_contid_write(struct file *file, const char __user *buf,
.write = proc_contid_write,
.llseek = generic_file_llseek,
};
+
+static ssize_t proc_capcontid_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct inode *inode = file_inode(file);
+ struct task_struct *task = get_proc_task(inode);
+ ssize_t length;
+ char tmpbuf[TMPBUFLEN];
+
+ if (!task)
+ return -ESRCH;
+ /* if we don't have caps, reject */
+ if (!capable(CAP_AUDIT_CONTROL) && !audit_get_capcontid(current))
+ return -EPERM;
+ length = scnprintf(tmpbuf, TMPBUFLEN, "%u", audit_get_capcontid(task));
+ put_task_struct(task);
+ return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
+}
+
+static ssize_t proc_capcontid_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct inode *inode = file_inode(file);
+ u32 capcontid;
+ int rv;
+ struct task_struct *task = get_proc_task(inode);
+
+ if (!task)
+ return -ESRCH;
+ if (*ppos != 0) {
+ /* No partial writes. */
+ put_task_struct(task);
+ return -EINVAL;
+ }
+
+ rv = kstrtou32_from_user(buf, count, 10, &capcontid);
+ if (rv < 0) {
+ put_task_struct(task);
+ return rv;
+ }
+
+ rv = audit_set_capcontid(task, capcontid);
+ put_task_struct(task);
+ if (rv < 0)
+ return rv;
+ return count;
+}
+
+static const struct file_operations proc_capcontid_operations = {
+ .read = proc_capcontid_read,
+ .write = proc_capcontid_write,
+ .llseek = generic_file_llseek,
+};
#endif

#ifdef CONFIG_FAULT_INJECTION
@@ -3121,6 +3174,7 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUGO, proc_sessionid_operations),
REG("audit_containerid", S_IWUSR|S_IRUSR, proc_contid_operations),
+ REG("audit_capcontainerid", S_IWUSR|S_IRUSR|S_IRUSR, proc_capcontid_operations),
#endif
#ifdef CONFIG_FAULT_INJECTION
REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
@@ -3522,6 +3576,7 @@ static int proc_tid_comm_permission(struct inode *inode, int mask)
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUGO, proc_sessionid_operations),
REG("audit_containerid", S_IWUSR|S_IRUSR, proc_contid_operations),
+ REG("audit_capcontainerid", S_IWUSR|S_IRUSR|S_IRUSR, proc_capcontid_operations),
#endif
#ifdef CONFIG_FAULT_INJECTION
REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
diff --git a/include/linux/audit.h b/include/linux/audit.h
index 28b9c7cd86a6..62c453306c2a 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -116,6 +116,7 @@ struct audit_task_info {
kuid_t loginuid;
unsigned int sessionid;
struct audit_contobj *cont;
+ u32 capcontid;
#ifdef CONFIG_AUDITSYSCALL
struct audit_context *ctx;
#endif
@@ -224,6 +225,14 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
return tsk->audit->sessionid;
}

+static inline u32 audit_get_capcontid(struct task_struct *tsk)
+{
+ if (!tsk->audit)
+ return 0;
+ return tsk->audit->capcontid;
+}
+
+extern int audit_set_capcontid(struct task_struct *tsk, u32 enable);
extern int audit_set_contid(struct task_struct *tsk, u64 contid);

static inline u64 audit_get_contid(struct task_struct *tsk)
@@ -305,6 +314,11 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
return AUDIT_SID_UNSET;
}

+static inline u32 audit_get_capcontid(struct task_struct *tsk)
+{
+ return 0;
+}
+
static inline u64 audit_get_contid(struct task_struct *tsk)
{
return AUDIT_CID_UNSET;
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 2844d78cd7af..01251e6dcec0 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -73,6 +73,7 @@
#define AUDIT_GET_FEATURE 1019 /* Get which features are enabled */
#define AUDIT_CONTAINER_OP 1020 /* Define the container id and info */
#define AUDIT_SIGNAL_INFO2 1021 /* Get info auditd signal sender */
+#define AUDIT_SET_CAPCONTID 1022 /* Set cap_contid of a task */

#define AUDIT_FIRST_USER_MSG 1100 /* Userspace messages mostly uninteresting to kernel */
#define AUDIT_USER_AVC 1107 /* We filter this differently */
diff --git a/kernel/audit.c b/kernel/audit.c
index 1287f0b63757..1c22dd084ae8 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -2698,6 +2698,41 @@ static bool audit_contid_isowner(struct task_struct *tsk)
return false;
}

+int audit_set_capcontid(struct task_struct *task, u32 enable)
+{
+ u32 oldcapcontid;
+ int rc = 0;
+ struct audit_buffer *ab;
+
+ if (!task->audit)
+ return -ENOPROTOOPT;
+ oldcapcontid = audit_get_capcontid(task);
+ /* if task is not descendant, block */
+ if (task == current)
+ rc = -EBADSLT;
+ else if (!task_is_descendant(current, task))
+ rc = -EXDEV;
+ else if (current_user_ns() == &init_user_ns) {
+ if (!capable(CAP_AUDIT_CONTROL) && !audit_get_capcontid(current))
+ rc = -EPERM;
+ }
+ if (!rc)
+ task->audit->capcontid = enable;
+
+ if (!audit_enabled)
+ return rc;
+
+ ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_SET_CAPCONTID);
+ if (!ab)
+ return rc;
+
+ audit_log_format(ab,
+ "opid=%d capcontid=%u old-capcontid=%u",
+ task_tgid_nr(task), enable, oldcapcontid);
+ audit_log_end(ab);
+ return rc;
+}
+
/*
* audit_set_contid - set current task's audit contid
* @task: target task
--
1.8.3.1

2019-12-31 19:54:53

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH ghak90 V8 06/16] audit: log container info of syscalls

Create a new audit record AUDIT_CONTAINER_ID to document the audit
container identifier of a process if it is present.

Called from audit_log_exit(), syscalls are covered.

A sample raw event:
type=SYSCALL msg=audit(1519924845.499:257): arch=c000003e syscall=257 success=yes exit=3 a0=ffffff9c a1=56374e1cef30 a2=241 a3=1b6 items=2 ppid=606 pid=635 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=3 comm="bash" exe="/usr/bin/bash" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="tmpcontainerid"
type=CWD msg=audit(1519924845.499:257): cwd="/root"
type=PATH msg=audit(1519924845.499:257): item=0 name="/tmp/" inode=13863 dev=00:27 mode=041777 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype= PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
type=PATH msg=audit(1519924845.499:257): item=1 name="/tmp/tmpcontainerid" inode=17729 dev=00:27 mode=0100644 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
type=PROCTITLE msg=audit(1519924845.499:257): proctitle=62617368002D6300736C65657020313B206563686F2074657374203E202F746D702F746D70636F6E7461696E65726964
type=CONTAINER_ID msg=audit(1519924845.499:257): contid=123458

Please see the github audit kernel issue for the main feature:
https://github.com/linux-audit/audit-kernel/issues/90
Please see the github audit userspace issue for supporting additions:
https://github.com/linux-audit/audit-userspace/issues/51
Please see the github audit testsuiite issue for the test case:
https://github.com/linux-audit/audit-testsuite/issues/64
Please see the github audit wiki for the feature overview:
https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
Signed-off-by: Richard Guy Briggs <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Acked-by: Steve Grubb <[email protected]>
Acked-by: Neil Horman <[email protected]>
Reviewed-by: Ondrej Mosnacek <[email protected]>
---
include/linux/audit.h | 5 +++++
include/uapi/linux/audit.h | 1 +
kernel/audit.c | 20 ++++++++++++++++++++
kernel/auditsc.c | 20 ++++++++++++++------
4 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index 0e6dbe943ae4..2636b0ad0011 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -216,6 +216,8 @@ static inline u64 audit_get_contid(struct task_struct *tsk)
return tsk->audit->cont->id;
}

+extern void audit_log_container_id(struct audit_context *context, u64 contid);
+
extern u32 audit_enabled;

extern int audit_signal_info(int sig, struct task_struct *t);
@@ -285,6 +287,9 @@ static inline u64 audit_get_contid(struct task_struct *tsk)
return AUDIT_CID_UNSET;
}

+static inline void audit_log_container_id(struct audit_context *context, u64 contid)
+{ }
+
#define audit_enabled AUDIT_OFF

static inline int audit_signal_info(int sig, struct task_struct *t)
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 866e1606c4ae..93417a8af9d0 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -117,6 +117,7 @@
#define AUDIT_FANOTIFY 1331 /* Fanotify access decision */
#define AUDIT_TIME_INJOFFSET 1332 /* Timekeeping offset injected */
#define AUDIT_TIME_ADJNTPVAL 1333 /* NTP value adjustment */
+#define AUDIT_CONTAINER_ID 1335 /* Container ID */

#define AUDIT_AVC 1400 /* SE Linux avc denial or grant */
#define AUDIT_SELINUX_ERR 1401 /* Internal SE Linux Errors */
diff --git a/kernel/audit.c b/kernel/audit.c
index fa8f1aa3a605..0871c3e5d6df 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -2156,6 +2156,26 @@ void audit_log_session_info(struct audit_buffer *ab)
audit_log_format(ab, "auid=%u ses=%u", auid, sessionid);
}

+/*
+ * audit_log_container_id - report container info
+ * @context: task or local context for record
+ * @contid: container ID to report
+ */
+void audit_log_container_id(struct audit_context *context, u64 contid)
+{
+ struct audit_buffer *ab;
+
+ if (!audit_contid_valid(contid))
+ return;
+ /* Generate AUDIT_CONTAINER_ID record with container ID */
+ ab = audit_log_start(context, GFP_KERNEL, AUDIT_CONTAINER_ID);
+ if (!ab)
+ return;
+ audit_log_format(ab, "contid=%llu", contid);
+ audit_log_end(ab);
+}
+EXPORT_SYMBOL(audit_log_container_id);
+
void audit_log_key(struct audit_buffer *ab, char *key)
{
audit_log_format(ab, " key=");
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index bd855794ad26..ac438fcff807 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1534,7 +1534,7 @@ static void audit_log_exit(void)
for (aux = context->aux_pids; aux; aux = aux->next) {
struct audit_aux_data_pids *axs = (void *)aux;

- for (i = 0; i < axs->pid_count; i++)
+ for (i = 0; i < axs->pid_count; i++) {
if (audit_log_pid_context(context, axs->target_pid[i],
axs->target_auid[i],
axs->target_uid[i],
@@ -1542,14 +1542,20 @@ static void audit_log_exit(void)
axs->target_sid[i],
axs->target_comm[i]))
call_panic = 1;
+ audit_log_container_id(context, axs->target_cid[i]);
+ }
}

- if (context->target_pid &&
- audit_log_pid_context(context, context->target_pid,
- context->target_auid, context->target_uid,
- context->target_sessionid,
- context->target_sid, context->target_comm))
+ if (context->target_pid) {
+ if (audit_log_pid_context(context, context->target_pid,
+ context->target_auid,
+ context->target_uid,
+ context->target_sessionid,
+ context->target_sid,
+ context->target_comm))
call_panic = 1;
+ audit_log_container_id(context, context->target_cid);
+ }

if (context->pwd.dentry && context->pwd.mnt) {
ab = audit_log_start(context, GFP_KERNEL, AUDIT_CWD);
@@ -1568,6 +1574,8 @@ static void audit_log_exit(void)

audit_log_proctitle();

+ audit_log_container_id(context, audit_get_contid(current));
+
audit_log_container_drop();

/* Send end of event record to help user space know we are finished */
--
1.8.3.1

2019-12-31 19:55:03

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH ghak90 V8 12/16] audit: contid check descendancy and nesting

Require the target task to be a descendant of the container
orchestrator/engine.

You would only change the audit container ID from one set or inherited
value to another if you were nesting containers.

If changing the contid, the container orchestrator/engine must be a
descendant and not same orchestrator as the one that set it so it is not
possible to change the contid of another orchestrator's container.

Since the task_is_descendant() function is used in YAMA and in audit,
remove the duplication and pull the function into kernel/core/sched.c

Signed-off-by: Richard Guy Briggs <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/audit.c | 44 ++++++++++++++++++++++++++++++++++++--------
kernel/sched/core.c | 33 +++++++++++++++++++++++++++++++++
security/yama/yama_lsm.c | 33 ---------------------------------
4 files changed, 72 insertions(+), 41 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index aebe24192b23..009d2cb2e2bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2006,4 +2006,7 @@ static inline void rseq_syscall(struct pt_regs *regs)

const struct cpumask *sched_trace_rd_span(struct root_domain *rd);

+extern int task_is_descendant(struct task_struct *parent,
+ struct task_struct *child);
+
#endif
diff --git a/kernel/audit.c b/kernel/audit.c
index f7a8d3288ca0..ef8e07524c46 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -2575,6 +2575,13 @@ int audit_signal_info(int sig, struct task_struct *t)
return audit_signal_info_syscall(t);
}

+static bool audit_contid_isowner(struct task_struct *tsk)
+{
+ if (tsk->audit && tsk->audit->cont)
+ return current == tsk->audit->cont->owner;
+ return false;
+}
+
/*
* audit_set_contid - set current task's audit contid
* @task: target task
@@ -2603,22 +2610,43 @@ int audit_set_contid(struct task_struct *task, u64 contid)
oldcontid = audit_get_contid(task);
read_lock(&tasklist_lock);
/* Don't allow the contid to be unset */
- if (!audit_contid_valid(contid))
+ if (!audit_contid_valid(contid)) {
rc = -EINVAL;
+ goto unlock;
+ }
/* Don't allow the contid to be set to the same value again */
- else if (contid == oldcontid) {
+ if (contid == oldcontid) {
rc = -EADDRINUSE;
+ goto unlock;
+ }
/* if we don't have caps, reject */
- else if (!capable(CAP_AUDIT_CONTROL))
+ if (!capable(CAP_AUDIT_CONTROL)) {
rc = -EPERM;
- /* if task has children or is not single-threaded, deny */
- else if (!list_empty(&task->children))
+ goto unlock;
+ }
+ /* if task has children, deny */
+ if (!list_empty(&task->children)) {
rc = -EBUSY;
- else if (!(thread_group_leader(task) && thread_group_empty(task)))
+ goto unlock;
+ }
+ /* if task is not single-threaded, deny */
+ if (!(thread_group_leader(task) && thread_group_empty(task))) {
rc = -EALREADY;
- /* if contid is already set, deny */
- else if (audit_contid_set(task))
+ goto unlock;
+ }
+ /* if task is not descendant, block */
+ if (task == current) {
+ rc = -EBADSLT;
+ goto unlock;
+ }
+ if (!task_is_descendant(current, task)) {
+ rc = -EXDEV;
+ goto unlock;
+ }
+ /* only allow contid setting again if nesting */
+ if (audit_contid_set(task) && audit_contid_isowner(task))
rc = -ECHILD;
+unlock:
read_unlock(&tasklist_lock);
if (!rc) {
struct audit_contobj *oldcont = _audit_contobj(task);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 90e4b00ace89..7d8145285eb9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7916,6 +7916,39 @@ void dump_cpu_task(int cpu)
}

/*
+ * task_is_descendant - walk up a process family tree looking for a match
+ * @parent: the process to compare against while walking up from child
+ * @child: the process to start from while looking upwards for parent
+ *
+ * Returns 1 if child is a descendant of parent, 0 if not.
+ */
+int task_is_descendant(struct task_struct *parent,
+ struct task_struct *child)
+{
+ int rc = 0;
+ struct task_struct *walker = child;
+
+ if (!parent || !child)
+ return 0;
+
+ rcu_read_lock();
+ if (!thread_group_leader(parent))
+ parent = rcu_dereference(parent->group_leader);
+ while (walker->pid > 0) {
+ if (!thread_group_leader(walker))
+ walker = rcu_dereference(walker->group_leader);
+ if (walker == parent) {
+ rc = 1;
+ break;
+ }
+ walker = rcu_dereference(walker->real_parent);
+ }
+ rcu_read_unlock();
+
+ return rc;
+}
+
+/*
* Nice levels are multiplicative, with a gentle 10% change for every
* nice level changed. I.e. when a CPU-bound task goes from nice 0 to
* nice 1, it will get ~10% less CPU time than another CPU-bound task
diff --git a/security/yama/yama_lsm.c b/security/yama/yama_lsm.c
index 94dc346370b1..25eae205eae8 100644
--- a/security/yama/yama_lsm.c
+++ b/security/yama/yama_lsm.c
@@ -263,39 +263,6 @@ static int yama_task_prctl(int option, unsigned long arg2, unsigned long arg3,
}

/**
- * task_is_descendant - walk up a process family tree looking for a match
- * @parent: the process to compare against while walking up from child
- * @child: the process to start from while looking upwards for parent
- *
- * Returns 1 if child is a descendant of parent, 0 if not.
- */
-static int task_is_descendant(struct task_struct *parent,
- struct task_struct *child)
-{
- int rc = 0;
- struct task_struct *walker = child;
-
- if (!parent || !child)
- return 0;
-
- rcu_read_lock();
- if (!thread_group_leader(parent))
- parent = rcu_dereference(parent->group_leader);
- while (walker->pid > 0) {
- if (!thread_group_leader(walker))
- walker = rcu_dereference(walker->group_leader);
- if (walker == parent) {
- rc = 1;
- break;
- }
- walker = rcu_dereference(walker->real_parent);
- }
- rcu_read_unlock();
-
- return rc;
-}
-
-/**
* ptracer_exception_found - tracer registered as exception for this tracee
* @tracer: the task_struct of the process attempting ptrace
* @tracee: the task_struct of the process to be ptraced
--
1.8.3.1

2019-12-31 19:55:19

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH ghak90 V8 09/16] audit: add containerid support for user records

Add audit container identifier auxiliary record to user event standalone
records.

Signed-off-by: Richard Guy Briggs <[email protected]>
Acked-by: Neil Horman <[email protected]>
Reviewed-by: Ondrej Mosnacek <[email protected]>
---
kernel/audit.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/audit.c b/kernel/audit.c
index 51159c94041c..d4e6eafe5644 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -1176,12 +1176,6 @@ static void audit_log_common_recv_msg(struct audit_context *context,
audit_log_task_context(*ab);
}

-static inline void audit_log_user_recv_msg(struct audit_buffer **ab,
- u16 msg_type)
-{
- audit_log_common_recv_msg(NULL, ab, msg_type);
-}
-
int is_audit_feature_set(int i)
{
return af.features & AUDIT_FEATURE_TO_MASK(i);
@@ -1444,13 +1438,16 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)

err = audit_filter(msg_type, AUDIT_FILTER_USER);
if (err == 1) { /* match or error */
+ struct audit_context *context;
+
err = 0;
if (msg_type == AUDIT_USER_TTY) {
err = tty_audit_push();
if (err)
break;
}
- audit_log_user_recv_msg(&ab, msg_type);
+ context = audit_alloc_local(GFP_KERNEL);
+ audit_log_common_recv_msg(context, &ab, msg_type);
if (msg_type != AUDIT_USER_TTY)
audit_log_format(ab, " msg='%.*s'",
AUDIT_MESSAGE_TEXT_MAX,
@@ -1466,6 +1463,8 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
audit_log_n_untrustedstring(ab, data, size);
}
audit_log_end(ab);
+ audit_log_container_id(context, audit_get_contid(current));
+ audit_free_context(context);
}
break;
case AUDIT_ADD_RULE:
--
1.8.3.1

2020-01-22 21:30:21

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 05/16] audit: log drop of contid on exit of last task

On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <[email protected]> wrote:
>
> Since we are tracking the life of each audit container indentifier, we
> can match the creation event with the destruction event. Log the
> destruction of the audit container identifier when the last process in
> that container exits.
>
> Signed-off-by: Richard Guy Briggs <[email protected]>
> ---
> kernel/audit.c | 17 +++++++++++++++++
> kernel/audit.h | 2 ++
> kernel/auditsc.c | 2 ++
> 3 files changed, 21 insertions(+)
>
> diff --git a/kernel/audit.c b/kernel/audit.c
> index 4bab20f5f781..fa8f1aa3a605 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -2502,6 +2502,23 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> return rc;
> }
>
> +void audit_log_container_drop(void)
> +{
> + struct audit_buffer *ab;
> +
> + if (!current->audit || !current->audit->cont ||
> + refcount_read(&current->audit->cont->refcount) > 1)
> + return;
> + ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONTAINER_OP);
> + if (!ab)
> + return;
> +
> + audit_log_format(ab, "op=drop opid=%d contid=%llu old-contid=%llu",
> + task_tgid_nr(current), audit_get_contid(current),
> + audit_get_contid(current));
> + audit_log_end(ab);
> +}

Assumine we are careful about where we call it in audit_free(...), you
are confident we can't do this as part of _audit_contobj_put(...),
yes?


> /**
> * audit_log_end - end one audit record
> * @ab: the audit_buffer
> diff --git a/kernel/audit.h b/kernel/audit.h
> index e4a31aa92dfe..162de8366b32 100644
> --- a/kernel/audit.h
> +++ b/kernel/audit.h
> @@ -255,6 +255,8 @@ extern void audit_log_d_path_exe(struct audit_buffer *ab,
> extern struct tty_struct *audit_get_tty(void);
> extern void audit_put_tty(struct tty_struct *tty);
>
> +extern void audit_log_container_drop(void);
> +
> /* audit watch/mark/tree functions */
> #ifdef CONFIG_AUDITSYSCALL
> extern unsigned int audit_serial(void);
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index 0e2d50533959..bd855794ad26 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -1568,6 +1568,8 @@ static void audit_log_exit(void)
>
> audit_log_proctitle();
>
> + audit_log_container_drop();
> +
> /* Send end of event record to help user space know we are finished */
> ab = audit_log_start(context, GFP_KERNEL, AUDIT_EOE);
> if (ab)
> --
> 1.8.3.1
>

--
paul moore
http://www.paul-moore.com

2020-01-22 21:30:32

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 02/16] audit: add container id

On Tue, Dec 31, 2019 at 2:49 PM Richard Guy Briggs <[email protected]> wrote:
>
> Implement the proc fs write to set the audit container identifier of a
> process, emitting an AUDIT_CONTAINER_OP record to document the event.
>
> This is a write from the container orchestrator task to a proc entry of
> the form /proc/PID/audit_containerid where PID is the process ID of the
> newly created task that is to become the first task in a container, or
> an additional task added to a container.
>
> The write expects up to a u64 value (unset: 18446744073709551615).
>
> The writer must have capability CAP_AUDIT_CONTROL.
>
> This will produce a record such as this:
> type=CONTAINER_OP msg=audit(2018-06-06 12:39:29.636:26949) : op=set opid=2209 contid=123456 old-contid=18446744073709551615
>
> The "op" field indicates an initial set. The "opid" field is the
> object's PID, the process being "contained". New and old audit
> container identifier values are given in the "contid" fields.
>
> It is not permitted to unset the audit container identifier.
> A child inherits its parent's audit container identifier.
>
> Please see the github audit kernel issue for the main feature:
> https://github.com/linux-audit/audit-kernel/issues/90
> Please see the github audit userspace issue for supporting additions:
> https://github.com/linux-audit/audit-userspace/issues/51
> Please see the github audit testsuiite issue for the test case:
> https://github.com/linux-audit/audit-testsuite/issues/64
> Please see the github audit wiki for the feature overview:
> https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
>
> Signed-off-by: Richard Guy Briggs <[email protected]>
> Acked-by: Serge Hallyn <[email protected]>
> Acked-by: Steve Grubb <[email protected]>
> Acked-by: Neil Horman <[email protected]>
> Reviewed-by: Ondrej Mosnacek <[email protected]>
> Signed-off-by: Richard Guy Briggs <[email protected]>
> ---
> fs/proc/base.c | 36 ++++++++++++++++++++++++++++
> include/linux/audit.h | 25 ++++++++++++++++++++
> include/uapi/linux/audit.h | 2 ++
> kernel/audit.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++
> kernel/audit.h | 1 +
> kernel/auditsc.c | 4 ++++
> 6 files changed, 126 insertions(+)

...

> diff --git a/kernel/audit.c b/kernel/audit.c
> index 397f8fb4836a..2d7707426b7d 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -2356,6 +2358,62 @@ int audit_signal_info(int sig, struct task_struct *t)
> return audit_signal_info_syscall(t);
> }
>
> +/*
> + * audit_set_contid - set current task's audit contid
> + * @task: target task
> + * @contid: contid value
> + *
> + * Returns 0 on success, -EPERM on permission failure.
> + *
> + * Called (set) from fs/proc/base.c::proc_contid_write().
> + */
> +int audit_set_contid(struct task_struct *task, u64 contid)
> +{
> + u64 oldcontid;
> + int rc = 0;
> + struct audit_buffer *ab;
> +
> + task_lock(task);
> + /* Can't set if audit disabled */
> + if (!task->audit) {
> + task_unlock(task);
> + return -ENOPROTOOPT;
> + }
> + oldcontid = audit_get_contid(task);
> + read_lock(&tasklist_lock);
> + /* Don't allow the audit containerid to be unset */
> + if (!audit_contid_valid(contid))
> + rc = -EINVAL;
> + /* if we don't have caps, reject */
> + else if (!capable(CAP_AUDIT_CONTROL))
> + rc = -EPERM;
> + /* if task has children or is not single-threaded, deny */
> + else if (!list_empty(&task->children))
> + rc = -EBUSY;
> + else if (!(thread_group_leader(task) && thread_group_empty(task)))
> + rc = -EALREADY;

[NOTE: there is a bigger issue below which I think is going to require
a respin/fixup of this patch so I'm going to take the opportunity to
do a bit more bikeshedding ;)]

It seems like we could combine both the thread/children checks under a
single -EBUSY return value. In both cases the caller should be able
to determine if the target process is multi-threaded for has spawned
children, yes? FWIW, my motivation for this question is that
-EALREADY seems like a poor choice here.

> + /* if contid is already set, deny */
> + else if (audit_contid_set(task))
> + rc = -ECHILD;

Does -EEXIST make more sense here?

> + read_unlock(&tasklist_lock);
> + if (!rc)
> + task->audit->contid = contid;
> + task_unlock(task);
> +
> + if (!audit_enabled)
> + return rc;
> +
> + ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONTAINER_OP);
> + if (!ab)
> + return rc;
> +
> + audit_log_format(ab,
> + "op=set opid=%d contid=%llu old-contid=%llu",
> + task_tgid_nr(task), contid, oldcontid);
> + audit_log_end(ab);

Assuming audit is enabled we always emit the record above, even if we
were not actually able to set the Audit Container ID (ACID); this
seems wrong to me. I think the proper behavior would be to either add
a "res=" field to indicate success/failure or only emit the record
when we actually change a task's ACID. Considering the impact that
the ACID value will potentially have on the audit stream, it seems
like always logging the record and including a "res=" field may be the
safer choice.


> + return rc;
> +}
> +
> /**
> * audit_log_end - end one audit record
> * @ab: the audit_buffer

--
paul moore
http://www.paul-moore.com

2020-01-22 21:30:44

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <[email protected]> wrote:
>
> Add audit container identifier support to the action of signalling the
> audit daemon.
>
> Since this would need to add an element to the audit_sig_info struct,
> a new record type AUDIT_SIGNAL_INFO2 was created with a new
> audit_sig_info2 struct. Corresponding support is required in the
> userspace code to reflect the new record request and reply type.
> An older userspace won't break since it won't know to request this
> record type.
>
> Signed-off-by: Richard Guy Briggs <[email protected]>
> ---
> include/linux/audit.h | 7 +++++++
> include/uapi/linux/audit.h | 1 +
> kernel/audit.c | 35 +++++++++++++++++++++++++++++++++++
> kernel/audit.h | 1 +
> security/selinux/nlmsgtab.c | 1 +
> 5 files changed, 45 insertions(+)

...

> diff --git a/kernel/audit.c b/kernel/audit.c
> index 0871c3e5d6df..51159c94041c 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -126,6 +126,14 @@ struct auditd_connection {
> kuid_t audit_sig_uid = INVALID_UID;
> pid_t audit_sig_pid = -1;
> u32 audit_sig_sid = 0;
> +/* Since the signal information is stored in the record buffer at the
> + * time of the signal, but not retrieved until later, there is a chance
> + * that the last process in the container could terminate before the
> + * signal record is delivered. In this circumstance, there is a chance
> + * the orchestrator could reuse the audit container identifier, causing
> + * an overlap of audit records that refer to the same audit container
> + * identifier, but a different container instance. */
> +u64 audit_sig_cid = AUDIT_CID_UNSET;

I believe we could prevent the case mentioned above by taking an
additional reference to the audit container ID object when the signal
information is collected, dropping it only after the signal
information is collected by userspace or another process signals the
audit daemon. Yes, it would block that audit container ID from being
reused immediately, but since we are talking about one number out of
2^64 that seems like a reasonable tradeoff.

--
paul moore
http://www.paul-moore.com

2020-01-22 21:30:55

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 04/16] audit: convert to contid list to check for orch/engine ownership

On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <[email protected]> wrote:
>
> Store the audit container identifier in a refcounted kernel object that
> is added to the master list of audit container identifiers. This will
> allow multiple container orchestrators/engines to work on the same
> machine without danger of inadvertantly re-using an existing identifier.
> It will also allow an orchestrator to inject a process into an existing
> container by checking if the original container owner is the one
> injecting the task. A hash table list is used to optimize searches.
>
> Signed-off-by: Richard Guy Briggs <[email protected]>
> ---
> include/linux/audit.h | 14 ++++++--
> kernel/audit.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++---
> kernel/audit.h | 8 +++++
> 3 files changed, 112 insertions(+), 8 deletions(-)

...

> diff --git a/include/linux/audit.h b/include/linux/audit.h
> index a045b34ecf44..0e6dbe943ae4 100644
> --- a/include/linux/audit.h
> +++ b/include/linux/audit.h
> @@ -94,10 +94,18 @@ struct audit_ntp_data {
> struct audit_ntp_data {};
> #endif
>
> +struct audit_contobj {
> + struct list_head list;
> + u64 id;
> + struct task_struct *owner;
> + refcount_t refcount;
> + struct rcu_head rcu;
> +};
> +
> struct audit_task_info {
> kuid_t loginuid;
> unsigned int sessionid;
> - u64 contid;
> + struct audit_contobj *cont;
> #ifdef CONFIG_AUDITSYSCALL
> struct audit_context *ctx;
> #endif
> @@ -203,9 +211,9 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
>
> static inline u64 audit_get_contid(struct task_struct *tsk)
> {
> - if (!tsk->audit)
> + if (!tsk->audit || !tsk->audit->cont)
> return AUDIT_CID_UNSET;
> - return tsk->audit->contid;
> + return tsk->audit->cont->id;
> }
>
> extern u32 audit_enabled;
> diff --git a/kernel/audit.c b/kernel/audit.c
> index 2d7707426b7d..4bab20f5f781 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -212,6 +218,31 @@ void __init audit_task_init(void)
> 0, SLAB_PANIC, NULL);
> }
>
> +static struct audit_contobj *_audit_contobj(struct task_struct *tsk)
> +{
> + if (!tsk->audit)
> + return NULL;
> + return tsk->audit->cont;

It seems like it would be safer to grab a reference here (e.g.
_audit_contobj_hold(...)), yes? Or are you confident we will never
have tsk go away while the caller is holding on to the returned audit
container ID object?

> +}
> +
> +/* audit_contobj_list_lock must be held by caller unless new */
> +static void _audit_contobj_hold(struct audit_contobj *cont)
> +{
> + refcount_inc(&cont->refcount);
> +}

If we are going to call the matching decrement function "_put" it
seems like we might want to call the function about "_get". Further,
we can also have it return an audit_contobj pointer in case the caller
needs to do an assignment as well (which seems typical if you need to
bump the refcount):

_audit_contobj_get(audit_contobj *cont)
{
if (cont)
refcount_inc(cont);
return cont;
}

> +/* audit_contobj_list_lock must be held by caller */
> +static void _audit_contobj_put(struct audit_contobj *cont)
> +{
> + if (!cont)
> + return;
> + if (refcount_dec_and_test(&cont->refcount)) {
> + put_task_struct(cont->owner);
> + list_del_rcu(&cont->list);
> + kfree_rcu(cont, rcu);
> + }
> +}
> +
> /**
> * audit_alloc - allocate an audit info block for a task
> * @tsk: task
> @@ -232,7 +263,11 @@ int audit_alloc(struct task_struct *tsk)
> }
> info->loginuid = audit_get_loginuid(current);
> info->sessionid = audit_get_sessionid(current);
> - info->contid = audit_get_contid(current);
> + spin_lock(&audit_contobj_list_lock);
> + info->cont = _audit_contobj(current);
> + if (info->cont)
> + _audit_contobj_hold(info->cont);
> + spin_unlock(&audit_contobj_list_lock);

If we are taking a spinlock in order to bump the refcount, does it
really need to be a refcount_t or can we just use a normal integer?
In RCU protected lists a spinlock is usually used to protect
adds/removes to the list, not the content of individual list items.

My guess is you probably want to use the spinlock as described above
(list add/remove protection) and manipulate the refcount_t inside a
RCU read lock protected region.

> tsk->audit = info;
>
> ret = audit_alloc_syscall(tsk);
> @@ -267,6 +302,9 @@ void audit_free(struct task_struct *tsk)
> /* Freeing the audit_task_info struct must be performed after
> * audit_log_exit() due to need for loginuid and sessionid.
> */
> + spin_lock(&audit_contobj_list_lock);
> + _audit_contobj_put(tsk->audit->cont);
> + spin_unlock(&audit_contobj_list_lock);

This is another case of refcount_t vs normal integer in a spinlock
protected region.

> info = tsk->audit;
> tsk->audit = NULL;
> kmem_cache_free(audit_task_cache, info);
> @@ -2365,6 +2406,9 @@ int audit_signal_info(int sig, struct task_struct *t)
> *
> * Returns 0 on success, -EPERM on permission failure.
> *
> + * If the original container owner goes away, no task injection is
> + * possible to an existing container.
> + *
> * Called (set) from fs/proc/base.c::proc_contid_write().
> */
> int audit_set_contid(struct task_struct *task, u64 contid)
> @@ -2381,9 +2425,12 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> }
> oldcontid = audit_get_contid(task);
> read_lock(&tasklist_lock);
> - /* Don't allow the audit containerid to be unset */
> + /* Don't allow the contid to be unset */
> if (!audit_contid_valid(contid))
> rc = -EINVAL;
> + /* Don't allow the contid to be set to the same value again */
> + else if (contid == oldcontid) {
> + rc = -EADDRINUSE;

First, is that brace a typo? It looks like it. Did this compile?

Second, can you explain why (re)setting the audit container ID to the
same value is something we need to prohibit? I'm guessing it has
something to do with explicitly set vs inherited, but I don't want to
assume too much about your thinking behind this.

If it is "set vs inherited", would allowing an orchestrator to
explicitly "set" an inherited audit container ID provide some level or
protection against moving the task?

> /* if we don't have caps, reject */
> else if (!capable(CAP_AUDIT_CONTROL))
> rc = -EPERM;
> @@ -2396,8 +2443,49 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> else if (audit_contid_set(task))
> rc = -ECHILD;
> read_unlock(&tasklist_lock);
> - if (!rc)
> - task->audit->contid = contid;
> + if (!rc) {
> + struct audit_contobj *oldcont = _audit_contobj(task);
> + struct audit_contobj *cont = NULL, *newcont = NULL;
> + int h = audit_hash_contid(contid);
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(cont, &audit_contid_hash[h], list)
> + if (cont->id == contid) {
> + /* task injection to existing container */
> + if (current == cont->owner) {

Do we have any protection against the task pointed to by cont->owner
going away and a new task with the same current pointer value (no
longer the legitimate audit container ID owner) manipulating the
target task's audit container ID?

> + spin_lock(&audit_contobj_list_lock);
> + _audit_contobj_hold(cont);
> + spin_unlock(&audit_contobj_list_lock);

More of the recount_t vs integer/spinlock question.

> + newcont = cont;
> + } else {
> + rc = -ENOTUNIQ;
> + goto conterror;
> + }
> + break;
> + }
> + if (!newcont) {
> + newcont = kmalloc(sizeof(*newcont), GFP_ATOMIC);
> + if (newcont) {
> + INIT_LIST_HEAD(&newcont->list);
> + newcont->id = contid;
> + get_task_struct(current);
> + newcont->owner = current;

newcont->owner = get_task_struct(current);

(This is what I was talking about above with returning the struct
pointer in the _get/_hold function)

> + refcount_set(&newcont->refcount, 1);
> + spin_lock(&audit_contobj_list_lock);
> + list_add_rcu(&newcont->list, &audit_contid_hash[h]);
> + spin_unlock(&audit_contobj_list_lock);

I think we might have a problem where multiple tasks could race adding
the same audit container ID and since there is no check inside the
spinlock protected region we could end up adding multiple instances of
the same audit container ID, yes?

> + } else {
> + rc = -ENOMEM;
> + goto conterror;
> + }
> + }
> + task->audit->cont = newcont;
> + spin_lock(&audit_contobj_list_lock);
> + _audit_contobj_put(oldcont);
> + spin_unlock(&audit_contobj_list_lock);

More of the refcount_t/integer/spinlock question.


> +conterror:
> + rcu_read_unlock();
> + }
> task_unlock(task);
>
> if (!audit_enabled)

--
paul moore
http://www.paul-moore.com

2020-01-22 21:30:58

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 12/16] audit: contid check descendancy and nesting

On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
>
> Require the target task to be a descendant of the container
> orchestrator/engine.
>
> You would only change the audit container ID from one set or inherited
> value to another if you were nesting containers.
>
> If changing the contid, the container orchestrator/engine must be a
> descendant and not same orchestrator as the one that set it so it is not
> possible to change the contid of another orchestrator's container.
>
> Since the task_is_descendant() function is used in YAMA and in audit,
> remove the duplication and pull the function into kernel/core/sched.c
>
> Signed-off-by: Richard Guy Briggs <[email protected]>
> ---
> include/linux/sched.h | 3 +++
> kernel/audit.c | 44 ++++++++++++++++++++++++++++++++++++--------
> kernel/sched/core.c | 33 +++++++++++++++++++++++++++++++++
> security/yama/yama_lsm.c | 33 ---------------------------------
> 4 files changed, 72 insertions(+), 41 deletions(-)

...

> diff --git a/kernel/audit.c b/kernel/audit.c
> index f7a8d3288ca0..ef8e07524c46 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -2603,22 +2610,43 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> oldcontid = audit_get_contid(task);
> read_lock(&tasklist_lock);
> /* Don't allow the contid to be unset */
> - if (!audit_contid_valid(contid))
> + if (!audit_contid_valid(contid)) {
> rc = -EINVAL;
> + goto unlock;
> + }
> /* Don't allow the contid to be set to the same value again */
> - else if (contid == oldcontid) {
> + if (contid == oldcontid) {
> rc = -EADDRINUSE;
> + goto unlock;
> + }
> /* if we don't have caps, reject */
> - else if (!capable(CAP_AUDIT_CONTROL))
> + if (!capable(CAP_AUDIT_CONTROL)) {
> rc = -EPERM;
> - /* if task has children or is not single-threaded, deny */
> - else if (!list_empty(&task->children))
> + goto unlock;
> + }
> + /* if task has children, deny */
> + if (!list_empty(&task->children)) {
> rc = -EBUSY;
> - else if (!(thread_group_leader(task) && thread_group_empty(task)))
> + goto unlock;
> + }
> + /* if task is not single-threaded, deny */
> + if (!(thread_group_leader(task) && thread_group_empty(task))) {
> rc = -EALREADY;
> - /* if contid is already set, deny */
> - else if (audit_contid_set(task))
> + goto unlock;
> + }

It seems like the if/else-if conversion above should be part of an
earlier patchset.

> + /* if task is not descendant, block */
> + if (task == current) {
> + rc = -EBADSLT;
> + goto unlock;
> + }
> + if (!task_is_descendant(current, task)) {
> + rc = -EXDEV;
> + goto unlock;
> + }

I understand you are trying to provide a unique error code for each
failure case, but this is getting silly. Let's group the descendent
checks under the same error code.

> + /* only allow contid setting again if nesting */
> + if (audit_contid_set(task) && audit_contid_isowner(task))
> rc = -ECHILD;

Should that be "!audit_contid_isowner()"?

--
paul moore
http://www.paul-moore.com

2020-01-22 21:31:38

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 15/16] audit: check contid count per netns and add config param limit

On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
>
> Clamp the number of audit container identifiers associated with a
> network namespace to limit the netlink and disk bandwidth used and to
> prevent losing information from record text size overflow in the contid
> field.
>
> Add a configuration parameter AUDIT_STATUS_CONTID_NETNS_LIMIT (0x100)
> to set the audit container identifier netns limit. This is used to
> prevent overflow of the contid field in CONTAINER_OP and CONTAINER_ID
> messages, losing information, and to limit bandwidth used by these
> messages.
>
> This value must be balanced with the audit container identifier nesting
> depth limit to multiply out to no more than 400. This is determined by
> the total audit message length less message overhead divided by the
> length of the text representation of an audit container identifier.
>
> Signed-off-by: Richard Guy Briggs <[email protected]>
> ---
> include/linux/audit.h | 16 +++++++----
> include/linux/nsproxy.h | 2 +-
> include/uapi/linux/audit.h | 2 ++
> kernel/audit.c | 68 ++++++++++++++++++++++++++++++++++++++--------
> kernel/audit.h | 7 +++++
> kernel/fork.c | 10 +++++--
> kernel/nsproxy.c | 27 +++++++++++++++---
> 7 files changed, 107 insertions(+), 25 deletions(-)

Similar to my comments in patch 14, let's defer this to a later time
if we need to do this at all.

--
paul moore
http://www.paul-moore.com

2020-01-22 21:32:07

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 11/16] audit: add support for containerid to network namespaces

On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
>
> This also adds support to qualify NETFILTER_PKT records.
>
> Audit events could happen in a network namespace outside of a task
> context due to packets received from the net that trigger an auditing
> rule prior to being associated with a running task. The network
> namespace could be in use by multiple containers by association to the
> tasks in that network namespace. We still want a way to attribute
> these events to any potential containers. Keep a list per network
> namespace to track these audit container identifiiers.
>
> Add/increment the audit container identifier on:
> - initial setting of the audit container identifier via /proc
> - clone/fork call that inherits an audit container identifier
> - unshare call that inherits an audit container identifier
> - setns call that inherits an audit container identifier
> Delete/decrement the audit container identifier on:
> - an inherited audit container identifier dropped when child set
> - process exit
> - unshare call that drops a net namespace
> - setns call that drops a net namespace
>
> Add audit container identifier auxiliary record(s) to NETFILTER_PKT
> event standalone records. Iterate through all potential audit container
> identifiers associated with a network namespace.
>
> Please see the github audit kernel issue for contid net support:
> https://github.com/linux-audit/audit-kernel/issues/92
> Please see the github audit testsuiite issue for the test case:
> https://github.com/linux-audit/audit-testsuite/issues/64
> Please see the github audit wiki for the feature overview:
> https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> Signed-off-by: Richard Guy Briggs <[email protected]>
> Acked-by: Neil Horman <[email protected]>
> Reviewed-by: Ondrej Mosnacek <[email protected]>
> ---
> include/linux/audit.h | 24 +++++++++
> kernel/audit.c | 132 ++++++++++++++++++++++++++++++++++++++++++++++-
> kernel/nsproxy.c | 4 ++
> net/netfilter/nft_log.c | 11 +++-
> net/netfilter/xt_AUDIT.c | 11 +++-
> 5 files changed, 176 insertions(+), 6 deletions(-)

...

> diff --git a/include/linux/audit.h b/include/linux/audit.h
> index 5531d37a4226..ed8d5b74758d 100644
> --- a/include/linux/audit.h
> +++ b/include/linux/audit.h
> @@ -12,6 +12,7 @@
> #include <linux/sched.h>
> #include <linux/ptrace.h>
> #include <uapi/linux/audit.h>
> +#include <linux/refcount.h>
>
> #define AUDIT_INO_UNSET ((unsigned long)-1)
> #define AUDIT_DEV_UNSET ((dev_t)-1)
> @@ -121,6 +122,13 @@ struct audit_task_info {
>
> extern struct audit_task_info init_struct_audit;
>
> +struct audit_contobj_netns {
> + struct list_head list;
> + u64 id;

Since we now track audit container IDs in their own structure, why not
link directly to the audit container ID object (and bump the
refcount)?

> + refcount_t refcount;
> + struct rcu_head rcu;
> +};
> +
> extern int is_audit_feature_set(int which);
>
> extern int __init audit_register_class(int class, unsigned *list);
> @@ -225,6 +233,12 @@ static inline u64 audit_get_contid(struct task_struct *tsk)
> }
>
> extern void audit_log_container_id(struct audit_context *context, u64 contid);
> +extern void audit_netns_contid_add(struct net *net, u64 contid);
> +extern void audit_netns_contid_del(struct net *net, u64 contid);
> +extern void audit_switch_task_namespaces(struct nsproxy *ns,
> + struct task_struct *p);
> +extern void audit_log_netns_contid_list(struct net *net,
> + struct audit_context *context);
>
> extern u32 audit_enabled;
>
> @@ -297,6 +311,16 @@ static inline u64 audit_get_contid(struct task_struct *tsk)
>
> static inline void audit_log_container_id(struct audit_context *context, u64 contid)
> { }
> +static inline void audit_netns_contid_add(struct net *net, u64 contid)
> +{ }
> +static inline void audit_netns_contid_del(struct net *net, u64 contid)
> +{ }
> +static inline void audit_switch_task_namespaces(struct nsproxy *ns,
> + struct task_struct *p)
> +{ }
> +static inline void audit_log_netns_contid_list(struct net *net,
> + struct audit_context *context)
> +{ }
>
> #define audit_enabled AUDIT_OFF
>
> diff --git a/kernel/audit.c b/kernel/audit.c
> index d4e6eafe5644..f7a8d3288ca0 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -59,6 +59,7 @@
> #include <linux/freezer.h>
> #include <linux/pid_namespace.h>
> #include <net/netns/generic.h>
> +#include <net/net_namespace.h>
>
> #include "audit.h"
>
> @@ -86,9 +87,13 @@
> /**
> * struct audit_net - audit private network namespace data
> * @sk: communication socket
> + * @contid_list: audit container identifier list
> + * @contid_list_lock audit container identifier list lock
> */
> struct audit_net {
> struct sock *sk;
> + struct list_head contid_list;
> + spinlock_t contid_list_lock;
> };
>
> /**
> @@ -305,8 +310,11 @@ struct audit_task_info init_struct_audit = {
> void audit_free(struct task_struct *tsk)
> {
> struct audit_task_info *info = tsk->audit;
> + struct nsproxy *ns = tsk->nsproxy;
>
> audit_free_syscall(tsk);
> + if (ns)
> + audit_netns_contid_del(ns->net_ns, audit_get_contid(tsk));
> /* Freeing the audit_task_info struct must be performed after
> * audit_log_exit() due to need for loginuid and sessionid.
> */
> @@ -409,6 +417,120 @@ static struct sock *audit_get_sk(const struct net *net)
> return aunet->sk;
> }
>
> +void audit_netns_contid_add(struct net *net, u64 contid)
> +{
> + struct audit_net *aunet;
> + struct list_head *contid_list;
> + struct audit_contobj_netns *cont;
> +
> + if (!net)
> + return;
> + if (!audit_contid_valid(contid))
> + return;
> + aunet = net_generic(net, audit_net_id);
> + if (!aunet)
> + return;
> + contid_list = &aunet->contid_list;
> + rcu_read_lock();
> + list_for_each_entry_rcu(cont, contid_list, list)
> + if (cont->id == contid) {
> + spin_lock(&aunet->contid_list_lock);
> + refcount_inc(&cont->refcount);
> + spin_unlock(&aunet->contid_list_lock);
> + goto out;
> + }
> + cont = kmalloc(sizeof(*cont), GFP_ATOMIC);
> + if (cont) {
> + INIT_LIST_HEAD(&cont->list);
> + cont->id = contid;
> + refcount_set(&cont->refcount, 1);
> + spin_lock(&aunet->contid_list_lock);
> + list_add_rcu(&cont->list, contid_list);
> + spin_unlock(&aunet->contid_list_lock);
> + }
> +out:
> + rcu_read_unlock();
> +}

See my comments about refcount_t, spinlocks, and list manipulation
races from earlier in the patchset; the same thing applies to the
function above.


--
paul moore
http://www.paul-moore.com

2020-01-22 21:32:25

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 13/16] audit: track container nesting

On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
>
> Track the parent container of a container to be able to filter and
> report nesting.
>
> Now that we have a way to track and check the parent container of a
> container, modify the contid field format to be able to report that
> nesting using a carrat ("^") separator to indicate nesting. The
> original field format was "contid=<contid>" for task-associated records
> and "contid=<contid>[,<contid>[...]]" for network-namespace-associated
> records. The new field format is
> "contid=<contid>[^<contid>[...]][,<contid>[...]]".

Let's make sure we always use a comma as a separator, even when
recording the parent information, for example:
"contid=<contid>[,^<contid>[...]][,<contid>[...]]"

> Signed-off-by: Richard Guy Briggs <[email protected]>
> ---
> include/linux/audit.h | 1 +
> kernel/audit.c | 53 +++++++++++++++++++++++++++++++++++++++++++--------
> kernel/audit.h | 1 +
> kernel/auditfilter.c | 17 ++++++++++++++++-
> kernel/auditsc.c | 2 +-
> 5 files changed, 64 insertions(+), 10 deletions(-)

...

> diff --git a/kernel/audit.c b/kernel/audit.c
> index ef8e07524c46..68be59d1a89b 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c

> @@ -492,6 +493,7 @@ void audit_switch_task_namespaces(struct nsproxy *ns, struct task_struct *p)
> audit_netns_contid_add(new->net_ns, contid);
> }
>
> +void audit_log_contid(struct audit_buffer *ab, u64 contid);

If we need a forward declaration, might as well just move it up near
the top of the file with the rest of the declarations.

> +void audit_log_contid(struct audit_buffer *ab, u64 contid)
> +{
> + struct audit_contobj *cont = NULL, *prcont = NULL;
> + int h;

It seems safer to pass the audit container ID object and not the u64.

> + if (!audit_contid_valid(contid)) {
> + audit_log_format(ab, "%llu", contid);

Do we really want to print (u64)-1 here? Since this is a known
invalid number, would "?" be a better choice?

> + return;
> + }
> + h = audit_hash_contid(contid);
> + rcu_read_lock();
> + list_for_each_entry_rcu(cont, &audit_contid_hash[h], list)
> + if (cont->id == contid) {
> + prcont = cont;

Why not just pull the code below into the body of this if statement?
It all needs to be done under the RCU read lock anyway and the code
would read much better this way.

> + break;
> + }
> + if (!prcont) {
> + audit_log_format(ab, "%llu", contid);
> + goto out;
> + }
> + while (prcont) {
> + audit_log_format(ab, "%llu", prcont->id);
> + prcont = prcont->parent;
> + if (prcont)
> + audit_log_format(ab, "^");

In the interest of limiting the number of calls to audit_log_format(),
how about something like the following:

audit_log_format("%llu", cont);
iter = cont->parent;
while (iter) {
if (iter->parent)
audit_log_format("^%llu,", iter);
else
audit_log_format("^%llu", iter);
iter = iter->parent;
}

> + }
> +out:
> + rcu_read_unlock();
> +}
> +
> /*
> * audit_log_container_id - report container info
> * @context: task or local context for record

...

> @@ -2705,9 +2741,10 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> if (!ab)
> return rc;
>
> - audit_log_format(ab,
> - "op=set opid=%d contid=%llu old-contid=%llu",
> - task_tgid_nr(task), contid, oldcontid);
> + audit_log_format(ab, "op=set opid=%d contid=", task_tgid_nr(task));
> + audit_log_contid(ab, contid);
> + audit_log_format(ab, " old-contid=");
> + audit_log_contid(ab, oldcontid);

This is an interesting case where contid and old-contid are going to
be largely the same, only the first (current) ID is going to be
different; do we want to duplicate all of those IDs?


> audit_log_end(ab);
> return rc;
> }
> @@ -2723,9 +2760,9 @@ void audit_log_container_drop(void)

--
paul moore
http://www.paul-moore.com

2020-01-22 21:32:37

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 14/16] audit: check contid depth and add limit config param

On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
>
> Clamp the depth of audit container identifier nesting to limit the
> netlink and disk bandwidth used and to prevent losing information from
> record text size overflow in the contid field.
>
> Add a configuration parameter AUDIT_STATUS_CONTID_DEPTH_LIMIT (0x80) to
> set the audit container identifier depth limit. This can be used to
> prevent overflow of the contid field in CONTAINER_OP and CONTAINER_ID
> messages, losing information, and to limit bandwidth used by these
> messages.
>
> Signed-off-by: Richard Guy Briggs <[email protected]>
> ---
> include/uapi/linux/audit.h | 2 ++
> kernel/audit.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> kernel/audit.h | 2 ++
> 3 files changed, 50 insertions(+)

Since setting an audit container ID, and hence acting as an
orchestrator and creating a new nested level of audit container IDs,
is a privileged operation I think we can equate this to the infamous
"shooting oneself in the foot" problem. Let's leave this limitation
out of the patchset for now, if it becomes a problem in the future we
can consider restricting the nesting depth.

--
paul moore
http://www.paul-moore.com

2020-01-22 21:32:51

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 16/16] audit: add capcontid to set contid outside init_user_ns

On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
>
> Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
> process in a non-init user namespace the capability to set audit
> container identifiers.
>
> Provide /proc/$PID/audit_capcontid interface to capcontid.
> Valid values are: 1==enabled, 0==disabled

It would be good to be more explicit about "enabled" and "disabled" in
the commit description. For example, which setting allows the target
task to set audit container IDs of it's children processes?

> Report this action in message type AUDIT_SET_CAPCONTID 1022 with fields
> opid= capcontid= old-capcontid=
>
> Signed-off-by: Richard Guy Briggs <[email protected]>
> ---
> fs/proc/base.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/audit.h | 14 ++++++++++++
> include/uapi/linux/audit.h | 1 +
> kernel/audit.c | 35 +++++++++++++++++++++++++++++
> 4 files changed, 105 insertions(+)

...

> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 26091800180c..283ef8e006e7 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1360,6 +1360,59 @@ static ssize_t proc_contid_write(struct file *file, const char __user *buf,
> .write = proc_contid_write,
> .llseek = generic_file_llseek,
> };
> +
> +static ssize_t proc_capcontid_read(struct file *file, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct inode *inode = file_inode(file);
> + struct task_struct *task = get_proc_task(inode);
> + ssize_t length;
> + char tmpbuf[TMPBUFLEN];
> +
> + if (!task)
> + return -ESRCH;
> + /* if we don't have caps, reject */
> + if (!capable(CAP_AUDIT_CONTROL) && !audit_get_capcontid(current))
> + return -EPERM;
> + length = scnprintf(tmpbuf, TMPBUFLEN, "%u", audit_get_capcontid(task));
> + put_task_struct(task);
> + return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
> +}
> +
> +static ssize_t proc_capcontid_write(struct file *file, const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct inode *inode = file_inode(file);
> + u32 capcontid;
> + int rv;
> + struct task_struct *task = get_proc_task(inode);
> +
> + if (!task)
> + return -ESRCH;
> + if (*ppos != 0) {
> + /* No partial writes. */
> + put_task_struct(task);
> + return -EINVAL;
> + }
> +
> + rv = kstrtou32_from_user(buf, count, 10, &capcontid);
> + if (rv < 0) {
> + put_task_struct(task);
> + return rv;
> + }
> +
> + rv = audit_set_capcontid(task, capcontid);
> + put_task_struct(task);
> + if (rv < 0)
> + return rv;
> + return count;
> +}
> +
> +static const struct file_operations proc_capcontid_operations = {
> + .read = proc_capcontid_read,
> + .write = proc_capcontid_write,
> + .llseek = generic_file_llseek,
> +};
> #endif
>
> #ifdef CONFIG_FAULT_INJECTION
> @@ -3121,6 +3174,7 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
> REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
> REG("sessionid", S_IRUGO, proc_sessionid_operations),
> REG("audit_containerid", S_IWUSR|S_IRUSR, proc_contid_operations),
> + REG("audit_capcontainerid", S_IWUSR|S_IRUSR|S_IRUSR, proc_capcontid_operations),
> #endif
> #ifdef CONFIG_FAULT_INJECTION
> REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
> @@ -3522,6 +3576,7 @@ static int proc_tid_comm_permission(struct inode *inode, int mask)
> REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
> REG("sessionid", S_IRUGO, proc_sessionid_operations),
> REG("audit_containerid", S_IWUSR|S_IRUSR, proc_contid_operations),
> + REG("audit_capcontainerid", S_IWUSR|S_IRUSR|S_IRUSR, proc_capcontid_operations),
> #endif
> #ifdef CONFIG_FAULT_INJECTION
> REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
> diff --git a/include/linux/audit.h b/include/linux/audit.h
> index 28b9c7cd86a6..62c453306c2a 100644
> --- a/include/linux/audit.h
> +++ b/include/linux/audit.h
> @@ -116,6 +116,7 @@ struct audit_task_info {
> kuid_t loginuid;
> unsigned int sessionid;
> struct audit_contobj *cont;
> + u32 capcontid;

Where is the code change that actually uses this to enforce the
described policy on setting an audit container ID?

> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> index 2844d78cd7af..01251e6dcec0 100644
> --- a/include/uapi/linux/audit.h
> +++ b/include/uapi/linux/audit.h
> @@ -73,6 +73,7 @@
> #define AUDIT_GET_FEATURE 1019 /* Get which features are enabled */
> #define AUDIT_CONTAINER_OP 1020 /* Define the container id and info */
> #define AUDIT_SIGNAL_INFO2 1021 /* Get info auditd signal sender */
> +#define AUDIT_SET_CAPCONTID 1022 /* Set cap_contid of a task */
>
> #define AUDIT_FIRST_USER_MSG 1100 /* Userspace messages mostly uninteresting to kernel */
> #define AUDIT_USER_AVC 1107 /* We filter this differently */
> diff --git a/kernel/audit.c b/kernel/audit.c
> index 1287f0b63757..1c22dd084ae8 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -2698,6 +2698,41 @@ static bool audit_contid_isowner(struct task_struct *tsk)
> return false;
> }
>
> +int audit_set_capcontid(struct task_struct *task, u32 enable)
> +{
> + u32 oldcapcontid;
> + int rc = 0;
> + struct audit_buffer *ab;
> +
> + if (!task->audit)
> + return -ENOPROTOOPT;
> + oldcapcontid = audit_get_capcontid(task);
> + /* if task is not descendant, block */
> + if (task == current)
> + rc = -EBADSLT;
> + else if (!task_is_descendant(current, task))
> + rc = -EXDEV;

See my previous comments about error code sanity.

> + else if (current_user_ns() == &init_user_ns) {
> + if (!capable(CAP_AUDIT_CONTROL) && !audit_get_capcontid(current))
> + rc = -EPERM;

I think we just want to use ns_capable() in the context of the current
userns to check CAP_AUDIT_CONTROL, yes? Something like this ...

if (current_user_ns() != &init_user_ns) {
if (!ns_capable(CAP_AUDIT_CONTROL) || !audit_get_capcontid())
rc = -EPERM;
} else if (!capable(CAP_AUDIT_CONTROL))
rc = -EPERM;

> + }
> + if (!rc)
> + task->audit->capcontid = enable;
> +
> + if (!audit_enabled)
> + return rc;
> +
> + ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_SET_CAPCONTID);
> + if (!ab)
> + return rc;
> +
> + audit_log_format(ab,
> + "opid=%d capcontid=%u old-capcontid=%u",
> + task_tgid_nr(task), enable, oldcapcontid);
> + audit_log_end(ab);

My prior comments about recording the success/failure, or not emitting
the record on failure, seem relevant here too.

> + return rc;
> +}

--
paul moore
http://www.paul-moore.com

2020-01-23 16:31:03

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-01-22 16:28, Paul Moore wrote:
> On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <[email protected]> wrote:
> >
> > Add audit container identifier support to the action of signalling the
> > audit daemon.
> >
> > Since this would need to add an element to the audit_sig_info struct,
> > a new record type AUDIT_SIGNAL_INFO2 was created with a new
> > audit_sig_info2 struct. Corresponding support is required in the
> > userspace code to reflect the new record request and reply type.
> > An older userspace won't break since it won't know to request this
> > record type.
> >
> > Signed-off-by: Richard Guy Briggs <[email protected]>
> > ---
> > include/linux/audit.h | 7 +++++++
> > include/uapi/linux/audit.h | 1 +
> > kernel/audit.c | 35 +++++++++++++++++++++++++++++++++++
> > kernel/audit.h | 1 +
> > security/selinux/nlmsgtab.c | 1 +
> > 5 files changed, 45 insertions(+)
>
> ...
>
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index 0871c3e5d6df..51159c94041c 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -126,6 +126,14 @@ struct auditd_connection {
> > kuid_t audit_sig_uid = INVALID_UID;
> > pid_t audit_sig_pid = -1;
> > u32 audit_sig_sid = 0;
> > +/* Since the signal information is stored in the record buffer at the
> > + * time of the signal, but not retrieved until later, there is a chance
> > + * that the last process in the container could terminate before the
> > + * signal record is delivered. In this circumstance, there is a chance
> > + * the orchestrator could reuse the audit container identifier, causing
> > + * an overlap of audit records that refer to the same audit container
> > + * identifier, but a different container instance. */
> > +u64 audit_sig_cid = AUDIT_CID_UNSET;
>
> I believe we could prevent the case mentioned above by taking an
> additional reference to the audit container ID object when the signal
> information is collected, dropping it only after the signal
> information is collected by userspace or another process signals the
> audit daemon. Yes, it would block that audit container ID from being
> reused immediately, but since we are talking about one number out of
> 2^64 that seems like a reasonable tradeoff.

I had thought that through and should have been more explicit about that
situation when I documented it. We could do that, but then the syscall
records would be connected with the call from auditd on shutdown to
request that signal information, rather than the exit of that last
process that was using that container. This strikes me as misleading.
Is that really what we want?

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-01-23 17:54:51

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Thu, Jan 23, 2020 at 11:29 AM Richard Guy Briggs <[email protected]> wrote:
> On 2020-01-22 16:28, Paul Moore wrote:
> > On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <[email protected]> wrote:
> > >
> > > Add audit container identifier support to the action of signalling the
> > > audit daemon.
> > >
> > > Since this would need to add an element to the audit_sig_info struct,
> > > a new record type AUDIT_SIGNAL_INFO2 was created with a new
> > > audit_sig_info2 struct. Corresponding support is required in the
> > > userspace code to reflect the new record request and reply type.
> > > An older userspace won't break since it won't know to request this
> > > record type.
> > >
> > > Signed-off-by: Richard Guy Briggs <[email protected]>
> > > ---
> > > include/linux/audit.h | 7 +++++++
> > > include/uapi/linux/audit.h | 1 +
> > > kernel/audit.c | 35 +++++++++++++++++++++++++++++++++++
> > > kernel/audit.h | 1 +
> > > security/selinux/nlmsgtab.c | 1 +
> > > 5 files changed, 45 insertions(+)
> >
> > ...
> >
> > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > index 0871c3e5d6df..51159c94041c 100644
> > > --- a/kernel/audit.c
> > > +++ b/kernel/audit.c
> > > @@ -126,6 +126,14 @@ struct auditd_connection {
> > > kuid_t audit_sig_uid = INVALID_UID;
> > > pid_t audit_sig_pid = -1;
> > > u32 audit_sig_sid = 0;
> > > +/* Since the signal information is stored in the record buffer at the
> > > + * time of the signal, but not retrieved until later, there is a chance
> > > + * that the last process in the container could terminate before the
> > > + * signal record is delivered. In this circumstance, there is a chance
> > > + * the orchestrator could reuse the audit container identifier, causing
> > > + * an overlap of audit records that refer to the same audit container
> > > + * identifier, but a different container instance. */
> > > +u64 audit_sig_cid = AUDIT_CID_UNSET;
> >
> > I believe we could prevent the case mentioned above by taking an
> > additional reference to the audit container ID object when the signal
> > information is collected, dropping it only after the signal
> > information is collected by userspace or another process signals the
> > audit daemon. Yes, it would block that audit container ID from being
> > reused immediately, but since we are talking about one number out of
> > 2^64 that seems like a reasonable tradeoff.
>
> I had thought that through and should have been more explicit about that
> situation when I documented it. We could do that, but then the syscall
> records would be connected with the call from auditd on shutdown to
> request that signal information, rather than the exit of that last
> process that was using that container. This strikes me as misleading.
> Is that really what we want?

???

I think one of us is not understanding the other; maybe it's me, maybe
it's you, maybe it's both of us.

Anyway, here is what I was trying to convey with my original comment
... When we record the audit container ID in audit_signal_info() we
take an extra reference to the audit container ID object so that it
will not disappear (and get reused) until after we respond with an
AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the
AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took in
audit_signal_info(). Unless I'm missing some other change you made,
this *shouldn't* affect the syscall records, all it does is preserve
the audit container ID object in the kernel's ACID store so it doesn't
get reused.

(We do need to do some extra housekeeping in audit_signal_info() to
deal with the case where nobody asks for AUDIT_SIGNAL_INFO2 -
basically if audit_sig_cid is not NULL we should drop a reference
before assigning it a new object pointer, and of course we would need
to set audit_sig_cid to NULL in audit_receive_msg() after sending it
up to userspace and dropping the extra ref.)

--
paul moore
http://www.paul-moore.com

2020-01-23 20:11:06

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-01-23 12:09, Paul Moore wrote:
> On Thu, Jan 23, 2020 at 11:29 AM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-01-22 16:28, Paul Moore wrote:
> > > On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <[email protected]> wrote:
> > > >
> > > > Add audit container identifier support to the action of signalling the
> > > > audit daemon.
> > > >
> > > > Since this would need to add an element to the audit_sig_info struct,
> > > > a new record type AUDIT_SIGNAL_INFO2 was created with a new
> > > > audit_sig_info2 struct. Corresponding support is required in the
> > > > userspace code to reflect the new record request and reply type.
> > > > An older userspace won't break since it won't know to request this
> > > > record type.
> > > >
> > > > Signed-off-by: Richard Guy Briggs <[email protected]>
> > > > ---
> > > > include/linux/audit.h | 7 +++++++
> > > > include/uapi/linux/audit.h | 1 +
> > > > kernel/audit.c | 35 +++++++++++++++++++++++++++++++++++
> > > > kernel/audit.h | 1 +
> > > > security/selinux/nlmsgtab.c | 1 +
> > > > 5 files changed, 45 insertions(+)
> > >
> > > ...
> > >
> > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > index 0871c3e5d6df..51159c94041c 100644
> > > > --- a/kernel/audit.c
> > > > +++ b/kernel/audit.c
> > > > @@ -126,6 +126,14 @@ struct auditd_connection {
> > > > kuid_t audit_sig_uid = INVALID_UID;
> > > > pid_t audit_sig_pid = -1;
> > > > u32 audit_sig_sid = 0;
> > > > +/* Since the signal information is stored in the record buffer at the
> > > > + * time of the signal, but not retrieved until later, there is a chance
> > > > + * that the last process in the container could terminate before the
> > > > + * signal record is delivered. In this circumstance, there is a chance
> > > > + * the orchestrator could reuse the audit container identifier, causing
> > > > + * an overlap of audit records that refer to the same audit container
> > > > + * identifier, but a different container instance. */
> > > > +u64 audit_sig_cid = AUDIT_CID_UNSET;
> > >
> > > I believe we could prevent the case mentioned above by taking an
> > > additional reference to the audit container ID object when the signal
> > > information is collected, dropping it only after the signal
> > > information is collected by userspace or another process signals the
> > > audit daemon. Yes, it would block that audit container ID from being
> > > reused immediately, but since we are talking about one number out of
> > > 2^64 that seems like a reasonable tradeoff.
> >
> > I had thought that through and should have been more explicit about that
> > situation when I documented it. We could do that, but then the syscall
> > records would be connected with the call from auditd on shutdown to
> > request that signal information, rather than the exit of that last
> > process that was using that container. This strikes me as misleading.
> > Is that really what we want?
>
> ???
>
> I think one of us is not understanding the other; maybe it's me, maybe
> it's you, maybe it's both of us.
>
> Anyway, here is what I was trying to convey with my original comment
> ... When we record the audit container ID in audit_signal_info() we
> take an extra reference to the audit container ID object so that it
> will not disappear (and get reused) until after we respond with an
> AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the
> AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took in
> audit_signal_info(). Unless I'm missing some other change you made,
> this *shouldn't* affect the syscall records, all it does is preserve
> the audit container ID object in the kernel's ACID store so it doesn't
> get reused.

This is exactly what I had understood. I hadn't considered the extra
details below in detail due to my original syscall concern, but they
make sense.

The syscall I refer to is the one connected with the drop of the
audit container identifier by the last process that was in that
container in patch 5/16. The production of this record is contingent on
the last ref in a contobj being dropped. So if it is due to that ref
being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2
record it fetched, then it will appear that the fetch action closed the
container rather than the last process in the container to exit.

Does this make sense?

> (We do need to do some extra housekeeping in audit_signal_info() to
> deal with the case where nobody asks for AUDIT_SIGNAL_INFO2 -
> basically if audit_sig_cid is not NULL we should drop a reference
> before assigning it a new object pointer, and of course we would need
> to set audit_sig_cid to NULL in audit_receive_msg() after sending it
> up to userspace and dropping the extra ref.)
>
> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-01-23 21:26:47

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 12/16] audit: contid check descendancy and nesting

On 2020-01-22 16:29, Paul Moore wrote:
> On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> >
> > Require the target task to be a descendant of the container
> > orchestrator/engine.
> >
> > You would only change the audit container ID from one set or inherited
> > value to another if you were nesting containers.
> >
> > If changing the contid, the container orchestrator/engine must be a
> > descendant and not same orchestrator as the one that set it so it is not
> > possible to change the contid of another orchestrator's container.
> >
> > Since the task_is_descendant() function is used in YAMA and in audit,
> > remove the duplication and pull the function into kernel/core/sched.c
> >
> > Signed-off-by: Richard Guy Briggs <[email protected]>
> > ---
> > include/linux/sched.h | 3 +++
> > kernel/audit.c | 44 ++++++++++++++++++++++++++++++++++++--------
> > kernel/sched/core.c | 33 +++++++++++++++++++++++++++++++++
> > security/yama/yama_lsm.c | 33 ---------------------------------
> > 4 files changed, 72 insertions(+), 41 deletions(-)
>
> ...
>
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index f7a8d3288ca0..ef8e07524c46 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -2603,22 +2610,43 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> > oldcontid = audit_get_contid(task);
> > read_lock(&tasklist_lock);
> > /* Don't allow the contid to be unset */
> > - if (!audit_contid_valid(contid))
> > + if (!audit_contid_valid(contid)) {
> > rc = -EINVAL;
> > + goto unlock;
> > + }
> > /* Don't allow the contid to be set to the same value again */
> > - else if (contid == oldcontid) {
> > + if (contid == oldcontid) {
> > rc = -EADDRINUSE;
> > + goto unlock;
> > + }
> > /* if we don't have caps, reject */
> > - else if (!capable(CAP_AUDIT_CONTROL))
> > + if (!capable(CAP_AUDIT_CONTROL)) {
> > rc = -EPERM;
> > - /* if task has children or is not single-threaded, deny */
> > - else if (!list_empty(&task->children))
> > + goto unlock;
> > + }
> > + /* if task has children, deny */
> > + if (!list_empty(&task->children)) {
> > rc = -EBUSY;
> > - else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > + goto unlock;
> > + }
> > + /* if task is not single-threaded, deny */
> > + if (!(thread_group_leader(task) && thread_group_empty(task))) {
> > rc = -EALREADY;
> > - /* if contid is already set, deny */
> > - else if (audit_contid_set(task))
> > + goto unlock;
> > + }
>
> It seems like the if/else-if conversion above should be part of an
> earlier patchset.

I had considered that, but it wasn't obvious where that conversion
should happen since it wasn't necessary earlier and is now. I can move
it earlier if you feel strongly about it.

> > + /* if task is not descendant, block */
> > + if (task == current) {
> > + rc = -EBADSLT;
> > + goto unlock;
> > + }
> > + if (!task_is_descendant(current, task)) {
> > + rc = -EXDEV;
> > + goto unlock;
> > + }
>
> I understand you are trying to provide a unique error code for each
> failure case, but this is getting silly. Let's group the descendent
> checks under the same error code.

Ok. I was trying to provide more information for debugging for me and
for users.

> > + /* only allow contid setting again if nesting */
> > + if (audit_contid_set(task) && audit_contid_isowner(task))
> > rc = -ECHILD;
>
> Should that be "!audit_contid_isowner()"?

No. If the contid is already set on this task and if it is the same
orchestrator that already owns this one, then block it since the same
orchestrator is not allowed to set it again. Another orchestrator that
has been shown by previous tests to be a descendant of the orchestrator
that already owns this one would be permitted.

Now that I say this explicitly, it appears I need another test to check:

/* only allow contid setting again if nesting */
if (audit_contid_set(task) && ( audit_contid_isowner(task) || !task_is_descendant(_audit_contobj(task)->owner, current) ))
rc = -ECHILD;

So we're back to audit_contobj_owner() like in the previous patchset
that would make this cleaner.

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-01-23 21:37:17

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Thu, Jan 23, 2020 at 3:04 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-01-23 12:09, Paul Moore wrote:
> > On Thu, Jan 23, 2020 at 11:29 AM Richard Guy Briggs <[email protected]> wrote:
> > > On 2020-01-22 16:28, Paul Moore wrote:
> > > > On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <[email protected]> wrote:
> > > > >
> > > > > Add audit container identifier support to the action of signalling the
> > > > > audit daemon.
> > > > >
> > > > > Since this would need to add an element to the audit_sig_info struct,
> > > > > a new record type AUDIT_SIGNAL_INFO2 was created with a new
> > > > > audit_sig_info2 struct. Corresponding support is required in the
> > > > > userspace code to reflect the new record request and reply type.
> > > > > An older userspace won't break since it won't know to request this
> > > > > record type.
> > > > >
> > > > > Signed-off-by: Richard Guy Briggs <[email protected]>
> > > > > ---
> > > > > include/linux/audit.h | 7 +++++++
> > > > > include/uapi/linux/audit.h | 1 +
> > > > > kernel/audit.c | 35 +++++++++++++++++++++++++++++++++++
> > > > > kernel/audit.h | 1 +
> > > > > security/selinux/nlmsgtab.c | 1 +
> > > > > 5 files changed, 45 insertions(+)
> > > >
> > > > ...
> > > >
> > > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > > index 0871c3e5d6df..51159c94041c 100644
> > > > > --- a/kernel/audit.c
> > > > > +++ b/kernel/audit.c
> > > > > @@ -126,6 +126,14 @@ struct auditd_connection {
> > > > > kuid_t audit_sig_uid = INVALID_UID;
> > > > > pid_t audit_sig_pid = -1;
> > > > > u32 audit_sig_sid = 0;
> > > > > +/* Since the signal information is stored in the record buffer at the
> > > > > + * time of the signal, but not retrieved until later, there is a chance
> > > > > + * that the last process in the container could terminate before the
> > > > > + * signal record is delivered. In this circumstance, there is a chance
> > > > > + * the orchestrator could reuse the audit container identifier, causing
> > > > > + * an overlap of audit records that refer to the same audit container
> > > > > + * identifier, but a different container instance. */
> > > > > +u64 audit_sig_cid = AUDIT_CID_UNSET;
> > > >
> > > > I believe we could prevent the case mentioned above by taking an
> > > > additional reference to the audit container ID object when the signal
> > > > information is collected, dropping it only after the signal
> > > > information is collected by userspace or another process signals the
> > > > audit daemon. Yes, it would block that audit container ID from being
> > > > reused immediately, but since we are talking about one number out of
> > > > 2^64 that seems like a reasonable tradeoff.
> > >
> > > I had thought that through and should have been more explicit about that
> > > situation when I documented it. We could do that, but then the syscall
> > > records would be connected with the call from auditd on shutdown to
> > > request that signal information, rather than the exit of that last
> > > process that was using that container. This strikes me as misleading.
> > > Is that really what we want?
> >
> > ???
> >
> > I think one of us is not understanding the other; maybe it's me, maybe
> > it's you, maybe it's both of us.
> >
> > Anyway, here is what I was trying to convey with my original comment
> > ... When we record the audit container ID in audit_signal_info() we
> > take an extra reference to the audit container ID object so that it
> > will not disappear (and get reused) until after we respond with an
> > AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the
> > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took in
> > audit_signal_info(). Unless I'm missing some other change you made,
> > this *shouldn't* affect the syscall records, all it does is preserve
> > the audit container ID object in the kernel's ACID store so it doesn't
> > get reused.
>
> This is exactly what I had understood. I hadn't considered the extra
> details below in detail due to my original syscall concern, but they
> make sense.
>
> The syscall I refer to is the one connected with the drop of the
> audit container identifier by the last process that was in that
> container in patch 5/16. The production of this record is contingent on
> the last ref in a contobj being dropped. So if it is due to that ref
> being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2
> record it fetched, then it will appear that the fetch action closed the
> container rather than the last process in the container to exit.
>
> Does this make sense?

More so than your original reply, at least to me anyway.

It makes sense that the audit container ID wouldn't be marked as
"dead" since it would still be very much alive and available for use
by the orchestrator, the question is if that is desirable or not. I
think the answer to this comes down the preserving the correctness of
the audit log.

If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been
reused then I think there is a legitimate concern that the audit log
is not correct, and could be misleading. If we solve that by grabbing
an extra reference, then there could also be some confusion as
userspace considers a container to be "dead" while the audit container
ID still exists in the kernel, and the kernel generated audit
container ID death record will not be generated until much later (and
possibly be associated with a different event, but that could be
solved by unassociating the container death record). Of the two
approaches, I think the latter is safer in that it preserves the
correctness of the audit log, even though it could result in a delay
of the container death record.

Neither way is perfect, so if you have any other ideas I'm all ears.

> > (We do need to do some extra housekeeping in audit_signal_info() to
> > deal with the case where nobody asks for AUDIT_SIGNAL_INFO2 -
> > basically if audit_sig_cid is not NULL we should drop a reference
> > before assigning it a new object pointer, and of course we would need
> > to set audit_sig_cid to NULL in audit_receive_msg() after sending it
> > up to userspace and dropping the extra ref.)

--
paul moore
http://www.paul-moore.com

2020-01-23 22:22:21

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 12/16] audit: contid check descendancy and nesting

On Thu, Jan 23, 2020 at 4:03 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-01-22 16:29, Paul Moore wrote:
> > On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> > >
> > > Require the target task to be a descendant of the container
> > > orchestrator/engine.
> > >
> > > You would only change the audit container ID from one set or inherited
> > > value to another if you were nesting containers.
> > >
> > > If changing the contid, the container orchestrator/engine must be a
> > > descendant and not same orchestrator as the one that set it so it is not
> > > possible to change the contid of another orchestrator's container.
> > >
> > > Since the task_is_descendant() function is used in YAMA and in audit,
> > > remove the duplication and pull the function into kernel/core/sched.c
> > >
> > > Signed-off-by: Richard Guy Briggs <[email protected]>
> > > ---
> > > include/linux/sched.h | 3 +++
> > > kernel/audit.c | 44 ++++++++++++++++++++++++++++++++++++--------
> > > kernel/sched/core.c | 33 +++++++++++++++++++++++++++++++++
> > > security/yama/yama_lsm.c | 33 ---------------------------------
> > > 4 files changed, 72 insertions(+), 41 deletions(-)
> >
> > ...
> >
> > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > index f7a8d3288ca0..ef8e07524c46 100644
> > > --- a/kernel/audit.c
> > > +++ b/kernel/audit.c
> > > @@ -2603,22 +2610,43 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> > > oldcontid = audit_get_contid(task);
> > > read_lock(&tasklist_lock);
> > > /* Don't allow the contid to be unset */
> > > - if (!audit_contid_valid(contid))
> > > + if (!audit_contid_valid(contid)) {
> > > rc = -EINVAL;
> > > + goto unlock;
> > > + }
> > > /* Don't allow the contid to be set to the same value again */
> > > - else if (contid == oldcontid) {
> > > + if (contid == oldcontid) {
> > > rc = -EADDRINUSE;
> > > + goto unlock;
> > > + }
> > > /* if we don't have caps, reject */
> > > - else if (!capable(CAP_AUDIT_CONTROL))
> > > + if (!capable(CAP_AUDIT_CONTROL)) {
> > > rc = -EPERM;
> > > - /* if task has children or is not single-threaded, deny */
> > > - else if (!list_empty(&task->children))
> > > + goto unlock;
> > > + }
> > > + /* if task has children, deny */
> > > + if (!list_empty(&task->children)) {
> > > rc = -EBUSY;
> > > - else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > > + goto unlock;
> > > + }
> > > + /* if task is not single-threaded, deny */
> > > + if (!(thread_group_leader(task) && thread_group_empty(task))) {
> > > rc = -EALREADY;
> > > - /* if contid is already set, deny */
> > > - else if (audit_contid_set(task))
> > > + goto unlock;
> > > + }
> >
> > It seems like the if/else-if conversion above should be part of an
> > earlier patchset.
>
> I had considered that, but it wasn't obvious where that conversion
> should happen since it wasn't necessary earlier and is now. I can move
> it earlier if you feel strongly about it.

Not particularly.

--
paul moore
http://www.paul-moore.com

2020-01-30 17:55:22

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 02/16] audit: add container id

On 2020-01-22 16:28, Paul Moore wrote:
> On Tue, Dec 31, 2019 at 2:49 PM Richard Guy Briggs <[email protected]> wrote:
> >
> > Implement the proc fs write to set the audit container identifier of a
> > process, emitting an AUDIT_CONTAINER_OP record to document the event.
> >
> > This is a write from the container orchestrator task to a proc entry of
> > the form /proc/PID/audit_containerid where PID is the process ID of the
> > newly created task that is to become the first task in a container, or
> > an additional task added to a container.
> >
> > The write expects up to a u64 value (unset: 18446744073709551615).
> >
> > The writer must have capability CAP_AUDIT_CONTROL.
> >
> > This will produce a record such as this:
> > type=CONTAINER_OP msg=audit(2018-06-06 12:39:29.636:26949) : op=set opid=2209 contid=123456 old-contid=18446744073709551615
> >
> > The "op" field indicates an initial set. The "opid" field is the
> > object's PID, the process being "contained". New and old audit
> > container identifier values are given in the "contid" fields.
> >
> > It is not permitted to unset the audit container identifier.
> > A child inherits its parent's audit container identifier.
> >
> > Please see the github audit kernel issue for the main feature:
> > https://github.com/linux-audit/audit-kernel/issues/90
> > Please see the github audit userspace issue for supporting additions:
> > https://github.com/linux-audit/audit-userspace/issues/51
> > Please see the github audit testsuiite issue for the test case:
> > https://github.com/linux-audit/audit-testsuite/issues/64
> > Please see the github audit wiki for the feature overview:
> > https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> >
> > Signed-off-by: Richard Guy Briggs <[email protected]>
> > Acked-by: Serge Hallyn <[email protected]>
> > Acked-by: Steve Grubb <[email protected]>
> > Acked-by: Neil Horman <[email protected]>
> > Reviewed-by: Ondrej Mosnacek <[email protected]>
> > Signed-off-by: Richard Guy Briggs <[email protected]>
> > ---
> > fs/proc/base.c | 36 ++++++++++++++++++++++++++++
> > include/linux/audit.h | 25 ++++++++++++++++++++
> > include/uapi/linux/audit.h | 2 ++
> > kernel/audit.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++
> > kernel/audit.h | 1 +
> > kernel/auditsc.c | 4 ++++
> > 6 files changed, 126 insertions(+)
>
> ...
>
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index 397f8fb4836a..2d7707426b7d 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -2356,6 +2358,62 @@ int audit_signal_info(int sig, struct task_struct *t)
> > return audit_signal_info_syscall(t);
> > }
> >
> > +/*
> > + * audit_set_contid - set current task's audit contid
> > + * @task: target task
> > + * @contid: contid value
> > + *
> > + * Returns 0 on success, -EPERM on permission failure.
> > + *
> > + * Called (set) from fs/proc/base.c::proc_contid_write().
> > + */
> > +int audit_set_contid(struct task_struct *task, u64 contid)
> > +{
> > + u64 oldcontid;
> > + int rc = 0;
> > + struct audit_buffer *ab;
> > +
> > + task_lock(task);
> > + /* Can't set if audit disabled */
> > + if (!task->audit) {
> > + task_unlock(task);
> > + return -ENOPROTOOPT;
> > + }
> > + oldcontid = audit_get_contid(task);
> > + read_lock(&tasklist_lock);
> > + /* Don't allow the audit containerid to be unset */
> > + if (!audit_contid_valid(contid))
> > + rc = -EINVAL;
> > + /* if we don't have caps, reject */
> > + else if (!capable(CAP_AUDIT_CONTROL))
> > + rc = -EPERM;
> > + /* if task has children or is not single-threaded, deny */
> > + else if (!list_empty(&task->children))
> > + rc = -EBUSY;
> > + else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > + rc = -EALREADY;
>
> [NOTE: there is a bigger issue below which I think is going to require
> a respin/fixup of this patch so I'm going to take the opportunity to
> do a bit more bikeshedding ;)]
>
> It seems like we could combine both the thread/children checks under a
> single -EBUSY return value. In both cases the caller should be able
> to determine if the target process is multi-threaded for has spawned
> children, yes? FWIW, my motivation for this question is that
> -EALREADY seems like a poor choice here.

Fair enough.

> > + /* if contid is already set, deny */
> > + else if (audit_contid_set(task))
> > + rc = -ECHILD;
>
> Does -EEXIST make more sense here?

Perhaps. I don't feel strongly about it, but none of these error codes
were intended for this use and should not overlap with other errors from
writing to /proc.

> > + read_unlock(&tasklist_lock);
> > + if (!rc)
> > + task->audit->contid = contid;
> > + task_unlock(task);
> > +
> > + if (!audit_enabled)
> > + return rc;
> > +
> > + ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONTAINER_OP);
> > + if (!ab)
> > + return rc;
> > +
> > + audit_log_format(ab,
> > + "op=set opid=%d contid=%llu old-contid=%llu",
> > + task_tgid_nr(task), contid, oldcontid);
> > + audit_log_end(ab);
>
> Assuming audit is enabled we always emit the record above, even if we
> were not actually able to set the Audit Container ID (ACID); this
> seems wrong to me. I think the proper behavior would be to either add
> a "res=" field to indicate success/failure or only emit the record
> when we actually change a task's ACID. Considering the impact that
> the ACID value will potentially have on the audit stream, it seems
> like always logging the record and including a "res=" field may be the
> safer choice.

This record should be accompanied by a syscall record (and eventually
possibly a CONTAINER_ID record of the orchestrator, if it is already in
a container). The syscall record has a res= field that already gives
this result.

> > + return rc;
> > +}
> > +
> > /**
> > * audit_log_end - end one audit record
> > * @ab: the audit_buffer
>
> --
> paul moore
> http://www.paul-moore.com
>

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-01-30 19:30:19

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 13/16] audit: track container nesting

On 2020-01-22 16:29, Paul Moore wrote:
> On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> >
> > Track the parent container of a container to be able to filter and
> > report nesting.
> >
> > Now that we have a way to track and check the parent container of a
> > container, modify the contid field format to be able to report that
> > nesting using a carrat ("^") separator to indicate nesting. The
> > original field format was "contid=<contid>" for task-associated records
> > and "contid=<contid>[,<contid>[...]]" for network-namespace-associated
> > records. The new field format is
> > "contid=<contid>[^<contid>[...]][,<contid>[...]]".
>
> Let's make sure we always use a comma as a separator, even when
> recording the parent information, for example:
> "contid=<contid>[,^<contid>[...]][,<contid>[...]]"

The intent here is to clearly indicate and separate nesting from
parallel use of several containers by one netns. If we do away with
that distinction, then we lose that inheritance accountability and
should really run the list through a "uniq" function to remove the
produced redundancies. This clear inheritance is something Steve was
looking for since tracking down individual events/records to show that
inheritance was not aways feasible due to rolled logs or search effort.

> > Signed-off-by: Richard Guy Briggs <[email protected]>
> > ---
> > include/linux/audit.h | 1 +
> > kernel/audit.c | 53 +++++++++++++++++++++++++++++++++++++++++++--------
> > kernel/audit.h | 1 +
> > kernel/auditfilter.c | 17 ++++++++++++++++-
> > kernel/auditsc.c | 2 +-
> > 5 files changed, 64 insertions(+), 10 deletions(-)
>
> ...
>
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index ef8e07524c46..68be59d1a89b 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
>
> > @@ -492,6 +493,7 @@ void audit_switch_task_namespaces(struct nsproxy *ns, struct task_struct *p)
> > audit_netns_contid_add(new->net_ns, contid);
> > }
> >
> > +void audit_log_contid(struct audit_buffer *ab, u64 contid);
>
> If we need a forward declaration, might as well just move it up near
> the top of the file with the rest of the declarations.

Ok.

> > +void audit_log_contid(struct audit_buffer *ab, u64 contid)
> > +{
> > + struct audit_contobj *cont = NULL, *prcont = NULL;
> > + int h;
>
> It seems safer to pass the audit container ID object and not the u64.

It would also be faster, but in some places it isn't available such as
for ptrace and signal targets. This also links back to the drop record
refcounts to hold onto the contobj until process exit, or signal
delivery.

What we could do is to supply two potential parameters, a contobj and/or
a contid, and have it use the contobj if it is valid, otherwise, use the
contid, as is done for names and paths supplied to audit_log_name().

> > + if (!audit_contid_valid(contid)) {
> > + audit_log_format(ab, "%llu", contid);
>
> Do we really want to print (u64)-1 here? Since this is a known
> invalid number, would "?" be a better choice?

I'll defer to Steve here. "?" would be one character vs 20 for (u64)-1.
I don't expect there to be that many records containing (u64)-1, but it
would also make them visually easier to pick out if that is a factor.

> > + return;
> > + }
> > + h = audit_hash_contid(contid);
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(cont, &audit_contid_hash[h], list)
> > + if (cont->id == contid) {
> > + prcont = cont;
>
> Why not just pull the code below into the body of this if statement?
> It all needs to be done under the RCU read lock anyway and the code
> would read much better this way.

Ok.

> > + break;
> > + }
> > + if (!prcont) {
> > + audit_log_format(ab, "%llu", contid);
> > + goto out;
> > + }
> > + while (prcont) {
> > + audit_log_format(ab, "%llu", prcont->id);
> > + prcont = prcont->parent;
> > + if (prcont)
> > + audit_log_format(ab, "^");
>
> In the interest of limiting the number of calls to audit_log_format(),
> how about something like the following:
>
> audit_log_format("%llu", cont);
> iter = cont->parent;
> while (iter) {
> if (iter->parent)
> audit_log_format("^%llu,", iter);
> else
> audit_log_format("^%llu", iter);
> iter = iter->parent;
> }

Ok.

> > + }
> > +out:
> > + rcu_read_unlock();
> > +}
> > +
> > /*
> > * audit_log_container_id - report container info
> > * @context: task or local context for record
>
> ...
>
> > @@ -2705,9 +2741,10 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> > if (!ab)
> > return rc;
> >
> > - audit_log_format(ab,
> > - "op=set opid=%d contid=%llu old-contid=%llu",
> > - task_tgid_nr(task), contid, oldcontid);
> > + audit_log_format(ab, "op=set opid=%d contid=", task_tgid_nr(task));
> > + audit_log_contid(ab, contid);
> > + audit_log_format(ab, " old-contid=");
> > + audit_log_contid(ab, oldcontid);
>
> This is an interesting case where contid and old-contid are going to
> be largely the same, only the first (current) ID is going to be
> different; do we want to duplicate all of those IDs?

At first when I read your comment, I thought we could just take contid
and drop oldcontid, but if it fails, we still want all the information,
so given the way I've set up the search code in userspace, listing only
the newest contid in the contid field and all the rest in oldcontid
could be a good compromise.

> > audit_log_end(ab);
> > return rc;
> > }
> > @@ -2723,9 +2760,9 @@ void audit_log_container_drop(void)
>
> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-01-31 14:52:35

by Steve Grubb

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 13/16] audit: track container nesting

On Wednesday, January 22, 2020 4:29:12 PM EST Paul Moore wrote:
> On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> > Track the parent container of a container to be able to filter and
> > report nesting.
> >
> > Now that we have a way to track and check the parent container of a
> > container, modify the contid field format to be able to report that
> > nesting using a carrat ("^") separator to indicate nesting. The
> > original field format was "contid=<contid>" for task-associated records
> > and "contid=<contid>[,<contid>[...]]" for network-namespace-associated
> > records. The new field format is
> > "contid=<contid>[^<contid>[...]][,<contid>[...]]".
>
> Let's make sure we always use a comma as a separator, even when
> recording the parent information, for example:
> "contid=<contid>[,^<contid>[...]][,<contid>[...]]"
>
> > Signed-off-by: Richard Guy Briggs <[email protected]>
> > ---
> >
> > include/linux/audit.h | 1 +
> > kernel/audit.c | 53
> > +++++++++++++++++++++++++++++++++++++++++++-------- kernel/audit.h
> > | 1 +
> > kernel/auditfilter.c | 17 ++++++++++++++++-
> > kernel/auditsc.c | 2 +-
> > 5 files changed, 64 insertions(+), 10 deletions(-)
>
> ...
>
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index ef8e07524c46..68be59d1a89b 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> >
> > @@ -492,6 +493,7 @@ void audit_switch_task_namespaces(struct nsproxy *ns,
> > struct task_struct *p)>
> > audit_netns_contid_add(new->net_ns, contid);
> >
> > }
> >
> > +void audit_log_contid(struct audit_buffer *ab, u64 contid);
>
> If we need a forward declaration, might as well just move it up near
> the top of the file with the rest of the declarations.
>
> > +void audit_log_contid(struct audit_buffer *ab, u64 contid)
> > +{
> > + struct audit_contobj *cont = NULL, *prcont = NULL;
> > + int h;
>
> It seems safer to pass the audit container ID object and not the u64.
>
> > + if (!audit_contid_valid(contid)) {
> > + audit_log_format(ab, "%llu", contid);
>
> Do we really want to print (u64)-1 here? Since this is a known
> invalid number, would "?" be a better choice?

The established pattern is that we print -1 when its unset and "?" when its
totalling missing. So, how could this be invalid? It should be set or not.
That is unless its totally missing just like when we do not run with selinux
enabled and a context just doesn't exist.

-Steve


> > + return;
> > + }
> > + h = audit_hash_contid(contid);
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(cont, &audit_contid_hash[h], list)
> > + if (cont->id == contid) {
> > + prcont = cont;
>
> Why not just pull the code below into the body of this if statement?
> It all needs to be done under the RCU read lock anyway and the code
> would read much better this way.
>
> > + break;
> > + }
> > + if (!prcont) {
> > + audit_log_format(ab, "%llu", contid);
> > + goto out;
> > + }
> > + while (prcont) {
> > + audit_log_format(ab, "%llu", prcont->id);
> > + prcont = prcont->parent;
> > + if (prcont)
> > + audit_log_format(ab, "^");
>
> In the interest of limiting the number of calls to audit_log_format(),
> how about something like the following:
>
> audit_log_format("%llu", cont);
> iter = cont->parent;
> while (iter) {
> if (iter->parent)
> audit_log_format("^%llu,", iter);
> else
> audit_log_format("^%llu", iter);
> iter = iter->parent;
> }
>
> > + }
> > +out:
> > + rcu_read_unlock();
> > +}
> > +
> >
> > /*
> >
> > * audit_log_container_id - report container info
> > * @context: task or local context for record
>
> ...
>
> > @@ -2705,9 +2741,10 @@ int audit_set_contid(struct task_struct *task, u64
> > contid)>
> > if (!ab)
> >
> > return rc;
> >
> > - audit_log_format(ab,
> > - "op=set opid=%d contid=%llu old-contid=%llu",
> > - task_tgid_nr(task), contid, oldcontid);
> > + audit_log_format(ab, "op=set opid=%d contid=",
> > task_tgid_nr(task)); + audit_log_contid(ab, contid);
> > + audit_log_format(ab, " old-contid=");
> > + audit_log_contid(ab, oldcontid);
>
> This is an interesting case where contid and old-contid are going to
> be largely the same, only the first (current) ID is going to be
> different; do we want to duplicate all of those IDs?
>
> > audit_log_end(ab);
> > return rc;
> >
> > }
> >
> > @@ -2723,9 +2760,9 @@ void audit_log_container_drop(void)
>
> --
> paul moore
> http://www.paul-moore.com




2020-02-04 13:21:31

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 13/16] audit: track container nesting

On 2020-01-31 09:50, Steve Grubb wrote:
> On Wednesday, January 22, 2020 4:29:12 PM EST Paul Moore wrote:
> > On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> > > Track the parent container of a container to be able to filter and
> > > report nesting.
> > >
> > > Now that we have a way to track and check the parent container of a
> > > container, modify the contid field format to be able to report that
> > > nesting using a carrat ("^") separator to indicate nesting. The
> > > original field format was "contid=<contid>" for task-associated records
> > > and "contid=<contid>[,<contid>[...]]" for network-namespace-associated
> > > records. The new field format is
> > > "contid=<contid>[^<contid>[...]][,<contid>[...]]".
> >
> > Let's make sure we always use a comma as a separator, even when
> > recording the parent information, for example:
> > "contid=<contid>[,^<contid>[...]][,<contid>[...]]"
> >
> > > Signed-off-by: Richard Guy Briggs <[email protected]>
> > > ---
> > >
> > > include/linux/audit.h | 1 +
> > > kernel/audit.c | 53
> > > +++++++++++++++++++++++++++++++++++++++++++-------- kernel/audit.h
> > > | 1 +
> > > kernel/auditfilter.c | 17 ++++++++++++++++-
> > > kernel/auditsc.c | 2 +-
> > > 5 files changed, 64 insertions(+), 10 deletions(-)
> >
> > ...
> >
> > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > index ef8e07524c46..68be59d1a89b 100644
> > > --- a/kernel/audit.c
> > > +++ b/kernel/audit.c
> > >
> > > @@ -492,6 +493,7 @@ void audit_switch_task_namespaces(struct nsproxy *ns,
> > > struct task_struct *p)>
> > > audit_netns_contid_add(new->net_ns, contid);
> > >
> > > }
> > >
> > > +void audit_log_contid(struct audit_buffer *ab, u64 contid);
> >
> > If we need a forward declaration, might as well just move it up near
> > the top of the file with the rest of the declarations.
> >
> > > +void audit_log_contid(struct audit_buffer *ab, u64 contid)
> > > +{
> > > + struct audit_contobj *cont = NULL, *prcont = NULL;
> > > + int h;
> >
> > It seems safer to pass the audit container ID object and not the u64.
> >
> > > + if (!audit_contid_valid(contid)) {
> > > + audit_log_format(ab, "%llu", contid);
> >
> > Do we really want to print (u64)-1 here? Since this is a known
> > invalid number, would "?" be a better choice?
>
> The established pattern is that we print -1 when its unset and "?" when its
> totalling missing. So, how could this be invalid? It should be set or not.
> That is unless its totally missing just like when we do not run with selinux
> enabled and a context just doesn't exist.

Ok, so in this case it is clearly unset, so should be -1, which will be a
20-digit number when represented as an unsigned long long int.

Thank you for that clarification Steve.

> -Steve
>
> > > + return;
> > > + }
> > > + h = audit_hash_contid(contid);
> > > + rcu_read_lock();
> > > + list_for_each_entry_rcu(cont, &audit_contid_hash[h], list)
> > > + if (cont->id == contid) {
> > > + prcont = cont;
> >
> > Why not just pull the code below into the body of this if statement?
> > It all needs to be done under the RCU read lock anyway and the code
> > would read much better this way.
> >
> > > + break;
> > > + }
> > > + if (!prcont) {
> > > + audit_log_format(ab, "%llu", contid);
> > > + goto out;
> > > + }
> > > + while (prcont) {
> > > + audit_log_format(ab, "%llu", prcont->id);
> > > + prcont = prcont->parent;
> > > + if (prcont)
> > > + audit_log_format(ab, "^");
> >
> > In the interest of limiting the number of calls to audit_log_format(),
> > how about something like the following:
> >
> > audit_log_format("%llu", cont);
> > iter = cont->parent;
> > while (iter) {
> > if (iter->parent)
> > audit_log_format("^%llu,", iter);
> > else
> > audit_log_format("^%llu", iter);
> > iter = iter->parent;
> > }
> >
> > > + }
> > > +out:
> > > + rcu_read_unlock();
> > > +}
> > > +
> > >
> > > /*
> > >
> > > * audit_log_container_id - report container info
> > > * @context: task or local context for record
> >
> > ...
> >
> > > @@ -2705,9 +2741,10 @@ int audit_set_contid(struct task_struct *task, u64
> > > contid)>
> > > if (!ab)
> > >
> > > return rc;
> > >
> > > - audit_log_format(ab,
> > > - "op=set opid=%d contid=%llu old-contid=%llu",
> > > - task_tgid_nr(task), contid, oldcontid);
> > > + audit_log_format(ab, "op=set opid=%d contid=",
> > > task_tgid_nr(task)); + audit_log_contid(ab, contid);
> > > + audit_log_format(ab, " old-contid=");
> > > + audit_log_contid(ab, oldcontid);
> >
> > This is an interesting case where contid and old-contid are going to
> > be largely the same, only the first (current) ID is going to be
> > different; do we want to duplicate all of those IDs?
> >
> > > audit_log_end(ab);
> > > return rc;
> > >
> > > }
> > >
> > > @@ -2723,9 +2760,9 @@ void audit_log_container_drop(void)
> >
> > paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-02-04 15:48:54

by Steve Grubb

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 13/16] audit: track container nesting

On Tuesday, February 4, 2020 8:19:44 AM EST Richard Guy Briggs wrote:
> > The established pattern is that we print -1 when its unset and "?" when
> > its totalling missing. So, how could this be invalid? It should be set
> > or not. That is unless its totally missing just like when we do not run
> > with selinux enabled and a context just doesn't exist.
>
> Ok, so in this case it is clearly unset, so should be -1, which will be a
> 20-digit number when represented as an unsigned long long int.
>
> Thank you for that clarification Steve.

It is literally a -1. ( 2 characters)

-Steve


2020-02-04 15:54:01

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 13/16] audit: track container nesting

On Tue, Feb 4, 2020 at 10:47 AM Steve Grubb <[email protected]> wrote:
> On Tuesday, February 4, 2020 8:19:44 AM EST Richard Guy Briggs wrote:
> > > The established pattern is that we print -1 when its unset and "?" when
> > > its totalling missing. So, how could this be invalid? It should be set
> > > or not. That is unless its totally missing just like when we do not run
> > > with selinux enabled and a context just doesn't exist.
> >
> > Ok, so in this case it is clearly unset, so should be -1, which will be a
> > 20-digit number when represented as an unsigned long long int.
> >
> > Thank you for that clarification Steve.
>
> It is literally a -1. ( 2 characters)

Well, not as Richard has currently written the code, it is a "%llu".
This was why I asked the question I did; if we want the "-1" here we
probably want to special case that as I don't think we want to display
audit container IDs as signed numbers in general.

--
paul moore
http://www.paul-moore.com

2020-02-04 18:14:22

by Steve Grubb

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 13/16] audit: track container nesting

On Tuesday, February 4, 2020 10:52:36 AM EST Paul Moore wrote:
> On Tue, Feb 4, 2020 at 10:47 AM Steve Grubb <[email protected]> wrote:
> > On Tuesday, February 4, 2020 8:19:44 AM EST Richard Guy Briggs wrote:
> > > > The established pattern is that we print -1 when its unset and "?"
> > > > when
> > > > its totalling missing. So, how could this be invalid? It should be
> > > > set
> > > > or not. That is unless its totally missing just like when we do not
> > > > run
> > > > with selinux enabled and a context just doesn't exist.
> > >
> > > Ok, so in this case it is clearly unset, so should be -1, which will be
> > > a
> > > 20-digit number when represented as an unsigned long long int.
> > >
> > > Thank you for that clarification Steve.
> >
> > It is literally a -1. ( 2 characters)
>
> Well, not as Richard has currently written the code, it is a "%llu".
> This was why I asked the question I did; if we want the "-1" here we
> probably want to special case that as I don't think we want to display
> audit container IDs as signed numbers in general.

OK, then go with the long number, we'll fix it in the interpretation. I guess
we do the same thing for auid.

-Steve


2020-02-04 22:53:31

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 04/16] audit: convert to contid list to check for orch/engine ownership

On 2020-01-22 16:28, Paul Moore wrote:
> On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <[email protected]> wrote:
> >
> > Store the audit container identifier in a refcounted kernel object that
> > is added to the master list of audit container identifiers. This will
> > allow multiple container orchestrators/engines to work on the same
> > machine without danger of inadvertantly re-using an existing identifier.
> > It will also allow an orchestrator to inject a process into an existing
> > container by checking if the original container owner is the one
> > injecting the task. A hash table list is used to optimize searches.
> >
> > Signed-off-by: Richard Guy Briggs <[email protected]>
> > ---
> > include/linux/audit.h | 14 ++++++--
> > kernel/audit.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++---
> > kernel/audit.h | 8 +++++
> > 3 files changed, 112 insertions(+), 8 deletions(-)
>
> ...
>
> > diff --git a/include/linux/audit.h b/include/linux/audit.h
> > index a045b34ecf44..0e6dbe943ae4 100644
> > --- a/include/linux/audit.h
> > +++ b/include/linux/audit.h
> > @@ -94,10 +94,18 @@ struct audit_ntp_data {
> > struct audit_ntp_data {};
> > #endif
> >
> > +struct audit_contobj {
> > + struct list_head list;
> > + u64 id;
> > + struct task_struct *owner;
> > + refcount_t refcount;
> > + struct rcu_head rcu;
> > +};
> > +
> > struct audit_task_info {
> > kuid_t loginuid;
> > unsigned int sessionid;
> > - u64 contid;
> > + struct audit_contobj *cont;
> > #ifdef CONFIG_AUDITSYSCALL
> > struct audit_context *ctx;
> > #endif
> > @@ -203,9 +211,9 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
> >
> > static inline u64 audit_get_contid(struct task_struct *tsk)
> > {
> > - if (!tsk->audit)
> > + if (!tsk->audit || !tsk->audit->cont)
> > return AUDIT_CID_UNSET;
> > - return tsk->audit->contid;
> > + return tsk->audit->cont->id;
> > }
> >
> > extern u32 audit_enabled;
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index 2d7707426b7d..4bab20f5f781 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -212,6 +218,31 @@ void __init audit_task_init(void)
> > 0, SLAB_PANIC, NULL);
> > }
> >
> > +static struct audit_contobj *_audit_contobj(struct task_struct *tsk)
> > +{
> > + if (!tsk->audit)
> > + return NULL;
> > + return tsk->audit->cont;
>
> It seems like it would be safer to grab a reference here (e.g.
> _audit_contobj_hold(...)), yes? Or are you confident we will never
> have tsk go away while the caller is holding on to the returned audit
> container ID object?

I'll switch to _get that calls _hold and just call _put immediately if I
don't want to keep the increase to the refcount.

> > +}
> > +
> > +/* audit_contobj_list_lock must be held by caller unless new */
> > +static void _audit_contobj_hold(struct audit_contobj *cont)
> > +{
> > + refcount_inc(&cont->refcount);
> > +}
>
> If we are going to call the matching decrement function "_put" it
> seems like we might want to call the function about "_get". Further,
> we can also have it return an audit_contobj pointer in case the caller
> needs to do an assignment as well (which seems typical if you need to
> bump the refcount):

Sure, I'll do that.

> > +/* audit_contobj_list_lock must be held by caller */
> > +static void _audit_contobj_put(struct audit_contobj *cont)
> > +{
> > + if (!cont)
> > + return;
> > + if (refcount_dec_and_test(&cont->refcount)) {
> > + put_task_struct(cont->owner);
> > + list_del_rcu(&cont->list);
> > + kfree_rcu(cont, rcu);
> > + }
> > +}
> > +
> > /**
> > * audit_alloc - allocate an audit info block for a task
> > * @tsk: task
> > @@ -232,7 +263,11 @@ int audit_alloc(struct task_struct *tsk)
> > }
> > info->loginuid = audit_get_loginuid(current);
> > info->sessionid = audit_get_sessionid(current);
> > - info->contid = audit_get_contid(current);
> > + spin_lock(&audit_contobj_list_lock);
> > + info->cont = _audit_contobj(current);
> > + if (info->cont)
> > + _audit_contobj_hold(info->cont);
> > + spin_unlock(&audit_contobj_list_lock);
>
> If we are taking a spinlock in order to bump the refcount, does it
> really need to be a refcount_t or can we just use a normal integer?
> In RCU protected lists a spinlock is usually used to protect
> adds/removes to the list, not the content of individual list items.
>
> My guess is you probably want to use the spinlock as described above
> (list add/remove protection) and manipulate the refcount_t inside a
> RCU read lock protected region.

Ok, I guess it could be an integer if it were protected by the spinlock,
but I think you've guessed my intent, so let us keep it as a refcount
and tighten the spinlock scope and use rcu read locking to protect _get
and _put in _alloc, _free, and later on when protecting the network
namespace contobj lists. This should reduce potential contention for
the spinlock to one location over fewer lines of code in that place
while speeding up updates and slightly simplifying code in the others.

> > tsk->audit = info;
> >
> > ret = audit_alloc_syscall(tsk);
> > @@ -267,6 +302,9 @@ void audit_free(struct task_struct *tsk)
> > /* Freeing the audit_task_info struct must be performed after
> > * audit_log_exit() due to need for loginuid and sessionid.
> > */
> > + spin_lock(&audit_contobj_list_lock);
> > + _audit_contobj_put(tsk->audit->cont);
> > + spin_unlock(&audit_contobj_list_lock);
>
> This is another case of refcount_t vs normal integer in a spinlock
> protected region.
>
> > info = tsk->audit;
> > tsk->audit = NULL;
> > kmem_cache_free(audit_task_cache, info);
> > @@ -2365,6 +2406,9 @@ int audit_signal_info(int sig, struct task_struct *t)
> > *
> > * Returns 0 on success, -EPERM on permission failure.
> > *
> > + * If the original container owner goes away, no task injection is
> > + * possible to an existing container.
> > + *
> > * Called (set) from fs/proc/base.c::proc_contid_write().
> > */
> > int audit_set_contid(struct task_struct *task, u64 contid)
> > @@ -2381,9 +2425,12 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> > }
> > oldcontid = audit_get_contid(task);
> > read_lock(&tasklist_lock);
> > - /* Don't allow the audit containerid to be unset */
> > + /* Don't allow the contid to be unset */
> > if (!audit_contid_valid(contid))
> > rc = -EINVAL;
> > + /* Don't allow the contid to be set to the same value again */
> > + else if (contid == oldcontid) {
> > + rc = -EADDRINUSE;
>
> First, is that brace a typo? It looks like it. Did this compile?

Yes, it was fixed in the later patch that restructured the if
statements.

> Second, can you explain why (re)setting the audit container ID to the
> same value is something we need to prohibit? I'm guessing it has
> something to do with explicitly set vs inherited, but I don't want to
> assume too much about your thinking behind this.

It made the refcounting more complicated later, and besides, the
prohibition on setting the contid again if it is already set would catch
this case, so I'll remove it in this patch and ensure this action
doesn't cause a problem in later patches.

> If it is "set vs inherited", would allowing an orchestrator to
> explicitly "set" an inherited audit container ID provide some level or
> protection against moving the task?

I can't see it helping prevent this since later descendancy checks will
stop this move anyways.

> > /* if we don't have caps, reject */
> > else if (!capable(CAP_AUDIT_CONTROL))
> > rc = -EPERM;
> > @@ -2396,8 +2443,49 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> > else if (audit_contid_set(task))
> > rc = -ECHILD;
> > read_unlock(&tasklist_lock);
> > - if (!rc)
> > - task->audit->contid = contid;
> > + if (!rc) {
> > + struct audit_contobj *oldcont = _audit_contobj(task);
> > + struct audit_contobj *cont = NULL, *newcont = NULL;
> > + int h = audit_hash_contid(contid);
> > +
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(cont, &audit_contid_hash[h], list)
> > + if (cont->id == contid) {
> > + /* task injection to existing container */
> > + if (current == cont->owner) {
>
> Do we have any protection against the task pointed to by cont->owner
> going away and a new task with the same current pointer value (no
> longer the legitimate audit container ID owner) manipulating the
> target task's audit container ID?

Yes, the get_task_struct() call below.

> > + spin_lock(&audit_contobj_list_lock);
> > + _audit_contobj_hold(cont);
> > + spin_unlock(&audit_contobj_list_lock);
>
> More of the recount_t vs integer/spinlock question.
>
> > + newcont = cont;
> > + } else {
> > + rc = -ENOTUNIQ;
> > + goto conterror;
> > + }
> > + break;
> > + }
> > + if (!newcont) {
> > + newcont = kmalloc(sizeof(*newcont), GFP_ATOMIC);
> > + if (newcont) {
> > + INIT_LIST_HEAD(&newcont->list);
> > + newcont->id = contid;
> > + get_task_struct(current);
> > + newcont->owner = current;
>
> newcont->owner = get_task_struct(current);
>
> (This is what I was talking about above with returning the struct
> pointer in the _get/_hold function)

No problem.

> > + refcount_set(&newcont->refcount, 1);
> > + spin_lock(&audit_contobj_list_lock);
> > + list_add_rcu(&newcont->list, &audit_contid_hash[h]);
> > + spin_unlock(&audit_contobj_list_lock);
>
> I think we might have a problem where multiple tasks could race adding
> the same audit container ID and since there is no check inside the
> spinlock protected region we could end up adding multiple instances of
> the same audit container ID, yes?

Yes, you are right. It was properly protected in v7. I'll go back to
the coverage from v7.

> > + } else {
> > + rc = -ENOMEM;
> > + goto conterror;
> > + }
> > + }
> > + task->audit->cont = newcont;
> > + spin_lock(&audit_contobj_list_lock);
> > + _audit_contobj_put(oldcont);
> > + spin_unlock(&audit_contobj_list_lock);
>
> More of the refcount_t/integer/spinlock question.
>
>
> > +conterror:
> > + rcu_read_unlock();
> > + }
> > task_unlock(task);
> >
> > if (!audit_enabled)
>
> --
> paul moore
> http://www.paul-moore.com
>

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-02-04 23:04:07

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 05/16] audit: log drop of contid on exit of last task

On 2020-01-22 16:28, Paul Moore wrote:
> On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <[email protected]> wrote:
> >
> > Since we are tracking the life of each audit container indentifier, we
> > can match the creation event with the destruction event. Log the
> > destruction of the audit container identifier when the last process in
> > that container exits.
> >
> > Signed-off-by: Richard Guy Briggs <[email protected]>
> > ---
> > kernel/audit.c | 17 +++++++++++++++++
> > kernel/audit.h | 2 ++
> > kernel/auditsc.c | 2 ++
> > 3 files changed, 21 insertions(+)
> >
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index 4bab20f5f781..fa8f1aa3a605 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -2502,6 +2502,23 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> > return rc;
> > }
> >
> > +void audit_log_container_drop(void)
> > +{
> > + struct audit_buffer *ab;
> > +
> > + if (!current->audit || !current->audit->cont ||
> > + refcount_read(&current->audit->cont->refcount) > 1)
> > + return;
> > + ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONTAINER_OP);
> > + if (!ab)
> > + return;
> > +
> > + audit_log_format(ab, "op=drop opid=%d contid=%llu old-contid=%llu",
> > + task_tgid_nr(current), audit_get_contid(current),
> > + audit_get_contid(current));
> > + audit_log_end(ab);
> > +}
>
> Assumine we are careful about where we call it in audit_free(...), you
> are confident we can't do this as part of _audit_contobj_put(...),
> yes?

We need audit_log_container_drop in audit_free_syscall() due to needing
context, which gets freed in audit_free_syscall() called from
audit_free().

We need audit_log_container_drop in audit_log_exit() due to having that
record included before the EOE record at the end of audit_log_exit().

We could put in _contobj_put() if we drop context and any attempt to
connect it with a syscall record, which I strongly discourage.

The syscall record contains info about subject, container_id record only
contains info about container object other than subj pid.

> > /**
> > * audit_log_end - end one audit record
> > * @ab: the audit_buffer
> > diff --git a/kernel/audit.h b/kernel/audit.h
> > index e4a31aa92dfe..162de8366b32 100644
> > --- a/kernel/audit.h
> > +++ b/kernel/audit.h
> > @@ -255,6 +255,8 @@ extern void audit_log_d_path_exe(struct audit_buffer *ab,
> > extern struct tty_struct *audit_get_tty(void);
> > extern void audit_put_tty(struct tty_struct *tty);
> >
> > +extern void audit_log_container_drop(void);
> > +
> > /* audit watch/mark/tree functions */
> > #ifdef CONFIG_AUDITSYSCALL
> > extern unsigned int audit_serial(void);
> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > index 0e2d50533959..bd855794ad26 100644
> > --- a/kernel/auditsc.c
> > +++ b/kernel/auditsc.c
> > @@ -1568,6 +1568,8 @@ static void audit_log_exit(void)
> >
> > audit_log_proctitle();
> >
> > + audit_log_container_drop();
> > +
> > /* Send end of event record to help user space know we are finished */
> > ab = audit_log_start(context, GFP_KERNEL, AUDIT_EOE);
> > if (ab)
> > --
> > 1.8.3.1
> >
>
> --
> paul moore
> http://www.paul-moore.com
>

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-02-04 23:23:46

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-01-23 16:35, Paul Moore wrote:
> On Thu, Jan 23, 2020 at 3:04 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-01-23 12:09, Paul Moore wrote:
> > > On Thu, Jan 23, 2020 at 11:29 AM Richard Guy Briggs <[email protected]> wrote:
> > > > On 2020-01-22 16:28, Paul Moore wrote:
> > > > > On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > >
> > > > > > Add audit container identifier support to the action of signalling the
> > > > > > audit daemon.
> > > > > >
> > > > > > Since this would need to add an element to the audit_sig_info struct,
> > > > > > a new record type AUDIT_SIGNAL_INFO2 was created with a new
> > > > > > audit_sig_info2 struct. Corresponding support is required in the
> > > > > > userspace code to reflect the new record request and reply type.
> > > > > > An older userspace won't break since it won't know to request this
> > > > > > record type.
> > > > > >
> > > > > > Signed-off-by: Richard Guy Briggs <[email protected]>
> > > > > > ---
> > > > > > include/linux/audit.h | 7 +++++++
> > > > > > include/uapi/linux/audit.h | 1 +
> > > > > > kernel/audit.c | 35 +++++++++++++++++++++++++++++++++++
> > > > > > kernel/audit.h | 1 +
> > > > > > security/selinux/nlmsgtab.c | 1 +
> > > > > > 5 files changed, 45 insertions(+)
> > > > >
> > > > > ...
> > > > >
> > > > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > > > index 0871c3e5d6df..51159c94041c 100644
> > > > > > --- a/kernel/audit.c
> > > > > > +++ b/kernel/audit.c
> > > > > > @@ -126,6 +126,14 @@ struct auditd_connection {
> > > > > > kuid_t audit_sig_uid = INVALID_UID;
> > > > > > pid_t audit_sig_pid = -1;
> > > > > > u32 audit_sig_sid = 0;
> > > > > > +/* Since the signal information is stored in the record buffer at the
> > > > > > + * time of the signal, but not retrieved until later, there is a chance
> > > > > > + * that the last process in the container could terminate before the
> > > > > > + * signal record is delivered. In this circumstance, there is a chance
> > > > > > + * the orchestrator could reuse the audit container identifier, causing
> > > > > > + * an overlap of audit records that refer to the same audit container
> > > > > > + * identifier, but a different container instance. */
> > > > > > +u64 audit_sig_cid = AUDIT_CID_UNSET;
> > > > >
> > > > > I believe we could prevent the case mentioned above by taking an
> > > > > additional reference to the audit container ID object when the signal
> > > > > information is collected, dropping it only after the signal
> > > > > information is collected by userspace or another process signals the
> > > > > audit daemon. Yes, it would block that audit container ID from being
> > > > > reused immediately, but since we are talking about one number out of
> > > > > 2^64 that seems like a reasonable tradeoff.
> > > >
> > > > I had thought that through and should have been more explicit about that
> > > > situation when I documented it. We could do that, but then the syscall
> > > > records would be connected with the call from auditd on shutdown to
> > > > request that signal information, rather than the exit of that last
> > > > process that was using that container. This strikes me as misleading.
> > > > Is that really what we want?
> > >
> > > ???
> > >
> > > I think one of us is not understanding the other; maybe it's me, maybe
> > > it's you, maybe it's both of us.
> > >
> > > Anyway, here is what I was trying to convey with my original comment
> > > ... When we record the audit container ID in audit_signal_info() we
> > > take an extra reference to the audit container ID object so that it
> > > will not disappear (and get reused) until after we respond with an
> > > AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the
> > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took in
> > > audit_signal_info(). Unless I'm missing some other change you made,
> > > this *shouldn't* affect the syscall records, all it does is preserve
> > > the audit container ID object in the kernel's ACID store so it doesn't
> > > get reused.
> >
> > This is exactly what I had understood. I hadn't considered the extra
> > details below in detail due to my original syscall concern, but they
> > make sense.
> >
> > The syscall I refer to is the one connected with the drop of the
> > audit container identifier by the last process that was in that
> > container in patch 5/16. The production of this record is contingent on
> > the last ref in a contobj being dropped. So if it is due to that ref
> > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2
> > record it fetched, then it will appear that the fetch action closed the
> > container rather than the last process in the container to exit.
> >
> > Does this make sense?
>
> More so than your original reply, at least to me anyway.
>
> It makes sense that the audit container ID wouldn't be marked as
> "dead" since it would still be very much alive and available for use
> by the orchestrator, the question is if that is desirable or not. I
> think the answer to this comes down the preserving the correctness of
> the audit log.
>
> If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been
> reused then I think there is a legitimate concern that the audit log
> is not correct, and could be misleading. If we solve that by grabbing
> an extra reference, then there could also be some confusion as
> userspace considers a container to be "dead" while the audit container
> ID still exists in the kernel, and the kernel generated audit
> container ID death record will not be generated until much later (and
> possibly be associated with a different event, but that could be
> solved by unassociating the container death record).

How does syscall association of the death record with AUDIT_SIGNAL_INFO2
possibly get associated with another event? Or is the syscall
association with the fetch for the AUDIT_SIGNAL_INFO2 the other event?

Another idea might be to bump the refcount in audit_signal_info() but
mark tht contid as dead so it can't be reused if we are concerned that
the dead contid be reused?

There is still the problem later that the reported contid is incomplete
compared to the rest of the contid reporting cycle wrt nesting since
AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length
fields to accommodate a nested contid list.

> Of the two
> approaches, I think the latter is safer in that it preserves the
> correctness of the audit log, even though it could result in a delay
> of the container death record.

I prefer the former since it strongly indicates last task in the
container. The AUDIT_SIGNAL_INFO2 msg has the pid and other subject
attributes and the contid to strongly link the responsible party.

> Neither way is perfect, so if you have any other ideas I'm all ears.
>
> > > (We do need to do some extra housekeeping in audit_signal_info() to
> > > deal with the case where nobody asks for AUDIT_SIGNAL_INFO2 -
> > > basically if audit_sig_cid is not NULL we should drop a reference
> > > before assigning it a new object pointer, and of course we would need
> > > to set audit_sig_cid to NULL in audit_receive_msg() after sending it
> > > up to userspace and dropping the extra ref.)
>
> --
> paul moore
> http://www.paul-moore.com
>

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-02-04 23:45:25

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 11/16] audit: add support for containerid to network namespaces

On 2020-01-22 16:28, Paul Moore wrote:
> On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> >
> > This also adds support to qualify NETFILTER_PKT records.
> >
> > Audit events could happen in a network namespace outside of a task
> > context due to packets received from the net that trigger an auditing
> > rule prior to being associated with a running task. The network
> > namespace could be in use by multiple containers by association to the
> > tasks in that network namespace. We still want a way to attribute
> > these events to any potential containers. Keep a list per network
> > namespace to track these audit container identifiiers.
> >
> > Add/increment the audit container identifier on:
> > - initial setting of the audit container identifier via /proc
> > - clone/fork call that inherits an audit container identifier
> > - unshare call that inherits an audit container identifier
> > - setns call that inherits an audit container identifier
> > Delete/decrement the audit container identifier on:
> > - an inherited audit container identifier dropped when child set
> > - process exit
> > - unshare call that drops a net namespace
> > - setns call that drops a net namespace
> >
> > Add audit container identifier auxiliary record(s) to NETFILTER_PKT
> > event standalone records. Iterate through all potential audit container
> > identifiers associated with a network namespace.
> >
> > Please see the github audit kernel issue for contid net support:
> > https://github.com/linux-audit/audit-kernel/issues/92
> > Please see the github audit testsuiite issue for the test case:
> > https://github.com/linux-audit/audit-testsuite/issues/64
> > Please see the github audit wiki for the feature overview:
> > https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > Signed-off-by: Richard Guy Briggs <[email protected]>
> > Acked-by: Neil Horman <[email protected]>
> > Reviewed-by: Ondrej Mosnacek <[email protected]>
> > ---
> > include/linux/audit.h | 24 +++++++++
> > kernel/audit.c | 132 ++++++++++++++++++++++++++++++++++++++++++++++-
> > kernel/nsproxy.c | 4 ++
> > net/netfilter/nft_log.c | 11 +++-
> > net/netfilter/xt_AUDIT.c | 11 +++-
> > 5 files changed, 176 insertions(+), 6 deletions(-)
>
> ...
>
> > diff --git a/include/linux/audit.h b/include/linux/audit.h
> > index 5531d37a4226..ed8d5b74758d 100644
> > --- a/include/linux/audit.h
> > +++ b/include/linux/audit.h
> > @@ -12,6 +12,7 @@
> > #include <linux/sched.h>
> > #include <linux/ptrace.h>
> > #include <uapi/linux/audit.h>
> > +#include <linux/refcount.h>
> >
> > #define AUDIT_INO_UNSET ((unsigned long)-1)
> > #define AUDIT_DEV_UNSET ((dev_t)-1)
> > @@ -121,6 +122,13 @@ struct audit_task_info {
> >
> > extern struct audit_task_info init_struct_audit;
> >
> > +struct audit_contobj_netns {
> > + struct list_head list;
> > + u64 id;
>
> Since we now track audit container IDs in their own structure, why not
> link directly to the audit container ID object (and bump the
> refcount)?

Ok, I've done this but at first I had doubts about the complexity.

> > + refcount_t refcount;
> > + struct rcu_head rcu;
> > +};
> > +
> > extern int is_audit_feature_set(int which);
> >
> > extern int __init audit_register_class(int class, unsigned *list);
> > @@ -225,6 +233,12 @@ static inline u64 audit_get_contid(struct task_struct *tsk)
> > }
> >
> > extern void audit_log_container_id(struct audit_context *context, u64 contid);
> > +extern void audit_netns_contid_add(struct net *net, u64 contid);
> > +extern void audit_netns_contid_del(struct net *net, u64 contid);
> > +extern void audit_switch_task_namespaces(struct nsproxy *ns,
> > + struct task_struct *p);
> > +extern void audit_log_netns_contid_list(struct net *net,
> > + struct audit_context *context);
> >
> > extern u32 audit_enabled;
> >
> > @@ -297,6 +311,16 @@ static inline u64 audit_get_contid(struct task_struct *tsk)
> >
> > static inline void audit_log_container_id(struct audit_context *context, u64 contid)
> > { }
> > +static inline void audit_netns_contid_add(struct net *net, u64 contid)
> > +{ }
> > +static inline void audit_netns_contid_del(struct net *net, u64 contid)
> > +{ }
> > +static inline void audit_switch_task_namespaces(struct nsproxy *ns,
> > + struct task_struct *p)
> > +{ }
> > +static inline void audit_log_netns_contid_list(struct net *net,
> > + struct audit_context *context)
> > +{ }
> >
> > #define audit_enabled AUDIT_OFF
> >
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index d4e6eafe5644..f7a8d3288ca0 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -59,6 +59,7 @@
> > #include <linux/freezer.h>
> > #include <linux/pid_namespace.h>
> > #include <net/netns/generic.h>
> > +#include <net/net_namespace.h>
> >
> > #include "audit.h"
> >
> > @@ -86,9 +87,13 @@
> > /**
> > * struct audit_net - audit private network namespace data
> > * @sk: communication socket
> > + * @contid_list: audit container identifier list
> > + * @contid_list_lock audit container identifier list lock
> > */
> > struct audit_net {
> > struct sock *sk;
> > + struct list_head contid_list;
> > + spinlock_t contid_list_lock;
> > };
> >
> > /**
> > @@ -305,8 +310,11 @@ struct audit_task_info init_struct_audit = {
> > void audit_free(struct task_struct *tsk)
> > {
> > struct audit_task_info *info = tsk->audit;
> > + struct nsproxy *ns = tsk->nsproxy;
> >
> > audit_free_syscall(tsk);
> > + if (ns)
> > + audit_netns_contid_del(ns->net_ns, audit_get_contid(tsk));
> > /* Freeing the audit_task_info struct must be performed after
> > * audit_log_exit() due to need for loginuid and sessionid.
> > */
> > @@ -409,6 +417,120 @@ static struct sock *audit_get_sk(const struct net *net)
> > return aunet->sk;
> > }
> >
> > +void audit_netns_contid_add(struct net *net, u64 contid)
> > +{
> > + struct audit_net *aunet;
> > + struct list_head *contid_list;
> > + struct audit_contobj_netns *cont;
> > +
> > + if (!net)
> > + return;
> > + if (!audit_contid_valid(contid))
> > + return;
> > + aunet = net_generic(net, audit_net_id);
> > + if (!aunet)
> > + return;
> > + contid_list = &aunet->contid_list;
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(cont, contid_list, list)
> > + if (cont->id == contid) {
> > + spin_lock(&aunet->contid_list_lock);
> > + refcount_inc(&cont->refcount);
> > + spin_unlock(&aunet->contid_list_lock);
> > + goto out;
> > + }
> > + cont = kmalloc(sizeof(*cont), GFP_ATOMIC);
> > + if (cont) {
> > + INIT_LIST_HEAD(&cont->list);
> > + cont->id = contid;
> > + refcount_set(&cont->refcount, 1);
> > + spin_lock(&aunet->contid_list_lock);
> > + list_add_rcu(&cont->list, contid_list);
> > + spin_unlock(&aunet->contid_list_lock);
> > + }
> > +out:
> > + rcu_read_unlock();
> > +}
>
> See my comments about refcount_t, spinlocks, and list manipulation
> races from earlier in the patchset; the same thing applies to the
> function above.

This was some of the complexity that concerned me, but switching to rcu
read locks helped. In this case, since a stale list would cause an
update issue and these counts aren't used or updated anywere else,
switching to an int makes sense.

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-02-05 00:41:35

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 16/16] audit: add capcontid to set contid outside init_user_ns

On 2020-01-22 16:29, Paul Moore wrote:
> On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> >
> > Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
> > process in a non-init user namespace the capability to set audit
> > container identifiers.
> >
> > Provide /proc/$PID/audit_capcontid interface to capcontid.
> > Valid values are: 1==enabled, 0==disabled
>
> It would be good to be more explicit about "enabled" and "disabled" in
> the commit description. For example, which setting allows the target
> task to set audit container IDs of it's children processes?

Ok...

> > Report this action in message type AUDIT_SET_CAPCONTID 1022 with fields
> > opid= capcontid= old-capcontid=
> >
> > Signed-off-by: Richard Guy Briggs <[email protected]>
> > ---
> > fs/proc/base.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/audit.h | 14 ++++++++++++
> > include/uapi/linux/audit.h | 1 +
> > kernel/audit.c | 35 +++++++++++++++++++++++++++++
> > 4 files changed, 105 insertions(+)
>
> ...
>
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 26091800180c..283ef8e006e7 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -1360,6 +1360,59 @@ static ssize_t proc_contid_write(struct file *file, const char __user *buf,
> > .write = proc_contid_write,
> > .llseek = generic_file_llseek,
> > };
> > +
> > +static ssize_t proc_capcontid_read(struct file *file, char __user *buf,
> > + size_t count, loff_t *ppos)
> > +{
> > + struct inode *inode = file_inode(file);
> > + struct task_struct *task = get_proc_task(inode);
> > + ssize_t length;
> > + char tmpbuf[TMPBUFLEN];
> > +
> > + if (!task)
> > + return -ESRCH;
> > + /* if we don't have caps, reject */
> > + if (!capable(CAP_AUDIT_CONTROL) && !audit_get_capcontid(current))
> > + return -EPERM;
> > + length = scnprintf(tmpbuf, TMPBUFLEN, "%u", audit_get_capcontid(task));
> > + put_task_struct(task);
> > + return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
> > +}
> > +
> > +static ssize_t proc_capcontid_write(struct file *file, const char __user *buf,
> > + size_t count, loff_t *ppos)
> > +{
> > + struct inode *inode = file_inode(file);
> > + u32 capcontid;
> > + int rv;
> > + struct task_struct *task = get_proc_task(inode);
> > +
> > + if (!task)
> > + return -ESRCH;
> > + if (*ppos != 0) {
> > + /* No partial writes. */
> > + put_task_struct(task);
> > + return -EINVAL;
> > + }
> > +
> > + rv = kstrtou32_from_user(buf, count, 10, &capcontid);
> > + if (rv < 0) {
> > + put_task_struct(task);
> > + return rv;
> > + }
> > +
> > + rv = audit_set_capcontid(task, capcontid);
> > + put_task_struct(task);
> > + if (rv < 0)
> > + return rv;
> > + return count;
> > +}
> > +
> > +static const struct file_operations proc_capcontid_operations = {
> > + .read = proc_capcontid_read,
> > + .write = proc_capcontid_write,
> > + .llseek = generic_file_llseek,
> > +};
> > #endif
> >
> > #ifdef CONFIG_FAULT_INJECTION
> > @@ -3121,6 +3174,7 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
> > REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
> > REG("sessionid", S_IRUGO, proc_sessionid_operations),
> > REG("audit_containerid", S_IWUSR|S_IRUSR, proc_contid_operations),
> > + REG("audit_capcontainerid", S_IWUSR|S_IRUSR|S_IRUSR, proc_capcontid_operations),
> > #endif
> > #ifdef CONFIG_FAULT_INJECTION
> > REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
> > @@ -3522,6 +3576,7 @@ static int proc_tid_comm_permission(struct inode *inode, int mask)
> > REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
> > REG("sessionid", S_IRUGO, proc_sessionid_operations),
> > REG("audit_containerid", S_IWUSR|S_IRUSR, proc_contid_operations),
> > + REG("audit_capcontainerid", S_IWUSR|S_IRUSR|S_IRUSR, proc_capcontid_operations),
> > #endif
> > #ifdef CONFIG_FAULT_INJECTION
> > REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
> > diff --git a/include/linux/audit.h b/include/linux/audit.h
> > index 28b9c7cd86a6..62c453306c2a 100644
> > --- a/include/linux/audit.h
> > +++ b/include/linux/audit.h
> > @@ -116,6 +116,7 @@ struct audit_task_info {
> > kuid_t loginuid;
> > unsigned int sessionid;
> > struct audit_contobj *cont;
> > + u32 capcontid;
>
> Where is the code change that actually uses this to enforce the
> described policy on setting an audit container ID?

Oops, lost in shuffle of refactorisation when dumping the netlink code in
favour of /proc.

> > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> > index 2844d78cd7af..01251e6dcec0 100644
> > --- a/include/uapi/linux/audit.h
> > +++ b/include/uapi/linux/audit.h
> > @@ -73,6 +73,7 @@
> > #define AUDIT_GET_FEATURE 1019 /* Get which features are enabled */
> > #define AUDIT_CONTAINER_OP 1020 /* Define the container id and info */
> > #define AUDIT_SIGNAL_INFO2 1021 /* Get info auditd signal sender */
> > +#define AUDIT_SET_CAPCONTID 1022 /* Set cap_contid of a task */
> >
> > #define AUDIT_FIRST_USER_MSG 1100 /* Userspace messages mostly uninteresting to kernel */
> > #define AUDIT_USER_AVC 1107 /* We filter this differently */
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index 1287f0b63757..1c22dd084ae8 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -2698,6 +2698,41 @@ static bool audit_contid_isowner(struct task_struct *tsk)
> > return false;
> > }
> >
> > +int audit_set_capcontid(struct task_struct *task, u32 enable)
> > +{
> > + u32 oldcapcontid;
> > + int rc = 0;
> > + struct audit_buffer *ab;
> > +
> > + if (!task->audit)
> > + return -ENOPROTOOPT;
> > + oldcapcontid = audit_get_capcontid(task);
> > + /* if task is not descendant, block */
> > + if (task == current)
> > + rc = -EBADSLT;
> > + else if (!task_is_descendant(current, task))
> > + rc = -EXDEV;
>
> See my previous comments about error code sanity.

I'll go with EXDEV.

> > + else if (current_user_ns() == &init_user_ns) {
> > + if (!capable(CAP_AUDIT_CONTROL) && !audit_get_capcontid(current))
> > + rc = -EPERM;
>
> I think we just want to use ns_capable() in the context of the current
> userns to check CAP_AUDIT_CONTROL, yes? Something like this ...

I thought we had firmly established in previous discussion that
CAP_AUDIT_CONTROL in anything other than init_user_ns was completely irrelevant
and untrustable.

> if (current_user_ns() != &init_user_ns) {
> if (!ns_capable(CAP_AUDIT_CONTROL) || !audit_get_capcontid())
> rc = -EPERM;
> } else if (!capable(CAP_AUDIT_CONTROL))
> rc = -EPERM;
>
> > + }
> > + if (!rc)
> > + task->audit->capcontid = enable;
> > +
> > + if (!audit_enabled)
> > + return rc;
> > +
> > + ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_SET_CAPCONTID);
> > + if (!ab)
> > + return rc;
> > +
> > + audit_log_format(ab,
> > + "opid=%d capcontid=%u old-capcontid=%u",
> > + task_tgid_nr(task), enable, oldcapcontid);
> > + audit_log_end(ab);
>
> My prior comments about recording the success/failure, or not emitting
> the record on failure, seem relevant here too.

It should be recorded in the syscall record.

> > + return rc;
> > +}
>
> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-02-05 22:42:52

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 04/16] audit: convert to contid list to check for orch/engine ownership

On Tue, Feb 4, 2020 at 5:52 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-01-22 16:28, Paul Moore wrote:
> > On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <[email protected]> wrote:
> > >
> > > Store the audit container identifier in a refcounted kernel object that
> > > is added to the master list of audit container identifiers. This will
> > > allow multiple container orchestrators/engines to work on the same
> > > machine without danger of inadvertantly re-using an existing identifier.
> > > It will also allow an orchestrator to inject a process into an existing
> > > container by checking if the original container owner is the one
> > > injecting the task. A hash table list is used to optimize searches.
> > >
> > > Signed-off-by: Richard Guy Briggs <[email protected]>
> > > ---
> > > include/linux/audit.h | 14 ++++++--
> > > kernel/audit.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++---
> > > kernel/audit.h | 8 +++++
> > > 3 files changed, 112 insertions(+), 8 deletions(-)

...

> > > @@ -232,7 +263,11 @@ int audit_alloc(struct task_struct *tsk)
> > > }
> > > info->loginuid = audit_get_loginuid(current);
> > > info->sessionid = audit_get_sessionid(current);
> > > - info->contid = audit_get_contid(current);
> > > + spin_lock(&audit_contobj_list_lock);
> > > + info->cont = _audit_contobj(current);
> > > + if (info->cont)
> > > + _audit_contobj_hold(info->cont);
> > > + spin_unlock(&audit_contobj_list_lock);
> >
> > If we are taking a spinlock in order to bump the refcount, does it
> > really need to be a refcount_t or can we just use a normal integer?
> > In RCU protected lists a spinlock is usually used to protect
> > adds/removes to the list, not the content of individual list items.
> >
> > My guess is you probably want to use the spinlock as described above
> > (list add/remove protection) and manipulate the refcount_t inside a
> > RCU read lock protected region.
>
> Ok, I guess it could be an integer if it were protected by the spinlock,
> but I think you've guessed my intent, so let us keep it as a refcount
> and tighten the spinlock scope and use rcu read locking to protect _get
> and _put in _alloc, _free, and later on when protecting the network
> namespace contobj lists. This should reduce potential contention for
> the spinlock to one location over fewer lines of code in that place
> while speeding up updates and slightly simplifying code in the others.

If it helps, you should be able to find plenty of rcu/spinlock
protected list examples in the kernel code. It might be a good idea
if you spent some time looking at those implementations first to get
an idea of how it is usually done.

> > > @@ -2381,9 +2425,12 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> > > }
> > > oldcontid = audit_get_contid(task);
> > > read_lock(&tasklist_lock);
> > > - /* Don't allow the audit containerid to be unset */
> > > + /* Don't allow the contid to be unset */
> > > if (!audit_contid_valid(contid))
> > > rc = -EINVAL;
> > > + /* Don't allow the contid to be set to the same value again */
> > > + else if (contid == oldcontid) {
> > > + rc = -EADDRINUSE;
> >
> > First, is that brace a typo? It looks like it. Did this compile?
>
> Yes, it was fixed in the later patch that restructured the if
> statements.

Generic reminder that each patch should compile and function on it's
own without the need for any follow-up patches. I know Richard is
already aware of that, and this was a mistake that slipped through the
cracks; this reminder is more for those who may be lurking on the
list.

> > Second, can you explain why (re)setting the audit container ID to the
> > same value is something we need to prohibit? I'm guessing it has
> > something to do with explicitly set vs inherited, but I don't want to
> > assume too much about your thinking behind this.
>
> It made the refcounting more complicated later, and besides, the
> prohibition on setting the contid again if it is already set would catch
> this case, so I'll remove it in this patch and ensure this action
> doesn't cause a problem in later patches.
>
> > If it is "set vs inherited", would allowing an orchestrator to
> > explicitly "set" an inherited audit container ID provide some level or
> > protection against moving the task?
>
> I can't see it helping prevent this since later descendancy checks will
> stop this move anyways.

That's what I thought, but I was just trying to think of any reason
why you felt this might have been useful since it was in the patch.
If it's in the patch I tend to fall back on the idea that it must have
served a purpose ;)

> > > @@ -2396,8 +2443,49 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> > > else if (audit_contid_set(task))
> > > rc = -ECHILD;
> > > read_unlock(&tasklist_lock);
> > > - if (!rc)
> > > - task->audit->contid = contid;
> > > + if (!rc) {
> > > + struct audit_contobj *oldcont = _audit_contobj(task);
> > > + struct audit_contobj *cont = NULL, *newcont = NULL;
> > > + int h = audit_hash_contid(contid);
> > > +
> > > + rcu_read_lock();
> > > + list_for_each_entry_rcu(cont, &audit_contid_hash[h], list)
> > > + if (cont->id == contid) {
> > > + /* task injection to existing container */
> > > + if (current == cont->owner) {
> >
> > Do we have any protection against the task pointed to by cont->owner
> > going away and a new task with the same current pointer value (no
> > longer the legitimate audit container ID owner) manipulating the
> > target task's audit container ID?
>
> Yes, the get_task_struct() call below.

Gotcha.

--
paul moore
http://www.paul-moore.com

2020-02-05 22:52:51

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 11/16] audit: add support for containerid to network namespaces

On Tue, Feb 4, 2020 at 6:43 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-01-22 16:28, Paul Moore wrote:
> > On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> > >
> > > This also adds support to qualify NETFILTER_PKT records.
> > >
> > > Audit events could happen in a network namespace outside of a task
> > > context due to packets received from the net that trigger an auditing
> > > rule prior to being associated with a running task. The network
> > > namespace could be in use by multiple containers by association to the
> > > tasks in that network namespace. We still want a way to attribute
> > > these events to any potential containers. Keep a list per network
> > > namespace to track these audit container identifiiers.
> > >
> > > Add/increment the audit container identifier on:
> > > - initial setting of the audit container identifier via /proc
> > > - clone/fork call that inherits an audit container identifier
> > > - unshare call that inherits an audit container identifier
> > > - setns call that inherits an audit container identifier
> > > Delete/decrement the audit container identifier on:
> > > - an inherited audit container identifier dropped when child set
> > > - process exit
> > > - unshare call that drops a net namespace
> > > - setns call that drops a net namespace
> > >
> > > Add audit container identifier auxiliary record(s) to NETFILTER_PKT
> > > event standalone records. Iterate through all potential audit container
> > > identifiers associated with a network namespace.
> > >
> > > Please see the github audit kernel issue for contid net support:
> > > https://github.com/linux-audit/audit-kernel/issues/92
> > > Please see the github audit testsuiite issue for the test case:
> > > https://github.com/linux-audit/audit-testsuite/issues/64
> > > Please see the github audit wiki for the feature overview:
> > > https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > > Signed-off-by: Richard Guy Briggs <[email protected]>
> > > Acked-by: Neil Horman <[email protected]>
> > > Reviewed-by: Ondrej Mosnacek <[email protected]>
> > > ---
> > > include/linux/audit.h | 24 +++++++++
> > > kernel/audit.c | 132 ++++++++++++++++++++++++++++++++++++++++++++++-
> > > kernel/nsproxy.c | 4 ++
> > > net/netfilter/nft_log.c | 11 +++-
> > > net/netfilter/xt_AUDIT.c | 11 +++-
> > > 5 files changed, 176 insertions(+), 6 deletions(-)
> >
> > ...
> >
> > > diff --git a/include/linux/audit.h b/include/linux/audit.h
> > > index 5531d37a4226..ed8d5b74758d 100644
> > > --- a/include/linux/audit.h
> > > +++ b/include/linux/audit.h
> > > @@ -12,6 +12,7 @@
> > > #include <linux/sched.h>
> > > #include <linux/ptrace.h>
> > > #include <uapi/linux/audit.h>
> > > +#include <linux/refcount.h>
> > >
> > > #define AUDIT_INO_UNSET ((unsigned long)-1)
> > > #define AUDIT_DEV_UNSET ((dev_t)-1)
> > > @@ -121,6 +122,13 @@ struct audit_task_info {
> > >
> > > extern struct audit_task_info init_struct_audit;
> > >
> > > +struct audit_contobj_netns {
> > > + struct list_head list;
> > > + u64 id;
> >
> > Since we now track audit container IDs in their own structure, why not
> > link directly to the audit container ID object (and bump the
> > refcount)?
>
> Ok, I've done this but at first I had doubts about the complexity.

Yes, it will be more complex, but it should be much safer.

--
paul moore
http://www.paul-moore.com

2020-02-05 22:53:06

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Tue, Feb 4, 2020 at 6:15 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-01-23 16:35, Paul Moore wrote:
> > On Thu, Jan 23, 2020 at 3:04 PM Richard Guy Briggs <[email protected]> wrote:
> > > On 2020-01-23 12:09, Paul Moore wrote:
> > > > On Thu, Jan 23, 2020 at 11:29 AM Richard Guy Briggs <[email protected]> wrote:
> > > > > On 2020-01-22 16:28, Paul Moore wrote:
> > > > > > On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > >
> > > > > > > Add audit container identifier support to the action of signalling the
> > > > > > > audit daemon.
> > > > > > >
> > > > > > > Since this would need to add an element to the audit_sig_info struct,
> > > > > > > a new record type AUDIT_SIGNAL_INFO2 was created with a new
> > > > > > > audit_sig_info2 struct. Corresponding support is required in the
> > > > > > > userspace code to reflect the new record request and reply type.
> > > > > > > An older userspace won't break since it won't know to request this
> > > > > > > record type.
> > > > > > >
> > > > > > > Signed-off-by: Richard Guy Briggs <[email protected]>
> > > > > > > ---
> > > > > > > include/linux/audit.h | 7 +++++++
> > > > > > > include/uapi/linux/audit.h | 1 +
> > > > > > > kernel/audit.c | 35 +++++++++++++++++++++++++++++++++++
> > > > > > > kernel/audit.h | 1 +
> > > > > > > security/selinux/nlmsgtab.c | 1 +
> > > > > > > 5 files changed, 45 insertions(+)
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > > > > index 0871c3e5d6df..51159c94041c 100644
> > > > > > > --- a/kernel/audit.c
> > > > > > > +++ b/kernel/audit.c
> > > > > > > @@ -126,6 +126,14 @@ struct auditd_connection {
> > > > > > > kuid_t audit_sig_uid = INVALID_UID;
> > > > > > > pid_t audit_sig_pid = -1;
> > > > > > > u32 audit_sig_sid = 0;
> > > > > > > +/* Since the signal information is stored in the record buffer at the
> > > > > > > + * time of the signal, but not retrieved until later, there is a chance
> > > > > > > + * that the last process in the container could terminate before the
> > > > > > > + * signal record is delivered. In this circumstance, there is a chance
> > > > > > > + * the orchestrator could reuse the audit container identifier, causing
> > > > > > > + * an overlap of audit records that refer to the same audit container
> > > > > > > + * identifier, but a different container instance. */
> > > > > > > +u64 audit_sig_cid = AUDIT_CID_UNSET;
> > > > > >
> > > > > > I believe we could prevent the case mentioned above by taking an
> > > > > > additional reference to the audit container ID object when the signal
> > > > > > information is collected, dropping it only after the signal
> > > > > > information is collected by userspace or another process signals the
> > > > > > audit daemon. Yes, it would block that audit container ID from being
> > > > > > reused immediately, but since we are talking about one number out of
> > > > > > 2^64 that seems like a reasonable tradeoff.
> > > > >
> > > > > I had thought that through and should have been more explicit about that
> > > > > situation when I documented it. We could do that, but then the syscall
> > > > > records would be connected with the call from auditd on shutdown to
> > > > > request that signal information, rather than the exit of that last
> > > > > process that was using that container. This strikes me as misleading.
> > > > > Is that really what we want?
> > > >
> > > > ???
> > > >
> > > > I think one of us is not understanding the other; maybe it's me, maybe
> > > > it's you, maybe it's both of us.
> > > >
> > > > Anyway, here is what I was trying to convey with my original comment
> > > > ... When we record the audit container ID in audit_signal_info() we
> > > > take an extra reference to the audit container ID object so that it
> > > > will not disappear (and get reused) until after we respond with an
> > > > AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the
> > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took in
> > > > audit_signal_info(). Unless I'm missing some other change you made,
> > > > this *shouldn't* affect the syscall records, all it does is preserve
> > > > the audit container ID object in the kernel's ACID store so it doesn't
> > > > get reused.
> > >
> > > This is exactly what I had understood. I hadn't considered the extra
> > > details below in detail due to my original syscall concern, but they
> > > make sense.
> > >
> > > The syscall I refer to is the one connected with the drop of the
> > > audit container identifier by the last process that was in that
> > > container in patch 5/16. The production of this record is contingent on
> > > the last ref in a contobj being dropped. So if it is due to that ref
> > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2
> > > record it fetched, then it will appear that the fetch action closed the
> > > container rather than the last process in the container to exit.
> > >
> > > Does this make sense?
> >
> > More so than your original reply, at least to me anyway.
> >
> > It makes sense that the audit container ID wouldn't be marked as
> > "dead" since it would still be very much alive and available for use
> > by the orchestrator, the question is if that is desirable or not. I
> > think the answer to this comes down the preserving the correctness of
> > the audit log.
> >
> > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been
> > reused then I think there is a legitimate concern that the audit log
> > is not correct, and could be misleading. If we solve that by grabbing
> > an extra reference, then there could also be some confusion as
> > userspace considers a container to be "dead" while the audit container
> > ID still exists in the kernel, and the kernel generated audit
> > container ID death record will not be generated until much later (and
> > possibly be associated with a different event, but that could be
> > solved by unassociating the container death record).
>
> How does syscall association of the death record with AUDIT_SIGNAL_INFO2
> possibly get associated with another event? Or is the syscall
> association with the fetch for the AUDIT_SIGNAL_INFO2 the other event?

The issue is when does the audit container ID "die". If it is when
the last task in the container exits, then the death record will be
associated when the task's exit. If the audit container ID lives on
until the last reference of it in the audit logs, including the
SIGNAL_INFO2 message, the death record will be associated with the
related SIGNAL_INFO2 syscalls, or perhaps unassociated depending on
the details of the syscalls/netlink.

> Another idea might be to bump the refcount in audit_signal_info() but
> mark tht contid as dead so it can't be reused if we are concerned that
> the dead contid be reused?

Ooof. Yes, maybe, but that would be ugly.

> There is still the problem later that the reported contid is incomplete
> compared to the rest of the contid reporting cycle wrt nesting since
> AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length
> fields to accommodate a nested contid list.

Do we really care about the full nested audit container ID list in the
SIGNAL_INFO2 record?

> > Of the two
> > approaches, I think the latter is safer in that it preserves the
> > correctness of the audit log, even though it could result in a delay
> > of the container death record.
>
> I prefer the former since it strongly indicates last task in the
> container. The AUDIT_SIGNAL_INFO2 msg has the pid and other subject
> attributes and the contid to strongly link the responsible party.

Steve is the only one who really tracks the security certifications
that are relevant to audit, see what the certification requirements
have to say and we can revisit this.

--
paul moore
http://www.paul-moore.com

2020-02-05 22:59:28

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 13/16] audit: track container nesting

On Tue, Feb 4, 2020 at 1:12 PM Steve Grubb <[email protected]> wrote:
> On Tuesday, February 4, 2020 10:52:36 AM EST Paul Moore wrote:
> > On Tue, Feb 4, 2020 at 10:47 AM Steve Grubb <[email protected]> wrote:
> > > On Tuesday, February 4, 2020 8:19:44 AM EST Richard Guy Briggs wrote:
> > > > > The established pattern is that we print -1 when its unset and "?"
> > > > > when
> > > > > its totalling missing. So, how could this be invalid? It should be
> > > > > set
> > > > > or not. That is unless its totally missing just like when we do not
> > > > > run
> > > > > with selinux enabled and a context just doesn't exist.
> > > >
> > > > Ok, so in this case it is clearly unset, so should be -1, which will be
> > > > a
> > > > 20-digit number when represented as an unsigned long long int.
> > > >
> > > > Thank you for that clarification Steve.
> > >
> > > It is literally a -1. ( 2 characters)
> >
> > Well, not as Richard has currently written the code, it is a "%llu".
> > This was why I asked the question I did; if we want the "-1" here we
> > probably want to special case that as I don't think we want to display
> > audit container IDs as signed numbers in general.
>
> OK, then go with the long number, we'll fix it in the interpretation. I guess
> we do the same thing for auid.

As I said above, I'm okay with a special case handling for unset/"-1"
in this case.

--
paul moore
http://www.paul-moore.com

2020-02-05 22:59:28

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 16/16] audit: add capcontid to set contid outside init_user_ns

On Tue, Feb 4, 2020 at 7:39 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-01-22 16:29, Paul Moore wrote:
> > On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> > >
> > > Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
> > > process in a non-init user namespace the capability to set audit
> > > container identifiers.
> > >
> > > Provide /proc/$PID/audit_capcontid interface to capcontid.
> > > Valid values are: 1==enabled, 0==disabled
> >
> > It would be good to be more explicit about "enabled" and "disabled" in
> > the commit description. For example, which setting allows the target
> > task to set audit container IDs of it's children processes?
>
> Ok...
>
> > > Report this action in message type AUDIT_SET_CAPCONTID 1022 with fields
> > > opid= capcontid= old-capcontid=
> > >
> > > Signed-off-by: Richard Guy Briggs <[email protected]>
> > > ---
> > > fs/proc/base.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++
> > > include/linux/audit.h | 14 ++++++++++++
> > > include/uapi/linux/audit.h | 1 +
> > > kernel/audit.c | 35 +++++++++++++++++++++++++++++
> > > 4 files changed, 105 insertions(+)

...

> > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > index 1287f0b63757..1c22dd084ae8 100644
> > > --- a/kernel/audit.c
> > > +++ b/kernel/audit.c
> > > @@ -2698,6 +2698,41 @@ static bool audit_contid_isowner(struct task_struct *tsk)
> > > return false;
> > > }
> > >
> > > +int audit_set_capcontid(struct task_struct *task, u32 enable)
> > > +{
> > > + u32 oldcapcontid;
> > > + int rc = 0;
> > > + struct audit_buffer *ab;
> > > +
> > > + if (!task->audit)
> > > + return -ENOPROTOOPT;
> > > + oldcapcontid = audit_get_capcontid(task);
> > > + /* if task is not descendant, block */
> > > + if (task == current)
> > > + rc = -EBADSLT;
> > > + else if (!task_is_descendant(current, task))
> > > + rc = -EXDEV;
> >
> > See my previous comments about error code sanity.
>
> I'll go with EXDEV.
>
> > > + else if (current_user_ns() == &init_user_ns) {
> > > + if (!capable(CAP_AUDIT_CONTROL) && !audit_get_capcontid(current))
> > > + rc = -EPERM;
> >
> > I think we just want to use ns_capable() in the context of the current
> > userns to check CAP_AUDIT_CONTROL, yes? Something like this ...
>
> I thought we had firmly established in previous discussion that
> CAP_AUDIT_CONTROL in anything other than init_user_ns was completely irrelevant
> and untrustable.

In the case of a container with multiple users, and multiple
applications, one being a nested orchestrator, it seems relevant to
allow that container to control which of it's processes are able to
exercise CAP_AUDIT_CONTROL. Granted, we still want to control it
within the overall host, e.g. the container in question must be
allowed to run a nested orchestrator, but allowing the container
itself to provide it's own granularity seems like the right thing to
do.

> > if (current_user_ns() != &init_user_ns) {
> > if (!ns_capable(CAP_AUDIT_CONTROL) || !audit_get_capcontid())
> > rc = -EPERM;
> > } else if (!capable(CAP_AUDIT_CONTROL))
> > rc = -EPERM;
> >

--
paul moore
http://www.paul-moore.com

2020-02-05 23:10:05

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 13/16] audit: track container nesting

On Thu, Jan 30, 2020 at 2:28 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-01-22 16:29, Paul Moore wrote:
> > On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> > >
> > > Track the parent container of a container to be able to filter and
> > > report nesting.
> > >
> > > Now that we have a way to track and check the parent container of a
> > > container, modify the contid field format to be able to report that
> > > nesting using a carrat ("^") separator to indicate nesting. The
> > > original field format was "contid=<contid>" for task-associated records
> > > and "contid=<contid>[,<contid>[...]]" for network-namespace-associated
> > > records. The new field format is
> > > "contid=<contid>[^<contid>[...]][,<contid>[...]]".
> >
> > Let's make sure we always use a comma as a separator, even when
> > recording the parent information, for example:
> > "contid=<contid>[,^<contid>[...]][,<contid>[...]]"
>
> The intent here is to clearly indicate and separate nesting from
> parallel use of several containers by one netns. If we do away with
> that distinction, then we lose that inheritance accountability and
> should really run the list through a "uniq" function to remove the
> produced redundancies. This clear inheritance is something Steve was
> looking for since tracking down individual events/records to show that
> inheritance was not aways feasible due to rolled logs or search effort.

Perhaps my example wasn't clear. I'm not opposed to the little
carat/hat character indicating a container's parent, I just think it
would be good to also include a comma *in*addition* to the carat/hat.

> > > Signed-off-by: Richard Guy Briggs <[email protected]>
> > > ---
> > > include/linux/audit.h | 1 +
> > > kernel/audit.c | 53 +++++++++++++++++++++++++++++++++++++++++++--------
> > > kernel/audit.h | 1 +
> > > kernel/auditfilter.c | 17 ++++++++++++++++-
> > > kernel/auditsc.c | 2 +-
> > > 5 files changed, 64 insertions(+), 10 deletions(-)
> >
> > ...
> >
> > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > index ef8e07524c46..68be59d1a89b 100644
> > > --- a/kernel/audit.c
> > > +++ b/kernel/audit.c
> >
> > > @@ -492,6 +493,7 @@ void audit_switch_task_namespaces(struct nsproxy *ns, struct task_struct *p)
> > > audit_netns_contid_add(new->net_ns, contid);
> > > }
> > >
> > > +void audit_log_contid(struct audit_buffer *ab, u64 contid);
> >
> > If we need a forward declaration, might as well just move it up near
> > the top of the file with the rest of the declarations.
>
> Ok.
>
> > > +void audit_log_contid(struct audit_buffer *ab, u64 contid)
> > > +{
> > > + struct audit_contobj *cont = NULL, *prcont = NULL;
> > > + int h;
> >
> > It seems safer to pass the audit container ID object and not the u64.
>
> It would also be faster, but in some places it isn't available such as
> for ptrace and signal targets. This also links back to the drop record
> refcounts to hold onto the contobj until process exit, or signal
> delivery.
>
> What we could do is to supply two potential parameters, a contobj and/or
> a contid, and have it use the contobj if it is valid, otherwise, use the
> contid, as is done for names and paths supplied to audit_log_name().

Let's not do multiple parameters, that begs for misuse, let's take the
wrapper function route:

func a(int id) {
// important stuff
}

func ao(struct obj) {
a(obj.id);
}

... and we can add a comment that you *really* should be using the
variant that passes an object.

> > > @@ -2705,9 +2741,10 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> > > if (!ab)
> > > return rc;
> > >
> > > - audit_log_format(ab,
> > > - "op=set opid=%d contid=%llu old-contid=%llu",
> > > - task_tgid_nr(task), contid, oldcontid);
> > > + audit_log_format(ab, "op=set opid=%d contid=", task_tgid_nr(task));
> > > + audit_log_contid(ab, contid);
> > > + audit_log_format(ab, " old-contid=");
> > > + audit_log_contid(ab, oldcontid);
> >
> > This is an interesting case where contid and old-contid are going to
> > be largely the same, only the first (current) ID is going to be
> > different; do we want to duplicate all of those IDs?
>
> At first when I read your comment, I thought we could just take contid
> and drop oldcontid, but if it fails, we still want all the information,
> so given the way I've set up the search code in userspace, listing only
> the newest contid in the contid field and all the rest in oldcontid
> could be a good compromise.

This is along the lines of what I was thinking.

--
paul moore
http://www.paul-moore.com

2020-02-05 23:53:27

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 13/16] audit: track container nesting

On 2020-02-05 18:05, Paul Moore wrote:
> On Thu, Jan 30, 2020 at 2:28 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-01-22 16:29, Paul Moore wrote:
> > > On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> > > >
> > > > Track the parent container of a container to be able to filter and
> > > > report nesting.
> > > >
> > > > Now that we have a way to track and check the parent container of a
> > > > container, modify the contid field format to be able to report that
> > > > nesting using a carrat ("^") separator to indicate nesting. The
> > > > original field format was "contid=<contid>" for task-associated records
> > > > and "contid=<contid>[,<contid>[...]]" for network-namespace-associated
> > > > records. The new field format is
> > > > "contid=<contid>[^<contid>[...]][,<contid>[...]]".
> > >
> > > Let's make sure we always use a comma as a separator, even when
> > > recording the parent information, for example:
> > > "contid=<contid>[,^<contid>[...]][,<contid>[...]]"
> >
> > The intent here is to clearly indicate and separate nesting from
> > parallel use of several containers by one netns. If we do away with
> > that distinction, then we lose that inheritance accountability and
> > should really run the list through a "uniq" function to remove the
> > produced redundancies. This clear inheritance is something Steve was
> > looking for since tracking down individual events/records to show that
> > inheritance was not aways feasible due to rolled logs or search effort.
>
> Perhaps my example wasn't clear. I'm not opposed to the little
> carat/hat character indicating a container's parent, I just think it
> would be good to also include a comma *in*addition* to the carat/hat.

Ah, ok. Well, I'd offer that it would be slightly shorter, slightly
less cluttered and having already written the parser in userspace, I
think the parser would be slightly simpler.

I must admit, I was a bit puzzled by your snippet of code that was used
as a prefix to the next item rather than as a postfix to the given item.

Can you say why you prefer the comma in addition?

> > > > Signed-off-by: Richard Guy Briggs <[email protected]>
> > > > ---
> > > > include/linux/audit.h | 1 +
> > > > kernel/audit.c | 53 +++++++++++++++++++++++++++++++++++++++++++--------
> > > > kernel/audit.h | 1 +
> > > > kernel/auditfilter.c | 17 ++++++++++++++++-
> > > > kernel/auditsc.c | 2 +-
> > > > 5 files changed, 64 insertions(+), 10 deletions(-)
> > >
> > > ...
> > >
> > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > index ef8e07524c46..68be59d1a89b 100644
> > > > --- a/kernel/audit.c
> > > > +++ b/kernel/audit.c
> > >
> > > > @@ -492,6 +493,7 @@ void audit_switch_task_namespaces(struct nsproxy *ns, struct task_struct *p)
> > > > audit_netns_contid_add(new->net_ns, contid);
> > > > }
> > > >
> > > > +void audit_log_contid(struct audit_buffer *ab, u64 contid);
> > >
> > > If we need a forward declaration, might as well just move it up near
> > > the top of the file with the rest of the declarations.
> >
> > Ok.
> >
> > > > +void audit_log_contid(struct audit_buffer *ab, u64 contid)
> > > > +{
> > > > + struct audit_contobj *cont = NULL, *prcont = NULL;
> > > > + int h;
> > >
> > > It seems safer to pass the audit container ID object and not the u64.
> >
> > It would also be faster, but in some places it isn't available such as
> > for ptrace and signal targets. This also links back to the drop record
> > refcounts to hold onto the contobj until process exit, or signal
> > delivery.
> >
> > What we could do is to supply two potential parameters, a contobj and/or
> > a contid, and have it use the contobj if it is valid, otherwise, use the
> > contid, as is done for names and paths supplied to audit_log_name().
>
> Let's not do multiple parameters, that begs for misuse, let's take the
> wrapper function route:
>
> func a(int id) {
> // important stuff
> }
>
> func ao(struct obj) {
> a(obj.id);
> }
>
> ... and we can add a comment that you *really* should be using the
> variant that passes an object.

I was already doing that where it available, and dereferencing the id
for the call. But I see an advantage to having both parameters supplied
to the function, since it saves us the trouble of dereferencing it,
searching for the id in the hash list and re-locating the object if the
object is already available.

> > > > @@ -2705,9 +2741,10 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> > > > if (!ab)
> > > > return rc;
> > > >
> > > > - audit_log_format(ab,
> > > > - "op=set opid=%d contid=%llu old-contid=%llu",
> > > > - task_tgid_nr(task), contid, oldcontid);
> > > > + audit_log_format(ab, "op=set opid=%d contid=", task_tgid_nr(task));
> > > > + audit_log_contid(ab, contid);
> > > > + audit_log_format(ab, " old-contid=");
> > > > + audit_log_contid(ab, oldcontid);
> > >
> > > This is an interesting case where contid and old-contid are going to
> > > be largely the same, only the first (current) ID is going to be
> > > different; do we want to duplicate all of those IDs?
> >
> > At first when I read your comment, I thought we could just take contid
> > and drop oldcontid, but if it fails, we still want all the information,
> > so given the way I've set up the search code in userspace, listing only
> > the newest contid in the contid field and all the rest in oldcontid
> > could be a good compromise.
>
> This is along the lines of what I was thinking.

Good.

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-02-06 13:02:07

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 16/16] audit: add capcontid to set contid outside init_user_ns

On 2020-02-05 17:56, Paul Moore wrote:
> On Tue, Feb 4, 2020 at 7:39 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-01-22 16:29, Paul Moore wrote:
> > > On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> > > >
> > > > Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
> > > > process in a non-init user namespace the capability to set audit
> > > > container identifiers.
> > > >
> > > > Provide /proc/$PID/audit_capcontid interface to capcontid.
> > > > Valid values are: 1==enabled, 0==disabled
> > >
> > > It would be good to be more explicit about "enabled" and "disabled" in
> > > the commit description. For example, which setting allows the target
> > > task to set audit container IDs of it's children processes?
> >
> > Ok...
> >
> > > > Report this action in message type AUDIT_SET_CAPCONTID 1022 with fields
> > > > opid= capcontid= old-capcontid=
> > > >
> > > > Signed-off-by: Richard Guy Briggs <[email protected]>
> > > > ---
> > > > fs/proc/base.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > include/linux/audit.h | 14 ++++++++++++
> > > > include/uapi/linux/audit.h | 1 +
> > > > kernel/audit.c | 35 +++++++++++++++++++++++++++++
> > > > 4 files changed, 105 insertions(+)
>
> ...
>
> > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > index 1287f0b63757..1c22dd084ae8 100644
> > > > --- a/kernel/audit.c
> > > > +++ b/kernel/audit.c
> > > > @@ -2698,6 +2698,41 @@ static bool audit_contid_isowner(struct task_struct *tsk)
> > > > return false;
> > > > }
> > > >
> > > > +int audit_set_capcontid(struct task_struct *task, u32 enable)
> > > > +{
> > > > + u32 oldcapcontid;
> > > > + int rc = 0;
> > > > + struct audit_buffer *ab;
> > > > +
> > > > + if (!task->audit)
> > > > + return -ENOPROTOOPT;
> > > > + oldcapcontid = audit_get_capcontid(task);
> > > > + /* if task is not descendant, block */
> > > > + if (task == current)
> > > > + rc = -EBADSLT;
> > > > + else if (!task_is_descendant(current, task))
> > > > + rc = -EXDEV;
> > >
> > > See my previous comments about error code sanity.
> >
> > I'll go with EXDEV.
> >
> > > > + else if (current_user_ns() == &init_user_ns) {
> > > > + if (!capable(CAP_AUDIT_CONTROL) && !audit_get_capcontid(current))
> > > > + rc = -EPERM;
> > >
> > > I think we just want to use ns_capable() in the context of the current
> > > userns to check CAP_AUDIT_CONTROL, yes? Something like this ...
> >
> > I thought we had firmly established in previous discussion that
> > CAP_AUDIT_CONTROL in anything other than init_user_ns was completely irrelevant
> > and untrustable.
>
> In the case of a container with multiple users, and multiple
> applications, one being a nested orchestrator, it seems relevant to
> allow that container to control which of it's processes are able to
> exercise CAP_AUDIT_CONTROL. Granted, we still want to control it
> within the overall host, e.g. the container in question must be
> allowed to run a nested orchestrator, but allowing the container
> itself to provide it's own granularity seems like the right thing to
> do.

Looking back to discussion on the v6 patch 2/10 (2019-05-30 15:29 Paul
Moore[1], 2019-07-08 14:05 RGB[2]) , it occurs to me that the
ns_capable(CAP_AUDIT_CONTROL) application was dangerous since there was
no parental accountability in storage or reporting. Now that is in
place, it does seem a bit more reasonable to allow it, but I'm still not
clear on why we would want both mechanisms now. I don't understand what
the last line in that email meant: "We would probably still want a
ns_capable(CAP_AUDIT_CONTROL) restriction in this case." Allow
ns_capable(CAP_AUDIT_CONTROL) to govern these actions, or restrict
ns_capable(CAP_AUDIT_CONTROL) from being used to govern these actions?

If an unprivileged user has been given capcontid to be able run their
own container orchestrator/engine and spawns a user namespace with
CAP_AUDIT_CONTROL, what matters is capcontid, and not CAP_AUDIT_CONTROL.
I could see needing CAP_AUDIT_CONTROL *in addition* to capcontid to give
it finer grained control, but since capcontid would have to be given to
each process explicitly anways, I don't see the point.

If that unprivileged user had not been given capcontid,
giving itself or one of its descendants CAP_AUDIT_CONTROL should not let
it jump into the game all of a sudden unless the now chained audit
container identifiers are deemed accountable enough. And then now we
need those hard limits on container depth and network namespace
container membership.

> > > if (current_user_ns() != &init_user_ns) {
> > > if (!ns_capable(CAP_AUDIT_CONTROL) || !audit_get_capcontid())
> > > rc = -EPERM;
> > > } else if (!capable(CAP_AUDIT_CONTROL))
> > > rc = -EPERM;
> > >
>
> paul moore

[1] https://www.redhat.com/archives/linux-audit/2019-May/msg00085.html
https://lkml.org/lkml/2019/5/30/1380
[2] https://www.redhat.com/archives/linux-audit/2019-July/msg00003.html
https://lkml.org/lkml/2019/7/8/1051

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-02-12 22:40:41

by Steve Grubb

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Wednesday, February 5, 2020 5:50:28 PM EST Paul Moore wrote:
> > > > > ... When we record the audit container ID in audit_signal_info() we
> > > > > take an extra reference to the audit container ID object so that it
> > > > > will not disappear (and get reused) until after we respond with an
> > > > > AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the
> > > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took
> > > > > in
> > > > > audit_signal_info(). Unless I'm missing some other change you
> > > > > made,
> > > > > this *shouldn't* affect the syscall records, all it does is
> > > > > preserve
> > > > > the audit container ID object in the kernel's ACID store so it
> > > > > doesn't
> > > > > get reused.
> > > >
> > > > This is exactly what I had understood. I hadn't considered the extra
> > > > details below in detail due to my original syscall concern, but they
> > > > make sense.
> > > >
> > > > The syscall I refer to is the one connected with the drop of the
> > > > audit container identifier by the last process that was in that
> > > > container in patch 5/16. The production of this record is contingent
> > > > on
> > > > the last ref in a contobj being dropped. So if it is due to that ref
> > > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2
> > > > record it fetched, then it will appear that the fetch action closed
> > > > the
> > > > container rather than the last process in the container to exit.
> > > >
> > > > Does this make sense?
> > >
> > > More so than your original reply, at least to me anyway.
> > >
> > > It makes sense that the audit container ID wouldn't be marked as
> > > "dead" since it would still be very much alive and available for use
> > > by the orchestrator, the question is if that is desirable or not. I
> > > think the answer to this comes down the preserving the correctness of
> > > the audit log.
> > >
> > > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been
> > > reused then I think there is a legitimate concern that the audit log
> > > is not correct, and could be misleading. If we solve that by grabbing
> > > an extra reference, then there could also be some confusion as
> > > userspace considers a container to be "dead" while the audit container
> > > ID still exists in the kernel, and the kernel generated audit
> > > container ID death record will not be generated until much later (and
> > > possibly be associated with a different event, but that could be
> > > solved by unassociating the container death record).
> >
> > How does syscall association of the death record with AUDIT_SIGNAL_INFO2
> > possibly get associated with another event? Or is the syscall
> > association with the fetch for the AUDIT_SIGNAL_INFO2 the other event?
>
> The issue is when does the audit container ID "die". If it is when
> the last task in the container exits, then the death record will be
> associated when the task's exit. If the audit container ID lives on
> until the last reference of it in the audit logs, including the
> SIGNAL_INFO2 message, the death record will be associated with the
> related SIGNAL_INFO2 syscalls, or perhaps unassociated depending on
> the details of the syscalls/netlink.
>
> > Another idea might be to bump the refcount in audit_signal_info() but
> > mark tht contid as dead so it can't be reused if we are concerned that
> > the dead contid be reused?
>
> Ooof. Yes, maybe, but that would be ugly.
>
> > There is still the problem later that the reported contid is incomplete
> > compared to the rest of the contid reporting cycle wrt nesting since
> > AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length
> > fields to accommodate a nested contid list.
>
> Do we really care about the full nested audit container ID list in the
> SIGNAL_INFO2 record?
>
> > > Of the two
> > > approaches, I think the latter is safer in that it preserves the
> > > correctness of the audit log, even though it could result in a delay
> > > of the container death record.
> >
> > I prefer the former since it strongly indicates last task in the
> > container. The AUDIT_SIGNAL_INFO2 msg has the pid and other subject
> > attributes and the contid to strongly link the responsible party.
>
> Steve is the only one who really tracks the security certifications
> that are relevant to audit, see what the certification requirements
> have to say and we can revisit this.

Sever Virtualization Protection Profile is the closest applicable standard

https://www.niap-ccevs.org/Profile/Info.cfm?PPID=408&id=408

It is silent on audit requirements for the lifecycle of a VM. I assume that
all that is needed is what the orchestrator says its doing at the high level.
So, if an orchestrator wants to shutdown a container, the orchestrator must
log that intent and its results. In a similar fashion, systemd logs that it's
killing a service and we don't actually hook the exit syscall of the service
to record that.

Now, if a container was being used as a VPS, and it had a fully functioning
userspace, it's own services, and its very own audit daemon, then in this
case it would care who sent a signal to its auditd. The tenant of that
container may have to comply with PCI-DSS or something else. It would log the
audit service is being terminated and systemd would record that its tearing
down the environment. The OS doesn't need to do anything.

-Steve


2020-02-13 00:11:28

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Wed, Feb 12, 2020 at 5:39 PM Steve Grubb <[email protected]> wrote:
> On Wednesday, February 5, 2020 5:50:28 PM EST Paul Moore wrote:
> > > > > > ... When we record the audit container ID in audit_signal_info() we
> > > > > > take an extra reference to the audit container ID object so that it
> > > > > > will not disappear (and get reused) until after we respond with an
> > > > > > AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the
> > > > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took
> > > > > > in
> > > > > > audit_signal_info(). Unless I'm missing some other change you
> > > > > > made,
> > > > > > this *shouldn't* affect the syscall records, all it does is
> > > > > > preserve
> > > > > > the audit container ID object in the kernel's ACID store so it
> > > > > > doesn't
> > > > > > get reused.
> > > > >
> > > > > This is exactly what I had understood. I hadn't considered the extra
> > > > > details below in detail due to my original syscall concern, but they
> > > > > make sense.
> > > > >
> > > > > The syscall I refer to is the one connected with the drop of the
> > > > > audit container identifier by the last process that was in that
> > > > > container in patch 5/16. The production of this record is contingent
> > > > > on
> > > > > the last ref in a contobj being dropped. So if it is due to that ref
> > > > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2
> > > > > record it fetched, then it will appear that the fetch action closed
> > > > > the
> > > > > container rather than the last process in the container to exit.
> > > > >
> > > > > Does this make sense?
> > > >
> > > > More so than your original reply, at least to me anyway.
> > > >
> > > > It makes sense that the audit container ID wouldn't be marked as
> > > > "dead" since it would still be very much alive and available for use
> > > > by the orchestrator, the question is if that is desirable or not. I
> > > > think the answer to this comes down the preserving the correctness of
> > > > the audit log.
> > > >
> > > > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been
> > > > reused then I think there is a legitimate concern that the audit log
> > > > is not correct, and could be misleading. If we solve that by grabbing
> > > > an extra reference, then there could also be some confusion as
> > > > userspace considers a container to be "dead" while the audit container
> > > > ID still exists in the kernel, and the kernel generated audit
> > > > container ID death record will not be generated until much later (and
> > > > possibly be associated with a different event, but that could be
> > > > solved by unassociating the container death record).
> > >
> > > How does syscall association of the death record with AUDIT_SIGNAL_INFO2
> > > possibly get associated with another event? Or is the syscall
> > > association with the fetch for the AUDIT_SIGNAL_INFO2 the other event?
> >
> > The issue is when does the audit container ID "die". If it is when
> > the last task in the container exits, then the death record will be
> > associated when the task's exit. If the audit container ID lives on
> > until the last reference of it in the audit logs, including the
> > SIGNAL_INFO2 message, the death record will be associated with the
> > related SIGNAL_INFO2 syscalls, or perhaps unassociated depending on
> > the details of the syscalls/netlink.
> >
> > > Another idea might be to bump the refcount in audit_signal_info() but
> > > mark tht contid as dead so it can't be reused if we are concerned that
> > > the dead contid be reused?
> >
> > Ooof. Yes, maybe, but that would be ugly.
> >
> > > There is still the problem later that the reported contid is incomplete
> > > compared to the rest of the contid reporting cycle wrt nesting since
> > > AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length
> > > fields to accommodate a nested contid list.
> >
> > Do we really care about the full nested audit container ID list in the
> > SIGNAL_INFO2 record?
> >
> > > > Of the two
> > > > approaches, I think the latter is safer in that it preserves the
> > > > correctness of the audit log, even though it could result in a delay
> > > > of the container death record.
> > >
> > > I prefer the former since it strongly indicates last task in the
> > > container. The AUDIT_SIGNAL_INFO2 msg has the pid and other subject
> > > attributes and the contid to strongly link the responsible party.
> >
> > Steve is the only one who really tracks the security certifications
> > that are relevant to audit, see what the certification requirements
> > have to say and we can revisit this.
>
> Sever Virtualization Protection Profile is the closest applicable standard
>
> https://www.niap-ccevs.org/Profile/Info.cfm?PPID=408&id=408
>
> It is silent on audit requirements for the lifecycle of a VM. I assume that
> all that is needed is what the orchestrator says its doing at the high level.
> So, if an orchestrator wants to shutdown a container, the orchestrator must
> log that intent and its results. In a similar fashion, systemd logs that it's
> killing a service and we don't actually hook the exit syscall of the service
> to record that.
>
> Now, if a container was being used as a VPS, and it had a fully functioning
> userspace, it's own services, and its very own audit daemon, then in this
> case it would care who sent a signal to its auditd. The tenant of that
> container may have to comply with PCI-DSS or something else. It would log the
> audit service is being terminated and systemd would record that its tearing
> down the environment. The OS doesn't need to do anything.

This latter case is the case of interest here, since the host auditd
should only be killed from a process on the host itself, not a process
running in a container. If we work under the assumption (and this may
be a break in our approach to not defining "container") that an auditd
instance is only ever signaled by a process with the same audit
container ID (ACID), is this really even an issue? Right now it isn't
as even with this patchset we will still really only support one
auditd instance, presumably on the host, so this isn't a significant
concern. Moving forward, once we add support for multiple auditd
instances we will likely need to move the signal info into
(potentially) s per-ACID struct, a struct whose lifetime would match
that of the associated container by definition; as the auditd
container died, the struct would die, the refcounts dropped, and any
ACID held only the signal info refcount would be dropped/killed.

However, making this assumption would mean that we are expecting a
"container" to provide some level of isolation such that processes
with a different audit container ID do not signal each other. From a
practical perspective I think that fits with the most (all?)
definitions of "container", but I can't say that for certain. In
those cases where the assumption is not correct and processes can
signal each other across audit container ID boundaries, perhaps it is
enough to explain that an audit container ID may not fully disappear
until it has been fetched with a SIGNAL_INFO2 message.

--
paul moore
http://www.paul-moore.com

2020-02-13 21:49:15

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

This is a bit of a thread-hijack, and for that I apologize, but
another thought crossed my mind while thinking about this issue
further ... Once we support multiple auditd instances, including the
necessary record routing and duplication/multiple-sends (the host
always sees *everything*), we will likely need to find a way to "trim"
the audit container ID (ACID) lists we send in the records. The
auditd instance running on the host/initns will always see everything,
so it will want the full container ACID list; however an auditd
instance running inside a container really should only see the ACIDs
of any child containers.

For example, imagine a system where the host has containers 1 and 2,
each running an auditd instance. Inside container 1 there are
containers A and B. Inside container 2 there are containers Y and Z.
If an audit event is generated in container Z, I would expect the
host's auditd to see a ACID list of "1,Z" but container 1's auditd
should only see an ACID list of "Z". The auditd running in container
2 should not see the record at all (that will be relatively
straightforward). Does that make sense? Do we have the record
formats properly designed to handle this without too much problem (I'm
not entirely sure we do)?

--
paul moore
http://www.paul-moore.com

2020-02-13 21:52:16

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 13/16] audit: track container nesting

On Wed, Feb 5, 2020 at 6:51 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-02-05 18:05, Paul Moore wrote:
> > On Thu, Jan 30, 2020 at 2:28 PM Richard Guy Briggs <[email protected]> wrote:
> > > On 2020-01-22 16:29, Paul Moore wrote:
> > > > On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> > > > >
> > > > > Track the parent container of a container to be able to filter and
> > > > > report nesting.
> > > > >
> > > > > Now that we have a way to track and check the parent container of a
> > > > > container, modify the contid field format to be able to report that
> > > > > nesting using a carrat ("^") separator to indicate nesting. The
> > > > > original field format was "contid=<contid>" for task-associated records
> > > > > and "contid=<contid>[,<contid>[...]]" for network-namespace-associated
> > > > > records. The new field format is
> > > > > "contid=<contid>[^<contid>[...]][,<contid>[...]]".
> > > >
> > > > Let's make sure we always use a comma as a separator, even when
> > > > recording the parent information, for example:
> > > > "contid=<contid>[,^<contid>[...]][,<contid>[...]]"
> > >
> > > The intent here is to clearly indicate and separate nesting from
> > > parallel use of several containers by one netns. If we do away with
> > > that distinction, then we lose that inheritance accountability and
> > > should really run the list through a "uniq" function to remove the
> > > produced redundancies. This clear inheritance is something Steve was
> > > looking for since tracking down individual events/records to show that
> > > inheritance was not aways feasible due to rolled logs or search effort.
> >
> > Perhaps my example wasn't clear. I'm not opposed to the little
> > carat/hat character indicating a container's parent, I just think it
> > would be good to also include a comma *in*addition* to the carat/hat.
>
> Ah, ok. Well, I'd offer that it would be slightly shorter, slightly
> less cluttered and having already written the parser in userspace, I
> think the parser would be slightly simpler.
>
> I must admit, I was a bit puzzled by your snippet of code that was used
> as a prefix to the next item rather than as a postfix to the given item.
>
> Can you say why you prefer the comma in addition?

Generally speaking, I believe that a single delimiter is both easier
for the eyes to parse, and easier/safer for machines to parse as well.
In this particular case I think of the comma as a delimiter and the
carat as a modifier, reusing the carat as a delimiter seems like a bad
idea to me.

> > > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > > index ef8e07524c46..68be59d1a89b 100644
> > > > > --- a/kernel/audit.c
> > > > > +++ b/kernel/audit.c
> > > >
> > > > > @@ -492,6 +493,7 @@ void audit_switch_task_namespaces(struct nsproxy *ns, struct task_struct *p)
> > > > > audit_netns_contid_add(new->net_ns, contid);
> > > > > }
> > > > >
> > > > > +void audit_log_contid(struct audit_buffer *ab, u64 contid);
> > > >
> > > > If we need a forward declaration, might as well just move it up near
> > > > the top of the file with the rest of the declarations.
> > >
> > > Ok.
> > >
> > > > > +void audit_log_contid(struct audit_buffer *ab, u64 contid)
> > > > > +{
> > > > > + struct audit_contobj *cont = NULL, *prcont = NULL;
> > > > > + int h;
> > > >
> > > > It seems safer to pass the audit container ID object and not the u64.
> > >
> > > It would also be faster, but in some places it isn't available such as
> > > for ptrace and signal targets. This also links back to the drop record
> > > refcounts to hold onto the contobj until process exit, or signal
> > > delivery.
> > >
> > > What we could do is to supply two potential parameters, a contobj and/or
> > > a contid, and have it use the contobj if it is valid, otherwise, use the
> > > contid, as is done for names and paths supplied to audit_log_name().
> >
> > Let's not do multiple parameters, that begs for misuse, let's take the
> > wrapper function route:
> >
> > func a(int id) {
> > // important stuff
> > }
> >
> > func ao(struct obj) {
> > a(obj.id);
> > }
> >
> > ... and we can add a comment that you *really* should be using the
> > variant that passes an object.
>
> I was already doing that where it available, and dereferencing the id
> for the call. But I see an advantage to having both parameters supplied
> to the function, since it saves us the trouble of dereferencing it,
> searching for the id in the hash list and re-locating the object if the
> object is already available.

I strongly prefer we not do multiple parameters for the same "thing";
I would much rather do the wrapper approach as described above. I
would also like to see us use the audit container ID object as much as
possible, using a bare integer should be a last resort.

--
paul moore
http://www.paul-moore.com

2020-02-13 22:05:32

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 16/16] audit: add capcontid to set contid outside init_user_ns

On Thu, Feb 6, 2020 at 7:52 AM Richard Guy Briggs <[email protected]> wrote:
> On 2020-02-05 17:56, Paul Moore wrote:
> > On Tue, Feb 4, 2020 at 7:39 PM Richard Guy Briggs <[email protected]> wrote:
> > > On 2020-01-22 16:29, Paul Moore wrote:
> > > > On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> > > > >
> > > > > Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
> > > > > process in a non-init user namespace the capability to set audit
> > > > > container identifiers.
> > > > >
> > > > > Provide /proc/$PID/audit_capcontid interface to capcontid.
> > > > > Valid values are: 1==enabled, 0==disabled
> > > >
> > > > It would be good to be more explicit about "enabled" and "disabled" in
> > > > the commit description. For example, which setting allows the target
> > > > task to set audit container IDs of it's children processes?
> > >
> > > Ok...
> > >
> > > > > Report this action in message type AUDIT_SET_CAPCONTID 1022 with fields
> > > > > opid= capcontid= old-capcontid=
> > > > >
> > > > > Signed-off-by: Richard Guy Briggs <[email protected]>
> > > > > ---
> > > > > fs/proc/base.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > > include/linux/audit.h | 14 ++++++++++++
> > > > > include/uapi/linux/audit.h | 1 +
> > > > > kernel/audit.c | 35 +++++++++++++++++++++++++++++
> > > > > 4 files changed, 105 insertions(+)
> >
> > ...
> >
> > > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > > index 1287f0b63757..1c22dd084ae8 100644
> > > > > --- a/kernel/audit.c
> > > > > +++ b/kernel/audit.c
> > > > > @@ -2698,6 +2698,41 @@ static bool audit_contid_isowner(struct task_struct *tsk)
> > > > > return false;
> > > > > }
> > > > >
> > > > > +int audit_set_capcontid(struct task_struct *task, u32 enable)
> > > > > +{
> > > > > + u32 oldcapcontid;
> > > > > + int rc = 0;
> > > > > + struct audit_buffer *ab;
> > > > > +
> > > > > + if (!task->audit)
> > > > > + return -ENOPROTOOPT;
> > > > > + oldcapcontid = audit_get_capcontid(task);
> > > > > + /* if task is not descendant, block */
> > > > > + if (task == current)
> > > > > + rc = -EBADSLT;
> > > > > + else if (!task_is_descendant(current, task))
> > > > > + rc = -EXDEV;
> > > >
> > > > See my previous comments about error code sanity.
> > >
> > > I'll go with EXDEV.
> > >
> > > > > + else if (current_user_ns() == &init_user_ns) {
> > > > > + if (!capable(CAP_AUDIT_CONTROL) && !audit_get_capcontid(current))
> > > > > + rc = -EPERM;
> > > >
> > > > I think we just want to use ns_capable() in the context of the current
> > > > userns to check CAP_AUDIT_CONTROL, yes? Something like this ...
> > >
> > > I thought we had firmly established in previous discussion that
> > > CAP_AUDIT_CONTROL in anything other than init_user_ns was completely irrelevant
> > > and untrustable.
> >
> > In the case of a container with multiple users, and multiple
> > applications, one being a nested orchestrator, it seems relevant to
> > allow that container to control which of it's processes are able to
> > exercise CAP_AUDIT_CONTROL. Granted, we still want to control it
> > within the overall host, e.g. the container in question must be
> > allowed to run a nested orchestrator, but allowing the container
> > itself to provide it's own granularity seems like the right thing to
> > do.
>
> Looking back to discussion on the v6 patch 2/10 (2019-05-30 15:29 Paul
> Moore[1], 2019-07-08 14:05 RGB[2]) , it occurs to me that the
> ns_capable(CAP_AUDIT_CONTROL) application was dangerous since there was
> no parental accountability in storage or reporting. Now that is in
> place, it does seem a bit more reasonable to allow it, but I'm still not
> clear on why we would want both mechanisms now. I don't understand what
> the last line in that email meant: "We would probably still want a
> ns_capable(CAP_AUDIT_CONTROL) restriction in this case." Allow
> ns_capable(CAP_AUDIT_CONTROL) to govern these actions, or restrict
> ns_capable(CAP_AUDIT_CONTROL) from being used to govern these actions?
>
> If an unprivileged user has been given capcontid to be able run their
> own container orchestrator/engine and spawns a user namespace with
> CAP_AUDIT_CONTROL, what matters is capcontid, and not CAP_AUDIT_CONTROL.
> I could see needing CAP_AUDIT_CONTROL *in addition* to capcontid to give
> it finer grained control, but since capcontid would have to be given to
> each process explicitly anways, I don't see the point.
>
> If that unprivileged user had not been given capcontid,
> giving itself or one of its descendants CAP_AUDIT_CONTROL should not let
> it jump into the game all of a sudden unless the now chained audit
> container identifiers are deemed accountable enough. And then now we
> need those hard limits on container depth and network namespace
> container membership.

Perhaps I'm not correctly understanding what you are trying to do with
this patchset, but my current understanding is that you are trying to
use capcontid to control which child audit container IDs (ACIDs) are
allowed to manage their own ACIDs. Further, I believe that the
capcontid setting operates at a per-ACID level, meaning there is no
provision for the associated container to further restrict that
ability, i.e. no access control granularity below the ACID level. My
thinking is that ns_capable(CAP_AUDIT_CONTROL) could be used within an
ACID to increase the granularity of the access controls so that only
privileged processes running inside the ACID would be able to manage
the ACIDs. Does that make sense?

--
paul moore
http://www.paul-moore.com

2020-03-12 19:32:51

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-02-13 16:44, Paul Moore wrote:
> This is a bit of a thread-hijack, and for that I apologize, but
> another thought crossed my mind while thinking about this issue
> further ... Once we support multiple auditd instances, including the
> necessary record routing and duplication/multiple-sends (the host
> always sees *everything*), we will likely need to find a way to "trim"
> the audit container ID (ACID) lists we send in the records. The
> auditd instance running on the host/initns will always see everything,
> so it will want the full container ACID list; however an auditd
> instance running inside a container really should only see the ACIDs
> of any child containers.

Agreed. This should be easy to check and limit, preventing an auditd
from seeing any contid that is a parent of its own contid.

> For example, imagine a system where the host has containers 1 and 2,
> each running an auditd instance. Inside container 1 there are
> containers A and B. Inside container 2 there are containers Y and Z.
> If an audit event is generated in container Z, I would expect the
> host's auditd to see a ACID list of "1,Z" but container 1's auditd
> should only see an ACID list of "Z". The auditd running in container
> 2 should not see the record at all (that will be relatively
> straightforward). Does that make sense? Do we have the record
> formats properly designed to handle this without too much problem (I'm
> not entirely sure we do)?

I completely agree and I believe we have record formats that are able to
handle this already.

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-12 20:29:36

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-02-12 19:09, Paul Moore wrote:
> On Wed, Feb 12, 2020 at 5:39 PM Steve Grubb <[email protected]> wrote:
> > On Wednesday, February 5, 2020 5:50:28 PM EST Paul Moore wrote:
> > > > > > > ... When we record the audit container ID in audit_signal_info() we
> > > > > > > take an extra reference to the audit container ID object so that it
> > > > > > > will not disappear (and get reused) until after we respond with an
> > > > > > > AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the
> > > > > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took
> > > > > > > in
> > > > > > > audit_signal_info(). Unless I'm missing some other change you
> > > > > > > made,
> > > > > > > this *shouldn't* affect the syscall records, all it does is
> > > > > > > preserve
> > > > > > > the audit container ID object in the kernel's ACID store so it
> > > > > > > doesn't
> > > > > > > get reused.
> > > > > >
> > > > > > This is exactly what I had understood. I hadn't considered the extra
> > > > > > details below in detail due to my original syscall concern, but they
> > > > > > make sense.
> > > > > >
> > > > > > The syscall I refer to is the one connected with the drop of the
> > > > > > audit container identifier by the last process that was in that
> > > > > > container in patch 5/16. The production of this record is contingent
> > > > > > on
> > > > > > the last ref in a contobj being dropped. So if it is due to that ref
> > > > > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2
> > > > > > record it fetched, then it will appear that the fetch action closed
> > > > > > the
> > > > > > container rather than the last process in the container to exit.
> > > > > >
> > > > > > Does this make sense?
> > > > >
> > > > > More so than your original reply, at least to me anyway.
> > > > >
> > > > > It makes sense that the audit container ID wouldn't be marked as
> > > > > "dead" since it would still be very much alive and available for use
> > > > > by the orchestrator, the question is if that is desirable or not. I
> > > > > think the answer to this comes down the preserving the correctness of
> > > > > the audit log.
> > > > >
> > > > > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been
> > > > > reused then I think there is a legitimate concern that the audit log
> > > > > is not correct, and could be misleading. If we solve that by grabbing
> > > > > an extra reference, then there could also be some confusion as
> > > > > userspace considers a container to be "dead" while the audit container
> > > > > ID still exists in the kernel, and the kernel generated audit
> > > > > container ID death record will not be generated until much later (and
> > > > > possibly be associated with a different event, but that could be
> > > > > solved by unassociating the container death record).
> > > >
> > > > How does syscall association of the death record with AUDIT_SIGNAL_INFO2
> > > > possibly get associated with another event? Or is the syscall
> > > > association with the fetch for the AUDIT_SIGNAL_INFO2 the other event?
> > >
> > > The issue is when does the audit container ID "die". If it is when
> > > the last task in the container exits, then the death record will be
> > > associated when the task's exit. If the audit container ID lives on
> > > until the last reference of it in the audit logs, including the
> > > SIGNAL_INFO2 message, the death record will be associated with the
> > > related SIGNAL_INFO2 syscalls, or perhaps unassociated depending on
> > > the details of the syscalls/netlink.
> > >
> > > > Another idea might be to bump the refcount in audit_signal_info() but
> > > > mark tht contid as dead so it can't be reused if we are concerned that
> > > > the dead contid be reused?
> > >
> > > Ooof. Yes, maybe, but that would be ugly.
> > >
> > > > There is still the problem later that the reported contid is incomplete
> > > > compared to the rest of the contid reporting cycle wrt nesting since
> > > > AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length
> > > > fields to accommodate a nested contid list.
> > >
> > > Do we really care about the full nested audit container ID list in the
> > > SIGNAL_INFO2 record?

I'm inclined to hand-wave it away as inconvenient that can be looked up
more carefully if it is really needed. Maybe the block above would be
safer and more complete even though it is ugly.

> > > > > Of the two
> > > > > approaches, I think the latter is safer in that it preserves the
> > > > > correctness of the audit log, even though it could result in a delay
> > > > > of the container death record.
> > > >
> > > > I prefer the former since it strongly indicates last task in the
> > > > container. The AUDIT_SIGNAL_INFO2 msg has the pid and other subject
> > > > attributes and the contid to strongly link the responsible party.
> > >
> > > Steve is the only one who really tracks the security certifications
> > > that are relevant to audit, see what the certification requirements
> > > have to say and we can revisit this.
> >
> > Sever Virtualization Protection Profile is the closest applicable standard
> >
> > https://www.niap-ccevs.org/Profile/Info.cfm?PPID=408&id=408
> >
> > It is silent on audit requirements for the lifecycle of a VM. I assume that
> > all that is needed is what the orchestrator says its doing at the high level.
> > So, if an orchestrator wants to shutdown a container, the orchestrator must
> > log that intent and its results. In a similar fashion, systemd logs that it's
> > killing a service and we don't actually hook the exit syscall of the service
> > to record that.
> >
> > Now, if a container was being used as a VPS, and it had a fully functioning
> > userspace, it's own services, and its very own audit daemon, then in this
> > case it would care who sent a signal to its auditd. The tenant of that
> > container may have to comply with PCI-DSS or something else. It would log the
> > audit service is being terminated and systemd would record that its tearing
> > down the environment. The OS doesn't need to do anything.
>
> This latter case is the case of interest here, since the host auditd
> should only be killed from a process on the host itself, not a process
> running in a container. If we work under the assumption (and this may
> be a break in our approach to not defining "container") that an auditd
> instance is only ever signaled by a process with the same audit
> container ID (ACID), is this really even an issue? Right now it isn't
> as even with this patchset we will still really only support one
> auditd instance, presumably on the host, so this isn't a significant
> concern. Moving forward, once we add support for multiple auditd
> instances we will likely need to move the signal info into
> (potentially) s per-ACID struct, a struct whose lifetime would match
> that of the associated container by definition; as the auditd
> container died, the struct would die, the refcounts dropped, and any
> ACID held only the signal info refcount would be dropped/killed.

Any process could signal auditd if it can see it based on namespace
relationships, nevermind container placement. Some container
architectures would not have a namespace configuration that would block
this (combination of PID/user/IPC?).

> However, making this assumption would mean that we are expecting a
> "container" to provide some level of isolation such that processes
> with a different audit container ID do not signal each other. From a
> practical perspective I think that fits with the most (all?)
> definitions of "container", but I can't say that for certain. In
> those cases where the assumption is not correct and processes can
> signal each other across audit container ID boundaries, perhaps it is
> enough to explain that an audit container ID may not fully disappear
> until it has been fetched with a SIGNAL_INFO2 message.

I think more and more, that more complete isolation is being done,
taking advantage of each type of namespace as they become available, but
I know a nuber of them didn't find it important yet to use IPC, PID or
user namespaces which would be the only namespaces I can think of that
would provide that isolation.

It isn't entirely clear to me which side you fall on this issue, Paul.
Can you pronounce on your strong preference one way or the other if the
death of a container coincide with the exit of the last process in that
namespace, or the fetch of any signal info related to it? I have a bias
to the former since the code already does that and I feel the exit of
the last process is much more relevant supported by the syscall record,
but could change it to the latter if you feel strongly enough about it
to block upstream acceptance.

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-12 20:52:51

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 13/16] audit: track container nesting

On 2020-02-13 16:49, Paul Moore wrote:
> On Wed, Feb 5, 2020 at 6:51 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-02-05 18:05, Paul Moore wrote:
> > > On Thu, Jan 30, 2020 at 2:28 PM Richard Guy Briggs <[email protected]> wrote:
> > > > On 2020-01-22 16:29, Paul Moore wrote:
> > > > > On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > >
> > > > > > Track the parent container of a container to be able to filter and
> > > > > > report nesting.
> > > > > >
> > > > > > Now that we have a way to track and check the parent container of a
> > > > > > container, modify the contid field format to be able to report that
> > > > > > nesting using a carrat ("^") separator to indicate nesting. The
> > > > > > original field format was "contid=<contid>" for task-associated records
> > > > > > and "contid=<contid>[,<contid>[...]]" for network-namespace-associated
> > > > > > records. The new field format is
> > > > > > "contid=<contid>[^<contid>[...]][,<contid>[...]]".
> > > > >
> > > > > Let's make sure we always use a comma as a separator, even when
> > > > > recording the parent information, for example:
> > > > > "contid=<contid>[,^<contid>[...]][,<contid>[...]]"
> > > >
> > > > The intent here is to clearly indicate and separate nesting from
> > > > parallel use of several containers by one netns. If we do away with
> > > > that distinction, then we lose that inheritance accountability and
> > > > should really run the list through a "uniq" function to remove the
> > > > produced redundancies. This clear inheritance is something Steve was
> > > > looking for since tracking down individual events/records to show that
> > > > inheritance was not aways feasible due to rolled logs or search effort.
> > >
> > > Perhaps my example wasn't clear. I'm not opposed to the little
> > > carat/hat character indicating a container's parent, I just think it
> > > would be good to also include a comma *in*addition* to the carat/hat.
> >
> > Ah, ok. Well, I'd offer that it would be slightly shorter, slightly
> > less cluttered and having already written the parser in userspace, I
> > think the parser would be slightly simpler.
> >
> > I must admit, I was a bit puzzled by your snippet of code that was used
> > as a prefix to the next item rather than as a postfix to the given item.
> >
> > Can you say why you prefer the comma in addition?
>
> Generally speaking, I believe that a single delimiter is both easier
> for the eyes to parse, and easier/safer for machines to parse as well.
> In this particular case I think of the comma as a delimiter and the
> carat as a modifier, reusing the carat as a delimiter seems like a bad
> idea to me.

I'm not crazy about this idea, but I'll have a look at how much work it
is to recode the userspace search tools. It also adds extra characters
and noise into the string format that seems counterproductive.

> > > > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > > > index ef8e07524c46..68be59d1a89b 100644
> > > > > > --- a/kernel/audit.c
> > > > > > +++ b/kernel/audit.c
> > > > >
> > > > > > @@ -492,6 +493,7 @@ void audit_switch_task_namespaces(struct nsproxy *ns, struct task_struct *p)
> > > > > > audit_netns_contid_add(new->net_ns, contid);
> > > > > > }
> > > > > >
> > > > > > +void audit_log_contid(struct audit_buffer *ab, u64 contid);
> > > > >
> > > > > If we need a forward declaration, might as well just move it up near
> > > > > the top of the file with the rest of the declarations.
> > > >
> > > > Ok.
> > > >
> > > > > > +void audit_log_contid(struct audit_buffer *ab, u64 contid)
> > > > > > +{
> > > > > > + struct audit_contobj *cont = NULL, *prcont = NULL;
> > > > > > + int h;
> > > > >
> > > > > It seems safer to pass the audit container ID object and not the u64.
> > > >
> > > > It would also be faster, but in some places it isn't available such as
> > > > for ptrace and signal targets. This also links back to the drop record
> > > > refcounts to hold onto the contobj until process exit, or signal
> > > > delivery.
> > > >
> > > > What we could do is to supply two potential parameters, a contobj and/or
> > > > a contid, and have it use the contobj if it is valid, otherwise, use the
> > > > contid, as is done for names and paths supplied to audit_log_name().
> > >
> > > Let's not do multiple parameters, that begs for misuse, let's take the
> > > wrapper function route:
> > >
> > > func a(int id) {
> > > // important stuff
> > > }
> > >
> > > func ao(struct obj) {
> > > a(obj.id);
> > > }
> > >
> > > ... and we can add a comment that you *really* should be using the
> > > variant that passes an object.
> >
> > I was already doing that where it available, and dereferencing the id
> > for the call. But I see an advantage to having both parameters supplied
> > to the function, since it saves us the trouble of dereferencing it,
> > searching for the id in the hash list and re-locating the object if the
> > object is already available.
>
> I strongly prefer we not do multiple parameters for the same "thing";

So do I, ideally. However...

> I would much rather do the wrapper approach as described above. I
> would also like to see us use the audit container ID object as much as
> possible, using a bare integer should be a last resort.

It is not clear to me that you understood what I wrote above. I can't
use the object pointer where preferable because there are a few cases
where only the ID is available. If only the ID is available, I would
have to make a best effort to look up the object pointer and am not
guaranteed to find it (invalid, stale, signal info...). If I am forced
to use only one, it becomes the ID that is used, and I no longer have
the benefit of already having the object pointer for certainty and
saving work. For all cases where I have the object pointer, which is
most cases, and most frequently used cases, I will have to dereference
the object pointer to an ID, then go through the work again to re-locate
the object pointer. This is less certain, and more work. Reluctantly,
the only practical solution I see here is to supply both, favouring the
object pointer if it is valid, then falling back on the ID from the next
parameter.

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-12 21:59:08

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 16/16] audit: add capcontid to set contid outside init_user_ns

On 2020-02-13 16:58, Paul Moore wrote:
> On Thu, Feb 6, 2020 at 7:52 AM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-02-05 17:56, Paul Moore wrote:
> > > On Tue, Feb 4, 2020 at 7:39 PM Richard Guy Briggs <[email protected]> wrote:
> > > > On 2020-01-22 16:29, Paul Moore wrote:
> > > > > On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > >
> > > > > > Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
> > > > > > process in a non-init user namespace the capability to set audit
> > > > > > container identifiers.
> > > > > >
> > > > > > Provide /proc/$PID/audit_capcontid interface to capcontid.
> > > > > > Valid values are: 1==enabled, 0==disabled
> > > > >
> > > > > It would be good to be more explicit about "enabled" and "disabled" in
> > > > > the commit description. For example, which setting allows the target
> > > > > task to set audit container IDs of it's children processes?
> > > >
> > > > Ok...
> > > >
> > > > > > Report this action in message type AUDIT_SET_CAPCONTID 1022 with fields
> > > > > > opid= capcontid= old-capcontid=
> > > > > >
> > > > > > Signed-off-by: Richard Guy Briggs <[email protected]>
> > > > > > ---
> > > > > > fs/proc/base.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > include/linux/audit.h | 14 ++++++++++++
> > > > > > include/uapi/linux/audit.h | 1 +
> > > > > > kernel/audit.c | 35 +++++++++++++++++++++++++++++
> > > > > > 4 files changed, 105 insertions(+)
> > >
> > > ...
> > >
> > > > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > > > index 1287f0b63757..1c22dd084ae8 100644
> > > > > > --- a/kernel/audit.c
> > > > > > +++ b/kernel/audit.c
> > > > > > @@ -2698,6 +2698,41 @@ static bool audit_contid_isowner(struct task_struct *tsk)
> > > > > > return false;
> > > > > > }
> > > > > >
> > > > > > +int audit_set_capcontid(struct task_struct *task, u32 enable)
> > > > > > +{
> > > > > > + u32 oldcapcontid;
> > > > > > + int rc = 0;
> > > > > > + struct audit_buffer *ab;
> > > > > > +
> > > > > > + if (!task->audit)
> > > > > > + return -ENOPROTOOPT;
> > > > > > + oldcapcontid = audit_get_capcontid(task);
> > > > > > + /* if task is not descendant, block */
> > > > > > + if (task == current)
> > > > > > + rc = -EBADSLT;
> > > > > > + else if (!task_is_descendant(current, task))
> > > > > > + rc = -EXDEV;
> > > > >
> > > > > See my previous comments about error code sanity.
> > > >
> > > > I'll go with EXDEV.
> > > >
> > > > > > + else if (current_user_ns() == &init_user_ns) {
> > > > > > + if (!capable(CAP_AUDIT_CONTROL) && !audit_get_capcontid(current))
> > > > > > + rc = -EPERM;
> > > > >
> > > > > I think we just want to use ns_capable() in the context of the current
> > > > > userns to check CAP_AUDIT_CONTROL, yes? Something like this ...
> > > >
> > > > I thought we had firmly established in previous discussion that
> > > > CAP_AUDIT_CONTROL in anything other than init_user_ns was completely irrelevant
> > > > and untrustable.
> > >
> > > In the case of a container with multiple users, and multiple
> > > applications, one being a nested orchestrator, it seems relevant to
> > > allow that container to control which of it's processes are able to
> > > exercise CAP_AUDIT_CONTROL. Granted, we still want to control it
> > > within the overall host, e.g. the container in question must be
> > > allowed to run a nested orchestrator, but allowing the container
> > > itself to provide it's own granularity seems like the right thing to
> > > do.
> >
> > Looking back to discussion on the v6 patch 2/10 (2019-05-30 15:29 Paul
> > Moore[1], 2019-07-08 14:05 RGB[2]) , it occurs to me that the
> > ns_capable(CAP_AUDIT_CONTROL) application was dangerous since there was
> > no parental accountability in storage or reporting. Now that is in
> > place, it does seem a bit more reasonable to allow it, but I'm still not
> > clear on why we would want both mechanisms now. I don't understand what
> > the last line in that email meant: "We would probably still want a
> > ns_capable(CAP_AUDIT_CONTROL) restriction in this case." Allow
> > ns_capable(CAP_AUDIT_CONTROL) to govern these actions, or restrict
> > ns_capable(CAP_AUDIT_CONTROL) from being used to govern these actions?
> >
> > If an unprivileged user has been given capcontid to be able run their
> > own container orchestrator/engine and spawns a user namespace with
> > CAP_AUDIT_CONTROL, what matters is capcontid, and not CAP_AUDIT_CONTROL.
> > I could see needing CAP_AUDIT_CONTROL *in addition* to capcontid to give
> > it finer grained control, but since capcontid would have to be given to
> > each process explicitly anways, I don't see the point.
> >
> > If that unprivileged user had not been given capcontid,
> > giving itself or one of its descendants CAP_AUDIT_CONTROL should not let
> > it jump into the game all of a sudden unless the now chained audit
> > container identifiers are deemed accountable enough. And then now we
> > need those hard limits on container depth and network namespace
> > container membership.
>
> Perhaps I'm not correctly understanding what you are trying to do with
> this patchset, but my current understanding is that you are trying to
> use capcontid to control which child audit container IDs (ACIDs) are
> allowed to manage their own ACIDs. Further, I believe that the
> capcontid setting operates at a per-ACID level, meaning there is no
> provision for the associated container to further restrict that
> ability, i.e. no access control granularity below the ACID level. My
> thinking is that ns_capable(CAP_AUDIT_CONTROL) could be used within an
> ACID to increase the granularity of the access controls so that only
> privileged processes running inside the ACID would be able to manage
> the ACIDs. Does that make sense?

The capcontid is not inherited like the contid (or contobj) in
audit_alloc(), so it stops at that process that was granted capcontid.
That process that was granted capcontid can then explicitly further
grant capcontid to any of its children should it deem necessary.

Since it is a boolean, it defaults to unset in init_struct_audit which
isn't relevant anyways since that is in the initial user namespace.
It isn't set in audit_alloc() and would default to false.
I can set them explicitly both to false to be certain if that makes
things clearer and more certain.

I still believe ns_capable() is irrelevant here.

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-13 16:31:00

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-02-13 16:44, Paul Moore wrote:
> > This is a bit of a thread-hijack, and for that I apologize, but
> > another thought crossed my mind while thinking about this issue
> > further ... Once we support multiple auditd instances, including the
> > necessary record routing and duplication/multiple-sends (the host
> > always sees *everything*), we will likely need to find a way to "trim"
> > the audit container ID (ACID) lists we send in the records. The
> > auditd instance running on the host/initns will always see everything,
> > so it will want the full container ACID list; however an auditd
> > instance running inside a container really should only see the ACIDs
> > of any child containers.
>
> Agreed. This should be easy to check and limit, preventing an auditd
> from seeing any contid that is a parent of its own contid.
>
> > For example, imagine a system where the host has containers 1 and 2,
> > each running an auditd instance. Inside container 1 there are
> > containers A and B. Inside container 2 there are containers Y and Z.
> > If an audit event is generated in container Z, I would expect the
> > host's auditd to see a ACID list of "1,Z" but container 1's auditd
> > should only see an ACID list of "Z". The auditd running in container
> > 2 should not see the record at all (that will be relatively
> > straightforward). Does that make sense? Do we have the record
> > formats properly designed to handle this without too much problem (I'm
> > not entirely sure we do)?
>
> I completely agree and I believe we have record formats that are able to
> handle this already.

I'm not convinced we do. What about the cases where we have a field
with a list of audit container IDs? How do we handle that?

--
paul moore
http://www.paul-moore.com

2020-03-13 16:44:12

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Thu, Mar 12, 2020 at 4:27 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-02-12 19:09, Paul Moore wrote:
> > On Wed, Feb 12, 2020 at 5:39 PM Steve Grubb <[email protected]> wrote:
> > > On Wednesday, February 5, 2020 5:50:28 PM EST Paul Moore wrote:
> > > > > > > > ... When we record the audit container ID in audit_signal_info() we
> > > > > > > > take an extra reference to the audit container ID object so that it
> > > > > > > > will not disappear (and get reused) until after we respond with an
> > > > > > > > AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the
> > > > > > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took
> > > > > > > > in
> > > > > > > > audit_signal_info(). Unless I'm missing some other change you
> > > > > > > > made,
> > > > > > > > this *shouldn't* affect the syscall records, all it does is
> > > > > > > > preserve
> > > > > > > > the audit container ID object in the kernel's ACID store so it
> > > > > > > > doesn't
> > > > > > > > get reused.
> > > > > > >
> > > > > > > This is exactly what I had understood. I hadn't considered the extra
> > > > > > > details below in detail due to my original syscall concern, but they
> > > > > > > make sense.
> > > > > > >
> > > > > > > The syscall I refer to is the one connected with the drop of the
> > > > > > > audit container identifier by the last process that was in that
> > > > > > > container in patch 5/16. The production of this record is contingent
> > > > > > > on
> > > > > > > the last ref in a contobj being dropped. So if it is due to that ref
> > > > > > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2
> > > > > > > record it fetched, then it will appear that the fetch action closed
> > > > > > > the
> > > > > > > container rather than the last process in the container to exit.
> > > > > > >
> > > > > > > Does this make sense?
> > > > > >
> > > > > > More so than your original reply, at least to me anyway.
> > > > > >
> > > > > > It makes sense that the audit container ID wouldn't be marked as
> > > > > > "dead" since it would still be very much alive and available for use
> > > > > > by the orchestrator, the question is if that is desirable or not. I
> > > > > > think the answer to this comes down the preserving the correctness of
> > > > > > the audit log.
> > > > > >
> > > > > > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been
> > > > > > reused then I think there is a legitimate concern that the audit log
> > > > > > is not correct, and could be misleading. If we solve that by grabbing
> > > > > > an extra reference, then there could also be some confusion as
> > > > > > userspace considers a container to be "dead" while the audit container
> > > > > > ID still exists in the kernel, and the kernel generated audit
> > > > > > container ID death record will not be generated until much later (and
> > > > > > possibly be associated with a different event, but that could be
> > > > > > solved by unassociating the container death record).
> > > > >
> > > > > How does syscall association of the death record with AUDIT_SIGNAL_INFO2
> > > > > possibly get associated with another event? Or is the syscall
> > > > > association with the fetch for the AUDIT_SIGNAL_INFO2 the other event?
> > > >
> > > > The issue is when does the audit container ID "die". If it is when
> > > > the last task in the container exits, then the death record will be
> > > > associated when the task's exit. If the audit container ID lives on
> > > > until the last reference of it in the audit logs, including the
> > > > SIGNAL_INFO2 message, the death record will be associated with the
> > > > related SIGNAL_INFO2 syscalls, or perhaps unassociated depending on
> > > > the details of the syscalls/netlink.
> > > >
> > > > > Another idea might be to bump the refcount in audit_signal_info() but
> > > > > mark tht contid as dead so it can't be reused if we are concerned that
> > > > > the dead contid be reused?
> > > >
> > > > Ooof. Yes, maybe, but that would be ugly.
> > > >
> > > > > There is still the problem later that the reported contid is incomplete
> > > > > compared to the rest of the contid reporting cycle wrt nesting since
> > > > > AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length
> > > > > fields to accommodate a nested contid list.
> > > >
> > > > Do we really care about the full nested audit container ID list in the
> > > > SIGNAL_INFO2 record?
>
> I'm inclined to hand-wave it away as inconvenient that can be looked up
> more carefully if it is really needed. Maybe the block above would be
> safer and more complete even though it is ugly.
>
> > > > > > Of the two
> > > > > > approaches, I think the latter is safer in that it preserves the
> > > > > > correctness of the audit log, even though it could result in a delay
> > > > > > of the container death record.
> > > > >
> > > > > I prefer the former since it strongly indicates last task in the
> > > > > container. The AUDIT_SIGNAL_INFO2 msg has the pid and other subject
> > > > > attributes and the contid to strongly link the responsible party.
> > > >
> > > > Steve is the only one who really tracks the security certifications
> > > > that are relevant to audit, see what the certification requirements
> > > > have to say and we can revisit this.
> > >
> > > Sever Virtualization Protection Profile is the closest applicable standard
> > >
> > > https://www.niap-ccevs.org/Profile/Info.cfm?PPID=408&id=408
> > >
> > > It is silent on audit requirements for the lifecycle of a VM. I assume that
> > > all that is needed is what the orchestrator says its doing at the high level.
> > > So, if an orchestrator wants to shutdown a container, the orchestrator must
> > > log that intent and its results. In a similar fashion, systemd logs that it's
> > > killing a service and we don't actually hook the exit syscall of the service
> > > to record that.
> > >
> > > Now, if a container was being used as a VPS, and it had a fully functioning
> > > userspace, it's own services, and its very own audit daemon, then in this
> > > case it would care who sent a signal to its auditd. The tenant of that
> > > container may have to comply with PCI-DSS or something else. It would log the
> > > audit service is being terminated and systemd would record that its tearing
> > > down the environment. The OS doesn't need to do anything.
> >
> > This latter case is the case of interest here, since the host auditd
> > should only be killed from a process on the host itself, not a process
> > running in a container. If we work under the assumption (and this may
> > be a break in our approach to not defining "container") that an auditd
> > instance is only ever signaled by a process with the same audit
> > container ID (ACID), is this really even an issue? Right now it isn't
> > as even with this patchset we will still really only support one
> > auditd instance, presumably on the host, so this isn't a significant
> > concern. Moving forward, once we add support for multiple auditd
> > instances we will likely need to move the signal info into
> > (potentially) s per-ACID struct, a struct whose lifetime would match
> > that of the associated container by definition; as the auditd
> > container died, the struct would die, the refcounts dropped, and any
> > ACID held only the signal info refcount would be dropped/killed.
>
> Any process could signal auditd if it can see it based on namespace
> relationships, nevermind container placement. Some container
> architectures would not have a namespace configuration that would block
> this (combination of PID/user/IPC?).
>
> > However, making this assumption would mean that we are expecting a
> > "container" to provide some level of isolation such that processes
> > with a different audit container ID do not signal each other. From a
> > practical perspective I think that fits with the most (all?)
> > definitions of "container", but I can't say that for certain. In
> > those cases where the assumption is not correct and processes can
> > signal each other across audit container ID boundaries, perhaps it is
> > enough to explain that an audit container ID may not fully disappear
> > until it has been fetched with a SIGNAL_INFO2 message.
>
> I think more and more, that more complete isolation is being done,
> taking advantage of each type of namespace as they become available, but
> I know a nuber of them didn't find it important yet to use IPC, PID or
> user namespaces which would be the only namespaces I can think of that
> would provide that isolation.
>
> It isn't entirely clear to me which side you fall on this issue, Paul.

That's mostly because I was hoping for some clarification in the
discussion, especially the relevant certification requirements, but it
looks like there is still plenty of room for interpretation there (as
usual). I'd much rather us arrive at decisions based on requirements
and not gut feelings, which is where I think we are at right now.

> Can you pronounce on your strong preference one way or the other if the
> death of a container coincide with the exit of the last process in that
> namespace, or the fetch of any signal info related to it?

"pronounce on your strong preference"? I've seen you use "pronounce"
a few times now, and suggest a different word in the future; the
connotation is not well received on my end.

> I have a bias
> to the former since the code already does that and I feel the exit of
> the last process is much more relevant supported by the syscall record,
> but could change it to the latter if you feel strongly enough about it
> to block upstream acceptance.

At this point in time I believe the right thing to do is to preserve
the audit container ID as "dead but still in existence" so that there
is no confusion (due to reuse) if/when it finally reappears in the
audit record stream.

The thread has had a lot of starts/stops, so I may be repeating a
previous suggestion, but one idea would be to still emit a "death
record" when the final task in the audit container ID does die, but
block the particular audit container ID from reuse until it the
SIGNAL2 info has been reported. This gives us the timely ACID death
notification while still preventing confusion and ambiguity caused by
potentially reusing the ACID before the SIGNAL2 record has been sent;
there is a small nit about the ACID being present in the SIGNAL2
*after* its death, but I think that can be easily explained and
understood by admins.

--
paul moore
http://www.paul-moore.com

2020-03-13 16:46:44

by Steve Grubb

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Friday, March 13, 2020 12:42:15 PM EDT Paul Moore wrote:
> > I think more and more, that more complete isolation is being done,
> > taking advantage of each type of namespace as they become available, but
> > I know a nuber of them didn't find it important yet to use IPC, PID or
> > user namespaces which would be the only namespaces I can think of that
> > would provide that isolation.
> >
> > It isn't entirely clear to me which side you fall on this issue, Paul.
>
> That's mostly because I was hoping for some clarification in the
> discussion, especially the relevant certification requirements, but it
> looks like there is still plenty of room for interpretation there (as
> usual). I'd much rather us arrive at decisions based on requirements
> and not gut feelings, which is where I think we are at right now.

Certification rquirements are that we need the identity of anyone attempting
to modify the audit configuration including shutting it down.

-Steve


2020-03-13 16:48:44

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 13/16] audit: track container nesting

On Thu, Mar 12, 2020 at 4:52 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-02-13 16:49, Paul Moore wrote:
> > On Wed, Feb 5, 2020 at 6:51 PM Richard Guy Briggs <[email protected]> wrote:
> > > On 2020-02-05 18:05, Paul Moore wrote:
> > > > On Thu, Jan 30, 2020 at 2:28 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > On 2020-01-22 16:29, Paul Moore wrote:
> > > > > > On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > >
> > > > > > > Track the parent container of a container to be able to filter and
> > > > > > > report nesting.
> > > > > > >
> > > > > > > Now that we have a way to track and check the parent container of a
> > > > > > > container, modify the contid field format to be able to report that
> > > > > > > nesting using a carrat ("^") separator to indicate nesting. The
> > > > > > > original field format was "contid=<contid>" for task-associated records
> > > > > > > and "contid=<contid>[,<contid>[...]]" for network-namespace-associated
> > > > > > > records. The new field format is
> > > > > > > "contid=<contid>[^<contid>[...]][,<contid>[...]]".
> > > > > >
> > > > > > Let's make sure we always use a comma as a separator, even when
> > > > > > recording the parent information, for example:
> > > > > > "contid=<contid>[,^<contid>[...]][,<contid>[...]]"
> > > > >
> > > > > The intent here is to clearly indicate and separate nesting from
> > > > > parallel use of several containers by one netns. If we do away with
> > > > > that distinction, then we lose that inheritance accountability and
> > > > > should really run the list through a "uniq" function to remove the
> > > > > produced redundancies. This clear inheritance is something Steve was
> > > > > looking for since tracking down individual events/records to show that
> > > > > inheritance was not aways feasible due to rolled logs or search effort.
> > > >
> > > > Perhaps my example wasn't clear. I'm not opposed to the little
> > > > carat/hat character indicating a container's parent, I just think it
> > > > would be good to also include a comma *in*addition* to the carat/hat.
> > >
> > > Ah, ok. Well, I'd offer that it would be slightly shorter, slightly
> > > less cluttered and having already written the parser in userspace, I
> > > think the parser would be slightly simpler.
> > >
> > > I must admit, I was a bit puzzled by your snippet of code that was used
> > > as a prefix to the next item rather than as a postfix to the given item.
> > >
> > > Can you say why you prefer the comma in addition?
> >
> > Generally speaking, I believe that a single delimiter is both easier
> > for the eyes to parse, and easier/safer for machines to parse as well.
> > In this particular case I think of the comma as a delimiter and the
> > carat as a modifier, reusing the carat as a delimiter seems like a bad
> > idea to me.
>
> I'm not crazy about this idea, but I'll have a look at how much work it
> is to recode the userspace search tools. It also adds extra characters
> and noise into the string format that seems counterproductive.

If anything the parser should be *easier* (although both parsers
should fall into the "trivial" category). The comma is the one and
only delimiter, and if the ACID starts with a carat then it is a
parent of the preceding ACID.

> > > > > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > > > > index ef8e07524c46..68be59d1a89b 100644
> > > > > > > --- a/kernel/audit.c
> > > > > > > +++ b/kernel/audit.c
> > > > > >
> > > > > > > @@ -492,6 +493,7 @@ void audit_switch_task_namespaces(struct nsproxy *ns, struct task_struct *p)
> > > > > > > audit_netns_contid_add(new->net_ns, contid);
> > > > > > > }
> > > > > > >
> > > > > > > +void audit_log_contid(struct audit_buffer *ab, u64 contid);
> > > > > >
> > > > > > If we need a forward declaration, might as well just move it up near
> > > > > > the top of the file with the rest of the declarations.
> > > > >
> > > > > Ok.
> > > > >
> > > > > > > +void audit_log_contid(struct audit_buffer *ab, u64 contid)
> > > > > > > +{
> > > > > > > + struct audit_contobj *cont = NULL, *prcont = NULL;
> > > > > > > + int h;
> > > > > >
> > > > > > It seems safer to pass the audit container ID object and not the u64.
> > > > >
> > > > > It would also be faster, but in some places it isn't available such as
> > > > > for ptrace and signal targets. This also links back to the drop record
> > > > > refcounts to hold onto the contobj until process exit, or signal
> > > > > delivery.
> > > > >
> > > > > What we could do is to supply two potential parameters, a contobj and/or
> > > > > a contid, and have it use the contobj if it is valid, otherwise, use the
> > > > > contid, as is done for names and paths supplied to audit_log_name().
> > > >
> > > > Let's not do multiple parameters, that begs for misuse, let's take the
> > > > wrapper function route:
> > > >
> > > > func a(int id) {
> > > > // important stuff
> > > > }
> > > >
> > > > func ao(struct obj) {
> > > > a(obj.id);
> > > > }
> > > >
> > > > ... and we can add a comment that you *really* should be using the
> > > > variant that passes an object.
> > >
> > > I was already doing that where it available, and dereferencing the id
> > > for the call. But I see an advantage to having both parameters supplied
> > > to the function, since it saves us the trouble of dereferencing it,
> > > searching for the id in the hash list and re-locating the object if the
> > > object is already available.
> >
> > I strongly prefer we not do multiple parameters for the same "thing";
>
> So do I, ideally. However...
>
> > I would much rather do the wrapper approach as described above. I
> > would also like to see us use the audit container ID object as much as
> > possible, using a bare integer should be a last resort.
>
> It is not clear to me that you understood what I wrote above. I can't
> use the object pointer where preferable because there are a few cases
> where only the ID is available. If only the ID is available, I would
> have to make a best effort to look up the object pointer and am not
> guaranteed to find it (invalid, stale, signal info...). If I am forced
> to use only one, it becomes the ID that is used, and I no longer have
> the benefit of already having the object pointer for certainty and
> saving work. For all cases where I have the object pointer, which is
> most cases, and most frequently used cases, I will have to dereference
> the object pointer to an ID, then go through the work again to re-locate
> the object pointer. This is less certain, and more work. Reluctantly,
> the only practical solution I see here is to supply both, favouring the
> object pointer if it is valid, then falling back on the ID from the next
> parameter.

It has been a while since I last looked at the patchset, but my
concern over the prefered use of the ACID number vs the ACID object is
that the number offers no reuse protection where the object does. I
really would like us to use the object everywhere it is possible.

--
paul moore
http://www.paul-moore.com

2020-03-13 16:50:56

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Fri, Mar 13, 2020 at 12:45 PM Steve Grubb <[email protected]> wrote:
> On Friday, March 13, 2020 12:42:15 PM EDT Paul Moore wrote:
> > > I think more and more, that more complete isolation is being done,
> > > taking advantage of each type of namespace as they become available, but
> > > I know a nuber of them didn't find it important yet to use IPC, PID or
> > > user namespaces which would be the only namespaces I can think of that
> > > would provide that isolation.
> > >
> > > It isn't entirely clear to me which side you fall on this issue, Paul.
> >
> > That's mostly because I was hoping for some clarification in the
> > discussion, especially the relevant certification requirements, but it
> > looks like there is still plenty of room for interpretation there (as
> > usual). I'd much rather us arrive at decisions based on requirements
> > and not gut feelings, which is where I think we are at right now.
>
> Certification rquirements are that we need the identity of anyone attempting
> to modify the audit configuration including shutting it down.

Yep, got it. Unfortunately that doesn't really help with what we are
talking about. Although preventing the reuse of the ACID before the
SIGNAL2 record does help preserve the sanity of the audit stream which
I believe to be very important, regardless.

--
paul moore
http://www.paul-moore.com

2020-03-13 19:01:36

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-03-13 12:29, Paul Moore wrote:
> On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-02-13 16:44, Paul Moore wrote:
> > > This is a bit of a thread-hijack, and for that I apologize, but
> > > another thought crossed my mind while thinking about this issue
> > > further ... Once we support multiple auditd instances, including the
> > > necessary record routing and duplication/multiple-sends (the host
> > > always sees *everything*), we will likely need to find a way to "trim"
> > > the audit container ID (ACID) lists we send in the records. The
> > > auditd instance running on the host/initns will always see everything,
> > > so it will want the full container ACID list; however an auditd
> > > instance running inside a container really should only see the ACIDs
> > > of any child containers.
> >
> > Agreed. This should be easy to check and limit, preventing an auditd
> > from seeing any contid that is a parent of its own contid.
> >
> > > For example, imagine a system where the host has containers 1 and 2,
> > > each running an auditd instance. Inside container 1 there are
> > > containers A and B. Inside container 2 there are containers Y and Z.
> > > If an audit event is generated in container Z, I would expect the
> > > host's auditd to see a ACID list of "1,Z" but container 1's auditd
> > > should only see an ACID list of "Z". The auditd running in container
> > > 2 should not see the record at all (that will be relatively
> > > straightforward). Does that make sense? Do we have the record
> > > formats properly designed to handle this without too much problem (I'm
> > > not entirely sure we do)?
> >
> > I completely agree and I believe we have record formats that are able to
> > handle this already.
>
> I'm not convinced we do. What about the cases where we have a field
> with a list of audit container IDs? How do we handle that?

I don't understand the problem. (I think you crossed your 1/2 vs
A/B/Y/Z in your example.) Clarifying the example above, if as you
suggest an event happens in container Z, the hosts's auditd would report
Z,^2
and the auditd in container 2 would report
Z,^2
but if there were another auditd running in container Z it would report
Z
while the auditd in container 1 or A/B would see nothing.

The format I had proposed already handles that:
contid^contid,contid^contid but you'd like to see it changed to
contid,^contid,contid,^contid and both formats handle it though I find
the former much easier to read. For the example above we'd have:
A,^1
B,^1
Y,^2
Z,^2
and for a shared network namespace potentially:
A,^1,B,^1,Y,^2,Z,^2
and if there were an event reported by an auditd in container Z it would
report only:
Z

Now, I could see an argument for restricting the visibility of the
contid to the container containing an auditd so that an auditd cannot
see its own contid, but that wasn't my design intent. This can still be
addressed after the initial code is committed without breaking the API.

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-13 19:24:16

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-03-13 12:42, Paul Moore wrote:
> On Thu, Mar 12, 2020 at 4:27 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-02-12 19:09, Paul Moore wrote:
> > > On Wed, Feb 12, 2020 at 5:39 PM Steve Grubb <[email protected]> wrote:
> > > > On Wednesday, February 5, 2020 5:50:28 PM EST Paul Moore wrote:
> > > > > > > > > ... When we record the audit container ID in audit_signal_info() we
> > > > > > > > > take an extra reference to the audit container ID object so that it
> > > > > > > > > will not disappear (and get reused) until after we respond with an
> > > > > > > > > AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the
> > > > > > > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took
> > > > > > > > > in
> > > > > > > > > audit_signal_info(). Unless I'm missing some other change you
> > > > > > > > > made,
> > > > > > > > > this *shouldn't* affect the syscall records, all it does is
> > > > > > > > > preserve
> > > > > > > > > the audit container ID object in the kernel's ACID store so it
> > > > > > > > > doesn't
> > > > > > > > > get reused.
> > > > > > > >
> > > > > > > > This is exactly what I had understood. I hadn't considered the extra
> > > > > > > > details below in detail due to my original syscall concern, but they
> > > > > > > > make sense.
> > > > > > > >
> > > > > > > > The syscall I refer to is the one connected with the drop of the
> > > > > > > > audit container identifier by the last process that was in that
> > > > > > > > container in patch 5/16. The production of this record is contingent
> > > > > > > > on
> > > > > > > > the last ref in a contobj being dropped. So if it is due to that ref
> > > > > > > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2
> > > > > > > > record it fetched, then it will appear that the fetch action closed
> > > > > > > > the
> > > > > > > > container rather than the last process in the container to exit.
> > > > > > > >
> > > > > > > > Does this make sense?
> > > > > > >
> > > > > > > More so than your original reply, at least to me anyway.
> > > > > > >
> > > > > > > It makes sense that the audit container ID wouldn't be marked as
> > > > > > > "dead" since it would still be very much alive and available for use
> > > > > > > by the orchestrator, the question is if that is desirable or not. I
> > > > > > > think the answer to this comes down the preserving the correctness of
> > > > > > > the audit log.
> > > > > > >
> > > > > > > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been
> > > > > > > reused then I think there is a legitimate concern that the audit log
> > > > > > > is not correct, and could be misleading. If we solve that by grabbing
> > > > > > > an extra reference, then there could also be some confusion as
> > > > > > > userspace considers a container to be "dead" while the audit container
> > > > > > > ID still exists in the kernel, and the kernel generated audit
> > > > > > > container ID death record will not be generated until much later (and
> > > > > > > possibly be associated with a different event, but that could be
> > > > > > > solved by unassociating the container death record).
> > > > > >
> > > > > > How does syscall association of the death record with AUDIT_SIGNAL_INFO2
> > > > > > possibly get associated with another event? Or is the syscall
> > > > > > association with the fetch for the AUDIT_SIGNAL_INFO2 the other event?
> > > > >
> > > > > The issue is when does the audit container ID "die". If it is when
> > > > > the last task in the container exits, then the death record will be
> > > > > associated when the task's exit. If the audit container ID lives on
> > > > > until the last reference of it in the audit logs, including the
> > > > > SIGNAL_INFO2 message, the death record will be associated with the
> > > > > related SIGNAL_INFO2 syscalls, or perhaps unassociated depending on
> > > > > the details of the syscalls/netlink.
> > > > >
> > > > > > Another idea might be to bump the refcount in audit_signal_info() but
> > > > > > mark tht contid as dead so it can't be reused if we are concerned that
> > > > > > the dead contid be reused?
> > > > >
> > > > > Ooof. Yes, maybe, but that would be ugly.
> > > > >
> > > > > > There is still the problem later that the reported contid is incomplete
> > > > > > compared to the rest of the contid reporting cycle wrt nesting since
> > > > > > AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length
> > > > > > fields to accommodate a nested contid list.
> > > > >
> > > > > Do we really care about the full nested audit container ID list in the
> > > > > SIGNAL_INFO2 record?
> >
> > I'm inclined to hand-wave it away as inconvenient that can be looked up
> > more carefully if it is really needed. Maybe the block above would be
> > safer and more complete even though it is ugly.
> >
> > > > > > > Of the two
> > > > > > > approaches, I think the latter is safer in that it preserves the
> > > > > > > correctness of the audit log, even though it could result in a delay
> > > > > > > of the container death record.
> > > > > >
> > > > > > I prefer the former since it strongly indicates last task in the
> > > > > > container. The AUDIT_SIGNAL_INFO2 msg has the pid and other subject
> > > > > > attributes and the contid to strongly link the responsible party.
> > > > >
> > > > > Steve is the only one who really tracks the security certifications
> > > > > that are relevant to audit, see what the certification requirements
> > > > > have to say and we can revisit this.
> > > >
> > > > Sever Virtualization Protection Profile is the closest applicable standard
> > > >
> > > > https://www.niap-ccevs.org/Profile/Info.cfm?PPID=408&id=408
> > > >
> > > > It is silent on audit requirements for the lifecycle of a VM. I assume that
> > > > all that is needed is what the orchestrator says its doing at the high level.
> > > > So, if an orchestrator wants to shutdown a container, the orchestrator must
> > > > log that intent and its results. In a similar fashion, systemd logs that it's
> > > > killing a service and we don't actually hook the exit syscall of the service
> > > > to record that.
> > > >
> > > > Now, if a container was being used as a VPS, and it had a fully functioning
> > > > userspace, it's own services, and its very own audit daemon, then in this
> > > > case it would care who sent a signal to its auditd. The tenant of that
> > > > container may have to comply with PCI-DSS or something else. It would log the
> > > > audit service is being terminated and systemd would record that its tearing
> > > > down the environment. The OS doesn't need to do anything.
> > >
> > > This latter case is the case of interest here, since the host auditd
> > > should only be killed from a process on the host itself, not a process
> > > running in a container. If we work under the assumption (and this may
> > > be a break in our approach to not defining "container") that an auditd
> > > instance is only ever signaled by a process with the same audit
> > > container ID (ACID), is this really even an issue? Right now it isn't
> > > as even with this patchset we will still really only support one
> > > auditd instance, presumably on the host, so this isn't a significant
> > > concern. Moving forward, once we add support for multiple auditd
> > > instances we will likely need to move the signal info into
> > > (potentially) s per-ACID struct, a struct whose lifetime would match
> > > that of the associated container by definition; as the auditd
> > > container died, the struct would die, the refcounts dropped, and any
> > > ACID held only the signal info refcount would be dropped/killed.
> >
> > Any process could signal auditd if it can see it based on namespace
> > relationships, nevermind container placement. Some container
> > architectures would not have a namespace configuration that would block
> > this (combination of PID/user/IPC?).
> >
> > > However, making this assumption would mean that we are expecting a
> > > "container" to provide some level of isolation such that processes
> > > with a different audit container ID do not signal each other. From a
> > > practical perspective I think that fits with the most (all?)
> > > definitions of "container", but I can't say that for certain. In
> > > those cases where the assumption is not correct and processes can
> > > signal each other across audit container ID boundaries, perhaps it is
> > > enough to explain that an audit container ID may not fully disappear
> > > until it has been fetched with a SIGNAL_INFO2 message.
> >
> > I think more and more, that more complete isolation is being done,
> > taking advantage of each type of namespace as they become available, but
> > I know a nuber of them didn't find it important yet to use IPC, PID or
> > user namespaces which would be the only namespaces I can think of that
> > would provide that isolation.
> >
> > It isn't entirely clear to me which side you fall on this issue, Paul.
>
> That's mostly because I was hoping for some clarification in the
> discussion, especially the relevant certification requirements, but it
> looks like there is still plenty of room for interpretation there (as
> usual). I'd much rather us arrive at decisions based on requirements
> and not gut feelings, which is where I think we are at right now.

I don't disagree.

> > Can you pronounce on your strong preference one way or the other if the
> > death of a container coincide with the exit of the last process in that
> > namespace, or the fetch of any signal info related to it?
>
> "pronounce on your strong preference"? I've seen you use "pronounce"
> a few times now, and suggest a different word in the future; the
> connotation is not well received on my end.

I'm sorry. I don't have any particular attachment to that word, but
I'll try to be concious to avoid it since you've expressed your aversion
to it. I don't mean to load it down with any negative connotations, I'm
simply seeking clarity on your preferred technical style so I may follow
it.

> > I have a bias
> > to the former since the code already does that and I feel the exit of
> > the last process is much more relevant supported by the syscall record,
> > but could change it to the latter if you feel strongly enough about it
> > to block upstream acceptance.
>
> At this point in time I believe the right thing to do is to preserve
> the audit container ID as "dead but still in existence" so that there
> is no confusion (due to reuse) if/when it finally reappears in the
> audit record stream.

I agree this seems safest.

> The thread has had a lot of starts/stops, so I may be repeating a
> previous suggestion, but one idea would be to still emit a "death
> record" when the final task in the audit container ID does die, but
> block the particular audit container ID from reuse until it the
> SIGNAL2 info has been reported. This gives us the timely ACID death
> notification while still preventing confusion and ambiguity caused by
> potentially reusing the ACID before the SIGNAL2 record has been sent;
> there is a small nit about the ACID being present in the SIGNAL2
> *after* its death, but I think that can be easily explained and
> understood by admins.

Thinking quickly about possible technical solutions to this, maybe it
makes sense to have two counters on a contobj so that we know when the
last process in that container exits and can issue the death
certificate, but we still block reuse of it until all further references
to it have been resolved. This will likely also make it possible to
report the full contid chain in SIGNAL2 records. This will eliminate
some of the issues we are discussing with regards to passing a contobj
vs a contid to the audit_log_contid function, but won't eliminate them
all because there are still some contids that won't have an object
associated with them to make it impossible to look them up in the
contobj lists.

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-15 01:43:51

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 13/16] audit: track container nesting

On 2020-03-13 12:47, Paul Moore wrote:
> On Thu, Mar 12, 2020 at 4:52 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-02-13 16:49, Paul Moore wrote:
> > > On Wed, Feb 5, 2020 at 6:51 PM Richard Guy Briggs <[email protected]> wrote:
> > > > On 2020-02-05 18:05, Paul Moore wrote:
> > > > > On Thu, Jan 30, 2020 at 2:28 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > On 2020-01-22 16:29, Paul Moore wrote:
> > > > > > > On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Track the parent container of a container to be able to filter and
> > > > > > > > report nesting.
> > > > > > > >
> > > > > > > > Now that we have a way to track and check the parent container of a
> > > > > > > > container, modify the contid field format to be able to report that
> > > > > > > > nesting using a carrat ("^") separator to indicate nesting. The
> > > > > > > > original field format was "contid=<contid>" for task-associated records
> > > > > > > > and "contid=<contid>[,<contid>[...]]" for network-namespace-associated
> > > > > > > > records. The new field format is
> > > > > > > > "contid=<contid>[^<contid>[...]][,<contid>[...]]".
> > > > > > >
> > > > > > > Let's make sure we always use a comma as a separator, even when
> > > > > > > recording the parent information, for example:
> > > > > > > "contid=<contid>[,^<contid>[...]][,<contid>[...]]"
> > > > > >
> > > > > > The intent here is to clearly indicate and separate nesting from
> > > > > > parallel use of several containers by one netns. If we do away with
> > > > > > that distinction, then we lose that inheritance accountability and
> > > > > > should really run the list through a "uniq" function to remove the
> > > > > > produced redundancies. This clear inheritance is something Steve was
> > > > > > looking for since tracking down individual events/records to show that
> > > > > > inheritance was not aways feasible due to rolled logs or search effort.
> > > > >
> > > > > Perhaps my example wasn't clear. I'm not opposed to the little
> > > > > carat/hat character indicating a container's parent, I just think it
> > > > > would be good to also include a comma *in*addition* to the carat/hat.
> > > >
> > > > Ah, ok. Well, I'd offer that it would be slightly shorter, slightly
> > > > less cluttered and having already written the parser in userspace, I
> > > > think the parser would be slightly simpler.
> > > >
> > > > I must admit, I was a bit puzzled by your snippet of code that was used
> > > > as a prefix to the next item rather than as a postfix to the given item.
> > > >
> > > > Can you say why you prefer the comma in addition?
> > >
> > > Generally speaking, I believe that a single delimiter is both easier
> > > for the eyes to parse, and easier/safer for machines to parse as well.
> > > In this particular case I think of the comma as a delimiter and the
> > > carat as a modifier, reusing the carat as a delimiter seems like a bad
> > > idea to me.
> >
> > I'm not crazy about this idea, but I'll have a look at how much work it
> > is to recode the userspace search tools. It also adds extra characters
> > and noise into the string format that seems counterproductive.
>
> If anything the parser should be *easier* (although both parsers
> should fall into the "trivial" category). The comma is the one and
> only delimiter, and if the ACID starts with a carat then it is a
> parent of the preceding ACID.

Ok, after a day of staring at the code and getting nowhere due to
multiple distractions, I was able to rework this code fairly easily and
it turned out simpler which should not surprise you. Both kernel and
userspace code are now in the format you recommended.

> > > > > > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > > > > > index ef8e07524c46..68be59d1a89b 100644
> > > > > > > > --- a/kernel/audit.c
> > > > > > > > +++ b/kernel/audit.c
> > > > > > >
> > > > > > > > @@ -492,6 +493,7 @@ void audit_switch_task_namespaces(struct nsproxy *ns, struct task_struct *p)
> > > > > > > > audit_netns_contid_add(new->net_ns, contid);
> > > > > > > > }
> > > > > > > >
> > > > > > > > +void audit_log_contid(struct audit_buffer *ab, u64 contid);
> > > > > > >
> > > > > > > If we need a forward declaration, might as well just move it up near
> > > > > > > the top of the file with the rest of the declarations.
> > > > > >
> > > > > > Ok.
> > > > > >
> > > > > > > > +void audit_log_contid(struct audit_buffer *ab, u64 contid)
> > > > > > > > +{
> > > > > > > > + struct audit_contobj *cont = NULL, *prcont = NULL;
> > > > > > > > + int h;
> > > > > > >
> > > > > > > It seems safer to pass the audit container ID object and not the u64.
> > > > > >
> > > > > > It would also be faster, but in some places it isn't available such as
> > > > > > for ptrace and signal targets. This also links back to the drop record
> > > > > > refcounts to hold onto the contobj until process exit, or signal
> > > > > > delivery.
> > > > > >
> > > > > > What we could do is to supply two potential parameters, a contobj and/or
> > > > > > a contid, and have it use the contobj if it is valid, otherwise, use the
> > > > > > contid, as is done for names and paths supplied to audit_log_name().
> > > > >
> > > > > Let's not do multiple parameters, that begs for misuse, let's take the
> > > > > wrapper function route:
> > > > >
> > > > > func a(int id) {
> > > > > // important stuff
> > > > > }
> > > > >
> > > > > func ao(struct obj) {
> > > > > a(obj.id);
> > > > > }
> > > > >
> > > > > ... and we can add a comment that you *really* should be using the
> > > > > variant that passes an object.
> > > >
> > > > I was already doing that where it available, and dereferencing the id
> > > > for the call. But I see an advantage to having both parameters supplied
> > > > to the function, since it saves us the trouble of dereferencing it,
> > > > searching for the id in the hash list and re-locating the object if the
> > > > object is already available.
> > >
> > > I strongly prefer we not do multiple parameters for the same "thing";
> >
> > So do I, ideally. However...
> >
> > > I would much rather do the wrapper approach as described above. I
> > > would also like to see us use the audit container ID object as much as
> > > possible, using a bare integer should be a last resort.
> >
> > It is not clear to me that you understood what I wrote above. I can't
> > use the object pointer where preferable because there are a few cases
> > where only the ID is available. If only the ID is available, I would
> > have to make a best effort to look up the object pointer and am not
> > guaranteed to find it (invalid, stale, signal info...). If I am forced
> > to use only one, it becomes the ID that is used, and I no longer have
> > the benefit of already having the object pointer for certainty and
> > saving work. For all cases where I have the object pointer, which is
> > most cases, and most frequently used cases, I will have to dereference
> > the object pointer to an ID, then go through the work again to re-locate
> > the object pointer. This is less certain, and more work. Reluctantly,
> > the only practical solution I see here is to supply both, favouring the
> > object pointer if it is valid, then falling back on the ID from the next
> > parameter.
>
> It has been a while since I last looked at the patchset, but my
> concern over the prefered use of the ACID number vs the ACID object is
> that the number offers no reuse protection where the object does. I
> really would like us to use the object everywhere it is possible.

Ok, so I take it from this that I go ahead with the dual format since
the wrapper funciton to convert from object to ID strips away object
information negating any benefit of favouring the object pointer. I'll
look at the remaining calls that use a contid (rather than contobj) and
convert all that I can over to storing an object using the dual counters
that track process exits versus signal2 and trace references.

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-17 18:29:56

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 13/16] audit: track container nesting

On 2020-03-14 18:42, Richard Guy Briggs wrote:
> On 2020-03-13 12:47, Paul Moore wrote:
> > On Thu, Mar 12, 2020 at 4:52 PM Richard Guy Briggs <[email protected]> wrote:
> > > On 2020-02-13 16:49, Paul Moore wrote:
> > > > On Wed, Feb 5, 2020 at 6:51 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > On 2020-02-05 18:05, Paul Moore wrote:
> > > > > > On Thu, Jan 30, 2020 at 2:28 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > > On 2020-01-22 16:29, Paul Moore wrote:
> > > > > > > > On Tue, Dec 31, 2019 at 2:51 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > Track the parent container of a container to be able to filter and
> > > > > > > > > report nesting.
> > > > > > > > >
> > > > > > > > > Now that we have a way to track and check the parent container of a
> > > > > > > > > container, modify the contid field format to be able to report that
> > > > > > > > > nesting using a carrat ("^") separator to indicate nesting. The
> > > > > > > > > original field format was "contid=<contid>" for task-associated records
> > > > > > > > > and "contid=<contid>[,<contid>[...]]" for network-namespace-associated
> > > > > > > > > records. The new field format is
> > > > > > > > > "contid=<contid>[^<contid>[...]][,<contid>[...]]".
> > > > > > > >
> > > > > > > > Let's make sure we always use a comma as a separator, even when
> > > > > > > > recording the parent information, for example:
> > > > > > > > "contid=<contid>[,^<contid>[...]][,<contid>[...]]"
> > > > > > >
> > > > > > > The intent here is to clearly indicate and separate nesting from
> > > > > > > parallel use of several containers by one netns. If we do away with
> > > > > > > that distinction, then we lose that inheritance accountability and
> > > > > > > should really run the list through a "uniq" function to remove the
> > > > > > > produced redundancies. This clear inheritance is something Steve was
> > > > > > > looking for since tracking down individual events/records to show that
> > > > > > > inheritance was not aways feasible due to rolled logs or search effort.
> > > > > >
> > > > > > Perhaps my example wasn't clear. I'm not opposed to the little
> > > > > > carat/hat character indicating a container's parent, I just think it
> > > > > > would be good to also include a comma *in*addition* to the carat/hat.
> > > > >
> > > > > Ah, ok. Well, I'd offer that it would be slightly shorter, slightly
> > > > > less cluttered and having already written the parser in userspace, I
> > > > > think the parser would be slightly simpler.
> > > > >
> > > > > I must admit, I was a bit puzzled by your snippet of code that was used
> > > > > as a prefix to the next item rather than as a postfix to the given item.
> > > > >
> > > > > Can you say why you prefer the comma in addition?
> > > >
> > > > Generally speaking, I believe that a single delimiter is both easier
> > > > for the eyes to parse, and easier/safer for machines to parse as well.
> > > > In this particular case I think of the comma as a delimiter and the
> > > > carat as a modifier, reusing the carat as a delimiter seems like a bad
> > > > idea to me.
> > >
> > > I'm not crazy about this idea, but I'll have a look at how much work it
> > > is to recode the userspace search tools. It also adds extra characters
> > > and noise into the string format that seems counterproductive.
> >
> > If anything the parser should be *easier* (although both parsers
> > should fall into the "trivial" category). The comma is the one and
> > only delimiter, and if the ACID starts with a carat then it is a
> > parent of the preceding ACID.
>
> Ok, after a day of staring at the code and getting nowhere due to
> multiple distractions, I was able to rework this code fairly easily and
> it turned out simpler which should not surprise you. Both kernel and
> userspace code are now in the format you recommended.
>
> > > > > > > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > > > > > > index ef8e07524c46..68be59d1a89b 100644
> > > > > > > > > --- a/kernel/audit.c
> > > > > > > > > +++ b/kernel/audit.c
> > > > > > > >
> > > > > > > > > @@ -492,6 +493,7 @@ void audit_switch_task_namespaces(struct nsproxy *ns, struct task_struct *p)
> > > > > > > > > audit_netns_contid_add(new->net_ns, contid);
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > +void audit_log_contid(struct audit_buffer *ab, u64 contid);
> > > > > > > >
> > > > > > > > If we need a forward declaration, might as well just move it up near
> > > > > > > > the top of the file with the rest of the declarations.
> > > > > > >
> > > > > > > Ok.
> > > > > > >
> > > > > > > > > +void audit_log_contid(struct audit_buffer *ab, u64 contid)
> > > > > > > > > +{
> > > > > > > > > + struct audit_contobj *cont = NULL, *prcont = NULL;
> > > > > > > > > + int h;
> > > > > > > >
> > > > > > > > It seems safer to pass the audit container ID object and not the u64.
> > > > > > >
> > > > > > > It would also be faster, but in some places it isn't available such as
> > > > > > > for ptrace and signal targets. This also links back to the drop record
> > > > > > > refcounts to hold onto the contobj until process exit, or signal
> > > > > > > delivery.
> > > > > > >
> > > > > > > What we could do is to supply two potential parameters, a contobj and/or
> > > > > > > a contid, and have it use the contobj if it is valid, otherwise, use the
> > > > > > > contid, as is done for names and paths supplied to audit_log_name().
> > > > > >
> > > > > > Let's not do multiple parameters, that begs for misuse, let's take the
> > > > > > wrapper function route:
> > > > > >
> > > > > > func a(int id) {
> > > > > > // important stuff
> > > > > > }
> > > > > >
> > > > > > func ao(struct obj) {
> > > > > > a(obj.id);
> > > > > > }
> > > > > >
> > > > > > ... and we can add a comment that you *really* should be using the
> > > > > > variant that passes an object.
> > > > >
> > > > > I was already doing that where it available, and dereferencing the id
> > > > > for the call. But I see an advantage to having both parameters supplied
> > > > > to the function, since it saves us the trouble of dereferencing it,
> > > > > searching for the id in the hash list and re-locating the object if the
> > > > > object is already available.
> > > >
> > > > I strongly prefer we not do multiple parameters for the same "thing";
> > >
> > > So do I, ideally. However...
> > >
> > > > I would much rather do the wrapper approach as described above. I
> > > > would also like to see us use the audit container ID object as much as
> > > > possible, using a bare integer should be a last resort.
> > >
> > > It is not clear to me that you understood what I wrote above. I can't
> > > use the object pointer where preferable because there are a few cases
> > > where only the ID is available. If only the ID is available, I would
> > > have to make a best effort to look up the object pointer and am not
> > > guaranteed to find it (invalid, stale, signal info...). If I am forced
> > > to use only one, it becomes the ID that is used, and I no longer have
> > > the benefit of already having the object pointer for certainty and
> > > saving work. For all cases where I have the object pointer, which is
> > > most cases, and most frequently used cases, I will have to dereference
> > > the object pointer to an ID, then go through the work again to re-locate
> > > the object pointer. This is less certain, and more work. Reluctantly,
> > > the only practical solution I see here is to supply both, favouring the
> > > object pointer if it is valid, then falling back on the ID from the next
> > > parameter.
> >
> > It has been a while since I last looked at the patchset, but my
> > concern over the prefered use of the ACID number vs the ACID object is
> > that the number offers no reuse protection where the object does. I
> > really would like us to use the object everywhere it is possible.
>
> Ok, so I take it from this that I go ahead with the dual format since
> the wrapper funciton to convert from object to ID strips away object
> information negating any benefit of favouring the object pointer. I'll
> look at the remaining calls that use a contid (rather than contobj) and
> convert all that I can over to storing an object using the dual counters
> that track process exits versus signal2 and trace references.

After reworking all the signal code to use the contobj and open coding
unnested single contid appearances, I was able to stick with just
passing a contobj to audit_contiainer_id() and audit_log_contid(), so
the dual format conundrum went away.

It issues the death certificate on process exit, and will issue an error
indicating the contid is dead and can't be reused yet until it is reaped
by a sig2 call.

> > paul moore
>
> - RGB

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-18 21:00:11

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Fri, Mar 13, 2020 at 2:59 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-03-13 12:29, Paul Moore wrote:
> > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <[email protected]> wrote:
> > > On 2020-02-13 16:44, Paul Moore wrote:
> > > > This is a bit of a thread-hijack, and for that I apologize, but
> > > > another thought crossed my mind while thinking about this issue
> > > > further ... Once we support multiple auditd instances, including the
> > > > necessary record routing and duplication/multiple-sends (the host
> > > > always sees *everything*), we will likely need to find a way to "trim"
> > > > the audit container ID (ACID) lists we send in the records. The
> > > > auditd instance running on the host/initns will always see everything,
> > > > so it will want the full container ACID list; however an auditd
> > > > instance running inside a container really should only see the ACIDs
> > > > of any child containers.
> > >
> > > Agreed. This should be easy to check and limit, preventing an auditd
> > > from seeing any contid that is a parent of its own contid.
> > >
> > > > For example, imagine a system where the host has containers 1 and 2,
> > > > each running an auditd instance. Inside container 1 there are
> > > > containers A and B. Inside container 2 there are containers Y and Z.
> > > > If an audit event is generated in container Z, I would expect the
> > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd
> > > > should only see an ACID list of "Z". The auditd running in container
> > > > 2 should not see the record at all (that will be relatively
> > > > straightforward). Does that make sense? Do we have the record
> > > > formats properly designed to handle this without too much problem (I'm
> > > > not entirely sure we do)?
> > >
> > > I completely agree and I believe we have record formats that are able to
> > > handle this already.
> >
> > I'm not convinced we do. What about the cases where we have a field
> > with a list of audit container IDs? How do we handle that?
>
> I don't understand the problem. (I think you crossed your 1/2 vs
> A/B/Y/Z in your example.) ...

It looks like I did, sorry about that.

> ... Clarifying the example above, if as you
> suggest an event happens in container Z, the hosts's auditd would report
> Z,^2
> and the auditd in container 2 would report
> Z,^2
> but if there were another auditd running in container Z it would report
> Z
> while the auditd in container 1 or A/B would see nothing.

Yes. My concern is how do we handle this to minimize duplicating and
rewriting the records? It isn't so much about the format, although
the format is a side effect.

--
paul moore
http://www.paul-moore.com

2020-03-18 21:03:17

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-03-13 12:42, Paul Moore wrote:

...

> > The thread has had a lot of starts/stops, so I may be repeating a
> > previous suggestion, but one idea would be to still emit a "death
> > record" when the final task in the audit container ID does die, but
> > block the particular audit container ID from reuse until it the
> > SIGNAL2 info has been reported. This gives us the timely ACID death
> > notification while still preventing confusion and ambiguity caused by
> > potentially reusing the ACID before the SIGNAL2 record has been sent;
> > there is a small nit about the ACID being present in the SIGNAL2
> > *after* its death, but I think that can be easily explained and
> > understood by admins.
>
> Thinking quickly about possible technical solutions to this, maybe it
> makes sense to have two counters on a contobj so that we know when the
> last process in that container exits and can issue the death
> certificate, but we still block reuse of it until all further references
> to it have been resolved. This will likely also make it possible to
> report the full contid chain in SIGNAL2 records. This will eliminate
> some of the issues we are discussing with regards to passing a contobj
> vs a contid to the audit_log_contid function, but won't eliminate them
> all because there are still some contids that won't have an object
> associated with them to make it impossible to look them up in the
> contobj lists.

I'm not sure you need a full second counter, I imagine a simple flag
would be okay. I think you just something to indicate that this ACID
object is marked as "dead" but it still being held for sanity reasons
and should not be reused.

--
paul moore
http://www.paul-moore.com

2020-03-18 21:09:53

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 13/16] audit: track container nesting

On Sat, Mar 14, 2020 at 6:42 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-03-13 12:47, Paul Moore wrote:

...

> > It has been a while since I last looked at the patchset, but my
> > concern over the prefered use of the ACID number vs the ACID object is
> > that the number offers no reuse protection where the object does. I
> > really would like us to use the object everywhere it is possible.
>
> Ok, so I take it from this that I go ahead with the dual format since
> the wrapper funciton to convert from object to ID strips away object
> information negating any benefit of favouring the object pointer. I'll
> look at the remaining calls that use a contid (rather than contobj) and
> convert all that I can over to storing an object using the dual counters
> that track process exits versus signal2 and trace references.

Well, as I said in the other thread, I'm not sure we need a full two
counters; I think one counter and a simple flag should suffice.
Otherwise that sounds good for the next iteration.

--
paul moore
http://www.paul-moore.com

2020-03-18 21:28:49

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-03-18 16:56, Paul Moore wrote:
> On Fri, Mar 13, 2020 at 2:59 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-03-13 12:29, Paul Moore wrote:
> > > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <[email protected]> wrote:
> > > > On 2020-02-13 16:44, Paul Moore wrote:
> > > > > This is a bit of a thread-hijack, and for that I apologize, but
> > > > > another thought crossed my mind while thinking about this issue
> > > > > further ... Once we support multiple auditd instances, including the
> > > > > necessary record routing and duplication/multiple-sends (the host
> > > > > always sees *everything*), we will likely need to find a way to "trim"
> > > > > the audit container ID (ACID) lists we send in the records. The
> > > > > auditd instance running on the host/initns will always see everything,
> > > > > so it will want the full container ACID list; however an auditd
> > > > > instance running inside a container really should only see the ACIDs
> > > > > of any child containers.
> > > >
> > > > Agreed. This should be easy to check and limit, preventing an auditd
> > > > from seeing any contid that is a parent of its own contid.
> > > >
> > > > > For example, imagine a system where the host has containers 1 and 2,
> > > > > each running an auditd instance. Inside container 1 there are
> > > > > containers A and B. Inside container 2 there are containers Y and Z.
> > > > > If an audit event is generated in container Z, I would expect the
> > > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd
> > > > > should only see an ACID list of "Z". The auditd running in container
> > > > > 2 should not see the record at all (that will be relatively
> > > > > straightforward). Does that make sense? Do we have the record
> > > > > formats properly designed to handle this without too much problem (I'm
> > > > > not entirely sure we do)?
> > > >
> > > > I completely agree and I believe we have record formats that are able to
> > > > handle this already.
> > >
> > > I'm not convinced we do. What about the cases where we have a field
> > > with a list of audit container IDs? How do we handle that?
> >
> > I don't understand the problem. (I think you crossed your 1/2 vs
> > A/B/Y/Z in your example.) ...
>
> It looks like I did, sorry about that.
>
> > ... Clarifying the example above, if as you
> > suggest an event happens in container Z, the hosts's auditd would report
> > Z,^2
> > and the auditd in container 2 would report
> > Z,^2
> > but if there were another auditd running in container Z it would report
> > Z
> > while the auditd in container 1 or A/B would see nothing.
>
> Yes. My concern is how do we handle this to minimize duplicating and
> rewriting the records? It isn't so much about the format, although
> the format is a side effect.

Are you talking about caching, or about divulging more information than
necessary or even information leaks? Or even noticing that records that
need to be generated to two audit daemons share the same contid field
values and should be generated at the same time or information shared
between them? I'd see any of these as optimizations that don't affect
the api.

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-18 21:43:31

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-03-18 17:01, Paul Moore wrote:
> On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-03-13 12:42, Paul Moore wrote:
>
> ...
>
> > > The thread has had a lot of starts/stops, so I may be repeating a
> > > previous suggestion, but one idea would be to still emit a "death
> > > record" when the final task in the audit container ID does die, but
> > > block the particular audit container ID from reuse until it the
> > > SIGNAL2 info has been reported. This gives us the timely ACID death
> > > notification while still preventing confusion and ambiguity caused by
> > > potentially reusing the ACID before the SIGNAL2 record has been sent;
> > > there is a small nit about the ACID being present in the SIGNAL2
> > > *after* its death, but I think that can be easily explained and
> > > understood by admins.
> >
> > Thinking quickly about possible technical solutions to this, maybe it
> > makes sense to have two counters on a contobj so that we know when the
> > last process in that container exits and can issue the death
> > certificate, but we still block reuse of it until all further references
> > to it have been resolved. This will likely also make it possible to
> > report the full contid chain in SIGNAL2 records. This will eliminate
> > some of the issues we are discussing with regards to passing a contobj
> > vs a contid to the audit_log_contid function, but won't eliminate them
> > all because there are still some contids that won't have an object
> > associated with them to make it impossible to look them up in the
> > contobj lists.
>
> I'm not sure you need a full second counter, I imagine a simple flag
> would be okay. I think you just something to indicate that this ACID
> object is marked as "dead" but it still being held for sanity reasons
> and should not be reused.

Ok, I see your point. This refcount can be changed to a flag easily
enough without change to the api if we can be sure that more than one
signal can't be delivered to the audit daemon *and* collected by sig2.
I'll have a more careful look at the audit daemon code to see if I can
determine this.

Steve, can you have a look and tell us if it is possible for the audit
daemon to make more than one signal_info (or signal_info2) record
request from the kernel after receiving a signal?


Another question occurs to me is that what if the audit daemon is sent a
signal and it cannot or will not collect the sig2 information from the
kernel (SIGKILL?)? Does that audit container identifier remain dead
until reboot, or do we institute some other form of reaping, possibly
time-based?


> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-18 21:44:40

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Wed, Mar 18, 2020 at 5:27 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-03-18 16:56, Paul Moore wrote:
> > On Fri, Mar 13, 2020 at 2:59 PM Richard Guy Briggs <[email protected]> wrote:
> > > On 2020-03-13 12:29, Paul Moore wrote:
> > > > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > On 2020-02-13 16:44, Paul Moore wrote:
> > > > > > This is a bit of a thread-hijack, and for that I apologize, but
> > > > > > another thought crossed my mind while thinking about this issue
> > > > > > further ... Once we support multiple auditd instances, including the
> > > > > > necessary record routing and duplication/multiple-sends (the host
> > > > > > always sees *everything*), we will likely need to find a way to "trim"
> > > > > > the audit container ID (ACID) lists we send in the records. The
> > > > > > auditd instance running on the host/initns will always see everything,
> > > > > > so it will want the full container ACID list; however an auditd
> > > > > > instance running inside a container really should only see the ACIDs
> > > > > > of any child containers.
> > > > >
> > > > > Agreed. This should be easy to check and limit, preventing an auditd
> > > > > from seeing any contid that is a parent of its own contid.
> > > > >
> > > > > > For example, imagine a system where the host has containers 1 and 2,
> > > > > > each running an auditd instance. Inside container 1 there are
> > > > > > containers A and B. Inside container 2 there are containers Y and Z.
> > > > > > If an audit event is generated in container Z, I would expect the
> > > > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd
> > > > > > should only see an ACID list of "Z". The auditd running in container
> > > > > > 2 should not see the record at all (that will be relatively
> > > > > > straightforward). Does that make sense? Do we have the record
> > > > > > formats properly designed to handle this without too much problem (I'm
> > > > > > not entirely sure we do)?
> > > > >
> > > > > I completely agree and I believe we have record formats that are able to
> > > > > handle this already.
> > > >
> > > > I'm not convinced we do. What about the cases where we have a field
> > > > with a list of audit container IDs? How do we handle that?
> > >
> > > I don't understand the problem. (I think you crossed your 1/2 vs
> > > A/B/Y/Z in your example.) ...
> >
> > It looks like I did, sorry about that.
> >
> > > ... Clarifying the example above, if as you
> > > suggest an event happens in container Z, the hosts's auditd would report
> > > Z,^2
> > > and the auditd in container 2 would report
> > > Z,^2
> > > but if there were another auditd running in container Z it would report
> > > Z
> > > while the auditd in container 1 or A/B would see nothing.
> >
> > Yes. My concern is how do we handle this to minimize duplicating and
> > rewriting the records? It isn't so much about the format, although
> > the format is a side effect.
>
> Are you talking about caching, or about divulging more information than
> necessary or even information leaks? Or even noticing that records that
> need to be generated to two audit daemons share the same contid field
> values and should be generated at the same time or information shared
> between them? I'd see any of these as optimizations that don't affect
> the api.

Imagine a record is generated in a container which has more than one
auditd in it's ancestry that should receive this record, how do we
handle that without completely killing performance? That's my
concern. If you've already thought up a plan for this - excellent,
please share :)

--
paul moore
http://www.paul-moore.com

2020-03-18 21:51:12

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Wed, Mar 18, 2020 at 5:42 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-03-18 17:01, Paul Moore wrote:
> > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <[email protected]> wrote:
> > > On 2020-03-13 12:42, Paul Moore wrote:
> >
> > ...
> >
> > > > The thread has had a lot of starts/stops, so I may be repeating a
> > > > previous suggestion, but one idea would be to still emit a "death
> > > > record" when the final task in the audit container ID does die, but
> > > > block the particular audit container ID from reuse until it the
> > > > SIGNAL2 info has been reported. This gives us the timely ACID death
> > > > notification while still preventing confusion and ambiguity caused by
> > > > potentially reusing the ACID before the SIGNAL2 record has been sent;
> > > > there is a small nit about the ACID being present in the SIGNAL2
> > > > *after* its death, but I think that can be easily explained and
> > > > understood by admins.
> > >
> > > Thinking quickly about possible technical solutions to this, maybe it
> > > makes sense to have two counters on a contobj so that we know when the
> > > last process in that container exits and can issue the death
> > > certificate, but we still block reuse of it until all further references
> > > to it have been resolved. This will likely also make it possible to
> > > report the full contid chain in SIGNAL2 records. This will eliminate
> > > some of the issues we are discussing with regards to passing a contobj
> > > vs a contid to the audit_log_contid function, but won't eliminate them
> > > all because there are still some contids that won't have an object
> > > associated with them to make it impossible to look them up in the
> > > contobj lists.
> >
> > I'm not sure you need a full second counter, I imagine a simple flag
> > would be okay. I think you just something to indicate that this ACID
> > object is marked as "dead" but it still being held for sanity reasons
> > and should not be reused.
>
> Ok, I see your point. This refcount can be changed to a flag easily
> enough without change to the api if we can be sure that more than one
> signal can't be delivered to the audit daemon *and* collected by sig2.
> I'll have a more careful look at the audit daemon code to see if I can
> determine this.

Maybe I'm not understanding your concern, but this isn't really
different than any of the other things we track for the auditd signal
sender, right? If we are worried about multiple signals being sent
then it applies to everything, not just the audit container ID.

> Another question occurs to me is that what if the audit daemon is sent a
> signal and it cannot or will not collect the sig2 information from the
> kernel (SIGKILL?)? Does that audit container identifier remain dead
> until reboot, or do we institute some other form of reaping, possibly
> time-based?

In order to preserve the integrity of the audit log that ACID value
would need to remain unavailable until the ACID which contains the
associated auditd is "dead" (no one can request the signal sender's
info if that container is dead).

--
paul moore
http://www.paul-moore.com

2020-03-18 21:57:07

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-03-18 17:42, Paul Moore wrote:
> On Wed, Mar 18, 2020 at 5:27 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-03-18 16:56, Paul Moore wrote:
> > > On Fri, Mar 13, 2020 at 2:59 PM Richard Guy Briggs <[email protected]> wrote:
> > > > On 2020-03-13 12:29, Paul Moore wrote:
> > > > > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > On 2020-02-13 16:44, Paul Moore wrote:
> > > > > > > This is a bit of a thread-hijack, and for that I apologize, but
> > > > > > > another thought crossed my mind while thinking about this issue
> > > > > > > further ... Once we support multiple auditd instances, including the
> > > > > > > necessary record routing and duplication/multiple-sends (the host
> > > > > > > always sees *everything*), we will likely need to find a way to "trim"
> > > > > > > the audit container ID (ACID) lists we send in the records. The
> > > > > > > auditd instance running on the host/initns will always see everything,
> > > > > > > so it will want the full container ACID list; however an auditd
> > > > > > > instance running inside a container really should only see the ACIDs
> > > > > > > of any child containers.
> > > > > >
> > > > > > Agreed. This should be easy to check and limit, preventing an auditd
> > > > > > from seeing any contid that is a parent of its own contid.
> > > > > >
> > > > > > > For example, imagine a system where the host has containers 1 and 2,
> > > > > > > each running an auditd instance. Inside container 1 there are
> > > > > > > containers A and B. Inside container 2 there are containers Y and Z.
> > > > > > > If an audit event is generated in container Z, I would expect the
> > > > > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd
> > > > > > > should only see an ACID list of "Z". The auditd running in container
> > > > > > > 2 should not see the record at all (that will be relatively
> > > > > > > straightforward). Does that make sense? Do we have the record
> > > > > > > formats properly designed to handle this without too much problem (I'm
> > > > > > > not entirely sure we do)?
> > > > > >
> > > > > > I completely agree and I believe we have record formats that are able to
> > > > > > handle this already.
> > > > >
> > > > > I'm not convinced we do. What about the cases where we have a field
> > > > > with a list of audit container IDs? How do we handle that?
> > > >
> > > > I don't understand the problem. (I think you crossed your 1/2 vs
> > > > A/B/Y/Z in your example.) ...
> > >
> > > It looks like I did, sorry about that.
> > >
> > > > ... Clarifying the example above, if as you
> > > > suggest an event happens in container Z, the hosts's auditd would report
> > > > Z,^2
> > > > and the auditd in container 2 would report
> > > > Z,^2
> > > > but if there were another auditd running in container Z it would report
> > > > Z
> > > > while the auditd in container 1 or A/B would see nothing.
> > >
> > > Yes. My concern is how do we handle this to minimize duplicating and
> > > rewriting the records? It isn't so much about the format, although
> > > the format is a side effect.
> >
> > Are you talking about caching, or about divulging more information than
> > necessary or even information leaks? Or even noticing that records that
> > need to be generated to two audit daemons share the same contid field
> > values and should be generated at the same time or information shared
> > between them? I'd see any of these as optimizations that don't affect
> > the api.
>
> Imagine a record is generated in a container which has more than one
> auditd in it's ancestry that should receive this record, how do we
> handle that without completely killing performance? That's my
> concern. If you've already thought up a plan for this - excellent,
> please share :)

No, I haven't given that much thought other than the correctness and
security issues of making sure that each audit daemon is sufficiently
isolated to do its job but not jeopardize another audit domain. Audit
already kills performance, according to some...

We currently won't have that problem since there can only be one so far.
Fixing and optimizing this is part of the next phase of the challenge of
adding a second audit daemon.

Let's work on correctness and reasonable efficiency for this phase and
not focus on a problem we don't yet have. I wouldn't consider this
incurring technical debt at this point.

I could see cacheing a contid string from one starting point, but it may
be more work to search that cached string to truncate it or add to it
when another audit daemon requests a copy of a similar string. I
suppose every full contid string could be generated the first time it is
used and parts of it used (start/finish) as needed but that
search/indexing may not be worth it.

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-18 22:07:19

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Wed, Mar 18, 2020 at 5:56 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-03-18 17:42, Paul Moore wrote:
> > On Wed, Mar 18, 2020 at 5:27 PM Richard Guy Briggs <[email protected]> wrote:
> > > On 2020-03-18 16:56, Paul Moore wrote:
> > > > On Fri, Mar 13, 2020 at 2:59 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > On 2020-03-13 12:29, Paul Moore wrote:
> > > > > > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > > On 2020-02-13 16:44, Paul Moore wrote:
> > > > > > > > This is a bit of a thread-hijack, and for that I apologize, but
> > > > > > > > another thought crossed my mind while thinking about this issue
> > > > > > > > further ... Once we support multiple auditd instances, including the
> > > > > > > > necessary record routing and duplication/multiple-sends (the host
> > > > > > > > always sees *everything*), we will likely need to find a way to "trim"
> > > > > > > > the audit container ID (ACID) lists we send in the records. The
> > > > > > > > auditd instance running on the host/initns will always see everything,
> > > > > > > > so it will want the full container ACID list; however an auditd
> > > > > > > > instance running inside a container really should only see the ACIDs
> > > > > > > > of any child containers.
> > > > > > >
> > > > > > > Agreed. This should be easy to check and limit, preventing an auditd
> > > > > > > from seeing any contid that is a parent of its own contid.
> > > > > > >
> > > > > > > > For example, imagine a system where the host has containers 1 and 2,
> > > > > > > > each running an auditd instance. Inside container 1 there are
> > > > > > > > containers A and B. Inside container 2 there are containers Y and Z.
> > > > > > > > If an audit event is generated in container Z, I would expect the
> > > > > > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd
> > > > > > > > should only see an ACID list of "Z". The auditd running in container
> > > > > > > > 2 should not see the record at all (that will be relatively
> > > > > > > > straightforward). Does that make sense? Do we have the record
> > > > > > > > formats properly designed to handle this without too much problem (I'm
> > > > > > > > not entirely sure we do)?
> > > > > > >
> > > > > > > I completely agree and I believe we have record formats that are able to
> > > > > > > handle this already.
> > > > > >
> > > > > > I'm not convinced we do. What about the cases where we have a field
> > > > > > with a list of audit container IDs? How do we handle that?
> > > > >
> > > > > I don't understand the problem. (I think you crossed your 1/2 vs
> > > > > A/B/Y/Z in your example.) ...
> > > >
> > > > It looks like I did, sorry about that.
> > > >
> > > > > ... Clarifying the example above, if as you
> > > > > suggest an event happens in container Z, the hosts's auditd would report
> > > > > Z,^2
> > > > > and the auditd in container 2 would report
> > > > > Z,^2
> > > > > but if there were another auditd running in container Z it would report
> > > > > Z
> > > > > while the auditd in container 1 or A/B would see nothing.
> > > >
> > > > Yes. My concern is how do we handle this to minimize duplicating and
> > > > rewriting the records? It isn't so much about the format, although
> > > > the format is a side effect.
> > >
> > > Are you talking about caching, or about divulging more information than
> > > necessary or even information leaks? Or even noticing that records that
> > > need to be generated to two audit daemons share the same contid field
> > > values and should be generated at the same time or information shared
> > > between them? I'd see any of these as optimizations that don't affect
> > > the api.
> >
> > Imagine a record is generated in a container which has more than one
> > auditd in it's ancestry that should receive this record, how do we
> > handle that without completely killing performance? That's my
> > concern. If you've already thought up a plan for this - excellent,
> > please share :)
>
> No, I haven't given that much thought other than the correctness and
> security issues of making sure that each audit daemon is sufficiently
> isolated to do its job but not jeopardize another audit domain. Audit
> already kills performance, according to some...
>
> We currently won't have that problem since there can only be one so far.
> Fixing and optimizing this is part of the next phase of the challenge of
> adding a second audit daemon.
>
> Let's work on correctness and reasonable efficiency for this phase and
> not focus on a problem we don't yet have. I wouldn't consider this
> incurring technical debt at this point.

I agree, one stage at a time, but the choice we make here is going to
have a significant impact on what we can do later. We need to get
this as "right" as possible; this isn't something we should dismiss
with a hand-wave as a problem for the next stage. We don't need an
implementation, but I would like to see a rough design of how we would
address this problem.

> I could see cacheing a contid string from one starting point, but it may
> be more work to search that cached string to truncate it or add to it
> when another audit daemon requests a copy of a similar string. I
> suppose every full contid string could be generated the first time it is
> used and parts of it used (start/finish) as needed but that
> search/indexing may not be worth it.

I hope we can do better than string manipulations in the kernel. I'd
much rather defer generating the ACID list (if possible), than
generating a list only to keep copying and editing it as the record is
sent.

--
paul moore
http://www.paul-moore.com

2020-03-19 22:24:08

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-03-18 17:47, Paul Moore wrote:
> On Wed, Mar 18, 2020 at 5:42 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-03-18 17:01, Paul Moore wrote:
> > > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <[email protected]> wrote:
> > > > On 2020-03-13 12:42, Paul Moore wrote:
> > >
> > > ...
> > >
> > > > > The thread has had a lot of starts/stops, so I may be repeating a
> > > > > previous suggestion, but one idea would be to still emit a "death
> > > > > record" when the final task in the audit container ID does die, but
> > > > > block the particular audit container ID from reuse until it the
> > > > > SIGNAL2 info has been reported. This gives us the timely ACID death
> > > > > notification while still preventing confusion and ambiguity caused by
> > > > > potentially reusing the ACID before the SIGNAL2 record has been sent;
> > > > > there is a small nit about the ACID being present in the SIGNAL2
> > > > > *after* its death, but I think that can be easily explained and
> > > > > understood by admins.
> > > >
> > > > Thinking quickly about possible technical solutions to this, maybe it
> > > > makes sense to have two counters on a contobj so that we know when the
> > > > last process in that container exits and can issue the death
> > > > certificate, but we still block reuse of it until all further references
> > > > to it have been resolved. This will likely also make it possible to
> > > > report the full contid chain in SIGNAL2 records. This will eliminate
> > > > some of the issues we are discussing with regards to passing a contobj
> > > > vs a contid to the audit_log_contid function, but won't eliminate them
> > > > all because there are still some contids that won't have an object
> > > > associated with them to make it impossible to look them up in the
> > > > contobj lists.
> > >
> > > I'm not sure you need a full second counter, I imagine a simple flag
> > > would be okay. I think you just something to indicate that this ACID
> > > object is marked as "dead" but it still being held for sanity reasons
> > > and should not be reused.
> >
> > Ok, I see your point. This refcount can be changed to a flag easily
> > enough without change to the api if we can be sure that more than one
> > signal can't be delivered to the audit daemon *and* collected by sig2.
> > I'll have a more careful look at the audit daemon code to see if I can
> > determine this.
>
> Maybe I'm not understanding your concern, but this isn't really
> different than any of the other things we track for the auditd signal
> sender, right? If we are worried about multiple signals being sent
> then it applies to everything, not just the audit container ID.

Yes, you are right. In all other cases the information is simply
overwritten. In the case of the audit container identifier any
previous value is put before a new one is referenced, so only the last
signal is kept. So, we only need a flag. Does a flag implemented with
a rcu-protected refcount sound reasonable to you?

> > Another question occurs to me is that what if the audit daemon is sent a
> > signal and it cannot or will not collect the sig2 information from the
> > kernel (SIGKILL?)? Does that audit container identifier remain dead
> > until reboot, or do we institute some other form of reaping, possibly
> > time-based?
>
> In order to preserve the integrity of the audit log that ACID value
> would need to remain unavailable until the ACID which contains the
> associated auditd is "dead" (no one can request the signal sender's
> info if that container is dead).

I don't understand why it would be associated with the contid of the
audit daemon process rather than with the audit daemon process itself.
How does the signal collection somehow get transferred or delegated to
another member of that audit daemon's container?

Thinking aloud here, the audit daemon's exit when it calls audit_free()
needs to ..._put_sig and cancel that audit_sig_cid (which in the future
will be allocated per auditd rather than the global it is now since
there is only one audit daemon).

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-19 22:28:22

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-03-18 18:06, Paul Moore wrote:
> On Wed, Mar 18, 2020 at 5:56 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-03-18 17:42, Paul Moore wrote:
> > > On Wed, Mar 18, 2020 at 5:27 PM Richard Guy Briggs <[email protected]> wrote:
> > > > On 2020-03-18 16:56, Paul Moore wrote:
> > > > > On Fri, Mar 13, 2020 at 2:59 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > On 2020-03-13 12:29, Paul Moore wrote:
> > > > > > > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > > > On 2020-02-13 16:44, Paul Moore wrote:
> > > > > > > > > This is a bit of a thread-hijack, and for that I apologize, but
> > > > > > > > > another thought crossed my mind while thinking about this issue
> > > > > > > > > further ... Once we support multiple auditd instances, including the
> > > > > > > > > necessary record routing and duplication/multiple-sends (the host
> > > > > > > > > always sees *everything*), we will likely need to find a way to "trim"
> > > > > > > > > the audit container ID (ACID) lists we send in the records. The
> > > > > > > > > auditd instance running on the host/initns will always see everything,
> > > > > > > > > so it will want the full container ACID list; however an auditd
> > > > > > > > > instance running inside a container really should only see the ACIDs
> > > > > > > > > of any child containers.
> > > > > > > >
> > > > > > > > Agreed. This should be easy to check and limit, preventing an auditd
> > > > > > > > from seeing any contid that is a parent of its own contid.
> > > > > > > >
> > > > > > > > > For example, imagine a system where the host has containers 1 and 2,
> > > > > > > > > each running an auditd instance. Inside container 1 there are
> > > > > > > > > containers A and B. Inside container 2 there are containers Y and Z.
> > > > > > > > > If an audit event is generated in container Z, I would expect the
> > > > > > > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd
> > > > > > > > > should only see an ACID list of "Z". The auditd running in container
> > > > > > > > > 2 should not see the record at all (that will be relatively
> > > > > > > > > straightforward). Does that make sense? Do we have the record
> > > > > > > > > formats properly designed to handle this without too much problem (I'm
> > > > > > > > > not entirely sure we do)?
> > > > > > > >
> > > > > > > > I completely agree and I believe we have record formats that are able to
> > > > > > > > handle this already.
> > > > > > >
> > > > > > > I'm not convinced we do. What about the cases where we have a field
> > > > > > > with a list of audit container IDs? How do we handle that?
> > > > > >
> > > > > > I don't understand the problem. (I think you crossed your 1/2 vs
> > > > > > A/B/Y/Z in your example.) ...
> > > > >
> > > > > It looks like I did, sorry about that.
> > > > >
> > > > > > ... Clarifying the example above, if as you
> > > > > > suggest an event happens in container Z, the hosts's auditd would report
> > > > > > Z,^2
> > > > > > and the auditd in container 2 would report
> > > > > > Z,^2
> > > > > > but if there were another auditd running in container Z it would report
> > > > > > Z
> > > > > > while the auditd in container 1 or A/B would see nothing.
> > > > >
> > > > > Yes. My concern is how do we handle this to minimize duplicating and
> > > > > rewriting the records? It isn't so much about the format, although
> > > > > the format is a side effect.
> > > >
> > > > Are you talking about caching, or about divulging more information than
> > > > necessary or even information leaks? Or even noticing that records that
> > > > need to be generated to two audit daemons share the same contid field
> > > > values and should be generated at the same time or information shared
> > > > between them? I'd see any of these as optimizations that don't affect
> > > > the api.
> > >
> > > Imagine a record is generated in a container which has more than one
> > > auditd in it's ancestry that should receive this record, how do we
> > > handle that without completely killing performance? That's my
> > > concern. If you've already thought up a plan for this - excellent,
> > > please share :)
> >
> > No, I haven't given that much thought other than the correctness and
> > security issues of making sure that each audit daemon is sufficiently
> > isolated to do its job but not jeopardize another audit domain. Audit
> > already kills performance, according to some...
> >
> > We currently won't have that problem since there can only be one so far.
> > Fixing and optimizing this is part of the next phase of the challenge of
> > adding a second audit daemon.
> >
> > Let's work on correctness and reasonable efficiency for this phase and
> > not focus on a problem we don't yet have. I wouldn't consider this
> > incurring technical debt at this point.
>
> I agree, one stage at a time, but the choice we make here is going to
> have a significant impact on what we can do later. We need to get
> this as "right" as possible; this isn't something we should dismiss
> with a hand-wave as a problem for the next stage. We don't need an
> implementation, but I would like to see a rough design of how we would
> address this problem.
>
> > I could see cacheing a contid string from one starting point, but it may
> > be more work to search that cached string to truncate it or add to it
> > when another audit daemon requests a copy of a similar string. I
> > suppose every full contid string could be generated the first time it is
> > used and parts of it used (start/finish) as needed but that
> > search/indexing may not be worth it.
>
> I hope we can do better than string manipulations in the kernel. I'd
> much rather defer generating the ACID list (if possible), than
> generating a list only to keep copying and editing it as the record is
> sent.

At the moment we are stuck with a string-only format. The contid list
only exists in the kernel. When do you suggest generating the contid
list? It sounds like you are hinting at userspace generating that list
from multiple records over the span of audit logs since boot of the
machine.

Even if we had a binary format, the current design would require
generating that list at the time of record generation since it could be
any contiguous subset of a full nested contid list.

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-20 21:57:50

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Thu, Mar 19, 2020 at 5:48 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-03-18 17:47, Paul Moore wrote:
> > On Wed, Mar 18, 2020 at 5:42 PM Richard Guy Briggs <[email protected]> wrote:
> > > On 2020-03-18 17:01, Paul Moore wrote:
> > > > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > On 2020-03-13 12:42, Paul Moore wrote:
> > > >
> > > > ...
> > > >
> > > > > > The thread has had a lot of starts/stops, so I may be repeating a
> > > > > > previous suggestion, but one idea would be to still emit a "death
> > > > > > record" when the final task in the audit container ID does die, but
> > > > > > block the particular audit container ID from reuse until it the
> > > > > > SIGNAL2 info has been reported. This gives us the timely ACID death
> > > > > > notification while still preventing confusion and ambiguity caused by
> > > > > > potentially reusing the ACID before the SIGNAL2 record has been sent;
> > > > > > there is a small nit about the ACID being present in the SIGNAL2
> > > > > > *after* its death, but I think that can be easily explained and
> > > > > > understood by admins.
> > > > >
> > > > > Thinking quickly about possible technical solutions to this, maybe it
> > > > > makes sense to have two counters on a contobj so that we know when the
> > > > > last process in that container exits and can issue the death
> > > > > certificate, but we still block reuse of it until all further references
> > > > > to it have been resolved. This will likely also make it possible to
> > > > > report the full contid chain in SIGNAL2 records. This will eliminate
> > > > > some of the issues we are discussing with regards to passing a contobj
> > > > > vs a contid to the audit_log_contid function, but won't eliminate them
> > > > > all because there are still some contids that won't have an object
> > > > > associated with them to make it impossible to look them up in the
> > > > > contobj lists.
> > > >
> > > > I'm not sure you need a full second counter, I imagine a simple flag
> > > > would be okay. I think you just something to indicate that this ACID
> > > > object is marked as "dead" but it still being held for sanity reasons
> > > > and should not be reused.
> > >
> > > Ok, I see your point. This refcount can be changed to a flag easily
> > > enough without change to the api if we can be sure that more than one
> > > signal can't be delivered to the audit daemon *and* collected by sig2.
> > > I'll have a more careful look at the audit daemon code to see if I can
> > > determine this.
> >
> > Maybe I'm not understanding your concern, but this isn't really
> > different than any of the other things we track for the auditd signal
> > sender, right? If we are worried about multiple signals being sent
> > then it applies to everything, not just the audit container ID.
>
> Yes, you are right. In all other cases the information is simply
> overwritten. In the case of the audit container identifier any
> previous value is put before a new one is referenced, so only the last
> signal is kept. So, we only need a flag. Does a flag implemented with
> a rcu-protected refcount sound reasonable to you?

Well, if I recall correctly you still need to fix the locking in this
patchset so until we see what that looks like it is hard to say for
certain. Just make sure that the flag is somehow protected from
races; it is probably a lot like the "valid" flags you sometimes see
with RCU protected lists.

> > > Another question occurs to me is that what if the audit daemon is sent a
> > > signal and it cannot or will not collect the sig2 information from the
> > > kernel (SIGKILL?)? Does that audit container identifier remain dead
> > > until reboot, or do we institute some other form of reaping, possibly
> > > time-based?
> >
> > In order to preserve the integrity of the audit log that ACID value
> > would need to remain unavailable until the ACID which contains the
> > associated auditd is "dead" (no one can request the signal sender's
> > info if that container is dead).
>
> I don't understand why it would be associated with the contid of the
> audit daemon process rather than with the audit daemon process itself.
> How does the signal collection somehow get transferred or delegated to
> another member of that audit daemon's container?

Presumably once we support multiple audit daemons we will need a
struct to contain the associated connection state, with at most one
struct (and one auditd) allowed for a given ACID. I would expect that
the signal sender info would be part of that state included in that
struct. If a task sent a signal to it's associated auditd, and no one
ever queried the signal information stored in the per-ACID state
struct, I would expect that the refcount/flag/whatever would remain
held for the signal sender's ACID until the auditd state's ACID died
(the struct would be reaped as part of the ACID death). In cases
where the container orchestrator blocks sending signals across ACID
boundaries this really isn't an issue as it will all be the same ACID,
but since we don't want to impose any restrictions on what a container
*could* be it is important to make sure we handle the case where the
signal sender's ACID may be different from the associated auditd's
ACID.

> Thinking aloud here, the audit daemon's exit when it calls audit_free()
> needs to ..._put_sig and cancel that audit_sig_cid (which in the future
> will be allocated per auditd rather than the global it is now since
> there is only one audit daemon).
>
> > paul moore
>
> - RGB
>
> --
> Richard Guy Briggs <[email protected]>
> Sr. S/W Engineer, Kernel Security, Base Operating Systems
> Remote, Ottawa, Red Hat Canada
> IRC: rgb, SunRaycer
> Voice: +1.647.777.2635, Internal: (81) 32635

--
paul moore
http://www.paul-moore.com

2020-03-24 00:17:23

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-03-18 18:06, Paul Moore wrote:

...

> > I hope we can do better than string manipulations in the kernel. I'd
> > much rather defer generating the ACID list (if possible), than
> > generating a list only to keep copying and editing it as the record is
> > sent.
>
> At the moment we are stuck with a string-only format.

Yes, we are. That is another topic, and another set of changes I've
been deferring so as to not disrupt the audit container ID work.

I was thinking of what we do inside the kernel between when the record
triggering event happens and when we actually emit the record to
userspace. Perhaps we collect the ACID information while the event is
occurring, but we defer generating the record until later when we have
a better understanding of what should be included in the ACID list.
It is somewhat similar (but obviously different) to what we do for
PATH records (we collect the pathname info when the path is being
resolved).

--
paul moore
http://www.paul-moore.com

2020-03-24 21:03:44

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-03-23 20:16, Paul Moore wrote:
> On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-03-18 18:06, Paul Moore wrote:
>
> ...
>
> > > I hope we can do better than string manipulations in the kernel. I'd
> > > much rather defer generating the ACID list (if possible), than
> > > generating a list only to keep copying and editing it as the record is
> > > sent.
> >
> > At the moment we are stuck with a string-only format.
>
> Yes, we are. That is another topic, and another set of changes I've
> been deferring so as to not disrupt the audit container ID work.
>
> I was thinking of what we do inside the kernel between when the record
> triggering event happens and when we actually emit the record to
> userspace. Perhaps we collect the ACID information while the event is
> occurring, but we defer generating the record until later when we have
> a better understanding of what should be included in the ACID list.
> It is somewhat similar (but obviously different) to what we do for
> PATH records (we collect the pathname info when the path is being
> resolved).

Ok, now I understand your concern.

In the case of NETFILTER_PKT records, the CONTAINER_ID record is the
only other possible record and they are generated at the same time with
a local context.

In the case of any event involving a syscall, that CONTAINER_ID record
is generated at the time of the rest of the event record generation at
syscall exit.

The others are only generated when needed, such as the sig2 reply.

We generally just store the contobj pointer until we actually generate
the CONTAINER_ID (or CONTAINER_OP) record.

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-25 12:30:11

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-03-20 17:56, Paul Moore wrote:
> On Thu, Mar 19, 2020 at 5:48 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-03-18 17:47, Paul Moore wrote:
> > > On Wed, Mar 18, 2020 at 5:42 PM Richard Guy Briggs <[email protected]> wrote:
> > > > On 2020-03-18 17:01, Paul Moore wrote:
> > > > > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > On 2020-03-13 12:42, Paul Moore wrote:
> > > > >
> > > > > ...
> > > > >
> > > > > > > The thread has had a lot of starts/stops, so I may be repeating a
> > > > > > > previous suggestion, but one idea would be to still emit a "death
> > > > > > > record" when the final task in the audit container ID does die, but
> > > > > > > block the particular audit container ID from reuse until it the
> > > > > > > SIGNAL2 info has been reported. This gives us the timely ACID death
> > > > > > > notification while still preventing confusion and ambiguity caused by
> > > > > > > potentially reusing the ACID before the SIGNAL2 record has been sent;
> > > > > > > there is a small nit about the ACID being present in the SIGNAL2
> > > > > > > *after* its death, but I think that can be easily explained and
> > > > > > > understood by admins.
> > > > > >
> > > > > > Thinking quickly about possible technical solutions to this, maybe it
> > > > > > makes sense to have two counters on a contobj so that we know when the
> > > > > > last process in that container exits and can issue the death
> > > > > > certificate, but we still block reuse of it until all further references
> > > > > > to it have been resolved. This will likely also make it possible to
> > > > > > report the full contid chain in SIGNAL2 records. This will eliminate
> > > > > > some of the issues we are discussing with regards to passing a contobj
> > > > > > vs a contid to the audit_log_contid function, but won't eliminate them
> > > > > > all because there are still some contids that won't have an object
> > > > > > associated with them to make it impossible to look them up in the
> > > > > > contobj lists.
> > > > >
> > > > > I'm not sure you need a full second counter, I imagine a simple flag
> > > > > would be okay. I think you just something to indicate that this ACID
> > > > > object is marked as "dead" but it still being held for sanity reasons
> > > > > and should not be reused.
> > > >
> > > > Ok, I see your point. This refcount can be changed to a flag easily
> > > > enough without change to the api if we can be sure that more than one
> > > > signal can't be delivered to the audit daemon *and* collected by sig2.
> > > > I'll have a more careful look at the audit daemon code to see if I can
> > > > determine this.
> > >
> > > Maybe I'm not understanding your concern, but this isn't really
> > > different than any of the other things we track for the auditd signal
> > > sender, right? If we are worried about multiple signals being sent
> > > then it applies to everything, not just the audit container ID.
> >
> > Yes, you are right. In all other cases the information is simply
> > overwritten. In the case of the audit container identifier any
> > previous value is put before a new one is referenced, so only the last
> > signal is kept. So, we only need a flag. Does a flag implemented with
> > a rcu-protected refcount sound reasonable to you?
>
> Well, if I recall correctly you still need to fix the locking in this
> patchset so until we see what that looks like it is hard to say for
> certain. Just make sure that the flag is somehow protected from
> races; it is probably a lot like the "valid" flags you sometimes see
> with RCU protected lists.

This is like looking for a needle in a haystack. Can you point me to
some code that does "valid" flags with RCU protected lists.

> > > > Another question occurs to me is that what if the audit daemon is sent a
> > > > signal and it cannot or will not collect the sig2 information from the
> > > > kernel (SIGKILL?)? Does that audit container identifier remain dead
> > > > until reboot, or do we institute some other form of reaping, possibly
> > > > time-based?
> > >
> > > In order to preserve the integrity of the audit log that ACID value
> > > would need to remain unavailable until the ACID which contains the
> > > associated auditd is "dead" (no one can request the signal sender's
> > > info if that container is dead).
> >
> > I don't understand why it would be associated with the contid of the
> > audit daemon process rather than with the audit daemon process itself.
> > How does the signal collection somehow get transferred or delegated to
> > another member of that audit daemon's container?
>
> Presumably once we support multiple audit daemons we will need a
> struct to contain the associated connection state, with at most one
> struct (and one auditd) allowed for a given ACID. I would expect that
> the signal sender info would be part of that state included in that
> struct. If a task sent a signal to it's associated auditd, and no one
> ever queried the signal information stored in the per-ACID state
> struct, I would expect that the refcount/flag/whatever would remain
> held for the signal sender's ACID until the auditd state's ACID died
> (the struct would be reaped as part of the ACID death). In cases
> where the container orchestrator blocks sending signals across ACID
> boundaries this really isn't an issue as it will all be the same ACID,
> but since we don't want to impose any restrictions on what a container
> *could* be it is important to make sure we handle the case where the
> signal sender's ACID may be different from the associated auditd's
> ACID.
>
> > Thinking aloud here, the audit daemon's exit when it calls audit_free()
> > needs to ..._put_sig and cancel that audit_sig_cid (which in the future
> > will be allocated per auditd rather than the global it is now since
> > there is only one audit daemon).
> >
> > > paul moore
> >
> > - RGB
>
> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-29 03:18:16

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Wed, Mar 25, 2020 at 8:29 AM Richard Guy Briggs <[email protected]> wrote:
> On 2020-03-20 17:56, Paul Moore wrote:
> > On Thu, Mar 19, 2020 at 5:48 PM Richard Guy Briggs <[email protected]> wrote:
> > > On 2020-03-18 17:47, Paul Moore wrote:
> > > > On Wed, Mar 18, 2020 at 5:42 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > On 2020-03-18 17:01, Paul Moore wrote:
> > > > > > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > > On 2020-03-13 12:42, Paul Moore wrote:
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > > > The thread has had a lot of starts/stops, so I may be repeating a
> > > > > > > > previous suggestion, but one idea would be to still emit a "death
> > > > > > > > record" when the final task in the audit container ID does die, but
> > > > > > > > block the particular audit container ID from reuse until it the
> > > > > > > > SIGNAL2 info has been reported. This gives us the timely ACID death
> > > > > > > > notification while still preventing confusion and ambiguity caused by
> > > > > > > > potentially reusing the ACID before the SIGNAL2 record has been sent;
> > > > > > > > there is a small nit about the ACID being present in the SIGNAL2
> > > > > > > > *after* its death, but I think that can be easily explained and
> > > > > > > > understood by admins.
> > > > > > >
> > > > > > > Thinking quickly about possible technical solutions to this, maybe it
> > > > > > > makes sense to have two counters on a contobj so that we know when the
> > > > > > > last process in that container exits and can issue the death
> > > > > > > certificate, but we still block reuse of it until all further references
> > > > > > > to it have been resolved. This will likely also make it possible to
> > > > > > > report the full contid chain in SIGNAL2 records. This will eliminate
> > > > > > > some of the issues we are discussing with regards to passing a contobj
> > > > > > > vs a contid to the audit_log_contid function, but won't eliminate them
> > > > > > > all because there are still some contids that won't have an object
> > > > > > > associated with them to make it impossible to look them up in the
> > > > > > > contobj lists.
> > > > > >
> > > > > > I'm not sure you need a full second counter, I imagine a simple flag
> > > > > > would be okay. I think you just something to indicate that this ACID
> > > > > > object is marked as "dead" but it still being held for sanity reasons
> > > > > > and should not be reused.
> > > > >
> > > > > Ok, I see your point. This refcount can be changed to a flag easily
> > > > > enough without change to the api if we can be sure that more than one
> > > > > signal can't be delivered to the audit daemon *and* collected by sig2.
> > > > > I'll have a more careful look at the audit daemon code to see if I can
> > > > > determine this.
> > > >
> > > > Maybe I'm not understanding your concern, but this isn't really
> > > > different than any of the other things we track for the auditd signal
> > > > sender, right? If we are worried about multiple signals being sent
> > > > then it applies to everything, not just the audit container ID.
> > >
> > > Yes, you are right. In all other cases the information is simply
> > > overwritten. In the case of the audit container identifier any
> > > previous value is put before a new one is referenced, so only the last
> > > signal is kept. So, we only need a flag. Does a flag implemented with
> > > a rcu-protected refcount sound reasonable to you?
> >
> > Well, if I recall correctly you still need to fix the locking in this
> > patchset so until we see what that looks like it is hard to say for
> > certain. Just make sure that the flag is somehow protected from
> > races; it is probably a lot like the "valid" flags you sometimes see
> > with RCU protected lists.
>
> This is like looking for a needle in a haystack. Can you point me to
> some code that does "valid" flags with RCU protected lists.

Sigh. Come on Richard, you've been playing in the kernel for some
time now. I can't think of one off the top of my head as I write
this, but there are several resources that deal with RCU protected
lists in the kernel, Google is your friend and Documentation/RCU is
your friend.

Spending time to learn how RCU works and how to use it properly is not
time wasted. It's a tricky thing to get right (I have to refresh my
memory on some of the more subtle details each time I write/review RCU
code), but it's very cool when done correctly.

--
paul moore
http://www.paul-moore.com

2020-03-29 03:18:35

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-03-23 20:16, Paul Moore wrote:
> > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <[email protected]> wrote:
> > > On 2020-03-18 18:06, Paul Moore wrote:
> >
> > ...
> >
> > > > I hope we can do better than string manipulations in the kernel. I'd
> > > > much rather defer generating the ACID list (if possible), than
> > > > generating a list only to keep copying and editing it as the record is
> > > > sent.
> > >
> > > At the moment we are stuck with a string-only format.
> >
> > Yes, we are. That is another topic, and another set of changes I've
> > been deferring so as to not disrupt the audit container ID work.
> >
> > I was thinking of what we do inside the kernel between when the record
> > triggering event happens and when we actually emit the record to
> > userspace. Perhaps we collect the ACID information while the event is
> > occurring, but we defer generating the record until later when we have
> > a better understanding of what should be included in the ACID list.
> > It is somewhat similar (but obviously different) to what we do for
> > PATH records (we collect the pathname info when the path is being
> > resolved).
>
> Ok, now I understand your concern.
>
> In the case of NETFILTER_PKT records, the CONTAINER_ID record is the
> only other possible record and they are generated at the same time with
> a local context.
>
> In the case of any event involving a syscall, that CONTAINER_ID record
> is generated at the time of the rest of the event record generation at
> syscall exit.
>
> The others are only generated when needed, such as the sig2 reply.
>
> We generally just store the contobj pointer until we actually generate
> the CONTAINER_ID (or CONTAINER_OP) record.

Perhaps I'm remembering your latest spin of these patches incorrectly,
but there is still a big gap between when the record is generated and
when it is sent up to the audit daemon. Most importantly in that gap
is the whole big queue/multicast/unicast mess.

You don't need to show me code, but I would like to see some sort of
plan for dealing with multiple nested audit daemons. Basically I just
want to make sure we aren't painting ourselves into a corner with this
approach; and if for some horrible reason we are, I at least want us
to be aware of what we are getting ourselves into.

--
paul moore
http://www.paul-moore.com

2020-03-30 13:48:28

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-03-28 23:11, Paul Moore wrote:
> On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-03-23 20:16, Paul Moore wrote:
> > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <[email protected]> wrote:
> > > > On 2020-03-18 18:06, Paul Moore wrote:
> > >
> > > ...
> > >
> > > > > I hope we can do better than string manipulations in the kernel. I'd
> > > > > much rather defer generating the ACID list (if possible), than
> > > > > generating a list only to keep copying and editing it as the record is
> > > > > sent.
> > > >
> > > > At the moment we are stuck with a string-only format.
> > >
> > > Yes, we are. That is another topic, and another set of changes I've
> > > been deferring so as to not disrupt the audit container ID work.
> > >
> > > I was thinking of what we do inside the kernel between when the record
> > > triggering event happens and when we actually emit the record to
> > > userspace. Perhaps we collect the ACID information while the event is
> > > occurring, but we defer generating the record until later when we have
> > > a better understanding of what should be included in the ACID list.
> > > It is somewhat similar (but obviously different) to what we do for
> > > PATH records (we collect the pathname info when the path is being
> > > resolved).
> >
> > Ok, now I understand your concern.
> >
> > In the case of NETFILTER_PKT records, the CONTAINER_ID record is the
> > only other possible record and they are generated at the same time with
> > a local context.
> >
> > In the case of any event involving a syscall, that CONTAINER_ID record
> > is generated at the time of the rest of the event record generation at
> > syscall exit.
> >
> > The others are only generated when needed, such as the sig2 reply.
> >
> > We generally just store the contobj pointer until we actually generate
> > the CONTAINER_ID (or CONTAINER_OP) record.
>
> Perhaps I'm remembering your latest spin of these patches incorrectly,
> but there is still a big gap between when the record is generated and
> when it is sent up to the audit daemon. Most importantly in that gap
> is the whole big queue/multicast/unicast mess.

So you suggest generating that record on the fly once it reaches the end
of the audit_queue just before being sent? That sounds... disruptive.
Each audit daemon is going to have its own queues, so by the time it
ends up in a particular queue, we'll already know its scope and would
have the right list of contids to print in that record.

I don't see the point in deferring the generation of the contid list
beyond the point of submitting that record to the relevant audit_queue.

> You don't need to show me code, but I would like to see some sort of
> plan for dealing with multiple nested audit daemons. Basically I just
> want to make sure we aren't painting ourselves into a corner with this
> approach; and if for some horrible reason we are, I at least want us
> to be aware of what we are getting ourselves into.

It wouldn't be significantly different from what we have, but as would
have to happen for *all* records generated to a particular auditd/queue
it would have to take the scope of that auditd into account, getting
references to PIDs right for that PID namespace, along with other
similar scope views including contid list range.

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-30 14:42:56

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <[email protected]> wrote:
> On 2020-03-28 23:11, Paul Moore wrote:
> > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <[email protected]> wrote:
> > > On 2020-03-23 20:16, Paul Moore wrote:
> > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > On 2020-03-18 18:06, Paul Moore wrote:
> > > >
> > > > ...
> > > >
> > > > > > I hope we can do better than string manipulations in the kernel. I'd
> > > > > > much rather defer generating the ACID list (if possible), than
> > > > > > generating a list only to keep copying and editing it as the record is
> > > > > > sent.
> > > > >
> > > > > At the moment we are stuck with a string-only format.
> > > >
> > > > Yes, we are. That is another topic, and another set of changes I've
> > > > been deferring so as to not disrupt the audit container ID work.
> > > >
> > > > I was thinking of what we do inside the kernel between when the record
> > > > triggering event happens and when we actually emit the record to
> > > > userspace. Perhaps we collect the ACID information while the event is
> > > > occurring, but we defer generating the record until later when we have
> > > > a better understanding of what should be included in the ACID list.
> > > > It is somewhat similar (but obviously different) to what we do for
> > > > PATH records (we collect the pathname info when the path is being
> > > > resolved).
> > >
> > > Ok, now I understand your concern.
> > >
> > > In the case of NETFILTER_PKT records, the CONTAINER_ID record is the
> > > only other possible record and they are generated at the same time with
> > > a local context.
> > >
> > > In the case of any event involving a syscall, that CONTAINER_ID record
> > > is generated at the time of the rest of the event record generation at
> > > syscall exit.
> > >
> > > The others are only generated when needed, such as the sig2 reply.
> > >
> > > We generally just store the contobj pointer until we actually generate
> > > the CONTAINER_ID (or CONTAINER_OP) record.
> >
> > Perhaps I'm remembering your latest spin of these patches incorrectly,
> > but there is still a big gap between when the record is generated and
> > when it is sent up to the audit daemon. Most importantly in that gap
> > is the whole big queue/multicast/unicast mess.
>
> So you suggest generating that record on the fly once it reaches the end
> of the audit_queue just before being sent? That sounds... disruptive.
> Each audit daemon is going to have its own queues, so by the time it
> ends up in a particular queue, we'll already know its scope and would
> have the right list of contids to print in that record.

I'm not suggesting any particular solution, I'm just pointing out a
potential problem. It isn't clear to me that you've thought about how
we generate a multiple records, each with the correct ACID list
intended for a specific audit daemon, based on a single audit event.
Explain to me how you intend that to work and we are good. Be
specific because I'm not convinced we are talking on the same plane
here.

--
paul moore
http://www.paul-moore.com

2020-03-30 15:26:32

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-03-28 23:17, Paul Moore wrote:
> On Wed, Mar 25, 2020 at 8:29 AM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-03-20 17:56, Paul Moore wrote:
> > > On Thu, Mar 19, 2020 at 5:48 PM Richard Guy Briggs <[email protected]> wrote:
> > > > On 2020-03-18 17:47, Paul Moore wrote:
> > > > > On Wed, Mar 18, 2020 at 5:42 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > On 2020-03-18 17:01, Paul Moore wrote:
> > > > > > > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > > > On 2020-03-13 12:42, Paul Moore wrote:
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > > > The thread has had a lot of starts/stops, so I may be repeating a
> > > > > > > > > previous suggestion, but one idea would be to still emit a "death
> > > > > > > > > record" when the final task in the audit container ID does die, but
> > > > > > > > > block the particular audit container ID from reuse until it the
> > > > > > > > > SIGNAL2 info has been reported. This gives us the timely ACID death
> > > > > > > > > notification while still preventing confusion and ambiguity caused by
> > > > > > > > > potentially reusing the ACID before the SIGNAL2 record has been sent;
> > > > > > > > > there is a small nit about the ACID being present in the SIGNAL2
> > > > > > > > > *after* its death, but I think that can be easily explained and
> > > > > > > > > understood by admins.
> > > > > > > >
> > > > > > > > Thinking quickly about possible technical solutions to this, maybe it
> > > > > > > > makes sense to have two counters on a contobj so that we know when the
> > > > > > > > last process in that container exits and can issue the death
> > > > > > > > certificate, but we still block reuse of it until all further references
> > > > > > > > to it have been resolved. This will likely also make it possible to
> > > > > > > > report the full contid chain in SIGNAL2 records. This will eliminate
> > > > > > > > some of the issues we are discussing with regards to passing a contobj
> > > > > > > > vs a contid to the audit_log_contid function, but won't eliminate them
> > > > > > > > all because there are still some contids that won't have an object
> > > > > > > > associated with them to make it impossible to look them up in the
> > > > > > > > contobj lists.
> > > > > > >
> > > > > > > I'm not sure you need a full second counter, I imagine a simple flag
> > > > > > > would be okay. I think you just something to indicate that this ACID
> > > > > > > object is marked as "dead" but it still being held for sanity reasons
> > > > > > > and should not be reused.
> > > > > >
> > > > > > Ok, I see your point. This refcount can be changed to a flag easily
> > > > > > enough without change to the api if we can be sure that more than one
> > > > > > signal can't be delivered to the audit daemon *and* collected by sig2.
> > > > > > I'll have a more careful look at the audit daemon code to see if I can
> > > > > > determine this.
> > > > >
> > > > > Maybe I'm not understanding your concern, but this isn't really
> > > > > different than any of the other things we track for the auditd signal
> > > > > sender, right? If we are worried about multiple signals being sent
> > > > > then it applies to everything, not just the audit container ID.
> > > >
> > > > Yes, you are right. In all other cases the information is simply
> > > > overwritten. In the case of the audit container identifier any
> > > > previous value is put before a new one is referenced, so only the last
> > > > signal is kept. So, we only need a flag. Does a flag implemented with
> > > > a rcu-protected refcount sound reasonable to you?
> > >
> > > Well, if I recall correctly you still need to fix the locking in this
> > > patchset so until we see what that looks like it is hard to say for
> > > certain. Just make sure that the flag is somehow protected from
> > > races; it is probably a lot like the "valid" flags you sometimes see
> > > with RCU protected lists.
> >
> > This is like looking for a needle in a haystack. Can you point me to
> > some code that does "valid" flags with RCU protected lists.
>
> Sigh. Come on Richard, you've been playing in the kernel for some
> time now. I can't think of one off the top of my head as I write
> this, but there are several resources that deal with RCU protected
> lists in the kernel, Google is your friend and Documentation/RCU is
> your friend.

Ok, I thought you were talking about a specific piece of code...

> Spending time to learn how RCU works and how to use it properly is not
> time wasted. It's a tricky thing to get right (I have to refresh my
> memory on some of the more subtle details each time I write/review RCU
> code), but it's very cool when done correctly.

I review Documentation/RCU almost every time I work on RCU...

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-30 16:23:31

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-03-30 10:26, Paul Moore wrote:
> On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-03-28 23:11, Paul Moore wrote:
> > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <[email protected]> wrote:
> > > > On 2020-03-23 20:16, Paul Moore wrote:
> > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > On 2020-03-18 18:06, Paul Moore wrote:
> > > > >
> > > > > ...
> > > > >
> > > > > > > I hope we can do better than string manipulations in the kernel. I'd
> > > > > > > much rather defer generating the ACID list (if possible), than
> > > > > > > generating a list only to keep copying and editing it as the record is
> > > > > > > sent.
> > > > > >
> > > > > > At the moment we are stuck with a string-only format.
> > > > >
> > > > > Yes, we are. That is another topic, and another set of changes I've
> > > > > been deferring so as to not disrupt the audit container ID work.
> > > > >
> > > > > I was thinking of what we do inside the kernel between when the record
> > > > > triggering event happens and when we actually emit the record to
> > > > > userspace. Perhaps we collect the ACID information while the event is
> > > > > occurring, but we defer generating the record until later when we have
> > > > > a better understanding of what should be included in the ACID list.
> > > > > It is somewhat similar (but obviously different) to what we do for
> > > > > PATH records (we collect the pathname info when the path is being
> > > > > resolved).
> > > >
> > > > Ok, now I understand your concern.
> > > >
> > > > In the case of NETFILTER_PKT records, the CONTAINER_ID record is the
> > > > only other possible record and they are generated at the same time with
> > > > a local context.
> > > >
> > > > In the case of any event involving a syscall, that CONTAINER_ID record
> > > > is generated at the time of the rest of the event record generation at
> > > > syscall exit.
> > > >
> > > > The others are only generated when needed, such as the sig2 reply.
> > > >
> > > > We generally just store the contobj pointer until we actually generate
> > > > the CONTAINER_ID (or CONTAINER_OP) record.
> > >
> > > Perhaps I'm remembering your latest spin of these patches incorrectly,
> > > but there is still a big gap between when the record is generated and
> > > when it is sent up to the audit daemon. Most importantly in that gap
> > > is the whole big queue/multicast/unicast mess.
> >
> > So you suggest generating that record on the fly once it reaches the end
> > of the audit_queue just before being sent? That sounds... disruptive.
> > Each audit daemon is going to have its own queues, so by the time it
> > ends up in a particular queue, we'll already know its scope and would
> > have the right list of contids to print in that record.
>
> I'm not suggesting any particular solution, I'm just pointing out a
> potential problem. It isn't clear to me that you've thought about how
> we generate a multiple records, each with the correct ACID list
> intended for a specific audit daemon, based on a single audit event.
> Explain to me how you intend that to work and we are good. Be
> specific because I'm not convinced we are talking on the same plane
> here.

Well, every time a record gets generated, *any* record gets generated,
we'll need to check for which audit daemons this record is in scope and
generate a different one for each depending on the content and whether
or not the content is influenced by the scope. Some events will be
generated for some of the auditd/queues and not for others. Some fields
in some of the records will need to be tailored for that specific
auditd/queue for either contid scope or PID namespace base reference or
other scope differences.

Every auditd/queue will need its own serial number per event and maybe
even timestamp depending on whether that auditd is in a different time
namespace and beyond that PID and contid fields and maybe others will
need to be customized per auditd/queue. So, it may make sense to
generate the contents of each field for a generic record and then either
reuse content that is unchanged or generate new content for a field that
will be different in a different auditd/queue scope, then render the
final record per auditd/queue and enqueue it.

I see this as the primary work of ghak93 ("RFE: run multiple audit
daemons on one machine"). I don't see how our proposed contid field
value format changes with this path above.

This is getting closer and closer to a netlink binary format too...

This is also an argument for spreading fields out over more record types
rather than cramming as much information as we can into one record type
(subject attributes in particular).

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-30 17:36:03

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-03-30 10:26, Paul Moore wrote:
> > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <[email protected]> wrote:
> > > On 2020-03-28 23:11, Paul Moore wrote:
> > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > On 2020-03-23 20:16, Paul Moore wrote:
> > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > > > I hope we can do better than string manipulations in the kernel. I'd
> > > > > > > > much rather defer generating the ACID list (if possible), than
> > > > > > > > generating a list only to keep copying and editing it as the record is
> > > > > > > > sent.
> > > > > > >
> > > > > > > At the moment we are stuck with a string-only format.
> > > > > >
> > > > > > Yes, we are. That is another topic, and another set of changes I've
> > > > > > been deferring so as to not disrupt the audit container ID work.
> > > > > >
> > > > > > I was thinking of what we do inside the kernel between when the record
> > > > > > triggering event happens and when we actually emit the record to
> > > > > > userspace. Perhaps we collect the ACID information while the event is
> > > > > > occurring, but we defer generating the record until later when we have
> > > > > > a better understanding of what should be included in the ACID list.
> > > > > > It is somewhat similar (but obviously different) to what we do for
> > > > > > PATH records (we collect the pathname info when the path is being
> > > > > > resolved).
> > > > >
> > > > > Ok, now I understand your concern.
> > > > >
> > > > > In the case of NETFILTER_PKT records, the CONTAINER_ID record is the
> > > > > only other possible record and they are generated at the same time with
> > > > > a local context.
> > > > >
> > > > > In the case of any event involving a syscall, that CONTAINER_ID record
> > > > > is generated at the time of the rest of the event record generation at
> > > > > syscall exit.
> > > > >
> > > > > The others are only generated when needed, such as the sig2 reply.
> > > > >
> > > > > We generally just store the contobj pointer until we actually generate
> > > > > the CONTAINER_ID (or CONTAINER_OP) record.
> > > >
> > > > Perhaps I'm remembering your latest spin of these patches incorrectly,
> > > > but there is still a big gap between when the record is generated and
> > > > when it is sent up to the audit daemon. Most importantly in that gap
> > > > is the whole big queue/multicast/unicast mess.
> > >
> > > So you suggest generating that record on the fly once it reaches the end
> > > of the audit_queue just before being sent? That sounds... disruptive.
> > > Each audit daemon is going to have its own queues, so by the time it
> > > ends up in a particular queue, we'll already know its scope and would
> > > have the right list of contids to print in that record.
> >
> > I'm not suggesting any particular solution, I'm just pointing out a
> > potential problem. It isn't clear to me that you've thought about how
> > we generate a multiple records, each with the correct ACID list
> > intended for a specific audit daemon, based on a single audit event.
> > Explain to me how you intend that to work and we are good. Be
> > specific because I'm not convinced we are talking on the same plane
> > here.
>
> Well, every time a record gets generated, *any* record gets generated,
> we'll need to check for which audit daemons this record is in scope and
> generate a different one for each depending on the content and whether
> or not the content is influenced by the scope.

That's the problem right there - we don't want to have to generate a
unique record for *each* auditd on *every* record. That is a recipe
for disaster.

Solving this for all of the known audit records is not something we
need to worry about in depth at the moment (although giving it some
casual thought is not a bad thing), but solving this for the audit
container ID information *is* something we need to worry about right
now.

--
paul moore
http://www.paul-moore.com

2020-03-30 17:51:19

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-03-30 13:34, Paul Moore wrote:
> On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-03-30 10:26, Paul Moore wrote:
> > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <[email protected]> wrote:
> > > > On 2020-03-28 23:11, Paul Moore wrote:
> > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > On 2020-03-23 20:16, Paul Moore wrote:
> > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > > > I hope we can do better than string manipulations in the kernel. I'd
> > > > > > > > > much rather defer generating the ACID list (if possible), than
> > > > > > > > > generating a list only to keep copying and editing it as the record is
> > > > > > > > > sent.
> > > > > > > >
> > > > > > > > At the moment we are stuck with a string-only format.
> > > > > > >
> > > > > > > Yes, we are. That is another topic, and another set of changes I've
> > > > > > > been deferring so as to not disrupt the audit container ID work.
> > > > > > >
> > > > > > > I was thinking of what we do inside the kernel between when the record
> > > > > > > triggering event happens and when we actually emit the record to
> > > > > > > userspace. Perhaps we collect the ACID information while the event is
> > > > > > > occurring, but we defer generating the record until later when we have
> > > > > > > a better understanding of what should be included in the ACID list.
> > > > > > > It is somewhat similar (but obviously different) to what we do for
> > > > > > > PATH records (we collect the pathname info when the path is being
> > > > > > > resolved).
> > > > > >
> > > > > > Ok, now I understand your concern.
> > > > > >
> > > > > > In the case of NETFILTER_PKT records, the CONTAINER_ID record is the
> > > > > > only other possible record and they are generated at the same time with
> > > > > > a local context.
> > > > > >
> > > > > > In the case of any event involving a syscall, that CONTAINER_ID record
> > > > > > is generated at the time of the rest of the event record generation at
> > > > > > syscall exit.
> > > > > >
> > > > > > The others are only generated when needed, such as the sig2 reply.
> > > > > >
> > > > > > We generally just store the contobj pointer until we actually generate
> > > > > > the CONTAINER_ID (or CONTAINER_OP) record.
> > > > >
> > > > > Perhaps I'm remembering your latest spin of these patches incorrectly,
> > > > > but there is still a big gap between when the record is generated and
> > > > > when it is sent up to the audit daemon. Most importantly in that gap
> > > > > is the whole big queue/multicast/unicast mess.
> > > >
> > > > So you suggest generating that record on the fly once it reaches the end
> > > > of the audit_queue just before being sent? That sounds... disruptive.
> > > > Each audit daemon is going to have its own queues, so by the time it
> > > > ends up in a particular queue, we'll already know its scope and would
> > > > have the right list of contids to print in that record.
> > >
> > > I'm not suggesting any particular solution, I'm just pointing out a
> > > potential problem. It isn't clear to me that you've thought about how
> > > we generate a multiple records, each with the correct ACID list
> > > intended for a specific audit daemon, based on a single audit event.
> > > Explain to me how you intend that to work and we are good. Be
> > > specific because I'm not convinced we are talking on the same plane
> > > here.
> >
> > Well, every time a record gets generated, *any* record gets generated,
> > we'll need to check for which audit daemons this record is in scope and
> > generate a different one for each depending on the content and whether
> > or not the content is influenced by the scope.
>
> That's the problem right there - we don't want to have to generate a
> unique record for *each* auditd on *every* record. That is a recipe
> for disaster.

I don't see how we can get around this.

We will already have that problem for PIDs in different PID namespaces.

We already need to use a different serial number in each auditd/queue,
or else we serialize *all* audit events on the machine and either leak
information to the nested daemons that there are other events happenning
on the machine, or confuse the host daemon because it now thinks that we
are losing events due to serial numbers missing because some nested
daemon issued an event that was not relevant to the host daemon,
consuming a globally serial audit message sequence number.

> Solving this for all of the known audit records is not something we
> need to worry about in depth at the moment (although giving it some
> casual thought is not a bad thing), but solving this for the audit
> container ID information *is* something we need to worry about right
> now.

If you think that a different nested contid value string per daemon is
not acceptable, then we are back to issuing a record that has only *one*
contid listed without any nesting information. This brings us back to
the original problem of keeping *all* audit log history since the boot
of the machine to be able to track the nesting of any particular contid.

What am I missing? What do you suggest?

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-03-30 19:56:42

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-03-30 13:34, Paul Moore wrote:
> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <[email protected]> wrote:
> > > On 2020-03-30 10:26, Paul Moore wrote:
> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <[email protected]> wrote:
> > > > > On 2020-03-28 23:11, Paul Moore wrote:
> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > > On 2020-03-23 20:16, Paul Moore wrote:
> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:

...

> > > Well, every time a record gets generated, *any* record gets generated,
> > > we'll need to check for which audit daemons this record is in scope and
> > > generate a different one for each depending on the content and whether
> > > or not the content is influenced by the scope.
> >
> > That's the problem right there - we don't want to have to generate a
> > unique record for *each* auditd on *every* record. That is a recipe
> > for disaster.
>
> I don't see how we can get around this.
>
> We will already have that problem for PIDs in different PID namespaces.

As I said below, let's not worry about this for all of the
known/current audit records, lets just think about how we solve this
for the ACID related information.

One of the bigger problems with translating namespace info (e.g. PIDs)
across ACIDs is that an ACID - by definition - has no understanding of
namespaces (both the concept as well as any given instance).

> We already need to use a different serial number in each auditd/queue,
> or else we serialize *all* audit events on the machine and either leak
> information to the nested daemons that there are other events happenning
> on the machine, or confuse the host daemon because it now thinks that we
> are losing events due to serial numbers missing because some nested
> daemon issued an event that was not relevant to the host daemon,
> consuming a globally serial audit message sequence number.

This isn't really relevant to the ACID lists, but sure.

> > Solving this for all of the known audit records is not something we
> > need to worry about in depth at the moment (although giving it some
> > casual thought is not a bad thing), but solving this for the audit
> > container ID information *is* something we need to worry about right
> > now.
>
> If you think that a different nested contid value string per daemon is
> not acceptable, then we are back to issuing a record that has only *one*
> contid listed without any nesting information. This brings us back to
> the original problem of keeping *all* audit log history since the boot
> of the machine to be able to track the nesting of any particular contid.

I'm not ruling anything out, except for the "let's just completely
regenerate every record for each auditd instance".

> What am I missing? What do you suggest?

I'm missing a solution in this thread, since you are the person
driving this effort I'm asking you to get creative and present us with
some solutions. :)


--
paul moore
http://www.paul-moore.com

2020-04-16 20:49:46

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

Paul Moore <[email protected]> writes:

> On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <[email protected]> wrote:
>> On 2020-03-30 13:34, Paul Moore wrote:
>> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <[email protected]> wrote:
>> > > On 2020-03-30 10:26, Paul Moore wrote:
>> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <[email protected]> wrote:
>> > > > > On 2020-03-28 23:11, Paul Moore wrote:
>> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <[email protected]> wrote:
>> > > > > > > On 2020-03-23 20:16, Paul Moore wrote:
>> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <[email protected]> wrote:
>> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
>
> ...
>
>> > > Well, every time a record gets generated, *any* record gets generated,
>> > > we'll need to check for which audit daemons this record is in scope and
>> > > generate a different one for each depending on the content and whether
>> > > or not the content is influenced by the scope.
>> >
>> > That's the problem right there - we don't want to have to generate a
>> > unique record for *each* auditd on *every* record. That is a recipe
>> > for disaster.
>> >
>> > Solving this for all of the known audit records is not something we
>> > need to worry about in depth at the moment (although giving it some
>> > casual thought is not a bad thing), but solving this for the audit
>> > container ID information *is* something we need to worry about right
>> > now.
>>
>> If you think that a different nested contid value string per daemon is
>> not acceptable, then we are back to issuing a record that has only *one*
>> contid listed without any nesting information. This brings us back to
>> the original problem of keeping *all* audit log history since the boot
>> of the machine to be able to track the nesting of any particular contid.
>
> I'm not ruling anything out, except for the "let's just completely
> regenerate every record for each auditd instance".

Paul I am a bit confused about what you are referring to when you say
regenerate every record.

Are you saying that you don't want to repeat the sequence:
audit_log_start(...);
audit_log_format(...);
audit_log_end(...);
for every nested audit daemon?

Or are you saying that you would like to literraly want to send the same
skb to each of the nested audit daemons?

Or are you thinking of something else?

Eric

2020-04-16 21:56:36

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman <[email protected]> wrote:
> Paul Moore <[email protected]> writes:
> > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <[email protected]> wrote:
> >> On 2020-03-30 13:34, Paul Moore wrote:
> >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <[email protected]> wrote:
> >> > > On 2020-03-30 10:26, Paul Moore wrote:
> >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <[email protected]> wrote:
> >> > > > > On 2020-03-28 23:11, Paul Moore wrote:
> >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <[email protected]> wrote:
> >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote:
> >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <[email protected]> wrote:
> >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
> >
> > ...
> >
> >> > > Well, every time a record gets generated, *any* record gets generated,
> >> > > we'll need to check for which audit daemons this record is in scope and
> >> > > generate a different one for each depending on the content and whether
> >> > > or not the content is influenced by the scope.
> >> >
> >> > That's the problem right there - we don't want to have to generate a
> >> > unique record for *each* auditd on *every* record. That is a recipe
> >> > for disaster.
> >> >
> >> > Solving this for all of the known audit records is not something we
> >> > need to worry about in depth at the moment (although giving it some
> >> > casual thought is not a bad thing), but solving this for the audit
> >> > container ID information *is* something we need to worry about right
> >> > now.
> >>
> >> If you think that a different nested contid value string per daemon is
> >> not acceptable, then we are back to issuing a record that has only *one*
> >> contid listed without any nesting information. This brings us back to
> >> the original problem of keeping *all* audit log history since the boot
> >> of the machine to be able to track the nesting of any particular contid.
> >
> > I'm not ruling anything out, except for the "let's just completely
> > regenerate every record for each auditd instance".
>
> Paul I am a bit confused about what you are referring to when you say
> regenerate every record.
>
> Are you saying that you don't want to repeat the sequence:
> audit_log_start(...);
> audit_log_format(...);
> audit_log_end(...);
> for every nested audit daemon?

If it can be avoided yes. Audit performance is already not-awesome,
this would make it even worse.

> Or are you saying that you would like to literraly want to send the same
> skb to each of the nested audit daemons?

Ideally we would reuse the generated audit messages as much as
possible. Less work is better. That's really my main concern here,
let's make sure we aren't going to totally tank performance when we
have a bunch of nested audit daemons.

> Or are you thinking of something else?

As mentioned above, I'm not thinking of anything specific, other than
let's please not have to regenerate *all* of the audit record strings
for each instance of an audit daemon, that's going to be a killer.

Maybe we have to regenerate some, if we do, what would that look like
in code? How do we handle the regeneration aspect? I worry that is
going to be really ugly.

Maybe we finally burn down the audit_log_format(...) function and pass
structs/TLVs to the audit subsystem and the audit subsystem generates
the strings in the auditd connection thread. Some of the record
strings could likely be shared, others would need to be ACID/auditd
dependent.

I'm open to any ideas people may have. We have a problem, let's solve it.

--
paul moore
http://www.paul-moore.com

2020-04-17 22:27:48

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

Paul Moore <[email protected]> writes:

> On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman <[email protected]> wrote:
>> Paul Moore <[email protected]> writes:
>> > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <[email protected]> wrote:
>> >> On 2020-03-30 13:34, Paul Moore wrote:
>> >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <[email protected]> wrote:
>> >> > > On 2020-03-30 10:26, Paul Moore wrote:
>> >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <[email protected]> wrote:
>> >> > > > > On 2020-03-28 23:11, Paul Moore wrote:
>> >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <[email protected]> wrote:
>> >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote:
>> >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <[email protected]> wrote:
>> >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
>> >
>> > ...
>> >
>> >> > > Well, every time a record gets generated, *any* record gets generated,
>> >> > > we'll need to check for which audit daemons this record is in scope and
>> >> > > generate a different one for each depending on the content and whether
>> >> > > or not the content is influenced by the scope.
>> >> >
>> >> > That's the problem right there - we don't want to have to generate a
>> >> > unique record for *each* auditd on *every* record. That is a recipe
>> >> > for disaster.
>> >> >
>> >> > Solving this for all of the known audit records is not something we
>> >> > need to worry about in depth at the moment (although giving it some
>> >> > casual thought is not a bad thing), but solving this for the audit
>> >> > container ID information *is* something we need to worry about right
>> >> > now.
>> >>
>> >> If you think that a different nested contid value string per daemon is
>> >> not acceptable, then we are back to issuing a record that has only *one*
>> >> contid listed without any nesting information. This brings us back to
>> >> the original problem of keeping *all* audit log history since the boot
>> >> of the machine to be able to track the nesting of any particular contid.
>> >
>> > I'm not ruling anything out, except for the "let's just completely
>> > regenerate every record for each auditd instance".
>>
>> Paul I am a bit confused about what you are referring to when you say
>> regenerate every record.
>>
>> Are you saying that you don't want to repeat the sequence:
>> audit_log_start(...);
>> audit_log_format(...);
>> audit_log_end(...);
>> for every nested audit daemon?
>
> If it can be avoided yes. Audit performance is already not-awesome,
> this would make it even worse.

As far as I can see not repeating sequences like that is fundamental
for making this work at all. Just because only the audit subsystem
should know about one or multiple audit daemons. Nothing else should
care.

>> Or are you saying that you would like to literraly want to send the same
>> skb to each of the nested audit daemons?
>
> Ideally we would reuse the generated audit messages as much as
> possible. Less work is better. That's really my main concern here,
> let's make sure we aren't going to totally tank performance when we
> have a bunch of nested audit daemons.

So I think there are two parts of this answer. Assuming we are talking
about nesting audit daemons in containers we will have different
rulesets and I expect most of the events for a nested audit daemon won't
be of interest to the outer audit daemon.

Beyond that it should be very straight forward to keep a pointer and
leave the buffer as a scatter gather list until audit_log_end
and translate pids, and rewrite ACIDs attributes in audit_log_end
when we build the final packet. Either through collaboration with
audit_log_format or a special audit_log command that carefully sets
up the handful of things that need that information.

Hmm. I am seeing that we send skbs to kauditd and then kauditd
sends those skbs to userspace. I presume that is primary so that
sending messages to userspace does not block the process being audited.

Plus a little bit so that the retry logic will work.

I think the naive implementation would be to simply have 1 kauditd
per auditd (strictly and audit context/namespace). Although that can be
optimized if that is a problem.

Beyond that I think we would need to look at profiles to really
understand where the bottlenecks are.

>> Or are you thinking of something else?
>
> As mentioned above, I'm not thinking of anything specific, other than
> let's please not have to regenerate *all* of the audit record strings
> for each instance of an audit daemon, that's going to be a killer.
>
> Maybe we have to regenerate some, if we do, what would that look like
> in code? How do we handle the regeneration aspect? I worry that is
> going to be really ugly.
>
> Maybe we finally burn down the audit_log_format(...) function and pass
> structs/TLVs to the audit subsystem and the audit subsystem generates
> the strings in the auditd connection thread. Some of the record
> strings could likely be shared, others would need to be ACID/auditd
> dependent.

I think we just a very limited amount of structs/TLVs for the cases that
matter and one-one auditd and kauditd implementations we should still
be able to do everything in audit_log_end. Plus doing as much work as
possible in audit_log_end where things are still cache hot is desirable.

> I'm open to any ideas people may have. We have a problem, let's solve
> it.

It definitely makes sense to look ahead to having audit daemons running
in containers, but in the grand scheme of things that is a nice to have.
Probably something we will and should get to, but we have lived a long
time without auditd running in containers so I expect we can live a
while longer.

As I understand Richard patchset for the specific case of the ACID we
are only talking about taking a subset of an existing string, and one
string at that. Not hard at all. Especially when looking at the
fundamental fact that we will need to send a different skb to
userspace, for each audit daemon.

Eric

2020-04-22 17:28:03

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Fri, Apr 17, 2020 at 6:26 PM Eric W. Biederman <[email protected]> wrote:
> Paul Moore <[email protected]> writes:
> > On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman <[email protected]> wrote:
> >> Paul Moore <[email protected]> writes:
> >> > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <[email protected]> wrote:
> >> >> On 2020-03-30 13:34, Paul Moore wrote:
> >> >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <[email protected]> wrote:
> >> >> > > On 2020-03-30 10:26, Paul Moore wrote:
> >> >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <[email protected]> wrote:
> >> >> > > > > On 2020-03-28 23:11, Paul Moore wrote:
> >> >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <[email protected]> wrote:
> >> >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote:
> >> >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <[email protected]> wrote:
> >> >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
> >> >
> >> > ...
> >> >
> >> >> > > Well, every time a record gets generated, *any* record gets generated,
> >> >> > > we'll need to check for which audit daemons this record is in scope and
> >> >> > > generate a different one for each depending on the content and whether
> >> >> > > or not the content is influenced by the scope.
> >> >> >
> >> >> > That's the problem right there - we don't want to have to generate a
> >> >> > unique record for *each* auditd on *every* record. That is a recipe
> >> >> > for disaster.
> >> >> >
> >> >> > Solving this for all of the known audit records is not something we
> >> >> > need to worry about in depth at the moment (although giving it some
> >> >> > casual thought is not a bad thing), but solving this for the audit
> >> >> > container ID information *is* something we need to worry about right
> >> >> > now.
> >> >>
> >> >> If you think that a different nested contid value string per daemon is
> >> >> not acceptable, then we are back to issuing a record that has only *one*
> >> >> contid listed without any nesting information. This brings us back to
> >> >> the original problem of keeping *all* audit log history since the boot
> >> >> of the machine to be able to track the nesting of any particular contid.
> >> >
> >> > I'm not ruling anything out, except for the "let's just completely
> >> > regenerate every record for each auditd instance".
> >>
> >> Paul I am a bit confused about what you are referring to when you say
> >> regenerate every record.
> >>
> >> Are you saying that you don't want to repeat the sequence:
> >> audit_log_start(...);
> >> audit_log_format(...);
> >> audit_log_end(...);
> >> for every nested audit daemon?
> >
> > If it can be avoided yes. Audit performance is already not-awesome,
> > this would make it even worse.
>
> As far as I can see not repeating sequences like that is fundamental
> for making this work at all. Just because only the audit subsystem
> should know about one or multiple audit daemons. Nothing else should
> care.

Yes, exactly, this has been mentioned in the past. Both the
performance hit and the code complication in the caller are things we
must avoid.

> >> Or are you saying that you would like to literraly want to send the same
> >> skb to each of the nested audit daemons?
> >
> > Ideally we would reuse the generated audit messages as much as
> > possible. Less work is better. That's really my main concern here,
> > let's make sure we aren't going to totally tank performance when we
> > have a bunch of nested audit daemons.
>
> So I think there are two parts of this answer. Assuming we are talking
> about nesting audit daemons in containers we will have different
> rulesets and I expect most of the events for a nested audit daemon won't
> be of interest to the outer audit daemon.

Yes, this is another thing that Richard and I have discussed in the
past. We will basically need to create per-daemon queues, rules,
tracking state, etc.; that is easy enough. What will be slightly more
tricky is the part where we apply the filters to the individual
records and decide if that record is valid/desired for a given daemon.
I think it can be done without too much pain, and any changes to the
callers, but it will require a bit of work to make sure it is done
well and that records are needlessly duplicated in the kernel.

> Beyond that it should be very straight forward to keep a pointer and
> leave the buffer as a scatter gather list until audit_log_end
> and translate pids, and rewrite ACIDs attributes in audit_log_end
> when we build the final packet. Either through collaboration with
> audit_log_format or a special audit_log command that carefully sets
> up the handful of things that need that information.

In order to maximize record re-use I think we will want to hold off on
assembling the final packet until it is sent to the daemons in the
kauditd thread. We'll also likely need to create special
audit_log_XXX functions to capture fields which we know will need
translation, e.g. ACID information. (the reason for the new
audit_log_XXX functions would be to mark the new sg element and ensure
the buffer is handled correctly)

Regardless of the details, I think the scatter gather approach is the
key here - that seems like the best design idea I've seen thus far.
It enables us to replace portions of the record as needed ... and
possibly use the existing skb cow stuff ... it has been a while, but
does the skb cow functions handle scatter gather skbs or do they need
to be linear?

> Hmm. I am seeing that we send skbs to kauditd and then kauditd
> sends those skbs to userspace. I presume that is primary so that
> sending messages to userspace does not block the process being audited.
> Plus a little bit so that the retry logic will work.

Long story short, it's a poor design. I'm not sure who came up with
it, but I have about a 1000 questions that are variations on "why did
this seem like a good idea?".

I expect the audit_buffer definition to change significantly during
the nested auditd work.

> I think the naive implementation would be to simply have 1 kauditd
> per auditd (strictly and audit context/namespace). Although that can be
> optimized if that is a problem.
>
> Beyond that I think we would need to look at profiles to really
> understand where the bottlenecks are.

Agreed. This is a hidden implementation detail that doesn't affect
the userspace API or the in-kernel callers. The first approach can be
simple and we can complicate it as needed in future versions.

> > I'm open to any ideas people may have. We have a problem, let's solve
> > it.
>
> It definitely makes sense to look ahead to having audit daemons running
> in containers, but in the grand scheme of things that is a nice to have.
> Probably something we will and should get to, but we have lived a long
> time without auditd running in containers so I expect we can live a
> while longer.

It looks like you are confusing my concern. I'm not pushing Richard
to implement support for this in the current patchset, I'm pushing
Richard to consider the design aspect of having multiple audit daemons
so that we don't code ourselves into a corner with the audit record
changes he is proposing. The audit record format is part of the
kernel/userspace API and as a result requires great care when
modifying/extending/etc.

--
paul moore
http://www.paul-moore.com

2020-06-08 18:08:21

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-04-22 13:24, Paul Moore wrote:
> On Fri, Apr 17, 2020 at 6:26 PM Eric W. Biederman <[email protected]> wrote:
> > Paul Moore <[email protected]> writes:
> > > On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman <[email protected]> wrote:
> > >> Paul Moore <[email protected]> writes:
> > >> > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <[email protected]> wrote:
> > >> >> On 2020-03-30 13:34, Paul Moore wrote:
> > >> >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <[email protected]> wrote:
> > >> >> > > On 2020-03-30 10:26, Paul Moore wrote:
> > >> >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <[email protected]> wrote:
> > >> >> > > > > On 2020-03-28 23:11, Paul Moore wrote:
> > >> >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <[email protected]> wrote:
> > >> >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote:
> > >> >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <[email protected]> wrote:
> > >> >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
> > >> >
> > >> > ...
> > >> >
> > >> >> > > Well, every time a record gets generated, *any* record gets generated,
> > >> >> > > we'll need to check for which audit daemons this record is in scope and
> > >> >> > > generate a different one for each depending on the content and whether
> > >> >> > > or not the content is influenced by the scope.
> > >> >> >
> > >> >> > That's the problem right there - we don't want to have to generate a
> > >> >> > unique record for *each* auditd on *every* record. That is a recipe
> > >> >> > for disaster.
> > >> >> >
> > >> >> > Solving this for all of the known audit records is not something we
> > >> >> > need to worry about in depth at the moment (although giving it some
> > >> >> > casual thought is not a bad thing), but solving this for the audit
> > >> >> > container ID information *is* something we need to worry about right
> > >> >> > now.
> > >> >>
> > >> >> If you think that a different nested contid value string per daemon is
> > >> >> not acceptable, then we are back to issuing a record that has only *one*
> > >> >> contid listed without any nesting information. This brings us back to
> > >> >> the original problem of keeping *all* audit log history since the boot
> > >> >> of the machine to be able to track the nesting of any particular contid.
> > >> >
> > >> > I'm not ruling anything out, except for the "let's just completely
> > >> > regenerate every record for each auditd instance".
> > >>
> > >> Paul I am a bit confused about what you are referring to when you say
> > >> regenerate every record.
> > >>
> > >> Are you saying that you don't want to repeat the sequence:
> > >> audit_log_start(...);
> > >> audit_log_format(...);
> > >> audit_log_end(...);
> > >> for every nested audit daemon?
> > >
> > > If it can be avoided yes. Audit performance is already not-awesome,
> > > this would make it even worse.
> >
> > As far as I can see not repeating sequences like that is fundamental
> > for making this work at all. Just because only the audit subsystem
> > should know about one or multiple audit daemons. Nothing else should
> > care.
>
> Yes, exactly, this has been mentioned in the past. Both the
> performance hit and the code complication in the caller are things we
> must avoid.
>
> > >> Or are you saying that you would like to literraly want to send the same
> > >> skb to each of the nested audit daemons?
> > >
> > > Ideally we would reuse the generated audit messages as much as
> > > possible. Less work is better. That's really my main concern here,
> > > let's make sure we aren't going to totally tank performance when we
> > > have a bunch of nested audit daemons.
> >
> > So I think there are two parts of this answer. Assuming we are talking
> > about nesting audit daemons in containers we will have different
> > rulesets and I expect most of the events for a nested audit daemon won't
> > be of interest to the outer audit daemon.
>
> Yes, this is another thing that Richard and I have discussed in the
> past. We will basically need to create per-daemon queues, rules,
> tracking state, etc.; that is easy enough. What will be slightly more
> tricky is the part where we apply the filters to the individual
> records and decide if that record is valid/desired for a given daemon.
> I think it can be done without too much pain, and any changes to the
> callers, but it will require a bit of work to make sure it is done
> well and that records are needlessly duplicated in the kernel.
>
> > Beyond that it should be very straight forward to keep a pointer and
> > leave the buffer as a scatter gather list until audit_log_end
> > and translate pids, and rewrite ACIDs attributes in audit_log_end
> > when we build the final packet. Either through collaboration with
> > audit_log_format or a special audit_log command that carefully sets
> > up the handful of things that need that information.
>
> In order to maximize record re-use I think we will want to hold off on
> assembling the final packet until it is sent to the daemons in the
> kauditd thread. We'll also likely need to create special
> audit_log_XXX functions to capture fields which we know will need
> translation, e.g. ACID information. (the reason for the new
> audit_log_XXX functions would be to mark the new sg element and ensure
> the buffer is handled correctly)
>
> Regardless of the details, I think the scatter gather approach is the
> key here - that seems like the best design idea I've seen thus far.
> It enables us to replace portions of the record as needed ... and
> possibly use the existing skb cow stuff ... it has been a while, but
> does the skb cow functions handle scatter gather skbs or do they need
> to be linear?

How does the selection of this data management technique affect our
choice of field format? Does this lock the field value to a fixed
length? Does the use of scatter/gather techniques or structures allow
the use of different lengths of data for each destination (auditd)? I
could see different target audit daemons triggering or switching to a
different chunk of data and length. This does raise a concern related
to the previous sig_info2 discussion that the struct contobj that exists
at the time of audit_log_exit called could have been reaped by the time
the buffer is pulled from the queue for transmission to auditd, but we
could hold a reference to it as is done for sig_info2.

Looking through the kernel scatter/gather possibilities, I see struct
iovec which is used by the readv/writev/preadv/pwritev syscalls, but I'm
understanding that this is a kernel implementation that will be not
visible to user space. So would the struct scatterlist be the right
choice?

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

2020-06-17 21:35:50

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On Mon, Jun 8, 2020 at 2:04 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-04-22 13:24, Paul Moore wrote:
> > On Fri, Apr 17, 2020 at 6:26 PM Eric W. Biederman <[email protected]> wrote:
> > > Paul Moore <[email protected]> writes:
> > > > On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman <[email protected]> wrote:
> > > >> Paul Moore <[email protected]> writes:
> > > >> > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <[email protected]> wrote:
> > > >> >> On 2020-03-30 13:34, Paul Moore wrote:
> > > >> >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <[email protected]> wrote:
> > > >> >> > > On 2020-03-30 10:26, Paul Moore wrote:
> > > >> >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <[email protected]> wrote:
> > > >> >> > > > > On 2020-03-28 23:11, Paul Moore wrote:
> > > >> >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <[email protected]> wrote:
> > > >> >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote:
> > > >> >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <[email protected]> wrote:
> > > >> >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
> > > >> >
> > > >> > ...
> > > >> >
> > > >> >> > > Well, every time a record gets generated, *any* record gets generated,
> > > >> >> > > we'll need to check for which audit daemons this record is in scope and
> > > >> >> > > generate a different one for each depending on the content and whether
> > > >> >> > > or not the content is influenced by the scope.
> > > >> >> >
> > > >> >> > That's the problem right there - we don't want to have to generate a
> > > >> >> > unique record for *each* auditd on *every* record. That is a recipe
> > > >> >> > for disaster.
> > > >> >> >
> > > >> >> > Solving this for all of the known audit records is not something we
> > > >> >> > need to worry about in depth at the moment (although giving it some
> > > >> >> > casual thought is not a bad thing), but solving this for the audit
> > > >> >> > container ID information *is* something we need to worry about right
> > > >> >> > now.
> > > >> >>
> > > >> >> If you think that a different nested contid value string per daemon is
> > > >> >> not acceptable, then we are back to issuing a record that has only *one*
> > > >> >> contid listed without any nesting information. This brings us back to
> > > >> >> the original problem of keeping *all* audit log history since the boot
> > > >> >> of the machine to be able to track the nesting of any particular contid.
> > > >> >
> > > >> > I'm not ruling anything out, except for the "let's just completely
> > > >> > regenerate every record for each auditd instance".
> > > >>
> > > >> Paul I am a bit confused about what you are referring to when you say
> > > >> regenerate every record.
> > > >>
> > > >> Are you saying that you don't want to repeat the sequence:
> > > >> audit_log_start(...);
> > > >> audit_log_format(...);
> > > >> audit_log_end(...);
> > > >> for every nested audit daemon?
> > > >
> > > > If it can be avoided yes. Audit performance is already not-awesome,
> > > > this would make it even worse.
> > >
> > > As far as I can see not repeating sequences like that is fundamental
> > > for making this work at all. Just because only the audit subsystem
> > > should know about one or multiple audit daemons. Nothing else should
> > > care.
> >
> > Yes, exactly, this has been mentioned in the past. Both the
> > performance hit and the code complication in the caller are things we
> > must avoid.
> >
> > > >> Or are you saying that you would like to literraly want to send the same
> > > >> skb to each of the nested audit daemons?
> > > >
> > > > Ideally we would reuse the generated audit messages as much as
> > > > possible. Less work is better. That's really my main concern here,
> > > > let's make sure we aren't going to totally tank performance when we
> > > > have a bunch of nested audit daemons.
> > >
> > > So I think there are two parts of this answer. Assuming we are talking
> > > about nesting audit daemons in containers we will have different
> > > rulesets and I expect most of the events for a nested audit daemon won't
> > > be of interest to the outer audit daemon.
> >
> > Yes, this is another thing that Richard and I have discussed in the
> > past. We will basically need to create per-daemon queues, rules,
> > tracking state, etc.; that is easy enough. What will be slightly more
> > tricky is the part where we apply the filters to the individual
> > records and decide if that record is valid/desired for a given daemon.
> > I think it can be done without too much pain, and any changes to the
> > callers, but it will require a bit of work to make sure it is done
> > well and that records are needlessly duplicated in the kernel.
> >
> > > Beyond that it should be very straight forward to keep a pointer and
> > > leave the buffer as a scatter gather list until audit_log_end
> > > and translate pids, and rewrite ACIDs attributes in audit_log_end
> > > when we build the final packet. Either through collaboration with
> > > audit_log_format or a special audit_log command that carefully sets
> > > up the handful of things that need that information.
> >
> > In order to maximize record re-use I think we will want to hold off on
> > assembling the final packet until it is sent to the daemons in the
> > kauditd thread. We'll also likely need to create special
> > audit_log_XXX functions to capture fields which we know will need
> > translation, e.g. ACID information. (the reason for the new
> > audit_log_XXX functions would be to mark the new sg element and ensure
> > the buffer is handled correctly)
> >
> > Regardless of the details, I think the scatter gather approach is the
> > key here - that seems like the best design idea I've seen thus far.
> > It enables us to replace portions of the record as needed ... and
> > possibly use the existing skb cow stuff ... it has been a while, but
> > does the skb cow functions handle scatter gather skbs or do they need
> > to be linear?
>
> How does the selection of this data management technique affect our
> choice of field format?

I'm not sure it affects the record string, but it might affect the
in-kernel API as we would likely want to have a special function for
logging the audit container ID that does the scatter-gather management
for the record. There might also need to be some changes to how we
allocate the records.

However, since you're the one working on these patches I would expect
you to be the one to look into how this would work and what the
impacts might be to the code, record format, etc.

> Does this lock the field value to a fixed length?

I wouldn't think so. In fact if it did it wouldn't really be a good solution.

Once again, this is something I would expect you to look into.

> Does the use of scatter/gather techniques or structures allow
> the use of different lengths of data for each destination (auditd)?

This is related to the above ... but yes, the reason why Eric and I
were discussing a scatter/gather approach is that it would presumably
allow one to break the single record string into pieces which could be
managed and manipulated much easier than the monolithic record string.

> I could see different target audit daemons triggering or switching to a
> different chunk of data and length. This does raise a concern related
> to the previous sig_info2 discussion that the struct contobj that exists
> at the time of audit_log_exit called could have been reaped by the time
> the buffer is pulled from the queue for transmission to auditd, but we
> could hold a reference to it as is done for sig_info2.

Yes.

> Looking through the kernel scatter/gather possibilities, I see struct
> iovec which is used by the readv/writev/preadv/pwritev syscalls, but I'm
> understanding that this is a kernel implementation that will be not
> visible to user space. So would the struct scatterlist be the right
> choice?

It has been so long since I've looked at the scatter-gather code that
I can't really say with any confidence at this point. All I can say
is that the scatter-gather code really should just be an
implementation detail in the kernel and should not be visible to
userspace; userspace should get the same awful, improperly generated
netlink message it always has received from the kernel ;)

--
paul moore
http://www.paul-moore.com

2020-06-19 22:58:07

by Richard Guy Briggs

[permalink] [raw]
Subject: Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

On 2020-04-17 17:23, Eric W. Biederman wrote:
> Paul Moore <[email protected]> writes:
>
> > On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman <[email protected]> wrote:
> >> Paul Moore <[email protected]> writes:
> >> > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <[email protected]> wrote:
> >> >> On 2020-03-30 13:34, Paul Moore wrote:
> >> >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <[email protected]> wrote:
> >> >> > > On 2020-03-30 10:26, Paul Moore wrote:
> >> >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <[email protected]> wrote:
> >> >> > > > > On 2020-03-28 23:11, Paul Moore wrote:
> >> >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <[email protected]> wrote:
> >> >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote:
> >> >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <[email protected]> wrote:
> >> >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
> >> >
> >> > ...
> >> >
> >> >> > > Well, every time a record gets generated, *any* record gets generated,
> >> >> > > we'll need to check for which audit daemons this record is in scope and
> >> >> > > generate a different one for each depending on the content and whether
> >> >> > > or not the content is influenced by the scope.
> >> >> >
> >> >> > That's the problem right there - we don't want to have to generate a
> >> >> > unique record for *each* auditd on *every* record. That is a recipe
> >> >> > for disaster.
> >> >> >
> >> >> > Solving this for all of the known audit records is not something we
> >> >> > need to worry about in depth at the moment (although giving it some
> >> >> > casual thought is not a bad thing), but solving this for the audit
> >> >> > container ID information *is* something we need to worry about right
> >> >> > now.
> >> >>
> >> >> If you think that a different nested contid value string per daemon is
> >> >> not acceptable, then we are back to issuing a record that has only *one*
> >> >> contid listed without any nesting information. This brings us back to
> >> >> the original problem of keeping *all* audit log history since the boot
> >> >> of the machine to be able to track the nesting of any particular contid.
> >> >
> >> > I'm not ruling anything out, except for the "let's just completely
> >> > regenerate every record for each auditd instance".
> >>
> >> Paul I am a bit confused about what you are referring to when you say
> >> regenerate every record.
> >>
> >> Are you saying that you don't want to repeat the sequence:
> >> audit_log_start(...);
> >> audit_log_format(...);
> >> audit_log_end(...);
> >> for every nested audit daemon?
> >
> > If it can be avoided yes. Audit performance is already not-awesome,
> > this would make it even worse.
>
> As far as I can see not repeating sequences like that is fundamental
> for making this work at all. Just because only the audit subsystem
> should know about one or multiple audit daemons. Nothing else should
> care.
>
> >> Or are you saying that you would like to literraly want to send the same
> >> skb to each of the nested audit daemons?
> >
> > Ideally we would reuse the generated audit messages as much as
> > possible. Less work is better. That's really my main concern here,
> > let's make sure we aren't going to totally tank performance when we
> > have a bunch of nested audit daemons.
>
> So I think there are two parts of this answer. Assuming we are talking
> about nesting audit daemons in containers we will have different
> rulesets and I expect most of the events for a nested audit daemon won't
> be of interest to the outer audit daemon.
>
> Beyond that it should be very straight forward to keep a pointer and
> leave the buffer as a scatter gather list until audit_log_end
> and translate pids, and rewrite ACIDs attributes in audit_log_end
> when we build the final packet. Either through collaboration with
> audit_log_format or a special audit_log command that carefully sets
> up the handful of things that need that information.
>
> Hmm. I am seeing that we send skbs to kauditd and then kauditd
> sends those skbs to userspace. I presume that is primary so that
> sending messages to userspace does not block the process being audited.
>
> Plus a little bit so that the retry logic will work.
>
> I think the naive implementation would be to simply have 1 kauditd
> per auditd (strictly and audit context/namespace). Although that can be
> optimized if that is a problem.
>
> Beyond that I think we would need to look at profiles to really
> understand where the bottlenecks are.
>
> >> Or are you thinking of something else?
> >
> > As mentioned above, I'm not thinking of anything specific, other than
> > let's please not have to regenerate *all* of the audit record strings
> > for each instance of an audit daemon, that's going to be a killer.
> >
> > Maybe we have to regenerate some, if we do, what would that look like
> > in code? How do we handle the regeneration aspect? I worry that is
> > going to be really ugly.
> >
> > Maybe we finally burn down the audit_log_format(...) function and pass
> > structs/TLVs to the audit subsystem and the audit subsystem generates
> > the strings in the auditd connection thread. Some of the record
> > strings could likely be shared, others would need to be ACID/auditd
> > dependent.
>
> I think we just a very limited amount of structs/TLVs for the cases that
> matter and one-one auditd and kauditd implementations we should still
> be able to do everything in audit_log_end. Plus doing as much work as
> possible in audit_log_end where things are still cache hot is desirable.

So in the end, perf may show us that moving things around a bit and
knowing to which queue(s) we send an skb will help maintain performance
by writing out the field contents in audit_log_end() and sending to the
correct queue rather than deferring writing out that field contents in
the kauditd process due to cache issues. In any case, it makes sense to
delay that formatting work until just after the daemon routing decision
is made.

> > I'm open to any ideas people may have. We have a problem, let's solve
> > it.
>
> It definitely makes sense to look ahead to having audit daemons running
> in containers, but in the grand scheme of things that is a nice to have.
> Probably something we will and should get to, but we have lived a long
> time without auditd running in containers so I expect we can live a
> while longer.
>
> As I understand Richard patchset for the specific case of the ACID we
> are only talking about taking a subset of an existing string, and one
> string at that. Not hard at all. Especially when looking at the
> fundamental fact that we will need to send a different skb to
> userspace, for each audit daemon.
>
> Eric

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635