Implement kernel audit container identifier.
This patchset is an eighth based on the proposal document (V4) posted:
https://www.redhat.com/archives/linux-audit/2019-September/msg00052.html
The first patch was the last patch from ghak81 that was absorbed into
this patchset since its primary justification is the rest of this
patchset.
The second patch implements the proc fs write to set the audit container
identifier of a process, emitting an AUDIT_CONTAINER_OP record to
announce the registration of that audit container identifier on that
process. This patch requires userspace support for record acceptance
and proper type display. This patch now includes the conversion
over from a simple u64 to a list member that includes owner information
to check for descendancy, allow process injection into a container and
prevent id reuse by other orchestrators.
The third implements reading the audit container identifier from the
proc filesystem for debugging. This patch wasn't planned for upstream
inclusion but is starting to become more likely.
The fourth logs the drop of an audit container identifier once all tasks
using that audit container identifier have exited.
The 5th implements the auxiliary record AUDIT_CONTAINER_ID if an audit
container identifier is associated with an event. This patch requires
userspace support for proper type display.
The 6th adds audit daemon signalling provenance through audit_sig_info2.
The 7th creates a local audit context to be able to bind a standalone
record with a locally created auxiliary record.
The 8th patch adds audit container identifier records to the user
standalone records.
The 9th adds audit container identifier filtering to the exit,
exclude and user lists. This patch adds the AUDIT_CONTID field and
requires auditctl userspace support for the --contid option.
The 10th adds network namespace audit container identifier labelling
based on member tasks' audit container identifier labels which supports
standalone netfilter records that don't have a task context and lists
each container to which that net namespace belongs.
The 11th checks that the target is a descendant for nesting and
refactors to avoid a duplicate of the copied function.
The 12th adds tracking and reporting for container nesting.
This enables kernel filtering and userspace searches of nested audit
container identifiers.
The 13th adds a mechanism to allow a process to be designated as a
container orchestrator/engine in non-init user namespaces.
Example: Set an audit container identifier of 123456 to the "sleep" task:
sleep 2&
child=$!
echo 123456 > /proc/$child/audit_containerid; echo $?
ausearch -ts recent -m container_op
echo child:$child contid:$( cat /proc/$child/audit_containerid)
This should produce a record such as:
type=CONTAINER_OP msg=audit(2018-06-06 12:39:29.636:26949) : op=set opid=2209 contid=123456 old-contid=18446744073709551615
Example: Set a filter on an audit container identifier 123459 on /tmp/tmpcontainerid:
contid=123459
key=tmpcontainerid
auditctl -a exit,always -F dir=/tmp -F perm=wa -F contid=$contid -F key=$key
perl -e "sleep 1; open(my \$tmpfile, '>', \"/tmp/$key\"); close(\$tmpfile);" &
child=$!
echo $contid > /proc/$child/audit_containerid
sleep 2
ausearch -i -ts recent -k $key
auditctl -d exit,always -F dir=/tmp -F perm=wa -F contid=$contid -F key=$key
rm -f /tmp/$key
This should produce an event such as:
type=CONTAINER_ID msg=audit(2018-06-06 12:46:31.707:26953) : contid=123459
type=PROCTITLE msg=audit(2018-06-06 12:46:31.707:26953) : proctitle=perl -e sleep 1; open(my $tmpfile, '>', "/tmp/tmpcontainerid"); close($tmpfile);
type=PATH msg=audit(2018-06-06 12:46:31.707:26953) : item=1 name=/tmp/tmpcontainerid inode=25656 dev=00:26 mode=file,644 ouid=root ogid=root rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0
type=PATH msg=audit(2018-06-06 12:46:31.707:26953) : item=0 name=/tmp/ inode=8985 dev=00:26 mode=dir,sticky,777 ouid=root ogid=root rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype=PARENT cap_fp=none cap_fi=none cap_fe=0 cap_fver=0
type=CWD msg=audit(2018-06-06 12:46:31.707:26953) : cwd=/root
type=SYSCALL msg=audit(2018-06-06 12:46:31.707:26953) : arch=x86_64 syscall=openat success=yes exit=3 a0=0xffffffffffffff9c a1=0x5621f2b81900 a2=O_WRONLY|O_CREAT|O_TRUNC a3=0x1b6 items=2 ppid=628 pid=2232 auid=root uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=ttyS0 ses=1 comm=perl exe=/usr/bin/perl subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=tmpcontainerid
Example: Test multiple containers on one netns:
sleep 5 &
child1=$!
containerid1=123451
echo $containerid1 > /proc/$child1/audit_containerid
sleep 5 &
child2=$!
containerid2=123452
echo $containerid2 > /proc/$child2/audit_containerid
iptables -I INPUT -i lo -p icmp --icmp-type echo-request -j AUDIT --type accept
iptables -I INPUT -t mangle -i lo -p icmp --icmp-type echo-request -j MARK --set-mark 0x12345555
sleep 1;
bash -c "ping -q -c 1 127.0.0.1 >/dev/null 2>&1"
sleep 1;
ausearch -i -m NETFILTER_PKT -ts boot|grep mark=0x12345555
ausearch -i -m NETFILTER_PKT -ts boot|grep contid=|grep $containerid1|grep $containerid2
This would produce an event such as:
type=NETFILTER_PKT msg=audit(03/15/2019 14:16:13.369:244) : mark=0x12345555 saddr=127.0.0.1 daddr=127.0.0.1 proto=icmp
type=CONTAINER_ID msg=audit(03/15/2019 14:16:13.369:244) : contid=123452,123451
Includes the last patch of https://github.com/linux-audit/audit-kernel/issues/81
Please see the github audit kernel issue for the main feature:
https://github.com/linux-audit/audit-kernel/issues/90
and the kernel filter code:
https://github.com/linux-audit/audit-kernel/issues/91
and the network support:
https://github.com/linux-audit/audit-kernel/issues/92
Please see the github audit userspace issue for supporting record types:
https://github.com/linux-audit/audit-userspace/issues/51
and filter code:
https://github.com/linux-audit/audit-userspace/issues/40
Please see the github audit testsuiite issue for the test case:
https://github.com/linux-audit/audit-testsuite/issues/64
https://github.com/rgbriggs/audit-testsuite/tree/ghat64-contid
https://githu.com/linux-audit/audit-testsuite/pull/91
Please see the github audit wiki for the feature overview:
https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
The code is also posted at:
git://toccata2.tricolour.ca/linux-2.6-rgb.git ghak90-audit-containerID.v9
Changelog:
v9
- rebase on v5.8-rc1
- fix whitespace and oversize lines where practicable
- remove harmless duplicate S_IRUSR in capcontid
- return -EBUSY for both threading and children (drop -EALREADY)
- return -EEXIST if already set and not nesting (drop -ECHILD)
- fix unbalanced brace and remove elseif ladder
- drop check for same contid set again as redundant (drop -EADDRINUSE)
- get reference to contobj's parent taskstruct
- protect all contid list updates with audit_contobj_list_lock
- protect refcounts with rcu read lock
- convert _audit_contobj to _audit_contobj_get, which calls _audit_contobj_hold
- convert audit_log_container_id() and audit_log_contid() from u64 to contobj, simplifying
- issue death certificate on contid after exit of last task
- keep contobj ref to block reuse with -ESHUTDOWN until auditd exit or signal info
- report all contids nested
- rework sig_info2 format to accomodate contid list
- fix zero-length array in include/linux/audit.h struct audit_sig_info2 data[]
- found bug in audit_alloc_local, don't check audit_ever_enabled, since all callers check audit_enabled
- remove warning at declaration of audit_sig_cid of reuse since reuse is now blocked
- report descendancy checking errcodes under -EXDEV (drop -EBADSLT)
- add missed check, replace audit_contid_isowner with audit_contid_isnesting
- limit calls to audit_log_format() with if(iter->parent) ...
- list only one contid in contid, nested in old-contid to avoid duplication
- switch to comma delimiter, carrat modifier in nested contid list
- special case -1 for AUDIT_CID_UNSET printing
- drop contid depth limit and netns contid limit patches
- enforce capcontid policy on contid write and read
- squash conversion to contobj into contid intro patch
v8
- rebase on v5.5-rc1 audit/next
- remove subject attrs in CONTAINER_OP record
- group audit_contid_list_lock with audit_contid_hash
- in audit_{set,log}_contid(), break out of loop after finding target
- use target var to size kmalloc
- rework audit_cont_owner() to bool audit_contid_isowner() and move to where used
- create static void audit_cont_hold(struct audit_contobj *cont) { refcount_inc(&cont->refcount); }
- rename audit_cont{,_*} refs to audit_contobj{,_*}
- prefix special local functions with _ [audit_contobj*()]
- protect contid list traversals with rcu_read_lock() and updates with audit_contid_list_lock
- protect real_parent in audit_contid_depth() with rcu_dereference
- give new contid field nesting format in patch description
- squash task_is_descendant()
- squash support for NETFILTER_PKT into network namespaces
- limit nesting depth based on record length overflow, bandwidth and storage
- implent control for audit container identifier nesting depth limit
- make room for audit_bpf patches (bump CONTAINER_ID to 1335)
- squash proc interface into capcontid
- remove netlink access to loginuid/sessionid/contid/capcontid
- delete 32k contid limit patch
- document potential overlap between signal delivery and contid reuse
- document audit_contobj_list_lock coverage
- document disappearing orch task injection limitation
- limit the number of containers that can be associated with a network namespace
- implent control for audit container identifier netns count limit
v7
- remove BUG() in audit_comparator64()
- rebase on v5.2-rc1 audit/next
- resolve merge conflict with ghak111 (signal_info regardless syscall)
- resolve merge conflict with ghak73 (audit_field_valid)
- resolve merge conflict with ghak64 (saddr_fam filter)
- resolve merge conflict with ghak10 (ntp audit) change AUDIT_CONTAINER_ID from 1332 to 1334
- rebase on v5.3-rc1 audit/next
- track container owner
- only permit setting contid of descendants for nesting
- track drop of contid and permit reuse
- track and report container nesting
- permit filtering on any nested contid
- set/get contid and loginuid/sessionid via netlink
- implement capcontid to enable orchestrators in non-init user
namespaces
- limit number of containers
- limit depth of container nesting
v6
- change TMPBUFLEN from 11 to 21 to cover the decimal value of contid
u64 (nhorman)
- fix bug overwriting ctx in struct audit_sig_info, move cid above
ctx[0] (nhorman)
- fix bug skipping remaining fields and not advancing bufp when copying
out contid in audit_krule_to_data (omosnacec)
- add acks, tidy commit descriptions, other formatting fixes (checkpatch
wrong on audit_log_lost)
- cast ull for u64 prints
- target_cid tracking was moved from the ptrace/signal patch to
container_op
- target ptrace and signal records were moved from the ptrace/signal
patch to container_id
- auditd signaller tracking was moved to a new AUDIT_SIGNAL_INFO2
request and record
- ditch unnecessary list_empty() checks
- check for null net and aunet in audit_netns_contid_add()
- swap CONTAINER_OP contid/old-contid order to ease parsing
v5
- address loginuid and sessionid syscall scope in ghak104
- address audit_context in CONFIG_AUDIT vs CONFIG_AUDITSYSCALL in ghak105
- remove tty patch, addressed in ghak106
- rebase on audit/next v5.0-rc1
w/ghak59/ghak104/ghak103/ghak100/ghak107/ghak105/ghak106/ghak105sup
- update CONTAINER_ID to CONTAINER_OP in patch description
- move audit_context in audit_task_info to CONFIG_AUDITSYSCALL
- move audit_alloc() and audit_free() out of CONFIG_AUDITSYSCALL and into
CONFIG_AUDIT and create audit_{alloc,free}_syscall
- use plain kmem_cache_alloc() rather than kmem_cache_zalloc() in audit_alloc()
- fix audit_get_contid() declaration type error
- move audit_set_contid() from auditsc.c to audit.c
- audit_log_contid() returns void
- audit_log_contid() handed contid rather than tsk
- switch from AUDIT_CONTAINER to AUDIT_CONTAINER_ID for aux record
- move audit_log_contid(tsk/contid) & audit_contid_set(tsk)/audit_contid_valid(contid)
- switch from tsk to current
- audit_alloc_local() calls audit_log_lost() on failure to allocate a context
- add AUDIT_USER* non-syscall contid record
- cosmetic cleanup double parens, goto out on err
- ditch audit_get_ns_contid_list_lock(), fix aunet lock race
- switch from all-cpu read spinlock to rcu, keep spinlock for write
- update audit_alloc_local() to use ktime_get_coarse_real_ts64()
- add nft_log support
- add call from do_exit() in audit_free() to remove contid from netns
- relegate AUDIT_CONTAINER ref= field (was op=) to debug patch
v4
- preface set with ghak81:"collect audit task parameters"
- add shallyn and sgrubb acks
- rename feature bitmap macro
- rename cid_valid() to audit_contid_valid()
- rename AUDIT_CONTAINER_ID to AUDIT_CONTAINER_OP
- delete audit_get_contid_list() from headers
- move work into inner if, delete "found"
- change netns contid list function names
- move exports for audit_log_contid audit_alloc_local audit_free_context to non-syscall patch
- list contids CSV
- pass in gfp flags to audit_alloc_local() (fix audit_alloc_context callers)
- use "local" in lieu of abusing in_syscall for auditsc_get_stamp()
- read_lock(&tasklist_lock) around children and thread check
- task_lock(tsk) should be taken before first check of tsk->audit
- add spin lock to contid list in aunet
- restrict /proc read to CAP_AUDIT_CONTROL
- remove set again prohibition and inherited flag
- delete contidion spelling fix from patchset, send to netdev/linux-wireless
v3
- switched from containerid in task_struct to audit_task_info (depends on ghak81)
- drop INVALID_CID in favour of only AUDIT_CID_UNSET
- check for !audit_task_info, throw -ENOPROTOOPT on set
- changed -EPERM to -EEXIST for parent check
- return AUDIT_CID_UNSET if !audit_enabled
- squash child/thread check patch into AUDIT_CONTAINER_ID patch
- changed -EPERM to -EBUSY for child check
- separate child and thread checks, use -EALREADY for latter
- move addition of op= from ptrace/signal patch to AUDIT_CONTAINER patch
- fix && to || bashism in ptrace/signal patch
- uninline and export function for audit_free_context()
- drop CONFIG_CHANGE, FEATURE_CHANGE, ANOM_ABEND, ANOM_SECCOMP patches
- move audit_enabled check (xt_AUDIT)
- switched from containerid list in struct net to net_generic's struct audit_net
- move containerid list iteration into audit (xt_AUDIT)
- create function to move namespace switch into audit
- switched /proc/PID/ entry from containerid to audit_containerid
- call kzalloc with GFP_ATOMIC on in_atomic() in audit_alloc_context()
- call kzalloc with GFP_ATOMIC on in_atomic() in audit_log_container_info()
- use xt_net(par) instead of sock_net(skb->sk) to get net
- switched record and field names: initial CONTAINER_ID, aux CONTAINER, field CONTID
- allow to set own contid
- open code audit_set_containerid
- add contid inherited flag
- ccontainerid and pcontainerid eliminated due to inherited flag
- change name of container list funcitons
- rename containerid to contid
- convert initial container record to syscall aux
- fix spelling mistake of contidion in net/rfkill/core.c to avoid contid name collision
v2
- add check for children and threads
- add network namespace container identifier list
- add NETFILTER_PKT audit container identifier logging
- patch description and documentation clean-up and example
- reap unused ppid
Richard Guy Briggs (13):
audit: collect audit task parameters
audit: add container id
audit: read container ID of a process
audit: log drop of contid on exit of last task
audit: log container info of syscalls
audit: add contid support for signalling the audit daemon
audit: add support for non-syscall auxiliary records
audit: add containerid support for user records
audit: add containerid filtering
audit: add support for containerid to network namespaces
audit: contid check descendancy and nesting
audit: track container nesting
audit: add capcontid to set contid outside init_user_ns
fs/proc/base.c | 112 +++++++-
include/linux/audit.h | 135 +++++++++-
include/linux/sched.h | 10 +-
include/uapi/linux/audit.h | 10 +-
init/init_task.c | 3 +-
init/main.c | 2 +
kernel/audit.c | 621 +++++++++++++++++++++++++++++++++++++++++++-
kernel/audit.h | 23 ++
kernel/auditfilter.c | 61 +++++
kernel/auditsc.c | 110 ++++++--
kernel/fork.c | 1 -
kernel/nsproxy.c | 4 +
kernel/sched/core.c | 33 +++
net/netfilter/nft_log.c | 11 +-
net/netfilter/xt_AUDIT.c | 11 +-
security/selinux/nlmsgtab.c | 1 +
security/yama/yama_lsm.c | 33 ---
17 files changed, 1085 insertions(+), 96 deletions(-)
--
1.8.3.1
Implement audit container identifier filtering using the AUDIT_CONTID
field name to send an 8-character string representing a u64 since the
value field is only u32.
Sending it as two u32 was considered, but gathering and comparing two
fields was more complex.
The feature indicator is AUDIT_FEATURE_BITMAP_CONTAINERID.
Please see the github audit kernel issue for the contid filter feature:
https://github.com/linux-audit/audit-kernel/issues/91
Please see the github audit userspace issue for filter additions:
https://github.com/linux-audit/audit-userspace/issues/40
Please see the github audit testsuiite issue for the test case:
https://github.com/linux-audit/audit-testsuite/issues/64
Please see the github audit wiki for the feature overview:
https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
Signed-off-by: Richard Guy Briggs <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Acked-by: Neil Horman <[email protected]>
Reviewed-by: Ondrej Mosnacek <[email protected]>
---
include/linux/audit.h | 1 +
include/uapi/linux/audit.h | 5 ++++-
kernel/audit.h | 1 +
kernel/auditfilter.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
kernel/auditsc.c | 4 ++++
5 files changed, 56 insertions(+), 1 deletion(-)
diff --git a/include/linux/audit.h b/include/linux/audit.h
index 15d0defc5193..c4a755ae0d61 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -68,6 +68,7 @@ struct audit_field {
u32 type;
union {
u32 val;
+ u64 val64;
kuid_t uid;
kgid_t gid;
struct {
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index a56ad77069b9..831c12bdd235 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -271,6 +271,7 @@
#define AUDIT_LOGINUID_SET 24
#define AUDIT_SESSIONID 25 /* Session ID */
#define AUDIT_FSTYPE 26 /* FileSystem Type */
+#define AUDIT_CONTID 27 /* Container ID */
/* These are ONLY useful when checking
* at syscall exit time (AUDIT_AT_EXIT). */
@@ -352,6 +353,7 @@ enum {
#define AUDIT_FEATURE_BITMAP_SESSIONID_FILTER 0x00000010
#define AUDIT_FEATURE_BITMAP_LOST_RESET 0x00000020
#define AUDIT_FEATURE_BITMAP_FILTER_FS 0x00000040
+#define AUDIT_FEATURE_BITMAP_CONTAINERID 0x00000080
#define AUDIT_FEATURE_BITMAP_ALL (AUDIT_FEATURE_BITMAP_BACKLOG_LIMIT | \
AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_TIME | \
@@ -359,7 +361,8 @@ enum {
AUDIT_FEATURE_BITMAP_EXCLUDE_EXTEND | \
AUDIT_FEATURE_BITMAP_SESSIONID_FILTER | \
AUDIT_FEATURE_BITMAP_LOST_RESET | \
- AUDIT_FEATURE_BITMAP_FILTER_FS)
+ AUDIT_FEATURE_BITMAP_FILTER_FS | \
+ AUDIT_FEATURE_BITMAP_CONTAINERID)
/* deprecated: AUDIT_VERSION_* */
#define AUDIT_VERSION_LATEST AUDIT_FEATURE_BITMAP_ALL
diff --git a/kernel/audit.h b/kernel/audit.h
index a7f88d76163f..34d8ec4bc6ef 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -228,6 +228,7 @@ static inline int audit_hash_contid(u64 contid)
extern int audit_match_class(int class, unsigned syscall);
extern int audit_comparator(const u32 left, const u32 op, const u32 right);
+extern int audit_comparator64(const u64 left, const u32 op, const u64 right);
extern int audit_uid_comparator(kuid_t left, u32 op, kuid_t right);
extern int audit_gid_comparator(kgid_t left, u32 op, kgid_t right);
extern int parent_len(const char *path);
diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
index a10e2997aa6c..d812698efc1d 100644
--- a/kernel/auditfilter.c
+++ b/kernel/auditfilter.c
@@ -399,6 +399,7 @@ static int audit_field_valid(struct audit_entry *entry, struct audit_field *f)
case AUDIT_FILETYPE:
case AUDIT_FIELD_COMPARE:
case AUDIT_EXE:
+ case AUDIT_CONTID:
/* only equal and not equal valid ops */
if (f->op != Audit_not_equal && f->op != Audit_equal)
return -EINVAL;
@@ -590,6 +591,14 @@ static struct audit_entry *audit_data_to_entry(struct audit_rule_data *data,
entry->rule.buflen += f_val;
entry->rule.exe = audit_mark;
break;
+ case AUDIT_CONTID:
+ if (f_val != sizeof(u64))
+ goto exit_free;
+ str = audit_unpack_string(&bufp, &remain, f_val);
+ if (IS_ERR(str))
+ goto exit_free;
+ f->val64 = ((u64 *)str)[0];
+ break;
default:
f->val = f_val;
break;
@@ -675,6 +684,11 @@ static struct audit_rule_data *audit_krule_to_data(struct audit_krule *krule)
data->buflen += data->values[i] =
audit_pack_string(&bufp, audit_mark_path(krule->exe));
break;
+ case AUDIT_CONTID:
+ data->buflen += data->values[i] = sizeof(u64);
+ memcpy(bufp, &f->val64, sizeof(u64));
+ bufp += sizeof(u64);
+ break;
case AUDIT_LOGINUID_SET:
if (krule->pflags & AUDIT_LOGINUID_LEGACY && !f->val) {
data->fields[i] = AUDIT_LOGINUID;
@@ -761,6 +775,10 @@ static int audit_compare_rule(struct audit_krule *a, struct audit_krule *b)
if (!gid_eq(a->fields[i].gid, b->fields[i].gid))
return 1;
break;
+ case AUDIT_CONTID:
+ if (a->fields[i].val64 != b->fields[i].val64)
+ return 1;
+ break;
default:
if (a->fields[i].val != b->fields[i].val)
return 1;
@@ -1216,6 +1234,30 @@ int audit_comparator(u32 left, u32 op, u32 right)
}
}
+int audit_comparator64(u64 left, u32 op, u64 right)
+{
+ switch (op) {
+ case Audit_equal:
+ return (left == right);
+ case Audit_not_equal:
+ return (left != right);
+ case Audit_lt:
+ return (left < right);
+ case Audit_le:
+ return (left <= right);
+ case Audit_gt:
+ return (left > right);
+ case Audit_ge:
+ return (left >= right);
+ case Audit_bitmask:
+ return (left & right);
+ case Audit_bittest:
+ return ((left & right) == right);
+ default:
+ return 0;
+ }
+}
+
int audit_uid_comparator(kuid_t left, u32 op, kuid_t right)
{
switch (op) {
@@ -1350,6 +1392,10 @@ int audit_filter(int msgtype, unsigned int listtype)
result = audit_comparator(audit_loginuid_set(current),
f->op, f->val);
break;
+ case AUDIT_CONTID:
+ result = audit_comparator64(audit_get_contid(current),
+ f->op, f->val64);
+ break;
case AUDIT_MSGTYPE:
result = audit_comparator(msgtype, f->op, f->val);
break;
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 935eb3d2cde9..baa5709590b4 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -640,6 +640,10 @@ static int audit_filter_rules(struct task_struct *tsk,
result = audit_comparator(ctx->sockaddr->ss_family,
f->op, f->val);
break;
+ case AUDIT_CONTID:
+ result = audit_comparator64(audit_get_contid(tsk),
+ f->op, f->val64);
+ break;
case AUDIT_SUBJ_USER:
case AUDIT_SUBJ_ROLE:
case AUDIT_SUBJ_TYPE:
--
1.8.3.1
Require the target task to be a descendant of the container
orchestrator/engine.
You would only change the audit container ID from one set or inherited
value to another if you were nesting containers.
If changing the contid, the container orchestrator/engine must be a
descendant and not same orchestrator as the one that set it so it is not
possible to change the contid of another orchestrator's container.
Since the task_is_descendant() function is used in YAMA and in audit,
remove the duplication and pull the function into kernel/core/sched.c
Signed-off-by: Richard Guy Briggs <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/audit.c | 23 +++++++++++++++++++++--
kernel/sched/core.c | 33 +++++++++++++++++++++++++++++++++
security/yama/yama_lsm.c | 33 ---------------------------------
4 files changed, 57 insertions(+), 35 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2213ac670386..06938d0b9e0c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2047,4 +2047,7 @@ static inline void rseq_syscall(struct pt_regs *regs)
const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
+extern int task_is_descendant(struct task_struct *parent,
+ struct task_struct *child);
+
#endif
diff --git a/kernel/audit.c b/kernel/audit.c
index a862721dfd9b..efa65ec01239 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -2713,6 +2713,20 @@ int audit_signal_info(int sig, struct task_struct *t)
return audit_signal_info_syscall(t);
}
+static bool audit_contid_isnesting(struct task_struct *tsk)
+{
+ bool isowner = false;
+ bool ownerisparent = false;
+
+ rcu_read_lock();
+ if (tsk->audit && tsk->audit->cont) {
+ isowner = current == tsk->audit->cont->owner;
+ ownerisparent = task_is_descendant(tsk->audit->cont->owner, current);
+ }
+ rcu_read_unlock();
+ return !isowner && ownerisparent;
+}
+
/*
* audit_set_contid - set current task's audit contid
* @task: target task
@@ -2755,8 +2769,13 @@ int audit_set_contid(struct task_struct *task, u64 contid)
rc = -EBUSY;
goto unlock;
}
- /* if contid is already set, deny */
- if (audit_contid_set(task))
+ /* if task is not descendant, block */
+ if (task == current || !task_is_descendant(current, task)) {
+ rc = -EXDEV;
+ goto unlock;
+ }
+ /* only allow contid setting again if nesting */
+ if (audit_contid_set(task) && !audit_contid_isnesting(task))
rc = -EEXIST;
unlock:
read_unlock(&tasklist_lock);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f360326861e..e6b24c52b3c3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8012,6 +8012,39 @@ void dump_cpu_task(int cpu)
}
/*
+ * task_is_descendant - walk up a process family tree looking for a match
+ * @parent: the process to compare against while walking up from child
+ * @child: the process to start from while looking upwards for parent
+ *
+ * Returns 1 if child is a descendant of parent, 0 if not.
+ */
+int task_is_descendant(struct task_struct *parent,
+ struct task_struct *child)
+{
+ int rc = 0;
+ struct task_struct *walker = child;
+
+ if (!parent || !child)
+ return 0;
+
+ rcu_read_lock();
+ if (!thread_group_leader(parent))
+ parent = rcu_dereference(parent->group_leader);
+ while (walker->pid > 0) {
+ if (!thread_group_leader(walker))
+ walker = rcu_dereference(walker->group_leader);
+ if (walker == parent) {
+ rc = 1;
+ break;
+ }
+ walker = rcu_dereference(walker->real_parent);
+ }
+ rcu_read_unlock();
+
+ return rc;
+}
+
+/*
* Nice levels are multiplicative, with a gentle 10% change for every
* nice level changed. I.e. when a CPU-bound task goes from nice 0 to
* nice 1, it will get ~10% less CPU time than another CPU-bound task
diff --git a/security/yama/yama_lsm.c b/security/yama/yama_lsm.c
index 536c99646f6a..24939f765df5 100644
--- a/security/yama/yama_lsm.c
+++ b/security/yama/yama_lsm.c
@@ -263,39 +263,6 @@ static int yama_task_prctl(int option, unsigned long arg2, unsigned long arg3,
}
/**
- * task_is_descendant - walk up a process family tree looking for a match
- * @parent: the process to compare against while walking up from child
- * @child: the process to start from while looking upwards for parent
- *
- * Returns 1 if child is a descendant of parent, 0 if not.
- */
-static int task_is_descendant(struct task_struct *parent,
- struct task_struct *child)
-{
- int rc = 0;
- struct task_struct *walker = child;
-
- if (!parent || !child)
- return 0;
-
- rcu_read_lock();
- if (!thread_group_leader(parent))
- parent = rcu_dereference(parent->group_leader);
- while (walker->pid > 0) {
- if (!thread_group_leader(walker))
- walker = rcu_dereference(walker->group_leader);
- if (walker == parent) {
- rc = 1;
- break;
- }
- walker = rcu_dereference(walker->real_parent);
- }
- rcu_read_unlock();
-
- return rc;
-}
-
-/**
* ptracer_exception_found - tracer registered as exception for this tracee
* @tracer: the task_struct of the process attempting ptrace
* @tracee: the task_struct of the process to be ptraced
--
1.8.3.1
This also adds support to qualify NETFILTER_PKT records.
Audit events could happen in a network namespace outside of a task
context due to packets received from the net that trigger an auditing
rule prior to being associated with a running task. The network
namespace could be in use by multiple containers by association to the
tasks in that network namespace. We still want a way to attribute
these events to any potential containers. Keep a list per network
namespace to track these audit container identifiiers.
Add/increment the audit container identifier on:
- initial setting of the audit container identifier via /proc
- clone/fork call that inherits an audit container identifier
- unshare call that inherits an audit container identifier
- setns call that inherits an audit container identifier
Delete/decrement the audit container identifier on:
- an inherited audit container identifier dropped when child set
- process exit
- unshare call that drops a net namespace
- setns call that drops a net namespace
Add audit container identifier auxiliary record(s) to NETFILTER_PKT
event standalone records. Iterate through all potential audit container
identifiers associated with a network namespace.
Please see the github audit kernel issue for contid net support:
https://github.com/linux-audit/audit-kernel/issues/92
Please see the github audit testsuiite issue for the test case:
https://github.com/linux-audit/audit-testsuite/issues/64
Please see the github audit wiki for the feature overview:
https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
Signed-off-by: Richard Guy Briggs <[email protected]>
Acked-by: Neil Horman <[email protected]>
Reviewed-by: Ondrej Mosnacek <[email protected]>
---
include/linux/audit.h | 20 ++++++
kernel/audit.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++-
kernel/nsproxy.c | 4 ++
net/netfilter/nft_log.c | 11 +++-
net/netfilter/xt_AUDIT.c | 11 +++-
5 files changed, 195 insertions(+), 7 deletions(-)
diff --git a/include/linux/audit.h b/include/linux/audit.h
index c4a755ae0d61..304fbb7c3c5b 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -128,6 +128,13 @@ struct audit_task_info {
extern struct audit_task_info init_struct_audit;
+struct audit_contobj_netns {
+ struct list_head list;
+ struct audit_contobj *obj;
+ int count;
+ struct rcu_head rcu;
+};
+
extern int is_audit_feature_set(int which);
extern int __init audit_register_class(int class, unsigned *list);
@@ -233,6 +240,11 @@ static inline u64 audit_get_contid(struct task_struct *tsk)
extern void audit_log_container_id(struct audit_context *context,
struct audit_contobj *cont);
+extern void audit_copy_namespaces(struct net *net, struct task_struct *tsk);
+extern void audit_switch_task_namespaces(struct nsproxy *ns,
+ struct task_struct *p);
+extern void audit_log_netns_contid_list(struct net *net,
+ struct audit_context *context);
extern u32 audit_enabled;
@@ -306,6 +318,14 @@ static inline u64 audit_get_contid(struct task_struct *tsk)
static inline void audit_log_container_id(struct audit_context *context,
struct audit_contobj *cont)
{ }
+static inline void audit_copy_namespaces(struct net *net, struct task_struct *tsk)
+{ }
+static inline void audit_switch_task_namespaces(struct nsproxy *ns,
+ struct task_struct *p)
+{ }
+static inline void audit_log_netns_contid_list(struct net *net,
+ struct audit_context *context)
+{ }
#define audit_enabled AUDIT_OFF
diff --git a/kernel/audit.c b/kernel/audit.c
index 997c34178ee8..a862721dfd9b 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -59,6 +59,7 @@
#include <linux/freezer.h>
#include <linux/pid_namespace.h>
#include <net/netns/generic.h>
+#include <net/net_namespace.h>
#include "audit.h"
@@ -86,9 +87,13 @@
/**
* struct audit_net - audit private network namespace data
* @sk: communication socket
+ * @contobj_list: audit container identifier list
+ * @contobj_list_lock audit container identifier list lock
*/
struct audit_net {
struct sock *sk;
+ struct list_head contobj_list;
+ spinlock_t contobj_list_lock;
};
/**
@@ -214,6 +219,9 @@ struct audit_reply {
static struct kmem_cache *audit_task_cache;
+void audit_netns_contid_add(struct net *net, struct audit_contobj *cont);
+void audit_netns_contid_del(struct net *net, struct audit_contobj *cont);
+
void __init audit_task_init(void)
{
audit_task_cache = kmem_cache_create("audit_task",
@@ -326,10 +334,17 @@ struct audit_task_info init_struct_audit = {
void audit_free(struct task_struct *tsk)
{
struct audit_task_info *info = tsk->audit;
+ struct nsproxy *ns = tsk->nsproxy;
+ struct audit_contobj *cont;
audit_free_syscall(tsk);
rcu_read_lock();
- _audit_contobj_put(tsk->audit->cont);
+ cont = _audit_contobj_get(tsk);
+ if (ns) {
+ audit_netns_contid_del(ns->net_ns, cont);
+ _audit_contobj_put(cont);
+ }
+ _audit_contobj_put(cont);
rcu_read_unlock();
/* Freeing the audit_task_info struct must be performed after
* audit_log_exit() due to need for loginuid and sessionid.
@@ -437,6 +452,136 @@ static struct sock *audit_get_sk(const struct net *net)
return aunet->sk;
}
+void audit_netns_contid_add(struct net *net, struct audit_contobj *cont)
+{
+ struct audit_net *aunet;
+ struct list_head *contobj_list;
+ struct audit_contobj_netns *contns;
+
+ if (!net)
+ return;
+ if (!cont)
+ return;
+ aunet = net_generic(net, audit_net_id);
+ if (!aunet)
+ return;
+ contobj_list = &aunet->contobj_list;
+ rcu_read_lock();
+ spin_lock(&aunet->contobj_list_lock);
+ list_for_each_entry_rcu(contns, contobj_list, list)
+ if (contns->obj == cont) {
+ contns->count++;
+ goto out;
+ }
+ contns = kmalloc(sizeof(*contns), GFP_ATOMIC);
+ if (contns) {
+ INIT_LIST_HEAD(&contns->list);
+ contns->obj = cont;
+ contns->count = 1;
+ list_add_rcu(&contns->list, contobj_list);
+ }
+out:
+ spin_unlock(&aunet->contobj_list_lock);
+ rcu_read_unlock();
+}
+
+void audit_netns_contid_del(struct net *net, struct audit_contobj *cont)
+{
+ struct audit_net *aunet;
+ struct list_head *contobj_list;
+ struct audit_contobj_netns *contns = NULL;
+
+ if (!net)
+ return;
+ if (!cont)
+ return;
+ aunet = net_generic(net, audit_net_id);
+ if (!aunet)
+ return;
+ contobj_list = &aunet->contobj_list;
+ rcu_read_lock();
+ spin_lock(&aunet->contobj_list_lock);
+ list_for_each_entry_rcu(contns, contobj_list, list)
+ if (contns->obj == cont) {
+ contns->count--;
+ if (contns->count < 1) {
+ list_del_rcu(&contns->list);
+ kfree_rcu(contns, rcu);
+ }
+ break;
+ }
+ spin_unlock(&aunet->contobj_list_lock);
+ rcu_read_unlock();
+}
+
+void audit_copy_namespaces(struct net *net, struct task_struct *tsk)
+{
+ struct audit_contobj *cont;
+
+ rcu_read_lock();
+ cont = _audit_contobj_get(tsk);
+ audit_netns_contid_add(net, cont);
+ rcu_read_unlock();
+}
+
+void audit_switch_task_namespaces(struct nsproxy *ns, struct task_struct *p)
+{
+ struct audit_contobj *cont;
+ struct nsproxy *new = p->nsproxy;
+
+ rcu_read_lock();
+ cont = _audit_contobj_get(p);
+ if (!cont)
+ goto out;
+ audit_netns_contid_del(ns->net_ns, cont);
+ if (new)
+ audit_netns_contid_add(new->net_ns, cont);
+ else
+ _audit_contobj_put(cont);
+ _audit_contobj_put(cont);
+out:
+ rcu_read_unlock();
+}
+
+/**
+ * audit_log_netns_contid_list - List contids for the given network namespace
+ * @net: the network namespace of interest
+ * @context: the audit context to use
+ *
+ * Description:
+ * Issues a CONTAINER_ID record with a CSV list of contids associated
+ * with a network namespace to accompany a NETFILTER_PKT record.
+ */
+void audit_log_netns_contid_list(struct net *net, struct audit_context *context)
+{
+ struct audit_buffer *ab = NULL;
+ struct audit_contobj_netns *cont;
+ struct audit_net *aunet;
+
+ /* Generate AUDIT_CONTAINER_ID record with container ID CSV list */
+ rcu_read_lock();
+ aunet = net_generic(net, audit_net_id);
+ if (!aunet)
+ goto out;
+ list_for_each_entry_rcu(cont, &aunet->contobj_list, list) {
+ if (!ab) {
+ ab = audit_log_start(context, GFP_ATOMIC,
+ AUDIT_CONTAINER_ID);
+ if (!ab) {
+ audit_log_lost("out of memory in audit_log_netns_contid_list");
+ goto out;
+ }
+ audit_log_format(ab, "contid=");
+ } else
+ audit_log_format(ab, ",");
+ audit_log_format(ab, "%llu", cont->obj->id);
+ }
+ audit_log_end(ab);
+out:
+ rcu_read_unlock();
+}
+EXPORT_SYMBOL(audit_log_netns_contid_list);
+
void audit_panic(const char *message)
{
switch (audit_failure) {
@@ -1786,7 +1931,6 @@ static int __net_init audit_net_init(struct net *net)
.flags = NL_CFG_F_NONROOT_RECV,
.groups = AUDIT_NLGRP_MAX,
};
-
struct audit_net *aunet = net_generic(net, audit_net_id);
aunet->sk = netlink_kernel_create(net, NETLINK_AUDIT, &cfg);
@@ -1795,7 +1939,8 @@ static int __net_init audit_net_init(struct net *net)
return -ENOMEM;
}
aunet->sk->sk_sndtimeo = MAX_SCHEDULE_TIMEOUT;
-
+ INIT_LIST_HEAD(&aunet->contobj_list);
+ spin_lock_init(&aunet->contobj_list_lock);
return 0;
}
@@ -2585,6 +2730,7 @@ int audit_set_contid(struct task_struct *task, u64 contid)
int rc = 0;
struct audit_buffer *ab;
struct audit_contobj *oldcont = NULL;
+ struct net *net = task->nsproxy->net_ns;
task_lock(task);
/* Can't set if audit disabled */
@@ -2657,6 +2803,10 @@ int audit_set_contid(struct task_struct *task, u64 contid)
spin_unlock(&audit_contobj_list_lock);
task->audit->cont = newcont;
_audit_contobj_put(oldcont);
+ audit_netns_contid_del(net, oldcont);
+ _audit_contobj_put(oldcont);
+ _audit_contobj_hold(newcont);
+ audit_netns_contid_add(net, newcont);
}
conterror:
task_unlock(task);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index b03df67621d0..5eddb3377049 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -26,6 +26,7 @@
#include <linux/syscalls.h>
#include <linux/cgroup.h>
#include <linux/perf_event.h>
+#include <linux/audit.h>
static struct kmem_cache *nsproxy_cachep;
@@ -187,6 +188,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
}
tsk->nsproxy = new_ns;
+ if (flags & CLONE_NEWNET)
+ audit_copy_namespaces(new_ns->net_ns, tsk);
return 0;
}
@@ -249,6 +252,7 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
ns = p->nsproxy;
p->nsproxy = new;
task_unlock(p);
+ audit_switch_task_namespaces(ns, p);
if (ns && atomic_dec_and_test(&ns->count))
free_nsproxy(ns);
diff --git a/net/netfilter/nft_log.c b/net/netfilter/nft_log.c
index fe4831f2258f..98d1e7e1a83c 100644
--- a/net/netfilter/nft_log.c
+++ b/net/netfilter/nft_log.c
@@ -66,13 +66,16 @@ static void nft_log_eval_audit(const struct nft_pktinfo *pkt)
struct sk_buff *skb = pkt->skb;
struct audit_buffer *ab;
int fam = -1;
+ struct audit_context *context;
+ struct net *net;
if (!audit_enabled)
return;
- ab = audit_log_start(NULL, GFP_ATOMIC, AUDIT_NETFILTER_PKT);
+ context = audit_alloc_local(GFP_ATOMIC);
+ ab = audit_log_start(context, GFP_ATOMIC, AUDIT_NETFILTER_PKT);
if (!ab)
- return;
+ goto errout;
audit_log_format(ab, "mark=%#x", skb->mark);
@@ -99,6 +102,10 @@ static void nft_log_eval_audit(const struct nft_pktinfo *pkt)
audit_log_format(ab, " saddr=? daddr=? proto=-1");
audit_log_end(ab);
+ net = xt_net(&pkt->xt);
+ audit_log_netns_contid_list(net, context);
+errout:
+ audit_free_context(context);
}
static void nft_log_eval(const struct nft_expr *expr,
diff --git a/net/netfilter/xt_AUDIT.c b/net/netfilter/xt_AUDIT.c
index 9cdc16b0d0d8..ecf868a1abde 100644
--- a/net/netfilter/xt_AUDIT.c
+++ b/net/netfilter/xt_AUDIT.c
@@ -68,10 +68,13 @@ static bool audit_ip6(struct audit_buffer *ab, struct sk_buff *skb)
{
struct audit_buffer *ab;
int fam = -1;
+ struct audit_context *context;
+ struct net *net;
if (audit_enabled == AUDIT_OFF)
- goto errout;
- ab = audit_log_start(NULL, GFP_ATOMIC, AUDIT_NETFILTER_PKT);
+ goto out;
+ context = audit_alloc_local(GFP_ATOMIC);
+ ab = audit_log_start(context, GFP_ATOMIC, AUDIT_NETFILTER_PKT);
if (ab == NULL)
goto errout;
@@ -101,7 +104,11 @@ static bool audit_ip6(struct audit_buffer *ab, struct sk_buff *skb)
audit_log_end(ab);
+ net = xt_net(par);
+ audit_log_netns_contid_list(net, context);
errout:
+ audit_free_context(context);
+out:
return XT_CONTINUE;
}
--
1.8.3.1
Track the parent container of a container to be able to filter and
report nesting.
Now that we have a way to track and check the parent container of a
container, modify the contid field format to be able to report that
nesting using a carrat ("^") modifier to indicate nesting. The
original field format was "contid=<contid>" for task-associated records
and "contid=<contid>[,<contid>[...]]" for network-namespace-associated
records. The new field format is
"contid=<contid>[,^<contid>[...]][,<contid>[...]]".
Signed-off-by: Richard Guy Briggs <[email protected]>
---
include/linux/audit.h | 1 +
kernel/audit.c | 60 ++++++++++++++++++++++++++++++++++++++++++---------
kernel/audit.h | 2 ++
kernel/auditfilter.c | 17 ++++++++++++++-
kernel/auditsc.c | 2 +-
5 files changed, 70 insertions(+), 12 deletions(-)
diff --git a/include/linux/audit.h b/include/linux/audit.h
index 304fbb7c3c5b..025b52ae8422 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -115,6 +115,7 @@ struct audit_contobj {
refcount_t refcount;
refcount_t sigflag;
struct rcu_head rcu;
+ struct audit_contobj *parent;
};
struct audit_task_info {
diff --git a/kernel/audit.c b/kernel/audit.c
index efa65ec01239..aaf74702e993 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -221,6 +221,7 @@ struct audit_reply {
void audit_netns_contid_add(struct net *net, struct audit_contobj *cont);
void audit_netns_contid_del(struct net *net, struct audit_contobj *cont);
+void audit_log_contid(struct audit_buffer *ab, struct audit_contobj *cont);
void __init audit_task_init(void)
{
@@ -277,6 +278,7 @@ static void _audit_contobj_put_sig(struct audit_contobj *cont)
refcount_set(&cont->sigflag, 0);
if (!refcount_read(&cont->refcount)) {
put_task_struct(cont->owner);
+ _audit_contobj_put(cont->parent);
list_del_rcu(&cont->list);
kfree_rcu(cont, rcu);
}
@@ -574,7 +576,7 @@ void audit_log_netns_contid_list(struct net *net, struct audit_context *context)
audit_log_format(ab, "contid=");
} else
audit_log_format(ab, ",");
- audit_log_format(ab, "%llu", cont->obj->id);
+ audit_log_contid(ab, cont->obj);
}
audit_log_end(ab);
out:
@@ -1747,7 +1749,9 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
kfree(sig_data);
break;
case AUDIT_SIGNAL_INFO2: {
+ char *contidstr = NULL;
unsigned int contidstrlen = 0;
+ struct audit_contobj *cont = audit_sig_cid;
len = 0;
if (audit_sig_sid) {
@@ -1757,13 +1761,27 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
return err;
}
if (audit_sig_cid) {
- contidstr = kmalloc(21, GFP_KERNEL);
+ contidstr = kmalloc(AUDIT_MESSAGE_TEXT_MAX, GFP_KERNEL);
if (!contidstr) {
if (audit_sig_sid)
security_release_secctx(ctx, len);
return -ENOMEM;
}
- contidstrlen = scnprintf(contidstr, 20, "%llu", audit_sig_cid->id);
+ rcu_read_lock();
+ while (cont) {
+ if (cont->parent)
+ contidstrlen += scnprintf(contidstr,
+ AUDIT_MESSAGE_TEXT_MAX -
+ contidstrlen,
+ "%llu,^", cont->id);
+ else
+ contidstrlen += scnprintf(contidstr,
+ AUDIT_MESSAGE_TEXT_MAX -
+ contidstrlen,
+ "%llu", cont->id);
+ cont = cont->parent;
+ }
+ rcu_read_unlock();
}
sig_data2 = kmalloc(sizeof(*sig_data2) + contidstrlen + len, GFP_KERNEL);
if (!sig_data2) {
@@ -2444,6 +2462,23 @@ void audit_log_session_info(struct audit_buffer *ab)
audit_log_format(ab, "auid=%u ses=%u", auid, sessionid);
}
+void audit_log_contid(struct audit_buffer *ab, struct audit_contobj *cont)
+{
+ if (!cont) {
+ audit_log_format(ab, "-1");
+ return;
+ }
+ rcu_read_lock();
+ while (cont) {
+ if (cont->parent)
+ audit_log_format(ab, "%llu,^", cont->id);
+ else
+ audit_log_format(ab, "%llu", cont->id);
+ cont = cont->parent;
+ }
+ rcu_read_unlock();
+}
+
/*
* audit_log_container_id - report container info
* @context: task or local context for record
@@ -2460,7 +2495,8 @@ void audit_log_container_id(struct audit_context *context,
ab = audit_log_start(context, GFP_KERNEL, AUDIT_CONTAINER_ID);
if (!ab)
return;
- audit_log_format(ab, "contid=%llu", contid);
+ audit_log_format(ab, "contid=");
+ audit_log_contid(ab, cont);
audit_log_end(ab);
}
EXPORT_SYMBOL(audit_log_container_id);
@@ -2810,6 +2846,7 @@ int audit_set_contid(struct task_struct *task, u64 contid)
INIT_LIST_HEAD(&newcont->list);
newcont->id = contid;
newcont->owner = get_task_struct(current);
+ newcont->parent = _audit_contobj_get(newcont->owner);
refcount_set(&newcont->refcount, 1);
list_add_rcu(&newcont->list,
&audit_contid_hash[h]);
@@ -2828,6 +2865,7 @@ int audit_set_contid(struct task_struct *task, u64 contid)
audit_netns_contid_add(net, newcont);
}
conterror:
+ rcu_read_unlock();
task_unlock(task);
if (!audit_enabled)
@@ -2837,12 +2875,13 @@ int audit_set_contid(struct task_struct *task, u64 contid)
if (!ab)
return rc;
- audit_log_format(ab,
- "op=set opid=%d contid=%llu old-contid=%llu",
- task_tgid_nr(task), contid, oldcont ? oldcont->id : -1);
+ audit_log_format(ab, "op=set opid=%d contid=%llu old-contid=",
+ task_tgid_nr(task), contid);
+ audit_log_contid(ab, oldcont);
+ audit_log_end(ab);
+ rcu_read_lock();
_audit_contobj_put(oldcont);
rcu_read_unlock();
- audit_log_end(ab);
return rc;
}
@@ -2859,8 +2898,9 @@ void audit_log_container_drop(void)
ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONTAINER_OP);
if (!ab)
goto out;
- audit_log_format(ab, "op=drop opid=%d contid=%llu old-contid=%llu",
- task_tgid_nr(current), cont->id, cont->id);
+ audit_log_format(ab, "op=drop opid=%d contid=%llu old-contid=",
+ task_tgid_nr(current), AUDIT_CID_UNSET);
+ audit_log_contid(ab, cont);
audit_log_end(ab);
out:
rcu_read_unlock();
diff --git a/kernel/audit.h b/kernel/audit.h
index 34d8ec4bc6ef..7bea5b51124b 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -229,6 +229,8 @@ static inline int audit_hash_contid(u64 contid)
extern int audit_match_class(int class, unsigned syscall);
extern int audit_comparator(const u32 left, const u32 op, const u32 right);
extern int audit_comparator64(const u64 left, const u32 op, const u64 right);
+extern int audit_contid_comparator(const u64 left, const u32 op,
+ const u64 right);
extern int audit_uid_comparator(kuid_t left, u32 op, kuid_t right);
extern int audit_gid_comparator(kgid_t left, u32 op, kgid_t right);
extern int parent_len(const char *path);
diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
index d812698efc1d..981c72a8b863 100644
--- a/kernel/auditfilter.c
+++ b/kernel/auditfilter.c
@@ -1302,6 +1302,21 @@ int audit_gid_comparator(kgid_t left, u32 op, kgid_t right)
}
}
+int audit_contid_comparator(u64 left, u32 op, u64 right)
+{
+ struct audit_contobj *cont = NULL;
+ int h;
+ int result = 0;
+
+ h = audit_hash_contid(left);
+ list_for_each_entry_rcu(cont, &audit_contid_hash[h], list) {
+ result = audit_comparator64(cont->id, op, right);
+ if (result)
+ break;
+ }
+ return result;
+}
+
/**
* parent_len - find the length of the parent portion of a pathname
* @path: pathname of which to determine length
@@ -1393,7 +1408,7 @@ int audit_filter(int msgtype, unsigned int listtype)
f->op, f->val);
break;
case AUDIT_CONTID:
- result = audit_comparator64(audit_get_contid(current),
+ result = audit_contid_comparator(audit_get_contid(current),
f->op, f->val64);
break;
case AUDIT_MSGTYPE:
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index baa5709590b4..9198857ac721 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -641,7 +641,7 @@ static int audit_filter_rules(struct task_struct *tsk,
f->op, f->val);
break;
case AUDIT_CONTID:
- result = audit_comparator64(audit_get_contid(tsk),
+ result = audit_contid_comparator(audit_get_contid(tsk),
f->op, f->val64);
break;
case AUDIT_SUBJ_USER:
--
1.8.3.1
Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
process in a non-init user namespace the capability to set audit
container identifiers of individual children.
Provide the /proc/$PID/audit_capcontid interface to capcontid.
Valid values are: 1==enabled, 0==disabled
Writing a "1" to this special file for the target process $PID will
enable the target process to set audit container identifiers of its
descendants.
A process must already have CAP_AUDIT_CONTROL in the initial user
namespace or have had audit_capcontid enabled by a previous use of this
feature by its parent on this process in order to be able to enable it
for another process. The target process must be a descendant of the
calling process.
Report this action in new message type AUDIT_SET_CAPCONTID 1022 with
fields opid= capcontid= old-capcontid=
Signed-off-by: Richard Guy Briggs <[email protected]>
---
fs/proc/base.c | 57 +++++++++++++++++++++++++++++++++++++++++++++-
include/linux/audit.h | 14 ++++++++++++
include/uapi/linux/audit.h | 1 +
kernel/audit.c | 38 ++++++++++++++++++++++++++++++-
4 files changed, 108 insertions(+), 2 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 794474cd8f35..1083db2ce345 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1329,7 +1329,7 @@ static ssize_t proc_contid_read(struct file *file, char __user *buf,
if (!task)
return -ESRCH;
/* if we don't have caps, reject */
- if (!capable(CAP_AUDIT_CONTROL))
+ if (!capable(CAP_AUDIT_CONTROL) && !audit_get_capcontid(current))
return -EPERM;
length = scnprintf(tmpbuf, TMPBUFLEN, "%llu", audit_get_contid(task));
put_task_struct(task);
@@ -1370,6 +1370,59 @@ static ssize_t proc_contid_write(struct file *file, const char __user *buf,
.write = proc_contid_write,
.llseek = generic_file_llseek,
};
+
+static ssize_t proc_capcontid_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct inode *inode = file_inode(file);
+ struct task_struct *task = get_proc_task(inode);
+ ssize_t length;
+ char tmpbuf[TMPBUFLEN];
+
+ if (!task)
+ return -ESRCH;
+ /* if we don't have caps, reject */
+ if (!capable(CAP_AUDIT_CONTROL) && !audit_get_capcontid(current))
+ return -EPERM;
+ length = scnprintf(tmpbuf, TMPBUFLEN, "%u", audit_get_capcontid(task));
+ put_task_struct(task);
+ return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
+}
+
+static ssize_t proc_capcontid_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct inode *inode = file_inode(file);
+ u32 capcontid;
+ int rv;
+ struct task_struct *task = get_proc_task(inode);
+
+ if (!task)
+ return -ESRCH;
+ if (*ppos != 0) {
+ /* No partial writes. */
+ put_task_struct(task);
+ return -EINVAL;
+ }
+
+ rv = kstrtou32_from_user(buf, count, 10, &capcontid);
+ if (rv < 0) {
+ put_task_struct(task);
+ return rv;
+ }
+
+ rv = audit_set_capcontid(task, capcontid);
+ put_task_struct(task);
+ if (rv < 0)
+ return rv;
+ return count;
+}
+
+static const struct file_operations proc_capcontid_operations = {
+ .read = proc_capcontid_read,
+ .write = proc_capcontid_write,
+ .llseek = generic_file_llseek,
+};
#endif
#ifdef CONFIG_FAULT_INJECTION
@@ -3273,6 +3326,7 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUGO, proc_sessionid_operations),
REG("audit_containerid", S_IWUSR|S_IRUSR, proc_contid_operations),
+ REG("audit_capcontainerid", S_IWUSR|S_IRUSR, proc_capcontid_operations),
#endif
#ifdef CONFIG_FAULT_INJECTION
REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
@@ -3613,6 +3667,7 @@ static int proc_tid_comm_permission(struct inode *inode, int mask)
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUGO, proc_sessionid_operations),
REG("audit_containerid", S_IWUSR|S_IRUSR, proc_contid_operations),
+ REG("audit_capcontainerid", S_IWUSR|S_IRUSR, proc_capcontid_operations),
#endif
#ifdef CONFIG_FAULT_INJECTION
REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
diff --git a/include/linux/audit.h b/include/linux/audit.h
index 025b52ae8422..2b3a2b6020ed 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -122,6 +122,7 @@ struct audit_task_info {
kuid_t loginuid;
unsigned int sessionid;
struct audit_contobj *cont;
+ u32 capcontid;
#ifdef CONFIG_AUDITSYSCALL
struct audit_context *ctx;
#endif
@@ -230,6 +231,14 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
return tsk->audit->sessionid;
}
+static inline u32 audit_get_capcontid(struct task_struct *tsk)
+{
+ if (!tsk->audit)
+ return 0;
+ return tsk->audit->capcontid;
+}
+
+extern int audit_set_capcontid(struct task_struct *tsk, u32 enable);
extern int audit_set_contid(struct task_struct *tsk, u64 contid);
static inline u64 audit_get_contid(struct task_struct *tsk)
@@ -311,6 +320,11 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
return AUDIT_SID_UNSET;
}
+static inline u32 audit_get_capcontid(struct task_struct *tsk)
+{
+ return 0;
+}
+
static inline u64 audit_get_contid(struct task_struct *tsk)
{
return AUDIT_CID_UNSET;
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 831c12bdd235..5e30f4c95dc2 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -73,6 +73,7 @@
#define AUDIT_GET_FEATURE 1019 /* Get which features are enabled */
#define AUDIT_CONTAINER_OP 1020 /* Define the container id and info */
#define AUDIT_SIGNAL_INFO2 1021 /* Get info auditd signal sender */
+#define AUDIT_SET_CAPCONTID 1022 /* Set cap_contid of a task */
#define AUDIT_FIRST_USER_MSG 1100 /* Userspace messages mostly uninteresting to kernel */
#define AUDIT_USER_AVC 1107 /* We filter this differently */
diff --git a/kernel/audit.c b/kernel/audit.c
index aaf74702e993..454473f2e193 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -307,6 +307,7 @@ int audit_alloc(struct task_struct *tsk)
rcu_read_lock();
info->cont = _audit_contobj_get(current);
rcu_read_unlock();
+ info->capcontid = 0;
tsk->audit = info;
ret = audit_alloc_syscall(tsk);
@@ -322,6 +323,7 @@ struct audit_task_info init_struct_audit = {
.loginuid = INVALID_UID,
.sessionid = AUDIT_SID_UNSET,
.cont = NULL,
+ .capcontid = 0,
#ifdef CONFIG_AUDITSYSCALL
.ctx = NULL,
#endif
@@ -2763,6 +2765,40 @@ static bool audit_contid_isnesting(struct task_struct *tsk)
return !isowner && ownerisparent;
}
+int audit_set_capcontid(struct task_struct *task, u32 enable)
+{
+ u32 oldcapcontid;
+ int rc = 0;
+ struct audit_buffer *ab;
+
+ if (!task->audit)
+ return -ENOPROTOOPT;
+ oldcapcontid = audit_get_capcontid(task);
+ /* if task is not descendant, block */
+ if (task == current || !task_is_descendant(current, task))
+ rc = -EXDEV;
+ else if (current_user_ns() == &init_user_ns) {
+ if (!capable(CAP_AUDIT_CONTROL) &&
+ !audit_get_capcontid(current))
+ rc = -EPERM;
+ }
+ if (!rc)
+ task->audit->capcontid = enable;
+
+ if (!audit_enabled)
+ return rc;
+
+ ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_SET_CAPCONTID);
+ if (!ab)
+ return rc;
+
+ audit_log_format(ab,
+ "opid=%d capcontid=%u old-capcontid=%u",
+ task_tgid_nr(task), enable, oldcapcontid);
+ audit_log_end(ab);
+ return rc;
+}
+
/*
* audit_set_contid - set current task's audit contid
* @task: target task
@@ -2795,7 +2831,7 @@ int audit_set_contid(struct task_struct *task, u64 contid)
goto unlock;
}
/* if we don't have caps, reject */
- if (!capable(CAP_AUDIT_CONTROL)) {
+ if (!capable(CAP_AUDIT_CONTROL) && !audit_get_capcontid(current)) {
rc = -EPERM;
goto unlock;
}
--
1.8.3.1
On Sat, Jun 27, 2020 at 9:23 AM Richard Guy Briggs <[email protected]> wrote:
>
> Require the target task to be a descendant of the container
> orchestrator/engine.
>
> You would only change the audit container ID from one set or inherited
> value to another if you were nesting containers.
>
> If changing the contid, the container orchestrator/engine must be a
> descendant and not same orchestrator as the one that set it so it is not
> possible to change the contid of another orchestrator's container.
>
> Since the task_is_descendant() function is used in YAMA and in audit,
> remove the duplication and pull the function into kernel/core/sched.c
>
> Signed-off-by: Richard Guy Briggs <[email protected]>
> ---
> include/linux/sched.h | 3 +++
> kernel/audit.c | 23 +++++++++++++++++++++--
> kernel/sched/core.c | 33 +++++++++++++++++++++++++++++++++
> security/yama/yama_lsm.c | 33 ---------------------------------
> 4 files changed, 57 insertions(+), 35 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2213ac670386..06938d0b9e0c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2047,4 +2047,7 @@ static inline void rseq_syscall(struct pt_regs *regs)
>
> const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
>
> +extern int task_is_descendant(struct task_struct *parent,
> + struct task_struct *child);
> +
> #endif
> diff --git a/kernel/audit.c b/kernel/audit.c
> index a862721dfd9b..efa65ec01239 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -2713,6 +2713,20 @@ int audit_signal_info(int sig, struct task_struct *t)
> return audit_signal_info_syscall(t);
> }
>
> +static bool audit_contid_isnesting(struct task_struct *tsk)
> +{
> + bool isowner = false;
> + bool ownerisparent = false;
> +
> + rcu_read_lock();
> + if (tsk->audit && tsk->audit->cont) {
> + isowner = current == tsk->audit->cont->owner;
> + ownerisparent = task_is_descendant(tsk->audit->cont->owner, current);
I want to make sure I'm understanding this correctly and I keep
mentally tripping over something: it seems like for a given audit
container ID a task is either the owner or a descendent, there is no
third state, is that correct?
Assuming that is true, can the descendent check simply be a negative
owner check given they both have the same audit container ID?
> + }
> + rcu_read_unlock();
> + return !isowner && ownerisparent;
> +}
> +
> /*
> * audit_set_contid - set current task's audit contid
> * @task: target task
> @@ -2755,8 +2769,13 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> rc = -EBUSY;
> goto unlock;
> }
> - /* if contid is already set, deny */
> - if (audit_contid_set(task))
> + /* if task is not descendant, block */
> + if (task == current || !task_is_descendant(current, task)) {
I'm also still fuzzy on why we can't let a task set it's own audit
container ID, assuming it meets all the criteria established in patch
2/13. It somewhat made sense when you were tracking inherited vs
explicitly set audit container IDs, but that doesn't appear to be the
case so far in this patchset, yes?
> + rc = -EXDEV;
I'm fairly confident we had a discussion about not using all these
different error codes, but that may be a moot point given my next
comment.
> + goto unlock;
> + }
> + /* only allow contid setting again if nesting */
> + if (audit_contid_set(task) && !audit_contid_isnesting(task))
> rc = -EEXIST;
It seems like what we need in audit_set_contid() is a check to ensure
that the task being modified is only modified by the owner of the
audit container ID, yes? If so, I would think we could do this quite
easily with the following, or similar logic, (NOTE: assumes both
current and tsk are properly setup):
if ((current->audit->cont != tsk->audit->cont) ||
(current->audit->cont->owner != current))
return -EACCESS;
This is somewhat independent of the above issue, but we may also want
to add to the capability check. Patch 2 adds a
"capable(CAP_AUDIT_CONTROL)" which is good, but perhaps we also need a
"ns_capable(CAP_AUDIT_CONTROL)" to allow a given audit container ID
orchestrator/owner the ability to control which of it's descendants
can change their audit container ID, for example:
if (!capable(CAP_AUDIT_CONTROL) ||
!ns_capable(current->nsproxy->user_ns, CAP_AUDIT_CONTROL))
return -EPERM;
--
paul moore
http://www.paul-moore.com
On Sat, Jun 27, 2020 at 9:24 AM Richard Guy Briggs <[email protected]> wrote:
>
> Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
> process in a non-init user namespace the capability to set audit
> container identifiers of individual children.
>
> Provide the /proc/$PID/audit_capcontid interface to capcontid.
> Valid values are: 1==enabled, 0==disabled
>
> Writing a "1" to this special file for the target process $PID will
> enable the target process to set audit container identifiers of its
> descendants.
>
> A process must already have CAP_AUDIT_CONTROL in the initial user
> namespace or have had audit_capcontid enabled by a previous use of this
> feature by its parent on this process in order to be able to enable it
> for another process. The target process must be a descendant of the
> calling process.
>
> Report this action in new message type AUDIT_SET_CAPCONTID 1022 with
> fields opid= capcontid= old-capcontid=
>
> Signed-off-by: Richard Guy Briggs <[email protected]>
> ---
> fs/proc/base.c | 57 +++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/audit.h | 14 ++++++++++++
> include/uapi/linux/audit.h | 1 +
> kernel/audit.c | 38 ++++++++++++++++++++++++++++++-
> 4 files changed, 108 insertions(+), 2 deletions(-)
This seems very similar to the capable/ns_capable combination I
mentioned in patch 11/13; any reasons why you feel that this might be
a better approach? My current thinking is that the capable/ns_capable
approach is preferable as it leverages existing kernel mechanisms and
doesn't require us to reinvent the wheel in the audit subsystem.
--
paul moore
http://www.paul-moore.com
On Sat, Jun 27, 2020 at 9:23 AM Richard Guy Briggs <[email protected]> wrote:
>
> This also adds support to qualify NETFILTER_PKT records.
>
> Audit events could happen in a network namespace outside of a task
> context due to packets received from the net that trigger an auditing
> rule prior to being associated with a running task. The network
> namespace could be in use by multiple containers by association to the
> tasks in that network namespace. We still want a way to attribute
> these events to any potential containers. Keep a list per network
> namespace to track these audit container identifiiers.
>
> Add/increment the audit container identifier on:
> - initial setting of the audit container identifier via /proc
> - clone/fork call that inherits an audit container identifier
> - unshare call that inherits an audit container identifier
> - setns call that inherits an audit container identifier
> Delete/decrement the audit container identifier on:
> - an inherited audit container identifier dropped when child set
> - process exit
> - unshare call that drops a net namespace
> - setns call that drops a net namespace
>
> Add audit container identifier auxiliary record(s) to NETFILTER_PKT
> event standalone records. Iterate through all potential audit container
> identifiers associated with a network namespace.
>
> Please see the github audit kernel issue for contid net support:
> https://github.com/linux-audit/audit-kernel/issues/92
> Please see the github audit testsuiite issue for the test case:
> https://github.com/linux-audit/audit-testsuite/issues/64
> Please see the github audit wiki for the feature overview:
> https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> Signed-off-by: Richard Guy Briggs <[email protected]>
> Acked-by: Neil Horman <[email protected]>
> Reviewed-by: Ondrej Mosnacek <[email protected]>
> ---
> include/linux/audit.h | 20 ++++++
> kernel/audit.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++-
> kernel/nsproxy.c | 4 ++
> net/netfilter/nft_log.c | 11 +++-
> net/netfilter/xt_AUDIT.c | 11 +++-
> 5 files changed, 195 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/audit.h b/include/linux/audit.h
> index c4a755ae0d61..304fbb7c3c5b 100644
> --- a/include/linux/audit.h
> +++ b/include/linux/audit.h
> @@ -128,6 +128,13 @@ struct audit_task_info {
>
> extern struct audit_task_info init_struct_audit;
>
> +struct audit_contobj_netns {
> + struct list_head list;
> + struct audit_contobj *obj;
> + int count;
This seems like it might be a good candidate for refcount_t, yes?
> + struct rcu_head rcu;
> +};
...
> diff --git a/kernel/audit.c b/kernel/audit.c
> index 997c34178ee8..a862721dfd9b 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -437,6 +452,136 @@ static struct sock *audit_get_sk(const struct net *net)
> return aunet->sk;
> }
>
> +void audit_netns_contid_add(struct net *net, struct audit_contobj *cont)
> +{
> + struct audit_net *aunet;
> + struct list_head *contobj_list;
> + struct audit_contobj_netns *contns;
> +
> + if (!net)
> + return;
> + if (!cont)
> + return;
> + aunet = net_generic(net, audit_net_id);
> + if (!aunet)
> + return;
> + contobj_list = &aunet->contobj_list;
> + rcu_read_lock();
> + spin_lock(&aunet->contobj_list_lock);
> + list_for_each_entry_rcu(contns, contobj_list, list)
> + if (contns->obj == cont) {
> + contns->count++;
> + goto out;
> + }
> + contns = kmalloc(sizeof(*contns), GFP_ATOMIC);
> + if (contns) {
> + INIT_LIST_HEAD(&contns->list);
> + contns->obj = cont;
> + contns->count = 1;
> + list_add_rcu(&contns->list, contobj_list);
> + }
> +out:
> + spin_unlock(&aunet->contobj_list_lock);
> + rcu_read_unlock();
> +}
> +
> +void audit_netns_contid_del(struct net *net, struct audit_contobj *cont)
> +{
> + struct audit_net *aunet;
> + struct list_head *contobj_list;
> + struct audit_contobj_netns *contns = NULL;
> +
> + if (!net)
> + return;
> + if (!cont)
> + return;
> + aunet = net_generic(net, audit_net_id);
> + if (!aunet)
> + return;
> + contobj_list = &aunet->contobj_list;
> + rcu_read_lock();
> + spin_lock(&aunet->contobj_list_lock);
> + list_for_each_entry_rcu(contns, contobj_list, list)
> + if (contns->obj == cont) {
> + contns->count--;
> + if (contns->count < 1) {
One could simplify this with "(--countns->count) < 1", although if it
is changed to a refcount_t (which seems like a smart thing), the
normal decrement/test would be the best choice.
> + list_del_rcu(&contns->list);
> + kfree_rcu(contns, rcu);
> + }
> + break;
> + }
> + spin_unlock(&aunet->contobj_list_lock);
> + rcu_read_unlock();
> +}
--
paul moore
http://www.paul-moore.com
On Sat, Jun 27, 2020 at 9:23 AM Richard Guy Briggs <[email protected]> wrote:
>
> Track the parent container of a container to be able to filter and
> report nesting.
>
> Now that we have a way to track and check the parent container of a
> container, modify the contid field format to be able to report that
> nesting using a carrat ("^") modifier to indicate nesting. The
> original field format was "contid=<contid>" for task-associated records
> and "contid=<contid>[,<contid>[...]]" for network-namespace-associated
> records. The new field format is
> "contid=<contid>[,^<contid>[...]][,<contid>[...]]".
I feel like this is a case which could really benefit from an example
in the commit description showing multiple levels of nesting, with
some leaf audit container IDs at each level. This way we have a
canonical example for people who want to understand how to parse the
list and properly sort out the inheritance.
> Signed-off-by: Richard Guy Briggs <[email protected]>
> ---
> include/linux/audit.h | 1 +
> kernel/audit.c | 60 ++++++++++++++++++++++++++++++++++++++++++---------
> kernel/audit.h | 2 ++
> kernel/auditfilter.c | 17 ++++++++++++++-
> kernel/auditsc.c | 2 +-
> 5 files changed, 70 insertions(+), 12 deletions(-)
--
paul moore
http://www.paul-moore.com
On 2020-07-05 11:11, Paul Moore wrote:
> On Sat, Jun 27, 2020 at 9:23 AM Richard Guy Briggs <[email protected]> wrote:
> >
> > This also adds support to qualify NETFILTER_PKT records.
> >
> > Audit events could happen in a network namespace outside of a task
> > context due to packets received from the net that trigger an auditing
> > rule prior to being associated with a running task. The network
> > namespace could be in use by multiple containers by association to the
> > tasks in that network namespace. We still want a way to attribute
> > these events to any potential containers. Keep a list per network
> > namespace to track these audit container identifiiers.
> >
> > Add/increment the audit container identifier on:
> > - initial setting of the audit container identifier via /proc
> > - clone/fork call that inherits an audit container identifier
> > - unshare call that inherits an audit container identifier
> > - setns call that inherits an audit container identifier
> > Delete/decrement the audit container identifier on:
> > - an inherited audit container identifier dropped when child set
> > - process exit
> > - unshare call that drops a net namespace
> > - setns call that drops a net namespace
> >
> > Add audit container identifier auxiliary record(s) to NETFILTER_PKT
> > event standalone records. Iterate through all potential audit container
> > identifiers associated with a network namespace.
> >
> > Please see the github audit kernel issue for contid net support:
> > https://github.com/linux-audit/audit-kernel/issues/92
> > Please see the github audit testsuiite issue for the test case:
> > https://github.com/linux-audit/audit-testsuite/issues/64
> > Please see the github audit wiki for the feature overview:
> > https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > Signed-off-by: Richard Guy Briggs <[email protected]>
> > Acked-by: Neil Horman <[email protected]>
> > Reviewed-by: Ondrej Mosnacek <[email protected]>
> > ---
> > include/linux/audit.h | 20 ++++++
> > kernel/audit.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++-
> > kernel/nsproxy.c | 4 ++
> > net/netfilter/nft_log.c | 11 +++-
> > net/netfilter/xt_AUDIT.c | 11 +++-
> > 5 files changed, 195 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/audit.h b/include/linux/audit.h
> > index c4a755ae0d61..304fbb7c3c5b 100644
> > --- a/include/linux/audit.h
> > +++ b/include/linux/audit.h
> > @@ -128,6 +128,13 @@ struct audit_task_info {
> >
> > extern struct audit_task_info init_struct_audit;
> >
> > +struct audit_contobj_netns {
> > + struct list_head list;
> > + struct audit_contobj *obj;
> > + int count;
>
> This seems like it might be a good candidate for refcount_t, yes?
I considered this before when converting the struct audit_contobj to
refcount_t, but decided against it since any updates are in the context
of a list traversal where it could be added to the list and so the
spinlock is already held anyways.
Is there a more efficent or elegant way of doing the locking around the
two list traversals below (_add and _del)?
I wonder about converting the count to refcount_t and only holding the
spinlock for the list_add_rcu() in the _add case. And for the _del case
holding the spinlock only for the list_del_rcu().
These are the only two locations items are added or deleted from the
lists.
Somewhat related to this is does the list order matter? Items are
currently added at the end of the list which likely makes locking
simpler, though the start of the list is a simple change. However,
unless we understand the profile of read use of these lists for
reporting contid use in audit_log_netns_contid_list() I don't think
order matters significantly. It could be that reporting of a contid
goes down in frequency over the lifetime of a contid that inserting them
at the beginning of the list would be best. This is not a visible
implementation detail so later optimization should pose no problem.
> > + struct rcu_head rcu;
> > +};
>
> ...
>
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index 997c34178ee8..a862721dfd9b 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -437,6 +452,136 @@ static struct sock *audit_get_sk(const struct net *net)
> > return aunet->sk;
> > }
> >
> > +void audit_netns_contid_add(struct net *net, struct audit_contobj *cont)
> > +{
> > + struct audit_net *aunet;
> > + struct list_head *contobj_list;
> > + struct audit_contobj_netns *contns;
> > +
> > + if (!net)
> > + return;
> > + if (!cont)
> > + return;
> > + aunet = net_generic(net, audit_net_id);
> > + if (!aunet)
> > + return;
> > + contobj_list = &aunet->contobj_list;
> > + rcu_read_lock();
> > + spin_lock(&aunet->contobj_list_lock);
> > + list_for_each_entry_rcu(contns, contobj_list, list)
> > + if (contns->obj == cont) {
> > + contns->count++;
> > + goto out;
> > + }
> > + contns = kmalloc(sizeof(*contns), GFP_ATOMIC);
> > + if (contns) {
> > + INIT_LIST_HEAD(&contns->list);
> > + contns->obj = cont;
> > + contns->count = 1;
> > + list_add_rcu(&contns->list, contobj_list);
> > + }
> > +out:
> > + spin_unlock(&aunet->contobj_list_lock);
> > + rcu_read_unlock();
> > +}
> > +
> > +void audit_netns_contid_del(struct net *net, struct audit_contobj *cont)
> > +{
> > + struct audit_net *aunet;
> > + struct list_head *contobj_list;
> > + struct audit_contobj_netns *contns = NULL;
> > +
> > + if (!net)
> > + return;
> > + if (!cont)
> > + return;
> > + aunet = net_generic(net, audit_net_id);
> > + if (!aunet)
> > + return;
> > + contobj_list = &aunet->contobj_list;
> > + rcu_read_lock();
> > + spin_lock(&aunet->contobj_list_lock);
> > + list_for_each_entry_rcu(contns, contobj_list, list)
> > + if (contns->obj == cont) {
> > + contns->count--;
> > + if (contns->count < 1) {
>
> One could simplify this with "(--countns->count) < 1", although if it
> is changed to a refcount_t (which seems like a smart thing), the
> normal decrement/test would be the best choice.
Agreed.
> > + list_del_rcu(&contns->list);
> > + kfree_rcu(contns, rcu);
> > + }
> > + break;
> > + }
> > + spin_unlock(&aunet->contobj_list_lock);
> > + rcu_read_unlock();
> > +}
>
> paul moore
- RGB
--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
On 2020-07-05 11:11, Paul Moore wrote:
> On Sat, Jun 27, 2020 at 9:23 AM Richard Guy Briggs <[email protected]> wrote:
> > Require the target task to be a descendant of the container
> > orchestrator/engine.
> >
> > You would only change the audit container ID from one set or inherited
> > value to another if you were nesting containers.
> >
> > If changing the contid, the container orchestrator/engine must be a
> > descendant and not same orchestrator as the one that set it so it is not
> > possible to change the contid of another orchestrator's container.
Are we able to agree on the premises above? Is anything asserted that
should not be and is there anything missing?
I've been sitting on my response below for more than a week trying to
understand the issues raised and to give it the proper attention to a
reply. Please excuse my tardiness at replying on this issue since I'm
still having trouble thinking through all the scenarios for nesting.
> > Since the task_is_descendant() function is used in YAMA and in audit,
> > remove the duplication and pull the function into kernel/core/sched.c
> >
> > Signed-off-by: Richard Guy Briggs <[email protected]>
> > ---
> > include/linux/sched.h | 3 +++
> > kernel/audit.c | 23 +++++++++++++++++++++--
> > kernel/sched/core.c | 33 +++++++++++++++++++++++++++++++++
> > security/yama/yama_lsm.c | 33 ---------------------------------
> > 4 files changed, 57 insertions(+), 35 deletions(-)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 2213ac670386..06938d0b9e0c 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2047,4 +2047,7 @@ static inline void rseq_syscall(struct pt_regs *regs)
> >
> > const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
> >
> > +extern int task_is_descendant(struct task_struct *parent,
> > + struct task_struct *child);
> > +
> > #endif
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index a862721dfd9b..efa65ec01239 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -2713,6 +2713,20 @@ int audit_signal_info(int sig, struct task_struct *t)
> > return audit_signal_info_syscall(t);
> > }
> >
> > +static bool audit_contid_isnesting(struct task_struct *tsk)
> > +{
> > + bool isowner = false;
> > + bool ownerisparent = false;
> > +
> > + rcu_read_lock();
> > + if (tsk->audit && tsk->audit->cont) {
> > + isowner = current == tsk->audit->cont->owner;
> > + ownerisparent = task_is_descendant(tsk->audit->cont->owner, current);
>
> I want to make sure I'm understanding this correctly and I keep
> mentally tripping over something: it seems like for a given audit
> container ID a task is either the owner or a descendent, there is no
> third state, is that correct?
Sure there is. It could be another owner (which is addressed when we
search for an existing contobj match), or in the next patch, the
owner's parent if nested or a peer.
> Assuming that is true, can the descendent check simply be a negative
> owner check given they both have the same audit container ID?
There isn't actually a check in my code for the orchestrator contid and
task contid being the same. Maybe I was making this check more
complicated than necessary, and still incomplete, but see below for more...
> > + }
> > + rcu_read_unlock();
> > + return !isowner && ownerisparent;
> > +}
> > +
> > /*
> > * audit_set_contid - set current task's audit contid
> > * @task: target task
> > @@ -2755,8 +2769,13 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> > rc = -EBUSY;
> > goto unlock;
> > }
> > - /* if contid is already set, deny */
> > - if (audit_contid_set(task))
> > + /* if task is not descendant, block */
> > + if (task == current || !task_is_descendant(current, task)) {
>
> I'm also still fuzzy on why we can't let a task set it's own audit
> container ID, assuming it meets all the criteria established in patch
> 2/13. It somewhat made sense when you were tracking inherited vs
> explicitly set audit container IDs, but that doesn't appear to be the
> case so far in this patchset, yes?
I'm still having a strong reluctance to permit this but can't come up
with a solid technical reason right now, but it feels like a layer
violation. If we forbid it and discover it necessary and harmless, then
permitting it won't break the API. If we permit it and later discover a
reason it causes a problem, then blocking it will break the API. I have
heard that there are cases where there is no orchestrator/engine, so in
those cases I conclude that a process would need to set its own contid
but I'm having trouble recalling what those circumstances are.
I also was seriously considering blocking any contid set on the initial
user or PID namespaces to avoid polluting them, and even had a tested
patch to implement it, but this starts making assumptions about the
definition of a container with respect to namespaces which we have been
deliberately avoiding.
> > + rc = -EXDEV;
>
> I'm fairly confident we had a discussion about not using all these
> different error codes, but that may be a moot point given my next
> comment.
Yes, we did. I reduced both circumstances down to what you requested,
shedding two along the way. Given the number of different ways
orchestrators, contids and tasks can be related, I'd rather have more,
not fewer diagnostics to understand what it thinks is happenning. This
is a realtively minor detail in the context of the rest of the
discussion in this thread.
> > + goto unlock;
> > + }
> > + /* only allow contid setting again if nesting */
> > + if (audit_contid_set(task) && !audit_contid_isnesting(task))
> > rc = -EEXIST;
>
> It seems like what we need in audit_set_contid() is a check to ensure
> that the task being modified is only modified by the owner of the
> audit container ID, yes? If so, I would think we could do this quite
> easily with the following, or similar logic, (NOTE: assumes both
> current and tsk are properly setup):
>
> if ((current->audit->cont != tsk->audit->cont) || (current->audit->cont->owner != current))
> return -EACCESS;
Not necessarily.
If we start from the premise that once set, a contid on a task cannot be
unset, and then that it cannot be set to another value, then the oldest
ancestor in any container must not be able to change contid. That
leaves any descendant (that hasn't threaded or parented) free to nest.
If we allow a task to modify its own contid (from the potential change
above), then if it inherited its contid, it could set its own. This
still looks like a layer violation to me. Going back to some
discussions with Eric Biederman from a number of years ago, it seems
wrong to me that a task should be able to see its own contid, let alone
be able to set it. This came out of a CRIU concern about serial nsIDs
based on proc inode numbers not being portable. Is it still a
consideration?
Another scenario comes to mind. Should an orchestrator be able to set
the contid of a descendant of one of the former's child orchestrators?
This doesn't sound like a good idea leaping generations and I can't come
up with a valid use case.
> This is somewhat independent of the above issue, but we may also want
> to add to the capability check. Patch 2 adds a
> "capable(CAP_AUDIT_CONTROL)" which is good, but perhaps we also need a
> "ns_capable(CAP_AUDIT_CONTROL)" to allow a given audit container ID
> orchestrator/owner the ability to control which of it's descendants
> can change their audit container ID, for example:
>
> if (!capable(CAP_AUDIT_CONTROL) ||
> !ns_capable(current->nsproxy->user_ns, CAP_AUDIT_CONTROL))
> return -EPERM;
Why does ns_capable keep being raised? The last patch, capcontid, was
developed to solve this previously raised issue. The issue was an
unprivileged user creating a user namespace with full capabilities,
circumventing capable() and being able to change the main audit
configuration. It was already discussed in v8 and before that and my
last posting in the thread was left dangling with an unanswered
question:
https://lkml.org/lkml/2020/2/6/333
I only see this being potentially useful with audit namespaces in
conjunction with unprivileged user namespaces in the future with the
implementation of multiple audit daemons for the ability of an
unprivileged user to run their own distro container without influencing
the master audit configuration.
> paul moore
- RGB
--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
On Fri, Aug 7, 2020 at 1:10 PM Richard Guy Briggs <[email protected]> wrote:
> On 2020-07-05 11:11, Paul Moore wrote:
> > On Sat, Jun 27, 2020 at 9:23 AM Richard Guy Briggs <[email protected]> wrote:
> > > Require the target task to be a descendant of the container
> > > orchestrator/engine.
If you want to get formal about this, you need to define "target" in
the sentence above. Target of what?
FWIW, I read the above to basically mean that a task can only set the
audit container ID of processes which are beneath it in the "process
tree" where the "process tree" is defined as the relationship between
a parent and children processes such that the children processes are
branches below the parent process.
I have no problem with that, with the understanding that nesting
complicates it somewhat. For example, this isn't true when one of the
children is a nested orchestrator, is it?
> > > You would only change the audit container ID from one set or inherited
> > > value to another if you were nesting containers.
I thought we decided we were going to allow an orchestrator to move a
process between audit container IDs, yes? no?
> > > If changing the contid, the container orchestrator/engine must be a
> > > descendant and not same orchestrator as the one that set it so it is not
> > > possible to change the contid of another orchestrator's container.
Try rephrasing the above please, it isn't clear to me what you are
trying to say.
> Are we able to agree on the premises above? Is anything asserted that
> should not be and is there anything missing?
See above.
If you want to go back to the definitions/assumptions stage, it
probably isn't worth worrying about the other comments until we get
the above sorted.
--
paul moore
http://www.paul-moore.com
On 2020-08-21 16:13, Paul Moore wrote:
> On Fri, Aug 7, 2020 at 1:10 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2020-07-05 11:11, Paul Moore wrote:
> > > On Sat, Jun 27, 2020 at 9:23 AM Richard Guy Briggs <[email protected]> wrote:
> > > > Require the target task to be a descendant of the container
> > > > orchestrator/engine.
>
> If you want to get formal about this, you need to define "target" in
> the sentence above. Target of what?
The target is the task having its audit container identifier modified by
the orchestrator current task.
> FWIW, I read the above to basically mean that a task can only set the
> audit container ID of processes which are beneath it in the "process
> tree" where the "process tree" is defined as the relationship between
> a parent and children processes such that the children processes are
> branches below the parent process.
Yes.
> I have no problem with that, with the understanding that nesting
> complicates it somewhat. For example, this isn't true when one of the
> children is a nested orchestrator, is it?
It should still be true if that child is a nested orchestrator that has
not yet spawned any children or threads (or they have all died off).
It does get more complicated when we consider the scenario outlined
below about perceived layer violations...
> > > > You would only change the audit container ID from one set or inherited
> > > > value to another if you were nesting containers.
>
> I thought we decided we were going to allow an orchestrator to move a
> process between audit container IDs, yes? no?
We did? I don't remember anything about that. Has this been requested?
This seems to violate the rule that we can't change the audit container
identifier once it has been set (other than nesting). Can you suggest a
use case?
> > > > If changing the contid, the container orchestrator/engine must be a
> > > > descendant and not same orchestrator as the one that set it so it is not
> > > > possible to change the contid of another orchestrator's container.
>
> Try rephrasing the above please, it isn't clear to me what you are
> trying to say.
This is harder than I expected to rephrase... It also makes it clear
that there are some scenarios that have not been considered that may
need to be restricted.
Orchestrator A spawned task B which is itself an orchestrator without
chidren yet. Orchestrator A sets the audit container identifier of B.
Neither A, nor B, nor any other child of A (or any of their
descendants), nor any orchestrator outside the tree of A (uncles, aunts
and cousins are outside), can change the audit container identifier of
B.
Orchestrator B spawns task C. Here's where it gets tricky. It seems
like a layer violation for B to spawn a child C and have A reach over B
to set the audit container identifier of C, especially if B is also an
orchestrator. This all will be especially hard to police if we don't
limit the ability of an orchestrator task to set an audit container
identifier to that orchestrator's immediate children, only once.
> > Are we able to agree on the premises above? Is anything asserted that
> > should not be and is there anything missing?
>
> See above.
>
> If you want to go back to the definitions/assumptions stage, it
> probably isn't worth worrying about the other comments until we get
> the above sorted.
I don't want to. I'm trying to confirm that we are on the same page.
> paul moore
- RGB
--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635