2024-06-09 10:40:09

by Jonathan Calmels

[permalink] [raw]
Subject: [PATCH v2 0/4] Introduce user namespace capabilities

This patch series introduces a new user namespace capability set, as
well as some plumbing around it (i.e. sysctl, secbit, lsm support).

First patch goes over the motivations for this as well as prior art.

In summary, while user namespaces are a great success today in that they
avoid running a lot of code as root, they also expand the attack surface
of the kernel substantially which is often abused by attackers.
Methods exist to limit the creation of such namespaces [1], however,
application developers often need to assume that user namespaces are
available for various tasks such as sandboxing. Thus, instead of
restricting the creation of user namespaces, we offer ways for userspace
to limit the capabilities granted to them.

Why a new capability set and not something specific to the userns (e.g.
ioctl_ns)?

1. We can't really expect userspace to patch every single callsite
and opt-in this new security mechanism.

2. We don't necessarily want policies enforced at said callsites.
For example a service like systemd-machined or a PAM session need to
be able to place restrictions on any namespace spawned under it.

3. We would need to come up with inheritance rules, querying
capabilities, etc. At this point we're just reinventing capability
sets.

4. We can easily define interactions between capability sets, thus
helping with adoption (patch 2 is an example of this)

Some examples of how this could be leveraged in userspace:

- Prevent user from getting CAP_NET_ADMIN in user namespaces under SSH:
echo "auth optional pam_cap.so" >> /etc/pam.d/sshd
echo "!cap_net_admin $USER" >> /etc/security/capability.conf
capsh --secbits=$((1 << 8)) -- -c /usr/sbin/sshd

- Prevent containers from ever getting CAP_DAC_OVERRIDE:
systemd-run -p CapabilityBoundingSet=~CAP_DAC_OVERRIDE \
-p SecureBits=userns-strict-caps \
/usr/bin/dockerd
systemd-run -p UserNSCapabilities=~CAP_DAC_OVERRIDE \
/usr/bin/incusd

- Kernel could be vulnerable to CAP_SYS_RAWIO exploits, prevent it:
sysctl -w cap_bound_userns_mask=0x1fffffdffff

- Drop CAP_SYS_ADMIN for this shell and all the user namespaces below it:
bwrap --unshare-user --cap-drop CAP_SYS_ADMIN /bin/sh

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7cd4c5c2101cb092db00f61f69d24380cf7a0ee8

---
Changes since v1:
- Add documentation
- Change commit wording
- Cleanup various aspects of the code based on feedback
- Add new CAP_SYS_CONTROL capability for sysctl check
- Add BPF-LSM support for modifying userns capabilities
---
Jonathan Calmels (4):
capabilities: Add user namespace capabilities
capabilities: Add securebit to restrict userns caps
capabilities: Add sysctl to mask off userns caps
bpf,lsm: Allow editing capabilities in BPF-LSM hooks

Documentation/filesystems/proc.rst | 1 +
Documentation/security/credentials.rst | 6 ++
fs/proc/array.c | 9 +++
include/linux/cred.h | 3 +
include/linux/lsm_hook_defs.h | 2 +-
include/linux/securebits.h | 1 +
include/linux/security.h | 4 +-
include/linux/user_namespace.h | 7 ++
include/uapi/linux/capability.h | 6 +-
include/uapi/linux/prctl.h | 7 ++
include/uapi/linux/securebits.h | 11 ++-
kernel/bpf/bpf_lsm.c | 55 +++++++++++++
kernel/cred.c | 3 +
kernel/sysctl.c | 10 +++
kernel/umh.c | 15 ++++
kernel/user_namespace.c | 80 +++++++++++++++++--
security/apparmor/lsm.c | 2 +-
security/commoncap.c | 62 +++++++++++++-
security/keys/process_keys.c | 3 +
security/security.c | 6 +-
security/selinux/hooks.c | 2 +-
security/selinux/include/classmap.h | 5 +-
.../selftests/bpf/prog_tests/deny_namespace.c | 12 ++-
.../selftests/bpf/progs/test_deny_namespace.c | 7 +-
24 files changed, 291 insertions(+), 28 deletions(-)

--
2.45.2



2024-06-09 10:40:54

by Jonathan Calmels

[permalink] [raw]
Subject: [PATCH v2 1/4] capabilities: Add user namespace capabilities

Attackers often rely on user namespaces to get elevated (yet confined)
privileges in order to target specific subsystems (e.g. [1]). Distributions
have been pretty adamant that they need a way to configure these, most of
them carry out-of-tree patches to do so, or plainly refuse to enable them.
As a result, there have been multiple efforts over the years to introduce
various knobs to control and/or disable user namespaces (e.g. [2][3][4]).

While we acknowledge that there are already ways to control the creation of
such namespaces (the most recent being a LSM hook), there are inherent
issues with these approaches. Preventing the user namespace creation is not
fine-grained enough, and in some cases, incompatible with various userspace
expectations (e.g. container runtimes, browser sandboxing, service
isolation)

This patch addresses these limitations by introducing an additional
capability set used to restrict the permissions granted when creating user
namespaces. This way, processes can apply the principle of least privilege
by configuring only the capabilities they need for their namespaces.

For compatibility reasons, processes always start with a full userns
capability set.

On namespace creation, the userns capability set (pU) is assigned to the
new effective (pE), permitted (pP) and bounding set (X) of the task:

pU = pE = pP = X

The userns capability set obeys the invariant that no bit can ever be set
if it is not already part of the task’s bounding set. This ensures that
no namespace can ever gain more privileges than its predecessors.
Additionally, if a task is not privileged over CAP_SETPCAP, setting any bit
in the userns set requires its corresponding bit to be set in the permitted
set. This effectively mimics the inheritable set rules and means that, by
default, only root in the user namespace can regain userns capabilities
previously dropped:

p’U = (pE & CAP_SETPCAP) ? X : (X & pP)

Note that since userns capabilities are strictly hierarchical, policies can
be enforced at various levels (e.g. init, pam_cap) and inherited by every
child namespace.

Here is a sample program that can be used to verify the functionality:

/*
* Test program that drops CAP_SYS_RAWIO from subsequent user namespaces.
*
* ./cap_userns_test unshare -r grep Cap /proc/self/status
* CapInh: 0000000000000000
* CapPrm: 000001fffffdffff
* CapEff: 000001fffffdffff
* CapBnd: 000001fffffdffff
* CapAmb: 0000000000000000
* CapUNs: 000001fffffdffff
*/

int main(int argc, char *argv[])
{
if (prctl(PR_CAP_USERNS, PR_CAP_USERNS_LOWER, CAP_SYS_RAWIO, 0, 0) < 0)
err(1, "cannot drop userns cap");

execvp(argv[1], argv + 1);
err(1, "cannot exec");
}

[1] https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html
[2] https://lore.kernel.org/lkml/[email protected]
[3] https://lore.kernel.org/lkml/[email protected]
[4] https://lore.kernel.org/containers/[email protected]

Signed-off-by: Jonathan Calmels <[email protected]>
---
Documentation/filesystems/proc.rst | 1 +
Documentation/security/credentials.rst | 6 +++
fs/proc/array.c | 9 ++++
include/linux/cred.h | 3 ++
include/uapi/linux/prctl.h | 7 +++
kernel/cred.c | 3 ++
kernel/umh.c | 15 +++++++
kernel/user_namespace.c | 12 +++--
security/commoncap.c | 62 ++++++++++++++++++++++++--
security/keys/process_keys.c | 3 ++
10 files changed, 111 insertions(+), 10 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 7c3a565ffbef..b5de4eaf1b7b 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -294,6 +294,7 @@ It's slow but very precise.
CapEff bitmap of effective capabilities
CapBnd bitmap of capabilities bounding set
CapAmb bitmap of ambient capabilities
+ CapUns bitmap of user namespace capabilities
NoNewPrivs no_new_privs, like prctl(PR_GET_NO_NEW_PRIV, ...)
Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...)
Speculation_Store_Bypass speculative store bypass mitigation status
diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
index 357328d566c8..7ee904237023 100644
--- a/Documentation/security/credentials.rst
+++ b/Documentation/security/credentials.rst
@@ -148,6 +148,7 @@ The Linux kernel supports the following types of credentials:
- Set of permitted capabilities
- Set of inheritable capabilities
- Set of effective capabilities
+ - Set of user namespace capabilities
- Capability bounding set

These are only carried by tasks. They indicate superior capabilities
@@ -170,6 +171,11 @@ The Linux kernel supports the following types of credentials:
``execve()``, especially when a binary is executed that will execute as
UID 0.

+ The user namespace set limits the capabilities granted to user namespaces.
+ It defines what capabilities will be available in the other sets after
+ creating a new user namespace, such as when calling ``clone()`` or
+ ``unshare()`` with ``CLONE_NEWUSER``.
+
3. Secure management flags (securebits).

These are only carried by tasks. These govern the way the above
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 34a47fb0c57f..364e8bb19f9d 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -313,6 +313,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p)
const struct cred *cred;
kernel_cap_t cap_inheritable, cap_permitted, cap_effective,
cap_bset, cap_ambient;
+#ifdef CONFIG_USER_NS
+ kernel_cap_t cap_userns;
+#endif

rcu_read_lock();
cred = __task_cred(p);
@@ -321,6 +324,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p)
cap_effective = cred->cap_effective;
cap_bset = cred->cap_bset;
cap_ambient = cred->cap_ambient;
+#ifdef CONFIG_USER_NS
+ cap_userns = cred->cap_userns;
+#endif
rcu_read_unlock();

render_cap_t(m, "CapInh:\t", &cap_inheritable);
@@ -328,6 +334,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p)
render_cap_t(m, "CapEff:\t", &cap_effective);
render_cap_t(m, "CapBnd:\t", &cap_bset);
render_cap_t(m, "CapAmb:\t", &cap_ambient);
+#ifdef CONFIG_USER_NS
+ render_cap_t(m, "CapUNs:\t", &cap_userns);
+#endif
}

static inline void task_seccomp(struct seq_file *m, struct task_struct *p)
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 2976f534a7a3..adab0031443e 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -124,6 +124,9 @@ struct cred {
kernel_cap_t cap_effective; /* caps we can actually use */
kernel_cap_t cap_bset; /* capability bounding set */
kernel_cap_t cap_ambient; /* Ambient capability set */
+#ifdef CONFIG_USER_NS
+ kernel_cap_t cap_userns; /* User namespace capability set */
+#endif
#ifdef CONFIG_KEYS
unsigned char jit_keyring; /* default keyring to attach requested
* keys to */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 35791791a879..b58325ebdc9e 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -198,6 +198,13 @@ struct prctl_mm_map {
# define PR_CAP_AMBIENT_LOWER 3
# define PR_CAP_AMBIENT_CLEAR_ALL 4

+/* Control the userns capability set */
+#define PR_CAP_USERNS 48
+# define PR_CAP_USERNS_IS_SET 1
+# define PR_CAP_USERNS_RAISE 2
+# define PR_CAP_USERNS_LOWER 3
+# define PR_CAP_USERNS_CLEAR_ALL 4
+
/* arm64 Scalable Vector Extension controls */
/* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */
#define PR_SVE_SET_VL 50 /* set task vector length */
diff --git a/kernel/cred.c b/kernel/cred.c
index 075cfa7c896f..9912c6f3bc6b 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -56,6 +56,9 @@ struct cred init_cred = {
.cap_permitted = CAP_FULL_SET,
.cap_effective = CAP_FULL_SET,
.cap_bset = CAP_FULL_SET,
+#ifdef CONFIG_USER_NS
+ .cap_userns = CAP_FULL_SET,
+#endif
.user = INIT_USER,
.user_ns = &init_user_ns,
.group_info = &init_groups,
diff --git a/kernel/umh.c b/kernel/umh.c
index 598b3ffe1522..0a5a9cf10d83 100644
--- a/kernel/umh.c
+++ b/kernel/umh.c
@@ -32,6 +32,9 @@

#include <trace/events/module.h>

+#ifdef CONFIG_USER_NS
+static kernel_cap_t usermodehelper_userns = CAP_FULL_SET;
+#endif
static kernel_cap_t usermodehelper_bset = CAP_FULL_SET;
static kernel_cap_t usermodehelper_inheritable = CAP_FULL_SET;
static DEFINE_SPINLOCK(umh_sysctl_lock);
@@ -94,6 +97,9 @@ static int call_usermodehelper_exec_async(void *data)
new->cap_bset = cap_intersect(usermodehelper_bset, new->cap_bset);
new->cap_inheritable = cap_intersect(usermodehelper_inheritable,
new->cap_inheritable);
+#ifdef CONFIG_USER_NS
+ new->cap_userns = cap_intersect(usermodehelper_userns, new->cap_userns);
+#endif
spin_unlock(&umh_sysctl_lock);

if (sub_info->init) {
@@ -560,6 +566,15 @@ static struct ctl_table usermodehelper_table[] = {
.mode = 0600,
.proc_handler = proc_cap_handler,
},
+#ifdef CONFIG_USER_NS
+ {
+ .procname = "userns",
+ .data = &usermodehelper_userns,
+ .maxlen = 2 * sizeof(unsigned long),
+ .mode = 0600,
+ .proc_handler = proc_cap_handler,
+ },
+#endif
};

static int __init init_umh_sysctls(void)
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 0b0b95418b16..7e624607330b 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -42,15 +42,13 @@ static void dec_user_namespaces(struct ucounts *ucounts)

static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
{
- /* Start with the same capabilities as init but useless for doing
- * anything as the capabilities are bound to the new user namespace.
- */
- cred->securebits = SECUREBITS_DEFAULT;
+ /* Start with the capabilities defined in the userns set. */
+ cred->cap_bset = cred->cap_userns;
+ cred->cap_permitted = cred->cap_userns;
+ cred->cap_effective = cred->cap_userns;
cred->cap_inheritable = CAP_EMPTY_SET;
- cred->cap_permitted = CAP_FULL_SET;
- cred->cap_effective = CAP_FULL_SET;
cred->cap_ambient = CAP_EMPTY_SET;
- cred->cap_bset = CAP_FULL_SET;
+ cred->securebits = SECUREBITS_DEFAULT;
#ifdef CONFIG_KEYS
key_put(cred->request_key_auth);
cred->request_key_auth = NULL;
diff --git a/security/commoncap.c b/security/commoncap.c
index 162d96b3a676..59fafbfcfc5e 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -214,10 +214,10 @@ int cap_capget(const struct task_struct *target, kernel_cap_t *effective,
}

/*
- * Determine whether the inheritable capabilities are limited to the old
+ * Determine whether the capabilities are limited to the old
* permitted set. Returns 1 if they are limited, 0 if they are not.
*/
-static inline int cap_inh_is_capped(void)
+static inline int cap_is_capped(void)
{
/* they are so limited unless the current task has the CAP_SETPCAP
* capability
@@ -228,6 +228,29 @@ static inline int cap_inh_is_capped(void)
return 1;
}

+/*
+ * Determine whether a userns capability can be raised.
+ * Returns 1 if it can, 0 otherwise.
+ */
+#ifdef CONFIG_USER_NS
+static inline int cap_uns_is_raiseable(unsigned long cap)
+{
+ if (!!cap_raised(current_cred()->cap_userns, cap))
+ return 1;
+
+ /*
+ * A capability cannot be raised unless the current task has it in
+ * its bounding set and, without CAP_SETPCAP, its permitted set.
+ */
+ if (!cap_raised(current_cred()->cap_bset, cap))
+ return 0;
+ if (cap_is_capped() && !cap_raised(current_cred()->cap_permitted, cap))
+ return 0;
+
+ return 1;
+}
+#endif
+
/**
* cap_capset - Validate and apply proposed changes to current's capabilities
* @new: The proposed new credentials; alterations should be made here
@@ -246,7 +269,7 @@ int cap_capset(struct cred *new,
const kernel_cap_t *inheritable,
const kernel_cap_t *permitted)
{
- if (cap_inh_is_capped() &&
+ if (cap_is_capped() &&
!cap_issubset(*inheritable,
cap_combine(old->cap_inheritable,
old->cap_permitted)))
@@ -1382,6 +1405,39 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
return commit_creds(new);
}

+#ifdef CONFIG_USER_NS
+ case PR_CAP_USERNS:
+ if (arg2 == PR_CAP_USERNS_CLEAR_ALL) {
+ if (arg3 | arg4 | arg5)
+ return -EINVAL;
+
+ new = prepare_creds();
+ if (!new)
+ return -ENOMEM;
+ cap_clear(new->cap_userns);
+ return commit_creds(new);
+ }
+
+ if (((!cap_valid(arg3)) | arg4 | arg5))
+ return -EINVAL;
+
+ if (arg2 == PR_CAP_USERNS_IS_SET)
+ return !!cap_raised(current_cred()->cap_userns, arg3);
+ if (arg2 != PR_CAP_USERNS_RAISE && arg2 != PR_CAP_USERNS_LOWER)
+ return -EINVAL;
+ if (arg2 == PR_CAP_USERNS_RAISE && !cap_uns_is_raiseable(arg3))
+ return -EPERM;
+
+ new = prepare_creds();
+ if (!new)
+ return -ENOMEM;
+ if (arg2 == PR_CAP_USERNS_RAISE)
+ cap_raise(new->cap_userns, arg3);
+ else
+ cap_lower(new->cap_userns, arg3);
+ return commit_creds(new);
+#endif
+
default:
/* No functionality available - continue with default */
return -ENOSYS;
diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c
index b5d5333ab330..e3670d815435 100644
--- a/security/keys/process_keys.c
+++ b/security/keys/process_keys.c
@@ -944,6 +944,9 @@ void key_change_session_keyring(struct callback_head *twork)
new->cap_effective = old->cap_effective;
new->cap_ambient = old->cap_ambient;
new->cap_bset = old->cap_bset;
+#ifdef CONFIG_USER_NS
+ new->cap_userns = old->cap_userns;
+#endif

new->jit_keyring = old->jit_keyring;
new->thread_keyring = key_get(old->thread_keyring);
--
2.45.2


2024-06-09 10:41:03

by Jonathan Calmels

[permalink] [raw]
Subject: [PATCH v2 2/4] capabilities: Add securebit to restrict userns caps

This patch adds a new capability security bit designed to constrain a
task’s userns capability set to its bounding set. The reason for this is
twofold:

- This serves as a quick and easy way to lock down a set of capabilities
for a task, thus ensuring that any namespace it creates will never be
more privileged than itself is.
- This helps userspace transition to more secure defaults by not requiring
specific logic for the userns capability set, or libcap support.

Example:

# capsh --secbits=$((1 << 8)) --drop=cap_sys_rawio -- \
-c 'unshare -r grep Cap /proc/self/status'
CapInh: 0000000000000000
CapPrm: 000001fffffdffff
CapEff: 000001fffffdffff
CapBnd: 000001fffffdffff
CapAmb: 0000000000000000
CapUNs: 000001fffffdffff

Signed-off-by: Jonathan Calmels <[email protected]>
---
include/linux/securebits.h | 1 +
include/uapi/linux/securebits.h | 11 ++++++++++-
kernel/user_namespace.c | 5 +++++
3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/linux/securebits.h b/include/linux/securebits.h
index 656528673983..5f9d85cd69c3 100644
--- a/include/linux/securebits.h
+++ b/include/linux/securebits.h
@@ -5,4 +5,5 @@
#include <uapi/linux/securebits.h>

#define issecure(X) (issecure_mask(X) & current_cred_xxx(securebits))
+#define iscredsecure(cred, X) (issecure_mask(X) & cred->securebits)
#endif /* !_LINUX_SECUREBITS_H */
diff --git a/include/uapi/linux/securebits.h b/include/uapi/linux/securebits.h
index d6d98877ff1a..2da3f4be4531 100644
--- a/include/uapi/linux/securebits.h
+++ b/include/uapi/linux/securebits.h
@@ -52,10 +52,19 @@
#define SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED \
(issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE_LOCKED))

+/* When set, user namespace capabilities are restricted to their parent's bounding set. */
+#define SECURE_USERNS_STRICT_CAPS 8
+#define SECURE_USERNS_STRICT_CAPS_LOCKED 9 /* make bit-8 immutable */
+
+#define SECBIT_USERNS_STRICT_CAPS (issecure_mask(SECURE_USERNS_STRICT_CAPS))
+#define SECBIT_USERNS_STRICT_CAPS_LOCKED \
+ (issecure_mask(SECURE_USERNS_STRICT_CAPS_LOCKED))
+
#define SECURE_ALL_BITS (issecure_mask(SECURE_NOROOT) | \
issecure_mask(SECURE_NO_SETUID_FIXUP) | \
issecure_mask(SECURE_KEEP_CAPS) | \
- issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE))
+ issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE) | \
+ issecure_mask(SECURE_USERNS_STRICT_CAPS))
#define SECURE_ALL_LOCKS (SECURE_ALL_BITS << 1)

#endif /* _UAPI_LINUX_SECUREBITS_H */
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 7e624607330b..53848e2b68cd 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -10,6 +10,7 @@
#include <linux/cred.h>
#include <linux/securebits.h>
#include <linux/security.h>
+#include <linux/capability.h>
#include <linux/keyctl.h>
#include <linux/key-type.h>
#include <keys/user-type.h>
@@ -42,6 +43,10 @@ static void dec_user_namespaces(struct ucounts *ucounts)

static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
{
+ /* Limit userns capabilities to our parent's bounding set. */
+ if (iscredsecure(cred, SECURE_USERNS_STRICT_CAPS))
+ cred->cap_userns = cap_intersect(cred->cap_userns, cred->cap_bset);
+
/* Start with the capabilities defined in the userns set. */
cred->cap_bset = cred->cap_userns;
cred->cap_permitted = cred->cap_userns;
--
2.45.2


2024-06-09 10:41:30

by Jonathan Calmels

[permalink] [raw]
Subject: [PATCH v2 3/4] capabilities: Add sysctl to mask off userns caps

This patch adds a new system-wide userns capability mask designed to mask
off capabilities in user namespaces.

This mask is controlled through a sysctl and can be set early in the boot
process or on the kernel command line to exclude known capabilities from
ever being gained in namespaces. Once set, it can be further restricted to
exert dynamic policies on the system (e.g. ward off a potential exploit).

Changing this mask requires privileges in the initial user namespace
over the newly introduced CAP_SYS_CONTROL.

Example:

# sysctl -qw kernel.cap_userns_mask=0x1fffffdffff && \
unshare -r grep Cap /proc/self/status
CapInh: 0000000000000000
CapPrm: 000001fffffdffff
CapEff: 000001fffffdffff
CapBnd: 000001fffffdffff
CapAmb: 0000000000000000
CapUNs: 000001fffffdffff

Signed-off-by: Jonathan Calmels <[email protected]>
---
include/linux/user_namespace.h | 7 ++++
include/uapi/linux/capability.h | 6 ++-
kernel/sysctl.c | 10 +++++
kernel/user_namespace.c | 63 +++++++++++++++++++++++++++++
security/selinux/include/classmap.h | 5 ++-
5 files changed, 88 insertions(+), 3 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 6030a8235617..d958d4819608 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -2,6 +2,7 @@
#ifndef _LINUX_USER_NAMESPACE_H
#define _LINUX_USER_NAMESPACE_H

+#include <linux/capability.h>
#include <linux/kref.h>
#include <linux/nsproxy.h>
#include <linux/ns_common.h>
@@ -14,6 +15,12 @@
#define UID_GID_MAP_MAX_BASE_EXTENTS 5
#define UID_GID_MAP_MAX_EXTENTS 340

+#ifdef CONFIG_SYSCTL
+extern kernel_cap_t cap_userns_mask;
+int cap_userns_sysctl_handler(const struct ctl_table *table, int write,
+ void *buffer, size_t *lenp, loff_t *ppos);
+#endif
+
struct uid_gid_extent {
u32 first;
u32 lower_first;
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 5bb906098697..e2c5e4bb2eb0 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -418,7 +418,11 @@ struct vfs_ns_cap_data {

#define CAP_CHECKPOINT_RESTORE 40

-#define CAP_LAST_CAP CAP_CHECKPOINT_RESTORE
+/* Allow setting the system userns capability mask. */
+
+#define CAP_SYS_CONTROL 41
+
+#define CAP_LAST_CAP CAP_SYS_CONTROL

#define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e0b917328cf9..95b27a92c63c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -62,6 +62,7 @@
#include <linux/sched/sysctl.h>
#include <linux/mount.h>
#include <linux/userfaultfd_k.h>
+#include <linux/user_namespace.h>
#include <linux/pid.h>

#include "../lib/kstrtox.h"
@@ -1846,6 +1847,15 @@ static struct ctl_table kern_table[] = {
.mode = 0444,
.proc_handler = proc_dointvec,
},
+#ifdef CONFIG_USER_NS
+ {
+ .procname = "cap_userns_mask",
+ .data = &cap_userns_mask,
+ .maxlen = sizeof(kernel_cap_t),
+ .mode = 0644,
+ .proc_handler = cap_userns_sysctl_handler,
+ },
+#endif
#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
{
.procname = "unknown_nmi_panic",
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 53848e2b68cd..e513d87ed102 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -26,6 +26,63 @@
static struct kmem_cache *user_ns_cachep __ro_after_init;
static DEFINE_MUTEX(userns_state_mutex);

+#ifdef CONFIG_SYSCTL
+static DEFINE_SPINLOCK(cap_userns_lock);
+kernel_cap_t cap_userns_mask = CAP_FULL_SET;
+
+int cap_userns_sysctl_handler(const struct ctl_table *table, int write,
+ void *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct ctl_table t;
+ unsigned long mask_array[2];
+ kernel_cap_t new_mask, *mask;
+ int err;
+
+ if (write && !capable(CAP_SYS_CONTROL))
+ return -EPERM;
+
+ /*
+ * convert from the global kernel_cap_t to the ulong array to print to
+ * userspace if this is a read.
+ *
+ * capabilities are exposed as one 64-bit value or two 32-bit values
+ * depending on the architecture
+ */
+ mask = table->data;
+ spin_lock(&cap_userns_lock);
+ mask_array[0] = (unsigned long) mask->val;
+ if (BITS_PER_LONG != 64)
+ mask_array[1] = mask->val >> BITS_PER_LONG;
+ spin_unlock(&cap_userns_lock);
+
+ t = *table;
+ t.data = &mask_array;
+
+ /*
+ * actually read or write and array of ulongs from userspace. Remember
+ * these are least significant bits first
+ */
+ err = proc_doulongvec_minmax(&t, write, buffer, lenp, ppos);
+ if (err < 0)
+ return err;
+
+ new_mask.val = mask_array[0];
+ if (BITS_PER_LONG != 64)
+ new_mask.val += (u64)mask_array[1] << BITS_PER_LONG;
+
+ /*
+ * Drop everything not in the new_mask (but don't add things)
+ */
+ if (write) {
+ spin_lock(&cap_userns_lock);
+ *mask = cap_intersect(*mask, new_mask);
+ spin_unlock(&cap_userns_lock);
+ }
+
+ return 0;
+}
+#endif
+
static bool new_idmap_permitted(const struct file *file,
struct user_namespace *ns, int cap_setid,
struct uid_gid_map *map);
@@ -46,6 +103,12 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
/* Limit userns capabilities to our parent's bounding set. */
if (iscredsecure(cred, SECURE_USERNS_STRICT_CAPS))
cred->cap_userns = cap_intersect(cred->cap_userns, cred->cap_bset);
+#ifdef CONFIG_SYSCTL
+ /* Mask off userns capabilities that are not permitted by the system-wide mask. */
+ spin_lock(&cap_userns_lock);
+ cred->cap_userns = cap_intersect(cred->cap_userns, cap_userns_mask);
+ spin_unlock(&cap_userns_lock);
+#endif

/* Start with the capabilities defined in the userns set. */
cred->cap_bset = cred->cap_userns;
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 7229c9bf6c27..8f3ede7aac92 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -34,9 +34,10 @@

#define COMMON_CAP2_PERMS \
"mac_override", "mac_admin", "syslog", "wake_alarm", "block_suspend", \
- "audit_read", "perfmon", "bpf", "checkpoint_restore"
+ "audit_read", "perfmon", "bpf", "checkpoint_restore", \
+ "sys_control"

-#if CAP_LAST_CAP > CAP_CHECKPOINT_RESTORE
+#if CAP_LAST_CAP > CAP_SYS_CONTROL
#error New capability defined, please update COMMON_CAP2_PERMS.
#endif

--
2.45.2


2024-06-09 10:42:46

by Jonathan Calmels

[permalink] [raw]
Subject: [PATCH v2 4/4] bpf,lsm: Allow editing capabilities in BPF-LSM hooks

This patch allows modifying the various capabilities of the struct cred
in BPF-LSM hooks. More specifically, the userns_create hook called
prior to creating a new user namespace.

With the introduction of userns capabilities, this effectively provides
a simple way for LSMs to control the capabilities granted to a user
namespace and all its descendants.

Update the selftests accordingly by dropping CAP_SYS_ADMIN in
namespaces and checking the resulting task's bounding set.

Signed-off-by: Jonathan Calmels <[email protected]>
---
include/linux/lsm_hook_defs.h | 2 +-
include/linux/security.h | 4 +-
kernel/bpf/bpf_lsm.c | 55 +++++++++++++++++++
security/apparmor/lsm.c | 2 +-
security/security.c | 6 +-
security/selinux/hooks.c | 2 +-
.../selftests/bpf/prog_tests/deny_namespace.c | 12 ++--
.../selftests/bpf/progs/test_deny_namespace.c | 7 ++-
8 files changed, 76 insertions(+), 14 deletions(-)

diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index f804b76cde44..58d6d8f2511f 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -250,7 +250,7 @@ LSM_HOOK(int, -ENOSYS, task_prctl, int option, unsigned long arg2,
unsigned long arg3, unsigned long arg4, unsigned long arg5)
LSM_HOOK(void, LSM_RET_VOID, task_to_inode, struct task_struct *p,
struct inode *inode)
-LSM_HOOK(int, 0, userns_create, const struct cred *cred)
+LSM_HOOK(int, 0, userns_create, struct cred *cred)
LSM_HOOK(int, 0, ipc_permission, struct kern_ipc_perm *ipcp, short flag)
LSM_HOOK(void, LSM_RET_VOID, ipc_getsecid, struct kern_ipc_perm *ipcp,
u32 *secid)
diff --git a/include/linux/security.h b/include/linux/security.h
index 21cf70346b33..ffb1b0dd2aef 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -465,7 +465,7 @@ int security_task_kill(struct task_struct *p, struct kernel_siginfo *info,
int security_task_prctl(int option, unsigned long arg2, unsigned long arg3,
unsigned long arg4, unsigned long arg5);
void security_task_to_inode(struct task_struct *p, struct inode *inode);
-int security_create_user_ns(const struct cred *cred);
+int security_create_user_ns(struct cred *cred);
int security_ipc_permission(struct kern_ipc_perm *ipcp, short flag);
void security_ipc_getsecid(struct kern_ipc_perm *ipcp, u32 *secid);
int security_msg_msg_alloc(struct msg_msg *msg);
@@ -1294,7 +1294,7 @@ static inline int security_task_prctl(int option, unsigned long arg2,
static inline void security_task_to_inode(struct task_struct *p, struct inode *inode)
{ }

-static inline int security_create_user_ns(const struct cred *cred)
+static inline int security_create_user_ns(struct cred *cred)
{
return 0;
}
diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c
index 68240c3c6e7d..6edba93ff883 100644
--- a/kernel/bpf/bpf_lsm.c
+++ b/kernel/bpf/bpf_lsm.c
@@ -382,10 +382,65 @@ bool bpf_lsm_is_trusted(const struct bpf_prog *prog)
return !btf_id_set_contains(&untrusted_lsm_hooks, prog->aux->attach_btf_id);
}

+static int bpf_lsm_btf_struct_access(struct bpf_verifier_log *log,
+ const struct bpf_reg_state *reg,
+ int off, int size)
+{
+ const struct btf_type *cred;
+ const struct btf_type *t;
+ s32 type_id;
+ size_t end;
+
+ type_id = btf_find_by_name_kind(reg->btf, "cred", BTF_KIND_STRUCT);
+ if (type_id < 0)
+ return -EINVAL;
+
+ t = btf_type_by_id(reg->btf, reg->btf_id);
+ cred = btf_type_by_id(reg->btf, type_id);
+ if (t != cred) {
+ bpf_log(log, "only read is supported\n");
+ return -EACCES;
+ }
+
+ switch (off) {
+ case offsetof(struct cred, cap_inheritable):
+ end = offsetofend(struct cred, cap_inheritable);
+ break;
+ case offsetof(struct cred, cap_permitted):
+ end = offsetofend(struct cred, cap_permitted);
+ break;
+ case offsetof(struct cred, cap_effective):
+ end = offsetofend(struct cred, cap_effective);
+ break;
+ case offsetof(struct cred, cap_bset):
+ end = offsetofend(struct cred, cap_bset);
+ break;
+ case offsetof(struct cred, cap_ambient):
+ end = offsetofend(struct cred, cap_ambient);
+ break;
+ case offsetof(struct cred, cap_userns):
+ end = offsetofend(struct cred, cap_userns);
+ break;
+ default:
+ bpf_log(log, "no write support to cred at off %d\n", off);
+ return -EACCES;
+ }
+
+ if (off + size > end) {
+ bpf_log(log,
+ "write access at off %d with size %d beyond the member of cred ended at %zu\n",
+ off, size, end);
+ return -EACCES;
+ }
+
+ return 0;
+}
+
const struct bpf_prog_ops lsm_prog_ops = {
};

const struct bpf_verifier_ops lsm_verifier_ops = {
.get_func_proto = bpf_lsm_func_proto,
.is_valid_access = btf_ctx_access,
+ .btf_struct_access = bpf_lsm_btf_struct_access,
};
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index 6239777090c4..310c9fa3d4b4 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -1036,7 +1036,7 @@ static int apparmor_task_kill(struct task_struct *target, struct kernel_siginfo
return error;
}

-static int apparmor_userns_create(const struct cred *cred)
+static int apparmor_userns_create(struct cred *cred)
{
struct aa_label *label;
struct aa_profile *profile;
diff --git a/security/security.c b/security/security.c
index e5da848c50b9..83cf2025c58e 100644
--- a/security/security.c
+++ b/security/security.c
@@ -3558,14 +3558,14 @@ void security_task_to_inode(struct task_struct *p, struct inode *inode)
}

/**
- * security_create_user_ns() - Check if creating a new userns is allowed
+ * security_create_user_ns() - Review permissions prior to userns creation
* @cred: prepared creds
*
- * Check permission prior to creating a new user namespace.
+ * Check and/or modify permissions prior to creating a new user namespace.
*
* Return: Returns 0 if successful, otherwise < 0 error code.
*/
-int security_create_user_ns(const struct cred *cred)
+int security_create_user_ns(struct cred *cred)
{
return call_int_hook(userns_create, cred);
}
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 7eed331e90f0..28deb9510d8e 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4263,7 +4263,7 @@ static void selinux_task_to_inode(struct task_struct *p,
spin_unlock(&isec->lock);
}

-static int selinux_userns_create(const struct cred *cred)
+static int selinux_userns_create(struct cred *cred)
{
u32 sid = current_sid();

diff --git a/tools/testing/selftests/bpf/prog_tests/deny_namespace.c b/tools/testing/selftests/bpf/prog_tests/deny_namespace.c
index 1bc6241b755b..1500578f9a30 100644
--- a/tools/testing/selftests/bpf/prog_tests/deny_namespace.c
+++ b/tools/testing/selftests/bpf/prog_tests/deny_namespace.c
@@ -5,6 +5,8 @@
#include <sched.h>
#include "cap_helpers.h"
#include <stdio.h>
+#include <stdbool.h>
+#include <sys/prctl.h>

static int wait_for_pid(pid_t pid)
{
@@ -29,7 +31,7 @@ static int wait_for_pid(pid_t pid)
* positive return value -> userns creation failed
* 0 -> userns creation succeeded
*/
-static int create_user_ns(void)
+static int create_user_ns(bool bpf)
{
pid_t pid;

@@ -40,6 +42,8 @@ static int create_user_ns(void)
if (pid == 0) {
if (unshare(CLONE_NEWUSER))
_exit(EXIT_FAILURE);
+ if (bpf && prctl(PR_CAPBSET_READ, CAP_SYS_ADMIN))
+ _exit(EXIT_FAILURE);
_exit(EXIT_SUCCESS);
}

@@ -53,11 +57,11 @@ static void test_userns_create_bpf(void)

cap_enable_effective(cap_mask, &old_caps);

- ASSERT_OK(create_user_ns(), "priv new user ns");
+ ASSERT_OK(create_user_ns(true), "priv new user ns");

cap_disable_effective(cap_mask, &old_caps);

- ASSERT_EQ(create_user_ns(), EPERM, "unpriv new user ns");
+ ASSERT_EQ(create_user_ns(true), EPERM, "unpriv new user ns");

if (cap_mask & old_caps)
cap_enable_effective(cap_mask, NULL);
@@ -70,7 +74,7 @@ static void test_unpriv_userns_create_no_bpf(void)

cap_disable_effective(cap_mask, &old_caps);

- ASSERT_OK(create_user_ns(), "no-bpf unpriv new user ns");
+ ASSERT_OK(create_user_ns(false), "no-bpf unpriv new user ns");

if (cap_mask & old_caps)
cap_enable_effective(cap_mask, NULL);
diff --git a/tools/testing/selftests/bpf/progs/test_deny_namespace.c b/tools/testing/selftests/bpf/progs/test_deny_namespace.c
index e96b901a733c..051906f80f4c 100644
--- a/tools/testing/selftests/bpf/progs/test_deny_namespace.c
+++ b/tools/testing/selftests/bpf/progs/test_deny_namespace.c
@@ -9,12 +9,13 @@ typedef struct { unsigned long long val; } kernel_cap_t;

struct cred {
kernel_cap_t cap_effective;
+ kernel_cap_t cap_userns;
} __attribute__((preserve_access_index));

char _license[] SEC("license") = "GPL";

SEC("lsm.s/userns_create")
-int BPF_PROG(test_userns_create, const struct cred *cred, int ret)
+int BPF_PROG(test_userns_create, struct cred *cred, int ret)
{
kernel_cap_t caps = cred->cap_effective;
__u64 cap_mask = 1ULL << CAP_SYS_ADMIN;
@@ -23,8 +24,10 @@ int BPF_PROG(test_userns_create, const struct cred *cred, int ret)
return 0;

ret = -EPERM;
- if (caps.val & cap_mask)
+ if (caps.val & cap_mask) {
+ cred->cap_userns.val &= ~cap_mask;
return 0;
+ }

return -EPERM;
}
--
2.45.2


2024-06-10 00:19:21

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] bpf,lsm: Allow editing capabilities in BPF-LSM hooks

On Sun, Jun 9, 2024 at 6:40 AM Jonathan Calmels <[email protected]> wrote:
>
> This patch allows modifying the various capabilities of the struct cred
> in BPF-LSM hooks. More specifically, the userns_create hook called
> prior to creating a new user namespace.
>
> With the introduction of userns capabilities, this effectively provides
> a simple way for LSMs to control the capabilities granted to a user
> namespace and all its descendants.
>
> Update the selftests accordingly by dropping CAP_SYS_ADMIN in
> namespaces and checking the resulting task's bounding set.
>
> Signed-off-by: Jonathan Calmels <[email protected]>
> ---
> include/linux/lsm_hook_defs.h | 2 +-
> include/linux/security.h | 4 +-
> kernel/bpf/bpf_lsm.c | 55 +++++++++++++++++++
> security/apparmor/lsm.c | 2 +-
> security/security.c | 6 +-
> security/selinux/hooks.c | 2 +-
> .../selftests/bpf/prog_tests/deny_namespace.c | 12 ++--
> .../selftests/bpf/progs/test_deny_namespace.c | 7 ++-
> 8 files changed, 76 insertions(+), 14 deletions(-)

I'm not sure we want to go down the path of a LSM modifying the POSIX
capabilities of a task, other than the capabilities/commoncap LSM. It
sets a bad precedent and could further complicate issues around LSM
ordering.

--
paul-moore.com

2024-06-10 01:57:01

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] capabilities: Add user namespace capabilities

On Sun, Jun 09, 2024 at 03:43:34AM -0700, Jonathan Calmels wrote:

(Adding amorgan as he doesn't seem to be on cc list)

> Attackers often rely on user namespaces to get elevated (yet confined)
> privileges in order to target specific subsystems (e.g. [1]). Distributions

I'd modify this to say "in order to target *bugs* in specific subsystems" :)

> have been pretty adamant that they need a way to configure these, most of
> them carry out-of-tree patches to do so, or plainly refuse to enable them.
> As a result, there have been multiple efforts over the years to introduce
> various knobs to control and/or disable user namespaces (e.g. [2][3][4]).
>
> While we acknowledge that there are already ways to control the creation of
> such namespaces (the most recent being a LSM hook), there are inherent
> issues with these approaches. Preventing the user namespace creation is not
> fine-grained enough, and in some cases, incompatible with various userspace
> expectations (e.g. container runtimes, browser sandboxing, service
> isolation)
>
> This patch addresses these limitations by introducing an additional
> capability set used to restrict the permissions granted when creating user
> namespaces. This way, processes can apply the principle of least privilege
> by configuring only the capabilities they need for their namespaces.

I think this is precisely the right thing to do. In each of these cases,
there is a kernel bug in code which is only reachable with some
CAP_X, regardless of which user namespace the CAP_X is targeted towards.
So there's really no better way to resist this (apart from not having the
bugs in the first place) than to allow CAP_X to be denied in unprivileged
user namespaces.

> For compatibility reasons, processes always start with a full userns
> capability set.
>
> On namespace creation, the userns capability set (pU) is assigned to the
> new effective (pE), permitted (pP) and bounding set (X) of the task:
>
> pU = pE = pP = X
>
> The userns capability set obeys the invariant that no bit can ever be set
> if it is not already part of the task’s bounding set. This ensures that
> no namespace can ever gain more privileges than its predecessors.
> Additionally, if a task is not privileged over CAP_SETPCAP, setting any bit
> in the userns set requires its corresponding bit to be set in the permitted
> set. This effectively mimics the inheritable set rules and means that, by
> default, only root in the user namespace can regain userns capabilities
> previously dropped:

Something about this last sentence feels wrong, but I'm not sure what
the best alternative would be. As is, though, it makes it sound as though
root in the userns can always regain previously dropped capabilities, but
that's not true if dropped in ancestor ns, or if root also dropped the
bits from its bounding set (right?).

> p’U = (pE & CAP_SETPCAP) ? X : (X & pP)
>
> Note that since userns capabilities are strictly hierarchical, policies can
> be enforced at various levels (e.g. init, pam_cap) and inherited by every
> child namespace.
>
> Here is a sample program that can be used to verify the functionality:
>
> /*
> * Test program that drops CAP_SYS_RAWIO from subsequent user namespaces.
> *
> * ./cap_userns_test unshare -r grep Cap /proc/self/status
> * CapInh: 0000000000000000
> * CapPrm: 000001fffffdffff
> * CapEff: 000001fffffdffff
> * CapBnd: 000001fffffdffff
> * CapAmb: 0000000000000000
> * CapUNs: 000001fffffdffff
> */
>
> int main(int argc, char *argv[])
> {
> if (prctl(PR_CAP_USERNS, PR_CAP_USERNS_LOWER, CAP_SYS_RAWIO, 0, 0) < 0)
> err(1, "cannot drop userns cap");
>
> execvp(argv[1], argv + 1);
> err(1, "cannot exec");
> }
>
> [1] https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html
> [2] https://lore.kernel.org/lkml/[email protected]
> [3] https://lore.kernel.org/lkml/[email protected]
> [4] https://lore.kernel.org/containers/[email protected]
>
> Signed-off-by: Jonathan Calmels <[email protected]>

Thanks.

Reviewed-by: Serge Hallyn <[email protected]>

> ---
> Documentation/filesystems/proc.rst | 1 +
> Documentation/security/credentials.rst | 6 +++
> fs/proc/array.c | 9 ++++
> include/linux/cred.h | 3 ++
> include/uapi/linux/prctl.h | 7 +++
> kernel/cred.c | 3 ++
> kernel/umh.c | 15 +++++++
> kernel/user_namespace.c | 12 +++--
> security/commoncap.c | 62 ++++++++++++++++++++++++--
> security/keys/process_keys.c | 3 ++
> 10 files changed, 111 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 7c3a565ffbef..b5de4eaf1b7b 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -294,6 +294,7 @@ It's slow but very precise.
> CapEff bitmap of effective capabilities
> CapBnd bitmap of capabilities bounding set
> CapAmb bitmap of ambient capabilities
> + CapUns bitmap of user namespace capabilities
> NoNewPrivs no_new_privs, like prctl(PR_GET_NO_NEW_PRIV, ...)
> Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...)
> Speculation_Store_Bypass speculative store bypass mitigation status
> diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
> index 357328d566c8..7ee904237023 100644
> --- a/Documentation/security/credentials.rst
> +++ b/Documentation/security/credentials.rst
> @@ -148,6 +148,7 @@ The Linux kernel supports the following types of credentials:
> - Set of permitted capabilities
> - Set of inheritable capabilities
> - Set of effective capabilities
> + - Set of user namespace capabilities
> - Capability bounding set
>
> These are only carried by tasks. They indicate superior capabilities
> @@ -170,6 +171,11 @@ The Linux kernel supports the following types of credentials:
> ``execve()``, especially when a binary is executed that will execute as
> UID 0.
>
> + The user namespace set limits the capabilities granted to user namespaces.
> + It defines what capabilities will be available in the other sets after
> + creating a new user namespace, such as when calling ``clone()`` or
> + ``unshare()`` with ``CLONE_NEWUSER``.
> +
> 3. Secure management flags (securebits).
>
> These are only carried by tasks. These govern the way the above
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 34a47fb0c57f..364e8bb19f9d 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -313,6 +313,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p)
> const struct cred *cred;
> kernel_cap_t cap_inheritable, cap_permitted, cap_effective,
> cap_bset, cap_ambient;
> +#ifdef CONFIG_USER_NS
> + kernel_cap_t cap_userns;
> +#endif
>
> rcu_read_lock();
> cred = __task_cred(p);
> @@ -321,6 +324,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p)
> cap_effective = cred->cap_effective;
> cap_bset = cred->cap_bset;
> cap_ambient = cred->cap_ambient;
> +#ifdef CONFIG_USER_NS
> + cap_userns = cred->cap_userns;
> +#endif
> rcu_read_unlock();
>
> render_cap_t(m, "CapInh:\t", &cap_inheritable);
> @@ -328,6 +334,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p)
> render_cap_t(m, "CapEff:\t", &cap_effective);
> render_cap_t(m, "CapBnd:\t", &cap_bset);
> render_cap_t(m, "CapAmb:\t", &cap_ambient);
> +#ifdef CONFIG_USER_NS
> + render_cap_t(m, "CapUNs:\t", &cap_userns);
> +#endif
> }
>
> static inline void task_seccomp(struct seq_file *m, struct task_struct *p)
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index 2976f534a7a3..adab0031443e 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -124,6 +124,9 @@ struct cred {
> kernel_cap_t cap_effective; /* caps we can actually use */
> kernel_cap_t cap_bset; /* capability bounding set */
> kernel_cap_t cap_ambient; /* Ambient capability set */
> +#ifdef CONFIG_USER_NS
> + kernel_cap_t cap_userns; /* User namespace capability set */
> +#endif
> #ifdef CONFIG_KEYS
> unsigned char jit_keyring; /* default keyring to attach requested
> * keys to */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 35791791a879..b58325ebdc9e 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -198,6 +198,13 @@ struct prctl_mm_map {
> # define PR_CAP_AMBIENT_LOWER 3
> # define PR_CAP_AMBIENT_CLEAR_ALL 4
>
> +/* Control the userns capability set */
> +#define PR_CAP_USERNS 48
> +# define PR_CAP_USERNS_IS_SET 1
> +# define PR_CAP_USERNS_RAISE 2
> +# define PR_CAP_USERNS_LOWER 3
> +# define PR_CAP_USERNS_CLEAR_ALL 4
> +
> /* arm64 Scalable Vector Extension controls */
> /* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */
> #define PR_SVE_SET_VL 50 /* set task vector length */
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 075cfa7c896f..9912c6f3bc6b 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -56,6 +56,9 @@ struct cred init_cred = {
> .cap_permitted = CAP_FULL_SET,
> .cap_effective = CAP_FULL_SET,
> .cap_bset = CAP_FULL_SET,
> +#ifdef CONFIG_USER_NS
> + .cap_userns = CAP_FULL_SET,
> +#endif
> .user = INIT_USER,
> .user_ns = &init_user_ns,
> .group_info = &init_groups,
> diff --git a/kernel/umh.c b/kernel/umh.c
> index 598b3ffe1522..0a5a9cf10d83 100644
> --- a/kernel/umh.c
> +++ b/kernel/umh.c
> @@ -32,6 +32,9 @@
>
> #include <trace/events/module.h>
>
> +#ifdef CONFIG_USER_NS
> +static kernel_cap_t usermodehelper_userns = CAP_FULL_SET;
> +#endif
> static kernel_cap_t usermodehelper_bset = CAP_FULL_SET;
> static kernel_cap_t usermodehelper_inheritable = CAP_FULL_SET;
> static DEFINE_SPINLOCK(umh_sysctl_lock);
> @@ -94,6 +97,9 @@ static int call_usermodehelper_exec_async(void *data)
> new->cap_bset = cap_intersect(usermodehelper_bset, new->cap_bset);
> new->cap_inheritable = cap_intersect(usermodehelper_inheritable,
> new->cap_inheritable);
> +#ifdef CONFIG_USER_NS
> + new->cap_userns = cap_intersect(usermodehelper_userns, new->cap_userns);
> +#endif
> spin_unlock(&umh_sysctl_lock);
>
> if (sub_info->init) {
> @@ -560,6 +566,15 @@ static struct ctl_table usermodehelper_table[] = {
> .mode = 0600,
> .proc_handler = proc_cap_handler,
> },
> +#ifdef CONFIG_USER_NS
> + {
> + .procname = "userns",
> + .data = &usermodehelper_userns,
> + .maxlen = 2 * sizeof(unsigned long),
> + .mode = 0600,
> + .proc_handler = proc_cap_handler,
> + },
> +#endif
> };
>
> static int __init init_umh_sysctls(void)
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 0b0b95418b16..7e624607330b 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -42,15 +42,13 @@ static void dec_user_namespaces(struct ucounts *ucounts)
>
> static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
> {
> - /* Start with the same capabilities as init but useless for doing
> - * anything as the capabilities are bound to the new user namespace.
> - */
> - cred->securebits = SECUREBITS_DEFAULT;
> + /* Start with the capabilities defined in the userns set. */
> + cred->cap_bset = cred->cap_userns;
> + cred->cap_permitted = cred->cap_userns;
> + cred->cap_effective = cred->cap_userns;
> cred->cap_inheritable = CAP_EMPTY_SET;
> - cred->cap_permitted = CAP_FULL_SET;
> - cred->cap_effective = CAP_FULL_SET;
> cred->cap_ambient = CAP_EMPTY_SET;
> - cred->cap_bset = CAP_FULL_SET;
> + cred->securebits = SECUREBITS_DEFAULT;
> #ifdef CONFIG_KEYS
> key_put(cred->request_key_auth);
> cred->request_key_auth = NULL;
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 162d96b3a676..59fafbfcfc5e 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -214,10 +214,10 @@ int cap_capget(const struct task_struct *target, kernel_cap_t *effective,
> }
>
> /*
> - * Determine whether the inheritable capabilities are limited to the old
> + * Determine whether the capabilities are limited to the old
> * permitted set. Returns 1 if they are limited, 0 if they are not.
> */
> -static inline int cap_inh_is_capped(void)
> +static inline int cap_is_capped(void)
> {
> /* they are so limited unless the current task has the CAP_SETPCAP
> * capability
> @@ -228,6 +228,29 @@ static inline int cap_inh_is_capped(void)
> return 1;
> }
>
> +/*
> + * Determine whether a userns capability can be raised.
> + * Returns 1 if it can, 0 otherwise.
> + */
> +#ifdef CONFIG_USER_NS
> +static inline int cap_uns_is_raiseable(unsigned long cap)
> +{
> + if (!!cap_raised(current_cred()->cap_userns, cap))
> + return 1;
> +
> + /*
> + * A capability cannot be raised unless the current task has it in
> + * its bounding set and, without CAP_SETPCAP, its permitted set.
> + */
> + if (!cap_raised(current_cred()->cap_bset, cap))
> + return 0;
> + if (cap_is_capped() && !cap_raised(current_cred()->cap_permitted, cap))
> + return 0;
> +
> + return 1;
> +}
> +#endif
> +
> /**
> * cap_capset - Validate and apply proposed changes to current's capabilities
> * @new: The proposed new credentials; alterations should be made here
> @@ -246,7 +269,7 @@ int cap_capset(struct cred *new,
> const kernel_cap_t *inheritable,
> const kernel_cap_t *permitted)
> {
> - if (cap_inh_is_capped() &&
> + if (cap_is_capped() &&
> !cap_issubset(*inheritable,
> cap_combine(old->cap_inheritable,
> old->cap_permitted)))
> @@ -1382,6 +1405,39 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
> return commit_creds(new);
> }
>
> +#ifdef CONFIG_USER_NS
> + case PR_CAP_USERNS:
> + if (arg2 == PR_CAP_USERNS_CLEAR_ALL) {
> + if (arg3 | arg4 | arg5)
> + return -EINVAL;
> +
> + new = prepare_creds();
> + if (!new)
> + return -ENOMEM;
> + cap_clear(new->cap_userns);
> + return commit_creds(new);
> + }
> +
> + if (((!cap_valid(arg3)) | arg4 | arg5))
> + return -EINVAL;
> +
> + if (arg2 == PR_CAP_USERNS_IS_SET)
> + return !!cap_raised(current_cred()->cap_userns, arg3);
> + if (arg2 != PR_CAP_USERNS_RAISE && arg2 != PR_CAP_USERNS_LOWER)
> + return -EINVAL;
> + if (arg2 == PR_CAP_USERNS_RAISE && !cap_uns_is_raiseable(arg3))
> + return -EPERM;
> +
> + new = prepare_creds();
> + if (!new)
> + return -ENOMEM;
> + if (arg2 == PR_CAP_USERNS_RAISE)
> + cap_raise(new->cap_userns, arg3);
> + else
> + cap_lower(new->cap_userns, arg3);
> + return commit_creds(new);
> +#endif
> +
> default:
> /* No functionality available - continue with default */
> return -ENOSYS;
> diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c
> index b5d5333ab330..e3670d815435 100644
> --- a/security/keys/process_keys.c
> +++ b/security/keys/process_keys.c
> @@ -944,6 +944,9 @@ void key_change_session_keyring(struct callback_head *twork)
> new->cap_effective = old->cap_effective;
> new->cap_ambient = old->cap_ambient;
> new->cap_bset = old->cap_bset;
> +#ifdef CONFIG_USER_NS
> + new->cap_userns = old->cap_userns;
> +#endif
>
> new->jit_keyring = old->jit_keyring;
> new->thread_keyring = key_get(old->thread_keyring);
> --
> 2.45.2

2024-06-10 02:33:25

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] capabilities: Add securebit to restrict userns caps

On Sun, Jun 09, 2024 at 03:43:35AM -0700, Jonathan Calmels wrote:
> This patch adds a new capability security bit designed to constrain a
> task’s userns capability set to its bounding set. The reason for this is
> twofold:
>
> - This serves as a quick and easy way to lock down a set of capabilities
> for a task, thus ensuring that any namespace it creates will never be
> more privileged than itself is.
> - This helps userspace transition to more secure defaults by not requiring
> specific logic for the userns capability set, or libcap support.
>
> Example:
>
> # capsh --secbits=$((1 << 8)) --drop=cap_sys_rawio -- \
> -c 'unshare -r grep Cap /proc/self/status'
> CapInh: 0000000000000000
> CapPrm: 000001fffffdffff
> CapEff: 000001fffffdffff
> CapBnd: 000001fffffdffff
> CapAmb: 0000000000000000
> CapUNs: 000001fffffdffff

But you are not (that I can see, in this or the previous patch)
keeping SECURE_USERNS_STRICT_CAPS in securebits on the next
level unshare. Though I think it's ok, because by then both
cap_userns and cap_bset are reduced and cap_userns can't be
expanded. (Sorry, just thinking aloud here)

> Signed-off-by: Jonathan Calmels <[email protected]>
> ---
> include/linux/securebits.h | 1 +
> include/uapi/linux/securebits.h | 11 ++++++++++-
> kernel/user_namespace.c | 5 +++++
> 3 files changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/securebits.h b/include/linux/securebits.h
> index 656528673983..5f9d85cd69c3 100644
> --- a/include/linux/securebits.h
> +++ b/include/linux/securebits.h
> @@ -5,4 +5,5 @@
> #include <uapi/linux/securebits.h>
>
> #define issecure(X) (issecure_mask(X) & current_cred_xxx(securebits))
> +#define iscredsecure(cred, X) (issecure_mask(X) & cred->securebits)
> #endif /* !_LINUX_SECUREBITS_H */
> diff --git a/include/uapi/linux/securebits.h b/include/uapi/linux/securebits.h
> index d6d98877ff1a..2da3f4be4531 100644
> --- a/include/uapi/linux/securebits.h
> +++ b/include/uapi/linux/securebits.h
> @@ -52,10 +52,19 @@
> #define SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED \
> (issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE_LOCKED))
>
> +/* When set, user namespace capabilities are restricted to their parent's bounding set. */
> +#define SECURE_USERNS_STRICT_CAPS 8
> +#define SECURE_USERNS_STRICT_CAPS_LOCKED 9 /* make bit-8 immutable */
> +
> +#define SECBIT_USERNS_STRICT_CAPS (issecure_mask(SECURE_USERNS_STRICT_CAPS))
> +#define SECBIT_USERNS_STRICT_CAPS_LOCKED \
> + (issecure_mask(SECURE_USERNS_STRICT_CAPS_LOCKED))
> +
> #define SECURE_ALL_BITS (issecure_mask(SECURE_NOROOT) | \
> issecure_mask(SECURE_NO_SETUID_FIXUP) | \
> issecure_mask(SECURE_KEEP_CAPS) | \
> - issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE))
> + issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE) | \
> + issecure_mask(SECURE_USERNS_STRICT_CAPS))
> #define SECURE_ALL_LOCKS (SECURE_ALL_BITS << 1)
>
> #endif /* _UAPI_LINUX_SECUREBITS_H */
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 7e624607330b..53848e2b68cd 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -10,6 +10,7 @@
> #include <linux/cred.h>
> #include <linux/securebits.h>
> #include <linux/security.h>
> +#include <linux/capability.h>
> #include <linux/keyctl.h>
> #include <linux/key-type.h>
> #include <keys/user-type.h>
> @@ -42,6 +43,10 @@ static void dec_user_namespaces(struct ucounts *ucounts)
>
> static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
> {
> + /* Limit userns capabilities to our parent's bounding set. */

In the case of userns_install(), it will be the target user namespace
creator's bounding set, right? Not "our parent's"?

> + if (iscredsecure(cred, SECURE_USERNS_STRICT_CAPS))
> + cred->cap_userns = cap_intersect(cred->cap_userns, cred->cap_bset);
> +
> /* Start with the capabilities defined in the userns set. */
> cred->cap_bset = cred->cap_userns;
> cred->cap_permitted = cred->cap_userns;
> --
> 2.45.2

2024-06-10 08:43:11

by Jonathan Calmels

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] capabilities: Add user namespace capabilities

On Sun, Jun 09, 2024 at 08:50:24PM GMT, Serge E. Hallyn wrote:
> On Sun, Jun 09, 2024 at 03:43:34AM -0700, Jonathan Calmels wrote:
> > Attackers often rely on user namespaces to get elevated (yet confined)
> > privileges in order to target specific subsystems (e.g. [1]). Distributions
>
> I'd modify this to say "in order to target *bugs* in specific subsystems" :)

Ack

> > This effectively mimics the inheritable set rules and means that, by
> > default, only root in the user namespace can regain userns capabilities
> > previously dropped:
>
> Something about this last sentence feels wrong, but I'm not sure what
> the best alternative would be. As is, though, it makes it sound as though
> root in the userns can always regain previously dropped capabilities, but
> that's not true if dropped in ancestor ns, or if root also dropped the
> bits from its bounding set (right?).

Right, the wording is a little bit confusing here I admit.
What I meant to say is that if a cap is dropped in a *given* namespace,
then it can only be regained by root there. But yes, caps can never be
regained from ancestors ns. I'll try to rephrase it.

BTW, this is rather strict, but I think that's what we want right,
something simple? Alternative would be to have a new cap masked off by
default, but if granted to a userns, allows you to regain ancestors
caps.

2024-06-10 10:11:24

by Jonathan Calmels

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] capabilities: Add securebit to restrict userns caps

On Sun, Jun 09, 2024 at 09:33:01PM GMT, Serge E. Hallyn wrote:
> On Sun, Jun 09, 2024 at 03:43:35AM -0700, Jonathan Calmels wrote:
> > This patch adds a new capability security bit designed to constrain a
> > task’s userns capability set to its bounding set. The reason for this is
> > twofold:
> >
> > - This serves as a quick and easy way to lock down a set of capabilities
> > for a task, thus ensuring that any namespace it creates will never be
> > more privileged than itself is.
> > - This helps userspace transition to more secure defaults by not requiring
> > specific logic for the userns capability set, or libcap support.
> >
> > Example:
> >
> > # capsh --secbits=$((1 << 8)) --drop=cap_sys_rawio -- \
> > -c 'unshare -r grep Cap /proc/self/status'
> > CapInh: 0000000000000000
> > CapPrm: 000001fffffdffff
> > CapEff: 000001fffffdffff
> > CapBnd: 000001fffffdffff
> > CapAmb: 0000000000000000
> > CapUNs: 000001fffffdffff
>
> But you are not (that I can see, in this or the previous patch)
> keeping SECURE_USERNS_STRICT_CAPS in securebits on the next
> level unshare. Though I think it's ok, because by then both
> cap_userns and cap_bset are reduced and cap_userns can't be
> expanded. (Sorry, just thinking aloud here)

Right this is safe to reset, but maybe we do keep it if the secbit is
locked? This is kind of a special case compared to the other bits.

> > + /* Limit userns capabilities to our parent's bounding set. */
>
> In the case of userns_install(), it will be the target user namespace
> creator's bounding set, right? Not "our parent's"?

Good point, I should reword this comment.

2024-06-10 12:49:29

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] capabilities: Add user namespace capabilities

On Mon, Jun 10, 2024 at 01:47:13AM -0700, Jonathan Calmels wrote:
> On Sun, Jun 09, 2024 at 08:50:24PM GMT, Serge E. Hallyn wrote:
> > On Sun, Jun 09, 2024 at 03:43:34AM -0700, Jonathan Calmels wrote:
> > > Attackers often rely on user namespaces to get elevated (yet confined)
> > > privileges in order to target specific subsystems (e.g. [1]). Distributions
> >
> > I'd modify this to say "in order to target *bugs* in specific subsystems" :)
>
> Ack
>
> > > This effectively mimics the inheritable set rules and means that, by
> > > default, only root in the user namespace can regain userns capabilities
> > > previously dropped:
> >
> > Something about this last sentence feels wrong, but I'm not sure what
> > the best alternative would be. As is, though, it makes it sound as though
> > root in the userns can always regain previously dropped capabilities, but
> > that's not true if dropped in ancestor ns, or if root also dropped the
> > bits from its bounding set (right?).
>
> Right, the wording is a little bit confusing here I admit.
> What I meant to say is that if a cap is dropped in a *given* namespace,
> then it can only be regained by root there. But yes, caps can never be
> regained from ancestors ns. I'll try to rephrase it.
>
> BTW, this is rather strict, but I think that's what we want right,

Yes,

> something simple? Alternative would be to have a new cap masked off by
> default, but if granted to a userns, allows you to regain ancestors
> caps.

we absolutely do not want to allow regaining caps dropped in an
ancestor namespace.

thanks,
-serge

2024-06-10 13:01:36

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] capabilities: Add user namespace capabilities

On Sun, Jun 09, 2024 at 03:43:34AM -0700, Jonathan Calmels wrote:
> Attackers often rely on user namespaces to get elevated (yet confined)
> privileges in order to target specific subsystems (e.g. [1]). Distributions
> have been pretty adamant that they need a way to configure these, most of
> them carry out-of-tree patches to do so, or plainly refuse to enable them.
> As a result, there have been multiple efforts over the years to introduce
> various knobs to control and/or disable user namespaces (e.g. [2][3][4]).
>
> While we acknowledge that there are already ways to control the creation of
> such namespaces (the most recent being a LSM hook), there are inherent
> issues with these approaches. Preventing the user namespace creation is not
> fine-grained enough, and in some cases, incompatible with various userspace
> expectations (e.g. container runtimes, browser sandboxing, service
> isolation)
>
> This patch addresses these limitations by introducing an additional
> capability set used to restrict the permissions granted when creating user
> namespaces. This way, processes can apply the principle of least privilege
> by configuring only the capabilities they need for their namespaces.
>
> For compatibility reasons, processes always start with a full userns
> capability set.
>
> On namespace creation, the userns capability set (pU) is assigned to the
> new effective (pE), permitted (pP) and bounding set (X) of the task:
>
> pU = pE = pP = X
>
> The userns capability set obeys the invariant that no bit can ever be set
> if it is not already part of the task’s bounding set. This ensures that
> no namespace can ever gain more privileges than its predecessors.
> Additionally, if a task is not privileged over CAP_SETPCAP, setting any bit
> in the userns set requires its corresponding bit to be set in the permitted
> set. This effectively mimics the inheritable set rules and means that, by
> default, only root in the user namespace can regain userns capabilities
> previously dropped:
>
> p’U = (pE & CAP_SETPCAP) ? X : (X & pP)
>
> Note that since userns capabilities are strictly hierarchical, policies can
> be enforced at various levels (e.g. init, pam_cap) and inherited by every
> child namespace.
>
> Here is a sample program that can be used to verify the functionality:
>
> /*
> * Test program that drops CAP_SYS_RAWIO from subsequent user namespaces.
> *
> * ./cap_userns_test unshare -r grep Cap /proc/self/status
> * CapInh: 0000000000000000
> * CapPrm: 000001fffffdffff
> * CapEff: 000001fffffdffff
> * CapBnd: 000001fffffdffff
> * CapAmb: 0000000000000000
> * CapUNs: 000001fffffdffff
> */

...

> +#ifdef CONFIG_USER_NS
> + case PR_CAP_USERNS:
> + if (arg2 == PR_CAP_USERNS_CLEAR_ALL) {
> + if (arg3 | arg4 | arg5)
> + return -EINVAL;
> +
> + new = prepare_creds();
> + if (!new)
> + return -ENOMEM;
> + cap_clear(new->cap_userns);
> + return commit_creds(new);
> + }
> +
> + if (((!cap_valid(arg3)) | arg4 | arg5))
> + return -EINVAL;
> +
> + if (arg2 == PR_CAP_USERNS_IS_SET)
> + return !!cap_raised(current_cred()->cap_userns, arg3);
> + if (arg2 != PR_CAP_USERNS_RAISE && arg2 != PR_CAP_USERNS_LOWER)
> + return -EINVAL;
> + if (arg2 == PR_CAP_USERNS_RAISE && !cap_uns_is_raiseable(arg3))
> + return -EPERM;
> +
> + new = prepare_creds();
> + if (!new)
> + return -ENOMEM;
> + if (arg2 == PR_CAP_USERNS_RAISE)
> + cap_raise(new->cap_userns, arg3);
> + else
> + cap_lower(new->cap_userns, arg3);

Now, one thing that does occur to me here is that there is a
very mild form of sendmail-capabilities vulnerability that
could happen here. Unpriv user joe can drop CAP_SYS_ADMIN
from cap_userns, then run a setuid-root program which starts
a container which expects CAP_SYS_ADMIN. This could be a
shared container, and so joe could be breaking expected
behavior there.

I *think* we want to say we don't care about this case, but
if we did, I suppose we could say that the normal cap raise
rules on setuid should apply to cap_userns?


2024-06-10 13:25:32

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] capabilities: Add securebit to restrict userns caps

On Mon, Jun 10, 2024 at 02:46:06AM -0700, Jonathan Calmels wrote:
> On Sun, Jun 09, 2024 at 09:33:01PM GMT, Serge E. Hallyn wrote:
> > On Sun, Jun 09, 2024 at 03:43:35AM -0700, Jonathan Calmels wrote:
> > > This patch adds a new capability security bit designed to constrain a
> > > task’s userns capability set to its bounding set. The reason for this is
> > > twofold:
> > >
> > > - This serves as a quick and easy way to lock down a set of capabilities
> > > for a task, thus ensuring that any namespace it creates will never be
> > > more privileged than itself is.
> > > - This helps userspace transition to more secure defaults by not requiring
> > > specific logic for the userns capability set, or libcap support.
> > >
> > > Example:
> > >
> > > # capsh --secbits=$((1 << 8)) --drop=cap_sys_rawio -- \
> > > -c 'unshare -r grep Cap /proc/self/status'
> > > CapInh: 0000000000000000
> > > CapPrm: 000001fffffdffff
> > > CapEff: 000001fffffdffff
> > > CapBnd: 000001fffffdffff
> > > CapAmb: 0000000000000000
> > > CapUNs: 000001fffffdffff
> >
> > But you are not (that I can see, in this or the previous patch)
> > keeping SECURE_USERNS_STRICT_CAPS in securebits on the next
> > level unshare. Though I think it's ok, because by then both
> > cap_userns and cap_bset are reduced and cap_userns can't be
> > expanded. (Sorry, just thinking aloud here)
>
> Right this is safe to reset, but maybe we do keep it if the secbit is
> locked? This is kind of a special case compared to the other bits.

I don't think it would be worth the extra complication in the
secbits code, and it's semantically very different from the
cap_userns.

> > > + /* Limit userns capabilities to our parent's bounding set. */
> >
> > In the case of userns_install(), it will be the target user namespace
> > creator's bounding set, right? Not "our parent's"?
>
> Good point, I should reword this comment.

2024-06-11 00:17:09

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Introduce user namespace capabilities

On Sun, Jun 09, 2024 at 03:43:33AM -0700, Jonathan Calmels wrote:
> This patch series introduces a new user namespace capability set, as
> well as some plumbing around it (i.e. sysctl, secbit, lsm support).
>
> First patch goes over the motivations for this as well as prior art.
>
> In summary, while user namespaces are a great success today in that they
> avoid running a lot of code as root, they also expand the attack surface
> of the kernel substantially which is often abused by attackers.
> Methods exist to limit the creation of such namespaces [1], however,
> application developers often need to assume that user namespaces are
> available for various tasks such as sandboxing. Thus, instead of
> restricting the creation of user namespaces, we offer ways for userspace
> to limit the capabilities granted to them.
>
> Why a new capability set and not something specific to the userns (e.g.
> ioctl_ns)?
>
> 1. We can't really expect userspace to patch every single callsite
> and opt-in this new security mechanism.
>
> 2. We don't necessarily want policies enforced at said callsites.
> For example a service like systemd-machined or a PAM session need to
> be able to place restrictions on any namespace spawned under it.
>
> 3. We would need to come up with inheritance rules, querying
> capabilities, etc. At this point we're just reinventing capability
> sets.
>
> 4. We can easily define interactions between capability sets, thus
> helping with adoption (patch 2 is an example of this)
>
> Some examples of how this could be leveraged in userspace:
>
> - Prevent user from getting CAP_NET_ADMIN in user namespaces under SSH:
> echo "auth optional pam_cap.so" >> /etc/pam.d/sshd
> echo "!cap_net_admin $USER" >> /etc/security/capability.conf
> capsh --secbits=$((1 << 8)) -- -c /usr/sbin/sshd
>
> - Prevent containers from ever getting CAP_DAC_OVERRIDE:
> systemd-run -p CapabilityBoundingSet=~CAP_DAC_OVERRIDE \
> -p SecureBits=userns-strict-caps \
> /usr/bin/dockerd
> systemd-run -p UserNSCapabilities=~CAP_DAC_OVERRIDE \
> /usr/bin/incusd
>
> - Kernel could be vulnerable to CAP_SYS_RAWIO exploits, prevent it:
> sysctl -w cap_bound_userns_mask=0x1fffffdffff
>
> - Drop CAP_SYS_ADMIN for this shell and all the user namespaces below it:
> bwrap --unshare-user --cap-drop CAP_SYS_ADMIN /bin/sh
>

Where are the tests for this patchset? I see you updated the bpf tests for the
bpf lsm bits, but there's nothing to validate this new behavior or exercise the
new ioctl you've added. Thanks,

Josef

2024-06-11 08:04:50

by Jonathan Calmels

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] bpf,lsm: Allow editing capabilities in BPF-LSM hooks

On Sun, Jun 09, 2024 at 08:18:48PM GMT, Paul Moore wrote:
> On Sun, Jun 9, 2024 at 6:40 AM Jonathan Calmels <[email protected]> wrote:
> >
> > This patch allows modifying the various capabilities of the struct cred
> > in BPF-LSM hooks. More specifically, the userns_create hook called
> > prior to creating a new user namespace.
> >
> > With the introduction of userns capabilities, this effectively provides
> > a simple way for LSMs to control the capabilities granted to a user
> > namespace and all its descendants.
> >
> > Update the selftests accordingly by dropping CAP_SYS_ADMIN in
> > namespaces and checking the resulting task's bounding set.
> >
> > Signed-off-by: Jonathan Calmels <[email protected]>
> > ---
> > include/linux/lsm_hook_defs.h | 2 +-
> > include/linux/security.h | 4 +-
> > kernel/bpf/bpf_lsm.c | 55 +++++++++++++++++++
> > security/apparmor/lsm.c | 2 +-
> > security/security.c | 6 +-
> > security/selinux/hooks.c | 2 +-
> > .../selftests/bpf/prog_tests/deny_namespace.c | 12 ++--
> > .../selftests/bpf/progs/test_deny_namespace.c | 7 ++-
> > 8 files changed, 76 insertions(+), 14 deletions(-)
>
> I'm not sure we want to go down the path of a LSM modifying the POSIX
> capabilities of a task, other than the capabilities/commoncap LSM. It
> sets a bad precedent and could further complicate issues around LSM
> ordering.

Well unless I'm misunderstanding, this does allow modifying the
capabilities/commoncap LSM through BTF. The reason for allowing
`userns_create` to be modified is that it is functionally very similar
to `cred_prepare` in that it operates with new creds (but specific to
user namespaces because of reasons detailed in [1]).

There were some concerns in previous threads that the userns caps by
themselves wouldn't be granular enough, hence the LSM integration.
Ubuntu for example, currently has to resort to a hardcoded profile
transition to achieve this [2].

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7cd4c5c2101cb092db00f61f69d24380cf7a0ee8
[2] https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/noble/commit/?id=43a6c29532f517179fea8c94949d657d71f4fc13

2024-06-11 08:16:37

by Jonathan Calmels

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] capabilities: Add user namespace capabilities

On Mon, Jun 10, 2024 at 08:00:57AM GMT, Serge E. Hallyn wrote:
>
> Now, one thing that does occur to me here is that there is a
> very mild form of sendmail-capabilities vulnerability that
> could happen here. Unpriv user joe can drop CAP_SYS_ADMIN
> from cap_userns, then run a setuid-root program which starts
> a container which expects CAP_SYS_ADMIN. This could be a
> shared container, and so joe could be breaking expected
> behavior there.
>
> I *think* we want to say we don't care about this case, but
> if we did, I suppose we could say that the normal cap raise
> rules on setuid should apply to cap_userns?
>

Right, good catch. If we do want to fix it, we could just check for
setuid no? Or do we want to follow the normal root inheritance rules
too? Essentially something like this:

pU' = is_suid(root) ? X : pU

2024-06-11 09:01:25

by Jonathan Calmels

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Introduce user namespace capabilities

On Mon, Jun 10, 2024 at 04:12:27PM GMT, Josef Bacik wrote:
> Where are the tests for this patchset? I see you updated the bpf tests for the
> bpf lsm bits, but there's nothing to validate this new behavior or exercise the
> new ioctl you've added. Thanks,

Apologies, I haven't had much time to spend on it so I prioritized the
rest. But yes, we should certainly update the capabilities selftests
once we agreed on the different behaviors.

2024-06-11 10:33:11

by John Johansen

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] bpf,lsm: Allow editing capabilities in BPF-LSM hooks

On 6/11/24 01:09, Jonathan Calmels wrote:
> On Sun, Jun 09, 2024 at 08:18:48PM GMT, Paul Moore wrote:
>> On Sun, Jun 9, 2024 at 6:40 AM Jonathan Calmels <[email protected]> wrote:
>>>
>>> This patch allows modifying the various capabilities of the struct cred
>>> in BPF-LSM hooks. More specifically, the userns_create hook called
>>> prior to creating a new user namespace.
>>>
>>> With the introduction of userns capabilities, this effectively provides
>>> a simple way for LSMs to control the capabilities granted to a user
>>> namespace and all its descendants.
>>>
>>> Update the selftests accordingly by dropping CAP_SYS_ADMIN in
>>> namespaces and checking the resulting task's bounding set.
>>>
>>> Signed-off-by: Jonathan Calmels <[email protected]>
>>> ---
>>> include/linux/lsm_hook_defs.h | 2 +-
>>> include/linux/security.h | 4 +-
>>> kernel/bpf/bpf_lsm.c | 55 +++++++++++++++++++
>>> security/apparmor/lsm.c | 2 +-
>>> security/security.c | 6 +-
>>> security/selinux/hooks.c | 2 +-
>>> .../selftests/bpf/prog_tests/deny_namespace.c | 12 ++--
>>> .../selftests/bpf/progs/test_deny_namespace.c | 7 ++-
>>> 8 files changed, 76 insertions(+), 14 deletions(-)
>>
>> I'm not sure we want to go down the path of a LSM modifying the POSIX
>> capabilities of a task, other than the capabilities/commoncap LSM. It
>> sets a bad precedent and could further complicate issues around LSM
>> ordering.
>
> Well unless I'm misunderstanding, this does allow modifying the
> capabilities/commoncap LSM through BTF. The reason for allowing
> `userns_create` to be modified is that it is functionally very similar
> to `cred_prepare` in that it operates with new creds (but specific to
> user namespaces because of reasons detailed in [1]).
>
yes

> There were some concerns in previous threads that the userns caps by
> themselves wouldn't be granular enough, hence the LSM integration.

> Ubuntu for example, currently has to resort to a hardcoded profile
> transition to achieve this [2].
>

The hard coded profile transition, is because the more generic solution
as part of policy just wasn't ready. The hard coding will go away before
it is upstreamed.

But yes, updating the cred really is necessary for the flexibility needed
whether it is modifying the POSIX capabilities of the task or the LSM
modifying its own security blob.

I do share some of Paul's concerns about the LSM modifying the POSIX
capabilities of the task, but also thing the LSM here needs to be
able to modify its own blob.

I have a very similar patch I was planning on posting once the
work to fix the hard coded profile transition is done.


> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7cd4c5c2101cb092db00f61f69d24380cf7a0ee8
> [2] https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/noble/commit/?id=43a6c29532f517179fea8c94949d657d71f4fc13


2024-06-11 19:01:34

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] bpf,lsm: Allow editing capabilities in BPF-LSM hooks

On Tue, Jun 11, 2024 at 6:32 AM John Johansen
<[email protected]> wrote:
>
> On 6/11/24 01:09, Jonathan Calmels wrote:
> > On Sun, Jun 09, 2024 at 08:18:48PM GMT, Paul Moore wrote:
> >> On Sun, Jun 9, 2024 at 6:40 AM Jonathan Calmels <[email protected]> wrote:
> >>>
> >>> This patch allows modifying the various capabilities of the struct cred
> >>> in BPF-LSM hooks. More specifically, the userns_create hook called
> >>> prior to creating a new user namespace.
> >>>
> >>> With the introduction of userns capabilities, this effectively provides
> >>> a simple way for LSMs to control the capabilities granted to a user
> >>> namespace and all its descendants.
> >>>
> >>> Update the selftests accordingly by dropping CAP_SYS_ADMIN in
> >>> namespaces and checking the resulting task's bounding set.
> >>>
> >>> Signed-off-by: Jonathan Calmels <[email protected]>
> >>> ---
> >>> include/linux/lsm_hook_defs.h | 2 +-
> >>> include/linux/security.h | 4 +-
> >>> kernel/bpf/bpf_lsm.c | 55 +++++++++++++++++++
> >>> security/apparmor/lsm.c | 2 +-
> >>> security/security.c | 6 +-
> >>> security/selinux/hooks.c | 2 +-
> >>> .../selftests/bpf/prog_tests/deny_namespace.c | 12 ++--
> >>> .../selftests/bpf/progs/test_deny_namespace.c | 7 ++-
> >>> 8 files changed, 76 insertions(+), 14 deletions(-)
> >>
> >> I'm not sure we want to go down the path of a LSM modifying the POSIX
> >> capabilities of a task, other than the capabilities/commoncap LSM. It
> >> sets a bad precedent and could further complicate issues around LSM
> >> ordering.
> >
> > Well unless I'm misunderstanding, this does allow modifying the
> > capabilities/commoncap LSM through BTF. The reason for allowing
> > `userns_create` to be modified is that it is functionally very similar
> > to `cred_prepare` in that it operates with new creds (but specific to
> > user namespaces because of reasons detailed in [1]).
>
> yes
>
> > There were some concerns in previous threads that the userns caps by
> > themselves wouldn't be granular enough, hence the LSM integration.
>
> > Ubuntu for example, currently has to resort to a hardcoded profile
> > transition to achieve this [2].
> >
>
> The hard coded profile transition, is because the more generic solution
> as part of policy just wasn't ready. The hard coding will go away before
> it is upstreamed.
>
> But yes, updating the cred really is necessary for the flexibility needed
> whether it is modifying the POSIX capabilities of the task or the LSM
> modifying its own security blob.
>
> I do share some of Paul's concerns about the LSM modifying the POSIX
> capabilities of the task, but also thing the LSM here needs to be
> able to modify its own blob.

To be clear, this isn't about a generic LSM needing to update its own
blob (LSM state), it is about the BPF LSM updating the capability
sets. While we obviously must support a LSM updating its own state,
I'm currently of the opinion that allowing one LSM to update the state
of another LSM is only going to lead to problems. We wouldn't want to
allow Smack to update AppArmor state, and from my current perspective
allowing the BPF LSM to update the capability state is no different.

It's also important to keep in mind that if we allow one LSM to do
something, we need to allow all LSMs to do something. If we allow
multiple LSMs to manipulate the capability sets, how do we reconcile
differences in the desired capability state? Does that resolution
change depending on what LSMs are enabled at build time? Enabled at
boot? Similarly, what about custom LSM ordering?

What about those LSMs that use a task's capabilities as an input to an
access control decision? If those LSMs allow an access based on a
given capability set only to have a LSM later in the ordering modify
that capability set to something which would have resulted in an
access denial, do we risk a security regression?

Our current approach to handling multiple LSMs is that each LSM is
limited to modifying its own state, and I'm pretty confident that we
stick to this model if we have any hope of preserving the sanity of
the LSM layer as a whole. If you want to modify the capability set
you need to do so within the confines of the capability LSM and/or
modify the other related kernel subsystems (which I'm guessing will
likely necessitate a change in the LSMs, but that avenue is very
unclear if such an option even exists).

--
paul-moore.com

2024-06-11 22:39:11

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] bpf,lsm: Allow editing capabilities in BPF-LSM hooks

On Tue, Jun 11, 2024 at 6:15 PM Jonathan Calmels <[email protected]> wrote:
> On Tue, Jun 11, 2024 at 03:01:01PM GMT, Paul Moore wrote:
> > On Tue, Jun 11, 2024 at 6:32 AM John Johansen
> > <[email protected]> wrote:
> > >
> > > On 6/11/24 01:09, Jonathan Calmels wrote:
> > > > On Sun, Jun 09, 2024 at 08:18:48PM GMT, Paul Moore wrote:
> > > >> On Sun, Jun 9, 2024 at 6:40 AM Jonathan Calmels <[email protected]> wrote:
> > > >>>
> > > >>> This patch allows modifying the various capabilities of the struct cred
> > > >>> in BPF-LSM hooks. More specifically, the userns_create hook called
> > > >>> prior to creating a new user namespace.
> > > >>>
> > > >>> With the introduction of userns capabilities, this effectively provides
> > > >>> a simple way for LSMs to control the capabilities granted to a user
> > > >>> namespace and all its descendants.
> > > >>>
> > > >>> Update the selftests accordingly by dropping CAP_SYS_ADMIN in
> > > >>> namespaces and checking the resulting task's bounding set.
> > > >>>
> > > >>> Signed-off-by: Jonathan Calmels <[email protected]>
> > > >>> ---
> > > >>> include/linux/lsm_hook_defs.h | 2 +-
> > > >>> include/linux/security.h | 4 +-
> > > >>> kernel/bpf/bpf_lsm.c | 55 +++++++++++++++++++
> > > >>> security/apparmor/lsm.c | 2 +-
> > > >>> security/security.c | 6 +-
> > > >>> security/selinux/hooks.c | 2 +-
> > > >>> .../selftests/bpf/prog_tests/deny_namespace.c | 12 ++--
> > > >>> .../selftests/bpf/progs/test_deny_namespace.c | 7 ++-
> > > >>> 8 files changed, 76 insertions(+), 14 deletions(-)
> > > >>
> > > >> I'm not sure we want to go down the path of a LSM modifying the POSIX
> > > >> capabilities of a task, other than the capabilities/commoncap LSM. It
> > > >> sets a bad precedent and could further complicate issues around LSM
> > > >> ordering.
> > > >
> > > > Well unless I'm misunderstanding, this does allow modifying the
> > > > capabilities/commoncap LSM through BTF. The reason for allowing
> > > > `userns_create` to be modified is that it is functionally very similar
> > > > to `cred_prepare` in that it operates with new creds (but specific to
> > > > user namespaces because of reasons detailed in [1]).
> > >
> > > yes
> > >
> > > > There were some concerns in previous threads that the userns caps by
> > > > themselves wouldn't be granular enough, hence the LSM integration.
> > >
> > > > Ubuntu for example, currently has to resort to a hardcoded profile
> > > > transition to achieve this [2].
> > > >
> > >
> > > The hard coded profile transition, is because the more generic solution
> > > as part of policy just wasn't ready. The hard coding will go away before
> > > it is upstreamed.
> > >
> > > But yes, updating the cred really is necessary for the flexibility needed
> > > whether it is modifying the POSIX capabilities of the task or the LSM
> > > modifying its own security blob.
> > >
> > > I do share some of Paul's concerns about the LSM modifying the POSIX
> > > capabilities of the task, but also thing the LSM here needs to be
> > > able to modify its own blob.
> >
> > To be clear, this isn't about a generic LSM needing to update its own
> > blob (LSM state), it is about the BPF LSM updating the capability
> > sets. While we obviously must support a LSM updating its own state,
> > I'm currently of the opinion that allowing one LSM to update the state
> > of another LSM is only going to lead to problems. We wouldn't want to
> > allow Smack to update AppArmor state, and from my current perspective
> > allowing the BPF LSM to update the capability state is no different.
> >
> > It's also important to keep in mind that if we allow one LSM to do
> > something, we need to allow all LSMs to do something. If we allow
> > multiple LSMs to manipulate the capability sets, how do we reconcile
> > differences in the desired capability state? Does that resolution
> > change depending on what LSMs are enabled at build time? Enabled at
> > boot? Similarly, what about custom LSM ordering?
> >
> > What about those LSMs that use a task's capabilities as an input to an
> > access control decision? If those LSMs allow an access based on a
> > given capability set only to have a LSM later in the ordering modify
> > that capability set to something which would have resulted in an
> > access denial, do we risk a security regression?
>
> I understand the concerns, what I fail to understand however, is how is
> it any different from say the `cred_prepare` hook today?

The existing cred_prepare hooks only operate on their own small
portion of the cred::security blob. What you are proposing would be
the BPF LSM operating on the capability sets that it does not "own"
(they belong to the capability LSM).

If you see that as a minor difference, please understand that if you
skip past that you have all the issues I mentioned in my previous
message to deal with.

> > Our current approach to handling multiple LSMs is that each LSM is
> > limited to modifying its own state, and I'm pretty confident that we
> > stick to this model if we have any hope of preserving the sanity of
> > the LSM layer as a whole. If you want to modify the capability set
> > you need to do so within the confines of the capability LSM and/or
> > modify the other related kernel subsystems (which I'm guessing will
> > likely necessitate a change in the LSMs, but that avenue is very
> > unclear if such an option even exists).
>
> What do you mean by "within the confines of the capability LSM" here?

Basically security/commoncap.c. One could make a lot of arguments
about if it is, or isn't, a LSM, but commoncap.c registers LSM hooks
which is pretty much the definition of a LSM from an implementation
point of view.

> Arguably, if we do want fine-grained userns policies, we need LSMs to
> influence the userns capset at some point.

One could always use, or develop, a LSM that offers additional
controls around exercising capabilities. There are currently four
in-tree LSMs, including the capabilities LSM, which supply a
security_capable() hook that is used by the capability-based access
controls in the kernel; all of these hook implementations work
together within the LSM framework and provide an additional level of
control/granularity beyond the existing capabilities.

--
paul-moore.com

2024-06-11 23:31:48

by Jonathan Calmels

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] bpf,lsm: Allow editing capabilities in BPF-LSM hooks

On Tue, Jun 11, 2024 at 03:01:01PM GMT, Paul Moore wrote:
> On Tue, Jun 11, 2024 at 6:32 AM John Johansen
> <[email protected]> wrote:
> >
> > On 6/11/24 01:09, Jonathan Calmels wrote:
> > > On Sun, Jun 09, 2024 at 08:18:48PM GMT, Paul Moore wrote:
> > >> On Sun, Jun 9, 2024 at 6:40 AM Jonathan Calmels <[email protected]> wrote:
> > >>>
> > >>> This patch allows modifying the various capabilities of the struct cred
> > >>> in BPF-LSM hooks. More specifically, the userns_create hook called
> > >>> prior to creating a new user namespace.
> > >>>
> > >>> With the introduction of userns capabilities, this effectively provides
> > >>> a simple way for LSMs to control the capabilities granted to a user
> > >>> namespace and all its descendants.
> > >>>
> > >>> Update the selftests accordingly by dropping CAP_SYS_ADMIN in
> > >>> namespaces and checking the resulting task's bounding set.
> > >>>
> > >>> Signed-off-by: Jonathan Calmels <[email protected]>
> > >>> ---
> > >>> include/linux/lsm_hook_defs.h | 2 +-
> > >>> include/linux/security.h | 4 +-
> > >>> kernel/bpf/bpf_lsm.c | 55 +++++++++++++++++++
> > >>> security/apparmor/lsm.c | 2 +-
> > >>> security/security.c | 6 +-
> > >>> security/selinux/hooks.c | 2 +-
> > >>> .../selftests/bpf/prog_tests/deny_namespace.c | 12 ++--
> > >>> .../selftests/bpf/progs/test_deny_namespace.c | 7 ++-
> > >>> 8 files changed, 76 insertions(+), 14 deletions(-)
> > >>
> > >> I'm not sure we want to go down the path of a LSM modifying the POSIX
> > >> capabilities of a task, other than the capabilities/commoncap LSM. It
> > >> sets a bad precedent and could further complicate issues around LSM
> > >> ordering.
> > >
> > > Well unless I'm misunderstanding, this does allow modifying the
> > > capabilities/commoncap LSM through BTF. The reason for allowing
> > > `userns_create` to be modified is that it is functionally very similar
> > > to `cred_prepare` in that it operates with new creds (but specific to
> > > user namespaces because of reasons detailed in [1]).
> >
> > yes
> >
> > > There were some concerns in previous threads that the userns caps by
> > > themselves wouldn't be granular enough, hence the LSM integration.
> >
> > > Ubuntu for example, currently has to resort to a hardcoded profile
> > > transition to achieve this [2].
> > >
> >
> > The hard coded profile transition, is because the more generic solution
> > as part of policy just wasn't ready. The hard coding will go away before
> > it is upstreamed.
> >
> > But yes, updating the cred really is necessary for the flexibility needed
> > whether it is modifying the POSIX capabilities of the task or the LSM
> > modifying its own security blob.
> >
> > I do share some of Paul's concerns about the LSM modifying the POSIX
> > capabilities of the task, but also thing the LSM here needs to be
> > able to modify its own blob.
>
> To be clear, this isn't about a generic LSM needing to update its own
> blob (LSM state), it is about the BPF LSM updating the capability
> sets. While we obviously must support a LSM updating its own state,
> I'm currently of the opinion that allowing one LSM to update the state
> of another LSM is only going to lead to problems. We wouldn't want to
> allow Smack to update AppArmor state, and from my current perspective
> allowing the BPF LSM to update the capability state is no different.
>
> It's also important to keep in mind that if we allow one LSM to do
> something, we need to allow all LSMs to do something. If we allow
> multiple LSMs to manipulate the capability sets, how do we reconcile
> differences in the desired capability state? Does that resolution
> change depending on what LSMs are enabled at build time? Enabled at
> boot? Similarly, what about custom LSM ordering?
>
> What about those LSMs that use a task's capabilities as an input to an
> access control decision? If those LSMs allow an access based on a
> given capability set only to have a LSM later in the ordering modify
> that capability set to something which would have resulted in an
> access denial, do we risk a security regression?

I understand the concerns, what I fail to understand however, is how is
it any different from say the `cred_prepare` hook today?

> Our current approach to handling multiple LSMs is that each LSM is
> limited to modifying its own state, and I'm pretty confident that we
> stick to this model if we have any hope of preserving the sanity of
> the LSM layer as a whole. If you want to modify the capability set
> you need to do so within the confines of the capability LSM and/or
> modify the other related kernel subsystems (which I'm guessing will
> likely necessitate a change in the LSMs, but that avenue is very
> unclear if such an option even exists).

What do you mean by "within the confines of the capability LSM" here?

Arguably, if we do want fine-grained userns policies, we need LSMs to
influence the userns capset at some point. Regardless of how or where we
do this, it will always be subject to some sort of ordering. We could
come up with some rules to limit surprises (e.g. caps can only be
dropped, only possible through some hooks, etc), but at the end of the
day, something needs to have the final word when it comes to deciding
what the creds should be.

2024-06-12 08:16:19

by Jonathan Calmels

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] bpf,lsm: Allow editing capabilities in BPF-LSM hooks

On Tue, Jun 11, 2024 at 06:38:31PM GMT, Paul Moore wrote:
> On Tue, Jun 11, 2024 at 6:15 PM Jonathan Calmels <[email protected]> wrote:
> > On Tue, Jun 11, 2024 at 03:01:01PM GMT, Paul Moore wrote:
> > > On Tue, Jun 11, 2024 at 6:32 AM John Johansen
> > > <[email protected]> wrote:
> > > >
> > > > On 6/11/24 01:09, Jonathan Calmels wrote:
> > > > > On Sun, Jun 09, 2024 at 08:18:48PM GMT, Paul Moore wrote:
> > > > >> On Sun, Jun 9, 2024 at 6:40 AM Jonathan Calmels <[email protected]> wrote:
> > > > >>>
> > > > >>> This patch allows modifying the various capabilities of the struct cred
> > > > >>> in BPF-LSM hooks. More specifically, the userns_create hook called
> > > > >>> prior to creating a new user namespace.
> > > > >>>
> > > > >>> With the introduction of userns capabilities, this effectively provides
> > > > >>> a simple way for LSMs to control the capabilities granted to a user
> > > > >>> namespace and all its descendants.
> > > > >>>
> > > > >>> Update the selftests accordingly by dropping CAP_SYS_ADMIN in
> > > > >>> namespaces and checking the resulting task's bounding set.
> > > > >>>
> > > > >>> Signed-off-by: Jonathan Calmels <[email protected]>
> > > > >>> ---
> > > > >>> include/linux/lsm_hook_defs.h | 2 +-
> > > > >>> include/linux/security.h | 4 +-
> > > > >>> kernel/bpf/bpf_lsm.c | 55 +++++++++++++++++++
> > > > >>> security/apparmor/lsm.c | 2 +-
> > > > >>> security/security.c | 6 +-
> > > > >>> security/selinux/hooks.c | 2 +-
> > > > >>> .../selftests/bpf/prog_tests/deny_namespace.c | 12 ++--
> > > > >>> .../selftests/bpf/progs/test_deny_namespace.c | 7 ++-
> > > > >>> 8 files changed, 76 insertions(+), 14 deletions(-)
> > > > >>
> > > > >> I'm not sure we want to go down the path of a LSM modifying the POSIX
> > > > >> capabilities of a task, other than the capabilities/commoncap LSM. It
> > > > >> sets a bad precedent and could further complicate issues around LSM
> > > > >> ordering.
> > > > >
> > > > > Well unless I'm misunderstanding, this does allow modifying the
> > > > > capabilities/commoncap LSM through BTF. The reason for allowing
> > > > > `userns_create` to be modified is that it is functionally very similar
> > > > > to `cred_prepare` in that it operates with new creds (but specific to
> > > > > user namespaces because of reasons detailed in [1]).
> > > >
> > > > yes
> > > >
> > > > > There were some concerns in previous threads that the userns caps by
> > > > > themselves wouldn't be granular enough, hence the LSM integration.
> > > >
> > > > > Ubuntu for example, currently has to resort to a hardcoded profile
> > > > > transition to achieve this [2].
> > > > >
> > > >
> > > > The hard coded profile transition, is because the more generic solution
> > > > as part of policy just wasn't ready. The hard coding will go away before
> > > > it is upstreamed.
> > > >
> > > > But yes, updating the cred really is necessary for the flexibility needed
> > > > whether it is modifying the POSIX capabilities of the task or the LSM
> > > > modifying its own security blob.
> > > >
> > > > I do share some of Paul's concerns about the LSM modifying the POSIX
> > > > capabilities of the task, but also thing the LSM here needs to be
> > > > able to modify its own blob.
> > >
> > > To be clear, this isn't about a generic LSM needing to update its own
> > > blob (LSM state), it is about the BPF LSM updating the capability
> > > sets. While we obviously must support a LSM updating its own state,
> > > I'm currently of the opinion that allowing one LSM to update the state
> > > of another LSM is only going to lead to problems. We wouldn't want to
> > > allow Smack to update AppArmor state, and from my current perspective
> > > allowing the BPF LSM to update the capability state is no different.
> > >
> > > It's also important to keep in mind that if we allow one LSM to do
> > > something, we need to allow all LSMs to do something. If we allow
> > > multiple LSMs to manipulate the capability sets, how do we reconcile
> > > differences in the desired capability state? Does that resolution
> > > change depending on what LSMs are enabled at build time? Enabled at
> > > boot? Similarly, what about custom LSM ordering?
> > >
> > > What about those LSMs that use a task's capabilities as an input to an
> > > access control decision? If those LSMs allow an access based on a
> > > given capability set only to have a LSM later in the ordering modify
> > > that capability set to something which would have resulted in an
> > > access denial, do we risk a security regression?
> >
> > I understand the concerns, what I fail to understand however, is how is
> > it any different from say the `cred_prepare` hook today?
>
> The existing cred_prepare hooks only operate on their own small
> portion of the cred::security blob. What you are proposing would be
> the BPF LSM operating on the capability sets that it does not "own"
> (they belong to the capability LSM).
>
> If you see that as a minor difference, please understand that if you
> skip past that you have all the issues I mentioned in my previous
> message to deal with.
>
> > > Our current approach to handling multiple LSMs is that each LSM is
> > > limited to modifying its own state, and I'm pretty confident that we
> > > stick to this model if we have any hope of preserving the sanity of
> > > the LSM layer as a whole. If you want to modify the capability set
> > > you need to do so within the confines of the capability LSM and/or
> > > modify the other related kernel subsystems (which I'm guessing will
> > > likely necessitate a change in the LSMs, but that avenue is very
> > > unclear if such an option even exists).
> >
> > What do you mean by "within the confines of the capability LSM" here?
>
> Basically security/commoncap.c. One could make a lot of arguments
> about if it is, or isn't, a LSM, but commoncap.c registers LSM hooks
> which is pretty much the definition of a LSM from an implementation
> point of view.

Yes, hence the proposal to give it more fine-grained controls than
what's currently available. But to your point, unlike the others,
its own state (i.e. capsets) is shared, so this gets questionable.

> > Arguably, if we do want fine-grained userns policies, we need LSMs to
> > influence the userns capset at some point.
>
> One could always use, or develop, a LSM that offers additional
> controls around exercising capabilities. There are currently four
> in-tree LSMs, including the capabilities LSM, which supply a
> security_capable() hook that is used by the capability-based access
> controls in the kernel; all of these hook implementations work
> together within the LSM framework and provide an additional level of
> control/granularity beyond the existing capabilities.

Right, but the idea was to have a simple and easy way to reuse/trigger
as much of the commoncap one as possible from BPF. If we're saying we
need to reimplement and/or use a whole new framework, then there is
little value.

TBH, I don't feel strongly about this, which is why it is absent from
v1. However, as John pointed out, we should at least be able to modify
the blob if we want flexible userns caps policies down the road.

2024-06-12 17:30:23

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] bpf,lsm: Allow editing capabilities in BPF-LSM hooks

On Wed, Jun 12, 2024 at 4:15 AM Jonathan Calmels <[email protected]> wrote:
> On Tue, Jun 11, 2024 at 06:38:31PM GMT, Paul Moore wrote:
> > On Tue, Jun 11, 2024 at 6:15 PM Jonathan Calmels <[email protected]> wrote:

...

> > > Arguably, if we do want fine-grained userns policies, we need LSMs to
> > > influence the userns capset at some point.
> >
> > One could always use, or develop, a LSM that offers additional
> > controls around exercising capabilities. There are currently four
> > in-tree LSMs, including the capabilities LSM, which supply a
> > security_capable() hook that is used by the capability-based access
> > controls in the kernel; all of these hook implementations work
> > together within the LSM framework and provide an additional level of
> > control/granularity beyond the existing capabilities.
>
> Right, but the idea was to have a simple and easy way to reuse/trigger
> as much of the commoncap one as possible from BPF. If we're saying we
> need to reimplement and/or use a whole new framework, then there is
> little value.

I can appreciate how allowing direct manipulation of capability bits
from a BPF LSM looks attractive, but my hope is that our discussion
here revealed that as you look deeper into making it work there are a
number of pitfalls which prevent this from being a safe option for
generalized systems.

> TBH, I don't feel strongly about this, which is why it is absent from
> v1. However, as John pointed out, we should at least be able to modify
> the blob if we want flexible userns caps policies down the road.

As discussed in this thread, there are existing ways to provide fine
grained control over exercising capabilities that can be safely used
within the LSM framework. I don't want to speak to what John is
envisioning, but he should be aware of these mechanisms, and if I
recall he did voice a level of concern about the same worries I
mentioned.

I'm happy to discuss ways in which we can adjust the LSM hooks/layer
to support different approaches to capability controls, but one LSM
directly manipulating the state of another is going to be a no vote
from me.

--
paul-moore.com

2024-06-13 03:55:10

by John Johansen

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] bpf,lsm: Allow editing capabilities in BPF-LSM hooks

On 6/12/24 10:29, Paul Moore wrote:
> On Wed, Jun 12, 2024 at 4:15 AM Jonathan Calmels <[email protected]> wrote:
>> On Tue, Jun 11, 2024 at 06:38:31PM GMT, Paul Moore wrote:
>>> On Tue, Jun 11, 2024 at 6:15 PM Jonathan Calmels <[email protected]> wrote:
>
> ...
>
>>>> Arguably, if we do want fine-grained userns policies, we need LSMs to
>>>> influence the userns capset at some point.
>>>
>>> One could always use, or develop, a LSM that offers additional
>>> controls around exercising capabilities. There are currently four
>>> in-tree LSMs, including the capabilities LSM, which supply a
>>> security_capable() hook that is used by the capability-based access
>>> controls in the kernel; all of these hook implementations work
>>> together within the LSM framework and provide an additional level of
>>> control/granularity beyond the existing capabilities.
>>
>> Right, but the idea was to have a simple and easy way to reuse/trigger
>> as much of the commoncap one as possible from BPF. If we're saying we
>> need to reimplement and/or use a whole new framework, then there is
>> little value.
>
> I can appreciate how allowing direct manipulation of capability bits
> from a BPF LSM looks attractive, but my hope is that our discussion
> here revealed that as you look deeper into making it work there are a
> number of pitfalls which prevent this from being a safe option for
> generalized systems.
>
>> TBH, I don't feel strongly about this, which is why it is absent from
>> v1. However, as John pointed out, we should at least be able to modify
>> the blob if we want flexible userns caps policies down the road.
>
> As discussed in this thread, there are existing ways to provide fine
> grained control over exercising capabilities that can be safely used
> within the LSM framework. I don't want to speak to what John is
> envisioning, but he should be aware of these mechanisms, and if I
> recall he did voice a level of concern about the same worries I
> mentioned.
>

sorry, I should have been more clear. I envision LSMs being able to
update their own state in the userns hook.

Basically the portion of the patch that removes const from the
userns hook.

An LSM updating the capset is worrysome for all the reasons you
pointed out, and I think a few more. I haven't had a chance to really
look at v2 yet, so I didn't want to speak directly on the bpf part of
the patch without first giving a good once over.

> I'm happy to discuss ways in which we can adjust the LSM hooks/layer
> to support different approaches to capability controls, but one LSM
> directly manipulating the state of another is going to be a no vote
> from me.
>
I might not be as hard no as Paul here, I am always willing to listen
to arguments, but it would have to be a really good argument to
modify the capset, when there are multiple LSMs in play on a system.



2024-06-13 09:09:20

by Jonathan Calmels

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] bpf,lsm: Allow editing capabilities in BPF-LSM hooks

On Wed, Jun 12, 2024 at 08:54:28PM GMT, John Johansen wrote:
> On 6/12/24 10:29, Paul Moore wrote:
> > On Wed, Jun 12, 2024 at 4:15 AM Jonathan Calmels <[email protected]> wrote:
> > > On Tue, Jun 11, 2024 at 06:38:31PM GMT, Paul Moore wrote:
> > > > On Tue, Jun 11, 2024 at 6:15 PM Jonathan Calmels <[email protected]> wrote:
> >
> > ...
> >
> > > > > Arguably, if we do want fine-grained userns policies, we need LSMs to
> > > > > influence the userns capset at some point.
> > > >
> > > > One could always use, or develop, a LSM that offers additional
> > > > controls around exercising capabilities. There are currently four
> > > > in-tree LSMs, including the capabilities LSM, which supply a
> > > > security_capable() hook that is used by the capability-based access
> > > > controls in the kernel; all of these hook implementations work
> > > > together within the LSM framework and provide an additional level of
> > > > control/granularity beyond the existing capabilities.
> > >
> > > Right, but the idea was to have a simple and easy way to reuse/trigger
> > > as much of the commoncap one as possible from BPF. If we're saying we
> > > need to reimplement and/or use a whole new framework, then there is
> > > little value.
> >
> > I can appreciate how allowing direct manipulation of capability bits
> > from a BPF LSM looks attractive, but my hope is that our discussion
> > here revealed that as you look deeper into making it work there are a
> > number of pitfalls which prevent this from being a safe option for
> > generalized systems.
> >
> > > TBH, I don't feel strongly about this, which is why it is absent from
> > > v1. However, as John pointed out, we should at least be able to modify
> > > the blob if we want flexible userns caps policies down the road.
> >
> > As discussed in this thread, there are existing ways to provide fine
> > grained control over exercising capabilities that can be safely used
> > within the LSM framework. I don't want to speak to what John is
> > envisioning, but he should be aware of these mechanisms, and if I
> > recall he did voice a level of concern about the same worries I
> > mentioned.
> >
>
> sorry, I should have been more clear. I envision LSMs being able to
> update their own state in the userns hook.
>
> Basically the portion of the patch that removes const from the
> userns hook.

Yes, pretty sure we'll need this regardless.

> An LSM updating the capset is worrysome for all the reasons you
> pointed out, and I think a few more. I haven't had a chance to really
> look at v2 yet, so I didn't want to speak directly on the bpf part of
> the patch without first giving a good once over.
>
> > I'm happy to discuss ways in which we can adjust the LSM hooks/layer
> > to support different approaches to capability controls, but one LSM
> > directly manipulating the state of another is going to be a no vote
> > from me.
> >
> I might not be as hard no as Paul here, I am always willing to listen
> to arguments, but it would have to be a really good argument to
> modify the capset, when there are multiple LSMs in play on a system.

The way I see it, it's more about enhancing the capability LSM with BPF
hooks and have it modify its own state dynamically, not so much
crosstalk between two distinct LSM frameworks (say one where the BPF
LSM implements a lot of things like capable()).

In this context and with enough safeguards (say we only allow dropping
caps) this could be a net positive. Sure, ordering could come into play
in very specific scenarios, but at this point I would expect the
admin/LSM author to be conscious about it.

If we think there is no way we can come up with something that's safe
enough, and that the risks outweigh the benefits, fine by me, we can
drop this patch from the series.

2024-06-13 10:46:56

by Dr. Greg

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] bpf,lsm: Allow editing capabilities in BPF-LSM hooks

On Wed, Jun 12, 2024 at 08:54:28PM -0700, John Johansen wrote:

Good morning, I hope the day is going well for everyone.

> On 6/12/24 10:29, Paul Moore wrote:
> >On Wed, Jun 12, 2024 at 4:15???AM Jonathan Calmels <[email protected]>
> >wrote:
> >>On Tue, Jun 11, 2024 at 06:38:31PM GMT, Paul Moore wrote:
> >>>On Tue, Jun 11, 2024 at 6:15???PM Jonathan Calmels <[email protected]>
> >>>wrote:
> >
> >...
> >
> >>>>Arguably, if we do want fine-grained userns policies, we need LSMs to
> >>>>influence the userns capset at some point.
> >>>
> >>>One could always use, or develop, a LSM that offers additional
> >>>controls around exercising capabilities. There are currently four
> >>>in-tree LSMs, including the capabilities LSM, which supply a
> >>>security_capable() hook that is used by the capability-based access
> >>>controls in the kernel; all of these hook implementations work
> >>>together within the LSM framework and provide an additional level of
> >>>control/granularity beyond the existing capabilities.
> >>
> >>Right, but the idea was to have a simple and easy way to reuse/trigger
> >>as much of the commoncap one as possible from BPF. If we're saying we
> >>need to reimplement and/or use a whole new framework, then there is
> >>little value.
> >
> >I can appreciate how allowing direct manipulation of capability bits
> >from a BPF LSM looks attractive, but my hope is that our discussion
> >here revealed that as you look deeper into making it work there are a
> >number of pitfalls which prevent this from being a safe option for
> >generalized systems.
> >
> >>TBH, I don't feel strongly about this, which is why it is absent from
> >>v1. However, as John pointed out, we should at least be able to modify
> >>the blob if we want flexible userns caps policies down the road.
> >
> >As discussed in this thread, there are existing ways to provide fine
> >grained control over exercising capabilities that can be safely used
> >within the LSM framework. I don't want to speak to what John is
> >envisioning, but he should be aware of these mechanisms, and if I
> >recall he did voice a level of concern about the same worries I
> >mentioned.
> >
>
> sorry, I should have been more clear. I envision LSMs being able to
> update their own state in the userns hook.
>
> Basically the portion of the patch that removes const from the
> userns hook.
>
> An LSM updating the capset is worrysome for all the reasons you
> pointed out, and I think a few more. I haven't had a chance to really
> look at v2 yet, so I didn't want to speak directly on the bpf part of
> the patch without first giving a good once over.
>
> >I'm happy to discuss ways in which we can adjust the LSM hooks/layer
> >to support different approaches to capability controls, but one LSM
> >directly manipulating the state of another is going to be a no vote
> >from me.
> >
> I might not be as hard no as Paul here, I am always willing to listen
> to arguments, but it would have to be a really good argument to
> modify the capset, when there are multiple LSMs in play on a system.

Putting my pragmatic operations hat on, it isn't just the impact on
multiple LSM's.

The security vendors, CrowdStrike's Falcon comes immediately to mind,
are installing BPF hooks as part of their agent systems.

Given that the issue of signing BPF programs is still an open
question, allowing the ability of a BPF program to modify the security
capabilities of a process opens the door to supply chain attacks that
would seem unbounded in scope.

On the other side of the fence, installing a BPF program is a
privileged operation. If a decision is made to allow that kind of
privilege on a system the argument can be made that you get to keep
both pieces.

Of course that needs to be paired against the fact that system's
administrators are not given any choice as to the wisdom of that type
of permission being afforded to security applications.

Best wishes for a productive remainder of the week.

As always,
Dr. Greg

The Quixote Project - Flailing at the Travails of Cybersecurity
https://github.com/Quixote-Project

2024-06-13 20:43:35

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] bpf,lsm: Allow editing capabilities in BPF-LSM hooks

On Wed, Jun 12, 2024 at 11:54 PM John Johansen
<[email protected]> wrote:
> On 6/12/24 10:29, Paul Moore wrote:
> > On Wed, Jun 12, 2024 at 4:15 AM Jonathan Calmels <[email protected]> wrote:
> >> On Tue, Jun 11, 2024 at 06:38:31PM GMT, Paul Moore wrote:
> >>> On Tue, Jun 11, 2024 at 6:15 PM Jonathan Calmels <[email protected]> wrote:
> >
> > ...
> >
> >>>> Arguably, if we do want fine-grained userns policies, we need LSMs to
> >>>> influence the userns capset at some point.
> >>>
> >>> One could always use, or develop, a LSM that offers additional
> >>> controls around exercising capabilities. There are currently four
> >>> in-tree LSMs, including the capabilities LSM, which supply a
> >>> security_capable() hook that is used by the capability-based access
> >>> controls in the kernel; all of these hook implementations work
> >>> together within the LSM framework and provide an additional level of
> >>> control/granularity beyond the existing capabilities.
> >>
> >> Right, but the idea was to have a simple and easy way to reuse/trigger
> >> as much of the commoncap one as possible from BPF. If we're saying we
> >> need to reimplement and/or use a whole new framework, then there is
> >> little value.
> >
> > I can appreciate how allowing direct manipulation of capability bits
> > from a BPF LSM looks attractive, but my hope is that our discussion
> > here revealed that as you look deeper into making it work there are a
> > number of pitfalls which prevent this from being a safe option for
> > generalized systems.
> >
> >> TBH, I don't feel strongly about this, which is why it is absent from
> >> v1. However, as John pointed out, we should at least be able to modify
> >> the blob if we want flexible userns caps policies down the road.
> >
> > As discussed in this thread, there are existing ways to provide fine
> > grained control over exercising capabilities that can be safely used
> > within the LSM framework. I don't want to speak to what John is
> > envisioning, but he should be aware of these mechanisms, and if I
> > recall he did voice a level of concern about the same worries I
> > mentioned.
> >
>
> sorry, I should have been more clear. I envision LSMs being able to
> update their own state in the userns hook.

Ah, okay, yes, that seems reasonable; although like any other change,
until we have an in-tree user we should just leave it as-is.

--
paul-moore.com

2024-06-13 20:55:34

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] bpf,lsm: Allow editing capabilities in BPF-LSM hooks

On Thu, Jun 13, 2024 at 4:45 AM Jonathan Calmels <[email protected]> wrote:
> On Wed, Jun 12, 2024 at 08:54:28PM GMT, John Johansen wrote:
> > On 6/12/24 10:29, Paul Moore wrote:
> > > On Wed, Jun 12, 2024 at 4:15 AM Jonathan Calmels <[email protected]> wrote:
> > > > On Tue, Jun 11, 2024 at 06:38:31PM GMT, Paul Moore wrote:
> > > > > On Tue, Jun 11, 2024 at 6:15 PM Jonathan Calmels <[email protected]> wrote:
> > >
> > > ...
> > >
> > > > > > Arguably, if we do want fine-grained userns policies, we need LSMs to
> > > > > > influence the userns capset at some point.
> > > > >
> > > > > One could always use, or develop, a LSM that offers additional
> > > > > controls around exercising capabilities. There are currently four
> > > > > in-tree LSMs, including the capabilities LSM, which supply a
> > > > > security_capable() hook that is used by the capability-based access
> > > > > controls in the kernel; all of these hook implementations work
> > > > > together within the LSM framework and provide an additional level of
> > > > > control/granularity beyond the existing capabilities.
> > > >
> > > > Right, but the idea was to have a simple and easy way to reuse/trigger
> > > > as much of the commoncap one as possible from BPF. If we're saying we
> > > > need to reimplement and/or use a whole new framework, then there is
> > > > little value.
> > >
> > > I can appreciate how allowing direct manipulation of capability bits
> > > from a BPF LSM looks attractive, but my hope is that our discussion
> > > here revealed that as you look deeper into making it work there are a
> > > number of pitfalls which prevent this from being a safe option for
> > > generalized systems.
> > >
> > > > TBH, I don't feel strongly about this, which is why it is absent from
> > > > v1. However, as John pointed out, we should at least be able to modify
> > > > the blob if we want flexible userns caps policies down the road.
> > >
> > > As discussed in this thread, there are existing ways to provide fine
> > > grained control over exercising capabilities that can be safely used
> > > within the LSM framework. I don't want to speak to what John is
> > > envisioning, but he should be aware of these mechanisms, and if I
> > > recall he did voice a level of concern about the same worries I
> > > mentioned.
> > >
> >
> > sorry, I should have been more clear. I envision LSMs being able to
> > update their own state in the userns hook.
> >
> > Basically the portion of the patch that removes const from the
> > userns hook.
>
> Yes, pretty sure we'll need this regardless.
>
> > An LSM updating the capset is worrysome for all the reasons you
> > pointed out, and I think a few more. I haven't had a chance to really
> > look at v2 yet, so I didn't want to speak directly on the bpf part of
> > the patch without first giving a good once over.


> > > I'm happy to discuss ways in which we can adjust the LSM hooks/layer
> > > to support different approaches to capability controls, but one LSM
> > > directly manipulating the state of another is going to be a no vote
> > > from me.
> >
> > I might not be as hard no as Paul here, I am always willing to listen
> > to arguments, but it would have to be a really good argument to
> > modify the capset, when there are multiple LSMs in play on a system.
>
> The way I see it, it's more about enhancing the capability LSM with BPF
> hooks and have it modify its own state dynamically, not so much
> crosstalk between two distinct LSM frameworks (say one where the BPF
> LSM implements a lot of things like capable()).

As I mentioned previously, if you want to do something with the
capability sets you simply need to do it within the confines of
security/commoncap.c. If you're really set on the "MUST BE BPF!" way
of life, and you can convince Serge (capabilities maintainer) that it
would be a good idea, you could propose a dedicated BPF hook within
the capabilities LSM. I'm not sure how wise that would be, but it
would resolve a lot of the LSM ordering/stacking issues that we've
discussed.

> If we think there is no way we can come up with something that's safe
> enough, and that the risks outweigh the benefits, fine by me, we can
> drop this patch from the series.

To be clear, this patch is not acceptable at this point in time. With
the understanding that I haven't looked that closely at the rest of
the patchset, it looks fairly well contained to the capabilities code
which means it is largely up to Serge, not me.

I will mention that you should update the audit code to recognize the
new capability set, look at kernel/auditsc.c for more information.

--
paul-moore.com

2024-06-15 15:19:35

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] capabilities: Add user namespace capabilities

On Tue, Jun 11, 2024 at 01:20:40AM -0700, Jonathan Calmels wrote:
> On Mon, Jun 10, 2024 at 08:00:57AM GMT, Serge E. Hallyn wrote:
> >
> > Now, one thing that does occur to me here is that there is a
> > very mild form of sendmail-capabilities vulnerability that
> > could happen here. Unpriv user joe can drop CAP_SYS_ADMIN
> > from cap_userns, then run a setuid-root program which starts
> > a container which expects CAP_SYS_ADMIN. This could be a
> > shared container, and so joe could be breaking expected
> > behavior there.
> >
> > I *think* we want to say we don't care about this case, but
> > if we did, I suppose we could say that the normal cap raise
> > rules on setuid should apply to cap_userns?
> >
>
> Right, good catch. If we do want to fix it, we could just check for
> setuid no? Or do we want to follow the normal root inheritance rules
> too? Essentially something like this:
>
> pU' = is_suid(root) ? X : pU

Yeah, I think that makes sense. Thanks.

-serge

2024-06-15 15:20:39

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] bpf,lsm: Allow editing capabilities in BPF-LSM hooks

On Thu, Jun 13, 2024 at 01:50:29AM -0700, Jonathan Calmels wrote:
> On Wed, Jun 12, 2024 at 08:54:28PM GMT, John Johansen wrote:
> > On 6/12/24 10:29, Paul Moore wrote:
> > > On Wed, Jun 12, 2024 at 4:15 AM Jonathan Calmels <[email protected]> wrote:
> > > > On Tue, Jun 11, 2024 at 06:38:31PM GMT, Paul Moore wrote:
> > > > > On Tue, Jun 11, 2024 at 6:15 PM Jonathan Calmels <[email protected]> wrote:
> > >
> > > ...
> > >
> > > > > > Arguably, if we do want fine-grained userns policies, we need LSMs to
> > > > > > influence the userns capset at some point.
> > > > >
> > > > > One could always use, or develop, a LSM that offers additional
> > > > > controls around exercising capabilities. There are currently four
> > > > > in-tree LSMs, including the capabilities LSM, which supply a
> > > > > security_capable() hook that is used by the capability-based access
> > > > > controls in the kernel; all of these hook implementations work
> > > > > together within the LSM framework and provide an additional level of
> > > > > control/granularity beyond the existing capabilities.
> > > >
> > > > Right, but the idea was to have a simple and easy way to reuse/trigger
> > > > as much of the commoncap one as possible from BPF. If we're saying we
> > > > need to reimplement and/or use a whole new framework, then there is
> > > > little value.
> > >
> > > I can appreciate how allowing direct manipulation of capability bits
> > > from a BPF LSM looks attractive, but my hope is that our discussion
> > > here revealed that as you look deeper into making it work there are a
> > > number of pitfalls which prevent this from being a safe option for
> > > generalized systems.
> > >
> > > > TBH, I don't feel strongly about this, which is why it is absent from
> > > > v1. However, as John pointed out, we should at least be able to modify
> > > > the blob if we want flexible userns caps policies down the road.
> > >
> > > As discussed in this thread, there are existing ways to provide fine
> > > grained control over exercising capabilities that can be safely used
> > > within the LSM framework. I don't want to speak to what John is
> > > envisioning, but he should be aware of these mechanisms, and if I
> > > recall he did voice a level of concern about the same worries I
> > > mentioned.
> > >
> >
> > sorry, I should have been more clear. I envision LSMs being able to
> > update their own state in the userns hook.
> >
> > Basically the portion of the patch that removes const from the
> > userns hook.
>
> Yes, pretty sure we'll need this regardless.
>
> > An LSM updating the capset is worrysome for all the reasons you
> > pointed out, and I think a few more. I haven't had a chance to really
> > look at v2 yet, so I didn't want to speak directly on the bpf part of
> > the patch without first giving a good once over.
> >
> > > I'm happy to discuss ways in which we can adjust the LSM hooks/layer
> > > to support different approaches to capability controls, but one LSM
> > > directly manipulating the state of another is going to be a no vote
> > > from me.
> > >
> > I might not be as hard no as Paul here, I am always willing to listen
> > to arguments, but it would have to be a really good argument to
> > modify the capset, when there are multiple LSMs in play on a system.
>
> The way I see it, it's more about enhancing the capability LSM with BPF
> hooks and have it modify its own state dynamically, not so much
> crosstalk between two distinct LSM frameworks (say one where the BPF
> LSM implements a lot of things like capable()).
>
> In this context and with enough safeguards (say we only allow dropping
> caps) this could be a net positive. Sure, ordering could come into play
> in very specific scenarios, but at this point I would expect the
> admin/LSM author to be conscious about it.
>
> If we think there is no way we can come up with something that's safe
> enough, and that the risks outweigh the benefits, fine by me, we can
> drop this patch from the series.

I think pursuing patches 1-3 now, and punting on 4 until later, would
be great.