V1->V2:
- Fix up the processing of the caps bits after discussions
with Any and Serge. Make patch less intrusive.
Ambient caps are something like restricted root privileges.
A process has a set of additional capabilities and those
are inherited without have to set capabilites in other
binaries involved. This allow the partial use of root
like features in a controlled way. It is often useful
to do this for user space device drivers or software that
needs increased priviledges for networking or to control
its own scheduling. Ambient caps allow one to avoid
having to run these with full root priviledges.
Control over this feature is avaialable via a new
prctl option called PR_CAP_AMBIENT. The second argument to prctl
is a the capability number and the third the desired state.
0 for off. Otherwise on.
Ambient bits are enabled regardless of the inheritance
mask of the target binary. They are only restricted
by the bounding set.
History:
Linux capabilities have suffered from the problem that they are not
inheritable like unregular process characteristics under Unix. This is
behavior that is counter intuitive to the expected behavior of processes
in Unix.
In particular there has been recently software that controls NICs from user
space and provides IP stack like behavior also in user space (DPDK and RDMA
kernel API based implementations). Those typically need either capabilities
to allow raw network access or have to be run setsuid. There is scripting and
LD_PREFLOAD etc involved, arbitrary binaries may be run from those scripts
including those setting additional capabilites or requiring root access.
That does not go well with having file capabilities set that would enable
the capabilities. Maybe it would work if one would setup capabilities on
all executables but that would also defeat a secure design since these
binaries may only need those caps for certain situations. Ok setting the
inheritable flags on everything may also get one there (if there would not
be the issues with LD_PRELOAD, debugging etc etc).
The easy solution is to allow some capabilities be inherited like setsuid
is. We really prefer to use capabilities instead of setsuid (we want to
limit what damage someone can do after all!). Therefore we have been
running a patch like this in production for the last 6 years. At some
point it becomes tedious to run your own custom kernel so we would like
to have this functionality upstream.
See some of the earlier related discussions on the problems with capability
inheritance:
0. Recent surprise:
https://lkml.org/lkml/2014/1/21/175
1. Attempt to revise caps
http://www.madore.org/~david/linux/newcaps/
2. Problems of passing caps through exec
http://unix.stackexchange.com/questions/128394/passing-capabilities-through-exec
3. Problems of binding to privileged ports
http://stackoverflow.com/questions/413807/is-there-a-way-for-non-root-processes-to-bind-to-privileged-ports-1024-on-l
4. Reviving capabilities
http://lwn.net/Articles/199004/
There does not seem to be an alternative on the horizon. Some involved
in security development under Linux have even stated that they want to
rip out the whole thing and replace it. Its been a couple of years now
and we are still suffering from the capabilities mess. Let us just
fix it. Others have already done implementations like this like Nokia
for the N900.
This patch does not change the default behavior but it allows to set up
a list of capabilities via prctl that will enable regular
unix inheritance only for the selected group of capabilities.
With that it is then possible to do something trivial like setting
CAP_NET_RAW on an executable that can then allow that capability to
be inherited by others.
Lets have a look at a coding example of a wrapper that enables
a couple of capabilities:
------------------------------ ambient_test.c
/*
* Test program for the ambient capabilities
*
*
* Compile using:
* gcc -o ambient_test ambient_test.o
*
* This program must have the following capabilities to run properly:
* CAP_SETPCAP, CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
*
* A command to equip this with the right caps is:
*
* setcap cap_setpcap,cap_net_raw,cap_net_admin,cap_sys_nice+eip ambient_test
*
* To get a shell with additional caps that can be inherited do:
*
* ./ambient_test /bin/bash
*
*/
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <sys/prctl.h>
#include <linux/capability.h>
/* Defintion to be updated in the user space include files */
#define PR_CAP_AMBIENT 45
int main(int argc, char **argv)
{
int rc;
if (prctl(PR_CAP_AMBIENT, CAP_NET_RAW))
perror("Cannot set CAP_NET_RAW");
if (prctl(PR_CAP_AMBIENT, CAP_NET_ADMIN))
perror("Cannot set CAP_NET_ADMIN");
if (prctl(PR_CAP_AMBIENT, CAP_SYS_NICE))
perror("Cannot set CAP_SYS_NICE");
printf("Ambient_test forking shell\n");
if (execv(argv[1], argv + 1))
perror("Cannot exec");
return 0;
}
-------------------------------- ambient_test.c
Allows the inheritance of CAP_SYS_NICE, CAP_NET_RAW and CAP_NET_ADMIN.
With that device raw access is possible and also real time priorities
can be set from user space. This is a frequently needed set of
priviledged operations in HPC and HFT applications. User space
processes need to be able to directly access devices as well as
have full control over scheduling.
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux/security/commoncap.c
===================================================================
--- linux.orig/security/commoncap.c 2015-02-25 13:43:06.929973954 -0600
+++ linux/security/commoncap.c 2015-02-26 16:10:02.347913397 -0600
@@ -347,15 +347,17 @@ static inline int bprm_caps_from_vfs_cap
*has_cap = true;
CAP_FOR_EACH_U32(i) {
+ __u32 ambient = current_cred()->cap_ambient.cap[i];
__u32 permitted = caps->permitted.cap[i];
__u32 inheritable = caps->inheritable.cap[i];
/*
- * pP' = (X & fP) | (pI & fI)
+ * pP' = (X & fP) | (pI & (fI | pA))
*/
new->cap_permitted.cap[i] =
(new->cap_bset.cap[i] & permitted) |
- (new->cap_inheritable.cap[i] & inheritable);
+ (new->cap_inheritable.cap[i] &
+ (inheritable | ambient));
if (permitted & ~new->cap_permitted.cap[i])
/* insufficient to execute correctly */
@@ -453,8 +455,18 @@ static int get_file_caps(struct linux_bi
if (rc == -EINVAL)
printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
__func__, rc, bprm->filename);
- else if (rc == -ENODATA)
+ else if (rc == -ENODATA) {
rc = 0;
+ if (!cap_isclear(current_cred()->cap_ambient)) {
+ /*
+ * The ambient caps are permitted for
+ * files that have no caps
+ */
+ bprm->cred->cap_permitted =
+ current_cred()->cap_ambient;
+ *effective = true;
+ }
+ }
goto out;
}
@@ -549,9 +561,20 @@ skip:
new->sgid = new->fsgid = new->egid;
if (effective)
+ /*
+ * pE' = pP' & (fE | pA)
+ *
+ * fE is implicity all set if effective == true.
+ * Therefore the above reduces to
+ *
+ * pE' = pP'
+ */
new->cap_effective = new->cap_permitted;
else
cap_clear(new->cap_effective);
+
+ /* pA' = pA */
+ new->cap_ambient = old->cap_ambient;
bprm->cap_effective = effective;
/*
@@ -566,7 +589,7 @@ skip:
* Number 1 above might fail if you don't have a full bset, but I think
* that is interesting information to audit.
*/
- if (!cap_isclear(new->cap_effective)) {
+ if (!cap_issubset(new->cap_effective, new->cap_ambient)) {
if (!cap_issubset(CAP_FULL_SET, new->cap_effective) ||
!uid_eq(new->euid, root_uid) || !uid_eq(new->uid, root_uid) ||
issecure(SECURE_NOROOT)) {
@@ -598,7 +621,7 @@ int cap_bprm_secureexec(struct linux_bin
if (!uid_eq(cred->uid, root_uid)) {
if (bprm->cap_effective)
return 1;
- if (!cap_isclear(cred->cap_permitted))
+ if (!cap_issubset(cred->cap_permitted, cred->cap_ambient))
return 1;
}
@@ -933,6 +956,23 @@ int cap_task_prctl(int option, unsigned
new->securebits &= ~issecure_mask(SECURE_KEEP_CAPS);
return commit_creds(new);
+ case PR_CAP_AMBIENT:
+ if (!ns_capable(current_user_ns(), CAP_SETPCAP))
+ return -EPERM;
+
+ if (!cap_valid(arg2))
+ return -EINVAL;
+
+ if (!ns_capable(current_user_ns(), arg2))
+ return -EPERM;
+
+ new = prepare_creds();
+ if (arg3 == 0)
+ cap_lower(new->cap_ambient, arg2);
+ else
+ cap_raise(new->cap_ambient, arg2);
+ return commit_creds(new);
+
default:
/* No functionality available - continue with default */
return -ENOSYS;
Index: linux/include/linux/cred.h
===================================================================
--- linux.orig/include/linux/cred.h 2015-02-25 13:43:06.929973954 -0600
+++ linux/include/linux/cred.h 2015-02-25 13:43:06.925972078 -0600
@@ -122,6 +122,7 @@ struct cred {
kernel_cap_t cap_permitted; /* caps we're permitted */
kernel_cap_t cap_effective; /* caps we can actually use */
kernel_cap_t cap_bset; /* capability bounding set */
+ kernel_cap_t cap_ambient; /* Ambient capability set */
#ifdef CONFIG_KEYS
unsigned char jit_keyring; /* default keyring to attach requested
* keys to */
Index: linux/include/uapi/linux/prctl.h
===================================================================
--- linux.orig/include/uapi/linux/prctl.h 2015-02-25 13:43:06.929973954 -0600
+++ linux/include/uapi/linux/prctl.h 2015-02-25 13:43:06.925972078 -0600
@@ -185,4 +185,7 @@ struct prctl_mm_map {
#define PR_MPX_ENABLE_MANAGEMENT 43
#define PR_MPX_DISABLE_MANAGEMENT 44
+/* Control the ambient capability set */
+#define PR_CAP_AMBIENT 45
+
#endif /* _LINUX_PRCTL_H */
Index: linux/fs/proc/array.c
===================================================================
--- linux.orig/fs/proc/array.c 2015-02-25 13:43:06.929973954 -0600
+++ linux/fs/proc/array.c 2015-02-25 13:43:06.925972078 -0600
@@ -302,7 +302,8 @@ static void render_cap_t(struct seq_file
static inline void task_cap(struct seq_file *m, struct task_struct *p)
{
const struct cred *cred;
- kernel_cap_t cap_inheritable, cap_permitted, cap_effective, cap_bset;
+ kernel_cap_t cap_inheritable, cap_permitted, cap_effective,
+ cap_bset, cap_ambient;
rcu_read_lock();
cred = __task_cred(p);
@@ -310,12 +311,14 @@ static inline void task_cap(struct seq_f
cap_permitted = cred->cap_permitted;
cap_effective = cred->cap_effective;
cap_bset = cred->cap_bset;
+ cap_ambient = cred->cap_ambient;
rcu_read_unlock();
render_cap_t(m, "CapInh:\t", &cap_inheritable);
render_cap_t(m, "CapPrm:\t", &cap_permitted);
render_cap_t(m, "CapEff:\t", &cap_effective);
render_cap_t(m, "CapBnd:\t", &cap_bset);
+ render_cap_t(m, "CapAmb:\t", &cap_ambient);
}
static inline void task_seccomp(struct seq_file *m, struct task_struct *p)
On Thu, Feb 26, 2015 at 04:14:33PM -0600, Christoph Lameter wrote:
>
> V1->V2:
> - Fix up the processing of the caps bits after discussions
> with Any and Serge. Make patch less intrusive.
>
> Ambient caps are something like restricted root privileges.
> A process has a set of additional capabilities and those
> are inherited without have to set capabilites in other
> binaries involved. This allow the partial use of root
> like features in a controlled way. It is often useful
> to do this for user space device drivers or software that
> needs increased priviledges for networking or to control
> its own scheduling. Ambient caps allow one to avoid
> having to run these with full root priviledges.
>
> Control over this feature is avaialable via a new
> prctl option called PR_CAP_AMBIENT. The second argument to prctl
> is a the capability number and the third the desired state.
> 0 for off. Otherwise on.
>
> Ambient bits are enabled regardless of the inheritance
> mask of the target binary. They are only restricted
> by the bounding set.
>
> History:
>
> Linux capabilities have suffered from the problem that they are not
> inheritable like unregular process characteristics under Unix. This is
> behavior that is counter intuitive to the expected behavior of processes
> in Unix.
>
> In particular there has been recently software that controls NICs from user
> space and provides IP stack like behavior also in user space (DPDK and RDMA
> kernel API based implementations). Those typically need either capabilities
> to allow raw network access or have to be run setsuid. There is scripting and
> LD_PREFLOAD etc involved, arbitrary binaries may be run from those scripts
> including those setting additional capabilites or requiring root access.
>
> That does not go well with having file capabilities set that would enable
> the capabilities. Maybe it would work if one would setup capabilities on
> all executables but that would also defeat a secure design since these
> binaries may only need those caps for certain situations. Ok setting the
> inheritable flags on everything may also get one there (if there would not
> be the issues with LD_PRELOAD, debugging etc etc).
>
> The easy solution is to allow some capabilities be inherited like setsuid
> is. We really prefer to use capabilities instead of setsuid (we want to
> limit what damage someone can do after all!). Therefore we have been
> running a patch like this in production for the last 6 years. At some
> point it becomes tedious to run your own custom kernel so we would like
> to have this functionality upstream.
>
> See some of the earlier related discussions on the problems with capability
> inheritance:
>
> 0. Recent surprise:
> https://lkml.org/lkml/2014/1/21/175
>
> 1. Attempt to revise caps
> http://www.madore.org/~david/linux/newcaps/
>
> 2. Problems of passing caps through exec
> http://unix.stackexchange.com/questions/128394/passing-capabilities-through-exec
>
> 3. Problems of binding to privileged ports
> http://stackoverflow.com/questions/413807/is-there-a-way-for-non-root-processes-to-bind-to-privileged-ports-1024-on-l
>
> 4. Reviving capabilities
> http://lwn.net/Articles/199004/
>
> There does not seem to be an alternative on the horizon. Some involved
> in security development under Linux have even stated that they want to
> rip out the whole thing and replace it. Its been a couple of years now
> and we are still suffering from the capabilities mess. Let us just
> fix it. Others have already done implementations like this like Nokia
> for the N900.
>
>
> This patch does not change the default behavior but it allows to set up
> a list of capabilities via prctl that will enable regular
> unix inheritance only for the selected group of capabilities.
>
> With that it is then possible to do something trivial like setting
> CAP_NET_RAW on an executable that can then allow that capability to
> be inherited by others.
>
> Lets have a look at a coding example of a wrapper that enables
> a couple of capabilities:
>
> ------------------------------ ambient_test.c
> /*
> * Test program for the ambient capabilities
> *
> *
> * Compile using:
> * gcc -o ambient_test ambient_test.o
> *
> * This program must have the following capabilities to run properly:
> * CAP_SETPCAP, CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
> *
> * A command to equip this with the right caps is:
> *
> * setcap cap_setpcap,cap_net_raw,cap_net_admin,cap_sys_nice+eip ambient_test
> *
> * To get a shell with additional caps that can be inherited do:
> *
> * ./ambient_test /bin/bash
> *
> */
>
> #include <stdlib.h>
> #include <stdio.h>
> #include <errno.h>
> #include <sys/prctl.h>
> #include <linux/capability.h>
>
> /* Defintion to be updated in the user space include files */
> #define PR_CAP_AMBIENT 45
>
> int main(int argc, char **argv)
> {
> int rc;
>
> if (prctl(PR_CAP_AMBIENT, CAP_NET_RAW))
> perror("Cannot set CAP_NET_RAW");
>
> if (prctl(PR_CAP_AMBIENT, CAP_NET_ADMIN))
> perror("Cannot set CAP_NET_ADMIN");
>
> if (prctl(PR_CAP_AMBIENT, CAP_SYS_NICE))
> perror("Cannot set CAP_SYS_NICE");
>
Your example program is not filling in pI though?
Ah, i see why. In get_file_caps() you are still assigning
fP = pA
if the file has no file capabilities. so then you are actually
doing
pP' = (X & (fP | pA)) | (pI & (fI | pA))
rather than
pP' = (X & fP) | (pI & (fI | pA))
Other than that, the patch is looking good to me. We should
consider emitting an audit record when a task fills in its
pA, and I do still wonder whether we should be requiring
CAP_SETFCAP (unsure how best to think of it). But assuming the
fP = pA was not intended, I think this largely does the right
thing.
> printf("Ambient_test forking shell\n");
> if (execv(argv[1], argv + 1))
> perror("Cannot exec");
>
> return 0;
> }
> -------------------------------- ambient_test.c
>
> Allows the inheritance of CAP_SYS_NICE, CAP_NET_RAW and CAP_NET_ADMIN.
> With that device raw access is possible and also real time priorities
> can be set from user space. This is a frequently needed set of
> priviledged operations in HPC and HFT applications. User space
> processes need to be able to directly access devices as well as
> have full control over scheduling.
>
> Signed-off-by: Christoph Lameter <[email protected]>
>
> Index: linux/security/commoncap.c
> ===================================================================
> --- linux.orig/security/commoncap.c 2015-02-25 13:43:06.929973954 -0600
> +++ linux/security/commoncap.c 2015-02-26 16:10:02.347913397 -0600
> @@ -347,15 +347,17 @@ static inline int bprm_caps_from_vfs_cap
> *has_cap = true;
>
> CAP_FOR_EACH_U32(i) {
> + __u32 ambient = current_cred()->cap_ambient.cap[i];
> __u32 permitted = caps->permitted.cap[i];
> __u32 inheritable = caps->inheritable.cap[i];
>
> /*
> - * pP' = (X & fP) | (pI & fI)
> + * pP' = (X & fP) | (pI & (fI | pA))
> */
> new->cap_permitted.cap[i] =
> (new->cap_bset.cap[i] & permitted) |
> - (new->cap_inheritable.cap[i] & inheritable);
> + (new->cap_inheritable.cap[i] &
> + (inheritable | ambient));
>
> if (permitted & ~new->cap_permitted.cap[i])
> /* insufficient to execute correctly */
> @@ -453,8 +455,18 @@ static int get_file_caps(struct linux_bi
> if (rc == -EINVAL)
> printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
> __func__, rc, bprm->filename);
> - else if (rc == -ENODATA)
> + else if (rc == -ENODATA) {
> rc = 0;
> + if (!cap_isclear(current_cred()->cap_ambient)) {
> + /*
> + * The ambient caps are permitted for
> + * files that have no caps
> + */
> + bprm->cred->cap_permitted =
> + current_cred()->cap_ambient;
> + *effective = true;
> + }
> + }
> goto out;
> }
>
> @@ -549,9 +561,20 @@ skip:
> new->sgid = new->fsgid = new->egid;
>
> if (effective)
> + /*
> + * pE' = pP' & (fE | pA)
> + *
> + * fE is implicity all set if effective == true.
> + * Therefore the above reduces to
> + *
> + * pE' = pP'
> + */
> new->cap_effective = new->cap_permitted;
> else
> cap_clear(new->cap_effective);
> +
> + /* pA' = pA */
> + new->cap_ambient = old->cap_ambient;
> bprm->cap_effective = effective;
>
> /*
> @@ -566,7 +589,7 @@ skip:
> * Number 1 above might fail if you don't have a full bset, but I think
> * that is interesting information to audit.
> */
> - if (!cap_isclear(new->cap_effective)) {
> + if (!cap_issubset(new->cap_effective, new->cap_ambient)) {
> if (!cap_issubset(CAP_FULL_SET, new->cap_effective) ||
> !uid_eq(new->euid, root_uid) || !uid_eq(new->uid, root_uid) ||
> issecure(SECURE_NOROOT)) {
> @@ -598,7 +621,7 @@ int cap_bprm_secureexec(struct linux_bin
> if (!uid_eq(cred->uid, root_uid)) {
> if (bprm->cap_effective)
> return 1;
> - if (!cap_isclear(cred->cap_permitted))
> + if (!cap_issubset(cred->cap_permitted, cred->cap_ambient))
> return 1;
> }
>
> @@ -933,6 +956,23 @@ int cap_task_prctl(int option, unsigned
> new->securebits &= ~issecure_mask(SECURE_KEEP_CAPS);
> return commit_creds(new);
>
> + case PR_CAP_AMBIENT:
> + if (!ns_capable(current_user_ns(), CAP_SETPCAP))
> + return -EPERM;
> +
> + if (!cap_valid(arg2))
> + return -EINVAL;
> +
> + if (!ns_capable(current_user_ns(), arg2))
> + return -EPERM;
> +
> + new = prepare_creds();
> + if (arg3 == 0)
> + cap_lower(new->cap_ambient, arg2);
> + else
> + cap_raise(new->cap_ambient, arg2);
> + return commit_creds(new);
> +
> default:
> /* No functionality available - continue with default */
> return -ENOSYS;
> Index: linux/include/linux/cred.h
> ===================================================================
> --- linux.orig/include/linux/cred.h 2015-02-25 13:43:06.929973954 -0600
> +++ linux/include/linux/cred.h 2015-02-25 13:43:06.925972078 -0600
> @@ -122,6 +122,7 @@ struct cred {
> kernel_cap_t cap_permitted; /* caps we're permitted */
> kernel_cap_t cap_effective; /* caps we can actually use */
> kernel_cap_t cap_bset; /* capability bounding set */
> + kernel_cap_t cap_ambient; /* Ambient capability set */
> #ifdef CONFIG_KEYS
> unsigned char jit_keyring; /* default keyring to attach requested
> * keys to */
> Index: linux/include/uapi/linux/prctl.h
> ===================================================================
> --- linux.orig/include/uapi/linux/prctl.h 2015-02-25 13:43:06.929973954 -0600
> +++ linux/include/uapi/linux/prctl.h 2015-02-25 13:43:06.925972078 -0600
> @@ -185,4 +185,7 @@ struct prctl_mm_map {
> #define PR_MPX_ENABLE_MANAGEMENT 43
> #define PR_MPX_DISABLE_MANAGEMENT 44
>
> +/* Control the ambient capability set */
> +#define PR_CAP_AMBIENT 45
> +
> #endif /* _LINUX_PRCTL_H */
> Index: linux/fs/proc/array.c
> ===================================================================
> --- linux.orig/fs/proc/array.c 2015-02-25 13:43:06.929973954 -0600
> +++ linux/fs/proc/array.c 2015-02-25 13:43:06.925972078 -0600
> @@ -302,7 +302,8 @@ static void render_cap_t(struct seq_file
> static inline void task_cap(struct seq_file *m, struct task_struct *p)
> {
> const struct cred *cred;
> - kernel_cap_t cap_inheritable, cap_permitted, cap_effective, cap_bset;
> + kernel_cap_t cap_inheritable, cap_permitted, cap_effective,
> + cap_bset, cap_ambient;
>
> rcu_read_lock();
> cred = __task_cred(p);
> @@ -310,12 +311,14 @@ static inline void task_cap(struct seq_f
> cap_permitted = cred->cap_permitted;
> cap_effective = cred->cap_effective;
> cap_bset = cred->cap_bset;
> + cap_ambient = cred->cap_ambient;
> rcu_read_unlock();
>
> render_cap_t(m, "CapInh:\t", &cap_inheritable);
> render_cap_t(m, "CapPrm:\t", &cap_permitted);
> render_cap_t(m, "CapEff:\t", &cap_effective);
> render_cap_t(m, "CapBnd:\t", &cap_bset);
> + render_cap_t(m, "CapAmb:\t", &cap_ambient);
> }
>
> static inline void task_seccomp(struct seq_file *m, struct task_struct *p)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/