Resending the RFC with participants of previous discussions
in the list.
Following patch which is a variation of a solution discussed
in https://lwn.net/Articles/736330/ provides the users of
pid namespace, the functionality of pid translation between
namespaces using a namespace identifier. The topic of
pid translation has been discussed in the community few times
but there has always been a resistance to adding new solution
for this problem.
I will outline the planned usecase of pid namespace by oracle
database and explain why any of the existing solution cannot
be used to solve their problem.
Consider a system in which several PID namespaces with multiple
nested levels exists in parallel with monitor processes managing
all the namespaces. PID translation is required for controlling
and accessing information about the processes by the monitors
and other processes down the hierarchy of namespaces. Controlling
primarily involves sending signals or using ptrace by a process in
parent namespace on any of the processes in its child namespace.
Accessing information deals with the reading /proc/<pid>/* files
of processes in child namespace. None of the processes have
root/CAP_SYS_ADMIN privileges.
Existing methods:
*) SCM_CREDENTIALS message:
Ref: http://man7.org/linux/man-pages/man7/unix.7.html
The sender can translate its pid from its own namespace
to a pid in the target namespace by sending and receiving the
SCM_CREDENTIALS message. The drawback of this method is the
requirement of a socket communication channel to pid translation
which adds to the management overhead. This method does not enable
the sender to translate the pid of other process unless it is root
or it has CAP_SYS_ADMIN.
*) /proc/<pid>/status file
Ref: http://man7.org/linux/man-pages/man5/proc.5.html
Ref: https://patchwork.kernel.org/patch/5861791/
/proc/<pid>/status file provides a way to find the pids
associated with a process in different namespaces. Pid translation
from child namespace to parent namespace from parent namespace would
require searching all the status file in the parent namespace to find
the desired pid at desired level which is inefficient.
Method under review:
*) pidns: introduce syscall translate_pid
Ref: https://lwn.net/Articles/736330/
This solution though not integrated yet, when used would require
monitor to have open file descriptor for every level of namespace it
has to monitor which results in explosion of number of file descriptors
the monitor has to keep open.
Following patch uses a 64-bit ID for namespace exported by procfs
for pid translation through a new file /proc/<pid>/ns/pidns_id.
[root@brm-x4600-01 ~]# ls -l /proc/4690/ns
total 0
lrwxrwxrwx 1 root root 0 Mar 1 15:55 cgroup -> cgroup:[4026531835]
lrwxrwxrwx 1 root root 0 Mar 1 15:55 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Mar 1 15:55 mnt -> mnt:[4026532324]
lrwxrwxrwx 1 root root 0 Mar 1 15:55 net -> net:[4026531993]
lrwxrwxrwx 1 root root 0 Mar 1 15:55 pid -> pid:[4026532325]
lrwxrwxrwx 1 root root 0 Mar 1 15:55 pid_for_children -> pid:[4026532325]
lrwxrwxrwx 1 root root 0 Mar 1 15:55 pidns_id -> [14994344851325148137]
lrwxrwxrwx 1 root root 0 Mar 1 15:55 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Mar 1 15:55 uts -> uts:[4026531838]
Though there is a problem of ID being recycled in longterm, managing
an ID per namespace is easier than having open file descriptors per
pid namespace. Existing namespace inode numbers recycles faster and hence
their usability as ID for this API is less.
Signed-off-by: Nagarathnam Muthusamy <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
fs/nsfs.c | 9 +-
fs/proc/namespaces.c | 1 +
include/linux/ns_common.h | 1 +
include/linux/pid_namespace.h | 3 +
include/linux/proc_ns.h | 1 +
include/linux/syscalls.h | 1 +
kernel/pid_namespace.c | 190 ++++++++++++++++++++++++++++++++-
kernel/sys_ni.c | 4 +
10 files changed, 208 insertions(+), 5 deletions(-)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 448ac21..31bf798 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
382 i386 pkey_free sys_pkey_free
383 i386 statx sys_statx
384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl
+385 i386 translate_pid sys_translate_pid compat_sys_translate_pid
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183..89196c3 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
330 common pkey_alloc sys_pkey_alloc
331 common pkey_free sys_pkey_free
332 common statx sys_statx
+333 64 translate_pid sys_translate_pid
#
# x32-specific system call numbers start at 512 to avoid cache impact
@@ -380,3 +381,4 @@
545 x32 execveat compat_sys_execveat/ptregs
546 x32 preadv2 compat_sys_preadv64v2
547 x32 pwritev2 compat_sys_pwritev64v2
+548 x32 translate_pid compat_sys_translate_pid
diff --git a/fs/nsfs.c b/fs/nsfs.c
index 36b0772..c635465 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -222,8 +222,13 @@ int ns_get_name(char *buf, size_t size, struct task_struct *task,
const char *name;
ns = ns_ops->get(task);
if (ns) {
- name = ns_ops->real_ns_name ? : ns_ops->name;
- res = snprintf(buf, size, "%s:[%u]", name, ns->inum);
+ if (!strcmp(ns_ops->name, "pidns_id")) {
+ res = snprintf(buf, size, "[%llu]",
+ (unsigned long long)ns->ns_id);
+ } else {
+ name = ns_ops->real_ns_name ? : ns_ops->name;
+ res = snprintf(buf, size, "%s:[%u]", name, ns->inum);
+ }
ns_ops->put(ns);
}
return res;
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 59b17e5..ac823ce 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -24,6 +24,7 @@
#endif
#ifdef CONFIG_PID_NS
&pidns_operations,
+ &pidns_id_operations,
&pidns_for_children_operations,
#endif
#ifdef CONFIG_USER_NS
diff --git a/include/linux/ns_common.h b/include/linux/ns_common.h
index 5fbc400..6ca3d43 100644
--- a/include/linux/ns_common.h
+++ b/include/linux/ns_common.h
@@ -8,6 +8,7 @@ struct ns_common {
atomic_long_t stashed;
const struct proc_ns_operations *ops;
unsigned int inum;
+ u64 ns_id;
};
#endif
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 49538b1..11d1d57 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -11,6 +11,7 @@
#include <linux/kref.h>
#include <linux/ns_common.h>
#include <linux/idr.h>
+#include <linux/list_bl.h>
struct fs_pin;
@@ -44,6 +45,8 @@ struct pid_namespace {
kgid_t pid_gid;
int hide_pid;
int reboot; /* group exit code if this pidns was rebooted */
+ struct hlist_bl_node node;
+ atomic_t lookups_pending;
struct ns_common ns;
} __randomize_layout;
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index d31cb62..861e38bd 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -28,6 +28,7 @@ struct proc_ns_operations {
extern const struct proc_ns_operations utsns_operations;
extern const struct proc_ns_operations ipcns_operations;
extern const struct proc_ns_operations pidns_operations;
+extern const struct proc_ns_operations pidns_id_operations;
extern const struct proc_ns_operations pidns_for_children_operations;
extern const struct proc_ns_operations userns_operations;
extern const struct proc_ns_operations mntns_operations;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a78186d..574349a 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -901,6 +901,7 @@ asmlinkage long sys_open_by_handle_at(int mountdirfd,
struct file_handle __user *handle,
int flags);
asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_translate_pid(pid_t pid, u64 source, u64 target);
asmlinkage long sys_process_vm_readv(pid_t pid,
const struct iovec __user *lvec,
unsigned long liovcnt,
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0b53eef..ff83aa8 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -22,6 +22,12 @@
#include <linux/sched/task.h>
#include <linux/sched/signal.h>
#include <linux/idr.h>
+#include <linux/random.h>
+#include <linux/compat.h>
+
+#define PID_NS_ID_HASH_BITS 9
+
+struct hlist_bl_head *pid_ns_hash;
struct pid_cache {
int nr_ids;
@@ -34,6 +40,13 @@ struct pid_cache {
static DEFINE_MUTEX(pid_caches_mutex);
static struct kmem_cache *pid_ns_cachep;
+static inline struct hlist_bl_head *
+ pid_ns_hash_head(struct hlist_bl_head *hash,
+ uint64_t key)
+{
+ return &hash[hash_64(key, PID_NS_ID_HASH_BITS)];
+}
+
/*
* creates the kmem cache to allocate pids from.
* @nr_ids: the number of numerical ids this pid will have to carry
@@ -93,12 +106,24 @@ static void dec_pid_namespaces(struct ucounts *ucounts)
dec_ucount(ucounts, UCOUNT_PID_NAMESPACES);
}
+static inline u64 get_namespace_id(void)
+{
+ u64 id = 0;
+
+ while (!id)
+ get_random_bytes(&id, sizeof(id));
+
+ return id;
+}
+
static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns,
struct pid_namespace *parent_pid_ns)
{
- struct pid_namespace *ns;
- unsigned int level = parent_pid_ns->level + 1;
+ struct pid_namespace *ns, *pt;
struct ucounts *ucounts;
+ struct hlist_bl_head *head;
+ struct hlist_bl_node *dup_node;
+ unsigned int level = parent_pid_ns->level + 1;
int err;
err = -EINVAL;
@@ -135,7 +160,24 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
ns->ucounts = ucounts;
ns->pid_allocated = PIDNS_ADDING;
INIT_WORK(&ns->proc_work, proc_cleanup_work);
-
+ ns->ns.ns_id = get_namespace_id();
+ while (1) {
+ head = pid_ns_hash_head(pid_ns_hash, ns->ns.ns_id);
+ hlist_bl_lock(head);
+ hlist_bl_for_each_entry(pt, dup_node, head, node) {
+ if (ns->ns.ns_id == pt->ns.ns_id) {
+ /*
+ * ID is taken. Move to next ID;
+ */
+ ns->ns.ns_id = get_namespace_id();
+ hlist_bl_unlock(head);
+ continue;
+ }
+ }
+ break;
+ }
+ hlist_bl_add_head(&ns->node, head);
+ hlist_bl_unlock(head);
return ns;
out_free_idr:
@@ -159,6 +201,30 @@ static void delayed_free_pidns(struct rcu_head *p)
static void destroy_pid_namespace(struct pid_namespace *ns)
{
+ struct pid_namespace *ph;
+ struct hlist_bl_head *head;
+ struct hlist_bl_node *dup_node;
+
+ /*
+ * Remove the namespace structure from hash table so
+ * now new lookups can start on it.
+ */
+ if (ns->ns.ns_id) {
+ head = pid_ns_hash_head(pid_ns_hash, ns->ns.ns_id);
+ hlist_bl_lock(head);
+ hlist_bl_for_each_entry(ph, dup_node, head, node) {
+ if (ns->ns.ns_id == ph->ns.ns_id) {
+ hlist_bl_del_init(&ph->node);
+ break;
+ }
+ }
+ hlist_bl_unlock(head);
+ }
+ /*
+ * Wait for pending lookups to complete.
+ */
+ while (atomic_read(&ns->lookups_pending))
+ cpu_relax();
ns_free_inum(&ns->ns);
idr_destroy(&ns->idr);
@@ -463,6 +529,17 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
.get_parent = pidns_get_parent,
};
+const struct proc_ns_operations pidns_id_operations = {
+ .name = "pidns_id",
+ .real_ns_name = "pid",
+ .type = CLONE_NEWPID,
+ .get = pidns_get,
+ .put = pidns_put,
+ .install = pidns_install,
+ .owner = pidns_owner,
+ .get_parent = pidns_get_parent,
+};
+
const struct proc_ns_operations pidns_for_children_operations = {
.name = "pid_for_children",
.real_ns_name = "pid",
@@ -474,9 +551,116 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
.get_parent = pidns_get_parent,
};
+/*
+ * translate_pid - convert pid in source pid-ns into target pid-ns.
+ * @pid: pid for translation
+ * @source: pid-ns id
+ * @target: pid-ns id
+ *
+ * Return pid in @target pid-ns, zero if task have no pid there,
+ * or -ESRCH of task with @pid is not found in @source pid-ns.
+ */
+SYSCALL_DEFINE3(translate_pid, pid_t, pid, u64, source,
+ u64, target)
+{
+ struct pid_namespace *source_ns = NULL, *target_ns = NULL;
+ struct pid *struct_pid;
+ struct pid_namespace *ph;
+ struct hlist_bl_head *shead = NULL;
+ struct hlist_bl_head *thead = NULL;
+ struct hlist_bl_node *dup_node;
+ pid_t result;
+
+ if (!source) {
+ source_ns = &init_pid_ns;
+ } else {
+ shead = pid_ns_hash_head(pid_ns_hash, source);
+ hlist_bl_lock(shead);
+ hlist_bl_for_each_entry(ph, dup_node, shead, node) {
+ if (source == ph->ns.ns_id) {
+ source_ns = ph;
+ break;
+ }
+ }
+ if (!source_ns) {
+ hlist_bl_unlock(shead);
+ return -EINVAL;
+ }
+ }
+ if (!ptrace_may_access(source_ns->child_reaper,
+ PTRACE_MODE_READ_FSCREDS)) {
+ if (shead)
+ hlist_bl_unlock(shead);
+ return -EPERM;
+ }
+
+ atomic_inc(&source_ns->lookups_pending);
+ if (shead)
+ hlist_bl_unlock(shead);
+
+ if (!target) {
+ target_ns = &init_pid_ns;
+ } else {
+ thead = pid_ns_hash_head(pid_ns_hash, target);
+ hlist_bl_lock(thead);
+ hlist_bl_for_each_entry(ph, dup_node, thead, node) {
+ if (target == ph->ns.ns_id) {
+ target_ns = ph;
+ break;
+ }
+ }
+ if (!target_ns) {
+ atomic_dec(&source_ns->lookups_pending);
+ hlist_bl_unlock(thead);
+ return -EINVAL;
+ }
+ }
+ if (!ptrace_may_access(target_ns->child_reaper,
+ PTRACE_MODE_READ_FSCREDS)) {
+ atomic_dec(&source_ns->lookups_pending);
+ if (thead)
+ hlist_bl_unlock(thead);
+ return -EPERM;
+ }
+ atomic_inc(&target_ns->lookups_pending);
+ if (thead)
+ hlist_bl_unlock(thead);
+
+ rcu_read_lock();
+ struct_pid = find_pid_ns(pid, source_ns);
+ result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;
+ rcu_read_unlock();
+ atomic_dec(&source_ns->lookups_pending);
+ atomic_dec(&target_ns->lookups_pending);
+ return result;
+}
+
+#ifdef CONFIG_COMPAT
+COMPAT_SYSCALL_DEFINE5(translate_pid, pid_t, pid, u32, s0, u32, s1,
+ u32, t0, u32, t1)
+{
+#ifdef __BIG_ENDIAN
+ return sys_translate_pid(pid, ((u64)s0 << 32) | s1,
+ ((u64)t0 << 32) | t1);
+#else
+ return sys_translate_pid(pid, ((u64)s1 << 32) | s0,
+ ((u64)t1 << 32) | t0);
+#endif
+}
+#endif
+
static __init int pid_namespaces_init(void)
{
+ unsigned long bucket_count;
+ int i;
+
+ bucket_count = (1UL << PID_NS_ID_HASH_BITS);
pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+ pid_ns_hash = kmalloc_array(bucket_count, sizeof(struct hlist_bl_head),
+ GFP_KERNEL);
+
+ for (i = 0; i < bucket_count; i++)
+ INIT_HLIST_BL_HEAD(&pid_ns_hash[i]);
#ifdef CONFIG_CHECKPOINT_RESTORE
register_sysctl_paths(kern_path, pid_ns_ctl_table);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index b518976..467255f 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -259,3 +259,7 @@ asmlinkage long sys_ni_syscall(void)
cond_syscall(sys_pkey_mprotect);
cond_syscall(sys_pkey_alloc);
cond_syscall(sys_pkey_free);
+
+/* pid translation */
+cond_syscall(sys_translate_pid);
+cond_syscall(compat_sys_translate_pid);
--
1.8.3.1
On Mon, Mar 12, 2018 at 10:18 AM, <[email protected]> wrote:
> Resending the RFC with participants of previous discussions
> in the list.
>
> Following patch which is a variation of a solution discussed
> in https://lwn.net/Articles/736330/ provides the users of
> pid namespace, the functionality of pid translation between
> namespaces using a namespace identifier. The topic of
> pid translation has been discussed in the community few times
> but there has always been a resistance to adding new solution
> for this problem.
> I will outline the planned usecase of pid namespace by oracle
> database and explain why any of the existing solution cannot
> be used to solve their problem.
>
> Consider a system in which several PID namespaces with multiple
> nested levels exists in parallel with monitor processes managing
> all the namespaces. PID translation is required for controlling
> and accessing information about the processes by the monitors
> and other processes down the hierarchy of namespaces. Controlling
> primarily involves sending signals or using ptrace by a process in
> parent namespace on any of the processes in its child namespace.
> Accessing information deals with the reading /proc/<pid>/* files
> of processes in child namespace. None of the processes have
> root/CAP_SYS_ADMIN privileges.
How are you dealing with PID reuse?
[...]
> diff --git a/fs/nsfs.c b/fs/nsfs.c
> index 36b0772..c635465 100644
> --- a/fs/nsfs.c
> +++ b/fs/nsfs.c
> @@ -222,8 +222,13 @@ int ns_get_name(char *buf, size_t size, struct task_struct *task,
> const char *name;
> ns = ns_ops->get(task);
> if (ns) {
> - name = ns_ops->real_ns_name ? : ns_ops->name;
> - res = snprintf(buf, size, "%s:[%u]", name, ns->inum);
> + if (!strcmp(ns_ops->name, "pidns_id")) {
Wouldn't it be cleaner to check for "ns_ops==&pidns_id_operations"?
> + res = snprintf(buf, size, "[%llu]",
> + (unsigned long long)ns->ns_id);
> + } else {
> + name = ns_ops->real_ns_name ? : ns_ops->name;
> + res = snprintf(buf, size, "%s:[%u]", name, ns->inum);
> + }
> ns_ops->put(ns);
> }
> return res;
[...]
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 49538b1..11d1d57 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -11,6 +11,7 @@
> #include <linux/kref.h>
> #include <linux/ns_common.h>
> #include <linux/idr.h>
> +#include <linux/list_bl.h>
>
>
> struct fs_pin;
> @@ -44,6 +45,8 @@ struct pid_namespace {
> kgid_t pid_gid;
> int hide_pid;
> int reboot; /* group exit code if this pidns was rebooted */
> + struct hlist_bl_node node;
> + atomic_t lookups_pending;
> struct ns_common ns;
> } __randomize_layout;
>
[...]
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 0b53eef..ff83aa8 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
[...]
> @@ -159,6 +201,30 @@ static void delayed_free_pidns(struct rcu_head *p)
>
> static void destroy_pid_namespace(struct pid_namespace *ns)
> {
> + struct pid_namespace *ph;
> + struct hlist_bl_head *head;
> + struct hlist_bl_node *dup_node;
> +
> + /*
> + * Remove the namespace structure from hash table so
> + * now new lookups can start on it.
s/now new/no new/
[...]
> @@ -474,9 +551,116 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
> .get_parent = pidns_get_parent,
> };
>
> +/*
> + * translate_pid - convert pid in source pid-ns into target pid-ns.
> + * @pid: pid for translation
> + * @source: pid-ns id
> + * @target: pid-ns id
> + *
> + * Return pid in @target pid-ns, zero if task have no pid there,
> + * or -ESRCH of task with @pid is not found in @source pid-ns.
s/of/if/
> + */
> +SYSCALL_DEFINE3(translate_pid, pid_t, pid, u64, source,
> + u64, target)
> +{
> + struct pid_namespace *source_ns = NULL, *target_ns = NULL;
> + struct pid *struct_pid;
> + struct pid_namespace *ph;
> + struct hlist_bl_head *shead = NULL;
> + struct hlist_bl_head *thead = NULL;
> + struct hlist_bl_node *dup_node;
> + pid_t result;
> +
> + if (!source) {
> + source_ns = &init_pid_ns;
> + } else {
> + shead = pid_ns_hash_head(pid_ns_hash, source);
> + hlist_bl_lock(shead);
> + hlist_bl_for_each_entry(ph, dup_node, shead, node) {
> + if (source == ph->ns.ns_id) {
> + source_ns = ph;
> + break;
> + }
> + }
> + if (!source_ns) {
> + hlist_bl_unlock(shead);
> + return -EINVAL;
> + }
> + }
> + if (!ptrace_may_access(source_ns->child_reaper,
> + PTRACE_MODE_READ_FSCREDS)) {
AFAICS this proposal breaks the visibility restrictions that
namespaces normally create. If there are two namespaces-based
containers that use the same UID range, I don't think they should be
able to learn information about each other, such as which PIDs are in
use in the other container; but as far as I can tell, your proposal
makes it possible to do that (unless an LSM or so is interfering). I
would prefer it if this API required visibility of the targeted PID
namespaces in the caller's PID namespace.
When doing ptrace access checks, please use the real creds in syscalls
like this one, not the fs creds. The fs creds are for filesystem
syscalls (in particular sys_open()), not for specialized syscalls like
ptrace() or this one.
On 03/13/2018 01:47 PM, Jann Horn wrote:
> On Mon, Mar 12, 2018 at 10:18 AM, <[email protected]> wrote:
>> Resending the RFC with participants of previous discussions
>> in the list.
>>
>> Following patch which is a variation of a solution discussed
>> in https://lwn.net/Articles/736330/ provides the users of
>> pid namespace, the functionality of pid translation between
>> namespaces using a namespace identifier. The topic of
>> pid translation has been discussed in the community few times
>> but there has always been a resistance to adding new solution
>> for this problem.
>> I will outline the planned usecase of pid namespace by oracle
>> database and explain why any of the existing solution cannot
>> be used to solve their problem.
>>
>> Consider a system in which several PID namespaces with multiple
>> nested levels exists in parallel with monitor processes managing
>> all the namespaces. PID translation is required for controlling
>> and accessing information about the processes by the monitors
>> and other processes down the hierarchy of namespaces. Controlling
>> primarily involves sending signals or using ptrace by a process in
>> parent namespace on any of the processes in its child namespace.
>> Accessing information deals with the reading /proc/<pid>/* files
>> of processes in child namespace. None of the processes have
>> root/CAP_SYS_ADMIN privileges.
> How are you dealing with PID reuse?
We have a monitor process which keeps track of the aliveness of
important processes. When a process dies, monitor makes a note of
it and hence detects if pid is reused.
>
> [...]
>> diff --git a/fs/nsfs.c b/fs/nsfs.c
>> index 36b0772..c635465 100644
>> --- a/fs/nsfs.c
>> +++ b/fs/nsfs.c
>> @@ -222,8 +222,13 @@ int ns_get_name(char *buf, size_t size, struct task_struct *task,
>> const char *name;
>> ns = ns_ops->get(task);
>> if (ns) {
>> - name = ns_ops->real_ns_name ? : ns_ops->name;
>> - res = snprintf(buf, size, "%s:[%u]", name, ns->inum);
>> + if (!strcmp(ns_ops->name, "pidns_id")) {
> Wouldn't it be cleaner to check for "ns_ops==&pidns_id_operations"?
Yup. Will fix it.
>
>> + res = snprintf(buf, size, "[%llu]",
>> + (unsigned long long)ns->ns_id);
>> + } else {
>> + name = ns_ops->real_ns_name ? : ns_ops->name;
>> + res = snprintf(buf, size, "%s:[%u]", name, ns->inum);
>> + }
>> ns_ops->put(ns);
>> }
>> return res;
> [...]
>> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
>> index 49538b1..11d1d57 100644
>> --- a/include/linux/pid_namespace.h
>> +++ b/include/linux/pid_namespace.h
>> @@ -11,6 +11,7 @@
>> #include <linux/kref.h>
>> #include <linux/ns_common.h>
>> #include <linux/idr.h>
>> +#include <linux/list_bl.h>
>>
>>
>> struct fs_pin;
>> @@ -44,6 +45,8 @@ struct pid_namespace {
>> kgid_t pid_gid;
>> int hide_pid;
>> int reboot; /* group exit code if this pidns was rebooted */
>> + struct hlist_bl_node node;
>> + atomic_t lookups_pending;
>> struct ns_common ns;
>> } __randomize_layout;
>>
> [...]
>> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
>> index 0b53eef..ff83aa8 100644
>> --- a/kernel/pid_namespace.c
>> +++ b/kernel/pid_namespace.c
> [...]
>> @@ -159,6 +201,30 @@ static void delayed_free_pidns(struct rcu_head *p)
>>
>> static void destroy_pid_namespace(struct pid_namespace *ns)
>> {
>> + struct pid_namespace *ph;
>> + struct hlist_bl_head *head;
>> + struct hlist_bl_node *dup_node;
>> +
>> + /*
>> + * Remove the namespace structure from hash table so
>> + * now new lookups can start on it.
> s/now new/no new/
Will fix it.
>
> [...]
>> @@ -474,9 +551,116 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
>> .get_parent = pidns_get_parent,
>> };
>>
>> +/*
>> + * translate_pid - convert pid in source pid-ns into target pid-ns.
>> + * @pid: pid for translation
>> + * @source: pid-ns id
>> + * @target: pid-ns id
>> + *
>> + * Return pid in @target pid-ns, zero if task have no pid there,
>> + * or -ESRCH of task with @pid is not found in @source pid-ns.
> s/of/if/
Will fix it.
>
>> + */
>> +SYSCALL_DEFINE3(translate_pid, pid_t, pid, u64, source,
>> + u64, target)
>> +{
>> + struct pid_namespace *source_ns = NULL, *target_ns = NULL;
>> + struct pid *struct_pid;
>> + struct pid_namespace *ph;
>> + struct hlist_bl_head *shead = NULL;
>> + struct hlist_bl_head *thead = NULL;
>> + struct hlist_bl_node *dup_node;
>> + pid_t result;
>> +
>> + if (!source) {
>> + source_ns = &init_pid_ns;
>> + } else {
>> + shead = pid_ns_hash_head(pid_ns_hash, source);
>> + hlist_bl_lock(shead);
>> + hlist_bl_for_each_entry(ph, dup_node, shead, node) {
>> + if (source == ph->ns.ns_id) {
>> + source_ns = ph;
>> + break;
>> + }
>> + }
>> + if (!source_ns) {
>> + hlist_bl_unlock(shead);
>> + return -EINVAL;
>> + }
>> + }
>> + if (!ptrace_may_access(source_ns->child_reaper,
>> + PTRACE_MODE_READ_FSCREDS)) {
> AFAICS this proposal breaks the visibility restrictions that
> namespaces normally create. If there are two namespaces-based
> containers that use the same UID range, I don't think they should be
> able to learn information about each other, such as which PIDs are in
> use in the other container; but as far as I can tell, your proposal
> makes it possible to do that (unless an LSM or so is interfering). I
> would prefer it if this API required visibility of the targeted PID
> namespaces in the caller's PID namespace.
I am trying to simulate the same access restrictions allowed
on a process's /proc/<pid>/ns/pid file. If the translator has
access to /proc/<pid>/ns/pid file of both source and destination
namespaces, shouldn't it be allowed to translate the pid between
them?
>
> When doing ptrace access checks, please use the real creds in syscalls
> like this one, not the fs creds. The fs creds are for filesystem
> syscalls (in particular sys_open()), not for specialized syscalls like
> ptrace() or this one.
Will fix this.
Thanks,
Nagarathnam.
On Tue, Mar 13, 2018 at 2:20 PM, Nagarathnam Muthusamy
<[email protected]> wrote:
> On 03/13/2018 01:47 PM, Jann Horn wrote:
>> On Mon, Mar 12, 2018 at 10:18 AM, <[email protected]>
>> wrote:
>>>
>>> Resending the RFC with participants of previous discussions
>>> in the list.
>>>
>>> Following patch which is a variation of a solution discussed
>>> in https://lwn.net/Articles/736330/ provides the users of
>>> pid namespace, the functionality of pid translation between
>>> namespaces using a namespace identifier. The topic of
>>> pid translation has been discussed in the community few times
>>> but there has always been a resistance to adding new solution
>>> for this problem.
>>> I will outline the planned usecase of pid namespace by oracle
>>> database and explain why any of the existing solution cannot
>>> be used to solve their problem.
>>>
>>> Consider a system in which several PID namespaces with multiple
>>> nested levels exists in parallel with monitor processes managing
>>> all the namespaces. PID translation is required for controlling
>>> and accessing information about the processes by the monitors
>>> and other processes down the hierarchy of namespaces. Controlling
>>> primarily involves sending signals or using ptrace by a process in
>>> parent namespace on any of the processes in its child namespace.
>>> Accessing information deals with the reading /proc/<pid>/* files
>>> of processes in child namespace. None of the processes have
>>> root/CAP_SYS_ADMIN privileges.
>>
>> How are you dealing with PID reuse?
>
>
> We have a monitor process which keeps track of the aliveness of
> important processes. When a process dies, monitor makes a note of
> it and hence detects if pid is reused.
How do you do that in a race-free manner?
>>> + */
>>> +SYSCALL_DEFINE3(translate_pid, pid_t, pid, u64, source,
>>> + u64, target)
>>> +{
>>> + struct pid_namespace *source_ns = NULL, *target_ns = NULL;
>>> + struct pid *struct_pid;
>>> + struct pid_namespace *ph;
>>> + struct hlist_bl_head *shead = NULL;
>>> + struct hlist_bl_head *thead = NULL;
>>> + struct hlist_bl_node *dup_node;
>>> + pid_t result;
>>> +
>>> + if (!source) {
>>> + source_ns = &init_pid_ns;
>>> + } else {
>>> + shead = pid_ns_hash_head(pid_ns_hash, source);
>>> + hlist_bl_lock(shead);
>>> + hlist_bl_for_each_entry(ph, dup_node, shead, node) {
>>> + if (source == ph->ns.ns_id) {
>>> + source_ns = ph;
>>> + break;
>>> + }
>>> + }
>>> + if (!source_ns) {
>>> + hlist_bl_unlock(shead);
>>> + return -EINVAL;
>>> + }
>>> + }
>>> + if (!ptrace_may_access(source_ns->child_reaper,
>>> + PTRACE_MODE_READ_FSCREDS)) {
>>
>> AFAICS this proposal breaks the visibility restrictions that
>> namespaces normally create. If there are two namespaces-based
>> containers that use the same UID range, I don't think they should be
>> able to learn information about each other, such as which PIDs are in
>> use in the other container; but as far as I can tell, your proposal
>> makes it possible to do that (unless an LSM or so is interfering). I
>> would prefer it if this API required visibility of the targeted PID
>> namespaces in the caller's PID namespace.
>
>
> I am trying to simulate the same access restrictions allowed
> on a process's /proc/<pid>/ns/pid file. If the translator has
> access to /proc/<pid>/ns/pid file of both source and destination
> namespaces, shouldn't it be allowed to translate the pid between
> them?
But the translator doesn't actually need to have access to those
procfs files, right?
On 03/13/2018 02:28 PM, Jann Horn wrote:
> On Tue, Mar 13, 2018 at 2:20 PM, Nagarathnam Muthusamy
> <[email protected]> wrote:
>> On 03/13/2018 01:47 PM, Jann Horn wrote:
>>> On Mon, Mar 12, 2018 at 10:18 AM, <[email protected]>
>>> wrote:
>>>> Resending the RFC with participants of previous discussions
>>>> in the list.
>>>>
>>>> Following patch which is a variation of a solution discussed
>>>> in https://lwn.net/Articles/736330/ provides the users of
>>>> pid namespace, the functionality of pid translation between
>>>> namespaces using a namespace identifier. The topic of
>>>> pid translation has been discussed in the community few times
>>>> but there has always been a resistance to adding new solution
>>>> for this problem.
>>>> I will outline the planned usecase of pid namespace by oracle
>>>> database and explain why any of the existing solution cannot
>>>> be used to solve their problem.
>>>>
>>>> Consider a system in which several PID namespaces with multiple
>>>> nested levels exists in parallel with monitor processes managing
>>>> all the namespaces. PID translation is required for controlling
>>>> and accessing information about the processes by the monitors
>>>> and other processes down the hierarchy of namespaces. Controlling
>>>> primarily involves sending signals or using ptrace by a process in
>>>> parent namespace on any of the processes in its child namespace.
>>>> Accessing information deals with the reading /proc/<pid>/* files
>>>> of processes in child namespace. None of the processes have
>>>> root/CAP_SYS_ADMIN privileges.
>>> How are you dealing with PID reuse?
>>
>> We have a monitor process which keeps track of the aliveness of
>> important processes. When a process dies, monitor makes a note of
>> it and hence detects if pid is reused.
> How do you do that in a race-free manner?
AFAIK, the monitor runs periodically to check the aliveness of the processes
and this period is too short for pids to recycle. I will get back with
more information
on this if any other mechanisms are in place.
>
>
>>>> + */
>>>> +SYSCALL_DEFINE3(translate_pid, pid_t, pid, u64, source,
>>>> + u64, target)
>>>> +{
>>>> + struct pid_namespace *source_ns = NULL, *target_ns = NULL;
>>>> + struct pid *struct_pid;
>>>> + struct pid_namespace *ph;
>>>> + struct hlist_bl_head *shead = NULL;
>>>> + struct hlist_bl_head *thead = NULL;
>>>> + struct hlist_bl_node *dup_node;
>>>> + pid_t result;
>>>> +
>>>> + if (!source) {
>>>> + source_ns = &init_pid_ns;
>>>> + } else {
>>>> + shead = pid_ns_hash_head(pid_ns_hash, source);
>>>> + hlist_bl_lock(shead);
>>>> + hlist_bl_for_each_entry(ph, dup_node, shead, node) {
>>>> + if (source == ph->ns.ns_id) {
>>>> + source_ns = ph;
>>>> + break;
>>>> + }
>>>> + }
>>>> + if (!source_ns) {
>>>> + hlist_bl_unlock(shead);
>>>> + return -EINVAL;
>>>> + }
>>>> + }
>>>> + if (!ptrace_may_access(source_ns->child_reaper,
>>>> + PTRACE_MODE_READ_FSCREDS)) {
>>> AFAICS this proposal breaks the visibility restrictions that
>>> namespaces normally create. If there are two namespaces-based
>>> containers that use the same UID range, I don't think they should be
>>> able to learn information about each other, such as which PIDs are in
>>> use in the other container; but as far as I can tell, your proposal
>>> makes it possible to do that (unless an LSM or so is interfering). I
>>> would prefer it if this API required visibility of the targeted PID
>>> namespaces in the caller's PID namespace.
>>
>> I am trying to simulate the same access restrictions allowed
>> on a process's /proc/<pid>/ns/pid file. If the translator has
>> access to /proc/<pid>/ns/pid file of both source and destination
>> namespaces, shouldn't it be allowed to translate the pid between
>> them?
> But the translator doesn't actually need to have access to those
> procfs files, right?
I thought it should have access to those procfs files to satisfy the
visibility constraint that targeted PID namespaces should be visible
in caller's PID namespace and ptrace_may_access checks that
constraint.
Thanks,
Nagarathnam.
On Tue, Mar 13, 2018 at 2:44 PM, Nagarathnam Muthusamy
<[email protected]> wrote:
>
>
> On 03/13/2018 02:28 PM, Jann Horn wrote:
>>
>> On Tue, Mar 13, 2018 at 2:20 PM, Nagarathnam Muthusamy
>> <[email protected]> wrote:
>>>
>>> On 03/13/2018 01:47 PM, Jann Horn wrote:
>>>>
>>>> On Mon, Mar 12, 2018 at 10:18 AM, <[email protected]>
>>>> wrote:
>>>>>
>>>>> Resending the RFC with participants of previous discussions
>>>>> in the list.
>>>>>
>>>>> Following patch which is a variation of a solution discussed
>>>>> in https://lwn.net/Articles/736330/ provides the users of
>>>>> pid namespace, the functionality of pid translation between
>>>>> namespaces using a namespace identifier. The topic of
>>>>> pid translation has been discussed in the community few times
>>>>> but there has always been a resistance to adding new solution
>>>>> for this problem.
>>>>> I will outline the planned usecase of pid namespace by oracle
>>>>> database and explain why any of the existing solution cannot
>>>>> be used to solve their problem.
>>>>>
>>>>> Consider a system in which several PID namespaces with multiple
>>>>> nested levels exists in parallel with monitor processes managing
>>>>> all the namespaces. PID translation is required for controlling
>>>>> and accessing information about the processes by the monitors
>>>>> and other processes down the hierarchy of namespaces. Controlling
>>>>> primarily involves sending signals or using ptrace by a process in
>>>>> parent namespace on any of the processes in its child namespace.
>>>>> Accessing information deals with the reading /proc/<pid>/* files
>>>>> of processes in child namespace. None of the processes have
>>>>> root/CAP_SYS_ADMIN privileges.
>>>>
>>>> How are you dealing with PID reuse?
>>>
>>>
>>> We have a monitor process which keeps track of the aliveness of
>>> important processes. When a process dies, monitor makes a note of
>>> it and hence detects if pid is reused.
>>
>> How do you do that in a race-free manner?
>
>
> AFAIK, the monitor runs periodically to check the aliveness of the processes
> and this period is too short for pids to recycle. I will get back with more
> information
> on this if any other mechanisms are in place.
>
>
>>
>>
>>>>> + */
>>>>> +SYSCALL_DEFINE3(translate_pid, pid_t, pid, u64, source,
>>>>> + u64, target)
>>>>> +{
>>>>> + struct pid_namespace *source_ns = NULL, *target_ns = NULL;
>>>>> + struct pid *struct_pid;
>>>>> + struct pid_namespace *ph;
>>>>> + struct hlist_bl_head *shead = NULL;
>>>>> + struct hlist_bl_head *thead = NULL;
>>>>> + struct hlist_bl_node *dup_node;
>>>>> + pid_t result;
>>>>> +
>>>>> + if (!source) {
>>>>> + source_ns = &init_pid_ns;
>>>>> + } else {
>>>>> + shead = pid_ns_hash_head(pid_ns_hash, source);
>>>>> + hlist_bl_lock(shead);
>>>>> + hlist_bl_for_each_entry(ph, dup_node, shead, node) {
>>>>> + if (source == ph->ns.ns_id) {
>>>>> + source_ns = ph;
>>>>> + break;
>>>>> + }
>>>>> + }
>>>>> + if (!source_ns) {
>>>>> + hlist_bl_unlock(shead);
>>>>> + return -EINVAL;
>>>>> + }
>>>>> + }
>>>>> + if (!ptrace_may_access(source_ns->child_reaper,
>>>>> + PTRACE_MODE_READ_FSCREDS)) {
>>>>
>>>> AFAICS this proposal breaks the visibility restrictions that
>>>> namespaces normally create. If there are two namespaces-based
>>>> containers that use the same UID range, I don't think they should be
>>>> able to learn information about each other, such as which PIDs are in
>>>> use in the other container; but as far as I can tell, your proposal
>>>> makes it possible to do that (unless an LSM or so is interfering). I
>>>> would prefer it if this API required visibility of the targeted PID
>>>> namespaces in the caller's PID namespace.
>>>
>>>
>>> I am trying to simulate the same access restrictions allowed
>>> on a process's /proc/<pid>/ns/pid file. If the translator has
>>> access to /proc/<pid>/ns/pid file of both source and destination
>>> namespaces, shouldn't it be allowed to translate the pid between
>>> them?
>>
>> But the translator doesn't actually need to have access to those
>> procfs files, right?
>
> I thought it should have access to those procfs files to satisfy the
> visibility constraint that targeted PID namespaces should be visible
> in caller's PID namespace and ptrace_may_access checks that
> constraint.
If there are two containers that use the same UID range,
ptrace_may_access() checks from a process in one container on a
process in another container can pass. Normally, you just can't even
reach the ptrace_may_access() checks because you can't reference
processes in another container in any way.
By the way, a related concern: The use of global identifiers will
probably also negatively affect Checkpoint/Restore In Userspace?
On 03/13/2018 03:00 PM, Jann Horn wrote:
> On Tue, Mar 13, 2018 at 2:44 PM, Nagarathnam Muthusamy
> <[email protected]> wrote:
>>
>> On 03/13/2018 02:28 PM, Jann Horn wrote:
>>> On Tue, Mar 13, 2018 at 2:20 PM, Nagarathnam Muthusamy
>>> <[email protected]> wrote:
>>>> On 03/13/2018 01:47 PM, Jann Horn wrote:
>>>>> On Mon, Mar 12, 2018 at 10:18 AM, <[email protected]>
>>>>> wrote:
>>>>>> Resending the RFC with participants of previous discussions
>>>>>> in the list.
>>>>>>
>>>>>> Following patch which is a variation of a solution discussed
>>>>>> in https://lwn.net/Articles/736330/ provides the users of
>>>>>> pid namespace, the functionality of pid translation between
>>>>>> namespaces using a namespace identifier. The topic of
>>>>>> pid translation has been discussed in the community few times
>>>>>> but there has always been a resistance to adding new solution
>>>>>> for this problem.
>>>>>> I will outline the planned usecase of pid namespace by oracle
>>>>>> database and explain why any of the existing solution cannot
>>>>>> be used to solve their problem.
>>>>>>
>>>>>> Consider a system in which several PID namespaces with multiple
>>>>>> nested levels exists in parallel with monitor processes managing
>>>>>> all the namespaces. PID translation is required for controlling
>>>>>> and accessing information about the processes by the monitors
>>>>>> and other processes down the hierarchy of namespaces. Controlling
>>>>>> primarily involves sending signals or using ptrace by a process in
>>>>>> parent namespace on any of the processes in its child namespace.
>>>>>> Accessing information deals with the reading /proc/<pid>/* files
>>>>>> of processes in child namespace. None of the processes have
>>>>>> root/CAP_SYS_ADMIN privileges.
>>>>> How are you dealing with PID reuse?
>>>>
>>>> We have a monitor process which keeps track of the aliveness of
>>>> important processes. When a process dies, monitor makes a note of
>>>> it and hence detects if pid is reused.
>>> How do you do that in a race-free manner?
>>
>> AFAIK, the monitor runs periodically to check the aliveness of the processes
>> and this period is too short for pids to recycle. I will get back with more
>> information
>> on this if any other mechanisms are in place.
>>
>>
>>>
>>>>>> + */
>>>>>> +SYSCALL_DEFINE3(translate_pid, pid_t, pid, u64, source,
>>>>>> + u64, target)
>>>>>> +{
>>>>>> + struct pid_namespace *source_ns = NULL, *target_ns = NULL;
>>>>>> + struct pid *struct_pid;
>>>>>> + struct pid_namespace *ph;
>>>>>> + struct hlist_bl_head *shead = NULL;
>>>>>> + struct hlist_bl_head *thead = NULL;
>>>>>> + struct hlist_bl_node *dup_node;
>>>>>> + pid_t result;
>>>>>> +
>>>>>> + if (!source) {
>>>>>> + source_ns = &init_pid_ns;
>>>>>> + } else {
>>>>>> + shead = pid_ns_hash_head(pid_ns_hash, source);
>>>>>> + hlist_bl_lock(shead);
>>>>>> + hlist_bl_for_each_entry(ph, dup_node, shead, node) {
>>>>>> + if (source == ph->ns.ns_id) {
>>>>>> + source_ns = ph;
>>>>>> + break;
>>>>>> + }
>>>>>> + }
>>>>>> + if (!source_ns) {
>>>>>> + hlist_bl_unlock(shead);
>>>>>> + return -EINVAL;
>>>>>> + }
>>>>>> + }
>>>>>> + if (!ptrace_may_access(source_ns->child_reaper,
>>>>>> + PTRACE_MODE_READ_FSCREDS)) {
>>>>> AFAICS this proposal breaks the visibility restrictions that
>>>>> namespaces normally create. If there are two namespaces-based
>>>>> containers that use the same UID range, I don't think they should be
>>>>> able to learn information about each other, such as which PIDs are in
>>>>> use in the other container; but as far as I can tell, your proposal
>>>>> makes it possible to do that (unless an LSM or so is interfering). I
>>>>> would prefer it if this API required visibility of the targeted PID
>>>>> namespaces in the caller's PID namespace.
>>>>
>>>> I am trying to simulate the same access restrictions allowed
>>>> on a process's /proc/<pid>/ns/pid file. If the translator has
>>>> access to /proc/<pid>/ns/pid file of both source and destination
>>>> namespaces, shouldn't it be allowed to translate the pid between
>>>> them?
>>> But the translator doesn't actually need to have access to those
>>> procfs files, right?
>> I thought it should have access to those procfs files to satisfy the
>> visibility constraint that targeted PID namespaces should be visible
>> in caller's PID namespace and ptrace_may_access checks that
>> constraint.
> If there are two containers that use the same UID range,
> ptrace_may_access() checks from a process in one container on a
> process in another container can pass. Normally, you just can't even
> reach the ptrace_may_access() checks because you can't reference
> processes in another container in any way.
If there is no way to reference the process in another container,
there is no way to get to the /proc/<pid>/ns/pidns_id file which
exports the ID of that container right? So, a translator has to
first guess the container ID then try translate. Even after translation,
unless the translator has proper privileges, I believe it cant do
anything with just the pid right?
>
> By the way, a related concern: The use of global identifiers will
> probably also negatively affect Checkpoint/Restore In Userspace?
Will look into this. Can you point me to the specifics of the
usecase which could be negatively affected?
Thanks,
Nagarathnam.
On Tue, Mar 13, 2018 at 3:45 PM, Nagarathnam Muthusamy
<[email protected]> wrote:
> On 03/13/2018 03:00 PM, Jann Horn wrote:
>> On Tue, Mar 13, 2018 at 2:44 PM, Nagarathnam Muthusamy
>> <[email protected]> wrote:
>>> On 03/13/2018 02:28 PM, Jann Horn wrote:
>>>> On Tue, Mar 13, 2018 at 2:20 PM, Nagarathnam Muthusamy
>>>> <[email protected]> wrote:
>>>>> On 03/13/2018 01:47 PM, Jann Horn wrote:
>>>>>> On Mon, Mar 12, 2018 at 10:18 AM, <[email protected]>
>>>>>> wrote:
[...]
>>>>>>> + */
>>>>>>> +SYSCALL_DEFINE3(translate_pid, pid_t, pid, u64, source,
>>>>>>> + u64, target)
>>>>>>> +{
>>>>>>> + struct pid_namespace *source_ns = NULL, *target_ns = NULL;
>>>>>>> + struct pid *struct_pid;
>>>>>>> + struct pid_namespace *ph;
>>>>>>> + struct hlist_bl_head *shead = NULL;
>>>>>>> + struct hlist_bl_head *thead = NULL;
>>>>>>> + struct hlist_bl_node *dup_node;
>>>>>>> + pid_t result;
>>>>>>> +
>>>>>>> + if (!source) {
>>>>>>> + source_ns = &init_pid_ns;
>>>>>>> + } else {
>>>>>>> + shead = pid_ns_hash_head(pid_ns_hash, source);
>>>>>>> + hlist_bl_lock(shead);
>>>>>>> + hlist_bl_for_each_entry(ph, dup_node, shead, node) {
>>>>>>> + if (source == ph->ns.ns_id) {
>>>>>>> + source_ns = ph;
>>>>>>> + break;
>>>>>>> + }
>>>>>>> + }
>>>>>>> + if (!source_ns) {
>>>>>>> + hlist_bl_unlock(shead);
>>>>>>> + return -EINVAL;
>>>>>>> + }
>>>>>>> + }
>>>>>>> + if (!ptrace_may_access(source_ns->child_reaper,
>>>>>>> + PTRACE_MODE_READ_FSCREDS)) {
>>>>>>
>>>>>> AFAICS this proposal breaks the visibility restrictions that
>>>>>> namespaces normally create. If there are two namespaces-based
>>>>>> containers that use the same UID range, I don't think they should be
>>>>>> able to learn information about each other, such as which PIDs are in
>>>>>> use in the other container; but as far as I can tell, your proposal
>>>>>> makes it possible to do that (unless an LSM or so is interfering). I
>>>>>> would prefer it if this API required visibility of the targeted PID
>>>>>> namespaces in the caller's PID namespace.
>>>>>
>>>>>
>>>>> I am trying to simulate the same access restrictions allowed
>>>>> on a process's /proc/<pid>/ns/pid file. If the translator has
>>>>> access to /proc/<pid>/ns/pid file of both source and destination
>>>>> namespaces, shouldn't it be allowed to translate the pid between
>>>>> them?
>>>>
>>>> But the translator doesn't actually need to have access to those
>>>> procfs files, right?
>>>
>>> I thought it should have access to those procfs files to satisfy the
>>> visibility constraint that targeted PID namespaces should be visible
>>> in caller's PID namespace and ptrace_may_access checks that
>>> constraint.
>>
>> If there are two containers that use the same UID range,
>> ptrace_may_access() checks from a process in one container on a
>> process in another container can pass. Normally, you just can't even
>> reach the ptrace_may_access() checks because you can't reference
>> processes in another container in any way.
>
>
> If there is no way to reference the process in another container,
> there is no way to get to the /proc/<pid>/ns/pidns_id file which
> exports the ID of that container right? So, a translator has to
> first guess the container ID then try translate. Even after translation,
> unless the translator has proper privileges, I believe it cant do
> anything with just the pid right?
Well, yes to both. You'd have to guess the ID of the container, and
you wouldn't be able to do much with it, apart from finding valid PIDs
and their mapping between namespaces.
>> By the way, a related concern: The use of global identifiers will
>> probably also negatively affect Checkpoint/Restore In Userspace?
>
> Will look into this. Can you point me to the specifics of the
> usecase which could be negatively affected?
AFAICS you won't be able to reliably recreate namespace IDs when a
process is checkpointed and resumed, meaning that checkpoint/resume
won't work on processes that use these namespace IDs.
On 03/13/2018 04:10 PM, Jann Horn wrote:
> On Tue, Mar 13, 2018 at 3:45 PM, Nagarathnam Muthusamy
> <[email protected]> wrote:
>> On 03/13/2018 03:00 PM, Jann Horn wrote:
>>> On Tue, Mar 13, 2018 at 2:44 PM, Nagarathnam Muthusamy
>>> <[email protected]> wrote:
>>>> On 03/13/2018 02:28 PM, Jann Horn wrote:
>>>>> On Tue, Mar 13, 2018 at 2:20 PM, Nagarathnam Muthusamy
>>>>> <[email protected]> wrote:
>>>>>> On 03/13/2018 01:47 PM, Jann Horn wrote:
>>>>>>> On Mon, Mar 12, 2018 at 10:18 AM, <[email protected]>
>>>>>>> wrote:
> [...]
>>>>>>>> + */
>>>>>>>> +SYSCALL_DEFINE3(translate_pid, pid_t, pid, u64, source,
>>>>>>>> + u64, target)
>>>>>>>> +{
>>>>>>>> + struct pid_namespace *source_ns = NULL, *target_ns = NULL;
>>>>>>>> + struct pid *struct_pid;
>>>>>>>> + struct pid_namespace *ph;
>>>>>>>> + struct hlist_bl_head *shead = NULL;
>>>>>>>> + struct hlist_bl_head *thead = NULL;
>>>>>>>> + struct hlist_bl_node *dup_node;
>>>>>>>> + pid_t result;
>>>>>>>> +
>>>>>>>> + if (!source) {
>>>>>>>> + source_ns = &init_pid_ns;
>>>>>>>> + } else {
>>>>>>>> + shead = pid_ns_hash_head(pid_ns_hash, source);
>>>>>>>> + hlist_bl_lock(shead);
>>>>>>>> + hlist_bl_for_each_entry(ph, dup_node, shead, node) {
>>>>>>>> + if (source == ph->ns.ns_id) {
>>>>>>>> + source_ns = ph;
>>>>>>>> + break;
>>>>>>>> + }
>>>>>>>> + }
>>>>>>>> + if (!source_ns) {
>>>>>>>> + hlist_bl_unlock(shead);
>>>>>>>> + return -EINVAL;
>>>>>>>> + }
>>>>>>>> + }
>>>>>>>> + if (!ptrace_may_access(source_ns->child_reaper,
>>>>>>>> + PTRACE_MODE_READ_FSCREDS)) {
>>>>>>> AFAICS this proposal breaks the visibility restrictions that
>>>>>>> namespaces normally create. If there are two namespaces-based
>>>>>>> containers that use the same UID range, I don't think they should be
>>>>>>> able to learn information about each other, such as which PIDs are in
>>>>>>> use in the other container; but as far as I can tell, your proposal
>>>>>>> makes it possible to do that (unless an LSM or so is interfering). I
>>>>>>> would prefer it if this API required visibility of the targeted PID
>>>>>>> namespaces in the caller's PID namespace.
>>>>>>
>>>>>> I am trying to simulate the same access restrictions allowed
>>>>>> on a process's /proc/<pid>/ns/pid file. If the translator has
>>>>>> access to /proc/<pid>/ns/pid file of both source and destination
>>>>>> namespaces, shouldn't it be allowed to translate the pid between
>>>>>> them?
>>>>> But the translator doesn't actually need to have access to those
>>>>> procfs files, right?
>>>> I thought it should have access to those procfs files to satisfy the
>>>> visibility constraint that targeted PID namespaces should be visible
>>>> in caller's PID namespace and ptrace_may_access checks that
>>>> constraint.
>>> If there are two containers that use the same UID range,
>>> ptrace_may_access() checks from a process in one container on a
>>> process in another container can pass. Normally, you just can't even
>>> reach the ptrace_may_access() checks because you can't reference
>>> processes in another container in any way.
>>
>> If there is no way to reference the process in another container,
>> there is no way to get to the /proc/<pid>/ns/pidns_id file which
>> exports the ID of that container right? So, a translator has to
>> first guess the container ID then try translate. Even after translation,
>> unless the translator has proper privileges, I believe it cant do
>> anything with just the pid right?
> Well, yes to both. You'd have to guess the ID of the container, and
> you wouldn't be able to do much with it, apart from finding valid PIDs
> and their mapping between namespaces.
>
>>> By the way, a related concern: The use of global identifiers will
>>> probably also negatively affect Checkpoint/Restore In Userspace?
>> Will look into this. Can you point me to the specifics of the
>> usecase which could be negatively affected?
> AFAICS you won't be able to reliably recreate namespace IDs when a
> process is checkpointed and resumed, meaning that checkpoint/resume
> won't work on processes that use these namespace IDs.
I agree. When the process is resumed, the namespace IDs
might be obsolete.
Thanks,
Nagarathnam.
[email protected] writes:
> Resending the RFC with participants of previous discussions
> in the list.
Nacked-by: "Eric W. Biederman" <[email protected]>
Please read below to see why.
> Following patch which is a variation of a solution discussed
> in https://lwn.net/Articles/736330/ provides the users of
> pid namespace, the functionality of pid translation between
> namespaces using a namespace identifier. The topic of
> pid translation has been discussed in the community few times
> but there has always been a resistance to adding new solution
> for this problem.
So far there has been no clear cut case that is worth the maintenance
challenge. As I recall the conversations they have always resulted
in a better way discovered.
> I will outline the planned usecase of pid namespace by oracle
> database and explain why any of the existing solution cannot
> be used to solve their problem.
>
> Consider a system in which several PID namespaces with multiple
> nested levels exists in parallel with monitor processes managing
> all the namespaces. PID translation is required for controlling
> and accessing information about the processes by the monitors
> and other processes down the hierarchy of namespaces. Controlling
> primarily involves sending signals or using ptrace by a process in
> parent namespace on any of the processes in its child namespace.
> Accessing information deals with the reading /proc/<pid>/* files
> of processes in child namespace. None of the processes have
> root/CAP_SYS_ADMIN privileges.
So one process inside of a pid namespace one process outside of
the pid namespace and they are talking about some other process
inside the pid namespace.
Agreed that is a usecase that likely needs pid translation.
> Existing methods:
>
> *) SCM_CREDENTIALS message:
> Ref: http://man7.org/linux/man-pages/man7/unix.7.html
> The sender can translate its pid from its own namespace
> to a pid in the target namespace by sending and receiving the
> SCM_CREDENTIALS message. The drawback of this method is the
> requirement of a socket communication channel to pid translation
> which adds to the management overhead. This method does not enable
> the sender to translate the pid of other process unless it is root
> or it has CAP_SYS_ADMIN.
Or it has CAP_SYS_ADMIN (in the sending pid namespace).
That is fair.
>
> *) /proc/<pid>/status file
> Ref: http://man7.org/linux/man-pages/man5/proc.5.html
> Ref: https://patchwork.kernel.org/patch/5861791/
> /proc/<pid>/status file provides a way to find the pids
> associated with a process in different namespaces. Pid translation
> from child namespace to parent namespace from parent namespace would
> require searching all the status file in the parent namespace to find
> the desired pid at desired level which is inefficient.
That is fair.
>
> Method under review:
>
> *) pidns: introduce syscall translate_pid
> Ref: https://lwn.net/Articles/736330/
> This solution though not integrated yet, when used would require
> monitor to have open file descriptor for every level of namespace it
> has to monitor which results in explosion of number of file descriptors
> the monitor has to keep open.
You know what tough nuggies.
File descriptors aren't or at least shouldn't be that expensive. And
you are already committing to tracking the pid namespaces anyway. An
int for a file descriptor is cheaper than a u64 for the userspace
process.
The cost of that ``cheaper'' u64 that is not in any namespace is that
you now have to go and implement a namespace of namespaces. You haven't
even attempted it. So just no. Anything that brings us to needing
a namespace of namespaces is a bad design.
> Following patch uses a 64-bit ID for namespace exported by procfs
> for pid translation through a new file /proc/<pid>/ns/pidns_id.
And this design detail is what brings the automatic nack.
Use file descriptros and it sounds like your use case justifies what you
are trying to do.
> [root@brm-x4600-01 ~]# ls -l /proc/4690/ns
> total 0
> lrwxrwxrwx 1 root root 0 Mar 1 15:55 cgroup -> cgroup:[4026531835]
> lrwxrwxrwx 1 root root 0 Mar 1 15:55 ipc -> ipc:[4026531839]
> lrwxrwxrwx 1 root root 0 Mar 1 15:55 mnt -> mnt:[4026532324]
> lrwxrwxrwx 1 root root 0 Mar 1 15:55 net -> net:[4026531993]
> lrwxrwxrwx 1 root root 0 Mar 1 15:55 pid -> pid:[4026532325]
> lrwxrwxrwx 1 root root 0 Mar 1 15:55 pid_for_children -> pid:[4026532325]
> lrwxrwxrwx 1 root root 0 Mar 1 15:55 pidns_id -> [14994344851325148137]
> lrwxrwxrwx 1 root root 0 Mar 1 15:55 user -> user:[4026531837]
> lrwxrwxrwx 1 root root 0 Mar 1 15:55 uts -> uts:[4026531838]
>
> Though there is a problem of ID being recycled in longterm, managing
> an ID per namespace is easier than having open file descriptors per
> pid namespace. Existing namespace inode numbers recycles faster and hence
> their usability as ID for this API is less.
No.
A namespace of namespaces is a much harder problem. We aren't going
there.
It sounds like you may be struggling with lifetime issues and when
to close/get rid of your pid namespace identifier. That could deserve
a conversation. Idenitifiers that are not namespaced are absolutely not
the solution.
Eric
> Signed-off-by: Nagarathnam Muthusamy <[email protected]>
> ---
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 2 +
> fs/nsfs.c | 9 +-
> fs/proc/namespaces.c | 1 +
> include/linux/ns_common.h | 1 +
> include/linux/pid_namespace.h | 3 +
> include/linux/proc_ns.h | 1 +
> include/linux/syscalls.h | 1 +
> kernel/pid_namespace.c | 190 ++++++++++++++++++++++++++++++++-
> kernel/sys_ni.c | 4 +
> 10 files changed, 208 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 448ac21..31bf798 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -391,3 +391,4 @@
> 382 i386 pkey_free sys_pkey_free
> 383 i386 statx sys_statx
> 384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl
> +385 i386 translate_pid sys_translate_pid compat_sys_translate_pid
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 5aef183..89196c3 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -339,6 +339,7 @@
> 330 common pkey_alloc sys_pkey_alloc
> 331 common pkey_free sys_pkey_free
> 332 common statx sys_statx
> +333 64 translate_pid sys_translate_pid
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> @@ -380,3 +381,4 @@
> 545 x32 execveat compat_sys_execveat/ptregs
> 546 x32 preadv2 compat_sys_preadv64v2
> 547 x32 pwritev2 compat_sys_pwritev64v2
> +548 x32 translate_pid compat_sys_translate_pid
> diff --git a/fs/nsfs.c b/fs/nsfs.c
> index 36b0772..c635465 100644
> --- a/fs/nsfs.c
> +++ b/fs/nsfs.c
> @@ -222,8 +222,13 @@ int ns_get_name(char *buf, size_t size, struct task_struct *task,
> const char *name;
> ns = ns_ops->get(task);
> if (ns) {
> - name = ns_ops->real_ns_name ? : ns_ops->name;
> - res = snprintf(buf, size, "%s:[%u]", name, ns->inum);
> + if (!strcmp(ns_ops->name, "pidns_id")) {
> + res = snprintf(buf, size, "[%llu]",
> + (unsigned long long)ns->ns_id);
> + } else {
> + name = ns_ops->real_ns_name ? : ns_ops->name;
> + res = snprintf(buf, size, "%s:[%u]", name, ns->inum);
> + }
> ns_ops->put(ns);
> }
> return res;
> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
> index 59b17e5..ac823ce 100644
> --- a/fs/proc/namespaces.c
> +++ b/fs/proc/namespaces.c
> @@ -24,6 +24,7 @@
> #endif
> #ifdef CONFIG_PID_NS
> &pidns_operations,
> + &pidns_id_operations,
> &pidns_for_children_operations,
> #endif
> #ifdef CONFIG_USER_NS
> diff --git a/include/linux/ns_common.h b/include/linux/ns_common.h
> index 5fbc400..6ca3d43 100644
> --- a/include/linux/ns_common.h
> +++ b/include/linux/ns_common.h
> @@ -8,6 +8,7 @@ struct ns_common {
> atomic_long_t stashed;
> const struct proc_ns_operations *ops;
> unsigned int inum;
> + u64 ns_id;
> };
>
> #endif
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 49538b1..11d1d57 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -11,6 +11,7 @@
> #include <linux/kref.h>
> #include <linux/ns_common.h>
> #include <linux/idr.h>
> +#include <linux/list_bl.h>
>
>
> struct fs_pin;
> @@ -44,6 +45,8 @@ struct pid_namespace {
> kgid_t pid_gid;
> int hide_pid;
> int reboot; /* group exit code if this pidns was rebooted */
> + struct hlist_bl_node node;
> + atomic_t lookups_pending;
> struct ns_common ns;
> } __randomize_layout;
>
> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
> index d31cb62..861e38bd 100644
> --- a/include/linux/proc_ns.h
> +++ b/include/linux/proc_ns.h
> @@ -28,6 +28,7 @@ struct proc_ns_operations {
> extern const struct proc_ns_operations utsns_operations;
> extern const struct proc_ns_operations ipcns_operations;
> extern const struct proc_ns_operations pidns_operations;
> +extern const struct proc_ns_operations pidns_id_operations;
> extern const struct proc_ns_operations pidns_for_children_operations;
> extern const struct proc_ns_operations userns_operations;
> extern const struct proc_ns_operations mntns_operations;
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a78186d..574349a 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -901,6 +901,7 @@ asmlinkage long sys_open_by_handle_at(int mountdirfd,
> struct file_handle __user *handle,
> int flags);
> asmlinkage long sys_setns(int fd, int nstype);
> +asmlinkage long sys_translate_pid(pid_t pid, u64 source, u64 target);
> asmlinkage long sys_process_vm_readv(pid_t pid,
> const struct iovec __user *lvec,
> unsigned long liovcnt,
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 0b53eef..ff83aa8 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -22,6 +22,12 @@
> #include <linux/sched/task.h>
> #include <linux/sched/signal.h>
> #include <linux/idr.h>
> +#include <linux/random.h>
> +#include <linux/compat.h>
> +
> +#define PID_NS_ID_HASH_BITS 9
> +
> +struct hlist_bl_head *pid_ns_hash;
>
> struct pid_cache {
> int nr_ids;
> @@ -34,6 +40,13 @@ struct pid_cache {
> static DEFINE_MUTEX(pid_caches_mutex);
> static struct kmem_cache *pid_ns_cachep;
>
> +static inline struct hlist_bl_head *
> + pid_ns_hash_head(struct hlist_bl_head *hash,
> + uint64_t key)
> +{
> + return &hash[hash_64(key, PID_NS_ID_HASH_BITS)];
> +}
> +
> /*
> * creates the kmem cache to allocate pids from.
> * @nr_ids: the number of numerical ids this pid will have to carry
> @@ -93,12 +106,24 @@ static void dec_pid_namespaces(struct ucounts *ucounts)
> dec_ucount(ucounts, UCOUNT_PID_NAMESPACES);
> }
>
> +static inline u64 get_namespace_id(void)
> +{
> + u64 id = 0;
> +
> + while (!id)
> + get_random_bytes(&id, sizeof(id));
> +
> + return id;
> +}
> +
> static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns,
> struct pid_namespace *parent_pid_ns)
> {
> - struct pid_namespace *ns;
> - unsigned int level = parent_pid_ns->level + 1;
> + struct pid_namespace *ns, *pt;
> struct ucounts *ucounts;
> + struct hlist_bl_head *head;
> + struct hlist_bl_node *dup_node;
> + unsigned int level = parent_pid_ns->level + 1;
> int err;
>
> err = -EINVAL;
> @@ -135,7 +160,24 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
> ns->ucounts = ucounts;
> ns->pid_allocated = PIDNS_ADDING;
> INIT_WORK(&ns->proc_work, proc_cleanup_work);
> -
> + ns->ns.ns_id = get_namespace_id();
> + while (1) {
> + head = pid_ns_hash_head(pid_ns_hash, ns->ns.ns_id);
> + hlist_bl_lock(head);
> + hlist_bl_for_each_entry(pt, dup_node, head, node) {
> + if (ns->ns.ns_id == pt->ns.ns_id) {
> + /*
> + * ID is taken. Move to next ID;
> + */
> + ns->ns.ns_id = get_namespace_id();
> + hlist_bl_unlock(head);
> + continue;
> + }
> + }
> + break;
> + }
> + hlist_bl_add_head(&ns->node, head);
> + hlist_bl_unlock(head);
> return ns;
>
> out_free_idr:
> @@ -159,6 +201,30 @@ static void delayed_free_pidns(struct rcu_head *p)
>
> static void destroy_pid_namespace(struct pid_namespace *ns)
> {
> + struct pid_namespace *ph;
> + struct hlist_bl_head *head;
> + struct hlist_bl_node *dup_node;
> +
> + /*
> + * Remove the namespace structure from hash table so
> + * now new lookups can start on it.
> + */
> + if (ns->ns.ns_id) {
> + head = pid_ns_hash_head(pid_ns_hash, ns->ns.ns_id);
> + hlist_bl_lock(head);
> + hlist_bl_for_each_entry(ph, dup_node, head, node) {
> + if (ns->ns.ns_id == ph->ns.ns_id) {
> + hlist_bl_del_init(&ph->node);
> + break;
> + }
> + }
> + hlist_bl_unlock(head);
> + }
> + /*
> + * Wait for pending lookups to complete.
> + */
> + while (atomic_read(&ns->lookups_pending))
> + cpu_relax();
> ns_free_inum(&ns->ns);
>
> idr_destroy(&ns->idr);
> @@ -463,6 +529,17 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
> .get_parent = pidns_get_parent,
> };
>
> +const struct proc_ns_operations pidns_id_operations = {
> + .name = "pidns_id",
> + .real_ns_name = "pid",
> + .type = CLONE_NEWPID,
> + .get = pidns_get,
> + .put = pidns_put,
> + .install = pidns_install,
> + .owner = pidns_owner,
> + .get_parent = pidns_get_parent,
> +};
> +
> const struct proc_ns_operations pidns_for_children_operations = {
> .name = "pid_for_children",
> .real_ns_name = "pid",
> @@ -474,9 +551,116 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
> .get_parent = pidns_get_parent,
> };
>
> +/*
> + * translate_pid - convert pid in source pid-ns into target pid-ns.
> + * @pid: pid for translation
> + * @source: pid-ns id
> + * @target: pid-ns id
> + *
> + * Return pid in @target pid-ns, zero if task have no pid there,
> + * or -ESRCH of task with @pid is not found in @source pid-ns.
> + */
> +SYSCALL_DEFINE3(translate_pid, pid_t, pid, u64, source,
> + u64, target)
> +{
> + struct pid_namespace *source_ns = NULL, *target_ns = NULL;
> + struct pid *struct_pid;
> + struct pid_namespace *ph;
> + struct hlist_bl_head *shead = NULL;
> + struct hlist_bl_head *thead = NULL;
> + struct hlist_bl_node *dup_node;
> + pid_t result;
> +
> + if (!source) {
> + source_ns = &init_pid_ns;
> + } else {
> + shead = pid_ns_hash_head(pid_ns_hash, source);
> + hlist_bl_lock(shead);
> + hlist_bl_for_each_entry(ph, dup_node, shead, node) {
> + if (source == ph->ns.ns_id) {
> + source_ns = ph;
> + break;
> + }
> + }
> + if (!source_ns) {
> + hlist_bl_unlock(shead);
> + return -EINVAL;
> + }
> + }
> + if (!ptrace_may_access(source_ns->child_reaper,
> + PTRACE_MODE_READ_FSCREDS)) {
> + if (shead)
> + hlist_bl_unlock(shead);
> + return -EPERM;
> + }
> +
> + atomic_inc(&source_ns->lookups_pending);
> + if (shead)
> + hlist_bl_unlock(shead);
> +
> + if (!target) {
> + target_ns = &init_pid_ns;
> + } else {
> + thead = pid_ns_hash_head(pid_ns_hash, target);
> + hlist_bl_lock(thead);
> + hlist_bl_for_each_entry(ph, dup_node, thead, node) {
> + if (target == ph->ns.ns_id) {
> + target_ns = ph;
> + break;
> + }
> + }
> + if (!target_ns) {
> + atomic_dec(&source_ns->lookups_pending);
> + hlist_bl_unlock(thead);
> + return -EINVAL;
> + }
> + }
> + if (!ptrace_may_access(target_ns->child_reaper,
> + PTRACE_MODE_READ_FSCREDS)) {
> + atomic_dec(&source_ns->lookups_pending);
> + if (thead)
> + hlist_bl_unlock(thead);
> + return -EPERM;
> + }
> + atomic_inc(&target_ns->lookups_pending);
> + if (thead)
> + hlist_bl_unlock(thead);
> +
> + rcu_read_lock();
> + struct_pid = find_pid_ns(pid, source_ns);
> + result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;
> + rcu_read_unlock();
> + atomic_dec(&source_ns->lookups_pending);
> + atomic_dec(&target_ns->lookups_pending);
> + return result;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +COMPAT_SYSCALL_DEFINE5(translate_pid, pid_t, pid, u32, s0, u32, s1,
> + u32, t0, u32, t1)
> +{
> +#ifdef __BIG_ENDIAN
> + return sys_translate_pid(pid, ((u64)s0 << 32) | s1,
> + ((u64)t0 << 32) | t1);
> +#else
> + return sys_translate_pid(pid, ((u64)s1 << 32) | s0,
> + ((u64)t1 << 32) | t0);
> +#endif
> +}
> +#endif
> +
> static __init int pid_namespaces_init(void)
> {
> + unsigned long bucket_count;
> + int i;
> +
> + bucket_count = (1UL << PID_NS_ID_HASH_BITS);
> pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
> + pid_ns_hash = kmalloc_array(bucket_count, sizeof(struct hlist_bl_head),
> + GFP_KERNEL);
> +
> + for (i = 0; i < bucket_count; i++)
> + INIT_HLIST_BL_HEAD(&pid_ns_hash[i]);
>
> #ifdef CONFIG_CHECKPOINT_RESTORE
> register_sysctl_paths(kern_path, pid_ns_ctl_table);
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index b518976..467255f 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -259,3 +259,7 @@ asmlinkage long sys_ni_syscall(void)
> cond_syscall(sys_pkey_mprotect);
> cond_syscall(sys_pkey_alloc);
> cond_syscall(sys_pkey_free);
> +
> +/* pid translation */
> +cond_syscall(sys_translate_pid);
> +cond_syscall(compat_sys_translate_pid);
On 03/13/2018 08:29 PM, [email protected] wrote:
> [email protected] writes:
>
>> Resending the RFC with participants of previous discussions
>> in the list.
> Nacked-by: "Eric W. Biederman" <[email protected]>
>
> Please read below to see why.
>
>> Following patch which is a variation of a solution discussed
>> in https://lwn.net/Articles/736330/ provides the users of
>> pid namespace, the functionality of pid translation between
>> namespaces using a namespace identifier. The topic of
>> pid translation has been discussed in the community few times
>> but there has always been a resistance to adding new solution
>> for this problem.
> So far there has been no clear cut case that is worth the maintenance
> challenge. As I recall the conversations they have always resulted
> in a better way discovered.
>
>> I will outline the planned usecase of pid namespace by oracle
>> database and explain why any of the existing solution cannot
>> be used to solve their problem.
>>
>> Consider a system in which several PID namespaces with multiple
>> nested levels exists in parallel with monitor processes managing
>> all the namespaces. PID translation is required for controlling
>> and accessing information about the processes by the monitors
>> and other processes down the hierarchy of namespaces. Controlling
>> primarily involves sending signals or using ptrace by a process in
>> parent namespace on any of the processes in its child namespace.
>> Accessing information deals with the reading /proc/<pid>/* files
>> of processes in child namespace. None of the processes have
>> root/CAP_SYS_ADMIN privileges.
> So one process inside of a pid namespace one process outside of
> the pid namespace and they are talking about some other process
> inside the pid namespace.
>
> Agreed that is a usecase that likely needs pid translation.
>
>> Existing methods:
>>
>> *) SCM_CREDENTIALS message:
>> Ref: http://man7.org/linux/man-pages/man7/unix.7.html
>> The sender can translate its pid from its own namespace
>> to a pid in the target namespace by sending and receiving the
>> SCM_CREDENTIALS message. The drawback of this method is the
>> requirement of a socket communication channel to pid translation
>> which adds to the management overhead. This method does not enable
>> the sender to translate the pid of other process unless it is root
>> or it has CAP_SYS_ADMIN.
> Or it has CAP_SYS_ADMIN (in the sending pid namespace).
> That is fair.
>
>> *) /proc/<pid>/status file
>> Ref: http://man7.org/linux/man-pages/man5/proc.5.html
>> Ref: https://patchwork.kernel.org/patch/5861791/
>> /proc/<pid>/status file provides a way to find the pids
>> associated with a process in different namespaces. Pid translation
>> from child namespace to parent namespace from parent namespace would
>> require searching all the status file in the parent namespace to find
>> the desired pid at desired level which is inefficient.
> That is fair.
>> Method under review:
>>
>> *) pidns: introduce syscall translate_pid
>> Ref: https://lwn.net/Articles/736330/
>> This solution though not integrated yet, when used would require
>> monitor to have open file descriptor for every level of namespace it
>> has to monitor which results in explosion of number of file descriptors
>> the monitor has to keep open.
> You know what tough nuggies.
>
> File descriptors aren't or at least shouldn't be that expensive. And
> you are already committing to tracking the pid namespaces anyway. An
> int for a file descriptor is cheaper than a u64 for the userspace
> process.
The size of individual file descriptor was not the issue but the
number of file descriptors to track for each level in a nested
PID namespace is the issue.
> The cost of that ``cheaper'' u64 that is not in any namespace is that
> you now have to go and implement a namespace of namespaces. You haven't
> even attempted it. So just no. Anything that brings us to needing
> a namespace of namespaces is a bad design.
I am not trying to implement a namespace of namespaces. The ID
exported through /proc/<pid>/ns/pidns_id is associated with the
pid namespace of the process. The function of this ID is very similar to
that of the inode number exported by /proc/<pid>/ns/pid file (ie) uniquely
identify a pid namespace in the system.
Inode numbers can be recycled very fast and hence I proposed a
separate ID. Instead of exporting inode number through /proc/<pid>/ns/pid
file we could directly export this new ID but I instead moved it to
a separate file so that any existing application which depends on this
number being 32-bit wont break.
>> Following patch uses a 64-bit ID for namespace exported by procfs
>> for pid translation through a new file /proc/<pid>/ns/pidns_id.
> And this design detail is what brings the automatic nack.
>
> Use file descriptros and it sounds like your use case justifies what you
> are trying to do.
File descriptors are problematic for following reasons.
1) I need to open a couple of file descriptors for every pid translation
request.
2) In case of nested PID namespaces, say a new pid namespace is created
at level 20,
with unique ID, I could just record this ID in a shared memory for
interested process
to use. In case of file descriptors, every level has to figure out
the process ID of the
newly created namespace's init process and open a file descriptor
to track it.
Thanks,
Nagarathnam.
>
>> [root@brm-x4600-01 ~]# ls -l /proc/4690/ns
>> total 0
>> lrwxrwxrwx 1 root root 0 Mar 1 15:55 cgroup -> cgroup:[4026531835]
>> lrwxrwxrwx 1 root root 0 Mar 1 15:55 ipc -> ipc:[4026531839]
>> lrwxrwxrwx 1 root root 0 Mar 1 15:55 mnt -> mnt:[4026532324]
>> lrwxrwxrwx 1 root root 0 Mar 1 15:55 net -> net:[4026531993]
>> lrwxrwxrwx 1 root root 0 Mar 1 15:55 pid -> pid:[4026532325]
>> lrwxrwxrwx 1 root root 0 Mar 1 15:55 pid_for_children -> pid:[4026532325]
>> lrwxrwxrwx 1 root root 0 Mar 1 15:55 pidns_id -> [14994344851325148137]
>> lrwxrwxrwx 1 root root 0 Mar 1 15:55 user -> user:[4026531837]
>> lrwxrwxrwx 1 root root 0 Mar 1 15:55 uts -> uts:[4026531838]
>>
>> Though there is a problem of ID being recycled in longterm, managing
>> an ID per namespace is easier than having open file descriptors per
>> pid namespace. Existing namespace inode numbers recycles faster and hence
>> their usability as ID for this API is less.
> No.
>
> A namespace of namespaces is a much harder problem. We aren't going
> there.
>
> It sounds like you may be struggling with lifetime issues and when
> to close/get rid of your pid namespace identifier. That could deserve
> a conversation. Idenitifiers that are not namespaced are absolutely not
> the solution.
>
> Eric
>
>
>> Signed-off-by: Nagarathnam Muthusamy <[email protected]>
>> ---
>> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
>> arch/x86/entry/syscalls/syscall_64.tbl | 2 +
>> fs/nsfs.c | 9 +-
>> fs/proc/namespaces.c | 1 +
>> include/linux/ns_common.h | 1 +
>> include/linux/pid_namespace.h | 3 +
>> include/linux/proc_ns.h | 1 +
>> include/linux/syscalls.h | 1 +
>> kernel/pid_namespace.c | 190 ++++++++++++++++++++++++++++++++-
>> kernel/sys_ni.c | 4 +
>> 10 files changed, 208 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
>> index 448ac21..31bf798 100644
>> --- a/arch/x86/entry/syscalls/syscall_32.tbl
>> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
>> @@ -391,3 +391,4 @@
>> 382 i386 pkey_free sys_pkey_free
>> 383 i386 statx sys_statx
>> 384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl
>> +385 i386 translate_pid sys_translate_pid compat_sys_translate_pid
>> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
>> index 5aef183..89196c3 100644
>> --- a/arch/x86/entry/syscalls/syscall_64.tbl
>> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
>> @@ -339,6 +339,7 @@
>> 330 common pkey_alloc sys_pkey_alloc
>> 331 common pkey_free sys_pkey_free
>> 332 common statx sys_statx
>> +333 64 translate_pid sys_translate_pid
>>
>> #
>> # x32-specific system call numbers start at 512 to avoid cache impact
>> @@ -380,3 +381,4 @@
>> 545 x32 execveat compat_sys_execveat/ptregs
>> 546 x32 preadv2 compat_sys_preadv64v2
>> 547 x32 pwritev2 compat_sys_pwritev64v2
>> +548 x32 translate_pid compat_sys_translate_pid
>> diff --git a/fs/nsfs.c b/fs/nsfs.c
>> index 36b0772..c635465 100644
>> --- a/fs/nsfs.c
>> +++ b/fs/nsfs.c
>> @@ -222,8 +222,13 @@ int ns_get_name(char *buf, size_t size, struct task_struct *task,
>> const char *name;
>> ns = ns_ops->get(task);
>> if (ns) {
>> - name = ns_ops->real_ns_name ? : ns_ops->name;
>> - res = snprintf(buf, size, "%s:[%u]", name, ns->inum);
>> + if (!strcmp(ns_ops->name, "pidns_id")) {
>> + res = snprintf(buf, size, "[%llu]",
>> + (unsigned long long)ns->ns_id);
>> + } else {
>> + name = ns_ops->real_ns_name ? : ns_ops->name;
>> + res = snprintf(buf, size, "%s:[%u]", name, ns->inum);
>> + }
>> ns_ops->put(ns);
>> }
>> return res;
>> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
>> index 59b17e5..ac823ce 100644
>> --- a/fs/proc/namespaces.c
>> +++ b/fs/proc/namespaces.c
>> @@ -24,6 +24,7 @@
>> #endif
>> #ifdef CONFIG_PID_NS
>> &pidns_operations,
>> + &pidns_id_operations,
>> &pidns_for_children_operations,
>> #endif
>> #ifdef CONFIG_USER_NS
>> diff --git a/include/linux/ns_common.h b/include/linux/ns_common.h
>> index 5fbc400..6ca3d43 100644
>> --- a/include/linux/ns_common.h
>> +++ b/include/linux/ns_common.h
>> @@ -8,6 +8,7 @@ struct ns_common {
>> atomic_long_t stashed;
>> const struct proc_ns_operations *ops;
>> unsigned int inum;
>> + u64 ns_id;
>> };
>>
>> #endif
>> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
>> index 49538b1..11d1d57 100644
>> --- a/include/linux/pid_namespace.h
>> +++ b/include/linux/pid_namespace.h
>> @@ -11,6 +11,7 @@
>> #include <linux/kref.h>
>> #include <linux/ns_common.h>
>> #include <linux/idr.h>
>> +#include <linux/list_bl.h>
>>
>>
>> struct fs_pin;
>> @@ -44,6 +45,8 @@ struct pid_namespace {
>> kgid_t pid_gid;
>> int hide_pid;
>> int reboot; /* group exit code if this pidns was rebooted */
>> + struct hlist_bl_node node;
>> + atomic_t lookups_pending;
>> struct ns_common ns;
>> } __randomize_layout;
>>
>> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
>> index d31cb62..861e38bd 100644
>> --- a/include/linux/proc_ns.h
>> +++ b/include/linux/proc_ns.h
>> @@ -28,6 +28,7 @@ struct proc_ns_operations {
>> extern const struct proc_ns_operations utsns_operations;
>> extern const struct proc_ns_operations ipcns_operations;
>> extern const struct proc_ns_operations pidns_operations;
>> +extern const struct proc_ns_operations pidns_id_operations;
>> extern const struct proc_ns_operations pidns_for_children_operations;
>> extern const struct proc_ns_operations userns_operations;
>> extern const struct proc_ns_operations mntns_operations;
>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> index a78186d..574349a 100644
>> --- a/include/linux/syscalls.h
>> +++ b/include/linux/syscalls.h
>> @@ -901,6 +901,7 @@ asmlinkage long sys_open_by_handle_at(int mountdirfd,
>> struct file_handle __user *handle,
>> int flags);
>> asmlinkage long sys_setns(int fd, int nstype);
>> +asmlinkage long sys_translate_pid(pid_t pid, u64 source, u64 target);
>> asmlinkage long sys_process_vm_readv(pid_t pid,
>> const struct iovec __user *lvec,
>> unsigned long liovcnt,
>> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
>> index 0b53eef..ff83aa8 100644
>> --- a/kernel/pid_namespace.c
>> +++ b/kernel/pid_namespace.c
>> @@ -22,6 +22,12 @@
>> #include <linux/sched/task.h>
>> #include <linux/sched/signal.h>
>> #include <linux/idr.h>
>> +#include <linux/random.h>
>> +#include <linux/compat.h>
>> +
>> +#define PID_NS_ID_HASH_BITS 9
>> +
>> +struct hlist_bl_head *pid_ns_hash;
>>
>> struct pid_cache {
>> int nr_ids;
>> @@ -34,6 +40,13 @@ struct pid_cache {
>> static DEFINE_MUTEX(pid_caches_mutex);
>> static struct kmem_cache *pid_ns_cachep;
>>
>> +static inline struct hlist_bl_head *
>> + pid_ns_hash_head(struct hlist_bl_head *hash,
>> + uint64_t key)
>> +{
>> + return &hash[hash_64(key, PID_NS_ID_HASH_BITS)];
>> +}
>> +
>> /*
>> * creates the kmem cache to allocate pids from.
>> * @nr_ids: the number of numerical ids this pid will have to carry
>> @@ -93,12 +106,24 @@ static void dec_pid_namespaces(struct ucounts *ucounts)
>> dec_ucount(ucounts, UCOUNT_PID_NAMESPACES);
>> }
>>
>> +static inline u64 get_namespace_id(void)
>> +{
>> + u64 id = 0;
>> +
>> + while (!id)
>> + get_random_bytes(&id, sizeof(id));
>> +
>> + return id;
>> +}
>> +
>> static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns,
>> struct pid_namespace *parent_pid_ns)
>> {
>> - struct pid_namespace *ns;
>> - unsigned int level = parent_pid_ns->level + 1;
>> + struct pid_namespace *ns, *pt;
>> struct ucounts *ucounts;
>> + struct hlist_bl_head *head;
>> + struct hlist_bl_node *dup_node;
>> + unsigned int level = parent_pid_ns->level + 1;
>> int err;
>>
>> err = -EINVAL;
>> @@ -135,7 +160,24 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
>> ns->ucounts = ucounts;
>> ns->pid_allocated = PIDNS_ADDING;
>> INIT_WORK(&ns->proc_work, proc_cleanup_work);
>> -
>> + ns->ns.ns_id = get_namespace_id();
>> + while (1) {
>> + head = pid_ns_hash_head(pid_ns_hash, ns->ns.ns_id);
>> + hlist_bl_lock(head);
>> + hlist_bl_for_each_entry(pt, dup_node, head, node) {
>> + if (ns->ns.ns_id == pt->ns.ns_id) {
>> + /*
>> + * ID is taken. Move to next ID;
>> + */
>> + ns->ns.ns_id = get_namespace_id();
>> + hlist_bl_unlock(head);
>> + continue;
>> + }
>> + }
>> + break;
>> + }
>> + hlist_bl_add_head(&ns->node, head);
>> + hlist_bl_unlock(head);
>> return ns;
>>
>> out_free_idr:
>> @@ -159,6 +201,30 @@ static void delayed_free_pidns(struct rcu_head *p)
>>
>> static void destroy_pid_namespace(struct pid_namespace *ns)
>> {
>> + struct pid_namespace *ph;
>> + struct hlist_bl_head *head;
>> + struct hlist_bl_node *dup_node;
>> +
>> + /*
>> + * Remove the namespace structure from hash table so
>> + * now new lookups can start on it.
>> + */
>> + if (ns->ns.ns_id) {
>> + head = pid_ns_hash_head(pid_ns_hash, ns->ns.ns_id);
>> + hlist_bl_lock(head);
>> + hlist_bl_for_each_entry(ph, dup_node, head, node) {
>> + if (ns->ns.ns_id == ph->ns.ns_id) {
>> + hlist_bl_del_init(&ph->node);
>> + break;
>> + }
>> + }
>> + hlist_bl_unlock(head);
>> + }
>> + /*
>> + * Wait for pending lookups to complete.
>> + */
>> + while (atomic_read(&ns->lookups_pending))
>> + cpu_relax();
>> ns_free_inum(&ns->ns);
>>
>> idr_destroy(&ns->idr);
>> @@ -463,6 +529,17 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
>> .get_parent = pidns_get_parent,
>> };
>>
>> +const struct proc_ns_operations pidns_id_operations = {
>> + .name = "pidns_id",
>> + .real_ns_name = "pid",
>> + .type = CLONE_NEWPID,
>> + .get = pidns_get,
>> + .put = pidns_put,
>> + .install = pidns_install,
>> + .owner = pidns_owner,
>> + .get_parent = pidns_get_parent,
>> +};
>> +
>> const struct proc_ns_operations pidns_for_children_operations = {
>> .name = "pid_for_children",
>> .real_ns_name = "pid",
>> @@ -474,9 +551,116 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
>> .get_parent = pidns_get_parent,
>> };
>>
>> +/*
>> + * translate_pid - convert pid in source pid-ns into target pid-ns.
>> + * @pid: pid for translation
>> + * @source: pid-ns id
>> + * @target: pid-ns id
>> + *
>> + * Return pid in @target pid-ns, zero if task have no pid there,
>> + * or -ESRCH of task with @pid is not found in @source pid-ns.
>> + */
>> +SYSCALL_DEFINE3(translate_pid, pid_t, pid, u64, source,
>> + u64, target)
>> +{
>> + struct pid_namespace *source_ns = NULL, *target_ns = NULL;
>> + struct pid *struct_pid;
>> + struct pid_namespace *ph;
>> + struct hlist_bl_head *shead = NULL;
>> + struct hlist_bl_head *thead = NULL;
>> + struct hlist_bl_node *dup_node;
>> + pid_t result;
>> +
>> + if (!source) {
>> + source_ns = &init_pid_ns;
>> + } else {
>> + shead = pid_ns_hash_head(pid_ns_hash, source);
>> + hlist_bl_lock(shead);
>> + hlist_bl_for_each_entry(ph, dup_node, shead, node) {
>> + if (source == ph->ns.ns_id) {
>> + source_ns = ph;
>> + break;
>> + }
>> + }
>> + if (!source_ns) {
>> + hlist_bl_unlock(shead);
>> + return -EINVAL;
>> + }
>> + }
>> + if (!ptrace_may_access(source_ns->child_reaper,
>> + PTRACE_MODE_READ_FSCREDS)) {
>> + if (shead)
>> + hlist_bl_unlock(shead);
>> + return -EPERM;
>> + }
>> +
>> + atomic_inc(&source_ns->lookups_pending);
>> + if (shead)
>> + hlist_bl_unlock(shead);
>> +
>> + if (!target) {
>> + target_ns = &init_pid_ns;
>> + } else {
>> + thead = pid_ns_hash_head(pid_ns_hash, target);
>> + hlist_bl_lock(thead);
>> + hlist_bl_for_each_entry(ph, dup_node, thead, node) {
>> + if (target == ph->ns.ns_id) {
>> + target_ns = ph;
>> + break;
>> + }
>> + }
>> + if (!target_ns) {
>> + atomic_dec(&source_ns->lookups_pending);
>> + hlist_bl_unlock(thead);
>> + return -EINVAL;
>> + }
>> + }
>> + if (!ptrace_may_access(target_ns->child_reaper,
>> + PTRACE_MODE_READ_FSCREDS)) {
>> + atomic_dec(&source_ns->lookups_pending);
>> + if (thead)
>> + hlist_bl_unlock(thead);
>> + return -EPERM;
>> + }
>> + atomic_inc(&target_ns->lookups_pending);
>> + if (thead)
>> + hlist_bl_unlock(thead);
>> +
>> + rcu_read_lock();
>> + struct_pid = find_pid_ns(pid, source_ns);
>> + result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;
>> + rcu_read_unlock();
>> + atomic_dec(&source_ns->lookups_pending);
>> + atomic_dec(&target_ns->lookups_pending);
>> + return result;
>> +}
>> +
>> +#ifdef CONFIG_COMPAT
>> +COMPAT_SYSCALL_DEFINE5(translate_pid, pid_t, pid, u32, s0, u32, s1,
>> + u32, t0, u32, t1)
>> +{
>> +#ifdef __BIG_ENDIAN
>> + return sys_translate_pid(pid, ((u64)s0 << 32) | s1,
>> + ((u64)t0 << 32) | t1);
>> +#else
>> + return sys_translate_pid(pid, ((u64)s1 << 32) | s0,
>> + ((u64)t1 << 32) | t0);
>> +#endif
>> +}
>> +#endif
>> +
>> static __init int pid_namespaces_init(void)
>> {
>> + unsigned long bucket_count;
>> + int i;
>> +
>> + bucket_count = (1UL << PID_NS_ID_HASH_BITS);
>> pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
>> + pid_ns_hash = kmalloc_array(bucket_count, sizeof(struct hlist_bl_head),
>> + GFP_KERNEL);
>> +
>> + for (i = 0; i < bucket_count; i++)
>> + INIT_HLIST_BL_HEAD(&pid_ns_hash[i]);
>>
>> #ifdef CONFIG_CHECKPOINT_RESTORE
>> register_sysctl_paths(kern_path, pid_ns_ctl_table);
>> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>> index b518976..467255f 100644
>> --- a/kernel/sys_ni.c
>> +++ b/kernel/sys_ni.c
>> @@ -259,3 +259,7 @@ asmlinkage long sys_ni_syscall(void)
>> cond_syscall(sys_pkey_mprotect);
>> cond_syscall(sys_pkey_alloc);
>> cond_syscall(sys_pkey_free);
>> +
>> +/* pid translation */
>> +cond_syscall(sys_translate_pid);
>> +cond_syscall(compat_sys_translate_pid);
Nagarathnam Muthusamy <[email protected]> writes:
> On 03/13/2018 08:29 PM, [email protected] wrote:
>> The cost of that ``cheaper'' u64 that is not in any namespace is that
>> you now have to go and implement a namespace of namespaces. You haven't
>> even attempted it. So just no. Anything that brings us to needing
>> a namespace of namespaces is a bad design.
>
> I am not trying to implement a namespace of namespaces.
No you are using a design that will require a namespace of namespaces
to be implemented to support CRIU (checkpoint/restart in userspace).
So when I see your patch I see a patch that only implements the easy
half of the work that needs to be done.
>>> Following patch uses a 64-bit ID for namespace exported by procfs
>>> for pid translation through a new file /proc/<pid>/ns/pidns_id.
>> And this design detail is what brings the automatic nack.
>>
>> Use file descriptros and it sounds like your use case justifies what you
>> are trying to do.
>
> File descriptors are problematic for following reasons.
> 1) I need to open a couple of file descriptors for every pid
> translation request.
You can cache descriptors across requests. I suspect simply
by tracking the origin of the shared memory segment you can figure
out it's pid namespace.
> 2) In case of nested PID namespaces, say a new pid namespace is
> created at level 20,
> with unique ID, I could just record this ID in a shared memory for
> interested process
> to use. In case of file descriptors, every level has to figure out
> the process ID of the
> newly created namespace's init process and open a file descriptor
> to track it.
Toss in a bind mount of the file in some filesystem if that helps.
But if I understand what you are talking about you are talking about
having a shared memory segment shared between processes in different
pid namespaces.
In that shared memory segment for a processes in different namespaces
you are talking about having the conversation structured as having
information structured as pid-namespace pid.
And crucuially you want anyone in any pid namespace to be able to read
that shared memory segment and to make sense of what is going on,
by just reading the pid namespace id.
Namespaces are all about making identifiers relative to their namespace.
The only way I can see you gain an advantage with your shared memory
design is by making identifiers that are not relative to their pid
namespace. As such identifiers will completely defeat the ability
to implement CRIU support.
The closest I have to such identifiers today are bind mounts of the
namespace files. So if you also have a common mount namespace you could
use that.
In theory a name in some other namespace is possible. However anyone in
a container will only be able to see the names in their container or in
nested sub containers. Which is what you have already with pids. So I
don't think that will help.
Eric
(Resending the reply as there was a reject due to HTML in email)
On 03/14/2018 03:03 PM, [email protected] wrote:
> Nagarathnam Muthusamy <[email protected]> writes:
>
>> On 03/13/2018 08:29 PM, [email protected] wrote:
>>> The cost of that ``cheaper'' u64 that is not in any namespace is that
>>> you now have to go and implement a namespace of namespaces. You haven't
>>> even attempted it. So just no. Anything that brings us to needing
>>> a namespace of namespaces is a bad design.
>> I am not trying to implement a namespace of namespaces.
> No you are using a design that will require a namespace of namespaces
> to be implemented to support CRIU (checkpoint/restart in userspace).
>
> So when I see your patch I see a patch that only implements the easy
> half of the work that needs to be done.
>
>>>> Following patch uses a 64-bit ID for namespace exported by procfs
>>>> for pid translation through a new file /proc/<pid>/ns/pidns_id.
>>> And this design detail is what brings the automatic nack.
>>>
>>> Use file descriptros and it sounds like your use case justifies what you
>>> are trying to do.
>> File descriptors are problematic for following reasons.
>> 1) I need to open a couple of file descriptors for every pid
>> translation request.
> You can cache descriptors across requests. I suspect simply
> by tracking the origin of the shared memory segment you can figure
> out it's pid namespace.
>
>> 2) In case of nested PID namespaces, say a new pid namespace is
>> created at level 20,
>> with unique ID, I could just record this ID in a shared memory for
>> interested process
>> to use. In case of file descriptors, every level has to figure out
>> the process ID of the
>> newly created namespace's init process and open a file descriptor
>> to track it.
> Toss in a bind mount of the file in some filesystem if that helps.
>
> But if I understand what you are talking about you are talking about
> having a shared memory segment shared between processes in different
> pid namespaces.
>
> In that shared memory segment for a processes in different namespaces
> you are talking about having the conversation structured as having
> information structured as pid-namespace pid.
>
> And crucuially you want anyone in any pid namespace to be able to read
> that shared memory segment and to make sense of what is going on,
> by just reading the pid namespace id.
This captures the usecase. Adding to that, every level is made up of
a combination of User, pid and mount namespace.
>
>
> Namespaces are all about making identifiers relative to their namespace.
>
> The only way I can see you gain an advantage with your shared memory
> design is by making identifiers that are not relative to their pid
> namespace. As such identifiers will completely defeat the ability
> to implement CRIU support.
>
> The closest I have to such identifiers today are bind mounts of the
> namespace files. So if you also have a common mount namespace you could
> use that.
We don't have common mount namespace. Each nested level will have
a new mount namespace. When a new nested level (User + pid + mnt) is
created, init process of new level cannot bind mount the namespace
directory,
as the effects wont be visible to the other levels.
On other hand, the new init process could send SCM_CREDENTIALS message
to a centralized listener running outside of the whole setup which does only
bind mounts. Here, we have a single point of failure for the whole
system and
this listener has to run as root to be able to do bind mounts. Apart
from these,
I am not able to see the bind mount by listener being propagated to child
namespaces in my setup. Not sure if I am missing anything or this is the
expected behavior.
Is it possible to have application provide the ID to be associated with
the namespace? During dump, we can save the ID and during restore,
we can assign the ID using the same API. There is a possibility of
collision during restore. Is it ok to fail the restore during such scenario?
Thanks,
Nagarathnam.
>
> In theory a name in some other namespace is possible. However anyone in
> a container will only be able to see the names in their container or in
> nested sub containers. Which is what you have already with pids. So I
> don't think that will help.
>
> Eric
Nagarathnam Muthusamy <[email protected]> writes:
> (Resending the reply as there was a reject due to HTML in email)
>
> On 03/14/2018 03:03 PM, [email protected] wrote:
>> Nagarathnam Muthusamy <[email protected]> writes:
>>
>>> On 03/13/2018 08:29 PM, [email protected] wrote:
>>>> The cost of that ``cheaper'' u64 that is not in any namespace is that
>>>> you now have to go and implement a namespace of namespaces. You haven't
>>>> even attempted it. So just no. Anything that brings us to needing
>>>> a namespace of namespaces is a bad design.
>>> I am not trying to implement a namespace of namespaces.
>> No you are using a design that will require a namespace of namespaces
>> to be implemented to support CRIU (checkpoint/restart in userspace).
>>
>> So when I see your patch I see a patch that only implements the easy
>> half of the work that needs to be done.
>>
>>>>> Following patch uses a 64-bit ID for namespace exported by procfs
>>>>> for pid translation through a new file /proc/<pid>/ns/pidns_id.
>>>> And this design detail is what brings the automatic nack.
>>>>
>>>> Use file descriptros and it sounds like your use case justifies what you
>>>> are trying to do.
>>> File descriptors are problematic for following reasons.
>>> 1) I need to open a couple of file descriptors for every pid
>>> translation request.
>> You can cache descriptors across requests. I suspect simply
>> by tracking the origin of the shared memory segment you can figure
>> out it's pid namespace.
>>
>>> 2) In case of nested PID namespaces, say a new pid namespace is
>>> created at level 20,
>>> with unique ID, I could just record this ID in a shared memory for
>>> interested process
>>> to use. In case of file descriptors, every level has to figure out
>>> the process ID of the
>>> newly created namespace's init process and open a file descriptor
>>> to track it.
>> Toss in a bind mount of the file in some filesystem if that helps.
>>
>> But if I understand what you are talking about you are talking about
>> having a shared memory segment shared between processes in different
>> pid namespaces.
>>
>> In that shared memory segment for a processes in different namespaces
>> you are talking about having the conversation structured as having
>> information structured as pid-namespace pid.
>>
>> And crucuially you want anyone in any pid namespace to be able to read
>> that shared memory segment and to make sense of what is going on,
>> by just reading the pid namespace id.
>
> This captures the usecase. Adding to that, every level is made up of
> a combination of User, pid and mount namespace.
You must be using sysvipc shared memory segments not posix shared memory
segments if you are sharing them in that scenario.
>> Namespaces are all about making identifiers relative to their namespace.
>>
>> The only way I can see you gain an advantage with your shared memory
>> design is by making identifiers that are not relative to their pid
>> namespace. As such identifiers will completely defeat the ability
>> to implement CRIU support.
>>
>> The closest I have to such identifiers today are bind mounts of the
>> namespace files. So if you also have a common mount namespace you could
>> use that.
>
> We don't have common mount namespace. Each nested level will have
> a new mount namespace. When a new nested level (User + pid + mnt) is
> created, init process of new level cannot bind mount the namespace directory,
> as the effects wont be visible to the other levels.
Do you have a ipc shared memory segment.
I just looked and realized there is a rather significant bug in ipc
shared memory segments when shared between pid namespaces, and I believe
fixing that bug will resolve your issue.
shmctl(IPC_STAT, ...) will return a struct shmid_ds.
The struct shmid_ds has a field shm_cpid.
That field is currently reported as the pid in the pid namespace that
created the segment. Which is nonsense if you are not in the pid
namespace.
However if we were to fix that to properly return the pid in the
pid namespace of the caller of IPC_STAT. You could find the
pid of the creator of the ipc shared memory segment by just
calling shmctl(IPC_STAT, ...).
> On other hand, the new init process could send SCM_CREDENTIALS message
> to a centralized listener running outside of the whole setup which does only
> bind mounts. Here, we have a single point of failure for the whole system and
> this listener has to run as root to be able to do bind mounts. Apart from these,
> I am not able to see the bind mount by listener being propagated to child
> namespaces in my setup. Not sure if I am missing anything or this is the
> expected behavior.
It depends on how mount propgation is configured. You can configure
mount propgation so that these bind mounts propagate to every mount
namespace. Which would definitely be a possible solution.
> Is it possible to have application provide the ID to be associated with
> the namespace? During dump, we can save the ID and during restore,
> we can assign the ID using the same API. There is a possibility of
> collision during restore. Is it ok to fail the restore during such
> scenario?
No. It is absolutely not ok to fail the restore during such a scenario.
For CRIU that is the entire point of having namespaces. Not having
to worry about an id not being available during restore.
Eric
While discussing a proposal by Nagarathnam Muthusamy I realized that
the ipc namespace has never been given proper pid namespace support.
In particular if objects in a single ipc namespace are accessed from
different pid namespaces they will return the wrong pids.
Then when trying to test this I discovered that definitions that are
only used in msg.c, sem.c, and shm.c are included in linux/sched.h
resulting in what should be simple local modifications requring
nearly a full kernel rebuild.
So this patchset does several things.
- Changes the types that are passed into the security hooks to struct
kern_ipc_perm because that is all the security hooks use.
- Moves definitions from include/{msg,sem,shm}.h into ipc/{msg,sem,shm}.c
So the code can be modified without excessive development time.
- Instead of storing pids as intergers stores struct pid * instead.
I took a careful look to see if it seems likely the performance
regression in credential passing that af_unix experienced after
a similar conversion would be likely, but I don't see it.
So I think the biggest concern is if someone in the last 10 years
has come to depend on the buggy behavior. If either the performance
is problematic or the there are regression caused by the change
in behavior we can revert.
Still I would like to see this fixed and I plan on merging this code.
Eric W. Biederman (11):
sem/security: Pass kern_ipc_perm not sem_array into the sem security hooks
shm/security: Pass kern_ipc_perm not shmid_kernel into the shm security hooks
msg/security: Pass kern_ipc_perm not msg_queue into the msg_queue security hooks
sem: Move struct sem and struct sem_array into ipc/sem.c
shm: Move struct shmid_kernel into ipc/shm.c
msg: Move struct msg_queue into ipc/msg.c
ipc: Move IPCMNI from include/ipc.h into ipc/util.h
ipc/util: Helpers for making the sysvipc operations pid namespace aware
ipc/shm: Fix shmctl(..., IPC_STAT, ...) between pid namespaces.
ipc/msg: Fix msgctl(..., IPC_STAT, ...) between pid namespaces
ipc/sem: Fix semctl(..., GETPID, ...) between pid namespaces
include/linux/ipc.h | 2 -
include/linux/lsm_hooks.h | 32 ++++++++--------
include/linux/msg.h | 18 ---------
include/linux/security.h | 67 ++++++++++++++++-----------------
include/linux/sem.h | 40 +-------------------
include/linux/shm.h | 23 ------------
ipc/msg.c | 54 ++++++++++++++++++---------
ipc/sem.c | 73 ++++++++++++++++++++++++++----------
ipc/shm.c | 60 +++++++++++++++++++++---------
ipc/util.c | 9 +++++
ipc/util.h | 12 ++++++
security/security.c | 32 ++++++++--------
security/selinux/hooks.c | 92 +++++++++++++++++++++++-----------------------
security/smack/smack_lsm.c | 68 +++++++++++++++++-----------------
14 files changed, 297 insertions(+), 285 deletions(-)
Eric
All of the implementations of security hooks that take msg_queue only
access q_perm the struct kern_ipc_perm member. This means the
dependencies of the msg_queue security hooks can be simplified by
passing the kern_ipc_perm member of msg_queue.
Making this change will allow struct msg_queue to become private to
ipc/msg.c.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/lsm_hooks.h | 12 ++++++------
include/linux/security.h | 25 ++++++++++++-------------
ipc/msg.c | 18 ++++++++----------
security/security.c | 12 ++++++------
security/selinux/hooks.c | 36 ++++++++++++++++++------------------
security/smack/smack_lsm.c | 24 ++++++++++++------------
6 files changed, 62 insertions(+), 65 deletions(-)
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index cac7a8082c43..bde167fa2c51 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1575,13 +1575,13 @@ union security_list_options {
int (*msg_msg_alloc_security)(struct msg_msg *msg);
void (*msg_msg_free_security)(struct msg_msg *msg);
- int (*msg_queue_alloc_security)(struct msg_queue *msq);
- void (*msg_queue_free_security)(struct msg_queue *msq);
- int (*msg_queue_associate)(struct msg_queue *msq, int msqflg);
- int (*msg_queue_msgctl)(struct msg_queue *msq, int cmd);
- int (*msg_queue_msgsnd)(struct msg_queue *msq, struct msg_msg *msg,
+ int (*msg_queue_alloc_security)(struct kern_ipc_perm *msq);
+ void (*msg_queue_free_security)(struct kern_ipc_perm *msq);
+ int (*msg_queue_associate)(struct kern_ipc_perm *msq, int msqflg);
+ int (*msg_queue_msgctl)(struct kern_ipc_perm *msq, int cmd);
+ int (*msg_queue_msgsnd)(struct kern_ipc_perm *msq, struct msg_msg *msg,
int msqflg);
- int (*msg_queue_msgrcv)(struct msg_queue *msq, struct msg_msg *msg,
+ int (*msg_queue_msgrcv)(struct kern_ipc_perm *msq, struct msg_msg *msg,
struct task_struct *target, long type,
int mode);
diff --git a/include/linux/security.h b/include/linux/security.h
index f390755808ea..128e1e4a5346 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -50,7 +50,6 @@ struct iattr;
struct fown_struct;
struct file_operations;
struct msg_msg;
-struct msg_queue;
struct xattr;
struct xfrm_sec_ctx;
struct mm_struct;
@@ -353,13 +352,13 @@ int security_ipc_permission(struct kern_ipc_perm *ipcp, short flag);
void security_ipc_getsecid(struct kern_ipc_perm *ipcp, u32 *secid);
int security_msg_msg_alloc(struct msg_msg *msg);
void security_msg_msg_free(struct msg_msg *msg);
-int security_msg_queue_alloc(struct msg_queue *msq);
-void security_msg_queue_free(struct msg_queue *msq);
-int security_msg_queue_associate(struct msg_queue *msq, int msqflg);
-int security_msg_queue_msgctl(struct msg_queue *msq, int cmd);
-int security_msg_queue_msgsnd(struct msg_queue *msq,
+int security_msg_queue_alloc(struct kern_ipc_perm *msq);
+void security_msg_queue_free(struct kern_ipc_perm *msq);
+int security_msg_queue_associate(struct kern_ipc_perm *msq, int msqflg);
+int security_msg_queue_msgctl(struct kern_ipc_perm *msq, int cmd);
+int security_msg_queue_msgsnd(struct kern_ipc_perm *msq,
struct msg_msg *msg, int msqflg);
-int security_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg,
+int security_msg_queue_msgrcv(struct kern_ipc_perm *msq, struct msg_msg *msg,
struct task_struct *target, long type, int mode);
int security_shm_alloc(struct kern_ipc_perm *shp);
void security_shm_free(struct kern_ipc_perm *shp);
@@ -1043,32 +1042,32 @@ static inline int security_msg_msg_alloc(struct msg_msg *msg)
static inline void security_msg_msg_free(struct msg_msg *msg)
{ }
-static inline int security_msg_queue_alloc(struct msg_queue *msq)
+static inline int security_msg_queue_alloc(struct kern_ipc_perm *msq)
{
return 0;
}
-static inline void security_msg_queue_free(struct msg_queue *msq)
+static inline void security_msg_queue_free(struct kern_ipc_perm *msq)
{ }
-static inline int security_msg_queue_associate(struct msg_queue *msq,
+static inline int security_msg_queue_associate(struct kern_ipc_perm *msq,
int msqflg)
{
return 0;
}
-static inline int security_msg_queue_msgctl(struct msg_queue *msq, int cmd)
+static inline int security_msg_queue_msgctl(struct kern_ipc_perm *msq, int cmd)
{
return 0;
}
-static inline int security_msg_queue_msgsnd(struct msg_queue *msq,
+static inline int security_msg_queue_msgsnd(struct kern_ipc_perm *msq,
struct msg_msg *msg, int msqflg)
{
return 0;
}
-static inline int security_msg_queue_msgrcv(struct msg_queue *msq,
+static inline int security_msg_queue_msgrcv(struct kern_ipc_perm *msq,
struct msg_msg *msg,
struct task_struct *target,
long type, int mode)
diff --git a/ipc/msg.c b/ipc/msg.c
index 0dcc6699dc53..cdfab0825fce 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -101,7 +101,7 @@ static void msg_rcu_free(struct rcu_head *head)
struct kern_ipc_perm *p = container_of(head, struct kern_ipc_perm, rcu);
struct msg_queue *msq = container_of(p, struct msg_queue, q_perm);
- security_msg_queue_free(msq);
+ security_msg_queue_free(&msq->q_perm);
kvfree(msq);
}
@@ -127,7 +127,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
msq->q_perm.key = key;
msq->q_perm.security = NULL;
- retval = security_msg_queue_alloc(msq);
+ retval = security_msg_queue_alloc(&msq->q_perm);
if (retval) {
kvfree(msq);
return retval;
@@ -258,9 +258,7 @@ static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
*/
static inline int msg_security(struct kern_ipc_perm *ipcp, int msgflg)
{
- struct msg_queue *msq = container_of(ipcp, struct msg_queue, q_perm);
-
- return security_msg_queue_associate(msq, msgflg);
+ return security_msg_queue_associate(ipcp, msgflg);
}
SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
@@ -380,7 +378,7 @@ static int msgctl_down(struct ipc_namespace *ns, int msqid, int cmd,
msq = container_of(ipcp, struct msg_queue, q_perm);
- err = security_msg_queue_msgctl(msq, cmd);
+ err = security_msg_queue_msgctl(&msq->q_perm, cmd);
if (err)
goto out_unlock1;
@@ -502,7 +500,7 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
if (ipcperms(ns, &msq->q_perm, S_IRUGO))
goto out_unlock;
- err = security_msg_queue_msgctl(msq, cmd);
+ err = security_msg_queue_msgctl(&msq->q_perm, cmd);
if (err)
goto out_unlock;
@@ -718,7 +716,7 @@ static inline int pipelined_send(struct msg_queue *msq, struct msg_msg *msg,
list_for_each_entry_safe(msr, t, &msq->q_receivers, r_list) {
if (testmsg(msg, msr->r_msgtype, msr->r_mode) &&
- !security_msg_queue_msgrcv(msq, msg, msr->r_tsk,
+ !security_msg_queue_msgrcv(&msq->q_perm, msg, msr->r_tsk,
msr->r_msgtype, msr->r_mode)) {
list_del(&msr->r_list);
@@ -784,7 +782,7 @@ static long do_msgsnd(int msqid, long mtype, void __user *mtext,
goto out_unlock0;
}
- err = security_msg_queue_msgsnd(msq, msg, msgflg);
+ err = security_msg_queue_msgsnd(&msq->q_perm, msg, msgflg);
if (err)
goto out_unlock0;
@@ -960,7 +958,7 @@ static struct msg_msg *find_msg(struct msg_queue *msq, long *msgtyp, int mode)
list_for_each_entry(msg, &msq->q_messages, m_list) {
if (testmsg(msg, *msgtyp, mode) &&
- !security_msg_queue_msgrcv(msq, msg, current,
+ !security_msg_queue_msgrcv(&msq->q_perm, msg, current,
*msgtyp, mode)) {
if (mode == SEARCH_LESSEQUAL && msg->m_type != 1) {
*msgtyp = msg->m_type - 1;
diff --git a/security/security.c b/security/security.c
index 77b69bd6f234..02d734e69955 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1163,33 +1163,33 @@ void security_msg_msg_free(struct msg_msg *msg)
call_void_hook(msg_msg_free_security, msg);
}
-int security_msg_queue_alloc(struct msg_queue *msq)
+int security_msg_queue_alloc(struct kern_ipc_perm *msq)
{
return call_int_hook(msg_queue_alloc_security, 0, msq);
}
-void security_msg_queue_free(struct msg_queue *msq)
+void security_msg_queue_free(struct kern_ipc_perm *msq)
{
call_void_hook(msg_queue_free_security, msq);
}
-int security_msg_queue_associate(struct msg_queue *msq, int msqflg)
+int security_msg_queue_associate(struct kern_ipc_perm *msq, int msqflg)
{
return call_int_hook(msg_queue_associate, 0, msq, msqflg);
}
-int security_msg_queue_msgctl(struct msg_queue *msq, int cmd)
+int security_msg_queue_msgctl(struct kern_ipc_perm *msq, int cmd)
{
return call_int_hook(msg_queue_msgctl, 0, msq, cmd);
}
-int security_msg_queue_msgsnd(struct msg_queue *msq,
+int security_msg_queue_msgsnd(struct kern_ipc_perm *msq,
struct msg_msg *msg, int msqflg)
{
return call_int_hook(msg_queue_msgsnd, 0, msq, msg, msqflg);
}
-int security_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg,
+int security_msg_queue_msgrcv(struct kern_ipc_perm *msq, struct msg_msg *msg,
struct task_struct *target, long type, int mode)
{
return call_int_hook(msg_queue_msgrcv, 0, msq, msg, target, type, mode);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 14f9e6c08273..925e546b5a87 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -5532,52 +5532,52 @@ static void selinux_msg_msg_free_security(struct msg_msg *msg)
}
/* message queue security operations */
-static int selinux_msg_queue_alloc_security(struct msg_queue *msq)
+static int selinux_msg_queue_alloc_security(struct kern_ipc_perm *msq)
{
struct ipc_security_struct *isec;
struct common_audit_data ad;
u32 sid = current_sid();
int rc;
- rc = ipc_alloc_security(&msq->q_perm, SECCLASS_MSGQ);
+ rc = ipc_alloc_security(msq, SECCLASS_MSGQ);
if (rc)
return rc;
- isec = msq->q_perm.security;
+ isec = msq->security;
ad.type = LSM_AUDIT_DATA_IPC;
- ad.u.ipc_id = msq->q_perm.key;
+ ad.u.ipc_id = msq->key;
rc = avc_has_perm(sid, isec->sid, SECCLASS_MSGQ,
MSGQ__CREATE, &ad);
if (rc) {
- ipc_free_security(&msq->q_perm);
+ ipc_free_security(msq);
return rc;
}
return 0;
}
-static void selinux_msg_queue_free_security(struct msg_queue *msq)
+static void selinux_msg_queue_free_security(struct kern_ipc_perm *msq)
{
- ipc_free_security(&msq->q_perm);
+ ipc_free_security(msq);
}
-static int selinux_msg_queue_associate(struct msg_queue *msq, int msqflg)
+static int selinux_msg_queue_associate(struct kern_ipc_perm *msq, int msqflg)
{
struct ipc_security_struct *isec;
struct common_audit_data ad;
u32 sid = current_sid();
- isec = msq->q_perm.security;
+ isec = msq->security;
ad.type = LSM_AUDIT_DATA_IPC;
- ad.u.ipc_id = msq->q_perm.key;
+ ad.u.ipc_id = msq->key;
return avc_has_perm(sid, isec->sid, SECCLASS_MSGQ,
MSGQ__ASSOCIATE, &ad);
}
-static int selinux_msg_queue_msgctl(struct msg_queue *msq, int cmd)
+static int selinux_msg_queue_msgctl(struct kern_ipc_perm *msq, int cmd)
{
int err;
int perms;
@@ -5602,11 +5602,11 @@ static int selinux_msg_queue_msgctl(struct msg_queue *msq, int cmd)
return 0;
}
- err = ipc_has_perm(&msq->q_perm, perms);
+ err = ipc_has_perm(msq, perms);
return err;
}
-static int selinux_msg_queue_msgsnd(struct msg_queue *msq, struct msg_msg *msg, int msqflg)
+static int selinux_msg_queue_msgsnd(struct kern_ipc_perm *msq, struct msg_msg *msg, int msqflg)
{
struct ipc_security_struct *isec;
struct msg_security_struct *msec;
@@ -5614,7 +5614,7 @@ static int selinux_msg_queue_msgsnd(struct msg_queue *msq, struct msg_msg *msg,
u32 sid = current_sid();
int rc;
- isec = msq->q_perm.security;
+ isec = msq->security;
msec = msg->security;
/*
@@ -5632,7 +5632,7 @@ static int selinux_msg_queue_msgsnd(struct msg_queue *msq, struct msg_msg *msg,
}
ad.type = LSM_AUDIT_DATA_IPC;
- ad.u.ipc_id = msq->q_perm.key;
+ ad.u.ipc_id = msq->key;
/* Can this process write to the queue? */
rc = avc_has_perm(sid, isec->sid, SECCLASS_MSGQ,
@@ -5649,7 +5649,7 @@ static int selinux_msg_queue_msgsnd(struct msg_queue *msq, struct msg_msg *msg,
return rc;
}
-static int selinux_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg,
+static int selinux_msg_queue_msgrcv(struct kern_ipc_perm *msq, struct msg_msg *msg,
struct task_struct *target,
long type, int mode)
{
@@ -5659,11 +5659,11 @@ static int selinux_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg,
u32 sid = task_sid(target);
int rc;
- isec = msq->q_perm.security;
+ isec = msq->security;
msec = msg->security;
ad.type = LSM_AUDIT_DATA_IPC;
- ad.u.ipc_id = msq->q_perm.key;
+ ad.u.ipc_id = msq->key;
rc = avc_has_perm(sid, isec->sid,
SECCLASS_MSGQ, MSGQ__READ, &ad);
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a3398c7f32c9..d960c2ea8d79 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -3210,9 +3210,9 @@ static int smack_sem_semop(struct kern_ipc_perm *sma, struct sembuf *sops,
*
* Returns 0
*/
-static int smack_msg_queue_alloc_security(struct msg_queue *msq)
+static int smack_msg_queue_alloc_security(struct kern_ipc_perm *msq)
{
- struct kern_ipc_perm *kisp = &msq->q_perm;
+ struct kern_ipc_perm *kisp = msq;
struct smack_known *skp = smk_of_current();
kisp->security = skp;
@@ -3225,9 +3225,9 @@ static int smack_msg_queue_alloc_security(struct msg_queue *msq)
*
* Clears the blob pointer
*/
-static void smack_msg_queue_free_security(struct msg_queue *msq)
+static void smack_msg_queue_free_security(struct kern_ipc_perm *msq)
{
- struct kern_ipc_perm *kisp = &msq->q_perm;
+ struct kern_ipc_perm *kisp = msq;
kisp->security = NULL;
}
@@ -3238,9 +3238,9 @@ static void smack_msg_queue_free_security(struct msg_queue *msq)
*
* Returns a pointer to the smack label entry
*/
-static struct smack_known *smack_of_msq(struct msg_queue *msq)
+static struct smack_known *smack_of_msq(struct kern_ipc_perm *msq)
{
- return (struct smack_known *)msq->q_perm.security;
+ return (struct smack_known *)msq->security;
}
/**
@@ -3250,7 +3250,7 @@ static struct smack_known *smack_of_msq(struct msg_queue *msq)
*
* return 0 if current has access, error otherwise
*/
-static int smk_curacc_msq(struct msg_queue *msq, int access)
+static int smk_curacc_msq(struct kern_ipc_perm *msq, int access)
{
struct smack_known *msp = smack_of_msq(msq);
struct smk_audit_info ad;
@@ -3258,7 +3258,7 @@ static int smk_curacc_msq(struct msg_queue *msq, int access)
#ifdef CONFIG_AUDIT
smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_IPC);
- ad.a.u.ipc_id = msq->q_perm.id;
+ ad.a.u.ipc_id = msq->id;
#endif
rc = smk_curacc(msp, access, &ad);
rc = smk_bu_current("msq", msp, access, rc);
@@ -3272,7 +3272,7 @@ static int smk_curacc_msq(struct msg_queue *msq, int access)
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smack_msg_queue_associate(struct msg_queue *msq, int msqflg)
+static int smack_msg_queue_associate(struct kern_ipc_perm *msq, int msqflg)
{
int may;
@@ -3287,7 +3287,7 @@ static int smack_msg_queue_associate(struct msg_queue *msq, int msqflg)
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smack_msg_queue_msgctl(struct msg_queue *msq, int cmd)
+static int smack_msg_queue_msgctl(struct kern_ipc_perm *msq, int cmd)
{
int may;
@@ -3321,7 +3321,7 @@ static int smack_msg_queue_msgctl(struct msg_queue *msq, int cmd)
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smack_msg_queue_msgsnd(struct msg_queue *msq, struct msg_msg *msg,
+static int smack_msg_queue_msgsnd(struct kern_ipc_perm *msq, struct msg_msg *msg,
int msqflg)
{
int may;
@@ -3340,7 +3340,7 @@ static int smack_msg_queue_msgsnd(struct msg_queue *msq, struct msg_msg *msg,
*
* Returns 0 if current has read and write access, error code otherwise
*/
-static int smack_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg,
+static int smack_msg_queue_msgrcv(struct kern_ipc_perm *msq, struct msg_msg *msg,
struct task_struct *target, long type, int mode)
{
return smk_curacc_msq(msq, MAY_READWRITE);
--
2.14.1
Today shm_cpid and shm_lpid are remembered in the pid namespace of the
creator and the processes that last touched a sysvipc shared memory
segment. If you have processes in multiple pid namespaces that
is just wrong, and I don't know how this has been over-looked for
so long.
As only creation and shared memory attach and shared memory detach
update the pids I do not expect there to be a repeat of the issues
when struct pid was attached to each af_unix skb, which in some
notable cases cut the performance in half. The problem was threads of
the same process updating same struct pid from different cpus causing
the cache line to be highly contended and bounce between cpus.
As creation, attach, and detach are expected to be rare operations for
sysvipc shared memory segments I do not expect that kind of cache line
ping pong to cause probems. In addition because the pid is at a fixed
location in the structure instead of being dynamic on a skb, the
reference count of the pid does not need to be updated on each
operation if the pid is the same. This ability to simply skip the pid
reference count changes if the pid is unchanging further reduces the
likelihood of the a cache line holding a pid reference count
ping-ponging between cpus.
Fixes: b488893a390e ("pid namespaces: changes to show virtual ids to user")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
ipc/shm.c | 25 +++++++++++++++----------
1 file changed, 15 insertions(+), 10 deletions(-)
diff --git a/ipc/shm.c b/ipc/shm.c
index 0565669ebe5c..932b7e411c6c 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -57,8 +57,8 @@ struct shmid_kernel /* private to the kernel */
time64_t shm_atim;
time64_t shm_dtim;
time64_t shm_ctim;
- pid_t shm_cprid;
- pid_t shm_lprid;
+ struct pid *shm_cprid;
+ struct pid *shm_lprid;
struct user_struct *mlock_user;
/* The task created the shm object. NULL if the task is dead. */
@@ -226,7 +226,7 @@ static int __shm_open(struct vm_area_struct *vma)
return PTR_ERR(shp);
shp->shm_atim = ktime_get_real_seconds();
- shp->shm_lprid = task_tgid_vnr(current);
+ ipc_update_pid(&shp->shm_lprid, task_tgid(current));
shp->shm_nattch++;
shm_unlock(shp);
return 0;
@@ -267,6 +267,8 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
user_shm_unlock(i_size_read(file_inode(shm_file)),
shp->mlock_user);
fput(shm_file);
+ ipc_update_pid(&shp->shm_cprid, NULL);
+ ipc_update_pid(&shp->shm_lprid, NULL);
ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
}
@@ -311,7 +313,7 @@ static void shm_close(struct vm_area_struct *vma)
if (WARN_ON_ONCE(IS_ERR(shp)))
goto done; /* no-op */
- shp->shm_lprid = task_tgid_vnr(current);
+ ipc_update_pid(&shp->shm_lprid, task_tgid(current));
shp->shm_dtim = ktime_get_real_seconds();
shp->shm_nattch--;
if (shm_may_destroy(ns, shp))
@@ -614,8 +616,8 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
if (IS_ERR(file))
goto no_file;
- shp->shm_cprid = task_tgid_vnr(current);
- shp->shm_lprid = 0;
+ shp->shm_cprid = get_pid(task_tgid(current));
+ shp->shm_lprid = NULL;
shp->shm_atim = shp->shm_dtim = 0;
shp->shm_ctim = ktime_get_real_seconds();
shp->shm_segsz = size;
@@ -648,6 +650,8 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
user_shm_unlock(size, shp->mlock_user);
fput(file);
no_file:
+ ipc_update_pid(&shp->shm_cprid, NULL);
+ ipc_update_pid(&shp->shm_lprid, NULL);
call_rcu(&shp->shm_perm.rcu, shm_rcu_free);
return error;
}
@@ -970,8 +974,8 @@ static int shmctl_stat(struct ipc_namespace *ns, int shmid,
tbuf->shm_atime = shp->shm_atim;
tbuf->shm_dtime = shp->shm_dtim;
tbuf->shm_ctime = shp->shm_ctim;
- tbuf->shm_cpid = shp->shm_cprid;
- tbuf->shm_lpid = shp->shm_lprid;
+ tbuf->shm_cpid = pid_vnr(shp->shm_cprid);
+ tbuf->shm_lpid = pid_vnr(shp->shm_lprid);
tbuf->shm_nattch = shp->shm_nattch;
ipc_unlock_object(&shp->shm_perm);
@@ -1605,6 +1609,7 @@ SYSCALL_DEFINE1(shmdt, char __user *, shmaddr)
#ifdef CONFIG_PROC_FS
static int sysvipc_shm_proc_show(struct seq_file *s, void *it)
{
+ struct pid_namespace *pid_ns = ipc_seq_pid_ns(s);
struct user_namespace *user_ns = seq_user_ns(s);
struct kern_ipc_perm *ipcp = it;
struct shmid_kernel *shp;
@@ -1627,8 +1632,8 @@ static int sysvipc_shm_proc_show(struct seq_file *s, void *it)
shp->shm_perm.id,
shp->shm_perm.mode,
shp->shm_segsz,
- shp->shm_cprid,
- shp->shm_lprid,
+ pid_nr_ns(shp->shm_cprid, pid_ns),
+ pid_nr_ns(shp->shm_lprid, pid_ns),
shp->shm_nattch,
from_kuid_munged(user_ns, shp->shm_perm.uid),
from_kgid_munged(user_ns, shp->shm_perm.gid),
--
2.14.1
The definition IPCMNI is only used in ipc/util.h and ipc/util.c. So
there is no reason to keep it in a header file that the whole kernel
can see. Move it into util.h to simplify future maintenance.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/ipc.h | 2 --
ipc/util.h | 1 +
2 files changed, 1 insertion(+), 2 deletions(-)
diff --git a/include/linux/ipc.h b/include/linux/ipc.h
index 821b2f260992..6cc2df7f7ac9 100644
--- a/include/linux/ipc.h
+++ b/include/linux/ipc.h
@@ -8,8 +8,6 @@
#include <uapi/linux/ipc.h>
#include <linux/refcount.h>
-#define IPCMNI 32768 /* <= MAX_INT limit for ipc arrays (including sysctl changes) */
-
/* used by in-kernel data structures */
struct kern_ipc_perm {
spinlock_t lock;
diff --git a/ipc/util.h b/ipc/util.h
index 89b8ec176fc4..959c10eb9cc1 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -15,6 +15,7 @@
#include <linux/err.h>
#include <linux/ipc_namespace.h>
+#define IPCMNI 32768 /* <= MAX_INT limit for ipc arrays (including sysctl changes) */
#define SEQ_MULTIPLIER (IPCMNI)
int sem_init(void);
--
2.14.1
Capture the pid namespace when /proc/sysvipc/msg /proc/sysvipc/shm
and /proc/sysvipc/sem are opened, and make it available through
the new helper ipc_seq_pid_ns.
This makes it possible to report the pids in these files in the
pid namespace of the opener of the files.
Implement ipc_update_pid. A simple impline helper that will only update
a struct pid pointer if the new value does not equal the old value. This
removes the need for wordy code sequences like:
old = object->pid;
object->pid = new;
put_pid(old);
and
old = object->pid;
if (old != new) {
object->pid = new;
put_pid(old);
}
Allowing the following to be written instead:
ipc_update_pid(&object->pid, new);
Which is easier to read and ensures that the pid reference count is
not touched the old and the new values are the same. Not touching
the reference count in this case is important to help avoid issues
like af_unix experienced, where multiple threads of the same
process managed to bounce the struct pid between cpu cache lines,
but updating the pids reference count.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
ipc/util.c | 9 +++++++++
ipc/util.h | 11 +++++++++++
2 files changed, 20 insertions(+)
diff --git a/ipc/util.c b/ipc/util.c
index 4ed5a17dd06f..3783b7991cc7 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -747,9 +747,16 @@ int ipc_parse_version(int *cmd)
#ifdef CONFIG_PROC_FS
struct ipc_proc_iter {
struct ipc_namespace *ns;
+ struct pid_namespace *pid_ns;
struct ipc_proc_iface *iface;
};
+struct pid_namespace *ipc_seq_pid_ns(struct seq_file *s)
+{
+ struct ipc_proc_iter *iter = s->private;
+ return iter->pid_ns;
+}
+
/*
* This routine locks the ipc structure found at least at position pos.
*/
@@ -872,6 +879,7 @@ static int sysvipc_proc_open(struct inode *inode, struct file *file)
iter->iface = PDE_DATA(inode);
iter->ns = get_ipc_ns(current->nsproxy->ipc_ns);
+ iter->pid_ns = get_pid_ns(task_active_pid_ns(current));
return 0;
}
@@ -881,6 +889,7 @@ static int sysvipc_proc_release(struct inode *inode, struct file *file)
struct seq_file *seq = file->private_data;
struct ipc_proc_iter *iter = seq->private;
put_ipc_ns(iter->ns);
+ put_pid_ns(iter->pid_ns);
return seq_release_private(inode, file);
}
diff --git a/ipc/util.h b/ipc/util.h
index 959c10eb9cc1..e39ed9705f99 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -23,6 +23,7 @@ int msg_init(void);
void shm_init(void);
struct ipc_namespace;
+struct pid_namespace;
#ifdef CONFIG_POSIX_MQUEUE
extern void mq_clear_sbinfo(struct ipc_namespace *ns);
@@ -86,6 +87,7 @@ int ipc_init_ids(struct ipc_ids *);
#ifdef CONFIG_PROC_FS
void __init ipc_init_proc_interface(const char *path, const char *header,
int ids, int (*show)(struct seq_file *, void *));
+struct pid_namespace *ipc_seq_pid_ns(struct seq_file *);
#else
#define ipc_init_proc_interface(path, header, ids, show) do {} while (0)
#endif
@@ -150,6 +152,15 @@ struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
struct ipc_ids *ids, int id, int cmd,
struct ipc64_perm *perm, int extra_perm);
+static inline void ipc_update_pid(struct pid **pos, struct pid *pid)
+{
+ struct pid *old = *pos;
+ if (old != pid) {
+ *pos = get_pid(pid);
+ put_pid(old);
+ }
+}
+
#ifndef CONFIG_ARCH_WANT_IPC_PARSE_VERSION
/* On IA-64, we always use the "64-bit version" of the IPC structures. */
# define ipc_parse_version(cmd) IPC_64
--
2.14.1
All of the users are now in ipc/shm.c so make the definition local to
that file to make code maintenance easier. AKA to prevent rebuilding
the entire kernel when struct shmid_kernel changes.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/shm.h | 23 -----------------------
ipc/shm.c | 22 ++++++++++++++++++++++
2 files changed, 22 insertions(+), 23 deletions(-)
diff --git a/include/linux/shm.h b/include/linux/shm.h
index 2bbafacfbfc9..c7cb18405ad7 100644
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -3,32 +3,9 @@
#define _LINUX_SHM_H_
#include <linux/list.h>
-#include <asm/page.h>
#include <uapi/linux/shm.h>
#include <asm/shmparam.h>
-struct shmid_kernel /* private to the kernel */
-{
- struct kern_ipc_perm shm_perm;
- struct file *shm_file;
- unsigned long shm_nattch;
- unsigned long shm_segsz;
- time64_t shm_atim;
- time64_t shm_dtim;
- time64_t shm_ctim;
- pid_t shm_cprid;
- pid_t shm_lprid;
- struct user_struct *mlock_user;
-
- /* The task created the shm object. NULL if the task is dead. */
- struct task_struct *shm_creator;
- struct list_head shm_clist; /* list by creator */
-} __randomize_layout;
-
-/* shm_mode upper byte flags */
-#define SHM_DEST 01000 /* segment will be destroyed on last detach */
-#define SHM_LOCKED 02000 /* segment will not be swapped */
-
#ifdef CONFIG_SYSVIPC
struct sysv_shm {
struct list_head shm_clist;
diff --git a/ipc/shm.c b/ipc/shm.c
index 387a786e7be1..0565669ebe5c 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -48,6 +48,28 @@
#include "util.h"
+struct shmid_kernel /* private to the kernel */
+{
+ struct kern_ipc_perm shm_perm;
+ struct file *shm_file;
+ unsigned long shm_nattch;
+ unsigned long shm_segsz;
+ time64_t shm_atim;
+ time64_t shm_dtim;
+ time64_t shm_ctim;
+ pid_t shm_cprid;
+ pid_t shm_lprid;
+ struct user_struct *mlock_user;
+
+ /* The task created the shm object. NULL if the task is dead. */
+ struct task_struct *shm_creator;
+ struct list_head shm_clist; /* list by creator */
+} __randomize_layout;
+
+/* shm_mode upper byte flags */
+#define SHM_DEST 01000 /* segment will be destroyed on last detach */
+#define SHM_LOCKED 02000 /* segment will not be swapped */
+
struct shm_file_data {
int id;
struct ipc_namespace *ns;
--
2.14.1
All of the users are now in ipc/msg.c so make the definition local to
that file to make code maintenance easier. AKA to prevent rebuilding
the entire kernel when struct msg_queue changes.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/msg.h | 18 ------------------
ipc/msg.c | 17 +++++++++++++++++
2 files changed, 17 insertions(+), 18 deletions(-)
diff --git a/include/linux/msg.h b/include/linux/msg.h
index 0a7eefeee0d1..9a972a296b95 100644
--- a/include/linux/msg.h
+++ b/include/linux/msg.h
@@ -3,7 +3,6 @@
#define _LINUX_MSG_H
#include <linux/list.h>
-#include <linux/time64.h>
#include <uapi/linux/msg.h>
/* one msg_msg structure for each message */
@@ -16,21 +15,4 @@ struct msg_msg {
/* the actual message follows immediately */
};
-/* one msq_queue structure for each present queue on the system */
-struct msg_queue {
- struct kern_ipc_perm q_perm;
- time64_t q_stime; /* last msgsnd time */
- time64_t q_rtime; /* last msgrcv time */
- time64_t q_ctime; /* last change time */
- unsigned long q_cbytes; /* current number of bytes on queue */
- unsigned long q_qnum; /* number of messages in queue */
- unsigned long q_qbytes; /* max number of bytes on queue */
- pid_t q_lspid; /* pid of last msgsnd */
- pid_t q_lrpid; /* last receive pid */
-
- struct list_head q_messages;
- struct list_head q_receivers;
- struct list_head q_senders;
-} __randomize_layout;
-
#endif /* _LINUX_MSG_H */
diff --git a/ipc/msg.c b/ipc/msg.c
index cdfab0825fce..af5a963306c4 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -43,6 +43,23 @@
#include <linux/uaccess.h>
#include "util.h"
+/* one msq_queue structure for each present queue on the system */
+struct msg_queue {
+ struct kern_ipc_perm q_perm;
+ time64_t q_stime; /* last msgsnd time */
+ time64_t q_rtime; /* last msgrcv time */
+ time64_t q_ctime; /* last change time */
+ unsigned long q_cbytes; /* current number of bytes on queue */
+ unsigned long q_qnum; /* number of messages in queue */
+ unsigned long q_qbytes; /* max number of bytes on queue */
+ pid_t q_lspid; /* pid of last msgsnd */
+ pid_t q_lrpid; /* last receive pid */
+
+ struct list_head q_messages;
+ struct list_head q_receivers;
+ struct list_head q_senders;
+} __randomize_layout;
+
/* one msg_receiver structure for each sleeping receiver */
struct msg_receiver {
struct list_head r_list;
--
2.14.1
Today msg_lspid and msg_lrpid are remembered in the pid namespace of
the creator and the processes that last send or received a sysvipc
message. If you have processes in multiple pid namespaces that is
just wrong. The process ids reported will not make the least bit of
sense.
This fix is slightly more susceptible to a performance problem than
the related fix for System V shared memory. By definition the pids
are updated by msgsnd and msgrcv, the fast path of System V message
queues. The only concern over the previous implementation is the
incrementing and decrementing of the pid reference count. As that is
the only difference and multiple updates by of the task_tgid by
threads in the same process have been shown in af_unix sockets to
create a cache line ping-pong between cpus of the same processor.
In this case I don't expect cache lines holding pid reference counts
to ping pong between cpus. As senders and receivers update different
pids there is a natural separation there. Further if multiple threads
of the same process either send or receive messages the pid will be
updated to the same value and ipc_update_pid will avoid the reference
count update.
Which means in the common case I expect msg_lspid and msg_lrpid to
remain constant, and reference counts not to be updated when messages
are sent.
In rare cases it may be possible to trigger the issue which was
observed for af_unix sockets, but it will require multiple processes
with multiple threads to be either sending or receiving messages. It
just does not feel likely that anyone would do that in practice.
This change updates msgctl(..., IPC_STAT, ...) to return msg_lspid and
msg_lrpid in the pid namespace of the process calling stat.
This change also updates cat /proc/sysvipc/msg to return print msg_lspid
and msg_lrpid in the pid namespace of the process that opened the proc
file.
Fixes: b488893a390e ("pid namespaces: changes to show virtual ids to user")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
ipc/msg.c | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/ipc/msg.c b/ipc/msg.c
index af5a963306c4..825ad585a6ff 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -52,8 +52,8 @@ struct msg_queue {
unsigned long q_cbytes; /* current number of bytes on queue */
unsigned long q_qnum; /* number of messages in queue */
unsigned long q_qbytes; /* max number of bytes on queue */
- pid_t q_lspid; /* pid of last msgsnd */
- pid_t q_lrpid; /* last receive pid */
+ struct pid *q_lspid; /* pid of last msgsnd */
+ struct pid *q_lrpid; /* last receive pid */
struct list_head q_messages;
struct list_head q_receivers;
@@ -154,7 +154,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
msq->q_ctime = ktime_get_real_seconds();
msq->q_cbytes = msq->q_qnum = 0;
msq->q_qbytes = ns->msg_ctlmnb;
- msq->q_lspid = msq->q_lrpid = 0;
+ msq->q_lspid = msq->q_lrpid = NULL;
INIT_LIST_HEAD(&msq->q_messages);
INIT_LIST_HEAD(&msq->q_receivers);
INIT_LIST_HEAD(&msq->q_senders);
@@ -267,6 +267,8 @@ static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
free_msg(msg);
}
atomic_sub(msq->q_cbytes, &ns->msg_bytes);
+ ipc_update_pid(&msq->q_lspid, NULL);
+ ipc_update_pid(&msq->q_lrpid, NULL);
ipc_rcu_putref(&msq->q_perm, msg_rcu_free);
}
@@ -536,8 +538,8 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
p->msg_cbytes = msq->q_cbytes;
p->msg_qnum = msq->q_qnum;
p->msg_qbytes = msq->q_qbytes;
- p->msg_lspid = msq->q_lspid;
- p->msg_lrpid = msq->q_lrpid;
+ p->msg_lspid = pid_vnr(msq->q_lspid);
+ p->msg_lrpid = pid_vnr(msq->q_lrpid);
ipc_unlock_object(&msq->q_perm);
rcu_read_unlock();
@@ -741,7 +743,7 @@ static inline int pipelined_send(struct msg_queue *msq, struct msg_msg *msg,
wake_q_add(wake_q, msr->r_tsk);
WRITE_ONCE(msr->r_msg, ERR_PTR(-E2BIG));
} else {
- msq->q_lrpid = task_pid_vnr(msr->r_tsk);
+ ipc_update_pid(&msq->q_lrpid, task_pid(msr->r_tsk));
msq->q_rtime = get_seconds();
wake_q_add(wake_q, msr->r_tsk);
@@ -842,7 +844,7 @@ static long do_msgsnd(int msqid, long mtype, void __user *mtext,
}
- msq->q_lspid = task_tgid_vnr(current);
+ ipc_update_pid(&msq->q_lspid, task_tgid(current));
msq->q_stime = get_seconds();
if (!pipelined_send(msq, msg, &wake_q)) {
@@ -1060,7 +1062,7 @@ static long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, in
list_del(&msg->m_list);
msq->q_qnum--;
msq->q_rtime = get_seconds();
- msq->q_lrpid = task_tgid_vnr(current);
+ ipc_update_pid(&msq->q_lrpid, task_tgid(current));
msq->q_cbytes -= msg->m_ts;
atomic_sub(msg->m_ts, &ns->msg_bytes);
atomic_dec(&ns->msg_hdrs);
@@ -1202,6 +1204,7 @@ void msg_exit_ns(struct ipc_namespace *ns)
#ifdef CONFIG_PROC_FS
static int sysvipc_msg_proc_show(struct seq_file *s, void *it)
{
+ struct pid_namespace *pid_ns = ipc_seq_pid_ns(s);
struct user_namespace *user_ns = seq_user_ns(s);
struct kern_ipc_perm *ipcp = it;
struct msg_queue *msq = container_of(ipcp, struct msg_queue, q_perm);
@@ -1213,8 +1216,8 @@ static int sysvipc_msg_proc_show(struct seq_file *s, void *it)
msq->q_perm.mode,
msq->q_cbytes,
msq->q_qnum,
- msq->q_lspid,
- msq->q_lrpid,
+ pid_nr_ns(msq->q_lspid, pid_ns),
+ pid_nr_ns(msq->q_lrpid, pid_ns),
from_kuid_munged(user_ns, msq->q_perm.uid),
from_kgid_munged(user_ns, msq->q_perm.gid),
from_kuid_munged(user_ns, msq->q_perm.cuid),
--
2.14.1
All of the users are now in ipc/sem.c so make the definitions
local to that file to make code maintenance easier. AKA
to prevent rebuilding the entire kernel when one of these
files is changed.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/sem.h | 40 +---------------------------------------
ipc/sem.c | 34 ++++++++++++++++++++++++++++++++++
2 files changed, 35 insertions(+), 39 deletions(-)
diff --git a/include/linux/sem.h b/include/linux/sem.h
index 9badd322dcee..5608a500c43e 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -2,48 +2,10 @@
#ifndef _LINUX_SEM_H
#define _LINUX_SEM_H
-#include <linux/atomic.h>
-#include <linux/rcupdate.h>
-#include <linux/cache.h>
-#include <linux/time64.h>
#include <uapi/linux/sem.h>
struct task_struct;
-
-/* One semaphore structure for each semaphore in the system. */
-struct sem {
- int semval; /* current value */
- /*
- * PID of the process that last modified the semaphore. For
- * Linux, specifically these are:
- * - semop
- * - semctl, via SETVAL and SETALL.
- * - at task exit when performing undo adjustments (see exit_sem).
- */
- int sempid;
- spinlock_t lock; /* spinlock for fine-grained semtimedop */
- struct list_head pending_alter; /* pending single-sop operations */
- /* that alter the semaphore */
- struct list_head pending_const; /* pending single-sop operations */
- /* that do not alter the semaphore*/
- time_t sem_otime; /* candidate for sem_otime */
-} ____cacheline_aligned_in_smp;
-
-/* One sem_array data structure for each set of semaphores in the system. */
-struct sem_array {
- struct kern_ipc_perm sem_perm; /* permissions .. see ipc.h */
- time64_t sem_ctime; /* create/last semctl() time */
- struct list_head pending_alter; /* pending operations */
- /* that alter the array */
- struct list_head pending_const; /* pending complex operations */
- /* that do not alter semvals */
- struct list_head list_id; /* undo requests on this array */
- int sem_nsems; /* no. of semaphores in array */
- int complex_count; /* pending complex operations */
- unsigned int use_global_lock;/* >0: global lock required */
-
- struct sem sems[];
-} __randomize_layout;
+struct sem_undo_list;
#ifdef CONFIG_SYSVIPC
diff --git a/ipc/sem.c b/ipc/sem.c
index 01f5c63670ae..d661c491b0a5 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -88,6 +88,40 @@
#include <linux/uaccess.h>
#include "util.h"
+/* One semaphore structure for each semaphore in the system. */
+struct sem {
+ int semval; /* current value */
+ /*
+ * PID of the process that last modified the semaphore. For
+ * Linux, specifically these are:
+ * - semop
+ * - semctl, via SETVAL and SETALL.
+ * - at task exit when performing undo adjustments (see exit_sem).
+ */
+ int sempid;
+ spinlock_t lock; /* spinlock for fine-grained semtimedop */
+ struct list_head pending_alter; /* pending single-sop operations */
+ /* that alter the semaphore */
+ struct list_head pending_const; /* pending single-sop operations */
+ /* that do not alter the semaphore*/
+ time_t sem_otime; /* candidate for sem_otime */
+} ____cacheline_aligned_in_smp;
+
+/* One sem_array data structure for each set of semaphores in the system. */
+struct sem_array {
+ struct kern_ipc_perm sem_perm; /* permissions .. see ipc.h */
+ time64_t sem_ctime; /* create/last semctl() time */
+ struct list_head pending_alter; /* pending operations */
+ /* that alter the array */
+ struct list_head pending_const; /* pending complex operations */
+ /* that do not alter semvals */
+ struct list_head list_id; /* undo requests on this array */
+ int sem_nsems; /* no. of semaphores in array */
+ int complex_count; /* pending complex operations */
+ unsigned int use_global_lock;/* >0: global lock required */
+
+ struct sem sems[];
+} __randomize_layout;
/* One queue for each sleeping process in the system. */
struct sem_queue {
--
2.14.1
All of the implementations of security hooks that take shmid_kernel only
access shm_perm the struct kern_ipc_perm member. This means the
dependencies of the shm security hooks can be simplified by passing
the kern_ipc_perm member of shmid_kernel..
Making this change will allow struct shmid_kernel to become private to ipc/shm.c.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/lsm_hooks.h | 10 +++++-----
include/linux/security.h | 21 ++++++++++-----------
ipc/shm.c | 17 +++++++----------
security/security.c | 10 +++++-----
security/selinux/hooks.c | 28 ++++++++++++++--------------
security/smack/smack_lsm.c | 22 +++++++++++-----------
6 files changed, 52 insertions(+), 56 deletions(-)
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index e4a94863a88c..cac7a8082c43 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1585,11 +1585,11 @@ union security_list_options {
struct task_struct *target, long type,
int mode);
- int (*shm_alloc_security)(struct shmid_kernel *shp);
- void (*shm_free_security)(struct shmid_kernel *shp);
- int (*shm_associate)(struct shmid_kernel *shp, int shmflg);
- int (*shm_shmctl)(struct shmid_kernel *shp, int cmd);
- int (*shm_shmat)(struct shmid_kernel *shp, char __user *shmaddr,
+ int (*shm_alloc_security)(struct kern_ipc_perm *shp);
+ void (*shm_free_security)(struct kern_ipc_perm *shp);
+ int (*shm_associate)(struct kern_ipc_perm *shp, int shmflg);
+ int (*shm_shmctl)(struct kern_ipc_perm *shp, int cmd);
+ int (*shm_shmat)(struct kern_ipc_perm *shp, char __user *shmaddr,
int shmflg);
int (*sem_alloc_security)(struct kern_ipc_perm *sma);
diff --git a/include/linux/security.h b/include/linux/security.h
index fa7adac4b99a..f390755808ea 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -49,7 +49,6 @@ struct qstr;
struct iattr;
struct fown_struct;
struct file_operations;
-struct shmid_kernel;
struct msg_msg;
struct msg_queue;
struct xattr;
@@ -362,11 +361,11 @@ int security_msg_queue_msgsnd(struct msg_queue *msq,
struct msg_msg *msg, int msqflg);
int security_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg,
struct task_struct *target, long type, int mode);
-int security_shm_alloc(struct shmid_kernel *shp);
-void security_shm_free(struct shmid_kernel *shp);
-int security_shm_associate(struct shmid_kernel *shp, int shmflg);
-int security_shm_shmctl(struct shmid_kernel *shp, int cmd);
-int security_shm_shmat(struct shmid_kernel *shp, char __user *shmaddr, int shmflg);
+int security_shm_alloc(struct kern_ipc_perm *shp);
+void security_shm_free(struct kern_ipc_perm *shp);
+int security_shm_associate(struct kern_ipc_perm *shp, int shmflg);
+int security_shm_shmctl(struct kern_ipc_perm *shp, int cmd);
+int security_shm_shmat(struct kern_ipc_perm *shp, char __user *shmaddr, int shmflg);
int security_sem_alloc(struct kern_ipc_perm *sma);
void security_sem_free(struct kern_ipc_perm *sma);
int security_sem_associate(struct kern_ipc_perm *sma, int semflg);
@@ -1077,26 +1076,26 @@ static inline int security_msg_queue_msgrcv(struct msg_queue *msq,
return 0;
}
-static inline int security_shm_alloc(struct shmid_kernel *shp)
+static inline int security_shm_alloc(struct kern_ipc_perm *shp)
{
return 0;
}
-static inline void security_shm_free(struct shmid_kernel *shp)
+static inline void security_shm_free(struct kern_ipc_perm *shp)
{ }
-static inline int security_shm_associate(struct shmid_kernel *shp,
+static inline int security_shm_associate(struct kern_ipc_perm *shp,
int shmflg)
{
return 0;
}
-static inline int security_shm_shmctl(struct shmid_kernel *shp, int cmd)
+static inline int security_shm_shmctl(struct kern_ipc_perm *shp, int cmd)
{
return 0;
}
-static inline int security_shm_shmat(struct shmid_kernel *shp,
+static inline int security_shm_shmat(struct kern_ipc_perm *shp,
char __user *shmaddr, int shmflg)
{
return 0;
diff --git a/ipc/shm.c b/ipc/shm.c
index 4643865e9171..387a786e7be1 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -181,7 +181,7 @@ static void shm_rcu_free(struct rcu_head *head)
rcu);
struct shmid_kernel *shp = container_of(ptr, struct shmid_kernel,
shm_perm);
- security_shm_free(shp);
+ security_shm_free(&shp->shm_perm);
kvfree(shp);
}
@@ -554,7 +554,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
shp->mlock_user = NULL;
shp->shm_perm.security = NULL;
- error = security_shm_alloc(shp);
+ error = security_shm_alloc(&shp->shm_perm);
if (error) {
kvfree(shp);
return error;
@@ -635,10 +635,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
*/
static inline int shm_security(struct kern_ipc_perm *ipcp, int shmflg)
{
- struct shmid_kernel *shp;
-
- shp = container_of(ipcp, struct shmid_kernel, shm_perm);
- return security_shm_associate(shp, shmflg);
+ return security_shm_associate(ipcp, shmflg);
}
/*
@@ -835,7 +832,7 @@ static int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
shp = container_of(ipcp, struct shmid_kernel, shm_perm);
- err = security_shm_shmctl(shp, cmd);
+ err = security_shm_shmctl(&shp->shm_perm, cmd);
if (err)
goto out_unlock1;
@@ -934,7 +931,7 @@ static int shmctl_stat(struct ipc_namespace *ns, int shmid,
if (ipcperms(ns, &shp->shm_perm, S_IRUGO))
goto out_unlock;
- err = security_shm_shmctl(shp, cmd);
+ err = security_shm_shmctl(&shp->shm_perm, cmd);
if (err)
goto out_unlock;
@@ -978,7 +975,7 @@ static int shmctl_do_lock(struct ipc_namespace *ns, int shmid, int cmd)
}
audit_ipc_obj(&(shp->shm_perm));
- err = security_shm_shmctl(shp, cmd);
+ err = security_shm_shmctl(&shp->shm_perm, cmd);
if (err)
goto out_unlock1;
@@ -1348,7 +1345,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
if (ipcperms(ns, &shp->shm_perm, acc_mode))
goto out_unlock;
- err = security_shm_shmat(shp, shmaddr, shmflg);
+ err = security_shm_shmat(&shp->shm_perm, shmaddr, shmflg);
if (err)
goto out_unlock;
diff --git a/security/security.c b/security/security.c
index d3b9aeb6b73b..77b69bd6f234 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1195,27 +1195,27 @@ int security_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg,
return call_int_hook(msg_queue_msgrcv, 0, msq, msg, target, type, mode);
}
-int security_shm_alloc(struct shmid_kernel *shp)
+int security_shm_alloc(struct kern_ipc_perm *shp)
{
return call_int_hook(shm_alloc_security, 0, shp);
}
-void security_shm_free(struct shmid_kernel *shp)
+void security_shm_free(struct kern_ipc_perm *shp)
{
call_void_hook(shm_free_security, shp);
}
-int security_shm_associate(struct shmid_kernel *shp, int shmflg)
+int security_shm_associate(struct kern_ipc_perm *shp, int shmflg)
{
return call_int_hook(shm_associate, 0, shp, shmflg);
}
-int security_shm_shmctl(struct shmid_kernel *shp, int cmd)
+int security_shm_shmctl(struct kern_ipc_perm *shp, int cmd)
{
return call_int_hook(shm_shmctl, 0, shp, cmd);
}
-int security_shm_shmat(struct shmid_kernel *shp, char __user *shmaddr, int shmflg)
+int security_shm_shmat(struct kern_ipc_perm *shp, char __user *shmaddr, int shmflg)
{
return call_int_hook(shm_shmat, 0, shp, shmaddr, shmflg);
}
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index cce994e9fc0a..14f9e6c08273 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -5674,53 +5674,53 @@ static int selinux_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg,
}
/* Shared Memory security operations */
-static int selinux_shm_alloc_security(struct shmid_kernel *shp)
+static int selinux_shm_alloc_security(struct kern_ipc_perm *shp)
{
struct ipc_security_struct *isec;
struct common_audit_data ad;
u32 sid = current_sid();
int rc;
- rc = ipc_alloc_security(&shp->shm_perm, SECCLASS_SHM);
+ rc = ipc_alloc_security(shp, SECCLASS_SHM);
if (rc)
return rc;
- isec = shp->shm_perm.security;
+ isec = shp->security;
ad.type = LSM_AUDIT_DATA_IPC;
- ad.u.ipc_id = shp->shm_perm.key;
+ ad.u.ipc_id = shp->key;
rc = avc_has_perm(sid, isec->sid, SECCLASS_SHM,
SHM__CREATE, &ad);
if (rc) {
- ipc_free_security(&shp->shm_perm);
+ ipc_free_security(shp);
return rc;
}
return 0;
}
-static void selinux_shm_free_security(struct shmid_kernel *shp)
+static void selinux_shm_free_security(struct kern_ipc_perm *shp)
{
- ipc_free_security(&shp->shm_perm);
+ ipc_free_security(shp);
}
-static int selinux_shm_associate(struct shmid_kernel *shp, int shmflg)
+static int selinux_shm_associate(struct kern_ipc_perm *shp, int shmflg)
{
struct ipc_security_struct *isec;
struct common_audit_data ad;
u32 sid = current_sid();
- isec = shp->shm_perm.security;
+ isec = shp->security;
ad.type = LSM_AUDIT_DATA_IPC;
- ad.u.ipc_id = shp->shm_perm.key;
+ ad.u.ipc_id = shp->key;
return avc_has_perm(sid, isec->sid, SECCLASS_SHM,
SHM__ASSOCIATE, &ad);
}
/* Note, at this point, shp is locked down */
-static int selinux_shm_shmctl(struct shmid_kernel *shp, int cmd)
+static int selinux_shm_shmctl(struct kern_ipc_perm *shp, int cmd)
{
int perms;
int err;
@@ -5749,11 +5749,11 @@ static int selinux_shm_shmctl(struct shmid_kernel *shp, int cmd)
return 0;
}
- err = ipc_has_perm(&shp->shm_perm, perms);
+ err = ipc_has_perm(shp, perms);
return err;
}
-static int selinux_shm_shmat(struct shmid_kernel *shp,
+static int selinux_shm_shmat(struct kern_ipc_perm *shp,
char __user *shmaddr, int shmflg)
{
u32 perms;
@@ -5763,7 +5763,7 @@ static int selinux_shm_shmat(struct shmid_kernel *shp,
else
perms = SHM__READ | SHM__WRITE;
- return ipc_has_perm(&shp->shm_perm, perms);
+ return ipc_has_perm(shp, perms);
}
/* Semaphore security operations */
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 0402b8c1aec1..a3398c7f32c9 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -2950,9 +2950,9 @@ static void smack_msg_msg_free_security(struct msg_msg *msg)
*
* Returns a pointer to the smack value
*/
-static struct smack_known *smack_of_shm(struct shmid_kernel *shp)
+static struct smack_known *smack_of_shm(struct kern_ipc_perm *shp)
{
- return (struct smack_known *)shp->shm_perm.security;
+ return (struct smack_known *)shp->security;
}
/**
@@ -2961,9 +2961,9 @@ static struct smack_known *smack_of_shm(struct shmid_kernel *shp)
*
* Returns 0
*/
-static int smack_shm_alloc_security(struct shmid_kernel *shp)
+static int smack_shm_alloc_security(struct kern_ipc_perm *shp)
{
- struct kern_ipc_perm *isp = &shp->shm_perm;
+ struct kern_ipc_perm *isp = shp;
struct smack_known *skp = smk_of_current();
isp->security = skp;
@@ -2976,9 +2976,9 @@ static int smack_shm_alloc_security(struct shmid_kernel *shp)
*
* Clears the blob pointer
*/
-static void smack_shm_free_security(struct shmid_kernel *shp)
+static void smack_shm_free_security(struct kern_ipc_perm *shp)
{
- struct kern_ipc_perm *isp = &shp->shm_perm;
+ struct kern_ipc_perm *isp = shp;
isp->security = NULL;
}
@@ -2990,7 +2990,7 @@ static void smack_shm_free_security(struct shmid_kernel *shp)
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smk_curacc_shm(struct shmid_kernel *shp, int access)
+static int smk_curacc_shm(struct kern_ipc_perm *shp, int access)
{
struct smack_known *ssp = smack_of_shm(shp);
struct smk_audit_info ad;
@@ -2998,7 +2998,7 @@ static int smk_curacc_shm(struct shmid_kernel *shp, int access)
#ifdef CONFIG_AUDIT
smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_IPC);
- ad.a.u.ipc_id = shp->shm_perm.id;
+ ad.a.u.ipc_id = shp->id;
#endif
rc = smk_curacc(ssp, access, &ad);
rc = smk_bu_current("shm", ssp, access, rc);
@@ -3012,7 +3012,7 @@ static int smk_curacc_shm(struct shmid_kernel *shp, int access)
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smack_shm_associate(struct shmid_kernel *shp, int shmflg)
+static int smack_shm_associate(struct kern_ipc_perm *shp, int shmflg)
{
int may;
@@ -3027,7 +3027,7 @@ static int smack_shm_associate(struct shmid_kernel *shp, int shmflg)
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smack_shm_shmctl(struct shmid_kernel *shp, int cmd)
+static int smack_shm_shmctl(struct kern_ipc_perm *shp, int cmd)
{
int may;
@@ -3062,7 +3062,7 @@ static int smack_shm_shmctl(struct shmid_kernel *shp, int cmd)
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smack_shm_shmat(struct shmid_kernel *shp, char __user *shmaddr,
+static int smack_shm_shmat(struct kern_ipc_perm *shp, char __user *shmaddr,
int shmflg)
{
int may;
--
2.14.1
All of the implementations of security hooks that take sem_array only
access sem_perm the struct kern_ipc_perm member. This means the
dependencies of the sem security hooks can be simplified by passing
the kern_ipc_perm member of sem_array.
Making this change will allow struct sem and struct sem_array
to become private to ipc/sem.c.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/lsm_hooks.h | 10 +++++-----
include/linux/security.h | 21 ++++++++++-----------
ipc/sem.c | 19 ++++++++-----------
security/security.c | 10 +++++-----
security/selinux/hooks.c | 28 ++++++++++++++--------------
security/smack/smack_lsm.c | 22 +++++++++++-----------
6 files changed, 53 insertions(+), 57 deletions(-)
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 7161d8e7ee79..e4a94863a88c 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1592,11 +1592,11 @@ union security_list_options {
int (*shm_shmat)(struct shmid_kernel *shp, char __user *shmaddr,
int shmflg);
- int (*sem_alloc_security)(struct sem_array *sma);
- void (*sem_free_security)(struct sem_array *sma);
- int (*sem_associate)(struct sem_array *sma, int semflg);
- int (*sem_semctl)(struct sem_array *sma, int cmd);
- int (*sem_semop)(struct sem_array *sma, struct sembuf *sops,
+ int (*sem_alloc_security)(struct kern_ipc_perm *sma);
+ void (*sem_free_security)(struct kern_ipc_perm *sma);
+ int (*sem_associate)(struct kern_ipc_perm *sma, int semflg);
+ int (*sem_semctl)(struct kern_ipc_perm *sma, int cmd);
+ int (*sem_semop)(struct kern_ipc_perm *sma, struct sembuf *sops,
unsigned nsops, int alter);
int (*netlink_send)(struct sock *sk, struct sk_buff *skb);
diff --git a/include/linux/security.h b/include/linux/security.h
index 73f1ef625d40..fa7adac4b99a 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -36,7 +36,6 @@ struct linux_binprm;
struct cred;
struct rlimit;
struct siginfo;
-struct sem_array;
struct sembuf;
struct kern_ipc_perm;
struct audit_context;
@@ -368,11 +367,11 @@ void security_shm_free(struct shmid_kernel *shp);
int security_shm_associate(struct shmid_kernel *shp, int shmflg);
int security_shm_shmctl(struct shmid_kernel *shp, int cmd);
int security_shm_shmat(struct shmid_kernel *shp, char __user *shmaddr, int shmflg);
-int security_sem_alloc(struct sem_array *sma);
-void security_sem_free(struct sem_array *sma);
-int security_sem_associate(struct sem_array *sma, int semflg);
-int security_sem_semctl(struct sem_array *sma, int cmd);
-int security_sem_semop(struct sem_array *sma, struct sembuf *sops,
+int security_sem_alloc(struct kern_ipc_perm *sma);
+void security_sem_free(struct kern_ipc_perm *sma);
+int security_sem_associate(struct kern_ipc_perm *sma, int semflg);
+int security_sem_semctl(struct kern_ipc_perm *sma, int cmd);
+int security_sem_semop(struct kern_ipc_perm *sma, struct sembuf *sops,
unsigned nsops, int alter);
void security_d_instantiate(struct dentry *dentry, struct inode *inode);
int security_getprocattr(struct task_struct *p, char *name, char **value);
@@ -1103,25 +1102,25 @@ static inline int security_shm_shmat(struct shmid_kernel *shp,
return 0;
}
-static inline int security_sem_alloc(struct sem_array *sma)
+static inline int security_sem_alloc(struct kern_ipc_perm *sma)
{
return 0;
}
-static inline void security_sem_free(struct sem_array *sma)
+static inline void security_sem_free(struct kern_ipc_perm *sma)
{ }
-static inline int security_sem_associate(struct sem_array *sma, int semflg)
+static inline int security_sem_associate(struct kern_ipc_perm *sma, int semflg)
{
return 0;
}
-static inline int security_sem_semctl(struct sem_array *sma, int cmd)
+static inline int security_sem_semctl(struct kern_ipc_perm *sma, int cmd)
{
return 0;
}
-static inline int security_sem_semop(struct sem_array *sma,
+static inline int security_sem_semop(struct kern_ipc_perm *sma,
struct sembuf *sops, unsigned nsops,
int alter)
{
diff --git a/ipc/sem.c b/ipc/sem.c
index a4af04979fd2..01f5c63670ae 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -265,7 +265,7 @@ static void sem_rcu_free(struct rcu_head *head)
struct kern_ipc_perm *p = container_of(head, struct kern_ipc_perm, rcu);
struct sem_array *sma = container_of(p, struct sem_array, sem_perm);
- security_sem_free(sma);
+ security_sem_free(&sma->sem_perm);
kvfree(sma);
}
@@ -495,7 +495,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
sma->sem_perm.key = key;
sma->sem_perm.security = NULL;
- retval = security_sem_alloc(sma);
+ retval = security_sem_alloc(&sma->sem_perm);
if (retval) {
kvfree(sma);
return retval;
@@ -535,10 +535,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
*/
static inline int sem_security(struct kern_ipc_perm *ipcp, int semflg)
{
- struct sem_array *sma;
-
- sma = container_of(ipcp, struct sem_array, sem_perm);
- return security_sem_associate(sma, semflg);
+ return security_sem_associate(ipcp, semflg);
}
/*
@@ -1209,7 +1206,7 @@ static int semctl_stat(struct ipc_namespace *ns, int semid,
if (ipcperms(ns, &sma->sem_perm, S_IRUGO))
goto out_unlock;
- err = security_sem_semctl(sma, cmd);
+ err = security_sem_semctl(&sma->sem_perm, cmd);
if (err)
goto out_unlock;
@@ -1300,7 +1297,7 @@ static int semctl_setval(struct ipc_namespace *ns, int semid, int semnum,
return -EACCES;
}
- err = security_sem_semctl(sma, SETVAL);
+ err = security_sem_semctl(&sma->sem_perm, SETVAL);
if (err) {
rcu_read_unlock();
return -EACCES;
@@ -1354,7 +1351,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
if (ipcperms(ns, &sma->sem_perm, cmd == SETALL ? S_IWUGO : S_IRUGO))
goto out_rcu_wakeup;
- err = security_sem_semctl(sma, cmd);
+ err = security_sem_semctl(&sma->sem_perm, cmd);
if (err)
goto out_rcu_wakeup;
@@ -1545,7 +1542,7 @@ static int semctl_down(struct ipc_namespace *ns, int semid,
sma = container_of(ipcp, struct sem_array, sem_perm);
- err = security_sem_semctl(sma, cmd);
+ err = security_sem_semctl(&sma->sem_perm, cmd);
if (err)
goto out_unlock1;
@@ -1962,7 +1959,7 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
goto out_free;
}
- error = security_sem_semop(sma, sops, nsops, alter);
+ error = security_sem_semop(&sma->sem_perm, sops, nsops, alter);
if (error) {
rcu_read_unlock();
goto out_free;
diff --git a/security/security.c b/security/security.c
index 1cd8526cb0b7..d3b9aeb6b73b 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1220,27 +1220,27 @@ int security_shm_shmat(struct shmid_kernel *shp, char __user *shmaddr, int shmfl
return call_int_hook(shm_shmat, 0, shp, shmaddr, shmflg);
}
-int security_sem_alloc(struct sem_array *sma)
+int security_sem_alloc(struct kern_ipc_perm *sma)
{
return call_int_hook(sem_alloc_security, 0, sma);
}
-void security_sem_free(struct sem_array *sma)
+void security_sem_free(struct kern_ipc_perm *sma)
{
call_void_hook(sem_free_security, sma);
}
-int security_sem_associate(struct sem_array *sma, int semflg)
+int security_sem_associate(struct kern_ipc_perm *sma, int semflg)
{
return call_int_hook(sem_associate, 0, sma, semflg);
}
-int security_sem_semctl(struct sem_array *sma, int cmd)
+int security_sem_semctl(struct kern_ipc_perm *sma, int cmd)
{
return call_int_hook(sem_semctl, 0, sma, cmd);
}
-int security_sem_semop(struct sem_array *sma, struct sembuf *sops,
+int security_sem_semop(struct kern_ipc_perm *sma, struct sembuf *sops,
unsigned nsops, int alter)
{
return call_int_hook(sem_semop, 0, sma, sops, nsops, alter);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 8644d864e3c1..cce994e9fc0a 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -5767,53 +5767,53 @@ static int selinux_shm_shmat(struct shmid_kernel *shp,
}
/* Semaphore security operations */
-static int selinux_sem_alloc_security(struct sem_array *sma)
+static int selinux_sem_alloc_security(struct kern_ipc_perm *sma)
{
struct ipc_security_struct *isec;
struct common_audit_data ad;
u32 sid = current_sid();
int rc;
- rc = ipc_alloc_security(&sma->sem_perm, SECCLASS_SEM);
+ rc = ipc_alloc_security(sma, SECCLASS_SEM);
if (rc)
return rc;
- isec = sma->sem_perm.security;
+ isec = sma->security;
ad.type = LSM_AUDIT_DATA_IPC;
- ad.u.ipc_id = sma->sem_perm.key;
+ ad.u.ipc_id = sma->key;
rc = avc_has_perm(sid, isec->sid, SECCLASS_SEM,
SEM__CREATE, &ad);
if (rc) {
- ipc_free_security(&sma->sem_perm);
+ ipc_free_security(sma);
return rc;
}
return 0;
}
-static void selinux_sem_free_security(struct sem_array *sma)
+static void selinux_sem_free_security(struct kern_ipc_perm *sma)
{
- ipc_free_security(&sma->sem_perm);
+ ipc_free_security(sma);
}
-static int selinux_sem_associate(struct sem_array *sma, int semflg)
+static int selinux_sem_associate(struct kern_ipc_perm *sma, int semflg)
{
struct ipc_security_struct *isec;
struct common_audit_data ad;
u32 sid = current_sid();
- isec = sma->sem_perm.security;
+ isec = sma->security;
ad.type = LSM_AUDIT_DATA_IPC;
- ad.u.ipc_id = sma->sem_perm.key;
+ ad.u.ipc_id = sma->key;
return avc_has_perm(sid, isec->sid, SECCLASS_SEM,
SEM__ASSOCIATE, &ad);
}
/* Note, at this point, sma is locked down */
-static int selinux_sem_semctl(struct sem_array *sma, int cmd)
+static int selinux_sem_semctl(struct kern_ipc_perm *sma, int cmd)
{
int err;
u32 perms;
@@ -5851,11 +5851,11 @@ static int selinux_sem_semctl(struct sem_array *sma, int cmd)
return 0;
}
- err = ipc_has_perm(&sma->sem_perm, perms);
+ err = ipc_has_perm(sma, perms);
return err;
}
-static int selinux_sem_semop(struct sem_array *sma,
+static int selinux_sem_semop(struct kern_ipc_perm *sma,
struct sembuf *sops, unsigned nsops, int alter)
{
u32 perms;
@@ -5865,7 +5865,7 @@ static int selinux_sem_semop(struct sem_array *sma,
else
perms = SEM__READ;
- return ipc_has_perm(&sma->sem_perm, perms);
+ return ipc_has_perm(sma, perms);
}
static int selinux_ipc_permission(struct kern_ipc_perm *ipcp, short flag)
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 03fdecba93bb..0402b8c1aec1 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -3077,9 +3077,9 @@ static int smack_shm_shmat(struct shmid_kernel *shp, char __user *shmaddr,
*
* Returns a pointer to the smack value
*/
-static struct smack_known *smack_of_sem(struct sem_array *sma)
+static struct smack_known *smack_of_sem(struct kern_ipc_perm *sma)
{
- return (struct smack_known *)sma->sem_perm.security;
+ return (struct smack_known *)sma->security;
}
/**
@@ -3088,9 +3088,9 @@ static struct smack_known *smack_of_sem(struct sem_array *sma)
*
* Returns 0
*/
-static int smack_sem_alloc_security(struct sem_array *sma)
+static int smack_sem_alloc_security(struct kern_ipc_perm *sma)
{
- struct kern_ipc_perm *isp = &sma->sem_perm;
+ struct kern_ipc_perm *isp = sma;
struct smack_known *skp = smk_of_current();
isp->security = skp;
@@ -3103,9 +3103,9 @@ static int smack_sem_alloc_security(struct sem_array *sma)
*
* Clears the blob pointer
*/
-static void smack_sem_free_security(struct sem_array *sma)
+static void smack_sem_free_security(struct kern_ipc_perm *sma)
{
- struct kern_ipc_perm *isp = &sma->sem_perm;
+ struct kern_ipc_perm *isp = sma;
isp->security = NULL;
}
@@ -3117,7 +3117,7 @@ static void smack_sem_free_security(struct sem_array *sma)
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smk_curacc_sem(struct sem_array *sma, int access)
+static int smk_curacc_sem(struct kern_ipc_perm *sma, int access)
{
struct smack_known *ssp = smack_of_sem(sma);
struct smk_audit_info ad;
@@ -3125,7 +3125,7 @@ static int smk_curacc_sem(struct sem_array *sma, int access)
#ifdef CONFIG_AUDIT
smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_IPC);
- ad.a.u.ipc_id = sma->sem_perm.id;
+ ad.a.u.ipc_id = sma->id;
#endif
rc = smk_curacc(ssp, access, &ad);
rc = smk_bu_current("sem", ssp, access, rc);
@@ -3139,7 +3139,7 @@ static int smk_curacc_sem(struct sem_array *sma, int access)
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smack_sem_associate(struct sem_array *sma, int semflg)
+static int smack_sem_associate(struct kern_ipc_perm *sma, int semflg)
{
int may;
@@ -3154,7 +3154,7 @@ static int smack_sem_associate(struct sem_array *sma, int semflg)
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smack_sem_semctl(struct sem_array *sma, int cmd)
+static int smack_sem_semctl(struct kern_ipc_perm *sma, int cmd)
{
int may;
@@ -3198,7 +3198,7 @@ static int smack_sem_semctl(struct sem_array *sma, int cmd)
*
* Returns 0 if access is allowed, error code otherwise
*/
-static int smack_sem_semop(struct sem_array *sma, struct sembuf *sops,
+static int smack_sem_semop(struct kern_ipc_perm *sma, struct sembuf *sops,
unsigned nsops, int alter)
{
return smk_curacc_sem(sma, MAY_READWRITE);
--
2.14.1
Today the last process to update a semaphore is remembered and
reported in the pid namespace of that process. If there are processes
in any other pid namespace querying that process id with GETPID the
result will be unusable nonsense as it does not make any
sense in your own pid namespace.
Due to ipc_update_pid I don't think you will be able to get System V
ipc semaphores into a troublesome cache line ping-pong. Using struct
pids from separate process are not a problem because they do not share
a cache line. Using struct pid from different threads of the same
process are unlikely to be a problem as the reference count update
can be avoided.
Further linux futexes are a much better tool for the job of mutual
exclusion between processes than System V semaphores. So I expect
programs that are performance limited by their interprocess mutual
exclusion primitive will be using futexes.
So while it is possible that enhancing the storage of the last
rocess of a System V semaphore from an integer to a struct pid
will cause a performance regression because of the effect
of frequently updating the pid reference count. I don't expect
that to happen in practice.
This change updates semctl(..., GETPID, ...) to return the
process id of the last process to update a semphore inthe
pid namespace of the calling process.
Fixes: b488893a390e ("pid namespaces: changes to show virtual ids to user")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
ipc/sem.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/ipc/sem.c b/ipc/sem.c
index d661c491b0a5..47b263960524 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -98,7 +98,7 @@ struct sem {
* - semctl, via SETVAL and SETALL.
* - at task exit when performing undo adjustments (see exit_sem).
*/
- int sempid;
+ struct pid *sempid;
spinlock_t lock; /* spinlock for fine-grained semtimedop */
struct list_head pending_alter; /* pending single-sop operations */
/* that alter the semaphore */
@@ -128,7 +128,7 @@ struct sem_queue {
struct list_head list; /* queue of pending operations */
struct task_struct *sleeper; /* this process */
struct sem_undo *undo; /* undo structure */
- int pid; /* process id of requesting process */
+ struct pid *pid; /* process id of requesting process */
int status; /* completion status of operation */
struct sembuf *sops; /* array of pending operations */
struct sembuf *blocking; /* the operation that blocked */
@@ -628,7 +628,8 @@ SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
*/
static int perform_atomic_semop_slow(struct sem_array *sma, struct sem_queue *q)
{
- int result, sem_op, nsops, pid;
+ int result, sem_op, nsops;
+ struct pid *pid;
struct sembuf *sop;
struct sem *curr;
struct sembuf *sops;
@@ -666,7 +667,7 @@ static int perform_atomic_semop_slow(struct sem_array *sma, struct sem_queue *q)
sop--;
pid = q->pid;
while (sop >= sops) {
- sma->sems[sop->sem_num].sempid = pid;
+ ipc_update_pid(&sma->sems[sop->sem_num].sempid, pid);
sop--;
}
@@ -753,7 +754,7 @@ static int perform_atomic_semop(struct sem_array *sma, struct sem_queue *q)
un->semadj[sop->sem_num] = undo;
}
curr->semval += sem_op;
- curr->sempid = q->pid;
+ ipc_update_pid(&curr->sempid, q->pid);
}
return 0;
@@ -1160,6 +1161,7 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
unlink_queue(sma, q);
wake_up_sem_queue_prepare(q, -EIDRM, &wake_q);
}
+ ipc_update_pid(&sem->sempid, NULL);
}
/* Remove the semaphore set from the IDR */
@@ -1352,7 +1354,7 @@ static int semctl_setval(struct ipc_namespace *ns, int semid, int semnum,
un->semadj[semnum] = 0;
curr->semval = val;
- curr->sempid = task_tgid_vnr(current);
+ ipc_update_pid(&curr->sempid, task_tgid(current));
sma->sem_ctime = ktime_get_real_seconds();
/* maybe some queued-up processes were waiting for this */
do_smart_update(sma, NULL, 0, 0, &wake_q);
@@ -1473,7 +1475,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
for (i = 0; i < nsems; i++) {
sma->sems[i].semval = sem_io[i];
- sma->sems[i].sempid = task_tgid_vnr(current);
+ ipc_update_pid(&sma->sems[i].sempid, task_tgid(current));
}
ipc_assert_locked_object(&sma->sem_perm);
@@ -1505,7 +1507,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
err = curr->semval;
goto out_unlock;
case GETPID:
- err = curr->sempid;
+ err = pid_vnr(curr->sempid);
goto out_unlock;
case GETNCNT:
err = count_semcnt(sma, semnum, 0);
@@ -2024,7 +2026,7 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
queue.sops = sops;
queue.nsops = nsops;
queue.undo = un;
- queue.pid = task_tgid_vnr(current);
+ queue.pid = task_tgid(current);
queue.alter = alter;
queue.dupsop = dupsop;
@@ -2318,7 +2320,7 @@ void exit_sem(struct task_struct *tsk)
semaphore->semval = 0;
if (semaphore->semval > SEMVMX)
semaphore->semval = SEMVMX;
- semaphore->sempid = task_tgid_vnr(current);
+ ipc_update_pid(&semaphore->sempid, task_tgid(current));
}
}
/* maybe some queued-up processes were waiting for this */
--
2.14.1
On 3/23/2018 12:16 PM, Eric W. Biederman wrote:
> Today shm_cpid and shm_lpid are remembered in the pid namespace of the
> creator and the processes that last touched a sysvipc shared memory
> segment. If you have processes in multiple pid namespaces that
> is just wrong, and I don't know how this has been over-looked for
> so long.
>
> As only creation and shared memory attach and shared memory detach
> update the pids I do not expect there to be a repeat of the issues
> when struct pid was attached to each af_unix skb, which in some
> notable cases cut the performance in half. The problem was threads of
> the same process updating same struct pid from different cpus causing
> the cache line to be highly contended and bounce between cpus.
>
> As creation, attach, and detach are expected to be rare operations for
> sysvipc shared memory segments I do not expect that kind of cache line
> ping pong to cause probems. In addition because the pid is at a fixed
> location in the structure instead of being dynamic on a skb, the
> reference count of the pid does not need to be updated on each
> operation if the pid is the same. This ability to simply skip the pid
> reference count changes if the pid is unchanging further reduces the
> likelihood of the a cache line holding a pid reference count
> ping-ponging between cpus.
>
> Fixes: b488893a390e ("pid namespaces: changes to show virtual ids to user")
> Signed-off-by: "Eric W. Biederman" <[email protected]>
Thanks!
Reviewed-by: Nagarathnam Muthusamy <[email protected]>
> ---
> ipc/shm.c | 25 +++++++++++++++----------
> 1 file changed, 15 insertions(+), 10 deletions(-)
>
> diff --git a/ipc/shm.c b/ipc/shm.c
> index 0565669ebe5c..932b7e411c6c 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -57,8 +57,8 @@ struct shmid_kernel /* private to the kernel */
> time64_t shm_atim;
> time64_t shm_dtim;
> time64_t shm_ctim;
> - pid_t shm_cprid;
> - pid_t shm_lprid;
> + struct pid *shm_cprid;
> + struct pid *shm_lprid;
> struct user_struct *mlock_user;
>
> /* The task created the shm object. NULL if the task is dead. */
> @@ -226,7 +226,7 @@ static int __shm_open(struct vm_area_struct *vma)
> return PTR_ERR(shp);
>
> shp->shm_atim = ktime_get_real_seconds();
> - shp->shm_lprid = task_tgid_vnr(current);
> + ipc_update_pid(&shp->shm_lprid, task_tgid(current));
> shp->shm_nattch++;
> shm_unlock(shp);
> return 0;
> @@ -267,6 +267,8 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
> user_shm_unlock(i_size_read(file_inode(shm_file)),
> shp->mlock_user);
> fput(shm_file);
> + ipc_update_pid(&shp->shm_cprid, NULL);
> + ipc_update_pid(&shp->shm_lprid, NULL);
> ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
> }
>
> @@ -311,7 +313,7 @@ static void shm_close(struct vm_area_struct *vma)
> if (WARN_ON_ONCE(IS_ERR(shp)))
> goto done; /* no-op */
>
> - shp->shm_lprid = task_tgid_vnr(current);
> + ipc_update_pid(&shp->shm_lprid, task_tgid(current));
> shp->shm_dtim = ktime_get_real_seconds();
> shp->shm_nattch--;
> if (shm_may_destroy(ns, shp))
> @@ -614,8 +616,8 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
> if (IS_ERR(file))
> goto no_file;
>
> - shp->shm_cprid = task_tgid_vnr(current);
> - shp->shm_lprid = 0;
> + shp->shm_cprid = get_pid(task_tgid(current));
> + shp->shm_lprid = NULL;
> shp->shm_atim = shp->shm_dtim = 0;
> shp->shm_ctim = ktime_get_real_seconds();
> shp->shm_segsz = size;
> @@ -648,6 +650,8 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
> user_shm_unlock(size, shp->mlock_user);
> fput(file);
> no_file:
> + ipc_update_pid(&shp->shm_cprid, NULL);
> + ipc_update_pid(&shp->shm_lprid, NULL);
> call_rcu(&shp->shm_perm.rcu, shm_rcu_free);
> return error;
> }
> @@ -970,8 +974,8 @@ static int shmctl_stat(struct ipc_namespace *ns, int shmid,
> tbuf->shm_atime = shp->shm_atim;
> tbuf->shm_dtime = shp->shm_dtim;
> tbuf->shm_ctime = shp->shm_ctim;
> - tbuf->shm_cpid = shp->shm_cprid;
> - tbuf->shm_lpid = shp->shm_lprid;
> + tbuf->shm_cpid = pid_vnr(shp->shm_cprid);
> + tbuf->shm_lpid = pid_vnr(shp->shm_lprid);
> tbuf->shm_nattch = shp->shm_nattch;
>
> ipc_unlock_object(&shp->shm_perm);
> @@ -1605,6 +1609,7 @@ SYSCALL_DEFINE1(shmdt, char __user *, shmaddr)
> #ifdef CONFIG_PROC_FS
> static int sysvipc_shm_proc_show(struct seq_file *s, void *it)
> {
> + struct pid_namespace *pid_ns = ipc_seq_pid_ns(s);
> struct user_namespace *user_ns = seq_user_ns(s);
> struct kern_ipc_perm *ipcp = it;
> struct shmid_kernel *shp;
> @@ -1627,8 +1632,8 @@ static int sysvipc_shm_proc_show(struct seq_file *s, void *it)
> shp->shm_perm.id,
> shp->shm_perm.mode,
> shp->shm_segsz,
> - shp->shm_cprid,
> - shp->shm_lprid,
> + pid_nr_ns(shp->shm_cprid, pid_ns),
> + pid_nr_ns(shp->shm_lprid, pid_ns),
> shp->shm_nattch,
> from_kuid_munged(user_ns, shp->shm_perm.uid),
> from_kgid_munged(user_ns, shp->shm_perm.gid),
On 3/23/2018 12:16 PM, Eric W. Biederman wrote:
> Today msg_lspid and msg_lrpid are remembered in the pid namespace of
> the creator and the processes that last send or received a sysvipc
> message. If you have processes in multiple pid namespaces that is
> just wrong. The process ids reported will not make the least bit of
> sense.
>
> This fix is slightly more susceptible to a performance problem than
> the related fix for System V shared memory. By definition the pids
> are updated by msgsnd and msgrcv, the fast path of System V message
> queues. The only concern over the previous implementation is the
> incrementing and decrementing of the pid reference count. As that is
> the only difference and multiple updates by of the task_tgid by
> threads in the same process have been shown in af_unix sockets to
> create a cache line ping-pong between cpus of the same processor.
>
> In this case I don't expect cache lines holding pid reference counts
> to ping pong between cpus. As senders and receivers update different
> pids there is a natural separation there. Further if multiple threads
> of the same process either send or receive messages the pid will be
> updated to the same value and ipc_update_pid will avoid the reference
> count update.
>
> Which means in the common case I expect msg_lspid and msg_lrpid to
> remain constant, and reference counts not to be updated when messages
> are sent.
>
> In rare cases it may be possible to trigger the issue which was
> observed for af_unix sockets, but it will require multiple processes
> with multiple threads to be either sending or receiving messages. It
> just does not feel likely that anyone would do that in practice.
>
> This change updates msgctl(..., IPC_STAT, ...) to return msg_lspid and
> msg_lrpid in the pid namespace of the process calling stat.
>
> This change also updates cat /proc/sysvipc/msg to return print msg_lspid
> and msg_lrpid in the pid namespace of the process that opened the proc
> file.
>
> Fixes: b488893a390e ("pid namespaces: changes to show virtual ids to user")
> Signed-off-by: "Eric W. Biederman" <[email protected]>
Thanks!
Reviewed-by: Nagarathnam Muthusamy <[email protected]>
> ---
> ipc/msg.c | 23 +++++++++++++----------
> 1 file changed, 13 insertions(+), 10 deletions(-)
>
> diff --git a/ipc/msg.c b/ipc/msg.c
> index af5a963306c4..825ad585a6ff 100644
> --- a/ipc/msg.c
> +++ b/ipc/msg.c
> @@ -52,8 +52,8 @@ struct msg_queue {
> unsigned long q_cbytes; /* current number of bytes on queue */
> unsigned long q_qnum; /* number of messages in queue */
> unsigned long q_qbytes; /* max number of bytes on queue */
> - pid_t q_lspid; /* pid of last msgsnd */
> - pid_t q_lrpid; /* last receive pid */
> + struct pid *q_lspid; /* pid of last msgsnd */
> + struct pid *q_lrpid; /* last receive pid */
>
> struct list_head q_messages;
> struct list_head q_receivers;
> @@ -154,7 +154,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
> msq->q_ctime = ktime_get_real_seconds();
> msq->q_cbytes = msq->q_qnum = 0;
> msq->q_qbytes = ns->msg_ctlmnb;
> - msq->q_lspid = msq->q_lrpid = 0;
> + msq->q_lspid = msq->q_lrpid = NULL;
> INIT_LIST_HEAD(&msq->q_messages);
> INIT_LIST_HEAD(&msq->q_receivers);
> INIT_LIST_HEAD(&msq->q_senders);
> @@ -267,6 +267,8 @@ static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
> free_msg(msg);
> }
> atomic_sub(msq->q_cbytes, &ns->msg_bytes);
> + ipc_update_pid(&msq->q_lspid, NULL);
> + ipc_update_pid(&msq->q_lrpid, NULL);
> ipc_rcu_putref(&msq->q_perm, msg_rcu_free);
> }
>
> @@ -536,8 +538,8 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
> p->msg_cbytes = msq->q_cbytes;
> p->msg_qnum = msq->q_qnum;
> p->msg_qbytes = msq->q_qbytes;
> - p->msg_lspid = msq->q_lspid;
> - p->msg_lrpid = msq->q_lrpid;
> + p->msg_lspid = pid_vnr(msq->q_lspid);
> + p->msg_lrpid = pid_vnr(msq->q_lrpid);
>
> ipc_unlock_object(&msq->q_perm);
> rcu_read_unlock();
> @@ -741,7 +743,7 @@ static inline int pipelined_send(struct msg_queue *msq, struct msg_msg *msg,
> wake_q_add(wake_q, msr->r_tsk);
> WRITE_ONCE(msr->r_msg, ERR_PTR(-E2BIG));
> } else {
> - msq->q_lrpid = task_pid_vnr(msr->r_tsk);
> + ipc_update_pid(&msq->q_lrpid, task_pid(msr->r_tsk));
> msq->q_rtime = get_seconds();
>
> wake_q_add(wake_q, msr->r_tsk);
> @@ -842,7 +844,7 @@ static long do_msgsnd(int msqid, long mtype, void __user *mtext,
>
> }
>
> - msq->q_lspid = task_tgid_vnr(current);
> + ipc_update_pid(&msq->q_lspid, task_tgid(current));
> msq->q_stime = get_seconds();
>
> if (!pipelined_send(msq, msg, &wake_q)) {
> @@ -1060,7 +1062,7 @@ static long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, in
> list_del(&msg->m_list);
> msq->q_qnum--;
> msq->q_rtime = get_seconds();
> - msq->q_lrpid = task_tgid_vnr(current);
> + ipc_update_pid(&msq->q_lrpid, task_tgid(current));
> msq->q_cbytes -= msg->m_ts;
> atomic_sub(msg->m_ts, &ns->msg_bytes);
> atomic_dec(&ns->msg_hdrs);
> @@ -1202,6 +1204,7 @@ void msg_exit_ns(struct ipc_namespace *ns)
> #ifdef CONFIG_PROC_FS
> static int sysvipc_msg_proc_show(struct seq_file *s, void *it)
> {
> + struct pid_namespace *pid_ns = ipc_seq_pid_ns(s);
> struct user_namespace *user_ns = seq_user_ns(s);
> struct kern_ipc_perm *ipcp = it;
> struct msg_queue *msq = container_of(ipcp, struct msg_queue, q_perm);
> @@ -1213,8 +1216,8 @@ static int sysvipc_msg_proc_show(struct seq_file *s, void *it)
> msq->q_perm.mode,
> msq->q_cbytes,
> msq->q_qnum,
> - msq->q_lspid,
> - msq->q_lrpid,
> + pid_nr_ns(msq->q_lspid, pid_ns),
> + pid_nr_ns(msq->q_lrpid, pid_ns),
> from_kuid_munged(user_ns, msq->q_perm.uid),
> from_kgid_munged(user_ns, msq->q_perm.gid),
> from_kuid_munged(user_ns, msq->q_perm.cuid),
NAGARATHNAM MUTHUSAMY <[email protected]> writes:
> Thanks!
>
> Reviewed-by: Nagarathnam Muthusamy <[email protected]>
Does this look like it will address the issue you have been fighting
with pids?
Eric
On 3/23/2018 2:33 PM, [email protected] wrote:
> NAGARATHNAM MUTHUSAMY <[email protected]> writes:
>
>> Thanks!
>>
>> Reviewed-by: Nagarathnam Muthusamy <[email protected]>
> Does this look like it will address the issue you have been fighting
> with pids?
We do use IPC shared memory but it is a single large one, shared by multiple
levels. We are currently looking into using a similar solution based on
file locks.
When a new level is created, a file representing that level could be
created in
a common path which could be locked by the init process of that level.
Parent levels could query the locking pid of that file to get the pid
translation
of the init process of the required level. Then it could open a file
descriptor
and use the translate_pid API for further translations.
Thanks,
Nagarathnam.
>
> Eric
On 3/23/2018 12:16 PM, Eric W. Biederman wrote:
> All of the implementations of security hooks that take sem_array only
> access sem_perm the struct kern_ipc_perm member. This means the
> dependencies of the sem security hooks can be simplified by passing
> the kern_ipc_perm member of sem_array.
>
> Making this change will allow struct sem and struct sem_array
> to become private to ipc/sem.c.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> include/linux/lsm_hooks.h | 10 +++++-----
> include/linux/security.h | 21 ++++++++++-----------
> ipc/sem.c | 19 ++++++++-----------
> security/security.c | 10 +++++-----
> security/selinux/hooks.c | 28 ++++++++++++++--------------
> security/smack/smack_lsm.c | 22 +++++++++++-----------
> 6 files changed, 53 insertions(+), 57 deletions(-)
>
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index 7161d8e7ee79..e4a94863a88c 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -1592,11 +1592,11 @@ union security_list_options {
> int (*shm_shmat)(struct shmid_kernel *shp, char __user *shmaddr,
> int shmflg);
>
> - int (*sem_alloc_security)(struct sem_array *sma);
> - void (*sem_free_security)(struct sem_array *sma);
> - int (*sem_associate)(struct sem_array *sma, int semflg);
> - int (*sem_semctl)(struct sem_array *sma, int cmd);
> - int (*sem_semop)(struct sem_array *sma, struct sembuf *sops,
> + int (*sem_alloc_security)(struct kern_ipc_perm *sma);
> + void (*sem_free_security)(struct kern_ipc_perm *sma);
> + int (*sem_associate)(struct kern_ipc_perm *sma, int semflg);
> + int (*sem_semctl)(struct kern_ipc_perm *sma, int cmd);
> + int (*sem_semop)(struct kern_ipc_perm *sma, struct sembuf *sops,
> unsigned nsops, int alter);
>
> int (*netlink_send)(struct sock *sk, struct sk_buff *skb);
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 73f1ef625d40..fa7adac4b99a 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -36,7 +36,6 @@ struct linux_binprm;
> struct cred;
> struct rlimit;
> struct siginfo;
> -struct sem_array;
> struct sembuf;
> struct kern_ipc_perm;
> struct audit_context;
> @@ -368,11 +367,11 @@ void security_shm_free(struct shmid_kernel *shp);
> int security_shm_associate(struct shmid_kernel *shp, int shmflg);
> int security_shm_shmctl(struct shmid_kernel *shp, int cmd);
> int security_shm_shmat(struct shmid_kernel *shp, char __user *shmaddr, int shmflg);
> -int security_sem_alloc(struct sem_array *sma);
> -void security_sem_free(struct sem_array *sma);
> -int security_sem_associate(struct sem_array *sma, int semflg);
> -int security_sem_semctl(struct sem_array *sma, int cmd);
> -int security_sem_semop(struct sem_array *sma, struct sembuf *sops,
> +int security_sem_alloc(struct kern_ipc_perm *sma);
> +void security_sem_free(struct kern_ipc_perm *sma);
> +int security_sem_associate(struct kern_ipc_perm *sma, int semflg);
> +int security_sem_semctl(struct kern_ipc_perm *sma, int cmd);
> +int security_sem_semop(struct kern_ipc_perm *sma, struct sembuf *sops,
> unsigned nsops, int alter);
> void security_d_instantiate(struct dentry *dentry, struct inode *inode);
> int security_getprocattr(struct task_struct *p, char *name, char **value);
> @@ -1103,25 +1102,25 @@ static inline int security_shm_shmat(struct shmid_kernel *shp,
> return 0;
> }
>
> -static inline int security_sem_alloc(struct sem_array *sma)
> +static inline int security_sem_alloc(struct kern_ipc_perm *sma)
> {
> return 0;
> }
>
> -static inline void security_sem_free(struct sem_array *sma)
> +static inline void security_sem_free(struct kern_ipc_perm *sma)
> { }
>
> -static inline int security_sem_associate(struct sem_array *sma, int semflg)
> +static inline int security_sem_associate(struct kern_ipc_perm *sma, int semflg)
> {
> return 0;
> }
>
> -static inline int security_sem_semctl(struct sem_array *sma, int cmd)
> +static inline int security_sem_semctl(struct kern_ipc_perm *sma, int cmd)
> {
> return 0;
> }
>
> -static inline int security_sem_semop(struct sem_array *sma,
> +static inline int security_sem_semop(struct kern_ipc_perm *sma,
> struct sembuf *sops, unsigned nsops,
> int alter)
> {
> diff --git a/ipc/sem.c b/ipc/sem.c
> index a4af04979fd2..01f5c63670ae 100644
> --- a/ipc/sem.c
> +++ b/ipc/sem.c
> @@ -265,7 +265,7 @@ static void sem_rcu_free(struct rcu_head *head)
> struct kern_ipc_perm *p = container_of(head, struct kern_ipc_perm, rcu);
> struct sem_array *sma = container_of(p, struct sem_array, sem_perm);
>
> - security_sem_free(sma);
> + security_sem_free(&sma->sem_perm);
> kvfree(sma);
> }
>
> @@ -495,7 +495,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
> sma->sem_perm.key = key;
>
> sma->sem_perm.security = NULL;
> - retval = security_sem_alloc(sma);
> + retval = security_sem_alloc(&sma->sem_perm);
> if (retval) {
> kvfree(sma);
> return retval;
> @@ -535,10 +535,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
> */
> static inline int sem_security(struct kern_ipc_perm *ipcp, int semflg)
> {
> - struct sem_array *sma;
> -
> - sma = container_of(ipcp, struct sem_array, sem_perm);
> - return security_sem_associate(sma, semflg);
> + return security_sem_associate(ipcp, semflg);
> }
>
> /*
> @@ -1209,7 +1206,7 @@ static int semctl_stat(struct ipc_namespace *ns, int semid,
> if (ipcperms(ns, &sma->sem_perm, S_IRUGO))
> goto out_unlock;
>
> - err = security_sem_semctl(sma, cmd);
> + err = security_sem_semctl(&sma->sem_perm, cmd);
> if (err)
> goto out_unlock;
>
> @@ -1300,7 +1297,7 @@ static int semctl_setval(struct ipc_namespace *ns, int semid, int semnum,
> return -EACCES;
> }
>
> - err = security_sem_semctl(sma, SETVAL);
> + err = security_sem_semctl(&sma->sem_perm, SETVAL);
> if (err) {
> rcu_read_unlock();
> return -EACCES;
> @@ -1354,7 +1351,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
> if (ipcperms(ns, &sma->sem_perm, cmd == SETALL ? S_IWUGO : S_IRUGO))
> goto out_rcu_wakeup;
>
> - err = security_sem_semctl(sma, cmd);
> + err = security_sem_semctl(&sma->sem_perm, cmd);
> if (err)
> goto out_rcu_wakeup;
>
> @@ -1545,7 +1542,7 @@ static int semctl_down(struct ipc_namespace *ns, int semid,
>
> sma = container_of(ipcp, struct sem_array, sem_perm);
>
> - err = security_sem_semctl(sma, cmd);
> + err = security_sem_semctl(&sma->sem_perm, cmd);
> if (err)
> goto out_unlock1;
>
> @@ -1962,7 +1959,7 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
> goto out_free;
> }
>
> - error = security_sem_semop(sma, sops, nsops, alter);
> + error = security_sem_semop(&sma->sem_perm, sops, nsops, alter);
> if (error) {
> rcu_read_unlock();
> goto out_free;
> diff --git a/security/security.c b/security/security.c
> index 1cd8526cb0b7..d3b9aeb6b73b 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -1220,27 +1220,27 @@ int security_shm_shmat(struct shmid_kernel *shp, char __user *shmaddr, int shmfl
> return call_int_hook(shm_shmat, 0, shp, shmaddr, shmflg);
> }
>
> -int security_sem_alloc(struct sem_array *sma)
> +int security_sem_alloc(struct kern_ipc_perm *sma)
> {
> return call_int_hook(sem_alloc_security, 0, sma);
> }
>
> -void security_sem_free(struct sem_array *sma)
> +void security_sem_free(struct kern_ipc_perm *sma)
> {
> call_void_hook(sem_free_security, sma);
> }
>
> -int security_sem_associate(struct sem_array *sma, int semflg)
> +int security_sem_associate(struct kern_ipc_perm *sma, int semflg)
> {
> return call_int_hook(sem_associate, 0, sma, semflg);
> }
>
> -int security_sem_semctl(struct sem_array *sma, int cmd)
> +int security_sem_semctl(struct kern_ipc_perm *sma, int cmd)
> {
> return call_int_hook(sem_semctl, 0, sma, cmd);
> }
>
> -int security_sem_semop(struct sem_array *sma, struct sembuf *sops,
> +int security_sem_semop(struct kern_ipc_perm *sma, struct sembuf *sops,
> unsigned nsops, int alter)
> {
> return call_int_hook(sem_semop, 0, sma, sops, nsops, alter);
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 8644d864e3c1..cce994e9fc0a 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -5767,53 +5767,53 @@ static int selinux_shm_shmat(struct shmid_kernel *shp,
> }
>
> /* Semaphore security operations */
> -static int selinux_sem_alloc_security(struct sem_array *sma)
> +static int selinux_sem_alloc_security(struct kern_ipc_perm *sma)
> {
> struct ipc_security_struct *isec;
> struct common_audit_data ad;
> u32 sid = current_sid();
> int rc;
>
> - rc = ipc_alloc_security(&sma->sem_perm, SECCLASS_SEM);
> + rc = ipc_alloc_security(sma, SECCLASS_SEM);
> if (rc)
> return rc;
>
> - isec = sma->sem_perm.security;
> + isec = sma->security;
>
> ad.type = LSM_AUDIT_DATA_IPC;
> - ad.u.ipc_id = sma->sem_perm.key;
> + ad.u.ipc_id = sma->key;
>
> rc = avc_has_perm(sid, isec->sid, SECCLASS_SEM,
> SEM__CREATE, &ad);
> if (rc) {
> - ipc_free_security(&sma->sem_perm);
> + ipc_free_security(sma);
> return rc;
> }
> return 0;
> }
>
> -static void selinux_sem_free_security(struct sem_array *sma)
> +static void selinux_sem_free_security(struct kern_ipc_perm *sma)
> {
> - ipc_free_security(&sma->sem_perm);
> + ipc_free_security(sma);
> }
>
> -static int selinux_sem_associate(struct sem_array *sma, int semflg)
> +static int selinux_sem_associate(struct kern_ipc_perm *sma, int semflg)
> {
> struct ipc_security_struct *isec;
> struct common_audit_data ad;
> u32 sid = current_sid();
>
> - isec = sma->sem_perm.security;
> + isec = sma->security;
>
> ad.type = LSM_AUDIT_DATA_IPC;
> - ad.u.ipc_id = sma->sem_perm.key;
> + ad.u.ipc_id = sma->key;
>
> return avc_has_perm(sid, isec->sid, SECCLASS_SEM,
> SEM__ASSOCIATE, &ad);
> }
>
> /* Note, at this point, sma is locked down */
> -static int selinux_sem_semctl(struct sem_array *sma, int cmd)
> +static int selinux_sem_semctl(struct kern_ipc_perm *sma, int cmd)
> {
> int err;
> u32 perms;
> @@ -5851,11 +5851,11 @@ static int selinux_sem_semctl(struct sem_array *sma, int cmd)
> return 0;
> }
>
> - err = ipc_has_perm(&sma->sem_perm, perms);
> + err = ipc_has_perm(sma, perms);
> return err;
> }
>
> -static int selinux_sem_semop(struct sem_array *sma,
> +static int selinux_sem_semop(struct kern_ipc_perm *sma,
> struct sembuf *sops, unsigned nsops, int alter)
> {
> u32 perms;
> @@ -5865,7 +5865,7 @@ static int selinux_sem_semop(struct sem_array *sma,
> else
> perms = SEM__READ;
>
> - return ipc_has_perm(&sma->sem_perm, perms);
> + return ipc_has_perm(sma, perms);
> }
>
> static int selinux_ipc_permission(struct kern_ipc_perm *ipcp, short flag)
> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> index 03fdecba93bb..0402b8c1aec1 100644
> --- a/security/smack/smack_lsm.c
> +++ b/security/smack/smack_lsm.c
> @@ -3077,9 +3077,9 @@ static int smack_shm_shmat(struct shmid_kernel *shp, char __user *shmaddr,
> *
> * Returns a pointer to the smack value
> */
> -static struct smack_known *smack_of_sem(struct sem_array *sma)
> +static struct smack_known *smack_of_sem(struct kern_ipc_perm *sma)
> {
> - return (struct smack_known *)sma->sem_perm.security;
> + return (struct smack_known *)sma->security;
> }
This function no longer makes sense and should be removed.
>
> /**
> @@ -3088,9 +3088,9 @@ static struct smack_known *smack_of_sem(struct sem_array *sma)
> *
> * Returns 0
> */
> -static int smack_sem_alloc_security(struct sem_array *sma)
> +static int smack_sem_alloc_security(struct kern_ipc_perm *sma)
> {
> - struct kern_ipc_perm *isp = &sma->sem_perm;
> + struct kern_ipc_perm *isp = sma;
> struct smack_known *skp = smk_of_current();
>
> isp->security = skp;
A kern_ipc_perm pointer is conventionally named isp in this code.
How about instead:
-static int smack_sem_alloc_security(struct sem_array *sma)
+static int smack_sem_alloc_security(struct kern_ipc_perm *isp)
{
- struct kern_ipc_perm *isp = &sma->sem_perm;
> @@ -3103,9 +3103,9 @@ static int smack_sem_alloc_security(struct sem_array *sma)
> *
> * Clears the blob pointer
> */
> -static void smack_sem_free_security(struct sem_array *sma)
> +static void smack_sem_free_security(struct kern_ipc_perm *sma)
> {
> - struct kern_ipc_perm *isp = &sma->sem_perm;
> + struct kern_ipc_perm *isp = sma;
-static void smack_sem_free_security(struct sem_array *sma)
+static void smack_sem_free_security(struct kern_ipc_perm *isp)
{
- struct kern_ipc_perm *isp = &sma->sem_perm;
>
> isp->security = NULL;
> }
> @@ -3117,7 +3117,7 @@ static void smack_sem_free_security(struct sem_array *sma)
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smk_curacc_sem(struct sem_array *sma, int access)
> +static int smk_curacc_sem(struct kern_ipc_perm *sma, int access)
-static int smk_curacc_sem(struct sem_array *sma, int access)
+static int smk_curacc_sem(struct kern_ipc_perm *isp, int access)
> {
> struct smack_known *ssp = smack_of_sem(sma);
struct smack_known *ssp = smack_of_ipc(isp);
> struct smk_audit_info ad;
> @@ -3125,7 +3125,7 @@ static int smk_curacc_sem(struct sem_array *sma, int access)
>
> #ifdef CONFIG_AUDIT
> smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_IPC);
> - ad.a.u.ipc_id = sma->sem_perm.id;
> + ad.a.u.ipc_id = sma->id;
- ad.a.u.ipc_id = isp->sem_perm.id;
+ ad.a.u.ipc_id = isp->id;
> #endif
> rc = smk_curacc(ssp, access, &ad);
> rc = smk_bu_current("sem", ssp, access, rc);
> @@ -3139,7 +3139,7 @@ static int smk_curacc_sem(struct sem_array *sma, int access)
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smack_sem_associate(struct sem_array *sma, int semflg)
> +static int smack_sem_associate(struct kern_ipc_perm *sma, int semflg)
+static int smack_sem_associate(struct kern_ipc_perm *isp, int semflg)
and related adjustments following.
> {
> int may;
>
> @@ -3154,7 +3154,7 @@ static int smack_sem_associate(struct sem_array *sma, int semflg)
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smack_sem_semctl(struct sem_array *sma, int cmd)
> +static int smack_sem_semctl(struct kern_ipc_perm *sma, int cmd)
+static int smack_sem_semctl(struct kern_ipc_perm *isp, int cmd)
and related adjustments following.
> {
> int may;
>
> @@ -3198,7 +3198,7 @@ static int smack_sem_semctl(struct sem_array *sma, int cmd)
> *
> * Returns 0 if access is allowed, error code otherwise
> */
> -static int smack_sem_semop(struct sem_array *sma, struct sembuf *sops,
> +static int smack_sem_semop(struct kern_ipc_perm *sma, struct sembuf *sops,
> unsigned nsops, int alter)
> {
> return smk_curacc_sem(sma, MAY_READWRITE);
+static int smack_sem_semop(struct kern_ipc_perm *isp, struct sembuf *sops,
unsigned nsops, int alter)
{
- return smk_curacc_sem(sma, MAY_READWRITE);
+ return smk_curacc_sem(isp, MAY_READWRITE);
On 3/23/2018 12:16 PM, Eric W. Biederman wrote:
> All of the implementations of security hooks that take shmid_kernel only
> access shm_perm the struct kern_ipc_perm member. This means the
> dependencies of the shm security hooks can be simplified by passing
> the kern_ipc_perm member of shmid_kernel..
>
> Making this change will allow struct shmid_kernel to become private to ipc/shm.c.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> include/linux/lsm_hooks.h | 10 +++++-----
> include/linux/security.h | 21 ++++++++++-----------
> ipc/shm.c | 17 +++++++----------
> security/security.c | 10 +++++-----
> security/selinux/hooks.c | 28 ++++++++++++++--------------
> security/smack/smack_lsm.c | 22 +++++++++++-----------
Can I reference the comments I made in PATCH 01 of this set
regarding the Smack changes? The problem in all of your changes
is the same. You aren't preserving the naming conventions, and
you've left in some code that is just silly.
> 6 files changed, 52 insertions(+), 56 deletions(-)
>
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index e4a94863a88c..cac7a8082c43 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -1585,11 +1585,11 @@ union security_list_options {
> struct task_struct *target, long type,
> int mode);
>
> - int (*shm_alloc_security)(struct shmid_kernel *shp);
> - void (*shm_free_security)(struct shmid_kernel *shp);
> - int (*shm_associate)(struct shmid_kernel *shp, int shmflg);
> - int (*shm_shmctl)(struct shmid_kernel *shp, int cmd);
> - int (*shm_shmat)(struct shmid_kernel *shp, char __user *shmaddr,
> + int (*shm_alloc_security)(struct kern_ipc_perm *shp);
> + void (*shm_free_security)(struct kern_ipc_perm *shp);
> + int (*shm_associate)(struct kern_ipc_perm *shp, int shmflg);
> + int (*shm_shmctl)(struct kern_ipc_perm *shp, int cmd);
> + int (*shm_shmat)(struct kern_ipc_perm *shp, char __user *shmaddr,
> int shmflg);
>
> int (*sem_alloc_security)(struct kern_ipc_perm *sma);
> diff --git a/include/linux/security.h b/include/linux/security.h
> index fa7adac4b99a..f390755808ea 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -49,7 +49,6 @@ struct qstr;
> struct iattr;
> struct fown_struct;
> struct file_operations;
> -struct shmid_kernel;
> struct msg_msg;
> struct msg_queue;
> struct xattr;
> @@ -362,11 +361,11 @@ int security_msg_queue_msgsnd(struct msg_queue *msq,
> struct msg_msg *msg, int msqflg);
> int security_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg,
> struct task_struct *target, long type, int mode);
> -int security_shm_alloc(struct shmid_kernel *shp);
> -void security_shm_free(struct shmid_kernel *shp);
> -int security_shm_associate(struct shmid_kernel *shp, int shmflg);
> -int security_shm_shmctl(struct shmid_kernel *shp, int cmd);
> -int security_shm_shmat(struct shmid_kernel *shp, char __user *shmaddr, int shmflg);
> +int security_shm_alloc(struct kern_ipc_perm *shp);
> +void security_shm_free(struct kern_ipc_perm *shp);
> +int security_shm_associate(struct kern_ipc_perm *shp, int shmflg);
> +int security_shm_shmctl(struct kern_ipc_perm *shp, int cmd);
> +int security_shm_shmat(struct kern_ipc_perm *shp, char __user *shmaddr, int shmflg);
> int security_sem_alloc(struct kern_ipc_perm *sma);
> void security_sem_free(struct kern_ipc_perm *sma);
> int security_sem_associate(struct kern_ipc_perm *sma, int semflg);
> @@ -1077,26 +1076,26 @@ static inline int security_msg_queue_msgrcv(struct msg_queue *msq,
> return 0;
> }
>
> -static inline int security_shm_alloc(struct shmid_kernel *shp)
> +static inline int security_shm_alloc(struct kern_ipc_perm *shp)
> {
> return 0;
> }
>
> -static inline void security_shm_free(struct shmid_kernel *shp)
> +static inline void security_shm_free(struct kern_ipc_perm *shp)
> { }
>
> -static inline int security_shm_associate(struct shmid_kernel *shp,
> +static inline int security_shm_associate(struct kern_ipc_perm *shp,
> int shmflg)
> {
> return 0;
> }
>
> -static inline int security_shm_shmctl(struct shmid_kernel *shp, int cmd)
> +static inline int security_shm_shmctl(struct kern_ipc_perm *shp, int cmd)
> {
> return 0;
> }
>
> -static inline int security_shm_shmat(struct shmid_kernel *shp,
> +static inline int security_shm_shmat(struct kern_ipc_perm *shp,
> char __user *shmaddr, int shmflg)
> {
> return 0;
> diff --git a/ipc/shm.c b/ipc/shm.c
> index 4643865e9171..387a786e7be1 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -181,7 +181,7 @@ static void shm_rcu_free(struct rcu_head *head)
> rcu);
> struct shmid_kernel *shp = container_of(ptr, struct shmid_kernel,
> shm_perm);
> - security_shm_free(shp);
> + security_shm_free(&shp->shm_perm);
> kvfree(shp);
> }
>
> @@ -554,7 +554,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
> shp->mlock_user = NULL;
>
> shp->shm_perm.security = NULL;
> - error = security_shm_alloc(shp);
> + error = security_shm_alloc(&shp->shm_perm);
> if (error) {
> kvfree(shp);
> return error;
> @@ -635,10 +635,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
> */
> static inline int shm_security(struct kern_ipc_perm *ipcp, int shmflg)
> {
> - struct shmid_kernel *shp;
> -
> - shp = container_of(ipcp, struct shmid_kernel, shm_perm);
> - return security_shm_associate(shp, shmflg);
> + return security_shm_associate(ipcp, shmflg);
> }
>
> /*
> @@ -835,7 +832,7 @@ static int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
>
> shp = container_of(ipcp, struct shmid_kernel, shm_perm);
>
> - err = security_shm_shmctl(shp, cmd);
> + err = security_shm_shmctl(&shp->shm_perm, cmd);
> if (err)
> goto out_unlock1;
>
> @@ -934,7 +931,7 @@ static int shmctl_stat(struct ipc_namespace *ns, int shmid,
> if (ipcperms(ns, &shp->shm_perm, S_IRUGO))
> goto out_unlock;
>
> - err = security_shm_shmctl(shp, cmd);
> + err = security_shm_shmctl(&shp->shm_perm, cmd);
> if (err)
> goto out_unlock;
>
> @@ -978,7 +975,7 @@ static int shmctl_do_lock(struct ipc_namespace *ns, int shmid, int cmd)
> }
>
> audit_ipc_obj(&(shp->shm_perm));
> - err = security_shm_shmctl(shp, cmd);
> + err = security_shm_shmctl(&shp->shm_perm, cmd);
> if (err)
> goto out_unlock1;
>
> @@ -1348,7 +1345,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
> if (ipcperms(ns, &shp->shm_perm, acc_mode))
> goto out_unlock;
>
> - err = security_shm_shmat(shp, shmaddr, shmflg);
> + err = security_shm_shmat(&shp->shm_perm, shmaddr, shmflg);
> if (err)
> goto out_unlock;
>
> diff --git a/security/security.c b/security/security.c
> index d3b9aeb6b73b..77b69bd6f234 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -1195,27 +1195,27 @@ int security_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg,
> return call_int_hook(msg_queue_msgrcv, 0, msq, msg, target, type, mode);
> }
>
> -int security_shm_alloc(struct shmid_kernel *shp)
> +int security_shm_alloc(struct kern_ipc_perm *shp)
> {
> return call_int_hook(shm_alloc_security, 0, shp);
> }
>
> -void security_shm_free(struct shmid_kernel *shp)
> +void security_shm_free(struct kern_ipc_perm *shp)
> {
> call_void_hook(shm_free_security, shp);
> }
>
> -int security_shm_associate(struct shmid_kernel *shp, int shmflg)
> +int security_shm_associate(struct kern_ipc_perm *shp, int shmflg)
> {
> return call_int_hook(shm_associate, 0, shp, shmflg);
> }
>
> -int security_shm_shmctl(struct shmid_kernel *shp, int cmd)
> +int security_shm_shmctl(struct kern_ipc_perm *shp, int cmd)
> {
> return call_int_hook(shm_shmctl, 0, shp, cmd);
> }
>
> -int security_shm_shmat(struct shmid_kernel *shp, char __user *shmaddr, int shmflg)
> +int security_shm_shmat(struct kern_ipc_perm *shp, char __user *shmaddr, int shmflg)
> {
> return call_int_hook(shm_shmat, 0, shp, shmaddr, shmflg);
> }
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index cce994e9fc0a..14f9e6c08273 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -5674,53 +5674,53 @@ static int selinux_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg,
> }
>
> /* Shared Memory security operations */
> -static int selinux_shm_alloc_security(struct shmid_kernel *shp)
> +static int selinux_shm_alloc_security(struct kern_ipc_perm *shp)
> {
> struct ipc_security_struct *isec;
> struct common_audit_data ad;
> u32 sid = current_sid();
> int rc;
>
> - rc = ipc_alloc_security(&shp->shm_perm, SECCLASS_SHM);
> + rc = ipc_alloc_security(shp, SECCLASS_SHM);
> if (rc)
> return rc;
>
> - isec = shp->shm_perm.security;
> + isec = shp->security;
>
> ad.type = LSM_AUDIT_DATA_IPC;
> - ad.u.ipc_id = shp->shm_perm.key;
> + ad.u.ipc_id = shp->key;
>
> rc = avc_has_perm(sid, isec->sid, SECCLASS_SHM,
> SHM__CREATE, &ad);
> if (rc) {
> - ipc_free_security(&shp->shm_perm);
> + ipc_free_security(shp);
> return rc;
> }
> return 0;
> }
>
> -static void selinux_shm_free_security(struct shmid_kernel *shp)
> +static void selinux_shm_free_security(struct kern_ipc_perm *shp)
> {
> - ipc_free_security(&shp->shm_perm);
> + ipc_free_security(shp);
> }
>
> -static int selinux_shm_associate(struct shmid_kernel *shp, int shmflg)
> +static int selinux_shm_associate(struct kern_ipc_perm *shp, int shmflg)
> {
> struct ipc_security_struct *isec;
> struct common_audit_data ad;
> u32 sid = current_sid();
>
> - isec = shp->shm_perm.security;
> + isec = shp->security;
>
> ad.type = LSM_AUDIT_DATA_IPC;
> - ad.u.ipc_id = shp->shm_perm.key;
> + ad.u.ipc_id = shp->key;
>
> return avc_has_perm(sid, isec->sid, SECCLASS_SHM,
> SHM__ASSOCIATE, &ad);
> }
>
> /* Note, at this point, shp is locked down */
> -static int selinux_shm_shmctl(struct shmid_kernel *shp, int cmd)
> +static int selinux_shm_shmctl(struct kern_ipc_perm *shp, int cmd)
> {
> int perms;
> int err;
> @@ -5749,11 +5749,11 @@ static int selinux_shm_shmctl(struct shmid_kernel *shp, int cmd)
> return 0;
> }
>
> - err = ipc_has_perm(&shp->shm_perm, perms);
> + err = ipc_has_perm(shp, perms);
> return err;
> }
>
> -static int selinux_shm_shmat(struct shmid_kernel *shp,
> +static int selinux_shm_shmat(struct kern_ipc_perm *shp,
> char __user *shmaddr, int shmflg)
> {
> u32 perms;
> @@ -5763,7 +5763,7 @@ static int selinux_shm_shmat(struct shmid_kernel *shp,
> else
> perms = SHM__READ | SHM__WRITE;
>
> - return ipc_has_perm(&shp->shm_perm, perms);
> + return ipc_has_perm(shp, perms);
> }
>
> /* Semaphore security operations */
> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> index 0402b8c1aec1..a3398c7f32c9 100644
> --- a/security/smack/smack_lsm.c
> +++ b/security/smack/smack_lsm.c
> @@ -2950,9 +2950,9 @@ static void smack_msg_msg_free_security(struct msg_msg *msg)
> *
> * Returns a pointer to the smack value
> */
> -static struct smack_known *smack_of_shm(struct shmid_kernel *shp)
> +static struct smack_known *smack_of_shm(struct kern_ipc_perm *shp)
> {
> - return (struct smack_known *)shp->shm_perm.security;
> + return (struct smack_known *)shp->security;
> }
>
> /**
> @@ -2961,9 +2961,9 @@ static struct smack_known *smack_of_shm(struct shmid_kernel *shp)
> *
> * Returns 0
> */
> -static int smack_shm_alloc_security(struct shmid_kernel *shp)
> +static int smack_shm_alloc_security(struct kern_ipc_perm *shp)
> {
> - struct kern_ipc_perm *isp = &shp->shm_perm;
> + struct kern_ipc_perm *isp = shp;
> struct smack_known *skp = smk_of_current();
>
> isp->security = skp;
> @@ -2976,9 +2976,9 @@ static int smack_shm_alloc_security(struct shmid_kernel *shp)
> *
> * Clears the blob pointer
> */
> -static void smack_shm_free_security(struct shmid_kernel *shp)
> +static void smack_shm_free_security(struct kern_ipc_perm *shp)
> {
> - struct kern_ipc_perm *isp = &shp->shm_perm;
> + struct kern_ipc_perm *isp = shp;
>
> isp->security = NULL;
> }
> @@ -2990,7 +2990,7 @@ static void smack_shm_free_security(struct shmid_kernel *shp)
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smk_curacc_shm(struct shmid_kernel *shp, int access)
> +static int smk_curacc_shm(struct kern_ipc_perm *shp, int access)
> {
> struct smack_known *ssp = smack_of_shm(shp);
> struct smk_audit_info ad;
> @@ -2998,7 +2998,7 @@ static int smk_curacc_shm(struct shmid_kernel *shp, int access)
>
> #ifdef CONFIG_AUDIT
> smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_IPC);
> - ad.a.u.ipc_id = shp->shm_perm.id;
> + ad.a.u.ipc_id = shp->id;
> #endif
> rc = smk_curacc(ssp, access, &ad);
> rc = smk_bu_current("shm", ssp, access, rc);
> @@ -3012,7 +3012,7 @@ static int smk_curacc_shm(struct shmid_kernel *shp, int access)
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smack_shm_associate(struct shmid_kernel *shp, int shmflg)
> +static int smack_shm_associate(struct kern_ipc_perm *shp, int shmflg)
> {
> int may;
>
> @@ -3027,7 +3027,7 @@ static int smack_shm_associate(struct shmid_kernel *shp, int shmflg)
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smack_shm_shmctl(struct shmid_kernel *shp, int cmd)
> +static int smack_shm_shmctl(struct kern_ipc_perm *shp, int cmd)
> {
> int may;
>
> @@ -3062,7 +3062,7 @@ static int smack_shm_shmctl(struct shmid_kernel *shp, int cmd)
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smack_shm_shmat(struct shmid_kernel *shp, char __user *shmaddr,
> +static int smack_shm_shmat(struct kern_ipc_perm *shp, char __user *shmaddr,
> int shmflg)
> {
> int may;
On 3/23/2018 12:16 PM, Eric W. Biederman wrote:
> All of the implementations of security hooks that take msg_queue only
> access q_perm the struct kern_ipc_perm member. This means the
> dependencies of the msg_queue security hooks can be simplified by
> passing the kern_ipc_perm member of msg_queue.
>
> Making this change will allow struct msg_queue to become private to
> ipc/msg.c.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> include/linux/lsm_hooks.h | 12 ++++++------
> include/linux/security.h | 25 ++++++++++++-------------
> ipc/msg.c | 18 ++++++++----------
> security/security.c | 12 ++++++------
> security/selinux/hooks.c | 36 ++++++++++++++++++------------------
> security/smack/smack_lsm.c | 24 ++++++++++++------------
Can I reference the comments I made in PATCH 01 of this set
regarding the Smack changes? The problem in all of your changes
is the same. You aren't preserving the naming conventions, and
you've left in some code that is just silly.
> 6 files changed, 62 insertions(+), 65 deletions(-)
>
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index cac7a8082c43..bde167fa2c51 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -1575,13 +1575,13 @@ union security_list_options {
> int (*msg_msg_alloc_security)(struct msg_msg *msg);
> void (*msg_msg_free_security)(struct msg_msg *msg);
>
> - int (*msg_queue_alloc_security)(struct msg_queue *msq);
> - void (*msg_queue_free_security)(struct msg_queue *msq);
> - int (*msg_queue_associate)(struct msg_queue *msq, int msqflg);
> - int (*msg_queue_msgctl)(struct msg_queue *msq, int cmd);
> - int (*msg_queue_msgsnd)(struct msg_queue *msq, struct msg_msg *msg,
> + int (*msg_queue_alloc_security)(struct kern_ipc_perm *msq);
> + void (*msg_queue_free_security)(struct kern_ipc_perm *msq);
> + int (*msg_queue_associate)(struct kern_ipc_perm *msq, int msqflg);
> + int (*msg_queue_msgctl)(struct kern_ipc_perm *msq, int cmd);
> + int (*msg_queue_msgsnd)(struct kern_ipc_perm *msq, struct msg_msg *msg,
> int msqflg);
> - int (*msg_queue_msgrcv)(struct msg_queue *msq, struct msg_msg *msg,
> + int (*msg_queue_msgrcv)(struct kern_ipc_perm *msq, struct msg_msg *msg,
> struct task_struct *target, long type,
> int mode);
>
> diff --git a/include/linux/security.h b/include/linux/security.h
> index f390755808ea..128e1e4a5346 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -50,7 +50,6 @@ struct iattr;
> struct fown_struct;
> struct file_operations;
> struct msg_msg;
> -struct msg_queue;
> struct xattr;
> struct xfrm_sec_ctx;
> struct mm_struct;
> @@ -353,13 +352,13 @@ int security_ipc_permission(struct kern_ipc_perm *ipcp, short flag);
> void security_ipc_getsecid(struct kern_ipc_perm *ipcp, u32 *secid);
> int security_msg_msg_alloc(struct msg_msg *msg);
> void security_msg_msg_free(struct msg_msg *msg);
> -int security_msg_queue_alloc(struct msg_queue *msq);
> -void security_msg_queue_free(struct msg_queue *msq);
> -int security_msg_queue_associate(struct msg_queue *msq, int msqflg);
> -int security_msg_queue_msgctl(struct msg_queue *msq, int cmd);
> -int security_msg_queue_msgsnd(struct msg_queue *msq,
> +int security_msg_queue_alloc(struct kern_ipc_perm *msq);
> +void security_msg_queue_free(struct kern_ipc_perm *msq);
> +int security_msg_queue_associate(struct kern_ipc_perm *msq, int msqflg);
> +int security_msg_queue_msgctl(struct kern_ipc_perm *msq, int cmd);
> +int security_msg_queue_msgsnd(struct kern_ipc_perm *msq,
> struct msg_msg *msg, int msqflg);
> -int security_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg,
> +int security_msg_queue_msgrcv(struct kern_ipc_perm *msq, struct msg_msg *msg,
> struct task_struct *target, long type, int mode);
> int security_shm_alloc(struct kern_ipc_perm *shp);
> void security_shm_free(struct kern_ipc_perm *shp);
> @@ -1043,32 +1042,32 @@ static inline int security_msg_msg_alloc(struct msg_msg *msg)
> static inline void security_msg_msg_free(struct msg_msg *msg)
> { }
>
> -static inline int security_msg_queue_alloc(struct msg_queue *msq)
> +static inline int security_msg_queue_alloc(struct kern_ipc_perm *msq)
> {
> return 0;
> }
>
> -static inline void security_msg_queue_free(struct msg_queue *msq)
> +static inline void security_msg_queue_free(struct kern_ipc_perm *msq)
> { }
>
> -static inline int security_msg_queue_associate(struct msg_queue *msq,
> +static inline int security_msg_queue_associate(struct kern_ipc_perm *msq,
> int msqflg)
> {
> return 0;
> }
>
> -static inline int security_msg_queue_msgctl(struct msg_queue *msq, int cmd)
> +static inline int security_msg_queue_msgctl(struct kern_ipc_perm *msq, int cmd)
> {
> return 0;
> }
>
> -static inline int security_msg_queue_msgsnd(struct msg_queue *msq,
> +static inline int security_msg_queue_msgsnd(struct kern_ipc_perm *msq,
> struct msg_msg *msg, int msqflg)
> {
> return 0;
> }
>
> -static inline int security_msg_queue_msgrcv(struct msg_queue *msq,
> +static inline int security_msg_queue_msgrcv(struct kern_ipc_perm *msq,
> struct msg_msg *msg,
> struct task_struct *target,
> long type, int mode)
> diff --git a/ipc/msg.c b/ipc/msg.c
> index 0dcc6699dc53..cdfab0825fce 100644
> --- a/ipc/msg.c
> +++ b/ipc/msg.c
> @@ -101,7 +101,7 @@ static void msg_rcu_free(struct rcu_head *head)
> struct kern_ipc_perm *p = container_of(head, struct kern_ipc_perm, rcu);
> struct msg_queue *msq = container_of(p, struct msg_queue, q_perm);
>
> - security_msg_queue_free(msq);
> + security_msg_queue_free(&msq->q_perm);
> kvfree(msq);
> }
>
> @@ -127,7 +127,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
> msq->q_perm.key = key;
>
> msq->q_perm.security = NULL;
> - retval = security_msg_queue_alloc(msq);
> + retval = security_msg_queue_alloc(&msq->q_perm);
> if (retval) {
> kvfree(msq);
> return retval;
> @@ -258,9 +258,7 @@ static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
> */
> static inline int msg_security(struct kern_ipc_perm *ipcp, int msgflg)
> {
> - struct msg_queue *msq = container_of(ipcp, struct msg_queue, q_perm);
> -
> - return security_msg_queue_associate(msq, msgflg);
> + return security_msg_queue_associate(ipcp, msgflg);
> }
>
> SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
> @@ -380,7 +378,7 @@ static int msgctl_down(struct ipc_namespace *ns, int msqid, int cmd,
>
> msq = container_of(ipcp, struct msg_queue, q_perm);
>
> - err = security_msg_queue_msgctl(msq, cmd);
> + err = security_msg_queue_msgctl(&msq->q_perm, cmd);
> if (err)
> goto out_unlock1;
>
> @@ -502,7 +500,7 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
> if (ipcperms(ns, &msq->q_perm, S_IRUGO))
> goto out_unlock;
>
> - err = security_msg_queue_msgctl(msq, cmd);
> + err = security_msg_queue_msgctl(&msq->q_perm, cmd);
> if (err)
> goto out_unlock;
>
> @@ -718,7 +716,7 @@ static inline int pipelined_send(struct msg_queue *msq, struct msg_msg *msg,
>
> list_for_each_entry_safe(msr, t, &msq->q_receivers, r_list) {
> if (testmsg(msg, msr->r_msgtype, msr->r_mode) &&
> - !security_msg_queue_msgrcv(msq, msg, msr->r_tsk,
> + !security_msg_queue_msgrcv(&msq->q_perm, msg, msr->r_tsk,
> msr->r_msgtype, msr->r_mode)) {
>
> list_del(&msr->r_list);
> @@ -784,7 +782,7 @@ static long do_msgsnd(int msqid, long mtype, void __user *mtext,
> goto out_unlock0;
> }
>
> - err = security_msg_queue_msgsnd(msq, msg, msgflg);
> + err = security_msg_queue_msgsnd(&msq->q_perm, msg, msgflg);
> if (err)
> goto out_unlock0;
>
> @@ -960,7 +958,7 @@ static struct msg_msg *find_msg(struct msg_queue *msq, long *msgtyp, int mode)
>
> list_for_each_entry(msg, &msq->q_messages, m_list) {
> if (testmsg(msg, *msgtyp, mode) &&
> - !security_msg_queue_msgrcv(msq, msg, current,
> + !security_msg_queue_msgrcv(&msq->q_perm, msg, current,
> *msgtyp, mode)) {
> if (mode == SEARCH_LESSEQUAL && msg->m_type != 1) {
> *msgtyp = msg->m_type - 1;
> diff --git a/security/security.c b/security/security.c
> index 77b69bd6f234..02d734e69955 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -1163,33 +1163,33 @@ void security_msg_msg_free(struct msg_msg *msg)
> call_void_hook(msg_msg_free_security, msg);
> }
>
> -int security_msg_queue_alloc(struct msg_queue *msq)
> +int security_msg_queue_alloc(struct kern_ipc_perm *msq)
> {
> return call_int_hook(msg_queue_alloc_security, 0, msq);
> }
>
> -void security_msg_queue_free(struct msg_queue *msq)
> +void security_msg_queue_free(struct kern_ipc_perm *msq)
> {
> call_void_hook(msg_queue_free_security, msq);
> }
>
> -int security_msg_queue_associate(struct msg_queue *msq, int msqflg)
> +int security_msg_queue_associate(struct kern_ipc_perm *msq, int msqflg)
> {
> return call_int_hook(msg_queue_associate, 0, msq, msqflg);
> }
>
> -int security_msg_queue_msgctl(struct msg_queue *msq, int cmd)
> +int security_msg_queue_msgctl(struct kern_ipc_perm *msq, int cmd)
> {
> return call_int_hook(msg_queue_msgctl, 0, msq, cmd);
> }
>
> -int security_msg_queue_msgsnd(struct msg_queue *msq,
> +int security_msg_queue_msgsnd(struct kern_ipc_perm *msq,
> struct msg_msg *msg, int msqflg)
> {
> return call_int_hook(msg_queue_msgsnd, 0, msq, msg, msqflg);
> }
>
> -int security_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg,
> +int security_msg_queue_msgrcv(struct kern_ipc_perm *msq, struct msg_msg *msg,
> struct task_struct *target, long type, int mode)
> {
> return call_int_hook(msg_queue_msgrcv, 0, msq, msg, target, type, mode);
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 14f9e6c08273..925e546b5a87 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -5532,52 +5532,52 @@ static void selinux_msg_msg_free_security(struct msg_msg *msg)
> }
>
> /* message queue security operations */
> -static int selinux_msg_queue_alloc_security(struct msg_queue *msq)
> +static int selinux_msg_queue_alloc_security(struct kern_ipc_perm *msq)
> {
> struct ipc_security_struct *isec;
> struct common_audit_data ad;
> u32 sid = current_sid();
> int rc;
>
> - rc = ipc_alloc_security(&msq->q_perm, SECCLASS_MSGQ);
> + rc = ipc_alloc_security(msq, SECCLASS_MSGQ);
> if (rc)
> return rc;
>
> - isec = msq->q_perm.security;
> + isec = msq->security;
>
> ad.type = LSM_AUDIT_DATA_IPC;
> - ad.u.ipc_id = msq->q_perm.key;
> + ad.u.ipc_id = msq->key;
>
> rc = avc_has_perm(sid, isec->sid, SECCLASS_MSGQ,
> MSGQ__CREATE, &ad);
> if (rc) {
> - ipc_free_security(&msq->q_perm);
> + ipc_free_security(msq);
> return rc;
> }
> return 0;
> }
>
> -static void selinux_msg_queue_free_security(struct msg_queue *msq)
> +static void selinux_msg_queue_free_security(struct kern_ipc_perm *msq)
> {
> - ipc_free_security(&msq->q_perm);
> + ipc_free_security(msq);
> }
>
> -static int selinux_msg_queue_associate(struct msg_queue *msq, int msqflg)
> +static int selinux_msg_queue_associate(struct kern_ipc_perm *msq, int msqflg)
> {
> struct ipc_security_struct *isec;
> struct common_audit_data ad;
> u32 sid = current_sid();
>
> - isec = msq->q_perm.security;
> + isec = msq->security;
>
> ad.type = LSM_AUDIT_DATA_IPC;
> - ad.u.ipc_id = msq->q_perm.key;
> + ad.u.ipc_id = msq->key;
>
> return avc_has_perm(sid, isec->sid, SECCLASS_MSGQ,
> MSGQ__ASSOCIATE, &ad);
> }
>
> -static int selinux_msg_queue_msgctl(struct msg_queue *msq, int cmd)
> +static int selinux_msg_queue_msgctl(struct kern_ipc_perm *msq, int cmd)
> {
> int err;
> int perms;
> @@ -5602,11 +5602,11 @@ static int selinux_msg_queue_msgctl(struct msg_queue *msq, int cmd)
> return 0;
> }
>
> - err = ipc_has_perm(&msq->q_perm, perms);
> + err = ipc_has_perm(msq, perms);
> return err;
> }
>
> -static int selinux_msg_queue_msgsnd(struct msg_queue *msq, struct msg_msg *msg, int msqflg)
> +static int selinux_msg_queue_msgsnd(struct kern_ipc_perm *msq, struct msg_msg *msg, int msqflg)
> {
> struct ipc_security_struct *isec;
> struct msg_security_struct *msec;
> @@ -5614,7 +5614,7 @@ static int selinux_msg_queue_msgsnd(struct msg_queue *msq, struct msg_msg *msg,
> u32 sid = current_sid();
> int rc;
>
> - isec = msq->q_perm.security;
> + isec = msq->security;
> msec = msg->security;
>
> /*
> @@ -5632,7 +5632,7 @@ static int selinux_msg_queue_msgsnd(struct msg_queue *msq, struct msg_msg *msg,
> }
>
> ad.type = LSM_AUDIT_DATA_IPC;
> - ad.u.ipc_id = msq->q_perm.key;
> + ad.u.ipc_id = msq->key;
>
> /* Can this process write to the queue? */
> rc = avc_has_perm(sid, isec->sid, SECCLASS_MSGQ,
> @@ -5649,7 +5649,7 @@ static int selinux_msg_queue_msgsnd(struct msg_queue *msq, struct msg_msg *msg,
> return rc;
> }
>
> -static int selinux_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg,
> +static int selinux_msg_queue_msgrcv(struct kern_ipc_perm *msq, struct msg_msg *msg,
> struct task_struct *target,
> long type, int mode)
> {
> @@ -5659,11 +5659,11 @@ static int selinux_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg,
> u32 sid = task_sid(target);
> int rc;
>
> - isec = msq->q_perm.security;
> + isec = msq->security;
> msec = msg->security;
>
> ad.type = LSM_AUDIT_DATA_IPC;
> - ad.u.ipc_id = msq->q_perm.key;
> + ad.u.ipc_id = msq->key;
>
> rc = avc_has_perm(sid, isec->sid,
> SECCLASS_MSGQ, MSGQ__READ, &ad);
> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> index a3398c7f32c9..d960c2ea8d79 100644
> --- a/security/smack/smack_lsm.c
> +++ b/security/smack/smack_lsm.c
> @@ -3210,9 +3210,9 @@ static int smack_sem_semop(struct kern_ipc_perm *sma, struct sembuf *sops,
> *
> * Returns 0
> */
> -static int smack_msg_queue_alloc_security(struct msg_queue *msq)
> +static int smack_msg_queue_alloc_security(struct kern_ipc_perm *msq)
> {
> - struct kern_ipc_perm *kisp = &msq->q_perm;
> + struct kern_ipc_perm *kisp = msq;
> struct smack_known *skp = smk_of_current();
>
> kisp->security = skp;
> @@ -3225,9 +3225,9 @@ static int smack_msg_queue_alloc_security(struct msg_queue *msq)
> *
> * Clears the blob pointer
> */
> -static void smack_msg_queue_free_security(struct msg_queue *msq)
> +static void smack_msg_queue_free_security(struct kern_ipc_perm *msq)
> {
> - struct kern_ipc_perm *kisp = &msq->q_perm;
> + struct kern_ipc_perm *kisp = msq;
>
> kisp->security = NULL;
> }
> @@ -3238,9 +3238,9 @@ static void smack_msg_queue_free_security(struct msg_queue *msq)
> *
> * Returns a pointer to the smack label entry
> */
> -static struct smack_known *smack_of_msq(struct msg_queue *msq)
> +static struct smack_known *smack_of_msq(struct kern_ipc_perm *msq)
> {
> - return (struct smack_known *)msq->q_perm.security;
> + return (struct smack_known *)msq->security;
> }
>
> /**
> @@ -3250,7 +3250,7 @@ static struct smack_known *smack_of_msq(struct msg_queue *msq)
> *
> * return 0 if current has access, error otherwise
> */
> -static int smk_curacc_msq(struct msg_queue *msq, int access)
> +static int smk_curacc_msq(struct kern_ipc_perm *msq, int access)
> {
> struct smack_known *msp = smack_of_msq(msq);
> struct smk_audit_info ad;
> @@ -3258,7 +3258,7 @@ static int smk_curacc_msq(struct msg_queue *msq, int access)
>
> #ifdef CONFIG_AUDIT
> smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_IPC);
> - ad.a.u.ipc_id = msq->q_perm.id;
> + ad.a.u.ipc_id = msq->id;
> #endif
> rc = smk_curacc(msp, access, &ad);
> rc = smk_bu_current("msq", msp, access, rc);
> @@ -3272,7 +3272,7 @@ static int smk_curacc_msq(struct msg_queue *msq, int access)
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smack_msg_queue_associate(struct msg_queue *msq, int msqflg)
> +static int smack_msg_queue_associate(struct kern_ipc_perm *msq, int msqflg)
> {
> int may;
>
> @@ -3287,7 +3287,7 @@ static int smack_msg_queue_associate(struct msg_queue *msq, int msqflg)
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smack_msg_queue_msgctl(struct msg_queue *msq, int cmd)
> +static int smack_msg_queue_msgctl(struct kern_ipc_perm *msq, int cmd)
> {
> int may;
>
> @@ -3321,7 +3321,7 @@ static int smack_msg_queue_msgctl(struct msg_queue *msq, int cmd)
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smack_msg_queue_msgsnd(struct msg_queue *msq, struct msg_msg *msg,
> +static int smack_msg_queue_msgsnd(struct kern_ipc_perm *msq, struct msg_msg *msg,
> int msqflg)
> {
> int may;
> @@ -3340,7 +3340,7 @@ static int smack_msg_queue_msgsnd(struct msg_queue *msq, struct msg_msg *msg,
> *
> * Returns 0 if current has read and write access, error code otherwise
> */
> -static int smack_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg,
> +static int smack_msg_queue_msgrcv(struct kern_ipc_perm *msq, struct msg_msg *msg,
> struct task_struct *target, long type, int mode)
> {
> return smk_curacc_msq(msq, MAY_READWRITE);
Casey Schaufler <[email protected]> writes:
> On 3/23/2018 12:16 PM, Eric W. Biederman wrote:
>> All of the implementations of security hooks that take msg_queue only
>> access q_perm the struct kern_ipc_perm member. This means the
>> dependencies of the msg_queue security hooks can be simplified by
>> passing the kern_ipc_perm member of msg_queue.
>>
>> Making this change will allow struct msg_queue to become private to
>> ipc/msg.c.
>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> include/linux/lsm_hooks.h | 12 ++++++------
>> include/linux/security.h | 25 ++++++++++++-------------
>> ipc/msg.c | 18 ++++++++----------
>> security/security.c | 12 ++++++------
>> security/selinux/hooks.c | 36 ++++++++++++++++++------------------
>> security/smack/smack_lsm.c | 24 ++++++++++++------------
>
> Can I reference the comments I made in PATCH 01 of this set
> regarding the Smack changes? The problem in all of your changes
> is the same. You aren't preserving the naming conventions, and
> you've left in some code that is just silly.
Being silly like that is actually important to make a sweeping patch
like that boring and trivial to show that it is correct. Anything
that is not a rule based transformation is much more likely to hide
a bug. So for the push down of the type change I think it was the right
way to go.
That said I am happy to add a clean up patch that makes the obvious
cleanups and simplifications to smack_lsm.c.
Eric
After the last round of cleanups the shm, sem, and msg associate
operations just became trivial wrappers around the appropriate security
method. Simplify things further by just calling the security method
directly.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
ipc/msg.c | 10 +---------
ipc/sem.c | 10 +---------
ipc/shm.c | 10 +---------
3 files changed, 3 insertions(+), 27 deletions(-)
diff --git a/ipc/msg.c b/ipc/msg.c
index 825ad585a6ff..d667dd8e97ab 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -272,20 +272,12 @@ static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
ipc_rcu_putref(&msq->q_perm, msg_rcu_free);
}
-/*
- * Called with msg_ids.rwsem and ipcp locked.
- */
-static inline int msg_security(struct kern_ipc_perm *ipcp, int msgflg)
-{
- return security_msg_queue_associate(ipcp, msgflg);
-}
-
SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
{
struct ipc_namespace *ns;
static const struct ipc_ops msg_ops = {
.getnew = newque,
- .associate = msg_security,
+ .associate = security_msg_queue_associate,
};
struct ipc_params msg_params;
diff --git a/ipc/sem.c b/ipc/sem.c
index 47b263960524..09d54af076a4 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -564,14 +564,6 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
}
-/*
- * Called with sem_ids.rwsem and ipcp locked.
- */
-static inline int sem_security(struct kern_ipc_perm *ipcp, int semflg)
-{
- return security_sem_associate(ipcp, semflg);
-}
-
/*
* Called with sem_ids.rwsem and ipcp locked.
*/
@@ -592,7 +584,7 @@ SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
struct ipc_namespace *ns;
static const struct ipc_ops sem_ops = {
.getnew = newary,
- .associate = sem_security,
+ .associate = security_sem_associate,
.more_checks = sem_more_checks,
};
struct ipc_params sem_params;
diff --git a/ipc/shm.c b/ipc/shm.c
index 932b7e411c6c..018db3d0e70e 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -656,14 +656,6 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
return error;
}
-/*
- * Called with shm_ids.rwsem and ipcp locked.
- */
-static inline int shm_security(struct kern_ipc_perm *ipcp, int shmflg)
-{
- return security_shm_associate(ipcp, shmflg);
-}
-
/*
* Called with shm_ids.rwsem and ipcp locked.
*/
@@ -684,7 +676,7 @@ SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
struct ipc_namespace *ns;
static const struct ipc_ops shm_ops = {
.getnew = newseg,
- .associate = shm_security,
+ .associate = security_shm_associate,
.more_checks = shm_more_checks,
};
struct ipc_params shm_params;
--
2.14.1
Rename the variables shp, sma, msq to isp. As that is how the code already
refers to those variables.
Collapse smack_of_shm, smack_of_sem, and smack_of_msq into smack_of_ipc,
as the three functions had become completely identical.
Collapse smack_shm_alloc_security, smack_sem_alloc_security and
smack_msg_queue_alloc_security into smack_ipc_alloc_security as the
three functions had become identical.
Collapse smack_shm_free_security, smack_sem_free_security and
smack_msg_queue_free_security into smack_ipc_free_security as the three
functions had become identical.
Requested-by: Casey Schaufler <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
security/smack/smack_lsm.c | 197 +++++++++++++--------------------------------
1 file changed, 58 insertions(+), 139 deletions(-)
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index d960c2ea8d79..0735b8db158b 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -2945,25 +2945,24 @@ static void smack_msg_msg_free_security(struct msg_msg *msg)
}
/**
- * smack_of_shm - the smack pointer for the shm
- * @shp: the object
+ * smack_of_ipc - the smack pointer for the ipc
+ * @isp: the object
*
* Returns a pointer to the smack value
*/
-static struct smack_known *smack_of_shm(struct kern_ipc_perm *shp)
+static struct smack_known *smack_of_ipc(struct kern_ipc_perm *isp)
{
- return (struct smack_known *)shp->security;
+ return (struct smack_known *)isp->security;
}
/**
- * smack_shm_alloc_security - Set the security blob for shm
- * @shp: the object
+ * smack_ipc_alloc_security - Set the security blob for ipc
+ * @isp: the object
*
* Returns 0
*/
-static int smack_shm_alloc_security(struct kern_ipc_perm *shp)
+static int smack_ipc_alloc_security(struct kern_ipc_perm *isp)
{
- struct kern_ipc_perm *isp = shp;
struct smack_known *skp = smk_of_current();
isp->security = skp;
@@ -2971,34 +2970,32 @@ static int smack_shm_alloc_security(struct kern_ipc_perm *shp)
}
/**
- * smack_shm_free_security - Clear the security blob for shm
- * @shp: the object
+ * smack_ipc_free_security - Clear the security blob for ipc
+ * @isp: the object
*
* Clears the blob pointer
*/
-static void smack_shm_free_security(struct kern_ipc_perm *shp)
+static void smack_ipc_free_security(struct kern_ipc_perm *isp)
{
- struct kern_ipc_perm *isp = shp;
-
isp->security = NULL;
}
/**
* smk_curacc_shm : check if current has access on shm
- * @shp : the object
+ * @isp : the object
* @access : access requested
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smk_curacc_shm(struct kern_ipc_perm *shp, int access)
+static int smk_curacc_shm(struct kern_ipc_perm *isp, int access)
{
- struct smack_known *ssp = smack_of_shm(shp);
+ struct smack_known *ssp = smack_of_ipc(isp);
struct smk_audit_info ad;
int rc;
#ifdef CONFIG_AUDIT
smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_IPC);
- ad.a.u.ipc_id = shp->id;
+ ad.a.u.ipc_id = isp->id;
#endif
rc = smk_curacc(ssp, access, &ad);
rc = smk_bu_current("shm", ssp, access, rc);
@@ -3007,27 +3004,27 @@ static int smk_curacc_shm(struct kern_ipc_perm *shp, int access)
/**
* smack_shm_associate - Smack access check for shm
- * @shp: the object
+ * @isp: the object
* @shmflg: access requested
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smack_shm_associate(struct kern_ipc_perm *shp, int shmflg)
+static int smack_shm_associate(struct kern_ipc_perm *isp, int shmflg)
{
int may;
may = smack_flags_to_may(shmflg);
- return smk_curacc_shm(shp, may);
+ return smk_curacc_shm(isp, may);
}
/**
* smack_shm_shmctl - Smack access check for shm
- * @shp: the object
+ * @isp: the object
* @cmd: what it wants to do
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smack_shm_shmctl(struct kern_ipc_perm *shp, int cmd)
+static int smack_shm_shmctl(struct kern_ipc_perm *isp, int cmd)
{
int may;
@@ -3051,81 +3048,42 @@ static int smack_shm_shmctl(struct kern_ipc_perm *shp, int cmd)
default:
return -EINVAL;
}
- return smk_curacc_shm(shp, may);
+ return smk_curacc_shm(isp, may);
}
/**
* smack_shm_shmat - Smack access for shmat
- * @shp: the object
+ * @isp: the object
* @shmaddr: unused
* @shmflg: access requested
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smack_shm_shmat(struct kern_ipc_perm *shp, char __user *shmaddr,
+static int smack_shm_shmat(struct kern_ipc_perm *ipc, char __user *shmaddr,
int shmflg)
{
int may;
may = smack_flags_to_may(shmflg);
- return smk_curacc_shm(shp, may);
-}
-
-/**
- * smack_of_sem - the smack pointer for the sem
- * @sma: the object
- *
- * Returns a pointer to the smack value
- */
-static struct smack_known *smack_of_sem(struct kern_ipc_perm *sma)
-{
- return (struct smack_known *)sma->security;
-}
-
-/**
- * smack_sem_alloc_security - Set the security blob for sem
- * @sma: the object
- *
- * Returns 0
- */
-static int smack_sem_alloc_security(struct kern_ipc_perm *sma)
-{
- struct kern_ipc_perm *isp = sma;
- struct smack_known *skp = smk_of_current();
-
- isp->security = skp;
- return 0;
-}
-
-/**
- * smack_sem_free_security - Clear the security blob for sem
- * @sma: the object
- *
- * Clears the blob pointer
- */
-static void smack_sem_free_security(struct kern_ipc_perm *sma)
-{
- struct kern_ipc_perm *isp = sma;
-
- isp->security = NULL;
+ return smk_curacc_shm(ipc, may);
}
/**
* smk_curacc_sem : check if current has access on sem
- * @sma : the object
+ * @isp : the object
* @access : access requested
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smk_curacc_sem(struct kern_ipc_perm *sma, int access)
+static int smk_curacc_sem(struct kern_ipc_perm *isp, int access)
{
- struct smack_known *ssp = smack_of_sem(sma);
+ struct smack_known *ssp = smack_of_ipc(isp);
struct smk_audit_info ad;
int rc;
#ifdef CONFIG_AUDIT
smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_IPC);
- ad.a.u.ipc_id = sma->id;
+ ad.a.u.ipc_id = isp->id;
#endif
rc = smk_curacc(ssp, access, &ad);
rc = smk_bu_current("sem", ssp, access, rc);
@@ -3134,27 +3092,27 @@ static int smk_curacc_sem(struct kern_ipc_perm *sma, int access)
/**
* smack_sem_associate - Smack access check for sem
- * @sma: the object
+ * @isp: the object
* @semflg: access requested
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smack_sem_associate(struct kern_ipc_perm *sma, int semflg)
+static int smack_sem_associate(struct kern_ipc_perm *isp, int semflg)
{
int may;
may = smack_flags_to_may(semflg);
- return smk_curacc_sem(sma, may);
+ return smk_curacc_sem(isp, may);
}
/**
* smack_sem_shmctl - Smack access check for sem
- * @sma: the object
+ * @isp: the object
* @cmd: what it wants to do
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smack_sem_semctl(struct kern_ipc_perm *sma, int cmd)
+static int smack_sem_semctl(struct kern_ipc_perm *isp, int cmd)
{
int may;
@@ -3184,12 +3142,12 @@ static int smack_sem_semctl(struct kern_ipc_perm *sma, int cmd)
return -EINVAL;
}
- return smk_curacc_sem(sma, may);
+ return smk_curacc_sem(isp, may);
}
/**
* smack_sem_semop - Smack checks of semaphore operations
- * @sma: the object
+ * @isp: the object
* @sops: unused
* @nsops: unused
* @alter: unused
@@ -3198,67 +3156,28 @@ static int smack_sem_semctl(struct kern_ipc_perm *sma, int cmd)
*
* Returns 0 if access is allowed, error code otherwise
*/
-static int smack_sem_semop(struct kern_ipc_perm *sma, struct sembuf *sops,
+static int smack_sem_semop(struct kern_ipc_perm *isp, struct sembuf *sops,
unsigned nsops, int alter)
{
- return smk_curacc_sem(sma, MAY_READWRITE);
-}
-
-/**
- * smack_msg_alloc_security - Set the security blob for msg
- * @msq: the object
- *
- * Returns 0
- */
-static int smack_msg_queue_alloc_security(struct kern_ipc_perm *msq)
-{
- struct kern_ipc_perm *kisp = msq;
- struct smack_known *skp = smk_of_current();
-
- kisp->security = skp;
- return 0;
-}
-
-/**
- * smack_msg_free_security - Clear the security blob for msg
- * @msq: the object
- *
- * Clears the blob pointer
- */
-static void smack_msg_queue_free_security(struct kern_ipc_perm *msq)
-{
- struct kern_ipc_perm *kisp = msq;
-
- kisp->security = NULL;
-}
-
-/**
- * smack_of_msq - the smack pointer for the msq
- * @msq: the object
- *
- * Returns a pointer to the smack label entry
- */
-static struct smack_known *smack_of_msq(struct kern_ipc_perm *msq)
-{
- return (struct smack_known *)msq->security;
+ return smk_curacc_sem(isp, MAY_READWRITE);
}
/**
* smk_curacc_msq : helper to check if current has access on msq
- * @msq : the msq
+ * @isp : the msq
* @access : access requested
*
* return 0 if current has access, error otherwise
*/
-static int smk_curacc_msq(struct kern_ipc_perm *msq, int access)
+static int smk_curacc_msq(struct kern_ipc_perm *isp, int access)
{
- struct smack_known *msp = smack_of_msq(msq);
+ struct smack_known *msp = smack_of_ipc(isp);
struct smk_audit_info ad;
int rc;
#ifdef CONFIG_AUDIT
smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_IPC);
- ad.a.u.ipc_id = msq->id;
+ ad.a.u.ipc_id = isp->id;
#endif
rc = smk_curacc(msp, access, &ad);
rc = smk_bu_current("msq", msp, access, rc);
@@ -3267,27 +3186,27 @@ static int smk_curacc_msq(struct kern_ipc_perm *msq, int access)
/**
* smack_msg_queue_associate - Smack access check for msg_queue
- * @msq: the object
+ * @isp: the object
* @msqflg: access requested
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smack_msg_queue_associate(struct kern_ipc_perm *msq, int msqflg)
+static int smack_msg_queue_associate(struct kern_ipc_perm *isp, int msqflg)
{
int may;
may = smack_flags_to_may(msqflg);
- return smk_curacc_msq(msq, may);
+ return smk_curacc_msq(isp, may);
}
/**
* smack_msg_queue_msgctl - Smack access check for msg_queue
- * @msq: the object
+ * @isp: the object
* @cmd: what it wants to do
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smack_msg_queue_msgctl(struct kern_ipc_perm *msq, int cmd)
+static int smack_msg_queue_msgctl(struct kern_ipc_perm *isp, int cmd)
{
int may;
@@ -3310,29 +3229,29 @@ static int smack_msg_queue_msgctl(struct kern_ipc_perm *msq, int cmd)
return -EINVAL;
}
- return smk_curacc_msq(msq, may);
+ return smk_curacc_msq(isp, may);
}
/**
* smack_msg_queue_msgsnd - Smack access check for msg_queue
- * @msq: the object
+ * @isp: the object
* @msg: unused
* @msqflg: access requested
*
* Returns 0 if current has the requested access, error code otherwise
*/
-static int smack_msg_queue_msgsnd(struct kern_ipc_perm *msq, struct msg_msg *msg,
+static int smack_msg_queue_msgsnd(struct kern_ipc_perm *isp, struct msg_msg *msg,
int msqflg)
{
int may;
may = smack_flags_to_may(msqflg);
- return smk_curacc_msq(msq, may);
+ return smk_curacc_msq(isp, may);
}
/**
* smack_msg_queue_msgsnd - Smack access check for msg_queue
- * @msq: the object
+ * @isp: the object
* @msg: unused
* @target: unused
* @type: unused
@@ -3340,10 +3259,10 @@ static int smack_msg_queue_msgsnd(struct kern_ipc_perm *msq, struct msg_msg *msg
*
* Returns 0 if current has read and write access, error code otherwise
*/
-static int smack_msg_queue_msgrcv(struct kern_ipc_perm *msq, struct msg_msg *msg,
+static int smack_msg_queue_msgrcv(struct kern_ipc_perm *isp, struct msg_msg *msg,
struct task_struct *target, long type, int mode)
{
- return smk_curacc_msq(msq, MAY_READWRITE);
+ return smk_curacc_msq(isp, MAY_READWRITE);
}
/**
@@ -4756,21 +4675,21 @@ static struct security_hook_list smack_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(msg_msg_alloc_security, smack_msg_msg_alloc_security),
LSM_HOOK_INIT(msg_msg_free_security, smack_msg_msg_free_security),
- LSM_HOOK_INIT(msg_queue_alloc_security, smack_msg_queue_alloc_security),
- LSM_HOOK_INIT(msg_queue_free_security, smack_msg_queue_free_security),
+ LSM_HOOK_INIT(msg_queue_alloc_security, smack_ipc_alloc_security),
+ LSM_HOOK_INIT(msg_queue_free_security, smack_ipc_free_security),
LSM_HOOK_INIT(msg_queue_associate, smack_msg_queue_associate),
LSM_HOOK_INIT(msg_queue_msgctl, smack_msg_queue_msgctl),
LSM_HOOK_INIT(msg_queue_msgsnd, smack_msg_queue_msgsnd),
LSM_HOOK_INIT(msg_queue_msgrcv, smack_msg_queue_msgrcv),
- LSM_HOOK_INIT(shm_alloc_security, smack_shm_alloc_security),
- LSM_HOOK_INIT(shm_free_security, smack_shm_free_security),
+ LSM_HOOK_INIT(shm_alloc_security, smack_ipc_alloc_security),
+ LSM_HOOK_INIT(shm_free_security, smack_ipc_free_security),
LSM_HOOK_INIT(shm_associate, smack_shm_associate),
LSM_HOOK_INIT(shm_shmctl, smack_shm_shmctl),
LSM_HOOK_INIT(shm_shmat, smack_shm_shmat),
- LSM_HOOK_INIT(sem_alloc_security, smack_sem_alloc_security),
- LSM_HOOK_INIT(sem_free_security, smack_sem_free_security),
+ LSM_HOOK_INIT(sem_alloc_security, smack_ipc_alloc_security),
+ LSM_HOOK_INIT(sem_free_security, smack_ipc_free_security),
LSM_HOOK_INIT(sem_associate, smack_sem_associate),
LSM_HOOK_INIT(sem_semctl, smack_sem_semctl),
LSM_HOOK_INIT(sem_semop, smack_sem_semop),
--
2.14.1
On 3/23/2018 10:42 PM, Eric W. Biederman wrote:
> Rename the variables shp, sma, msq to isp. As that is how the code already
> refers to those variables.
Thanks. It's important to keep the code readable.
> Collapse smack_of_shm, smack_of_sem, and smack_of_msq into smack_of_ipc,
> as the three functions had become completely identical.
Thanks. Completely reasonable and correct.
> Collapse smack_shm_alloc_security, smack_sem_alloc_security and
> smack_msg_queue_alloc_security into smack_ipc_alloc_security as the
> three functions had become identical.
>
> Collapse smack_shm_free_security, smack_sem_free_security and
> smack_msg_queue_free_security into smack_ipc_free_security as the three
> functions had become identical.
This is reasonable but unprecedented. Nowhere else is the
same function used to supply multiple LSM hooks. Does anyone
out there see a reason not to do this?
> Requested-by: Casey Schaufler <[email protected]>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> security/smack/smack_lsm.c | 197 +++++++++++++--------------------------------
> 1 file changed, 58 insertions(+), 139 deletions(-)
>
> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> index d960c2ea8d79..0735b8db158b 100644
> --- a/security/smack/smack_lsm.c
> +++ b/security/smack/smack_lsm.c
> @@ -2945,25 +2945,24 @@ static void smack_msg_msg_free_security(struct msg_msg *msg)
> }
>
> /**
> - * smack_of_shm - the smack pointer for the shm
> - * @shp: the object
> + * smack_of_ipc - the smack pointer for the ipc
> + * @isp: the object
> *
> * Returns a pointer to the smack value
> */
> -static struct smack_known *smack_of_shm(struct kern_ipc_perm *shp)
> +static struct smack_known *smack_of_ipc(struct kern_ipc_perm *isp)
> {
> - return (struct smack_known *)shp->security;
> + return (struct smack_known *)isp->security;
> }
>
> /**
> - * smack_shm_alloc_security - Set the security blob for shm
> - * @shp: the object
> + * smack_ipc_alloc_security - Set the security blob for ipc
> + * @isp: the object
> *
> * Returns 0
> */
> -static int smack_shm_alloc_security(struct kern_ipc_perm *shp)
> +static int smack_ipc_alloc_security(struct kern_ipc_perm *isp)
> {
> - struct kern_ipc_perm *isp = shp;
> struct smack_known *skp = smk_of_current();
>
> isp->security = skp;
> @@ -2971,34 +2970,32 @@ static int smack_shm_alloc_security(struct kern_ipc_perm *shp)
> }
>
> /**
> - * smack_shm_free_security - Clear the security blob for shm
> - * @shp: the object
> + * smack_ipc_free_security - Clear the security blob for ipc
> + * @isp: the object
> *
> * Clears the blob pointer
> */
> -static void smack_shm_free_security(struct kern_ipc_perm *shp)
> +static void smack_ipc_free_security(struct kern_ipc_perm *isp)
> {
> - struct kern_ipc_perm *isp = shp;
> -
> isp->security = NULL;
> }
>
> /**
> * smk_curacc_shm : check if current has access on shm
> - * @shp : the object
> + * @isp : the object
> * @access : access requested
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smk_curacc_shm(struct kern_ipc_perm *shp, int access)
> +static int smk_curacc_shm(struct kern_ipc_perm *isp, int access)
> {
> - struct smack_known *ssp = smack_of_shm(shp);
> + struct smack_known *ssp = smack_of_ipc(isp);
> struct smk_audit_info ad;
> int rc;
>
> #ifdef CONFIG_AUDIT
> smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_IPC);
> - ad.a.u.ipc_id = shp->id;
> + ad.a.u.ipc_id = isp->id;
> #endif
> rc = smk_curacc(ssp, access, &ad);
> rc = smk_bu_current("shm", ssp, access, rc);
> @@ -3007,27 +3004,27 @@ static int smk_curacc_shm(struct kern_ipc_perm *shp, int access)
>
> /**
> * smack_shm_associate - Smack access check for shm
> - * @shp: the object
> + * @isp: the object
> * @shmflg: access requested
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smack_shm_associate(struct kern_ipc_perm *shp, int shmflg)
> +static int smack_shm_associate(struct kern_ipc_perm *isp, int shmflg)
> {
> int may;
>
> may = smack_flags_to_may(shmflg);
> - return smk_curacc_shm(shp, may);
> + return smk_curacc_shm(isp, may);
> }
>
> /**
> * smack_shm_shmctl - Smack access check for shm
> - * @shp: the object
> + * @isp: the object
> * @cmd: what it wants to do
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smack_shm_shmctl(struct kern_ipc_perm *shp, int cmd)
> +static int smack_shm_shmctl(struct kern_ipc_perm *isp, int cmd)
> {
> int may;
>
> @@ -3051,81 +3048,42 @@ static int smack_shm_shmctl(struct kern_ipc_perm *shp, int cmd)
> default:
> return -EINVAL;
> }
> - return smk_curacc_shm(shp, may);
> + return smk_curacc_shm(isp, may);
> }
>
> /**
> * smack_shm_shmat - Smack access for shmat
> - * @shp: the object
> + * @isp: the object
> * @shmaddr: unused
> * @shmflg: access requested
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smack_shm_shmat(struct kern_ipc_perm *shp, char __user *shmaddr,
> +static int smack_shm_shmat(struct kern_ipc_perm *ipc, char __user *shmaddr,
> int shmflg)
> {
> int may;
>
> may = smack_flags_to_may(shmflg);
> - return smk_curacc_shm(shp, may);
> -}
> -
> -/**
> - * smack_of_sem - the smack pointer for the sem
> - * @sma: the object
> - *
> - * Returns a pointer to the smack value
> - */
> -static struct smack_known *smack_of_sem(struct kern_ipc_perm *sma)
> -{
> - return (struct smack_known *)sma->security;
> -}
> -
> -/**
> - * smack_sem_alloc_security - Set the security blob for sem
> - * @sma: the object
> - *
> - * Returns 0
> - */
> -static int smack_sem_alloc_security(struct kern_ipc_perm *sma)
> -{
> - struct kern_ipc_perm *isp = sma;
> - struct smack_known *skp = smk_of_current();
> -
> - isp->security = skp;
> - return 0;
> -}
> -
> -/**
> - * smack_sem_free_security - Clear the security blob for sem
> - * @sma: the object
> - *
> - * Clears the blob pointer
> - */
> -static void smack_sem_free_security(struct kern_ipc_perm *sma)
> -{
> - struct kern_ipc_perm *isp = sma;
> -
> - isp->security = NULL;
> + return smk_curacc_shm(ipc, may);
> }
>
> /**
> * smk_curacc_sem : check if current has access on sem
> - * @sma : the object
> + * @isp : the object
> * @access : access requested
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smk_curacc_sem(struct kern_ipc_perm *sma, int access)
> +static int smk_curacc_sem(struct kern_ipc_perm *isp, int access)
> {
> - struct smack_known *ssp = smack_of_sem(sma);
> + struct smack_known *ssp = smack_of_ipc(isp);
> struct smk_audit_info ad;
> int rc;
>
> #ifdef CONFIG_AUDIT
> smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_IPC);
> - ad.a.u.ipc_id = sma->id;
> + ad.a.u.ipc_id = isp->id;
> #endif
> rc = smk_curacc(ssp, access, &ad);
> rc = smk_bu_current("sem", ssp, access, rc);
> @@ -3134,27 +3092,27 @@ static int smk_curacc_sem(struct kern_ipc_perm *sma, int access)
>
> /**
> * smack_sem_associate - Smack access check for sem
> - * @sma: the object
> + * @isp: the object
> * @semflg: access requested
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smack_sem_associate(struct kern_ipc_perm *sma, int semflg)
> +static int smack_sem_associate(struct kern_ipc_perm *isp, int semflg)
> {
> int may;
>
> may = smack_flags_to_may(semflg);
> - return smk_curacc_sem(sma, may);
> + return smk_curacc_sem(isp, may);
> }
>
> /**
> * smack_sem_shmctl - Smack access check for sem
> - * @sma: the object
> + * @isp: the object
> * @cmd: what it wants to do
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smack_sem_semctl(struct kern_ipc_perm *sma, int cmd)
> +static int smack_sem_semctl(struct kern_ipc_perm *isp, int cmd)
> {
> int may;
>
> @@ -3184,12 +3142,12 @@ static int smack_sem_semctl(struct kern_ipc_perm *sma, int cmd)
> return -EINVAL;
> }
>
> - return smk_curacc_sem(sma, may);
> + return smk_curacc_sem(isp, may);
> }
>
> /**
> * smack_sem_semop - Smack checks of semaphore operations
> - * @sma: the object
> + * @isp: the object
> * @sops: unused
> * @nsops: unused
> * @alter: unused
> @@ -3198,67 +3156,28 @@ static int smack_sem_semctl(struct kern_ipc_perm *sma, int cmd)
> *
> * Returns 0 if access is allowed, error code otherwise
> */
> -static int smack_sem_semop(struct kern_ipc_perm *sma, struct sembuf *sops,
> +static int smack_sem_semop(struct kern_ipc_perm *isp, struct sembuf *sops,
> unsigned nsops, int alter)
> {
> - return smk_curacc_sem(sma, MAY_READWRITE);
> -}
> -
> -/**
> - * smack_msg_alloc_security - Set the security blob for msg
> - * @msq: the object
> - *
> - * Returns 0
> - */
> -static int smack_msg_queue_alloc_security(struct kern_ipc_perm *msq)
> -{
> - struct kern_ipc_perm *kisp = msq;
> - struct smack_known *skp = smk_of_current();
> -
> - kisp->security = skp;
> - return 0;
> -}
> -
> -/**
> - * smack_msg_free_security - Clear the security blob for msg
> - * @msq: the object
> - *
> - * Clears the blob pointer
> - */
> -static void smack_msg_queue_free_security(struct kern_ipc_perm *msq)
> -{
> - struct kern_ipc_perm *kisp = msq;
> -
> - kisp->security = NULL;
> -}
> -
> -/**
> - * smack_of_msq - the smack pointer for the msq
> - * @msq: the object
> - *
> - * Returns a pointer to the smack label entry
> - */
> -static struct smack_known *smack_of_msq(struct kern_ipc_perm *msq)
> -{
> - return (struct smack_known *)msq->security;
> + return smk_curacc_sem(isp, MAY_READWRITE);
> }
>
> /**
> * smk_curacc_msq : helper to check if current has access on msq
> - * @msq : the msq
> + * @isp : the msq
> * @access : access requested
> *
> * return 0 if current has access, error otherwise
> */
> -static int smk_curacc_msq(struct kern_ipc_perm *msq, int access)
> +static int smk_curacc_msq(struct kern_ipc_perm *isp, int access)
> {
> - struct smack_known *msp = smack_of_msq(msq);
> + struct smack_known *msp = smack_of_ipc(isp);
> struct smk_audit_info ad;
> int rc;
>
> #ifdef CONFIG_AUDIT
> smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_IPC);
> - ad.a.u.ipc_id = msq->id;
> + ad.a.u.ipc_id = isp->id;
> #endif
> rc = smk_curacc(msp, access, &ad);
> rc = smk_bu_current("msq", msp, access, rc);
> @@ -3267,27 +3186,27 @@ static int smk_curacc_msq(struct kern_ipc_perm *msq, int access)
>
> /**
> * smack_msg_queue_associate - Smack access check for msg_queue
> - * @msq: the object
> + * @isp: the object
> * @msqflg: access requested
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smack_msg_queue_associate(struct kern_ipc_perm *msq, int msqflg)
> +static int smack_msg_queue_associate(struct kern_ipc_perm *isp, int msqflg)
> {
> int may;
>
> may = smack_flags_to_may(msqflg);
> - return smk_curacc_msq(msq, may);
> + return smk_curacc_msq(isp, may);
> }
>
> /**
> * smack_msg_queue_msgctl - Smack access check for msg_queue
> - * @msq: the object
> + * @isp: the object
> * @cmd: what it wants to do
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smack_msg_queue_msgctl(struct kern_ipc_perm *msq, int cmd)
> +static int smack_msg_queue_msgctl(struct kern_ipc_perm *isp, int cmd)
> {
> int may;
>
> @@ -3310,29 +3229,29 @@ static int smack_msg_queue_msgctl(struct kern_ipc_perm *msq, int cmd)
> return -EINVAL;
> }
>
> - return smk_curacc_msq(msq, may);
> + return smk_curacc_msq(isp, may);
> }
>
> /**
> * smack_msg_queue_msgsnd - Smack access check for msg_queue
> - * @msq: the object
> + * @isp: the object
> * @msg: unused
> * @msqflg: access requested
> *
> * Returns 0 if current has the requested access, error code otherwise
> */
> -static int smack_msg_queue_msgsnd(struct kern_ipc_perm *msq, struct msg_msg *msg,
> +static int smack_msg_queue_msgsnd(struct kern_ipc_perm *isp, struct msg_msg *msg,
> int msqflg)
> {
> int may;
>
> may = smack_flags_to_may(msqflg);
> - return smk_curacc_msq(msq, may);
> + return smk_curacc_msq(isp, may);
> }
>
> /**
> * smack_msg_queue_msgsnd - Smack access check for msg_queue
> - * @msq: the object
> + * @isp: the object
> * @msg: unused
> * @target: unused
> * @type: unused
> @@ -3340,10 +3259,10 @@ static int smack_msg_queue_msgsnd(struct kern_ipc_perm *msq, struct msg_msg *msg
> *
> * Returns 0 if current has read and write access, error code otherwise
> */
> -static int smack_msg_queue_msgrcv(struct kern_ipc_perm *msq, struct msg_msg *msg,
> +static int smack_msg_queue_msgrcv(struct kern_ipc_perm *isp, struct msg_msg *msg,
> struct task_struct *target, long type, int mode)
> {
> - return smk_curacc_msq(msq, MAY_READWRITE);
> + return smk_curacc_msq(isp, MAY_READWRITE);
> }
>
> /**
> @@ -4756,21 +4675,21 @@ static struct security_hook_list smack_hooks[] __lsm_ro_after_init = {
> LSM_HOOK_INIT(msg_msg_alloc_security, smack_msg_msg_alloc_security),
> LSM_HOOK_INIT(msg_msg_free_security, smack_msg_msg_free_security),
>
> - LSM_HOOK_INIT(msg_queue_alloc_security, smack_msg_queue_alloc_security),
> - LSM_HOOK_INIT(msg_queue_free_security, smack_msg_queue_free_security),
> + LSM_HOOK_INIT(msg_queue_alloc_security, smack_ipc_alloc_security),
> + LSM_HOOK_INIT(msg_queue_free_security, smack_ipc_free_security),
> LSM_HOOK_INIT(msg_queue_associate, smack_msg_queue_associate),
> LSM_HOOK_INIT(msg_queue_msgctl, smack_msg_queue_msgctl),
> LSM_HOOK_INIT(msg_queue_msgsnd, smack_msg_queue_msgsnd),
> LSM_HOOK_INIT(msg_queue_msgrcv, smack_msg_queue_msgrcv),
>
> - LSM_HOOK_INIT(shm_alloc_security, smack_shm_alloc_security),
> - LSM_HOOK_INIT(shm_free_security, smack_shm_free_security),
> + LSM_HOOK_INIT(shm_alloc_security, smack_ipc_alloc_security),
> + LSM_HOOK_INIT(shm_free_security, smack_ipc_free_security),
> LSM_HOOK_INIT(shm_associate, smack_shm_associate),
> LSM_HOOK_INIT(shm_shmctl, smack_shm_shmctl),
> LSM_HOOK_INIT(shm_shmat, smack_shm_shmat),
>
> - LSM_HOOK_INIT(sem_alloc_security, smack_sem_alloc_security),
> - LSM_HOOK_INIT(sem_free_security, smack_sem_free_security),
> + LSM_HOOK_INIT(sem_alloc_security, smack_ipc_alloc_security),
> + LSM_HOOK_INIT(sem_free_security, smack_ipc_free_security),
> LSM_HOOK_INIT(sem_associate, smack_sem_associate),
> LSM_HOOK_INIT(sem_semctl, smack_sem_semctl),
> LSM_HOOK_INIT(sem_semop, smack_sem_semop),
NAGARATHNAM MUTHUSAMY <[email protected]> writes:
> On 3/23/2018 2:33 PM, [email protected] wrote:
>> NAGARATHNAM MUTHUSAMY <[email protected]> writes:
>>
>>> Thanks!
>>>
>>> Reviewed-by: Nagarathnam Muthusamy <[email protected]>
>> Does this look like it will address the issue you have been fighting
>> with pids?
>
> We do use IPC shared memory but it is a single large one, shared by multiple
> levels. We are currently looking into using a similar solution based on file
> locks.
> When a new level is created, a file representing that level could be created in
> a common path which could be locked by the init process of that level.
> Parent levels could query the locking pid of that file to get the pid
> translation
> of the init process of the required level. Then it could open a file descriptor
> and use the translate_pid API for further translations.
Do you want to resend the translate_pid API with file descriptors as it
was in the lwn article? That I will apply.
Eric
On 03/28/2018 04:04 PM, [email protected] wrote:
> NAGARATHNAM MUTHUSAMY <[email protected]> writes:
>
>> On 3/23/2018 2:33 PM, [email protected] wrote:
>>> NAGARATHNAM MUTHUSAMY <[email protected]> writes:
>>>
>>>> Thanks!
>>>>
>>>> Reviewed-by: Nagarathnam Muthusamy <[email protected]>
>>> Does this look like it will address the issue you have been fighting
>>> with pids?
>> We do use IPC shared memory but it is a single large one, shared by multiple
>> levels. We are currently looking into using a similar solution based on file
>> locks.
>> When a new level is created, a file representing that level could be created in
>> a common path which could be locked by the init process of that level.
>> Parent levels could query the locking pid of that file to get the pid
>> translation
>> of the init process of the required level. Then it could open a file descriptor
>> and use the translate_pid API for further translations.
> Do you want to resend the translate_pid API with file descriptors as it
> was in the lwn article? That I will apply.
Sure Eric! We are currently implementing and testing the file locks + FD
based
approach, just to make sure it covers all the requirements. Will resend the
patch with FD based translate_pid API in few days.
Thanks,
Nagarathnam.
>
> Eric
On Fri, 23 Mar 2018, Casey Schaufler wrote:
>A kern_ipc_perm pointer is conventionally named isp in this code.
So the ideal name would be ipcp, used in core ipc, but I have no strong
preference over isp, ipp or whatever other name is used in LSMs. The
important thing is that kern_ipc_perm should not be called sma or any
ipc specific name.
>How about instead:
Agreed.
>
>-static int smack_sem_alloc_security(struct sem_array *sma)
>+static int smack_sem_alloc_security(struct kern_ipc_perm *isp)
> {
>- struct kern_ipc_perm *isp = &sma->sem_perm;
>
Thanks,
Davidlohr
On Sat, 24 Mar 2018, Casey Schaufler wrote:
>On 3/23/2018 10:42 PM, Eric W. Biederman wrote:
>> Rename the variables shp, sma, msq to isp. As that is how the code already
>> refers to those variables.
>
>Thanks. It's important to keep the code readable.
Ah great, ignore my last email then.
On Sat, 24 Mar 2018, Eric W. Biederman wrote:
>
>After the last round of cleanups the shm, sem, and msg associate
>operations just became trivial wrappers around the appropriate security
>method. Simplify things further by just calling the security method
>directly.
>
>Signed-off-by: "Eric W. Biederman" <[email protected]>
>---
> ipc/msg.c | 10 +---------
> ipc/sem.c | 10 +---------
> ipc/shm.c | 10 +---------
> 3 files changed, 3 insertions(+), 27 deletions(-)
This is nice to see.
On Sat, 24 Mar 2018, Eric W. Biederman wrote:
> /**
>- * smack_of_shm - the smack pointer for the shm
>- * @shp: the object
>+ * smack_of_ipc - the smack pointer for the ipc
>+ * @isp: the object
Nit, but while at it the @isp description does need some love:
"@isp: the pointer for the ipc perm structure"
On Fri, 23 Mar 2018, Eric W. Biederman wrote:
>Today the last process to update a semaphore is remembered and
>reported in the pid namespace of that process. If there are processes
>in any other pid namespace querying that process id with GETPID the
>result will be unusable nonsense as it does not make any
>sense in your own pid namespace.
Yeah that sounds pretty wrong.
>
>Due to ipc_update_pid I don't think you will be able to get System V
>ipc semaphores into a troublesome cache line ping-pong. Using struct
>pids from separate process are not a problem because they do not share
>a cache line. Using struct pid from different threads of the same
>process are unlikely to be a problem as the reference count update
>can be avoided.
>
>Further linux futexes are a much better tool for the job of mutual
>exclusion between processes than System V semaphores. So I expect
>programs that are performance limited by their interprocess mutual
>exclusion primitive will be using futexes.
You would be wrong. There are plenty of real workloads out there
that do not use futexes and are care about performance; in the end
futexes are only good for the uncontended cases, it can also
destroy numa boxes if you consider the global hash table. Experience
as shown me that sysvipc sems are quite still used.
>
>So while it is possible that enhancing the storage of the last
>rocess of a System V semaphore from an integer to a struct pid
>will cause a performance regression because of the effect
>of frequently updating the pid reference count. I don't expect
>that to happen in practice.
How's that? Now thanks to ipc_update_pid() for each semop the user
passes, perform_atomic_semop() will do two atomic updates for the
cases where there are multiple processes updating the sem. This is
not uncommon.
Could you please provide some numbers.
Thanks,
Davidlohr
On Fri, 23 Mar 2018, Eric W. Biederman wrote:
>Still I would like to see this fixed and I plan on merging this code.
Yes, it needs fixed, but 1) there are pending issues (such as the extra atomics)
and 2) its late in the -rc cycle. Plus this issue has existed for 11 years without
the world ending, so I'm sure we can hold on until at least one more release.
Thanks,
Davidlohr
Davidlohr Bueso <[email protected]> writes:
> On Fri, 23 Mar 2018, Eric W. Biederman wrote:
>
>>Still I would like to see this fixed and I plan on merging this code.
The code is merged into my for-next tree now.
> Yes, it needs fixed, but 1) there are pending issues (such as the
> extra atomics)
Concerns not issues. I documented them but I don't see any serious
reason to be concerned. The data structures are sufficiently different
from AF_UNIX as well as the usage patterns that I have no reasonable
expectation that there will be problems.
There is no reasonable alternate implementation for correcting this bug.
Because of my concerns I looked at several other possibilities and they
all showed incorrect behavior, in different circumstances.
The implementations are simple enough there are no deep subtle issues.
I have tested the code.
If a regression happens the code is carefully split up so things can be
bisected easily and reverted if necessary.
> and 2) its late in the -rc cycle. Plus this issue has existed for 11 years without
> the world ending, so I'm sure we can hold on until at least one more
> release.
People really are starting to seriously look at accessing a single ipc
namespace from multiple pid namespaces. The work arounds I saw posted
for the current brokenness were too nasty to live.
Better to fix things before there is code that actually starts depending
on the current brokenness.
I am the namespace maintianer and this is my area of responsibility.
The code is ready and I see no reason or benefit in delay.
Eric
On Wed, 28 Mar 2018, Davidlohr Bueso wrote:
>On Fri, 23 Mar 2018, Eric W. Biederman wrote:
>
>>Today the last process to update a semaphore is remembered and
>>reported in the pid namespace of that process. If there are processes
>>in any other pid namespace querying that process id with GETPID the
>>result will be unusable nonsense as it does not make any
>>sense in your own pid namespace.
>
>Yeah that sounds pretty wrong.
>
>>
>>Due to ipc_update_pid I don't think you will be able to get System V
>>ipc semaphores into a troublesome cache line ping-pong. Using struct
>>pids from separate process are not a problem because they do not share
>>a cache line. Using struct pid from different threads of the same
>>process are unlikely to be a problem as the reference count update
>>can be avoided.
>>
>>Further linux futexes are a much better tool for the job of mutual
>>exclusion between processes than System V semaphores. So I expect
>>programs that are performance limited by their interprocess mutual
>>exclusion primitive will be using futexes.
>
>You would be wrong. There are plenty of real workloads out there
>that do not use futexes and are care about performance; in the end
>futexes are only good for the uncontended cases, it can also
>destroy numa boxes if you consider the global hash table. Experience
>as shown me that sysvipc sems are quite still used.
>
>>
>>So while it is possible that enhancing the storage of the last
>>rocess of a System V semaphore from an integer to a struct pid
>>will cause a performance regression because of the effect
>>of frequently updating the pid reference count. I don't expect
>>that to happen in practice.
>
>How's that? Now thanks to ipc_update_pid() for each semop the user
>passes, perform_atomic_semop() will do two atomic updates for the
>cases where there are multiple processes updating the sem. This is
>not uncommon.
>
>Could you please provide some numbers.
I ran this on a 40-core (no ht) Westmere with two benchmarks. The first
is Manfred's sysvsem lockunlock[1] program which uses _processes_ to,
well, lock and unlock the semaphore. The options are a little
unconventional, to keep the "critical region small" and the lock+unlock
frequency high I added busy_in=busy_out=10. Similarly, to get the
worst case scenario and have everyone update the same semaphore, a single
one is used. Here are the results (pretty low stddev from run to run)
for doing 100,000 lock+unlock.
- 1 proc:
* vanilla
total execution time: 0.110638 seconds for 100000 loops
* dirty
total execution time: 0.120144 seconds for 100000 loops
- 2 proc:
* vanilla
total execution time: 0.379756 seconds for 100000 loops
* dirty
total execution time: 0.477778 seconds for 100000 loops
- 4 proc:
* vanilla
total execution time: 6.749710 seconds for 100000 loops
* dirty
total execution time: 4.651872 seconds for 100000 loops
- 8 proc:
* vanilla
total execution time: 5.558404 seconds for 100000 loops
* dirty
total execution time: 7.143329 seconds for 100000 loops
- 16 proc:
* vanilla
total execution time: 9.016398 seconds for 100000 loops
* dirty
total execution time: 9.412055 seconds for 100000 loops
- 32 proc:
* vanilla
total execution time: 9.694451 seconds for 100000 loops
* dirty
total execution time: 9.990451 seconds for 100000 loops
- 64 proc:
* vanilla
total execution time: 9.844984 seconds for 100032 loops
* dirty
total execution time: 10.016464 seconds for 100032 loops
Lower task counts show pretty massive performance hits of ~9%, ~25%
and ~30% for single, two and four/eight processes. As more are added
I guess the overhead tends to disappear as for one you have a lot
more locking contention going on.
The second workload I ran this patch on was Chris Mason's sem-scalebench[2]
program which uses _threads_ for the sysvsem option (this benchmark is more
about semaphores as a concept rather than sysvsem specific). Dealing with
a single semaphore and increasing thread counts we get:
sembench-sem
vanill dirt
vanilla dirty
Hmean sembench-sem-2 286272.00 ( 0.00%) 288232.00 ( 0.68%)
Hmean sembench-sem-8 510966.00 ( 0.00%) 494375.00 ( -3.25%)
Hmean sembench-sem-12 435753.00 ( 0.00%) 465328.00 ( 6.79%)
Hmean sembench-sem-21 448144.00 ( 0.00%) 462091.00 ( 3.11%)
Hmean sembench-sem-30 479519.00 ( 0.00%) 471295.00 ( -1.72%)
Hmean sembench-sem-48 533270.00 ( 0.00%) 542525.00 ( 1.74%)
Hmean sembench-sem-79 510218.00 ( 0.00%) 528392.00 ( 3.56%)
Unsurprisingly, the thread case shows no overhead -- and yes, even better at
times but still noise). Similarly, when completely abusing the systems and doing
64*NCPUS there is pretty much no difference:
vanill dirt
vanilla dirty
User 1865.99 1819.75
System 35080.97 35396.34
Elapsed 3602.03 3560.50
So at least for a large box this patch hurts the cases where there is low
to medium cpu usage (no more than ~8 processes on a 40 core box) in a non
trivial way. For more processes it doesn't matter. We can confirm that the
case for threads is irrelevant. While I'm not happy about the 30% regression
I guess we can live with this.
Manfred, any thoughts?
Thanks
Davidlohr
[1] https://github.com/manfred-colorfu/ipcscale/blob/master/sem-lockunlock.c
[2] https://github.com/davidlohr/sembench-ng/blob/master/sembench.c
Davidlohr Bueso <[email protected]> writes:
> I ran this on a 40-core (no ht) Westmere with two benchmarks. The first
> is Manfred's sysvsem lockunlock[1] program which uses _processes_ to,
> well, lock and unlock the semaphore. The options are a little
> unconventional, to keep the "critical region small" and the lock+unlock
> frequency high I added busy_in=busy_out=10. Similarly, to get the
> worst case scenario and have everyone update the same semaphore, a single
> one is used. Here are the results (pretty low stddev from run to run)
> for doing 100,000 lock+unlock.
>
> - 1 proc:
> * vanilla
> total execution time: 0.110638 seconds for 100000 loops
> * dirty
> total execution time: 0.120144 seconds for 100000 loops
>
> - 2 proc:
> * vanilla
> total execution time: 0.379756 seconds for 100000 loops
> * dirty
> total execution time: 0.477778 seconds for 100000 loops
>
> - 4 proc:
> * vanilla
> total execution time: 6.749710 seconds for 100000 loops
> * dirty
> total execution time: 4.651872 seconds for 100000 loops
>
> - 8 proc:
> * vanilla
> total execution time: 5.558404 seconds for 100000 loops
> * dirty
> total execution time: 7.143329 seconds for 100000 loops
>
> - 16 proc:
> * vanilla
> total execution time: 9.016398 seconds for 100000 loops
> * dirty
> total execution time: 9.412055 seconds for 100000 loops
>
> - 32 proc:
> * vanilla
> total execution time: 9.694451 seconds for 100000 loops
> * dirty
> total execution time: 9.990451 seconds for 100000 loops
>
> - 64 proc:
> * vanilla
> total execution time: 9.844984 seconds for 100032 loops
> * dirty
> total execution time: 10.016464 seconds for 100032 loops
>
> Lower task counts show pretty massive performance hits of ~9%, ~25%
> and ~30% for single, two and four/eight processes. As more are added
> I guess the overhead tends to disappear as for one you have a lot
> more locking contention going on.
Can you check your notes on the 4 process case? As I read the 4 process
case above it is ~30% improvement. Either that is a typo or there is the
potential for quite a bit of noise in the test case.
Eric
On Fri, 30 Mar 2018, Eric W. Biederman wrote:
>Davidlohr Bueso <[email protected]> writes:
>
>> I ran this on a 40-core (no ht) Westmere with two benchmarks. The first
>> is Manfred's sysvsem lockunlock[1] program which uses _processes_ to,
>> well, lock and unlock the semaphore. The options are a little
>> unconventional, to keep the "critical region small" and the lock+unlock
>> frequency high I added busy_in=busy_out=10. Similarly, to get the
>> worst case scenario and have everyone update the same semaphore, a single
>> one is used. Here are the results (pretty low stddev from run to run)
>> for doing 100,000 lock+unlock.
>>
>> - 1 proc:
>> * vanilla
>> total execution time: 0.110638 seconds for 100000 loops
>> * dirty
>> total execution time: 0.120144 seconds for 100000 loops
>>
>> - 2 proc:
>> * vanilla
>> total execution time: 0.379756 seconds for 100000 loops
>> * dirty
>> total execution time: 0.477778 seconds for 100000 loops
>>
>> - 4 proc:
>> * vanilla
>> total execution time: 6.749710 seconds for 100000 loops
>> * dirty
>> total execution time: 4.651872 seconds for 100000 loops
>>
>> - 8 proc:
>> * vanilla
>> total execution time: 5.558404 seconds for 100000 loops
>> * dirty
>> total execution time: 7.143329 seconds for 100000 loops
>>
>> - 16 proc:
>> * vanilla
>> total execution time: 9.016398 seconds for 100000 loops
>> * dirty
>> total execution time: 9.412055 seconds for 100000 loops
>>
>> - 32 proc:
>> * vanilla
>> total execution time: 9.694451 seconds for 100000 loops
>> * dirty
>> total execution time: 9.990451 seconds for 100000 loops
>>
>> - 64 proc:
>> * vanilla
>> total execution time: 9.844984 seconds for 100032 loops
>> * dirty
>> total execution time: 10.016464 seconds for 100032 loops
>>
>> Lower task counts show pretty massive performance hits of ~9%, ~25%
>> and ~30% for single, two and four/eight processes. As more are added
>> I guess the overhead tends to disappear as for one you have a lot
>> more locking contention going on.
>
>Can you check your notes on the 4 process case? As I read the 4 process
>case above it is ~30% improvement. Either that is a typo or there is the
>potential for quite a bit of noise in the test case.
Yeah, sorry that was a typo. Unlike the second benchmark I didn't have
this one automated but it's always the vanilla kernel that outperforms
the dirty.
Thanks,
Davidlohr
On Sat, 24 Mar 2018, Eric W. Biederman wrote:
>
> After the last round of cleanups the shm, sem, and msg associate
> operations just became trivial wrappers around the appropriate security
> method. Simplify things further by just calling the security method
> directly.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
Reviewed-by: James Morris <[email protected]>
--
James Morris
<[email protected]>
Hi,
On 03/30/2018 09:09 PM, Davidlohr Bueso wrote:
> On Wed, 28 Mar 2018, Davidlohr Bueso wrote:
>
>> On Fri, 23 Mar 2018, Eric W. Biederman wrote:
>>
>>> Today the last process to update a semaphore is remembered and
>>> reported in the pid namespace of that process. If there are processes
>>> in any other pid namespace querying that process id with GETPID the
>>> result will be unusable nonsense as it does not make any
>>> sense in your own pid namespace.
>>
>> Yeah that sounds pretty wrong.
>>
>>>
>>> Due to ipc_update_pid I don't think you will be able to get System V
>>> ipc semaphores into a troublesome cache line ping-pong. Using struct
>>> pids from separate process are not a problem because they do not share
>>> a cache line. Using struct pid from different threads of the same
>>> process are unlikely to be a problem as the reference count update
>>> can be avoided.
>>>
>>> Further linux futexes are a much better tool for the job of mutual
>>> exclusion between processes than System V semaphores. So I expect
>>> programs that are performance limited by their interprocess mutual
>>> exclusion primitive will be using futexes.
>>
The performance of sysv sem and futexes for the contended case is more
or less identical, it depends on the CONFIG_ options what is faster.
And this is obvious, both primitives must do the same tasks:
sleep:
- lookup a kernel pointer from a user space reference
- acquire a lock, do some housekeeping, unlock and sleep
wakeup:
- lookup a kernel pointer from a user space reference
- acquire a lock, do some housekeeping, especially unlink the to be
woken up task, unlock and wakeup
The woken up task has nothing to do, it returns immediately to user space.
IIRC for the uncontended case, sysvsem was at ~300 cpu cycles, but that
number is a few years old, and I don't know what is the impact of spectre.
The futex code is obviously faster.
But I don't know which real-world applications do their own
optimizations for the uncontended case before using sysvsem.
Thus the only "real" challenge is to minimize cache line trashing.
>> You would be wrong. There are plenty of real workloads out there
>> that do not use futexes and are care about performance; in the end
>> futexes are only good for the uncontended cases, it can also
>> destroy numa boxes if you consider the global hash table. Experience
>> as shown me that sysvipc sems are quite still used.
>>
>>>
>>> So while it is possible that enhancing the storage of the last
>>> rocess of a System V semaphore from an integer to a struct pid
>>> will cause a performance regression because of the effect
>>> of frequently updating the pid reference count. I don't expect
>>> that to happen in practice.
>>
>> How's that? Now thanks to ipc_update_pid() for each semop the user
>> passes, perform_atomic_semop() will do two atomic updates for the
>> cases where there are multiple processes updating the sem. This is
>> not uncommon.
>>
>> Could you please provide some numbers.
>
[...]
> So at least for a large box this patch hurts the cases where there is low
> to medium cpu usage (no more than ~8 processes on a 40 core box) in a non
> trivial way. For more processes it doesn't matter. We can confirm that
> the
> case for threads is irrelevant. While I'm not happy about the 30%
> regression
> I guess we can live with this.
>
> Manfred, any thoughts?
>
Bugfixing has always first priority, and a 30% regression in one
microbenchmark doesn't seem to be that bad.
Thus I would propose that we fix SEMPID first, and _if_ someone notices
a noticeable regression, then we must improve the code.
--
Manfred