2015-07-06 08:50:53

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

Currently we use the proc file system, where all information are
presented in text files, what is convenient for humans. But if we need
to get information about processes from code (e.g. in C), the procfs
doesn't look so cool.

>From code we would prefer to get information in binary format and to be
able to specify which information and for which tasks are required. Here
is a new interface with all these features, which is called task_diag.
In addition it's much faster than procfs.

task_diag is based on netlink sockets and looks like socket-diag, which
is used to get information about sockets.

A request is described by the task_diag_pid structure:

struct task_diag_pid {
__u64 show_flags; /* specify which information are required */
__u64 dump_stratagy; /* specify a group of processes */

__u32 pid;
};

dump_stratagy specifies a group of processes:
/* per-process strategies */
TASK_DIAG_DUMP_CHILDREN - all children
TASK_DIAG_DUMP_THREAD - all threads
TASK_DIAG_DUMP_ONE - one process
/* system wide strategies (the pid fiel is ignored) */
TASK_DIAG_DUMP_ALL - all processes
TASK_DIAG_DUMP_ALL_THREAD - all threads

show_flags specifies which information are required.
If we set the TASK_DIAG_SHOW_BASE flag, the response message will
contain the TASK_DIAG_BASE attribute which is described by the
task_diag_base structure.

struct task_diag_base {
__u32 tgid;
__u32 pid;
__u32 ppid;
__u32 tpid;
__u32 sid;
__u32 pgid;
__u8 state;
char comm[TASK_DIAG_COMM_LEN];
};

In future, it can be extended by optional attributes. The request
describes which task properties are required and for which processes
they are required for.

A response can be divided into a few netlink packets if the NETLINK_DUMP
has been set in a request. Each task is described by a message. Each
message contains the TASK_DIAG_PID attribute and optional attributes
which have been requested (show_flags). A message can be divided into a
few parts if it doesn’t fit into a current netlink packet. In this case,
the first message in the next packet contains the same PID and
attributes which doesn’t  fit into the previous message.

The task diag is much faster than the proc file system. We don't need to
create a new file descriptor for each task. We need to send a request
and get a response. It allows to get information for a few tasks in one
request-response iteration.

As for security, task_diag always works as procfs with hidepid = 2 (highest
level of security).

I have compared performance of procfs and task-diag for the
"ps ax -o pid,ppid" command.

A test stand contains 30108 processes.
$ ps ax -o pid,ppid | wc -l
30108

$ time ps ax -o pid,ppid > /dev/null

real 0m0.836s
user 0m0.238s
sys 0m0.583s

Read /proc/PID/stat for each task
$ time ./task_proc_all > /dev/null

real 0m0.258s
user 0m0.019s
sys 0m0.232s

$ time ./task_diag_all > /dev/null

real 0m0.052s
user 0m0.013s
sys 0m0.036s

And here are statistics on syscalls which were called by each
command.

$ perf trace -s -o log -- ./task_proc_all > /dev/null

Summary of events:

task_proc_all (30781), 180785 events, 100.0%, 0.000 msec

syscall calls min avg max stddev
(msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- ------
read 30111 0.000 0.013 0.107 0.21%
write 1 0.008 0.008 0.008 0.00%
open 30111 0.007 0.012 0.145 0.24%
close 30112 0.004 0.011 0.110 0.20%
fstat 3 0.009 0.013 0.016 16.15%
mmap 8 0.011 0.020 0.027 11.24%
mprotect 4 0.019 0.023 0.028 8.33%
munmap 1 0.026 0.026 0.026 0.00%
brk 8 0.007 0.015 0.024 11.94%
ioctl 1 0.007 0.007 0.007 0.00%
access 1 0.019 0.019 0.019 0.00%
execve 1 0.000 0.000 0.000 0.00%
getdents 29 0.008 1.010 2.215 8.88%
arch_prctl 1 0.016 0.016 0.016 0.00%
openat 1 0.021 0.021 0.021 0.00%


$ perf trace -s -o log -- ./task_diag_all > /dev/null
Summary of events:

task_diag_all (30762), 717 events, 98.9%, 0.000 msec

syscall calls min avg max stddev
(msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- ------
read 2 0.000 0.008 0.016 100.00%
write 197 0.008 0.019 0.041 3.00%
open 2 0.023 0.029 0.036 22.45%
close 3 0.010 0.012 0.014 11.34%
fstat 3 0.012 0.044 0.106 70.52%
mmap 8 0.014 0.031 0.054 18.88%
mprotect 4 0.016 0.023 0.027 10.93%
munmap 1 0.022 0.022 0.022 0.00%
brk 1 0.040 0.040 0.040 0.00%
ioctl 1 0.011 0.011 0.011 0.00%
access 1 0.032 0.032 0.032 0.00%
getpid 1 0.012 0.012 0.012 0.00%
socket 1 0.032 0.032 0.032 0.00%
sendto 2 0.032 0.095 0.157 65.77%
recvfrom 129 0.009 0.235 0.418 2.45%
bind 1 0.018 0.018 0.018 0.00%
execve 1 0.000 0.000 0.000 0.00%
arch_prctl 1 0.012 0.012 0.012 0.00%

You can find the test program from this experiment in tools/test/selftest/taskdiag.

The idea of this functionality was suggested by Pavel Emelyanov
(xemul@), when he found that operations with /proc forms a significant
part of a checkpointing time.

Ten years ago there was attempt to add a netlink interface to access to /proc
information:
http://lwn.net/Articles/99600/

git repo: https://github.com/avagin/linux-task-diag

Changes from the first version:

David Ahern implemented all required functionality to use task_diag in
perf.

Bellow you can find his results how it affects performance.
> Using the fork test command:
> 10,000 processes; 10k proc with 5 threads = 50,000 tasks
> reading /proc: 11.3 sec
> task_diag: 2.2 sec
>
> @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096
>
> 128 instances of sepcjbb, 80,000+ tasks:
> reading /proc: 32.1 sec
> task_diag: 3.9 sec
>
> So overall much snappier startup times.

Many thanks to David Ahern for the help with improving task_diag.

Cc: Oleg Nesterov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Roger Luethi <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Pavel Odintsov <[email protected]>
Signed-off-by: Andrey Vagin <[email protected]>
--
2.1.0


2015-07-06 08:56:58

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 01/24] kernel: define taststats commands in the one place

Currently if we add a new TASKSTATS_ constant, we will chanage all
CGROUPSTATS_ contants and break backward compatibility.

Signed-off-by: Andrey Vagin <[email protected]>
---
include/uapi/linux/cgroupstats.h | 15 ---------------
include/uapi/linux/taskstats.h | 7 +++++++
2 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/include/uapi/linux/cgroupstats.h b/include/uapi/linux/cgroupstats.h
index 3753c33..8095931 100644
--- a/include/uapi/linux/cgroupstats.h
+++ b/include/uapi/linux/cgroupstats.h
@@ -37,21 +37,6 @@ struct cgroupstats {
__u64 nr_io_wait; /* Number of tasks waiting on IO */
};

-/*
- * Commands sent from userspace
- * Not versioned. New commands should only be inserted at the enum's end
- * prior to __CGROUPSTATS_CMD_MAX
- */
-
-enum {
- CGROUPSTATS_CMD_UNSPEC = __TASKSTATS_CMD_MAX, /* Reserved */
- CGROUPSTATS_CMD_GET, /* user->kernel request/get-response */
- CGROUPSTATS_CMD_NEW, /* kernel->user event */
- __CGROUPSTATS_CMD_MAX,
-};
-
-#define CGROUPSTATS_CMD_MAX (__CGROUPSTATS_CMD_MAX - 1)
-
enum {
CGROUPSTATS_TYPE_UNSPEC = 0, /* Reserved */
CGROUPSTATS_TYPE_CGROUP_STATS, /* contains name + stats */
diff --git a/include/uapi/linux/taskstats.h b/include/uapi/linux/taskstats.h
index 2466e55..a1cc91b 100644
--- a/include/uapi/linux/taskstats.h
+++ b/include/uapi/linux/taskstats.h
@@ -176,9 +176,16 @@ enum {
TASKSTATS_CMD_UNSPEC = 0, /* Reserved */
TASKSTATS_CMD_GET, /* user->kernel request/get-response */
TASKSTATS_CMD_NEW, /* kernel->user event */
+ __TASKSTATS_CMD_RESERVED,
+
+ CGROUPSTATS_CMD_GET, /* user->kernel request/get-response */
+ CGROUPSTATS_CMD_NEW, /* kernel->user event */
+
__TASKSTATS_CMD_MAX,
};

+#define __CGROUPSTATS_CMD_MAX __TASKSTATS_CMD_MAX
+#define CGROUPSTATS_CMD_MAX (__CGROUPSTATS_CMD_MAX - 1)
#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1)

enum {
--
2.1.0

2015-07-06 08:50:41

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 02/24] kernel: add a netlink interface to get information about tasks (v2)

task_diag is based on netlink sockets and looks like socket-diag, which
is used to get information about sockets.

task_diag is a new interface which is going to raplace the proc file
system in cases when we need to get information in a binary format.

A request messages is described by the task_diag_pid structure:
struct task_diag_pid {
__u64 show_flags;
__u64 dump_strategy;

__u32 pid;
};

A respone is a set of netlink messages. Each message describes one task.
All task properties are divided on groups. A message contains the
TASK_DIAG_PID group, and other groups if they have been requested in
show_flags. For example, if show_flags contains TASK_DIAG_SHOW_BASE, a
response will contain the TASK_DIAG_CRED group which is described by the
task_diag_creds structure.

struct task_diag_base {
__u32 tgid;
__u32 pid;
__u32 ppid;
__u32 tpid;
__u32 sid;
__u32 pgid;
__u8 state;
char comm[TASK_DIAG_COMM_LEN];
};

The dump_strategy field will be used in following patches to request
information for a group of processes.

v2: A few changes from David Ahern
Use a consistent name
Add max attr enum
task diag: Send pid as u32
Change _MSG/msg references to base
Fix 8-byte alignment

Cc: David Ahern <[email protected]>
Signed-off-by: Andrey Vagin <[email protected]>
---
include/linux/taskstats_kern.h | 7 ++
include/uapi/linux/task_diag.h | 60 +++++++++++++++
include/uapi/linux/taskstats.h | 2 +
init/Kconfig | 12 +++
kernel/Makefile | 1 +
kernel/taskdiag.c | 168 +++++++++++++++++++++++++++++++++++++++++
kernel/taskstats.c | 25 +++++-
7 files changed, 271 insertions(+), 4 deletions(-)
create mode 100644 include/uapi/linux/task_diag.h
create mode 100644 kernel/taskdiag.c

diff --git a/include/linux/taskstats_kern.h b/include/linux/taskstats_kern.h
index 58de6ed..a1fd4f8 100644
--- a/include/linux/taskstats_kern.h
+++ b/include/linux/taskstats_kern.h
@@ -15,6 +15,8 @@
extern struct kmem_cache *taskstats_cache;
extern struct mutex taskstats_exit_mutex;

+extern struct genl_family taskstats_family;
+
static inline void taskstats_tgid_free(struct signal_struct *sig)
{
if (sig->stats)
@@ -23,6 +25,11 @@ static inline void taskstats_tgid_free(struct signal_struct *sig)

extern void taskstats_exit(struct task_struct *, int group_dead);
extern void taskstats_init_early(void);
+
+struct genl_info;
+struct sk_buff;
+int taskdiag_doit(struct sk_buff *skb, struct genl_info *info);
+
#else
static inline void taskstats_exit(struct task_struct *tsk, int group_dead)
{}
diff --git a/include/uapi/linux/task_diag.h b/include/uapi/linux/task_diag.h
new file mode 100644
index 0000000..3a1e6c4
--- /dev/null
+++ b/include/uapi/linux/task_diag.h
@@ -0,0 +1,60 @@
+#ifndef _LINUX_TASK_DIAG_H
+#define _LINUX_TASK_DIAG_H
+
+#include <linux/types.h>
+#include <linux/capability.h>
+
+enum {
+ /* optional attributes which can be specified in show_flags */
+ TASK_DIAG_BASE = 0,
+
+ /* other attributes */
+ TASK_DIAG_PID = 64, /* u32 */
+
+ __TASK_DIAG_ATTR_MAX
+#define TASK_DIAG_ATTR_MAX (__TASK_DIAG_ATTR_MAX - 1)
+};
+
+#define TASK_DIAG_SHOW_BASE (1ULL << TASK_DIAG_BASE)
+
+enum {
+ TASK_DIAG_RUNNING,
+ TASK_DIAG_INTERRUPTIBLE,
+ TASK_DIAG_UNINTERRUPTIBLE,
+ TASK_DIAG_STOPPED,
+ TASK_DIAG_TRACE_STOP,
+ TASK_DIAG_DEAD,
+ TASK_DIAG_ZOMBIE,
+};
+
+#define TASK_DIAG_COMM_LEN 16
+
+struct task_diag_base {
+ __u32 tgid;
+ __u32 pid;
+ __u32 ppid;
+ __u32 tpid;
+ __u32 sid;
+ __u32 pgid;
+ __u8 state;
+ char comm[TASK_DIAG_COMM_LEN];
+};
+
+#define TASK_DIAG_DUMP_ALL 0
+
+struct task_diag_pid {
+ __u64 show_flags;
+ __u64 dump_strategy;
+
+ __u32 pid;
+};
+
+enum {
+ TASK_DIAG_CMD_ATTR_UNSPEC = 0,
+ TASK_DIAG_CMD_ATTR_GET,
+ __TASK_DIAG_CMD_ATTR_MAX,
+};
+
+#define TASK_DIAG_CMD_ATTR_MAX (__TASK_DIAG_CMD_ATTR_MAX - 1)
+
+#endif /* _LINUX_TASK_DIAG_H */
diff --git a/include/uapi/linux/taskstats.h b/include/uapi/linux/taskstats.h
index a1cc91b..04b974a 100644
--- a/include/uapi/linux/taskstats.h
+++ b/include/uapi/linux/taskstats.h
@@ -181,6 +181,8 @@ enum {
CGROUPSTATS_CMD_GET, /* user->kernel request/get-response */
CGROUPSTATS_CMD_NEW, /* kernel->user event */

+ TASK_DIAG_CMD_GET,
+
__TASKSTATS_CMD_MAX,
};

diff --git a/init/Kconfig b/init/Kconfig
index 7d1ffd2..4d0483c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -432,6 +432,18 @@ config TASKSTATS

Say N if unsure.

+config TASK_DIAG
+ bool "Export task/process properties through netlink"
+ depends on NET && TASKSTATS
+ default n
+ help
+ Export selected properties for tasks/processes through the
+ generic netlink interface. Unlike the proc file system, task_diag
+ returns information in a binary format, allows to specify which
+ information are required.
+
+ Say N if unsure.
+
config TASK_DELAY_ACCT
bool "Enable per-task delay accounting"
depends on TASKSTATS
diff --git a/kernel/Makefile b/kernel/Makefile
index 60c302c..ed6fed5 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
obj-$(CONFIG_JUMP_LABEL) += jump_label.o
obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
obj-$(CONFIG_TORTURE_TEST) += torture.o
+obj-$(CONFIG_TASK_DIAG) += taskdiag.o

$(obj)/configs.o: $(obj)/config_data.h

diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
new file mode 100644
index 0000000..7327e08
--- /dev/null
+++ b/kernel/taskdiag.c
@@ -0,0 +1,168 @@
+#include <linux/kernel.h>
+#include <linux/taskstats_kern.h>
+#include <linux/task_diag.h>
+#include <net/genetlink.h>
+#include <linux/pid_namespace.h>
+#include <linux/ptrace.h>
+#include <linux/proc_fs.h>
+#include <linux/sched.h>
+
+static size_t taskdiag_packet_size(u64 show_flags)
+{
+ size_t size;
+
+ size = nla_total_size(sizeof(u32)); /* PID */
+
+ if (show_flags & TASK_DIAG_SHOW_BASE)
+ size += nla_total_size(sizeof(struct task_diag_base));
+
+ return size;
+}
+
+/*
+ * The task state array is a strange "bitmap" of
+ * reasons to sleep. Thus "running" is zero, and
+ * you can test for combinations of others with
+ * simple bit tests.
+ */
+static const __u8 task_state_array[] = {
+ TASK_DIAG_RUNNING,
+ TASK_DIAG_INTERRUPTIBLE,
+ TASK_DIAG_UNINTERRUPTIBLE,
+ TASK_DIAG_STOPPED,
+ TASK_DIAG_TRACE_STOP,
+ TASK_DIAG_DEAD,
+ TASK_DIAG_ZOMBIE,
+};
+
+static inline const __u8 get_task_state(struct task_struct *tsk)
+{
+ unsigned int state = (tsk->state | tsk->exit_state) & TASK_REPORT;
+
+ BUILD_BUG_ON(1 + ilog2(TASK_REPORT) != ARRAY_SIZE(task_state_array)-1);
+
+ return task_state_array[fls(state)];
+}
+
+static int fill_task_base(struct task_struct *p, struct sk_buff *skb)
+{
+ struct pid_namespace *ns = task_active_pid_ns(current);
+ struct task_diag_base *base;
+ struct nlattr *attr;
+ char tcomm[sizeof(p->comm)];
+ struct task_struct *tracer;
+
+ attr = nla_reserve(skb, TASK_DIAG_BASE, sizeof(struct task_diag_base));
+ if (!attr)
+ return -EMSGSIZE;
+
+ base = nla_data(attr);
+
+ rcu_read_lock();
+ base->ppid = pid_alive(p) ?
+ task_tgid_nr_ns(rcu_dereference(p->real_parent), ns) : 0;
+
+ base->tpid = 0;
+ tracer = ptrace_parent(p);
+ if (tracer)
+ base->tpid = task_pid_nr_ns(tracer, ns);
+
+ base->tgid = task_tgid_nr_ns(p, ns);
+ base->pid = task_pid_nr_ns(p, ns);
+ base->sid = task_session_nr_ns(p, ns);
+ base->pgid = task_pgrp_nr_ns(p, ns);
+
+ rcu_read_unlock();
+
+ get_task_comm(tcomm, p);
+ memset(base->comm, 0, TASK_DIAG_COMM_LEN);
+ strncpy(base->comm, tcomm, TASK_DIAG_COMM_LEN);
+
+ base->state = get_task_state(p);
+
+ return 0;
+}
+
+static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
+ u64 show_flags, u32 portid, u32 seq)
+{
+ void *reply;
+ int err;
+ u32 pid;
+
+ reply = genlmsg_put(skb, portid, seq, &taskstats_family, 0, TASK_DIAG_CMD_GET);
+ if (reply == NULL)
+ return -EMSGSIZE;
+
+ pid = task_pid_vnr(tsk);
+ err = nla_put_u32(skb, TASK_DIAG_PID, pid);
+ if (err)
+ goto err;
+
+ if (show_flags & TASK_DIAG_SHOW_BASE) {
+ err = fill_task_base(tsk, skb);
+ if (err)
+ goto err;
+ }
+
+ genlmsg_end(skb, reply);
+ return 0;
+err:
+ genlmsg_cancel(skb, reply);
+ return err;
+}
+
+int taskdiag_doit(struct sk_buff *skb, struct genl_info *info)
+{
+ struct nlattr *nla = info->attrs[TASK_DIAG_CMD_ATTR_GET];
+ struct task_struct *tsk = NULL;
+ struct task_diag_pid req;
+ struct sk_buff *msg;
+ size_t size;
+ int rc;
+
+ if (!nla_data(nla))
+ return -EINVAL;
+
+ if (nla_len(nla) < sizeof(req))
+ return -EINVAL;
+
+ /*
+ * use a req variable to deal with alignment issues. task_diag_pid
+ * contains u64 elements which means extended load operations can be
+ * used and those can require 8-byte alignment (e.g., sparc)
+ */
+ memcpy(&req, nla_data(nla), sizeof(req));
+
+ size = taskdiag_packet_size(req.show_flags);
+ msg = genlmsg_new(size, GFP_KERNEL);
+ if (!msg)
+ return -ENOMEM;
+
+ rcu_read_lock();
+ tsk = find_task_by_vpid(req.pid);
+ if (tsk)
+ get_task_struct(tsk);
+ rcu_read_unlock();
+ if (!tsk) {
+ rc = -ESRCH;
+ goto err;
+ };
+
+ if (!ptrace_may_access(tsk, PTRACE_MODE_READ)) {
+ put_task_struct(tsk);
+ rc = -EPERM;
+ goto err;
+ }
+
+ rc = task_diag_fill(tsk, msg, req.show_flags,
+ info->snd_portid, info->snd_seq);
+ put_task_struct(tsk);
+ if (rc < 0)
+ goto err;
+
+ return genlmsg_reply(msg, info);
+err:
+ nlmsg_free(msg);
+ return rc;
+}
diff --git a/kernel/taskstats.c b/kernel/taskstats.c
index 21f82c2..d70f1e5 100644
--- a/kernel/taskstats.c
+++ b/kernel/taskstats.c
@@ -18,6 +18,7 @@

#include <linux/kernel.h>
#include <linux/taskstats_kern.h>
+#include <linux/task_diag.h>
#include <linux/tsacct_kern.h>
#include <linux/delayacct.h>
#include <linux/cpumask.h>
@@ -41,7 +42,7 @@ static DEFINE_PER_CPU(__u32, taskstats_seqnum);
static int family_registered;
struct kmem_cache *taskstats_cache;

-static struct genl_family family = {
+struct genl_family taskstats_family = {
.id = GENL_ID_GENERATE,
.name = TASKSTATS_GENL_NAME,
.version = TASKSTATS_GENL_VERSION,
@@ -92,9 +93,9 @@ static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
if (!info) {
int seq = this_cpu_inc_return(taskstats_seqnum) - 1;

- reply = genlmsg_put(skb, 0, seq, &family, 0, cmd);
+ reply = genlmsg_put(skb, 0, seq, &taskstats_family, 0, cmd);
} else
- reply = genlmsg_put_reply(skb, info, &family, 0, cmd);
+ reply = genlmsg_put_reply(skb, info, &taskstats_family, 0, cmd);
if (reply == NULL) {
nlmsg_free(skb);
return -EINVAL;
@@ -664,6 +665,15 @@ err:
nlmsg_free(rep_skb);
}

+#ifdef CONFIG_TASK_DIAG
+static const struct nla_policy
+ taskdiag_cmd_get_policy[TASK_DIAG_CMD_ATTR_MAX+1] = {
+ [TASK_DIAG_CMD_ATTR_GET] = { .type = NLA_UNSPEC,
+ .len = sizeof(struct task_diag_pid)
+ },
+};
+#endif
+
static const struct genl_ops taskstats_ops[] = {
{
.cmd = TASKSTATS_CMD_GET,
@@ -676,6 +686,13 @@ static const struct genl_ops taskstats_ops[] = {
.doit = cgroupstats_user_cmd,
.policy = cgroupstats_cmd_get_policy,
},
+#ifdef CONFIG_TASK_DIAG
+ {
+ .cmd = TASK_DIAG_CMD_GET,
+ .doit = taskdiag_doit,
+ .policy = taskdiag_cmd_get_policy,
+ },
+#endif
};

/* Needed early in initialization */
@@ -694,7 +711,7 @@ static int __init taskstats_init(void)
{
int rc;

- rc = genl_register_family_with_ops(&family, taskstats_ops);
+ rc = genl_register_family_with_ops(&taskstats_family, taskstats_ops);
if (rc)
return rc;

--
2.1.0

2015-07-06 08:50:48

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 03/24] kernel: make taskstats available from all net namespaces

Signed-off-by: Andrey Vagin <[email protected]>
---
kernel/taskstats.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/taskstats.c b/kernel/taskstats.c
index d70f1e5..0c90d11 100644
--- a/kernel/taskstats.c
+++ b/kernel/taskstats.c
@@ -47,6 +47,7 @@ struct genl_family taskstats_family = {
.name = TASKSTATS_GENL_NAME,
.version = TASKSTATS_GENL_VERSION,
.maxattr = TASKSTATS_CMD_ATTR_MAX,
+ .netnsok = true,
};

static const struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] = {
--
2.1.0

2015-07-06 08:50:46

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 04/24] kernel: move next_tgid from fs/proc

This function will be used in task_diag.

Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/base.c | 43 -------------------------------------------
include/linux/proc_fs.h | 7 +++++++
kernel/pid.c | 39 +++++++++++++++++++++++++++++++++++++++
3 files changed, 46 insertions(+), 43 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 1d540b3..88cf6ae 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2994,49 +2994,6 @@ out:
return ERR_PTR(result);
}

-/*
- * Find the first task with tgid >= tgid
- *
- */
-struct tgid_iter {
- unsigned int tgid;
- struct task_struct *task;
-};
-static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter)
-{
- struct pid *pid;
-
- if (iter.task)
- put_task_struct(iter.task);
- rcu_read_lock();
-retry:
- iter.task = NULL;
- pid = find_ge_pid(iter.tgid, ns);
- if (pid) {
- iter.tgid = pid_nr_ns(pid, ns);
- iter.task = pid_task(pid, PIDTYPE_PID);
- /* What we to know is if the pid we have find is the
- * pid of a thread_group_leader. Testing for task
- * being a thread_group_leader is the obvious thing
- * todo but there is a window when it fails, due to
- * the pid transfer logic in de_thread.
- *
- * So we perform the straight forward test of seeing
- * if the pid we have found is the pid of a thread
- * group leader, and don't worry if the task we have
- * found doesn't happen to be a thread group leader.
- * As we don't care in the case of readdir.
- */
- if (!iter.task || !has_group_leader_pid(iter.task)) {
- iter.tgid += 1;
- goto retry;
- }
- get_task_struct(iter.task);
- }
- rcu_read_unlock();
- return iter;
-}
-
#define TGID_OFFSET (FIRST_PROCESS_ENTRY + 2)

/* for the /proc/ directory itself, after non-process stuff has been done */
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index b97bf2e..136b6ed 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -82,4 +82,11 @@ static inline struct proc_dir_entry *proc_net_mkdir(
return proc_mkdir_data(name, 0, parent, net);
}

+struct tgid_iter {
+ unsigned int tgid;
+ struct task_struct *task;
+};
+
+struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter);
+
#endif /* _LINUX_PROC_FS_H */
diff --git a/kernel/pid.c b/kernel/pid.c
index 4fd07d5..2acb0a9 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -569,6 +569,45 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
}

/*
+ * Find the first task with tgid >= tgid
+ *
+ */
+struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter)
+{
+ struct pid *pid;
+
+ if (iter.task)
+ put_task_struct(iter.task);
+ rcu_read_lock();
+retry:
+ iter.task = NULL;
+ pid = find_ge_pid(iter.tgid, ns);
+ if (pid) {
+ iter.tgid = pid_nr_ns(pid, ns);
+ iter.task = pid_task(pid, PIDTYPE_PID);
+ /* What we to know is if the pid we have find is the
+ * pid of a thread_group_leader. Testing for task
+ * being a thread_group_leader is the obvious thing
+ * todo but there is a window when it fails, due to
+ * the pid transfer logic in de_thread.
+ *
+ * So we perform the straight forward test of seeing
+ * if the pid we have found is the pid of a thread
+ * group leader, and don't worry if the task we have
+ * found doesn't happen to be a thread group leader.
+ * As we don't care in the case of readdir.
+ */
+ if (!iter.task || !has_group_leader_pid(iter.task)) {
+ iter.tgid += 1;
+ goto retry;
+ }
+ get_task_struct(iter.task);
+ }
+ rcu_read_unlock();
+ return iter;
+}
+
+/*
* The pid hash table is scaled according to the amount of memory in the
* machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or
* more.
--
2.1.0

2015-07-06 08:50:51

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 05/24] task_diag: add ability to get information about all tasks

For that we need to set NLM_F_DUMP. A response contain on a few packets.

Signed-off-by: Andrey Vagin <[email protected]>
---
include/linux/taskstats_kern.h | 2 ++
kernel/taskdiag.c | 40 ++++++++++++++++++++++++++++++++++++++++
kernel/taskstats.c | 1 +
3 files changed, 43 insertions(+)

diff --git a/include/linux/taskstats_kern.h b/include/linux/taskstats_kern.h
index a1fd4f8..716835f 100644
--- a/include/linux/taskstats_kern.h
+++ b/include/linux/taskstats_kern.h
@@ -29,6 +29,8 @@ extern void taskstats_init_early(void);
struct genl_info;
struct sk_buff;
int taskdiag_doit(struct sk_buff *skb, struct genl_info *info);
+struct netlink_callback;
+int taskdiag_dumpit(struct sk_buff *skb, struct netlink_callback *cb);

#else
static inline void taskstats_exit(struct task_struct *tsk, int group_dead)
diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index 7327e08..3497304 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -112,6 +112,46 @@ err:
return err;
}

+int taskdiag_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
+{
+ struct pid_namespace *ns = task_active_pid_ns(current);
+ struct tgid_iter iter;
+ struct nlattr *na;
+ struct task_diag_pid req;
+ int rc;
+
+ if (nlmsg_len(cb->nlh) < GENL_HDRLEN + sizeof(req))
+ return -EINVAL;
+
+ na = nlmsg_data(cb->nlh) + GENL_HDRLEN;
+ if (na->nla_type < 0)
+ return -EINVAL;
+
+ memcpy(&req, nla_data(na), sizeof(req));
+
+ iter.tgid = cb->args[0];
+ iter.task = NULL;
+ for (iter = next_tgid(ns, iter);
+ iter.task;
+ iter.tgid += 1, iter = next_tgid(ns, iter)) {
+ if (!ptrace_may_access(iter.task, PTRACE_MODE_READ))
+ continue;
+
+ rc = task_diag_fill(iter.task, skb, req.show_flags,
+ NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq);
+ if (rc < 0) {
+ put_task_struct(iter.task);
+ if (rc != -EMSGSIZE)
+ return rc;
+ break;
+ }
+ }
+
+ cb->args[0] = iter.tgid;
+
+ return skb->len;
+}
+
int taskdiag_doit(struct sk_buff *skb, struct genl_info *info)
{
struct nlattr *nla = info->attrs[TASK_DIAG_CMD_ATTR_GET];
diff --git a/kernel/taskstats.c b/kernel/taskstats.c
index 0c90d11..40f2cdf2 100644
--- a/kernel/taskstats.c
+++ b/kernel/taskstats.c
@@ -691,6 +691,7 @@ static const struct genl_ops taskstats_ops[] = {
{
.cmd = TASK_DIAG_CMD_GET,
.doit = taskdiag_doit,
+ .dumpit = taskdiag_dumpit,
.policy = taskdiag_cmd_get_policy,
},
#endif
--
2.1.0

2015-07-06 08:50:57

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 06/24] task_diag: add ability to split per-task data on a few netlink messages

The size of requested per-task data can be bigger that the size of one
netlink skb. Currently we are able to split data into a few netlink
messages.

v2: set NLM_F_MULTI for multipart packets

Signed-off-by: Andrey Vagin <[email protected]>
---
kernel/taskdiag.c | 76 +++++++++++++++++++++++++++++++++++++------------------
1 file changed, 51 insertions(+), 25 deletions(-)

diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index 3497304..5771baf 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -84,13 +84,21 @@ static int fill_task_base(struct task_struct *p, struct sk_buff *skb)
}

static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
- u64 show_flags, u32 portid, u32 seq)
+ u64 show_flags, u32 portid, u32 seq,
+ struct netlink_callback *cb)
{
void *reply;
- int err;
+ int err = 0, i = 0, n = 0;
+ int flags = 0;
u32 pid;

- reply = genlmsg_put(skb, portid, seq, &taskstats_family, 0, TASK_DIAG_CMD_GET);
+ if (cb) {
+ n = cb->args[2];
+ flags |= NLM_F_MULTI;
+ }
+
+ reply = genlmsg_put(skb, portid, seq, &taskstats_family,
+ flags, TASK_DIAG_CMD_GET);
if (reply == NULL)
return -EMSGSIZE;

@@ -100,15 +108,26 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
goto err;

if (show_flags & TASK_DIAG_SHOW_BASE) {
- err = fill_task_base(tsk, skb);
+ if (i >= n)
+ err = fill_task_base(tsk, skb);
if (err)
goto err;
+ i++;
}

genlmsg_end(skb, reply);
+ if (cb)
+ cb->args[2] = 0;
+
return 0;
err:
- genlmsg_cancel(skb, reply);
+ if (err == -EMSGSIZE && i != 0) {
+ if (cb)
+ cb->args[2] = i;
+ genlmsg_end(skb, reply);
+ } else
+ genlmsg_cancel(skb, reply);
+
return err;
}

@@ -138,7 +157,7 @@ int taskdiag_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
continue;

rc = task_diag_fill(iter.task, skb, req.show_flags,
- NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq);
+ NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, cb);
if (rc < 0) {
put_task_struct(iter.task);
if (rc != -EMSGSIZE)
@@ -157,7 +176,7 @@ int taskdiag_doit(struct sk_buff *skb, struct genl_info *info)
struct nlattr *nla = info->attrs[TASK_DIAG_CMD_ATTR_GET];
struct task_struct *tsk = NULL;
struct task_diag_pid req;
- struct sk_buff *msg;
+ struct sk_buff *msg = NULL;
size_t size;
int rc;

@@ -174,35 +193,42 @@ int taskdiag_doit(struct sk_buff *skb, struct genl_info *info)
*/
memcpy(&req, nla_data(nla), sizeof(req));

- size = taskdiag_packet_size(req.show_flags);
- msg = genlmsg_new(size, GFP_KERNEL);
- if (!msg)
- return -ENOMEM;
-
rcu_read_lock();
tsk = find_task_by_vpid(req.pid);
if (tsk)
get_task_struct(tsk);
rcu_read_unlock();
- if (!tsk) {
- rc = -ESRCH;
- goto err;
- };
+ if (!tsk)
+ return -ESRCH;

if (!ptrace_may_access(tsk, PTRACE_MODE_READ)) {
put_task_struct(tsk);
- rc = -EPERM;
- goto err;
+ return -EPERM;
+ }
+
+ size = taskdiag_packet_size(req.show_flags);
+
+ while (1) {
+ msg = genlmsg_new(size, GFP_KERNEL);
+ if (!msg) {
+ put_task_struct(tsk);
+ return -EMSGSIZE;
+ }
+
+ rc = task_diag_fill(tsk, msg, req.show_flags,
+ info->snd_portid, info->snd_seq, NULL);
+ if (rc != -EMSGSIZE)
+ break;
+
+ nlmsg_free(msg);
+ size += 128;
}

- rc = task_diag_fill(tsk, msg, req.show_flags,
- info->snd_portid, info->snd_seq);
put_task_struct(tsk);
- if (rc < 0)
- goto err;
+ if (rc < 0) {
+ nlmsg_free(msg);
+ return rc;
+ }

return genlmsg_reply(msg, info);
-err:
- nlmsg_free(msg);
- return rc;
}
--
2.1.0

2015-07-06 08:56:44

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 07/24] task_diag: add a new group to get process credentials

A response is represented by the task_diag_creds structure:

struct task_diag_creds {
struct task_diag_caps cap_inheritable;
struct task_diag_caps cap_permitted;
struct task_diag_caps cap_effective;
struct task_diag_caps cap_bset;

__u32 uid;
__u32 euid;
__u32 suid;
__u32 fsuid;
__u32 gid;
__u32 egid;
__u32 sgid;
__u32 fsgid;
};

This group is optional and it's filled only if show_flags contains
TASK_DIAG_SHOW_CRED.

Signed-off-by: Andrey Vagin <[email protected]>
---
include/uapi/linux/task_diag.h | 22 ++++++++++++++++++
kernel/taskdiag.c | 53 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 75 insertions(+)

diff --git a/include/uapi/linux/task_diag.h b/include/uapi/linux/task_diag.h
index 3a1e6c4..0e659d6 100644
--- a/include/uapi/linux/task_diag.h
+++ b/include/uapi/linux/task_diag.h
@@ -7,6 +7,7 @@
enum {
/* optional attributes which can be specified in show_flags */
TASK_DIAG_BASE = 0,
+ TASK_DIAG_CRED,

/* other attributes */
TASK_DIAG_PID = 64, /* u32 */
@@ -16,6 +17,7 @@ enum {
};

#define TASK_DIAG_SHOW_BASE (1ULL << TASK_DIAG_BASE)
+#define TASK_DIAG_SHOW_CRED (1ULL << TASK_DIAG_CRED)

enum {
TASK_DIAG_RUNNING,
@@ -40,6 +42,26 @@ struct task_diag_base {
char comm[TASK_DIAG_COMM_LEN];
};

+struct task_diag_caps {
+ __u32 cap[_LINUX_CAPABILITY_U32S_3];
+};
+
+struct task_diag_creds {
+ struct task_diag_caps cap_inheritable;
+ struct task_diag_caps cap_permitted;
+ struct task_diag_caps cap_effective;
+ struct task_diag_caps cap_bset;
+
+ __u32 uid;
+ __u32 euid;
+ __u32 suid;
+ __u32 fsuid;
+ __u32 gid;
+ __u32 egid;
+ __u32 sgid;
+ __u32 fsgid;
+};
+
#define TASK_DIAG_DUMP_ALL 0

struct task_diag_pid {
diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index 5771baf..5123e12 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -16,6 +16,9 @@ static size_t taskdiag_packet_size(u64 show_flags)
if (show_flags & TASK_DIAG_SHOW_BASE)
size += nla_total_size(sizeof(struct task_diag_base));

+ if (show_flags & TASK_DIAG_SHOW_CRED)
+ size += nla_total_size(sizeof(struct task_diag_creds));
+
return size;
}

@@ -83,6 +86,48 @@ static int fill_task_base(struct task_struct *p, struct sk_buff *skb)
return 0;
}

+static inline void caps2diag(struct task_diag_caps *diag, const kernel_cap_t *cap)
+{
+ int i;
+
+ for (i = 0; i < _LINUX_CAPABILITY_U32S_3; i++)
+ diag->cap[i] = cap->cap[i];
+}
+
+static int fill_creds(struct task_struct *p, struct sk_buff *skb)
+{
+ struct user_namespace *user_ns = current_user_ns();
+ struct task_diag_creds *diag_cred;
+ const struct cred *cred;
+ struct nlattr *attr;
+
+ attr = nla_reserve(skb, TASK_DIAG_CRED, sizeof(struct task_diag_creds));
+ if (!attr)
+ return -EMSGSIZE;
+
+ diag_cred = nla_data(attr);
+
+ cred = get_task_cred(p);
+
+ caps2diag(&diag_cred->cap_inheritable, &cred->cap_inheritable);
+ caps2diag(&diag_cred->cap_permitted, &cred->cap_permitted);
+ caps2diag(&diag_cred->cap_effective, &cred->cap_effective);
+ caps2diag(&diag_cred->cap_bset, &cred->cap_bset);
+
+ diag_cred->uid = from_kuid_munged(user_ns, cred->uid);
+ diag_cred->euid = from_kuid_munged(user_ns, cred->euid);
+ diag_cred->suid = from_kuid_munged(user_ns, cred->suid);
+ diag_cred->fsuid = from_kuid_munged(user_ns, cred->fsuid);
+ diag_cred->gid = from_kgid_munged(user_ns, cred->gid);
+ diag_cred->egid = from_kgid_munged(user_ns, cred->egid);
+ diag_cred->sgid = from_kgid_munged(user_ns, cred->sgid);
+ diag_cred->fsgid = from_kgid_munged(user_ns, cred->fsgid);
+
+ put_cred(cred);
+
+ return 0;
+}
+
static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
u64 show_flags, u32 portid, u32 seq,
struct netlink_callback *cb)
@@ -115,6 +160,14 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
i++;
}

+ if (show_flags & TASK_DIAG_SHOW_CRED) {
+ if (i >= n)
+ err = fill_creds(tsk, skb);
+ if (err)
+ goto err;
+ i++;
+ }
+
genlmsg_end(skb, reply);
if (cb)
cb->args[2] = 0;
--
2.1.0

2015-07-06 08:56:01

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 08/24] proc: pick out a function to iterate task children

This function will be used in task_diag.

Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/array.c | 67 ++++++++++++++++++++++++++++++++++-----------------------
1 file changed, 40 insertions(+), 27 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index ce065cf..52c4a7b 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -578,31 +578,26 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
}

#ifdef CONFIG_PROC_CHILDREN
-static struct pid *
-get_children_pid(struct inode *inode, struct pid *pid_prev, loff_t pos)
+static struct task_struct *
+task_next_child(struct task_struct *parent, struct task_struct *prev, unsigned int pos)
{
- struct task_struct *start, *task;
- struct pid *pid = NULL;
+ struct task_struct *task;

read_lock(&tasklist_lock);
-
- start = pid_task(proc_pid(inode), PIDTYPE_PID);
- if (!start)
- goto out;
-
/*
* Lets try to continue searching first, this gives
* us significant speedup on children-rich processes.
*/
- if (pid_prev) {
- task = pid_task(pid_prev, PIDTYPE_PID);
- if (task && task->real_parent == start &&
+ if (prev) {
+ task = prev;
+ if (task && task->real_parent == parent &&
!(list_empty(&task->sibling))) {
- if (list_is_last(&task->sibling, &start->children))
+ if (list_is_last(&task->sibling, &parent->children)) {
+ task = NULL;
goto out;
+ }
task = list_first_entry(&task->sibling,
struct task_struct, sibling);
- pid = get_pid(task_pid(task));
goto out;
}
}
@@ -622,16 +617,34 @@ get_children_pid(struct inode *inode, struct pid *pid_prev, loff_t pos)
* So one need to stop or freeze the leader and all
* its children to get a precise result.
*/
- list_for_each_entry(task, &start->children, sibling) {
- if (pos-- == 0) {
- pid = get_pid(task_pid(task));
- break;
- }
+ list_for_each_entry(task, &parent->children, sibling) {
+ if (pos-- == 0)
+ goto out;
}
-
+ task = NULL;
out:
+ if (prev)
+ put_task_struct(prev);
+ if (task)
+ get_task_struct(task);
read_unlock(&tasklist_lock);
- return pid;
+ return task;
+}
+
+static struct task_struct *
+get_children_pid(struct inode *inode, struct task_struct *prev, loff_t pos)
+{
+ struct task_struct *start, *task = NULL;
+
+ start = get_proc_task(inode);
+ if (!start)
+ goto out;
+
+ task = task_next_child(start, prev, pos);
+
+ put_task_struct(start);
+out:
+ return task;
}

static int children_seq_show(struct seq_file *seq, void *v)
@@ -639,7 +652,7 @@ static int children_seq_show(struct seq_file *seq, void *v)
struct inode *inode = seq->private;
pid_t pid;

- pid = pid_nr_ns(v, inode->i_sb->s_fs_info);
+ pid = task_pid_nr_ns(v, inode->i_sb->s_fs_info);
seq_printf(seq, "%d ", pid);

return 0;
@@ -652,18 +665,18 @@ static void *children_seq_start(struct seq_file *seq, loff_t *pos)

static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
{
- struct pid *pid;
+ struct task_struct *task;

- pid = get_children_pid(seq->private, v, *pos + 1);
- put_pid(v);
+ task = get_children_pid(seq->private, v, *pos + 1);

++*pos;
- return pid;
+ return task;
}

static void children_seq_stop(struct seq_file *seq, void *v)
{
- put_pid(v);
+ if (v)
+ put_task_struct(v);
}

static const struct seq_operations children_seq_ops = {
--
2.1.0

2015-07-06 08:51:02

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 09/24] proc: move task_next_child() from fs/proc

It's going to be used in task_diag.

Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/array.c | 53 -------------------------------------------------
include/linux/proc_fs.h | 3 +++
kernel/pid.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 56 insertions(+), 53 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 52c4a7b..e22266d 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -579,59 +579,6 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,

#ifdef CONFIG_PROC_CHILDREN
static struct task_struct *
-task_next_child(struct task_struct *parent, struct task_struct *prev, unsigned int pos)
-{
- struct task_struct *task;
-
- read_lock(&tasklist_lock);
- /*
- * Lets try to continue searching first, this gives
- * us significant speedup on children-rich processes.
- */
- if (prev) {
- task = prev;
- if (task && task->real_parent == parent &&
- !(list_empty(&task->sibling))) {
- if (list_is_last(&task->sibling, &parent->children)) {
- task = NULL;
- goto out;
- }
- task = list_first_entry(&task->sibling,
- struct task_struct, sibling);
- goto out;
- }
- }
-
- /*
- * Slow search case.
- *
- * We might miss some children here if children
- * are exited while we were not holding the lock,
- * but it was never promised to be accurate that
- * much.
- *
- * "Just suppose that the parent sleeps, but N children
- * exit after we printed their tids. Now the slow paths
- * skips N extra children, we miss N tasks." (c)
- *
- * So one need to stop or freeze the leader and all
- * its children to get a precise result.
- */
- list_for_each_entry(task, &parent->children, sibling) {
- if (pos-- == 0)
- goto out;
- }
- task = NULL;
-out:
- if (prev)
- put_task_struct(prev);
- if (task)
- get_task_struct(task);
- read_unlock(&tasklist_lock);
- return task;
-}
-
-static struct task_struct *
get_children_pid(struct inode *inode, struct task_struct *prev, loff_t pos)
{
struct task_struct *start, *task = NULL;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 136b6ed..497e58c 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -89,4 +89,7 @@ struct tgid_iter {

struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter);

+struct task_struct *
+task_next_child(struct task_struct *parent, struct task_struct *prev, unsigned int pos);
+
#endif /* _LINUX_PROC_FS_H */
diff --git a/kernel/pid.c b/kernel/pid.c
index 2acb0a9..42a6b40 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -607,6 +607,59 @@ retry:
return iter;
}

+struct task_struct *
+task_next_child(struct task_struct *parent, struct task_struct *prev, unsigned int pos)
+{
+ struct task_struct *task;
+
+ read_lock(&tasklist_lock);
+ /*
+ * Lets try to continue searching first, this gives
+ * us significant speedup on children-rich processes.
+ */
+ if (prev) {
+ task = prev;
+ if (task && task->real_parent == parent &&
+ !(list_empty(&task->sibling))) {
+ if (list_is_last(&task->sibling, &parent->children)) {
+ task = NULL;
+ goto out;
+ }
+ task = list_first_entry(&task->sibling,
+ struct task_struct, sibling);
+ goto out;
+ }
+ }
+
+ /*
+ * Slow search case.
+ *
+ * We might miss some children here if children
+ * are exited while we were not holding the lock,
+ * but it was never promised to be accurate that
+ * much.
+ *
+ * "Just suppose that the parent sleeps, but N children
+ * exit after we printed their tids. Now the slow paths
+ * skips N extra children, we miss N tasks." (c)
+ *
+ * So one need to stop or freeze the leader and all
+ * its children to get a precise result.
+ */
+ list_for_each_entry(task, &parent->children, sibling) {
+ if (pos-- == 0)
+ goto out;
+ }
+ task = NULL;
+out:
+ if (prev)
+ put_task_struct(prev);
+ if (task)
+ get_task_struct(task);
+ read_unlock(&tasklist_lock);
+ return task;
+}
+
/*
* The pid hash table is scaled according to the amount of memory in the
* machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or
--
2.1.0

2015-07-06 08:55:31

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 10/24] task_diag: add ability to dump children (v2)

Now we can dump all task or children of a specified task.
It's an example how this interface can be expanded for different
use-cases.

v2: Fixes from David Ahern
Add missing break in iter_stop
Fix 8-byte alignment issues

Cc: David Ahern <[email protected]>
Signed-off-by: Andrey Vagin <[email protected]>
---
include/uapi/linux/task_diag.h | 1 +
kernel/taskdiag.c | 117 +++++++++++++++++++++++++++++++++++------
2 files changed, 102 insertions(+), 16 deletions(-)

diff --git a/include/uapi/linux/task_diag.h b/include/uapi/linux/task_diag.h
index 0e659d6..af192db 100644
--- a/include/uapi/linux/task_diag.h
+++ b/include/uapi/linux/task_diag.h
@@ -63,6 +63,7 @@ struct task_diag_creds {
};

#define TASK_DIAG_DUMP_ALL 0
+#define TASK_DIAG_DUMP_CHILDREN 1

struct task_diag_pid {
__u64 show_flags;
diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index 5123e12..6dd3361 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -184,42 +184,127 @@ err:
return err;
}

+struct task_iter {
+ struct task_diag_pid req;
+ struct pid_namespace *ns;
+ struct netlink_callback *cb;
+ struct task_struct *parent;
+
+ union {
+ struct tgid_iter tgid;
+ struct {
+ unsigned int pos;
+ struct task_struct *task;
+ };
+ };
+};
+
+static void iter_stop(struct task_iter *iter)
+{
+ struct task_struct *task;
+
+ if (iter->parent)
+ put_task_struct(iter->parent);
+
+ switch (iter->req.dump_strategy) {
+ case TASK_DIAG_DUMP_ALL:
+ task = iter->tgid.task;
+ break;
+ default:
+ task = iter->task;
+ }
+ if (task)
+ put_task_struct(task);
+}
+
+static struct task_struct *iter_start(struct task_iter *iter)
+{
+ if (iter->req.pid > 0) {
+ rcu_read_lock();
+ iter->parent = find_task_by_pid_ns(iter->req.pid, iter->ns);
+ if (iter->parent)
+ get_task_struct(iter->parent);
+ rcu_read_unlock();
+ }
+
+ switch (iter->req.dump_strategy) {
+ case TASK_DIAG_DUMP_CHILDREN:
+
+ if (iter->parent == NULL)
+ return ERR_PTR(-ESRCH);
+
+ iter->pos = iter->cb->args[0];
+ iter->task = task_next_child(iter->parent, NULL, iter->pos);
+ return iter->task;
+
+ case TASK_DIAG_DUMP_ALL:
+ iter->tgid.tgid = iter->cb->args[0];
+ iter->tgid.task = NULL;
+ iter->tgid = next_tgid(iter->ns, iter->tgid);
+ return iter->tgid.task;
+ }
+
+ return ERR_PTR(-EINVAL);
+}
+
+static struct task_struct *iter_next(struct task_iter *iter)
+{
+ switch (iter->req.dump_strategy) {
+ case TASK_DIAG_DUMP_CHILDREN:
+ iter->pos++;
+ iter->task = task_next_child(iter->parent, iter->task, iter->pos);
+ iter->cb->args[0] = iter->pos;
+ return iter->task;
+
+ case TASK_DIAG_DUMP_ALL:
+ iter->tgid.tgid += 1;
+ iter->tgid = next_tgid(iter->ns, iter->tgid);
+ iter->cb->args[0] = iter->tgid.tgid;
+ return iter->tgid.task;
+ }
+
+ return NULL;
+}
+
int taskdiag_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
{
struct pid_namespace *ns = task_active_pid_ns(current);
- struct tgid_iter iter;
+ struct task_iter iter;
struct nlattr *na;
- struct task_diag_pid req;
+ struct task_struct *task;
int rc;

- if (nlmsg_len(cb->nlh) < GENL_HDRLEN + sizeof(req))
+ if (nlmsg_len(cb->nlh) < GENL_HDRLEN + sizeof(iter.req))
return -EINVAL;

na = nlmsg_data(cb->nlh) + GENL_HDRLEN;
if (na->nla_type < 0)
return -EINVAL;

- memcpy(&req, nla_data(na), sizeof(req));
+ memcpy(&iter.req, nla_data(na), sizeof(iter.req));

- iter.tgid = cb->args[0];
- iter.task = NULL;
- for (iter = next_tgid(ns, iter);
- iter.task;
- iter.tgid += 1, iter = next_tgid(ns, iter)) {
- if (!ptrace_may_access(iter.task, PTRACE_MODE_READ))
- continue;
+ iter.ns = ns;
+ iter.cb = cb;
+ iter.parent = NULL;

- rc = task_diag_fill(iter.task, skb, req.show_flags,
+ task = iter_start(&iter);
+ if (IS_ERR(task))
+ return PTR_ERR(task);
+
+ for (; task; task = iter_next(&iter)) {
+ if (!ptrace_may_access(task, PTRACE_MODE_READ))
+ continue;
+ rc = task_diag_fill(task, skb, iter.req.show_flags,
NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, cb);
if (rc < 0) {
- put_task_struct(iter.task);
- if (rc != -EMSGSIZE)
+ if (rc != -EMSGSIZE) {
+ iter_stop(&iter);
return rc;
+ }
break;
}
}
-
- cb->args[0] = iter.tgid;
+ iter_stop(&iter);

return skb->len;
}
--
2.1.0

2015-07-06 08:51:07

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 11/24] task_diag: add a new group to get task statistics

Signed-off-by: Andrey Vagin <[email protected]>
---
include/linux/taskstats_kern.h | 2 ++
include/uapi/linux/task_diag.h | 2 ++
kernel/taskdiag.c | 30 ++++++++++++++++++++++++++++++
kernel/taskstats.c | 2 +-
4 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/include/linux/taskstats_kern.h b/include/linux/taskstats_kern.h
index 716835f..8faec32 100644
--- a/include/linux/taskstats_kern.h
+++ b/include/linux/taskstats_kern.h
@@ -32,6 +32,8 @@ int taskdiag_doit(struct sk_buff *skb, struct genl_info *info);
struct netlink_callback;
int taskdiag_dumpit(struct sk_buff *skb, struct netlink_callback *cb);

+extern int fill_stats_for_pid(pid_t pid, struct taskstats *stats);
+
#else
static inline void taskstats_exit(struct task_struct *tsk, int group_dead)
{}
diff --git a/include/uapi/linux/task_diag.h b/include/uapi/linux/task_diag.h
index af192db..c51380a 100644
--- a/include/uapi/linux/task_diag.h
+++ b/include/uapi/linux/task_diag.h
@@ -8,6 +8,7 @@ enum {
/* optional attributes which can be specified in show_flags */
TASK_DIAG_BASE = 0,
TASK_DIAG_CRED,
+ TASK_DIAG_STAT,

/* other attributes */
TASK_DIAG_PID = 64, /* u32 */
@@ -18,6 +19,7 @@ enum {

#define TASK_DIAG_SHOW_BASE (1ULL << TASK_DIAG_BASE)
#define TASK_DIAG_SHOW_CRED (1ULL << TASK_DIAG_CRED)
+#define TASK_DIAG_SHOW_STAT (1ULL << TASK_DIAG_STAT)

enum {
TASK_DIAG_RUNNING,
diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index 6dd3361..a49ccab 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -6,6 +6,7 @@
#include <linux/ptrace.h>
#include <linux/proc_fs.h>
#include <linux/sched.h>
+#include <linux/taskstats.h>

static size_t taskdiag_packet_size(u64 show_flags)
{
@@ -19,6 +20,9 @@ static size_t taskdiag_packet_size(u64 show_flags)
if (show_flags & TASK_DIAG_SHOW_CRED)
size += nla_total_size(sizeof(struct task_diag_creds));

+ if (show_flags & TASK_DIAG_SHOW_STAT)
+ size += nla_total_size(sizeof(struct taskstats));
+
return size;
}

@@ -94,6 +98,24 @@ static inline void caps2diag(struct task_diag_caps *diag, const kernel_cap_t *ca
diag->cap[i] = cap->cap[i];
}

+static int fill_stats(struct task_struct *tsk, struct sk_buff *skb)
+{
+ struct taskstats *diag_stats;
+ struct nlattr *attr;
+ int ret;
+
+ attr = nla_reserve(skb, TASK_DIAG_STAT, sizeof(struct taskstats));
+ if (!attr)
+ return -EMSGSIZE;
+
+ diag_stats = nla_data(attr);
+
+ ret = fill_stats_for_pid(task_pid_vnr(tsk), diag_stats);
+ if (ret)
+ return ret;
+ return 0;
+}
+
static int fill_creds(struct task_struct *p, struct sk_buff *skb)
{
struct user_namespace *user_ns = current_user_ns();
@@ -168,6 +190,14 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
i++;
}

+ if (show_flags & TASK_DIAG_SHOW_STAT) {
+ if (i >= n)
+ err = fill_stats(tsk, skb);
+ if (err)
+ goto err;
+ i++;
+ }
+
genlmsg_end(skb, reply);
if (cb)
cb->args[2] = 0;
diff --git a/kernel/taskstats.c b/kernel/taskstats.c
index 40f2cdf2..3b3d660 100644
--- a/kernel/taskstats.c
+++ b/kernel/taskstats.c
@@ -192,7 +192,7 @@ static void fill_stats(struct user_namespace *user_ns,
xacct_add_tsk(stats, tsk);
}

-static int fill_stats_for_pid(pid_t pid, struct taskstats *stats)
+int fill_stats_for_pid(pid_t pid, struct taskstats *stats)
{
struct task_struct *tsk;

--
2.1.0

2015-07-06 08:53:42

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 12/24] task_diag: add a new group to get tasks memory mappings (v2)

v2: Fixes from David Ahern
* Fix 8-byte alignment
* Change implementation of DIAG_VMA attribute:

This patch puts the filename into the task_diag_vma struct and
converts TASK_DIAG_VMA attribute into a series of task_diag_vma.
Now is there is a single TASK_DIAG_VMA attribute that is parsed
as:

| struct task_diag_vma | filename | ...

Cc: David Ahern <[email protected]>
Signed-off-by: Andrey Vagin <[email protected]>
---
include/uapi/linux/task_diag.h | 54 +++++++++
kernel/taskdiag.c | 255 ++++++++++++++++++++++++++++++++++++++++-
2 files changed, 306 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/task_diag.h b/include/uapi/linux/task_diag.h
index c51380a..943d8d1 100644
--- a/include/uapi/linux/task_diag.h
+++ b/include/uapi/linux/task_diag.h
@@ -2,6 +2,7 @@
#define _LINUX_TASK_DIAG_H

#include <linux/types.h>
+#include <linux/netlink.h>
#include <linux/capability.h>

enum {
@@ -9,6 +10,7 @@ enum {
TASK_DIAG_BASE = 0,
TASK_DIAG_CRED,
TASK_DIAG_STAT,
+ TASK_DIAG_VMA,

/* other attributes */
TASK_DIAG_PID = 64, /* u32 */
@@ -20,6 +22,7 @@ enum {
#define TASK_DIAG_SHOW_BASE (1ULL << TASK_DIAG_BASE)
#define TASK_DIAG_SHOW_CRED (1ULL << TASK_DIAG_CRED)
#define TASK_DIAG_SHOW_STAT (1ULL << TASK_DIAG_STAT)
+#define TASK_DIAG_SHOW_VMA (1ULL << TASK_DIAG_VMA)

enum {
TASK_DIAG_RUNNING,
@@ -64,6 +67,57 @@ struct task_diag_creds {
__u32 fsgid;
};

+#define TASK_DIAG_VMA_F_READ (1ULL << 0)
+#define TASK_DIAG_VMA_F_WRITE (1ULL << 1)
+#define TASK_DIAG_VMA_F_EXEC (1ULL << 2)
+#define TASK_DIAG_VMA_F_SHARED (1ULL << 3)
+#define TASK_DIAG_VMA_F_MAYREAD (1ULL << 4)
+#define TASK_DIAG_VMA_F_MAYWRITE (1ULL << 5)
+#define TASK_DIAG_VMA_F_MAYEXEC (1ULL << 6)
+#define TASK_DIAG_VMA_F_MAYSHARE (1ULL << 7)
+#define TASK_DIAG_VMA_F_GROWSDOWN (1ULL << 8)
+#define TASK_DIAG_VMA_F_PFNMAP (1ULL << 9)
+#define TASK_DIAG_VMA_F_DENYWRITE (1ULL << 10)
+#define TASK_DIAG_VMA_F_MPX (1ULL << 11)
+#define TASK_DIAG_VMA_F_LOCKED (1ULL << 12)
+#define TASK_DIAG_VMA_F_IO (1ULL << 13)
+#define TASK_DIAG_VMA_F_SEQ_READ (1ULL << 14)
+#define TASK_DIAG_VMA_F_RAND_READ (1ULL << 15)
+#define TASK_DIAG_VMA_F_DONTCOPY (1ULL << 16)
+#define TASK_DIAG_VMA_F_DONTEXPAND (1ULL << 17)
+#define TASK_DIAG_VMA_F_ACCOUNT (1ULL << 18)
+#define TASK_DIAG_VMA_F_NORESERVE (1ULL << 19)
+#define TASK_DIAG_VMA_F_HUGETLB (1ULL << 20)
+#define TASK_DIAG_VMA_F_ARCH_1 (1ULL << 21)
+#define TASK_DIAG_VMA_F_DONTDUMP (1ULL << 22)
+#define TASK_DIAG_VMA_F_SOFTDIRTY (1ULL << 23)
+#define TASK_DIAG_VMA_F_MIXEDMAP (1ULL << 24)
+#define TASK_DIAG_VMA_F_HUGEPAGE (1ULL << 25)
+#define TASK_DIAG_VMA_F_NOHUGEPAGE (1ULL << 26)
+#define TASK_DIAG_VMA_F_MERGEABLE (1ULL << 27)
+
+/* task_diag_vma must be NLA_ALIGN'ed */
+struct task_diag_vma {
+ __u64 start, end;
+ __u64 vm_flags;
+ __u64 pgoff;
+ __u32 major;
+ __u32 minor;
+ __u64 inode;
+ __u32 generation;
+ __u16 vma_len;
+ __u16 name_off;
+ __u16 name_len;
+} __attribute__((__aligned__(NLA_ALIGNTO)));
+
+static inline char *task_diag_vma_name(struct task_diag_vma *vma)
+{
+ if (!vma->name_len)
+ return NULL;
+
+ return ((char *)vma) + vma->name_off;
+}
+
#define TASK_DIAG_DUMP_ALL 0
#define TASK_DIAG_DUMP_CHILDREN 1

diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index a49ccab..c488c1b 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -8,7 +8,7 @@
#include <linux/sched.h>
#include <linux/taskstats.h>

-static size_t taskdiag_packet_size(u64 show_flags)
+static size_t taskdiag_packet_size(u64 show_flags, int n_vma)
{
size_t size;

@@ -23,6 +23,14 @@ static size_t taskdiag_packet_size(u64 show_flags)
if (show_flags & TASK_DIAG_SHOW_STAT)
size += nla_total_size(sizeof(struct taskstats));

+ if (show_flags & TASK_DIAG_SHOW_VMA && n_vma > 0) {
+ /*
+ * 128 is a schwag on average path length for maps; used to
+ * ballpark initial memory allocation for genl msg
+ */
+ size += nla_total_size(sizeof(struct task_diag_vma) * n_vma + 128);
+ }
+
return size;
}

@@ -150,12 +158,245 @@ static int fill_creds(struct task_struct *p, struct sk_buff *skb)
return 0;
}

+static u64 get_vma_flags(struct vm_area_struct *vma)
+{
+ u64 flags = 0;
+
+ static const u64 mnemonics[BITS_PER_LONG] = {
+ /*
+ * In case if we meet a flag we don't know about.
+ */
+ [0 ... (BITS_PER_LONG-1)] = 0,
+
+ [ilog2(VM_READ)] = TASK_DIAG_VMA_F_READ,
+ [ilog2(VM_WRITE)] = TASK_DIAG_VMA_F_WRITE,
+ [ilog2(VM_EXEC)] = TASK_DIAG_VMA_F_EXEC,
+ [ilog2(VM_SHARED)] = TASK_DIAG_VMA_F_SHARED,
+ [ilog2(VM_MAYREAD)] = TASK_DIAG_VMA_F_MAYREAD,
+ [ilog2(VM_MAYWRITE)] = TASK_DIAG_VMA_F_MAYWRITE,
+ [ilog2(VM_MAYEXEC)] = TASK_DIAG_VMA_F_MAYEXEC,
+ [ilog2(VM_MAYSHARE)] = TASK_DIAG_VMA_F_MAYSHARE,
+ [ilog2(VM_GROWSDOWN)] = TASK_DIAG_VMA_F_GROWSDOWN,
+ [ilog2(VM_PFNMAP)] = TASK_DIAG_VMA_F_PFNMAP,
+ [ilog2(VM_DENYWRITE)] = TASK_DIAG_VMA_F_DENYWRITE,
+#ifdef CONFIG_X86_INTEL_MPX
+ [ilog2(VM_MPX)] = TASK_DIAG_VMA_F_MPX,
+#endif
+ [ilog2(VM_LOCKED)] = TASK_DIAG_VMA_F_LOCKED,
+ [ilog2(VM_IO)] = TASK_DIAG_VMA_F_IO,
+ [ilog2(VM_SEQ_READ)] = TASK_DIAG_VMA_F_SEQ_READ,
+ [ilog2(VM_RAND_READ)] = TASK_DIAG_VMA_F_RAND_READ,
+ [ilog2(VM_DONTCOPY)] = TASK_DIAG_VMA_F_DONTCOPY,
+ [ilog2(VM_DONTEXPAND)] = TASK_DIAG_VMA_F_DONTEXPAND,
+ [ilog2(VM_ACCOUNT)] = TASK_DIAG_VMA_F_ACCOUNT,
+ [ilog2(VM_NORESERVE)] = TASK_DIAG_VMA_F_NORESERVE,
+ [ilog2(VM_HUGETLB)] = TASK_DIAG_VMA_F_HUGETLB,
+ [ilog2(VM_ARCH_1)] = TASK_DIAG_VMA_F_ARCH_1,
+ [ilog2(VM_DONTDUMP)] = TASK_DIAG_VMA_F_DONTDUMP,
+#ifdef CONFIG_MEM_SOFT_DIRTY
+ [ilog2(VM_SOFTDIRTY)] = TASK_DIAG_VMA_F_SOFTDIRTY,
+#endif
+ [ilog2(VM_MIXEDMAP)] = TASK_DIAG_VMA_F_MIXEDMAP,
+ [ilog2(VM_HUGEPAGE)] = TASK_DIAG_VMA_F_HUGEPAGE,
+ [ilog2(VM_NOHUGEPAGE)] = TASK_DIAG_VMA_F_NOHUGEPAGE,
+ [ilog2(VM_MERGEABLE)] = TASK_DIAG_VMA_F_MERGEABLE,
+ };
+ size_t i;
+
+ for (i = 0; i < BITS_PER_LONG; i++) {
+ if (vma->vm_flags & (1UL << i))
+ flags |= mnemonics[i];
+ }
+
+ return flags;
+}
+
+static int task_vma_num(struct mm_struct *mm)
+{
+ struct vm_area_struct *vma;
+ int n_vma = 0;
+
+ if (!mm || !atomic_inc_not_zero(&mm->mm_users))
+ return 0;
+
+ down_read(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next, n_vma++)
+ ;
+
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+
+ return n_vma;
+}
+
+/*
+ * use a tmp variable and copy to input arg to deal with
+ * alignment issues. diag_vma contains u64 elements which
+ * means extended load operations can be used and those can
+ * require 8-byte alignment (e.g., sparc)
+ */
+static void fill_diag_vma(struct vm_area_struct *vma,
+ struct task_diag_vma *diag_vma)
+{
+ struct task_diag_vma tmp;
+
+ /* We don't show the stack guard page in /proc/maps */
+ tmp.start = vma->vm_start;
+ if (stack_guard_page_start(vma, tmp.start))
+ tmp.start += PAGE_SIZE;
+
+ tmp.end = vma->vm_end;
+ if (stack_guard_page_end(vma, tmp.end))
+ tmp.end -= PAGE_SIZE;
+ tmp.vm_flags = get_vma_flags(vma);
+
+ if (vma->vm_file) {
+ struct inode *inode = file_inode(vma->vm_file);
+ dev_t dev;
+
+ dev = inode->i_sb->s_dev;
+ tmp.major = MAJOR(dev);
+ tmp.minor = MINOR(dev);
+ tmp.inode = inode->i_ino;
+ tmp.generation = inode->i_generation;
+ tmp.pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
+ } else {
+ tmp.major = 0;
+ tmp.minor = 0;
+ tmp.inode = 0;
+ tmp.generation = 0;
+ tmp.pgoff = 0;
+ }
+
+ memcpy(diag_vma, &tmp, sizeof(*diag_vma));
+}
+
+static const char *get_vma_name(struct vm_area_struct *vma, char *page)
+{
+ const char *name = NULL;
+
+ if (vma->vm_file) {
+ name = d_path(&vma->vm_file->f_path, page, PAGE_SIZE);
+ goto out;
+ }
+
+ if (vma->vm_ops && vma->vm_ops->name) {
+ name = vma->vm_ops->name(vma);
+ if (name)
+ goto out;
+ }
+
+ name = arch_vma_name(vma);
+
+out:
+ return name;
+}
+
+static int fill_vma(struct task_struct *p, struct sk_buff *skb,
+ struct netlink_callback *cb, bool *progress)
+{
+ struct vm_area_struct *vma;
+ struct mm_struct *mm;
+ struct nlattr *attr = NULL;
+ struct task_diag_vma *diag_vma;
+ unsigned long mark = 0;
+ char *page;
+ int i, rc = -EMSGSIZE;
+
+ if (cb)
+ mark = cb->args[3];
+
+ mm = p->mm;
+ if (!mm || !atomic_inc_not_zero(&mm->mm_users))
+ return 0;
+
+ page = (char *)__get_free_page(GFP_TEMPORARY);
+ if (!page) {
+ mmput(mm);
+ return -ENOMEM;
+ }
+
+ down_read(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next, i++) {
+ unsigned char *b = skb_tail_pointer(skb);
+ const char *name;
+ void *pfile;
+
+
+ if (mark >= vma->vm_start)
+ continue;
+
+ /* setup pointer for next map */
+ if (attr == NULL) {
+ attr = nla_reserve(skb, TASK_DIAG_VMA, sizeof(*diag_vma));
+ if (!attr)
+ goto err;
+
+ diag_vma = nla_data(attr);
+ } else {
+ diag_vma = nla_reserve_nohdr(skb, sizeof(*diag_vma));
+
+ if (diag_vma == NULL) {
+ nlmsg_trim(skb, b);
+ goto out;
+ }
+ }
+
+ fill_diag_vma(vma, diag_vma);
+
+ name = get_vma_name(vma, page);
+ if (IS_ERR(name)) {
+ nlmsg_trim(skb, b);
+ rc = PTR_ERR(name);
+ goto out;
+ }
+
+ if (name) {
+ diag_vma->name_len = strlen(name) + 1;
+
+ /* reserves NLA_ALIGN(len) */
+ pfile = nla_reserve_nohdr(skb, diag_vma->name_len);
+ if (pfile == NULL) {
+ nlmsg_trim(skb, b);
+ goto out;
+ }
+ diag_vma->name_off = pfile - (void *) diag_vma;
+ memcpy(pfile, name, diag_vma->name_len);
+ } else {
+ diag_vma->name_len = 0;
+ diag_vma->name_off = 0;
+ }
+
+ mark = vma->vm_start;
+
+ diag_vma->vma_len = skb_tail_pointer(skb) - (unsigned char *) diag_vma;
+
+ *progress = true;
+ }
+
+ rc = 0;
+ mark = 0;
+out:
+ if (*progress)
+ attr->nla_len = skb_tail_pointer(skb) - (unsigned char *) attr;
+
+err:
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+ free_page((unsigned long) page);
+ if (cb)
+ cb->args[3] = mark;
+
+ return rc;
+}
+
static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
u64 show_flags, u32 portid, u32 seq,
struct netlink_callback *cb)
{
void *reply;
int err = 0, i = 0, n = 0;
+ bool progress = false;
int flags = 0;
u32 pid;

@@ -198,13 +439,21 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
i++;
}

+ if (show_flags & TASK_DIAG_SHOW_VMA) {
+ if (i >= n)
+ err = fill_vma(tsk, skb, cb, &progress);
+ if (err)
+ goto err;
+ i++;
+ }
+
genlmsg_end(skb, reply);
if (cb)
cb->args[2] = 0;

return 0;
err:
- if (err == -EMSGSIZE && i != 0) {
+ if (err == -EMSGSIZE && (i > n || progress)) {
if (cb)
cb->args[2] = i;
genlmsg_end(skb, reply);
@@ -374,7 +623,7 @@ int taskdiag_doit(struct sk_buff *skb, struct genl_info *info)
return -EPERM;
}

- size = taskdiag_packet_size(req.show_flags);
+ size = taskdiag_packet_size(req.show_flags, task_vma_num(tsk->mm));

while (1) {
msg = genlmsg_new(size, GFP_KERNEL);
--
2.1.0

2015-07-06 08:51:21

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 13/24] task_diag: shows memory consumption for memory mappings (v2)

The same information are shown in /proc/self/smaps.

v2: account the task_diag_vma_stat struct in taskdiag_packet_size()
Fix 8-byte alignment for vma and vma_stats // David Ahern

Cc: David Ahern <[email protected]>
Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/task_mmu.c | 14 +--------
include/linux/proc_fs.h | 17 +++++++++++
include/uapi/linux/task_diag.h | 25 +++++++++++++++++
kernel/taskdiag.c | 64 ++++++++++++++++++++++++++++++++++++++----
4 files changed, 101 insertions(+), 19 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6dee68d..c89f68c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -435,18 +435,6 @@ const struct file_operations proc_tid_maps_operations = {
#define PSS_SHIFT 12

#ifdef CONFIG_PROC_PAGE_MONITOR
-struct mem_size_stats {
- unsigned long resident;
- unsigned long shared_clean;
- unsigned long shared_dirty;
- unsigned long private_clean;
- unsigned long private_dirty;
- unsigned long referenced;
- unsigned long anonymous;
- unsigned long anonymous_thp;
- unsigned long swap;
- u64 pss;
-};

static void smaps_account(struct mem_size_stats *mss, struct page *page,
unsigned long size, bool young, bool dirty)
@@ -526,7 +514,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
}
#endif

-static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
struct vm_area_struct *vma = walk->vma;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 497e58c..62d0079 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -92,4 +92,21 @@ struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter);
struct task_struct *
task_next_child(struct task_struct *parent, struct task_struct *prev, unsigned int pos);

+struct mem_size_stats {
+ unsigned long resident;
+ unsigned long shared_clean;
+ unsigned long shared_dirty;
+ unsigned long private_clean;
+ unsigned long private_dirty;
+ unsigned long referenced;
+ unsigned long anonymous;
+ unsigned long anonymous_thp;
+ unsigned long swap;
+ u64 pss;
+};
+
+struct mm_walk;
+int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+ struct mm_walk *walk);
+
#endif /* _LINUX_PROC_FS_H */
diff --git a/include/uapi/linux/task_diag.h b/include/uapi/linux/task_diag.h
index 943d8d1..73d33c8 100644
--- a/include/uapi/linux/task_diag.h
+++ b/include/uapi/linux/task_diag.h
@@ -11,6 +11,7 @@ enum {
TASK_DIAG_CRED,
TASK_DIAG_STAT,
TASK_DIAG_VMA,
+ TASK_DIAG_VMA_STAT,

/* other attributes */
TASK_DIAG_PID = 64, /* u32 */
@@ -23,6 +24,7 @@ enum {
#define TASK_DIAG_SHOW_CRED (1ULL << TASK_DIAG_CRED)
#define TASK_DIAG_SHOW_STAT (1ULL << TASK_DIAG_STAT)
#define TASK_DIAG_SHOW_VMA (1ULL << TASK_DIAG_VMA)
+#define TASK_DIAG_SHOW_VMA_STAT (1ULL << TASK_DIAG_VMA_STAT)

enum {
TASK_DIAG_RUNNING,
@@ -96,6 +98,19 @@ struct task_diag_creds {
#define TASK_DIAG_VMA_F_NOHUGEPAGE (1ULL << 26)
#define TASK_DIAG_VMA_F_MERGEABLE (1ULL << 27)

+struct task_diag_vma_stat {
+ __u64 resident;
+ __u64 shared_clean;
+ __u64 shared_dirty;
+ __u64 private_clean;
+ __u64 private_dirty;
+ __u64 referenced;
+ __u64 anonymous;
+ __u64 anonymous_thp;
+ __u64 swap;
+ __u64 pss;
+} __attribute__((__aligned__(NLA_ALIGNTO)));
+
/* task_diag_vma must be NLA_ALIGN'ed */
struct task_diag_vma {
__u64 start, end;
@@ -108,6 +123,8 @@ struct task_diag_vma {
__u16 vma_len;
__u16 name_off;
__u16 name_len;
+ __u16 stat_off;
+ __u16 stat_len;
} __attribute__((__aligned__(NLA_ALIGNTO)));

static inline char *task_diag_vma_name(struct task_diag_vma *vma)
@@ -118,6 +135,14 @@ static inline char *task_diag_vma_name(struct task_diag_vma *vma)
return ((char *)vma) + vma->name_off;
}

+static inline struct task_diag_vma_stat *task_diag_vma_stat(struct task_diag_vma *vma)
+{
+ if (!vma->stat_len)
+ return NULL;
+
+ return ((void *)vma) + vma->stat_off;
+}
+
#define TASK_DIAG_DUMP_ALL 0
#define TASK_DIAG_DUMP_CHILDREN 1

diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index c488c1b..8e00c3e 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -24,11 +24,17 @@ static size_t taskdiag_packet_size(u64 show_flags, int n_vma)
size += nla_total_size(sizeof(struct taskstats));

if (show_flags & TASK_DIAG_SHOW_VMA && n_vma > 0) {
+ size_t entry_size;
+
/*
* 128 is a schwag on average path length for maps; used to
* ballpark initial memory allocation for genl msg
*/
- size += nla_total_size(sizeof(struct task_diag_vma) * n_vma + 128);
+ entry_size = sizeof(struct task_diag_vma) + 128;
+
+ if (show_flags & TASK_DIAG_SHOW_VMA_STAT)
+ entry_size += sizeof(struct task_diag_vma_stat);
+ size += nla_total_size(entry_size * n_vma);
}

return size;
@@ -292,8 +298,37 @@ out:
return name;
}

+static void fill_diag_vma_stat(struct vm_area_struct *vma, struct task_diag_vma_stat *stat)
+{
+ struct task_diag_vma_stat tmp;
+ struct mem_size_stats mss;
+ struct mm_walk smaps_walk = {
+ .pmd_entry = smaps_pte_range,
+ .mm = vma->vm_mm,
+ .private = &mss,
+ };
+
+ memset(&mss, 0, sizeof mss);
+ memset(&tmp, 0, sizeof(tmp));
+
+ /* mmap_sem is held in m_start */
+ walk_page_vma(vma, &smaps_walk);
+
+ tmp.resident = mss.resident;
+ tmp.pss = mss.pss;
+ tmp.shared_clean = mss.shared_clean;
+ tmp.private_clean = mss.private_clean;
+ tmp.private_dirty = mss.private_dirty;
+ tmp.referenced = mss.referenced;
+ tmp.anonymous = mss.anonymous;
+ tmp.anonymous_thp = mss.anonymous_thp;
+ tmp.swap = mss.swap;
+
+ memcpy(stat, &tmp, sizeof(*stat));
+}
+
static int fill_vma(struct task_struct *p, struct sk_buff *skb,
- struct netlink_callback *cb, bool *progress)
+ struct netlink_callback *cb, bool *progress, u64 show_flags)
{
struct vm_area_struct *vma;
struct mm_struct *mm;
@@ -301,7 +336,7 @@ static int fill_vma(struct task_struct *p, struct sk_buff *skb,
struct task_diag_vma *diag_vma;
unsigned long mark = 0;
char *page;
- int i, rc = -EMSGSIZE;
+ int i, rc = -EMSGSIZE, size;

if (cb)
mark = cb->args[3];
@@ -316,6 +351,10 @@ static int fill_vma(struct task_struct *p, struct sk_buff *skb,
return -ENOMEM;
}

+ size = NLA_ALIGN(sizeof(struct task_diag_vma));
+ if (show_flags & TASK_DIAG_SHOW_VMA_STAT)
+ size += NLA_ALIGN(sizeof(struct task_diag_vma_stat));
+
down_read(&mm->mmap_sem);
for (vma = mm->mmap; vma; vma = vma->vm_next, i++) {
unsigned char *b = skb_tail_pointer(skb);
@@ -328,13 +367,13 @@ static int fill_vma(struct task_struct *p, struct sk_buff *skb,

/* setup pointer for next map */
if (attr == NULL) {
- attr = nla_reserve(skb, TASK_DIAG_VMA, sizeof(*diag_vma));
+ attr = nla_reserve(skb, TASK_DIAG_VMA, size);
if (!attr)
goto err;

diag_vma = nla_data(attr);
} else {
- diag_vma = nla_reserve_nohdr(skb, sizeof(*diag_vma));
+ diag_vma = nla_reserve_nohdr(skb, size);

if (diag_vma == NULL) {
nlmsg_trim(skb, b);
@@ -344,6 +383,19 @@ static int fill_vma(struct task_struct *p, struct sk_buff *skb,

fill_diag_vma(vma, diag_vma);

+ if (show_flags & TASK_DIAG_SHOW_VMA_STAT) {
+ struct task_diag_vma_stat *stat;
+
+ stat = (void *) diag_vma + NLA_ALIGN(sizeof(struct task_diag_vma));
+
+ fill_diag_vma_stat(vma, stat);
+ diag_vma->stat_len = sizeof(struct task_diag_vma_stat);
+ diag_vma->stat_off = (void *) stat - (void *)diag_vma;
+ } else {
+ diag_vma->stat_len = 0;
+ diag_vma->stat_off = 0;
+ }
+
name = get_vma_name(vma, page);
if (IS_ERR(name)) {
nlmsg_trim(skb, b);
@@ -441,7 +493,7 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,

if (show_flags & TASK_DIAG_SHOW_VMA) {
if (i >= n)
- err = fill_vma(tsk, skb, cb, &progress);
+ err = fill_vma(tsk, skb, cb, &progress, show_flags);
if (err)
goto err;
i++;
--
2.1.0

2015-07-06 08:51:13

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 14/24] task_diag: add a marcos to enumirate memory mappings

This macros will help users to use this interface.

v2: Fix task_diag_for_each_vma to work with libnl // David Ahern

Signed-off-by: Andrey Vagin <[email protected]>
---
include/uapi/linux/task_diag.h | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/include/uapi/linux/task_diag.h b/include/uapi/linux/task_diag.h
index 73d33c8..9e4c3c4 100644
--- a/include/uapi/linux/task_diag.h
+++ b/include/uapi/linux/task_diag.h
@@ -143,6 +143,11 @@ static inline struct task_diag_vma_stat *task_diag_vma_stat(struct task_diag_vma
return ((void *)vma) + vma->stat_off;
}

+#define task_diag_for_each_vma(vma, attr) \
+ for (vma = nla_data(attr); \
+ (void *) vma < nla_data(attr) + nla_len(attr); \
+ vma = (void *) vma + vma->vma_len)
+
#define TASK_DIAG_DUMP_ALL 0
#define TASK_DIAG_DUMP_CHILDREN 1

--
2.1.0

2015-07-06 08:54:09

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 15/24] proc: give task_struct instead of pid into first_tid

It will be more convenient when this function will be used in
task_diag.

Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/base.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 88cf6ae..af797d9 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3225,19 +3225,17 @@ out_no_task:
* In the case of a seek we start with the leader and walk nr
* threads past it.
*/
-static struct task_struct *first_tid(struct pid *pid, int tid, loff_t f_pos,
+static struct task_struct *
+first_tid(struct task_struct *task, int tid, loff_t f_pos,
struct pid_namespace *ns)
{
- struct task_struct *pos, *task;
+ struct task_struct *pos;
unsigned long nr = f_pos;

if (nr != f_pos) /* 32bit overflow? */
return NULL;

rcu_read_lock();
- task = pid_task(pid, PIDTYPE_PID);
- if (!task)
- goto fail;

/* Attempt to start with the tid of a thread */
if (tid && nr) {
@@ -3294,7 +3292,7 @@ static struct task_struct *next_tid(struct task_struct *start)
static int proc_task_readdir(struct file *file, struct dir_context *ctx)
{
struct inode *inode = file_inode(file);
- struct task_struct *task;
+ struct task_struct *start, *task;
struct pid_namespace *ns;
int tid;

@@ -3304,13 +3302,17 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx)
if (!dir_emit_dots(file, ctx))
return 0;

+ start = get_proc_task(inode);
+ if (!start)
+ return 0;
+
/* f_version caches the tgid value that the last readdir call couldn't
* return. lseek aka telldir automagically resets f_version to 0.
*/
ns = inode->i_sb->s_fs_info;
tid = (int)file->f_version;
file->f_version = 0;
- for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns);
+ for (task = first_tid(start, tid, ctx->pos - 2, ns);
task;
task = next_tid(task), ctx->pos++) {
char name[PROC_NUMBUF];
@@ -3327,6 +3329,8 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx)
}
}

+ put_task_struct(start);
+
return 0;
}

--
2.1.0

2015-07-06 08:54:07

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 16/24] proc: move first_tid and next_tid out of proc

They will be used in task_diag

Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/base.c | 79 ++-----------------------------------------------
include/linux/proc_fs.h | 4 +++
kernel/pid.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 82 insertions(+), 77 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index af797d9..4b72053 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3213,81 +3213,6 @@ out_no_task:
return ERR_PTR(result);
}

-/*
- * Find the first tid of a thread group to return to user space.
- *
- * Usually this is just the thread group leader, but if the users
- * buffer was too small or there was a seek into the middle of the
- * directory we have more work todo.
- *
- * In the case of a short read we start with find_task_by_pid.
- *
- * In the case of a seek we start with the leader and walk nr
- * threads past it.
- */
-static struct task_struct *
-first_tid(struct task_struct *task, int tid, loff_t f_pos,
- struct pid_namespace *ns)
-{
- struct task_struct *pos;
- unsigned long nr = f_pos;
-
- if (nr != f_pos) /* 32bit overflow? */
- return NULL;
-
- rcu_read_lock();
-
- /* Attempt to start with the tid of a thread */
- if (tid && nr) {
- pos = find_task_by_pid_ns(tid, ns);
- if (pos && same_thread_group(pos, task))
- goto found;
- }
-
- /* If nr exceeds the number of threads there is nothing todo */
- if (nr >= get_nr_threads(task))
- goto fail;
-
- /* If we haven't found our starting place yet start
- * with the leader and walk nr threads forward.
- */
- pos = task = task->group_leader;
- do {
- if (!nr--)
- goto found;
- } while_each_thread(task, pos);
-fail:
- pos = NULL;
- goto out;
-found:
- get_task_struct(pos);
-out:
- rcu_read_unlock();
- return pos;
-}
-
-/*
- * Find the next thread in the thread list.
- * Return NULL if there is an error or no next thread.
- *
- * The reference to the input task_struct is released.
- */
-static struct task_struct *next_tid(struct task_struct *start)
-{
- struct task_struct *pos = NULL;
- rcu_read_lock();
- if (pid_alive(start)) {
- pos = next_thread(start);
- if (thread_group_leader(pos))
- pos = NULL;
- else
- get_task_struct(pos);
- }
- rcu_read_unlock();
- put_task_struct(start);
- return pos;
-}
-
/* for the /proc/TGID/task/ directories */
static int proc_task_readdir(struct file *file, struct dir_context *ctx)
{
@@ -3312,9 +3237,9 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx)
ns = inode->i_sb->s_fs_info;
tid = (int)file->f_version;
file->f_version = 0;
- for (task = first_tid(start, tid, ctx->pos - 2, ns);
+ for (task = task_first_tid(start, tid, ctx->pos - 2, ns);
task;
- task = next_tid(task), ctx->pos++) {
+ task = task_next_tid(task), ctx->pos++) {
char name[PROC_NUMBUF];
int len;
tid = task_pid_nr_ns(task, ns);
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 62d0079..178f526 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -92,6 +92,10 @@ struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter);
struct task_struct *
task_next_child(struct task_struct *parent, struct task_struct *prev, unsigned int pos);

+struct task_struct *task_first_tid(struct task_struct *task, int tid, loff_t f_pos,
+ struct pid_namespace *ns);
+struct task_struct *task_next_tid(struct task_struct *start);
+
struct mem_size_stats {
unsigned long resident;
unsigned long shared_clean;
diff --git a/kernel/pid.c b/kernel/pid.c
index 42a6b40..b66f370 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -661,6 +661,82 @@ out:
}

/*
+ * Find the first tid of a thread group to return to user space.
+ *
+ * Usually this is just the thread group leader, but if the users
+ * buffer was too small or there was a seek into the middle of the
+ * directory we have more work todo.
+ *
+ * In the case of a short read we start with find_task_by_pid.
+ *
+ * In the case of a seek we start with the leader and walk nr
+ * threads past it.
+ */
+struct task_struct *
+task_first_tid(struct task_struct *task, int tid, loff_t f_pos,
+ struct pid_namespace *ns)
+{
+ struct task_struct *pos;
+ unsigned long nr = f_pos;
+
+ if (nr != f_pos) /* 32bit overflow? */
+ return NULL;
+
+ rcu_read_lock();
+
+ /* Attempt to start with the tid of a thread */
+ if (tid && nr) {
+ pos = find_task_by_pid_ns(tid, ns);
+ if (pos && same_thread_group(pos, task))
+ goto found;
+ }
+
+ /* If nr exceeds the number of threads there is nothing todo */
+ if (nr >= get_nr_threads(task))
+ goto fail;
+
+ /* If we haven't found our starting place yet start
+ * with the leader and walk nr threads forward.
+ */
+ pos = task = task->group_leader;
+ do {
+ if (!nr--)
+ goto found;
+ } while_each_thread(task, pos);
+fail:
+ pos = NULL;
+ goto out;
+found:
+ get_task_struct(pos);
+out:
+ rcu_read_unlock();
+ return pos;
+}
+
+/*
+ * Find the next thread in the thread list.
+ * Return NULL if there is an error or no next thread.
+ *
+ * The reference to the input task_struct is released.
+ */
+struct task_struct *task_next_tid(struct task_struct *start)
+{
+ struct task_struct *pos = NULL;
+ rcu_read_lock();
+ if (pid_alive(start)) {
+ pos = next_thread(start);
+ if (thread_group_leader(pos))
+ pos = NULL;
+ else
+ get_task_struct(pos);
+ }
+ rcu_read_unlock();
+ put_task_struct(start);
+ return pos;
+}
+
+
+/*
* The pid hash table is scaled according to the amount of memory in the
* machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or
* more.
--
2.1.0

2015-07-06 08:51:23

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 17/24] task_diag: add ability to dump theads

Signed-off-by: Andrey Vagin <[email protected]>
---
include/uapi/linux/task_diag.h | 1 +
kernel/taskdiag.c | 15 ++++++++++++++-
2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/task_diag.h b/include/uapi/linux/task_diag.h
index 9e4c3c4..e1d09e6 100644
--- a/include/uapi/linux/task_diag.h
+++ b/include/uapi/linux/task_diag.h
@@ -150,6 +150,7 @@ static inline struct task_diag_vma_stat *task_diag_vma_stat(struct task_diag_vma

#define TASK_DIAG_DUMP_ALL 0
#define TASK_DIAG_DUMP_CHILDREN 1
+#define TASK_DIAG_DUMP_THREAD 2

struct task_diag_pid {
__u64 show_flags;
diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index 8e00c3e..dc319bf 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -559,8 +559,15 @@ static struct task_struct *iter_start(struct task_iter *iter)
}

switch (iter->req.dump_strategy) {
- case TASK_DIAG_DUMP_CHILDREN:
+ case TASK_DIAG_DUMP_THREAD:
+ if (iter->parent == NULL)
+ return ERR_PTR(-ESRCH);
+
+ iter->pos = iter->cb->args[0];
+ iter->task = task_first_tid(iter->parent, 0, iter->pos, iter->ns);
+ return iter->task;

+ case TASK_DIAG_DUMP_CHILDREN:
if (iter->parent == NULL)
return ERR_PTR(-ESRCH);

@@ -581,6 +588,12 @@ static struct task_struct *iter_start(struct task_iter *iter)
static struct task_struct *iter_next(struct task_iter *iter)
{
switch (iter->req.dump_strategy) {
+ case TASK_DIAG_DUMP_THREAD:
+ iter->pos++;
+ iter->task = task_next_tid(iter->task);
+ iter->cb->args[0] = iter->pos;
+ iter->cb->args[1] = task_pid_nr_ns(iter->task, iter->ns);
+ return iter->task;
case TASK_DIAG_DUMP_CHILDREN:
iter->pos++;
iter->task = task_next_child(iter->parent, iter->task, iter->pos);
--
2.1.0

2015-07-06 08:53:22

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 18/24] task_diag: add ability to handle one task in a continious mode

If a task has many vma-s, the size of information about them can be too
big to be placed into one skb.

Signed-off-by: Andrey Vagin <[email protected]>
---
include/uapi/linux/task_diag.h | 1 +
kernel/taskdiag.c | 17 +++++++++++++++++
2 files changed, 18 insertions(+)

diff --git a/include/uapi/linux/task_diag.h b/include/uapi/linux/task_diag.h
index e1d09e6..4a1cecb 100644
--- a/include/uapi/linux/task_diag.h
+++ b/include/uapi/linux/task_diag.h
@@ -151,6 +151,7 @@ static inline struct task_diag_vma_stat *task_diag_vma_stat(struct task_diag_vma
#define TASK_DIAG_DUMP_ALL 0
#define TASK_DIAG_DUMP_CHILDREN 1
#define TASK_DIAG_DUMP_THREAD 2
+#define TASK_DIAG_DUMP_ONE 3

struct task_diag_pid {
__u64 show_flags;
diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index dc319bf..20494ee 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -559,6 +559,16 @@ static struct task_struct *iter_start(struct task_iter *iter)
}

switch (iter->req.dump_strategy) {
+ case TASK_DIAG_DUMP_ONE:
+ if (iter->parent == NULL)
+ return ERR_PTR(-ESRCH);
+ iter->pos = iter->cb->args[0];
+ if (iter->pos == 0) {
+ iter->task = iter->parent;
+ get_task_struct(iter->task);
+ } else
+ iter->task = NULL;
+ return iter->task;
case TASK_DIAG_DUMP_THREAD:
if (iter->parent == NULL)
return ERR_PTR(-ESRCH);
@@ -588,6 +598,13 @@ static struct task_struct *iter_start(struct task_iter *iter)
static struct task_struct *iter_next(struct task_iter *iter)
{
switch (iter->req.dump_strategy) {
+ case TASK_DIAG_DUMP_ONE:
+ iter->pos++;
+ iter->cb->args[0] = iter->pos;
+ if (iter->task)
+ put_task_struct(iter->task);
+ iter->task = NULL;
+ return NULL;
case TASK_DIAG_DUMP_THREAD:
iter->pos++;
iter->task = task_next_tid(iter->task);
--
2.1.0

2015-07-06 08:51:30

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 19/24] task_diag: Add option to dump all threads for all tasks

From: David Ahern <[email protected]>

New dump strategy to walk all threads of all processes.

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: Andrey Vagin <[email protected]>
---
include/uapi/linux/task_diag.h | 7 +++---
kernel/taskdiag.c | 55 ++++++++++++++++++++++++++++++++++++------
2 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/include/uapi/linux/task_diag.h b/include/uapi/linux/task_diag.h
index 4a1cecb..19c9579 100644
--- a/include/uapi/linux/task_diag.h
+++ b/include/uapi/linux/task_diag.h
@@ -149,9 +149,10 @@ static inline struct task_diag_vma_stat *task_diag_vma_stat(struct task_diag_vma
vma = (void *) vma + vma->vma_len)

#define TASK_DIAG_DUMP_ALL 0
-#define TASK_DIAG_DUMP_CHILDREN 1
-#define TASK_DIAG_DUMP_THREAD 2
-#define TASK_DIAG_DUMP_ONE 3
+#define TASK_DIAG_DUMP_ALL_THREAD 1
+#define TASK_DIAG_DUMP_CHILDREN 2
+#define TASK_DIAG_DUMP_THREAD 3
+#define TASK_DIAG_DUMP_ONE 4

struct task_diag_pid {
__u64 show_flags;
diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index 20494ee..e5a5eed 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -521,13 +521,9 @@ struct task_iter {
struct netlink_callback *cb;
struct task_struct *parent;

- union {
- struct tgid_iter tgid;
- struct {
- unsigned int pos;
- struct task_struct *task;
- };
- };
+ struct tgid_iter tgid;
+ unsigned int pos;
+ struct task_struct *task;
};

static void iter_stop(struct task_iter *iter)
@@ -541,6 +537,12 @@ static void iter_stop(struct task_iter *iter)
case TASK_DIAG_DUMP_ALL:
task = iter->tgid.task;
break;
+ case TASK_DIAG_DUMP_ALL_THREAD:
+ /* release both tgid task and thread task */
+ if (iter->task)
+ put_task_struct(iter->task);
+ task = iter->tgid.task;
+ break;
default:
task = iter->task;
}
@@ -590,6 +592,25 @@ static struct task_struct *iter_start(struct task_iter *iter)
iter->tgid.task = NULL;
iter->tgid = next_tgid(iter->ns, iter->tgid);
return iter->tgid.task;
+
+ case TASK_DIAG_DUMP_ALL_THREAD:
+ iter->pos = iter->cb->args[1];
+ iter->tgid.tgid = iter->cb->args[0];
+ iter->tgid.task = NULL;
+ iter->tgid = next_tgid(iter->ns, iter->tgid);
+ if (!iter->tgid.task)
+ return NULL;
+
+ iter->task = task_first_tid(iter->tgid.task, 0, iter->pos, iter->ns);
+ if (!iter->task) {
+ iter->pos = 0;
+ iter->tgid.tgid += 1;
+ iter->tgid = next_tgid(iter->ns, iter->tgid);
+ iter->task = iter->tgid.task;
+ if (iter->task)
+ get_task_struct(iter->task);
+ }
+ return iter->task;
}

return ERR_PTR(-EINVAL);
@@ -622,6 +643,24 @@ static struct task_struct *iter_next(struct task_iter *iter)
iter->tgid = next_tgid(iter->ns, iter->tgid);
iter->cb->args[0] = iter->tgid.tgid;
return iter->tgid.task;
+
+ case TASK_DIAG_DUMP_ALL_THREAD:
+ iter->pos++;
+ iter->task = task_next_tid(iter->task);
+ if (!iter->task) {
+ iter->pos = 0;
+ iter->tgid.tgid += 1;
+ iter->tgid = next_tgid(iter->ns, iter->tgid);
+ iter->task = iter->tgid.task;
+ if (iter->task)
+ get_task_struct(iter->task);
+ }
+
+ /* save current position */
+ iter->cb->args[0] = iter->tgid.tgid;
+ iter->cb->args[1] = iter->pos;
+
+ return iter->task;
}

return NULL;
@@ -647,6 +686,8 @@ int taskdiag_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
iter.ns = ns;
iter.cb = cb;
iter.parent = NULL;
+ iter.pos = 0;
+ iter.task = NULL;

task = iter_start(&iter);
if (IS_ERR(task))
--
2.1.0

2015-07-06 08:52:38

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 20/24] task_diag: Only add VMAs for thread_group leader

From: David Ahern <[email protected]>

threads of a process share the same VMAs, so when dumping all threads
for all processes only push vma data for group leader.

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: Andrey Vagin <[email protected]>
---
kernel/taskdiag.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index e5a5eed..6549df3 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -443,9 +443,10 @@ err:
}

static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
- u64 show_flags, u32 portid, u32 seq,
- struct netlink_callback *cb)
+ struct task_diag_pid *req, u32 portid, u32 seq,
+ struct netlink_callback *cb)
{
+ u64 show_flags = req->show_flags;
void *reply;
int err = 0, i = 0, n = 0;
bool progress = false;
@@ -492,6 +493,13 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
}

if (show_flags & TASK_DIAG_SHOW_VMA) {
+ /* if the request is to dump all threads of all processes
+ * only show VMAs for group leader.
+ */
+ if (req->dump_strategy == TASK_DIAG_DUMP_ALL_THREAD &&
+ !thread_group_leader(tsk))
+ goto done;
+
if (i >= n)
err = fill_vma(tsk, skb, cb, &progress, show_flags);
if (err)
@@ -499,6 +507,7 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
i++;
}

+done:
genlmsg_end(skb, reply);
if (cb)
cb->args[2] = 0;
@@ -696,7 +705,7 @@ int taskdiag_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
for (; task; task = iter_next(&iter)) {
if (!ptrace_may_access(task, PTRACE_MODE_READ))
continue;
- rc = task_diag_fill(task, skb, iter.req.show_flags,
+ rc = task_diag_fill(task, skb, &iter.req,
NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, cb);
if (rc < 0) {
if (rc != -EMSGSIZE) {
@@ -755,7 +764,7 @@ int taskdiag_doit(struct sk_buff *skb, struct genl_info *info)
return -EMSGSIZE;
}

- rc = task_diag_fill(tsk, msg, req.show_flags,
+ rc = task_diag_fill(tsk, msg, &req,
info->snd_portid, info->snd_seq, NULL);
if (rc != -EMSGSIZE)
break;
--
2.1.0

2015-07-06 08:51:35

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 21/24] task diag: Add support for TGID attribute

From: David Ahern <[email protected]>

Add TGID attribute and put in every message. This is need for perf for
example when processing messages with maps.

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: Andrey Vagin <[email protected]>
---
include/uapi/linux/task_diag.h | 1 +
kernel/taskdiag.c | 8 +++++++-
2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/task_diag.h b/include/uapi/linux/task_diag.h
index 19c9579..935502b 100644
--- a/include/uapi/linux/task_diag.h
+++ b/include/uapi/linux/task_diag.h
@@ -15,6 +15,7 @@ enum {

/* other attributes */
TASK_DIAG_PID = 64, /* u32 */
+ TASK_DIAG_TGID, /* u32 */

__TASK_DIAG_ATTR_MAX
#define TASK_DIAG_ATTR_MAX (__TASK_DIAG_ATTR_MAX - 1)
diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index 6549df3..ccba538 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -13,6 +13,7 @@ static size_t taskdiag_packet_size(u64 show_flags, int n_vma)
size_t size;

size = nla_total_size(sizeof(u32)); /* PID */
+ + nla_total_size(sizeof(u32)); /* TGID */

if (show_flags & TASK_DIAG_SHOW_BASE)
size += nla_total_size(sizeof(struct task_diag_base));
@@ -451,7 +452,7 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
int err = 0, i = 0, n = 0;
bool progress = false;
int flags = 0;
- u32 pid;
+ u32 pid, tgid;

if (cb) {
n = cb->args[2];
@@ -468,6 +469,11 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
if (err)
goto err;

+ tgid = task_tgid_vnr(tsk);
+ err = nla_put_u32(skb, TASK_DIAG_TGID, tgid);
+ if (err)
+ goto err;
+
if (show_flags & TASK_DIAG_SHOW_BASE) {
if (i >= n)
err = fill_task_base(tsk, skb);
--
2.1.0

2015-07-06 08:52:42

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 22/24] Documentation: add documentation for task_diag

Signed-off-by: Andrey Vagin <[email protected]>
---
Documentation/accounting/task_diag.txt | 53 ++++++++++++++++++++++++++++++++++
1 file changed, 53 insertions(+)
create mode 100644 Documentation/accounting/task_diag.txt

diff --git a/Documentation/accounting/task_diag.txt b/Documentation/accounting/task_diag.txt
new file mode 100644
index 0000000..635b322
--- /dev/null
+++ b/Documentation/accounting/task_diag.txt
@@ -0,0 +1,53 @@
+Taskdiag is a netlink interface for getting task properties. Compared to procfs,
+it is faster, more flexible and provides data in a binary format. Taskdiag was
+created using the basic idea of socket_diag.
+
+Interface
+---------
+
+The user-kernel interface is encapsulated in include/uapi/linux/task_diag.h
+
+Request
+-------
+
+A request is described by the task_diag_pid structure.
+
+struct task_diag_pid {
+ __u64 show_flags; /* TASK_DIAG_SHOW_* */
+ __u64 dump_stratagy; /* TASK_DIAG_DUMP_* */
+
+ __u32 pid;
+};
+
+dump_stratagy specifies a group of processes:
+/* per-process strategies */
+TASK_DIAG_DUMP_CHILDREN - all children
+TASK_DIAG_DUMP_THREAD - all threads
+TASK_DIAG_DUMP_ONE - one process
+/* system wide strategies (the pid fiel is ignored) */
+TASK_DIAG_DUMP_ALL - all processes
+TASK_DIAG_DUMP_ALL_THREAD - all threads
+
+show_flags specifies which information are required. If we set the
+TASK_DIAG_SHOW_BASE flag, the response message will contain the TASK_DIAG_BASE
+attribute which is described by the task_diag_base structure.
+
+In future, it can be extended by optional attributes. The request describes
+which task properties are required and for which processes they are required
+for.
+
+Response
+--------
+
+A response can be divided into a few netlink packets if the NETLINK_DUMP has
+been set in a request. Each task is described by a message. Each message
+contains the TASK_DIAG_PID attribute and optional attributes which have been
+requested. A message can be divided into a few parts if it doesn’t fit into a
+current netlink packet. In this case, the first message in the next packet
+contains the same PID and attributes which doesn’t  fit into the previous
+message.
+
+Examples
+--------
+
+A few examples can be found in tools/testing/selftests/task_diag/
--
2.1.0

2015-07-06 08:51:50

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 23/24] selftest: check the task_diag functinonality

Here are two test (example) programs.

task_diag - request information for two processes.
test_diag_all - request information about all processes

v2: Fixes from David Ahern:
* task_diag: Fix 8-byte alignment for vma and vma_stats

Signed-off-by: Andrey Vagin <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/task_diag/Makefile | 18 +++
tools/testing/selftests/task_diag/fork.c | 30 ++++
tools/testing/selftests/task_diag/run.sh | 6 +
tools/testing/selftests/task_diag/task_diag.c | 115 +++++++++++++++
tools/testing/selftests/task_diag/task_diag.h | 1 +
tools/testing/selftests/task_diag/task_diag_all.c | 145 +++++++++++++++++++
tools/testing/selftests/task_diag/task_diag_comm.c | 157 +++++++++++++++++++++
tools/testing/selftests/task_diag/task_diag_comm.h | 33 +++++
tools/testing/selftests/task_diag/task_proc_all.c | 35 +++++
tools/testing/selftests/task_diag/taskstats.h | 1 +
11 files changed, 542 insertions(+)
create mode 100644 tools/testing/selftests/task_diag/Makefile
create mode 100644 tools/testing/selftests/task_diag/fork.c
create mode 100755 tools/testing/selftests/task_diag/run.sh
create mode 100644 tools/testing/selftests/task_diag/task_diag.c
create mode 120000 tools/testing/selftests/task_diag/task_diag.h
create mode 100644 tools/testing/selftests/task_diag/task_diag_all.c
create mode 100644 tools/testing/selftests/task_diag/task_diag_comm.c
create mode 100644 tools/testing/selftests/task_diag/task_diag_comm.h
create mode 100644 tools/testing/selftests/task_diag/task_proc_all.c
create mode 120000 tools/testing/selftests/task_diag/taskstats.h

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 95abddc..25a42de 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -18,6 +18,7 @@ TARGETS += timers
TARGETS += user
TARGETS += vm
TARGETS += x86
+TARGETS += task_diag
#Please keep the TARGETS list alphabetically sorted

TARGETS_HOTPLUG = cpu-hotplug
diff --git a/tools/testing/selftests/task_diag/Makefile b/tools/testing/selftests/task_diag/Makefile
new file mode 100644
index 0000000..7104573
--- /dev/null
+++ b/tools/testing/selftests/task_diag/Makefile
@@ -0,0 +1,18 @@
+all: task_diag task_diag_all fork task_proc_all fork
+
+CFLAGS += -Wall -O2 -I/usr/include/libnl3
+LDFLAGS += -lnl-3 -lnl-genl-3
+TEST_PROGS := run.sh
+include ../lib.mk
+
+task_diag.o: task_diag.c task_diag_comm.h
+task_diag_all.o: task_diag_all.c task_diag_comm.h
+task_diag_comm.o: task_diag_comm.c task_diag_comm.h
+
+task_diag_all: task_diag_all.o task_diag_comm.o
+task_diag: task_diag.o task_diag_comm.o
+fork: fork.c
+task_proc_all: task_proc_all.c
+
+clean:
+ rm -rf task_diag task_diag_all task_diag_comm.o task_diag_all.o task_diag.o fork task_proc_all
diff --git a/tools/testing/selftests/task_diag/fork.c b/tools/testing/selftests/task_diag/fork.c
new file mode 100644
index 0000000..c6e17d1
--- /dev/null
+++ b/tools/testing/selftests/task_diag/fork.c
@@ -0,0 +1,30 @@
+#include <unistd.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+int main(int argc, char **argv)
+{
+ int i, n;
+
+ if (argc < 2)
+ return 1;
+
+ n = atoi(argv[1]);
+ for (i = 0; i < n; i++) {
+ pid_t pid;
+
+ pid = fork();
+ if (pid < 0) {
+ printf("Unable to fork: %m\n");
+ return 1;
+ }
+ if (pid == 0) {
+ while (1)
+ sleep(1000);
+ return 0;
+ }
+ }
+
+ return 0;
+}
diff --git a/tools/testing/selftests/task_diag/run.sh b/tools/testing/selftests/task_diag/run.sh
new file mode 100755
index 0000000..62250a5
--- /dev/null
+++ b/tools/testing/selftests/task_diag/run.sh
@@ -0,0 +1,6 @@
+#!/bin/sh
+./fork 1000
+nproc=`./task_diag_all A | grep 'pid.*tgid.*ppid.*comm fork$' | wc -l`
+killall -9 fork
+[ "$nproc" -eq 1000 ] && exit 0
+echo "Unexpected number of tasks '$nproc'" 1>&2
diff --git a/tools/testing/selftests/task_diag/task_diag.c b/tools/testing/selftests/task_diag/task_diag.c
new file mode 100644
index 0000000..ff232a1
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_diag.c
@@ -0,0 +1,115 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <poll.h>
+#include <string.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/socket.h>
+#include <sys/wait.h>
+#include <signal.h>
+#include <dirent.h>
+
+#include <linux/netlink.h>
+#include <netlink/socket.h>
+#include <linux/genetlink.h>
+#include <netlink/genl/ctrl.h>
+#include <netlink/genl/genl.h>
+#include <netlink/genl/mngt.h>
+
+#include "task_diag.h"
+#include "taskstats.h"
+#include "task_diag_comm.h"
+
+int main(int argc, char *argv[])
+{
+ struct nl_sock *sock;
+ int exit_status = 1;
+ int id;
+ struct task_diag_pid req;
+ struct nl_msg *msg;
+ void *hdr;
+ int err;
+
+
+ req.pid = 0;
+ if (argc >= 2)
+ req.pid = atoi(argv[1]);
+ if (req.pid == 0) {
+ pr_err("Usage: %s PID\n", argv[0]);
+ return 1;
+ }
+ req.show_flags = TASK_DIAG_SHOW_BASE | TASK_DIAG_SHOW_CRED |
+ TASK_DIAG_SHOW_VMA | TASK_DIAG_SHOW_VMA_STAT;
+
+ sock = nl_socket_alloc();
+ if (sock == NULL)
+ return -1;
+ nl_connect(sock, NETLINK_GENERIC);
+
+ err = genl_register_family(&ops);
+ if (err < 0) {
+ pr_err("Unable to register Generic Netlink family");
+ return -1;
+ }
+
+ err = genl_ops_resolve(sock, &ops);
+ if (err < 0) {
+ pr_err("Unable to resolve family name");
+ return -1;
+ }
+
+ id = genl_ctrl_resolve(sock, TASKSTATS_GENL_NAME);
+ if (id == GENL_ID_GENERATE)
+ return -1;
+
+ msg = nlmsg_alloc();
+ if (msg == NULL) {
+ pr_err("Unable to allocate netlink message");
+ return -1;
+ }
+
+ hdr = genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, id,
+ 0, 0, TASK_DIAG_CMD_GET, 0);
+ if (hdr == NULL) {
+ pr_err("Unable to write genl header");
+ return -1;
+ }
+
+ err = nla_put(msg, TASKSTATS_CMD_GET,
+ sizeof(struct task_diag_pid), &req);
+ if (err < 0) {
+ pr_err("Unable to add attribute: %s", nl_geterror(err));
+ return -1;
+ }
+
+ err = nl_send_auto_complete(sock, msg);
+ if (err < 0) {
+ pr_err("Unable to send message: %s", nl_geterror(err));
+ return -1;
+ }
+
+ nlmsg_free(msg);
+
+ err = nl_socket_modify_cb(sock, NL_CB_VALID,
+ NL_CB_CUSTOM, parse_cb, NULL);
+ if (err < 0) {
+ pr_err("Unable to modify valid message callback");
+ goto err;
+ }
+
+
+ err = nl_recvmsgs_default(sock);
+ if (err < 0) {
+ pr_err("Unable to receive message: %s", nl_geterror(err));
+ goto err;
+ }
+
+ exit_status = 0;
+err:
+ nl_close(sock);
+ nl_socket_free(sock);
+ return exit_status;
+}
diff --git a/tools/testing/selftests/task_diag/task_diag.h b/tools/testing/selftests/task_diag/task_diag.h
new file mode 120000
index 0000000..d20a38c
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_diag.h
@@ -0,0 +1 @@
+../../../../include/uapi/linux/task_diag.h
\ No newline at end of file
diff --git a/tools/testing/selftests/task_diag/task_diag_all.c b/tools/testing/selftests/task_diag/task_diag_all.c
new file mode 100644
index 0000000..53eb713
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_diag_all.c
@@ -0,0 +1,145 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <poll.h>
+#include <string.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/socket.h>
+#include <sys/wait.h>
+#include <signal.h>
+#include <dirent.h>
+
+#include <linux/netlink.h>
+#include <netlink/socket.h>
+#include <linux/genetlink.h>
+#include <netlink/genl/ctrl.h>
+#include <netlink/genl/genl.h>
+#include <netlink/genl/mngt.h>
+
+#include "task_diag.h"
+#include "taskstats.h"
+#include "task_diag_comm.h"
+
+int main(int argc, char *argv[])
+{
+ struct nl_sock *sock;
+ int exit_status = 1;
+ int id;
+ struct task_diag_pid req;
+ struct nl_msg *msg;
+ __u32 last_pid = 0;
+ void *hdr;
+ int err;
+
+ req.show_flags = TASK_DIAG_SHOW_BASE | TASK_DIAG_SHOW_CRED |
+ TASK_DIAG_SHOW_VMA | TASK_DIAG_SHOW_VMA_STAT;
+
+ if (argc < 2) {
+ pr_err("Usage: %s type pid", argv[0]);
+ return 1;
+ }
+
+ req.pid = 0; /* dump all tasks by default */
+ if (argc > 2)
+ req.pid = atoi(argv[2]);
+
+ switch (argv[1][0]) {
+ case 'c':
+ req.dump_strategy = TASK_DIAG_DUMP_CHILDREN;
+ break;
+ case 't':
+ req.dump_strategy = TASK_DIAG_DUMP_THREAD;
+ break;
+ case 'o':
+ req.dump_strategy = TASK_DIAG_DUMP_ONE;
+ break;
+ case 'a':
+ req.dump_strategy = TASK_DIAG_DUMP_ALL;
+ req.pid = 0;
+ break;
+ case 'A':
+ req.dump_strategy = TASK_DIAG_DUMP_ALL_THREAD;
+ req.pid = 0;
+ break;
+ default:
+ pr_err("Usage: %s type pid", argv[0]);
+ return 1;
+ }
+
+ sock = nl_socket_alloc();
+ if (sock == NULL)
+ return -1;
+ nl_connect(sock, NETLINK_GENERIC);
+
+ err = genl_register_family(&ops);
+ if (err < 0) {
+ pr_err("Unable to register Generic Netlink family");
+ return -1;
+ }
+
+ err = genl_ops_resolve(sock, &ops);
+ if (err < 0) {
+ pr_err("Unable to resolve family name");
+ return -1;
+ }
+
+ id = genl_ctrl_resolve(sock, TASKSTATS_GENL_NAME);
+ if (id == GENL_ID_GENERATE)
+ return -1;
+
+ msg = nlmsg_alloc();
+ if (msg == NULL) {
+ pr_err("Unable to allocate netlink message");
+ return -1;
+ }
+
+ hdr = genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, id,
+ 0, NLM_F_DUMP, TASK_DIAG_CMD_GET, 0);
+ if (hdr == NULL) {
+ pr_err("Unable to write genl header");
+ return -1;
+ }
+
+ err = nla_put(msg, TASKSTATS_CMD_GET, sizeof(req), &req);
+ if (err < 0) {
+ pr_err("Unable to add attribute: %s", nl_geterror(err));
+ return -1;
+ }
+
+ err = nl_send_auto_complete(sock, msg);
+ if (err < 0) {
+ pr_err("Unable to send message: %s", nl_geterror(err));
+ return -1;
+ }
+
+ nlmsg_free(msg);
+
+ err = nl_socket_modify_cb(sock, NL_CB_VALID, NL_CB_CUSTOM,
+ parse_cb, &last_pid);
+ if (err < 0) {
+ pr_err("Unable to modify valid message callback");
+ goto err;
+ }
+ err = nl_socket_modify_cb(sock, NL_CB_FINISH, NL_CB_CUSTOM,
+ parse_cb, &last_pid);
+ if (err < 0) {
+ pr_err("Unable to modify valid message callback");
+ goto err;
+ }
+
+
+ err = nl_recvmsgs_default(sock);
+ if (err < 0) {
+ pr_err("Unable to receive message: %s", nl_geterror(err));
+ goto err;
+ }
+
+ exit_status = 0;
+err:
+ nl_close(sock);
+ nl_socket_free(sock);
+ return exit_status;
+}
diff --git a/tools/testing/selftests/task_diag/task_diag_comm.c b/tools/testing/selftests/task_diag/task_diag_comm.c
new file mode 100644
index 0000000..480c7cf
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_diag_comm.c
@@ -0,0 +1,157 @@
+#include <errno.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <linux/netlink.h>
+#include <linux/genetlink.h>
+#include <netlink/cli/utils.h>
+
+#include "task_diag.h"
+#include "taskstats.h"
+#include "task_diag_comm.h"
+
+int quiet;
+
+static struct nla_policy attr_policy[TASK_DIAG_ATTR_MAX + 1] = {
+ [TASK_DIAG_PID] = { .type = NLA_U32},
+ [TASK_DIAG_BASE] = { .minlen = sizeof(struct task_diag_base) },
+ [TASK_DIAG_CRED] = { .minlen = sizeof(struct task_diag_creds) },
+};
+
+#define PSS_SHIFT 12
+static int parse_cmd_new(struct nl_cache_ops *unused, struct genl_cmd *cmd,
+ struct genl_info *info, void *arg)
+{
+ struct nlattr **attrs;
+
+ attrs = info->attrs;
+ __u32 *last_pid = (__u32 *)arg, pid;
+
+ if (arg) {
+ pid = *((__u32 *)nla_data(attrs[TASK_DIAG_PID]));
+
+ if (pid != *last_pid)
+ pr_info("Start getting information about %d\n", pid);
+ else
+ pr_info("Continue getting information about %d\n", pid);
+
+ *last_pid = pid;
+ }
+
+ if (attrs[TASK_DIAG_BASE]) {
+ struct task_diag_base *msg;
+
+ /* For nested attributes, na follows */
+ msg = nla_data(attrs[TASK_DIAG_BASE]);
+ pr_info("pid %d tgid %d ppid %d comm %s\n",
+ msg->pid, msg->tgid, msg->ppid, msg->comm);
+ }
+
+ if (attrs[TASK_DIAG_CRED]) {
+ struct task_diag_creds *creds;
+
+ creds = nla_data(attrs[TASK_DIAG_CRED]);
+ pr_info("uid: %d %d %d %d\n", creds->uid,
+ creds->euid, creds->suid, creds->fsuid);
+ pr_info("gid: %d %d %d %d\n", creds->uid,
+ creds->euid, creds->suid, creds->fsuid);
+ pr_info("CapInh: %08x%08x\n",
+ creds->cap_inheritable.cap[1],
+ creds->cap_inheritable.cap[0]);
+ pr_info("CapPrm: %08x%08x\n",
+ creds->cap_permitted.cap[1],
+ creds->cap_permitted.cap[0]);
+ pr_info("CapEff: %08x%08x\n",
+ creds->cap_effective.cap[1],
+ creds->cap_effective.cap[0]);
+ pr_info("CapBnd: %08x%08x\n", creds->cap_bset.cap[1],
+ creds->cap_bset.cap[0]);
+ }
+
+ if (attrs[TASK_DIAG_VMA]) {
+ struct task_diag_vma *vma_tmp, vma;
+
+ task_diag_for_each_vma(vma_tmp, attrs[TASK_DIAG_VMA]) {
+ char *name;
+ struct task_diag_vma_stat *stat_tmp, stat;
+
+ name = task_diag_vma_name(vma_tmp);
+ if (name == NULL)
+ name = "";
+
+ memcpy(&vma, vma_tmp, sizeof(vma));
+ pr_info("%016llx-%016llx %016llx %s\n",
+ vma.start, vma.end, vma.vm_flags, name);
+
+ stat_tmp = task_diag_vma_stat(vma_tmp);
+ if (stat_tmp)
+ memcpy(&stat, stat_tmp, sizeof(stat));
+ else
+ memset(&stat, 0, sizeof(stat));
+
+ pr_info(
+ "Size: %8llu kB\n"
+ "Rss: %8llu kB\n"
+ "Pss: %8llu kB\n"
+ "Shared_Clean: %8llu kB\n"
+ "Shared_Dirty: %8llu kB\n"
+ "Private_Clean: %8llu kB\n"
+ "Private_Dirty: %8llu kB\n"
+ "Referenced: %8llu kB\n"
+ "Anonymous: %8llu kB\n"
+ "AnonHugePages: %8llu kB\n"
+ "Swap: %8llu kB\n",
+ (vma.end - vma.start) >> 10,
+ stat.resident >> 10,
+ (stat.pss >> (10 + PSS_SHIFT)),
+ stat.shared_clean >> 10,
+ stat.shared_dirty >> 10,
+ stat.private_clean >> 10,
+ stat.private_dirty >> 10,
+ stat.referenced >> 10,
+ stat.anonymous >> 10,
+ stat.anonymous_thp >> 10,
+ stat.swap >> 10);
+ }
+ }
+
+ return 0;
+}
+
+static struct genl_cmd cmds[] = {
+ {
+ .c_id = TASK_DIAG_CMD_GET,
+ .c_name = "taskstats_new()",
+ .c_maxattr = TASK_DIAG_ATTR_MAX,
+ .c_attr_policy = attr_policy,
+ .c_msg_parser = &parse_cmd_new,
+ },
+};
+
+#define ARRAY_SIZE(X) (sizeof(X) / sizeof((X)[0]))
+
+struct genl_ops ops = {
+ .o_name = TASKSTATS_GENL_NAME,
+ .o_cmds = cmds,
+ .o_ncmds = ARRAY_SIZE(cmds),
+};
+
+int parse_cb(struct nl_msg *msg, void *arg)
+{
+ struct nlmsghdr *hdr = nlmsg_hdr(msg);
+
+ if (hdr->nlmsg_type == NLMSG_DONE) {
+ int *ret = nlmsg_data(hdr);
+
+ if (*ret < 0) {
+ pr_err("An error message is received: %s\n",
+ strerror(-*ret));
+ return *ret;
+ }
+ return 0;
+ }
+
+ return genl_handle_msg(msg, arg);
+}
diff --git a/tools/testing/selftests/task_diag/task_diag_comm.h b/tools/testing/selftests/task_diag/task_diag_comm.h
new file mode 100644
index 0000000..5f6ba07
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_diag_comm.h
@@ -0,0 +1,33 @@
+#ifndef __TASK_DIAG_COMM__
+#define __TASK_DIAG_COMM__
+
+#include <stdio.h>
+
+#include <linux/genetlink.h>
+#include "task_diag.h"
+
+/*
+ * Generic macros for dealing with netlink sockets. Might be duplicated
+ * elsewhere. It is recommended that commercial grade applications use
+ * libnl or libnetlink and use the interfaces provided by the library
+ */
+#define GENLMSG_DATA(glh) ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN))
+#define GENLMSG_PAYLOAD(glh) (NLMSG_PAYLOAD(glh, 0) - GENL_HDRLEN)
+
+#define pr_err(fmt, ...) \
+ fprintf(stderr, fmt"\n", ##__VA_ARGS__)
+
+#define pr_perror(fmt, ...) \
+ fprintf(stderr, fmt " : %m\n", ##__VA_ARGS__)
+
+extern int quiet;
+#define pr_info(fmt, arg...) \
+ do { \
+ if (!quiet) \
+ printf(fmt, ##arg); \
+ } while (0) \
+
+struct genl_ops ops;
+int parse_cb(struct nl_msg *msg, void *arg);
+
+#endif /* __TASK_DIAG_COMM__ */
diff --git a/tools/testing/selftests/task_diag/task_proc_all.c b/tools/testing/selftests/task_diag/task_proc_all.c
new file mode 100644
index 0000000..07ee80c
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_proc_all.c
@@ -0,0 +1,35 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+
+
+int main(int argc, char **argv)
+{
+ DIR *d;
+ int fd, tasks = 0;
+ struct dirent *de;
+ char buf[4096];
+
+ d = opendir("/proc");
+ if (d == NULL)
+ return 1;
+
+ while ((de = readdir(d))) {
+ if (de->d_name[0] < '0' || de->d_name[0] > '9')
+ continue;
+ snprintf(buf, sizeof(buf), "/proc/%s/stat", de->d_name);
+ fd = open(buf, O_RDONLY);
+ read(fd, buf, sizeof(buf));
+ close(fd);
+ tasks++;
+ }
+
+ closedir(d);
+
+ printf("tasks: %d\n", tasks);
+
+ return 0;
+}
diff --git a/tools/testing/selftests/task_diag/taskstats.h b/tools/testing/selftests/task_diag/taskstats.h
new file mode 120000
index 0000000..fa9523b
--- /dev/null
+++ b/tools/testing/selftests/task_diag/taskstats.h
@@ -0,0 +1 @@
+../../../../include/uapi/linux/taskstats.h
\ No newline at end of file
--
2.1.0

2015-07-06 08:51:41

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 24/24] task_diag: Enhance fork tool to spawn threads

From: David Ahern <[email protected]>

Add option to fork threads as well as processes.
Make the sleep time configurable too so that spawned
tasks exit on their own.

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: Andrey Vagin <[email protected]>
---
tools/testing/selftests/task_diag/Makefile | 2 ++
tools/testing/selftests/task_diag/fork.c | 36 ++++++++++++++++++++++++++----
tools/testing/selftests/task_diag/run.sh | 4 ++--
3 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/task_diag/Makefile b/tools/testing/selftests/task_diag/Makefile
index 7104573..0c53cf6 100644
--- a/tools/testing/selftests/task_diag/Makefile
+++ b/tools/testing/selftests/task_diag/Makefile
@@ -12,6 +12,8 @@ task_diag_comm.o: task_diag_comm.c task_diag_comm.h
task_diag_all: task_diag_all.o task_diag_comm.o
task_diag: task_diag.o task_diag_comm.o
fork: fork.c
+ $(CC) $(CFLAGS) $(LDFLAGS) -o $@ $^ -lpthread
+
task_proc_all: task_proc_all.c

clean:
diff --git a/tools/testing/selftests/task_diag/fork.c b/tools/testing/selftests/task_diag/fork.c
index c6e17d1..ebddedd2 100644
--- a/tools/testing/selftests/task_diag/fork.c
+++ b/tools/testing/selftests/task_diag/fork.c
@@ -2,15 +2,39 @@
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
+#include <pthread.h>

+void *f(void *arg)
+{
+ unsigned long t = (unsigned long) arg;
+
+ sleep(t);
+ return NULL;
+}
+
+/* usage: fork nproc [mthreads [sleep]] */
int main(int argc, char **argv)
{
- int i, n;
+ int i, j, n, m = 0;
+ unsigned long t_sleep = 1000;
+ pthread_attr_t attr;
+ pthread_t id;

- if (argc < 2)
+ if (argc < 2) {
+ fprintf(stderr, "usage: fork nproc [mthreads [sleep]]\n");
return 1;
+ }

n = atoi(argv[1]);
+
+ if (argc > 2)
+ m = atoi(argv[2]);
+
+ if (argc > 3)
+ t_sleep = atoi(argv[3]);
+
+ pthread_attr_init(&attr);
+
for (i = 0; i < n; i++) {
pid_t pid;

@@ -20,8 +44,12 @@ int main(int argc, char **argv)
return 1;
}
if (pid == 0) {
- while (1)
- sleep(1000);
+ if (m) {
+ for (j = 0; j < m-1; ++j)
+ pthread_create(&id, &attr, f, (void *)t_sleep);
+ }
+
+ sleep(t_sleep);
return 0;
}
}
diff --git a/tools/testing/selftests/task_diag/run.sh b/tools/testing/selftests/task_diag/run.sh
index 62250a5..06e182d 100755
--- a/tools/testing/selftests/task_diag/run.sh
+++ b/tools/testing/selftests/task_diag/run.sh
@@ -1,6 +1,6 @@
#!/bin/sh
-./fork 1000
+./fork 1000 10
nproc=`./task_diag_all A | grep 'pid.*tgid.*ppid.*comm fork$' | wc -l`
killall -9 fork
-[ "$nproc" -eq 1000 ] && exit 0
+[ "$nproc" -eq 10000 ] && exit 0
echo "Unexpected number of tasks '$nproc'" 1>&2
--
2.1.0

2015-07-06 17:10:57

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

On Mon, Jul 6, 2015 at 1:47 AM, Andrey Vagin <[email protected]> wrote:
> Currently we use the proc file system, where all information are
> presented in text files, what is convenient for humans. But if we need
> to get information about processes from code (e.g. in C), the procfs
> doesn't look so cool.
>
> From code we would prefer to get information in binary format and to be
> able to specify which information and for which tasks are required. Here
> is a new interface with all these features, which is called task_diag.
> In addition it's much faster than procfs.
>
> task_diag is based on netlink sockets and looks like socket-diag, which
> is used to get information about sockets.

I think I like this in principle, but I have can see a few potential
problems with using netlink for this:

1. Netlink very naturally handles net namespaces, but it doesn't
naturally handle any other kind of namespace. In fact, the taskstats
code that you're building on has highly broken user and pid namespace
support. (Look for some obviously useless init_user_ns and
init_pid_ns references. But that's only the obvious problem. That
code calls current_user_ns() and task_active_pid_ns(current) from
.doit, which is, in turn, called from sys_write, and looking at
current's security state from sys_write is a big no-no.)

You could partially fix it by looking at f_cred's namespaces, but that
would be a change of what it means to create a netlink socket, and I'm
not sure that's a good idea.

2. These look like generally useful interfaces, which means that
people might want to use them in common non-system software, which
means that some of that software might get run inside of sandboxes
(Sandstorm, xdg-app, etc.) Sandboxes like that might block netlink
outright, since it can't be usefully filtered by seccomp. (This isn't
really the case now, since netlink route queries are too common, but
still.)

3. Netlink is a bit tedious to use from userspace. Especially for
things like task_diag, which are really just queries that generate
single replies.

Would it make more sense to have a new syscall instead? You could
even still use nlattr formatting for the syscall results.

--Andy

2015-07-07 15:44:07

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

On Mon, Jul 06, 2015 at 10:10:32AM -0700, Andy Lutomirski wrote:
> On Mon, Jul 6, 2015 at 1:47 AM, Andrey Vagin <[email protected]> wrote:
> > Currently we use the proc file system, where all information are
> > presented in text files, what is convenient for humans. But if we need
> > to get information about processes from code (e.g. in C), the procfs
> > doesn't look so cool.
> >
> > From code we would prefer to get information in binary format and to be
> > able to specify which information and for which tasks are required. Here
> > is a new interface with all these features, which is called task_diag.
> > In addition it's much faster than procfs.
> >
> > task_diag is based on netlink sockets and looks like socket-diag, which
> > is used to get information about sockets.
>
> I think I like this in principle, but I have can see a few potential
> problems with using netlink for this:
>
> 1. Netlink very naturally handles net namespaces, but it doesn't
> naturally handle any other kind of namespace. In fact, the taskstats
> code that you're building on has highly broken user and pid namespace
> support. (Look for some obviously useless init_user_ns and
> init_pid_ns references. But that's only the obvious problem. That
> code calls current_user_ns() and task_active_pid_ns(current) from
> .doit, which is, in turn, called from sys_write, and looking at
> current's security state from sys_write is a big no-no.)
>
> You could partially fix it by looking at f_cred's namespaces, but that
> would be a change of what it means to create a netlink socket, and I'm
> not sure that's a good idea.

If I don't miss something, all problems around pidns and userns are
related with multicast functionality. task_diag is using
request/response scheme and doesn't send multicast packets.

>
> 2. These look like generally useful interfaces, which means that
> people might want to use them in common non-system software, which
> means that some of that software might get run inside of sandboxes
> (Sandstorm, xdg-app, etc.) Sandboxes like that might block netlink
> outright, since it can't be usefully filtered by seccomp. (This isn't
> really the case now, since netlink route queries are too common, but
> still.)
>
> 3. Netlink is a bit tedious to use from userspace. Especially for
> things like task_diag, which are really just queries that generate
> single replies.

I don't understand this point. Could you elaborate? I thought the
netlink was designed for such purposes. (not only for them, but for them
too)

There are two features of netlink which are used.

The netlink interface allows to split response into a few packets, if
it's too big to be transferred for one iteration.

And I want to mention "Memory mapped netlink I/O" functionality, which
can be used to speed up task_diag.

>
> Would it make more sense to have a new syscall instead? You could
> even still use nlattr formatting for the syscall results.

Andy, thank you for the feedback. I got your points. I need time to
think about them. I suppose that a new syscall can be more suitable in
this case, and I need time to form a vision of it. If you have any ideas
or thoughts, I would be glad to know about them.

Thanks,
Andrew

>
> --Andy

2015-07-07 15:57:06

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

On Tue, Jul 7, 2015 at 8:43 AM, Andrew Vagin <[email protected]> wrote:
> On Mon, Jul 06, 2015 at 10:10:32AM -0700, Andy Lutomirski wrote:
>> On Mon, Jul 6, 2015 at 1:47 AM, Andrey Vagin <[email protected]> wrote:
>> > Currently we use the proc file system, where all information are
>> > presented in text files, what is convenient for humans. But if we need
>> > to get information about processes from code (e.g. in C), the procfs
>> > doesn't look so cool.
>> >
>> > From code we would prefer to get information in binary format and to be
>> > able to specify which information and for which tasks are required. Here
>> > is a new interface with all these features, which is called task_diag.
>> > In addition it's much faster than procfs.
>> >
>> > task_diag is based on netlink sockets and looks like socket-diag, which
>> > is used to get information about sockets.
>>
>> I think I like this in principle, but I have can see a few potential
>> problems with using netlink for this:
>>
>> 1. Netlink very naturally handles net namespaces, but it doesn't
>> naturally handle any other kind of namespace. In fact, the taskstats
>> code that you're building on has highly broken user and pid namespace
>> support. (Look for some obviously useless init_user_ns and
>> init_pid_ns references. But that's only the obvious problem. That
>> code calls current_user_ns() and task_active_pid_ns(current) from
>> .doit, which is, in turn, called from sys_write, and looking at
>> current's security state from sys_write is a big no-no.)
>>
>> You could partially fix it by looking at f_cred's namespaces, but that
>> would be a change of what it means to create a netlink socket, and I'm
>> not sure that's a good idea.
>
> If I don't miss something, all problems around pidns and userns are
> related with multicast functionality. task_diag is using
> request/response scheme and doesn't send multicast packets.

It has nothing to do with multicast. task_diag needs to know what
pidns and userns to use for a request, but netlink isn't set up to
give you any reasonably way to do that. A netlink socket is
fundamentally tied to a *net* ns (it's a socket, after all). But you
can send it requests using write(2), and calling current_user_ns()
from write(2) is bad. There's a long history of bugs and
vulnerabilities related to thinking that current_cred() and similar
are acceptable things to use in write(2) implementations.

>
>>
>> 2. These look like generally useful interfaces, which means that
>> people might want to use them in common non-system software, which
>> means that some of that software might get run inside of sandboxes
>> (Sandstorm, xdg-app, etc.) Sandboxes like that might block netlink
>> outright, since it can't be usefully filtered by seccomp. (This isn't
>> really the case now, since netlink route queries are too common, but
>> still.)
>>
>> 3. Netlink is a bit tedious to use from userspace. Especially for
>> things like task_diag, which are really just queries that generate
>> single replies.
>
> I don't understand this point. Could you elaborate? I thought the
> netlink was designed for such purposes. (not only for them, but for them
> too)
>
> There are two features of netlink which are used.
>
> The netlink interface allows to split response into a few packets, if
> it's too big to be transferred for one iteration.
>

Netlink is fine for these use cases (if they were related to the
netns, not the pid ns or user ns), and it works. It's still tedious
-- I bet that if you used a syscall, the user code would be
considerable shorter, though. :)

How would this be a problem if you used plain syscalls? The user
would make a request, and the syscall would tell the user that their
result buffer was too small if it was, in fact, too small.

> And I want to mention "Memory mapped netlink I/O" functionality, which
> can be used to speed up task_diag.
>

IIRC memory-mapped netlink writes are terminally broken and therefore
neutered in current kernels (and hence no faster, and possibly slower,
than plain send(2)). Memory-mapped reads are probably okay, but I
can't imagine that feature actually saving time in any real workload.
Almost all of the cpu time spent in task_diag will be in locking,
following pointers, formatting things, etc, and adding a memcpy will
almost certainly be lost in the noise.

--Andy

2015-07-07 16:17:48

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

On 7/7/15 9:56 AM, Andy Lutomirski wrote:
> Netlink is fine for these use cases (if they were related to the
> netns, not the pid ns or user ns), and it works. It's still tedious
> -- I bet that if you used a syscall, the user code would be
> considerable shorter, though. :)
>
> How would this be a problem if you used plain syscalls? The user
> would make a request, and the syscall would tell the user that their
> result buffer was too small if it was, in fact, too small.

It will be impossible to tell a user what sized buffer is needed. The
size is largely a function of the number of tasks and number of maps per
thread group and both of those will be changing. With the growing size
of systems (I was sparc systems with 1024 cpus) the workload can be 10's
of thousands of tasks each with a lot of maps (e.g., java workloads).
That amounts to a non-trivial amount of data that has to be pushed to
userspace.

One of the benefits of the netlink approach is breaking the data across
multiple messages and picking up where you left off. That infrastructure
is already in place.

David

2015-07-07 16:24:34

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

On Tue, Jul 7, 2015 at 9:17 AM, David Ahern <[email protected]> wrote:
> On 7/7/15 9:56 AM, Andy Lutomirski wrote:
>>
>> Netlink is fine for these use cases (if they were related to the
>> netns, not the pid ns or user ns), and it works. It's still tedious
>> -- I bet that if you used a syscall, the user code would be
>> considerable shorter, though. :)
>>
>> How would this be a problem if you used plain syscalls? The user
>> would make a request, and the syscall would tell the user that their
>> result buffer was too small if it was, in fact, too small.
>
>
> It will be impossible to tell a user what sized buffer is needed. The size
> is largely a function of the number of tasks and number of maps per thread
> group and both of those will be changing. With the growing size of systems
> (I was sparc systems with 1024 cpus) the workload can be 10's of thousands
> of tasks each with a lot of maps (e.g., java workloads). That amounts to a
> non-trivial amount of data that has to be pushed to userspace.
>
> One of the benefits of the netlink approach is breaking the data across
> multiple messages and picking up where you left off. That infrastructure is
> already in place.
>

How does picking up where you left off work? I assumed the interface
was something along the lines of "give me information starting from
pid N", but maybe I missed something.

--Andy

2015-07-07 16:26:06

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

Em Tue, Jul 07, 2015 at 06:43:46PM +0300, Andrew Vagin escreveu:
> On Mon, Jul 06, 2015 at 10:10:32AM -0700, Andy Lutomirski wrote:
> > On Mon, Jul 6, 2015 at 1:47 AM, Andrey Vagin <[email protected]> wrote:
> > Would it make more sense to have a new syscall instead? You could
> > even still use nlattr formatting for the syscall results.
>
> Andy, thank you for the feedback. I got your points. I need time to
> think about them. I suppose that a new syscall can be more suitable in
> this case, and I need time to form a vision of it. If you have any ideas
> or thoughts, I would be glad to know about them.

If a new syscall would indeed be better for this, then using
sys_perf_event_open and on one of the perf_event_attr flip a bit to ask
for those PERF_RECORD_{COMM,FORK,PERF_RECORD_MMAP2, etc} to be generated
in the perf buffer could make it reuse all the userspace tooling, with
really minimal change: flip the bit, don't synthesize it from /proc.

- Arnaldo

2015-07-07 16:28:00

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

On Tue, Jul 7, 2015 at 9:25 AM, Arnaldo Carvalho de Melo
<[email protected]> wrote:
> Em Tue, Jul 07, 2015 at 06:43:46PM +0300, Andrew Vagin escreveu:
>> On Mon, Jul 06, 2015 at 10:10:32AM -0700, Andy Lutomirski wrote:
>> > On Mon, Jul 6, 2015 at 1:47 AM, Andrey Vagin <[email protected]> wrote:
>> > Would it make more sense to have a new syscall instead? You could
>> > even still use nlattr formatting for the syscall results.
>>
>> Andy, thank you for the feedback. I got your points. I need time to
>> think about them. I suppose that a new syscall can be more suitable in
>> this case, and I need time to form a vision of it. If you have any ideas
>> or thoughts, I would be glad to know about them.
>
> If a new syscall would indeed be better for this, then using
> sys_perf_event_open and on one of the perf_event_attr flip a bit to ask
> for those PERF_RECORD_{COMM,FORK,PERF_RECORD_MMAP2, etc} to be generated
> in the perf buffer could make it reuse all the userspace tooling, with
> really minimal change: flip the bit, don't synthesize it from /proc.
>

Hmm, that's an interesting thought.

Andrew, would that work for you?

--Andy

2015-07-07 16:41:26

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

On 7/7/15 10:24 AM, Andy Lutomirski wrote:
> On Tue, Jul 7, 2015 at 9:17 AM, David Ahern <[email protected]> wrote:
>> On 7/7/15 9:56 AM, Andy Lutomirski wrote:
>>>
>>> Netlink is fine for these use cases (if they were related to the
>>> netns, not the pid ns or user ns), and it works. It's still tedious
>>> -- I bet that if you used a syscall, the user code would be
>>> considerable shorter, though. :)
>>>
>>> How would this be a problem if you used plain syscalls? The user
>>> would make a request, and the syscall would tell the user that their
>>> result buffer was too small if it was, in fact, too small.
>>
>>
>> It will be impossible to tell a user what sized buffer is needed. The size
>> is largely a function of the number of tasks and number of maps per thread
>> group and both of those will be changing. With the growing size of systems
>> (I was sparc systems with 1024 cpus) the workload can be 10's of thousands
>> of tasks each with a lot of maps (e.g., java workloads). That amounts to a
>> non-trivial amount of data that has to be pushed to userspace.
>>
>> One of the benefits of the netlink approach is breaking the data across
>> multiple messages and picking up where you left off. That infrastructure is
>> already in place.
>>
>
> How does picking up where you left off work? I assumed the interface
> was something along the lines of "give me information starting from
> pid N", but maybe I missed something.

There are different use cases:
1. specific pid

2. specific thread group id

3. all child processes

4. all tasks in the system (perf record/top -a mode)

5. all tasks for a uid is another but not coded yet

The big hitter is 4 in terms of data volume.

Essentially the kernel side changes saves where you are when an skb
fills -- which task and which map. e.g, see the cb usage in patch 5. On
return if the task is still there continue dumping its data. If it
disappeared move on to the next.

2015-07-07 16:56:44

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

On 7/7/15 10:27 AM, Andy Lutomirski wrote:
> On Tue, Jul 7, 2015 at 9:25 AM, Arnaldo Carvalho de Melo
> <[email protected]> wrote:
>> Em Tue, Jul 07, 2015 at 06:43:46PM +0300, Andrew Vagin escreveu:
>>> On Mon, Jul 06, 2015 at 10:10:32AM -0700, Andy Lutomirski wrote:
>>>> On Mon, Jul 6, 2015 at 1:47 AM, Andrey Vagin <[email protected]> wrote:
>>>> Would it make more sense to have a new syscall instead? You could
>>>> even still use nlattr formatting for the syscall results.
>>>
>>> Andy, thank you for the feedback. I got your points. I need time to
>>> think about them. I suppose that a new syscall can be more suitable in
>>> this case, and I need time to form a vision of it. If you have any ideas
>>> or thoughts, I would be glad to know about them.
>>
>> If a new syscall would indeed be better for this, then using
>> sys_perf_event_open and on one of the perf_event_attr flip a bit to ask
>> for those PERF_RECORD_{COMM,FORK,PERF_RECORD_MMAP2, etc} to be generated
>> in the perf buffer could make it reuse all the userspace tooling, with
>> really minimal change: flip the bit, don't synthesize it from /proc.
>>
>
> Hmm, that's an interesting thought.
>
> Andrew, would that work for you?

It's an interesting option to look at. It provides a fixed sized ring
buffer. The dummy event can be used as the event to trigger the
generation of task data. The LOST event can tell you if the buffer is
too small.

Of course the devil is in the details. The buffer for event tasks will
need to be fairly large. That size is only needed for the initial task
data meaning the global mmap size is not right for it and this
particular buffer can be reduced after startup.

The initial task detection can generate a flood of data in a very short
amount of time since kernel side is in a tight loop walking tasks and
maps and there is nothing to throttle it beyond the buffer filling --
and that just means lost data -- and an occasional need_resched check.
Perhaps the kernel loop will need to pause if the buffer is full to give
userspace a moment to collect data rather than just dropping it.

2015-07-08 16:10:44

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

On Tue, Jul 07, 2015 at 08:56:37AM -0700, Andy Lutomirski wrote:
> On Tue, Jul 7, 2015 at 8:43 AM, Andrew Vagin <[email protected]> wrote:
> > On Mon, Jul 06, 2015 at 10:10:32AM -0700, Andy Lutomirski wrote:
> >> On Mon, Jul 6, 2015 at 1:47 AM, Andrey Vagin <[email protected]> wrote:
> >> > Currently we use the proc file system, where all information are
> >> > presented in text files, what is convenient for humans. But if we need
> >> > to get information about processes from code (e.g. in C), the procfs
> >> > doesn't look so cool.
> >> >
> >> > From code we would prefer to get information in binary format and to be
> >> > able to specify which information and for which tasks are required. Here
> >> > is a new interface with all these features, which is called task_diag.
> >> > In addition it's much faster than procfs.
> >> >
> >> > task_diag is based on netlink sockets and looks like socket-diag, which
> >> > is used to get information about sockets.
> >>
> >> I think I like this in principle, but I have can see a few potential
> >> problems with using netlink for this:
> >>
> >> 1. Netlink very naturally handles net namespaces, but it doesn't
> >> naturally handle any other kind of namespace. In fact, the taskstats
> >> code that you're building on has highly broken user and pid namespace
> >> support. (Look for some obviously useless init_user_ns and
> >> init_pid_ns references. But that's only the obvious problem. That
> >> code calls current_user_ns() and task_active_pid_ns(current) from
> >> .doit, which is, in turn, called from sys_write, and looking at
> >> current's security state from sys_write is a big no-no.)
> >>
> >> You could partially fix it by looking at f_cred's namespaces, but that
> >> would be a change of what it means to create a netlink socket, and I'm
> >> not sure that's a good idea.
> >
> > If I don't miss something, all problems around pidns and userns are
> > related with multicast functionality. task_diag is using
> > request/response scheme and doesn't send multicast packets.
>
> It has nothing to do with multicast. task_diag needs to know what
> pidns and userns to use for a request, but netlink isn't set up to
> give you any reasonably way to do that. A netlink socket is
> fundamentally tied to a *net* ns (it's a socket, after all). But you
> can send it requests using write(2), and calling current_user_ns()
> from write(2) is bad. There's a long history of bugs and
> vulnerabilities related to thinking that current_cred() and similar
> are acceptable things to use in write(2) implementations.
>

As far as I understand, socket_diag doesn't have this problem, becaus
each socket has a link on a namespace where it was created.

What if we will pin the current pidns and credentials to a task_diag
socket in a moment when it's created.

Thanks,
Andrew

2015-07-08 17:40:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

On Wed, Jul 8, 2015 at 9:10 AM, Andrew Vagin <[email protected]> wrote:
> On Tue, Jul 07, 2015 at 08:56:37AM -0700, Andy Lutomirski wrote:
>> On Tue, Jul 7, 2015 at 8:43 AM, Andrew Vagin <[email protected]> wrote:
>> > On Mon, Jul 06, 2015 at 10:10:32AM -0700, Andy Lutomirski wrote:
>> >> On Mon, Jul 6, 2015 at 1:47 AM, Andrey Vagin <[email protected]> wrote:
>> >> > Currently we use the proc file system, where all information are
>> >> > presented in text files, what is convenient for humans. But if we need
>> >> > to get information about processes from code (e.g. in C), the procfs
>> >> > doesn't look so cool.
>> >> >
>> >> > From code we would prefer to get information in binary format and to be
>> >> > able to specify which information and for which tasks are required. Here
>> >> > is a new interface with all these features, which is called task_diag.
>> >> > In addition it's much faster than procfs.
>> >> >
>> >> > task_diag is based on netlink sockets and looks like socket-diag, which
>> >> > is used to get information about sockets.
>> >>
>> >> I think I like this in principle, but I have can see a few potential
>> >> problems with using netlink for this:
>> >>
>> >> 1. Netlink very naturally handles net namespaces, but it doesn't
>> >> naturally handle any other kind of namespace. In fact, the taskstats
>> >> code that you're building on has highly broken user and pid namespace
>> >> support. (Look for some obviously useless init_user_ns and
>> >> init_pid_ns references. But that's only the obvious problem. That
>> >> code calls current_user_ns() and task_active_pid_ns(current) from
>> >> .doit, which is, in turn, called from sys_write, and looking at
>> >> current's security state from sys_write is a big no-no.)
>> >>
>> >> You could partially fix it by looking at f_cred's namespaces, but that
>> >> would be a change of what it means to create a netlink socket, and I'm
>> >> not sure that's a good idea.
>> >
>> > If I don't miss something, all problems around pidns and userns are
>> > related with multicast functionality. task_diag is using
>> > request/response scheme and doesn't send multicast packets.
>>
>> It has nothing to do with multicast. task_diag needs to know what
>> pidns and userns to use for a request, but netlink isn't set up to
>> give you any reasonably way to do that. A netlink socket is
>> fundamentally tied to a *net* ns (it's a socket, after all). But you
>> can send it requests using write(2), and calling current_user_ns()
>> from write(2) is bad. There's a long history of bugs and
>> vulnerabilities related to thinking that current_cred() and similar
>> are acceptable things to use in write(2) implementations.
>>
>
> As far as I understand, socket_diag doesn't have this problem, becaus
> each socket has a link on a namespace where it was created.
>
> What if we will pin the current pidns and credentials to a task_diag
> socket in a moment when it's created.

That's certainly doable. OTOH, if anything does:

socket(AF_NETLINK, ...);
unshare(CLONE_PID);
fork();

then they now have a (minor) security problem.

--Andy

2015-07-08 22:49:14

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

2015-07-08 20:39 GMT+03:00 Andy Lutomirski <[email protected]>:
> On Wed, Jul 8, 2015 at 9:10 AM, Andrew Vagin <[email protected]> wrote:
>> On Tue, Jul 07, 2015 at 08:56:37AM -0700, Andy Lutomirski wrote:
>>> On Tue, Jul 7, 2015 at 8:43 AM, Andrew Vagin <[email protected]> wrote:
>>> > On Mon, Jul 06, 2015 at 10:10:32AM -0700, Andy Lutomirski wrote:
>>> >> On Mon, Jul 6, 2015 at 1:47 AM, Andrey Vagin <[email protected]> wrote:
>>> >> > Currently we use the proc file system, where all information are
>>> >> > presented in text files, what is convenient for humans. But if we need
>>> >> > to get information about processes from code (e.g. in C), the procfs
>>> >> > doesn't look so cool.
>>> >> >
>>> >> > From code we would prefer to get information in binary format and to be
>>> >> > able to specify which information and for which tasks are required. Here
>>> >> > is a new interface with all these features, which is called task_diag.
>>> >> > In addition it's much faster than procfs.
>>> >> >
>>> >> > task_diag is based on netlink sockets and looks like socket-diag, which
>>> >> > is used to get information about sockets.
>>> >>
>>> >> I think I like this in principle, but I have can see a few potential
>>> >> problems with using netlink for this:
>>> >>
>>> >> 1. Netlink very naturally handles net namespaces, but it doesn't
>>> >> naturally handle any other kind of namespace. In fact, the taskstats
>>> >> code that you're building on has highly broken user and pid namespace
>>> >> support. (Look for some obviously useless init_user_ns and
>>> >> init_pid_ns references. But that's only the obvious problem. That
>>> >> code calls current_user_ns() and task_active_pid_ns(current) from
>>> >> .doit, which is, in turn, called from sys_write, and looking at
>>> >> current's security state from sys_write is a big no-no.)
>>> >>
>>> >> You could partially fix it by looking at f_cred's namespaces, but that
>>> >> would be a change of what it means to create a netlink socket, and I'm
>>> >> not sure that's a good idea.
>>> >
>>> > If I don't miss something, all problems around pidns and userns are
>>> > related with multicast functionality. task_diag is using
>>> > request/response scheme and doesn't send multicast packets.
>>>
>>> It has nothing to do with multicast. task_diag needs to know what
>>> pidns and userns to use for a request, but netlink isn't set up to
>>> give you any reasonably way to do that. A netlink socket is
>>> fundamentally tied to a *net* ns (it's a socket, after all). But you
>>> can send it requests using write(2), and calling current_user_ns()
>>> from write(2) is bad. There's a long history of bugs and
>>> vulnerabilities related to thinking that current_cred() and similar
>>> are acceptable things to use in write(2) implementations.
>>>
>>
>> As far as I understand, socket_diag doesn't have this problem, becaus
>> each socket has a link on a namespace where it was created.
>>
>> What if we will pin the current pidns and credentials to a task_diag
>> socket in a moment when it's created.
>
> That's certainly doable. OTOH, if anything does:
>
> socket(AF_NETLINK, ...);
> unshare(CLONE_PID);
> fork();
>
> then they now have a (minor) security problem.

What do you mean? Is it not the same when we open a file and change
uid and gid? Permissions are checked only in the "open" syscall.

[root@avagin-fc19-cr ~]# ls -l xxx
-rw-r--r-- 1 root root 5 Jul 9 01:42 xxx

open("xxx", O_WRONLY|O_APPEND) = 3
setgid(1000) = 0
setuid(1000) = 0
write(3, "a", 1) = 1
close(1) = 0

>
> --Andy

2015-07-08 23:48:52

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

On Wed, Jul 8, 2015 at 3:49 PM, Andrey Vagin <[email protected]> wrote:
> 2015-07-08 20:39 GMT+03:00 Andy Lutomirski <[email protected]>:
>> On Wed, Jul 8, 2015 at 9:10 AM, Andrew Vagin <[email protected]> wrote:
>>>
>>> As far as I understand, socket_diag doesn't have this problem, becaus
>>> each socket has a link on a namespace where it was created.
>>>
>>> What if we will pin the current pidns and credentials to a task_diag
>>> socket in a moment when it's created.
>>
>> That's certainly doable. OTOH, if anything does:
>>
>> socket(AF_NETLINK, ...);
>> unshare(CLONE_PID);
>> fork();
>>
>> then they now have a (minor) security problem.
>
> What do you mean? Is it not the same when we open a file and change
> uid and gid? Permissions are checked only in the "open" syscall.
>
> [root@avagin-fc19-cr ~]# ls -l xxx
> -rw-r--r-- 1 root root 5 Jul 9 01:42 xxx
>
> open("xxx", O_WRONLY|O_APPEND) = 3
> setgid(1000) = 0
> setuid(1000) = 0
> write(3, "a", 1) = 1
> close(1) = 0

Yes and no.

open(2) is supposed to return an fd that retains the access to the
file that existed when open(2) was called. socket(2) is supposed* to
capture the access to the netns that existed at the time it was
called, but capturing access to a userns and/or pidns is new.

If you added socket(AF_NETLINK, SOCK_DGRAM, NETLINK_PIDNS), then maybe
that would work, but the userns interaction is a bit odd. OTOH every
pidns has an associated userns, so you could just use that. I don't
know whether that would annoy someone.

* There's some question as to whether socket(2) or connect(2) should
do this, but connect handling in netlink is quite broken and iproute2
relies on the broken handling. The historical behavior was different,
too, but the old behavior was exploitable. I have a cute little
program that does 'ip set dev lo down' but doesn't need to be run as
root :)

--Andy

2015-07-14 17:48:49

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 20/24] task_diag: Only add VMAs for thread_group leader

On 07/06, Andrey Vagin wrote:
>
> From: David Ahern <[email protected]>
>
> threads of a process share the same VMAs, so when dumping all threads
> for all processes only push vma data for group leader.

...

> @@ -492,6 +493,13 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
> }
>
> if (show_flags & TASK_DIAG_SHOW_VMA) {
> + /* if the request is to dump all threads of all processes
> + * only show VMAs for group leader.
> + */
> + if (req->dump_strategy == TASK_DIAG_DUMP_ALL_THREAD &&
> + !thread_group_leader(tsk))
> + goto done;
> +

This doesn't look right, group leader can be a zombie with ->mm == NULL,

> if (i >= n)
> err = fill_vma(tsk, skb, cb, &progress, show_flags);

so this probably needs something like find_lock_task_mm().

Oleg.

2015-07-14 18:04:20

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 08/24] proc: pick out a function to iterate task children

On 07/06, Andrey Vagin wrote:
>
> -static struct pid *
> -get_children_pid(struct inode *inode, struct pid *pid_prev, loff_t pos)
> +static struct task_struct *
> +task_next_child(struct task_struct *parent, struct task_struct *prev, unsigned int pos)
> {

I won't really argue, just a question...

So this patch changes it to accept/return task_struct rather pid. Why?
it is better to get/put "struct pid" only, not the whole task_struct.

If another caller want task_struct, the necessary conversion is simple.
But again, I won't argue if you think this will complicate the non-proc
users of this helper.

Oleg.

2015-07-14 18:10:42

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 12/24] task_diag: add a new group to get tasks memory mappings (v2)

On 07/06, Andrey Vagin wrote:
>
> +static int task_vma_num(struct mm_struct *mm)
> +{
> + struct vm_area_struct *vma;
> + int n_vma = 0;
> +
> + if (!mm || !atomic_inc_not_zero(&mm->mm_users))
> + return 0;
> +
> + down_read(&mm->mmap_sem);
> + for (vma = mm->mmap; vma; vma = vma->vm_next, n_vma++)
> + ;
> +
> + up_read(&mm->mmap_sem);
> + mmput(mm);
> +
> + return n_vma;
> +}

Hmm. How about

int task_vma_num(struct mm_struct *mm)
{
return mm->map_count;
}

?

Oleg.

2015-07-14 18:13:11

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 15/24] proc: give task_struct instead of pid into first_tid

On 07/06, Andrey Vagin wrote:
>
> It will be more convenient when this function will be used in
> task_diag.

Again, I won't argue, but perhaps the changelog(s) could say a bit
more about this change (see my reply to 8/24).

Oleg.

2015-07-15 02:02:03

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 20/24] task_diag: Only add VMAs for thread_group leader

On 7/14/15 1:47 PM, Oleg Nesterov wrote:
> On 07/06, Andrey Vagin wrote:
>>
>> From: David Ahern <[email protected]>
>>
>> threads of a process share the same VMAs, so when dumping all threads
>> for all processes only push vma data for group leader.
>
> ...
>
>> @@ -492,6 +493,13 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
>> }
>>
>> if (show_flags & TASK_DIAG_SHOW_VMA) {
>> + /* if the request is to dump all threads of all processes
>> + * only show VMAs for group leader.
>> + */
>> + if (req->dump_strategy == TASK_DIAG_DUMP_ALL_THREAD &&
>> + !thread_group_leader(tsk))
>> + goto done;
>> +
>
> This doesn't look right, group leader can be a zombie with ->mm == NULL,

Seriously? I'll find some way to track whether VMAs have been dumped for
a pid.

>
>> if (i >= n)
>> err = fill_vma(tsk, skb, cb, &progress, show_flags);
>
> so this probably needs something like find_lock_task_mm().

ok.

Thanks for the review,
David

2015-07-15 02:02:35

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 12/24] task_diag: add a new group to get tasks memory mappings (v2)

On 7/14/15 2:08 PM, Oleg Nesterov wrote:
> On 07/06, Andrey Vagin wrote:
>>
>> +static int task_vma_num(struct mm_struct *mm)
>> +{
>> + struct vm_area_struct *vma;
>> + int n_vma = 0;
>> +
>> + if (!mm || !atomic_inc_not_zero(&mm->mm_users))
>> + return 0;
>> +
>> + down_read(&mm->mmap_sem);
>> + for (vma = mm->mmap; vma; vma = vma->vm_next, n_vma++)
>> + ;
>> +
>> + up_read(&mm->mmap_sem);
>> + mmput(mm);
>> +
>> + return n_vma;
>> +}
>
> Hmm. How about
>
> int task_vma_num(struct mm_struct *mm)
> {
> return mm->map_count;
> }
>
> ?

makes sense. Thanks for the pointer.

2015-07-15 13:32:54

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 20/24] task_diag: Only add VMAs for thread_group leader

On 07/14, David Ahern wrote:
>
> On 7/14/15 1:47 PM, Oleg Nesterov wrote:
>>>
>>> if (show_flags & TASK_DIAG_SHOW_VMA) {
>>> + /* if the request is to dump all threads of all processes
>>> + * only show VMAs for group leader.
>>> + */
>>> + if (req->dump_strategy == TASK_DIAG_DUMP_ALL_THREAD &&
>>> + !thread_group_leader(tsk))
>>> + goto done;
>>> +
>>
>> This doesn't look right, group leader can be a zombie with ->mm == NULL,
>
> Seriously?

Yes, the main thread can do sys_exit() / pthread_exit(), although this
is not that common.

> I'll find some way to track whether VMAs have been dumped for
> a pid.

In case I confused you, the thread_group_leader() check and "goto done"
above are fine, just you can't trust tsk->mm, so

>>> if (i >= n)
>>> err = fill_vma(tsk, skb, cb, &progress, show_flags);
>>
>> so this probably needs something like find_lock_task_mm().
>
> ok.

Yes, you just need to (try) find a subthread with ->mm != NULL if !tsk->mm,
find_lock_task_mm() can help.

Oleg.

2015-07-17 15:57:36

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 08/24] proc: pick out a function to iterate task children

On Tue, Jul 14, 2015 at 08:02:35PM +0200, Oleg Nesterov wrote:
> On 07/06, Andrey Vagin wrote:
> >
> > -static struct pid *
> > -get_children_pid(struct inode *inode, struct pid *pid_prev, loff_t pos)
> > +static struct task_struct *
> > +task_next_child(struct task_struct *parent, struct task_struct *prev, unsigned int pos)
> > {
>
> I won't really argue, just a question...
>
> So this patch changes it to accept/return task_struct rather pid. Why?
> it is better to get/put "struct pid" only, not the whole task_struct.
>
> If another caller want task_struct, the necessary conversion is simple.

Another caller wants task_struct.

Currently this function receives pid and converts it into task_struct, then
gets the next child and returns its pid. So I try to avoid extra
conversion in task_diag code.

> But again, I won't argue if you think this will complicate the non-proc
> users of this helper.
>
> Oleg.
>

2015-07-18 21:24:15

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 08/24] proc: pick out a function to iterate task children

On 07/17, Andrew Vagin wrote:
>
> On Tue, Jul 14, 2015 at 08:02:35PM +0200, Oleg Nesterov wrote:
> > On 07/06, Andrey Vagin wrote:
> > >
> > > -static struct pid *
> > > -get_children_pid(struct inode *inode, struct pid *pid_prev, loff_t pos)
> > > +static struct task_struct *
> > > +task_next_child(struct task_struct *parent, struct task_struct *prev, unsigned int pos)
> > > {
> >
> > I won't really argue, just a question...
> >
> > So this patch changes it to accept/return task_struct rather pid. Why?
> > it is better to get/put "struct pid" only, not the whole task_struct.
> >
> > If another caller want task_struct, the necessary conversion is simple.
>
> Another caller wants task_struct.
>
> Currently this function receives pid and converts it into task_struct, then
> gets the next child and returns its pid.

Exactly because we try to avoid get_task_struct() if possible.

> So I try to avoid extra
> conversion in task_diag code.

Which is simple. And perhaps even iter->task can actually be iter->pid.

But as I said I won't really argue. And just in case, I personally think
this series makes sense, although I can't review the netlink interface.

Oleg.

2015-11-24 15:18:39

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

Hello Everybody,

Sorry for the long delay. I wanted to resurrect this thread.

Andy suggested to create a new syscall instead of using netlink
interface.
> Would it make more sense to have a new syscall instead?  You could
> even still use nlattr formatting for the syscall results.

I tried to implement it to understand how it looks like. Here is my
version:
https://github.com/avagin/linux-task-diag/blob/task_diag_syscall/kernel/task_diag.c#L665
I could not invent a better interfaces for it than using netlink
messages as arguments. I know it looks weird.

I could not say that I understood why a new system call is better
than using a netlink socket, so I tried to solve the problem which
were mentioned for the netlink interface.

The magor question was how to support pid and user namespaces in task_diag.
I think I found a good and logical solution.

As for pidns, we can use scm credentials, which is connected to each
socket message. They contain requestor’s pid and we can get a pid
namespace from it. In this case, we get a good feature to specify a pid
namespace without entering into it. For that, an user need to specify
any process from this pidns in an scm message.

As for credentials, we can get them from file->f_cred. In this case we
are able to create a socket and decrease permissions of the current
process, but the socket will work as before. It’s the common behaviour for
file descriptors.

As before, I incline to use the netlink interface for task_diag:

* Netlink is designed for such type of workloads. It allows to expand
the interface and save backward compatibility. It allows to generates
packets with a different set of parameters.
* If we use a file descriptor, we can create it and decrease
capabilities of the current process. It's a good feature which will be
unavailable if we decide to create a system call.
* task_stat is a bad example, because a few problems were not solved in
it.

I’m going to send the next version of the task_diag patches in a few
days. Any comments are welcome.

Here is the git repo with the current version:
https://github.com/avagin/linux-task-diag/commits/master

Thanks,
Andrew

On Mon, Jul 06, 2015 at 11:47:01AM +0300, Andrey Vagin wrote:
> Currently we use the proc file system, where all information are
> presented in text files, what is convenient for humans. But if we need
> to get information about processes from code (e.g. in C), the procfs
> doesn't look so cool.
>
> From code we would prefer to get information in binary format and to be
> able to specify which information and for which tasks are required. Here
> is a new interface with all these features, which is called task_diag.
> In addition it's much faster than procfs.
>
> task_diag is based on netlink sockets and looks like socket-diag, which
> is used to get information about sockets.
>
> A request is described by the task_diag_pid structure:
>
> struct task_diag_pid {
> __u64 show_flags; /* specify which information are required */
> __u64 dump_stratagy; /* specify a group of processes */
>
> __u32 pid;
> };
>
> dump_stratagy specifies a group of processes:
> /* per-process strategies */
> TASK_DIAG_DUMP_CHILDREN - all children
> TASK_DIAG_DUMP_THREAD - all threads
> TASK_DIAG_DUMP_ONE - one process
> /* system wide strategies (the pid fiel is ignored) */
> TASK_DIAG_DUMP_ALL - all processes
> TASK_DIAG_DUMP_ALL_THREAD - all threads
>
> show_flags specifies which information are required.
> If we set the TASK_DIAG_SHOW_BASE flag, the response message will
> contain the TASK_DIAG_BASE attribute which is described by the
> task_diag_base structure.
>
> struct task_diag_base {
> __u32 tgid;
> __u32 pid;
> __u32 ppid;
> __u32 tpid;
> __u32 sid;
> __u32 pgid;
> __u8 state;
> char comm[TASK_DIAG_COMM_LEN];
> };
>
> In future, it can be extended by optional attributes. The request
> describes which task properties are required and for which processes
> they are required for.
>
> A response can be divided into a few netlink packets if the NETLINK_DUMP
> has been set in a request. Each task is described by a message. Each
> message contains the TASK_DIAG_PID attribute and optional attributes
> which have been requested (show_flags). A message can be divided into a
> few parts if it doesn’t fit into a current netlink packet. In this case,
> the first message in the next packet contains the same PID and
> attributes which doesn’t  fit into the previous message.
>
> The task diag is much faster than the proc file system. We don't need to
> create a new file descriptor for each task. We need to send a request
> and get a response. It allows to get information for a few tasks in one
> request-response iteration.
>
> As for security, task_diag always works as procfs with hidepid = 2 (highest
> level of security).
>
> I have compared performance of procfs and task-diag for the
> "ps ax -o pid,ppid" command.
>
> A test stand contains 30108 processes.
> $ ps ax -o pid,ppid | wc -l
> 30108
>
> $ time ps ax -o pid,ppid > /dev/null
>
> real 0m0.836s
> user 0m0.238s
> sys 0m0.583s
>
> Read /proc/PID/stat for each task
> $ time ./task_proc_all > /dev/null
>
> real 0m0.258s
> user 0m0.019s
> sys 0m0.232s
>
> $ time ./task_diag_all > /dev/null
>
> real 0m0.052s
> user 0m0.013s
> sys 0m0.036s
>
> And here are statistics on syscalls which were called by each
> command.
>
> $ perf trace -s -o log -- ./task_proc_all > /dev/null
>
> Summary of events:
>
> task_proc_all (30781), 180785 events, 100.0%, 0.000 msec
>
> syscall calls min avg max stddev
> (msec) (msec) (msec) (%)
> --------------- -------- --------- --------- --------- ------
> read 30111 0.000 0.013 0.107 0.21%
> write 1 0.008 0.008 0.008 0.00%
> open 30111 0.007 0.012 0.145 0.24%
> close 30112 0.004 0.011 0.110 0.20%
> fstat 3 0.009 0.013 0.016 16.15%
> mmap 8 0.011 0.020 0.027 11.24%
> mprotect 4 0.019 0.023 0.028 8.33%
> munmap 1 0.026 0.026 0.026 0.00%
> brk 8 0.007 0.015 0.024 11.94%
> ioctl 1 0.007 0.007 0.007 0.00%
> access 1 0.019 0.019 0.019 0.00%
> execve 1 0.000 0.000 0.000 0.00%
> getdents 29 0.008 1.010 2.215 8.88%
> arch_prctl 1 0.016 0.016 0.016 0.00%
> openat 1 0.021 0.021 0.021 0.00%
>
>
> $ perf trace -s -o log -- ./task_diag_all > /dev/null
> Summary of events:
>
> task_diag_all (30762), 717 events, 98.9%, 0.000 msec
>
> syscall calls min avg max stddev
> (msec) (msec) (msec) (%)
> --------------- -------- --------- --------- --------- ------
> read 2 0.000 0.008 0.016 100.00%
> write 197 0.008 0.019 0.041 3.00%
> open 2 0.023 0.029 0.036 22.45%
> close 3 0.010 0.012 0.014 11.34%
> fstat 3 0.012 0.044 0.106 70.52%
> mmap 8 0.014 0.031 0.054 18.88%
> mprotect 4 0.016 0.023 0.027 10.93%
> munmap 1 0.022 0.022 0.022 0.00%
> brk 1 0.040 0.040 0.040 0.00%
> ioctl 1 0.011 0.011 0.011 0.00%
> access 1 0.032 0.032 0.032 0.00%
> getpid 1 0.012 0.012 0.012 0.00%
> socket 1 0.032 0.032 0.032 0.00%
> sendto 2 0.032 0.095 0.157 65.77%
> recvfrom 129 0.009 0.235 0.418 2.45%
> bind 1 0.018 0.018 0.018 0.00%
> execve 1 0.000 0.000 0.000 0.00%
> arch_prctl 1 0.012 0.012 0.012 0.00%
>
> You can find the test program from this experiment in tools/test/selftest/taskdiag.
>
> The idea of this functionality was suggested by Pavel Emelyanov
> (xemul@), when he found that operations with /proc forms a significant
> part of a checkpointing time.
>
> Ten years ago there was attempt to add a netlink interface to access to /proc
> information:
> http://lwn.net/Articles/99600/
>
> git repo: https://github.com/avagin/linux-task-diag
>
> Changes from the first version:
>
> David Ahern implemented all required functionality to use task_diag in
> perf.
>
> Bellow you can find his results how it affects performance.
> > Using the fork test command:
> > 10,000 processes; 10k proc with 5 threads = 50,000 tasks
> > reading /proc: 11.3 sec
> > task_diag: 2.2 sec
> >
> > @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096
> >
> > 128 instances of sepcjbb, 80,000+ tasks:
> > reading /proc: 32.1 sec
> > task_diag: 3.9 sec
> >
> > So overall much snappier startup times.
>
> Many thanks to David Ahern for the help with improving task_diag.
>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Cyrill Gorcunov <[email protected]>
> Cc: Pavel Emelyanov <[email protected]>
> Cc: Roger Luethi <[email protected]>
> Cc: Arnd Bergmann <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: David Ahern <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Pavel Odintsov <[email protected]>
> Signed-off-by: Andrey Vagin <[email protected]>
> --
> 2.1.0
>

2015-12-03 23:20:53

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

On Tue, Nov 24, 2015 at 7:18 AM, Andrew Vagin <[email protected]> wrote:
> Hello Everybody,
>
> Sorry for the long delay. I wanted to resurrect this thread.
>
> Andy suggested to create a new syscall instead of using netlink
> interface.
>> Would it make more sense to have a new syscall instead? You could
>> even still use nlattr formatting for the syscall results.
>
> I tried to implement it to understand how it looks like. Here is my
> version:
> https://github.com/avagin/linux-task-diag/blob/task_diag_syscall/kernel/task_diag.c#L665
> I could not invent a better interfaces for it than using netlink
> messages as arguments. I know it looks weird.
>
> I could not say that I understood why a new system call is better
> than using a netlink socket, so I tried to solve the problem which
> were mentioned for the netlink interface.
>
> The magor question was how to support pid and user namespaces in task_diag.
> I think I found a good and logical solution.
>
> As for pidns, we can use scm credentials, which is connected to each
> socket message. They contain requestor’s pid and we can get a pid
> namespace from it. In this case, we get a good feature to specify a pid
> namespace without entering into it. For that, an user need to specify
> any process from this pidns in an scm message.

That seems a little messy. A process can't currently move into
another pidns, but how do you make sure you have any pid at all that
belongs to the reference pidns? You can, of course, always use your
own pid, but that still seems odd to me.

>
> As for credentials, we can get them from file->f_cred. In this case we
> are able to create a socket and decrease permissions of the current
> process, but the socket will work as before. It’s the common behaviour for
> file descriptors.

Slightly off-topic, but this netlink is really rather bad as an
example of how fds can be used as capabilities (in the real capability
sense, not the Linux capabilities sense). You call socket and get a
socket. That socket captures f_cred. Then you drop privs, and you
assume that the socket you're holding on to retains the right to do
certain things.

This breaks pretty badly when, through things such as this patch set,
existing code that creates netlink sockets suddenly starts capturing
brand-new rights that didn't exist as part of a netlink socket before.

>From my perspective, netlink is a lot like ioctl, only without the
meaningful fd that you're calling it on. So why is it better than
syscalls? I'll grant that it comes with nice (?) buffering machinery.

> * Netlink is designed for such type of workloads. It allows to expand
> the interface and save backward compatibility. It allows to generates
> packets with a different set of parameters.
> * If we use a file descriptor, we can create it and decrease
> capabilities of the current process. It's a good feature which will be
> unavailable if we decide to create a system call.

If this is actually a real goal and it matters, then I'd suggest doing
it right. Make a way to create an fd that represents a pidns and,
specifically, the right to query non-secret properties of the
processes in the pidns.

--Andy

2015-12-03 23:43:56

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

On Thursday 03 December 2015 15:20:30 Andy Lutomirski wrote:
> > * Netlink is designed for such type of workloads. It allows to expand
> > the interface and save backward compatibility. It allows to generates
> > packets with a different set of parameters.
> > * If we use a file descriptor, we can create it and decrease
> > capabilities of the current process. It's a good feature which will be
> > unavailable if we decide to create a system call.
>
> If this is actually a real goal and it matters, then I'd suggest doing
> it right. Make a way to create an fd that represents a pidns and,
> specifically, the right to query non-secret properties of the
> processes in the pidns.

My first thought about doing an interface here was to create a virtual
file system that can be queried rather than using netlink, but then I
realized that the idea was to avoid procfs ;-)

More seriously, maybe the answer is to have a transaction file in
procfs itself. Procfs already knows about namespaces, so adding
a /proc/task-diag file as the entry point into the kernel could
get that out of the way.

The simple_transaction infrastructure that we have is limited to
a little under a page for the total data size, but something similar
could be used.

Arnd

2015-12-14 07:52:50

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

On Thu, Dec 03, 2015 at 03:20:30PM -0800, Andy Lutomirski wrote:
> On Tue, Nov 24, 2015 at 7:18 AM, Andrew Vagin <[email protected]> wrote:
> > Hello Everybody,
> >
> > Sorry for the long delay. I wanted to resurrect this thread.
> >
> > Andy suggested to create a new syscall instead of using netlink
> > interface.
> >> Would it make more sense to have a new syscall instead? You could
> >> even still use nlattr formatting for the syscall results.
> >
> > I tried to implement it to understand how it looks like. Here is my
> > version:
> > https://github.com/avagin/linux-task-diag/blob/task_diag_syscall/kernel/task_diag.c#L665
> > I could not invent a better interfaces for it than using netlink
> > messages as arguments. I know it looks weird.
> >
> > I could not say that I understood why a new system call is better
> > than using a netlink socket, so I tried to solve the problem which
> > were mentioned for the netlink interface.
> >
> > The magor question was how to support pid and user namespaces in task_diag.
> > I think I found a good and logical solution.
> >
> > As for pidns, we can use scm credentials, which is connected to each
> > socket message. They contain requestor’s pid and we can get a pid
> > namespace from it. In this case, we get a good feature to specify a pid
> > namespace without entering into it. For that, an user need to specify
> > any process from this pidns in an scm message.
>
> That seems a little messy. A process can't currently move into
> another pidns, but how do you make sure you have any pid at all that
> belongs to the reference pidns? You can, of course, always use your
> own pid, but that still seems odd to me.

There is your pid by default, you need to do nothing for that.
If we look at containers or sandboxes, we ussualy know PID of
the init process.


>
> >
> > As for credentials, we can get them from file->f_cred. In this case we
> > are able to create a socket and decrease permissions of the current
> > process, but the socket will work as before. It’s the common behaviour for
> > file descriptors.
>
> Slightly off-topic, but this netlink is really rather bad as an
> example of how fds can be used as capabilities (in the real capability
> sense, not the Linux capabilities sense). You call socket and get a
> socket. That socket captures f_cred. Then you drop privs, and you
> assume that the socket you're holding on to retains the right to do
> certain things.
>
> This breaks pretty badly when, through things such as this patch set,
> existing code that creates netlink sockets suddenly starts capturing
> brand-new rights that didn't exist as part of a netlink socket before.

Sorry, I don't understand this part. Could you eloborate? Maybe give an
example.

I always think that it's a feature, that we can create a descriptor and
drop capabilities of the process or send this descriptor to an
unprivilieged process.

I think this part may be critical for understanding your point of view.
Thanks.

>
> From my perspective, netlink is a lot like ioctl, only without the
> meaningful fd that you're calling it on. So why is it better than
> syscalls? I'll grant that it comes with nice (?) buffering machinery.

The nice buffering machinery is one of the reasons. Another reason is
that netlink is a ready interface for this sort of task, so we can avoid
code duplacation.

>
> > * Netlink is designed for such type of workloads. It allows to expand
> > the interface and save backward compatibility. It allows to generates
> > packets with a different set of parameters.
> > * If we use a file descriptor, we can create it and decrease
> > capabilities of the current process. It's a good feature which will be
> > unavailable if we decide to create a system call.
>
> If this is actually a real goal and it matters, then I'd suggest doing
> it right. Make a way to create an fd that represents a pidns and,
> specifically, the right to query non-secret properties of the
> processes in the pidns.

I will think. Thank you for comments.

>
> --Andy

2015-12-14 08:05:57

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

On Fri, Dec 04, 2015 at 12:43:29AM +0100, Arnd Bergmann wrote:
> On Thursday 03 December 2015 15:20:30 Andy Lutomirski wrote:
> > > * Netlink is designed for such type of workloads. It allows to expand
> > > the interface and save backward compatibility. It allows to generates
> > > packets with a different set of parameters.
> > > * If we use a file descriptor, we can create it and decrease
> > > capabilities of the current process. It's a good feature which will be
> > > unavailable if we decide to create a system call.
> >
> > If this is actually a real goal and it matters, then I'd suggest doing
> > it right. Make a way to create an fd that represents a pidns and,
> > specifically, the right to query non-secret properties of the
> > processes in the pidns.
>
> My first thought about doing an interface here was to create a virtual
> file system that can be queried rather than using netlink, but then I
> realized that the idea was to avoid procfs ;-)

No, we doesn't have an idea to avoid using of procfs. The idea is to
create a new interace to get information about tasks, which will work
faster and will be more convenient for using from applications.

>
> More seriously, maybe the answer is to have a transaction file in
> procfs itself. Procfs already knows about namespaces, so adding
> a /proc/task-diag file as the entry point into the kernel could
> get that out of the way.
>
> The simple_transaction infrastructure that we have is limited to
> a little under a page for the total data size, but something similar
> could be used.

Thank you for the idea.
>
> Arnd

2015-12-14 22:38:34

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

On Dec 13, 2015 11:52 PM, "Andrew Vagin" <[email protected]> wrote:
>
> On Thu, Dec 03, 2015 at 03:20:30PM -0800, Andy Lutomirski wrote:
> > On Tue, Nov 24, 2015 at 7:18 AM, Andrew Vagin <[email protected]> wrote:
> > > Hello Everybody,
> > >
> > > Sorry for the long delay. I wanted to resurrect this thread.
> > >
> > > Andy suggested to create a new syscall instead of using netlink
> > > interface.
> > >> Would it make more sense to have a new syscall instead? You could
> > >> even still use nlattr formatting for the syscall results.
> > >
> > > I tried to implement it to understand how it looks like. Here is my
> > > version:
> > > https://github.com/avagin/linux-task-diag/blob/task_diag_syscall/kernel/task_diag.c#L665
> > > I could not invent a better interfaces for it than using netlink
> > > messages as arguments. I know it looks weird.
> > >
> > > I could not say that I understood why a new system call is better
> > > than using a netlink socket, so I tried to solve the problem which
> > > were mentioned for the netlink interface.
> > >
> > > The magor question was how to support pid and user namespaces in task_diag.
> > > I think I found a good and logical solution.
> > >
> > > As for pidns, we can use scm credentials, which is connected to each
> > > socket message. They contain requestor’s pid and we can get a pid
> > > namespace from it. In this case, we get a good feature to specify a pid
> > > namespace without entering into it. For that, an user need to specify
> > > any process from this pidns in an scm message.
> >
> > That seems a little messy. A process can't currently move into
> > another pidns, but how do you make sure you have any pid at all that
> > belongs to the reference pidns? You can, of course, always use your
> > own pid, but that still seems odd to me.
>
> There is your pid by default, you need to do nothing for that.
> If we look at containers or sandboxes, we ussualy know PID of
> the init process.
>
>
> >
> > >
> > > As for credentials, we can get them from file->f_cred. In this case we
> > > are able to create a socket and decrease permissions of the current
> > > process, but the socket will work as before. It’s the common behaviour for
> > > file descriptors.
> >
> > Slightly off-topic, but this netlink is really rather bad as an
> > example of how fds can be used as capabilities (in the real capability
> > sense, not the Linux capabilities sense). You call socket and get a
> > socket. That socket captures f_cred. Then you drop privs, and you
> > assume that the socket you're holding on to retains the right to do
> > certain things.
> >
> > This breaks pretty badly when, through things such as this patch set,
> > existing code that creates netlink sockets suddenly starts capturing
> > brand-new rights that didn't exist as part of a netlink socket before.
>
> Sorry, I don't understand this part. Could you eloborate? Maybe give an
> example.
>
> I always think that it's a feature, that we can create a descriptor and
> drop capabilities of the process or send this descriptor to an
> unprivilieged process.

Suppose there's an existing program that likes this feature. It
creates a netlink socket, optionally calls connect(2), and then drop
privileges. It expects to retain some subset of its privileges.

The problem is that by increasing the power of a netlink socket
created with higher-than-current privilege, you've just increased the
privilege retained by the old app. In this particular case, it's
especially odd because it retains privilege over the old pidns,
whereas the old program (in theory -- probably no one does this) could
have created a netlink socket, unshared pidns, and forked, and it
would have expected to retain no privilege over the old pidns.

--Andy

2015-12-15 15:54:17

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

On Mon, Dec 14, 2015 at 02:38:06PM -0800, Andy Lutomirski wrote:
> On Dec 13, 2015 11:52 PM, "Andrew Vagin" <[email protected]> wrote:
> >
> > On Thu, Dec 03, 2015 at 03:20:30PM -0800, Andy Lutomirski wrote:
> > > On Tue, Nov 24, 2015 at 7:18 AM, Andrew Vagin <[email protected]> wrote:
> > > > Hello Everybody,
> > > >
> > > > Sorry for the long delay. I wanted to resurrect this thread.
> > > >
> > > > Andy suggested to create a new syscall instead of using netlink
> > > > interface.
> > > >> Would it make more sense to have a new syscall instead? You could
> > > >> even still use nlattr formatting for the syscall results.
> > > >
> > > > I tried to implement it to understand how it looks like. Here is my
> > > > version:
> > > > https://github.com/avagin/linux-task-diag/blob/task_diag_syscall/kernel/task_diag.c#L665
> > > > I could not invent a better interfaces for it than using netlink
> > > > messages as arguments. I know it looks weird.
> > > >
> > > > I could not say that I understood why a new system call is better
> > > > than using a netlink socket, so I tried to solve the problem which
> > > > were mentioned for the netlink interface.
> > > >
> > > > The magor question was how to support pid and user namespaces in task_diag.
> > > > I think I found a good and logical solution.
> > > >
> > > > As for pidns, we can use scm credentials, which is connected to each
> > > > socket message. They contain requestor’s pid and we can get a pid
> > > > namespace from it. In this case, we get a good feature to specify a pid
> > > > namespace without entering into it. For that, an user need to specify
> > > > any process from this pidns in an scm message.
> > >
> > > That seems a little messy. A process can't currently move into
> > > another pidns, but how do you make sure you have any pid at all that
> > > belongs to the reference pidns? You can, of course, always use your
> > > own pid, but that still seems odd to me.
> >
> > There is your pid by default, you need to do nothing for that.
> > If we look at containers or sandboxes, we ussualy know PID of
> > the init process.
> >
> >
> > >
> > > >
> > > > As for credentials, we can get them from file->f_cred. In this case we
> > > > are able to create a socket and decrease permissions of the current
> > > > process, but the socket will work as before. It’s the common behaviour for
> > > > file descriptors.
> > >
> > > Slightly off-topic, but this netlink is really rather bad as an
> > > example of how fds can be used as capabilities (in the real capability
> > > sense, not the Linux capabilities sense). You call socket and get a
> > > socket. That socket captures f_cred. Then you drop privs, and you
> > > assume that the socket you're holding on to retains the right to do
> > > certain things.
> > >
> > > This breaks pretty badly when, through things such as this patch set,
> > > existing code that creates netlink sockets suddenly starts capturing
> > > brand-new rights that didn't exist as part of a netlink socket before.
> >
> > Sorry, I don't understand this part. Could you eloborate? Maybe give an
> > example.
> >
> > I always think that it's a feature, that we can create a descriptor and
> > drop capabilities of the process or send this descriptor to an
> > unprivilieged process.
>
> Suppose there's an existing program that likes this feature. It
> creates a netlink socket, optionally calls connect(2), and then drop
> privileges. It expects to retain some subset of its privileges.
>
> The problem is that by increasing the power of a netlink socket
> created with higher-than-current privilege, you've just increased the
> privilege retained by the old app. In this particular case, it's
> especially odd because it retains privilege over the old pidns,
> whereas the old program (in theory -- probably no one does this) could
> have created a netlink socket, unshared pidns, and forked, and it
> would have expected to retain no privilege over the old pidns.

Thank you for the explanation. If I understand you correctly, the
problem is that we can use an arbitrary netlink socket to use task_diag.

It can be a reason to not use netlink interface for task diag.

What do you think about the idea to add a a transaction file in
procfs? We will open it, send a request and get required information.

I want to have a file descriptor to transfer data between kernel and
userspace, because a size of response can be too big to receive it for
one call. If we use a file descriptor, we can divide a response into
parts.

Thanks,
Andrew

> --Andy

2015-12-15 16:44:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)

On Tue, Dec 15, 2015 at 7:53 AM, Andrew Vagin <[email protected]> wrote:
> On Mon, Dec 14, 2015 at 02:38:06PM -0800, Andy Lutomirski wrote:
>> On Dec 13, 2015 11:52 PM, "Andrew Vagin" <[email protected]> wrote:
>> >
>> > On Thu, Dec 03, 2015 at 03:20:30PM -0800, Andy Lutomirski wrote:
>> > > On Tue, Nov 24, 2015 at 7:18 AM, Andrew Vagin <[email protected]> wrote:
>> > > > Hello Everybody,
>> > > >
>> > > > Sorry for the long delay. I wanted to resurrect this thread.
>> > > >
>> > > > Andy suggested to create a new syscall instead of using netlink
>> > > > interface.
>> > > >> Would it make more sense to have a new syscall instead? You could
>> > > >> even still use nlattr formatting for the syscall results.
>> > > >
>> > > > I tried to implement it to understand how it looks like. Here is my
>> > > > version:
>> > > > https://github.com/avagin/linux-task-diag/blob/task_diag_syscall/kernel/task_diag.c#L665
>> > > > I could not invent a better interfaces for it than using netlink
>> > > > messages as arguments. I know it looks weird.
>> > > >
>> > > > I could not say that I understood why a new system call is better
>> > > > than using a netlink socket, so I tried to solve the problem which
>> > > > were mentioned for the netlink interface.
>> > > >
>> > > > The magor question was how to support pid and user namespaces in task_diag.
>> > > > I think I found a good and logical solution.
>> > > >
>> > > > As for pidns, we can use scm credentials, which is connected to each
>> > > > socket message. They contain requestor’s pid and we can get a pid
>> > > > namespace from it. In this case, we get a good feature to specify a pid
>> > > > namespace without entering into it. For that, an user need to specify
>> > > > any process from this pidns in an scm message.
>> > >
>> > > That seems a little messy. A process can't currently move into
>> > > another pidns, but how do you make sure you have any pid at all that
>> > > belongs to the reference pidns? You can, of course, always use your
>> > > own pid, but that still seems odd to me.
>> >
>> > There is your pid by default, you need to do nothing for that.
>> > If we look at containers or sandboxes, we ussualy know PID of
>> > the init process.
>> >
>> >
>> > >
>> > > >
>> > > > As for credentials, we can get them from file->f_cred. In this case we
>> > > > are able to create a socket and decrease permissions of the current
>> > > > process, but the socket will work as before. It’s the common behaviour for
>> > > > file descriptors.
>> > >
>> > > Slightly off-topic, but this netlink is really rather bad as an
>> > > example of how fds can be used as capabilities (in the real capability
>> > > sense, not the Linux capabilities sense). You call socket and get a
>> > > socket. That socket captures f_cred. Then you drop privs, and you
>> > > assume that the socket you're holding on to retains the right to do
>> > > certain things.
>> > >
>> > > This breaks pretty badly when, through things such as this patch set,
>> > > existing code that creates netlink sockets suddenly starts capturing
>> > > brand-new rights that didn't exist as part of a netlink socket before.
>> >
>> > Sorry, I don't understand this part. Could you eloborate? Maybe give an
>> > example.
>> >
>> > I always think that it's a feature, that we can create a descriptor and
>> > drop capabilities of the process or send this descriptor to an
>> > unprivilieged process.
>>
>> Suppose there's an existing program that likes this feature. It
>> creates a netlink socket, optionally calls connect(2), and then drop
>> privileges. It expects to retain some subset of its privileges.
>>
>> The problem is that by increasing the power of a netlink socket
>> created with higher-than-current privilege, you've just increased the
>> privilege retained by the old app. In this particular case, it's
>> especially odd because it retains privilege over the old pidns,
>> whereas the old program (in theory -- probably no one does this) could
>> have created a netlink socket, unshared pidns, and forked, and it
>> would have expected to retain no privilege over the old pidns.
>
> Thank you for the explanation. If I understand you correctly, the
> problem is that we can use an arbitrary netlink socket to use task_diag.
>
> It can be a reason to not use netlink interface for task diag.

Agreed. FWIW, it's not the end of the world -- netlink is probably
already so leaky in this respect that there's no real security loss,
although the extra namespace capture (pid instead of net) makes me a
bit nervous.

I suppose we could add an ioctl to netlink that says "enable pidns
access" and that has to be called while still privileged. (/me
ducks).

>
> What do you think about the idea to add a a transaction file in
> procfs? We will open it, send a request and get required information.
>
> I want to have a file descriptor to transfer data between kernel and
> userspace, because a size of response can be too big to receive it for
> one call. If we use a file descriptor, we can divide a response into
> parts.

I think I'm okay with that.

--Andy