2015-02-17 08:39:42

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes

Here is a preview version. It provides restricted set of functionality.
I would like to collect feedback about this idea.

Currently we use the proc file system, where all information are
presented in text files, what is convenient for humans. But if we need
to get information about processes from code (e.g. in C), the procfs
doesn't look so cool.

>From code we would prefer to get information in binary format and to be
able to specify which information and for which tasks are required. Here
is a new interface with all these features, which is called task_diag.
In addition it's much faster than procfs.

task_diag is based on netlink sockets and looks like socket-diag, which
is used to get information about sockets.

A request is described by the task_diag_pid structure:

struct task_diag_pid {
__u64 show_flags; /* specify which information are required */
__u64 dump_stratagy; /* specify a group of processes */

__u32 pid;
};

A respone is a set of netlink messages. Each message describes one task.
All task properties are divided on groups. A message contains the
TASK_DIAG_MSG group and other groups if they have been requested in
show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
response will contain the TASK_DIAG_CRED group which is described by the
task_diag_creds structure.

struct task_diag_msg {
__u32 tgid;
__u32 pid;
__u32 ppid;
__u32 tpid;
__u32 sid;
__u32 pgid;
__u8 state;
char comm[TASK_DIAG_COMM_LEN];
};

Another good feature of task_diag is an ability to request information
for a few processes. Currently here are two stratgies
TASK_DIAG_DUMP_ALL - get information for all tasks
TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
tasks

The task diag is much faster than the proc file system. We don't need to
create a new file descriptor for each task. We need to send a request
and get a response. It allows to get information for a few task in one
request-response iteration.

I have compared performance of procfs and task-diag for the
"ps ax -o pid,ppid" command.

A test stand contains 10348 processes.
$ ps ax -o pid,ppid | wc -l
10348

$ time ps ax -o pid,ppid > /dev/null

real 0m1.073s
user 0m0.086s
sys 0m0.903s

$ time ./task_diag_all > /dev/null

real 0m0.037s
user 0m0.004s
sys 0m0.020s

And here are statistics about syscalls which were called by each
command.
$ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid 2>&1 | grep syscalls | sort -n -r | head -n 5
20,713 syscalls:sys_exit_open
20,710 syscalls:sys_exit_close
20,708 syscalls:sys_exit_read
10,348 syscalls:sys_exit_newstat
31 syscalls:sys_exit_write

$ perf stat -e syscalls:sys_exit* -- ./task_diag_all 2>&1 | grep syscalls | sort -n -r | head -n 5
114 syscalls:sys_exit_recvfrom
49 syscalls:sys_exit_write
8 syscalls:sys_exit_mmap
4 syscalls:sys_exit_mprotect
3 syscalls:sys_exit_newfstat

You can find the test program from this experiment in the last patch.

The idea of this functionality was suggested by Pavel Emelyanov
(xemul@), when he found that operations with /proc forms a significant
part of a checkpointing time.

Ten years ago here was attempt to add a netlink interface to access to /proc
information:
http://lwn.net/Articles/99600/

Signed-off-by: Andrey Vagin <[email protected]>

git repo: https://github.com/avagin/linux-task-diag

Andrey Vagin (7):
[RFC] kernel: add a netlink interface to get information about tasks
kernel: move next_tgid from fs/proc
task-diag: add ability to get information about all tasks
task-diag: add a new group to get process credentials
kernel: add ability to iterate children of a specified task
task_diag: add ability to dump children
selftest: check the task_diag functinonality

fs/proc/array.c | 58 +---
fs/proc/base.c | 43 ---
include/linux/proc_fs.h | 13 +
include/uapi/linux/taskdiag.h | 89 ++++++
init/Kconfig | 12 +
kernel/Makefile | 1 +
kernel/pid.c | 94 ++++++
kernel/taskdiag.c | 343 +++++++++++++++++++++
tools/testing/selftests/task_diag/Makefile | 16 +
tools/testing/selftests/task_diag/task_diag.c | 59 ++++
tools/testing/selftests/task_diag/task_diag_all.c | 82 +++++
tools/testing/selftests/task_diag/task_diag_comm.c | 195 ++++++++++++
tools/testing/selftests/task_diag/task_diag_comm.h | 47 +++
tools/testing/selftests/task_diag/taskdiag.h | 1 +
14 files changed, 967 insertions(+), 86 deletions(-)
create mode 100644 include/uapi/linux/taskdiag.h
create mode 100644 kernel/taskdiag.c
create mode 100644 tools/testing/selftests/task_diag/Makefile
create mode 100644 tools/testing/selftests/task_diag/task_diag.c
create mode 100644 tools/testing/selftests/task_diag/task_diag_all.c
create mode 100644 tools/testing/selftests/task_diag/task_diag_comm.c
create mode 100644 tools/testing/selftests/task_diag/task_diag_comm.h
create mode 120000 tools/testing/selftests/task_diag/taskdiag.h

Cc: Oleg Nesterov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Roger Luethi <[email protected]>
--
2.1.0


2015-02-17 08:39:47

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 1/7] kernel: add a netlink interface to get information about tasks

task_diag is based on netlink sockets and looks like socket-diag, which
is used to get information about sockets.

task_diag is a new interface which is going to raplace the proc file
system in cases when we need to get information in a binary format.

A request messages is described by the task_diag_pid structure:
struct task_diag_pid {
__u64 show_flags;
__u64 dump_stratagy;

__u32 pid;
};

A respone is a set of netlink messages. Each message describes one task.
All task properties are divided on groups. A message contains the
TASK_DIAG_MSG group, and other groups if they have been requested in
show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
response will contain the TASK_DIAG_CRED group which is described by the
task_diag_creds structure.

struct task_diag_msg {
__u32 tgid;
__u32 pid;
__u32 ppid;
__u32 tpid;
__u32 sid;
__u32 pgid;
__u8 state;
char comm[TASK_DIAG_COMM_LEN];
};

The dump_stratagy field will be used in following patches to request
information for a group of processes.

Signed-off-by: Andrey Vagin <[email protected]>
---
include/uapi/linux/taskdiag.h | 64 +++++++++++++++
init/Kconfig | 12 +++
kernel/Makefile | 1 +
kernel/taskdiag.c | 179 ++++++++++++++++++++++++++++++++++++++++++
4 files changed, 256 insertions(+)
create mode 100644 include/uapi/linux/taskdiag.h
create mode 100644 kernel/taskdiag.c

diff --git a/include/uapi/linux/taskdiag.h b/include/uapi/linux/taskdiag.h
new file mode 100644
index 0000000..e1feb35
--- /dev/null
+++ b/include/uapi/linux/taskdiag.h
@@ -0,0 +1,64 @@
+#ifndef _LINUX_TASKDIAG_H
+#define _LINUX_TASKDIAG_H
+
+#include <linux/types.h>
+#include <linux/capability.h>
+
+#define TASKDIAG_GENL_NAME "TASKDIAG"
+#define TASKDIAG_GENL_VERSION 0x1
+
+enum {
+ /* optional attributes which can be specified in show_flags */
+
+ /* other attributes */
+ TASK_DIAG_MSG = 64,
+};
+
+enum {
+ TASK_DIAG_RUNNING,
+ TASK_DIAG_INTERRUPTIBLE,
+ TASK_DIAG_UNINTERRUPTIBLE,
+ TASK_DIAG_STOPPED,
+ TASK_DIAG_TRACE_STOP,
+ TASK_DIAG_DEAD,
+ TASK_DIAG_ZOMBIE,
+};
+
+#define TASK_DIAG_COMM_LEN 16
+
+struct task_diag_msg {
+ __u32 tgid;
+ __u32 pid;
+ __u32 ppid;
+ __u32 tpid;
+ __u32 sid;
+ __u32 pgid;
+ __u8 state;
+ char comm[TASK_DIAG_COMM_LEN];
+};
+
+enum {
+ TASKDIAG_CMD_UNSPEC = 0, /* Reserved */
+ TASKDIAG_CMD_GET,
+ __TASKDIAG_CMD_MAX,
+};
+#define TASKDIAG_CMD_MAX (__TASKDIAG_CMD_MAX - 1)
+
+#define TASK_DIAG_DUMP_ALL 0
+
+struct task_diag_pid {
+ __u64 show_flags;
+ __u64 dump_stratagy;
+
+ __u32 pid;
+};
+
+enum {
+ TASKDIAG_CMD_ATTR_UNSPEC = 0,
+ TASKDIAG_CMD_ATTR_GET,
+ __TASKDIAG_CMD_ATTR_MAX,
+};
+
+#define TASKDIAG_CMD_ATTR_MAX (__TASKDIAG_CMD_ATTR_MAX - 1)
+
+#endif /* _LINUX_TASKDIAG_H */
diff --git a/init/Kconfig b/init/Kconfig
index 9afb971..e959ae3 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -430,6 +430,18 @@ config TASKSTATS

Say N if unsure.

+config TASK_DIAG
+ bool "Export task/process properties through netlink"
+ depends on NET
+ default n
+ help
+ Export selected properties for tasks/processes through the
+ generic netlink interface. Unlike the proc file system, task_diag
+ returns information in a binary format, allows to specify which
+ information are required.
+
+ Say N if unsure.
+
config TASK_DELAY_ACCT
bool "Enable per-task delay accounting"
depends on TASKSTATS
diff --git a/kernel/Makefile b/kernel/Makefile
index a59481a..2d4fc71 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -95,6 +95,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
obj-$(CONFIG_JUMP_LABEL) += jump_label.o
obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
obj-$(CONFIG_TORTURE_TEST) += torture.o
+obj-$(CONFIG_TASK_DIAG) += taskdiag.o

$(obj)/configs.o: $(obj)/config_data.h

diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
new file mode 100644
index 0000000..5faf3f0
--- /dev/null
+++ b/kernel/taskdiag.c
@@ -0,0 +1,179 @@
+#include <uapi/linux/taskdiag.h>
+#include <net/genetlink.h>
+#include <linux/pid_namespace.h>
+#include <linux/ptrace.h>
+#include <linux/proc_fs.h>
+#include <linux/sched.h>
+
+static struct genl_family family = {
+ .id = GENL_ID_GENERATE,
+ .name = TASKDIAG_GENL_NAME,
+ .version = TASKDIAG_GENL_VERSION,
+ .maxattr = TASKDIAG_CMD_ATTR_MAX,
+ .netnsok = true,
+};
+
+static size_t taskdiag_packet_size(u64 show_flags)
+{
+ return nla_total_size(sizeof(struct task_diag_msg));
+}
+
+/*
+ * The task state array is a strange "bitmap" of
+ * reasons to sleep. Thus "running" is zero, and
+ * you can test for combinations of others with
+ * simple bit tests.
+ */
+static const __u8 task_state_array[] = {
+ TASK_DIAG_RUNNING,
+ TASK_DIAG_INTERRUPTIBLE,
+ TASK_DIAG_UNINTERRUPTIBLE,
+ TASK_DIAG_STOPPED,
+ TASK_DIAG_TRACE_STOP,
+ TASK_DIAG_DEAD,
+ TASK_DIAG_ZOMBIE,
+};
+
+static inline const __u8 get_task_state(struct task_struct *tsk)
+{
+ unsigned int state = (tsk->state | tsk->exit_state) & TASK_REPORT;
+
+ BUILD_BUG_ON(1 + ilog2(TASK_REPORT) != ARRAY_SIZE(task_state_array)-1);
+
+ return task_state_array[fls(state)];
+}
+
+static int fill_task_msg(struct task_struct *p, struct sk_buff *skb)
+{
+ struct pid_namespace *ns = task_active_pid_ns(current);
+ struct task_diag_msg *msg;
+ struct nlattr *attr;
+ char tcomm[sizeof(p->comm)];
+ struct task_struct *tracer;
+
+ attr = nla_reserve(skb, TASK_DIAG_MSG, sizeof(struct task_diag_msg));
+ if (!attr)
+ return -EMSGSIZE;
+
+ msg = nla_data(attr);
+
+ rcu_read_lock();
+ msg->ppid = pid_alive(p) ?
+ task_tgid_nr_ns(rcu_dereference(p->real_parent), ns) : 0;
+
+ msg->tpid = 0;
+ tracer = ptrace_parent(p);
+ if (tracer)
+ msg->tpid = task_pid_nr_ns(tracer, ns);
+
+ msg->tgid = task_tgid_nr_ns(p, ns);
+ msg->pid = task_pid_nr_ns(p, ns);
+ msg->sid = task_session_nr_ns(p, ns);
+ msg->pgid = task_pgrp_nr_ns(p, ns);
+
+ rcu_read_unlock();
+
+ get_task_comm(tcomm, p);
+ memset(msg->comm, 0, TASK_DIAG_COMM_LEN);
+ strncpy(msg->comm, tcomm, TASK_DIAG_COMM_LEN);
+
+ msg->state = get_task_state(p);
+
+ return 0;
+}
+
+static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
+ u64 show_flags, u32 portid, u32 seq)
+{
+ void *reply;
+ int err;
+
+ reply = genlmsg_put(skb, portid, seq, &family, 0, TASKDIAG_CMD_GET);
+ if (reply == NULL)
+ return -EMSGSIZE;
+
+ err = fill_task_msg(tsk, skb);
+ if (err)
+ goto err;
+
+ return genlmsg_end(skb, reply);
+err:
+ genlmsg_cancel(skb, reply);
+ return err;
+}
+
+static int taskdiag_doit(struct sk_buff *skb, struct genl_info *info)
+{
+ struct task_struct *tsk = NULL;
+ struct task_diag_pid *req;
+ struct sk_buff *msg;
+ size_t size;
+ int rc;
+
+ req = nla_data(info->attrs[TASKDIAG_CMD_ATTR_GET]);
+ if (req == NULL)
+ return -EINVAL;
+
+ if (nla_len(info->attrs[TASKDIAG_CMD_ATTR_GET]) < sizeof(*req))
+ return -EINVAL;
+
+ size = taskdiag_packet_size(req->show_flags);
+ msg = genlmsg_new(size, GFP_KERNEL);
+ if (!msg)
+ return -ENOMEM;
+
+ rcu_read_lock();
+ tsk = find_task_by_vpid(req->pid);
+ if (tsk)
+ get_task_struct(tsk);
+ rcu_read_unlock();
+ if (!tsk) {
+ rc = -ESRCH;
+ goto err;
+ };
+
+ if (!ptrace_may_access(tsk, PTRACE_MODE_READ)) {
+ put_task_struct(tsk);
+ rc = -EPERM;
+ goto err;
+ }
+
+ rc = task_diag_fill(tsk, msg, req->show_flags,
+ info->snd_portid, info->snd_seq);
+ put_task_struct(tsk);
+ if (rc < 0)
+ goto err;
+
+ return genlmsg_reply(msg, info);
+err:
+ nlmsg_free(msg);
+ return rc;
+}
+
+static const struct nla_policy
+ taskstats_cmd_get_policy[TASKDIAG_CMD_ATTR_MAX+1] = {
+ [TASKDIAG_CMD_ATTR_GET] = { .type = NLA_UNSPEC,
+ .len = sizeof(struct task_diag_pid)
+ },
+};
+
+static const struct genl_ops taskdiag_ops[] = {
+ {
+ .cmd = TASKDIAG_CMD_GET,
+ .doit = taskdiag_doit,
+ .policy = taskstats_cmd_get_policy,
+ },
+};
+
+static int __init taskdiag_init(void)
+{
+ int rc;
+
+ rc = genl_register_family_with_ops(&family, taskdiag_ops);
+ if (rc)
+ return rc;
+
+ return 0;
+}
+
+late_initcall(taskdiag_init);
--
2.1.0

2015-02-17 08:41:07

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 2/7] kernel: move next_tgid from fs/proc

This function will be used in task_diag.

Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/base.c | 43 -------------------------------------------
include/linux/proc_fs.h | 7 +++++++
kernel/pid.c | 39 +++++++++++++++++++++++++++++++++++++++
3 files changed, 46 insertions(+), 43 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3f3d7ae..24ed43d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2795,49 +2795,6 @@ out:
return ERR_PTR(result);
}

-/*
- * Find the first task with tgid >= tgid
- *
- */
-struct tgid_iter {
- unsigned int tgid;
- struct task_struct *task;
-};
-static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter)
-{
- struct pid *pid;
-
- if (iter.task)
- put_task_struct(iter.task);
- rcu_read_lock();
-retry:
- iter.task = NULL;
- pid = find_ge_pid(iter.tgid, ns);
- if (pid) {
- iter.tgid = pid_nr_ns(pid, ns);
- iter.task = pid_task(pid, PIDTYPE_PID);
- /* What we to know is if the pid we have find is the
- * pid of a thread_group_leader. Testing for task
- * being a thread_group_leader is the obvious thing
- * todo but there is a window when it fails, due to
- * the pid transfer logic in de_thread.
- *
- * So we perform the straight forward test of seeing
- * if the pid we have found is the pid of a thread
- * group leader, and don't worry if the task we have
- * found doesn't happen to be a thread group leader.
- * As we don't care in the case of readdir.
- */
- if (!iter.task || !has_group_leader_pid(iter.task)) {
- iter.tgid += 1;
- goto retry;
- }
- get_task_struct(iter.task);
- }
- rcu_read_unlock();
- return iter;
-}
-
#define TGID_OFFSET (FIRST_PROCESS_ENTRY + 2)

/* for the /proc/ directory itself, after non-process stuff has been done */
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index b97bf2e..136b6ed 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -82,4 +82,11 @@ static inline struct proc_dir_entry *proc_net_mkdir(
return proc_mkdir_data(name, 0, parent, net);
}

+struct tgid_iter {
+ unsigned int tgid;
+ struct task_struct *task;
+};
+
+struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter);
+
#endif /* _LINUX_PROC_FS_H */
diff --git a/kernel/pid.c b/kernel/pid.c
index cd36a5e..082307a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -568,6 +568,45 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
}

/*
+ * Find the first task with tgid >= tgid
+ *
+ */
+struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter)
+{
+ struct pid *pid;
+
+ if (iter.task)
+ put_task_struct(iter.task);
+ rcu_read_lock();
+retry:
+ iter.task = NULL;
+ pid = find_ge_pid(iter.tgid, ns);
+ if (pid) {
+ iter.tgid = pid_nr_ns(pid, ns);
+ iter.task = pid_task(pid, PIDTYPE_PID);
+ /* What we to know is if the pid we have find is the
+ * pid of a thread_group_leader. Testing for task
+ * being a thread_group_leader is the obvious thing
+ * todo but there is a window when it fails, due to
+ * the pid transfer logic in de_thread.
+ *
+ * So we perform the straight forward test of seeing
+ * if the pid we have found is the pid of a thread
+ * group leader, and don't worry if the task we have
+ * found doesn't happen to be a thread group leader.
+ * As we don't care in the case of readdir.
+ */
+ if (!iter.task || !has_group_leader_pid(iter.task)) {
+ iter.tgid += 1;
+ goto retry;
+ }
+ get_task_struct(iter.task);
+ }
+ rcu_read_unlock();
+ return iter;
+}
+
+/*
* The pid hash table is scaled according to the amount of memory in the
* machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or
* more.
--
2.1.0

2015-02-17 08:39:30

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 3/7] task-diag: add ability to get information about all tasks

For that we need to set NLM_F_DUMP. Currently here are no
filters. Any suggestions are welcome.

I think we can add request for children, threads, session or group
members.

Signed-off-by: Andrey Vagin <[email protected]>
---
kernel/taskdiag.c | 41 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)

diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index 5faf3f0..da4a51b 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -102,6 +102,46 @@ err:
return err;
}

+static int taskdiag_dumpid(struct sk_buff *skb, struct netlink_callback *cb)
+{
+ struct pid_namespace *ns = task_active_pid_ns(current);
+ struct tgid_iter iter;
+ struct nlattr *na;
+ struct task_diag_pid *req;
+ int rc;
+
+ if (nlmsg_len(cb->nlh) < GENL_HDRLEN + sizeof(*req))
+ return -EINVAL;
+
+ na = nlmsg_data(cb->nlh) + GENL_HDRLEN;
+ if (na->nla_type < 0)
+ return -EINVAL;
+
+ req = (struct task_diag_pid *) nla_data(na);
+
+ iter.tgid = cb->args[0];
+ iter.task = NULL;
+ for (iter = next_tgid(ns, iter);
+ iter.task;
+ iter.tgid += 1, iter = next_tgid(ns, iter)) {
+ if (!ptrace_may_access(iter.task, PTRACE_MODE_READ))
+ continue;
+
+ rc = task_diag_fill(iter.task, skb, req->show_flags,
+ NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq);
+ if (rc < 0) {
+ put_task_struct(iter.task);
+ if (rc != -EMSGSIZE)
+ return rc;
+ break;
+ }
+ }
+
+ cb->args[0] = iter.tgid;
+
+ return skb->len;
+}
+
static int taskdiag_doit(struct sk_buff *skb, struct genl_info *info)
{
struct task_struct *tsk = NULL;
@@ -161,6 +201,7 @@ static const struct genl_ops taskdiag_ops[] = {
{
.cmd = TASKDIAG_CMD_GET,
.doit = taskdiag_doit,
+ .dumpit = taskdiag_dumpid,
.policy = taskstats_cmd_get_policy,
},
};
--
2.1.0

2015-02-17 08:39:38

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 4/7] task-diag: add a new group to get process credentials

A response is represented by the task_diag_creds structure:

struct task_diag_creds {
struct task_diag_caps cap_inheritable;
struct task_diag_caps cap_permitted;
struct task_diag_caps cap_effective;
struct task_diag_caps cap_bset;

__u32 uid;
__u32 euid;
__u32 suid;
__u32 fsuid;
__u32 gid;
__u32 egid;
__u32 sgid;
__u32 fsgid;
};

This group is optional and it filled only if show_flags contains
TASK_DIAG_SHOW_CRED.

Signed-off-by: Andrey Vagin <[email protected]>
---
include/uapi/linux/taskdiag.h | 23 ++++++++++++++++++
kernel/taskdiag.c | 55 ++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/taskdiag.h b/include/uapi/linux/taskdiag.h
index e1feb35..db12f6d 100644
--- a/include/uapi/linux/taskdiag.h
+++ b/include/uapi/linux/taskdiag.h
@@ -9,11 +9,14 @@

enum {
/* optional attributes which can be specified in show_flags */
+ TASK_DIAG_CRED,

/* other attributes */
TASK_DIAG_MSG = 64,
};

+#define TASK_DIAG_SHOW_CRED (1ULL << TASK_DIAG_CRED)
+
enum {
TASK_DIAG_RUNNING,
TASK_DIAG_INTERRUPTIBLE,
@@ -37,6 +40,26 @@ struct task_diag_msg {
char comm[TASK_DIAG_COMM_LEN];
};

+struct task_diag_caps {
+ __u32 cap[_LINUX_CAPABILITY_U32S_3];
+};
+
+struct task_diag_creds {
+ struct task_diag_caps cap_inheritable;
+ struct task_diag_caps cap_permitted;
+ struct task_diag_caps cap_effective;
+ struct task_diag_caps cap_bset;
+
+ __u32 uid;
+ __u32 euid;
+ __u32 suid;
+ __u32 fsuid;
+ __u32 gid;
+ __u32 egid;
+ __u32 sgid;
+ __u32 fsgid;
+};
+
enum {
TASKDIAG_CMD_UNSPEC = 0, /* Reserved */
TASKDIAG_CMD_GET,
diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index da4a51b..6ccbcaf 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -15,7 +15,14 @@ static struct genl_family family = {

static size_t taskdiag_packet_size(u64 show_flags)
{
- return nla_total_size(sizeof(struct task_diag_msg));
+ size_t size;
+
+ size = nla_total_size(sizeof(struct task_diag_msg));
+
+ if (show_flags & TASK_DIAG_SHOW_CRED)
+ size += nla_total_size(sizeof(struct task_diag_creds));
+
+ return size;
}

/*
@@ -82,6 +89,46 @@ static int fill_task_msg(struct task_struct *p, struct sk_buff *skb)
return 0;
}

+static inline void caps2diag(struct task_diag_caps *diag, const kernel_cap_t *cap)
+{
+ int i;
+
+ for (i = 0; i < _LINUX_CAPABILITY_U32S_3; i++)
+ diag->cap[i] = cap->cap[i];
+}
+
+static int fill_creds(struct task_struct *p, struct sk_buff *skb)
+{
+ struct user_namespace *user_ns = current_user_ns();
+ struct task_diag_creds *diag_cred;
+ const struct cred *cred;
+ struct nlattr *attr;
+
+ attr = nla_reserve(skb, TASK_DIAG_CRED, sizeof(struct task_diag_creds));
+ if (!attr)
+ return -EMSGSIZE;
+
+ diag_cred = nla_data(attr);
+
+ cred = get_task_cred(p);
+
+ caps2diag(&diag_cred->cap_inheritable, &cred->cap_inheritable);
+ caps2diag(&diag_cred->cap_permitted, &cred->cap_permitted);
+ caps2diag(&diag_cred->cap_effective, &cred->cap_effective);
+ caps2diag(&diag_cred->cap_bset, &cred->cap_bset);
+
+ diag_cred->uid = from_kuid_munged(user_ns, cred->uid);
+ diag_cred->euid = from_kuid_munged(user_ns, cred->euid);
+ diag_cred->suid = from_kuid_munged(user_ns, cred->suid);
+ diag_cred->fsuid = from_kuid_munged(user_ns, cred->fsuid);
+ diag_cred->gid = from_kgid_munged(user_ns, cred->gid);
+ diag_cred->egid = from_kgid_munged(user_ns, cred->egid);
+ diag_cred->sgid = from_kgid_munged(user_ns, cred->sgid);
+ diag_cred->fsgid = from_kgid_munged(user_ns, cred->fsgid);
+
+ return 0;
+}
+
static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
u64 show_flags, u32 portid, u32 seq)
{
@@ -96,6 +143,12 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
if (err)
goto err;

+ if (show_flags & TASK_DIAG_SHOW_CRED) {
+ err = fill_creds(tsk, skb);
+ if (err)
+ goto err;
+ }
+
return genlmsg_end(skb, reply);
err:
genlmsg_cancel(skb, reply);
--
2.1.0

2015-02-17 08:40:45

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 5/7] kernel: add ability to iterate children of a specified task

The interface is similar with the tgid iterator. It is used in
procfs and it will be used in task_diag.

Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/array.c | 58 +++++++++++++------------------------------------
include/linux/proc_fs.h | 6 +++++
kernel/pid.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 76 insertions(+), 43 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index bd117d0..7197c6a 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -579,54 +579,26 @@ get_children_pid(struct inode *inode, struct pid *pid_prev, loff_t pos)
{
struct task_struct *start, *task;
struct pid *pid = NULL;
+ struct child_iter iter;

- read_lock(&tasklist_lock);
-
- start = pid_task(proc_pid(inode), PIDTYPE_PID);
+ start = get_proc_task(inode);
if (!start)
- goto out;
+ return NULL;

- /*
- * Lets try to continue searching first, this gives
- * us significant speedup on children-rich processes.
- */
- if (pid_prev) {
- task = pid_task(pid_prev, PIDTYPE_PID);
- if (task && task->real_parent == start &&
- !(list_empty(&task->sibling))) {
- if (list_is_last(&task->sibling, &start->children))
- goto out;
- task = list_first_entry(&task->sibling,
- struct task_struct, sibling);
- pid = get_pid(task_pid(task));
- goto out;
- }
- }
+ if (pid_prev)
+ task = get_pid_task(pid_prev, PIDTYPE_PID);
+ else
+ task = NULL;

- /*
- * Slow search case.
- *
- * We might miss some children here if children
- * are exited while we were not holding the lock,
- * but it was never promised to be accurate that
- * much.
- *
- * "Just suppose that the parent sleeps, but N children
- * exit after we printed their tids. Now the slow paths
- * skips N extra children, we miss N tasks." (c)
- *
- * So one need to stop or freeze the leader and all
- * its children to get a precise result.
- */
- list_for_each_entry(task, &start->children, sibling) {
- if (pos-- == 0) {
- pid = get_pid(task_pid(task));
- break;
- }
- }
+ iter.parent = start;
+ iter.task = task;
+ iter.pos = pos;
+
+ iter = next_child(iter);

-out:
- read_unlock(&tasklist_lock);
+ put_task_struct(start);
+ if (iter.task)
+ pid = get_pid(task_pid(iter.task));
return pid;
}

diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 136b6ed..eba98bc 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -89,4 +89,10 @@ struct tgid_iter {

struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter);

+struct child_iter {
+ struct task_struct *task, *parent;
+ unsigned int pos;
+};
+
+struct child_iter next_child(struct child_iter iter);
#endif /* _LINUX_PROC_FS_H */
diff --git a/kernel/pid.c b/kernel/pid.c
index 082307a..6e3e42a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -606,6 +606,61 @@ retry:
return iter;
}

+struct child_iter next_child(struct child_iter iter)
+{
+ struct task_struct *task;
+ loff_t pos = iter.pos;
+
+ read_lock(&tasklist_lock);
+
+ /*
+ * Lets try to continue searching first, this gives
+ * us significant speedup on children-rich processes.
+ */
+ if (iter.task) {
+ task = iter.task;
+ if (task && task->real_parent == iter.parent &&
+ !(list_empty(&task->sibling))) {
+ if (list_is_last(&task->sibling, &iter.parent->children)) {
+ task = NULL;
+ goto out;
+ }
+ task = list_first_entry(&task->sibling,
+ struct task_struct, sibling);
+ goto out;
+ }
+ }
+
+ /*
+ * Slow search case.
+ *
+ * We might miss some children here if children
+ * are exited while we were not holding the lock,
+ * but it was never promised to be accurate that
+ * much.
+ *
+ * "Just suppose that the parent sleeps, but N children
+ * exit after we printed their tids. Now the slow paths
+ * skips N extra children, we miss N tasks." (c)
+ *
+ * So one need to stop or freeze the leader and all
+ * its children to get a precise result.
+ */
+ list_for_each_entry(task, &iter.parent->children, sibling) {
+ if (pos-- == 0)
+ goto out;
+ }
+ task = NULL;
+out:
+ if (iter.task)
+ put_task_struct(iter.task);
+ if (task)
+ get_task_struct(task);
+ iter.task = task;
+ read_unlock(&tasklist_lock);
+ return iter;
+}
+
/*
* The pid hash table is scaled according to the amount of memory in the
* machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or
--
2.1.0

2015-02-17 08:39:33

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 6/7] task_diag: add ability to dump children

Now we can dump all task or children of a specified task.
It's an example how this interface can be expanded for different
use-cases.

Signed-off-by: Andrey Vagin <[email protected]>
---
include/uapi/linux/taskdiag.h | 1 +
kernel/taskdiag.c | 83 +++++++++++++++++++++++++++++++++++++------
2 files changed, 73 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/taskdiag.h b/include/uapi/linux/taskdiag.h
index db12f6d..d8a9e92 100644
--- a/include/uapi/linux/taskdiag.h
+++ b/include/uapi/linux/taskdiag.h
@@ -68,6 +68,7 @@ enum {
#define TASKDIAG_CMD_MAX (__TASKDIAG_CMD_MAX - 1)

#define TASK_DIAG_DUMP_ALL 0
+#define TASK_DIAG_DUMP_CHILDREN 1

struct task_diag_pid {
__u64 show_flags;
diff --git a/kernel/taskdiag.c b/kernel/taskdiag.c
index 6ccbcaf..951ecbd 100644
--- a/kernel/taskdiag.c
+++ b/kernel/taskdiag.c
@@ -155,12 +155,71 @@ err:
return err;
}

+struct task_iter {
+ struct task_diag_pid *req;
+ struct pid_namespace *ns;
+ struct netlink_callback *cb;
+
+ union {
+ struct tgid_iter tgid;
+ struct child_iter child;
+ };
+};
+
+static struct task_struct *iter_start(struct task_iter *iter)
+{
+ switch (iter->req->dump_stratagy) {
+ case TASK_DIAG_DUMP_CHILDREN:
+ rcu_read_lock();
+ iter->child.parent = find_task_by_pid_ns(iter->req->pid, iter->ns);
+ if (iter->child.parent)
+ get_task_struct(iter->child.parent);
+ rcu_read_unlock();
+
+ if (iter->child.parent == NULL)
+ return ERR_PTR(-ESRCH);
+
+ iter->child.pos = iter->cb->args[0];
+ iter->child.task = NULL;
+ iter->child = next_child(iter->child);
+ return iter->child.task;
+
+ case TASK_DIAG_DUMP_ALL:
+ iter->tgid.tgid = iter->cb->args[0];
+ iter->tgid.task = NULL;
+ iter->tgid = next_tgid(iter->ns, iter->tgid);
+ return iter->tgid.task;
+ }
+
+ return ERR_PTR(-EINVAL);
+}
+
+static struct task_struct *iter_next(struct task_iter *iter)
+{
+ switch (iter->req->dump_stratagy) {
+ case TASK_DIAG_DUMP_CHILDREN:
+ iter->child.pos += 1;
+ iter->child = next_child(iter->child);
+ iter->cb->args[0] = iter->child.pos;
+ return iter->child.task;
+
+ case TASK_DIAG_DUMP_ALL:
+ iter->tgid.tgid += 1;
+ iter->tgid = next_tgid(iter->ns, iter->tgid);
+ iter->cb->args[0] = iter->tgid.tgid;
+ return iter->tgid.task;
+ }
+
+ return NULL;
+}
+
static int taskdiag_dumpid(struct sk_buff *skb, struct netlink_callback *cb)
{
struct pid_namespace *ns = task_active_pid_ns(current);
- struct tgid_iter iter;
+ struct task_iter iter;
struct nlattr *na;
struct task_diag_pid *req;
+ struct task_struct *task;
int rc;

if (nlmsg_len(cb->nlh) < GENL_HDRLEN + sizeof(*req))
@@ -172,26 +231,28 @@ static int taskdiag_dumpid(struct sk_buff *skb, struct netlink_callback *cb)

req = (struct task_diag_pid *) nla_data(na);

- iter.tgid = cb->args[0];
- iter.task = NULL;
- for (iter = next_tgid(ns, iter);
- iter.task;
- iter.tgid += 1, iter = next_tgid(ns, iter)) {
- if (!ptrace_may_access(iter.task, PTRACE_MODE_READ))
+ iter.req = req;
+ iter.ns = ns;
+ iter.cb = cb;
+
+ task = iter_start(&iter);
+ if (IS_ERR(task) < 0)
+ return PTR_ERR(task);
+
+ for (; task; task = iter_next(&iter)) {
+ if (!ptrace_may_access(task, PTRACE_MODE_READ))
continue;

- rc = task_diag_fill(iter.task, skb, req->show_flags,
+ rc = task_diag_fill(task, skb, req->show_flags,
NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq);
if (rc < 0) {
- put_task_struct(iter.task);
+ put_task_struct(task);
if (rc != -EMSGSIZE)
return rc;
break;
}
}

- cb->args[0] = iter.tgid;
-
return skb->len;
}

--
2.1.0

2015-02-17 08:40:03

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 7/7] selftest: check the task_diag functinonality

Here are two test (example) programs.

task_diag - request information for two processes.
test_diag_all - request information about all processes

Signed-off-by: Andrey Vagin <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/task_diag/Makefile | 16 ++
tools/testing/selftests/task_diag/task_diag.c | 56 ++++++
tools/testing/selftests/task_diag/task_diag_all.c | 82 ++++++++
tools/testing/selftests/task_diag/task_diag_comm.c | 206 +++++++++++++++++++++
tools/testing/selftests/task_diag/task_diag_comm.h | 47 +++++
tools/testing/selftests/task_diag/taskdiag.h | 1 +
7 files changed, 409 insertions(+)
create mode 100644 tools/testing/selftests/task_diag/Makefile
create mode 100644 tools/testing/selftests/task_diag/task_diag.c
create mode 100644 tools/testing/selftests/task_diag/task_diag_all.c
create mode 100644 tools/testing/selftests/task_diag/task_diag_comm.c
create mode 100644 tools/testing/selftests/task_diag/task_diag_comm.h
create mode 120000 tools/testing/selftests/task_diag/taskdiag.h

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 4e51122..c73d888 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -17,6 +17,7 @@ TARGETS += sysctl
TARGETS += timers
TARGETS += user
TARGETS += vm
+TARGETS += task_diag
#Please keep the TARGETS list alphabetically sorted

TARGETS_HOTPLUG = cpu-hotplug
diff --git a/tools/testing/selftests/task_diag/Makefile b/tools/testing/selftests/task_diag/Makefile
new file mode 100644
index 0000000..d6583c4
--- /dev/null
+++ b/tools/testing/selftests/task_diag/Makefile
@@ -0,0 +1,16 @@
+all: task_diag task_diag_all
+
+run_tests: all
+ @./task_diag && ./task_diag_all && echo "task_diag: [PASS]" || echo "task_diag: [FAIL]"
+
+CFLAGS += -Wall -O2
+
+task_diag.o: task_diag.c task_diag_comm.h
+task_diag_all.o: task_diag_all.c task_diag_comm.h
+task_diag_comm.o: task_diag_comm.c task_diag_comm.h
+
+task_diag_all: task_diag_all.o task_diag_comm.o
+task_diag: task_diag.o task_diag_comm.o
+
+clean:
+ rm -rf task_diag task_diag_all task_diag_comm.o task_diag_all.o task_diag.o
diff --git a/tools/testing/selftests/task_diag/task_diag.c b/tools/testing/selftests/task_diag/task_diag.c
new file mode 100644
index 0000000..fafeeac
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_diag.c
@@ -0,0 +1,56 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <poll.h>
+#include <string.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/socket.h>
+#include <sys/wait.h>
+#include <signal.h>
+
+#include <linux/genetlink.h>
+#include "taskdiag.h"
+#include "task_diag_comm.h"
+
+int main(int argc, char *argv[])
+{
+ int exit_status = 1;
+ int rc, rep_len, id;
+ int nl_sd = -1;
+ struct task_diag_pid req;
+ char buf[4096];
+
+ req.show_flags = TASK_DIAG_SHOW_CRED;
+ req.pid = getpid();
+
+ nl_sd = create_nl_socket(NETLINK_GENERIC);
+ if (nl_sd < 0)
+ return -1;
+
+ id = get_family_id(nl_sd);
+ if (!id)
+ goto err;
+
+ rc = send_cmd(nl_sd, id, getpid(), TASKDIAG_CMD_GET,
+ TASKDIAG_CMD_ATTR_GET, &req, sizeof(req), 0);
+ pr_info("Sent pid/tgid, retval %d\n", rc);
+ if (rc < 0)
+ goto err;
+
+ rep_len = recv(nl_sd, buf, sizeof(buf), 0);
+ if (rep_len < 0) {
+ pr_perror("Unable to receive a response\n");
+ goto err;
+ }
+ pr_info("received %d bytes\n", rep_len);
+
+ nlmsg_receive(buf, rep_len, &show_task);
+
+ exit_status = 0;
+err:
+ close(nl_sd);
+ return exit_status;
+}
diff --git a/tools/testing/selftests/task_diag/task_diag_all.c b/tools/testing/selftests/task_diag/task_diag_all.c
new file mode 100644
index 0000000..85e1a0a
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_diag_all.c
@@ -0,0 +1,82 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <poll.h>
+#include <string.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/socket.h>
+#include <sys/wait.h>
+#include <signal.h>
+
+#include "task_diag_comm.h"
+#include "taskdiag.h"
+
+int tasks;
+
+
+extern int _show_task(struct nlmsghdr *hdr)
+{
+ tasks++;
+ return show_task(hdr);
+}
+
+int main(int argc, char *argv[])
+{
+ int exit_status = 1;
+ int rc, rep_len, id;
+ int nl_sd = -1;
+ struct {
+ struct task_diag_pid req;
+ } pid_req;
+ char buf[4096];
+
+ quiet = 0;
+
+ pid_req.req.show_flags = 0;
+ pid_req.req.dump_stratagy = TASK_DIAG_DUMP_ALL;
+ pid_req.req.pid = 1;
+
+ nl_sd = create_nl_socket(NETLINK_GENERIC);
+ if (nl_sd < 0)
+ return -1;
+
+ id = get_family_id(nl_sd);
+ if (!id)
+ goto err;
+
+ rc = send_cmd(nl_sd, id, getpid(), TASKDIAG_CMD_GET,
+ TASKDIAG_CMD_ATTR_GET, &pid_req, sizeof(pid_req), 1);
+ pr_info("Sent pid/tgid, retval %d\n", rc);
+ if (rc < 0)
+ goto err;
+
+ while (1) {
+ int err;
+
+ rep_len = recv(nl_sd, buf, sizeof(buf), 0);
+ pr_info("received %d bytes\n", rep_len);
+
+ if (rep_len < 0) {
+ pr_perror("Unable to receive a response\n");
+ goto err;
+ }
+
+ if (rep_len == 0)
+ break;
+
+ err = nlmsg_receive(buf, rep_len, &_show_task);
+ if (err < 0)
+ goto err;
+ if (err == 0)
+ break;
+ }
+ printf("tasks: %d\n", tasks);
+
+ exit_status = 0;
+err:
+ close(nl_sd);
+ return exit_status;
+}
diff --git a/tools/testing/selftests/task_diag/task_diag_comm.c b/tools/testing/selftests/task_diag/task_diag_comm.c
new file mode 100644
index 0000000..df7780d
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_diag_comm.c
@@ -0,0 +1,206 @@
+#include <errno.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <linux/genetlink.h>
+
+#include "taskdiag.h"
+#include "task_diag_comm.h"
+
+int quiet = 0;
+
+/*
+ * Create a raw netlink socket and bind
+ */
+int create_nl_socket(int protocol)
+{
+ int fd;
+ struct sockaddr_nl local;
+
+ fd = socket(AF_NETLINK, SOCK_RAW, protocol);
+ if (fd < 0)
+ return -1;
+
+ memset(&local, 0, sizeof(local));
+ local.nl_family = AF_NETLINK;
+
+ if (bind(fd, (struct sockaddr *) &local, sizeof(local)) < 0)
+ goto error;
+
+ return fd;
+error:
+ close(fd);
+ return -1;
+}
+
+
+int send_cmd(int sd, __u16 nlmsg_type, __u32 nlmsg_pid,
+ __u8 genl_cmd, __u16 nla_type,
+ void *nla_data, int nla_len, int dump)
+{
+ struct nlattr *na;
+ struct sockaddr_nl nladdr;
+ int r, buflen;
+ char *buf;
+
+ struct msgtemplate msg;
+
+ msg.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN);
+ msg.n.nlmsg_type = nlmsg_type;
+ msg.n.nlmsg_flags = NLM_F_REQUEST;
+ if (dump)
+ msg.n.nlmsg_flags |= NLM_F_DUMP;
+ msg.n.nlmsg_seq = 0;
+ msg.n.nlmsg_pid = nlmsg_pid;
+ msg.g.cmd = genl_cmd;
+ msg.g.version = 0x1;
+ na = (struct nlattr *) GENLMSG_DATA(&msg);
+ na->nla_type = nla_type;
+ na->nla_len = nla_len + 1 + NLA_HDRLEN;
+ memcpy(NLA_DATA(na), nla_data, nla_len);
+ msg.n.nlmsg_len += NLMSG_ALIGN(na->nla_len);
+
+ buf = (char *) &msg;
+ buflen = msg.n.nlmsg_len;
+ memset(&nladdr, 0, sizeof(nladdr));
+ nladdr.nl_family = AF_NETLINK;
+ r = sendto(sd, buf, buflen, 0, (struct sockaddr *) &nladdr,
+ sizeof(nladdr));
+ if (r != buflen) {
+ pr_perror("Unable to send %d (%d)", r, buflen);
+ return -1;
+ }
+ return 0;
+}
+
+
+/*
+ * Probe the controller in genetlink to find the family id
+ * for the TASKDIAG family
+ */
+int get_family_id(int sd)
+{
+ char name[100];
+ struct msgtemplate ans;
+
+ int id = 0, rc;
+ struct nlattr *na;
+ int rep_len;
+
+ strcpy(name, TASKDIAG_GENL_NAME);
+ rc = send_cmd(sd, GENL_ID_CTRL, getpid(), CTRL_CMD_GETFAMILY,
+ CTRL_ATTR_FAMILY_NAME, (void *)name,
+ strlen(TASKDIAG_GENL_NAME) + 1, 0);
+ if (rc < 0)
+ return -1;
+
+ rep_len = recv(sd, &ans, sizeof(ans), 0);
+ if (ans.n.nlmsg_type == NLMSG_ERROR ||
+ (rep_len < 0) || !NLMSG_OK((&ans.n), rep_len))
+ return 0;
+
+ na = (struct nlattr *) GENLMSG_DATA(&ans);
+ na = (struct nlattr *) ((char *) na + NLA_ALIGN(na->nla_len));
+ if (na->nla_type == CTRL_ATTR_FAMILY_ID)
+ id = *(__u16 *) NLA_DATA(na);
+
+ return id;
+}
+
+int nlmsg_receive(void *buf, int len, int (*cb)(struct nlmsghdr *))
+{
+ struct nlmsghdr *hdr;
+
+ for (hdr = (struct nlmsghdr *)buf;
+ NLMSG_OK(hdr, len); hdr = NLMSG_NEXT(hdr, len)) {
+
+ if (hdr->nlmsg_type == NLMSG_DONE) {
+ int *len = (int *)NLMSG_DATA(hdr);
+
+ if (*len < 0) {
+ pr_err("ERROR %d reported by netlink (%s)\n",
+ *len, strerror(-*len));
+ return *len;
+ }
+
+ return 0;
+ }
+
+ if (hdr->nlmsg_type == NLMSG_ERROR) {
+ struct nlmsgerr *err = (struct nlmsgerr *)NLMSG_DATA(hdr);
+
+ if (hdr->nlmsg_len - sizeof(*hdr) < sizeof(struct nlmsgerr)) {
+ pr_err("ERROR truncated\n");
+ return -1;
+ }
+
+ if (err->error == 0)
+ return 0;
+
+ return -1;
+ }
+ if (cb && cb(hdr))
+ return -1;
+ }
+
+ return 1;
+}
+
+int show_task(struct nlmsghdr *hdr)
+{
+ int msg_len;
+ struct msgtemplate *msg;
+ struct nlattr *na;
+ int len;
+
+ msg_len = GENLMSG_PAYLOAD(hdr);
+
+ msg = (struct msgtemplate *)hdr;
+ na = (struct nlattr *) GENLMSG_DATA(msg);
+ len = 0;
+ while (len < msg_len) {
+ len += NLA_ALIGN(na->nla_len);
+ switch (na->nla_type) {
+ case TASK_DIAG_MSG:
+ {
+ struct task_diag_msg *msg;
+
+ /* For nested attributes, na follows */
+ msg = (struct task_diag_msg *) NLA_DATA(na);
+ pr_info("pid %d ppid %d comm %s\n", msg->pid, msg->ppid, msg->comm);
+ break;
+ }
+ case TASK_DIAG_CRED:
+ {
+ struct task_diag_creds *creds;
+
+ creds = (struct task_diag_creds *) NLA_DATA(na);
+ pr_info("uid: %d %d %d %d\n", creds->uid,
+ creds->euid, creds->suid, creds->fsuid);
+ pr_info("gid: %d %d %d %d\n", creds->uid,
+ creds->euid, creds->suid, creds->fsuid);
+ pr_info("CapInh: %08x%08x\n",
+ creds->cap_inheritable.cap[1],
+ creds->cap_inheritable.cap[0]);
+ pr_info("CapPrm: %08x%08x\n",
+ creds->cap_permitted.cap[1],
+ creds->cap_permitted.cap[0]);
+ pr_info("CapEff: %08x%08x\n",
+ creds->cap_effective.cap[1],
+ creds->cap_effective.cap[0]);
+ pr_info("CapBnd: %08x%08x\n", creds->cap_bset.cap[1],
+ creds->cap_bset.cap[0]);
+ break;
+ }
+ default:
+ pr_err("Unknown nla_type %d\n",
+ na->nla_type);
+ return -1;
+ }
+ na = (struct nlattr *) (GENLMSG_DATA(msg) + len);
+ }
+
+ return 0;
+}
diff --git a/tools/testing/selftests/task_diag/task_diag_comm.h b/tools/testing/selftests/task_diag/task_diag_comm.h
new file mode 100644
index 0000000..42f2088
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_diag_comm.h
@@ -0,0 +1,47 @@
+#ifndef __TASK_DIAG_COMM__
+#define __TASK_DIAG_COMM__
+
+#include <stdio.h>
+
+#include <linux/genetlink.h>
+#include "taskdiag.h"
+
+/*
+ * Generic macros for dealing with netlink sockets. Might be duplicated
+ * elsewhere. It is recommended that commercial grade applications use
+ * libnl or libnetlink and use the interfaces provided by the library
+ */
+#define GENLMSG_DATA(glh) ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN))
+#define GENLMSG_PAYLOAD(glh) (NLMSG_PAYLOAD(glh, 0) - GENL_HDRLEN)
+#define NLA_DATA(na) ((void *)((char *)(na) + NLA_HDRLEN))
+#define NLA_PAYLOAD(len) (len - NLA_HDRLEN)
+
+#define pr_err(fmt, ...) \
+ fprintf(stderr, fmt, ##__VA_ARGS__)
+
+#define pr_perror(fmt, ...) \
+ fprintf(stderr, fmt " : %m\n", ##__VA_ARGS__)
+
+extern int quiet;
+#define pr_info(fmt, arg...) \
+ do { \
+ if (!quiet) \
+ printf(fmt, ##arg); \
+ } while (0) \
+
+struct msgtemplate {
+ struct nlmsghdr n;
+ struct genlmsghdr g;
+ char body[4096];
+};
+
+extern int create_nl_socket(int protocol);
+extern int send_cmd(int sd, __u16 nlmsg_type, __u32 nlmsg_pid,
+ __u8 genl_cmd, __u16 nla_type,
+ void *nla_data, int nla_len, int dump);
+
+extern int get_family_id(int sd);
+extern int nlmsg_receive(void *buf, int len, int (*cb)(struct nlmsghdr *));
+extern int show_task(struct nlmsghdr *hdr);
+
+#endif /* __TASK_DIAG_COMM__ */
diff --git a/tools/testing/selftests/task_diag/taskdiag.h b/tools/testing/selftests/task_diag/taskdiag.h
new file mode 120000
index 0000000..83e857e
--- /dev/null
+++ b/tools/testing/selftests/task_diag/taskdiag.h
@@ -0,0 +1 @@
+../../../../include/uapi/linux/taskdiag.h
\ No newline at end of file
--
2.1.0

2015-02-17 08:53:19

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes

On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> task_diag is based on netlink sockets and looks like socket-diag, which
> is used to get information about sockets.
>
> A request is described by the task_diag_pid structure:
>
> struct task_diag_pid {
> __u64 show_flags; /* specify which information are required */
> __u64 dump_stratagy; /* specify a group of processes */
>
> __u32 pid;
> };

Can you explain how the interface relates to the 'taskstats' genetlink
API? Did you consider extending that interface to provide the
information you need instead of basing on the socket-diag?

Arnd

2015-02-17 16:09:51

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes

On 2/17/15 1:20 AM, Andrey Vagin wrote:
> And here are statistics about syscalls which were called by each
> command.
> $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid 2>&1 | grep syscalls | sort -n -r | head -n 5
> 20,713 syscalls:sys_exit_open
> 20,710 syscalls:sys_exit_close
> 20,708 syscalls:sys_exit_read
> 10,348 syscalls:sys_exit_newstat
> 31 syscalls:sys_exit_write
>
> $ perf stat -e syscalls:sys_exit* -- ./task_diag_all 2>&1 | grep syscalls | sort -n -r | head -n 5
> 114 syscalls:sys_exit_recvfrom
> 49 syscalls:sys_exit_write
> 8 syscalls:sys_exit_mmap
> 4 syscalls:sys_exit_mprotect
> 3 syscalls:sys_exit_newfstat

'perf trace -s' gives the summary with stats.
e.g., perf trace -s -- ps ax -o pid,ppid

ps (23850), 3117 events, 99.3%, 0.000 msec

syscall calls min avg max stddev
(msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- ------
read 353 0.000 0.010 0.035 3.14%
write 166 0.006 0.012 0.045 3.03%
open 365 0.002 0.005 0.178 11.29%
close 354 0.001 0.002 0.024 3.57%
stat 170 0.002 0.007 0.662 52.99%
fstat 19 0.002 0.003 0.003 2.31%
lseek 2 0.003 0.003 0.003 6.49%
mmap 50 0.004 0.006 0.013 3.40%
...

2015-02-17 19:05:54

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes

On Feb 17, 2015 12:40 AM, "Andrey Vagin" <[email protected]> wrote:
>
> Here is a preview version. It provides restricted set of functionality.
> I would like to collect feedback about this idea.
>
> Currently we use the proc file system, where all information are
> presented in text files, what is convenient for humans. But if we need
> to get information about processes from code (e.g. in C), the procfs
> doesn't look so cool.
>
> From code we would prefer to get information in binary format and to be
> able to specify which information and for which tasks are required. Here
> is a new interface with all these features, which is called task_diag.
> In addition it's much faster than procfs.
>
> task_diag is based on netlink sockets and looks like socket-diag, which
> is used to get information about sockets.
>
> A request is described by the task_diag_pid structure:
>
> struct task_diag_pid {
> __u64 show_flags; /* specify which information are required */
> __u64 dump_stratagy; /* specify a group of processes */
>
> __u32 pid;
> };
>
> A respone is a set of netlink messages. Each message describes one task.
> All task properties are divided on groups. A message contains the
> TASK_DIAG_MSG group and other groups if they have been requested in
> show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> response will contain the TASK_DIAG_CRED group which is described by the
> task_diag_creds structure.
>
> struct task_diag_msg {
> __u32 tgid;
> __u32 pid;
> __u32 ppid;
> __u32 tpid;
> __u32 sid;
> __u32 pgid;
> __u8 state;
> char comm[TASK_DIAG_COMM_LEN];
> };
>
> Another good feature of task_diag is an ability to request information
> for a few processes. Currently here are two stratgies
> TASK_DIAG_DUMP_ALL - get information for all tasks
> TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> tasks
>
> The task diag is much faster than the proc file system. We don't need to
> create a new file descriptor for each task. We need to send a request
> and get a response. It allows to get information for a few task in one
> request-response iteration.
>
> I have compared performance of procfs and task-diag for the
> "ps ax -o pid,ppid" command.
>
> A test stand contains 10348 processes.
> $ ps ax -o pid,ppid | wc -l
> 10348
>
> $ time ps ax -o pid,ppid > /dev/null
>
> real 0m1.073s
> user 0m0.086s
> sys 0m0.903s
>
> $ time ./task_diag_all > /dev/null
>
> real 0m0.037s
> user 0m0.004s
> sys 0m0.020s
>
> And here are statistics about syscalls which were called by each
> command.
> $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid 2>&1 | grep syscalls | sort -n -r | head -n 5
> 20,713 syscalls:sys_exit_open
> 20,710 syscalls:sys_exit_close
> 20,708 syscalls:sys_exit_read
> 10,348 syscalls:sys_exit_newstat
> 31 syscalls:sys_exit_write
>
> $ perf stat -e syscalls:sys_exit* -- ./task_diag_all 2>&1 | grep syscalls | sort -n -r | head -n 5
> 114 syscalls:sys_exit_recvfrom
> 49 syscalls:sys_exit_write
> 8 syscalls:sys_exit_mmap
> 4 syscalls:sys_exit_mprotect
> 3 syscalls:sys_exit_newfstat
>
> You can find the test program from this experiment in the last patch.
>
> The idea of this functionality was suggested by Pavel Emelyanov
> (xemul@), when he found that operations with /proc forms a significant
> part of a checkpointing time.
>
> Ten years ago here was attempt to add a netlink interface to access to /proc
> information:
> http://lwn.net/Articles/99600/

I don't suppose this could use real syscalls instead of netlink. If
nothing else, netlink seems to conflate pid and net namespaces.

Also, using an asynchronous interface (send, poll?, recv) for
something that's inherently synchronous (as the kernel a local
question) seems awkward to me.

--Andy

2015-02-17 20:32:20

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes

On Tue, Feb 17, 2015 at 09:09:47AM -0700, David Ahern wrote:
> On 2/17/15 1:20 AM, Andrey Vagin wrote:
> >And here are statistics about syscalls which were called by each
> >command.
> >$ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid 2>&1 | grep syscalls | sort -n -r | head -n 5
> > 20,713 syscalls:sys_exit_open
> > 20,710 syscalls:sys_exit_close
> > 20,708 syscalls:sys_exit_read
> > 10,348 syscalls:sys_exit_newstat
> > 31 syscalls:sys_exit_write
> >
> >$ perf stat -e syscalls:sys_exit* -- ./task_diag_all 2>&1 | grep syscalls | sort -n -r | head -n 5
> > 114 syscalls:sys_exit_recvfrom
> > 49 syscalls:sys_exit_write
> > 8 syscalls:sys_exit_mmap
> > 4 syscalls:sys_exit_mprotect
> > 3 syscalls:sys_exit_newfstat
>
> 'perf trace -s' gives the summary with stats.
> e.g., perf trace -s -- ps ax -o pid,ppid

Thank you for this command, I haven't used it before.

ps (21301), 145271 events, 100.0%, 0.000 msec

syscall calls min avg max stddev
(msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- ------
read 20717 0.000 0.020 1.631 0.64%
write 1 0.019 0.019 0.019 0.00%
open 20722 0.025 0.035 3.624 0.93%
close 20719 0.006 0.009 1.059 0.95%
stat 10352 0.015 0.025 1.748 0.95%
fstat 12 0.010 0.012 0.020 6.17%
lseek 2 0.011 0.012 0.012 3.08%
mmap 30 0.012 0.034 0.094 9.35%
mprotect 17 0.034 0.045 0.067 4.86%
munmap 3 0.028 0.058 0.108 44.12%
brk 4 0.011 0.015 0.019 11.24%
rt_sigaction 25 0.011 0.011 0.014 1.27%
rt_sigprocmask 1 0.012 0.012 0.012 0.00%
ioctl 4 0.010 0.012 0.014 6.94%
access 1 0.034 0.034 0.034 0.00%
execve 6 0.000 0.496 2.794 92.58%
uname 1 0.015 0.015 0.015 0.00%
getdents 12 0.019 0.691 1.158 13.04%
getrlimit 1 0.012 0.012 0.012 0.00%
geteuid 1 0.012 0.012 0.012 0.00%
arch_prctl 1 0.013 0.013 0.013 0.00%
futex 1 0.020 0.020 0.020 0.00%
set_tid_address 1 0.012 0.012 0.012 0.00%
openat 1 0.030 0.030 0.030 0.00%
set_robust_list 1 0.011 0.011 0.011 0.00%


task_diag_all (21304), 569 events, 98.6%, 0.000 msec

syscall calls min avg max stddev
(msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- ------
read 2 0.000 0.045 0.090 100.00%
write 77 0.010 0.013 0.083 7.93%
open 2 0.031 0.038 0.045 19.64%
close 3 0.010 0.014 0.017 13.43%
fstat 3 0.011 0.011 0.012 3.79%
mmap 8 0.013 0.027 0.049 16.72%
mprotect 4 0.034 0.043 0.052 8.86%
munmap 1 0.031 0.031 0.031 0.00%
brk 1 0.014 0.014 0.014 0.00%
ioctl 1 0.010 0.010 0.010 0.00%
access 1 0.030 0.030 0.030 0.00%
getpid 1 0.011 0.011 0.011 0.00%
socket 1 0.045 0.045 0.045 0.00%
sendto 2 0.091 0.104 0.117 12.63%
recvfrom 175 0.026 0.093 0.141 1.10%
bind 1 0.014 0.014 0.014 0.00%
execve 1 0.000 0.000 0.000 0.00%
arch_prctl 1 0.011 0.011 0.011 0.00%

>
> ps (23850), 3117 events, 99.3%, 0.000 msec
>
> syscall calls min avg max stddev
> (msec) (msec) (msec) (%)
> --------------- -------- --------- --------- --------- ------
> read 353 0.000 0.010 0.035 3.14%
> write 166 0.006 0.012 0.045 3.03%
> open 365 0.002 0.005 0.178 11.29%
> close 354 0.001 0.002 0.024 3.57%
> stat 170 0.002 0.007 0.662 52.99%
> fstat 19 0.002 0.003 0.003 2.31%
> lseek 2 0.003 0.003 0.003 6.49%
> mmap 50 0.004 0.006 0.013 3.40%
> ...

2015-02-17 21:33:26

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes

On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > task_diag is based on netlink sockets and looks like socket-diag, which
> > is used to get information about sockets.
> >
> > A request is described by the task_diag_pid structure:
> >
> > struct task_diag_pid {
> > __u64 show_flags; /* specify which information are required */
> > __u64 dump_stratagy; /* specify a group of processes */
> >
> > __u32 pid;
> > };
>
> Can you explain how the interface relates to the 'taskstats' genetlink
> API? Did you consider extending that interface to provide the
> information you need instead of basing on the socket-diag?

It isn't based on the socket-diag, it looks like socket-diag.

Current task_diag registers a new genl family, but we can use the taskstats
family and add task_diag commands to it.

Thanks,
Andrew

>
> Arnd

2015-02-18 11:06:56

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes

On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > is used to get information about sockets.
> > >
> > > A request is described by the task_diag_pid structure:
> > >
> > > struct task_diag_pid {
> > > __u64 show_flags; /* specify which information are required */
> > > __u64 dump_stratagy; /* specify a group of processes */
> > >
> > > __u32 pid;
> > > };
> >
> > Can you explain how the interface relates to the 'taskstats' genetlink
> > API? Did you consider extending that interface to provide the
> > information you need instead of basing on the socket-diag?
>
> It isn't based on the socket-diag, it looks like socket-diag.
>
> Current task_diag registers a new genl family, but we can use the taskstats
> family and add task_diag commands to it.

What I meant was more along the lines of making it look like taskstats
by adding new fields to 'struct taskstat' for what you want return.
I don't know if that is possible or a good idea for the information
you want to get out of the kernel, but it seems like a more natural
interface, as it already has some of the same data (comm, gid, pid,
ppid, ...).

Arnd

2015-02-18 12:42:42

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes

On Wed, Feb 18, 2015 at 12:06:40PM +0100, Arnd Bergmann wrote:
> On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> > On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > is used to get information about sockets.
> > > >
> > > > A request is described by the task_diag_pid structure:
> > > >
> > > > struct task_diag_pid {
> > > > __u64 show_flags; /* specify which information are required */
> > > > __u64 dump_stratagy; /* specify a group of processes */
> > > >
> > > > __u32 pid;
> > > > };
> > >
> > > Can you explain how the interface relates to the 'taskstats' genetlink
> > > API? Did you consider extending that interface to provide the
> > > information you need instead of basing on the socket-diag?
> >
> > It isn't based on the socket-diag, it looks like socket-diag.
> >
> > Current task_diag registers a new genl family, but we can use the taskstats
> > family and add task_diag commands to it.
>
> What I meant was more along the lines of making it look like taskstats
> by adding new fields to 'struct taskstat' for what you want return.
> I don't know if that is possible or a good idea for the information
> you want to get out of the kernel, but it seems like a more natural
> interface, as it already has some of the same data (comm, gid, pid,
> ppid, ...).

Now I see what you mean. task_diag has more flexible and universal
interface than taskstat. A response of taskstat only contains a
taskstats structure. A response of taskdiag can contains a few types of
properties. Each type is described by its own structure.

Curently here are only two groups of parameters: task_diag_msg and
task_diag_creds.

task_diag_msg contains a few basic parameters.
task_diag_creds contains credentials.

I'm going to add other groups to describe all kind of task properties
which currently are presented in procfs (e.g. /proc/pid/maps,
/proc/pid/fding/*, /proc/pid/status, etc).

One of features of task_diag is an ability to choose which information
are required. This allows to minimize a response size and a time, which
is requred to fill this response.

struct task_diag_msg {
__u32 tgid;
__u32 pid;
__u32 ppid;
__u32 tpid;
__u32 sid;
__u32 pgid;
__u8 state;
char comm[TASK_DIAG_COMM_LEN];
};

struct task_diag_creds {
struct task_diag_caps cap_inheritable;
struct task_diag_caps cap_permitted;
struct task_diag_caps cap_effective;
struct task_diag_caps cap_bset;

__u32 uid;
__u32 euid;
__u32 suid;
__u32 fsuid;
__u32 gid;
__u32 egid;
__u32 sgid;
__u32 fsgid;
};

Thanks,
Andrew
>
> Arnd

2015-02-18 14:27:27

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes

On Tue, Feb 17, 2015 at 11:05:31AM -0800, Andy Lutomirski wrote:
> On Feb 17, 2015 12:40 AM, "Andrey Vagin" <[email protected]> wrote:
> >
> > Here is a preview version. It provides restricted set of functionality.
> > I would like to collect feedback about this idea.
> >
> > Currently we use the proc file system, where all information are
> > presented in text files, what is convenient for humans. But if we need
> > to get information about processes from code (e.g. in C), the procfs
> > doesn't look so cool.
> >
> > From code we would prefer to get information in binary format and to be
> > able to specify which information and for which tasks are required. Here
> > is a new interface with all these features, which is called task_diag.
> > In addition it's much faster than procfs.
> >
> > task_diag is based on netlink sockets and looks like socket-diag, which
> > is used to get information about sockets.
> >
> > A request is described by the task_diag_pid structure:
> >
> > struct task_diag_pid {
> > __u64 show_flags; /* specify which information are required */
> > __u64 dump_stratagy; /* specify a group of processes */
> >
> > __u32 pid;
> > };
> >
> > A respone is a set of netlink messages. Each message describes one task.
> > All task properties are divided on groups. A message contains the
> > TASK_DIAG_MSG group and other groups if they have been requested in
> > show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> > response will contain the TASK_DIAG_CRED group which is described by the
> > task_diag_creds structure.
> >
> > struct task_diag_msg {
> > __u32 tgid;
> > __u32 pid;
> > __u32 ppid;
> > __u32 tpid;
> > __u32 sid;
> > __u32 pgid;
> > __u8 state;
> > char comm[TASK_DIAG_COMM_LEN];
> > };
> >
> > Another good feature of task_diag is an ability to request information
> > for a few processes. Currently here are two stratgies
> > TASK_DIAG_DUMP_ALL - get information for all tasks
> > TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> > tasks
> >
> > The task diag is much faster than the proc file system. We don't need to
> > create a new file descriptor for each task. We need to send a request
> > and get a response. It allows to get information for a few task in one
> > request-response iteration.
> >
> > I have compared performance of procfs and task-diag for the
> > "ps ax -o pid,ppid" command.
> >
> > A test stand contains 10348 processes.
> > $ ps ax -o pid,ppid | wc -l
> > 10348
> >
> > $ time ps ax -o pid,ppid > /dev/null
> >
> > real 0m1.073s
> > user 0m0.086s
> > sys 0m0.903s
> >
> > $ time ./task_diag_all > /dev/null
> >
> > real 0m0.037s
> > user 0m0.004s
> > sys 0m0.020s
> >
> > And here are statistics about syscalls which were called by each
> > command.
> > $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid 2>&1 | grep syscalls | sort -n -r | head -n 5
> > 20,713 syscalls:sys_exit_open
> > 20,710 syscalls:sys_exit_close
> > 20,708 syscalls:sys_exit_read
> > 10,348 syscalls:sys_exit_newstat
> > 31 syscalls:sys_exit_write
> >
> > $ perf stat -e syscalls:sys_exit* -- ./task_diag_all 2>&1 | grep syscalls | sort -n -r | head -n 5
> > 114 syscalls:sys_exit_recvfrom
> > 49 syscalls:sys_exit_write
> > 8 syscalls:sys_exit_mmap
> > 4 syscalls:sys_exit_mprotect
> > 3 syscalls:sys_exit_newfstat
> >
> > You can find the test program from this experiment in the last patch.
> >
> > The idea of this functionality was suggested by Pavel Emelyanov
> > (xemul@), when he found that operations with /proc forms a significant
> > part of a checkpointing time.
> >
> > Ten years ago here was attempt to add a netlink interface to access to /proc
> > information:
> > http://lwn.net/Articles/99600/
>
> I don't suppose this could use real syscalls instead of netlink. If
> nothing else, netlink seems to conflate pid and net namespaces.

What do you mean by "conflate pid and net namespaces"?

>
> Also, using an asynchronous interface (send, poll?, recv) for
> something that's inherently synchronous (as the kernel a local
> question) seems awkward to me.

Actually all requests are handled synchronously. We call sendmsg to send
a request and it is handled in this syscall.
2) | netlink_sendmsg() {
2) | netlink_unicast() {
2) | taskdiag_doit() {
2) 2.153 us | task_diag_fill();
2) | netlink_unicast() {
2) 0.185 us | netlink_attachskb();
2) 0.291 us | __netlink_sendskb();
2) 2.452 us | }
2) + 33.625 us | }
2) + 54.611 us | }
2) + 76.370 us | }
2) | netlink_recvmsg() {
2) 1.178 us | skb_recv_datagram();
2) + 46.953 us | }

If we request information for a group of tasks (NLM_F_DUMP), a first
portion of data is filled from the sendmsg syscall. And then when we read
it, the kernel fills the next portion.

3) | netlink_sendmsg() {
3) | __netlink_dump_start() {
3) | netlink_dump() {
3) | taskdiag_dumpid() {
3) 0.685 us | task_diag_fill();
...
3) 0.224 us | task_diag_fill();
3) + 74.028 us | }
3) + 88.757 us | }
3) + 89.296 us | }
3) + 98.705 us | }
3) | netlink_recvmsg() {
3) | netlink_dump() {
3) | taskdiag_dumpid() {
3) 0.594 us | task_diag_fill();
...
3) 0.242 us | task_diag_fill();
3) + 60.634 us | }
3) + 72.803 us | }
3) + 88.005 us | }
3) | netlink_recvmsg() {
3) | netlink_dump() {
3) 2.403 us | taskdiag_dumpid();
3) + 26.236 us | }
3) + 40.522 us | }
0) + 20.407 us | netlink_recvmsg();


netlink is really good for this type of tasks. It allows to create an
extendable interface which can be easy customized for different needs.

I don't think that we would want to create another similar interface
just to be independent from network subsystem.

Thanks,
Andrew

>
> --Andy

2015-02-18 14:46:41

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes

On Wednesday 18 February 2015 15:42:11 Andrew Vagin wrote:
> On Wed, Feb 18, 2015 at 12:06:40PM +0100, Arnd Bergmann wrote:
> > On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> > > On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > > > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > > is used to get information about sockets.
> > > > >
> > > > > A request is described by the task_diag_pid structure:
> > > > >
> > > > > struct task_diag_pid {
> > > > > __u64 show_flags; /* specify which information are required */
> > > > > __u64 dump_stratagy; /* specify a group of processes */
> > > > >
> > > > > __u32 pid;
> > > > > };
> > > >
> > > > Can you explain how the interface relates to the 'taskstats' genetlink
> > > > API? Did you consider extending that interface to provide the
> > > > information you need instead of basing on the socket-diag?
> > >
> > > It isn't based on the socket-diag, it looks like socket-diag.
> > >
> > > Current task_diag registers a new genl family, but we can use the taskstats
> > > family and add task_diag commands to it.
> >
> > What I meant was more along the lines of making it look like taskstats
> > by adding new fields to 'struct taskstat' for what you want return.
> > I don't know if that is possible or a good idea for the information
> > you want to get out of the kernel, but it seems like a more natural
> > interface, as it already has some of the same data (comm, gid, pid,
> > ppid, ...).
>
> Now I see what you mean. task_diag has more flexible and universal
> interface than taskstat. A response of taskstat only contains a
> taskstats structure. A response of taskdiag can contains a few types of
> properties. Each type is described by its own structure.

Right, so the question is whether that flexibility is actually required
here. Independent of which design you personally prefer, what are the
downsides of extending the existing but less flexible interface?

If it's good enough, that would seem to provide a more consistent
API, which in turn helps users understand the interface and use it
correctly.

> Curently here are only two groups of parameters: task_diag_msg and
> task_diag_creds.
>
> task_diag_msg contains a few basic parameters.
> task_diag_creds contains credentials.
>
> I'm going to add other groups to describe all kind of task properties
> which currently are presented in procfs (e.g. /proc/pid/maps,
> /proc/pid/fding/*, /proc/pid/status, etc).
>
> One of features of task_diag is an ability to choose which information
> are required. This allows to minimize a response size and a time, which
> is requred to fill this response.

I realize that you are trying to optimize for performance, but it
would be nice to quantify this if you want to argue for requiring
a split interface.

> struct task_diag_msg {
> __u32 tgid;
> __u32 pid;
> __u32 ppid;
> __u32 tpid;
> __u32 sid;
> __u32 pgid;
> __u8 state;
> char comm[TASK_DIAG_COMM_LEN];
> };

I guess this part would be a very natural extension to the
existing taskstats structure, and we should only add a new
one here if there are extremely good reasons for it.

> struct task_diag_creds {
> struct task_diag_caps cap_inheritable;
> struct task_diag_caps cap_permitted;
> struct task_diag_caps cap_effective;
> struct task_diag_caps cap_bset;
>
> __u32 uid;
> __u32 euid;
> __u32 suid;
> __u32 fsuid;
> __u32 gid;
> __u32 egid;
> __u32 sgid;
> __u32 fsgid;
> };

while this part could well be kept separate so you can query
it individually from the rest of taskstats, but through a
related interface.

Arnd

2015-02-19 01:19:03

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes

On Feb 18, 2015 6:27 AM, "Andrew Vagin" <[email protected]> wrote:
>
> On Tue, Feb 17, 2015 at 11:05:31AM -0800, Andy Lutomirski wrote:
> > On Feb 17, 2015 12:40 AM, "Andrey Vagin" <[email protected]> wrote:
> > >
> > > Here is a preview version. It provides restricted set of functionality.
> > > I would like to collect feedback about this idea.
> > >
> > > Currently we use the proc file system, where all information are
> > > presented in text files, what is convenient for humans. But if we need
> > > to get information about processes from code (e.g. in C), the procfs
> > > doesn't look so cool.
> > >
> > > From code we would prefer to get information in binary format and to be
> > > able to specify which information and for which tasks are required. Here
> > > is a new interface with all these features, which is called task_diag.
> > > In addition it's much faster than procfs.
> > >
> > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > is used to get information about sockets.
> > >
> > > A request is described by the task_diag_pid structure:
> > >
> > > struct task_diag_pid {
> > > __u64 show_flags; /* specify which information are required */
> > > __u64 dump_stratagy; /* specify a group of processes */
> > >
> > > __u32 pid;
> > > };
> > >
> > > A respone is a set of netlink messages. Each message describes one task.
> > > All task properties are divided on groups. A message contains the
> > > TASK_DIAG_MSG group and other groups if they have been requested in
> > > show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> > > response will contain the TASK_DIAG_CRED group which is described by the
> > > task_diag_creds structure.
> > >
> > > struct task_diag_msg {
> > > __u32 tgid;
> > > __u32 pid;
> > > __u32 ppid;
> > > __u32 tpid;
> > > __u32 sid;
> > > __u32 pgid;
> > > __u8 state;
> > > char comm[TASK_DIAG_COMM_LEN];
> > > };
> > >
> > > Another good feature of task_diag is an ability to request information
> > > for a few processes. Currently here are two stratgies
> > > TASK_DIAG_DUMP_ALL - get information for all tasks
> > > TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> > > tasks
> > >
> > > The task diag is much faster than the proc file system. We don't need to
> > > create a new file descriptor for each task. We need to send a request
> > > and get a response. It allows to get information for a few task in one
> > > request-response iteration.
> > >
> > > I have compared performance of procfs and task-diag for the
> > > "ps ax -o pid,ppid" command.
> > >
> > > A test stand contains 10348 processes.
> > > $ ps ax -o pid,ppid | wc -l
> > > 10348
> > >
> > > $ time ps ax -o pid,ppid > /dev/null
> > >
> > > real 0m1.073s
> > > user 0m0.086s
> > > sys 0m0.903s
> > >
> > > $ time ./task_diag_all > /dev/null
> > >
> > > real 0m0.037s
> > > user 0m0.004s
> > > sys 0m0.020s
> > >
> > > And here are statistics about syscalls which were called by each
> > > command.
> > > $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid 2>&1 | grep syscalls | sort -n -r | head -n 5
> > > 20,713 syscalls:sys_exit_open
> > > 20,710 syscalls:sys_exit_close
> > > 20,708 syscalls:sys_exit_read
> > > 10,348 syscalls:sys_exit_newstat
> > > 31 syscalls:sys_exit_write
> > >
> > > $ perf stat -e syscalls:sys_exit* -- ./task_diag_all 2>&1 | grep syscalls | sort -n -r | head -n 5
> > > 114 syscalls:sys_exit_recvfrom
> > > 49 syscalls:sys_exit_write
> > > 8 syscalls:sys_exit_mmap
> > > 4 syscalls:sys_exit_mprotect
> > > 3 syscalls:sys_exit_newfstat
> > >
> > > You can find the test program from this experiment in the last patch.
> > >
> > > The idea of this functionality was suggested by Pavel Emelyanov
> > > (xemul@), when he found that operations with /proc forms a significant
> > > part of a checkpointing time.
> > >
> > > Ten years ago here was attempt to add a netlink interface to access to /proc
> > > information:
> > > http://lwn.net/Articles/99600/
> >
> > I don't suppose this could use real syscalls instead of netlink. If
> > nothing else, netlink seems to conflate pid and net namespaces.
>
> What do you mean by "conflate pid and net namespaces"?

A netlink socket is bound to a network namespace, but you should be
returning data specific to a pid namespace.

On a related note, how does this interact with hidepid? More
generally, what privileges are you requiring to obtain what data?

>
> >
> > Also, using an asynchronous interface (send, poll?, recv) for
> > something that's inherently synchronous (as the kernel a local
> > question) seems awkward to me.
>
> Actually all requests are handled synchronously. We call sendmsg to send
> a request and it is handled in this syscall.
> 2) | netlink_sendmsg() {
> 2) | netlink_unicast() {
> 2) | taskdiag_doit() {
> 2) 2.153 us | task_diag_fill();
> 2) | netlink_unicast() {
> 2) 0.185 us | netlink_attachskb();
> 2) 0.291 us | __netlink_sendskb();
> 2) 2.452 us | }
> 2) + 33.625 us | }
> 2) + 54.611 us | }
> 2) + 76.370 us | }
> 2) | netlink_recvmsg() {
> 2) 1.178 us | skb_recv_datagram();
> 2) + 46.953 us | }
>
> If we request information for a group of tasks (NLM_F_DUMP), a first
> portion of data is filled from the sendmsg syscall. And then when we read
> it, the kernel fills the next portion.
>
> 3) | netlink_sendmsg() {
> 3) | __netlink_dump_start() {
> 3) | netlink_dump() {
> 3) | taskdiag_dumpid() {
> 3) 0.685 us | task_diag_fill();
> ...
> 3) 0.224 us | task_diag_fill();
> 3) + 74.028 us | }
> 3) + 88.757 us | }
> 3) + 89.296 us | }
> 3) + 98.705 us | }
> 3) | netlink_recvmsg() {
> 3) | netlink_dump() {
> 3) | taskdiag_dumpid() {
> 3) 0.594 us | task_diag_fill();
> ...
> 3) 0.242 us | task_diag_fill();
> 3) + 60.634 us | }
> 3) + 72.803 us | }
> 3) + 88.005 us | }
> 3) | netlink_recvmsg() {
> 3) | netlink_dump() {
> 3) 2.403 us | taskdiag_dumpid();
> 3) + 26.236 us | }
> 3) + 40.522 us | }
> 0) + 20.407 us | netlink_recvmsg();
>
>
> netlink is really good for this type of tasks. It allows to create an
> extendable interface which can be easy customized for different needs.
>
> I don't think that we would want to create another similar interface
> just to be independent from network subsystem.

I guess this is a bit streamy in that you ask one question and get
multiple answers.

>
> Thanks,
> Andrew
>
> >
> > --Andy

2015-02-19 14:05:03

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes

On Wed, Feb 18, 2015 at 03:46:31PM +0100, Arnd Bergmann wrote:
> On Wednesday 18 February 2015 15:42:11 Andrew Vagin wrote:
> > On Wed, Feb 18, 2015 at 12:06:40PM +0100, Arnd Bergmann wrote:
> > > On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> > > > On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > > > > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > > > is used to get information about sockets.
> > > > > >
> > > > > > A request is described by the task_diag_pid structure:
> > > > > >
> > > > > > struct task_diag_pid {
> > > > > > __u64 show_flags; /* specify which information are required */
> > > > > > __u64 dump_stratagy; /* specify a group of processes */
> > > > > >
> > > > > > __u32 pid;
> > > > > > };
> > > > >
> > > > > Can you explain how the interface relates to the 'taskstats' genetlink
> > > > > API? Did you consider extending that interface to provide the
> > > > > information you need instead of basing on the socket-diag?
> > > >
> > > > It isn't based on the socket-diag, it looks like socket-diag.
> > > >
> > > > Current task_diag registers a new genl family, but we can use the taskstats
> > > > family and add task_diag commands to it.
> > >
> > > What I meant was more along the lines of making it look like taskstats
> > > by adding new fields to 'struct taskstat' for what you want return.
> > > I don't know if that is possible or a good idea for the information
> > > you want to get out of the kernel, but it seems like a more natural
> > > interface, as it already has some of the same data (comm, gid, pid,
> > > ppid, ...).
> >
> > Now I see what you mean. task_diag has more flexible and universal
> > interface than taskstat. A response of taskstat only contains a
> > taskstats structure. A response of taskdiag can contains a few types of
> > properties. Each type is described by its own structure.
>
> Right, so the question is whether that flexibility is actually required
> here. Independent of which design you personally prefer, what are the
> downsides of extending the existing but less flexible interface?

I have looked at taskstat once again.

The format of response messages for taskstat and taskdiag are the same.
It's a netlink message with a set of nested attributes. New attributes
can be added without breaking backward compatibility.

The request can be expanded to be able to specified which information is
required and for which tasks.

These two features allow to significantly improve performance, because
in this case we don't need to do a system call for each task.

I have done a few experiments to prove these words.

task_proc_all reads /proc/pid/stat for each tast
$ time ./task_proc_all > /dev/null

real 0m1.528s
user 0m0.016s
sys 0m1.341s

task_diag uses task_diag and requests information for each task
separately.
$ time ./task_diag > /dev/null

real 0m1.166s
user 0m0.024s
sys 0m1.127s

task_diag_all uses task_diag and requests information for all tasks in
one request.
$ time ./task_diag_all > /dev/null

real 0m0.077s
user 0m0.018s
sys 0m0.053s

So you can see that the ability to request information for a group of
tasks allows to be more effective.

The summary of this message is that we can use the interface of
taskstats with some extensions.

Arnd, thank you for your opinion and suggestions.

>
> If it's good enough, that would seem to provide a more consistent
> API, which in turn helps users understand the interface and use it
> correctly.
>
> > Curently here are only two groups of parameters: task_diag_msg and
> > task_diag_creds.
> >
> > task_diag_msg contains a few basic parameters.
> > task_diag_creds contains credentials.
> >
> > I'm going to add other groups to describe all kind of task properties
> > which currently are presented in procfs (e.g. /proc/pid/maps,
> > /proc/pid/fding/*, /proc/pid/status, etc).
> >
> > One of features of task_diag is an ability to choose which information
> > are required. This allows to minimize a response size and a time, which
> > is requred to fill this response.
>
> I realize that you are trying to optimize for performance, but it
> would be nice to quantify this if you want to argue for requiring
> a split interface.
>
> > struct task_diag_msg {
> > __u32 tgid;
> > __u32 pid;
> > __u32 ppid;
> > __u32 tpid;
> > __u32 sid;
> > __u32 pgid;
> > __u8 state;
> > char comm[TASK_DIAG_COMM_LEN];
> > };
>
> I guess this part would be a very natural extension to the
> existing taskstats structure, and we should only add a new
> one here if there are extremely good reasons for it.

The task_diag_msg structure contains properties which are used more
frequently than statistics from the taststats structure.

The size of the task_diag_msg structure is 44 bytes, the size of the
taststats structure 328. If we have more data, we need to do more
system calls. So I have done one more experiment to look how it affects
perfomance:

If we use the task_diag_msg structure:
$ time ./task_diag_all > /dev/null

real 0m0.077s
user 0m0.018s
sys 0m0.053s

If we use the taststats structure:
$ time ./task_diag_all > /dev/null

real 0m0.117s
user 0m0.029s
sys 0m0.085s

Thanks,
Andrew

2015-02-19 21:39:47

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes

On Wed, Feb 18, 2015 at 05:18:38PM -0800, Andy Lutomirski wrote:
> On Feb 18, 2015 6:27 AM, "Andrew Vagin" <[email protected]> wrote:
> >
> > On Tue, Feb 17, 2015 at 11:05:31AM -0800, Andy Lutomirski wrote:
> > > On Feb 17, 2015 12:40 AM, "Andrey Vagin" <[email protected]> wrote:
> > > >
> > > > Here is a preview version. It provides restricted set of functionality.
> > > > I would like to collect feedback about this idea.
> > > >
> > > > Currently we use the proc file system, where all information are
> > > > presented in text files, what is convenient for humans. But if we need
> > > > to get information about processes from code (e.g. in C), the procfs
> > > > doesn't look so cool.
> > > >
> > > > From code we would prefer to get information in binary format and to be
> > > > able to specify which information and for which tasks are required. Here
> > > > is a new interface with all these features, which is called task_diag.
> > > > In addition it's much faster than procfs.
> > > >
> > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > is used to get information about sockets.
> > > >
> > > > A request is described by the task_diag_pid structure:
> > > >
> > > > struct task_diag_pid {
> > > > __u64 show_flags; /* specify which information are required */
> > > > __u64 dump_stratagy; /* specify a group of processes */
> > > >
> > > > __u32 pid;
> > > > };
> > > >
> > > > A respone is a set of netlink messages. Each message describes one task.
> > > > All task properties are divided on groups. A message contains the
> > > > TASK_DIAG_MSG group and other groups if they have been requested in
> > > > show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> > > > response will contain the TASK_DIAG_CRED group which is described by the
> > > > task_diag_creds structure.
> > > >
> > > > struct task_diag_msg {
> > > > __u32 tgid;
> > > > __u32 pid;
> > > > __u32 ppid;
> > > > __u32 tpid;
> > > > __u32 sid;
> > > > __u32 pgid;
> > > > __u8 state;
> > > > char comm[TASK_DIAG_COMM_LEN];
> > > > };
> > > >
> > > > Another good feature of task_diag is an ability to request information
> > > > for a few processes. Currently here are two stratgies
> > > > TASK_DIAG_DUMP_ALL - get information for all tasks
> > > > TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> > > > tasks
> > > >
> > > > The task diag is much faster than the proc file system. We don't need to
> > > > create a new file descriptor for each task. We need to send a request
> > > > and get a response. It allows to get information for a few task in one
> > > > request-response iteration.
> > > >
> > > > I have compared performance of procfs and task-diag for the
> > > > "ps ax -o pid,ppid" command.
> > > >
> > > > A test stand contains 10348 processes.
> > > > $ ps ax -o pid,ppid | wc -l
> > > > 10348
> > > >
> > > > $ time ps ax -o pid,ppid > /dev/null
> > > >
> > > > real 0m1.073s
> > > > user 0m0.086s
> > > > sys 0m0.903s
> > > >
> > > > $ time ./task_diag_all > /dev/null
> > > >
> > > > real 0m0.037s
> > > > user 0m0.004s
> > > > sys 0m0.020s
> > > >
> > > > And here are statistics about syscalls which were called by each
> > > > command.
> > > > $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid 2>&1 | grep syscalls | sort -n -r | head -n 5
> > > > 20,713 syscalls:sys_exit_open
> > > > 20,710 syscalls:sys_exit_close
> > > > 20,708 syscalls:sys_exit_read
> > > > 10,348 syscalls:sys_exit_newstat
> > > > 31 syscalls:sys_exit_write
> > > >
> > > > $ perf stat -e syscalls:sys_exit* -- ./task_diag_all 2>&1 | grep syscalls | sort -n -r | head -n 5
> > > > 114 syscalls:sys_exit_recvfrom
> > > > 49 syscalls:sys_exit_write
> > > > 8 syscalls:sys_exit_mmap
> > > > 4 syscalls:sys_exit_mprotect
> > > > 3 syscalls:sys_exit_newfstat
> > > >
> > > > You can find the test program from this experiment in the last patch.
> > > >
> > > > The idea of this functionality was suggested by Pavel Emelyanov
> > > > (xemul@), when he found that operations with /proc forms a significant
> > > > part of a checkpointing time.
> > > >
> > > > Ten years ago here was attempt to add a netlink interface to access to /proc
> > > > information:
> > > > http://lwn.net/Articles/99600/
> > >
> > > I don't suppose this could use real syscalls instead of netlink. If
> > > nothing else, netlink seems to conflate pid and net namespaces.
> >
> > What do you mean by "conflate pid and net namespaces"?
>
> A netlink socket is bound to a network namespace, but you should be
> returning data specific to a pid namespace.

Here is a good question. When we mount a procfs instance, the current
pidns is saved on a superblock. Then if we read data from
this procfs from another pidns, we will see pid-s from the pidns where
this procfs has been mounted.

$ unshare -p -- bash -c '(bash)'
$ cat /proc/self/status | grep ^Pid:
Pid: 15770
$ echo $$
1

A similar situation with socket_diag. A socket_diag socket is bound to a
network namespace. If we open a socket_diag socket and change a network
namespace, it will return infromation about the initial netns.

In this version I always use a current pid namespace.
But to be consistant with other kernel logic, a socket diag has to be
linked with a pidns where it has been created.

>
> On a related note, how does this interact with hidepid? More

Currently it always work as procfs with hidepid = 2 (highest level of
security).

> generally, what privileges are you requiring to obtain what data?

It dumps information only if ptrace_may_access(tsk, PTRACE_MODE_READ) returns true

>
> >
> > >
> > > Also, using an asynchronous interface (send, poll?, recv) for
> > > something that's inherently synchronous (as the kernel a local
> > > question) seems awkward to me.
> >
> > Actually all requests are handled synchronously. We call sendmsg to send
> > a request and it is handled in this syscall.
> > 2) | netlink_sendmsg() {
> > 2) | netlink_unicast() {
> > 2) | taskdiag_doit() {
> > 2) 2.153 us | task_diag_fill();
> > 2) | netlink_unicast() {
> > 2) 0.185 us | netlink_attachskb();
> > 2) 0.291 us | __netlink_sendskb();
> > 2) 2.452 us | }
> > 2) + 33.625 us | }
> > 2) + 54.611 us | }
> > 2) + 76.370 us | }
> > 2) | netlink_recvmsg() {
> > 2) 1.178 us | skb_recv_datagram();
> > 2) + 46.953 us | }
> >
> > If we request information for a group of tasks (NLM_F_DUMP), a first
> > portion of data is filled from the sendmsg syscall. And then when we read
> > it, the kernel fills the next portion.
> >
> > 3) | netlink_sendmsg() {
> > 3) | __netlink_dump_start() {
> > 3) | netlink_dump() {
> > 3) | taskdiag_dumpid() {
> > 3) 0.685 us | task_diag_fill();
> > ...
> > 3) 0.224 us | task_diag_fill();
> > 3) + 74.028 us | }
> > 3) + 88.757 us | }
> > 3) + 89.296 us | }
> > 3) + 98.705 us | }
> > 3) | netlink_recvmsg() {
> > 3) | netlink_dump() {
> > 3) | taskdiag_dumpid() {
> > 3) 0.594 us | task_diag_fill();
> > ...
> > 3) 0.242 us | task_diag_fill();
> > 3) + 60.634 us | }
> > 3) + 72.803 us | }
> > 3) + 88.005 us | }
> > 3) | netlink_recvmsg() {
> > 3) | netlink_dump() {
> > 3) 2.403 us | taskdiag_dumpid();
> > 3) + 26.236 us | }
> > 3) + 40.522 us | }
> > 0) + 20.407 us | netlink_recvmsg();
> >
> >
> > netlink is really good for this type of tasks. It allows to create an
> > extendable interface which can be easy customized for different needs.
> >
> > I don't think that we would want to create another similar interface
> > just to be independent from network subsystem.
>
> I guess this is a bit streamy in that you ask one question and get
> multiple answers.

It's like seq_file in procfs. The kernel allocates a buffer then fills
it, copies it into userspace, fills it again, ... repeats these actions.
And we can read data from file by portions.

Actually here is one more analogy. When we open a file in procfs,
we sends a request to the kernel and a file path is a request body in
this case. But in case of procfs, we can't construct requests, we only
have a set of predefined requests.

>
> >
> > Thanks,
> > Andrew
> >
> > >
> > > --Andy

2015-02-20 20:33:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes

On Thu, Feb 19, 2015 at 1:39 PM, Andrew Vagin <[email protected]> wrote:
> On Wed, Feb 18, 2015 at 05:18:38PM -0800, Andy Lutomirski wrote:
>> > > I don't suppose this could use real syscalls instead of netlink. If
>> > > nothing else, netlink seems to conflate pid and net namespaces.
>> >
>> > What do you mean by "conflate pid and net namespaces"?
>>
>> A netlink socket is bound to a network namespace, but you should be
>> returning data specific to a pid namespace.
>
> Here is a good question. When we mount a procfs instance, the current
> pidns is saved on a superblock. Then if we read data from
> this procfs from another pidns, we will see pid-s from the pidns where
> this procfs has been mounted.
>
> $ unshare -p -- bash -c '(bash)'
> $ cat /proc/self/status | grep ^Pid:
> Pid: 15770
> $ echo $$
> 1
>
> A similar situation with socket_diag. A socket_diag socket is bound to a
> network namespace. If we open a socket_diag socket and change a network
> namespace, it will return infromation about the initial netns.
>
> In this version I always use a current pid namespace.
> But to be consistant with other kernel logic, a socket diag has to be
> linked with a pidns where it has been created.
>

Attaching a pidns to every freshly created netlink socket seems odd,
but I don't see a better solution that still uses netlink.

>>
>> On a related note, how does this interact with hidepid? More
>
> Currently it always work as procfs with hidepid = 2 (highest level of
> security).
>
>> generally, what privileges are you requiring to obtain what data?
>
> It dumps information only if ptrace_may_access(tsk, PTRACE_MODE_READ) returns true

Sounds good to me.

>
>>
>> >
>> > >
>> > > Also, using an asynchronous interface (send, poll?, recv) for
>> > > something that's inherently synchronous (as the kernel a local
>> > > question) seems awkward to me.
>> >
>> > Actually all requests are handled synchronously. We call sendmsg to send
>> > a request and it is handled in this syscall.
>> > 2) | netlink_sendmsg() {
>> > 2) | netlink_unicast() {
>> > 2) | taskdiag_doit() {
>> > 2) 2.153 us | task_diag_fill();
>> > 2) | netlink_unicast() {
>> > 2) 0.185 us | netlink_attachskb();
>> > 2) 0.291 us | __netlink_sendskb();
>> > 2) 2.452 us | }
>> > 2) + 33.625 us | }
>> > 2) + 54.611 us | }
>> > 2) + 76.370 us | }
>> > 2) | netlink_recvmsg() {
>> > 2) 1.178 us | skb_recv_datagram();
>> > 2) + 46.953 us | }
>> >
>> > If we request information for a group of tasks (NLM_F_DUMP), a first
>> > portion of data is filled from the sendmsg syscall. And then when we read
>> > it, the kernel fills the next portion.
>> >
>> > 3) | netlink_sendmsg() {
>> > 3) | __netlink_dump_start() {
>> > 3) | netlink_dump() {
>> > 3) | taskdiag_dumpid() {
>> > 3) 0.685 us | task_diag_fill();
>> > ...
>> > 3) 0.224 us | task_diag_fill();
>> > 3) + 74.028 us | }
>> > 3) + 88.757 us | }
>> > 3) + 89.296 us | }
>> > 3) + 98.705 us | }
>> > 3) | netlink_recvmsg() {
>> > 3) | netlink_dump() {
>> > 3) | taskdiag_dumpid() {
>> > 3) 0.594 us | task_diag_fill();
>> > ...
>> > 3) 0.242 us | task_diag_fill();
>> > 3) + 60.634 us | }
>> > 3) + 72.803 us | }
>> > 3) + 88.005 us | }
>> > 3) | netlink_recvmsg() {
>> > 3) | netlink_dump() {
>> > 3) 2.403 us | taskdiag_dumpid();
>> > 3) + 26.236 us | }
>> > 3) + 40.522 us | }
>> > 0) + 20.407 us | netlink_recvmsg();
>> >
>> >
>> > netlink is really good for this type of tasks. It allows to create an
>> > extendable interface which can be easy customized for different needs.
>> >
>> > I don't think that we would want to create another similar interface
>> > just to be independent from network subsystem.
>>
>> I guess this is a bit streamy in that you ask one question and get
>> multiple answers.
>
> It's like seq_file in procfs. The kernel allocates a buffer then fills
> it, copies it into userspace, fills it again, ... repeats these actions.
> And we can read data from file by portions.
>
> Actually here is one more analogy. When we open a file in procfs,
> we sends a request to the kernel and a file path is a request body in
> this case. But in case of procfs, we can't construct requests, we only
> have a set of predefined requests.

Fair enough. Procfs is also a bit absurd and only makes sense because
it's compatible with lots of tools. In a totally sane world, I would
argue that you should issue one syscall asking questions about a bit
and you should get answers immediately.

--Andy