Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933110AbbBQIjm (ORCPT ); Tue, 17 Feb 2015 03:39:42 -0500 Received: from mailhub.sw.ru ([195.214.232.25]:45995 "EHLO relay.sw.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933100AbbBQIjj (ORCPT ); Tue, 17 Feb 2015 03:39:39 -0500 X-Greylist: delayed 1122 seconds by postgrey-1.27 at vger.kernel.org; Tue, 17 Feb 2015 03:39:28 EST From: Andrey Vagin To: linux-kernel@vger.kernel.org Cc: linux-api@vger.kernel.org, Oleg Nesterov , Andrew Morton , Cyrill Gorcunov , Pavel Emelyanov , Roger Luethi , Andrey Vagin Subject: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes Date: Tue, 17 Feb 2015 11:20:19 +0300 Message-Id: <1424161226-15176-1-git-send-email-avagin@openvz.org> X-Mailer: git-send-email 2.1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5842 Lines: 149 Here is a preview version. It provides restricted set of functionality. I would like to collect feedback about this idea. Currently we use the proc file system, where all information are presented in text files, what is convenient for humans. But if we need to get information about processes from code (e.g. in C), the procfs doesn't look so cool. >From code we would prefer to get information in binary format and to be able to specify which information and for which tasks are required. Here is a new interface with all these features, which is called task_diag. In addition it's much faster than procfs. task_diag is based on netlink sockets and looks like socket-diag, which is used to get information about sockets. A request is described by the task_diag_pid structure: struct task_diag_pid { __u64 show_flags; /* specify which information are required */ __u64 dump_stratagy; /* specify a group of processes */ __u32 pid; }; A respone is a set of netlink messages. Each message describes one task. All task properties are divided on groups. A message contains the TASK_DIAG_MSG group and other groups if they have been requested in show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a response will contain the TASK_DIAG_CRED group which is described by the task_diag_creds structure. struct task_diag_msg { __u32 tgid; __u32 pid; __u32 ppid; __u32 tpid; __u32 sid; __u32 pgid; __u8 state; char comm[TASK_DIAG_COMM_LEN]; }; Another good feature of task_diag is an ability to request information for a few processes. Currently here are two stratgies TASK_DIAG_DUMP_ALL - get information for all tasks TASK_DIAG_DUMP_CHILDREN - get information for children of a specified tasks The task diag is much faster than the proc file system. We don't need to create a new file descriptor for each task. We need to send a request and get a response. It allows to get information for a few task in one request-response iteration. I have compared performance of procfs and task-diag for the "ps ax -o pid,ppid" command. A test stand contains 10348 processes. $ ps ax -o pid,ppid | wc -l 10348 $ time ps ax -o pid,ppid > /dev/null real 0m1.073s user 0m0.086s sys 0m0.903s $ time ./task_diag_all > /dev/null real 0m0.037s user 0m0.004s sys 0m0.020s And here are statistics about syscalls which were called by each command. $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid 2>&1 | grep syscalls | sort -n -r | head -n 5 20,713 syscalls:sys_exit_open 20,710 syscalls:sys_exit_close 20,708 syscalls:sys_exit_read 10,348 syscalls:sys_exit_newstat 31 syscalls:sys_exit_write $ perf stat -e syscalls:sys_exit* -- ./task_diag_all 2>&1 | grep syscalls | sort -n -r | head -n 5 114 syscalls:sys_exit_recvfrom 49 syscalls:sys_exit_write 8 syscalls:sys_exit_mmap 4 syscalls:sys_exit_mprotect 3 syscalls:sys_exit_newfstat You can find the test program from this experiment in the last patch. The idea of this functionality was suggested by Pavel Emelyanov (xemul@), when he found that operations with /proc forms a significant part of a checkpointing time. Ten years ago here was attempt to add a netlink interface to access to /proc information: http://lwn.net/Articles/99600/ Signed-off-by: Andrey Vagin git repo: https://github.com/avagin/linux-task-diag Andrey Vagin (7): [RFC] kernel: add a netlink interface to get information about tasks kernel: move next_tgid from fs/proc task-diag: add ability to get information about all tasks task-diag: add a new group to get process credentials kernel: add ability to iterate children of a specified task task_diag: add ability to dump children selftest: check the task_diag functinonality fs/proc/array.c | 58 +--- fs/proc/base.c | 43 --- include/linux/proc_fs.h | 13 + include/uapi/linux/taskdiag.h | 89 ++++++ init/Kconfig | 12 + kernel/Makefile | 1 + kernel/pid.c | 94 ++++++ kernel/taskdiag.c | 343 +++++++++++++++++++++ tools/testing/selftests/task_diag/Makefile | 16 + tools/testing/selftests/task_diag/task_diag.c | 59 ++++ tools/testing/selftests/task_diag/task_diag_all.c | 82 +++++ tools/testing/selftests/task_diag/task_diag_comm.c | 195 ++++++++++++ tools/testing/selftests/task_diag/task_diag_comm.h | 47 +++ tools/testing/selftests/task_diag/taskdiag.h | 1 + 14 files changed, 967 insertions(+), 86 deletions(-) create mode 100644 include/uapi/linux/taskdiag.h create mode 100644 kernel/taskdiag.c create mode 100644 tools/testing/selftests/task_diag/Makefile create mode 100644 tools/testing/selftests/task_diag/task_diag.c create mode 100644 tools/testing/selftests/task_diag/task_diag_all.c create mode 100644 tools/testing/selftests/task_diag/task_diag_comm.c create mode 100644 tools/testing/selftests/task_diag/task_diag_comm.h create mode 120000 tools/testing/selftests/task_diag/taskdiag.h Cc: Oleg Nesterov Cc: Andrew Morton Cc: Cyrill Gorcunov Cc: Pavel Emelyanov Cc: Roger Luethi -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/