2017-12-14 22:00:44

by Krzysztof Opasiak

[permalink] [raw]
Subject: [RFC PATCH v2 0/4] Add rlimit-resources change notification mechanism

This is a Proof of Concept and RFC of rlimit-events - generic,
low-overhead notification mechanism for monitoring process-level
resources. This series is not ready for submission. Its main purpose is
to share the overall idea and collect feedback from the community.
All comments are very welcome.

Problem statement
=================

Linux is running tons of userspace software and a big part of it
is imperfect. It's nothing new. People discovered this long time ago
on their servers and that's why they introduced monitoring tools like
Nagios. Those tools are used not only to monitor system-wide resources
but also process-specific ones for critical services.

Base idea of such a monitoring tool is to run periodic check
(like every 10 min) of given resource usage, collect the results and
most importantly perform some action (like send email to admin) when
resource usage reaches particular level.

This idea has couple of disadvantages. First of all, one have no
information what happens between the checks. The resource may go above
and/or below notification level many times but no action is going to
be performed. Secondly this solution comes from the server world. In
that case power usage is not not a major problem, as the power is not
as scarce a resource as in battery-powered devices.

IoT world and servers seems to be very different but they have at
least one common feature - they need to run imperfect software for
a long period of time with best possible availability. That's why
we would like to monitor resource usage also on IoT devices while the
server world would still be able to benefit from the new solution.

In contrast to server world IoT devices are very concerned about power
usage. Also their usage profile is very different. Servers are very
busy machines (working with high load) while IoT devices usually just
do nothing waiting for user stimulus. That's why if we simply try to
reuse server world solution it turns out that devices often wake
up only to check the resource usage even though it's it has remained
unchanged for a number of hours. Not to mention that when user comes
back home and starts playing with lights, music etc resource usage may
change way more often than the polling period. To solve those issues
we need to replace polling with asynchronous notification about
resource usage change.

Try #1 - use existing tools
===========================

First of all we tried to use existing tools - audit in particular.
We manage to implement some very simple version but then we
discovered couple of problems:

1) It's not possible to monitor resources which are not bound to
syscalls e.g. CPU

2) It's not possible to monitor number of open FDs if they are
allocated and returned from ioctl() or socket

3) Audit has a significant performance overhead. With only a single
audit rule in the system, which is not being triggered, the time
overhead for open() call on Odroid U3+ is 44.6% for a hot file (in
cache or virtual) and 33.34% for a cold file (on eMMC).

3) Audit slows down the entire system, not only the process that's
being monitored

Solution - rlimit-events
========================

To resolve audit-related problems we developed a kernel infrastructure
to notify userspace about reaching a particular level of resource
usage by given process.

The main idea is to provide a userspace process a file descriptor
which can be used to subscribe for a notification when the chosen
process reaches given resource usage level.

To provide a fully-flexible solution we decided that a single process
may monitor multiple other processes and a single process can be
monitored by many other processes. One of important design goals
was to minimize the performance overhead. That's why watchers are
not only installed in per process manner but also every resource has
a separate list of them. This allows to limit overhead not only to
the process that's being monitored but also to syscalls related to the
monitored resource (if you monitor only FD usage there is no performance
impact on memory-related syscalls). Using the same test as for audit
our PoC achieved 1.58% overhead for cold file and 5.63% for hot file
(Plus 4% overhead for each of them for very simple counting of number
of open file descriptors which could be replaced with a counter).

Typical scenario:
1) Obtain a notification FD from the kernel via Netlink
(if someone has a better idea I will be happy to change this)

2) Issue ioctl() to add new watchers. Each ioctl() contains a triplet:
PID, Resource, Level

3) read() or poll() the notification FD. When the monitored resource's
usage of a process specified in 2) crosses the level set there FD
suitable event can be read from this FD

A sample test application can be found on my github:

https://github.com/kopasiak/rlimit-events-tests


Please share your opinion about idea and current design.
--
Best regards
Krzysztof Opasiak

Changes since v1:
- Reuse binder's file structure stored in proc
- Fix license issues
- Fix indentation
- Fix IOCTL definiton
- Add attribute packed to ioctl structures

---
Krzysztof Opasiak (4):
sched: Allow to get() and put() signal struct
Add rlimit-events framework
Connect rlimit-events with process life cycle
Allow to trace fd usage with rlimit-events

Documentation/ioctl/ioctl-number.txt | 2 +
drivers/android/binder.c | 4 +-
fs/exec.c | 2 +-
fs/file.c | 82 +++-
fs/open.c | 2 +-
include/asm-generic/resource.h | 37 +-
include/linux/fdtable.h | 8 +-
include/linux/init_task.h | 1 +
include/linux/rlimit_noti_kern.h | 47 +++
include/linux/sched/signal.h | 19 +
include/uapi/linux/netlink.h | 1 +
include/uapi/linux/rlimit_noti.h | 81 ++++
init/Kconfig | 6 +
kernel/Makefile | 1 +
kernel/exit.c | 3 +
kernel/fork.c | 20 +-
kernel/rlimit_noti.c | 777 +++++++++++++++++++++++++++++++++++
17 files changed, 1062 insertions(+), 31 deletions(-)
create mode 100644 include/linux/rlimit_noti_kern.h
create mode 100644 include/uapi/linux/rlimit_noti.h
create mode 100644 kernel/rlimit_noti.c

--
2.9.3


2017-12-14 22:00:53

by Krzysztof Opasiak

[permalink] [raw]
Subject: [RFC PATCH v2 2/4] Add rlimit-events framework

Add a framework which allows to notify userspace programs
about change of resource (the same as in rlimits) usage.

To monitor some process, monitor FD has to be obtained from
kernel using rlimit-events netlink interface.
Then monitor can issue ioctls() to subscribe for a particular
usage level of given resource.
When monitoring subject crosses given usage level monitoring
fd will be ready to read resource change event from it.

It's possible to monitor multiple processes and single
process can be monitored by multiple other processes.

Signed-off-by: Krzysztof Opasiak <[email protected]>
---
Documentation/ioctl/ioctl-number.txt | 2 +
include/asm-generic/resource.h | 37 +-
include/linux/init_task.h | 1 +
include/linux/rlimit_noti_kern.h | 47 +++
include/linux/sched/signal.h | 6 +
include/uapi/linux/netlink.h | 1 +
include/uapi/linux/rlimit_noti.h | 81 ++++
init/Kconfig | 6 +
kernel/Makefile | 1 +
kernel/rlimit_noti.c | 777 +++++++++++++++++++++++++++++++++++
10 files changed, 958 insertions(+), 1 deletion(-)
create mode 100644 include/linux/rlimit_noti_kern.h
create mode 100644 include/uapi/linux/rlimit_noti.h
create mode 100644 kernel/rlimit_noti.c

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index 3e3fdae5f3ed..f1dbaac0223f 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -315,6 +315,8 @@ Code Seq#(hex) Include File Comments
0xAC 00-1F linux/raw.h
0xAD 00 Netfilter device in development:
<mailto:[email protected]>
+0xAD 01-1F rlimit-events in development:
+ <mailto:[email protected]>
0xAE all linux/kvm.h Kernel-based Virtual Machine
<mailto:[email protected]>
0xAF 00-1F linux/fsl_hypervisor.h Freescale hypervisor
diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
index 8874f681b056..aabbe0347fa2 100644
--- a/include/asm-generic/resource.h
+++ b/include/asm-generic/resource.h
@@ -3,7 +3,7 @@
#define _ASM_GENERIC_RESOURCE_H

#include <uapi/asm-generic/resource.h>
-
+#include <linux/spinlock.h>

/*
* boot-time rlimit defaults for the init task:
@@ -28,4 +28,39 @@
[RLIMIT_RTTIME] = { RLIM_INFINITY, RLIM_INFINITY }, \
}

+#ifdef CONFIG_RLIMIT_NOTIFICATION
+
+#define INIT_RLIMIT_WATCHER(watchers, limit) \
+ [limit] = LIST_HEAD_INIT(watchers[limit])
+
+#define INIT_RLIMIT_WATCHERS(watchers) \
+{ \
+ INIT_RLIMIT_WATCHER(watchers, RLIMIT_CPU), \
+ INIT_RLIMIT_WATCHER(watchers, RLIMIT_FSIZE), \
+ INIT_RLIMIT_WATCHER(watchers, RLIMIT_DATA), \
+ INIT_RLIMIT_WATCHER(watchers, RLIMIT_STACK), \
+ INIT_RLIMIT_WATCHER(watchers, RLIMIT_CORE), \
+ INIT_RLIMIT_WATCHER(watchers, RLIMIT_RSS), \
+ INIT_RLIMIT_WATCHER(watchers, RLIMIT_NPROC), \
+ INIT_RLIMIT_WATCHER(watchers, RLIMIT_NOFILE), \
+ INIT_RLIMIT_WATCHER(watchers, RLIMIT_MEMLOCK), \
+ INIT_RLIMIT_WATCHER(watchers, RLIMIT_AS), \
+ INIT_RLIMIT_WATCHER(watchers, RLIMIT_LOCKS), \
+ INIT_RLIMIT_WATCHER(watchers, RLIMIT_SIGPENDING), \
+ INIT_RLIMIT_WATCHER(watchers, RLIMIT_MSGQUEUE), \
+ INIT_RLIMIT_WATCHER(watchers, RLIMIT_NICE), \
+ INIT_RLIMIT_WATCHER(watchers, RLIMIT_RTPRIO), \
+ INIT_RLIMIT_WATCHER(watchers, RLIMIT_RTTIME), \
+}
+
+#define INIT_RLIMIT_EVENTS_CTX(sig) \
+.rlimit_events_ctx = { \
+ .lock = __SPIN_LOCK_UNLOCKED(sig.rlimit_events_ctx.lock), \
+ .watchers = INIT_RLIMIT_WATCHERS(sig.rlimit_events_ctx.watchers),\
+ .process_dead = 0, \
+},
+#else
+#define INIT_RLIMIT_EVENTS_CTX(sig)
+#endif /* CONFIG_RLIMIT_NOTIFICATION */
+
#endif
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 8062e6cc607c..a569d6a07cfd 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -77,6 +77,7 @@ extern struct fs_struct init_fs;
INIT_POSIX_TIMERS(sig) \
INIT_CPU_TIMERS(sig) \
.rlim = INIT_RLIMITS, \
+ INIT_RLIMIT_EVENTS_CTX(sig) \
INIT_CPUTIMER(sig) \
INIT_PREV_CPUTIME(sig) \
.cred_guard_mutex = \
diff --git a/include/linux/rlimit_noti_kern.h b/include/linux/rlimit_noti_kern.h
new file mode 100644
index 000000000000..c67e3a93ad34
--- /dev/null
+++ b/include/linux/rlimit_noti_kern.h
@@ -0,0 +1,47 @@
+/*
+ * Released under the GPLv2 only.
+ * SPDX-License-Identifier: GPL-2.0
+ */
+
+#ifndef _LINUX_RLIMIT_NOTI_H_
+#define _LINUX_RLIMIT_NOTI_H_
+
+#include <uapi/linux/rlimit_noti.h>
+
+struct rlimit_noti_ctx {
+ /* for mdification protection */
+ spinlock_t lock;
+ /* protected by RCU */
+ struct list_head watchers[RLIM_NLIMITS];
+
+ unsigned process_dead:1;
+};
+
+#ifdef CONFIG_RLIMIT_NOTIFICATION
+
+int rlimit_noti_task_fork(struct task_struct *parent,
+ struct task_struct *child);
+
+void rlimit_noti_task_exit(struct task_struct *tsk);
+
+int rlimit_noti_watch_active(struct task_struct *tsk, unsigned int res);
+
+void rlimit_noti_res_changed(struct task_struct *tsk, unsigned int res,
+ uint64_t old, uint64_t new);
+
+#else
+
+static inline int rlimit_noti_watch_active(struct task_struct *tsk,
+ unsigned int res)
+{
+ return 0;
+}
+
+static inline void rlimit_noti_res_changed(struct task_struct *tsk,
+ unsigned int res,
+ uint64_t old, uint64_t new)
+{
+}
+
+#endif /* CONFIG_RLIMIT_NOTIFICATION */
+#endif /* _LINUX_RLIMIT_NOTI_H_ */
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 78c0c337c130..be4d847ce7b5 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -8,7 +8,9 @@
#include <linux/sched/jobctl.h>
#include <linux/sched/task.h>
#include <linux/cred.h>
+#include <linux/list.h>

+#include <linux/rlimit_noti_kern.h>
/*
* Types defining task->signal and task->sighand and APIs using them:
*/
@@ -198,6 +200,10 @@ struct signal_struct {
*/
struct rlimit rlim[RLIM_NLIMITS];

+#ifdef CONFIG_RLIMIT_NOTIFICATION
+ struct rlimit_noti_ctx rlimit_events_ctx;
+#endif
+
#ifdef CONFIG_BSD_PROCESS_ACCT
struct pacct_struct pacct; /* per-process accounting information */
#endif
diff --git a/include/uapi/linux/netlink.h b/include/uapi/linux/netlink.h
index 776bc92e9118..84773accf1f8 100644
--- a/include/uapi/linux/netlink.h
+++ b/include/uapi/linux/netlink.h
@@ -29,6 +29,7 @@
#define NETLINK_RDMA 20
#define NETLINK_CRYPTO 21 /* Crypto layer */
#define NETLINK_SMC 22 /* SMC monitoring */
+#define NETLINK_RLIMIT_EVENTS 23 /* rlimit notification */

#define NETLINK_INET_DIAG NETLINK_SOCK_DIAG

diff --git a/include/uapi/linux/rlimit_noti.h b/include/uapi/linux/rlimit_noti.h
new file mode 100644
index 000000000000..cc225db9f607
--- /dev/null
+++ b/include/uapi/linux/rlimit_noti.h
@@ -0,0 +1,81 @@
+/*
+ * Released under the GPLv2 only.
+ * SPDX-License-Identifier: GPL-2.0
+ */
+
+#ifndef _UAPI_LINUX_RLIMIT_NOTI_H_
+#define _UAPI_LINUX_RLIMIT_NOTI_H_
+
+#ifdef __KERNEL__
+#include <linux/resource.h>
+#else
+#include <stdint.h>
+#endif
+
+#include <linux/types.h>
+#include <asm/ioctl.h>
+
+#define RLIMIT_GET_NOTI_FD 1000
+
+/* Event types */
+enum {
+ RLIMIT_EVENT_TYPE_RES_CHANGED,
+ RLIMIT_EVENT_TYPE_MAX
+};
+
+struct rlimit_noti_subject {
+ __kernel_pid_t pid;
+ __u32 resource;
+} __attribute__((packed));
+
+struct rlimit_noti_level {
+ struct rlimit_noti_subject subj;
+ __u64 value;
+ __u32 flags;
+} __attribute__((packed));
+
+struct rlimit_event {
+ __u32 ev_type;
+ __u32 size;
+} __attribute__((packed));
+
+struct rlimit_event_res_changed {
+ struct rlimit_noti_subject subj;
+ __u64 new_value;
+} __attribute__((packed));
+
+#define RLIMIT_NOTI_IOCTL_CODE 0xAD
+#define RLIMIT_NOTI_IOCTL_SEQ_BASE 0x01
+
+/* ioctl's */
+#define RLIMIT_ADD_NOTI_LVL _IOW(RLIMIT_NOTI_IOCTL_CODE, \
+ RLIMIT_NOTI_IOCTL_SEQ_BASE + 0, \
+ struct rlimit_noti_level)
+
+#define RLIMIT_RM_NOTI_LVL _IOW(RLIMIT_NOTI_IOCTL_CODE, \
+ RLIMIT_NOTI_IOCTL_SEQ_BASE + 1, \
+ struct rlimit_noti_level)
+
+#define RLIMIT_SET_NOTI_ALL _IOW(RLIMIT_NOTI_IOCTL_CODE, \
+ RLIMIT_NOTI_IOCTL_SEQ_BASE + 2, \
+ struct rlimit_noti_subject)
+
+#define RLIMIT_CLEAR_NOTI_ALL _IOW(RLIMIT_NOTI_IOCTL_CODE, \
+ RLIMIT_NOTI_IOCTL_SEQ_BASE + 3, \
+ struct rlimit_noti_subject)
+
+/*
+ * For future:
+ *
+ * list registered noti levels:
+ * #define RLIMIT_GET_NOTI_LVLS
+ *
+ * notify every 5, 10 units change:
+ * #define RLIMIT_SET_NOTI_STEP
+ */
+
+/* Flags for ioctl's */
+#define RLIMIT_FLAG_NO_INHERIT (1u << 0)
+
+
+#endif /* _UAPI_LINUX_RLIMIT_NOTI_H_ */
diff --git a/init/Kconfig b/init/Kconfig
index 3c1faaa2af4a..95cebc570795 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -332,6 +332,12 @@ config AUDIT_TREE
depends on AUDITSYSCALL
select FSNOTIFY

+config RLIMIT_NOTIFICATION
+ bool "Support fd notifications on given resource usage"
+ depends on NET
+ help
+ Enable this to monitor process resource changes usage via fd.
+
source "kernel/irq/Kconfig"
source "kernel/time/Kconfig"

diff --git a/kernel/Makefile b/kernel/Makefile
index 172d151d429c..8f4be71aaf78 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -78,6 +78,7 @@ obj-$(CONFIG_AUDIT) += audit.o auditfilter.o
obj-$(CONFIG_AUDITSYSCALL) += auditsc.o
obj-$(CONFIG_AUDIT_WATCH) += audit_watch.o audit_fsnotify.o
obj-$(CONFIG_AUDIT_TREE) += audit_tree.o
+obj-$(CONFIG_RLIMIT_NOTIFICATION) += rlimit_noti.o
obj-$(CONFIG_GCOV_KERNEL) += gcov/
obj-$(CONFIG_KCOV) += kcov.o
obj-$(CONFIG_KPROBES) += kprobes.o
diff --git a/kernel/rlimit_noti.c b/kernel/rlimit_noti.c
new file mode 100644
index 000000000000..b57ca9d98d2f
--- /dev/null
+++ b/kernel/rlimit_noti.c
@@ -0,0 +1,777 @@
+/*
+ * Released under the GPLv2 only.
+ * SPDX-License-Identifier: GPL-2.0
+ *
+ * Netlink communication strongly based on audit.c.
+ */
+
+#include <linux/rlimit_noti.h>
+
+#include <net/sock.h>
+#include <net/netlink.h>
+#include <linux/skbuff.h>
+#include <net/netns/generic.h>
+
+#include <linux/kthread.h>
+#include <linux/module.h>
+#include <linux/wait.h>
+#include <linux/list.h>
+#include <linux/sched.h>
+#include <linux/pid.h>
+#include <linux/pid_namespace.h>
+#include <linux/anon_inodes.h>
+#include <linux/sched/signal.h>
+#include <linux/spinlock.h>
+
+#define sig_watchers(sig) sig->rlimit_events_ctx.watchers
+
+#define sig_for_each_res(wlist, sig) \
+ for (wlist = &sig_watchers(sig)[0]; \
+ wlist - &sig_watchers(sig)[0] < ARRAY_SIZE(sig_watchers(sig)); \
+ ++wlist)
+
+struct rlimit_event_list {
+ struct rlimit_event ev;
+ union {
+ struct rlimit_event_res_changed rchanged;
+ } event_data;
+ struct list_head node;
+};
+
+#define MAX_RLIMIT_EVENT_SIZE ({ \
+ struct rlimit_event_list *_rl = NULL; \
+ sizeof(_rl->event_data); \
+})
+
+struct rlimit_watch_fd_ctx {
+ struct kref kref;
+
+ spinlock_t noti_ctx_lock;
+ struct list_head watchers;
+ unsigned fd_invalid:1;
+
+ spinlock_t events_lock;
+ wait_queue_head_t events_queue;
+ struct list_head events;
+};
+
+struct rlimit_watcher {
+ struct rcu_head rcu;
+ struct rlimit_watch_fd_ctx *ctx;
+ struct signal_struct *signal;
+
+ struct list_head tsk_node;
+ struct list_head ctx_node;
+
+ uint64_t value;
+ unsigned noti_all_changes:1;
+};
+
+/******************************************************************************
+ * Public API
+ ******************************************************************************/
+
+static void release_ctx(struct kref *kref)
+{
+ struct rlimit_watch_fd_ctx *ctx =
+ container_of(kref, struct rlimit_watch_fd_ctx, kref);
+
+ kfree(ctx);
+}
+
+static struct rlimit_watcher *alloc_rlimit_watcher(
+ struct rlimit_watch_fd_ctx *ctx, struct signal_struct *signal,
+ uint64_t value, bool noti_all)
+{
+ struct rlimit_watcher *w;
+
+ w = kzalloc(sizeof(*w), GFP_ATOMIC);
+ if (!w)
+ return ERR_PTR(ENOMEM);
+
+ INIT_LIST_HEAD(&w->tsk_node);
+ INIT_LIST_HEAD(&w->ctx_node);
+
+ w->ctx = ctx;
+ kref_get(&ctx->kref);
+ w->signal = signal;
+ get_signal_struct(signal);
+ w->value = value;
+ w->noti_all_changes = noti_all;
+
+ return w;
+}
+
+static void free_rlimit_watcher(struct rlimit_watcher *w)
+{
+ if (!w)
+ return;
+
+ kref_put(&w->ctx->kref, release_ctx);
+ put_signal_struct(w->signal);
+ kfree(w);
+}
+
+static void free_rlimit_watcher_rcu(struct rcu_head *head)
+{
+ free_rlimit_watcher(container_of(head, struct rlimit_watcher, rcu));
+}
+
+static inline struct rlimit_watcher *rlimit_watcher_dup(
+ struct rlimit_watcher *org, struct task_struct *new_owner)
+{
+ return alloc_rlimit_watcher(org->ctx, new_owner->signal, org->value,
+ org->noti_all_changes);
+}
+
+/* This is not called for threads */
+int rlimit_noti_task_fork(struct task_struct *parent, struct task_struct *child)
+{
+ struct rlimit_watcher *w, *nw;
+ struct signal_struct *sig = child->signal;
+ unsigned long flags;
+ struct list_head *iter;
+ int ret;
+
+ /*
+ * init all list to avoid leaving uninitialized lists
+ * in case of error
+ */
+ sig_for_each_res(iter, sig)
+ INIT_LIST_HEAD(iter);
+
+ spin_lock_init(&sig->rlimit_events_ctx.lock);
+ sig->rlimit_events_ctx.process_dead = 0;
+
+ /* Lock the list to be safe against modification */
+ spin_lock_irqsave(&parent->signal->rlimit_events_ctx.lock, flags);
+
+ sig_for_each_res(iter, sig)
+ list_for_each_entry(w, iter, tsk_node) {
+ nw = rlimit_watcher_dup(w, child);
+ if (!nw) {
+ spin_unlock_irqrestore(
+ &parent->signal->rlimit_events_ctx.lock,
+ flags);
+ ret = -ENOMEM;
+ goto cleanup;
+ }
+
+ /*
+ * For now we put this only on task side list
+ * to avoid deadlock (ABBA)
+ *
+ * We assume that no one can access this new task
+ * for now so we don't use any locking here
+ */
+ list_add_tail_rcu(&nw->tsk_node, iter);
+ }
+
+ /*
+ * now we got all watchers on our brand new list so we can release
+ * parent lock and allow modification of its list
+ */
+ spin_unlock_irqrestore(&parent->signal->rlimit_events_ctx.lock, flags);
+
+ sig_for_each_res(iter, sig) {
+start_again:
+ rcu_read_lock();
+ list_for_each_entry_rcu(w, iter, tsk_node) {
+ spin_lock_irqsave(&w->ctx->noti_ctx_lock, flags);
+ if (list_empty(&w->ctx_node)) {
+ if (!w->ctx->fd_invalid) {
+ list_add_tail(&w->ctx_node,
+ &w->ctx->watchers);
+ } else {
+ spin_lock(&sig->rlimit_events_ctx.lock);
+ list_del_rcu(&w->tsk_node);
+ call_rcu(&w->rcu,
+ free_rlimit_watcher_rcu);
+ spin_unlock(
+ &sig->rlimit_events_ctx.lock);
+ rcu_read_unlock();
+ goto start_again;
+ }
+ }
+ spin_unlock_irqrestore(&w->ctx->noti_ctx_lock, flags);
+ }
+ rcu_read_unlock();
+ }
+
+ return 0;
+cleanup:
+ sig_for_each_res(iter, sig) {
+ while (!list_empty(iter)) {
+ w = list_first_entry(iter,
+ struct rlimit_watcher, ctx_node);
+ list_del_init(&w->tsk_node);
+ call_rcu(&w->rcu, free_rlimit_watcher_rcu);
+ }
+ }
+ return ret;
+}
+
+void rlimit_noti_task_exit(struct task_struct *tsk)
+{
+ struct rlimit_watcher *w;
+ struct rlimit_noti_ctx *n_ctx = &tsk->signal->rlimit_events_ctx;
+ unsigned long flags;
+ struct list_head *head;
+
+ if (tsk != tsk->group_leader)
+ return;
+
+ /*
+ * Let's mark that we are in the middle of cleaning up
+ * to prevent new watchers from being added to the list
+ */
+ spin_lock_irqsave(&n_ctx->lock, flags);
+ WARN_ON(n_ctx->process_dead);
+ n_ctx->process_dead = true;
+ spin_unlock_irqrestore(&n_ctx->lock, flags);
+
+ sig_for_each_res(head, tsk->signal) {
+ /*
+ * Let's go through the list and remove watchers form respective
+ * fd contextes.
+ */
+ rcu_read_lock();
+ list_for_each_entry_rcu(w, head, tsk_node) {
+ spin_lock_irqsave(&w->ctx->noti_ctx_lock, flags);
+ /*
+ * List empty means that between iteration and acquiring
+ * lock this watcher has been already removed and
+ * it's just hanging due to grace period
+ */
+ if (!list_empty(&w->ctx_node)
+ && !list_empty(&w->tsk_node))
+ list_del_init(&w->ctx_node);
+
+ spin_unlock_irqrestore(&w->ctx->noti_ctx_lock, flags);
+ }
+ rcu_read_unlock();
+
+ /* Now let's cleanup our list */
+ spin_lock_irqsave(&n_ctx->lock, flags);
+ while (!list_empty(head)) {
+ w = list_first_entry(head,
+ struct rlimit_watcher, tsk_node);
+ list_del_rcu(&w->tsk_node);
+ call_rcu(&w->rcu, free_rlimit_watcher_rcu);
+ }
+ spin_unlock_irqrestore(&n_ctx->lock, flags);
+ }
+}
+
+static int rlimit_generate_res_changed_event(struct rlimit_watch_fd_ctx *ctx,
+ struct task_struct *tsk,
+ unsigned int resource,
+ uint64_t new, int mflags)
+{
+ struct rlimit_event_list *ev_list;
+ unsigned long flags;
+
+ ev_list = kzalloc(sizeof(*ev_list), mflags);
+ if (!ev_list)
+ return -ENOMEM;
+
+ ev_list->ev.ev_type = RLIMIT_EVENT_TYPE_RES_CHANGED;
+ ev_list->ev.size = sizeof(struct rlimit_event)
+ + sizeof(struct rlimit_event_res_changed);
+
+ /* TODO add here support for PID namespace */
+ ev_list->event_data.rchanged.subj.pid = tsk->pid;
+ ev_list->event_data.rchanged.subj.resource = resource;
+
+ ev_list->event_data.rchanged.new_value = new;
+
+ INIT_LIST_HEAD(&ev_list->node);
+
+ spin_lock_irqsave(&ctx->events_lock, flags);
+ list_add_tail(&ev_list->node, &ctx->events);
+ wake_up_interruptible(&ctx->events_queue);
+ spin_unlock_irqrestore(&ctx->events_lock, flags);
+
+ return 0;
+}
+
+int rlimit_noti_watch_active(struct task_struct *tsk, unsigned int res)
+{
+ return !list_empty(&tsk->signal->rlimit_events_ctx.watchers[res]);
+}
+
+void rlimit_noti_res_changed(struct task_struct *tsk, unsigned int res,
+ uint64_t old, uint64_t new)
+{
+ struct rlimit_watcher *w;
+ struct signal_struct *signal = tsk->signal;
+
+ rcu_read_lock();
+ /* TODO this should be replaced with sth faster */
+ list_for_each_entry_rcu(w, &signal->rlimit_events_ctx.watchers[res],
+ tsk_node)
+ if (w->noti_all_changes ||
+ (w->value > old && w->value <= new) ||
+ (w->value > new && w->value <= old)) {
+ /* ignore error as there is nothing we can do */
+ rlimit_generate_res_changed_event(w->ctx, tsk,
+ res, new, GFP_ATOMIC);
+ }
+ rcu_read_unlock();
+}
+
+/******************************************************************************
+ * FD part
+ ******************************************************************************/
+
+static int add_new_watcher(struct rlimit_watch_fd_ctx *ctx,
+ struct task_struct *tsk,
+ int resource, uint64_t value, bool noti_all)
+{
+ struct rlimit_watcher *w;
+ struct signal_struct *signal;
+ unsigned long flags;
+ int ret = 0;
+
+ if (resource >= RLIM_NLIMITS)
+ return -EINVAL;
+
+ read_lock(&tasklist_lock);
+ if (!tsk->sighand) {
+ ret = -ESRCH;
+ goto unlock_read;
+ }
+
+ task_lock(tsk->group_leader);
+ signal = tsk->signal;
+
+ w = alloc_rlimit_watcher(ctx, signal, value, noti_all);
+ if (IS_ERR(w)) {
+ ret = PTR_ERR(w);
+ goto unlock_group_leader;
+ }
+
+ spin_lock_irqsave(&ctx->noti_ctx_lock, flags);
+ /*
+ * First add it to ctx list as we are holding it's lock
+ * and no one is going to modify or iterate it
+ */
+ list_add_tail(&w->ctx_node, &ctx->watchers);
+ /* Now let's lock process side lock and add this torcu protected list */
+ spin_lock(&signal->rlimit_events_ctx.lock);
+
+ /* If process is in the middle of cleanup let's rollback everything */
+ if (!signal->rlimit_events_ctx.process_dead) {
+ list_add_tail_rcu(&signal->rlimit_events_ctx.watchers[resource],
+ &w->tsk_node);
+ ret = 0;
+ } else {
+ list_del(&w->ctx_node);
+ free_rlimit_watcher(w);
+ ret = -ENOENT;
+ }
+
+ spin_unlock(&signal->rlimit_events_ctx.lock);
+ spin_unlock_irqrestore(&ctx->noti_ctx_lock, flags);
+unlock_group_leader:
+ task_unlock(tsk->group_leader);
+unlock_read:
+ read_unlock(&tasklist_lock);
+
+ return ret;
+}
+
+ssize_t rlimit_noti_read_event(struct file *file, char __user *buf,
+ size_t size, loff_t *ptr)
+{
+ struct rlimit_watch_fd_ctx *ctx = file->private_data;
+ struct rlimit_event_list *ev_list;
+ unsigned long flags;
+ size_t ret;
+
+ /* TODO allow to read only part of event */
+ if (size < MAX_RLIMIT_EVENT_SIZE)
+ return -EINVAL;
+
+ spin_lock_irqsave(&ctx->events_lock, flags);
+#define READ_COND (!list_empty(&ctx->events))
+ while (!READ_COND) {
+ spin_unlock_irqrestore(&ctx->events_lock, flags);
+
+ if (wait_event_interruptible(ctx->events_queue, READ_COND))
+ return -ERESTARTSYS;
+ spin_lock_irqsave(&ctx->events_lock, flags);
+ }
+#undef READ_COND
+
+ ev_list = list_first_entry(&ctx->events,
+ struct rlimit_event_list, node);
+ list_del(&ev_list->node);
+ spin_unlock_irqrestore(&ctx->events_lock, flags);
+
+ /* TODO handle fault */
+ ret = copy_to_user(buf, &ev_list->ev, ev_list->ev.size);
+ if (ret == 0)
+ ret = ev_list->ev.size;
+
+ kfree(ev_list);
+
+ return ret;
+}
+
+
+unsigned int rlimit_noti_poll(struct file *file, struct poll_table_struct *wait)
+{
+ struct rlimit_watch_fd_ctx *ctx = file->private_data;
+ unsigned int mask = POLLWRNORM;
+ unsigned long flags;
+
+ poll_wait(file, &ctx->events_queue, wait);
+
+ spin_lock_irqsave(&ctx->events_lock, flags);
+ if (!list_empty(&ctx->events))
+ mask |= POLLIN;
+
+ /* TODO add notification when last process exited */
+ spin_unlock_irqrestore(&ctx->events_lock, flags);
+
+ return mask;
+}
+
+
+static long rlimit_noti_ioctl(struct file *file,
+ unsigned int cmd, unsigned long arg)
+{
+ struct rlimit_watch_fd_ctx *ctx = file->private_data;
+ struct task_struct *tsk;
+ struct rlimit_noti_level nlvl;
+ bool noti_all = false;
+ int ret;
+
+ switch (cmd) {
+ case RLIMIT_SET_NOTI_ALL:
+ if (copy_from_user(&nlvl.subj,
+ (void __user *)arg, sizeof(nlvl.subj)))
+ return -EFAULT;
+
+ nlvl.value = 0;
+ noti_all = true;
+ goto set_watch;
+
+ case RLIMIT_ADD_NOTI_LVL:
+ if (copy_from_user(&nlvl, (void __user *)arg, sizeof(nlvl)))
+ return -EFAULT;
+set_watch:
+ rcu_read_lock();
+ tsk = find_task_by_vpid(nlvl.subj.pid);
+ if (!tsk) {
+ rcu_read_unlock();
+ printk(KERN_DEBUG "No PID in current NS\n");
+ return -EINVAL;
+ }
+
+ get_task_struct(tsk);
+ rcu_read_unlock();
+
+ /* TODO check for duplicates before adding */
+ ret = add_new_watcher(ctx, tsk, nlvl.subj.resource,
+ nlvl.value, false);
+ put_task_struct(tsk);
+ break;
+
+ case RLIMIT_CLEAR_NOTI_ALL:
+ case RLIMIT_RM_NOTI_LVL:
+
+ /* TODO: Implement me */
+ ret = -ENOTSUPP;
+ break;
+ default:
+ ret = -EINVAL;
+ }
+
+ return ret;
+}
+
+static int rlimit_noti_release(struct inode *inode, struct file *file)
+{
+ struct rlimit_watch_fd_ctx *ctx = file->private_data;
+ struct rlimit_watcher *w;
+ struct rlimit_event_list *ev_list;
+ unsigned long flags;
+
+ /* Clean up watchers */
+ spin_lock_irqsave(&ctx->noti_ctx_lock, flags);
+ ctx->fd_invalid = 1;
+ list_for_each_entry(w, &ctx->watchers, ctx_node) {
+ spin_lock(&w->signal->rlimit_events_ctx.lock);
+ list_del_rcu(&w->tsk_node);
+ spin_unlock(&w->signal->rlimit_events_ctx.lock);
+ }
+
+ while (!list_empty(&ctx->watchers)) {
+ w = list_first_entry(&ctx->watchers,
+ struct rlimit_watcher, ctx_node);
+ list_del_init(&w->ctx_node);
+ call_rcu(&w->rcu, free_rlimit_watcher_rcu);
+ }
+
+ spin_unlock_irqrestore(&ctx->noti_ctx_lock, flags);
+
+ /* to ensure that no more events will be generated */
+ synchronize_rcu();
+
+ spin_lock_irqsave(&ctx->events_lock, flags);
+ while (!list_empty(&ctx->events)) {
+ ev_list = list_first_entry(&ctx->events,
+ struct rlimit_event_list, node);
+ list_del(&ev_list->node);
+ kfree(ev_list);
+ }
+ spin_unlock_irqrestore(&ctx->events_lock, flags);
+
+ kref_put(&ctx->kref, release_ctx);
+
+ return 0;
+}
+
+static const struct file_operations rlimit_noti_fops = {
+ .read = rlimit_noti_read_event,
+ .release = rlimit_noti_release,
+ .poll = rlimit_noti_poll,
+ .unlocked_ioctl = rlimit_noti_ioctl,
+};
+
+static int rlimit_noti_create_fd(void)
+{
+ struct rlimit_watch_fd_ctx *ctx;
+ int ret;
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ kref_init(&ctx->kref);
+ spin_lock_init(&ctx->noti_ctx_lock);
+ INIT_LIST_HEAD(&ctx->watchers);
+ spin_lock_init(&ctx->events_lock);
+ INIT_LIST_HEAD(&ctx->events);
+ init_waitqueue_head(&ctx->events_queue);
+
+ ret = anon_inode_getfd("rlimit_noti", &rlimit_noti_fops, ctx, 0);
+ if (ret < 0)
+ goto put_ctx;
+
+ return ret;
+put_ctx:
+ kref_put(&ctx->kref, release_ctx);
+ return ret;
+}
+
+
+
+/******************************************************************************
+ * netlink part
+ ******************************************************************************/
+
+
+/* private rlimit_noti network namespace index */
+static unsigned int rlimit_noti_net_id;
+
+struct rlimit_noti_net {
+ struct sock *sk;
+};
+
+struct rlimit_noti_reply {
+ __u32 portid;
+ struct net *net;
+ struct sk_buff *skb;
+};
+
+static struct sock *rlimit_noti_get_socket(const struct net *net)
+{
+ struct rlimit_noti_net *rn_net;
+
+ if (!net)
+ return NULL;
+
+ rn_net = net_generic(net, rlimit_noti_net_id);
+ return rn_net->sk;
+}
+
+static struct sk_buff *rlimit_noti_make_reply(int seq, int type,
+ void *payload, int size)
+{
+ struct sk_buff *skb;
+ struct nlmsghdr *nl_header;
+ int flags = 0;
+
+ skb = nlmsg_new(size, GFP_KERNEL);
+ if (!skb)
+ return NULL;
+
+ nl_header = nlmsg_put(skb, 0, seq, type, size, flags);
+ if (!nl_header)
+ goto free_skb;
+
+ memcpy(nlmsg_data(nl_header), payload, size);
+
+ return skb;
+
+free_skb:
+ kfree_skb(skb);
+ return NULL;
+}
+
+static int rlimit_noti_send_reply_thread(void *arg)
+{
+ struct rlimit_noti_reply *reply = arg;
+ struct sock *sk = rlimit_noti_get_socket(reply->net);
+
+ /*
+ * Ignore failure. It'll only happen if the sender goes away,
+ * because our timeout is set to infinite.
+ */
+ netlink_unicast(sk, reply->skb, reply->portid, 0);
+ put_net(reply->net);
+ kfree(reply);
+ return 0;
+}
+
+static void rlimit_noti_send_reply(struct sk_buff *request_skb, int seq,
+ int type, void *payload, int size)
+{
+ struct net *net = sock_net(NETLINK_CB(request_skb).sk);
+ struct sk_buff *skb;
+ struct task_struct *tsk;
+ struct rlimit_noti_reply *reply;
+
+ reply = kmalloc(sizeof(*reply), GFP_KERNEL);
+ if (!reply)
+ return;
+
+ skb = rlimit_noti_make_reply(seq, type, payload, size);
+ if (!skb)
+ goto out;
+
+ reply->net = get_net(net);
+ reply->portid = NETLINK_CB(request_skb).portid;
+ reply->skb = skb;
+
+ tsk = kthread_run(rlimit_noti_send_reply_thread, reply,
+ "rlimit_noti_send_reply");
+ if (!IS_ERR(tsk))
+ return;
+ kfree_skb(skb);
+out:
+ kfree(reply);
+}
+
+static int rlimit_noti_netlink_ok(struct sk_buff *skb, u16 msg_type)
+{
+ /* TODO: put here some security and namespace checks */
+ return 0;
+}
+
+static int rlimit_noti_receive_msg(struct sk_buff *skb,
+ struct nlmsghdr *nl_header)
+{
+ u32 seq_nb = nl_header->nlmsg_seq;
+ u16 msg_type = nl_header->nlmsg_type;
+ int ret;
+
+ ret = rlimit_noti_netlink_ok(skb, msg_type);
+ if (ret)
+ return ret;
+
+ switch (msg_type) {
+ case RLIMIT_GET_NOTI_FD: {
+ int fd = 10;
+
+ fd = rlimit_noti_create_fd();
+ if (fd < 0) {
+ ret = fd;
+ goto out;
+ }
+ rlimit_noti_send_reply(skb, seq_nb, RLIMIT_GET_NOTI_FD,
+ &fd, sizeof(fd));
+ ret = 0;
+ break;
+ }
+ default:
+ ret = -EINVAL;
+ break;
+ }
+out:
+ return ret;
+}
+
+static void rlimit_noti_netlink_receive(struct sk_buff *skb)
+{
+ struct nlmsghdr *nl_header;
+ int len, ret;
+
+ nl_header = nlmsg_hdr(skb);
+ len = skb->len;
+
+ while (nlmsg_ok(nl_header, len)) {
+ ret = rlimit_noti_receive_msg(skb, nl_header);
+ /* if err or if this message says it wants a response */
+ if (ret || (nl_header->nlmsg_flags & NLM_F_ACK))
+ netlink_ack(skb, nl_header, ret, NULL);
+
+ nl_header = nlmsg_next(nl_header, &len);
+ }
+}
+
+static int rlimit_noti_netlink_bind(struct net *net, int group)
+{
+ /* For now we allow everyone but maybe this should be limited? */
+ return 0;
+}
+
+static int __net_init rlimit_noti_net_init(struct net *net)
+{
+ struct netlink_kernel_cfg cfg = {
+ .input = rlimit_noti_netlink_receive,
+ .bind = rlimit_noti_netlink_bind,
+ .flags = NL_CFG_F_NONROOT_RECV,
+ .groups = 1, /* Just one, the default */
+ };
+ struct rlimit_noti_net *rn_net = net_generic(net, rlimit_noti_net_id);
+
+ rn_net->sk = netlink_kernel_create(net, NETLINK_RLIMIT_EVENTS, &cfg);
+ if (rn_net->sk == NULL) {
+ printk(KERN_ERR
+ "cannot initialize netlink socket in namespace");
+ return -ENOMEM;
+ }
+ rn_net->sk->sk_sndtimeo = MAX_SCHEDULE_TIMEOUT;
+
+ return 0;
+
+}
+
+static void __net_exit rlimit_noti_net_exit(struct net *net)
+{
+ struct rlimit_noti_net *rn_net = net_generic(net, rlimit_noti_net_id);
+
+ netlink_kernel_release(rn_net->sk);
+}
+
+static struct pernet_operations rlimit_noti_net_ops __net_initdata = {
+ .init = rlimit_noti_net_init,
+ .exit = rlimit_noti_net_exit,
+ .id = &rlimit_noti_net_id,
+ .size = sizeof(struct rlimit_noti_net),
+};
+
+static int __init rlimit_noti_init(void)
+{
+ return register_pernet_subsys(&rlimit_noti_net_ops);
+}
+late_initcall(rlimit_noti_init);
+
+static void __exit rlimit_noti_exit(void)
+{
+ unregister_pernet_subsys(&rlimit_noti_net_ops);
+}
--
2.9.3

2017-12-14 22:01:12

by Krzysztof Opasiak

[permalink] [raw]
Subject: [RFC PATCH v2 4/4] Allow to trace fd usage with rlimit-events

Add rlimit-events calls to file descriptors management
code to allow tracing of FD usage.

This allows userspace process (monitor) to get notification when
other process (subject) uses given amount of file descriptors.

This can be used to for example asynchronously monitor number
of open FD's in system services instead of polling with
predefined interval.

Signed-off-by: Krzysztof Opasiak <[email protected]>
---
drivers/android/binder.c | 4 +--
fs/exec.c | 2 +-
fs/file.c | 82 +++++++++++++++++++++++++++++++++++++++---------
fs/open.c | 2 +-
include/linux/fdtable.h | 8 ++---
5 files changed, 76 insertions(+), 22 deletions(-)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index fddf76ef5bd6..06bb13e75260 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -890,7 +890,7 @@ static int task_get_unused_fd_flags(struct binder_proc *proc, int flags)
rlim_cur = task_rlimit(proc->tsk, RLIMIT_NOFILE);
unlock_task_sighand(proc->tsk, &irqs);

- return __alloc_fd(files, 0, rlim_cur, flags);
+ return __alloc_fd(proc->tsk, files, 0, rlim_cur, flags);
}

/*
@@ -913,7 +913,7 @@ static long task_close_fd(struct binder_proc *proc, unsigned int fd)
if (proc->files == NULL)
return -ESRCH;

- retval = __close_fd(proc->files, fd);
+ retval = __close_fd(proc->tsk, proc->files, fd);
/* can't restart close syscall because file table entry was cleared */
if (unlikely(retval == -ERESTARTSYS ||
retval == -ERESTARTNOINTR ||
diff --git a/fs/exec.c b/fs/exec.c
index 3e14ba25f678..bfc63506876d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1293,7 +1293,7 @@ int flush_old_exec(struct linux_binprm * bprm)
* trying to access the should-be-closed file descriptors of a process
* undergoing exec(2).
*/
- do_close_on_exec(current->files);
+ do_close_on_exec(current);
return 0;

out:
diff --git a/fs/file.c b/fs/file.c
index 4eecbf4244a5..2f2e14a18e19 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -23,6 +23,7 @@
#include <linux/spinlock.h>
#include <linux/rcupdate.h>
#include <linux/workqueue.h>
+#include <linux/rlimit_noti_kern.h>

unsigned int sysctl_nr_open __read_mostly = 1024*1024;
unsigned int sysctl_nr_open_min = BITS_PER_LONG;
@@ -255,7 +256,7 @@ static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
__clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
}

-static unsigned int count_open_files(struct fdtable *fdt)
+static unsigned int get_last_open_file(struct fdtable *fdt)
{
unsigned int size = fdt->max_fds;
unsigned int i;
@@ -301,7 +302,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)

spin_lock(&oldf->file_lock);
old_fdt = files_fdtable(oldf);
- open_files = count_open_files(old_fdt);
+ open_files = get_last_open_file(old_fdt);

/*
* Check whether we need to allocate a larger fd array and fd set.
@@ -332,7 +333,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
*/
spin_lock(&oldf->file_lock);
old_fdt = files_fdtable(oldf);
- open_files = count_open_files(old_fdt);
+ open_files = get_last_open_file(old_fdt);
}

copy_fd_bitmaps(new_fdt, old_fdt, open_files);
@@ -464,6 +465,31 @@ struct files_struct init_files = {
.file_lock = __SPIN_LOCK_UNLOCKED(init_files.file_lock),
};

+static unsigned int count_open_fds(struct fdtable *fdt)
+{
+ unsigned int maxfd = fdt->max_fds;
+ unsigned int maxbit = maxfd / BITS_PER_LONG;
+ unsigned int count = 0;
+ int i;
+
+ i = find_next_zero_bit(fdt->full_fds_bits, maxbit, 0);
+ /* If there is no free fds */
+ if (i > maxbit)
+ return maxfd;
+#if BITS_PER_LONG == 32
+#define HWEIGHT_LONG hweight32
+#else
+#define HWEIGHT_LONG hweight64
+#endif
+
+ count += i * BITS_PER_LONG;
+ for (; i < maxbit; ++i)
+ count += HWEIGHT_LONG(fdt->open_fds[i]);
+
+#undef HWEIGHT_LONG
+ return count;
+}
+
static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
{
unsigned int maxfd = fdt->max_fds;
@@ -481,8 +507,8 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
/*
* allocate a file descriptor, mark it busy.
*/
-int __alloc_fd(struct files_struct *files,
- unsigned start, unsigned end, unsigned flags)
+int __alloc_fd(struct task_struct *owner, struct files_struct *files,
+ unsigned int start, unsigned int end, unsigned int flags)
{
unsigned int fd;
int error;
@@ -526,6 +552,13 @@ int __alloc_fd(struct files_struct *files,
else
__clear_close_on_exec(fd, fdt);
error = fd;
+
+ if (rlimit_noti_watch_active(owner, RLIMIT_NOFILE)) {
+ unsigned int count;
+
+ count = count_open_fds(fdt);
+ rlimit_noti_res_changed(owner, RLIMIT_NOFILE, count - 1, count);
+ }
#if 1
/* Sanity check */
if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
@@ -541,28 +574,37 @@ int __alloc_fd(struct files_struct *files,

static int alloc_fd(unsigned start, unsigned flags)
{
- return __alloc_fd(current->files, start, rlimit(RLIMIT_NOFILE), flags);
+ return __alloc_fd(current, current->files,
+ start, rlimit(RLIMIT_NOFILE), flags);
}

int get_unused_fd_flags(unsigned flags)
{
- return __alloc_fd(current->files, 0, rlimit(RLIMIT_NOFILE), flags);
+ return alloc_fd(0, flags);
}
EXPORT_SYMBOL(get_unused_fd_flags);

-static void __put_unused_fd(struct files_struct *files, unsigned int fd)
+static void __put_unused_fd(struct task_struct *owner, unsigned int fd)
{
+ struct files_struct *files = owner->files;
struct fdtable *fdt = files_fdtable(files);
__clear_open_fd(fd, fdt);
if (fd < files->next_fd)
files->next_fd = fd;
+
+ if (rlimit_noti_watch_active(owner, RLIMIT_NOFILE)) {
+ unsigned int count;
+
+ count = count_open_fds(fdt);
+ rlimit_noti_res_changed(owner, RLIMIT_NOFILE, count + 1, count);
+ }
}

void put_unused_fd(unsigned int fd)
{
struct files_struct *files = current->files;
spin_lock(&files->file_lock);
- __put_unused_fd(files, fd);
+ __put_unused_fd(current, fd);
spin_unlock(&files->file_lock);
}

@@ -619,7 +661,8 @@ EXPORT_SYMBOL(fd_install);
/*
* The same warnings as for __alloc_fd()/__fd_install() apply here...
*/
-int __close_fd(struct files_struct *files, unsigned fd)
+int __close_fd(struct task_struct *owner, struct files_struct *files,
+ unsigned int fd)
{
struct file *file;
struct fdtable *fdt;
@@ -633,7 +676,7 @@ int __close_fd(struct files_struct *files, unsigned fd)
goto out_unlock;
rcu_assign_pointer(fdt->fd[fd], NULL);
__clear_close_on_exec(fd, fdt);
- __put_unused_fd(files, fd);
+ __put_unused_fd(owner, fd);
spin_unlock(&files->file_lock);
return filp_close(file, files);

@@ -642,10 +685,11 @@ int __close_fd(struct files_struct *files, unsigned fd)
return -EBADF;
}

-void do_close_on_exec(struct files_struct *files)
+void do_close_on_exec(struct task_struct *tsk)
{
unsigned i;
struct fdtable *fdt;
+ struct files_struct *files = tsk->files;

/* exec unshares first */
spin_lock(&files->file_lock);
@@ -667,7 +711,7 @@ void do_close_on_exec(struct files_struct *files)
if (!file)
continue;
rcu_assign_pointer(fdt->fd[fd], NULL);
- __put_unused_fd(files, fd);
+ __put_unused_fd(tsk, fd);
spin_unlock(&files->file_lock);
filp_close(file, files);
cond_resched();
@@ -839,6 +883,16 @@ __releases(&files->file_lock)
__set_close_on_exec(fd, fdt);
else
__clear_close_on_exec(fd, fdt);
+
+ /* If fd was previously open then number of opened fd stays untouched */
+ if (!tofree && rlimit_noti_watch_active(current, RLIMIT_NOFILE)) {
+ unsigned int count;
+
+ count = count_open_fds(fdt);
+ rlimit_noti_res_changed(current, RLIMIT_NOFILE,
+ count - 1, count);
+ }
+
spin_unlock(&files->file_lock);

if (tofree)
@@ -857,7 +911,7 @@ int replace_fd(unsigned fd, struct file *file, unsigned flags)
struct files_struct *files = current->files;

if (!file)
- return __close_fd(files, fd);
+ return __close_fd(current, files, fd);

if (fd >= rlimit(RLIMIT_NOFILE))
return -EBADF;
diff --git a/fs/open.c b/fs/open.c
index 7ea118471dce..dc0d19d35df0 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1152,7 +1152,7 @@ EXPORT_SYMBOL(filp_close);
*/
SYSCALL_DEFINE1(close, unsigned int, fd)
{
- int retval = __close_fd(current->files, fd);
+ int retval = __close_fd(current, current->files, fd);

/* can't restart close syscall because file table entry was cleared */
if (unlikely(retval == -ERESTARTSYS ||
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 1c65817673db..b254796e46b7 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -107,16 +107,16 @@ void put_files_struct(struct files_struct *fs);
void reset_files_struct(struct files_struct *);
int unshare_files(struct files_struct **);
struct files_struct *dup_fd(struct files_struct *, int *) __latent_entropy;
-void do_close_on_exec(struct files_struct *);
+void do_close_on_exec(struct task_struct *tsk);
int iterate_fd(struct files_struct *, unsigned,
int (*)(const void *, struct file *, unsigned),
const void *);

-extern int __alloc_fd(struct files_struct *files,
- unsigned start, unsigned end, unsigned flags);
+extern int __alloc_fd(struct task_struct *owner, struct files_struct *files,
+ unsigned int start, unsigned int end, unsigned int flags);
extern void __fd_install(struct files_struct *files,
unsigned int fd, struct file *file);
-extern int __close_fd(struct files_struct *files,
+extern int __close_fd(struct task_struct *owner, struct files_struct *files,
unsigned int fd);

extern struct kmem_cache *files_cachep;
--
2.9.3

2017-12-14 22:01:21

by Krzysztof Opasiak

[permalink] [raw]
Subject: [RFC PATCH v2 3/4] Connect rlimit-events with process life cycle

Add rlimit-events call to process lifecycle to ensure that
we get notified whenever process dies (to cleanup our watch
levels) or forks (to implement watch levels inheritance).

Signed-off-by: Krzysztof Opasiak <[email protected]>
---
kernel/exit.c | 3 +++
kernel/fork.c | 11 ++++++++++-
2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index f6cad39f35df..d3b9e1b854a3 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -62,6 +62,7 @@
#include <linux/random.h>
#include <linux/rcuwait.h>
#include <linux/compat.h>
+#include <linux/rlimit_noti.h>

#include <linux/uaccess.h>
#include <asm/unistd.h>
@@ -846,6 +847,8 @@ void __noreturn do_exit(long code)
tty_audit_exit();
audit_free(tsk);

+ rlimit_noti_task_exit(tsk);
+
tsk->exit_code = code;
taskstats_exit(tsk, group_dead);

diff --git a/kernel/fork.c b/kernel/fork.c
index e54684b63c08..77bbac0bdfe9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -100,6 +100,8 @@

#include <trace/events/sched.h>

+#include <linux/rlimit_noti.h>
+
#define CREATE_TRACE_POINTS
#include <trace/events/task.h>

@@ -1389,6 +1391,7 @@ static inline void posix_cpu_timers_init_group(struct signal_struct *sig) { }
static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
{
struct signal_struct *sig;
+ int ret;

if (clone_flags & CLONE_THREAD)
return 0;
@@ -1419,9 +1422,15 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
#endif

task_lock(current->group_leader);
- memcpy(sig->rlim, current->signal->rlim, sizeof sig->rlim);
+ memcpy(sig->rlim, current->signal->rlim, sizeof(sig->rlim));
task_unlock(current->group_leader);

+ ret = rlimit_noti_task_fork(current->group_leader, tsk);
+ if (ret) {
+ kmem_cache_free(signal_cachep, sig);
+ return ret;
+ }
+
posix_cpu_timers_init_group(sig);

tty_audit_fork(sig);
--
2.9.3

2017-12-14 22:01:27

by Krzysztof Opasiak

[permalink] [raw]
Subject: [RFC PATCH v2 1/4] sched: Allow to get() and put() signal struct

Allow to get() and put() signal struct from different
parts of kernel core, not only from signal.c.

This is useful when you place inside signal struct additional
resources that has to be accessed after process dies.

Signed-off-by: Krzysztof Opasiak <[email protected]>
---
include/linux/sched/signal.h | 13 +++++++++++++
kernel/fork.c | 9 ++-------
2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 0aa4548fb492..78c0c337c130 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -610,4 +610,17 @@ static inline unsigned long rlimit_max(unsigned int limit)
return task_rlimit_max(current, limit);
}

+void free_signal_struct(struct signal_struct *sig);
+
+static inline void put_signal_struct(struct signal_struct *sig)
+{
+ if (atomic_dec_and_test(&sig->sigcnt))
+ free_signal_struct(sig);
+}
+
+static inline void get_signal_struct(struct signal_struct *sig)
+{
+ atomic_inc(&sig->sigcnt);
+}
+
#endif /* _LINUX_SCHED_SIGNAL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 07cc743698d3..e54684b63c08 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -390,7 +390,7 @@ void free_task(struct task_struct *tsk)
}
EXPORT_SYMBOL(free_task);

-static inline void free_signal_struct(struct signal_struct *sig)
+void free_signal_struct(struct signal_struct *sig)
{
taskstats_tgid_free(sig);
sched_autogroup_exit(sig);
@@ -402,12 +402,7 @@ static inline void free_signal_struct(struct signal_struct *sig)
mmdrop_async(sig->oom_mm);
kmem_cache_free(signal_cachep, sig);
}
-
-static inline void put_signal_struct(struct signal_struct *sig)
-{
- if (atomic_dec_and_test(&sig->sigcnt))
- free_signal_struct(sig);
-}
+EXPORT_SYMBOL_GPL(free_signal_struct);

void __put_task_struct(struct task_struct *tsk)
{
--
2.9.3

2017-12-14 23:59:37

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC PATCH v2 4/4] Allow to trace fd usage with rlimit-events

On Dec 14, 2017, at 3:00 PM, Krzysztof Opasiak <[email protected]> wrote:
>
> Add rlimit-events calls to file descriptors management
> code to allow tracing of FD usage.
>
> This allows userspace process (monitor) to get notification when
> other process (subject) uses given amount of file descriptors.
>
> This can be used to for example asynchronously monitor number
> of open FD's in system services instead of polling with
> predefined interval.

I'm not involved in this area of code, but one optimization question:

> +static unsigned int count_open_fds(struct fdtable *fdt)
> +{
> + unsigned int maxfd = fdt->max_fds;
> + unsigned int maxbit = maxfd / BITS_PER_LONG;
> + unsigned int count = 0;
> + int i;
> +
> + i = find_next_zero_bit(fdt->full_fds_bits, maxbit, 0);
> + /* If there is no free fds */
> + if (i > maxbit)
> + return maxfd;
> +#if BITS_PER_LONG == 32
> +#define HWEIGHT_LONG hweight32
> +#else
> +#define HWEIGHT_LONG hweight64
> +#endif
> +
> + count += i * BITS_PER_LONG;
> + for (; i < maxbit; ++i)
> + count += HWEIGHT_LONG(fdt->open_fds[i]);
> +
> +#undef HWEIGHT_LONG
> + return count;
> +}

Since find_next_zero_bit() needs to process all of the words anyway
as well as lots of extra operations that add overhead, it looks more
efficient to just compute HWEIGHT_LONG(open_fds[]) for the whole array.

Cheers, Andreas






Attachments:
signature.asc (195.00 B)
Message signed with OpenPGP

2017-12-15 08:05:58

by Krzysztof Opasiak

[permalink] [raw]
Subject: Re: [RFC PATCH v2 4/4] Allow to trace fd usage with rlimit-events



On 12/15/2017 12:59 AM, Andreas Dilger wrote:
> On Dec 14, 2017, at 3:00 PM, Krzysztof Opasiak <[email protected]> wrote:
>>
>> Add rlimit-events calls to file descriptors management
>> code to allow tracing of FD usage.
>>
>> This allows userspace process (monitor) to get notification when
>> other process (subject) uses given amount of file descriptors.
>>
>> This can be used to for example asynchronously monitor number
>> of open FD's in system services instead of polling with
>> predefined interval.
>
> I'm not involved in this area of code, but one optimization question:
>
>> +static unsigned int count_open_fds(struct fdtable *fdt)
>> +{
>> + unsigned int maxfd = fdt->max_fds;
>> + unsigned int maxbit = maxfd / BITS_PER_LONG;
>> + unsigned int count = 0;
>> + int i;
>> +
>> + i = find_next_zero_bit(fdt->full_fds_bits, maxbit, 0);
>> + /* If there is no free fds */
>> + if (i > maxbit)
>> + return maxfd;
>> +#if BITS_PER_LONG == 32
>> +#define HWEIGHT_LONG hweight32
>> +#else
>> +#define HWEIGHT_LONG hweight64
>> +#endif
>> +
>> + count += i * BITS_PER_LONG;
>> + for (; i < maxbit; ++i)
>> + count += HWEIGHT_LONG(fdt->open_fds[i]);
>> +
>> +#undef HWEIGHT_LONG
>> + return count;
>> +}
>
> Since find_next_zero_bit() needs to process all of the words anyway
> as well as lots of extra operations that add overhead, it looks more
> efficient to just compute HWEIGHT_LONG(open_fds[]) for the whole array.
>

I may try to measure this but I'm not sure now because there are two
levels of bit maps. First one (open_fds) is a real bitmap of open file
descriptors (which can be quite large) while full_fds_bits is a second
level bitmap (which is 8 times smaller) and if bit there is set it means
that whole byte in lower bitmap (open_fds) is set. Because we are always
opening the lowest fd number that is available we should have quite
large contiguous region in the beginning so using this
find_next_zero_bit() at first should allow us to skip this region.

But anyway I can try your suggestion and check if it speeds up.

Best regards,
--
Krzysztof Opasiak
Samsung R&D Institute Poland
Samsung Electronics