2009-12-11 22:59:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH RFC v2 0/4] cgroup notifications API and memory thresholds

This patchset introduces eventfd-based API for notifications in cgroups and
implements memory notifications on top of it.

It uses statistics in memory controler to track memory usage.

Before changes:

Root cgroup
Performance counter stats for './multi-fault 2' (5 runs):

117596.249864 task-clock-msecs # 1.960 CPUs ( +- 0.043% )
80114 context-switches # 0.001 M/sec ( +- 0.234% )
80 CPU-migrations # 0.000 M/sec ( +- 24.934% )
39120862 page-faults # 0.333 M/sec ( +- 0.138% )
294682530295 cycles # 2505.884 M/sec ( +- 0.076% ) (scaled from 70.00%)
191303772329 instructions # 0.649 IPC ( +- 0.041% ) (scaled from 80.01%)
39400843259 branches # 335.052 M/sec ( +- 0.062% ) (scaled from 80.02%)
497810459 branch-misses # 1.263 % ( +- 1.584% ) (scaled from 80.02%)
3352408601 cache-references # 28.508 M/sec ( +- 0.251% ) (scaled from 19.98%)
128744 cache-misses # 0.001 M/sec ( +- 4.542% ) (scaled from 19.98%)

60.001025199 seconds time elapsed ( +- 0.000% )

Non-root cgroup
Performance counter stats for './multi-fault 2' (5 runs):

116907.543887 task-clock-msecs # 1.948 CPUs ( +- 0.087% )
70497 context-switches # 0.001 M/sec ( +- 0.204% )
94 CPU-migrations # 0.000 M/sec ( +- 11.854% )
33894593 page-faults # 0.290 M/sec ( +- 0.123% )
291912994149 cycles # 2496.956 M/sec ( +- 0.102% ) (scaled from 70.03%)
194998499007 instructions # 0.668 IPC ( +- 0.109% ) (scaled from 80.01%)
41752189092 branches # 357.139 M/sec ( +- 0.118% ) (scaled from 79.96%)
487437901 branch-misses # 1.167 % ( +- 0.378% ) (scaled from 79.95%)
3076284269 cache-references # 26.314 M/sec ( +- 0.471% ) (scaled from 20.04%)
170468 cache-misses # 0.001 M/sec ( +- 1.481% ) (scaled from 20.05%)

60.001211398 seconds time elapsed ( +- 0.000% )

After changes:

Root cgroup
Performance counter stats for './multi-fault 2' (5 runs):

117396.738764 task-clock-msecs # 1.957 CPUs ( +- 0.047% )
78763 context-switches # 0.001 M/sec ( +- 0.132% )
109 CPU-migrations # 0.000 M/sec ( +- 25.646% )
38141062 page-faults # 0.325 M/sec ( +- 0.107% )
294257674123 cycles # 2506.523 M/sec ( +- 0.045% ) (scaled from 70.01%)
194937378540 instructions # 0.662 IPC ( +- 0.120% ) (scaled from 79.98%)
40694602714 branches # 346.642 M/sec ( +- 0.127% ) (scaled from 79.95%)
529968529 branch-misses # 1.302 % ( +- 1.668% ) (scaled from 79.94%)
3196763471 cache-references # 27.230 M/sec ( +- 0.262% ) (scaled from 20.05%)
201095 cache-misses # 0.002 M/sec ( +- 3.315% ) (scaled from 20.06%)

60.001025546 seconds time elapsed ( +- 0.000% )

Non-root cgroup:
Performance counter stats for './multi-fault 2' (5 runs):

116471.855099 task-clock-msecs # 1.941 CPUs ( +- 0.067% )
69393 context-switches # 0.001 M/sec ( +- 0.099% )
117 CPU-migrations # 0.000 M/sec ( +- 14.049% )
33043048 page-faults # 0.284 M/sec ( +- 0.086% )
290751403642 cycles # 2496.323 M/sec ( +- 0.073% ) (scaled from 69.97%)
196594115294 instructions # 0.676 IPC ( +- 0.065% ) (scaled from 79.97%)
42507307304 branches # 364.958 M/sec ( +- 0.054% ) (scaled from 79.96%)
500670691 branch-misses # 1.178 % ( +- 0.729% ) (scaled from 79.98%)
2935664654 cache-references # 25.205 M/sec ( +- 0.153% ) (scaled from 20.04%)
224967 cache-misses # 0.002 M/sec ( +- 2.462% ) (scaled from 20.02%)

60.001218531 seconds time elapsed ( +- 0.000% )

Any comments?

TODO:
- documentation.

v1 -> v2:
- use statistics instead of res_counter to track resource usage;
- fix bugs with locking;

v0 -> v1:
- memsw support implemented.

Kirill A. Shutemov (4):
cgroup: implement eventfd-based generic API for notifications
memcg: extract mem_group_usage() from mem_cgroup_read()
memcg: rework usage of stats by soft limit
memcg: implement memory thresholds

include/linux/cgroup.h | 20 +++
kernel/cgroup.c | 215 ++++++++++++++++++++++++++++++-
mm/memcontrol.c | 335 ++++++++++++++++++++++++++++++++++++++++++++----
3 files changed, 543 insertions(+), 27 deletions(-)


2009-12-11 22:59:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH RFC v2 1/4] cgroup: implement eventfd-based generic API for notifications

This patch introduces write-only file "cgroup.event_control" in every
cgroup.

To register new notification handler you need:
- create an eventfd;
- open a control file to be monitored. Callbacks register_event() and
unregister_event() must be defined for the control file;
- write "<event_fd> <control_fd> <args>" to cgroup.event_control.
Interpretation of args is defined by control file implementation;

eventfd will be woken up by control file implementation or when the
cgroup is removed.

To unregister notification handler just close eventfd.

If you need notification functionality for a control file you have to
implement callbacks register_event() and unregister_event() in the
struct cftype.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/cgroup.h | 20 +++++
kernel/cgroup.c | 215 +++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 234 insertions(+), 1 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 0008dee..7ad3078 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -220,6 +220,10 @@ struct cgroup {

/* For RCU-protected deletion */
struct rcu_head rcu_head;
+
+ /* List of events which userspace want to recieve */
+ struct list_head event_list;
+ struct mutex event_list_mutex;
};

/*
@@ -362,6 +366,22 @@ struct cftype {
int (*trigger)(struct cgroup *cgrp, unsigned int event);

int (*release)(struct inode *inode, struct file *file);
+
+ /*
+ * register_event() callback will be used to add new userspace
+ * waiter for changes related to the cftype. Implement it if
+ * you want to provide this functionality. Use eventfd_signal()
+ * on eventfd to send notification to userspace.
+ */
+ int (*register_event)(struct cgroup *cgrp, struct cftype *cft,
+ struct eventfd_ctx *eventfd, const char *args);
+ /*
+ * unregister_event() callback will be called when userspace
+ * close the eventfd. This callback must be implemented, if you
+ * provide register_event().
+ */
+ int (*unregister_event)(struct cgroup *cgrp, struct cftype *cft,
+ struct eventfd_ctx *eventfd);
};

struct cgroup_scanner {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0249f4b..f7ec3ca 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4,6 +4,10 @@
* Based originally on the cpuset system, extracted by Paul Menage
* Copyright (C) 2006 Google, Inc
*
+ * Notifiactions support
+ * Copyright (C) 2009 Nokia Corporation
+ * Author: Kirill A. Shutemov
+ *
* Copyright notices from the original cpuset code:
* --------------------------------------------------
* Copyright (C) 2003 BULL SA.
@@ -51,6 +55,8 @@
#include <linux/pid_namespace.h>
#include <linux/idr.h>
#include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
+#include <linux/eventfd.h>
+#include <linux/poll.h>

#include <asm/atomic.h>

@@ -146,6 +152,36 @@ struct css_id {
unsigned short stack[0]; /* Array of Length (depth+1) */
};

+/*
+ * cgroup_event represents event which userspace want to recieve.
+ */
+struct cgroup_event {
+ /*
+ * Cgroup which the event belongs to.
+ */
+ struct cgroup *cgrp;
+ /*
+ * Control file which the event associated.
+ */
+ struct cftype *cft;
+ /*
+ * eventfd to signal userspace about the event.
+ */
+ struct eventfd_ctx *eventfd;
+ /*
+ * Each of these stored in a list by the cgroup.
+ */
+ struct list_head list;
+ /*
+ * All fields below needed to unregister event when
+ * userspace closes eventfd.
+ */
+ poll_table pt;
+ wait_queue_head_t *wqh;
+ wait_queue_t wait;
+ struct work_struct remove;
+};
+static void cgroup_event_remove(struct cgroup_event *event);

/* The list of hierarchy roots */

@@ -734,14 +770,29 @@ static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
static int cgroup_call_pre_destroy(struct cgroup *cgrp)
{
struct cgroup_subsys *ss;
+ struct cgroup_event *event, *tmp;
int ret = 0;

for_each_subsys(cgrp->root, ss)
if (ss->pre_destroy) {
ret = ss->pre_destroy(ss, cgrp);
if (ret)
- break;
+ goto out;
}
+
+ /*
+ * Unregister events and notify userspace.
+ * FIXME: How to avoid race with cgroup_event_remove_work()
+ * which runs from workqueue?
+ */
+ mutex_lock(&cgrp->event_list_mutex);
+ list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) {
+ cgroup_event_remove(event);
+ eventfd_signal(event->eventfd, 1);
+ }
+ mutex_unlock(&cgrp->event_list_mutex);
+
+out:
return ret;
}

@@ -1136,6 +1187,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
INIT_LIST_HEAD(&cgrp->release_list);
INIT_LIST_HEAD(&cgrp->pidlists);
mutex_init(&cgrp->pidlist_mutex);
+ INIT_LIST_HEAD(&cgrp->event_list);
+ mutex_init(&cgrp->event_list_mutex);
}

static void init_cgroup_root(struct cgroupfs_root *root)
@@ -1935,6 +1988,16 @@ static const struct inode_operations cgroup_dir_inode_operations = {
.rename = cgroup_rename,
};

+/*
+ * Check if a file is a control file
+ */
+static inline struct cftype *__file_cft(struct file *file)
+{
+ if (file->f_dentry->d_inode->i_fop != &cgroup_file_operations)
+ return ERR_PTR(-EINVAL);
+ return __d_cft(file->f_dentry);
+}
+
static int cgroup_create_file(struct dentry *dentry, mode_t mode,
struct super_block *sb)
{
@@ -2789,6 +2852,151 @@ static int cgroup_write_notify_on_release(struct cgroup *cgrp,
return 0;
}

+static inline void cgroup_event_remove(struct cgroup_event *event)
+{
+ struct cgroup *cgrp = event->cgrp;
+
+ BUG_ON(event->cft->unregister_event(cgrp, event->cft, event->eventfd));
+ eventfd_ctx_put(event->eventfd);
+ remove_wait_queue(event->wqh, &event->wait);
+ list_del(&event->list);
+ kfree(event);
+}
+
+static void cgroup_event_remove_work(struct work_struct *work)
+{
+ struct cgroup_event *event = container_of(work, struct cgroup_event,
+ remove);
+ struct cgroup *cgrp = event->cgrp;
+
+ mutex_lock(&cgrp->event_list_mutex);
+ cgroup_event_remove(event);
+ mutex_unlock(&cgrp->event_list_mutex);
+}
+
+static int cgroup_event_wake(wait_queue_t *wait, unsigned mode,
+ int sync, void *key)
+{
+ struct cgroup_event *event = container_of(wait,
+ struct cgroup_event, wait);
+ unsigned long flags = (unsigned long)key;
+
+ if (flags & POLLHUP)
+ /*
+ * This function called with spinlock taken, but
+ * cgroup_event_remove() may sleep, so we have
+ * to run it in a workqueue.
+ */
+ schedule_work(&event->remove);
+
+ return 0;
+}
+
+static void cgroup_event_ptable_queue_proc(struct file *file,
+ wait_queue_head_t *wqh, poll_table *pt)
+{
+ struct cgroup_event *event = container_of(pt,
+ struct cgroup_event, pt);
+
+ event->wqh = wqh;
+ add_wait_queue(wqh, &event->wait);
+}
+
+static int cgroup_write_event_control(struct cgroup *cont, struct cftype *cft,
+ const char *buffer)
+{
+ struct cgroup_event *event = NULL;
+ unsigned int efd, cfd;
+ struct file *efile = NULL;
+ struct file *cfile = NULL;
+ char *endp;
+ int ret;
+
+ efd = simple_strtoul(buffer, &endp, 10);
+ if (*endp != ' ')
+ return -EINVAL;
+ buffer = endp + 1;
+
+ cfd = simple_strtoul(buffer, &endp, 10);
+ if ((*endp != ' ') && (*endp != '\0'))
+ return -EINVAL;
+ buffer = endp + 1;
+
+ event = kzalloc(sizeof(*event), GFP_KERNEL);
+ if (!event)
+ return -ENOMEM;
+ event->cgrp = cont;
+ INIT_LIST_HEAD(&event->list);
+ init_poll_funcptr(&event->pt, cgroup_event_ptable_queue_proc);
+ init_waitqueue_func_entry(&event->wait, cgroup_event_wake);
+ INIT_WORK(&event->remove, cgroup_event_remove_work);
+
+ efile = eventfd_fget(efd);
+ if (IS_ERR(efile)) {
+ ret = PTR_ERR(efile);
+ goto fail;
+ }
+
+ event->eventfd = eventfd_ctx_fileget(efile);
+ if (IS_ERR(event->eventfd)) {
+ ret = PTR_ERR(event->eventfd);
+ goto fail;
+ }
+
+ cfile = fget(cfd);
+ if (!cfile) {
+ ret = -EBADF;
+ goto fail;
+ }
+
+ /* the process need read permission on control file */
+ ret = file_permission(cfile, MAY_READ);
+ if (ret < 0)
+ goto fail;
+
+ event->cft = __file_cft(cfile);
+ if (IS_ERR(event->cft)) {
+ ret = PTR_ERR(event->cft);
+ goto fail;
+ }
+
+ if (!event->cft->register_event || !event->cft->unregister_event) {
+ ret = -EINVAL;
+ goto fail;
+ }
+
+ ret = event->cft->register_event(cont, event->cft,
+ event->eventfd, buffer);
+ if (ret)
+ goto fail;
+
+ efile->f_op->poll(efile, &event->pt);
+
+ mutex_lock(&cont->event_list_mutex);
+ list_add(&event->list, &cont->event_list);
+ mutex_unlock(&cont->event_list_mutex);
+
+ fput(cfile);
+ fput(efile);
+
+ return 0;
+
+fail:
+ if (!IS_ERR(cfile))
+ fput(cfile);
+
+ if (event && event->eventfd && !IS_ERR(event->eventfd))
+ eventfd_ctx_put(event->eventfd);
+
+ if (!IS_ERR(efile))
+ fput(efile);
+
+ if (event)
+ kfree(event);
+
+ return ret;
+}
+
/*
* for the common functions, 'private' gives the type of file
*/
@@ -2814,6 +3022,11 @@ static struct cftype files[] = {
.read_u64 = cgroup_read_notify_on_release,
.write_u64 = cgroup_write_notify_on_release,
},
+ {
+ .name = CGROUP_FILE_GENERIC_PREFIX "event_control",
+ .write_string = cgroup_write_event_control,
+ .mode = S_IWUGO,
+ },
};

static struct cftype cft_release_agent = {
--
1.6.5.3

2009-12-11 22:59:59

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH RFC v2 2/4] memcg: extract mem_group_usage() from mem_cgroup_read()

Helper to get memory or mem+swap usage of the cgroup.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/memcontrol.c | 53 ++++++++++++++++++++++++++++++++---------------------
1 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c31a310..0ff65ed 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2510,39 +2510,50 @@ mem_cgroup_get_recursive_idx_stat(struct mem_cgroup *mem,
*val = d.val;
}

+static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
+{
+ u64 idx_val, val;
+
+ if (!mem_cgroup_is_root(mem)) {
+ if (!swap)
+ return res_counter_read_u64(&mem->res, RES_USAGE);
+ else
+ return res_counter_read_u64(&mem->memsw, RES_USAGE);
+ }
+
+ mem_cgroup_get_recursive_idx_stat(mem, MEM_CGROUP_STAT_CACHE, &idx_val);
+ val = idx_val;
+ mem_cgroup_get_recursive_idx_stat(mem, MEM_CGROUP_STAT_RSS, &idx_val);
+ val += idx_val;
+
+ if (swap) {
+ mem_cgroup_get_recursive_idx_stat(mem,
+ MEM_CGROUP_STAT_SWAPOUT, &idx_val);
+ val += idx_val;
+ }
+
+ return val << PAGE_SHIFT;
+}
+
static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
{
struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
- u64 idx_val, val;
+ u64 val;
int type, name;

type = MEMFILE_TYPE(cft->private);
name = MEMFILE_ATTR(cft->private);
switch (type) {
case _MEM:
- if (name == RES_USAGE && mem_cgroup_is_root(mem)) {
- mem_cgroup_get_recursive_idx_stat(mem,
- MEM_CGROUP_STAT_CACHE, &idx_val);
- val = idx_val;
- mem_cgroup_get_recursive_idx_stat(mem,
- MEM_CGROUP_STAT_RSS, &idx_val);
- val += idx_val;
- val <<= PAGE_SHIFT;
- } else
+ if (name == RES_USAGE)
+ val = mem_cgroup_usage(mem, false);
+ else
val = res_counter_read_u64(&mem->res, name);
break;
case _MEMSWAP:
- if (name == RES_USAGE && mem_cgroup_is_root(mem)) {
- mem_cgroup_get_recursive_idx_stat(mem,
- MEM_CGROUP_STAT_CACHE, &idx_val);
- val = idx_val;
- mem_cgroup_get_recursive_idx_stat(mem,
- MEM_CGROUP_STAT_RSS, &idx_val);
- val += idx_val;
- mem_cgroup_get_recursive_idx_stat(mem,
- MEM_CGROUP_STAT_SWAPOUT, &idx_val);
- val <<= PAGE_SHIFT;
- } else
+ if (name == RES_USAGE)
+ val = mem_cgroup_usage(mem, true);
+ else
val = res_counter_read_u64(&mem->memsw, name);
break;
default:
--
1.6.5.3

2009-12-11 22:59:41

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH RFC v2 3/4] memcg: rework usage of stats by soft limit

Instead of incrementing counter on each page in/out and comparing it
with constant, we set counter to constant, decrement counter on each
page in/out and compare it with zero. We want to make comparing as fast
as possible. On many RISC systems (probably not only RISC) comparing
with zero is more effective than comparing with a constant, since not
every constant can be immediate operand for compare instruction.

Also, I've renamed MEM_CGROUP_STAT_EVENTS to MEM_CGROUP_STAT_SOFTLIMIT,
since really it's not a generic counter.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/memcontrol.c | 19 ++++++++++++++-----
1 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0ff65ed..c6081cc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -69,8 +69,9 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */
MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
- MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+ MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
+ used by soft limit implementation */

MEM_CGROUP_STAT_NSTATS,
};
@@ -90,6 +91,13 @@ __mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu *stat,
stat->count[idx] = 0;
}

+static inline void
+__mem_cgroup_stat_set(struct mem_cgroup_stat_cpu *stat,
+ enum mem_cgroup_stat_index idx, s64 val)
+{
+ stat->count[idx] = val;
+}
+
static inline s64
__mem_cgroup_stat_read_local(struct mem_cgroup_stat_cpu *stat,
enum mem_cgroup_stat_index idx)
@@ -374,9 +382,10 @@ static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)

cpu = get_cpu();
cpustat = &mem->stat.cpustat[cpu];
- val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
- if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
- __mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
+ val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_SOFTLIMIT);
+ if (unlikely(val < 0)) {
+ __mem_cgroup_stat_set(cpustat, MEM_CGROUP_STAT_SOFTLIMIT,
+ SOFTLIMIT_EVENTS_THRESH);
ret = true;
}
put_cpu();
@@ -509,7 +518,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
else
__mem_cgroup_stat_add_safe(cpustat,
MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
- __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_EVENTS, 1);
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SOFTLIMIT, -1);
put_cpu();
}

--
1.6.5.3

2009-12-11 23:00:08

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH RFC v2 4/4] memcg: implement memory thresholds

It allows to register multiple memory and memsw thresholds and gets
notifications when it crosses.

To register a threshold application need:
- create an eventfd;
- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
- write string like "<event_fd> <memory.usage_in_bytes> <threshold>" to
cgroup.event_control.

Application will be notified through eventfd when memory usage crosses
threshold in any direction.

It's applicable for root and non-root cgroup.

It uses stats to track memory usage, simmilar to soft limits. It checks
if we need to send event to userspace on every 100 page in/out. I guess
it's good compromise between performance and accuracy of thresholds.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/memcontrol.c | 263 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 263 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c6081cc..5ba2140 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6,6 +6,10 @@
* Copyright 2007 OpenVZ SWsoft Inc
* Author: Pavel Emelianov <[email protected]>
*
+ * Memory thresholds
+ * Copyright (C) 2009 Nokia Corporation
+ * Author: Kirill A. Shutemov
+ *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
@@ -38,6 +42,7 @@
#include <linux/vmalloc.h>
#include <linux/mm_inline.h>
#include <linux/page_cgroup.h>
+#include <linux/eventfd.h>
#include "internal.h"

#include <asm/uaccess.h>
@@ -56,6 +61,7 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/

static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */
#define SOFTLIMIT_EVENTS_THRESH (1000)
+#define THRESHOLDS_EVENTS_THRESH (100)

/*
* Statistics for memory cgroup.
@@ -72,6 +78,8 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
used by soft limit implementation */
+ MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
+ used by threshold implementation */

MEM_CGROUP_STAT_NSTATS,
};
@@ -182,6 +190,15 @@ struct mem_cgroup_tree {

static struct mem_cgroup_tree soft_limit_tree __read_mostly;

+struct mem_cgroup_threshold {
+ struct list_head list;
+ struct eventfd_ctx *eventfd;
+ u64 threshold;
+};
+
+static bool mem_cgroup_threshold_check(struct mem_cgroup* mem);
+static void mem_cgroup_threshold(struct mem_cgroup* mem, bool swap);
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
@@ -233,6 +250,19 @@ struct mem_cgroup {
/* set when res.limit == memsw.limit */
bool memsw_is_minimum;

+ /* protect lists of thresholds*/
+ spinlock_t thresholds_lock;
+
+ /* thresholds for memory usage */
+ struct list_head thresholds;
+ struct mem_cgroup_threshold *below_threshold;
+ struct mem_cgroup_threshold *above_threshold;
+
+ /* thresholds for mem+swap usage */
+ struct list_head memsw_thresholds;
+ struct mem_cgroup_threshold *memsw_below_threshold;
+ struct mem_cgroup_threshold *memsw_above_threshold;
+
/*
* statistics. This must be placed at the end of memcg.
*/
@@ -519,6 +549,8 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
__mem_cgroup_stat_add_safe(cpustat,
MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SOFTLIMIT, -1);
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_THRESHOLDS, -1);
+
put_cpu();
}

@@ -1363,6 +1395,11 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
if (mem_cgroup_soft_limit_check(mem))
mem_cgroup_update_tree(mem, page);
done:
+ if (mem_cgroup_threshold_check(mem)) {
+ mem_cgroup_threshold(mem, false);
+ if (do_swap_account)
+ mem_cgroup_threshold(mem, true);
+ }
return 0;
nomem:
css_put(&mem->css);
@@ -1906,6 +1943,11 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)

if (mem_cgroup_soft_limit_check(mem))
mem_cgroup_update_tree(mem, page);
+ if (mem_cgroup_threshold_check(mem)) {
+ mem_cgroup_threshold(mem, false);
+ if (do_swap_account)
+ mem_cgroup_threshold(mem, true);
+ }
/* at swapout, this memcg will be accessed to record to swap */
if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
css_put(&mem->css);
@@ -2860,11 +2902,181 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
}


+static bool mem_cgroup_threshold_check(struct mem_cgroup *mem)
+{
+ bool ret = false;
+ int cpu;
+ s64 val;
+ struct mem_cgroup_stat_cpu *cpustat;
+
+ cpu = get_cpu();
+ cpustat = &mem->stat.cpustat[cpu];
+ val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_THRESHOLDS);
+ if (unlikely(val < 0)) {
+ __mem_cgroup_stat_set(cpustat, MEM_CGROUP_STAT_THRESHOLDS,
+ THRESHOLDS_EVENTS_THRESH);
+ ret = true;
+ }
+ put_cpu();
+ return ret;
+}
+
+static void mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
+{
+ struct mem_cgroup_threshold **below, **above;
+ struct list_head *thresholds;
+ u64 usage = mem_cgroup_usage(memcg, swap);
+
+ if (!swap) {
+ thresholds = &memcg->thresholds;
+ above = &memcg->above_threshold;
+ below = &memcg->below_threshold;
+ } else {
+ thresholds = &memcg->memsw_thresholds;
+ above = &memcg->memsw_above_threshold;
+ below = &memcg->memsw_below_threshold;
+ }
+
+ spin_lock(&memcg->thresholds_lock);
+ if ((*above)->threshold <= usage) {
+ *below = *above;
+ list_for_each_entry_continue((*above), thresholds, list) {
+ eventfd_signal((*below)->eventfd, 1);
+ if ((*above)->threshold > usage)
+ break;
+ *below = *above;
+ }
+ } else if ((*below)->threshold > usage) {
+ *above = *below;
+ list_for_each_entry_continue_reverse((*below), thresholds,
+ list) {
+ eventfd_signal((*above)->eventfd, 1);
+ if ((*below)->threshold <= usage)
+ break;
+ *above = *below;
+ }
+ }
+ spin_unlock(&memcg->thresholds_lock);
+}
+
+static void mem_cgroup_invalidate_thresholds(struct mem_cgroup *memcg,
+ bool swap)
+{
+ struct list_head *thresholds;
+ struct mem_cgroup_threshold **below, **above;
+ u64 usage = mem_cgroup_usage(memcg, swap);
+
+ if (!swap) {
+ thresholds = &memcg->thresholds;
+ above = &memcg->above_threshold;
+ below = &memcg->below_threshold;
+ } else {
+ thresholds = &memcg->memsw_thresholds;
+ above = &memcg->memsw_above_threshold;
+ below = &memcg->memsw_below_threshold;
+ }
+
+ *below = NULL;
+ list_for_each_entry((*above), thresholds, list) {
+ if ((*above)->threshold > usage) {
+ BUG_ON(!*below);
+ break;
+ }
+ *below = *above;
+ }
+}
+
+static inline struct mem_cgroup_threshold *mem_cgroup_threshold_alloc(void)
+{
+ struct mem_cgroup_threshold *new;
+
+ new = kmalloc(sizeof(*new), GFP_KERNEL);
+ if (!new)
+ return NULL;
+ INIT_LIST_HEAD(&new->list);
+
+ return new;
+}
+
+static int mem_cgroup_register_event(struct cgroup *cgrp, struct cftype *cft,
+ struct eventfd_ctx *eventfd, const char *args)
+{
+ u64 threshold;
+ struct mem_cgroup_threshold *new, *tmp;
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ struct list_head *thresholds;
+ int type = MEMFILE_TYPE(cft->private);
+ int ret;
+
+ ret = res_counter_memparse_write_strategy(args, &threshold);
+ if (ret)
+ return ret;
+
+ new = mem_cgroup_threshold_alloc();
+ if (!new)
+ return -ENOMEM;
+ new->eventfd = eventfd;
+ new->threshold = threshold;
+
+ if (type == _MEM)
+ thresholds = &memcg->thresholds;
+ else if (type == _MEMSWAP)
+ thresholds = &memcg->memsw_thresholds;
+ else
+ BUG();
+
+ /* Check if a threshold crossed before adding a new one */
+ mem_cgroup_threshold(memcg, type == _MEMSWAP);
+
+ spin_lock(&memcg->thresholds_lock);
+ list_for_each_entry(tmp, thresholds, list)
+ if (new->threshold < tmp->threshold) {
+ list_add_tail(&new->list, &tmp->list);
+ break;
+ }
+ mem_cgroup_invalidate_thresholds(memcg, type == _MEMSWAP);
+ spin_unlock(&memcg->thresholds_lock);
+
+ return 0;
+}
+
+static int mem_cgroup_unregister_event(struct cgroup *cgrp, struct cftype *cft,
+ struct eventfd_ctx *eventfd)
+{
+ struct mem_cgroup_threshold *threshold, *tmp;
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ struct list_head *thresholds;
+ int type = MEMFILE_TYPE(cft->private);
+
+ if (type == _MEM)
+ thresholds = &memcg->thresholds;
+ else if (type == _MEMSWAP)
+ thresholds = &memcg->memsw_thresholds;
+ else
+ BUG();
+
+ /* Check if a threshold crossed before adding a new one */
+ mem_cgroup_threshold(memcg, type == _MEMSWAP);
+
+ spin_lock(&memcg->thresholds_lock);
+ list_for_each_entry_safe(threshold, tmp, thresholds, list)
+ if (threshold->eventfd == eventfd) {
+ list_del(&threshold->list);
+ kfree(threshold);
+ }
+ mem_cgroup_invalidate_thresholds(memcg, type == _MEMSWAP);
+ spin_unlock(&memcg->thresholds_lock);
+
+ return 0;
+}
+
static struct cftype mem_cgroup_files[] = {
{
.name = "usage_in_bytes",
.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
.read_u64 = mem_cgroup_read,
+ .register_event = mem_cgroup_register_event,
+ .unregister_event = mem_cgroup_unregister_event,
},
{
.name = "max_usage_in_bytes",
@@ -2916,6 +3128,8 @@ static struct cftype memsw_cgroup_files[] = {
.name = "memsw.usage_in_bytes",
.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
.read_u64 = mem_cgroup_read,
+ .register_event = mem_cgroup_register_event,
+ .unregister_event = mem_cgroup_unregister_event,
},
{
.name = "memsw.max_usage_in_bytes",
@@ -2990,6 +3204,48 @@ static void free_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
kfree(mem->info.nodeinfo[node]);
}

+static int mem_cgroup_thresholds_init(struct mem_cgroup *mem)
+{
+ INIT_LIST_HEAD(&mem->thresholds);
+ INIT_LIST_HEAD(&mem->memsw_thresholds);
+
+ mem->below_threshold = mem_cgroup_threshold_alloc();
+ list_add(&mem->below_threshold->list, &mem->thresholds);
+ mem->below_threshold->threshold = 0ULL;
+
+ mem->above_threshold = mem_cgroup_threshold_alloc();
+ list_add_tail(&mem->above_threshold->list, &mem->thresholds);
+ mem->above_threshold->threshold = RESOURCE_MAX;
+
+ mem->memsw_below_threshold = mem_cgroup_threshold_alloc();
+ list_add(&mem->memsw_below_threshold->list, &mem->memsw_thresholds);
+ mem->memsw_below_threshold->threshold = 0ULL;
+
+ mem->memsw_above_threshold = mem_cgroup_threshold_alloc();
+ list_add_tail(&mem->memsw_above_threshold->list, &mem->memsw_thresholds);
+ mem->memsw_above_threshold->threshold = RESOURCE_MAX;
+
+ return 0;
+}
+
+static void mem_cgroup_thresholds_free(struct mem_cgroup *mem)
+{
+ /* Make sure that lists have only two elements */
+ BUG_ON((mem->below_threshold) &&
+ (mem->above_threshold) &&
+ (mem->below_threshold->list.next !=
+ &mem->above_threshold->list));
+ BUG_ON((mem->memsw_below_threshold) &&
+ (mem->memsw_above_threshold) &&
+ (mem->below_threshold->list.next !=
+ &mem->above_threshold->list));
+
+ kfree(mem->below_threshold);
+ kfree(mem->above_threshold);
+ kfree(mem->memsw_below_threshold);
+ kfree(mem->memsw_above_threshold);
+}
+
static int mem_cgroup_size(void)
{
int cpustat_size = nr_cpu_ids * sizeof(struct mem_cgroup_stat_cpu);
@@ -3026,6 +3282,8 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
{
int node;

+ mem_cgroup_thresholds_free(mem);
+
mem_cgroup_remove_from_trees(mem);
free_css_id(&mem_cgroup_subsys, &mem->css);

@@ -3145,6 +3403,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
mem->last_scanned_child = 0;
spin_lock_init(&mem->reclaim_param_lock);

+ spin_lock_init(&mem->thresholds_lock);
+ error = mem_cgroup_thresholds_init(mem);
+ if (error)
+ goto free_out;
+
if (parent)
mem->swappiness = get_swappiness(parent);
atomic_set(&mem->refcnt, 1);
--
1.6.5.3

2009-12-12 03:19:28

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/4] memcg: implement memory thresholds

On Sat, 12 Dec 2009 00:59:19 +0200
"Kirill A. Shutemov" <[email protected]> wrote:

> It allows to register multiple memory and memsw thresholds and gets
> notifications when it crosses.
>
> To register a threshold application need:
> - create an eventfd;
> - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
> - write string like "<event_fd> <memory.usage_in_bytes> <threshold>" to
> cgroup.event_control.
>
> Application will be notified through eventfd when memory usage crosses
> threshold in any direction.
>
> It's applicable for root and non-root cgroup.
>
> It uses stats to track memory usage, simmilar to soft limits. It checks
> if we need to send event to userspace on every 100 page in/out. I guess
> it's good compromise between performance and accuracy of thresholds.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> mm/memcontrol.c | 263 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 263 insertions(+), 0 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c6081cc..5ba2140 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6,6 +6,10 @@
> * Copyright 2007 OpenVZ SWsoft Inc
> * Author: Pavel Emelianov <[email protected]>
> *
> + * Memory thresholds
> + * Copyright (C) 2009 Nokia Corporation
> + * Author: Kirill A. Shutemov
> + *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License as published by
> * the Free Software Foundation; either version 2 of the License, or
> @@ -38,6 +42,7 @@
> #include <linux/vmalloc.h>
> #include <linux/mm_inline.h>
> #include <linux/page_cgroup.h>
> +#include <linux/eventfd.h>
> #include "internal.h"
>
> #include <asm/uaccess.h>
> @@ -56,6 +61,7 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>
> static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */
This mutex has already removed in current mmotm.
Please write a patch for memcg based on mmot.

> #define SOFTLIMIT_EVENTS_THRESH (1000)
> +#define THRESHOLDS_EVENTS_THRESH (100)
>
> /*
> * Statistics for memory cgroup.

(snip)

> @@ -1363,6 +1395,11 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> if (mem_cgroup_soft_limit_check(mem))
> mem_cgroup_update_tree(mem, page);
> done:
> + if (mem_cgroup_threshold_check(mem)) {
> + mem_cgroup_threshold(mem, false);
> + if (do_swap_account)
> + mem_cgroup_threshold(mem, true);
> + }
> return 0;
> nomem:
> css_put(&mem->css);
> @@ -1906,6 +1943,11 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>
> if (mem_cgroup_soft_limit_check(mem))
> mem_cgroup_update_tree(mem, page);
> + if (mem_cgroup_threshold_check(mem)) {
> + mem_cgroup_threshold(mem, false);
> + if (do_swap_account)
> + mem_cgroup_threshold(mem, true);
> + }
> /* at swapout, this memcg will be accessed to record to swap */
> if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> css_put(&mem->css);
Can "if (do_swap_account)" check be moved into mem_cgroup_threshold ?


Thanks,
Daisuke Nishimura.

2009-12-12 03:51:21

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH RFC v2 3/4] memcg: rework usage of stats by soft limit

Sorry, I disagree this change.

mem_cgroup_soft_limit_check() is used for checking how much current usage exceeds
the soft_limit_in_bytes and updating softlimit tree asynchronously, instead of
checking every charge/uncharge. What if you change the soft_limit_in_bytes,
but the number of charges and uncharges are very balanced afterwards ?
The softlimit tree will not be updated for a long time.

And IIUC, it's the same for your threshold feature, right ?
I think it would be better:

- discard this change.
- in 4/4, rename mem_cgroup_soft_limit_check to mem_cgroup_event_check,
and instead of adding a new STAT counter, do like:

if (mem_cgroup_event_check(mem)) {
mem_cgroup_update_tree(mem, page);
mem_cgroup_threshold(mem);
}

Ah, yes. Current code doesn't call mem_cgroup_soft_limit_check() for root cgroup
in charge path as you said in http://marc.info/?l=linux-mm&m=126021128400687&w=2.
I think you can change there as you want, I can change my patch
(http://marc.info/?l=linux-mm&m=126023467303178&w=2, it has not yet sent to
Andrew anyway) to check mem_cgroup_is_root() in mem_cgroup_update_tree().

Thanks,
Daisuke Nishimura.

On Sat, 12 Dec 2009 00:59:18 +0200
"Kirill A. Shutemov" <[email protected]> wrote:

> Instead of incrementing counter on each page in/out and comparing it
> with constant, we set counter to constant, decrement counter on each
> page in/out and compare it with zero. We want to make comparing as fast
> as possible. On many RISC systems (probably not only RISC) comparing
> with zero is more effective than comparing with a constant, since not
> every constant can be immediate operand for compare instruction.
>
> Also, I've renamed MEM_CGROUP_STAT_EVENTS to MEM_CGROUP_STAT_SOFTLIMIT,
> since really it's not a generic counter.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> mm/memcontrol.c | 19 ++++++++++++++-----
> 1 files changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0ff65ed..c6081cc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -69,8 +69,9 @@ enum mem_cgroup_stat_index {
> MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */
> MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
> MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
> - MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
> MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> + MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> + used by soft limit implementation */
>
> MEM_CGROUP_STAT_NSTATS,
> };
> @@ -90,6 +91,13 @@ __mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu *stat,
> stat->count[idx] = 0;
> }
>
> +static inline void
> +__mem_cgroup_stat_set(struct mem_cgroup_stat_cpu *stat,
> + enum mem_cgroup_stat_index idx, s64 val)
> +{
> + stat->count[idx] = val;
> +}
> +
> static inline s64
> __mem_cgroup_stat_read_local(struct mem_cgroup_stat_cpu *stat,
> enum mem_cgroup_stat_index idx)
> @@ -374,9 +382,10 @@ static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
>
> cpu = get_cpu();
> cpustat = &mem->stat.cpustat[cpu];
> - val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
> - if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
> - __mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
> + val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_SOFTLIMIT);
> + if (unlikely(val < 0)) {
> + __mem_cgroup_stat_set(cpustat, MEM_CGROUP_STAT_SOFTLIMIT,
> + SOFTLIMIT_EVENTS_THRESH);
> ret = true;
> }
> put_cpu();
> @@ -509,7 +518,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> else
> __mem_cgroup_stat_add_safe(cpustat,
> MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
> - __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_EVENTS, 1);
> + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SOFTLIMIT, -1);
> put_cpu();
> }
>
> --
> 1.6.5.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-12-12 13:06:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH RFC v2 3/4] memcg: rework usage of stats by soft limit

On Sat, Dec 12, 2009 at 5:50 AM, Daisuke Nishimura
<[email protected]> wrote:
> Sorry, I disagree this change.
>
> mem_cgroup_soft_limit_check() is used for checking how much current usage exceeds
> the soft_limit_in_bytes and updating softlimit tree asynchronously, instead of
> checking every charge/uncharge. What if you change the soft_limit_in_bytes,
> but the number of charges and uncharges are very balanced afterwards ?
> The softlimit tree will not be updated for a long time.

I don't see how my patch affects the logic you've described.
Statistics updates and
checks in the same place. It just uses decrement instead of increment.

>
> And IIUC, it's the same for your threshold feature, right ?
> I think it would be better:
>
> - discard this change.
> - in 4/4, rename mem_cgroup_soft_limit_check to mem_cgroup_event_check,
>  and instead of adding a new STAT counter, do like:
>
>        if (mem_cgroup_event_check(mem)) {
>                mem_cgroup_update_tree(mem, page);
>                mem_cgroup_threshold(mem);
>        }

I think that mem_cgroup_update_tree() and mem_cgroup_threshold() should be
run with different frequency. How to share MEM_CGROUP_STAT_EVENTS
between soft limits and thresholds in this case?

2009-12-12 13:11:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/4] memcg: implement memory thresholds

On Sat, Dec 12, 2009 at 5:19 AM, Daisuke Nishimura
<[email protected]> wrote:
>> @@ -56,6 +61,7 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>>
>>  static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */
> This mutex has already removed in current mmotm.
> Please write a patch for memcg based on mmot.

Ok.

>
>>  #define SOFTLIMIT_EVENTS_THRESH (1000)
>> +#define THRESHOLDS_EVENTS_THRESH (100)
>>
>>  /*
>>   * Statistics for memory cgroup.
>
> (snip)
>
>> @@ -1363,6 +1395,11 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>>       if (mem_cgroup_soft_limit_check(mem))
>>               mem_cgroup_update_tree(mem, page);
>>  done:
>> +     if (mem_cgroup_threshold_check(mem)) {
>> +             mem_cgroup_threshold(mem, false);
>> +             if (do_swap_account)
>> +                     mem_cgroup_threshold(mem, true);
>> +     }
>>       return 0;
>>  nomem:
>>       css_put(&mem->css);
>> @@ -1906,6 +1943,11 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>>
>>       if (mem_cgroup_soft_limit_check(mem))
>>               mem_cgroup_update_tree(mem, page);
>> +     if (mem_cgroup_threshold_check(mem)) {
>> +             mem_cgroup_threshold(mem, false);
>> +             if (do_swap_account)
>> +                     mem_cgroup_threshold(mem, true);
>> +     }
>>       /* at swapout, this memcg will be accessed to record to swap */
>>       if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
>>               css_put(&mem->css);
> Can "if (do_swap_account)" check be moved into mem_cgroup_threshold ?

Ok, I'll move it. It will affect performance of
mem_cgroup_invalidate_thresholds(),
but I don't think that it's important.

2009-12-12 13:13:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/4] memcg: implement memory thresholds

On Sat, Dec 12, 2009 at 3:11 PM, Kirill A. Shutemov
<[email protected]> wrote:
> Ok, I'll move it. It will affect performance of
> mem_cgroup_invalidate_thresholds(),
> but I don't think that it's important.

s/mem_cgroup_invalidate_thresholds()/mem_cgroup_register_event() and
mem_cgroup_unregister_event()/

2009-12-13 01:49:45

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH RFC v2 3/4] memcg: rework usage of stats by soft limit

On Sat, 12 Dec 2009 15:06:52 +0200
"Kirill A. Shutemov" <[email protected]> wrote:

> On Sat, Dec 12, 2009 at 5:50 AM, Daisuke Nishimura
> <[email protected]> wrote:
> > Sorry, I disagree this change.
> >
> > mem_cgroup_soft_limit_check() is used for checking how much current usage exceeds
> > the soft_limit_in_bytes and updating softlimit tree asynchronously, instead of
> > checking every charge/uncharge. What if you change the soft_limit_in_bytes,
> > but the number of charges and uncharges are very balanced afterwards ?
> > The softlimit tree will not be updated for a long time.
>
> I don't see how my patch affects the logic you've described.
> Statistics updates and
> checks in the same place. It just uses decrement instead of increment.
>
Ah... my bad. Ignore this comment, please.
I misunderstood this patch.

> >
> > And IIUC, it's the same for your threshold feature, right ?
> > I think it would be better:
> >
> > - discard this change.
> > - in 4/4, rename mem_cgroup_soft_limit_check to mem_cgroup_event_check,
> >  and instead of adding a new STAT counter, do like:
> >
> >        if (mem_cgroup_event_check(mem)) {
> >                mem_cgroup_update_tree(mem, page);
> >                mem_cgroup_threshold(mem);
> >        }
>
> I think that mem_cgroup_update_tree() and mem_cgroup_threshold() should be
> run with different frequency. How to share MEM_CGROUP_STAT_EVENTS
> between soft limits and thresholds in this case?
>
hmm, both softlimit and your threshold count events at the same place(charge and uncharge).
So, I think those events can be shared.
Is there any reason they should run in different frequency ?


Thanks,
Daisuke Nishimura.

2009-12-13 03:08:45

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH RFC v2 3/4] memcg: rework usage of stats by soft limit

On Sat, Dec 12, 2009 at 4:34 PM, Daisuke Nishimura
<[email protected]> wrote:
> On Sat, 12 Dec 2009 15:06:52 +0200
> "Kirill A. Shutemov" <[email protected]> wrote:
>
>> On Sat, Dec 12, 2009 at 5:50 AM, Daisuke Nishimura
>> <[email protected]> wrote:
>> > And IIUC, it's the same for your threshold feature, right ?
>> > I think it would be better:
>> >
>> > - discard this change.
>> > - in 4/4, rename mem_cgroup_soft_limit_check to mem_cgroup_event_check,
>> >  and instead of adding a new STAT counter, do like:
>> >
>> >        if (mem_cgroup_event_check(mem)) {
>> >                mem_cgroup_update_tree(mem, page);
>> >                mem_cgroup_threshold(mem);
>> >        }
>>
>> I think that mem_cgroup_update_tree() and mem_cgroup_threshold() should be
>> run with different frequency. How to share MEM_CGROUP_STAT_EVENTS
>> between soft limits and thresholds in this case?
>>
> hmm, both softlimit and your threshold count events at the same place(charge and uncharge).
> So, I think those events can be shared.
> Is there any reason they should run in different frequency ?

SOFTLIMIT_EVENTS_THRESH is 1000. If use the same value for thresholds,
a threshold can
be exceed on 1000*nr_cpu_id pages. It's too many. I think, that 100 is
a reasonable value.

mem_cgroup_soft_limit_check() resets MEM_CGROUP_STAT_EVENTS when it reaches
SOFTLIMIT_EVENTS_THRESH. If I will do the same thing for
THRESHOLDS_EVENTS_THRESH
(which is 100) , mem_cgroup_event_check() will never be 'true'. Any
idea how to share
MEM_CGROUP_STAT_EVENTS in this case?

2009-12-13 01:31:43

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [PATCH RFC v2 3/4] memcg: rework usage of stats by soft limit

On Sat, 12 Dec 2009 21:46:08 +0200
"Kirill A. Shutemov" <[email protected]> wrote:

> On Sat, Dec 12, 2009 at 4:34 PM, Daisuke Nishimura
> <[email protected]> wrote:
> > On Sat, 12 Dec 2009 15:06:52 +0200
> > "Kirill A. Shutemov" <[email protected]> wrote:
> >
> >> On Sat, Dec 12, 2009 at 5:50 AM, Daisuke Nishimura
> >> <[email protected]> wrote:
> >> > And IIUC, it's the same for your threshold feature, right ?
> >> > I think it would be better:
> >> >
> >> > - discard this change.
> >> > - in 4/4, rename mem_cgroup_soft_limit_check to mem_cgroup_event_check,
> >> >  and instead of adding a new STAT counter, do like:
> >> >
> >> >        if (mem_cgroup_event_check(mem)) {
> >> >                mem_cgroup_update_tree(mem, page);
> >> >                mem_cgroup_threshold(mem);
> >> >        }
> >>
> >> I think that mem_cgroup_update_tree() and mem_cgroup_threshold() should be
> >> run with different frequency. How to share MEM_CGROUP_STAT_EVENTS
> >> between soft limits and thresholds in this case?
> >>
> > hmm, both softlimit and your threshold count events at the same place(charge and uncharge).
> > So, I think those events can be shared.
> > Is there any reason they should run in different frequency ?
>
> SOFTLIMIT_EVENTS_THRESH is 1000. If use the same value for thresholds,
> a threshold can
> be exceed on 1000*nr_cpu_id pages. It's too many. I think, that 100 is
> a reasonable value.
>
O.K. I see.

> mem_cgroup_soft_limit_check() resets MEM_CGROUP_STAT_EVENTS when it reaches
> SOFTLIMIT_EVENTS_THRESH. If I will do the same thing for
> THRESHOLDS_EVENTS_THRESH
> (which is 100) , mem_cgroup_event_check() will never be 'true'. Any
> idea how to share
> MEM_CGROUP_STAT_EVENTS in this case?
>
It's impossible if they have different frequency as you say.

Thank you for your clarification.


Daisuke Nishimura.

2009-12-15 01:38:31

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH RFC v2 3/4] memcg: rework usage of stats by soft limit

On Sat, 12 Dec 2009 21:46:08 +0200
"Kirill A. Shutemov" <[email protected]> wrote:

> On Sat, Dec 12, 2009 at 4:34 PM, Daisuke Nishimura
> <[email protected]> wrote:
> > On Sat, 12 Dec 2009 15:06:52 +0200
> > "Kirill A. Shutemov" <[email protected]> wrote:
> >
> >> On Sat, Dec 12, 2009 at 5:50 AM, Daisuke Nishimura
> >> <[email protected]> wrote:
> >> > And IIUC, it's the same for your threshold feature, right ?
> >> > I think it would be better:
> >> >
> >> > - discard this change.
> >> > - in 4/4, rename mem_cgroup_soft_limit_check to mem_cgroup_event_check,
> >> >  and instead of adding a new STAT counter, do like:
> >> >
> >> >        if (mem_cgroup_event_check(mem)) {
> >> >                mem_cgroup_update_tree(mem, page);
> >> >                mem_cgroup_threshold(mem);
> >> >        }
> >>
> >> I think that mem_cgroup_update_tree() and mem_cgroup_threshold() should be
> >> run with different frequency. How to share MEM_CGROUP_STAT_EVENTS
> >> between soft limits and thresholds in this case?
> >>
> > hmm, both softlimit and your threshold count events at the same place(charge and uncharge).
> > So, I think those events can be shared.
> > Is there any reason they should run in different frequency ?
>
> SOFTLIMIT_EVENTS_THRESH is 1000. If use the same value for thresholds,
> a threshold can
> be exceed on 1000*nr_cpu_id pages. It's too many. I think, that 100 is
> a reasonable value.
>

Hmm, then what amount of costs does this code add ?

Do you have benchmark result ?

Thanks,
-Kame

2009-12-15 02:02:06

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/4] memcg: implement memory thresholds

On Sat, 12 Dec 2009 00:59:19 +0200
"Kirill A. Shutemov" <[email protected]> wrote:

> It allows to register multiple memory and memsw thresholds and gets
> notifications when it crosses.
>
> To register a threshold application need:
> - create an eventfd;
> - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
> - write string like "<event_fd> <memory.usage_in_bytes> <threshold>" to
> cgroup.event_control.
>
> Application will be notified through eventfd when memory usage crosses
> threshold in any direction.
>
> It's applicable for root and non-root cgroup.
>
> It uses stats to track memory usage, simmilar to soft limits. It checks
> if we need to send event to userspace on every 100 page in/out. I guess
> it's good compromise between performance and accuracy of thresholds.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> mm/memcontrol.c | 263 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 263 insertions(+), 0 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c6081cc..5ba2140 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6,6 +6,10 @@
> * Copyright 2007 OpenVZ SWsoft Inc
> * Author: Pavel Emelianov <[email protected]>
> *
> + * Memory thresholds
> + * Copyright (C) 2009 Nokia Corporation
> + * Author: Kirill A. Shutemov
> + *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License as published by
> * the Free Software Foundation; either version 2 of the License, or
> @@ -38,6 +42,7 @@
> #include <linux/vmalloc.h>
> #include <linux/mm_inline.h>
> #include <linux/page_cgroup.h>
> +#include <linux/eventfd.h>
> #include "internal.h"
>
> #include <asm/uaccess.h>
> @@ -56,6 +61,7 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>
> static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */
> #define SOFTLIMIT_EVENTS_THRESH (1000)
> +#define THRESHOLDS_EVENTS_THRESH (100)
>
> /*
> * Statistics for memory cgroup.
> @@ -72,6 +78,8 @@ enum mem_cgroup_stat_index {
> MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> used by soft limit implementation */
> + MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> + used by threshold implementation */
>
> MEM_CGROUP_STAT_NSTATS,
> };
> @@ -182,6 +190,15 @@ struct mem_cgroup_tree {
>
> static struct mem_cgroup_tree soft_limit_tree __read_mostly;
>
> +struct mem_cgroup_threshold {
> + struct list_head list;
> + struct eventfd_ctx *eventfd;
> + u64 threshold;
> +};
> +
> +static bool mem_cgroup_threshold_check(struct mem_cgroup* mem);
> +static void mem_cgroup_threshold(struct mem_cgroup* mem, bool swap);
> +
> /*
> * The memory controller data structure. The memory controller controls both
> * page cache and RSS per cgroup. We would eventually like to provide
> @@ -233,6 +250,19 @@ struct mem_cgroup {
> /* set when res.limit == memsw.limit */
> bool memsw_is_minimum;
>
> + /* protect lists of thresholds*/
> + spinlock_t thresholds_lock;
> +
> + /* thresholds for memory usage */
> + struct list_head thresholds;
> + struct mem_cgroup_threshold *below_threshold;
> + struct mem_cgroup_threshold *above_threshold;
> +
> + /* thresholds for mem+swap usage */
> + struct list_head memsw_thresholds;
> + struct mem_cgroup_threshold *memsw_below_threshold;
> + struct mem_cgroup_threshold *memsw_above_threshold;
> +
> /*
> * statistics. This must be placed at the end of memcg.
> */
> @@ -519,6 +549,8 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> __mem_cgroup_stat_add_safe(cpustat,
> MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
> __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SOFTLIMIT, -1);
> + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_THRESHOLDS, -1);
> +
> put_cpu();
> }
>
> @@ -1363,6 +1395,11 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> if (mem_cgroup_soft_limit_check(mem))
> mem_cgroup_update_tree(mem, page);
> done:
> + if (mem_cgroup_threshold_check(mem)) {
> + mem_cgroup_threshold(mem, false);
> + if (do_swap_account)
> + mem_cgroup_threshold(mem, true);
> + }
> return 0;
> nomem:
> css_put(&mem->css);
> @@ -1906,6 +1943,11 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>
> if (mem_cgroup_soft_limit_check(mem))
> mem_cgroup_update_tree(mem, page);
> + if (mem_cgroup_threshold_check(mem)) {
> + mem_cgroup_threshold(mem, false);
> + if (do_swap_account)
> + mem_cgroup_threshold(mem, true);
> + }
> /* at swapout, this memcg will be accessed to record to swap */
> if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> css_put(&mem->css);
> @@ -2860,11 +2902,181 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
> }
>
>
> +static bool mem_cgroup_threshold_check(struct mem_cgroup *mem)
> +{
> + bool ret = false;
> + int cpu;
> + s64 val;
> + struct mem_cgroup_stat_cpu *cpustat;
> +
> + cpu = get_cpu();
> + cpustat = &mem->stat.cpustat[cpu];
> + val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_THRESHOLDS);
> + if (unlikely(val < 0)) {
> + __mem_cgroup_stat_set(cpustat, MEM_CGROUP_STAT_THRESHOLDS,
> + THRESHOLDS_EVENTS_THRESH);
> + ret = true;
> + }
> + put_cpu();
> + return ret;
> +}
> +

Hmm. please check

if (likely(list_empty(&mem->thesholds) &&
list_empty(&mem->memsw_thresholds)))
return;

or adds a flag as mem->no_threshold_check to skip this routine quickly.

_OR_
I personally don't like to have 2 counters to catch events.

How about this ?

adds
struct mem_cgroup {
atomic_t event_counter; // this is incremented per 32
page-in/out
atomic_t last_softlimit_check;
atomic_t last_thresh_check;
};

static bool mem_cgroup_threshold_check(struct mem_cgroup *mem)
{
decrement percpu event counter.
if (percpu counter reaches 0) {
if (atomic_dec_and_test(&mem->check_thresh) {
check threashold.
reset counter.
}
if (atomic_dec_and_test(&memc->check_softlimit) {
update softlimit tree.
reset counter.
}
reset percpu counter.
}
}

Then, you can have a counter like system-wide event counter.


> +static void mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
> +{
> + struct mem_cgroup_threshold **below, **above;
> + struct list_head *thresholds;
> + u64 usage = mem_cgroup_usage(memcg, swap);
> +
> + if (!swap) {
> + thresholds = &memcg->thresholds;
> + above = &memcg->above_threshold;
> + below = &memcg->below_threshold;
> + } else {
> + thresholds = &memcg->memsw_thresholds;
> + above = &memcg->memsw_above_threshold;
> + below = &memcg->memsw_below_threshold;
> + }
> +
> + spin_lock(&memcg->thresholds_lock);
> + if ((*above)->threshold <= usage) {
> + *below = *above;
> + list_for_each_entry_continue((*above), thresholds, list) {
> + eventfd_signal((*below)->eventfd, 1);
> + if ((*above)->threshold > usage)
> + break;
> + *below = *above;
> + }
> + } else if ((*below)->threshold > usage) {
> + *above = *below;
> + list_for_each_entry_continue_reverse((*below), thresholds,
> + list) {
> + eventfd_signal((*above)->eventfd, 1);
> + if ((*below)->threshold <= usage)
> + break;
> + *above = *below;
> + }
> + }
> + spin_unlock(&memcg->thresholds_lock);
> +}

Could you adds comment on above check ?

And do we need *spin_lock* here ? Can't you use RCU list walk ?

If you use have to use spinlock here, this is a system-wide spinlock,
threshold as "100" is too small, I think.
Even if you can't use spinlock, please use mutex. (with checking gfp_mask).

Thanks,
-Kame


> +
> +static void mem_cgroup_invalidate_thresholds(struct mem_cgroup *memcg,
> + bool swap)
> +{
> + struct list_head *thresholds;
> + struct mem_cgroup_threshold **below, **above;
> + u64 usage = mem_cgroup_usage(memcg, swap);
> +
> + if (!swap) {
> + thresholds = &memcg->thresholds;
> + above = &memcg->above_threshold;
> + below = &memcg->below_threshold;
> + } else {
> + thresholds = &memcg->memsw_thresholds;
> + above = &memcg->memsw_above_threshold;
> + below = &memcg->memsw_below_threshold;
> + }
> +
> + *below = NULL;
> + list_for_each_entry((*above), thresholds, list) {
> + if ((*above)->threshold > usage) {
> + BUG_ON(!*below);
> + break;
> + }
> + *below = *above;
> + }
> +}
> +
> +static inline struct mem_cgroup_threshold *mem_cgroup_threshold_alloc(void)
> +{
> + struct mem_cgroup_threshold *new;
> +
> + new = kmalloc(sizeof(*new), GFP_KERNEL);
> + if (!new)
> + return NULL;
> + INIT_LIST_HEAD(&new->list);
> +
> + return new;
> +}
> +
> +static int mem_cgroup_register_event(struct cgroup *cgrp, struct cftype *cft,
> + struct eventfd_ctx *eventfd, const char *args)
> +{
> + u64 threshold;
> + struct mem_cgroup_threshold *new, *tmp;
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + struct list_head *thresholds;
> + int type = MEMFILE_TYPE(cft->private);
> + int ret;
> +
> + ret = res_counter_memparse_write_strategy(args, &threshold);
> + if (ret)
> + return ret;
> +
> + new = mem_cgroup_threshold_alloc();
> + if (!new)
> + return -ENOMEM;
> + new->eventfd = eventfd;
> + new->threshold = threshold;
> +
> + if (type == _MEM)
> + thresholds = &memcg->thresholds;
> + else if (type == _MEMSWAP)
> + thresholds = &memcg->memsw_thresholds;
> + else
> + BUG();
> +
> + /* Check if a threshold crossed before adding a new one */
> + mem_cgroup_threshold(memcg, type == _MEMSWAP);
> +
> + spin_lock(&memcg->thresholds_lock);
> + list_for_each_entry(tmp, thresholds, list)
> + if (new->threshold < tmp->threshold) {
> + list_add_tail(&new->list, &tmp->list);
> + break;
> + }
> + mem_cgroup_invalidate_thresholds(memcg, type == _MEMSWAP);
> + spin_unlock(&memcg->thresholds_lock);
> +
> + return 0;
> +}
> +
> +static int mem_cgroup_unregister_event(struct cgroup *cgrp, struct cftype *cft,
> + struct eventfd_ctx *eventfd)
> +{
> + struct mem_cgroup_threshold *threshold, *tmp;
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + struct list_head *thresholds;
> + int type = MEMFILE_TYPE(cft->private);
> +
> + if (type == _MEM)
> + thresholds = &memcg->thresholds;
> + else if (type == _MEMSWAP)
> + thresholds = &memcg->memsw_thresholds;
> + else
> + BUG();
> +
> + /* Check if a threshold crossed before adding a new one */
> + mem_cgroup_threshold(memcg, type == _MEMSWAP);
> +
> + spin_lock(&memcg->thresholds_lock);
> + list_for_each_entry_safe(threshold, tmp, thresholds, list)
> + if (threshold->eventfd == eventfd) {
> + list_del(&threshold->list);
> + kfree(threshold);
> + }
> + mem_cgroup_invalidate_thresholds(memcg, type == _MEMSWAP);
> + spin_unlock(&memcg->thresholds_lock);
> +
> + return 0;
> +}
> +
> static struct cftype mem_cgroup_files[] = {
> {
> .name = "usage_in_bytes",
> .private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
> .read_u64 = mem_cgroup_read,
> + .register_event = mem_cgroup_register_event,
> + .unregister_event = mem_cgroup_unregister_event,
> },
> {
> .name = "max_usage_in_bytes",
> @@ -2916,6 +3128,8 @@ static struct cftype memsw_cgroup_files[] = {
> .name = "memsw.usage_in_bytes",
> .private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
> .read_u64 = mem_cgroup_read,
> + .register_event = mem_cgroup_register_event,
> + .unregister_event = mem_cgroup_unregister_event,
> },
> {
> .name = "memsw.max_usage_in_bytes",
> @@ -2990,6 +3204,48 @@ static void free_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> kfree(mem->info.nodeinfo[node]);
> }
>
> +static int mem_cgroup_thresholds_init(struct mem_cgroup *mem)
> +{
> + INIT_LIST_HEAD(&mem->thresholds);
> + INIT_LIST_HEAD(&mem->memsw_thresholds);
> +
> + mem->below_threshold = mem_cgroup_threshold_alloc();
> + list_add(&mem->below_threshold->list, &mem->thresholds);
> + mem->below_threshold->threshold = 0ULL;
> +
> + mem->above_threshold = mem_cgroup_threshold_alloc();
> + list_add_tail(&mem->above_threshold->list, &mem->thresholds);
> + mem->above_threshold->threshold = RESOURCE_MAX;
> +
> + mem->memsw_below_threshold = mem_cgroup_threshold_alloc();
> + list_add(&mem->memsw_below_threshold->list, &mem->memsw_thresholds);
> + mem->memsw_below_threshold->threshold = 0ULL;
> +
> + mem->memsw_above_threshold = mem_cgroup_threshold_alloc();
> + list_add_tail(&mem->memsw_above_threshold->list, &mem->memsw_thresholds);
> + mem->memsw_above_threshold->threshold = RESOURCE_MAX;
> +
> + return 0;
> +}
> +
> +static void mem_cgroup_thresholds_free(struct mem_cgroup *mem)
> +{
> + /* Make sure that lists have only two elements */
> + BUG_ON((mem->below_threshold) &&
> + (mem->above_threshold) &&
> + (mem->below_threshold->list.next !=
> + &mem->above_threshold->list));
> + BUG_ON((mem->memsw_below_threshold) &&
> + (mem->memsw_above_threshold) &&
> + (mem->below_threshold->list.next !=
> + &mem->above_threshold->list));
> +
> + kfree(mem->below_threshold);
> + kfree(mem->above_threshold);
> + kfree(mem->memsw_below_threshold);
> + kfree(mem->memsw_above_threshold);
> +}
> +
> static int mem_cgroup_size(void)
> {
> int cpustat_size = nr_cpu_ids * sizeof(struct mem_cgroup_stat_cpu);
> @@ -3026,6 +3282,8 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
> {
> int node;
>
> + mem_cgroup_thresholds_free(mem);
> +
> mem_cgroup_remove_from_trees(mem);
> free_css_id(&mem_cgroup_subsys, &mem->css);
>
> @@ -3145,6 +3403,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> mem->last_scanned_child = 0;
> spin_lock_init(&mem->reclaim_param_lock);
>
> + spin_lock_init(&mem->thresholds_lock);
> + error = mem_cgroup_thresholds_init(mem);
> + if (error)
> + goto free_out;
> +
> if (parent)
> mem->swappiness = get_swappiness(parent);
> atomic_set(&mem->refcnt, 1);
> --
> 1.6.5.3
>
>

2009-12-15 07:48:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH RFC v2 3/4] memcg: rework usage of stats by soft limit

On Tue, Dec 15, 2009 at 3:35 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Sat, 12 Dec 2009 21:46:08 +0200
> "Kirill A. Shutemov" <[email protected]> wrote:
>
>> On Sat, Dec 12, 2009 at 4:34 PM, Daisuke Nishimura
>> <[email protected]> wrote:
>> > On Sat, 12 Dec 2009 15:06:52 +0200
>> > "Kirill A. Shutemov" <[email protected]> wrote:
>> >
>> >> On Sat, Dec 12, 2009 at 5:50 AM, Daisuke Nishimura
>> >> <[email protected]> wrote:
>> >> > And IIUC, it's the same for your threshold feature, right ?
>> >> > I think it would be better:
>> >> >
>> >> > - discard this change.
>> >> > - in 4/4, rename mem_cgroup_soft_limit_check to mem_cgroup_event_check,
>> >> >  and instead of adding a new STAT counter, do like:
>> >> >
>> >> >        if (mem_cgroup_event_check(mem)) {
>> >> >                mem_cgroup_update_tree(mem, page);
>> >> >                mem_cgroup_threshold(mem);
>> >> >        }
>> >>
>> >> I think that mem_cgroup_update_tree() and mem_cgroup_threshold() should be
>> >> run with different frequency. How to share MEM_CGROUP_STAT_EVENTS
>> >> between soft limits and thresholds in this case?
>> >>
>> > hmm, both softlimit and your threshold count events at the same place(charge and uncharge).
>> > So, I think those events can be shared.
>> > Is there any reason they should run in different frequency ?
>>
>> SOFTLIMIT_EVENTS_THRESH is 1000. If use the same value for thresholds,
>> a threshold can
>> be exceed on 1000*nr_cpu_id pages. It's too many. I think, that 100 is
>> a reasonable value.
>>
>
> Hmm, then what amount of costs does this code add ?
>
> Do you have benchmark result ?

I've post some numbers how the patchset affects performance:
http://article.gmane.org/gmane.linux.kernel.mm/41880

Do you need any other results?

2009-12-15 08:10:22

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH RFC v2 3/4] memcg: rework usage of stats by soft limit

On Tue, 15 Dec 2009 09:48:09 +0200
"Kirill A. Shutemov" <[email protected]> wrote:

> On Tue, Dec 15, 2009 at 3:35 AM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Sat, 12 Dec 2009 21:46:08 +0200
> > "Kirill A. Shutemov" <[email protected]> wrote:
> >
> >> On Sat, Dec 12, 2009 at 4:34 PM, Daisuke Nishimura
> >> <[email protected]> wrote:
> >> > On Sat, 12 Dec 2009 15:06:52 +0200
> >> > "Kirill A. Shutemov" <[email protected]> wrote:
> >> >
> >> >> On Sat, Dec 12, 2009 at 5:50 AM, Daisuke Nishimura
> >> >> <[email protected]> wrote:
> >> >> > And IIUC, it's the same for your threshold feature, right ?
> >> >> > I think it would be better:
> >> >> >
> >> >> > - discard this change.
> >> >> > - in 4/4, rename mem_cgroup_soft_limit_check to mem_cgroup_event_check,
> >> >> >  and instead of adding a new STAT counter, do like:
> >> >> >
> >> >> >        if (mem_cgroup_event_check(mem)) {
> >> >> >                mem_cgroup_update_tree(mem, page);
> >> >> >                mem_cgroup_threshold(mem);
> >> >> >        }
> >> >>
> >> >> I think that mem_cgroup_update_tree() and mem_cgroup_threshold() should be
> >> >> run with different frequency. How to share MEM_CGROUP_STAT_EVENTS
> >> >> between soft limits and thresholds in this case?
> >> >>
> >> > hmm, both softlimit and your threshold count events at the same place(charge and uncharge).
> >> > So, I think those events can be shared.
> >> > Is there any reason they should run in different frequency ?
> >>
> >> SOFTLIMIT_EVENTS_THRESH is 1000. If use the same value for thresholds,
> >> a threshold can
> >> be exceed on 1000*nr_cpu_id pages. It's too many. I think, that 100 is
> >> a reasonable value.
> >>
> >
> > Hmm, then what amount of costs does this code add ?
> >
> > Do you have benchmark result ?
>
> I've post some numbers how the patchset affects performance:
> http://article.gmane.org/gmane.linux.kernel.mm/41880
>
> Do you need any other results?
>
Ah, sorry. I missed that. The numbers seems good.

(off topic)
multi-fault is too special, It's just a my toy ;)

The test I recommend you is kernel-make on tmpfs.
This is my setup script.
==
#!/bin/sh

mount -t tmpfs none /home/kamezawa/tmpfs
cp /home/kamezawa/linux-2.6.30.tar.bz2 /home/kamezawa/tmpfs
cd /home/kamezawa/tmpfs
mkdir /home/kamezawa/tmpfs/tmp
tar xvpjf linux-2.6.30.tar.bz2
cd linux-2.6.30
make defconfig

and making gcc's tmporarly strage(TMPDIR) on tmpfs.

#make clean; make -j 8 or some.

and check "stime"

But I don't ask you to do this, now.
The whole patch seems attractive to me. Please fix something pointed out.

I stop my patches for memcg's percpu counter rewriting until yours and
Nishimura's patch goes. You can leave your threshold-event-counter as it
is. I'll think of I can do total-rewrite of that counter or not.

Thanks,
-Kame

2009-12-15 09:11:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH RFC v2 1/4] cgroup: implement eventfd-based generic API for notifications

Could anybody review the patch?

Thank you.

On Sat, Dec 12, 2009 at 12:59 AM, Kirill A. Shutemov
<[email protected]> wrote:
> This patch introduces write-only file "cgroup.event_control" in every
> cgroup.
>
> To register new notification handler you need:
> - create an eventfd;
> - open a control file to be monitored. Callbacks register_event() and
>  unregister_event() must be defined for the control file;
> - write "<event_fd> <control_fd> <args>" to cgroup.event_control.
>  Interpretation of args is defined by control file implementation;
>
> eventfd will be woken up by control file implementation or when the
> cgroup is removed.
>
> To unregister notification handler just close eventfd.
>
> If you need notification functionality for a control file you have to
> implement callbacks register_event() and unregister_event() in the
> struct cftype.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
>  include/linux/cgroup.h |   20 +++++
>  kernel/cgroup.c        |  215 +++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 234 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 0008dee..7ad3078 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -220,6 +220,10 @@ struct cgroup {
>
>        /* For RCU-protected deletion */
>        struct rcu_head rcu_head;
> +
> +       /* List of events which userspace want to recieve */
> +       struct list_head event_list;
> +       struct mutex event_list_mutex;
>  };
>
>  /*
> @@ -362,6 +366,22 @@ struct cftype {
>        int (*trigger)(struct cgroup *cgrp, unsigned int event);
>
>        int (*release)(struct inode *inode, struct file *file);
> +
> +       /*
> +        * register_event() callback will be used to add new userspace
> +        * waiter for changes related to the cftype. Implement it if
> +        * you want to provide this functionality. Use eventfd_signal()
> +        * on eventfd to send notification to userspace.
> +        */
> +       int (*register_event)(struct cgroup *cgrp, struct cftype *cft,
> +                       struct eventfd_ctx *eventfd, const char *args);
> +       /*
> +        * unregister_event() callback will be called when userspace
> +        * close the eventfd. This callback must be implemented, if you
> +        * provide register_event().
> +        */
> +       int (*unregister_event)(struct cgroup *cgrp, struct cftype *cft,
> +                       struct eventfd_ctx *eventfd);
>  };
>
>  struct cgroup_scanner {
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 0249f4b..f7ec3ca 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -4,6 +4,10 @@
>  *  Based originally on the cpuset system, extracted by Paul Menage
>  *  Copyright (C) 2006 Google, Inc
>  *
> + *  Notifiactions support
> + *  Copyright (C) 2009 Nokia Corporation
> + *  Author: Kirill A. Shutemov
> + *
>  *  Copyright notices from the original cpuset code:
>  *  --------------------------------------------------
>  *  Copyright (C) 2003 BULL SA.
> @@ -51,6 +55,8 @@
>  #include <linux/pid_namespace.h>
>  #include <linux/idr.h>
>  #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
> +#include <linux/eventfd.h>
> +#include <linux/poll.h>
>
>  #include <asm/atomic.h>
>
> @@ -146,6 +152,36 @@ struct css_id {
>        unsigned short stack[0]; /* Array of Length (depth+1) */
>  };
>
> +/*
> + * cgroup_event represents event which userspace want to recieve.
> + */
> +struct cgroup_event {
> +       /*
> +        * Cgroup which the event belongs to.
> +        */
> +       struct cgroup *cgrp;
> +       /*
> +        * Control file which the event associated.
> +        */
> +       struct cftype *cft;
> +       /*
> +        * eventfd to signal userspace about the event.
> +        */
> +       struct eventfd_ctx *eventfd;
> +       /*
> +        * Each of these stored in a list by the cgroup.
> +        */
> +       struct list_head list;
> +       /*
> +        * All fields below needed to unregister event when
> +        * userspace closes eventfd.
> +        */
> +       poll_table pt;
> +       wait_queue_head_t *wqh;
> +       wait_queue_t wait;
> +       struct work_struct remove;
> +};
> +static void cgroup_event_remove(struct cgroup_event *event);
>
>  /* The list of hierarchy roots */
>
> @@ -734,14 +770,29 @@ static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
>  static int cgroup_call_pre_destroy(struct cgroup *cgrp)
>  {
>        struct cgroup_subsys *ss;
> +       struct cgroup_event *event, *tmp;
>        int ret = 0;
>
>        for_each_subsys(cgrp->root, ss)
>                if (ss->pre_destroy) {
>                        ret = ss->pre_destroy(ss, cgrp);
>                        if (ret)
> -                               break;
> +                               goto out;
>                }
> +
> +       /*
> +        * Unregister events and notify userspace.
> +        * FIXME: How to avoid race with cgroup_event_remove_work()
> +        *        which runs from workqueue?
> +        */
> +       mutex_lock(&cgrp->event_list_mutex);
> +       list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) {
> +               cgroup_event_remove(event);
> +               eventfd_signal(event->eventfd, 1);
> +       }
> +       mutex_unlock(&cgrp->event_list_mutex);
> +
> +out:
>        return ret;
>  }
>
> @@ -1136,6 +1187,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
>        INIT_LIST_HEAD(&cgrp->release_list);
>        INIT_LIST_HEAD(&cgrp->pidlists);
>        mutex_init(&cgrp->pidlist_mutex);
> +       INIT_LIST_HEAD(&cgrp->event_list);
> +       mutex_init(&cgrp->event_list_mutex);
>  }
>
>  static void init_cgroup_root(struct cgroupfs_root *root)
> @@ -1935,6 +1988,16 @@ static const struct inode_operations cgroup_dir_inode_operations = {
>        .rename = cgroup_rename,
>  };
>
> +/*
> + * Check if a file is a control file
> + */
> +static inline struct cftype *__file_cft(struct file *file)
> +{
> +       if (file->f_dentry->d_inode->i_fop != &cgroup_file_operations)
> +               return ERR_PTR(-EINVAL);
> +       return __d_cft(file->f_dentry);
> +}
> +
>  static int cgroup_create_file(struct dentry *dentry, mode_t mode,
>                                struct super_block *sb)
>  {
> @@ -2789,6 +2852,151 @@ static int cgroup_write_notify_on_release(struct cgroup *cgrp,
>        return 0;
>  }
>
> +static inline void cgroup_event_remove(struct cgroup_event *event)
> +{
> +       struct cgroup *cgrp = event->cgrp;
> +
> +       BUG_ON(event->cft->unregister_event(cgrp, event->cft, event->eventfd));
> +       eventfd_ctx_put(event->eventfd);
> +       remove_wait_queue(event->wqh, &event->wait);
> +       list_del(&event->list);
> +       kfree(event);
> +}
> +
> +static void cgroup_event_remove_work(struct work_struct *work)
> +{
> +       struct cgroup_event *event = container_of(work, struct cgroup_event,
> +                       remove);
> +       struct cgroup *cgrp = event->cgrp;
> +
> +       mutex_lock(&cgrp->event_list_mutex);
> +       cgroup_event_remove(event);
> +       mutex_unlock(&cgrp->event_list_mutex);
> +}
> +
> +static int cgroup_event_wake(wait_queue_t *wait, unsigned mode,
> +               int sync, void *key)
> +{
> +       struct cgroup_event *event = container_of(wait,
> +                       struct cgroup_event, wait);
> +       unsigned long flags = (unsigned long)key;
> +
> +       if (flags & POLLHUP)
> +               /*
> +                * This function called with spinlock taken, but
> +                * cgroup_event_remove() may sleep, so we have
> +                * to run it in a workqueue.
> +                */
> +               schedule_work(&event->remove);
> +
> +       return 0;
> +}
> +
> +static void cgroup_event_ptable_queue_proc(struct file *file,
> +               wait_queue_head_t *wqh, poll_table *pt)
> +{
> +       struct cgroup_event *event = container_of(pt,
> +                       struct cgroup_event, pt);
> +
> +       event->wqh = wqh;
> +       add_wait_queue(wqh, &event->wait);
> +}
> +
> +static int cgroup_write_event_control(struct cgroup *cont, struct cftype *cft,
> +                                     const char *buffer)
> +{
> +       struct cgroup_event *event = NULL;
> +       unsigned int efd, cfd;
> +       struct file *efile = NULL;
> +       struct file *cfile = NULL;
> +       char *endp;
> +       int ret;
> +
> +       efd = simple_strtoul(buffer, &endp, 10);
> +       if (*endp != ' ')
> +               return -EINVAL;
> +       buffer = endp + 1;
> +
> +       cfd = simple_strtoul(buffer, &endp, 10);
> +       if ((*endp != ' ') && (*endp != '\0'))
> +               return -EINVAL;
> +       buffer = endp + 1;
> +
> +       event = kzalloc(sizeof(*event), GFP_KERNEL);
> +       if (!event)
> +               return -ENOMEM;
> +       event->cgrp = cont;
> +       INIT_LIST_HEAD(&event->list);
> +       init_poll_funcptr(&event->pt, cgroup_event_ptable_queue_proc);
> +       init_waitqueue_func_entry(&event->wait, cgroup_event_wake);
> +       INIT_WORK(&event->remove, cgroup_event_remove_work);
> +
> +       efile = eventfd_fget(efd);
> +       if (IS_ERR(efile)) {
> +               ret = PTR_ERR(efile);
> +               goto fail;
> +       }
> +
> +       event->eventfd = eventfd_ctx_fileget(efile);
> +       if (IS_ERR(event->eventfd)) {
> +               ret = PTR_ERR(event->eventfd);
> +               goto fail;
> +       }
> +
> +       cfile = fget(cfd);
> +       if (!cfile) {
> +               ret = -EBADF;
> +               goto fail;
> +       }
> +
> +       /* the process need read permission on control file */
> +       ret = file_permission(cfile, MAY_READ);
> +       if (ret < 0)
> +               goto fail;
> +
> +       event->cft = __file_cft(cfile);
> +       if (IS_ERR(event->cft)) {
> +               ret = PTR_ERR(event->cft);
> +               goto fail;
> +       }
> +
> +       if (!event->cft->register_event || !event->cft->unregister_event) {
> +               ret = -EINVAL;
> +               goto fail;
> +       }
> +
> +       ret = event->cft->register_event(cont, event->cft,
> +                       event->eventfd, buffer);
> +       if (ret)
> +               goto fail;
> +
> +       efile->f_op->poll(efile, &event->pt);
> +
> +       mutex_lock(&cont->event_list_mutex);
> +       list_add(&event->list, &cont->event_list);
> +       mutex_unlock(&cont->event_list_mutex);
> +
> +       fput(cfile);
> +       fput(efile);
> +
> +       return 0;
> +
> +fail:
> +       if (!IS_ERR(cfile))
> +               fput(cfile);
> +
> +       if (event && event->eventfd && !IS_ERR(event->eventfd))
> +               eventfd_ctx_put(event->eventfd);
> +
> +       if (!IS_ERR(efile))
> +               fput(efile);
> +
> +       if (event)
> +               kfree(event);
> +
> +       return ret;
> +}
> +
>  /*
>  * for the common functions, 'private' gives the type of file
>  */
> @@ -2814,6 +3022,11 @@ static struct cftype files[] = {
>                .read_u64 = cgroup_read_notify_on_release,
>                .write_u64 = cgroup_write_notify_on_release,
>        },
> +       {
> +               .name = CGROUP_FILE_GENERIC_PREFIX "event_control",
> +               .write_string = cgroup_write_event_control,
> +               .mode = S_IWUGO,
> +       },
>  };
>
>  static struct cftype cft_release_agent = {
> --
> 1.6.5.3
>
>

2009-12-15 09:38:45

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH RFC v2 1/4] cgroup: implement eventfd-based generic API for notifications

On Tue, 15 Dec 2009 11:11:16 +0200
"Kirill A. Shutemov" <[email protected]> wrote:

> Could anybody review the patch?
>
> Thank you.

some nitpicks.

>
> On Sat, Dec 12, 2009 at 12:59 AM, Kirill A. Shutemov
> <[email protected]> wrote:

> > +       /*
> > +        * Unregister events and notify userspace.
> > +        * FIXME: How to avoid race with cgroup_event_remove_work()
> > +        *        which runs from workqueue?
> > +        */
> > +       mutex_lock(&cgrp->event_list_mutex);
> > +       list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) {
> > +               cgroup_event_remove(event);
> > +               eventfd_signal(event->eventfd, 1);
> > +       }
> > +       mutex_unlock(&cgrp->event_list_mutex);
> > +
> > +out:
> >        return ret;
> >  }

How ciritical is this FIXME ?
But Hmm..can't we use RCU ?

> >
> > @@ -1136,6 +1187,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
> >        INIT_LIST_HEAD(&cgrp->release_list);
> >        INIT_LIST_HEAD(&cgrp->pidlists);
> >        mutex_init(&cgrp->pidlist_mutex);
> > +       INIT_LIST_HEAD(&cgrp->event_list);
> > +       mutex_init(&cgrp->event_list_mutex);
> >  }
> >
> >  static void init_cgroup_root(struct cgroupfs_root *root)
> > @@ -1935,6 +1988,16 @@ static const struct inode_operations cgroup_dir_inode_operations = {
> >        .rename = cgroup_rename,
> >  };
> >
> > +/*
> > + * Check if a file is a control file
> > + */
> > +static inline struct cftype *__file_cft(struct file *file)
> > +{
> > +       if (file->f_dentry->d_inode->i_fop != &cgroup_file_operations)
> > +               return ERR_PTR(-EINVAL);
> > +       return __d_cft(file->f_dentry);
> > +}
> > +
> >  static int cgroup_create_file(struct dentry *dentry, mode_t mode,
> >                                struct super_block *sb)
> >  {
> > @@ -2789,6 +2852,151 @@ static int cgroup_write_notify_on_release(struct cgroup *cgrp,
> >        return 0;
> >  }
> >
> > +static inline void cgroup_event_remove(struct cgroup_event *event)
> > +{
> > +       struct cgroup *cgrp = event->cgrp;
> > +
> > +       BUG_ON(event->cft->unregister_event(cgrp, event->cft, event->eventfd));

Hmm ? BUG ? If bug, please add document or comment.

> > +       eventfd_ctx_put(event->eventfd);
> > +       remove_wait_queue(event->wqh, &event->wait);
> > +       list_del(&event->list);

please add comment as /* event_list_mutex must be held */

> > +       kfree(event);
> > +}
> > +
> > +static void cgroup_event_remove_work(struct work_struct *work)
> > +{
> > +       struct cgroup_event *event = container_of(work, struct cgroup_event,
> > +                       remove);
> > +       struct cgroup *cgrp = event->cgrp;
> > +
> > +       mutex_lock(&cgrp->event_list_mutex);
> > +       cgroup_event_remove(event);
> > +       mutex_unlock(&cgrp->event_list_mutex);
> > +}
> > +
> > +static int cgroup_event_wake(wait_queue_t *wait, unsigned mode,
> > +               int sync, void *key)
> > +{
> > +       struct cgroup_event *event = container_of(wait,
> > +                       struct cgroup_event, wait);
> > +       unsigned long flags = (unsigned long)key;
> > +
> > +       if (flags & POLLHUP)
> > +               /*
> > +                * This function called with spinlock taken, but
> > +                * cgroup_event_remove() may sleep, so we have
> > +                * to run it in a workqueue.
> > +                */
> > +               schedule_work(&event->remove);
> > +
> > +       return 0;
> > +}

> > +
> > +static void cgroup_event_ptable_queue_proc(struct file *file,
> > +               wait_queue_head_t *wqh, poll_table *pt)
> > +{
> > +       struct cgroup_event *event = container_of(pt,
> > +                       struct cgroup_event, pt);
> > +
> > +       event->wqh = wqh;
> > +       add_wait_queue(wqh, &event->wait);
> > +}
> > +
> > +static int cgroup_write_event_control(struct cgroup *cont, struct cftype *cft,
> > +                                     const char *buffer)
> > +{
> > +       struct cgroup_event *event = NULL;
> > +       unsigned int efd, cfd;
> > +       struct file *efile = NULL;
> > +       struct file *cfile = NULL;
> > +       char *endp;
> > +       int ret;
> > +
> > +       efd = simple_strtoul(buffer, &endp, 10);
> > +       if (*endp != ' ')
> > +               return -EINVAL;
> > +       buffer = endp + 1;
> > +
> > +       cfd = simple_strtoul(buffer, &endp, 10);
> > +       if ((*endp != ' ') && (*endp != '\0'))
> > +               return -EINVAL;
> > +       buffer = endp + 1;
> > +
> > +       event = kzalloc(sizeof(*event), GFP_KERNEL);
> > +       if (!event)
> > +               return -ENOMEM;
> > +       event->cgrp = cont;
> > +       INIT_LIST_HEAD(&event->list);
> > +       init_poll_funcptr(&event->pt, cgroup_event_ptable_queue_proc);
> > +       init_waitqueue_func_entry(&event->wait, cgroup_event_wake);
> > +       INIT_WORK(&event->remove, cgroup_event_remove_work);
> > +
> > +       efile = eventfd_fget(efd);
> > +       if (IS_ERR(efile)) {
> > +               ret = PTR_ERR(efile);
> > +               goto fail;
> > +       }
> > +
> > +       event->eventfd = eventfd_ctx_fileget(efile);
> > +       if (IS_ERR(event->eventfd)) {
> > +               ret = PTR_ERR(event->eventfd);
> > +               goto fail;
> > +       }
> > +
> > +       cfile = fget(cfd);
> > +       if (!cfile) {
> > +               ret = -EBADF;
> > +               goto fail;
> > +       }
> > +
> > +       /* the process need read permission on control file */
> > +       ret = file_permission(cfile, MAY_READ);
> > +       if (ret < 0)
> > +               goto fail;
> > +
> > +       event->cft = __file_cft(cfile);
> > +       if (IS_ERR(event->cft)) {
> > +               ret = PTR_ERR(event->cft);
> > +               goto fail;
> > +       }
> > +
> > +       if (!event->cft->register_event || !event->cft->unregister_event) {
> > +               ret = -EINVAL;
> > +               goto fail;
> > +       }
> > +
> > +       ret = event->cft->register_event(cont, event->cft,
> > +                       event->eventfd, buffer);
> > +       if (ret)
> > +               goto fail;
> > +
> > +       efile->f_op->poll(efile, &event->pt);

Not necessary to check return value ?

Thanks,
-Kame
> > +
> > +       mutex_lock(&cont->event_list_mutex);
> > +       list_add(&event->list, &cont->event_list);
> > +       mutex_unlock(&cont->event_list_mutex);
> > +
> > +       fput(cfile);
> > +       fput(efile);
> > +
> > +       return 0;
> > +
> > +fail:
> > +       if (!IS_ERR(cfile))
> > +               fput(cfile);
> > +
> > +       if (event && event->eventfd && !IS_ERR(event->eventfd))
> > +               eventfd_ctx_put(event->eventfd);
> > +
> > +       if (!IS_ERR(efile))
> > +               fput(efile);
> > +
> > +       if (event)
> > +               kfree(event);
> > +
> > +       return ret;
> > +}
> > +
> >  /*
> >  * for the common functions, 'private' gives the type of file
> >  */
> > @@ -2814,6 +3022,11 @@ static struct cftype files[] = {
> >                .read_u64 = cgroup_read_notify_on_release,
> >                .write_u64 = cgroup_write_notify_on_release,
> >        },
> > +       {
> > +               .name = CGROUP_FILE_GENERIC_PREFIX "event_control",
> > +               .write_string = cgroup_write_event_control,
> > +               .mode = S_IWUGO,
> > +       },
> >  };
> >
> >  static struct cftype cft_release_agent = {
> > --
> > 1.6.5.3
> >
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2009-12-15 10:30:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH RFC v2 1/4] cgroup: implement eventfd-based generic API for notifications

On Tue, Dec 15, 2009 at 11:35 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Tue, 15 Dec 2009 11:11:16 +0200
> "Kirill A. Shutemov" <[email protected]> wrote:
>
>> Could anybody review the patch?
>>
>> Thank you.
>
> some nitpicks.
>
>>
>> On Sat, Dec 12, 2009 at 12:59 AM, Kirill A. Shutemov
>> <[email protected]> wrote:
>
>> > +       /*
>> > +        * Unregister events and notify userspace.
>> > +        * FIXME: How to avoid race with cgroup_event_remove_work()
>> > +        *        which runs from workqueue?
>> > +        */
>> > +       mutex_lock(&cgrp->event_list_mutex);
>> > +       list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) {
>> > +               cgroup_event_remove(event);
>> > +               eventfd_signal(event->eventfd, 1);
>> > +       }
>> > +       mutex_unlock(&cgrp->event_list_mutex);
>> > +
>> > +out:
>> >        return ret;
>> >  }
>
> How ciritical is this FIXME ?

There is potential race. I have never seen it. When userspace closes
eventfd associated
with cgroup event, cgroup_event_remove() will not be called
immediately. It will be called
later from workqueue. If somebody removes cgroup before the workqueue calls
cgroup_event_remove() we will get problem.
It's unlikely, but theoretically possible.

> But Hmm..can't we use RCU ?

I'll play with it.

>> >
>> > @@ -1136,6 +1187,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
>> >        INIT_LIST_HEAD(&cgrp->release_list);
>> >        INIT_LIST_HEAD(&cgrp->pidlists);
>> >        mutex_init(&cgrp->pidlist_mutex);
>> > +       INIT_LIST_HEAD(&cgrp->event_list);
>> > +       mutex_init(&cgrp->event_list_mutex);
>> >  }
>> >
>> >  static void init_cgroup_root(struct cgroupfs_root *root)
>> > @@ -1935,6 +1988,16 @@ static const struct inode_operations cgroup_dir_inode_operations = {
>> >        .rename = cgroup_rename,
>> >  };
>> >
>> > +/*
>> > + * Check if a file is a control file
>> > + */
>> > +static inline struct cftype *__file_cft(struct file *file)
>> > +{
>> > +       if (file->f_dentry->d_inode->i_fop != &cgroup_file_operations)
>> > +               return ERR_PTR(-EINVAL);
>> > +       return __d_cft(file->f_dentry);
>> > +}
>> > +
>> >  static int cgroup_create_file(struct dentry *dentry, mode_t mode,
>> >                                struct super_block *sb)
>> >  {
>> > @@ -2789,6 +2852,151 @@ static int cgroup_write_notify_on_release(struct cgroup *cgrp,
>> >        return 0;
>> >  }
>> >
>> > +static inline void cgroup_event_remove(struct cgroup_event *event)
>> > +{
>> > +       struct cgroup *cgrp = event->cgrp;
>> > +
>> > +       BUG_ON(event->cft->unregister_event(cgrp, event->cft, event->eventfd));
>
> Hmm ? BUG ? If bug, please add document or comment.

I'll remove it, since we check it in cgroup_write_event_control().

>> > +       eventfd_ctx_put(event->eventfd);
>> > +       remove_wait_queue(event->wqh, &event->wait);
>> > +       list_del(&event->list);
>
> please add comment as /* event_list_mutex must be held */

Ok.

>> > +       kfree(event);
>> > +}
>> > +
>> > +static void cgroup_event_remove_work(struct work_struct *work)
>> > +{
>> > +       struct cgroup_event *event = container_of(work, struct cgroup_event,
>> > +                       remove);
>> > +       struct cgroup *cgrp = event->cgrp;
>> > +
>> > +       mutex_lock(&cgrp->event_list_mutex);
>> > +       cgroup_event_remove(event);
>> > +       mutex_unlock(&cgrp->event_list_mutex);
>> > +}
>> > +
>> > +static int cgroup_event_wake(wait_queue_t *wait, unsigned mode,
>> > +               int sync, void *key)
>> > +{
>> > +       struct cgroup_event *event = container_of(wait,
>> > +                       struct cgroup_event, wait);
>> > +       unsigned long flags = (unsigned long)key;
>> > +
>> > +       if (flags & POLLHUP)
>> > +               /*
>> > +                * This function called with spinlock taken, but
>> > +                * cgroup_event_remove() may sleep, so we have
>> > +                * to run it in a workqueue.
>> > +                */
>> > +               schedule_work(&event->remove);
>> > +
>> > +       return 0;
>> > +}
>
>> > +
>> > +static void cgroup_event_ptable_queue_proc(struct file *file,
>> > +               wait_queue_head_t *wqh, poll_table *pt)
>> > +{
>> > +       struct cgroup_event *event = container_of(pt,
>> > +                       struct cgroup_event, pt);
>> > +
>> > +       event->wqh = wqh;
>> > +       add_wait_queue(wqh, &event->wait);
>> > +}
>> > +
>> > +static int cgroup_write_event_control(struct cgroup *cont, struct cftype *cft,
>> > +                                     const char *buffer)
>> > +{
>> > +       struct cgroup_event *event = NULL;
>> > +       unsigned int efd, cfd;
>> > +       struct file *efile = NULL;
>> > +       struct file *cfile = NULL;
>> > +       char *endp;
>> > +       int ret;
>> > +
>> > +       efd = simple_strtoul(buffer, &endp, 10);
>> > +       if (*endp != ' ')
>> > +               return -EINVAL;
>> > +       buffer = endp + 1;
>> > +
>> > +       cfd = simple_strtoul(buffer, &endp, 10);
>> > +       if ((*endp != ' ') && (*endp != '\0'))
>> > +               return -EINVAL;
>> > +       buffer = endp + 1;
>> > +
>> > +       event = kzalloc(sizeof(*event), GFP_KERNEL);
>> > +       if (!event)
>> > +               return -ENOMEM;
>> > +       event->cgrp = cont;
>> > +       INIT_LIST_HEAD(&event->list);
>> > +       init_poll_funcptr(&event->pt, cgroup_event_ptable_queue_proc);
>> > +       init_waitqueue_func_entry(&event->wait, cgroup_event_wake);
>> > +       INIT_WORK(&event->remove, cgroup_event_remove_work);
>> > +
>> > +       efile = eventfd_fget(efd);
>> > +       if (IS_ERR(efile)) {
>> > +               ret = PTR_ERR(efile);
>> > +               goto fail;
>> > +       }
>> > +
>> > +       event->eventfd = eventfd_ctx_fileget(efile);
>> > +       if (IS_ERR(event->eventfd)) {
>> > +               ret = PTR_ERR(event->eventfd);
>> > +               goto fail;
>> > +       }
>> > +
>> > +       cfile = fget(cfd);
>> > +       if (!cfile) {
>> > +               ret = -EBADF;
>> > +               goto fail;
>> > +       }
>> > +
>> > +       /* the process need read permission on control file */
>> > +       ret = file_permission(cfile, MAY_READ);
>> > +       if (ret < 0)
>> > +               goto fail;
>> > +
>> > +       event->cft = __file_cft(cfile);
>> > +       if (IS_ERR(event->cft)) {
>> > +               ret = PTR_ERR(event->cft);
>> > +               goto fail;
>> > +       }
>> > +
>> > +       if (!event->cft->register_event || !event->cft->unregister_event) {
>> > +               ret = -EINVAL;
>> > +               goto fail;
>> > +       }
>> > +
>> > +       ret = event->cft->register_event(cont, event->cft,
>> > +                       event->eventfd, buffer);
>> > +       if (ret)
>> > +               goto fail;
>> > +
>> > +       efile->f_op->poll(efile, &event->pt);
>
> Not necessary to check return value ?

You are right. We need to check return value for POLLHUP.

Thanks!

2009-12-15 10:46:36

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/4] memcg: implement memory thresholds

On Tue, Dec 15, 2009 at 3:58 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Sat, 12 Dec 2009 00:59:19 +0200
> "Kirill A. Shutemov" <[email protected]> wrote:
>
>> It allows to register multiple memory and memsw thresholds and gets
>> notifications when it crosses.
>>
>> To register a threshold application need:
>> - create an eventfd;
>> - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
>> - write string like "<event_fd> <memory.usage_in_bytes> <threshold>" to
>>   cgroup.event_control.
>>
>> Application will be notified through eventfd when memory usage crosses
>> threshold in any direction.
>>
>> It's applicable for root and non-root cgroup.
>>
>> It uses stats to track memory usage, simmilar to soft limits. It checks
>> if we need to send event to userspace on every 100 page in/out. I guess
>> it's good compromise between performance and accuracy of thresholds.
>>
>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>> ---
>>  mm/memcontrol.c |  263 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 files changed, 263 insertions(+), 0 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index c6081cc..5ba2140 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -6,6 +6,10 @@
>>   * Copyright 2007 OpenVZ SWsoft Inc
>>   * Author: Pavel Emelianov <[email protected]>
>>   *
>> + * Memory thresholds
>> + * Copyright (C) 2009 Nokia Corporation
>> + * Author: Kirill A. Shutemov
>> + *
>>   * This program is free software; you can redistribute it and/or modify
>>   * it under the terms of the GNU General Public License as published by
>>   * the Free Software Foundation; either version 2 of the License, or
>> @@ -38,6 +42,7 @@
>>  #include <linux/vmalloc.h>
>>  #include <linux/mm_inline.h>
>>  #include <linux/page_cgroup.h>
>> +#include <linux/eventfd.h>
>>  #include "internal.h"
>>
>>  #include <asm/uaccess.h>
>> @@ -56,6 +61,7 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>>
>>  static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */
>>  #define SOFTLIMIT_EVENTS_THRESH (1000)
>> +#define THRESHOLDS_EVENTS_THRESH (100)
>>
>>  /*
>>   * Statistics for memory cgroup.
>> @@ -72,6 +78,8 @@ enum mem_cgroup_stat_index {
>>       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
>>       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
>>                                       used by soft limit implementation */
>> +     MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
>> +                                     used by threshold implementation */
>>
>>       MEM_CGROUP_STAT_NSTATS,
>>  };
>> @@ -182,6 +190,15 @@ struct mem_cgroup_tree {
>>
>>  static struct mem_cgroup_tree soft_limit_tree __read_mostly;
>>
>> +struct mem_cgroup_threshold {
>> +     struct list_head list;
>> +     struct eventfd_ctx *eventfd;
>> +     u64 threshold;
>> +};
>> +
>> +static bool mem_cgroup_threshold_check(struct mem_cgroup* mem);
>> +static void mem_cgroup_threshold(struct mem_cgroup* mem, bool swap);
>> +
>>  /*
>>   * The memory controller data structure. The memory controller controls both
>>   * page cache and RSS per cgroup. We would eventually like to provide
>> @@ -233,6 +250,19 @@ struct mem_cgroup {
>>       /* set when res.limit == memsw.limit */
>>       bool            memsw_is_minimum;
>>
>> +     /* protect lists of thresholds*/
>> +     spinlock_t thresholds_lock;
>> +
>> +     /* thresholds for memory usage */
>> +     struct list_head thresholds;
>> +     struct mem_cgroup_threshold *below_threshold;
>> +     struct mem_cgroup_threshold *above_threshold;
>> +
>> +     /* thresholds for mem+swap usage */
>> +     struct list_head memsw_thresholds;
>> +     struct mem_cgroup_threshold *memsw_below_threshold;
>> +     struct mem_cgroup_threshold *memsw_above_threshold;
>> +
>>       /*
>>        * statistics. This must be placed at the end of memcg.
>>        */
>> @@ -519,6 +549,8 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
>>               __mem_cgroup_stat_add_safe(cpustat,
>>                               MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
>>       __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SOFTLIMIT, -1);
>> +     __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_THRESHOLDS, -1);
>> +
>>       put_cpu();
>>  }
>>
>> @@ -1363,6 +1395,11 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>>       if (mem_cgroup_soft_limit_check(mem))
>>               mem_cgroup_update_tree(mem, page);
>>  done:
>> +     if (mem_cgroup_threshold_check(mem)) {
>> +             mem_cgroup_threshold(mem, false);
>> +             if (do_swap_account)
>> +                     mem_cgroup_threshold(mem, true);
>> +     }
>>       return 0;
>>  nomem:
>>       css_put(&mem->css);
>> @@ -1906,6 +1943,11 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>>
>>       if (mem_cgroup_soft_limit_check(mem))
>>               mem_cgroup_update_tree(mem, page);
>> +     if (mem_cgroup_threshold_check(mem)) {
>> +             mem_cgroup_threshold(mem, false);
>> +             if (do_swap_account)
>> +                     mem_cgroup_threshold(mem, true);
>> +     }
>>       /* at swapout, this memcg will be accessed to record to swap */
>>       if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
>>               css_put(&mem->css);
>> @@ -2860,11 +2902,181 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
>>  }
>>
>>
>> +static bool mem_cgroup_threshold_check(struct mem_cgroup *mem)
>> +{
>> +     bool ret = false;
>> +     int cpu;
>> +     s64 val;
>> +     struct mem_cgroup_stat_cpu *cpustat;
>> +
>> +     cpu = get_cpu();
>> +     cpustat = &mem->stat.cpustat[cpu];
>> +     val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_THRESHOLDS);
>> +     if (unlikely(val < 0)) {
>> +             __mem_cgroup_stat_set(cpustat, MEM_CGROUP_STAT_THRESHOLDS,
>> +                             THRESHOLDS_EVENTS_THRESH);
>> +             ret = true;
>> +     }
>> +     put_cpu();
>> +     return ret;
>> +}
>> +
>
> Hmm. please check
>
>        if (likely(list_empty(&mem->thesholds) &&
>                   list_empty(&mem->memsw_thresholds)))
>                return;

These lists are never be empty. They have at least two fake threshold for 0
and RESOURCE_MAX.

>
> or adds a flag as mem->no_threshold_check to skip this routine quickly.
>
> _OR_
> I personally don't like to have 2 counters to catch events.
>
> How about this ?
>
>   adds
>   struct mem_cgroup {
>        atomic_t        event_counter; // this is incremented per 32
>                                           page-in/out
>        atomic_t last_softlimit_check;
>        atomic_t last_thresh_check;
>   };
>
> static bool mem_cgroup_threshold_check(struct mem_cgroup *mem)
> {
>        decrement percpu event counter.
>        if (percpu counter reaches 0) {
>                if  (atomic_dec_and_test(&mem->check_thresh) {
>                        check threashold.
>                        reset counter.
>                }
>                if  (atomic_dec_and_test(&memc->check_softlimit) {
>                        update softlimit tree.
>                        reset counter.
>                }
>                reset percpu counter.
>        }
> }
>
> Then, you can have a counter like system-wide event counter.

I leave it as is for now, as you mention in other letter.

>> +static void mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
>> +{
>> +     struct mem_cgroup_threshold **below, **above;
>> +     struct list_head *thresholds;
>> +     u64 usage = mem_cgroup_usage(memcg, swap);
>> +
>> +     if (!swap) {
>> +             thresholds = &memcg->thresholds;
>> +             above = &memcg->above_threshold;
>> +             below = &memcg->below_threshold;
>> +     } else {
>> +             thresholds = &memcg->memsw_thresholds;
>> +             above = &memcg->memsw_above_threshold;
>> +             below = &memcg->memsw_below_threshold;
>> +     }
>> +
>> +     spin_lock(&memcg->thresholds_lock);
>> +     if ((*above)->threshold <= usage) {
>> +             *below = *above;
>> +             list_for_each_entry_continue((*above), thresholds, list) {
>> +                     eventfd_signal((*below)->eventfd, 1);
>> +                     if ((*above)->threshold > usage)
>> +                             break;
>> +                     *below = *above;
>> +             }
>> +     } else if ((*below)->threshold > usage) {
>> +             *above = *below;
>> +             list_for_each_entry_continue_reverse((*below), thresholds,
>> +                             list) {
>> +                     eventfd_signal((*above)->eventfd, 1);
>> +                     if ((*below)->threshold <= usage)
>> +                             break;
>> +                     *above = *below;
>> +             }
>> +     }
>> +     spin_unlock(&memcg->thresholds_lock);
>> +}
>
> Could you adds comment on above check ?

I'll add comments in next version of patchset.

> And do we need *spin_lock* here ? Can't you use RCU list walk ?

I'll play with it.

> If you use have to use spinlock here, this is a system-wide spinlock,
> threshold as "100" is too small, I think.

What is reasonable value for THRESHOLDS_EVENTS_THRESH for you?

In most cases spinlock taken only for two checks. Is it significant time?

Unfortunately, I can't test it on a big box. I have only dual-core system.
It's not enough to test scalability.

> Even if you can't use spinlock, please use mutex. (with checking gfp_mask).
>
> Thanks,
> -Kame

2009-12-15 11:12:42

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/4] memcg: implement memory thresholds

On Tue, 15 Dec 2009 12:46:32 +0200
"Kirill A. Shutemov" <[email protected]> wrote:

> On Tue, Dec 15, 2009 at 3:58 AM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Sat, 12 Dec 2009 00:59:19 +0200
> > "Kirill A. Shutemov" <[email protected]> wrote:

> > If you use have to use spinlock here, this is a system-wide spinlock,
> > threshold as "100" is too small, I think.
>
> What is reasonable value for THRESHOLDS_EVENTS_THRESH for you?
>
> In most cases spinlock taken only for two checks. Is it significant time?
>
I tend to think about "bad case" when I see spinlock.

And...I'm not sure but, recently, there are many VM users.
spinlock can be a big pitfall in some enviroment if not para-virtualized.
(I'm sorry I misunderstand somehing and VM handle this well...)

> Unfortunately, I can't test it on a big box. I have only dual-core system.
> It's not enough to test scalability.
>

please leave it as 100 for now. But there is a chance to do simple optimization
for reducing the number of checks.

example)
static void mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
{
/* For handle memory allocation in rush, check jiffies */
*/
smp_rmb();
if (memcg->last_checkpoint_jiffies == jiffies)
return; /* reset event to half value ..*/
memcg->last_checkpoint_jiffies = jiffies;
smp_wmb();
.....

I think this kind of check is necessary for handle "Rushing" memory allocation
in scalable way. Above one is just an example, 1 tick may be too long.

Other simple plan is

/* Allow only one thread to do scan the list at the same time. */
if (atomic_inc_not_zero(&memcg->threahold_scan_count) {
atomic_dec(&memcg->threshold_scan_count);
return;
}
...
atomic_dec(&memcg->threahold_scan_count)

Some easy logic (as above) for taking care of scalability and commenary for that
is enough at 1st stage. Then, if there seems to be a trouble/concern, someone
(me?) will do some work later.




Thanks,
-Kame

2009-12-15 15:03:45

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH RFC v2 1/4] cgroup: implement eventfd-based generic API for notifications

On Tue, Dec 15, 2009 at 11:35 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Tue, 15 Dec 2009 11:11:16 +0200
> "Kirill A. Shutemov" <[email protected]> wrote:
>
>> Could anybody review the patch?
>>
>> Thank you.
>
> some nitpicks.
>
>>
>> On Sat, Dec 12, 2009 at 12:59 AM, Kirill A. Shutemov
>> <[email protected]> wrote:
>
>> > +       /*
>> > +        * Unregister events and notify userspace.
>> > +        * FIXME: How to avoid race with cgroup_event_remove_work()
>> > +        *        which runs from workqueue?
>> > +        */
>> > +       mutex_lock(&cgrp->event_list_mutex);
>> > +       list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) {
>> > +               cgroup_event_remove(event);
>> > +               eventfd_signal(event->eventfd, 1);
>> > +       }
>> > +       mutex_unlock(&cgrp->event_list_mutex);
>> > +
>> > +out:
>> >        return ret;
>> >  }
>
> How ciritical is this FIXME ?
> But Hmm..can't we use RCU ?

It's not reasonable to have RCU here, since event_list isn't mostly-read.

2009-12-15 23:59:17

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH RFC v2 1/4] cgroup: implement eventfd-based generic API for notifications

On Tue, 15 Dec 2009 17:03:37 +0200
"Kirill A. Shutemov" <[email protected]> wrote:
> >> > +       /*
> >> > +        * Unregister events and notify userspace.
> >> > +        * FIXME: How to avoid race with cgroup_event_remove_work()
> >> > +        *        which runs from workqueue?
> >> > +        */
> >> > +       mutex_lock(&cgrp->event_list_mutex);
> >> > +       list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) {
> >> > +               cgroup_event_remove(event);
> >> > +               eventfd_signal(event->eventfd, 1);
> >> > +       }
> >> > +       mutex_unlock(&cgrp->event_list_mutex);
> >> > +
> >> > +out:
> >> >        return ret;
> >> >  }
> >
> > How ciritical is this FIXME ?
> > But Hmm..can't we use RCU ?
>
> It's not reasonable to have RCU here, since event_list isn't mostly-read.
>
ok.

Thanks,
-Kame

2009-12-16 01:44:51

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH RFC v2 1/4] cgroup: implement eventfd-based generic API for notifications

Kirill A. Shutemov wrote:
> This patch introduces write-only file "cgroup.event_control" in every
> cgroup.
>
> To register new notification handler you need:
> - create an eventfd;
> - open a control file to be monitored. Callbacks register_event() and
> unregister_event() must be defined for the control file;
> - write "<event_fd> <control_fd> <args>" to cgroup.event_control.
> Interpretation of args is defined by control file implementation;
>
> eventfd will be woken up by control file implementation or when the
> cgroup is removed.
>
> To unregister notification handler just close eventfd.
>
> If you need notification functionality for a control file you have to
> implement callbacks register_event() and unregister_event() in the
> struct cftype.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> include/linux/cgroup.h | 20 +++++
> kernel/cgroup.c | 215 +++++++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 234 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 0008dee..7ad3078 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -220,6 +220,10 @@ struct cgroup {
>
> /* For RCU-protected deletion */
> struct rcu_head rcu_head;
> +
> + /* List of events which userspace want to recieve */
> + struct list_head event_list;
> + struct mutex event_list_mutex;
> };
>
> /*
> @@ -362,6 +366,22 @@ struct cftype {
> int (*trigger)(struct cgroup *cgrp, unsigned int event);
>
> int (*release)(struct inode *inode, struct file *file);
> +
> + /*
> + * register_event() callback will be used to add new userspace
> + * waiter for changes related to the cftype. Implement it if
> + * you want to provide this functionality. Use eventfd_signal()
> + * on eventfd to send notification to userspace.
> + */
> + int (*register_event)(struct cgroup *cgrp, struct cftype *cft,
> + struct eventfd_ctx *eventfd, const char *args);
> + /*
> + * unregister_event() callback will be called when userspace
> + * close the eventfd. This callback must be implemented, if you
> + * provide register_event().
> + */
> + int (*unregister_event)(struct cgroup *cgrp, struct cftype *cft,
> + struct eventfd_ctx *eventfd);
> };
>
> struct cgroup_scanner {
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 0249f4b..f7ec3ca 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -4,6 +4,10 @@
> * Based originally on the cpuset system, extracted by Paul Menage
> * Copyright (C) 2006 Google, Inc
> *
> + * Notifiactions support

s/Notifiactions/Notifications

> + * Copyright (C) 2009 Nokia Corporation
> + * Author: Kirill A. Shutemov
> + *
> * Copyright notices from the original cpuset code:
> * --------------------------------------------------
> * Copyright (C) 2003 BULL SA.
> @@ -51,6 +55,8 @@
> #include <linux/pid_namespace.h>
> #include <linux/idr.h>
> #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
> +#include <linux/eventfd.h>
> +#include <linux/poll.h>
>
> #include <asm/atomic.h>
>
> @@ -146,6 +152,36 @@ struct css_id {
> unsigned short stack[0]; /* Array of Length (depth+1) */
> };
>
> +/*
> + * cgroup_event represents event which userspace want to recieve.

s/event/events ?

> + */
> +struct cgroup_event {
> + /*
> + * Cgroup which the event belongs to.
> + */
> + struct cgroup *cgrp;
> + /*
> + * Control file which the event associated.
> + */
> + struct cftype *cft;
> + /*
> + * eventfd to signal userspace about the event.
> + */
> + struct eventfd_ctx *eventfd;
> + /*
> + * Each of these stored in a list by the cgroup.
> + */
> + struct list_head list;
> + /*
> + * All fields below needed to unregister event when
> + * userspace closes eventfd.
> + */
> + poll_table pt;
> + wait_queue_head_t *wqh;
> + wait_queue_t wait;
> + struct work_struct remove;
> +};

Please add a blank line here.

> +static void cgroup_event_remove(struct cgroup_event *event);
>
> /* The list of hierarchy roots */
>
> @@ -734,14 +770,29 @@ static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
> static int cgroup_call_pre_destroy(struct cgroup *cgrp)
> {
> struct cgroup_subsys *ss;
> + struct cgroup_event *event, *tmp;
> int ret = 0;
>
> for_each_subsys(cgrp->root, ss)
> if (ss->pre_destroy) {
> ret = ss->pre_destroy(ss, cgrp);
> if (ret)
> - break;
> + goto out;
> }
> +
> + /*
> + * Unregister events and notify userspace.
> + * FIXME: How to avoid race with cgroup_event_remove_work()
> + * which runs from workqueue?
> + */
> + mutex_lock(&cgrp->event_list_mutex);
> + list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) {
> + cgroup_event_remove(event);
> + eventfd_signal(event->eventfd, 1);

How can you access event after you kfree()ed it in cgroup_event_remove()?

> + }
> + mutex_unlock(&cgrp->event_list_mutex);
> +
> +out:
> return ret;
> }
>
> @@ -1136,6 +1187,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
> INIT_LIST_HEAD(&cgrp->release_list);
> INIT_LIST_HEAD(&cgrp->pidlists);
> mutex_init(&cgrp->pidlist_mutex);
> + INIT_LIST_HEAD(&cgrp->event_list);
> + mutex_init(&cgrp->event_list_mutex);
> }
>
> static void init_cgroup_root(struct cgroupfs_root *root)
> @@ -1935,6 +1988,16 @@ static const struct inode_operations cgroup_dir_inode_operations = {
> .rename = cgroup_rename,
> };
>
> +/*
> + * Check if a file is a control file
> + */
> +static inline struct cftype *__file_cft(struct file *file)
> +{
> + if (file->f_dentry->d_inode->i_fop != &cgroup_file_operations)
> + return ERR_PTR(-EINVAL);

I don't think this check is needed.

> + return __d_cft(file->f_dentry);
> +}
> +
> static int cgroup_create_file(struct dentry *dentry, mode_t mode,
> struct super_block *sb)
> {
> @@ -2789,6 +2852,151 @@ static int cgroup_write_notify_on_release(struct cgroup *cgrp,
> return 0;
> }
>
> +static inline void cgroup_event_remove(struct cgroup_event *event)
> +{
> + struct cgroup *cgrp = event->cgrp;
> +
> + BUG_ON(event->cft->unregister_event(cgrp, event->cft, event->eventfd));
> + eventfd_ctx_put(event->eventfd);
> + remove_wait_queue(event->wqh, &event->wait);
> + list_del(&event->list);
> + kfree(event);
> +}
> +
> +static void cgroup_event_remove_work(struct work_struct *work)
> +{
> + struct cgroup_event *event = container_of(work, struct cgroup_event,
> + remove);
> + struct cgroup *cgrp = event->cgrp;
> +
> + mutex_lock(&cgrp->event_list_mutex);
> + cgroup_event_remove(event);
> + mutex_unlock(&cgrp->event_list_mutex);
> +}
> +
> +static int cgroup_event_wake(wait_queue_t *wait, unsigned mode,
> + int sync, void *key)
> +{
> + struct cgroup_event *event = container_of(wait,
> + struct cgroup_event, wait);
> + unsigned long flags = (unsigned long)key;
> +
> + if (flags & POLLHUP)
> + /*
> + * This function called with spinlock taken, but

s/called/is called/ ?

> + * cgroup_event_remove() may sleep, so we have
> + * to run it in a workqueue.
> + */
> + schedule_work(&event->remove);

Please use:

if (...) {
...
}

> +
> + return 0;
> +}
> +
> +static void cgroup_event_ptable_queue_proc(struct file *file,
> + wait_queue_head_t *wqh, poll_table *pt)
> +{
> + struct cgroup_event *event = container_of(pt,
> + struct cgroup_event, pt);
> +
> + event->wqh = wqh;
> + add_wait_queue(wqh, &event->wait);
> +}
> +
> +static int cgroup_write_event_control(struct cgroup *cont, struct cftype *cft,

Please consistently use "cgrp"

> + const char *buffer)
> +{
> + struct cgroup_event *event = NULL;
> + unsigned int efd, cfd;
> + struct file *efile = NULL;
> + struct file *cfile = NULL;
> + char *endp;
> + int ret;
> +
> + efd = simple_strtoul(buffer, &endp, 10);
> + if (*endp != ' ')
> + return -EINVAL;
> + buffer = endp + 1;
> +
> + cfd = simple_strtoul(buffer, &endp, 10);
> + if ((*endp != ' ') && (*endp != '\0'))
> + return -EINVAL;
> + buffer = endp + 1;
> +
> + event = kzalloc(sizeof(*event), GFP_KERNEL);
> + if (!event)
> + return -ENOMEM;
> + event->cgrp = cont;
> + INIT_LIST_HEAD(&event->list);
> + init_poll_funcptr(&event->pt, cgroup_event_ptable_queue_proc);
> + init_waitqueue_func_entry(&event->wait, cgroup_event_wake);
> + INIT_WORK(&event->remove, cgroup_event_remove_work);
> +
> + efile = eventfd_fget(efd);
> + if (IS_ERR(efile)) {
> + ret = PTR_ERR(efile);
> + goto fail;
> + }
> +
> + event->eventfd = eventfd_ctx_fileget(efile);
> + if (IS_ERR(event->eventfd)) {
> + ret = PTR_ERR(event->eventfd);
> + goto fail;
> + }
> +
> + cfile = fget(cfd);
> + if (!cfile) {
> + ret = -EBADF;
> + goto fail;
> + }
> +
> + /* the process need read permission on control file */
> + ret = file_permission(cfile, MAY_READ);
> + if (ret < 0)
> + goto fail;
> +
> + event->cft = __file_cft(cfile);
> + if (IS_ERR(event->cft)) {
> + ret = PTR_ERR(event->cft);
> + goto fail;
> + }
> +
> + if (!event->cft->register_event || !event->cft->unregister_event) {
> + ret = -EINVAL;
> + goto fail;
> + }
> +
> + ret = event->cft->register_event(cont, event->cft,
> + event->eventfd, buffer);
> + if (ret)
> + goto fail;
> +
> + efile->f_op->poll(efile, &event->pt);
> +
> + mutex_lock(&cont->event_list_mutex);
> + list_add(&event->list, &cont->event_list);
> + mutex_unlock(&cont->event_list_mutex);
> +
> + fput(cfile);
> + fput(efile);
> +
> + return 0;
> +
> +fail:
> + if (!IS_ERR(cfile))
> + fput(cfile);
> +
> + if (event && event->eventfd && !IS_ERR(event->eventfd))
> + eventfd_ctx_put(event->eventfd);
> +
> + if (!IS_ERR(efile))
> + fput(efile);
> +
> + if (event)
> + kfree(event);

kfree(NULL) is ok

> +
> + return ret;
> +}
> +
> /*
> * for the common functions, 'private' gives the type of file
> */
> @@ -2814,6 +3022,11 @@ static struct cftype files[] = {
> .read_u64 = cgroup_read_notify_on_release,
> .write_u64 = cgroup_write_notify_on_release,
> },
> + {
> + .name = CGROUP_FILE_GENERIC_PREFIX "event_control",
> + .write_string = cgroup_write_event_control,
> + .mode = S_IWUGO,

We want this file to be writable to everyone ?

> + },
> };
>
> static struct cftype cft_release_agent = {

2009-12-16 02:00:52

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH RFC v2 1/4] cgroup: implement eventfd-based generic API for notifications

>> +/*
>> + * Check if a file is a control file
>> + */
>> +static inline struct cftype *__file_cft(struct file *file)
>> +{
>> + if (file->f_dentry->d_inode->i_fop != &cgroup_file_operations)
>> + return ERR_PTR(-EINVAL);
>
> I don't think this check is needed.
>

Sorry, please ignore this comment

2009-12-16 05:46:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH RFC v2 1/4] cgroup: implement eventfd-based generic API for notifications

On Wed, Dec 16, 2009 at 3:44 AM, Li Zefan <[email protected]> wrote:
> Kirill A. Shutemov wrote:
>> This patch introduces write-only file "cgroup.event_control" in every
>> cgroup.
>>
>> To register new notification handler you need:
>> - create an eventfd;
>> - open a control file to be monitored. Callbacks register_event() and
>>   unregister_event() must be defined for the control file;
>> - write "<event_fd> <control_fd> <args>" to cgroup.event_control.
>>   Interpretation of args is defined by control file implementation;
>>
>> eventfd will be woken up by control file implementation or when the
>> cgroup is removed.
>>
>> To unregister notification handler just close eventfd.
>>
>> If you need notification functionality for a control file you have to
>> implement callbacks register_event() and unregister_event() in the
>> struct cftype.
>>
>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>> ---
>>  include/linux/cgroup.h |   20 +++++
>>  kernel/cgroup.c        |  215 +++++++++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 234 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
>> index 0008dee..7ad3078 100644
>> --- a/include/linux/cgroup.h
>> +++ b/include/linux/cgroup.h
>> @@ -220,6 +220,10 @@ struct cgroup {
>>
>>       /* For RCU-protected deletion */
>>       struct rcu_head rcu_head;
>> +
>> +     /* List of events which userspace want to recieve */
>> +     struct list_head event_list;
>> +     struct mutex event_list_mutex;
>>  };
>>
>>  /*
>> @@ -362,6 +366,22 @@ struct cftype {
>>       int (*trigger)(struct cgroup *cgrp, unsigned int event);
>>
>>       int (*release)(struct inode *inode, struct file *file);
>> +
>> +     /*
>> +      * register_event() callback will be used to add new userspace
>> +      * waiter for changes related to the cftype. Implement it if
>> +      * you want to provide this functionality. Use eventfd_signal()
>> +      * on eventfd to send notification to userspace.
>> +      */
>> +     int (*register_event)(struct cgroup *cgrp, struct cftype *cft,
>> +                     struct eventfd_ctx *eventfd, const char *args);
>> +     /*
>> +      * unregister_event() callback will be called when userspace
>> +      * close the eventfd. This callback must be implemented, if you
>> +      * provide register_event().
>> +      */
>> +     int (*unregister_event)(struct cgroup *cgrp, struct cftype *cft,
>> +                     struct eventfd_ctx *eventfd);
>>  };
>>
>>  struct cgroup_scanner {
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index 0249f4b..f7ec3ca 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -4,6 +4,10 @@
>>   *  Based originally on the cpuset system, extracted by Paul Menage
>>   *  Copyright (C) 2006 Google, Inc
>>   *
>> + *  Notifiactions support
>
> s/Notifiactions/Notifications

Thanks.

>> + *  Copyright (C) 2009 Nokia Corporation
>> + *  Author: Kirill A. Shutemov
>> + *
>>   *  Copyright notices from the original cpuset code:
>>   *  --------------------------------------------------
>>   *  Copyright (C) 2003 BULL SA.
>> @@ -51,6 +55,8 @@
>>  #include <linux/pid_namespace.h>
>>  #include <linux/idr.h>
>>  #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
>> +#include <linux/eventfd.h>
>> +#include <linux/poll.h>
>>
>>  #include <asm/atomic.h>
>>
>> @@ -146,6 +152,36 @@ struct css_id {
>>       unsigned short stack[0]; /* Array of Length (depth+1) */
>>  };
>>
>> +/*
>> + * cgroup_event represents event which userspace want to recieve.
>
> s/event/events ?

Thanks.

>
>> + */
>> +struct cgroup_event {
>> +     /*
>> +      * Cgroup which the event belongs to.
>> +      */
>> +     struct cgroup *cgrp;
>> +     /*
>> +      * Control file which the event associated.
>> +      */
>> +     struct cftype *cft;
>> +     /*
>> +      * eventfd to signal userspace about the event.
>> +      */
>> +     struct eventfd_ctx *eventfd;
>> +     /*
>> +      * Each of these stored in a list by the cgroup.
>> +      */
>> +     struct list_head list;
>> +     /*
>> +      * All fields below needed to unregister event when
>> +      * userspace closes eventfd.
>> +      */
>> +     poll_table pt;
>> +     wait_queue_head_t *wqh;
>> +     wait_queue_t wait;
>> +     struct work_struct remove;
>> +};
>
> Please add a blank line here.

Ok.

>> +static void cgroup_event_remove(struct cgroup_event *event);
>>
>>  /* The list of hierarchy roots */
>>
>> @@ -734,14 +770,29 @@ static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
>>  static int cgroup_call_pre_destroy(struct cgroup *cgrp)
>>  {
>>       struct cgroup_subsys *ss;
>> +     struct cgroup_event *event, *tmp;
>>       int ret = 0;
>>
>>       for_each_subsys(cgrp->root, ss)
>>               if (ss->pre_destroy) {
>>                       ret = ss->pre_destroy(ss, cgrp);
>>                       if (ret)
>> -                             break;
>> +                             goto out;
>>               }
>> +
>> +     /*
>> +      * Unregister events and notify userspace.
>> +      * FIXME: How to avoid race with cgroup_event_remove_work()
>> +      *        which runs from workqueue?
>> +      */
>> +     mutex_lock(&cgrp->event_list_mutex);
>> +     list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) {
>> +             cgroup_event_remove(event);
>> +             eventfd_signal(event->eventfd, 1);
>
> How can you access event after you kfree()ed it in cgroup_event_remove()?

Nice catch. Thank you.

>> +     }
>> +     mutex_unlock(&cgrp->event_list_mutex);
>> +
>> +out:
>>       return ret;
>>  }
>>
>> @@ -1136,6 +1187,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
>>       INIT_LIST_HEAD(&cgrp->release_list);
>>       INIT_LIST_HEAD(&cgrp->pidlists);
>>       mutex_init(&cgrp->pidlist_mutex);
>> +     INIT_LIST_HEAD(&cgrp->event_list);
>> +     mutex_init(&cgrp->event_list_mutex);
>>  }
>>
>>  static void init_cgroup_root(struct cgroupfs_root *root)
>> @@ -1935,6 +1988,16 @@ static const struct inode_operations cgroup_dir_inode_operations = {
>>       .rename = cgroup_rename,
>>  };
>>
>> +/*
>> + * Check if a file is a control file
>> + */
>> +static inline struct cftype *__file_cft(struct file *file)
>> +{
>> +     if (file->f_dentry->d_inode->i_fop != &cgroup_file_operations)
>> +             return ERR_PTR(-EINVAL);
>
> I don't think this check is needed.
>
>> +     return __d_cft(file->f_dentry);
>> +}
>> +
>>  static int cgroup_create_file(struct dentry *dentry, mode_t mode,
>>                               struct super_block *sb)
>>  {
>> @@ -2789,6 +2852,151 @@ static int cgroup_write_notify_on_release(struct cgroup *cgrp,
>>       return 0;
>>  }
>>
>> +static inline void cgroup_event_remove(struct cgroup_event *event)
>> +{
>> +     struct cgroup *cgrp = event->cgrp;
>> +
>> +     BUG_ON(event->cft->unregister_event(cgrp, event->cft, event->eventfd));
>> +     eventfd_ctx_put(event->eventfd);
>> +     remove_wait_queue(event->wqh, &event->wait);
>> +     list_del(&event->list);
>> +     kfree(event);
>> +}
>> +
>> +static void cgroup_event_remove_work(struct work_struct *work)
>> +{
>> +     struct cgroup_event *event = container_of(work, struct cgroup_event,
>> +                     remove);
>> +     struct cgroup *cgrp = event->cgrp;
>> +
>> +     mutex_lock(&cgrp->event_list_mutex);
>> +     cgroup_event_remove(event);
>> +     mutex_unlock(&cgrp->event_list_mutex);
>> +}
>> +
>> +static int cgroup_event_wake(wait_queue_t *wait, unsigned mode,
>> +             int sync, void *key)
>> +{
>> +     struct cgroup_event *event = container_of(wait,
>> +                     struct cgroup_event, wait);
>> +     unsigned long flags = (unsigned long)key;
>> +
>> +     if (flags & POLLHUP)
>> +             /*
>> +              * This function called with spinlock taken, but
>
> s/called/is called/ ?

Ok.

>> +              * cgroup_event_remove() may sleep, so we have
>> +              * to run it in a workqueue.
>> +              */
>> +             schedule_work(&event->remove);
>
> Please use:
>
>        if (...) {
>                ...
>        }

Ok.

>> +
>> +     return 0;
>> +}
>> +
>> +static void cgroup_event_ptable_queue_proc(struct file *file,
>> +             wait_queue_head_t *wqh, poll_table *pt)
>> +{
>> +     struct cgroup_event *event = container_of(pt,
>> +                     struct cgroup_event, pt);
>> +
>> +     event->wqh = wqh;
>> +     add_wait_queue(wqh, &event->wait);
>> +}
>> +
>> +static int cgroup_write_event_control(struct cgroup *cont, struct cftype *cft,
>
> Please consistently use "cgrp"

Ok.

>> +                                   const char *buffer)
>> +{
>> +     struct cgroup_event *event = NULL;
>> +     unsigned int efd, cfd;
>> +     struct file *efile = NULL;
>> +     struct file *cfile = NULL;
>> +     char *endp;
>> +     int ret;
>> +
>> +     efd = simple_strtoul(buffer, &endp, 10);
>> +     if (*endp != ' ')
>> +             return -EINVAL;
>> +     buffer = endp + 1;
>> +
>> +     cfd = simple_strtoul(buffer, &endp, 10);
>> +     if ((*endp != ' ') && (*endp != '\0'))
>> +             return -EINVAL;
>> +     buffer = endp + 1;
>> +
>> +     event = kzalloc(sizeof(*event), GFP_KERNEL);
>> +     if (!event)
>> +             return -ENOMEM;
>> +     event->cgrp = cont;
>> +     INIT_LIST_HEAD(&event->list);
>> +     init_poll_funcptr(&event->pt, cgroup_event_ptable_queue_proc);
>> +     init_waitqueue_func_entry(&event->wait, cgroup_event_wake);
>> +     INIT_WORK(&event->remove, cgroup_event_remove_work);
>> +
>> +     efile = eventfd_fget(efd);
>> +     if (IS_ERR(efile)) {
>> +             ret = PTR_ERR(efile);
>> +             goto fail;
>> +     }
>> +
>> +     event->eventfd = eventfd_ctx_fileget(efile);
>> +     if (IS_ERR(event->eventfd)) {
>> +             ret = PTR_ERR(event->eventfd);
>> +             goto fail;
>> +     }
>> +
>> +     cfile = fget(cfd);
>> +     if (!cfile) {
>> +             ret = -EBADF;
>> +             goto fail;
>> +     }
>> +
>> +     /* the process need read permission on control file */
>> +     ret = file_permission(cfile, MAY_READ);
>> +     if (ret < 0)
>> +             goto fail;
>> +
>> +     event->cft = __file_cft(cfile);
>> +     if (IS_ERR(event->cft)) {
>> +             ret = PTR_ERR(event->cft);
>> +             goto fail;
>> +     }
>> +
>> +     if (!event->cft->register_event || !event->cft->unregister_event) {
>> +             ret = -EINVAL;
>> +             goto fail;
>> +     }
>> +
>> +     ret = event->cft->register_event(cont, event->cft,
>> +                     event->eventfd, buffer);
>> +     if (ret)
>> +             goto fail;
>> +
>> +     efile->f_op->poll(efile, &event->pt);
>> +
>> +     mutex_lock(&cont->event_list_mutex);
>> +     list_add(&event->list, &cont->event_list);
>> +     mutex_unlock(&cont->event_list_mutex);
>> +
>> +     fput(cfile);
>> +     fput(efile);
>> +
>> +     return 0;
>> +
>> +fail:
>> +     if (!IS_ERR(cfile))
>> +             fput(cfile);
>> +
>> +     if (event && event->eventfd && !IS_ERR(event->eventfd))
>> +             eventfd_ctx_put(event->eventfd);
>> +
>> +     if (!IS_ERR(efile))
>> +             fput(efile);
>> +
>> +     if (event)
>> +             kfree(event);
>
> kfree(NULL) is ok

Ok.

>> +
>> +     return ret;
>> +}
>> +
>>  /*
>>   * for the common functions, 'private' gives the type of file
>>   */
>> @@ -2814,6 +3022,11 @@ static struct cftype files[] = {
>>               .read_u64 = cgroup_read_notify_on_release,
>>               .write_u64 = cgroup_write_notify_on_release,
>>       },
>> +     {
>> +             .name = CGROUP_FILE_GENERIC_PREFIX "event_control",
>> +             .write_string = cgroup_write_event_control,
>> +             .mode = S_IWUGO,
>
> We want this file to be writable to everyone ?

Yes. We check permission of the file which we want to track.

>> +     },
>>  };
>>
>>  static struct cftype cft_release_agent = {
>

Thank you.

2009-12-16 08:40:48

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH RFC v2 2/4] memcg: extract mem_group_usage() from mem_cgroup_read()

* Kirill A. Shutemov <[email protected]> [2009-12-12 00:59:17]:

> Helper to get memory or mem+swap usage of the cgroup.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>

Looks like a good cleanup to me!

Acked-by: Balbir Singh <[email protected]>

--
Balbir