2009-04-13 22:15:40

by Dan Malek

[permalink] [raw]
Subject: [PATCH] Memory usage limit notification addition to memcg

This patch updates the Memory Controller cgroup to add
a configurable memory usage limit notification. The feature
was presented at the April 2009 Embedded Linux Conference.

Signed-off-by: Dan Malek <[email protected]>
---
Documentation/cgroups/mem_notify.txt | 129 +++++++++++++++++++++
include/linux/memcontrol.h | 7 +
init/Kconfig | 9 ++
mm/memcontrol.c | 207 ++++++++++++++++++++++++++++++++++
4 files changed, 352 insertions(+), 0 deletions(-)
create mode 100644 Documentation/cgroups/mem_notify.txt

diff --git a/Documentation/cgroups/mem_notify.txt b/Documentation/cgroups/mem_notify.txt
new file mode 100644
index 0000000..72d5c26
--- /dev/null
+++ b/Documentation/cgroups/mem_notify.txt
@@ -0,0 +1,129 @@
+
+Memory Limit Notificiation
+
+Attempts have been made in the past to provide a mechanism for
+the notification to processes (task, an address space) when memory
+usage is approaching a high limit. The intention is that it gives
+the application an opportunity to release some memory and continue
+operation rather than be OOM killed. The CE Linux Forum requested
+a more comtemporary implementation, and this is the result.
+
+The memory limit notification is a configurable extension to the
+existing Memory Resource Controller. Please read memory.txt in this
+directory to understand its operation before continuing here.
+
+1. Operation
+
+When a kernel is configured with CGROUP_MEM_NOTIFY, three additional
+files will appear in the memory resource controller:
+
+ memory.notify_limit_percent
+ memory.notify_limit_usage
+ memory.notify_limit_lowait
+
+The notification is based upon reaching a percentage of the memory
+resource controller limit (memory.limit_in_bytes). When the controller
+group is created, the percentage is set to 100. Any integer percentage
+may be set by writing to memory.notify_limit_percent, such as:
+
+ echo 80 > memory.notify_limit_percent
+
+The current integer usage percentage may be read at any time from
+the memory.notify_limit_usage file.
+
+The memory.notify_limit_lowait is a blocking read file. The read will
+block until one of four conditions occurs:
+
+ - The usage reaches or exceeds the memory.notify_limit_percent
+ - The memory.notify_limit_lowait file is written with any value (debug)
+ - A thread is moved to another controller group
+ - The cgroup is destroyed or forced empty (memory.force_empty)
+
+
+1.1 Example Usage
+
+An application must be designed to properly take advantage of this
+memory limit notification feature. It is a powerful management component
+of some operating systems and embedded devices that must provide
+highly available and reliable computing services. The application works
+in conjunction with information provided by the operating system to
+control limited resource usage. Since many programmers still think
+memory is infinite and never check the return value from malloc(), it
+may come as a surprise that such mechanisms have been utilized long ago.
+
+A typical application will be multithreaded, with one thread either
+polling or waiting for the notification event. When the event occurs,
+the thread will take whatever action is appropriate within the application
+design. This could be actually running a garbage collection algorithm
+or to simply signal other processing threads they must do something to
+reduce their memory usage. The notification thread will then be required
+to poll the actual usage until the low limit of its choosing is met,
+at which time the reclaim of memory can stop and the notification thread
+will wait for the next event.
+
+Internally, the application only needs to fopen("memory.notify_limit_usage" ..)
+and fopen("memory.notify_limit_lowait" ...), then either poll the former
+file or block read on the latter file using fread() or fscanf() as desired.
+
+2. Configuration
+
+Follow the instructions in memory.txt for the configuration and usage of
+the Memory Resource Controller cgroup. Once this is created and tasks
+assigned, use the memory limit notification as described here.
+
+The only action that is needed outside of the application waiting or polling
+is to set the memory.notify_limit_percent. To set a notification to occur
+when memory usage of the cgroup reaches or exceeds 80 percent can be
+simply done:
+
+ echo 80 > memory.notify_limit_percent
+
+This value may be read or changed at any time. Writing a lower value once
+the Memory Resource Controller is in operation may trigger immediate
+notification if the usage is above the new limit.
+
+3. Debug and Testing
+
+The design of cgroups makes it easier to perform some debugging or
+monitoring tasks without modification to the application. For example,
+a write of any value to memory.notify_limit_lowait will wake up all
+threads waiting for notifications regardless of current memory usage.
+
+Collecting performance data about the cgroup is also simplified, as
+no application modifications are necessary. A separate task can be
+created that will open and monitor any necessary files of the cgroup
+(such as current limits, usage and usage percentages and even when
+notification occurs). This task can also operate outside of the cgroup,
+so its memory usage is not charged to the cgroup.
+
+4. Design
+
+The memory limit notification is a configurable extension to the
+existing Memory Resource Controller, which operates as described to
+track and manage the memory of the Control Group. The Memory Resource
+Controller will still continue to reclaim memory under pressure
+of the limits, and may OOM kill tasks within the cgroup according to
+the OOM Killer configuration.
+
+The memory notification limit was chosen as a percentage of the
+memory in use so the cgroup paramaters may continue to be dynamically
+modified without the need to modify the notificaton parameters.
+Otherwise, the notification limit would have to also be computed
+and modified on any Memory Resource Controller operating parameter change.
+
+The cgroup file semantics are not well suited for this type of notificaton
+mechanism. While applications may choose to simply poll the current
+usage at their convenience, it was also desired to have a notification
+event that would trigger when the usage attained the limit. The
+blocking read() was chosen, as it is the only current useful method.
+This presented the problems of "out of band" notification, when you want
+to return some exceptional status other than reaching the notification
+limit. In the cases listed above, the read() on the memory.notify_limit_lowait
+file will not block and return "0" for the percentage. When this occurs,
+the thread must determine if the task has moved to a new cgroup or if
+the cgroup has been destroyed. Due to the usage model of this cgroup,
+neither is likely to happen during normal operation of a product.
+
+Dan Malek <[email protected]>
+Embedded Alley Solutions, Inc.
+10 March 2009
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 18146c9..031e5d1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -117,6 +117,13 @@ static inline bool mem_cgroup_disabled(void)

extern bool mem_cgroup_oom_called(struct task_struct *task);

+#ifdef CONFIG_CGROUP_MEM_NOTIFY
+extern void test_and_wakeup_notify(struct mem_cgroup *mcg,
+ unsigned long long newlimit);
+extern unsigned long compute_usage_percent(unsigned long long usage,
+ unsigned long long limit);
+#endif
+
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;

diff --git a/init/Kconfig b/init/Kconfig
index f2f9b53..97138da 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -588,6 +588,15 @@ config CGROUP_MEM_RES_CTLR
This config option also selects MM_OWNER config option, which
could in turn add some fork/exit overhead.

+config CGROUP_MEM_NOTIFY
+ bool "Memory Usage Limit Notification"
+ depends on CGROUP_MEM_RES_CTLR
+ help
+ Provides a memory notification when usage reaches a preset limit.
+ It is an extenstion to the memory resource controller, since it
+ uses the memory usage accounting of the cgroup to test against
+ the notification limit. (See Documentation/cgroups/mem_notify.txt)
+
config CGROUP_MEM_RES_CTLR_SWAP
bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2fc6d6c..d6367ed 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6,6 +6,10 @@
* Copyright 2007 OpenVZ SWsoft Inc
* Author: Pavel Emelianov <[email protected]>
*
+ * Memory Limit Notification update
+ * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
+ * Author: Dan Malek <[email protected]>
+ *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
@@ -180,6 +184,11 @@ struct mem_cgroup {
* statistics. This must be placed at the end of memcg.
*/
struct mem_cgroup_stat stat;
+
+#ifdef CONFIG_CGROUP_MEM_NOTIFY
+ unsigned long notify_limit_percent;
+ wait_queue_head_t notify_limit_wait;
+#endif
};

enum charge_type {
@@ -934,6 +943,21 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,

VM_BUG_ON(mem_cgroup_is_obsolete(mem));

+#ifdef CONFIG_CGROUP_MEM_NOTIFY
+ /* We check on the way in so we don't have to duplicate code
+ * in both the normal and error exit path.
+ */
+ if (likely(mem->res.limit != (unsigned long long)LLONG_MAX)) {
+ unsigned long usage_pct;
+
+ usage_pct = compute_usage_percent(mem->res.usage + PAGE_SIZE,
+ mem->res.limit);
+ if ((usage_pct >= mem->notify_limit_percent) &&
+ waitqueue_active(&mem->notify_limit_wait))
+ wake_up(&mem->notify_limit_wait);
+ }
+#endif
+
while (1) {
int ret;
bool noswap = false;
@@ -1663,6 +1687,13 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
int children = mem_cgroup_count_children(memcg);
u64 curusage, oldusage;

+#ifdef CONFIG_CGROUP_MEM_NOTIFY
+ /* Test and notify ahead of the necessity to free pages, as
+ * applications giving up pages may help this reclaim procedure.
+ */
+ test_and_wakeup_notify(memcg, val);
+#endif
+
/*
* For keeping hierarchical_reclaim simple, how long we should retry
* is depends on callers. We set our retry-count to be function
@@ -2215,6 +2246,147 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
return 0;
}

+#ifdef CONFIG_CGROUP_MEM_NOTIFY
+#define CGROUP_LOCAL_BUFFER_SIZE 64 /* Would be nice if this was in cgroup.h */
+
+/* The resource counters are defined as long long, but few processors
+ * handle 64-bit divisor in hardware, and the software to do it isn't
+ * present in the kernel. It would be nice if the resource counters were
+ * platform specific configurable typedefs, but for now we'll just divide
+ * down the byte counters by the page size to get 32-bit arithmetic.
+ * With a 4K page size, this will work up to about 16384G resource limit.
+ */
+unsigned long compute_usage_percent(unsigned long long usage,
+ unsigned long long limit)
+{
+ unsigned long lim;
+ unsigned long long usage_pct;
+
+ usage_pct = (usage / PAGE_SIZE) * 100;
+ lim = (unsigned long)(limit / PAGE_SIZE);
+
+ do_div(usage_pct, lim);
+
+ return (unsigned long)usage_pct;
+}
+
+void test_and_wakeup_notify(struct mem_cgroup *mcg, unsigned long long newlimit)
+{
+ unsigned long usage_pct;
+
+ /* Check to see if the new limit should cause notification.
+ */
+ usage_pct = compute_usage_percent(mcg->res.usage, newlimit);
+
+ if ((usage_pct >= mcg->notify_limit_percent) &&
+ waitqueue_active(&mcg->notify_limit_wait))
+ wake_up(&mcg->notify_limit_wait);
+}
+
+static ssize_t notify_limit_read(struct cgroup *cgrp, struct cftype *cft,
+ struct file *file,
+ char __user *buf, size_t nbytes,
+ loff_t *ppos)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ char tmp[CGROUP_LOCAL_BUFFER_SIZE];
+ int len;
+
+ len = sprintf(tmp, "%lu\n", memcg->notify_limit_percent);
+
+ return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
+}
+
+static int notify_limit_write(struct cgroup *cgrp, struct cftype *cft,
+ const char *buffer)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ unsigned long val;
+ char *endptr;
+
+ val = simple_strtoul(buffer, &endptr, 0);
+ if (val > 100)
+ return -EINVAL;
+
+ memcg->notify_limit_percent = val;
+
+ /* Check to see if the new percentage limit should cause notification.
+ */
+ test_and_wakeup_notify(memcg, memcg->res.limit);
+
+ return 0;
+}
+
+static ssize_t notify_limit_usage_read(struct cgroup *cgrp, struct cftype *cft,
+ struct file *file,
+ char __user *buf, size_t nbytes,
+ loff_t *ppos)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+ char tmp[CGROUP_LOCAL_BUFFER_SIZE];
+ unsigned long usage_pct;
+ int len;
+
+ usage_pct = compute_usage_percent(mem->res.usage, mem->res.limit);
+
+ len = sprintf(tmp, "%lu\n", usage_pct);
+
+ return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
+}
+
+static ssize_t notify_limit_lowait(struct cgroup *cgrp, struct cftype *cft,
+ struct file *file,
+ char __user *buf, size_t nbytes,
+ loff_t *ppos)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+ char tmp[CGROUP_LOCAL_BUFFER_SIZE];
+ unsigned long usage_pct;
+ int len;
+ DEFINE_WAIT(notify_lowait);
+
+ /* A memory resource usage of zero is a special case that
+ * causes us not to sleep. It normally happens when the
+ * cgroup is about to be destroyed, and we don't want someone
+ * trying to sleep on a queue that is about to go away. This
+ * condition can also be forced as part of testing.
+ */
+ usage_pct = compute_usage_percent(mem->res.usage, mem->res.limit);
+ if (likely(mem->res.usage != 0)) {
+
+ prepare_to_wait(&mem->notify_limit_wait, &notify_lowait,
+ TASK_INTERRUPTIBLE);
+
+ if (usage_pct < mem->notify_limit_percent) {
+ schedule();
+
+ /* Compute percentage we have now and return it.
+ */
+ usage_pct = compute_usage_percent(mem->res.usage,
+ mem->res.limit);
+ }
+ finish_wait(&mem->notify_limit_wait, &notify_lowait);
+ }
+
+ len = sprintf(tmp, "%lu\n", usage_pct);
+
+ return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
+}
+
+/* This is used to wake up all threads that may be hanging
+ * out waiting for a low memory condition prior to that happening.
+ * Useful for triggering the event to assist with debug of applications.
+ */
+static int notify_limit_wake_em_up(struct cgroup *cgrp, unsigned int event)
+{
+ struct mem_cgroup *mem;
+
+ mem = mem_cgroup_from_cont(cgrp);
+ wake_up(&mem->notify_limit_wait);
+ return 0;
+}
+#endif /* CONFIG_CGROUP_MEM_NOTIFY */
+

static struct cftype mem_cgroup_files[] = {
{
@@ -2258,6 +2430,22 @@ static struct cftype mem_cgroup_files[] = {
.read_u64 = mem_cgroup_swappiness_read,
.write_u64 = mem_cgroup_swappiness_write,
},
+#ifdef CONFIG_CGROUP_MEM_NOTIFY
+ {
+ .name = "notify_limit_percent",
+ .write_string = notify_limit_write,
+ .read = notify_limit_read,
+ },
+ {
+ .name = "notify_limit_usage",
+ .read = notify_limit_usage_read,
+ },
+ {
+ .name = "notify_limit_lowait",
+ .trigger = notify_limit_wake_em_up,
+ .read = notify_limit_lowait,
+ },
+#endif
};

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -2461,6 +2649,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
mem->last_scanned_child = 0;
spin_lock_init(&mem->reclaim_param_lock);

+#ifdef CONFIG_CGROUP_MEM_NOTIFY
+ init_waitqueue_head(&mem->notify_limit_wait);
+ mem->notify_limit_percent = 100;
+#endif
+
if (parent)
mem->swappiness = get_swappiness(parent);
atomic_set(&mem->refcnt, 1);
@@ -2504,6 +2697,20 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
struct cgroup *old_cont,
struct task_struct *p)
{
+#ifdef CONFIG_CGROUP_MEM_NOTIFY
+ /* We wake up all notification threads any time a migration takes
+ * place. They will have to check to see if a move is needed to
+ * a new cgroup file to wait for notification.
+ * This isn't so much a task move as it is an attach. A thread not
+ * a child of an existing task won't have a valid parent, which
+ * is necessary to test because it won't have a valid mem_cgroup
+ * either. Which further means it won't have a proper wait queue
+ * and we can't do a wakeup.
+ */
+ if (old_cont->parent != NULL)
+ notify_limit_wake_em_up(old_cont, 0);
+#endif
+
mutex_lock(&memcg_tasklist);
/*
* FIXME: It's better to move charges of this process from old
--
1.6.2.GIT


2009-04-13 23:10:58

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Memory usage limit notification addition to memcg


Please cc [email protected] on this sort of thing
so that all the right people get to see it.

On Mon, 13 Apr 2009 18:08:32 -0400
Dan Malek <[email protected]> wrote:

> This patch updates the Memory Controller cgroup to add
> a configurable memory usage limit notification. The feature
> was presented at the April 2009 Embedded Linux Conference.
>
> Signed-off-by: Dan Malek <[email protected]>
> ---
> Documentation/cgroups/mem_notify.txt | 129 +++++++++++++++++++++
> include/linux/memcontrol.h | 7 +
> init/Kconfig | 9 ++
> mm/memcontrol.c | 207 ++++++++++++++++++++++++++++++++++
> 4 files changed, 352 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/cgroups/mem_notify.txt
>
> diff --git a/Documentation/cgroups/mem_notify.txt b/Documentation/cgroups/mem_notify.txt
> new file mode 100644
> index 0000000..72d5c26
> --- /dev/null
> +++ b/Documentation/cgroups/mem_notify.txt
> @@ -0,0 +1,129 @@
> +
> +Memory Limit Notificiation
> +
> +Attempts have been made in the past to provide a mechanism for
> +the notification to processes (task, an address space) when memory
> +usage is approaching a high limit. The intention is that it gives
> +the application an opportunity to release some memory and continue
> +operation rather than be OOM killed. The CE Linux Forum requested
> +a more comtemporary implementation, and this is the result.
> +
> +The memory limit notification is a configurable extension to the
> +existing Memory Resource Controller. Please read memory.txt in this
> +directory to understand its operation before continuing here.
> +
> +1. Operation
> +
> +When a kernel is configured with CGROUP_MEM_NOTIFY, three additional
> +files will appear in the memory resource controller:
> +
> + memory.notify_limit_percent

We've run into problems in the past where a percentage number is too
coarse on large-memory systems.

Proabably that won't be an issue here, but I invite you to convince us
of this ;)


> + memory.notify_limit_usage
> + memory.notify_limit_lowait
> +
> +The notification is based upon reaching a percentage of the memory
> +resource controller limit (memory.limit_in_bytes). When the controller
> +group is created, the percentage is set to 100. Any integer percentage
> +may be set by writing to memory.notify_limit_percent, such as:
> +
> + echo 80 > memory.notify_limit_percent
> +
> +The current integer usage percentage may be read at any time from
> +the memory.notify_limit_usage file.
> +
> +The memory.notify_limit_lowait is a blocking read file. The read will
> +block until one of four conditions occurs:
> +
> + - The usage reaches or exceeds the memory.notify_limit_percent
> + - The memory.notify_limit_lowait file is written with any value (debug)
> + - A thread is moved to another controller group
> + - The cgroup is destroyed or forced empty (memory.force_empty)

Does it support select()/poll()/eventfd()/etc?

> +
> +1.1 Example Usage
> +
> +An application must be designed to properly take advantage of this
> +memory limit notification feature. It is a powerful management component
> +of some operating systems and embedded devices that must provide
> +highly available and reliable computing services. The application works
> +in conjunction with information provided by the operating system to
> +control limited resource usage. Since many programmers still think
> +memory is infinite and never check the return value from malloc(), it
> +may come as a surprise that such mechanisms have been utilized long ago.
> +
> +A typical application will be multithreaded, with one thread either
> +polling or waiting for the notification event. When the event occurs,
> +the thread will take whatever action is appropriate within the application
> +design. This could be actually running a garbage collection algorithm
> +or to simply signal other processing threads they must do something to
> +reduce their memory usage. The notification thread will then be required
> +to poll the actual usage until the low limit of its choosing is met,
> +at which time the reclaim of memory can stop and the notification thread
> +will wait for the next event.
> +
> +Internally, the application only needs to fopen("memory.notify_limit_usage" ..)
> +and fopen("memory.notify_limit_lowait" ...), then either poll the former
> +file or block read on the latter file using fread() or fscanf() as desired.
> +
> +2. Configuration
> +
> +Follow the instructions in memory.txt for the configuration and usage of
> +the Memory Resource Controller cgroup. Once this is created and tasks
> +assigned, use the memory limit notification as described here.
> +
> +The only action that is needed outside of the application waiting or polling
> +is to set the memory.notify_limit_percent. To set a notification to occur
> +when memory usage of the cgroup reaches or exceeds 80 percent can be
> +simply done:
> +
> + echo 80 > memory.notify_limit_percent
> +
> +This value may be read or changed at any time. Writing a lower value once
> +the Memory Resource Controller is in operation may trigger immediate
> +notification if the usage is above the new limit.
> +
> +3. Debug and Testing
> +
> +The design of cgroups makes it easier to perform some debugging or
> +monitoring tasks without modification to the application. For example,
> +a write of any value to memory.notify_limit_lowait will wake up all
> +threads waiting for notifications regardless of current memory usage.
> +
> +Collecting performance data about the cgroup is also simplified, as
> +no application modifications are necessary. A separate task can be
> +created that will open and monitor any necessary files of the cgroup
> +(such as current limits, usage and usage percentages and even when
> +notification occurs). This task can also operate outside of the cgroup,
> +so its memory usage is not charged to the cgroup.
> +
> +4. Design
> +
> +The memory limit notification is a configurable extension to the
> +existing Memory Resource Controller, which operates as described to
> +track and manage the memory of the Control Group. The Memory Resource
> +Controller will still continue to reclaim memory under pressure
> +of the limits, and may OOM kill tasks within the cgroup according to
> +the OOM Killer configuration.
> +
> +The memory notification limit was chosen as a percentage of the
> +memory in use so the cgroup paramaters may continue to be dynamically
> +modified without the need to modify the notificaton parameters.
> +Otherwise, the notification limit would have to also be computed
> +and modified on any Memory Resource Controller operating parameter change.
> +
> +The cgroup file semantics are not well suited for this type of notificaton
> +mechanism. While applications may choose to simply poll the current
> +usage at their convenience, it was also desired to have a notification
> +event that would trigger when the usage attained the limit. The
> +blocking read() was chosen, as it is the only current useful method.
> +This presented the problems of "out of band" notification, when you want
> +to return some exceptional status other than reaching the notification
> +limit. In the cases listed above, the read() on the memory.notify_limit_lowait
> +file will not block and return "0" for the percentage. When this occurs,
> +the thread must determine if the task has moved to a new cgroup or if
> +the cgroup has been destroyed. Due to the usage model of this cgroup,
> +neither is likely to happen during normal operation of a product.
> +
> +Dan Malek <[email protected]>
> +Embedded Alley Solutions, Inc.
> +10 March 2009
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 18146c9..031e5d1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -117,6 +117,13 @@ static inline bool mem_cgroup_disabled(void)
>
> extern bool mem_cgroup_oom_called(struct task_struct *task);
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +extern void test_and_wakeup_notify(struct mem_cgroup *mcg,
> + unsigned long long newlimit);
> +extern unsigned long compute_usage_percent(unsigned long long usage,
> + unsigned long long limit);
> +#endif

Stylistic trick: here, please add

#else
static inline void test_and_wakeup_notify(struct mem_cgroup *mcg,
unsigned long long newlimit)
{
}

static inline unsigned long compute_usage_percent(unsigned long long usage,
unsigned long long limit)
{
return 0; /* ? */
}
#endif

and then remove the ifdefs from the .c files.

> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> struct mem_cgroup;
>
> diff --git a/init/Kconfig b/init/Kconfig
> index f2f9b53..97138da 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -588,6 +588,15 @@ config CGROUP_MEM_RES_CTLR
> This config option also selects MM_OWNER config option, which
> could in turn add some fork/exit overhead.
>
> +config CGROUP_MEM_NOTIFY
> + bool "Memory Usage Limit Notification"
> + depends on CGROUP_MEM_RES_CTLR
> + help
> + Provides a memory notification when usage reaches a preset limit.
> + It is an extenstion to the memory resource controller, since it
> + uses the memory usage accounting of the cgroup to test against
> + the notification limit. (See Documentation/cgroups/mem_notify.txt)
> +
> config CGROUP_MEM_RES_CTLR_SWAP
> bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
> depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2fc6d6c..d6367ed 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6,6 +6,10 @@
> * Copyright 2007 OpenVZ SWsoft Inc
> * Author: Pavel Emelianov <[email protected]>
> *
> + * Memory Limit Notification update
> + * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
> + * Author: Dan Malek <[email protected]>
> + *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License as published by
> * the Free Software Foundation; either version 2 of the License, or
> @@ -180,6 +184,11 @@ struct mem_cgroup {
> * statistics. This must be placed at the end of memcg.
> */
> struct mem_cgroup_stat stat;
> +
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + unsigned long notify_limit_percent;
> + wait_queue_head_t notify_limit_wait;
> +#endif
> };
>
> enum charge_type {
> @@ -934,6 +943,21 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>
> VM_BUG_ON(mem_cgroup_is_obsolete(mem));
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + /* We check on the way in so we don't have to duplicate code
> + * in both the normal and error exit path.
> + */
> + if (likely(mem->res.limit != (unsigned long long)LLONG_MAX)) {
> + unsigned long usage_pct;
> +
> + usage_pct = compute_usage_percent(mem->res.usage + PAGE_SIZE,
> + mem->res.limit);
> + if ((usage_pct >= mem->notify_limit_percent) &&
> + waitqueue_active(&mem->notify_limit_wait))
> + wake_up(&mem->notify_limit_wait);
> + }
> +#endif

It would be nicer to pull this out into a separate function, I expect.

> while (1) {
> int ret;
> bool noswap = false;
> @@ -1663,6 +1687,13 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> int children = mem_cgroup_count_children(memcg);
> u64 curusage, oldusage;
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + /* Test and notify ahead of the necessity to free pages, as
> + * applications giving up pages may help this reclaim procedure.
> + */
> + test_and_wakeup_notify(memcg, val);
> +#endif

ifdefs-in-c make kernel developers sad.

> /*
> * For keeping hierarchical_reclaim simple, how long we should retry
> * is depends on callers. We set our retry-count to be function
> @@ -2215,6 +2246,147 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
> return 0;
> }
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +#define CGROUP_LOCAL_BUFFER_SIZE 64 /* Would be nice if this was in cgroup.h */
> +
> +/* The resource counters are defined as long long, but few processors
> + * handle 64-bit divisor in hardware, and the software to do it isn't
> + * present in the kernel. It would be nice if the resource counters were
> + * platform specific configurable typedefs, but for now we'll just divide
> + * down the byte counters by the page size to get 32-bit arithmetic.
> + * With a 4K page size, this will work up to about 16384G resource limit.
> + */
> +unsigned long compute_usage_percent(unsigned long long usage,
> + unsigned long long limit)

This is a poor choice of identifier for a global symbol. Please prefix
it with some string which identifies its subsystem.


> +{
> + unsigned long lim;
> + unsigned long long usage_pct;
> +
> + usage_pct = (usage / PAGE_SIZE) * 100;
> + lim = (unsigned long)(limit / PAGE_SIZE);
> +
> + do_div(usage_pct, lim);
> +
> + return (unsigned long)usage_pct;
> +}
> +
> +void test_and_wakeup_notify(struct mem_cgroup *mcg, unsigned long long newlimit)

Ditto.

> +{
> + unsigned long usage_pct;
> +
> + /* Check to see if the new limit should cause notification.
> + */
> + usage_pct = compute_usage_percent(mcg->res.usage, newlimit);
> +
> + if ((usage_pct >= mcg->notify_limit_percent) &&
> + waitqueue_active(&mcg->notify_limit_wait))
> + wake_up(&mcg->notify_limit_wait);
> +}

Also, identifiers such as "newlimit" are a bit unclear because they
don't communicate the units, and (less seriously) they don't
communicate what quantity they are measuring. Something like
memory_newlimit_bytes would be clearer, although a bit silly.

I _assume_ these things are all operating in units of bytes. Perhaps
it was pages? That's my point...

> +static ssize_t notify_limit_read(struct cgroup *cgrp, struct cftype *cft,
> + struct file *file,
> + char __user *buf, size_t nbytes,
> + loff_t *ppos)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + char tmp[CGROUP_LOCAL_BUFFER_SIZE];
> + int len;
> +
> + len = sprintf(tmp, "%lu\n", memcg->notify_limit_percent);

The reader has to run around the tree to find out if there's a buffer
overflow here. scnprintf() would set minds at ease.


> + return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
> +}
> +
> +static int notify_limit_write(struct cgroup *cgrp, struct cftype *cft,
> + const char *buffer)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + unsigned long val;
> + char *endptr;
> +
> + val = simple_strtoul(buffer, &endptr, 0);

strict_strtoul() please. It's stricter.

> + if (val > 100)
> + return -EINVAL;
> +
> + memcg->notify_limit_percent = val;
> +
> + /* Check to see if the new percentage limit should cause notification.
> + */
> + test_and_wakeup_notify(memcg, memcg->res.limit);
> +
> + return 0;
> +}
> +
> +static ssize_t notify_limit_usage_read(struct cgroup *cgrp, struct cftype *cft,
> + struct file *file,
> + char __user *buf, size_t nbytes,
> + loff_t *ppos)
> +{
> + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> + char tmp[CGROUP_LOCAL_BUFFER_SIZE];
> + unsigned long usage_pct;
> + int len;
> +
> + usage_pct = compute_usage_percent(mem->res.usage, mem->res.limit);
> +
> + len = sprintf(tmp, "%lu\n", usage_pct);

scnprintf()

> + return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
> +}
> +
> +static ssize_t notify_limit_lowait(struct cgroup *cgrp, struct cftype *cft,

Perhaps this function would benefit from a nice comment explaining its
design. It's the core thing.

Than again, the .txt file has good coverage.

> + struct file *file,
> + char __user *buf, size_t nbytes,
> + loff_t *ppos)
> +{
> + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> + char tmp[CGROUP_LOCAL_BUFFER_SIZE];
> + unsigned long usage_pct;
> + int len;
> + DEFINE_WAIT(notify_lowait);
> +
> + /* A memory resource usage of zero is a special case that
> + * causes us not to sleep. It normally happens when the
> + * cgroup is about to be destroyed, and we don't want someone
> + * trying to sleep on a queue that is about to go away. This
> + * condition can also be forced as part of testing.
> + */
> + usage_pct = compute_usage_percent(mem->res.usage, mem->res.limit);
> + if (likely(mem->res.usage != 0)) {
> +
> + prepare_to_wait(&mem->notify_limit_wait, &notify_lowait,
> + TASK_INTERRUPTIBLE);
> +
> + if (usage_pct < mem->notify_limit_percent) {
> + schedule();
> +
> + /* Compute percentage we have now and return it.
> + */
> + usage_pct = compute_usage_percent(mem->res.usage,
> + mem->res.limit);
> + }
> + finish_wait(&mem->notify_limit_wait, &notify_lowait);
> + }
> +
> + len = sprintf(tmp, "%lu\n", usage_pct);

scnprintf()

> + return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
> +}

What is the behavior of this read when someone sends the process a
signal? Seems that it will return early, giving userspace a number
which it didn't expect to see. I guess that's OK.

> +/* This is used to wake up all threads that may be hanging
> + * out waiting for a low memory condition prior to that happening.
> + * Useful for triggering the event to assist with debug of applications.
> + */
> +static int notify_limit_wake_em_up(struct cgroup *cgrp, unsigned int event)
> +{
> + struct mem_cgroup *mem;
> +
> + mem = mem_cgroup_from_cont(cgrp);
> + wake_up(&mem->notify_limit_wait);
> + return 0;
> +}
> +#endif /* CONFIG_CGROUP_MEM_NOTIFY */
> +
>
> static struct cftype mem_cgroup_files[] = {
> {
> @@ -2258,6 +2430,22 @@ static struct cftype mem_cgroup_files[] = {
> .read_u64 = mem_cgroup_swappiness_read,
> .write_u64 = mem_cgroup_swappiness_write,
> },
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + {
> + .name = "notify_limit_percent",
> + .write_string = notify_limit_write,
> + .read = notify_limit_read,
> + },
> + {
> + .name = "notify_limit_usage",
> + .read = notify_limit_usage_read,
> + },
> + {
> + .name = "notify_limit_lowait",
> + .trigger = notify_limit_wake_em_up,
> + .read = notify_limit_lowait,
> + },
> +#endif
> };
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -2461,6 +2649,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> mem->last_scanned_child = 0;
> spin_lock_init(&mem->reclaim_param_lock);
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + init_waitqueue_head(&mem->notify_limit_wait);
> + mem->notify_limit_percent = 100;
> +#endif
> +
> if (parent)
> mem->swappiness = get_swappiness(parent);
> atomic_set(&mem->refcnt, 1);
> @@ -2504,6 +2697,20 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
> struct cgroup *old_cont,
> struct task_struct *p)
> {
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + /* We wake up all notification threads any time a migration takes
> + * place. They will have to check to see if a move is needed to
> + * a new cgroup file to wait for notification.
> + * This isn't so much a task move as it is an attach. A thread not
> + * a child of an existing task won't have a valid parent, which
> + * is necessary to test because it won't have a valid mem_cgroup
> + * either. Which further means it won't have a proper wait queue
> + * and we can't do a wakeup.
> + */
> + if (old_cont->parent != NULL)
> + notify_limit_wake_em_up(old_cont, 0);
> +#endif
> +

2009-04-13 23:51:15

by Dan Malek

[permalink] [raw]
Subject: Re: [PATCH] Memory usage limit notification addition to memcg


OK, I'll rewrite and resubmit the patch with suggested updates.
Comments below...

On Apr 13, 2009, at 4:08 PM, Andrew Morton wrote:

> We've run into problems in the past where a percentage number is too
> coarse on large-memory systems.
>
> Proabably that won't be an issue here, but I invite you to convince us
> of this ;)

The challenge here is that the absolute limit of the memcg can
be dynamically changed, so I wanted to avoid a couple of problems.
One is just a system configuration error where someone forgets
to modify both. For example, if you start with the memcg limit of 100M,
and the notification limit to 80M, then come back and change the memcg
limit to 90M (or worse, < 80M) you now have a clearly incorrect
configuration. Another problem is the operation isn't atomic, at some
point during the changes, even if you remember to do it correctly, you
will have the two values not representing what you really want. It
could trigger an erroneous notification, or simply OOM kill before you
get the configuration correct.

If an integer number turns out to not be sufficient, we could change
this
to some fixed point representation and adjust the arithmetic in the
tests.
I believe the integer number will be fine, even in large memory systems.
This is just a notification model, if we want something more fine
grained
I believe it would need different semantics.

> Does it support select()/poll()/eventfd()/etc?

No. Unfortunately this is a cgroup implementation limitation.
My TODO list includes updating cgroups to allow this, using
this notification as an example.

> Stylistic trick: here, please add

Will do, including some others to get rid of ifdefs.

> ifdefs-in-c make kernel developers sad.

I know. I'll make them go away.

I'll fix it up and resubmit shortly.

Thanks.

-- Dan

2009-04-13 23:57:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Memory usage limit notification addition to memcg

On Mon, 13 Apr 2009 16:45:17 -0700
Dan Malek <[email protected]> wrote:

> On Apr 13, 2009, at 4:08 PM, Andrew Morton wrote:
>
> > We've run into problems in the past where a percentage number is too
> > coarse on large-memory systems.
> >
> > Proabably that won't be an issue here, but I invite you to convince us
> > of this ;)
>
> The challenge here is that the absolute limit of the memcg can
> be dynamically changed, so I wanted to avoid a couple of problems.
> One is just a system configuration error where someone forgets
> to modify both. For example, if you start with the memcg limit of 100M,
> and the notification limit to 80M, then come back and change the memcg
> limit to 90M (or worse, < 80M) you now have a clearly incorrect
> configuration. Another problem is the operation isn't atomic, at some
> point during the changes, even if you remember to do it correctly, you
> will have the two values not representing what you really want. It
> could trigger an erroneous notification, or simply OOM kill before you
> get the configuration correct.
>
> If an integer number turns out to not be sufficient, we could change
> this
> to some fixed point representation and adjust the arithmetic in the
> tests.
> I believe the integer number will be fine, even in large memory systems.
> This is just a notification model, if we want something more fine
> grained
> I believe it would need different semantics.

I agree. But it would be a mighty mess if we were to turn around in
two years time and add a second centi-percent interface. So we should
give this careful thought now and really convince ourselves that we
will never ever ever want sub-1% resolution.

2009-04-14 00:51:56

by Dan Malek

[permalink] [raw]
Subject: Re: [PATCH] Memory usage limit notification addition to memcg


On Apr 13, 2009, at 4:54 PM, Andrew Morton wrote:

> I agree. But it would be a mighty mess if we were to turn around in
> two years time and add a second centi-percent interface.

OK, I'll do it. I'm just going to create a couple of simple functions
within this file to serve my requirements for the string conversion.

Thanks.

-- Dan

2009-04-15 00:37:40

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] Memory usage limit notification addition to memcg

On Mon, 13 Apr 2009 18:08:32 -0400
Dan Malek <[email protected]> wrote:

> This patch updates the Memory Controller cgroup to add
> a configurable memory usage limit notification. The feature
> was presented at the April 2009 Embedded Linux Conference.
>
> Signed-off-by: Dan Malek <[email protected]>

Welcome to memory cgroup world :)


> ---
> Documentation/cgroups/mem_notify.txt | 129 +++++++++++++++++++++
> include/linux/memcontrol.h | 7 +
> init/Kconfig | 9 ++
> mm/memcontrol.c | 207 ++++++++++++++++++++++++++++++++++
> 4 files changed, 352 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/cgroups/mem_notify.txt
>
> diff --git a/Documentation/cgroups/mem_notify.txt b/Documentation/cgroups/mem_notify.txt
> new file mode 100644
> index 0000000..72d5c26
> --- /dev/null
> +++ b/Documentation/cgroups/mem_notify.txt
> @@ -0,0 +1,129 @@
> +
> +Memory Limit Notificiation
> +
> +Attempts have been made in the past to provide a mechanism for
> +the notification to processes (task, an address space) when memory
> +usage is approaching a high limit. The intention is that it gives
> +the application an opportunity to release some memory and continue
> +operation rather than be OOM killed. The CE Linux Forum requested
> +a more comtemporary implementation, and this is the result.
> +
> +The memory limit notification is a configurable extension to the
> +existing Memory Resource Controller. Please read memory.txt in this
> +directory to understand its operation before continuing here.
> +
> +1. Operation
> +
> +When a kernel is configured with CGROUP_MEM_NOTIFY, three additional
> +files will appear in the memory resource controller:
> +
> + memory.notify_limit_percent
> + memory.notify_limit_usage
> + memory.notify_limit_lowait
> +
> +The notification is based upon reaching a percentage of the memory
> +resource controller limit (memory.limit_in_bytes). When the controller
> +group is created, the percentage is set to 100. Any integer percentage
> +may be set by writing to memory.notify_limit_percent, such as:
> +
> + echo 80 > memory.notify_limit_percent
> +

As Andrew pointed out, "percent" is not good.


> +The current integer usage percentage may be read at any time from
> +the memory.notify_limit_usage file.
> +
> +The memory.notify_limit_lowait is a blocking read file. The read will
> +block until one of four conditions occurs:
> +
> + - The usage reaches or exceeds the memory.notify_limit_percent
> + - The memory.notify_limit_lowait file is written with any value (debug)
> + - A thread is moved to another controller group

Why don't you check "moved from other cgroup" case ?
And why "moved to" case should be catched ?

> + - The cgroup is destroyed or forced empty (memory.force_empty)
> +
> +
> +1.1 Example Usage
> +
> +An application must be designed to properly take advantage of this
> +memory limit notification feature. It is a powerful management component
> +of some operating systems and embedded devices that must provide
> +highly available and reliable computing services. The application works
> +in conjunction with information provided by the operating system to
> +control limited resource usage. Since many programmers still think
> +memory is infinite and never check the return value from malloc(), it
> +may come as a surprise that such mechanisms have been utilized long ago.
> +
> +A typical application will be multithreaded, with one thread either
> +polling or waiting for the notification event. When the event occurs,
> +the thread will take whatever action is appropriate within the application
> +design. This could be actually running a garbage collection algorithm
> +or to simply signal other processing threads they must do something to
> +reduce their memory usage. The notification thread will then be required
> +to poll the actual usage until the low limit of its choosing is met,
> +at which time the reclaim of memory can stop and the notification thread
> +will wait for the next event.
> +
> +Internally, the application only needs to fopen("memory.notify_limit_usage" ..)
> +and fopen("memory.notify_limit_lowait" ...), then either poll the former
> +file or block read on the latter file using fread() or fscanf() as desired.
> +
> +2. Configuration
> +
> +Follow the instructions in memory.txt for the configuration and usage of
> +the Memory Resource Controller cgroup. Once this is created and tasks
> +assigned, use the memory limit notification as described here.
> +
> +The only action that is needed outside of the application waiting or polling
> +is to set the memory.notify_limit_percent. To set a notification to occur
> +when memory usage of the cgroup reaches or exceeds 80 percent can be
> +simply done:
> +
> + echo 80 > memory.notify_limit_percent
> +
> +This value may be read or changed at any time. Writing a lower value once
> +the Memory Resource Controller is in operation may trigger immediate
> +notification if the usage is above the new limit.
> +
> +3. Debug and Testing
> +
> +The design of cgroups makes it easier to perform some debugging or
> +monitoring tasks without modification to the application. For example,
> +a write of any value to memory.notify_limit_lowait will wake up all
> +threads waiting for notifications regardless of current memory usage.
> +
> +Collecting performance data about the cgroup is also simplified, as
> +no application modifications are necessary. A separate task can be
> +created that will open and monitor any necessary files of the cgroup
> +(such as current limits, usage and usage percentages and even when
> +notification occurs). This task can also operate outside of the cgroup,
> +so its memory usage is not charged to the cgroup.
> +
> +4. Design
> +
> +The memory limit notification is a configurable extension to the
> +existing Memory Resource Controller, which operates as described to
> +track and manage the memory of the Control Group. The Memory Resource
> +Controller will still continue to reclaim memory under pressure
> +of the limits, and may OOM kill tasks within the cgroup according to
> +the OOM Killer configuration.
> +
> +The memory notification limit was chosen as a percentage of the
> +memory in use so the cgroup paramaters may continue to be dynamically
> +modified without the need to modify the notificaton parameters.
> +Otherwise, the notification limit would have to also be computed
> +and modified on any Memory Resource Controller operating parameter change.
> +
> +The cgroup file semantics are not well suited for this type of notificaton
> +mechanism. While applications may choose to simply poll the current
> +usage at their convenience, it was also desired to have a notification
> +event that would trigger when the usage attained the limit. The
> +blocking read() was chosen, as it is the only current useful method.
> +This presented the problems of "out of band" notification, when you want
> +to return some exceptional status other than reaching the notification
> +limit. In the cases listed above, the read() on the memory.notify_limit_lowait
> +file will not block and return "0" for the percentage. When this occurs,
> +the thread must determine if the task has moved to a new cgroup or if
> +the cgroup has been destroyed. Due to the usage model of this cgroup,
> +neither is likely to happen during normal operation of a product.
> +
> +Dan Malek <[email protected]>
> +Embedded Alley Solutions, Inc.
> +10 March 2009
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 18146c9..031e5d1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -117,6 +117,13 @@ static inline bool mem_cgroup_disabled(void)
>
> extern bool mem_cgroup_oom_called(struct task_struct *task);
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +extern void test_and_wakeup_notify(struct mem_cgroup *mcg,
> + unsigned long long newlimit);
> +extern unsigned long compute_usage_percent(unsigned long long usage,
> + unsigned long long limit);
> +#endif
> +
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> struct mem_cgroup;
>
> diff --git a/init/Kconfig b/init/Kconfig
> index f2f9b53..97138da 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -588,6 +588,15 @@ config CGROUP_MEM_RES_CTLR
> This config option also selects MM_OWNER config option, which
> could in turn add some fork/exit overhead.
>
> +config CGROUP_MEM_NOTIFY
> + bool "Memory Usage Limit Notification"
> + depends on CGROUP_MEM_RES_CTLR
> + help
> + Provides a memory notification when usage reaches a preset limit.
> + It is an extenstion to the memory resource controller, since it
> + uses the memory usage accounting of the cgroup to test against
> + the notification limit. (See Documentation/cgroups/mem_notify.txt)
> +

I think it's better to remove this CONFIG.

> config CGROUP_MEM_RES_CTLR_SWAP
> bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
> depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2fc6d6c..d6367ed 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6,6 +6,10 @@
> * Copyright 2007 OpenVZ SWsoft Inc
> * Author: Pavel Emelianov <[email protected]>
> *
> + * Memory Limit Notification update
> + * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
> + * Author: Dan Malek <[email protected]>
> + *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License as published by
> * the Free Software Foundation; either version 2 of the License, or
> @@ -180,6 +184,11 @@ struct mem_cgroup {
> * statistics. This must be placed at the end of memcg.
> */
> struct mem_cgroup_stat stat;
> +
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + unsigned long notify_limit_percent;
> + wait_queue_head_t notify_limit_wait;
> +#endif
> };
>
> enum charge_type {
> @@ -934,6 +943,21 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>
> VM_BUG_ON(mem_cgroup_is_obsolete(mem));
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + /* We check on the way in so we don't have to duplicate code
> + * in both the normal and error exit path.
> + */
> + if (likely(mem->res.limit != (unsigned long long)LLONG_MAX)) {
> + unsigned long usage_pct;
> +
> + usage_pct = compute_usage_percent(mem->res.usage + PAGE_SIZE,
> + mem->res.limit);
> + if ((usage_pct >= mem->notify_limit_percent) &&
> + waitqueue_active(&mem->notify_limit_wait))
> + wake_up(&mem->notify_limit_wait);
> + }
> +#endif
> +
I don't think this it is sane manner to check this limit always...If this mem_notify is
not required to as "hard limit", please reduce # of checks.
How about once per 1MBytes ?
One notified, the applications can keep observation for a while.


> while (1) {
> int ret;
> bool noswap = false;
> @@ -1663,6 +1687,13 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> int children = mem_cgroup_count_children(memcg);
> u64 curusage, oldusage;
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + /* Test and notify ahead of the necessity to free pages, as
> + * applications giving up pages may help this reclaim procedure.
> + */
> + test_and_wakeup_notify(memcg, val);
> +#endif
> +
> /*
> * For keeping hierarchical_reclaim simple, how long we should retry
> * is depends on callers. We set our retry-count to be function
> @@ -2215,6 +2246,147 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
> return 0;
> }
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +#define CGROUP_LOCAL_BUFFER_SIZE 64 /* Would be nice if this was in cgroup.h */
> +
> +/* The resource counters are defined as long long, but few processors
> + * handle 64-bit divisor in hardware, and the software to do it isn't
> + * present in the kernel. It would be nice if the resource counters were
> + * platform specific configurable typedefs, but for now we'll just divide
> + * down the byte counters by the page size to get 32-bit arithmetic.
> + * With a 4K page size, this will work up to about 16384G resource limit.
> + */
> +unsigned long compute_usage_percent(unsigned long long usage,
> + unsigned long long limit)
> +{
> + unsigned long lim;
> + unsigned long long usage_pct;
> +
> + usage_pct = (usage / PAGE_SIZE) * 100;
> + lim = (unsigned long)(limit / PAGE_SIZE);
> +
> + do_div(usage_pct, lim);
> +
> + return (unsigned long)usage_pct;
> +}
> +
Hmm, I think this "lim" can be calculated when the user does "set limit" or
"set notify_percent".

And...please wake up all waiting thread at rmdir(). If not, rmdir() will return
-EBUSY always.


> +void test_and_wakeup_notify(struct mem_cgroup *mcg, unsigned long long newlimit)
> +{
> + unsigned long usage_pct;
> +
> + /* Check to see if the new limit should cause notification.
> + */
> + usage_pct = compute_usage_percent(mcg->res.usage, newlimit);
> +
> + if ((usage_pct >= mcg->notify_limit_percent) &&
> + waitqueue_active(&mcg->notify_limit_wait))
> + wake_up(&mcg->notify_limit_wait);
> +}
> +
> +static ssize_t notify_limit_read(struct cgroup *cgrp, struct cftype *cft,
> + struct file *file,
> + char __user *buf, size_t nbytes,
> + loff_t *ppos)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + char tmp[CGROUP_LOCAL_BUFFER_SIZE];
> + int len;
> +
> + len = sprintf(tmp, "%lu\n", memcg->notify_limit_percent);
> +
> + return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
> +}
> +
> +static int notify_limit_write(struct cgroup *cgrp, struct cftype *cft,
> + const char *buffer)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + unsigned long val;
> + char *endptr;
> +
> + val = simple_strtoul(buffer, &endptr, 0);
> + if (val > 100)
> + return -EINVAL;
> +
> + memcg->notify_limit_percent = val;
> +
> + /* Check to see if the new percentage limit should cause notification.
> + */
> + test_and_wakeup_notify(memcg, memcg->res.limit);
> +
> + return 0;
> +}
> +
> +static ssize_t notify_limit_usage_read(struct cgroup *cgrp, struct cftype *cft,
> + struct file *file,
> + char __user *buf, size_t nbytes,
> + loff_t *ppos)
> +{
> + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> + char tmp[CGROUP_LOCAL_BUFFER_SIZE];
> + unsigned long usage_pct;
> + int len;
> +
> + usage_pct = compute_usage_percent(mem->res.usage, mem->res.limit);
> +
> + len = sprintf(tmp, "%lu\n", usage_pct);
> +
> + return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
> +}
> +
> +static ssize_t notify_limit_lowait(struct cgroup *cgrp, struct cftype *cft,
> + struct file *file,
> + char __user *buf, size_t nbytes,
> + loff_t *ppos)
> +{
> + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> + char tmp[CGROUP_LOCAL_BUFFER_SIZE];
> + unsigned long usage_pct;
> + int len;
> + DEFINE_WAIT(notify_lowait);
> +
> + /* A memory resource usage of zero is a special case that
> + * causes us not to sleep. It normally happens when the
> + * cgroup is about to be destroyed, and we don't want someone
> + * trying to sleep on a queue that is about to go away. This
> + * condition can also be forced as part of testing.
> + */
> + usage_pct = compute_usage_percent(mem->res.usage, mem->res.limit);
> + if (likely(mem->res.usage != 0)) {
> +
> + prepare_to_wait(&mem->notify_limit_wait, &notify_lowait,
> + TASK_INTERRUPTIBLE);
> +
> + if (usage_pct < mem->notify_limit_percent) {
> + schedule();
> +
> + /* Compute percentage we have now and return it.
> + */
> + usage_pct = compute_usage_percent(mem->res.usage,
> + mem->res.limit);
> + }
> + finish_wait(&mem->notify_limit_wait, &notify_lowait);
> + }
> +
> + len = sprintf(tmp, "%lu\n", usage_pct);
> +
> + return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
> +}
> +
> +/* This is used to wake up all threads that may be hanging
> + * out waiting for a low memory condition prior to that happening.
> + * Useful for triggering the event to assist with debug of applications.
> + */
> +static int notify_limit_wake_em_up(struct cgroup *cgrp, unsigned int event)
> +{
> + struct mem_cgroup *mem;
> +
> + mem = mem_cgroup_from_cont(cgrp);
> + wake_up(&mem->notify_limit_wait);
> + return 0;
> +}
> +#endif /* CONFIG_CGROUP_MEM_NOTIFY */
> +
>
> static struct cftype mem_cgroup_files[] = {
> {
> @@ -2258,6 +2430,22 @@ static struct cftype mem_cgroup_files[] = {
> .read_u64 = mem_cgroup_swappiness_read,
> .write_u64 = mem_cgroup_swappiness_write,
> },
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + {
> + .name = "notify_limit_percent",
> + .write_string = notify_limit_write,
> + .read = notify_limit_read,
> + },
> + {
> + .name = "notify_limit_usage",
> + .read = notify_limit_usage_read,
> + },
> + {
> + .name = "notify_limit_lowait",
> + .trigger = notify_limit_wake_em_up,
> + .read = notify_limit_lowait,
> + },
> +#endif
> };
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -2461,6 +2649,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> mem->last_scanned_child = 0;
> spin_lock_init(&mem->reclaim_param_lock);
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + init_waitqueue_head(&mem->notify_limit_wait);
> + mem->notify_limit_percent = 100;
> +#endif
> +

I think this means notify is triggerred at every "reach limit"...
mem->notify_limit_percent = 101 or some is better.


> if (parent)
> mem->swappiness = get_swappiness(parent);
> atomic_set(&mem->refcnt, 1);
> @@ -2504,6 +2697,20 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
> struct cgroup *old_cont,
> struct task_struct *p)
> {
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + /* We wake up all notification threads any time a migration takes
> + * place. They will have to check to see if a move is needed to
> + * a new cgroup file to wait for notification.
> + * This isn't so much a task move as it is an attach. A thread not
> + * a child of an existing task won't have a valid parent, which
> + * is necessary to test because it won't have a valid mem_cgroup
> + * either. Which further means it won't have a proper wait queue
> + * and we can't do a wakeup.
> + */
> + if (old_cont->parent != NULL)
> + notify_limit_wake_em_up(old_cont, 0);
> +#endif
> +
> mutex_lock(&memcg_tasklist);
> /*
> * FIXME: It's better to move charges of this process from old

Hmm. I'll add follwing interface if you necessary. (Or it's ok to add in your set."

- memory.shirnk_usage_in_bytes
example)
#echo 1G > memory.limit_in_bytes.
use up to 999MB.
#echo 100M > memory.shrink_usage_to_bytes.
try to reduce 100M of memory usage of this cgroup. and make memory usage to be 899MB.


Thanks,
-Kame







2009-04-15 02:33:18

by Dan Malek

[permalink] [raw]
Subject: Re: [PATCH] Memory usage limit notification addition to memcg


Hi Kame.

On Apr 14, 2009, at 5:35 PM, KAMEZAWA Hiroyuki wrote:

> Welcome to memory cgroup world :)

Thanks. I think it's a great feature that will be realized
over time.

I was just about to resend the patch, so I'll incorporate
your comments. I'll reply to some below as well.

> As Andrew pointed out, "percent" is not good.

I updated this to add more granularity, to xx.yy
I can't comprehend why this is a problem. Conceptually,
it works very well with the applications I have used. If
you guys really want to use an absolute number for a
notification limit, we can change it, but I really don't
want to :-)

>> +The memory.notify_limit_lowait is a blocking read file. The read
>> will
>> +block until one of four conditions occurs:
>> +
>> + - The usage reaches or exceeds the memory.notify_limit_percent
>> + - The memory.notify_limit_lowait file is written with any
>> value (debug)
>> + - A thread is moved to another controller group
>
> Why don't you check "moved from other cgroup" case ?
> And why "moved to" case should be catched ?

Sorry, badly worded. The test is actually when a task moves from
a cgroup. If a task is moved from one cgroup to another, the threads
waiting for notification in the "from" group are poked to wake up.
I didn't see the need to wake up anyone in the cgroup it may move into.

> I think it's better to remove this CONFIG.

OK. Should I just add the documentation to
Documentation/cgroups/memory.txt or leave it stand alone?
BTW, all of the ifdefs are removed even with the CONFIG
option. I just thought if someone was really counting cycles,
wanted memcg without notify, it was easy to do that.

> I don't think this it is sane manner to check this limit
> always...If this mem_notify is
> not required to as "hard limit", please reduce # of checks.
> How about once per 1MBytes ?
> One notified, the applications can keep observation for a while.

The overhead is small, and this kind of contradicts Andrew's
comment about wanting finer granularity. Also, the test would have
to be scaled to match the size of the cgroup, on some of the
embedded systems 1M could be a measurable percentage.
But, let me think of some other way to do the math. I think I'll turn
it around, do the percentage computation only to the application,
not internally.

> Hmm, I think this "lim" can be calculated when the user does "set
> limit" or
> "set notify_percent".

Yeah, probably.

> And...please wake up all waiting thread at rmdir(). If not, rmdir()
> will return
> -EBUSY always.

OK, I'll check to make sure this still works. An empty cgroup causes
the
notification thread to not sleep and returns zero.

>> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
>> + init_waitqueue_head(&mem->notify_limit_wait);
>> + mem->notify_limit_percent = 100;
>> +#endif
>> +
>
> I think this means notify is triggerred at every "reach limit"...
> mem->notify_limit_percent = 101 or some is better.

I just didn't want it to be zero :-) I think I'll leave it at 100
because
that's a legal value. Although, maybe we should allow setting up
to 101 as a way of a preventing notification even if threads are
waiting.

> Hmm. I'll add follwing interface if you necessary. (Or it's ok to
> add in your set."
>
> - memory.shirnk_usage_in_bytes
> example)
> #echo 1G > memory.limit_in_bytes.
> use up to 999MB.
> #echo 100M > memory.shrink_usage_to_bytes.
> try to reduce 100M of memory usage of this cgroup. and make
> memory usage to be 899MB.

I understand the idea, but what happens if you can't?
Of course, the proper way is to do this automatically
when the task is moved out :-)

I'll think about all of this for a bit and then submit an
updated patch.

Thanks.

-- Dan

2009-04-15 02:59:57

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] Memory usage limit notification addition to memcg

On Tue, 14 Apr 2009 19:34:04 -0700
Dan Malek <[email protected]> wrote:

>
> Hi Kame.
>
> On Apr 14, 2009, at 5:35 PM, KAMEZAWA Hiroyuki wrote:
>
> > Welcome to memory cgroup world :)
>
> Thanks. I think it's a great feature that will be realized
> over time.
>
> I was just about to resend the patch, so I'll incorporate
> your comments. I'll reply to some below as well.
>
> > As Andrew pointed out, "percent" is not good.
>
> I updated this to add more granularity, to xx.yy
> I can't comprehend why this is a problem. Conceptually,
> it works very well with the applications I have used. If
> you guys really want to use an absolute number for a
> notification limit, we can change it, but I really don't
> want to :-)
>
Memory cgroup is a feature both for very-small-system and very-large-system.

XXMB(KB) for limit is an idea.
# echo 100MB > memory.limit_in_bytes.
# echo 5MB > memory.notify_triger_thresh_in_bytes.

Notify will be generated at 95MB of usage.


> >> +The memory.notify_limit_lowait is a blocking read file. The read
> >> will
> >> +block until one of four conditions occurs:
> >> +
> >> + - The usage reaches or exceeds the memory.notify_limit_percent
> >> + - The memory.notify_limit_lowait file is written with any
> >> value (debug)
> >> + - A thread is moved to another controller group
> >
> > Why don't you check "moved from other cgroup" case ?
> > And why "moved to" case should be catched ?
>
> Sorry, badly worded. The test is actually when a task moves from
> a cgroup. If a task is moved from one cgroup to another, the threads
> waiting for notification in the "from" group are poked to wake up.
> I didn't see the need to wake up anyone in the cgroup it may move into.
>
> > I think it's better to remove this CONFIG.
>
> OK. Should I just add the documentation to
> Documentation/cgroups/memory.txt or leave it stand alone?

Both are ok to me Please do as you want.

> BTW, all of the ifdefs are removed even with the CONFIG
> option. I just thought if someone was really counting cycles,
> wanted memcg without notify, it was easy to do that.
>
> > I don't think this it is sane manner to check this limit
> > always...If this mem_notify is
> > not required to as "hard limit", please reduce # of checks.
> > How about once per 1MBytes ?
> > One notified, the applications can keep observation for a while.
>
> The overhead is small, and this kind of contradicts Andrew's
> comment about wanting finer granularity. Also, the test would have
> to be scaled to match the size of the cgroup, on some of the
> embedded systems 1M could be a measurable percentage.

maybe. But this kind of overhead is tend to increase gradually and implicitly.
Doing our best here will help us in future, I think.

> But, let me think of some other way to do the math. I think I'll turn
> it around, do the percentage computation only to the application,
> not internally.
>
Thanks.

> > Hmm, I think this "lim" can be calculated when the user does "set
> > limit" or
> > "set notify_percent".
>
> Yeah, probably.
>
> > And...please wake up all waiting thread at rmdir(). If not, rmdir()
> > will return
> > -EBUSY always.
>
> OK, I'll check to make sure this still works. An empty cgroup causes
> the
> notification thread to not sleep and returns zero.
>
Sure, thanks.


> >> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> >> + init_waitqueue_head(&mem->notify_limit_wait);
> >> + mem->notify_limit_percent = 100;
> >> +#endif
> >> +
> >
> > I think this means notify is triggerred at every "reach limit"...
> > mem->notify_limit_percent = 101 or some is better.
>
> I just didn't want it to be zero :-) I think I'll leave it at 100
> because
> that's a legal value. Although, maybe we should allow setting up
> to 101 as a way of a preventing notification even if threads are
> waiting.
>
> > Hmm. I'll add follwing interface if you necessary. (Or it's ok to
> > add in your set."
> >
> > - memory.shirnk_usage_in_bytes
> > example)
> > #echo 1G > memory.limit_in_bytes.
> > use up to 999MB.
> > #echo 100M > memory.shrink_usage_to_bytes.
> > try to reduce 100M of memory usage of this cgroup. and make
> > memory usage to be 899MB.
>
> I understand the idea, but what happens if you can't?
returns -BUSY. (or timeout) following is example in my mind.

The VM monitor application will work like
==
while () {
poll(or read) event notify.
check the usage

if (usage is enough small)
continue;

if (the most of usage is file cache)
try-to-reduce-usage-only-file-cache. #need support in the kernel

if (usage is enough small)
continue;

if (hierarchy is used)
check bad children.

ret = try-to-reduce-usage-general() #need support in the kernel.
if (ret == -EBUSY && usage is too much) {
show warning to users.
kill/freeze or move tasks. or check locked shmem/tmpfs.
}
}
==
Of course, this monitor process should be out of limited memcg ;)

> Of course, the proper way is to do this automatically
> when the task is moved out :-)
>
> I'll think about all of this for a bit and then submit an
> updated patch.
>

Regards,
-Kame

> Thanks.
>
> -- Dan
>
>

2009-04-15 07:32:43

by Dan Malek

[permalink] [raw]
Subject: Re: [PATCH] Memory usage limit notification addition to memcg


On Apr 14, 2009, at 7:58 PM, KAMEZAWA Hiroyuki wrote:

> Memory cgroup is a feature both for very-small-system and very-
> large-system.
>
> XXMB(KB) for limit is an idea.
> # echo 100MB > memory.limit_in_bytes.
> # echo 5MB > memory.notify_triger_thresh_in_bytes.
>
> Notify will be generated at 95MB of usage.

I get it, I'll change it.

> The VM monitor application will work like

I see. Let me finish my patch first, then we
can discuss this some more?

Thanks.

-- Dan

2009-04-15 07:35:28

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] Memory usage limit notification addition to memcg

On Wed, 15 Apr 2009 00:32:29 -0700
Dan Malek <[email protected]> wrote:

>
> On Apr 14, 2009, at 7:58 PM, KAMEZAWA Hiroyuki wrote:
>
> > Memory cgroup is a feature both for very-small-system and very-
> > large-system.
> >
> > XXMB(KB) for limit is an idea.
> > # echo 100MB > memory.limit_in_bytes.
> > # echo 5MB > memory.notify_triger_thresh_in_bytes.
> >
> > Notify will be generated at 95MB of usage.
>
> I get it, I'll change it.
>
> > The VM monitor application will work like
>
> I see. Let me finish my patch first, then we
> can discuss this some more?
>
yes. Let's go step-by-step.

Thanks,
-Kame

> Thanks.
>
> -- Dan
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-04-15 08:26:01

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Memory usage limit notification addition to memcg

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-15 09:35:55]:

> On Mon, 13 Apr 2009 18:08:32 -0400
> Dan Malek <[email protected]> wrote:
>
> > This patch updates the Memory Controller cgroup to add
> > a configurable memory usage limit notification. The feature
> > was presented at the April 2009 Embedded Linux Conference.
> >
> > Signed-off-by: Dan Malek <[email protected]>
>
> Welcome to memory cgroup world :)

Welcome! I've taken a cursory look at the patches, but I'll come back
with more review comments. Could you point me to your
presentation/paper? I could not find the PDF at
http://www.celinuxforum.org/CelfPubWiki/ELC2009Presentations
--
Balbir

2009-04-15 17:35:32

by Dan Malek

[permalink] [raw]
Subject: Re: [PATCH] Memory usage limit notification addition to memcg


On Apr 15, 2009, at 1:24 AM, Balbir Singh wrote:

> Welcome! I've taken a cursory look at the patches, but I'll come back
> with more review comments.

I'm going to submit a new patch shortly, so let's wait and review that
one. Too many updates since the first time I sent it.

> ... Could you point me to your
> presentation/paper? I could not find the PDF at
> http://www.celinuxforum.org/CelfPubWiki/ELC2009Presentations

Apologies. I just uploaded it now. There was an hour of
talking that goes along with that, which I hope will explain some
of the points on the slides. The video was recorded, I assume it
will show up someplace, someday :-)

Thanks.

-- Dan

2009-04-16 03:16:13

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Memory usage limit notification addition to memcg

* Dan Malek <[email protected]> [2009-04-15 10:35:20]:

>
> On Apr 15, 2009, at 1:24 AM, Balbir Singh wrote:
>
>> Welcome! I've taken a cursory look at the patches, but I'll come back
>> with more review comments.
>
> I'm going to submit a new patch shortly, so let's wait and review that
> one. Too many updates since the first time I sent it.
>

OK, I was going to advice using cgroupstats for notification.
cgroupstats builds on top of taskstats

Please see

1. Documentation/accounting/*
2. kernel/taskstats.c

getdelays.c (available under (1)) provides the user space application
to make use of taskstats and cgroupstats.

Using that will allow you to

1. Provide type information with every event (extensible)
2. Reuses existing well tested code in the form of genetlink and
netlink


>> ... Could you point me to your
>> presentation/paper? I could not find the PDF at
>> http://www.celinuxforum.org/CelfPubWiki/ELC2009Presentations
>
> Apologies. I just uploaded it now. There was an hour of
> talking that goes along with that, which I hope will explain some
> of the points on the slides. The video was recorded, I assume it
> will show up someplace, someday :-)
>

I was able to download it, Thanks!

--
Balbir

2009-07-07 20:25:25

by Vladislav D. Buzov

[permalink] [raw]
Subject: [PATCH 0/1] Memory usage limit notification addition to memcg

Hello all,

The following patch introduces memory usage limit notification capability to
the Memory Controller cgroup. It is a reworked version of a patch sent by Dan
Malek in April 2009.

The major difference between this and original patches is a modified method of
notification threshold setting. The original patch implemented threshold as a
percentage of the memory controller limit. During following discussion it's
been decided to change percentage to an absolute number representing the
minimal amount of memory that should be available below the limit. The
following patch implements this method along with various clean-ups and style
changes.

Thanks,
Vlad.

2009-07-07 20:25:38

by Vladislav D. Buzov

[permalink] [raw]
Subject: [PATCH 1/1] Memory usage limit notification addition to memcg

This patch updates the Memory Controller cgroup to add
a configurable memory usage limit notification. The feature
was presented at the April 2009 Embedded Linux Conference.

Signed-off-by: Dan Malek <[email protected]>
Signed-off-by: Vladislav Buzov <[email protected]>
---
Documentation/cgroups/mem_notify.txt | 140 ++++++++++++++++++++++++++
include/linux/memcontrol.h | 21 ++++
init/Kconfig | 9 ++
mm/memcontrol.c | 178 ++++++++++++++++++++++++++++++++++
4 files changed, 348 insertions(+), 0 deletions(-)
create mode 100644 Documentation/cgroups/mem_notify.txt

diff --git a/Documentation/cgroups/mem_notify.txt b/Documentation/cgroups/mem_notify.txt
new file mode 100644
index 0000000..b4f20d0
--- /dev/null
+++ b/Documentation/cgroups/mem_notify.txt
@@ -0,0 +1,140 @@
+
+Memory Limit Notificiation
+
+Attempts have been made in the past to provide a mechanism for
+the notification to processes (task, an address space) when memory
+usage is approaching a high limit. The intention is that it gives
+the application an opportunity to release some memory and continue
+operation rather than be OOM killed. The CE Linux Forum requested
+a more comtemporary implementation, and this is the result.
+
+The memory threshold notification is a configurable extension to the
+existing Memory Resource Controller. Please read memory.txt in this
+directory to understand its operation before continuing here.
+
+1. Operation
+
+When a kernel is configured with CGROUP_MEM_NOTIFY, three additional
+files will appear in the memory resource controller:
+
+ memory.notify_threshold_in_bytes
+ memory.notify_available_in_bytes
+ memory.notify_threshold_lowait
+
+The notification is based upon reaching a threshold below the memory
+resouce controller limit (memory.limit_in_bytes). The threshold
+represents the minimal number of bytes that should be available under
+the limit. When the controller group is created, the threshold is set
+to zero which triggers notification when the memory resource controller
+limit is reached.
+
+The threshold may be set by writing to memory.notify_threshold_in_bytes,
+such as:
+
+ echo 10M > memory.notify_threshold_in_bytes
+
+The current number of available bytes may be read at any time from
+the memory.notify_available_in_bytes
+
+The memory.notify_threshold_lowait is a blocking read file. The read will
+block until one of four conditions occurs:
+
+ - The amount of available memory is equal or less than the threshold
+ defined in memory.notify_threshold_in_bytes
+ - The memory.notify_threshold_lowait file is written with any value (debug)
+ - A thread is moved to another controller group
+ - The cgroup is destroyed or forced empty (memory.force_empty)
+
+
+1.1 Example Usage
+
+An application must be designed to properly take advantage of this
+memory threshold notification feature. It is a powerful management component
+of some operating systems and embedded devices that must provide
+highly available and reliable computing services. The application works
+in conjunction with information provided by the operating system to
+control limited resource usage. Since many programmers still think
+memory is infinite and never check the return value from malloc(), it
+may come as a surprise that such mechanisms have been utilized long ago.
+
+A typical application will be multithreaded, with one thread either
+polling or waiting for the notification event. When the event occurs,
+the thread will take whatever action is appropriate within the application
+design. This could be actually running a garbage collection algorithm
+or to simply signal other processing threads they must do something to
+reduce their memory usage. The notification thread will then be required
+to poll the actual usage until the low limit of its choosing is met,
+at which time the reclaim of memory can stop and the notification thread
+will wait for the next event.
+
+Internally, the application only needs to
+fopen("memory.notify_available_in_bytes" ..) or
+fopen("memory.notify_threshold_lowait" ...), then either poll the former
+file or block read on the latter file using fread() or fscanf() as desired.
+Comparing the value returned from either of these read function with the
+value obtained by reading memory.notify_threshold_in_bytes will be an
+indication of the amount of memory used over the threshold limit.
+
+2. Configuration
+
+Follow the instructions in memory.txt for the configuration and usage of
+the Memory Resource Controller cgroup. Once this is created and tasks
+assigned, use the memory threshold notification as described here.
+
+The only action that is needed outside of the application waiting or polling
+is to set the memory.notify_threshold_in_bytes. To set a notification to occur
+when memory usage of the cgroup reaches or exceeds 1 MByte below the limit
+can be simply done:
+
+ echo 1M > memory.notify_threshold_in_bytes
+
+This value may be read or changed at any time. Writing a higher value once
+the Memory Resource Controller is in operation may trigger immediate
+notification if the usage is above the new threshold.
+
+3. Debug and Testing
+
+The design of cgroups makes it easier to perform some debugging or
+monitoring tasks without modification to the application. For example,
+a write of any value to memory.notify_threshold_lowait will wake up all
+threads waiting for notifications regardless of current memory usage.
+
+Collecting performance data about the cgroup is also simplified, as
+no application modifications are necessary. A separate task can be
+created that will open and monitor any necessary files of the cgroup
+(such as current limits, usage and usage percentages and even when
+notification occurs). This task can also operate outside of the cgroup,
+so its memory usage is not charged to the cgroup.
+
+4. Design
+
+The memory threshold notification is a configurable extension to the
+existing Memory Resource Controller, which operates as described to
+track and manage the memory of the Control Group. The Memory Resource
+Controller will still continue to reclaim memory under pressure
+of the limits, and may OOM kill tasks within the cgroup according to
+the OOM Killer configuration.
+
+The memory notification threshold was chosen as a number of bytes of the
+memory not in use so the cgroup paramaters may continue to be dynamically
+modified without the need to modify the notificaton parameters.
+Otherwise, the notification threshold would have to also be computed
+and modified on any Memory Resource Controller operating parameter change.
+
+The cgroup file semantics are not well suited for this type of notificaton
+mechanism. While applications may choose to simply poll the current
+usage at their convenience, it was also desired to have a notification
+event that would trigger when the usage attained the threshold. The
+blocking read() was chosen, as it is the only current useful method.
+This presented the problems of "out of band" notification, when you want
+to return some exceptional status other than reaching the notification
+threshold. In the cases listed above, the read() on the
+memory.notify_threshold_lowait file will not block and return "0" for
+the remaining size. When this occurs, the thread must determine if the task
+has moved to a new cgroup or if the cgroup has been destroyed. Due to
+the usage model of this cgroup, neither is likely to happen during normal
+operation of a product.
+
+Dan Malek <[email protected]>
+Embedded Alley Solutions, Inc.
+6 July 2009
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e46a073..78205a3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -118,6 +118,27 @@ static inline bool mem_cgroup_disabled(void)

extern bool mem_cgroup_oom_called(struct task_struct *task);
void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
+
+#ifdef CONFIG_CGROUP_MEM_NOTIFY
+void mem_cgroup_notify_test_and_wakeup(struct mem_cgroup *mcg,
+ unsigned long long usage, unsigned long long limit);
+void mem_cgroup_notify_new_limit(struct mem_cgroup *mcg,
+ unsigned long long newlimit);
+void mem_cgroup_notify_move_task(struct cgroup *old_cont);
+#else
+static inline void mem_cgroup_notify_test_and_wakeup(struct mem_cgroup *mcg,
+ unsigned long long usage, unsigned long long limit)
+{
+}
+static inline void mem_cgroup_notify_new_limit(struct mem_cgroup *mcg,
+ unsigned long long newlimit)
+{
+}
+static inline void mem_cgroup_notify_move_task(struct cgroup *old_cont)
+{
+}
+#endif
+
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;

diff --git a/init/Kconfig b/init/Kconfig
index 1ce05a4..fb2f7d5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -594,6 +594,15 @@ config CGROUP_MEM_RES_CTLR
This config option also selects MM_OWNER config option, which
could in turn add some fork/exit overhead.

+config CGROUP_MEM_NOTIFY
+ bool "Memory Usage Limit Notification"
+ depends on CGROUP_MEM_RES_CTLR
+ help
+ Provides a memory notification when usage reaches a preset limit.
+ It is an extenstion to the memory resource controller, since it
+ uses the memory usage accounting of the cgroup to test against
+ the notification limit. (See Documentation/cgroups/mem_notify.txt)
+
config CGROUP_MEM_RES_CTLR_SWAP
bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e2fa20d..cf04279 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6,6 +6,10 @@
* Copyright 2007 OpenVZ SWsoft Inc
* Author: Pavel Emelianov <[email protected]>
*
+ * Memory Limit Notification update
+ * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
+ * Author: Dan Malek <[email protected]>
+ *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
@@ -180,6 +184,11 @@ struct mem_cgroup {
/* set when res.limit == memsw.limit */
bool memsw_is_minimum;

+#ifdef CONFIG_CGROUP_MEM_NOTIFY
+ unsigned long long notify_threshold_bytes;
+ wait_queue_head_t notify_threshold_wait;
+#endif
+
/*
* statistics. This must be placed at the end of memcg.
*/
@@ -995,6 +1004,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,

VM_BUG_ON(css_is_removed(&mem->css));

+ /*
+ * We check on the way in so we don't have to duplicate code
+ * in both the normal and error exit path.
+ */
+ mem_cgroup_notify_test_and_wakeup(mem, mem->res.usage + PAGE_SIZE,
+ mem->res.limit);
+
while (1) {
int ret;
bool noswap = false;
@@ -1744,6 +1760,12 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
u64 curusage, oldusage;

/*
+ * Test and notify ahead of the necessity to free pages, as
+ * applications giving up pages may help this reclaim procedure.
+ */
+ mem_cgroup_notify_new_limit(memcg, val);
+
+ /*
* For keeping hierarchical_reclaim simple, how long we should retry
* is depends on callers. We set our retry-count to be function
* of # of children which we should visit in this loop.
@@ -2308,6 +2330,139 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
return 0;
}

+#ifdef CONFIG_CGROUP_MEM_NOTIFY
+/*
+ * Check if a task exceeded notification threshold set for a memory cgroup.
+ * Wake up waiting notification threads, if any.
+ */
+void mem_cgroup_notify_test_and_wakeup(struct mem_cgroup *mcg,
+ unsigned long long usage,
+ unsigned long long limit)
+{
+ if (unlikely(usage == RESOURCE_MAX))
+ return;
+
+ if ((limit - usage <= mcg->notify_threshold_bytes) &&
+ waitqueue_active(&mcg->notify_threshold_wait))
+ wake_up(&mcg->notify_threshold_wait);
+}
+/*
+ * Check if current notification threshold exceeds new memory usage
+ * limit set for a memory cgroup. If so, set threshold to zero to
+ * notify tasks in the group when maximal memory usage is achieved.
+ */
+void mem_cgroup_notify_new_limit(struct mem_cgroup *mcg,
+ unsigned long long newlimit)
+{
+ if (newlimit <= mcg->notify_threshold_bytes)
+ mcg->notify_threshold_bytes = 0;
+
+ mem_cgroup_notify_test_and_wakeup(mcg, mcg->res.usage, newlimit);
+}
+
+static u64 mem_cgroup_notify_threshold_read(struct cgroup *cgrp,
+ struct cftype *cft)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ return memcg->notify_threshold_bytes;
+}
+
+static int mem_cgroup_notify_threshold_write(struct cgroup *cgrp,
+ struct cftype *cft,
+ const char *buffer)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ unsigned long long val;
+ int ret;
+
+ /* This function does all necessary parse...reuse it */
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ return ret;
+
+ /* Threshold must be lower than usage limit */
+ if (val >= memcg->res.limit)
+ return -EINVAL;
+
+ memcg->notify_threshold_bytes = val;
+
+ /* Check to see if the new threshold should cause notification */
+ mem_cgroup_notify_test_and_wakeup(memcg, memcg->res.usage,
+ memcg->res.limit);
+
+ return 0;
+}
+
+static u64 mem_cgroup_notify_available_read(struct cgroup *cgrp,
+ struct cftype *cft)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ return memcg->res.limit - memcg->res.usage;
+}
+
+static u64 mem_cgroup_notify_threshold_lowait(struct cgroup *cgrp,
+ struct cftype *cft)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+ unsigned long long available_bytes;
+ DEFINE_WAIT(notify_lowait);
+
+ /*
+ * A memory resource usage of zero is a special case that
+ * causes us not to sleep. It normally happens when the
+ * cgroup is about to be destroyed, and we don't want someone
+ * trying to sleep on a queue that is about to go away. This
+ * condition can also be forced as part of testing.
+ */
+ available_bytes = mem->res.limit - mem->res.usage;
+ if (likely(mem->res.usage != 0)) {
+
+ prepare_to_wait(&mem->notify_threshold_wait, &notify_lowait,
+ TASK_INTERRUPTIBLE);
+
+ if (available_bytes > mem->notify_threshold_bytes)
+ schedule();
+
+ available_bytes = mem->res.limit - mem->res.usage;
+
+ finish_wait(&mem->notify_threshold_wait, &notify_lowait);
+ }
+
+ return available_bytes;
+}
+
+/*
+ * This is used to wake up all threads that may be hanging
+ * out waiting for a low memory condition prior to that happening.
+ * Useful for triggering the event to assist with debug of applications.
+ */
+static int mem_cgroup_notify_threshold_wake_em_up(struct cgroup *cgrp,
+ unsigned int event)
+{
+ struct mem_cgroup *mem;
+
+ mem = mem_cgroup_from_cont(cgrp);
+ wake_up(&mem->notify_threshold_wait);
+ return 0;
+}
+
+/*
+ * We wake up all notification threads any time a migration takes
+ * place. They will have to check to see if a move is needed to
+ * a new cgroup file to wait for notification.
+ * This isn't so much a task move as it is an attach. A thread not
+ * a child of an existing task won't have a valid parent, which
+ * is necessary to test because it won't have a valid mem_cgroup
+ * either. Which further means it won't have a proper wait queue
+ * and we can't do a wakeup.
+ */
+void mem_cgroup_notify_move_task(struct cgroup *old_cont)
+{
+ if (old_cont->parent != NULL)
+ mem_cgroup_notify_threshold_wake_em_up(old_cont, 0);
+}
+#endif /* CONFIG_CGROUP_MEM_NOTIFY */
+

static struct cftype mem_cgroup_files[] = {
{
@@ -2351,6 +2506,22 @@ static struct cftype mem_cgroup_files[] = {
.read_u64 = mem_cgroup_swappiness_read,
.write_u64 = mem_cgroup_swappiness_write,
},
+#ifdef CONFIG_CGROUP_MEM_NOTIFY
+ {
+ .name = "notify_threshold_in_bytes",
+ .write_string = mem_cgroup_notify_threshold_write,
+ .read_u64 = mem_cgroup_notify_threshold_read,
+ },
+ {
+ .name = "notify_available_in_bytes",
+ .read_u64 = mem_cgroup_notify_available_read,
+ },
+ {
+ .name = "notify_threshold_lowait",
+ .trigger = mem_cgroup_notify_threshold_wake_em_up,
+ .read_u64 = mem_cgroup_notify_threshold_lowait,
+ },
+#endif
};

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -2554,6 +2725,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
mem->last_scanned_child = 0;
spin_lock_init(&mem->reclaim_param_lock);

+#ifdef CONFIG_CGROUP_MEM_NOTIFY
+ init_waitqueue_head(&mem->notify_threshold_wait);
+ mem->notify_threshold_bytes = 0;
+#endif
+
if (parent)
mem->swappiness = get_swappiness(parent);
atomic_set(&mem->refcnt, 1);
@@ -2597,6 +2773,8 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
struct cgroup *old_cont,
struct task_struct *p)
{
+ mem_cgroup_notify_move_task(old_cont);
+
mutex_lock(&memcg_tasklist);
/*
* FIXME: It's better to move charges of this process from old
--
1.5.6.3

2009-07-08 00:58:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/1] Memory usage limit notification addition to memcg

A few comments. Maybe adding [email protected] in CC. list makes it easier to
find this thread in the next post.

On Tue, 7 Jul 2009 13:25:10 -0700
Vladislav Buzov <[email protected]> wrote:

> This patch updates the Memory Controller cgroup to add
> a configurable memory usage limit notification. The feature
> was presented at the April 2009 Embedded Linux Conference.
>
> Signed-off-by: Dan Malek <[email protected]>
> Signed-off-by: Vladislav Buzov <[email protected]>
> ---
> Documentation/cgroups/mem_notify.txt | 140 ++++++++++++++++++++++++++
> include/linux/memcontrol.h | 21 ++++
> init/Kconfig | 9 ++
> mm/memcontrol.c | 178 ++++++++++++++++++++++++++++++++++
> 4 files changed, 348 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/cgroups/mem_notify.txt
>
> diff --git a/Documentation/cgroups/mem_notify.txt b/Documentation/cgroups/mem_notify.txt
> new file mode 100644
> index 0000000..b4f20d0
> --- /dev/null
> +++ b/Documentation/cgroups/mem_notify.txt
> @@ -0,0 +1,140 @@
> +
> +Memory Limit Notificiation
> +
> +Attempts have been made in the past to provide a mechanism for
> +the notification to processes (task, an address space) when memory
> +usage is approaching a high limit. The intention is that it gives
> +the application an opportunity to release some memory and continue
> +operation rather than be OOM killed. The CE Linux Forum requested
> +a more comtemporary implementation, and this is the result.
> +
> +The memory threshold notification is a configurable extension to the
> +existing Memory Resource Controller. Please read memory.txt in this
> +directory to understand its operation before continuing here.
> +
> +1. Operation
> +
> +When a kernel is configured with CGROUP_MEM_NOTIFY, three additional
> +files will appear in the memory resource controller:
> +
> + memory.notify_threshold_in_bytes
> + memory.notify_available_in_bytes
> + memory.notify_threshold_lowait
> +
> +The notification is based upon reaching a threshold below the memory
> +resouce controller limit (memory.limit_in_bytes). The threshold
> +represents the minimal number of bytes that should be available under
> +the limit. When the controller group is created, the threshold is set
> +to zero which triggers notification when the memory resource controller
> +limit is reached.
> +
> +The threshold may be set by writing to memory.notify_threshold_in_bytes,
> +such as:
> +
> + echo 10M > memory.notify_threshold_in_bytes
> +
> +The current number of available bytes may be read at any time from
> +the memory.notify_available_in_bytes
> +
> +The memory.notify_threshold_lowait is a blocking read file. The read will
> +block until one of four conditions occurs:
> +
> + - The amount of available memory is equal or less than the threshold
> + defined in memory.notify_threshold_in_bytes
> + - The memory.notify_threshold_lowait file is written with any value (debug)
> + - A thread is moved to another controller group
> + - The cgroup is destroyed or forced empty (memory.force_empty)
> +

I don't think notify_available_in_bytes is necessary.

For making this kind of threashold useful, I think some relaxing margin is good.
for example) Once triggered, "notiry" will not be triggered in next 1ms
Do you have an idea ?

I know people likes to wait for file descriptor to get notification in these days.
Can't we have "event" file descriptor in cgroup layer and make it reusable for
other purposes ?

> +
> +1.1 Example Usage
> +
> +An application must be designed to properly take advantage of this
> +memory threshold notification feature. It is a powerful management component
> +of some operating systems and embedded devices that must provide
> +highly available and reliable computing services. The application works
> +in conjunction with information provided by the operating system to
> +control limited resource usage. Since many programmers still think
> +memory is infinite and never check the return value from malloc(), it
> +may come as a surprise that such mechanisms have been utilized long ago.
> +
> +A typical application will be multithreaded, with one thread either
> +polling or waiting for the notification event. When the event occurs,
> +the thread will take whatever action is appropriate within the application
> +design. This could be actually running a garbage collection algorithm
> +or to simply signal other processing threads they must do something to
> +reduce their memory usage. The notification thread will then be required
> +to poll the actual usage until the low limit of its choosing is met,
> +at which time the reclaim of memory can stop and the notification thread
> +will wait for the next event.
> +
> +Internally, the application only needs to
> +fopen("memory.notify_available_in_bytes" ..) or
> +fopen("memory.notify_threshold_lowait" ...), then either poll the former
> +file or block read on the latter file using fread() or fscanf() as desired.
> +Comparing the value returned from either of these read function with the
> +value obtained by reading memory.notify_threshold_in_bytes will be an
> +indication of the amount of memory used over the threshold limit.
> +

I hope this application will not block rmdir() ;)



> +2. Configuration
> +
> +Follow the instructions in memory.txt for the configuration and usage of
> +the Memory Resource Controller cgroup. Once this is created and tasks
> +assigned, use the memory threshold notification as described here.
> +
> +The only action that is needed outside of the application waiting or polling
> +is to set the memory.notify_threshold_in_bytes. To set a notification to occur
> +when memory usage of the cgroup reaches or exceeds 1 MByte below the limit
> +can be simply done:
> +
> + echo 1M > memory.notify_threshold_in_bytes
> +
> +This value may be read or changed at any time. Writing a higher value once
> +the Memory Resource Controller is in operation may trigger immediate
> +notification if the usage is above the new threshold.
> +

One question is how this works under hierarchical accounting.

Considering following.

/cgroup/A/ no thresh
001/ thresh=5M
John thresh=1M
002/ no thresh
Hiroyuki no thresh

If Hiroyuki use too much and hit /cgroup/A's limit, memory will be reclaimed from all
A,001,John,002,Hiroyuki and OOM Killer may kill processes in John.
But 001/John's notifier will not fire. Right ?


> +3. Debug and Testing
> +
> +The design of cgroups makes it easier to perform some debugging or
> +monitoring tasks without modification to the application. For example,
> +a write of any value to memory.notify_threshold_lowait will wake up all
> +threads waiting for notifications regardless of current memory usage.
> +
> +Collecting performance data about the cgroup is also simplified, as
> +no application modifications are necessary. A separate task can be
> +created that will open and monitor any necessary files of the cgroup
> +(such as current limits, usage and usage percentages and even when
> +notification occurs). This task can also operate outside of the cgroup,
> +so its memory usage is not charged to the cgroup.
> +
> +4. Design
> +
> +The memory threshold notification is a configurable extension to the
> +existing Memory Resource Controller, which operates as described to
> +track and manage the memory of the Control Group. The Memory Resource
> +Controller will still continue to reclaim memory under pressure
> +of the limits, and may OOM kill tasks within the cgroup according to
> +the OOM Killer configuration.
> +
> +The memory notification threshold was chosen as a number of bytes of the
> +memory not in use so the cgroup paramaters may continue to be dynamically
> +modified without the need to modify the notificaton parameters.
> +Otherwise, the notification threshold would have to also be computed
> +and modified on any Memory Resource Controller operating parameter change.
> +
> +The cgroup file semantics are not well suited for this type of notificaton
> +mechanism. While applications may choose to simply poll the current
> +usage at their convenience, it was also desired to have a notification
> +event that would trigger when the usage attained the threshold. The
> +blocking read() was chosen, as it is the only current useful method.
> +This presented the problems of "out of band" notification, when you want
> +to return some exceptional status other than reaching the notification
> +threshold. In the cases listed above, the read() on the
> +memory.notify_threshold_lowait file will not block and return "0" for
> +the remaining size. When this occurs, the thread must determine if the task
> +has moved to a new cgroup or if the cgroup has been destroyed. Due to
> +the usage model of this cgroup, neither is likely to happen during normal
> +operation of a product.
> +
> +Dan Malek <[email protected]>
> +Embedded Alley Solutions, Inc.
> +6 July 2009
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e46a073..78205a3 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -118,6 +118,27 @@ static inline bool mem_cgroup_disabled(void)
>
> extern bool mem_cgroup_oom_called(struct task_struct *task);
> void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
> +
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +void mem_cgroup_notify_test_and_wakeup(struct mem_cgroup *mcg,
> + unsigned long long usage, unsigned long long limit);
> +void mem_cgroup_notify_new_limit(struct mem_cgroup *mcg,
> + unsigned long long newlimit);
> +void mem_cgroup_notify_move_task(struct cgroup *old_cont);
> +#else
> +static inline void mem_cgroup_notify_test_and_wakeup(struct mem_cgroup *mcg,
> + unsigned long long usage, unsigned long long limit)
> +{
> +}
> +static inline void mem_cgroup_notify_new_limit(struct mem_cgroup *mcg,
> + unsigned long long newlimit)
> +{
> +}
> +static inline void mem_cgroup_notify_move_task(struct cgroup *old_cont)
> +{
> +}
> +#endif
> +
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> struct mem_cgroup;
>
> diff --git a/init/Kconfig b/init/Kconfig
> index 1ce05a4..fb2f7d5 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -594,6 +594,15 @@ config CGROUP_MEM_RES_CTLR
> This config option also selects MM_OWNER config option, which
> could in turn add some fork/exit overhead.
>
> +config CGROUP_MEM_NOTIFY
> + bool "Memory Usage Limit Notification"
> + depends on CGROUP_MEM_RES_CTLR
> + help
> + Provides a memory notification when usage reaches a preset limit.
> + It is an extenstion to the memory resource controller, since it
> + uses the memory usage accounting of the cgroup to test against
> + the notification limit. (See Documentation/cgroups/mem_notify.txt)
> +

I don't think CONFIG is necessary. Let this always used.


> config CGROUP_MEM_RES_CTLR_SWAP
> bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
> depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e2fa20d..cf04279 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6,6 +6,10 @@
> * Copyright 2007 OpenVZ SWsoft Inc
> * Author: Pavel Emelianov <[email protected]>
> *
> + * Memory Limit Notification update
> + * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
> + * Author: Dan Malek <[email protected]>
> + *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License as published by
> * the Free Software Foundation; either version 2 of the License, or
> @@ -180,6 +184,11 @@ struct mem_cgroup {
> /* set when res.limit == memsw.limit */
> bool memsw_is_minimum;
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + unsigned long long notify_threshold_bytes;
> + wait_queue_head_t notify_threshold_wait;
> +#endif
> +
> /*
> * statistics. This must be placed at the end of memcg.
> */
> @@ -995,6 +1004,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>
> VM_BUG_ON(css_is_removed(&mem->css));
>
> + /*
> + * We check on the way in so we don't have to duplicate code
> + * in both the normal and error exit path.
> + */
> + mem_cgroup_notify_test_and_wakeup(mem, mem->res.usage + PAGE_SIZE,
> + mem->res.limit);
> +

2 points.
- Do we have to check this always we account ?
- This will not catch hierarchical accounting threshold because this check
only local cgroup, no ancestors.

I don't want to say this but you need to add hook to res_counter itself.


> while (1) {
> int ret;
> bool noswap = false;
> @@ -1744,6 +1760,12 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> u64 curusage, oldusage;
>
> /*
> + * Test and notify ahead of the necessity to free pages, as
> + * applications giving up pages may help this reclaim procedure.
> + */
> + mem_cgroup_notify_new_limit(memcg, val);
> +
> + /*
> * For keeping hierarchical_reclaim simple, how long we should retry
> * is depends on callers. We set our retry-count to be function
> * of # of children which we should visit in this loop.
> @@ -2308,6 +2330,139 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
> return 0;
> }
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +/*
> + * Check if a task exceeded notification threshold set for a memory cgroup.
> + * Wake up waiting notification threads, if any.
> + */
> +void mem_cgroup_notify_test_and_wakeup(struct mem_cgroup *mcg,
> + unsigned long long usage,
> + unsigned long long limit)
> +{
> + if (unlikely(usage == RESOURCE_MAX))
> + return;
What this means ?? Can happen ?

> +
> + if ((limit - usage <= mcg->notify_threshold_bytes) &&
> + waitqueue_active(&mcg->notify_threshold_wait))
> + wake_up(&mcg->notify_threshold_wait);
> +}
> +/*
> + * Check if current notification threshold exceeds new memory usage
> + * limit set for a memory cgroup. If so, set threshold to zero to
> + * notify tasks in the group when maximal memory usage is achieved.
> + */
> +void mem_cgroup_notify_new_limit(struct mem_cgroup *mcg,
> + unsigned long long newlimit)
> +{
> + if (newlimit <= mcg->notify_threshold_bytes)
> + mcg->notify_threshold_bytes = 0;
> +
> + mem_cgroup_notify_test_and_wakeup(mcg, mcg->res.usage, newlimit);
> +}
> +
> +static u64 mem_cgroup_notify_threshold_read(struct cgroup *cgrp,
> + struct cftype *cft)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + return memcg->notify_threshold_bytes;
> +}
> +
> +static int mem_cgroup_notify_threshold_write(struct cgroup *cgrp,
> + struct cftype *cft,
> + const char *buffer)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + unsigned long long val;
> + int ret;
> +
> + /* This function does all necessary parse...reuse it */
> + ret = res_counter_memparse_write_strategy(buffer, &val);
> + if (ret)
> + return ret;
> +
> + /* Threshold must be lower than usage limit */
> + if (val >= memcg->res.limit)
> + return -EINVAL;

If this is true, "set limit" should be checked to guarantee this.
plz allow minus this for avoiding mess.

> +
> + memcg->notify_threshold_bytes = val;
> +
> + /* Check to see if the new threshold should cause notification */
> + mem_cgroup_notify_test_and_wakeup(memcg, memcg->res.usage,
> + memcg->res.limit);
> +
> + return 0;
> +}
> +
> +static u64 mem_cgroup_notify_available_read(struct cgroup *cgrp,
> + struct cftype *cft)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + return memcg->res.limit - memcg->res.usage;
> +}
> +
> +static u64 mem_cgroup_notify_threshold_lowait(struct cgroup *cgrp,
> + struct cftype *cft)
> +{
> + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> + unsigned long long available_bytes;
> + DEFINE_WAIT(notify_lowait);
> +
> + /*
> + * A memory resource usage of zero is a special case that
> + * causes us not to sleep. It normally happens when the
> + * cgroup is about to be destroyed, and we don't want someone
> + * trying to sleep on a queue that is about to go away. This
> + * condition can also be forced as part of testing.
> + */
> + available_bytes = mem->res.limit - mem->res.usage;
> + if (likely(mem->res.usage != 0)) {
> +
> + prepare_to_wait(&mem->notify_threshold_wait, &notify_lowait,
> + TASK_INTERRUPTIBLE);
> +
> + if (available_bytes > mem->notify_threshold_bytes)
> + schedule();
> +
> + available_bytes = mem->res.limit - mem->res.usage;
> +
> + finish_wait(&mem->notify_threshold_wait, &notify_lowait);
> + }
> +
> + return available_bytes;
> +}
> +
> +/*
> + * This is used to wake up all threads that may be hanging
> + * out waiting for a low memory condition prior to that happening.
> + * Useful for triggering the event to assist with debug of applications.
> + */
> +static int mem_cgroup_notify_threshold_wake_em_up(struct cgroup *cgrp,
> + unsigned int event)
> +{
> + struct mem_cgroup *mem;
> +
> + mem = mem_cgroup_from_cont(cgrp);
> + wake_up(&mem->notify_threshold_wait);
> + return 0;
> +}
> +
> +/*
> + * We wake up all notification threads any time a migration takes
> + * place. They will have to check to see if a move is needed to
> + * a new cgroup file to wait for notification.
> + * This isn't so much a task move as it is an attach. A thread not
> + * a child of an existing task won't have a valid parent, which
> + * is necessary to test because it won't have a valid mem_cgroup
> + * either. Which further means it won't have a proper wait queue
> + * and we can't do a wakeup.
> + */
> +void mem_cgroup_notify_move_task(struct cgroup *old_cont)
> +{
> + if (old_cont->parent != NULL)
> + mem_cgroup_notify_threshold_wake_em_up(old_cont, 0);
> +}
> +#endif /* CONFIG_CGROUP_MEM_NOTIFY */
> +
>

plz call wake_em_up at pre_destroy(), too.

Thanks,
-Kame


> static struct cftype mem_cgroup_files[] = {
> {
> @@ -2351,6 +2506,22 @@ static struct cftype mem_cgroup_files[] = {
> .read_u64 = mem_cgroup_swappiness_read,
> .write_u64 = mem_cgroup_swappiness_write,
> },
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + {
> + .name = "notify_threshold_in_bytes",
> + .write_string = mem_cgroup_notify_threshold_write,
> + .read_u64 = mem_cgroup_notify_threshold_read,
> + },
> + {
> + .name = "notify_available_in_bytes",
> + .read_u64 = mem_cgroup_notify_available_read,
> + },
> + {
> + .name = "notify_threshold_lowait",
> + .trigger = mem_cgroup_notify_threshold_wake_em_up,
> + .read_u64 = mem_cgroup_notify_threshold_lowait,
> + },
> +#endif
> };
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -2554,6 +2725,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> mem->last_scanned_child = 0;
> spin_lock_init(&mem->reclaim_param_lock);
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + init_waitqueue_head(&mem->notify_threshold_wait);
> + mem->notify_threshold_bytes = 0;
> +#endif
> +
> if (parent)
> mem->swappiness = get_swappiness(parent);
> atomic_set(&mem->refcnt, 1);
> @@ -2597,6 +2773,8 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
> struct cgroup *old_cont,
> struct task_struct *p)
> {
> + mem_cgroup_notify_move_task(old_cont);
> +
> mutex_lock(&memcg_tasklist);
> /*
> * FIXME: It's better to move charges of this process from old
> --

2009-07-08 03:53:19

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 1/1] Memory usage limit notification addition to memcg

* Vladislav Buzov <[email protected]> [2009-07-07 13:25:10]:

> This patch updates the Memory Controller cgroup to add
> a configurable memory usage limit notification. The feature
> was presented at the April 2009 Embedded Linux Conference.
>
> Signed-off-by: Dan Malek <[email protected]>
> Signed-off-by: Vladislav Buzov <[email protected]>
> ---
> Documentation/cgroups/mem_notify.txt | 140 ++++++++++++++++++++++++++
> include/linux/memcontrol.h | 21 ++++
> init/Kconfig | 9 ++
> mm/memcontrol.c | 178 ++++++++++++++++++++++++++++++++++
> 4 files changed, 348 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/cgroups/mem_notify.txt
>
> diff --git a/Documentation/cgroups/mem_notify.txt b/Documentation/cgroups/mem_notify.txt
> new file mode 100644
> index 0000000..b4f20d0
> --- /dev/null
> +++ b/Documentation/cgroups/mem_notify.txt
> @@ -0,0 +1,140 @@
> +
> +Memory Limit Notificiation
> +
> +Attempts have been made in the past to provide a mechanism for
> +the notification to processes (task, an address space) when memory
> +usage is approaching a high limit. The intention is that it gives
> +the application an opportunity to release some memory and continue
> +operation rather than be OOM killed. The CE Linux Forum requested
> +a more comtemporary implementation, and this is the result.
> +
> +The memory threshold notification is a configurable extension to the
> +existing Memory Resource Controller. Please read memory.txt in this
> +directory to understand its operation before continuing here.
> +
> +1. Operation
> +
> +When a kernel is configured with CGROUP_MEM_NOTIFY, three additional
> +files will appear in the memory resource controller:
> +
> + memory.notify_threshold_in_bytes
> + memory.notify_available_in_bytes
> + memory.notify_threshold_lowait
> +
> +The notification is based upon reaching a threshold below the memory
> +resouce controller limit (memory.limit_in_bytes). The threshold
> +represents the minimal number of bytes that should be available under
> +the limit. When the controller group is created, the threshold is set
> +to zero which triggers notification when the memory resource controller
> +limit is reached.
> +
> +The threshold may be set by writing to memory.notify_threshold_in_bytes,
> +such as:
> +
> + echo 10M > memory.notify_threshold_in_bytes
> +
> +The current number of available bytes may be read at any time from
> +the memory.notify_available_in_bytes
> +
> +The memory.notify_threshold_lowait is a blocking read file. The read will
> +block until one of four conditions occurs:
> +
> + - The amount of available memory is equal or less than the threshold
> + defined in memory.notify_threshold_in_bytes
> + - The memory.notify_threshold_lowait file is written with any value (debug)
> + - A thread is moved to another controller group
> + - The cgroup is destroyed or forced empty (memory.force_empty)
> +
> +
> +1.1 Example Usage
> +
> +An application must be designed to properly take advantage of this
> +memory threshold notification feature. It is a powerful management component
> +of some operating systems and embedded devices that must provide
> +highly available and reliable computing services. The application works
> +in conjunction with information provided by the operating system to
> +control limited resource usage. Since many programmers still think
> +memory is infinite and never check the return value from malloc(), it
> +may come as a surprise that such mechanisms have been utilized long ago.
> +
> +A typical application will be multithreaded, with one thread either
> +polling or waiting for the notification event. When the event occurs,
> +the thread will take whatever action is appropriate within the application
> +design. This could be actually running a garbage collection algorithm
> +or to simply signal other processing threads they must do something to
> +reduce their memory usage. The notification thread will then be required
> +to poll the actual usage until the low limit of its choosing is met,
> +at which time the reclaim of memory can stop and the notification thread
> +will wait for the next event.
> +
> +Internally, the application only needs to
> +fopen("memory.notify_available_in_bytes" ..) or
> +fopen("memory.notify_threshold_lowait" ...), then either poll the former
> +file or block read on the latter file using fread() or fscanf() as desired.
> +Comparing the value returned from either of these read function with the
> +value obtained by reading memory.notify_threshold_in_bytes will be an
> +indication of the amount of memory used over the threshold limit.

Polling is never good (from the power consumption and efficiency
view point), unless by poll you mean select() and wait on events.
Blocked read requires a dedicated thread, adding a select or some
other notification mechanism allows the software to wait on several
events at the same time.

> +
> +2. Configuration
> +
> +Follow the instructions in memory.txt for the configuration and usage of
> +the Memory Resource Controller cgroup. Once this is created and tasks
> +assigned, use the memory threshold notification as described here.
> +
> +The only action that is needed outside of the application waiting or polling
> +is to set the memory.notify_threshold_in_bytes. To set a notification to occur
> +when memory usage of the cgroup reaches or exceeds 1 MByte below the limit
> +can be simply done:
> +
> + echo 1M > memory.notify_threshold_in_bytes
> +
> +This value may be read or changed at any time. Writing a higher value once
> +the Memory Resource Controller is in operation may trigger immediate
> +notification if the usage is above the new threshold.
> +
> +3. Debug and Testing
> +
> +The design of cgroups makes it easier to perform some debugging or
> +monitoring tasks without modification to the application. For example,
> +a write of any value to memory.notify_threshold_lowait will wake up all
> +threads waiting for notifications regardless of current memory usage.
> +
> +Collecting performance data about the cgroup is also simplified, as
> +no application modifications are necessary. A separate task can be
> +created that will open and monitor any necessary files of the cgroup
> +(such as current limits, usage and usage percentages and even when
> +notification occurs). This task can also operate outside of the cgroup,
> +so its memory usage is not charged to the cgroup.
> +
> +4. Design
> +
> +The memory threshold notification is a configurable extension to the
> +existing Memory Resource Controller, which operates as described to
> +track and manage the memory of the Control Group. The Memory Resource
> +Controller will still continue to reclaim memory under pressure
> +of the limits, and may OOM kill tasks within the cgroup according to
> +the OOM Killer configuration.
> +
> +The memory notification threshold was chosen as a number of bytes of the
> +memory not in use so the cgroup paramaters may continue to be dynamically

Could you clarify the meaning of "not in use"

> +modified without the need to modify the notificaton parameters.
> +Otherwise, the notification threshold would have to also be computed
> +and modified on any Memory Resource Controller operating parameter change.
> +
> +The cgroup file semantics are not well suited for this type of notificaton
> +mechanism. While applications may choose to simply poll the current
> +usage at their convenience, it was also desired to have a notification
> +event that would trigger when the usage attained the threshold. The
> +blocking read() was chosen, as it is the only current useful method.

Could you please elaborate further, why would other mechanisms not
work? Hint: please see cgroupstats.

> +This presented the problems of "out of band" notification, when you want
> +to return some exceptional status other than reaching the notification
> +threshold. In the cases listed above, the read() on the
> +memory.notify_threshold_lowait file will not block and return "0" for
> +the remaining size. When this occurs, the thread must determine if the task
> +has moved to a new cgroup or if the cgroup has been destroyed. Due to
> +the usage model of this cgroup, neither is likely to happen during normal
> +operation of a product.
> +
> +Dan Malek <[email protected]>
> +Embedded Alley Solutions, Inc.
> +6 July 2009
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e46a073..78205a3 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -118,6 +118,27 @@ static inline bool mem_cgroup_disabled(void)
>
> extern bool mem_cgroup_oom_called(struct task_struct *task);
> void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
> +
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +void mem_cgroup_notify_test_and_wakeup(struct mem_cgroup *mcg,
> + unsigned long long usage, unsigned long long limit);
> +void mem_cgroup_notify_new_limit(struct mem_cgroup *mcg,
> + unsigned long long newlimit);
> +void mem_cgroup_notify_move_task(struct cgroup *old_cont);
> +#else
> +static inline void mem_cgroup_notify_test_and_wakeup(struct mem_cgroup *mcg,
> + unsigned long long usage, unsigned long long limit)
> +{
> +}
> +static inline void mem_cgroup_notify_new_limit(struct mem_cgroup *mcg,
> + unsigned long long newlimit)
> +{
> +}
> +static inline void mem_cgroup_notify_move_task(struct cgroup *old_cont)
> +{
> +}
> +#endif
> +
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> struct mem_cgroup;
>
> diff --git a/init/Kconfig b/init/Kconfig
> index 1ce05a4..fb2f7d5 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -594,6 +594,15 @@ config CGROUP_MEM_RES_CTLR
> This config option also selects MM_OWNER config option, which
> could in turn add some fork/exit overhead.
>
> +config CGROUP_MEM_NOTIFY
> + bool "Memory Usage Limit Notification"
> + depends on CGROUP_MEM_RES_CTLR
> + help
> + Provides a memory notification when usage reaches a preset limit.
> + It is an extenstion to the memory resource controller, since it
> + uses the memory usage accounting of the cgroup to test against
> + the notification limit. (See Documentation/cgroups/mem_notify.txt)
> +
> config CGROUP_MEM_RES_CTLR_SWAP
> bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
> depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e2fa20d..cf04279 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6,6 +6,10 @@
> * Copyright 2007 OpenVZ SWsoft Inc
> * Author: Pavel Emelianov <[email protected]>
> *
> + * Memory Limit Notification update
> + * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
> + * Author: Dan Malek <[email protected]>
> + *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License as published by
> * the Free Software Foundation; either version 2 of the License, or
> @@ -180,6 +184,11 @@ struct mem_cgroup {
> /* set when res.limit == memsw.limit */
> bool memsw_is_minimum;
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + unsigned long long notify_threshold_bytes;
> + wait_queue_head_t notify_threshold_wait;
> +#endif
> +
> /*
> * statistics. This must be placed at the end of memcg.
> */
> @@ -995,6 +1004,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>
> VM_BUG_ON(css_is_removed(&mem->css));
>
> + /*
> + * We check on the way in so we don't have to duplicate code
> + * in both the normal and error exit path.
> + */
> + mem_cgroup_notify_test_and_wakeup(mem, mem->res.usage + PAGE_SIZE,
> + mem->res.limit);
> +

I don't think it is a good idea to directly read out mem->res.*
without any protection

> while (1) {
> int ret;
> bool noswap = false;
> @@ -1744,6 +1760,12 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> u64 curusage, oldusage;
>
> /*
> + * Test and notify ahead of the necessity to free pages, as
> + * applications giving up pages may help this reclaim procedure.
> + */
> + mem_cgroup_notify_new_limit(memcg, val);
> +
> + /*
> * For keeping hierarchical_reclaim simple, how long we should retry
> * is depends on callers. We set our retry-count to be function
> * of # of children which we should visit in this loop.
> @@ -2308,6 +2330,139 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
> return 0;
> }
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +/*
> + * Check if a task exceeded notification threshold set for a memory cgroup.
> + * Wake up waiting notification threads, if any.
> + */
> +void mem_cgroup_notify_test_and_wakeup(struct mem_cgroup *mcg,

Could you please use mem or memcg, since we've been using that as a
standard convention in our code.

> + unsigned long long usage,
> + unsigned long long limit)
> +{
> + if (unlikely(usage == RESOURCE_MAX))

I don't think it is a good idea to use unlikely since it is always
likely for root to be at RESOURCE_MAX. Using likely/unlikely on user
parameters IMHO is not a good idea.

> + return;
> +
> + if ((limit - usage <= mcg->notify_threshold_bytes) &&
> + waitqueue_active(&mcg->notify_threshold_wait))
> + wake_up(&mcg->notify_threshold_wait);
> +}
> +/*
> + * Check if current notification threshold exceeds new memory usage
> + * limit set for a memory cgroup. If so, set threshold to zero to
> + * notify tasks in the group when maximal memory usage is achieved.
> + */
> +void mem_cgroup_notify_new_limit(struct mem_cgroup *mcg,
> + unsigned long long newlimit)
> +{
> + if (newlimit <= mcg->notify_threshold_bytes)
> + mcg->notify_threshold_bytes = 0;
> +
> + mem_cgroup_notify_test_and_wakeup(mcg, mcg->res.usage, newlimit);
> +}


Again, I am confused about the mutual exclusion, what protects the new
values being added.

> +
> +static u64 mem_cgroup_notify_threshold_read(struct cgroup *cgrp,
> + struct cftype *cft)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + return memcg->notify_threshold_bytes;
> +}
> +
> +static int mem_cgroup_notify_threshold_write(struct cgroup *cgrp,
> + struct cftype *cft,
> + const char *buffer)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + unsigned long long val;
> + int ret;
> +
> + /* This function does all necessary parse...reuse it */
> + ret = res_counter_memparse_write_strategy(buffer, &val);
> + if (ret)
> + return ret;
> +
> + /* Threshold must be lower than usage limit */
> + if (val >= memcg->res.limit)
> + return -EINVAL;
> +
> + memcg->notify_threshold_bytes = val;
> +
> + /* Check to see if the new threshold should cause notification */
> + mem_cgroup_notify_test_and_wakeup(memcg, memcg->res.usage,
> + memcg->res.limit);
> +
> + return 0;
> +}
> +
> +static u64 mem_cgroup_notify_available_read(struct cgroup *cgrp,
> + struct cftype *cft)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + return memcg->res.limit - memcg->res.usage;
> +}

Please use res_counter abstractions to read mem->res values

> +
> +static u64 mem_cgroup_notify_threshold_lowait(struct cgroup *cgrp,
> + struct cftype *cft)
> +{
> + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> + unsigned long long available_bytes;
> + DEFINE_WAIT(notify_lowait);
> +
> + /*
> + * A memory resource usage of zero is a special case that
> + * causes us not to sleep. It normally happens when the
> + * cgroup is about to be destroyed, and we don't want someone
> + * trying to sleep on a queue that is about to go away. This
> + * condition can also be forced as part of testing.
> + */
> + available_bytes = mem->res.limit - mem->res.usage;
> + if (likely(mem->res.usage != 0)) {
> +
> + prepare_to_wait(&mem->notify_threshold_wait, &notify_lowait,
> + TASK_INTERRUPTIBLE);
> +
> + if (available_bytes > mem->notify_threshold_bytes)
> + schedule();
> +
> + available_bytes = mem->res.limit - mem->res.usage;
> +
> + finish_wait(&mem->notify_threshold_wait, &notify_lowait);
> + }
> +
> + return available_bytes;
> +}
> +
> +/*
> + * This is used to wake up all threads that may be hanging
> + * out waiting for a low memory condition prior to that happening.
> + * Useful for triggering the event to assist with debug of applications.
> + */
> +static int mem_cgroup_notify_threshold_wake_em_up(struct cgroup *cgrp,
> + unsigned int event)
> +{
> + struct mem_cgroup *mem;
> +
> + mem = mem_cgroup_from_cont(cgrp);
> + wake_up(&mem->notify_threshold_wait);
> + return 0;
> +}
> +
> +/*
> + * We wake up all notification threads any time a migration takes
> + * place. They will have to check to see if a move is needed to
> + * a new cgroup file to wait for notification.
> + * This isn't so much a task move as it is an attach. A thread not
> + * a child of an existing task won't have a valid parent, which
> + * is necessary to test because it won't have a valid mem_cgroup
> + * either. Which further means it won't have a proper wait queue
> + * and we can't do a wakeup.
> + */
> +void mem_cgroup_notify_move_task(struct cgroup *old_cont)
> +{
> + if (old_cont->parent != NULL)
> + mem_cgroup_notify_threshold_wake_em_up(old_cont, 0);
> +}
> +#endif /* CONFIG_CGROUP_MEM_NOTIFY */
> +
>
> static struct cftype mem_cgroup_files[] = {
> {
> @@ -2351,6 +2506,22 @@ static struct cftype mem_cgroup_files[] = {
> .read_u64 = mem_cgroup_swappiness_read,
> .write_u64 = mem_cgroup_swappiness_write,
> },
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + {
> + .name = "notify_threshold_in_bytes",
> + .write_string = mem_cgroup_notify_threshold_write,
> + .read_u64 = mem_cgroup_notify_threshold_read,
> + },
> + {
> + .name = "notify_available_in_bytes",
> + .read_u64 = mem_cgroup_notify_available_read,
> + },
> + {
> + .name = "notify_threshold_lowait",
> + .trigger = mem_cgroup_notify_threshold_wake_em_up,
> + .read_u64 = mem_cgroup_notify_threshold_lowait,
> + },
> +#endif
> };
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -2554,6 +2725,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> mem->last_scanned_child = 0;
> spin_lock_init(&mem->reclaim_param_lock);
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + init_waitqueue_head(&mem->notify_threshold_wait);
> + mem->notify_threshold_bytes = 0;
> +#endif
> +
> if (parent)
> mem->swappiness = get_swappiness(parent);
> atomic_set(&mem->refcnt, 1);
> @@ -2597,6 +2773,8 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
> struct cgroup *old_cont,
> struct task_struct *p)
> {
> + mem_cgroup_notify_move_task(old_cont);
> +
> mutex_lock(&memcg_tasklist);
> /*
> * FIXME: It's better to move charges of this process from old
> --
> 1.5.6.3
>

--
Balbir

2009-07-09 01:44:10

by Vladislav D. Buzov

[permalink] [raw]
Subject: Re: [PATCH 1/1] Memory usage limit notification addition to memcg

KAMEZAWA Hiroyuki wrote:
> I don't think notify_available_in_bytes is necessary.
>
I agree. This was a replacement for the old percentage calculation that
was harder for the application to resolve. I'll remove it and update the
example to use the other available memory controller information.

> For making this kind of threashold useful, I think some relaxing margin is good.
> for example) Once triggered, "notiry" will not be triggered in next 1ms
> Do you have an idea ?
>
There isn't any time attribute associated with this model. There is no
"trigger," just that you don't sleep if the threshold is exceeded.

The notification only happens if you are asking for it. One application
implementation could be that you just respond to notifications. If one
occurs, you will free some memory, then wait for another notification.
If you didn't free enough memory, the notification just keeps occurring
as you ask until the situation is resolved.

> I know people likes to wait for file descriptor to get notification in these days.
> Can't we have "event" file descriptor in cgroup layer and make it reusable for
> other purposes ?
That's next on the list to implement, and there were some comments in
previous messages. I just didn't want to complicate providing this
notification feature by having to also implement an "event" descriptor.
I'm certain that will cause much discussion as well. :-)

>
> I hope this application will not block rmdir() ;)
>
No, because there are no blocking reads (the wait continues to return)
when the cgroup is being destroyed.

> One question is how this works under hierarchical accounting.
>
> Considering following.
>
> /cgroup/A/ no thresh
> 001/ thresh=5M
> John thresh=1M
> 002/ no thresh
> Hiroyuki no thresh
>
> If Hiroyuki use too much and hit /cgroup/A's limit, memory will be reclaimed from all
> A,001,John,002,Hiroyuki and OOM Killer may kill processes in John.
> But 001/John's notifier will not fire. Right ?
>
The 001/John's applications will not be notified, since everything in
that child cgroup is OK. This is based on the accounting behavior of the
memory cgroup. If you want notification at the parent, you need to
create a thread to catch that condition at the parent level. When that
occurs, there is a mechanism to notify the children by just writing the
notify_threshold_lowait file. Your applications need to be designed to
identify this condition (or simply always free some resources when
notified) for this to work.

The OOM killer is an orthogonal discussion. You can select from
available killers that may make the choices you desire, or implement
your own requirements and attach it to the cgroup.

> I don't think CONFIG is necessary. Let this always used.
>
Ok.

> 2 points.
> - Do we have to check this always we account ?
>
What are the options? Every N pages? How to select N?

> - This will not catch hierarchical accounting threshold because this check
> only local cgroup, no ancestors.
>
Right.. That was the intention so I'll need to fix it

> I don't want to say this but you need to add hook to res_counter itself.
>
I agree, res_counter seems to be the most appropriate place to keep and
track the threshold as well as it already does for the usage and limit.
During resource charge operation res_counter can check the usage against
the threshold and, if it's exceeded, call the memory controller cgroup
to notify its tasks

> What this means ?? Can happen ?
>
It means the cgroup was created but no one has yet set any limits on the
cgroup itself. There is no reason to test any conditions for notification.

> If this is true, "set limit" should be checked to guarantee this.
> plz allow minus this for avoiding mess.
Setting the memory controller cgroup limit and the notification
threshold are two separate operations. There isn't any "mess," just some
validation testing for reporting back to the source of the request. When
changing the memory controller limit, we ensure the threshold limit is
never allowed "negative." At most, the threshold limit will be equal the
memory controller cgroup limit. Otherwise, the arithmetic and
conditional tests during the operational part of the software becomes
more complex, which we don't want.

> plz call wake_em_up at pre_destroy(), too.
>
Ok.

Thanks,
Vlad.

2009-07-10 18:02:28

by Vladislav D. Buzov

[permalink] [raw]
Subject: Re: [PATCH 1/1] Memory usage limit notification addition to memcg

Balbir Singh wrote:
> Polling is never good (from the power consumption and efficiency
> view point), unless by poll you mean select() and wait on events.
>
Currently poll()/select() on a file descriptor are not supported by
cgroups. So now, it's up to user application to decide whether it's
going to periodically check memory usage or use blocking read to wait
for notification.
> Blocked read requires a dedicated thread, adding a select or some
> other notification mechanism allows the software to wait on several
> events at the same time.
>
That's true. This is the next step to be implemented. For now I just
don't want to complicate this notification feature. There is a parallel
discussion about proper threshold implementation, which I'd like to
finish first and then look at possible options for a better notification
mechanism.

> Could you clarify the meaning of "not in use"
>
The threshold represents the minimal number of bytes that should be
available under the memory controller limit before notification occurs.
For example:

limit=10M
threshold=1M
Notification fires when memory usage reaches 9M

> Could you please elaborate further, why would other mechanisms not
> work? Hint: please see cgroupstats.
>
I'm not saying that other mechanisms (other than the cgroup files) are
not going to work. The cgroup files was chosen to communicate
notifications and the blocking read is the only useful method there.

>> + /*
>> + * We check on the way in so we don't have to duplicate code
>> + * in both the normal and error exit path.
>> + */
>> + mem_cgroup_notify_test_and_wakeup(mem, mem->res.usage + PAGE_SIZE,
>> + mem->res.limit);
>> +
>>
>
> I don't think it is a good idea to directly read out mem->res.*
> without any protection
>
Why do I need to protect it here? I'm just reading values, not modifying
them. I don't have to worry if the values change after I read them, the
awaken thread will figure it out. Also, if I use res_counter interface
to read fields (res_counter_read_u64()) then it still doesn't provide
any protection.

However, I understand your concerns about data protection here and
below. In my previous email there was a discussion about moving
threshold support to the res_counter rather than keeping it in the
memory controller cgroup. I like that idea and currently working on
another patch that adds threshold support into the res_counter. It will
address all concerns about mutual exclusive access.

> Could you please use mem or memcg, since we've been using that as a
> standard convention in our code.
>
Ok.
>
>> + unsigned long long usage,
>> + unsigned long long limit)
>> +{
>> + if (unlikely(usage == RESOURCE_MAX))
>>
>
> I don't think it is a good idea to use unlikely since it is always
> likely for root to be at RESOURCE_MAX. Using likely/unlikely on user
> parameters IMHO is not a good idea.
>
I agree.

> Again, I am confused about the mutual exclusion, what protects the new
> values being added.
>
Ditto.

> Please use res_counter abstractions to read mem->res values
>
Ok.

2009-07-13 00:54:17

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/1] Memory usage limit notification addition to memcg

On Wed, 08 Jul 2009 18:43:48 -0700
"Vladislav D. Buzov" <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:

> > 2 points.
> > - Do we have to check this always we account ?
> >
> What are the options? Every N pages? How to select N?
>
I think you can reuse Balbir's softlimit event counter. (see v9.)


> > If this is true, "set limit" should be checked to guarantee this.
> > plz allow minus this for avoiding mess.
> Setting the memory controller cgroup limit and the notification
> threshold are two separate operations. There isn't any "mess," just some
> validation testing for reporting back to the source of the request. When
> changing the memory controller limit, we ensure the threshold limit is
> never allowed "negative." At most, the threshold limit will be equal the
> memory controller cgroup limit. Otherwise, the arithmetic and
> conditional tests during the operational part of the software becomes
> more complex, which we don't want.
>
Hmm, then, plz this interface put under "set_limit_mutex".

Thanks,
-Kame

2009-07-13 21:21:27

by Vladislav D. Buzov

[permalink] [raw]
Subject: Re: [PATCH 1/1] Memory usage limit notification addition to memcg

KAMEZAWA Hiroyuki wrote:
> On Wed, 08 Jul 2009 18:43:48 -0700
> "Vladislav D. Buzov" <[email protected]> wrote:
>
>
>> KAMEZAWA Hiroyuki wrote:
>>
>>> 2 points.
>>> - Do we have to check this always we account ?
>>>
>>>
>> What are the options? Every N pages? How to select N?
>>
>>
> I think you can reuse Balbir's softlimit event counter. (see v9.)
>
It still does not answer the question how to select the number of events
before/between sending the notification.

The idea behind the notification feature is to let user applications
know immediately when a low memory condition occurs (the threshold is
exceeded). So that they can take action to free unused memory before the
OS is involved to handle that (OOM-kill, reclaiming pages).

As far as I understand the reason why you would like to add a delay
between sending notifications is to let user applications some time to
free memory. This is not required by design of the notification feature
because the notification is sent only if someone listening for it.
Typical application will subscribe for low-memory notification, receive
it, handle and then subscribe again. So, even if low memory conditions
keep occurring in mean time, the notification will not be fired. If it
happens again after the user application freed some memory the
application will be immediately notified.

>
>>> If this is true, "set limit" should be checked to guarantee this.
>>> plz allow minus this for avoiding mess.
>>>
>> Setting the memory controller cgroup limit and the notification
>> threshold are two separate operations. There isn't any "mess," just some
>> validation testing for reporting back to the source of the request. When
>> changing the memory controller limit, we ensure the threshold limit is
>> never allowed "negative." At most, the threshold limit will be equal the
>> memory controller cgroup limit. Otherwise, the arithmetic and
>> conditional tests during the operational part of the software becomes
>> more complex, which we don't want.
>>
>>
> Hmm, then, plz this interface put under "set_limit_mutex".
>
I'm going to send another patch soon where I added threshold feature to
the Resource Counter. It's going to address all concerns about data
protection.

Thanks,
Vlad.
> Thanks,
> -Kame
>
>

2009-07-13 22:15:54

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/1] Memory usage limit notification addition to memcg

On Tue, Jul 7, 2009 at 5:56 PM, KAMEZAWA
Hiroyuki<[email protected]> wrote:
>
> I know people likes to wait for file descriptor to get notification in these days.
> Can't we have "event" file descriptor in cgroup layer and make it reusable for
> other purposes ?

I agree - rather than having to add a separate "wait for value to
cross X threshold" file for each numeric usage value that people might
be concerned about, it would be better to have a generic way to do it
for any file. Given that this is a userspace API, it would be better
to work out at least the generic API first, even if the initial
implementation isn't generic.

Properties that it should support include:

- notification when a value crosses above or below a given threshold
(which would include binary cases such as OOM notification where the
value cross from "not-OOM" to "OOM"

- independent thresholds for different waiters

- epoll support (by using eventfd?)

- automatic wakeup when a cgroup is removed

- maybe optional wakeup when a thread attach occurs?

- not require more than read permissions on the file containing the
value being monitored

I guess there are a few possible ways this could be exposed to userspace:

1) new ioctl on cgroups files. simple but probably not popular

2) new system call. maybe the cleanest, but involves changing every
arch and is hard to script

3) new per-cgroup file to control these e.g:
- create an eventfd
- open the control file to be monitored
- write the "<event_fd>, <control_fd> <threshold> to
cgroup.event_control to link them together
flexible and scriptable but maybe a clumsy interface in general

Paul

2009-07-14 00:16:24

by Vladislav D. Buzov

[permalink] [raw]
Subject: [PATCH 0/2] Memory usage limit notification feature (v3)


The following sequence of patches introduce memory usage limit notification
capability to the Memory Controller cgroup.

This is v3 of the implementation. The major difference between previous
version is it is based on the the Resource Counter extension to notify the
Resource Controller when the resource usage achieves or exceeds a configurable
threshold.

TODOs:

1. Another, more generic notification mechanism supporting different events
is preferred to use, rather than creating a dedicated file in the Memory
Controller cgroup.


Thanks,
Vlad.

2009-07-14 00:16:38

by Vladislav D. Buzov

[permalink] [raw]
Subject: [PATCH 1/2] Resource usage threshold notification addition to res_counter (v3)

This patch updates the Resource Counter to add a configurable resource usage
threshold notification mechanism.

Signed-off-by: Vladislav Buzov <[email protected]>
Signed-off-by: Dan Malek <[email protected]>
---
Documentation/cgroups/resource_counter.txt | 21 ++++++++-
include/linux/res_counter.h | 69 ++++++++++++++++++++++++++++
kernel/res_counter.c | 7 +++
3 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 95b24d7..1369dff 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -39,7 +39,20 @@ to work with it.
The failcnt stands for "failures counter". This is the number of
resource allocation attempts that failed.

- c. spinlock_t lock
+ e. unsigned long long threshold
+
+ The resource usage threshold to notify the resouce controller. This is
+ the minimal difference between the resource limit and current usage
+ to fire a notification.
+
+ f. void (*threshold_notifier)(struct res_counter *counter)
+
+ The threshold notification callback installed by the resource
+ controller. Called when the usage reaches or exceeds the threshold.
+ Should be fast and not sleep because called when interrupts are
+ disabled.
+
+ g. spinlock_t lock

Protects changes of the above values.

@@ -140,6 +153,7 @@ counter fields. They are recommended to adhere to the following rules:
usage usage_in_<unit_of_measurement>
max_usage max_usage_in_<unit_of_measurement>
limit limit_in_<unit_of_measurement>
+ threshold notify_threshold_in_<unit_of_measurement>
failcnt failcnt
lock no file :)

@@ -153,9 +167,12 @@ counter fields. They are recommended to adhere to the following rules:
usage prohibited
max_usage reset to usage
limit set the limit
+ threshold set the threshold
failcnt reset to zero

-
+ d. Notification is enabled by installing the threshold notifier callback. It
+ is up to the resouce controller to communicate the notification to user
+ space tasks.

5. Usage example

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 511f42f..5ec98d7 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -9,6 +9,11 @@
*
* Author: Pavel Emelianov <[email protected]>
*
+ * Resouce usage threshold notification update
+ * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
+ * Author: Dan Malek <[email protected]>
+ * Author: Vladislav Buzov <[email protected]>
+ *
* See Documentation/cgroups/resource_counter.txt for more
* info about what this counter is.
*/
@@ -35,6 +40,19 @@ struct res_counter {
*/
unsigned long long limit;
/*
+ * the resource usage threshold to notify the resouce controller. This
+ * is the minimal difference between the resource limit and current
+ * usage to fire a notification.
+ */
+ unsigned long long threshold;
+ /*
+ * the threshold notification callback installed by the resource
+ * controller. Called when the usage reaches or exceeds the threshold.
+ * Should be fast and not sleep because called when interrupts are
+ * disabled.
+ */
+ void (*threshold_notifier)(struct res_counter *counter);
+ /*
* the number of unsuccessful attempts to consume the resource
*/
unsigned long long failcnt;
@@ -87,6 +105,7 @@ enum {
RES_MAX_USAGE,
RES_LIMIT,
RES_FAILCNT,
+ RES_THRESHOLD,
};

/*
@@ -132,6 +151,21 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
return false;
}

+static inline bool res_counter_threshold_check_locked(struct res_counter *cnt)
+{
+ if (cnt->usage + cnt->threshold < cnt->limit)
+ return true;
+
+ return false;
+}
+
+static inline void res_counter_threshold_notify_locked(struct res_counter *cnt)
+{
+ if (!res_counter_threshold_check_locked(cnt) &&
+ cnt->threshold_notifier)
+ cnt->threshold_notifier(cnt);
+}
+
/*
* Helper function to detect if the cgroup is within it's limit or
* not. It's currently called from cgroup_rss_prepare()
@@ -147,6 +181,21 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
return ret;
}

+/*
+ * Helper function to detect if the cgroup usage is under it's threshold or
+ * not.
+ */
+static inline bool res_counter_check_under_threshold(struct res_counter *cnt)
+{
+ bool ret;
+ unsigned long flags;
+
+ spin_lock_irqsave(&cnt->lock, flags);
+ ret = res_counter_threshold_check_locked(cnt);
+ spin_unlock_irqrestore(&cnt->lock, flags);
+ return ret;
+}
+
static inline void res_counter_reset_max(struct res_counter *cnt)
{
unsigned long flags;
@@ -174,6 +223,26 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
spin_lock_irqsave(&cnt->lock, flags);
if (cnt->usage <= limit) {
cnt->limit = limit;
+ if (limit <= cnt->threshold)
+ cnt->threshold = 0;
+ else
+ res_counter_threshold_notify_locked(cnt);
+ ret = 0;
+ }
+ spin_unlock_irqrestore(&cnt->lock, flags);
+ return ret;
+}
+
+static inline int res_counter_set_threshold(struct res_counter *cnt,
+ unsigned long long threshold)
+{
+ unsigned long flags;
+ int ret = -EINVAL;
+
+ spin_lock_irqsave(&cnt->lock, flags);
+ if (cnt->limit > threshold) {
+ cnt->threshold = threshold;
+ res_counter_threshold_notify_locked(cnt);
ret = 0;
}
spin_unlock_irqrestore(&cnt->lock, flags);
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index e1338f0..9b36748 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -5,6 +5,10 @@
*
* Author: Pavel Emelianov <[email protected]>
*
+ * Resouce usage threshold notification update
+ * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
+ * Author: Dan Malek <[email protected]>
+ * Author: Vladislav Buzov <[email protected]>
*/

#include <linux/types.h>
@@ -32,6 +36,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
counter->usage += val;
if (counter->usage > counter->max_usage)
counter->max_usage = counter->usage;
+ res_counter_threshold_notify_locked(counter);
return 0;
}

@@ -101,6 +106,8 @@ res_counter_member(struct res_counter *counter, int member)
return &counter->limit;
case RES_FAILCNT:
return &counter->failcnt;
+ case RES_THRESHOLD:
+ return &counter->threshold;
};

BUG();
--
1.5.6.3

2009-07-14 00:16:39

by Vladislav D. Buzov

[permalink] [raw]
Subject: [PATCH 2/2] Memory usage limit notification addition to memcg (v3)

This patch updates the Memory Controller Control Group to add a
configurable memory usage limit notification. The feature was
presented at the April 2009 Embedded Linux Conference.

Signed-off-by: Vladislav Buzov <[email protected]>
Signed-off-by: Dan Malek <[email protected]>
---
Documentation/cgroups/mem_notify.txt | 140 ++++++++++++++++++++++++++++++++++
mm/memcontrol.c | 100 ++++++++++++++++++++++++-
2 files changed, 239 insertions(+), 1 deletions(-)
create mode 100644 Documentation/cgroups/mem_notify.txt

diff --git a/Documentation/cgroups/mem_notify.txt b/Documentation/cgroups/mem_notify.txt
new file mode 100644
index 0000000..94be3f3
--- /dev/null
+++ b/Documentation/cgroups/mem_notify.txt
@@ -0,0 +1,140 @@
+
+Memory Limit Notification
+
+Attempts have been made in the past to provide a mechanism for
+the notification to processes (task, an address space) when memory
+usage is approaching a high limit. The intention is that it gives
+the application an opportunity to release some memory and continue
+operation rather than be OOM killed. The CE Linux Forum requested
+a more contemporary implementation, and this is the result.
+
+The memory limit notification is an extension to the existing Memory
+Resource Controller. Please read memory.txt in this directory to
+understand its operation before continuing here.
+
+1. Operation
+
+When the Memory Controller cgroup file system is mounted, the following
+files will appear:
+
+ memory.notify_threshold_in_bytes
+ memory.notify_threshold_lowait
+
+The notification is based upon reaching a threshold below the Memory
+Resource Controller limit (memory.limit_in_bytes). The threshold
+represents the minimal number of bytes that should be available under
+the limit. When the controller group is created, the threshold is set
+to zero which triggers notification when the Memory Resource Controller
+limit is reached.
+
+The threshold may be set by writing to memory.notify_threshold_in_bytes,
+such as:
+
+ echo 10M > memory.notify_threshold_in_bytes
+
+The current number of available bytes may be computed at any time as a
+difference between the memory.limit_in_bytes and memory.usage_in_bytes.
+
+The memory.notify_threshold_lowait is a blocking read file. The read will
+block until one of four conditions occurs:
+
+ - The amount of available memory is equal or less than the threshold
+ defined in memory.notify_threshold_in_bytes
+ - The memory.notify_threshold_lowait file is written with any value (debug)
+ - A thread is moved to another controller group
+ - The cgroup is destroyed or forced empty (memory.force_empty)
+
+
+1.1 Example Usage
+
+An application must be designed to properly take advantage of this
+memory threshold notification feature. It is a powerful management component
+of some operating systems and embedded devices that must provide
+highly available and reliable computing services. The application works
+in conjunction with information provided by the operating system to
+control limited resource usage. Since many programmers still think
+memory is infinite and never check the return value from malloc(), it
+may come as a surprise that such mechanisms have been utilized long ago.
+
+A typical application will be multithreaded, with one thread either
+polling or waiting for the notification event. When the event occurs,
+the thread will take whatever action is appropriate within the application
+design. This could be actually running a garbage collection algorithm
+or to simply signal other processing threads they must do something to
+reduce their memory usage. The notification thread will then be required
+to poll the actual usage until the low limit of its choosing is met,
+at which time the reclaim of memory can stop and the notification thread
+will wait for the next event.
+
+Internally, the application only needs to
+fopen("memory.notify_usage_in_bytes" ..) or
+fopen("memory.notify_threshold_lowait" ...), then either poll the former
+files or block read on the latter file using fread() or fscanf() as desired.
+Subtracting the value returned from either of these read function from the
+value obtained by reading memory.limit_in_bytes and further comparing it with
+the threshold obtained by reading memory.notify_threshold_in_bytes will be an
+indication of the amount of memory used over the threshold limit.
+
+2. Configuration
+
+Follow the instructions in memory.txt for the configuration and usage of
+the Memory Resource Controller cgroup. Once this is created and tasks
+assigned, use the memory threshold notification as described here.
+
+The only action that is needed outside of the application waiting or polling
+is to set the memory.notify_threshold_in_bytes. To set a notification to occur
+when memory usage of the cgroup reaches or exceeds 1 MByte below the limit
+can be simply done:
+
+ echo 1M > memory.notify_threshold_in_bytes
+
+This value may be read or changed at any time. Writing a higher value once
+the Memory Resource Controller is in operation may trigger immediate
+notification if the usage is above the new threshold. Writing a value higher
+than the Memory Controller limit will cause an error while setting the limit
+lower than the threshold will cause setting the threshold to zero.
+
+3. Debug and Testing
+
+The design of cgroups makes it easier to perform some debugging or
+monitoring tasks without modification to the application. For example,
+a write of any value to memory.notify_threshold_lowait will wake up all
+threads waiting for notifications regardless of current memory usage.
+
+Collecting performance data about the cgroup is also simplified, as
+no application modifications are necessary. A separate task can be
+created that will open and monitor any necessary files of the cgroup
+(such as current limits, usage and usage percentages and even when
+notification occurs). This task can also operate outside of the cgroup,
+so its memory usage is not charged to the cgroup.
+
+4. Design
+
+The Memory Resource Controller utilizes the Resource Counter to track and manage
+the memory of the Control Group. The Resource Counter was extended to support
+the resource usage threshold, which is the minimal difference between the
+resource limit and usage causing the notification. For the Memory Controller
+cgroup it means a number of bytes of the memory not in use so the cgroup
+parameters may continue to be dynamically modified without the need to modify
+the notification parameters. Otherwise, the notification threshold would have
+to also be computed and modified on any Memory Resource Controller operating
+parameter change.
+
+The cgroup file semantics are not well suited for this type of notification
+mechanism. While applications may choose to simply poll the current
+usage at their convenience, it was also desired to have a notification
+event that would trigger when the usage attained the threshold. The
+blocking read() was chosen, as it is the only current useful method.
+This presented the problems of "out of band" notification, when you want
+to return some exceptional status other than reaching the notification
+threshold. In the cases listed above, the read() on the
+memory.notify_threshold_lowait file will not block and return "0" for
+the remaining size. When this occurs, the thread must determine if the task
+has moved to a new cgroup or if the cgroup has been destroyed. Due to
+the usage model of this cgroup, neither is likely to happen during normal
+operation of a product.
+
+Dan Malek <[email protected]>
+Vladislav Buzov <[email protected]>
+Embedded Alley Solutions, Inc.
+10 July 2009
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e2fa20d..3b49fd4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6,6 +6,11 @@
* Copyright 2007 OpenVZ SWsoft Inc
* Author: Pavel Emelianov <[email protected]>
*
+ * Memory Limit Notification update
+ * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
+ * Author: Dan Malek <[email protected]>
+ * Author: Vladislav Buzov <[email protected]>
+ *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
@@ -180,6 +185,9 @@ struct mem_cgroup {
/* set when res.limit == memsw.limit */
bool memsw_is_minimum;

+ /* tasks waiting for memory usage threshold notification */
+ wait_queue_head_t notify_threshold_wait;
+
/*
* statistics. This must be placed at the end of memcg.
*/
@@ -2052,7 +2060,7 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
}
/*
* The user of this function is...
- * RES_LIMIT.
+ * RES_LIMIT, RES_THRESHOLD
*/
static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
const char *buffer)
@@ -2075,6 +2083,17 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
else
ret = mem_cgroup_resize_memsw_limit(memcg, val);
break;
+ case RES_THRESHOLD:
+ /* This function does all necessary parse...reuse it */
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ /* For memsw threshold is not implemented */
+ if (type == _MEM)
+ ret = res_counter_set_threshold(&memcg->res, val);
+ else
+ ret = -EINVAL;
+ break;
default:
ret = -EINVAL; /* should be BUG() ? */
break;
@@ -2308,6 +2327,68 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
return 0;
}

+/*
+ * This is a blocking read operation forcing a reader to sleep unless
+ * a low memory condition occurs, someone intentionaly writes to
+ * "memory.notify_threshold_lowait" or cgroup state is changed. E.g.
+ * the cgroup is destroyed or task is moved to another cgroup.
+ */
+static u64 mem_cgroup_notify_threshold_lowait(struct cgroup *cgrp,
+ struct cftype *cft)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+ DEFINE_WAIT(notify_lowait);
+
+ /*
+ * A memory resource usage of zero is a special case that
+ * causes us not to sleep. It normally happens when the
+ * cgroup is about to be destroyed, and we don't want someone
+ * trying to sleep on a queue that is about to go away. This
+ * condition can also be forced as part of testing.
+ */
+ if (likely(mem->res.usage != 0)) {
+ prepare_to_wait(&mem->notify_threshold_wait, &notify_lowait,
+ TASK_INTERRUPTIBLE);
+
+ if (res_counter_check_under_threshold(&mem->res))
+ schedule();
+
+ finish_wait(&mem->notify_threshold_wait, &notify_lowait);
+ }
+
+ return res_counter_read_u64(&mem->res, RES_USAGE);
+}
+
+/*
+ * Memory usage threshold notification callback. Called under disabled
+ * interrupts by the memory resource counter when low memory condition
+ * occurs.
+ */
+static void mem_cgroup_res_threshold_notifier(struct res_counter *cnt)
+{
+ struct mem_cgroup *memcg;
+
+ memcg = mem_cgroup_from_res_counter(cnt, res);
+ if (waitqueue_active(&memcg->notify_threshold_wait))
+ wake_up_locked(&memcg->notify_threshold_wait);
+}
+
+/*
+ * This is used to wake up all threads that may be hanging
+ * out waiting for a low memory condition prior to that happening.
+ * Useful for triggering the event to assist with debug of applications.
+ */
+static int mem_cgroup_notify_threshold_wake_em_up(struct cgroup *cgrp,
+ unsigned int event)
+{
+ struct mem_cgroup *memcg;
+
+ memcg = mem_cgroup_from_cont(cgrp);
+ if (waitqueue_active(&memcg->notify_threshold_wait))
+ wake_up(&memcg->notify_threshold_wait);
+ return 0;
+}
+

static struct cftype mem_cgroup_files[] = {
{
@@ -2351,6 +2432,17 @@ static struct cftype mem_cgroup_files[] = {
.read_u64 = mem_cgroup_swappiness_read,
.write_u64 = mem_cgroup_swappiness_write,
},
+ {
+ .name = "notify_threshold_in_bytes",
+ .private = MEMFILE_PRIVATE(_MEM, RES_THRESHOLD),
+ .write_string = mem_cgroup_write,
+ .read_u64 = mem_cgroup_read,
+ },
+ {
+ .name = "notify_threshold_lowait",
+ .trigger = mem_cgroup_notify_threshold_wake_em_up,
+ .read_u64 = mem_cgroup_notify_threshold_lowait,
+ },
};

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -2554,6 +2646,9 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
mem->last_scanned_child = 0;
spin_lock_init(&mem->reclaim_param_lock);

+ init_waitqueue_head(&mem->notify_threshold_wait);
+ mem->res.threshold_notifier = mem_cgroup_res_threshold_notifier;
+
if (parent)
mem->swappiness = get_swappiness(parent);
atomic_set(&mem->refcnt, 1);
@@ -2568,6 +2663,7 @@ static int mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
{
struct mem_cgroup *mem = mem_cgroup_from_cont(cont);

+ mem_cgroup_notify_threshold_wake_em_up(cont, 0);
return mem_cgroup_force_empty(mem, false);
}

@@ -2597,6 +2693,8 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
struct cgroup *old_cont,
struct task_struct *p)
{
+ mem_cgroup_notify_threshold_wake_em_up(old_cont, 0);
+
mutex_lock(&memcg_tasklist);
/*
* FIXME: It's better to move charges of this process from old
--
1.5.6.3

2009-07-14 00:20:15

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 0/2] Memory usage limit notification feature (v3)

On Mon, Jul 13, 2009 at 5:16 PM, Vladislav
Buzov<[email protected]> wrote:
>
> The following sequence of patches introduce memory usage limit notification
> capability to the Memory Controller cgroup.
>
> This is v3 of the implementation. The major difference between previous
> version is it is based on the the Resource Counter extension to notify the
> Resource Controller when the resource usage achieves or exceeds a configurable
> threshold.
>
> TODOs:
>
> 1. Another, more generic notification mechanism supporting different ?events
> ? is preferred to use, rather than creating a dedicated file in the Memory
> ? Controller cgroup.

I think that defining the the more generic userspace-API portion of
this TODO should come *prior* to the new feature in this patch, even
if the kernel implementation isn't initially generic.

Paul

2009-07-14 00:31:37

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/2] Memory usage limit notification feature (v3)

> On Mon, Jul 13, 2009 at 5:16 PM, Vladislav
> Buzov<[email protected]> wrote:
> >
> > The following sequence of patches introduce memory usage limit notification
> > capability to the Memory Controller cgroup.
> >
> > This is v3 of the implementation. The major difference between previous
> > version is it is based on the the Resource Counter extension to notify the
> > Resource Controller when the resource usage achieves or exceeds a configurable
> > threshold.
> >
> > TODOs:
> >
> > 1. Another, more generic notification mechanism supporting different ?events
> > ? is preferred to use, rather than creating a dedicated file in the Memory
> > ? Controller cgroup.
>
> I think that defining the the more generic userspace-API portion of
> this TODO should come *prior* to the new feature in this patch, even
> if the kernel implementation isn't initially generic.

I fully agree this ;-)


2009-07-14 00:32:14

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/2] Resource usage threshold notification addition to res_counter (v3)

On Mon, 13 Jul 2009 17:16:20 -0700
Vladislav Buzov <[email protected]> wrote:

> This patch updates the Resource Counter to add a configurable resource usage
> threshold notification mechanism.
>
> Signed-off-by: Vladislav Buzov <[email protected]>
> Signed-off-by: Dan Malek <[email protected]>
> ---
> Documentation/cgroups/resource_counter.txt | 21 ++++++++-
> include/linux/res_counter.h | 69 ++++++++++++++++++++++++++++
> kernel/res_counter.c | 7 +++
> 3 files changed, 95 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
> index 95b24d7..1369dff 100644
> --- a/Documentation/cgroups/resource_counter.txt
> +++ b/Documentation/cgroups/resource_counter.txt
> @@ -39,7 +39,20 @@ to work with it.
> The failcnt stands for "failures counter". This is the number of
> resource allocation attempts that failed.
>
> - c. spinlock_t lock
> + e. unsigned long long threshold
> +
> + The resource usage threshold to notify the resouce controller. This is
> + the minimal difference between the resource limit and current usage
> + to fire a notification.
> +
> + f. void (*threshold_notifier)(struct res_counter *counter)
> +
> + The threshold notification callback installed by the resource
> + controller. Called when the usage reaches or exceeds the threshold.
> + Should be fast and not sleep because called when interrupts are
> + disabled.
> +

This interface isn't very useful..hard to use..can't you just return the result as
"exceeds threshold" to the callers ?

If I was you, I'll add following state to res_counter

enum {
RES_BELOW_THRESH,
RES_OVER_THRESH,
} res_state;

struct res_counter {
.....
enum res_state state;
}

Then, caller does
example)
prev_state = res->state;
res_counter_charge(res....)
if (prev_state != res->state)
do_xxxxx..

notifier under spinlock is not usual interface. And if this is "notifier",
something generic, notifier_call_chain should be used rather than original
one, IIUC.

So, avoiding to use "callback" is a way to go, I think.

Thanks,
-Kame




> + g. spinlock_t lock
>
> Protects changes of the above values.
>
> @@ -140,6 +153,7 @@ counter fields. They are recommended to adhere to the following rules:
> usage usage_in_<unit_of_measurement>
> max_usage max_usage_in_<unit_of_measurement>
> limit limit_in_<unit_of_measurement>
> + threshold notify_threshold_in_<unit_of_measurement>
> failcnt failcnt
> lock no file :)
>
> @@ -153,9 +167,12 @@ counter fields. They are recommended to adhere to the following rules:
> usage prohibited
> max_usage reset to usage
> limit set the limit
> + threshold set the threshold
> failcnt reset to zero
>
> -
> + d. Notification is enabled by installing the threshold notifier callback. It
> + is up to the resouce controller to communicate the notification to user
> + space tasks.
>
> 5. Usage example
>
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index 511f42f..5ec98d7 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -9,6 +9,11 @@
> *
> * Author: Pavel Emelianov <[email protected]>
> *
> + * Resouce usage threshold notification update
> + * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
> + * Author: Dan Malek <[email protected]>
> + * Author: Vladislav Buzov <[email protected]>
> + *
> * See Documentation/cgroups/resource_counter.txt for more
> * info about what this counter is.
> */
> @@ -35,6 +40,19 @@ struct res_counter {
> */
> unsigned long long limit;
> /*
> + * the resource usage threshold to notify the resouce controller. This
> + * is the minimal difference between the resource limit and current
> + * usage to fire a notification.
> + */
> + unsigned long long threshold;
> + /*
> + * the threshold notification callback installed by the resource
> + * controller. Called when the usage reaches or exceeds the threshold.
> + * Should be fast and not sleep because called when interrupts are
> + * disabled.
> + */
> + void (*threshold_notifier)(struct res_counter *counter);
> + /*
> * the number of unsuccessful attempts to consume the resource
> */
> unsigned long long failcnt;
> @@ -87,6 +105,7 @@ enum {
> RES_MAX_USAGE,
> RES_LIMIT,
> RES_FAILCNT,
> + RES_THRESHOLD,
> };
>
> /*
> @@ -132,6 +151,21 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
> return false;
> }
>
> +static inline bool res_counter_threshold_check_locked(struct res_counter *cnt)
> +{
> + if (cnt->usage + cnt->threshold < cnt->limit)
> + return true;
> +
> + return false;
> +}
> +
> +static inline void res_counter_threshold_notify_locked(struct res_counter *cnt)
> +{
> + if (!res_counter_threshold_check_locked(cnt) &&
> + cnt->threshold_notifier)
> + cnt->threshold_notifier(cnt);
> +}
> +
> /*
> * Helper function to detect if the cgroup is within it's limit or
> * not. It's currently called from cgroup_rss_prepare()
> @@ -147,6 +181,21 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
> return ret;
> }
>
> +/*
> + * Helper function to detect if the cgroup usage is under it's threshold or
> + * not.
> + */
> +static inline bool res_counter_check_under_threshold(struct res_counter *cnt)
> +{
> + bool ret;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&cnt->lock, flags);
> + ret = res_counter_threshold_check_locked(cnt);
> + spin_unlock_irqrestore(&cnt->lock, flags);
> + return ret;
> +}
> +
> static inline void res_counter_reset_max(struct res_counter *cnt)
> {
> unsigned long flags;
> @@ -174,6 +223,26 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
> spin_lock_irqsave(&cnt->lock, flags);
> if (cnt->usage <= limit) {
> cnt->limit = limit;
> + if (limit <= cnt->threshold)
> + cnt->threshold = 0;
> + else
> + res_counter_threshold_notify_locked(cnt);
> + ret = 0;
> + }
> + spin_unlock_irqrestore(&cnt->lock, flags);
> + return ret;
> +}
> +
> +static inline int res_counter_set_threshold(struct res_counter *cnt,
> + unsigned long long threshold)
> +{
> + unsigned long flags;
> + int ret = -EINVAL;
> +
> + spin_lock_irqsave(&cnt->lock, flags);
> + if (cnt->limit > threshold) {
> + cnt->threshold = threshold;
> + res_counter_threshold_notify_locked(cnt);
> ret = 0;
> }
> spin_unlock_irqrestore(&cnt->lock, flags);
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index e1338f0..9b36748 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -5,6 +5,10 @@
> *
> * Author: Pavel Emelianov <[email protected]>
> *
> + * Resouce usage threshold notification update
> + * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
> + * Author: Dan Malek <[email protected]>
> + * Author: Vladislav Buzov <[email protected]>
> */
>
> #include <linux/types.h>
> @@ -32,6 +36,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> counter->usage += val;
> if (counter->usage > counter->max_usage)
> counter->max_usage = counter->usage;
> + res_counter_threshold_notify_locked(counter);
> return 0;
> }
>
> @@ -101,6 +106,8 @@ res_counter_member(struct res_counter *counter, int member)
> return &counter->limit;
> case RES_FAILCNT:
> return &counter->failcnt;
> + case RES_THRESHOLD:
> + return &counter->threshold;
> };
>
> BUG();
> --
> 1.5.6.3
>
>

2009-07-14 00:36:47

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/2] Resource usage threshold notification addition to res_counter (v3)

As I mentioned in another thread, I think that associating the
threshold with the res_counter rather than with each individual waiter
is a mistake, since it creates global state and makes it hard to have
multiple waiters on the same cgroup.

Paul

On Mon, Jul 13, 2009 at 5:16 PM, Vladislav
Buzov<[email protected]> wrote:
> This patch updates the Resource Counter to add a configurable resource usage
> threshold notification mechanism.
>
> Signed-off-by: Vladislav Buzov <[email protected]>
> Signed-off-by: Dan Malek <[email protected]>
> ---
> ?Documentation/cgroups/resource_counter.txt | ? 21 ++++++++-
> ?include/linux/res_counter.h ? ? ? ? ? ? ? ?| ? 69 ++++++++++++++++++++++++++++
> ?kernel/res_counter.c ? ? ? ? ? ? ? ? ? ? ? | ? ?7 +++
> ?3 files changed, 95 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
> index 95b24d7..1369dff 100644
> --- a/Documentation/cgroups/resource_counter.txt
> +++ b/Documentation/cgroups/resource_counter.txt
> @@ -39,7 +39,20 @@ to work with it.
> ? ? ? ?The failcnt stands for "failures counter". This is the number of
> ? ? ? ?resource allocation attempts that failed.
>
> - c. spinlock_t lock
> + e. unsigned long long threshold
> +
> + ? ? ? The resource usage threshold to notify the resouce controller. This is
> + ? ? ? the minimal difference between the resource limit and current usage
> + ? ? ? to fire a notification.
> +
> + f. void (*threshold_notifier)(struct res_counter *counter)
> +
> + ? ? ? The threshold notification callback installed by the resource
> + ? ? ? controller. Called when the usage reaches or exceeds the threshold.
> + ? ? ? Should be fast and not sleep because called when interrupts are
> + ? ? ? disabled.
> +
> + g. spinlock_t lock
>
> ? ? ? ?Protects changes of the above values.
>
> @@ -140,6 +153,7 @@ counter fields. They are recommended to adhere to the following rules:
> ? ? ? ?usage ? ? ? ? ? usage_in_<unit_of_measurement>
> ? ? ? ?max_usage ? ? ? max_usage_in_<unit_of_measurement>
> ? ? ? ?limit ? ? ? ? ? limit_in_<unit_of_measurement>
> + ? ? ? threshold ? ? ? notify_threshold_in_<unit_of_measurement>
> ? ? ? ?failcnt ? ? ? ? failcnt
> ? ? ? ?lock ? ? ? ? ? ?no file :)
>
> @@ -153,9 +167,12 @@ counter fields. They are recommended to adhere to the following rules:
> ? ? ? ?usage ? ? ? ? ? prohibited
> ? ? ? ?max_usage ? ? ? reset to usage
> ? ? ? ?limit ? ? ? ? ? set the limit
> + ? ? ? threshold ? ? ? set the threshold
> ? ? ? ?failcnt ? ? ? ? reset to zero
>
> -
> + d. Notification is enabled by installing the threshold notifier callback. It
> + ? ?is up to the resouce controller to communicate the notification to user
> + ? ?space tasks.
>
> ?5. Usage example
>
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index 511f42f..5ec98d7 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -9,6 +9,11 @@
> ?*
> ?* Author: Pavel Emelianov <[email protected]>
> ?*
> + * Resouce usage threshold notification update
> + * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
> + * Author: Dan Malek <[email protected]>
> + * Author: Vladislav Buzov <[email protected]>
> + *
> ?* See Documentation/cgroups/resource_counter.txt for more
> ?* info about what this counter is.
> ?*/
> @@ -35,6 +40,19 @@ struct res_counter {
> ? ? ? ? */
> ? ? ? ?unsigned long long limit;
> ? ? ? ?/*
> + ? ? ? ?* the resource usage threshold to notify the resouce controller. This
> + ? ? ? ?* is the minimal difference between the resource limit and current
> + ? ? ? ?* usage to fire a notification.
> + ? ? ? ?*/
> + ? ? ? unsigned long long threshold;
> + ? ? ? /*
> + ? ? ? ?* the threshold notification callback installed by the resource
> + ? ? ? ?* controller. Called when the usage reaches or exceeds the threshold.
> + ? ? ? ?* Should be fast and not sleep because called when interrupts are
> + ? ? ? ?* disabled.
> + ? ? ? ?*/
> + ? ? ? void (*threshold_notifier)(struct res_counter *counter);
> + ? ? ? /*
> ? ? ? ? * the number of unsuccessful attempts to consume the resource
> ? ? ? ? */
> ? ? ? ?unsigned long long failcnt;
> @@ -87,6 +105,7 @@ enum {
> ? ? ? ?RES_MAX_USAGE,
> ? ? ? ?RES_LIMIT,
> ? ? ? ?RES_FAILCNT,
> + ? ? ? RES_THRESHOLD,
> ?};
>
> ?/*
> @@ -132,6 +151,21 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
> ? ? ? ?return false;
> ?}
>
> +static inline bool res_counter_threshold_check_locked(struct res_counter *cnt)
> +{
> + ? ? ? if (cnt->usage + cnt->threshold < cnt->limit)
> + ? ? ? ? ? ? ? return true;
> +
> + ? ? ? return false;
> +}
> +
> +static inline void res_counter_threshold_notify_locked(struct res_counter *cnt)
> +{
> + ? ? ? if (!res_counter_threshold_check_locked(cnt) &&
> + ? ? ? ? ? cnt->threshold_notifier)
> + ? ? ? ? ? ? ? cnt->threshold_notifier(cnt);
> +}
> +
> ?/*
> ?* Helper function to detect if the cgroup is within it's limit or
> ?* not. It's currently called from cgroup_rss_prepare()
> @@ -147,6 +181,21 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
> ? ? ? ?return ret;
> ?}
>
> +/*
> + * Helper function to detect if the cgroup usage is under it's threshold or
> + * not.
> + */
> +static inline bool res_counter_check_under_threshold(struct res_counter *cnt)
> +{
> + ? ? ? bool ret;
> + ? ? ? unsigned long flags;
> +
> + ? ? ? spin_lock_irqsave(&cnt->lock, flags);
> + ? ? ? ret = res_counter_threshold_check_locked(cnt);
> + ? ? ? spin_unlock_irqrestore(&cnt->lock, flags);
> + ? ? ? return ret;
> +}
> +
> ?static inline void res_counter_reset_max(struct res_counter *cnt)
> ?{
> ? ? ? ?unsigned long flags;
> @@ -174,6 +223,26 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
> ? ? ? ?spin_lock_irqsave(&cnt->lock, flags);
> ? ? ? ?if (cnt->usage <= limit) {
> ? ? ? ? ? ? ? ?cnt->limit = limit;
> + ? ? ? ? ? ? ? if (limit <= cnt->threshold)
> + ? ? ? ? ? ? ? ? ? ? ? cnt->threshold = 0;
> + ? ? ? ? ? ? ? else
> + ? ? ? ? ? ? ? ? ? ? ? res_counter_threshold_notify_locked(cnt);
> + ? ? ? ? ? ? ? ret = 0;
> + ? ? ? }
> + ? ? ? spin_unlock_irqrestore(&cnt->lock, flags);
> + ? ? ? return ret;
> +}
> +
> +static inline int res_counter_set_threshold(struct res_counter *cnt,
> + ? ? ? ? ? ? ? unsigned long long threshold)
> +{
> + ? ? ? unsigned long flags;
> + ? ? ? int ret = -EINVAL;
> +
> + ? ? ? spin_lock_irqsave(&cnt->lock, flags);
> + ? ? ? if (cnt->limit > threshold) {
> + ? ? ? ? ? ? ? cnt->threshold = threshold;
> + ? ? ? ? ? ? ? res_counter_threshold_notify_locked(cnt);
> ? ? ? ? ? ? ? ?ret = 0;
> ? ? ? ?}
> ? ? ? ?spin_unlock_irqrestore(&cnt->lock, flags);
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index e1338f0..9b36748 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -5,6 +5,10 @@
> ?*
> ?* Author: Pavel Emelianov <[email protected]>
> ?*
> + * Resouce usage threshold notification update
> + * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
> + * Author: Dan Malek <[email protected]>
> + * Author: Vladislav Buzov <[email protected]>
> ?*/
>
> ?#include <linux/types.h>
> @@ -32,6 +36,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> ? ? ? ?counter->usage += val;
> ? ? ? ?if (counter->usage > counter->max_usage)
> ? ? ? ? ? ? ? ?counter->max_usage = counter->usage;
> + ? ? ? res_counter_threshold_notify_locked(counter);
> ? ? ? ?return 0;
> ?}
>
> @@ -101,6 +106,8 @@ res_counter_member(struct res_counter *counter, int member)
> ? ? ? ? ? ? ? ?return &counter->limit;
> ? ? ? ?case RES_FAILCNT:
> ? ? ? ? ? ? ? ?return &counter->failcnt;
> + ? ? ? case RES_THRESHOLD:
> + ? ? ? ? ? ? ? return &counter->threshold;
> ? ? ? ?};
>
> ? ? ? ?BUG();
> --
> 1.5.6.3
>
>

2009-07-14 00:49:27

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/2] Resource usage threshold notification addition to res_counter (v3)

On Mon, 13 Jul 2009 17:36:40 -0700
Paul Menage <[email protected]> wrote:

> As I mentioned in another thread, I think that associating the
> threshold with the res_counter rather than with each individual waiter
> is a mistake, since it creates global state and makes it hard to have
> multiple waiters on the same cgroup.
>
Ah, Hmm...maybe yes.

But the problem is "hierarchy". (even if this usage notifier don't handle it.)

While we charge as following res_coutner+hierarchy

res_counter_A + PAGE_SIZE
res_counter_B + PAGE_SIZE
res_counter_C + PAGE_SIZE

Checking "where we exceeds" in smart way is not very easy. Balbir's soft limit does
similar check but it's not very smart, either I think.

If there are prural thesholds (notifer, softlimit, etc...), this is worth to be
tried. Hmm...if not, size of res_coutner excees 128bytes and we'll see terrible counter.
Any idea ?

Thanks,
-Kame


> Paul
>
> On Mon, Jul 13, 2009 at 5:16 PM, Vladislav
> Buzov<[email protected]> wrote:
> > This patch updates the Resource Counter to add a configurable resource usage
> > threshold notification mechanism.
> >
> > Signed-off-by: Vladislav Buzov <[email protected]>
> > Signed-off-by: Dan Malek <[email protected]>
> > ---
> >  Documentation/cgroups/resource_counter.txt |   21 ++++++++-
> >  include/linux/res_counter.h                |   69 ++++++++++++++++++++++++++++
> >  kernel/res_counter.c                       |    7 +++
> >  3 files changed, 95 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
> > index 95b24d7..1369dff 100644
> > --- a/Documentation/cgroups/resource_counter.txt
> > +++ b/Documentation/cgroups/resource_counter.txt
> > @@ -39,7 +39,20 @@ to work with it.
> >        The failcnt stands for "failures counter". This is the number of
> >        resource allocation attempts that failed.
> >
> > - c. spinlock_t lock
> > + e. unsigned long long threshold
> > +
> > +       The resource usage threshold to notify the resouce controller. This is
> > +       the minimal difference between the resource limit and current usage
> > +       to fire a notification.
> > +
> > + f. void (*threshold_notifier)(struct res_counter *counter)
> > +
> > +       The threshold notification callback installed by the resource
> > +       controller. Called when the usage reaches or exceeds the threshold.
> > +       Should be fast and not sleep because called when interrupts are
> > +       disabled.
> > +
> > + g. spinlock_t lock
> >
> >        Protects changes of the above values.
> >
> > @@ -140,6 +153,7 @@ counter fields. They are recommended to adhere to the following rules:
> >        usage           usage_in_<unit_of_measurement>
> >        max_usage       max_usage_in_<unit_of_measurement>
> >        limit           limit_in_<unit_of_measurement>
> > +       threshold       notify_threshold_in_<unit_of_measurement>
> >        failcnt         failcnt
> >        lock            no file :)
> >
> > @@ -153,9 +167,12 @@ counter fields. They are recommended to adhere to the following rules:
> >        usage           prohibited
> >        max_usage       reset to usage
> >        limit           set the limit
> > +       threshold       set the threshold
> >        failcnt         reset to zero
> >
> > -
> > + d. Notification is enabled by installing the threshold notifier callback. It
> > +    is up to the resouce controller to communicate the notification to user
> > +    space tasks.
> >
> >  5. Usage example
> >
> > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > index 511f42f..5ec98d7 100644
> > --- a/include/linux/res_counter.h
> > +++ b/include/linux/res_counter.h
> > @@ -9,6 +9,11 @@
> >  *
> >  * Author: Pavel Emelianov <[email protected]>
> >  *
> > + * Resouce usage threshold notification update
> > + * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
> > + * Author: Dan Malek <[email protected]>
> > + * Author: Vladislav Buzov <[email protected]>
> > + *
> >  * See Documentation/cgroups/resource_counter.txt for more
> >  * info about what this counter is.
> >  */
> > @@ -35,6 +40,19 @@ struct res_counter {
> >         */
> >        unsigned long long limit;
> >        /*
> > +        * the resource usage threshold to notify the resouce controller. This
> > +        * is the minimal difference between the resource limit and current
> > +        * usage to fire a notification.
> > +        */
> > +       unsigned long long threshold;
> > +       /*
> > +        * the threshold notification callback installed by the resource
> > +        * controller. Called when the usage reaches or exceeds the threshold.
> > +        * Should be fast and not sleep because called when interrupts are
> > +        * disabled.
> > +        */
> > +       void (*threshold_notifier)(struct res_counter *counter);
> > +       /*
> >         * the number of unsuccessful attempts to consume the resource
> >         */
> >        unsigned long long failcnt;
> > @@ -87,6 +105,7 @@ enum {
> >        RES_MAX_USAGE,
> >        RES_LIMIT,
> >        RES_FAILCNT,
> > +       RES_THRESHOLD,
> >  };
> >
> >  /*
> > @@ -132,6 +151,21 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
> >        return false;
> >  }
> >
> > +static inline bool res_counter_threshold_check_locked(struct res_counter *cnt)
> > +{
> > +       if (cnt->usage + cnt->threshold < cnt->limit)
> > +               return true;
> > +
> > +       return false;
> > +}
> > +
> > +static inline void res_counter_threshold_notify_locked(struct res_counter *cnt)
> > +{
> > +       if (!res_counter_threshold_check_locked(cnt) &&
> > +           cnt->threshold_notifier)
> > +               cnt->threshold_notifier(cnt);
> > +}
> > +
> >  /*
> >  * Helper function to detect if the cgroup is within it's limit or
> >  * not. It's currently called from cgroup_rss_prepare()
> > @@ -147,6 +181,21 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
> >        return ret;
> >  }
> >
> > +/*
> > + * Helper function to detect if the cgroup usage is under it's threshold or
> > + * not.
> > + */
> > +static inline bool res_counter_check_under_threshold(struct res_counter *cnt)
> > +{
> > +       bool ret;
> > +       unsigned long flags;
> > +
> > +       spin_lock_irqsave(&cnt->lock, flags);
> > +       ret = res_counter_threshold_check_locked(cnt);
> > +       spin_unlock_irqrestore(&cnt->lock, flags);
> > +       return ret;
> > +}
> > +
> >  static inline void res_counter_reset_max(struct res_counter *cnt)
> >  {
> >        unsigned long flags;
> > @@ -174,6 +223,26 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
> >        spin_lock_irqsave(&cnt->lock, flags);
> >        if (cnt->usage <= limit) {
> >                cnt->limit = limit;
> > +               if (limit <= cnt->threshold)
> > +                       cnt->threshold = 0;
> > +               else
> > +                       res_counter_threshold_notify_locked(cnt);
> > +               ret = 0;
> > +       }
> > +       spin_unlock_irqrestore(&cnt->lock, flags);
> > +       return ret;
> > +}
> > +
> > +static inline int res_counter_set_threshold(struct res_counter *cnt,
> > +               unsigned long long threshold)
> > +{
> > +       unsigned long flags;
> > +       int ret = -EINVAL;
> > +
> > +       spin_lock_irqsave(&cnt->lock, flags);
> > +       if (cnt->limit > threshold) {
> > +               cnt->threshold = threshold;
> > +               res_counter_threshold_notify_locked(cnt);
> >                ret = 0;
> >        }
> >        spin_unlock_irqrestore(&cnt->lock, flags);
> > diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> > index e1338f0..9b36748 100644
> > --- a/kernel/res_counter.c
> > +++ b/kernel/res_counter.c
> > @@ -5,6 +5,10 @@
> >  *
> >  * Author: Pavel Emelianov <[email protected]>
> >  *
> > + * Resouce usage threshold notification update
> > + * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
> > + * Author: Dan Malek <[email protected]>
> > + * Author: Vladislav Buzov <[email protected]>
> >  */
> >
> >  #include <linux/types.h>
> > @@ -32,6 +36,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> >        counter->usage += val;
> >        if (counter->usage > counter->max_usage)
> >                counter->max_usage = counter->usage;
> > +       res_counter_threshold_notify_locked(counter);
> >        return 0;
> >  }
> >
> > @@ -101,6 +106,8 @@ res_counter_member(struct res_counter *counter, int member)
> >                return &counter->limit;
> >        case RES_FAILCNT:
> >                return &counter->failcnt;
> > +       case RES_THRESHOLD:
> > +               return &counter->threshold;
> >        };
> >
> >        BUG();
> > --
> > 1.5.6.3
> >
> >
>

2009-07-14 01:01:07

by Vladislav D. Buzov

[permalink] [raw]
Subject: Re: [PATCH 1/1] Memory usage limit notification addition to memcg

Paul Menage wrote:
> On Tue, Jul 7, 2009 at 5:56 PM, KAMEZAWA
> Hiroyuki<[email protected]> wrote:
>
>> I know people likes to wait for file descriptor to get notification in these days.
>> Can't we have "event" file descriptor in cgroup layer and make it reusable for
>> other purposes ?
>>
>
> I agree - rather than having to add a separate "wait for value to
> cross X threshold" file for each numeric usage value that people might
> be concerned about, it would be better to have a generic way to do it
> for any file. Given that this is a userspace API, it would be better
> to work out at least the generic API first, even if the initial
> implementation isn't generic.
> "OOM"
>
We all agree on that. Our original intention was to create the memory
limit notification capability, not a generic mechanism for the cgroup
events. So, we started out with something that fits the current model.

> - independent thresholds for different waiters
>
The threshold is one of the Memory Controller cgroup attributes, as well
as memory usage limit. It is not a user task attribute and should not be
controlled by user tasks. It's administrator or some control task job to
set memory usage parameters within the cgroup. Once the threshold is
reached, all tasks within the cgroup should be notified and do their
best to free the memory.

2009-07-14 01:05:13

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/1] Memory usage limit notification addition to memcg

On Mon, 13 Jul 2009 18:00:53 -0700
"Vladislav D. Buzov" <[email protected]> wrote:

> Paul Menage wrote:
> > On Tue, Jul 7, 2009 at 5:56 PM, KAMEZAWA
> > Hiroyuki<[email protected]> wrote:
> >
> >> I know people likes to wait for file descriptor to get notification in these days.
> >> Can't we have "event" file descriptor in cgroup layer and make it reusable for
> >> other purposes ?
> >>
> >
> > I agree - rather than having to add a separate "wait for value to
> > cross X threshold" file for each numeric usage value that people might
> > be concerned about, it would be better to have a generic way to do it
> > for any file. Given that this is a userspace API, it would be better
> > to work out at least the generic API first, even if the initial
> > implementation isn't generic.
> > "OOM"
> >
> We all agree on that. Our original intention was to create the memory
> limit notification capability, not a generic mechanism for the cgroup
> events. So, we started out with something that fits the current model.
>
> > - independent thresholds for different waiters
> >
> The threshold is one of the Memory Controller cgroup attributes, as well
> as memory usage limit. It is not a user task attribute and should not be
> controlled by user tasks. It's administrator or some control task job to
> set memory usage parameters within the cgroup. Once the threshold is
> reached, all tasks within the cgroup should be notified and do their
> best to free the memory.
>
About this, I'll add following file.

memory.reduce_usage --- try to reduce memory usage to be smaller than...?

if your notifier goes in. Maybe good for dropping file caches etc.. for your notifer
use case.

Thanks,
-Kame




2009-07-14 01:29:15

by Vladislav D. Buzov

[permalink] [raw]
Subject: Re: [PATCH 1/2] Resource usage threshold notification addition to res_counter (v3)

KAMEZAWA Hiroyuki wrote:
> On Mon, 13 Jul 2009 17:16:20 -0700
> Vladislav Buzov <[email protected]> wrote:
>
>
>> This patch updates the Resource Counter to add a configurable resource usage
>> threshold notification mechanism.
>>
>> Signed-off-by: Vladislav Buzov <[email protected]>
>> Signed-off-by: Dan Malek <[email protected]>
>> ---
>> Documentation/cgroups/resource_counter.txt | 21 ++++++++-
>> include/linux/res_counter.h | 69 ++++++++++++++++++++++++++++
>> kernel/res_counter.c | 7 +++
>> 3 files changed, 95 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
>> index 95b24d7..1369dff 100644
>> --- a/Documentation/cgroups/resource_counter.txt
>> +++ b/Documentation/cgroups/resource_counter.txt
>> @@ -39,7 +39,20 @@ to work with it.
>> The failcnt stands for "failures counter". This is the number of
>> resource allocation attempts that failed.
>>
>> - c. spinlock_t lock
>> + e. unsigned long long threshold
>> +
>> + The resource usage threshold to notify the resouce controller. This is
>> + the minimal difference between the resource limit and current usage
>> + to fire a notification.
>> +
>> + f. void (*threshold_notifier)(struct res_counter *counter)
>> +
>> + The threshold notification callback installed by the resource
>> + controller. Called when the usage reaches or exceeds the threshold.
>> + Should be fast and not sleep because called when interrupts are
>> + disabled.
>> +
>>
>
> This interface isn't very useful..hard to use..can't you just return the result as
> "exceeds threshold" to the callers ?
>
> If I was you, I'll add following state to res_counter
>
> enum {
> RES_BELOW_THRESH,
> RES_OVER_THRESH,
> } res_state;
>
> struct res_counter {
> .....
> enum res_state state;
> }
>
> Then, caller does
> example)
> prev_state = res->state;
> res_counter_charge(res....)
> if (prev_state != res->state)
> do_xxxxx..
>
> notifier under spinlock is not usual interface. And if this is "notifier",
> something generic, notifier_call_chain should be used rather than original
> one, IIUC.
>
> So, avoiding to use "callback" is a way to go, I think.
>
>
The reason of having this callback is to support the hierarchy, which
was the problem in previous implementation you pointed out.

When a new page charged we want to walk up the hierarchy and find all
the ancestors exceeding their thresholds and notify them. To avoid
walking up the hierarchy twice, I've expanded res_counter with "notifier
callback" called by res_counter_charge() for each res_counter in the
tree which exceeds the limit.

In the example above, the hierarchy is not supported. We know only state
of the res_counter/memcg which current thread belongs to.

Thanks,
Vlad.

> Thanks,
> -Kame
>
>
>
>
>
>> + g. spinlock_t lock
>>
>> Protects changes of the above values.
>>
>> @@ -140,6 +153,7 @@ counter fields. They are recommended to adhere to the following rules:
>> usage usage_in_<unit_of_measurement>
>> max_usage max_usage_in_<unit_of_measurement>
>> limit limit_in_<unit_of_measurement>
>> + threshold notify_threshold_in_<unit_of_measurement>
>> failcnt failcnt
>> lock no file :)
>>
>> @@ -153,9 +167,12 @@ counter fields. They are recommended to adhere to the following rules:
>> usage prohibited
>> max_usage reset to usage
>> limit set the limit
>> + threshold set the threshold
>> failcnt reset to zero
>>
>> -
>> + d. Notification is enabled by installing the threshold notifier callback. It
>> + is up to the resouce controller to communicate the notification to user
>> + space tasks.
>>
>> 5. Usage example
>>
>> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
>> index 511f42f..5ec98d7 100644
>> --- a/include/linux/res_counter.h
>> +++ b/include/linux/res_counter.h
>> @@ -9,6 +9,11 @@
>> *
>> * Author: Pavel Emelianov <[email protected]>
>> *
>> + * Resouce usage threshold notification update
>> + * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
>> + * Author: Dan Malek <[email protected]>
>> + * Author: Vladislav Buzov <[email protected]>
>> + *
>> * See Documentation/cgroups/resource_counter.txt for more
>> * info about what this counter is.
>> */
>> @@ -35,6 +40,19 @@ struct res_counter {
>> */
>> unsigned long long limit;
>> /*
>> + * the resource usage threshold to notify the resouce controller. This
>> + * is the minimal difference between the resource limit and current
>> + * usage to fire a notification.
>> + */
>> + unsigned long long threshold;
>> + /*
>> + * the threshold notification callback installed by the resource
>> + * controller. Called when the usage reaches or exceeds the threshold.
>> + * Should be fast and not sleep because called when interrupts are
>> + * disabled.
>> + */
>> + void (*threshold_notifier)(struct res_counter *counter);
>> + /*
>> * the number of unsuccessful attempts to consume the resource
>> */
>> unsigned long long failcnt;
>> @@ -87,6 +105,7 @@ enum {
>> RES_MAX_USAGE,
>> RES_LIMIT,
>> RES_FAILCNT,
>> + RES_THRESHOLD,
>> };
>>
>> /*
>> @@ -132,6 +151,21 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
>> return false;
>> }
>>
>> +static inline bool res_counter_threshold_check_locked(struct res_counter *cnt)
>> +{
>> + if (cnt->usage + cnt->threshold < cnt->limit)
>> + return true;
>> +
>> + return false;
>> +}
>> +
>> +static inline void res_counter_threshold_notify_locked(struct res_counter *cnt)
>> +{
>> + if (!res_counter_threshold_check_locked(cnt) &&
>> + cnt->threshold_notifier)
>> + cnt->threshold_notifier(cnt);
>> +}
>> +
>> /*
>> * Helper function to detect if the cgroup is within it's limit or
>> * not. It's currently called from cgroup_rss_prepare()
>> @@ -147,6 +181,21 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
>> return ret;
>> }
>>
>> +/*
>> + * Helper function to detect if the cgroup usage is under it's threshold or
>> + * not.
>> + */
>> +static inline bool res_counter_check_under_threshold(struct res_counter *cnt)
>> +{
>> + bool ret;
>> + unsigned long flags;
>> +
>> + spin_lock_irqsave(&cnt->lock, flags);
>> + ret = res_counter_threshold_check_locked(cnt);
>> + spin_unlock_irqrestore(&cnt->lock, flags);
>> + return ret;
>> +}
>> +
>> static inline void res_counter_reset_max(struct res_counter *cnt)
>> {
>> unsigned long flags;
>> @@ -174,6 +223,26 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
>> spin_lock_irqsave(&cnt->lock, flags);
>> if (cnt->usage <= limit) {
>> cnt->limit = limit;
>> + if (limit <= cnt->threshold)
>> + cnt->threshold = 0;
>> + else
>> + res_counter_threshold_notify_locked(cnt);
>> + ret = 0;
>> + }
>> + spin_unlock_irqrestore(&cnt->lock, flags);
>> + return ret;
>> +}
>> +
>> +static inline int res_counter_set_threshold(struct res_counter *cnt,
>> + unsigned long long threshold)
>> +{
>> + unsigned long flags;
>> + int ret = -EINVAL;
>> +
>> + spin_lock_irqsave(&cnt->lock, flags);
>> + if (cnt->limit > threshold) {
>> + cnt->threshold = threshold;
>> + res_counter_threshold_notify_locked(cnt);
>> ret = 0;
>> }
>> spin_unlock_irqrestore(&cnt->lock, flags);
>> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
>> index e1338f0..9b36748 100644
>> --- a/kernel/res_counter.c
>> +++ b/kernel/res_counter.c
>> @@ -5,6 +5,10 @@
>> *
>> * Author: Pavel Emelianov <[email protected]>
>> *
>> + * Resouce usage threshold notification update
>> + * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
>> + * Author: Dan Malek <[email protected]>
>> + * Author: Vladislav Buzov <[email protected]>
>> */
>>
>> #include <linux/types.h>
>> @@ -32,6 +36,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
>> counter->usage += val;
>> if (counter->usage > counter->max_usage)
>> counter->max_usage = counter->usage;
>> + res_counter_threshold_notify_locked(counter);
>> return 0;
>> }
>>
>> @@ -101,6 +106,8 @@ res_counter_member(struct res_counter *counter, int member)
>> return &counter->limit;
>> case RES_FAILCNT:
>> return &counter->failcnt;
>> + case RES_THRESHOLD:
>> + return &counter->threshold;
>> };
>>
>> BUG();
>> --
>> 1.5.6.3
>>
>>
>>
>
>

2009-07-14 01:44:04

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/1] Memory usage limit notification addition to memcg

> On Tue, Jul 7, 2009 at 5:56 PM, KAMEZAWA
> Hiroyuki<[email protected]> wrote:
> >
> > I know people likes to wait for file descriptor to get notification in these days.
> > Can't we have "event" file descriptor in cgroup layer and make it reusable for
> > other purposes ?
>
> I agree - rather than having to add a separate "wait for value to
> cross X threshold" file for each numeric usage value that people might
> be concerned about, it would be better to have a generic way to do it
> for any file. Given that this is a userspace API, it would be better
> to work out at least the generic API first, even if the initial
> implementation isn't generic.
>
> Properties that it should support include:
>
> - notification when a value crosses above or below a given threshold
> (which would include binary cases such as OOM notification where the
> value cross from "not-OOM" to "OOM"
>
> - independent thresholds for different waiters
>
> - epoll support (by using eventfd?)

signalfd?


> - automatic wakeup when a cgroup is removed
>
> - maybe optional wakeup when a thread attach occurs?
>
> - not require more than read permissions on the file containing the
> value being monitored
>
> I guess there are a few possible ways this could be exposed to userspace:
>
> 1) new ioctl on cgroups files. simple but probably not popular
>
> 2) new system call. maybe the cleanest, but involves changing every
> arch and is hard to script
>
> 3) new per-cgroup file to control these e.g:
> - create an eventfd
> - open the control file to be monitored
> - write the "<event_fd>, <control_fd> <threshold> to
> cgroup.event_control to link them together
> flexible and scriptable but maybe a clumsy interface in general

I like multiple threshold and per-thresold file-descriptor.
it solve multiple waiters issue.

but How about this?

/cgroup
/group1
/notifications
/threashold-A
/threashold-B





2009-07-14 01:47:38

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/2] Resource usage threshold notification addition to res_counter (v3)

On Mon, 13 Jul 2009 18:29:01 -0700
"Vladislav D. Buzov" <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Mon, 13 Jul 2009 17:16:20 -0700
> > Vladislav Buzov <[email protected]> wrote:
> >
> >
> >> This patch updates the Resource Counter to add a configurable resource usage
> >> threshold notification mechanism.
> >>
> >> Signed-off-by: Vladislav Buzov <[email protected]>
> >> Signed-off-by: Dan Malek <[email protected]>
> >> ---
> >> Documentation/cgroups/resource_counter.txt | 21 ++++++++-
> >> include/linux/res_counter.h | 69 ++++++++++++++++++++++++++++
> >> kernel/res_counter.c | 7 +++
> >> 3 files changed, 95 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
> >> index 95b24d7..1369dff 100644
> >> --- a/Documentation/cgroups/resource_counter.txt
> >> +++ b/Documentation/cgroups/resource_counter.txt
> >> @@ -39,7 +39,20 @@ to work with it.
> >> The failcnt stands for "failures counter". This is the number of
> >> resource allocation attempts that failed.
> >>
> >> - c. spinlock_t lock
> >> + e. unsigned long long threshold
> >> +
> >> + The resource usage threshold to notify the resouce controller. This is
> >> + the minimal difference between the resource limit and current usage
> >> + to fire a notification.
> >> +
> >> + f. void (*threshold_notifier)(struct res_counter *counter)
> >> +
> >> + The threshold notification callback installed by the resource
> >> + controller. Called when the usage reaches or exceeds the threshold.
> >> + Should be fast and not sleep because called when interrupts are
> >> + disabled.
> >> +
> >>
> >
> > This interface isn't very useful..hard to use..can't you just return the result as
> > "exceeds threshold" to the callers ?
> >
> > If I was you, I'll add following state to res_counter
> >
> > enum {
> > RES_BELOW_THRESH,
> > RES_OVER_THRESH,
> > } res_state;
> >
> > struct res_counter {
> > .....
> > enum res_state state;
> > }
> >
> > Then, caller does
> > example)
> > prev_state = res->state;
> > res_counter_charge(res....)
> > if (prev_state != res->state)
> > do_xxxxx..
> >
> > notifier under spinlock is not usual interface. And if this is "notifier",
> > something generic, notifier_call_chain should be used rather than original
> > one, IIUC.
> >
> > So, avoiding to use "callback" is a way to go, I think.
> >
> >
> The reason of having this callback is to support the hierarchy, which
> was the problem in previous implementation you pointed out.
>
> When a new page charged we want to walk up the hierarchy and find all
> the ancestors exceeding their thresholds and notify them. To avoid
> walking up the hierarchy twice, I've expanded res_counter with "notifier
> callback" called by res_counter_charge() for each res_counter in the
> tree which exceeds the limit.
>
> In the example above, the hierarchy is not supported. We know only state
> of the res_counter/memcg which current thread belongs to.
>
How heavy res_coutner can be ? ;) plz don't check at "every charge", use some
filter.

plz discuss with Balbir. His softlimit adds something similar. And I don't think
both are elegant.

I'll consider more (of course, I may not be able to find any..) and rewrite the
whole thing if I have a chance.

Briefly thinking, it's not very bad to have following interface.

==
/*
* This function is for checking all ancestors's state. Each ancestors are
* pased to check_function() ony be one until res->parent is not NULL.
*/
void res_counter_callback(struct res_counter *res, int (*check_function)())
{
do {
if ((*check_function)(res))
break;
res = res->parent;
} while (res);
}
==
Calling this once per 1000 charges or once per sec will not be very bad. And we can
keep res_counter simple. If you want some trigger, you can add something as
you like.

Thanks,
-Kame

2009-07-14 19:10:12

by Dan Malek

[permalink] [raw]
Subject: Re: [PATCH 1/1] Memory usage limit notification addition to memcg


On Jul 13, 2009, at 6:43 PM, KOSAKI Motohiro wrote:

> I like multiple threshold and per-thresold file-descriptor.
> it solve multiple waiters issue.
>
> but How about this?
>
> /cgroup
> /group1
> /notifications
> /threashold-A
> /threashold-B

Why are you making this so complicated?
As a group, there is a limit and a notification
threshold. Use the power of the cgroup hierarchy
if you want something with related limits and
different thresholds. I don't understand the system
design that would desire group cooperation
but yet different threshold notifications, except in
the case of upper/lower limits. I'm not arguing
you can't do it, but the value in doing so.

The complexity of the event delivery will cause
extensive discussion as well. You have great
ideas but scratch just a little below the surface
and there are complex problems to solve. For
example, if you have a "below limit" notification
I can think of many challenges to solve. Do you
constantly deliver the event? Only once on crossing?
At some time interval? How often? On a new attach
to the event? Hysteresis? How do you control these?

The purpose of this memory notification was to improve
upon a previous attempt at such a feature. It's a useful
feature that is today being used in some applications
to successfully manage the constrained resource.

If you look at my presentation from the last ELC, you
will see this patch is one small step of many to improve
resource management. This event notification discussion
is important, but still just a tiny implementation detail in
a bigger resource management scheme. We need to
make the small steps to make people aware of new features,
write applications that utilize these features, and to
perhaps discover something even better we aren't
even considering.

Thanks.

-- Dan

2009-07-15 06:03:04

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 1/1] Memory usage limit notification addition to memcg

* [email protected] <[email protected]> [2009-07-13 15:15:45]:

> On Tue, Jul 7, 2009 at 5:56 PM, KAMEZAWA
> Hiroyuki<[email protected]> wrote:
> >
> > I know people likes to wait for file descriptor to get notification in these days.
> > Can't we have "event" file descriptor in cgroup layer and make it reusable for
> > other purposes ?
>
> I agree - rather than having to add a separate "wait for value to
> cross X threshold" file for each numeric usage value that people might
> be concerned about, it would be better to have a generic way to do it
> for any file. Given that this is a userspace API, it would be better
> to work out at least the generic API first, even if the initial
> implementation isn't generic.
>
> Properties that it should support include:
>
> - notification when a value crosses above or below a given threshold
> (which would include binary cases such as OOM notification where the
> value cross from "not-OOM" to "OOM"
>
> - independent thresholds for different waiters
>
> - epoll support (by using eventfd?)
>
> - automatic wakeup when a cgroup is removed
>
> - maybe optional wakeup when a thread attach occurs?
>

Please don't forget netlink, we've already got a genetlink socket for
cgroups in the form og cgroupstats.

> - not require more than read permissions on the file containing the
> value being monitored
>
> I guess there are a few possible ways this could be exposed to userspace:
>
> 1) new ioctl on cgroups files. simple but probably not popular
>
> 2) new system call. maybe the cleanest, but involves changing every
> arch and is hard to script
>
> 3) new per-cgroup file to control these e.g:
> - create an eventfd
> - open the control file to be monitored
> - write the "<event_fd>, <control_fd> <threshold> to
> cgroup.event_control to link them together
> flexible and scriptable but maybe a clumsy interface in general
>
> Paul

--
Balbir

2009-07-16 17:15:40

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 1/1] Memory usage limit notification addition to memcg

* Dan Malek <[email protected]> [2009-07-14 12:13:32]:

> If you look at my presentation from the last ELC, you
> will see this patch is one small step of many to improve
> resource management. This event notification discussion
> is important, but still just a tiny implementation detail in
> a bigger resource management scheme. We need to
> make the small steps to make people aware of new features,
> write applications that utilize these features, and to
> perhaps discover something even better we aren't
> even considering.

Dan, if you are suggesting that we incrementally add features, I
completely agree with you, that way the code is reviewable and
maintainable. As we add features we need to

1. Look at reuse
2. Make sure the design is sane and will not prohibit further
development.

--
Balbir

2009-07-16 18:13:09

by Dan Malek

[permalink] [raw]
Subject: Re: [PATCH 1/1] Memory usage limit notification addition to memcg


On Jul 16, 2009, at 10:15 AM, Balbir Singh wrote:

> Dan, if you are suggesting that we incrementally add features, I
> completely agree with you, that way the code is reviewable and
> maintainable. As we add features we need to

Right, this is all goodness. My specific comments are this patch
adds a new useful feature and it's been through a couple of iterations
to make it more acceptable. Let's post it, as it makes people aware
of such a feature since it's currently in use and useful, and then
continue the discussion about how to make it (and all of the cgroup
features) better. Otherwise, this is going to degenerate into a "do
everything but nothing gets done" ongoing discussion and I'll
quickly lose interest and move on the something else :-)

There are currently two discussions in progress. One is about
notification limits, which this feature patch adds. We need to
close this discussion with a more feature rich implementation
that addresses both upper and lower notification, the semantics
of this feature in a cgroup hierarchy, and in particular the
behavior outside of the memory controller group.

The second discussion is about event delivery in cgroups.
Linux already has many mechanisms, and some product
implementations patch even more of their own into the kernel.
Outside of these implementation details, we have to determine
what is useful for a cgroup. Are events just arbitrary (anything
can send any kind of event)? How do we pass information?
Is there some standard header? How do we control this so
the event target is identified and we prevent event floods?
And many more.....

> 1. Look at reuse
> 2. Make sure the design is sane and will not prohibit further
> development.

3. Contain the scope of work so I can do it without affecting
the work that pays my salary :-)

Thanks.

-- Dan

2009-07-17 02:33:41

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 1/1] Memory usage limit notification addition to memcg

* Dan Malek <[email protected]> [2009-07-16 11:16:29]:

>
> On Jul 16, 2009, at 10:15 AM, Balbir Singh wrote:
>
> > Dan, if you are suggesting that we incrementally add features, I
> > completely agree with you, that way the code is reviewable and
> > maintainable. As we add features we need to
>
> Right, this is all goodness. My specific comments are this patch
> adds a new useful feature and it's been through a couple of iterations
> to make it more acceptable. Let's post it, as it makes people aware
> of such a feature since it's currently in use and useful, and then
> continue the discussion about how to make it (and all of the cgroup
> features) better. Otherwise, this is going to degenerate into a "do
> everything but nothing gets done" ongoing discussion and I'll
> quickly lose interest and move on the something else :-)
>
> There are currently two discussions in progress. One is about
> notification limits, which this feature patch adds. We need to
> close this discussion with a more feature rich implementation
> that addresses both upper and lower notification, the semantics
> of this feature in a cgroup hierarchy, and in particular the
> behavior outside of the memory controller group.
>
> The second discussion is about event delivery in cgroups.
> Linux already has many mechanisms, and some product
> implementations patch even more of their own into the kernel.
> Outside of these implementation details, we have to determine
> what is useful for a cgroup. Are events just arbitrary (anything
> can send any kind of event)? How do we pass information?
> Is there some standard header? How do we control this so
> the event target is identified and we prevent event floods?
> And many more.....
>

I think you keep missing my pointers to cgroupstats - a genetlink
based mechanism for event delivery and request/response applications.


> > 1. Look at reuse
> > 2. Make sure the design is sane and will not prohibit further
> > development.
>
> 3. Contain the scope of work so I can do it without affecting
> the work that pays my salary :-)
>

Not at the cost of (1) and (2) and a patient discussion around what is
being proposed.

--
Balbir