Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753989AbZDMXK6 (ORCPT ); Mon, 13 Apr 2009 19:10:58 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751130AbZDMXKt (ORCPT ); Mon, 13 Apr 2009 19:10:49 -0400 Received: from smtp1.linux-foundation.org ([140.211.169.13]:53220 "EHLO smtp1.linux-foundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751084AbZDMXKs (ORCPT ); Mon, 13 Apr 2009 19:10:48 -0400 Date: Mon, 13 Apr 2009 16:08:14 -0700 From: Andrew Morton To: Dan Malek Cc: linux-kernel@vger.kernel.org, menage@google.com, dan@embeddedalley.com, containers@lists.linux-foundation.org Subject: Re: [PATCH] Memory usage limit notification addition to memcg Message-Id: <20090413160814.1c335d1d.akpm@linux-foundation.org> In-Reply-To: <1239660512-25468-1-git-send-email-dan@embeddedalley.com> References: <1239660512-25468-1-git-send-email-dan@embeddedalley.com> X-Mailer: Sylpheed version 2.2.4 (GTK+ 2.8.20; i486-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 19665 Lines: 531 Please cc containers@lists.linux-foundation.org on this sort of thing so that all the right people get to see it. On Mon, 13 Apr 2009 18:08:32 -0400 Dan Malek wrote: > This patch updates the Memory Controller cgroup to add > a configurable memory usage limit notification. The feature > was presented at the April 2009 Embedded Linux Conference. > > Signed-off-by: Dan Malek > --- > Documentation/cgroups/mem_notify.txt | 129 +++++++++++++++++++++ > include/linux/memcontrol.h | 7 + > init/Kconfig | 9 ++ > mm/memcontrol.c | 207 ++++++++++++++++++++++++++++++++++ > 4 files changed, 352 insertions(+), 0 deletions(-) > create mode 100644 Documentation/cgroups/mem_notify.txt > > diff --git a/Documentation/cgroups/mem_notify.txt b/Documentation/cgroups/mem_notify.txt > new file mode 100644 > index 0000000..72d5c26 > --- /dev/null > +++ b/Documentation/cgroups/mem_notify.txt > @@ -0,0 +1,129 @@ > + > +Memory Limit Notificiation > + > +Attempts have been made in the past to provide a mechanism for > +the notification to processes (task, an address space) when memory > +usage is approaching a high limit. The intention is that it gives > +the application an opportunity to release some memory and continue > +operation rather than be OOM killed. The CE Linux Forum requested > +a more comtemporary implementation, and this is the result. > + > +The memory limit notification is a configurable extension to the > +existing Memory Resource Controller. Please read memory.txt in this > +directory to understand its operation before continuing here. > + > +1. Operation > + > +When a kernel is configured with CGROUP_MEM_NOTIFY, three additional > +files will appear in the memory resource controller: > + > + memory.notify_limit_percent We've run into problems in the past where a percentage number is too coarse on large-memory systems. Proabably that won't be an issue here, but I invite you to convince us of this ;) > + memory.notify_limit_usage > + memory.notify_limit_lowait > + > +The notification is based upon reaching a percentage of the memory > +resource controller limit (memory.limit_in_bytes). When the controller > +group is created, the percentage is set to 100. Any integer percentage > +may be set by writing to memory.notify_limit_percent, such as: > + > + echo 80 > memory.notify_limit_percent > + > +The current integer usage percentage may be read at any time from > +the memory.notify_limit_usage file. > + > +The memory.notify_limit_lowait is a blocking read file. The read will > +block until one of four conditions occurs: > + > + - The usage reaches or exceeds the memory.notify_limit_percent > + - The memory.notify_limit_lowait file is written with any value (debug) > + - A thread is moved to another controller group > + - The cgroup is destroyed or forced empty (memory.force_empty) Does it support select()/poll()/eventfd()/etc? > + > +1.1 Example Usage > + > +An application must be designed to properly take advantage of this > +memory limit notification feature. It is a powerful management component > +of some operating systems and embedded devices that must provide > +highly available and reliable computing services. The application works > +in conjunction with information provided by the operating system to > +control limited resource usage. Since many programmers still think > +memory is infinite and never check the return value from malloc(), it > +may come as a surprise that such mechanisms have been utilized long ago. > + > +A typical application will be multithreaded, with one thread either > +polling or waiting for the notification event. When the event occurs, > +the thread will take whatever action is appropriate within the application > +design. This could be actually running a garbage collection algorithm > +or to simply signal other processing threads they must do something to > +reduce their memory usage. The notification thread will then be required > +to poll the actual usage until the low limit of its choosing is met, > +at which time the reclaim of memory can stop and the notification thread > +will wait for the next event. > + > +Internally, the application only needs to fopen("memory.notify_limit_usage" ..) > +and fopen("memory.notify_limit_lowait" ...), then either poll the former > +file or block read on the latter file using fread() or fscanf() as desired. > + > +2. Configuration > + > +Follow the instructions in memory.txt for the configuration and usage of > +the Memory Resource Controller cgroup. Once this is created and tasks > +assigned, use the memory limit notification as described here. > + > +The only action that is needed outside of the application waiting or polling > +is to set the memory.notify_limit_percent. To set a notification to occur > +when memory usage of the cgroup reaches or exceeds 80 percent can be > +simply done: > + > + echo 80 > memory.notify_limit_percent > + > +This value may be read or changed at any time. Writing a lower value once > +the Memory Resource Controller is in operation may trigger immediate > +notification if the usage is above the new limit. > + > +3. Debug and Testing > + > +The design of cgroups makes it easier to perform some debugging or > +monitoring tasks without modification to the application. For example, > +a write of any value to memory.notify_limit_lowait will wake up all > +threads waiting for notifications regardless of current memory usage. > + > +Collecting performance data about the cgroup is also simplified, as > +no application modifications are necessary. A separate task can be > +created that will open and monitor any necessary files of the cgroup > +(such as current limits, usage and usage percentages and even when > +notification occurs). This task can also operate outside of the cgroup, > +so its memory usage is not charged to the cgroup. > + > +4. Design > + > +The memory limit notification is a configurable extension to the > +existing Memory Resource Controller, which operates as described to > +track and manage the memory of the Control Group. The Memory Resource > +Controller will still continue to reclaim memory under pressure > +of the limits, and may OOM kill tasks within the cgroup according to > +the OOM Killer configuration. > + > +The memory notification limit was chosen as a percentage of the > +memory in use so the cgroup paramaters may continue to be dynamically > +modified without the need to modify the notificaton parameters. > +Otherwise, the notification limit would have to also be computed > +and modified on any Memory Resource Controller operating parameter change. > + > +The cgroup file semantics are not well suited for this type of notificaton > +mechanism. While applications may choose to simply poll the current > +usage at their convenience, it was also desired to have a notification > +event that would trigger when the usage attained the limit. The > +blocking read() was chosen, as it is the only current useful method. > +This presented the problems of "out of band" notification, when you want > +to return some exceptional status other than reaching the notification > +limit. In the cases listed above, the read() on the memory.notify_limit_lowait > +file will not block and return "0" for the percentage. When this occurs, > +the thread must determine if the task has moved to a new cgroup or if > +the cgroup has been destroyed. Due to the usage model of this cgroup, > +neither is likely to happen during normal operation of a product. > + > +Dan Malek > +Embedded Alley Solutions, Inc. > +10 March 2009 > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 18146c9..031e5d1 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -117,6 +117,13 @@ static inline bool mem_cgroup_disabled(void) > > extern bool mem_cgroup_oom_called(struct task_struct *task); > > +#ifdef CONFIG_CGROUP_MEM_NOTIFY > +extern void test_and_wakeup_notify(struct mem_cgroup *mcg, > + unsigned long long newlimit); > +extern unsigned long compute_usage_percent(unsigned long long usage, > + unsigned long long limit); > +#endif Stylistic trick: here, please add #else static inline void test_and_wakeup_notify(struct mem_cgroup *mcg, unsigned long long newlimit) { } static inline unsigned long compute_usage_percent(unsigned long long usage, unsigned long long limit) { return 0; /* ? */ } #endif and then remove the ifdefs from the .c files. > #else /* CONFIG_CGROUP_MEM_RES_CTLR */ > struct mem_cgroup; > > diff --git a/init/Kconfig b/init/Kconfig > index f2f9b53..97138da 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -588,6 +588,15 @@ config CGROUP_MEM_RES_CTLR > This config option also selects MM_OWNER config option, which > could in turn add some fork/exit overhead. > > +config CGROUP_MEM_NOTIFY > + bool "Memory Usage Limit Notification" > + depends on CGROUP_MEM_RES_CTLR > + help > + Provides a memory notification when usage reaches a preset limit. > + It is an extenstion to the memory resource controller, since it > + uses the memory usage accounting of the cgroup to test against > + the notification limit. (See Documentation/cgroups/mem_notify.txt) > + > config CGROUP_MEM_RES_CTLR_SWAP > bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)" > depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 2fc6d6c..d6367ed 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6,6 +6,10 @@ > * Copyright 2007 OpenVZ SWsoft Inc > * Author: Pavel Emelianov > * > + * Memory Limit Notification update > + * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc. > + * Author: Dan Malek > + * > * This program is free software; you can redistribute it and/or modify > * it under the terms of the GNU General Public License as published by > * the Free Software Foundation; either version 2 of the License, or > @@ -180,6 +184,11 @@ struct mem_cgroup { > * statistics. This must be placed at the end of memcg. > */ > struct mem_cgroup_stat stat; > + > +#ifdef CONFIG_CGROUP_MEM_NOTIFY > + unsigned long notify_limit_percent; > + wait_queue_head_t notify_limit_wait; > +#endif > }; > > enum charge_type { > @@ -934,6 +943,21 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, > > VM_BUG_ON(mem_cgroup_is_obsolete(mem)); > > +#ifdef CONFIG_CGROUP_MEM_NOTIFY > + /* We check on the way in so we don't have to duplicate code > + * in both the normal and error exit path. > + */ > + if (likely(mem->res.limit != (unsigned long long)LLONG_MAX)) { > + unsigned long usage_pct; > + > + usage_pct = compute_usage_percent(mem->res.usage + PAGE_SIZE, > + mem->res.limit); > + if ((usage_pct >= mem->notify_limit_percent) && > + waitqueue_active(&mem->notify_limit_wait)) > + wake_up(&mem->notify_limit_wait); > + } > +#endif It would be nicer to pull this out into a separate function, I expect. > while (1) { > int ret; > bool noswap = false; > @@ -1663,6 +1687,13 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg, > int children = mem_cgroup_count_children(memcg); > u64 curusage, oldusage; > > +#ifdef CONFIG_CGROUP_MEM_NOTIFY > + /* Test and notify ahead of the necessity to free pages, as > + * applications giving up pages may help this reclaim procedure. > + */ > + test_and_wakeup_notify(memcg, val); > +#endif ifdefs-in-c make kernel developers sad. > /* > * For keeping hierarchical_reclaim simple, how long we should retry > * is depends on callers. We set our retry-count to be function > @@ -2215,6 +2246,147 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft, > return 0; > } > > +#ifdef CONFIG_CGROUP_MEM_NOTIFY > +#define CGROUP_LOCAL_BUFFER_SIZE 64 /* Would be nice if this was in cgroup.h */ > + > +/* The resource counters are defined as long long, but few processors > + * handle 64-bit divisor in hardware, and the software to do it isn't > + * present in the kernel. It would be nice if the resource counters were > + * platform specific configurable typedefs, but for now we'll just divide > + * down the byte counters by the page size to get 32-bit arithmetic. > + * With a 4K page size, this will work up to about 16384G resource limit. > + */ > +unsigned long compute_usage_percent(unsigned long long usage, > + unsigned long long limit) This is a poor choice of identifier for a global symbol. Please prefix it with some string which identifies its subsystem. > +{ > + unsigned long lim; > + unsigned long long usage_pct; > + > + usage_pct = (usage / PAGE_SIZE) * 100; > + lim = (unsigned long)(limit / PAGE_SIZE); > + > + do_div(usage_pct, lim); > + > + return (unsigned long)usage_pct; > +} > + > +void test_and_wakeup_notify(struct mem_cgroup *mcg, unsigned long long newlimit) Ditto. > +{ > + unsigned long usage_pct; > + > + /* Check to see if the new limit should cause notification. > + */ > + usage_pct = compute_usage_percent(mcg->res.usage, newlimit); > + > + if ((usage_pct >= mcg->notify_limit_percent) && > + waitqueue_active(&mcg->notify_limit_wait)) > + wake_up(&mcg->notify_limit_wait); > +} Also, identifiers such as "newlimit" are a bit unclear because they don't communicate the units, and (less seriously) they don't communicate what quantity they are measuring. Something like memory_newlimit_bytes would be clearer, although a bit silly. I _assume_ these things are all operating in units of bytes. Perhaps it was pages? That's my point... > +static ssize_t notify_limit_read(struct cgroup *cgrp, struct cftype *cft, > + struct file *file, > + char __user *buf, size_t nbytes, > + loff_t *ppos) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); > + char tmp[CGROUP_LOCAL_BUFFER_SIZE]; > + int len; > + > + len = sprintf(tmp, "%lu\n", memcg->notify_limit_percent); The reader has to run around the tree to find out if there's a buffer overflow here. scnprintf() would set minds at ease. > + return simple_read_from_buffer(buf, nbytes, ppos, tmp, len); > +} > + > +static int notify_limit_write(struct cgroup *cgrp, struct cftype *cft, > + const char *buffer) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); > + unsigned long val; > + char *endptr; > + > + val = simple_strtoul(buffer, &endptr, 0); strict_strtoul() please. It's stricter. > + if (val > 100) > + return -EINVAL; > + > + memcg->notify_limit_percent = val; > + > + /* Check to see if the new percentage limit should cause notification. > + */ > + test_and_wakeup_notify(memcg, memcg->res.limit); > + > + return 0; > +} > + > +static ssize_t notify_limit_usage_read(struct cgroup *cgrp, struct cftype *cft, > + struct file *file, > + char __user *buf, size_t nbytes, > + loff_t *ppos) > +{ > + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); > + char tmp[CGROUP_LOCAL_BUFFER_SIZE]; > + unsigned long usage_pct; > + int len; > + > + usage_pct = compute_usage_percent(mem->res.usage, mem->res.limit); > + > + len = sprintf(tmp, "%lu\n", usage_pct); scnprintf() > + return simple_read_from_buffer(buf, nbytes, ppos, tmp, len); > +} > + > +static ssize_t notify_limit_lowait(struct cgroup *cgrp, struct cftype *cft, Perhaps this function would benefit from a nice comment explaining its design. It's the core thing. Than again, the .txt file has good coverage. > + struct file *file, > + char __user *buf, size_t nbytes, > + loff_t *ppos) > +{ > + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); > + char tmp[CGROUP_LOCAL_BUFFER_SIZE]; > + unsigned long usage_pct; > + int len; > + DEFINE_WAIT(notify_lowait); > + > + /* A memory resource usage of zero is a special case that > + * causes us not to sleep. It normally happens when the > + * cgroup is about to be destroyed, and we don't want someone > + * trying to sleep on a queue that is about to go away. This > + * condition can also be forced as part of testing. > + */ > + usage_pct = compute_usage_percent(mem->res.usage, mem->res.limit); > + if (likely(mem->res.usage != 0)) { > + > + prepare_to_wait(&mem->notify_limit_wait, ¬ify_lowait, > + TASK_INTERRUPTIBLE); > + > + if (usage_pct < mem->notify_limit_percent) { > + schedule(); > + > + /* Compute percentage we have now and return it. > + */ > + usage_pct = compute_usage_percent(mem->res.usage, > + mem->res.limit); > + } > + finish_wait(&mem->notify_limit_wait, ¬ify_lowait); > + } > + > + len = sprintf(tmp, "%lu\n", usage_pct); scnprintf() > + return simple_read_from_buffer(buf, nbytes, ppos, tmp, len); > +} What is the behavior of this read when someone sends the process a signal? Seems that it will return early, giving userspace a number which it didn't expect to see. I guess that's OK. > +/* This is used to wake up all threads that may be hanging > + * out waiting for a low memory condition prior to that happening. > + * Useful for triggering the event to assist with debug of applications. > + */ > +static int notify_limit_wake_em_up(struct cgroup *cgrp, unsigned int event) > +{ > + struct mem_cgroup *mem; > + > + mem = mem_cgroup_from_cont(cgrp); > + wake_up(&mem->notify_limit_wait); > + return 0; > +} > +#endif /* CONFIG_CGROUP_MEM_NOTIFY */ > + > > static struct cftype mem_cgroup_files[] = { > { > @@ -2258,6 +2430,22 @@ static struct cftype mem_cgroup_files[] = { > .read_u64 = mem_cgroup_swappiness_read, > .write_u64 = mem_cgroup_swappiness_write, > }, > +#ifdef CONFIG_CGROUP_MEM_NOTIFY > + { > + .name = "notify_limit_percent", > + .write_string = notify_limit_write, > + .read = notify_limit_read, > + }, > + { > + .name = "notify_limit_usage", > + .read = notify_limit_usage_read, > + }, > + { > + .name = "notify_limit_lowait", > + .trigger = notify_limit_wake_em_up, > + .read = notify_limit_lowait, > + }, > +#endif > }; > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP > @@ -2461,6 +2649,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) > mem->last_scanned_child = 0; > spin_lock_init(&mem->reclaim_param_lock); > > +#ifdef CONFIG_CGROUP_MEM_NOTIFY > + init_waitqueue_head(&mem->notify_limit_wait); > + mem->notify_limit_percent = 100; > +#endif > + > if (parent) > mem->swappiness = get_swappiness(parent); > atomic_set(&mem->refcnt, 1); > @@ -2504,6 +2697,20 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss, > struct cgroup *old_cont, > struct task_struct *p) > { > +#ifdef CONFIG_CGROUP_MEM_NOTIFY > + /* We wake up all notification threads any time a migration takes > + * place. They will have to check to see if a move is needed to > + * a new cgroup file to wait for notification. > + * This isn't so much a task move as it is an attach. A thread not > + * a child of an existing task won't have a valid parent, which > + * is necessary to test because it won't have a valid mem_cgroup > + * either. Which further means it won't have a proper wait queue > + * and we can't do a wakeup. > + */ > + if (old_cont->parent != NULL) > + notify_limit_wake_em_up(old_cont, 0); > +#endif > + -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/