Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753103AbZDMWPk (ORCPT ); Mon, 13 Apr 2009 18:15:40 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1750836AbZDMWPc (ORCPT ); Mon, 13 Apr 2009 18:15:32 -0400 Received: from postoffice.embeddedalley.com ([65.114.89.228]:33105 "HELO postoffice.embeddedalley.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1750713AbZDMWPb (ORCPT ); Mon, 13 Apr 2009 18:15:31 -0400 X-Greylist: delayed 400 seconds by postgrey-1.27 at vger.kernel.org; Mon, 13 Apr 2009 18:15:30 EDT From: Dan Malek To: Linux Kernel Mailing List Cc: Paul Menage , Dan Malek Subject: [PATCH] Memory usage limit notification addition to memcg Date: Mon, 13 Apr 2009 18:08:32 -0400 Message-Id: <1239660512-25468-1-git-send-email-dan@embeddedalley.com> X-Mailer: git-send-email 1.6.2.GIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 17011 Lines: 461 This patch updates the Memory Controller cgroup to add a configurable memory usage limit notification. The feature was presented at the April 2009 Embedded Linux Conference. Signed-off-by: Dan Malek --- Documentation/cgroups/mem_notify.txt | 129 +++++++++++++++++++++ include/linux/memcontrol.h | 7 + init/Kconfig | 9 ++ mm/memcontrol.c | 207 ++++++++++++++++++++++++++++++++++ 4 files changed, 352 insertions(+), 0 deletions(-) create mode 100644 Documentation/cgroups/mem_notify.txt diff --git a/Documentation/cgroups/mem_notify.txt b/Documentation/cgroups/mem_notify.txt new file mode 100644 index 0000000..72d5c26 --- /dev/null +++ b/Documentation/cgroups/mem_notify.txt @@ -0,0 +1,129 @@ + +Memory Limit Notificiation + +Attempts have been made in the past to provide a mechanism for +the notification to processes (task, an address space) when memory +usage is approaching a high limit. The intention is that it gives +the application an opportunity to release some memory and continue +operation rather than be OOM killed. The CE Linux Forum requested +a more comtemporary implementation, and this is the result. + +The memory limit notification is a configurable extension to the +existing Memory Resource Controller. Please read memory.txt in this +directory to understand its operation before continuing here. + +1. Operation + +When a kernel is configured with CGROUP_MEM_NOTIFY, three additional +files will appear in the memory resource controller: + + memory.notify_limit_percent + memory.notify_limit_usage + memory.notify_limit_lowait + +The notification is based upon reaching a percentage of the memory +resource controller limit (memory.limit_in_bytes). When the controller +group is created, the percentage is set to 100. Any integer percentage +may be set by writing to memory.notify_limit_percent, such as: + + echo 80 > memory.notify_limit_percent + +The current integer usage percentage may be read at any time from +the memory.notify_limit_usage file. + +The memory.notify_limit_lowait is a blocking read file. The read will +block until one of four conditions occurs: + + - The usage reaches or exceeds the memory.notify_limit_percent + - The memory.notify_limit_lowait file is written with any value (debug) + - A thread is moved to another controller group + - The cgroup is destroyed or forced empty (memory.force_empty) + + +1.1 Example Usage + +An application must be designed to properly take advantage of this +memory limit notification feature. It is a powerful management component +of some operating systems and embedded devices that must provide +highly available and reliable computing services. The application works +in conjunction with information provided by the operating system to +control limited resource usage. Since many programmers still think +memory is infinite and never check the return value from malloc(), it +may come as a surprise that such mechanisms have been utilized long ago. + +A typical application will be multithreaded, with one thread either +polling or waiting for the notification event. When the event occurs, +the thread will take whatever action is appropriate within the application +design. This could be actually running a garbage collection algorithm +or to simply signal other processing threads they must do something to +reduce their memory usage. The notification thread will then be required +to poll the actual usage until the low limit of its choosing is met, +at which time the reclaim of memory can stop and the notification thread +will wait for the next event. + +Internally, the application only needs to fopen("memory.notify_limit_usage" ..) +and fopen("memory.notify_limit_lowait" ...), then either poll the former +file or block read on the latter file using fread() or fscanf() as desired. + +2. Configuration + +Follow the instructions in memory.txt for the configuration and usage of +the Memory Resource Controller cgroup. Once this is created and tasks +assigned, use the memory limit notification as described here. + +The only action that is needed outside of the application waiting or polling +is to set the memory.notify_limit_percent. To set a notification to occur +when memory usage of the cgroup reaches or exceeds 80 percent can be +simply done: + + echo 80 > memory.notify_limit_percent + +This value may be read or changed at any time. Writing a lower value once +the Memory Resource Controller is in operation may trigger immediate +notification if the usage is above the new limit. + +3. Debug and Testing + +The design of cgroups makes it easier to perform some debugging or +monitoring tasks without modification to the application. For example, +a write of any value to memory.notify_limit_lowait will wake up all +threads waiting for notifications regardless of current memory usage. + +Collecting performance data about the cgroup is also simplified, as +no application modifications are necessary. A separate task can be +created that will open and monitor any necessary files of the cgroup +(such as current limits, usage and usage percentages and even when +notification occurs). This task can also operate outside of the cgroup, +so its memory usage is not charged to the cgroup. + +4. Design + +The memory limit notification is a configurable extension to the +existing Memory Resource Controller, which operates as described to +track and manage the memory of the Control Group. The Memory Resource +Controller will still continue to reclaim memory under pressure +of the limits, and may OOM kill tasks within the cgroup according to +the OOM Killer configuration. + +The memory notification limit was chosen as a percentage of the +memory in use so the cgroup paramaters may continue to be dynamically +modified without the need to modify the notificaton parameters. +Otherwise, the notification limit would have to also be computed +and modified on any Memory Resource Controller operating parameter change. + +The cgroup file semantics are not well suited for this type of notificaton +mechanism. While applications may choose to simply poll the current +usage at their convenience, it was also desired to have a notification +event that would trigger when the usage attained the limit. The +blocking read() was chosen, as it is the only current useful method. +This presented the problems of "out of band" notification, when you want +to return some exceptional status other than reaching the notification +limit. In the cases listed above, the read() on the memory.notify_limit_lowait +file will not block and return "0" for the percentage. When this occurs, +the thread must determine if the task has moved to a new cgroup or if +the cgroup has been destroyed. Due to the usage model of this cgroup, +neither is likely to happen during normal operation of a product. + +Dan Malek +Embedded Alley Solutions, Inc. +10 March 2009 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 18146c9..031e5d1 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -117,6 +117,13 @@ static inline bool mem_cgroup_disabled(void) extern bool mem_cgroup_oom_called(struct task_struct *task); +#ifdef CONFIG_CGROUP_MEM_NOTIFY +extern void test_and_wakeup_notify(struct mem_cgroup *mcg, + unsigned long long newlimit); +extern unsigned long compute_usage_percent(unsigned long long usage, + unsigned long long limit); +#endif + #else /* CONFIG_CGROUP_MEM_RES_CTLR */ struct mem_cgroup; diff --git a/init/Kconfig b/init/Kconfig index f2f9b53..97138da 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -588,6 +588,15 @@ config CGROUP_MEM_RES_CTLR This config option also selects MM_OWNER config option, which could in turn add some fork/exit overhead. +config CGROUP_MEM_NOTIFY + bool "Memory Usage Limit Notification" + depends on CGROUP_MEM_RES_CTLR + help + Provides a memory notification when usage reaches a preset limit. + It is an extenstion to the memory resource controller, since it + uses the memory usage accounting of the cgroup to test against + the notification limit. (See Documentation/cgroups/mem_notify.txt) + config CGROUP_MEM_RES_CTLR_SWAP bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)" depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2fc6d6c..d6367ed 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6,6 +6,10 @@ * Copyright 2007 OpenVZ SWsoft Inc * Author: Pavel Emelianov * + * Memory Limit Notification update + * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc. + * Author: Dan Malek + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or @@ -180,6 +184,11 @@ struct mem_cgroup { * statistics. This must be placed at the end of memcg. */ struct mem_cgroup_stat stat; + +#ifdef CONFIG_CGROUP_MEM_NOTIFY + unsigned long notify_limit_percent; + wait_queue_head_t notify_limit_wait; +#endif }; enum charge_type { @@ -934,6 +943,21 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, VM_BUG_ON(mem_cgroup_is_obsolete(mem)); +#ifdef CONFIG_CGROUP_MEM_NOTIFY + /* We check on the way in so we don't have to duplicate code + * in both the normal and error exit path. + */ + if (likely(mem->res.limit != (unsigned long long)LLONG_MAX)) { + unsigned long usage_pct; + + usage_pct = compute_usage_percent(mem->res.usage + PAGE_SIZE, + mem->res.limit); + if ((usage_pct >= mem->notify_limit_percent) && + waitqueue_active(&mem->notify_limit_wait)) + wake_up(&mem->notify_limit_wait); + } +#endif + while (1) { int ret; bool noswap = false; @@ -1663,6 +1687,13 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg, int children = mem_cgroup_count_children(memcg); u64 curusage, oldusage; +#ifdef CONFIG_CGROUP_MEM_NOTIFY + /* Test and notify ahead of the necessity to free pages, as + * applications giving up pages may help this reclaim procedure. + */ + test_and_wakeup_notify(memcg, val); +#endif + /* * For keeping hierarchical_reclaim simple, how long we should retry * is depends on callers. We set our retry-count to be function @@ -2215,6 +2246,147 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft, return 0; } +#ifdef CONFIG_CGROUP_MEM_NOTIFY +#define CGROUP_LOCAL_BUFFER_SIZE 64 /* Would be nice if this was in cgroup.h */ + +/* The resource counters are defined as long long, but few processors + * handle 64-bit divisor in hardware, and the software to do it isn't + * present in the kernel. It would be nice if the resource counters were + * platform specific configurable typedefs, but for now we'll just divide + * down the byte counters by the page size to get 32-bit arithmetic. + * With a 4K page size, this will work up to about 16384G resource limit. + */ +unsigned long compute_usage_percent(unsigned long long usage, + unsigned long long limit) +{ + unsigned long lim; + unsigned long long usage_pct; + + usage_pct = (usage / PAGE_SIZE) * 100; + lim = (unsigned long)(limit / PAGE_SIZE); + + do_div(usage_pct, lim); + + return (unsigned long)usage_pct; +} + +void test_and_wakeup_notify(struct mem_cgroup *mcg, unsigned long long newlimit) +{ + unsigned long usage_pct; + + /* Check to see if the new limit should cause notification. + */ + usage_pct = compute_usage_percent(mcg->res.usage, newlimit); + + if ((usage_pct >= mcg->notify_limit_percent) && + waitqueue_active(&mcg->notify_limit_wait)) + wake_up(&mcg->notify_limit_wait); +} + +static ssize_t notify_limit_read(struct cgroup *cgrp, struct cftype *cft, + struct file *file, + char __user *buf, size_t nbytes, + loff_t *ppos) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + char tmp[CGROUP_LOCAL_BUFFER_SIZE]; + int len; + + len = sprintf(tmp, "%lu\n", memcg->notify_limit_percent); + + return simple_read_from_buffer(buf, nbytes, ppos, tmp, len); +} + +static int notify_limit_write(struct cgroup *cgrp, struct cftype *cft, + const char *buffer) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + unsigned long val; + char *endptr; + + val = simple_strtoul(buffer, &endptr, 0); + if (val > 100) + return -EINVAL; + + memcg->notify_limit_percent = val; + + /* Check to see if the new percentage limit should cause notification. + */ + test_and_wakeup_notify(memcg, memcg->res.limit); + + return 0; +} + +static ssize_t notify_limit_usage_read(struct cgroup *cgrp, struct cftype *cft, + struct file *file, + char __user *buf, size_t nbytes, + loff_t *ppos) +{ + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); + char tmp[CGROUP_LOCAL_BUFFER_SIZE]; + unsigned long usage_pct; + int len; + + usage_pct = compute_usage_percent(mem->res.usage, mem->res.limit); + + len = sprintf(tmp, "%lu\n", usage_pct); + + return simple_read_from_buffer(buf, nbytes, ppos, tmp, len); +} + +static ssize_t notify_limit_lowait(struct cgroup *cgrp, struct cftype *cft, + struct file *file, + char __user *buf, size_t nbytes, + loff_t *ppos) +{ + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); + char tmp[CGROUP_LOCAL_BUFFER_SIZE]; + unsigned long usage_pct; + int len; + DEFINE_WAIT(notify_lowait); + + /* A memory resource usage of zero is a special case that + * causes us not to sleep. It normally happens when the + * cgroup is about to be destroyed, and we don't want someone + * trying to sleep on a queue that is about to go away. This + * condition can also be forced as part of testing. + */ + usage_pct = compute_usage_percent(mem->res.usage, mem->res.limit); + if (likely(mem->res.usage != 0)) { + + prepare_to_wait(&mem->notify_limit_wait, ¬ify_lowait, + TASK_INTERRUPTIBLE); + + if (usage_pct < mem->notify_limit_percent) { + schedule(); + + /* Compute percentage we have now and return it. + */ + usage_pct = compute_usage_percent(mem->res.usage, + mem->res.limit); + } + finish_wait(&mem->notify_limit_wait, ¬ify_lowait); + } + + len = sprintf(tmp, "%lu\n", usage_pct); + + return simple_read_from_buffer(buf, nbytes, ppos, tmp, len); +} + +/* This is used to wake up all threads that may be hanging + * out waiting for a low memory condition prior to that happening. + * Useful for triggering the event to assist with debug of applications. + */ +static int notify_limit_wake_em_up(struct cgroup *cgrp, unsigned int event) +{ + struct mem_cgroup *mem; + + mem = mem_cgroup_from_cont(cgrp); + wake_up(&mem->notify_limit_wait); + return 0; +} +#endif /* CONFIG_CGROUP_MEM_NOTIFY */ + static struct cftype mem_cgroup_files[] = { { @@ -2258,6 +2430,22 @@ static struct cftype mem_cgroup_files[] = { .read_u64 = mem_cgroup_swappiness_read, .write_u64 = mem_cgroup_swappiness_write, }, +#ifdef CONFIG_CGROUP_MEM_NOTIFY + { + .name = "notify_limit_percent", + .write_string = notify_limit_write, + .read = notify_limit_read, + }, + { + .name = "notify_limit_usage", + .read = notify_limit_usage_read, + }, + { + .name = "notify_limit_lowait", + .trigger = notify_limit_wake_em_up, + .read = notify_limit_lowait, + }, +#endif }; #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP @@ -2461,6 +2649,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) mem->last_scanned_child = 0; spin_lock_init(&mem->reclaim_param_lock); +#ifdef CONFIG_CGROUP_MEM_NOTIFY + init_waitqueue_head(&mem->notify_limit_wait); + mem->notify_limit_percent = 100; +#endif + if (parent) mem->swappiness = get_swappiness(parent); atomic_set(&mem->refcnt, 1); @@ -2504,6 +2697,20 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss, struct cgroup *old_cont, struct task_struct *p) { +#ifdef CONFIG_CGROUP_MEM_NOTIFY + /* We wake up all notification threads any time a migration takes + * place. They will have to check to see if a move is needed to + * a new cgroup file to wait for notification. + * This isn't so much a task move as it is an attach. A thread not + * a child of an existing task won't have a valid parent, which + * is necessary to test because it won't have a valid mem_cgroup + * either. Which further means it won't have a proper wait queue + * and we can't do a wakeup. + */ + if (old_cont->parent != NULL) + notify_limit_wake_em_up(old_cont, 0); +#endif + mutex_lock(&memcg_tasklist); /* * FIXME: It's better to move charges of this process from old -- 1.6.2.GIT -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/