Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758546AbZGGUZi (ORCPT ); Tue, 7 Jul 2009 16:25:38 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1757906AbZGGUZP (ORCPT ); Tue, 7 Jul 2009 16:25:15 -0400 Received: from easi.embeddedalley.com ([71.6.201.124]:57703 "HELO easi.embeddedalley.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1755126AbZGGUZM (ORCPT ); Tue, 7 Jul 2009 16:25:12 -0400 From: Vladislav Buzov To: Linux Kernel Mailing List Cc: Linux Containers Mailing List , Dan Malek , Andrew Morton , Paul Menage , KAMEZAWA Hiroyuki , Balbir Singh , Vladislav Buzov Subject: [PATCH 1/1] Memory usage limit notification addition to memcg Date: Tue, 7 Jul 2009 13:25:10 -0700 Message-Id: <1246998310-16764-2-git-send-email-vbuzov@embeddedalley.com> X-Mailer: git-send-email 1.5.6.3 In-Reply-To: <1246998310-16764-1-git-send-email-vbuzov@embeddedalley.com> References: <1239660512-25468-1-git-send-email-dan@embeddedalley.com> <1246998310-16764-1-git-send-email-vbuzov@embeddedalley.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 17241 Lines: 458 This patch updates the Memory Controller cgroup to add a configurable memory usage limit notification. The feature was presented at the April 2009 Embedded Linux Conference. Signed-off-by: Dan Malek Signed-off-by: Vladislav Buzov --- Documentation/cgroups/mem_notify.txt | 140 ++++++++++++++++++++++++++ include/linux/memcontrol.h | 21 ++++ init/Kconfig | 9 ++ mm/memcontrol.c | 178 ++++++++++++++++++++++++++++++++++ 4 files changed, 348 insertions(+), 0 deletions(-) create mode 100644 Documentation/cgroups/mem_notify.txt diff --git a/Documentation/cgroups/mem_notify.txt b/Documentation/cgroups/mem_notify.txt new file mode 100644 index 0000000..b4f20d0 --- /dev/null +++ b/Documentation/cgroups/mem_notify.txt @@ -0,0 +1,140 @@ + +Memory Limit Notificiation + +Attempts have been made in the past to provide a mechanism for +the notification to processes (task, an address space) when memory +usage is approaching a high limit. The intention is that it gives +the application an opportunity to release some memory and continue +operation rather than be OOM killed. The CE Linux Forum requested +a more comtemporary implementation, and this is the result. + +The memory threshold notification is a configurable extension to the +existing Memory Resource Controller. Please read memory.txt in this +directory to understand its operation before continuing here. + +1. Operation + +When a kernel is configured with CGROUP_MEM_NOTIFY, three additional +files will appear in the memory resource controller: + + memory.notify_threshold_in_bytes + memory.notify_available_in_bytes + memory.notify_threshold_lowait + +The notification is based upon reaching a threshold below the memory +resouce controller limit (memory.limit_in_bytes). The threshold +represents the minimal number of bytes that should be available under +the limit. When the controller group is created, the threshold is set +to zero which triggers notification when the memory resource controller +limit is reached. + +The threshold may be set by writing to memory.notify_threshold_in_bytes, +such as: + + echo 10M > memory.notify_threshold_in_bytes + +The current number of available bytes may be read at any time from +the memory.notify_available_in_bytes + +The memory.notify_threshold_lowait is a blocking read file. The read will +block until one of four conditions occurs: + + - The amount of available memory is equal or less than the threshold + defined in memory.notify_threshold_in_bytes + - The memory.notify_threshold_lowait file is written with any value (debug) + - A thread is moved to another controller group + - The cgroup is destroyed or forced empty (memory.force_empty) + + +1.1 Example Usage + +An application must be designed to properly take advantage of this +memory threshold notification feature. It is a powerful management component +of some operating systems and embedded devices that must provide +highly available and reliable computing services. The application works +in conjunction with information provided by the operating system to +control limited resource usage. Since many programmers still think +memory is infinite and never check the return value from malloc(), it +may come as a surprise that such mechanisms have been utilized long ago. + +A typical application will be multithreaded, with one thread either +polling or waiting for the notification event. When the event occurs, +the thread will take whatever action is appropriate within the application +design. This could be actually running a garbage collection algorithm +or to simply signal other processing threads they must do something to +reduce their memory usage. The notification thread will then be required +to poll the actual usage until the low limit of its choosing is met, +at which time the reclaim of memory can stop and the notification thread +will wait for the next event. + +Internally, the application only needs to +fopen("memory.notify_available_in_bytes" ..) or +fopen("memory.notify_threshold_lowait" ...), then either poll the former +file or block read on the latter file using fread() or fscanf() as desired. +Comparing the value returned from either of these read function with the +value obtained by reading memory.notify_threshold_in_bytes will be an +indication of the amount of memory used over the threshold limit. + +2. Configuration + +Follow the instructions in memory.txt for the configuration and usage of +the Memory Resource Controller cgroup. Once this is created and tasks +assigned, use the memory threshold notification as described here. + +The only action that is needed outside of the application waiting or polling +is to set the memory.notify_threshold_in_bytes. To set a notification to occur +when memory usage of the cgroup reaches or exceeds 1 MByte below the limit +can be simply done: + + echo 1M > memory.notify_threshold_in_bytes + +This value may be read or changed at any time. Writing a higher value once +the Memory Resource Controller is in operation may trigger immediate +notification if the usage is above the new threshold. + +3. Debug and Testing + +The design of cgroups makes it easier to perform some debugging or +monitoring tasks without modification to the application. For example, +a write of any value to memory.notify_threshold_lowait will wake up all +threads waiting for notifications regardless of current memory usage. + +Collecting performance data about the cgroup is also simplified, as +no application modifications are necessary. A separate task can be +created that will open and monitor any necessary files of the cgroup +(such as current limits, usage and usage percentages and even when +notification occurs). This task can also operate outside of the cgroup, +so its memory usage is not charged to the cgroup. + +4. Design + +The memory threshold notification is a configurable extension to the +existing Memory Resource Controller, which operates as described to +track and manage the memory of the Control Group. The Memory Resource +Controller will still continue to reclaim memory under pressure +of the limits, and may OOM kill tasks within the cgroup according to +the OOM Killer configuration. + +The memory notification threshold was chosen as a number of bytes of the +memory not in use so the cgroup paramaters may continue to be dynamically +modified without the need to modify the notificaton parameters. +Otherwise, the notification threshold would have to also be computed +and modified on any Memory Resource Controller operating parameter change. + +The cgroup file semantics are not well suited for this type of notificaton +mechanism. While applications may choose to simply poll the current +usage at their convenience, it was also desired to have a notification +event that would trigger when the usage attained the threshold. The +blocking read() was chosen, as it is the only current useful method. +This presented the problems of "out of band" notification, when you want +to return some exceptional status other than reaching the notification +threshold. In the cases listed above, the read() on the +memory.notify_threshold_lowait file will not block and return "0" for +the remaining size. When this occurs, the thread must determine if the task +has moved to a new cgroup or if the cgroup has been destroyed. Due to +the usage model of this cgroup, neither is likely to happen during normal +operation of a product. + +Dan Malek +Embedded Alley Solutions, Inc. +6 July 2009 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index e46a073..78205a3 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -118,6 +118,27 @@ static inline bool mem_cgroup_disabled(void) extern bool mem_cgroup_oom_called(struct task_struct *task); void mem_cgroup_update_mapped_file_stat(struct page *page, int val); + +#ifdef CONFIG_CGROUP_MEM_NOTIFY +void mem_cgroup_notify_test_and_wakeup(struct mem_cgroup *mcg, + unsigned long long usage, unsigned long long limit); +void mem_cgroup_notify_new_limit(struct mem_cgroup *mcg, + unsigned long long newlimit); +void mem_cgroup_notify_move_task(struct cgroup *old_cont); +#else +static inline void mem_cgroup_notify_test_and_wakeup(struct mem_cgroup *mcg, + unsigned long long usage, unsigned long long limit) +{ +} +static inline void mem_cgroup_notify_new_limit(struct mem_cgroup *mcg, + unsigned long long newlimit) +{ +} +static inline void mem_cgroup_notify_move_task(struct cgroup *old_cont) +{ +} +#endif + #else /* CONFIG_CGROUP_MEM_RES_CTLR */ struct mem_cgroup; diff --git a/init/Kconfig b/init/Kconfig index 1ce05a4..fb2f7d5 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -594,6 +594,15 @@ config CGROUP_MEM_RES_CTLR This config option also selects MM_OWNER config option, which could in turn add some fork/exit overhead. +config CGROUP_MEM_NOTIFY + bool "Memory Usage Limit Notification" + depends on CGROUP_MEM_RES_CTLR + help + Provides a memory notification when usage reaches a preset limit. + It is an extenstion to the memory resource controller, since it + uses the memory usage accounting of the cgroup to test against + the notification limit. (See Documentation/cgroups/mem_notify.txt) + config CGROUP_MEM_RES_CTLR_SWAP bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)" depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e2fa20d..cf04279 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6,6 +6,10 @@ * Copyright 2007 OpenVZ SWsoft Inc * Author: Pavel Emelianov * + * Memory Limit Notification update + * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc. + * Author: Dan Malek + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or @@ -180,6 +184,11 @@ struct mem_cgroup { /* set when res.limit == memsw.limit */ bool memsw_is_minimum; +#ifdef CONFIG_CGROUP_MEM_NOTIFY + unsigned long long notify_threshold_bytes; + wait_queue_head_t notify_threshold_wait; +#endif + /* * statistics. This must be placed at the end of memcg. */ @@ -995,6 +1004,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, VM_BUG_ON(css_is_removed(&mem->css)); + /* + * We check on the way in so we don't have to duplicate code + * in both the normal and error exit path. + */ + mem_cgroup_notify_test_and_wakeup(mem, mem->res.usage + PAGE_SIZE, + mem->res.limit); + while (1) { int ret; bool noswap = false; @@ -1744,6 +1760,12 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg, u64 curusage, oldusage; /* + * Test and notify ahead of the necessity to free pages, as + * applications giving up pages may help this reclaim procedure. + */ + mem_cgroup_notify_new_limit(memcg, val); + + /* * For keeping hierarchical_reclaim simple, how long we should retry * is depends on callers. We set our retry-count to be function * of # of children which we should visit in this loop. @@ -2308,6 +2330,139 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft, return 0; } +#ifdef CONFIG_CGROUP_MEM_NOTIFY +/* + * Check if a task exceeded notification threshold set for a memory cgroup. + * Wake up waiting notification threads, if any. + */ +void mem_cgroup_notify_test_and_wakeup(struct mem_cgroup *mcg, + unsigned long long usage, + unsigned long long limit) +{ + if (unlikely(usage == RESOURCE_MAX)) + return; + + if ((limit - usage <= mcg->notify_threshold_bytes) && + waitqueue_active(&mcg->notify_threshold_wait)) + wake_up(&mcg->notify_threshold_wait); +} +/* + * Check if current notification threshold exceeds new memory usage + * limit set for a memory cgroup. If so, set threshold to zero to + * notify tasks in the group when maximal memory usage is achieved. + */ +void mem_cgroup_notify_new_limit(struct mem_cgroup *mcg, + unsigned long long newlimit) +{ + if (newlimit <= mcg->notify_threshold_bytes) + mcg->notify_threshold_bytes = 0; + + mem_cgroup_notify_test_and_wakeup(mcg, mcg->res.usage, newlimit); +} + +static u64 mem_cgroup_notify_threshold_read(struct cgroup *cgrp, + struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + return memcg->notify_threshold_bytes; +} + +static int mem_cgroup_notify_threshold_write(struct cgroup *cgrp, + struct cftype *cft, + const char *buffer) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + unsigned long long val; + int ret; + + /* This function does all necessary parse...reuse it */ + ret = res_counter_memparse_write_strategy(buffer, &val); + if (ret) + return ret; + + /* Threshold must be lower than usage limit */ + if (val >= memcg->res.limit) + return -EINVAL; + + memcg->notify_threshold_bytes = val; + + /* Check to see if the new threshold should cause notification */ + mem_cgroup_notify_test_and_wakeup(memcg, memcg->res.usage, + memcg->res.limit); + + return 0; +} + +static u64 mem_cgroup_notify_available_read(struct cgroup *cgrp, + struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + return memcg->res.limit - memcg->res.usage; +} + +static u64 mem_cgroup_notify_threshold_lowait(struct cgroup *cgrp, + struct cftype *cft) +{ + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); + unsigned long long available_bytes; + DEFINE_WAIT(notify_lowait); + + /* + * A memory resource usage of zero is a special case that + * causes us not to sleep. It normally happens when the + * cgroup is about to be destroyed, and we don't want someone + * trying to sleep on a queue that is about to go away. This + * condition can also be forced as part of testing. + */ + available_bytes = mem->res.limit - mem->res.usage; + if (likely(mem->res.usage != 0)) { + + prepare_to_wait(&mem->notify_threshold_wait, ¬ify_lowait, + TASK_INTERRUPTIBLE); + + if (available_bytes > mem->notify_threshold_bytes) + schedule(); + + available_bytes = mem->res.limit - mem->res.usage; + + finish_wait(&mem->notify_threshold_wait, ¬ify_lowait); + } + + return available_bytes; +} + +/* + * This is used to wake up all threads that may be hanging + * out waiting for a low memory condition prior to that happening. + * Useful for triggering the event to assist with debug of applications. + */ +static int mem_cgroup_notify_threshold_wake_em_up(struct cgroup *cgrp, + unsigned int event) +{ + struct mem_cgroup *mem; + + mem = mem_cgroup_from_cont(cgrp); + wake_up(&mem->notify_threshold_wait); + return 0; +} + +/* + * We wake up all notification threads any time a migration takes + * place. They will have to check to see if a move is needed to + * a new cgroup file to wait for notification. + * This isn't so much a task move as it is an attach. A thread not + * a child of an existing task won't have a valid parent, which + * is necessary to test because it won't have a valid mem_cgroup + * either. Which further means it won't have a proper wait queue + * and we can't do a wakeup. + */ +void mem_cgroup_notify_move_task(struct cgroup *old_cont) +{ + if (old_cont->parent != NULL) + mem_cgroup_notify_threshold_wake_em_up(old_cont, 0); +} +#endif /* CONFIG_CGROUP_MEM_NOTIFY */ + static struct cftype mem_cgroup_files[] = { { @@ -2351,6 +2506,22 @@ static struct cftype mem_cgroup_files[] = { .read_u64 = mem_cgroup_swappiness_read, .write_u64 = mem_cgroup_swappiness_write, }, +#ifdef CONFIG_CGROUP_MEM_NOTIFY + { + .name = "notify_threshold_in_bytes", + .write_string = mem_cgroup_notify_threshold_write, + .read_u64 = mem_cgroup_notify_threshold_read, + }, + { + .name = "notify_available_in_bytes", + .read_u64 = mem_cgroup_notify_available_read, + }, + { + .name = "notify_threshold_lowait", + .trigger = mem_cgroup_notify_threshold_wake_em_up, + .read_u64 = mem_cgroup_notify_threshold_lowait, + }, +#endif }; #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP @@ -2554,6 +2725,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) mem->last_scanned_child = 0; spin_lock_init(&mem->reclaim_param_lock); +#ifdef CONFIG_CGROUP_MEM_NOTIFY + init_waitqueue_head(&mem->notify_threshold_wait); + mem->notify_threshold_bytes = 0; +#endif + if (parent) mem->swappiness = get_swappiness(parent); atomic_set(&mem->refcnt, 1); @@ -2597,6 +2773,8 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss, struct cgroup *old_cont, struct task_struct *p) { + mem_cgroup_notify_move_task(old_cont); + mutex_lock(&memcg_tasklist); /* * FIXME: It's better to move charges of this process from old -- 1.5.6.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/