With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications. The levels are defined like this:
The "low" level means that the system is reclaiming memory for new
allocations. Monitoring this reclaiming activity might be useful for
maintaining cache level. Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).
The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file caches,
etc. Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.
The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can to help the
system. It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.
The events are propagated upward until the event is handled, i.e. the
events are not pass-through. Here is what this means: for example you have
three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
and C, and suppose group C experiences some pressure. In this situation,
only group C will receive the notification, i.e. groups A and B will not
receive it. This is done to avoid excessive "broadcasting" of messages,
which disturbs the system and which is especially bad if we are low on
memory or thrashing. So, organize the cgroups wisely, or propagate the
events manually (or, ask us to implement the pass-through events,
explaining why would you need them.)
Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features. Unfortunately, as of current memcg implementation,
pages accounting is an inseparable part and cannot be turned off. The good
news is that there are some efforts[1] to improve the situation; plus,
implementing the same, fully API-compatible[2] interface for
CONFIG_MEMCG=n case (e.g. embedded) is also a viable option, so it will
not require any changes on the userland side.
[1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
[2] http://lkml.org/lkml/2013/2/21/454
Signed-off-by: Anton Vorontsov <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
Hi all,
Here is a shiny new v3!
In v3:
- No changes in the code, just updated commit message to incorporate the
answer to Minchan Kim's comment regarding applicability to embedded use
cases in the light of memcg performance overhead, plus gave some
references to Glauber Costa's memcg work.
- Rebased onto 3.9.0-rc3-next-20130321.
In v2:
- Addressed Glauber Costa's comments:
o Use parent_mem_cgroup() instead of own parent function (also suggested
by Kamezawa). This change also affected events distribution logic, so
it became more like memory thresholds notifications, i.e. we deliver
the event to the cgroup where the event originated, not to the parent
cgroup; (This also addreses Kamezawa's remark regarding which cgroup
receives which event.)
o Register vmpressure cgroup file directly in memcontrol.c.
- Addressed Greg Thelen's comments:
o Fixed bool/int inconsistency in the code;
o Fixed nr_scanned accounting;
o Don't use cryptic 's', 'r' abbreviations; get rid of confusing
'window' argument.
- Addressed Kamezawa Hiroyuki's comments:
o Moved declarations from mm/internal.h into linux/vmpressue.h;
o Removed Kconfig symbol. Vmpressure is pretty lightweight (especially
comparing to the memcg accounting). If it ever causes any measurable
performance effect, we want to fix it, not paper it over with a
Kconfig option. :-)
o Removed read operation on pressure_level cgroup file. In apps, we only
use notifications, we don't need the content of the file, so let's
keep things simple for now. Plus this resolves questions like what
should we return there when the system is not reclaiming;
o Reworded documentation;
o Improved comments for vmpressure_prio().
Old changelogs/submissions:
v2: http://lkml.org/lkml/2013/2/18/577
v1: http://lkml.org/lkml/2013/2/10/140
mempressure cgroup: http://lkml.org/lkml/2013/1/4/55
Documentation/cgroups/memory.txt | 61 +++++++++-
include/linux/vmpressure.h | 47 ++++++++
mm/Makefile | 2 +-
mm/memcontrol.c | 28 +++++
mm/vmpressure.c | 252 +++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 8 ++
6 files changed, 396 insertions(+), 2 deletions(-)
create mode 100644 include/linux/vmpressure.h
create mode 100644 mm/vmpressure.c
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index addb1f1..0c004de 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -40,6 +40,7 @@ Features:
- soft limit
- moving (recharging) account at moving a task is selectable.
- usage threshold notifier
+ - memory pressure notifier
- oom-killer disable knob and oom-notifier
- Root cgroup has no limit controls.
@@ -65,6 +66,7 @@ Brief summary of control files.
memory.stat # show various statistics
memory.use_hierarchy # set/show hierarchical account enabled
memory.force_empty # trigger forced move charge to parent
+ memory.pressure_level # set memory pressure notifications
memory.swappiness # set/show swappiness parameter of vmscan
(See sysctl's vm.swappiness)
memory.move_charge_at_immigrate # set/show controls of moving charges
@@ -778,7 +780,64 @@ At reading, current status of OOM is shown.
under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
be stopped.)
-11. TODO
+11. Memory Pressure
+
+The pressure level notifications can be used to monitor the memory
+allocation cost; based on the pressure, applications can implement
+different strategies of managing their memory resources. The pressure
+levels are defined as following:
+
+The "low" level means that the system is reclaiming memory for new
+allocations. Monitoring this reclaiming activity might be useful for
+maintaining cache level. Upon notification, the program (typically
+"Activity Manager") might analyze vmstat and act in advance (i.e.
+prematurely shutdown unimportant services).
+
+The "medium" level means that the system is experiencing medium memory
+pressure, the system might be making swap, paging out active file caches,
+etc. Upon this event applications may decide to further analyze
+vmstat/zoneinfo/memcg or internal memory usage statistics and free any
+resources that can be easily reconstructed or re-read from a disk.
+
+The "critical" level means that the system is actively thrashing, it is
+about to out of memory (OOM) or even the in-kernel OOM killer is on its
+way to trigger. Applications should do whatever they can to help the
+system. It might be too late to consult with vmstat or any other
+statistics, so it's advisable to take an immediate action.
+
+The events are propagated upward until the event is handled, i.e. the
+events are not pass-through. Here is what this means: for example you have
+three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
+and C, and suppose group C experiences some pressure. In this situation,
+only group C will receive the notification, i.e. groups A and B will not
+receive it. This is done to avoid excessive "broadcasting" of messages,
+which disturbs the system and which is especially bad if we are low on
+memory or thrashing. So, organize the cgroups wisely, or propagate the
+events manually (or, ask us to implement the pass-through events,
+explaining why would you need them.)
+
+The file memory.pressure_level is only used to setup an eventfd,
+read/write operations are no implemented.
+
+Test:
+
+ Here is a small script example that makes a new cgroup, sets up a
+ memory limit, sets up a notification in the cgroup and then makes child
+ cgroup experience a critical pressure:
+
+ # cd /sys/fs/cgroup/memory/
+ # mkdir foo
+ # cd foo
+ # cgroup_event_listener memory.pressure_level low &
+ # echo 8000000 > memory.limit_in_bytes
+ # echo 8000000 > memory.memsw.limit_in_bytes
+ # echo $$ > tasks
+ # dd if=/dev/zero | read x
+
+ (Expect a bunch of notifications, and eventually, the oom-killer will
+ trigger.)
+
+12. TODO
1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first
diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
new file mode 100644
index 0000000..fa84783
--- /dev/null
+++ b/include/linux/vmpressure.h
@@ -0,0 +1,47 @@
+#ifndef __LINUX_VMPRESSURE_H
+#define __LINUX_VMPRESSURE_H
+
+#include <linux/mutex.h>
+#include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/gfp.h>
+#include <linux/types.h>
+#include <linux/cgroup.h>
+
+struct vmpressure {
+ unsigned int scanned;
+ unsigned int reclaimed;
+ /* The lock is used to keep the scanned/reclaimed above in sync. */
+ struct mutex sr_lock;
+
+ struct list_head events;
+ /* Have to grab the lock on events traversal or modifications. */
+ struct mutex events_lock;
+
+ struct work_struct work;
+};
+
+struct mem_cgroup;
+
+#ifdef CONFIG_MEMCG
+extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
+ unsigned long scanned, unsigned long reclaimed);
+extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
+#else
+static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
+ unsigned long scanned, unsigned long reclaimed) {}
+static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
+ int prio) {}
+#endif /* CONFIG_MEMCG */
+
+extern void vmpressure_init(struct vmpressure *vmpr);
+extern struct vmpressure *memcg_to_vmpr(struct mem_cgroup *memcg);
+extern struct cgroup_subsys_state *vmpr_to_css(struct vmpressure *vmpr);
+extern struct vmpressure *css_to_vmpr(struct cgroup_subsys_state *css);
+extern int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
+ struct eventfd_ctx *eventfd,
+ const char *args);
+extern void vmpressure_unregister_event(struct cgroup *cg, struct cftype *cft,
+ struct eventfd_ctx *eventfd);
+
+#endif /* __LINUX_VMPRESSURE_H */
diff --git a/mm/Makefile b/mm/Makefile
index 3a46287..72c5acb 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -50,7 +50,7 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
-obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o vmpressure.o
obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f608546..2482f2c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -49,6 +49,7 @@
#include <linux/fs.h>
#include <linux/seq_file.h>
#include <linux/vmalloc.h>
+#include <linux/vmpressure.h>
#include <linux/mm_inline.h>
#include <linux/page_cgroup.h>
#include <linux/cpu.h>
@@ -376,6 +377,9 @@ struct mem_cgroup {
atomic_t numainfo_events;
atomic_t numainfo_updating;
#endif
+
+ struct vmpressure vmpr;
+
/*
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
@@ -576,6 +580,24 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
return container_of(s, struct mem_cgroup, css);
}
+/* Some nice accessors for the vmpressure. */
+struct vmpressure *memcg_to_vmpr(struct mem_cgroup *memcg)
+{
+ if (!memcg)
+ memcg = root_mem_cgroup;
+ return &memcg->vmpr;
+}
+
+struct cgroup_subsys_state *vmpr_to_css(struct vmpressure *vmpr)
+{
+ return &container_of(vmpr, struct mem_cgroup, vmpr)->css;
+}
+
+struct vmpressure *css_to_vmpr(struct cgroup_subsys_state *css)
+{
+ return &mem_cgroup_from_css(css)->vmpr;
+}
+
static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
{
return (memcg == root_mem_cgroup);
@@ -6074,6 +6096,11 @@ static struct cftype mem_cgroup_files[] = {
.unregister_event = mem_cgroup_oom_unregister_event,
.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
},
+ {
+ .name = "pressure_level",
+ .register_event = vmpressure_register_event,
+ .unregister_event = vmpressure_unregister_event,
+ },
#ifdef CONFIG_NUMA
{
.name = "numa_stat",
@@ -6365,6 +6392,7 @@ mem_cgroup_css_alloc(struct cgroup *cont)
memcg->move_charge_at_immigrate = 0;
mutex_init(&memcg->thresholds_lock);
spin_lock_init(&memcg->move_lock);
+ vmpressure_init(&memcg->vmpr);
return &memcg->css;
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
new file mode 100644
index 0000000..ae0ff8e
--- /dev/null
+++ b/mm/vmpressure.c
@@ -0,0 +1,252 @@
+/*
+ * Linux VM pressure
+ *
+ * Copyright 2012 Linaro Ltd.
+ * Anton Vorontsov <[email protected]>
+ *
+ * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
+ * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+#include <linux/eventfd.h>
+#include <linux/swap.h>
+#include <linux/printk.h>
+#include <linux/vmpressure.h>
+
+/*
+ * The window size is the number of scanned pages before we try to analyze
+ * the scanned/reclaimed ratio (or difference).
+ *
+ * It is used as a rate-limit tunable for the "low" level notification,
+ * and for averaging medium/critical levels. Using small window sizes can
+ * cause lot of false positives, but too big window size will delay the
+ * notifications.
+ *
+ * TODO: Make the window size depend on machine size, as we do for vmstat
+ * thresholds.
+ */
+static const unsigned int vmpressure_win = SWAP_CLUSTER_MAX * 16;
+static const unsigned int vmpressure_level_med = 60;
+static const unsigned int vmpressure_level_critical = 95;
+static const unsigned int vmpressure_level_critical_prio = 3;
+
+enum vmpressure_levels {
+ VMPRESSURE_LOW = 0,
+ VMPRESSURE_MEDIUM,
+ VMPRESSURE_CRITICAL,
+ VMPRESSURE_NUM_LEVELS,
+};
+
+static const char *vmpressure_str_levels[] = {
+ [VMPRESSURE_LOW] = "low",
+ [VMPRESSURE_MEDIUM] = "medium",
+ [VMPRESSURE_CRITICAL] = "critical",
+};
+
+static enum vmpressure_levels vmpressure_level(unsigned int pressure)
+{
+ if (pressure >= vmpressure_level_critical)
+ return VMPRESSURE_CRITICAL;
+ else if (pressure >= vmpressure_level_med)
+ return VMPRESSURE_MEDIUM;
+ return VMPRESSURE_LOW;
+}
+
+static enum vmpressure_levels vmpressure_calc_level(unsigned int scanned,
+ unsigned int reclaimed)
+{
+ unsigned long scale = scanned + reclaimed;
+ unsigned long pressure;
+
+ if (!scanned)
+ return VMPRESSURE_LOW;
+
+ /*
+ * We calculate the ratio (in percents) of how many pages were
+ * scanned vs. reclaimed in a given time frame (window). Note that
+ * time is in VM reclaimer's "ticks", i.e. number of pages
+ * scanned. This makes it possible to set desired reaction time
+ * and serves as a ratelimit.
+ */
+ pressure = scale - (reclaimed * scale / scanned);
+ pressure = pressure * 100 / scale;
+
+ pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, pressure,
+ scanned, reclaimed);
+
+ return vmpressure_level(pressure);
+}
+
+void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
+ unsigned long scanned, unsigned long reclaimed)
+{
+ struct vmpressure *vmpr = memcg_to_vmpr(memcg);
+
+ /*
+ * So far we are only interested application memory, or, in case
+ * of low pressure, in FS/IO memory reclaim. We are also
+ * interested indirect reclaim (kswapd sets sc->gfp_mask to
+ * GFP_KERNEL).
+ */
+ if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
+ return;
+
+ if (!scanned)
+ return;
+
+ mutex_lock(&vmpr->sr_lock);
+ vmpr->scanned += scanned;
+ vmpr->reclaimed += reclaimed;
+ mutex_unlock(&vmpr->sr_lock);
+
+ if (scanned < vmpressure_win || work_pending(&vmpr->work))
+ return;
+ schedule_work(&vmpr->work);
+}
+
+void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
+{
+ if (prio > vmpressure_level_critical_prio)
+ return;
+
+ /*
+ * OK, the prio is below the threshold, updating vmpressure
+ * information before diving into long shrinking of long range
+ * vmscan.
+ */
+ vmpressure(gfp, memcg, vmpressure_win, 0);
+}
+
+static struct vmpressure *wk_to_vmpr(struct work_struct *wk)
+{
+ return container_of(wk, struct vmpressure, work);
+}
+
+static struct vmpressure *cg_to_vmpr(struct cgroup *cg)
+{
+ return css_to_vmpr(cgroup_subsys_state(cg, mem_cgroup_subsys_id));
+}
+
+struct vmpressure_event {
+ struct eventfd_ctx *efd;
+ enum vmpressure_levels level;
+ struct list_head node;
+};
+
+static bool vmpressure_event(struct vmpressure *vmpr,
+ unsigned long scanned, unsigned long reclaimed)
+{
+ struct vmpressure_event *ev;
+ int level = vmpressure_calc_level(scanned, reclaimed);
+ bool signalled = false;
+
+ mutex_lock(&vmpr->events_lock);
+
+ list_for_each_entry(ev, &vmpr->events, node) {
+ if (level >= ev->level) {
+ eventfd_signal(ev->efd, 1);
+ signalled = true;
+ }
+ }
+
+ mutex_unlock(&vmpr->events_lock);
+
+ return signalled;
+}
+
+static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
+{
+ struct cgroup *cg = vmpr_to_css(vmpr)->cgroup;
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cg);
+
+ memcg = parent_mem_cgroup(memcg);
+ if (!memcg)
+ return NULL;
+ return memcg_to_vmpr(memcg);
+}
+
+static void vmpressure_wk_fn(struct work_struct *wk)
+{
+ struct vmpressure *vmpr = wk_to_vmpr(wk);
+ unsigned long s;
+ unsigned long r;
+
+ mutex_lock(&vmpr->sr_lock);
+ s = vmpr->scanned;
+ r = vmpr->reclaimed;
+ vmpr->scanned = 0;
+ vmpr->reclaimed = 0;
+ mutex_unlock(&vmpr->sr_lock);
+
+ do {
+ if (vmpressure_event(vmpr, s, r))
+ break;
+ /*
+ * If not handled, propagate the event upward into the
+ * hierarchy.
+ */
+ } while ((vmpr = vmpressure_parent(vmpr)));
+}
+
+int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
+ struct eventfd_ctx *eventfd, const char *args)
+{
+ struct vmpressure *vmpr = cg_to_vmpr(cg);
+ struct vmpressure_event *ev;
+ int lvl;
+
+ for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
+ if (!strcmp(vmpressure_str_levels[lvl], args))
+ break;
+ }
+
+ if (lvl >= VMPRESSURE_NUM_LEVELS)
+ return -EINVAL;
+
+ ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+ if (!ev)
+ return -ENOMEM;
+
+ ev->efd = eventfd;
+ ev->level = lvl;
+
+ mutex_lock(&vmpr->events_lock);
+ list_add(&ev->node, &vmpr->events);
+ mutex_unlock(&vmpr->events_lock);
+
+ return 0;
+}
+
+void vmpressure_unregister_event(struct cgroup *cg, struct cftype *cft,
+ struct eventfd_ctx *eventfd)
+{
+ struct vmpressure *vmpr = cg_to_vmpr(cg);
+ struct vmpressure_event *ev;
+
+ mutex_lock(&vmpr->events_lock);
+ list_for_each_entry(ev, &vmpr->events, node) {
+ if (ev->efd != eventfd)
+ continue;
+ list_del(&ev->node);
+ kfree(ev);
+ break;
+ }
+ mutex_unlock(&vmpr->events_lock);
+}
+
+void vmpressure_init(struct vmpressure *vmpr)
+{
+ mutex_init(&vmpr->sr_lock);
+ mutex_init(&vmpr->events_lock);
+ INIT_LIST_HEAD(&vmpr->events);
+ INIT_WORK(&vmpr->work, vmpressure_wk_fn);
+}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index df78d17..616e2bb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -19,6 +19,7 @@
#include <linux/pagemap.h>
#include <linux/init.h>
#include <linux/highmem.h>
+#include <linux/vmpressure.h>
#include <linux/vmstat.h>
#include <linux/file.h>
#include <linux/writeback.h>
@@ -1982,6 +1983,11 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
}
memcg = mem_cgroup_iter(root, memcg, &reclaim);
} while (memcg);
+
+ vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
+ sc->nr_scanned - nr_scanned,
+ sc->nr_reclaimed - nr_reclaimed);
+
} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
sc->nr_scanned - nr_scanned, sc));
}
@@ -2167,6 +2173,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
count_vm_event(ALLOCSTALL);
do {
+ vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
+ sc->priority);
sc->nr_scanned = 0;
aborted_reclaim = shrink_zones(zonelist, sc);
--
1.8.1.4
On Fri, 22 Mar 2013 00:13:51 -0700 Anton Vorontsov <[email protected]> wrote:
> With this patch userland applications that want to maintain the
> interactivity/memory allocation cost can use the pressure level
> notifications. The levels are defined like this:
>
> The "low" level means that the system is reclaiming memory for new
> allocations. Monitoring this reclaiming activity might be useful for
> maintaining cache level. Upon notification, the program (typically
> "Activity Manager") might analyze vmstat and act in advance (i.e.
> prematurely shutdown unimportant services).
>
> The "medium" level means that the system is experiencing medium memory
> pressure, the system might be making swap, paging out active file caches,
> etc. Upon this event applications may decide to further analyze
> vmstat/zoneinfo/memcg or internal memory usage statistics and free any
> resources that can be easily reconstructed or re-read from a disk.
>
> The "critical" level means that the system is actively thrashing, it is
> about to out of memory (OOM) or even the in-kernel OOM killer is on its
> way to trigger. Applications should do whatever they can to help the
> system. It might be too late to consult with vmstat or any other
> statistics, so it's advisable to take an immediate action.
>
> The events are propagated upward until the event is handled, i.e. the
> events are not pass-through. Here is what this means: for example you have
> three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
> and C, and suppose group C experiences some pressure. In this situation,
> only group C will receive the notification, i.e. groups A and B will not
> receive it. This is done to avoid excessive "broadcasting" of messages,
> which disturbs the system and which is especially bad if we are low on
> memory or thrashing. So, organize the cgroups wisely, or propagate the
> events manually (or, ask us to implement the pass-through events,
> explaining why would you need them.)
>
> Performance wise, the memory pressure notifications feature itself is
> lightweight and does not require much of bookkeeping, in contrast to the
> rest of memcg features. Unfortunately, as of current memcg implementation,
> pages accounting is an inseparable part and cannot be turned off. The good
> news is that there are some efforts[1] to improve the situation; plus,
> implementing the same, fully API-compatible[2] interface for
> CONFIG_MEMCG=n case (e.g. embedded) is also a viable option, so it will
> not require any changes on the userland side.
Nicely presented patch, thanks.
The interface still seems a bit of a toy, but is a conservative
approach: anything less toy-like would expose (and hence be dependent
upon) more VM internals.
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -40,6 +40,7 @@ Features:
> - soft limit
> - moving (recharging) account at moving a task is selectable.
> - usage threshold notifier
> + - memory pressure notifier
> - oom-killer disable knob and oom-notifier
> - Root cgroup has no limit controls.
>
> @@ -65,6 +66,7 @@ Brief summary of control files.
> memory.stat # show various statistics
> memory.use_hierarchy # set/show hierarchical account enabled
> memory.force_empty # trigger forced move charge to parent
> + memory.pressure_level # set memory pressure notifications
> memory.swappiness # set/show swappiness parameter of vmscan
> (See sysctl's vm.swappiness)
> memory.move_charge_at_immigrate # set/show controls of moving charges
> @@ -778,7 +780,64 @@ At reading, current status of OOM is shown.
> under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
> be stopped.)
>
> -11. TODO
> +11. Memory Pressure
> +
> +The pressure level notifications can be used to monitor the memory
> +allocation cost; based on the pressure, applications can implement
> +different strategies of managing their memory resources. The pressure
> +levels are defined as following:
> +
> +The "low" level means that the system is reclaiming memory for new
> +allocations. Monitoring this reclaiming activity might be useful for
> +maintaining cache level. Upon notification, the program (typically
> +"Activity Manager") might analyze vmstat and act in advance (i.e.
> +prematurely shutdown unimportant services).
> +
> +The "medium" level means that the system is experiencing medium memory
> +pressure, the system might be making swap, paging out active file caches,
> +etc. Upon this event applications may decide to further analyze
> +vmstat/zoneinfo/memcg or internal memory usage statistics and free any
> +resources that can be easily reconstructed or re-read from a disk.
> +
> +The "critical" level means that the system is actively thrashing, it is
> +about to out of memory (OOM) or even the in-kernel OOM killer is on its
> +way to trigger. Applications should do whatever they can to help the
> +system. It might be too late to consult with vmstat or any other
> +statistics, so it's advisable to take an immediate action.
> +
> +The events are propagated upward until the event is handled, i.e. the
> +events are not pass-through. Here is what this means: for example you have
> +three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
> +and C, and suppose group C experiences some pressure. In this situation,
> +only group C will receive the notification, i.e. groups A and B will not
> +receive it. This is done to avoid excessive "broadcasting" of messages,
> +which disturbs the system and which is especially bad if we are low on
> +memory or thrashing. So, organize the cgroups wisely, or propagate the
> +events manually (or, ask us to implement the pass-through events,
> +explaining why would you need them.)
> +
> +The file memory.pressure_level is only used to setup an eventfd,
> +read/write operations are no implemented.
> +
> +Test:
> +
> + Here is a small script example that makes a new cgroup, sets up a
> + memory limit, sets up a notification in the cgroup and then makes child
> + cgroup experience a critical pressure:
> +
> + # cd /sys/fs/cgroup/memory/
> + # mkdir foo
> + # cd foo
> + # cgroup_event_listener memory.pressure_level low &
> + # echo 8000000 > memory.limit_in_bytes
> + # echo 8000000 > memory.memsw.limit_in_bytes
> + # echo $$ > tasks
> + # dd if=/dev/zero | read x
> +
> + (Expect a bunch of notifications, and eventually, the oom-killer will
> + trigger.)
> +
> +12. TODO
Did we tell people how to use the eventfd interface anywhere?
> 1. Add support for accounting huge pages (as a separate controller)
> 2. Make per-cgroup scanner reclaim not-shared pages first
> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> --- /dev/null
> +++ b/include/linux/vmpressure.h
> @@ -0,0 +1,47 @@
> +#ifndef __LINUX_VMPRESSURE_H
> +#define __LINUX_VMPRESSURE_H
> +
> +#include <linux/mutex.h>
> +#include <linux/list.h>
> +#include <linux/workqueue.h>
> +#include <linux/gfp.h>
> +#include <linux/types.h>
> +#include <linux/cgroup.h>
> +
> +struct vmpressure {
> + unsigned int scanned;
> + unsigned int reclaimed;
> + /* The lock is used to keep the scanned/reclaimed above in sync. */
> + struct mutex sr_lock;
> +
> + struct list_head events;
A comment describing what goes at `events' would be nice. Reference
"struct vmpressure_event".
> + /* Have to grab the lock on events traversal or modifications. */
> + struct mutex events_lock;
> +
> + struct work_struct work;
> +};
>
> ...
>
> +/*
> + * The window size is the number of scanned pages before we try to analyze
> + * the scanned/reclaimed ratio (or difference).
> + *
> + * It is used as a rate-limit tunable for the "low" level notification,
> + * and for averaging medium/critical levels. Using small window sizes can
> + * cause lot of false positives, but too big window size will delay the
> + * notifications.
> + *
> + * TODO: Make the window size depend on machine size, as we do for vmstat
> + * thresholds.
Here "the window size" refers to vmpressure_win, yes?
> + */
> +static const unsigned int vmpressure_win = SWAP_CLUSTER_MAX * 16;
> +static const unsigned int vmpressure_level_med = 60;
> +static const unsigned int vmpressure_level_critical = 95;
> +static const unsigned int vmpressure_level_critical_prio = 3;
vmpressure_level_critical_prio is a bit mysterious and undocumented.
Please document it here and/or at vmpressure_prio().
> +
> +enum vmpressure_levels {
> + VMPRESSURE_LOW = 0,
> + VMPRESSURE_MEDIUM,
> + VMPRESSURE_CRITICAL,
> + VMPRESSURE_NUM_LEVELS,
> +};
>
> ...
>
> +static enum vmpressure_levels vmpressure_calc_level(unsigned int scanned,
> + unsigned int reclaimed)
> +{
> + unsigned long scale = scanned + reclaimed;
> + unsigned long pressure;
> +
> + if (!scanned)
> + return VMPRESSURE_LOW;
> +
> + /*
> + * We calculate the ratio (in percents) of how many pages were
> + * scanned vs. reclaimed in a given time frame (window). Note that
> + * time is in VM reclaimer's "ticks", i.e. number of pages
> + * scanned. This makes it possible to set desired reaction time
> + * and serves as a ratelimit.
> + */
> + pressure = scale - (reclaimed * scale / scanned);
> + pressure = pressure * 100 / scale;
> +
> + pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, pressure,
> + scanned, reclaimed);
> +
> + return vmpressure_level(pressure);
> +}
The types worry me. The patch switches the vm's unsigned longs into
unsigneds. We have a long and sorry history of overflowing 32-bit
counters in VM corner cases. I suggest that this patch should
carefully use `unsigned longs' in all the appropriate places and do
away with its present truncation and overflow risks.
> +void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> + unsigned long scanned, unsigned long reclaimed)
Exported function and a primary inteface. Needs nice documentation, please ;)
> +{
> + struct vmpressure *vmpr = memcg_to_vmpr(memcg);
> +
> + /*
> + * So far we are only interested application memory, or, in case
"interested in"
> + * of low pressure, in FS/IO memory reclaim. We are also
> + * interested indirect reclaim (kswapd sets sc->gfp_mask to
> + * GFP_KERNEL).
This is all pretty obvious from reading the code. Good comments
explain "why", not "what". *why* did we make these decisions?
> + */
> + if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
I'm surprised at __GFP_HIGHMEM's inclusion. On some machines the great
majority of user memory is in highmem. What's up?
> + return;
> +
> + if (!scanned)
> + return;
> +
> + mutex_lock(&vmpr->sr_lock);
> + vmpr->scanned += scanned;
> + vmpr->reclaimed += reclaimed;
See, here we're accumulating into a 32-bit variable quantities which used
to be held in 64-bit variables. The overflow risk gets higher...
> + mutex_unlock(&vmpr->sr_lock);
> +
> + if (scanned < vmpressure_win || work_pending(&vmpr->work))
> + return;
> + schedule_work(&vmpr->work);
> +}
> +
> +void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
Documentation please.
> +{
> + if (prio > vmpressure_level_critical_prio)
> + return;
> +
> + /*
> + * OK, the prio is below the threshold, updating vmpressure
But you never told me what that threshold is for! And I have no means
of working out why you chose "3", nor the effects of altering it, etc.
> + * information before diving into long shrinking of long range
> + * vmscan.
> + */
> + vmpressure(gfp, memcg, vmpressure_win, 0);
> +}
> +
> +static struct vmpressure *wk_to_vmpr(struct work_struct *wk)
> +{
> + return container_of(wk, struct vmpressure, work);
> +}
> +
> +static struct vmpressure *cg_to_vmpr(struct cgroup *cg)
> +{
> + return css_to_vmpr(cgroup_subsys_state(cg, mem_cgroup_subsys_id));
> +}
> +
> +struct vmpressure_event {
> + struct eventfd_ctx *efd;
> + enum vmpressure_levels level;
> + struct list_head node;
> +};
> +
> +static bool vmpressure_event(struct vmpressure *vmpr,
> + unsigned long scanned, unsigned long reclaimed)
> +{
> + struct vmpressure_event *ev;
> + int level = vmpressure_calc_level(scanned, reclaimed);
Here's where we go from 64-bit to 32-bit.
> + bool signalled = false;
> +
> + mutex_lock(&vmpr->events_lock);
> +
> + list_for_each_entry(ev, &vmpr->events, node) {
> + if (level >= ev->level) {
> + eventfd_signal(ev->efd, 1);
> + signalled = true;
> + }
> + }
> +
> + mutex_unlock(&vmpr->events_lock);
> +
> + return signalled;
> +}
>
> ...
>
> +int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
> + struct eventfd_ctx *eventfd, const char *args)
Document the interface, please.
> +{
> + struct vmpressure *vmpr = cg_to_vmpr(cg);
> + struct vmpressure_event *ev;
> + int lvl;
These abbreviations are rather unlinuxy. wk->work, vmpr->vmpressure,
lvl->level, etc.
> + for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
> + if (!strcmp(vmpressure_str_levels[lvl], args))
> + break;
> + }
> +
> + if (lvl >= VMPRESSURE_NUM_LEVELS)
> + return -EINVAL;
> +
> + ev = kzalloc(sizeof(*ev), GFP_KERNEL);
> + if (!ev)
> + return -ENOMEM;
> +
> + ev->efd = eventfd;
> + ev->level = lvl;
> +
> + mutex_lock(&vmpr->events_lock);
> + list_add(&ev->node, &vmpr->events);
What's the upper bound on the length of this list?
> + mutex_unlock(&vmpr->events_lock);
> +
> + return 0;
> +}
> +
>
> ...
>
(2013/03/22 16:13), Anton Vorontsov wrote:
> With this patch userland applications that want to maintain the
> interactivity/memory allocation cost can use the pressure level
> notifications. The levels are defined like this:
>
> The "low" level means that the system is reclaiming memory for new
> allocations. Monitoring this reclaiming activity might be useful for
> maintaining cache level. Upon notification, the program (typically
> "Activity Manager") might analyze vmstat and act in advance (i.e.
> prematurely shutdown unimportant services).
>
> The "medium" level means that the system is experiencing medium memory
> pressure, the system might be making swap, paging out active file caches,
> etc. Upon this event applications may decide to further analyze
> vmstat/zoneinfo/memcg or internal memory usage statistics and free any
> resources that can be easily reconstructed or re-read from a disk.
>
> The "critical" level means that the system is actively thrashing, it is
> about to out of memory (OOM) or even the in-kernel OOM killer is on its
> way to trigger. Applications should do whatever they can to help the
> system. It might be too late to consult with vmstat or any other
> statistics, so it's advisable to take an immediate action.
>
> The events are propagated upward until the event is handled, i.e. the
> events are not pass-through. Here is what this means: for example you have
> three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
> and C, and suppose group C experiences some pressure. In this situation,
> only group C will receive the notification, i.e. groups A and B will not
> receive it. This is done to avoid excessive "broadcasting" of messages,
> which disturbs the system and which is especially bad if we are low on
> memory or thrashing. So, organize the cgroups wisely, or propagate the
> events manually (or, ask us to implement the pass-through events,
> explaining why would you need them.)
>
> Performance wise, the memory pressure notifications feature itself is
> lightweight and does not require much of bookkeeping, in contrast to the
> rest of memcg features. Unfortunately, as of current memcg implementation,
> pages accounting is an inseparable part and cannot be turned off. The good
> news is that there are some efforts[1] to improve the situation; plus,
> implementing the same, fully API-compatible[2] interface for
> CONFIG_MEMCG=n case (e.g. embedded) is also a viable option, so it will
> not require any changes on the userland side.
>
> [1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
> [2] http://lkml.org/lkml/2013/2/21/454
>
> Signed-off-by: Anton Vorontsov <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> ---
>
> Hi all,
>
> Here is a shiny new v3!
>
> In v3:
>
> - No changes in the code, just updated commit message to incorporate the
> answer to Minchan Kim's comment regarding applicability to embedded use
> cases in the light of memcg performance overhead, plus gave some
> references to Glauber Costa's memcg work.
>
> - Rebased onto 3.9.0-rc3-next-20130321.
>
> In v2:
>
> - Addressed Glauber Costa's comments:
> o Use parent_mem_cgroup() instead of own parent function (also suggested
> by Kamezawa). This change also affected events distribution logic, so
> it became more like memory thresholds notifications, i.e. we deliver
> the event to the cgroup where the event originated, not to the parent
> cgroup; (This also addreses Kamezawa's remark regarding which cgroup
> receives which event.)
> o Register vmpressure cgroup file directly in memcontrol.c.
>
> - Addressed Greg Thelen's comments:
> o Fixed bool/int inconsistency in the code;
> o Fixed nr_scanned accounting;
> o Don't use cryptic 's', 'r' abbreviations; get rid of confusing
> 'window' argument.
>
> - Addressed Kamezawa Hiroyuki's comments:
> o Moved declarations from mm/internal.h into linux/vmpressue.h;
> o Removed Kconfig symbol. Vmpressure is pretty lightweight (especially
> comparing to the memcg accounting). If it ever causes any measurable
> performance effect, we want to fix it, not paper it over with a
> Kconfig option. :-)
> o Removed read operation on pressure_level cgroup file. In apps, we only
> use notifications, we don't need the content of the file, so let's
> keep things simple for now. Plus this resolves questions like what
> should we return there when the system is not reclaiming;
> o Reworded documentation;
> o Improved comments for vmpressure_prio().
>
> Old changelogs/submissions:
> v2: http://lkml.org/lkml/2013/2/18/577
> v1: http://lkml.org/lkml/2013/2/10/140
> mempressure cgroup: http://lkml.org/lkml/2013/1/4/55
>
> Documentation/cgroups/memory.txt | 61 +++++++++-
> include/linux/vmpressure.h | 47 ++++++++
> mm/Makefile | 2 +-
> mm/memcontrol.c | 28 +++++
> mm/vmpressure.c | 252 +++++++++++++++++++++++++++++++++++++++
> mm/vmscan.c | 8 ++
> 6 files changed, 396 insertions(+), 2 deletions(-)
> create mode 100644 include/linux/vmpressure.h
> create mode 100644 mm/vmpressure.c
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index addb1f1..0c004de 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -40,6 +40,7 @@ Features:
> - soft limit
> - moving (recharging) account at moving a task is selectable.
> - usage threshold notifier
> + - memory pressure notifier
> - oom-killer disable knob and oom-notifier
> - Root cgroup has no limit controls.
>
> @@ -65,6 +66,7 @@ Brief summary of control files.
> memory.stat # show various statistics
> memory.use_hierarchy # set/show hierarchical account enabled
> memory.force_empty # trigger forced move charge to parent
> + memory.pressure_level # set memory pressure notifications
> memory.swappiness # set/show swappiness parameter of vmscan
> (See sysctl's vm.swappiness)
> memory.move_charge_at_immigrate # set/show controls of moving charges
> @@ -778,7 +780,64 @@ At reading, current status of OOM is shown.
> under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
> be stopped.)
>
> -11. TODO
> +11. Memory Pressure
> +
> +The pressure level notifications can be used to monitor the memory
> +allocation cost; based on the pressure, applications can implement
> +different strategies of managing their memory resources. The pressure
> +levels are defined as following:
> +
> +The "low" level means that the system is reclaiming memory for new
> +allocations. Monitoring this reclaiming activity might be useful for
> +maintaining cache level. Upon notification, the program (typically
> +"Activity Manager") might analyze vmstat and act in advance (i.e.
> +prematurely shutdown unimportant services).
> +
> +The "medium" level means that the system is experiencing medium memory
> +pressure, the system might be making swap, paging out active file caches,
> +etc. Upon this event applications may decide to further analyze
> +vmstat/zoneinfo/memcg or internal memory usage statistics and free any
> +resources that can be easily reconstructed or re-read from a disk.
> +
> +The "critical" level means that the system is actively thrashing, it is
> +about to out of memory (OOM) or even the in-kernel OOM killer is on its
> +way to trigger. Applications should do whatever they can to help the
> +system. It might be too late to consult with vmstat or any other
> +statistics, so it's advisable to take an immediate action.
> +
> +The events are propagated upward until the event is handled, i.e. the
> +events are not pass-through. Here is what this means: for example you have
> +three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
> +and C, and suppose group C experiences some pressure. In this situation,
> +only group C will receive the notification, i.e. groups A and B will not
> +receive it. This is done to avoid excessive "broadcasting" of messages,
> +which disturbs the system and which is especially bad if we are low on
> +memory or thrashing. So, organize the cgroups wisely, or propagate the
> +events manually (or, ask us to implement the pass-through events,
> +explaining why would you need them.)
> +
> +The file memory.pressure_level is only used to setup an eventfd,
> +read/write operations are no implemented.
> +
I'll make an ack with this spec. some nitpicks below.
> +Test:
> +
> + Here is a small script example that makes a new cgroup, sets up a
> + memory limit, sets up a notification in the cgroup and then makes child
> + cgroup experience a critical pressure:
> +
> + # cd /sys/fs/cgroup/memory/
> + # mkdir foo
> + # cd foo
> + # cgroup_event_listener memory.pressure_level low &
> + # echo 8000000 > memory.limit_in_bytes
> + # echo 8000000 > memory.memsw.limit_in_bytes
> + # echo $$ > tasks
> + # dd if=/dev/zero | read x
> +
> + (Expect a bunch of notifications, and eventually, the oom-killer will
> + trigger.)
> +
> +12. TODO
>
> 1. Add support for accounting huge pages (as a separate controller)
> 2. Make per-cgroup scanner reclaim not-shared pages first
> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> new file mode 100644
> index 0000000..fa84783
> --- /dev/null
> +++ b/include/linux/vmpressure.h
> @@ -0,0 +1,47 @@
> +#ifndef __LINUX_VMPRESSURE_H
> +#define __LINUX_VMPRESSURE_H
> +
> +#include <linux/mutex.h>
> +#include <linux/list.h>
> +#include <linux/workqueue.h>
> +#include <linux/gfp.h>
> +#include <linux/types.h>
> +#include <linux/cgroup.h>
> +
> +struct vmpressure {
> + unsigned int scanned;
> + unsigned int reclaimed;
> + /* The lock is used to keep the scanned/reclaimed above in sync. */
> + struct mutex sr_lock;
> +
> + struct list_head events;
> + /* Have to grab the lock on events traversal or modifications. */
> + struct mutex events_lock;
> +
> + struct work_struct work;
> +};
> +
> +struct mem_cgroup;
> +
> +#ifdef CONFIG_MEMCG
> +extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> + unsigned long scanned, unsigned long reclaimed);
> +extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
> +#else
> +static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> + unsigned long scanned, unsigned long reclaimed) {}
> +static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
> + int prio) {}
> +#endif /* CONFIG_MEMCG */
> +
> +extern void vmpressure_init(struct vmpressure *vmpr);
> +extern struct vmpressure *memcg_to_vmpr(struct mem_cgroup *memcg);
> +extern struct cgroup_subsys_state *vmpr_to_css(struct vmpressure *vmpr);
> +extern struct vmpressure *css_to_vmpr(struct cgroup_subsys_state *css);
> +extern int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
> + struct eventfd_ctx *eventfd,
> + const char *args);
> +extern void vmpressure_unregister_event(struct cgroup *cg, struct cftype *cft,
> + struct eventfd_ctx *eventfd);
> +
> +#endif /* __LINUX_VMPRESSURE_H */
> diff --git a/mm/Makefile b/mm/Makefile
> index 3a46287..72c5acb 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -50,7 +50,7 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
> obj-$(CONFIG_MIGRATION) += migrate.o
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
> -obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o
> +obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o vmpressure.o
> obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
> obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
> obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f608546..2482f2c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -49,6 +49,7 @@
> #include <linux/fs.h>
> #include <linux/seq_file.h>
> #include <linux/vmalloc.h>
> +#include <linux/vmpressure.h>
> #include <linux/mm_inline.h>
> #include <linux/page_cgroup.h>
> #include <linux/cpu.h>
> @@ -376,6 +377,9 @@ struct mem_cgroup {
> atomic_t numainfo_events;
> atomic_t numainfo_updating;
> #endif
> +
> + struct vmpressure vmpr;
> +
How about placing this just below "memsw_threshold" ?
memory objects around there is not performance critical.
> /*
> * Per cgroup active and inactive list, similar to the
> * per zone LRU lists.
> @@ -576,6 +580,24 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
> return container_of(s, struct mem_cgroup, css);
> }
>
> +/* Some nice accessors for the vmpressure. */
> +struct vmpressure *memcg_to_vmpr(struct mem_cgroup *memcg)
> +{
> + if (!memcg)
> + memcg = root_mem_cgroup;
> + return &memcg->vmpr;
> +}
> +
> +struct cgroup_subsys_state *vmpr_to_css(struct vmpressure *vmpr)
> +{
> + return &container_of(vmpr, struct mem_cgroup, vmpr)->css;
> +}
> +
> +struct vmpressure *css_to_vmpr(struct cgroup_subsys_state *css)
> +{
> + return &mem_cgroup_from_css(css)->vmpr;
> +}
> +
> static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
> {
> return (memcg == root_mem_cgroup);
> @@ -6074,6 +6096,11 @@ static struct cftype mem_cgroup_files[] = {
> .unregister_event = mem_cgroup_oom_unregister_event,
> .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
> },
> + {
> + .name = "pressure_level",
> + .register_event = vmpressure_register_event,
> + .unregister_event = vmpressure_unregister_event,
> + },
> #ifdef CONFIG_NUMA
> {
> .name = "numa_stat",
> @@ -6365,6 +6392,7 @@ mem_cgroup_css_alloc(struct cgroup *cont)
> memcg->move_charge_at_immigrate = 0;
> mutex_init(&memcg->thresholds_lock);
> spin_lock_init(&memcg->move_lock);
> + vmpressure_init(&memcg->vmpr);
>
> return &memcg->css;
>
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> new file mode 100644
> index 0000000..ae0ff8e
> --- /dev/null
> +++ b/mm/vmpressure.c
> @@ -0,0 +1,252 @@
> +/*
> + * Linux VM pressure
> + *
> + * Copyright 2012 Linaro Ltd.
> + * Anton Vorontsov <[email protected]>
> + *
> + * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
> + * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + */
> +
> +#include <linux/cgroup.h>
> +#include <linux/fs.h>
> +#include <linux/sched.h>
> +#include <linux/mm.h>
> +#include <linux/vmstat.h>
> +#include <linux/eventfd.h>
> +#include <linux/swap.h>
> +#include <linux/printk.h>
> +#include <linux/vmpressure.h>
> +
> +/*
> + * The window size is the number of scanned pages before we try to analyze
> + * the scanned/reclaimed ratio (or difference).
> + *
> + * It is used as a rate-limit tunable for the "low" level notification,
> + * and for averaging medium/critical levels. Using small window sizes can
> + * cause lot of false positives, but too big window size will delay the
> + * notifications.
> + *
> + * TODO: Make the window size depend on machine size, as we do for vmstat
> + * thresholds.
> + */
> +static const unsigned int vmpressure_win = SWAP_CLUSTER_MAX * 16;
> +static const unsigned int vmpressure_level_med = 60;
> +static const unsigned int vmpressure_level_critical = 95;
> +static const unsigned int vmpressure_level_critical_prio = 3;
> +
more comments are welcomed...
I'm not against the numbers themselves but I'm not sure how these numbers are
selected...I'm glad if you show some reasons in changelog or somewhere.
> +enum vmpressure_levels {
> + VMPRESSURE_LOW = 0,
> + VMPRESSURE_MEDIUM,
> + VMPRESSURE_CRITICAL,
> + VMPRESSURE_NUM_LEVELS,
> +};
> +
> +static const char *vmpressure_str_levels[] = {
> + [VMPRESSURE_LOW] = "low",
> + [VMPRESSURE_MEDIUM] = "medium",
> + [VMPRESSURE_CRITICAL] = "critical",
> +};
> +
> +static enum vmpressure_levels vmpressure_level(unsigned int pressure)
> +{
> + if (pressure >= vmpressure_level_critical)
> + return VMPRESSURE_CRITICAL;
> + else if (pressure >= vmpressure_level_med)
> + return VMPRESSURE_MEDIUM;
> + return VMPRESSURE_LOW;
> +}
> +
> +static enum vmpressure_levels vmpressure_calc_level(unsigned int scanned,
> + unsigned int reclaimed)
> +{
> + unsigned long scale = scanned + reclaimed;
> + unsigned long pressure;
> +
> + if (!scanned)
> + return VMPRESSURE_LOW;
Can you add comment here ? When !scanned happens ?
> +
> + /*
> + * We calculate the ratio (in percents) of how many pages were
> + * scanned vs. reclaimed in a given time frame (window). Note that
> + * time is in VM reclaimer's "ticks", i.e. number of pages
> + * scanned. This makes it possible to set desired reaction time
> + * and serves as a ratelimit.
> + */
> + pressure = scale - (reclaimed * scale / scanned);
> + pressure = pressure * 100 / scale;
> +
> + pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, pressure,
> + scanned, reclaimed);
> +
> + return vmpressure_level(pressure);
> +}
> +
> +void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> + unsigned long scanned, unsigned long reclaimed)
> +{
> + struct vmpressure *vmpr = memcg_to_vmpr(memcg);
> +
> + /*
> + * So far we are only interested application memory, or, in case
> + * of low pressure, in FS/IO memory reclaim. We are also
> + * interested indirect reclaim (kswapd sets sc->gfp_mask to
> + * GFP_KERNEL).
> + */
> + if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
> + return;
> +
> + if (!scanned)
> + return;
> +
> + mutex_lock(&vmpr->sr_lock);
> + vmpr->scanned += scanned;
> + vmpr->reclaimed += reclaimed;
> + mutex_unlock(&vmpr->sr_lock);
> +
> + if (scanned < vmpressure_win || work_pending(&vmpr->work))
> + return;
> + schedule_work(&vmpr->work);
> +}
I'm not sure how other guys thinks but....could you place the definition
of work_fn above calling it ? you call vmpressure_wk_fn(), right ?
> +
> +void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
> +{
> + if (prio > vmpressure_level_critical_prio)
> + return;
> +
> + /*
> + * OK, the prio is below the threshold, updating vmpressure
> + * information before diving into long shrinking of long range
> + * vmscan.
> + */
> + vmpressure(gfp, memcg, vmpressure_win, 0);
> +}
> +
> +static struct vmpressure *wk_to_vmpr(struct work_struct *wk)
> +{
> + return container_of(wk, struct vmpressure, work);
> +}
> +
> +static struct vmpressure *cg_to_vmpr(struct cgroup *cg)
> +{
> + return css_to_vmpr(cgroup_subsys_state(cg, mem_cgroup_subsys_id));
> +}
> +
> +struct vmpressure_event {
> + struct eventfd_ctx *efd;
> + enum vmpressure_levels level;
> + struct list_head node;
> +};
> +
> +static bool vmpressure_event(struct vmpressure *vmpr,
> + unsigned long scanned, unsigned long reclaimed)
> +{
> + struct vmpressure_event *ev;
> + int level = vmpressure_calc_level(scanned, reclaimed);
> + bool signalled = false;
> +
> + mutex_lock(&vmpr->events_lock);
> +
> + list_for_each_entry(ev, &vmpr->events, node) {
> + if (level >= ev->level) {
> + eventfd_signal(ev->efd, 1);
> + signalled = true;
> + }
> + }
> +
> + mutex_unlock(&vmpr->events_lock);
> +
> + return signalled;
> +}
> +
> +static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
> +{
> + struct cgroup *cg = vmpr_to_css(vmpr)->cgroup;
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cg);
> +
> + memcg = parent_mem_cgroup(memcg);
> + if (!memcg)
> + return NULL;
> + return memcg_to_vmpr(memcg);
> +}
> +
> +static void vmpressure_wk_fn(struct work_struct *wk)
> +{
> + struct vmpressure *vmpr = wk_to_vmpr(wk);
> + unsigned long s;
> + unsigned long r;
> +
> + mutex_lock(&vmpr->sr_lock);
> + s = vmpr->scanned;
> + r = vmpr->reclaimed;
> + vmpr->scanned = 0;
> + vmpr->reclaimed = 0;
> + mutex_unlock(&vmpr->sr_lock);
> +
> + do {
> + if (vmpressure_event(vmpr, s, r))
> + break;
> + /*
> + * If not handled, propagate the event upward into the
> + * hierarchy.
> + */
> + } while ((vmpr = vmpressure_parent(vmpr)));
> +}
> +
> +int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
> + struct eventfd_ctx *eventfd, const char *args)
> +{
> + struct vmpressure *vmpr = cg_to_vmpr(cg);
> + struct vmpressure_event *ev;
> + int lvl;
> +
> + for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
> + if (!strcmp(vmpressure_str_levels[lvl], args))
> + break;
> + }
> +
> + if (lvl >= VMPRESSURE_NUM_LEVELS)
> + return -EINVAL;
> +
> + ev = kzalloc(sizeof(*ev), GFP_KERNEL);
> + if (!ev)
> + return -ENOMEM;
> +
> + ev->efd = eventfd;
> + ev->level = lvl;
> +
> + mutex_lock(&vmpr->events_lock);
> + list_add(&ev->node, &vmpr->events);
> + mutex_unlock(&vmpr->events_lock);
> +
> + return 0;
> +}
> +
> +void vmpressure_unregister_event(struct cgroup *cg, struct cftype *cft,
> + struct eventfd_ctx *eventfd)
> +{
> + struct vmpressure *vmpr = cg_to_vmpr(cg);
> + struct vmpressure_event *ev;
> +
> + mutex_lock(&vmpr->events_lock);
> + list_for_each_entry(ev, &vmpr->events, node) {
> + if (ev->efd != eventfd)
> + continue;
> + list_del(&ev->node);
> + kfree(ev);
> + break;
> + }
> + mutex_unlock(&vmpr->events_lock);
> +}
> +
> +void vmpressure_init(struct vmpressure *vmpr)
> +{
> + mutex_init(&vmpr->sr_lock);
> + mutex_init(&vmpr->events_lock);
> + INIT_LIST_HEAD(&vmpr->events);
> + INIT_WORK(&vmpr->work, vmpressure_wk_fn);
> +}
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index df78d17..616e2bb 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -19,6 +19,7 @@
> #include <linux/pagemap.h>
> #include <linux/init.h>
> #include <linux/highmem.h>
> +#include <linux/vmpressure.h>
> #include <linux/vmstat.h>
> #include <linux/file.h>
> #include <linux/writeback.h>
> @@ -1982,6 +1983,11 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
> }
> memcg = mem_cgroup_iter(root, memcg, &reclaim);
> } while (memcg);
> +
> + vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
> + sc->nr_scanned - nr_scanned,
> + sc->nr_reclaimed - nr_reclaimed);
> +
> } while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
> sc->nr_scanned - nr_scanned, sc));
> }
> @@ -2167,6 +2173,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> count_vm_event(ALLOCSTALL);
>
> do {
> + vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
> + sc->priority);
> sc->nr_scanned = 0;
> aborted_reclaim = shrink_zones(zonelist, sc);
>
>
When you answers Andrew's comment and fix problems, feel free to add
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
On Wed, Mar 27, 2013 at 10:53:30AM +0900, Kamezawa Hiroyuki wrote:
[...]
> >+++ b/mm/memcontrol.c
> >@@ -49,6 +49,7 @@
> > #include <linux/fs.h>
> > #include <linux/seq_file.h>
> > #include <linux/vmalloc.h>
> >+#include <linux/vmpressure.h>
> > #include <linux/mm_inline.h>
> > #include <linux/page_cgroup.h>
> > #include <linux/cpu.h>
> >@@ -376,6 +377,9 @@ struct mem_cgroup {
> > atomic_t numainfo_events;
> > atomic_t numainfo_updating;
> > #endif
> >+
> >+ struct vmpressure vmpr;
> >+
>
> How about placing this just below "memsw_threshold" ?
> memory objects around there is not performance critical.
Yup, done.
[...]
> >+static const unsigned int vmpressure_win = SWAP_CLUSTER_MAX * 16;
> >+static const unsigned int vmpressure_level_med = 60;
> >+static const unsigned int vmpressure_level_critical = 95;
> >+static const unsigned int vmpressure_level_critical_prio = 3;
> >+
> more comments are welcomed...
>
> I'm not against the numbers themselves but I'm not sure how these numbers are
> selected...I'm glad if you show some reasons in changelog or somewhere.
Sure, in v4 the numbers are described in the comments.
[...]
> >+static enum vmpressure_levels vmpressure_calc_level(unsigned int scanned,
> >+ unsigned int reclaimed)
> >+{
> >+ unsigned long scale = scanned + reclaimed;
> >+ unsigned long pressure;
> >+
> >+ if (!scanned)
> >+ return VMPRESSURE_LOW;
>
> Can you add comment here ? When !scanned happens ?
Yeah, the comment is needed. in v4 I added explanation for this case.
[...]
> >+ mutex_lock(&vmpr->sr_lock);
> >+ vmpr->scanned += scanned;
> >+ vmpr->reclaimed += reclaimed;
> >+ mutex_unlock(&vmpr->sr_lock);
> >+
> >+ if (scanned < vmpressure_win || work_pending(&vmpr->work))
> >+ return;
> >+ schedule_work(&vmpr->work);
> >+}
>
> I'm not sure how other guys thinks but....could you place the definition
> of work_fn above calling it ? you call vmpressure_wk_fn(), right ?
Yup. OK, I rearranged the code a bit.
[...]
> > do {
> >+ vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
> >+ sc->priority);
> > sc->nr_scanned = 0;
> > aborted_reclaim = shrink_zones(zonelist, sc);
> >
> >
>
> When you answers Andrew's comment and fix problems, feel free to add
>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Thanks a lot for the reviews, Kamezawa!
Anton
On Tue, Mar 26, 2013 at 01:46:56PM -0700, Andrew Morton wrote:
[...]
> > +The file memory.pressure_level is only used to setup an eventfd,
> > +read/write operations are no implemented.
[...]
> Did we tell people how to use the eventfd interface anywhere?
Good point. In v4 I added a detailed instructions on how to setup the file
descriptors.
> > 1. Add support for accounting huge pages (as a separate controller)
> > 2. Make per-cgroup scanner reclaim not-shared pages first
> > diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> > --- /dev/null
> > +++ b/include/linux/vmpressure.h
> > @@ -0,0 +1,47 @@
> > +#ifndef __LINUX_VMPRESSURE_H
> > +#define __LINUX_VMPRESSURE_H
> > +
> > +#include <linux/mutex.h>
> > +#include <linux/list.h>
> > +#include <linux/workqueue.h>
> > +#include <linux/gfp.h>
> > +#include <linux/types.h>
> > +#include <linux/cgroup.h>
> > +
> > +struct vmpressure {
> > + unsigned int scanned;
> > + unsigned int reclaimed;
> > + /* The lock is used to keep the scanned/reclaimed above in sync. */
> > + struct mutex sr_lock;
> > +
> > + struct list_head events;
>
> A comment describing what goes at `events' would be nice. Reference
> "struct vmpressure_event".
Done.
> > + /* Have to grab the lock on events traversal or modifications. */
> > + struct mutex events_lock;
> > +
> > + struct work_struct work;
> > +};
> >
> > ...
> >
> > +/*
> > + * The window size is the number of scanned pages before we try to analyze
> > + * the scanned/reclaimed ratio (or difference).
> > + *
> > + * It is used as a rate-limit tunable for the "low" level notification,
> > + * and for averaging medium/critical levels. Using small window sizes can
> > + * cause lot of false positives, but too big window size will delay the
> > + * notifications.
> > + *
> > + * TODO: Make the window size depend on machine size, as we do for vmstat
> > + * thresholds.
>
> Here "the window size" refers to vmpressure_win, yes?
Yup.
(To make it clear, in the new version I added a direct reference to the
vmpressure_win.)
> > + */
> > +static const unsigned int vmpressure_win = SWAP_CLUSTER_MAX * 16;
> > +static const unsigned int vmpressure_level_med = 60;
> > +static const unsigned int vmpressure_level_critical = 95;
> > +static const unsigned int vmpressure_level_critical_prio = 3;
>
> vmpressure_level_critical_prio is a bit mysterious and undocumented.
> Please document it here and/or at vmpressure_prio().
I added documentation in v4.
> > +void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> > + unsigned long scanned, unsigned long reclaimed)
>
> Exported function and a primary inteface. Needs nice documentation, please ;)
Sure thing, all exported function now come with kernel-doc comments.
[...]
> > + */
> > + if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
>
> I'm surprised at __GFP_HIGHMEM's inclusion. On some machines the great
> majority of user memory is in highmem. What's up?
In the new revision I included this comment:
/*
* Here we only want to account pressure that userland is able to
* help us with. For example, suppose that DMA zone is under
* pressure; if we notify userland about that kind of pressure,
* then it will be mostly a waste as it will trigger unnecessary
* freeing of memory by userland (since userland is more likely to
* have HIGHMEM/MOVABLE pages instead of the DMA fallback). That
* is why we include only movable, highmem and FS/IO pages.
* Indirect reclaim (kswapd) sets sc->gfp_mask to GFP_KERNEL, so
* we account it too.
*/
if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
return;
> > + if (!scanned)
> > + return;
> > +
> > + mutex_lock(&vmpr->sr_lock);
> > + vmpr->scanned += scanned;
> > + vmpr->reclaimed += reclaimed;
>
> See, here we're accumulating into a 32-bit variable quantities which used
> to be held in 64-bit variables. The overflow risk gets higher...
I see. I fixed this.
> > + mutex_unlock(&vmpr->sr_lock);
> > +
> > + if (scanned < vmpressure_win || work_pending(&vmpr->work))
> > + return;
> > + schedule_work(&vmpr->work);
> > +}
> > +
> > +void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
>
> Documentation please.
Yup, done.
> > +{
> > + if (prio > vmpressure_level_critical_prio)
> > + return;
> > +
> > + /*
> > + * OK, the prio is below the threshold, updating vmpressure
>
> But you never told me what that threshold is for! And I have no means
> of working out why you chose "3", nor the effects of altering it, etc.
True. This is explained it in a comment now.
[...]
> > +int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
> > + struct eventfd_ctx *eventfd, const char *args)
>
> Document the interface, please.
Done.
> > +{
> > + struct vmpressure *vmpr = cg_to_vmpr(cg);
> > + struct vmpressure_event *ev;
> > + int lvl;
>
> These abbreviations are rather unlinuxy. wk->work, vmpr->vmpressure,
> lvl->level, etc.
Yeah, I agree. Although, 'vmpressure' as a function-scope variable is
kinda too long, the code becomes really hard to read. But in memcg struct
and global namespace I now use the full 'vmpressure' name.
> > + for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
> > + if (!strcmp(vmpressure_str_levels[lvl], args))
> > + break;
> > + }
> > +
> > + if (lvl >= VMPRESSURE_NUM_LEVELS)
> > + return -EINVAL;
> > +
> > + ev = kzalloc(sizeof(*ev), GFP_KERNEL);
> > + if (!ev)
> > + return -ENOMEM;
> > +
> > + ev->efd = eventfd;
> > + ev->level = lvl;
> > +
> > + mutex_lock(&vmpr->events_lock);
> > + list_add(&ev->node, &vmpr->events);
>
> What's the upper bound on the length of this list?
As of now, it is controlled by the cgroup core, so I would say the number
of opened FDs, and if that is a problem, it should be fixed for everyone.
The good thing is that the list is per-cgroup, it is not global.
Thanks for the review, Andrew!
Anton