2012-11-28 10:32:31

by Anton Vorontsov

[permalink] [raw]
Subject: [RFC] Add mempressure cgroup

This is an attempt to implement David Rientjes' idea of mempressure
cgroup.

The main characteristics are the same to what I've tried to add to vmevent
API:

Internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
pressure index calculation. But we don't expose the index to the
userland. Instead, there are three levels of the pressure:

o low (just reclaiming, e.g. caches are draining);
o medium (allocation cost becomes high, e.g. swapping);
o oom (about to oom very soon).

The rationale behind exposing levels and not the raw pressure index
described here: http://lkml.org/lkml/2012/11/16/675

The API uses standard cgroups eventfd notifications:

$ gcc Documentation/cgroups/cgroup_event_listener.c -o \
cgroup_event_listener
$ cd /sys/fs/cgroup/
$ mkdir mempressure
$ mount -t cgroup cgroup ./mempressure -o mempressure
$ cd mempressure
$ cgroup_event_listener ./mempressure.level low
("low", "medium", "oom" are permitted values.)

Upon hitting the threshold, you should see "/sys/fs/cgroup/mempressure
low: crossed" messages.

To test that it actually works on per-cgroup basis, I did a small trick: I
moved all kswapd into a separate cgroup, and hooked the listener onto
another (non-root) cgroup. The listener no longer received global reclaim
pressure, which is expected.

For a task it is possible to be in both cpusets, memcg and mempressure
cgroups, so by rearranging the tasks it should be possible to watch a
specific pressure.

Note that while this adds the cgroups support, the code is well separated
and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
But this is another story.

Signed-off-by: Anton Vorontsov <[email protected]>
---
include/linux/cgroup_subsys.h | 6 +
include/linux/vmstat.h | 8 ++
init/Kconfig | 5 +
mm/Makefile | 1 +
mm/mempressure.c | 287 ++++++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 3 +
6 files changed, 310 insertions(+)
create mode 100644 mm/mempressure.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index f204a7a..b9802e2 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -37,6 +37,12 @@ SUBSYS(mem_cgroup)

/* */

+#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_MEMPRESSURE)
+SUBSYS(mpc_cgroup)
+#endif
+
+/* */
+
#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_DEVICE)
SUBSYS(devices)
#endif
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 92a86b2..7698341 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -10,6 +10,14 @@

extern int sysctl_stat_interval;

+#ifdef CONFIG_CGROUP_MEMPRESSURE
+extern void vmpressure(ulong scanned, ulong reclaimed);
+extern void vmpressure_prio(int prio);
+#else
+static inline void vmpressure(ulong scanned, ulong reclaimed) {}
+static inline void vmpressure_prio(int prio) {}
+#endif
+
#ifdef CONFIG_VM_EVENT_COUNTERS
/*
* Light weight per cpu counter implementation.
diff --git a/init/Kconfig b/init/Kconfig
index 6fdd6e3..7065e44 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -826,6 +826,11 @@ config MEMCG_KMEM
the kmem extension can use it to guarantee that no group of processes
will ever exhaust kernel resources alone.

+config CGROUP_MEMPRESSURE
+ bool "Memory pressure monitor for Control Groups"
+ help
+ TODO
+
config CGROUP_HUGETLB
bool "HugeTLB Resource Controller for Control Groups"
depends on RESOURCE_COUNTERS && HUGETLB_PAGE && EXPERIMENTAL
diff --git a/mm/Makefile b/mm/Makefile
index 6b025f8..40cee19 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -50,6 +50,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEMPRESSURE) += mempressure.o
obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
diff --git a/mm/mempressure.c b/mm/mempressure.c
new file mode 100644
index 0000000..5c85bbe
--- /dev/null
+++ b/mm/mempressure.c
@@ -0,0 +1,287 @@
+/*
+ * Linux VM pressure notifications
+ *
+ * Copyright 2012 Linaro Ltd.
+ * Anton Vorontsov <[email protected]>
+ *
+ * Based on ideas from David Rientjes, KOSAKI Motohiro, Leonid Moiseichuk,
+ * Mel Gorman, Minchan Kim and Pekka Enberg.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+#include <linux/atomic.h>
+#include <linux/eventfd.h>
+#include <linux/swap.h>
+#include <linux/printk.h>
+
+static void mpc_vmpressure(ulong scanned, ulong reclaimed);
+
+/*
+ * Generic VM Pressure routines (no cgroups or any other API details)
+ */
+
+/* These are defaults. Might make them configurable one day. */
+static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16;
+static const uint vmpressure_level_med = 60;
+static const uint vmpressure_level_oom = 99;
+static const uint vmpressure_level_oom_prio = 4;
+
+enum vmpressure_levels {
+ VMPRESSURE_LOW = 0,
+ VMPRESSURE_MEDIUM,
+ VMPRESSURE_OOM,
+ VMPRESSURE_NUM_LEVELS,
+};
+
+static const char const *vmpressure_str_levels[] = {
+ [VMPRESSURE_LOW] = "low",
+ [VMPRESSURE_MEDIUM] = "medium",
+ [VMPRESSURE_OOM] = "oom",
+};
+
+static enum vmpressure_levels vmpressure_level(uint pressure)
+{
+ if (pressure >= vmpressure_level_oom)
+ return VMPRESSURE_OOM;
+ else if (pressure >= vmpressure_level_med)
+ return VMPRESSURE_MEDIUM;
+ return VMPRESSURE_LOW;
+}
+
+static ulong vmpressure_calc_level(uint win, uint s, uint r)
+{
+ ulong p;
+
+ if (!s)
+ return 0;
+
+ /*
+ * We calculate the ratio (in percents) of how many pages were
+ * scanned vs. reclaimed in a given time frame (window). Note that
+ * time is in VM reclaimer's "ticks", i.e. number of pages
+ * scanned. This makes it possible to set desired reaction time
+ * and serves as a ratelimit.
+ */
+ p = win - (r * win / s);
+ p = p * 100 / win;
+
+ pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, p, s, r);
+
+ return vmpressure_level(p);
+}
+
+void vmpressure(ulong scanned, ulong reclaimed)
+{
+ if (!scanned)
+ return;
+ mpc_vmpressure(scanned, reclaimed);
+}
+
+void vmpressure_prio(int prio)
+{
+ if (prio > vmpressure_level_oom_prio)
+ return;
+
+ /* OK, the prio is below the threshold, send the pre-OOM event. */
+ vmpressure(vmpressure_win, 0);
+}
+
+/*
+ * Memory pressure cgroup code
+ */
+
+struct mpc_state {
+ struct cgroup_subsys_state css;
+ uint scanned;
+ uint reclaimed;
+ struct mutex lock;
+ struct eventfd_ctx *eventfd;
+ enum vmpressure_levels thres;
+};
+
+static struct mpc_state *css2mpc(struct cgroup_subsys_state *css)
+{
+ return container_of(css, struct mpc_state, css);
+}
+
+static struct mpc_state *tsk2mpc(struct task_struct *tsk)
+{
+ return css2mpc(task_subsys_state(tsk, mpc_cgroup_subsys_id));
+}
+
+static struct mpc_state *cg2mpc(struct cgroup *cg)
+{
+ return css2mpc(cgroup_subsys_state(cg, mpc_cgroup_subsys_id));
+}
+
+static void __mpc_vmpressure(ulong scanned, ulong reclaimed)
+{
+ struct mpc_state *mpc = tsk2mpc(current);
+ int level;
+
+ mpc->scanned += scanned;
+ mpc->reclaimed += reclaimed;
+
+ if (mpc->scanned < vmpressure_win)
+ return;
+
+ level = vmpressure_calc_level(vmpressure_win,
+ mpc->scanned, mpc->reclaimed);
+ if (level >= mpc->thres) {
+ mutex_lock(&mpc->lock);
+ if (mpc->eventfd)
+ eventfd_signal(mpc->eventfd, 1);
+ mutex_unlock(&mpc->lock);
+ }
+}
+
+static void mpc_vmpressure(ulong scanned, ulong reclaimed)
+{
+ /*
+ * There are two options for implementing cgroup pressure
+ * notifications:
+ *
+ * - Store pressure counter atomically in the task struct. Upon
+ * hitting 'window' wake up a workqueue that will walk every
+ * task and sum per-thread pressure into cgroup pressure (to
+ * which the task belongs). The cons are obvious: bloats task
+ * struct, have to walk all processes and makes pressue less
+ * accurate (the window becomes per-thread);
+ *
+ * - Store pressure counters in per-cgroup state. This is easy and
+ * straighforward, and that's how we do things here. But this
+ * requires us to not put the vmpressure hooks into hotpath,
+ * since we have to grab some locks.
+ */
+ task_lock(current);
+ __mpc_vmpressure(scanned, reclaimed);
+ task_unlock(current);
+}
+
+static struct cgroup_subsys_state *mpc_create(struct cgroup *cg)
+{
+ struct mpc_state *mpc;
+
+ mpc = kzalloc(sizeof(*mpc), GFP_KERNEL);
+ if (!mpc)
+ return ERR_PTR(-ENOMEM);
+
+ mutex_init(&mpc->lock);
+
+ return &mpc->css;
+}
+
+static int mpc_pre_destroy(struct cgroup *cg)
+{
+ struct mpc_state *mpc = cg2mpc(cg);
+ int ret = 0;
+
+ mutex_lock(&mpc->lock);
+
+ if (mpc->eventfd)
+ ret = -EBUSY;
+
+ mutex_unlock(&mpc->lock);
+
+ return ret;
+}
+
+static void mpc_destroy(struct cgroup *cg)
+{
+ struct mpc_state *mpc = cg2mpc(cg);
+
+ kfree(mpc);
+}
+
+static ssize_t mpc_read_level(struct cgroup *cg, struct cftype *cft,
+ struct file *file, char __user *buf,
+ size_t sz, loff_t *ppos)
+{
+ struct mpc_state *mpc = cg2mpc(cg);
+ uint level;
+ const char *str;
+
+ mutex_lock(&mpc->lock);
+
+ level = vmpressure_calc_level(vmpressure_win,
+ mpc->scanned, mpc->reclaimed);
+ mpc->scanned = 0;
+ mpc->reclaimed = 0;
+
+ mutex_unlock(&mpc->lock);
+
+ str = vmpressure_str_levels[level];
+ return simple_read_from_buffer(buf, sz, ppos, str, strlen(str));
+}
+
+static int mpc_register_level_event(struct cgroup *cg, struct cftype *cft,
+ struct eventfd_ctx *eventfd,
+ const char *args)
+{
+ struct mpc_state *mpc = cg2mpc(cg);
+ int i;
+ int ret;
+
+ mutex_lock(&mpc->lock);
+
+ /*
+ * It's easy to implement multiple thresholds, but so far we don't
+ * need it.
+ */
+ if (mpc->eventfd) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+
+ ret = -EINVAL;
+ for (i = 0; i < VMPRESSURE_NUM_LEVELS; i++) {
+ if (strcmp(vmpressure_str_levels[i], args))
+ continue;
+ mpc->eventfd = eventfd;
+ mpc->thres = i;
+ ret = 0;
+ break;
+ }
+out_unlock:
+ mutex_unlock(&mpc->lock);
+
+ return ret;
+}
+
+static void mpc_unregister_level_event(struct cgroup *cg, struct cftype *cft,
+ struct eventfd_ctx *eventfd)
+{
+ struct mpc_state *mpc = cg2mpc(cg);
+
+ mutex_lock(&mpc->lock);
+ BUG_ON(mpc->eventfd != eventfd);
+ mpc->eventfd = NULL;
+ mutex_unlock(&mpc->lock);
+}
+
+static struct cftype mpc_files[] = {
+ {
+ .name = "level",
+ .read = mpc_read_level,
+ .register_event = mpc_register_level_event,
+ .unregister_event = mpc_unregister_level_event,
+ },
+ {},
+};
+
+struct cgroup_subsys mpc_cgroup_subsys = {
+ .name = "mempressure",
+ .subsys_id = mpc_cgroup_subsys_id,
+ .create = mpc_create,
+ .pre_destroy = mpc_pre_destroy,
+ .destroy = mpc_destroy,
+ .base_cftypes = mpc_files,
+};
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 48550c6..430d8a5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1877,6 +1877,8 @@ restart:
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
sc, LRU_ACTIVE_ANON);

+ vmpressure(sc->nr_scanned - nr_scanned, nr_reclaimed);
+
/* reclaim/compaction might need reclaim to continue */
if (should_continue_reclaim(lruvec, nr_reclaimed,
sc->nr_scanned - nr_scanned, sc))
@@ -2099,6 +2101,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
count_vm_event(ALLOCSTALL);

do {
+ vmpressure_prio(sc->priority);
sc->nr_scanned = 0;
aborted_reclaim = shrink_zones(zonelist, sc);

--
1.8.0


2012-11-28 16:29:30

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC] Add mempressure cgroup

On Wed 28-11-12 02:29:08, Anton Vorontsov wrote:
> This is an attempt to implement David Rientjes' idea of mempressure
> cgroup.
>
> The main characteristics are the same to what I've tried to add to vmevent
> API:
>
> Internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
> pressure index calculation. But we don't expose the index to the
> userland. Instead, there are three levels of the pressure:
>
> o low (just reclaiming, e.g. caches are draining);
> o medium (allocation cost becomes high, e.g. swapping);
> o oom (about to oom very soon).
>
> The rationale behind exposing levels and not the raw pressure index
> described here: http://lkml.org/lkml/2012/11/16/675
>
> The API uses standard cgroups eventfd notifications:
>
> $ gcc Documentation/cgroups/cgroup_event_listener.c -o \
> cgroup_event_listener
> $ cd /sys/fs/cgroup/
> $ mkdir mempressure
> $ mount -t cgroup cgroup ./mempressure -o mempressure
> $ cd mempressure
> $ cgroup_event_listener ./mempressure.level low
> ("low", "medium", "oom" are permitted values.)
>
> Upon hitting the threshold, you should see "/sys/fs/cgroup/mempressure
> low: crossed" messages.
>
> To test that it actually works on per-cgroup basis, I did a small trick: I
> moved all kswapd into a separate cgroup, and hooked the listener onto
> another (non-root) cgroup. The listener no longer received global reclaim
> pressure, which is expected.

Is this really expected? So you want to be notified only about the
direct reclaim?
I am not sure how much useful is that. If you co-mount with e.g. memcg then
the picture is different because even global memory pressure is spread
among groups so it would be just a matter of the proper accounting
(which can be handled similar to lruvec when your code doesn't have to
care about memcg internally).
Co-mounting with cpusets makes sense as well because then you get a
pressure notification based on the placement policy.

So does it make much sense to mount mempressure on its own without
co-mounting with other controllers?

> For a task it is possible to be in both cpusets, memcg and mempressure
> cgroups, so by rearranging the tasks it should be possible to watch a
> specific pressure.

Could you be more specific what you mean by rearranging? Creating a same
hierarchy? Co-mounting?

> Note that while this adds the cgroups support, the code is well separated
> and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
> But this is another story.

I think it would be nice to follow freezer and split this into 2 files.
Generic and cgroup spefici.

> Signed-off-by: Anton Vorontsov <[email protected]>
> ---
[...]
> +/* These are defaults. Might make them configurable one day. */
> +static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16;

I realize this is just an RFC but could you be more specific what is the
meaning of vmpressure_win?

> +static const uint vmpressure_level_med = 60;
> +static const uint vmpressure_level_oom = 99;
> +static const uint vmpressure_level_oom_prio = 4;
> +
> +enum vmpressure_levels {
> + VMPRESSURE_LOW = 0,
> + VMPRESSURE_MEDIUM,
> + VMPRESSURE_OOM,
> + VMPRESSURE_NUM_LEVELS,
> +};
> +
> +static const char const *vmpressure_str_levels[] = {
> + [VMPRESSURE_LOW] = "low",
> + [VMPRESSURE_MEDIUM] = "medium",
> + [VMPRESSURE_OOM] = "oom",
> +};
> +
> +static enum vmpressure_levels vmpressure_level(uint pressure)
> +{
> + if (pressure >= vmpressure_level_oom)
> + return VMPRESSURE_OOM;
> + else if (pressure >= vmpressure_level_med)
> + return VMPRESSURE_MEDIUM;
> + return VMPRESSURE_LOW;
> +}
> +
> +static ulong vmpressure_calc_level(uint win, uint s, uint r)
> +{
> + ulong p;
> +
> + if (!s)
> + return 0;
> +
> + /*
> + * We calculate the ratio (in percents) of how many pages were
> + * scanned vs. reclaimed in a given time frame (window). Note that
> + * time is in VM reclaimer's "ticks", i.e. number of pages
> + * scanned. This makes it possible to set desired reaction time
> + * and serves as a ratelimit.
> + */
> + p = win - (r * win / s);
> + p = p * 100 / win;

Do we need the win at all?
p = 100 - (100 * r / s);
> +
> + pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, p, s, r);
> +
> + return vmpressure_level(p);
> +}
> +
[...]
> +static int mpc_pre_destroy(struct cgroup *cg)
> +{
> + struct mpc_state *mpc = cg2mpc(cg);
> + int ret = 0;
> +
> + mutex_lock(&mpc->lock);
> +
> + if (mpc->eventfd)
> + ret = -EBUSY;

The current cgroup's core doesn't allow pre_destroy to fail anymore. The
code is marked for 3.8

[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 48550c6..430d8a5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1877,6 +1877,8 @@ restart:
> shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> sc, LRU_ACTIVE_ANON);
>
> + vmpressure(sc->nr_scanned - nr_scanned, nr_reclaimed);
> +

I think this should already report to a proper group otherwise all the
global reclaim would go to a group where kswapd sits rather than to the
target group as I mentioned above (so it at least wouldn't work with a
co-mounted cases).

> /* reclaim/compaction might need reclaim to continue */
> if (should_continue_reclaim(lruvec, nr_reclaimed,
> sc->nr_scanned - nr_scanned, sc))
> @@ -2099,6 +2101,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> count_vm_event(ALLOCSTALL);
>
> do {
> + vmpressure_prio(sc->priority);

Shouldn't this go into shrink_lruvec or somewhere at that level to catch
also kswapd low priorities? If you insist on the direct reclaim then you
should hook into __zone_reclaim as well.

> sc->nr_scanned = 0;
> aborted_reclaim = shrink_zones(zonelist, sc);

--
Michal Hocko
SUSE Labs

2012-11-28 23:14:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] Add mempressure cgroup

On Wed, 28 Nov 2012 02:29:08 -0800
Anton Vorontsov <[email protected]> wrote:

> The main characteristics are the same to what I've tried to add to vmevent
> API:
>
> Internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
> pressure index calculation. But we don't expose the index to the
> userland. Instead, there are three levels of the pressure:
>
> o low (just reclaiming, e.g. caches are draining);
> o medium (allocation cost becomes high, e.g. swapping);
> o oom (about to oom very soon).
>
> The rationale behind exposing levels and not the raw pressure index
> described here: http://lkml.org/lkml/2012/11/16/675

This rationale is central to the overall design (and is hence central
to the review). It would be better to include it in the changelogs
where it can be maintained, understood and discussed.


I see a problem with it:


It blurs the question of "who is in control". We tell userspace "hey,
we're getting a bit tight here, please do something". And userspace
makes the decision about what "something" is. So userspace is in
control of part of the reclaim function and the kernel is in control of
another part. Strange interactions are likely.

Also, the system as a whole is untestable by kernel developers - it
puts the onus onto each and every userspace developer to develop, test
and tune his application against a particular kernel version.

And the more carefully the userspace developer tunes his application,
the more vulnerable he becomes to regressions which were caused by
subtle changes in the kernel's behaviour.


Compare this with the shrink_slab() shrinkers. With these, the VM can
query and then control the clients. If something goes wrong or is out
of balance, it's the VM's problem to solve.

So I'm thinking that a better design would be one which puts the kernel
VM in control of userspace scanning and freeing. Presumably with a
query-and-control interface similar to the slab shrinkers.

IOW, we make the kernel smarter and make userspace dumber. Userspace
just sits there and does what the kernel tells it to do.

This gives the kernel developers the ability to tune and tweak (ie:
alter) userspace's behaviour *years* after that userspace code was
written.

Probably most significantly, this approach has a really big advantage:
we can test it. Once we have defined that userspace query/control
interface we can write a compliant userspace test application then fire
it up and observe the overall system behaviour. We can fix bugs and we
can tune it. This cannot be done with your proposed interface because
we just don't know what userspace will do in response to changes in the
exposed metric.

2012-11-29 01:31:13

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [RFC] Add mempressure cgroup

On Wed, Nov 28, 2012 at 03:14:32PM -0800, Andrew Morton wrote:
[...]
> Compare this with the shrink_slab() shrinkers. With these, the VM can
> query and then control the clients. If something goes wrong or is out
> of balance, it's the VM's problem to solve.
>
> So I'm thinking that a better design would be one which puts the kernel
> VM in control of userspace scanning and freeing. Presumably with a
> query-and-control interface similar to the slab shrinkers.

Thanks for the ideas, Andrew.

Query-and-control scheme looks very attractive, and that's actually
resembles my "balance" level idea, when userland tells the kernel how much
reclaimable memory it has. Except the your scheme works in the reverse
direction, i.e. the kernel becomes in charge.

But there is one, rather major issue: we're crossing kernel-userspace
boundary. And with the scheme we'll have to cross the boundary four times:
query / reply-available / control / reply-shrunk / (and repeat if
necessary, every SHRINK_BATCH pages). Plus, it has to be done somewhat
synchronously (all the four stages), and/or we have to make a "userspace
shrinker" thread working in parallel with the normal shrinker, and here,
I'm afraid, we'll see more strange interactions. :)

But there is a good news: for these kind of fine-grained control we have a
better interface, where we don't have to communicate [very often] w/ the
kernel. These are "volatile ranges", where userland itself marks chunks of
data as "I might need it, but I won't cry if you recycle it; but when I
access it next time, let me know if you actually recycled it". Yes,
userland no longer able to decide which exact page it permits to recycle,
but we don't have use-cases when we actually care that much. And if we do,
we'd rather introduce volatile LRUs with different priorities, or
something alike.

So, we really don't need the full-fledged userland shrinker, since we can
just let the in-kernel shrinker do its job. If we work with the
bytes/pages granularity it is just easier (and more efficient in terms of
communication) to do the volatile ranges.

For the pressure notifications use-cases, we don't even know bytes/pages
information: "activity managers" are separate processes looking after
overall system performance.

So, we're not trying to make userland too smart, quite the contrary: we
realized that for this interface we don't want to mess with the bytes and
pages, and that's why we cut this stuff down to only three levels. Before
this, we were actually trying to count bytes, we did not like it and we
ran away screaming.

OTOH, your scheme makes volatile ranges unneeded, since a thread might
register a shrinker hook and free stuff by itself. But again, I believe
this involves more communication with the kernel.

Thanks,
Anton.

2012-11-29 03:36:54

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [RFC] Add mempressure cgroup

On Wed, Nov 28, 2012 at 05:27:51PM -0800, Anton Vorontsov wrote:
> On Wed, Nov 28, 2012 at 03:14:32PM -0800, Andrew Morton wrote:
> [...]
> > Compare this with the shrink_slab() shrinkers. With these, the VM can
> > query and then control the clients. If something goes wrong or is out
> > of balance, it's the VM's problem to solve.
> >
> > So I'm thinking that a better design would be one which puts the kernel
> > VM in control of userspace scanning and freeing. Presumably with a
> > query-and-control interface similar to the slab shrinkers.
>
> Thanks for the ideas, Andrew.
>
> Query-and-control scheme looks very attractive, and that's actually
> resembles my "balance" level idea, when userland tells the kernel how much
> reclaimable memory it has. Except the your scheme works in the reverse
> direction, i.e. the kernel becomes in charge.
>
> But there is one, rather major issue: we're crossing kernel-userspace
> boundary. And with the scheme we'll have to cross the boundary four times:
> query / reply-available / control / reply-shrunk / (and repeat if
> necessary, every SHRINK_BATCH pages). Plus, it has to be done somewhat
> synchronously (all the four stages), and/or we have to make a "userspace
> shrinker" thread working in parallel with the normal shrinker, and here,
> I'm afraid, we'll see more strange interactions. :)
>
> But there is a good news: for these kind of fine-grained control we have a
> better interface, where we don't have to communicate [very often] w/ the
> kernel. These are "volatile ranges", where userland itself marks chunks of
> data as "I might need it, but I won't cry if you recycle it; but when I
> access it next time, let me know if you actually recycled it". Yes,
> userland no longer able to decide which exact page it permits to recycle,
> but we don't have use-cases when we actually care that much. And if we do,
> we'd rather introduce volatile LRUs with different priorities, or
> something alike.
>
> So, we really don't need the full-fledged userland shrinker, since we can
> just let the in-kernel shrinker do its job. If we work with the
> bytes/pages granularity it is just easier (and more efficient in terms of
> communication) to do the volatile ranges.
>
> For the pressure notifications use-cases, we don't even know bytes/pages
> information: "activity managers" are separate processes looking after
> overall system performance.
>
> So, we're not trying to make userland too smart, quite the contrary: we
> realized that for this interface we don't want to mess with the bytes and
> pages, and that's why we cut this stuff down to only three levels. Before
> this, we were actually trying to count bytes, we did not like it and we
> ran away screaming.
>
> OTOH, your scheme makes volatile ranges unneeded, since a thread might
> register a shrinker hook and free stuff by itself. But again, I believe
> this involves more communication with the kernel.

Btw, I believe your idea is something completely new, and I surely cannot
fully evaluate it on my own -- I might be wrong here. So I invite folks to
express their opinions too.

Guys, it's about Andrew's idea of exposing shrinker-alike logic to the
userland (and I made it 'vs. volatile ranges'):

http://lkml.org/lkml/2012/11/28/607

Thanks,
Anton.

2012-11-29 04:20:54

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [RFC] Add mempressure cgroup

Hello Michal,

Thanks a lot for taking a look into this!

On Wed, Nov 28, 2012 at 05:29:24PM +0100, Michal Hocko wrote:
> On Wed 28-11-12 02:29:08, Anton Vorontsov wrote:
> > This is an attempt to implement David Rientjes' idea of mempressure
> > cgroup.
> >
> > The main characteristics are the same to what I've tried to add to vmevent
> > API:
> >
> > Internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
> > pressure index calculation. But we don't expose the index to the
> > userland. Instead, there are three levels of the pressure:
> >
> > o low (just reclaiming, e.g. caches are draining);
> > o medium (allocation cost becomes high, e.g. swapping);
> > o oom (about to oom very soon).
> >
> > The rationale behind exposing levels and not the raw pressure index
> > described here: http://lkml.org/lkml/2012/11/16/675
> >
> > The API uses standard cgroups eventfd notifications:
> >
> > $ gcc Documentation/cgroups/cgroup_event_listener.c -o \
> > cgroup_event_listener
> > $ cd /sys/fs/cgroup/
> > $ mkdir mempressure
> > $ mount -t cgroup cgroup ./mempressure -o mempressure
> > $ cd mempressure
> > $ cgroup_event_listener ./mempressure.level low
> > ("low", "medium", "oom" are permitted values.)
> >
> > Upon hitting the threshold, you should see "/sys/fs/cgroup/mempressure
> > low: crossed" messages.
> >
> > To test that it actually works on per-cgroup basis, I did a small trick: I
> > moved all kswapd into a separate cgroup, and hooked the listener onto
> > another (non-root) cgroup. The listener no longer received global reclaim
> > pressure, which is expected.
>
> Is this really expected? So you want to be notified only about the
> direct reclaim?

I didn't try to put much meaning into assinging a task to a non-global
reclaim watchers, I just mentioned this as an easiest way to test that we
actually can account things on per-thread basis. :)

> I am not sure how much useful is that. If you co-mount with e.g. memcg then
> the picture is different because even global memory pressure is spread
> among groups so it would be just a matter of the proper accounting
> (which can be handled similar to lruvec when your code doesn't have to
> care about memcg internally).
> Co-mounting with cpusets makes sense as well because then you get a
> pressure notification based on the placement policy.
>
> So does it make much sense to mount mempressure on its own without
> co-mounting with other controllers?

Android does not actually need any of these (memcg or cpusets), but we
still want to get notifications (for a root cgroup would be enough for us
-- but I'm trying to make things generic, of course).

> > For a task it is possible to be in both cpusets, memcg and mempressure
> > cgroups, so by rearranging the tasks it should be possible to watch a
> > specific pressure.
>
> Could you be more specific what you mean by rearranging? Creating a same
> hierarchy? Co-mounting?
>
> > Note that while this adds the cgroups support, the code is well separated
> > and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
> > But this is another story.
>
> I think it would be nice to follow freezer and split this into 2 files.
> Generic and cgroup spefici.

Yeah, this is surely an option, but so far it's only a few hundrends lines
of code, plus we don't have any other users for the "internals". So, for
the time being, I'd rather keep it in one file.

> > Signed-off-by: Anton Vorontsov <[email protected]>
> > ---
> [...]
> > +/* These are defaults. Might make them configurable one day. */
> > +static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16;
>
> I realize this is just an RFC but could you be more specific what is the
> meaning of vmpressure_win?

Sure, let me just copy the text from the previous RFC, to which you were
not Cc'ed:

When the system is short on idle pages, the new memory is allocated by
reclaiming least recently used resources: kernel scans pages to be
reclaimed (e.g. from file caches, mmap(2) volatile ranges, etc.; and
potentially swapping some pages out). The index shows the relative time
spent by the kernel uselessly scanning pages, or, in other words, the
percentage of scans of pages (vmpressure_window) that were not reclaimed.
...
Window size is used as a rate-limit tunable for VMPRESSURE_LOW
notifications and for averaging for VMPRESSURE_{MEDIUM,OOM} levels. So,
using small window sizes can cause lot of false positives for _MEDIUM and
_OOM levels, but too big window size may delay notifications. By default
the window size equals to 256 pages (1MB).

You can find more about the tunables in the previus RFC:

http://lkml.org/lkml/2012/11/7/169

> > +static const uint vmpressure_level_med = 60;
> > +static const uint vmpressure_level_oom = 99;
> > +static const uint vmpressure_level_oom_prio = 4;
> > +
> > +enum vmpressure_levels {
> > + VMPRESSURE_LOW = 0,
> > + VMPRESSURE_MEDIUM,
> > + VMPRESSURE_OOM,
> > + VMPRESSURE_NUM_LEVELS,
> > +};
> > +
> > +static const char const *vmpressure_str_levels[] = {
> > + [VMPRESSURE_LOW] = "low",
> > + [VMPRESSURE_MEDIUM] = "medium",
> > + [VMPRESSURE_OOM] = "oom",
> > +};
> > +
> > +static enum vmpressure_levels vmpressure_level(uint pressure)
> > +{
> > + if (pressure >= vmpressure_level_oom)
> > + return VMPRESSURE_OOM;
> > + else if (pressure >= vmpressure_level_med)
> > + return VMPRESSURE_MEDIUM;
> > + return VMPRESSURE_LOW;
> > +}
> > +
> > +static ulong vmpressure_calc_level(uint win, uint s, uint r)
> > +{
> > + ulong p;
> > +
> > + if (!s)
> > + return 0;
> > +
> > + /*
> > + * We calculate the ratio (in percents) of how many pages were
> > + * scanned vs. reclaimed in a given time frame (window). Note that
> > + * time is in VM reclaimer's "ticks", i.e. number of pages
> > + * scanned. This makes it possible to set desired reaction time
> > + * and serves as a ratelimit.
> > + */
> > + p = win - (r * win / s);
> > + p = p * 100 / win;
>
> Do we need the win at all?
> p = 100 - (100 * r / s);

Other than for me being pedant, pretty much no. :) It's just less
"precise" (try s=1000, r=9). (But in return, my version is prone to
misbehave when window is too large.)

> > +
> > + pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, p, s, r);
> > +
> > + return vmpressure_level(p);
> > +}
> > +
> [...]
> > +static int mpc_pre_destroy(struct cgroup *cg)
> > +{
> > + struct mpc_state *mpc = cg2mpc(cg);
> > + int ret = 0;
> > +
> > + mutex_lock(&mpc->lock);
> > +
> > + if (mpc->eventfd)
> > + ret = -EBUSY;
>
> The current cgroup's core doesn't allow pre_destroy to fail anymore. The
> code is marked for 3.8

Sure, I can rebase. (Currently, the code is based on the v3.7-rc6, which
isn't even released but seems way too old already, heh. :)

> [...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 48550c6..430d8a5 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1877,6 +1877,8 @@ restart:
> > shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> > sc, LRU_ACTIVE_ANON);
> >
> > + vmpressure(sc->nr_scanned - nr_scanned, nr_reclaimed);
> > +
>
> I think this should already report to a proper group otherwise all the
> global reclaim would go to a group where kswapd sits rather than to the
> target group as I mentioned above (so it at least wouldn't work with a
> co-mounted cases).

Um. Yeah, I guess I was too optimistic here, relying on the things to
"just work". I guess I still need to pass memcg pointer to the
vmpressure() and check if the process is also part of the
sc->target_mem_cgroup.

> > /* reclaim/compaction might need reclaim to continue */
> > if (should_continue_reclaim(lruvec, nr_reclaimed,
> > sc->nr_scanned - nr_scanned, sc))
> > @@ -2099,6 +2101,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> > count_vm_event(ALLOCSTALL);
> >
> > do {
> > + vmpressure_prio(sc->priority);
>
> Shouldn't this go into shrink_lruvec or somewhere at that level to catch
> also kswapd low priorities? If you insist on the direct reclaim then you
> should hook into __zone_reclaim as well.

Probably... Thanks for pointing out, I'll take a closer look once we
resolve the global/design issues.

Thanks!
Anton.

2012-11-29 06:13:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC] Add mempressure cgroup

On Wed, Nov 28, 2012 at 02:29:08AM -0800, Anton Vorontsov wrote:
> +static int mpc_pre_destroy(struct cgroup *cg)
> +{
> + struct mpc_state *mpc = cg2mpc(cg);
> + int ret = 0;
> +
> + mutex_lock(&mpc->lock);
> +
> + if (mpc->eventfd)
> + ret = -EBUSY;

cgroup_rmdir() will unregister all events for you. No need to handle it
here.

> +
> + mutex_unlock(&mpc->lock);
> +
> + return ret;
> +}

> +static int mpc_register_level_event(struct cgroup *cg, struct cftype *cft,
> + struct eventfd_ctx *eventfd,
> + const char *args)
> +{
> + struct mpc_state *mpc = cg2mpc(cg);
> + int i;
> + int ret;
> +
> + mutex_lock(&mpc->lock);
> +
> + /*
> + * It's easy to implement multiple thresholds, but so far we don't
> + * need it.
> + */
> + if (mpc->eventfd) {
> + ret = -EBUSY;
> + goto out_unlock;
> + }

One user which listen for one threashold per cgroup?
I think it's wrong. It's essensial for API to serve multiple users.

> +
> + ret = -EINVAL;
> + for (i = 0; i < VMPRESSURE_NUM_LEVELS; i++) {
> + if (strcmp(vmpressure_str_levels[i], args))
> + continue;
> + mpc->eventfd = eventfd;
> + mpc->thres = i;
> + ret = 0;
> + break;
> + }
> +out_unlock:
> + mutex_unlock(&mpc->lock);
> +
> + return ret;
> +}

--
Kirill A. Shutemov

2012-11-29 06:24:21

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [RFC] Add mempressure cgroup

On Thu, Nov 29, 2012 at 08:14:13AM +0200, Kirill A. Shutemov wrote:
> On Wed, Nov 28, 2012 at 02:29:08AM -0800, Anton Vorontsov wrote:
> > +static int mpc_pre_destroy(struct cgroup *cg)
> > +{
> > + struct mpc_state *mpc = cg2mpc(cg);
> > + int ret = 0;
> > +
> > + mutex_lock(&mpc->lock);
> > +
> > + if (mpc->eventfd)
> > + ret = -EBUSY;
>
> cgroup_rmdir() will unregister all events for you. No need to handle it
> here.

Okie, thanks!

[...]
> > +static int mpc_register_level_event(struct cgroup *cg, struct cftype *cft,
> > + struct eventfd_ctx *eventfd,
> > + const char *args)
> > +{
> > + struct mpc_state *mpc = cg2mpc(cg);
> > + int i;
> > + int ret;
> > +
> > + mutex_lock(&mpc->lock);
> > +
> > + /*
> > + * It's easy to implement multiple thresholds, but so far we don't
> > + * need it.
> > + */
> > + if (mpc->eventfd) {
> > + ret = -EBUSY;
> > + goto out_unlock;
> > + }
>
> One user which listen for one threashold per cgroup?
> I think it's wrong. It's essensial for API to serve multiple users.

Yea, if we'll consider merging this, I'll definitely fix this. Just didn't
want to bring the complexity into the code.

Thanks,
Anton.

2012-11-30 17:49:16

by Luiz Capitulino

[permalink] [raw]
Subject: Re: [RFC] Add mempressure cgroup

On Wed, 28 Nov 2012 17:27:51 -0800
Anton Vorontsov <[email protected]> wrote:

> On Wed, Nov 28, 2012 at 03:14:32PM -0800, Andrew Morton wrote:
> [...]
> > Compare this with the shrink_slab() shrinkers. With these, the VM can
> > query and then control the clients. If something goes wrong or is out
> > of balance, it's the VM's problem to solve.
> >
> > So I'm thinking that a better design would be one which puts the kernel
> > VM in control of userspace scanning and freeing. Presumably with a
> > query-and-control interface similar to the slab shrinkers.
>
> Thanks for the ideas, Andrew.
>
> Query-and-control scheme looks very attractive, and that's actually
> resembles my "balance" level idea, when userland tells the kernel how much
> reclaimable memory it has. Except the your scheme works in the reverse
> direction, i.e. the kernel becomes in charge.
>
> But there is one, rather major issue: we're crossing kernel-userspace
> boundary. And with the scheme we'll have to cross the boundary four times:
> query / reply-available / control / reply-shrunk / (and repeat if
> necessary, every SHRINK_BATCH pages). Plus, it has to be done somewhat
> synchronously (all the four stages), and/or we have to make a "userspace
> shrinker" thread working in parallel with the normal shrinker, and here,
> I'm afraid, we'll see more strange interactions. :)

Wouldn't this be just like kswapd?

> But there is a good news: for these kind of fine-grained control we have a
> better interface, where we don't have to communicate [very often] w/ the
> kernel. These are "volatile ranges", where userland itself marks chunks of
> data as "I might need it, but I won't cry if you recycle it; but when I
> access it next time, let me know if you actually recycled it". Yes,
> userland no longer able to decide which exact page it permits to recycle,
> but we don't have use-cases when we actually care that much. And if we do,
> we'd rather introduce volatile LRUs with different priorities, or
> something alike.

I'm new to this stuff so please take this with a grain of salt, but I'm
not sure volatile ranges would be a good fit for our use case: we want to
make (kvm) guests reduce their memory when the host is getting memory
pressure.

Having a notification seems just fine for this purpose, but I'm not sure
how this would work with volatile ranges, as we'd have to mark pages volatile
in advance.

Andrew's idea seems to give a lot more freedom to apps, IMHO.