Hi all,
This is the third RFC. As suggested by Minchan Kim, the API is much
simplified now (comparing to vmevent_fd):
- As well as Minchan, KOSAKI Motohiro didn't like the timers, so the
timers are gone now;
- Pekka Enberg didn't like the complex attributes matching code, and so it
is no longer there;
- Nobody liked the raw vmstat attributes, and so they were eliminated too.
But, conceptually, it is the exactly the same approach as in v2: three
discrete levels of the pressure -- low, medium and oom. The levels are
based on the reclaimer inefficiency index as proposed by Mel Gorman, but
userland does not see the raw index values. The description why I moved
away from reporting the raw 'reclaimer inefficiency index' can be found in
v2: http://lkml.org/lkml/2012/10/22/177
While the new API is very simple, it is still extensible (i.e. versioned).
As there are a lot of drastic changes in the API itself, I decided to just
add a new files along with vmevent, it is much easier to review it this
way (I can prepare a separate patch that removes vmevent files, if we care
to preserve the history through the vmevent tree).
Thanks,
Anton.
--
Documentation/sysctl/vm.txt | 47 +++++
arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 2 +
include/linux/vmpressure.h | 128 ++++++++++++
kernel/sys_ni.c | 1 +
kernel/sysctl.c | 31 +++
mm/Kconfig | 13 ++
mm/Makefile | 1 +
mm/vmpressure.c | 231 +++++++++++++++++++++
mm/vmscan.c | 5 +
tools/testing/vmpressure/.gitignore | 1 +
tools/testing/vmpressure/Makefile | 30 +++
tools/testing/vmpressure/vmpressure-test.c | 93 +++++++++
13 files changed, 584 insertions(+)
This patch introduces vmpressure_fd() system call. The system call creates
a new file descriptor that can be used to monitor Linux' virtual memory
management pressure. There are three discrete levels of the pressure:
VMPRESSURE_LOW: Notifies that the system is reclaiming memory for new
allocations. Monitoring reclaiming activity might be useful for
maintaining overall system's cache level.
VMPRESSURE_MEDIUM: The system is experiencing medium memory pressure,
there might be some mild swapping activity. Upon this event applications
may decide to free any resources that can be easily reconstructed or
re-read from a disk.
VMPRESSURE_OOM: The system is actively thrashing, it is about to go out of
memory (OOM) or even the in-kernel OOM killer is on its way to trigger.
Applications should do whatever they can to help the system.
There are four sysctls to tune the behaviour of the levels:
vmevent_window
vmevent_level_medium
vmevent_level_oom
vmevent_level_oom_priority
Currently vmevent pressure levels are based on the reclaimer inefficiency
index (range from 0 to 100). The index shows the relative time spent by
the kernel uselessly scanning pages, or, in other words, the percentage of
scans of pages (vmevent_window) that were not reclaimed. The higher the
index, the more it should be evident that new allocations' cost becomes
higher.
The files vmevent_level_medium and vmevent_level_oom accept the index
values (by default set to 60 and 99 respectively). A non-existent
vmevent_level_low tunable is always set to 0
When index equals to 0, this means that the kernel is reclaiming, but
every scanned page has been successfully reclaimed (so the pressure is
low). 100 means that the kernel is trying to reclaim, but nothing can be
reclaimed (OOM).
Window size is used as a rate-limit tunable for VMPRESSURE_LOW
notifications and for averaging for VMPRESSURE_{MEDIUM,OOM} levels. So,
using small window sizes can cause lot of false positives for _MEDIUM and
_OOM levels, but too big window size may delay notifications. By default
the window size equals to 256 pages (1MB).
The _OOM level is also attached to the reclaimer's priority. When the
system is almost OOM, it might be getting the last reclaimable pages
slowly, scanning all the queues, and so we never catch the OOM case via
window-size averaging. For this case the priority can be used to determine
the pre-OOM condition, the pre-OOM priority level can be set via
vmpressure_level_oom_prio sysctl.
Signed-off-by: Anton Vorontsov <[email protected]>
---
Documentation/sysctl/vm.txt | 48 ++++++++
arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 2 +
include/linux/vmpressure.h | 128 ++++++++++++++++++++++
kernel/sys_ni.c | 1 +
kernel/sysctl.c | 31 ++++++
mm/Kconfig | 13 +++
mm/Makefile | 1 +
mm/vmpressure.c | 231 +++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 5 +
10 files changed, 461 insertions(+)
create mode 100644 include/linux/vmpressure.h
create mode 100644 mm/vmpressure.c
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 078701f..9837fe2 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -44,6 +44,10 @@ Currently, these files are in /proc/sys/vm:
- nr_overcommit_hugepages
- nr_trim_pages (only if CONFIG_MMU=n)
- numa_zonelist_order
+- vmpressure_window
+- vmpressure_level_medium
+- vmpressure_level_oom
+- vmpressure_level_oom_priority
- oom_dump_tasks
- oom_kill_allocating_task
- overcommit_memory
@@ -487,6 +491,50 @@ this is causing problems for your system/application.
==============================================================
+vmpressure_window
+vmpressure_level_med
+vmpressure_level_oom
+vmpressure_level_oom_priority
+
+These sysctls are used to tune vmpressure_fd(2) behaviour.
+
+Currently vmpressure pressure levels are based on the reclaimer
+inefficiency index (range from 0 to 100). The files vmpressure_level_med
+and vmpressure_level_oom accept the index values (by default set to 60 and
+99 respectively). A non-existent vmpressure_level_low tunable is always
+set to 0
+
+When the system is short on idle pages, the new memory is allocated by
+reclaiming least recently used resources: kernel scans pages to be
+reclaimed (e.g. from file caches, mmap(2) volatile ranges, etc.; and
+potentially swapping some pages out). The index shows the relative time
+spent by the kernel uselessly scanning pages, or, in other words, the
+percentage of scans of pages (vmpressure_window) that were not reclaimed.
+The higher the index, the more it should be evident that new allocations'
+cost becomes higher.
+
+When index equals to 0, this means that the kernel is reclaiming, but
+every scanned page has been successfully reclaimed (so the pressure is
+low). 100 means that the kernel is trying to reclaim, but nothing can be
+reclaimed (close to OOM).
+
+Window size is used as a rate-limit tunable for VMPRESSURE_LOW
+notifications and for averaging for VMPRESSURE_{MEDIUM,OOM} levels. So,
+using small window sizes can cause lot of false positives for _MEDIUM and
+_OOM levels, but too big window size may delay notifications. By default
+the window size equals to 256 pages (1MB).
+
+When the system is almost OOM it might be getting the last reclaimable
+pages slowly, scanning all the queues, and so we never catch the OOM case
+via window-size averaging. For this case there is another mechanism of
+detecting the pre-OOM conditions: kernel's reclaimer has a scanning
+priority, the higest priority is 0 (reclaimer will scan all the available
+pages). Kernel starts scanning with priority set to 12 (queue_length >>
+12). So, vmpressure_level_oom_prio should be between 0 and 12 (by default
+it is set to 4).
+
+==============================================================
+
oom_dump_tasks
Enables a system-wide task dump (excluding kernel threads) to be
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 316449a..6e4fa6a 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -320,6 +320,7 @@
311 64 process_vm_writev sys_process_vm_writev
312 common kcmp sys_kcmp
313 64 vmevent_fd sys_vmevent_fd
+314 64 vmpressure_fd sys_vmpressure_fd
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 19439c7..3d2587d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -63,6 +63,7 @@ struct getcpu_cache;
struct old_linux_dirent;
struct perf_event_attr;
struct file_handle;
+struct vmpressure_config;
#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -860,4 +861,5 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
unsigned long idx1, unsigned long idx2);
+asmlinkage long sys_vmpressure_fd(struct vmpressure_config __user *config);
#endif
diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
new file mode 100644
index 0000000..b808b04
--- /dev/null
+++ b/include/linux/vmpressure.h
@@ -0,0 +1,128 @@
+/*
+ * Linux VM pressure notifications
+ *
+ * Copyright 2011-2012 Pekka Enberg <[email protected]>
+ * Copyright 2011-2012 Linaro Ltd.
+ * Anton Vorontsov <[email protected]>
+ *
+ * Based on ideas from KOSAKI Motohiro, Leonid Moiseichuk, Mel Gorman,
+ * Minchan Kim and Pekka Enberg.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#ifndef _LINUX_VMPRESSURE_H
+#define _LINUX_VMPRESSURE_H
+
+#include <linux/types.h>
+
+/**
+ * enum vmpressure_level - Memory pressure levels
+ * @VMPRESSURE_LOW: The system is short on idle pages, losing caches
+ * @VMPRESSURE_MEDIUM: New allocations' cost becomes high
+ * @VMPRESSURE_OOM: The system is about to go out-of-memory
+ */
+enum vmpressure_level {
+ /* We spread the values, reserving room for new levels. */
+ VMPRESSURE_LOW = 1 << 10,
+ VMPRESSURE_MEDIUM = 1 << 20,
+ VMPRESSURE_OOM = 1 << 30,
+};
+
+/**
+ * struct vmpressure_config - Configuration structure for vmpressure_fd()
+ * @size: Size of the struct for ABI extensibility
+ * @threshold: Minimum pressure level of notifications
+ *
+ * This structure is used to configure the file descriptor that
+ * vmpressure_fd() returns.
+ *
+ * @size is used to "version" the ABI, it must be initialized to
+ * 'sizeof(struct vmpressure_config)'.
+ *
+ * @threshold should be one of @vmpressure_level values, and specifies
+ * minimal level of notification that will be delivered.
+ */
+struct vmpressure_config {
+ __u32 size;
+ __u32 threshold;
+};
+
+/**
+ * struct vmpressure_event - An event that is returned via vmpressure fd
+ * @pressure: Most recent system's pressure level
+ *
+ * Upon notification, this structure must be read from the vmpressure file
+ * descriptor.
+ */
+struct vmpressure_event {
+ __u32 pressure;
+};
+
+#ifdef __KERNEL__
+
+struct mem_cgroup;
+
+#ifdef CONFIG_VMPRESSURE
+
+extern uint vmpressure_win;
+extern uint vmpressure_level_med;
+extern uint vmpressure_level_oom;
+extern uint vmpressure_level_oom_prio;
+
+extern void __vmpressure(struct mem_cgroup *memcg,
+ ulong scanned, ulong reclaimed);
+static void vmpressure(struct mem_cgroup *memcg,
+ ulong scanned, ulong reclaimed);
+
+/*
+ * OK, we're cheating. The thing is, we have to average s/r ratio by
+ * gathering a lot of scans (otherwise we might get some local
+ * false-positives index of '100').
+ *
+ * But... when we're almost OOM we might be getting the last reclaimable
+ * pages slowly, scanning all the queues, and so we never catch the OOM
+ * case via averaging. Although the priority will show it for sure. The
+ * pre-OOM priority value is mostly an empirically taken priority: we
+ * never observe it under any load, except for last few allocations before
+ * the OOM (but the exact value is still configurable via sysctl).
+ */
+static inline void vmpressure_prio(struct mem_cgroup *memcg, int prio)
+{
+ if (prio > vmpressure_level_oom_prio)
+ return;
+
+ /* OK, the prio is below the threshold, send the pre-OOM event. */
+ vmpressure(memcg, vmpressure_win, 0);
+}
+
+#else
+static inline void __vmpressure(struct mem_cgroup *memcg,
+ ulong scanned, ulong reclaimed) {}
+static inline void vmpressure_prio(struct mem_cgroup *memcg, int prio) {}
+#endif /* CONFIG_VMPRESSURE */
+
+static inline void vmpressure(struct mem_cgroup *memcg,
+ ulong scanned, ulong reclaimed)
+{
+ if (!scanned)
+ return;
+
+ if (IS_BUILTIN(CONFIG_MEMCG) && memcg) {
+ /*
+ * The vmpressure API reports system pressure, for per-cgroup
+ * pressure, we'll chain cgroups notifications, this is to
+ * be implemented.
+ *
+ * memcg_vm_pressure(target_mem_cgroup, scanned, reclaimed);
+ */
+ return;
+ }
+ __vmpressure(memcg, scanned, reclaimed);
+}
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_VMPRESSURE_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 3ccdbf4..9573a5a 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -192,6 +192,7 @@ cond_syscall(compat_sys_timerfd_gettime);
cond_syscall(sys_eventfd);
cond_syscall(sys_eventfd2);
cond_syscall(sys_vmevent_fd);
+cond_syscall(sys_vmpressure_fd);
/* performance counters: */
cond_syscall(sys_perf_event_open);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 87174ef..7c9a3be 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -50,6 +50,7 @@
#include <linux/dnotify.h>
#include <linux/syscalls.h>
#include <linux/vmstat.h>
+#include <linux/vmpressure.h>
#include <linux/nfs_fs.h>
#include <linux/acpi.h>
#include <linux/reboot.h>
@@ -1317,6 +1318,36 @@ static struct ctl_table vm_table[] = {
.proc_handler = numa_zonelist_order_handler,
},
#endif
+#ifdef CONFIG_VMPRESSURE
+ {
+ .procname = "vmpressure_window",
+ .data = &vmpressure_win,
+ .maxlen = sizeof(vmpressure_win),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "vmpressure_level_medium",
+ .data = &vmpressure_level_med,
+ .maxlen = sizeof(vmpressure_level_med),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "vmpressure_level_oom",
+ .data = &vmpressure_level_oom,
+ .maxlen = sizeof(vmpressure_level_oom),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "vmpressure_level_oom_priority",
+ .data = &vmpressure_level_oom_prio,
+ .maxlen = sizeof(vmpressure_level_oom_prio),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+#endif
#if (defined(CONFIG_X86_32) && !defined(CONFIG_UML))|| \
(defined(CONFIG_SUPERH) && defined(CONFIG_VSYSCALL))
{
diff --git a/mm/Kconfig b/mm/Kconfig
index cd0ea24e..8a47a5f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -401,6 +401,19 @@ config VMEVENT
help
If unsure, say N to disable vmevent
+config VMPRESSURE
+ bool "Enable vmpressure_fd() notifications"
+ help
+ This option enables vmpressure_fd() system call, it is used to
+ notify userland applications about system's virtual memory
+ pressure state.
+
+ Upon these notifications, userland programs can cooperate with
+ the kernel (e.g. free easily reclaimable resources), and so
+ achieving better system's memory management.
+
+ If unsure, say N.
+
config FRONTSWAP
bool "Enable frontswap to cache swap pages if tmem is present"
depends on SWAP
diff --git a/mm/Makefile b/mm/Makefile
index 80debc7..2f08d14 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -57,4 +57,5 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
obj-$(CONFIG_CLEANCACHE) += cleancache.o
obj-$(CONFIG_VMEVENT) += vmevent.o
+obj-$(CONFIG_VMPRESSURE) += vmpressure.o
obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
new file mode 100644
index 0000000..54f35a3
--- /dev/null
+++ b/mm/vmpressure.c
@@ -0,0 +1,231 @@
+/*
+ * Linux VM pressure notifications
+ *
+ * Copyright 2011-2012 Pekka Enberg <[email protected]>
+ * Copyright 2011-2012 Linaro Ltd.
+ * Anton Vorontsov <[email protected]>
+ *
+ * Based on ideas from KOSAKI Motohiro, Leonid Moiseichuk, Mel Gorman,
+ * Minchan Kim and Pekka Enberg.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include <linux/anon_inodes.h>
+#include <linux/atomic.h>
+#include <linux/compiler.h>
+#include <linux/vmpressure.h>
+#include <linux/syscalls.h>
+#include <linux/workqueue.h>
+#include <linux/mutex.h>
+#include <linux/file.h>
+#include <linux/list.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/swap.h>
+
+struct vmpressure_watch {
+ struct vmpressure_config config;
+ atomic_t pending;
+ wait_queue_head_t waitq;
+ struct list_head node;
+};
+
+static atomic64_t vmpressure_sr;
+static uint vmpressure_val;
+
+static LIST_HEAD(vmpressure_watchers);
+static DEFINE_MUTEX(vmpressure_watchers_lock);
+
+/* Our sysctl tunables, see Documentation/sysctl/vm.txt */
+uint __read_mostly vmpressure_win = SWAP_CLUSTER_MAX * 16;
+uint vmpressure_level_med = 60;
+uint vmpressure_level_oom = 99;
+uint vmpressure_level_oom_prio = 4;
+
+/*
+ * This function is called from a workqueue, which can have only one
+ * execution thread, so we don't need to worry about racing w/ ourselves.
+ * And so it possible to implement the lock-free logic, using just the
+ * atomic watch->pending variable.
+ */
+static void vmpressure_sample(struct vmpressure_watch *watch)
+{
+ if (atomic_read(&watch->pending))
+ return;
+ if (vmpressure_val < watch->config.threshold)
+ return;
+
+ atomic_set(&watch->pending, 1);
+ wake_up(&watch->waitq);
+}
+
+static u64 vmpressure_level(uint pressure)
+{
+ if (pressure >= vmpressure_level_oom)
+ return VMPRESSURE_OOM;
+ else if (pressure >= vmpressure_level_med)
+ return VMPRESSURE_MEDIUM;
+ return VMPRESSURE_LOW;
+}
+
+static uint vmpressure_calc_pressure(uint win, uint s, uint r)
+{
+ ulong p;
+
+ /*
+ * We calculate the ratio (in percents) of how many pages were
+ * scanned vs. reclaimed in a given time frame (window). Note that
+ * time is in VM reclaimer's "ticks", i.e. number of pages
+ * scanned. This makes it possible set desired reaction time and
+ * serves as a ratelimit.
+ */
+ p = win - (r * win / s);
+ p = p * 100 / win;
+
+ pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, p, s, r);
+
+ return vmpressure_level(p);
+}
+
+#define VMPRESSURE_SCANNED_SHIFT (sizeof(u64) * 8 / 2)
+
+static void vmpressure_wk_fn(struct work_struct *wk)
+{
+ struct vmpressure_watch *watch;
+ u64 sr = atomic64_xchg(&vmpressure_sr, 0);
+ u32 s = sr >> VMPRESSURE_SCANNED_SHIFT;
+ u32 r = sr & (((u64)1 << VMPRESSURE_SCANNED_SHIFT) - 1);
+
+ vmpressure_val = vmpressure_calc_pressure(vmpressure_win, s, r);
+
+ mutex_lock(&vmpressure_watchers_lock);
+ list_for_each_entry(watch, &vmpressure_watchers, node)
+ vmpressure_sample(watch);
+ mutex_unlock(&vmpressure_watchers_lock);
+}
+static DECLARE_WORK(vmpressure_wk, vmpressure_wk_fn);
+
+void __vmpressure(struct mem_cgroup *memcg, ulong scanned, ulong reclaimed)
+{
+ /*
+ * Store s/r combined, so we don't have to worry to synchronize
+ * them. On modern machines it will be truly atomic; on arches w/o
+ * 64 bit atomics it will turn into a spinlock (for a small amount
+ * of CPUs it's not a problem).
+ *
+ * Using int-sized atomics is a bad idea as it would only allow to
+ * count (1 << 16) - 1 pages (256MB), which we can scan pretty
+ * fast.
+ *
+ * We can't have per-CPU counters as this will not catch a case
+ * when many CPUs scan small amounts (so none of them hit the
+ * window size limit, and thus we won't send a notification in
+ * time).
+ *
+ * So we shouldn't place vmpressure() into a very hot path.
+ */
+ atomic64_add(scanned << VMPRESSURE_SCANNED_SHIFT | reclaimed,
+ &vmpressure_sr);
+
+ scanned = atomic64_read(&vmpressure_sr) >> VMPRESSURE_SCANNED_SHIFT;
+ if (scanned >= vmpressure_win && !work_pending(&vmpressure_wk))
+ schedule_work(&vmpressure_wk);
+}
+
+static uint vmpressure_poll(struct file *file, poll_table *wait)
+{
+ struct vmpressure_watch *watch = file->private_data;
+
+ poll_wait(file, &watch->waitq, wait);
+
+ return atomic_read(&watch->pending) ? POLLIN : 0;
+}
+
+static ssize_t vmpressure_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct vmpressure_watch *watch = file->private_data;
+ struct vmpressure_event event;
+ int ret;
+
+ if (count < sizeof(event))
+ return -EINVAL;
+
+ ret = wait_event_interruptible(watch->waitq,
+ atomic_read(&watch->pending));
+ if (ret)
+ return ret;
+
+ event.pressure = vmpressure_val;
+ if (copy_to_user(buf, &event, sizeof(event)))
+ return -EFAULT;
+
+ atomic_set(&watch->pending, 0);
+
+ return count;
+}
+
+static int vmpressure_release(struct inode *inode, struct file *file)
+{
+ struct vmpressure_watch *watch = file->private_data;
+
+ mutex_lock(&vmpressure_watchers_lock);
+ list_del(&watch->node);
+ mutex_unlock(&vmpressure_watchers_lock);
+
+ kfree(watch);
+ return 0;
+}
+
+static const struct file_operations vmpressure_fops = {
+ .poll = vmpressure_poll,
+ .read = vmpressure_read,
+ .release = vmpressure_release,
+};
+
+SYSCALL_DEFINE1(vmpressure_fd, struct vmpressure_config __user *, config)
+{
+ struct vmpressure_watch *watch;
+ struct file *file;
+ int ret;
+ int fd;
+
+ watch = kzalloc(sizeof(*watch), GFP_KERNEL);
+ if (!watch)
+ return -ENOMEM;
+
+ ret = copy_from_user(&watch->config, config, sizeof(*config));
+ if (ret)
+ goto err_free;
+
+ fd = get_unused_fd_flags(O_RDONLY);
+ if (fd < 0) {
+ ret = fd;
+ goto err_free;
+ }
+
+ file = anon_inode_getfile("[vmpressure]", &vmpressure_fops, watch,
+ O_RDONLY);
+ if (IS_ERR(file)) {
+ ret = PTR_ERR(file);
+ goto err_fd;
+ }
+
+ fd_install(fd, file);
+
+ init_waitqueue_head(&watch->waitq);
+
+ mutex_lock(&vmpressure_watchers_lock);
+ list_add(&watch->node, &vmpressure_watchers);
+ mutex_unlock(&vmpressure_watchers_lock);
+
+ return fd;
+err_fd:
+ put_unused_fd(fd);
+err_free:
+ kfree(watch);
+ return ret;
+}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99b434b..5439117 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -20,6 +20,7 @@
#include <linux/init.h>
#include <linux/highmem.h>
#include <linux/vmstat.h>
+#include <linux/vmpressure.h>
#include <linux/file.h>
#include <linux/writeback.h>
#include <linux/blkdev.h>
@@ -1846,6 +1847,9 @@ restart:
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
sc, LRU_ACTIVE_ANON);
+ vmpressure(sc->target_mem_cgroup,
+ sc->nr_scanned - nr_scanned, nr_reclaimed);
+
/* reclaim/compaction might need reclaim to continue */
if (should_continue_reclaim(lruvec, nr_reclaimed,
sc->nr_scanned - nr_scanned, sc))
@@ -2068,6 +2072,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
count_vm_event(ALLOCSTALL);
do {
+ vmpressure_prio(sc->target_mem_cgroup, sc->priority);
sc->nr_scanned = 0;
aborted_reclaim = shrink_zones(zonelist, sc);
--
1.8.0
Just a simple test/example utility for the vmpressure_fd(2) system call.
Signed-off-by: Anton Vorontsov <[email protected]>
---
tools/testing/vmpressure/.gitignore | 1 +
tools/testing/vmpressure/Makefile | 30 ++++++++++
tools/testing/vmpressure/vmpressure-test.c | 93 ++++++++++++++++++++++++++++++
3 files changed, 124 insertions(+)
create mode 100644 tools/testing/vmpressure/.gitignore
create mode 100644 tools/testing/vmpressure/Makefile
create mode 100644 tools/testing/vmpressure/vmpressure-test.c
diff --git a/tools/testing/vmpressure/.gitignore b/tools/testing/vmpressure/.gitignore
new file mode 100644
index 0000000..fe5e38c
--- /dev/null
+++ b/tools/testing/vmpressure/.gitignore
@@ -0,0 +1 @@
+vmpressure-test
diff --git a/tools/testing/vmpressure/Makefile b/tools/testing/vmpressure/Makefile
new file mode 100644
index 0000000..7545f3e
--- /dev/null
+++ b/tools/testing/vmpressure/Makefile
@@ -0,0 +1,30 @@
+WARNINGS := -Wcast-align
+WARNINGS += -Wformat
+WARNINGS += -Wformat-security
+WARNINGS += -Wformat-y2k
+WARNINGS += -Wshadow
+WARNINGS += -Winit-self
+WARNINGS += -Wpacked
+WARNINGS += -Wredundant-decls
+WARNINGS += -Wstrict-aliasing=3
+WARNINGS += -Wswitch-default
+WARNINGS += -Wno-system-headers
+WARNINGS += -Wundef
+WARNINGS += -Wwrite-strings
+WARNINGS += -Wbad-function-cast
+WARNINGS += -Wmissing-declarations
+WARNINGS += -Wmissing-prototypes
+WARNINGS += -Wnested-externs
+WARNINGS += -Wold-style-definition
+WARNINGS += -Wstrict-prototypes
+WARNINGS += -Wdeclaration-after-statement
+
+CFLAGS = -O3 -g -std=gnu99 $(WARNINGS)
+
+PROGRAMS = vmpressure-test
+
+all: $(PROGRAMS)
+
+clean:
+ rm -f $(PROGRAMS) *.o
+.PHONY: clean
diff --git a/tools/testing/vmpressure/vmpressure-test.c b/tools/testing/vmpressure/vmpressure-test.c
new file mode 100644
index 0000000..1e448be
--- /dev/null
+++ b/tools/testing/vmpressure/vmpressure-test.c
@@ -0,0 +1,93 @@
+/*
+ * vmpressure_fd(2) test utility
+ *
+ * Copyright 2011-2012 Pekka Enberg <[email protected]>
+ * Copyright 2011-2012 Linaro Ltd.
+ * Anton Vorontsov <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+/* TODO: glibc wrappers */
+#include "../../../include/linux/vmpressure.h"
+
+#if defined(__x86_64__)
+#include "../../../arch/x86/include/generated/asm/unistd_64.h"
+#endif
+#if defined(__arm__)
+#include "../../../arch/arm/include/asm/unistd.h"
+#endif
+
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <stdio.h>
+#include <poll.h>
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+
+static void pexit(const char *str)
+{
+ perror(str);
+ exit(1);
+}
+
+static int vmpressure_fd(struct vmpressure_config *config)
+{
+ config->size = sizeof(*config);
+
+ return syscall(__NR_vmpressure_fd, config);
+}
+
+int main(int argc, char *argv[])
+{
+ struct vmpressure_config config[] = {
+ /*
+ * We could just set the lowest priority, but we want to
+ * actually test if the thresholds work.
+ */
+ { .threshold = VMPRESSURE_LOW },
+ { .threshold = VMPRESSURE_MEDIUM },
+ { .threshold = VMPRESSURE_OOM },
+ };
+ const size_t num = ARRAY_SIZE(config);
+ struct pollfd pfds[num];
+ int i;
+
+ for (i = 0; i < num; i++) {
+ pfds[i].fd = vmpressure_fd(&config[i]);
+ if (pfds[i].fd < 0)
+ pexit("vmpressure_fd failed");
+
+ pfds[i].events = POLLIN;
+ }
+
+ while (poll(pfds, num, -1) > 0) {
+ for (i = 0; i < num; i++) {
+ struct vmpressure_event event;
+
+ if (!pfds[i].revents)
+ continue;
+
+ if (read(pfds[i].fd, &event, sizeof(event)) < 0)
+ pexit("read failed");
+
+ printf("VM pressure: 0x%.8x (threshold 0x%.8x)\n",
+ event.pressure, config[i].threshold);
+ }
+ }
+
+ perror("poll failed\n");
+
+ for (i = 0; i < num; i++) {
+ if (close(pfds[i].fd) < 0)
+ pexit("close failed");
+ }
+
+ exit(1);
+ return 0;
+}
--
1.8.0
VMPRESSURE_FD(2) Linux Programmer's Manual VMPRESSURE_FD(2)
NAME
vmpressure_fd - Linux virtual memory pressure notifications
SYNOPSIS
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <asm/unistd.h>
#include <linux/types.h>
#include <linux/vmpressure.h>
int vmpressure_fd(struct vmpressure_config *config)
{
config->size = sizeof(*config);
return syscall(__NR_vmpressure_fd, config);
}
DESCRIPTION
This system call creates a new file descriptor that can be used
with blocking (e.g. read(2)) and/or polling (e.g. poll(2)) rou-
tines to get notified about system's memory pressure.
Upon these notifications, userland programs can cooperate with
the kernel, achieving better system's memory management.
Memory pressure levels
There are currently three memory pressure levels, each level is
defined via vmpressure_level enumeration, and correspond to these
constants:
VMPRESSURE_LOW
The system is reclaiming memory for new allocations. Moni-
toring reclaiming activity might be useful for maintaining
overall system's cache level.
VMPRESSURE_MEDIUM
The system is experiencing medium memory pressure, there
might be some mild swapping activity. Upon this event,
applications may decide to free any resources that can be
easily reconstructed or re-read from a disk.
VMPRESSURE_OOM
The system is actively thrashing, it is about to out of
memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can
to help the system. See proc(5) for more information about
OOM killer and its configuration options.
Note that the behaviour of some levels can be tuned through the
sysctl(5) mechanism. See /usr/src/linux/Documenta-
tion/sysctl/vm.txt for various vmpressure_* tunables and their
meanings.
Configuration
vmpressure_fd(2) accepts vmpressure_config structure to configure
the notifications:
struct vmpressure_config {
__u32 size;
__u32 threshold;
};
size is a part of ABI versioning and must be initialized to
sizeof(struct vmpressure_config).
threshold is used to setup a minimal value of the pressure upon
which the events will be delivered by the kernel (for algebraic
comparisons, it is defined that VMPRESSURE_LOW < VMPRES-
SURE_MEDIUM < VMPRESSURE_OOM, but applications should not put any
meaning into the absolute values.)
Events
Upon a notification, application must read out events using
read(2) system call. The events are delivered using the follow-
ing structure:
struct vmpressure_event {
__u32 pressure;
};
The pressure shows the most recent system's pressure level.
RETURN VALUE
On success, vmpressure_fd() returns a new file descriptor. On
error, a negative value is returned and errno is set to indicate
the error.
ERRORS
vmpressure_fd() can fail with errors similar to open(2).
In addition, the following errors are possible:
EINVAL The failure means that an improperly initalized config
structure has been passed to the call.
EFAULT The failure means that the kernel was unable to read the
configuration structure, that is, config parameter points
to an inaccessible memory.
VERSIONS
The system call is available on Linux since kernel 3.8. Library
support is yet not provided by any glibc version.
CONFORMING TO
The system call is Linux-specific.
EXAMPLE
Examples can be found in /usr/src/linux/tools/testing/vmpressure/
directory.
SEE ALSO
poll(2), read(2), proc(5), sysctl(5), vmstat(8)
Linux 2012-10-16 VMPRESSURE_FD(2)
Signed-off-by: Anton Vorontsov <[email protected]>
---
man2/vmpressure_fd.2 | 163 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 163 insertions(+)
create mode 100644 man2/vmpressure_fd.2
diff --git a/man2/vmpressure_fd.2 b/man2/vmpressure_fd.2
new file mode 100644
index 0000000..eaf07d4
--- /dev/null
+++ b/man2/vmpressure_fd.2
@@ -0,0 +1,163 @@
+.\" Copyright (C) 2008 Michael Kerrisk <[email protected]>
+.\" Copyright (C) 2012 Linaro Ltd.
+.\" Anton Vorontsov <[email protected]>
+.\"
+.\" Based on ideas from:
+.\" KOSAKI Motohiro, Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka
+.\" Enberg.
+.\"
+.\" This program is free software; you can redistribute it and/or modify
+.\" it under the terms of the GNU General Public License as published by
+.\" the Free Software Foundation; either version 2 of the License, or
+.\" (at your option) any later version.
+.\"
+.\" This program is distributed in the hope that it will be useful,
+.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
+.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+.\" GNU General Public License for more details.
+.\"
+.\" You should have received a copy of the GNU General Public License
+.\" along with this program; if not, write to the Free Software
+.\" Foundation, Inc., 59 Temple Place, Suite 330, Boston,
+.\" MA 02111-1307 USA
+.\"
+.TH VMPRESSURE_FD 2 2012-10-16 Linux "Linux Programmer's Manual"
+.SH NAME
+vmpressure_fd \- Linux virtual memory pressure notifications
+.SH SYNOPSIS
+.nf
+.B #define _GNU_SOURCE
+.B #include <unistd.h>
+.B #include <sys/syscall.h>
+.B #include <asm/unistd.h>
+.B #include <linux/types.h>
+.B #include <linux/vmpressure.h>
+.\" TODO: libc wrapper
+
+.BI "int vmpressure_fd(struct vmpressure_config *"config )
+.B
+{
+.B
+ config->size = sizeof(*config);
+.B
+ return syscall(__NR_vmpressure_fd, config);
+.B
+}
+.fi
+.SH DESCRIPTION
+This system call creates a new file descriptor that can be used with
+blocking (e.g.
+.BR read (2))
+and/or polling (e.g.
+.BR poll (2))
+routines to get notified about system's memory pressure.
+
+Upon these notifications, userland programs can cooperate with the kernel,
+achieving better system's memory management.
+.SS Memory pressure levels
+There are currently three memory pressure levels, each level is defined
+via
+.IR vmpressure_level " enumeration,"
+and correspond to these constants:
+.TP
+.B VMPRESSURE_LOW
+The system is reclaiming memory for new allocations. Monitoring reclaiming
+activity might be useful for maintaining overall system's cache level.
+.TP
+.B VMPRESSURE_MEDIUM
+The system is experiencing medium memory pressure, there might be some
+mild swapping activity. Upon this event, applications may decide to free
+any resources that can be easily reconstructed or re-read from a disk.
+.TP
+.B VMPRESSURE_OOM
+The system is actively thrashing, it is about to out of memory (OOM) or
+even the in-kernel OOM killer is on its way to trigger. Applications
+should do whatever they can to help the system. See
+.BR proc (5)
+for more information about OOM killer and its configuration options.
+.TP 0
+Note that the behaviour of some levels can be tuned through the
+.BR sysctl (5)
+mechanism. See
+.I /usr/src/linux/Documentation/sysctl/vm.txt
+for various
+.I vmpressure_*
+tunables and their meanings.
+.SS Configuration
+.BR vmpressure_fd (2)
+accepts
+.I vmpressure_config
+structure to configure the notifications:
+
+.nf
+struct vmpressure_config {
+ __u32 size;
+ __u32 threshold;
+};
+.fi
+
+.I size
+is a part of ABI versioning and must be initialized to
+.IR "sizeof(struct vmpressure_config)" .
+
+.I threshold
+is used to setup a minimal value of the pressure upon which the events
+will be delivered by the kernel (for algebraic comparisons, it is defined
+that
+.BR VMPRESSURE_LOW " <"
+.BR VMPRESSURE_MEDIUM " <"
+.BR VMPRESSURE_OOM ,
+but applications should not put any meaning into the absolute values.)
+.SS Events
+Upon a notification, application must read out events using
+.BR read (2)
+system call.
+The events are delivered using the following structure:
+
+.nf
+struct vmpressure_event {
+ __u32 pressure;
+};
+.fi
+
+The
+.I pressure
+shows the most recent system's pressure level.
+.SH "RETURN VALUE"
+On success,
+.BR vmpressure_fd ()
+returns a new file descriptor. On error, a negative value is returned and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.BR vmpressure_fd ()
+can fail with errors similar to
+.BR open (2).
+
+In addition, the following errors are possible:
+.TP
+.B EINVAL
+The failure means that an improperly initalized
+.I config
+structure has been passed to the call.
+.TP
+.B EFAULT
+The failure means that the kernel was unable to read the configuration
+structure, that is,
+.I config
+parameter points to an inaccessible memory.
+.SH VERSIONS
+The system call is available on Linux since kernel 3.8. Library support is
+yet not provided by any glibc version.
+.SH CONFORMING TO
+The system call is Linux-specific.
+.SH EXAMPLE
+Examples can be found in
+.I /usr/src/linux/tools/testing/vmpressure/
+directory.
+.SH "SEE ALSO"
+.BR poll (2),
+.BR read (2),
+.BR proc (5),
+.BR sysctl (5),
+.BR vmstat (8)
--
1.8.0
On Wed, Nov 07, 2012 at 02:53:49AM -0800, Anton Vorontsov wrote:
> Hi all,
>
> This is the third RFC. As suggested by Minchan Kim, the API is much
> simplified now (comparing to vmevent_fd):
>
> - As well as Minchan, KOSAKI Motohiro didn't like the timers, so the
> timers are gone now;
> - Pekka Enberg didn't like the complex attributes matching code, and so it
> is no longer there;
> - Nobody liked the raw vmstat attributes, and so they were eliminated too.
>
> But, conceptually, it is the exactly the same approach as in v2: three
> discrete levels of the pressure -- low, medium and oom. The levels are
> based on the reclaimer inefficiency index as proposed by Mel Gorman, but
> userland does not see the raw index values. The description why I moved
> away from reporting the raw 'reclaimer inefficiency index' can be found in
> v2: http://lkml.org/lkml/2012/10/22/177
>
> While the new API is very simple, it is still extensible (i.e. versioned).
Sorry, I didn't follow previous discussion on this, but could you
explain what's wrong with memory notifications from memcg?
As I can see you can get pretty similar functionality using memory
thresholds on the root cgroup. What's the point?
--
Kirill A. Shutemov
On Wed, Nov 7, 2012 at 1:21 PM, Kirill A. Shutemov <[email protected]> wrote:
>> While the new API is very simple, it is still extensible (i.e. versioned).
>
> Sorry, I didn't follow previous discussion on this, but could you
> explain what's wrong with memory notifications from memcg?
> As I can see you can get pretty similar functionality using memory
> thresholds on the root cgroup. What's the point?
Why should you be required to use cgroups to get VM pressure events to
userspace?
Hi Anton,
On Wed, Nov 7, 2012 at 12:53 PM, Anton Vorontsov
<[email protected]> wrote:
> This is the third RFC. As suggested by Minchan Kim, the API is much
> simplified now (comparing to vmevent_fd):
>
> - As well as Minchan, KOSAKI Motohiro didn't like the timers, so the
> timers are gone now;
> - Pekka Enberg didn't like the complex attributes matching code, and
> so it is no longer there;
> - Nobody liked the raw vmstat attributes, and so they were eliminated
> too.
I love the API and implementation simplifications but I hate the new
ABI. It's a specialized, single-purpose syscall and bunch of procfs
tunables and I don't see how it's 'extensible' to anything but VM
If people object to vmevent_fd() system call, we should consider using
something more generic like perf_event_open() instead of inventing our
own special purpose ABI.
Pekka
On Wed, Nov 7, 2012 at 1:30 PM, Pekka Enberg <[email protected]> wrote:
> I love the API and implementation simplifications but I hate the new
> ABI. It's a specialized, single-purpose syscall and bunch of procfs
> tunables and I don't see how it's 'extensible' to anything but VM
s/anything but VM/anything but VM pressure notification/
On Wed, Nov 07, 2012 at 01:28:12PM +0200, Pekka Enberg wrote:
> On Wed, Nov 7, 2012 at 1:21 PM, Kirill A. Shutemov <[email protected]> wrote:
> >> While the new API is very simple, it is still extensible (i.e. versioned).
> >
> > Sorry, I didn't follow previous discussion on this, but could you
> > explain what's wrong with memory notifications from memcg?
> > As I can see you can get pretty similar functionality using memory
> > thresholds on the root cgroup. What's the point?
>
> Why should you be required to use cgroups to get VM pressure events to
> userspace?
Valid point. But in fact you have it on most systems anyway.
I personally don't like to have a syscall per small feature.
Isn't it better to have a file-based interface which can be used with
normal file syscalls: open()/read()/poll()?
--
Kirill A. Shutemov
On Wed, Nov 07, 2012 at 01:21:36PM +0200, Kirill A. Shutemov wrote:
[...]
> Sorry, I didn't follow previous discussion on this, but could you
> explain what's wrong with memory notifications from memcg?
> As I can see you can get pretty similar functionality using memory
> thresholds on the root cgroup. What's the point?
There are a few reasons we don't use cgroup notifications:
1. We're not interested in the absolute number of pages/KB of available
memory, as provided by cgroup memory controller. What we're interested
in is the amount of easily reclaimable memory and new memory
allocations' cost.
We can have plenty of "free" memory, of which say 90% will be caches,
and say 10% idle. But we do want to differentiate these types of memory
(although not going into details about it), i.e. we want to get
notified when kernel is reclaiming. And we also want to know when the
memory comes from swapping others' pages out (well, actually we don't
call it swap, it's "new allocations cost becomes high" -- it might be a
result of many factors (swapping, fragmentation, etc.) -- and userland
might analyze the situation when this happens).
Exposing all the VM details to userland is not an option -- it is not
possible to build a stable ABI on this. Plus, it makes it really hard
for userland to deal with all the low level details of Linux VM
internals.
So, no, raw numbers of "free/used KBs" are not interesting at all.
1.5. But it is important to understand that vmpressure_fd() is not
orthogonal to cgroups (like it was with vmevent_fd()). We want it to
be "cgroup'able" too. :) But optionally.
2. The last time I checked, cgroups memory controller did not (and I guess
still does not) not account kernel-owned slabs. I asked several times
why so, but nobody answered.
But no, this is not the main issue -- per "1.", we're not interested in
kilobytes.
3. Some folks don't like cgroups: it has a penalty for kernel size, for
performance and memory wastage. But again, it's not the main issue with
memcg.
Thanks,
Anton.
On Wed, Nov 07, 2012 at 01:30:16PM +0200, Pekka Enberg wrote: [...]
> I love the API and implementation simplifications but I hate the new
> ABI. It's a specialized, single-purpose syscall and bunch of procfs
> tunables and I don't see how it's 'extensible' to anything but VM
It is extensible to VM pressure notifications, yeah. We're probably not
going to add the raw vmstat values to it (and that's why we changed the
name). But having three levels is not the best thing we can do -- we can
do better. As I described here:
http://lkml.org/lkml/2012/10/25/115
That is, later we might want to tell the kernel how much reclaimable
memory userland has. So this can be two-way communication, which to me
sounds pretty cool. :) And who knows what we'll do after that.
But these are just plans. We might end up not having this, but we always
have an option to have it one day.
> If people object to vmevent_fd() system call, we should consider using
> something more generic like perf_event_open() instead of inventing our
> own special purpose ABI.
Ugh. While I *love* perf, but, IIUC, it was designed for other things:
handling tons of events, so it has many stuff that are completely
unnecessary here: we don't need ring buffers, formats, 7+k LOC, etc. Folks
will complain that we need the whole perf stuff for such a simple thing
(just like cgroups).
Also note that for pre-OOM we have to be really fast, i.e. use shortest
possible path (and, btw, that's why in this version the read() now can be
blocking -- and so we no longer have to do two poll()+read() syscalls,
just single read is now possible).
So I really don't see the need for perf here: it doesn't result in any
code reuse, but instead it just complicates our task. As for ABI
maintenance point of view, it is just the same thing as the dedicated
syscall.
Thanks,
Anton.
On Wed, Nov 07, 2012 at 03:43:46AM -0800, Anton Vorontsov wrote:
> On Wed, Nov 07, 2012 at 01:21:36PM +0200, Kirill A. Shutemov wrote:
> [...]
> > Sorry, I didn't follow previous discussion on this, but could you
> > explain what's wrong with memory notifications from memcg?
> > As I can see you can get pretty similar functionality using memory
> > thresholds on the root cgroup. What's the point?
>
> There are a few reasons we don't use cgroup notifications:
>
> 1. We're not interested in the absolute number of pages/KB of available
> memory, as provided by cgroup memory controller. What we're interested
> in is the amount of easily reclaimable memory and new memory
> allocations' cost.
>
> We can have plenty of "free" memory, of which say 90% will be caches,
> and say 10% idle. But we do want to differentiate these types of memory
> (although not going into details about it), i.e. we want to get
> notified when kernel is reclaiming. And we also want to know when the
> memory comes from swapping others' pages out (well, actually we don't
> call it swap, it's "new allocations cost becomes high" -- it might be a
> result of many factors (swapping, fragmentation, etc.) -- and userland
> might analyze the situation when this happens).
>
> Exposing all the VM details to userland is not an option
IIUC, you want MemFree + Buffers + Cached + SwapCached, right?
It's already exposed to userspace.
> -- it is not
> possible to build a stable ABI on this. Plus, it makes it really hard
> for userland to deal with all the low level details of Linux VM
> internals.
>
> So, no, raw numbers of "free/used KBs" are not interesting at all.
>
> 1.5. But it is important to understand that vmpressure_fd() is not
> orthogonal to cgroups (like it was with vmevent_fd()). We want it to
> be "cgroup'able" too. :) But optionally.
>
> 2. The last time I checked, cgroups memory controller did not (and I guess
> still does not) not account kernel-owned slabs. I asked several times
> why so, but nobody answered.
Almost there. Glauber works on it.
--
Kirill A. Shutemov
On Wed, Nov 07, 2012 at 02:11:10PM +0200, Kirill A. Shutemov wrote:
[...]
> > We can have plenty of "free" memory, of which say 90% will be caches,
> > and say 10% idle. But we do want to differentiate these types of memory
> > (although not going into details about it), i.e. we want to get
> > notified when kernel is reclaiming. And we also want to know when the
> > memory comes from swapping others' pages out (well, actually we don't
> > call it swap, it's "new allocations cost becomes high" -- it might be a
> > result of many factors (swapping, fragmentation, etc.) -- and userland
> > might analyze the situation when this happens).
> >
> > Exposing all the VM details to userland is not an option
>
> IIUC, you want MemFree + Buffers + Cached + SwapCached, right?
> It's already exposed to userspace.
How? If you mean vmstat, then no, that interface is not efficient at all:
we have to poll it from userland, which is no go for embedded (although,
as a workaround it can be done via deferrable timers in userland, which I
posted a few months ago).
But even with polling vmstat via deferrable timers, it leaves us with the
ugly timers-based approach (and no way to catch the pre-OOM conditions).
With vmpressure_fd() we have the synchronous notifications right from the
core (upon which, you can, if you want to, analyze the vmstat).
>> 2. The last time I checked, cgroups memory controller did not (and I guess
>> still does not) not account kernel-owned slabs. I asked several times
>> why so, but nobody answered.
>
> Almost there. Glauber works on it.
It's good to hear, but still, the number of "used KBs" is a bad (or
irrelevant) metric for the pressure. We'd still need to analyze the memory
in more details, and "'limit - used' KBs" doesn't tell us anything about
the cost of the available memory.
Thanks,
Anton.
On 11/07/2012 06:01 AM, Anton Vorontsov wrote:
> Configuration
> vmpressure_fd(2) accepts vmpressure_config structure to configure
> the notifications:
>
> struct vmpressure_config {
> __u32 size;
> __u32 threshold;
> };
>
> size is a part of ABI versioning and must be initialized to
> sizeof(struct vmpressure_config).
If you want to use a versioned ABI, why not pass in an
actual version number?
On Wed, Nov 07 2012, Kirill A. Shutemov wrote:
> On Wed, Nov 07, 2012 at 02:53:49AM -0800, Anton Vorontsov wrote:
>> Hi all,
>>
>> This is the third RFC. As suggested by Minchan Kim, the API is much
>> simplified now (comparing to vmevent_fd):
>>
>> - As well as Minchan, KOSAKI Motohiro didn't like the timers, so the
>> timers are gone now;
>> - Pekka Enberg didn't like the complex attributes matching code, and so it
>> is no longer there;
>> - Nobody liked the raw vmstat attributes, and so they were eliminated too.
>>
>> But, conceptually, it is the exactly the same approach as in v2: three
>> discrete levels of the pressure -- low, medium and oom. The levels are
>> based on the reclaimer inefficiency index as proposed by Mel Gorman, but
>> userland does not see the raw index values. The description why I moved
>> away from reporting the raw 'reclaimer inefficiency index' can be found in
>> v2: http://lkml.org/lkml/2012/10/22/177
>>
>> While the new API is very simple, it is still extensible (i.e. versioned).
>
> Sorry, I didn't follow previous discussion on this, but could you
> explain what's wrong with memory notifications from memcg?
> As I can see you can get pretty similar functionality using memory
> thresholds on the root cgroup. What's the point?
Related question: are there plans to extend this system call to provide
per-cgroup vm pressure notification?
Hi Greg,
On 11/7/12 7:20 PM, Greg Thelen wrote:
> Related question: are there plans to extend this system call to
> provide per-cgroup vm pressure notification?
Yes, that's something that needs to be addressed before we can ever
consider merging something like this to mainline. We probably need help
with that, though. Preferably from someone who knows cgroups. :-)
Pekka
(Sorry about being very late reviewing this)
On Wed, Nov 07, 2012 at 03:01:28AM -0800, Anton Vorontsov wrote:
> This patch introduces vmpressure_fd() system call. The system call creates
> a new file descriptor that can be used to monitor Linux' virtual memory
> management pressure. There are three discrete levels of the pressure:
>
Why was eventfd unsuitable? It's a bit trickier to use but there are
examples in the kernel where an application is required to do something like
1. open eventfd
2. open a control file, say /proc/sys/vm/vmpressure or if cgroups
/sys/fs/cgroup/something/vmpressure
3. write fd_event fd_control [low|medium|oom]. Can be a binary structure
you write
and then poll the eventfd. The trickiness is awkward but a library
implementation of vmpressure_fd() that mapped onto eventfd properly should
be trivial.
I confess I'm not super familiar with eventfd and if this can actually
work in practice but I found the introduction of a dedicated syscall
surprising. Apologies if this has been discussed already. If it was,
it should be in the changelog to prevent stupid questions from drive-by
reviewers.
> VMPRESSURE_LOW: Notifies that the system is reclaiming memory for new
> allocations. Monitoring reclaiming activity might be useful for
> maintaining overall system's cache level.
>
If you do another revision, add a caveat that a streaming reader might
be enough to trigger this level. It's not necessarily a problem of
course.
> VMPRESSURE_MEDIUM: The system is experiencing medium memory pressure,
> there might be some mild swapping activity. Upon this event applications
> may decide to free any resources that can be easily reconstructed or
> re-read from a disk.
>
Good.
> VMPRESSURE_OOM: The system is actively thrashing, it is about to go out of
> memory (OOM) or even the in-kernel OOM killer is on its way to trigger.
> Applications should do whatever they can to help the system.
>
Good.
> There are four sysctls to tune the behaviour of the levels:
>
> vmevent_window
> vmevent_level_medium
> vmevent_level_oom
> vmevent_level_oom_priority
>
Superficially these feel like the might expose implementation details of
the pressure implementation and therby indirectly expose the internals
of the VM. Should these be debugfs instead of sysctls that spit out a
warning if used so it generates a bug report? That won't stop someone
depending on them anyway but if these values are changed we should
immediately hear why it was necessary.
> Currently vmevent pressure levels are based on the reclaimer inefficiency
> index (range from 0 to 100). The index shows the relative time spent by
> the kernel uselessly scanning pages, or, in other words, the percentage of
> scans of pages (vmevent_window) that were not reclaimed. The higher the
> index, the more it should be evident that new allocations' cost becomes
> higher.
>
Good.
> The files vmevent_level_medium and vmevent_level_oom accept the index
> values (by default set to 60 and 99 respectively). A non-existent
> vmevent_level_low tunable is always set to 0
>
> When index equals to 0, this means that the kernel is reclaiming, but
> every scanned page has been successfully reclaimed (so the pressure is
> low). 100 means that the kernel is trying to reclaim, but nothing can be
> reclaimed (OOM).
>
> Window size is used as a rate-limit tunable for VMPRESSURE_LOW
> notifications and for averaging for VMPRESSURE_{MEDIUM,OOM} levels. So,
> using small window sizes can cause lot of false positives for _MEDIUM and
> _OOM levels, but too big window size may delay notifications. By default
> the window size equals to 256 pages (1MB).
>
I think it would be reasonable to leave the window as a sysctl but rename
it vmpressure_sensitivity. Tuning it to be very "sensitive" would initially
be implemented as the window shrinking.
> The _OOM level is also attached to the reclaimer's priority. When the
> system is almost OOM, it might be getting the last reclaimable pages
> slowly, scanning all the queues, and so we never catch the OOM case via
> window-size averaging. For this case the priority can be used to determine
> the pre-OOM condition, the pre-OOM priority level can be set via
> vmpressure_level_oom_prio sysctl.
>
> Signed-off-by: Anton Vorontsov <[email protected]>
> ---
> Documentation/sysctl/vm.txt | 48 ++++++++
> arch/x86/syscalls/syscall_64.tbl | 1 +
> include/linux/syscalls.h | 2 +
> include/linux/vmpressure.h | 128 ++++++++++++++++++++++
> kernel/sys_ni.c | 1 +
> kernel/sysctl.c | 31 ++++++
> mm/Kconfig | 13 +++
> mm/Makefile | 1 +
> mm/vmpressure.c | 231 +++++++++++++++++++++++++++++++++++++++
> mm/vmscan.c | 5 +
> 10 files changed, 461 insertions(+)
> create mode 100644 include/linux/vmpressure.h
> create mode 100644 mm/vmpressure.c
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 078701f..9837fe2 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -44,6 +44,10 @@ Currently, these files are in /proc/sys/vm:
> - nr_overcommit_hugepages
> - nr_trim_pages (only if CONFIG_MMU=n)
> - numa_zonelist_order
> +- vmpressure_window
> +- vmpressure_level_medium
> +- vmpressure_level_oom
> +- vmpressure_level_oom_priority
> - oom_dump_tasks
> - oom_kill_allocating_task
> - overcommit_memory
> @@ -487,6 +491,50 @@ this is causing problems for your system/application.
>
> ==============================================================
>
> +vmpressure_window
> +vmpressure_level_med
> +vmpressure_level_oom
> +vmpressure_level_oom_priority
> +
> +These sysctls are used to tune vmpressure_fd(2) behaviour.
> +
Ok, I'm ok with FD being the interface. I think it makes sense and means
it can be used with select or poll.
> +Currently vmpressure pressure levels are based on the reclaimer
> +inefficiency index (range from 0 to 100). The files vmpressure_level_med
> +and vmpressure_level_oom accept the index values (by default set to 60 and
> +99 respectively). A non-existent vmpressure_level_low tunable is always
> +set to 0
> +
> +When the system is short on idle pages, the new memory is allocated by
> +reclaiming least recently used resources: kernel scans pages to be
> +reclaimed (e.g. from file caches, mmap(2) volatile ranges, etc.; and
> +potentially swapping some pages out). The index shows the relative time
> +spent by the kernel uselessly scanning pages, or, in other words, the
> +percentage of scans of pages (vmpressure_window) that were not reclaimed.
> +The higher the index, the more it should be evident that new allocations'
> +cost becomes higher.
> +
> +When index equals to 0, this means that the kernel is reclaiming, but
> +every scanned page has been successfully reclaimed (so the pressure is
> +low). 100 means that the kernel is trying to reclaim, but nothing can be
> +reclaimed (close to OOM).
> +
> +Window size is used as a rate-limit tunable for VMPRESSURE_LOW
> +notifications and for averaging for VMPRESSURE_{MEDIUM,OOM} levels. So,
> +using small window sizes can cause lot of false positives for _MEDIUM and
> +_OOM levels, but too big window size may delay notifications. By default
> +the window size equals to 256 pages (1MB).
> +
> +When the system is almost OOM it might be getting the last reclaimable
> +pages slowly, scanning all the queues, and so we never catch the OOM case
> +via window-size averaging. For this case there is another mechanism of
> +detecting the pre-OOM conditions: kernel's reclaimer has a scanning
> +priority, the higest priority is 0 (reclaimer will scan all the available
> +pages). Kernel starts scanning with priority set to 12 (queue_length >>
> +12). So, vmpressure_level_oom_prio should be between 0 and 12 (by default
> +it is set to 4).
> +
Sounds good. Again, be careful on how much implementation detail you expose
to the interface. I think the actual user-visible interface should be low,
medium, high with a sensitivity tunable but the ranges and window sizes
hidden away (or at least in debugfs).
> +==============================================================
> +
> oom_dump_tasks
>
> Enables a system-wide task dump (excluding kernel threads) to be
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index 316449a..6e4fa6a 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -320,6 +320,7 @@
> 311 64 process_vm_writev sys_process_vm_writev
> 312 common kcmp sys_kcmp
> 313 64 vmevent_fd sys_vmevent_fd
> +314 64 vmpressure_fd sys_vmpressure_fd
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 19439c7..3d2587d 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -63,6 +63,7 @@ struct getcpu_cache;
> struct old_linux_dirent;
> struct perf_event_attr;
> struct file_handle;
> +struct vmpressure_config;
>
> #include <linux/types.h>
> #include <linux/aio_abi.h>
> @@ -860,4 +861,5 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
>
> asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
> unsigned long idx1, unsigned long idx2);
> +asmlinkage long sys_vmpressure_fd(struct vmpressure_config __user *config);
> #endif
> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> new file mode 100644
> index 0000000..b808b04
> --- /dev/null
> +++ b/include/linux/vmpressure.h
> @@ -0,0 +1,128 @@
> +/*
> + * Linux VM pressure notifications
> + *
> + * Copyright 2011-2012 Pekka Enberg <[email protected]>
> + * Copyright 2011-2012 Linaro Ltd.
> + * Anton Vorontsov <[email protected]>
> + *
> + * Based on ideas from KOSAKI Motohiro, Leonid Moiseichuk, Mel Gorman,
> + * Minchan Kim and Pekka Enberg.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + */
> +
> +#ifndef _LINUX_VMPRESSURE_H
> +#define _LINUX_VMPRESSURE_H
> +
> +#include <linux/types.h>
> +
> +/**
> + * enum vmpressure_level - Memory pressure levels
> + * @VMPRESSURE_LOW: The system is short on idle pages, losing caches
> + * @VMPRESSURE_MEDIUM: New allocations' cost becomes high
> + * @VMPRESSURE_OOM: The system is about to go out-of-memory
> + */
> +enum vmpressure_level {
> + /* We spread the values, reserving room for new levels. */
> + VMPRESSURE_LOW = 1 << 10,
> + VMPRESSURE_MEDIUM = 1 << 20,
> + VMPRESSURE_OOM = 1 << 30,
> +};
> +
Once again, be careful on what you expose to userspace. Bear in mind these
are compiled to to maintain binary compatability the user-visible structure
should be plain enums.
enum vmpressure_level {
VM_PRESSURE_LOW,
VM_PRESSURE_MEDIUM,
VM_PRESSURE_OOM
};
These should then be mapped to a kernel internal ranges
enum __vmpressure_level_range_internal {
__VM_PRESSURE_LOW = 1<< 10,
__VM_PRESSURE_MEDIUM = 1 << 20,
}
That allows the kernel internal ranges to change without worrying about
userspace compatability.
This comment would apply even if you used eventfd.
I don't mean to bitch about exposing implementation details but a stated
goal of this interface was to avoid having applications aware of VM
implementation details.
> +/**
> + * struct vmpressure_config - Configuration structure for vmpressure_fd()
> + * @size: Size of the struct for ABI extensibility
> + * @threshold: Minimum pressure level of notifications
> + *
> + * This structure is used to configure the file descriptor that
> + * vmpressure_fd() returns.
> + *
> + * @size is used to "version" the ABI, it must be initialized to
> + * 'sizeof(struct vmpressure_config)'.
> + *
> + * @threshold should be one of @vmpressure_level values, and specifies
> + * minimal level of notification that will be delivered.
> + */
> +struct vmpressure_config {
> + __u32 size;
> + __u32 threshold;
> +};
> +
Again I suspect this might be compatible with eventfd. The writing of the
eventfd just needs to handle a binary structure instead of strings without
having to introduce a dedicated system call.
The versioning of the structure is not a bad idea though but don't use
"size". Use a magic value for the high bits and a number of the low bits
and #define it VMPRESSURE_NOTIFY_MAGIC1
> +/**
> + * struct vmpressure_event - An event that is returned via vmpressure fd
> + * @pressure: Most recent system's pressure level
> + *
> + * Upon notification, this structure must be read from the vmpressure file
> + * descriptor.
> + */
> +struct vmpressure_event {
> + __u32 pressure;
> +};
> +
What is the meaning of "pressure" as returned to userspace?
Would it be better if userspace just received an event when the requested
threshold was reached but when it reads it just gets a single 0 byte that
should not be interpreted?
I say this because the application can only request low, medium or OOM
but gets a number back. How should it intepret that number? The value of
the number depends on sysctl files and I fear that applications will end
up making decisions on the implementation again.
I think it would be a lot safer for Pressure ABI v1 to return only 0 here
and see how far that gets. If Android has already gone through this process
and *know* they need this number then it should be documented.
If this has already been discussed, it should also be documented :P
I see from a debugging perspective why it might be handy to monitor
pressure over time. If so, then maybe a debugfs file would help with a
CLEAR warning that no application should depend on its existance (make it
depend on CONFIG_DEBUG_VMPRESSURE && CONFIG_DEBUG_VM or something).
> +#ifdef __KERNEL__
> +
> +struct mem_cgroup;
> +
> +#ifdef CONFIG_VMPRESSURE
> +
> +extern uint vmpressure_win;
> +extern uint vmpressure_level_med;
> +extern uint vmpressure_level_oom;
> +extern uint vmpressure_level_oom_prio;
> +
> +extern void __vmpressure(struct mem_cgroup *memcg,
> + ulong scanned, ulong reclaimed);
> +static void vmpressure(struct mem_cgroup *memcg,
> + ulong scanned, ulong reclaimed);
> +
> +/*
> + * OK, we're cheating. The thing is, we have to average s/r ratio by
> + * gathering a lot of scans (otherwise we might get some local
> + * false-positives index of '100').
> + *
> + * But... when we're almost OOM we might be getting the last reclaimable
> + * pages slowly, scanning all the queues, and so we never catch the OOM
> + * case via averaging. Although the priority will show it for sure. The
> + * pre-OOM priority value is mostly an empirically taken priority: we
> + * never observe it under any load, except for last few allocations before
> + * the OOM (but the exact value is still configurable via sysctl).
> + */
> +static inline void vmpressure_prio(struct mem_cgroup *memcg, int prio)
> +{
> + if (prio > vmpressure_level_oom_prio)
> + return;
> +
> + /* OK, the prio is below the threshold, send the pre-OOM event. */
> + vmpressure(memcg, vmpressure_win, 0);
> +}
> +
> +#else
> +static inline void __vmpressure(struct mem_cgroup *memcg,
> + ulong scanned, ulong reclaimed) {}
> +static inline void vmpressure_prio(struct mem_cgroup *memcg, int prio) {}
> +#endif /* CONFIG_VMPRESSURE */
> +
> +static inline void vmpressure(struct mem_cgroup *memcg,
> + ulong scanned, ulong reclaimed)
> +{
> + if (!scanned)
> + return;
> +
> + if (IS_BUILTIN(CONFIG_MEMCG) && memcg) {
> + /*
> + * The vmpressure API reports system pressure, for per-cgroup
> + * pressure, we'll chain cgroups notifications, this is to
> + * be implemented.
> + *
> + * memcg_vm_pressure(target_mem_cgroup, scanned, reclaimed);
> + */
> + return;
> + }
> + __vmpressure(memcg, scanned, reclaimed);
> +}
> +
Ok. Personally I'm ok with memcg support not existing initially. If we
can't get the global case right, then the memcg case is impossible.
> +#endif /* __KERNEL__ */
> +
> +#endif /* _LINUX_VMPRESSURE_H */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 3ccdbf4..9573a5a 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -192,6 +192,7 @@ cond_syscall(compat_sys_timerfd_gettime);
> cond_syscall(sys_eventfd);
> cond_syscall(sys_eventfd2);
> cond_syscall(sys_vmevent_fd);
> +cond_syscall(sys_vmpressure_fd);
>
> /* performance counters: */
> cond_syscall(sys_perf_event_open);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 87174ef..7c9a3be 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -50,6 +50,7 @@
> #include <linux/dnotify.h>
> #include <linux/syscalls.h>
> #include <linux/vmstat.h>
> +#include <linux/vmpressure.h>
> #include <linux/nfs_fs.h>
> #include <linux/acpi.h>
> #include <linux/reboot.h>
> @@ -1317,6 +1318,36 @@ static struct ctl_table vm_table[] = {
> .proc_handler = numa_zonelist_order_handler,
> },
> #endif
> +#ifdef CONFIG_VMPRESSURE
> + {
> + .procname = "vmpressure_window",
> + .data = &vmpressure_win,
> + .maxlen = sizeof(vmpressure_win),
> + .mode = 0644,
> + .proc_handler = proc_dointvec,
> + },
> + {
> + .procname = "vmpressure_level_medium",
> + .data = &vmpressure_level_med,
> + .maxlen = sizeof(vmpressure_level_med),
> + .mode = 0644,
> + .proc_handler = proc_dointvec,
> + },
> + {
> + .procname = "vmpressure_level_oom",
> + .data = &vmpressure_level_oom,
> + .maxlen = sizeof(vmpressure_level_oom),
> + .mode = 0644,
> + .proc_handler = proc_dointvec,
> + },
> + {
> + .procname = "vmpressure_level_oom_priority",
> + .data = &vmpressure_level_oom_prio,
> + .maxlen = sizeof(vmpressure_level_oom_prio),
> + .mode = 0644,
> + .proc_handler = proc_dointvec,
> + },
> +#endif
Talked about this and why I think they should be debugfs already.
> #if (defined(CONFIG_X86_32) && !defined(CONFIG_UML))|| \
> (defined(CONFIG_SUPERH) && defined(CONFIG_VSYSCALL))
> {
> diff --git a/mm/Kconfig b/mm/Kconfig
> index cd0ea24e..8a47a5f 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -401,6 +401,19 @@ config VMEVENT
> help
> If unsure, say N to disable vmevent
>
> +config VMPRESSURE
> + bool "Enable vmpressure_fd() notifications"
> + help
> + This option enables vmpressure_fd() system call, it is used to
> + notify userland applications about system's virtual memory
> + pressure state.
> +
> + Upon these notifications, userland programs can cooperate with
> + the kernel (e.g. free easily reclaimable resources), and so
> + achieving better system's memory management.
> +
> + If unsure, say N.
> +
If anything I think this should be default Y. If Android benefits from
it, it's plausible that normal desktops might and failing that,
monitoring applications on server workloads will. With default N, it's
going to be missed by distributions.
I think making it configurable at all is overkill -- maybe the debugfs
parts but otherwise build it.
> config FRONTSWAP
> bool "Enable frontswap to cache swap pages if tmem is present"
> depends on SWAP
> diff --git a/mm/Makefile b/mm/Makefile
> index 80debc7..2f08d14 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -57,4 +57,5 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
> obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
> obj-$(CONFIG_CLEANCACHE) += cleancache.o
> obj-$(CONFIG_VMEVENT) += vmevent.o
> +obj-$(CONFIG_VMPRESSURE) += vmpressure.o
> obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> new file mode 100644
> index 0000000..54f35a3
> --- /dev/null
> +++ b/mm/vmpressure.c
> @@ -0,0 +1,231 @@
> +/*
> + * Linux VM pressure notifications
> + *
> + * Copyright 2011-2012 Pekka Enberg <[email protected]>
> + * Copyright 2011-2012 Linaro Ltd.
> + * Anton Vorontsov <[email protected]>
> + *
> + * Based on ideas from KOSAKI Motohiro, Leonid Moiseichuk, Mel Gorman,
> + * Minchan Kim and Pekka Enberg.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + */
> +
> +#include <linux/anon_inodes.h>
> +#include <linux/atomic.h>
> +#include <linux/compiler.h>
> +#include <linux/vmpressure.h>
> +#include <linux/syscalls.h>
> +#include <linux/workqueue.h>
> +#include <linux/mutex.h>
> +#include <linux/file.h>
> +#include <linux/list.h>
> +#include <linux/poll.h>
> +#include <linux/slab.h>
> +#include <linux/swap.h>
> +
> +struct vmpressure_watch {
> + struct vmpressure_config config;
> + atomic_t pending;
> + wait_queue_head_t waitq;
> + struct list_head node;
> +};
> +
> +static atomic64_t vmpressure_sr;
> +static uint vmpressure_val;
> +
> +static LIST_HEAD(vmpressure_watchers);
> +static DEFINE_MUTEX(vmpressure_watchers_lock);
> +
Superficially, this looks like a custom implementation of a chain notifier
(include/linux/notifier.h). It's not something the VM makes much use of
other than the OOM killer but it's there.
> +/* Our sysctl tunables, see Documentation/sysctl/vm.txt */
> +uint __read_mostly vmpressure_win = SWAP_CLUSTER_MAX * 16;
> +uint vmpressure_level_med = 60;
> +uint vmpressure_level_oom = 99;
> +uint vmpressure_level_oom_prio = 4;
> +
> +/*
> + * This function is called from a workqueue, which can have only one
> + * execution thread, so we don't need to worry about racing w/ ourselves.
> + * And so it possible to implement the lock-free logic, using just the
> + * atomic watch->pending variable.
> + */
> +static void vmpressure_sample(struct vmpressure_watch *watch)
> +{
> + if (atomic_read(&watch->pending))
> + return;
> + if (vmpressure_val < watch->config.threshold)
> + return;
> +
> + atomic_set(&watch->pending, 1);
> + wake_up(&watch->waitq);
> +}
> +
> +static u64 vmpressure_level(uint pressure)
> +{
> + if (pressure >= vmpressure_level_oom)
> + return VMPRESSURE_OOM;
> + else if (pressure >= vmpressure_level_med)
> + return VMPRESSURE_MEDIUM;
> + return VMPRESSURE_LOW;
> +}
> +
> +static uint vmpressure_calc_pressure(uint win, uint s, uint r)
> +{
> + ulong p;
> +
> + /*
> + * We calculate the ratio (in percents) of how many pages were
> + * scanned vs. reclaimed in a given time frame (window). Note that
> + * time is in VM reclaimer's "ticks", i.e. number of pages
> + * scanned. This makes it possible set desired reaction time and
> + * serves as a ratelimit.
> + */
> + p = win - (r * win / s);
> + p = p * 100 / win;
> +
> + pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, p, s, r);
> +
> + return vmpressure_level(p);
> +}
> +
Ok!
> +#define VMPRESSURE_SCANNED_SHIFT (sizeof(u64) * 8 / 2)
> +
> +static void vmpressure_wk_fn(struct work_struct *wk)
> +{
> + struct vmpressure_watch *watch;
> + u64 sr = atomic64_xchg(&vmpressure_sr, 0);
> + u32 s = sr >> VMPRESSURE_SCANNED_SHIFT;
> + u32 r = sr & (((u64)1 << VMPRESSURE_SCANNED_SHIFT) - 1);
> +
> + vmpressure_val = vmpressure_calc_pressure(vmpressure_win, s, r);
> +
> + mutex_lock(&vmpressure_watchers_lock);
> + list_for_each_entry(watch, &vmpressure_watchers, node)
> + vmpressure_sample(watch);
> + mutex_unlock(&vmpressure_watchers_lock);
> +}
So, if you used notifiers I think this would turn into a
blocking_notifier_call_chain() probably. Maybe
atomic_notifier_call_chain() depending.
> +static DECLARE_WORK(vmpressure_wk, vmpressure_wk_fn);
> +
> +void __vmpressure(struct mem_cgroup *memcg, ulong scanned, ulong reclaimed)
> +{
> + /*
> + * Store s/r combined, so we don't have to worry to synchronize
> + * them. On modern machines it will be truly atomic; on arches w/o
> + * 64 bit atomics it will turn into a spinlock (for a small amount
> + * of CPUs it's not a problem).
> + *
> + * Using int-sized atomics is a bad idea as it would only allow to
> + * count (1 << 16) - 1 pages (256MB), which we can scan pretty
> + * fast.
> + *
> + * We can't have per-CPU counters as this will not catch a case
> + * when many CPUs scan small amounts (so none of them hit the
> + * window size limit, and thus we won't send a notification in
> + * time).
> + *
> + * So we shouldn't place vmpressure() into a very hot path.
> + */
> + atomic64_add(scanned << VMPRESSURE_SCANNED_SHIFT | reclaimed,
> + &vmpressure_sr);
> +
> + scanned = atomic64_read(&vmpressure_sr) >> VMPRESSURE_SCANNED_SHIFT;
> + if (scanned >= vmpressure_win && !work_pending(&vmpressure_wk))
> + schedule_work(&vmpressure_wk);
> +}
So after all this, I'm ok with the actual calculation of pressure part
and when userspace gets woken up. I'm *WAY* happier with this than I was
with notifiers based on free memory so for *just* that part
Acked-by: Mel Gorman <[email protected]>
I'm less keen on the actual interface and have explained why but it's up
to other people to say whether they feel the same way. If Pekka and the
Android people are ok with the interface then I won't object. However,
if eventfd cannot be used and a system call really is required then it
should be explained *very* carefully in the changelog or it'll just get
snagged by another reviewer.
> +
> +static uint vmpressure_poll(struct file *file, poll_table *wait)
> +{
> + struct vmpressure_watch *watch = file->private_data;
> +
> + poll_wait(file, &watch->waitq, wait);
> +
> + return atomic_read(&watch->pending) ? POLLIN : 0;
> +}
> +
> +static ssize_t vmpressure_read(struct file *file, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct vmpressure_watch *watch = file->private_data;
> + struct vmpressure_event event;
> + int ret;
> +
> + if (count < sizeof(event))
> + return -EINVAL;
> +
> + ret = wait_event_interruptible(watch->waitq,
> + atomic_read(&watch->pending));
> + if (ret)
> + return ret;
> +
> + event.pressure = vmpressure_val;
> + if (copy_to_user(buf, &event, sizeof(event)))
> + return -EFAULT;
> +
> + atomic_set(&watch->pending, 0);
> +
> + return count;
> +}
> +
> +static int vmpressure_release(struct inode *inode, struct file *file)
> +{
> + struct vmpressure_watch *watch = file->private_data;
> +
> + mutex_lock(&vmpressure_watchers_lock);
> + list_del(&watch->node);
> + mutex_unlock(&vmpressure_watchers_lock);
> +
> + kfree(watch);
> + return 0;
> +}
> +
> +static const struct file_operations vmpressure_fops = {
> + .poll = vmpressure_poll,
> + .read = vmpressure_read,
> + .release = vmpressure_release,
> +};
> +
> +SYSCALL_DEFINE1(vmpressure_fd, struct vmpressure_config __user *, config)
> +{
> + struct vmpressure_watch *watch;
> + struct file *file;
> + int ret;
> + int fd;
> +
> + watch = kzalloc(sizeof(*watch), GFP_KERNEL);
> + if (!watch)
> + return -ENOMEM;
> +
> + ret = copy_from_user(&watch->config, config, sizeof(*config));
> + if (ret)
> + goto err_free;
> +
> + fd = get_unused_fd_flags(O_RDONLY);
> + if (fd < 0) {
> + ret = fd;
> + goto err_free;
> + }
> +
> + file = anon_inode_getfile("[vmpressure]", &vmpressure_fops, watch,
> + O_RDONLY);
> + if (IS_ERR(file)) {
> + ret = PTR_ERR(file);
> + goto err_fd;
> + }
> +
> + fd_install(fd, file);
> +
> + init_waitqueue_head(&watch->waitq);
> +
> + mutex_lock(&vmpressure_watchers_lock);
> + list_add(&watch->node, &vmpressure_watchers);
> + mutex_unlock(&vmpressure_watchers_lock);
> +
> + return fd;
> +err_fd:
> + put_unused_fd(fd);
> +err_free:
> + kfree(watch);
> + return ret;
> +}
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 99b434b..5439117 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -20,6 +20,7 @@
> #include <linux/init.h>
> #include <linux/highmem.h>
> #include <linux/vmstat.h>
> +#include <linux/vmpressure.h>
> #include <linux/file.h>
> #include <linux/writeback.h>
> #include <linux/blkdev.h>
> @@ -1846,6 +1847,9 @@ restart:
> shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> sc, LRU_ACTIVE_ANON);
>
> + vmpressure(sc->target_mem_cgroup,
> + sc->nr_scanned - nr_scanned, nr_reclaimed);
> +
> /* reclaim/compaction might need reclaim to continue */
> if (should_continue_reclaim(lruvec, nr_reclaimed,
> sc->nr_scanned - nr_scanned, sc))
> @@ -2068,6 +2072,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> count_vm_event(ALLOCSTALL);
>
> do {
> + vmpressure_prio(sc->target_mem_cgroup, sc->priority);
> sc->nr_scanned = 0;
> aborted_reclaim = shrink_zones(zonelist, sc);
>
> --
> 1.8.0
>
--
Mel Gorman
SUSE Labs
On Thu, Nov 08, 2012 at 05:01:24PM +0000, Mel Gorman wrote:
> (Sorry about being very late reviewing this)
>
> On Wed, Nov 07, 2012 at 03:01:28AM -0800, Anton Vorontsov wrote:
> > This patch introduces vmpressure_fd() system call. The system call creates
> > a new file descriptor that can be used to monitor Linux' virtual memory
> > management pressure. There are three discrete levels of the pressure:
> >
>
> Why was eventfd unsuitable? It's a bit trickier to use but there are
> examples in the kernel where an application is required to do something like
>
> 1. open eventfd
> 2. open a control file, say /proc/sys/vm/vmpressure or if cgroups
> /sys/fs/cgroup/something/vmpressure
> 3. write fd_event fd_control [low|medium|oom]. Can be a binary structure
> you write
>
> and then poll the eventfd. The trickiness is awkward but a library
> implementation of vmpressure_fd() that mapped onto eventfd properly should
> be trivial.
>
> I confess I'm not super familiar with eventfd and if this can actually
> work in practice
You've described how it works for memory thresholds and oom notifications
in memcg. So it works. I also prefer this kind of interface.
See Documentation/cgroups/cgroups.txt section 2.4 and
Documentation/cgroups/memory.txt sections 9 and 10.
--
Kirill A. Shutemov
Hi Anton,
On Wed, 7 Nov 2012 02:53:49 -0800
Anton Vorontsov <[email protected]> wrote:
> Hi all,
>
> This is the third RFC. As suggested by Minchan Kim, the API is much
> simplified now (comparing to vmevent_fd):
Which tree is this against? I'd like to try this series, but it doesn't
apply to Linus tree.
On Fri, Nov 09, 2012 at 09:32:03AM +0100, Luiz Capitulino wrote:
> Anton Vorontsov <[email protected]> wrote:
> > This is the third RFC. As suggested by Minchan Kim, the API is much
> > simplified now (comparing to vmevent_fd):
>
> Which tree is this against? I'd like to try this series, but it doesn't
> apply to Linus tree.
Thanks for trying!
The tree is a mix of Pekka's linux-vmevent tree and Linus' tree. You can
just clone my tree to get the whole thing:
git://git.infradead.org/users/cbou/linux-vmevent.git
Note that the tree is rebasable. Also be sure to select CONFIG_VMPRESSURE,
not CONFIG_VMEVENT.
Thanks!
Anton.
On Wed, 7 Nov 2012 03:01:28 -0800
Anton Vorontsov <[email protected]> wrote:
> This patch introduces vmpressure_fd() system call. The system call creates
> a new file descriptor that can be used to monitor Linux' virtual memory
> management pressure.
I noticed a couple of quick things as I was looking this over...
> +static ssize_t vmpressure_read(struct file *file, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct vmpressure_watch *watch = file->private_data;
> + struct vmpressure_event event;
> + int ret;
> +
> + if (count < sizeof(event))
> + return -EINVAL;
> +
> + ret = wait_event_interruptible(watch->waitq,
> + atomic_read(&watch->pending));
Would it make sense to support non-blocking reads? Perhaps a process would
like to simply know that current pressure level?
> +SYSCALL_DEFINE1(vmpressure_fd, struct vmpressure_config __user *, config)
> +{
> + struct vmpressure_watch *watch;
> + struct file *file;
> + int ret;
> + int fd;
> +
> + watch = kzalloc(sizeof(*watch), GFP_KERNEL);
> + if (!watch)
> + return -ENOMEM;
> +
> + ret = copy_from_user(&watch->config, config, sizeof(*config));
> + if (ret)
> + goto err_free;
This is wrong - you'll return the number of uncopied bytes to user space.
You'll need a "ret = -EFAULT;" in there somewhere.
jon
On Wed, 7 Nov 2012, Kirill A. Shutemov wrote:
> > > Sorry, I didn't follow previous discussion on this, but could you
> > > explain what's wrong with memory notifications from memcg?
> > > As I can see you can get pretty similar functionality using memory
> > > thresholds on the root cgroup. What's the point?
> >
> > Why should you be required to use cgroups to get VM pressure events to
> > userspace?
>
> Valid point. But in fact you have it on most systems anyway.
>
> I personally don't like to have a syscall per small feature.
> Isn't it better to have a file-based interface which can be used with
> normal file syscalls: open()/read()/poll()?
>
I agree that eventfd is the way to go, but I'll also add that this feature
seems to be implemented at a far too coarse of level. Memory, and hence
memory pressure, is constrained by several factors other than just the
amount of physical RAM which vmpressure_fd is addressing. What about
memory pressure caused by cpusets or mempolicies? (Memcg has its own
reclaim logic and its own memory thresholds implemented on top of eventfd
that people already use.) These both cause high levels of reclaim within
the page allocator whereas there may be an abundance of free memory
available on the system.
I don't think we want several implementations of memory pressure
notifications, so a more generic and flexible interface is going to be
needed and I think it can't be done in an extendable way through this
vmpressure_fd syscall. Unfortunately, I think that means polling on a
per-thread notifier.
Hi David,
Thanks for your comments!
On Wed, Nov 14, 2012 at 07:21:14PM -0800, David Rientjes wrote:
> > > Why should you be required to use cgroups to get VM pressure events to
> > > userspace?
> >
> > Valid point. But in fact you have it on most systems anyway.
> >
> > I personally don't like to have a syscall per small feature.
> > Isn't it better to have a file-based interface which can be used with
> > normal file syscalls: open()/read()/poll()?
> >
>
> I agree that eventfd is the way to go, but I'll also add that this feature
> seems to be implemented at a far too coarse of level. Memory, and hence
> memory pressure, is constrained by several factors other than just the
> amount of physical RAM which vmpressure_fd is addressing. What about
> memory pressure caused by cpusets or mempolicies? (Memcg has its own
> reclaim logic
Yes, sure, and my plan for per-cgroups vmpressure was to just add the same
hooks into cgroups reclaim logic (as far as I understand, we can use the
same scanned/reclaimed ratio + reclaimer priority to determine the
pressure).
> and its own memory thresholds implemented on top of eventfd
> that people already use.) These both cause high levels of reclaim within
> the page allocator whereas there may be an abundance of free memory
> available on the system.
Yes, surely global-level vmpressure should be separate for the per-cgroup
memory pressure.
But we still want the "global vmpressure" thing, so that we could use it
without cgroups too. How to do it -- syscall or sysfs+eventfd doesn't
matter much (in the sense that I can do eventfd thing if you folks like it
:).
> I don't think we want several implementations of memory pressure
> notifications,
Even with a dedicated syscall, why would we need a several implementation
of memory pressure? Suppose an app in the root cgroup gets an FD via
vmpressure_fd() syscall and then polls it... Do you see any reason why we
can't make the underlaying FD switch from global to per-cgroup vmpressure
notifications completely transparently for the app? Actually, it must be
done transparently.
Oh, or do you mean that we want to monitor cgroups vmpressure outside of
the cgroup? I.e. parent cgroup might want to watch child's pressure? Well,
for this, the API will have to have a hard dependency for cgroup's sysfs
hierarchy -- so how would we use it without cgroups then? :) I see no
other option but to have two "APIs" then. (Well, in eventfd case it will
be indeed simpler -- we would only have different sysfs paths for cgroups
and non-cgroups case... do you see this acceptable?)
Thanks,
Anton.
On Wed, 14 Nov 2012, Anton Vorontsov wrote:
> > I agree that eventfd is the way to go, but I'll also add that this feature
> > seems to be implemented at a far too coarse of level. Memory, and hence
> > memory pressure, is constrained by several factors other than just the
> > amount of physical RAM which vmpressure_fd is addressing. What about
> > memory pressure caused by cpusets or mempolicies? (Memcg has its own
> > reclaim logic
>
> Yes, sure, and my plan for per-cgroups vmpressure was to just add the same
> hooks into cgroups reclaim logic (as far as I understand, we can use the
> same scanned/reclaimed ratio + reclaimer priority to determine the
> pressure).
>
I don't understand, how would this work with cpusets, for example, with
vmpressure_fd as defined? The cpuset policy is embedded in the page
allocator and skips over zones that are not allowed when trying to find a
page of the specified order. Imagine a cpuset bound to a single node that
is under severe memory pressure. The reclaim logic will get triggered and
cause a notification on your fd when the rest of the system's nodes may
have tons of memory available. So now an application that actually is
using this interface and is trying to be a good kernel citizen decides to
free caches back to the kernel, start ratelimiting, etc, when it actually
doesn't have any memory allocated on the nearly-oom cpuset so its memory
freeing doesn't actually achieve anything.
Rather, I think it's much better to be notified when an individual process
invokes various levels of reclaim up to and including the oom killer so
that we know the context that memory freeing needs to happen (or,
optionally, the set of processes that could be sacrificed so that this
higher priority process may allocate memory).
> > and its own memory thresholds implemented on top of eventfd
> > that people already use.) These both cause high levels of reclaim within
> > the page allocator whereas there may be an abundance of free memory
> > available on the system.
>
> Yes, surely global-level vmpressure should be separate for the per-cgroup
> memory pressure.
>
I disagree, I think if you have a per-thread memory pressure notification
if and when it starts down the page allocator slowpath, through the
various states of reclaim (perhaps on a scale of 0-100 as described), and
including the oom killer that you can target eventual memory freeing that
actually is useful.
> But we still want the "global vmpressure" thing, so that we could use it
> without cgroups too. How to do it -- syscall or sysfs+eventfd doesn't
> matter much (in the sense that I can do eventfd thing if you folks like it
> :).
>
Most processes aren't going to care if they are running into memory
pressure and have no implementation to free memory back to the kernel or
start ratelimiting themselves. They will just continue happily along
until they get the memory they want or they get oom killed. The ones that
do, however, or a job scheduler or monitor that is watching over the
memory usage of a set of tasks, will be able to do something when
notified.
In the hopes of a single API that can do all this and not a
reimplementation for various types of memory limitations (it seems like
what you're suggesting is at least three different APIs: system-wide via
vmpressure_fd, memcg via memcg thresholds, and cpusets through an eventual
cpuset threshold), I'm hoping that we can have a single interface that can
be polled on to determine when individual processes are encountering
memory pressure. And if I'm not running in your oom cpuset, I don't care
about your memory pressure.
Hi David,
Thanks again for your inspirational comments!
On Wed, Nov 14, 2012 at 07:59:52PM -0800, David Rientjes wrote:
> > > I agree that eventfd is the way to go, but I'll also add that this feature
> > > seems to be implemented at a far too coarse of level. Memory, and hence
> > > memory pressure, is constrained by several factors other than just the
> > > amount of physical RAM which vmpressure_fd is addressing. What about
> > > memory pressure caused by cpusets or mempolicies? (Memcg has its own
> > > reclaim logic
> >
> > Yes, sure, and my plan for per-cgroups vmpressure was to just add the same
> > hooks into cgroups reclaim logic (as far as I understand, we can use the
> > same scanned/reclaimed ratio + reclaimer priority to determine the
> > pressure).
[Answers reordered]
> Rather, I think it's much better to be notified when an individual process
> invokes various levels of reclaim up to and including the oom killer so
> that we know the context that memory freeing needs to happen (or,
> optionally, the set of processes that could be sacrificed so that this
> higher priority process may allocate memory).
I think I understand what you're saying, and surely it makes sense, but I
don't know how you see this implemented on the API level.
Getting struct {pid, pressure} pairs that cause the pressure at the
moment? And the monitor only gets <pids> that are in the same cpuset? How
about memcg limits?..
[...]
> > But we still want the "global vmpressure" thing, so that we could use it
> > without cgroups too. How to do it -- syscall or sysfs+eventfd doesn't
> > matter much (in the sense that I can do eventfd thing if you folks like it
> > :).
> >
>
> Most processes aren't going to care if they are running into memory
> pressure and have no implementation to free memory back to the kernel or
> start ratelimiting themselves. They will just continue happily along
> until they get the memory they want or they get oom killed. The ones that
> do, however, or a job scheduler or monitor that is watching over the
> memory usage of a set of tasks, will be able to do something when
> notified.
Yup, this is exactly how we want to use this. In Android we have "Activity
Manager" thing, which acts exactly how you describe: it's a tasks monitor.
> In the hopes of a single API that can do all this and not a
> reimplementation for various types of memory limitations (it seems like
> what you're suggesting is at least three different APIs: system-wide via
> vmpressure_fd, memcg via memcg thresholds, and cpusets through an eventual
> cpuset threshold), I'm hoping that we can have a single interface that can
> be polled on to determine when individual processes are encountering
> memory pressure. And if I'm not running in your oom cpuset, I don't care
> about your memory pressure.
I'm not sure to what exactly you are opposing. :) You don't want to have
three "kinds" pressures, or you don't what to have three different
interfaces to each of them, or both?
> I don't understand, how would this work with cpusets, for example, with
> vmpressure_fd as defined? The cpuset policy is embedded in the page
> allocator and skips over zones that are not allowed when trying to find a
> page of the specified order. Imagine a cpuset bound to a single node that
> is under severe memory pressure. The reclaim logic will get triggered and
> cause a notification on your fd when the rest of the system's nodes may
> have tons of memory available.
Yes, I see your point: we have many ways to limit resources, so it makes
it hard to identify the cause of the "pressure" and thus how to deal with
it, since the pressure might be caused by different kinds of limits, and
freeing memory from one bucket doesn't mean that the memory will be
available to the process that is requesting the memory.
So we do want to know whether a specific cpuset is under pressure, whether
a specific memcg is under pressure, or whether the system (and kernel
itself) lacks memory.
And we want to have a single API for this? Heh. :)
The other idea might be this (I'm describing it in detail so that you
could actually comment on what exactly you don't like in this):
1. Obtain the fd via eventfd();
2. The fd can be passed to these files:
I) Say /sys/kernel/mm/memory_pressure
If we don't use cpusets/memcg or even have CGROUPS=n, this will be
system's/global memory pressure. Pass the fd to this file and start
polling.
If we do use cpusets or memcg, the API will still work, but we have
two options for its behaviour:
a) This will only report the pressure when we're reclaiming with
say (global_reclaim() && node_isset(zone_to_nid(zone),
current->mems_allowed)) == 1. (Basically, we want to see pressure
of kernel slabs allocations or any non-soft limits).
or
b) If 'filtering' cpusets/memcg seems too hard, we can say that
these notifications are the "sum" of global+memcg+cpuset. It
doesn't make sense to actually monitor these, though, so if the
monitor is aware of cgroups, just 'goto II) and/or III)'.
II) /sys/fs/cgroup/cpuset/.../cpuset.memory_pressure (yeah, we have
it already)
Pass the fd to this file to monitor per-cpuset pressure. So, if you
get the pressure from here, it makes sense to free resources from
this cpuset.
III) /sys/fs/cgroup/memory/.../memory.pressure
Pass the fd to this file to monitor per-memcg pressure. If you get
the pressure from here, it only makes sense to free resources from
this memcg.
3. The pressure level values (and their meaning) and the format of the
files are the same, and this what defines the "API".
So, if "memory monitor/supervisor app" is aware of cpusets, it manages
memory at this level. If both cpuset and memcg is used, then it has to
monitor both files, and act accordingly. And if we don't use
cpusets/memcg (or even have cgroups=n), we can just watch the global
reclaimer's pressure.
Do I understand correctly that you don't like this? Just to make sure. :)
Thanks,
Anton.
On Wed, 14 Nov 2012, Anton Vorontsov wrote:
> Thanks again for your inspirational comments!
>
Heh, not sure I've been too inspirational (probably more annoying than
anything else). I really do want generic memory pressure notifications in
the kernel and already have some ideas on how I can tie it into our malloc
arenas, so please do keep working on it.
> I think I understand what you're saying, and surely it makes sense, but I
> don't know how you see this implemented on the API level.
>
> Getting struct {pid, pressure} pairs that cause the pressure at the
> moment? And the monitor only gets <pids> that are in the same cpuset? How
> about memcg limits?..
>
Depends on whether you want to support mempolicies or not and the argument
could go either way:
- FOR supporting mempolicies: memory that you're mbind() too can become
depleted and since there is no fallback then you have no way to prevent
lots of reclaim and/or invoking the oom killer, it would be
disappointing to not be able to get notifications of such a condition.
- AGAINST supporting mempolicies: you only need to support memory
isolation for cgroups (memcg and cpusets) and thus can implement your
own memory pressure cgroup that you can use to aggregate tasks and
then replace memcg memory thresholds with co-mounting this new cgroup
that would notify on an eventfd anytime one of the attached processes
experiences memory pressure.
> > Most processes aren't going to care if they are running into memory
> > pressure and have no implementation to free memory back to the kernel or
> > start ratelimiting themselves. They will just continue happily along
> > until they get the memory they want or they get oom killed. The ones that
> > do, however, or a job scheduler or monitor that is watching over the
> > memory usage of a set of tasks, will be able to do something when
> > notified.
>
> Yup, this is exactly how we want to use this. In Android we have "Activity
> Manager" thing, which acts exactly how you describe: it's a tasks monitor.
>
In addition to that, I think I can hook into our implementation of malloc
which frees memory back to the kernel with MADV_DONTNEED and zaps
individual ptes to poke holes in the memory it allocates to actually cache
the memory that we free() and then re-use it under normal circumstances to
return cache-hot memory on the next allocation but under memory pressure,
as triggered by your interface (but for threads attached to a memcg facing
memcg limits), drain the memory back to the kernel immediately.
> > In the hopes of a single API that can do all this and not a
> > reimplementation for various types of memory limitations (it seems like
> > what you're suggesting is at least three different APIs: system-wide via
> > vmpressure_fd, memcg via memcg thresholds, and cpusets through an eventual
> > cpuset threshold), I'm hoping that we can have a single interface that can
> > be polled on to determine when individual processes are encountering
> > memory pressure. And if I'm not running in your oom cpuset, I don't care
> > about your memory pressure.
>
> I'm not sure to what exactly you are opposing. :) You don't want to have
> three "kinds" pressures, or you don't what to have three different
> interfaces to each of them, or both?
>
The three pressures are a separate topic (I think it would be better to
have some measure of memory pressure similar to your reclaim scale and
allow users to get notifications at levels they define). I really dislike
having multiple interfaces that are all different from one another
depending on the context.
Given what we have right now with memory thresholds in memcg, if we were
to merge vmpressure_fd, then we're significantly limiting the usecase
since applications need not know if they are attached to a memcg or not:
it's a type of virtualization that the admin may setup but another admin
may be running unconstrained on a system with much more memory. So for
your usecase of a job monitor, that would work fine for global oom
conditions but the application no longer has an API to use if it wants to
know when it itself is feeling memory pressure.
I think others have voiced their opinion on trying to create a single API
for memory pressure notifications as well, it's just a hard problem and
takes a lot of work to determine how we can make it easy to use and
understand and extendable at the same time.
> > I don't understand, how would this work with cpusets, for example, with
> > vmpressure_fd as defined? The cpuset policy is embedded in the page
> > allocator and skips over zones that are not allowed when trying to find a
> > page of the specified order. Imagine a cpuset bound to a single node that
> > is under severe memory pressure. The reclaim logic will get triggered and
> > cause a notification on your fd when the rest of the system's nodes may
> > have tons of memory available.
>
> Yes, I see your point: we have many ways to limit resources, so it makes
> it hard to identify the cause of the "pressure" and thus how to deal with
> it, since the pressure might be caused by different kinds of limits, and
> freeing memory from one bucket doesn't mean that the memory will be
> available to the process that is requesting the memory.
>
> So we do want to know whether a specific cpuset is under pressure, whether
> a specific memcg is under pressure, or whether the system (and kernel
> itself) lacks memory.
>
> And we want to have a single API for this? Heh. :)
>
Might not be too difficult if you implement your own cgroup to aggregate
these tasks for which you want to know memory pressure events; it would
have to be triggered for the task trying to allocate memory at any given
time and how hard it was to allocate that memory in the slowpath, tie it
back to that tasks' memory pressure cgroup, and then report the trigger if
it's over a user-defined threshold normalized to the 0-100 scale. Then
you could co-mount this cgroup with memcg, cpusets, or just do it for the
root cgroup for users who want to monitor the entire system
(CONFIG_CGROUPS is enabled by default).
On Thu, Nov 15, 2012 at 12:11:47AM -0800, David Rientjes wrote:
[...]
> Might not be too difficult if you implement your own cgroup to aggregate
> these tasks for which you want to know memory pressure events; it would
> have to be triggered for the task trying to allocate memory at any given
> time and how hard it was to allocate that memory in the slowpath, tie it
> back to that tasks' memory pressure cgroup, and then report the trigger if
> it's over a user-defined threshold normalized to the 0-100 scale. Then
> you could co-mount this cgroup with memcg, cpusets, or just do it for the
> root cgroup for users who want to monitor the entire system
This seems doable. But
> (CONFIG_CGROUPS is enabled by default).
Hehe, you're saying that we have to have cgroups=y. :) But some folks were
deliberately asking us to make the cgroups optional.
OK, here is what I can try to do:
- Implement memory pressure cgroup as you described, by doing so we'd make
the thing play well with cpusets and memcg;
- This will be eventfd()-based;
- Once done, we will have a solution for pretty much every major use-case
(i.e. servers, desktops and Android, they all have cgroups enabled);
(- Optionally, if there will be a demand, for CGROUPS=n we can implement a
separate sysfs file with the exactly same eventfd interface, it will only
report global pressure. This will be for folks that don't want the cgroups
for some reason. The interface can be discussed separately.)
Thanks,
Anton.
On Thu, 15 Nov 2012, Anton Vorontsov wrote:
> Hehe, you're saying that we have to have cgroups=y. :) But some folks were
> deliberately asking us to make the cgroups optional.
>
Enabling just CONFIG_CGROUPS (which is enabled by default) and no other
current cgroups increases the size of the kernel text by less than 0.3%
with x86_64 defconfig:
text data bss dec hex filename
10330039 1038912 1118208 12487159 be89f7 vmlinux.disabled
10360993 1041624 1122304 12524921 bf1d79 vmlinux.enabled
I understand that users with minimally-enabled configs for an optimized
memory footprint will have a higher percentage because their kernel is
already smaller (~1.8% increase for allnoconfig), but I think the cost of
enabling the cgroups code to be able to mount a vmpressure cgroup (which
I'd rename to be "mempressure" to be consistent with "memcg" but it's only
an opinion) is relatively small and allows for a much more maintainable
and extendable feature to be included: it already provides the
cgroup.event_control interface that supports eventfd that makes
implementation much easier. It also makes writing a library on top of the
cgroup to be much easier because of the standardization.
I'm more concerned about what to do with the memcg memory thresholds and
whether they can be replaced with this new cgroup. If so, then we'll have
to figure out how to map those triggers to use the new cgroup's interface
in a way that doesn't break current users that open and pass the fd of
memory.usage_in_bytes to cgroup.event_control for memcg.
> OK, here is what I can try to do:
>
> - Implement memory pressure cgroup as you described, by doing so we'd make
> the thing play well with cpusets and memcg;
>
> - This will be eventfd()-based;
>
Should be based on cgroup.event_control, see how memcg interfaces its
memory thresholds with this in Documentation/cgroups/memory.txt.
> - Once done, we will have a solution for pretty much every major use-case
> (i.e. servers, desktops and Android, they all have cgroups enabled);
>
Excellent! I'd be interested in hearing anybody else's opinions,
especially those from the memcg world, so we make sure that everybody is
happy with the API that you've described.
On 11/16/2012 01:25 AM, David Rientjes wrote:
> On Thu, 15 Nov 2012, Anton Vorontsov wrote:
>
>> Hehe, you're saying that we have to have cgroups=y. :) But some folks were
>> deliberately asking us to make the cgroups optional.
>>
>
> Enabling just CONFIG_CGROUPS (which is enabled by default) and no other
> current cgroups increases the size of the kernel text by less than 0.3%
> with x86_64 defconfig:
>
> text data bss dec hex filename
> 10330039 1038912 1118208 12487159 be89f7 vmlinux.disabled
> 10360993 1041624 1122304 12524921 bf1d79 vmlinux.enabled
>
> I understand that users with minimally-enabled configs for an optimized
> memory footprint will have a higher percentage because their kernel is
> already smaller (~1.8% increase for allnoconfig), but I think the cost of
> enabling the cgroups code to be able to mount a vmpressure cgroup (which
> I'd rename to be "mempressure" to be consistent with "memcg" but it's only
> an opinion) is relatively small and allows for a much more maintainable
> and extendable feature to be included: it already provides the
> cgroup.event_control interface that supports eventfd that makes
> implementation much easier. It also makes writing a library on top of the
> cgroup to be much easier because of the standardization.
>
> I'm more concerned about what to do with the memcg memory thresholds and
> whether they can be replaced with this new cgroup. If so, then we'll have
> to figure out how to map those triggers to use the new cgroup's interface
> in a way that doesn't break current users that open and pass the fd of
> memory.usage_in_bytes to cgroup.event_control for memcg.
>
>> OK, here is what I can try to do:
>>
>> - Implement memory pressure cgroup as you described, by doing so we'd make
>> the thing play well with cpusets and memcg;
>>
>> - This will be eventfd()-based;
>>
>
> Should be based on cgroup.event_control, see how memcg interfaces its
> memory thresholds with this in Documentation/cgroups/memory.txt.
>
>> - Once done, we will have a solution for pretty much every major use-case
>> (i.e. servers, desktops and Android, they all have cgroups enabled);
>>
>
> Excellent! I'd be interested in hearing anybody else's opinions,
> especially those from the memcg world, so we make sure that everybody is
> happy with the API that you've described.
>
Just CC'd them all.
My personal take:
Most people hate memcg due to the cost it imposes. I've already
demonstrated that with some effort, it doesn't necessarily have to be
so. (http://lwn.net/Articles/517634/)
The one thing I missed on that work, was precisely notifications. If you
can come up with a good notifications scheme that *lives* in memcg, but
does not *depend* in the memcg infrastructure, I personally think it
could be a big win.
Doing this in memcg has the advantage that the "per-group" vs "global"
is automatically solved, since the root memcg is just another name for
"global".
I honestly like your low/high/oom scheme better than memcg's
"threshold-in-bytes". I would also point out that those thresholds are
*far* from exact, due to the stock charging mechanism, and can be wrong
by as much as O(#cpus). So far, nobody complained. So in theory it
should be possible to convert memcg to low/high/oom, while still
accepting writes in bytes, that would be thrown in the closest bucket.
Another thing from one of your e-mails, that may shift you in the memcg
direction:
"2. The last time I checked, cgroups memory controller did not (and I
guess still does not) not account kernel-owned slabs. I asked several
times why so, but nobody answered."
It should, now, in the latest -mm, although it won't do per-group
reclaim (yet).
I am also failing to see how cpusets would be involved in here. I
understand that you may have free memory in terms of size, but still be
further restricted by cpuset. But I also think that having multiple
entry points for this buy us nothing at all. So the choices I see are:
1) If cpuset + memcg are comounted, take this into account when deciding
low / high / oom. This is yet another advantage over the "threshold in
bytes" interface, in which you can transparently take
other issues into account while keeping the interface.
2) If they are not, just ignore this effect.
The fallback in 2) sounds harsh, but I honestly think this is the price
to pay for the insanity of mounting those things in different
hierarchies, and we do have a plan to have all those things eventually
together anyway. If you have two cgroups dealing with memory, and set
them up in orthogonal ways, I really can't see how we can bring sanity
to that. So just admitting and unleashing the insanity may be better, if
it brings up our urge to fix it. It worked for Batman, why wouldn't it
work for us?
On Fri, 16 Nov 2012, Glauber Costa wrote:
> My personal take:
>
> Most people hate memcg due to the cost it imposes. I've already
> demonstrated that with some effort, it doesn't necessarily have to be
> so. (http://lwn.net/Articles/517634/)
>
> The one thing I missed on that work, was precisely notifications. If you
> can come up with a good notifications scheme that *lives* in memcg, but
> does not *depend* in the memcg infrastructure, I personally think it
> could be a big win.
>
This doesn't allow users of cpusets without memcg to have an API for
memory pressure, that's why I thought it should be a new cgroup that can
be mounted alongside any existing cgroup, any cgroup in the future, or
just by itself.
> Doing this in memcg has the advantage that the "per-group" vs "global"
> is automatically solved, since the root memcg is just another name for
> "global".
>
That's true of any cgroup.
> I honestly like your low/high/oom scheme better than memcg's
> "threshold-in-bytes". I would also point out that those thresholds are
> *far* from exact, due to the stock charging mechanism, and can be wrong
> by as much as O(#cpus). So far, nobody complained. So in theory it
> should be possible to convert memcg to low/high/oom, while still
> accepting writes in bytes, that would be thrown in the closest bucket.
>
I'm wondering if we should have more than three different levels.
> Another thing from one of your e-mails, that may shift you in the memcg
> direction:
>
> "2. The last time I checked, cgroups memory controller did not (and I
> guess still does not) not account kernel-owned slabs. I asked several
> times why so, but nobody answered."
>
> It should, now, in the latest -mm, although it won't do per-group
> reclaim (yet).
>
Not sure where that was written, but I certainly didn't write it and it's
not really relevant in this discussion: memory pressure notifications
would be triggered by reclaim when trying to allocate memory; why we need
to reclaim or how we got into that state is tangential. It certainly may
be because a lot of slab was allocated, but that's not the only case.
> I am also failing to see how cpusets would be involved in here. I
> understand that you may have free memory in terms of size, but still be
> further restricted by cpuset. But I also think that having multiple
> entry points for this buy us nothing at all. So the choices I see are:
>
Umm, why do users of cpusets not want to be able to trigger memory
pressure notifications?
Hey,
On 11/17/2012 12:04 AM, David Rientjes wrote:
> On Fri, 16 Nov 2012, Glauber Costa wrote:
>
>> My personal take:
>>
>> Most people hate memcg due to the cost it imposes. I've already
>> demonstrated that with some effort, it doesn't necessarily have to be
>> so. (http://lwn.net/Articles/517634/)
>>
>> The one thing I missed on that work, was precisely notifications. If you
>> can come up with a good notifications scheme that *lives* in memcg, but
>> does not *depend* in the memcg infrastructure, I personally think it
>> could be a big win.
>>
>
> This doesn't allow users of cpusets without memcg to have an API for
> memory pressure, that's why I thought it should be a new cgroup that can
> be mounted alongside any existing cgroup, any cgroup in the future, or
> just by itself.
>
>> Doing this in memcg has the advantage that the "per-group" vs "global"
>> is automatically solved, since the root memcg is just another name for
>> "global".
>>
>
> That's true of any cgroup.
Yes. But memcg happens to also deal with memory usage, and already have
a notification mechanism =)
>
>> I honestly like your low/high/oom scheme better than memcg's
>> "threshold-in-bytes". I would also point out that those thresholds are
>> *far* from exact, due to the stock charging mechanism, and can be wrong
>> by as much as O(#cpus). So far, nobody complained. So in theory it
>> should be possible to convert memcg to low/high/oom, while still
>> accepting writes in bytes, that would be thrown in the closest bucket.
>>
>
> I'm wondering if we should have more than three different levels.
>
In the case I outlined below, for backwards compatibility. What I
actually mean is that memcg *currently* allows arbitrary notifications.
One way to merge those, while moving to a saner 3-point notification, is
to still allow the old writes and fit them in the closest bucket.
>> Another thing from one of your e-mails, that may shift you in the memcg
>> direction:
>>
>> "2. The last time I checked, cgroups memory controller did not (and I
>> guess still does not) not account kernel-owned slabs. I asked several
>> times why so, but nobody answered."
>>
>> It should, now, in the latest -mm, although it won't do per-group
>> reclaim (yet).
>>
>
> Not sure where that was written, but I certainly didn't write it
Indeed you didn't, Anton did. It's his proposal, so I actually meant him
everytime I said "you". The fact that you were the last responder made
it confusing - sorry.
> and it's
> not really relevant in this discussion: memory pressure notifications
> would be triggered by reclaim when trying to allocate memory; why we need
> to reclaim or how we got into that state is tangential.
My understanding is that one of the advantages he was pointing of his
mechanism over memcg, is that it would allow one to count slab memory as
well, which memcg won't do (it will, now).
>> I am also failing to see how cpusets would be involved in here. I
>> understand that you may have free memory in terms of size, but still be
>> further restricted by cpuset. But I also think that having multiple
>> entry points for this buy us nothing at all. So the choices I see are:
>>
>
> Umm, why do users of cpusets not want to be able to trigger memory
> pressure notifications?
>
Because cpusets only deal with memory placement, not memory usage.
And it is not that moving a task to cpuset disallows you to do any of
this: you could, as long as the same set of tasks are mounted in a
corresponding memcg.
Of course there are a couple use cases that could benefit from the
orthogonality, but I doubt it would justify the complexity in this case.
On Sat, 17 Nov 2012, Glauber Costa wrote:
> > I'm wondering if we should have more than three different levels.
> >
>
> In the case I outlined below, for backwards compatibility. What I
> actually mean is that memcg *currently* allows arbitrary notifications.
> One way to merge those, while moving to a saner 3-point notification, is
> to still allow the old writes and fit them in the closest bucket.
>
Yeah, but I'm wondering why three is the right answer.
> > Umm, why do users of cpusets not want to be able to trigger memory
> > pressure notifications?
> >
> Because cpusets only deal with memory placement, not memory usage.
The set of nodes that a thread is allowed to allocate from may face memory
pressure up to and including oom while the rest of the system may have a
ton of free memory. Your solution is to compile and mount memcg if you
want notifications of memory pressure on those nodes. Others in this
thread have already said they don't want to rely on memcg for any of this
and, as Anton showed, this can be tied directly into the VM without any
help from memcg as it sits today. So why implement a simple and clean
mempressure cgroup that can be used alone or co-existing with either memcg
or cpusets?
> And it is not that moving a task to cpuset disallows you to do any of
> this: you could, as long as the same set of tasks are mounted in a
> corresponding memcg.
>
Same thing with a separate mempressure cgroup. The point is that there
will be users of this cgroup that do not want the overhead imposed by
memcg (which is why it's disabled in defconfig) and there's no direct
dependency that causes it to be a part of memcg.
On Fri, Nov 16, 2012 at 01:57:09PM -0800, David Rientjes wrote:
> > > I'm wondering if we should have more than three different levels.
> > >
> >
> > In the case I outlined below, for backwards compatibility. What I
> > actually mean is that memcg *currently* allows arbitrary notifications.
> > One way to merge those, while moving to a saner 3-point notification, is
> > to still allow the old writes and fit them in the closest bucket.
>
> Yeah, but I'm wondering why three is the right answer.
You were not Cc'ed, so let me repeat why I ended up w/ the levels (not
necessary three levels), instead of relying on the 0..100 scale:
The main change is that I decided to go with discrete levels of the
pressure.
When I started writing the man page, I had to describe the 'reclaimer
inefficiency index', and while doing this I realized that I'm describing
how the kernel is doing the memory management, which we try to avoid in
the vmevent. And applications don't really care about these details:
reclaimers, its inefficiency indexes, scanning window sizes, priority
levels, etc. -- it's all "not interesting", and purely kernel's stuff. So
I guess Mel Gorman was right, we need some sort of levels.
What applications (well, activity managers) are really interested in is
this:
1. Do we we sacrifice resources for new memory allocations (e.g. files
cache)?
2. Does the new memory allocations' cost becomes too high, and the system
hurts because of this?
3. Are we about to OOM soon?
And here are the answers:
1. VMEVENT_PRESSURE_LOW
2. VMEVENT_PRESSURE_MED
3. VMEVENT_PRESSURE_OOM
There is no "high" pressure, since I really don't see any definition of
it, but it's possible to introduce new levels without breaking ABI.
Later I came up with the fourth level:
Maybe it makes sense to implement something like PRESSURE_MILD/BALANCE
with an additional nr_pages threshold, which basically hits the kernel
about how many easily reclaimable pages userland has (that would be a
part of our definition for the mild/balance pressure level).
I.e. the fourth level can serve as a two-way communication w/ the kernel.
But again, this would be just an extension, I don't want to introduce this
now.
> > > Umm, why do users of cpusets not want to be able to trigger memory
> > > pressure notifications?
> > >
> > Because cpusets only deal with memory placement, not memory usage.
>
> The set of nodes that a thread is allowed to allocate from may face memory
> pressure up to and including oom while the rest of the system may have a
> ton of free memory. Your solution is to compile and mount memcg if you
> want notifications of memory pressure on those nodes. Others in this
> thread have already said they don't want to rely on memcg for any of this
> and, as Anton showed, this can be tied directly into the VM without any
> help from memcg as it sits today. So why implement a simple and clean
You meant 'why not'?
> mempressure cgroup that can be used alone or co-existing with either memcg
> or cpusets?
>
> > And it is not that moving a task to cpuset disallows you to do any of
> > this: you could, as long as the same set of tasks are mounted in a
> > corresponding memcg.
> >
>
> Same thing with a separate mempressure cgroup. The point is that there
> will be users of this cgroup that do not want the overhead imposed by
> memcg (which is why it's disabled in defconfig) and there's no direct
> dependency that causes it to be a part of memcg.
There's also an API "inconvenince issue" with memcg's usage_in_bytes
stuff: applications have a hard time resetting the threshold to 'emulate'
the pressure notifications, and they also have to count bytes (like 'total
- used = free') to set the threshold. While a separate 'pressure'
notifications shows exactly what apps actually want to know: the pressure.
Thanks,
Anton.
On Fri, 16 Nov 2012, Anton Vorontsov wrote:
> The main change is that I decided to go with discrete levels of the
> pressure.
>
> When I started writing the man page, I had to describe the 'reclaimer
> inefficiency index', and while doing this I realized that I'm describing
> how the kernel is doing the memory management, which we try to avoid in
> the vmevent. And applications don't really care about these details:
> reclaimers, its inefficiency indexes, scanning window sizes, priority
> levels, etc. -- it's all "not interesting", and purely kernel's stuff. So
> I guess Mel Gorman was right, we need some sort of levels.
>
> What applications (well, activity managers) are really interested in is
> this:
>
> 1. Do we we sacrifice resources for new memory allocations (e.g. files
> cache)?
> 2. Does the new memory allocations' cost becomes too high, and the system
> hurts because of this?
> 3. Are we about to OOM soon?
>
> And here are the answers:
>
> 1. VMEVENT_PRESSURE_LOW
> 2. VMEVENT_PRESSURE_MED
> 3. VMEVENT_PRESSURE_OOM
>
> There is no "high" pressure, since I really don't see any definition of
> it, but it's possible to introduce new levels without breaking ABI.
>
> Later I came up with the fourth level:
>
> Maybe it makes sense to implement something like PRESSURE_MILD/BALANCE
> with an additional nr_pages threshold, which basically hits the kernel
> about how many easily reclaimable pages userland has (that would be a
> part of our definition for the mild/balance pressure level).
>
> I.e. the fourth level can serve as a two-way communication w/ the kernel.
> But again, this would be just an extension, I don't want to introduce this
> now.
>
That certainly makes sense, it would be too much of a usage and
maintenance burden to assume that the implementation of the VM is to
remain the same.
> > The set of nodes that a thread is allowed to allocate from may face memory
> > pressure up to and including oom while the rest of the system may have a
> > ton of free memory. Your solution is to compile and mount memcg if you
> > want notifications of memory pressure on those nodes. Others in this
> > thread have already said they don't want to rely on memcg for any of this
> > and, as Anton showed, this can be tied directly into the VM without any
> > help from memcg as it sits today. So why implement a simple and clean
>
> You meant 'why not'?
>
Yes, sorry.
> > mempressure cgroup that can be used alone or co-existing with either memcg
> > or cpusets?
> >
> > Same thing with a separate mempressure cgroup. The point is that there
> > will be users of this cgroup that do not want the overhead imposed by
> > memcg (which is why it's disabled in defconfig) and there's no direct
> > dependency that causes it to be a part of memcg.
>
> There's also an API "inconvenince issue" with memcg's usage_in_bytes
> stuff: applications have a hard time resetting the threshold to 'emulate'
> the pressure notifications, and they also have to count bytes (like 'total
> - used = free') to set the threshold. While a separate 'pressure'
> notifications shows exactly what apps actually want to know: the pressure.
>
Agreed.
On 11/17/2012 01:57 AM, David Rientjes wrote:
> On Sat, 17 Nov 2012, Glauber Costa wrote:
>
>>> I'm wondering if we should have more than three different levels.
>>>
>>
>> In the case I outlined below, for backwards compatibility. What I
>> actually mean is that memcg *currently* allows arbitrary notifications.
>> One way to merge those, while moving to a saner 3-point notification, is
>> to still allow the old writes and fit them in the closest bucket.
>>
>
> Yeah, but I'm wondering why three is the right answer.
>
This is unrelated to what I am talking about.
I am talking about pre-defined values with a specific event meaning (in
his patchset, 3) vs arbitrary numbers valued in bytes.
>>> Umm, why do users of cpusets not want to be able to trigger memory
>>> pressure notifications?
>>>
>> Because cpusets only deal with memory placement, not memory usage.
>
> The set of nodes that a thread is allowed to allocate from may face memory
> pressure up to and including oom while the rest of the system may have a
> ton of free memory. Your solution is to compile and mount memcg if you
> want notifications of memory pressure on those nodes. Others in this
> thread have already said they don't want to rely on memcg for any of this
> and, as Anton showed, this can be tied directly into the VM without any
> help from memcg as it sits today. So why implement a simple and clean
> mempressure cgroup that can be used alone or co-existing with either memcg
> or cpusets?
>
>> And it is not that moving a task to cpuset disallows you to do any of
>> this: you could, as long as the same set of tasks are mounted in a
>> corresponding memcg.
>>
>
> Same thing with a separate mempressure cgroup. The point is that there
> will be users of this cgroup that do not want the overhead imposed by
> memcg (which is why it's disabled in defconfig) and there's no direct
> dependency that causes it to be a part of memcg.
>
I think we should shoot the duck where it is going, not where it is. A
good interface is more important than overhead, since this overhead is
by no means fundamental - memcg is fixable, and we would all benefit
from it.
Now, whether or not memcg is the right interface is a different
discussion - let's have it!
On 11/17/2012 05:21 AM, Anton Vorontsov wrote:
> On Fri, Nov 16, 2012 at 01:57:09PM -0800, David Rientjes wrote:
>>>> I'm wondering if we should have more than three different levels.
>>>>
>>>
>>> In the case I outlined below, for backwards compatibility. What I
>>> actually mean is that memcg *currently* allows arbitrary notifications.
>>> One way to merge those, while moving to a saner 3-point notification, is
>>> to still allow the old writes and fit them in the closest bucket.
>>
>> Yeah, but I'm wondering why three is the right answer.
>
> You were not Cc'ed, so let me repeat why I ended up w/ the levels (not
> necessary three levels), instead of relying on the 0..100 scale:
>
> The main change is that I decided to go with discrete levels of the
> pressure.
>
> When I started writing the man page, I had to describe the 'reclaimer
> inefficiency index', and while doing this I realized that I'm describing
> how the kernel is doing the memory management, which we try to avoid in
> the vmevent. And applications don't really care about these details:
> reclaimers, its inefficiency indexes, scanning window sizes, priority
> levels, etc. -- it's all "not interesting", and purely kernel's stuff. So
> I guess Mel Gorman was right, we need some sort of levels.
>
> What applications (well, activity managers) are really interested in is
> this:
>
> 1. Do we we sacrifice resources for new memory allocations (e.g. files
> cache)?
> 2. Does the new memory allocations' cost becomes too high, and the system
> hurts because of this?
> 3. Are we about to OOM soon?
>
> And here are the answers:
>
> 1. VMEVENT_PRESSURE_LOW
> 2. VMEVENT_PRESSURE_MED
> 3. VMEVENT_PRESSURE_OOM
>
> There is no "high" pressure, since I really don't see any definition of
> it, but it's possible to introduce new levels without breaking ABI.
>
> Later I came up with the fourth level:
>
> Maybe it makes sense to implement something like PRESSURE_MILD/BALANCE
> with an additional nr_pages threshold, which basically hits the kernel
> about how many easily reclaimable pages userland has (that would be a
> part of our definition for the mild/balance pressure level).
>
> I.e. the fourth level can serve as a two-way communication w/ the kernel.
> But again, this would be just an extension, I don't want to introduce this
> now.
>
>>>> Umm, why do users of cpusets not want to be able to trigger memory
>>>> pressure notifications?
>>>>
>>> Because cpusets only deal with memory placement, not memory usage.
>>
>> The set of nodes that a thread is allowed to allocate from may face memory
>> pressure up to and including oom while the rest of the system may have a
>> ton of free memory. Your solution is to compile and mount memcg if you
>> want notifications of memory pressure on those nodes. Others in this
>> thread have already said they don't want to rely on memcg for any of this
>> and, as Anton showed, this can be tied directly into the VM without any
>> help from memcg as it sits today. So why implement a simple and clean
>
> You meant 'why not'?
>
>> mempressure cgroup that can be used alone or co-existing with either memcg
>> or cpusets?
>>
>>> And it is not that moving a task to cpuset disallows you to do any of
>>> this: you could, as long as the same set of tasks are mounted in a
>>> corresponding memcg.
>>>
>>
>> Same thing with a separate mempressure cgroup. The point is that there
>> will be users of this cgroup that do not want the overhead imposed by
>> memcg (which is why it's disabled in defconfig) and there's no direct
>> dependency that causes it to be a part of memcg.
>
> There's also an API "inconvenince issue" with memcg's usage_in_bytes
> stuff: applications have a hard time resetting the threshold to 'emulate'
> the pressure notifications, and they also have to count bytes (like 'total
> - used = free') to set the threshold. While a separate 'pressure'
> notifications shows exactly what apps actually want to know: the pressure.
>
Anton,
The API you propose is way superior than memcg's current interface IMHO.
That is why my proposal is to move memcg to yours, and deprecate the old
interface.
We can do this easily by allowing writes to happen, and then moving them
to the closest pressure bucket. More or less what was done for timers to
reduce wakeups.
What I noted in a previous e-mail, is that memcg triggers notifications
based on "usage" *before* the stock is drained. This means it can be
wrong by as much as 32 * NR_CPUS * PAGE_SIZE, and so far, nobody seemed
to care.
>>> Umm, why do users of cpusets not want to be able to trigger memory
>>> pressure notifications?
>>>
>> Because cpusets only deal with memory placement, not memory usage.
>
> The set of nodes that a thread is allowed to allocate from may face memory
> pressure up to and including oom while the rest of the system may have a
> ton of free memory. Your solution is to compile and mount memcg if you
> want notifications of memory pressure on those nodes. Others in this
> thread have already said they don't want to rely on memcg for any of this
> and, as Anton showed, this can be tied directly into the VM without any
> help from memcg as it sits today. So why implement a simple and clean
> mempressure cgroup that can be used alone or co-existing with either memcg
> or cpusets?
>
Forgot this one:
Because there is a huge ongoing work going on by Tejun aiming at
reducing the effects of orthogonal hierarchy. There are many controllers
today that are "close enough" to each other (cpu, cpuacct; net_prio,
net_cls), and in practice, it brought more problems than it solved.
So yes, *maybe* mempressure is the answer, but it need to be justified
with care. Long term, I think a saner notification API for memcg will
lead us to a better and brighter future.
There is also yet another aspect: This scheme works well for global
notifications. If we would always want this to be global, this would
work neatly. But as already mentioned in this thread, at some point
we'll want this to work for a group of processes as well. At that point,
you'll have to count how much memory is being used, so you can determine
whether or not pressure is going on. You will, then, have to redo all
the work memcg already does.
On Wed, 7 Nov 2012 03:01:52 -0800 Anton Vorontsov <[email protected]> wrote:
> Upon these notifications, userland programs can cooperate with
> the kernel, achieving better system's memory management.
Well I read through the whole thread and afaict the above is the only
attempt to describe why this patchset exists!
How about we step away from implementation details for a while and
discuss observed problems, use-cases, requirements and such? What are
we actually trying to achieve here?
On Mon, Nov 19, 2012 at 09:52:11PM -0800, Andrew Morton wrote:
> On Wed, 7 Nov 2012 03:01:52 -0800 Anton Vorontsov <[email protected]> wrote:
> > Upon these notifications, userland programs can cooperate with
> > the kernel, achieving better system's memory management.
>
> Well I read through the whole thread and afaict the above is the only
> attempt to describe why this patchset exists!
Thanks for taking a look. :)
> How about we step away from implementation details for a while and
> discuss observed problems, use-cases, requirements and such? What are
> we actually trying to achieve here?
We try to make userland freeing resources when the system becomes low on
memory. Once we're short on memory, sometimes it's better to discard
(free) data, rather than let the kernel to drain file caches or even start
swapping.
In Android case, the data includes all idling applications' state, some of
which might be saved on the disk anyway -- so we don't need to swap apps,
we just kill them. Another Android use-case is to kill low-priority tasks
(e.g. currently unimportant services -- background/sync daemons, etc.).
There are other use cases: VPS/containers balancing, freeing browser's old
pages renders on desktops, etc. But I'll let folks speak for their use
cases, as I truly know about Android/embedded only.
But in general, it's the same stuff as the in-kernel shrinker, except that
we try to make it available for the userland: the userland knows better
about its memory, so we want to let it help with the memory management.
Thanks,
Anton.
On Mon, 19 Nov 2012, Glauber Costa wrote:
> >> In the case I outlined below, for backwards compatibility. What I
> >> actually mean is that memcg *currently* allows arbitrary notifications.
> >> One way to merge those, while moving to a saner 3-point notification, is
> >> to still allow the old writes and fit them in the closest bucket.
> >>
> >
> > Yeah, but I'm wondering why three is the right answer.
> >
>
> This is unrelated to what I am talking about.
> I am talking about pre-defined values with a specific event meaning (in
> his patchset, 3) vs arbitrary numbers valued in bytes.
>
Right, and I don't see how you can map the memcg thresholds onto Anton's
scheme that heavily relies upon reclaim activity; what bucket does a
threshold of 48MB in a memcg with a limit of 64MB fit into? Perhaps you
have some formula in mind that would do this, but I don't see how it works
correctly without factoring in configuration options (memory compaction),
type of allocation (GFP_ATOMIC won't trigger Anton's reclaim scheme like
GFP_KERNEL), altered min_free_kbytes, etc.
This begs the question of whether the new cgroup should be considered as a
replacement for memory thresholds within memcg in the first place;
certainly both can coexist just fine.
> > Same thing with a separate mempressure cgroup. The point is that there
> > will be users of this cgroup that do not want the overhead imposed by
> > memcg (which is why it's disabled in defconfig) and there's no direct
> > dependency that causes it to be a part of memcg.
> >
> I think we should shoot the duck where it is going, not where it is. A
> good interface is more important than overhead, since this overhead is
> by no means fundamental - memcg is fixable, and we would all benefit
> from it.
>
> Now, whether or not memcg is the right interface is a different
> discussion - let's have it!
>
I don't see memcg as being a prerequisite for any of this, I think Anton's
cgroup can coexist with memcg thresholds, it allows for notifications in
cpusets as well when they face memory pressure, and users need not enable
memcg for this functionality (and memcg is pretty darn large in its memory
footprint, I'd rather not see it fragmented either for something that can
standalone with increased functionality).
But let's try the question in reverse: is there any specific reasons why
this can't be implemented separately? I sure know the cpusets + no-memcg
configuration would benefit from it.
On Mon, 19 Nov 2012, Anton Vorontsov wrote:
> We try to make userland freeing resources when the system becomes low on
> memory. Once we're short on memory, sometimes it's better to discard
> (free) data, rather than let the kernel to drain file caches or even start
> swapping.
>
To add another usecase: its possible to modify our version of malloc (or
any malloc) so that memory that is free()'d can be released back to the
kernel only when necessary, i.e. when keeping the extra memory around
starts to have a detremental effect on the system, memcg, or cpuset. When
there is an abundance of memory available such that allocations need not
defragment or reclaim memory to be allocated, it can improve performance
to keep a memory arena from which to allocate from immediately without
calling the kernel.
Our version of malloc frees memory back to the kernel with
madvise(MADV_DONTNEED) which ends up zaping the mapped ptes. With
pressure events, we only need to do this when faced with memory pressure;
to keep our rss low, we require that thp's max_ptes_none tunable be set to
0; we don't want our applications to use any additional memory. This
requires splitting a hugepage anytime memory is free()'d back to the
kernel.
I'd like to use this as a hook into malloc() for applications that do not
have strict memory footprint requirements to be able to increase
performance by keeping around a memory arena from which to allocate.
On Mon, 19 Nov 2012, Glauber Costa wrote:
> >> Because cpusets only deal with memory placement, not memory usage.
> >
> > The set of nodes that a thread is allowed to allocate from may face memory
> > pressure up to and including oom while the rest of the system may have a
> > ton of free memory. Your solution is to compile and mount memcg if you
> > want notifications of memory pressure on those nodes. Others in this
> > thread have already said they don't want to rely on memcg for any of this
> > and, as Anton showed, this can be tied directly into the VM without any
> > help from memcg as it sits today. So why implement a simple and clean
> > mempressure cgroup that can be used alone or co-existing with either memcg
> > or cpusets?
> >
>
> Forgot this one:
>
> Because there is a huge ongoing work going on by Tejun aiming at
> reducing the effects of orthogonal hierarchy. There are many controllers
> today that are "close enough" to each other (cpu, cpuacct; net_prio,
> net_cls), and in practice, it brought more problems than it solved.
>
I'm very happy that Tejun is working on that, but I don't see how it's
relevant here: I'm referring to users who are not using memcg
specifically. This is what others brought up earlier in the thread: they
do not want to be required to use memcg for this functionality.
There are users of cpusets today that do not enable nor comount memcg. I
argue that a mempressure cgroup allows them this functionality without the
memory footprint of memcg (not only in text, but requiring page_cgroup).
Additionally, there are probably users who do not want either cpusets or
memcg and want notifications from mempressure at a global level. Users
who care so much about the memory pressure of their systems probably have
strict footprint requirements, it would be a complete shame to require a
semi-tractor trailer when all I want is a compact car.
> So yes, *maybe* mempressure is the answer, but it need to be justified
> with care. Long term, I think a saner notification API for memcg will
> lead us to a better and brighter future.
>
You can easily comount mempressure with your memcg, this is not anything
new.
> There is also yet another aspect: This scheme works well for global
> notifications. If we would always want this to be global, this would
> work neatly. But as already mentioned in this thread, at some point
> we'll want this to work for a group of processes as well. At that point,
> you'll have to count how much memory is being used, so you can determine
> whether or not pressure is going on. You will, then, have to redo all
> the work memcg already does.
>
Anton can correct me if I'm wrong, but I certainly don't think this is
where mempressure is headed: I don't think any accounting needs to be done
and, if it is, it's a design issue that should be addressed now rather
than later. I believe notifications should occur on current's mempressure
cgroup depending on its level of reclaim: nobody cares if your memcg has a
limit of 64GB when you only have 32GB of RAM, we'll want the notification.
On 11/20/2012 10:23 PM, David Rientjes wrote:
> Anton can correct me if I'm wrong, but I certainly don't think this is
> where mempressure is headed: I don't think any accounting needs to be done
> and, if it is, it's a design issue that should be addressed now rather
> than later. I believe notifications should occur on current's mempressure
> cgroup depending on its level of reclaim: nobody cares if your memcg has a
> limit of 64GB when you only have 32GB of RAM, we'll want the notification.
My main concern is that to trigger those notifications, one would have
to first determine whether or not the particular group of tasks is under
pressure. And to do that, we need to somehow know how much memory we are
using, and how much we are reclaiming, etc. On a system-wide level, we
have this information. On a grouplevel, this is already accounted by memcg.
In fact, the current code already seems to rely on memcg:
+ vmpressure(sc->target_mem_cgroup,
+ sc->nr_scanned - nr_scanned, nr_reclaimed);
Now, let's start simple: Assume we will have a different cgroup.
We want per-group pressure notifications for that group. How would you
determine that the specific group is under pressure?
On Wed, Nov 21, 2012 at 12:27:28PM +0400, Glauber Costa wrote:
> On 11/20/2012 10:23 PM, David Rientjes wrote:
> > Anton can correct me if I'm wrong, but I certainly don't think this is
> > where mempressure is headed: I don't think any accounting needs to be done
Yup, I'd rather not do any accounting, at least not in bytes.
> > and, if it is, it's a design issue that should be addressed now rather
> > than later. I believe notifications should occur on current's mempressure
> > cgroup depending on its level of reclaim: nobody cares if your memcg has a
> > limit of 64GB when you only have 32GB of RAM, we'll want the notification.
>
> My main concern is that to trigger those notifications, one would have
> to first determine whether or not the particular group of tasks is under
> pressure.
As far as I understand, the notifications will be triggered by a process
that tries to allocate memory. So, effectively that would be a per-process
pressure.
So, if one process in a group is suffering, we notify that "a process in a
group is under pressure", and the notification goes to a cgroup listener
> And to do that, we need to somehow know how much memory we are
> using, and how much we are reclaiming, etc. On a system-wide level, we
> have this information. On a grouplevel, this is already accounted by memcg.
>
> In fact, the current code already seems to rely on memcg:
>
> + vmpressure(sc->target_mem_cgroup,
> + sc->nr_scanned - nr_scanned, nr_reclaimed);
Well, I'm yet unsure about the details, but I guess in "mempressure"
cgroup approach, this will be derived from the current->, i.e. a task.
But note that we won't report pressure to a memcg cgroup, we will notify
only mempressure cgroup. But a process can be in both of them
simultaneously. In the code, the mempressure and memcg will not depend on
each other.
> Now, let's start simple: Assume we will have a different cgroup.
> We want per-group pressure notifications for that group. How would you
> determine that the specific group is under pressure?
If a process that tries to allocate memory & causes reclaim is a part of
the cgroup, then cgroup has a pressure.
At least that's very brief understanding of the idea, details to be
investigated... But I welcome David to comment whether I got everything
correctly. :)
Thanks,
Anton.
On 11/21/2012 12:46 PM, Anton Vorontsov wrote:
> On Wed, Nov 21, 2012 at 12:27:28PM +0400, Glauber Costa wrote:
>> On 11/20/2012 10:23 PM, David Rientjes wrote:
>>> Anton can correct me if I'm wrong, but I certainly don't think this is
>>> where mempressure is headed: I don't think any accounting needs to be done
>
> Yup, I'd rather not do any accounting, at least not in bytes.
It doesn't matter here, but memcg doesn't do any accounting in bytes as
well. It only display it in bytes, but internally, it's all pages. The
bytes representation is convenient, because then you can be agnostic of
page sizes.
>
>>> and, if it is, it's a design issue that should be addressed now rather
>>> than later. I believe notifications should occur on current's mempressure
>>> cgroup depending on its level of reclaim: nobody cares if your memcg has a
>>> limit of 64GB when you only have 32GB of RAM, we'll want the notification.
>>
>> My main concern is that to trigger those notifications, one would have
>> to first determine whether or not the particular group of tasks is under
>> pressure.
>
> As far as I understand, the notifications will be triggered by a process
> that tries to allocate memory. So, effectively that would be a per-process
> pressure.
>
> So, if one process in a group is suffering, we notify that "a process in a
> group is under pressure", and the notification goes to a cgroup listener
If you effectively have a per-process mechanism, why do you need an
extra cgroup at all?
It seems to me that this is simply something that should be inherited
over fork, and then you register the notifier in your first process, and
it will be valid for everybody in the process tree.
If you need tasks in different processes to respond to the same
notifier, then you just register the same notifier in two different
processes.
On Tue, Nov 20, 2012 at 10:02:45AM -0800, David Rientjes wrote:
> On Mon, 19 Nov 2012, Glauber Costa wrote:
>
> > >> In the case I outlined below, for backwards compatibility. What I
> > >> actually mean is that memcg *currently* allows arbitrary notifications.
> > >> One way to merge those, while moving to a saner 3-point notification, is
> > >> to still allow the old writes and fit them in the closest bucket.
> > >>
> > >
> > > Yeah, but I'm wondering why three is the right answer.
> > >
> >
> > This is unrelated to what I am talking about.
> > I am talking about pre-defined values with a specific event meaning (in
> > his patchset, 3) vs arbitrary numbers valued in bytes.
> >
>
> Right, and I don't see how you can map the memcg thresholds onto Anton's
> scheme
BTW, there's interface for OOM notification in memcg. See oom_control.
I guess other pressure levels can also fit to the interface.
--
Kirill A. Shutemov
Hi,
>
> Memory notifications are quite irrelevant to partitioning and cgroups. The use-case is related to user-space handling low memory. Meaning the functionality should be accurate with specific granularity (e.g. 1 MB) and time (0.25s is OK) but better to have it as simple and battery-friendly. I prefer to have pseudo-device-based text API because it is easy to debug and investigate. It would be nice if it will be possible to use simple scripting to point what kind of memory on which levels need to be tracked but private/shared dirty is #1 and memcg cannot handle it.
>
If that is the case, then fine.
The reason I jumped in talking about memcg, is that it was mentioned
that at some point we'd like to have those notifications on a per-group
basis.
So I'll say it again: if this is always global, there is no reason any
cgroup needs to be involved. If this turns out to be per-process, as
Anton suggested in a recent e-mail, I don't see any reason to have
cgroups involved as well.
But if this needs to be extended to be per-cgroup, then past experience
shows that we need to be really careful not to start duplicating
infrastructure, and creating inter-dependencies like it happened to
other groups in the past.
-----Original Message-----
From: ext Glauber Costa [mailto:[email protected]]
Sent: 21 November, 2012 13:55
....
So I'll say it again: if this is always global, there is no reason any
cgroup needs to be involved. If this turns out to be per-process, as
Anton suggested in a recent e-mail, I don't see any reason to have
cgroups involved as well.
-----
Per-process memory tracking has no much sense: process should consume all available memory but work fast. Also this approach required knowledge about process deps to take into account dependencies e.g. in dbus or Xorg. If you need to know how much memory process consumed in particular moment you can use /proc/self/smaps, that is easier.
Best Wishes,
Leonid
-----Original Message-----
From: ext Kirill A. Shutemov [mailto:[email protected]]
Sent: 21 November, 2012 11:31
...
BTW, there's interface for OOM notification in memcg. See oom_control.
I guess other pressure levels can also fit to the interface.
---
Hi,
I have tracking this conversation very little, but as person somehow related to this round of development and requestor of memcg notification mechanism in past (Kirill implemented that) I have to point there are reasons not to use memcg. The situation in latest kernels could be different but practically in past the following troubles were observed with memcg:
1. by default memcg is turned off on Android (at least on 4.1 I see it)
2. you need to produce memory partitioning, and that maybe non-trivial task in general case when apps/use cases are not so limited
3. memcg takes into account cached memory. Yes, you can play with MADV_DONTNEED as it was mentioned but in generic case that is insane
4. memcg need be extended in a way you need to track some other kinds of memory
5. in case of situation in some partition changed fast (e.g. process moved to another partition) it may cause pages trashing and device lock. The in-kernel lock was fixed in May 2012, but even pages trashing knock out device number of seconds (even minutes).
Thus, I would prefer to avoid memcg even it is powerful feature.
Memory notifications are quite irrelevant to partitioning and cgroups. The use-case is related to user-space handling low memory. Meaning the functionality should be accurate with specific granularity (e.g. 1 MB) and time (0.25s is OK) but better to have it as simple and battery-friendly. I prefer to have pseudo-device-based text API because it is easy to debug and investigate. It would be nice if it will be possible to use simple scripting to point what kind of memory on which levels need to be tracked but private/shared dirty is #1 and memcg cannot handle it.
There are two use-cases related to this notification feature:
1. direct usage -> reaction to coming low memory situation and do something ahead of time. E.g. system calibrated to 80% dirty memory border, and if we crossed it we can compensate device slowness by flushing application caches, closing background images even notify user but without killing apps by any OOM killer and corruption unsaved data.
2. permission to do some heavy actions. If memory level is low enough for some application use case (e.g. 50 MB available) application can start heavy use-case, otherwise - do something to prevent potential problems.
So, seems to me, the levels depends from application memory usage e.g. calculator does not need memory information but browser and image gallery needs. Thus, tracking daemons in user-space looks as overhead, and such construction we used in n900 (ke-recv -> dbus -> apps) is quite fragile and slow.
These bits [1] was developed initially for n9 to replace memcg notifications with great support of kernel community about a year ago. Unfortunately for n9 I was a bit late and code was integrated to another product's kernel (say M), but at last Summer project M was forced to die due to moving product line to W. Practically arm device produced signals ON/OFF which fit well into space/time requirements, so I have what I need. Even it is quite primitive code but I prefer do not over-engineering complexity without necessity.
Best Wishes,
Leonid
PS: but seems code related to vmpressure_fd solves some other problem so you can ignore my speech.
[1] http://maemo.gitorious.org/maemo-tools/libmemnotify/blobs/master/src/kernel/memnotify.c
On Tue, Nov 20, 2012 at 10:12:28AM -0800, David Rientjes wrote:
> On Mon, 19 Nov 2012, Anton Vorontsov wrote:
>
> > We try to make userland freeing resources when the system becomes low on
> > memory. Once we're short on memory, sometimes it's better to discard
> > (free) data, rather than let the kernel to drain file caches or even start
> > swapping.
> >
>
> To add another usecase: its possible to modify our version of malloc (or
> any malloc) so that memory that is free()'d can be released back to the
> kernel only when necessary, i.e. when keeping the extra memory around
> starts to have a detremental effect on the system, memcg, or cpuset. When
> there is an abundance of memory available such that allocations need not
> defragment or reclaim memory to be allocated, it can improve performance
> to keep a memory arena from which to allocate from immediately without
> calling the kernel.
>
A potential third use case is a variation of the first for batch systems. If
it's running low priority tasks and a high priority task starts that
results in memory pressure then the job scheduler may decide to move the
low priority jobs elsewhere (or cancel them entirely).
A similar use case is monitoring systems running high priority workloads
that should never swap. It can be easily detected if the system starts
swapping but a pressure notification might act as an early warning system
that something is happening on the system that might cause the primary
workload to start swapping.
--
Mel Gorman
SUSE Labs
On Wed, 21 Nov 2012 15:01:50 +0000
Mel Gorman <[email protected]> wrote:
> On Tue, Nov 20, 2012 at 10:12:28AM -0800, David Rientjes wrote:
> > On Mon, 19 Nov 2012, Anton Vorontsov wrote:
> >
> > > We try to make userland freeing resources when the system becomes low on
> > > memory. Once we're short on memory, sometimes it's better to discard
> > > (free) data, rather than let the kernel to drain file caches or even start
> > > swapping.
> > >
> >
> > To add another usecase: its possible to modify our version of malloc (or
> > any malloc) so that memory that is free()'d can be released back to the
> > kernel only when necessary, i.e. when keeping the extra memory around
> > starts to have a detremental effect on the system, memcg, or cpuset. When
> > there is an abundance of memory available such that allocations need not
> > defragment or reclaim memory to be allocated, it can improve performance
> > to keep a memory arena from which to allocate from immediately without
> > calling the kernel.
> >
>
> A potential third use case is a variation of the first for batch systems. If
> it's running low priority tasks and a high priority task starts that
> results in memory pressure then the job scheduler may decide to move the
> low priority jobs elsewhere (or cancel them entirely).
>
> A similar use case is monitoring systems running high priority workloads
> that should never swap. It can be easily detected if the system starts
> swapping but a pressure notification might act as an early warning system
> that something is happening on the system that might cause the primary
> workload to start swapping.
I hope Anton's writing all of this down ;)
The proposed API bugs me a bit. It seems simplistic. I need to have a
quality think about this. Maybe the result of that think will be to
suggest an interface which can be extended in a back-compatible fashion
later on, if/when the simplistic nature becomes a problem.
On Wed, 21 Nov 2012, Andrew Morton wrote:
> The proposed API bugs me a bit. It seems simplistic. I need to have a
> quality think about this. Maybe the result of that think will be to
> suggest an interface which can be extended in a back-compatible fashion
> later on, if/when the simplistic nature becomes a problem.
That's exactly why I made a generic vmevent_fd() syscall, not a 'vm
pressure' specific ABI.
Pekka
[Sorry to jump in that late]
On Tue 20-11-12 10:02:45, David Rientjes wrote:
> On Mon, 19 Nov 2012, Glauber Costa wrote:
>
> > >> In the case I outlined below, for backwards compatibility. What I
> > >> actually mean is that memcg *currently* allows arbitrary notifications.
> > >> One way to merge those, while moving to a saner 3-point notification, is
> > >> to still allow the old writes and fit them in the closest bucket.
> > >>
> > >
> > > Yeah, but I'm wondering why three is the right answer.
> > >
> >
> > This is unrelated to what I am talking about.
> > I am talking about pre-defined values with a specific event meaning (in
> > his patchset, 3) vs arbitrary numbers valued in bytes.
> >
>
> Right, and I don't see how you can map the memcg thresholds onto Anton's
> scheme that heavily relies upon reclaim activity; what bucket does a
> threshold of 48MB in a memcg with a limit of 64MB fit into?
> Perhaps you have some formula in mind that would do this, but I don't
> see how it works correctly without factoring in configuration options
> (memory compaction), type of allocation (GFP_ATOMIC won't trigger
> Anton's reclaim scheme like GFP_KERNEL), altered min_free_kbytes, etc.
>
> This begs the question of whether the new cgroup should be considered as a
> replacement for memory thresholds within memcg in the first place;
> certainly both can coexist just fine.
Absolutely agreed. Yes those two things are inherently different.
Information that "you have passed half of your limit" is something
totally different than "you should slow down". Although I am not
entirely sure what the first is one good for (to be honest), but I
believe there are users out there.
I do not think that mixing those two makes much sense. They have
different usecases and until we have users for the thresholds one we
should keep it.
[...]
Thanks
--
Michal Hocko
SUSE Labs