Hi all,
So this is the second RFC. The main change is that I decided to go with
discrete levels of the pressure.
When I started writing the man page, I had to describe the 'reclaimer
inefficiency index', and while doing this I realized that I'm describing
how the kernel is doing the memory management, which we try to avoid in
the vmevent. And applications don't really care about these details:
reclaimers, its inefficiency indexes, scanning window sizes, priority
levels, etc. -- it's all "not interesting", and purely kernel's stuff. So
I guess Mel Gorman was right, we need some sort of levels.
What applications (well, activity managers) are really interested in is
this:
1. Do we we sacrifice resources for new memory allocations (e.g. files
cache)?
2. Does the new memory allocations' cost becomes too high, and the system
hurts because of this?
3. Are we about to OOM soon?
And here are the answers:
1. VMEVENT_PRESSURE_LOW
2. VMEVENT_PRESSURE_MED
3. VMEVENT_PRESSURE_OOM
There is no "high" pressure, since I really don't see any definition of
it, but it's possible to introduce new levels without breaking ABI. The
levels described in more details in the patches, and the stuff is still
tunable, but now via sysctls, not the vmevent_fd() call itself (i.e. we
don't need to rebuild applications to adjust window size or other mm
"details").
What I couldn't fix in this RFC is making vmevent_{scanned,reclaimed}
stuff per-CPU (there's a comment describing the problem with this). But I
made it lockless and tried to make it very lightweight (plus I moved the
vmevent_pressure() call to a more "cold" path).
Thanks,
Anton.
This patch introduces VMEVENT_ATTR_PRESSURE, the attribute reports Linux
virtual memory management pressure. There are three discrete levels:
VMEVENT_PRESSURE_LOW: Notifies that the system is reclaiming memory for
new allocations. Monitoring reclaiming activity might be useful for
maintaining overall system's cache level.
VMEVENT_PRESSURE_MED: The system is experiencing medium memory pressure,
there is some mild swapping activity. Upon this event applications may
decide to free any resources that can be easily reconstructed or re-read
from a disk.
VMEVENT_PRESSURE_OOM: The system is actively thrashing, it is about to out
of memory (OOM) or even the in-kernel OOM killer is on its way to trigger.
Applications should do whatever they can to help the system.
There are three sysctls to tune the behaviour of the levels:
vmevent_window
vmevent_level_med
vmevent_level_oom
Currently vmevent pressure levels are based on the reclaimer inefficiency
index (range from 0 to 100). The index shows the relative time spent by
the kernel uselessly scanning pages, or, in other words, the percentage of
scans of pages (vmevent_window) that were not reclaimed. The higher the
index, the more it should be evident that new allocations' cost becomes
higher.
The files vmevent_level_med and vmevent_level_oom accept the index values
(by default set to 60 and 99 respectively). A non-existent
vmevent_level_low tunable is always set to 0
When index equals to 0, this means that the kernel is reclaiming, but
every scanned page has been successfully reclaimed (so the pressure is
low). 100 means that the kernel is trying to reclaim, but nothing can be
reclaimed (close to OOM).
Window size is used as a rate-limit tunable for VMEVENT_PRESSURE_LOW
notifications and for averaging for VMEVENT_PRESSURE_{MED,OOM} levels. So,
using small window sizes can cause lot of false positives for _MED and
_OOM levels, but too big window size may delay notifications.
By default the window size equals to 256 pages (1MB).
Signed-off-by: Anton Vorontsov <[email protected]>
---
Documentation/sysctl/vm.txt | 37 +++++++++++++++
include/linux/vmevent.h | 42 +++++++++++++++++
kernel/sysctl.c | 24 ++++++++++
mm/vmevent.c | 107 ++++++++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 19 ++++++++
5 files changed, 229 insertions(+)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 078701f..ff0023b 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -44,6 +44,9 @@ Currently, these files are in /proc/sys/vm:
- nr_overcommit_hugepages
- nr_trim_pages (only if CONFIG_MMU=n)
- numa_zonelist_order
+- vmevent_window
+- vmevent_level_med
+- vmevent_level_oom
- oom_dump_tasks
- oom_kill_allocating_task
- overcommit_memory
@@ -487,6 +490,40 @@ this is causing problems for your system/application.
==============================================================
+vmevent_window
+vmevent_level_med
+vmevent_level_oom
+
+These sysctls are used to tune vmevent_fd(2) behaviour.
+
+Currently vmevent pressure levels are based on the reclaimer inefficiency
+index (range from 0 to 100). The files vmevent_level_med and
+vmevent_level_oom accept the index values (by default set to 60 and 99
+respectively). A non-existent vmevent_level_low tunable is always set to 0
+
+When the system is short on idle pages, the new memory is allocated by
+reclaiming least recently used resources: kernel scans pages to be
+reclaimed (e.g. from file caches, mmap(2) volatile ranges, etc.; and
+potentially swapping some pages out), and the index shows the relative
+time spent by the kernel uselessly scanning pages, or, in other words, the
+percentage of scans of pages (vmevent_window) that were not reclaimed. The
+higher the index, the more it should be evident that new allocations' cost
+becomes higher.
+
+When index equals to 0, this means that the kernel is reclaiming, but
+every scanned page has been successfully reclaimed (so the pressure is
+low). 100 means that the kernel is trying to reclaim, but nothing can be
+reclaimed (close to OOM).
+
+Window size is used as a rate-limit tunable for VMEVENT_PRESSURE_LOW
+notifications and for averaging for VMEVENT_PRESSURE_{MED,OOM} levels. So,
+using small window sizes can cause lot of false positives for _MED and
+_OOM levels, but too big window size may delay notifications.
+
+By default the window size equals to 256 pages (1MB).
+
+==============================================================
+
oom_dump_tasks
Enables a system-wide task dump (excluding kernel threads) to be
diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h
index b1c4016..a0e6641 100644
--- a/include/linux/vmevent.h
+++ b/include/linux/vmevent.h
@@ -10,10 +10,18 @@ enum {
VMEVENT_ATTR_NR_AVAIL_PAGES = 1UL,
VMEVENT_ATTR_NR_FREE_PAGES = 2UL,
VMEVENT_ATTR_NR_SWAP_PAGES = 3UL,
+ VMEVENT_ATTR_PRESSURE = 4UL,
VMEVENT_ATTR_MAX /* non-ABI */
};
+/* We spread the values, reserving room for new levels, if ever needed. */
+enum {
+ VMEVENT_PRESSURE_LOW = 1 << 10,
+ VMEVENT_PRESSURE_MED = 1 << 11,
+ VMEVENT_PRESSURE_OOM = 1 << 12,
+};
+
/*
* Attribute state bits for threshold
*/
@@ -97,4 +105,38 @@ struct vmevent_event {
struct vmevent_attr attrs[];
};
+#ifdef __KERNEL__
+
+struct mem_cgroup;
+
+extern void __vmevent_pressure(struct mem_cgroup *memcg,
+ ulong scanned,
+ ulong reclaimed);
+
+static inline void vmevent_pressure(struct mem_cgroup *memcg,
+ ulong scanned,
+ ulong reclaimed)
+{
+ if (!scanned)
+ return;
+
+ if (IS_BUILTIN(CONFIG_MEMCG) && memcg) {
+ /*
+ * The vmevent API reports system pressure, for per-cgroup
+ * pressure, we'll chain cgroups notifications, this is to
+ * be implemented.
+ *
+ * memcg_vm_pressure(target_mem_cgroup, scanned, reclaimed);
+ */
+ return;
+ }
+ __vmevent_pressure(memcg, scanned, reclaimed);
+}
+
+extern uint vmevent_window;
+extern uint vmevent_level_med;
+extern uint vmevent_level_oom;
+
+#endif
+
#endif /* _LINUX_VMEVENT_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 87174ef..e00d3fb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -50,6 +50,7 @@
#include <linux/dnotify.h>
#include <linux/syscalls.h>
#include <linux/vmstat.h>
+#include <linux/vmevent.h>
#include <linux/nfs_fs.h>
#include <linux/acpi.h>
#include <linux/reboot.h>
@@ -1317,6 +1318,29 @@ static struct ctl_table vm_table[] = {
.proc_handler = numa_zonelist_order_handler,
},
#endif
+#ifdef CONFIG_VMEVENT
+ {
+ .procname = "vmevent_window",
+ .data = &vmevent_window,
+ .maxlen = sizeof(vmevent_window),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "vmevent_level_med",
+ .data = &vmevent_level_med,
+ .maxlen = sizeof(vmevent_level_med),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "vmevent_level_oom",
+ .data = &vmevent_level_oom,
+ .maxlen = sizeof(vmevent_level_oom),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+#endif
#if (defined(CONFIG_X86_32) && !defined(CONFIG_UML))|| \
(defined(CONFIG_SUPERH) && defined(CONFIG_VSYSCALL))
{
diff --git a/mm/vmevent.c b/mm/vmevent.c
index 8195897..11ce5ef 100644
--- a/mm/vmevent.c
+++ b/mm/vmevent.c
@@ -4,6 +4,7 @@
#include <linux/vmevent.h>
#include <linux/syscalls.h>
#include <linux/workqueue.h>
+#include <linux/mutex.h>
#include <linux/file.h>
#include <linux/list.h>
#include <linux/poll.h>
@@ -28,8 +29,22 @@ struct vmevent_watch {
/* poll */
wait_queue_head_t waitq;
+
+ /* Our node in the pressure watchers list. */
+ struct list_head pwatcher;
};
+static atomic64_t vmevent_pressure_sr;
+static uint vmevent_pressure_val;
+
+static LIST_HEAD(vmevent_pwatchers);
+static DEFINE_MUTEX(vmevent_pwatchers_lock);
+
+/* Our sysctl tunables, see Documentation/sysctl/vm.txt */
+uint __read_mostly vmevent_window = SWAP_CLUSTER_MAX * 16;
+uint vmevent_level_med = 60;
+uint vmevent_level_oom = 99;
+
typedef u64 (*vmevent_attr_sample_fn)(struct vmevent_watch *watch,
struct vmevent_attr *attr);
@@ -97,10 +112,21 @@ static u64 vmevent_attr_avail_pages(struct vmevent_watch *watch,
return totalram_pages;
}
+static u64 vmevent_attr_pressure(struct vmevent_watch *watch,
+ struct vmevent_attr *attr)
+{
+ if (vmevent_pressure_val >= vmevent_level_oom)
+ return VMEVENT_PRESSURE_OOM;
+ else if (vmevent_pressure_val >= vmevent_level_med)
+ return VMEVENT_PRESSURE_MED;
+ return VMEVENT_PRESSURE_LOW;
+}
+
static vmevent_attr_sample_fn attr_samplers[] = {
[VMEVENT_ATTR_NR_AVAIL_PAGES] = vmevent_attr_avail_pages,
[VMEVENT_ATTR_NR_FREE_PAGES] = vmevent_attr_free_pages,
[VMEVENT_ATTR_NR_SWAP_PAGES] = vmevent_attr_swap_pages,
+ [VMEVENT_ATTR_PRESSURE] = vmevent_attr_pressure,
};
static u64 vmevent_sample_attr(struct vmevent_watch *watch, struct vmevent_attr *attr)
@@ -239,6 +265,73 @@ static void vmevent_start_timer(struct vmevent_watch *watch)
vmevent_schedule_watch(watch);
}
+static uint vmevent_calc_pressure(uint win, uint s, uint r)
+{
+ ulong p;
+
+ /*
+ * We calculate the ratio (in percents) of how many pages were
+ * scanned vs. reclaimed in a given time frame (window). Note that
+ * time is in VM reclaimer's "ticks", i.e. number of pages
+ * scanned. This makes it possible set desired reaction time and
+ * serves as a ratelimit.
+ */
+ p = win - (r * win / s);
+ p = p * 100 / win;
+
+ pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, p, s, r);
+
+ return p;
+}
+
+#define VMEVENT_SCANNED_SHIFT (sizeof(u64) * 8 / 2)
+
+static void vmevent_pressure_wk_fn(struct work_struct *wk)
+{
+ struct vmevent_watch *watch;
+ u64 sr = atomic64_xchg(&vmevent_pressure_sr, 0);
+ u32 s = sr >> VMEVENT_SCANNED_SHIFT;
+ u32 r = sr & (((u64)1 << VMEVENT_SCANNED_SHIFT) - 1);
+
+ vmevent_pressure_val = vmevent_calc_pressure(vmevent_window, s, r);
+
+ mutex_lock(&vmevent_pwatchers_lock);
+ list_for_each_entry(watch, &vmevent_pwatchers, pwatcher)
+ vmevent_sample(watch);
+ mutex_unlock(&vmevent_pwatchers_lock);
+}
+static DECLARE_WORK(vmevent_pressure_wk, vmevent_pressure_wk_fn);
+
+void __vmevent_pressure(struct mem_cgroup *memcg,
+ ulong scanned,
+ ulong reclaimed)
+{
+ /*
+ * Store s/r combined, so we don't have to worry to synchronize
+ * them. On modern machines it will be truly atomic; on arches w/o
+ * 64 bit atomics it will turn into a spinlock (for a small amount
+ * of CPUs it's not a problem).
+ *
+ * Using int-sized atomics is a bad idea as it would only allow to
+ * count (1 << 16) - 1 pages (256MB), which we can scan pretty
+ * fast.
+ *
+ * We can't have per-CPU counters as this will not catch a case
+ * when many CPUs scan small amounts (so none of them hit the
+ * window size limit, and thus we won't send a notification in
+ * time).
+ *
+ * So we shouldn't place vmevent_pressure() into a very hot path.
+ */
+ atomic64_add(scanned << VMEVENT_SCANNED_SHIFT | reclaimed,
+ &vmevent_pressure_sr);
+
+ scanned = atomic64_read(&vmevent_pressure_sr) >> VMEVENT_SCANNED_SHIFT;
+ if (scanned >= vmevent_window &&
+ !work_pending(&vmevent_pressure_wk))
+ schedule_work(&vmevent_pressure_wk);
+}
+
static unsigned int vmevent_poll(struct file *file, poll_table *wait)
{
struct vmevent_watch *watch = file->private_data;
@@ -300,6 +393,11 @@ static int vmevent_release(struct inode *inode, struct file *file)
cancel_delayed_work_sync(&watch->work);
+ if (watch->pwatcher.next) {
+ mutex_lock(&vmevent_pwatchers_lock);
+ list_del(&watch->pwatcher);
+ mutex_unlock(&vmevent_pwatchers_lock);
+ }
kfree(watch);
return 0;
@@ -328,6 +426,7 @@ static int vmevent_setup_watch(struct vmevent_watch *watch)
{
struct vmevent_config *config = &watch->config;
struct vmevent_attr *attrs = NULL;
+ bool pwatcher = 0;
unsigned long nr;
int i;
@@ -340,6 +439,8 @@ static int vmevent_setup_watch(struct vmevent_watch *watch)
if (attr->type >= VMEVENT_ATTR_MAX)
continue;
+ else if (attr->type == VMEVENT_ATTR_PRESSURE)
+ pwatcher = 1;
size = sizeof(struct vmevent_attr) * (nr + 1);
@@ -363,6 +464,12 @@ static int vmevent_setup_watch(struct vmevent_watch *watch)
watch->sample_attrs = attrs;
watch->nr_attrs = nr;
+ if (pwatcher) {
+ mutex_lock(&vmevent_pwatchers_lock);
+ list_add(&watch->pwatcher, &vmevent_pwatchers);
+ mutex_unlock(&vmevent_pwatchers_lock);
+ }
+
return 0;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99b434b..cd3bd19 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -20,6 +20,7 @@
#include <linux/init.h>
#include <linux/highmem.h>
#include <linux/vmstat.h>
+#include <linux/vmevent.h>
#include <linux/file.h>
#include <linux/writeback.h>
#include <linux/blkdev.h>
@@ -1846,6 +1847,9 @@ restart:
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
sc, LRU_ACTIVE_ANON);
+ vmevent_pressure(sc->target_mem_cgroup,
+ sc->nr_scanned - nr_scanned, nr_reclaimed);
+
/* reclaim/compaction might need reclaim to continue */
if (should_continue_reclaim(lruvec, nr_reclaimed,
sc->nr_scanned - nr_scanned, sc))
@@ -2068,6 +2072,21 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
count_vm_event(ALLOCSTALL);
do {
+ /*
+ * OK, we're cheating. The thing is, we have to average
+ * s/r ratio by gathering a lot of scans (otherwise we
+ * might get some local false-positives index of '100').
+ *
+ * But... when we're almost OOM we might be getting the
+ * last reclaimable pages slowly, scanning all the queues,
+ * and so we never catch the OOM case via averaging.
+ * Although the priority will show it for sure. 3 is an
+ * empirically taken priority: we never observe it under
+ * any load, except for last few allocations before OOM.
+ */
+ if (sc->priority <= 3)
+ vmevent_pressure(sc->target_mem_cgroup,
+ vmevent_window, 0);
sc->nr_scanned = 0;
aborted_reclaim = shrink_zones(zonelist, sc);
--
1.7.12.3
VMEVENT_FD(2) Linux Programmer's Manual VMEVENT_FD(2)
NAME
vmevent_fd - Linux virtual memory management events
SYNOPSIS
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <asm/unistd.h>
#include <linux/types.h>
#include <linux/vmevent.h>
syscall(__NR_vmevent_fd, config);
DESCRIPTION
This system call creates a new file descriptor that can be used
with polling routines (e.g. poll(2)) to get notified about vari-
ous in-kernel virtual memory management events that might be of
interest for userspace. The interface can also be used to effe-
ciently monitor memory usage (e.g. number of idle and swap
pages).
Applications can make overall system's memory management more
nimble by adjusting theirs resources usage upon the notifica-
tions.
Attributes
Attributes are the basic concept, they are described by the fol-
lowing structure:
struct vmevent_attr {
__u64 value;
__u32 type;
__u32 state;
};
type may correspond to these values:
VMEVENT_ATTR_NR_AVAIL_PAGES
The attribute reports total number of available pages in
the system, not including swap space (i.e. just total
RAM). value is used to setup a threshold (in number or
pages) upon which the event will be delivered by the ker-
nel.
Upon notifications kernel updates all configured
attributes, so the attribute is mostly used without any
thresholds, just for getting the value together with other
attributes and avoid reading and parsing /proc/vmstat.
VMEVENT_ATTR_NR_FREE_PAGES
The attribute reports total number of unused (idle) RAM in
the system.
value is used to setup a threshold (in number or pages)
upon which the event will be delivered by the kernel.
VMEVENT_ATTR_NR_SWAP_PAGES
The attribute reports total number of swapped pages.
value is used to setup a threshold (in number or pages)
upon which the event will be delivered by the kernel.
VMEVENT_ATTR_PRESSURE
The attribute reports Linux virtual memory management
pressure. There are three discrete levels:
VMEVENT_PRESSURE_LOW: By setting the threshold to this
value it's possible to watch whether system is reclaiming
memory for new allocations. Monitoring reclaiming activity
might be useful for maintaining overall system's cache
level.
VMEVENT_PRESSURE_MED: The system is experiencing medium
memory pressure, there is some mild swapping activity.
Upon this event applications may decide to free any
resources that can be easily reconstructed or re-read from
a disk.
VMEVENT_PRESSURE_OOM: The system is actively thrashing, it
is about to out of memory (OOM) or even the in-kernel OOM
killer is on its way to trigger. Applications should do
whatever they can to help the system. See proc(5) for more
information about OOM killer and its configuration
options.
value is used to setup a threshold upon which the event
will be delivered by the kernel (for algebraic compar-
isons, it is defined that VMEVENT_PRESSURE_LOW <
VMEVENT_PRESSURE_MED < VMEVENT_PRESSURE_OOM, but applica-
tions should not put any meaning into the absolute val-
ues.)
state is used to setup thresholds' behaviour, the following
flags can be bitwise OR'ed:
VMEVENT_ATTR_STATE_VALUE_LT
Notification will be delivered when an attribute is less
than a user-specified value.
VMEVENT_ATTR_STATE_VALUE_GT
Notifications will be delivered when an attribute is
greater than a user-specified value.
VMEVENT_ATTR_STATE_VALUE_EQ
Notifications will be delivered when an attribute is equal
to a user-specified value.
VMEVENT_ATTR_STATE_EDGE_TRIGGER
Events will be only delivered when an attribute crosses
value threshold.
Events
Upon a notification, application must read out events using
read(2) system call. The events are delivered using the follow-
ing structure:
struct vmevent_event {
__u32 counter;
__u32 padding;
struct vmevent_attr attrs[];
};
The counter specifies a number of reported attributes, and the
attrs array contains a copy of configured attributes, with
vmevent_attr's value overwritten to attribute's value.
Config
vmevent_fd(2) accepts vmevent_config structure to configure the
notifications:
struct vmevent_config {
__u32 size;
__u32 counter;
__u64 sample_period_ns;
struct vmevent_attr attrs[VMEVENT_CONFIG_MAX_ATTRS];
};
size must be initialized to sizeof(struct vmevent_config).
counter specifies a number of initialized attrs elements.
sample_period_ns specifies sampling period in nanoseconds. For
applications it is recommended to set this value to a highest
suitable period. (Note that for some attributes the delivery tim-
ing is not based on the sampling period, e.g. VMEVENT_ATTR_PRES-
SURE.)
RETURN VALUE
On success, vmevent_fd() returns a new file descriptor. On error,
a negative value is returned and errno is set to indicate the
error.
ERRORS
vmevent_fd() can fail with errors similar to open(2).
In addition, the following errors are possible:
EINVAL The failure means that an improperly initalized config
structure has been passed to the call (this also includes
improperly initialized attrs arrays).
EFAULT The failure means that the kernel was unable to read the
configuration structure, that is, config parameter points
to an inaccessible memory.
VERSIONS
The system call is available on Linux since kernel 3.8. Library
support is yet not provided by any glibc version.
CONFORMING TO
The system call is Linux-specific.
EXAMPLE
Examples can be found in /usr/src/linux/tools/testing/vmevent/
directory.
SEE ALSO
poll(2), read(2), proc(5), vmstat(8)
Linux 2012-10-16 VMEVENT_FD(2)
Signed-off-by: Anton Vorontsov <[email protected]>
---
man2/vmevent_fd.2 | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 235 insertions(+)
create mode 100644 man2/vmevent_fd.2
diff --git a/man2/vmevent_fd.2 b/man2/vmevent_fd.2
new file mode 100644
index 0000000..b631455
--- /dev/null
+++ b/man2/vmevent_fd.2
@@ -0,0 +1,235 @@
+.\" Copyright (C) 2008 Michael Kerrisk <[email protected]>
+.\" Copyright (C) 2012 Linaro Ltd.
+.\" Anton Vorontsov <[email protected]>
+.\" Based on ideas from:
+.\" KOSAKI Motohiro, Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka
+.\" Enberg.
+.\"
+.\" This program is free software; you can redistribute it and/or modify
+.\" it under the terms of the GNU General Public License as published by
+.\" the Free Software Foundation; either version 2 of the License, or
+.\" (at your option) any later version.
+.\"
+.\" This program is distributed in the hope that it will be useful,
+.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
+.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+.\" GNU General Public License for more details.
+.\"
+.\" You should have received a copy of the GNU General Public License
+.\" along with this program; if not, write to the Free Software
+.\" Foundation, Inc., 59 Temple Place, Suite 330, Boston,
+.\" MA 02111-1307 USA
+.\"
+.TH VMEVENT_FD 2 2012-10-16 Linux "Linux Programmer's Manual"
+.SH NAME
+vmevent_fd \- Linux virtual memory management events
+.SH SYNOPSIS
+.nf
+.B #define _GNU_SOURCE
+.B #include <unistd.h>
+.B #include <sys/syscall.h>
+.B #include <asm/unistd.h>
+.B #include <linux/types.h>
+.B #include <linux/vmevent.h>
+
+.\" TODO: libc wrapper
+.BI "syscall(__NR_vmevent_fd, "config );
+.fi
+.SH DESCRIPTION
+This system call creates a new file descriptor that can be used with polling
+routines (e.g.
+.BR poll (2))
+to get notified about various in-kernel virtual memory management events
+that might be of interest for userspace. The interface can
+also be used to effeciently monitor memory usage (e.g. number of idle and
+swap pages).
+
+Applications can make overall system's memory management more nimble by
+adjusting theirs resources usage upon the notifications.
+.SS Attributes
+Attributes are the basic concept, they are described by the following
+structure:
+
+.nf
+struct vmevent_attr {
+ __u64 value;
+ __u32 type;
+ __u32 state;
+};
+.fi
+
+.I type
+may correspond to these values:
+.TP
+.B VMEVENT_ATTR_NR_AVAIL_PAGES
+The attribute reports total number of available pages in the system, not
+including swap space (i.e. just total RAM).
+.I value
+is used to setup a threshold (in number or pages) upon which the event
+will be delivered by the kernel.
+
+Upon notifications kernel updates all configured attributes, so the
+attribute is mostly used without any thresholds, just for getting the
+value together with other attributes and avoid reading and parsing
+.IR /proc/vmstat .
+.TP
+.B VMEVENT_ATTR_NR_FREE_PAGES
+The attribute reports total number of unused (idle) RAM in the system.
+
+.I value
+is used to setup a threshold (in number or pages) upon which the event
+will be delivered by the kernel.
+.TP
+.B VMEVENT_ATTR_NR_SWAP_PAGES
+The attribute reports total number of swapped pages.
+
+.I value
+is used to setup a threshold (in number or pages) upon which the event
+will be delivered by the kernel.
+.TP
+.B VMEVENT_ATTR_PRESSURE
+The attribute reports Linux virtual memory management pressure. There are
+three discrete levels:
+
+.BR VMEVENT_PRESSURE_LOW :
+By setting the threshold to this value it's possible to watch whether
+system is reclaiming memory for new allocations. Monitoring reclaiming
+activity might be useful for maintaining overall system's cache level.
+
+.BR VMEVENT_PRESSURE_MED :
+The system is experiencing medium memory pressure, there is some mild
+swapping activity. Upon this event applications may decide to free any
+resources that can be easily reconstructed or re-read from a disk.
+
+.BR VMEVENT_PRESSURE_OOM :
+The system is actively thrashing, it is about to out of memory (OOM) or
+even the in-kernel OOM killer is on its way to trigger. Applications
+should do whatever they can to help the system. See
+.BR proc (5)
+for more information about OOM killer and its configuration options.
+
+.I value
+is used to setup a threshold upon which the event will be delivered by
+the kernel (for algebraic comparisons, it is defined that
+.BR VMEVENT_PRESSURE_LOW " <"
+.BR VMEVENT_PRESSURE_MED " <"
+.BR VMEVENT_PRESSURE_OOM ,
+but applications should not put any meaning into the absolute values.)
+
+.TP
+.I state
+is used to setup thresholds' behaviour, the following flags can be bitwise
+OR'ed:
+....
+.TP
+.B VMEVENT_ATTR_STATE_VALUE_LT
+Notification will be delivered when an attribute is less than a
+user-specified
+.IR "value" .
+.TP
+.B VMEVENT_ATTR_STATE_VALUE_GT
+Notifications will be delivered when an attribute is greater than a
+user-specified
+.IR "value" .
+.TP
+.B VMEVENT_ATTR_STATE_VALUE_EQ
+Notifications will be delivered when an attribute is equal to a
+user-specified
+.IR "value" .
+.TP
+.B VMEVENT_ATTR_STATE_EDGE_TRIGGER
+Events will be only delivered when an attribute crosses
+.I value
+threshold.
+.SS Events
+Upon a notification, application must read out events using
+.BR read (2)
+system call.
+The events are delivered using the following structure:
+
+.nf
+struct vmevent_event {
+ __u32 counter;
+ __u32 padding;
+ struct vmevent_attr attrs[];
+};
+.fi
+
+The
+.I counter
+specifies a number of reported attributes, and the
+.I attrs
+array contains a copy of configured attributes, with
+.IR "vmevent_attr" 's
+.I value
+overwritten to attribute's value.
+.SS Config
+.BR vmevent_fd (2)
+accepts
+.I vmevent_config
+structure to configure the notifications:
+
+.nf
+struct vmevent_config {
+ __u32 size;
+ __u32 counter;
+ __u64 sample_period_ns;
+ struct vmevent_attr attrs[VMEVENT_CONFIG_MAX_ATTRS];
+};
+.fi
+
+.I size
+must be initialized to
+.IR "sizeof(struct vmevent_config)" .
+
+.I counter
+specifies a number of initialized
+.I attrs
+elements.
+
+.I sample_period_ns
+specifies sampling period in nanoseconds. For applications it is
+recommended to set this value to a highest suitable period. (Note that for
+some attributes the delivery timing is not based on the sampling period,
+e.g.
+.IR VMEVENT_ATTR_PRESSURE .)
+.SH "RETURN VALUE"
+On success,
+.BR vmevent_fd ()
+returns a new file descriptor. On error, a negative value is returned and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.BR vmevent_fd ()
+can fail with errors similar to
+.BR open (2).
+
+In addition, the following errors are possible:
+.TP
+.B EINVAL
+The failure means that an improperly initalized
+.I config
+structure has been passed to the call (this also includes improperly
+initialized
+.I attrs
+arrays).
+.TP
+.B EFAULT
+The failure means that the kernel was unable to read the configuration
+structure, that is,
+.I config
+parameter points to an inaccessible memory.
+.SH VERSIONS
+The system call is available on Linux since kernel 3.8. Library support is
+yet not provided by any glibc version.
+.SH CONFORMING TO
+The system call is Linux-specific.
+.SH EXAMPLE
+Examples can be found in
+.I /usr/src/linux/tools/testing/vmevent/
+directory.
+.SH "SEE ALSO"
+.BR poll (2),
+.BR read (2),
+.BR proc (5),
+.BR vmstat (8)
--
1.7.12.3
On Mon, 22 Oct 2012, Anton Vorontsov wrote:
> This patch introduces VMEVENT_ATTR_PRESSURE, the attribute reports Linux
> virtual memory management pressure. There are three discrete levels:
>
> VMEVENT_PRESSURE_LOW: Notifies that the system is reclaiming memory for
> new allocations. Monitoring reclaiming activity might be useful for
> maintaining overall system's cache level.
>
> VMEVENT_PRESSURE_MED: The system is experiencing medium memory pressure,
> there is some mild swapping activity. Upon this event applications may
> decide to free any resources that can be easily reconstructed or re-read
> from a disk.
Nit:
s/VMEVENT_PRESSURE_MED/VMEVENT_PRESSUDE_MEDIUM/
Other than that, I'm OK with this. Mel and others, what are your thoughts
on this?
Anton, have you tested this with real world scenarios? How does it stack
up against Android's low memory killer, for example?
Pekka
Hello Pekka,
Thanks for taking a look into this!
On Wed, Oct 24, 2012 at 12:03:10PM +0300, Pekka Enberg wrote:
> On Mon, 22 Oct 2012, Anton Vorontsov wrote:
> > This patch introduces VMEVENT_ATTR_PRESSURE, the attribute reports Linux
> > virtual memory management pressure. There are three discrete levels:
> >
> > VMEVENT_PRESSURE_LOW: Notifies that the system is reclaiming memory for
> > new allocations. Monitoring reclaiming activity might be useful for
> > maintaining overall system's cache level.
> >
> > VMEVENT_PRESSURE_MED: The system is experiencing medium memory pressure,
> > there is some mild swapping activity. Upon this event applications may
> > decide to free any resources that can be easily reconstructed or re-read
> > from a disk.
>
> Nit:
>
> s/VMEVENT_PRESSURE_MED/VMEVENT_PRESSUDE_MEDIUM/
Sure thing, will change.
> Other than that, I'm OK with this. Mel and others, what are your thoughts
> on this?
>
> Anton, have you tested this with real world scenarios?
Yup, I was mostly testing it on a desktop. I.e. in a KVM instance I was
running a full fedora17 desktop w/ a lot of apps opened. The pressure
index was pretty good in the sense that it was indeed reflecting the
sluggishness in the system during swap activity. It's not ideal, i.e. the
index might drop slightly for some time, but we usually interested in
"above some value" threshold, so it should be fine.
The _LOW level is defined very strictly, and cannot be tuned anyhow. So
it's very solid, and that's what we mostly use for Android.
The _OOM level is also defined quite strict, so from the API point of
view, it's also solid, and should not be a problem.
Although the problem with _OOM is delivering the event in time (i.e. we
must be quick in predicting it, before OOMK triggers). Today the patch has
a shortcut for _OOM level: we send _OOM notification when reclaimer's
priority is below empirically found value '3' (we might make it tunable
via sysctl too, but that would expose another mm detail -- although sysctl
sounds not that bad as exposing something in the C API; we have plenty of
mm knobs in /proc/sys/vm/ already).
The real tunable is _MED level, and this should be tuned based on the
desired system's behaviour that I described in more detail in this long
post: http://lkml.org/lkml/2012/10/7/29.
Based on my observations, I wouldn't say that we have plenty of room to
tune the value, though. Usual swapping activity causes index to rise to
say to 30%, and when the system can't keep up, it raises to 50..90 (but we
still have plenty of swap space, so the system is far away from OOM,
although it is thrashing. Ideally I'd prefer to not have any sysctl, but I
believe _MED level is really based on user's definition of "medium".
> How does it stack up against Android's low memory killer, for example?
The LMK driver is effectively using what we call _LOW pressure
notifications here, so by definition it is enough to build a full
replacement for the in-kernel LMK using just the _LOW level. But in the
future, we might want to use _MED as well, e.g. kill unneeded services
based not on the cache level, but based on the pressure.
Thanks,
Anton.
Hi Anton,
On Mon, Oct 22, 2012 at 04:19:28AM -0700, Anton Vorontsov wrote:
> Hi all,
>
> So this is the second RFC. The main change is that I decided to go with
> discrete levels of the pressure.
I am very happy with that because I already have yelled it several time.
>
> When I started writing the man page, I had to describe the 'reclaimer
> inefficiency index', and while doing this I realized that I'm describing
> how the kernel is doing the memory management, which we try to avoid in
> the vmevent. And applications don't really care about these details:
> reclaimers, its inefficiency indexes, scanning window sizes, priority
> levels, etc. -- it's all "not interesting", and purely kernel's stuff. So
> I guess Mel Gorman was right, we need some sort of levels.
>
> What applications (well, activity managers) are really interested in is
> this:
>
> 1. Do we we sacrifice resources for new memory allocations (e.g. files
> cache)?
> 2. Does the new memory allocations' cost becomes too high, and the system
> hurts because of this?
> 3. Are we about to OOM soon?
Good but I think 3 is never easy.
But early notification would be better than late notification which can kill
someone.
>
> And here are the answers:
>
> 1. VMEVENT_PRESSURE_LOW
> 2. VMEVENT_PRESSURE_MED
> 3. VMEVENT_PRESSURE_OOM
>
> There is no "high" pressure, since I really don't see any definition of
> it, but it's possible to introduce new levels without breaking ABI. The
> levels described in more details in the patches, and the stuff is still
> tunable, but now via sysctls, not the vmevent_fd() call itself (i.e. we
> don't need to rebuild applications to adjust window size or other mm
> "details").
>
> What I couldn't fix in this RFC is making vmevent_{scanned,reclaimed}
> stuff per-CPU (there's a comment describing the problem with this). But I
> made it lockless and tried to make it very lightweight (plus I moved the
> vmevent_pressure() call to a more "cold" path).
Your description doesn't include why we need new vmevent_fd(2).
Of course, it's very flexible and potential to add new VM knob easily but
the thing we is about to use now is only VMEVENT_ATTR_PRESSURE.
Is there any other use cases for swap or free? or potential user?
Adding vmevent_fd without them is rather overkill.
And I want to avoid timer-base polling of vmevent if possbile.
mem_notify of KOSAKI doesn't use such timer.
I don't object but we need rationale for adding new system call which should
be maintained forever once we add it.
>
> Thanks,
> Anton.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Kind regards,
Minchan Kim
On Thu, Oct 25, 2012 at 9:40 AM, Minchan Kim <[email protected]> wrote:
> Your description doesn't include why we need new vmevent_fd(2).
> Of course, it's very flexible and potential to add new VM knob easily but
> the thing we is about to use now is only VMEVENT_ATTR_PRESSURE.
> Is there any other use cases for swap or free? or potential user?
> Adding vmevent_fd without them is rather overkill.
What ABI would you use instead?
On Thu, Oct 25, 2012 at 9:40 AM, Minchan Kim <[email protected]> wrote:
> I don't object but we need rationale for adding new system call which should
> be maintained forever once we add it.
Agreed.
On Wed, Oct 24, 2012 at 07:23:21PM -0700, Anton Vorontsov wrote:
> Hello Pekka,
>
> Thanks for taking a look into this!
>
> On Wed, Oct 24, 2012 at 12:03:10PM +0300, Pekka Enberg wrote:
> > On Mon, 22 Oct 2012, Anton Vorontsov wrote:
> > > This patch introduces VMEVENT_ATTR_PRESSURE, the attribute reports Linux
> > > virtual memory management pressure. There are three discrete levels:
> > >
> > > VMEVENT_PRESSURE_LOW: Notifies that the system is reclaiming memory for
> > > new allocations. Monitoring reclaiming activity might be useful for
> > > maintaining overall system's cache level.
> > >
> > > VMEVENT_PRESSURE_MED: The system is experiencing medium memory pressure,
> > > there is some mild swapping activity. Upon this event applications may
> > > decide to free any resources that can be easily reconstructed or re-read
> > > from a disk.
> >
> > Nit:
> >
> > s/VMEVENT_PRESSURE_MED/VMEVENT_PRESSUDE_MEDIUM/
>
> Sure thing, will change.
>
> > Other than that, I'm OK with this. Mel and others, what are your thoughts
> > on this?
> >
> > Anton, have you tested this with real world scenarios?
>
> Yup, I was mostly testing it on a desktop. I.e. in a KVM instance I was
> running a full fedora17 desktop w/ a lot of apps opened. The pressure
> index was pretty good in the sense that it was indeed reflecting the
> sluggishness in the system during swap activity. It's not ideal, i.e. the
> index might drop slightly for some time, but we usually interested in
> "above some value" threshold, so it should be fine.
>
> The _LOW level is defined very strictly, and cannot be tuned anyhow. So
> it's very solid, and that's what we mostly use for Android.
>
> The _OOM level is also defined quite strict, so from the API point of
> view, it's also solid, and should not be a problem.
The one of the concern when I see the code is that whether we should consider
high order page allocation. Now OOM killer doesn't kill anyone when VM
suffer from higher order allocation because it doesn't help getting physical
contiguos memory in normal case. Same rule could be applied.
>
> Although the problem with _OOM is delivering the event in time (i.e. we
> must be quick in predicting it, before OOMK triggers). Today the patch has
Absolutely. It was a biggest challenge.
> a shortcut for _OOM level: we send _OOM notification when reclaimer's
> priority is below empirically found value '3' (we might make it tunable
> via sysctl too, but that would expose another mm detail -- although sysctl
> sounds not that bad as exposing something in the C API; we have plenty of
> mm knobs in /proc/sys/vm/ already).
Hmm, I'm not sure depending on such magic value is good idea but I have no idea
so I will shut up :(
>
> The real tunable is _MED level, and this should be tuned based on the
> desired system's behaviour that I described in more detail in this long
> post: http://lkml.org/lkml/2012/10/7/29.
>
> Based on my observations, I wouldn't say that we have plenty of room to
> tune the value, though. Usual swapping activity causes index to rise to
> say to 30%, and when the system can't keep up, it raises to 50..90 (but we
> still have plenty of swap space, so the system is far away from OOM,
> although it is thrashing. Ideally I'd prefer to not have any sysctl, but I
> believe _MED level is really based on user's definition of "medium".
>
> > How does it stack up against Android's low memory killer, for example?
>
> The LMK driver is effectively using what we call _LOW pressure
> notifications here, so by definition it is enough to build a full
> replacement for the in-kernel LMK using just the _LOW level. But in the
> future, we might want to use _MED as well, e.g. kill unneeded services
> based not on the cache level, but based on the pressure.
Good idea.
Thanks for keeping trying this, Anton!
>
> Thanks,
> Anton.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Kind regards,
Minchan Kim
Hi Pekka,
On Thu, Oct 25, 2012 at 09:44:52AM +0300, Pekka Enberg wrote:
> On Thu, Oct 25, 2012 at 9:40 AM, Minchan Kim <[email protected]> wrote:
> > Your description doesn't include why we need new vmevent_fd(2).
> > Of course, it's very flexible and potential to add new VM knob easily but
> > the thing we is about to use now is only VMEVENT_ATTR_PRESSURE.
> > Is there any other use cases for swap or free? or potential user?
> > Adding vmevent_fd without them is rather overkill.
>
> What ABI would you use instead?
I thought /dev/some_knob like mem_notify and epoll is enough but please keep in mind
that I'm not against vmevent_fd strongly. My point is that description should include
explain about why other candidate is not good or why vmevent_fd is better.
(But at least, I don't like vmevent timer polling still and I hope we use it
as last resort once we can find another)
>
> On Thu, Oct 25, 2012 at 9:40 AM, Minchan Kim <[email protected]> wrote:
> > I don't object but we need rationale for adding new system call which should
> > be maintained forever once we add it.
>
> Agreed.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Kind regards,
Minchan Kim
Hello Minchan,
Thanks a lot for the email!
On Thu, Oct 25, 2012 at 03:40:09PM +0900, Minchan Kim wrote:
[...]
> > What applications (well, activity managers) are really interested in is
> > this:
> >
> > 1. Do we we sacrifice resources for new memory allocations (e.g. files
> > cache)?
> > 2. Does the new memory allocations' cost becomes too high, and the system
> > hurts because of this?
> > 3. Are we about to OOM soon?
>
> Good but I think 3 is never easy.
> But early notification would be better than late notification which can kill
> someone.
Well, basically these are two fixed (strictly defined) levels (low and
oom) + one flexible level (med), which meaning can be slightly tuned (but
we still have a meaningful definition for it).
So, I guess it's a good start. :)
> > And here are the answers:
> >
> > 1. VMEVENT_PRESSURE_LOW
> > 2. VMEVENT_PRESSURE_MED
> > 3. VMEVENT_PRESSURE_OOM
> >
> > There is no "high" pressure, since I really don't see any definition of
> > it, but it's possible to introduce new levels without breaking ABI. The
> > levels described in more details in the patches, and the stuff is still
> > tunable, but now via sysctls, not the vmevent_fd() call itself (i.e. we
> > don't need to rebuild applications to adjust window size or other mm
> > "details").
> >
> > What I couldn't fix in this RFC is making vmevent_{scanned,reclaimed}
> > stuff per-CPU (there's a comment describing the problem with this). But I
> > made it lockless and tried to make it very lightweight (plus I moved the
> > vmevent_pressure() call to a more "cold" path).
>
> Your description doesn't include why we need new vmevent_fd(2).
> Of course, it's very flexible and potential to add new VM knob easily but
> the thing we is about to use now is only VMEVENT_ATTR_PRESSURE.
> Is there any other use cases for swap or free? or potential user?
Number of idle pages by itself might be not that interesting, but
cache+idle level is quite interesting.
By definition, _MED happens when performance already degraded, slightly,
but still -- we can be swapping.
But _LOW notifications are coming when kernel is just reclaiming, so by
using _LOW notifications + watching for cache level we can very easily
predict the swapping activity long before we have even _MED pressure.
E.g. if idle+cache drops below amount of memory that userland can free,
we'd indeed like to start freeing stuff (this somewhat resembles current
logic that we have in the in-kernel LMK).
Sure, we can read and parse /proc/vmstat upon _LOW events (and that was my
backup plan), but reporting stuff together would make things much nicer.
Although, I somewhat doubt that it is OK to report raw numbers, so this
needs some thinking to develop more elegant solution.
Maybe it makes sense to implement something like PRESSURE_MILD with an
additional nr_pages threshold, which basically hits the kernel about how
many easily reclaimable pages userland has (that would be a part of our
definition for the mild pressure level). So, essentially it will be
if (pressure_index >= oom_level)
return PRESSURE_OOM;
else if (pressure_index >= med_level)
return PRESSURE_MEDIUM;
else if (userland_reclaimable_pages >= nr_reclaimable_pages)
return PRESSURE_MILD;
return PRESSURE_LOW;
I must admit I like the idea more than exposing NR_FREE and stuff, but the
scheme reminds me the blended attributes, which we abandoned. Although,
the definition sounds better now, and we seem to be doing it in the right
place.
And if we go this way, then sure, we won't need any other attributes, and
so we could make the API much simpler.
> Adding vmevent_fd without them is rather overkill.
>
> And I want to avoid timer-base polling of vmevent if possbile.
> mem_notify of KOSAKI doesn't use such timer.
For pressure notifications we don't use the timers. We also read the
vmstat counters together with the pressure, so "pressure + counters"
effectively turns it into non-timer based polling. :)
But yes, hopefully we can get rid of the raw counters and timers, I don't
them it too.
> I don't object but we need rationale for adding new system call which should
> be maintained forever once we add it.
We can do it via eventfd, or /dev/chardev (which has been discussed and
people didn't like it, IIRC), or signals (which also has been discussed
and there are problems with this approach as well).
I'm not sure why having a syscall is a big issue. If we're making eventfd
interface, then we'd need to maintain /sys/.../ ABI the same way as we
maintain the syscall. What's the difference? A dedicated syscall is just a
simpler interface, we don't need to mess with opening and passing things
through /sys/.../.
Personally I don't have any preference (except that I distaste chardev and
ioctls :), I just want to see pros and cons of all the solutions, and so
far the syscall seems like an easiest way? Anyway, I'm totally open to
changing it into whatever fits best.
Thanks,
Anton.
On Thu, Oct 25, 2012 at 02:08:14AM -0700, Anton Vorontsov wrote:
[...]
> Maybe it makes sense to implement something like PRESSURE_MILD with an
> additional nr_pages threshold, which basically hits the kernel about how
> many easily reclaimable pages userland has (that would be a part of our
> definition for the mild pressure level). So, essentially it will be
>
> if (pressure_index >= oom_level)
> return PRESSURE_OOM;
> else if (pressure_index >= med_level)
> return PRESSURE_MEDIUM;
> else if (userland_reclaimable_pages >= nr_reclaimable_pages)
> return PRESSURE_MILD;
...or we can call it PRESSURE_BALANCE, just to be precise and clear.
On Thu, Oct 25, 2012 at 02:08:14AM -0700, Anton Vorontsov wrote:
> Hello Minchan,
>
> Thanks a lot for the email!
>
> On Thu, Oct 25, 2012 at 03:40:09PM +0900, Minchan Kim wrote:
> [...]
> > > What applications (well, activity managers) are really interested in is
> > > this:
> > >
> > > 1. Do we we sacrifice resources for new memory allocations (e.g. files
> > > cache)?
> > > 2. Does the new memory allocations' cost becomes too high, and the system
> > > hurts because of this?
> > > 3. Are we about to OOM soon?
> >
> > Good but I think 3 is never easy.
> > But early notification would be better than late notification which can kill
> > someone.
>
> Well, basically these are two fixed (strictly defined) levels (low and
> oom) + one flexible level (med), which meaning can be slightly tuned (but
> we still have a meaningful definition for it).
>
I mean detection of "3) Are we about to OOM soon" isn't easy.
> So, I guess it's a good start. :)
Absolutely!
>
> > > And here are the answers:
> > >
> > > 1. VMEVENT_PRESSURE_LOW
> > > 2. VMEVENT_PRESSURE_MED
> > > 3. VMEVENT_PRESSURE_OOM
> > >
> > > There is no "high" pressure, since I really don't see any definition of
> > > it, but it's possible to introduce new levels without breaking ABI. The
> > > levels described in more details in the patches, and the stuff is still
> > > tunable, but now via sysctls, not the vmevent_fd() call itself (i.e. we
> > > don't need to rebuild applications to adjust window size or other mm
> > > "details").
> > >
> > > What I couldn't fix in this RFC is making vmevent_{scanned,reclaimed}
> > > stuff per-CPU (there's a comment describing the problem with this). But I
> > > made it lockless and tried to make it very lightweight (plus I moved the
> > > vmevent_pressure() call to a more "cold" path).
> >
> > Your description doesn't include why we need new vmevent_fd(2).
> > Of course, it's very flexible and potential to add new VM knob easily but
> > the thing we is about to use now is only VMEVENT_ATTR_PRESSURE.
> > Is there any other use cases for swap or free? or potential user?
>
> Number of idle pages by itself might be not that interesting, but
> cache+idle level is quite interesting.
>
> By definition, _MED happens when performance already degraded, slightly,
> but still -- we can be swapping.
>
> But _LOW notifications are coming when kernel is just reclaiming, so by
> using _LOW notifications + watching for cache level we can very easily
> predict the swapping activity long before we have even _MED pressure.
So, for seeing cache level, we need new vmevent_attr?
>
> E.g. if idle+cache drops below amount of memory that userland can free,
> we'd indeed like to start freeing stuff (this somewhat resembles current
> logic that we have in the in-kernel LMK).
>
> Sure, we can read and parse /proc/vmstat upon _LOW events (and that was my
> backup plan), but reporting stuff together would make things much nicer.
My concern is that user can imagine various scenario with vmstat and they might start to
require new vmevent_attr in future and vmevent_fd will be bloated and mm guys should
care of vmevent_vd whenever they add new vmstat. I don't like it. User can do it by
just reading /proc/vmstat. So I support your backup plan.
>
> Although, I somewhat doubt that it is OK to report raw numbers, so this
> needs some thinking to develop more elegant solution.
Indeed.
>
> Maybe it makes sense to implement something like PRESSURE_MILD with an
> additional nr_pages threshold, which basically hits the kernel about how
> many easily reclaimable pages userland has (that would be a part of our
> definition for the mild pressure level). So, essentially it will be
>
> if (pressure_index >= oom_level)
> return PRESSURE_OOM;
> else if (pressure_index >= med_level)
> return PRESSURE_MEDIUM;
> else if (userland_reclaimable_pages >= nr_reclaimable_pages)
> return PRESSURE_MILD;
> return PRESSURE_LOW;
>
> I must admit I like the idea more than exposing NR_FREE and stuff, but the
> scheme reminds me the blended attributes, which we abandoned. Although,
> the definition sounds better now, and we seem to be doing it in the right
> place.
>
> And if we go this way, then sure, we won't need any other attributes, and
> so we could make the API much simpler.
That's what I want! If there isn't any user who really are willing to use it,
let's drop it. Do not persuade with imaginary scenario because we should be
careful to introduce new ABI.
>
> > Adding vmevent_fd without them is rather overkill.
> >
> > And I want to avoid timer-base polling of vmevent if possbile.
> > mem_notify of KOSAKI doesn't use such timer.
>
> For pressure notifications we don't use the timers. We also read the
Hmm, when I see the code, timer still works and can notify to user. No?
> vmstat counters together with the pressure, so "pressure + counters"
> effectively turns it into non-timer based polling. :)
>
> But yes, hopefully we can get rid of the raw counters and timers, I don't
> them it too.
You and i are reaching on a conclusion, at least.
>
> > I don't object but we need rationale for adding new system call which should
> > be maintained forever once we add it.
>
> We can do it via eventfd, or /dev/chardev (which has been discussed and
> people didn't like it, IIRC), or signals (which also has been discussed
> and there are problems with this approach as well).
>
> I'm not sure why having a syscall is a big issue. If we're making eventfd
> interface, then we'd need to maintain /sys/.../ ABI the same way as we
> maintain the syscall. What's the difference? A dedicated syscall is just a
No difference. What I want is just to remove unnecessary stuff in vmevent_fd
and keep it as simple. If we do via /dev/chardev, I expect we can do necessary
things for VM pressure. But if we can diet with vmevent_fd, It would be better.
If so, maybe we have to change vmevent_fd to lowmem_fd or vmpressure_fd.
> simpler interface, we don't need to mess with opening and passing things
> through /sys/.../.
>
> Personally I don't have any preference (except that I distaste chardev and
> ioctls :), I just want to see pros and cons of all the solutions, and so
> far the syscall seems like an easiest way? Anyway, I'm totally open to
> changing it into whatever fits best.
Yeb. Interface stuff isn't a big concern for low memory notification so I'm not
against it stronlgy, too.
Thanks, Anton.
--
Kind regards,
Minchan Kim
On Fri, Oct 26, 2012 at 11:37:20AM +0900, Minchan Kim wrote:
[...]
> > > Of course, it's very flexible and potential to add new VM knob easily but
> > > the thing we is about to use now is only VMEVENT_ATTR_PRESSURE.
> > > Is there any other use cases for swap or free? or potential user?
> >
> > Number of idle pages by itself might be not that interesting, but
> > cache+idle level is quite interesting.
> >
> > By definition, _MED happens when performance already degraded, slightly,
> > but still -- we can be swapping.
> >
> > But _LOW notifications are coming when kernel is just reclaiming, so by
> > using _LOW notifications + watching for cache level we can very easily
> > predict the swapping activity long before we have even _MED pressure.
>
> So, for seeing cache level, we need new vmevent_attr?
Hopefully, not. We're not interested in the raw values of the cache level,
but what we want is to to tell the kernel how much "easily reclaimable
pages" userland has, and get notified when kernel believes that it's good
time for the userland is to help. I.e. this new _MILD level:
> > Maybe it makes sense to implement something like PRESSURE_MILD with an
> > additional nr_pages threshold, which basically hits the kernel about how
> > many easily reclaimable pages userland has (that would be a part of our
> > definition for the mild pressure level). So, essentially it will be
> >
> > if (pressure_index >= oom_level)
> > return PRESSURE_OOM;
> > else if (pressure_index >= med_level)
> > return PRESSURE_MEDIUM;
> > else if (userland_reclaimable_pages >= nr_reclaimable_pages)
> > return PRESSURE_MILD;
> > return PRESSURE_LOW;
> >
> > I must admit I like the idea more than exposing NR_FREE and stuff, but the
> > scheme reminds me the blended attributes, which we abandoned. Although,
> > the definition sounds better now, and we seem to be doing it in the right
> > place.
> >
> > And if we go this way, then sure, we won't need any other attributes, and
> > so we could make the API much simpler.
>
> That's what I want! If there isn't any user who really are willing to use it,
> let's drop it. Do not persuade with imaginary scenario because we should be
> careful to introduce new ABI.
Yeah, I think you're right. Let's make the vmevent_fd slim first. I won't
even focus on the _MILD/_BALANCE level for now, we can do it later, and we
always have the /proc/vmstat even if the _MILD turns out to be a bad idea.
Reading /proc/vmstat is a bit more overhead, but it's not that much at all
(especially when we don't have to timer-poll the vmstat).
> > > Adding vmevent_fd without them is rather overkill.
> > >
> > > And I want to avoid timer-base polling of vmevent if possbile.
> > > mem_notify of KOSAKI doesn't use such timer.
> >
> > For pressure notifications we don't use the timers. We also read the
>
> Hmm, when I see the code, timer still works and can notify to user. No?
Yes, I was mostly saying that it is technically not required anymore, but
you're right, the code still fires the timer (it just runs needlessly for
the pressure attr).
Bad wording on my side.
[..]
> > We can do it via eventfd, or /dev/chardev (which has been discussed and
> > people didn't like it, IIRC), or signals (which also has been discussed
> > and there are problems with this approach as well).
> >
> > I'm not sure why having a syscall is a big issue. If we're making eventfd
> > interface, then we'd need to maintain /sys/.../ ABI the same way as we
> > maintain the syscall. What's the difference? A dedicated syscall is just a
>
> No difference. What I want is just to remove unnecessary stuff in vmevent_fd
> and keep it as simple. If we do via /dev/chardev, I expect we can do necessary
> things for VM pressure. But if we can diet with vmevent_fd, It would be better.
> If so, maybe we have to change vmevent_fd to lowmem_fd or
> vmpressure_fd.
Sure, then I'm starting the work to slim the API down, and we'll see how
things are going to look after that.
Thanks a lot!
Anton.