2012-05-01 13:25:33

by Anton Vorontsov

[permalink] [raw]
Subject: [PATCH 0/3] vmevent: Implement 'low memory' attribute

Hi all,

Accounting only free pages is very inaccurate for low memory handling,
so we have to be smarter here.

The patch set implements a new attribute, which is blended from various
memory statistics. Vmevent can't expose all the kernel internals to the
userland, as it would make internal Linux MM representation tied to the
ABI. So the ABI itself was made very simple: just number of pages before
we consider that we're low on memory, and the kernel takes care of the
rest.

Thanks,

--
Anton Vorontsov
Email: [email protected]


2012-05-01 13:26:56

by Anton Vorontsov

[permalink] [raw]
Subject: [PATCH 1/3] vmevent: Implement equal-to attribute state

This complements GT and LT, making it possible to combine GE and LE
operators. We'll use it for blended attributes: the special attributes
will return either 0 or <threshold>, so to make two-way notifications
we will pass LT | EQ bits.

Signed-off-by: Anton Vorontsov <[email protected]>
---
include/linux/vmevent.h | 6 +++++-
mm/vmevent.c | 22 +++++++++++++++-------
2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h
index ca97cf0..aae0d24 100644
--- a/include/linux/vmevent.h
+++ b/include/linux/vmevent.h
@@ -27,9 +27,13 @@ enum {
*/
VMEVENT_ATTR_STATE_VALUE_GT = (1UL << 1),
/*
+ * Sample value is equal to user-specified value
+ */
+ VMEVENT_ATTR_STATE_VALUE_EQ = (1UL << 2),
+ /*
* One-shot mode.
*/
- VMEVENT_ATTR_STATE_ONE_SHOT = (1UL << 2),
+ VMEVENT_ATTR_STATE_ONE_SHOT = (1UL << 3),

/* Saved state, used internally by the kernel for one-shot mode. */
__VMEVENT_ATTR_STATE_VALUE_WAS_LT = (1UL << 30),
diff --git a/mm/vmevent.c b/mm/vmevent.c
index 47ed448..9f1520b 100644
--- a/mm/vmevent.c
+++ b/mm/vmevent.c
@@ -87,28 +87,39 @@ static bool vmevent_match(struct vmevent_watch *watch)
u32 state = attr->state;
bool attr_lt = state & VMEVENT_ATTR_STATE_VALUE_LT;
bool attr_gt = state & VMEVENT_ATTR_STATE_VALUE_GT;
+ bool attr_eq = state & VMEVENT_ATTR_STATE_VALUE_EQ;

if (!state)
continue;

- if (attr_lt || attr_gt) {
+ if (attr_lt || attr_gt || attr_eq) {
bool one_shot = state & VMEVENT_ATTR_STATE_ONE_SHOT;
u32 was_lt_mask = __VMEVENT_ATTR_STATE_VALUE_WAS_LT;
u32 was_gt_mask = __VMEVENT_ATTR_STATE_VALUE_WAS_GT;
u64 value = vmevent_sample_attr(watch, attr);
bool lt = value < attr->value;
bool gt = value > attr->value;
+ bool eq = value == attr->value;
bool was_lt = state & was_lt_mask;
bool was_gt = state & was_gt_mask;
+ bool was_eq = was_lt && was_gt;
bool ret = false;

- if (((attr_lt && lt) || (attr_gt && gt)) && !one_shot)
+ if (((attr_lt && lt) || (attr_gt && gt) ||
+ (attr_eq && eq)) && !one_shot)
return true;

- if (attr_lt && lt && was_lt) {
+ if (attr_eq && eq && was_eq) {
return false;
- } else if (attr_gt && gt && was_gt) {
+ } else if (attr_lt && lt && was_lt && !was_eq) {
return false;
+ } else if (attr_gt && gt && was_gt && !was_eq) {
+ return false;
+ } else if (eq) {
+ state |= was_lt_mask;
+ state |= was_gt_mask;
+ if (attr_eq)
+ ret = true;
} else if (lt) {
state |= was_lt_mask;
state &= ~was_gt_mask;
@@ -119,9 +130,6 @@ static bool vmevent_match(struct vmevent_watch *watch)
state &= ~was_lt_mask;
if (attr_gt)
ret = true;
- } else {
- state &= ~was_lt_mask;
- state &= ~was_gt_mask;
}

attr->state = state;
--
1.7.9.2

2012-05-01 13:27:20

by Anton Vorontsov

[permalink] [raw]
Subject: [PATCH 2/3] vmevent: Pass attr argument to sampling functions

We'll need the argument for blended attributes, the attributes return
either 0 or the threshold value (which is in attr).

Signed-off-by: Anton Vorontsov <[email protected]>
---
mm/vmevent.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/mm/vmevent.c b/mm/vmevent.c
index 9f1520b..b312236 100644
--- a/mm/vmevent.c
+++ b/mm/vmevent.c
@@ -39,9 +39,11 @@ struct vmevent_watch {
wait_queue_head_t waitq;
};

-typedef u64 (*vmevent_attr_sample_fn)(struct vmevent_watch *watch);
+typedef u64 (*vmevent_attr_sample_fn)(struct vmevent_watch *watch,
+ struct vmevent_attr *attr);

-static u64 vmevent_attr_swap_pages(struct vmevent_watch *watch)
+static u64 vmevent_attr_swap_pages(struct vmevent_watch *watch,
+ struct vmevent_attr *attr)
{
#ifdef CONFIG_SWAP
struct sysinfo si;
@@ -54,12 +56,14 @@ static u64 vmevent_attr_swap_pages(struct vmevent_watch *watch)
#endif
}

-static u64 vmevent_attr_free_pages(struct vmevent_watch *watch)
+static u64 vmevent_attr_free_pages(struct vmevent_watch *watch,
+ struct vmevent_attr *attr)
{
return global_page_state(NR_FREE_PAGES);
}

-static u64 vmevent_attr_avail_pages(struct vmevent_watch *watch)
+static u64 vmevent_attr_avail_pages(struct vmevent_watch *watch,
+ struct vmevent_attr *attr)
{
return totalram_pages;
}
@@ -74,7 +78,7 @@ static u64 vmevent_sample_attr(struct vmevent_watch *watch, struct vmevent_attr
{
vmevent_attr_sample_fn fn = attr_samplers[attr->type];

- return fn(watch);
+ return fn(watch, attr);
}

static bool vmevent_match(struct vmevent_watch *watch)
--
1.7.9.2

2012-05-01 13:27:43

by Anton Vorontsov

[permalink] [raw]
Subject: [PATCH 3/3] vmevent: Implement special low-memory attribute

This is specially "blended" attribute, the event triggers when kernel
decides that we're close to the low memory threshold. Userspace should
not expect very precise meaning of low memory situation, mostly, it's
just a guess on the kernel's side.

Well, this is the same as userland should not know or care how exactly
kernel manages the memory, or assume that memory management behaviour
is a part of the "ABI". So, all the 'low memory' is just guessing, but
we're trying to do our best. It might be that we will end up with two
or three variations of 'low memory' thresholds, and all of them would
be useful for different use cases.

For this implementation, we assume that there's a low memory situation
for the N pages threshold when we have neither N pages of completely
free pages, nor we have N reclaimable pages in the cache. This
effectively means, that if userland expects to allocate N pages, it
would consume all the free pages, and any further allocations (above
N) would start draining caches.

In the worst case, prior to hitting the threshold, we might have only
N pages in cache, and nearly no memory as free pages.

The same 'low memory' meaning is used in the current Android Low
Memory Killer driver.

Signed-off-by: Anton Vorontsov <[email protected]>
---
include/linux/vmevent.h | 7 ++++++
mm/vmevent.c | 40 ++++++++++++++++++++++++++++++++++
tools/testing/vmevent/vmevent-test.c | 12 +++++++++-
3 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h
index aae0d24..9bfa244 100644
--- a/include/linux/vmevent.h
+++ b/include/linux/vmevent.h
@@ -10,6 +10,13 @@ enum {
VMEVENT_ATTR_NR_AVAIL_PAGES = 1UL,
VMEVENT_ATTR_NR_FREE_PAGES = 2UL,
VMEVENT_ATTR_NR_SWAP_PAGES = 3UL,
+ /*
+ * This is specially blended attribute, the event triggers
+ * when kernel decides that we're close to the low memory threshold.
+ * Don't expect very precise meaning of low memory situation, mostly,
+ * it's just a guess on the kernel's side.
+ */
+ VMEVENT_ATTR_LOWMEM_PAGES = 4UL,

VMEVENT_ATTR_MAX /* non-ABI */
};
diff --git a/mm/vmevent.c b/mm/vmevent.c
index b312236..d278a25 100644
--- a/mm/vmevent.c
+++ b/mm/vmevent.c
@@ -68,10 +68,50 @@ static u64 vmevent_attr_avail_pages(struct vmevent_watch *watch,
return totalram_pages;
}

+/*
+ * Here's some implementation details for the "low memory" meaning.
+ *
+ * (The explanation is not in the header file as userland should not
+ * know these details, nor it should assume that the meaning will
+ * always be the same. As well as it should not know how exactly kernel
+ * manages the memory, or assume that memory management behaviour is a
+ * part of the "ABI". So, all the 'low memory' is just guessing, but
+ * we're trying to do our best.)
+ *
+ * For this implementation, we assume that there's a low memory situation
+ * for the N pages threshold when we have neither N pages of completely
+ * free pages, nor we have N reclaimable pages in the cache. This
+ * effectively means, that if userland expects to allocate N pages, it
+ * would consume all the free pages, and any further allocations (above
+ * N) would start draining caches.
+ *
+ * In the worst case, prior hitting the threshold, we might have only
+ * N pages in cache, and nearly no memory as free pages.
+ */
+static u64 vmevent_attr_lowmem_pages(struct vmevent_watch *watch,
+ struct vmevent_attr *attr)
+{
+ int free = global_page_state(NR_FREE_PAGES);
+ int file = global_page_state(NR_FILE_PAGES) -
+ global_page_state(NR_SHMEM); /* TODO: account locked pages */
+ int val = attr->value;
+
+ /*
+ * For convenience we return 0 or attr value (instead of 0/1), it
+ * makes it easier for vmevent_match() to cope with blended
+ * attributes, plus userland might use the value to find out which
+ * threshold triggered.
+ */
+ if (free < val && file < val)
+ return val;
+ return 0;
+}
+
static vmevent_attr_sample_fn attr_samplers[] = {
[VMEVENT_ATTR_NR_AVAIL_PAGES] = vmevent_attr_avail_pages,
[VMEVENT_ATTR_NR_FREE_PAGES] = vmevent_attr_free_pages,
[VMEVENT_ATTR_NR_SWAP_PAGES] = vmevent_attr_swap_pages,
+ [VMEVENT_ATTR_LOWMEM_PAGES] = vmevent_attr_lowmem_pages,
};

static u64 vmevent_sample_attr(struct vmevent_watch *watch, struct vmevent_attr *attr)
diff --git a/tools/testing/vmevent/vmevent-test.c b/tools/testing/vmevent/vmevent-test.c
index fd9a174..c61aed7 100644
--- a/tools/testing/vmevent/vmevent-test.c
+++ b/tools/testing/vmevent/vmevent-test.c
@@ -33,7 +33,7 @@ int main(int argc, char *argv[])

config = (struct vmevent_config) {
.sample_period_ns = 1000000000L,
- .counter = 6,
+ .counter = 7,
.attrs = {
{
.type = VMEVENT_ATTR_NR_FREE_PAGES,
@@ -59,6 +59,13 @@ int main(int argc, char *argv[])
.type = VMEVENT_ATTR_NR_SWAP_PAGES,
},
{
+ .type = VMEVENT_ATTR_LOWMEM_PAGES,
+ .state = VMEVENT_ATTR_STATE_VALUE_LT |
+ VMEVENT_ATTR_STATE_VALUE_EQ |
+ VMEVENT_ATTR_STATE_ONE_SHOT,
+ .value = phys_pages / 2,
+ },
+ {
.type = 0xffff, /* invalid */
},
},
@@ -108,6 +115,9 @@ int main(int argc, char *argv[])
case VMEVENT_ATTR_NR_SWAP_PAGES:
printf(" VMEVENT_ATTR_NR_SWAP_PAGES: %Lu\n", attr->value);
break;
+ case VMEVENT_ATTR_LOWMEM_PAGES:
+ printf(" VMEVENT_ATTR_LOWMEM_PAGES: %Lu\n", attr->value);
+ break;
default:
printf(" Unknown attribute: %Lu\n", attr->value);
}
--
1.7.9.2

2012-05-03 08:10:16

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 0/3] vmevent: Implement 'low memory' attribute

On Tue, May 1, 2012 at 4:24 PM, Anton Vorontsov
<[email protected]> wrote:
> Accounting only free pages is very inaccurate for low memory handling,
> so we have to be smarter here.

Can you elaborate on what kind of problems there are with tracking free pages?

2012-05-03 09:46:04

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 0/3] vmevent: Implement 'low memory' attribute

On Thu, May 03, 2012 at 11:10:12AM +0300, Pekka Enberg wrote:
> On Tue, May 1, 2012 at 4:24 PM, Anton Vorontsov
> <[email protected]> wrote:
> > Accounting only free pages is very inaccurate for low memory handling,
> > so we have to be smarter here.
>
> Can you elaborate on what kind of problems there are with tracking free pages?

Well, there's no problem with tracking itself, the word 'inaccurate'
was probably misleading. Tracking just free pages is inaccurate for
our "low memory" notification needs, but NR_FREE_PAGES tracking
itself is fine.

The thing is that NR_FREE_PAGES accounts only completely unused
(wasted) pages. Most of the time we have very low NR_FREE_PAGES,
and lots of page cache and block buffers (i.e. NR_FILE_PAGES).

The file pages are easily reclaimable (except shmem/tmpfs and
locked pages), so file pages may be considered as "somewhat
free" pages.

The cache might contain very stale data (or not), so we have to
maneuver between the two strategies: sacrifice caches, or start
freeing memory (which prevents caches draining).

The strategy is described in the third patch in the series.
It might be not ideal, but the logic itself is not part of
the ABI (this is very similar "not ABI" rules as we have for
OOM scoring logic), and is subject for changes.

Thanks,

--
Anton Vorontsov
Email: [email protected]

2012-05-03 10:33:52

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On Tue, May 1, 2012 at 4:26 PM, Anton Vorontsov
<[email protected]> wrote:
> This is specially "blended" attribute, the event triggers when kernel
> decides that we're close to the low memory threshold. Userspace should
> not expect very precise meaning of low memory situation, mostly, it's
> just a guess on the kernel's side.
>
> Well, this is the same as userland should not know or care how exactly
> kernel manages the memory, or assume that memory management behaviour
> is a part of the "ABI". So, all the 'low memory' is just guessing, but
> we're trying to do our best. It might be that we will end up with two
> or three variations of 'low memory' thresholds, and all of them would
> be useful for different use cases.
>
> For this implementation, we assume that there's a low memory situation
> for the N pages threshold when we have neither N pages of completely
> free pages, nor we have N reclaimable pages in the cache. This
> effectively means, that if userland expects to allocate N pages, it
> would consume all the free pages, and any further allocations (above
> N) would start draining caches.
>
> In the worst case, prior to hitting the threshold, we might have only
> N pages in cache, and nearly no memory as free pages.
>
> The same 'low memory' meaning is used in the current Android Low
> Memory Killer driver.
>
> Signed-off-by: Anton Vorontsov <[email protected]>

I don't see why we couldn't add this. Minchan, thoughts?

> ---
> ?include/linux/vmevent.h ? ? ? ? ? ? ?| ? ?7 ++++++
> ?mm/vmevent.c ? ? ? ? ? ? ? ? ? ? ? ? | ? 40 ++++++++++++++++++++++++++++++++++
> ?tools/testing/vmevent/vmevent-test.c | ? 12 +++++++++-
> ?3 files changed, 58 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h
> index aae0d24..9bfa244 100644
> --- a/include/linux/vmevent.h
> +++ b/include/linux/vmevent.h
> @@ -10,6 +10,13 @@ enum {
> ? ? ? ?VMEVENT_ATTR_NR_AVAIL_PAGES ? ? = 1UL,
> ? ? ? ?VMEVENT_ATTR_NR_FREE_PAGES ? ? ?= 2UL,
> ? ? ? ?VMEVENT_ATTR_NR_SWAP_PAGES ? ? ?= 3UL,
> + ? ? ? /*
> + ? ? ? ?* This is specially blended attribute, the event triggers
> + ? ? ? ?* when kernel decides that we're close to the low memory threshold.
> + ? ? ? ?* Don't expect very precise meaning of low memory situation, mostly,
> + ? ? ? ?* it's just a guess on the kernel's side.
> + ? ? ? ?*/
> + ? ? ? VMEVENT_ATTR_LOWMEM_PAGES ? ? ? = 4UL,
>
> ? ? ? ?VMEVENT_ATTR_MAX ? ? ? ? ? ? ? ?/* non-ABI */
> ?};
> diff --git a/mm/vmevent.c b/mm/vmevent.c
> index b312236..d278a25 100644
> --- a/mm/vmevent.c
> +++ b/mm/vmevent.c
> @@ -68,10 +68,50 @@ static u64 vmevent_attr_avail_pages(struct vmevent_watch *watch,
> ? ? ? ?return totalram_pages;
> ?}
>
> +/*
> + * Here's some implementation details for the "low memory" meaning.
> + *
> + * (The explanation is not in the header file as userland should not
> + * know these details, nor it should assume that the meaning will
> + * always be the same. As well as it should not know how exactly kernel
> + * manages the memory, or assume that memory management behaviour is a
> + * part of the "ABI". So, all the 'low memory' is just guessing, but
> + * we're trying to do our best.)
> + *
> + * For this implementation, we assume that there's a low memory situation
> + * for the N pages threshold when we have neither N pages of completely
> + * free pages, nor we have N reclaimable pages in the cache. This
> + * effectively means, that if userland expects to allocate N pages, it
> + * would consume all the free pages, and any further allocations (above
> + * N) would start draining caches.
> + *
> + * In the worst case, prior hitting the threshold, we might have only
> + * N pages in cache, and nearly no memory as free pages.
> + */
> +static u64 vmevent_attr_lowmem_pages(struct vmevent_watch *watch,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct vmevent_attr *attr)
> +{
> + ? ? ? int free = global_page_state(NR_FREE_PAGES);
> + ? ? ? int file = global_page_state(NR_FILE_PAGES) -
> + ? ? ? ? ? ? ? ? ?global_page_state(NR_SHMEM); /* TODO: account locked pages */
> + ? ? ? int val = attr->value;
> +
> + ? ? ? /*
> + ? ? ? ?* For convenience we return 0 or attr value (instead of 0/1), it
> + ? ? ? ?* makes it easier for vmevent_match() to cope with blended
> + ? ? ? ?* attributes, plus userland might use the value to find out which
> + ? ? ? ?* threshold triggered.
> + ? ? ? ?*/
> + ? ? ? if (free < val && file < val)
> + ? ? ? ? ? ? ? return val;
> + ? ? ? return 0;
> +}
> +
> ?static vmevent_attr_sample_fn attr_samplers[] = {
> ? ? ? ?[VMEVENT_ATTR_NR_AVAIL_PAGES] ? = vmevent_attr_avail_pages,
> ? ? ? ?[VMEVENT_ATTR_NR_FREE_PAGES] ? ?= vmevent_attr_free_pages,
> ? ? ? ?[VMEVENT_ATTR_NR_SWAP_PAGES] ? ?= vmevent_attr_swap_pages,
> + ? ? ? [VMEVENT_ATTR_LOWMEM_PAGES] ? ? = vmevent_attr_lowmem_pages,
> ?};
>
> ?static u64 vmevent_sample_attr(struct vmevent_watch *watch, struct vmevent_attr *attr)
> diff --git a/tools/testing/vmevent/vmevent-test.c b/tools/testing/vmevent/vmevent-test.c
> index fd9a174..c61aed7 100644
> --- a/tools/testing/vmevent/vmevent-test.c
> +++ b/tools/testing/vmevent/vmevent-test.c
> @@ -33,7 +33,7 @@ int main(int argc, char *argv[])
>
> ? ? ? ?config = (struct vmevent_config) {
> ? ? ? ? ? ? ? ?.sample_period_ns ? ? ? = 1000000000L,
> - ? ? ? ? ? ? ? .counter ? ? ? ? ? ? ? ?= 6,
> + ? ? ? ? ? ? ? .counter ? ? ? ? ? ? ? ?= 7,
> ? ? ? ? ? ? ? ?.attrs ? ? ? ? ? ? ? ? ?= {
> ? ? ? ? ? ? ? ? ? ? ? ?{
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?.type ? = VMEVENT_ATTR_NR_FREE_PAGES,
> @@ -59,6 +59,13 @@ int main(int argc, char *argv[])
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?.type ? = VMEVENT_ATTR_NR_SWAP_PAGES,
> ? ? ? ? ? ? ? ? ? ? ? ?},
> ? ? ? ? ? ? ? ? ? ? ? ?{
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? .type ? = VMEVENT_ATTR_LOWMEM_PAGES,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? .state ?= VMEVENT_ATTR_STATE_VALUE_LT |
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? VMEVENT_ATTR_STATE_VALUE_EQ |
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? VMEVENT_ATTR_STATE_ONE_SHOT,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? .value ?= phys_pages / 2,
> + ? ? ? ? ? ? ? ? ? ? ? },
> + ? ? ? ? ? ? ? ? ? ? ? {
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?.type ? = 0xffff, /* invalid */
> ? ? ? ? ? ? ? ? ? ? ? ?},
> ? ? ? ? ? ? ? ?},
> @@ -108,6 +115,9 @@ int main(int argc, char *argv[])
> ? ? ? ? ? ? ? ? ? ? ? ?case VMEVENT_ATTR_NR_SWAP_PAGES:
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?printf(" ?VMEVENT_ATTR_NR_SWAP_PAGES: %Lu\n", attr->value);
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?break;
> + ? ? ? ? ? ? ? ? ? ? ? case VMEVENT_ATTR_LOWMEM_PAGES:
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? printf(" ?VMEVENT_ATTR_LOWMEM_PAGES: %Lu\n", attr->value);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
> ? ? ? ? ? ? ? ? ? ? ? ?default:
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?printf(" ?Unknown attribute: %Lu\n", attr->value);
> ? ? ? ? ? ? ? ? ? ? ? ?}
> --
> 1.7.9.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. ?For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2012-05-03 10:54:11

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 0/3] vmevent: Implement 'low memory' attribute

Patches 1-2 applied, thanks!

2012-05-04 04:26:52

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On 05/01/2012 10:26 PM, Anton Vorontsov wrote:

> This is specially "blended" attribute, the event triggers when kernel
> decides that we're close to the low memory threshold. Userspace should
> not expect very precise meaning of low memory situation, mostly, it's
> just a guess on the kernel's side.
>
> Well, this is the same as userland should not know or care how exactly
> kernel manages the memory, or assume that memory management behaviour
> is a part of the "ABI". So, all the 'low memory' is just guessing, but
> we're trying to do our best. It might be that we will end up with two
> or three variations of 'low memory' thresholds, and all of them would


First of all, your calculation for available pages is very simple and
it's very specific of recent mobile phone.
But recent systems have various devices.
For example,

SSD : very fast server SSD which has lots of internal ram so that write is very fast.
thumb usb : very slow whihc has small ram

1) We can consider anon pages and dirty pages as available pages.

SSD thumb usb
rootfs 0
swap 0

2) We can consider anon pages as available pages but dirty page doesn't

SSD thumb usb
rootfs O
swap 0

3) We can consider dirty pages as available pages but anon doesn't

SSD thumb usb
rootfs O
swap O

4) We can't consider dirty pages and anon pages as available pages

SSD thumb usb
rootfs 0
swap 0

5) If we use zram as swap?
6) Another idea. If we use both zram and swap device(eMMC), then when zram is full,
we writes zram pages into swap device with align cluster size?

I mean we can select various option to define low memory state.

> be useful for different use cases.


Why should we do it in kernel side?
If vmevent will have VMEVENT_ATTR_[FILE|MOCK|DIRTY|WRITEBACK|SHMEM|ANON|SWAP]_PAGES
and so on which is needed by calculation, we can calculate it in userspace without
forking /proc/vmstat to see it. So I think there is no problem to do it in userspace.

And even though we can solve above problem, it is possible to show up another new "blended" attribute
in future and it will suffer same problem, again. So IMHO, let's leave vmevent as it is which is
very raw attribute and let's do blended attribute in user space.

>
> For this implementation, we assume that there's a low memory situation
> for the N pages threshold when we have neither N pages of completely
> free pages, nor we have N reclaimable pages in the cache. This
> effectively means, that if userland expects to allocate N pages, it
> would consume all the free pages, and any further allocations (above
> N) would start draining caches.
>
> In the worst case, prior to hitting the threshold, we might have only
> N pages in cache, and nearly no memory as free pages.
>
> The same 'low memory' meaning is used in the current Android Low
> Memory Killer driver.
>
> Signed-off-by: Anton Vorontsov <[email protected]>
> ---
> include/linux/vmevent.h | 7 ++++++
> mm/vmevent.c | 40 ++++++++++++++++++++++++++++++++++
> tools/testing/vmevent/vmevent-test.c | 12 +++++++++-
> 3 files changed, 58 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h
> index aae0d24..9bfa244 100644
> --- a/include/linux/vmevent.h
> +++ b/include/linux/vmevent.h
> @@ -10,6 +10,13 @@ enum {
> VMEVENT_ATTR_NR_AVAIL_PAGES = 1UL,
> VMEVENT_ATTR_NR_FREE_PAGES = 2UL,
> VMEVENT_ATTR_NR_SWAP_PAGES = 3UL,
> + /*
> + * This is specially blended attribute, the event triggers
> + * when kernel decides that we're close to the low memory threshold.
> + * Don't expect very precise meaning of low memory situation, mostly,
> + * it's just a guess on the kernel's side.
> + */
> + VMEVENT_ATTR_LOWMEM_PAGES = 4UL,
>
> VMEVENT_ATTR_MAX /* non-ABI */
> };
> diff --git a/mm/vmevent.c b/mm/vmevent.c
> index b312236..d278a25 100644
> --- a/mm/vmevent.c
> +++ b/mm/vmevent.c
> @@ -68,10 +68,50 @@ static u64 vmevent_attr_avail_pages(struct vmevent_watch *watch,
> return totalram_pages;
> }
>
> +/*
> + * Here's some implementation details for the "low memory" meaning.
> + *
> + * (The explanation is not in the header file as userland should not
> + * know these details, nor it should assume that the meaning will
> + * always be the same. As well as it should not know how exactly kernel
> + * manages the memory, or assume that memory management behaviour is a
> + * part of the "ABI". So, all the 'low memory' is just guessing, but
> + * we're trying to do our best.)
> + *
> + * For this implementation, we assume that there's a low memory situation
> + * for the N pages threshold when we have neither N pages of completely
> + * free pages, nor we have N reclaimable pages in the cache. This
> + * effectively means, that if userland expects to allocate N pages, it
> + * would consume all the free pages, and any further allocations (above
> + * N) would start draining caches.
> + *
> + * In the worst case, prior hitting the threshold, we might have only
> + * N pages in cache, and nearly no memory as free pages.
> + */
> +static u64 vmevent_attr_lowmem_pages(struct vmevent_watch *watch,
> + struct vmevent_attr *attr)
> +{
> + int free = global_page_state(NR_FREE_PAGES);
> + int file = global_page_state(NR_FILE_PAGES) -
> + global_page_state(NR_SHMEM); /* TODO: account locked pages */
> + int val = attr->value;
> +
> + /*
> + * For convenience we return 0 or attr value (instead of 0/1), it
> + * makes it easier for vmevent_match() to cope with blended
> + * attributes, plus userland might use the value to find out which
> + * threshold triggered.
> + */
> + if (free < val && file < val)
> + return val;
> + return 0;
> +}
> +
> static vmevent_attr_sample_fn attr_samplers[] = {
> [VMEVENT_ATTR_NR_AVAIL_PAGES] = vmevent_attr_avail_pages,
> [VMEVENT_ATTR_NR_FREE_PAGES] = vmevent_attr_free_pages,
> [VMEVENT_ATTR_NR_SWAP_PAGES] = vmevent_attr_swap_pages,
> + [VMEVENT_ATTR_LOWMEM_PAGES] = vmevent_attr_lowmem_pages,
> };
>
> static u64 vmevent_sample_attr(struct vmevent_watch *watch, struct vmevent_attr *attr)
> diff --git a/tools/testing/vmevent/vmevent-test.c b/tools/testing/vmevent/vmevent-test.c
> index fd9a174..c61aed7 100644
> --- a/tools/testing/vmevent/vmevent-test.c
> +++ b/tools/testing/vmevent/vmevent-test.c
> @@ -33,7 +33,7 @@ int main(int argc, char *argv[])
>
> config = (struct vmevent_config) {
> .sample_period_ns = 1000000000L,
> - .counter = 6,
> + .counter = 7,
> .attrs = {
> {
> .type = VMEVENT_ATTR_NR_FREE_PAGES,
> @@ -59,6 +59,13 @@ int main(int argc, char *argv[])
> .type = VMEVENT_ATTR_NR_SWAP_PAGES,
> },
> {
> + .type = VMEVENT_ATTR_LOWMEM_PAGES,
> + .state = VMEVENT_ATTR_STATE_VALUE_LT |
> + VMEVENT_ATTR_STATE_VALUE_EQ |
> + VMEVENT_ATTR_STATE_ONE_SHOT,
> + .value = phys_pages / 2,
> + },
> + {
> .type = 0xffff, /* invalid */
> },
> },
> @@ -108,6 +115,9 @@ int main(int argc, char *argv[])
> case VMEVENT_ATTR_NR_SWAP_PAGES:
> printf(" VMEVENT_ATTR_NR_SWAP_PAGES: %Lu\n", attr->value);
> break;
> + case VMEVENT_ATTR_LOWMEM_PAGES:
> + printf(" VMEVENT_ATTR_LOWMEM_PAGES: %Lu\n", attr->value);
> + break;
> default:
> printf(" Unknown attribute: %Lu\n", attr->value);
> }



--
Kind regards,
Minchan Kim

2012-05-04 07:45:28

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On Fri, May 04, 2012 at 01:26:45PM +0900, Minchan Kim wrote:
[...]
> > be useful for different use cases.
>
> Why should we do it in kernel side?

Because currently you can't do this in userland, see below. Today
this would be effectively the same as constantly reading /proc/vmstat,
which is surely not friendly performance/context switches/battery
wise.

> If vmevent will have VMEVENT_ATTR_[FILE|MOCK|DIRTY|WRITEBACK|SHMEM|ANON|SWAP]_PAGES
> and so on which is needed by calculation, we can calculate it in userspace without
> forking /proc/vmstat to see it. So I think there is no problem to do it in userspace.

There are two problems.

1. Originally, the idea behind vmevent was that we should not expose all
these mm details in vmevent, because it ties ABI with Linux internal
memory representation;

2. If you have say a boolean '(A + B + C + ...) > X' attribute (which is
exactly what blended attributes are), you can't just set up independent
thresholds on A, B, C, ... and have the same effect.

(What we can do, though, is... introduce arithmetic operators in
vmevent. :-D But then, at the end, we'll probably implement in-kernel
forth-like stack machine, with vmevent_config array serving as a
sequence of op-codes. ;-)

If we'll give up on "1." (Pekka, ping), then we need to solve "2."
in a sane way: we'll have to add a 'NR_FILE_PAGES - NR_SHMEM -
<todo-locked-file-pages>' attribute, and give it a name.

RECLAIMABLE_CACHE_PAGES maybe?

Thanks!

--
Anton Vorontsov
Email: [email protected]

2012-05-07 07:14:35

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On Fri, May 4, 2012 at 10:38 AM, Anton Vorontsov
<[email protected]> wrote:
> There are two problems.
>
> 1. Originally, the idea behind vmevent was that we should not expose all
> ? these mm details in vmevent, because it ties ABI with Linux internal
> ? memory representation;
>
> 2. If you have say a boolean '(A + B + C + ...) > X' attribute (which is
> ? exactly what blended attributes are), you can't just set up independent
> ? thresholds on A, B, C, ... and have the same effect.
>
> ? (What we can do, though, is... introduce arithmetic operators in
> ? vmevent. :-D But then, at the end, we'll probably implement in-kernel
> ? forth-like stack machine, with vmevent_config array serving as a
> ? sequence of op-codes. ;-)
>
> If we'll give up on "1." (Pekka, ping), then we need to solve "2."
> in a sane way: we'll have to add a 'NR_FILE_PAGES - NR_SHMEM -
> <todo-locked-file-pages>' attribute, and give it a name.

Well, no, we can't give up on (1) completely. That'd mean that
eventually we'd need to change the ABI and break userspace. The
difference between exposing internal details and reasonable
abstractions is by no means black and white.

AFAICT, RECLAIMABLE_CACHE_PAGES is a reasonable thing to support. Can
anyone come up with a reason why we couldn't do that in the future?

Pekka

2012-05-07 08:34:35

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

>> If we'll give up on "1." (Pekka, ping), then we need to solve "2."
>> in a sane way: we'll have to add a 'NR_FILE_PAGES - NR_SHMEM -
>> <todo-locked-file-pages>' attribute, and give it a name.
>
> Well, no, we can't give up on (1) completely. That'd mean that
> eventually we'd need to change the ABI and break userspace. The
> difference between exposing internal details and reasonable
> abstractions is by no means black and white.
>
> AFAICT, RECLAIMABLE_CACHE_PAGES is a reasonable thing to support. Can
> anyone come up with a reason why we couldn't do that in the future?

It can. but the problem is, that is completely useless. Because of, 1)
dirty pages writing-out
is sometimes very slow and 2) libc and some important library's pages
are critical important
for running a system even though it is clean and reclaimable. In other
word, kernel don't have
an info then can't expose it.

2012-05-07 12:16:54

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On Mon, May 07, 2012 at 04:26:00AM -0400, KOSAKI Motohiro wrote:
> >> If we'll give up on "1." (Pekka, ping), then we need to solve "2."
> >> in a sane way: we'll have to add a 'NR_FILE_PAGES - NR_SHMEM -
> >> <todo-locked-file-pages>' attribute, and give it a name.
> >
> > Well, no, we can't give up on (1) completely. That'd mean that
> > eventually we'd need to change the ABI and break userspace. The
> > difference between exposing internal details and reasonable
> > abstractions is by no means black and white.
> >
> > AFAICT, RECLAIMABLE_CACHE_PAGES is a reasonable thing to support. Can
> > anyone come up with a reason why we couldn't do that in the future?
>
> It can. but the problem is, that is completely useless.

Surely it is useful. Could be not ideal, but you can't say that
it is completely useless.

> Because of, 1) dirty pages writing-out is sometimes very slow and

I don't see it as a unresolvable problem: we can exclude dirty pages,
that's a nice idea actually.

Easily reclaimable cache pages = file_pages - shmem - locked pages
- dirty pages.

The amount of dirty pages is configurable, which is also great.

Even more, we may introduce two attributes:

RECLAIMABLE_CACHE_PAGES and
RECLAIMABLE_CACHE_PAGES_NOIO (which excludes dirty pages).

This makes ABI detached from the mm internals and still keeps a
defined meaning of the attributes.

> 2) libc and some important library's pages are critical important
> for running a system even though it is clean and reclaimable. In other
> word, kernel don't have an info then can't expose it.

First off, I guess LRU would try to keep important/most used pages in
the cache, as we try to never fully drain page cache to the zero mark.

Secondly, if we're really low on memory (which low memory notifications
help to prevent) and kernel decided to throw libc's pages out of the
cache, you'll get cache miss and kernel will have to read it back. Well,
sometimes cache misses do happen, that's life. And if somebody really
don't want this for the essential parts of the system, one have to
mlock it (which eliminates your "kernel don't have an info" argument).


Btw, if you have any better strategy on helping userspace to define
'low memory' conditions, I'll readily try to implement it.

Thanks!

--
Anton Vorontsov
Email: [email protected]

2012-05-07 19:19:55

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

(5/7/12 8:15 AM), Anton Vorontsov wrote:
> On Mon, May 07, 2012 at 04:26:00AM -0400, KOSAKI Motohiro wrote:
>>>> If we'll give up on "1." (Pekka, ping), then we need to solve "2."
>>>> in a sane way: we'll have to add a 'NR_FILE_PAGES - NR_SHMEM -
>>>> <todo-locked-file-pages>' attribute, and give it a name.
>>>
>>> Well, no, we can't give up on (1) completely. That'd mean that
>>> eventually we'd need to change the ABI and break userspace. The
>>> difference between exposing internal details and reasonable
>>> abstractions is by no means black and white.
>>>
>>> AFAICT, RECLAIMABLE_CACHE_PAGES is a reasonable thing to support. Can
>>> anyone come up with a reason why we couldn't do that in the future?
>>
>> It can. but the problem is, that is completely useless.
>
> Surely it is useful. Could be not ideal, but you can't say that
> it is completely useless.

Why? It doesn't work.



>> Because of, 1) dirty pages writing-out is sometimes very slow and
>
> I don't see it as a unresolvable problem: we can exclude dirty pages,
> that's a nice idea actually.
>
> Easily reclaimable cache pages = file_pages - shmem - locked pages
> - dirty pages.
>
> The amount of dirty pages is configurable, which is also great.

You don't understand the issue. The point is NOT a formula. The problem
is, dirty and non-dirty pages aren't isolated in our kernel. Then, kernel
start to get stuck far before non-dirty pages become empty. Lie notification
always useless.


> Even more, we may introduce two attributes:
>
> RECLAIMABLE_CACHE_PAGES and
> RECLAIMABLE_CACHE_PAGES_NOIO (which excludes dirty pages).
>
> This makes ABI detached from the mm internals and still keeps a
> defined meaning of the attributes.

Collection of craps are also crap. If you want to improve userland
notification, you should join VM improvement activity. You shouldn't
think nobody except you haven't think userland notification feature.

The problem is, Any current kernel vm statistics were not created for
such purpose and don't fit.

Even though, some inaccurate and incorrect statistics fit _your_ usecase,
they definitely don't fit other. And their people think it is bug.


>> 2) libc and some important library's pages are critical important
>> for running a system even though it is clean and reclaimable. In other
>> word, kernel don't have an info then can't expose it.
>
> First off, I guess LRU would try to keep important/most used pages in
> the cache, as we try to never fully drain page cache to the zero mark.

Yes, what do you want say?


> Secondly, if we're really low on memory (which low memory notifications
> help to prevent) and kernel decided to throw libc's pages out of the
> cache, you'll get cache miss and kernel will have to read it back. Well,
> sometimes cache misses do happen, that's life. And if somebody really
> don't want this for the essential parts of the system, one have to
> mlock it (which eliminates your "kernel don't have an info" argument).

First off, "low memory" is very poor definition and we must not use it.
It is multiple meanings. 1) System free memory is low. Some embedded have userland
oom killer and they want to know _system_ status. 2) available memory is low.
This is different from (1) when using NUMA, memcg or cpusets. And in nowadays,
almost all x86 box have numa. This is userful for swap avoidance activity if
we can implement correctly.

Secondly, we can't assume someone mlock to libc. Because of, Linux is generic
purpose kernel. As far as you continue to talk about only user usecase, we can't
agree you. "Users may have a workaround" don't make excuse to accept broken patch.




> Btw, if you have any better strategy on helping userspace to define
> 'low memory' conditions, I'll readily try to implement it.


2012-05-08 00:33:18

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On Mon, May 07, 2012 at 03:19:50PM -0400, KOSAKI Motohiro wrote:
[...]
> You don't understand the issue.

Apparently.

> The point is NOT a formula. The problem
> is, dirty and non-dirty pages aren't isolated in our kernel. Then, kernel
> start to get stuck far before non-dirty pages become empty. Lie notification
> always useless.

Ugh. I don't get it (yeah, see above :-), in what sense they're not
isolated? In sense of isolate_lru_page and friends? Yes, they're not
isolated, but how that makes the notifications untrustworthy?

I'm confused. Can you elaborate a bit?

> >Even more, we may introduce two attributes:
> >
> >RECLAIMABLE_CACHE_PAGES and
> >RECLAIMABLE_CACHE_PAGES_NOIO (which excludes dirty pages).
> >
> >This makes ABI detached from the mm internals and still keeps a
> >defined meaning of the attributes.
>
> Collection of craps are also crap. If you want to improve userland
> notification, you should join VM improvement activity.

I'm all for improving VM, but please, be specific. I'm assuming
there is currently some efforts on VM improvements, which I'm
not aware of. Or there are some plans or thoughts on improvements --
please tell what are they.

> You shouldn't
> think nobody except you haven't think userland notification feature.

That was never my assumption; surely many people have worked on userland
notifications, and still, today we have none that would fully suite
Android's or Nokia's (or "embedded people's") needs, right? ;-)

So, let's try solve things.

Memcg is currently not usable for us, and I explained why (the slab
accounting for root cgroup thing: http://lkml.org/lkml/2012/4/30/115 ),
any comments?

> The problem is, Any current kernel vm statistics were not created for
> such purpose and don't fit.

OK, presuming current statistics don't fit, which ones should we
implement? How do you see it?

> Even though, some inaccurate and incorrect statistics fit _your_ usecase,
> they definitely don't fit other. And their people think it is bug.

I'm all for making a single solution for your and ours use cases, but
you don't say anything specific.

(Btw, what are your use cases?)

> >>2) libc and some important library's pages are critical important
> >>for running a system even though it is clean and reclaimable. In other
> >>word, kernel don't have an info then can't expose it.
> >
> >First off, I guess LRU would try to keep important/most used pages in
> >the cache, as we try to never fully drain page cache to the zero mark.

*1

> Yes, what do you want say?

> >Secondly, if we're really low on memory (which low memory notifications
> >help to prevent) and kernel decided to throw libc's pages out of the
> >cache, you'll get cache miss and kernel will have to read it back. Well,
> >sometimes cache misses do happen, that's life. And if somebody really
> >don't want this for the essential parts of the system, one have to
> >mlock it (which eliminates your "kernel don't have an info" argument).
>
> First off, "low memory" is very poor definition and we must not use it.

OK.

> It is multiple meanings.
> 1) System free memory is low. Some embedded have userland

The 'free' has multiple meanings as well. For us, it is
'not-used-at-all-pages + the-pages-that-we-can-get-in-a-
jiffy-and-not-disturb-things-much'.

The 'not-disturb-things-much' has a moot meaning as well, so all this
should probably be tunable. Cool, so let's give the userspace all the
needed statistics to decide on these meanings.

> oom killer and they want to know _system_ status. 2) available memory is low.
> This is different from (1) when using NUMA, memcg or cpusets. And in nowadays,
> almost all x86 box have numa. This is userful for swap avoidance activity if
> we can implement correctly.

I don't get it: you don't see '1)' as a use case? You're saying
that the meanings are different when using NUMA/memcg. If we don't
use memcg, what statistics should we use?

OK, if you are hinting that memcg should be mandatory for proper
statistics accounting, then please comment on the current memcg
issues, which don't let us do '1)' via '2)'.

> Secondly, we can't assume someone mlock to libc. Because of, Linux is generic
> purpose kernel.

You said that libc pages are important, implying that ideally they should
never leave the page cache (i.e. we should not count the pages as 'easily
reclaimable').

I answered that if are OK with "not guaranteed, but we'll do our best"
strategy, then just don't let fully drain the caches, and then LRU will
try keep "most important" pages (apparently libc) in the cache. *1 It
is surely userland's task to maintain the needed amount of memory, and
to do this efficiently we need..... notifications, that's right.

But if you want a guarantee, I guess mlock() is the only option -- it is
the only way to tell the kernel that the pages are really not to be
reclaimed.

So, in the light of 'easily reclaimable pages' statistics, what for was
your libc point again? How would you solve "the libc problem"?

Thanks!

--
Anton Vorontsov
Email: [email protected]

2012-05-08 05:20:26

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On Mon, May 7, 2012 at 10:19 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> Even more, we may introduce two attributes:
>>
>> RECLAIMABLE_CACHE_PAGES and
>> RECLAIMABLE_CACHE_PAGES_NOIO (which excludes dirty pages).
>>
>> This makes ABI detached from the mm internals and still keeps a
>> defined meaning of the attributes.
>
> Collection of craps are also crap. If you want to improve userland
> notification, you should join VM improvement activity. You shouldn't
> think nobody except you haven't think userland notification feature.
>
> The problem is, Any current kernel vm statistics were not created for
> such purpose and don't fit.
>
> Even though, some inaccurate and incorrect statistics fit _your_ usecase,
> they definitely don't fit other. And their people think it is bug.

Well, yeah, if we are to report _number of pages_, the numbers better
be meaningful.

That said, I think you are being unfair to Anton who's one of the few
that's actually taking the time to implement this properly instead of
settling for an out-of-tree hack.

2012-05-08 05:42:27

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On Tue, May 8, 2012 at 1:20 AM, Pekka Enberg <[email protected]> wrote:
> On Mon, May 7, 2012 at 10:19 PM, KOSAKI Motohiro
> <[email protected]> wrote:
>>> Even more, we may introduce two attributes:
>>>
>>> RECLAIMABLE_CACHE_PAGES and
>>> RECLAIMABLE_CACHE_PAGES_NOIO (which excludes dirty pages).
>>>
>>> This makes ABI detached from the mm internals and still keeps a
>>> defined meaning of the attributes.
>>
>> Collection of craps are also crap. If you want to improve userland
>> notification, you should join VM improvement activity. You shouldn't
>> think nobody except you haven't think userland notification feature.
>>
>> The problem is, Any current kernel vm statistics were not created for
>> such purpose and don't fit.
>>
>> Even though, some inaccurate and incorrect statistics fit _your_ usecase,
>> they definitely don't fit other. And their people think it is bug.
>
> Well, yeah, if we are to report _number of pages_, the numbers better
> be meaningful.
>
> That said, I think you are being unfair to Anton who's one of the few
> that's actually taking the time to implement this properly instead of
> settling for an out-of-tree hack.

Unfair? But only I can talk about technical comment. To be honest, I
really dislike
I need say the same explanation again and again. A lot of people don't read
past discussion. And as far as the patches take the same mistake, I must say
the same thing. It is just PITA.

I don't disagree vmevent notification itself, but I must disagree lie
notification.
And also, To make just idea statistics doesn't make sense at all. How do an
application choose the right events? If that depend on hardware configuration,
userland developers can't write proper applications.

That's the reason why I often disagree at random new features.

2012-05-08 05:53:58

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On Tue, May 8, 2012 at 8:42 AM, KOSAKI Motohiro
<[email protected]> wrote:
>> That said, I think you are being unfair to Anton who's one of the few
>> that's actually taking the time to implement this properly instead of
>> settling for an out-of-tree hack.
>
> Unfair? But only I can talk about technical comment. To be honest, I
> really dislike
> I need say the same explanation again and again. A lot of people don't read
> past discussion. And as far as the patches take the same mistake, I must say
> the same thing. It is just PITA.

Unfair because you are trying to make it look as if Anton is only
concerned with his specific use case. That's simply not true.

On Tue, May 8, 2012 at 8:42 AM, KOSAKI Motohiro
<[email protected]> wrote:
> I don't disagree vmevent notification itself, but I must disagree lie
> notification.
> And also, To make just idea statistics doesn't make sense at all. How do an
> application choose the right events? If that depend on hardware configuration,
> userland developers can't write proper applications.

That's exactly the problem we're trying to tackle here! We _want_ the
ABI to provide sane, well-defined events that solve real world
problems.

Pekka

2012-05-08 07:00:17

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On Tue, May 08, 2012 at 01:42:05AM -0400, KOSAKI Motohiro wrote:
[...]
> > Well, yeah, if we are to report _number of pages_, the numbers better
> > be meaningful.
> >
> > That said, I think you are being unfair to Anton who's one of the few
> > that's actually taking the time to implement this properly instead of
> > settling for an out-of-tree hack.
>
> Unfair? But only I can talk about technical comment. To be honest, I
> really dislike
> I need say the same explanation again and again. A lot of people don't read
> past discussion. And as far as the patches take the same mistake, I must say
> the same thing. It is just PITA.

Note that just telling people that something is PITA doesn't help solve
things (so people will come back to you with stupid questions over and
over again). You can call people morons, idiots and dumbasses (that's
all fine) but still finding a way to be productive. :-)

You could just give a link to a previous discussion, in which you think
you explained all your concerns regarding cache handling issues, or
memory notifications/statistics in general.

So, feel free to call me an idiot, but please expand your points a
little bit or give a link to the discussion you're referring to?

Thanks,

--
Anton Vorontsov
Email: [email protected]

2012-05-08 07:12:12

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On Tue, May 8, 2012 at 1:53 AM, Pekka Enberg <[email protected]> wrote:
> On Tue, May 8, 2012 at 8:42 AM, KOSAKI Motohiro
> <[email protected]> wrote:
>>> That said, I think you are being unfair to Anton who's one of the few
>>> that's actually taking the time to implement this properly instead of
>>> settling for an out-of-tree hack.
>>
>> Unfair? But only I can talk about technical comment. To be honest, I
>> really dislike
>> I need say the same explanation again and again. A lot of people don't read
>> past discussion. And as far as the patches take the same mistake, I must say
>> the same thing. It is just PITA.
>
> Unfair because you are trying to make it look as if Anton is only
> concerned with his specific use case. That's simply not true.

However current proposal certainly don't refer past discuss and don't work
many environment.


> On Tue, May 8, 2012 at 8:42 AM, KOSAKI Motohiro
> <[email protected]> wrote:
>> I don't disagree vmevent notification itself, but I must disagree lie
>> notification.
>> And also, To make just idea statistics doesn't make sense at all. How do an
>> application choose the right events? If that depend on hardware configuration,
>> userland developers can't write proper applications.
>
> That's exactly the problem we're trying to tackle here! We _want_ the
> ABI to provide sane, well-defined events that solve real world
> problems.

Ok, sane. Then I take my time a little and review current vmevent code briefly.
(I read vmevent/core branch in pekka's tree. please let me know if
there is newer
repositry)

I think following thing should be fixed.

1) sample_period is brain damaged idea. If people ONLY need to
sampling stastics, they
only need to read /proc/vmstat periodically. just remove it and
implement push notification.
_IF_ someone need unfrequent level trigger, just use
"usleep(timeout); read(vmevent_fd)"
on userland code.
2) VMEVENT_ATTR_STATE_ONE_SHOT is misleading name. That is effect as
edge trigger
shot. not only once.
3) vmevent_fd() seems sane interface. but it has name space unaware.
maybe we discuss how
to harmonize name space feature. No hurry. but we have to think
that issue since at beginning.
4) Currently, vmstat have per-cpu batch and vmstat updating makes 3
second delay at maximum.
This is fine for usual case because almost userland watcher only
read /proc/vmstat per second.
But, for vmevent_fd() case, 3 seconds may be unacceptable delay. At
worst, 128 batch x 4096
x 4k pagesize = 2G bytes inaccurate is there.
5) __VMEVENT_ATTR_STATE_VALUE_WAS_LT should be removed from userland
exporting files.
When exporing kenrel internal, always silly gus used them and made unhappy.
6) Also vmevent_event must hide from userland.
7) vmevent_config::size must be removed. In 20th century, M$ API
prefer to use this technique. But
They dropped the way because a lot of application don't initialize
size member and they can't use
it for keeping upper compitibility.
8) memcg unaware
9) numa unaware
10) zone unaware

And, we may need vm internal change if we really need lowmem
notification. current kernel don't
have such info. _And_ there is one more big problem. Currently the
kernel maintain memory per
zone. But almost all userland application aren't aware zone nor node.
Thus raw notification aren't
useful for userland. In the other hands, total memory and total free
memory is useful? Definitely No!
Even though total free memory are lots, system may start swap out and
oom invokation. If we can't
oom invocation, this feature has serious raison d'etre issue. (i.e.
(4), (8), (9) and (19) are not ignorable
issue. I think)

2012-05-08 07:17:04

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

(5/8/12 2:58 AM), Anton Vorontsov wrote:
> On Tue, May 08, 2012 at 01:42:05AM -0400, KOSAKI Motohiro wrote:
> [...]
>>> Well, yeah, if we are to report _number of pages_, the numbers better
>>> be meaningful.
>>>
>>> That said, I think you are being unfair to Anton who's one of the few
>>> that's actually taking the time to implement this properly instead of
>>> settling for an out-of-tree hack.
>>
>> Unfair? But only I can talk about technical comment. To be honest, I
>> really dislike
>> I need say the same explanation again and again. A lot of people don't read
>> past discussion. And as far as the patches take the same mistake, I must say
>> the same thing. It is just PITA.
>
> Note that just telling people that something is PITA doesn't help solve
> things (so people will come back to you with stupid questions over and
> over again). You can call people morons, idiots and dumbasses (that's
> all fine) but still finding a way to be productive. :-)
>
> You could just give a link to a previous discussion, in which you think
> you explained all your concerns regarding cache handling issues, or
> memory notifications/statistics in general.
>
> So, feel free to call me an idiot, but please expand your points a
> little bit or give a link to the discussion you're referring to?

I don't think you are idiot. But I hope you test your patch before submitting.
That just don't work especially on x86. Because of, all x86 box have multiple zone
and summarized statistics (i.e. global_page_state() thing) don't work and can't
prevent oom nor swapping.

and please see may previous mail.

2012-05-08 07:36:33

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On Tue, May 8, 2012 at 10:11 AM, KOSAKI Motohiro
<[email protected]> wrote:
> Ok, sane. Then I take my time a little and review current vmevent code briefly.
> (I read vmevent/core branch in pekka's tree. please let me know if
> there is newer repositry)

It's the latest one.

On Tue, May 8, 2012 at 10:11 AM, KOSAKI Motohiro
<[email protected]> wrote:
> 1) sample_period is brain damaged idea. If people ONLY need to
> sampling stastics, they
> ?only need to read /proc/vmstat periodically. just remove it and
> implement push notification.
> ?_IF_ someone need unfrequent level trigger, just use
> "usleep(timeout); read(vmevent_fd)"
> ?on userland code.

That comes from a real-world requirement. See Leonid's email on the topic:

https://lkml.org/lkml/2012/5/2/42

> 2) VMEVENT_ATTR_STATE_ONE_SHOT is misleading name. That is effect as
> edge trigger shot. not only once.

Would VMEVENT_ATTR_STATE_EDGE_TRIGGER be a better name?

> 3) vmevent_fd() seems sane interface. but it has name space unaware.
> maybe we discuss how to harmonize name space feature. ?No hurry. but we have
> to think that issue since at beginning.

You mean VFS namespaces? Yeah, we need to take care of that.

> 4) Currently, vmstat have per-cpu batch and vmstat updating makes 3
> second delay at maximum.
> ?This is fine for usual case because almost userland watcher only
> read /proc/vmstat per second.
> ?But, for vmevent_fd() case, 3 seconds may be unacceptable delay. At
> worst, 128 batch x 4096
> ?x 4k pagesize = 2G bytes inaccurate is there.

That's pretty awful. Anton, Leonid, comments?

> 5) __VMEVENT_ATTR_STATE_VALUE_WAS_LT should be removed from userland
> exporting files.
> ?When exporing kenrel internal, always silly gus used them and made unhappy.

Agreed. Anton, care to cook up a patch to do that?

> 6) Also vmevent_event must hide from userland.

Why? That's part of the ABI.

> 7) vmevent_config::size must be removed. In 20th century, M$ API
> prefer to use this technique. But
> ?They dropped the way because a lot of application don't initialize
> size member and they can't use it for keeping upper compitibility.

It's there to support forward/backward ABI compatibility like perf
does. I'm going to keep it for now but I'm open to dropping it when
the ABI is more mature.

> 8) memcg unaware
> 9) numa unaware
> 10) zone unaware

Yup.

> And, we may need vm internal change if we really need lowmem
> notification. current kernel don't have such info. _And_ there is one more
> big problem. Currently the kernel maintain memory per
> zone. But almost all userland application aren't aware zone nor node.
> Thus raw notification aren't useful for userland. In the other hands, total
> memory and total free memory is useful? Definitely No!
> Even though total free memory are lots, system may start swap out and
> oom invokation. If we can't oom invocation, this feature has serious raison
> d'etre issue. (i.e. (4), (8), (9) and (19) are not ignorable issue. I think)

I'm guessing most of the existing solutions get away with
approximations and soft limits because they're mostly used on UMA
embedded machines.

But yes, we need to do better here.

2012-05-08 07:51:34

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

(5/8/12 3:36 AM), Pekka Enberg wrote:
> On Tue, May 8, 2012 at 10:11 AM, KOSAKI Motohiro
> <[email protected]> wrote:
>> Ok, sane. Then I take my time a little and review current vmevent code briefly.
>> (I read vmevent/core branch in pekka's tree. please let me know if
>> there is newer repositry)
>
> It's the latest one.
>
> On Tue, May 8, 2012 at 10:11 AM, KOSAKI Motohiro
> <[email protected]> wrote:
>> 1) sample_period is brain damaged idea. If people ONLY need to
>> sampling stastics, they
>> only need to read /proc/vmstat periodically. just remove it and
>> implement push notification.
>> _IF_ someone need unfrequent level trigger, just use
>> "usleep(timeout); read(vmevent_fd)"
>> on userland code.
>
> That comes from a real-world requirement. See Leonid's email on the topic:
>
> https://lkml.org/lkml/2012/5/2/42

I know, many embedded guys prefer such timer interval. I also have an experience
similar logic when I was TV box developer. but I must disagree. Someone hope
timer housekeeping complexity into kernel. but I haven't seen any justification.


>> 2) VMEVENT_ATTR_STATE_ONE_SHOT is misleading name. That is effect as
>> edge trigger shot. not only once.
>
> Would VMEVENT_ATTR_STATE_EDGE_TRIGGER be a better name?

maybe.


>> 3) vmevent_fd() seems sane interface. but it has name space unaware.
>> maybe we discuss how to harmonize name space feature. No hurry. but we have
>> to think that issue since at beginning.
>
> You mean VFS namespaces? Yeah, we need to take care of that.

If we keep current vmevent_fd() design, we may need to create new namespace concept
likes ipc namespace. current vmevent_fd() is not VFS based.


>> 4) Currently, vmstat have per-cpu batch and vmstat updating makes 3
>> second delay at maximum.
>> This is fine for usual case because almost userland watcher only
>> read /proc/vmstat per second.
>> But, for vmevent_fd() case, 3 seconds may be unacceptable delay. At
>> worst, 128 batch x 4096
>> x 4k pagesize = 2G bytes inaccurate is there.
>
> That's pretty awful. Anton, Leonid, comments?
>
>> 5) __VMEVENT_ATTR_STATE_VALUE_WAS_LT should be removed from userland
>> exporting files.
>> When exporing kenrel internal, always silly gus used them and made unhappy.
>
> Agreed. Anton, care to cook up a patch to do that?
>
>> 6) Also vmevent_event must hide from userland.
>
> Why? That's part of the ABI.

Ahhh, if so, I missed something. as far as I look, vmevent_fd() only depend
on vmevent_config. which syscall depend on vmevent_evennt?



>> 7) vmevent_config::size must be removed. In 20th century, M$ API
>> prefer to use this technique. But
>> They dropped the way because a lot of application don't initialize
>> size member and they can't use it for keeping upper compitibility.
>
> It's there to support forward/backward ABI compatibility like perf
> does. I'm going to keep it for now but I'm open to dropping it when
> the ABI is more mature.

perf api is not intended to use from generic applications. then, I don't
think it will make abi issue. tool/perf is sane, isn't it? but vmevent_fd()
is generic api and we can't trust all userland guy have sane, unfortunately.

>> 8) memcg unaware
>> 9) numa unaware
>> 10) zone unaware
>
> Yup.
>
>> And, we may need vm internal change if we really need lowmem
>> notification. current kernel don't have such info. _And_ there is one more
>> big problem. Currently the kernel maintain memory per
>> zone. But almost all userland application aren't aware zone nor node.
>> Thus raw notification aren't useful for userland. In the other hands, total
>> memory and total free memory is useful? Definitely No!
>> Even though total free memory are lots, system may start swap out and
>> oom invokation. If we can't oom invocation, this feature has serious raison
>> d'etre issue. (i.e. (4), (8), (9) and (19) are not ignorable issue. I think)
>
> I'm guessing most of the existing solutions get away with
> approximations and soft limits because they're mostly used on UMA
> embedded machines.
>
> But yes, we need to do better here.

Hm. If you want vmevent makes depend on CONFIG_EMBEDDED, I have no reason to
complain this feature. At that world, almost all applications _know_ their
system configuration. then I don't think api misuse issue is big matter.




2012-05-08 08:04:01

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On Tue, May 8, 2012 at 10:50 AM, KOSAKI Motohiro
<[email protected]> wrote:
>>> 1) sample_period is brain damaged idea. If people ONLY need to
>>> sampling stastics, they
>>> ?only need to read /proc/vmstat periodically. just remove it and
>>> implement push notification.
>>> ?_IF_ someone need unfrequent level trigger, just use
>>> "usleep(timeout); read(vmevent_fd)"
>>> ?on userland code.
>>
>> That comes from a real-world requirement. See Leonid's email on the topic:
>>
>> https://lkml.org/lkml/2012/5/2/42
>
> I know, many embedded guys prefer such timer interval. I also have an
> experience
> similar logic when I was TV box developer. but I must disagree. Someone hope
> timer housekeeping complexity into kernel. but I haven't seen any
> justification.

Leonid?

>>> 6) Also vmevent_event must hide from userland.
>>
>> Why? That's part of the ABI.
>
> Ahhh, if so, I missed something. as far as I look, vmevent_fd() only depend
> on vmevent_config. which syscall depend on vmevent_evennt?

read(2). See tools/testing/vmevent/vmevent-test.c for an example how
the ABI is used.

>>> 7) vmevent_config::size must be removed. In 20th century, M$ API
>>> prefer to use this technique. But
>>> ?They dropped the way because a lot of application don't initialize
>>> size member and they can't use it for keeping upper compitibility.
>>
>> It's there to support forward/backward ABI compatibility like perf
>> does. I'm going to keep it for now but I'm open to dropping it when
>> the ABI is more mature.
>
> perf api is not intended to use from generic applications. then, I don't
> think it will make abi issue. tool/perf is sane, isn't it? but vmevent_fd()
> is generic api and we can't trust all userland guy have sane, unfortunately.

The perf ABI is being used by other projects as well. We don't
_depend_ on ->size being sane as much as use it to detect new features
in the future.

But anyway, we can consider dropping once the ABI is more stable.

> Hm. If you want vmevent makes depend on CONFIG_EMBEDDED, I have no reason to
> complain this feature. At that world, almost all applications _know_ their
> system configuration. then I don't think api misuse issue is big matter.

No, I don't want to do that. I was just commeting on why existing
solutions are the way they are.

2012-05-08 08:14:34

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On Tue, May 08, 2012 at 03:16:59AM -0400, KOSAKI Motohiro wrote:
[...]
> >So, feel free to call me an idiot, but please expand your points a
> >little bit or give a link to the discussion you're referring to?
>
> I don't think you are idiot. But I hope you test your patch before submitting.
> That just don't work especially on x86. Because of, all x86 box have multiple zone
> and summarized statistics (i.e. global_page_state() thing) don't work and can't
> prevent oom nor swapping.

Now I think I understand you: we don't take into account that e.g. DMA
zone is not usable by the normal allocations, and so if we're basing our
calculations on summarized stats, it is indeed possible to get an OOM
in such a case. On Android kernels nobody noticed the issue as it is
used mainly on ARM targets, which might have CONFIG_ZONE_DMA=n.

So, this particular issue makes sense now.

Thanks!

--
Anton Vorontsov
Email: [email protected]

2012-05-08 08:23:17

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On Tue, May 08, 2012 at 01:13:05AM -0700, Anton Vorontsov wrote:
> On Tue, May 08, 2012 at 03:16:59AM -0400, KOSAKI Motohiro wrote:
> [...]
> > >So, feel free to call me an idiot, but please expand your points a
> > >little bit or give a link to the discussion you're referring to?
> >
> > I don't think you are idiot. But I hope you test your patch before submitting.
> > That just don't work especially on x86. Because of, all x86 box have multiple zone
> > and summarized statistics (i.e. global_page_state() thing) don't work and can't
> > prevent oom nor swapping.
>
> Now I think I understand you: we don't take into account that e.g. DMA
> zone is not usable by the normal allocations, and so if we're basing our
> calculations on summarized stats, it is indeed possible to get an OOM
> in such a case.

Oops. Looking into it more, I think I was wrong here: kernel will surely
use pages from the DMA zone when we have no pages in normal zones.

So, I don't see how we can get OOM in that case.

Hm.

--
Anton Vorontsov
Email: [email protected]

2012-05-08 08:32:42

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On 05/08/2012 04:11 PM, KOSAKI Motohiro wrote:

> On Tue, May 8, 2012 at 1:53 AM, Pekka Enberg <[email protected]> wrote:
>> On Tue, May 8, 2012 at 8:42 AM, KOSAKI Motohiro
>> <[email protected]> wrote:
>>>> That said, I think you are being unfair to Anton who's one of the few
>>>> that's actually taking the time to implement this properly instead of
>>>> settling for an out-of-tree hack.
>>>
>>> Unfair? But only I can talk about technical comment. To be honest, I
>>> really dislike
>>> I need say the same explanation again and again. A lot of people don't read
>>> past discussion. And as far as the patches take the same mistake, I must say
>>> the same thing. It is just PITA.
>>
>> Unfair because you are trying to make it look as if Anton is only
>> concerned with his specific use case. That's simply not true.
>
> However current proposal certainly don't refer past discuss and don't work
> many environment.
>
>
>> On Tue, May 8, 2012 at 8:42 AM, KOSAKI Motohiro
>> <[email protected]> wrote:
>>> I don't disagree vmevent notification itself, but I must disagree lie
>>> notification.
>>> And also, To make just idea statistics doesn't make sense at all. How do an
>>> application choose the right events? If that depend on hardware configuration,
>>> userland developers can't write proper applications.
>>
>> That's exactly the problem we're trying to tackle here! We _want_ the
>> ABI to provide sane, well-defined events that solve real world
>> problems.
>
> Ok, sane. Then I take my time a little and review current vmevent code briefly.
> (I read vmevent/core branch in pekka's tree. please let me know if
> there is newer
> repositry)
>
> I think following thing should be fixed.
>
> 1) sample_period is brain damaged idea. If people ONLY need to
> sampling stastics, they
> only need to read /proc/vmstat periodically. just remove it and
> implement push notification.
> _IF_ someone need unfrequent level trigger, just use
> "usleep(timeout); read(vmevent_fd)"
> on userland code.
> 2) VMEVENT_ATTR_STATE_ONE_SHOT is misleading name. That is effect as
> edge trigger
> shot. not only once.
> 3) vmevent_fd() seems sane interface. but it has name space unaware.
> maybe we discuss how
> to harmonize name space feature. No hurry. but we have to think
> that issue since at beginning.
> 4) Currently, vmstat have per-cpu batch and vmstat updating makes 3
> second delay at maximum.
> This is fine for usual case because almost userland watcher only
> read /proc/vmstat per second.
> But, for vmevent_fd() case, 3 seconds may be unacceptable delay. At
> worst, 128 batch x 4096
> x 4k pagesize = 2G bytes inaccurate is there.
> 5) __VMEVENT_ATTR_STATE_VALUE_WAS_LT should be removed from userland
> exporting files.
> When exporing kenrel internal, always silly gus used them and made unhappy.
> 6) Also vmevent_event must hide from userland.
> 7) vmevent_config::size must be removed. In 20th century, M$ API
> prefer to use this technique. But
> They dropped the way because a lot of application don't initialize
> size member and they can't use
> it for keeping upper compitibility.
> 8) memcg unaware
> 9) numa unaware
> 10) zone unaware


I would like to add a concern.

11) understand storage speed.

As I mentioned, system can have various storage type(SSD, disk, eMMC, ramfs)
In some system, user can tolerate ramfs and SSD write or swapout.
We should consdier that to make it really useful.

The problem is user can't know it in advance so it should be detected by kernel.
Unfortunately, it's not easy now.

The idea is that we can make some levels in advane and explain it to user

Level 1: It a immediate response to user when kernel decide there are not fast-reclaimable pages any more.
Level 2: It's rather slower response than level 1 but kernel will consider it as reclaimable target
Level 3: It's slowest response because kernel will consider page needed long time to reclaim as reclaimable target.

It doesn't expose any internal of kernel and can implment it in internal.
For simple example,

Level 1: non-mapped clean page
Level 2: Level 1 + mapped clean-page
Level 3: Level 2 + dirty pages

So users of vmevent_fd can select his level.
Of course, latency sensitive application with slow stoarge would select Level 1.
Some application might use Level 4(Level 3 + half of swap) because it has very fast storage.

And application receives event can make strategy folloing as.

When it receives level 1 notification, it could request to others if it can release their own buffers.
When it receives level 2 notification, it could request to suicide if it's not critical application.
When it receives level 3 notification, it could kill others.

It's a just example and my point is we should storage speed to make it general.

--
Kind regards,
Minchan Kim

2012-05-08 09:16:09

by leonid.moiseichuk

[permalink] [raw]
Subject: RE: [PATCH 3/3] vmevent: Implement special low-memory attribute

> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of ext
> Pekka Enberg
> Sent: 08 May, 2012 11:03
> To: KOSAKI Motohiro
> Cc: Anton Vorontsov; Minchan Kim; Moiseichuk Leonid (Nokia-MP/Espoo); John
...
> >> That comes from a real-world requirement. See Leonid's email on the topic:
> >>
> >> https://lkml.org/lkml/2012/5/2/42
> >
> > I know, many embedded guys prefer such timer interval. I also have an
> > experience similar logic when I was TV box developer. but I must
> > disagree. Someone hope timer housekeeping complexity into kernel. but
> > I haven't seen any justification.
>
> Leonid?

The "usleep(timeout); read(vmevent_fd)" will eliminate opportunity to use vmevent API for mobile devices.
Developers already have to use heartbeat primitives to align/sync timers and update code which is not always simple to do.
But the idea is to have user-space wakeup only if we have something change in memory numbers, thus aligned timers will not help much in vmevent case due to memory situation may change a lot in short time.
Short depends from software stack but usually it below 1s. To have use-time and wakeups on good level (below 50Hz by e.g. powertop) and allow cpu switch off timers of such short period like 1s are not allowed.

Leonid
PS: Sorry, meetings prevent to do interesting things :( I am tracking conversation with quite low understanding how it will be useful for practical needs because user-space developers in 80% cases needs to track simply dirty memory changes i.e. modified pages which cannot be dropped.

2012-05-08 09:20:12

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On Tue, May 8, 2012 at 12:15 PM, <[email protected]> wrote:
> I am tracking conversation with quite low understanding how it will be useful for
> practical needs because user-space developers in 80% cases needs to track
> simply dirty memory changes i.e. modified pages which cannot be dropped.

The point is to support those cases in such a way that works sanely
across different architectures and configurations.

2012-05-08 09:27:42

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On Tue, May 8, 2012 at 11:32 AM, Minchan Kim <[email protected]> wrote:
> The idea is that we can make some levels in advane and explain it to user
>
> Level 1: It a immediate response to user when kernel decide there are not fast-reclaimable pages any more.
> Level 2: It's rather slower response than level 1 but kernel will consider it as reclaimable target
> Level 3: It's slowest response because kernel will consider page needed long time to reclaim as reclaimable target.
>
> It doesn't expose any internal of kernel and can implment it in internal.
> For simple example,
>
> Level 1: non-mapped clean page
> Level 2: Level 1 + mapped clean-page
> Level 3: Level 2 + dirty pages
>
> So users of vmevent_fd can select his level.

I'm totally OK with pursuing something like this if people like Leonid
and Anton think it's useful for their use-cases.

2012-05-08 10:41:33

by leonid.moiseichuk

[permalink] [raw]
Subject: RE: [PATCH 3/3] vmevent: Implement special low-memory attribute

> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of ext
> Pekka Enberg
> Sent: 08 May, 2012 12:20
...
> On Tue, May 8, 2012 at 12:15 PM, <[email protected]> wrote:
> > I am tracking conversation with quite low understanding how it will be
> > useful for practical needs because user-space developers in 80% cases
> > needs to track simply dirty memory changes i.e. modified pages which cannot
> be dropped.
>
> The point is to support those cases in such a way that works sanely across
> different architectures and configurations.

Which usually means you need to increase level of abstraction from hugepages, swapped, kernel reclaimable, slab allocated, active memory to used, cache and must-have memory which are common for all platforms. Do you have visibility what do you need to show and how do you will calculate it? Does it possible to do for system, group of processes or particular process?

I implemented system-wide variant because that was a simplest one and most urgent I need. But e.g. to say how much memory consumed particular process in Linux you need to use smaps. Probably vmevent need to have some scratches (aka roadmap) into this direction as well.

2012-06-01 12:22:57

by Anton Vorontsov

[permalink] [raw]
Subject: [PATCH 0/5] Some vmevent fixes...

On Tue, May 08, 2012 at 10:36:31AM +0300, Pekka Enberg wrote:
[...]
> > 2) VMEVENT_ATTR_STATE_ONE_SHOT is misleading name. That is effect as
> > edge trigger shot. not only once.
>
> Would VMEVENT_ATTR_STATE_EDGE_TRIGGER be a better name?
[...]
> > 4) Currently, vmstat have per-cpu batch and vmstat updating makes 3
> > second delay at maximum.
> >  This is fine for usual case because almost userland watcher only
> > read /proc/vmstat per second.
> >  But, for vmevent_fd() case, 3 seconds may be unacceptable delay. At
> > worst, 128 batch x 4096
> >  x 4k pagesize = 2G bytes inaccurate is there.
>
> That's pretty awful. Anton, Leonid, comments?
[...]
> > 5) __VMEVENT_ATTR_STATE_VALUE_WAS_LT should be removed from userland
> > exporting files.
> >  When exporing kenrel internal, always silly gus used them and made unhappy.
>
> Agreed. Anton, care to cook up a patch to do that?

KOSAKI-San, Pekka,

Much thanks for your reviews!

These three issues should be fixed by the following patches. One mm/
change is needed outside of vmevent...

And I'm looking into other issues you pointed out...

Thanks!

---
include/linux/vmevent.h | 10 +++----
include/linux/vmstat.h | 2 ++
mm/vmevent.c | 66 +++++++++++++++++++++++++++++------------------
mm/vmstat.c | 22 +++++++++++++++-
4 files changed, 68 insertions(+), 32 deletions(-)

--
Anton Vorontsov
Email: [email protected]

2012-06-01 12:26:33

by Anton Vorontsov

[permalink] [raw]
Subject: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

We'll need to use smp_function_call() in the sampling routines, and the
call is not supposed to be called from the bottom halves. So, let's
convert vmevent to dffered workqueues.

As a side effect, we also fix the swap reporting (we cannot call
si_swapinfo from the interrupt context), i.e. the following oops should
be fixed now:

=================================
[ INFO: inconsistent lock state ]
3.4.0-rc1+ #37 Not tainted
---------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/0/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
(swap_lock){+.?...}, at: [<ffffffff8110449d>] si_swapinfo+0x1d/0x90
{SOFTIRQ-ON-W} state was registered at:
[<ffffffff8107ca7f>] mark_irqflags+0x15f/0x1b0
[<ffffffff8107e5e3>] __lock_acquire+0x493/0x9d0
[<ffffffff8107f20e>] lock_acquire+0x9e/0x200
[<ffffffff813e9071>] _raw_spin_lock+0x41/0x50
[<ffffffff8110449d>] si_swapinfo+0x1d/0x90
[<ffffffff8117e7c8>] meminfo_proc_show+0x38/0x3f0
[<ffffffff81141209>] seq_read+0x139/0x3f0
[<ffffffff81174cc6>] proc_reg_read+0x86/0xc0
[<ffffffff8111c19c>] vfs_read+0xac/0x160
[<ffffffff8111c29a>] sys_read+0x4a/0x90
[<ffffffff813ea652>] system_call_fastpath+0x16/0x1b

Signed-off-by: Anton Vorontsov <[email protected]>
---
mm/vmevent.c | 49 ++++++++++++++++++++++++++++---------------------
1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/mm/vmevent.c b/mm/vmevent.c
index 381e9d1..4ca2a04 100644
--- a/mm/vmevent.c
+++ b/mm/vmevent.c
@@ -3,7 +3,7 @@
#include <linux/compiler.h>
#include <linux/vmevent.h>
#include <linux/syscalls.h>
-#include <linux/timer.h>
+#include <linux/workqueue.h>
#include <linux/file.h>
#include <linux/list.h>
#include <linux/poll.h>
@@ -34,7 +34,7 @@ struct vmevent_watch {
struct vmevent_attr *config_attrs[VMEVENT_CONFIG_MAX_ATTRS];

/* sampling */
- struct timer_list timer;
+ struct delayed_work work;

/* poll */
wait_queue_head_t waitq;
@@ -146,15 +146,13 @@ static bool vmevent_match(struct vmevent_watch *watch)
}

/*
- * This function is called from the timer context, which has the same
- * guaranties as an interrupt handler: it can have only one execution
- * thread (unlike bare softirq handler), so we don't need to worry
- * about racing w/ ourselves.
+ * This function is called from a workqueue, which can have only one
+ * execution thread, so we don't need to worry about racing w/ ourselves.
*
- * We also don't need to worry about several instances of timers
- * accessing the same vmevent_watch, as we allocate vmevent_watch
- * together w/ the timer instance in vmevent_fd(), so there is always
- * one timer per vmevent_watch.
+ * We also don't need to worry about several instances of us accessing
+ * the same vmevent_watch, as we allocate vmevent_watch together w/ the
+ * work instance in vmevent_fd(), so there is always one work per
+ * vmevent_watch.
*
* All the above makes it possible to implement the lock-free logic,
* using just the atomic watch->pending variable.
@@ -178,26 +176,35 @@ static void vmevent_sample(struct vmevent_watch *watch)
atomic_set(&watch->pending, 1);
}

-static void vmevent_timer_fn(unsigned long data)
+static void vmevent_schedule_watch(struct vmevent_watch *watch)
{
- struct vmevent_watch *watch = (struct vmevent_watch *)data;
+ schedule_delayed_work(&watch->work,
+ nsecs_to_jiffies64(watch->config.sample_period_ns));
+}
+
+static struct vmevent_watch *work_to_vmevent_watch(struct work_struct *work)
+{
+ struct delayed_work *wk = to_delayed_work(work);
+
+ return container_of(wk, struct vmevent_watch, work);
+}
+
+static void vmevent_timer_fn(struct work_struct *work)
+{
+ struct vmevent_watch *watch = work_to_vmevent_watch(work);

vmevent_sample(watch);

if (atomic_read(&watch->pending))
wake_up(&watch->waitq);
- mod_timer(&watch->timer, jiffies +
- nsecs_to_jiffies64(watch->config.sample_period_ns));
+
+ vmevent_schedule_watch(watch);
}

static void vmevent_start_timer(struct vmevent_watch *watch)
{
- init_timer_deferrable(&watch->timer);
- watch->timer.data = (unsigned long)watch;
- watch->timer.function = vmevent_timer_fn;
- watch->timer.expires = jiffies +
- nsecs_to_jiffies64(watch->config.sample_period_ns);
- add_timer(&watch->timer);
+ INIT_DELAYED_WORK_DEFERRABLE(&watch->work, vmevent_timer_fn);
+ vmevent_schedule_watch(watch);
}

static unsigned int vmevent_poll(struct file *file, poll_table *wait)
@@ -259,7 +266,7 @@ static int vmevent_release(struct inode *inode, struct file *file)
{
struct vmevent_watch *watch = file->private_data;

- del_timer_sync(&watch->timer);
+ cancel_delayed_work_sync(&watch->work);

kfree(watch);

--
1.7.9.2

2012-06-01 12:26:40

by Anton Vorontsov

[permalink] [raw]
Subject: [PATCH 4/5] vmevent: Hide meaningful names from the user-visible header

...so that nobody would try to use the internally used bits.

Suggested-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: Anton Vorontsov <[email protected]>
---
include/linux/vmevent.h | 6 ++----
mm/vmevent.c | 9 +++++++--
2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h
index aae0d24..b8ec0ac 100644
--- a/include/linux/vmevent.h
+++ b/include/linux/vmevent.h
@@ -35,10 +35,8 @@ enum {
*/
VMEVENT_ATTR_STATE_ONE_SHOT = (1UL << 3),

- /* Saved state, used internally by the kernel for one-shot mode. */
- __VMEVENT_ATTR_STATE_VALUE_WAS_LT = (1UL << 30),
- /* Saved state, used internally by the kernel for one-shot mode. */
- __VMEVENT_ATTR_STATE_VALUE_WAS_GT = (1UL << 31),
+ __VMEVENT_ATTR_STATE_INTERNAL = (1UL << 30) |
+ (1UL << 31),
};

struct vmevent_attr {
diff --git a/mm/vmevent.c b/mm/vmevent.c
index 35fd0d5..e64a92d 100644
--- a/mm/vmevent.c
+++ b/mm/vmevent.c
@@ -83,6 +83,11 @@ static u64 vmevent_sample_attr(struct vmevent_watch *watch, struct vmevent_attr
return fn(watch, attr);
}

+enum {
+ VMEVENT_ATTR_STATE_VALUE_WAS_LT = (1UL << 30),
+ VMEVENT_ATTR_STATE_VALUE_WAS_GT = (1UL << 31),
+};
+
static bool vmevent_match(struct vmevent_watch *watch)
{
struct vmevent_config *config = &watch->config;
@@ -100,8 +105,8 @@ static bool vmevent_match(struct vmevent_watch *watch)

if (attr_lt || attr_gt || attr_eq) {
bool one_shot = state & VMEVENT_ATTR_STATE_ONE_SHOT;
- u32 was_lt_mask = __VMEVENT_ATTR_STATE_VALUE_WAS_LT;
- u32 was_gt_mask = __VMEVENT_ATTR_STATE_VALUE_WAS_GT;
+ u32 was_lt_mask = VMEVENT_ATTR_STATE_VALUE_WAS_LT;
+ u32 was_gt_mask = VMEVENT_ATTR_STATE_VALUE_WAS_GT;
u64 value = vmevent_sample_attr(watch, attr);
bool lt = value < attr->value;
bool gt = value > attr->value;
--
1.7.9.2

2012-06-01 12:26:39

by Anton Vorontsov

[permalink] [raw]
Subject: [PATCH 5/5] vmevent: Rename one-shot mode to edge trigger mode

VMEVENT_ATTR_STATE_ONE_SHOT is misleading name. That is effect as
edge trigger shot, not only once.

Suggested-by: KOSAKI Motohiro <[email protected]>
Suggested-by: Pekka Enberg <[email protected]>
Signed-off-by: Anton Vorontsov <[email protected]>
---
include/linux/vmevent.h | 4 ++--
mm/vmevent.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h
index b8ec0ac..b1c4016 100644
--- a/include/linux/vmevent.h
+++ b/include/linux/vmevent.h
@@ -31,9 +31,9 @@ enum {
*/
VMEVENT_ATTR_STATE_VALUE_EQ = (1UL << 2),
/*
- * One-shot mode.
+ * Edge trigger mode.
*/
- VMEVENT_ATTR_STATE_ONE_SHOT = (1UL << 3),
+ VMEVENT_ATTR_STATE_EDGE_TRIGGER = (1UL << 3),

__VMEVENT_ATTR_STATE_INTERNAL = (1UL << 30) |
(1UL << 31),
diff --git a/mm/vmevent.c b/mm/vmevent.c
index e64a92d..46c1d18 100644
--- a/mm/vmevent.c
+++ b/mm/vmevent.c
@@ -104,7 +104,7 @@ static bool vmevent_match(struct vmevent_watch *watch)
continue;

if (attr_lt || attr_gt || attr_eq) {
- bool one_shot = state & VMEVENT_ATTR_STATE_ONE_SHOT;
+ bool edge = state & VMEVENT_ATTR_STATE_EDGE_TRIGGER;
u32 was_lt_mask = VMEVENT_ATTR_STATE_VALUE_WAS_LT;
u32 was_gt_mask = VMEVENT_ATTR_STATE_VALUE_WAS_GT;
u64 value = vmevent_sample_attr(watch, attr);
@@ -117,7 +117,7 @@ static bool vmevent_match(struct vmevent_watch *watch)
bool ret = false;

if (((attr_lt && lt) || (attr_gt && gt) ||
- (attr_eq && eq)) && !one_shot)
+ (attr_eq && eq)) && !edge)
return true;

if (attr_eq && eq && was_eq) {
--
1.7.9.2

2012-06-01 12:27:29

by Anton Vorontsov

[permalink] [raw]
Subject: [PATCH 3/5] vmevent: Refresh vmstats before sampling

On SMP, kernel updates vmstats only once per second, which makes vmevent
unusable. Let's fix it by updating vmstats before sampling.

Reported-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: Anton Vorontsov <[email protected]>
---
mm/vmevent.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/mm/vmevent.c b/mm/vmevent.c
index 4ca2a04..35fd0d5 100644
--- a/mm/vmevent.c
+++ b/mm/vmevent.c
@@ -2,6 +2,7 @@
#include <linux/atomic.h>
#include <linux/compiler.h>
#include <linux/vmevent.h>
+#include <linux/mm.h>
#include <linux/syscalls.h>
#include <linux/workqueue.h>
#include <linux/file.h>
@@ -163,6 +164,9 @@ static void vmevent_sample(struct vmevent_watch *watch)

if (atomic_read(&watch->pending))
return;
+
+ refresh_vm_stats();
+
if (!vmevent_match(watch))
return;

--
1.7.9.2

2012-06-01 12:26:31

by Anton Vorontsov

[permalink] [raw]
Subject: [PATCH 1/5] vmstat: Implement refresh_vm_stats()

This function forcibly flushes per-cpu vmstat diff counters to the
global counters.

Note that we don't try to flush percpu pagesets, the pcp will be
still flushed once per 3 seconds.

Signed-off-by: Anton Vorontsov <[email protected]>
---
include/linux/vmstat.h | 2 ++
mm/vmstat.c | 22 +++++++++++++++++++++-
2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 65efb92..2a53896 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -200,6 +200,7 @@ extern void __inc_zone_state(struct zone *, enum zone_stat_item);
extern void dec_zone_state(struct zone *, enum zone_stat_item);
extern void __dec_zone_state(struct zone *, enum zone_stat_item);

+void refresh_vm_stats(void);
void refresh_cpu_vm_stats(int);
void refresh_zone_stat_thresholds(void);

@@ -253,6 +254,7 @@ static inline void __dec_zone_page_state(struct page *page,

#define set_pgdat_percpu_threshold(pgdat, callback) { }

+static inline void refresh_vm_stats(void) { }
static inline void refresh_cpu_vm_stats(int cpu) { }
static inline void refresh_zone_stat_thresholds(void) { }

diff --git a/mm/vmstat.c b/mm/vmstat.c
index f600557..4a9d432 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -13,6 +13,7 @@
#include <linux/err.h>
#include <linux/module.h>
#include <linux/slab.h>
+#include <linux/smp.h>
#include <linux/cpu.h>
#include <linux/vmstat.h>
#include <linux/sched.h>
@@ -434,7 +435,7 @@ EXPORT_SYMBOL(dec_zone_page_state);
* with the global counters. These could cause remote node cache line
* bouncing and will have to be only done when necessary.
*/
-void refresh_cpu_vm_stats(int cpu)
+static void __refresh_cpu_vm_stats(int cpu, bool drain_pcp)
{
struct zone *zone;
int i;
@@ -456,11 +457,15 @@ void refresh_cpu_vm_stats(int cpu)
local_irq_restore(flags);
atomic_long_add(v, &zone->vm_stat[i]);
global_diff[i] += v;
+ if (!drain_pcp)
+ continue;
#ifdef CONFIG_NUMA
/* 3 seconds idle till flush */
p->expire = 3;
#endif
}
+ if (!drain_pcp)
+ continue;
cond_resched();
#ifdef CONFIG_NUMA
/*
@@ -495,6 +500,21 @@ void refresh_cpu_vm_stats(int cpu)
atomic_long_add(global_diff[i], &vm_stat[i]);
}

+void refresh_cpu_vm_stats(int cpu)
+{
+ __refresh_cpu_vm_stats(cpu, 1);
+}
+
+static void smp_call_refresh_cpu_vm_stats(void *info)
+{
+ __refresh_cpu_vm_stats(smp_processor_id(), 0);
+}
+
+void refresh_vm_stats(void)
+{
+ smp_call_function(smp_call_refresh_cpu_vm_stats, NULL, 1);
+}
+
#endif

#ifdef CONFIG_NUMA
--
1.7.9.2

2012-06-03 18:27:06

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Fri, 1 Jun 2012, Anton Vorontsov wrote:
> > That's pretty awful. Anton, Leonid, comments?
> [...]
> > > 5) __VMEVENT_ATTR_STATE_VALUE_WAS_LT should be removed from userland
> > > exporting files.
> > >  When exporing kenrel internal, always silly gus used them and made unhappy.
> >
> > Agreed. Anton, care to cook up a patch to do that?
>
> KOSAKI-San, Pekka,
>
> Much thanks for your reviews!
>
> These three issues should be fixed by the following patches. One mm/
> change is needed outside of vmevent...
>
> And I'm looking into other issues you pointed out...

I applied patches 2, 4, and 5. The vmstat patch need ACKs from VM folks
to enter the tree.

Pekka

2012-06-04 08:44:54

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On 06/04/2012 03:26 AM, Pekka Enberg wrote:

> On Fri, 1 Jun 2012, Anton Vorontsov wrote:
>>> That's pretty awful. Anton, Leonid, comments?
>> [...]
>>>> 5) __VMEVENT_ATTR_STATE_VALUE_WAS_LT should be removed from userland
>>>> exporting files.
>>>> When exporing kenrel internal, always silly gus used them and made unhappy.
>>>
>>> Agreed. Anton, care to cook up a patch to do that?
>>
>> KOSAKI-San, Pekka,
>>
>> Much thanks for your reviews!
>>
>> These three issues should be fixed by the following patches. One mm/
>> change is needed outside of vmevent...
>>
>> And I'm looking into other issues you pointed out...
>
> I applied patches 2, 4, and 5. The vmstat patch need ACKs from VM folks
> to enter the tree.
>
> Pekka


It's time to wrap it up.
It seems some people include me have tried to improve vmevent
But I hope let us convince why we need vmevent before further review/patch.

Recently I tried reclaim-latency based notifier to consider backed device speed I mentioned elsewhere thread.
The working model is that measure reclaim time and if it doesn't finish requirement time which is configurable
by admin, notify it to user or kill some thread but I didn't published yet because it's not easy for admin to control
and several issues.

AFAIK, low memory notifier is started for replacing android lowmemory killer.
At the same time, other folks want use it generally.
As I look through android low memory killer, it's not too bad except some point.

1. It should not depend on shrink_slab. If we need, we can add some hook in vmscan.c directly instead of shrink_slab.
2. We can use out_of_memory instead of custom victim selection/killing function. If we need,
we can change out_of_memory interface little bit for passing needed information to select victim.
3. calculation for available pages

1) and 2) would make android low memory killer very general and 3) can meet each folk's requirement, I believe.

Anton, I expect you already investigated android low memory killer so maybe you know pros and cons of each solution.
Could you convince us "why we need vmevent" and "why can't android LMK do it?"

KOSAKI, AFAIRC, you are a person who hates android low memory killer.
Why do you hate it? If it solve problems I mentioned, do you have a concern, still?
If so, please, list up.

Android low memory killer is proved solution for a long time, at least embedded area(So many android phone already have used it) so I think improving it makes sense to me rather than inventing new wheel.
Frankly speaking, I don't know vmevent's other use cases except low memory notification and didn't see
any agreement about that with other guys.

I hope we get an agreement about vmevent before further enhance.

Thanks, all.

--
Kind regards,
Minchan Kim

2012-06-04 09:20:20

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Mon, Jun 4, 2012 at 11:45 AM, Minchan Kim <[email protected]> wrote:
> KOSAKI, AFAIRC, you are a person who hates android low memory killer.
> Why do you hate it? If it solve problems I mentioned, do you have a concern, still?
> If so, please, list up.
>
> Android low memory killer is proved solution for a long time, at least embedded
> area(So many android phone already have used it) so I think improving it makes
> sense to me rather than inventing new wheel.

VM events started out as *ABI cleanup* of Nokia's N9 Linux lowmem
notifier. That's not reinventing the wheel.

On Mon, Jun 4, 2012 at 11:45 AM, Minchan Kim <[email protected]> wrote:
> Frankly speaking, I don't know vmevent's other use cases except low memory
> notification and didn't see any agreement about that with other guys.

I think you are missing the point. "vmevent" is an ABI for delivering
VM events to userspace. I started it because different userspaces do
not agree what "low memory" means - for obvious reasons.

As for use cases, it'd be useful for VMs to be notified of "about to
swap your pages soon" so that they can aggressively GC before entering
GC-swapstorm hell. I also hear that something similar would be useful
for KVM/QEMU but I don't know the details.

I really don't see how Android's "low memory killer" will be useful as
a generic solution.

Pekka

2012-06-04 11:39:53

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Mon, Jun 04, 2012 at 05:45:06PM +0900, Minchan Kim wrote:
[...]
> AFAIK, low memory notifier is started for replacing android lowmemory killer.
> At the same time, other folks want use it generally.
> As I look through android low memory killer, it's not too bad except some point.
>
> 1. It should not depend on shrink_slab. If we need, we can add some hook in vmscan.c directly instead of shrink_slab.
> 2. We can use out_of_memory instead of custom victim selection/killing function. If we need,
> we can change out_of_memory interface little bit for passing needed information to select victim.
> 3. calculation for available pages
>
> 1) and 2) would make android low memory killer very general and 3) can meet each folk's requirement, I believe.
>
> Anton, I expect you already investigated android low memory killer so maybe you know pros and cons of each solution.
> Could you convince us "why we need vmevent" and "why can't android LMK do it?"

Note that 1) and 2) are not problems per se, it's just implementation
details, easy stuff. Vmevent is basically an ABI/API, and I didn't
hear anybody who would object to vmevent ABI idea itself. More than
this, nobody stop us from implementing in-kernel vmevent API, and
make Android Lowmemory killer use it, if we want to.

The real problem is not with vmevent. Today there are two real problems:

a) Gathering proper statistics from the kernel. Both cgroups and vmstat
have issues. Android lowmemory killer has the same problems w/ the
statistics as vmevent, it uses vmstat, so by no means Android
low memory killer is better or easier in this regard.
(And cgroups has issues w/ slab accounting, plus some folks don't
want memcg at all, since it has runtime and memory-wise costs.)

b) Interpreting this statistics. We can't provide one, universal
"low memory" definition that would work for everybody.
(Btw, your "levels" based low memory grading actually sounds
the same as mine RECLAIMABLE_CACHE_PAGES and
RECLAIMABLE_CACHE_PAGES_NOIO idea, i.e.
http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/02751.html
so personally I like the idea of level-based approach, based
on available memory *cost*.)

So, you see, all these issues are valid for vmevent, cgroups and
android low memory killer.

> KOSAKI, AFAIRC, you are a person who hates android low memory killer.
> Why do you hate it? If it solve problems I mentioned, do you have a concern, still?
> If so, please, list up.
>
> Android low memory killer is proved solution for a long time, at least embedded area(So many android phone already have used it) so I think improving it makes sense to me rather than inventing new wheel.

Yes, nobody throws Android lowmemory killer away. And recently I fixed
a bunch of issues in its tasks traversing and killing code. Now it's
just time to "fix" statistics gathering and interpretation issues,
and I see vmevent as a good way to do just that, and then we
can either turn Android lowmemory killer driver to use the vmevent
in-kernel API (so it will become just a "glue" between notifications
and killing functions), or use userland daemon.

Note that memcg has notifications as well, so it's another proof that
there is a demand for this stuff outside of embedded world, and going
with ad-hoc, custom "low memory killer" is simple and tempting approach,
but it doesn't solve any real problems.

> Frankly speaking, I don't know vmevent's other use cases except low memory notification

I won't speak for realistic use-cases, but that is what comes to
mind:

- DB can grow its caches/in-memory indexes infinitely, and start dropping
them on demand (based on internal LRU list, for example). No more
guessed/static configuration for DB daemons?
- Assuming VPS hosting w/ dynamic resources management, notifications
would be useful to readjust resources?
- On desktops, apps can drop their caches on demand if they want to
and can avoid swap activity?

Thanks,

--
Anton Vorontsov
Email: [email protected]

2012-06-04 12:17:33

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Mon, Jun 04, 2012 at 04:38:12AM -0700, Anton Vorontsov wrote:
> On Mon, Jun 04, 2012 at 05:45:06PM +0900, Minchan Kim wrote:
> [...]
> > AFAIK, low memory notifier is started for replacing android lowmemory killer.
> > At the same time, other folks want use it generally.
> > As I look through android low memory killer, it's not too bad except some point.
> >
> > 1. It should not depend on shrink_slab. If we need, we can add some hook in vmscan.c directly instead of shrink_slab.
> > 2. We can use out_of_memory instead of custom victim selection/killing function. If we need,
> > we can change out_of_memory interface little bit for passing needed information to select victim.
> > 3. calculation for available pages
> >
> > 1) and 2) would make android low memory killer very general and 3) can meet each folk's requirement, I believe.
> >
> > Anton, I expect you already investigated android low memory killer so maybe you know pros and cons of each solution.
> > Could you convince us "why we need vmevent" and "why can't android LMK do it?"
>
> Note that 1) and 2) are not problems per se, it's just implementation
> details, easy stuff. Vmevent is basically an ABI/API, and I didn't

My opinion is different.
Things depends on only vmstat have limitation. IMHO.
1) of such POV is very important. If we don't do 1), low memory notifier must depend on vmstat only.
But if we do 1), we can add some hooks in vmscan.c so we can control notifier more exactly/fine-grained.

As stupid example, if get_scan_count get the anon pages and it is about to reclaim inactive anon list,
it's a kind of signal anon pages swap out so that we can notify it to qemu/kvm which will ballon.

Other example, if we get the high priority of scanning (ex, DEF_PRIORITY - 2) and get scanned value
greater than lru size, it is sort of high memory pressure.

My point is that if we want level notifier approach, we need more tightly VM-coupled thing

> hear anybody who would object to vmevent ABI idea itself. More than
> this, nobody stop us from implementing in-kernel vmevent API, and
> make Android Lowmemory killer use it, if we want to.

I guess other guys didn't have a interest or very busy.
In this chance, I hope all guys have a consensus.
At least, we need Andrew, Rik, David and KOSAKI's opinion.

>
> The real problem is not with vmevent. Today there are two real problems:
>
> a) Gathering proper statistics from the kernel. Both cgroups and vmstat
> have issues. Android lowmemory killer has the same problems w/ the
> statistics as vmevent, it uses vmstat, so by no means Android
> low memory killer is better or easier in this regard.

Right.

> (And cgroups has issues w/ slab accounting, plus some folks don't
> want memcg at all, since it has runtime and memory-wise costs.)

As I mentioned earlier, we need more VM-tightly coupled policy, NOT vmstat.

>
> b) Interpreting this statistics. We can't provide one, universal
> "low memory" definition that would work for everybody.
> (Btw, your "levels" based low memory grading actually sounds
> the same as mine RECLAIMABLE_CACHE_PAGES and
> RECLAIMABLE_CACHE_PAGES_NOIO idea, i.e.
> http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/02751.html
> so personally I like the idea of level-based approach, based
> on available memory *cost*.)

Yes. I like such abstraction like my suggestion.
For new comer's information, Quote from my previoius mail
"
The idea is that we can make some levels in advane and explain it to user

Level 1: It a immediate response to user when kernel decide there are not fast-reclaimable pages any more.
Level 2: It's rather slower response than level 1 but kernel will consider it as reclaimable target
Level 3: It's slowest response because kernel will consider page needed long time to reclaim as reclaimable target.

It doesn't expose any internal of kernel and can implment it in internal.
For simple example,

Level 1: non-mapped clean page
Level 2: Level 1 + mapped clean-page
Level 3: Level 2 + dirty pages

So users of vmevent_fd can select his level.
Of course, latency sensitive application with slow stoarge would select Level 1.
Some application might use Level 4(Level 3 + half of swap) because it has very fast storage.

And application receives event can make strategy folloing as.

When it receives level 1 notification, it could request to others if it can release their own buffers.
When it receives level 2 notification, it could request to suicide if it's not critical application.
When it receives level 3 notification, it could kill others.

It's a just example and my point is we should storage speed to make it general.
"
>
> So, you see, all these issues are valid for vmevent, cgroups and
> android low memory killer.

Agree.

>
> > KOSAKI, AFAIRC, you are a person who hates android low memory killer.
> > Why do you hate it? If it solve problems I mentioned, do you have a concern, still?
> > If so, please, list up.
> >
> > Android low memory killer is proved solution for a long time, at least embedded area(So many android phone already have used it) so I think improving it makes sense to me rather than inventing new wheel.
>
> Yes, nobody throws Android lowmemory killer away. And recently I fixed
> a bunch of issues in its tasks traversing and killing code. Now it's
> just time to "fix" statistics gathering and interpretation issues,
> and I see vmevent as a good way to do just that, and then we
> can either turn Android lowmemory killer driver to use the vmevent
> in-kernel API (so it will become just a "glue" between notifications
> and killing functions), or use userland daemon.

If you can define such level by only vmstat, I am OKAY and want to see
how to glue vmevent and android LMK.
But keep in mind about killing. killer should be a kernel, not user.
https://lkml.org/lkml/2011/12/19/330

>
> Note that memcg has notifications as well, so it's another proof that
> there is a demand for this stuff outside of embedded world, and going
> with ad-hoc, custom "low memory killer" is simple and tempting approach,
> but it doesn't solve any real problems.
>
> > Frankly speaking, I don't know vmevent's other use cases except low memory notification
>
> I won't speak for realistic use-cases, but that is what comes to
> mind:
>
> - DB can grow its caches/in-memory indexes infinitely, and start dropping
> them on demand (based on internal LRU list, for example). No more
> guessed/static configuration for DB daemons?
> - Assuming VPS hosting w/ dynamic resources management, notifications
> would be useful to readjust resources?
> - On desktops, apps can drop their caches on demand if they want to
> and can avoid swap activity?

I don't mean VMEVENT_ATTR_LOWMEM_PAGES but following as,

VMEVENT_ATTR_NR_FREE_PAGES
VMEVENT_ATTR_NR_SWAP_PAGES
VMEVENT_ATTR_NR_AVAIL_PAGES

I'm not sure how it is useful.
Even VMEVENT_ATTR_LOWMEM_PAGES, I'm not sure it would be useful with only vmstat.

>
> Thanks,
>
> --
> Anton Vorontsov
> Email: [email protected]

2012-06-04 12:23:16

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Mon, Jun 04, 2012 at 12:20:18PM +0300, Pekka Enberg wrote:
> On Mon, Jun 4, 2012 at 11:45 AM, Minchan Kim <[email protected]> wrote:
> > KOSAKI, AFAIRC, you are a person who hates android low memory killer.
> > Why do you hate it? If it solve problems I mentioned, do you have a concern, still?
> > If so, please, list up.
> >
> > Android low memory killer is proved solution for a long time, at least embedded
> > area(So many android phone already have used it) so I think improving it makes
> > sense to me rather than inventing new wheel.
>
> VM events started out as *ABI cleanup* of Nokia's N9 Linux lowmem
> notifier. That's not reinventing the wheel.
>
> On Mon, Jun 4, 2012 at 11:45 AM, Minchan Kim <[email protected]> wrote:
> > Frankly speaking, I don't know vmevent's other use cases except low memory
> > notification and didn't see any agreement about that with other guys.
>
> I think you are missing the point. "vmevent" is an ABI for delivering
> VM events to userspace. I started it because different userspaces do
> not agree what "low memory" means - for obvious reasons.

The part I dislike vmevent is to expose raw vmstat itself.

VMEVENT_ATTR_NR_SWAP_PAGES
VMEVENT_ATTR_NR_FREE_PAGES
VMEVENT_ATTR_NR_AVAIL_PAGES

And some calculation based on raw vmstat.
We need more abstraction based on vmscan heuristic.

>
> As for use cases, it'd be useful for VMs to be notified of "about to
> swap your pages soon" so that they can aggressively GC before entering

How do you detect it? It's a policy which is most important part on vmevent.

> GC-swapstorm hell. I also hear that something similar would be useful
> for KVM/QEMU but I don't know the details.
>
> I really don't see how Android's "low memory killer" will be useful as
> a generic solution.

Yes. we can't do it by current android LMK.
I'm not big fan of androild LMK but we can improve it much by my suggestions
and yours smart ideas, I believe.

>
> Pekka
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2012-06-04 13:37:16

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Mon, Jun 04, 2012 at 09:17:22PM +0900, Minchan Kim wrote:
[...]
> But keep in mind about killing. killer should be a kernel, not user.
> https://lkml.org/lkml/2011/12/19/330

There you're discussing out-of-memory condition vs. low memory
situation, and I don't see LMK as a substitution for the OOMK,
or vice versa.

LMK triggers when we have plenty of free memory (and thus time);
and OOMK is a last resort thing: it is super-fast, as the kernel
may have literally nothing to do but start killing tasks.

Sure, if LMK won't react fast and low-memory turns into "out
of memory", then OOMK will "help", which is totally fine. And
it is no different from current in-kernel Android low memory
killer (which gives no guarantees that OOMK will not trigger,
it is still possible to make OOM condition, even with LMK
enabled).

Anyway, personally I don't mind if LMK would live in the kernel,
but I don't see it as a mandatory thing (unlike OOMK, which is
kernel-only thing, of course).

> > > Frankly speaking, I don't know vmevent's other use cases except low memory notification
> >
> > I won't speak for realistic use-cases, but that is what comes to
> > mind:
> >
> > - DB can grow its caches/in-memory indexes infinitely, and start dropping
> > them on demand (based on internal LRU list, for example). No more
> > guessed/static configuration for DB daemons?
> > - Assuming VPS hosting w/ dynamic resources management, notifications
> > would be useful to readjust resources?
> > - On desktops, apps can drop their caches on demand if they want to
> > and can avoid swap activity?
>
> I don't mean VMEVENT_ATTR_LOWMEM_PAGES but following as,
>
> VMEVENT_ATTR_NR_FREE_PAGES
> VMEVENT_ATTR_NR_SWAP_PAGES
> VMEVENT_ATTR_NR_AVAIL_PAGES
>
> I'm not sure how it is useful.

Yep, these raw values are mostly useless, and personally I don't use
these plain attributes. I use heuristics, i.e. "blended" attributes.
If we can come up with levels-based approach w/o using vmstat, well..
OK then.

But note that we actually want to know how much "free", "cheap to
reclaim" and "can reclaim, but at cost" pages there are. It should
not be just "ouch, we're out of cheap pages" signal, since the whole
point of Android low memory killer is to act in advance, its idea
is to try to free memory before the condition happens, and thus make
the phone UI more smooth. And the threshold is actually load/HW
specific, so I guess folks want to tune it (e.g. "trigger when
there are XX MB of 'cheap to reclaim' pages left").

Thanks,

--
Anton Vorontsov
Email: [email protected]

2012-06-04 19:50:39

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

> KOSAKI, AFAIRC, you are a person who hates android low memory killer.
> Why do you hate it? If it solve problems I mentioned, do you have a concern, still?
> If so, please, list up.

1) it took tasklist_lock. (it was solved already)
2) hacky lowmem_deathpending_timeout
3) No ZONE awareness. global_page_state(), lowmem_minfree[] and shrink_slab interface
don't realize real memory footprint.
4) No memcg, cpuset awareness.
5) strange score calculation. the score is calculated from anon_rss+file_rss,
but oom killer only free anon_rss.
6) strange oom_score_adj overload
7) much duplicate code w/ oom_killer. we can make common helper function, I think.
8) John's fallocate(VOLATILE) is more cleaner.

2012-06-04 20:05:11

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

>> Anton, I expect you already investigated android low memory killer so maybe you know pros and cons of each solution.
>> Could you convince us "why we need vmevent" and "why can't android LMK do it?"
>
> Note that 1) and 2) are not problems per se, it's just implementation
> details, easy stuff. Vmevent is basically an ABI/API, and I didn't
> hear anybody who would object to vmevent ABI idea itself. More than
> this, nobody stop us from implementing in-kernel vmevent API, and
> make Android Lowmemory killer use it, if we want to.

I never agree "it's mere ABI" discussion. Until the implementation is ugly,
I never agree the ABI even if syscall interface is very clean.


> The real problem is not with vmevent. Today there are two real problems:
>
> a) Gathering proper statistics from the kernel. Both cgroups and vmstat
> have issues. Android lowmemory killer has the same problems w/ the
> statistics as vmevent, it uses vmstat, so by no means Android
> low memory killer is better or easier in this regard.
> (And cgroups has issues w/ slab accounting, plus some folks don't
> want memcg at all, since it has runtime and memory-wise costs.)

Right. android lowmemory killer is also buggy.


> b) Interpreting this statistics. We can't provide one, universal
> "low memory" definition that would work for everybody.
> (Btw, your "levels" based low memory grading actually sounds
> the same as mine RECLAIMABLE_CACHE_PAGES and
> RECLAIMABLE_CACHE_PAGES_NOIO idea, i.e.
> http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/02751.html
> so personally I like the idea of level-based approach, based
> on available memory *cost*.)
>
> So, you see, all these issues are valid for vmevent, cgroups and
> android low memory killer.
>
>> KOSAKI, AFAIRC, you are a person who hates android low memory killer.
>> Why do you hate it? If it solve problems I mentioned, do you have a concern, still?
>> If so, please, list up.
>>
>> Android low memory killer is proved solution for a long time, at least embedded area(So many android phone already have used it) so I think improving it makes sense to me rather than inventing new wheel.
>
> Yes, nobody throws Android lowmemory killer away. And recently I fixed
> a bunch of issues in its tasks traversing and killing code. Now it's
> just time to "fix" statistics gathering and interpretation issues,
> and I see vmevent as a good way to do just that, and then we
> can either turn Android lowmemory killer driver to use the vmevent
> in-kernel API (so it will become just a "glue" between notifications
> and killing functions), or use userland daemon.

Huh? No? android lowmem killer is a "killer". it doesn't make any notification,
it only kill memory hogging process. I don't think we can merge them.



> Note that memcg has notifications as well, so it's another proof that
> there is a demand for this stuff outside of embedded world, and going
> with ad-hoc, custom "low memory killer" is simple and tempting approach,
> but it doesn't solve any real problems.

Wrong.
memcg notification notify the event to _another_ mem cgroup's process. Then, it can
avoid a notified process can't work well by swap thrashing. Its feature only share a
weakness of vmevent api.



>> Frankly speaking, I don't know vmevent's other use cases except low memory notification
>
> I won't speak for realistic use-cases, but that is what comes to
> mind:
>
> - DB can grow its caches/in-memory indexes infinitely, and start dropping
> them on demand (based on internal LRU list, for example). No more
> guessed/static configuration for DB daemons?

They uses direct-io for fine grained cache control.


> - Assuming VPS hosting w/ dynamic resources management, notifications
> would be useful to readjust resources?

How do they readjust? Now kvm/xen use balloon driver for dynamic resource
adjustment. AND it work more fine than vmevent because it doesn't route
userspace.


> - On desktops, apps can drop their caches on demand if they want to
> and can avoid swap activity?

In this case, fallocate(VOLATILE) is work more better.

2012-06-04 22:41:31

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Mon, Jun 04, 2012 at 04:05:05PM -0400, KOSAKI Motohiro wrote:
[...]
> >Yes, nobody throws Android lowmemory killer away. And recently I fixed
> >a bunch of issues in its tasks traversing and killing code. Now it's
> >just time to "fix" statistics gathering and interpretation issues,
> >and I see vmevent as a good way to do just that, and then we
> >can either turn Android lowmemory killer driver to use the vmevent
> >in-kernel API (so it will become just a "glue" between notifications
> >and killing functions), or use userland daemon.
>
> Huh? No? android lowmem killer is a "killer". it doesn't make any notification,
> it only kill memory hogging process. I don't think we can merge them.

KOSAKI, you don't read what I write. I didn't ever say that low memory
killer makes any notifications, that's not what I was saying. I said
that once we'll have a good "low memory" notification mechanism (e.g.
vmevent), Android low memory killer would just use this mechanism. Be
it userland notifications or in-kernel, doesn't matter much.

--
Anton Vorontsov
Email: [email protected]

2012-06-05 07:47:21

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Mon, Jun 4, 2012 at 11:05 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> Note that 1) and 2) are not problems per se, it's just implementation
>> details, easy stuff. Vmevent is basically an ABI/API, and I didn't
>> hear anybody who would object to vmevent ABI idea itself. More than
>> this, nobody stop us from implementing in-kernel vmevent API, and
>> make Android Lowmemory killer use it, if we want to.
>
> I never agree "it's mere ABI" discussion. Until the implementation is ugly,
> I never agree the ABI even if syscall interface is very clean.

I don't know what discussion you are talking about.

I also don't agree that something should be merged just because the
ABI is clean. The implementation must also make sense. I don't see how
we disagree here at all.

2012-06-05 07:52:09

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Mon, Jun 4, 2012 at 11:05 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> - On desktops, apps can drop their caches on demand if they want to
>> ? and can avoid swap activity?
>
> In this case, fallocate(VOLATILE) is work more better.

For some cases, yes, but probably not for all.

For example, if userspace doesn't know about "about to swap real soon"
condition, it can continue to grow its caches making
fallocate(VOLATILE) pretty much useless.

2012-06-05 07:53:44

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Mon, Jun 4, 2012 at 4:35 PM, Anton Vorontsov <[email protected]> wrote:
>> I don't mean VMEVENT_ATTR_LOWMEM_PAGES but following as,
>>
>> VMEVENT_ATTR_NR_FREE_PAGES
>> VMEVENT_ATTR_NR_SWAP_PAGES
>> VMEVENT_ATTR_NR_AVAIL_PAGES
>>
>> I'm not sure how it is useful.
>
> Yep, these raw values are mostly useless, and personally I don't use
> these plain attributes. I use heuristics, i.e. "blended" attributes.
> If we can come up with levels-based approach w/o using vmstat, well..
> OK then.

That's what Nokia's lowmem notifier uses. We can probably drop them
once we have something else they could use.

2012-06-05 07:59:56

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

Hi Peakk,

On 06/05/2012 04:53 PM, Pekka Enberg wrote:

> On Mon, Jun 4, 2012 at 4:35 PM, Anton Vorontsov <[email protected]> wrote:
>>> I don't mean VMEVENT_ATTR_LOWMEM_PAGES but following as,
>>>
>>> VMEVENT_ATTR_NR_FREE_PAGES
>>> VMEVENT_ATTR_NR_SWAP_PAGES
>>> VMEVENT_ATTR_NR_AVAIL_PAGES
>>>
>>> I'm not sure how it is useful.
>>
>> Yep, these raw values are mostly useless, and personally I don't use
>> these plain attributes. I use heuristics, i.e. "blended" attributes.
>> If we can come up with levels-based approach w/o using vmstat, well..
>> OK then.
>
> That's what Nokia's lowmem notifier uses. We can probably drop them
> once we have something else they could use.


Next concern is that periodic timer of implementation.
I think it would add direct hook in vmscan.c rather than peeking raw vmstat periodically by timer
so we can control more fine-grained way without unnecessary overhead.

--
Kind regards,
Minchan Kim

2012-06-05 08:01:47

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Tue, Jun 5, 2012 at 11:00 AM, Minchan Kim <[email protected]> wrote:
>> That's what Nokia's lowmem notifier uses. We can probably drop them
>> once we have something else they could use.
>
> Next concern is that periodic timer of implementation.
> I think it would add direct hook in vmscan.c rather than peeking raw vmstat periodically by timer
> so we can control more fine-grained way without unnecessary overhead.

If the hooks are clean and it doesn't hurt the !CONFIG_VMEVENT case,
I'm completely OK with that.

2012-06-05 08:19:08

by leonid.moiseichuk

[permalink] [raw]
Subject: RE: [PATCH 0/5] Some vmevent fixes...

> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of ext
> Pekka Enberg
> Sent: 05 June, 2012 11:02
> To: Minchan Kim
...
> > Next concern is that periodic timer of implementation.
> > I think it would add direct hook in vmscan.c rather than peeking raw
> > vmstat periodically by timer so we can control more fine-grained way
> without unnecessary overhead.
>
> If the hooks are clean and it doesn't hurt the !CONFIG_VMEVENT case, I'm
> completely OK with that.

On the previous iteration hooking vm was pointed as very bad idea, so in my version I installed shrinker to handle cases when we have memory pressure.
Using deferred timer with adequate timeout (0.250 ms or larger) fully suitable for userspace and produce adequate overhead
-> by nature such API should not be 100% accurate, anyhow applications cannot handle situation as good as kernel can provide, 0.5MB space accuracy, 100ms is maximum user-space require for 64-1024MB devices.

2012-06-05 08:26:53

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On 06/05/2012 05:16 PM, [email protected] wrote:

>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of ext
>> Pekka Enberg
>> Sent: 05 June, 2012 11:02
>> To: Minchan Kim
> ...
>>> Next concern is that periodic timer of implementation.
>>> I think it would add direct hook in vmscan.c rather than peeking raw
>>> vmstat periodically by timer so we can control more fine-grained way
>> without unnecessary overhead.
>>
>> If the hooks are clean and it doesn't hurt the !CONFIG_VMEVENT case, I'm
>> completely OK with that.
>
> On the previous iteration hooking vm was pointed as very bad idea, so in my version I installed shrinker to handle cases when we have memory pressure.
> Using deferred timer with adequate timeout (0.250 ms or larger) fully suitable for userspace and produce adequate overhead
> -> by nature such API should not be 100% accurate, anyhow applications cannot handle situation as good as kernel can provide, 0.5MB space accuracy, 100ms is maximum user-space require for 64-1024MB devices.
>


I didn't follow previous iteration you mentioned so I don't know the history.
I think it's a not good idea if LMN(low memory notifier) is needed by only embedded world.
Maybe in that case, we might control it enough by only vmstat events but now we know many folks want it
so we are trying to make it general.
IMHO, for meeting various requirement, vmstat raw event isn't enough so we need direct hook in vmscan
and should abstract it to some levels.
Of course, VM guys should maintain it to work best as VM algorithm are changing.

--
Kind regards,
Minchan Kim

2012-06-05 08:41:05

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Tue, Jun 05, 2012 at 10:47:18AM +0300, Pekka Enberg wrote:
> On Mon, Jun 4, 2012 at 11:05 PM, KOSAKI Motohiro
> <[email protected]> wrote:
> >> Note that 1) and 2) are not problems per se, it's just implementation
> >> details, easy stuff. Vmevent is basically an ABI/API, and I didn't
> >> hear anybody who would object to vmevent ABI idea itself. More than
> >> this, nobody stop us from implementing in-kernel vmevent API, and
> >> make Android Lowmemory killer use it, if we want to.
> >
> > I never agree "it's mere ABI" discussion. Until the implementation is ugly,
> > I never agree the ABI even if syscall interface is very clean.
>
> I don't know what discussion you are talking about.
>
> I also don't agree that something should be merged just because the
> ABI is clean. The implementation must also make sense. I don't see how
> we disagree here at all.

BTW, I wasn't implying that vmevent should be merged just because
it is a clean ABI, and I wasn't implying that it is clean, and I
didn't propose to merge it at all. :-)

I just don't see any point in trying to scrap vmevent in favour of
Android low memory killer. This makes no sense at all, since today
vmevent is more useful than Android's solution. For vmevent we have
contributors from Nokia, Samsung, and of course Linaro, plus we
have an userland killer daemon* for Android (which can work with
both cgroups and vmevent backends). So vmevent is more generic
already.

To me it would make more sense if mm guys would tell us "scrap
this all, just use cgroups and its notifications; fix cgroups'
slab accounting and be happy". Well, I'd understand that.

Anyway, we all know that vmevent is 'work in progress', so nobody
tries to push it, nobody asks to merge it. So far we're just
discussing any possible solutions, and vmevent is a good
playground.


So, question to Minchan. Do you have anything particular in mind
regarding how the vmstat hooks should look like? And how all this
would connect with cgroups, since KOSAKI wants to see it cgroups-
aware...

p.s. http://git.infradead.org/users/cbou/ulmkd.git
I haven't updated it for new vmevent changes, but still,
its idea should be clear enough.

--
Anton Vorontsov
Email: [email protected]

Subject: Re: [PATCH 3/5] vmevent: Refresh vmstats before sampling

On Fri, 1 Jun 2012, Anton Vorontsov wrote:

> On SMP, kernel updates vmstats only once per second, which makes vmevent
> unusable. Let's fix it by updating vmstats before sampling.

Well this may increase your accuracy but there is no guarantee that an
update to vm counters will not happen immediately after you have refreshed
the counters for one processor or the other.

Also please consider the impact that a IPI broadcast will have on latency
of other processors and to the function that is currently executing.

We just went through a round of getting rid of IPI broadcast because they
create OS noise on processors.

Subject: Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

On Tue, 8 May 2012, KOSAKI Motohiro wrote:

> 4) Currently, vmstat have per-cpu batch and vmstat updating makes 3
> second delay at maximum.

Nope. The delay is one second only and it is limited to a certain amount
per cpu. There is a bound on the inaccuracy of the counter. If you want to
have them more accurate then the right approach is to limit the
threshhold. The more accurate the higher the impact of cache line
bouncing.

Subject: Re: [PATCH 1/5] vmstat: Implement refresh_vm_stats()

On Fri, 1 Jun 2012, Anton Vorontsov wrote:

> This function forcibly flushes per-cpu vmstat diff counters to the
> global counters.

Why is it necessary to have a function that does not expire the pcps? Is
that side effect important? We use refresh_vm_cpu_stats(cpu) in
page_alloc.c already to flush the vmstat counters. Is the flushing of the
pcps in 2 seconds insteads of 3 once really that important?

Also if we do this

Can we therefore also name the function in a different way like

flush_vmstats()


> @@ -456,11 +457,15 @@ void refresh_cpu_vm_stats(int cpu)
> local_irq_restore(flags);
> atomic_long_add(v, &zone->vm_stat[i]);
> global_diff[i] += v;
> + if (!drain_pcp)
> + continue;
> #ifdef CONFIG_NUMA
> /* 3 seconds idle till flush */
> p->expire = 3;
> #endif

Erm. This should be

#ifdef CONFIG_NUMA
if (drain_pcp)
p->expire = 3;
#endif

The construct using "continue" is weird.


> }
> + if (!drain_pcp)
> + continue;
> cond_resched();
> #ifdef CONFIG_NUMA
> /*
> @@ -495,6 +500,21 @@ void refresh_cpu_vm_stats(int cpu)
> atomic_long_add(global_diff[i], &vm_stat[i]);
> }
>
> +void refresh_cpu_vm_stats(int cpu)
> +{
> + __refresh_cpu_vm_stats(cpu, 1);
> +}

Fold __refresh_cpu_vm_stats into this function and modify the caller
of refresh_cpu_vm_stats instead.

2012-06-07 02:41:00

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On 06/05/2012 05:39 PM, Anton Vorontsov wrote:

> On Tue, Jun 05, 2012 at 10:47:18AM +0300, Pekka Enberg wrote:
>> On Mon, Jun 4, 2012 at 11:05 PM, KOSAKI Motohiro
>> <[email protected]> wrote:
>>>> Note that 1) and 2) are not problems per se, it's just implementation
>>>> details, easy stuff. Vmevent is basically an ABI/API, and I didn't
>>>> hear anybody who would object to vmevent ABI idea itself. More than
>>>> this, nobody stop us from implementing in-kernel vmevent API, and
>>>> make Android Lowmemory killer use it, if we want to.
>>>
>>> I never agree "it's mere ABI" discussion. Until the implementation is ugly,
>>> I never agree the ABI even if syscall interface is very clean.
>>
>> I don't know what discussion you are talking about.
>>
>> I also don't agree that something should be merged just because the
>> ABI is clean. The implementation must also make sense. I don't see how
>> we disagree here at all.
>
> BTW, I wasn't implying that vmevent should be merged just because
> it is a clean ABI, and I wasn't implying that it is clean, and I
> didn't propose to merge it at all. :-)
>
> I just don't see any point in trying to scrap vmevent in favour of
> Android low memory killer. This makes no sense at all, since today
> vmevent is more useful than Android's solution. For vmevent we have
> contributors from Nokia, Samsung, and of course Linaro, plus we
> have an userland killer daemon* for Android (which can work with
> both cgroups and vmevent backends). So vmevent is more generic
> already.
>
> To me it would make more sense if mm guys would tell us "scrap
> this all, just use cgroups and its notifications; fix cgroups'
> slab accounting and be happy". Well, I'd understand that.
>
> Anyway, we all know that vmevent is 'work in progress', so nobody
> tries to push it, nobody asks to merge it. So far we're just
> discussing any possible solutions, and vmevent is a good
> playground.
>
>
> So, question to Minchan. Do you have anything particular in mind
> regarding how the vmstat hooks should look like? And how all this
> would connect with cgroups, since KOSAKI wants to see it cgroups-
> aware...


How about this?

It's totally pseudo code just I want to show my intention and even it's not a math.
Totally we need more fine-grained some expression to standardize memory pressure.
For it, we can use VM's several parameter, nr_scanned, nr_reclaimed, order, dirty page scanning ratio
and so on. Also, we can aware of zone, node so we can pass lots of information to user space if they
want it. For making lowmem notifier general, they are must, I think.
We can have a plenty of tools for it.

And later as further step, we could replace it with memcg-aware after memcg reclaim work is
totally unified with global page reclaim. Many memcg guys have tried it so I expect it works
sooner or later but I'm not sure memcg really need it because memcg's goal is limit memory resource
among several process groups. If some process feel bad about latency due to short of free memory
and it's critical, I think it would be better to create new memcg group has tighter limit for
latency and put the process into the group.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeb3bc9..eae3d2e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2323,6 +2323,32 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
}

/*
+ * higher dirty pages, higher pressure
+ * higher nr_scanned, higher pressure
+ * higher nr_reclaimed, lower pressure
+ * higher unmapped pages, lower pressure
+ *
+ * index toward 0 implies memory pressure is heavy.
+ */
+int lowmem_index(struct zone *zone, struct scan_control *sc)
+{
+ int pressure = (1000 * (sc->nr_scanned * (zone_page_state(zone, NR_FILE_DIRTY)
+ * dirty_weight + 1) - sc->nr_reclaimed -
+ zone_unmapped_file_pages(zone))) /
+ zone_reclaimable_page(zone);
+
+ return 1000 - pressure;
+}
+
+void lowmem_notifier(struct zone *zone, int index)
+{
+ if (lowmem_has_interested_zone(zone)) {
+ if (index < sysctl_lowmem_threshold)
+ notify(numa_node_id(), zone, index);
+ }
+}
+
+/*
* For kswapd, balance_pgdat() will work across all this node's zones until
* they are all at high_wmark_pages(zone).
*
@@ -2494,6 +2520,7 @@ loop_again:
!zone_watermark_ok_safe(zone, testorder,
high_wmark_pages(zone) + balance_gap,
end_zone, 0)) {
+ int index;
shrink_zone(zone, &sc);

reclaim_state->reclaimed_slab = 0;
@@ -2503,6 +2530,9 @@ loop_again:

if (nr_slab == 0 && !zone_reclaimable(zone))
zone->all_unreclaimable = 1;
+
+ index = lowmem_index(zone, &sc);
+ lowmem_notifier(zone, index);

>

> p.s. http://git.infradead.org/users/cbou/ulmkd.git
> I haven't updated it for new vmevent changes, but still,
> its idea should be clear enough.
>


--
Kind regards,
Minchan Kim

2012-06-08 03:17:34

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/5] vmstat: Implement refresh_vm_stats()

(6/1/12 8:24 AM), Anton Vorontsov wrote:
> This function forcibly flushes per-cpu vmstat diff counters to the
> global counters.
>
> Note that we don't try to flush percpu pagesets, the pcp will be
> still flushed once per 3 seconds.
>
> Signed-off-by: Anton Vorontsov<[email protected]>

No.

This is insane. Your patch improved vmevent accuracy a _bit_. But instead of,
decrease a performance of large systems. That's no good deal. 99% user never
uses vmevent.

MOREOVER, this patch don't solve vmevent accuracy issue AT ALL. because of,
a second is enough big to make large inaccuracy. Modern cpus are fast. Guys,
the fact is, user land monitor can't use implicit batch likes vmstat. That's
a reason why perf don't use vmstat. I suggest you see perf APIs. It may bring
you good inspiration.

2012-06-08 03:25:35

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

(6/1/12 8:24 AM), Anton Vorontsov wrote:
> We'll need to use smp_function_call() in the sampling routines, and the
> call is not supposed to be called from the bottom halves. So, let's
> convert vmevent to dffered workqueues.
>
> As a side effect, we also fix the swap reporting (we cannot call
> si_swapinfo from the interrupt context), i.e. the following oops should
> be fixed now:
>
> =================================
> [ INFO: inconsistent lock state ]
> 3.4.0-rc1+ #37 Not tainted
> ---------------------------------
> inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
> swapper/0/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
> (swap_lock){+.?...}, at: [<ffffffff8110449d>] si_swapinfo+0x1d/0x90
> {SOFTIRQ-ON-W} state was registered at:
> [<ffffffff8107ca7f>] mark_irqflags+0x15f/0x1b0
> [<ffffffff8107e5e3>] __lock_acquire+0x493/0x9d0
> [<ffffffff8107f20e>] lock_acquire+0x9e/0x200
> [<ffffffff813e9071>] _raw_spin_lock+0x41/0x50
> [<ffffffff8110449d>] si_swapinfo+0x1d/0x90
> [<ffffffff8117e7c8>] meminfo_proc_show+0x38/0x3f0
> [<ffffffff81141209>] seq_read+0x139/0x3f0
> [<ffffffff81174cc6>] proc_reg_read+0x86/0xc0
> [<ffffffff8111c19c>] vfs_read+0xac/0x160
> [<ffffffff8111c29a>] sys_read+0x4a/0x90
> [<ffffffff813ea652>] system_call_fastpath+0x16/0x1b
>
> Signed-off-by: Anton Vorontsov<[email protected]>

As I already told you, vmevent shouldn't deal a timer at all. It is
NOT familiar to embedded world. Because of, time subsystem is one of
most complex one on linux. Our 'time' is not simple concept. time.h
says we have 5 possibilities user want, at least.

include/linux/time.h
------------------------------------------
#define CLOCK_REALTIME 0
#define CLOCK_MONOTONIC 1
#define CLOCK_MONOTONIC_RAW 4
#define CLOCK_REALTIME_COARSE 5
#define CLOCK_MONOTONIC_COARSE 6

And, some people want to change timer slack for optimize power
consumption.

So, Don't reinventing the wheel. Just use posix tiemr apis.






2012-06-08 03:35:25

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

(6/5/12 4:16 AM), [email protected] wrote:
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of ext
>> Pekka Enberg
>> Sent: 05 June, 2012 11:02
>> To: Minchan Kim
> ...
>>> Next concern is that periodic timer of implementation.
>>> I think it would add direct hook in vmscan.c rather than peeking raw
>>> vmstat periodically by timer so we can control more fine-grained way
>> without unnecessary overhead.
>>
>> If the hooks are clean and it doesn't hurt the !CONFIG_VMEVENT case, I'm
>> completely OK with that.
>
> On the previous iteration hooking vm was pointed as very bad idea, so in my version I installed shrinker to handle cases when we have memory pressure.
> Using deferred timer with adequate timeout (0.250 ms or larger) fully suitable for userspace and produce adequate overhead
> -> by nature such API should not be 100% accurate, anyhow applications cannot handle situation as good as kernel can provide, 0.5MB space accuracy, 100ms is maximum user-space require for 64-1024MB devices.

I believe that's bad idea. In fact, An "adequate" timeout depend on hardware, not application performance tendency. Thus, applications can't know "adequate" value.

2012-06-08 03:45:55

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

(6/4/12 6:39 PM), Anton Vorontsov wrote:
> On Mon, Jun 04, 2012 at 04:05:05PM -0400, KOSAKI Motohiro wrote:
> [...]
>>> Yes, nobody throws Android lowmemory killer away. And recently I fixed
>>> a bunch of issues in its tasks traversing and killing code. Now it's
>>> just time to "fix" statistics gathering and interpretation issues,
>>> and I see vmevent as a good way to do just that, and then we
>>> can either turn Android lowmemory killer driver to use the vmevent
>>> in-kernel API (so it will become just a "glue" between notifications
>>> and killing functions), or use userland daemon.
>>
>> Huh? No? android lowmem killer is a "killer". it doesn't make any notification,
>> it only kill memory hogging process. I don't think we can merge them.
>
> KOSAKI, you don't read what I write. I didn't ever say that low memory
> killer makes any notifications, that's not what I was saying. I said
> that once we'll have a good "low memory" notification mechanism (e.g.
> vmevent), Android low memory killer would just use this mechanism. Be
> it userland notifications or in-kernel, doesn't matter much.

I don't disagree this. But this was not my point. I have to note why
current android killer is NOT notification.

In fact, notification is a mere notification. There is no guarantee to
success to kill. There are various reason to fail to kill. e.g. 1) Any
shrinking activity need more memory. (that's the reason why now we only
have memcg oom notification) 2) userland memory returning activity is not
atomic nor fast. kernel might find another memory shortage before finishing
memory returning. 3) system thrashing may bring userland process stucking
4) ... and userland bugs.

So, big design choice here. 1) vmevent is a just notification. it can't guarantee
to prevent oom. or 2) to implement some trick (e.g. reserved memory for vmevent
processes, kernel activity blocking until finish memory returing, etc)

2012-06-08 03:55:45

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

(6/5/12 3:52 AM), Pekka Enberg wrote:
> On Mon, Jun 4, 2012 at 11:05 PM, KOSAKI Motohiro
> <[email protected]> wrote:
>>> - On desktops, apps can drop their caches on demand if they want to
>>> and can avoid swap activity?
>>
>> In this case, fallocate(VOLATILE) is work more better.
>
> For some cases, yes, but probably not for all.
>
> For example, if userspace doesn't know about "about to swap real soon"
> condition, it can continue to grow its caches making
> fallocate(VOLATILE) pretty much useless.

Fair enough. But, Please consider your scenario don't need vmevent(2),
cat /proc/meminfo works enough. Only dropping activity need a quick work
and just in time notification. cache growing limitation don't need.

2012-06-08 06:54:18

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Fri, Jun 8, 2012 at 6:55 AM, KOSAKI Motohiro
<[email protected]> wrote:
>> For some cases, yes, but probably not for all.
>>
>> For example, if userspace doesn't know about "about to swap real soon"
>> condition, it can continue to grow its caches making
>> fallocate(VOLATILE) pretty much useless.
>
> Fair enough. But, Please consider your scenario don't need vmevent(2),
> cat /proc/meminfo works enough. Only dropping activity need a quick work
> and just in time notification. cache growing limitation don't need.

Why do you keep bringing this up? Yes, you might get away with polling
/proc/meminfo or /proc/vmstat for some specific cases. It does not
make it a proper generic solution to the problem "vmevent" ABI tries
to address.

2012-06-08 06:57:13

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Fri, Jun 8, 2012 at 6:45 AM, KOSAKI Motohiro
<[email protected]> wrote:
> So, big design choice here. 1) vmevent is a just notification. it can't
> guarantee to prevent oom. or 2) to implement some trick (e.g. reserved
> memory for vmevent processes, kernel activity blocking until finish
> memory returing, etc)

Yes, vmevent is only for notification and will not *guarantee* OOM
prevention. It simply tries to provide hints early enough to the
userspace to so that OOM can be avoided if possible.

2012-06-08 06:57:52

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

(6/8/12 2:54 AM), Pekka Enberg wrote:
> On Fri, Jun 8, 2012 at 6:55 AM, KOSAKI Motohiro
> <[email protected]> wrote:
>>> For some cases, yes, but probably not for all.
>>>
>>> For example, if userspace doesn't know about "about to swap real soon"
>>> condition, it can continue to grow its caches making
>>> fallocate(VOLATILE) pretty much useless.
>>
>> Fair enough. But, Please consider your scenario don't need vmevent(2),
>> cat /proc/meminfo works enough. Only dropping activity need a quick work
>> and just in time notification. cache growing limitation don't need.
>
> Why do you keep bringing this up? Yes, you might get away with polling
> /proc/meminfo or /proc/vmstat for some specific cases. It does not
> make it a proper generic solution to the problem "vmevent" ABI tries
> to address.

Guys, current vmevent _is_ a polling interface. It only is wrapped kernel
timer.

2012-06-08 06:59:16

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Fri, Jun 8, 2012 at 9:57 AM, KOSAKI Motohiro
<[email protected]> wrote:
> Guys, current vmevent _is_ a polling interface. It only is wrapped kernel
> timer.

Current implementation is like that but we don't intend to keep it
that way. That's why having a separate ABI is so important.

2012-06-08 07:00:12

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

On Thu, Jun 07, 2012 at 11:25:30PM -0400, KOSAKI Motohiro wrote:
[...]
> As I already told you, vmevent shouldn't deal a timer at all. It is
> NOT familiar to embedded world. Because of, time subsystem is one of
> most complex one on linux. Our 'time' is not simple concept. time.h
> says we have 5 possibilities user want, at least.
>
> include/linux/time.h
> ------------------------------------------
> #define CLOCK_REALTIME 0
> #define CLOCK_MONOTONIC 1
> #define CLOCK_MONOTONIC_RAW 4
> #define CLOCK_REALTIME_COARSE 5
> #define CLOCK_MONOTONIC_COARSE 6
>
> And, some people want to change timer slack for optimize power
> consumption.
>
> So, Don't reinventing the wheel. Just use posix tiemr apis.

I'm puzzled, why you mention posix timers in the context of the
in-kernel user? And none of the posix timers are deferrable.

The whole point of vmevent is to be lightweight and save power.
Vmevent is doing all the work in the kernel, and it uses
deferrable timers/workqueues to save power, and it is a proper
in-kernel API to do so.

If you're saying that we should set up a timer in the userland and
constantly read /proc/vmstat, then we will cause CPU wake up
every 100ms, which is not acceptable. Well, we can try to introduce
deferrable timers for the userspace. But then it would still add
a lot more overhead for our task, as this solution adds other two
context switches to read and parse /proc/vmstat. I guess this is
not a show-stopper though, so we can discuss this.

Leonid, Pekka, what do you think about the idea?

Thanks,

--
Anton Vorontsov
Email: [email protected]

2012-06-08 07:03:27

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

On Fri, Jun 8, 2012 at 9:58 AM, Anton Vorontsov
<[email protected]> wrote:
> If you're saying that we should set up a timer in the userland and
> constantly read /proc/vmstat, then we will cause CPU wake up
> every 100ms, which is not acceptable. Well, we can try to introduce
> deferrable timers for the userspace. But then it would still add
> a lot more overhead for our task, as this solution adds other two
> context switches to read and parse /proc/vmstat. I guess this is
> not a show-stopper though, so we can discuss this.
>
> Leonid, Pekka, what do you think about the idea?

That's exactly the kind of half-assed ABI that lead to people
inventing out-of-tree lowmem notifiers in the first place.

I'd be more interested to know what people think of Minchan's that
gets rid of vmstat sampling.

2012-06-08 07:08:45

by leonid.moiseichuk

[permalink] [raw]
Subject: RE: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

> -----Original Message-----
> From: ext Anton Vorontsov [mailto:[email protected]]
> Sent: 08 June, 2012 09:58
...
> If you're saying that we should set up a timer in the userland and constantly
> read /proc/vmstat, then we will cause CPU wake up every 100ms, which is
> not acceptable. Well, we can try to introduce deferrable timers for the
> userspace. But then it would still add a lot more overhead for our task, as this
> solution adds other two context switches to read and parse /proc/vmstat. I
> guess this is not a show-stopper though, so we can discuss this.
>
> Leonid, Pekka, what do you think about the idea?

Seems to me not nice solution. Generating/parsing vmstat every 100ms plus wakeups it is what exactly should be avoid to have sense to API.
It also will cause page trashing because user-space code could be pushed out from cache if VM decide.

>
> Thanks,
>
> --
> Anton Vorontsov
> Email: [email protected]
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-06-08 07:10:57

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

On Fri, Jun 8, 2012 at 3:05 AM, <[email protected]> wrote:
>> -----Original Message-----
>> From: ext Anton Vorontsov [mailto:[email protected]]
>> Sent: 08 June, 2012 09:58
> ...
>> If you're saying that we should set up a timer in the userland and constantly
>> read /proc/vmstat, then we will cause CPU wake up every 100ms, which is
>> not acceptable. Well, we can try to introduce deferrable timers for the
>> userspace. But then it would still add a lot more overhead for our task, as this
>> solution adds other two context switches to read and parse /proc/vmstat. I
>> guess this is not a show-stopper though, so we can discuss this.
>>
>> Leonid, Pekka, what do you think about the idea?
>
> Seems to me not nice solution. Generating/parsing vmstat every 100ms plus wakeups it is what exactly should be avoid to have sense to API.

No. I don't suggest to wake up every 100ms. I suggest to integrate
existing subsystems. If you need any enhancement, just do it.


> It also will cause page trashing because user-space code could be pushed out from cache if VM decide.

This is completely unrelated issue. Even if notification code is not
swapped, userland notify handling code still may be swapped. So,
if you must avoid swap, you must use mlock.

2012-06-08 07:19:21

by leonid.moiseichuk

[permalink] [raw]
Subject: RE: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

> -----Original Message-----
> From: ext KOSAKI Motohiro [mailto:[email protected]]
> Sent: 08 June, 2012 10:11
..
> No. I don't suggest to wake up every 100ms. I suggest to integrate existing
> subsystems. If you need any enhancement, just do it.
That will be non-trivial to hook all vmstat updates . Simple to use deferred timer.

> > It also will cause page trashing because user-space code could be pushed
> out from cache if VM decide.
>
> This is completely unrelated issue. Even if notification code is not swapped,
> userland notify handling code still may be swapped. So, if you must avoid
> swap, you must use mlock.

If you wakeup only by signal when memory situation changed you can be not mlocked.
Mlocking uses memory very inefficient way and usually cannot be applied for apps which wants to be notified due to resources restrictions.

2012-06-08 07:23:23

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

>> > It also will cause page trashing because user-space code could be pushed
>> out from cache if VM decide.
>>
>> This is completely unrelated issue. Even if notification code is not swapped,
>> userland notify handling code still may be swapped. So, if you must avoid
>> swap, you must use mlock.
>
> If you wakeup only by signal when memory situation changed you can be not mlocked.
> Mlocking uses memory very inefficient way and usually cannot be applied for apps which wants to be notified due to resources restrictions.

That's your choice. If you don't need to care cache dropping, We don't
enforce it. I only pointed out your explanation was technically
incorrect.

2012-06-08 07:28:42

by leonid.moiseichuk

[permalink] [raw]
Subject: RE: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

> -----Original Message-----
> From: ext KOSAKI Motohiro [mailto:[email protected]]
> Sent: 08 June, 2012 10:23
...
> > If you wakeup only by signal when memory situation changed you can be
> not mlocked.
> > Mlocking uses memory very inefficient way and usually cannot be applied
> for apps which wants to be notified due to resources restrictions.
>
> That's your choice. If you don't need to care cache dropping, We don't
> enforce it. I only pointed out your explanation was technically incorrect.

My explanation is correct. That is an overhead you have to pay if start to use API based on polling from user-space and this overhead narrows API applicability.
Moving all times/tracking to kernel avoid useless wakeups in user-space.

2012-06-08 07:33:21

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

(6/8/12 3:28 AM), [email protected] wrote:
>> -----Original Message-----
>> From: ext KOSAKI Motohiro [mailto:[email protected]]
>> Sent: 08 June, 2012 10:23
> ...
>>> If you wakeup only by signal when memory situation changed you can be
>> not mlocked.
>>> Mlocking uses memory very inefficient way and usually cannot be applied
>> for apps which wants to be notified due to resources restrictions.
>>
>> That's your choice. If you don't need to care cache dropping, We don't
>> enforce it. I only pointed out your explanation was technically incorrect.
>
> My explanation is correct. That is an overhead you have to pay if start to
>use API based on polling from user-space and this overhead narrows API
>applicability.
> Moving all times/tracking to kernel avoid useless wakeups in user-space.

Wrong. CPU don't realized the running code belong to userspace or kernel. Every
code just consume a power. That's why polling timer is wrong from point of power
consumption view.

2012-06-08 07:49:33

by leonid.moiseichuk

[permalink] [raw]
Subject: RE: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

> -----Original Message-----
> From: ext KOSAKI Motohiro [mailto:[email protected]]
> Sent: 08 June, 2012 10:33
> To: Moiseichuk Leonid (Nokia-MP/Espoo)
..
> Wrong. CPU don't realized the running code belong to userspace or kernel.
> Every code just consume a power. That's why polling timer is wrong from
> point of power consumption view.
???
We are talking about different things.
User-space code could be dropped, distributed between several applications and has not deferred timers support.
For polling API the user-space code has to be executed quite often.
Localizing this code in kernel additionally allows to avoid vmsat/meminfo generation and parsing overhead as well.

2012-06-08 07:50:50

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Thu, Jun 07, 2012 at 11:41:27AM +0900, Minchan Kim wrote:
[...]
> How about this?

So, basically this is just another shrinker callback (well, it's
very close to), but only triggered after some magic index crosses
a threshold.

Some information beforehand: current ALMK is broken in the regard
that it does not do what it is supposed to do according to its
documentation. It uses shrinker notifiers (alike to your code), but
kernel calls shrinker when there is already pressure on the memory,
and ALMK's original idea was to start killing processes when there
is [yet] no pressure at all, so ALMK was supposed to act in advance,
e.g. "kill unneeded apps when there's say 64 MB free memory left,
irrespective of the current pressure". ALMK doesn't do this
currently, it only reacts to the shrinker.

So, the solution would be then two-fold:

1. Use your memory pressure notifications. They must be quite fast when
we starting to feel the high pressure. (I see the you use
zone_page_state() and friends, which is vm_stat, and it is updated
very infrequently, but to get accurate notification we have to
update it much more frequently, but this is very expensive. So
KOSAKI and Christoph will complain. :-)
2. Plus use deferred timers to monitor /proc/vmstat, we don't have to
be fast here. But I see Pekka and Leonid don't like it already,
so we're stuck.

Thanks,

> It's totally pseudo code just I want to show my intention and even it's not a math.
> Totally we need more fine-grained some expression to standardize memory pressure.
> For it, we can use VM's several parameter, nr_scanned, nr_reclaimed, order, dirty page scanning ratio
> and so on. Also, we can aware of zone, node so we can pass lots of information to user space if they
> want it. For making lowmem notifier general, they are must, I think.
> We can have a plenty of tools for it.
>
> And later as further step, we could replace it with memcg-aware after memcg reclaim work is
> totally unified with global page reclaim. Many memcg guys have tried it so I expect it works
> sooner or later but I'm not sure memcg really need it because memcg's goal is limit memory resource
> among several process groups. If some process feel bad about latency due to short of free memory
> and it's critical, I think it would be better to create new memcg group has tighter limit for
> latency and put the process into the group.
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eeb3bc9..eae3d2e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2323,6 +2323,32 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> }
>
> /*
> + * higher dirty pages, higher pressure
> + * higher nr_scanned, higher pressure
> + * higher nr_reclaimed, lower pressure
> + * higher unmapped pages, lower pressure
> + *
> + * index toward 0 implies memory pressure is heavy.
> + */
> +int lowmem_index(struct zone *zone, struct scan_control *sc)
> +{
> + int pressure = (1000 * (sc->nr_scanned * (zone_page_state(zone, NR_FILE_DIRTY)
> + * dirty_weight + 1) - sc->nr_reclaimed -
> + zone_unmapped_file_pages(zone))) /
> + zone_reclaimable_page(zone);
> +
> + return 1000 - pressure;
> +}
> +
> +void lowmem_notifier(struct zone *zone, int index)
> +{
> + if (lowmem_has_interested_zone(zone)) {
> + if (index < sysctl_lowmem_threshold)
> + notify(numa_node_id(), zone, index);
> + }
> +}
> +
> +/*
> * For kswapd, balance_pgdat() will work across all this node's zones until
> * they are all at high_wmark_pages(zone).
> *
> @@ -2494,6 +2520,7 @@ loop_again:
> !zone_watermark_ok_safe(zone, testorder,
> high_wmark_pages(zone) + balance_gap,
> end_zone, 0)) {
> + int index;
> shrink_zone(zone, &sc);
>
> reclaim_state->reclaimed_slab = 0;
> @@ -2503,6 +2530,9 @@ loop_again:
>
> if (nr_slab == 0 && !zone_reclaimable(zone))
> zone->all_unreclaimable = 1;
> +
> + index = lowmem_index(zone, &sc);
> + lowmem_notifier(zone, index);
>
> >
>
> > p.s. http://git.infradead.org/users/cbou/ulmkd.git
> > I haven't updated it for new vmevent changes, but still,
> > its idea should be clear enough.
> >
>
>
> --
> Kind regards,
> Minchan Kim

--
Anton Vorontsov
Email: [email protected]

2012-06-08 08:00:27

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

On Fri, Jun 08, 2012 at 07:05:46AM +0000, [email protected] wrote:
> > -----Original Message-----
> > From: ext Anton Vorontsov [mailto:[email protected]]
> > Sent: 08 June, 2012 09:58
> ...
> > If you're saying that we should set up a timer in the userland and constantly
> > read /proc/vmstat, then we will cause CPU wake up every 100ms, which is
> > not acceptable. Well, we can try to introduce deferrable timers for the
> > userspace. But then it would still add a lot more overhead for our task, as this
> > solution adds other two context switches to read and parse /proc/vmstat. I
> > guess this is not a show-stopper though, so we can discuss this.
> >
> > Leonid, Pekka, what do you think about the idea?
>
> Seems to me not nice solution. Generating/parsing vmstat every 100ms plus wakeups it is what exactly should be avoid to have sense to API.

No, iff we implement deferred timers for userland, we would not wake
up every 100 ms. The only additional overhead comparing to vmevent
would be:

a) Two more context swtiches;
b) Serialization/deserialization of /proc/vmstat.

> It also will cause page trashing because user-space code could be pushed out from cache if VM decide.

This can solved by moving a "watcher" to a separate (daemon) process,
and mlocking it. We do this in ulmkd.

Thanks,

--
Anton Vorontsov
Email: [email protected]

2012-06-08 08:08:52

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

On Fri, Jun 08, 2012 at 10:03:24AM +0300, Pekka Enberg wrote:
> On Fri, Jun 8, 2012 at 9:58 AM, Anton Vorontsov
> <[email protected]> wrote:
> > If you're saying that we should set up a timer in the userland and
> > constantly read /proc/vmstat, then we will cause CPU wake up
> > every 100ms, which is not acceptable. Well, we can try to introduce
> > deferrable timers for the userspace. But then it would still add
> > a lot more overhead for our task, as this solution adds other two
> > context switches to read and parse /proc/vmstat. I guess this is
> > not a show-stopper though, so we can discuss this.
> >
> > Leonid, Pekka, what do you think about the idea?
>
> That's exactly the kind of half-assed ABI that lead to people
> inventing out-of-tree lowmem notifiers in the first place.

:-)

Well, at least powersaving-wise, the solution w/ userland deferred
timers would be much better then just looping over /proc/vmstat each
100ms, and it is comparable to vmevent. Not pretty, but still would
work.

> I'd be more interested to know what people think of Minchan's that
> gets rid of vmstat sampling.

I answered to Minchan's post. The thing is that Minchan's idea
is not a substitution for vmevent. To me it seems like a shrinker w/
some pre-filter.

Thanks,

--
Anton Vorontsov
Email: [email protected]

2012-06-08 08:16:39

by leonid.moiseichuk

[permalink] [raw]
Subject: RE: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

> -----Original Message-----
> From: ext Anton Vorontsov [mailto:[email protected]]
> Sent: 08 June, 2012 10:59
...
> a) Two more context swtiches;
> b) Serialization/deserialization of /proc/vmstat.
>
> > It also will cause page trashing because user-space code could be pushed
> out from cache if VM decide.
>
> This can solved by moving a "watcher" to a separate (daemon) process, and
> mlocking it. We do this in ulmkd.

Right. It but it has drawbacks as well e.g. ensure that daemon scheduled properly and propagate reaction decision outside ulmkd.
Also I understand your statement about "watcher" as probably you use one timer for daemon.
Btw, in my variant (memnotify.c) I used only one timer, it is enough.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-06-08 08:42:50

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

On Fri, Jun 08, 2012 at 08:16:04AM +0000, [email protected] wrote:
> > -----Original Message-----
> > From: ext Anton Vorontsov [mailto:[email protected]]
> > Sent: 08 June, 2012 10:59
> ...
> > a) Two more context swtiches;
> > b) Serialization/deserialization of /proc/vmstat.
> >
> > > It also will cause page trashing because user-space code could be pushed
> > out from cache if VM decide.
> >
> > This can solved by moving a "watcher" to a separate (daemon) process, and
> > mlocking it. We do this in ulmkd.
>
> Right. It but it has drawbacks as well e.g. ensure that daemon scheduled properly and propagate reaction decision outside ulmkd.

No, ulmkd itself propagates the decision (i.e. it kills processes).

Here is how it works:

1. Android activity manager (it is tons of Java-code, runs inside a
JVM) maintains list of applications and their "importance" index.
This huge pile of code (and the whole JVM) of course can't be
mlocked, and so it only maintains the list.

2. Once ulmkd (a separate low memory killer daemon, written in C)
receives notification that system is low on memory, then it looks
at the already prepared lists, and based on the processes'
importance (and current free memory level) it kills appropriate
apps.

Note that in-kernel LMK does absolutely the same as ulmkd, just
in the kernel (and the "importance index" is passed to LMK as
per-process oom_score_adj).

> Also I understand your statement about "watcher" as probably you use one timer for daemon.
> Btw, in my variant (memnotify.c) I used only one timer, it is enough.

Not quite following.

In ulmkd I don't use timers at all, and by "watcher" I mean the
some userspace daemon that receives lowmemory/pressure events
(in our case it is ulmkd).

If we start "polling" on /proc/vmstat via userland deferred timers,
that would be a single timer, just like in vmevent case. So, I'm
not sure what is the difference?..

Thanks,

--
Anton Vorontsov
Email: [email protected]

2012-06-08 08:43:27

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On 06/08/2012 04:49 PM, Anton Vorontsov wrote:

> On Thu, Jun 07, 2012 at 11:41:27AM +0900, Minchan Kim wrote:
> [...]
>> How about this?
>
> So, basically this is just another shrinker callback (well, it's
> very close to), but only triggered after some magic index crosses
> a threshold.
>
> Some information beforehand: current ALMK is broken in the regard
> that it does not do what it is supposed to do according to its
> documentation. It uses shrinker notifiers (alike to your code), but
> kernel calls shrinker when there is already pressure on the memory,
> and ALMK's original idea was to start killing processes when there
> is [yet] no pressure at all, so ALMK was supposed to act in advance,
> e.g. "kill unneeded apps when there's say 64 MB free memory left,
> irrespective of the current pressure". ALMK doesn't do this
> currently, it only reacts to the shrinker.


When I hear your information, I feel it's a problem is in VM.
VM's goal is to use available memory enough while it can reclaim used
page as soon as possible so that user program should not feel big latency
if there are enough easy reclaimable pages in VM.

So, when reclaim start firstly, maybe there are lots of reclaimable pages
in VM so it can be reclaimed easily. Nontheless, if you feel it's very slow,
in principle, it's a VM's problem. But I don't have been heard such latency
complain from desktop people once there are lots of reclaimable pages.\

I admit there is big latency if we have lots of dirty pages while clean pages are
almost out and backed devices are very slow which is known problem and several
mm guys still have tried to solve it.

I admit you can argue what's the reclaimable pages easily.
Normally, we can order it following as.

1. non-mapped clean cache page
2. mapped-clean cache page
3. non-mapped dirty cache page
4. mapped dirty cache page
5. anon pages, tmpfs/shmem pages.

So I want to make math by those and VM's additional information and user can configure
weight because in some crazy system, swapout which is backed by SSD can be faster than
dirty file page flush which is backed very slow rotation device.

And we can improve it by adding new LRU list - CFLRU[1] which would be good for swap in embedded device, too.
If clean LRU is about to be short, it's a good indication on latency so we can trigger notification or start vmevent timer.

[1] http://staff.ustc.edu.cn/~jpq/paper/flash/2006-CASES-CFLRU-%20A%20Replacement%20Algorithm%20for%20Flash%20Memory.pdf

>
> So, the solution would be then two-fold:
>
> 1. Use your memory pressure notifications. They must be quite fast when
> we starting to feel the high pressure. (I see the you use
> zone_page_state() and friends, which is vm_stat, and it is updated


VM has other information like nr_reclaimed, nr_scanned, nr_congested, recent_scanned,
recent_rotated, too. I hope we can make math by them and improve as we improve
VM reclaimer.

> very infrequently, but to get accurate notification we have to
> update it much more frequently, but this is very expensive. So
> KOSAKI and Christoph will complain. :-)


Reclaimer already have used that and if we need accuracy, we handled it
like zone_watermark_ok_safe. If it's very inaccurate, VM should be fixed, too.

> 2. Plus use deferred timers to monitor /proc/vmstat, we don't have to
> be fast here. But I see Pekka and Leonid don't like it already,
> so we're stuck.
>
> Thanks,


>>
>> --
>> Kind regards,
>> Minchan Kim
>



--
Kind regards,
Minchan Kim

2012-06-08 08:48:12

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Fri, Jun 8, 2012 at 11:43 AM, Minchan Kim <[email protected]> wrote:
>> So, the solution would be then two-fold:
>>
>> 1. Use your memory pressure notifications. They must be quite fast when
>> ? ?we starting to feel the high pressure. (I see the you use
>> ? ?zone_page_state() and friends, which is vm_stat, and it is updated
>
> VM has other information like nr_reclaimed, nr_scanned, nr_congested, recent_scanned,
> recent_rotated, too. I hope we can make math by them and improve as we improve
> VM reclaimer.
>
>> ? ?very infrequently, but to get accurate notification we have to
>> ? ?update it much more frequently, but this is very expensive. So
>> ? ?KOSAKI and Christoph will complain. :-)
>
>
> Reclaimer already have used that and if we need accuracy, we handled it
> like zone_watermark_ok_safe. If it's very inaccurate, VM should be fixed, too.

Exactly. I don't know why people think pushing vmevents to userspace
is going to fix any of the hard problems.

Anton, Lenoid, do you see any fundamental issues from userspace point
of view with going forward what Minchan is proposing?

2012-06-08 08:57:46

by leonid.moiseichuk

[permalink] [raw]
Subject: RE: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

> -----Original Message-----
> From: ext Anton Vorontsov [mailto:[email protected]]
> Sent: 08 June, 2012 11:41
...
> > Right. It but it has drawbacks as well e.g. ensure that daemon scheduled
> properly and propagate reaction decision outside ulmkd.
>
> No, ulmkd itself propagates the decision (i.e. it kills processes).

That is a decision "select & kill" :)
Propagation of this decision required time. Not all processes could be killed. You may stuck in killing in some cases.
...
> In ulmkd I don't use timers at all, and by "watcher" I mean the some
> userspace daemon that receives lowmemory/pressure events (in our case it
> is ulmkd).

Thanks for info.

> If we start "polling" on /proc/vmstat via userland deferred timers, that would
> be a single timer, just like in vmevent case. So, I'm not sure what is the
> difference?..

Context switches, parsing, activity in userspace even memory situation is not changed. In kernel space you can use sliding timer (increasing interval) + shinker.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-06-08 09:12:55

by leonid.moiseichuk

[permalink] [raw]
Subject: RE: [PATCH 0/5] Some vmevent fixes...

> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of ext
> Pekka Enberg
> Sent: 08 June, 2012 11:48
> To: Minchan Kim
...
>
> On Fri, Jun 8, 2012 at 11:43 AM, Minchan Kim <[email protected]> wrote:
> >> So, the solution would be then two-fold:
> >>
> >> 1. Use your memory pressure notifications. They must be quite fast
> >> when
> >> ? ?we starting to feel the high pressure. (I see the you use
> >> ? ?zone_page_state() and friends, which is vm_stat, and it is updated
> >
> > VM has other information like nr_reclaimed, nr_scanned, nr_congested,
> > recent_scanned, recent_rotated, too. I hope we can make math by them
> > and improve as we improve VM reclaimer.
> >
> >> ? ?very infrequently, but to get accurate notification we have to
> >> ? ?update it much more frequently, but this is very expensive. So
> >> ? ?KOSAKI and Christoph will complain. :-)
> >
> >
> > Reclaimer already have used that and if we need accuracy, we handled
> > it like zone_watermark_ok_safe. If it's very inaccurate, VM should be fixed,
> too.
>
> Exactly. I don't know why people think pushing vmevents to userspace is
> going to fix any of the hard problems.
>
> Anton, Lenoid, do you see any fundamental issues from userspace point of
> view with going forward what Minchan is proposing?

That good proposal but I have to underline that userspace could be interested not only in memory consumption stressed cases (pressure, vm watermarks ON etc.)
but quite relaxed as well e.g. 60% durty pages are consumed - let's do not restart some daemons. In very stressed conditions user-space might be already dead.

Another interesting question which combination of VM page types could be recognized as interesting for tracking as Minchan correctly stated it depends from area.
For me seems weights most likely will be -1, 0 or +1 to calculate resulting values and thesholds e.g. Active = {+1 * Active_Anon; +1 * Active_File}
It will extend flexibility a lot.

2012-06-08 09:46:50

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Fri, Jun 08, 2012 at 09:12:36AM +0000, [email protected] wrote:
[...]
> > Exactly. I don't know why people think pushing vmevents to userspace is
> > going to fix any of the hard problems.
> >
> > Anton, Lenoid, do you see any fundamental issues from userspace point of
> > view with going forward what Minchan is proposing?
>
> That good proposal but I have to underline that userspace could be interested not only in memory consumption stressed cases (pressure, vm watermarks ON etc.)
> but quite relaxed as well e.g. 60% durty pages are consumed - let's do not restart some daemons. In very stressed conditions user-space might be already dead.

Indeed. Minchan's proposal is good to get notified that VM is under
stress.

But suppose some app allocates memory slowly (i.e. I scroll a large
page on my phone, and the page is rendered piece by piece). So, in
the end we're slowly but surely allocate a lot of memory. In that
case Minchan's method won't notice that it's actually time to close
some apps.

Then suppose someone calls me, the "phone" application is now
starting, but since we're out of 'easy to reclaim' pages, it takes
forever for the app to load, VM is now under huge stress, and surely
we're *now* getting notified, but alas, it is too late. Call missed.


So, it's like measuring distance, velocity and acceleration. In
Android case, among other things, we're interested in distance too!
I.e. "how much exactly 'easy to reclaim' pages left", not only
"how fast we're getting out of 'easy to reclaim' pages".

> Another interesting question which combination of VM page types could be recognized as interesting for tracking as Minchan correctly stated it depends from area.
> For me seems weights most likely will be -1, 0 or +1 to calculate resulting values and thesholds e.g. Active = {+1 * Active_Anon; +1 * Active_File}
> It will extend flexibility a lot.

Exposing VM details to the userland? No good. :-)

--
Anton Vorontsov
Email: [email protected]

2012-06-08 10:36:48

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

On Fri, Jun 08, 2012 at 08:57:13AM +0000, [email protected] wrote:
> > -----Original Message-----
> > From: ext Anton Vorontsov [mailto:[email protected]]
> > Sent: 08 June, 2012 11:41
> ...
> > > Right. It but it has drawbacks as well e.g. ensure that daemon scheduled
> > properly and propagate reaction decision outside ulmkd.
> >
> > No, ulmkd itself propagates the decision (i.e. it kills processes).
>
> That is a decision "select & kill" :)
> Propagation of this decision required time. Not all processes could be killed. You may stuck in killing in some cases.

Yeah. But since we have plenty of free memory (i.e. we're getting
notified in advance), it's OK to be not super-fast.

And if we're losing control, OOMK will help us. (Btw, we can
introduce "thrash killer" in-kernel driver. This would also help
desktop use case, when the system is thrashing so hard that it
becomes unresponsive, we'd better do something about it. When
browser goes crazy on my laptop, I wish I had such a driver. :-)
It takes forever to get OOM condition w/ 2GB swap space, slow
hard drive and CPU all busy w/ moving pages back and forward.)

> > If we start "polling" on /proc/vmstat via userland deferred timers, that would
> > be a single timer, just like in vmevent case. So, I'm not sure what is the
> > difference?..
>
> Context switches, parsing, activity in userspace even memory situation is not changed.

Sure, there is some additional overhead. I'm just saying that it is
not drastic. It would be like 100 sprintfs + 100 sscanfs + 2 context
switches? Well, it is unfortunate... but come on, today's phones are
running X11 and Java. :-)

> In kernel space you can use sliding timer (increasing interval) + shinker.

Well, w/ Minchan's idea, we can get shrinker notifications into the
userland, so the sliding timer thing would be still possible.

Thanks,

--
Anton Vorontsov
Email: [email protected]

2012-06-08 10:42:16

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Fri, Jun 08, 2012 at 02:45:07AM -0700, Anton Vorontsov wrote:
> On Fri, Jun 08, 2012 at 09:12:36AM +0000, [email protected] wrote:
> [...]
> > > Exactly. I don't know why people think pushing vmevents to userspace is
> > > going to fix any of the hard problems.
> > >
> > > Anton, Lenoid, do you see any fundamental issues from userspace point of
> > > view with going forward what Minchan is proposing?
> >
> > That good proposal but I have to underline that userspace could be interested not only in memory consumption stressed cases (pressure, vm watermarks ON etc.)
> > but quite relaxed as well e.g. 60% durty pages are consumed - let's do not restart some daemons. In very stressed conditions user-space might be already dead.
>
> Indeed. Minchan's proposal is good to get notified that VM is under
> stress.
>
> But suppose some app allocates memory slowly (i.e. I scroll a large
> page on my phone, and the page is rendered piece by piece). So, in
> the end we're slowly but surely allocate a lot of memory. In that
> case Minchan's method won't notice that it's actually time to close
> some apps.

I can't understand. Why can't the approach catch the situation?
Let's think about it.

There is 40M in CleanCache LRU which has easy-reclaimable pages and
there is 10M free pages and 5M high watermark in system.

Your application start to consume free pages very slowly.
So when your application consumed 5M, VM start to reclaim. So far, it's okay
because we have 40M easy-reclaimable pages. And low memory notifier can start
to notify so your dameon can do some action to get free pages.

I think it's not so late.

sidenote:
It seems I live in the complete opposite place because
you guys always start discussion when I am about going out of office.
Please understand my late response.
Maybe I will come back after weekend. :)

Thanks.

>
> Then suppose someone calls me, the "phone" application is now
> starting, but since we're out of 'easy to reclaim' pages, it takes
> forever for the app to load, VM is now under huge stress, and surely
> we're *now* getting notified, but alas, it is too late. Call missed.
>
>
> So, it's like measuring distance, velocity and acceleration. In
> Android case, among other things, we're interested in distance too!
> I.e. "how much exactly 'easy to reclaim' pages left", not only
> "how fast we're getting out of 'easy to reclaim' pages".
>
> > Another interesting question which combination of VM page types could be recognized as interesting for tracking as Minchan correctly stated it depends from area.
> > For me seems weights most likely will be -1, 0 or +1 to calculate resulting values and thesholds e.g. Active = {+1 * Active_Anon; +1 * Active_File}
> > It will extend flexibility a lot.
>
> Exposing VM details to the userland? No good. :-)
>
> --
> Anton Vorontsov
> Email: [email protected]

2012-06-08 11:03:56

by leonid.moiseichuk

[permalink] [raw]
Subject: RE: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

> -----Original Message-----
> From: ext Anton Vorontsov [mailto:[email protected]]
> Sent: 08 June, 2012 13:35
...
> > Context switches, parsing, activity in userspace even memory situation is
> not changed.
>
> Sure, there is some additional overhead. I'm just saying that it is not drastic. It
> would be like 100 sprintfs + 100 sscanfs + 2 context switches? Well, it is
> unfortunate... but come on, today's phones are running X11 and Java. :-)

Vmstat generation is not so trivial. Meminfo has even higher overhead. I just checked generation time using idling device and open/read test:
- vmstat min 30, avg 94 max 2746 uSeconds
- meminfo min 30, average 65 max 15961 uSeconds

In comparison /proc/version for the same conditions: min 30, average 41, max 1505 uSeconds

> > In kernel space you can use sliding timer (increasing interval) + shinker.
>
> Well, w/ Minchan's idea, we can get shrinker notifications into the userland,
> so the sliding timer thing would be still possible.

Only as a post-schrinker actions. In case of memory stressing or close-to-stressing conditions shrinkers called very often, I saw up to 50 times per second.

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-06-08 11:16:11

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On Fri, Jun 08, 2012 at 07:42:04PM +0900, Minchan Kim wrote:
[...]
> I can't understand. Why can't the approach catch the situation?
> Let's think about it.
>
> There is 40M in CleanCache LRU which has easy-reclaimable pages and
> there is 10M free pages and 5M high watermark in system.
>
> Your application start to consume free pages very slowly.
> So when your application consumed 5M, VM start to reclaim. So far, it's okay
> because we have 40M easy-reclaimable pages. And low memory notifier can start
> to notify so your dameon can do some action to get free pages.

Maybe I'm missing how would you use the shrinker. But the last time
I tried on my (swap-less, FWIW) qemu test setup, I was not receiving
any notifications from the shrinker until the system was almost
(but not exactly) out of memory.

My test app was allocating all memory MB by MB, filling the memory
with zeroes. So, what I was observing is that shrinker callback was
called just a few MB before OOM, not every 'X' consumed MBs.

> I think it's not so late.
>
> sidenote:
> It seems I live in the complete opposite place because
> you guys always start discussion when I am about going out of office.
> Please understand my late response.
> Maybe I will come back after weekend. :)

Well, it's 4AM here. :-) Have a great weekend!

--
Anton Vorontsov
Email: [email protected]

2012-06-08 12:15:22

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

On Fri, Jun 08, 2012 at 11:03:29AM +0000, [email protected] wrote:
> > -----Original Message-----
> > From: ext Anton Vorontsov [mailto:[email protected]]
> > Sent: 08 June, 2012 13:35
> ...
> > > Context switches, parsing, activity in userspace even memory situation is
> > not changed.
> >
> > Sure, there is some additional overhead. I'm just saying that it is not drastic. It
> > would be like 100 sprintfs + 100 sscanfs + 2 context switches? Well, it is
> > unfortunate... but come on, today's phones are running X11 and Java. :-)
>
> Vmstat generation is not so trivial. Meminfo has even higher overhead. I just checked generation time using idling device and open/read test:
> - vmstat min 30, avg 94 max 2746 uSeconds
> - meminfo min 30, average 65 max 15961 uSeconds
>
> In comparison /proc/version for the same conditions: min 30, average 41, max 1505 uSeconds

Hm. I would expect that avg value for meminfo will be much worse
than vmstat (meminfo grabs some locks).

OK, if we consider 100ms interval, then this would be like 0.1%
overhead? Not great, but still better than memcg:

http://lkml.org/lkml/2011/12/21/487

:-)

Personally? I'm all for saving these 0.1% tho, I'm all for vmevent.
But, for example, it's still broken for SMP as it is costly to
update vm_stat. And I see no way to fix this.

So, I guess the right approach would be to find ways to not depend on
frequent vm_stat updates (and thus reads).

userland deferred timers (and infrequent reads from vmstat) +
"userland vm pressure notifications" looks promising for the userland
solution.

For in-kernel solution it is all the same, a deferred timer that
reads vm_stat occasionally (no pressure case) + in-kernel shrinker
notifications for fast reaction under pressure.

> > > In kernel space you can use sliding timer (increasing interval) + shinker.
> >
> > Well, w/ Minchan's idea, we can get shrinker notifications into the userland,
> > so the sliding timer thing would be still possible.
>
> Only as a post-schrinker actions. In case of memory stressing or
> close-to-stressing conditions shrinkers called very often, I saw up to
> 50 times per second.

Well, yes. But in userland you would just poll/select on the shrinker
notification fd, you won't get more than you can (or want to) process.

--
Anton Vorontsov
Email: [email protected]

2012-06-08 12:25:34

by leonid.moiseichuk

[permalink] [raw]
Subject: RE: [PATCH 2/5] vmevent: Convert from deferred timer to deferred work

> -----Original Message-----
> From: ext Anton Vorontsov [mailto:[email protected]]
> Sent: 08 June, 2012 15:14
> To: Moiseichuk Leonid (Nokia-MP/Espoo)
...
> Hm. I would expect that avg value for meminfo will be much worse than
> vmstat (meminfo grabs some locks).
>
> OK, if we consider 100ms interval, then this would be like 0.1% overhead?
> Not great, but still better than memcg:
>
> http://lkml.org/lkml/2011/12/21/487

That is difficult to win over memcg :)
But in comparison to one syscall like read() for small structure for particular device the generation of meminfo is about 1000x times more expensive.

> So, I guess the right approach would be to find ways to not depend on
> frequent vm_stat updates (and thus reads).

Agree.

> userland deferred timers (and infrequent reads from vmstat) + "userland vm
> pressure notifications" looks promising for the userland solution.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-06-11 04:50:29

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 0/5] Some vmevent fixes...

On 06/08/2012 08:14 PM, Anton Vorontsov wrote:

> On Fri, Jun 08, 2012 at 07:42:04PM +0900, Minchan Kim wrote:
> [...]
>> I can't understand. Why can't the approach catch the situation?
>> Let's think about it.
>>
>> There is 40M in CleanCache LRU which has easy-reclaimable pages and
>> there is 10M free pages and 5M high watermark in system.
>>
>> Your application start to consume free pages very slowly.
>> So when your application consumed 5M, VM start to reclaim. So far, it's okay
>> because we have 40M easy-reclaimable pages. And low memory notifier can start
>> to notify so your dameon can do some action to get free pages.
>
> Maybe I'm missing how would you use the shrinker. But the last time
> I tried on my (swap-less, FWIW) qemu test setup, I was not receiving
> any notifications from the shrinker until the system was almost
> (but not exactly) out of memory.
>
> My test app was allocating all memory MB by MB, filling the memory
> with zeroes. So, what I was observing is that shrinker callback was
> called just a few MB before OOM, not every 'X' consumed MBs.


Yes. page reclaimer doesn't make sure calling shrinker per MB.
So if you want to be notified per your threshold, shrinker isn't good.

I didn't say I will use shrinker.
I want to add hooks directly in vmscan.c's normal reclaim path, not shrinker.

Still, I want to implement it by level triggering like I mentioned.
I will show you my concept if anybody doesn't interrupt me within a few weeks. :)

Thanks.

>
>> I think it's not so late.
>>
>> sidenote:
>> It seems I live in the complete opposite place because
>> you guys always start discussion when I am about going out of office.
>> Please understand my late response.
>> Maybe I will come back after weekend. :)
>
> Well, it's 4AM here. :-) Have a great weekend!
>


You win! :)

--
Kind regards,
Minchan Kim