This patch implements a new event type, it will trigger whenever a
value becomes greater than user-specified threshold, it complements
the 'less-then' trigger type.
Also, let's implement the one-shot mode for the events, when set,
userspace will only receive one notification per crossing the
boundaries.
Now when both LT and GT are set on the same level, the event type
works as a cross event type: it triggers whenever a value crosses
the threshold from a lesser values side to a greater values side,
and vice versa.
We use the event types in an userspace low-memory killer: we get a
notification when memory becomes low, so we start freeing memory by
killing unneeded processes, and we get notification when memory hits
the threshold from another side, so we know that we freed enough of
memory.
Signed-off-by: Anton Vorontsov <[email protected]>
---
On Thu, Apr 19, 2012 at 09:29:23AM -0700, Anton Vorontsov wrote:
> > > Yep, with CONFIG_SWAP=n, and I had to a modify the test
> > > since I saw the same thing, I believe. I'll try w/ the swap enabled,
> > > and see how it goes. I think the vmevent-test.c needs some improvemnts
> > > in general, but meanwhile...
> > >
> > > > Physical pages: 109858
> > > > read failed: Invalid argument
> > >
> > > Can you send me the .config file that you used? Might be that
> > > you have CONFIG_SWAP=n too?
> >
> > No, I have CONFIG_SWAP=y. Here's the config I use.
>
> Thanks! It appears that there was just another hard-coded value for
> the counter in vmevent-test.c, I changed it to config.counter and
> now everything should be fine.
[...]
> + if ((lt || gt) && !one_shot)
> return true;
Noticed one small issue: the code doesn't check whether we actually
asked to track lt or gt states, so it shoudl be (attr_lt && lt) ||
(attr_gt && gt).
This is fixed now.
include/linux/vmevent.h | 13 ++++++++++
mm/vmevent.c | 45 +++++++++++++++++++++++++++++-----
tools/testing/vmevent/vmevent-test.c | 24 +++++++++++++-----
3 files changed, 70 insertions(+), 12 deletions(-)
diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h
index 64357e4..ca97cf0 100644
--- a/include/linux/vmevent.h
+++ b/include/linux/vmevent.h
@@ -22,6 +22,19 @@ enum {
* Sample value is less than user-specified value
*/
VMEVENT_ATTR_STATE_VALUE_LT = (1UL << 0),
+ /*
+ * Sample value is greater than user-specified value
+ */
+ VMEVENT_ATTR_STATE_VALUE_GT = (1UL << 1),
+ /*
+ * One-shot mode.
+ */
+ VMEVENT_ATTR_STATE_ONE_SHOT = (1UL << 2),
+
+ /* Saved state, used internally by the kernel for one-shot mode. */
+ __VMEVENT_ATTR_STATE_VALUE_WAS_LT = (1UL << 30),
+ /* Saved state, used internally by the kernel for one-shot mode. */
+ __VMEVENT_ATTR_STATE_VALUE_WAS_GT = (1UL << 31),
};
struct vmevent_attr {
diff --git a/mm/vmevent.c b/mm/vmevent.c
index 9ed6aca..47ed448 100644
--- a/mm/vmevent.c
+++ b/mm/vmevent.c
@@ -1,5 +1,6 @@
#include <linux/anon_inodes.h>
#include <linux/atomic.h>
+#include <linux/compiler.h>
#include <linux/vmevent.h>
#include <linux/syscalls.h>
#include <linux/timer.h>
@@ -83,16 +84,48 @@ static bool vmevent_match(struct vmevent_watch *watch)
for (i = 0; i < config->counter; i++) {
struct vmevent_attr *attr = &config->attrs[i];
- u64 value;
+ u32 state = attr->state;
+ bool attr_lt = state & VMEVENT_ATTR_STATE_VALUE_LT;
+ bool attr_gt = state & VMEVENT_ATTR_STATE_VALUE_GT;
- if (!attr->state)
+ if (!state)
continue;
- value = vmevent_sample_attr(watch, attr);
-
- if (attr->state & VMEVENT_ATTR_STATE_VALUE_LT) {
- if (value < attr->value)
+ if (attr_lt || attr_gt) {
+ bool one_shot = state & VMEVENT_ATTR_STATE_ONE_SHOT;
+ u32 was_lt_mask = __VMEVENT_ATTR_STATE_VALUE_WAS_LT;
+ u32 was_gt_mask = __VMEVENT_ATTR_STATE_VALUE_WAS_GT;
+ u64 value = vmevent_sample_attr(watch, attr);
+ bool lt = value < attr->value;
+ bool gt = value > attr->value;
+ bool was_lt = state & was_lt_mask;
+ bool was_gt = state & was_gt_mask;
+ bool ret = false;
+
+ if (((attr_lt && lt) || (attr_gt && gt)) && !one_shot)
return true;
+
+ if (attr_lt && lt && was_lt) {
+ return false;
+ } else if (attr_gt && gt && was_gt) {
+ return false;
+ } else if (lt) {
+ state |= was_lt_mask;
+ state &= ~was_gt_mask;
+ if (attr_lt)
+ ret = true;
+ } else if (gt) {
+ state |= was_gt_mask;
+ state &= ~was_lt_mask;
+ if (attr_gt)
+ ret = true;
+ } else {
+ state &= ~was_lt_mask;
+ state &= ~was_gt_mask;
+ }
+
+ attr->state = state;
+ return ret;
}
}
diff --git a/tools/testing/vmevent/vmevent-test.c b/tools/testing/vmevent/vmevent-test.c
index 534f827..fd9a174 100644
--- a/tools/testing/vmevent/vmevent-test.c
+++ b/tools/testing/vmevent/vmevent-test.c
@@ -33,20 +33,32 @@ int main(int argc, char *argv[])
config = (struct vmevent_config) {
.sample_period_ns = 1000000000L,
- .counter = 4,
+ .counter = 6,
.attrs = {
- [0] = {
+ {
.type = VMEVENT_ATTR_NR_FREE_PAGES,
.state = VMEVENT_ATTR_STATE_VALUE_LT,
.value = phys_pages,
},
- [1] = {
+ {
+ .type = VMEVENT_ATTR_NR_FREE_PAGES,
+ .state = VMEVENT_ATTR_STATE_VALUE_GT,
+ .value = phys_pages,
+ },
+ {
+ .type = VMEVENT_ATTR_NR_FREE_PAGES,
+ .state = VMEVENT_ATTR_STATE_VALUE_LT |
+ VMEVENT_ATTR_STATE_VALUE_GT |
+ VMEVENT_ATTR_STATE_ONE_SHOT,
+ .value = phys_pages / 2,
+ },
+ {
.type = VMEVENT_ATTR_NR_AVAIL_PAGES,
},
- [2] = {
+ {
.type = VMEVENT_ATTR_NR_SWAP_PAGES,
},
- [3] = {
+ {
.type = 0xffff, /* invalid */
},
},
@@ -59,7 +71,7 @@ int main(int argc, char *argv[])
}
for (i = 0; i < 10; i++) {
- char buffer[sizeof(struct vmevent_event) + 4 * sizeof(struct vmevent_attr)];
+ char buffer[sizeof(struct vmevent_event) + config.counter * sizeof(struct vmevent_attr)];
struct vmevent_event *event;
int n = 0;
int idx;
--
1.7.9.2
On 05/01/2012 09:18 AM, Anton Vorontsov wrote:
> This patch implements a new event type, it will trigger whenever a
> value becomes greater than user-specified threshold, it complements
> the 'less-then' trigger type.
>
> Also, let's implement the one-shot mode for the events, when set,
> userspace will only receive one notification per crossing the
> boundaries.
>
> Now when both LT and GT are set on the same level, the event type
> works as a cross event type: it triggers whenever a value crosses
> the threshold from a lesser values side to a greater values side,
> and vice versa.
>
> We use the event types in an userspace low-memory killer: we get a
> notification when memory becomes low, so we start freeing memory by
> killing unneeded processes, and we get notification when memory hits
> the threshold from another side, so we know that we freed enough of
> memory.
How are these vmevents supposed to work with cgroups?
What do we do when a cgroup nears its limit, and there
is no more swap space available?
What do we do when a cgroup nears its limit, and there
is swap space available?
It would be nice to be able to share the same code for
embedded, desktop and server workloads...
--
All rights reversed
Hello Rik,
Thanks for looking into this!
On Tue, May 01, 2012 at 05:04:21PM -0400, Rik van Riel wrote:
> On 05/01/2012 09:18 AM, Anton Vorontsov wrote:
> >This patch implements a new event type, it will trigger whenever a
> >value becomes greater than user-specified threshold, it complements
> >the 'less-then' trigger type.
> >
> >Also, let's implement the one-shot mode for the events, when set,
> >userspace will only receive one notification per crossing the
> >boundaries.
> >
> >Now when both LT and GT are set on the same level, the event type
> >works as a cross event type: it triggers whenever a value crosses
> >the threshold from a lesser values side to a greater values side,
> >and vice versa.
> >
> >We use the event types in an userspace low-memory killer: we get a
> >notification when memory becomes low, so we start freeing memory by
> >killing unneeded processes, and we get notification when memory hits
> >the threshold from another side, so we know that we freed enough of
> >memory.
>
> How are these vmevents supposed to work with cgroups?
Currently these are independent subsystems, if you have memcg enabled,
you can do almost anything* with the memory, as memg has all the needed
hooks in the mm/ subsystem (it is more like "memory management tracer"
nowadays :-).
But cgroups have its cost, both performance penalty and memory wastage.
For example, in the best case, memcg constantly consumes 0.5% of RAM to
track memory usage, this is 5 MB on a 1 GB "embedded" machine. To some
people it feels just wrong to waste that memory for mere notifications.
Of course, this alone can be considered as a lame argument for making
another subsystem (instead of "fixing" the current one). But see below,
vmevent is just a convenient ABI.
> What do we do when a cgroup nears its limit, and there
> is no more swap space available?
>
> What do we do when a cgroup nears its limit, and there
> is swap space available?
As of now, this is all orthogonal to vmevent. Vmevent doesn't know
about cgroups. If kernel has the memcg enabled, one should probably*
go with it (or better, with its ABI). At least for now.
> It would be nice to be able to share the same code for
> embedded, desktop and server workloads...
It would be great indeed, but so far I don't see much that
vmevent could share. Plus, sharing the code at this point is not
that interesting; it's mere 500 lines of code (comparing to
more than 10K lines for cgroups, and it's not including memcg_
hooks and logic that is spread all over mm/).
Today vmevent code is mostly an ABI implementation, there is
very little memory management logic (in contrast to the memcg).
Personally, I would rather consider sharing ABI at some point:
i.e. making a memcg backend for the vmevent. That would be pretty
cool. And once done, vmevent would be cgroups-aware (if memcg
enabled, of course; and if not, vmevent would still work, with
no memcg-related expenses).
* For low memory notifications, there are still some unresolved
issues with memcg. Mainly, slab accounting for the root cgroup:
currently developed slab accounting doesn't account kernel's
internal memory consumption, plus it doesn't account slab memory
for the root cgroup at all.
A few days ago I asked[1] why memcg doesn't do all this, and
whether it is a design decision or just an implementation detail
(so that we have a chance to fix it).
But so far there were no feedback. We'll see how things turn out.
[1] http://lkml.org/lkml/2012/4/30/115
Thanks!
--
Anton Vorontsov
Email: [email protected]
(5/1/12 8:20 PM), Anton Vorontsov wrote:
> Hello Rik,
>
> Thanks for looking into this!
>
> On Tue, May 01, 2012 at 05:04:21PM -0400, Rik van Riel wrote:
>> On 05/01/2012 09:18 AM, Anton Vorontsov wrote:
>>> This patch implements a new event type, it will trigger whenever a
>>> value becomes greater than user-specified threshold, it complements
>>> the 'less-then' trigger type.
>>>
>>> Also, let's implement the one-shot mode for the events, when set,
>>> userspace will only receive one notification per crossing the
>>> boundaries.
>>>
>>> Now when both LT and GT are set on the same level, the event type
>>> works as a cross event type: it triggers whenever a value crosses
>>> the threshold from a lesser values side to a greater values side,
>>> and vice versa.
>>>
>>> We use the event types in an userspace low-memory killer: we get a
>>> notification when memory becomes low, so we start freeing memory by
>>> killing unneeded processes, and we get notification when memory hits
>>> the threshold from another side, so we know that we freed enough of
>>> memory.
>>
>> How are these vmevents supposed to work with cgroups?
>
> Currently these are independent subsystems, if you have memcg enabled,
> you can do almost anything* with the memory, as memg has all the needed
> hooks in the mm/ subsystem (it is more like "memory management tracer"
> nowadays :-).
>
> But cgroups have its cost, both performance penalty and memory wastage.
> For example, in the best case, memcg constantly consumes 0.5% of RAM to
> track memory usage, this is 5 MB on a 1 GB "embedded" machine. To some
> people it feels just wrong to waste that memory for mere notifications.
>
> Of course, this alone can be considered as a lame argument for making
> another subsystem (instead of "fixing" the current one). But see below,
> vmevent is just a convenient ABI.
>
>> What do we do when a cgroup nears its limit, and there
>> is no more swap space available?
>>
>> What do we do when a cgroup nears its limit, and there
>> is swap space available?
>
> As of now, this is all orthogonal to vmevent. Vmevent doesn't know
> about cgroups. If kernel has the memcg enabled, one should probably*
> go with it (or better, with its ABI). At least for now.
>
>> It would be nice to be able to share the same code for
>> embedded, desktop and server workloads...
>
> It would be great indeed, but so far I don't see much that
> vmevent could share. Plus, sharing the code at this point is not
> that interesting; it's mere 500 lines of code (comparing to
> more than 10K lines for cgroups, and it's not including memcg_
> hooks and logic that is spread all over mm/).
>
> Today vmevent code is mostly an ABI implementation, there is
> very little memory management logic (in contrast to the memcg).
But, if it doesn't work desktop/server area, it shouldn't be merged.
We have to consider the best design before kernel inclusion. They cann't
be separeted to discuss.
Hello KOSAKI,
On Tue, May 01, 2012 at 09:20:27PM -0400, KOSAKI Motohiro wrote:
[...]
> >It would be great indeed, but so far I don't see much that
> >vmevent could share. Plus, sharing the code at this point is not
> >that interesting; it's mere 500 lines of code (comparing to
> >more than 10K lines for cgroups, and it's not including memcg_
> >hooks and logic that is spread all over mm/).
> >
> >Today vmevent code is mostly an ABI implementation, there is
> >very little memory management logic (in contrast to the memcg).
>
> But, if it doesn't work desktop/server area, it shouldn't be merged.
What makes you think that vmevent won't work for desktop or servers?
:-)
E.g. for some servers you don't always want memcg, really. Suppose,
a kvm farm or a database server. Sometimes there's really no need for
the memcg, but there's still a demand for low memory notifications.
Current Linux desktops don't use any notifications at all, I think.
So nothing to say about, neither on cgroup's nor on vmevent's behalf.
I hardly imagine why desktop would use the whole memcg thing, but
still have a use case for memory notifications.
> We have to consider the best design before kernel inclusion. They cann't
> be separeted to discuss.
Of course, no objections here. But I somewhat disagree with the
"best design" term. Which design is better, reading a file via
read() or mmap()? It depends. Same here.
So far, I see that memcg has its own cons, some are "by design"
and some because of incomplete features (e.g. slab accounting,
which, if accepted as is, seem to have its own design flaws).
memcg has many pros as well, the main goodness of memcg (for
memory notifications case) is rate limited events, which is a very
cool feature, and memcg has the feature because it's so much
tied with the mm subsystem.
But, as I said in my previus email, making memcg backend for
vmevents seems doable. We'd only need to place a vmevents hook
into mm/memcontrol.c:memcg_check_events() and export
mem_cgroup_usage() call.
So vmevent makes it possible for things to work with cgroups and
without cgroups, everybody's happy.
Thanks,
p.s. I'm not the vmevents author, plus I use both memcg and
vmevents. That makes me think that I'm pretty unbiased here. ;-)
--
Anton Vorontsov
Email: [email protected]
On Tue, May 01, 2012 at 08:31:36PM -0700, Anton Vorontsov wrote:
[...]
> p.s. I'm not the vmevents author, plus I use both memcg and
> vmevents. That makes me think that I'm pretty unbiased here. ;-)
...though, that doesn't mean I'm right, of course. :-)
--
Anton Vorontsov
Email: [email protected]
On 05/02/2012 12:31 PM, Anton Vorontsov wrote:
> Hello KOSAKI,
>
> On Tue, May 01, 2012 at 09:20:27PM -0400, KOSAKI Motohiro wrote:
> [...]
>>> It would be great indeed, but so far I don't see much that
>>> vmevent could share. Plus, sharing the code at this point is not
>>> that interesting; it's mere 500 lines of code (comparing to
>>> more than 10K lines for cgroups, and it's not including memcg_
>>> hooks and logic that is spread all over mm/).
>>>
>>> Today vmevent code is mostly an ABI implementation, there is
>>> very little memory management logic (in contrast to the memcg).
>>
>> But, if it doesn't work desktop/server area, it shouldn't be merged.
>
> What makes you think that vmevent won't work for desktop or servers?
> :-)
>
> E.g. for some servers you don't always want memcg, really. Suppose,
> a kvm farm or a database server. Sometimes there's really no need for
> the memcg, but there's still a demand for low memory notifications.
>
> Current Linux desktops don't use any notifications at all, I think.
> So nothing to say about, neither on cgroup's nor on vmevent's behalf.
> I hardly imagine why desktop would use the whole memcg thing, but
> still have a use case for memory notifications.
>
>> We have to consider the best design before kernel inclusion. They cann't
>> be separeted to discuss.
>
> Of course, no objections here. But I somewhat disagree with the
> "best design" term. Which design is better, reading a file via
> read() or mmap()? It depends. Same here.
I think hardest problem in low mem notification is how to define _lowmem situation_.
We all guys (server, desktop and embedded) should reach a conclusion on define lowmem situation
before progressing further implementation because each part can require different limits.
Hopefully, I want it.
What is the best situation we can call it as "low memory"?
As a matter of fact, if we can define it well, I think even we don't need vmevent ABI.
In my opinion, it's not easy to generalize each use-cases so we can pass it to user space and
just export low attributes of vmstat in kernel by vmevent.
Userspace program can determine low mem situation well on his environment with other vmstats
when notification happens. Of course, it has a drawback that userspace couples kernel's vmstat
but at least I think that's why we need vmevent for triggering event when we start watching carefully.
--
Kind regards,
Minchan Kim
> -----Original Message-----
> From: ext Minchan Kim [mailto:[email protected]]
> Sent: 02 May, 2012 08:04
> To: Anton Vorontsov
> Cc: KOSAKI Motohiro; Rik van Riel; Pekka Enberg; Moiseichuk Leonid (Nokia-
...
> I think hardest problem in low mem notification is how to define _lowmem
> situation_.
> We all guys (server, desktop and embedded) should reach a conclusion on
> define lowmem situation before progressing further implementation
> because each part can require different limits.
> Hopefully, I want it.
>
> What is the best situation we can call it as "low memory"?
That depends on what user-space can do. In n9 case [1] we can handle some OOM/slowness-prevention and actions e.g. close background applications, stop prestarted apps,
flush browser/graphics caches in applications and do all the things kernel even don't know about. This set of activities usually comes as memory management design.
>From another side, polling by re-scan vmstat data using procfs might be performance heavy and for sure - use-time disaster.
Leonid
[1] http://maemo.gitorious.org/maemo-tools/libmemnotify - yes, not ideal but it works and quite well isolated code.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Wed, May 2, 2012 at 4:20 AM, KOSAKI Motohiro
<[email protected]> wrote:
> But, if it doesn't work desktop/server area, it shouldn't be merged.
> We have to consider the best design before kernel inclusion. They cann't
> be separeted to discuss.
Yes, completely agreed.
On Wed, May 2, 2012 at 8:04 AM, Minchan Kim <[email protected]> wrote:
> I think hardest problem in low mem notification is how to define _lowmem situation_.
> We all guys (server, desktop and embedded) should reach a conclusion on define lowmem situation
> before progressing further implementation because each part can require different limits.
> Hopefully, I want it.
>
> What is the best situation we can call it as "low memory"?
Looking at real-world scenarios, it seems to be totally dependent on
userspace policy.
On Wed, May 2, 2012 at 8:04 AM, Minchan Kim <[email protected]> wrote:
> As a matter of fact, if we can define it well, I think even we don't need vmevent ABI.
> In my opinion, it's not easy to generalize each use-cases so we can pass it to user space and
> just export low attributes of vmstat in kernel by vmevent.
> Userspace program can determine low mem situation well on his environment with other vmstats
> when notification happens. Of course, it has a drawback that userspace couples kernel's vmstat
> but at least I think that's why we need vmevent for triggering event when we start watching carefully.
Please keep in mind that VM events is not only about "low memory"
notification. The ABI might be useful for other kinds of VM events as
well.
On 05/02/2012 03:57 PM, Pekka Enberg wrote:
> On Wed, May 2, 2012 at 8:04 AM, Minchan Kim <[email protected]> wrote:
>> I think hardest problem in low mem notification is how to define _lowmem situation_.
>> We all guys (server, desktop and embedded) should reach a conclusion on define lowmem situation
>> before progressing further implementation because each part can require different limits.
>> Hopefully, I want it.
>>
>> What is the best situation we can call it as "low memory"?
>
> Looking at real-world scenarios, it seems to be totally dependent on
> userspace policy.
That's why I insist on defining low memory state in user space, not kernel.
>
> On Wed, May 2, 2012 at 8:04 AM, Minchan Kim <[email protected]> wrote:
>> As a matter of fact, if we can define it well, I think even we don't neead vmevent ABI.
>> In my opinion, it's not easy to generalize each use-cases so we can pass it to user space and
>> just export low attributes of vmstat in kernel by vmevent.
>> Userspace program can determine low mem situation well on his environment with other vmstats
>> when notification happens. Of course, it has a drawback that userspace couples kernel's vmstat
>> but at least I think that's why we need vmevent for triggering event when we start watching carefully.
>
> Please keep in mind that VM events is not only about "low memory"
> notification. The ABI might be useful for other kinds of VM events as
> well.
Fully agreed but we should prove why such event is useful in real scenario before adding more features.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
--
Kind regards,
Minchan Kim
On Tue, 1 May 2012, Anton Vorontsov wrote:
> This patch implements a new event type, it will trigger whenever a
> value becomes greater than user-specified threshold, it complements
> the 'less-then' trigger type.
>
> Also, let's implement the one-shot mode for the events, when set,
> userspace will only receive one notification per crossing the
> boundaries.
>
> Now when both LT and GT are set on the same level, the event type
> works as a cross event type: it triggers whenever a value crosses
> the threshold from a lesser values side to a greater values side,
> and vice versa.
>
> We use the event types in an userspace low-memory killer: we get a
> notification when memory becomes low, so we start freeing memory by
> killing unneeded processes, and we get notification when memory hits
> the threshold from another side, so we know that we freed enough of
> memory.
>
> Signed-off-by: Anton Vorontsov <[email protected]>
Applied, thanks!