LinuxLists.cc - Andrea VM changes

2003-08-30 15:18:19

by Marcelo Tosatti

[permalink] [raw]

Subject: Andrea VM changes

> You need to integrate with -aa on the VM. It has been hard enough for
> Andrea to get his stuff in, I doubt you will fair any better.

Thats because I never received separate patches which make sense one by
one. Most of Andreas changes are all grouped into few big patches that
only he knows the mess. That is not the way to merge things.

I want to work out with him after I merge other stuff to address that.

2003-08-30 15:42:16

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sat, Aug 30, 2003 at 12:13:57PM -0300, Marcelo Tosatti wrote:
>
> > You need to integrate with -aa on the VM. It has been hard enough for
> > Andrea to get his stuff in, I doubt you will fair any better.
>
> Thats because I never received separate patches which make sense one by
> one. Most of Andreas changes are all grouped into few big patches that
> only he knows the mess. That is not the way to merge things.
>
> I want to work out with him after I merge other stuff to address that.

that's true for only one patch, the others are pretty orthogonal after
Andrew helped splitting them:

05_vm_03_vm_tunables-4
05_vm_05_zone_accounting-2
05_vm_06_swap_out-3
05_vm_07_local_pages-4
05_vm_08_try_to_free_pages_nozone-4
05_vm_09_misc_junk-3
05_vm_10_read_write_tweaks-3
05_vm_13_activate_page_cleanup-1
05_vm_15_active_page_swapout-1
05_vm_16_active_free_zone_bhs-1
05_vm_17_rest-10
05_vm_18_buffer-page-uptodate-1
05_vm_20_cleanups-3
05_vm_21_rt-alloc-1
05_vm_22_vm-anon-lru-1
05_vm_23_per-cpu-pages-3
05_vm_24_accessed-ipi-only-smp-1
05_vm_25_try_to_free_buffers-invariant-1

The "mess" one is only 05_vm_17_rest-10 as far as I can tell.

Andrea

2003-08-30 15:54:52

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: Andrea VM changes

> that's true for only one patch, the others are pretty orthogonal after
> Andrew helped splitting them:
> 05_vm_03_vm_tunables-4
> 05_vm_05_zone_accounting-2
> 05_vm_06_swap_out-3
> 05_vm_07_local_pages-4
> 05_vm_08_try_to_free_pages_nozone-4
> 05_vm_09_misc_junk-3
> 05_vm_10_read_write_tweaks-3
> 05_vm_13_activate_page_cleanup-1
> 05_vm_15_active_page_swapout-1
> 05_vm_16_active_free_zone_bhs-1
> 05_vm_17_rest-10
> 05_vm_18_buffer-page-uptodate-1
> 05_vm_20_cleanups-3
> 05_vm_21_rt-alloc-1
> 05_vm_22_vm-anon-lru-1
> 05_vm_23_per-cpu-pages-3
> 05_vm_24_accessed-ipi-only-smp-1
> 05_vm_25_try_to_free_buffers-invariant-1

Indeed, you are right.

I'll start looking at them Monday. I'll keep you in touch. Thanks.

2003-08-30 19:16:15

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sat, 30 Aug 2003, Marcelo Tosatti wrote:

>
> > that's true for only one patch, the others are pretty orthogonal after
> > Andrew helped splitting them:
> > 05_vm_03_vm_tunables-4
> > 05_vm_05_zone_accounting-2
> > 05_vm_06_swap_out-3
> > 05_vm_07_local_pages-4
> > 05_vm_08_try_to_free_pages_nozone-4
> > 05_vm_09_misc_junk-3
> > 05_vm_10_read_write_tweaks-3
> > 05_vm_13_activate_page_cleanup-1
> > 05_vm_15_active_page_swapout-1
> > 05_vm_16_active_free_zone_bhs-1
> > 05_vm_17_rest-10
> > 05_vm_18_buffer-page-uptodate-1
> > 05_vm_20_cleanups-3
> > 05_vm_21_rt-alloc-1
> > 05_vm_22_vm-anon-lru-1
> > 05_vm_23_per-cpu-pages-3
> > 05_vm_24_accessed-ipi-only-smp-1
> > 05_vm_25_try_to_free_buffers-invariant-1
>
> Indeed, you are right.
>
> I'll start looking at them Monday. I'll keep you in touch. Thanks.

Andrea,

Would you mind to explain me 05_vm_06_swap_out-3 ?

I see you change shrink_cache, try_to_free_pages_zone, etc.

Can you please give me a detailed explanation of the changes there?

I appreciate very much.

I'll keep looking at other patches for now.

Thanks

2003-08-30 19:25:26

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: Andrea VM changes

y

On Sat, 30 Aug 2003, Marcelo Tosatti wrote:

> >
> > Indeed, you are right.
> >
> > I'll start looking at them Monday. I'll keep you in touch. Thanks.
>
> Andrea,
>
> Would you mind to explain me 05_vm_06_swap_out-3 ?
>
> I see you change shrink_cache, try_to_free_pages_zone, etc.
>
> Can you please give me a detailed explanation of the changes there?
>
> I appreciate very much.
>
> I'll keep looking at other patches for now.

05_vm_09_misc_junk-3 removes the PF_MEMDIE and you also seem to remove the
OOM killer. Is that right? Why?

2003-08-30 23:18:53

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sat, Aug 30, 2003 at 04:21:02PM -0300, Marcelo Tosatti wrote:
> y
>
> On Sat, 30 Aug 2003, Marcelo Tosatti wrote:
>
> > >
> > > Indeed, you are right.
> > >
> > > I'll start looking at them Monday. I'll keep you in touch. Thanks.
> >
> > Andrea,
> >
> > Would you mind to explain me 05_vm_06_swap_out-3 ?
> >
> > I see you change shrink_cache, try_to_free_pages_zone, etc.
> >
> > Can you please give me a detailed explanation of the changes there?
> >
> > I appreciate very much.
> >
> > I'll keep looking at other patches for now.
>
> 05_vm_09_misc_junk-3 removes the PF_MEMDIE and you also seem to remove the
> OOM killer. Is that right? Why?

because the oom killer is a DoS on servers, on a database setup, with 2G
free, with say all tasks 2.7G large, it'll start killing all the
thousand database tasks instead of the 2g netscape task that hit an
userspace bug and it started allocating ram in a loop, and that will
make no progress since no physical ram will be released. There's no need
of oom killer to keep the system stable, with my vm, and the current
probabilistic oom killer in the page fault hander kills the right task
most of the time (unlike the stock oom killers that works well only for
the desktops or developer machines). So it does a much better job and it
doesn't risk to DoS the box due oom.

Another DoS generated by the oom killer is that it'll try forever to
kill a UNINTERRUPTIBLE task hanging in a nfs server that is down, so it
hangs the whole box for an unlimited time.

I've an algorithm that will work, and that will provide very good
guarantees to kill the "best" task to make the machine usable again,
with the needed protection against the security DoSes, but it's in
no-way similar to the current oom killer.

Andrea

2003-08-30 23:28:07

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sun, 31 Aug 2003, Andrea Arcangeli wrote:

> On Sat, Aug 30, 2003 at 04:21:02PM -0300, Marcelo Tosatti wrote:
> > y
> >
> > On Sat, 30 Aug 2003, Marcelo Tosatti wrote:
> >
> > > >
> > > > Indeed, you are right.
> > > >
> > > > I'll start looking at them Monday. I'll keep you in touch. Thanks.
> > >
> > > Andrea,
> > >
> > > Would you mind to explain me 05_vm_06_swap_out-3 ?
> > >
> > > I see you change shrink_cache, try_to_free_pages_zone, etc.
> > >
> > > Can you please give me a detailed explanation of the changes there?
> > >
> > > I appreciate very much.
> > >
> > > I'll keep looking at other patches for now.
> >
> > 05_vm_09_misc_junk-3 removes the PF_MEMDIE and you also seem to remove the
> > OOM killer. Is that right? Why?
>
> because the oom killer is a DoS on servers, on a database setup, with 2G
> free, with say all tasks 2.7G large, it'll start killing all the
> thousand database tasks instead of the 2g netscape task that hit an
> userspace bug and it started allocating ram in a loop, and that will
> make no progress since no physical ram will be released. There's no need
> of oom killer to keep the system stable, with my vm, and the current
> probabilistic oom killer in the page fault hander

So tasks get killed in case of page allocation failure?

> kills the right task most of the time (unlike the stock oom killers that
> works well only for the desktops or developer machines). So it does a
> much better job and it doesn't risk to DoS the box due oom.

Mind to explain me in more detail the OOM killing mechanism?

> Another DoS generated by the oom killer is that it'll try forever to
> kill a UNINTERRUPTIBLE task hanging in a nfs server that is down, so it
> hangs the whole box for an unlimited time.
>
> I've an algorithm that will work, and that will provide very good
> guarantees to kill the "best" task to make the machine usable again,
> with the needed protection against the security DoSes, but it's in
> no-way similar to the current oom killer.

My concern is about how this oom killer works.

PS: Thanks for answering, hope we can agree on things and make progress
on this merge.

2003-08-30 23:57:07

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sat, Aug 30, 2003 at 08:30:36PM -0300, Marcelo Tosatti wrote:
>
>
> On Sun, 31 Aug 2003, Andrea Arcangeli wrote:
>
> > On Sat, Aug 30, 2003 at 04:21:02PM -0300, Marcelo Tosatti wrote:
> > > y
> > >
> > > On Sat, 30 Aug 2003, Marcelo Tosatti wrote:
> > >
> > > > >
> > > > > Indeed, you are right.
> > > > >
> > > > > I'll start looking at them Monday. I'll keep you in touch. Thanks.
> > > >
> > > > Andrea,
> > > >
> > > > Would you mind to explain me 05_vm_06_swap_out-3 ?
> > > >
> > > > I see you change shrink_cache, try_to_free_pages_zone, etc.
> > > >
> > > > Can you please give me a detailed explanation of the changes there?
> > > >
> > > > I appreciate very much.
> > > >
> > > > I'll keep looking at other patches for now.
> > >
> > > 05_vm_09_misc_junk-3 removes the PF_MEMDIE and you also seem to remove the
> > > OOM killer. Is that right? Why?
> >
> > because the oom killer is a DoS on servers, on a database setup, with 2G
> > free, with say all tasks 2.7G large, it'll start killing all the
> > thousand database tasks instead of the 2g netscape task that hit an
> > userspace bug and it started allocating ram in a loop, and that will
> > make no progress since no physical ram will be released. There's no need
> > of oom killer to keep the system stable, with my vm, and the current
> > probabilistic oom killer in the page fault hander
>
> So tasks get killed in case of page allocation failure?

yes.

When alloc_pages returns NULL during the page fault handling we just
call do_exit. With 2.2-aa we were even smarter, we also checked if the
task had iopl privilegies (something that at the moment we can do only
in the page fault handler btw), so we could trust the task and just send
a SIGTERM a few times, instead of doing immediatly a do_exit(SIGKILL).
So we wouldn't screwup the graphics card for example (killing an iopl
task isn't always safe). But I never forward ported this very nice
feature to 2.4.

If alloc_pages returns null in all other cases, it's up to the caller to
return -ENOMEM to userspace as a retval of the syscall.

> > kills the right task most of the time (unlike the stock oom killers that
> > works well only for the desktops or developer machines). So it does a
> > much better job and it doesn't risk to DoS the box due oom.
>
> Mind to explain me in more detail the OOM killing mechanism?

the current logic depends on alloc_pages to return NULL.

And alloc_pages will return null depending on the
swapping/cache-shrinking.

The current code in mainline instead is even OOM deadlock prone in the
VM, for example not only the oom killer can do a DoSable wrong selection
of the task on servers, but it can even fail to detect an oom condition.
Another other thing that can easily fool the current oom killer, is the
mlocked ram: the current oom killer will be fooled by the fact there's
still some swap free and it'll never kick in and the box will deadlock.
This can't happen with my tree since I don't trust the unreliable
statistical information we have from the kernel: we simply have no way
to (efficiently) calculate the number of freeable pages at any given
time, and as such the only reasonable thing we can do is to try to
swap/shrink a number of times and to giveup eventually (that is like
counting inefficiently the number of freeable pages a few times).

> > Another DoS generated by the oom killer is that it'll try forever to
> > kill a UNINTERRUPTIBLE task hanging in a nfs server that is down, so it
> > hangs the whole box for an unlimited time.
> >
> > I've an algorithm that will work, and that will provide very good
> > guarantees to kill the "best" task to make the machine usable again,
> > with the needed protection against the security DoSes, but it's in
> > no-way similar to the current oom killer.
>
> My concern is about how this oom killer works.

This oom killer on desktops may do a worse selections of the task to
kill (the usual ssh now has a chance to be killed), but it fixes the oom
deadlocks and it won't do stupid things on servers shall a netscape or
whatever else app hit an userspace bug. So I've to prefer it, until I
will write a reliable algorithm for the oom killing that won't fall into
dosable corner cases so easily (mlock/nfs/database as the three most
common examples of where current mainline can fail, btw the lowmem
shortage is another very common DoS that the oom killer will never
notice, my tree doesn't deadlock [or at least not technically, in
practice it may look like a kernel deadlock despite syscalls returns
-ENOMEM ;) ] during lowmem shortage on the 64G boxes).

Andrea

2003-08-31 11:50:54

by Matthias Andree

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sat, 30 Aug 2003, Marcelo Tosatti wrote:

> 05_vm_09_misc_junk-3 removes the PF_MEMDIE and you also seem to remove the
> OOM killer. Is that right? Why?

Nuking OOM killer is IMHO a sane thing to do. Unless you start
everything out of PID #1 which is unkillable, usually init(8), you don't
want the OOM killer. Imagine it nukes your portmap. With Linux portmap
that doesn't support warm starts (unlike Solaris 8), this means: reboot.

--
Matthias Andree

Encrypt your mail: my GnuPG key ID is 0x052E7D95

2003-08-31 14:12:15

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sul, 2003-08-31 at 00:19, Andrea Arcangeli wrote:
> I've an algorithm that will work, and that will provide very good
> guarantees to kill the "best" task to make the machine usable again,
> with the needed protection against the security DoSes, but it's in
> no-way similar to the current oom killer.

And -ac has trivial code so you can avoid OOM killing every happening,
which is pretty much essential for big servers. Perhaps merging that
as well would be a good idea.

2003-08-31 14:59:11

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sun, Aug 31, 2003 at 03:10:04PM +0100, Alan Cox wrote:
> On Sul, 2003-08-31 at 00:19, Andrea Arcangeli wrote:
> > I've an algorithm that will work, and that will provide very good
> > guarantees to kill the "best" task to make the machine usable again,
> > with the needed protection against the security DoSes, but it's in
> > no-way similar to the current oom killer.
>
> And -ac has trivial code so you can avoid OOM killing every happening,
> which is pretty much essential for big servers. Perhaps merging that
> as well would be a good idea.

the reservation that you've to do can generate a less optimal
utilization of ram (some buggy app can also fail with it), but I agree
it's a nice feature to be able to return -ENOMEM out of malloc (for
desktops too), instead of killing the task.

However you have the exact same oom deadlocks problem with all non
userspace allocations, like a select, select will deadlock the box in
-ac if you're out of lowmemory, no matter of the non-overcommit
behaviour, same goes for mlock.

And I don't see how you can avoid oom killing to ever happen if the apps
recurse on the stack and growsdown some hundred megs. In such case
you've to oom kill, since there's no synchronous failure path during the
stack growsdown walk.

this of course doesn't change the fact that providing the non overcommit
behaviour (optional), sounds a very good idea, I'm all for it.

I just don't think it solves or hides the other issues, it seems
completely orthogonal to me, because you can still run oom during stack
growsdown.

Andrea

2003-08-31 15:32:20

[permalink] [raw]

Subject: Re: Andrea VM changes

I spent way too long tweaking the OOM killer before I
realized it was hopeless.
The fact that incoming network traffic can be what causes the
OOM condition makes it Really Hard to decide which app deserves
the axe.

In the test-and-measurement system I'm developing,
it turned out the sanest thing to do with OOM conditions
was to consider them user errors, and to handle them
by dumping memory usage info about processes and slab caches,
then halt. It's been very helpful because it turns flaky
conditions into rock-solid failures. Too bad this drastic
approach isn't appropriate for general use.
- Dan

--
Dan Kegel
http://www.kegel.com
http://counter.li.org/cgi-bin/runscript/display-person.cgi?user=78045

2003-08-31 15:30:48

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sul, 2003-08-31 at 15:59, Andrea Arcangeli wrote:
> And I don't see how you can avoid oom killing to ever happen if the apps
> recurse on the stack and growsdown some hundred megs. In such case
> you've to oom kill, since there's no synchronous failure path during the
> stack growsdown walk.

The stack grow fails and you get a signal. Its up to you to have a
language that handles this or in C enjoy the delights of sigaltstack. In
practice the settings are such that this case basically "doesnt happen"
for all normal use.

> I just don't think it solves or hides the other issues, it seems
> completely orthogonal to me, because you can still run oom during stack
> growsdown.

Agreed - and there will always be corner cases, people who don't want
strict overcommit etc. Thats why I said "as well". Its not a replacement
for OOM handling of some form.

2003-08-31 15:48:36

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sun, 31 August 2003 08:51:55 -0700, Dan Kegel wrote:
>
> In the test-and-measurement system I'm developing,
> it turned out the sanest thing to do with OOM conditions
> was to consider them user errors, and to handle them
> by dumping memory usage info about processes and slab caches,
> then halt. It's been very helpful because it turns flaky
> conditions into rock-solid failures. Too bad this drastic
> approach isn't appropriate for general use.

Sound interesting. Can you send a patch for the interested and
unafraid?

J?rn

--
A quarrel is quickly settled when deserted by one party; there is
no battle unless there be two.
-- Seneca

2003-08-31 15:59:19

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sun, Aug 31, 2003 at 04:29:49PM +0100, Alan Cox wrote:
> On Sul, 2003-08-31 at 15:59, Andrea Arcangeli wrote:
> > And I don't see how you can avoid oom killing to ever happen if the apps
> > recurse on the stack and growsdown some hundred megs. In such case
> > you've to oom kill, since there's no synchronous failure path during the
> > stack growsdown walk.
>
> The stack grow fails and you get a signal. Its up to you to have a
> language that handles this or in C enjoy the delights of sigaltstack. In

the synchronous signal sending looks fine.

the brainer part here is what happens after sending the signal: how to
eventually fallback to sigkill. how do you fallback from a graceful
signal (one that can be handled in userspace) to an "hard" one like
sigkill that guarantees the stability of the system? I mean, you could
set a timer, and then try to kill the task with sigkill later if it's
still there after a few seconds you sent the graceful signal. There may
be different solutions to this.

But we need a fallback like the above because we can't trust userspace,
if the task doesn't go away, we've to sigkill it eventually.

Even sending sigkill immediatly would be acceptable (despite it would
prevent userspace to exit gracefully).

> practice the settings are such that this case basically "doesnt happen"
> for all normal use.

yes, stack usage is normally very limited.

>
> > I just don't think it solves or hides the other issues, it seems
> > completely orthogonal to me, because you can still run oom during stack
> > growsdown.
>
> Agreed - and there will always be corner cases, people who don't want
> strict overcommit etc. Thats why I said "as well". Its not a replacement
> for OOM handling of some form.

agreed.

Andrea

2003-08-31 16:00:52

[permalink] [raw]

Subject: Re: Andrea VM changes

J?rn Engel wrote:
> On Sun, 31 August 2003 08:51:55 -0700, Dan Kegel wrote:
>
>>In the test-and-measurement system I'm developing,
>>it turned out the sanest thing to do with OOM conditions
>>was to consider them user errors, and to handle them
>>by dumping memory usage info about processes and slab caches,
>>then halt. It's been very helpful because it turns flaky
>>conditions into rock-solid failures. Too bad this drastic
>>approach isn't appropriate for general use.
>
>
> Sound interesting. Can you send a patch for the interested and
> unafraid?

This is against 2.4.21 or so.

--- mm.old/oom_kill.c Mon Apr 28 17:23:19 2003
+++ mm/oom_kill.c Mon Apr 28 20:22:23 2003
@@ -20,9 +20,13 @@
#include <linux/swap.h>
#include <linux/swapctl.h>
#include <linux/timex.h>
+#include <asm/uaccess.h>

/* #define DEBUG */

+#define CONFIG_OOM_HALT
+#ifndef CONFIG_OOM_HALT
+
/**
* int_sqrt - oom_kill.c internal function, rough approximation to sqrt
* @x: integer of which to calculate the sqrt
@@ -193,6 +197,62 @@
return;
}

+#else
+
+/**
+ * oom_halt - log out of memory condition, then halt system.
+ *
+ * For embedded systems which can't tolerate the chance that
+ * the oom killer will kill the wrong process, and would rather
+ * simply log the event in detail and halt.
+ */
+static void
+oom_halt(void)
+{
+ struct task_struct *p;
+ struct file *file;
+ int ret;
+
+ printk(KERN_EMERG "oom: Out of memory!\n");
+
+ printk(KERN_EMERG "oom: VM and RSS in KB, pid, and mm ptr for each task:\n");
+ read_lock(&tasklist_lock);
+ for_each_task(p) {
+ if (p->mm)
+ printk(KERN_EMERG "oom> vm %5d rss %5d pid %5d mm %p (%s)\n",
+ p->mm->total_vm * (PAGE_SIZE / 1024),
+ p->mm->rss * (PAGE_SIZE / 1024), p->pid, p->mm, p->comm);
+ }
+ read_unlock(&tasklist_lock);
+
+ file = filp_open("/proc/slabinfo", O_RDONLY, 0);
+ if (IS_ERR(file) || !file->f_op || !file->f_op->read)
+ goto out;
+ printk(KERN_EMERG "oom: Contents of /proc/slabinfo:\n");
+ do {
+ char buf[128];
+ int pos;
+ mm_segment_t fs = get_fs();
+ /* read one line */
+ for (pos = 0; pos < sizeof (buf); pos++) {
+ set_fs(KERNEL_DS);
+ ret = file->f_op->read(file, buf + pos, 1, &file->f_pos);
+ set_fs(fs);
+ if (ret != 1 || buf[pos] == '\n')
+ break;
+ }
+ buf[pos] = 0;
+ printk(KERN_EMERG "oom> %s\n", buf);
+ } while (ret == 1);
+ /* filp_close(file, NULL); */
+out:
+ printk(KERN_EMERG "oom: Halting.\n");
+ cli();
+ machine_halt();
+}
+
+#endif
+
/**
* out_of_memory - is the system out of memory?
*/
@@ -237,7 +297,11 @@
/*
* Ok, really out of memory. Kill something.
*/
lastkill = now;
+#ifdef CONFIG_OOM_HALT
+ oom_halt();
+#else
oom_kill();
+#endif

reset:
first = now;

--
Dan Kegel
http://www.kegel.com
http://counter.li.org/cgi-bin/runscript/display-person.cgi?user=78045

2003-08-31 17:31:58

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: Andrea VM changes

---------- Forwarded message ----------
Date: Sun, 31 Aug 2003 12:43:27 -0300 (BRT)
From: Marcelo Tosatti <[email protected]>
To: Andrea Arcangeli <[email protected]>
Cc: Marcelo Tosatti <[email protected]>,
Mike Fedyk <[email protected]>, Antonio Vargas <[email protected]>,
lkml <[email protected]>,
Marc-Christian Petersen <[email protected]>
Subject: Re: Andrea VM changes

On Sun, 31 Aug 2003, Andrea Arcangeli wrote:

> This oom killer on desktops may do a worse selections of the task to
> kill (the usual ssh now has a chance to be killed), but it fixes the oom
> deadlocks and it won't do stupid things on servers shall a netscape or
> whatever else app hit an userspace bug. So I've to prefer it, until I
> will write a reliable algorithm for the oom killing that won't fall into
> dosable corner cases so easily (mlock/nfs/database as the three most
> common examples of where current mainline can fail, btw the lowmem
> shortage is another very common DoS that the oom killer will never
> notice, my tree doesn't deadlock [or at least not technically, in
> practice it may look like a kernel deadlock despite syscalls returns
> -ENOMEM ;) ] during lowmem shortage on the 64G boxes).

Suppose you have a big fat hog leaking (lets say, netscape) allocating
pages at a slow pace. Now you have a decent well behaved app who is
allocating at a fast pace, and gets killed.

The chance the well behaved app gets killed is big, right?

2003-08-31 17:33:57

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: Andrea VM changes

---------- Forwarded message ----------
Date: Sun, 31 Aug 2003 14:14:06 -0300 (BRT)
From: Marcelo Tosatti <[email protected]>
To: Andrea Arcangeli <[email protected]>
Cc: Marcelo Tosatti <[email protected]>,
Mike Fedyk <[email protected]>, Antonio Vargas <[email protected]>,
lkml <[email protected]>,
Marc-Christian Petersen <[email protected]>
Subject: Re: Andrea VM changes

On Sat, 30 Aug 2003, Andrea Arcangeli wrote:

> On Sat, Aug 30, 2003 at 12:13:57PM -0300, Marcelo Tosatti wrote:
> >
> > > You need to integrate with -aa on the VM. It has been hard enough for
> > > Andrea to get his stuff in, I doubt you will fair any better.
> >
> > Thats because I never received separate patches which make sense one by
> > one. Most of Andreas changes are all grouped into few big patches that
> > only he knows the mess. That is not the way to merge things.
> >
> > I want to work out with him after I merge other stuff to address that.
>
> that's true for only one patch, the others are pretty orthogonal after
> Andrew helped splitting them:
>
>
> 05_vm_03_vm_tunables-4
> 05_vm_05_zone_accounting-2
> 05_vm_06_swap_out-3

Help me understand something about this patch. In try_to_free_pages(), you
set failed_swapout to zero in case we are under __GFP_IO. And
failed_swapout decides whether we swap_out() or not.

So basically with -aa swap_out() is only called by __GFP_IO tasks (which
are throttled by the page laundering code in shrink_cache) and in mainline
non __GFP_IO tasks do swap_out() (and those are not throttled by anything).

Did I understood this right?

Part 2:

Now in try_to_free_pages_zone() and shrink_cache you have:

if (!*failed_swapout)
*failed_swapout = !swap_out(classzone);

Which means: Keep trying to swap_out() only in case swap_out()
successfully desactivates nr_pages pte's. Right? Do you do that to avoid
terrible expensive swap_out() loops which dont successfully free pages?

Thanks

2003-08-31 19:13:21

by Jonathan Lundell

[permalink] [raw]

Subject: Re: Andrea VM changes

At 8:51am -0700 8/31/03, Dan Kegel wrote:
>In the test-and-measurement system I'm developing,
>it turned out the sanest thing to do with OOM conditions
>was to consider them user errors, and to handle them
>by dumping memory usage info about processes and slab caches,
>then halt. It's been very helpful because it turns flaky
>conditions into rock-solid failures. Too bad this drastic
>approach isn't appropriate for general use.

Likewise in an HA environment, if you've got a standby node
available, we prefer to fail over on an oom condition or (or an oops,
for that matter) than to try to continue running in some randomly
crippled way. The node in question can then reboot and return to
service as a standby.

Ideally, we'd have a notifier that would be triggered for every
unanticipated process kill (oom, oops, whatever).
--
/Jonathan Lundell.

2003-08-31 19:22:49

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sun, Aug 31, 2003 at 08:51:55AM -0700, Dan Kegel wrote:
> I spent way too long tweaking the OOM killer before I
> realized it was hopeless.
> The fact that incoming network traffic can be what causes the
> OOM condition makes it Really Hard to decide which app deserves
> the axe.

This may be a little off topic, but is there a way to manually select
this? I can see having a mode where everything stops thrashing
for a while, in order to let the admin calmly kill off the offending
process, as a useful feature.

It would be useless in an environment where OOM is actually needed
(can't wait for a human admin to show up), but cool for those that
like to bring their machines back from the edge.

- Chris

2003-08-31 22:45:47

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sun, Aug 31, 2003 at 02:34:01PM -0300, Marcelo Tosatti wrote:
>
>
> ---------- Forwarded message ----------
> Date: Sun, 31 Aug 2003 12:43:27 -0300 (BRT)
> From: Marcelo Tosatti <[email protected]>
> To: Andrea Arcangeli <[email protected]>
> Cc: Marcelo Tosatti <[email protected]>,
> Mike Fedyk <[email protected]>, Antonio Vargas <[email protected]>,
> lkml <[email protected]>,
> Marc-Christian Petersen <[email protected]>
> Subject: Re: Andrea VM changes
>
>
>
> On Sun, 31 Aug 2003, Andrea Arcangeli wrote:
>
> > This oom killer on desktops may do a worse selections of the task to
> > kill (the usual ssh now has a chance to be killed), but it fixes the oom
> > deadlocks and it won't do stupid things on servers shall a netscape or
> > whatever else app hit an userspace bug. So I've to prefer it, until I
> > will write a reliable algorithm for the oom killing that won't fall into
> > dosable corner cases so easily (mlock/nfs/database as the three most
> > common examples of where current mainline can fail, btw the lowmem
> > shortage is another very common DoS that the oom killer will never
> > notice, my tree doesn't deadlock [or at least not technically, in
> > practice it may look like a kernel deadlock despite syscalls returns
> > -ENOMEM ;) ] during lowmem shortage on the 64G boxes).
>
> Suppose you have a big fat hog leaking (lets say, netscape) allocating
> pages at a slow pace. Now you have a decent well behaved app who is
> allocating at a fast pace, and gets killed.
>
> The chance the well behaved app gets killed is big, right?

correct. But it's not a bad thing. How can you know it's better to kill
the hog instead of the well behaved app? if the the hog is allocating at
slow pace, the admin will simply have to kill it if it grown too big. In
terms of omm-killing an hog allocating at slow peace, is no different
from a malloc(1G);bzero(1G);pause(); that leaves 1k free only.
eventually the hog will be killed too if needed.

Andrea

2003-08-31 23:42:31

by Jamie Lokier

[permalink] [raw]

Subject: Re: Andrea VM changes

Chris Frey wrote:
> > The fact that incoming network traffic can be what causes the
> > OOM condition makes it Really Hard to decide which app deserves
> > the axe.
>
> This may be a little off topic, but is there a way to manually select
> this? I can see having a mode where everything stops thrashing
> for a while, in order to let the admin calmly kill off the offending
> process, as a useful feature.

I'd love to be able to select which app _doesn't_ deserve the axe.
I.e. not sshd, and then not httpd.

I once ran GCC on a box out there in netland, on a short bit of code,
and it was a surprise memory hog due to the usual GCC surprises.

It totally crippled the machine, for 18 hours until I was able to get
someone to reboot it. No ssh, no http, no nothing except TCP initial
handshakes, and ping. Not good.

When that happens I'd like the VM to notice that my most important
tasks (sshd and its subshells) aren't making progress and start
killing off other tasks.

The obvious answer is to turn off swap, but I like to have some swap
to hold static data that isn't much used, to free up some RAM.

-- Jamie

2003-09-01 00:44:09

[permalink] [raw]

Subject: Re: Andrea VM changes

Jamie Lokier <jamie () shareable ! org> wrote:
> I'd love to be able to select which app _doesn't_ deserve the axe.
> I.e. not sshd, and then not httpd.

I tried adding a hinting system that let the user
tweak the badness calculated by the OOM killer.
Didn't help. No matter how I tried to protect
important processes, there was always a case where
the OOM killer ended up killing them anyway.

That was probably just a weakness in how I did the
hinting. You might be able to do it with some sort of
'for god's sake never ever kill this process' tweak,
but before I tried that, I realized that making OOM
conditions halt the system was what I really wanted
for my users.

- Dan

--
Dan Kegel
http://www.kegel.com
http://counter.li.org/cgi-bin/runscript/display-person.cgi?user=78045

2003-09-01 06:01:46

by Rik van Riel

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sun, 31 Aug 2003, Marcelo Tosatti wrote:

> Suppose you have a big fat hog leaking (lets say, netscape) allocating
> pages at a slow pace. Now you have a decent well behaved app who is
> allocating at a fast pace, and gets killed.
>
> The chance the well behaved app gets killed is big, right?

Usually syslogd, which receives an error message from the
network driver the moment memory fills up.

The near-certain death of syslogd in OOM situations is why
I wrote the OOM killer in the first place.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2003-09-01 06:03:59

by Rik van Riel

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sun, 31 Aug 2003, Dan Kegel wrote:
> Jamie Lokier <jamie () shareable ! org> wrote:
> > I'd love to be able to select which app _doesn't_ deserve the axe.
> > I.e. not sshd, and then not httpd.
>
> I tried adding a hinting system that let the user
> tweak the badness calculated by the OOM killer.
> Didn't help. No matter how I tried to protect
> important processes, there was always a case where
> the OOM killer ended up killing them anyway.

Indeed. You can't have completely fool-proof heuristics.

Then again, a heuristic is often better than killing
syslogd at the first hint of trouble.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2003-09-01 09:14:32

by Ihar 'Philips' Filipau

[permalink] [raw]

Subject: Re: Andrea VM changes

Rik van Riel wrote:
> On Sun, 31 Aug 2003, Dan Kegel wrote:
>
>>Jamie Lokier <jamie () shareable ! org> wrote:
>>
>>>I'd love to be able to select which app _doesn't_ deserve the axe.
>>>I.e. not sshd, and then not httpd.
>>
>>I tried adding a hinting system that let the user
>>tweak the badness calculated by the OOM killer.
>>Didn't help. No matter how I tried to protect
>>important processes, there was always a case where
>>the OOM killer ended up killing them anyway.
>
>
> Indeed. You can't have completely fool-proof heuristics.
>
> Then again, a heuristic is often better than killing
> syslogd at the first hint of trouble.
>

Best heuristics:
# echo '/usr/sbin/sshd' >/proc/sys/vm/oom_exclude_list
# echo '/usr/sbin/httpd' >/proc/sys/vm/oom_exclude_list

Works 100% ;-)))

2003-09-01 11:48:24

[permalink] [raw]

Subject: Re: Andrea VM changes

On Llu, 2003-09-01 at 00:42, Jamie Lokier wrote:
> I once ran GCC on a box out there in netland, on a short bit of code,
> and it was a surprise memory hog due to the usual GCC surprises.

Run -ac on remote boxes and turn on no overcommit. Paranoid people also
run watchdog drivers set to NOWAYOUT and monitor a list of apps to be
sure they are there - if the app is gone, or the watchdog app dies the
box will reboot.

2003-09-01 15:53:33

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Andrea VM changes

On Mon, Sep 01, 2003 at 02:01:35AM -0400, Rik van Riel wrote:
> On Sun, 31 Aug 2003, Marcelo Tosatti wrote:
>
> > Suppose you have a big fat hog leaking (lets say, netscape) allocating
> > pages at a slow pace. Now you have a decent well behaved app who is
> > allocating at a fast pace, and gets killed.
> >
> > The chance the well behaved app gets killed is big, right?
>
> Usually syslogd, which receives an error message from the
> network driver the moment memory fills up.
>
> The near-certain death of syslogd in OOM situations is why
> I wrote the OOM killer in the first place.

that was used to happen with the old vm, now the fariness in the
allocator is better and normally the first task that runs in the oom
condition is the one that's killed, plus after one task-killing no other
tasks are normally killed (in the past the vm wasn't capable of using
the freed ram promptly and it was killing 3/4 tasks in a row, so syslogd
was killed despite the hog already exited). still you're right syslogd
may be very well still killed in theory but that's ok with me.

Andrea

2003-09-01 17:25:07

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: Andrea VM changes

Mind to answer this message Andrea?

On Sun, 31 Aug 2003, Marcelo Tosatti wrote:

>
>
> On Sat, 30 Aug 2003, Andrea Arcangeli wrote:
>
> > On Sat, Aug 30, 2003 at 12:13:57PM -0300, Marcelo Tosatti wrote:
> > >
> > > > You need to integrate with -aa on the VM. It has been hard enough for
> > > > Andrea to get his stuff in, I doubt you will fair any better.
> > >
> > > Thats because I never received separate patches which make sense one by
> > > one. Most of Andreas changes are all grouped into few big patches that
> > > only he knows the mess. That is not the way to merge things.
> > >
> > > I want to work out with him after I merge other stuff to address that.
> >
> > that's true for only one patch, the others are pretty orthogonal after
> > Andrew helped splitting them:
> >
> >
> > 05_vm_03_vm_tunables-4
> > 05_vm_05_zone_accounting-2
> > 05_vm_06_swap_out-3
>
> Help me understand something about this patch. In try_to_free_pages(), you
> set failed_swapout to zero in case we are under __GFP_IO. And
> failed_swapout decides whether we swap_out() or not.
>
> So basically with -aa swap_out() is only called by __GFP_IO tasks (which
> are throttled by the page laundering code in shrink_cache) and in mainline
> non __GFP_IO tasks do swap_out() (and those are not throttled by anything).
>
> Did I understood this right?
>
> Part 2:
>
> Now in try_to_free_pages_zone() and shrink_cache you have:
>
> if (!*failed_swapout)
> *failed_swapout = !swap_out(classzone);
>
> Which means: Keep trying to swap_out() only in case swap_out()
> successfully desactivates nr_pages pte's. Right? Do you do that to avoid
> terrible expensive swap_out() loops which dont successfully free pages?
>
> Thanks
>
>
>
>
>

2003-09-01 17:49:36

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Andrea VM changes

On Mon, Sep 01, 2003 at 02:27:07PM -0300, Marcelo Tosatti wrote:
>
> Mind to answer this message Andrea?

part 1 is correct. non __GFP_IO only shrinks the cache. This was mostly
a feature for some place that is calling GFP_ATOMIC by mistake, and they
shoud call GFP_NOIO, they don't want to slowdown too much, usually
they're in realtime context. However it should be safe to set
failed_swapout to zero always too.

part 2 is the thing that avoids the kernel to deadlock during oom, the
first time I scan the whole vm and nothing was freeable I stop trying
otherwise it takes way too long to handle oom gracefully.

I've another message from you in queue.

Feel free to ask for more details.

> On Sun, 31 Aug 2003, Marcelo Tosatti wrote:
>
> >
> >
> > On Sat, 30 Aug 2003, Andrea Arcangeli wrote:
> >
> > > On Sat, Aug 30, 2003 at 12:13:57PM -0300, Marcelo Tosatti wrote:
> > > >
> > > > > You need to integrate with -aa on the VM. It has been hard enough for
> > > > > Andrea to get his stuff in, I doubt you will fair any better.
> > > >
> > > > Thats because I never received separate patches which make sense one by
> > > > one. Most of Andreas changes are all grouped into few big patches that
> > > > only he knows the mess. That is not the way to merge things.
> > > >
> > > > I want to work out with him after I merge other stuff to address that.
> > >
> > > that's true for only one patch, the others are pretty orthogonal after
> > > Andrew helped splitting them:
> > >
> > >
> > > 05_vm_03_vm_tunables-4
> > > 05_vm_05_zone_accounting-2
> > > 05_vm_06_swap_out-3
> >
> > Help me understand something about this patch. In try_to_free_pages(), you
> > set failed_swapout to zero in case we are under __GFP_IO. And
> > failed_swapout decides whether we swap_out() or not.
> >
> > So basically with -aa swap_out() is only called by __GFP_IO tasks (which
> > are throttled by the page laundering code in shrink_cache) and in mainline
> > non __GFP_IO tasks do swap_out() (and those are not throttled by anything).
> >
> > Did I understood this right?
> >
> > Part 2:
> >
> > Now in try_to_free_pages_zone() and shrink_cache you have:
> >
> > if (!*failed_swapout)
> > *failed_swapout = !swap_out(classzone);
> >
> > Which means: Keep trying to swap_out() only in case swap_out()
> > successfully desactivates nr_pages pte's. Right? Do you do that to avoid
> > terrible expensive swap_out() loops which dont successfully free pages?
> >
> > Thanks
> >
> >
> >
> >
> >
>

Andrea

2003-09-01 17:59:11

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sat, Aug 30, 2003 at 04:11:49PM -0300, Marcelo Tosatti wrote:
>
>
> On Sat, 30 Aug 2003, Marcelo Tosatti wrote:
>
> >
> > > that's true for only one patch, the others are pretty orthogonal after
> > > Andrew helped splitting them:
> > > 05_vm_03_vm_tunables-4
> > > 05_vm_05_zone_accounting-2
> > > 05_vm_06_swap_out-3
> > > 05_vm_07_local_pages-4
> > > 05_vm_08_try_to_free_pages_nozone-4
> > > 05_vm_09_misc_junk-3
> > > 05_vm_10_read_write_tweaks-3
> > > 05_vm_13_activate_page_cleanup-1
> > > 05_vm_15_active_page_swapout-1
> > > 05_vm_16_active_free_zone_bhs-1
> > > 05_vm_17_rest-10
> > > 05_vm_18_buffer-page-uptodate-1
> > > 05_vm_20_cleanups-3
> > > 05_vm_21_rt-alloc-1
> > > 05_vm_22_vm-anon-lru-1
> > > 05_vm_23_per-cpu-pages-3
> > > 05_vm_24_accessed-ipi-only-smp-1
> > > 05_vm_25_try_to_free_buffers-invariant-1
> >
> > Indeed, you are right.
> >
> > I'll start looking at them Monday. I'll keep you in touch. Thanks.
>
> Andrea,
>
> Would you mind to explain me 05_vm_06_swap_out-3 ?
>
> I see you change shrink_cache, try_to_free_pages_zone, etc.

that achieves multiple things. It avoids oom deadlocks by not wasting
time in the pagetable walking anymore after we failed once, it protects
init from being killed, and most important it avoids failed oom kills if
a task has been killed under us (or if plenty of ram has been freed
under us for whatever else reason). See the check_classzone_need_balance
checks.

Then it gives classzone awareness to refill_inactive so we make sure to
make progress for non highmem allocs too and to shrink stuff properly,
the lists are global. Plus it checkpoints the point in the active list
where it stopped the last time.

It also changes the shrink_cache function to shrink the vfs lists
internally if needed.

The max_scan etc.. in shrink_cache are as well classzone aware, since
the lists are global but we skip over the non interesting pages (like in
refill_inactive).

Andrea

2003-09-01 18:23:29

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sat, 30 Aug 2003, Andrea Arcangeli wrote:

> On Sat, Aug 30, 2003 at 12:13:57PM -0300, Marcelo Tosatti wrote:
> >
> > > You need to integrate with -aa on the VM. It has been hard enough for
> > > Andrea to get his stuff in, I doubt you will fair any better.
> >
> > Thats because I never received separate patches which make sense one by
> > one. Most of Andreas changes are all grouped into few big patches that
> > only he knows the mess. That is not the way to merge things.
> >
> > I want to work out with him after I merge other stuff to address that.
>
> that's true for only one patch, the others are pretty orthogonal after
> Andrew helped splitting them:
>
>
> 05_vm_03_vm_tunables-4
> 05_vm_05_zone_accounting-2
> 05_vm_06_swap_out-3
> 05_vm_07_local_pages-4

Two things: I will leave this local pages change to be applied later. I
want to see what it does by itself (apply swap_out() changes & friends now
and on another -pre local pages).

> 05_vm_08_try_to_free_pages_nozone-4

@@ -737,7 +737,6 @@ static void free_more_memory(void)
balance_dirty();
wakeup_bdflush();
try_to_free_pages(GFP_NOIO);
- run_task_queue(&tq_disk);
yield();
}

Whats the reason behind this?

2003-09-01 18:36:50

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Andrea VM changes

On Mon, Sep 01, 2003 at 03:26:02PM -0300, Marcelo Tosatti wrote:
>
>
> On Sat, 30 Aug 2003, Andrea Arcangeli wrote:
>
> > On Sat, Aug 30, 2003 at 12:13:57PM -0300, Marcelo Tosatti wrote:
> > >
> > > > You need to integrate with -aa on the VM. It has been hard enough for
> > > > Andrea to get his stuff in, I doubt you will fair any better.
> > >
> > > Thats because I never received separate patches which make sense one by
> > > one. Most of Andreas changes are all grouped into few big patches that
> > > only he knows the mess. That is not the way to merge things.
> > >
> > > I want to work out with him after I merge other stuff to address that.
> >
> > that's true for only one patch, the others are pretty orthogonal after
> > Andrew helped splitting them:
> >
> >
> > 05_vm_03_vm_tunables-4
> > 05_vm_05_zone_accounting-2
> > 05_vm_06_swap_out-3
> > 05_vm_07_local_pages-4
>
> Two things: I will leave this local pages change to be applied later. I
> want to see what it does by itself (apply swap_out() changes & friends now
> and on another -pre local pages).

fine thanks!

>
> > 05_vm_08_try_to_free_pages_nozone-4
>
> @@ -737,7 +737,6 @@ static void free_more_memory(void)
> balance_dirty();
> wakeup_bdflush();
> try_to_free_pages(GFP_NOIO);
> - run_task_queue(&tq_disk);
> yield();
> }
>
>
> Whats the reason behind this?

the reason is that added or removed won't make any significant
difference. Sure, there may be a few dirty buffers queued, but we
already did balance_dirty() and wakeup_bdflush, so if there was
significant amount of dirty data to write, bdflush would trigger the
unplug by itself. And if there wasn't we can wait for more data to
become dirty.

In general, I don't like sparse tq_disk unplug, I like to have them only
where strictly needed, that looks cleaner, and it doesn't risk to
generate short commands.

Andrea

/*
* If you also refuse to depend on closed software for a critical
* part of your business, these links may be useful:
*
* rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/
* rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/
* http://www.cobite.com/cvsps/
*
* svn://svn.kernel.org/linux-2.6/trunk
* svn://svn.kernel.org/linux-2.4/trunk
*/

2003-09-01 18:58:14

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sat, 30 Aug 2003, Andrea Arcangeli wrote:

> On Sat, Aug 30, 2003 at 12:13:57PM -0300, Marcelo Tosatti wrote:
> >
> > > You need to integrate with -aa on the VM. It has been hard enough for
> > > Andrea to get his stuff in, I doubt you will fair any better.
> >
> > Thats because I never received separate patches which make sense one by
> > one. Most of Andreas changes are all grouped into few big patches that
> > only he knows the mess. That is not the way to merge things.
> >
> > I want to work out with him after I merge other stuff to address that.
>
> that's true for only one patch, the others are pretty orthogonal after
> Andrew helped splitting them:
>
>
> 05_vm_03_vm_tunables-4
> 05_vm_05_zone_accounting-2
> 05_vm_06_swap_out-3
> 05_vm_07_local_pages-4
> 05_vm_08_try_to_free_pages_nozone-4
> 05_vm_09_misc_junk-3
> 05_vm_10_read_write_tweaks-3
> 05_vm_13_activate_page_cleanup-1
> 05_vm_15_active_page_swapout-1
> 05_vm_16_active_free_zone_bhs-1
> 05_vm_17_rest-10

Can you please split the watermark changes from 05_vm_rest-10 and send me
that ? (no waitqueue changes, no page wakeup logic changes)

As I said previously, lets start with the page reclaiming logic changes
first, which include:

05_vm_03_vm_tunables-4
05_vm_05_zone_accounting-2
05_vm_06_swap_out-3

And the necessary (ONLY watermark stuff AFAICS) from 05_vm_rest-10.

Right?

Thanks

> 05_vm_18_buffer-page-uptodate-1
> 05_vm_20_cleanups-3
> 05_vm_21_rt-alloc-1
> 05_vm_22_vm-anon-lru-1
> 05_vm_23_per-cpu-pages-3
> 05_vm_24_accessed-ipi-only-smp-1
> 05_vm_25_try_to_free_buffers-invariant-1
>
> The "mess" one is only 05_vm_17_rest-10 as far as I can tell.
>
> Andrea
>

2003-09-01 19:04:55

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Andrea VM changes

On Mon, Sep 01, 2003 at 04:00:49PM -0300, Marcelo Tosatti wrote:
>
>
> On Sat, 30 Aug 2003, Andrea Arcangeli wrote:
>
> > On Sat, Aug 30, 2003 at 12:13:57PM -0300, Marcelo Tosatti wrote:
> > >
> > > > You need to integrate with -aa on the VM. It has been hard enough for
> > > > Andrea to get his stuff in, I doubt you will fair any better.
> > >
> > > Thats because I never received separate patches which make sense one by
> > > one. Most of Andreas changes are all grouped into few big patches that
> > > only he knows the mess. That is not the way to merge things.
> > >
> > > I want to work out with him after I merge other stuff to address that.
> >
> > that's true for only one patch, the others are pretty orthogonal after
> > Andrew helped splitting them:
> >
> >
> > 05_vm_03_vm_tunables-4
> > 05_vm_05_zone_accounting-2
> > 05_vm_06_swap_out-3
> > 05_vm_07_local_pages-4
> > 05_vm_08_try_to_free_pages_nozone-4
> > 05_vm_09_misc_junk-3
> > 05_vm_10_read_write_tweaks-3
> > 05_vm_13_activate_page_cleanup-1
> > 05_vm_15_active_page_swapout-1
> > 05_vm_16_active_free_zone_bhs-1
> > 05_vm_17_rest-10
>
> Can you please split the watermark changes from 05_vm_rest-10 and send me
> that ? (no waitqueue changes, no page wakeup logic changes)

yes sure. (I have it already splitted here but I'm unsure if it's
uptodate or/and if still applies:

http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.15pre6/zone-watermarks-1

so don't use it, I'll send a new one).

> As I said previously, lets start with the page reclaiming logic changes
> first, which include:
>
> 05_vm_03_vm_tunables-4
> 05_vm_05_zone_accounting-2
> 05_vm_06_swap_out-3
>
> And the necessary (ONLY watermark stuff AFAICS) from 05_vm_rest-10.
>
> Right?

Looks fine to me. Many thanks!

Andrea

/*
* If you also refuse to depend on closed software for a critical
* part of your business, these links may be useful:
*
* rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/
* rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/
* http://www.cobite.com/cvsps/
*
* svn://svn.kernel.org/linux-2.6/trunk
* svn://svn.kernel.org/linux-2.4/trunk
*/

2003-09-01 19:52:15

[permalink] [raw]

Subject: Re: Andrea VM changes

On Sun, Aug 31, 2003 at 01:50:50PM +0200, Matthias Andree wrote:
> On Sat, 30 Aug 2003, Marcelo Tosatti wrote:
>
> > 05_vm_09_misc_junk-3 removes the PF_MEMDIE and you also seem to remove the
> > OOM killer. Is that right? Why?
>
> Nuking OOM killer is IMHO a sane thing to do. Unless you start
> everything out of PID #1 which is unkillable, usually init(8), you don't
> want the OOM killer. Imagine it nukes your portmap. With Linux portmap
> that doesn't support warm starts (unlike Solaris 8), this means: reboot.

Can't you just restart the other rpc services after restarting portmap?
(IIRC, I have done exactly this without trobule)

2003-09-02 20:49:08

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: Andrea VM changes

On Mon, 1 Sep 2003, Andrea Arcangeli wrote:

> On Mon, Sep 01, 2003 at 04:00:49PM -0300, Marcelo Tosatti wrote:
> >
> >
> > On Sat, 30 Aug 2003, Andrea Arcangeli wrote:
> >
> > > On Sat, Aug 30, 2003 at 12:13:57PM -0300, Marcelo Tosatti wrote:
> > > >
> > > > > You need to integrate with -aa on the VM. It has been hard enough for
> > > > > Andrea to get his stuff in, I doubt you will fair any better.
> > > >
> > > > Thats because I never received separate patches which make sense one by
> > > > one. Most of Andreas changes are all grouped into few big patches that
> > > > only he knows the mess. That is not the way to merge things.
> > > >
> > > > I want to work out with him after I merge other stuff to address that.
> > >
> > > that's true for only one patch, the others are pretty orthogonal after
> > > Andrew helped splitting them:
> > >
> > >
> > > 05_vm_03_vm_tunables-4
> > > 05_vm_05_zone_accounting-2
> > > 05_vm_06_swap_out-3
> > > 05_vm_07_local_pages-4
> > > 05_vm_08_try_to_free_pages_nozone-4
> > > 05_vm_09_misc_junk-3
> > > 05_vm_10_read_write_tweaks-3
> > > 05_vm_13_activate_page_cleanup-1
> > > 05_vm_15_active_page_swapout-1
> > > 05_vm_16_active_free_zone_bhs-1
> > > 05_vm_17_rest-10
> >
> > Can you please split the watermark changes from 05_vm_rest-10 and send me
> > that ? (no waitqueue changes, no page wakeup logic changes)
>
> yes sure. (I have it already splitted here but I'm unsure if it's
> uptodate or/and if still applies:
>
> http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.15pre6/zone-watermarks-1
>
> so don't use it, I'll send a new one).

Any progress? 8)

2003-09-15 05:16:09

[permalink] [raw]

Subject: Re: Andrea VM changes

Alan Cox <[email protected]> writes:

> On Sul, 2003-08-31 at 00:19, Andrea Arcangeli wrote:
> > I've an algorithm that will work, and that will provide very good
> > guarantees to kill the "best" task to make the machine usable again,
> > with the needed protection against the security DoSes, but it's in
> > no-way similar to the current oom killer.
>
> And -ac has trivial code so you can avoid OOM killing every happening,
> which is pretty much essential for big servers. Perhaps merging that
> as well would be a good idea.

Indeed there has been an enormous amount of discussion on the postgres mailing
list about how to deal with the OOM killer. The wide consensus there is that
the only sane setting for a production database would be one that guarantees
never to kill overcommit at all.

Frankly, they're a bit in shock that this wasn't an option a long time ago.

Consider e.g.:

http://groups.google.com/groups?threadm=3F510688.1050709%40colorfullife.com

--
greg

2003-09-15 10:47:36

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Andrea VM changes

On Mon, Sep 15, 2003 at 01:16:03AM -0400, Greg Stark wrote:
> http://groups.google.com/groups?threadm=3F510688.1050709%40colorfullife.com

btw, side note about the "swap space should be 2*physical memory" that's
not true anymore for a long time. Personally I normally install swap =
ram.

Andrea

/*
* If you refuse to depend on closed software for a critical
* part of your business, these links may be useful:
*
* rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/
* rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/
* http://www.cobite.com/cvsps/
*
* svn://svn.kernel.org/linux-2.6/trunk
* svn://svn.kernel.org/linux-2.4/trunk
*/