Hi,
Using a 2.2.17 kernel I often experience problems where I get messages like
"VM: do_try_to_free_pages failed for <some process>", and the machine hangs
until the VM can recover, which sometimes takes too long for me to wait. I
suppose that the problem is similar sometimes when I get a frozen system
under X, but can't see the kernel messages then.
Yesterday I could reproduce this at will, with a "make -j50" on 2.2.17
sources (as unpriviledged user). In less than half an our syslogd stopped
to log anything (at 00:38), and this morning I could only see those messages
trying to free pages for (or from ?) wwwoffled. Last load see by "top" on
another VC was ~74.
Have some work been done for 2.2.18 that could help me ? Are there some
2.2-based VM patches that could help (I found the VM-global patch from
Andrea but have no info about what it is, and could not find 2.2-based
patches on Rik's pages) ?
Also I can't be sure for now I don't run into a hw problem...
I'm willing to investigate, but clearly lack experience in the VM...
Regards,
--
Yann Dirson <[email protected]> | Why make M$-Bill richer & richer ?
debian-email: <[email protected]> | Support Debian GNU/Linux:
| Cheaper, more Powerful, more Stable !
http://ydirson.free.fr/ | Check <http://www.debian.org/>
In article <[email protected]>,
Yann Dirson <[email protected]> wrote:
>Using a 2.2.17 kernel I often experience problems where I get messages like
>"VM: do_try_to_free_pages failed for <some process>", and the machine hangs
>until the VM can recover, which sometimes takes too long for me to wait. I
>suppose that the problem is similar sometimes when I get a frozen system
>under X, but can't see the kernel messages then.
I'm seeing the same on some machines. Running several instances of
bonnie on a dual SMP Intel with a DAC 1164 raid controller would
kill the machine in a few hours. However it has been running several
bonnies now without a hitch for 2 days, on 2.2.18pre18
Mike.
--
People get the operating system they deserve.
On Wed, 1 Nov 2000, Yann Dirson wrote:
> Hi,
>
> Using a 2.2.17 kernel I often experience problems where I get messages like
> "VM: do_try_to_free_pages failed for <some process>", and the machine hangs
> until the VM can recover, which sometimes takes too long for me to wait. I
> suppose that the problem is similar sometimes when I get a frozen system
> under X, but can't see the kernel messages then.
>
> Yesterday I could reproduce this at will, with a "make -j50" on 2.2.17
> sources (as unpriviledged user). In less than half an our syslogd stopped
> to log anything (at 00:38), and this morning I could only see those messages
> trying to free pages for (or from ?) wwwoffled. Last load see by "top" on
> another VC was ~74.
>
> Have some work been done for 2.2.18 that could help me ? Are there some
> 2.2-based VM patches that could help (I found the VM-global patch from
> Andrea but have no info about what it is
VM-global It should fix your problem.
On Wed, Nov 01, 2000 at 09:41:04AM -0200, Marcelo Tosatti wrote:
> VM-global It should fix your problem.
Thanks for the hint, it works great indeed, I couldn't freeze the machine
any more - at least the make process understood Ctrl-C.
However, the OOM killer behaves in strange ways, it seems. In the 2 "make
-j50" runs I had (one with the working fs mounted 'sync', the other
'async'), it mostly killed non-root processes, but once it killed 'cron',
which was run as root:
[sync run]
Nov 1 15:51:05 bylbo kernel: VM: killing process cpp
Nov 1 15:56:38 bylbo kernel: VM: killing process apache
Nov 1 16:02:54 bylbo kernel: VM: killing process cc1
Nov 1 16:08:51 bylbo kernel: VM: killing process wwwoffled
[async run]
Nov 1 17:13:08 bylbo kernel: VM: killing process apache
Nov 1 17:14:01 bylbo kernel: VM: killing process cron
apache was running as "www-data" (uid 33), wwwoffled as "proxy" (uid 13).
An idea that came upon me was whether it would be possible to add to the
OOMK some sort of preference for processes owned by "system users", to be
defined by a "uid limit". For example, on Debian systems, where "real user"
uids start at 1000, it would be great if the OOMK would leave those
processes as far as possible off its kill list. Opinions ?
Best regards,
--
Yann Dirson <[email protected]> | Why make M$-Bill richer & richer ?
debian-email: <[email protected]> | Support Debian GNU/Linux:
| Cheaper, more Powerful, more Stable !
http://ydirson.free.fr/ | Check <http://www.debian.org/>
On Wed, Nov 01, 2000 at 05:43:39PM +0100, Yann Dirson wrote:
> However, the OOM killer behaves in strange ways, it seems. In the 2 "make
Fair enough as there isn't an oom killer in the kernel you're running :).
So it can kill unlucky tasks as well.
Since nobody cares to implement it, for 2.4.x on my TODO list there's an
alternative oom killer based on the task fault rate.
(btw, make sure you're using the -7 revision of the VM-global patch, as it
includes the same MM corruption bugfix that is been included into 18pre18)
Andrea
On Wed, 1 Nov 2000, Andrea Arcangeli wrote:
> On Wed, Nov 01, 2000 at 05:43:39PM +0100, Yann Dirson wrote:
> > However, the OOM killer behaves in strange ways, it seems. In the 2 "make
>
> Fair enough as there isn't an oom killer in the kernel you're
> running :). So it can kill unlucky tasks as well.
There's a (slightly outdated?) patch available on my home
page, though...
> Since nobody cares to implement it, for 2.4.x on my TODO list
> there's an alternative oom killer based on the task fault rate.
Cool. It will be interesting to see how this compares to my
OOM killer (and to the other approaches that will undoubtedly
surface over the next few months).
I'm definately looking forward to an "OOM killer showdown"
where we can compare how the different OOM tactics work.
Not because I think it matters all that much on most systems
(good admins put in enough memory&swap), but simply because
it appears there has been amazingly little research on this
subject and it's completely unknown which approach will work
"best" ... or even, what kind of behaviour is considered to
be best by the users...
regards,
Rik
--
"What you're running that piece of shit Gnome?!?!"
-- Miguel de Icaza, UKUUG 2000
http://www.conectiva.com/ http://www.surriel.com/
On Wed, Nov 01, 2000 at 02:59:01PM -0200, Rik van Riel wrote:
Andrea wrote:
> (btw, make sure you're using the -7 revision of the VM-global patch, as it
> includes the same MM corruption bugfix that is been included into 18pre18)
Damn, I was using -6. http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.18pre9/ does not have -7.
Neither does your e-mind repository hlinked from linux-mm.
I'm currently running -6 :(
On Wed, Nov 01, 2000 at 02:59:01PM -0200, Rik van Riel wrote:
> it appears there has been amazingly little research on this
> subject and it's completely unknown which approach will work
> "best" ... or even, what kind of behaviour is considered to
> be best by the users...
Sounds to me like a good point to favour a config-time selection of
OOM killers.
> On Wed, 1 Nov 2000, Andrea Arcangeli wrote:
> > Fair enough as there isn't an oom killer in the kernel you're
> > running :). So it can kill unlucky tasks as well.
>
> There's a (slightly outdated?) patch available on my home
> page, though...
Found http://www.surriel.com/patches/2.2.17pre8-oomkill. Will take a
look, thanks.
> Not because I think it matters all that much on most systems
> (good admins put in enough memory&swap), but simply because
Ah, I'll have to reconsider how much I rate my skills :)
Best regards,
--
Yann Dirson <[email protected]> | Why make M$-Bill richer & richer ?
debian-email: <[email protected]> | Support Debian GNU/Linux:
| Cheaper, more Powerful, more Stable !
http://ydirson.free.fr/ | Check <http://www.debian.org/>
On Wed, Nov 01, 2000 at 10:03:27PM +0100, Yann Dirson wrote:
> On Wed, Nov 01, 2000 at 02:59:01PM -0200, Rik van Riel wrote:
>
> Andrea wrote:
> > (btw, make sure you're using the -7 revision of the VM-global patch, as it
> > includes the same MM corruption bugfix that is been included into 18pre18)
>
> Damn, I was using -6. http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.18pre9/ does not have -7.
> Neither does your e-mind repository hlinked from linux-mm.
>
> I'm currently running -6 :(
A -7 to apply to 2.2.18pre17 and _previous_ releases is in the directory
patches/v2.2/2.2.18pre17/. An equivalent one against 2.2.18pre18 (and probably
future releases) is in patches/v2.2/2.2.18pre18/.
After applying the patch you should make sure there are no rejects with a `find
-name \*.rej`. If there aren't rejects all gone right.
Andrea
On Wed, Nov 01, 2000 at 10:00:02PM +0100, Andrea Arcangeli wrote:
> A -7 to apply to 2.2.18pre17 and _previous_ releases is in the directory
Argh !! Damned sort order - I thought pre9 was the latest :(
Thanks,
--
Yann Dirson <[email protected]> | Why make M$-Bill richer & richer ?
debian-email: <[email protected]> | Support Debian GNU/Linux:
| Cheaper, more Powerful, more Stable !
http://ydirson.free.fr/ | Check <http://www.debian.org/>
On Wed, 01 Nov 2000 17:48:16 Andrea Arcangeli wrote:
>
> (btw, make sure you're using the -7 revision of the VM-global patch, as it
> includes the same MM corruption bugfix that is been included into 18pre18)
>
> Andrea
"Includes" means that the full patch is not included in pre18 ?.
So, will the VM-pre17 work with pre18 ?.
--
Juan Antonio Magallon Lacarta mailto:[email protected]
On Thu, Nov 02, 2000 at 03:15:17AM +0100, J . A . Magallon wrote:
> "Includes" means that the full patch is not included in pre18 ?.
Only the strict bugfix broadcasted to l-k is been included in pre18.
> So, will the VM-pre17 work with pre18 ?.
It will generate a trivial reject but I just uploaded a new VM-global against
pre18 that generates exactly the same source code of the previous one against
pre17.
Andrea
On Wed, Nov 01, 2000 at 10:03:27PM +0100, Yann Dirson wrote:
> On Wed, Nov 01, 2000 at 02:59:01PM -0200, Rik van Riel wrote:
> > it appears there has been amazingly little research on this
> > subject and it's completely unknown which approach will work
> > "best" ... or even, what kind of behaviour is considered to
> > be best by the users...
>
> Sounds to me like a good point to favour a config-time selection of
> OOM killers.
Better yet: Apply my OOM-Killer-API-Patch[1] and build your own
OOM-Killer!
Just lock your module into memory, supply an function to
install_oom_killer(), save the old one (you get it as return if
installing it went ok) and be happy.
And now have fun bringing your machine into OOM situations.
Want to change it back? No problem. Just get signaled somehow[2],
reinstall the old one, unlock your module and wait to be cleaned
up.
I never tried it above Riks 2.2.x-OOM-Killer-Patch, but it should
work on top of it, because oom_kill.c isn't all that different.
Regards
Ingo Oeser
[1] http://www.tu-chemnitz.de/~ioe/oom_kill_api.patch
[2] if you don't know that much about the kernel, you shouldn't
play with oom-handlers anyway ;-)
--
Feel the power of the penguin - run [email protected]
<esc>:x
On Fri, 3 Nov 2000, Ingo Oeser wrote:
> On Wed, Nov 01, 2000 at 10:03:27PM +0100, Yann Dirson wrote:
> > On Wed, Nov 01, 2000 at 02:59:01PM -0200, Rik van Riel wrote:
> > > it appears there has been amazingly little research on this
> > > subject and it's completely unknown which approach will work
> > > "best" ... or even, what kind of behaviour is considered to
> > > be best by the users...
> >
> > Sounds to me like a good point to favour a config-time selection of
> > OOM killers.
>
> Better yet: Apply my OOM-Killer-API-Patch[1] and build your own
> OOM-Killer!
>
> Just lock your module into memory, supply an function to
> install_oom_killer(), save the old one (you get it as return if
> installing it went ok) and be happy.
>
> And now have fun bringing your machine into OOM situations.
>
> Want to change it back? No problem. Just get signaled somehow[2],
> reinstall the old one, unlock your module and wait to be cleaned
> up.
>
> I never tried it above Riks 2.2.x-OOM-Killer-Patch, but it should
> work on top of it, because oom_kill.c isn't all that different.
Ingo,
Maybe you could change Rik's oom killer to be a "module" of your OOM
killer API?
With this in place, its easier for people to compare Rik's OOM killer
with other possible new algorithms.
On Wed, Nov 01, 2000 at 10:00:02PM +0100, Andrea Arcangeli wrote:
> On Wed, Nov 01, 2000 at 10:03:27PM +0100, Yann Dirson wrote:
> > On Wed, Nov 01, 2000 at 02:59:01PM -0200, Rik van Riel wrote:
> >
> > Andrea wrote:
> > > (btw, make sure you're using the -7 revision of the VM-global patch, as it
> > > includes the same MM corruption bugfix that is been included into 18pre18)
> >
> > Damn, I was using -6. http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.18pre9/ does not have -7.
> > Neither does your e-mind repository hlinked from linux-mm.
> >
> > I'm currently running -6 :(
>
> A -7 to apply to 2.2.18pre17 and _previous_ releases is in the directory
> patches/v2.2/2.2.18pre17/. An equivalent one against 2.2.18pre18 (and probably
> future releases) is in patches/v2.2/2.2.18pre18/.
Now running 2.2.18pre17 + -7 since a few days
The problems I had appear to be gone. However with this kernel I have
bursts of error messages from nscd (from glibc 2.1.95) like this one taken
from daemon.log:
Nov 5 22:36:17 bylbo nscd: 925: while accepting connection: Cannot allocate memory
They usually appear at cron.daily time, although it looks like I kinda can
reproduce them. I'm still investigating and narrowing - they seem to avoid
me unfortunately :( Will launch a tracking job for the night, hopefully
I'll narrow to the single cron job this time.
Anyone seen that ?
> After applying the patch you should make sure there are no rejects with a `find
> -name \*.rej`. If there aren't rejects all gone right.
I prefer running "patch --dry-run", I hate it when I have to wait for bzip2
and tar :}
Best regards,
--
Yann Dirson <[email protected]> | Why make M$-Bill richer & richer ?
debian-email: <[email protected]> | Support Debian GNU/Linux:
| Cheaper, more Powerful, more Stable !
http://ydirson.free.fr/ | Check <http://www.debian.org/>
Yann Dirson wrote:
>
> Nov 5 22:36:17 bylbo nscd: 925: while accepting connection: Cannot allocate memory
>
> They usually appear at cron.daily time, although it looks like I kinda can
> reproduce them. I'm still investigating and narrowing - they seem to avoid
> me unfortunately :( Will launch a tracking job for the night, hopefully
> I'll narrow to the single cron job this time.
>
> Anyone seen that ?
I see it with sendmail all the time when the fs gets really busy, and
memory gets low in
box.
Thanks
Jeff
>
> > After applying the patch you should make sure there are no rejects with a `find
> > -name \*.rej`. If there aren't rejects all gone right.
On Mon, Nov 06, 2000 at 02:48:27PM -0700, Jeff V. Merkey wrote:
> Yann Dirson wrote:
> > Nov 5 22:36:17 bylbo nscd: 925: while accepting connection: Cannot allocate memory
> >
> > They usually appear at cron.daily time, although it looks like I kinda can
> > reproduce them. I'm still investigating and narrowing - they seem to avoid
> > me unfortunately :( Will launch a tracking job for the night, hopefully
> > I'll narrow to the single cron job this time.
Hm... 12h non-stop looping on the cron jobs and nothing in the logs.
Heisenbug :}
> > Anyone seen that ?
>
> I see it with sendmail all the time when the fs gets really busy, and
> memory gets low in
> box.
But if your memory gets low it seems at least less anormal... In my case it
occured at night when only the cron job was running.
--
Yann Dirson <[email protected]> | Why make M$-Bill richer & richer ?
debian-email: <[email protected]> | Support Debian GNU/Linux:
| Cheaper, more Powerful, more Stable !
http://ydirson.free.fr/ | Check <http://www.debian.org/>