I have stumbled upon very interesting and in other hand very hard
problem. I am trying to integrate code which introduces process
protection from OOM killer action in cases when there is not enough RAM
in the system. The code works fine on architectures like ppc32, ppc64
and arm XScale. On Intel Xeon and Pentium M architectures with disabled
SMP and HighMem support in the kernel the code also works fine. Problem
begins if I turn on SMP and HighMem. I am using a test case which work
following way: when child is forked both parent and child malloc amount
of memory which when summed together exceed amount of memory available
on system; parent is than protected. On the Intel architectures with SMP
and HighMem on I experience that OOM kills many processes in addition to
child process like sshd, portmap, bash, xinetd, etc. Number of killed
processes vary from one test execution to other. As, I pointed out
earlier, the code works fine only when SMP and HighMem are off. Any
other combination of these options still causes the problem. I am using
heavily patched Linux 2.6.10 kernel, with Kernel preempt turned off (the
problem occurs with vanilla 2.6.14rc2 kernel also). The targets I have
tested so far don't use swap device, and they have RAM in amounts
starting from 1Gb.
What I discovered so far is that when OOM sends SIGKILL to process,
memory owned by process is not released as quickly as system would like.
Processes running on other processors jump into OOM and eventually more
processes get killed. I introduced changes to mm/page_alloc.c ->
__alloc_pages() function which suppress more than one processor to enter
the OOM code, and also I reduced number of OOM calls to one call per
second for allowed processor. The idea is to give enough time to process
marked to be killed to release pages. The change works fine for one of
the targets but isn't working for other Intel boards.
I need more insight on what might cause slow releasing of pages, when
this release occurs and which part of kernel is in charge of that.
Please CC me with your replies.
Thanks,
Boban
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
So basically you discovered:
- OOM activated due to say 512K free RAM (OOM happens before this IIRC
but eh)
- OOM kills process to give 127M free RAM
- Other CPU wants 4M RAM, so far 3M are free from killed process
- ANOTHER OOM kill happens to free up another 35M of RAM
Looks like a race condition?
Petrovic, Boban wrote:
> I have stumbled upon very interesting and in other hand very hard
> problem. I am trying to integrate code which introduces process
> protection from OOM killer action in cases when there is not enough RAM
> in the system. The code works fine on architectures like ppc32, ppc64
> and arm XScale. On Intel Xeon and Pentium M architectures with disabled
> SMP and HighMem support in the kernel the code also works fine. Problem
> begins if I turn on SMP and HighMem. I am using a test case which work
> following way: when child is forked both parent and child malloc amount
> of memory which when summed together exceed amount of memory available
> on system; parent is than protected. On the Intel architectures with SMP
> and HighMem on I experience that OOM kills many processes in addition to
> child process like sshd, portmap, bash, xinetd, etc. Number of killed
> processes vary from one test execution to other. As, I pointed out
> earlier, the code works fine only when SMP and HighMem are off. Any
> other combination of these options still causes the problem. I am using
> heavily patched Linux 2.6.10 kernel, with Kernel preempt turned off (the
> problem occurs with vanilla 2.6.14rc2 kernel also). The targets I have
> tested so far don't use swap device, and they have RAM in amounts
> starting from 1Gb.
> What I discovered so far is that when OOM sends SIGKILL to process,
> memory owned by process is not released as quickly as system would like.
> Processes running on other processors jump into OOM and eventually more
> processes get killed. I introduced changes to mm/page_alloc.c ->
> __alloc_pages() function which suppress more than one processor to enter
> the OOM code, and also I reduced number of OOM calls to one call per
> second for allowed processor. The idea is to give enough time to process
> marked to be killed to release pages. The change works fine for one of
> the targets but isn't working for other Intel boards.
> I need more insight on what might cause slow releasing of pages, when
> this release occurs and which part of kernel is in charge of that.
> Please CC me with your replies.
>
> Thanks,
> Boban
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.
Creative brains are a valuable, limited resource. They shouldn't be
wasted on re-inventing the wheel when there are so many fascinating
new problems waiting out there.
-- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFDVUL1hDd4aOud5P8RAoGvAJ9jSA5XF+WS625Ha0WJQEI49oey+QCdHj78
XL5rgwxIfmh9MyfJR8hrSlE=
=b8pR
-----END PGP SIGNATURE-----