2009-11-17 20:52:48

by Eyal Lotem

[permalink] [raw]
Subject: [RFC] Using "page credits" as a solution for common thrashing scenarios

I apologize for not sending a patch, but I am not yet skilled enough with
Linux kernel hacking to stop with the cheap talk and show you the code.

I have encountered numerous times the following situation:

* Small processes A, B, C happily functioning on the system (e.g: bash,
xterm, etc).

* Large misbehaving process D (e.g firefox) is launched, and due to
a bug or other issue, begins to allocate and write to an extremely
large memory working set.

* The entire operating system and all processes become highly
unresponsive. Sometimes to the point where I cannot even kill the
problematic process. SysRq keys still work, but those are usually too
coarse. Sometimes the OOM will start firing at random.

My analysis of this:

* I believe that process D in this scenario is basically causing the
kernel MM to evict all of the pages of the well-behaving A, B, C
processes.

* I think it is wrong for the kernel to evict the 15 pages of the bash,
xterm, X server's working set, as an example, in order for a
misbehaving process to have 1000015 instead of 1000000 pages in its
working set. EVEN if that misbehaving process is accessing its working
set far more aggressively.

Suggested solution:

If my analysis isn't off, then I believe the following solution might
mitigate the problem elegantly.

1. Maintain a per-process Most-Recently-Used (MRU) page listing. This
listing is allowed to be approximate (e.g: by use of the page table's
Accessed and Dirty bits).

2. Assign a number of "page credits" to each process (sort of a memory
"niceness" level) which the kernel automatically disperses on the MRU
pages of that process. When the MRU set changes, the "page credits"
are *moved* from LRU entries to new MRU entries.

3. Each physical page accumulates all of the "page credits" of the various
processes that use it in their MRU. This allows shared pages to
accumulate more credit when they are useful for more processes.

4. Page eviction should still be global (not per-process) but should evict
pages whose accumulated "page credits" count is lowest.

5. per-process "page credit" levels can be maintained by user-space (using
setrlimit or such). It probably makes sense for fork() to split the
"page credit" levels and not duplicate them (and spawn new page credits
out of nowhere). While having a good way to specify per-process "page
credits" could improve things, I don't believe it is critical to do
this accurately in order for this to effectively prevent thrashing.

Rationale: The usefulness of a page also stems from how big of a percentage
it is of it's user process working set. If a single page is the entire
working set of a process, evicting that single page is most costly than
even evicting 100 pages of another process if that other process's
working set is extremely large.

Highly simplified example of suggested solution (numbers are made up):

NOTE: The example does not include shared library memory use and various
other details (I do believe this solution handles those elegantly), to
avoid clutter.

1. 3 bash processes are assigned 1 million credits each, and each of them
use a shared 10 pages (e.g mmap'd code) and another 40 unique pages in
their MRU working set.

The kernel automatically assigns 20,000 credits (1M / 50) to each of
the pages in each bash process. The shared 10 physical pages will
accumulate 60,000 credits each. Each of the 120 (40 * 3) unique
physical pages accumulates 20,000 credits.

2. xterm, X processes are assigned 1 million credits each, and each of
them use 200 unique pages.

The kernel automatically assigns 5,000 credits to each of the unique
pages, so the physical pages accumulate 5,000 credits each.

3. a firefox process is too assigned 1 million credits. It aggressively
allocates and writes to as many pages as the kernel allows. Instead of
starting to evict the above processes (bash's, xterm's, X's) which
hinder responsiveness to the user, the kernel will find the physical
pages with the least amount of "credits" to evict.

4. Assuming firefox has already allocated 1 million pages before eviction
is required, the eviction decision is faced with the following data:

* 10 physical pages shared by the bash processes, each have 60,000
credits.

* 120 physical pages (40 x 3 unique pages of the bash processes), each
with 20,000 credits.

* 400 physical pages (of xterm and X) with 5,000 credits each.

* 1 million physical pages (of firefox) with 1 credit each.

5. The kernel has a "no-brainer" choice here. Instead of effectively
pausing all of the behaving processes in order to allocate a few more
pages for firefox, it can and should make firefox pay for its
wastefulness. It will evict firefox's old pages to make room for
firefox's new pages.

In effect, firefox's DOS attack here (on the physical page resource)
will attack only firefox itself, and not the rest of the system.
Firefox's DOS attack on disk I/O and other resources is still underway
(Perhaps requiring a different solution).


2009-11-17 21:09:11

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC] Using "page credits" as a solution for common thrashing scenarios

On 11/17/2009 03:52 PM, Eyal Lotem wrote:
> Highly simplified example of suggested solution (numbers are made up):
>
> NOTE: The example does not include shared library memory use and various
> other details (I do believe this solution handles those elegantly), to
> avoid clutter.
>
>
The real problematic details are in the data structures.

How to organize the pages in-kernel?

How does the kernel find the pages with the lowest
score? In the memory zone where memory needs
to be freed...

How can the kernel efficiently change the score of
tens of thousands (or even millions) of pages at once,
moving them to the right LRU list cheaply?

Making page replacement scale to systems with
hundreds of gigabytes of memory is a necessity, as
is evicting the right kind of page (filesystem cache
vs process page, etc).

Within those constraints, it becomes very hard to
implement a solution along your ideas.

2009-11-17 21:19:28

by Eyal Lotem

[permalink] [raw]
Subject: Re: [RFC] Using "page credits" as a solution for common thrashing scenarios

> The real problematic details are in the data structures.
>
> How to organize the pages in-kernel?
>
> How does the kernel find the pages with the lowest
> score? ?In the memory zone where memory needs
> to be freed...
>
> How can the kernel efficiently change the score of
> tens of thousands (or even millions) of pages at once,
> moving them to the right LRU list cheaply?
>
> Making page replacement scale to systems with
> hundreds of gigabytes of memory is a necessity, as
> is evicting the right kind of page (filesystem cache
> vs process page, etc).
>
> Within those constraints, it becomes very hard to
> implement a solution along your ideas.

I agree an implementation that reasonably addresses these concerns is far
from trivial. I wouldn't give up on finding a way to implement it (or a
good approximation of it) so soon.

Before diving into the exact implementation details, I was wondering if I'm
getting some basic fact wrong -- would this solution (if possible to
implement with reasonable performance) truly mitigate this kind of
thrashing?

Eyal

2009-11-24 21:15:33

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] Using "page credits" as a solution for common thrashing scenarios

Eyal Lotem <[email protected]> writes:

Replying to an old email.

> * I think it is wrong for the kernel to evict the 15 pages of the bash,
> xterm, X server's working set, as an example, in order for a
> misbehaving process to have 1000015 instead of 1000000 pages in its
> working set. EVEN if that misbehaving process is accessing its working
> set far more aggressively.

One problem in practice tends to be that it's hard to realiably detect
that a process is misbehaving. The 1000000 page process might be your
critical database, while the 15 page process is something very
unimportant.

-Andi

--
[email protected] -- Speaking for myself only.

2009-11-24 22:41:42

by Chris Friesen

[permalink] [raw]
Subject: Re: [RFC] Using "page credits" as a solution for common thrashing scenarios

On 11/24/2009 03:15 PM, Andi Kleen wrote:
> Eyal Lotem <[email protected]> writes:
>
> Replying to an old email.
>
>> * I think it is wrong for the kernel to evict the 15 pages of the bash,
>> xterm, X server's working set, as an example, in order for a
>> misbehaving process to have 1000015 instead of 1000000 pages in its
>> working set. EVEN if that misbehaving process is accessing its working
>> set far more aggressively.
>
> One problem in practice tends to be that it's hard to realiably detect
> that a process is misbehaving. The 1000000 page process might be your
> critical database, while the 15 page process is something very
> unimportant.

Quite a while ago now I proposed the ability for an app (with suitable
privileges) to register with the system the amount of memory it expected
to use. As long as it was under that amount it would be immune to the
oom-killer.

We've been using this in production for some time now as we have several
very large memory footprint apps that otherwise become quite attractive
to the oom killer.

The oom_adj proc entry made this less of an issue so I never bothered
pushing it to mainline.

Chris

2010-06-08 09:45:31

by Eyal Lotem

[permalink] [raw]
Subject: Re: [RFC] Using "page credits" as a solution for common thrashing scenarios

Replying to a very old email :-)

On Wed, Nov 25, 2009 at 12:15 AM, Andi Kleen <[email protected]> wrote:
> Eyal Lotem <[email protected]> writes:
>
> Replying to an old email.
>
>> ? * I think it is wrong for the kernel to evict the 15 pages of the bash,
>> ? ? xterm, X server's working set, as an example, in order for a
>> ? ? misbehaving process to have 1000015 instead of 1000000 pages in its
>> ? ? working set. EVEN if that misbehaving process is accessing its working
>> ? ? set far more aggressively.
>
> One problem in practice tends to be that it's hard to realiably detect
> that a process is misbehaving. The 1000000 page process might be your
> critical database, while the 15 page process is something very
> unimportant.

Well, this solution doesn't really depend on any detection of
"misbehaving", it just goes about a more accurate way of defining page
importance. A simple solution to the problem you suggest is assigning
far more "credits" to the database than to the 15-page process.

Eyal

>
> -Andi
>
> --
> [email protected] -- Speaking for myself only.
>