2018-08-22 09:30:17

by Marcus Linsner

[permalink] [raw]
Subject: Howto prevent kernel from evicting code pages ever? (to avoid disk thrashing when about to run out of RAM)

Hi. How to make the kernel keep(lock?) all code pages in RAM so that
kswapd0 won't evict them when the system is under low memory
conditions ?

The purpose of this is to prevent the kernel from causing lots of disk
reads(effectively freezing the whole system) when about to run out of
RAM, even when there is no swap enabled, but well before(in real time
minutes) OOM-killer triggers to kill the offending process (eg. ld)!

I can replicate this consistently with 4G (and 12G) max RAM inside a
Qubes OS R4.0 AppVM running Fedora 28 while trying to compile Firefox.
The disk thrashing (continuous 192+MiB/sec reads) occurs well before
the OOM-killer triggers to kill 'ld' (or 'rustc') process and
everything is frozen for (real time) minutes. I've also encountered
this on bare metal myself, if it matters at all.

I tried to ask this question on SO here:
https://stackoverflow.com/q/51927528/10239615
but maybe I have better luck on this mailing list where the kernel experts are.

Just think of all the frozen systems that you'll be saving(see related
question in the above link, for one), if you figure out the answer to
this, whether be it a kernel patch, or some .config options needing
change, or whatever. Just consider it, whoever you are, reader :)
(probably a kernel god xD 'cause who else would know howto) - I'm
actually selfish, I want this for myself, but I'm more than willing to
share it with all, once I'm aware of it. (this = this howto: let the
OOM-killer kill the offending process asap, without first passing
through disk-thrashing hell freezing the OS ;-) er, I mean Hi, and
how's life? 'll be even better after you've read this, I guarantee it
;-) just believe! synergize)


2018-09-10 09:19:44

by Marcus Linsner

[permalink] [raw]
Subject: Re: Howto prevent kernel from evicting code pages ever? (to avoid disk thrashing when about to run out of RAM)

On Wed, Aug 22, 2018 at 11:25 AM Marcus Linsner
<[email protected]> wrote:
>
> Hi. How to make the kernel keep(lock?) all code pages in RAM so that
> kswapd0 won't evict them when the system is under low memory
> conditions ?
>
> The purpose of this is to prevent the kernel from causing lots of disk
> reads(effectively freezing the whole system) when about to run out of
> RAM, even when there is no swap enabled, but well before(in real time
> minutes) OOM-killer triggers to kill the offending process (eg. ld)!
>
> I can replicate this consistently with 4G (and 12G) max RAM inside a
> Qubes OS R4.0 AppVM running Fedora 28 while trying to compile Firefox.
> The disk thrashing (continuous 192+MiB/sec reads) occurs well before
> the OOM-killer triggers to kill 'ld' (or 'rustc') process and
> everything is frozen for (real time) minutes. I've also encountered
> this on bare metal myself, if it matters at all.
>
> I tried to ask this question on SO here:
> https://stackoverflow.com/q/51927528/10239615
> but maybe I have better luck on this mailing list where the kernel experts are.
>

This is what I got working so far, to prevent the disk thrashing
(constant re-reading of active executable pages from disk) that would
otherwise freeze the OS before running Out Of Memory:

the following patch can also be seen here:
https://github.com/constantoverride/qubes-linux-kernel/blob/devel-4.18/patches.addon/le9d.patch

revision 3
preliminary patch to avoid disk thrashing (constant reading) under
memory pressure before OOM-killer triggers
more info: https://gist.github.com/constantoverride/84eba764f487049ed642eb2111a20830

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2..7636498 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -208,7 +208,7 @@ enum lru_list {

#define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)

-#define for_each_evictable_lru(lru) for (lru = 0; lru <=
LRU_ACTIVE_FILE; lru++)
+#define for_each_evictable_lru(lru) for (lru = 0; lru <=
LRU_INACTIVE_FILE; lru++)

static inline int is_file_lru(enum lru_list lru)
{
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 03822f8..1f3ffb5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2086,9 +2086,9 @@ static unsigned long shrink_list(enum lr
struct scan_control *sc)
{
if (is_active_lru(lru)) {
- if (inactive_list_is_low(lruvec, is_file_lru(lru),
- memcg, sc, true))
- shrink_active_list(nr_to_scan, lruvec, sc, lru);
+ //if (inactive_list_is_low(lruvec, is_file_lru(lru),
+ // memcg, sc, true))
+ // shrink_active_list(nr_to_scan, lruvec, sc, lru);
return 0;
}

@@ -2234,7 +2234,7 @@ static void get_scan_count(struct lruvec
*lruvec, struct mem_cgroup *memcg,

anon = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON, MAX_NR_ZONES) +
lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, MAX_NR_ZONES);
- file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
+ file = //lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);

spin_lock_irq(&pgdat->lru_lock);
@@ -2345,7 +2345,7 @@ static void shrink_node_memcg(struct pglist_data
*pgdat, struct mem_cgroup *memc
sc->priority == DEF_PRIORITY);

blk_start_plug(&plug);
- while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
+ while (nr[LRU_INACTIVE_ANON] || //nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
unsigned long nr_anon, nr_file, percentage;
unsigned long nr_scanned;
@@ -2372,7 +2372,8 @@ static void shrink_node_memcg(struct pglist_data
*pgdat, struct mem_cgroup *memc
* stop reclaiming one LRU and reduce the amount scanning
* proportional to the original scan target.
*/
- nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
+ nr_file = nr[LRU_INACTIVE_FILE] //+ nr[LRU_ACTIVE_FILE]
+ ;
nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];

/*
@@ -2391,7 +2392,8 @@ static void shrink_node_memcg(struct pglist_data
*pgdat, struct mem_cgroup *memc
percentage = nr_anon * 100 / scan_target;
} else {
unsigned long scan_target = targets[LRU_INACTIVE_FILE] +
- targets[LRU_ACTIVE_FILE] + 1;
+ //targets[LRU_ACTIVE_FILE] +
+ 1;
lru = LRU_FILE;
percentage = nr_file * 100 / scan_target;
}
@@ -2409,10 +2411,12 @@ static void shrink_node_memcg(struct pgl
nr[lru] = targets[lru] * (100 - percentage) / 100;
nr[lru] -= min(nr[lru], nr_scanned);

+ if (LRU_FILE != lru) { //avoid this block for LRU_ACTIVE_FILE
lru += LRU_ACTIVE;
nr_scanned = targets[lru] - nr[lru];
nr[lru] = targets[lru] * (100 - percentage) / 100;
nr[lru] -= min(nr[lru], nr_scanned);
+ }

scan_adjusted = true;
}


Tested on kernel 4.18.5 under Qubes OS, in both dom0 and VMs. It gets
rid of the disk thrashing that would otherwise seemingly-permanently
freeze a qube (VM) with continous disk reading (seen from dom0 via
sudo iotop). With the above, it only freezes for at most 1 second
before OOM-killer triggers and restores the RAM by killing some
process.

If anyone has a better idea, please let me know. I am hoping someone
knowledgeable can step in :)

I tried to find a way to also keep Inactive file pages in RAM, just
for tests(!) but couldn't figure out how (I'm not a programmer).
So, keeping just the Active file pages, seem good enough for now, even
though I can clearly see (via vm.block_dump=1) that there are still
some pages that are being re-read during high memory pressure, but
they for some reason don't cause any(or much) disk thrashing.

Cheers!

2018-09-28 15:42:52

by Alan Cox

[permalink] [raw]
Subject: Re: Howto prevent kernel from evicting code pages ever? (to avoid disk thrashing when about to run out of RAM)

On Wed, 22 Aug 2018 11:25:35 +0200
Marcus Linsner <[email protected]> wrote:

> Hi. How to make the kernel keep(lock?) all code pages in RAM so that
> kswapd0 won't evict them when the system is under low memory
> conditions ?
>
> The purpose of this is to prevent the kernel from causing lots of disk
> reads(effectively freezing the whole system) when about to run out of
> RAM, even when there is no swap enabled, but well before(in real time
> minutes) OOM-killer triggers to kill the offending process (eg. ld)!

Having no swap is not helping you at all.

In Linux you can do several things. Firstly add some swap - even a
swap file because if you have no swap you fill up memory with pages that
are not backed by disk and the kernel has to pick less and less optimal
things to swap out so begins to thrash. Even slowish swap is better than
no swap as it can dump out little used data pages.

You can tune the OOM killer to taste and you can even guide it on what to
shoot first.

You can use cgroups to constrain the resources some group of things are
allowed to use.

You can play with no overcommit mode, although that is much more about
'cannot fail' embedded applications usually. In that mode the kernel
tightly constrains the resource overcommit permissible. It's very
conservative and you end up needing a lot of 'just in case' wasted
resource, although you can tune the amount you leverage the real
resources.

To be fair Linux *is* really bad at handling this case. What other systems
did (being older and from the days where RAM wasn't reasonably assumed
infinite) was two fold. The first was under high swap load to switch to
swapping out entire processes, which with all the shared resources and
fast I/O today isn't quite so relevant. The second was that it would
ensure a process got a certain amount of real CPU time before it's pages
could be booted out again (and would then boot out lots of them). That
turns the thrashing into forward progress but still feels unpleasant.
However you still need swap or you don't have anywhere to boot out all
the dirty non code pages in order to manage progress.

There is a reason swap exists. If you don't have enough RAM to run
smoothly without swap, add swap (or RAM). Even then some things usually
need swap - I've got things that make the compiler consume over 16GB
building one file. With swap it's fine even on a 4GB machine.

Alan