2001-11-08 15:59:12

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: out_of_memory() heuristic broken for different mem configurations (fwd)


Linus,

I guess you forgot to apply the following patch on 2.4.15-pre1, right ?


---------- Forwarded message ----------
Date: Tue, 6 Nov 2001 15:23:29 -0200 (BRST)
From: Marcelo Tosatti <[email protected]>
To: Linus Torvalds <[email protected]>
Cc: Andrea Arcangeli <[email protected]>, lkml <[email protected]>
Subject: Re: out_of_memory() heuristic broken for different mem
configurations



On Tue, 6 Nov 2001, Marcelo Tosatti wrote:

>
> > With the more aggressive max_mapped, the oom failure count could be
> > dropped to something smaller, as a false positive from shrink_caches
> > should be fairly rare. I don't think it needs to be tunable on memory
> > size, I just didn't even try any other values on my machines (I noticed
> > that the old values were too high once max_mapped was upped and the swap
> > cache reclaiming was re-done, but I didn't try if five seconds and ten
> > failures was any better than 10 seconds and five failures, for example)
>
> Ok, I'll take a careful look at shrink_cache()/try_to_free_pages() path
> later and find out "saner" magic numbers for big/small memory workloads.

Ok, I found it. The problem is that swap_out() tries to scan the _whole_
address space looking for pte's to deactivate, and try_to_swap_out() does
not return a value indicating the lack of swap space, so at each
swap_out() call we simply loop around the whole VM when there is no swap
space available.

Here goes the tested fix.


--- linux.orig/mm/vmscan.c Sun Nov 4 22:54:44 2001
+++ linux/mm/vmscan.c Tue Nov 6 16:06:08 2001
@@ -36,7 +36,8 @@
/*
* The swap-out function returns 1 if it successfully
* scanned all the pages it was asked to (`count').
- * It returns zero if it couldn't do anything,
+ * It returns zero if it couldn't free the given pte or -1
+ * if there was no swap space left.
*
* rss may decrease because pages are shared, but this
* doesn't count as having freed a page.
@@ -142,7 +143,7 @@
/* No swap space left */
set_pte(page_table, pte);
UnlockPage(page);
- return 0;
+ return -1;
}

/* mm->page_table_lock is held. mmap_sem is not held */
@@ -170,7 +171,12 @@
struct page *page = pte_page(*pte);

if (VALID_PAGE(page) && !PageReserved(page)) {
- count -= try_to_swap_out(mm, vma, address, pte, page, classzone);
+ int ret = try_to_swap_out(mm, vma, address, pte, page, classzone);
+ if (ret < 0)
+ return ret;
+
+ count -= ret;
+
if (!count) {
address += PAGE_SIZE;
break;
@@ -205,7 +211,11 @@
end = pgd_end;

do {
- count = swap_out_pmd(mm, vma, pmd, address, end, count, classzone);
+ int ret = swap_out_pmd(mm, vma, pmd, address, end, count, classzone);
+
+ if (ret < 0)
+ return ret;
+ count = ret;
if (!count)
break;
address = (address + PMD_SIZE) & PMD_MASK;
@@ -230,7 +240,10 @@
if (address >= end)
BUG();
do {
- count = swap_out_pgd(mm, vma, pgdir, address, end, count, classzone);
+ int ret = swap_out_pgd(mm, vma, pgdir, address, end, count, classzone);
+ if (ret < 0)
+ return ret;
+ count = ret;
if (!count)
break;
address = (address + PGDIR_SIZE) & PGDIR_MASK;
@@ -287,7 +300,7 @@
static int FASTCALL(swap_out(unsigned int priority, unsigned int gfp_mask, zone_t * classzone));
static int swap_out(unsigned int priority, unsigned int gfp_mask, zone_t * classzone)
{
- int counter, nr_pages = SWAP_CLUSTER_MAX;
+ int counter, nr_pages = SWAP_CLUSTER_MAX, ret;
struct mm_struct *mm;

counter = mmlist_nr;
@@ -311,9 +324,15 @@
atomic_inc(&mm->mm_users);
spin_unlock(&mmlist_lock);

- nr_pages = swap_out_mm(mm, nr_pages, &counter, classzone);
+ ret = swap_out_mm(mm, nr_pages, &counter, classzone);

mmput(mm);
+
+ /* No more swap space ? */
+ if (ret < 0)
+ return nr_pages;
+
+ nr_pages = ret;

if (!nr_pages)
return 1;



2001-11-08 16:12:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: out_of_memory() heuristic broken for different mem configurations (fwd)


On Thu, 8 Nov 2001, Marcelo Tosatti wrote:
>
> I guess you forgot to apply the following patch on 2.4.15-pre1, right ?

The thing is, I _really_ think it is broken.

The way to make it fail is to have many large SHARED mappings - in which
case we have backing space for 99% of all memory, and returning -1 just
because a few pages need swap-space and can't be thrown out is wrong.

Try it with no swap, and having some processes that MAP_SHARED much more
than available memory and many (small) processes that do not, and need
swap-space. It should work fine - we're never even _close_ to being out of
memory, but your change makes "swap_out()" fail all the time.

Linus

2001-11-08 16:34:53

by Rik van Riel

[permalink] [raw]
Subject: Re: out_of_memory() heuristic broken for different mem configurations (fwd)

On Thu, 8 Nov 2001, Linus Torvalds wrote:
> On Thu, 8 Nov 2001, Marcelo Tosatti wrote:
> >
> > I guess you forgot to apply the following patch on 2.4.15-pre1, right ?
>
> The thing is, I _really_ think it is broken.
>
> The way to make it fail is to have many large SHARED mappings

ISTR that you wanted swap_out() changed into something which
only scans a portion of the ptes and doesn't have any return
value for a related reason in early 2.4 ... ;)

regards,

Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/

http://www.surriel.com/ http://distro.conectiva.com/

2001-11-08 16:47:43

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: out_of_memory() heuristic broken for different mem configurations (fwd)



On Thu, 8 Nov 2001, Rik van Riel wrote:

> On Thu, 8 Nov 2001, Linus Torvalds wrote:
> > On Thu, 8 Nov 2001, Marcelo Tosatti wrote:
> > >
> > > I guess you forgot to apply the following patch on 2.4.15-pre1, right ?
> >
> > The thing is, I _really_ think it is broken.
> >
> > The way to make it fail is to have many large SHARED mappings
>
> ISTR that you wanted swap_out() changed into something which
> only scans a portion of the ptes and doesn't have any return
> value for a related reason in early 2.4 ... ;)

Rik,

I remember Linus had a reasoning for the "scan _ALL_ ptes until success"
behaviour.

Linus, was that due to zone-specific (eg DMA shortage on bigmem machine)
shortages or ?

That _is_ one good argument (I'm still seeing the network driver not being
able to allocate memory from the DMA zone on the 16GB boxen), I think.

However, the current code breaks badly as I've tested on the 16GB boxen.

2001-11-08 16:54:43

by Rik van Riel

[permalink] [raw]
Subject: Re: out_of_memory() heuristic broken for different mem configurations (fwd)

On Thu, 8 Nov 2001, Marcelo Tosatti wrote:

> I remember Linus had a reasoning for the "scan _ALL_ ptes until
> success" behaviour.

Oh, I agree with that.

The problem is the "if no success, scan ALL ptes again
in an infinite loop" ;)

Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/

http://www.surriel.com/ http://distro.conectiva.com/

2001-11-08 16:58:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: out_of_memory() heuristic broken for different mem configurations (fwd)


On Thu, 8 Nov 2001, Rik van Riel wrote:
>
> ISTR that you wanted swap_out() changed into something which
> only scans a portion of the ptes and doesn't have any return
> value for a related reason in early 2.4 ... ;)

The fundamental thing is that we can _never_ return early out of
swap_out(). There are no local decisions we can make that make any sense.
The fact that we're out of swap-space is only meaningful _after_ we've
verified that we have no shared mappings.

Now, there _is_ another approach we can take, which is to notice _before_
we call swap_out() that it's not going to help us. With all anonymous
pages on the LRU list, we do have enough information in shrink_caches() to
notice that the pages we want to get rid of require backing store: that's
as easy as seeing "page->mapping == NULL".

But we can not make that decision in swap_out(), and any code that _tries_
to make that decision is clearly bogus.

Which is, indeed, why I feel strongly that swap_out() needs to be "void".
(Well, right now it technically isn't, but we ignore the value it returns,
so it doesn't much matter ;).

But in shrink_cache(), we could add something like

- increment a count every time we see a page with a mapping, and in the
right memory zone: all such pages _can_ be dropped and do not need
swap-space.

- add logic somewhat like

if (nr_swap_pages || page_mapping_count > 5%)
we're not oom

(note that "page_mapping_count" in _theory_ is the same as
"page_cache_size", except that we need to count only pages in the right
zones. It doesn't help us if we have tons of highmem pages free, if we're
unable to get rid of normal pages. There are simply too many things that
can not use the high pages, and we have to consider that to be oom).

Linus

2001-11-08 17:04:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: out_of_memory() heuristic broken for different mem configurations (fwd)


On Thu, 8 Nov 2001, Marcelo Tosatti wrote:
>
> I remember Linus had a reasoning for the "scan _ALL_ ptes until success"
> behaviour.
>
> Linus, was that due to zone-specific (eg DMA shortage on bigmem machine)
> shortages or ?

No, the major reason simply _is_ because it's fairly easy to come up with
test-cases that are mostly MAP_SHARED, and are not out-of-memory.
Returning early from swap_out() is simply fundamentally wrong - whether we
have free swap-space or not has very little to do with anything.

(Well, whether we have free swap space or not _is_ meaningful once we have
scanned all VM's, but not before that).

But see my suggestion about potentially noticing the oom _before_ calling
swap_out(). Thanks to anonymous-in-LRU we have a _lot_ of powerful
information that we simply traditionally haven't had. It doesn't just tell
us when we should start swapping out, it can also tell us if swap-out is
going to need swap-space or not. It can even tell how _much_ swap-space
we'll need to satisfy the "free N pages from the VM".

> However, the current code breaks badly as I've tested on the 16GB boxen.

I understand. I just think the fix needs to be different.

Linus