Subject: [patch 00/19] VM pageout scalability improvements

On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not
only does it use up CPU time, but it also provokes lock contention
and can leave large systems under memory presure in a catatonic state.

Against 2.6.24-rc6-mm1

This patch series improves VM scalability by:

1) making the locking a little more scalable

2) putting filesystem backed, swap backed and non-reclaimable pages
onto their own LRUs, so the system only scans the pages that it
can/should evict from memory

3) switching to SEQ replacement for the anonymous LRUs, so the
number of pages that need to be scanned when the system
starts swapping is bound to a reasonable number

The noreclaim patches come verbatim from Lee Schermerhorn and
Nick Piggin. I have made a few small fixes to them and left out
the bits that are no longer needed with split file/anon lists.

The exception is "Scan noreclaim list for reclaimable pages",
which should not be needed but could be a useful debugging tool.

--
All Rights Reversed


2008-01-03 16:51:32

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

On Wed, 2008-01-02 at 17:41 -0500, [email protected] wrote:
> On large memory systems, the VM can spend way too much time scanning
> through pages that it cannot (or should not) evict from memory. Not
> only does it use up CPU time, but it also provokes lock contention
> and can leave large systems under memory presure in a catatonic state.
>
> Against 2.6.24-rc6-mm1
>
> This patch series improves VM scalability by:
>
> 1) making the locking a little more scalable
>
> 2) putting filesystem backed, swap backed and non-reclaimable pages
> onto their own LRUs, so the system only scans the pages that it
> can/should evict from memory
>
> 3) switching to SEQ replacement for the anonymous LRUs, so the
> number of pages that need to be scanned when the system
> starts swapping is bound to a reasonable number
>
> The noreclaim patches come verbatim from Lee Schermerhorn and
> Nick Piggin. I have made a few small fixes to them and left out
> the bits that are no longer needed with split file/anon lists.
>
> The exception is "Scan noreclaim list for reclaimable pages",
> which should not be needed but could be a useful debugging tool.

Note that patch 14/19 [SHM_LOCK/UNLOCK handling] depends on the
infrastructure introduced by the "Scan noreclaim list for reclaimable
pages" patch. When SHM_UNLOCKing a shm segment, we call a new
scan_mapping_noreclaim_page() function to check all of the pages in the
segment for reclaimability. There might be other reasons for the pages
to be non-reclaimable...

So, we can't merge 14/19 as is w/o some of patch 12. We can probably
eliminate the sysctl and per node sysfs attributes to force a scan.
But, as Rik says, this has been useful for debugging--e.g., periodically
forcing a full rescan while running a stress load.

Also, I should point out that the full noreclaim series includes a
couple of other patches NOT posted here by Rik:

1) treat swap backed pages as nonreclaimable when no swap space is
available. This addresses a problem we've seen in real life, with
vmscan spending a lot of time trying to reclaim anon/shmem/tmpfs/...
pages only to find that there is no swap space--add_to_swap() fails.
Maybe not a problem with Rik's new anon page handling. We'll see. If
we did want to add this filter, we'll need a way to bring back pages
from the noreclaim list that are there only for lack of swap space when
space is added or becomes available.

2) treat anon pages with "excessively long" anon_vma lists as
nonreclaimable. "excessively long" here is a sysctl tunable parameter.
This also addresses problems we've seen with benchmarks and stress
tests--all cpus spinning on some anon_vma lock. In "real life", we've
seen this behavior with file backed pages--spinning on the
i_mmap_lock--running Oracle workloads with user counts in the few
thousands. Again, something we may not need with Rik's vmscan rework.
If we did want to do this, we'd probably want to address file backed
pages and add support to bring the pages back from the noreclaim list
when the number of "mappers" drops below the threshold. My current
patch leaves anon pages as non-reclaimable until they're freed, or
manually scanned via the mechanism introduced by patch 12.

Lee
>

2008-01-03 17:00:40

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

On Thu, 03 Jan 2008 11:52:08 -0500
Lee Schermerhorn <[email protected]> wrote:

> Also, I should point out that the full noreclaim series includes a
> couple of other patches NOT posted here by Rik:
>
> 1) treat swap backed pages as nonreclaimable when no swap space is
> available. This addresses a problem we've seen in real life, with
> vmscan spending a lot of time trying to reclaim anon/shmem/tmpfs/...
> pages only to find that there is no swap space--add_to_swap() fails.
> Maybe not a problem with Rik's new anon page handling.

If there is no swap space, my VM code will not bother scanning
any anon pages. This has the same effect as moving the pages
to the no-reclaim list, with the extra benefit of being able to
resume scanning the anon lists once swap space is freed.

> 2) treat anon pages with "excessively long" anon_vma lists as
> nonreclaimable. "excessively long" here is a sysctl tunable parameter.
> This also addresses problems we've seen with benchmarks and stress
> tests--all cpus spinning on some anon_vma lock. In "real life", we've
> seen this behavior with file backed pages--spinning on the
> i_mmap_lock--running Oracle workloads with user counts in the few
> thousands. Again, something we may not need with Rik's vmscan rework.
> If we did want to do this, we'd probably want to address file backed
> pages and add support to bring the pages back from the noreclaim list
> when the number of "mappers" drops below the threshold. My current
> patch leaves anon pages as non-reclaimable until they're freed, or
> manually scanned via the mechanism introduced by patch 12.

I can see some issues with that patch. Specifically, if the threshold
is set too high no pages will be affected, and if the threshold is too
low all pages will become non-reclaimable, leading to a false OOM kill.

Not only is it a very big hammer, it's also a rather awkward one...

--
All Rights Reversed

2008-01-03 17:12:55

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

On Thu, 2008-01-03 at 12:00 -0500, Rik van Riel wrote:
> On Thu, 03 Jan 2008 11:52:08 -0500
> Lee Schermerhorn <[email protected]> wrote:
>
> > Also, I should point out that the full noreclaim series includes a
> > couple of other patches NOT posted here by Rik:
> >
> > 1) treat swap backed pages as nonreclaimable when no swap space is
> > available. This addresses a problem we've seen in real life, with
> > vmscan spending a lot of time trying to reclaim anon/shmem/tmpfs/...
> > pages only to find that there is no swap space--add_to_swap() fails.
> > Maybe not a problem with Rik's new anon page handling.
>
> If there is no swap space, my VM code will not bother scanning
> any anon pages. This has the same effect as moving the pages
> to the no-reclaim list, with the extra benefit of being able to
> resume scanning the anon lists once swap space is freed.
>
> > 2) treat anon pages with "excessively long" anon_vma lists as
> > nonreclaimable. "excessively long" here is a sysctl tunable parameter.
> > This also addresses problems we've seen with benchmarks and stress
> > tests--all cpus spinning on some anon_vma lock. In "real life", we've
> > seen this behavior with file backed pages--spinning on the
> > i_mmap_lock--running Oracle workloads with user counts in the few
> > thousands. Again, something we may not need with Rik's vmscan rework.
> > If we did want to do this, we'd probably want to address file backed
> > pages and add support to bring the pages back from the noreclaim list
> > when the number of "mappers" drops below the threshold. My current
> > patch leaves anon pages as non-reclaimable until they're freed, or
> > manually scanned via the mechanism introduced by patch 12.
>
> I can see some issues with that patch. Specifically, if the threshold
> is set too high no pages will be affected, and if the threshold is too
> low all pages will become non-reclaimable, leading to a false OOM kill.
>
> Not only is it a very big hammer, it's also a rather awkward one...

Yes, but the problem, when it occurs, is very awkward. The system just
hangs for hours/days spinning on the reverse mapping locks--in both
page_referenced() and try_to_unmap(). No pages get reclaimed and NO OOM
kill occurs because we never get that far. So, I'm not sure I'd call
any OOM kills resulting from this patch as "false". The memory is
effectively nonreclaimable. Now, I think that your anon pages SEQ
patch will eliminate the contention in page_referenced[_anon](), but we
could still hang in try_to_unmap(). And we have the issue with file
back pages and the i_mmap_lock. I'll see if this issue comes up in
testings with the current series. If not, cool! If so, we just have
more work to do.

Later,
Lee
>

2008-01-03 22:01:18

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

On Thu, 03 Jan 2008 12:13:32 -0500
Lee Schermerhorn <[email protected]> wrote:

> Yes, but the problem, when it occurs, is very awkward. The system just
> hangs for hours/days spinning on the reverse mapping locks--in both
> page_referenced() and try_to_unmap(). No pages get reclaimed and NO OOM
> kill occurs because we never get that far. So, I'm not sure I'd call
> any OOM kills resulting from this patch as "false". The memory is
> effectively nonreclaimable. Now, I think that your anon pages SEQ
> patch will eliminate the contention in page_referenced[_anon](), but we
> could still hang in try_to_unmap().

I am hoping that Nick's ticket spinlocks will fix this problem.

Would you happen to have any test cases for the above problem that
I could use to reproduce the problem and look for an automatic fix?

Any fix that requires the sysadmin to tune things _just_ right seems
too dangerous to me - especially if a change in the workload can
result in the system doing exactly the wrong thing...

The idea is valid, but it just has to work automagically.

Btw, if page_referenced() is called less, the locks that try_to_unmap()
also takes should get less contention.

--
All Rights Reversed

2008-01-04 16:24:56

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

On Thu, 2008-01-03 at 17:00 -0500, Rik van Riel wrote:
> On Thu, 03 Jan 2008 12:13:32 -0500
> Lee Schermerhorn <[email protected]> wrote:
>
> > Yes, but the problem, when it occurs, is very awkward. The system just
> > hangs for hours/days spinning on the reverse mapping locks--in both
> > page_referenced() and try_to_unmap(). No pages get reclaimed and NO OOM
> > kill occurs because we never get that far. So, I'm not sure I'd call
> > any OOM kills resulting from this patch as "false". The memory is
> > effectively nonreclaimable. Now, I think that your anon pages SEQ
> > patch will eliminate the contention in page_referenced[_anon](), but we
> > could still hang in try_to_unmap().
>
> I am hoping that Nick's ticket spinlocks will fix this problem.
>
> Would you happen to have any test cases for the above problem that
> I could use to reproduce the problem and look for an automatic fix?

We can easily [he says, glibly] reproduce the hang on the anon_vma lock
with AIM7 loads on our test platforms. Perhaps we can come up with an
AIM workload to reproduce the phenomenon on one of your test platforms.
I've seen the hang with 15K-20K tasks on a 4 socket x86_64 with 16-32G
of memory and quite a bit of storage.

I've also seen related hangs on both anon_vma and i_mmap_lock during a
heavy usex stress load on the splitlru+noreclaim patches. [This, by the
way, without and WITH my rw_lock patches for both anon_vma and
i_mmap_lock.] I can try to package up the workload to run on your
system.

>
> Any fix that requires the sysadmin to tune things _just_ right seems
> too dangerous to me - especially if a change in the workload can
> result in the system doing exactly the wrong thing...
>
> The idea is valid, but it just has to work automagically.
>
> Btw, if page_referenced() is called less, the locks that try_to_unmap()
> also takes should get less contention.

Makes sense. we'll have to see.

Lee
>

2008-01-04 16:34:16

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

Lee Schermerhorn <[email protected]> writes:

> We can easily [he says, glibly] reproduce the hang on the anon_vma lock

Is that a NUMA platform? On non x86? Perhaps you just need queued spinlocks?

-Andi

2008-01-04 16:55:43

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

On Fri, 04 Jan 2008 17:34:00 +0100
Andi Kleen <[email protected]> wrote:
> Lee Schermerhorn <[email protected]> writes:
>
> > We can easily [he says, glibly] reproduce the hang on the anon_vma lock
>
> Is that a NUMA platform? On non x86? Perhaps you just need queued spinlocks?

I really think that the anon_vma and i_mmap_lock spinlock hangs are
due to the lack of queued spinlocks. Not because I have seen your
system hang, but because I've seen one of Larry's test systems here
hang in scary/amusing ways :)

With queued spinlocks the system should just slow down, not hang.

--
All rights reversed.

2008-01-04 17:05:36

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

On Fri, 2008-01-04 at 17:34 +0100, Andi Kleen wrote:
> Lee Schermerhorn <[email protected]> writes:
>
> > We can easily [he says, glibly] reproduce the hang on the anon_vma lock
>
> Is that a NUMA platform? On non x86? Perhaps you just need queued spinlocks?

We see this on both NUMA and non-NUMA. x86_64 and ia64. The basic
criteria to reproduce is to be able to run thousands [or low 10s of
thousands] of tasks, continually increasing the number until the system
just goes into reclaim. Instead of swapping, the system seems to
hang--unresponsive from the console, but with "soft lockup" messages
spitting out every few seconds...


Lee


>
> -Andi

2008-01-04 18:09:22

by Larry Woodman

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

Rik van Riel wrote:

>On Fri, 04 Jan 2008 17:34:00 +0100
>Andi Kleen <[email protected]> wrote:
>
>
>>Lee Schermerhorn <[email protected]> writes:
>>
>>
>>
>>>We can easily [he says, glibly] reproduce the hang on the anon_vma lock
>>>
>>>
>>Is that a NUMA platform? On non x86? Perhaps you just need queued spinlocks?
>>
>>
>
>I really think that the anon_vma and i_mmap_lock spinlock hangs are
>due to the lack of queued spinlocks. Not because I have seen your
>system hang, but because I've seen one of Larry's test systems here
>hang in scary/amusing ways :)
>
Changing the anon_vma->lock into a rwlock_t helps because
page_lock_anon_vma()
can take it for read and thats where the contention is. However its the
fact that under
some tests, most of the pages are in vmas queued to one anon_vma that
causes so much
lock contention.


>
>With queued spinlocks the system should just slow down, not hang.
>
>
>

2008-01-07 10:03:31

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

On Thu, 3 Jan 2008 12:00:00 -0500
Rik van Riel <[email protected]> wrote:

> On Thu, 03 Jan 2008 11:52:08 -0500
> Lee Schermerhorn <[email protected]> wrote:
>
> > Also, I should point out that the full noreclaim series includes a
> > couple of other patches NOT posted here by Rik:
> >
> > 1) treat swap backed pages as nonreclaimable when no swap space is
> > available. This addresses a problem we've seen in real life, with
> > vmscan spending a lot of time trying to reclaim anon/shmem/tmpfs/...
> > pages only to find that there is no swap space--add_to_swap() fails.
> > Maybe not a problem with Rik's new anon page handling.
>
> If there is no swap space, my VM code will not bother scanning
> any anon pages. This has the same effect as moving the pages
> to the no-reclaim list, with the extra benefit of being able to
> resume scanning the anon lists once swap space is freed.
>
Is this 'avoiding scanning anon if no swap' feature in this set ?

Thanks
-Kame

2008-01-07 15:18:56

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

On Mon, 7 Jan 2008 19:06:10 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Thu, 3 Jan 2008 12:00:00 -0500
> Rik van Riel <[email protected]> wrote:

> > If there is no swap space, my VM code will not bother scanning
> > any anon pages. This has the same effect as moving the pages
> > to the no-reclaim list, with the extra benefit of being able to
> > resume scanning the anon lists once swap space is freed.
> >
> Is this 'avoiding scanning anon if no swap' feature in this set ?

I seem to have lost that code in a forward merge :(

Dunno if I started the forward merge from an older series that
Lee had or if I lost the code myself...

I'll put it back in ASAP.

--
All rights reversed.

2008-01-07 19:08:07

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

On Fri, 4 Jan 2008, Lee Schermerhorn wrote:

> We see this on both NUMA and non-NUMA. x86_64 and ia64. The basic
> criteria to reproduce is to be able to run thousands [or low 10s of
> thousands] of tasks, continually increasing the number until the system
> just goes into reclaim. Instead of swapping, the system seems to
> hang--unresponsive from the console, but with "soft lockup" messages
> spitting out every few seconds...

Ditto here.

2008-01-07 19:32:41

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

On Mon, 7 Jan 2008 11:07:54 -0800 (PST)
Christoph Lameter <[email protected]> wrote:
> On Fri, 4 Jan 2008, Lee Schermerhorn wrote:
>
> > We see this on both NUMA and non-NUMA. x86_64 and ia64. The basic
> > criteria to reproduce is to be able to run thousands [or low 10s of
> > thousands] of tasks, continually increasing the number until the system
> > just goes into reclaim. Instead of swapping, the system seems to
> > hang--unresponsive from the console, but with "soft lockup" messages
> > spitting out every few seconds...
>
> Ditto here.

I have some suspicions on what could be causing this.

The most obvious suspect is get_scan_ratio() continuing to return
100 file reclaim, 0 anon reclaim when the file LRUs have already
been reduced to something very small, because reclaiming up to that
point was easy.

I plan to add some code to automatically set the anon reclaim to
100% if (free + file_active + file_inactive <= zone->pages_high),
meaning that reclaiming just file pages will not be able to free
enough pages.

--
All rights reversed.