2009-12-17 12:29:19

by Larry Woodman

[permalink] [raw]
Subject: FWD: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

KOSAKI Motohiro wrote:
> (offlist)
>
> Larry, May I ask current status of following your issue?
> I don't reproduce it. and I don't hope to keep lots patch are up in the air.
>

Yes, sorry for the delay but I dont have direct or exclusive access to
these large systems
and workloads. As far as I can tell this patch series does help prevent
total system
hangs running AIM7. I did have trouble with the early postings mostly
due to using sleep_on()
and wakeup() but those appear to be fixed.

However, I did add more debug code and see ~10000 processes blocked in
shrink_zone_begin().
This is expected but bothersome, practically all of the processes remain
runnable for the entire
duration of these AIM runs. Collectively all these runnable processes
overwhelm the VM system.
There are many more runnable processes now than were ever seen before,
~10000 now versus
~100 on RHEL5(2.6.18 based). So, we have also been experimenting around
with some of the
CFS scheduler tunables to see of this is responsible...
> plus, I integrated page_referenced() improvement patch series and
> limit concurrent reclaimers patch series privately. I plan to post it
> to lkml at this week end. comments are welcome.
>

The only problem I noticed with the page_referenced patch was an
increase in the
try_to_unmap() failures which causes more re-activations. This is very
obvious with
the using tracepoints I have posted over the past few months but they
were never
included. I didnt get a chance to figure out the exact cause due to
access to the hardware
and workload. This patch series also seems to help the overall stalls
in the VM system.
>
> changelog from last post:
> - remake limit concurrent reclaimers series and sort out its patch order
> - change default max concurrent reclaimers from 8 to num_online_cpu().
> it mean, Andi only talked negative feeling comment in last post.
> he dislike constant default value. plus, over num_online_cpu() is
> really silly. iow, it is really low risk.
> (probably we might change default value. as far as I mesure, small
> value makes better benchmark result. but I'm not sure small value
> don't make regression)
> - Improve OOM and SIGKILL behavior.
> (because RHEL5's vmscan has TIF_MEMDIE recovering logic, but
> current mainline doesn't. I don't hope RHEL6 has regression)
>
>
>
>
>> On Fri, 2009-12-11 at 16:46 -0500, Rik van Riel wrote:
>>
>> Rik, the latest patch appears to have a problem although I dont know
>> what the problem is yet. When the system ran out of memory we see
>> thousands of runnable processes and 100% system time:
>>
>>
>> 9420 2 29824 79856 62676 19564 0 0 0 0 8054 379 0
>> 100 0 0 0
>> 9420 2 29824 79368 62292 19564 0 0 0 0 8691 413 0
>> 100 0 0 0
>> 9421 1 29824 79780 61780 19820 0 0 0 0 8928 408 0
>> 100 0 0 0
>>
>> The system would not respond so I dont know whats going on yet. I'll
>> add debug code to figure out why its in that state as soon as I get
>> access to the hardware.
>>

This was in response to Rik's first patch and seems to be fixed by the
latest path set.

Finally, having said all that, the system still struggles reclaiming
memory with
~10000 processes trying at the same time, you fix one bottleneck and it
moves
somewhere else. The latest run showed all but one running process
spinning in
page_lock_anon_vma() trying for the anon_vma_lock. I noticed that there
are
~5000 vma's linked to one anon_vma, this seems excessive!!!

I changed the anon_vma->lock to a rwlock_t and page_lock_anon_vma() to use
read_lock() so multiple callers could execute the page_reference_anon code.
This seems to help quite a bit.


>> Larry


Attachments:
aim.patch (4.69 kB)

2009-12-17 14:44:05

by Rik van Riel

[permalink] [raw]
Subject: Re: FWD: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

On 12/17/2009 07:23 AM, Larry Woodman wrote:

>>> The system would not respond so I dont know whats going on yet. I'll
>>> add debug code to figure out why its in that state as soon as I get
>>> access to the hardware.
>
> This was in response to Rik's first patch and seems to be fixed by the
> latest path set.
>
> Finally, having said all that, the system still struggles reclaiming
> memory with
> ~10000 processes trying at the same time, you fix one bottleneck and it
> moves
> somewhere else. The latest run showed all but one running process
> spinning in
> page_lock_anon_vma() trying for the anon_vma_lock. I noticed that there are
> ~5000 vma's linked to one anon_vma, this seems excessive!!!

I have some ideas on how to keep processes waiting better
on the per zone reclaim_wait waitqueue.

For one, we should probably only do the lots-free wakeup
if we have more than zone->pages_high free pages in the
zone - having each of the waiters free some memory one
after another should not be a problem as long as we do
not have too much free memory in the zone.

Currently it's a hair trigger, with the threshold for
processes going into the page reclaim path and processes
exiting it "plenty free" being exactly the same.

Some hysteresis there could help.

--
All rights reversed.

2009-12-17 19:55:46

by Rik van Riel

[permalink] [raw]
Subject: Re: FWD: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

After removing some more immediate bottlenecks with
the patches by Kosaki and me, Larry ran into a really
big one:

Larry Woodman wrote:

> Finally, having said all that, the system still struggles reclaiming
> memory with
> ~10000 processes trying at the same time, you fix one bottleneck and it
> moves
> somewhere else. The latest run showed all but one running process
> spinning in
> page_lock_anon_vma() trying for the anon_vma_lock. I noticed that there
> are
> ~5000 vma's linked to one anon_vma, this seems excessive!!!
>
> I changed the anon_vma->lock to a rwlock_t and page_lock_anon_vma() to use
> read_lock() so multiple callers could execute the page_reference_anon code.
> This seems to help quite a bit.

The system has 10000 processes, all of which are child
processes of the same parent.

Pretty much all memory is anonymous memory.

This means that pretty much every anonymous page in the
system:
1) belongs to just one process, but
2) belongs to an anon_vma which is attached to 10,000 VMAs!

This results in page_referenced scanning 10,000 VMAs for
every page, despite the fact that each page is typically
only mapped into one process.

This seems to be our real scalability issue.

The only way out I can think is to have a new anon_vma
when we start a child process and to have COW place new
pages in the new anon_vma.

However, this is a bit of a paradigm shift in our object
rmap system and I am wondering if somebody else has a
better idea :)

2009-12-17 21:16:18

by Hugh Dickins

[permalink] [raw]
Subject: Re: FWD: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

On Thu, 17 Dec 2009, Rik van Riel wrote:

> After removing some more immediate bottlenecks with
> the patches by Kosaki and me, Larry ran into a really
> big one:
>
> Larry Woodman wrote:
>
> > Finally, having said all that, the system still struggles reclaiming memory
> > with
> > ~10000 processes trying at the same time, you fix one bottleneck and it
> > moves
> > somewhere else. The latest run showed all but one running process spinning
> > in
> > page_lock_anon_vma() trying for the anon_vma_lock. I noticed that there are
> > ~5000 vma's linked to one anon_vma, this seems excessive!!!
> >
> > I changed the anon_vma->lock to a rwlock_t and page_lock_anon_vma() to use
> > read_lock() so multiple callers could execute the page_reference_anon code.
> > This seems to help quite a bit.
>
> The system has 10000 processes, all of which are child
> processes of the same parent.
>
> Pretty much all memory is anonymous memory.
>
> This means that pretty much every anonymous page in the
> system:
> 1) belongs to just one process, but
> 2) belongs to an anon_vma which is attached to 10,000 VMAs!
>
> This results in page_referenced scanning 10,000 VMAs for
> every page, despite the fact that each page is typically
> only mapped into one process.
>
> This seems to be our real scalability issue.
>
> The only way out I can think is to have a new anon_vma
> when we start a child process and to have COW place new
> pages in the new anon_vma.
>
> However, this is a bit of a paradigm shift in our object
> rmap system and I am wondering if somebody else has a
> better idea :)

Please first clarify whether what Larry is running is actually
a workload that people need to behave well in real life.

>From time to time such cases have been constructed, but we've
usually found better things to do than solve them, because
they've been no more than academic problems.

I'm not asserting that this one is purely academic, but I do
think we need more than an artificial case to worry much about it.

An rwlock there has been proposed on several occasions, but
we resist because that change benefits this case but performs
worse on more common cases (I believe: no numbers to back that up).

Substitute a MAP_SHARED file underneath those 10000 vmas,
and don't you have an equal problem with the prio_tree,
which would be harder to solve than the anon_vma case?

Hugh

2009-12-17 22:52:54

by Rik van Riel

[permalink] [raw]
Subject: Re: FWD: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

Hugh Dickins wrote:
> On Thu, 17 Dec 2009, Rik van Riel wrote:
>
>> After removing some more immediate bottlenecks with
>> the patches by Kosaki and me, Larry ran into a really
>> big one:
>>
>> Larry Woodman wrote:
>>
>>> Finally, having said all that, the system still struggles reclaiming memory
>>> with
>>> ~10000 processes trying at the same time, you fix one bottleneck and it
>>> moves
>>> somewhere else. The latest run showed all but one running process spinning
>>> in
>>> page_lock_anon_vma() trying for the anon_vma_lock. I noticed that there are
>>> ~5000 vma's linked to one anon_vma, this seems excessive!!!
>>>
>>> I changed the anon_vma->lock to a rwlock_t and page_lock_anon_vma() to use
>>> read_lock() so multiple callers could execute the page_reference_anon code.
>>> This seems to help quite a bit.
>> The system has 10000 processes, all of which are child
>> processes of the same parent.
>>
>> Pretty much all memory is anonymous memory.
>>
>> This means that pretty much every anonymous page in the
>> system:
>> 1) belongs to just one process, but
>> 2) belongs to an anon_vma which is attached to 10,000 VMAs!
>>
>> This results in page_referenced scanning 10,000 VMAs for
>> every page, despite the fact that each page is typically
>> only mapped into one process.
>>
>> This seems to be our real scalability issue.
>>
>> The only way out I can think is to have a new anon_vma
>> when we start a child process and to have COW place new
>> pages in the new anon_vma.
>>
>> However, this is a bit of a paradigm shift in our object
>> rmap system and I am wondering if somebody else has a
>> better idea :)
>
> Please first clarify whether what Larry is running is actually
> a workload that people need to behave well in real life.

AIM7 is fairly artificial, but real life workloads
like Oracle, PostgreSQL and Apache can also fork off
large numbers of child processes, which also cause
the system to end up with lots of VMAs attached to
the anon_vmas which all the anonymous pages belong
to.

10,000 is fairly extreme, but very large Oracle
workloads can get up to 1,000 or 2,000 today.
This number is bound to grow in the future.

2009-12-18 10:28:10

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

> KOSAKI Motohiro wrote:
> > (offlist)
> >
> > Larry, May I ask current status of following your issue?
> > I don't reproduce it. and I don't hope to keep lots patch are up in the air.
> >
>
> Yes, sorry for the delay but I dont have direct or exclusive access to
> these large systems
> and workloads. As far as I can tell this patch series does help prevent
> total system
> hangs running AIM7. I did have trouble with the early postings mostly
> due to using sleep_on()
> and wakeup() but those appear to be fixed.
>
> However, I did add more debug code and see ~10000 processes blocked in
> shrink_zone_begin().
> This is expected but bothersome, practically all of the processes remain
> runnable for the entire
> duration of these AIM runs. Collectively all these runnable processes
> overwhelm the VM system.
> There are many more runnable processes now than were ever seen before,
> ~10000 now versus
> ~100 on RHEL5(2.6.18 based). So, we have also been experimenting around
> with some of the
> CFS scheduler tunables to see of this is responsible...

What point you bother? throughput, latency or somethingelse? Actually,
unfairness itself is right thing from VM view. because perfectly fairness
VM easily makes livelock. (e.g. process-A swap out process-B's page, parocess-B
swap out process-A's page). swap token solve above simplest case. but run
many process easily makes similar circulation dependency. recovering from
heavy memory pressure need lots unfairness.

Of cource, if the unfairness makes performance regression, it's bug. it should be
fixed.


> The only problem I noticed with the page_referenced patch was an
> increase in the
> try_to_unmap() failures which causes more re-activations. This is very
> obvious with
> the using tracepoints I have posted over the past few months but they
> were never
> included. I didnt get a chance to figure out the exact cause due to
> access to the hardware
> and workload. This patch series also seems to help the overall stalls
> in the VM system.

I (and many VM developer) don't forget your tracepoint effort. we only
hope to solve the regression at first.


> >> Rik, the latest patch appears to have a problem although I dont know
> >> what the problem is yet. When the system ran out of memory we see
> >> thousands of runnable processes and 100% system time:
> >>
> >>
> >> 9420 2 29824 79856 62676 19564 0 0 0 0 8054 379 0
> >> 100 0 0 0
> >> 9420 2 29824 79368 62292 19564 0 0 0 0 8691 413 0
> >> 100 0 0 0
> >> 9421 1 29824 79780 61780 19820 0 0 0 0 8928 408 0
> >> 100 0 0 0
> >>
> >> The system would not respond so I dont know whats going on yet. I'll
> >> add debug code to figure out why its in that state as soon as I get
> >> access to the hardware.
> >>
>
> This was in response to Rik's first patch and seems to be fixed by the
> latest path set.
>
> Finally, having said all that, the system still struggles reclaiming
> memory with
> ~10000 processes trying at the same time, you fix one bottleneck and it
> moves
> somewhere else. The latest run showed all but one running process
> spinning in
> page_lock_anon_vma() trying for the anon_vma_lock. I noticed that there
> are
> ~5000 vma's linked to one anon_vma, this seems excessive!!!
>
> I changed the anon_vma->lock to a rwlock_t and page_lock_anon_vma() to use
> read_lock() so multiple callers could execute the page_reference_anon code.
> This seems to help quite a bit.

Ug. no. rw-spinlock is evil. please don't use it. rw-spinlock has bad
performance characteristics, plenty read_lock block write_lock for very
long time.

and I would like to confirm one thing. anon_vma design didn't change
for long year. Is this really performance regression? Do we strike
right regression point?


2009-12-18 14:09:58

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

On 12/18/2009 05:27 AM, KOSAKI Motohiro wrote:
>> KOSAKI Motohiro wrote:

>> Finally, having said all that, the system still struggles reclaiming
>> memory with
>> ~10000 processes trying at the same time, you fix one bottleneck and it
>> moves
>> somewhere else. The latest run showed all but one running process
>> spinning in
>> page_lock_anon_vma() trying for the anon_vma_lock. I noticed that there
>> are
>> ~5000 vma's linked to one anon_vma, this seems excessive!!!
>>
>> I changed the anon_vma->lock to a rwlock_t and page_lock_anon_vma() to use
>> read_lock() so multiple callers could execute the page_reference_anon code.
>> This seems to help quite a bit.
>
> Ug. no. rw-spinlock is evil. please don't use it. rw-spinlock has bad
> performance characteristics, plenty read_lock block write_lock for very
> long time.
>
> and I would like to confirm one thing. anon_vma design didn't change
> for long year. Is this really performance regression? Do we strike
> right regression point?

In 2.6.9 and 2.6.18 the system would hit different contention
points before getting to the anon_vma lock. Now that we've
gotten the other contention points out of the way, this one
has finally been exposed.

--
All rights reversed.

2009-12-18 16:23:46

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: FWD: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

On Thu, Dec 17, 2009 at 09:05:23PM +0000, Hugh Dickins wrote:
> Please first clarify whether what Larry is running is actually
> a workload that people need to behave well in real life.

Anything with 10000 connections using a connection-per-thread/process
model, should use threads if good performance are expected, processes
not. Most things that are using multi-process design will never use
one-connection-per-process design (yes there are exceptions and
no we can't expect to fix those as they're proprietary ;). So I'm not
particularly worried.

Also make sure this also happens on older kernels, newer kernels uses
rmap chains and mangle over ptes even when there's no VM pressure for
no good reason. Older kernels would only hit on the anon_vma chain on
any anon page, only after this anon page was converted to swapcache
and swap was hit, so it makes a whole lot of difference. Anon_vma
chains should only be touched after we are I/O bound if anybody is to
expect decent performance out of the kernel.

> I'm not asserting that this one is purely academic, but I do
> think we need more than an artificial case to worry much about it.

Tend to agree.

> An rwlock there has been proposed on several occasions, but
> we resist because that change benefits this case but performs
> worse on more common cases (I believe: no numbers to back that up).

I think rwlock for anon_vma is a must. Whatever higher overhead of the
fast path with no contention is practically zero, and in large smp it
allows rmap on long chains to run in parallel, so very much worth it
because downside is practically zero and upside may be measurable
instead in certain corner cases. I don't think it'll be enough, but I
definitely like it.

> Substitute a MAP_SHARED file underneath those 10000 vmas,
> and don't you have an equal problem with the prio_tree,
> which would be harder to solve than the anon_vma case?

That is a very good point.

Rik suggested to me to have a cowed newly allocated page to use its
own anon_vma. Conceptually Rik's idea is fine one, but the only
complication then is how to chain the same vma into multiple anon_vma
(in practice insert/removal will be slower and more metadata will be
needed for additional anon_vmas and vams queued in more than
anon_vma). But this only will help if the mapcount of the page is 1,
if the mapcount is 10000 no change to anon_vma or prio_tree will solve
this, and we've to start breaking the rmap loop after 64
test_and_clear_young instead to mitigate the inefficiency on pages
that are used and never will go into swap and so where wasting 10000
cachelines just because this used page eventually is in the tail
position of the lru uis entirely wasted.

2009-12-18 17:44:18

by Rik van Riel

[permalink] [raw]
Subject: Re: FWD: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone

On 12/18/2009 11:23 AM, Andrea Arcangeli wrote:
> On Thu, Dec 17, 2009 at 09:05:23PM +0000, Hugh Dickins wrote:

>> An rwlock there has been proposed on several occasions, but
>> we resist because that change benefits this case but performs
>> worse on more common cases (I believe: no numbers to back that up).
>
> I think rwlock for anon_vma is a must. Whatever higher overhead of the
> fast path with no contention is practically zero, and in large smp it
> allows rmap on long chains to run in parallel, so very much worth it
> because downside is practically zero and upside may be measurable
> instead in certain corner cases. I don't think it'll be enough, but I
> definitely like it.

I agree, changing the anon_vma lock to an rwlock should
work a lot better than what we have today. The tradeoff
is a tiny slowdown in medium contention cases, at the
benefit of avoiding catastrophic slowdown in some cases.

With Nick Piggin's fair rwlocks, there should be no issue
at all.

> Rik suggested to me to have a cowed newly allocated page to use its
> own anon_vma. Conceptually Rik's idea is fine one, but the only
> complication then is how to chain the same vma into multiple anon_vma
> (in practice insert/removal will be slower and more metadata will be
> needed for additional anon_vmas and vams queued in more than
> anon_vma). But this only will help if the mapcount of the page is 1,
> if the mapcount is 10000 no change to anon_vma or prio_tree will solve
> this,

It's even more complex than this for anonymous pages.

Anonymous pages get COW copied in child (and parent)
processes, potentially resulting in one page, at each
offset into the anon_vma, for every process attached
to the anon_vma.

As a result, with 10000 child processes, page_referenced
can end up searching through 10000 VMAs even for pages
with a mapcount of 1!

--
All rights reversed.