LinuxLists.cc - [PATCH] vmscan: do not evict inactive pages when skipping an active list scan

2009-11-25 18:38:34

by Rik van Riel

[permalink] [raw]

Subject: [PATCH] vmscan: do not evict inactive pages when skipping an active list scan

In AIM7 runs, recent kernels start swapping out anonymous pages
well before they should. This is due to shrink_list falling
through to shrink_inactive_list if !inactive_anon_is_low(zone, sc),
when all we really wanted to do is pre-age some anonymous pages to
give them extra time to be referenced while on the inactive list.

The obvious fix is to make sure that shrink_list does not fall
through to scanning/reclaiming inactive pages when we called it
to scan one of the active lists.

This change should be safe because the loop in shrink_zone ensures
that we will still shrink the anon and file inactive lists whenever
we should.

Reported-by: Larry Woodman <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 777af57..ec4dfda 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1469,13 +1469,15 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
{
int file = is_file_lru(lru);

- if (lru == LRU_ACTIVE_FILE && inactive_file_is_low(zone, sc)) {
- shrink_active_list(nr_to_scan, zone, sc, priority, file);
+ if (lru == LRU_ACTIVE_FILE) {
+ if (inactive_file_is_low(zone, sc))
+ shrink_active_list(nr_to_scan, zone, sc, priority, file);
return 0;
}

- if (lru == LRU_ACTIVE_ANON && inactive_anon_is_low(zone, sc)) {
- shrink_active_list(nr_to_scan, zone, sc, priority, file);
+ if (lru == LRU_ACTIVE_ANON) {
+ if (inactive_file_is_low(zone, sc))
+ shrink_active_list(nr_to_scan, zone, sc, priority, file);
return 0;
}
return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);

2009-11-25 21:05:11

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH] vmscan: do not evict inactive pages when skipping an active list scan

Hello all,

On Wed, Nov 25, 2009 at 01:37:52PM -0500, Rik van Riel wrote:
> In AIM7 runs, recent kernels start swapping out anonymous pages
> well before they should. This is due to shrink_list falling
> through to shrink_inactive_list if !inactive_anon_is_low(zone, sc),
> when all we really wanted to do is pre-age some anonymous pages to
> give them extra time to be referenced while on the inactive list.

I do not quite understand what changed 'recently'.

That fall-through logic to keep eating inactives when the ratio is off
came in a year ago with the second-chance-for-anon-pages patch..?

> The obvious fix is to make sure that shrink_list does not fall
> through to scanning/reclaiming inactive pages when we called it
> to scan one of the active lists.
>
> This change should be safe because the loop in shrink_zone ensures
> that we will still shrink the anon and file inactive lists whenever
> we should.

It was not so obvious to me ;)

At first, I thought it would make sense to actively rebalance between
the lists if the inactive one grows too large (the fall-through case).

But shrink_zone() does not know about this and although we scan
inactive pages, we do not account for them and decrease the 'nr[lru]'
for active pages instead, effectively shifting the 'active todo' over
to the 'inactive todo'. I can imagine this going wrong!

So I agree, we should use the inactive_*_is_low() predicate only
passively.

> Reported-by: Larry Woodman <[email protected]>
> Signed-off-by: Rik van Riel <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 777af57..ec4dfda 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1469,13 +1469,15 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
> {
> int file = is_file_lru(lru);
>
> - if (lru == LRU_ACTIVE_FILE && inactive_file_is_low(zone, sc)) {
> - shrink_active_list(nr_to_scan, zone, sc, priority, file);
> + if (lru == LRU_ACTIVE_FILE) {
> + if (inactive_file_is_low(zone, sc))
> + shrink_active_list(nr_to_scan, zone, sc, priority, file);
> return 0;
> }
>
> - if (lru == LRU_ACTIVE_ANON && inactive_anon_is_low(zone, sc)) {
> - shrink_active_list(nr_to_scan, zone, sc, priority, file);
> + if (lru == LRU_ACTIVE_ANON) {
> + if (inactive_file_is_low(zone, sc))
> + shrink_active_list(nr_to_scan, zone, sc, priority, file);
> return 0;
> }
> return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-11-25 20:48:39

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH] vmscan: do not evict inactive pages when skipping an active list scan

On 11/25/2009 03:35 PM, Johannes Weiner wrote:
> Hello all,
>
> On Wed, Nov 25, 2009 at 01:37:52PM -0500, Rik van Riel wrote:
>
>> In AIM7 runs, recent kernels start swapping out anonymous pages
>> well before they should. This is due to shrink_list falling
>> through to shrink_inactive_list if !inactive_anon_is_low(zone, sc),
>> when all we really wanted to do is pre-age some anonymous pages to
>> give them extra time to be referenced while on the inactive list.
>>
> I do not quite understand what changed 'recently'.
>
> That fall-through logic to keep eating inactives when the ratio is off
> came in a year ago with the second-chance-for-anon-pages patch..?
>
>
The confusion comes from my use of the word
"recently" here. Larry started testing with
RHEL 5 as the baseline.

And yeah - I believe the code may well have
been wrong ever since it was merged. The
indentation just looked so pretty that noone
spotted the bug.

> Acked-by: Johannes Weiner<[email protected]>
>
>
Thank you.

2009-11-26 02:50:17

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH] vmscan: do not evict inactive pages when skipping an active list scan

> In AIM7 runs, recent kernels start swapping out anonymous pages
> well before they should. This is due to shrink_list falling
> through to shrink_inactive_list if !inactive_anon_is_low(zone, sc),
> when all we really wanted to do is pre-age some anonymous pages to
> give them extra time to be referenced while on the inactive list.
>
> The obvious fix is to make sure that shrink_list does not fall
> through to scanning/reclaiming inactive pages when we called it
> to scan one of the active lists.
>
> This change should be safe because the loop in shrink_zone ensures
> that we will still shrink the anon and file inactive lists whenever
> we should.

Good catch!

>
> Reported-by: Larry Woodman <[email protected]>
> Signed-off-by: Rik van Riel <[email protected]>
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 777af57..ec4dfda 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1469,13 +1469,15 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
> {
> int file = is_file_lru(lru);
>
> - if (lru == LRU_ACTIVE_FILE && inactive_file_is_low(zone, sc)) {
> - shrink_active_list(nr_to_scan, zone, sc, priority, file);
> + if (lru == LRU_ACTIVE_FILE) {
> + if (inactive_file_is_low(zone, sc))
> + shrink_active_list(nr_to_scan, zone, sc, priority, file);
> return 0;
> }
>
> - if (lru == LRU_ACTIVE_ANON && inactive_anon_is_low(zone, sc)) {
> - shrink_active_list(nr_to_scan, zone, sc, priority, file);
> + if (lru == LRU_ACTIVE_ANON) {
> + if (inactive_file_is_low(zone, sc))

This inactive_file_is_low() should be inactive_anon_is_low().
cut-n-paste programming often makes similar mistake. probaby we need make
more cleanup to this function.

How about this? (this is incremental patch from you)

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 17 ++++++++++-------
1 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a8f61c0..80e94a2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1467,22 +1467,25 @@ static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
return low;
}

+static int inactive_list_is_low(struct zone *zone, struct scan_control *sc, int file)
+{
+ if (file)
+ return inactive_file_is_low(zone, sc);
+ else
+ return inactive_anon_is_low(zone, sc);
+}
+
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct zone *zone, struct scan_control *sc, int priority)
{
int file = is_file_lru(lru);

- if (lru == LRU_ACTIVE_FILE) {
- if (inactive_file_is_low(zone, sc))
+ if (is_active_lru(lru)) {
+ if (inactive_list_is_low(zone, sc, file))
shrink_active_list(nr_to_scan, zone, sc, priority, file);
return 0;
}

- if (lru == LRU_ACTIVE_ANON) {
- if (inactive_file_is_low(zone, sc))
- shrink_active_list(nr_to_scan, zone, sc, priority, file);
- return 0;
- }
return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
}

--
1.6.5.2

2009-11-26 02:58:31

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH] vmscan: do not evict inactive pages when skipping an active list scan

On 11/25/2009 09:50 PM, KOSAKI Motohiro wrote:
>
>> - if (lru == LRU_ACTIVE_ANON&& inactive_anon_is_low(zone, sc)) {
>> - shrink_active_list(nr_to_scan, zone, sc, priority, file);
>> + if (lru == LRU_ACTIVE_ANON) {
>> + if (inactive_file_is_low(zone, sc))
>>
> This inactive_file_is_low() should be inactive_anon_is_low().
> cut-n-paste programming often makes similar mistake. probaby we need make
> more cleanup to this function.
>
> How about this? (this is incremental patch from you)
>
>
>
Doh! Nice catch...
> Signed-off-by: KOSAKI Motohiro<[email protected]>
>
>
Reviewed-by: Rik van Riel <[email protected]>

2009-11-30 21:57:27

by Larry Woodman

[permalink] [raw]

Subject: [RFC] high system time & lock contention running large mixed workload

While running workloads that do lots of forking processes, exiting
processes and page reclamation(AIM 7) on large systems very high system
time(100%) and lots of lock contention was observed.

CPU5:
[<ffffffff814afb48>] ? _spin_lock+0x27/0x48
[<ffffffff81101deb>] ? anon_vma_link+0x2a/0x5a
[<ffffffff8105d3d8>] ? dup_mm+0x242/0x40c
[<ffffffff8105e0a9>] ? copy_process+0xab1/0x12be
[<ffffffff8105ea07>] ? do_fork+0x151/0x330
[<ffffffff81058407>] ? default_wake_function+0x0/0x36
[<ffffffff814b0243>] ? _spin_lock_irqsave+0x2f/0x68
[<ffffffff810121d3>] ? stub_clone+0x13/0x20
[<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b

CPU4:
[<ffffffff814afb4a>] ? _spin_lock+0x29/0x48
[<ffffffff81103062>] ? anon_vma_unlink+0x2a/0x84
[<ffffffff810fbab7>] ? free_pgtables+0x3c/0xe1
[<ffffffff810fd8b1>] ? exit_mmap+0xc5/0x110
[<ffffffff8105ce4c>] ? mmput+0x55/0xd9
[<ffffffff81061afd>] ? exit_mm+0x109/0x129
[<ffffffff81063846>] ? do_exit+0x1d7/0x712
[<ffffffff814b0243>] ? _spin_lock_irqsave+0x2f/0x68
[<ffffffff81063e07>] ? do_group_exit+0x86/0xb2
[<ffffffff81063e55>] ? sys_exit_group+0x22/0x3e
[<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b

CPU0:
[<ffffffff814afb4a>] ? _spin_lock+0x29/0x48
[<ffffffff81101ad1>] ? page_check_address+0x9e/0x16f
[<ffffffff81101cb8>] ? page_referenced_one+0x53/0x10b
[<ffffffff81102f5a>] ? page_referenced+0xcd/0x167
[<ffffffff810eb32d>] ? shrink_active_list+0x1ed/0x2a3
[<ffffffff810ebde9>] ? shrink_zone+0xa06/0xa38
[<ffffffff8108440a>] ? getnstimeofday+0x64/0xce
[<ffffffff810ecaf9>] ? do_try_to_free_pages+0x1e5/0x362
[<ffffffff810ecd9f>] ? try_to_free_pages+0x7a/0x94
[<ffffffff810ea66f>] ? isolate_pages_global+0x0/0x242
[<ffffffff810e57b9>] ? __alloc_pages_nodemask+0x397/0x572
[<ffffffff810e3c1e>] ? __get_free_pages+0x19/0x6e
[<ffffffff8105d6c9>] ? copy_process+0xd1/0x12be
[<ffffffff81204eb2>] ? avc_has_perm+0x5c/0x84
[<ffffffff81130db8>] ? user_path_at+0x65/0xa3
[<ffffffff8105ea07>] ? do_fork+0x151/0x330
[<ffffffff810b7935>] ? check_for_new_grace_period+0x78/0xab
[<ffffffff810121d3>] ? stub_clone+0x13/0x20
[<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b

------------------------------------------------------------------------------
PerfTop: 864 irqs/sec kernel:99.7% [100000 cycles], (all, 8
CPUs)
------------------------------------------------------------------------------

samples pcnt RIP kernel function
______ _______ _____ ________________ _______________

3235.00 - 75.1% - ffffffff814afb21 : _spin_lock
670.00 - 15.6% - ffffffff81101a33 : page_check_address
165.00 - 3.8% - ffffffffa01cbc39 : rpc_sleep_on [sunrpc]
40.00 - 0.9% - ffffffff81102113 : try_to_unmap_one
29.00 - 0.7% - ffffffff81101c65 : page_referenced_one
27.00 - 0.6% - ffffffff81101964 : vma_address
8.00 - 0.2% - ffffffff8125a5a0 : clear_page_c
6.00 - 0.1% - ffffffff8125a5f0 : copy_page_c
6.00 - 0.1% - ffffffff811023ca : try_to_unmap_anon
5.00 - 0.1% - ffffffff810fb014 : copy_page_range
5.00 - 0.1% - ffffffff810e4d18 : get_page_from_freelist

The cause was determined to be the unconditional call to
page_referenced() for every mapped page encountered in
shrink_active_list(). page_referenced() takes the anon_vma->lock and
calls page_referenced_one() for each vma. page_referenced_one() then
calls page_check_address() which takes the pte_lockptr spinlock. If
several CPUs are doing this at the same time there is a lot of
pte_lockptr spinlock contention with the anon_vma->lock held. This
causes contention on the anon_vma->lock, stalling in the fo and very
high system time.

Before the splitLRU patch shrink_active_list() would only call
page_referenced() when reclaim_mapped got set. reclaim_mapped only got
set when the priority worked its way from 12 all the way to 7. This
prevented page_referenced() from being called from shrink_active_list()
until the system was really struggling to reclaim memory.

On way to prevent this is to change page_check_address() to execute a
spin_trylock(ptl) when it was called by shrink_active_list() and simply
fail if it could not get the pte_lockptr spinlock. This will make
shrink_active_list() consider the page not referenced and allow the
anon_vma->lock to be dropped much quicker.

The attached patch does just that, thoughts???

Attachments:

pageout.diff (3.34 kB)

2009-12-01 10:04:45

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [RFC] high system time & lock contention running large mixed workload

On Mon, Nov 30, 2009 at 05:00:29PM -0500, Larry Woodman wrote:
> Before the splitLRU patch shrink_active_list() would only call
> page_referenced() when reclaim_mapped got set. reclaim_mapped only got
> set when the priority worked its way from 12 all the way to 7. This
> prevented page_referenced() from being called from shrink_active_list()
> until the system was really struggling to reclaim memory.

page_referenced should never be called and nobody should touch ptes
until priority went down to 7. This is a regression in splitLRU that
should be fixed. With light VM pressure we should never touch ptes ever.

> On way to prevent this is to change page_check_address() to execute a
> spin_trylock(ptl) when it was called by shrink_active_list() and simply
> fail if it could not get the pte_lockptr spinlock. This will make
> shrink_active_list() consider the page not referenced and allow the
> anon_vma->lock to be dropped much quicker.
>
> The attached patch does just that, thoughts???

Just stop calling page_referenced there...

Even if we ignore the above, one problem later in skipping over the PT
lock, is also to assume the page is not referenced when it actually
is, so it won't be activated again when page_referenced is called
again to move the page back in the active list... Not the end of the
world to lose a young bit sometime though.

There may be all reasons in the world why we have to mess with ptes
when there's light VM pressure, for whatever terabyte machine or
whatever workload that performs better that way, but I know in 100% of
my systems I don't ever want the VM to touch ptes when there's light
VM pressure, no matter what. So if you want the default to be messing
with ptes, just give me a sysctl knob to let me run faster.

2009-12-01 12:23:21

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [RFC] high system time & lock contention running large mixed workload

(cc to some related person)

> The cause was determined to be the unconditional call to
> page_referenced() for every mapped page encountered in
> shrink_active_list(). page_referenced() takes the anon_vma->lock and
> calls page_referenced_one() for each vma. page_referenced_one() then
> calls page_check_address() which takes the pte_lockptr spinlock. If
> several CPUs are doing this at the same time there is a lot of
> pte_lockptr spinlock contention with the anon_vma->lock held. This
> causes contention on the anon_vma->lock, stalling in the fo and very
> high system time.
>
> Before the splitLRU patch shrink_active_list() would only call
> page_referenced() when reclaim_mapped got set. reclaim_mapped only got
> set when the priority worked its way from 12 all the way to 7. This
> prevented page_referenced() from being called from shrink_active_list()
> until the system was really struggling to reclaim memory.
>
> On way to prevent this is to change page_check_address() to execute a
> spin_trylock(ptl) when it was called by shrink_active_list() and simply
> fail if it could not get the pte_lockptr spinlock. This will make
> shrink_active_list() consider the page not referenced and allow the
> anon_vma->lock to be dropped much quicker.
>
> The attached patch does just that, thoughts???

At first look,

- We have to fix this issue certenally.
- But your patch is a bit risky.

Your patch treat trylock(pte-lock) failure as no accessced. but
generally lock contention imply to have contention peer. iow, the page
have reference bit typically. then, next shrink_inactive_list() move it
active list again. that's suboptimal result.

However, we can't treat lock-contention as page-is-referenced simply. if it does,
the system easily go into OOM.

So,
if (priority < DEF_PRIORITY - 2)
page_referenced()
else
page_refenced_trylock()

is better?
On typical workload, almost vmscan only use DEF_PRIORITY. then,
if priority==DEF_PRIORITY situation don't cause heavy lock contention,
the system don't need to mind the contention. anyway we can't avoid
contention if the system have heavy memory pressure.

btw, current shrink_active_list() have unnecessary page_mapping_inuse() call.
it prevent to drop page reference bit from unmapped cache page. it mean
we protect unmapped cache page than mapped page. it is strange.

Unfortunately, I don't have enough development time today. I'll
working on tommorow.

2009-12-01 12:31:06

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [RFC] high system time & lock contention running large mixed workload

> On Mon, Nov 30, 2009 at 05:00:29PM -0500, Larry Woodman wrote:
> > Before the splitLRU patch shrink_active_list() would only call
> > page_referenced() when reclaim_mapped got set. reclaim_mapped only got
> > set when the priority worked its way from 12 all the way to 7. This
> > prevented page_referenced() from being called from shrink_active_list()
> > until the system was really struggling to reclaim memory.
>
> page_referenced should never be called and nobody should touch ptes
> until priority went down to 7. This is a regression in splitLRU that
> should be fixed. With light VM pressure we should never touch ptes ever.

Ummm. I can't agree this. 7 is too small priority. if large system have prio==7,
the system have unacceptable big latency trouble.
if only prio==DEF_PRIOTIRY or something, I can agree you probably.

> > On way to prevent this is to change page_check_address() to execute a
> > spin_trylock(ptl) when it was called by shrink_active_list() and simply
> > fail if it could not get the pte_lockptr spinlock. This will make
> > shrink_active_list() consider the page not referenced and allow the
> > anon_vma->lock to be dropped much quicker.
> >
> > The attached patch does just that, thoughts???
>
> Just stop calling page_referenced there...
>
> Even if we ignore the above, one problem later in skipping over the PT
> lock, is also to assume the page is not referenced when it actually
> is, so it won't be activated again when page_referenced is called
> again to move the page back in the active list... Not the end of the
> world to lose a young bit sometime though.
>
> There may be all reasons in the world why we have to mess with ptes
> when there's light VM pressure, for whatever terabyte machine or
> whatever workload that performs better that way, but I know in 100% of
> my systems I don't ever want the VM to touch ptes when there's light
> VM pressure, no matter what. So if you want the default to be messing
> with ptes, just give me a sysctl knob to let me run faster.

Um.
Avoiding lock contention on light VM pressure is important than
strict lru order. I guess we don't need knob.

2009-12-01 12:46:28

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [RFC] high system time & lock contention running large mixed workload

On Tue, Dec 01, 2009 at 09:31:09PM +0900, KOSAKI Motohiro wrote:
> Ummm. I can't agree this. 7 is too small priority. if large system have prio==7,
> the system have unacceptable big latency trouble.
> if only prio==DEF_PRIOTIRY or something, I can agree you probably.

I taken number 7 purely as mentioned by Larry about old code, but I
don't mind what is the actual breakpoint level where we start to send
the ipi flood to destroy all userland tlbs mapping the page so the
young bit can be set by the cpu on the old pte. If you agree with me
at the lowest priority we shouldn't flood ipi and destroy tlb when
there's plenty of clean unmapped clean cache, we already agree ;). If
that's 7 or DEV_PRIORITY-1, that's ok. All I care is that it escalates
gradually, first clean cache and re-activate mapped pages, then when
we're low on clean cache we start to check ptes and unmap whatever is
not found referenced.

> Avoiding lock contention on light VM pressure is important than
> strict lru order. I guess we don't need knob.

Hope so indeed. It's not just lock contention, that is exacerbated by
certain workloads, but even in total absence of any lock contention I
generally dislike the cpu waste itself of the pte loop to clear the
young bit, and the interruption of userland as well when it receives a
tlb flush for no good reason because 99% of the time plenty of
unmapped clean cache is available. I know this performs best, even if
there will be always someone that will want mapped and unmapped cache
to be threat totally equal in lru terms (which then make me wonder why
there are still & VM_EXEC magics in vmscan.c if all pages shall be
threaded equal in the lru... especially given VM_EXEC is often
meaningless [because potentially randomly set] unlike page_mapcount
[which is never randomly set]), which is the reason I mentioned the
knob.

2009-12-01 16:37:50

by Larry Woodman

[permalink] [raw]

Subject: Re: [RFC] high system time & lock contention running large mixed workload

On Tue, 2009-12-01 at 21:23 +0900, KOSAKI Motohiro wrote:
> (cc to some related person)

>
> At first look,
>
> - We have to fix this issue certenally.
> - But your patch is a bit risky.
>
> Your patch treat trylock(pte-lock) failure as no accessced. but
> generally lock contention imply to have contention peer. iow, the page
> have reference bit typically. then, next shrink_inactive_list() move it
> active list again. that's suboptimal result.
>
> However, we can't treat lock-contention as page-is-referenced simply. if it does,
> the system easily go into OOM.
>
> So,
> if (priority < DEF_PRIORITY - 2)
> page_referenced()
> else
> page_refenced_trylock()
>
> is better?
> On typical workload, almost vmscan only use DEF_PRIORITY. then,
> if priority==DEF_PRIORITY situation don't cause heavy lock contention,
> the system don't need to mind the contention. anyway we can't avoid
> contention if the system have heavy memory pressure.
>

Agreed. The attached updated patch only does a trylock in the
page_referenced() call from shrink_inactive_list() and only for
anonymous pages when the priority is either 10, 11 or
12(DEF_PRIORITY-2). I have never seen a problem like this with active
pagecache pages and it does not alter the existing shrink_page_list
behavior. What do you think about this???

Attachments:

pageout.diff (6.64 kB)

2009-12-02 01:56:00

by Rik van Riel

[permalink] [raw]

Subject: Re: [RFC] high system time & lock contention running large mixed workload

On 11/30/2009 05:00 PM, Larry Woodman wrote:
> While running workloads that do lots of forking processes, exiting
> processes and page reclamation(AIM 7) on large systems very high system
> time(100%) and lots of lock contention was observed.
>
>
>
> CPU5:
> [<ffffffff814afb48>] ? _spin_lock+0x27/0x48
> [<ffffffff81101deb>] ? anon_vma_link+0x2a/0x5a
> [<ffffffff8105d3d8>] ? dup_mm+0x242/0x40c
> [<ffffffff8105e0a9>] ? copy_process+0xab1/0x12be
> [<ffffffff8105ea07>] ? do_fork+0x151/0x330
> [<ffffffff81058407>] ? default_wake_function+0x0/0x36
> [<ffffffff814b0243>] ? _spin_lock_irqsave+0x2f/0x68
> [<ffffffff810121d3>] ? stub_clone+0x13/0x20
> [<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b
>
> CPU4:
> [<ffffffff814afb4a>] ? _spin_lock+0x29/0x48
> [<ffffffff81103062>] ? anon_vma_unlink+0x2a/0x84
> [<ffffffff810fbab7>] ? free_pgtables+0x3c/0xe1
> [<ffffffff810fd8b1>] ? exit_mmap+0xc5/0x110
> [<ffffffff8105ce4c>] ? mmput+0x55/0xd9
> [<ffffffff81061afd>] ? exit_mm+0x109/0x129
> [<ffffffff81063846>] ? do_exit+0x1d7/0x712
> [<ffffffff814b0243>] ? _spin_lock_irqsave+0x2f/0x68
> [<ffffffff81063e07>] ? do_group_exit+0x86/0xb2
> [<ffffffff81063e55>] ? sys_exit_group+0x22/0x3e
> [<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b
>
> CPU0:
> [<ffffffff814afb4a>] ? _spin_lock+0x29/0x48
> [<ffffffff81101ad1>] ? page_check_address+0x9e/0x16f
> [<ffffffff81101cb8>] ? page_referenced_one+0x53/0x10b
> [<ffffffff81102f5a>] ? page_referenced+0xcd/0x167
> [<ffffffff810eb32d>] ? shrink_active_list+0x1ed/0x2a3
> [<ffffffff810ebde9>] ? shrink_zone+0xa06/0xa38
> [<ffffffff8108440a>] ? getnstimeofday+0x64/0xce
> [<ffffffff810ecaf9>] ? do_try_to_free_pages+0x1e5/0x362
> [<ffffffff810ecd9f>] ? try_to_free_pages+0x7a/0x94
> [<ffffffff810ea66f>] ? isolate_pages_global+0x0/0x242
> [<ffffffff810e57b9>] ? __alloc_pages_nodemask+0x397/0x572
> [<ffffffff810e3c1e>] ? __get_free_pages+0x19/0x6e
> [<ffffffff8105d6c9>] ? copy_process+0xd1/0x12be
> [<ffffffff81204eb2>] ? avc_has_perm+0x5c/0x84
> [<ffffffff81130db8>] ? user_path_at+0x65/0xa3
> [<ffffffff8105ea07>] ? do_fork+0x151/0x330
> [<ffffffff810b7935>] ? check_for_new_grace_period+0x78/0xab
> [<ffffffff810121d3>] ? stub_clone+0x13/0x20
> [<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b
>
>
> ------------------------------------------------------------------------------
> PerfTop: 864 irqs/sec kernel:99.7% [100000 cycles], (all, 8
> CPUs)
> ------------------------------------------------------------------------------
>
> samples pcnt RIP kernel function
> ______ _______ _____ ________________ _______________
>
> 3235.00 - 75.1% - ffffffff814afb21 : _spin_lock
> 670.00 - 15.6% - ffffffff81101a33 : page_check_address
> 165.00 - 3.8% - ffffffffa01cbc39 : rpc_sleep_on [sunrpc]
> 40.00 - 0.9% - ffffffff81102113 : try_to_unmap_one
> 29.00 - 0.7% - ffffffff81101c65 : page_referenced_one
> 27.00 - 0.6% - ffffffff81101964 : vma_address
> 8.00 - 0.2% - ffffffff8125a5a0 : clear_page_c
> 6.00 - 0.1% - ffffffff8125a5f0 : copy_page_c
> 6.00 - 0.1% - ffffffff811023ca : try_to_unmap_anon
> 5.00 - 0.1% - ffffffff810fb014 : copy_page_range
> 5.00 - 0.1% - ffffffff810e4d18 : get_page_from_freelist
>
>
>
> The cause was determined to be the unconditional call to
> page_referenced() for every mapped page encountered in
> shrink_active_list(). page_referenced() takes the anon_vma->lock and
> calls page_referenced_one() for each vma. page_referenced_one() then
> calls page_check_address() which takes the pte_lockptr spinlock. If
> several CPUs are doing this at the same time there is a lot of
> pte_lockptr spinlock contention with the anon_vma->lock held. This
> causes contention on the anon_vma->lock, stalling in the fo and very
> high system time.
>
> Before the splitLRU patch shrink_active_list() would only call
> page_referenced() when reclaim_mapped got set. reclaim_mapped only got
> set when the priority worked its way from 12 all the way to 7. This
> prevented page_referenced() from being called from shrink_active_list()
> until the system was really struggling to reclaim memory.
>
> On way to prevent this is to change page_check_address() to execute a
> spin_trylock(ptl) when it was called by shrink_active_list() and simply
> fail if it could not get the pte_lockptr spinlock. This will make
> shrink_active_list() consider the page not referenced and allow the
> anon_vma->lock to be dropped much quicker.
>
> The attached patch does just that, thoughts???
>
My first thought is that you haven't read the code
you are trying to patch.

The purpose of calling page_referenced on anonymous
pages from shrink_active_list is to CLEAR the referenced
bit, not to test it!

Not clearing the referenced bit will break page replacement,
because pages that were not recently referenced will appear
to be, causing them to get another round on the active list,
which in turn could increase the list movement...

2009-12-02 02:00:53

by Rik van Riel

[permalink] [raw]

Subject: Re: [RFC] high system time & lock contention running large mixed workload

On 12/01/2009 05:04 AM, Andrea Arcangeli wrote:
> On Mon, Nov 30, 2009 at 05:00:29PM -0500, Larry Woodman wrote:
>
>> Before the splitLRU patch shrink_active_list() would only call
>> page_referenced() when reclaim_mapped got set. reclaim_mapped only got
>> set when the priority worked its way from 12 all the way to 7. This
>> prevented page_referenced() from being called from shrink_active_list()
>> until the system was really struggling to reclaim memory.
>>
> page_referenced should never be called and nobody should touch ptes
> until priority went down to 7. This is a regression in splitLRU that
> should be fixed. With light VM pressure we should never touch ptes ever.
>
You appear to have not read the code, either.

The VM should not look at the active anon list much,
unless it has a good reason to start evicting anonymous
pages. Yes, there was a bug in shrink_list(), but Kosaki
and I just posted patches to fix that.

As for page_referenced not being called until priority
goes down to 7 - that is one of the root causes the old
VM did not scale. The number of pages that need to
be scanned to get down to that point is staggeringly
huge on systems with 1TB of RAM - a much larger
number than we should EVER scan in the pageout code.

There is no way we could go back to that heuristic.
It fell apart before and it would continue to fall apart
if we reintroduced it.

2009-12-02 02:02:37

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [RFC] high system time & lock contention running large mixed workload

Hi

> > Avoiding lock contention on light VM pressure is important than
> > strict lru order. I guess we don't need knob.
>
> Hope so indeed. It's not just lock contention, that is exacerbated by
> certain workloads, but even in total absence of any lock contention I
> generally dislike the cpu waste itself of the pte loop to clear the
> young bit, and the interruption of userland as well when it receives a
> tlb flush for no good reason because 99% of the time plenty of
> unmapped clean cache is available. I know this performs best, even if
> there will be always someone that will want mapped and unmapped cache
> to be threat totally equal in lru terms (which then make me wonder why
> there are still & VM_EXEC magics in vmscan.c if all pages shall be
> threaded equal in the lru... especially given VM_EXEC is often
> meaningless [because potentially randomly set] unlike page_mapcount
> [which is never randomly set]), which is the reason I mentioned the
> knob.

Umm?? I'm puzlled. if almost pages in lru are unmapped file cache, pte walk
is not costly. reverse, if almost pages in lru are mapped pages, we have
to do pte walk, otherwise any pages don't deactivate and system cause
big latency trouble.

I don't want vmscan focus to peak speed and ignore worst case. it isn't proper
behavior in memory shortage situation. Then I hope to focus to solve lock
contention issue.

Of course, if this cause any trouble to KVM or other usage in real world,
I'll fix it.
if you have any trouble experience about current VM, please tell us.

[I (and Hugh at least) dislike VM_EXEC logic too. but it seems off topic.]

2009-12-02 02:05:45

by Rik van Riel

[permalink] [raw]

Subject: Re: [RFC] high system time & lock contention running large mixed workload

Andrea Arcangeli wrote:
> I taken number 7 purely as mentioned by Larry about old code, but I
> don't mind what is the actual breakpoint level where we start to send
> the ipi flood to destroy all userland tlbs mapping the page so the
> young bit can be set by the cpu on the old pte. If you agree with me
> at the lowest priority we shouldn't flood ipi and destroy tlb when
> there's plenty of clean unmapped clean cache, we already agree ;). If
> that's 7 or DEV_PRIORITY-1, that's ok. All I care is that it escalates
> gradually, first clean cache and re-activate mapped pages, then when
> we're low on clean cache we start to check ptes and unmap whatever is
> not found referenced.
>
>
The code already does what you propose.

It takes a heavy AIM7 run for Larry to run into the
lock contention issue. I suspect that the page cache
was already very small by the time the lock contention
issue was triggered.

Larry, do you have any more info on the state of the
VM when you see the lock contention?

Also, do you have the latest patches to shrink_list()
by Kosaki and me applied?

2009-12-02 02:21:07

by Rik van Riel

[permalink] [raw]

Subject: Re: [RFC] high system time & lock contention running large mixed workload

On 12/01/2009 11:41 AM, Larry Woodman wrote:
>
> Agreed. The attached updated patch only does a trylock in the
> page_referenced() call from shrink_inactive_list() and only for
> anonymous pages when the priority is either 10, 11 or
> 12(DEF_PRIORITY-2). I have never seen a problem like this with active
> pagecache pages and it does not alter the existing shrink_page_list
> behavior. What do you think about this???
This is reasonable, except for the fact that pages that are moved
to the inactive list without having the referenced bit cleared are
guaranteed to be moved back to the active list.

You'll be better off without that excess list movement, by simply
moving pages directly back onto the active list if the trylock
fails.

Yes, this means that page_referenced can now return 3 different
return values (not accessed, accessed, lock contended), which
should probably be an enum so we can test for the values
symbolically in the calling functions.

That way only pages where we did manage to clear the referenced bit
will be moved onto the inactive list. This not only reduces the
amount of excess list movement, it also makes sure that the pages
which do get onto the inactive list get a fair chance at being
referenced again, instead of potentially being flooded out by pages
where the trylock failed.

A minor nitpick: maybe it would be good to rename the "try" parameter
to "noblock". That more closely matches the requested behaviour.

2009-12-02 02:41:53

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [RFC] high system time & lock contention running large mixed workload

> On 12/01/2009 11:41 AM, Larry Woodman wrote:
> >
> > Agreed. The attached updated patch only does a trylock in the
> > page_referenced() call from shrink_inactive_list() and only for
> > anonymous pages when the priority is either 10, 11 or
> > 12(DEF_PRIORITY-2). I have never seen a problem like this with active
> > pagecache pages and it does not alter the existing shrink_page_list
> > behavior. What do you think about this???
> This is reasonable, except for the fact that pages that are moved
> to the inactive list without having the referenced bit cleared are
> guaranteed to be moved back to the active list.
>
> You'll be better off without that excess list movement, by simply
> moving pages directly back onto the active list if the trylock
> fails.
>
> Yes, this means that page_referenced can now return 3 different
> return values (not accessed, accessed, lock contended), which
> should probably be an enum so we can test for the values
> symbolically in the calling functions.
>
> That way only pages where we did manage to clear the referenced bit
> will be moved onto the inactive list. This not only reduces the
> amount of excess list movement, it also makes sure that the pages
> which do get onto the inactive list get a fair chance at being
> referenced again, instead of potentially being flooded out by pages
> where the trylock failed.

Agreed.

> A minor nitpick: maybe it would be good to rename the "try" parameter
> to "noblock". That more closely matches the requested behaviour.

Another minor nit: probably we have to rename page_referenced(). it imply test
reference bit. but we use it for clear reference bit in shrink_active_list.

2009-12-02 02:55:25

by KOSAKI Motohiro

[permalink] [raw]

Subject: [PATCH] Clear reference bit although page isn't mapped.

> btw, current shrink_active_list() have unnecessary page_mapping_inuse() call.
> it prevent to drop page reference bit from unmapped cache page. it mean
> we protect unmapped cache page than mapped page. it is strange.

How about this?

---------------------------------
SplitLRU VM replacement algorithm assume shrink_active_list() clear
the page's reference bit. but unnecessary page_mapping_inuse() test
prevent it.

This patch remove it.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 00156f2..3e942b5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1347,8 +1347,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
}

/* page_referenced clears PageReferenced */
- if (page_mapping_inuse(page) &&
- page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
+ if (page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
nr_rotated++;
/*
* Identify referenced, file-backed active pages and
--
1.6.5.2

2009-12-02 03:09:04

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH] Clear reference bit although page isn't mapped.

On 12/01/2009 09:55 PM, KOSAKI Motohiro wrote:
>> btw, current shrink_active_list() have unnecessary page_mapping_inuse() call.
>> it prevent to drop page reference bit from unmapped cache page. it mean
>> we protect unmapped cache page than mapped page. it is strange.
>>
> How about this?
>
> ---------------------------------
> SplitLRU VM replacement algorithm assume shrink_active_list() clear
> the page's reference bit. but unnecessary page_mapping_inuse() test
> prevent it.
>
> This patch remove it.
>
Shrink_page_list ignores the referenced bit on pages
that are !page_mapping_inuse().

if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
referenced &&
page_mapping_inuse(page)
&& !(vm_flags & VM_LOCKED))
goto activate_locked;

The reason we leave the referenced bit on unmapped
pages is that we want the next reference to a deactivated
page cache page to move that page back to the active
list. We do not want to require that such a page gets
accessed twice before being reactivated while on the
inactive list, because (1) we know it was a frequently
accessed page already and (2) ongoing streaming IO
might evict it from the inactive list before it gets accessed
twice.

Arguably, we should just replace the page_mapping_inuse()
in both places with page_mapped() to simplify things.

2009-12-02 03:28:24

by KOSAKI Motohiro

[permalink] [raw]

Subject: [PATCH] Replace page_mapping_inuse() with page_mapped()

> On 12/01/2009 09:55 PM, KOSAKI Motohiro wrote:
> >> btw, current shrink_active_list() have unnecessary page_mapping_inuse() call.
> >> it prevent to drop page reference bit from unmapped cache page. it mean
> >> we protect unmapped cache page than mapped page. it is strange.
> >>
> > How about this?
> >
> > ---------------------------------
> > SplitLRU VM replacement algorithm assume shrink_active_list() clear
> > the page's reference bit. but unnecessary page_mapping_inuse() test
> > prevent it.
> >
> > This patch remove it.
> >
> Shrink_page_list ignores the referenced bit on pages
> that are !page_mapping_inuse().
>
> if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
> referenced &&
> page_mapping_inuse(page)
> && !(vm_flags & VM_LOCKED))
> goto activate_locked;
>
> The reason we leave the referenced bit on unmapped
> pages is that we want the next reference to a deactivated
> page cache page to move that page back to the active
> list. We do not want to require that such a page gets
> accessed twice before being reactivated while on the
> inactive list, because (1) we know it was a frequently
> accessed page already and (2) ongoing streaming IO
> might evict it from the inactive list before it gets accessed
> twice.
>
> Arguably, we should just replace the page_mapping_inuse()
> in both places with page_mapped() to simplify things.

Ah, yes. /me was slept. thanks correct me.

>From 61340720e6e66b645db8d5410e89fd3b67eda907 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Wed, 2 Dec 2009 12:05:26 +0900
Subject: [PATCH] Replace page_mapping_inuse() with page_mapped()

page reclaim logic need to distingish mapped and unmapped pages.
However page_mapping_inuse() don't provide proper test way. it test
the address space (i.e. file) is mmpad(). Why `page' reclaim need
care unrelated page's mapped state? it's unrelated.

Thus, This patch replace page_mapping_inuse() with page_mapped()

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 25 ++-----------------------
1 files changed, 2 insertions(+), 23 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 00156f2..350b9cc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -277,27 +277,6 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
return ret;
}

-/* Called without lock on whether page is mapped, so answer is unstable */
-static inline int page_mapping_inuse(struct page *page)
-{
- struct address_space *mapping;
-
- /* Page is in somebody's page tables. */
- if (page_mapped(page))
- return 1;
-
- /* Be more reluctant to reclaim swapcache than pagecache */
- if (PageSwapCache(page))
- return 1;
-
- mapping = page_mapping(page);
- if (!mapping)
- return 0;
-
- /* File is mmap'd by somebody? */
- return mapping_mapped(mapping);
-}
-
static inline int is_page_cache_freeable(struct page *page)
{
/*
@@ -663,7 +642,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* try_to_unmap moves it to unevictable list
*/
if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
- referenced && page_mapping_inuse(page)
+ referenced && page_mapped(page)
&& !(vm_flags & VM_LOCKED))
goto activate_locked;

@@ -1347,7 +1326,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
}

/* page_referenced clears PageReferenced */
- if (page_mapping_inuse(page) &&
+ if (page_mapped(page) &&
page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
nr_rotated++;
/*
--
1.6.5.2

2009-12-02 04:58:19

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH] Replace page_mapping_inuse() with page_mapped()

On 12/01/2009 10:28 PM, KOSAKI Motohiro wrote:
>> On 12/01/2009 09:55 PM, KOSAKI Motohiro wrote:
>>
>>>> btw, current shrink_active_list() have unnecessary page_mapping_inuse() call.
>>>> it prevent to drop page reference bit from unmapped cache page. it mean
>>>> we protect unmapped cache page than mapped page. it is strange.
>>>>
>>>>
>>> How about this?
>>>
>>> ---------------------------------
>>> SplitLRU VM replacement algorithm assume shrink_active_list() clear
>>> the page's reference bit. but unnecessary page_mapping_inuse() test
>>> prevent it.
>>>
>>> This patch remove it.
>>>
>>>
>> Shrink_page_list ignores the referenced bit on pages
>> that are !page_mapping_inuse().
>>
>> if (sc->order<= PAGE_ALLOC_COSTLY_ORDER&&
>> referenced&&
>> page_mapping_inuse(page)
>> && !(vm_flags& VM_LOCKED))
>> goto activate_locked;
>>
>> The reason we leave the referenced bit on unmapped
>> pages is that we want the next reference to a deactivated
>> page cache page to move that page back to the active
>> list. We do not want to require that such a page gets
>> accessed twice before being reactivated while on the
>> inactive list, because (1) we know it was a frequently
>> accessed page already and (2) ongoing streaming IO
>> might evict it from the inactive list before it gets accessed
>> twice.
>>
>> Arguably, we should just replace the page_mapping_inuse()
>> in both places with page_mapped() to simplify things.
>>
> Ah, yes. /me was slept. thanks correct me.
>
>
> From 61340720e6e66b645db8d5410e89fd3b67eda907 Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro<[email protected]>
> Date: Wed, 2 Dec 2009 12:05:26 +0900
> Subject: [PATCH] Replace page_mapping_inuse() with page_mapped()
>
> page reclaim logic need to distingish mapped and unmapped pages.
> However page_mapping_inuse() don't provide proper test way. it test
> the address space (i.e. file) is mmpad(). Why `page' reclaim need
> care unrelated page's mapped state? it's unrelated.
>
> Thus, This patch replace page_mapping_inuse() with page_mapped()
>
> Signed-off-by: KOSAKI Motohiro<[email protected]>
>
Reviewed-by: Rik van Riel <[email protected]>

2009-12-02 11:12:44

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH] Replace page_mapping_inuse() with page_mapped()

On Wed, Dec 02, 2009 at 12:28:26PM +0900, KOSAKI Motohiro wrote:

> From 61340720e6e66b645db8d5410e89fd3b67eda907 Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <[email protected]>
> Date: Wed, 2 Dec 2009 12:05:26 +0900
> Subject: [PATCH] Replace page_mapping_inuse() with page_mapped()
>
> page reclaim logic need to distingish mapped and unmapped pages.
> However page_mapping_inuse() don't provide proper test way. it test
> the address space (i.e. file) is mmpad(). Why `page' reclaim need
> care unrelated page's mapped state? it's unrelated.
>
> Thus, This patch replace page_mapping_inuse() with page_mapped()
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>

Reviewed-by: Johannes Weiner <[email protected]>

2009-12-03 22:11:42

by Larry Woodman

[permalink] [raw]

Subject: Re: [RFC] high system time & lock contention running large mixed workload

On Tue, 2009-12-01 at 21:20 -0500, Rik van Riel wrote:

> This is reasonable, except for the fact that pages that are moved
> to the inactive list without having the referenced bit cleared are
> guaranteed to be moved back to the active list.
>
> You'll be better off without that excess list movement, by simply
> moving pages directly back onto the active list if the trylock
> fails.
>

The attached patch addresses this issue by changing page_check_address()
to return -1 if the spin_trylock() fails and page_referenced_one() to
return 1 in that path so the page gets moved back to the active list.

Also, BTW, check this out: an 8-CPU/16GB system running AIM 7 Compute
has 196491 isolated_anon pages. This means that ~6140 processes are
somewhere down in try_to_free_pages() since we only isolate 32 pages at
a time, this is out of 9000 processes...

---------------------------------------------------------------------
active_anon:2140361 inactive_anon:453356 isolated_anon:196491
active_file:3438 inactive_file:1100 isolated_file:0
unevictable:2802 dirty:153 writeback:0 unstable:0
free:578920 slab_reclaimable:49214 slab_unreclaimable:93268
mapped:1105 shmem:0 pagetables:139100 bounce:0

Node 0 Normal free:1647892kB min:12500kB low:15624kB high:18748kB
active_anon:7835452kB inactive_anon:785764kB active_file:13672kB
inactive_file:4352kB unevictable:11208kB isolated(anon):785964kB
isolated(file):0kB present:12410880kB mlocked:11208kB dirty:604kB
writeback:0kB mapped:4344kB shmem:0kB slab_reclaimable:177792kB
slab_unreclaimable:368676kB kernel_stack:73256kB pagetables:489972kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0

202895 total pagecache pages
197629 pages in swap cache
Swap cache stats: add 6954838, delete 6757209, find 1251447/2095005
Free swap = 65881196kB
Total swap = 67354616kB
3997696 pages RAM
207046 pages reserved
1688629 pages shared
3016248 pages non-shared

Attachments:

page_referenced.patch (7.23 kB)

2009-12-04 00:29:38

by Rik van Riel

[permalink] [raw]

Subject: Re: [RFC] high system time & lock contention running large mixed workload

On 12/03/2009 05:14 PM, Larry Woodman wrote:

> The attached patch addresses this issue by changing page_check_address()
> to return -1 if the spin_trylock() fails and page_referenced_one() to
> return 1 in that path so the page gets moved back to the active list.

Your patch forgot to add the code to vmscan.c to actually move
the page back to the active list.

Also, please use an enum for the page_referenced return
values, so the code in vmscan.c can use symbolic names.

enum page_reference {
NOT_REFERENCED,
REFERENCED,
LOCK_CONTENDED,
};

--
All rights reversed.

2009-12-04 00:36:21

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [RFC] high system time & lock contention running large mixed workload

> On Tue, 2009-12-01 at 21:20 -0500, Rik van Riel wrote:
>
> > This is reasonable, except for the fact that pages that are moved
> > to the inactive list without having the referenced bit cleared are
> > guaranteed to be moved back to the active list.
> >
> > You'll be better off without that excess list movement, by simply
> > moving pages directly back onto the active list if the trylock
> > fails.
> >
>
>
> The attached patch addresses this issue by changing page_check_address()
> to return -1 if the spin_trylock() fails and page_referenced_one() to
> return 1 in that path so the page gets moved back to the active list.
>
> Also, BTW, check this out: an 8-CPU/16GB system running AIM 7 Compute
> has 196491 isolated_anon pages. This means that ~6140 processes are
> somewhere down in try_to_free_pages() since we only isolate 32 pages at
> a time, this is out of 9000 processes...
>
>
> ---------------------------------------------------------------------
> active_anon:2140361 inactive_anon:453356 isolated_anon:196491
> active_file:3438 inactive_file:1100 isolated_file:0
> unevictable:2802 dirty:153 writeback:0 unstable:0
> free:578920 slab_reclaimable:49214 slab_unreclaimable:93268
> mapped:1105 shmem:0 pagetables:139100 bounce:0
>
> Node 0 Normal free:1647892kB min:12500kB low:15624kB high:18748kB
> active_anon:7835452kB inactive_anon:785764kB active_file:13672kB
> inactive_file:4352kB unevictable:11208kB isolated(anon):785964kB
> isolated(file):0kB present:12410880kB mlocked:11208kB dirty:604kB
> writeback:0kB mapped:4344kB shmem:0kB slab_reclaimable:177792kB
> slab_unreclaimable:368676kB kernel_stack:73256kB pagetables:489972kB
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
>
> 202895 total pagecache pages
> 197629 pages in swap cache
> Swap cache stats: add 6954838, delete 6757209, find 1251447/2095005
> Free swap = 65881196kB
> Total swap = 67354616kB
> 3997696 pages RAM
> 207046 pages reserved
> 1688629 pages shared
> 3016248 pages non-shared

This seems we have to improve reclaim bale out logic. the system already
have 1.5GB free pages. IOW, the system don't need swap-out anymore.

> @@ -352,9 +359,11 @@ static int page_referenced_one(struct page *page,
> if (address == -EFAULT)
> goto out;
>
> - pte = page_check_address(page, mm, address, &ptl, 0);
> + pte = page_check_address(page, mm, address, &ptl, 0, trylock);
> if (!pte)
> goto out;
> + else if (pte == (pte_t *)-1)
> + return 1;
>
> /*
> * Don't want to elevate referenced for mlocked page that gets this far,

Sorry, NAK.
I have to say the same thing of Rik's previous mention. shrink_active_list()
ignore the return value of page_referenced(). then above 'return 1' is meaningless.

Umm, ok, I'll make the patch myself.

2009-12-04 19:29:24

by Larry Woodman

[permalink] [raw]

Subject: Re: [RFC] high system time & lock contention running large mixed workload

On Fri, 2009-12-04 at 09:36 +0900, KOSAKI Motohiro wrote:

> >
> > Also, BTW, check this out: an 8-CPU/16GB system running AIM 7 Compute
> > has 196491 isolated_anon pages. This means that ~6140 processes are
> > somewhere down in try_to_free_pages() since we only isolate 32 pages at
> > a time, this is out of 9000 processes...
> >
> >
> > ---------------------------------------------------------------------
> > active_anon:2140361 inactive_anon:453356 isolated_anon:196491
> > active_file:3438 inactive_file:1100 isolated_file:0
> > unevictable:2802 dirty:153 writeback:0 unstable:0
> > free:578920 slab_reclaimable:49214 slab_unreclaimable:93268
> > mapped:1105 shmem:0 pagetables:139100 bounce:0
> >
> > Node 0 Normal free:1647892kB min:12500kB low:15624kB high:18748kB
> > active_anon:7835452kB inactive_anon:785764kB active_file:13672kB
> > inactive_file:4352kB unevictable:11208kB isolated(anon):785964kB
> > isolated(file):0kB present:12410880kB mlocked:11208kB dirty:604kB
> > writeback:0kB mapped:4344kB shmem:0kB slab_reclaimable:177792kB
> > slab_unreclaimable:368676kB kernel_stack:73256kB pagetables:489972kB
> > unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
> >
> > 202895 total pagecache pages
> > 197629 pages in swap cache
> > Swap cache stats: add 6954838, delete 6757209, find 1251447/2095005
> > Free swap = 65881196kB
> > Total swap = 67354616kB
> > 3997696 pages RAM
> > 207046 pages reserved
> > 1688629 pages shared
> > 3016248 pages non-shared
>
> This seems we have to improve reclaim bale out logic. the system already
> have 1.5GB free pages. IOW, the system don't need swap-out anymore.
>
>

Whats going on here is there are about 7500 runable processes and the
memory is already low. A process runs, requests memory and eventually
ends up in try_to_free_pages. Since the page reclaim code calls
cond_resched()in several places so the scheduler eventually puts that
process on the run queue and runs another process which does the same
thing. Eventually you end up with thousands of runnable processes in
the page reclaim code continuing to reclaim even though there is plenty
of free memory.

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu-----
r b swpd free buff cache si so bi bo in cs us sy
id wa st
7480 421 1474396 2240060 7640 12988 23 27 24 40 14 40
12 45 43 1 0
7524 405 1474772 2224504 7644 13460 759 689 764 689 8932
11076
8 92 0 0 0
7239 401 1474164 2210572 7656 13524 596 592 596 596 8809
10494
9 91 0 0 0

BTW, this is easy to reproduce. Fork thousands of processes that
collectively overcommit memory and keep them allocating...

2009-12-04 21:24:51

by Larry Woodman

[permalink] [raw]

Subject: Re: [RFC] high system time & lock contention running large mixed workload

On Thu, 2009-12-03 at 19:29 -0500, Rik van Riel wrote:
> On 12/03/2009 05:14 PM, Larry Woodman wrote:
>
> > The attached patch addresses this issue by changing page_check_address()
> > to return -1 if the spin_trylock() fails and page_referenced_one() to
> > return 1 in that path so the page gets moved back to the active list.
>
> Your patch forgot to add the code to vmscan.c to actually move
> the page back to the active list.

Right

>
> Also, please use an enum for the page_referenced return
> values, so the code in vmscan.c can use symbolic names.
>
> enum page_reference {
> NOT_REFERENCED,
> REFERENCED,
> LOCK_CONTENDED,
> };
>

Here it is:

Attachments:

page_referenced.patch (11.00 kB)

2009-12-06 21:04:52

by Rik van Riel

[permalink] [raw]

Subject: Re: [RFC] high system time & lock contention running large mixed workload

On 12/04/2009 04:26 PM, Larry Woodman wrote:

>
> Here it is:

Kosaki's patch series looks a lot more complete.

Does Kosaki's patch series resolve your issue?

--
All rights reversed.