2004-11-21 15:51:50

by Nikita Danilov

[permalink] [raw]
Subject: [PATCH]: 1/4 batch mark_page_accessed()

Batch mark_page_accessed() (a la lru_cache_add() and lru_cache_add_active()):
page to be marked accessed is placed into per-cpu pagevec
(page_accessed_pvec). When pagevec is filled up, all pages are processed in a
batch.

This is supposed to decrease contention on zone->lru_lock.

(Patch is for 2.6.10-rc2)

Signed-off-by: Nikita Danilov <[email protected]>

mm/swap.c | 47 ++++++++++++++++++++++++++++++++++++++++-------
1 files changed, 40 insertions(+), 7 deletions(-)

diff -puN mm/swap.c~batch-mark_page_accessed mm/swap.c
--- bk-linux/mm/swap.c~batch-mark_page_accessed 2004-11-21 17:01:02.061618792 +0300
+++ bk-linux-nikita/mm/swap.c 2004-11-21 17:01:02.063618488 +0300
@@ -113,6 +113,39 @@ void fastcall activate_page(struct page
spin_unlock_irq(&zone->lru_lock);
}

+static void __pagevec_mark_accessed(struct pagevec *pvec)
+{
+ int i;
+ struct zone *zone = NULL;
+
+ for (i = 0; i < pagevec_count(pvec); i++) {
+ struct page *page = pvec->pages[i];
+ struct zone *pagezone = page_zone(page);
+
+ if (pagezone != zone) {
+ if (zone)
+ local_unlock_irq(&zone->lru_lock);
+ zone = pagezone;
+ local_lock_irq(&zone->lru_lock);
+ }
+ if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
+ del_page_from_inactive_list(zone, page);
+ SetPageActive(page);
+ add_page_to_active_list(zone, page);
+ inc_page_state(pgactivate);
+ ClearPageReferenced(page);
+ } else if (!PageReferenced(page)) {
+ SetPageReferenced(page);
+ }
+ }
+ if (zone)
+ local_unlock_irq(&zone->lru_lock);
+ release_pages(pvec->pages, pvec->nr, pvec->cold);
+ pagevec_reinit(pvec);
+}
+
+static DEFINE_PER_CPU(struct pagevec, page_accessed_pvec) = { 0, };
+
/*
* Mark a page as having seen activity.
*
@@ -122,14 +155,14 @@ void fastcall activate_page(struct page
*/
void fastcall mark_page_accessed(struct page *page)
{
- if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
- activate_page(page);
- ClearPageReferenced(page);
- } else if (!PageReferenced(page)) {
- SetPageReferenced(page);
- }
-}
+ struct pagevec *pvec;

+ pvec = &get_cpu_var(page_accessed_pvec);
+ page_cache_get(page);
+ if (!pagevec_add(pvec, page))
+ __pagevec_mark_accessed(pvec);
+ put_cpu_var(page_accessed_pvec);
+}
EXPORT_SYMBOL(mark_page_accessed);

/**

_


2004-11-21 21:13:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH]: 1/4 batch mark_page_accessed()

Nikita Danilov <[email protected]> wrote:
>
> Batch mark_page_accessed() (a la lru_cache_add() and lru_cache_add_active()):
> page to be marked accessed is placed into per-cpu pagevec
> (page_accessed_pvec). When pagevec is filled up, all pages are processed in a
> batch.
>
> This is supposed to decrease contention on zone->lru_lock.

Looks sane, althought it does add more atomic ops (the extra
get_page/put_page). Some benchmarks would be nice to have.

2004-11-24 15:34:58

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH]: 1/4 batch mark_page_accessed()

On Sun, Nov 21, 2004 at 01:12:50PM -0800, Andrew Morton wrote:
> Nikita Danilov <[email protected]> wrote:
> >
> > Batch mark_page_accessed() (a la lru_cache_add() and lru_cache_add_active()):
> > page to be marked accessed is placed into per-cpu pagevec
> > (page_accessed_pvec). When pagevec is filled up, all pages are processed in a
> > batch.
> >
> > This is supposed to decrease contention on zone->lru_lock.
>
> Looks sane, althought it does add more atomic ops (the extra
> get_page/put_page). Some benchmarks would be nice to have.

I'll run STP benchmarks as soon as STP is working again.

2004-11-24 22:46:17

by Nikita Danilov

[permalink] [raw]
Subject: Re: [PATCH]: 1/4 batch mark_page_accessed()

Marcelo Tosatti writes:
>
> Hi Nikita,

Hello, Marcelo,

>

[...]

> > + if (pagezone != zone) {
> > + if (zone)
> > + local_unlock_irq(&zone->lru_lock);
>
> You surely meant spin_{un}lock_irq and not local{un}lock_irq.

Oh, you are right. local_lock_* are functions to manipulate "local wait"
spin-lock variety that was introduced by some other
patch. batch-mark_page_accessed patch worked only because all references
to local_lock_* functions were removed by pvec-cleanup patch.

Another proof of the obvious fact that manually coded pagevec iteration
is evil. :)

>
> Started the STP tests on 4way/8way boxes.

Great.

Nikita.

2004-11-24 22:47:18

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH]: 1/4 batch mark_page_accessed()


Hi Nikita,

On Sun, Nov 21, 2004 at 06:44:04PM +0300, Nikita Danilov wrote:
> Batch mark_page_accessed() (a la lru_cache_add() and lru_cache_add_active()):
> page to be marked accessed is placed into per-cpu pagevec
> (page_accessed_pvec). When pagevec is filled up, all pages are processed in a
> batch.
>
> This is supposed to decrease contention on zone->lru_lock.
>
> (Patch is for 2.6.10-rc2)
>
> Signed-off-by: Nikita Danilov <[email protected]>
>
> mm/swap.c | 47 ++++++++++++++++++++++++++++++++++++++++-------
> 1 files changed, 40 insertions(+), 7 deletions(-)
>
> diff -puN mm/swap.c~batch-mark_page_accessed mm/swap.c
> --- bk-linux/mm/swap.c~batch-mark_page_accessed 2004-11-21 17:01:02.061618792 +0300
> +++ bk-linux-nikita/mm/swap.c 2004-11-21 17:01:02.063618488 +0300
> @@ -113,6 +113,39 @@ void fastcall activate_page(struct page
> spin_unlock_irq(&zone->lru_lock);
> }
>
> +static void __pagevec_mark_accessed(struct pagevec *pvec)
> +{
> + int i;
> + struct zone *zone = NULL;
> +
> + for (i = 0; i < pagevec_count(pvec); i++) {
> + struct page *page = pvec->pages[i];
> + struct zone *pagezone = page_zone(page);
> +
> + if (pagezone != zone) {
> + if (zone)
> + local_unlock_irq(&zone->lru_lock);

You surely meant spin_{un}lock_irq and not local{un}lock_irq.

Started the STP tests on 4way/8way boxes.

> + zone = pagezone;
> + local_lock_irq(&zone->lru_lock);
> + }
> + if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
> + del_page_from_inactive_list(zone, page);
> + SetPageActive(page);
> + add_page_to_active_list(zone, page);
> + inc_page_state(pgactivate);
> + ClearPageReferenced(page);
> + } else if (!PageReferenced(page)) {
> + SetPageReferenced(page);
> + }
> + }
> + if (zone)
> + local_unlock_irq(&zone->lru_lock);
> + release_pages(pvec->pages, pvec->nr, pvec->cold);
> + pagevec_reinit(pvec);
> +}
> +
> +static DEFINE_PER_CPU(struct pagevec, page_accessed_pvec) = { 0, };
> +


2004-11-27 00:15:46

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH]: 1/4 batch mark_page_accessed()

On Sun, Nov 21, 2004 at 06:44:04PM +0300, Nikita Danilov wrote:
> Batch mark_page_accessed() (a la lru_cache_add() and lru_cache_add_active()):
> page to be marked accessed is placed into per-cpu pagevec
> (page_accessed_pvec). When pagevec is filled up, all pages are processed in a
> batch.
>
> This is supposed to decrease contention on zone->lru_lock.

Here are the STP 8way results:

8way:

reaim default (database IO intensive load), increases performance _significantly_:
------------------------------------------
kernel: patch-2.6.10-rc2
Peak load Test: Maximum Jobs per Minute 8491.96 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 8326.23 (average of 3 runs)

kernel: nikita-b2
Peak load Test: Maximum Jobs per Minute 9039.56 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 8325.09 (average of 3 runs)


reaim -w compute (compute intensive load), decreases performance:
-----------------------------------------
kernel: patch-2.6.10-rc2
Peak load Test: Maximum Jobs per Minute 9591.82 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 9359.76 (average of 3 runs)

kernel: nikita-b2
Peak load Test: Maximum Jobs per Minute 9533.34 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 9324.25 (average of 3 runs)

kernbench

Decreases performance significantly (on -j4 more notably), probably due to
the additional atomic operations as noted by Andrew:

kernel: nikita-b2 kernel: patch-2.6.10-rc2
Host: stp8-002 Host: stp8-003

Average Optimal -j 32 Load Run: Average Optimal -j 32 Load Run:
Elapsed Time 130 Elapsed Time 129.562
User Time 872.816 User Time 871.898
System Time 88.978 System Time 87.346
Percent CPU 739.2 Percent CPU 739.8
Context Switches 35111.4 Context Switches 34973.2
Sleeps 28182.6 Sleeps 28465.2

Average Maximal -j Load Run: Average Maximal -j Load Run:
Elapsed Time 128.862 Elapsed Time 128.334
User Time 868.234 User Time 867.702
System Time 86.888 System Time 85.318
Percent CPU 740.6 Percent CPU 742.2
Context Switches 27278.2 Context Switches 27210
Sleeps 19889 Sleeps 19898.4

Average Half Load -j 4 Run: Average Half Load -j 4 Run:
Elapsed Time 274.916 Elapsed Time 245.026
User Time 833.63 User Time 832.34
System Time 73.704 System Time 73.41
Percent CPU 335.8 Percent CPU 373.6
Context Switches 12984.8 Context Switches 13427.4
Sleeps 21459.2 Sleeps 21642


2004-11-27 00:42:27

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH]: 1/4 batch mark_page_accessed()

Marcelo Tosatti wrote:
> On Sun, Nov 21, 2004 at 06:44:04PM +0300, Nikita Danilov wrote:
>
>>Batch mark_page_accessed() (a la lru_cache_add() and lru_cache_add_active()):
>>page to be marked accessed is placed into per-cpu pagevec
>>(page_accessed_pvec). When pagevec is filled up, all pages are processed in a
>>batch.
>>
>>This is supposed to decrease contention on zone->lru_lock.
>
>
> Here are the STP 8way results:
>
> 8way:
>

...

> kernbench
>
> Decreases performance significantly (on -j4 more notably), probably due to
> the additional atomic operations as noted by Andrew:
>
> kernel: nikita-b2 kernel: patch-2.6.10-rc2
> Host: stp8-002 Host: stp8-003
>

...

>
> Average Half Load -j 4 Run: Average Half Load -j 4 Run:
> Elapsed Time 274.916 Elapsed Time 245.026
> User Time 833.63 User Time 832.34
> System Time 73.704 System Time 73.41
> Percent CPU 335.8 Percent CPU 373.6
> Context Switches 12984.8 Context Switches 13427.4
> Sleeps 21459.2 Sleeps 21642

Do you think looks like it may be a CPU scheduling or disk/fs artifact?
Neither user nor system time are significantly worse, while the vanilla
kernel is using a lot more of the CPUs' power (ie waiting for IO less,
or becoming idle less often due to CPU scheduler balancing).

Aside: under-load conditions like this is actually something where the
CPU scheduler doesn't do brilliantly at currently. I attribute this to
probably most "performance tests" loading it up as much as possible.
I am (on and off) looking at improving performance in these conditions,
and am making some inroads.

2004-11-27 10:42:35

by Nikita Danilov

[permalink] [raw]
Subject: Re: [PATCH]: 1/4 batch mark_page_accessed()

Marcelo Tosatti writes:

[...]

>
> Average Half Load -j 4 Run: Average Half Load -j 4 Run:
> Elapsed Time 274.916 Elapsed Time 245.026
> User Time 833.63 User Time 832.34
> System Time 73.704 System Time 73.41
> Percent CPU 335.8 Percent CPU 373.6
> Context Switches 12984.8 Context Switches 13427.4
> Sleeps 21459.2 Sleeps 21642

What is "System Time" here? Does in include CPU consumption by, say,
kswapd? If not, then one may conjecture that mark_page_accessed batching
modifies ordering of pages on zone "LRU" lists in a way that measurably
changes efficiency of VM scanning, and mostly affects process that scans
most---kswapd.

Nikita.

2004-11-27 13:29:10

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH]: 1/4 batch mark_page_accessed()

On Sat, Nov 27, 2004 at 01:41:58PM +0300, Nikita Danilov wrote:
> Marcelo Tosatti writes:
>
> [...]
>
> >
> > Average Half Load -j 4 Run: Average Half Load -j 4 Run:
> > Elapsed Time 274.916 Elapsed Time 245.026
> > User Time 833.63 User Time 832.34
> > System Time 73.704 System Time 73.41
> > Percent CPU 335.8 Percent CPU 373.6
> > Context Switches 12984.8 Context Switches 13427.4
> > Sleeps 21459.2 Sleeps 21642
>
> What is "System Time" here? Does in include CPU consumption by, say,
> kswapd?

I dont think "System Time" counts for kswapd - but only time spent inside
the kernel, including page reclamation efforts.

> If not, then one may conjecture that mark_page_accessed batching
> modifies ordering of pages on zone "LRU" lists in a way that measurably
> changes efficiency of VM scanning, and mostly affects process that scans
> most---kswapd.

I agree.

2004-12-01 00:39:45

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH]: 1/4 batch mark_page_accessed()

On Sat, Nov 27, 2004 at 11:37:17AM +1100, Nick Piggin wrote:
> Marcelo Tosatti wrote:
> >On Sun, Nov 21, 2004 at 06:44:04PM +0300, Nikita Danilov wrote:
> >
> >>Batch mark_page_accessed() (a la lru_cache_add() and
> >>lru_cache_add_active()):
> >>page to be marked accessed is placed into per-cpu pagevec
> >>(page_accessed_pvec). When pagevec is filled up, all pages are processed
> >>in a
> >>batch.
> >>
> >>This is supposed to decrease contention on zone->lru_lock.
> >
> >
> >Here are the STP 8way results:
> >
> >8way:
> >
>
> ...
>
> >kernbench
> >
> >Decreases performance significantly (on -j4 more notably), probably due to
> >the additional atomic operations as noted by Andrew:
> >
> >kernel: nikita-b2 kernel: patch-2.6.10-rc2
> >Host: stp8-002 Host: stp8-003
> >
>
> ...
>
> >
> >Average Half Load -j 4 Run: Average Half Load -j 4 Run:
> >Elapsed Time 274.916 Elapsed Time 245.026
> >User Time 833.63 User Time 832.34
> >System Time 73.704 System Time 73.41
> >Percent CPU 335.8 Percent CPU 373.6
> >Context Switches 12984.8 Context Switches 13427.4
> >Sleeps 21459.2 Sleeps 21642
>
> Do you think looks like it may be a CPU scheduling or disk/fs artifact?
> Neither user nor system time are significantly worse, while the vanilla
> kernel is using a lot more of the CPUs' power (ie waiting for IO less,
> or becoming idle less often due to CPU scheduler balancing).

Nick,

I do not think it is a disk/fs artifact.

Because the ordering of LRU pages should be enhanced in respect to locality,
with the mark_page_accessed batching you group together tasks accessed pages
and move them at once to the active list.

You maintain better locality ordering, while decreasing the precision of aging/
temporal locality.

Which should enhance disk writeout performance.

On the other hand, without batching you mix the locality up in LRU - the LRU becomes
more precise in terms of "LRU aging", but less ordered in terms of sequential
access pattern.

The disk IO intensive reaim has very significant gain from the batching, its
probably due to the enhanced LRU ordering (what Nikita says).

The slowdown is probably due to the additional atomic_inc by page_cache_get().

Is there no way to avoid such page_cache_get there (and in lru_cache_add also)?

> Aside: under-load conditions like this is actually something where the
> CPU scheduler doesn't do brilliantly at currently. I attribute this to
> probably most "performance tests" loading it up as much as possible.
> I am (on and off) looking at improving performance in these conditions,
> and am making some inroads.


2004-12-01 01:35:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH]: 1/4 batch mark_page_accessed()

Marcelo Tosatti <[email protected]> wrote:
>
> Because the ordering of LRU pages should be enhanced in respect to locality,
> with the mark_page_accessed batching you group together tasks accessed pages
> and move them at once to the active list.
>
> You maintain better locality ordering, while decreasing the precision of aging/
> temporal locality.
>
> Which should enhance disk writeout performance.

I'll buy that explanation. Although I'm a bit sceptical that it is
measurable.

Was that particular workload actually performing significant amounts of
writeout in vmscan.c? (We should have direct+kswapd counters for that, but
we don't. /proc/vmstat:pgrotated will give us an idea).


> On the other hand, without batching you mix the locality up in LRU - the LRU becomes
> more precise in terms of "LRU aging", but less ordered in terms of sequential
> access pattern.
>
> The disk IO intensive reaim has very significant gain from the batching, its
> probably due to the enhanced LRU ordering (what Nikita says).
>
> The slowdown is probably due to the additional atomic_inc by page_cache_get().
>
> Is there no way to avoid such page_cache_get there (and in lru_cache_add also)?

Not really. The page is only in the pagevec at that time - if someone does
a put_page() on it the page will be freed for real, and will then be
spilled onto the LRU. Messy.

2004-12-01 04:58:30

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH]: 1/4 batch mark_page_accessed()

On Tue, Nov 30, 2004 at 05:33:23PM -0800, Andrew Morton wrote:
> Marcelo Tosatti <[email protected]> wrote:
> >
> > Because the ordering of LRU pages should be enhanced in respect to locality,
> > with the mark_page_accessed batching you group together tasks accessed pages
> > and move them at once to the active list.
> >
> > You maintain better locality ordering, while decreasing the precision of aging/
> > temporal locality.
> >
> > Which should enhance disk writeout performance.
>
> I'll buy that explanation. Although I'm a bit sceptical that it is
> measurable.

Its just a theory that makes sense to me, but yes we need to be measure it.

> Was that particular workload actually performing significant amounts of
> writeout in vmscan.c? (We should have direct+kswapd counters for that, but
> we don't. /proc/vmstat:pgrotated will give us an idea).

I strongly believe so, its a memory hungry benchmark - I'll collect status later
on this week or the next.

> > On the other hand, without batching you mix the locality up in LRU - the LRU becomes
> > more precise in terms of "LRU aging", but less ordered in terms of sequential
> > access pattern.
> >
> > The disk IO intensive reaim has very significant gain from the batching, its
> > probably due to the enhanced LRU ordering (what Nikita says).
> >
> > The slowdown is probably due to the additional atomic_inc by page_cache_get().
> >
> > Is there no way to avoid such page_cache_get there (and in lru_cache_add also)?
>
> Not really. The page is only in the pagevec at that time - if someone does
> a put_page() on it the page will be freed for real, and will then be
> spilled onto the LRU. Messy.

We could handle such situation on allocation and during vmscan - maybe its doable.
Just pondering, maybe its indeed too messy to even ponder about.