2008-07-25 02:25:58

by Rik van Riel

[permalink] [raw]
Subject: PERF: performance tests with the split LRU VM in -mm

In order to get the performance of the split LRU VM (in -mm) better,
I have performed several performance tests with the following kernels:
- 2.6.26 "2.6.26"
- 2.6.26-rc8-mm1 "-mm"
- 2.6.26-rc8-mm1 w/ "evict streaming IO cache first" patch "stream"
Patch at: http://lkml.org/lkml/2008/7/15/465
- 2.6.26-rc8-mm1 w/ "fix swapout on sequential IO" patch "noforce"
Patch at: http://marc.info/?l=linux-mm&m=121683855132630&w=2

I have run the performance tests on a Dell pe1950 system
with 2 quad-core CPUs, 16GB of RAM and a hardware RAID 1
array of 146GB disks.

The tests are fairly simple, but took a fair amount of time to
run due to the size of the data set involved (full disk for dd,
55GB innodb file for the database tests).


TEST 1: dd if=/dev/sda of=/dev/null bs=1M

kernel speed swap used

2.6.26 111MB/s 500kB
-mm 110MB/s 59MB (ouch, system noticably slower)
noforce 111MB/s 128kB
stream 108MB/s 0 (slight regression, not sure why yet)

This patch shows that the split LRU VM in -mm has a problem
with large streaming IOs: the working set gets pushed out of
memory, which makes doing anything else during the big streaming
IO kind of painful.

However, either of the two patches posted fixes that problem,
though at a slight performance penalty for the "stream" patch.


TEST 2: sysbench & linear query

In this test, I run sysbench in parallel with "SELECT COUNT(*) FROM sbtest;"
on a 240,000,000 row sysbench database. In the first columns, MySQL has
been started up with its default memory allocation; the second set of
results has innodb_buffer_pool_size=12G, allocating 75% of system memory
as innodb buffer. The sysbench performance number is the number of
transactions per second (tps), while the linear query simply has its
time measured.

default memory 12GB innodb buffer
kernel tps SELECT COUNT tps SELECT COUNT swapped out

2.6.26 100 42 min 6 sec 142 1 hour 20 min 5GB (constant swap IO!)
-mm 109 33 min 25 sec 210 22 min 26 sec <70MB
noforce 101 34 min 48 sec 207 22 min 16 sec <70MB
stream 111 32 min 5 sec 209 22 min 22 sec <70MB

These results show that increasing the database buffer helps
sysbench performance, even in 2.6.26 which is constantly swapping
the database buffer in and out. However, the large linear query
really suffers in the upstream VM.

The upstream VM constantly swaps mysql innodb buffer in and
out, with the amount of swap space in use hovering about half
full (5GB). This probably indicates that the kernel keeps
cycling mysql in and out of swap, freeing up swap space at
swapin time.

Neither of the patches I proposed for -mm seem to make much of
a performance difference for this test, but they do solve the
interactivity problem during large streaming IO.

The split LRU VM in the -mm kernel really improves the performance
of databases in the presence of streaming IO, which is a real
performance issue for Linux users at the moment.


2008-07-28 14:58:29

by Rik van Riel

[permalink] [raw]
Subject: Re: PERF: performance tests with the split LRU VM in -mm

On Thu, 24 Jul 2008 22:25:10 -0400
Rik van Riel <[email protected]> wrote:

> TEST 1: dd if=/dev/sda of=/dev/null bs=1M
>
> kernel speed swap used
>
> 2.6.26 111MB/s 500kB
> -mm 110MB/s 59MB (ouch, system noticably slower)
> noforce 111MB/s 128kB
> stream 108MB/s 0 (slight regression, not sure why yet)
>
> This patch shows that the split LRU VM in -mm has a problem
> with large streaming IOs: the working set gets pushed out of
> memory, which makes doing anything else during the big streaming
> IO kind of painful.
>
> However, either of the two patches posted fixes that problem,
> though at a slight performance penalty for the "stream" patch.

OK, the throughput number with this test turns out not to mean
nearly as much as I thought.

Switching off CPU frequency scaling, pinning the CPUs at the
highest speed, resulted in a throughput of only 102MB/s.

My suspicion is that faster running code on the CPU results
in IOs being sent down to the device faster, resulting in
smaller IOs and lower throughput.

This would be promising for the "stream" patch, which makes
choosing between the two patches harder :)

Andrew, what is your preference between:
http://lkml.org/lkml/2008/7/15/465
and
http://marc.info/?l=linux-mm&m=121683855132630&w=2

--
All Rights Reversed

2008-07-28 15:30:56

by Ray Lee

[permalink] [raw]
Subject: Re: PERF: performance tests with the split LRU VM in -mm

On Mon, Jul 28, 2008 at 7:57 AM, Rik van Riel <[email protected]> wrote:
> On Thu, 24 Jul 2008 22:25:10 -0400
> Rik van Riel <[email protected]> wrote:
>
>> TEST 1: dd if=/dev/sda of=/dev/null bs=1M
>>
>> kernel speed swap used
>>
>> 2.6.26 111MB/s 500kB
>> -mm 110MB/s 59MB (ouch, system noticably slower)
>> noforce 111MB/s 128kB
>> stream 108MB/s 0 (slight regression, not sure why yet)
>>
>> This patch shows that the split LRU VM in -mm has a problem
>> with large streaming IOs: the working set gets pushed out of
>> memory, which makes doing anything else during the big streaming
>> IO kind of painful.
>>
>> However, either of the two patches posted fixes that problem,
>> though at a slight performance penalty for the "stream" patch.
>
> OK, the throughput number with this test turns out not to mean
> nearly as much as I thought.
>
> Switching off CPU frequency scaling, pinning the CPUs at the
> highest speed, resulted in a throughput of only 102MB/s.
>
> My suspicion is that faster running code on the CPU results
> in IOs being sent down to the device faster, resulting in
> smaller IOs and lower throughput.

Or the IOs are getting sent in a different order, and so
coalescing/merging isn't occurring as often. Getting some
instrumentation (something as simple as a histogram) on the IO sizes
could be useful.

2008-07-28 23:42:57

by Andrew Morton

[permalink] [raw]
Subject: Re: PERF: performance tests with the split LRU VM in -mm

On Mon, 28 Jul 2008 10:57:42 -0400
Rik van Riel <[email protected]> wrote:

> On Thu, 24 Jul 2008 22:25:10 -0400
> Rik van Riel <[email protected]> wrote:
>
> > TEST 1: dd if=/dev/sda of=/dev/null bs=1M
> >
> > kernel speed swap used
> >
> > 2.6.26 111MB/s 500kB
> > -mm 110MB/s 59MB (ouch, system noticably slower)
> > noforce 111MB/s 128kB
> > stream 108MB/s 0 (slight regression, not sure why yet)
> >
> > This patch shows that the split LRU VM in -mm has a problem
> > with large streaming IOs: the working set gets pushed out of
> > memory, which makes doing anything else during the big streaming
> > IO kind of painful.
> >
> > However, either of the two patches posted fixes that problem,
> > though at a slight performance penalty for the "stream" patch.
>
> OK, the throughput number with this test turns out not to mean
> nearly as much as I thought.
>
> Switching off CPU frequency scaling, pinning the CPUs at the
> highest speed, resulted in a throughput of only 102MB/s.
>
> My suspicion is that faster running code on the CPU results
> in IOs being sent down to the device faster, resulting in
> smaller IOs and lower throughput.
>
> This would be promising for the "stream" patch, which makes
> choosing between the two patches harder :)
>
> Andrew, what is your preference between:
> http://lkml.org/lkml/2008/7/15/465
> and
> http://marc.info/?l=linux-mm&m=121683855132630&w=2
>

Boy. They both seem rather hacky special-cases. But that doesn't mean
that they're undesirable hacky special-cases. I guess the second one
looks a bit more "algorithmic" and a bit less hacky-special-case. But
it all depends on testing..

On a different topic, these:

vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru.patch
vm-dont-run-touch_buffer-during-buffercache-lookups.patch

have been floating about in -mm for ages, awaiting demonstration that
they're a net benefit. But all of this new page-reclaim rework was
built on top of those two patches and incorporates and retains them.

I could toss them out, but that would require some rework and would
partially invalidate previous testing and who knows, they _might_ be
good patches. Or they might not be.

What are your thoughts?

2008-07-28 23:57:31

by Rik van Riel

[permalink] [raw]
Subject: Re: PERF: performance tests with the split LRU VM in -mm

On Mon, 28 Jul 2008 16:41:24 -0700
Andrew Morton <[email protected]> wrote:

> > Andrew, what is your preference between:
> > http://lkml.org/lkml/2008/7/15/465
> > and
> > http://marc.info/?l=linux-mm&m=121683855132630&w=2
> >
>
> Boy. They both seem rather hacky special-cases. But that doesn't mean
> that they're undesirable hacky special-cases. I guess the second one
> looks a bit more "algorithmic" and a bit less hacky-special-case. But
> it all depends on testing..

I prefer the second one, since it removes the + 1 magic (at least,
for the higher priorities), instead of adding new magic like the
other patch does.

> On a different topic, these:
>
> vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru.patch
> vm-dont-run-touch_buffer-during-buffercache-lookups.patch
>
> have been floating about in -mm for ages, awaiting demonstration that
> they're a net benefit. But all of this new page-reclaim rework was
> built on top of those two patches and incorporates and retains them.
>
> I could toss them out, but that would require some rework and would
> partially invalidate previous testing and who knows, they _might_ be
> good patches. Or they might not be.
>
> What are your thoughts?

I believe you should definately keep those. Being able to better
preserve actively accessed file pages could be a good benefit and
we have yet to discover a downside to those patches.

--
All Rights Reversed

2008-07-29 00:03:29

by Rik van Riel

[permalink] [raw]
Subject: Re: PERF: performance tests with the split LRU VM in -mm

On Mon, 28 Jul 2008 19:57:13 -0400
Rik van Riel <[email protected]> wrote:
> On Mon, 28 Jul 2008 16:41:24 -0700
> Andrew Morton <[email protected]> wrote:
>
> > > Andrew, what is your preference between:
> > > http://lkml.org/lkml/2008/7/15/465
> > > and
> > > http://marc.info/?l=linux-mm&m=121683855132630&w=2
> > >
> >
> > Boy. They both seem rather hacky special-cases. But that doesn't mean
> > that they're undesirable hacky special-cases. I guess the second one
> > looks a bit more "algorithmic" and a bit less hacky-special-case. But
> > it all depends on testing..
>
> I prefer the second one, since it removes the + 1 magic (at least,
> for the higher priorities), instead of adding new magic like the
> other patch does.

Btw, didn't you add that "+ 1" originally early on in the 2.6 VM?

Do you remember its purpose?

Does it still make sense to have that "+ 1" in the split LRU VM?

Could we get away with just removing it unconditionally?

--
All Rights Reversed

2008-07-29 00:18:41

by Andrew Morton

[permalink] [raw]
Subject: Re: PERF: performance tests with the split LRU VM in -mm

On Mon, 28 Jul 2008 20:03:11 -0400
Rik van Riel <[email protected]> wrote:

> On Mon, 28 Jul 2008 19:57:13 -0400
> Rik van Riel <[email protected]> wrote:
> > On Mon, 28 Jul 2008 16:41:24 -0700
> > Andrew Morton <[email protected]> wrote:
> >
> > > > Andrew, what is your preference between:
> > > > http://lkml.org/lkml/2008/7/15/465
> > > > and
> > > > http://marc.info/?l=linux-mm&m=121683855132630&w=2
> > > >
> > >
> > > Boy. They both seem rather hacky special-cases. But that doesn't mean
> > > that they're undesirable hacky special-cases. I guess the second one
> > > looks a bit more "algorithmic" and a bit less hacky-special-case. But
> > > it all depends on testing..
> >
> > I prefer the second one, since it removes the + 1 magic (at least,
> > for the higher priorities), instead of adding new magic like the
> > other patch does.
>
> Btw, didn't you add that "+ 1" originally early on in the 2.6 VM?

You mean this?

/*
* Add one to nr_to_scan just to make sure that the kernel
* will slowly sift through the active list.
*/
zone->nr_scan_active +=
(zone_page_state(zone, NR_ACTIVE) >> priority) + 1;


> Do you remember its purpose?

erm, not specifically, but I tended to lavishly describe changes like
this in the changelogging.

> Does it still make sense to have that "+ 1" in the split LRU VM?
>
> Could we get away with just removing it unconditionally?

We should do the necessary git dumpster-diving before tossing out
hard-won changes. Otherwise we might need to spend a year
re-discovering and re-fixing already-discovered-and-fixed things.

That code has been there in one way or another for some time.

In June 2004, 385c0449 did this:

/*
- * Try to keep the active list 2/3 of the size of the cache. And
- * make sure that refill_inactive is given a decent number of pages.
- *
- * The "scan_active + 1" here is important. With pagecache-intensive
- * workloads the inactive list is huge, and `ratio' evaluates to zero
- * all the time. Which pins the active list memory. So we add one to
- * `scan_active' just to make sure that the kernel will slowly sift
- * through the active list.
+ * Add one to `nr_to_scan' just to make sure that the kernel will
+ * slowly sift through the active list.
*/
- if (zone->nr_active >= 4*(zone->nr_inactive*2 + 1)) {
- /* Don't scan more than 4 times the inactive list scan size */
- scan_active = 4*scan_inactive;

(there was some regrettable information loss there).

Is the scenario which that fix addresses no longer possible?


On a different topic, I am staring in frustration at
introduce-__get_user_pages.patch, which says:

New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
because current get_user_pages() can't grab PROT_NONE pages theresore
it cause PROT_NONE pages can't munlock.

could someone please work out for me which of these patches:

vmscan-move-isolate_lru_page-to-vmscanc.patch
vmscan-use-an-indexed-array-for-lru-variables.patch
swap-use-an-array-for-the-lru-pagevecs.patch
vmscan-free-swap-space-on-swap-in-activation.patch
define-page_file_cache-function.patch
vmscan-split-lru-lists-into-anon-file-sets.patch
vmscan-second-chance-replacement-for-anonymous-pages.patch
vmscan-fix-pagecache-reclaim-referenced-bit-check.patch
vmscan-add-newly-swapped-in-pages-to-the-inactive-list.patch
more-aggressively-use-lumpy-reclaim.patch
pageflag-helpers-for-configed-out-flags.patch
unevictable-lru-infrastructure.patch
unevictable-lru-page-statistics.patch
ramfs-and-ram-disk-pages-are-unevictable.patch
shm_locked-pages-are-unevictable.patch
mlock-mlocked-pages-are-unevictable.patch
mlock-downgrade-mmap-sem-while-populating-mlocked-regions.patch
mmap-handle-mlocked-pages-during-map-remap-unmap.patch

that patch fixes?

2008-07-29 00:31:43

by Rik van Riel

[permalink] [raw]
Subject: Re: PERF: performance tests with the split LRU VM in -mm

On Mon, 28 Jul 2008 17:17:28 -0700
Andrew Morton <[email protected]> wrote:

> /*
> - * Try to keep the active list 2/3 of the size of the cache. And
> - * make sure that refill_inactive is given a decent number of pages.
> - *
> - * The "scan_active + 1" here is important. With pagecache-intensive
> - * workloads the inactive list is huge, and `ratio' evaluates to zero
> - * all the time. Which pins the active list memory. So we add one to

If the active list is so small that nr_active_file >> priority always
evaluates to 0, I suspect it won't hurt at all to keep it around.

After all, we now only scan once the (incrementing) scan number reaches
swap_cluster_max.

> - * `scan_active' just to make sure that the kernel will slowly sift
> - * through the active list.
> + * Add one to `nr_to_scan' just to make sure that the kernel will
> + * slowly sift through the active list.
> */
> - if (zone->nr_active >= 4*(zone->nr_inactive*2 + 1)) {
> - /* Don't scan more than 4 times the inactive list scan size */
> - scan_active = 4*scan_inactive;
>
> (there was some regrettable information loss there).
>
> Is the scenario which that fix addresses no longer possible?

I believe it is possible, but harmless. Maybe even desired.

> On a different topic, I am staring in frustration at
> introduce-__get_user_pages.patch, which says:
>
> New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
> because current get_user_pages() can't grab PROT_NONE pages theresore
> it cause PROT_NONE pages can't munlock.
>
> could someone please work out for me which of these patches:
>
> vmscan-move-isolate_lru_page-to-vmscanc.patch
> vmscan-use-an-indexed-array-for-lru-variables.patch
> swap-use-an-array-for-the-lru-pagevecs.patch
> vmscan-free-swap-space-on-swap-in-activation.patch
> define-page_file_cache-function.patch
> vmscan-split-lru-lists-into-anon-file-sets.patch
> vmscan-second-chance-replacement-for-anonymous-pages.patch
> vmscan-fix-pagecache-reclaim-referenced-bit-check.patch
> vmscan-add-newly-swapped-in-pages-to-the-inactive-list.patch
> more-aggressively-use-lumpy-reclaim.patch
> pageflag-helpers-for-configed-out-flags.patch
> unevictable-lru-infrastructure.patch
> unevictable-lru-page-statistics.patch
> ramfs-and-ram-disk-pages-are-unevictable.patch
> shm_locked-pages-are-unevictable.patch
> mlock-mlocked-pages-are-unevictable.patch
> mlock-downgrade-mmap-sem-while-populating-mlocked-regions.patch
> mmap-handle-mlocked-pages-during-map-remap-unmap.patch
>
> that patch fixes?

I'll take a look later. Time to drive home and eat dinner :)

--
All Rights Reversed

2008-07-29 00:47:19

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: PERF: performance tests with the split LRU VM in -mm

On Mon, 2008-07-28 at 17:17 -0700, Andrew Morton wrote:
> On Mon, 28 Jul 2008 20:03:11 -0400
> Rik van Riel <[email protected]> wrote:
>
> > On Mon, 28 Jul 2008 19:57:13 -0400
> > Rik van Riel <[email protected]> wrote:
> > > On Mon, 28 Jul 2008 16:41:24 -0700
> > > Andrew Morton <[email protected]> wrote:
> > >
> > > > > Andrew, what is your preference between:
> > > > > http://lkml.org/lkml/2008/7/15/465
> > > > > and
> > > > > http://marc.info/?l=linux-mm&m=121683855132630&w=2
> > > > >
> > > >
> > > > Boy. They both seem rather hacky special-cases. But that doesn't mean
> > > > that they're undesirable hacky special-cases. I guess the second one
> > > > looks a bit more "algorithmic" and a bit less hacky-special-case. But
> > > > it all depends on testing..
> > >
> > > I prefer the second one, since it removes the + 1 magic (at least,
> > > for the higher priorities), instead of adding new magic like the
> > > other patch does.
> >
> > Btw, didn't you add that "+ 1" originally early on in the 2.6 VM?
>
> You mean this?
>
> /*
> * Add one to nr_to_scan just to make sure that the kernel
> * will slowly sift through the active list.
> */
> zone->nr_scan_active +=
> (zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
>
>
> > Do you remember its purpose?
>
> erm, not specifically, but I tended to lavishly describe changes like
> this in the changelogging.
>
> > Does it still make sense to have that "+ 1" in the split LRU VM?
> >
> > Could we get away with just removing it unconditionally?
>
> We should do the necessary git dumpster-diving before tossing out
> hard-won changes. Otherwise we might need to spend a year
> re-discovering and re-fixing already-discovered-and-fixed things.
>
> That code has been there in one way or another for some time.
>
> In June 2004, 385c0449 did this:
>
> /*
> - * Try to keep the active list 2/3 of the size of the cache. And
> - * make sure that refill_inactive is given a decent number of pages.
> - *
> - * The "scan_active + 1" here is important. With pagecache-intensive
> - * workloads the inactive list is huge, and `ratio' evaluates to zero
> - * all the time. Which pins the active list memory. So we add one to
> - * `scan_active' just to make sure that the kernel will slowly sift
> - * through the active list.
> + * Add one to `nr_to_scan' just to make sure that the kernel will
> + * slowly sift through the active list.
> */
> - if (zone->nr_active >= 4*(zone->nr_inactive*2 + 1)) {
> - /* Don't scan more than 4 times the inactive list scan size */
> - scan_active = 4*scan_inactive;
>
> (there was some regrettable information loss there).
>
> Is the scenario which that fix addresses no longer possible?
>
>
> On a different topic, I am staring in frustration at
> introduce-__get_user_pages.patch, which says:
>
> New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
> because current get_user_pages() can't grab PROT_NONE pages theresore
> it cause PROT_NONE pages can't munlock.
>
> could someone please work out for me which of these patches:
>
> vmscan-move-isolate_lru_page-to-vmscanc.patch
> vmscan-use-an-indexed-array-for-lru-variables.patch
> swap-use-an-array-for-the-lru-pagevecs.patch
> vmscan-free-swap-space-on-swap-in-activation.patch
> define-page_file_cache-function.patch
> vmscan-split-lru-lists-into-anon-file-sets.patch
> vmscan-second-chance-replacement-for-anonymous-pages.patch
> vmscan-fix-pagecache-reclaim-referenced-bit-check.patch
> vmscan-add-newly-swapped-in-pages-to-the-inactive-list.patch
> more-aggressively-use-lumpy-reclaim.patch
> pageflag-helpers-for-configed-out-flags.patch
> unevictable-lru-infrastructure.patch
> unevictable-lru-page-statistics.patch
> ramfs-and-ram-disk-pages-are-unevictable.patch
> shm_locked-pages-are-unevictable.patch
> mlock-mlocked-pages-are-unevictable.patch

Andrew:

Kosaki-san's patch to introduce __get_user_pages() is a patch to the
above unevictable, mlocked pages. He enhanced get_user_pages() so that
we could fault in PROT_NONE pages for munlocking, to replace the page
table walker [subsequent patches in that series]. He replaced the page
table walker to avoid the "sleeping while atomic" for 32-bit/HIGHPTE
configs.

Lee

> mlock-downgrade-mmap-sem-while-populating-mlocked-regions.patch
> mmap-handle-mlocked-pages-during-map-remap-unmap.patch
>
> that patch fixes?
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2008-07-29 13:05:05

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: PERF: performance tests with the split LRU VM in -mm

> TEST 1: dd if=/dev/sda of=/dev/null bs=1M
>
> kernel speed swap used
>
> 2.6.26 111MB/s 500kB
> -mm 110MB/s 59MB (ouch, system noticably slower)
> noforce 111MB/s 128kB
> stream 108MB/s 0 (slight regression, not sure why yet)

I tried to reproduce it, my ia64 result was

kernel speed swap used
2.6.26-rc8 49.8MB/s 1M
2.6.26-rc8-mm1 47.6MB/s 168M
-mm with above two patch 50.2MB/s 0


So, I think it isn't regression.


2008-07-29 13:17:41

by Rik van Riel

[permalink] [raw]
Subject: Re: PERF: performance tests with the split LRU VM in -mm

On Tue, 29 Jul 2008 22:04:16 +0900
KOSAKI Motohiro <[email protected]> wrote:

> > TEST 1: dd if=/dev/sda of=/dev/null bs=1M
> >
> > kernel speed swap used
> >
> > 2.6.26 111MB/s 500kB
> > -mm 110MB/s 59MB (ouch, system noticably slower)
> > noforce 111MB/s 128kB
> > stream 108MB/s 0 (slight regression, not sure why yet)
>
> I tried to reproduce it, my ia64 result was
>
> kernel speed swap used
> 2.6.26-rc8 49.8MB/s 1M
> 2.6.26-rc8-mm1 47.6MB/s 168M
> -mm with above two patch 50.2MB/s 0
>
>
> So, I think it isn't regression.

Agreed. It looked like it, but once I changed the cpuspeed
governor from ondemand to performance, I saw that it had to
be an artifact of something else.

Getting rid of the swap use from a linear IO is the important
part.

--
All rights reversed.

2008-07-29 13:22:21

by Johannes Weiner

[permalink] [raw]
Subject: Re: PERF: performance tests with the split LRU VM in -mm

Hi,

Rik van Riel <[email protected]> writes:

> On Mon, 28 Jul 2008 19:57:13 -0400
> Rik van Riel <[email protected]> wrote:
>> On Mon, 28 Jul 2008 16:41:24 -0700
>> Andrew Morton <[email protected]> wrote:
>>
>> > > Andrew, what is your preference between:
>> > > http://lkml.org/lkml/2008/7/15/465
>> > > and
>> > > http://marc.info/?l=linux-mm&m=121683855132630&w=2
>> > >
>> >
>> > Boy. They both seem rather hacky special-cases. But that doesn't mean
>> > that they're undesirable hacky special-cases. I guess the second one
>> > looks a bit more "algorithmic" and a bit less hacky-special-case. But
>> > it all depends on testing..
>>
>> I prefer the second one, since it removes the + 1 magic (at least,
>> for the higher priorities), instead of adding new magic like the
>> other patch does.
>
> Btw, didn't you add that "+ 1" originally early on in the 2.6 VM?
>
> Do you remember its purpose?
>
> Does it still make sense to have that "+ 1" in the split LRU VM?
>
> Could we get away with just removing it unconditionally?

Here is my original patch that just gets rid of it. It did not cause
any problems to me on high pressure. Rik, you said on IRC that you now
also think the patch is safe..?

Hannes

---
From: Johannes Weiner <[email protected]>
Subject: mm: don't accumulate scan pressure on unrelated lists

During each reclaim scan we accumulate scan pressure on unrelated
lists which will result in bogus scans and unwanted reclaims
eventually.

Scanning lists with few reclaim candidates results in a lot of
rotation and therefor also disturbs the list balancing, putting even
more pressure on the wrong lists.

In a test-case with much streaming IO, and therefor a crowded inactive
file page list, swapping started because

a) anon pages were reclaimed after swap_cluster_max reclaim
invocations -- nr_scan of this list has just accumulated

b) active file pages were scanned because *their* nr_scan has also
accumulated through the same logic. And this in return created a
lot of rotation for file pages and resulted in a decrease of file
list priority, again increasing the pressure on anon pages.

The result was an evicted working set of anon pages while there were
tons of inactive file pages that should have been taken instead.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1458,16 +1458,13 @@ static unsigned long shrink_zone(int pri
if (scan_global_lru(sc)) {
int file = is_file_lru(l);
int scan;
- /*
- * Add one to nr_to_scan just to make sure that the
- * kernel will slowly sift through each list.
- */
+
scan = zone_page_state(zone, NR_LRU_BASE + l);
if (priority) {
scan >>= priority;
scan = (scan * percent[file]) / 100;
}
- zone->lru[l].nr_scan += scan + 1;
+ zone->lru[l].nr_scan += scan;
nr[l] = zone->lru[l].nr_scan;
if (nr[l] >= sc->swap_cluster_max)
zone->lru[l].nr_scan = 0;

2008-07-29 13:29:17

by Rik van Riel

[permalink] [raw]
Subject: Re: PERF: performance tests with the split LRU VM in -mm

On Tue, 29 Jul 2008 15:21:47 +0200
Johannes Weiner <[email protected]> wrote:

> Here is my original patch that just gets rid of it. It did not cause
> any problems to me on high pressure. Rik, you said on IRC that you now
> also think the patch is safe..?

Yes. Removing the "+ 1" is safe because we do not scan until
zone->lru[l].nr_scan reaches swap_cluster_max, which means that
the scan counter for small lists will also slowly increase and
no list will be left behind.

> From: Johannes Weiner <[email protected]>
> Subject: mm: don't accumulate scan pressure on unrelated lists
>
> During each reclaim scan we accumulate scan pressure on unrelated
> lists which will result in bogus scans and unwanted reclaims
> eventually.

This patch fixes the balancing issues that we have been seeing
with the split LRU VM currently in -mm.

It is my preferred patch because it removes magic from the VM,
instead of adding some.

> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed.

2008-07-29 13:51:51

by Johannes Weiner

[permalink] [raw]
Subject: Re: PERF: performance tests with the split LRU VM in -mm

Hi,

Rik van Riel <[email protected]> writes:

> In order to get the performance of the split LRU VM (in -mm) better,
> I have performed several performance tests with the following kernels:
> - 2.6.26 "2.6.26"
> - 2.6.26-rc8-mm1 "-mm"
> - 2.6.26-rc8-mm1 w/ "evict streaming IO cache first" patch "stream"
> Patch at: http://lkml.org/lkml/2008/7/15/465
> - 2.6.26-rc8-mm1 w/ "fix swapout on sequential IO" patch "noforce"
> Patch at: http://marc.info/?l=linux-mm&m=121683855132630&w=2
>
> I have run the performance tests on a Dell pe1950 system
> with 2 quad-core CPUs, 16GB of RAM and a hardware RAID 1
> array of 146GB disks.
>
> The tests are fairly simple, but took a fair amount of time to
> run due to the size of the data set involved (full disk for dd,
> 55GB innodb file for the database tests).
>
>
> TEST 1: dd if=/dev/sda of=/dev/null bs=1M
>
> kernel speed swap used
>
> 2.6.26 111MB/s 500kB
> -mm 110MB/s 59MB (ouch, system noticably slower)
> noforce 111MB/s 128kB
> stream 108MB/s 0 (slight regression, not sure why yet)
>
> This patch shows that the split LRU VM in -mm has a problem
> with large streaming IOs: the working set gets pushed out of
> memory, which makes doing anything else during the big streaming
> IO kind of painful.
>
> However, either of the two patches posted fixes that problem,
> though at a slight performance penalty for the "stream" patch.

Btw, my desktop machine runs -mm (+ the patch I have posted later in
this thread) for over a week now and I have not yet encountered any
notable regressions in normal usage patterns.

I have not collected hard numbers but just tried to work normally with
it.

I also employed a massive memory eater (besides emacs and firefox) that
spawns children that eat, serialized, ~120% of RAM each.

Continuing normal work on both kernels was a bit harder, sure, but not
impossible.

The box never died on me nor did it thrash perceivably harder/longer
near oom than .26. The oom killer was never invoked.

Hannes