2008-07-19 17:32:29

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH -mm] mm: more likely reclaim MADV_SEQUENTIAL mappings

File pages accessed only once through sequential-read mappings between
fault and scan time are perfect candidates for reclaim.

This patch makes page_referenced() ignore these singular references and
the pages stay on the inactive list where they likely fall victim to the
next reclaim phase.

Already activated pages are still treated normally. If they were
accessed multiple times and therefor promoted to the active list, we
probably want to keep them.

Benchmarks show that big (relative to the system's memory)
MADV_SEQUENTIAL mappings read sequentially cause much less kernel
activity. Especially less LRU moving-around because we never activate
read-once pages in the first place just to demote them again.

And leaving these perfect reclaim candidates on the inactive list makes
it more likely for the real working set to survive the next reclaim
scan.

Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
mm/rmap.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)

Benchmark graphs and the test-application can be found here:

http://hannes.saeurebad.de/madvseq/

Patch is against -mm, although only tested on good ol' linus-tree as
-mmotm wouldn't compile at the moment.

--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -333,8 +333,18 @@ static int page_referenced_one(struct pa
goto out_unmap;
}

- if (ptep_clear_flush_young_notify(vma, address, pte))
- referenced++;
+ if (ptep_clear_flush_young_notify(vma, address, pte)) {
+ /*
+ * If there was just one sequential access to the
+ * page, ignore it. Otherwise, mark_page_accessed()
+ * will have promoted the page to the active list and
+ * it should be kept.
+ */
+ if (VM_SequentialReadHint(vma) && !PageActive(page))
+ ClearPageReferenced(page);
+ else
+ referenced++;
+ }

/* Pretend the page is referenced if the task has the
swap token and is in the middle of a page fault. */
@@ -455,9 +465,6 @@ int page_referenced(struct page *page, i
{
int referenced = 0;

- if (TestClearPageReferenced(page))
- referenced++;
-
if (page_mapped(page) && page->mapping) {
if (PageAnon(page))
referenced += page_referenced_anon(page, mem_cont);
@@ -473,6 +480,9 @@ int page_referenced(struct page *page, i
}
}

+ if (TestClearPageReferenced(page))
+ referenced++;
+
if (page_test_and_clear_young(page))
referenced++;


2008-07-19 17:59:56

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm] mm: more likely reclaim MADV_SEQUENTIAL mappings

On Sat, 19 Jul 2008 19:31:49 +0200
Johannes Weiner <[email protected]> wrote:

> File pages accessed only once through sequential-read mappings between
> fault and scan time are perfect candidates for reclaim.
>
> This patch makes page_referenced() ignore these singular references and
> the pages stay on the inactive list where they likely fall victim to the
> next reclaim phase.

Which is exactly what the madvise man page says about pages in
MADV_SEQUENTIAL ranges. Yay.

MADV_SEQUENTIAL
Expect page references in sequential order. (Hence, pages in
the given range can be aggressively read ahead, and may be freed
soon after they are accessed.)

--
All rights reversed.

2008-07-21 00:11:03

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH -mm] mm: more likely reclaim MADV_SEQUENTIAL mappings

Hi Johannes,

> File pages accessed only once through sequential-read mappings between
> fault and scan time are perfect candidates for reclaim.
>
> This patch makes page_referenced() ignore these singular references and
> the pages stay on the inactive list where they likely fall victim to the
> next reclaim phase.
>
> Already activated pages are still treated normally. If they were
> accessed multiple times and therefor promoted to the active list, we
> probably want to keep them.
>
> Benchmarks show that big (relative to the system's memory)
> MADV_SEQUENTIAL mappings read sequentially cause much less kernel
> activity. Especially less LRU moving-around because we never activate
> read-once pages in the first place just to demote them again.
>
> And leaving these perfect reclaim candidates on the inactive list makes
> it more likely for the real working set to survive the next reclaim
> scan.

looks good to me.
Actually, I made similar patch half year ago.

in my experience,
- page_referenced_one is performance critical point.
you should test some benchmark.
- its patch improved mmaped-copy performance about 5%.
(Of cource, you should test in current -mm. MM code was changed widely)

So, I'm looking for your test result :)

2008-07-21 01:49:19

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH -mm] mm: more likely reclaim MADV_SEQUENTIAL mappings

On Mon, 21 Jul 2008 09:09:26 +0900 "KOSAKI Motohiro" <[email protected]> wrote:

> Hi Johannes,
>
> > File pages accessed only once through sequential-read mappings between
> > fault and scan time are perfect candidates for reclaim.
> >
> > This patch makes page_referenced() ignore these singular references and
> > the pages stay on the inactive list where they likely fall victim to the
> > next reclaim phase.
> >
> > Already activated pages are still treated normally. If they were
> > accessed multiple times and therefor promoted to the active list, we
> > probably want to keep them.
> >
> > Benchmarks show that big (relative to the system's memory)
> > MADV_SEQUENTIAL mappings read sequentially cause much less kernel
> > activity. Especially less LRU moving-around because we never activate
> > read-once pages in the first place just to demote them again.
> >
> > And leaving these perfect reclaim candidates on the inactive list makes
> > it more likely for the real working set to survive the next reclaim
> > scan.
>
> looks good to me.
> Actually, I made similar patch half year ago.
>
> in my experience,
> - page_referenced_one is performance critical point.
> you should test some benchmark.
> - its patch improved mmaped-copy performance about 5%.
> (Of cource, you should test in current -mm. MM code was changed widely)
>
> So, I'm looking for your test result :)

The change seems logical and I queued it for 2.6.28.

But yes, testing for what-does-this-improve is good and useful, but so
is testing for what-does-this-worsen. How do we do that in this case?

2008-07-21 03:53:33

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH -mm] mm: more likely reclaim MADV_SEQUENTIAL mappings

Hi Andrew,

>> in my experience,
>> - page_referenced_one is performance critical point.
>> you should test some benchmark.
>> - its patch improved mmaped-copy performance about 5%.
>> (Of cource, you should test in current -mm. MM code was changed widely)
>>
>> So, I'm looking for your test result :)
>
> The change seems logical and I queued it for 2.6.28.

Great.

> But yes, testing for what-does-this-improve is good and useful, but so
> is testing for what-does-this-worsen. How do we do that in this case?

In general, page_referenced_one is important for reclaim throuput.
if crap page_referenced_one changing happend,
system reclaim throuput become slow down.

Of cource, I don't think this patch cause performance regression :-)

So, any benchmark with memcgroup memory restriction is good choice.

btw:
maybe, I will able to post mamped-copy improve mesurement of Johannes's patch
after OLS.

2008-07-21 05:49:19

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH -mm] mm: more likely reclaim MADV_SEQUENTIAL mappings

On Monday 21 July 2008 11:48, Andrew Morton wrote:
> On Mon, 21 Jul 2008 09:09:26 +0900 "KOSAKI Motohiro"
<[email protected]> wrote:
> > Hi Johannes,
> >
> > > File pages accessed only once through sequential-read mappings between
> > > fault and scan time are perfect candidates for reclaim.
> > >
> > > This patch makes page_referenced() ignore these singular references and
> > > the pages stay on the inactive list where they likely fall victim to
> > > the next reclaim phase.
> > >
> > > Already activated pages are still treated normally. If they were
> > > accessed multiple times and therefor promoted to the active list, we
> > > probably want to keep them.
> > >
> > > Benchmarks show that big (relative to the system's memory)
> > > MADV_SEQUENTIAL mappings read sequentially cause much less kernel
> > > activity. Especially less LRU moving-around because we never activate
> > > read-once pages in the first place just to demote them again.
> > >
> > > And leaving these perfect reclaim candidates on the inactive list makes
> > > it more likely for the real working set to survive the next reclaim
> > > scan.
> >
> > looks good to me.
> > Actually, I made similar patch half year ago.
> >
> > in my experience,
> > - page_referenced_one is performance critical point.
> > you should test some benchmark.
> > - its patch improved mmaped-copy performance about 5%.
> > (Of cource, you should test in current -mm. MM code was changed
> > widely)
> >
> > So, I'm looking for your test result :)
>
> The change seems logical and I queued it for 2.6.28.
>
> But yes, testing for what-does-this-improve is good and useful, but so
> is testing for what-does-this-worsen. How do we do that in this case?

It's OK, but as always I worry about adding "cool new bells and
whistles" to make already-bad code work a bit faster. It slows
things down. A mispredicted branch btw is about as costly as an
atomic operation.

It is already bad because: if you are doing a big streaming copy
which you know is going to blow the cache and not be used again,
then you should be unmapping behind you as you go. If you do not
do this, then page reclaim has to do the rmap walk, page table
walk, and then the (unbatched, likely IPI delivered) TLB shootdown
for every page. Not to mention churning through the LRU and
chucking other things out just to find these pages.

So what you actually should do is use direct IO, or do page
unmappings and fadvise thingies to throw out the cache.

Adding code and branches to speed up by 5% an already terribly
suboptimal microbenchmark is not very good.

2008-07-21 15:14:51

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm] mm: more likely reclaim MADV_SEQUENTIAL mappings

On Mon, 21 Jul 2008 15:49:00 +1000
Nick Piggin <[email protected]> wrote:

> It is already bad because: if you are doing a big streaming copy
> which you know is going to blow the cache and not be used again,
> then you should be unmapping behind you as you go.

MADV_SEQUENTIAL exists for a reason.

If you think that doing an automatic unmap-behind will be
a better way to go, we can certainly whip up a patch for
that...

--
All rights reversed.

2008-07-22 02:02:42

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH -mm] mm: more likely reclaim MADV_SEQUENTIAL mappings

On Tuesday 22 July 2008 01:14, Rik van Riel wrote:
> On Mon, 21 Jul 2008 15:49:00 +1000
>
> Nick Piggin <[email protected]> wrote:
> > It is already bad because: if you are doing a big streaming copy
> > which you know is going to blow the cache and not be used again,
> > then you should be unmapping behind you as you go.
>
> MADV_SEQUENTIAL exists for a reason.

AFAIKS it is to open up readahead mainly. Because it is quite reasonable
to otherwise be much more conservative about readahead than with regular
reads (and of course you can't do big chunks per kernel entry)...

I don't actually care what the man page or posix says if it is obviously
silly behaviour. If you want to dispute the technical points of my post,
that would be helpful.


> If you think that doing an automatic unmap-behind will be
> a better way to go, we can certainly whip up a patch for
> that...

I don't. Don't let me stop you trying of course :)

Consider this: if the app already has dedicated knowledge and
syscalls to know about this big sequential copy, then it should
go about doing it the *right* way and really get performance
improvement. Automatic unmap-behind even if it was perfect still
needs to scan LRU lists to reclaim.

2008-07-22 02:36:39

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm] mm: more likely reclaim MADV_SEQUENTIAL mappings

On Tue, 22 Jul 2008 12:02:26 +1000
Nick Piggin <[email protected]> wrote:

> I don't actually care what the man page or posix says if it is obviously
> silly behaviour. If you want to dispute the technical points of my post,
> that would be helpful.

Application writers read the man page and expect MADV_SEQUENTIAL
to do roughly what the name and description imply.

If you think that the kernel should not bother implementing
what the application writers expect, and the application writers
should implement special drop-behind magic for Linux, your
expectations may not be entirely realistic.

> Consider this: if the app already has dedicated knowledge and
> syscalls to know about this big sequential copy, then it should
> go about doing it the *right* way and really get performance
> improvement. Automatic unmap-behind even if it was perfect still
> needs to scan LRU lists to reclaim.

Doing nothing _also_ ends up with the kernel scanning the
LRU lists, once memory fills up.

Scanning the LRU lists is a given.

All that the patch by Johannes does is make sure the kernel
does the right thing when it runs into an MADV_SEQUENTIAL
page on the inactive_file list: evict the page immediately,
instead of having it pass through the active list and the
inactive list again.

This reduces the number of times that MADV_SEQUENTIAL pages
get scanned from 3 to 1, while protecting the working set
from MADV_SEQUENTIAL pages.

--
All rights reversed.

2008-07-22 02:54:51

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH -mm] mm: more likely reclaim MADV_SEQUENTIAL mappings

On Tuesday 22 July 2008 12:36, Rik van Riel wrote:
> On Tue, 22 Jul 2008 12:02:26 +1000
>
> Nick Piggin <[email protected]> wrote:
> > I don't actually care what the man page or posix says if it is obviously
> > silly behaviour. If you want to dispute the technical points of my post,
> > that would be helpful.
>
> Application writers read the man page and expect MADV_SEQUENTIAL
> to do roughly what the name and description imply.
>
> If you think that the kernel should not bother implementing
> what the application writers expect, and the application writers
> should implement special drop-behind magic for Linux, your
> expectations may not be entirely realistic.

The simple fact is that if you already have the knowledge and custom
code for sequentially accessed mappings, then if you know the pages
are not going to be used, there is a *far* better way to do it by
unmapping them than the kernel will ever be able to do itself.

Also, it would be perfectly valid to want a sequentially accessed
mapping but not want to drop the pages early.

What we should do is update the man page now rather than try adding
things to support it.


> > Consider this: if the app already has dedicated knowledge and
> > syscalls to know about this big sequential copy, then it should
> > go about doing it the *right* way and really get performance
> > improvement. Automatic unmap-behind even if it was perfect still
> > needs to scan LRU lists to reclaim.
>
> Doing nothing _also_ ends up with the kernel scanning the
> LRU lists, once memory fills up.

But we are not doing nothing because we already know and have coded
for the fact that the mapping will be accessed once, sequentially.
Now that we have gone this far, we should actually do it properly and
1. unmap after use, 2. POSIX_FADV_DONTNEED after use. This will give
you much better performance and cache behaviour than any automatic
detection scheme, and it doesn't introduce any regressions for existing
code.


> Scanning the LRU lists is a given.

It is not.


> All that the patch by Johannes does is make sure the kernel
> does the right thing when it runs into an MADV_SEQUENTIAL
> page on the inactive_file list: evict the page immediately,
> instead of having it pass through the active list and the
> inactive list again.
>
> This reduces the number of times that MADV_SEQUENTIAL pages
> get scanned from 3 to 1, while protecting the working set
> from MADV_SEQUENTIAL pages.

We should update the man page. And seeing as Linux had never preferred
to drop behind *before* now, it is crazy to add such a feature that
we will then have a much harder time to remove, given that it is
clearly suboptimal. Update the man page to sketch the *correct* way to
optimise this type of access.

2008-07-22 03:04:51

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm] mm: more likely reclaim MADV_SEQUENTIAL mappings

On Tue, 22 Jul 2008 12:54:28 +1000
Nick Piggin <[email protected]> wrote:
> On Tuesday 22 July 2008 12:36, Rik van Riel wrote:
> > On Tue, 22 Jul 2008 12:02:26 +1000
> >
> > Nick Piggin <[email protected]> wrote:
> > > I don't actually care what the man page or posix says if it is obviously
> > > silly behaviour. If you want to dispute the technical points of my post,
> > > that would be helpful.
> >
> > Application writers read the man page and expect MADV_SEQUENTIAL
> > to do roughly what the name and description imply.
> >
> > If you think that the kernel should not bother implementing
> > what the application writers expect, and the application writers
> > should implement special drop-behind magic for Linux, your
> > expectations may not be entirely realistic.
>
> The simple fact is that if you already have the knowledge and custom
> code for sequentially accessed mappings, then if you know the pages
> are not going to be used, there is a *far* better way to do it by
> unmapping them than the kernel will ever be able to do itself.

Applications are not developed just for Linux.

Application writers expect MADV_SEQUENTIAL to behave
a certain way and this 5 line patch implements that.

> Also, it would be perfectly valid to want a sequentially accessed
> mapping but not want to drop the pages early.

That is exactly what Johannes' patch does. All it does is
change the behaviour of pages that have already reached the
end of the LRU lists.

It does not do any kind of early drop-behind or other strange
magic.

> > > Consider this: if the app already has dedicated knowledge and
> > > syscalls to know about this big sequential copy, then it should
> > > go about doing it the *right* way and really get performance
> > > improvement. Automatic unmap-behind even if it was perfect still
> > > needs to scan LRU lists to reclaim.
> >
> > Doing nothing _also_ ends up with the kernel scanning the
> > LRU lists, once memory fills up.
>
> But we are not doing nothing because we already know and have coded
> for the fact that the mapping will be accessed once, sequentially.
> Now that we have gone this far, we should actually do it properly and
> 1. unmap after use, 2. POSIX_FADV_DONTNEED after use. This will give
> you much better performance and cache behaviour than any automatic
> detection scheme, and it doesn't introduce any regressions for existing
> code.

If you run just one instance of the application!

Think about something like an ftp server or a media server,
where you want to cache the data that is served up many
times, while evicting the data that got served just once.

The kernel has much better knowledge of what the aggregate
of all processes in the system are doing than any individual
process has.

> > Scanning the LRU lists is a given.
>
> It is not.

Regardless of whether or not the application unmaps the pages
by itself, the pages will still be on the LRU lists.

--
All rights reversed.

2008-07-22 03:43:56

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH -mm] mm: more likely reclaim MADV_SEQUENTIAL mappings

On Tuesday 22 July 2008 13:04, Rik van Riel wrote:
> On Tue, 22 Jul 2008 12:54:28 +1000
>
> Nick Piggin <[email protected]> wrote:
> > On Tuesday 22 July 2008 12:36, Rik van Riel wrote:
> > > On Tue, 22 Jul 2008 12:02:26 +1000
> > >
> > > Nick Piggin <[email protected]> wrote:
> > > > I don't actually care what the man page or posix says if it is
> > > > obviously silly behaviour. If you want to dispute the technical
> > > > points of my post, that would be helpful.
> > >
> > > Application writers read the man page and expect MADV_SEQUENTIAL
> > > to do roughly what the name and description imply.
> > >
> > > If you think that the kernel should not bother implementing
> > > what the application writers expect, and the application writers
> > > should implement special drop-behind magic for Linux, your
> > > expectations may not be entirely realistic.
> >
> > The simple fact is that if you already have the knowledge and custom
> > code for sequentially accessed mappings, then if you know the pages
> > are not going to be used, there is a *far* better way to do it by
> > unmapping them than the kernel will ever be able to do itself.
>
> Applications are not developed just for Linux.
>
> Application writers expect MADV_SEQUENTIAL to behave
> a certain way and this 5 line patch implements that.

OK well applications will still work fine without it, and the ones that
really want high performance and low cache pollution on any other platform
will not be doing this anyway.

I don't care if it is 5 or 500 lines actually, I can see what it does but
I am arguing that it is not a good idea anyway. I have certianly not seen
anything to justify adding overhead here.


> > Also, it would be perfectly valid to want a sequentially accessed
> > mapping but not want to drop the pages early.
>
> That is exactly what Johannes' patch does. All it does is
> change the behaviour of pages that have already reached the
> end of the LRU lists.
>
> It does not do any kind of early drop-behind or other strange
> magic.

I don't understand what you are getting at here. It exactly does put a
lower priority on the page access, but what I am saying is that you may
not want that.



> > > > Consider this: if the app already has dedicated knowledge and
> > > > syscalls to know about this big sequential copy, then it should
> > > > go about doing it the *right* way and really get performance
> > > > improvement. Automatic unmap-behind even if it was perfect still
> > > > needs to scan LRU lists to reclaim.
> > >
> > > Doing nothing _also_ ends up with the kernel scanning the
> > > LRU lists, once memory fills up.
> >
> > But we are not doing nothing because we already know and have coded
> > for the fact that the mapping will be accessed once, sequentially.
> > Now that we have gone this far, we should actually do it properly and
> > 1. unmap after use, 2. POSIX_FADV_DONTNEED after use. This will give
> > you much better performance and cache behaviour than any automatic
> > detection scheme, and it doesn't introduce any regressions for existing
> > code.
>
> If you run just one instance of the application!
>
> Think about something like an ftp server or a media server,
> where you want to cache the data that is served up many
> times, while evicting the data that got served just once.
>
> The kernel has much better knowledge of what the aggregate
> of all processes in the system are doing than any individual
> process has.

That's true, but this case isn't really very good anyway. The information
goes away after you drop the mapping anyway. Or did you hope that the
backup program or indexer keeps all those mappings open until all the pages
have filtered through? Or maybe we can add yet more branches into the unmap
path to test for this flag as well?

I don't think it is a good idea to add random things just because they seem
at first glance like a good idea.


> > > Scanning the LRU lists is a given.
> >
> > It is not.
>
> Regardless of whether or not the application unmaps the pages
> by itself, the pages will still be on the LRU lists.

I clearly was talking about FADV_DONTNEED. I know the pages will be on LRU
lists after unmapping.

2008-07-22 03:49:52

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH -mm] mm: more likely reclaim MADV_SEQUENTIAL mappings

On Tuesday 22 July 2008 13:43, Nick Piggin wrote:
> On Tuesday 22 July 2008 13:04, Rik van Riel wrote:
> > On Tue, 22 Jul 2008 12:54:28 +1000

> > > But we are not doing nothing because we already know and have coded
> > > for the fact that the mapping will be accessed once, sequentially.
> > > Now that we have gone this far, we should actually do it properly and
> > > 1. unmap after use, 2. POSIX_FADV_DONTNEED after use. This will give
> > > you much better performance and cache behaviour than any automatic
> > > detection scheme, and it doesn't introduce any regressions for existing
> > > code.
> >
> > If you run just one instance of the application!
> >
> > Think about something like an ftp server or a media server,
> > where you want to cache the data that is served up many
> > times, while evicting the data that got served just once.
> >
> > The kernel has much better knowledge of what the aggregate
> > of all processes in the system are doing than any individual
> > process has.
>
> That's true, but this case isn't really very good anyway. The information
> goes away after you drop the mapping anyway. Or did you hope that the
> backup program or indexer keeps all those mappings open until all the pages
> have filtered through? Or maybe we can add yet more branches into the unmap
> path to test for this flag as well?
>
> I don't think it is a good idea to add random things just because they seem
> at first glance like a good idea.

BTW. in the backup of a busy fileserver or some case like that, I'd
bet that even using FADV_DONTNEED would be much faster than leaving
these mappings around to try to drop them due to the decreased churn
on the LRUs overall anyway.