LinuxLists.cc - fadvise DONTNEED implementation (or lack thereof)

2010-11-04 05:59:06

Subject: fadvise DONTNEED implementation (or lack thereof)

I've recently been trying to track down the root cause of my server's
persistent issue of thrashing horribly after being left inactive. It
seems that the issue is likely my nightly backup schedule (using rsync)
which traverses my entire 50GB home directory. I was surprised to find
that rsync does not use fadvise to notify the kernel of its use-once
data usage pattern.

It looks like a patch[1] was written (although never merged, it seems)
incorporating fadvise support, but I found its implementation rather
odd, using mincore() and FADV_DONTNEED to kick out only regions brought
in by rsync. It seemed to me the simpler and more appropriate solution
would be to simply flag every touched file with FADV_NOREUSE and let the
kernel manage automatically expelling used pages.

After looking deeper into the kernel implementation[2] of fadvise() the
reason for using DONTNEED became more apparant. It seems that the kernel
implements NOREUSE as a noop. A little googling revealed[3] that I not
the first person to encounter this limitation. It looks like a few
folks[4] have discussed addressing the issue in the past, but nothing
has happened as of 2.6.36. Are there plans to implement this
functionality in the near future? It seems like the utility of fadvise
is severely limited by lacking support for NOREUSE.

Cheers,

- Ben

[1] http://insights.oetiker.ch/linux/fadvise.html
[2] http://lxr.free-electrons.com/source/mm/fadvise.c?a=avr32
[3] https://issues.apache.org/jira/browse/CASSANDRA-1470
http://chbits.blogspot.com/2010/06/lucene-and-fadvisemadvise.html
[4] http://www.mail-archive.com/[email protected]/msg179576.html
http://lkml.indiana.edu/hypermail/linux/kernel/0807.2/0442.html

2010-11-09 07:28:11

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

> I've recently been trying to track down the root cause of my server's
> persistent issue of thrashing horribly after being left inactive. It
> seems that the issue is likely my nightly backup schedule (using rsync)
> which traverses my entire 50GB home directory. I was surprised to find
> that rsync does not use fadvise to notify the kernel of its use-once
> data usage pattern.
>
> It looks like a patch[1] was written (although never merged, it seems)
> incorporating fadvise support, but I found its implementation rather
> odd, using mincore() and FADV_DONTNEED to kick out only regions brought
> in by rsync. It seemed to me the simpler and more appropriate solution
> would be to simply flag every touched file with FADV_NOREUSE and let the
> kernel manage automatically expelling used pages.
>
> After looking deeper into the kernel implementation[2] of fadvise() the
> reason for using DONTNEED became more apparant. It seems that the kernel
> implements NOREUSE as a noop. A little googling revealed[3] that I not
> the first person to encounter this limitation. It looks like a few
> folks[4] have discussed addressing the issue in the past, but nothing
> has happened as of 2.6.36. Are there plans to implement this
> functionality in the near future? It seems like the utility of fadvise
> is severely limited by lacking support for NOREUSE.

btw, Other OSs seems to also don't implement it.
example,

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/posix_fadvise.c

35 /*
36 * SUSv3 - file advisory information
37 *
38 * This function does nothing, but that's OK because the
39 * Posix specification doesn't require it to do anything
40 * other than return appropriate error numbers.
41 *
42 * In the future, a file system dependent fadvise() or fcntl()
43 * interface, similar to madvise(), should be developed to enable
44 * the kernel to optimize I/O operations based on the given advice.
45 */
46
47 /* ARGSUSED1 */
48 int
49 posix_fadvise(int fd, off_t offset, off_t len, int advice)
50 {
51 struct stat64 statb;
52
53 switch (advice) {
54 case POSIX_FADV_NORMAL:
55 case POSIX_FADV_RANDOM:
56 case POSIX_FADV_SEQUENTIAL:
57 case POSIX_FADV_WILLNEED:
58 case POSIX_FADV_DONTNEED:
59 case POSIX_FADV_NOREUSE:
60 break;
61 default:
62 return (EINVAL);
63 }
64 if (len < 0)
65 return (EINVAL);
66 if (fstat64(fd, &statb) != 0)
67 return (EBADF);
68 if (S_ISFIFO(statb.st_mode))
69 return (ESPIPE);
70 return (0);
71 }

So, I don't think application developers will use fadvise() aggressively
because we don't have a cross platform agreement of a fadvice behavior.

2010-11-09 08:03:15

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

> > I've recently been trying to track down the root cause of my server's
> > persistent issue of thrashing horribly after being left inactive. It
> > seems that the issue is likely my nightly backup schedule (using rsync)
> > which traverses my entire 50GB home directory. I was surprised to find
> > that rsync does not use fadvise to notify the kernel of its use-once
> > data usage pattern.
> >
> > It looks like a patch[1] was written (although never merged, it seems)
> > incorporating fadvise support, but I found its implementation rather
> > odd, using mincore() and FADV_DONTNEED to kick out only regions brought
> > in by rsync. It seemed to me the simpler and more appropriate solution
> > would be to simply flag every touched file with FADV_NOREUSE and let the
> > kernel manage automatically expelling used pages.
> >
> > After looking deeper into the kernel implementation[2] of fadvise() the
> > reason for using DONTNEED became more apparant. It seems that the kernel
> > implements NOREUSE as a noop. A little googling revealed[3] that I not
> > the first person to encounter this limitation. It looks like a few
> > folks[4] have discussed addressing the issue in the past, but nothing
> > has happened as of 2.6.36. Are there plans to implement this
> > functionality in the near future? It seems like the utility of fadvise
> > is severely limited by lacking support for NOREUSE.
>
> btw, Other OSs seems to also don't implement it.
> example,

I've heared other OSs status of fadvise() from private mail.

NetBSD: no-op (as linux)
FreeBSD/DragonflyBSD/OpenBSD: don't exist posix_fadvise(2)

2010-11-09 12:54:14

by Ben Gamari

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

On Tue, 9 Nov 2010 16:28:02 +0900 (JST), KOSAKI Motohiro <[email protected]> wrote:
> So, I don't think application developers will use fadvise() aggressively
> because we don't have a cross platform agreement of a fadvice behavior.
>
I strongly disagree. For a long time I have been trying to resolve
interactivity issues caused by my rsync-based backup script. Many kernel
developers have said that there is nothing the kernel can do without
more information from user-space (e.g. cgroups, madvise). While cgroups
help, the fix is round-about at best and requires configuration where
really none should be necessary. The easiest solution for everyone
involved would be for rsync to use FADV_DONTNEED. The behavior doesn't
need to be perfectly consistent between platforms for the flag to be
useful so long as each implementation does something sane to help
use-once access patterns.

People seem to mention frequently that there are no users of
FADV_DONTNEED and therefore we don't need to implement it. It seems like
this is ignoring an obvious catch-22. Currently rsync has no fadvise
support at all, since using[1] the implemented hints to get the desired
effect is far too complicated^M^M^M^Mhacky to be considered
merge-worthy. Considering the number of Google hits returned for
fadvise, I wouldn't be surprised if there were countless other projects
with this same difficulty. We want to be able to tell the kernel about
our useage patterns, but the kernel won't listen.

Cheers,

- Ben

[1] http://insights.oetiker.ch/linux/fadvise.html

2010-11-14 05:09:34

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

> On Tue, 9 Nov 2010 16:28:02 +0900 (JST), KOSAKI Motohiro <[email protected]> wrote:
> > So, I don't think application developers will use fadvise() aggressively
> > because we don't have a cross platform agreement of a fadvice behavior.
> >
> I strongly disagree. For a long time I have been trying to resolve
> interactivity issues caused by my rsync-based backup script. Many kernel
> developers have said that there is nothing the kernel can do without
> more information from user-space (e.g. cgroups, madvise). While cgroups
> help, the fix is round-about at best and requires configuration where
> really none should be necessary. The easiest solution for everyone
> involved would be for rsync to use FADV_DONTNEED. The behavior doesn't
> need to be perfectly consistent between platforms for the flag to be
> useful so long as each implementation does something sane to help
> use-once access patterns.
>
> People seem to mention frequently that there are no users of
> FADV_DONTNEED and therefore we don't need to implement it. It seems like
> this is ignoring an obvious catch-22. Currently rsync has no fadvise
> support at all, since using[1] the implemented hints to get the desired
> effect is far too complicated^M^M^M^Mhacky to be considered
> merge-worthy. Considering the number of Google hits returned for
> fadvise, I wouldn't be surprised if there were countless other projects
> with this same difficulty. We want to be able to tell the kernel about
> our useage patterns, but the kernel won't listen.

Because we have an alternative solution already. please try memcgroup :)

2010-11-14 05:21:08

by Ben Gamari

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

On Sun, 14 Nov 2010 14:09:29 +0900 (JST), KOSAKI Motohiro <[email protected]> wrote:
> Because we have an alternative solution already. please try memcgroup :)
>
Alright, fair enough. It still seems like there are many cases where
fadvise seems more appropriate, but memcg should at least satisfy my
personal needs so I'll shut up now. Thanks!

- Ben

2010-11-14 21:41:45

by Brian K. White

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

On 11/14/2010 12:20 AM, Ben Gamari wrote:
> On Sun, 14 Nov 2010 14:09:29 +0900 (JST), KOSAKI Motohiro<[email protected]> wrote:
>> Because we have an alternative solution already. please try memcgroup :)
>>
> Alright, fair enough. It still seems like there are many cases where
> fadvise seems more appropriate, but memcg should at least satisfy my
> personal needs so I'll shut up now. Thanks!
>
> - Ben

Could someone expand on this a little?

The "there are no users of this feature" argument is indeed a silly one.
I've only wanted the ability to perform i/o without poisoning the cache
since oh, 10 or more years ago at least. It really hurts my users since
they are all direct login interactive db app users. No load balancing
web interface can hide the fact when a box goes to a crawl.

How would one use memcgroup to prevent a backup or other large file
operation from wiping out the cache with used-once garbage?

(note for rsync in particular, how does this help rsync on other platforms?)

--
bkw

2010-11-15 06:07:59

by Minchan Kim

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

On Sun, Nov 14, 2010 at 2:09 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> On Tue, ?9 Nov 2010 16:28:02 +0900 (JST), KOSAKI Motohiro <[email protected]> wrote:
>> > So, I don't think application developers will use fadvise() aggressively
>> > because we don't have a cross platform agreement of a fadvice behavior.
>> >
>> I strongly disagree. For a long time I have been trying to resolve
>> interactivity issues caused by my rsync-based backup script. Many kernel
>> developers have said that there is nothing the kernel can do without
>> more information from user-space (e.g. cgroups, madvise). While cgroups
>> help, the fix is round-about at best and requires configuration where
>> really none should be necessary. The easiest solution for everyone
>> involved would be for rsync to use FADV_DONTNEED. The behavior doesn't
>> need to be perfectly consistent between platforms for the flag to be
>> useful so long as each implementation does something sane to help
>> use-once access patterns.
>>
>> People seem to mention frequently that there are no users of
>> FADV_DONTNEED and therefore we don't need to implement it. It seems like
>> this is ignoring an obvious catch-22. Currently rsync has no fadvise
>> support at all, since using[1] the implemented hints to get the desired
>> effect is far too complicated^M^M^M^Mhacky to be considered
>> merge-worthy. Considering the number of Google hits returned for
>> fadvise, I wouldn't be surprised if there were countless other projects
>> with this same difficulty. We want to be able to tell the kernel about
>> our useage patterns, but the kernel won't listen.
>
> Because we have an alternative solution already. please try memcgroup :)

I think memcg could be a solution of them but fundamental solution is
that we have to cure it in VM itself.
I feel it's absolutely absurd to enable and use memcg for amending it.

I wonder what's the problem in Peter's patch 'drop behind'.
http://www.mail-archive.com/[email protected]/msg179576.html

Could anyone tell me why it can't accept upstream?

--
Kind regards,
Minchan Kim

2010-11-15 07:09:45

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

> > Because we have an alternative solution already. please try memcgroup :)
>
> I think memcg could be a solution of them but fundamental solution is
> that we have to cure it in VM itself.
> I feel it's absolutely absurd to enable and use memcg for amending it.
>
> I wonder what's the problem in Peter's patch 'drop behind'.
> http://www.mail-archive.com/[email protected]/msg179576.html
>
> Could anyone tell me why it can't accept upstream?

I don't know the reason. And this one looks reasonable to me. I'm curious the above
patch solve rsync issue or not.
Minchan, have you tested it yourself?

2010-11-15 07:19:51

by Minchan Kim

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

On Mon, Nov 15, 2010 at 4:09 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> > Because we have an alternative solution already. please try memcgroup :)
>>
>> I think memcg could be a solution of them but fundamental solution is
>> that we have to cure it in VM itself.
>> I feel it's absolutely absurd to enable and use memcg for amending it.
>>
>> I wonder what's the problem in Peter's patch 'drop behind'.
>> http://www.mail-archive.com/[email protected]/msg179576.html
>>
>> Could anyone tell me why it can't accept upstream?
>
> I don't know the reason. And this one looks reasonable to me. I'm curious the above
> patch solve rsync issue or not.
> Minchan, have you tested it yourself?

Still yet. :)
If we all think it's reasonable, it would be valuable to adjust it
with current mmotm and see the effect.

>
>
>

--
Kind regards,
Minchan Kim

2010-11-15 07:28:43

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

> On Mon, Nov 15, 2010 at 4:09 PM, KOSAKI Motohiro
> <[email protected]> wrote:
> >> > Because we have an alternative solution already. please try memcgroup :)
> >>
> >> I think memcg could be a solution of them but fundamental solution is
> >> that we have to cure it in VM itself.
> >> I feel it's absolutely absurd to enable and use memcg for amending it.
> >>
> >> I wonder what's the problem in Peter's patch 'drop behind'.
> >> http://www.mail-archive.com/[email protected]/msg179576.html
> >>
> >> Could anyone tell me why it can't accept upstream?
> >
> > I don't know the reason. And this one looks reasonable to me. I'm curious the above
> > patch solve rsync issue or not.
> > Minchan, have you tested it yourself?
>
> Still yet. :)
> If we all think it's reasonable, it would be valuable to adjust it
> with current mmotm and see the effect.

Who can make rsync like io pattern test suite? a code change is easy. but
to comfirm justification is more harder work.

2010-11-15 07:46:21

by Minchan Kim

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

On Mon, Nov 15, 2010 at 4:28 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> On Mon, Nov 15, 2010 at 4:09 PM, KOSAKI Motohiro
>> <[email protected]> wrote:
>> >> > Because we have an alternative solution already. please try memcgroup :)
>> >>
>> >> I think memcg could be a solution of them but fundamental solution is
>> >> that we have to cure it in VM itself.
>> >> I feel it's absolutely absurd to enable and use memcg for amending it.
>> >>
>> >> I wonder what's the problem in Peter's patch 'drop behind'.
>> >> http://www.mail-archive.com/[email protected]/msg179576.html
>> >>
>> >> Could anyone tell me why it can't accept upstream?
>> >
>> > I don't know the reason. And this one looks reasonable to me. I'm curious the above
>> > patch solve rsync issue or not.
>> > Minchan, have you tested it yourself?
>>
>> Still yet. :)
>> If we all think it's reasonable, it would be valuable to adjust it
>> with current mmotm and see the effect.
>
> Who can make rsync like io pattern test suite? a code change is easy. but
> to comfirm justification is more harder work.

Maybe Ben, Brian those reports the problem. :)

--
Kind regards,
Minchan Kim

2010-11-15 08:46:56

by Peter Zijlstra

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

On Mon, 2010-11-15 at 15:07 +0900, Minchan Kim wrote:
> On Sun, Nov 14, 2010 at 2:09 PM, KOSAKI Motohiro
> <[email protected]> wrote:
> >> On Tue, 9 Nov 2010 16:28:02 +0900 (JST), KOSAKI Motohiro <[email protected]> wrote:
> >> > So, I don't think application developers will use fadvise() aggressively
> >> > because we don't have a cross platform agreement of a fadvice behavior.
> >> >
> >> I strongly disagree. For a long time I have been trying to resolve
> >> interactivity issues caused by my rsync-based backup script. Many kernel
> >> developers have said that there is nothing the kernel can do without
> >> more information from user-space (e.g. cgroups, madvise). While cgroups
> >> help, the fix is round-about at best and requires configuration where
> >> really none should be necessary. The easiest solution for everyone
> >> involved would be for rsync to use FADV_DONTNEED. The behavior doesn't
> >> need to be perfectly consistent between platforms for the flag to be
> >> useful so long as each implementation does something sane to help
> >> use-once access patterns.
> >>
> >> People seem to mention frequently that there are no users of
> >> FADV_DONTNEED and therefore we don't need to implement it. It seems like
> >> this is ignoring an obvious catch-22. Currently rsync has no fadvise
> >> support at all, since using[1] the implemented hints to get the desired
> >> effect is far too complicated^M^M^M^Mhacky to be considered
> >> merge-worthy. Considering the number of Google hits returned for
> >> fadvise, I wouldn't be surprised if there were countless other projects
> >> with this same difficulty. We want to be able to tell the kernel about
> >> our useage patterns, but the kernel won't listen.
> >
> > Because we have an alternative solution already. please try memcgroup :)

Using memcgroup for this is utter crap, it just contains the trainwreck,
it doesn't solve it in any way.

> I think memcg could be a solution of them but fundamental solution is
> that we have to cure it in VM itself.
> I feel it's absolutely absurd to enable and use memcg for amending it.

Agreed..

> I wonder what's the problem in Peter's patch 'drop behind'.
> http://www.mail-archive.com/[email protected]/msg179576.html
>
> Could anyone tell me why it can't accept upstream?

Read the thread, its quite clear nobody got convinced it was a good idea
and wanted to fix the use-once policy, then Rik rewrote all of
page-reclaim.

2010-11-15 09:05:04

by Minchan Kim

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

On Mon, Nov 15, 2010 at 5:47 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, 2010-11-15 at 15:07 +0900, Minchan Kim wrote:
>> On Sun, Nov 14, 2010 at 2:09 PM, KOSAKI Motohiro
>> <[email protected]> wrote:
>> >> On Tue, ?9 Nov 2010 16:28:02 +0900 (JST), KOSAKI Motohiro <[email protected]> wrote:
>> >> > So, I don't think application developers will use fadvise() aggressively
>> >> > because we don't have a cross platform agreement of a fadvice behavior.
>> >> >
>> >> I strongly disagree. For a long time I have been trying to resolve
>> >> interactivity issues caused by my rsync-based backup script. Many kernel
>> >> developers have said that there is nothing the kernel can do without
>> >> more information from user-space (e.g. cgroups, madvise). While cgroups
>> >> help, the fix is round-about at best and requires configuration where
>> >> really none should be necessary. The easiest solution for everyone
>> >> involved would be for rsync to use FADV_DONTNEED. The behavior doesn't
>> >> need to be perfectly consistent between platforms for the flag to be
>> >> useful so long as each implementation does something sane to help
>> >> use-once access patterns.
>> >>
>> >> People seem to mention frequently that there are no users of
>> >> FADV_DONTNEED and therefore we don't need to implement it. It seems like
>> >> this is ignoring an obvious catch-22. Currently rsync has no fadvise
>> >> support at all, since using[1] the implemented hints to get the desired
>> >> effect is far too complicated^M^M^M^Mhacky to be considered
>> >> merge-worthy. Considering the number of Google hits returned for
>> >> fadvise, I wouldn't be surprised if there were countless other projects
>> >> with this same difficulty. We want to be able to tell the kernel about
>> >> our useage patterns, but the kernel won't listen.
>> >
>> > Because we have an alternative solution already. please try memcgroup :)
>
> Using memcgroup for this is utter crap, it just contains the trainwreck,
> it doesn't solve it in any way.
>
>> I think memcg could be a solution of them but fundamental solution is
>> that we have to cure it in VM itself.
>> I feel it's absolutely absurd to enable and use memcg for amending it.
>
> Agreed..
>
>> I wonder what's the problem in Peter's patch 'drop behind'.
>> http://www.mail-archive.com/[email protected]/msg179576.html
>>
>> Could anyone tell me why it can't accept upstream?
>
> Read the thread, its quite clear nobody got convinced it was a good idea
> and wanted to fix the use-once policy, then Rik rewrote all of
> page-reclaim.
>

Thanks for the information.
I hope this is a chance to rethink about it.
Rik, Could you give us to any comment about this idea?

--
Kind regards,
Minchan Kim

2010-11-15 09:10:55

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

> > I wonder what's the problem in Peter's patch 'drop behind'.
> > http://www.mail-archive.com/[email protected]/msg179576.html
> >
> > Could anyone tell me why it can't accept upstream?
>
> Read the thread, its quite clear nobody got convinced it was a good idea
> and wanted to fix the use-once policy, then Rik rewrote all of
> page-reclaim.

If my understand is correct, rsync touch data twice (for a hash calculation
and for a copy). then, currect used-once-heuristics seems still doesn't work.

2010-11-15 12:46:34

by Ben Gamari

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

On Mon, 15 Nov 2010 16:28:32 +0900 (JST), KOSAKI Motohiro <[email protected]> wrote:
> Who can make rsync like io pattern test suite? a code change is easy. but
> to comfirm justification is more harder work.
>
I'm afraid I don't have time to work up any code. I would be happy to
try the patch with my backup use-case though. I'll just have to think
of an objective way of measuring the result.

- Ben

2010-11-15 14:48:58

by Rik van Riel

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

On 11/15/2010 04:05 AM, Minchan Kim wrote:
> On Mon, Nov 15, 2010 at 5:47 PM, Peter Zijlstra<[email protected]> wrote:
>> On Mon, 2010-11-15 at 15:07 +0900, Minchan Kim wrote:

>>> I wonder what's the problem in Peter's patch 'drop behind'.
>>> http://www.mail-archive.com/[email protected]/msg179576.html
>>>
>>> Could anyone tell me why it can't accept upstream?
>>
>> Read the thread, its quite clear nobody got convinced it was a good idea
>> and wanted to fix the use-once policy, then Rik rewrote all of
>> page-reclaim.
>>
>
> Thanks for the information.
> I hope this is a chance to rethink about it.
> Rik, Could you give us to any comment about this idea?

At the time, there were all kinds of general problems
in page reclaim that all needed to be fixed. Peter's
patch was mostly a band-aid for streaming IO.

However, now that most of the other page reclaim problems
seem to have been resolved, it would be worthwhile to test
whether Peter's drop-behind approach gives an additional
improvement.

I could see it help by getting rid of already-read pages
earlier, leaving more space for read-ahead data.

I suspect it would do fairly little to protect the working
set, because we do not scan the active file list at all
unless it grows to be larger than the inactive file list.

--
All rights reversed

2010-11-17 10:16:22

by Minchan Kim

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

On Mon, Nov 15, 2010 at 11:48 PM, Rik van Riel <[email protected]> wrote:
> On 11/15/2010 04:05 AM, Minchan Kim wrote:
>>
>> On Mon, Nov 15, 2010 at 5:47 PM, Peter Zijlstra<[email protected]>
>> ?wrote:
>>>
>>> On Mon, 2010-11-15 at 15:07 +0900, Minchan Kim wrote:
>
>>>> I wonder what's the problem in Peter's patch 'drop behind'.
>>>> http://www.mail-archive.com/[email protected]/msg179576.html
>>>>
>>>> Could anyone tell me why it can't accept upstream?
>>>
>>> Read the thread, its quite clear nobody got convinced it was a good idea
>>> and wanted to fix the use-once policy, then Rik rewrote all of
>>> page-reclaim.
>>>
>>
>> Thanks for the information.
>> I hope this is a chance to rethink about it.
>> Rik, Could you give us to any comment about this idea?

Sorry for late reply, Rik.

> At the time, there were all kinds of general problems
> in page reclaim that all needed to be fixed. ?Peter's
> patch was mostly a band-aid for streaming IO.
>
> However, now that most of the other page reclaim problems
> seem to have been resolved, it would be worthwhile to test
> whether Peter's drop-behind approach gives an additional
> improvement.

Okay. I will have a time to make the workload for testing.

>
> I could see it help by getting rid of already-read pages
> earlier, leaving more space for read-ahead data.

Yes. Peter's logic breaks demotion if the page is in active list.
But I think if it's just active page like rsync's two touch, we have
to move tail of inactive although it's in active list.
I will look into this, too.

>
> I suspect it would do fairly little to protect the working
> set, because we do not scan the active file list at all
> unless it grows to be larger than the inactive file list.

Absolutely. But how about rsync's two touch?
It can evict working set.

I need the time for investigation.
Thanks for the comment.

>
> --
> All rights reversed
>

--
Kind regards,
Minchan Kim

2010-11-17 11:16:00

by Minchan Kim

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

On Wed, Nov 17, 2010 at 7:16 PM, Minchan Kim <[email protected]> wrote:
> On Mon, Nov 15, 2010 at 11:48 PM, Rik van Riel <[email protected]> wrote:
>> On 11/15/2010 04:05 AM, Minchan Kim wrote:
>>>
>>> On Mon, Nov 15, 2010 at 5:47 PM, Peter Zijlstra<[email protected]>
>>> ?wrote:
>>>>
>>>> On Mon, 2010-11-15 at 15:07 +0900, Minchan Kim wrote:
>>
>>>>> I wonder what's the problem in Peter's patch 'drop behind'.
>>>>> http://www.mail-archive.com/[email protected]/msg179576.html
>>>>>
>>>>> Could anyone tell me why it can't accept upstream?
>>>>
>>>> Read the thread, its quite clear nobody got convinced it was a good idea
>>>> and wanted to fix the use-once policy, then Rik rewrote all of
>>>> page-reclaim.
>>>>
>>>
>>> Thanks for the information.
>>> I hope this is a chance to rethink about it.
>>> Rik, Could you give us to any comment about this idea?
>
>
> Sorry for late reply, Rik.
>
>> At the time, there were all kinds of general problems
>> in page reclaim that all needed to be fixed. ?Peter's
>> patch was mostly a band-aid for streaming IO.
>>
>> However, now that most of the other page reclaim problems
>> seem to have been resolved, it would be worthwhile to test
>> whether Peter's drop-behind approach gives an additional
>> improvement.
>
> Okay. I will have a time to make the workload for testing.
>
>>
>> I could see it help by getting rid of already-read pages
>> earlier, leaving more space for read-ahead data.
>
> Yes. Peter's logic breaks demotion if the page is in active list.
> But I think if it's just active page like rsync's two touch, we have
> to move tail of inactive although it's in active list.
> I will look into this, too.

Most important thing is how to know it's real working set or just
trick by two touch.
If it's very hard, recent Mandeep's patch can be a another solution.
http://thread.gmane.org/gmane.linux.kernel.mm/54572
I will try it, too.

--
Kind regards,
Minchan Kim

2010-11-17 16:22:17

by Rik van Riel

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

On 11/17/2010 05:16 AM, Minchan Kim wrote:

> Absolutely. But how about rsync's two touch?
> It can evict working set.
>
> I need the time for investigation.
> Thanks for the comment.

Maybe we could exempt MADV_SEQUENTIAL and FADV_SEQUENTIAL
touches from promoting the page to the active list?

Then we just need to make sure rsync uses fadvise properly
to keep the working set protected from rsync.

--
All rights reversed

2010-11-18 02:47:22

by Minchan Kim

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

On Thu, Nov 18, 2010 at 1:22 AM, Rik van Riel <[email protected]> wrote:
> On 11/17/2010 05:16 AM, Minchan Kim wrote:
>
>> Absolutely. But how about rsync's two touch?
>> It can evict working set.
>>
>> I need the time for investigation.
>> Thanks for the comment.
>
> Maybe we could exempt MADV_SEQUENTIAL and FADV_SEQUENTIAL
> touches from promoting the page to the active list?
>

The problem is non-mapped file page.
non-mapped file page promotion happens by only mark_page_accessed.
But it doesn't enough information to prevent promotion(ex, vma or file)
Hmm.. Do other guys have any idea?

Here is another idea.
Current problem is following as.
User can use fadivse with FADV_DONTNEED.
But problem is that it can't affect when it meet dirty pages.
So user have to sync dirty page before calling fadvise with FADV_DONTNEED.
It would lose performance.

Let's add some semantic of FADV_DONTNEED.
It invalidates only pages which are not dirty.
If it meets dirty page, let's move the page into inactive's tail or head.
If we move the page into tail, shrinker can move it into head again
for deferred write if it isn't written the backed device.

> Then we just need to make sure rsync uses fadvise properly
> to keep the working set protected from rsync.
>
> --
> All rights reversed
>

--
Kind regards,
Minchan Kim

2010-11-18 03:25:16

by Rik van Riel

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

On 11/17/2010 09:47 PM, Minchan Kim wrote:
> On Thu, Nov 18, 2010 at 1:22 AM, Rik van Riel<[email protected]> wrote:
>> On 11/17/2010 05:16 AM, Minchan Kim wrote:
>>
>>> Absolutely. But how about rsync's two touch?
>>> It can evict working set.
>>>
>>> I need the time for investigation.
>>> Thanks for the comment.
>>
>> Maybe we could exempt MADV_SEQUENTIAL and FADV_SEQUENTIAL
>> touches from promoting the page to the active list?
>>
>
> The problem is non-mapped file page.
> non-mapped file page promotion happens by only mark_page_accessed.
> But it doesn't enough information to prevent promotion(ex, vma or file)

I believe we have enough information in filemap.c and can just
pass that as a parameter to mark_page_accessed.

> Here is another idea.
> Current problem is following as.
> User can use fadivse with FADV_DONTNEED.
> But problem is that it can't affect when it meet dirty pages.
> So user have to sync dirty page before calling fadvise with FADV_DONTNEED.
> It would lose performance.
>
> Let's add some semantic of FADV_DONTNEED.
> It invalidates only pages which are not dirty.
> If it meets dirty page, let's move the page into inactive's tail or head.
> If we move the page into tail, shrinker can move it into head again
> for deferred write if it isn't written the backed device.

That sounds like a good idea.

--
All rights reversed

2010-11-18 03:46:47

by Minchan Kim

[permalink] [raw]

Subject: Re: fadvise DONTNEED implementation (or lack thereof)

On Thu, Nov 18, 2010 at 12:24 PM, Rik van Riel <[email protected]> wrote:
> On 11/17/2010 09:47 PM, Minchan Kim wrote:
>>
>> On Thu, Nov 18, 2010 at 1:22 AM, Rik van Riel<[email protected]> ?wrote:
>>>
>>> On 11/17/2010 05:16 AM, Minchan Kim wrote:
>>>
>>>> Absolutely. But how about rsync's two touch?
>>>> It can evict working set.
>>>>
>>>> I need the time for investigation.
>>>> Thanks for the comment.
>>>
>>> Maybe we could exempt MADV_SEQUENTIAL and FADV_SEQUENTIAL
>>> touches from promoting the page to the active list?
>>>
>>
>> The problem is non-mapped file page.
>> non-mapped file page promotion happens by only mark_page_accessed.
>> But it doesn't enough information to prevent promotion(ex, vma or file)
>
> I believe we have enough information in filemap.c and can just
> pass that as a parameter to mark_page_accessed.

FADV_SEQUENTIAL is per file/vma semantic and It is used by many place.
I think changing all those places isn't simple and I don't want to add
new structure to propagate the information to mark_page_accessed.

>
>> Here is another idea.
>> Current problem is following as.
>> User can use fadivse with FADV_DONTNEED.
>> But problem is that it can't affect when it meet dirty pages.
>> So user have to sync dirty page before calling fadvise with FADV_DONTNEED.
>> It would lose performance.
>>
>> Let's add some semantic of FADV_DONTNEED.
>> It invalidates only pages which are not dirty.
>> If it meets dirty page, let's move the page into inactive's tail or head.
>> If we move the page into tail, shrinker can move it into head again
>> for deferred write if it isn't written the backed device.
>
> That sounds like a good idea.

I will implement it.
Thanks, Rik.

>
> --
> All rights reversed
>

--
Kind regards,
Minchan Kim