Hi KOSAKI,

On Thu, Apr 11, 2013 at 11:01:11AM -0400, KOSAKI Motohiro wrote:
> >>>> and adding new syscall invokation is unwelcome.
> >>>
> >>> Sure. But one more system call could be cheaper than page-granuarity
> >>> operation on purged range.
> >>
> >> I don't think vrange(VOLATILE) cost is the related of this discusstion.
> >> Whether sending SIGBUS or just nuke pte, purge should be done on vmscan,
> >> not vrange() syscall.
> >
> > Again, please see the MADV_FREE. http://lwn.net/Articles/230799/
> > It does changes pte and page flags on all pages of the range through
> > zap_pte_range. So it would make vrange(VOLASTILE) expensive and
> > the bigger cost is, the bigger range is.
>
> This haven't been crossed my mind. now try_to_discard_one() insert vrange
> for making SIGBUS. then, we can insert pte_none() as the same cost too. Am
> I missing something?

For your requirement, we need some tracking model to detect some page is
using by the process currently before VM discards it *if* we don't give
vrange(NOVOLATILE) pair system call(Look at below). So the tracking model
should be formed in vrange(VOLATILE) system call context.

>
> I couldn't imazine why pte should be zapping on vrange(VOLATILE).

Sorry, my explanation was too bad to understand.
I will try again.

First of all, thing you want is almost like MADV_FREE.
So let's look at it firstly.

If you call madvise(range, MADV_FREE), VM should investigate all of
pages mapped at page table for range(start, start + len) so we need
page table lookup for the range and mark a flag to all page descriptor
(ex,PG_lazyfree) to give hint to kernel for discarding the page instead of
swappint out when reclaim happens. Another thing we need is to clear out
a dirty bit from PTE to detect the pages is dirtied or not, since we call
madvise(range, MADV_FREE) because we can't discard them, which are using by
some process since he called madvise. So if VM find the page has PG_lazyfree
but the page is dirtied recenlty by peeking PTE, VM can't discard the page.
So madivse system call's overhead is folloinwg as in madvise(MADV_FREE)

1. look up all pages from page table for the range.
2. mark some bit(PG_lazyfree) for page descriptors of pages mapped at range
3. clear dirty bit and TLB flush

So, madvise(MADV_FREE) would be better than madvise(DONTNEED) because it can
avoid page fault if memory pressure doesn't happen but system call overhead
could be still huge and expecially the overhead is increased proportionally
by range size.

Let's talk about vrange(range, VOLATILE)
The overhead of it is very small, which is just mark a flag into a
structure which represents the range (ie, struct vrange). When VM want to reclaim
some pages, VM find a page is mapped at VOLATILE area, so it could discard it
instead of swapping out. It moves the ovehead from system call itself to
VM reclaim path which is very slow path in the system and I think it's desirable
design(And that's why we have rmap).
But the problem is remained. VM can't detect page using by process after he calls
vrange(range, VOLATILE) because we didn't do anything in vrange(VOLATILE) so
VM might discard the page under the process. It didn't happen in madvise(MADV_FREE)
because it cleared out dirty bit of PTE to detect the page is used or not
since madvise is called.

Solution in vrange is to make new vrange(range, NOVOLATILE) system call, which give
the hint to kernel for preventing descarding pages in the range any more.
The cost of vrange(range, NOVOLATILE) is very small, too.
It just clear out the flags from a struct vrange which represents a range.

So I think calling of pair system call about volatile would be cheaper than a
only madvise(MADV_FREE).

I hope it helps your understanding but not sure because I am writing this
in airport which are very hard to focus my work. :(

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-04-16 03:33:15

by John Stultz

[permalink] [raw]

Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

On 04/14/2013 12:42 AM, Minchan Kim wrote:
> Hi KOSAKI,
>
> On Thu, Apr 11, 2013 at 11:01:11AM -0400, KOSAKI Motohiro wrote:
>>>>>> and adding new syscall invokation is unwelcome.
>>>>> Sure. But one more system call could be cheaper than page-granuarity
>>>>> operation on purged range.
>>>> I don't think vrange(VOLATILE) cost is the related of this discusstion.
>>>> Whether sending SIGBUS or just nuke pte, purge should be done on vmscan,
>>>> not vrange() syscall.
>>> Again, please see the MADV_FREE. http://lwn.net/Articles/230799/
>>> It does changes pte and page flags on all pages of the range through
>>> zap_pte_range. So it would make vrange(VOLASTILE) expensive and
>>> the bigger cost is, the bigger range is.
>> This haven't been crossed my mind. now try_to_discard_one() insert vrange
>> for making SIGBUS. then, we can insert pte_none() as the same cost too. Am
>> I missing something?
> For your requirement, we need some tracking model to detect some page is
> using by the process currently before VM discards it *if* we don't give
> vrange(NOVOLATILE) pair system call(Look at below). So the tracking model
> should be formed in vrange(VOLATILE) system call context.

To further clarify Minchan's note here, the reason its important for the
application to use vrange(NOVOLATILE), its really to help define _when
the range stops being volatile_.

In your libc hack to use vrange(), you see the benfit of not immediately
purging the memory as you do with MADV_DONTNEED. However, if the heap
grows again, and those address are re-used, nothing has stopped those
pages from continuing to be volatile. Thus the kernel could then decide
to purge those pages after they start to be used again, and you'd lose
data. I suspect that's not what you want. :)

Rik's MADV_FREE implementation is very similar to vrange(VOLATILE), but
has an implicit vrange(NOVOLATILE) on any page write. So by dirtying a
page, it stops the kernel from later purging it.

This MADV_FREE semantic works very well if you always want zerofill (as
in the case of malloc/free). But for other data, its important to know
something was lost (as a zero page could be valid data), and that's why
we provide the SIGBUS, as well as the purged notification on
vrange(NOVOLATILE).

In other-words, as long as you do a vrange(NOVOLATILE) when you grow the
heap again (before its used), it should be very similar to the MADV_FREE
behavior, but is more flexible for other use cases.

thanks
-john