Two callbacks to remove individual pages as done in rmap code
invalidate_page()
Called from the inner loop of rmap walks to invalidate pages.
age_page()
Called for the determination of the page referenced status.
If we do not care about page referenced status then an age_page callback
may be be omitted. PageLock and pte lock are held when either of the
functions is called.
Signed-off-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>
---
mm/rmap.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c 2008-02-07 16:49:32.000000000 -0800
+++ linux-2.6/mm/rmap.c 2008-02-07 17:25:25.000000000 -0800
@@ -49,6 +49,7 @@
#include <linux/module.h>
#include <linux/kallsyms.h>
#include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
#include <asm/tlbflush.h>
@@ -287,7 +288,8 @@ static int page_referenced_one(struct pa
if (vma->vm_flags & VM_LOCKED) {
referenced++;
*mapcount = 1; /* break early from loop */
- } else if (ptep_clear_flush_young(vma, address, pte))
+ } else if (ptep_clear_flush_young(vma, address, pte) |
+ mmu_notifier_age_page(mm, address))
referenced++;
/* Pretend the page is referenced if the task has the
@@ -455,6 +457,7 @@ static int page_mkclean_one(struct page
flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush(vma, address, pte);
+ mmu_notifier(invalidate_page, mm, address);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);
@@ -712,7 +715,8 @@ static int try_to_unmap_one(struct page
* skipped over this mm) then we should reactivate it.
*/
if (!migration && ((vma->vm_flags & VM_LOCKED) ||
- (ptep_clear_flush_young(vma, address, pte)))) {
+ (ptep_clear_flush_young(vma, address, pte) |
+ mmu_notifier_age_page(mm, address)))) {
ret = SWAP_FAIL;
goto out_unmap;
}
@@ -720,6 +724,7 @@ static int try_to_unmap_one(struct page
/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
pteval = ptep_clear_flush(vma, address, pte);
+ mmu_notifier(invalidate_page, mm, address);
/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
@@ -844,12 +849,14 @@ static void try_to_unmap_cluster(unsigne
page = vm_normal_page(vma, address, *pte);
BUG_ON(!page || PageAnon(page));
- if (ptep_clear_flush_young(vma, address, pte))
+ if (ptep_clear_flush_young(vma, address, pte) |
+ mmu_notifier_age_page(mm, address))
continue;
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush(vma, address, pte);
+ mmu_notifier(invalidate_page, mm, address);
/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))
--
On Thu, 14 Feb 2008 22:49:02 -0800 Christoph Lameter <[email protected]> wrote:
> Two callbacks to remove individual pages as done in rmap code
>
> invalidate_page()
>
> Called from the inner loop of rmap walks to invalidate pages.
>
> age_page()
>
> Called for the determination of the page referenced status.
>
> If we do not care about page referenced status then an age_page callback
> may be be omitted. PageLock and pte lock are held when either of the
> functions is called.
The age_page mystery shallows.
It would be useful to have some rationale somewhere in the patchset for the
existence of this callback.
> #include <asm/tlbflush.h>
>
> @@ -287,7 +288,8 @@ static int page_referenced_one(struct pa
> if (vma->vm_flags & VM_LOCKED) {
> referenced++;
> *mapcount = 1; /* break early from loop */
> - } else if (ptep_clear_flush_young(vma, address, pte))
> + } else if (ptep_clear_flush_young(vma, address, pte) |
> + mmu_notifier_age_page(mm, address))
> referenced++;
The "|" is obviously deliberate. But no explanation is provided telling us
why we still call the callback if ptep_clear_flush_young() said the page
was recently referenced. People who read your code will want to understand
this.
> /* Pretend the page is referenced if the task has the
> @@ -455,6 +457,7 @@ static int page_mkclean_one(struct page
>
> flush_cache_page(vma, address, pte_pfn(*pte));
> entry = ptep_clear_flush(vma, address, pte);
> + mmu_notifier(invalidate_page, mm, address);
I just don't see how ths can be done if the callee has another thread in
the middle of establishing IO against this region of memory.
->invalidate_page() _has_ to be able to block. Confused.
On Fri, Feb 15, 2008 at 07:37:36PM -0800, Andrew Morton wrote:
> The "|" is obviously deliberate. But no explanation is provided telling us
> why we still call the callback if ptep_clear_flush_young() said the page
> was recently referenced. People who read your code will want to understand
> this.
This is to clear the young bit in every pte and spte to such physical
page before backing off because any young bit was on. So if any young
bit will be on in the next scan, we're guaranteed the page has been
touched recently and not ages before (otherwise it would take a worst
case N rounds of the lru before the page can be freed, where N is the
number of pte or sptes pointing to the page).
> I just don't see how ths can be done if the callee has another thread in
> the middle of establishing IO against this region of memory.
> ->invalidate_page() _has_ to be able to block. Confused.
invalidate_page marking the spte invalid and flushing the asid/tlb
doesn't need to block the same way ptep_clear_flush doesn't need to
block for the main linux pte. Infact before invalidate_page and
ptep_clear_flush can touch anything at all, they've to take their own
spinlocks (mmu_lock for the former, and PT lock for the latter).
The only sleeping trouble is for networked driven message passing,
where they want to schedule while they wait the message to arrive or
it'd hang the whole cpu to spin for so long.
sptes are cpu-clocked entities like ptes so scheduling there is by far
not necessary because there's zero delay in invalidating them and
flushing their tlbs. GRU is similar. Because we boost the reference
count of the pages for every spte mapping, only implementing
invalidate_range_end is enough, but I need to figure out the
get_user_pages->rmap_add window too and because get_user_pages can
schedule, and if I want to add a critical section around it to avoid
calling get_user_pages twice during the kvm page fault, a mutex would
be the only way (it sure can't be a spinlock). But a mutex can't be
taken by invalidate_page to stop it. So that leaves me with the idea
of adding a get_user_pages variant that returns the page locked. So
instead of calling get_user_pages a second time after rmap_add
returns, I will only need to call unlock_page which should be faster
than a follow_page. And setting the PG_lock before dropping the PT
lock in follow_page, should be fast enough too.
On Fri, 15 Feb 2008, Andrew Morton wrote:
> > @@ -287,7 +288,8 @@ static int page_referenced_one(struct pa
> > if (vma->vm_flags & VM_LOCKED) {
> > referenced++;
> > *mapcount = 1; /* break early from loop */
> > - } else if (ptep_clear_flush_young(vma, address, pte))
> > + } else if (ptep_clear_flush_young(vma, address, pte) |
> > + mmu_notifier_age_page(mm, address))
> > referenced++;
>
> The "|" is obviously deliberate. But no explanation is provided telling us
> why we still call the callback if ptep_clear_flush_young() said the page
> was recently referenced. People who read your code will want to understand
> this.
Andrea?
> > flush_cache_page(vma, address, pte_pfn(*pte));
> > entry = ptep_clear_flush(vma, address, pte);
> > + mmu_notifier(invalidate_page, mm, address);
>
> I just don't see how ths can be done if the callee has another thread in
> the middle of establishing IO against this region of memory.
> ->invalidate_page() _has_ to be able to block. Confused.
The page lock is held and that holds off I/O?
Christoph Lameter wrote:
> On Fri, 15 Feb 2008, Andrew Morton wrote:
>
>
>>> @@ -287,7 +288,8 @@ static int page_referenced_one(struct pa
>>> if (vma->vm_flags & VM_LOCKED) {
>>> referenced++;
>>> *mapcount = 1; /* break early from loop */
>>> - } else if (ptep_clear_flush_young(vma, address, pte))
>>> + } else if (ptep_clear_flush_young(vma, address, pte) |
>>> + mmu_notifier_age_page(mm, address))
>>> referenced++;
>>>
>> The "|" is obviously deliberate. But no explanation is provided telling us
>> why we still call the callback if ptep_clear_flush_young() said the page
>> was recently referenced. People who read your code will want to understand
>> this.
>>
>
> Andrea?
>
>
I'm not Andrea, but the way I read it, ptep_clear_flush_young() and
->age_page() each have two effects: check whether the page has been
referenced and clear the referenced bit. || would retain the semantics
of the check but lose the clearing. | does the right thing.
--
Any sufficiently difficult bug is indistinguishable from a feature.
On Saturday 16 February 2008 14:37, Andrew Morton wrote:
> On Thu, 14 Feb 2008 22:49:02 -0800 Christoph Lameter <[email protected]>
wrote:
> > Two callbacks to remove individual pages as done in rmap code
> >
> > invalidate_page()
> >
> > Called from the inner loop of rmap walks to invalidate pages.
> >
> > age_page()
> >
> > Called for the determination of the page referenced status.
> >
> > If we do not care about page referenced status then an age_page callback
> > may be be omitted. PageLock and pte lock are held when either of the
> > functions is called.
>
> The age_page mystery shallows.
BTW. can this callback be called mmu_notifier_clear_flush_young? To
match the core VM.
On Sunday 17 February 2008 06:22, Christoph Lameter wrote:
> On Fri, 15 Feb 2008, Andrew Morton wrote:
> > > flush_cache_page(vma, address, pte_pfn(*pte));
> > > entry = ptep_clear_flush(vma, address, pte);
> > > + mmu_notifier(invalidate_page, mm, address);
> >
> > I just don't see how ths can be done if the callee has another thread in
> > the middle of establishing IO against this region of memory.
> > ->invalidate_page() _has_ to be able to block. Confused.
>
> The page lock is held and that holds off I/O?
I think the actual answer is that "it doesn't matter".
ptes are not exactly the entity via which IO gets established, so
all we really care about here is that after the callback finishes,
we will not get any more reads or writes to the page via the
external mapping.
As far as holding off local IO goes, that is the job of the core
VM. (And no, page lock does not necessarily hold it off FYI -- it
can be writeback IO or even IO directly via buffers).
Holding off IO via the external references I guess is a job for
the notifier driver.
On Tue, Feb 19, 2008 at 07:46:10PM +1100, Nick Piggin wrote:
> On Sunday 17 February 2008 06:22, Christoph Lameter wrote:
> > On Fri, 15 Feb 2008, Andrew Morton wrote:
>
> > > > flush_cache_page(vma, address, pte_pfn(*pte));
> > > > entry = ptep_clear_flush(vma, address, pte);
> > > > + mmu_notifier(invalidate_page, mm, address);
> > >
> > > I just don't see how ths can be done if the callee has another thread in
> > > the middle of establishing IO against this region of memory.
> > > ->invalidate_page() _has_ to be able to block. Confused.
> >
> > The page lock is held and that holds off I/O?
>
> I think the actual answer is that "it doesn't matter".
Agreed. The PG_lock itself taken when invalidate_page is called, is
used to serialized the VM against the VM, not the VM against I/O.