2008-01-28 20:30:36

by Christoph Lameter

[permalink] [raw]
Subject: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

The invalidation of address ranges in a mm_struct needs to be
performed when pages are removed or permissions etc change.
Most of the VM address space changes can use the range invalidate
callback.

invalidate_range() is generally called with mmap_sem held but
no spinlocks are active. If invalidate_range() is called with
locks held then we pass a flag into invalidate_range()

Comments state that mmap_sem must be held for
remap_pfn_range() but various drivers do not seem to do this.

Signed-off-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>

---
mm/fremap.c | 2 ++
mm/hugetlb.c | 2 ++
mm/memory.c | 11 +++++++++--
mm/mmap.c | 1 +
4 files changed, 14 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c 2008-01-25 19:31:05.000000000 -0800
+++ linux-2.6/mm/fremap.c 2008-01-25 19:32:49.000000000 -0800
@@ -15,6 +15,7 @@
#include <linux/rmap.h>
#include <linux/module.h>
#include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>

#include <asm/mmu_context.h>
#include <asm/cacheflush.h>
@@ -211,6 +212,7 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(&mapping->i_mmap_lock);
}

+ mmu_notifier(invalidate_range, mm, start, start + size, 0);
err = populate_range(mm, vma, start, size, pgoff);
if (!err && !(flags & MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c 2008-01-25 19:31:05.000000000 -0800
+++ linux-2.6/mm/memory.c 2008-01-25 19:32:49.000000000 -0800
@@ -50,6 +50,7 @@
#include <linux/delayacct.h>
#include <linux/init.h>
#include <linux/writeback.h>
+#include <linux/mmu_notifier.h>

#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -891,6 +892,8 @@ unsigned long zap_page_range(struct vm_a
end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
if (tlb)
tlb_finish_mmu(tlb, address, end);
+ mmu_notifier(invalidate_range, mm, address, end,
+ (details ? (details->i_mmap_lock != NULL) : 0));
return end;
}

@@ -1319,7 +1322,7 @@ int remap_pfn_range(struct vm_area_struc
{
pgd_t *pgd;
unsigned long next;
- unsigned long end = addr + PAGE_ALIGN(size);
+ unsigned long start = addr, end = addr + PAGE_ALIGN(size);
struct mm_struct *mm = vma->vm_mm;
int err;

@@ -1360,6 +1363,7 @@ int remap_pfn_range(struct vm_area_struc
if (err)
break;
} while (pgd++, addr = next, addr != end);
+ mmu_notifier(invalidate_range, mm, start, end, 0);
return err;
}
EXPORT_SYMBOL(remap_pfn_range);
@@ -1443,7 +1447,7 @@ int apply_to_page_range(struct mm_struct
{
pgd_t *pgd;
unsigned long next;
- unsigned long end = addr + size;
+ unsigned long start = addr, end = addr + size;
int err;

BUG_ON(addr >= end);
@@ -1454,6 +1458,7 @@ int apply_to_page_range(struct mm_struct
if (err)
break;
} while (pgd++, addr = next, addr != end);
+ mmu_notifier(invalidate_range, mm, start, end, 0);
return err;
}
EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1634,6 +1639,8 @@ gotten:
/*
* Re-check the pte - we dropped the lock
*/
+ mmu_notifier(invalidate_range, mm, address,
+ address + PAGE_SIZE - 1, 0);
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (likely(pte_same(*page_table, orig_pte))) {
if (old_page) {
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c 2008-01-25 19:31:05.000000000 -0800
+++ linux-2.6/mm/mmap.c 2008-01-25 19:32:49.000000000 -0800
@@ -1748,6 +1748,7 @@ static void unmap_region(struct mm_struc
free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
next? next->vm_start: 0);
tlb_finish_mmu(tlb, start, end);
+ mmu_notifier(invalidate_range, mm, start, end, 0);
}

/*
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c 2008-01-25 19:33:58.000000000 -0800
+++ linux-2.6/mm/hugetlb.c 2008-01-25 19:34:13.000000000 -0800
@@ -14,6 +14,7 @@
#include <linux/mempolicy.h>
#include <linux/cpuset.h>
#include <linux/mutex.h>
+#include <linux/mmu_notifier.h>

#include <asm/page.h>
#include <asm/pgtable.h>
@@ -763,6 +764,7 @@ void __unmap_hugepage_range(struct vm_ar
}
spin_unlock(&mm->page_table_lock);
flush_tlb_range(vma, start, end);
+ mmu_notifier(invalidate_range, mm, start, end, 1);
list_for_each_entry_safe(page, tmp, &page_list, lru) {
list_del(&page->lru);
put_page(page);

--


2008-01-29 16:21:32

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Mon, Jan 28, 2008 at 12:28:42PM -0800, Christoph Lameter wrote:
> Index: linux-2.6/mm/fremap.c
> ===================================================================
> --- linux-2.6.orig/mm/fremap.c 2008-01-25 19:31:05.000000000 -0800
> +++ linux-2.6/mm/fremap.c 2008-01-25 19:32:49.000000000 -0800
> @@ -15,6 +15,7 @@
> #include <linux/rmap.h>
> #include <linux/module.h>
> #include <linux/syscalls.h>
> +#include <linux/mmu_notifier.h>
>
> #include <asm/mmu_context.h>
> #include <asm/cacheflush.h>
> @@ -211,6 +212,7 @@ asmlinkage long sys_remap_file_pages(uns
> spin_unlock(&mapping->i_mmap_lock);
> }
>
> + mmu_notifier(invalidate_range, mm, start, start + size, 0);
> err = populate_range(mm, vma, start, size, pgoff);

How can it be right to invalidate_range _before_ ptep_clear_flush?

> @@ -1634,6 +1639,8 @@ gotten:
> /*
> * Re-check the pte - we dropped the lock
> */
> + mmu_notifier(invalidate_range, mm, address,
> + address + PAGE_SIZE - 1, 0);
> page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> if (likely(pte_same(*page_table, orig_pte))) {
> if (old_page) {

What's the point of invalidate_range when the size is PAGE_SIZE? And
how can it be right to invalidate_range _before_ ptep_clear_flush?

2008-01-29 18:29:17

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

Christoph, the below patch should fix the current leak of the pinned
pages. I hope the page-pin that should be dropped by the
invalidate_range op, is enough to prevent the "physical page" mapped
on that "mm+address" to change before invalidate_range returns. If
that would ever happen, there would be a coherency loss between the
guest VM writes and the writes coming from userland on the same
mm+address from a different thread (qemu, whatever). invalidate_page
before PT lock was obviously safe. Now we entirely relay on the pin to
prevent the page to change before invalidate_range returns. If the pte
is unmapped and the page is mapped back in with a minor fault that's
ok, as long as the physical page remains the same for that mm+address,
until all sptes are gone.

Signed-off-by: Andrea Arcangeli <[email protected]>

diff --git a/mm/fremap.c b/mm/fremap.c
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -212,8 +212,8 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(&mapping->i_mmap_lock);
}

+ err = populate_range(mm, vma, start, size, pgoff);
mmu_notifier(invalidate_range, mm, start, start + size, 0);
- err = populate_range(mm, vma, start, size, pgoff);
if (!err && !(flags & MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {
downgrade_write(&mm->mmap_sem);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1639,8 +1639,6 @@ gotten:
/*
* Re-check the pte - we dropped the lock
*/
- mmu_notifier(invalidate_range, mm, address,
- address + PAGE_SIZE - 1, 0);
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (likely(pte_same(*page_table, orig_pte))) {
if (old_page) {
@@ -1676,6 +1674,8 @@ gotten:
page_cache_release(old_page);
unlock:
pte_unmap_unlock(page_table, ptl);
+ mmu_notifier(invalidate_range, mm, address,
+ address + PAGE_SIZE - 1, 0);
if (dirty_page) {
if (vma->vm_file)
file_update_time(vma->vm_file);

2008-01-29 19:55:27

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Tue, 29 Jan 2008, Andrea Arcangeli wrote:

> > + mmu_notifier(invalidate_range, mm, address,
> > + address + PAGE_SIZE - 1, 0);
> > page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> > if (likely(pte_same(*page_table, orig_pte))) {
> > if (old_page) {
>
> What's the point of invalidate_range when the size is PAGE_SIZE? And
> how can it be right to invalidate_range _before_ ptep_clear_flush?

I am not sure. AFAICT you wrote that code.

It seems to be okay to invalidate range if you hold mmap_sem writably. In
that case no additional faults can happen that would create new ptes.


2008-01-29 20:30:22

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Tue, 29 Jan 2008, Andrea Arcangeli wrote:

> diff --git a/mm/fremap.c b/mm/fremap.c
> --- a/mm/fremap.c
> +++ b/mm/fremap.c
> @@ -212,8 +212,8 @@ asmlinkage long sys_remap_file_pages(uns
> spin_unlock(&mapping->i_mmap_lock);
> }
>
> + err = populate_range(mm, vma, start, size, pgoff);
> mmu_notifier(invalidate_range, mm, start, start + size, 0);
> - err = populate_range(mm, vma, start, size, pgoff);
> if (!err && !(flags & MAP_NONBLOCK)) {
> if (unlikely(has_write_lock)) {
> downgrade_write(&mm->mmap_sem);

We invalidate the range *after* populating it? Isnt it okay to establish
references while populate_range() runs?

> diff --git a/mm/memory.c b/mm/memory.c
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1639,8 +1639,6 @@ gotten:
> /*
> * Re-check the pte - we dropped the lock
> */
> - mmu_notifier(invalidate_range, mm, address,
> - address + PAGE_SIZE - 1, 0);
> page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> if (likely(pte_same(*page_table, orig_pte))) {
> if (old_page) {

What we did is to invalidate the page (?!) before taking the pte lock. In
the lock we replace the pte to point to another page. This means that we
need to clear stale information. So we zap it before. If another reference
is established after taking the spinlock then the pte contents have
changed at the cirtical section fails.

Before the critical section starts we have gotten an extra refcount on the
original page so the page cannot vanish from under us.

> @@ -1676,6 +1674,8 @@ gotten:
> page_cache_release(old_page);
> unlock:
> pte_unmap_unlock(page_table, ptl);
> + mmu_notifier(invalidate_range, mm, address,
> + address + PAGE_SIZE - 1, 0);
> if (dirty_page) {
> if (vma->vm_file)
> file_update_time(vma->vm_file);

Now we invalidate the page after the transaction is complete. This means
external pte can persist while we change the pte? Possibly even dirty the
page?


2008-01-29 21:18:19

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Tue, Jan 29, 2008 at 11:55:10AM -0800, Christoph Lameter wrote:
> I am not sure. AFAICT you wrote that code.

Actually I didn't need to change a single line in do_wp_page because
ptep_clear_flush was already doing everything transparently for
me. This was the memory.c part of my last patch I posted, it only
touches zap_page_range, remap_pfn_range and apply_to_page_range.

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -889,6 +889,7 @@ unsigned long zap_page_range(struct vm_a
end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
if (tlb)
tlb_finish_mmu(tlb, address, end);
+ mmu_notifier(invalidate_range, mm, address, end);
return end;
}

@@ -1317,7 +1318,7 @@ int remap_pfn_range(struct vm_area_struc
{
pgd_t *pgd;
unsigned long next;
- unsigned long end = addr + PAGE_ALIGN(size);
+ unsigned long start = addr, end = addr + PAGE_ALIGN(size);
struct mm_struct *mm = vma->vm_mm;
int err;

@@ -1358,6 +1359,7 @@ int remap_pfn_range(struct vm_area_struc
if (err)
break;
} while (pgd++, addr = next, addr != end);
+ mmu_notifier(invalidate_range, mm, start, end);
return err;
}
EXPORT_SYMBOL(remap_pfn_range);
@@ -1441,7 +1443,7 @@ int apply_to_page_range(struct mm_struct
{
pgd_t *pgd;
unsigned long next;
- unsigned long end = addr + size;
+ unsigned long start = addr, end = addr + size;
int err;

BUG_ON(addr >= end);
@@ -1452,6 +1454,7 @@ int apply_to_page_range(struct mm_struct
if (err)
break;
} while (pgd++, addr = next, addr != end);
+ mmu_notifier(invalidate_range, mm, start, end);
return err;
}
EXPORT_SYMBOL_GPL(apply_to_page_range);

> It seems to be okay to invalidate range if you hold mmap_sem writably. In
> that case no additional faults can happen that would create new ptes.

In that place the mmap_sem is taken but in readonly mode. I never rely
on the mmap_sem in the mmu notifier methods. Not invoking the notifier
before releasing the PT lock adds quite some uncertainty on the smp
safety of the spte invalidates, because the pte may be unmapped and
remapped by a minor fault before invalidate_range is invoked, but I
didn't figure out a kernel crashing race yet thanks to the pin we take
through get_user_pages (and only thanks to it). The requirement is
that invalidate_range is invoked after the last ptep_clear_flush or it
leaks pins that's why I had to move it at the end.

2008-01-29 21:36:19

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Tue, 29 Jan 2008, Andrea Arcangeli wrote:

> > It seems to be okay to invalidate range if you hold mmap_sem writably. In
> > that case no additional faults can happen that would create new ptes.
>
> In that place the mmap_sem is taken but in readonly mode. I never rely
> on the mmap_sem in the mmu notifier methods. Not invoking the notifier

Well it seems that we have to rely on mmap_sem otherwise concurrent faults
can occur. The mmap_sem seems to be acquired for write there.

if (!has_write_lock) {
up_read(&mm->mmap_sem);
down_write(&mm->mmap_sem);
has_write_lock = 1;
goto retry;
}


> before releasing the PT lock adds quite some uncertainty on the smp
> safety of the spte invalidates, because the pte may be unmapped and
> remapped by a minor fault before invalidate_range is invoked, but I
> didn't figure out a kernel crashing race yet thanks to the pin we take
> through get_user_pages (and only thanks to it). The requirement is
> that invalidate_range is invoked after the last ptep_clear_flush or it
> leaks pins that's why I had to move it at the end.

So "pins" means a reference count right? I still do not get why you
have refcount problems. You take a refcount when you export the page
through KVM and then drop the refcount in invalidate page right?

So you walk through the KVM ptes and drop the refcount for each spte you
encounter?

2008-01-29 21:36:34

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Tue, Jan 29, 2008 at 12:30:06PM -0800, Christoph Lameter wrote:
> On Tue, 29 Jan 2008, Andrea Arcangeli wrote:
>
> > diff --git a/mm/fremap.c b/mm/fremap.c
> > --- a/mm/fremap.c
> > +++ b/mm/fremap.c
> > @@ -212,8 +212,8 @@ asmlinkage long sys_remap_file_pages(uns
> > spin_unlock(&mapping->i_mmap_lock);
> > }
> >
> > + err = populate_range(mm, vma, start, size, pgoff);
> > mmu_notifier(invalidate_range, mm, start, start + size, 0);
> > - err = populate_range(mm, vma, start, size, pgoff);
> > if (!err && !(flags & MAP_NONBLOCK)) {
> > if (unlikely(has_write_lock)) {
> > downgrade_write(&mm->mmap_sem);
>
> We invalidate the range *after* populating it? Isnt it okay to establish
> references while populate_range() runs?

It's not ok because that function can very well overwrite existing and
present ptes (it's actually the nonlinear common case fast path for
db). With your code the sptes created between invalidate_range and
populate_range, will keep pointing forever to the old physical page
instead of the newly populated one.

I'm also asking myself if it's a smp race not to call
mmu_notifier(invalidate_page) between ptep_clear_flush and set_pte_at
in install_file_pte. Probably not because the guest VM running in a
different thread would need to serialize outside the install_file_pte
code with the task running install_file_pte, if it wants to be sure to
write either all its data to the old or the new page. Certainly doing
the invalidate_page inside the PT lock was obviously safe but I hope
this is safe and this can accommodate your needs too.

> > diff --git a/mm/memory.c b/mm/memory.c
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1639,8 +1639,6 @@ gotten:
> > /*
> > * Re-check the pte - we dropped the lock
> > */
> > - mmu_notifier(invalidate_range, mm, address,
> > - address + PAGE_SIZE - 1, 0);
> > page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> > if (likely(pte_same(*page_table, orig_pte))) {
> > if (old_page) {
>
> What we did is to invalidate the page (?!) before taking the pte lock. In
> the lock we replace the pte to point to another page. This means that we
> need to clear stale information. So we zap it before. If another reference
> is established after taking the spinlock then the pte contents have
> changed at the cirtical section fails.
>
> Before the critical section starts we have gotten an extra refcount on the
> original page so the page cannot vanish from under us.

The problem is the missing invalidate_page/range _after_
ptep_clear_flush. If a spte is built between invalidate_range and
pte_offset_map_lock, it will remain pointing to the old page
forever. Nothing will be called to invalidate that stale spte built
between invalidate_page/range and ptep_clear_flush. This is why for
the last few days I kept saying the mmu notifiers have to be invoked
_after_ ptep_clear_flush and never before (remember the export
notifier?). No idea how you can deal with this in your code, certainly
for KVM sptes that's backwards and unworkable ordering of operation
(exactly as backwards are doing the tlb flush before pte_clear in
ptep_clear_flush, think spte as a tlb, you can't flush the tlb before
clearing/updating the pte or it's smp unsafe).

> > @@ -1676,6 +1674,8 @@ gotten:
> > page_cache_release(old_page);
> > unlock:
> > pte_unmap_unlock(page_table, ptl);
> > + mmu_notifier(invalidate_range, mm, address,
> > + address + PAGE_SIZE - 1, 0);
> > if (dirty_page) {
> > if (vma->vm_file)
> > file_update_time(vma->vm_file);
>
> Now we invalidate the page after the transaction is complete. This means
> external pte can persist while we change the pte? Possibly even dirty the
> page?

Yes, and the only reason this can be safe is for the reason explained
at the top of the email, if the other cpu wants to serialize to be
sure to write in the "new" page, it has to serialize with the
page-fault but to serialize it has to wait the page fault to return
(example: we're not going to call futex code until the page fault
returns).

2008-01-29 21:53:20

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Tue, 29 Jan 2008, Andrea Arcangeli wrote:

> > We invalidate the range *after* populating it? Isnt it okay to establish
> > references while populate_range() runs?
>
> It's not ok because that function can very well overwrite existing and
> present ptes (it's actually the nonlinear common case fast path for
> db). With your code the sptes created between invalidate_range and
> populate_range, will keep pointing forever to the old physical page
> instead of the newly populated one.

Seems though that the mmap_sem is taken for regular vmas writably and will
hold off new mappings.

> I'm also asking myself if it's a smp race not to call
> mmu_notifier(invalidate_page) between ptep_clear_flush and set_pte_at
> in install_file_pte. Probably not because the guest VM running in a
> different thread would need to serialize outside the install_file_pte
> code with the task running install_file_pte, if it wants to be sure to
> write either all its data to the old or the new page. Certainly doing
> the invalidate_page inside the PT lock was obviously safe but I hope
> this is safe and this can accommodate your needs too.

But that would be doing two invalidates on one pte. One range and one page
invalidate.

> > > diff --git a/mm/memory.c b/mm/memory.c
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1639,8 +1639,6 @@ gotten:
> > > /*
> > > * Re-check the pte - we dropped the lock
> > > */
> > > - mmu_notifier(invalidate_range, mm, address,
> > > - address + PAGE_SIZE - 1, 0);
> > > page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> > > if (likely(pte_same(*page_table, orig_pte))) {
> > > if (old_page) {
> >
> > What we did is to invalidate the page (?!) before taking the pte lock. In
> > the lock we replace the pte to point to another page. This means that we
> > need to clear stale information. So we zap it before. If another reference
> > is established after taking the spinlock then the pte contents have
> > changed at the cirtical section fails.
> >
> > Before the critical section starts we have gotten an extra refcount on the
> > original page so the page cannot vanish from under us.
>
> The problem is the missing invalidate_page/range _after_
> ptep_clear_flush. If a spte is built between invalidate_range and
> pte_offset_map_lock, it will remain pointing to the old page
> forever. Nothing will be called to invalidate that stale spte built
> between invalidate_page/range and ptep_clear_flush. This is why for
> the last few days I kept saying the mmu notifiers have to be invoked
> _after_ ptep_clear_flush and never before (remember the export
> notifier?). No idea how you can deal with this in your code, certainly
> for KVM sptes that's backwards and unworkable ordering of operation
> (exactly as backwards are doing the tlb flush before pte_clear in
> ptep_clear_flush, think spte as a tlb, you can't flush the tlb before
> clearing/updating the pte or it's smp unsafe).

Hmmm... So we could only do an invalidate_page here? Drop the strange
invalidate_range()?

>
> > > @@ -1676,6 +1674,8 @@ gotten:
> > > page_cache_release(old_page);
> > > unlock:
> > > pte_unmap_unlock(page_table, ptl);
> > > + mmu_notifier(invalidate_range, mm, address,
> > > + address + PAGE_SIZE - 1, 0);
> > > if (dirty_page) {
> > > if (vma->vm_file)
> > > file_update_time(vma->vm_file);
> >
> > Now we invalidate the page after the transaction is complete. This means
> > external pte can persist while we change the pte? Possibly even dirty the
> > page?
>
> Yes, and the only reason this can be safe is for the reason explained
> at the top of the email, if the other cpu wants to serialize to be
> sure to write in the "new" page, it has to serialize with the
> page-fault but to serialize it has to wait the page fault to return
> (example: we're not going to call futex code until the page fault
> returns).

Serialize how? mmap_sem?

2008-01-29 22:02:30

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Tue, Jan 29, 2008 at 01:35:58PM -0800, Christoph Lameter wrote:
> On Tue, 29 Jan 2008, Andrea Arcangeli wrote:
>
> > > It seems to be okay to invalidate range if you hold mmap_sem writably. In
> > > that case no additional faults can happen that would create new ptes.
> >
> > In that place the mmap_sem is taken but in readonly mode. I never rely
> > on the mmap_sem in the mmu notifier methods. Not invoking the notifier
>
> Well it seems that we have to rely on mmap_sem otherwise concurrent faults
> can occur. The mmap_sem seems to be acquired for write there.
^^^^^
>
> if (!has_write_lock) {
> up_read(&mm->mmap_sem);
> down_write(&mm->mmap_sem);
> has_write_lock = 1;
> goto retry;
> }


hmm, "there" where? When I said it was taken in readonly mode I meant
for the quoted code (it would be at the top if it wasn't cut), so I
quote below again:

> > + mmu_notifier(invalidate_range, mm, address,
> > + address + PAGE_SIZE - 1, 0);
> > page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> > if (likely(pte_same(*page_table, orig_pte))) {
> > if (old_page) {

The "there" for me was do_wp_page.

Even for the code you quoted in freemap.c, the has_write_lock is set
to 1 _only_ for the very first time you call sys_remap_file_pages on a
VMA. Only the transition of the VMA between linear to nonlinear
requires the mmap in write mode. So you can be sure all freemap code
99% of the time is populating (overwriting) already present ptes with
only the mmap_sem in readonly mode like do_wp_page. It would be
unnecessary to populate the nonlinear range with the mmap in write
mode. Only the "vma" mangling requires the mmap_sem in write mode, the
pte modifications only requires the PT_lock + mmap_sem in read mode.

Effectively the first invocation of populate_range runs with the
mmap_sem in write mode, I wonder why, there seem to be no good reason
for that. I guess it's a bit that should be optimized, by calling
downgrade_write before calling populate_range even for the first time
the vma switches from linear to nonlinear (after the vma has been
fully updated to the new status). But for sure all later invocations
runs populate_range with the semaphore readonly like the rest of the
VM does when instantiating ptes in the page faults.

> > before releasing the PT lock adds quite some uncertainty on the smp
> > safety of the spte invalidates, because the pte may be unmapped and
> > remapped by a minor fault before invalidate_range is invoked, but I
> > didn't figure out a kernel crashing race yet thanks to the pin we take
> > through get_user_pages (and only thanks to it). The requirement is
> > that invalidate_range is invoked after the last ptep_clear_flush or it
> > leaks pins that's why I had to move it at the end.
>
> So "pins" means a reference count right? I still do not get why you

Yes.

> have refcount problems. You take a refcount when you export the page
> through KVM and then drop the refcount in invalidate page right?

Yes.

> So you walk through the KVM ptes and drop the refcount for each spte you
> encounter?

Yes.

All pins are gone by the time invalidate_page/range returns. But there
is no critical section between invalidate_page and the _later_
ptep_clear_flush. So get_user_pages is free to run and take the PT
lock before the ptep_clear_flush, find the linux pte still
instantiated, and to create a new spte, before ptep_clear_flush runs.

Think of why the tlb flushes are being called at the end of
ptep_clear_flush. The mmu notifier invalidate has to be called after
for the exact same reason.

Perhaps somebody else should explain this, I started exposing this
smp race the moment after I've seen the backwards ordering being
proposed in export-notifier-v1, sorry if I'm not clear enough.

2008-01-29 22:35:22

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Tue, Jan 29, 2008 at 01:53:05PM -0800, Christoph Lameter wrote:
> On Tue, 29 Jan 2008, Andrea Arcangeli wrote:
>
> > > We invalidate the range *after* populating it? Isnt it okay to establish
> > > references while populate_range() runs?
> >
> > It's not ok because that function can very well overwrite existing and
> > present ptes (it's actually the nonlinear common case fast path for
> > db). With your code the sptes created between invalidate_range and
> > populate_range, will keep pointing forever to the old physical page
> > instead of the newly populated one.
>
> Seems though that the mmap_sem is taken for regular vmas writably and will
> hold off new mappings.

It's taken writable due to the code being inefficient the first time,
all later times remap_populate_range overwrites ptes with the mmap_sem
in readonly mode (finally rightfully so). The first remap_file_pages I
guess it's irrelevant to optimize, the whole point of nonlinear is to
call remap_file_pages zillon of times on the same vma, overwriting
present ptes the whole time, so if the first time the mutex is not
readonly it probably doesn't make a difference.

get_user_pages invoked by the kvm spte-fault, can happen between
invalidate_range and populate_range. If it can't happen, for sure
nobody pointed out a good reason why it can't happen. The kvm page
faults as well rightfully only takes the mmap_sem in readonly mode, so
get_user_pages is only called internally to gfn_to_page with the
readonly semaphore.

With my approach ptep_clear_flush was not only invalidating sptes
after ptep_clear_flush, but it was also invalidating them inside the
PT lock, so it was totally obvious there could be no race vs
get_user_pages.

> > I'm also asking myself if it's a smp race not to call
> > mmu_notifier(invalidate_page) between ptep_clear_flush and set_pte_at
> > in install_file_pte. Probably not because the guest VM running in a
> > different thread would need to serialize outside the install_file_pte
> > code with the task running install_file_pte, if it wants to be sure to
> > write either all its data to the old or the new page. Certainly doing
> > the invalidate_page inside the PT lock was obviously safe but I hope
> > this is safe and this can accommodate your needs too.
>
> But that would be doing two invalidates on one pte. One range and one page
> invalidate.

Yes, but it would have been micro-optimized later if you really cared,
by simply changing ptep_clear_flush to __ptep_clear_flush, no big
deal. Definitely all methods must be robust about them being called
multiple times, even if the rmap finds no spte mapping such host
virtual address.

> Hmmm... So we could only do an invalidate_page here? Drop the strange
> invalidate_range()?

That's a question you should answer.

> > > > @@ -1676,6 +1674,8 @@ gotten:
> > > > page_cache_release(old_page);
> > > > unlock:
> > > > pte_unmap_unlock(page_table, ptl);
> > > > + mmu_notifier(invalidate_range, mm, address,
> > > > + address + PAGE_SIZE - 1, 0);
> > > > if (dirty_page) {
> > > > if (vma->vm_file)
> > > > file_update_time(vma->vm_file);
> > >
> > > Now we invalidate the page after the transaction is complete. This means
> > > external pte can persist while we change the pte? Possibly even dirty the
> > > page?
> >
> > Yes, and the only reason this can be safe is for the reason explained
> > at the top of the email, if the other cpu wants to serialize to be
> > sure to write in the "new" page, it has to serialize with the
> > page-fault but to serialize it has to wait the page fault to return
> > (example: we're not going to call futex code until the page fault
> > returns).
>
> Serialize how? mmap_sem?

No, that's a different angle.

But now I think there may be an issue with a third thread that may
show unsafe the removal of invalidate_page from ptep_clear_flush.

A third thread writing to a page through the linux-pte and the guest
VM writing to the same page through the sptes, will be writing on the
same physical page concurrently and using an userspace spinlock w/o
ever entering the kernel. With your patch that invalidate_range after
dropping the PT lock, the third thread may start writing on the new
page, when the guest is still writing to the old page through the
sptes. While this couldn't happen with my patch.

So really at the light of the third thread, it seems your approach is
smp racey and ptep_clear_flush should invalidate_page as last thing
before returning. My patch was enforcing that ptep_clear_flush would
stop the third thread in a linux page fault, and to drop the spte,
before the new mapping could be instantiated in both the linux pte and
in the sptes. The PT lock provided the needed serialization. This
ensured the third thread and the guest VM would always write on the
same physical page even if the first thread runs a flood of
remap_file_pages on that same page moving it around the pagecache. So
it seems I found a unfixable smp race in pretending to invalidate in a
sleeping place.

Perhaps you want to change the PT lock to a mutex instead of a
spinlock, that may be your only chance to sleep while maintaining 100%
memory coherency with threads.

2008-01-29 22:39:18

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

n Tue, 29 Jan 2008, Andrea Arcangeli wrote:

> hmm, "there" where? When I said it was taken in readonly mode I meant
> for the quoted code (it would be at the top if it wasn't cut), so I
> quote below again:
>
> > > + mmu_notifier(invalidate_range, mm, address,
> > > + address + PAGE_SIZE - 1, 0);
> > > page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> > > if (likely(pte_same(*page_table, orig_pte))) {
> > > if (old_page) {
>
> The "there" for me was do_wp_page.

Maybe we better focus on one call at a time?

> Even for the code you quoted in freemap.c, the has_write_lock is set
> to 1 _only_ for the very first time you call sys_remap_file_pages on a
> VMA. Only the transition of the VMA between linear to nonlinear
> requires the mmap in write mode. So you can be sure all freemap code
> 99% of the time is populating (overwriting) already present ptes with
> only the mmap_sem in readonly mode like do_wp_page. It would be
> unnecessary to populate the nonlinear range with the mmap in write
> mode. Only the "vma" mangling requires the mmap_sem in write mode, the
> pte modifications only requires the PT_lock + mmap_sem in read mode.
>
> Effectively the first invocation of populate_range runs with the
> mmap_sem in write mode, I wonder why, there seem to be no good reason
> for that. I guess it's a bit that should be optimized, by calling
> downgrade_write before calling populate_range even for the first time
> the vma switches from linear to nonlinear (after the vma has been
> fully updated to the new status). But for sure all later invocations
> runs populate_range with the semaphore readonly like the rest of the
> VM does when instantiating ptes in the page faults.

If it does not run in write mode then concurrent faults are permissible
while we remap pages. Weird. Maybe we better handle this like individual
page operations? Put the invalidate_page back into zap_pte. But then there
would be no callback w/o lock as required by Robin. Doing the
invalidate_range after populate allows access to memory for which ptes
were zapped and the refcount was released.

> All pins are gone by the time invalidate_page/range returns. But there
> is no critical section between invalidate_page and the _later_
> ptep_clear_flush. So get_user_pages is free to run and take the PT
> lock before the ptep_clear_flush, find the linux pte still
> instantiated, and to create a new spte, before ptep_clear_flush runs.

Hmmm... Right. Did not consider get_user_pages. A write to the page that
is not marked dirty would typically require a fault that will serialize.

2008-01-29 22:56:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Tue, 29 Jan 2008, Andrea Arcangeli wrote:

> But now I think there may be an issue with a third thread that may
> show unsafe the removal of invalidate_page from ptep_clear_flush.
>
> A third thread writing to a page through the linux-pte and the guest
> VM writing to the same page through the sptes, will be writing on the
> same physical page concurrently and using an userspace spinlock w/o
> ever entering the kernel. With your patch that invalidate_range after
> dropping the PT lock, the third thread may start writing on the new
> page, when the guest is still writing to the old page through the
> sptes. While this couldn't happen with my patch.

A user space spinlock plays into this??? That is irrelevant to the kernel.
And we are discussing "your" placement of the invalidate_range not mine.

This is the scenario that I described before. You just need two threads.
One thread is in do_wp_page and the other is writing through the spte.
We are in do_wp_page. Meaning the page is not writable. The writer will
have to take fault which will properly serialize access. It a bug if the
spte would allow write.

2008-01-29 23:44:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Tue, Jan 29, 2008 at 02:55:56PM -0800, Christoph Lameter wrote:
> On Tue, 29 Jan 2008, Andrea Arcangeli wrote:
>
> > But now I think there may be an issue with a third thread that may
> > show unsafe the removal of invalidate_page from ptep_clear_flush.
> >
> > A third thread writing to a page through the linux-pte and the guest
> > VM writing to the same page through the sptes, will be writing on the
> > same physical page concurrently and using an userspace spinlock w/o
> > ever entering the kernel. With your patch that invalidate_range after
> > dropping the PT lock, the third thread may start writing on the new
> > page, when the guest is still writing to the old page through the
> > sptes. While this couldn't happen with my patch.
>
> A user space spinlock plays into this??? That is irrelevant to the kernel.
> And we are discussing "your" placement of the invalidate_range not mine.

With "my" code, invalidate_range wasn't placed there at all, my
modification to ptep_clear_flush already covered it in a automatic
way, grep from the word fremap in my latest patch you won't find it,
like you won't find any change to do_wp_page. Not sure why you keep
thinking I added those invalidate_range when infact you did.

The user space spinlock plays also in declaring rdtscp unworkable to
provide a monotone vgettimeofday w/o kernel locking.

My patch by calling invalidate_page inside ptep_clear_flush guaranteed
that both the thread writing through sptes and the thread writing
through linux ptes, couldn't possibly simultaneously write to two
different physical pages.

Your patch allows the thread writing through linux-pte to write to a
new populated page while the old thread writing through sptes still
writes to the old page. Is that safe? I don't know for sure. The fact
the physical page backing the virtual address could change back and
forth, perhaps invalidates the theory that somebody could possibly do
some useful locking out of it relaying on all threads seeing the same
physical page at the same time.

Anyway as long as invalidate_page/range happens after ptep_clear_flush
things are mostly ok.

> This is the scenario that I described before. You just need two threads.
> One thread is in do_wp_page and the other is writing through the spte.
> We are in do_wp_page. Meaning the page is not writable. The writer will

Actually above I was describing remap_file_pages not do_wp_page.

> have to take fault which will properly serialize access. It a bug if the
> spte would allow write.

In that scenario because write is forbidden (unlike remap_file_pages)
like you said things should be ok. The spte reader will eventually see
the updates happening in the new page, as long as the spte invalidate
happens after ptep_clear_flush (i.e. with my incremental fix applied
to your code, or with my latest patch).

2008-01-30 00:01:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Tue, Jan 29, 2008 at 02:39:00PM -0800, Christoph Lameter wrote:
> If it does not run in write mode then concurrent faults are permissible
> while we remap pages. Weird. Maybe we better handle this like individual
> page operations? Put the invalidate_page back into zap_pte. But then there
> would be no callback w/o lock as required by Robin. Doing the

The Robin requirements and the need to schedule are the source of the
complications indeed.

I posted all the KVM patches using mmu notifiers, today I reposted the
ones to work with your V2 (which crashes my host unlike my last
simpler mmu notifier patch but I also changed a few other variable
besides your mmu notifier changes, so I can't yet be sure it's a bug
in your V2, and the SMP regressions I fixed so far sure can't explain
the crashes because my KVM setup could never run in do_wp_page nor
remap_file_pages so it's something else I need to find ASAP).

Robin, if you don't mind, could you please post or upload somewhere
your GPLv2 code that registers itself in Christoph's V2 notifiers? Or
is it top secret? I wouldn't mind to have a look so I can better
understand what's the exact reason you're sleeping besides attempting
GFP_KERNEL allocations. Thanks!

> invalidate_range after populate allows access to memory for which ptes
> were zapped and the refcount was released.

The last refcount is released by the invalidate_range itself.

> > All pins are gone by the time invalidate_page/range returns. But there
> > is no critical section between invalidate_page and the _later_
> > ptep_clear_flush. So get_user_pages is free to run and take the PT
> > lock before the ptep_clear_flush, find the linux pte still
> > instantiated, and to create a new spte, before ptep_clear_flush runs.
>
> Hmmm... Right. Did not consider get_user_pages. A write to the page that
> is not marked dirty would typically require a fault that will serialize.

The pte is already marked dirty (and this is the case only for
get_user_pages, regular linux writes don't fault unless it's
explicitly writeprotect, which is mandatory in a few archs, x86 not).

2008-01-30 00:06:23

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, Jan 30, 2008 at 01:00:39AM +0100, Andrea Arcangeli wrote:
> get_user_pages, regular linux writes don't fault unless it's
> explicitly writeprotect, which is mandatory in a few archs, x86 not).

actually get_user_pages doesn't fault either but it calls into
set_page_dirty, however get_user_pages (unlike a userland-write) at
least requires mmap_sem in read mode and the PT lock as serialization,
userland writes don't, they just go ahead and mark the pte in hardware
w/o faults. Anyway anonymous memory these days always mapped with
dirty bit set regardless, even for read-faults, after Nick finally
rightfully cleaned up the zero-page trick.

2008-01-30 00:21:05

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, 30 Jan 2008, Andrea Arcangeli wrote:

> > invalidate_range after populate allows access to memory for which ptes
> > were zapped and the refcount was released.
>
> The last refcount is released by the invalidate_range itself.

That is true for your implementation and to address Robin's issues. Jack:
Is that true for the GRU?

2008-01-30 00:22:56

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, 30 Jan 2008, Andrea Arcangeli wrote:

> On Wed, Jan 30, 2008 at 01:00:39AM +0100, Andrea Arcangeli wrote:
> > get_user_pages, regular linux writes don't fault unless it's
> > explicitly writeprotect, which is mandatory in a few archs, x86 not).
>
> actually get_user_pages doesn't fault either but it calls into
> set_page_dirty, however get_user_pages (unlike a userland-write) at
> least requires mmap_sem in read mode and the PT lock as serialization,
> userland writes don't, they just go ahead and mark the pte in hardware
> w/o faults. Anyway anonymous memory these days always mapped with
> dirty bit set regardless, even for read-faults, after Nick finally
> rightfully cleaned up the zero-page trick.

That is only partially true. pte are created wronly in order to track
dirty state these days. The first write will lead to a fault that switches
the pte to writable. When the page undergoes writeback the page again
becomes write protected. Thus our need to effectively deal with
page_mkclean.

2008-01-30 00:30:22

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Tue, Jan 29, 2008 at 04:20:50PM -0800, Christoph Lameter wrote:
> On Wed, 30 Jan 2008, Andrea Arcangeli wrote:
>
> > > invalidate_range after populate allows access to memory for which ptes
> > > were zapped and the refcount was released.
> >
> > The last refcount is released by the invalidate_range itself.
>
> That is true for your implementation and to address Robin's issues. Jack:
> Is that true for the GRU?

I'm not sure I understand the question. The GRU never (currently) takes
a reference on a page. It has no mechanism for tracking pages that
were exported to the external TLBs.

--- jack

2008-01-30 00:34:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, 30 Jan 2008, Andrea Arcangeli wrote:

> > A user space spinlock plays into this??? That is irrelevant to the kernel.
> > And we are discussing "your" placement of the invalidate_range not mine.
>
> With "my" code, invalidate_range wasn't placed there at all, my
> modification to ptep_clear_flush already covered it in a automatic
> way, grep from the word fremap in my latest patch you won't find it,
> like you won't find any change to do_wp_page. Not sure why you keep
> thinking I added those invalidate_range when infact you did.

Well you moved the code at minimum. Hmmm... according
http://marc.info/?l=linux-kernel&m=120114755620891&w=2 it was Robin.

> The user space spinlock plays also in declaring rdtscp unworkable to
> provide a monotone vgettimeofday w/o kernel locking.

No idea what you are talking about.

> My patch by calling invalidate_page inside ptep_clear_flush guaranteed
> that both the thread writing through sptes and the thread writing
> through linux ptes, couldn't possibly simultaneously write to two
> different physical pages.

But then the ptep_clear_flush will issue invalidate_page() for ranges
that were already covered by invalidate_range(). There are multiple calls
to clear the same spte.
>
> Your patch allows the thread writing through linux-pte to write to a
> new populated page while the old thread writing through sptes still
> writes to the old page. Is that safe? I don't know for sure. The fact
> the physical page backing the virtual address could change back and
> forth, perhaps invalidates the theory that somebody could possibly do
> some useful locking out of it relaying on all threads seeing the same
> physical page at the same time.

This is referrring to the remap issue not do_wp_page right?

> Actually above I was describing remap_file_pages not do_wp_page.

Ok.

The serialization of remap_file_pages does not seem that critical since we
only take a read lock on mmap_sem here. There may already be concurrent
access to pages from other processors while the ptes are remapped. So
there is already some overlap.

We could take mmap_sem there writably and keep it writably for the case
that we have an mmu notifier in the mm.

2008-01-30 00:35:54

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Tue, 29 Jan 2008, Jack Steiner wrote:

> > That is true for your implementation and to address Robin's issues. Jack:
> > Is that true for the GRU?
>
> I'm not sure I understand the question. The GRU never (currently) takes
> a reference on a page. It has no mechanism for tracking pages that
> were exported to the external TLBs.

Thats what I was looking for. Thanks. KVM takes a refcount and so does
XPmem.

2008-01-30 00:59:25

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Tue, Jan 29, 2008 at 04:22:46PM -0800, Christoph Lameter wrote:
> That is only partially true. pte are created wronly in order to track
> dirty state these days. The first write will lead to a fault that switches
> the pte to writable. When the page undergoes writeback the page again
> becomes write protected. Thus our need to effectively deal with
> page_mkclean.

Well I was talking about anonymous memory.

2008-01-30 08:27:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges


On Wed, 2008-01-30 at 01:59 +0100, Andrea Arcangeli wrote:
> On Tue, Jan 29, 2008 at 04:22:46PM -0800, Christoph Lameter wrote:
> > That is only partially true. pte are created wronly in order to track
> > dirty state these days. The first write will lead to a fault that switches
> > the pte to writable. When the page undergoes writeback the page again
> > becomes write protected. Thus our need to effectively deal with
> > page_mkclean.
>
> Well I was talking about anonymous memory.

Just to be absolutely clear on this (I lost track of what exactly we are
talking about here), nonlinear mappings no not do the dirty accounting,
and are not allowed on a backing store that would require dirty
accounting.


2008-01-30 13:37:57

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Tue, Jan 29, 2008 at 06:28:05PM -0600, Jack Steiner wrote:
> On Tue, Jan 29, 2008 at 04:20:50PM -0800, Christoph Lameter wrote:
> > On Wed, 30 Jan 2008, Andrea Arcangeli wrote:
> >
> > > > invalidate_range after populate allows access to memory for which ptes
> > > > were zapped and the refcount was released.
> > >
> > > The last refcount is released by the invalidate_range itself.
> >
> > That is true for your implementation and to address Robin's issues. Jack:
> > Is that true for the GRU?
>
> I'm not sure I understand the question. The GRU never (currently) takes
> a reference on a page. It has no mechanism for tracking pages that
> were exported to the external TLBs.

If you don't have a pin, then things like invalidate_range in
remap_file_pages can't be safe as writes through the external TLBs can
keep going on pages in the freelist. For you to be safe w/o a
page-pin, you need to return in the direction of invalidate_page
inside ptep_clear_flush (or anyway before
page_cache_release/__free_page/put_page...). You're generally not safe
with any invalidate_range that may run after the page pointed by the
pte has been freed (or can be freed by the VM anytime because of being
unpinned cache).

2008-01-30 14:43:33

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, Jan 30, 2008 at 02:37:20PM +0100, Andrea Arcangeli wrote:
> On Tue, Jan 29, 2008 at 06:28:05PM -0600, Jack Steiner wrote:
> > On Tue, Jan 29, 2008 at 04:20:50PM -0800, Christoph Lameter wrote:
> > > On Wed, 30 Jan 2008, Andrea Arcangeli wrote:
> > >
> > > > > invalidate_range after populate allows access to memory for which ptes
> > > > > were zapped and the refcount was released.
> > > >
> > > > The last refcount is released by the invalidate_range itself.
> > >
> > > That is true for your implementation and to address Robin's issues. Jack:
> > > Is that true for the GRU?
> >
> > I'm not sure I understand the question. The GRU never (currently) takes
> > a reference on a page. It has no mechanism for tracking pages that
> > were exported to the external TLBs.
>
> If you don't have a pin, then things like invalidate_range in
> remap_file_pages can't be safe as writes through the external TLBs can
> keep going on pages in the freelist. For you to be safe w/o a
> page-pin, you need to return in the direction of invalidate_page
> inside ptep_clear_flush (or anyway before
> page_cache_release/__free_page/put_page...). You're generally not safe
> with any invalidate_range that may run after the page pointed by the
> pte has been freed (or can be freed by the VM anytime because of being
> unpinned cache).

Yuck....

I see what you mean. I need to review to mail to see why this changed
but in the original discussions with Christoph, the invalidate_range
callouts were suppose to be made BEFORE the pages were put on the freelist.


--- jack

2008-01-30 16:11:34

by Robin Holt

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

> Robin, if you don't mind, could you please post or upload somewhere
> your GPLv2 code that registers itself in Christoph's V2 notifiers? Or
> is it top secret? I wouldn't mind to have a look so I can better
> understand what's the exact reason you're sleeping besides attempting
> GFP_KERNEL allocations. Thanks!

Dean is still actively working on updating the xpmem patch posted
here a few months ago reworked for the mmu_notifiers. I am sure
we can give you a early look, but it is in a really rough state.

http://marc.info/?l=linux-mm&w=2&r=1&s=xpmem&q=t

The need to sleep comes from the fact that these PFNs are sent to other
hosts on the same NUMA fabric which have direct access to the pages
and then placed into remote process's page tables and then filled into
their TLBs. Our only means of communicating the recall is async.

I think I need to straighten this discussion out in my head a little bit.
Am I correct in assuming Andrea's original patch set did not have any SMP
race conditions for KVM? If so, then we need to start looking at how to
implement Christoph's and my changes in a safe fashion. Andrea, I agree
complete that our introduction of the range callouts have introduced
SMP races.

The three issues we need to simultaneously solve is revoking the remote
page table/tlb information while still in a sleepable context and not
having the remote faulters become out of sync with the granting process.
Currently, I don't see a way to do that cleanly with a single callout.

Could we consider doing a range-based recall and lock callout before
clearing the processes page tables/TLBs, then use the _page or _range
callouts from Andrea's patch to clear the mappings, finally make a
range-based unlock callout. The mmu_notifier user would usually use ops
for either the recall+lock/unlock family of callouts or the _page/_range
family of callouts.

Thanks,
Robin

2008-01-30 17:05:30

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, Jan 30, 2008 at 10:11:24AM -0600, Robin Holt wrote:
> > Robin, if you don't mind, could you please post or upload somewhere
> > your GPLv2 code that registers itself in Christoph's V2 notifiers? Or
> > is it top secret? I wouldn't mind to have a look so I can better
> > understand what's the exact reason you're sleeping besides attempting
> > GFP_KERNEL allocations. Thanks!
>
> Dean is still actively working on updating the xpmem patch posted
> here a few months ago reworked for the mmu_notifiers. I am sure
> we can give you a early look, but it is in a really rough state.
>
> http://marc.info/?l=linux-mm&w=2&r=1&s=xpmem&q=t
>
> The need to sleep comes from the fact that these PFNs are sent to other
> hosts on the same NUMA fabric which have direct access to the pages
> and then placed into remote process's page tables and then filled into
> their TLBs. Our only means of communicating the recall is async.
>
> I think I need to straighten this discussion out in my head a little bit.
> Am I correct in assuming Andrea's original patch set did not have any SMP
> race conditions for KVM? If so, then we need to start looking at how to

Yes my last patch was SMP safe, stable and feature complete for KVM. I
tested it for 1 week on my smp workstation with real desktop load and
everything loaded, with 3G non-linux guest running on 2G of ram.

Now for whatever reason I adapted the KVM side to Christoph's V2/V3
and it hangs the moment it hits swap. However in the meantime I
changed test hardware, upgraded host to 2.6.24-hg, and upgraded kvm
kernel and userland. all patches applied cleanly (with a minor nit in
a .h include in V2 on top of current git). Swapping of regular tasks
on the test system is 100% solid or I wouldn't even wasting time
mentioning this. By code inspection I didn't expect a stability
regression or I wouldn't have chanced all variables at the same time
(taking the opportunity to move everything to bleeding edge while
moving to V2 turned out to be a bad idea). I already audited the mmu
notifiers a few times, infact I already went back to call
invalidate_page and age_page inside ptep_clear_flush/young in case the
page-pin wasn't enough to prevent the page to change under the sptes,
as I thought yesterday.

Christoph's V3 notably still misses the needed range flushes in mremap
for example, but that's not my problem. (Jack instead will certainly
kernel crash due to the missing invalidate_page after ptep_clear_flush
in mremap, such an invalidate_page wasn't missing with my last patch)

I'm now going to run the same binaries that still are stable on my
workstation on the test system too, to rule out timings and hardware
differences.

> implement Christoph's and my changes in a safe fashion. Andrea, I agree
> complete that our introduction of the range callouts have introduced
> SMP races.

I think for KVM basic swapping both V2 and V3 should be safe. V2 had
race conditions that would later break KSM yes, I fixed it and V3
should be already ok and I'm not testing KSM. This is all thanks to the
pin of the page in get_user_page that KVM does for every page mapped
in any spte.

> The three issues we need to simultaneously solve is revoking the remote
> page table/tlb information while still in a sleepable context and not
> having the remote faulters become out of sync with the granting process.
> Currently, I don't see a way to do that cleanly with a single callout.

Agreed.

> Could we consider doing a range-based recall and lock callout before
> clearing the processes page tables/TLBs, then use the _page or _range
> callouts from Andrea's patch to clear the mappings, finally make a
> range-based unlock callout. The mmu_notifier user would usually use ops
> for either the recall+lock/unlock family of callouts or the _page/_range
> family of callouts.

invalidate_page/age_page can return inside ptep_clear_flush/young and
Jack will need that too. Infact Jack will need an invalidate_page also
inside ptep_get_and_clear. And the range callout will be done always
in a sleeping context and it'll relay on the page-pin to be safe (when
details->i_mmap_lock != NULL invalidate_range it shouldn't be called
inside zap_page_range but before returning from
unmap_mapping_range_vma before cond_resched). This will make
everything a bit simpler and less prone to breakage IMHO, plus it'll
have a chance to work for Jack w/o page-pin without additional
cluttering of mm/*.c.

2008-01-30 17:30:26

by Robin Holt

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, Jan 30, 2008 at 06:04:52PM +0100, Andrea Arcangeli wrote:
> On Wed, Jan 30, 2008 at 10:11:24AM -0600, Robin Holt wrote:
...
> > The three issues we need to simultaneously solve is revoking the remote
> > page table/tlb information while still in a sleepable context and not
> > having the remote faulters become out of sync with the granting process.
...
> > Could we consider doing a range-based recall and lock callout before
> > clearing the processes page tables/TLBs, then use the _page or _range
> > callouts from Andrea's patch to clear the mappings, finally make a
> > range-based unlock callout. The mmu_notifier user would usually use ops
> > for either the recall+lock/unlock family of callouts or the _page/_range
> > family of callouts.
>
> invalidate_page/age_page can return inside ptep_clear_flush/young and
> Jack will need that too. Infact Jack will need an invalidate_page also
> inside ptep_get_and_clear. And the range callout will be done always
> in a sleeping context and it'll relay on the page-pin to be safe (when
> details->i_mmap_lock != NULL invalidate_range it shouldn't be called
> inside zap_page_range but before returning from
> unmap_mapping_range_vma before cond_resched). This will make
> everything a bit simpler and less prone to breakage IMHO, plus it'll
> have a chance to work for Jack w/o page-pin without additional
> cluttering of mm/*.c.

I don't think I saw the answer to my original question. I assume your
original patch, extended in a way similar to what Christoph has done,
can be made to work to cover both the KVM and GRU (Jack's) case.

XPMEM, however, does not look to be solvable due to the three simultaneous
issues above. To address that, I think I am coming to the conclusion
that we need an accompanying but seperate pair of callouts. The first
will ensure the remote page tables and TLBs are cleared and all page
information is returned back to the process that is granting access to
its address space. That will include an implicit block on the address
range so no further faults will be satisfied by the remote accessor
(forgot the KVM name for this, sorry). Any faults will be held off
and only the processes page tables/TLBs are in play. Once the normal
processing of the kernel is complete, an unlock callout would be made
for the range and then faulting may occur on behalf of the process again.

Currently, this is the only direct solution that I can see as a
possibility. My question is two fold. Does this seem like a reasonable
means to solve the three simultaneous issues above and if so, does it
seem like the most reasonable means?

Thanks,
Robin

2008-01-30 18:25:40

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, Jan 30, 2008 at 11:30:09AM -0600, Robin Holt wrote:
> I don't think I saw the answer to my original question. I assume your
> original patch, extended in a way similar to what Christoph has done,
> can be made to work to cover both the KVM and GRU (Jack's) case.

Yes, I think so.

> XPMEM, however, does not look to be solvable due to the three simultaneous
> issues above. To address that, I think I am coming to the conclusion
> that we need an accompanying but seperate pair of callouts. The first

The mmu_rmap_notifiers are already one separate pair of callouts and
we can add more of them of course.

> will ensure the remote page tables and TLBs are cleared and all page
> information is returned back to the process that is granting access to
> its address space. That will include an implicit block on the address
> range so no further faults will be satisfied by the remote accessor
> (forgot the KVM name for this, sorry). Any faults will be held off
> and only the processes page tables/TLBs are in play. Once the normal

Good, this "block" is how you close the race condition, and you need
the second callout to "unblock" (this is why it could hardly work well
before with a single invalidate_range).

> processing of the kernel is complete, an unlock callout would be made
> for the range and then faulting may occur on behalf of the process again.

This sounds good.

> Currently, this is the only direct solution that I can see as a
> possibility. My question is two fold. Does this seem like a reasonable
> means to solve the three simultaneous issues above and if so, does it
> seem like the most reasonable means?

Yes.

KVM can deal with both invalidate_page (atomic) and invalidate_range (sleepy)

GRU can only deal with invalidate_page (atomic)

XPMEM requires with invalidate_range (sleepy) +
before_invalidate_range (sleepy). invalidate_all should also be called
before_release (both sleepy).

It sounds we need full overlap of information provided by
invalidate_page and invalidate_range to fit all three models (the
opposite of the zero objective that current V3 is taking). And the
swap will be handled only by invalidate_page either through linux rmap
or external rmap (with the latter that can sleep so it's ok for you,
the former not). GRU can safely use the either the linux rmap notifier
or the external rmap notifier equally well, because when try_to_unmap
is called the page is locked and obviously pinned by the VM itself.

2008-01-30 19:35:22

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, 30 Jan 2008, Robin Holt wrote:

> I think I need to straighten this discussion out in my head a little bit.
> Am I correct in assuming Andrea's original patch set did not have any SMP
> race conditions for KVM? If so, then we need to start looking at how to
> implement Christoph's and my changes in a safe fashion. Andrea, I agree
> complete that our introduction of the range callouts have introduced
> SMP races.

The original patch drew the clearing of the sptes into ptep_clear_flush().
So the invalidate_page was called for each page regardless if we had been
doing an invalidate range before or not. It seems that the the
invalidate_range() was just there for optimization.

> The three issues we need to simultaneously solve is revoking the remote
> page table/tlb information while still in a sleepable context and not
> having the remote faulters become out of sync with the granting process.
> Currently, I don't see a way to do that cleanly with a single callout.

You could use the invalidate_page callouts to set a flag that no
additional rmap entries may be added until the invalidate_range has
occurred? We could add back all the original invalidate_pages() and pass
a flag that specifies that an invalidate range will follow. The notifier
can then decide what to do with that information. If its okay to defer
then do nothing and wait for the range_invalidate. XPmem could stop
allowing external references to be established until the invalidate_range
was successful.

Jack had a concern that multiple callouts for the same pte could cause
problems.

2008-01-30 19:41:44

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, 30 Jan 2008, Jack Steiner wrote:

> I see what you mean. I need to review to mail to see why this changed
> but in the original discussions with Christoph, the invalidate_range
> callouts were suppose to be made BEFORE the pages were put on the freelist.

Seems that we cannot rely on the invalidate_ranges for correctness at all?
We need to have invalidate_page() always. invalidate_range() is only an
optimization.

2008-01-30 19:50:37

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, 30 Jan 2008, Andrea Arcangeli wrote:

> XPMEM requires with invalidate_range (sleepy) +
> before_invalidate_range (sleepy). invalidate_all should also be called
> before_release (both sleepy).
>
> It sounds we need full overlap of information provided by
> invalidate_page and invalidate_range to fit all three models (the
> opposite of the zero objective that current V3 is taking). And the
> swap will be handled only by invalidate_page either through linux rmap
> or external rmap (with the latter that can sleep so it's ok for you,
> the former not). GRU can safely use the either the linux rmap notifier
> or the external rmap notifier equally well, because when try_to_unmap
> is called the page is locked and obviously pinned by the VM itself.

So put the invalidate_page() callbacks in everywhere.

Then we have

invalidate_range_start(mm)

and

invalidate_range_finish(mm, start, end)

in addition to the invalidate rmap_notifier?

---
include/linux/mmu_notifier.h | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h 2008-01-30 11:49:02.000000000 -0800
+++ linux-2.6/include/linux/mmu_notifier.h 2008-01-30 11:49:57.000000000 -0800
@@ -69,10 +69,13 @@ struct mmu_notifier_ops {
/*
* lock indicates that the function is called under spinlock.
*/
- void (*invalidate_range)(struct mmu_notifier *mn,
+ void (*invalidate_range_begin)(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end,
int lock);
+
+ void (*invalidate_range_end)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end);
};

struct mmu_rmap_notifier_ops;

2008-01-30 20:29:31

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, Jan 30, 2008 at 11:41:29AM -0800, Christoph Lameter wrote:
> On Wed, 30 Jan 2008, Jack Steiner wrote:
>
> > I see what you mean. I need to review to mail to see why this changed
> > but in the original discussions with Christoph, the invalidate_range
> > callouts were suppose to be made BEFORE the pages were put on the freelist.
>
> Seems that we cannot rely on the invalidate_ranges for correctness at all?
> We need to have invalidate_page() always. invalidate_range() is only an
> optimization.
>

I don't understand your point "an optimization". How would invalidate_range
as currently defined be correctly used?

It _looks_ like it would work only if xpmem/gru/etc takes a refcnt on
the page & drops it when invalidate_range is called. That may work (not sure)
for xpmem but not for the GRU.

--- jack

2008-01-30 20:55:47

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, 30 Jan 2008, Jack Steiner wrote:

> > Seems that we cannot rely on the invalidate_ranges for correctness at all?
> > We need to have invalidate_page() always. invalidate_range() is only an
> > optimization.
> >
>
> I don't understand your point "an optimization". How would invalidate_range
> as currently defined be correctly used?

We are changing definitions. The original patch by Andrea calls
invalidate_page for each pte that is cleared. So strictly you would not
need an invalidate_range.

> It _looks_ like it would work only if xpmem/gru/etc takes a refcnt on
> the page & drops it when invalidate_range is called. That may work (not sure)
> for xpmem but not for the GRU.

The refcount is not necessary if we adopt Andrea's approach of a callback
on the clearing of each pte. At that point the page is still guaranteed to
exist. If we do the range_invalidate later (as in V3) then the page may
have been released (see sys_remap_file_pages() f.e.) before we zap the GRU
ptes. So there will be a time when the GRU may write to a page that has
been freed and used for another purpose.

Taking a refcount on the page defers the free until the range_invalidate
runs.

I would prefer a solution that does not require taking refcounts (pins)
for establishing an external pte and for release (like what the GRU does).

If we could effectively determine that there are no external ptes in a
range then the invalidate_page() call may return immediately. Maybe it is
then effective to do these gazillions of invalidate_page() calls when a
process terminates or an remap is performed.

2008-01-30 22:18:21

by Robin Holt

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, Jan 30, 2008 at 11:50:26AM -0800, Christoph Lameter wrote:
> On Wed, 30 Jan 2008, Andrea Arcangeli wrote:
>
> > XPMEM requires with invalidate_range (sleepy) +
> > before_invalidate_range (sleepy). invalidate_all should also be called
> > before_release (both sleepy).
> >
> > It sounds we need full overlap of information provided by
> > invalidate_page and invalidate_range to fit all three models (the
> > opposite of the zero objective that current V3 is taking). And the
> > swap will be handled only by invalidate_page either through linux rmap
> > or external rmap (with the latter that can sleep so it's ok for you,
> > the former not). GRU can safely use the either the linux rmap notifier
> > or the external rmap notifier equally well, because when try_to_unmap
> > is called the page is locked and obviously pinned by the VM itself.
>
> So put the invalidate_page() callbacks in everywhere.

The way I am envisioning it, we essentially drop back to Andrea's original
patch. We then introduce a invalidate_range_begin (I was really thinking
of it as invalidate_and_lock_range()) and an invalidate_range_end (again
I was thinking of unlock_range).

Thanks,
Robin

2008-01-30 23:52:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, Jan 30, 2008 at 11:50:26AM -0800, Christoph Lameter wrote:
> Then we have
>
> invalidate_range_start(mm)
>
> and
>
> invalidate_range_finish(mm, start, end)
>
> in addition to the invalidate rmap_notifier?
>
> ---
> include/linux/mmu_notifier.h | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/include/linux/mmu_notifier.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mmu_notifier.h 2008-01-30 11:49:02.000000000 -0800
> +++ linux-2.6/include/linux/mmu_notifier.h 2008-01-30 11:49:57.000000000 -0800
> @@ -69,10 +69,13 @@ struct mmu_notifier_ops {
> /*
> * lock indicates that the function is called under spinlock.
> */
> - void (*invalidate_range)(struct mmu_notifier *mn,
> + void (*invalidate_range_begin)(struct mmu_notifier *mn,
> struct mm_struct *mm,
> - unsigned long start, unsigned long end,
> int lock);
> +
> + void (*invalidate_range_end)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long start, unsigned long end);
> };

start/finish/begin/end/before/after? ;)

I'd drop the 'int lock', you should skip the before/after if
i_mmap_lock isn't null and offload it to the caller before taking the
lock. At least for the "after" call that looks a few liner change,
didn't figure out the "before" yet.

Given the amount of changes that are going on in design terms to cover
both XPMEM and GRE, can we split the minimal invalidate_page that
provides an obviously safe and feature complete mmu notifier code for
KVM, and merge that first patch that will cover KVM 100%, it will
cover GRE 90%, and then we add invalidate_range_before/after in a
separate patch and we close the remaining 10% for GRE covering
ptep_get_and_clear or whatever else ptep_*? The mmu notifiers are
made so that are extendible in backwards compatible way. I think
invalidate_page inside ptep_clear_flush is the first fundamental block
of the mmu notifiers. Then once the fundamental is in and obviously
safe and feature complete for KVM, the rest can be added very easily
with incremental patches as far as I can tell. That would be my
preferred route ;)

2008-01-31 00:01:43

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

> > - void (*invalidate_range)(struct mmu_notifier *mn,
> > + void (*invalidate_range_begin)(struct mmu_notifier *mn,
> > struct mm_struct *mm,
> > - unsigned long start, unsigned long end,
> > int lock);
> > +
> > + void (*invalidate_range_end)(struct mmu_notifier *mn,
> > + struct mm_struct *mm,
> > + unsigned long start, unsigned long end);
> > };
>
> start/finish/begin/end/before/after? ;)

Well lets pick one and then stick to it.

> I'd drop the 'int lock', you should skip the before/after if
> i_mmap_lock isn't null and offload it to the caller before taking the
> lock. At least for the "after" call that looks a few liner change,
> didn't figure out the "before" yet.

How we offload that? Before the scan of the rmaps we do not have the
mmstruct. So we'd need another notifier_rmap_callback.

> Given the amount of changes that are going on in design terms to cover
> both XPMEM and GRE, can we split the minimal invalidate_page that
> provides an obviously safe and feature complete mmu notifier code for
> KVM, and merge that first patch that will cover KVM 100%, it will

The obvious solution does not scale. You will have a callback for every
page and there may be a million of those if you have a 4GB process.

> made so that are extendible in backwards compatible way. I think
> invalidate_page inside ptep_clear_flush is the first fundamental block
> of the mmu notifiers. Then once the fundamental is in and obviously
> safe and feature complete for KVM, the rest can be added very easily
> with incremental patches as far as I can tell. That would be my
> preferred route ;)

We need to have a coherent notifier solution that works for multiple
scenarios. I think a working invalidate_range would also be required for
KVM. KVM and GRUB are very similar so they should be able to use the same
mechanisms and we need to properly document how that mechanism is safe.
Either both take a page refcount or none.


2008-01-31 00:35:06

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, Jan 30, 2008 at 04:01:31PM -0800, Christoph Lameter wrote:
> How we offload that? Before the scan of the rmaps we do not have the
> mmstruct. So we'd need another notifier_rmap_callback.

My assumption is that that "int lock" exists just because
unmap_mapping_range_vma exists. If I'm right then my suggestion was to
move the invalidate_range after dropping the i_mmap_lock and not to
invoke it inside zap_page_range.

> The obvious solution does not scale. You will have a callback for every

Scale is the wrong word. The PT lock will prevent any other cpu to
trash on the mmu_lock, so it's a fixed cost for each pte_clear with no
scalability risk, nor any complexity issue. Certainly we could average
certain fixed costs over more than one pte_clear to boost performance,
and that's good idea. Not really a short term concern, we need to swap
reliably first ;).

> page and there may be a million of those if you have a 4GB process.

That can be optimized adding a __ptep_clear_flush and an
invalidate_pages (let's call it pages to better show it's an
'clustered' version of invalidate_page, to avoid the confusion with
_range_before/after that does an entirely different thing). Also for
_range I tend to like before/after, as a means to say before the
pte_clear and after the pte_clear but any other meaning is ok with me.

We add invalidate_page and invalidate_pages
immediately. invalidate_pages may never be called initially by the
linux VM, we can start calling it later as we replace ptep_clear_flush
with __ptep_clear_flush (or local_ptep_clear_flush).

I don't see any problem with this approach and it looks quite clean to
me and it leaves you full room for experimenting in practice with
range_before/after while knowing those range_before/after won't
require many changes.

And for things like the age_page it will never happen that you could
call the respective ptep_clear_flush_young w/o mmu notifier age_page
after it, so you won't ever risk having to add an age_pages or a
__ptep_clear_flush_young.

> We need to have a coherent notifier solution that works for multiple
> scenarios. I think a working invalidate_range would also be required for
> KVM. KVM and GRUB are very similar so they should be able to use the same
> mechanisms and we need to properly document how that mechanism is safe.
> Either both take a page refcount or none.

There's no reason why KVM should take any risk of corrupting memory
due to a single missing mmu notifier, with not taking the
refcount. get_user_pages will take it for us, so we have to pay the
atomic-op anyway. It sure worth doing the atomic_dec inside the mmu
notifier, and not immediately like this:

get_user_pages(pages)
__free_page(pages[0])

The idea is that what works for GRU, works for KVM too. So we do a
single invalidate_page and clustered invalidate_pages, we add that,
and then we make sure all places are covered so GRU will not
kernel-crash, and KVM won't risk to run oom or to generate _userland_
corruption.

2008-01-31 01:46:32

by Christoph Lameter

[permalink] [raw]
Subject: Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

> On Wed, Jan 30, 2008 at 04:01:31PM -0800, Christoph Lameter wrote:
> > How we offload that? Before the scan of the rmaps we do not have the
> > mmstruct. So we'd need another notifier_rmap_callback.
>
> My assumption is that that "int lock" exists just because
> unmap_mapping_range_vma exists. If I'm right then my suggestion was to
> move the invalidate_range after dropping the i_mmap_lock and not to
> invoke it inside zap_page_range.

There is still no pointer to the mm_struct available there because pages
of a mapping may belong to multiple processes. So we need to add another
rmap method?

The same issue is also occurring for unmap_hugepages().

> There's no reason why KVM should take any risk of corrupting memory
> due to a single missing mmu notifier, with not taking the
> refcount. get_user_pages will take it for us, so we have to pay the
> atomic-op anyway. It sure worth doing the atomic_dec inside the mmu
> notifier, and not immediately like this:

Well the GRU uses follow_page() instead of get_user_pages. Performance is
a major issue for the GRU.


> get_user_pages(pages)
> __free_page(pages[0])
>
> The idea is that what works for GRU, works for KVM too. So we do a
> single invalidate_page and clustered invalidate_pages, we add that,
> and then we make sure all places are covered so GRU will not
> kernel-crash, and KVM won't risk to run oom or to generate _userland_
> corruption.

Hmmmm.. Could we go to a scheme where we do not have to increase the page
count? Modifications of the page struct require dirtying a cache line and
it seems that we do not need an increased page count if we have an
invalidate_range_start() that clears all the external references
and stops the establishment of new ones and invalidate_range_end() that
reenables new external references?

Then we do not need the frequent invalidate_page() calls.

The typical case would be anyways that invalidate_all() is called
before anything else on exit. Invalidate_all() would remove all pages
and disable creation of new references to the memory in the mm_struct.


2008-01-31 02:08:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges


Patch to


1. Remove sync on notifier_release. Must be called when only a
single process remain.

2. Add invalidate_range_start/end. This should allow safe removal
of ranges of external ptes without having to resort to a callback
for every individual page.

This must be able to nest so the driver needs to keep a refcount of range
invalidates and wait if the refcount != 0.


---
include/linux/mmu_notifier.h | 11 +++++++++--
mm/fremap.c | 3 ++-
mm/hugetlb.c | 3 ++-
mm/memory.c | 16 ++++++++++------
mm/mmu_notifier.c | 9 ++++-----
5 files changed, 27 insertions(+), 15 deletions(-)

Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c 2008-01-30 17:58:48.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c 2008-01-30 18:00:26.000000000 -0800
@@ -13,23 +13,22 @@
#include <linux/mm.h>
#include <linux/mmu_notifier.h>

+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
void mmu_notifier_release(struct mm_struct *mm)
{
struct mmu_notifier *mn;
struct hlist_node *n, *t;

if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
- down_write(&mm->mmap_sem);
- rcu_read_lock();
hlist_for_each_entry_safe_rcu(mn, n, t,
&mm->mmu_notifier.head, hlist) {
hlist_del_rcu(&mn->hlist);
if (mn->ops->release)
mn->ops->release(mn, mm);
}
- rcu_read_unlock();
- up_write(&mm->mmap_sem);
- synchronize_rcu();
}
}

Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h 2008-01-30 17:58:48.000000000 -0800
+++ linux-2.6/include/linux/mmu_notifier.h 2008-01-30 18:00:26.000000000 -0800
@@ -67,15 +67,22 @@ struct mmu_notifier_ops {
int dummy);

/*
+ * invalidate_range_begin() and invalidate_range_end() are paired.
+ *
+ * invalidate_range_begin must clear all references in the range
+ * and stop the establishment of new references.
+ *
+ * invalidate_range_end() reenables the establishment of references.
+ *
* lock indicates that the function is called under spinlock.
*/
void (*invalidate_range_begin)(struct mmu_notifier *mn,
struct mm_struct *mm,
+ unsigned long start, unsigned long end,
int lock);

void (*invalidate_range_end)(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long start, unsigned long end);
+ struct mm_struct *mm);
};

struct mmu_rmap_notifier_ops;
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c 2008-01-30 17:58:48.000000000 -0800
+++ linux-2.6/mm/fremap.c 2008-01-30 18:00:26.000000000 -0800
@@ -212,8 +212,9 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(&mapping->i_mmap_lock);
}

+ mmu_notifier(invalidate_range_start, mm, start, start + size, 0);
err = populate_range(mm, vma, start, size, pgoff);
- mmu_notifier(invalidate_range, mm, start, start + size, 0);
+ mmu_notifier(invalidate_range_end, mm);
if (!err && !(flags & MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {
downgrade_write(&mm->mmap_sem);
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c 2008-01-30 17:58:48.000000000 -0800
+++ linux-2.6/mm/hugetlb.c 2008-01-30 18:00:26.000000000 -0800
@@ -744,6 +744,7 @@ void __unmap_hugepage_range(struct vm_ar
BUG_ON(start & ~HPAGE_MASK);
BUG_ON(end & ~HPAGE_MASK);

+ mmu_notifier(invalidate_range_start, mm, start, end, 1);
spin_lock(&mm->page_table_lock);
for (address = start; address < end; address += HPAGE_SIZE) {
ptep = huge_pte_offset(mm, address);
@@ -764,7 +765,7 @@ void __unmap_hugepage_range(struct vm_ar
}
spin_unlock(&mm->page_table_lock);
flush_tlb_range(vma, start, end);
- mmu_notifier(invalidate_range, mm, start, end, 1);
+ mmu_notifier(invalidate_range_end, mm);
list_for_each_entry_safe(page, tmp, &page_list, lru) {
list_del(&page->lru);
put_page(page);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c 2008-01-30 17:58:48.000000000 -0800
+++ linux-2.6/mm/memory.c 2008-01-30 18:00:51.000000000 -0800
@@ -888,11 +888,12 @@ unsigned long zap_page_range(struct vm_a
lru_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
+ mmu_notifier(invalidate_range_start, mm, address, end,
+ (details ? (details->i_mmap_lock != NULL) : 0));
end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
if (tlb)
tlb_finish_mmu(tlb, address, end);
- mmu_notifier(invalidate_range, mm, address, end,
- (details ? (details->i_mmap_lock != NULL) : 0));
+ mmu_notifier(invalidate_range_end, mm);
return end;
}

@@ -1355,6 +1356,7 @@ int remap_pfn_range(struct vm_area_struc
pfn -= addr >> PAGE_SHIFT;
pgd = pgd_offset(mm, addr);
flush_cache_range(vma, addr, end);
+ mmu_notifier(invalidate_range_start, mm, start, end, 0);
do {
next = pgd_addr_end(addr, end);
err = remap_pud_range(mm, pgd, addr, next,
@@ -1362,7 +1364,7 @@ int remap_pfn_range(struct vm_area_struc
if (err)
break;
} while (pgd++, addr = next, addr != end);
- mmu_notifier(invalidate_range, mm, start, end, 0);
+ mmu_notifier(invalidate_range_end, mm);
return err;
}
EXPORT_SYMBOL(remap_pfn_range);
@@ -1450,6 +1452,7 @@ int apply_to_page_range(struct mm_struct
int err;

BUG_ON(addr >= end);
+ mmu_notifier(invalidate_range_start, mm, start, end, 0);
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);
@@ -1457,7 +1460,7 @@ int apply_to_page_range(struct mm_struct
if (err)
break;
} while (pgd++, addr = next, addr != end);
- mmu_notifier(invalidate_range, mm, start, end, 0);
+ mmu_notifier(invalidate_range_end, mm);
return err;
}
EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1635,6 +1638,8 @@ gotten:
goto oom;
cow_user_page(new_page, old_page, address, vma);

+ mmu_notifier(invalidate_range_start, mm, address,
+ address + PAGE_SIZE - 1, 0);
/*
* Re-check the pte - we dropped the lock
*/
@@ -1673,8 +1678,7 @@ gotten:
page_cache_release(old_page);
unlock:
pte_unmap_unlock(page_table, ptl);
- mmu_notifier(invalidate_range, mm, address,
- address + PAGE_SIZE - 1, 0);
+ mmu_notifier(invalidate_range_end, mm);
if (dirty_page) {
if (vma->vm_file)
file_update_time(vma->vm_file);

2008-01-31 02:35:34

by Robin Holt

[permalink] [raw]
Subject: Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

> Well the GRU uses follow_page() instead of get_user_pages. Performance is
> a major issue for the GRU.

Worse, the GRU takes its TLB faults from within an interrupt so we
use follow_page to prevent going to sleep. That said, I think we
could probably use follow_page() with FOLL_GET set to accomplish the
requirements of mmu_notifier invalidate_range call. Doesn't look too
promising for hugetlb pages.

Thanks,
Robin

2008-01-31 02:37:37

by Christoph Lameter

[permalink] [raw]
Subject: Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, 30 Jan 2008, Robin Holt wrote:

> > Well the GRU uses follow_page() instead of get_user_pages. Performance is
> > a major issue for the GRU.
>
> Worse, the GRU takes its TLB faults from within an interrupt so we
> use follow_page to prevent going to sleep. That said, I think we
> could probably use follow_page() with FOLL_GET set to accomplish the
> requirements of mmu_notifier invalidate_range call. Doesn't look too
> promising for hugetlb pages.

There may be no need to with the range_start/end scheme. The driver can
have its own lock to make follow page secure. The lock needs to serialize
the follow_page handler and the range_start/end calls as well as the
invalidate_page callouts. I think that avoids the need for
get_user_pages().

2008-01-31 02:43:12

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, Jan 30, 2008 at 06:08:14PM -0800, Christoph Lameter wrote:
> hlist_for_each_entry_safe_rcu(mn, n, t,
^^^^

> &mm->mmu_notifier.head, hlist) {
> hlist_del_rcu(&mn->hlist);
^^^^

_rcu can go away from both, if hlist_del_rcu can be called w/o locks.

2008-01-31 02:51:38

by Christoph Lameter

[permalink] [raw]
Subject: Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

> On Wed, Jan 30, 2008 at 06:08:14PM -0800, Christoph Lameter wrote:
> > hlist_for_each_entry_safe_rcu(mn, n, t,
> ^^^^
>
> > &mm->mmu_notifier.head, hlist) {
> > hlist_del_rcu(&mn->hlist);
> ^^^^
>
> _rcu can go away from both, if hlist_del_rcu can be called w/o locks.

True. hlist_del_init ok? That would allow to check the driver that the
mmu_notifier is already linked in using !hlist_unhashed(). Driver then
needs to properly initialize the mmu_notifier list with INIT_HLIST_NODE().

2008-01-31 02:56:47

by Christoph Lameter

[permalink] [raw]
Subject: Re: [kvm-devel] mmu_notifier: invalidate_range_start with lock=1

One possible way that XPmem could deal with a call of
invalidate_range_start with the lock flag set:

Scan through the rmaps you have for ptes. If you find one then elevate the
refcount of the corresponding page and mark in the maps that you have done
so. Also make them readonly. The increased refcount will prevent the
freeing of the page. The page will be unmapped from the process and XPmem
will retain the only reference.

Then some shepherding process that you have anyways with XPmem can
sometime later zap the remote ptes and free the pages. Would leave stale
data visible on the remote side for awhile. Would that be okay?

This would only be used for truncate that uses the unmap_mapping_range
call. So we are not in reclaim or other distress.


2008-01-31 10:52:53

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, Jan 30, 2008 at 05:46:21PM -0800, Christoph Lameter wrote:
> Well the GRU uses follow_page() instead of get_user_pages. Performance is
> a major issue for the GRU.

GRU is a external TLB, we have to allocate RAM instead but we do it
through the regular userland paging mechanism. Performance is a major
issue for kvm too, but the result of get_user_pages is used to fill a
spte, so then the cpu will use the spte in hardware to fill its
tlb, we won't have to keep calling follow_page in software to fill the
tlb like GRU has to do, so you can imagine the difference in cpu
utilization spent in those paths (plus our requirement to allocate
memory).

> Hmmmm.. Could we go to a scheme where we do not have to increase the page
> count? Modifications of the page struct require dirtying a cache line and

I doubt the atomic_inc is measurable given the rest of overhead like
building the rmap for each new spte.

There's no technical reason for not wanting proper reference counting
other than microoptimization. What will work for GRU will work for KVM
too regardless of whatever reference counting. Each mmu-notifier user
should be free to do what it think it's better/safer or more
convenient (and for anybody calling get_user_pages having the
refcounting on external references is natural and zero additional
cost).

> it seems that we do not need an increased page count if we have an
> invalidate_range_start() that clears all the external references
> and stops the establishment of new ones and invalidate_range_end() that
> reenables new external references?
>
> Then we do not need the frequent invalidate_page() calls.

The increased page count is _mandatory_ to safely use range_start/end
called outside the locks with _end called after releasing the old
page. sptes will build themself the whole time until the pte_clear is
called on the main linux pte. We don't want to clutter the VM fast
paths with additional locks to stop the kvm pagefault while the VM is
in the _range_start/end critical section like xpmem has to do be
safe. So you're contradicting yourself by suggesting not to use
invalidate_page and not to use a increased page count at the same
time. And I need invalidate_page anyway for rmap.c which can't be
provided as an invalidate_range and it can't sleep either.

2008-01-31 13:39:29

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

On Wed, Jan 30, 2008 at 06:51:26PM -0800, Christoph Lameter wrote:
> True. hlist_del_init ok? That would allow to check the driver that the
> mmu_notifier is already linked in using !hlist_unhashed(). Driver then
> needs to properly initialize the mmu_notifier list with INIT_HLIST_NODE().

A driver couldn't possibly care about the mmu notifier anymore at that
point, we just agreed a moment ago that the list can't change under
mmu_notifier_release, and in turn no driver could possibly call
mmu_notifier_unregister/register at that point anymore regardless of
the outcome of hlist_unhashed and external serialization must let the
driver know he's done with the notifiers.