2008-01-13 16:24:33

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH] mmu notifiers #v2

Hello,

This patch is last version of a basic implementation of the mmu
notifiers.

In short when the linux VM decides to free a page, it will unmap it
from the linux pagetables. However when a page is mapped not just by
the regular linux ptes, but also from the shadow pagetables, it's
currently unfreeable by the linux VM.

This patch allows the shadow pagetables to be dropped and the page to
be freed after that, if the linux VM decides to unmap the page from
the main ptes because it wants to swap out the page.

In my basic initial patch I only track the tlb flushes which should be
the minimum required to have a nice linux-VM controlled swapping
behavior of the KVM gphysical memory. The shadow-ptes works much like
a TLB, so the same way we flush the tlb after clearing the ptes, we
should also issue the mmu_notifier invalidate_page/range/release
methods. Quadrics needs much more than that to optimize things to
preload secondary-ptes after main-pte instantiation (to prevent page
faults) and other similar optimizations, but it's easy to add more
methods to the below code to fit their needs if the basic
infrastructure is ok. Furthermore the best would be to use
self-modifying code to implement the notifiers as a nop inst when
disarmed but such an optimization is a low priority for now.

This follows the model of Avi's original patch, however I guess it
would also be possible to track when the VM shrink_cache methods wants
to free a certain host-page_t instead of tracking when the tlb is
flushed. Not sure if other ways are feasible or better, but the below
is the most important bit we need for KVM to swap nicely with minimal
overhead to the host kernel even if KVM is unused.

About the locking perhaps I'm underestimating it, but by following the
TLB flushing analogy, by simply clearing the shadow ptes (with kvm
mmu_lock spinlock to avoid racing with other vcpu spte accesses of
course) and flushing the shadow-pte after clearing the main linux pte,
it should be enough to serialize against shadow-pte page faults that
would call into get_user_pages. Flushing the host TLB before or after
the shadow-ptes shouldn't matter.

Comments welcome... especially from SGI/IBM/Quadrics and all other
potential users of this functionality.

Patch is running on my desktop and swapping out a 3G non-linux VM, smp
guest and smp host as I write this, it seems to work fine (only
tested with SVM so far).

If you're interested in the KVM usage of this feature, in the below
link you can see how it enables full KVM swapping using the core linux
kernel virtual memory algorithms using the KVM rmap (like pte-chains)
to unmap shadow-ptes in addition to the mm/rmap.c for the regular
ptes.

http://marc.info/?l=kvm-devel&m=120023119710198&w=2

This is a zero-risk patch when unused. So I hope we can quickly
concentrate on finalize the interface and merge it ASAP. It's mostly a
matter of deciding if this is the right way of getting notification of
VM mapping invalidations, mapping instantiations for secondary
TLBs/MMUs/RDMA/Shadow etc...

There are also certain details I'm uncertain about, like passing 'mm'
to the lowlevel methods, my KVM usage of the invalidate_page()
notifier for example only uses 'mm' for a BUG_ON for example:

void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
struct mm_struct *mm,
[..]
BUG_ON(mm != kvm->mm);
[..]

BTW, if anybody needs 'mm' the mm could also be stored in the
container of the 'mn', incidentally like in KVM case. containerof is
an efficient and standard way to do things. OTOH I kept passing down
the 'mm' like below just in case others don't have 'mm' saved in the
container for other reasons like we have in struct kvm, and they
prefer to get it through registers/stack. In practice it won't make
any measurable difference, it's mostly a style issue.

Signed-off-by: Andrea Arcangeli <[email protected]>

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -86,6 +86,7 @@ do { \
pte_t __pte; \
__pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep); \
flush_tlb_page(__vma, __address); \
+ mmu_notifier(invalidate_page, (__vma)->vm_mm, __address); \
__pte; \
})
#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -13,6 +13,7 @@
#include <linux/debug_locks.h>
#include <linux/mm_types.h>
#include <linux/security.h>
+#include <linux/mmu_notifier.h>

struct mempolicy;
struct anon_vma;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -219,6 +219,10 @@ struct mm_struct {
/* aio bits */
rwlock_t ioctx_list_lock;
struct kioctx *ioctx_list;
+
+#ifdef CONFIG_MMU_NOTIFIER
+ struct hlist_head mmu_notifier; /* MMU notifier list */
+#endif
};

#endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,53 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/mm_types.h>
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier;
+
+struct mmu_notifier_ops {
+ void (*release)(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+ void (*invalidate_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+ void (*invalidate_range)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+};
+
+struct mmu_notifier {
+ struct hlist_node hlist;
+ const struct mmu_notifier_ops *ops;
+};
+
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+extern void mmu_notifier_unregister(struct mmu_notifier *mn);
+extern void mmu_notifier_release(struct mm_struct *mm);
+
+#define mmu_notifier(function, mm, args...) \
+ do { \
+ struct mmu_notifier *__mn; \
+ struct hlist_node *__n; \
+ \
+ hlist_for_each_entry(__mn, __n, &(mm)->mmu_notifier, hlist) \
+ if (__mn->ops->function) \
+ __mn->ops->function(__mn, mm, args); \
+ } while (0)
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+#define mmu_notifier_register(mn, mm) do {} while(0)
+#define mmu_notifier_unregister(mn) do {} while (0)
+#define mmu_notifier_release(mm) do {} while (0)
+
+#define mmu_notifier(function, mm, args...) \
+ do { } while (0)
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,3 +193,7 @@ config VIRT_TO_BUS
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+ def_bool y
+ bool "MMU notifier, for paging KVM/RDMA"
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -30,4 +30,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -753,6 +753,7 @@ void __unmap_hugepage_range(struct vm_ar
}
spin_unlock(&mm->page_table_lock);
flush_tlb_range(vma, start, end);
+ mmu_notifier(invalidate_range, mm, start, end);
list_for_each_entry_safe(page, tmp, &page_list, lru) {
list_del(&page->lru);
put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -889,6 +889,7 @@ unsigned long zap_page_range(struct vm_a
end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
if (tlb)
tlb_finish_mmu(tlb, address, end);
+ mmu_notifier(invalidate_range, mm, address, end);
return end;
}

@@ -1358,6 +1359,7 @@ int remap_pfn_range(struct vm_area_struc
if (err)
break;
} while (pgd++, addr = next, addr != end);
+ mmu_notifier(invalidate_range, mm, end-PAGE_ALIGN(size), end);
return err;
}
EXPORT_SYMBOL(remap_pfn_range);
@@ -1452,6 +1454,7 @@ int apply_to_page_range(struct mm_struct
if (err)
break;
} while (pgd++, addr = next, addr != end);
+ mmu_notifier(invalidate_range, mm, end-size, end);
return err;
}
EXPORT_SYMBOL_GPL(apply_to_page_range);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1747,6 +1747,7 @@ static void unmap_region(struct mm_struc
free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
next? next->vm_start: 0);
tlb_finish_mmu(tlb, start, end);
+ mmu_notifier(invalidate_range, mm, start, end);
}

/*
@@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
vm_unacct_memory(nr_accounted);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
+ mmu_notifier_release(mm);

/*
* Walk the list again, actually closing and freeing it,
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
new file mode 100644
--- /dev/null
+++ b/mm/mmu_notifier.c
@@ -0,0 +1,35 @@
+/*
+ * linux/mm/mmu_notifier.c
+ *
+ * Copyright (C) 2008 Qumranet, Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+
+void mmu_notifier_release(struct mm_struct *mm)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n, *tmp;
+
+ hlist_for_each_entry_safe(mn, n, tmp, &mm->mmu_notifier, hlist) {
+ if (mn->ops->release)
+ mn->ops->release(mn, mm);
+ hlist_del(&mn->hlist);
+ }
+}
+
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ hlist_add_head(&mn->hlist, &mm->mmu_notifier);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn)
+{
+ hlist_del(&mn->hlist);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);


2008-01-13 21:13:20

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH] mmu notifiers #v2


On Sun, 2008-01-13 at 17:24 +0100, Andrea Arcangeli wrote:
> Hello,
>
> This patch is last version of a basic implementation of the mmu
> notifiers.
>
> In short when the linux VM decides to free a page, it will unmap it
> from the linux pagetables. However when a page is mapped not just by
> the regular linux ptes, but also from the shadow pagetables, it's
> currently unfreeable by the linux VM.
>
> This patch allows the shadow pagetables to be dropped and the page to
> be freed after that, if the linux VM decides to unmap the page from
> the main ptes because it wants to swap out the page.

Another potential user of that I can see is the DRM. Nowadays, graphic
cards essentially have an MMU on chip, and can do paging. It would be
nice to be able to map user objects in them without having to lock them
down using your callback to properly mark them cast out on the card.

Ben.

2008-01-14 20:03:01

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] mmu notifiers #v2

On Sun, 13 Jan 2008, Andrea Arcangeli wrote:

> About the locking perhaps I'm underestimating it, but by following the
> TLB flushing analogy, by simply clearing the shadow ptes (with kvm
> mmu_lock spinlock to avoid racing with other vcpu spte accesses of
> course) and flushing the shadow-pte after clearing the main linux pte,
> it should be enough to serialize against shadow-pte page faults that
> would call into get_user_pages. Flushing the host TLB before or after
> the shadow-ptes shouldn't matter.

Hmmm... In most of the callsites we hold a writelock on mmap_sem right?

> Comments welcome... especially from SGI/IBM/Quadrics and all other
> potential users of this functionality.

> There are also certain details I'm uncertain about, like passing 'mm'
> to the lowlevel methods, my KVM usage of the invalidate_page()
> notifier for example only uses 'mm' for a BUG_ON for example:

Passing mm is fine as long as mmap_sem is held.

> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -86,6 +86,7 @@ do { \
> pte_t __pte; \
> __pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep); \
> flush_tlb_page(__vma, __address); \
> + mmu_notifier(invalidate_page, (__vma)->vm_mm, __address); \
> __pte; \
> })
> #endif

Hmmm... this is ptep_clear_flush? What about the other uses of
flush_tlb_page in asm-generic/pgtable.h and related uses in arch code?
(would help if your patches would mention the function name in the diff
headers)

> +#define mmu_notifier(function, mm, args...) \
> + do { \
> + struct mmu_notifier *__mn; \
> + struct hlist_node *__n; \
> + \
> + hlist_for_each_entry(__mn, __n, &(mm)->mmu_notifier, hlist) \
> + if (__mn->ops->function) \
> + __mn->ops->function(__mn, mm, args); \
> + } while (0)

Does this have to be inline? ptep_clear_flush will become quite big

2008-01-15 04:29:50

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH] mmu notifiers #v2


On Mon, 2008-01-14 at 12:02 -0800, Christoph Lameter wrote:
> On Sun, 13 Jan 2008, Andrea Arcangeli wrote:
>
> > About the locking perhaps I'm underestimating it, but by following the
> > TLB flushing analogy, by simply clearing the shadow ptes (with kvm
> > mmu_lock spinlock to avoid racing with other vcpu spte accesses of
> > course) and flushing the shadow-pte after clearing the main linux pte,
> > it should be enough to serialize against shadow-pte page faults that
> > would call into get_user_pages. Flushing the host TLB before or after
> > the shadow-ptes shouldn't matter.
>
> Hmmm... In most of the callsites we hold a writelock on mmap_sem right?

Not in unmap_mapping_range() afaik.

> > Comments welcome... especially from SGI/IBM/Quadrics and all other
> > potential users of this functionality.
>
> > There are also certain details I'm uncertain about, like passing 'mm'
> > to the lowlevel methods, my KVM usage of the invalidate_page()
> > notifier for example only uses 'mm' for a BUG_ON for example:
>
> Passing mm is fine as long as mmap_sem is held.

Passing mm is always a good idea, regardless of the mmap_sem, it can be
useful for lots of other things :-)

> > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> > --- a/include/asm-generic/pgtable.h
> > +++ b/include/asm-generic/pgtable.h
> > @@ -86,6 +86,7 @@ do { \
> > pte_t __pte; \
> > __pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep); \
> > flush_tlb_page(__vma, __address); \
> > + mmu_notifier(invalidate_page, (__vma)->vm_mm, __address); \
> > __pte; \
> > })
> > #endif
>
> Hmmm... this is ptep_clear_flush? What about the other uses of
> flush_tlb_page in asm-generic/pgtable.h and related uses in arch code?
> (would help if your patches would mention the function name in the diff
> headers)

Note that last I looked, a lot of these were stale. Might be time to
resume my spring/summer cleaning of page table accessors...

> > +#define mmu_notifier(function, mm, args...) \
> > + do { \
> > + struct mmu_notifier *__mn; \
> > + struct hlist_node *__n; \
> > + \
> > + hlist_for_each_entry(__mn, __n, &(mm)->mmu_notifier, hlist) \
> > + if (__mn->ops->function) \
> > + __mn->ops->function(__mn, mm, args); \
> > + } while (0)
>
> Does this have to be inline? ptep_clear_flush will become quite big
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2008-01-15 12:45:11

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mmu notifiers #v2

On Mon, Jan 14, 2008 at 12:02:42PM -0800, Christoph Lameter wrote:
> Hmmm... In most of the callsites we hold a writelock on mmap_sem right?

Not in all, like Marcelo pointed out in kvm-devel, so the lowlevel
locking can't relay on the VM locks.

About your request to schedule in the mmu notifier methods this is not
feasible right now, the notifier is often called with the pte
spinlocks held. I wonder if you can simply post/queue an event like a
softirq/pdflush.

> Passing mm is fine as long as mmap_sem is held.

mmap_sem is not held, but don't worry "mm" can't go away under the mmu
notifier, so it's ok. It's just that the KVM methods never uses "mm"
at all (containerof translates the struct mmu_notifier to a struct
kvm, and there is the mm in kvm->mm too). Perhaps others don't save
the "mm" in their container where the mmu_notifier is embedded into,
so I left mm as parameter to the methods.

> Hmmm... this is ptep_clear_flush? What about the other uses of
> flush_tlb_page in asm-generic/pgtable.h and related uses in arch code?

This is not necessarily a 1:1 relationship with the tlb
flushes. Otherwise they'd be the tlb-notifiers not the mmu-notifiers.

The other methods in the pgtable.h are not dropping an user page from
the "mm". That's the invalidate case right now. Other methods will not
call into invalidate_page, but you're welcome to add other methods and
call them from other ptep_* functions if you're interested about being
notified about more than just the invalidates of the "mm".

Is invalidate_page/range a clear enough method name to explain when
the ptes and tlb entries have been dropped for such page/range mapped
in userland in that address/range?

> (would help if your patches would mention the function name in the diff
> headers)

my patches uses git diff defaults I guess, and they mention the
function name in all other places, it's just git isn't smart enough
there to catch the function name in that single place, it's ok.

> > +#define mmu_notifier(function, mm, args...) \
> > + do { \
> > + struct mmu_notifier *__mn; \
> > + struct hlist_node *__n; \
> > + \
> > + hlist_for_each_entry(__mn, __n, &(mm)->mmu_notifier, hlist) \
> > + if (__mn->ops->function) \
> > + __mn->ops->function(__mn, mm, args); \
> > + } while (0)
>
> Does this have to be inline? ptep_clear_flush will become quite big

Inline makes the patch smaller and it avoids a call in the common case
that the mmu_notifier list is empty. Perhaps I could add a:

if (unlikely(!list_empty(&(mm)->mmu_notifier)) {
...
}

so gcc could offload the internal block in a cold-icache region of .text.

I think at least an unlikely(!list_empty(&(mm)->mmu_notifier)) check
has to be inline. Currently there isn't such check because I'm unsure
if it really makes sense. The idea is that if you really care to
optimize this you'll use self-modifying code to turn a nop into a call
when a certain method is armed. That's an extreme optimization though,
current code shouldn't be measurable already when disarmed.

2008-01-15 20:20:15

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH] mmu notifiers #v2


On Tue, 2008-01-15 at 13:44 +0100, Andrea Arcangeli wrote:
> On Mon, Jan 14, 2008 at 12:02:42PM -0800, Christoph Lameter wrote:
> > Hmmm... In most of the callsites we hold a writelock on mmap_sem right?
>
> Not in all, like Marcelo pointed out in kvm-devel, so the lowlevel
> locking can't relay on the VM locks.
>
> About your request to schedule in the mmu notifier methods this is not
> feasible right now, the notifier is often called with the pte
> spinlocks held. I wonder if you can simply post/queue an event like a
> softirq/pdflush.

Do you have cases where it's -not- called with the PTE lock held ?

Ben.

2008-01-16 01:06:19

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mmu notifiers #v2

On Wed, Jan 16, 2008 at 07:18:53AM +1100, Benjamin Herrenschmidt wrote:
> Do you have cases where it's -not- called with the PTE lock held ?

For invalidate_page no because currently it's only called next to the
ptep_get_and_clear that modifies the pte and requires the pte
lock. invalidate_range/release are called w/o pte lock held.

2008-01-16 09:01:44

by Brice Goglin

[permalink] [raw]
Subject: Re: [PATCH] mmu notifiers #v2

Andrea Arcangeli wrote:
> This patch is last version of a basic implementation of the mmu
> notifiers.
>
> In short when the linux VM decides to free a page, it will unmap it
> from the linux pagetables. However when a page is mapped not just by
> the regular linux ptes, but also from the shadow pagetables, it's
> currently unfreeable by the linux VM.
>
> This patch allows the shadow pagetables to be dropped and the page to
> be freed after that, if the linux VM decides to unmap the page from
> the main ptes because it wants to swap out the page.
>
> [...]
>
> Comments welcome... especially from SGI/IBM/Quadrics and all other
> potential users of this functionality.
>

For HPC, this should be very interesting. Managing the registration
cache of high-speed networks from user-space is a huge mess. This
approach should help a lot. In fact, back in 2004, I implemented
something similar called vmaspy to update the regcache of Myrinet
drivers. I never submitted any patch because Infiniband would have been
the only user in the mainline kernel and they were reluctant to these
ideas [1]. In the meantime, some of them apparently changed their mind
since they implemented some vmops-overriding hack to do something
similar [2]. This patch should simplify all this.

One of the difference with my patch is that you attach the notifier list
to the mm_struct while my code attached it to vmas. But I now don't
think it was such a good idea since it probably didn't reduce the number
of notifier calls a lot.

Also, one thing that I looked at in vmaspy was notifying fork. I am not
sure what happens on Copy-on-write with your code, but for sure C-o-w is
problematic for shadow page tables. I thought shadow pages should just
be invalidated when a fork happens and the caller would refill them
after forcing C-o-w or so. So adding a notifier call there too might be
nice.

Brice

[1] http://lkml.org/lkml/2005/4/29/175
[2] http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf

2008-01-16 10:19:52

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mmu notifiers #v2

On Wed, Jan 16, 2008 at 10:01:32AM +0100, Brice Goglin wrote:
> One of the difference with my patch is that you attach the notifier list to
> the mm_struct while my code attached it to vmas. But I now don't think it
> was such a good idea since it probably didn't reduce the number of notifier
> calls a lot.

Thanks for raising this topic.

Notably KVM also would be a bit more optimal with the notifier in the
vma and that was the original implementation too. It's not a sure
thing that it has to be in the mm.

The quadrics patch does a mixture, it attaches it to the mm but then
it pretends to pass the vma down to the method, and it's broken doing
so, like during munmap where it passes the first vma being unmapped
but not all the later ones in the munmap range.

If we want to attach it to the vma, I think the vma should be passed
as parameter instead of the mm. In some places like
apply_to_page_range the vma isn't even available and I found a little
dirty to run a find_vma inside a #ifdef CONFIG_MMU_NOTIFIER.

The only thing the vma could be interesting about are the protection
bits for things like update_range in the quadrics patch where they
prefetch their secondary tlb. But again if we want to do that, we need
to hook inside unmap_vmas and to pass all the different vmas and not
just the first one touched by unmap_vmas. unmap_vmas is _plural_ not
singular ;).

In the end attaching to mm avoided solving all the above troubles and
provided a strightforward implementation where I would need a single
call to mmu_notifier_register and other minor advantages like that and
not much downside.

But certainly the mm vs vma decision wasn't trivial (I switched back
and forth a few times from vma to mm and back) and if people thinks
this shall be in the vma I can try again but it won't be as a
strightforward patch as for the mm.

One benefit is for example is that it could go in the memslot and
effectively the notifier->memslot conversion would be just a
containerof instead of a "search" over the memslots. Locking aside.

> Also, one thing that I looked at in vmaspy was notifying fork. I am not
> sure what happens on Copy-on-write with your code, but for sure C-o-w is
> problematic for shadow page tables. I thought shadow pages should just be
> invalidated when a fork happens and the caller would refill them after
> forcing C-o-w or so. So adding a notifier call there too might be nice.

There can't be any cows right now in KVM VM backing store, that's why
it's enough to get full swapping working fine. For example I think
we'll need to add more notifiers to handle swapping of MAP_PRIVATE non
linear tmpfs shared pages properly (and it won't be an issue with
fork() but with after the fact sharing).

Right now I'm more interested in the interface, for the invalidates,
things like mm vs vma, the places where we hook under pte spinlock,
things like that, then the patch can hopefully be merged and extended
with more methods like ->change_protection_page/range and added to cow
etc...

2008-01-16 17:44:27

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] mmu notifiers #v2

On Sun, 13 Jan 2008 17:24:18 +0100
Andrea Arcangeli <[email protected]> wrote:

> In my basic initial patch I only track the tlb flushes which should be
> the minimum required to have a nice linux-VM controlled swapping
> behavior of the KVM gphysical memory.

I have a vaguely related question on KVM swapping.

Do page accesses inside KVM guests get propagated to the host
OS, so Linux can choose a reasonable page for eviction, or is
the pageout of KVM guest pages essentially random?

--
All rights reversed.

2008-01-16 18:12:55

by izik eidus

[permalink] [raw]
Subject: Re: [PATCH] mmu notifiers #v2

Rik van Riel wrote:
> On Sun, 13 Jan 2008 17:24:18 +0100
> Andrea Arcangeli <[email protected]> wrote:
>
>
>> In my basic initial patch I only track the tlb flushes which should be
>> the minimum required to have a nice linux-VM controlled swapping
>> behavior of the KVM gphysical memory.
>>
>
> I have a vaguely related question on KVM swapping.
>
> Do page accesses inside KVM guests get propagated to the host
> OS, so Linux can choose a reasonable page for eviction, or is
> the pageout of KVM guest pages essentially random?
>
>
right now when kvm remove pte from the shadow cache, it mark as access
the page that this pte pointed to.
it was a good solution untill the mmut notifiers beacuse the pages were
pinned and couldnt be swapped to disk
so now it will have to do something more sophisticated or at least mark
as access every page pointed by pte
that get insrted to the shadow cache....

2008-01-17 16:34:46

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mmu notifiers #v2

On Wed, Jan 16, 2008 at 07:48:06PM +0200, Izik Eidus wrote:
> Rik van Riel wrote:
>> On Sun, 13 Jan 2008 17:24:18 +0100
>> Andrea Arcangeli <[email protected]> wrote:
>>
>>
>>> In my basic initial patch I only track the tlb flushes which should be
>>> the minimum required to have a nice linux-VM controlled swapping
>>> behavior of the KVM gphysical memory.
>>
>> I have a vaguely related question on KVM swapping.
>>
>> Do page accesses inside KVM guests get propagated to the host
>> OS, so Linux can choose a reasonable page for eviction, or is
>> the pageout of KVM guest pages essentially random?

Right, selection of the guest OS pages to swap is partly random but
wait: _only_ for the long-cached and hot spte entries. It's certainly
not entirely random.

As the shadow-cache is a bit dynamic, every new instantiated spte will
refresh the PG_referenced bit in follow_page already (through minor
faults). not-present fault of swapped non-present sptes, can trigger
minor faults from swapcache too and they'll refresh young regular
ptes.

> right now when kvm remove pte from the shadow cache, it mark as access the
> page that this pte pointed to.

Yes: the referenced bit in the mmu-notifier invalidate case isn't
useful because it's set right before freeing the page.

> it was a good solution untill the mmut notifiers beacuse the pages were
> pinned and couldnt be swapped to disk

It probably still makes sense for sptes removed because of other
reasons (not mmu notifier invalidates).

> so now it will have to do something more sophisticated or at least mark as
> access every page pointed by pte
> that get insrted to the shadow cache....

I think that should already be the case, see the mark_page_accessed in
follow_page, isn't FOLL_TOUCH set, isn't it?

The only thing we clearly miss is a logic that refreshes the
PG_referenced bitflag for "hot" sptes that remains instantiated and
cached for a long time. For regular linux ptes this is done by the cpu
through the young bitflag. But note that not all architectures have
the young bitflag support in hardware! So I suppose the swapping of
the KVM task, is like the swapping any other task but on an alpha
CPU. It works good enough in practice even if we clearly have room for
further optimizations in this area (like there would be on archs w/o
young bit updated in hardware too).

To refresh the PG_referenced bit for long lived hot sptes, I think the
easiest solution is to chain the sptes in a lru, and to start dropping
them when memory pressure start. We could drop one spte every X pages
collected by the VM. So the "age" time factor depends on the VM
velocity and we totally avoid useless shadow page faults when there's
no VM pressure. When VM pressure increases, the kvm non-present fault
will then take care to refresh the PG_referenced bit. This should
solve the aging-issue for long lived and hot sptes. This should
improve the responsiveness of the guest OS during "initial" swap
pressure (after the initial swap pressure, the working set finds
itself in ram again). So it should avoid some swapout/swapin not
required jitter during the initial swap. I see this mostly as a kvm
internal optimization, not strictly related to the mmu notifiers
though.

2008-01-17 18:31:15

by izik eidus

[permalink] [raw]
Subject: Re: [PATCH] mmu notifiers #v2

Andrea Arcangeli wrote:
> On Wed, Jan 16, 2008 at 07:48:06PM +0200, Izik Eidus wrote:
>
>> Rik van Riel wrote:
>>
>>> On Sun, 13 Jan 2008 17:24:18 +0100
>>> Andrea Arcangeli <[email protected]> wrote:
>>>
>>>
>>>
>>>> In my basic initial patch I only track the tlb flushes which should be
>>>> the minimum required to have a nice linux-VM controlled swapping
>>>> behavior of the KVM gphysical memory.
>>>>
>>> I have a vaguely related question on KVM swapping.
>>>
>>> Do page accesses inside KVM guests get propagated to the host
>>> OS, so Linux can choose a reasonable page for eviction, or is
>>> the pageout of KVM guest pages essentially random?
>>>
>
> Right, selection of the guest OS pages to swap is partly random but
> wait: _only_ for the long-cached and hot spte entries. It's certainly
> not entirely random.
>
> As the shadow-cache is a bit dynamic, every new instantiated spte will
> refresh the PG_referenced bit in follow_page already (through minor
> faults). not-present fault of swapped non-present sptes, can trigger
> minor faults from swapcache too and they'll refresh young regular
> ptes.
>
>
>> right now when kvm remove pte from the shadow cache, it mark as access the
>> page that this pte pointed to.
>>
>
> Yes: the referenced bit in the mmu-notifier invalidate case isn't
> useful because it's set right before freeing the page.
>
>
>> it was a good solution untill the mmut notifiers beacuse the pages were
>> pinned and couldnt be swapped to disk
>>
>
> It probably still makes sense for sptes removed because of other
> reasons (not mmu notifier invalidates).
>
agree
>
>> so now it will have to do something more sophisticated or at least mark as
>> access every page pointed by pte
>> that get insrted to the shadow cache....
>>
>
> I think that should already be the case, see the mark_page_accessed in
> follow_page, isn't FOLL_TOUCH set, isn't it?
>
yes you are right FOLL_TOUCH is set.
> The only thing we clearly miss is a logic that refreshes the
> PG_referenced bitflag for "hot" sptes that remains instantiated and
> cached for a long time. For regular linux ptes this is done by the cpu
> through the young bitflag. But note that not all architectures have
> the young bitflag support in hardware! So I suppose the swapping of
> the KVM task, is like the swapping any other task but on an alpha
> CPU. It works good enough in practice even if we clearly have room for
> further optimizations in this area (like there would be on archs w/o
> young bit updated in hardware too).
>
> To refresh the PG_referenced bit for long lived hot sptes, I think the
> easiest solution is to chain the sptes in a lru, and to start dropping
> them when memory pressure start. We could drop one spte every X pages
> collected by the VM. So the "age" time factor depends on the VM
> velocity and we totally avoid useless shadow page faults when there's
> no VM pressure. When VM pressure increases, the kvm non-present fault
> will then take care to refresh the PG_referenced bit. This should
> solve the aging-issue for long lived and hot sptes. This should
> improve the responsiveness of the guest OS during "initial" swap
> pressure (after the initial swap pressure, the working set finds
> itself in ram again). So it should avoid some swapout/swapin not
> required jitter during the initial swap. I see this mostly as a kvm
> internal optimization, not strictly related to the mmu notifiers
> though.
>
ohh i like it, this is cleaver solution, and i guess the cost of the
vmexits wont be too high if it will
be not too much aggressive....

2008-01-17 19:44:58

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mmu notifiers #v2

On Thu, Jan 17, 2008 at 08:21:16PM +0200, Izik Eidus wrote:
> ohh i like it, this is cleaver solution, and i guess the cost of the
> vmexits wont be too high if it will
> be not too much aggressive....

Yes, and especially during swapping, the system isn't usually CPU
bound. The idea is to pay with some vmexit minor fault when the CPU
tends to be idle, to reduce the amount of swapouts. I say swapouts and
not swapins because it will mostly help avoiding writing out swapcache
to disk for no good reason. Swapins already have a chance not to
generate any read-I/O if the removed spte is really hot.

To make this work we still need notification from the VM about memory
pressure and perhaps the slab shrinker method is enough even if it has
a coarse granularity. Freeing sptes during memory pressure converges
also with the objective of releasing pinned slab memory so that the
spte cache can grow more freely (the 4k PAGE_SIZE for 0-order page
defrag philosophy will also appreciate that to work). There are lots
of details to figure out in an good implementation though but the
basic idea converges on two fairly important fronts.

2008-01-21 12:52:19

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH] mmu notifiers #v3

On Thu, Jan 17, 2008 at 08:32:52PM +0100, Andrea Arcangeli wrote:
> To make this work we still need notification from the VM about memory
> pressure [..]

Ok I thought some more at the aging issue of the hot kvm pages (to
prevent the guest-OS very-hot working set to be swapped out). So I now
hooked a age_page mmu notifier in the page_referenced mkold path. This
way when the linux pte is marked old, we can also drop the spte. This
way we give the guest-OS a whole round scan of the inactive list in
order to generate a vmexit minor fault by touching the hot page. The
very-lightweight vmexit will call into follow_page again that I
accordingly changed to mark the pte young (which is nicer because it
truly simulates what a regular access through the virtual address
would do). For direct-io it makes no difference and this way the next
time page_referenced runs it'll find the pte young again and it'll
mark the pte old again and in turn it'll call ->age_page again that
will drop the spte again, etc... So the working set will be sticky in
ram and it won't generate spurious swapouts (this is the theory at
least). It works well in practice so far but I don't have hard numbers
myself (I just implemented what I think is a quite effective aging
strategy to do a not random page replacement on the very hot guest-OS
working set).

In absence of memory pressure (or with little pressure) there will be
no age_page calls at all and the spte cache can grow freely without
any vmexit. This provides peak performance in absence of memory
pressure.

This keeps the VM aging decision in the VM instead of having an lru of
sptes to collect. The lru of sptes to collect would still be
interesting for the shrinker method though (similar to dcache/inode
lru etc..).

This update also adds some locking so multiple subsystems can
register/unregister for the notifiers at any time (something that had
to be handled by design with external serialization before and
effectively it was a bit fragile).

BTW, when MMU_NOTIFIER=n the kernel compile spawns a warning in
memory.c about two unused variables, not sure if it worth hiding it
given I suppose most people will have MMU_NOTIFIER=y. One easy way to
avoid the warning is to move the mmu_notifier call out of line and to
have one function per notifier (which was suggested by Christoph
already as an icache optimization). But this implementation keeps the
patch smaller and quicker to improve for now...

I'd like to know if this could be possibly merged soon and what I need
to change to make this happen. Thanks!

The kvm side of this can be found here:

http://marc.info/?l=kvm-devel&m=120091930324366&w=2
http://marc.info/?l=kvm-devel&m=120091906724000&w=2
http://marc.info/?l=kvm-devel&m=120091939024572&w=2

Signed-off-by: Andrea Arcangeli <[email protected]>

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -44,8 +44,10 @@
({ \
int __young; \
__young = ptep_test_and_clear_young(__vma, __address, __ptep); \
- if (__young) \
+ if (__young) { \
flush_tlb_page(__vma, __address); \
+ mmu_notifier(age_page, (__vma)->vm_mm, __address); \
+ } \
__young; \
})
#endif
@@ -86,6 +88,7 @@ do { \
pte_t __pte; \
__pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep); \
flush_tlb_page(__vma, __address); \
+ mmu_notifier(invalidate_page, (__vma)->vm_mm, __address); \
__pte; \
})
#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
#include <linux/rbtree.h>
#include <linux/rwsem.h>
#include <linux/completion.h>
+#include <linux/mmu_notifier.h>
#include <asm/page.h>
#include <asm/mmu.h>

@@ -219,6 +220,10 @@ struct mm_struct {
/* aio bits */
rwlock_t ioctx_list_lock;
struct kioctx *ioctx_list;
+
+#ifdef CONFIG_MMU_NOTIFIER
+ struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
+#endif
};

#endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,79 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier;
+
+struct mmu_notifier_ops {
+ void (*release)(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+ void (*age_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+ void (*invalidate_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+ void (*invalidate_range)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+};
+
+struct mmu_notifier_head {
+ struct hlist_head head;
+ rwlock_t lock;
+};
+
+struct mmu_notifier {
+ struct hlist_node hlist;
+ const struct mmu_notifier_ops *ops;
+};
+
+#include <linux/mm_types.h>
+
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+extern void mmu_notifier_release(struct mm_struct *mm);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+ INIT_HLIST_HEAD(&mnh->head);
+ rwlock_init(&mnh->lock);
+}
+
+#define mmu_notifier(function, mm, args...) \
+ do { \
+ struct mmu_notifier *__mn; \
+ struct hlist_node *__n; \
+ \
+ if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+ read_lock(&(mm)->mmu_notifier.lock); \
+ hlist_for_each_entry(__mn, __n, \
+ &(mm)->mmu_notifier.head, \
+ hlist) \
+ if (__mn->ops->function) \
+ __mn->ops->function(__mn, \
+ mm, \
+ args); \
+ read_unlock(&(mm)->mmu_notifier.lock); \
+ } \
+ } while (0)
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+#define mmu_notifier_register(mn, mm) do {} while(0)
+#define mmu_notifier_unregister(mn, mm) do {} while (0)
+#define mmu_notifier_release(mm) do {} while (0)
+#define mmu_notifier_head_init(mmh) do {} while (0)
+
+#define mmu_notifier(function, mm, args...) \
+ do { } while (0)
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -359,6 +359,7 @@ static struct mm_struct * mm_init(struct

if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
+ mmu_notifier_head_init(&mm->mmu_notifier);
return mm;
}
free_mm(mm);
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,3 +193,7 @@ config VIRT_TO_BUS
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+ def_bool y
+ bool "MMU notifier, for paging KVM/RDMA"
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -30,4 +30,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -753,6 +753,7 @@ void __unmap_hugepage_range(struct vm_ar
}
spin_unlock(&mm->page_table_lock);
flush_tlb_range(vma, start, end);
+ mmu_notifier(invalidate_range, mm, start, end);
list_for_each_entry_safe(page, tmp, &page_list, lru) {
list_del(&page->lru);
put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -889,6 +889,7 @@ unsigned long zap_page_range(struct vm_a
end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
if (tlb)
tlb_finish_mmu(tlb, address, end);
+ mmu_notifier(invalidate_range, mm, address, end);
return end;
}

@@ -950,7 +951,8 @@ struct page *follow_page(struct vm_area_
if ((flags & FOLL_WRITE) &&
!pte_dirty(pte) && !PageDirty(page))
set_page_dirty(page);
- mark_page_accessed(page);
+ if (!pte_young(pte))
+ set_pte_at(mm, address, ptep, pte_mkyoung(pte));
}
unlock:
pte_unmap_unlock(ptep, ptl);
@@ -1317,7 +1319,7 @@ int remap_pfn_range(struct vm_area_struc
{
pgd_t *pgd;
unsigned long next;
- unsigned long end = addr + PAGE_ALIGN(size);
+ unsigned long start = addr, end = addr + PAGE_ALIGN(size);
struct mm_struct *mm = vma->vm_mm;
int err;

@@ -1358,6 +1360,7 @@ int remap_pfn_range(struct vm_area_struc
if (err)
break;
} while (pgd++, addr = next, addr != end);
+ mmu_notifier(invalidate_range, mm, start, end);
return err;
}
EXPORT_SYMBOL(remap_pfn_range);
@@ -1441,7 +1444,7 @@ int apply_to_page_range(struct mm_struct
{
pgd_t *pgd;
unsigned long next;
- unsigned long end = addr + size;
+ unsigned long start = addr, end = addr + size;
int err;

BUG_ON(addr >= end);
@@ -1452,6 +1455,7 @@ int apply_to_page_range(struct mm_struct
if (err)
break;
} while (pgd++, addr = next, addr != end);
+ mmu_notifier(invalidate_range, mm, start, end);
return err;
}
EXPORT_SYMBOL_GPL(apply_to_page_range);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1747,6 +1747,7 @@ static void unmap_region(struct mm_struc
free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
next? next->vm_start: 0);
tlb_finish_mmu(tlb, start, end);
+ mmu_notifier(invalidate_range, mm, start, end);
}

/*
@@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
vm_unacct_memory(nr_accounted);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
+ mmu_notifier_release(mm);

/*
* Walk the list again, actually closing and freeing it,
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
new file mode 100644
--- /dev/null
+++ b/mm/mmu_notifier.c
@@ -0,0 +1,44 @@
+/*
+ * linux/mm/mmu_notifier.c
+ *
+ * Copyright (C) 2008 Qumranet, Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+
+void mmu_notifier_release(struct mm_struct *mm)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n, *tmp;
+
+ if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+ read_lock(&mm->mmu_notifier.lock);
+ hlist_for_each_entry_safe(mn, n, tmp,
+ &mm->mmu_notifier.head, hlist) {
+ if (mn->ops->release)
+ mn->ops->release(mn, mm);
+ hlist_del(&mn->hlist);
+ }
+ read_unlock(&mm->mmu_notifier.lock);
+ }
+}
+
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ write_lock(&mm->mmu_notifier.lock);
+ hlist_add_head(&mn->hlist, &mm->mmu_notifier.head);
+ write_unlock(&mm->mmu_notifier.lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ write_lock(&mm->mmu_notifier.lock);
+ hlist_del(&mn->hlist);
+ write_unlock(&mm->mmu_notifier.lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);

2008-01-22 02:22:55

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] mmu notifiers #v3

On Mon, 21 Jan 2008 13:52:04 +0100
Andrea Arcangeli <[email protected]> wrote:

> Signed-off-by: Andrea Arcangeli <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed.

2008-01-22 14:33:20

by Avi Kivity

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] mmu notifiers #v3

Andrea Arcangeli wrote:
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -44,8 +44,10 @@
> ({ \
> int __young; \
> __young = ptep_test_and_clear_young(__vma, __address, __ptep); \
> - if (__young) \
> + if (__young) { \
> flush_tlb_page(__vma, __address); \
> + mmu_notifier(age_page, (__vma)->vm_mm, __address); \
> + } \
> __young; \
> })
>

I think that unconditionally doing

__young |= mmu_notifier(test_and_clear_young, ...);

allows hardware with accessed bits more control over what is going on.

--
error compiling committee.c: too many arguments to function

2008-01-22 14:43:58

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] mmu notifiers #v3

On Tue, Jan 22, 2008 at 04:12:34PM +0200, Avi Kivity wrote:
> Andrea Arcangeli wrote:
>> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
>> --- a/include/asm-generic/pgtable.h
>> +++ b/include/asm-generic/pgtable.h
>> @@ -44,8 +44,10 @@
>> ({ \
>> int __young; \
>> __young = ptep_test_and_clear_young(__vma, __address, __ptep); \
>> - if (__young) \
>> + if (__young) { \
>> flush_tlb_page(__vma, __address); \
>> + mmu_notifier(age_page, (__vma)->vm_mm, __address); \
>> + } \
>> __young; \
>> })
>>
>
> I think that unconditionally doing
>
> __young |= mmu_notifier(test_and_clear_young, ...);
>
> allows hardware with accessed bits more control over what is going on.

Agreed, likely it'll have to be mmu_notifier_age_page().

2008-01-22 19:29:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] mmu notifiers #v3


On Mon, 2008-01-21 at 13:52 +0100, Andrea Arcangeli wrote:

> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> new file mode 100644
> --- /dev/null
> +++ b/include/linux/mmu_notifier.h
> @@ -0,0 +1,79 @@
> +#ifndef _LINUX_MMU_NOTIFIER_H
> +#define _LINUX_MMU_NOTIFIER_H
> +
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +
> +#ifdef CONFIG_MMU_NOTIFIER
> +
> +struct mmu_notifier;
> +
> +struct mmu_notifier_ops {
> + void (*release)(struct mmu_notifier *mn,
> + struct mm_struct *mm);
> + void (*age_page)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long address);
> + void (*invalidate_page)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long address);
> + void (*invalidate_range)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long start, unsigned long end);
> +};
> +
> +struct mmu_notifier_head {
> + struct hlist_head head;
> + rwlock_t lock;

spinlock_t lock;

I think we can get rid of this rwlock as I think this will seriously
hurt larger machines.

> +};
> +
> +struct mmu_notifier {
> + struct hlist_node hlist;
> + const struct mmu_notifier_ops *ops;
> +};
> +
> +#include <linux/mm_types.h>
> +
> +extern void mmu_notifier_register(struct mmu_notifier *mn,
> + struct mm_struct *mm);
> +extern void mmu_notifier_unregister(struct mmu_notifier *mn,
> + struct mm_struct *mm);
> +extern void mmu_notifier_release(struct mm_struct *mm);
> +
> +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
> +{
> + INIT_HLIST_HEAD(&mnh->head);
> + rwlock_init(&mnh->lock);
> +}
> +
> +#define mmu_notifier(function, mm, args...) \
> + do { \
> + struct mmu_notifier *__mn; \
> + struct hlist_node *__n; \
> + \
> + if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
> + read_lock(&(mm)->mmu_notifier.lock); \
rcu_read_lock();
> + hlist_for_each_entry(__mn, __n, \
hlist_for_each_entry_rcu
> + &(mm)->mmu_notifier.head, \
> + hlist) \
> + if (__mn->ops->function) \
> + __mn->ops->function(__mn, \
> + mm, \
> + args); \
> + read_unlock(&(mm)->mmu_notifier.lock); \
rcu_read_unlock();
> + } \
> + } while (0)
> +
> +#else /* CONFIG_MMU_NOTIFIER */
> +
> +#define mmu_notifier_register(mn, mm) do {} while(0)
> +#define mmu_notifier_unregister(mn, mm) do {} while (0)
> +#define mmu_notifier_release(mm) do {} while (0)
> +#define mmu_notifier_head_init(mmh) do {} while (0)
> +
> +#define mmu_notifier(function, mm, args...) \
> + do { } while (0)
> +
> +#endif /* CONFIG_MMU_NOTIFIER */
> +
> +#endif /* _LINUX_MMU_NOTIFIER_H */


> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> new file mode 100644
> --- /dev/null
> +++ b/mm/mmu_notifier.c
> @@ -0,0 +1,44 @@
> +/*
> + * linux/mm/mmu_notifier.c
> + *
> + * Copyright (C) 2008 Qumranet, Inc.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/mmu_notifier.h>
> +#include <linux/module.h>
> +
> +void mmu_notifier_release(struct mm_struct *mm)
> +{
> + struct mmu_notifier *mn;
> + struct hlist_node *n, *tmp;
> +
> + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> + read_lock(&mm->mmu_notifier.lock);
rcu_read_lock();
> + hlist_for_each_entry_safe(mn, n, tmp,
hlist_for_each_entry_safe_rcu
> + &mm->mmu_notifier.head, hlist) {
> + if (mn->ops->release)
> + mn->ops->release(mn, mm);
> + hlist_del(&mn->hlist);
hlist_del_rcu
> + }
> + read_unlock(&mm->mmu_notifier.lock);
rcu_read_unlock();
> + }
> +}
> +
> +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> + write_lock(&mm->mmu_notifier.lock);
spin_lock
> + hlist_add_head(&mn->hlist, &mm->mmu_notifier.head);
hlist_add_head_rcu
> + write_unlock(&mm->mmu_notifier.lock);
spin_unlock
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_register);
> +
> +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> + write_lock(&mm->mmu_notifier.lock);
spin_lock
> + hlist_del(&mn->hlist);
hlist_del_rcu
> + write_unlock(&mm->mmu_notifier.lock);
spin_unlock
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_unregister);


#define hlist_for_each_entry_safe(tpos, pos, n, head, member) \
for (pos = (head)->first; \
rcu_dereference(pos) && ({ n = pos->next; 1; }) && \
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
pos = n)


2008-01-22 20:09:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] mmu notifiers #v4

This last update avoids the need to refresh the young bit in the linux
pte through follow_page and it allows tracking the accessed bits set
by the hardware in the sptes without requiring vmexits in certain
implementations.

KVM side is here:

http://marc.info/?l=kvm-devel&m=120103225508669&w=2

Signed-off-by: Andrea Arcangeli <[email protected]>

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -46,6 +46,7 @@
__young = ptep_test_and_clear_young(__vma, __address, __ptep); \
if (__young) \
flush_tlb_page(__vma, __address); \
+ __young |= mmu_notifier_age_page((__vma)->vm_mm, __address); \
__young; \
})
#endif
@@ -86,6 +87,7 @@ do { \
pte_t __pte; \
__pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep); \
flush_tlb_page(__vma, __address); \
+ mmu_notifier(invalidate_page, (__vma)->vm_mm, __address); \
__pte; \
})
#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
#include <linux/rbtree.h>
#include <linux/rwsem.h>
#include <linux/completion.h>
+#include <linux/mmu_notifier.h>
#include <asm/page.h>
#include <asm/mmu.h>

@@ -219,6 +220,10 @@ struct mm_struct {
/* aio bits */
rwlock_t ioctx_list_lock;
struct kioctx *ioctx_list;
+
+#ifdef CONFIG_MMU_NOTIFIER
+ struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
+#endif
};

#endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,82 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier;
+
+struct mmu_notifier_ops {
+ void (*release)(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+ int (*age_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+ void (*invalidate_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+ void (*invalidate_range)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+};
+
+struct mmu_notifier_head {
+ struct hlist_head head;
+ rwlock_t lock;
+};
+
+struct mmu_notifier {
+ struct hlist_node hlist;
+ const struct mmu_notifier_ops *ops;
+};
+
+#include <linux/mm_types.h>
+
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+ unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+ INIT_HLIST_HEAD(&mnh->head);
+ rwlock_init(&mnh->lock);
+}
+
+#define mmu_notifier(function, mm, args...) \
+ do { \
+ struct mmu_notifier *__mn; \
+ struct hlist_node *__n; \
+ \
+ if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+ read_lock(&(mm)->mmu_notifier.lock); \
+ hlist_for_each_entry(__mn, __n, \
+ &(mm)->mmu_notifier.head, \
+ hlist) \
+ if (__mn->ops->function) \
+ __mn->ops->function(__mn, \
+ mm, \
+ args); \
+ read_unlock(&(mm)->mmu_notifier.lock); \
+ } \
+ } while (0)
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+#define mmu_notifier_register(mn, mm) do {} while(0)
+#define mmu_notifier_unregister(mn, mm) do {} while (0)
+#define mmu_notifier_release(mm) do {} while (0)
+#define mmu_notifier_age_page(mm, address) ({ 0; })
+#define mmu_notifier_head_init(mmh) do {} while (0)
+
+#define mmu_notifier(function, mm, args...) \
+ do { } while (0)
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -359,6 +359,7 @@ static struct mm_struct * mm_init(struct

if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
+ mmu_notifier_head_init(&mm->mmu_notifier);
return mm;
}
free_mm(mm);
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,3 +193,7 @@ config VIRT_TO_BUS
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+ def_bool y
+ bool "MMU notifier, for paging KVM/RDMA"
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -30,4 +30,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -753,6 +753,7 @@ void __unmap_hugepage_range(struct vm_ar
}
spin_unlock(&mm->page_table_lock);
flush_tlb_range(vma, start, end);
+ mmu_notifier(invalidate_range, mm, start, end);
list_for_each_entry_safe(page, tmp, &page_list, lru) {
list_del(&page->lru);
put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -889,6 +889,7 @@ unsigned long zap_page_range(struct vm_a
end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
if (tlb)
tlb_finish_mmu(tlb, address, end);
+ mmu_notifier(invalidate_range, mm, address, end);
return end;
}

@@ -1317,7 +1318,7 @@ int remap_pfn_range(struct vm_area_struc
{
pgd_t *pgd;
unsigned long next;
- unsigned long end = addr + PAGE_ALIGN(size);
+ unsigned long start = addr, end = addr + PAGE_ALIGN(size);
struct mm_struct *mm = vma->vm_mm;
int err;

@@ -1358,6 +1359,7 @@ int remap_pfn_range(struct vm_area_struc
if (err)
break;
} while (pgd++, addr = next, addr != end);
+ mmu_notifier(invalidate_range, mm, start, end);
return err;
}
EXPORT_SYMBOL(remap_pfn_range);
@@ -1441,7 +1443,7 @@ int apply_to_page_range(struct mm_struct
{
pgd_t *pgd;
unsigned long next;
- unsigned long end = addr + size;
+ unsigned long start = addr, end = addr + size;
int err;

BUG_ON(addr >= end);
@@ -1452,6 +1454,7 @@ int apply_to_page_range(struct mm_struct
if (err)
break;
} while (pgd++, addr = next, addr != end);
+ mmu_notifier(invalidate_range, mm, start, end);
return err;
}
EXPORT_SYMBOL_GPL(apply_to_page_range);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1747,6 +1747,7 @@ static void unmap_region(struct mm_struc
free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
next? next->vm_start: 0);
tlb_finish_mmu(tlb, start, end);
+ mmu_notifier(invalidate_range, mm, start, end);
}

/*
@@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
vm_unacct_memory(nr_accounted);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
+ mmu_notifier_release(mm);

/*
* Walk the list again, actually closing and freeing it,
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
new file mode 100644
--- /dev/null
+++ b/mm/mmu_notifier.c
@@ -0,0 +1,68 @@
+/*
+ * linux/mm/mmu_notifier.c
+ *
+ * Copyright (C) 2008 Qumranet, Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+
+void mmu_notifier_release(struct mm_struct *mm)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n, *tmp;
+
+ if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+ read_lock(&mm->mmu_notifier.lock);
+ hlist_for_each_entry_safe(mn, n, tmp,
+ &mm->mmu_notifier.head, hlist) {
+ if (mn->ops->release)
+ mn->ops->release(mn, mm);
+ hlist_del(&mn->hlist);
+ }
+ read_unlock(&mm->mmu_notifier.lock);
+ }
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n, *tmp;
+ int young = 0;
+
+ if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+ read_lock(&mm->mmu_notifier.lock);
+ hlist_for_each_entry_safe(mn, n, tmp,
+ &mm->mmu_notifier.head, hlist) {
+ if (mn->ops->age_page)
+ young |= mn->ops->age_page(mn, mm, address);
+ }
+ read_unlock(&mm->mmu_notifier.lock);
+ }
+
+ return young;
+}
+
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ write_lock(&mm->mmu_notifier.lock);
+ hlist_add_head(&mn->hlist, &mm->mmu_notifier.head);
+ write_unlock(&mm->mmu_notifier.lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ write_lock(&mm->mmu_notifier.lock);
+ hlist_del(&mn->hlist);
+ write_unlock(&mm->mmu_notifier.lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);

2008-01-22 20:31:20

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] mmu notifiers #v3

On Tue, 22 Jan 2008, Peter Zijlstra wrote:

> I think we can get rid of this rwlock as I think this will seriously
> hurt larger machines.

Correct.

2008-01-22 20:31:47

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mmu notifiers #v3

On Tue, Jan 22, 2008 at 08:28:47PM +0100, Peter Zijlstra wrote:
> I think we can get rid of this rwlock as I think this will seriously
> hurt larger machines.

Yep, I initially considered it, nevertheless given you solved part of
the complication I can add it now ;). The only technical reason for
not using RCU is if certain users of the notifiers are registering and
unregistering at high frequency through objects that may need to be
freed quickly.

I can tell the KVM usage of the mmu notifiers is sure fine to use RCU.
Then I will have to update KVM so that it will free the kvm structure
after waiting a quiescent point to avoid kernel crashing memory
corruption after applying your changes to the mmu notifier.

Thanks!

2008-01-22 20:34:56

by Christoph Lameter

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Tue, 22 Jan 2008, Andrea Arcangeli wrote:

> This last update avoids the need to refresh the young bit in the linux
> pte through follow_page and it allows tracking the accessed bits set
> by the hardware in the sptes without requiring vmexits in certain
> implementations.

The problem that I have with this is still that there is no way to sleep
while running the notifier. We need to invalidate mappings on a remote
instance of linux. This means sending out a message and waiting for reply
before the local page is unmapped. So I reworked Andrea's early patch and
came up with this one:


Export notifiers: Callbacks for external references to VM pages

This patch provides callbacks for external subsystems (such as DMA engines,
RDMA, XPMEM) that allow these subsystems to know when the VM is unmapping
pages. It is loosely based on Andrea's mmu_ops patch #2.

A subsystem can export pages by

1. Marking a series of pages using SetPageExported()

This means callbacks will occur if these pages are individually unmapped.

2. Marking a mm_struct with MMF_EXPORTED

This results in callbacks when the process exits or when ranges in the
mm structs are removed. Pages marked with PageExported() must at least
have one mm that has MMF_EXPORTED set. The subsystem can determine that
no pages are exported anymore through the release callbacks and then
terminate operations.

Callbacks in general are made early before any spinlocks are taken. This
means it is possible to communicate with another instance of Linux that
may have cross mapped the page using f.e. PFN_MAP and only proceed to
unmap the page after confirmation has been received that the remote
side has removed the pte.

This also allows the user of the export notifier to utilize its
own reverse maps to find external ptes that need to be cleaned.
(There is nothing that prohibits the use of the anon/inode
rmaps either. The exporter may drop the locks and rescan
the list in its own scanning of the lists if needed).

Issues with mmu_ops #2

- Notifiers are called *after* we tore down ptes. At that point pages
may already have been freed and reused. This means that there can
still be uses of the page by the user of mmu_ops after the OS has
dropped its mapping. IMHO the foreign entity needs to drop its
mappings first. That also ensures that the entities operated
upon continue to exist.

- anon_vma/inode and pte locks are held during callbacks.

Signed-off-by: Christoph Lameter <[email protected]>

---
include/linux/export_notifier.h | 80 ++++++++++++++++++++++++++++++++++++++++
include/linux/page-flags.h | 9 ++++
include/linux/sched.h | 1
mm/Kconfig | 6 +++
mm/Makefile | 1
mm/export_notifier.c | 15 +++++++
mm/hugetlb.c | 3 +
mm/memory.c | 10 +++++
mm/mmap.c | 7 +++
mm/mprotect.c | 4 ++
mm/rmap.c | 10 +++++
11 files changed, 146 insertions(+)

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c 2008-01-18 12:13:10.195895516 -0800
+++ linux-2.6/mm/memory.c 2008-01-18 13:41:11.650547393 -0800
@@ -59,6 +59,7 @@

#include <linux/swapops.h>
#include <linux/elf.h>
+#include <linux/export_notifier.h>

#ifndef CONFIG_NEED_MULTIPLE_NODES
/* use the per-pgdat data instead for discontigmem - mbligh */
@@ -1349,6 +1350,9 @@ int remap_pfn_range(struct vm_area_struc

vma->vm_flags |= VM_IO | VM_RESERVED | VM_PFNMAP;

+ if (mm->flags & MMF_EXPORTED)
+ export_notifier(invalidate_process_range, mm, addr, end);
+
BUG_ON(addr >= end);
pfn -= addr >> PAGE_SHIFT;
pgd = pgd_offset(mm, addr);
@@ -1447,6 +1451,10 @@ int apply_to_page_range(struct mm_struct
int err;

BUG_ON(addr >= end);
+
+ if (mm->flags & MMF_EXPORTED)
+ export_notifier(invalidate_process_range, mm, addr, end);
+
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);
@@ -1878,6 +1886,8 @@ void unmap_mapping_range(struct address_
details.last_index = ULONG_MAX;
details.i_mmap_lock = &mapping->i_mmap_lock;

+ export_notifier(invalidate_mapping_range, mapping, holebegin, holelen);
+
spin_lock(&mapping->i_mmap_lock);

/* Protect against endless unmapping loops */
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c 2007-11-09 14:48:48.394445680 -0800
+++ linux-2.6/mm/mprotect.c 2008-01-18 14:34:39.241295334 -0800
@@ -21,6 +21,7 @@
#include <linux/syscalls.h>
#include <linux/swap.h>
#include <linux/swapops.h>
+#include <linux/export_notifier.h>
#include <asm/uaccess.h>
#include <asm/pgtable.h>
#include <asm/cacheflush.h>
@@ -269,6 +270,9 @@ sys_mprotect(unsigned long start, size_t
if (start > vma->vm_start)
prev = vma;

+ if (current->mm->flags & MMF_EXPORTED)
+ export_notifier(invalidate_process_range, current->mm, start, end);
+
for (nstart = start ; ; ) {
unsigned long newflags;

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c 2008-01-07 14:21:02.072452263 -0800
+++ linux-2.6/mm/rmap.c 2008-01-18 13:41:11.670547733 -0800
@@ -49,6 +49,7 @@
#include <linux/rcupdate.h>
#include <linux/module.h>
#include <linux/kallsyms.h>
+#include <linux/export_notifier.h>

#include <asm/tlbflush.h>

@@ -356,6 +357,9 @@ static int page_referenced_file(struct p
*/
BUG_ON(!PageLocked(page));

+ if (unlikely(PageExported(page)))
+ export_notifier(determine_references, page, &referenced);
+
spin_lock(&mapping->i_mmap_lock);

/*
@@ -469,6 +473,9 @@ int page_mkclean(struct page *page)

BUG_ON(!PageLocked(page));

+ if (unlikely(PageExported(page)))
+ export_notifier(invalidate_page, page);
+
if (page_mapped(page)) {
struct address_space *mapping = page_mapping(page);
if (mapping) {
@@ -966,6 +973,9 @@ int try_to_unmap(struct page *page, int

BUG_ON(!PageLocked(page));

+ if (unlikely(PageExported(page)))
+ export_notifier(invalidate_page, page);
+
if (PageAnon(page))
ret = try_to_unmap_anon(page, migration);
else
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig 2008-01-07 14:21:02.044451642 -0800
+++ linux-2.6/mm/Kconfig 2008-01-18 13:41:11.682547939 -0800
@@ -193,3 +193,9 @@ config NR_QUICK
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config EXPORT_NOTIFIER
+ def_bool y
+ depends on 64BIT
+ bool "Export Notifier for notifying subsystems about changes to page mappings"
+
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile 2007-11-27 17:29:40.152962721 -0800
+++ linux-2.6/mm/Makefile 2008-01-18 13:41:11.686548010 -0800
@@ -30,4 +30,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_EXPORT_NOTIFIER) += export_notifier.o

Index: linux-2.6/mm/export_notifier.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/export_notifier.c 2008-01-18 14:05:28.551481181 -0800
@@ -0,0 +1,15 @@
+#include <linux/export_notifier.h>
+
+LIST_HEAD(export_notifier_list);
+
+void export_notifier_register(struct export_notifier *em)
+{
+ list_add(&em->list, &export_notifier_list);
+}
+
+void export_notifier_unregister(struct export_notifier *em)
+{
+ list_del(&em->list);
+}
+
+
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c 2008-01-14 12:55:46.667413772 -0800
+++ linux-2.6/mm/hugetlb.c 2008-01-18 13:41:11.722548622 -0800
@@ -776,6 +776,9 @@ void unmap_hugepage_range(struct vm_area
* do nothing in this case.
*/
if (vma->vm_file) {
+ if (vma->vm_vm->flags & MMF_EXPORTED)
+ export_notifier(invalidate_process_range, start, end);
+
spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
__unmap_hugepage_range(vma, start, end);
spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c 2008-01-07 14:21:02.060451804 -0800
+++ linux-2.6/mm/mmap.c 2008-01-18 13:41:11.730548760 -0800
@@ -26,6 +26,7 @@
#include <linux/mount.h>
#include <linux/mempolicy.h>
#include <linux/rmap.h>
+#include <linux/export_notifier.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -1739,6 +1740,9 @@ static void unmap_region(struct mm_struc
struct mmu_gather *tlb;
unsigned long nr_accounted = 0;

+ if (mm->flags & MMF_EXPORTED)
+ export_notifier(invalidate_process_range, mm, start, end);
+
lru_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
@@ -2044,6 +2048,9 @@ void exit_mmap(struct mm_struct *mm)
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);

+ if (mm->flags & MMF_EXPORTED)
+ export_notifier(release, mm);
+
/*
* Walk the list again, actually closing and freeing it,
* with preemption enabled, without holding any MM locks.
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h 2007-11-08 13:50:46.427607787 -0800
+++ linux-2.6/include/linux/page-flags.h 2008-01-18 13:41:11.742548966 -0800
@@ -105,6 +105,7 @@
* 64 bit | FIELDS | ?????? FLAGS |
* 63 32 0
*/
+#define PG_exported 30 /* Page is referenced by something not in the rmaps */
#define PG_uncached 31 /* Page has been mapped as uncached */
#endif

@@ -260,6 +261,14 @@ static inline void __ClearPageTail(struc
#define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
#define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)

+#ifdef CONFIG_EXPORT_NOTIFIER
+#define PageExported(page) test_bit(PG_exported, &(page)->flags)
+#define SetPageExported(page) set_bit(PG_exported, &(page)->flags)
+#define ClearPageExported(page) clear_bit(PG_exported, &(page)->flags)
+#else
+#define PageExported(page) 0
+#endif
+
struct page; /* forward declaration */

extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h 2008-01-14 12:55:46.591413423 -0800
+++ linux-2.6/include/linux/sched.h 2008-01-18 13:41:11.754549168 -0800
@@ -372,6 +372,7 @@ extern int get_dumpable(struct mm_struct
(((1 << MMF_DUMP_FILTER_BITS) - 1) << MMF_DUMP_FILTER_SHIFT)
#define MMF_DUMP_FILTER_DEFAULT \
((1 << MMF_DUMP_ANON_PRIVATE) | (1 << MMF_DUMP_ANON_SHARED))
+#define MMF_EXPORTED 7 /* mm struct has externally referenced pages */

struct sighand_struct {
atomic_t count;
Index: linux-2.6/include/linux/export_notifier.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/export_notifier.h 2008-01-18 14:31:59.654580722 -0800
@@ -0,0 +1,80 @@
+#ifndef _LINUX_EXPORT_NOTIFIER_H
+#define _LINUX_EXPORT_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/mm_types.h>
+
+struct export_notifier {
+ struct list_head list;
+ const struct export_notifier_ops *ops;
+};
+
+struct address_space;
+
+struct export_notifier_ops {
+ /*
+ * Called when exiting a process to release all resources
+ */
+ void (*release)(struct export_notifier *em, struct mm_struct *mm);
+
+ /*
+ * Called with the page lock held before ptes are modified or removed.
+ *
+ * Must clear PageExported()
+ */
+ void (*invalidate_page)(struct export_notifier *em, struct page *page);
+
+ /*
+ * Called with mmap_sem held before a range of ptes is modified or removed.
+ *
+ * Must clear all PageExported for all pages in the range
+ */
+ void (*invalidate_process_range)(struct export_notifier *em,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+
+ /*
+ * Called before the pages in mapping are removed.
+ *
+ * Must clear all PageExported for all pages in the range that
+ * are owned by the subsystem.
+ */
+ void (*invalidate_mapping_range)(struct export_notifier *em,
+ struct address_space *mapping,
+ unsigned long start, unsigned long end);
+
+ /*
+ * Determine the external references held to a page and
+ * increment the int pointed to by that amount.
+ */
+ void (*determine_references)(struct export_notifier *em,
+ struct page *page, int *references);
+};
+
+#ifdef CONFIG_EXPORT_NOTIFIER
+
+extern void export_notifier_register(struct export_notifier *em);
+extern void export_notifier_unregister(struct export_notifier *em);
+extern void export_notifier_release(struct export_notifier *em);
+
+extern struct list_head export_notifier_list;
+
+#define export_notifier(function, args...) \
+ do { \
+ struct export_notifier *__em; \
+ \
+ list_for_each_entry(__em, &export_notifier_list, list) \
+ if (__em->ops->function) \
+ __em->ops->function(__em, args); \
+ } while (0);
+
+#else
+
+#define export_notifier(function, args...)
+
+static inline void export_notifier_register(struct export_notifier *em) {}
+static inline void export_notifier_unregister(struct export_notifier *em) {}
+
+#endif
+
+#endif /* _LINUX_EXPORT_NOTIFIER_H */

2008-01-22 22:13:27

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mmu notifiers #v3

On Tue, 22 Jan 2008, Andrea Arcangeli wrote:
>
> Then I will have to update KVM so that it will free the kvm structure
> after waiting a quiescent point to avoid kernel crashing memory
> corruption after applying your changes to the mmu notifier.

It may not be suitable (I've not looked into your needs), but consider
SLAB_DESTROY_BY_RCU: it might give you the easiest way to do that.

Hugh

2008-01-22 22:31:56

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

Hi Christoph,

Just a few early comments.

First it makes me optimistic this can be merged sooner than later to
see a second brand new implementation of this ;).

On Tue, Jan 22, 2008 at 12:34:46PM -0800, Christoph Lameter wrote:
> On Tue, 22 Jan 2008, Andrea Arcangeli wrote:
>
> > This last update avoids the need to refresh the young bit in the linux
> > pte through follow_page and it allows tracking the accessed bits set
> > by the hardware in the sptes without requiring vmexits in certain
> > implementations.
>
> The problem that I have with this is still that there is no way to sleep
> while running the notifier. We need to invalidate mappings on a remote
> instance of linux. This means sending out a message and waiting for reply
> before the local page is unmapped. So I reworked Andrea's early patch and
> came up with this one:

I guess you missed a problem in unmapping the secondary mmu before the
core linux pte is cleared with a zero-locking window in between the
two operations. The spte may be instantiated again by a
vmexit/secondary-pagefault in another cpu during the zero-locking
window (zero locking is zero locking, anything can run in the other
cpus, so not exactly sure how you plan to fix that nasty subtle spte
leak if you insist calling the mmu_notifier invalidates _before_
instead of _after_ ;). All spte invalidates should happen _after_
dropping the main linux pte not before, or you never know what else is
left mapped in the secondary mmu by the time the linux pte is finally
cleared.

With a non-present linux pte, the VM won't call into try_to_unmap
anymore and the page will remain pinned in ram forever without any
chance to free it anymore until the spte is freed for other reasons
(VM pressure not included in the other reasons :( ).

> Issues with mmu_ops #2
>
> - Notifiers are called *after* we tore down ptes. At that point pages
> may already have been freed and reused. [..]

Wait, you should always represent the external reference in the page
count just like we do every time we map the page in a linux pte! If
you worry about that, that's your fault I'm afraid.

> [..] This means that there can
> still be uses of the page by the user of mmu_ops after the OS has
> dropped its mapping. IMHO the foreign entity needs to drop its
> mappings first. That also ensures that the entities operated
> upon continue to exist.
>
> - anon_vma/inode and pte locks are held during callbacks.

In a previous email I asked what's wrong in offloading the event, and
instead of answering you did your own thing that apparently would leak
memory-pins in hardly fixable way. Chances are your latency in sending
the event won't be too low so if you cluster the invalidates in a
single packet perhaps you're a bit faster anyway. You've just to fix
your reference counting so you stop risking corrupting ram at the
first missing notifier (and you're missing some already, I know the
invalidate_page in do_wp_page for example is already used by the KVM
sharing code, and for you missing a single notifier means memory
corruption because you don't bump the page count to represent the
external reference).

> @@ -966,6 +973,9 @@ int try_to_unmap(struct page *page, int
>
> BUG_ON(!PageLocked(page));
>
> + if (unlikely(PageExported(page)))
> + export_notifier(invalidate_page, page);
> +

Passing the page here will complicate things especially for shared
pages across different VM that are already working in KVM. For non
shared pages we could cache the userland mapping address in
page->private but it's a kludge only working for non-shared
pages. Walking twice the anon_vma lists when only a single walk is
needed sounds very backwards for KVM purposes. This at least as long
as keep a hva->multiple_gfn design which is quite elegant so far given
qemu has to access the ram in the memslots too.

> if (PageAnon(page))
> ret = try_to_unmap_anon(page, migration);
> else

Besides the pinned pages ram leak by having the zero locking window
above I'm curious how you are going to take care of the finegrined
aging that I'm doing with the accessed bit set by hardware in the spte
with your coarse export_notifier(invalidate_page) called
unconditionally before checking any young bit at all.

Look how clean it is to hook asm-generic/pgtable.h in my last patch
compared to the above leaking code expanded all over the place in the
mm/*.c, unnecessary mangling of atomic bitflags in the page struct,
etc...

> +config EXPORT_NOTIFIER
> + def_bool y
> + depends on 64BIT

?

> + bool "Export Notifier for notifying subsystems about changes to page mappings"

The word "export notifier" isn't very insightful to me, it doesn't
even give an hint we're in the memory management area. If you don't
like mmu notifier name I don't mind changing it, but I doubt export
notifier is a vast naming improvement. Infact it looks one of those
names like RCU that don't tell much of what is really going on
(there's no copy 99% of time in RCU).

> +LIST_HEAD(export_notifier_list);

A global list is not ok IMHO, it's really bad to have a O(N) (N number
of mm in the system) complexity here when it's so trivial to go O(1)
like in my code. We want to swap 100% of the VM exactly so we can have
zillon of idle (or sigstopped) VM on the same system.

Infact initially I wondered for a quite long while if it was better to
register in the mm or the vma, now in kvm registering in the mm is a
lot simpler, even if perhaps it might be possible to save a few cycles
per page-invalidate with the mm. But it's definitely not a complexity
issue to have it in the mm at least for KVM (the number of memslots is
very limited and not in function of the VM size, furthermore it can be
made O(log(N)) quite easily if really interesting and it avoids
creating a 1:1 identity between post-vma-merges and memslots).

Thanks a lot!

2008-01-22 22:53:25

by Christoph Lameter

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Tue, 22 Jan 2008, Andrea Arcangeli wrote:

> First it makes me optimistic this can be merged sooner than later to
> see a second brand new implementation of this ;).

Brand new? Well this is borrowing as much as possible from you....

> > The problem that I have with this is still that there is no way to sleep
> > while running the notifier. We need to invalidate mappings on a remote
> > instance of linux. This means sending out a message and waiting for reply
> > before the local page is unmapped. So I reworked Andrea's early patch and
> > came up with this one:
>
> I guess you missed a problem in unmapping the secondary mmu before the
> core linux pte is cleared with a zero-locking window in between the
> two operations. The spte may be instantiated again by a
> vmexit/secondary-pagefault in another cpu during the zero-locking
> window (zero locking is zero locking, anything can run in the other
> cpus, so not exactly sure how you plan to fix that nasty subtle spte
> leak if you insist calling the mmu_notifier invalidates _before_
> instead of _after_ ;). All spte invalidates should happen _after_
> dropping the main linux pte not before, or you never know what else is
> left mapped in the secondary mmu by the time the linux pte is finally
> cleared.

spte is the remote pte in my scheme right? The linux instance with the
secondary mmu must call back to the exporting machine in order to
reinstantiate the page. PageExported is cleared in invalidate_page() so
the other linux instance will be told that the page is not available.

> > - Notifiers are called *after* we tore down ptes. At that point pages
> > may already have been freed and reused. [..]
>
> Wait, you should always represent the external reference in the page
> count just like we do every time we map the page in a linux pte! If
> you worry about that, that's your fault I'm afraid.

Ahhh. Good to hear. But we will still end in a situation where only
the remote ptes point to the page. Maybe the remote instance will dirty
the page at that point?

> > - anon_vma/inode and pte locks are held during callbacks.
>
> In a previous email I asked what's wrong in offloading the event, and

We have internally discussed the possibility of offloading the event but
that wont work with the existing callback since we would have to
perform atomic allocation and there may be thousands of external
references to a page.

> sharing code, and for you missing a single notifier means memory
> corruption because you don't bump the page count to represent the
> external reference).

The approach with the export notifier is page based not based on the
mm_struct. We only need a single page count for a page that is exported to
a number of remote instances of linux. The page count is dropped when all
the remote instances have unmapped the page.


> > @@ -966,6 +973,9 @@ int try_to_unmap(struct page *page, int
> >
> > BUG_ON(!PageLocked(page));
> >
> > + if (unlikely(PageExported(page)))
> > + export_notifier(invalidate_page, page);
> > +
>
> Passing the page here will complicate things especially for shared
> pages across different VM that are already working in KVM. For non

How?

> shared pages we could cache the userland mapping address in
> page->private but it's a kludge only working for non-shared
> pages. Walking twice the anon_vma lists when only a single walk is

There is only the need to walk twice for pages that are marked Exported.
And the double walk is only necessary if the exporter does not have its
own rmap. The cross partition thing that we are doing has such an rmap and
its a matter of walking the exporters rmap to clear out the external
references and then we walk the local rmaps. All once.

> Besides the pinned pages ram leak by having the zero locking window
> above I'm curious how you are going to take care of the finegrined
> aging that I'm doing with the accessed bit set by hardware in the spte

I think I explained that above. Remote users effectively are forbidden to
establish new references to the page by the clearing of the exported bit.

> with your coarse export_notifier(invalidate_page) called
> unconditionally before checking any young bit at all.

The export notifier is called only if the mm_struct or page bit for
exporting is set. Maybe I missed to add a check somewhere?

> Look how clean it is to hook asm-generic/pgtable.h in my last patch
> compared to the above leaking code expanded all over the place in the
> mm/*.c, unnecessary mangling of atomic bitflags in the page struct,
> etc...

I think that hunk is particularly bad in your patch. A notification side
event in a macro? You would want that explicitly in the code.

> > + bool "Export Notifier for notifying subsystems about changes to page mappings"
>
> The word "export notifier" isn't very insightful to me, it doesn't
> even give an hint we're in the memory management area. If you don't
> like mmu notifier name I don't mind changing it, but I doubt export
> notifier is a vast naming improvement. Infact it looks one of those
> names like RCU that don't tell much of what is really going on
> (there's no copy 99% of time in RCU).

What we are doing is effectively allowing external references to pages.
This is outside of the regular VM operations. So export came up but we
could call it something else. External? Its not really tied to the mmu
now.

>
> > +LIST_HEAD(export_notifier_list);
>
> A global list is not ok IMHO, it's really bad to have a O(N) (N number
> of mm in the system) complexity here when it's so trivial to go O(1)
> like in my code. We want to swap 100% of the VM exactly so we can have
> zillon of idle (or sigstopped) VM on the same system.

There will only be one or two of those notifiers. There is no need to
build long lists of mm_structs like in your patch.

The mm_struct is not available at the point of my callbacks. There is no
way to do a callback that is mm_struct based if you are not scanning the
reverse list. And scanning the reverse list requires taking locks.

2008-01-22 23:38:08

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1


On Tue, 2008-01-22 at 12:34 -0800, Christoph Lameter wrote:
>
> - Notifiers are called *after* we tore down ptes. At that point pages
> may already have been freed and reused. This means that there can
> still be uses of the page by the user of mmu_ops after the OS has
> dropped its mapping. IMHO the foreign entity needs to drop its
> mappings first. That also ensures that the entities operated
> upon continue to exist.

That's definitely an issue. Maybe having the foreign entity get a
reference to the page and drop it when it unmaps would help ?

> - anon_vma/inode and pte locks are held during callbacks.

So how does that fix the problem of sleeping then ?

Ben.

2008-01-23 00:40:59

by Christoph Lameter

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Wed, 23 Jan 2008, Benjamin Herrenschmidt wrote:

> > - anon_vma/inode and pte locks are held during callbacks.
>
> So how does that fix the problem of sleeping then ?

The locks are taken in the mmu_ops patch. This patch does not hold them
while performing the callbacks.

2008-01-23 01:22:00

by Robin Holt

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Tue, Jan 22, 2008 at 04:40:50PM -0800, Christoph Lameter wrote:
> On Wed, 23 Jan 2008, Benjamin Herrenschmidt wrote:
>
> > > - anon_vma/inode and pte locks are held during callbacks.
> >
> > So how does that fix the problem of sleeping then ?
>
> The locks are taken in the mmu_ops patch. This patch does not hold them
> while performing the callbacks.

Let me start by clarifying, the page is referenced prior to exporting
and that reference is not removed until after recall is complete and
memory protections are back to normal.

As Christoph pointed out, the mmu_ops callouts do not allow sleeping.
This is a problem for us as our recall path includes a message to one or
more other hosts and a wait until we receive a response. That message
sequence can take seconds or more to complete. It includes an operation
to ensure the memory is in a cross-partition clean state and then changes
memory protection. When that is complete we remove our page reference
and return.

Christoph's patch allows that long slow activity to happen prior to the
mmu_ops callout. By the time the mmu_ops callout is made, we no longer
are exporting the page so the cleanup is equivalent to the cleanup of
a page we have never used.

Thanks,
Robin Holt

2008-01-23 10:28:16

by Avi Kivity

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

Christoph Lameter wrote:
> Ahhh. Good to hear. But we will still end in a situation where only
> the remote ptes point to the page. Maybe the remote instance will dirty
> the page at that point?
>
>

When the spte is dropped, its dirty bit is transferred to the page.
>
>> sharing code, and for you missing a single notifier means memory
>> corruption because you don't bump the page count to represent the
>> external reference).
>>
>
> The approach with the export notifier is page based not based on the
> mm_struct. We only need a single page count for a page that is exported to
> a number of remote instances of linux. The page count is dropped when all
> the remote instances have unmapped the page.
>

That won't work for kvm. If we have a hundred virtual machines, that
means 99 no-op notifications.

Also, our rmap key for finding the spte is keyed on (mm, va). I imagine
most RDMA cards are similar.
>
>
>>> @@ -966,6 +973,9 @@ int try_to_unmap(struct page *page, int
>>>
>>> BUG_ON(!PageLocked(page));
>>>
>>> + if (unlikely(PageExported(page)))
>>> + export_notifier(invalidate_page, page);
>>> +
>>>
>> Passing the page here will complicate things especially for shared
>> pages across different VM that are already working in KVM. For non
>>
>
> How?
>
>
>> shared pages we could cache the userland mapping address in
>> page->private but it's a kludge only working for non-shared
>> pages. Walking twice the anon_vma lists when only a single walk is
>>
>
> There is only the need to walk twice for pages that are marked Exported.
> And the double walk is only necessary if the exporter does not have its
> own rmap. The cross partition thing that we are doing has such an rmap and
> its a matter of walking the exporters rmap to clear out the external
> references and then we walk the local rmaps. All once.
>

The problem is that external mmus need a reverse mapping structure to
locate their ptes. We can't expand struct page so we need to base it on
mm + va.

>
>> Besides the pinned pages ram leak by having the zero locking window
>> above I'm curious how you are going to take care of the finegrined
>> aging that I'm doing with the accessed bit set by hardware in the spte
>>
>
> I think I explained that above. Remote users effectively are forbidden to
> establish new references to the page by the clearing of the exported bit.
>
>

Can they wait on that bit?


--
error compiling committee.c: too many arguments to function

2008-01-23 10:52:57

by Robin Holt

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Wed, Jan 23, 2008 at 12:27:57PM +0200, Avi Kivity wrote:
>> The approach with the export notifier is page based not based on the
>> mm_struct. We only need a single page count for a page that is exported to
>> a number of remote instances of linux. The page count is dropped when all
>> the remote instances have unmapped the page.
>
> That won't work for kvm. If we have a hundred virtual machines, that means
> 99 no-op notifications.

But 100 callouts holding spinlocks will not work for our implementation
and even if the callouts are made with spinlocks released, we would very
strongly prefer a single callout which messages the range to the other
side.

> Also, our rmap key for finding the spte is keyed on (mm, va). I imagine
> most RDMA cards are similar.

For our RDMA rmap, it is based upon physical address.

>> There is only the need to walk twice for pages that are marked Exported.
>> And the double walk is only necessary if the exporter does not have its
>> own rmap. The cross partition thing that we are doing has such an rmap and
>> its a matter of walking the exporters rmap to clear out the external
>> references and then we walk the local rmaps. All once.
>>
>
> The problem is that external mmus need a reverse mapping structure to
> locate their ptes. We can't expand struct page so we need to base it on mm
> + va.

Our rmap takes a physical address and turns it into mm+va.

> Can they wait on that bit?

PageLocked(page) should work, right? We already have a backoff
mechanism so we expect to be able to adapt it to include a
PageLocked(page) check.


Thanks,
Robin

2008-01-23 11:42:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

Hi Christoph,

On Tue, Jan 22, 2008 at 02:53:12PM -0800, Christoph Lameter wrote:
> On Tue, 22 Jan 2008, Andrea Arcangeli wrote:
>
> > First it makes me optimistic this can be merged sooner than later to
> > see a second brand new implementation of this ;).
>
> Brand new? Well this is borrowing as much as possible from you....

I said "new", not "different" ;).

> > > The problem that I have with this is still that there is no way to sleep
> > > while running the notifier. We need to invalidate mappings on a remote
> > > instance of linux. This means sending out a message and waiting for reply
> > > before the local page is unmapped. So I reworked Andrea's early patch and
> > > came up with this one:
> >
> > I guess you missed a problem in unmapping the secondary mmu before the
> > core linux pte is cleared with a zero-locking window in between the
> > two operations. The spte may be instantiated again by a
> > vmexit/secondary-pagefault in another cpu during the zero-locking
> > window (zero locking is zero locking, anything can run in the other
> > cpus, so not exactly sure how you plan to fix that nasty subtle spte
> > leak if you insist calling the mmu_notifier invalidates _before_
> > instead of _after_ ;). All spte invalidates should happen _after_
> > dropping the main linux pte not before, or you never know what else is
> > left mapped in the secondary mmu by the time the linux pte is finally
> > cleared.
>
> spte is the remote pte in my scheme right? [..]

yes.

> [..] The linux instance with the
> secondary mmu must call back to the exporting machine in order to
> reinstantiate the page. PageExported is cleared in invalidate_page() so
> the other linux instance will be told that the page is not available.

Page is not available is not a valid answer. At least with KVM there
are three possible ways:

1) the remote instance will have to wait for the linux pte to go away
before calling follow_page (then the page is gone as a whole so
there won't be any more page flags to check)
2) it will kill the VM

Nothing runs SetPageExported in your VM code, I can't see how the
remote instance can know when it can call follow_page again safely on
the master node.

The remote instance is like a secondary TLB what you're doing in your
code is as backwards as flushing the TLB _before_ clearing the PTE! If
you want to call the secondary tlb flush outside locks we can argue
about that, but I think you should do that _after_ clearing the linux
pte IMHO. Otherwise you can as well move the tlb_flush_page before
clearing the pte and you'll run in the same amount of smp races for
the master MMU too.

> Ahhh. Good to hear. But we will still end in a situation where only
> the remote ptes point to the page. Maybe the remote instance will dirty
> the page at that point?

If you flush the remote instance _after_ clearing the main linux PTE
like I suggest and like I'm doing in my patch, I can't see how you
could risk to end up in a situation with only the remote ptes pointing
the page, that's the whole point of doing the remote-TLB flush _after_
clearing the main linux pte, instead of before like in your patch.

This is the same as the tlb flush, there's a reason we normally do:

pte_clear()
flush_tlb_page()
__free_page()

instead of:

flush_tlb_page()
pte_clear()
__free_page()

The ordering you're implementing is backwards and unnatural, you can
try to serialize it with explicit locking to block the "remote-tlb
refills" through page bitflags (something not doable with the core
master tlb because the refills are done by the hardware with the
master tlb), but it'll remain unnatural and backwards IMHO.

> > > - anon_vma/inode and pte locks are held during callbacks.
> >
> > In a previous email I asked what's wrong in offloading the event, and
>
> We have internally discussed the possibility of offloading the event but
> that wont work with the existing callback since we would have to
> perform atomic allocation and there may be thousands of external
> references to a page.

You should however also consider a rearming tasklet/softirq invoked by
ksoftirqd, if memory allocation fails you retry later. Furthermore you
should not require thousands of simultaneous allocations anyway,
you're running in the PF_MEMALLOC path and your memory will come from
the precious PF_MEMALLOC pool in the objrmap paths! If you ever
attempt what you suggest (simultanous allocation of thousands of
packets to simultaneously notify the thousand of external reference)
depending on the size of the allocation you can instant deadlock
regardless if you can sleep or not. Infact if you can't sleep and you
rearm the tasklet when GFP_ATOMIC fails you won't instant
deadlock.... I think you should have a max-amount of simultanous
allocations, and you should notify the external references partially
_serially_. Also you can't exceed the max-amount of simultanous
allocation even if GFP_ATOMIC/KERNEL fails or you'll squeeze the whole
PF_MEMALLOC pool leading to deadlocks elsewhere. You must stop when
there's still quite some room in the PF_MEMALLOC pool.

> The approach with the export notifier is page based not based on the
> mm_struct. We only need a single page count for a page that is exported to
> a number of remote instances of linux. The page count is dropped when all
> the remote instances have unmapped the page.

With KVM it doesn't work that way. Anyway you must be keeping a
"secondary" count if you know when it's time to call
__free_page/put_page, so why don't you use the main page_count instead?

> > > @@ -966,6 +973,9 @@ int try_to_unmap(struct page *page, int
> > >
> > > BUG_ON(!PageLocked(page));
> > >
> > > + if (unlikely(PageExported(page)))
> > > + export_notifier(invalidate_page, page);
> > > +
> >
> > Passing the page here will complicate things especially for shared
> > pages across different VM that are already working in KVM. For non
>
> How?

Because especially for shared pages we'll have to walk
objrmap/anon_vma->vmas->ptes by hand in our lowlevel methods twice to
get to the address where the page is mapped in the address space of
the task etc...

> There is only the need to walk twice for pages that are marked Exported.

All kvm guest physical pages would need to be marked exported of
course.

> And the double walk is only necessary if the exporter does not have its
> own rmap. The cross partition thing that we are doing has such an rmap and

We've one rmap per-VM, so the same physical pages will have multiple
rmap structures, each VM is indipendent and the refcounting happens on
the core page_count.

If multiple KVM images are using the page, then multiple "mm" will
have a mmu notifier registered, and they'll all be called when we walk
the objrmap/anon_vma with the address where the page is mapped in each
"mm". That's what we need to find the rmap structure for the "mm".

> its a matter of walking the exporters rmap to clear out the external
> references and then we walk the local rmaps. All once.

The problem in having a single rmap for a physical page is that it'd
need to be a data structure shared by all KVM instances, it doesn't
work that way right now and certainly there would be complications.

> > Besides the pinned pages ram leak by having the zero locking window
> > above I'm curious how you are going to take care of the finegrined
> > aging that I'm doing with the accessed bit set by hardware in the spte
>
> I think I explained that above. Remote users effectively are forbidden to
> establish new references to the page by the clearing of the exported bit.

And how do they know when they can restart adding references if infact
the VM _never_ calls into SetPageExported? (perhaps you forgot
something in your patch to set PageExported again to notify the
external reference that it can "de-freeze" and to restart adding
references ;)

> > with your coarse export_notifier(invalidate_page) called
> > unconditionally before checking any young bit at all.
>
> The export notifier is called only if the mm_struct or page bit for
> exporting is set. Maybe I missed to add a check somewhere?

try_to_unmap internally is doing:

if any accessed bit is clear:
clear the accessed bit, flush tlb and do nothing else at all
else
ptep_clear_flush

Not sure how you're going to "do nothing else at all" if you've a
broad invalidate_page call before even checking the linux pte.

But given with your design we'll have to replicate mm/rmap.c inside
KVM to find the virtual address where the page is mapped in each "mm"
that might have a KVM instance running, I guess we can duplicate it
all and check the young bit too. Infact we could duplicate the whole
thing and offload mm/rmap.c inside KVM and perhaps it'll work ok. But
I don't think having to duplicate the whole mm/rmap.c in an external
module is a good design when my model requires zero duplication, zero
"freezing" of the SVM/VMX pagefaults as long as PageExported bitflag
is clear (and nobody knows who's supposed to SetPageExported again to
"unfreeze" the KVM page fault, certainly KVM can't know when the VM
has finished clearing the pte and flushing the core linux tlb).

Also note KVM will try to generate a core-linux page fault in order to
find the "page struct", so I can't see how we could ever check the
PageExported bit to know if we can trigger the core-linux page fault
or not. Your whole design is backwards as far as I can tell.

> > Look how clean it is to hook asm-generic/pgtable.h in my last patch
> > compared to the above leaking code expanded all over the place in the
> > mm/*.c, unnecessary mangling of atomic bitflags in the page struct,
> > etc...
>
> I think that hunk is particularly bad in your patch. A notification side
> event in a macro? You would want that explicitly in the code.

Sorry this is the exact opposite. I'm placing all required
invalidate_page with a 1 line change to the kernel source. How can
that be bad? This is exactly the right place to hook into so it
will remain as close as possible to the main linux TLB/MMU without
cluttering mm/*.c with notifiers all over the place. Check how clean
it is the access bit test_and_clear:

#ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
#define ptep_clear_flush_young(__vma, __address, __ptep) \
({ \
int __young; \
__young = ptep_test_and_clear_young(__vma, __address, __ptep); \
if (__young) \
flush_tlb_page(__vma, __address); \
__young |= mmu_notifier_age_page((__vma)->vm_mm, __address); \
__young; \
})
#endif

This is totally strightforward, clean and 100% optimal too! I fail to
see how this can be considered _inferior_ to your cluttering of
mm/*.c (plus the fact you place your hook _before_ clearing the main
linux pte which is backwards).

> > > + bool "Export Notifier for notifying subsystems about changes to page mappings"
> >
> > The word "export notifier" isn't very insightful to me, it doesn't
> > even give an hint we're in the memory management area. If you don't
> > like mmu notifier name I don't mind changing it, but I doubt export
> > notifier is a vast naming improvement. Infact it looks one of those
> > names like RCU that don't tell much of what is really going on
> > (there's no copy 99% of time in RCU).
>
> What we are doing is effectively allowing external references to pages.
> This is outside of the regular VM operations. So export came up but we
> could call it something else. External? Its not really tied to the mmu
> now.

It's tied to the core linux mmu. Even for KVM it's a secondary tlb not
really a secondary mmu. But we want notifications of the operations on
the master linux mmu so we can export the same data in secondary
subsystems too through the notifiers.

It wasn't a "pte" notifier, not always we notify about pte updates,
sometime we notify about "range of virtual addresses being zapped
without even knowing if a pte existed in the range". It wasn't a "tlb"
notifier because we don't always notify whenever there is a tlb flush
in the main linux VM. we simply notify about certain events that
affect the master MMU that allows to represent the userland memory not
only in the master MMU.

> > > +LIST_HEAD(export_notifier_list);
> >
> > A global list is not ok IMHO, it's really bad to have a O(N) (N number
> > of mm in the system) complexity here when it's so trivial to go O(1)
> > like in my code. We want to swap 100% of the VM exactly so we can have
> > zillon of idle (or sigstopped) VM on the same system.
>
> There will only be one or two of those notifiers. There is no need to
> build long lists of mm_structs like in your patch.

Each KVM will register into this list (they're mostly indipendent),
plus when each notifier is called it will have to check the rmap to
see if the page belongs to its "mm" before doing anything with it,
that's really bad for KVM.

> The mm_struct is not available at the point of my callbacks. There is no
> way to do a callback that is mm_struct based if you are not scanning the
> reverse list. And scanning the reverse list requires taking locks.

The thing is, we can add notifiers to my patch to fit your needs, but
those will be _different_ notifiers and they really should be after
the linux pte updates... I think fixing your code so it'll work with
the sleeping-notifiers getting the "page" instead of a virtual address
called _after_ clearing the main linux pte, is the way to go. Then
hopefully won't have to find a way to enable the PageExported bitflag
anymore in the linux VM and it may remain always-on for the exported
pages etc.... it makes life a whole lot easier for you too IMHO.

2008-01-23 12:05:12

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Wed, Jan 23, 2008 at 04:52:47AM -0600, Robin Holt wrote:
> But 100 callouts holding spinlocks will not work for our implementation
> and even if the callouts are made with spinlocks released, we would very
> strongly prefer a single callout which messages the range to the other
> side.

But you take the physical address and turn into mm+va with your rmap...

> > Also, our rmap key for finding the spte is keyed on (mm, va). I imagine
> > most RDMA cards are similar.
>
> For our RDMA rmap, it is based upon physical address.

so why do you turn it into mm+va?

> >> There is only the need to walk twice for pages that are marked Exported.
> >> And the double walk is only necessary if the exporter does not have its
> >> own rmap. The cross partition thing that we are doing has such an rmap and
> >> its a matter of walking the exporters rmap to clear out the external
> >> references and then we walk the local rmaps. All once.
> >>
> >
> > The problem is that external mmus need a reverse mapping structure to
> > locate their ptes. We can't expand struct page so we need to base it on mm
> > + va.
>
> Our rmap takes a physical address and turns it into mm+va.

Why don't you stick to mm+va and use get_user_pages and let the VM do
the swapins etc...?

> > Can they wait on that bit?
>
> PageLocked(page) should work, right? We already have a backoff
> mechanism so we expect to be able to adapt it to include a
> PageLocked(page) check.

It's not PageLocked but wait_on_page___not___exported() called on the
master node. Plus nothing in the VM of the master node calls
SetPageExported... good luck to make it work (KVM swapping OTOH works
like a charm already w/o the backwards secondary-TLB-flushing order).

2008-01-23 12:32:40

by Robin Holt

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

Christoph, Maybe you can clear one thing up. Was this proposal an
addition to or replacement of Andrea's? I assumed an addition. I am
going to try to restrict my responses to ones appropriate for that
assumption.

> The remote instance is like a secondary TLB what you're doing in your
> code is as backwards as flushing the TLB _before_ clearing the PTE! If
> you want to call the secondary tlb flush outside locks we can argue
> about that, but I think you should do that _after_ clearing the linux
> pte IMHO. Otherwise you can as well move the tlb_flush_page before
> clearing the pte and you'll run in the same amount of smp races for
> the master MMU too.

I can agree to doing the flush after as long as I get a flag from the
mmu notifier saying this flush will be repeated later without locks
held. That would be fine with me. If we don't have that then the
export_notifiers would need to be expanded to cover all cases of pte
clearing. Baring one of the two, I would argue we have an unworkable
solution for XPMEM. This is because of allocations which is touched
upon later.

> The ordering you're implementing is backwards and unnatural, you can
> try to serialize it with explicit locking to block the "remote-tlb
> refills" through page bitflags (something not doable with the core
> master tlb because the refills are done by the hardware with the
> master tlb), but it'll remain unnatural and backwards IMHO.

Given one of the two compromises above, I believe this will then work
for XPMEM. The callouts were placed as they are now to prevent the
mmu_notifier callouts from having to do work.

> > > > - anon_vma/inode and pte locks are held during callbacks.
> > >
> > > In a previous email I asked what's wrong in offloading the event, and
> >
> > We have internally discussed the possibility of offloading the event but
> > that wont work with the existing callback since we would have to
> > perform atomic allocation and there may be thousands of external
> > references to a page.

As an example of thousands, we currently have one customer job that
has 16880 processors all with the same physical page faulted into their
address space. The way XPMEM is currently structured, there is fan-out of
that PFN information so we would not need to queue up that many messages,
but it would still be considerable. Our upcoming version of the hardware
will potentially make this fanout worse because we are going to allow
even more fine-grained divisions of the machine to help with memory
error containment.

> With KVM it doesn't work that way. Anyway you must be keeping a
> "secondary" count if you know when it's time to call
> __free_page/put_page, so why don't you use the main page_count instead?

We have a counter associated with a pfn that indicates when the pfn is no
longer referenced by other partitions. This counter triggers changing of
memory protections so any subsequent access to this page will result in
a memory error on the remote partition (this should be an illegal case).

> And how do they know when they can restart adding references if infact
> the VM _never_ calls into SetPageExported? (perhaps you forgot
> something in your patch to set PageExported again to notify the
> external reference that it can "de-freeze" and to restart adding
> references ;)

I assumed Christoph intended this to be part of our memory protection
changing phase. Once we have raised memory protections for the page,
clear the bit. When we lower memory protections, set the bit.

> The thing is, we can add notifiers to my patch to fit your needs, but
> those will be _different_ notifiers and they really should be after
> the linux pte updates... I think fixing your code so it'll work with
> the sleeping-notifiers getting the "page" instead of a virtual address
> called _after_ clearing the main linux pte, is the way to go. Then
> hopefully won't have to find a way to enable the PageExported bitflag
> anymore in the linux VM and it may remain always-on for the exported
> pages etc.... it makes life a whole lot easier for you too IMHO.

As I said before, I was under the assumption that Christoph was proposing
this as an addition to your notifiers, not a replacement.

Thanks,
Robin

2008-01-23 12:34:57

by Robin Holt

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

> Why don't you stick to mm+va and use get_user_pages and let the VM do
> the swapins etc...?

Our concern is also with memory protections for the physical page.
Additionally, we need to support exporting of memory in a VM_PFNMAP
address ranges (specifically, memory mapped using sgi_fetchop, uncached
or cached mspec devices).

Thanks,
Robin

2008-01-23 12:52:31

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

Hi,

Jumping in here, looks like this could develop into a direction useful
for Xen.

Background: Xen has a mechanism called "grant tables" for page sharing.
Guest #1 can issue a "grant" for another guest #2, which in turn then
can use that grant to map the page owned by guest #1 into its address
space. This is used by the virtual network/disk drivers, i.e. typically
Domain-0 (which has access to the real hardware) maps pages of other
guests to fill in disk/network data.

Establishing and tearing down mappings for those grants must happen
through a special grant table hypercall, and especially for the tear
down part of the problem mmu/export/whatever-we-call-them-in-the-end
notifies could help.

> Issues with mmu_ops #2
>
> - Notifiers are called *after* we tore down ptes.

That would render the notifies useless for Xen too. Xen needs to
intercept the actual pte clear and instead of just zapping it use the
hypercall to do the unmap and release the grant.

Current implementation uses a new vm_ops operation which is called if
present instead of doing a ptep_get_and_clear_full(). It is in the
XenSource tree only, mainline hasn't this yet due to implementing only
the DomU bits so far. When adding Dom0 support to mainline we'll need
some way to handle it, and I'd like to see the notifies be designed in a
way that Xen can simply use them.

cheers,
Gerd

2008-01-23 13:19:52

by Robin Holt

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Wed, Jan 23, 2008 at 01:51:23PM +0100, Gerd Hoffmann wrote:
> Jumping in here, looks like this could develop into a direction useful
> for Xen.
>
> Background: Xen has a mechanism called "grant tables" for page sharing.
> Guest #1 can issue a "grant" for another guest #2, which in turn then
> can use that grant to map the page owned by guest #1 into its address
> space. This is used by the virtual network/disk drivers, i.e. typically
> Domain-0 (which has access to the real hardware) maps pages of other
> guests to fill in disk/network data.

This is extremely similar to what XPMEM is providing.

> That would render the notifies useless for Xen too. Xen needs to
> intercept the actual pte clear and instead of just zapping it use the
> hypercall to do the unmap and release the grant.

We are tackling that by having our own page table hanging off the
structure representing our seg (thing created when we do the equiv of
your grant call).

> Current implementation uses a new vm_ops operation which is called if
> present instead of doing a ptep_get_and_clear_full(). It is in the
> XenSource tree only, mainline hasn't this yet due to implementing only
> the DomU bits so far. When adding Dom0 support to mainline we'll need
> some way to handle it, and I'd like to see the notifies be designed in a
> way that Xen can simply use them.

Would the callouts Christoph proposed work for you if you maintained
your own page table and moved them after the callouts the mmu_notifiers
are using.

Thanks,
Robin

2008-01-23 14:13:58

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

Hi,

>> That would render the notifies useless for Xen too. Xen needs to
>> intercept the actual pte clear and instead of just zapping it use the
>> hypercall to do the unmap and release the grant.
>
> We are tackling that by having our own page table hanging off the
> structure representing our seg (thing created when we do the equiv of
> your grant call).

--verbose please. I don't understand that "own page table" trick. Is
that page table actually used by the processor or is it just used to
maintain some sort of page list?

>> Current implementation uses a new vm_ops operation which is called if
>> present instead of doing a ptep_get_and_clear_full(). It is in the
>> XenSource tree only, mainline hasn't this yet due to implementing only
>> the DomU bits so far. When adding Dom0 support to mainline we'll need
>> some way to handle it, and I'd like to see the notifies be designed in a
>> way that Xen can simply use them.
>
> Would the callouts Christoph proposed work for you if you maintained
> your own page table and moved them after the callouts the mmu_notifiers
> are using.

I *think* it would. I'm not that deep in the VM details to be sure
though. One possible problem I see is that the hypercall does also tear
down the mapping, so this isn't just a notify but also changes the page
tables, which could confuse the VM later on when it comes to the actual
pte clearing.

cheers,
Gerd

2008-01-23 14:17:41

by Avi Kivity

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

Robin Holt wrote:
> On Wed, Jan 23, 2008 at 01:51:23PM +0100, Gerd Hoffmann wrote:
>
>> Jumping in here, looks like this could develop into a direction useful
>> for Xen.
>>
>> Background: Xen has a mechanism called "grant tables" for page sharing.
>> Guest #1 can issue a "grant" for another guest #2, which in turn then
>> can use that grant to map the page owned by guest #1 into its address
>> space. This is used by the virtual network/disk drivers, i.e. typically
>> Domain-0 (which has access to the real hardware) maps pages of other
>> guests to fill in disk/network data.
>>
>
> This is extremely similar to what XPMEM is providing.
>
>

I think that in Xen's case the page tables are the normal cpu page
tables, not an external mmu (like RDMA, kvm, and XPMEM).

--
error compiling committee.c: too many arguments to function

2008-01-23 14:18:26

by Robin Holt

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Wed, Jan 23, 2008 at 03:12:36PM +0100, Gerd Hoffmann wrote:
> Hi,
>
> >> That would render the notifies useless for Xen too. Xen needs to
> >> intercept the actual pte clear and instead of just zapping it use the
> >> hypercall to do the unmap and release the grant.
> >
> > We are tackling that by having our own page table hanging off the
> > structure representing our seg (thing created when we do the equiv of
> > your grant call).
>
> --verbose please. I don't understand that "own page table" trick. Is
> that page table actually used by the processor or is it just used to
> maintain some sort of page list?

We have a seg structure which is similar to some structure you probably
have which describes the grant. One of the things hanging off that
seg structure is essentially a page table containing PFNs with their
respective flags (XPMEM specific and not the same as the pfn flags in
the processor page tables).

Thanks,
Robin

2008-01-23 14:36:53

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

Robin Holt wrote:
> We have a seg structure which is similar to some structure you probably
> have which describes the grant. One of the things hanging off that
> seg structure is essentially a page table containing PFNs with their
> respective flags (XPMEM specific and not the same as the pfn flags in
> the processor page tables).

i.e. page tables used by hardware != cpu, right?

In the Xen guest case the normal processor page tables are modified, but
in a special way to make the Xen hypervisor also release the grant.

cheers,
Gerd

2008-01-23 15:41:47

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

Hi Kraxel,

On Wed, Jan 23, 2008 at 01:51:23PM +0100, Gerd Hoffmann wrote:
> That would render the notifies useless for Xen too. Xen needs to
> intercept the actual pte clear and instead of just zapping it use the
> hypercall to do the unmap and release the grant.

I think it has yet to be demonstrated that doing the invalidate
_before_ clearing the linux pte is workable at all for
shadow-pte/RDMA. Infact even doing it _after_ still requires some form
of serialization but it's less obviously broken and perhaps more
fixable unlike doing it before that seems hardly fixable given the
refill event running in the remote node is supposed to wait on a
bitflag of a page in the master node to return ON. What Christoph
didn't specify after hinting you have to wait for the PageExported
bitflag to return on, is that such page may be back in the freelist by
the time the secondary-tlb page fault starts checking that bit. And
nobody is setting that bit anyway in the VM so good luck waiting that
bit to return on in a page in the freelist (nothing keeps that page
pinned anymore by the time the ->invalidate_page returns: the whole
point of the invalidate is so the VM code can finally free it and put
it in the freelist).

Until there's some more reasonable theory of how invalidating the
remote tlbs/ptes _before_ the main linux pte can remotely work, I'm
"quite" skeptical it's the way to go for the invalidate_page callback.

Like Avi said, Xen is dealing with the linux pte only, so there's no
racy smp page fault to serialize against. Perhaps we can add another
notifier for Xen though.

But I think it's still not enough for Xen to have a method called
before the ptep_clear_flush: rmap.c would get confused in
page_mkclean_one for example. It might be possible that vm_ops is the
right way for you even if it further clutters the VM. Like Avi pointed
me out once, with our current mmu_notifiers we can export the KVM
address space with remote dma and keep swapping the whole KVM asset
just fine despite the triple MMU running the system (qemu using linux
pte, KVM using spte, quadrics using pcimmu). And the core Linux VM
code (not some obscure hypervisor) will deal with all aging and VM
issues like a normal task (especially with my last patch that reflects
the accessed bitflag in the spte the same way the accessed bitflag is
reflected for the regular ptes).

Nevertheless if you've any idea on how to use the notifiers for Xen
I'd be glad to help. Perhaps one workable way to change my patch to
work for you could be to pass the retval of ptep_clear_flush to the
notifiers themself. something like:

#define ptep_clear_flush(__vma, __address, __ptep) \
({ \
pte_t __pte; \
__pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep); \
flush_tlb_page(__vma, __address); \
__pte = mmu_notifier(invalidate_page, (__vma)->vm_mm, __address, __pte, __ptep); \
__pte; \
})

But this would kind of need an exclusive registration or this loop
wouldn't work well if everyone pretends to overwrite the memory
pointed by __ptep with its own value calculated in function of __pte.

for_each_notifier(mn,mm)
mn->invalidate_page(mm, __address, __pte, __ptep);

You could get a pte_none page fault in between the old value and the
new value though. (but you wouldn't need to flush the tlb inside
invalidate_pte, only us need to flush the secondary tlb for the spte
inside the invalidate_page obviously)

Let me know if you're interested in the above.

Thanks!

2008-01-23 15:48:53

by Robin Holt

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Wed, Jan 23, 2008 at 03:35:54PM +0100, Gerd Hoffmann wrote:
> Robin Holt wrote:
> > We have a seg structure which is similar to some structure you probably
> > have which describes the grant. One of the things hanging off that
> > seg structure is essentially a page table containing PFNs with their
> > respective flags (XPMEM specific and not the same as the pfn flags in
> > the processor page tables).
>
> i.e. page tables used by hardware != cpu, right?

Actually page tables used exclusively by software during the cross
partition coordination. Those entries are inserted on the remote side
by normal faults with VM_PFNMAP vmas created by the importing side.

> In the Xen guest case the normal processor page tables are modified, but
> in a special way to make the Xen hypervisor also release the grant.

In our guest case, we can not access the kernel struct page area on the
remote host. We therefore handle all the ref/deref of the page as part
of messaging the PFN across the partition boundaries.

Thanks,
Robin

2008-01-23 19:47:38

by Christoph Lameter

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Wed, 23 Jan 2008, Robin Holt wrote:

> > That won't work for kvm. If we have a hundred virtual machines, that means
> > 99 no-op notifications.
>
> But 100 callouts holding spinlocks will not work for our implementation
> and even if the callouts are made with spinlocks released, we would very
> strongly prefer a single callout which messages the range to the other
> side.


Andrea wont have 99 no op notifications. You will have one notification to
the kvm subsystem (since there needs to be only one register operation
for a subsystem that wants to get notifications). What do you do there is
up to kvm. If you want to call some function 99 times then you are free to
do that.

2008-01-23 19:49:01

by Christoph Lameter

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Wed, 23 Jan 2008, Andrea Arcangeli wrote:

> On Wed, Jan 23, 2008 at 04:52:47AM -0600, Robin Holt wrote:
> > But 100 callouts holding spinlocks will not work for our implementation
> > and even if the callouts are made with spinlocks released, we would very
> > strongly prefer a single callout which messages the range to the other
> > side.
>
> But you take the physical address and turn into mm+va with your rmap...

The remote mm+va or a local mm+va?

2008-01-23 19:58:50

by Robin Holt

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Wed, Jan 23, 2008 at 11:48:43AM -0800, Christoph Lameter wrote:
> On Wed, 23 Jan 2008, Andrea Arcangeli wrote:
>
> > On Wed, Jan 23, 2008 at 04:52:47AM -0600, Robin Holt wrote:
> > > But 100 callouts holding spinlocks will not work for our implementation
> > > and even if the callouts are made with spinlocks released, we would very
> > > strongly prefer a single callout which messages the range to the other
> > > side.
> >
> > But you take the physical address and turn into mm+va with your rmap...
>
> The remote mm+va or a local mm+va?

To be more complete, the phys is pointing to a xpmem_segment+va and the
xpmem_segment points to the mm. The seg describes a window into the
source processes virtual address space. Seems somewhat analogous to the
Xen grant, but I do not know.

Thanks,
Robin

2008-01-23 17:49:28

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

Andrea Arcangeli wrote:
> Like Avi said, Xen is dealing with the linux pte only, so there's no
> racy smp page fault to serialize against. Perhaps we can add another
> notifier for Xen though.
>
> But I think it's still not enough for Xen to have a method called
> before the ptep_clear_flush: rmap.c would get confused in
> page_mkclean_one for example.

The current code sets a bunch of vma flags (VM_RESERVED, VM_DONTCOPY,
VM_FOREIGN) so the VM doesn't try to handle those special mapping. IIRC
one of them was needed to not make rmap unhappy.

> Nevertheless if you've any idea on how to use the notifiers for Xen
> I'd be glad to help. Perhaps one workable way to change my patch to
> work for you could be to pass the retval of ptep_clear_flush to the
> notifiers themself. something like:
>
> #define ptep_clear_flush(__vma, __address, __ptep) \
> ({ \
> pte_t __pte; \
> __pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep); \
> flush_tlb_page(__vma, __address); \
> __pte = mmu_notifier(invalidate_page, (__vma)->vm_mm, __address, __pte, __ptep); \
> __pte; \
> })

Would not work. Need to pass a pointer to the pte so the xen hypervisor
can do unmap (aka pte_clear) and grant release as atomic operation.
Thus passing the value of the pte entry isn't good enougth.

Another maybe workable approach for Xen is to go through pv_ops
(although pte_clear doesn't go through pv_ops right now, so this would
be an additional hook too ...).

cheers,
Gerd

2008-01-23 17:35:57

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Wed, Jan 23, 2008 at 06:32:30AM -0600, Robin Holt wrote:
> Christoph, Maybe you can clear one thing up. Was this proposal an
> addition to or replacement of Andrea's? I assumed an addition. I am
> going to try to restrict my responses to ones appropriate for that
> assumption.

It wasn't immediately obvious that this was an addition but really I
don't mind either ways as long as a feature workable for KVM is
included ;).

> > The remote instance is like a secondary TLB what you're doing in your
> > code is as backwards as flushing the TLB _before_ clearing the PTE! If
> > you want to call the secondary tlb flush outside locks we can argue
> > about that, but I think you should do that _after_ clearing the linux
> > pte IMHO. Otherwise you can as well move the tlb_flush_page before
> > clearing the pte and you'll run in the same amount of smp races for
> > the master MMU too.
>
> I can agree to doing the flush after as long as I get a flag from the
> mmu notifier saying this flush will be repeated later without locks
> held. That would be fine with me. If we don't have that then the

You want to be able to tell the mmu_notifier that you want the flush
repeated without locks later? Sorry but then if you're always going to
set the bitflag unconditionally, why don't you simply implement a
second notifier in addition of my current ->invalidate_page (like
->after_invalidate_page).

We can then implement a method in rmap.c for you to call to do the
final freeing of the page (pagecache/swapcache won't be collected
unless it's a truncate, as long as you keep it pinned and you
certainly don't want to wait a second round of lru scan before freeing
the page after you release the external reference, so you may need to
call this method before returning from the
->after_invalidate_page). Infact I can call that method for you in the
notifier implementation itself after all ->after_invalidate_pages have
been called. (of course only if at least one of them was implemented
and not-null)

> export_notifiers would need to be expanded to cover all cases of pte
> clearing. Baring one of the two, I would argue we have an unworkable
> solution for XPMEM. This is because of allocations which is touched
> upon later.

mmu notifiers should already cover all pte clearing cases (export
notifiers definitely don't instead!). It's not as important as for
quadrics, we're mostly interested in rmap.c swapping and do_wp_page
for KVM page sharing. But in the longer term protection changes and
other things happening in the main MMU can also be tracked through mmu
notifiers (something quadrics will likely need). Tracking the
asm-generic/pgtable.h is all about trivially tracking all places
without cluttering mm/*.c, so the mm/*.c remains more hackable and
more readable.

> As an example of thousands, we currently have one customer job that
> has 16880 processors all with the same physical page faulted into their
> address space. The way XPMEM is currently structured, there is fan-out of
> that PFN information so we would not need to queue up that many messages,
> but it would still be considerable. Our upcoming version of the hardware
> will potentially make this fanout worse because we are going to allow
> even more fine-grained divisions of the machine to help with memory
> error containment.

Well as long as you send these messages somewhat serially and you
don't pretend to allocate all packets at once it should be ok. Perhaps
you should preallocate all packets statically and serialize the access
to the pool with a lock.

What I'd like to stress to be sure it's crystal clear, is that in the
mm/rmap.c path GFP_KERNEL==GFP_ATOMIC, infact both are = PF_MEMALLOC =
TIF_MEMDIE = if mempool is empty it will crash. The argument that you
need to sleep to allocate memory with GFP_KERNEL is totally bogus. If
that's the only reason, you don't need to sleep at all. alloc_pages
will not invoke the VM when called inside the VM, it will grab ram
from PF_MEMALLOC instead. At most it will schedule so the only benefit
would be lower -rt latency in the end.

> We have a counter associated with a pfn that indicates when the pfn is no
> longer referenced by other partitions. This counter triggers changing of
> memory protections so any subsequent access to this page will result in
> a memory error on the remote partition (this should be an illegal case).

As long as you keep a reference on the page too, you don't risk
any corruption by flushing after.

> > And how do they know when they can restart adding references if infact
> > the VM _never_ calls into SetPageExported? (perhaps you forgot
> > something in your patch to set PageExported again to notify the
> > external reference that it can "de-freeze" and to restart adding
> > references ;)
>
> I assumed Christoph intended this to be part of our memory protection
> changing phase. Once we have raised memory protections for the page,
> clear the bit. When we lower memory protections, set the bit.

The window that you must close with that bitflag is the request coming
from the remote node to map the page after the linux pte has been
cleared. If you map the page in a remote node after the linux pte has
been cleared ->invalidate_page won't be called again because the page
will look unmapped in the linux VM. Now invalidate_page will clear the
bitflag, so the map requests will block. But where exactly you know
that the linux pte has been cleared so you can "unblock" the map
requests? If a page is not mapped by some linux pte, mm/rmap.c will
never be called and this is why any notification in mm/rmap.c should
track the "address space" and not the "physical page".

In effect you don't care less about the address space of the task in
the master node, so IMHO you're hooking your ->invalidate_page(page)
(instead of my ->invalidate_page(mm, address)) in the very wrong
place. You should hook it in mm/vmscan.c shrink-list so it will be
invoked regardless if the pte is mapped or not. Then your model that
passes the "page" instead of an "mm+address" will start to work
without any need clearing/setting PG_exported bifflags, and you will
automatically close the above race window without the need of ever
clearing the PG_exported bitflag etc.... So again the current design
of Christoph's patch really looks flawed to me.

If you work the "pages" you should stick to pages and to stay away
from mm/rmap.c and ignore whatever is mapped in the master address
space of the task. mm/rmap.c only deals with ptes/sptes and other
_virtual-tracked_ mappings.

If you work with get_user_pages for page allocation (instead of
alloc_pages) and the userland "virtual" addresses are your RAM backing
store, like with KVM, then you should build an rmap structure based on
_virtual_ addresses because the virtual addresses of the task will be
all you care about, and then you will register the notifier in the
"mm" and you will need ->invalidate_page to take an "address" as
parameter and not a "page". All reference counting will be automatic,
the userland task will run with all virtual memory visible
automatically. You should make sure this model tied to the ram mapped
in the userland task can't fit your needs and you really have to care
about building stuff on top of physical pages instead of letting the
VM decide which physical page or swap entry is going back your memory needs.

> As I said before, I was under the assumption that Christoph was proposing
> this as an addition to your notifiers, not a replacement.

Ok. This wasn't my impression but again I'm fine either ways! ;)

2008-01-23 20:18:57

by Christoph Lameter

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Wed, 23 Jan 2008, Andrea Arcangeli wrote:

> > [..] The linux instance with the
> > secondary mmu must call back to the exporting machine in order to
> > reinstantiate the page. PageExported is cleared in invalidate_page() so
> > the other linux instance will be told that the page is not available.
>
> Page is not available is not a valid answer. At least with KVM there
> are three possible ways:
>
> 1) the remote instance will have to wait for the linux pte to go away
> before calling follow_page (then the page is gone as a whole so
> there won't be any more page flags to check)
> 2) it will kill the VM
>
> Nothing runs SetPageExported in your VM code, I can't see how the
> remote instance can know when it can call follow_page again safely on
> the master node.

SetPageExported is set when a remote instance of linux establishes a
reference to the page (a kind of remote page fault). In the KVM scenario
that would occur when memory is made available.



> The remote instance is like a secondary TLB what you're doing in your
> code is as backwards as flushing the TLB _before_ clearing the PTE! If
> you want to call the secondary tlb flush outside locks we can argue
> about that, but I think you should do that _after_ clearing the linux
> pte IMHO. Otherwise you can as well move the tlb_flush_page before
> clearing the pte and you'll run in the same amount of smp races for
> the master MMU too.
>
> > Ahhh. Good to hear. But we will still end in a situation where only
> > the remote ptes point to the page. Maybe the remote instance will dirty
> > the page at that point?
>
> If you flush the remote instance _after_ clearing the main linux PTE
> like I suggest and like I'm doing in my patch, I can't see how you
> could risk to end up in a situation with only the remote ptes pointing
> the page, that's the whole point of doing the remote-TLB flush _after_
> clearing the main linux pte, instead of before like in your patch.

You are saying that clearing the main linux ptes and leaving the remote
ptes in place will not allow access to the page via the remote ptes?

> This is the same as the tlb flush, there's a reason we normally do:
>
> pte_clear()
> flush_tlb_page()
> __free_page()
>
> instead of:
>
> flush_tlb_page()
> pte_clear()
> __free_page()
>
> The ordering you're implementing is backwards and unnatural, you can
> try to serialize it with explicit locking to block the "remote-tlb
> refills" through page bitflags (something not doable with the core
> master tlb because the refills are done by the hardware with the
> master tlb), but it'll remain unnatural and backwards IMHO.

I do not understand where you actually clear the remote pte or spte. You
must do it sometime before the notification to make it work.

> > > > - anon_vma/inode and pte locks are held during callbacks.
> > >
> > > In a previous email I asked what's wrong in offloading the event, and
> >
> > We have internally discussed the possibility of offloading the event but
> > that wont work with the existing callback since we would have to
> > perform atomic allocation and there may be thousands of external
> > references to a page.
>
> You should however also consider a rearming tasklet/softirq invoked by
> ksoftirqd, if memory allocation fails you retry later. Furthermore you
> should not require thousands of simultaneous allocations anyway,
> you're running in the PF_MEMALLOC path and your memory will come from
> the precious PF_MEMALLOC pool in the objrmap paths! If you ever

Right. That is why the mmu_ops approach does not work and that is why we
need to sleep.

> attempt what you suggest (simultanous allocation of thousands of
> packets to simultaneously notify the thousand of external reference)
> depending on the size of the allocation you can instant deadlock
> regardless if you can sleep or not. Infact if you can't sleep and you
> rearm the tasklet when GFP_ATOMIC fails you won't instant
> deadlock.... I think you should have a max-amount of simultanous
> allocations, and you should notify the external references partially
> _serially_. Also you can't exceed the max-amount of simultanous
> allocation even if GFP_ATOMIC/KERNEL fails or you'll squeeze the whole
> PF_MEMALLOC pool leading to deadlocks elsewhere. You must stop when
> there's still quite some room in the PF_MEMALLOC pool.

Good. So we cannot use your mmops approach.

> > The approach with the export notifier is page based not based on the
> > mm_struct. We only need a single page count for a page that is exported to
> > a number of remote instances of linux. The page count is dropped when all
> > the remote instances have unmapped the page.
>
> With KVM it doesn't work that way. Anyway you must be keeping a
> "secondary" count if you know when it's time to call
> __free_page/put_page, so why don't you use the main page_count instead?

The reverse maps contain the count and the remote count being zero is
important for the subsystem. It can then decide to no longer export the
page or the range of pages.

> > > > @@ -966,6 +973,9 @@ int try_to_unmap(struct page *page, int
> > > >
> > > > BUG_ON(!PageLocked(page));
> > > >
> > > > + if (unlikely(PageExported(page)))
> > > > + export_notifier(invalidate_page, page);
> > > > +
> > >
> > > Passing the page here will complicate things especially for shared
> > > pages across different VM that are already working in KVM. For non
> >
> > How?
>
> Because especially for shared pages we'll have to walk
> objrmap/anon_vma->vmas->ptes by hand in our lowlevel methods twice to
> get to the address where the page is mapped in the address space of
> the task etc...

Well that could be avoided by keeping an rmap for that purpose?

Maybe we need two different types of callbacks? It seems that the callback
before we begin scanning the rmaps is also necessary for mmops because we
need to disable the ptes in some fashion before we shut down the local
ptes.

> > There is only the need to walk twice for pages that are marked Exported.
>
> All kvm guest physical pages would need to be marked exported of
> course.

Why export all pages? Then you are planning to have mm_struct
notifiers for all processes in the system?

> > And the double walk is only necessary if the exporter does not have its
> > own rmap. The cross partition thing that we are doing has such an rmap and
>
> We've one rmap per-VM, so the same physical pages will have multiple
> rmap structures, each VM is indipendent and the refcounting happens on
> the core page_count.

Ahh. So you are basically okay?

> > I think I explained that above. Remote users effectively are forbidden to
> > establish new references to the page by the clearing of the exported bit.
>
> And how do they know when they can restart adding references if infact
> the VM _never_ calls into SetPageExported? (perhaps you forgot
> something in your patch to set PageExported again to notify the
> external reference that it can "de-freeze" and to restart adding
> references ;)

Well there is the subsystem missing that provides that piece.

> > > with your coarse export_notifier(invalidate_page) called
> > > unconditionally before checking any young bit at all.
> >
> > The export notifier is called only if the mm_struct or page bit for
> > exporting is set. Maybe I missed to add a check somewhere?
>
> try_to_unmap internally is doing:
>
> if any accessed bit is clear:
> clear the accessed bit, flush tlb and do nothing else at all
> else
> ptep_clear_flush
>
> Not sure how you're going to "do nothing else at all" if you've a
> broad invalidate_page call before even checking the linux pte.

Look at the code: It checks PageExported before doing any calls. And the
list of callbacks is very small. One item typically.

> But given with your design we'll have to replicate mm/rmap.c inside
> KVM to find the virtual address where the page is mapped in each "mm"

The rmap that you are using is likely very much simplified. You just need
to track how it was mapped in order to invalidate the kvm ptes.

> Also note KVM will try to generate a core-linux page fault in order to
> find the "page struct", so I can't see how we could ever check the
> PageExported bit to know if we can trigger the core-linux page fault
> or not. Your whole design is backwards as far as I can tell.

Check it before calling into the vm to generate the core-linux fault?
Surely you run some KVM code there.

> > > Look how clean it is to hook asm-generic/pgtable.h in my last patch
> > > compared to the above leaking code expanded all over the place in the
> > > mm/*.c, unnecessary mangling of atomic bitflags in the page struct,
> > > etc...
> >
> > I think that hunk is particularly bad in your patch. A notification side
> > event in a macro? You would want that explicitly in the code.
>
> Sorry this is the exact opposite. I'm placing all required
> invalidate_page with a 1 line change to the kernel source. How can
> that be bad? This is exactly the right place to hook into so it
> will remain as close as possible to the main linux TLB/MMU without
> cluttering mm/*.c with notifiers all over the place. Check how clean
> it is the access bit test_and_clear:
>
> #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> #define ptep_clear_flush_young(__vma, __address, __ptep) \
> ({ \
> int __young; \
> __young = ptep_test_and_clear_young(__vma, __address, __ptep); \
> if (__young) \
> flush_tlb_page(__vma, __address); \
> __young |= mmu_notifier_age_page((__vma)->vm_mm, __address); \
> __young; \
> })
> #endif
>
> This is totally strightforward, clean and 100% optimal too! I fail to
> see how this can be considered _inferior_ to your cluttering of
> mm/*.c (plus the fact you place your hook _before_ clearing the main
> linux pte which is backwards).

This means that if you do a ptep_clear_flush_young then mmu notifiers run
etc etc which may do a lot of things. You want that not hidden in a macro.
The flush_tlb_page there is bad enough.

> > What we are doing is effectively allowing external references to pages.
> > This is outside of the regular VM operations. So export came up but we
> > could call it something else. External? Its not really tied to the mmu
> > now.
>
> It's tied to the core linux mmu. Even for KVM it's a secondary tlb not
> really a secondary mmu. But we want notifications of the operations on
> the master linux mmu so we can export the same data in secondary
> subsystems too through the notifiers.

Hmmm.. tlb notifier? I was wondering at some point if we could not tie
this into the tlb subsystem.

> > > > +LIST_HEAD(export_notifier_list);
> > >
> > > A global list is not ok IMHO, it's really bad to have a O(N) (N number
> > > of mm in the system) complexity here when it's so trivial to go O(1)
> > > like in my code. We want to swap 100% of the VM exactly so we can have
> > > zillon of idle (or sigstopped) VM on the same system.
> >
> > There will only be one or two of those notifiers. There is no need to
> > build long lists of mm_structs like in your patch.
>
> Each KVM will register into this list (they're mostly indipendent),
> plus when each notifier is called it will have to check the rmap to
> see if the page belongs to its "mm" before doing anything with it,
> that's really bad for KVM.

KVM could maintain its own lists and deal with its series of KVMs in a
more effective way when it gets its callback.

> > The mm_struct is not available at the point of my callbacks. There is no
> > way to do a callback that is mm_struct based if you are not scanning the
> > reverse list. And scanning the reverse list requires taking locks.
>
> The thing is, we can add notifiers to my patch to fit your needs, but
> those will be _different_ notifiers and they really should be after
> the linux pte updates... I think fixing your code so it'll work with
> the sleeping-notifiers getting the "page" instead of a virtual address
> called _after_ clearing the main linux pte, is the way to go. Then
> hopefully won't have to find a way to enable the PageExported bitflag
> anymore in the linux VM and it may remain always-on for the exported
> pages etc.... it makes life a whole lot easier for you too IMHO.

If you call it after then the pte will still exist remotely and allow
access to the page after the VM has removed processes from the page.

Just talked to Robin and I think we could work with having a callback
after the local ptes have been removed and the locks have been dropped. At
that point we do not have an mm_struct anymore so the callback would have
to passs a NULL mm_struct and the page. Also the unmapping of the remote
ptes may transfer a dirty bit because writing through the remote pte is
possible after the local ptes have been removed. So the callback notifier
after needs to be able to return a dirty state?

2008-01-23 20:28:00

by Christoph Lameter

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Wed, 23 Jan 2008, Andrea Arcangeli wrote:

> You want to be able to tell the mmu_notifier that you want the flush
> repeated without locks later? Sorry but then if you're always going to
> set the bitflag unconditionally, why don't you simply implement a
> second notifier in addition of my current ->invalidate_page (like
> ->after_invalidate_page).

Because there is no mm_struct available at that point. So we cannot do a
callback based on the mmu_ops in that fasion. We would have to build a
list of notifiers while scanning the reverse maps.

> We can then implement a method in rmap.c for you to call to do the
> final freeing of the page (pagecache/swapcache won't be collected
> unless it's a truncate, as long as you keep it pinned and you
> certainly don't want to wait a second round of lru scan before freeing
> the page after you release the external reference, so you may need to
> call this method before returning from the

The page count is elevated because of the remote pte so the page is
effectively pinned.

> ->after_invalidate_page). Infact I can call that method for you in the
> notifier implementation itself after all ->after_invalidate_pages have
> been called. (of course only if at least one of them was implemented
> and not-null)

Ok.

> > As an example of thousands, we currently have one customer job that
> > has 16880 processors all with the same physical page faulted into their
> > address space. The way XPMEM is currently structured, there is fan-out of
> > that PFN information so we would not need to queue up that many messages,
> > but it would still be considerable. Our upcoming version of the hardware
> > will potentially make this fanout worse because we are going to allow
> > even more fine-grained divisions of the machine to help with memory
> > error containment.
>
> Well as long as you send these messages somewhat serially and you
> don't pretend to allocate all packets at once it should be ok. Perhaps
> you should preallocate all packets statically and serialize the access
> to the pool with a lock.
>
> What I'd like to stress to be sure it's crystal clear, is that in the
> mm/rmap.c path GFP_KERNEL==GFP_ATOMIC, infact both are = PF_MEMALLOC =
> TIF_MEMDIE = if mempool is empty it will crash. The argument that you
> need to sleep to allocate memory with GFP_KERNEL is totally bogus. If
> that's the only reason, you don't need to sleep at all. alloc_pages
> will not invoke the VM when called inside the VM, it will grab ram
> from PF_MEMALLOC instead. At most it will schedule so the only benefit
> would be lower -rt latency in the end.

If you are holding a lock then you have to use GFP_ATOMIC and the number
of GFP_ATOMIC allocs is limited. PF_MEMALLOC does not do reclaim so we are
in trouble if too many allocs occur.


> > We have a counter associated with a pfn that indicates when the pfn is no
> > longer referenced by other partitions. This counter triggers changing of
> > memory protections so any subsequent access to this page will result in
> > a memory error on the remote partition (this should be an illegal case).
>
> As long as you keep a reference on the page too, you don't risk
> any corruption by flushing after.

There are still dirty bit issues.

> The window that you must close with that bitflag is the request coming
> from the remote node to map the page after the linux pte has been
> cleared. If you map the page in a remote node after the linux pte has
> been cleared ->invalidate_page won't be called again because the page
> will look unmapped in the linux VM. Now invalidate_page will clear the
> bitflag, so the map requests will block. But where exactly you know
> that the linux pte has been cleared so you can "unblock" the map
> requests? If a page is not mapped by some linux pte, mm/rmap.c will
> never be called and this is why any notification in mm/rmap.c should
> track the "address space" and not the "physical page".

The subsystem needs to establish proper locking for that case.

> In effect you don't care less about the address space of the task in
> the master node, so IMHO you're hooking your ->invalidate_page(page)
> (instead of my ->invalidate_page(mm, address)) in the very wrong
> place. You should hook it in mm/vmscan.c shrink-list so it will be
> invoked regardless if the pte is mapped or not. Then your model that

Then page migration and other uses of try_to_unmap wont get there. Also
the page lock is an item that helps with serialization of new faults.

> If you work the "pages" you should stick to pages and to stay away
> from mm/rmap.c and ignore whatever is mapped in the master address
> space of the task. mm/rmap.c only deals with ptes/sptes and other
> _virtual-tracked_ mappings.

It also deals f.e. with page dirty status.

2008-01-23 20:40:24

by Christoph Lameter

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Wed, 23 Jan 2008, Andrea Arcangeli wrote:

> I think it has yet to be demonstrated that doing the invalidate
> _before_ clearing the linux pte is workable at all for
> shadow-pte/RDMA. Infact even doing it _after_ still requires some form
> of serialization but it's less obviously broken and perhaps more
> fixable unlike doing it before that seems hardly fixable given the
> refill event running in the remote node is supposed to wait on a
> bitflag of a page in the master node to return ON. What Christoph
> didn't specify after hinting you have to wait for the PageExported
> bitflag to return on, is that such page may be back in the freelist by

Why would you wait for the PageExported flag to return on? You remove the
remote mappings, the page lock is being held so no new mappings can occur.
Then you remove the local mappings. A concurrent remote fault would be
prevented by the subsystem which could involve waiting on the page lock.

> Until there's some more reasonable theory of how invalidating the
> remote tlbs/ptes _before_ the main linux pte can remotely work, I'm
> "quite" skeptical it's the way to go for the invalidate_page callback.

Not sure that I see the problem if the subsystem prevents new references
from being established.

> Like Avi said, Xen is dealing with the linux pte only, so there's no
> racy smp page fault to serialize against. Perhaps we can add another
> notifier for Xen though.

Well I think we need to come up with a set of notifiers that can cover all
cases. And so far the export notifiers seem to be the most general. But
they also require more intelligence in the notifiers to do proper
serialization and reverse mapping.

> But I think it's still not enough for Xen to have a method called
> before the ptep_clear_flush: rmap.c would get confused in
> page_mkclean_one for example. It might be possible that vm_ops is the

Why would it get confused? When we get to the operations in rmap.c we
just need to be sure that remote references do no longer exist. If the
operations in rmap.c fail then we can reestablish references on demand.

> right way for you even if it further clutters the VM. Like Avi pointed
> me out once, with our current mmu_notifiers we can export the KVM
> address space with remote dma and keep swapping the whole KVM asset
> just fine despite the triple MMU running the system (qemu using linux
> pte, KVM using spte, quadrics using pcimmu). And the core Linux VM
> code (not some obscure hypervisor) will deal with all aging and VM
> issues like a normal task (especially with my last patch that reflects
> the accessed bitflag in the spte the same way the accessed bitflag is
> reflected for the regular ptes).

The problem for us there is that multiple references may exist remotely.
So the actual remote reference count needs to be calculated?

2008-01-24 02:00:27

by Robin Holt

[permalink] [raw]
Subject: Enhance mmu notifiers to accomplish a lockless implementation (incomplete).

Expand the mmu_notifiers to allow for lockless callers. To accomplish
this, the function receiving notifications needs to implement an rmap
equivalent. The notification function is also responsible for tracking
page dirty state.

With this patch, I am getting fairly close to not needing the
invalidate_page mmu_notifier. The combination of invalidate_range
and this export_notifier covers all the paths I can see so far except
__xip_unmap and do_wp_page. __xip_unmap is not so much of a concern,
but I would like to handle it as well.

The one that really concerns me is do_wp_page. I am having difficulty
figuring out a way to handle this without holding locks.

For either of these callers of ptep_clear_flush, I welcome suggestions
on methods to call a notifier without holding any non-sleepable locks.

I am traveling tomorrow but should be able to get back to this tomorrow
evening or early Friday. This has not even been compiled yet. Just
marking it up for now.

Thank you for your attention,
Robin Holt



Index: mmu_notifiers/include/linux/export_notifier.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ mmu_notifiers/include/linux/export_notifier.h 2008-01-23 19:46:05.000000000 -0600
@@ -0,0 +1,48 @@
+#ifndef _LINUX_EXPORT_NOTIFIER_H
+#define _LINUX_EXPORT_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/mm_types.h>
+
+struct export_notifier {
+ struct hlist_node list;
+ const struct export_notifier_ops *ops;
+};
+
+struct export_notifier_ops {
+ /*
+ * Called with the page lock held after ptes are modified or removed.
+ *
+ * Must clear PageExported()
+ */
+ void (*invalidate_page)(struct export_notifier *em, struct page *page);
+};
+
+#ifdef CONFIG_EXPORT_NOTIFIER
+
+extern void export_notifier_register(struct export_notifier *em);
+extern void export_notifier_unregister(struct export_notifier *em);
+
+extern struct hlist_head export_notifier_list;
+
+#define export_notifier(function, args...) \
+ do { \
+ struct export_notifier *__em; \
+ \
+ rcu_read_lock(); \
+ hlist_for_each_entry_rcu(__em, &export_notifier_list, list) \
+ if (__em->ops->function) \
+ __em->ops->function(__em, args); \
+ rcu_read_unlock(); \
+ } while (0);
+
+#else
+
+#define export_notifier(function, args...)
+
+static inline void export_notifier_register(struct export_notifier *em) {}
+static inline void export_notifier_unregister(struct export_notifier *em) {}
+
+#endif
+
+#endif /* _LINUX_EXPORT_NOTIFIER_H */
Index: mmu_notifiers/include/linux/page-flags.h
===================================================================
--- mmu_notifiers.orig/include/linux/page-flags.h 2008-01-23 19:44:40.000000000 -0600
+++ mmu_notifiers/include/linux/page-flags.h 2008-01-23 19:46:05.000000000 -0600
@@ -105,6 +105,7 @@
* 64 bit | FIELDS | ?????? FLAGS |
* 63 32 0
*/
+#define PG_exported 30 /* Page is referenced by something not in the rmaps */
#define PG_uncached 31 /* Page has been mapped as uncached */
#endif

@@ -260,6 +261,14 @@ static inline void __ClearPageTail(struc
#define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
#define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)

+#ifdef CONFIG_EXPORT_NOTIFIER
+#define PageExported(page) test_bit(PG_exported, &(page)->flags)
+#define SetPageExported(page) set_bit(PG_exported, &(page)->flags)
+#define ClearPageExported(page) clear_bit(PG_exported, &(page)->flags)
+#else
+#define PageExported(page) 0
+#endif
+
struct page; /* forward declaration */

extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: mmu_notifiers/mm/Kconfig
===================================================================
--- mmu_notifiers.orig/mm/Kconfig 2008-01-23 19:44:39.000000000 -0600
+++ mmu_notifiers/mm/Kconfig 2008-01-23 19:46:06.000000000 -0600
@@ -197,3 +197,8 @@ config VIRT_TO_BUS
config MMU_NOTIFIER
def_bool y
bool "MMU notifier, for paging KVM/RDMA"
+
+config EXPORT_NOTIFIER
+ def_bool y
+ depends on 64BIT
+ bool "Export Notifier for notifying subsystems about changes to page mappings"
Index: mmu_notifiers/mm/Makefile
===================================================================
--- mmu_notifiers.orig/mm/Makefile 2008-01-23 19:44:39.000000000 -0600
+++ mmu_notifiers/mm/Makefile 2008-01-23 19:46:06.000000000 -0600
@@ -31,4 +31,4 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
-
+obj-$(CONFIG_EXPORT_NOTIFIER) += export_notifier.o
Index: mmu_notifiers/mm/export_notifier.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ mmu_notifiers/mm/export_notifier.c 2008-01-23 19:46:06.000000000 -0600
@@ -0,0 +1,20 @@
+#include <linux/export_notifier.h>
+
+HLIST_HEAD(export_notifier_list);
+DEFINE_SPINLOCK(export_notifier_list_lock);
+
+void export_notifier_register(struct export_notifier *en)
+{
+ spin_lock(&export_notifier_list_lock);
+ hlist_add_head_rcu(&en->list, &export_notifier_list);
+ spin_unlock(&export_notifier_list_lock);
+}
+
+void export_notifier_unregister(struct export_notifier *en)
+{
+ spin_lock(&export_notifier_list_lock);
+ hlist_del_rcu(&en->list);
+ spin_unlock(&export_notifier_list_lock);
+}
+
+
Index: mmu_notifiers/mm/rmap.c
===================================================================
--- mmu_notifiers.orig/mm/rmap.c 2008-01-23 19:44:39.000000000 -0600
+++ mmu_notifiers/mm/rmap.c 2008-01-23 19:46:06.000000000 -0600
@@ -49,6 +49,7 @@
#include <linux/rcupdate.h>
#include <linux/module.h>
#include <linux/kallsyms.h>
+#include <linux/export_notifier.h>

#include <asm/tlbflush.h>

@@ -473,6 +474,8 @@ int page_mkclean(struct page *page)
struct address_space *mapping = page_mapping(page);
if (mapping) {
ret = page_mkclean_file(mapping, page);
+ if (unlikely(PageExported(page)))
+ export_notifier(invalidate_page, page);
if (page_test_dirty(page)) {
page_clear_dirty(page);
ret = 1;
@@ -971,6 +974,9 @@ int try_to_unmap(struct page *page, int
else
ret = try_to_unmap_file(page, migration);

+ if (unlikely(PageExported(page)))
+ export_notifier(invalidate_page, page);
+
if (!page_mapped(page))
ret = SWAP_SUCCESS;
return ret;
Index: mmu_notifiers/mm/fremap.c
===================================================================
--- mmu_notifiers.orig/mm/fremap.c 2008-01-23 19:44:39.000000000 -0600
+++ mmu_notifiers/mm/fremap.c 2008-01-23 19:46:06.000000000 -0600
@@ -211,6 +211,7 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(&mapping->i_mmap_lock);
}

+ mmu_notifier(invalidate_range, mm, start, start + size);
err = populate_range(mm, vma, start, size, pgoff);
if (!err && !(flags & MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {

2008-01-24 04:04:53

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1


On Wed, 2008-01-23 at 16:17 +0200, Avi Kivity wrote:
> Robin Holt wrote:
> > On Wed, Jan 23, 2008 at 01:51:23PM +0100, Gerd Hoffmann wrote:
> >
> >> Jumping in here, looks like this could develop into a direction useful
> >> for Xen.
> >>
> >> Background: Xen has a mechanism called "grant tables" for page sharing.
> >> Guest #1 can issue a "grant" for another guest #2, which in turn then
> >> can use that grant to map the page owned by guest #1 into its address
> >> space. This is used by the virtual network/disk drivers, i.e. typically
> >> Domain-0 (which has access to the real hardware) maps pages of other
> >> guests to fill in disk/network data.
> >>
> >
> > This is extremely similar to what XPMEM is providing.
> >
> >
>
> I think that in Xen's case the page tables are the normal cpu page
> tables, not an external mmu (like RDMA, kvm, and XPMEM).

However, that will be useful to the DRI folks as modern video chips are
growing MMU with even page fault capabilities.

Ben.

2008-01-24 04:05:38

by Robin Holt

[permalink] [raw]
Subject: Re: Enhance mmu notifiers to accomplish a lockless implementation (incomplete).

Expand the mmu_notifiers to allow for lockless callers. To accomplish
this, the function receiving notifications needs to implement an rmap
equivalent. The notification function is also responsible for tracking
page dirty state.

Version 2 brings with it __xip_unmap and do_wp_page so this is getting
to the point where we can start testing. It does compile now.

I am traveling tomorrow but should be able to get back to this tomorrow
evening or early Friday.

Thank you for your attention,
Robin Holt

Index: mmu_notifiers/include/linux/export_notifier.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ mmu_notifiers/include/linux/export_notifier.h 2008-01-23 21:24:33.000000000 -0600
@@ -0,0 +1,50 @@
+#ifndef _LINUX_EXPORT_NOTIFIER_H
+#define _LINUX_EXPORT_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/mm_types.h>
+
+struct export_notifier {
+ struct hlist_node hlist;
+ const struct export_notifier_ops *ops;
+};
+
+struct export_notifier_ops {
+ /*
+ * Called with the page lock held after ptes are modified or removed.
+ *
+ * Must clear PageExported()
+ */
+ void (*invalidate_page)(struct export_notifier *em, struct page *page);
+};
+
+#ifdef CONFIG_EXPORT_NOTIFIER
+
+extern void export_notifier_register(struct export_notifier *em);
+extern void export_notifier_unregister(struct export_notifier *em);
+
+extern struct hlist_head export_notifier_list;
+
+#define export_notifier(function, args...) \
+ do { \
+ struct export_notifier *__em; \
+ struct hlist_node *__n; \
+ \
+ rcu_read_lock(); \
+ hlist_for_each_entry_rcu(__em, __n, &export_notifier_list, \
+ hlist) \
+ if (__em->ops->function) \
+ __em->ops->function(__em, args); \
+ rcu_read_unlock(); \
+ } while (0);
+
+#else
+
+#define export_notifier(function, args...)
+
+static inline void export_notifier_register(struct export_notifier *em) {}
+static inline void export_notifier_unregister(struct export_notifier *em) {}
+
+#endif
+
+#endif /* _LINUX_EXPORT_NOTIFIER_H */
Index: mmu_notifiers/include/linux/page-flags.h
===================================================================
--- mmu_notifiers.orig/include/linux/page-flags.h 2008-01-23 21:24:27.000000000 -0600
+++ mmu_notifiers/include/linux/page-flags.h 2008-01-23 21:57:58.000000000 -0600
@@ -105,6 +105,7 @@
* 64 bit | FIELDS | ?????? FLAGS |
* 63 32 0
*/
+#define PG_exported 30 /* Page is referenced by something not in the rmaps */
#define PG_uncached 31 /* Page has been mapped as uncached */
#endif

@@ -260,6 +261,14 @@ static inline void __ClearPageTail(struc
#define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
#define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)

+#ifdef CONFIG_EXPORT_NOTIFIER
+#define PageExported(page) test_bit(PG_exported, &(page)->flags)
+#define SetPageExported(page) set_bit(PG_exported, &(page)->flags)
+#define ClearPageExported(page) clear_bit(PG_exported, &(page)->flags)
+#else
+#define PageExported(page) 0
+#endif
+
struct page; /* forward declaration */

extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: mmu_notifiers/mm/Kconfig
===================================================================
--- mmu_notifiers.orig/mm/Kconfig 2008-01-23 21:24:27.000000000 -0600
+++ mmu_notifiers/mm/Kconfig 2008-01-23 21:57:58.000000000 -0600
@@ -197,3 +197,8 @@ config VIRT_TO_BUS
config MMU_NOTIFIER
def_bool y
bool "MMU notifier, for paging KVM/RDMA"
+
+config EXPORT_NOTIFIER
+ def_bool y
+ depends on 64BIT
+ bool "Export Notifier for notifying subsystems about changes to page mappings"
Index: mmu_notifiers/mm/Makefile
===================================================================
--- mmu_notifiers.orig/mm/Makefile 2008-01-23 21:24:27.000000000 -0600
+++ mmu_notifiers/mm/Makefile 2008-01-23 21:24:33.000000000 -0600
@@ -31,4 +31,4 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
-
+obj-$(CONFIG_EXPORT_NOTIFIER) += export_notifier.o
Index: mmu_notifiers/mm/export_notifier.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ mmu_notifiers/mm/export_notifier.c 2008-01-23 21:57:27.000000000 -0600
@@ -0,0 +1,20 @@
+#include <linux/export_notifier.h>
+
+HLIST_HEAD(export_notifier_list);
+DEFINE_SPINLOCK(export_notifier_list_lock);
+
+void export_notifier_register(struct export_notifier *en)
+{
+ spin_lock(&export_notifier_list_lock);
+ hlist_add_head_rcu(&en->hlist, &export_notifier_list);
+ spin_unlock(&export_notifier_list_lock);
+}
+
+void export_notifier_unregister(struct export_notifier *en)
+{
+ spin_lock(&export_notifier_list_lock);
+ hlist_del_rcu(&en->hlist);
+ spin_unlock(&export_notifier_list_lock);
+}
+
+
Index: mmu_notifiers/mm/rmap.c
===================================================================
--- mmu_notifiers.orig/mm/rmap.c 2008-01-23 21:24:27.000000000 -0600
+++ mmu_notifiers/mm/rmap.c 2008-01-23 21:57:27.000000000 -0600
@@ -49,6 +49,7 @@
#include <linux/rcupdate.h>
#include <linux/module.h>
#include <linux/kallsyms.h>
+#include <linux/export_notifier.h>

#include <asm/tlbflush.h>

@@ -473,6 +474,8 @@ int page_mkclean(struct page *page)
struct address_space *mapping = page_mapping(page);
if (mapping) {
ret = page_mkclean_file(mapping, page);
+ if (unlikely(PageExported(page)))
+ export_notifier(invalidate_page, page);
if (page_test_dirty(page)) {
page_clear_dirty(page);
ret = 1;
@@ -971,6 +974,9 @@ int try_to_unmap(struct page *page, int
else
ret = try_to_unmap_file(page, migration);

+ if (unlikely(PageExported(page)))
+ export_notifier(invalidate_page, page);
+
if (!page_mapped(page))
ret = SWAP_SUCCESS;
return ret;
Index: mmu_notifiers/mm/fremap.c
===================================================================
--- mmu_notifiers.orig/mm/fremap.c 2008-01-23 21:24:27.000000000 -0600
+++ mmu_notifiers/mm/fremap.c 2008-01-23 21:57:27.000000000 -0600
@@ -211,6 +211,7 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(&mapping->i_mmap_lock);
}

+ mmu_notifier(invalidate_range, mm, start, start + size);
err = populate_range(mm, vma, start, size, pgoff);
if (!err && !(flags & MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {
Index: mmu_notifiers/mm/memory.c
===================================================================
--- mmu_notifiers.orig/mm/memory.c 2008-01-23 21:24:27.000000000 -0600
+++ mmu_notifiers/mm/memory.c 2008-01-23 21:57:27.000000000 -0600
@@ -1637,6 +1637,7 @@ gotten:
/*
* Re-check the pte - we dropped the lock
*/
+ mmu_notifier(invalidate_range, mm, address, address + PAGE_SIZE - 1);
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (likely(pte_same(*page_table, orig_pte))) {
if (old_page) {
Index: mmu_notifiers/mm/filemap_xip.c
===================================================================
--- mmu_notifiers.orig/mm/filemap_xip.c 2008-01-23 21:33:21.000000000 -0600
+++ mmu_notifiers/mm/filemap_xip.c 2008-01-23 21:57:27.000000000 -0600
@@ -13,6 +13,7 @@
#include <linux/module.h>
#include <linux/uio.h>
#include <linux/rmap.h>
+#include <linux/export_notifier.h>
#include <linux/sched.h>
#include <asm/tlbflush.h>

@@ -183,6 +184,7 @@ __xip_unmap (struct address_space * mapp
if (!page)
return;

+ export_notifier(invalidate_page, page);
spin_lock(&mapping->i_mmap_lock);
vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
mm = vma->vm_mm;

2008-01-24 06:03:44

by Avi Kivity

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

Christoph Lameter wrote:
> On Wed, 23 Jan 2008, Robin Holt wrote:
>
>
>>> That won't work for kvm. If we have a hundred virtual machines, that means
>>> 99 no-op notifications.
>>>
>> But 100 callouts holding spinlocks will not work for our implementation
>> and even if the callouts are made with spinlocks released, we would very
>> strongly prefer a single callout which messages the range to the other
>> side.
>>
>
>
> Andrea wont have 99 no op notifications. You will have one notification to
> the kvm subsystem (since there needs to be only one register operation
> for a subsystem that wants to get notifications). What do you do there is
> up to kvm. If you want to call some function 99 times then you are free to
> do that.
>

What I need is a list of (mm, va) that map the page. kvm doesn't have
access to that, export notifiers do. It seems reasonable that export
notifier do that rmap walk since they are part of core mm, not kvm.

Alternatively, kvm can change its internal rmap structure to be page
based instead of (mm, hva) based. The problem here is to size this
thing, as we don't know in advance (when the kvm module is loaded)
whether 0% or 100% (or some value in between) of system memory will be
used for kvm.

--
Any sufficiently difficult bug is indistinguishable from a feature.

2008-01-24 06:08:18

by Avi Kivity

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

Gerd Hoffmann wrote:
> Another maybe workable approach for Xen is to go through pv_ops
> (although pte_clear doesn't go through pv_ops right now, so this would
> be an additional hook too ...).
>

I think that's the way. Xen is not a secondary mmu but rather a primary
mmu with some magic characteristics.

--
Any sufficiently difficult bug is indistinguishable from a feature.

2008-01-24 06:45:29

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

Gerd Hoffmann wrote:
> Another maybe workable approach for Xen is to go through pv_ops
> (although pte_clear doesn't go through pv_ops right now, so this would
> be an additional hook too ...).
>

It does for 32-bit PAE. Making pte_clear uniform across all pagetable
modes would be a nice cleanup.

J

2008-01-24 12:26:34

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Thu, Jan 24, 2008 at 07:56:43AM +0200, Avi Kivity wrote:
> What I need is a list of (mm, va) that map the page. kvm doesn't have
> access to that, export notifiers do. It seems reasonable that export
> notifier do that rmap walk since they are part of core mm, not kvm.

Yes. Like said in earlier email we could ignore the slowdown and
duplicate the mm/rmap.c code inside kvm, but that looks a bad layering
violation and it's unnecessary, dirty and suboptimal IMHO.

> Alternatively, kvm can change its internal rmap structure to be page based
> instead of (mm, hva) based. The problem here is to size this thing, as we
> don't know in advance (when the kvm module is loaded) whether 0% or 100%
> (or some value in between) of system memory will be used for kvm.

Another issue is that for things like the page sharing driver, it's
more convenient to be able to know exactly which "sptes" belongs to a
certain userland mapping, and only that userland mapping (not all
others mappings of the physical page). So if the rmap becomes page
based, it'd be nice to still be able to find the "mm" associated with
that certain spte pointer to skip all sptes in the other "mm" during
the invalidate.

2008-01-24 12:34:50

by Avi Kivity

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

Andrea Arcangeli wrote:
> On Thu, Jan 24, 2008 at 07:56:43AM +0200, Avi Kivity wrote:
>
>> What I need is a list of (mm, va) that map the page. kvm doesn't have
>> access to that, export notifiers do. It seems reasonable that export
>> notifier do that rmap walk since they are part of core mm, not kvm.
>>
>
> Yes. Like said in earlier email we could ignore the slowdown and
> duplicate the mm/rmap.c code inside kvm, but that looks a bad layering
> violation and it's unnecessary, dirty and suboptimal IMHO.
>
>

Historical note: old kvm versions (like the what will eventually ship in
2.6.24) have a page-based rmap (hooking the rmap list off
page->private). We changed that to an mm based rmap since page->private
is not available when kvm maps general userspace memory.


>> Alternatively, kvm can change its internal rmap structure to be page based
>> instead of (mm, hva) based. The problem here is to size this thing, as we
>> don't know in advance (when the kvm module is loaded) whether 0% or 100%
>> (or some value in between) of system memory will be used for kvm.
>>
>
> Another issue is that for things like the page sharing driver, it's
> more convenient to be able to know exactly which "sptes" belongs to a
> certain userland mapping, and only that userland mapping (not all
> others mappings of the physical page). So if the rmap becomes page
> based, it'd be nice to still be able to find the "mm" associated with
> that certain spte pointer to skip all sptes in the other "mm" during
> the invalidate.
>

You also need the mm (or rather, the kvm structure, but they have a 1:1
relationship) to be able to lock and maintain the shadow structures
properly.

--
Any sufficiently difficult bug is indistinguishable from a feature.

2008-01-24 14:35:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Wed, Jan 23, 2008 at 12:18:45PM -0800, Christoph Lameter wrote:
> On Wed, 23 Jan 2008, Andrea Arcangeli wrote:
>
> > > [..] The linux instance with the
> > > secondary mmu must call back to the exporting machine in order to
> > > reinstantiate the page. PageExported is cleared in invalidate_page() so
> > > the other linux instance will be told that the page is not available.
> >
> > Page is not available is not a valid answer. At least with KVM there
> > are three possible ways:
> >
> > 1) the remote instance will have to wait for the linux pte to go away
> > before calling follow_page (then the page is gone as a whole so
> > there won't be any more page flags to check)
> > 2) it will kill the VM
> >
> > Nothing runs SetPageExported in your VM code, I can't see how the
> > remote instance can know when it can call follow_page again safely on
> > the master node.
>
> SetPageExported is set when a remote instance of linux establishes a
> reference to the page (a kind of remote page fault). In the KVM scenario
> that would occur when memory is made available.

The remote page fault is exactly the thing that has to wait on the
PageExported bit to return on! So how can it be the thing that sets
SetPageExported?

The idea is:

NODE0 NODE1
->invalidate_page()
ClearPageExported
GFP_KERNEL (== GFP_ATOMIC in mm/rmap.c, won't ever do any I/O)

->invalidate_page() arrives and drop
references

__free_page -> unpin so it can be freed
go ahead after invalidate_page

zero locking so previous invalidate_page could schedule (not wait for I/O,
there' won't be any I/O out of GFP_KERNEL inside PF_MEMALLOC i.e. mm/rmap.c!!!)

remote page fault
tries to instantiate more references
remote page fault arrives
instantiate more references
get_page() -> pin
SetPageExported
remote page fault succeeded

zero locking so invalidate_page can schedule (not wait for I/O,
there' won't be any I/O out of GFP_KERNEL!)

ptep_clear_flush

After the above your remote references will keep the page pinned in
RAM and it'll be unswappable, mm/rmap.c will never be called on that
page again! That's the guest-memory pinning memory leak in KVM terms.

I immediately told you about the above SMP race when I've seen your
backwards idea of invalidating the page _before_ clearing the linux
pte.

I thought your solution was to have the remote page fault wait on
PG_exported to return ON!! But now you tell me the remote page fault
is the thing that has to SetPageExported, not the linux VM. So make up
your mind about this PG_exported mess...

> > The remote instance is like a secondary TLB what you're doing in your
> > code is as backwards as flushing the TLB _before_ clearing the PTE! If
> > you want to call the secondary tlb flush outside locks we can argue
> > about that, but I think you should do that _after_ clearing the linux
> > pte IMHO. Otherwise you can as well move the tlb_flush_page before
> > clearing the pte and you'll run in the same amount of smp races for
> > the master MMU too.
> >
> > > Ahhh. Good to hear. But we will still end in a situation where only
> > > the remote ptes point to the page. Maybe the remote instance will dirty
> > > the page at that point?
> >
> > If you flush the remote instance _after_ clearing the main linux PTE
> > like I suggest and like I'm doing in my patch, I can't see how you
> > could risk to end up in a situation with only the remote ptes pointing
> > the page, that's the whole point of doing the remote-TLB flush _after_
> > clearing the main linux pte, instead of before like in your patch.
>
> You are saying that clearing the main linux ptes and leaving the remote
> ptes in place will not allow access to the page via the remote ptes?

No, I'm saying if you clear the main linux pte while there are still
remote ptes in place (in turn the page_count has been boosted by 1
with your current code), and you relay on mm/rmap.c for the
->invalidate_page, you will generate a unswappable-pin-leak.

The linux pte must be present and the page must be mapped in userland
as long as there are remote references to the page and in turn as long
as the page_count has been boosted by 1. Otherwise mm/rmap.c won't be
called.

At the very least you should move your invalidate_page in
mm/vmscan.c and have it called regardless if the page is mapped in
userland or not.

> > This is the same as the tlb flush, there's a reason we normally do:
> >
> > pte_clear()
> > flush_tlb_page()
> > __free_page()
> >
> > instead of:
> >
> > flush_tlb_page()
> > pte_clear()
> > __free_page()
> >
> > The ordering you're implementing is backwards and unnatural, you can
> > try to serialize it with explicit locking to block the "remote-tlb
> > refills" through page bitflags (something not doable with the core
> > master tlb because the refills are done by the hardware with the
> > master tlb), but it'll remain unnatural and backwards IMHO.
>
> I do not understand where you actually clear the remote pte or spte. You
> must do it sometime before the notification to make it work.

Definitely not. spte is just like a second tlb. You never flush the
tlb before clearing the main linux pte! Do like I suggested:

flush_tlb_page()
pte_clear()
__free_page

and you'll see crashes very very soon.

With KVM the backwards order perhaps it wouldn't crash because when
the spte maps the page the page is pinned, but still there would be an
unswappable-pinned-memory-leak.

> > > > > - anon_vma/inode and pte locks are held during callbacks.
> > > >
> > > > In a previous email I asked what's wrong in offloading the event, and
> > >
> > > We have internally discussed the possibility of offloading the event but
> > > that wont work with the existing callback since we would have to
> > > perform atomic allocation and there may be thousands of external
> > > references to a page.
> >
> > You should however also consider a rearming tasklet/softirq invoked by
> > ksoftirqd, if memory allocation fails you retry later. Furthermore you
> > should not require thousands of simultaneous allocations anyway,
> > you're running in the PF_MEMALLOC path and your memory will come from
> > the precious PF_MEMALLOC pool in the objrmap paths! If you ever
>
> Right. That is why the mmu_ops approach does not work and that is why we
> need to sleep.

You told me you worried about atomic allocations. Now you tell me you
need to sleep after I just explained you how utterly useless is to
sleep inside GFP_KERNEL allocations when invoked by try_to_unmap in
the mm/rmap.c paths. You will never sleep in any memory allocation
other than to call schedule() because need_resched is set. You will do
zero I/O. all your allocations will come from the PF_MEMALLOC pool
like I said above, not from swapping, not from the VM. The VM will
obviously refuse to be invoked recursively.

> > attempt what you suggest (simultanous allocation of thousands of
> > packets to simultaneously notify the thousand of external reference)
> > depending on the size of the allocation you can instant deadlock
> > regardless if you can sleep or not. Infact if you can't sleep and you
> > rearm the tasklet when GFP_ATOMIC fails you won't instant
> > deadlock.... I think you should have a max-amount of simultanous
> > allocations, and you should notify the external references partially
> > _serially_. Also you can't exceed the max-amount of simultanous
> > allocation even if GFP_ATOMIC/KERNEL fails or you'll squeeze the whole
> > PF_MEMALLOC pool leading to deadlocks elsewhere. You must stop when
> > there's still quite some room in the PF_MEMALLOC pool.
>
> Good. So we cannot use your mmops approach.

If PF_MEMALLOC from my mmu notifiers isn't enough, the place where you
put your invalidate_page will only get memory from PF_MEMALLOC and it
will also not be enough.

Also not sure why you call my patch mmops, when it's mmu_notifier instead.

> Well that could be avoided by keeping an rmap for that purpose?

This was answered in a separate email.

> Maybe we need two different types of callbacks? It seems that the callback
> before we begin scanning the rmaps is also necessary for mmops because we
> need to disable the ptes in some fashion before we shut down the local
> ptes.

"disable the ptes" "before" "we shut down the local ptes". Not very
clear.

Anyway no, I don't need any call before scanning the rmaps. Doing
anything at all on the sptes before the main linux pte is gone is
backwards and flawed, just like it would be flawed to flush the tlb
before clearing the linux pte:

flush_tlb_page()
pte_clear()
__free_page

I don't need to do anything at all, as long as the main linux pte is
still there.

> > > There is only the need to walk twice for pages that are marked Exported.
> >
> > All kvm guest physical pages would need to be marked exported of
> > course.
>
> Why export all pages? Then you are planning to have mm_struct
> notifiers for all processes in the system?

KVM is 1 process, not sure how you get to imagine I need to track
process in the system, when infact I only need to track pages
belonging to the KVM process.

But you're right one thing, I could also try to mark PG_exported the
guest pages currently mapped in the sptes, that's a minor optimization
that will save a bit of cpu during swapping but it will make the
non-swap fast-path a bit slower requiring one more atomic bitop for
every spte instantiation and spte unmapping.

> > > And the double walk is only necessary if the exporter does not have its
> > > own rmap. The cross partition thing that we are doing has such an rmap and
> >
> > We've one rmap per-VM, so the same physical pages will have multiple
> > rmap structures, each VM is indipendent and the refcounting happens on
> > the core page_count.
>
> Ahh. So you are basically okay?

If "cat mm/rmap.c >> arch/x86/kvm/mmu.c" means "basically ok" to you,
then yes, I'm basically ok. More details on this in Avi's email.

The other way would be to change the kvm internals to share a single
rmap structure for _all_ VM. I find so elegant to connect the main
linux pte with only the sptes associated with it that it's not very
appealing to go back and instead bind the page_t with all sptes of all
running VM instead, when infact the sptes aren't all equal but each
spte is still associated with a linux pte at runtime. Also considering
I still got to know which "mm/kvm" struct is associated with each spte
reacheable through the page_t in order to do anything with the spte.

> > > I think I explained that above. Remote users effectively are forbidden to
> > > establish new references to the page by the clearing of the exported bit.
> >
> > And how do they know when they can restart adding references if infact
> > the VM _never_ calls into SetPageExported? (perhaps you forgot
> > something in your patch to set PageExported again to notify the
> > external reference that it can "de-freeze" and to restart adding
> > references ;)
>
> Well there is the subsystem missing that provides that piece.

See top of the email on how such subsystem was supposed to freeze as
long as PG_exported was unset.

> > > > with your coarse export_notifier(invalidate_page) called
> > > > unconditionally before checking any young bit at all.
> > >
> > > The export notifier is called only if the mm_struct or page bit for
> > > exporting is set. Maybe I missed to add a check somewhere?
> >
> > try_to_unmap internally is doing:
> >
> > if any accessed bit is clear:
> > clear the accessed bit, flush tlb and do nothing else at all
> > else
> > ptep_clear_flush
> >
> > Not sure how you're going to "do nothing else at all" if you've a
> > broad invalidate_page call before even checking the linux pte.
>
> Look at the code: It checks PageExported before doing any calls. And the
> list of callbacks is very small. One item typically.

What's the relation between PG_exported and the young bit in the linux
pte? How do you connect the two?

It's utterly useless to call ->invalidate_page(page) on a page that is
still mapped by some linux pte with the young bit set. You must defer
the ->invalidate_page after all young bits are gone. This is what I
do, infact I do tons more than that by also honouring the accessed
bits in all sptes. There's zero chance you can do as remotely as
efficient as my mmu-notifiers are, unless you also do "cat rmap.c >>
/sgi/yoursubsystem/something.c" and you check the young bit in the
linux ptes yourself _before_ deciding if you've to start dropping
remote references or not.

> > But given with your design we'll have to replicate mm/rmap.c inside
> > KVM to find the virtual address where the page is mapped in each "mm"
>
> The rmap that you are using is likely very much simplified. You just need
> to track how it was mapped in order to invalidate the kvm ptes.

rmap I'm using is very lightweight, that's a feature not a bug. Why to
duplicate heavyweight info to track sptes from the "page" when I can
keep that heavyweight info just in the kvm structure?

> > Also note KVM will try to generate a core-linux page fault in order to
> > find the "page struct", so I can't see how we could ever check the
> > PageExported bit to know if we can trigger the core-linux page fault
> > or not. Your whole design is backwards as far as I can tell.
>
> Check it before calling into the vm to generate the core-linux fault?
> Surely you run some KVM code there.

-ENOTUNDERSTOOD but I doubt it's worth answering until the major flaws
in your #v1 are clearly understood.

>
> > > > Look how clean it is to hook asm-generic/pgtable.h in my last patch
> > > > compared to the above leaking code expanded all over the place in the
> > > > mm/*.c, unnecessary mangling of atomic bitflags in the page struct,
> > > > etc...
> > >
> > > I think that hunk is particularly bad in your patch. A notification side
> > > event in a macro? You would want that explicitly in the code.
> >
> > Sorry this is the exact opposite. I'm placing all required
> > invalidate_page with a 1 line change to the kernel source. How can
> > that be bad? This is exactly the right place to hook into so it
> > will remain as close as possible to the main linux TLB/MMU without
> > cluttering mm/*.c with notifiers all over the place. Check how clean
> > it is the access bit test_and_clear:
> >
> > #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> > #define ptep_clear_flush_young(__vma, __address, __ptep) \
> > ({ \
> > int __young; \
> > __young = ptep_test_and_clear_young(__vma, __address, __ptep); \
> > if (__young) \
> > flush_tlb_page(__vma, __address); \
> > __young |= mmu_notifier_age_page((__vma)->vm_mm, __address); \
> > __young; \
> > })
> > #endif
> >
> > This is totally strightforward, clean and 100% optimal too! I fail to
> > see how this can be considered _inferior_ to your cluttering of
> > mm/*.c (plus the fact you place your hook _before_ clearing the main
> > linux pte which is backwards).
>
> This means that if you do a ptep_clear_flush_young then mmu notifiers run
> etc etc which may do a lot of things. You want that not hidden in a macro.
> The flush_tlb_page there is bad enough.

Hiding the tlb flushes and being able to run ptep_clear_flush and
ptep_clear_flush_young without worrying about the arch internals is
totally clean. It's hard to believe you're seriously suggesting that
it would be better to expand all those flush_tlb_page in
mm/*.c. Furthermore this leaves each arch free to implement the
ptep_clear_flush ops like they prefer, which is a very arch lowlevel
thing.

Anyway if you think it'd be cleaner to expand flush_tlb_page in mm/*.c
and remove it from pgtable.h then you're free to send a patch to
achieve such a ""cleanup"" and I leave the comments to others, I
cannot care less about coding style issues frankly, I'm not that kind
of person caring about those things, and especially seeing how much
people opinion could diverge by your claim that cluttering mm/*.c with
tlb flushes would be a "good thing" I'm not too interested to argue
about it either.

>From my POV as long as you keep calling __young |=
mmu_notifier_age_page((__vma)->vm_mm, __address) after your cpp
expansion I'm ok.

> > > What we are doing is effectively allowing external references to pages.
> > > This is outside of the regular VM operations. So export came up but we
> > > could call it something else. External? Its not really tied to the mmu
> > > now.
> >
> > It's tied to the core linux mmu. Even for KVM it's a secondary tlb not
> > really a secondary mmu. But we want notifications of the operations on
> > the master linux mmu so we can export the same data in secondary
> > subsystems too through the notifiers.
>
> Hmmm.. tlb notifier? I was wondering at some point if we could not tie
> this into the tlb subsystem.

I guess I didn't explained myself clearly sorry, I was making the
example of "tlb notifier" as a wrong name for this.

> > > > > +LIST_HEAD(export_notifier_list);
> > > >
> > > > A global list is not ok IMHO, it's really bad to have a O(N) (N number
> > > > of mm in the system) complexity here when it's so trivial to go O(1)
> > > > like in my code. We want to swap 100% of the VM exactly so we can have
> > > > zillon of idle (or sigstopped) VM on the same system.
> > >
> > > There will only be one or two of those notifiers. There is no need to
> > > build long lists of mm_structs like in your patch.
> >
> > Each KVM will register into this list (they're mostly indipendent),
> > plus when each notifier is called it will have to check the rmap to
> > see if the page belongs to its "mm" before doing anything with it,
> > that's really bad for KVM.
>
> KVM could maintain its own lists and deal with its series of KVMs in a
> more effective way when it gets its callback.

Not more effective, but sure in some way it can be made it work.

> > > The mm_struct is not available at the point of my callbacks. There is no
> > > way to do a callback that is mm_struct based if you are not scanning the
> > > reverse list. And scanning the reverse list requires taking locks.
> >
> > The thing is, we can add notifiers to my patch to fit your needs, but
> > those will be _different_ notifiers and they really should be after
> > the linux pte updates... I think fixing your code so it'll work with
> > the sleeping-notifiers getting the "page" instead of a virtual address
> > called _after_ clearing the main linux pte, is the way to go. Then
> > hopefully won't have to find a way to enable the PageExported bitflag
> > anymore in the linux VM and it may remain always-on for the exported
> > pages etc.... it makes life a whole lot easier for you too IMHO.
>
> If you call it after then the pte will still exist remotely and allow
> access to the page after the VM has removed processes from the page.
>
> Just talked to Robin and I think we could work with having a callback
> after the local ptes have been removed and the locks have been dropped. At

I know Robin is converging on calling the secondary-MMU invalidate
after clearing the main linux pte. But I answered to the top of this
email anyway, to be sure it's clear _why_ that's the right ordering.

> that point we do not have an mm_struct anymore so the callback would have

The mm struct wasn't available in the place where you put
invalidate_page either.

> to passs a NULL mm_struct and the page. Also the unmapping of the remote

I think it's a lot better for it to be an invalidate_page_after(page)
w/o null parameter. Plus I want an (address, mm) pair still out of my
atomic non-sleeping call, not a "page".

> ptes may transfer a dirty bit because writing through the remote pte is
> possible after the local ptes have been removed. So the callback notifier
> after needs to be able to return a dirty state?

set_page_dirty can be called inside ->invalidate_page if needed. But
I'm not against artificially setting the dirty bit of the pteval
returned by set_page_dirty, perhaps that's more efficient as it will
require a single locked op on the page_t.

2008-01-24 14:42:03

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Thu, Jan 24, 2008 at 03:34:54PM +0100, Andrea Arcangeli wrote:
> set_page_dirty can be called inside ->invalidate_page if needed. But
> I'm not against artificially setting the dirty bit of the pteval
> returned by set_page_dirty, perhaps that's more efficient as it will
^^^^^^^^^^^^^^ ptep_clear_flush of course

> require a single locked op on the page_t.

2008-01-24 15:15:53

by Avi Kivity

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

Andrea Arcangeli wrote:
> The remote page fault
>

As we have two names for this ('shadow' and 'remote/export') I'd like to
suggest a neutral nomenclature. How about 'secondary mmu' (and
secondary ptes (sptes), etc.)? I think this fits xpmem, kvm, rdma, and dri.

--
Any sufficiently difficult bug is indistinguishable from a feature.

2008-01-24 15:19:19

by Avi Kivity

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

Avi Kivity wrote:
> Andrea Arcangeli wrote:
>
>> The remote page fault
>>
>>
>
> As we have two names for this ('shadow' and 'remote/export') I'd like to
> suggest a neutral nomenclature. How about 'secondary mmu' (and
> secondary ptes (sptes), etc.)? I think this fits xpmem, kvm, rdma, and dri.
>
>

Er, it was Robin who came up with this first, I see. Let's just say I
second the motion.

--
Any sufficiently difficult bug is indistinguishable from a feature.

2008-01-24 15:43:20

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Wed, Jan 23, 2008 at 12:27:47PM -0800, Christoph Lameter wrote:
> There are still dirty bit issues.

Yes, but no big issues given ->invalidate_page is fully capable of
running set_page_dirty if needed.

> > The window that you must close with that bitflag is the request coming
> > from the remote node to map the page after the linux pte has been
> > cleared. If you map the page in a remote node after the linux pte has
> > been cleared ->invalidate_page won't be called again because the page
> > will look unmapped in the linux VM. Now invalidate_page will clear the
> > bitflag, so the map requests will block. But where exactly you know
> > that the linux pte has been cleared so you can "unblock" the map
> > requests? If a page is not mapped by some linux pte, mm/rmap.c will
> > never be called and this is why any notification in mm/rmap.c should
> > track the "address space" and not the "physical page".
>
> The subsystem needs to establish proper locking for that case.

How? I Your answer was to have the subsystem-fault wait PG_exported to
return ON... when later you told me the subsystem-fault is the thing
supposed to set PG_exported ON again... Perhaps you really could
invent a proper locking to make your #v1 workable somehow but I didn't
see a sign of it yet.

Infact I'm not so sure if all will be race-free with
invalidate_page_after (given you pretend to call it outside the PT
lock so concurrent linux minor faults can happen in parallel of your
invalidate_page_after) but at least it has a better chance to work
without having to invent much new complex locking.

> It also deals f.e. with page dirty status.

I think you should consider if you can also build a rmap per-MM like
KVM does and index it by the virtual address like KVM does.

2008-01-24 20:02:16

by Christoph Lameter

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Thu, 24 Jan 2008, Andrea Arcangeli wrote:

> > SetPageExported is set when a remote instance of linux establishes a
> > reference to the page (a kind of remote page fault). In the KVM scenario
> > that would occur when memory is made available.
>
> The remote page fault is exactly the thing that has to wait on the
> PageExported bit to return on! So how can it be the thing that sets
> SetPageExported?

I do not remember us saying that the remote page fault has to wait on PageExported.

> The idea is:
>
> NODE0 NODE1

SetPageLocked

> ->invalidate_page()
> ClearPageExported
> GFP_KERNEL (== GFP_ATOMIC in mm/rmap.c, won't ever do any I/O)
>
> ->invalidate_page() arrives and drop
> references
>
ClearPageLocked

> __free_page -> unpin so it can be freed
> go ahead after invalidate_page
>
> zero locking so previous invalidate_page could schedule (not wait for I/O,
> there' won't be any I/O out of GFP_KERNEL inside PF_MEMALLOC i.e. mm/rmap.c!!!)

PageLocked is set and there could be synchronization among the
callbacks. F.e. the mm_struct invalidate_page could set a flag to prevent
new references to be established. The callback after removal of the OS
ptes could reenable establishing new references.

>
> remote page fault
> tries to instantiate more references
> remote page fault arrives
> instantiate more references
> get_page() -> pin

lock_page Waits until rmap is complete. Then rechecks if page is
still part of the mapping.

> SetPageExported
> remote page fault succeeded
>
> zero locking so invalidate_page can schedule (not wait for I/O,
> there' won't be any I/O out of GFP_KERNEL!)

Ok this is often a PF_MEMALLOC context. We already do disk I/O in that
context?


> I thought your solution was to have the remote page fault wait on
> PG_exported to return ON!! But now you tell me the remote page fault
> is the thing that has to SetPageExported, not the linux VM. So make up
> your mind about this PG_exported mess...

The SetPageExported is mainly a switchon/off of the callbacks for a page.
Not necessarily used for synchronization. PageExported should be modified
under Pagelock.


> > You are saying that clearing the main linux ptes and leaving the remote
> > ptes in place will not allow access to the page via the remote ptes?
>
> No, I'm saying if you clear the main linux pte while there are still
> remote ptes in place (in turn the page_count has been boosted by 1
> with your current code), and you relay on mm/rmap.c for the
> ->invalidate_page, you will generate a unswappable-pin-leak.

The invalidate_page presumably would reduce the page count to zero after
clearing the remote ptes?

> The linux pte must be present and the page must be mapped in userland
> as long as there are remote references to the page and in turn as long
> as the page_count has been boosted by 1. Otherwise mm/rmap.c won't be
> called.

page_mapped() must be true. So we would need to increase mapcount instead
of page_count?

> At the very least you should move your invalidate_page in
> mm/vmscan.c and have it called regardless if the page is mapped in
> userland or not.

That would not cover page migration and other uses. We also need the
invalidate_page for page_mkclean etc. Needed for dirty page tracking.


> > Right. That is why the mmu_ops approach does not work and that is why we
> > need to sleep.
>
> You told me you worried about atomic allocations. Now you tell me you
> need to sleep after I just explained you how utterly useless is to
> sleep inside GFP_KERNEL allocations when invoked by try_to_unmap in
> the mm/rmap.c paths. You will never sleep in any memory allocation
> other than to call schedule() because need_resched is set. You will do
> zero I/O. all your allocations will come from the PF_MEMALLOC pool
> like I said above, not from swapping, not from the VM. The VM will
> obviously refuse to be invoked recursively.

That may be okay if we do not need to generate listheads to track all the
mm_structs in the rmap loops. If we loop on our own then we do not need to
construct this list and can directly communicate with the other partition.

> Also not sure why you call my patch mmops, when it's mmu_notifier instead.

Oh. Sorry. Will use the correct name in the future. I think I keyed of the
mm_ops structure.

> > > All kvm guest physical pages would need to be marked exported of
> > > course.
> >
> > Why export all pages? Then you are planning to have mm_struct
> > notifiers for all processes in the system?
>
> KVM is 1 process, not sure how you get to imagine I need to track
> process in the system, when infact I only need to track pages
> belonging to the KVM process.

Ahh. A KVM is one process to the host but may have multiple processes
running in it and you want the notifier for the one process in the host.

> It's utterly useless to call ->invalidate_page(page) on a page that is
> still mapped by some linux pte with the young bit set. You must defer
> the ->invalidate_page after all young bits are gone. This is what I
> do, infact I do tons more than that by also honouring the accessed
> bits in all sptes. There's zero chance you can do as remotely as
> efficient as my mmu-notifiers are, unless you also do "cat rmap.c >>
> /sgi/yoursubsystem/something.c" and you check the young bit in the
> linux ptes yourself _before_ deciding if you've to start dropping
> remote references or not.

I think we agreed on doing the callback after the OS rmaps have been
walked right.

> > that point we do not have an mm_struct anymore so the callback would have
>
> The mm struct wasn't available in the place where you put
> invalidate_page either.

Right.

2008-01-24 20:08:02

by Christoph Lameter

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

On Thu, 24 Jan 2008, Andrea Arcangeli wrote:

> I think you should consider if you can also build a rmap per-MM like
> KVM does and index it by the virtual address like KVM does.

Yes we have that.

If we have that then we do not need the mmu_notifier.
We could call it with a page parameter and then walk the KVM or XPmem
reverse map to directly find all the ptes we need to clear. There is no
need then to add a new field to the mm_struct.

2008-01-25 06:36:18

by Avi Kivity

[permalink] [raw]
Subject: Re: [kvm-devel] [PATCH] export notifier #1

Christoph Lameter wrote:
> On Thu, 24 Jan 2008, Andrea Arcangeli wrote:
>
>
>> I think you should consider if you can also build a rmap per-MM like
>> KVM does and index it by the virtual address like KVM does.
>>
>
> Yes we have that.
>
> If we have that then we do not need the mmu_notifier.
> We could call it with a page parameter and then walk the KVM or XPmem
> reverse map to directly find all the ptes we need to clear. There is no
> need then to add a new field to the mm_struct.
>

The reason the new field is needed is because the Linux mm does not
understand the secondary pte format and zapping protocol. Locating the
secondary ptes is just a part of the problem.


--
Any sufficiently difficult bug is indistinguishable from a feature.