I hope this is finally a release that covers all the requirements. Locking
description is at the top of the core patch.
This is a patchset implementing MMU notifier callbacks based on Andrea's
earlier work. These are needed if Linux pages are referenced from something
else than tracked by the rmaps of the kernel. The known immediate users are
KVM (establishes a refcount to the page. External references called spte)
(Refcount seems to be not necessary)
GRU (simple TLB shootdown without refcount. Has its own pagetable/tlb)
XPmem (uses its own reverse mappings. Remote ptes, Needs
to sleep when sending messages)
XPmem could defer freeing pages if a callback with atomic=1 occurs.
Pending:
- Feedback from users of the callbacks for KVM, RDMA, XPmem and GRU
(Early tests with the GRU were successful).
Known issues:
- RCU quiescent periods are required on registering
notifiers to guarantee visibility to other processors.
Andrea's mmu_notifier #4 -> RFC V1
- Merge subsystem rmap based with Linux rmap based approach
- Move Linux rmap based notifiers out of macro
- Try to account for what locks are held while the notifiers are
called.
- Develop a patch sequence that separates out the different types of
hooks so that we can review their use.
- Avoid adding include to linux/mm_types.h
- Integrate RCU logic suggested by Peter.
V1->V2:
- Improve RCU support
- Use mmap_sem for mmu_notifier register / unregister
- Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we
already have invalidate_range() callbacks there.
- Clean compile for !MMU_NOTIFIER
- Isolate filemap_xip strangeness into its own diff
- Pass a the flag to invalidate_range to indicate if a spinlock
is held.
- Add invalidate_all()
V2->V3:
- Further RCU fixes
- Fixes from Andrea to fixup aging and move invalidate_range() in do_wp_page
and sys_remap_file_pages() after the pte clearing.
V3->V4:
- Drop locking and synchronize_rcu() on ->release since we know on release that
we are the only executing thread. This is also true for invalidate_all() so
we could drop off the mmu_notifier there early. Use hlist_del_init instead
of hlist_del_rcu.
- Do the invalidation as begin/end pairs with the requirement that the driver
holds off new references in between.
- Fixup filemap_xip.c
- Figure out a potential way in which XPmem can deal with locks that are held.
- Robin's patches to make the mmu_notifier logic manage the PageRmapExported bit.
- Strip cc list down a bit.
- Drop Peters new rcu list macro
- Add description to the core patch
--
GRU should implement at least invalidate_page and invalidate_pages,
and it should be mostly covered with that.
My suggestion is to add the invalidate_range_start/end incrementally
with this, and to keep all the xpmem mmu notifiers in a separate
incremental patch (those are going to require many more changes to
perfect). They've very different things. GRU is simpler, will require
less changes and it should be taken care of sooner than XPMEM. KVM
requirements are a subset of GRU thanks to the page pin so I can
ignore KVM requirements as a whole and I only focus on GRU for the
time being.
There's room for optimization by moving some invalidate_page to
invalidate_pages but that would complicate all tlb flushing code a
lot, it'd require tons of changes and we should concentrate on getting
the functionality included in the simplest, obviously safe, least
intrusive form, in a way that can be optimized later if that turns out
to be an issue. And I think for KVM this is already quite optimal. And
there's zero slowdown when the notifiers are armed, for all tasks but
the one tracked by the notifiers.
Invalidates inside PT lock will avoid the page faults to happen in
parallel of my invalidates, no dependency on the page pin, mremap
covered, mprotect covered etc... I verified with the below patch KVM
swapping is rock solid on top of 2.6.24 as with my #v4 on a older host
kernel. I think it'd be much less efficient to add yet another
non-per-pte-scalar lock to serialize the GRU interrupt or KVM
pagefault against the main linux page fault, given we already have all
needed serialization out of the PT lock. XPMEM is forced to do that
because it has to sleep in sending the invalidate. So it has to take
such subsystem mutex, send the invalidate, ptep_clear_flush, and
finally unlock the mutex (hence the need of a _end callback, to unlock
the page faults). KVM could also add a kvm internal lock to do
something like that but there's no point given PT lock is more
scalar. GRU is totally forbidden to use such model because GRU lacks
the page pin (KVM could use the XPMEM invalidate_range too, instead of
the GRU invalidate_page[s] only thanks to the page pin taken in
get_user_pages)
Invalidate_pages and invalidate_range_start/end provide entirely
different semantics even if they're sharing the same mmu_notifier.head
registration list. And _end will better get start,end range too.
XPMEM -> invalidate_range_start/end/invalidate_external_rmap
GRU/KVM -> invalidate_pages[s], in the future mprotect_pages optimization etc...
Signed-off-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -46,6 +46,7 @@
__young = ptep_test_and_clear_young(__vma, __address, __ptep); \
if (__young) \
flush_tlb_page(__vma, __address); \
+ __young |= mmu_notifier_age_page((__vma)->vm_mm, __address); \
__young; \
})
#endif
@@ -86,6 +87,7 @@ do { \
pte_t __pte; \
__pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep); \
flush_tlb_page(__vma, __address); \
+ mmu_notifier(invalidate_page, (__vma)->vm_mm, __address); \
__pte; \
})
#endif
diff --git a/include/asm-s390/pgtable.h b/include/asm-s390/pgtable.h
--- a/include/asm-s390/pgtable.h
+++ b/include/asm-s390/pgtable.h
@@ -712,6 +712,7 @@ static inline pte_t ptep_clear_flush(str
{
pte_t pte = *ptep;
ptep_invalidate(address, ptep);
+ mmu_notifier(invalidate_page, vma->vm_mm, address);
return pte;
}
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
#include <linux/rbtree.h>
#include <linux/rwsem.h>
#include <linux/completion.h>
+#include <linux/mmu_notifier.h>
#include <asm/page.h>
#include <asm/mmu.h>
@@ -219,6 +220,8 @@ struct mm_struct {
/* aio bits */
rwlock_t ioctx_list_lock;
struct kioctx *ioctx_list;
+
+ struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
};
#endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,132 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+struct mmu_notifier;
+
+struct mmu_notifier_ops {
+ /*
+ * Called when nobody can register any more notifier in the mm
+ * and after the "mn" notifier has been disarmed already.
+ */
+ void (*release)(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+
+ /*
+ * invalidate_page[s] is called in atomic context
+ * after any pte has been updated and before
+ * dropping the PT lock required to update any Linux pte.
+ * Once the PT lock will be released the pte will have its
+ * final value to export through the secondary MMU.
+ * Before this is invoked any secondary MMU is still ok
+ * to read/write to the page previously pointed by the
+ * Linux pte because the old page hasn't been freed yet.
+ * If required set_page_dirty has to be called internally
+ * to this method.
+ */
+ void (*invalidate_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+ void (*invalidate_pages)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+
+ /*
+ * Age page is called in atomic context inside the PT lock
+ * right after the VM is test-and-clearing the young/accessed
+ * bitflag in the pte. This way the VM will provide proper aging
+ * to the accesses to the page through the secondary MMUs
+ * and not only to the ones through the Linux pte.
+ */
+ int (*age_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+};
+
+struct mmu_notifier {
+ struct hlist_node hlist;
+ const struct mmu_notifier_ops *ops;
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier_head {
+ struct hlist_head head;
+ spinlock_t lock;
+};
+
+#include <linux/mm_types.h>
+
+/*
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads.
+ */
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+/*
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the "struct mmu_notifier" can be freed. Alternatively it
+ * can be synchronously freed inside ->release when the list can't
+ * change anymore and nobody could possibly walk it.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+ unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+ INIT_HLIST_HEAD(&mnh->head);
+ spin_lock_init(&mnh->lock);
+}
+
+#define mmu_notifier(function, mm, args...) \
+ do { \
+ struct mmu_notifier *__mn; \
+ struct hlist_node *__n; \
+ \
+ if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+ rcu_read_lock(); \
+ hlist_for_each_entry_rcu(__mn, __n, \
+ &(mm)->mmu_notifier.head, \
+ hlist) \
+ if (__mn->ops->function) \
+ __mn->ops->function(__mn, \
+ mm, \
+ args); \
+ rcu_read_unlock(); \
+ } \
+ } while (0)
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+struct mmu_notifier_head {};
+
+#define mmu_notifier_register(mn, mm) do {} while(0)
+#define mmu_notifier_unregister(mn, mm) do {} while (0)
+#define mmu_notifier_release(mm) do {} while (0)
+#define mmu_notifier_age_page(mm, address) ({ 0; })
+#define mmu_notifier_head_init(mmh) do {} while (0)
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...) \
+ do { \
+ if (0) { \
+ struct mmu_notifier *__mn; \
+ \
+ __mn = (struct mmu_notifier *)(0x00ff); \
+ __mn->ops->function(__mn, mm, args); \
+ }; \
+ } while (0)
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -360,6 +360,7 @@ static struct mm_struct * mm_init(struct
if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
+ mmu_notifier_head_init(&mm->mmu_notifier);
return mm;
}
free_mm(mm);
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,3 +193,7 @@ config VIRT_TO_BUS
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+ def_bool y
+ bool "MMU notifier, for paging KVM/RDMA"
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -30,4 +30,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -756,6 +756,7 @@ void __unmap_hugepage_range(struct vm_ar
if (pte_none(pte))
continue;
+ mmu_notifier(invalidate_page, mm, address);
page = pte_page(pte);
if (pte_dirty(pte))
set_page_dirty(page);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -661,6 +661,7 @@ static unsigned long zap_pte_range(struc
}
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
+ mmu_notifier(invalidate_page, mm, addr);
tlb_remove_tlb_entry(tlb, pte, addr);
if (unlikely(!page))
continue;
@@ -1249,6 +1250,7 @@ static int remap_pte_range(struct mm_str
{
pte_t *pte;
spinlock_t *ptl;
+ unsigned long start = addr;
pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
if (!pte)
@@ -1260,6 +1262,7 @@ static int remap_pte_range(struct mm_str
pfn++;
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
+ mmu_notifier(invalidate_pages, mm, start, addr);
pte_unmap_unlock(pte - 1, ptl);
return 0;
}
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2043,6 +2043,7 @@ void exit_mmap(struct mm_struct *mm)
vm_unacct_memory(nr_accounted);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
+ mmu_notifier_release(mm);
/*
* Walk the list again, actually closing and freeing it,
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
new file mode 100644
--- /dev/null
+++ b/mm/mmu_notifier.c
@@ -0,0 +1,73 @@
+/*
+ * linux/mm/mmu_notifier.c
+ *
+ * Copyright (C) 2008 Qumranet, Inc.
+ * Copyright (C) 2008 SGI
+ * Christoph Lameter <[email protected]>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/rcupdate.h>
+
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
+void mmu_notifier_release(struct mm_struct *mm)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n, *tmp;
+
+ if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+ hlist_for_each_entry_safe(mn, n, tmp,
+ &mm->mmu_notifier.head, hlist) {
+ hlist_del(&mn->hlist);
+ if (mn->ops->release)
+ mn->ops->release(mn, mm);
+ }
+ }
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n;
+ int young = 0;
+
+ if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(mn, n,
+ &mm->mmu_notifier.head, hlist) {
+ if (mn->ops->age_page)
+ young |= mn->ops->age_page(mn, mm, address);
+ }
+ rcu_read_unlock();
+ }
+
+ return young;
+}
+
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ spin_lock(&mm->mmu_notifier.lock);
+ hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+ spin_unlock(&mm->mmu_notifier.lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ spin_lock(&mm->mmu_notifier.lock);
+ hlist_del_rcu(&mn->hlist);
+ spin_unlock(&mm->mmu_notifier.lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -32,6 +32,7 @@ static void change_pte_range(struct mm_s
{
pte_t *pte, oldpte;
spinlock_t *ptl;
+ unsigned long start = addr;
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
@@ -71,6 +72,7 @@ static void change_pte_range(struct mm_s
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
+ mmu_notifier(invalidate_pages, mm, start, addr);
pte_unmap_unlock(pte - 1, ptl);
}
On Thu, 31 Jan 2008, Andrea Arcangeli wrote:
> My suggestion is to add the invalidate_range_start/end incrementally
> with this, and to keep all the xpmem mmu notifiers in a separate
> incremental patch (those are going to require many more changes to
> perfect). They've very different things. GRU is simpler, will require
> less changes and it should be taken care of sooner than XPMEM. KVM
> requirements are a subset of GRU thanks to the page pin so I can
> ignore KVM requirements as a whole and I only focus on GRU for the
> time being.
KVM requires get_user_pages. This makes them currently different.
> Invalidates inside PT lock will avoid the page faults to happen in
> parallel of my invalidates, no dependency on the page pin, mremap
You are aware that the pt lock is split for systems with >4 CPUS? You can
use the pte_lock only to serialize access to individual ptes.
> pagefault against the main linux page fault, given we already have all
> needed serialization out of the PT lock. XPMEM is forced to do that
pt lock cannot serialize with invalidate_range since it is split. A range
requires locking for a series of ptes not only individual ones.
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -46,6 +46,7 @@
> __young = ptep_test_and_clear_young(__vma, __address, __ptep); \
> if (__young) \
> flush_tlb_page(__vma, __address); \
> + __young |= mmu_notifier_age_page((__vma)->vm_mm, __address); \
> __young; \
> })
> #endif
That may be okay. Have you checked all the arches that can provide their
own implementation of this macro? This is only going to work on arches
that use the generic implementation.
> @@ -86,6 +87,7 @@ do { \
> pte_t __pte; \
> __pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep); \
> flush_tlb_page(__vma, __address); \
> + mmu_notifier(invalidate_page, (__vma)->vm_mm, __address); \
> __pte; \
> })
> #endif
This will require a callback on every(!) removal of a pte. A range
invalidate does not do any good since the callbacks are performed anyways.
Probably needlessly.
In addition you have the same issues with arches providing their own macro
here.
> diff --git a/include/asm-s390/pgtable.h b/include/asm-s390/pgtable.h
> --- a/include/asm-s390/pgtable.h
> +++ b/include/asm-s390/pgtable.h
> @@ -712,6 +712,7 @@ static inline pte_t ptep_clear_flush(str
> {
> pte_t pte = *ptep;
> ptep_invalidate(address, ptep);
> + mmu_notifier(invalidate_page, vma->vm_mm, address);
> return pte;
> }
>
Ahh you found an additional arch. How about x86 code? There is one
override of these functions there as well.
> + /*
> + * invalidate_page[s] is called in atomic context
> + * after any pte has been updated and before
> + * dropping the PT lock required to update any Linux pte.
> + * Once the PT lock will be released the pte will have its
> + * final value to export through the secondary MMU.
> + * Before this is invoked any secondary MMU is still ok
> + * to read/write to the page previously pointed by the
> + * Linux pte because the old page hasn't been freed yet.
> + * If required set_page_dirty has to be called internally
> + * to this method.
> + */
> + void (*invalidate_page)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long address);
> + void (*invalidate_pages)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long start, unsigned long end);
What is the point of invalidate_pages? It cannot be serialized properly
and you do the invalidate_page() calles regardless. Is is some sort of
optimization?
> +struct mmu_notifier_head {};
> +
> +#define mmu_notifier_register(mn, mm) do {} while(0)
> +#define mmu_notifier_unregister(mn, mm) do {} while (0)
> +#define mmu_notifier_release(mm) do {} while (0)
> +#define mmu_notifier_age_page(mm, address) ({ 0; })
> +#define mmu_notifier_head_init(mmh) do {} while (0)
Macros. We want functions there to be able to validate the parameters even
if !CONFIG_MMU_NOTIFIER.
> +
> +/*
> + * Notifiers that use the parameters that they were passed so that the
> + * compiler does not complain about unused variables but does proper
> + * parameter checks even if !CONFIG_MMU_NOTIFIER.
> + * Macros generate no code.
> + */
> +#define mmu_notifier(function, mm, args...) \
> + do { \
> + if (0) { \
> + struct mmu_notifier *__mn; \
> + \
> + __mn = (struct mmu_notifier *)(0x00ff); \
> + __mn->ops->function(__mn, mm, args); \
> + }; \
> + } while (0)
> +
> +#endif /* CONFIG_MMU_NOTIFIER */
Ok here you took the variant that checks parameters.
> @@ -1249,6 +1250,7 @@ static int remap_pte_range(struct mm_str
> {
> pte_t *pte;
> spinlock_t *ptl;
> + unsigned long start = addr;
>
> pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
> if (!pte)
> @@ -1260,6 +1262,7 @@ static int remap_pte_range(struct mm_str
> pfn++;
> } while (pte++, addr += PAGE_SIZE, addr != end);
> arch_leave_lazy_mmu_mode();
> + mmu_notifier(invalidate_pages, mm, start, addr);
> pte_unmap_unlock(pte - 1, ptl);
> return 0;
You are under the wrong impression that you can use the pte lock to
serialize general access to ptes! Nope. ptelock only serialize access to
individual ptes. This is broken.
> + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> + hlist_for_each_entry_safe(mn, n, tmp,
> + &mm->mmu_notifier.head, hlist) {
> + hlist_del(&mn->hlist);
hlist_del_init?
> @@ -71,6 +72,7 @@ static void change_pte_range(struct mm_s
>
> } while (pte++, addr += PAGE_SIZE, addr != end);
> arch_leave_lazy_mmu_mode();
> + mmu_notifier(invalidate_pages, mm, start, addr);
> pte_unmap_unlock(pte - 1, ptl);
> }
Again broken serialization.
On Thu, 31 Jan 2008, Christoph Lameter wrote:
> > pagefault against the main linux page fault, given we already have all
> > needed serialization out of the PT lock. XPMEM is forced to do that
>
> pt lock cannot serialize with invalidate_range since it is split. A range
> requires locking for a series of ptes not only individual ones.
Hmmm.. May be okay after all. I see that you are only doing it on the pte
level. This means the range callbacks are taking down a max of 512
entries. So you have a callback for each pmd. A callback for 2M of memory?
On Thu, Jan 31, 2008 at 12:18:54PM -0800, Christoph Lameter wrote:
> pt lock cannot serialize with invalidate_range since it is split. A range
> requires locking for a series of ptes not only individual ones.
The lock I take already protects up to 512 ptes yes. I call
invalidate_pages only across a _single_ 4k pagetable filled at max
with 512 pte_t mapping virutal addresses in a 2M naturally aligned
virtual range.
There's no smp race even for >4 CPUS. Check it again! I never call
invalidate_pages _outside_ the PT lock specific for that 4k pagetable
(a single PT lock protects 512 pte_t, not only a single one!).
Perhaps if you called it PMD lock there would be no misunderstanding:
spinlock_t *__ptl = pte_lockptr(mm, pmd); \
(the pt lock is a function of the pmd, pte_t not!)
> That may be okay. Have you checked all the arches that can provide their
> own implementation of this macro? This is only going to work on arches
> that use the generic implementation.
Obviously I checked them all yes, and this was much faster than hand
editing the .c files like you did indeed.
> This will require a callback on every(!) removal of a pte. A range
> invalidate does not do any good since the callbacks are performed anyways.
> Probably needlessly.
>
> In addition you have the same issues with arches providing their own macro
> here.
Yes, s390 is the only one.
>
> > diff --git a/include/asm-s390/pgtable.h b/include/asm-s390/pgtable.h
> > --- a/include/asm-s390/pgtable.h
> > +++ b/include/asm-s390/pgtable.h
> > @@ -712,6 +712,7 @@ static inline pte_t ptep_clear_flush(str
> > {
> > pte_t pte = *ptep;
> > ptep_invalidate(address, ptep);
> > + mmu_notifier(invalidate_page, vma->vm_mm, address);
> > return pte;
> > }
> >
>
> Ahh you found an additional arch. How about x86 code? There is one
> override of these functions there as well.
There's no ptep_clear_flush override on x86 or I would have patched it
like I did to s390.
I had to change 2 lines instead of a single one, not such a big deal.
> > + /*
> > + * invalidate_page[s] is called in atomic context
> > + * after any pte has been updated and before
> > + * dropping the PT lock required to update any Linux pte.
> > + * Once the PT lock will be released the pte will have its
> > + * final value to export through the secondary MMU.
> > + * Before this is invoked any secondary MMU is still ok
> > + * to read/write to the page previously pointed by the
> > + * Linux pte because the old page hasn't been freed yet.
> > + * If required set_page_dirty has to be called internally
> > + * to this method.
> > + */
> > + void (*invalidate_page)(struct mmu_notifier *mn,
> > + struct mm_struct *mm,
> > + unsigned long address);
>
>
> > + void (*invalidate_pages)(struct mmu_notifier *mn,
> > + struct mm_struct *mm,
> > + unsigned long start, unsigned long end);
>
> What is the point of invalidate_pages? It cannot be serialized properly
> and you do the invalidate_page() calles regardless. Is is some sort of
> optimization?
It is already serialized 100% properly sorry.
> > +struct mmu_notifier_head {};
> > +
> > +#define mmu_notifier_register(mn, mm) do {} while(0)
> > +#define mmu_notifier_unregister(mn, mm) do {} while (0)
> > +#define mmu_notifier_release(mm) do {} while (0)
> > +#define mmu_notifier_age_page(mm, address) ({ 0; })
> > +#define mmu_notifier_head_init(mmh) do {} while (0)
>
> Macros. We want functions there to be able to validate the parameters even
> if !CONFIG_MMU_NOTIFIER.
If you want I can turn this into a static inline, it would already
work fine. Certainly this isn't a blocker for merging given most
people will have MMU_NOTIFIER=y and this speedup compilation a tiny
bit for the embedded.
> > +
> > +/*
> > + * Notifiers that use the parameters that they were passed so that the
> > + * compiler does not complain about unused variables but does proper
> > + * parameter checks even if !CONFIG_MMU_NOTIFIER.
> > + * Macros generate no code.
> > + */
> > +#define mmu_notifier(function, mm, args...) \
> > + do { \
> > + if (0) { \
> > + struct mmu_notifier *__mn; \
> > + \
> > + __mn = (struct mmu_notifier *)(0x00ff); \
> > + __mn->ops->function(__mn, mm, args); \
> > + }; \
> > + } while (0)
> > +
> > +#endif /* CONFIG_MMU_NOTIFIER */
>
> Ok here you took the variant that checks parameters.
This is primarly to turn off the compiler warning, not to check the
parameters, but yes it should also checks the parameter types as a bonus.
> > @@ -1249,6 +1250,7 @@ static int remap_pte_range(struct mm_str
> > {
> > pte_t *pte;
> > spinlock_t *ptl;
> > + unsigned long start = addr;
> >
> > pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
> > if (!pte)
> > @@ -1260,6 +1262,7 @@ static int remap_pte_range(struct mm_str
> > pfn++;
> > } while (pte++, addr += PAGE_SIZE, addr != end);
> > arch_leave_lazy_mmu_mode();
> > + mmu_notifier(invalidate_pages, mm, start, addr);
> > pte_unmap_unlock(pte - 1, ptl);
> > return 0;
>
> You are under the wrong impression that you can use the pte lock to
> serialize general access to ptes! Nope. ptelock only serialize access to
> individual ptes. This is broken.
You are under the wrong impression that invalidate_page could be
called on a range that spawns over a region that isn't 2M naturally
aligned and <2M in size.
> > + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > + hlist_for_each_entry_safe(mn, n, tmp,
> > + &mm->mmu_notifier.head, hlist) {
> > + hlist_del(&mn->hlist);
>
> hlist_del_init?
This is a mess I've to say and I'm almost willing to return to a
_safe_rcu + and removing the autodisarming feature. KVM itself isn't
using mm_users but mm_count. then if somebody need to release the mn
inside ->relase they should mmu_notifier_unregister _inside_
->release and then call_rcu to release the mn (synchronize_rcu isn't
allowed inside ->release because of the rcu_spin_lock() that can't
schedule or it'll reach a quiescent point itself allowing the other
structures to be released.
> > @@ -71,6 +72,7 @@ static void change_pte_range(struct mm_s
> >
> > } while (pte++, addr += PAGE_SIZE, addr != end);
> > arch_leave_lazy_mmu_mode();
> > + mmu_notifier(invalidate_pages, mm, start, addr);
> > pte_unmap_unlock(pte - 1, ptl);
> > }
>
> Again broken serialization.
The above is perfectly fine too. If you have doubts and your
misunderstanding of my 100% SMP safe locking isn't immediately clear,
think about this, how could it be safe to modify 512 ptes with a
single lock, regardless of the mmu notifiers?
I appreciate the review! I hope my entirely bug free and
strightforward #v5 will strongly increase the probability of getting
this in sooner than later. If something else it shows the approach I
prefer to cover GRU/KVM 100%, leaving the overkill mutex locking
requirements only to the mmu notifier users that can't deal with the
scalar and finegrined and already-taken/trashed PT lock.
On Thu, Jan 31, 2008 at 03:09:55PM -0800, Christoph Lameter wrote:
> On Thu, 31 Jan 2008, Christoph Lameter wrote:
>
> > > pagefault against the main linux page fault, given we already have all
> > > needed serialization out of the PT lock. XPMEM is forced to do that
> >
> > pt lock cannot serialize with invalidate_range since it is split. A range
> > requires locking for a series of ptes not only individual ones.
>
> Hmmm.. May be okay after all. I see that you are only doing it on the pte
> level. This means the range callbacks are taking down a max of 512
> entries. So you have a callback for each pmd. A callback for 2M of memory?
Exactly. The point of _pages is to reduce of an order of magnitude
(512, or 1024 times) the number of needed invalidate_page calls in a
few places where it's a strightforward optimization for both KVM and
GRU. Thanks to the PT lock this remains a totally obviously safe
design and it requires zero additional locking anywhere (nor linux VM,
nor in the mmu notifier methods, nor in the KVM/GRU page fault).
Sure you can do invalidate_range_start/end for more than 2M(/4M on
32bit) max virtual ranges. But my approach that averages the fixed
mmu_lock cost already over 512(/1024) ptes will make any larger
"range" improvement not strongly measurable anymore given to do that
you have to add locking as well and _surely_ decrease the GRU
scalability with tons of threads and tons of cpus potentially making
GRU a lot slower _especially_ on your numa systems.
On Fri, 1 Feb 2008, Andrea Arcangeli wrote:
> I appreciate the review! I hope my entirely bug free and
> strightforward #v5 will strongly increase the probability of getting
> this in sooner than later. If something else it shows the approach I
> prefer to cover GRU/KVM 100%, leaving the overkill mutex locking
> requirements only to the mmu notifier users that can't deal with the
> scalar and finegrined and already-taken/trashed PT lock.
Mutex locking? Could you be more specific?
I hope you will continue to do reviews of the other mmu notifier patchset?
On Fri, 1 Feb 2008, Andrea Arcangeli wrote:
> GRU. Thanks to the PT lock this remains a totally obviously safe
> design and it requires zero additional locking anywhere (nor linux VM,
> nor in the mmu notifier methods, nor in the KVM/GRU page fault).
Na. I would not be so sure about having caught all the issues yet...
> Sure you can do invalidate_range_start/end for more than 2M(/4M on
> 32bit) max virtual ranges. But my approach that averages the fixed
> mmu_lock cost already over 512(/1024) ptes will make any larger
> "range" improvement not strongly measurable anymore given to do that
> you have to add locking as well and _surely_ decrease the GRU
> scalability with tons of threads and tons of cpus potentially making
> GRU a lot slower _especially_ on your numa systems.
The trouble is that the invalidates are much more expensive if you have to
send theses to remote partitions (XPmem). And its really great if you can
simple tear down everything. Certainly this is a significant improvement
over the earlier approach but you still have the invalidate_page calls in
ptep_clear_flush. So they fire needlessly?
Serializing access in the device driver makes sense and comes with
additional possiblity of not having to increment page counts all the time.
So you trade one cacheline dirtying for many that are necessary if you
always increment the page count.
How does KVM insure the consistency of the shadow page tables? Atomic ops?
The GRU has no page table on its own. It populates TLB entries on demand
using the linux page table. There is no way it can figure out when to
drop page counts again. The invalidate calls are turned directly into tlb
flushes.
On Thu, Jan 31, 2008 at 05:37:21PM -0800, Christoph Lameter wrote:
> On Fri, 1 Feb 2008, Andrea Arcangeli wrote:
>
> > I appreciate the review! I hope my entirely bug free and
> > strightforward #v5 will strongly increase the probability of getting
> > this in sooner than later. If something else it shows the approach I
> > prefer to cover GRU/KVM 100%, leaving the overkill mutex locking
> > requirements only to the mmu notifier users that can't deal with the
> > scalar and finegrined and already-taken/trashed PT lock.
>
> Mutex locking? Could you be more specific?
I think he is talking about the external locking that xpmem will need
to do to ensure we are not able to refault pages inside of regions that
are undergoing recall/page table clearing. At least that has been my
understanding to this point.
Thanks,
Robin
On Thu, 31 Jan 2008, Robin Holt wrote:
> > Mutex locking? Could you be more specific?
>
> I think he is talking about the external locking that xpmem will need
> to do to ensure we are not able to refault pages inside of regions that
> are undergoing recall/page table clearing. At least that has been my
> understanding to this point.
Right this has to be something like rw spinlock. Its needed for both
GRU/XPmem. Not sure about KVM.
Take the read lock for invalidate operations. These can occur
concurrently. (Or a simpler implementation for the GRU may just use a
spinlock).
The write lock must be held for populate operations.
Lock can be refined as needed by the notifier driver. F.e. locking could
be restricted to certain ranges.
On Thu, Jan 31, 2008 at 05:37:21PM -0800, Christoph Lameter wrote:
> On Fri, 1 Feb 2008, Andrea Arcangeli wrote:
>
> > I appreciate the review! I hope my entirely bug free and
> > strightforward #v5 will strongly increase the probability of getting
> > this in sooner than later. If something else it shows the approach I
> > prefer to cover GRU/KVM 100%, leaving the overkill mutex locking
> > requirements only to the mmu notifier users that can't deal with the
> > scalar and finegrined and already-taken/trashed PT lock.
>
> Mutex locking? Could you be more specific?
Robin understanding was correct on this point. Also I think if you add
start,end to range_end (like suggested a few times by me and once
by Robin) I may get away without a lock thanks to the page pin. That's
one strong reason why start,end are needed in range_end.
The only one that will definitely be forced to add a new lock around
follow_page, in addition to the very _scalable_ PT lock, is GRU.
> I hope you will continue to do reviews of the other mmu notifier patchset?
Sure. I will also create a new KVM patch to plug on top of your
versions (I already did for V2/V3 even if it never worked, but that
might have a build problem, see kvm-devel for details). To be clear,
as long as anything is merged that avoids me to use kprobes to make
KVM swap work, I'm satisfied. I thought my approach had several
advantages in simplicity, and being more obviously safe (in not
relaying entirely on the page pin and creating a window of time where
the linux pte writes to a page and the sptes writes to another one,
remember populate_range), and it avoids introducing external locks in
the GRU follow_page path. My approach was supposed to be in addition
of the range_start/end needed by xpmem that can't possibly take
advanage of the scalar PT lock and it definitely requires external
lock to serialize xpmem fault against linux pagefault (not the case
for GRU/KVM).
On Thu, Jan 31, 2008 at 05:44:24PM -0800, Christoph Lameter wrote:
> The trouble is that the invalidates are much more expensive if you have to
> send theses to remote partitions (XPmem). And its really great if you can
> simple tear down everything. Certainly this is a significant improvement
> over the earlier approach but you still have the invalidate_page calls in
> ptep_clear_flush. So they fire needlessly?
Dunno, they certainly fire more frequently than yours, even _pages
fires more frequently than range_start,end but don't forget why!
That's because I've a different spinlock for every 512
ptes/4k-grub-tlbs that are being invalidated... So it pays off in
scalability. I'm unsure if gru could play tricks with your patch, to
still allow faults to still happen in parallel if they're on virtual
addresses not in the same 2M naturally aligned chunk.
> Serializing access in the device driver makes sense and comes with
> additional possiblity of not having to increment page counts all the time.
> So you trade one cacheline dirtying for many that are necessary if you
> always increment the page count.
Note that my #v5 doesn't require to increase the page count all the
time, so GRU will work fine with #v5.
See this comment in my patch:
/*
* invalidate_page[s] is called in atomic context
* after any pte has been updated and before
* dropping the PT lock required to update any Linux pte.
* Once the PT lock will be released the pte will have its
* final value to export through the secondary MMU.
* Before this is invoked any secondary MMU is still ok
* to read/write to the page previously pointed by the
* Linux pte because the old page hasn't been freed yet.
* If required set_page_dirty has to be called internally
* to this method.
*/
invalidate_page[s] is always called before the page is freed. This
will require modifications to the tlb flushing code logic to take
advantage of _pages in certain places. For now it's just safe.
> How does KVM insure the consistency of the shadow page tables? Atomic ops?
A per-VM mmu_lock spinlock is taken to serialize the access, plus
atomic ops for the cpu.
> The GRU has no page table on its own. It populates TLB entries on demand
> using the linux page table. There is no way it can figure out when to
> drop page counts again. The invalidate calls are turned directly into tlb
> flushes.
Yes, this is why it can't serialize follow_page with only the PT lock
with your patch. KVM may do it once you add start,end to range_end
only thanks to the additional pin on the page.
On Fri, 1 Feb 2008, Andrea Arcangeli wrote:
> Note that my #v5 doesn't require to increase the page count all the
> time, so GRU will work fine with #v5.
But that comes with the cost of firing invalidate_page for every page
being evicted. In order to make your single invalidate_range work without
it you need to hold a refcount on the page.
> invalidate_page[s] is always called before the page is freed. This
> will require modifications to the tlb flushing code logic to take
> advantage of _pages in certain places. For now it's just safe.
Yes so your invalidate_range is still some sort of dysfunctional
optimization? Gazillions of invalidate_page's will have to be executed
when tearing down large memory areas.
> > How does KVM insure the consistency of the shadow page tables? Atomic ops?
>
> A per-VM mmu_lock spinlock is taken to serialize the access, plus
> atomic ops for the cpu.
And that would not be enough to hold of new references? With small tweaks
this should work with a common scheme. We could also redefine the role
of _start and _end slightly to just require that the refs are removed when
_end completes. That would allow the KVM page count ref to work as is now
and would avoid the individual invalidate_page() callouts.
> > The GRU has no page table on its own. It populates TLB entries on demand
> > using the linux page table. There is no way it can figure out when to
> > drop page counts again. The invalidate calls are turned directly into tlb
> > flushes.
>
> Yes, this is why it can't serialize follow_page with only the PT lock
> with your patch. KVM may do it once you add start,end to range_end
> only thanks to the additional pin on the page.
Right but that pin requires taking a refcount which we cannot do.
Frankly this looks as if this is a solution that would work only for KVM.
On Fri, Feb 01, 2008 at 11:23:57AM -0800, Christoph Lameter wrote:
> Yes so your invalidate_range is still some sort of dysfunctional
> optimization? Gazillions of invalidate_page's will have to be executed
> when tearing down large memory areas.
I don't know if gru can flush the external TLB references with a
single instruction like the cpu can do by overwriting %cr3. As far as
vmx/svm are concerned, you got to do some fixed amount of work on the
rmap structures and each spte, for each 4k invalidated regardless of
page/pages/range/ranges. You can only share the cost of the lock and
of the memslot lookup, so you lookup and take the lock once and you
drop 512 sptes instead of just 1. Similar to what we do in the main
linux VM by taking the PT lock for every 512 sptes.
So you worry about gazillions of invalidate_page without taking into
account that even if you call invalidate_range_end(0, -1ULL) KVM will
have to mangle over 4503599627370496 rmap entries anyway. Yes calling
1 invalidate_range_end instead of 4503599627370496 invalidate_page,
will be faster, but not much faster. For KVM it remains an O(N)
operation, where N is the number of pages. I'm not so sure we need to
worry about my invalidate_pages being limited to 512 entries.
Perhaps GRU is different, I'm just describing the KVM situation here.
As far as KVM is concerned it's more sensible to be able to scale when
there are 1024 kvm threads on 1024 cpus and each kvm-vcpu is accessing
a different guest physical page (especially when new chunks of memory
are allocated for the first time) that won't collide the whole time on
naturally aligned 2M chunks of virtual addresses.
> And that would not be enough to hold of new references? With small tweaks
> this should work with a common scheme. We could also redefine the role
> of _start and _end slightly to just require that the refs are removed when
> _end completes. That would allow the KVM page count ref to work as is now
> and would avoid the individual invalidate_page() callouts.
I can already only use _end and ignore _start, only remaining problem
is this won't be 100% coherent (my patch is 100% coherent thanks to PT
lock implicitly serializing follow_page/get_user_pages of the KVM/GRU
secondary MMU faults). My approach give a better peace of mind with
full scalability and no lock recursion when it's the KVM page fault
that invokes get_user_pages that invokes the linux page fault routines
that will try to execute _start taking the lock to block the page
fault that is already running...
> > > The GRU has no page table on its own. It populates TLB entries on demand
> > > using the linux page table. There is no way it can figure out when to
> > > drop page counts again. The invalidate calls are turned directly into tlb
> > > flushes.
> >
> > Yes, this is why it can't serialize follow_page with only the PT lock
> > with your patch. KVM may do it once you add start,end to range_end
> > only thanks to the additional pin on the page.
>
> Right but that pin requires taking a refcount which we cannot do.
GRU can use my patch without the pin. XPMEM obviously can't use my
patch as my invalidate_page[s] are under the PT lock (a feature to fit
GRU/KVM in the simplest way), this is why an incremental patch adding
invalidate_range_start/end would be required to support XPMEM too.
On Sun, Feb 03, 2008 at 03:17:04AM +0100, Andrea Arcangeli wrote:
> On Fri, Feb 01, 2008 at 11:23:57AM -0800, Christoph Lameter wrote:
> > Yes so your invalidate_range is still some sort of dysfunctional
> > optimization? Gazillions of invalidate_page's will have to be executed
> > when tearing down large memory areas.
>
> I don't know if gru can flush the external TLB references with a
> single instruction like the cpu can do by overwriting %cr3. As far as
The mechanism obviously is a little different, but the GRU can flush an
external TLB in a single read/write/flush of a cacheline (should take a few
100 nsec). Typically, an application uses a single GRU. Threaded
applications, however, could use one GRU per thread, so it is possible that
multiple TLBs must be flushed. In some cases, the external flush can be
avoided if the GRU is not currently in use by the thread.
Also, most (but not all) applications that use the GRU do not usually do
anything that requires frequent flushing (fortunately). The GRU is intended
for HPC-like applications. These don't usually do frequent map/unmap
operations or anything else that requires a lot of flushes.
I expect that KVM is a lot different.
I have most of the GRU code working with the latest mmuops patch. I still
have a list of loose ends that I'll get to next week. The most important is
the exact handling of the range invalidates. The code that I currently have
works (mostly) but has a few endcases that will cause problems. Once I
finish, I'll be glad to send you snippets of the code (or all of it) if you
would like to take a look.
> vmx/svm are concerned, you got to do some fixed amount of work on the
> rmap structures and each spte, for each 4k invalidated regardless of
> page/pages/range/ranges. You can only share the cost of the lock and
> of the memslot lookup, so you lookup and take the lock once and you
> drop 512 sptes instead of just 1. Similar to what we do in the main
> linux VM by taking the PT lock for every 512 sptes.
>
> So you worry about gazillions of invalidate_page without taking into
> account that even if you call invalidate_range_end(0, -1ULL) KVM will
> have to mangle over 4503599627370496 rmap entries anyway. Yes calling
> 1 invalidate_range_end instead of 4503599627370496 invalidate_page,
> will be faster, but not much faster. For KVM it remains an O(N)
> operation, where N is the number of pages. I'm not so sure we need to
> worry about my invalidate_pages being limited to 512 entries.
>
> Perhaps GRU is different, I'm just describing the KVM situation here.
>
> As far as KVM is concerned it's more sensible to be able to scale when
> there are 1024 kvm threads on 1024 cpus and each kvm-vcpu is accessing
> a different guest physical page (especially when new chunks of memory
> are allocated for the first time) that won't collide the whole time on
> naturally aligned 2M chunks of virtual addresses.
>
> > And that would not be enough to hold of new references? With small tweaks
> > this should work with a common scheme. We could also redefine the role
> > of _start and _end slightly to just require that the refs are removed when
> > _end completes. That would allow the KVM page count ref to work as is now
> > and would avoid the individual invalidate_page() callouts.
>
> I can already only use _end and ignore _start, only remaining problem
> is this won't be 100% coherent (my patch is 100% coherent thanks to PT
> lock implicitly serializing follow_page/get_user_pages of the KVM/GRU
> secondary MMU faults). My approach give a better peace of mind with
> full scalability and no lock recursion when it's the KVM page fault
> that invokes get_user_pages that invokes the linux page fault routines
> that will try to execute _start taking the lock to block the page
> fault that is already running...
>
> > > > The GRU has no page table on its own. It populates TLB entries on demand
> > > > using the linux page table. There is no way it can figure out when to
> > > > drop page counts again. The invalidate calls are turned directly into tlb
> > > > flushes.
> > >
> > > Yes, this is why it can't serialize follow_page with only the PT lock
> > > with your patch. KVM may do it once you add start,end to range_end
> > > only thanks to the additional pin on the page.
> >
> > Right but that pin requires taking a refcount which we cannot do.
>
> GRU can use my patch without the pin. XPMEM obviously can't use my
> patch as my invalidate_page[s] are under the PT lock (a feature to fit
> GRU/KVM in the simplest way), this is why an incremental patch adding
> invalidate_range_start/end would be required to support XPMEM too.
--- jack
On Sat, Feb 02, 2008 at 09:14:57PM -0600, Jack Steiner wrote:
> Also, most (but not all) applications that use the GRU do not usually do
> anything that requires frequent flushing (fortunately). The GRU is intended
> for HPC-like applications. These don't usually do frequent map/unmap
> operations or anything else that requires a lot of flushes.
>
> I expect that KVM is a lot different.
I don't think so. invalidate_page/pages/range_start,end is a slow and
unfrequent path for KVM (or alternatively the ranges are very small in
which case _range_start/end won't payoff compared to _pages). Whenever
invalidate_page[s] become a fast path, we're generally I/O
bound. get_user_pages is always the fast path instead. I thought it
was much more important that get_user_pages scale as well as it does
now and that the KVM page fault isn't serialized with a mutex, than
whatever invalidate side optimization. get_user_pages may run
frequently from all vcpus even if there are no invalidates and no
memory pressure and I don't mean only during startup.
> I have most of the GRU code working with the latest mmuops patch. I still
> have a list of loose ends that I'll get to next week. The most important is
> the exact handling of the range invalidates. The code that I currently have
> works (mostly) but has a few endcases that will cause problems. Once I
> finish, I'll be glad to send you snippets of the code (or all of it) if you
> would like to take a look.
Sure.
On Sun, 3 Feb 2008, Andrea Arcangeli wrote:
> > Right but that pin requires taking a refcount which we cannot do.
>
> GRU can use my patch without the pin. XPMEM obviously can't use my
> patch as my invalidate_page[s] are under the PT lock (a feature to fit
> GRU/KVM in the simplest way), this is why an incremental patch adding
> invalidate_range_start/end would be required to support XPMEM too.
Doesnt the kernel in some situations release the page before releasing the
pte lock? Then there will be an external pte pointing to a page that may
now have a different use. Its really bad if that pte does allow writes.
On Mon, Feb 04, 2008 at 11:09:01AM -0800, Christoph Lameter wrote:
> On Sun, 3 Feb 2008, Andrea Arcangeli wrote:
>
> > > Right but that pin requires taking a refcount which we cannot do.
> >
> > GRU can use my patch without the pin. XPMEM obviously can't use my
> > patch as my invalidate_page[s] are under the PT lock (a feature to fit
> > GRU/KVM in the simplest way), this is why an incremental patch adding
> > invalidate_range_start/end would be required to support XPMEM too.
>
> Doesnt the kernel in some situations release the page before releasing the
> pte lock? Then there will be an external pte pointing to a page that may
> now have a different use. Its really bad if that pte does allow writes.
Sure the kernel does that most of the time, which is for example why I
had to use invalidate_page instead of invalidate_pages inside
zap_pte_range. Zero problems with that (this is also the exact reason
why I mentioned the tlb flushing code would need changes to convert
some page in pages).
On Tue, 5 Feb 2008, Andrea Arcangeli wrote:
> On Mon, Feb 04, 2008 at 11:09:01AM -0800, Christoph Lameter wrote:
> > On Sun, 3 Feb 2008, Andrea Arcangeli wrote:
> >
> > > > Right but that pin requires taking a refcount which we cannot do.
> > >
> > > GRU can use my patch without the pin. XPMEM obviously can't use my
> > > patch as my invalidate_page[s] are under the PT lock (a feature to fit
> > > GRU/KVM in the simplest way), this is why an incremental patch adding
> > > invalidate_range_start/end would be required to support XPMEM too.
> >
> > Doesnt the kernel in some situations release the page before releasing the
> > pte lock? Then there will be an external pte pointing to a page that may
> > now have a different use. Its really bad if that pte does allow writes.
>
> Sure the kernel does that most of the time, which is for example why I
> had to use invalidate_page instead of invalidate_pages inside
> zap_pte_range. Zero problems with that (this is also the exact reason
> why I mentioned the tlb flushing code would need changes to convert
> some page in pages).
Zero problems only if you find having a single callout for every page
acceptable. So the invalidate_range in your patch is only working
sometimes. And even if it works then it has to be used on 2M range. Seems
to be a bit fragile and needlessly complex.
"conversion of some page in pages"? A proposal to defer the freeing of the
pages until after the pte_unlock?
On Mon, Feb 04, 2008 at 10:11:24PM -0800, Christoph Lameter wrote:
> Zero problems only if you find having a single callout for every page
> acceptable. So the invalidate_range in your patch is only working
invalidate_pages is only a further optimization that was
strightforward in some places where the page isn't freed but only the
pte modified.
> sometimes. And even if it works then it has to be used on 2M range. Seems
> to be a bit fragile and needlessly complex.
The patch as a whole isn't fragile nor complex. Pretending to use
invalidate_pages anywhere would be complex (and in turn more fragile
than my current patch, complexity brings fragility).
Overall you can only argue against performance issues (my patch is
simpler for GRU/KVM, and it sure isn't fragile, quite the opposite,
given I never allow a coherency-loss between two threads that will
read/write to two different physical pages for the same virtual
adddress in remap_file_pages).
In performance terms with your patch before GRU can run follow_page it
has to take a mm-wide global mutex where each thread in all cpus will
have to take it. That will trash on >4-way when the tlb misses start
to occour. There's nothing like that in my patch. Your approach will
micro-optimize certain large pte-mangling calls, or do_exit, but those
aren't interesting paths nor for GRU nor for KVM. You're optimizing for
the slow path, and making the fast path slower.
> "conversion of some page in pages"? A proposal to defer the freeing of the
> pages until after the pte_unlock?
There can be many tricks to optimize page in pages, but again munmap
and do_exit aren't the interesting path to optimzie, nor for GRU nor
for KVM so it doesn't matter right now.
On Tue, 5 Feb 2008, Andrea Arcangeli wrote:
> given I never allow a coherency-loss between two threads that will
> read/write to two different physical pages for the same virtual
> adddress in remap_file_pages).
The other approach will not have any remote ptes at that point. Why would
there be a coherency issue?
> In performance terms with your patch before GRU can run follow_page it
> has to take a mm-wide global mutex where each thread in all cpus will
> have to take it. That will trash on >4-way when the tlb misses start
No. It only has to lock the affected range. Remote page faults can occur
while another part of the address space is being invalidated. The
complexity of locking is up to the user of the mmu notifier. A simple
implementation is satisfactory for the GRU right now. Should it become a
problem then the lock granularity can be refined without changing the API.
> > "conversion of some page in pages"? A proposal to defer the freeing of the
> > pages until after the pte_unlock?
>
> There can be many tricks to optimize page in pages, but again munmap
> and do_exit aren't the interesting path to optimzie, nor for GRU nor
> for KVM so it doesn't matter right now.
Still not sure what we are talking about here.
On Tue, Feb 05, 2008 at 10:17:41AM -0800, Christoph Lameter wrote:
> The other approach will not have any remote ptes at that point. Why would
> there be a coherency issue?
It never happens that two threads writes to two different physical
pages by working on the same process virtual address. This is an issue
only for KVM which is probably ok with it but certainly you can't
consider the dependency on the page-pin less fragile or less complex
than my PT lock approach.
> No. It only has to lock the affected range. Remote page faults can occur
> while another part of the address space is being invalidated. The
> complexity of locking is up to the user of the mmu notifier. A simple
> implementation is satisfactory for the GRU right now. Should it become a
> problem then the lock granularity can be refined without changing the API.
That will make the follow_page fast path even slower if it has to
lookup a rbtree or a list of locked ranges. Still not comparable to
the PT lock that 1) it's zero cost and 2) it'll provide an even more
granular scalability.
> Still not sure what we are talking about here.
The apps using GRU/KVM never trigger large
munmap/mremap/do_exit. You're optimizing for the irrelevant workload,
by requiring unnecessary new locking in the GRU fast path.
On Tue, 5 Feb 2008, Andrea Arcangeli wrote:
> On Tue, Feb 05, 2008 at 10:17:41AM -0800, Christoph Lameter wrote:
> > The other approach will not have any remote ptes at that point. Why would
> > there be a coherency issue?
>
> It never happens that two threads writes to two different physical
> pages by working on the same process virtual address. This is an issue
> only for KVM which is probably ok with it but certainly you can't
> consider the dependency on the page-pin less fragile or less complex
> than my PT lock approach.
You can avoid the page-pin and the pt lock completely by zapping the
mappings at _start and then holding off new references until _end.
> > No. It only has to lock the affected range. Remote page faults can occur
> > while another part of the address space is being invalidated. The
> > complexity of locking is up to the user of the mmu notifier. A simple
> > implementation is satisfactory for the GRU right now. Should it become a
> > problem then the lock granularity can be refined without changing the API.
>
> That will make the follow_page fast path even slower if it has to
> lookup a rbtree or a list of locked ranges. Still not comparable to
> the PT lock that 1) it's zero cost and 2) it'll provide an even more
> granular scalability.
As I said the implementation is up to the caller. Not sure what
XPmem is using there but then XPmem is not using follow_page. The GRU
would be using a lightway way of locking not rbtrees.
> > Still not sure what we are talking about here.
>
> The apps using GRU/KVM never trigger large
> munmap/mremap/do_exit. You're optimizing for the irrelevant workload,
> by requiring unnecessary new locking in the GRU fast path.
Maybe that is true for KVM but certainly not true for the GRU. The GRU is
designed to manage several petabytes of memory that may be mapped by a
series of Linux instances. If a process only maps a small chunk of 4
Gigabytes then we already have to deal with 1 mio callbacks.
On Tue, Feb 05, 2008 at 02:06:23PM -0800, Christoph Lameter wrote:
> On Tue, 5 Feb 2008, Andrea Arcangeli wrote:
>
> > On Tue, Feb 05, 2008 at 10:17:41AM -0800, Christoph Lameter wrote:
> > > The other approach will not have any remote ptes at that point. Why would
> > > there be a coherency issue?
> >
> > It never happens that two threads writes to two different physical
> > pages by working on the same process virtual address. This is an issue
> > only for KVM which is probably ok with it but certainly you can't
> > consider the dependency on the page-pin less fragile or less complex
> > than my PT lock approach.
>
> You can avoid the page-pin and the pt lock completely by zapping the
> mappings at _start and then holding off new references until _end.
XPMEM is doing this by putting our equivalent structure of the mm into a
recalling state which will cause all future faulters to back off, it then
marks any currently active faults in the range as invalid (we have a very
small number of possible concurrent faulters for a different reason),
proceeds to start remote shoot-downs, waits for those shoot-downs to
complete, then returns from the _begin callout with the mm-equiv still in
the recalling state. Additional recalls may occur, but no new faults can.
The _end callout reduces the number of active recalls until there are
none left at which point the faulters are allowed to proceed again.
Thanks,
Robin
On Tue, Feb 05, 2008 at 02:06:23PM -0800, Christoph Lameter wrote:
> On Tue, 5 Feb 2008, Andrea Arcangeli wrote:
>
> > On Tue, Feb 05, 2008 at 10:17:41AM -0800, Christoph Lameter wrote:
> > > The other approach will not have any remote ptes at that point. Why would
> > > there be a coherency issue?
> >
> > It never happens that two threads writes to two different physical
> > pages by working on the same process virtual address. This is an issue
> > only for KVM which is probably ok with it but certainly you can't
> > consider the dependency on the page-pin less fragile or less complex
> > than my PT lock approach.
>
> You can avoid the page-pin and the pt lock completely by zapping the
> mappings at _start and then holding off new references until _end.
Avoid the PT lock? The PT lock has to be taken anyway by the linux
VM.
"holding off new references until _end" = per-range mutex less scalar
and more expensive than the PT lock that has to be taken anyway.
> As I said the implementation is up to the caller. Not sure what
> XPmem is using there but then XPmem is not using follow_page. The GRU
> would be using a lightway way of locking not rbtrees.
"lightway way of locking" = mm-wide-mutex (not necessary at all if we
take advantage of the per-pte-scalar PT lock that has to be taken
anyway like in my patch)
> Maybe that is true for KVM but certainly not true for the GRU. The GRU is
> designed to manage several petabytes of memory that may be mapped by a
> series of Linux instances. If a process only maps a small chunk of 4
> Gigabytes then we already have to deal with 1 mio callbacks.
KVM is also going to map a lot of stuff, but mapping involves mmap,
munmap/mremap/mprotect not. The size of mmap is irrelevant in both
approaches. optimizing do_exit by making the tlb-miss runtime slower
doesn't sound great to me and that's your patch does if you force GRU
to use it.
On Tue, 5 Feb 2008, Andrea Arcangeli wrote:
> > You can avoid the page-pin and the pt lock completely by zapping the
> > mappings at _start and then holding off new references until _end.
>
> "holding off new references until _end" = per-range mutex less scalar
> and more expensive than the PT lock that has to be taken anyway.
You can of course setup a 2M granularity lock to get the same granularity
as the pte lock. That would even work for the cases where you have to page
pin now.
> > Maybe that is true for KVM but certainly not true for the GRU. The GRU is
> > designed to manage several petabytes of memory that may be mapped by a
> > series of Linux instances. If a process only maps a small chunk of 4
> > Gigabytes then we already have to deal with 1 mio callbacks.
>
> KVM is also going to map a lot of stuff, but mapping involves mmap,
> munmap/mremap/mprotect not. The size of mmap is irrelevant in both
> approaches. optimizing do_exit by making the tlb-miss runtime slower
> doesn't sound great to me and that's your patch does if you force GRU
> to use it.
The size of the mmap is relevant if you have to perform callbacks on
every mapped page that involved take mmu specific locks. That seems to be
the case with this approach.
Optimizing do_exit by taking a single lock to zap all external references
instead of 1 mio callbacks somehow leads to slowdown?
On Tue, Feb 05, 2008 at 03:10:52PM -0800, Christoph Lameter wrote:
> On Tue, 5 Feb 2008, Andrea Arcangeli wrote:
>
> > > You can avoid the page-pin and the pt lock completely by zapping the
> > > mappings at _start and then holding off new references until _end.
> >
> > "holding off new references until _end" = per-range mutex less scalar
> > and more expensive than the PT lock that has to be taken anyway.
>
> You can of course setup a 2M granularity lock to get the same granularity
> as the pte lock. That would even work for the cases where you have to page
> pin now.
If you set a 2M granularity lock, the _start callback would need to
do:
for_each_2m_lock()
mutex_lock()
so you'd run zillon of mutex_lock in a row, you're the one with the
million of operations argument.
> The size of the mmap is relevant if you have to perform callbacks on
> every mapped page that involved take mmu specific locks. That seems to be
> the case with this approach.
mmap should never trigger any range_start/_end callback unless it's
overwriting an older mapping which is definitely not the interesting
workload for those apps including kvm.
> Optimizing do_exit by taking a single lock to zap all external references
> instead of 1 mio callbacks somehow leads to slowdown?
It can if the application runs for more than a couple of seconds,
i.e. not a fork flood in which you care about do_exit speed. Keep in
mind if you had 1mio invalidate_pages callback it means you previously
called follow_page 1 mio of times too...
On Wed, 6 Feb 2008, Andrea Arcangeli wrote:
> > You can of course setup a 2M granularity lock to get the same granularity
> > as the pte lock. That would even work for the cases where you have to page
> > pin now.
>
> If you set a 2M granularity lock, the _start callback would need to
> do:
>
> for_each_2m_lock()
> mutex_lock()
>
> so you'd run zillon of mutex_lock in a row, you're the one with the
> million of operations argument.
There is no requirement to do a linear search. No one in his right mind
would implement a performance critical operation that way.
> > The size of the mmap is relevant if you have to perform callbacks on
> > every mapped page that involved take mmu specific locks. That seems to be
> > the case with this approach.
>
> mmap should never trigger any range_start/_end callback unless it's
> overwriting an older mapping which is definitely not the interesting
> workload for those apps including kvm.
There is still at least the need for teardown on exit. And you need to
consider the boundless creativity of user land programmers. You would not
believe what I have seen....
> > Optimizing do_exit by taking a single lock to zap all external references
> > instead of 1 mio callbacks somehow leads to slowdown?
>
> It can if the application runs for more than a couple of seconds,
> i.e. not a fork flood in which you care about do_exit speed. Keep in
> mind if you had 1mio invalidate_pages callback it means you previously
> called follow_page 1 mio of times too...
That is another problem were we are also in need of solutions. I believe
we have discussed that elsewhere.