2008-01-31 04:59:37

by Christoph Lameter

[permalink] [raw]
Subject: [patch 1/3] mmu_notifier: Core code

Notifier functions for hardware and software that establishes external
references to pages of a Linux system. The notifier calls ensure that
external mappings are removed when the Linux VM removes memory ranges
or individual pages from a process.

These fall into two classes:

1. mmu_notifier

These are callbacks registered with an mm_struct. If pages are
removed from an address space then callbacks are performed.

Spinlocks must be held in order to walk reverse maps. The
invalidate_page() callbacks are performed with spinlocks are held.

The invalidate_range_start/end callbacks can be performed in contexts
where sleeping is allowed or in atomic contexts. A flag is passed
to indicate an atomic context.


2. mmu_rmap_notifier

Callbacks for subsystems that provide their own rmaps. These
need to walk their own rmaps for a page. The invalidate_page
callback is outside of locks so that we are not in a strictly
atomic context (but we may be in a PF_MEMALLOC context if the
notifier is called from reclaim code) and are able to sleep.

Rmap notifiers need an extra page bit and are only available
on 64 bit platforms.

Pages must be marked dirty if dirty bits are found to be set in
the external ptes.

Requirements on synchronization within the driver:

Multiple invalidate_range_begin/ends may be nested or called
concurrently. That is legit. However, no new external references
may be established as long as any invalidate_xxx is running or
any invalidate_range_begin() and has not been completed through a
corresponding call to invalidate_range_end().

Locking within the notifier needs to serialize events correspondingly.

If all invalidate_xx notifier calls take a driver lock then it is possible
to run follow_page() under the same lock. The lock can then guarantee
that no page is removed and provides an additional existence guarantee
of the page.

invalidate_range_begin() must clear all references in the range
and stop the establishment of new references.

invalidate_range_end() reenables the establishment of references.
atomic indicates that the function is called in an atomic context.
We can sleep if atomic == 0.

Signed-off-by: Christoph Lameter <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>

---
include/linux/mm_types.h | 6 +
include/linux/mmu_notifier.h | 248 +++++++++++++++++++++++++++++++++++++++++++
include/linux/page-flags.h | 11 +
kernel/fork.c | 2
mm/Kconfig | 4
mm/Makefile | 1
mm/mmap.c | 3
mm/mmu_notifier.c | 118 ++++++++++++++++++++
8 files changed, 393 insertions(+)

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h 2008-01-30 19:49:32.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h 2008-01-30 19:49:34.000000000 -0800
@@ -153,6 +153,10 @@ struct vm_area_struct {
#endif
};

+struct mmu_notifier_head {
+ struct hlist_head head;
+};
+
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;
@@ -219,6 +223,8 @@ struct mm_struct {
/* aio bits */
rwlock_t ioctx_list_lock;
struct kioctx *ioctx_list;
+
+ struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
};

#endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h 2008-01-30 20:25:43.000000000 -0800
@@ -0,0 +1,248 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes
+ *
+ * 1. mmu_notifier
+ *
+ * These are callbacks registered with an mm_struct. If mappings are
+ * removed from an address space then callbacks are performed.
+ *
+ * Spinlocks must be held in order to walk reverse maps. The
+ * invalidate_page notifications are performed with spinlocks are held.
+ *
+ * The invalidate_range_start/end callbacks can be performed in contexts
+ * where sleeping is allowed or in atomic contexts. A flag is passed
+ * to indicate an atomic context.
+ *
+ *
+ * 2. mmu_rmap_notifier
+ *
+ * Callbacks for subsystems that provide their own rmaps. These
+ * need to walk their own rmaps for a page. The invalidate_page
+ * callback is outside of locks so that we are not in a strictly
+ * atomic context (but we may be in a PF_MEMALLOC context if the
+ * notifier is called from reclaim code) and are able to sleep.
+ *
+ * Rmap notifiers need an extra page bit and are only available
+ * on 64 bit platforms.
+ *
+ * Pages must be marked dirty if dirty bits found to be set in
+ * the external ptes.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+ struct hlist_node hlist;
+ const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+ /*
+ * The release notifier is called when no other execution threads
+ * are left. Synchronization is not necessary.
+ */
+ void (*release)(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+
+ /* Dummy needed because the mmu_notifier() macro requires it */
+ void (*invalidate_all)(struct mmu_notifier *mn, struct mm_struct *mm,
+ int dummy);
+
+ /*
+ * age_page is called from contexts where the pte_lock is held
+ */
+ int (*age_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+
+ /* invalidate_page is called from contexts where the pte_lock is held */
+ void (*invalidate_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+
+ /*
+ * invalidate_range_begin() and invalidate_range_end() must paired.
+ * Multiple invalidate_range_begin/ends may be nested or called
+ * concurrently. That is legit. However, no new external references
+ * may be established as long as any invalidate_xxx is running or
+ * any invalidate_range_begin() and has not been completed through a
+ * corresponding call to invalidate_range_end().
+ *
+ * Locking within the notifier needs to serialize events correspondingly.
+ *
+ * If all invalidate_xx notifier calls take a driver lock then it is possible
+ * to run follow_page() under the same lock. The lock can then guarantee
+ * that no page is removed. That way we can avoid increasing the refcount
+ * of the pages.
+ *
+ * invalidate_range_begin() must clear all references in the range
+ * and stop the establishment of new references.
+ *
+ * invalidate_range_end() reenables the establishment of references.
+ *
+ * atomic indicates that the function is called in an atomic context.
+ * We can sleep if atomic == 0.
+ */
+ void (*invalidate_range_begin)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ int atomic);
+
+ void (*invalidate_range_end)(struct mmu_notifier *mn,
+ struct mm_struct *mm, int atomic);
+};
+
+struct mmu_rmap_notifier_ops;
+
+struct mmu_rmap_notifier {
+ struct hlist_node hlist;
+ const struct mmu_rmap_notifier_ops *ops;
+};
+
+struct mmu_rmap_notifier_ops {
+ /*
+ * Called with the page lock held after ptes are modified or removed
+ * so that a subsystem with its own rmap's can remove remote ptes
+ * mapping a page.
+ */
+ void (*invalidate_page)(struct mmu_rmap_notifier *mrn,
+ struct page *page);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * Must hold the mmap_sem for write.
+ *
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads
+ */
+extern void __mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+/* Will acquire mmap_sem for write*/
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+/*
+ * Will acquire mmap_sem for write.
+ *
+ * A quiescent period needs to pass before the mmu_notifier structure
+ * can be released. mmu_notifier_release() will wait for a quiescent period
+ * after calling the ->release callback. So it is safe to call
+ * mmu_notifier_unregister from the ->release function.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+
+
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+ unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+ INIT_HLIST_HEAD(&mnh->head);
+}
+
+#define mmu_notifier(function, mm, args...) \
+ do { \
+ struct mmu_notifier *__mn; \
+ struct hlist_node *__n; \
+ \
+ if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+ rcu_read_lock(); \
+ hlist_for_each_entry_rcu(__mn, __n, \
+ &(mm)->mmu_notifier.head, \
+ hlist) \
+ if (__mn->ops->function) \
+ __mn->ops->function(__mn, \
+ mm, \
+ args); \
+ rcu_read_unlock(); \
+ } \
+ } while (0)
+
+extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
+extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
+
+/* Must hold PageLock */
+extern void mmu_rmap_export_page(struct page *page);
+
+extern struct hlist_head mmu_rmap_notifier_list;
+
+#define mmu_rmap_notifier(function, args...) \
+ do { \
+ struct mmu_rmap_notifier *__mrn; \
+ struct hlist_node *__n; \
+ \
+ rcu_read_lock(); \
+ hlist_for_each_entry_rcu(__mrn, __n, \
+ &mmu_rmap_notifier_list, hlist) \
+ if (__mrn->ops->function) \
+ __mrn->ops->function(__mrn, args); \
+ rcu_read_unlock(); \
+ } while (0);
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...) \
+ do { \
+ if (0) { \
+ struct mmu_notifier *__mn; \
+ \
+ __mn = (struct mmu_notifier *)(0x00ff); \
+ __mn->ops->function(__mn, mm, args); \
+ }; \
+ } while (0)
+
+#define mmu_rmap_notifier(function, args...) \
+ do { \
+ if (0) { \
+ struct mmu_rmap_notifier *__mrn; \
+ \
+ __mrn = (struct mmu_rmap_notifier *)(0x00ff); \
+ __mrn->ops->function(__mrn, args); \
+ } \
+ } while (0);
+
+static inline void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm) {}
+static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
+ struct mm_struct *mm) {}
+static inline void mmu_notifier_release(struct mm_struct *mm) {}
+static inline int mmu_notifier_age_page(struct mm_struct *mm,
+ unsigned long address)
+{
+ return 0;
+}
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
+
+static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+ {}
+static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+ {}
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h 2008-01-30 19:49:32.000000000 -0800
+++ linux-2.6/include/linux/page-flags.h 2008-01-30 20:30:50.000000000 -0800
@@ -105,6 +105,7 @@
* 64 bit | FIELDS | ?????? FLAGS |
* 63 32 0
*/
+#define PG_external_rmap 30 /* Page has external rmap */
#define PG_uncached 31 /* Page has been mapped as uncached */
#endif

@@ -260,6 +261,16 @@ static inline void __ClearPageTail(struc
#define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
#define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)

+#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT)
+#define PageExternalRmap(page) test_bit(PG_external_rmap, &(page)->flags)
+#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags)
+#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \
+ &(page)->flags)
+#else
+#define ClearPageExternalRmap(page) do {} while (0)
+#define PageExternalRmap(page) 0
+#endif
+
struct page; /* forward declaration */

extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig 2008-01-30 19:49:32.000000000 -0800
+++ linux-2.6/mm/Kconfig 2008-01-30 19:49:34.000000000 -0800
@@ -193,3 +193,7 @@ config NR_QUICK
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+ def_bool y
+ bool "MMU notifier, for paging KVM/RDMA"
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile 2008-01-30 19:49:32.000000000 -0800
+++ linux-2.6/mm/Makefile 2008-01-30 19:49:34.000000000 -0800
@@ -30,4 +30,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o

Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/mmu_notifier.c 2008-01-30 20:31:12.000000000 -0800
@@ -0,0 +1,118 @@
+/*
+ * linux/mm/mmu_notifier.c
+ *
+ * Copyright (C) 2008 Qumranet, Inc.
+ * Copyright (C) 2008 SGI
+ * Christoph Lameter <[email protected]>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
+void mmu_notifier_release(struct mm_struct *mm)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n, *t;
+
+ if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+ hlist_for_each_entry_safe(mn, n, t,
+ &mm->mmu_notifier.head, hlist) {
+ hlist_del_init(&mn->hlist);
+ if (mn->ops->release)
+ mn->ops->release(mn, mm);
+ }
+ }
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n;
+ int young = 0;
+
+ if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(mn, n,
+ &mm->mmu_notifier.head, hlist) {
+ if (mn->ops->age_page)
+ young |= mn->ops->age_page(mn, mm, address);
+ }
+ rcu_read_unlock();
+ }
+
+ return young;
+}
+
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ */
+void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
+
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ down_write(&mm->mmap_sem);
+ __mmu_notifier_register(mn, mm);
+ up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ down_write(&mm->mmap_sem);
+ hlist_del_rcu(&mn->hlist);
+ up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
+
+#ifdef CONFIG_64BIT
+static DEFINE_SPINLOCK(mmu_notifier_list_lock);
+HLIST_HEAD(mmu_rmap_notifier_list);
+
+void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+{
+ spin_lock(&mmu_notifier_list_lock);
+ hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list);
+ spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_register);
+
+void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+{
+ spin_lock(&mmu_notifier_list_lock);
+ hlist_del_rcu(&mrn->hlist);
+ spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
+
+/*
+ * Export a page.
+ *
+ * Pagelock must be held.
+ * Must be called before a page is put on an external rmap.
+ */
+void mmu_rmap_export_page(struct page *page)
+{
+ BUG_ON(!PageLocked(page));
+ SetPageExternalRmap(page);
+}
+EXPORT_SYMBOL(mmu_rmap_export_page);
+
+#endif
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c 2008-01-30 19:49:32.000000000 -0800
+++ linux-2.6/kernel/fork.c 2008-01-30 19:49:34.000000000 -0800
@@ -52,6 +52,7 @@
#include <linux/tty.h>
#include <linux/proc_fs.h>
#include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -360,6 +361,7 @@ static struct mm_struct * mm_init(struct

if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
+ mmu_notifier_head_init(&mm->mmu_notifier);
return mm;
}
free_mm(mm);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c 2008-01-30 19:49:32.000000000 -0800
+++ linux-2.6/mm/mmap.c 2008-01-30 20:29:31.000000000 -0800
@@ -26,6 +26,7 @@
#include <linux/mount.h>
#include <linux/mempolicy.h>
#include <linux/rmap.h>
+#include <linux/mmu_notifier.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -2033,6 +2034,7 @@ void exit_mmap(struct mm_struct *mm)
unsigned long end;

/* mm's last user has gone, and its about to be pulled down */
+ mmu_notifier(invalidate_all, mm, 0);
arch_exit_mmap(mm);

lru_add_drain();
@@ -2044,6 +2046,7 @@ void exit_mmap(struct mm_struct *mm)
vm_unacct_memory(nr_accounted);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
+ mmu_notifier_release(mm);

/*
* Walk the list again, actually closing and freeing it,

--


2008-02-01 01:56:23

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [patch 1/3] mmu_notifier: Core code

> @@ -2033,6 +2034,7 @@ void exit_mmap(struct mm_struct *mm)
> unsigned long end;
>
> /* mm's last user has gone, and its about to be pulled down */
> + mmu_notifier(invalidate_all, mm, 0);
> arch_exit_mmap(mm);
>

The name of the "invalidate_all" callout is not very descriptive.
Why not use "exit_mmap". That matches the function name, the arch callout
and better describes what is happening.


--- jack

2008-02-01 02:24:53

by Robin Holt

[permalink] [raw]
Subject: Re: [patch 1/3] mmu_notifier: Core code

On Thu, Jan 31, 2008 at 07:56:12PM -0600, Jack Steiner wrote:
> > @@ -2033,6 +2034,7 @@ void exit_mmap(struct mm_struct *mm)
> > unsigned long end;
> >
> > /* mm's last user has gone, and its about to be pulled down */
> > + mmu_notifier(invalidate_all, mm, 0);
> > arch_exit_mmap(mm);
> >
>
> The name of the "invalidate_all" callout is not very descriptive.
> Why not use "exit_mmap". That matches the function name, the arch callout
> and better describes what is happening.

This got removed in a later patch. We now only do the release.

Thanks,
Robin

2008-02-01 02:31:24

by Robin Holt

[permalink] [raw]
Subject: Re: [patch 1/3] mmu_notifier: Core code

Christoph,

Jack has repeatedly pointed out needing an unregister outside the
mmap_sem. I still don't see the benefit to not having the lock in the mm.

Thanks,
Robin

Index: mmu_notifiers-cl-v4/include/linux/mm_types.h
===================================================================
--- mmu_notifiers-cl-v4.orig/include/linux/mm_types.h 2008-01-31 19:56:08.000000000 -0600
+++ mmu_notifiers-cl-v4/include/linux/mm_types.h 2008-01-31 19:56:58.000000000 -0600
@@ -155,6 +155,7 @@ struct vm_area_struct {

struct mmu_notifier_head {
#ifdef CONFIG_MMU_NOTIFIER
+ spinlock_t list_lock;
struct hlist_head head;
#endif
};
Index: mmu_notifiers-cl-v4/mm/mmu_notifier.c
===================================================================
--- mmu_notifiers-cl-v4.orig/mm/mmu_notifier.c 2008-01-31 19:56:08.000000000 -0600
+++ mmu_notifiers-cl-v4/mm/mmu_notifier.c 2008-01-31 20:26:31.000000000 -0600
@@ -60,25 +60,19 @@ int mmu_notifier_age_page(struct mm_stru
* Note that all notifiers use RCU. The updates are only guaranteed to be
* visible to other processes after a RCU quiescent period!
*/
-void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
-{
- hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
-}
-EXPORT_SYMBOL_GPL(__mmu_notifier_register);
-
void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
{
- down_write(&mm->mmap_sem);
- __mmu_notifier_register(mn, mm);
- up_write(&mm->mmap_sem);
+ spin_lock(&mm->mmu_notifier.list_lock);
+ hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+ spin_unlock(&mm->mmu_notifier.list_lock);
}
EXPORT_SYMBOL_GPL(mmu_notifier_register);

void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
{
- down_write(&mm->mmap_sem);
+ spin_lock(&mm->mmu_notifier.list_lock);
hlist_del_rcu(&mn->hlist);
- up_write(&mm->mmap_sem);
+ spin_unlock(&mm->mmu_notifier.list_lock);
}
EXPORT_SYMBOL_GPL(mmu_notifier_unregister);

2008-02-01 02:37:20

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [patch 1/3] mmu_notifier: Core code

On Thu, Jan 31, 2008 at 08:24:44PM -0600, Robin Holt wrote:
> On Thu, Jan 31, 2008 at 07:56:12PM -0600, Jack Steiner wrote:
> > > @@ -2033,6 +2034,7 @@ void exit_mmap(struct mm_struct *mm)
> > > unsigned long end;
> > >
> > > /* mm's last user has gone, and its about to be pulled down */
> > > + mmu_notifier(invalidate_all, mm, 0);
> > > arch_exit_mmap(mm);
> > >
> >
> > The name of the "invalidate_all" callout is not very descriptive.
> > Why not use "exit_mmap". That matches the function name, the arch callout
> > and better describes what is happening.
>
> This got removed in a later patch. We now only do the release.

Found it, thanks.

Christoph, is it time to post a new series of patches? I've got
as many fixup patches as I have patches in the original posting.



-- jack

2008-02-01 02:39:28

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 1/3] mmu_notifier: Core code

On Thu, 31 Jan 2008, Robin Holt wrote:

> Jack has repeatedly pointed out needing an unregister outside the
> mmap_sem. I still don't see the benefit to not having the lock in the mm.

I never understood why this would be needed. ->release removes the
mmu_notifier right now.

2008-02-01 02:39:58

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 1/3] mmu_notifier: Core code

On Thu, 31 Jan 2008, Jack Steiner wrote:

> Christoph, is it time to post a new series of patches? I've got
> as many fixup patches as I have patches in the original posting.

Maybe wait another day? This is getting a bit too frequent and so far we
have only minor changes.

2008-02-01 02:47:54

by Robin Holt

[permalink] [raw]
Subject: Re: [patch 1/3] mmu_notifier: Core code

On Thu, Jan 31, 2008 at 06:39:19PM -0800, Christoph Lameter wrote:
> On Thu, 31 Jan 2008, Robin Holt wrote:
>
> > Jack has repeatedly pointed out needing an unregister outside the
> > mmap_sem. I still don't see the benefit to not having the lock in the mm.
>
> I never understood why this would be needed. ->release removes the
> mmu_notifier right now.

Both xpmem and GRU have means of removing their context seperate from
process termination. XPMEMs is by closing the fd, I believe GRU is
the same. In the case of XPMEM, we are able to acquire the mmap_sem.
For GRU, I don't think it is possible, but I do not remember the exact
reason.

Thanks,
Robin

2008-02-01 03:01:30

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [patch 1/3] mmu_notifier: Core code

On Thu, Jan 31, 2008 at 06:39:19PM -0800, Christoph Lameter wrote:
> On Thu, 31 Jan 2008, Robin Holt wrote:
>
> > Jack has repeatedly pointed out needing an unregister outside the
> > mmap_sem. I still don't see the benefit to not having the lock in the mm.
>
> I never understood why this would be needed. ->release removes the
> mmu_notifier right now.

Christoph -

We discussed this earlier this week. Here is part of the mail:

------------

> > There currently is no __mmu_notifier_unregister(). Oversite???
>
> No need. mmu_notifier_release implies an unregister and I think that is
> the most favored way to release resources since it deals with the RCU
> quiescent period.


I currently unlink the mmu_notifier when the last GRU mapping is closed. For
example, if a user does a:

gru_create_context();
...
gru_destroy_context();

the mmu_notifier is unlinked and all task tables allocated
by the driver are freed. Are you suggesting that I leave tables
allocated until the task terminates??

Why is that better? What problem do I cause by trying
to free tables as soon as they are not needed?


-----------------------------------------------

> Christoph responded:
> > the mmu_notifier is unlinked and all task tables allocated
> > by the driver are freed. Are you suggesting that I leave tables
> > allocated until the task terminates??
>
> You need to leave the mmu_notifier structure allocated until the next
> quiescent rcu period unless you use the release notifier.

I assumed that I would need to use call_rcu() or synchronize_rcu()
before the table is actually freed. That's still on my TODO list.



--- jack

2008-02-01 03:01:56

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 1/3] mmu_notifier: Core code

On Thu, 31 Jan 2008, Robin Holt wrote:

> Both xpmem and GRU have means of removing their context seperate from
> process termination. XPMEMs is by closing the fd, I believe GRU is
> the same. In the case of XPMEM, we are able to acquire the mmap_sem.
> For GRU, I don't think it is possible, but I do not remember the exact
> reason.

For any action initiated from user space you will not hold mmap sem. So
you can call the unregister function. Then you need to do a
synchronize_rcu before freeing the structures.

It is also possible to shut this down outside via f.e. a control thread.
The control thread can acquire mmap_sem and then unregister the notifier.

Am I missing something?

2008-02-01 03:03:52

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 1/3] mmu_notifier: Core code

On Thu, 31 Jan 2008, Jack Steiner wrote:

> I currently unlink the mmu_notifier when the last GRU mapping is closed. For
> example, if a user does a:
>
> gru_create_context();
> ...
> gru_destroy_context();
>
> the mmu_notifier is unlinked and all task tables allocated
> by the driver are freed. Are you suggesting that I leave tables
> allocated until the task terminates??

You are in user space and calling into the kernel somehow. The
mmap_sem is not held at that point so its no trouble to use the unregister
function. After that wait for rcu and then free your tables.

> I assumed that I would need to use call_rcu() or synchronize_rcu()
> before the table is actually freed. That's still on my TODO list.

Right.

2008-02-01 03:53:01

by Robin Holt

[permalink] [raw]
Subject: Re: [patch 1/3] mmu_notifier: Core code

> Index: linux-2.6/include/linux/mmu_notifier.h
...
> +struct mmu_notifier_ops {
...
> + /*
> + * invalidate_range_begin() and invalidate_range_end() must paired.
> + * Multiple invalidate_range_begin/ends may be nested or called
> + * concurrently. That is legit. However, no new external references
> + * may be established as long as any invalidate_xxx is running or
> + * any invalidate_range_begin() and has not been completed through a
> + * corresponding call to invalidate_range_end().
> + *
> + * Locking within the notifier needs to serialize events correspondingly.
> + *
> + * If all invalidate_xx notifier calls take a driver lock then it is possible
> + * to run follow_page() under the same lock. The lock can then guarantee
> + * that no page is removed. That way we can avoid increasing the refcount
> + * of the pages.
> + *
> + * invalidate_range_begin() must clear all references in the range
> + * and stop the establishment of new references.
> + *
> + * invalidate_range_end() reenables the establishment of references.
> + *
> + * atomic indicates that the function is called in an atomic context.
> + * We can sleep if atomic == 0.
> + */
> + void (*invalidate_range_begin)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long start, unsigned long end,
> + int atomic);
> +
> + void (*invalidate_range_end)(struct mmu_notifier *mn,
> + struct mm_struct *mm, int atomic);

I think we need to pass in the same start-end here as well. Without it,
the first invalidate_range would have to block faulting for all addresses
and would need to remain blocked until the last invalidate_range has
completed. While this would work, (and will probably be how we implement
it for the short term), it is far from ideal.

Thanks,
Robin

2008-02-01 03:58:49

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 1/3] mmu_notifier: Core code

On Thu, 31 Jan 2008, Robin Holt wrote:

> > + void (*invalidate_range_end)(struct mmu_notifier *mn,
> > + struct mm_struct *mm, int atomic);
>
> I think we need to pass in the same start-end here as well. Without it,
> the first invalidate_range would have to block faulting for all addresses
> and would need to remain blocked until the last invalidate_range has
> completed. While this would work, (and will probably be how we implement
> it for the short term), it is far from ideal.

Ok. Andrea wanted the same because then he can void the begin callouts.

The problem is that you would have to track the start-end addres right?

2008-02-01 04:15:28

by Robin Holt

[permalink] [raw]
Subject: Re: [patch 1/3] mmu_notifier: Core code

On Thu, Jan 31, 2008 at 07:58:40PM -0800, Christoph Lameter wrote:
> On Thu, 31 Jan 2008, Robin Holt wrote:
>
> > > + void (*invalidate_range_end)(struct mmu_notifier *mn,
> > > + struct mm_struct *mm, int atomic);
> >
> > I think we need to pass in the same start-end here as well. Without it,
> > the first invalidate_range would have to block faulting for all addresses
> > and would need to remain blocked until the last invalidate_range has
> > completed. While this would work, (and will probably be how we implement
> > it for the short term), it is far from ideal.
>
> Ok. Andrea wanted the same because then he can void the begin callouts.
>
> The problem is that you would have to track the start-end addres right?

Yep. We will probably no do that in the next week, but I would expect
we have that working before we submit xpmem again. We will probably
just chain them up in a regular linked list.

Thanks,
Robin

2008-02-03 01:33:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 1/3] mmu_notifier: Core code

On Thu, Jan 31, 2008 at 07:58:40PM -0800, Christoph Lameter wrote:
> Ok. Andrea wanted the same because then he can void the begin callouts.

Exactly. I hope the page-pin will avoid me having to serialize the KVM
page fault against the start/end critical section.

BTW, I wonder if the start/end critical section API is intended to
forbid scheduling inside it. In short I wonder if GRU can is allowed
to take a spinlock in _range_start as last thing before returning, and
to release that same spinlock in _range_end as first thing, and not to
be forced to use a mutex.

2008-02-04 19:14:09

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 1/3] mmu_notifier: Core code

On Sun, 3 Feb 2008, Andrea Arcangeli wrote:

> On Thu, Jan 31, 2008 at 07:58:40PM -0800, Christoph Lameter wrote:
> > Ok. Andrea wanted the same because then he can void the begin callouts.
>
> Exactly. I hope the page-pin will avoid me having to serialize the KVM
> page fault against the start/end critical section.
>
> BTW, I wonder if the start/end critical section API is intended to
> forbid scheduling inside it. In short I wonder if GRU can is allowed
> to take a spinlock in _range_start as last thing before returning, and
> to release that same spinlock in _range_end as first thing, and not to
> be forced to use a mutex.

_begin/end encloses code that may sleep and _begin/_end itself may sleep.
So a semaphore may work but not a spinlock.