2014-11-03 20:46:41

by Jerome Glisse

[permalink] [raw]
Subject: HMM (heterogeneous memory management) v5

Andrew i received no feedback since last time i sent this patchset, so i
would really like to have it merge for next kernel. While right now there
is no kernel driver that leverage this code, the hardware is coming and we
still have a long way to go before we have all the features needed. Right
now i am blocking any further work on the merge of this core code.

(Note that patch 5 the dummy driver is included as reference and should not
be merge unless you want me to grow it into some testing infrastructure. I
only include it here so people can have a look on how HMM is suppose to be
use).


What it is ?

In a nutshell HMM is a subsystem that provide an easy to use api to mirror a
process address on a device with minimal hardware requirement (mainly device
page fault and read only page mapping). This does not rely on ATS and PASID
PCIE extensions. It intends to supersede those extensions by allowing to move
system memory to device memory in a transparent fashion for core kernel mm
code (ie cpu page fault on page residing in device memory will trigger
migration back to system memory).


Why doing this ?

We want to be able to mirror a process address space so that compute api such
as OpenCL or other similar api can start using the exact same address space on
the GPU as on the CPU. This will greatly simplify usages of those api. Moreover
we believe that we will see more and more specialize unit functions that will
want to mirror process address using their own mmu.

The migration side is simply because GPU memory bandwidth is far beyond than
system memory bandwith and there is no sign that this gap is closing (quite the
opposite).


Current status and future features :

None of this core code change in any major way core kernel mm code. This
is simple ground work with no impact on existing code path. Features that
will be implemented on top of this are :
1 - Tansparently handle page mapping on behalf of device driver (DMA).
2 - Improve DMA api to better match new usage pattern of HMM.
3 - Migration of anonymous memory to device memory.
4 - Locking memory to remote memory (CPU access triger SIGBUS).
5 - Access exclusion btw CPU and device for atomic operations.
6 - Migration of file backed memory to device memory.


How future features will be implemented :
1 - Simply use existing DMA api to map page on behalf of a device.
2 - Introduce new DMA api to match new semantic of HMM. It is no longer page
we map but address range and managing which page is effectively backing
an address should be easy to update. I gave a presentation about that
during this LPC.
3 - Requires change to cpu page fault code path to handle migration back to
system memory on cpu access. An implementation of this was already sent
as part of v1. This will be low impact and only add a new special swap
type handling to existing fault code.
4 - Require a new syscall as i can not see which current syscall would be
appropriate for this. My first feeling was to use mbind as it has the
right semantic (binding a range of address to a device) but mbind is
too numa centric.

Second one was madvise, but semantic does not match, madvise does allow
kernel to ignore them while we do want to block cpu access for as long
as the range is bind to a device.

So i do not think any of existing syscall can be extended with new flags
but maybe i am wrong.
5 - Allowing to map a page as read only on the CPU while a device perform
some atomic operation on it (this is mainly to work around system bus
that do not support atomic memory access and sadly there is a large
base of hardware without that feature).

Easiest implementation would be using some page flags but there is none
left. So it must be some flags in vma to know if there is a need to query
HMM for write protection.

6 - This is the trickiest one to implement and while i showed a proof of
concept with v1, i am still have a lot of conflictual feeling about how
to achieve this.


As usual comments are more then welcome. Thanks in advance to anyone that
take a look at this code.

Previous patchset posting :
v1 http://lwn.net/Articles/597289/
v2 https://lkml.org/lkml/2014/6/12/559 (cover letter did not make it to ml)
v3 https://lkml.org/lkml/2014/6/13/633
v4 https://lkml.org/lkml/2014/8/29/423

Cheers,
Jérôme

To: "Andrew Morton" <[email protected]>,
Cc: <[email protected]>,
Cc: linux-mm <[email protected]>,
Cc: <[email protected]>,
Cc: "Linus Torvalds" <[email protected]>,
Cc: "Mel Gorman" <[email protected]>,
Cc: "H. Peter Anvin" <[email protected]>,
Cc: "Peter Zijlstra" <[email protected]>,
Cc: "Linda Wang" <[email protected]>,
Cc: "Kevin E Martin" <[email protected]>,
Cc: "Jerome Glisse" <[email protected]>,
Cc: "Andrea Arcangeli" <[email protected]>,
Cc: "Johannes Weiner" <[email protected]>,
Cc: "Larry Woodman" <[email protected]>,
Cc: "Rik van Riel" <[email protected]>,
Cc: "Dave Airlie" <[email protected]>,
Cc: "Jeff Law" <[email protected]>,
Cc: "Brendan Conoboy" <[email protected]>,
Cc: "Joe Donohue" <[email protected]>,
Cc: "Duncan Poole" <[email protected]>,
Cc: "Sherry Cheung" <[email protected]>,
Cc: "Subhash Gutti" <[email protected]>,
Cc: "John Hubbard" <[email protected]>,
Cc: "Mark Hairgrove" <[email protected]>,
Cc: "Lucien Dunning" <[email protected]>,
Cc: "Cameron Buschardt" <[email protected]>,
Cc: "Arvind Gopalakrishnan" <[email protected]>,
Cc: "Haggai Eran" <[email protected]>,
Cc: "Or Gerlitz" <[email protected]>,
Cc: "Sagi Grimberg" <[email protected]>
Cc: "Shachar Raindel" <[email protected]>,
Cc: "Liran Liss" <[email protected]>,
Cc: "Roland Dreier" <[email protected]>,
Cc: "Sander, Ben" <[email protected]>,
Cc: "Stoner, Greg" <[email protected]>,
Cc: "Bridgman, John" <[email protected]>,
Cc: "Mantor, Michael" <[email protected]>,
Cc: "Blinzer, Paul" <[email protected]>,
Cc: "Morichetti, Laurent" <[email protected]>,
Cc: "Deucher, Alexander" <[email protected]>,
Cc: "Gabbay, Oded" <[email protected]>,


2014-11-03 20:46:46

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 1/5] mmu_notifier: add event information to address invalidation v5

From: Jérôme Glisse <[email protected]>

The event information will be usefull for new user of mmu_notifier API.
The event argument differentiate between a vma disappearing, a page
being write protected or simply a page being unmaped. This allow new
user to take different path for different event for instance on unmap
the resource used to track a vma are still valid and should stay around.
While if the event is saying that a vma is being destroy it means that any
resources used to track this vma can be free.

Changed since v1:
- renamed action into event (updated commit message too).
- simplified the event names and clarified their intented usage
also documenting what exceptation the listener can have in
respect to each event.

Changed since v2:
- Avoid crazy name.
- Do not move code that do not need to move.

Changed since v3:
- Separate hugue page split from mlock/munlock and softdirty.

Changed since v4:
- Rebase (no other changes).

Signed-off-by: Jérôme Glisse <[email protected]>
---
drivers/gpu/drm/i915/i915_gem_userptr.c | 3 +-
drivers/iommu/amd_iommu_v2.c | 11 ++-
drivers/misc/sgi-gru/grutlbpurge.c | 9 ++-
drivers/xen/gntdev.c | 9 ++-
fs/proc/task_mmu.c | 6 +-
include/linux/mmu_notifier.h | 131 ++++++++++++++++++++++++++------
kernel/events/uprobes.c | 10 ++-
mm/filemap_xip.c | 2 +-
mm/huge_memory.c | 39 ++++++----
mm/hugetlb.c | 23 +++---
mm/ksm.c | 18 +++--
mm/memory.c | 27 ++++---
mm/migrate.c | 9 ++-
mm/mmu_notifier.c | 28 ++++---
mm/mprotect.c | 5 +-
mm/mremap.c | 6 +-
mm/rmap.c | 24 ++++--
virt/kvm/kvm_main.c | 12 ++-
18 files changed, 269 insertions(+), 103 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index d182058..20dbd26 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -129,7 +129,8 @@ restart:
static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ enum mmu_event event)
{
struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
struct interval_tree_node *it = NULL;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 90d734b..57d2acf 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -413,14 +413,17 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,

static void mn_invalidate_page(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
__mn_flush_page(mn, address);
}

static void mn_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
struct pasid_state *pasid_state;
struct device_state *dev_state;
@@ -441,7 +444,9 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,

static void mn_invalidate_range_end(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
struct pasid_state *pasid_state;
struct device_state *dev_state;
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 2129274..e67fed1 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
*/
static void gru_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end,
+ enum mmu_event event)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
ms_notifier);
@@ -235,7 +236,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,

static void gru_invalidate_range_end(struct mmu_notifier *mn,
struct mm_struct *mm, unsigned long start,
- unsigned long end)
+ unsigned long end,
+ enum mmu_event event)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
ms_notifier);
@@ -248,7 +250,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
}

static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
ms_notifier);
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 073b4a1..fe9da94 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,7 +428,9 @@ static void unmap_if_in_range(struct grant_map *map,

static void mn_invl_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
struct grant_map *map;
@@ -445,9 +447,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,

static void mn_invl_page(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
- mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
+ mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
}

static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 735b389..c884143 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -842,7 +842,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
downgrade_write(&mm->mmap_sem);
break;
}
- mmu_notifier_invalidate_range_start(mm, 0, -1);
+ mmu_notifier_invalidate_range_start(mm, 0,
+ -1, MMU_ISDIRTY);
}
for (vma = mm->mmap; vma; vma = vma->vm_next) {
cp.vma = vma;
@@ -867,7 +868,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
&clear_refs_walk);
}
if (type == CLEAR_REFS_SOFT_DIRTY)
- mmu_notifier_invalidate_range_end(mm, 0, -1);
+ mmu_notifier_invalidate_range_end(mm, 0,
+ -1, MMU_ISDIRTY);
flush_tlb_mm(mm);
up_read(&mm->mmap_sem);
mmput(mm);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 94d19f6..d36de82 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -9,6 +9,66 @@
struct mmu_notifier;
struct mmu_notifier_ops;

+/* MMU Events report fine-grained information to the callback routine, allowing
+ * the event listener to make a more informed decision as to what action to
+ * take. The event types are:
+ *
+ * - MMU_HSPLIT huge page split, the memory is the same only the page table
+ * structure is updated (level added or removed).
+ *
+ * - MMU_ISDIRTY need to update the dirty bit of the page table so proper
+ * dirty accounting can happen.
+ *
+ * - MMU_MIGRATE: memory is migrating from one page to another, thus all write
+ * access must stop after invalidate_range_start callback returns.
+ * Furthermore, no read access should be allowed either, as a new page can
+ * be remapped with write access before the invalidate_range_end callback
+ * happens and thus any read access to old page might read stale data. There
+ * are several sources for this event, including:
+ *
+ * - A page moving to swap (various reasons, including page reclaim),
+ * - An mremap syscall,
+ * - migration for NUMA reasons,
+ * - balancing the memory pool,
+ * - write fault on COW page,
+ * - and more that are not listed here.
+ *
+ * - MMU_MPROT: memory access protection is changing. Refer to the vma to get
+ * the new access protection. All memory access are still valid until the
+ * invalidate_range_end callback.
+ *
+ * - MMU_MUNLOCK: unlock memory. Content of page table stays the same but
+ * page are unlocked.
+ *
+ * - MMU_MUNMAP: the range is being unmapped (outcome of a munmap syscall or
+ * process destruction). However, access is still allowed, up until the
+ * invalidate_range_free_pages callback. This also implies that secondary
+ * page table can be trimmed, because the address range is no longer valid.
+ *
+ * - MMU_WRITE_BACK: memory is being written back to disk, all write accesses
+ * must stop after invalidate_range_start callback returns. Read access are
+ * still allowed.
+ *
+ * - MMU_WRITE_PROTECT: memory is being writte protected (ie should be mapped
+ * read only no matter what the vma memory protection allows). All write
+ * accesses must stop after invalidate_range_start callback returns. Read
+ * access are still allowed.
+ *
+ * If in doubt when adding a new notifier caller, please use MMU_MIGRATE,
+ * because it will always lead to reasonable behavior, but will not allow the
+ * listener a chance to optimize its events.
+ */
+enum mmu_event {
+ MMU_HSPLIT = 0,
+ MMU_ISDIRTY,
+ MMU_MIGRATE,
+ MMU_MPROT,
+ MMU_MUNLOCK,
+ MMU_MUNMAP,
+ MMU_WRITE_BACK,
+ MMU_WRITE_PROTECT,
+};
+
#ifdef CONFIG_MMU_NOTIFIER

/*
@@ -82,7 +142,8 @@ struct mmu_notifier_ops {
void (*change_pte)(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
- pte_t pte);
+ pte_t pte,
+ enum mmu_event event);

/*
* Before this is invoked any secondary MMU is still ok to
@@ -93,7 +154,8 @@ struct mmu_notifier_ops {
*/
void (*invalidate_page)(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long address);
+ unsigned long address,
+ enum mmu_event event);

/*
* invalidate_range_start() and invalidate_range_end() must be
@@ -140,10 +202,14 @@ struct mmu_notifier_ops {
*/
void (*invalidate_range_start)(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end);
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event);
void (*invalidate_range_end)(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end);
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event);

/*
* invalidate_range() is either called between
@@ -206,13 +272,20 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
extern int __mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address);
extern void __mmu_notifier_change_pte(struct mm_struct *mm,
- unsigned long address, pte_t pte);
+ unsigned long address,
+ pte_t pte,
+ enum mmu_event event);
extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address);
+ unsigned long address,
+ enum mmu_event event);
extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start, unsigned long end);
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event);
extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start, unsigned long end);
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event);
extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
unsigned long start, unsigned long end);

@@ -240,31 +313,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
}

static inline void mmu_notifier_change_pte(struct mm_struct *mm,
- unsigned long address, pte_t pte)
+ unsigned long address,
+ pte_t pte,
+ enum mmu_event event)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_change_pte(mm, address, pte);
+ __mmu_notifier_change_pte(mm, address, pte, event);
}

static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_page(mm, address);
+ __mmu_notifier_invalidate_page(mm, address, event);
}

static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_range_start(mm, start, end);
+ __mmu_notifier_invalidate_range_start(mm, start, end, event);
}

static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_range_end(mm, start, end);
+ __mmu_notifier_invalidate_range_end(mm, start, end, event);
}

static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -359,13 +439,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
* old page would remain mapped readonly in the secondary MMUs after the new
* page is already writable by some CPU through the primary MMU.
*/
-#define set_pte_at_notify(__mm, __address, __ptep, __pte) \
+#define set_pte_at_notify(__mm, __address, __ptep, __pte, __event) \
({ \
struct mm_struct *___mm = __mm; \
unsigned long ___address = __address; \
pte_t ___pte = __pte; \
\
- mmu_notifier_change_pte(___mm, ___address, ___pte); \
+ mmu_notifier_change_pte(___mm, ___address, ___pte, __event); \
set_pte_at(___mm, ___address, __ptep, ___pte); \
})

@@ -393,22 +473,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
}

static inline void mmu_notifier_change_pte(struct mm_struct *mm,
- unsigned long address, pte_t pte)
+ unsigned long address,
+ pte_t pte,
+ enum mmu_event event)
{
}

static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
}

static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
}

static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
}

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index bc143cf..eacdf1b 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -176,7 +176,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
/* For try_to_free_swap() and munlock_vma_page() below */
lock_page(page);

- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
err = -EAGAIN;
ptep = page_check_address(page, mm, addr, &ptl, 0);
if (!ptep)
@@ -194,7 +195,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,

flush_cache_page(vma, addr, pte_pfn(*ptep));
ptep_clear_flush_notify(vma, addr, ptep);
- set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+ set_pte_at_notify(mm, addr, ptep,
+ mk_pte(kpage, vma->vm_page_prot),
+ MMU_MIGRATE);

page_remove_rmap(page);
if (!page_mapped(page))
@@ -208,7 +211,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
err = 0;
unlock:
mem_cgroup_cancel_charge(kpage, memcg);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
unlock_page(page);
return err;
}
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..a2b3f09 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -198,7 +198,7 @@ retry:
BUG_ON(pte_dirty(pteval));
pte_unmap_unlock(pte, ptl);
/* must invalidate_page _before_ freeing the page */
- mmu_notifier_invalidate_page(mm, address);
+ mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
page_cache_release(page);
}
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c44c8cc..f61b4ac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1033,7 +1033,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,

mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);

ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1067,7 +1068,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
page_remove_rmap(page);
spin_unlock(ptl);

- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

ret |= VM_FAULT_WRITE;
put_page(page);
@@ -1077,7 +1079,8 @@ out:

out_free_pages:
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
for (i = 0; i < HPAGE_PMD_NR; i++) {
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
@@ -1169,7 +1172,8 @@ alloc:

mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);

spin_lock(ptl);
if (page)
@@ -1201,7 +1205,8 @@ alloc:
}
spin_unlock(ptl);
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
out:
return ret;
out_unlock:
@@ -1637,7 +1642,8 @@ static int __split_huge_page_splitting(struct page *page,
const unsigned long mmun_start = address;
const unsigned long mmun_end = address + HPAGE_PMD_SIZE;

- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_HSPLIT);
pmd = page_check_address_pmd(page, mm, address,
PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
if (pmd) {
@@ -1653,7 +1659,8 @@ static int __split_huge_page_splitting(struct page *page,
ret = 1;
spin_unlock(ptl);
}
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_HSPLIT);

return ret;
}
@@ -2474,7 +2481,8 @@ static void collapse_huge_page(struct mm_struct *mm,

mmun_start = address;
mmun_end = address + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
/*
* After this gup_fast can't run anymore. This also removes
@@ -2484,7 +2492,8 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
_pmd = pmdp_clear_flush(vma, address, pmd);
spin_unlock(pmd_ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

spin_lock(pte_ptl);
isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2875,24 +2884,28 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
again:
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_trans_huge(*pmd))) {
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
return;
}
if (is_huge_zero_pmd(*pmd)) {
__split_huge_zero_page_pmd(vma, haddr, pmd);
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
return;
}
page = pmd_page(*pmd);
VM_BUG_ON_PAGE(!page_count(page), page);
get_page(page);
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

split_huge_page(page);

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f0cca62..a9418d6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2560,7 +2560,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
mmun_start = vma->vm_start;
mmun_end = vma->vm_end;
if (cow)
- mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(src, mmun_start,
+ mmun_end, MMU_MIGRATE);

for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
@@ -2614,7 +2615,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
}

if (cow)
- mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(src, mmun_start,
+ mmun_end, MMU_MIGRATE);

return ret;
}
@@ -2640,7 +2642,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
BUG_ON(end & ~huge_page_mask(h));

tlb_start_vma(tlb, vma);
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
again:
for (address = start; address < end; address += sz) {
ptep = huge_pte_offset(mm, address);
@@ -2711,7 +2714,8 @@ unlock:
if (address < end && !ref_page)
goto again;
}
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
tlb_end_vma(tlb, vma);
}

@@ -2889,8 +2893,8 @@ retry_avoidcopy:

mmun_start = address & huge_page_mask(h);
mmun_end = mmun_start + huge_page_size(h);
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);
/*
* Retake the page table lock to check for racing updates
* before the page tables are altered
@@ -2911,7 +2915,8 @@ retry_avoidcopy:
new_page = old_page;
}
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);
out_release_all:
page_cache_release(new_page);
out_release_old:
@@ -3349,7 +3354,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
BUG_ON(address >= end);
flush_cache_range(vma, address, end);

- mmu_notifier_invalidate_range_start(mm, start, end);
+ mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
for (; address < end; address += huge_page_size(h)) {
spinlock_t *ptl;
@@ -3380,7 +3385,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
flush_tlb_range(vma, start, end);
mmu_notifier_invalidate_range(mm, start, end);
mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
- mmu_notifier_invalidate_range_end(mm, start, end);
+ mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);

return pages << h->order;
}
diff --git a/mm/ksm.c b/mm/ksm.c
index d247efa..8c3a892 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -872,7 +872,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,

mmun_start = addr;
mmun_end = addr + PAGE_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+ MMU_WRITE_PROTECT);

ptep = page_check_address(page, mm, addr, &ptl, 0);
if (!ptep)
@@ -904,7 +905,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
if (pte_dirty(entry))
set_page_dirty(page);
entry = pte_mkclean(pte_wrprotect(entry));
- set_pte_at_notify(mm, addr, ptep, entry);
+ set_pte_at_notify(mm, addr, ptep, entry, MMU_WRITE_PROTECT);
}
*orig_pte = *ptep;
err = 0;
@@ -912,7 +913,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
out_unlock:
pte_unmap_unlock(ptep, ptl);
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+ MMU_WRITE_PROTECT);
out:
return err;
}
@@ -948,7 +950,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,

mmun_start = addr;
mmun_end = addr + PAGE_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);

ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
if (!pte_same(*ptep, orig_pte)) {
@@ -961,7 +964,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,

flush_cache_page(vma, addr, pte_pfn(*ptep));
ptep_clear_flush_notify(vma, addr, ptep);
- set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+ set_pte_at_notify(mm, addr, ptep,
+ mk_pte(kpage, vma->vm_page_prot),
+ MMU_MIGRATE);

page_remove_rmap(page);
if (!page_mapped(page))
@@ -971,7 +976,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
pte_unmap_unlock(ptep, ptl);
err = 0;
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);
out:
return err;
}
diff --git a/mm/memory.c b/mm/memory.c
index f61d341..64c3cde 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1056,7 +1056,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
mmun_end = end;
if (is_cow)
mmu_notifier_invalidate_range_start(src_mm, mmun_start,
- mmun_end);
+ mmun_end, MMU_MIGRATE);

ret = 0;
dst_pgd = pgd_offset(dst_mm, addr);
@@ -1073,7 +1073,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
} while (dst_pgd++, src_pgd++, addr = next, addr != end);

if (is_cow)
- mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
+ MMU_MIGRATE);
return ret;
}

@@ -1378,10 +1379,12 @@ void unmap_vmas(struct mmu_gather *tlb,
{
struct mm_struct *mm = vma->vm_mm;

- mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+ mmu_notifier_invalidate_range_start(mm, start_addr,
+ end_addr, MMU_MUNMAP);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
- mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
+ mmu_notifier_invalidate_range_end(mm, start_addr,
+ end_addr, MMU_MUNMAP);
}

/**
@@ -1403,10 +1406,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
lru_add_drain();
tlb_gather_mmu(&tlb, mm, start, end);
update_hiwater_rss(mm);
- mmu_notifier_invalidate_range_start(mm, start, end);
+ mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
unmap_single_vma(&tlb, vma, start, end, details);
- mmu_notifier_invalidate_range_end(mm, start, end);
+ mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
tlb_finish_mmu(&tlb, start, end);
}

@@ -1429,9 +1432,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
lru_add_drain();
tlb_gather_mmu(&tlb, mm, address, end);
update_hiwater_rss(mm);
- mmu_notifier_invalidate_range_start(mm, address, end);
+ mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
unmap_single_vma(&tlb, vma, address, end, details);
- mmu_notifier_invalidate_range_end(mm, address, end);
+ mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
tlb_finish_mmu(&tlb, address, end);
}

@@ -2216,7 +2219,8 @@ gotten:

mmun_start = address & PAGE_MASK;
mmun_end = mmun_start + PAGE_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

/*
* Re-check the pte - we dropped the lock
@@ -2248,7 +2252,7 @@ gotten:
* mmu page tables (such as kvm shadow page tables), we want the
* new page to be mapped directly into the secondary page table.
*/
- set_pte_at_notify(mm, address, page_table, entry);
+ set_pte_at_notify(mm, address, page_table, entry, MMU_MIGRATE);
update_mmu_cache(vma, address, page_table);
if (old_page) {
/*
@@ -2287,7 +2291,8 @@ gotten:
unlock:
pte_unmap_unlock(page_table, ptl);
if (mmun_end > mmun_start)
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
if (old_page) {
/*
* Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index 41945cb..b5279b8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1814,12 +1814,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
WARN_ON(PageLRU(new_page));

/* Recheck the target PMD */
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
fail_putback:
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

/* Reverse changes made by migrate_page_copy() */
if (TestClearPageActive(new_page))
@@ -1873,7 +1875,8 @@ fail_putback:
page_remove_rmap(page);

spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

/* Take an "isolate" reference and put new page on the LRU. */
get_page(new_page);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 3b9b3d0..e51ea02 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -142,8 +142,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
return young;
}

-void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
- pte_t pte)
+void __mmu_notifier_change_pte(struct mm_struct *mm,
+ unsigned long address,
+ pte_t pte,
+ enum mmu_event event)
{
struct mmu_notifier *mn;
int id;
@@ -151,13 +153,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->change_pte)
- mn->ops->change_pte(mn, mm, address, pte);
+ mn->ops->change_pte(mn, mm, address, pte, event);
}
srcu_read_unlock(&srcu, id);
}

void __mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
struct mmu_notifier *mn;
int id;
@@ -165,13 +168,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->invalidate_page)
- mn->ops->invalidate_page(mn, mm, address);
+ mn->ops->invalidate_page(mn, mm, address, event);
}
srcu_read_unlock(&srcu, id);
}

void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
+
{
struct mmu_notifier *mn;
int id;
@@ -179,14 +185,17 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->invalidate_range_start)
- mn->ops->invalidate_range_start(mn, mm, start, end);
+ mn->ops->invalidate_range_start(mn, mm, start,
+ end, event);
}
srcu_read_unlock(&srcu, id);
}
EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);

void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
struct mmu_notifier *mn;
int id;
@@ -204,7 +213,8 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
if (mn->ops->invalidate_range)
mn->ops->invalidate_range(mn, mm, start, end);
if (mn->ops->invalidate_range_end)
- mn->ops->invalidate_range_end(mn, mm, start, end);
+ mn->ops->invalidate_range_end(mn, mm, start,
+ end, event);
}
srcu_read_unlock(&srcu, id);
}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ace9345..2302721 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -152,7 +152,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
/* invoke the mmu notifier if the pmd is populated */
if (!mni_start) {
mni_start = addr;
- mmu_notifier_invalidate_range_start(mm, mni_start, end);
+ mmu_notifier_invalidate_range_start(mm, mni_start,
+ end, MMU_MPROT);
}

if (pmd_trans_huge(*pmd)) {
@@ -180,7 +181,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
} while (pmd++, addr = next, addr != end);

if (mni_start)
- mmu_notifier_invalidate_range_end(mm, mni_start, end);
+ mmu_notifier_invalidate_range_end(mm, mni_start, end, MMU_MPROT);

if (nr_huge_updates)
count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 1e35ba66..a39f2aa 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -177,7 +177,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,

mmun_start = old_addr;
mmun_end = old_end;
- mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
cond_resched();
@@ -229,7 +230,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
if (likely(need_flush))
flush_tlb_range(vma, old_end-len, old_addr);

- mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

return len + old_addr - old_end; /* how much done */
}
diff --git a/mm/rmap.c b/mm/rmap.c
index d3eb1e0..5fd9ece 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -840,7 +840,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
pte_unmap_unlock(pte, ptl);

if (ret) {
- mmu_notifier_invalidate_page(mm, address);
+ mmu_notifier_invalidate_page(mm, address, MMU_WRITE_BACK);
(*cleaned)++;
}
out:
@@ -1142,6 +1142,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
spinlock_t *ptl;
int ret = SWAP_AGAIN;
enum ttu_flags flags = (enum ttu_flags)arg;
+ enum mmu_event event = MMU_MIGRATE;
+
+ if (flags & TTU_MUNLOCK)
+ event = MMU_MUNLOCK;

pte = page_check_address(page, mm, address, &ptl, 0);
if (!pte)
@@ -1247,7 +1251,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
out_unmap:
pte_unmap_unlock(pte, ptl);
if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
- mmu_notifier_invalidate_page(mm, address);
+ mmu_notifier_invalidate_page(mm, address, event);
out:
return ret;

@@ -1301,7 +1305,9 @@ out_mlock:
#define CLUSTER_MASK (~(CLUSTER_SIZE - 1))

static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
- struct vm_area_struct *vma, struct page *check_page)
+ struct vm_area_struct *vma,
+ struct page *check_page,
+ enum ttu_flags flags)
{
struct mm_struct *mm = vma->vm_mm;
pmd_t *pmd;
@@ -1315,6 +1321,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
unsigned long end;
int ret = SWAP_AGAIN;
int locked_vma = 0;
+ enum mmu_event event = MMU_MIGRATE;
+
+ if (flags & TTU_MUNLOCK)
+ event = MMU_MUNLOCK;

address = (vma->vm_start + cursor) & CLUSTER_MASK;
end = address + CLUSTER_SIZE;
@@ -1329,7 +1339,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,

mmun_start = address;
mmun_end = end;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);

/*
* If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1398,7 +1408,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
(*mapcount)--;
}
pte_unmap_unlock(pte - 1, ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
if (locked_vma)
up_read(&vma->vm_mm->mmap_sem);
return ret;
@@ -1454,7 +1464,9 @@ static int try_to_unmap_nonlinear(struct page *page,
while (cursor < max_nl_cursor &&
cursor < vma->vm_end - vma->vm_start) {
if (try_to_unmap_cluster(cursor, &mapcount,
- vma, page) == SWAP_MLOCK)
+ vma, page,
+ (enum ttu_flags)arg)
+ == SWAP_MLOCK)
ret = SWAP_MLOCK;
cursor += CLUSTER_SIZE;
vma->vm_private_data = (void *) cursor;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 25ffac9..8afea97 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -263,7 +263,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)

static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
int need_tlb_flush, idx;
@@ -305,7 +306,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
- pte_t pte)
+ pte_t pte,
+ enum mmu_event event)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
int idx;
@@ -321,7 +323,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ enum mmu_event event)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
int need_tlb_flush = 0, idx;
@@ -347,7 +350,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ enum mmu_event event)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);

--
1.9.3

2014-11-03 20:46:59

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 5/5] hmm/dummy: dummy driver to showcase the hmm api v3

From: Jérôme Glisse <[email protected]>

This is a dummy driver which full fill two purposes :
- showcase the hmm api and gives references on how to use it.
- provide an extensive user space api to stress test hmm.

This is a particularly dangerous module as it allow to access a
mirror of a process address space through its device file. Hence
it should not be enabled by default and only people actively
developing for hmm should use it.

Changed since v1:
- Fixed all checkpatch.pl issue (ignoreing some over 80 characters).

Changed since v2:
- Rebase and adapted to lastest change.

Signed-off-by: Jérôme Glisse <[email protected]>
---
drivers/char/Kconfig | 9 +
drivers/char/Makefile | 1 +
drivers/char/hmm_dummy.c | 1151 ++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/hmm_dummy.h | 30 ++
4 files changed, 1191 insertions(+)
create mode 100644 drivers/char/hmm_dummy.c
create mode 100644 include/uapi/linux/hmm_dummy.h

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index efefd12..7574e92 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -600,6 +600,15 @@ config TILE_SROM
device appear much like a simple EEPROM, and knows
how to partition a single ROM for multiple purposes.

+config HMM_DUMMY
+ tristate "hmm dummy driver to test hmm."
+ depends on HMM
+ default n
+ help
+ Say Y here if you want to build the hmm dummy driver that allow you
+ to test the hmm infrastructure by mapping a process address space
+ in hmm dummy driver device file. When in doubt, say "N".
+
source "drivers/char/xillybus/Kconfig"

endmenu
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index d06cde26..eff0543 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -62,3 +62,4 @@ js-rtc-y = rtc.o

obj-$(CONFIG_TILE_SROM) += tile-srom.o
obj-$(CONFIG_XILLYBUS) += xillybus/
+obj-$(CONFIG_HMM_DUMMY) += hmm_dummy.o
diff --git a/drivers/char/hmm_dummy.c b/drivers/char/hmm_dummy.c
new file mode 100644
index 0000000..89a9112
--- /dev/null
+++ b/drivers/char/hmm_dummy.c
@@ -0,0 +1,1151 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <[email protected]>
+ */
+/* This is a dummy driver made to exercice the HMM (hardware memory management)
+ * API of the kernel. It allow an userspace program to map its whole address
+ * space through the hmm dummy driver file.
+ *
+ * In here mirror address are address in the process address space that is
+ * being mirrored. While virtual address are the address in the current
+ * process that has the hmm dummy dev file mapped (address of the file
+ * mapping).
+ *
+ * You must be carefull to not mix one and another.
+ */
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/major.h>
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/highmem.h>
+#include <linux/delay.h>
+#include <linux/hmm.h>
+
+#include <uapi/linux/hmm_dummy.h>
+
+#define HMM_DUMMY_DEVICE_NAME "hmm_dummy_device"
+#define HMM_DUMMY_MAX_DEVICES 4
+
+struct hmm_dummy_device;
+
+struct hmm_dummy_mirror {
+ struct kref kref;
+ struct file *filp;
+ struct hmm_dummy_device *ddevice;
+ struct hmm_mirror mirror;
+ unsigned minor;
+ pid_t pid;
+ struct mm_struct *mm;
+ unsigned long *pgdp;
+ struct mutex mutex;
+ bool stop;
+};
+
+struct hmm_dummy_device {
+ struct cdev cdev;
+ struct hmm_device device;
+ dev_t dev;
+ int major;
+ struct mutex mutex;
+ char name[32];
+ /* device file mapping tracking (keep track of all vma) */
+ struct hmm_dummy_mirror *dmirrors[HMM_DUMMY_MAX_DEVICES];
+ struct address_space *fmapping[HMM_DUMMY_MAX_DEVICES];
+};
+
+/* We only create 2 device to show the inter device rmem sharing/migration
+ * capabilities.
+ */
+static struct hmm_dummy_device ddevices[2];
+
+
+/* hmm_dummy_pt - dummy page table, the dummy device fake its own page table.
+ *
+ * Helper function to manage the dummy device page table.
+ */
+#define HMM_DUMMY_PTE_VALID (1UL << 0UL)
+#define HMM_DUMMY_PTE_READ (1UL << 1UL)
+#define HMM_DUMMY_PTE_WRITE (1UL << 2UL)
+#define HMM_DUMMY_PTE_DIRTY (1UL << 3UL)
+#define HMM_DUMMY_PFN_SHIFT (PAGE_SHIFT)
+
+#define ARCH_PAGE_SIZE ((unsigned long)PAGE_SIZE)
+#define ARCH_PAGE_SHIFT ((unsigned long)PAGE_SHIFT)
+
+#define HMM_DUMMY_PTRS_PER_LEVEL (ARCH_PAGE_SIZE / sizeof(long))
+#ifdef CONFIG_64BIT
+#define HMM_DUMMY_BITS_PER_LEVEL (ARCH_PAGE_SHIFT - 3UL)
+#else
+#define HMM_DUMMY_BITS_PER_LEVEL (ARCH_PAGE_SHIFT - 2UL)
+#endif
+#define HMM_DUMMY_PLD_SHIFT (ARCH_PAGE_SHIFT)
+#define HMM_DUMMY_PMD_SHIFT (HMM_DUMMY_PLD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PUD_SHIFT (HMM_DUMMY_PMD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PGD_SHIFT (HMM_DUMMY_PUD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PGD_NPTRS (1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PMD_NPTRS (1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PUD_NPTRS (1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PLD_NPTRS (1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PLD_SIZE (1UL << (HMM_DUMMY_PLD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PMD_SIZE (1UL << (HMM_DUMMY_PMD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PUD_SIZE (1UL << (HMM_DUMMY_PUD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PGD_SIZE (1UL << (HMM_DUMMY_PGD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PLD_MASK (~(HMM_DUMMY_PLD_SIZE - 1UL))
+#define HMM_DUMMY_PMD_MASK (~(HMM_DUMMY_PMD_SIZE - 1UL))
+#define HMM_DUMMY_PUD_MASK (~(HMM_DUMMY_PUD_SIZE - 1UL))
+#define HMM_DUMMY_PGD_MASK (~(HMM_DUMMY_PGD_SIZE - 1UL))
+#define HMM_DUMMY_MAX_ADDR (1UL << (HMM_DUMMY_PGD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+
+static inline unsigned long hmm_dummy_pld_index(unsigned long addr)
+{
+ return (addr >> HMM_DUMMY_PLD_SHIFT) & (HMM_DUMMY_PLD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pmd_index(unsigned long addr)
+{
+ return (addr >> HMM_DUMMY_PMD_SHIFT) & (HMM_DUMMY_PMD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pud_index(unsigned long addr)
+{
+ return (addr >> HMM_DUMMY_PUD_SHIFT) & (HMM_DUMMY_PUD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pgd_index(unsigned long addr)
+{
+ return (addr >> HMM_DUMMY_PGD_SHIFT) & (HMM_DUMMY_PGD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pld_base(unsigned long addr)
+{
+ return (addr & HMM_DUMMY_PLD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pmd_base(unsigned long addr)
+{
+ return (addr & HMM_DUMMY_PMD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pud_base(unsigned long addr)
+{
+ return (addr & HMM_DUMMY_PUD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pgd_base(unsigned long addr)
+{
+ return (addr & HMM_DUMMY_PGD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pld_next(unsigned long addr)
+{
+ return (addr & HMM_DUMMY_PLD_MASK) + HMM_DUMMY_PLD_SIZE;
+}
+
+static inline unsigned long hmm_dummy_pmd_next(unsigned long addr)
+{
+ return (addr & HMM_DUMMY_PMD_MASK) + HMM_DUMMY_PMD_SIZE;
+}
+
+static inline unsigned long hmm_dummy_pud_next(unsigned long addr)
+{
+ return (addr & HMM_DUMMY_PUD_MASK) + HMM_DUMMY_PUD_SIZE;
+}
+
+static inline unsigned long hmm_dummy_pgd_next(unsigned long addr)
+{
+ return (addr & HMM_DUMMY_PGD_MASK) + HMM_DUMMY_PGD_SIZE;
+}
+
+static inline struct page *hmm_dummy_pte_to_page(unsigned long pte)
+{
+ if (!(pte & HMM_DUMMY_PTE_VALID))
+ return NULL;
+ return pfn_to_page((pte >> HMM_DUMMY_PFN_SHIFT));
+}
+
+struct hmm_dummy_pt_map {
+ struct hmm_dummy_mirror *dmirror;
+ struct page *pud_page;
+ struct page *pmd_page;
+ struct page *pld_page;
+ unsigned long pgd_idx;
+ unsigned long pud_idx;
+ unsigned long pmd_idx;
+ unsigned long *pudp;
+ unsigned long *pmdp;
+ unsigned long *pldp;
+};
+
+static inline unsigned long *hmm_dummy_pt_pud_map(struct hmm_dummy_pt_map *pt_map,
+ unsigned long addr)
+{
+ struct hmm_dummy_mirror *dmirror = pt_map->dmirror;
+ unsigned long *pdep;
+
+ if (!dmirror->pgdp)
+ return NULL;
+
+ if (!pt_map->pud_page || pt_map->pgd_idx != hmm_dummy_pgd_index(addr)) {
+ if (pt_map->pud_page) {
+ kunmap(pt_map->pud_page);
+ pt_map->pud_page = NULL;
+ pt_map->pudp = NULL;
+ }
+ pt_map->pgd_idx = hmm_dummy_pgd_index(addr);
+ pdep = &dmirror->pgdp[pt_map->pgd_idx];
+ if (!((*pdep) & HMM_DUMMY_PTE_VALID))
+ return NULL;
+ pt_map->pud_page = pfn_to_page((*pdep) >> HMM_DUMMY_PFN_SHIFT);
+ pt_map->pudp = kmap(pt_map->pud_page);
+ }
+ return pt_map->pudp;
+}
+
+static inline unsigned long *hmm_dummy_pt_pmd_map(struct hmm_dummy_pt_map *pt_map,
+ unsigned long addr)
+{
+ unsigned long *pdep;
+
+ if (!hmm_dummy_pt_pud_map(pt_map, addr))
+ return NULL;
+
+ if (!pt_map->pmd_page || pt_map->pud_idx != hmm_dummy_pud_index(addr)) {
+ if (pt_map->pmd_page) {
+ kunmap(pt_map->pmd_page);
+ pt_map->pmd_page = NULL;
+ pt_map->pmdp = NULL;
+ }
+ pt_map->pud_idx = hmm_dummy_pud_index(addr);
+ pdep = &pt_map->pudp[pt_map->pud_idx];
+ if (!((*pdep) & HMM_DUMMY_PTE_VALID))
+ return NULL;
+ pt_map->pmd_page = pfn_to_page((*pdep) >> HMM_DUMMY_PFN_SHIFT);
+ pt_map->pmdp = kmap(pt_map->pmd_page);
+ }
+ return pt_map->pmdp;
+}
+
+static inline unsigned long *hmm_dummy_pt_pld_map(struct hmm_dummy_pt_map *pt_map,
+ unsigned long addr)
+{
+ unsigned long *pdep;
+
+ if (!hmm_dummy_pt_pmd_map(pt_map, addr))
+ return NULL;
+
+ if (!pt_map->pld_page || pt_map->pmd_idx != hmm_dummy_pmd_index(addr)) {
+ if (pt_map->pld_page) {
+ kunmap(pt_map->pld_page);
+ pt_map->pld_page = NULL;
+ pt_map->pldp = NULL;
+ }
+ pt_map->pmd_idx = hmm_dummy_pmd_index(addr);
+ pdep = &pt_map->pmdp[pt_map->pmd_idx];
+ if (!((*pdep) & HMM_DUMMY_PTE_VALID))
+ return NULL;
+ pt_map->pld_page = pfn_to_page((*pdep) >> HMM_DUMMY_PFN_SHIFT);
+ pt_map->pldp = kmap(pt_map->pld_page);
+ }
+ return pt_map->pldp;
+}
+
+static inline void hmm_dummy_pt_pld_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+ if (pt_map->pld_page) {
+ kunmap(pt_map->pld_page);
+ pt_map->pld_page = NULL;
+ pt_map->pldp = NULL;
+ }
+}
+
+static inline void hmm_dummy_pt_pmd_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+ hmm_dummy_pt_pld_unmap(pt_map);
+ if (pt_map->pmd_page) {
+ kunmap(pt_map->pmd_page);
+ pt_map->pmd_page = NULL;
+ pt_map->pmdp = NULL;
+ }
+}
+
+static inline void hmm_dummy_pt_pud_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+ hmm_dummy_pt_pmd_unmap(pt_map);
+ if (pt_map->pud_page) {
+ kunmap(pt_map->pud_page);
+ pt_map->pud_page = NULL;
+ pt_map->pudp = NULL;
+ }
+}
+
+static inline void hmm_dummy_pt_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+ hmm_dummy_pt_pud_unmap(pt_map);
+}
+
+static int hmm_dummy_pt_alloc(struct hmm_dummy_mirror *dmirror,
+ unsigned long start,
+ unsigned long end)
+{
+ unsigned long *pgdp, *pudp, *pmdp;
+
+ if (dmirror->stop)
+ return -EINVAL;
+
+ if (dmirror->pgdp == NULL) {
+ dmirror->pgdp = kzalloc(PAGE_SIZE, GFP_KERNEL);
+ if (dmirror->pgdp == NULL)
+ return -ENOMEM;
+ }
+
+ for (; start < end; start = hmm_dummy_pld_next(start)) {
+ struct page *pud_page, *pmd_page;
+
+ pgdp = &dmirror->pgdp[hmm_dummy_pgd_index(start)];
+ if (!((*pgdp) & HMM_DUMMY_PTE_VALID)) {
+ pud_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+ if (!pud_page)
+ return -ENOMEM;
+ *pgdp = (page_to_pfn(pud_page)<<HMM_DUMMY_PFN_SHIFT);
+ *pgdp |= HMM_DUMMY_PTE_VALID;
+ }
+
+ pud_page = pfn_to_page((*pgdp) >> HMM_DUMMY_PFN_SHIFT);
+ pudp = kmap(pud_page);
+ pudp = &pudp[hmm_dummy_pud_index(start)];
+ if (!((*pudp) & HMM_DUMMY_PTE_VALID)) {
+ pmd_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+ if (!pmd_page) {
+ kunmap(pud_page);
+ return -ENOMEM;
+ }
+ *pudp = (page_to_pfn(pmd_page)<<HMM_DUMMY_PFN_SHIFT);
+ *pudp |= HMM_DUMMY_PTE_VALID;
+ }
+
+ pmd_page = pfn_to_page((*pudp) >> HMM_DUMMY_PFN_SHIFT);
+ pmdp = kmap(pmd_page);
+ pmdp = &pmdp[hmm_dummy_pmd_index(start)];
+ if (!((*pmdp) & HMM_DUMMY_PTE_VALID)) {
+ struct page *page;
+
+ page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+ if (!page) {
+ kunmap(pmd_page);
+ kunmap(pud_page);
+ return -ENOMEM;
+ }
+ *pmdp = (page_to_pfn(page) << HMM_DUMMY_PFN_SHIFT);
+ *pmdp |= HMM_DUMMY_PTE_VALID;
+ }
+
+ kunmap(pmd_page);
+ kunmap(pud_page);
+ }
+
+ return 0;
+}
+
+static void hmm_dummy_pt_free_pmd(struct hmm_dummy_pt_map *pt_map,
+ unsigned long start,
+ unsigned long end)
+{
+ for (; start < end; start = hmm_dummy_pld_next(start)) {
+ unsigned long pfn, *pmdp, next;
+ struct page *page;
+
+ next = min(hmm_dummy_pld_next(start), end);
+ if (start > hmm_dummy_pld_base(start) || end < next)
+ continue;
+ pmdp = hmm_dummy_pt_pmd_map(pt_map, start);
+ if (!pmdp)
+ continue;
+ if (!(pmdp[hmm_dummy_pmd_index(start)] & HMM_DUMMY_PTE_VALID))
+ continue;
+ pfn = pmdp[hmm_dummy_pmd_index(start)] >> HMM_DUMMY_PFN_SHIFT;
+ page = pfn_to_page(pfn);
+ pmdp[hmm_dummy_pmd_index(start)] = 0;
+ __free_page(page);
+ }
+}
+
+static void hmm_dummy_pt_free_pud(struct hmm_dummy_pt_map *pt_map,
+ unsigned long start,
+ unsigned long end)
+{
+ for (; start < end; start = hmm_dummy_pmd_next(start)) {
+ unsigned long pfn, *pudp, next;
+ struct page *page;
+
+ next = min(hmm_dummy_pmd_next(start), end);
+ hmm_dummy_pt_free_pmd(pt_map, start, next);
+ hmm_dummy_pt_pmd_unmap(pt_map);
+ if (start > hmm_dummy_pmd_base(start) || end < next)
+ continue;
+ pudp = hmm_dummy_pt_pud_map(pt_map, start);
+ if (!pudp)
+ continue;
+ if (!(pudp[hmm_dummy_pud_index(start)] & HMM_DUMMY_PTE_VALID))
+ continue;
+ pfn = pudp[hmm_dummy_pud_index(start)] >> HMM_DUMMY_PFN_SHIFT;
+ page = pfn_to_page(pfn);
+ pudp[hmm_dummy_pud_index(start)] = 0;
+ __free_page(page);
+ }
+}
+
+static void hmm_dummy_pt_free(struct hmm_dummy_mirror *dmirror,
+ unsigned long start,
+ unsigned long end)
+{
+ struct hmm_dummy_pt_map pt_map = {0};
+
+ if (!dmirror->pgdp || (end - start) < HMM_DUMMY_PLD_SIZE)
+ return;
+
+ pt_map.dmirror = dmirror;
+
+ for (; start < end; start = hmm_dummy_pud_next(start)) {
+ unsigned long pfn, *pgdp, next;
+ struct page *page;
+
+ next = min(hmm_dummy_pud_next(start), end);
+ pgdp = dmirror->pgdp;
+ hmm_dummy_pt_free_pud(&pt_map, start, next);
+ hmm_dummy_pt_pud_unmap(&pt_map);
+ if (start > hmm_dummy_pud_base(start) || end < next)
+ continue;
+ if (!(pgdp[hmm_dummy_pgd_index(start)] & HMM_DUMMY_PTE_VALID))
+ continue;
+ pfn = pgdp[hmm_dummy_pgd_index(start)] >> HMM_DUMMY_PFN_SHIFT;
+ page = pfn_to_page(pfn);
+ pgdp[hmm_dummy_pgd_index(start)] = 0;
+ __free_page(page);
+ }
+ hmm_dummy_pt_unmap(&pt_map);
+}
+
+
+
+
+/* hmm_ops - hmm callback for the hmm dummy driver.
+ *
+ * Below are the various callback that the hmm api require for a device. The
+ * implementation of the dummy device driver is necessarily simpler that what
+ * a real device driver would do. We do not have interrupt nor any kind of
+ * command buffer on to which schedule memory invalidation and updates.
+ */
+static struct hmm_mirror *hmm_dummy_mirror_ref(struct hmm_mirror *mirror)
+{
+ struct hmm_dummy_mirror *dmirror;
+
+ if (!mirror)
+ return NULL;
+ dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+ if (!kref_get_unless_zero(&dmirror->kref))
+ return NULL;
+ return mirror;
+}
+
+static void hmm_dummy_mirror_destroy(struct kref *kref)
+{
+ struct hmm_dummy_mirror *dmirror;
+
+ dmirror = container_of(kref, struct hmm_dummy_mirror, kref);
+ mutex_lock(&dmirror->ddevice->mutex);
+ dmirror->ddevice->dmirrors[dmirror->minor] = NULL;
+ mutex_unlock(&dmirror->ddevice->mutex);
+
+ hmm_mirror_unregister(&dmirror->mirror);
+
+ kfree(dmirror);
+}
+
+static struct hmm_mirror *hmm_dummy_mirror_unref(struct hmm_mirror *mirror)
+{
+ struct hmm_dummy_mirror *dmirror;
+
+ if (!mirror)
+ return NULL;
+ dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+ kref_put(&dmirror->kref, hmm_dummy_mirror_destroy);
+ return NULL;
+}
+
+static void hmm_dummy_mirror_release(struct hmm_mirror *mirror)
+{
+ struct hmm_dummy_mirror *dmirror;
+
+ dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+ dmirror->stop = true;
+ mutex_lock(&dmirror->mutex);
+ hmm_dummy_pt_free(dmirror, 0, HMM_DUMMY_MAX_ADDR);
+ kfree(dmirror->pgdp);
+ dmirror->pgdp = NULL;
+ mutex_unlock(&dmirror->mutex);
+}
+
+static int hmm_dummy_fence_wait(struct hmm_fence *fence)
+{
+ /* FIXME add fake fence to showcase api */
+ return 0;
+}
+
+static void hmm_dummy_fence_ref(struct hmm_fence *fence)
+{
+ /* We never allocate fence so how could we end up here ? */
+ BUG();
+}
+
+static void hmm_dummy_fence_unref(struct hmm_fence *fence)
+{
+ /* We never allocate fence so how could we end up here ? */
+ BUG();
+}
+
+static int hmm_dummy_fault(struct hmm_mirror *mirror,
+ struct hmm_event *event,
+ const struct hmm_range *range)
+{
+ struct hmm_dummy_mirror *dmirror;
+ struct hmm_dummy_pt_map pt_map = {0};
+ unsigned long addr, i;
+ int ret = 0;
+
+ dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+ pt_map.dmirror = dmirror;
+
+ mutex_lock(&dmirror->mutex);
+ for (i = 0, addr = range->start; addr < range->end; ++i, addr += PAGE_SIZE) {
+ unsigned long *pldp, pld_idx;
+ struct page *page;
+ bool write;
+
+ pldp = hmm_dummy_pt_pld_map(&pt_map, addr);
+ if (!pldp) {
+ ret = -ENOMEM;
+ break;
+ }
+
+ if (!hmm_pte_is_valid_smem(&range->pte[i])) {
+ ret = -ENOENT;
+ break;
+ }
+ write = hmm_pte_is_write(&range->pte[i]);
+ page = pfn_to_page(hmm_pte_pfn(range->pte[i]));
+ if (event->etype == HMM_WFAULT && !write) {
+ ret = -EACCES;
+ break;
+ }
+
+ pr_info("%16s %4d [0x%016lx] pfn 0x%016lx write %d\n",
+ __func__, __LINE__, addr, page_to_pfn(page), write);
+ pld_idx = hmm_dummy_pld_index(addr);
+ pldp[pld_idx] = (page_to_pfn(page) << HMM_DUMMY_PFN_SHIFT);
+ pldp[pld_idx] |= write ? HMM_DUMMY_PTE_WRITE : 0;
+ pldp[pld_idx] |= HMM_DUMMY_PTE_VALID | HMM_DUMMY_PTE_READ;
+ }
+ hmm_dummy_pt_unmap(&pt_map);
+ mutex_unlock(&dmirror->mutex);
+ return ret;
+}
+
+static struct hmm_fence *hmm_dummy_update(struct hmm_mirror *mirror,
+ struct hmm_event *event,
+ const struct hmm_range *range)
+{
+ struct hmm_dummy_mirror *dmirror;
+ struct hmm_dummy_pt_map pt_map = {0};
+ unsigned long addr, i, mask;
+ int ret;
+
+ dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+ pt_map.dmirror = dmirror;
+
+ pr_info("%16s %4d [0x%016lx 0x%016lx] type %d\n",
+ __func__, __LINE__, range->start, range->end, event->etype);
+ /* Debugging hmm real device driver do not have to do that. */
+ switch (event->etype) {
+ case HMM_MIGRATE:
+ case HMM_MUNMAP:
+ mask = 0;
+ break;
+ case HMM_ISDIRTY:
+ mask = -1UL;
+ break;
+ case HMM_WRITE_PROTECT:
+ mask = ~HMM_DUMMY_PTE_WRITE;
+ break;
+ case HMM_RFAULT:
+ case HMM_WFAULT:
+ ret = hmm_dummy_fault(mirror, event, range);
+ if (ret)
+ return ERR_PTR(ret);
+ return NULL;
+ default:
+ return ERR_PTR(-EIO);
+ }
+
+ mutex_lock(&dmirror->mutex);
+ for (i = 0, addr = range->start; addr < range->end; ++i, addr += PAGE_SIZE) {
+ unsigned long *pldp;
+
+ pldp = hmm_dummy_pt_pld_map(&pt_map, addr);
+ if (!pldp)
+ continue;
+ if (((*pldp) & HMM_DUMMY_PTE_DIRTY)) {
+ hmm_pte_mk_dirty(&range->pte[i]);
+ }
+ *pldp &= ~HMM_DUMMY_PTE_DIRTY;
+ *pldp &= mask;
+ }
+ hmm_dummy_pt_unmap(&pt_map);
+
+ if (event->etype == HMM_MUNMAP)
+ hmm_dummy_pt_free(dmirror, range->start, range->end);
+ mutex_unlock(&dmirror->mutex);
+ return NULL;
+}
+
+static const struct hmm_device_ops hmm_dummy_ops = {
+ .mirror_ref = &hmm_dummy_mirror_ref,
+ .mirror_unref = &hmm_dummy_mirror_unref,
+ .mirror_release = &hmm_dummy_mirror_release,
+ .fence_wait = &hmm_dummy_fence_wait,
+ .fence_ref = &hmm_dummy_fence_ref,
+ .fence_unref = &hmm_dummy_fence_unref,
+ .update = &hmm_dummy_update,
+};
+
+
+/* hmm_dummy_mmap - hmm dummy device file mmap operations.
+ *
+ * The hmm dummy driver does not allow mmap of its device file. The main reason
+ * is because the kernel lack the ability to insert page with specific custom
+ * protections inside a vma.
+ */
+static int hmm_dummy_mmap_fault(struct vm_area_struct *vma,
+ struct vm_fault *vmf)
+{
+ return VM_FAULT_SIGBUS;
+}
+
+static void hmm_dummy_mmap_open(struct vm_area_struct *vma)
+{
+ /* nop */
+}
+
+static void hmm_dummy_mmap_close(struct vm_area_struct *vma)
+{
+ /* nop */
+}
+
+static const struct vm_operations_struct mmap_mem_ops = {
+ .fault = hmm_dummy_mmap_fault,
+ .open = hmm_dummy_mmap_open,
+ .close = hmm_dummy_mmap_close,
+};
+
+
+/* hmm_dummy_fops - hmm dummy device file operations.
+ *
+ * The hmm dummy driver allow to read/write to the mirrored process through
+ * the device file. Below are the read and write and others device file
+ * callback that implement access to the mirrored address space.
+ */
+#define DUMMY_WINDOW 4
+
+static int hmm_dummy_mirror_fault(struct hmm_dummy_mirror *dmirror,
+ unsigned long addr,
+ bool write)
+{
+ struct hmm_mirror *mirror = &dmirror->mirror;
+ struct hmm_event event;
+ unsigned long start, end;
+ int ret;
+
+ event.start = start = addr > ((DUMMY_WINDOW >> 1) << PAGE_SHIFT) ?
+ addr - ((DUMMY_WINDOW >> 1) << PAGE_SHIFT) : 0;
+ event.end = end = start + (DUMMY_WINDOW << PAGE_SHIFT);
+ event.etype = write ? HMM_WFAULT : HMM_RFAULT;
+
+ /* Pre-allocate device page table. */
+ mutex_lock(&dmirror->mutex);
+ ret = hmm_dummy_pt_alloc(dmirror, start, end);
+ mutex_unlock(&dmirror->mutex);
+ if (ret)
+ return ret;
+
+ while (1) {
+ ret = hmm_mirror_fault(mirror, &event);
+ /* Ignore any error that do not concern the fault address. */
+ if (addr >= event.end) {
+ event.start = event.end;
+ event.end = end;
+ continue;
+ }
+ break;
+ }
+
+ return ret;
+}
+
+static ssize_t hmm_dummy_fops_read(struct file *filp,
+ char __user *buf,
+ size_t count,
+ loff_t *ppos)
+{
+ struct hmm_dummy_device *ddevice;
+ struct hmm_dummy_mirror *dmirror;
+ struct hmm_dummy_pt_map pt_map = {0};
+ struct hmm_mirror *mirror;
+ unsigned long start, end, offset;
+ unsigned minor;
+ ssize_t retval = 0;
+ void *tmp;
+ long r;
+
+ tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!tmp)
+ return -ENOMEM;
+
+ /* Check if we are mirroring anything */
+ minor = iminor(file_inode(filp));
+ ddevice = filp->private_data;
+ mutex_lock(&ddevice->mutex);
+ if (ddevice->dmirrors[minor] == NULL) {
+ mutex_unlock(&ddevice->mutex);
+ kfree(tmp);
+ return 0;
+ }
+ mirror = hmm_mirror_ref(&ddevice->dmirrors[minor]->mirror);
+ mutex_unlock(&ddevice->mutex);
+
+ if (!mirror) {
+ kfree(tmp);
+ return 0;
+ }
+
+ dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+ if (dmirror->stop) {
+ kfree(tmp);
+ hmm_mirror_unref(mirror);
+ return 0;
+ }
+
+ /* The range of address to lookup. */
+ start = (*ppos) & PAGE_MASK;
+ offset = (*ppos) - start;
+ end = PAGE_ALIGN(start + count);
+ BUG_ON(start == end);
+ pt_map.dmirror = dmirror;
+
+ for (; count; start += PAGE_SIZE, offset = 0) {
+ unsigned long *pldp, pld_idx;
+ unsigned long size = min(PAGE_SIZE - offset, count);
+ struct page *page;
+ char *ptr;
+
+ mutex_lock(&dmirror->mutex);
+ pldp = hmm_dummy_pt_pld_map(&pt_map, start);
+ pld_idx = hmm_dummy_pld_index(start);
+ if (!pldp || !(pldp[pld_idx] & HMM_DUMMY_PTE_VALID)) {
+ hmm_dummy_pt_unmap(&pt_map);
+ mutex_unlock(&dmirror->mutex);
+ goto fault;
+ }
+ page = hmm_dummy_pte_to_page(pldp[pld_idx]);
+ if (!page) {
+ mutex_unlock(&dmirror->mutex);
+ BUG();
+ kfree(tmp);
+ hmm_mirror_unref(mirror);
+ return -EFAULT;
+ }
+ ptr = kmap(page);
+ memcpy(tmp, ptr + offset, size);
+ kunmap(page);
+ hmm_dummy_pt_unmap(&pt_map);
+ mutex_unlock(&dmirror->mutex);
+
+ r = copy_to_user(buf, tmp, size);
+ if (r) {
+ kfree(tmp);
+ hmm_mirror_unref(mirror);
+ return -EFAULT;
+ }
+ retval += size;
+ *ppos += size;
+ count -= size;
+ buf += size;
+ }
+
+ kfree(tmp);
+ hmm_mirror_unref(mirror);
+ return retval;
+
+fault:
+ kfree(tmp);
+ r = hmm_dummy_mirror_fault(dmirror, start, false);
+ hmm_mirror_unref(mirror);
+ if (r)
+ return r;
+
+ /* Force userspace to retry read if nothing was read. */
+ return retval ? retval : -EINTR;
+}
+
+static ssize_t hmm_dummy_fops_write(struct file *filp,
+ const char __user *buf,
+ size_t count,
+ loff_t *ppos)
+{
+ struct hmm_dummy_device *ddevice;
+ struct hmm_dummy_mirror *dmirror;
+ struct hmm_dummy_pt_map pt_map = {0};
+ struct hmm_mirror *mirror;
+ unsigned long start, end, offset;
+ unsigned minor;
+ ssize_t retval = 0;
+ void *tmp;
+ long r;
+
+ tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!tmp)
+ return -ENOMEM;
+
+ /* Check if we are mirroring anything */
+ minor = iminor(file_inode(filp));
+ ddevice = filp->private_data;
+ mutex_lock(&ddevice->mutex);
+ if (ddevice->dmirrors[minor] == NULL) {
+ mutex_unlock(&ddevice->mutex);
+ kfree(tmp);
+ return 0;
+ }
+ mirror = hmm_mirror_ref(&ddevice->dmirrors[minor]->mirror);
+ mutex_unlock(&ddevice->mutex);
+
+ if (!mirror) {
+ kfree(tmp);
+ return 0;
+ }
+
+ dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+ if (dmirror->stop) {
+ kfree(tmp);
+ hmm_mirror_unref(mirror);
+ return 0;
+ }
+
+ /* The range of address to lookup. */
+ start = (*ppos) & PAGE_MASK;
+ offset = (*ppos) - start;
+ end = PAGE_ALIGN(start + count);
+ BUG_ON(start == end);
+ pt_map.dmirror = dmirror;
+
+ for (; count; start += PAGE_SIZE, offset = 0) {
+ unsigned long *pldp, pld_idx;
+ unsigned long size = min(PAGE_SIZE - offset, count);
+ struct page *page;
+ char *ptr;
+
+ r = copy_from_user(tmp, buf, size);
+ if (r) {
+ kfree(tmp);
+ hmm_mirror_unref(mirror);
+ return -EFAULT;
+ }
+
+ mutex_lock(&dmirror->mutex);
+
+ pldp = hmm_dummy_pt_pld_map(&pt_map, start);
+ pld_idx = hmm_dummy_pld_index(start);
+ if (!pldp || !(pldp[pld_idx] & HMM_DUMMY_PTE_VALID)) {
+ hmm_dummy_pt_unmap(&pt_map);
+ mutex_unlock(&dmirror->mutex);
+ goto fault;
+ }
+ if (!(pldp[pld_idx] & HMM_DUMMY_PTE_WRITE)) {
+ hmm_dummy_pt_unmap(&pt_map);
+ mutex_unlock(&dmirror->mutex);
+ goto fault;
+ }
+ pldp[pld_idx] |= HMM_DUMMY_PTE_DIRTY;
+ page = hmm_dummy_pte_to_page(pldp[pld_idx]);
+ if (!page) {
+ mutex_unlock(&dmirror->mutex);
+ BUG();
+ kfree(tmp);
+ hmm_mirror_unref(mirror);
+ return -EFAULT;
+ }
+ ptr = kmap(page);
+ memcpy(ptr + offset, tmp, size);
+ kunmap(page);
+ hmm_dummy_pt_unmap(&pt_map);
+ mutex_unlock(&dmirror->mutex);
+
+ retval += size;
+ *ppos += size;
+ count -= size;
+ buf += size;
+ }
+
+ kfree(tmp);
+ hmm_mirror_unref(mirror);
+ return retval;
+
+fault:
+ kfree(tmp);
+ r = hmm_dummy_mirror_fault(dmirror, start, true);
+ hmm_mirror_unref(mirror);
+ if (r)
+ return r;
+
+ /* Force userspace to retry write if nothing was writen. */
+ return retval ? retval : -EINTR;
+}
+
+static int hmm_dummy_fops_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+ return -EINVAL;
+}
+
+static int hmm_dummy_fops_open(struct inode *inode, struct file *filp)
+{
+ struct hmm_dummy_device *ddevice;
+ struct cdev *cdev = inode->i_cdev;
+ const int minor = iminor(inode);
+
+ /* No exclusive opens */
+ if (filp->f_flags & O_EXCL)
+ return -EINVAL;
+
+ ddevice = container_of(cdev, struct hmm_dummy_device, cdev);
+ filp->private_data = ddevice;
+ ddevice->fmapping[minor] = &inode->i_data;
+
+ return 0;
+}
+
+static int hmm_dummy_fops_release(struct inode *inode,
+ struct file *filp)
+{
+#if 0
+ struct hmm_dummy_device *ddevice;
+ struct hmm_dummy_mirror *dmirror;
+ struct cdev *cdev = inode->i_cdev;
+ const int minor = iminor(inode);
+
+ ddevice = container_of(cdev, struct hmm_dummy_device, cdev);
+ mutex_lock(&ddevice->mutex);
+ dmirror = ddevice->dmirrors[minor];
+ if (dmirror && dmirror->filp == filp) {
+ struct hmm_mirror *mirror = hmm_mirror_ref(&dmirror->mirror);
+ ddevice->dmirrors[minor] = NULL;
+ mutex_unlock(&ddevice->mutex);
+
+ if (mirror) {
+ hmm_mirror_release(mirror);
+ hmm_mirror_unref(mirror);
+ }
+ } else
+ mutex_unlock(&ddevice->mutex);
+#endif
+
+ return 0;
+}
+
+static long hmm_dummy_fops_unlocked_ioctl(struct file *filp,
+ unsigned int command,
+ unsigned long arg)
+{
+ struct hmm_dummy_device *ddevice;
+ struct hmm_dummy_mirror *dmirror;
+ unsigned minor;
+ int ret;
+
+ minor = iminor(file_inode(filp));
+ ddevice = filp->private_data;
+ switch (command) {
+ case HMM_DUMMY_EXPOSE_MM:
+ mutex_lock(&ddevice->mutex);
+ dmirror = ddevice->dmirrors[minor];
+ if (dmirror) {
+ mutex_unlock(&ddevice->mutex);
+ return -EBUSY;
+ }
+ /* Mirror this process address space */
+ dmirror = kzalloc(sizeof(*dmirror), GFP_KERNEL);
+ if (dmirror == NULL) {
+ mutex_unlock(&ddevice->mutex);
+ return -ENOMEM;
+ }
+ kref_init(&dmirror->kref);
+ dmirror->mm = NULL;
+ dmirror->stop = false;
+ dmirror->pid = task_pid_nr(current);
+ dmirror->ddevice = ddevice;
+ dmirror->minor = minor;
+ dmirror->filp = filp;
+ dmirror->pgdp = NULL;
+ mutex_init(&dmirror->mutex);
+ ddevice->dmirrors[minor] = dmirror;
+ mutex_unlock(&ddevice->mutex);
+
+ ret = hmm_mirror_register(&dmirror->mirror,
+ &ddevice->device,
+ current->mm);
+ if (ret) {
+ mutex_lock(&ddevice->mutex);
+ ddevice->dmirrors[minor] = NULL;
+ mutex_unlock(&ddevice->mutex);
+ kfree(dmirror);
+ return ret;
+ }
+ /* Success. */
+ pr_info("mirroring address space of %d\n", dmirror->pid);
+ hmm_mirror_unref(&dmirror->mirror);
+ return 0;
+ default:
+ return -EINVAL;
+ }
+ return 0;
+}
+
+static const struct file_operations hmm_dummy_fops = {
+ .read = hmm_dummy_fops_read,
+ .write = hmm_dummy_fops_write,
+ .mmap = hmm_dummy_fops_mmap,
+ .open = hmm_dummy_fops_open,
+ .release = hmm_dummy_fops_release,
+ .unlocked_ioctl = hmm_dummy_fops_unlocked_ioctl,
+ .llseek = default_llseek,
+ .owner = THIS_MODULE,
+};
+
+
+/*
+ * char device driver
+ */
+static int hmm_dummy_device_init(struct hmm_dummy_device *ddevice)
+{
+ int ret, i;
+
+ ret = alloc_chrdev_region(&ddevice->dev, 0,
+ HMM_DUMMY_MAX_DEVICES,
+ ddevice->name);
+ if (ret < 0)
+ goto error;
+ ddevice->major = MAJOR(ddevice->dev);
+
+ cdev_init(&ddevice->cdev, &hmm_dummy_fops);
+ ret = cdev_add(&ddevice->cdev, ddevice->dev, HMM_DUMMY_MAX_DEVICES);
+ if (ret) {
+ unregister_chrdev_region(ddevice->dev, HMM_DUMMY_MAX_DEVICES);
+ goto error;
+ }
+
+ /* Register the hmm device. */
+ for (i = 0; i < HMM_DUMMY_MAX_DEVICES; i++)
+ ddevice->dmirrors[i] = NULL;
+ mutex_init(&ddevice->mutex);
+ ddevice->device.ops = &hmm_dummy_ops;
+ ddevice->device.name = ddevice->name;
+
+ ret = hmm_device_register(&ddevice->device);
+ if (ret) {
+ cdev_del(&ddevice->cdev);
+ unregister_chrdev_region(ddevice->dev, HMM_DUMMY_MAX_DEVICES);
+ goto error;
+ }
+
+ return 0;
+
+error:
+ return ret;
+}
+
+static void hmm_dummy_device_fini(struct hmm_dummy_device *ddevice)
+{
+ unsigned i;
+
+ /* First finish hmm. */
+ mutex_lock(&ddevice->mutex);
+ for (i = 0; i < HMM_DUMMY_MAX_DEVICES; i++) {
+ struct hmm_mirror *mirror = NULL;
+
+ if (ddevices->dmirrors[i]) {
+ mirror = hmm_mirror_ref(&ddevices->dmirrors[i]->mirror);
+ ddevices->dmirrors[i] = NULL;
+ }
+ if (!mirror)
+ continue;
+
+ mutex_unlock(&ddevice->mutex);
+ hmm_mirror_release(mirror);
+ hmm_mirror_unref(mirror);
+ mutex_lock(&ddevice->mutex);
+ }
+ mutex_unlock(&ddevice->mutex);
+
+ if (hmm_device_unregister(&ddevice->device))
+ BUG();
+
+ cdev_del(&ddevice->cdev);
+ unregister_chrdev_region(ddevice->dev,
+ HMM_DUMMY_MAX_DEVICES);
+}
+
+static int __init hmm_dummy_init(void)
+{
+ int ret;
+
+ snprintf(ddevices[0].name, sizeof(ddevices[0].name),
+ "%s%d", HMM_DUMMY_DEVICE_NAME, 0);
+ ret = hmm_dummy_device_init(&ddevices[0]);
+ if (ret)
+ return ret;
+
+ snprintf(ddevices[1].name, sizeof(ddevices[1].name),
+ "%s%d", HMM_DUMMY_DEVICE_NAME, 1);
+ ret = hmm_dummy_device_init(&ddevices[1]);
+ if (ret) {
+ hmm_dummy_device_fini(&ddevices[0]);
+ return ret;
+ }
+
+ pr_info("hmm_dummy loaded THIS IS A DANGEROUS MODULE !!!\n");
+ return 0;
+}
+
+static void __exit hmm_dummy_exit(void)
+{
+ hmm_dummy_device_fini(&ddevices[1]);
+ hmm_dummy_device_fini(&ddevices[0]);
+}
+
+module_init(hmm_dummy_init);
+module_exit(hmm_dummy_exit);
+MODULE_LICENSE("GPL");
diff --git a/include/uapi/linux/hmm_dummy.h b/include/uapi/linux/hmm_dummy.h
new file mode 100644
index 0000000..20eb98f
--- /dev/null
+++ b/include/uapi/linux/hmm_dummy.h
@@ -0,0 +1,30 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <[email protected]>
+ */
+/* This is a dummy driver made to exercice the HMM (hardware memory management)
+ * API of the kernel. It allow an userspace program to map its whole address
+ * space through the hmm dummy driver file.
+ */
+#ifndef _UAPI_LINUX_HMM_DUMMY_H
+#define _UAPI_LINUX_HMM_DUMMY_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+#include <linux/irqnr.h>
+
+/* Expose the address space of the calling process through hmm dummy dev file */
+#define HMM_DUMMY_EXPOSE_MM _IO('R', 0x00)
+
+#endif /* _UAPI_LINUX_RANDOM_H */
--
1.9.3

2014-11-03 20:46:50

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 2/5] mmu_notifier: keep track of active invalidation ranges

From: Jérôme Glisse <[email protected]>

The mmu_notifier_invalidate_range_start() and mmu_notifier_invalidate_range_end()
can be considered as forming an "atomic" section for the cpu page table update
point of view. Between this two function the cpu page table content is unreliable
for the address range being invalidated.

Current user such as kvm need to know when they can trust the content of the cpu
page table. This becomes even more important to new users of the mmu_notifier
api (such as HMM or ODP).

This patch use a structure define at all call site to invalidate_range_start()
that is added to a list for the duration of the invalidation. It adds two new
helpers to allow querying if a range is being invalidated or to wait for a range
to become valid.

For proper synchronization, user must block new range invalidation from inside
there invalidate_range_start() callback, before calling the helper functions.
Otherwise there is no garanty that a new range invalidation will not be added
after the call to the helper function to query for existing range.

Signed-off-by: Jérôme Glisse <[email protected]>
---
drivers/gpu/drm/i915/i915_gem_userptr.c | 13 +++---
drivers/iommu/amd_iommu_v2.c | 8 +---
drivers/misc/sgi-gru/grutlbpurge.c | 15 +++----
drivers/xen/gntdev.c | 15 ++++---
fs/proc/task_mmu.c | 12 +++--
include/linux/mmu_notifier.h | 60 ++++++++++++++-----------
kernel/events/uprobes.c | 13 +++---
mm/huge_memory.c | 78 ++++++++++++++-------------------
mm/hugetlb.c | 55 +++++++++++------------
mm/ksm.c | 28 +++++-------
mm/memory.c | 78 +++++++++++++++++++--------------
mm/migrate.c | 36 +++++++--------
mm/mmu_notifier.c | 76 +++++++++++++++++++++++++++-----
mm/mprotect.c | 17 ++++---
mm/mremap.c | 14 +++---
mm/rmap.c | 15 +++----
virt/kvm/kvm_main.c | 10 ++---
17 files changed, 298 insertions(+), 245 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 20dbd26..10b0044 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -128,26 +128,25 @@ restart:

static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ const struct mmu_notifier_range *range)
{
struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
struct interval_tree_node *it = NULL;
- unsigned long next = start;
+ unsigned long next = range->start;
unsigned long serial = 0;
+ /* interval ranges are inclusive, but invalidate range is exclusive */
+ unsigned long end = range->end - 1;

- end--; /* interval ranges are inclusive, but invalidate range is exclusive */
while (next < end) {
struct drm_i915_gem_object *obj = NULL;

spin_lock(&mn->lock);
if (mn->has_linear)
- it = invalidate_range__linear(mn, mm, start, end);
+ it = invalidate_range__linear(mn, mm, range->start, end);
else if (serial == mn->serial)
it = interval_tree_iter_next(it, next, end);
else
- it = interval_tree_iter_first(&mn->objects, start, end);
+ it = interval_tree_iter_first(&mn->objects, range->start, end);
if (it != NULL) {
obj = container_of(it, struct i915_mmu_object, it)->obj;
drm_gem_object_reference(&obj->base);
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 57d2acf..9b7f32d 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -421,9 +421,7 @@ static void mn_invalidate_page(struct mmu_notifier *mn,

static void mn_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ const struct mmu_notifier_range *range)
{
struct pasid_state *pasid_state;
struct device_state *dev_state;
@@ -444,9 +442,7 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,

static void mn_invalidate_range_end(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ const struct mmu_notifier_range *range)
{
struct pasid_state *pasid_state;
struct device_state *dev_state;
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index e67fed1..44b41b7 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,8 +221,7 @@ void gru_flush_all_tlb(struct gru_state *gru)
*/
static void gru_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end,
- enum mmu_event event)
+ const struct mmu_notifier_range *range)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
ms_notifier);
@@ -230,14 +229,13 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
STAT(mmu_invalidate_range);
atomic_inc(&gms->ms_range_active);
gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
- start, end, atomic_read(&gms->ms_range_active));
- gru_flush_tlb_range(gms, start, end - start);
+ range->start, range->end, atomic_read(&gms->ms_range_active));
+ gru_flush_tlb_range(gms, range->start, range->end - range->start);
}

static void gru_invalidate_range_end(struct mmu_notifier *mn,
- struct mm_struct *mm, unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mm_struct *mm,
+ const struct mmu_notifier_range *range)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
ms_notifier);
@@ -246,7 +244,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
(void)atomic_dec_and_test(&gms->ms_range_active);

wake_up_all(&gms->ms_wait_queue);
- gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms, start, end);
+ gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms,
+ range->start, range->end);
}

static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index fe9da94..db5c2cad 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,19 +428,17 @@ static void unmap_if_in_range(struct grant_map *map,

static void mn_invl_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ const struct mmu_notifier_range *range)
{
struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
struct grant_map *map;

spin_lock(&priv->lock);
list_for_each_entry(map, &priv->maps, next) {
- unmap_if_in_range(map, start, end);
+ unmap_if_in_range(map, range->start, range->end);
}
list_for_each_entry(map, &priv->freeable_maps, next) {
- unmap_if_in_range(map, start, end);
+ unmap_if_in_range(map, range->start, range->end);
}
spin_unlock(&priv->lock);
}
@@ -450,7 +448,12 @@ static void mn_invl_page(struct mmu_notifier *mn,
unsigned long address,
enum mmu_event event)
{
- mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
+ struct mmu_notifier_range range;
+
+ range.start = address;
+ range.end = address + PAGE_SIZE;
+ range.event = event;
+ mn_invl_range_start(mn, mm, &range);
}

static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c884143..19dc948 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -828,6 +828,12 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
.mm = mm,
.private = &cp,
};
+ struct mmu_notifier_range range = {
+ .start = 0,
+ .end = -1UL,
+ .event = MMU_ISDIRTY,
+ };
+
down_read(&mm->mmap_sem);
if (type == CLEAR_REFS_SOFT_DIRTY) {
for (vma = mm->mmap; vma; vma = vma->vm_next) {
@@ -842,8 +848,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
downgrade_write(&mm->mmap_sem);
break;
}
- mmu_notifier_invalidate_range_start(mm, 0,
- -1, MMU_ISDIRTY);
+ mmu_notifier_invalidate_range_start(mm, &range);
}
for (vma = mm->mmap; vma; vma = vma->vm_next) {
cp.vma = vma;
@@ -868,8 +873,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
&clear_refs_walk);
}
if (type == CLEAR_REFS_SOFT_DIRTY)
- mmu_notifier_invalidate_range_end(mm, 0,
- -1, MMU_ISDIRTY);
+ mmu_notifier_invalidate_range_end(mm, &range);
flush_tlb_mm(mm);
up_read(&mm->mmap_sem);
mmput(mm);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index d36de82..8acb7c9 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -69,6 +69,13 @@ enum mmu_event {
MMU_WRITE_PROTECT,
};

+struct mmu_notifier_range {
+ struct list_head list;
+ unsigned long start;
+ unsigned long end;
+ enum mmu_event event;
+};
+
#ifdef CONFIG_MMU_NOTIFIER

/*
@@ -82,6 +89,12 @@ struct mmu_notifier_mm {
struct hlist_head list;
/* to serialize the list modifications and hlist_unhashed */
spinlock_t lock;
+ /* List of all active range invalidations. */
+ struct list_head ranges;
+ /* Number of active range invalidations. */
+ int nranges;
+ /* For threads waiting on range invalidations. */
+ wait_queue_head_t wait_queue;
};

struct mmu_notifier_ops {
@@ -202,14 +215,10 @@ struct mmu_notifier_ops {
*/
void (*invalidate_range_start)(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event);
+ const struct mmu_notifier_range *range);
void (*invalidate_range_end)(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event);
+ const struct mmu_notifier_range *range);

/*
* invalidate_range() is either called between
@@ -279,15 +288,17 @@ extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
unsigned long address,
enum mmu_event event);
extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event);
+ struct mmu_notifier_range *range);
extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event);
+ struct mmu_notifier_range *range);
extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
unsigned long start, unsigned long end);
+extern bool mmu_notifier_range_is_valid(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end);
+extern void mmu_notifier_range_wait_valid(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end);

static inline void mmu_notifier_release(struct mm_struct *mm)
{
@@ -330,21 +341,22 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
}

static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mmu_notifier_range *range)
{
+ /*
+ * Initialize list no matter what in case a mmu_notifier register after
+ * a range_start but before matching range_end.
+ */
+ INIT_LIST_HEAD(&range->list);
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_range_start(mm, start, end, event);
+ __mmu_notifier_invalidate_range_start(mm, range);
}

static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mmu_notifier_range *range)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_range_end(mm, start, end, event);
+ __mmu_notifier_invalidate_range_end(mm, range);
}

static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -486,16 +498,12 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
}

static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mmu_notifier_range *range)
{
}

static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mmu_notifier_range *range)
{
}

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index eacdf1b..5470f61 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -164,9 +164,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
spinlock_t *ptl;
pte_t *ptep;
int err;
- /* For mmu_notifiers */
- const unsigned long mmun_start = addr;
- const unsigned long mmun_end = addr + PAGE_SIZE;
+ struct mmu_notifier_range range;
struct mem_cgroup *memcg;

err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg);
@@ -176,8 +174,10 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
/* For try_to_free_swap() and munlock_vma_page() below */
lock_page(page);

- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ range.start = addr;
+ range.end = addr + PAGE_SIZE;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);
err = -EAGAIN;
ptep = page_check_address(page, mm, addr, &ptl, 0);
if (!ptep)
@@ -211,8 +211,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
err = 0;
unlock:
mem_cgroup_cancel_charge(kpage, memcg);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
unlock_page(page);
return err;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f61b4ac..e1ea4f5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -992,8 +992,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
pmd_t _pmd;
int ret = 0, i;
struct page **pages;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;

pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
GFP_KERNEL);
@@ -1031,10 +1030,10 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
cond_resched();
}

- mmun_start = haddr;
- mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ range.start = haddr;
+ range.end = haddr + HPAGE_PMD_SIZE;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);

ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1068,8 +1067,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
page_remove_rmap(page);
spin_unlock(ptl);

- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);

ret |= VM_FAULT_WRITE;
put_page(page);
@@ -1079,8 +1077,7 @@ out:

out_free_pages:
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
for (i = 0; i < HPAGE_PMD_NR; i++) {
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
@@ -1099,8 +1096,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page = NULL, *new_page;
struct mem_cgroup *memcg;
unsigned long haddr;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;

ptl = pmd_lockptr(mm, pmd);
VM_BUG_ON_VMA(!vma->anon_vma, vma);
@@ -1170,10 +1166,10 @@ alloc:
copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
__SetPageUptodate(new_page);

- mmun_start = haddr;
- mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ range.start = haddr;
+ range.end = haddr + HPAGE_PMD_SIZE;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);

spin_lock(ptl);
if (page)
@@ -1205,8 +1201,7 @@ alloc:
}
spin_unlock(ptl);
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
out:
return ret;
out_unlock:
@@ -1638,12 +1633,12 @@ static int __split_huge_page_splitting(struct page *page,
spinlock_t *ptl;
pmd_t *pmd;
int ret = 0;
- /* For mmu_notifiers */
- const unsigned long mmun_start = address;
- const unsigned long mmun_end = address + HPAGE_PMD_SIZE;
+ struct mmu_notifier_range range;

- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_HSPLIT);
+ range.start = address;
+ range.end = address + HPAGE_PMD_SIZE;
+ range.event = MMU_HSPLIT;
+ mmu_notifier_invalidate_range_start(mm, &range);
pmd = page_check_address_pmd(page, mm, address,
PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
if (pmd) {
@@ -1659,8 +1654,7 @@ static int __split_huge_page_splitting(struct page *page,
ret = 1;
spin_unlock(ptl);
}
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_HSPLIT);
+ mmu_notifier_invalidate_range_end(mm, &range);

return ret;
}
@@ -2438,8 +2432,7 @@ static void collapse_huge_page(struct mm_struct *mm,
int isolated;
unsigned long hstart, hend;
struct mem_cgroup *memcg;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;

VM_BUG_ON(address & ~HPAGE_PMD_MASK);

@@ -2479,10 +2472,10 @@ static void collapse_huge_page(struct mm_struct *mm,
pte = pte_offset_map(pmd, address);
pte_ptl = pte_lockptr(mm, pmd);

- mmun_start = address;
- mmun_end = address + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ range.start = address;
+ range.end = address + HPAGE_PMD_SIZE;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);
pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
/*
* After this gup_fast can't run anymore. This also removes
@@ -2492,8 +2485,7 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
_pmd = pmdp_clear_flush(vma, address, pmd);
spin_unlock(pmd_ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);

spin_lock(pte_ptl);
isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2876,36 +2868,32 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
struct page *page;
struct mm_struct *mm = vma->vm_mm;
unsigned long haddr = address & HPAGE_PMD_MASK;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;

BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);

- mmun_start = haddr;
- mmun_end = haddr + HPAGE_PMD_SIZE;
+ range.start = haddr;
+ range.end = haddr + HPAGE_PMD_SIZE;
+ range.event = MMU_MIGRATE;
again:
- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_start(mm, &range);
ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_trans_huge(*pmd))) {
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
return;
}
if (is_huge_zero_pmd(*pmd)) {
__split_huge_zero_page_pmd(vma, haddr, pmd);
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
return;
}
page = pmd_page(*pmd);
VM_BUG_ON_PAGE(!page_count(page), page);
get_page(page);
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);

split_huge_page(page);

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a9418d6..57c7425 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2551,17 +2551,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
int cow;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;
int ret = 0;

cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;

- mmun_start = vma->vm_start;
- mmun_end = vma->vm_end;
+ range.start = vma->vm_start;
+ range.end = vma->vm_end;
+ range.event = MMU_MIGRATE;
if (cow)
- mmu_notifier_invalidate_range_start(src, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_start(src, &range);

for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
@@ -2601,8 +2600,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
} else {
if (cow) {
huge_ptep_set_wrprotect(src, addr, src_pte);
- mmu_notifier_invalidate_range(src, mmun_start,
- mmun_end);
+ mmu_notifier_invalidate_range(src, range.start,
+ range.end);
}
entry = huge_ptep_get(src_pte);
ptepage = pte_page(entry);
@@ -2615,8 +2614,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
}

if (cow)
- mmu_notifier_invalidate_range_end(src, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(src, &range);

return ret;
}
@@ -2634,16 +2632,17 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
struct page *page;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
- const unsigned long mmun_start = start; /* For mmu_notifiers */
- const unsigned long mmun_end = end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;

WARN_ON(!is_vm_hugetlb_page(vma));
BUG_ON(start & ~huge_page_mask(h));
BUG_ON(end & ~huge_page_mask(h));

+ range.start = start;
+ range.end = end;
+ range.event = MMU_MIGRATE;
tlb_start_vma(tlb, vma);
- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_start(mm, &range);
again:
for (address = start; address < end; address += sz) {
ptep = huge_pte_offset(mm, address);
@@ -2714,8 +2713,7 @@ unlock:
if (address < end && !ref_page)
goto again;
}
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
tlb_end_vma(tlb, vma);
}

@@ -2812,8 +2810,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
struct hstate *h = hstate_vma(vma);
struct page *old_page, *new_page;
int ret = 0, outside_reserve = 0;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;

old_page = pte_page(pte);

@@ -2891,10 +2888,11 @@ retry_avoidcopy:
pages_per_huge_page(h));
__SetPageUptodate(new_page);

- mmun_start = address & huge_page_mask(h);
- mmun_end = mmun_start + huge_page_size(h);
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ range.start = address & huge_page_mask(h);
+ range.end = range.start + huge_page_size(h);
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);
+
/*
* Retake the page table lock to check for racing updates
* before the page tables are altered
@@ -2906,7 +2904,7 @@ retry_avoidcopy:

/* Break COW */
huge_ptep_clear_flush(vma, address, ptep);
- mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range(mm, range.start, range.end);
set_huge_pte_at(mm, address, ptep,
make_huge_pte(vma, new_page, 1));
page_remove_rmap(old_page);
@@ -2915,8 +2913,7 @@ retry_avoidcopy:
new_page = old_page;
}
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
out_release_all:
page_cache_release(new_page);
out_release_old:
@@ -3350,11 +3347,15 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
pte_t pte;
struct hstate *h = hstate_vma(vma);
unsigned long pages = 0;
+ struct mmu_notifier_range range;

BUG_ON(address >= end);
flush_cache_range(vma, address, end);

- mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
+ range.start = start;
+ range.end = end;
+ range.event = MMU_MPROT;
+ mmu_notifier_invalidate_range_start(mm, &range);
mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
for (; address < end; address += huge_page_size(h)) {
spinlock_t *ptl;
@@ -3385,7 +3386,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
flush_tlb_range(vma, start, end);
mmu_notifier_invalidate_range(mm, start, end);
mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
- mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);
+ mmu_notifier_invalidate_range_end(mm, &range);

return pages << h->order;
}
diff --git a/mm/ksm.c b/mm/ksm.c
index 8c3a892..3667d98 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -855,14 +855,13 @@ static inline int pages_identical(struct page *page1, struct page *page2)
static int write_protect_page(struct vm_area_struct *vma, struct page *page,
pte_t *orig_pte)
{
+ struct mmu_notifier_range range;
struct mm_struct *mm = vma->vm_mm;
unsigned long addr;
pte_t *ptep;
spinlock_t *ptl;
int swapped;
int err = -EFAULT;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */

addr = page_address_in_vma(page, vma);
if (addr == -EFAULT)
@@ -870,10 +869,10 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,

BUG_ON(PageTransCompound(page));

- mmun_start = addr;
- mmun_end = addr + PAGE_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
- MMU_WRITE_PROTECT);
+ range.start = addr;
+ range.end = addr + PAGE_SIZE;
+ range.event = MMU_WRITE_PROTECT;
+ mmu_notifier_invalidate_range_start(mm, &range);

ptep = page_check_address(page, mm, addr, &ptl, 0);
if (!ptep)
@@ -913,8 +912,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
out_unlock:
pte_unmap_unlock(ptep, ptl);
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
- MMU_WRITE_PROTECT);
+ mmu_notifier_invalidate_range_end(mm, &range);
out:
return err;
}
@@ -937,8 +935,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
spinlock_t *ptl;
unsigned long addr;
int err = -EFAULT;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;

addr = page_address_in_vma(page, vma);
if (addr == -EFAULT)
@@ -948,10 +945,10 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
if (!pmd)
goto out;

- mmun_start = addr;
- mmun_end = addr + PAGE_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ range.start = addr;
+ range.end = addr + PAGE_SIZE;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);

ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
if (!pte_same(*ptep, orig_pte)) {
@@ -976,8 +973,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
pte_unmap_unlock(ptep, ptl);
err = 0;
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
out:
return err;
}
diff --git a/mm/memory.c b/mm/memory.c
index 64c3cde..cdafc2a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1015,8 +1015,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
unsigned long next;
unsigned long addr = vma->vm_start;
unsigned long end = vma->vm_end;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;
bool is_cow;
int ret;

@@ -1052,11 +1051,11 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
* is_cow_mapping() returns true.
*/
is_cow = is_cow_mapping(vma->vm_flags);
- mmun_start = addr;
- mmun_end = end;
+ range.start = addr;
+ range.end = end;
+ range.event = MMU_MIGRATE;
if (is_cow)
- mmu_notifier_invalidate_range_start(src_mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_start(src_mm, &range);

ret = 0;
dst_pgd = pgd_offset(dst_mm, addr);
@@ -1073,8 +1072,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
} while (dst_pgd++, src_pgd++, addr = next, addr != end);

if (is_cow)
- mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(src_mm, &range);
return ret;
}

@@ -1378,13 +1376,16 @@ void unmap_vmas(struct mmu_gather *tlb,
unsigned long end_addr)
{
struct mm_struct *mm = vma->vm_mm;
+ struct mmu_notifier_range range = {
+ .start = start_addr,
+ .end = end_addr,
+ .event = MMU_MUNMAP,
+ };

- mmu_notifier_invalidate_range_start(mm, start_addr,
- end_addr, MMU_MUNMAP);
+ mmu_notifier_invalidate_range_start(mm, &range);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
- mmu_notifier_invalidate_range_end(mm, start_addr,
- end_addr, MMU_MUNMAP);
+ mmu_notifier_invalidate_range_end(mm, &range);
}

/**
@@ -1401,16 +1402,20 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
{
struct mm_struct *mm = vma->vm_mm;
struct mmu_gather tlb;
- unsigned long end = start + size;
+ struct mmu_notifier_range range = {
+ .start = start,
+ .end = start + size,
+ .event = MMU_MUNMAP,
+ };

lru_add_drain();
- tlb_gather_mmu(&tlb, mm, start, end);
+ tlb_gather_mmu(&tlb, mm, start, range.end);
update_hiwater_rss(mm);
- mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
- for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
- unmap_single_vma(&tlb, vma, start, end, details);
- mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
- tlb_finish_mmu(&tlb, start, end);
+ mmu_notifier_invalidate_range_start(mm, &range);
+ for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
+ unmap_single_vma(&tlb, vma, start, range.end, details);
+ mmu_notifier_invalidate_range_end(mm, &range);
+ tlb_finish_mmu(&tlb, start, range.end);
}

/**
@@ -1427,15 +1432,19 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
{
struct mm_struct *mm = vma->vm_mm;
struct mmu_gather tlb;
- unsigned long end = address + size;
+ struct mmu_notifier_range range = {
+ .start = address,
+ .end = address + size,
+ .event = MMU_MUNMAP,
+ };

lru_add_drain();
- tlb_gather_mmu(&tlb, mm, address, end);
+ tlb_gather_mmu(&tlb, mm, address, range.end);
update_hiwater_rss(mm);
- mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
- unmap_single_vma(&tlb, vma, address, end, details);
- mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
- tlb_finish_mmu(&tlb, address, end);
+ mmu_notifier_invalidate_range_start(mm, &range);
+ unmap_single_vma(&tlb, vma, address, range.end, details);
+ mmu_notifier_invalidate_range_end(mm, &range);
+ tlb_finish_mmu(&tlb, address, range.end);
}

/**
@@ -2055,10 +2064,12 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
int ret = 0;
int page_mkwrite = 0;
struct page *dirty_page = NULL;
- unsigned long mmun_start = 0; /* For mmu_notifiers */
- unsigned long mmun_end = 0; /* For mmu_notifiers */
+ struct mmu_notifier_range range;
struct mem_cgroup *memcg;

+ range.start = 0;
+ range.end = 0;
+
old_page = vm_normal_page(vma, address, orig_pte);
if (!old_page) {
/*
@@ -2217,10 +2228,10 @@ gotten:
if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
goto oom_free_new;

- mmun_start = address & PAGE_MASK;
- mmun_end = mmun_start + PAGE_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ range.start = address & PAGE_MASK;
+ range.end = range.start + PAGE_SIZE;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);

/*
* Re-check the pte - we dropped the lock
@@ -2290,9 +2301,8 @@ gotten:
page_cache_release(new_page);
unlock:
pte_unmap_unlock(page_table, ptl);
- if (mmun_end > mmun_start)
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ if (range.end > range.start)
+ mmu_notifier_invalidate_range_end(mm, &range);
if (old_page) {
/*
* Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index b5279b8..1b5b9ab 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1776,10 +1776,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
int isolated = 0;
struct page *new_page = NULL;
int page_lru = page_is_file_cache(page);
- unsigned long mmun_start = address & HPAGE_PMD_MASK;
- unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+ struct mmu_notifier_range range;
pmd_t orig_entry;

+ range.start = address & HPAGE_PMD_MASK;
+ range.end = range.start + HPAGE_PMD_SIZE;
+ range.event = MMU_MIGRATE;
+
/*
* Rate-limit the amount of data that is being migrated to a node.
* Optimal placement is no good if the memory bus is saturated and
@@ -1801,7 +1804,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
}

if (mm_tlb_flush_pending(mm))
- flush_tlb_range(vma, mmun_start, mmun_end);
+ flush_tlb_range(vma, range.start, range.end);

/* Prepare a page as a migration target */
__set_page_locked(new_page);
@@ -1814,14 +1817,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
WARN_ON(PageLRU(new_page));

/* Recheck the target PMD */
- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_start(mm, &range);
ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
fail_putback:
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);

/* Reverse changes made by migrate_page_copy() */
if (TestClearPageActive(new_page))
@@ -1854,17 +1855,17 @@ fail_putback:
* The SetPageUptodate on the new page and page_add_new_anon_rmap
* guarantee the copy is visible before the pagetable update.
*/
- flush_cache_range(vma, mmun_start, mmun_end);
- page_add_anon_rmap(new_page, vma, mmun_start);
- pmdp_clear_flush_notify(vma, mmun_start, pmd);
- set_pmd_at(mm, mmun_start, pmd, entry);
- flush_tlb_range(vma, mmun_start, mmun_end);
+ flush_cache_range(vma, range.start, range.end);
+ page_add_anon_rmap(new_page, vma, range.start);
+ pmdp_clear_flush_notify(vma, range.start, pmd);
+ set_pmd_at(mm, range.start, pmd, entry);
+ flush_tlb_range(vma, range.start, range.end);
update_mmu_cache_pmd(vma, address, &entry);

if (page_count(page) != 2) {
- set_pmd_at(mm, mmun_start, pmd, orig_entry);
- flush_tlb_range(vma, mmun_start, mmun_end);
- mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
+ set_pmd_at(mm, range.start, pmd, orig_entry);
+ flush_tlb_range(vma, range.start, range.end);
+ mmu_notifier_invalidate_range(mm, range.start, range.end);
update_mmu_cache_pmd(vma, address, &entry);
page_remove_rmap(new_page);
goto fail_putback;
@@ -1875,8 +1876,7 @@ fail_putback:
page_remove_rmap(page);

spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);

/* Take an "isolate" reference and put new page on the LRU. */
get_page(new_page);
@@ -1901,7 +1901,7 @@ out_dropref:
ptl = pmd_lock(mm, pmd);
if (pmd_same(*pmd, entry)) {
entry = pmd_mknonnuma(entry);
- set_pmd_at(mm, mmun_start, pmd, entry);
+ set_pmd_at(mm, range.start, pmd, entry);
update_mmu_cache_pmd(vma, address, &entry);
}
spin_unlock(ptl);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index e51ea02..142ee8d 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -174,9 +174,7 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
}

void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mmu_notifier_range *range)

{
struct mmu_notifier *mn;
@@ -185,21 +183,36 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->invalidate_range_start)
- mn->ops->invalidate_range_start(mn, mm, start,
- end, event);
+ mn->ops->invalidate_range_start(mn, mm, range);
}
srcu_read_unlock(&srcu, id);
+
+ /*
+ * This must happen after the callback so that subsystem can block on
+ * new invalidation range to synchronize itself.
+ */
+ spin_lock(&mm->mmu_notifier_mm->lock);
+ list_add_tail(&range->list, &mm->mmu_notifier_mm->ranges);
+ mm->mmu_notifier_mm->nranges++;
+ spin_unlock(&mm->mmu_notifier_mm->lock);
}
EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);

void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mmu_notifier_range *range)
{
struct mmu_notifier *mn;
int id;

+ /*
+ * This must happen before the callback so that subsystem can unblock
+ * when range invalidation end.
+ */
+ spin_lock(&mm->mmu_notifier_mm->lock);
+ list_del_init(&range->list);
+ mm->mmu_notifier_mm->nranges--;
+ spin_unlock(&mm->mmu_notifier_mm->lock);
+
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
/*
@@ -211,12 +224,18 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
* (besides the pointer check).
*/
if (mn->ops->invalidate_range)
- mn->ops->invalidate_range(mn, mm, start, end);
+ mn->ops->invalidate_range(mn, mm,
+ range->start, range->end);
if (mn->ops->invalidate_range_end)
- mn->ops->invalidate_range_end(mn, mm, start,
- end, event);
+ mn->ops->invalidate_range_end(mn, mm, range);
}
srcu_read_unlock(&srcu, id);
+
+ /*
+ * Wakeup after callback so they can do their job before any of the
+ * waiters resume.
+ */
+ wake_up(&mm->mmu_notifier_mm->wait_queue);
}
EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);

@@ -235,6 +254,38 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm,
}
EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range);

+bool mmu_notifier_range_is_valid(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ struct mmu_notifier_range *range;
+
+ spin_lock(&mm->mmu_notifier_mm->lock);
+ list_for_each_entry(range, &mm->mmu_notifier_mm->ranges, list) {
+ if (!(range->end <= start || range->start >= end)) {
+ spin_unlock(&mm->mmu_notifier_mm->lock);
+ return false;
+ }
+ }
+ spin_unlock(&mm->mmu_notifier_mm->lock);
+ return true;
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_range_is_valid);
+
+void mmu_notifier_range_wait_valid(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ int nranges = mm->mmu_notifier_mm->nranges;
+
+ while (!mmu_notifier_range_is_valid(mm, start, end)) {
+ wait_event(mm->mmu_notifier_mm->wait_queue,
+ nranges != mm->mmu_notifier_mm->nranges);
+ nranges = mm->mmu_notifier_mm->nranges;
+ }
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_range_wait_valid);
+
static int do_mmu_notifier_register(struct mmu_notifier *mn,
struct mm_struct *mm,
int take_mmap_sem)
@@ -264,6 +315,9 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
if (!mm_has_notifiers(mm)) {
INIT_HLIST_HEAD(&mmu_notifier_mm->list);
spin_lock_init(&mmu_notifier_mm->lock);
+ INIT_LIST_HEAD(&mmu_notifier_mm->ranges);
+ mmu_notifier_mm->nranges = 0;
+ init_waitqueue_head(&mmu_notifier_mm->wait_queue);

mm->mmu_notifier_mm = mmu_notifier_mm;
mmu_notifier_mm = NULL;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2302721..c88f770 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -139,7 +139,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
unsigned long next;
unsigned long pages = 0;
unsigned long nr_huge_updates = 0;
- unsigned long mni_start = 0;
+ struct mmu_notifier_range range = {
+ .start = 0,
+ };

pmd = pmd_offset(pud, addr);
do {
@@ -150,10 +152,11 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
continue;

/* invoke the mmu notifier if the pmd is populated */
- if (!mni_start) {
- mni_start = addr;
- mmu_notifier_invalidate_range_start(mm, mni_start,
- end, MMU_MPROT);
+ if (!range.start) {
+ range.start = addr;
+ range.end = end;
+ range.event = MMU_MPROT;
+ mmu_notifier_invalidate_range_start(mm, &range);
}

if (pmd_trans_huge(*pmd)) {
@@ -180,8 +183,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pages += this_pages;
} while (pmd++, addr = next, addr != end);

- if (mni_start)
- mmu_notifier_invalidate_range_end(mm, mni_start, end, MMU_MPROT);
+ if (range.start)
+ mmu_notifier_invalidate_range_end(mm, &range);

if (nr_huge_updates)
count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index a39f2aa..22b712f 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -167,18 +167,17 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
bool need_rmap_locks)
{
unsigned long extent, next, old_end;
+ struct mmu_notifier_range range;
pmd_t *old_pmd, *new_pmd;
bool need_flush = false;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */

old_end = old_addr + len;
flush_cache_range(vma, old_addr, old_end);

- mmun_start = old_addr;
- mmun_end = old_end;
- mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ range.start = old_addr;
+ range.end = old_end;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(vma->vm_mm, &range);

for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
cond_resched();
@@ -230,8 +229,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
if (likely(need_flush))
flush_tlb_range(vma, old_end-len, old_addr);

- mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, &range);

return len + old_addr - old_end; /* how much done */
}
diff --git a/mm/rmap.c b/mm/rmap.c
index 5fd9ece..98fb97f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1316,15 +1316,14 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
spinlock_t *ptl;
struct page *page;
unsigned long address;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;
unsigned long end;
int ret = SWAP_AGAIN;
int locked_vma = 0;
- enum mmu_event event = MMU_MIGRATE;

+ range.event = MMU_MIGRATE;
if (flags & TTU_MUNLOCK)
- event = MMU_MUNLOCK;
+ range.event = MMU_MUNLOCK;

address = (vma->vm_start + cursor) & CLUSTER_MASK;
end = address + CLUSTER_SIZE;
@@ -1337,9 +1336,9 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
if (!pmd)
return ret;

- mmun_start = address;
- mmun_end = end;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
+ range.start = address;
+ range.end = end;
+ mmu_notifier_invalidate_range_start(mm, &range);

/*
* If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1408,7 +1407,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
(*mapcount)--;
}
pte_unmap_unlock(pte - 1, ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
+ mmu_notifier_invalidate_range_end(mm, &range);
if (locked_vma)
up_read(&vma->vm_mm->mmap_sem);
return ret;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8afea97..03c1357 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -322,9 +322,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,

static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ const struct mmu_notifier_range *range)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
int need_tlb_flush = 0, idx;
@@ -337,7 +335,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
* count is also read inside the mmu_lock critical section.
*/
kvm->mmu_notifier_count++;
- need_tlb_flush = kvm_unmap_hva_range(kvm, start, end);
+ need_tlb_flush = kvm_unmap_hva_range(kvm, range->start, range->end);
need_tlb_flush |= kvm->tlbs_dirty;
/* we've to flush the tlb before the pages can be freed */
if (need_tlb_flush)
@@ -349,9 +347,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,

static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ const struct mmu_notifier_range *range)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);

--
1.9.3

2014-11-03 20:47:25

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 4/5] hmm: heterogeneous memory management v6

From: Jérôme Glisse <[email protected]>

Motivation:

Heterogeneous memory management is intended to allow a device to transparently
access a process address space without having to lock pages of the process or
take references on them. In other word mirroring a process address space while
allowing the regular memory management event such as page reclamation or page
migration, to happen seamlessly.

Recent years have seen a surge into the number of specialized devices that are
part of a computer platform (from desktop to phone). So far each of those
devices have operated on there own private address space that is not link or
expose to the process address space that is using them. This separation often
leads to multiple memory copy happening between the device owned memory and the
process memory. This of course is both a waste of cpu cycle and memory.

Over the last few years most of those devices have gained a full mmu allowing
them to support multiple page table, page fault and other features that are
found inside cpu mmu. There is now a strong incentive to start leveraging
capabilities of such devices and to start sharing process address to avoid
any unnecessary memory copy as well as simplifying the programming model of
those devices by sharing an unique and common address space with the process
that use them.

The aim of the heterogeneous memory management is to provide a common API that
can be use by any such devices in order to mirror process address. The hmm code
provide an unique entry point and interface itself with the core mm code of the
linux kernel avoiding duplicate implementation and shielding device driver code
from core mm code.

Moreover, hmm also intend to provide support for migrating memory to device
private memory, allowing device to work on its own fast local memory. The hmm
code would be responsible to intercept cpu page fault on migrated range and
to migrate it back to system memory allowing cpu to resume its access to the
memory.

Another feature hmm intend to provide is support for atomic operation for the
device even if the bus linking the device and the cpu do not have any such
capabilities. On such hardware atomic operation require the page to only be
mapped on the device or on the cpu but not both at the same time.

We expect that graphic processing unit and network interface to be among the
first users of such api.

Hardware requirement:

Because hmm is intended to be use by device driver there are minimum features
requirement for the hardware mmu :
- hardware have its own page table per process (can be share btw != devices)
- hardware mmu support page fault and suspend execution until the page fault
is serviced by hmm code. The page fault must also trigger some form of
interrupt so that hmm code can be call by the device driver.
- hardware must support at least read only mapping (otherwise it can not
access read only range of the process address space).
- hardware access to system memory must be cache coherent with the cpu.

For better memory management it is highly recommanded that the device also
support the following features :
- hardware mmu set access bit in its page table on memory access (like cpu).
- hardware page table can be updated from cpu or through a fast path.
- hardware provide advanced statistic over which range of memory it access
the most.
- hardware differentiate atomic memory access from regular access allowing
to support atomic operation even on platform that do not have atomic
support on the bus linking the device with the cpu.

Implementation:

The hmm layer provide a simple API to the device driver. Each device driver
have to register and hmm device that holds pointer to all the callback the hmm
code will make to synchronize the device page table with the cpu page table of
a given process.

For each process it wants to mirror the device driver must register a mirror
hmm structure that holds all the informations specific to the process being
mirrored. Each hmm mirror uniquely link an hmm device with a process address
space (the mm struct).

This design allow several different device driver to mirror concurrently the
same process. The hmm layer will dispatch approprietly to each device driver
modification that are happening to the process address space.

The hmm layer rely on the mmu notifier api to monitor change to the process
address space. Because update to device page table can have unbound completion
time, the hmm layer need the capability to sleep during mmu notifier callback.

This patch only implement the core of the hmm layer and do not support feature
such as migration to device memory.

Changed since v1:
- convert fence to refcounted object
- change the api to provide pte value directly avoiding useless temporary
special hmm pfn value
- cleanups & fixes ...

Changed since v2:
- fixed checkpatch.pl warnings & errors
- converted to a staging feature

Changed since v3:
- Use mmput notifier chain instead of adding hmm destroy call to mmput.
- Clear mm->hmm inside mm_init to be match mmu_notifier.
- Separate cpu page table invalidation from device page table fault to
have cleaner and simpler code for synchronization btw this two types
of event.
- Removing hmm_mirror kref and rely on user to manage lifetime of the
hmm_mirror.

Changed since v4:
- Invalidate either in range_start() or in range_end() depending on the
kind of mmu event.
- Use the new generic page table implementation to keep an hmm mirror of
the cpu page table.
- Get rid of the range lock exclusion as it is no longer needed.
- Simplify the driver api.
- Support for hugue page.

Changed since v5:
- Take advantages of mmu_notifier tracking active invalidation range.
- Adapt to change to arch independant page table.

Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Jatin Kumar <[email protected]>
---
include/linux/hmm.h | 364 +++++++++++++++
include/linux/mm.h | 11 +
include/linux/mm_types.h | 14 +
kernel/fork.c | 2 +
mm/Kconfig | 15 +
mm/Makefile | 1 +
mm/hmm.c | 1156 ++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 1563 insertions(+)
create mode 100644 include/linux/hmm.h
create mode 100644 mm/hmm.c

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..3331798
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,364 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <[email protected]>
+ */
+/* This is a heterogeneous memory management (hmm). In a nutshell this provide
+ * an API to mirror a process address on a device which has its own mmu using
+ * its own page table for the process. It supports everything except special
+ * vma.
+ *
+ * Mandatory hardware features :
+ * - An mmu with pagetable.
+ * - Read only flag per cpu page.
+ * - Page fault ie hardware must stop and wait for kernel to service fault.
+ *
+ * Optional hardware features :
+ * - Dirty bit per cpu page.
+ * - Access bit per cpu page.
+ *
+ * The hmm code handle all the interfacing with the core kernel mm code and
+ * provide a simple API. It does support migrating system memory to device
+ * memory and handle migration back to system memory on cpu page fault.
+ *
+ * Migrated memory is considered as swaped from cpu and core mm code point of
+ * view.
+ */
+#ifndef _HMM_H
+#define _HMM_H
+
+#ifdef CONFIG_HMM
+
+#include <linux/list.h>
+#include <linux/rwsem.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/workqueue.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mman.h>
+
+
+struct hmm_device;
+struct hmm_device_ops;
+struct hmm_mirror;
+struct hmm_event;
+struct hmm;
+
+
+/* hmm_fence - device driver fence to wait for device driver operations.
+ *
+ * In order to concurrently update several different devices mmu the hmm rely
+ * on device driver fence to wait for operation hmm schedules to complete on
+ * devices. It is strongly recommanded to implement fences and have the hmm
+ * callback do as little as possible (just scheduling the update and returning
+ * a fence). Moreover the hmm code will reschedule for i/o the current process
+ * if necessary once it has scheduled all updates on all devices.
+ *
+ * Each fence is created as a result of either an update to range of memory or
+ * for remote memory to/from local memory dma.
+ *
+ * Update to range of memory correspond to a specific event type. For instance
+ * range of memory is unmap for page reclamation, or range of memory is unmap
+ * from process address space as result of munmap syscall (HMM_MUNMAP), or a
+ * memory protection change on the range. There is one hmm_etype for each of
+ * those event allowing the device driver to take appropriate action like for
+ * instance freeing device page table on HMM_MUNMAP but keeping it when it is
+ * just an access protection change or temporary unmap.
+ */
+enum hmm_etype {
+ HMM_NONE = 0,
+ HMM_ISDIRTY,
+ HMM_MIGRATE,
+ HMM_MUNMAP,
+ HMM_RFAULT,
+ HMM_WFAULT,
+ HMM_WRITE_PROTECT,
+};
+
+struct hmm_fence {
+ struct hmm_mirror *mirror;
+ struct list_head list;
+};
+
+
+/* struct hmm_event - used to serialize change to overlapping range of address.
+ *
+ * @list: Core hmm keep track of all active events.
+ * @start: First address (inclusive).
+ * @end: Last address (exclusive).
+ * @fences: List of device fences associated with this event.
+ * @etype: Event type (munmap, migrate, truncate, ...).
+ * @backoff: Only meaningful for device page fault.
+ */
+struct hmm_event {
+ struct list_head list;
+ unsigned long start;
+ unsigned long end;
+ struct list_head fences;
+ enum hmm_etype etype;
+ bool backoff;
+};
+
+
+/* struct hmm_range - used to communicate range infos to various callback.
+ *
+ * @pte: The hmm page table entry for the range.
+ * @ptp: The page directory page struct.
+ * @start: First address (inclusive).
+ * @end: Last address (exclusive).
+ */
+struct hmm_range {
+ unsigned long *pte;
+ struct page *pdp;
+ unsigned long start;
+ unsigned long end;
+};
+
+static inline unsigned long hmm_range_size(struct hmm_range *range)
+{
+ return range->end - range->start;
+}
+
+#define HMM_PTE_VALID_PDIR_BIT 0UL
+#define HMM_PTE_VALID_SMEM_BIT 1UL
+#define HMM_PTE_WRITE_BIT 2UL
+#define HMM_PTE_DIRTY_BIT 3UL
+
+static inline unsigned long hmm_pte_from_pfn(unsigned long pfn)
+{
+ return (pfn << PAGE_SHIFT) | (1UL << HMM_PTE_VALID_SMEM_BIT);
+}
+
+static inline void hmm_pte_mk_dirty(volatile unsigned long *hmm_pte)
+{
+ set_bit(HMM_PTE_DIRTY_BIT, hmm_pte);
+}
+
+static inline void hmm_pte_mk_write(volatile unsigned long *hmm_pte)
+{
+ set_bit(HMM_PTE_WRITE_BIT, hmm_pte);
+}
+
+static inline bool hmm_pte_clear_valid_smem(volatile unsigned long *hmm_pte)
+{
+ return test_and_clear_bit(HMM_PTE_VALID_SMEM_BIT, hmm_pte);
+}
+
+static inline bool hmm_pte_clear_write(volatile unsigned long *hmm_pte)
+{
+ return test_and_clear_bit(HMM_PTE_WRITE_BIT, hmm_pte);
+}
+
+static inline bool hmm_pte_is_valid_smem(const volatile unsigned long *hmm_pte)
+{
+ return test_bit(HMM_PTE_VALID_SMEM_BIT, hmm_pte);
+}
+
+static inline bool hmm_pte_is_write(const volatile unsigned long *hmm_pte)
+{
+ return test_bit(HMM_PTE_WRITE_BIT, hmm_pte);
+}
+
+static inline unsigned long hmm_pte_pfn(unsigned long hmm_pte)
+{
+ return hmm_pte >> PAGE_SHIFT;
+}
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device.
+ *
+ * The hmm_device is the link btw hmm and each device driver.
+ */
+
+/* struct hmm_device_operations - hmm device operation callback
+ */
+struct hmm_device_ops {
+ /* mirror_ref() - take reference on mirror struct.
+ *
+ * @mirror: Struct being referenced.
+ */
+ struct hmm_mirror *(*mirror_ref)(struct hmm_mirror *mirror);
+
+ /* mirror_unref() - drop reference on mirror struct.
+ *
+ * @mirror: Struct being dereferenced.
+ */
+ struct hmm_mirror *(*mirror_unref)(struct hmm_mirror *mirror);
+
+ /* mirror_release() - device must stop using the address space.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * This callback is call either on mm destruction or as result to a
+ * call of hmm_mirror_release(). Device driver have to stop all hw
+ * thread and all usage of the address space, it has to dirty all
+ * pages that have been dirty by the device.
+ */
+ void (*mirror_release)(struct hmm_mirror *mirror);
+
+ /* fence_wait() - to wait on device driver fence.
+ *
+ * @fence: The device driver fence struct.
+ * Returns: 0 on success,-EIO on error, -EAGAIN to wait again.
+ *
+ * Called when hmm want to wait for all operations associated with a
+ * fence to complete (including device cache flush if the event mandate
+ * it).
+ *
+ * Device driver must free fence and associated resources if it returns
+ * something else thant -EAGAIN. On -EAGAIN the fence must not be free
+ * as hmm will call back again.
+ *
+ * Return error if scheduled operation failed or if need to wait again.
+ * -EIO Some input/output error with the device.
+ * -EAGAIN The fence not yet signaled, hmm reschedule waiting thread.
+ *
+ * All other return value trigger warning and are transformed to -EIO.
+ */
+ int (*fence_wait)(struct hmm_fence *fence);
+
+ /* fence_ref() - take a reference fence structure.
+ *
+ * @fence: Fence structure hmm is referencing.
+ */
+ void (*fence_ref)(struct hmm_fence *fence);
+
+ /* fence_unref() - drop a reference fence structure.
+ *
+ * @fence: Fence structure hmm is dereferencing.
+ */
+ void (*fence_unref)(struct hmm_fence *fence);
+
+ /* update() - update device mmu for a range of address.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ * @event: The event that triggered the update.
+ * @range: All informations about the range that needs to be updated.
+ * Returns: Valid fence ptr or NULL on success otherwise ERR_PTR.
+ *
+ * Called to update device page table for a range of address.
+ * The event type provide the nature of the update :
+ * - Range is no longer valid (munmap).
+ * - Range protection changes (mprotect, COW, ...).
+ * - Range is unmapped (swap, reclaim, page migration, ...).
+ * - Device page fault.
+ * - ...
+ *
+ * Any event that block further write to the memory must also trigger a
+ * device cache flush and everything has to be flush to local memory by
+ * the time the wait callback return (if this callback returned a fence
+ * otherwise everything must be flush by the time the callback return).
+ *
+ * Device must properly set the dirty bit using hmm_pte_mk_dirty helper
+ * on each hmm page table entry.
+ *
+ * The driver should return a fence pointer or NULL on success. Device
+ * driver should return fence and delay wait for the operation to the
+ * fence wait callback. Returning a fence allow hmm to batch update to
+ * several devices and delay wait on those once they all have scheduled
+ * the update.
+ *
+ * Device driver must not fail lightly, any failure result in device
+ * process being kill.
+ *
+ * Return fence or NULL on success, error value otherwise :
+ * -ENOMEM Not enough memory for performing the operation.
+ * -EIO Some input/output error with the device.
+ *
+ * All other return value trigger warning and are transformed to -EIO.
+ */
+ struct hmm_fence *(*update)(struct hmm_mirror *mirror,
+ struct hmm_event *event,
+ const struct hmm_range *range);
+};
+
+
+/* struct hmm_device - per device hmm structure
+ *
+ * @name: Device name (uniquely identify the device on the system).
+ * @ops: The hmm operations callback.
+ * @mirrors: List of all active mirrors for the device.
+ * @mutex: Mutex protecting mirrors list.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct (only once).
+ */
+struct hmm_device {
+ const char *name;
+ const struct hmm_device_ops *ops;
+ struct list_head mirrors;
+ struct mutex mutex;
+};
+
+int hmm_device_register(struct hmm_device *device);
+int hmm_device_unregister(struct hmm_device *device);
+
+
+/* hmm_mirror - device specific mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct associating
+ * the process address space with the device. Same process can be mirrored by
+ * several different devices at the same time.
+ */
+
+/* struct hmm_mirror - per device and per mm hmm structure
+ *
+ * @device: The hmm_device struct this hmm_mirror is associated to.
+ * @hmm: The hmm struct this hmm_mirror is associated to.
+ * @dlist: List of all hmm_mirror for same device.
+ * @mlist: List of all hmm_mirror for same process.
+ * @work: Work struct for delayed unreference.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct for each of the address space it wants to mirror. Same device can
+ * mirror several different address space. As well same address space can be
+ * mirror by different devices.
+ */
+struct hmm_mirror {
+ struct hmm_device *device;
+ struct hmm *hmm;
+ struct list_head dlist;
+ struct list_head mlist;
+ struct work_struct work;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror,
+ struct hmm_device *device,
+ struct mm_struct *mm);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+static inline struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror)
+{
+ if (!mirror || !mirror->device)
+ return NULL;
+
+ return mirror->device->ops->mirror_ref(mirror);
+}
+
+static inline struct hmm_mirror *hmm_mirror_unref(struct hmm_mirror *mirror)
+{
+ if (!mirror || !mirror->device)
+ return NULL;
+
+ return mirror->device->ops->mirror_unref(mirror);
+}
+
+void hmm_mirror_release(struct hmm_mirror *mirror);
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
+
+
+#endif /* CONFIG_HMM */
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f16c7f9..66a6418 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2179,5 +2179,16 @@ void __init setup_nr_node_ids(void);
static inline void setup_nr_node_ids(void) {}
#endif

+#ifdef CONFIG_HMM
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+ mm->hmm = NULL;
+}
+#else /* !CONFIG_HMM */
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+}
+#endif /* !CONFIG_HMM */
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e0b286..7eeff71 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -16,6 +16,10 @@
#include <asm/page.h>
#include <asm/mmu.h>

+#ifdef CONFIG_HMM
+struct hmm;
+#endif
+
#ifndef AT_VECTOR_SIZE_ARCH
#define AT_VECTOR_SIZE_ARCH 0
#endif
@@ -425,6 +429,16 @@ struct mm_struct {
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
+#ifdef CONFIG_HMM
+ /*
+ * hmm always register an mmu_notifier we rely on mmu notifier to keep
+ * refcount on mm struct as well as forbiding registering hmm on a
+ * dying mm
+ *
+ * This field is set with mmap_sem old in write mode.
+ */
+ struct hmm *hmm;
+#endif
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 9ca8418..7f1ab4d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
#include <linux/binfmts.h>
#include <linux/mman.h>
#include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/vmacache.h>
@@ -568,6 +569,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
mm_init_aio(mm);
mm_init_owner(mm, p);
mmu_notifier_mm_init(mm);
+ hmm_mm_init(mm);
clear_tlb_flush_pending(mm);
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
mm->pmd_huge_pte = NULL;
diff --git a/mm/Kconfig b/mm/Kconfig
index 1d1ae6b..b249db0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -618,3 +618,18 @@ config MAX_STACK_SIZE_MB
changed to a smaller value in which case that is used.

A sane initial value is 80 MB.
+
+if STAGING
+config HMM
+ bool "Enable heterogeneous memory management (HMM)"
+ depends on MMU
+ select MMU_NOTIFIER
+ select GENERIC_PAGE_TABLE
+ default n
+ help
+ Heterogeneous memory management provide infrastructure for a device
+ to mirror a process address space into an hardware mmu or into any
+ things supporting pagefault like event.
+
+ If unsure, say N to disable hmm.
+endif # STAGING
diff --git a/mm/Makefile b/mm/Makefile
index e259c5d..8109b19 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -70,3 +70,4 @@ obj-$(CONFIG_ZSMALLOC) += zsmalloc.o
obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
obj-$(CONFIG_CMA) += cma.o
obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
+obj-$(CONFIG_HMM) += hmm.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..25c20ac
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,1156 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <[email protected]>
+ */
+/* This is the core code for heterogeneous memory management (HMM). HMM intend
+ * to provide helper for mirroring a process address space on a device as well
+ * as allowing migration of data between system memory and device memory refer
+ * as remote memory from here on out.
+ *
+ * Refer to include/linux/hmm.h for further informations on general design.
+ */
+#include <linux/export.h>
+#include <linux/bitmap.h>
+#include <linux/list.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/ksm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mmu_context.h>
+#include <linux/memcontrol.h>
+#include <linux/hmm.h>
+#include <linux/wait.h>
+#include <linux/mman.h>
+#include <linux/delay.h>
+#include <linux/workqueue.h>
+#include <linux/gpt.h>
+
+#include "internal.h"
+
+/* global SRCU for all HMMs */
+static struct srcu_struct srcu;
+
+
+/* struct hmm - per mm_struct hmm structure
+ *
+ * @device_faults: List of all active device page faults.
+ * @mirrors: List of all mirror for this mm (one per device).
+ * @mm: The mm struct this hmm is associated with.
+ * @ndevice_faults: Number of active device page faults.
+ * @kref: Reference counter
+ * @lock: Serialize the mirror list modifications.
+ * @wait_queue: Wait queue for event synchronization.
+ * @mmu_notifier: The mmu_notifier of this mm.
+ *
+ * For each process address space (mm_struct) there is one and only one hmm
+ * struct. hmm functions will redispatch to each devices the change made to
+ * the process address space.
+ */
+struct hmm {
+ struct list_head device_faults;
+ struct list_head mirrors;
+ struct mm_struct *mm;
+ unsigned long ndevice_faults;
+ struct kref kref;
+ spinlock_t lock;
+ wait_queue_head_t wait_queue;
+ struct mmu_notifier mmu_notifier;
+ struct gpt pt;
+};
+
+static struct mmu_notifier_ops hmm_notifier_ops;
+
+static inline struct hmm *hmm_ref(struct hmm *hmm);
+static inline struct hmm *hmm_unref(struct hmm *hmm);
+
+static void hmm_mirror_delayed_unref(struct work_struct *work);
+static void hmm_mirror_handle_error(struct hmm_mirror *mirror);
+
+static void hmm_device_fence_wait(struct hmm_device *device,
+ struct hmm_fence *fence);
+
+
+/* hmm_event - use to track information relating to an event.
+ *
+ * Each change to cpu page table or fault from a device is considered as an
+ * event by hmm. For each event there is a common set of things that need to
+ * be tracked. The hmm_event struct centralize those and the helper functions
+ * help dealing with all this.
+ */
+
+static inline bool hmm_event_overlap(struct hmm_event *a, struct hmm_event *b)
+{
+ return !((a->end <= b->start) || (a->start >= b->end));
+}
+
+static inline void hmm_event_init(struct hmm_event *event,
+ unsigned long start,
+ unsigned long end)
+{
+ event->start = start & PAGE_MASK;
+ event->end = PAGE_ALIGN(end);
+ INIT_LIST_HEAD(&event->fences);
+}
+
+static inline void hmm_event_wait(struct hmm_event *event)
+{
+ struct hmm_fence *fence, *tmp;
+
+ if (list_empty(&event->fences))
+ /* Nothing to wait for. */
+ return;
+
+ io_schedule();
+
+ list_for_each_entry_safe(fence, tmp, &event->fences, list) {
+ hmm_device_fence_wait(fence->mirror->device, fence);
+ }
+}
+
+
+/* hmm_range - range helper functions.
+ *
+ * Range are use to communicate btw various hmm function and device driver.
+ */
+
+static void hmm_range_update_mirrors(struct hmm_range *range,
+ struct hmm *hmm,
+ struct hmm_event *event)
+{
+ struct hmm_mirror *mirror;
+ int id;
+
+ id = srcu_read_lock(&srcu);
+ list_for_each_entry(mirror, &hmm->mirrors, mlist) {
+ struct hmm_device *device = mirror->device;
+ struct hmm_fence *fence;
+
+ fence = device->ops->update(mirror, event, range);
+ if (fence) {
+ if (IS_ERR(fence)) {
+ hmm_mirror_handle_error(mirror);
+ } else {
+ fence->mirror = hmm_mirror_ref(mirror);
+ list_add_tail(&fence->list, &event->fences);
+ }
+ }
+ }
+ srcu_read_unlock(&srcu, id);
+}
+
+static bool hmm_range_wprot(struct hmm_range *range, struct hmm *hmm)
+{
+ unsigned long i;
+ bool update = false;
+
+ for (i = 0; i < (hmm_range_size(range) >> PAGE_SHIFT); ++i) {
+ update |= hmm_pte_clear_write(&range->pte[i]);
+ }
+ return update;
+}
+
+static void hmm_range_clear(struct hmm_range *range, struct hmm *hmm)
+{
+ unsigned long i;
+
+ for (i = 0; i < (hmm_range_size(range) >> PAGE_SHIFT); ++i)
+ if (hmm_pte_clear_valid_smem(&range->pte[i]))
+ gpt_pdp_unref(&hmm->pt, range->pdp);
+}
+
+
+/* hmm - core hmm functions.
+ *
+ * Core hmm functions that deal with all the process mm activities and use
+ * event for synchronization. Those function are use mostly as result of cpu
+ * mm event.
+ */
+
+static uint64_t hmm_pde_from_pdp(struct gpt *gpt, struct page *pdp)
+{
+ uint64_t pde;
+
+ pde = (page_to_pfn(pdp) << PAGE_SHIFT);
+ pde |= (1UL << HMM_PTE_VALID_PDIR_BIT);
+ return pde;
+}
+
+static int hmm_init(struct hmm *hmm, struct mm_struct *mm)
+{
+ int ret;
+
+ hmm->mm = mm;
+ kref_init(&hmm->kref);
+ INIT_LIST_HEAD(&hmm->device_faults);
+ INIT_LIST_HEAD(&hmm->mirrors);
+ spin_lock_init(&hmm->lock);
+ init_waitqueue_head(&hmm->wait_queue);
+ hmm->ndevice_faults = 0;
+
+ /* Initialize page table. */
+ hmm->pt.last_idx = (mm->highest_vm_end - 1UL) >> PAGE_SHIFT;
+ hmm->pt.pde_mask = PAGE_MASK;
+ hmm->pt.pde_shift = PAGE_SHIFT;
+ hmm->pt.pde_valid = 1UL << HMM_PTE_VALID_PDIR_BIT;
+ hmm->pt.pde_from_pdp = &hmm_pde_from_pdp;
+ hmm->pt.gfp_flags = GFP_HIGHUSER;
+ ret = gpt_ulong_init(&hmm->pt);
+ if (ret)
+ return ret;
+
+ /* register notifier */
+ hmm->mmu_notifier.ops = &hmm_notifier_ops;
+ return __mmu_notifier_register(&hmm->mmu_notifier, mm);
+}
+
+static void hmm_del_mirror_locked(struct hmm *hmm, struct hmm_mirror *mirror)
+{
+ list_del_rcu(&mirror->mlist);
+}
+
+static int hmm_add_mirror(struct hmm *hmm, struct hmm_mirror *mirror)
+{
+ struct hmm_mirror *tmp_mirror;
+
+ spin_lock(&hmm->lock);
+ list_for_each_entry_rcu (tmp_mirror, &hmm->mirrors, mlist)
+ if (tmp_mirror->device == mirror->device) {
+ /* Same device can mirror only once. */
+ spin_unlock(&hmm->lock);
+ return -EINVAL;
+ }
+ list_add_rcu(&mirror->mlist, &hmm->mirrors);
+ spin_unlock(&hmm->lock);
+
+ return 0;
+}
+
+static inline struct hmm *hmm_ref(struct hmm *hmm)
+{
+ if (hmm) {
+ if (!kref_get_unless_zero(&hmm->kref))
+ return NULL;
+ return hmm;
+ }
+ return NULL;
+}
+
+static void hmm_destroy(struct kref *kref)
+{
+ struct hmm *hmm;
+
+ hmm = container_of(kref, struct hmm, kref);
+
+ down_write(&hmm->mm->mmap_sem);
+ /* A new hmm might have been register before we get call. */
+ if (hmm->mm->hmm == hmm)
+ hmm->mm->hmm = NULL;
+ up_write(&hmm->mm->mmap_sem);
+ mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
+
+ mmu_notifier_synchronize();
+
+ gpt_free(&hmm->pt);
+ kfree(hmm);
+}
+
+static inline struct hmm *hmm_unref(struct hmm *hmm)
+{
+ if (hmm)
+ kref_put(&hmm->kref, hmm_destroy);
+ return NULL;
+}
+
+static int hmm_device_fault_start(struct hmm *hmm, struct hmm_event *fevent)
+{
+ int ret = 0;
+
+ mmu_notifier_range_wait_valid(hmm->mm, fevent->start, fevent->end);
+
+ spin_lock(&hmm->lock);
+ if (mmu_notifier_range_is_valid(hmm->mm, fevent->start, fevent->end)) {
+ list_add_tail(&fevent->list, &hmm->device_faults);
+ hmm->ndevice_faults++;
+ fevent->backoff = false;
+ } else
+ ret = -EAGAIN;
+ spin_unlock(&hmm->lock);
+ wake_up(&hmm->wait_queue);
+
+ return ret;
+}
+
+static void hmm_device_fault_end(struct hmm *hmm, struct hmm_event *fevent)
+{
+ spin_lock(&hmm->lock);
+ list_del_init(&fevent->list);
+ hmm->ndevice_faults--;
+ spin_unlock(&hmm->lock);
+ wake_up(&hmm->wait_queue);
+}
+
+static void hmm_wait_device_fault(struct hmm *hmm, struct hmm_event *ievent)
+{
+ struct hmm_event *fevent;
+ unsigned long wait_for = 0;
+
+again:
+ spin_lock(&hmm->lock);
+ list_for_each_entry (fevent, &hmm->device_faults, list) {
+ if (!hmm_event_overlap(fevent, ievent))
+ continue;
+ fevent->backoff = true;
+ wait_for = hmm->ndevice_faults;
+ }
+ spin_unlock(&hmm->lock);
+
+ if (wait_for > 0) {
+ wait_event(hmm->wait_queue, wait_for != hmm->ndevice_faults);
+ wait_for = 0;
+ goto again;
+ }
+}
+
+static void hmm_update(struct hmm *hmm,
+ struct hmm_event *event)
+{
+ struct hmm_range range;
+ struct gpt_lock lock;
+ struct gpt_iter iter;
+ struct gpt *pt = &hmm->pt;
+
+ /* This hmm is already fully stop. */
+ if (hmm->mm->hmm != hmm)
+ return;
+
+ hmm_wait_device_fault(hmm, event);
+
+ lock.first = event->start >> PAGE_SHIFT;
+ lock.last = (event->end - 1UL) >> PAGE_SHIFT;
+ gpt_ulong_lock_update(&hmm->pt, &lock);
+ gpt_iter_init(&iter, &hmm->pt, &lock);
+ if (!gpt_ulong_iter_first(&iter, event->start >> PAGE_SHIFT,
+ (event->end - 1UL) >> PAGE_SHIFT)) {
+ /* Empty range nothing to invalidate. */
+ gpt_ulong_unlock_update(&hmm->pt, &lock);
+ return;
+ }
+
+ for (range.start = iter.idx << PAGE_SHIFT; iter.pdep;) {
+ bool update_mirrors = true;
+
+ range.pte = iter.pdep;
+ range.pdp = iter.pdp;
+ range.end = min((gpt_pdp_last(pt, iter.pdp) + 1UL) <<
+ PAGE_SHIFT, (uint64_t)event->end);
+ if (event->etype == HMM_WRITE_PROTECT)
+ update_mirrors = hmm_range_wprot(&range, hmm);
+ if (update_mirrors)
+ hmm_range_update_mirrors(&range, hmm, event);
+
+ range.start = range.end;
+ gpt_ulong_iter_first(&iter, range.start >> PAGE_SHIFT,
+ (event->end - 1UL) >> PAGE_SHIFT);
+ }
+
+ hmm_event_wait(event);
+
+ if (event->etype == HMM_MUNMAP || event->etype == HMM_MIGRATE) {
+ BUG_ON(!gpt_ulong_iter_first(&iter, event->start >> PAGE_SHIFT,
+ (event->end - 1UL) >> PAGE_SHIFT));
+ for (range.start = iter.idx << PAGE_SHIFT; iter.pdep;) {
+ range.pte = iter.pdep;
+ range.pdp = iter.pdp;
+ range.end = min((gpt_pdp_last(pt, iter.pdp) + 1UL) <<
+ PAGE_SHIFT, (uint64_t)event->end);
+ hmm_range_clear(&range, hmm);
+ range.start = range.end;
+ gpt_ulong_iter_first(&iter, range.start >> PAGE_SHIFT,
+ (event->end - 1UL) >> PAGE_SHIFT);
+ }
+ }
+
+ gpt_ulong_unlock_update(&hmm->pt, &lock);
+}
+
+static int hmm_do_mm_fault(struct hmm *hmm,
+ struct hmm_event *event,
+ struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ int r;
+
+ for (; addr < event->end; addr += PAGE_SIZE) {
+ unsigned flags = 0;
+
+ flags |= event->etype == HMM_WFAULT ? FAULT_FLAG_WRITE : 0;
+ flags |= FAULT_FLAG_ALLOW_RETRY;
+ do {
+ r = handle_mm_fault(mm, vma, addr, flags);
+ if (!(r & VM_FAULT_RETRY) && (r & VM_FAULT_ERROR)) {
+ if (r & VM_FAULT_OOM)
+ return -ENOMEM;
+ /* Same error code for all other cases. */
+ return -EFAULT;
+ }
+ flags &= ~FAULT_FLAG_ALLOW_RETRY;
+ } while (r & VM_FAULT_RETRY);
+ }
+
+ return 0;
+}
+
+
+/* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
+ *
+ * HMM use use mmu notifier to track change made to process address space.
+ */
+
+static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ struct hmm_mirror *mirror;
+ struct hmm *hmm;
+
+ /* The hmm structure can not be free because the mmu_notifier srcu is
+ * read locked thus any concurrent hmm_mirror_unregister that would
+ * free hmm would have to wait on the mmu_notifier.
+ */
+ hmm = container_of(mn, struct hmm, mmu_notifier);
+ spin_lock(&hmm->lock);
+ mirror = list_first_or_null_rcu(&hmm->mirrors,
+ struct hmm_mirror,
+ mlist);
+ while (mirror) {
+ hmm_del_mirror_locked(hmm, mirror);
+ spin_unlock(&hmm->lock);
+
+ mirror->device->ops->mirror_release(mirror);
+ INIT_WORK(&mirror->work, hmm_mirror_delayed_unref);
+ schedule_work(&mirror->work);
+
+ spin_lock(&hmm->lock);
+ mirror = list_first_or_null_rcu(&hmm->mirrors,
+ struct hmm_mirror,
+ mlist);
+ }
+ spin_unlock(&hmm->lock);
+
+ synchronize_srcu(&srcu);
+
+ wake_up(&hmm->wait_queue);
+}
+
+static void hmm_mmu_mprot_to_etype(struct mm_struct *mm,
+ unsigned long addr,
+ enum mmu_event mmu_event,
+ enum hmm_etype *etype)
+{
+ struct vm_area_struct *vma;
+
+ vma = find_vma(mm, addr);
+ if (!vma || vma->vm_start > addr || !(vma->vm_flags & VM_READ)) {
+ *etype = HMM_MUNMAP;
+ return;
+ }
+
+ if (!(vma->vm_flags & VM_WRITE)) {
+ *etype = HMM_WRITE_PROTECT;
+ return;
+ }
+
+ *etype = HMM_NONE;
+}
+
+static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ const struct mmu_notifier_range *range)
+{
+ struct hmm_event event;
+ unsigned long start = range->start, end = range->end;
+ struct hmm *hmm;
+
+ /* FIXME this should not happen beside when process is exiting. */
+ if (start >= mm->highest_vm_end)
+ return;
+ if (end > mm->highest_vm_end)
+ end = mm->highest_vm_end;
+
+ switch (range->event) {
+ case MMU_HSPLIT:
+ case MMU_MUNLOCK:
+ /* Still same physical ram backing same address. */
+ return;
+ case MMU_MPROT:
+ hmm_mmu_mprot_to_etype(mm, start, range->event, &event.etype);
+ if (event.etype == HMM_NONE)
+ return;
+ break;
+ case MMU_WRITE_BACK:
+ case MMU_WRITE_PROTECT:
+ event.etype = HMM_WRITE_PROTECT;
+ break;
+ case MMU_ISDIRTY:
+ event.etype = HMM_ISDIRTY;
+ break;
+ case MMU_MUNMAP:
+ event.etype = HMM_MUNMAP;
+ break;
+ case MMU_MIGRATE:
+ default:
+ event.etype = HMM_MIGRATE;
+ break;
+ }
+
+ hmm = container_of(mn, struct hmm, mmu_notifier);
+ hmm_event_init(&event, start, end);
+
+ hmm_update(hmm, &event);
+}
+
+static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long addr,
+ enum mmu_event mmu_event)
+{
+ struct mmu_notifier_range range;
+
+ range.start = addr & PAGE_MASK;
+ range.end = range.start + PAGE_SIZE;
+ range.event = mmu_event;
+ hmm_notifier_invalidate_range_start(mn, mm, &range);
+}
+
+static struct mmu_notifier_ops hmm_notifier_ops = {
+ .release = hmm_notifier_release,
+ /* .clear_flush_young FIXME we probably want to do something. */
+ /* .test_young FIXME we probably want to do something. */
+ /* WARNING .change_pte must always bracketed by range_start/end there
+ * was patches to remove that behavior we must make sure that those
+ * patches are not included as there are alternative solutions to issue
+ * they are trying to solve.
+ *
+ * Fact is hmm can not use the change_pte callback as non sleeping lock
+ * are held during change_pte callback.
+ */
+ .change_pte = NULL,
+ .invalidate_page = hmm_notifier_invalidate_page,
+ .invalidate_range_start = hmm_notifier_invalidate_range_start,
+};
+
+
+/* hmm_mirror - per device mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct. A process
+ * can be mirror by several devices at the same time.
+ *
+ * Below are all the functions and their helpers use by device driver to mirror
+ * the process address space. Those functions either deals with updating the
+ * device page table (through hmm callback). Or provide helper functions use by
+ * the device driver to fault in range of memory in the device page table.
+ */
+
+/* hmm_mirror_register() - register a device mirror against an mm struct
+ *
+ * @mirror: The mirror that link process address space with the device.
+ * @device: The device struct to associate this mirror with.
+ * @mm: The mm struct of the process.
+ * Returns: 0 success, -ENOMEM or -EINVAL if process already mirrored.
+ *
+ * Call when device driver want to start mirroring a process address space. The
+ * hmm shim will register mmu_notifier and start monitoring process address
+ * space changes. Hence callback to device driver might happen even before this
+ * function return.
+ *
+ * The mm pin must also be hold (either task is current or using get_task_mm).
+ *
+ * Only one mirror per mm and hmm_device can be created, it will return -EINVAL
+ * if the hmm_device already has an hmm_mirror for the the mm.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror,
+ struct hmm_device *device,
+ struct mm_struct *mm)
+{
+ struct hmm *hmm = NULL;
+ int ret = 0;
+
+ /* Sanity checks. */
+ BUG_ON(!mirror);
+ BUG_ON(!device);
+ BUG_ON(!mm);
+
+ /*
+ * Initialize the mirror struct fields, the mlist init and del dance is
+ * necessary to make the error path easier for driver and for hmm.
+ */
+ INIT_LIST_HEAD(&mirror->mlist);
+ list_del(&mirror->mlist);
+ INIT_LIST_HEAD(&mirror->dlist);
+ mutex_lock(&device->mutex);
+ mirror->device = device;
+ list_add(&mirror->dlist, &device->mirrors);
+ mutex_unlock(&device->mutex);
+ mirror->hmm = NULL;
+ mirror = hmm_mirror_ref(mirror);
+ if (!mirror) {
+ mutex_lock(&device->mutex);
+ list_del_init(&mirror->dlist);
+ mutex_unlock(&device->mutex);
+ return -EINVAL;
+ }
+
+ down_write(&mm->mmap_sem);
+
+ hmm = mm->hmm ? hmm_ref(hmm) : NULL;
+ if (hmm == NULL) {
+ /* no hmm registered yet so register one */
+ hmm = kzalloc(sizeof(*mm->hmm), GFP_KERNEL);
+ if (hmm == NULL) {
+ up_write(&mm->mmap_sem);
+ hmm_mirror_unref(mirror);
+ return -ENOMEM;
+ }
+
+ ret = hmm_init(hmm, mm);
+ if (ret) {
+ up_write(&mm->mmap_sem);
+ hmm_mirror_unref(mirror);
+ kfree(hmm);
+ return ret;
+ }
+
+ mm->hmm = hmm;
+ }
+
+ mirror->hmm = hmm;
+ ret = hmm_add_mirror(hmm, mirror);
+ up_write(&mm->mmap_sem);
+ if (ret) {
+ mirror->hmm = NULL;
+ hmm_mirror_unref(mirror);
+ hmm_unref(hmm);
+ return ret;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+static void hmm_mirror_delayed_unref(struct work_struct *work)
+{
+ struct hmm_mirror *mirror;
+
+ mirror = container_of(work, struct hmm_mirror, work);
+ hmm_mirror_unref(mirror);
+}
+
+static void hmm_mirror_handle_error(struct hmm_mirror *mirror)
+{
+ struct hmm *hmm = mirror->hmm;
+
+ spin_lock(&hmm->lock);
+ if (mirror->mlist.prev != LIST_POISON2) {
+ hmm_del_mirror_locked(hmm, mirror);
+ spin_unlock(&hmm->lock);
+
+ mirror->device->ops->mirror_release(mirror);
+ INIT_WORK(&mirror->work, hmm_mirror_delayed_unref);
+ schedule_work(&mirror->work);
+ } else
+ spin_unlock(&hmm->lock);
+}
+
+/* hmm_mirror_unregister() - unregister an hmm_mirror.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * Device driver must call this function when it is destroying a registered
+ * mirror structure. If destruction was initiated by the device driver then
+ * it must have call hmm_mirror_release() prior to calling this function.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+ BUG_ON(!mirror || !mirror->device);
+ BUG_ON(mirror->mlist.prev != LIST_POISON2);
+
+ mirror->hmm = hmm_unref(mirror->hmm);
+
+ mutex_lock(&mirror->device->mutex);
+ list_del_init(&mirror->dlist);
+ mutex_unlock(&mirror->device->mutex);
+ mirror->device = NULL;
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+
+/* hmm_mirror_release() - release an hmm_mirror.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * Device driver must call this function when it wants to stop mirroring the
+ * process.
+ */
+void hmm_mirror_release(struct hmm_mirror *mirror)
+{
+ if (!mirror->hmm)
+ return;
+
+ spin_lock(&mirror->hmm->lock);
+ /* Check if the mirror is already removed from the mirror list in which
+ * case there is no reason to call release.
+ */
+ if (mirror->mlist.prev != LIST_POISON2) {
+ hmm_del_mirror_locked(mirror->hmm, mirror);
+ spin_unlock(&mirror->hmm->lock);
+
+ mirror->device->ops->mirror_release(mirror);
+ synchronize_srcu(&srcu);
+
+ hmm_mirror_unref(mirror);
+ } else
+ spin_unlock(&mirror->hmm->lock);
+}
+EXPORT_SYMBOL(hmm_mirror_release);
+
+static int hmm_mirror_update(struct hmm_mirror *mirror,
+ struct hmm_event *event,
+ unsigned long *start,
+ struct gpt_iter *iter)
+{
+ unsigned long addr = *start & PAGE_MASK;
+
+ if (!gpt_ulong_iter_idx(iter, addr >> PAGE_SHIFT))
+ return -EINVAL;
+
+ do {
+ struct hmm_device *device = mirror->device;
+ unsigned long *pte = (unsigned long *)iter->pdep;
+ struct hmm_fence *fence;
+ struct hmm_range range;
+
+ if (event->backoff)
+ return -EAGAIN;
+
+ range.start = addr;
+ range.end = min((gpt_pdp_last(iter->gpt, iter->pdp) + 1UL) <<
+ PAGE_SHIFT, (uint64_t)event->end);
+ range.pte = iter->pdep;
+ for (; addr < range.end; addr += PAGE_SIZE, ++pte) {
+ if (!hmm_pte_is_valid_smem(pte)) {
+ *start = addr;
+ return 0;
+ }
+ if (event->etype == HMM_WFAULT &&
+ !hmm_pte_is_write(pte)) {
+ *start = addr;
+ return 0;
+ }
+ }
+
+ fence = device->ops->update(mirror, event, &range);
+ if (fence) {
+ if (IS_ERR(fence)) {
+ *start = range.start;
+ return -EIO;
+ }
+ fence->mirror = hmm_mirror_ref(mirror);
+ list_add_tail(&fence->list, &event->fences);
+ }
+
+ } while (addr < event->end &&
+ gpt_ulong_iter_idx(iter, addr >> PAGE_SHIFT));
+
+ *start = addr;
+ return 0;
+}
+
+struct hmm_mirror_fault {
+ struct hmm_mirror *mirror;
+ struct hmm_event *event;
+ struct vm_area_struct *vma;
+ unsigned long addr;
+ struct gpt_iter *iter;
+};
+
+static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
+ struct hmm_event *event,
+ struct vm_area_struct *vma,
+ struct gpt_iter *iter,
+ pmd_t *pmdp,
+ unsigned long start,
+ unsigned long end)
+{
+ struct page *page;
+ unsigned long *hmm_pte, i;
+ unsigned flags = FOLL_TOUCH;
+ spinlock_t *ptl;
+
+ ptl = pmd_lock(mirror->hmm->mm, pmdp);
+ if (unlikely(!pmd_trans_huge(*pmdp))) {
+ spin_unlock(ptl);
+ return -EAGAIN;
+ }
+ if (unlikely(pmd_trans_splitting(*pmdp))) {
+ spin_unlock(ptl);
+ wait_split_huge_page(vma->anon_vma, pmdp);
+ return -EAGAIN;
+ }
+ flags |= event->etype == HMM_WFAULT ? FOLL_WRITE : 0;
+ page = follow_trans_huge_pmd(vma, start, pmdp, flags);
+ spin_unlock(ptl);
+
+ BUG_ON(!gpt_ulong_iter_idx(iter, start >> PAGE_SHIFT));
+ hmm_pte = iter->pdep;
+
+ gpt_pdp_lock(&mirror->hmm->pt, iter->pdp);
+ for (i = 0; start < end; start += PAGE_SIZE, ++i, ++page) {
+ if (!hmm_pte_is_valid_smem(&hmm_pte[i])) {
+ hmm_pte[i] = hmm_pte_from_pfn(page_to_pfn(page));
+ gpt_pdp_ref(&mirror->hmm->pt, iter->pdp);
+ }
+ BUG_ON(hmm_pte_pfn(hmm_pte[i]) != page_to_pfn(page));
+ if (pmd_write(*pmdp))
+ hmm_pte_mk_write(&hmm_pte[i]);
+ }
+ gpt_pdp_unlock(&mirror->hmm->pt, iter->pdp);
+
+ return 0;
+}
+
+static int hmm_mirror_fault_pmd(pmd_t *pmdp,
+ unsigned long start,
+ unsigned long end,
+ struct mm_walk *walk)
+{
+ struct hmm_mirror_fault *mirror_fault = walk->private;
+ struct vm_area_struct *vma = mirror_fault->vma;
+ struct hmm_mirror *mirror = mirror_fault->mirror;
+ struct hmm_event *event = mirror_fault->event;
+ struct gpt_iter *iter = mirror_fault->iter;
+ unsigned long addr = start, i, *hmm_pte;
+ struct hmm *hmm = mirror->hmm;
+ pte_t *ptep;
+ int ret = 0;
+
+ /* Make sure there was no gap. */
+ if (start != mirror_fault->addr)
+ return -ENOENT;
+
+ if (event->backoff)
+ return -EAGAIN;
+
+ if (pmd_none(*pmdp))
+ return -ENOENT;
+
+ if (pmd_trans_huge(*pmdp)) {
+ ret = hmm_mirror_fault_hpmd(mirror, event, vma, iter,
+ pmdp, start, end);
+ mirror_fault->addr = ret ? start : end;
+ return ret;
+ }
+
+ if (pmd_none_or_trans_huge_or_clear_bad(pmdp))
+ return -EFAULT;
+
+ BUG_ON(!gpt_ulong_iter_idx(iter, start >> PAGE_SHIFT));
+ hmm_pte = iter->pdep;
+
+ ptep = pte_offset_map(pmdp, start);
+ gpt_pdp_lock(&hmm->pt, iter->pdp);
+ for (i = 0; addr < end; addr += PAGE_SIZE, ++i) {
+ if (!pte_present(*ptep) ||
+ ((event->etype == HMM_WFAULT) && !pte_write(*ptep))) {
+ ptep++;
+ ret = -ENOENT;
+ break;
+ }
+
+ if (!hmm_pte_is_valid_smem(&hmm_pte[i])) {
+ hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(*ptep));
+ gpt_pdp_ref(&hmm->pt, iter->pdp);
+ }
+ BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pte_pfn(*ptep));
+ if (pte_write(*ptep))
+ hmm_pte_mk_write(&hmm_pte[i]);
+ ptep++;
+ }
+ gpt_pdp_unlock(&hmm->pt, iter->pdp);
+ pte_unmap(ptep - 1);
+ mirror_fault->addr = addr;
+
+ return ret;
+}
+
+static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
+ struct hmm_event *event,
+ struct vm_area_struct *vma)
+{
+ struct hmm_mirror_fault mirror_fault;
+ struct mm_walk walk = {0};
+ struct gpt_lock lock;
+ struct gpt_iter iter;
+ unsigned long addr;
+ int ret = 0;
+
+ if ((event->etype == HMM_WFAULT) && !(vma->vm_flags & VM_WRITE))
+ return -EACCES;
+
+ ret = hmm_device_fault_start(mirror->hmm, event);
+ if (ret)
+ return ret;
+
+ addr = event->start;
+ lock.first = event->start >> PAGE_SHIFT;
+ lock.last = (event->end - 1UL) >> PAGE_SHIFT;
+ ret = gpt_ulong_lock_fault(&mirror->hmm->pt, &lock);
+ if (ret) {
+ hmm_device_fault_end(mirror->hmm, event);
+ return ret;
+ }
+ gpt_iter_init(&iter, &mirror->hmm->pt, &lock);
+
+again:
+ ret = hmm_mirror_update(mirror, event, &addr, &iter);
+ if (ret)
+ goto out;
+
+ if (event->backoff) {
+ ret = -EAGAIN;
+ goto out;
+ }
+ if (addr >= event->end)
+ goto out;
+
+ mirror_fault.event = event;
+ mirror_fault.mirror = mirror;
+ mirror_fault.vma = vma;
+ mirror_fault.addr = addr;
+ mirror_fault.iter = &iter;
+ walk.mm = mirror->hmm->mm;
+ walk.private = &mirror_fault;
+ walk.pmd_entry = hmm_mirror_fault_pmd;
+ ret = walk_page_range(addr, event->end, &walk);
+ hmm_event_wait(event);
+ if (!ret)
+ goto again;
+ addr = mirror_fault.addr;
+
+out:
+ gpt_ulong_unlock_fault(&mirror->hmm->pt, &lock);
+ hmm_device_fault_end(mirror->hmm, event);
+ if (ret == -ENOENT) {
+ ret = hmm_do_mm_fault(mirror->hmm, event, vma, addr);
+ ret = ret ? ret : -EAGAIN;
+ }
+ return ret;
+}
+
+/* hmm_mirror_fault() - call by the device driver on device memory fault.
+ *
+ * @mirror: Mirror related to the fault if any.
+ * @event: Event describing the fault.
+ *
+ * Device driver call this function either if it needs to fill its page table
+ * for a range of address or if it needs to migrate memory between system and
+ * remote memory.
+ *
+ * This function perform vma lookup and access permission check on behalf of
+ * the device. If device ask for range [A; D] but there is only a valid vma
+ * starting at B with B > A and B < D then callback will return -EFAULT and
+ * set event->end to B so device driver can either report an issue back or
+ * call again the hmm_mirror_fault with range updated to [B; D].
+ *
+ * This allows device driver to optimistically fault range of address without
+ * having to know about valid vma range. Device driver can then take proper
+ * action if a real memory access happen inside an invalid address range.
+ *
+ * Also the fault will clamp the requested range to valid vma range (unless the
+ * vma into which event->start falls to, can grow). So in previous example if D
+ * D is not cover by any vma then hmm_mirror_fault will stop a C with C < D and
+ * C being the last address of a valid vma. Also event->end will be set to C.
+ *
+ * All error must be handled by device driver and most likely result in the
+ * process device tasks to be kill by the device driver.
+ *
+ * Returns:
+ * > 0 Number of pages faulted.
+ * -EINVAL if invalid argument.
+ * -ENOMEM if failing to allocate memory.
+ * -EACCES if trying to write to read only address.
+ * -EFAULT if trying to access an invalid address.
+ * -ENODEV if mirror is in process of being destroy.
+ * -EIO if device driver update callback failed.
+ */
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event)
+{
+ struct vm_area_struct *vma;
+ int ret = 0;
+
+ if (!mirror || !event || event->start >= event->end)
+ return -EINVAL;
+
+ hmm_event_init(event, event->start, event->end);
+ if (event->end > mirror->hmm->mm->highest_vm_end)
+ return -EFAULT;
+
+retry:
+ if (!mirror->hmm->mm->hmm)
+ return -ENODEV;
+
+ /*
+ * So synchronization with the cpu page table is the most important
+ * and tedious aspect of device page fault. There must be a strong
+ * ordering btw call to device->update() for device page fault and
+ * device->update() for cpu page table invalidation/update.
+ *
+ * Page that are exposed to device driver must stay valid while the
+ * callback is in progress ie any cpu page table invalidation that
+ * render those pages obsolete must call device->update() after the
+ * device->update() call that faulted those pages.
+ *
+ * To achieve this we rely on few things. First the mmap_sem insure
+ * us that any munmap() syscall will serialize with us. So issue are
+ * with unmap_mapping_range() and with migrate or merge page. For this
+ * hmm keep track of affected range of address and block device page
+ * fault that hit overlapping range.
+ */
+ down_read(&mirror->hmm->mm->mmap_sem);
+ vma = find_vma_intersection(mirror->hmm->mm, event->start, event->end);
+ if (!vma) {
+ ret = -EFAULT;
+ goto out;
+ }
+ if (vma->vm_start > event->start) {
+ event->end = vma->vm_start;
+ ret = -EFAULT;
+ goto out;
+ }
+ event->end = min(event->end, vma->vm_end);
+ if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ switch (event->etype) {
+ case HMM_RFAULT:
+ case HMM_WFAULT:
+ ret = hmm_mirror_handle_fault(mirror, event, vma);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+out:
+ /* Drop the mmap_sem so anyone waiting on it have a chance. */
+ up_read(&mirror->hmm->mm->mmap_sem);
+ if (ret == -EAGAIN)
+ goto retry;
+ return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_fault);
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device
+ *
+ * The hmm_device is the link btw hmm and each device driver.
+ */
+
+/* hmm_device_register() - register a device with hmm.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success, -EINVAL otherwise.
+ *
+ * Call when device driver want to register itself with hmm. Device driver can
+ * only register once. It will return a reference on the device thus to release
+ * a device the driver must unreference the device.
+ */
+int hmm_device_register(struct hmm_device *device)
+{
+ /* sanity check */
+ BUG_ON(!device);
+ BUG_ON(!device->ops);
+ BUG_ON(!device->ops->mirror_ref);
+ BUG_ON(!device->ops->mirror_unref);
+ BUG_ON(!device->ops->mirror_release);
+ BUG_ON(!device->ops->fence_wait);
+ BUG_ON(!device->ops->fence_ref);
+ BUG_ON(!device->ops->fence_unref);
+ BUG_ON(!device->ops->update);
+
+ mutex_init(&device->mutex);
+ INIT_LIST_HEAD(&device->mirrors);
+
+ return 0;
+}
+EXPORT_SYMBOL(hmm_device_register);
+
+/* hmm_device_unregister() - unregister a device with hmm.
+ *
+ * @device: The hmm_device struct.
+ *
+ * Call when device driver want to unregister itself with hmm. This will check
+ * if there is any active mirror and return -EBUSY if so. It is device driver
+ * responsability to cleanup and stop all mirror before calling this.
+ */
+int hmm_device_unregister(struct hmm_device *device)
+{
+ struct hmm_mirror *mirror;
+
+ mutex_lock(&device->mutex);
+ mirror = list_first_entry_or_null(&device->mirrors,
+ struct hmm_mirror,
+ dlist);
+ mutex_unlock(&device->mutex);
+ if (mirror)
+ return -EBUSY;
+ return 0;
+}
+EXPORT_SYMBOL(hmm_device_unregister);
+
+static void hmm_device_fence_wait(struct hmm_device *device,
+ struct hmm_fence *fence)
+{
+ struct hmm_mirror *mirror;
+ int r;
+
+ if (fence == NULL)
+ return;
+
+ list_del_init(&fence->list);
+ do {
+ r = device->ops->fence_wait(fence);
+ if (r == -EAGAIN)
+ io_schedule();
+ } while (r == -EAGAIN);
+
+ mirror = fence->mirror;
+ device->ops->fence_unref(fence);
+ if (r)
+ hmm_mirror_handle_error(mirror);
+ hmm_mirror_unref(mirror);
+}
+
+
+static int __init hmm_subsys_init(void)
+{
+ return init_srcu_struct(&srcu);
+}
+subsys_initcall(hmm_subsys_init);
--
1.9.3

2014-11-03 20:47:52

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

From: Jérôme Glisse <[email protected]>

Page table is a common structure format most notably use by cpu mmu. The
arch depend page table code has strong tie to the architecture which makes
it unsuitable to be use by other non arch specific code.

This patch implement a generic and arch independent page table. It is generic
in the sense that entry size can be u64 or unsigned long (or u32 too on 32bits
arch).

It is lockless in the sense that at any point in time you can have concurrent
thread updating the page table (removing or changing entry) and faulting in
the page table (adding new entry). This is achieve by enforcing each updater
and each faulter to take a range lock. There is no exclusion on range lock,
ie several thread can fault or update the same range concurrently and it is
the responsability of the user to synchronize update to the page table entry
(pte), update to the page table directory (pdp) is under gpt responsability.

API usage pattern is :
gpt_init()

gpt_lock_update(lock_range)
// User can update pte for instance by using atomic bit operation
// allowing complete lockless update.
gpt_unlock_update(lock_range)

gpt_lock_fault(lock_range)
// User can fault in pte but he is responsible for avoiding thread
// to concurrently fault the same pte and for properly accounting
// the number of pte faulted in the pdp structure.
gpt_unlock_fault(lock_range)
// The new faulted pte will only be visible to others updaters only
// once all concurrent faulter on the address unlock.

Details on how the lockless concurrent updater and faulter works is provided
in the header file.

Changed since v1:
- Switch to macro implementation instead of using arithmetic to accomodate
the various size for table entry (uint64_t, unsigned long, ...).
This is somewhat less flexbile but right now there is no use for the extra
flexibility v1 was offering.

Signed-off-by: Jérôme Glisse <[email protected]>
---
include/linux/gpt.h | 340 +++++++++++++++++++++++++++
lib/Kconfig | 3 +
lib/Makefile | 2 +
lib/gpt.c | 202 ++++++++++++++++
lib/gpt_generic.h | 663 ++++++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 1210 insertions(+)
create mode 100644 include/linux/gpt.h
create mode 100644 lib/gpt.c
create mode 100644 lib/gpt_generic.h

diff --git a/include/linux/gpt.h b/include/linux/gpt.h
new file mode 100644
index 0000000..3c28634
--- /dev/null
+++ b/include/linux/gpt.h
@@ -0,0 +1,340 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <[email protected]>
+ */
+/*
+ * High level overview
+ * -------------------
+ *
+ * This is a generic arch independant page table implementation with lockless
+ * (allmost lockless) access. The content of the page table ie the page table
+ * entry, are not protected by the gpt helper, it is up to the code using gpt
+ * to protect the page table entry from concurrent update with no restriction
+ * on the mechanism (can be atomic or can sleep).
+ *
+ * The gpt code only deals with protecting the page directory tree structure.
+ * Which is done in a lockless way. Concurrent threads can read and or write
+ * overlapping range of the gpt. There can also be concurrent insertion and
+ * removal of page directory (insertion or removal of page table level).
+ *
+ * While removal of page directory is completely lockless, insertion of new
+ * page directory still require a lock (to avoid double insertion). If the
+ * architecture have a spinlock in its page struct then several threads can
+ * concurrently insert new directory (level) as long as they are inserting into
+ * different page directory. Otherwise insertion will serialize using a common
+ * spinlock. Note that insertion in this context only refer to inserting page
+ * directory, it does not deal about page table entry insertion and again this
+ * is the responsability of gpt user to properly synchronize those.
+ *
+ *
+ * Each gpt access must be done under gpt lock protection by calling gpt_lock()
+ * with a lock structure. Once a range is "locked" with gpt_lock() all access
+ * can be done in lockless fashion, using either gpt_walk or gpt_iter helpers.
+ * Note however that only directory that are considered as established will be
+ * considered ie if a thread is concurently inserting a new directory in the
+ * locked range then this directory will be ignore by gpt_walk or gpt_iter.
+ *
+ * This restriction comes from the lockless design. Some thread can hold a gpt
+ * lock for long time but if it holds it for a period long enough some of the
+ * internal gpt counter (unsigned long) might wrap around breaking all further
+ * access (thought it is self healing after a period of time). So access
+ * pattern to gpt should be :
+ * gpt_lock(gpt, lock)
+ * gpt_walk(gpt, lock, walk)
+ * gpt_unlock(gpt, lock)
+ *
+ * Walker callback can sleep but for now longer than it would take for other
+ * threads to wrap around internal gpt value through :
+ * gpt_lock_fault(gpt, lock)
+ * ... user faulting in new pte ...
+ * gpt_unlock_fault(gpt, lock)
+ *
+ * The lockless design refer to gpt_lock() and gpt_unlock() taking a spinlock
+ * only for adding/removing the lock struct to active lock list ie no more than
+ * few instructions in both case leaving little room for lock contention.
+ *
+ * Moreover there is no memory allocation during gpt_lock() or gpt_unlock() or
+ * gpt_walk(). The only constraint is that the lock struct must be the same for
+ * gpt_lock(), gpt_unlock() and gpt_walk().
+ */
+#ifndef __LINUX_GPT_H
+#define __LINUX_GPT_H
+
+#include <linux/mm.h>
+#include <asm/types.h>
+
+struct gpt_walk;
+struct gpt_iter;
+
+/* struct gpt - generic page table structure.
+ *
+ * @pde_from_pdp: Return page directory entry that correspond to a page
+ * directory page. This allow user to use there own custom page directory
+ * entry format for all page directory level.
+ * @pgd: Page global directory if multi level (tree page table).
+ * @faulters: List of all concurrent fault locks.
+ * @updaters: List of all concurrent update locks.
+ * @pdp_young: List of all young page directory page, analogy would be that
+ * directory page on the young list are like inside a rcu read section and
+ * might be dereference by other threads that do not hold a reference on it.
+ * Logic is that an active updater might have taken reference before this
+ * page directory was added and because once an updater have a lock on a
+ * range it can start to walk or iterate over the range without holding rcu
+ * read critical section (allowing walker or iterator to sleep). Directory
+ * are move off the young list only once all updaters that never considered
+ * it are done (ie have call gpt_ ## SUFFIX ## _unlock_update()).
+ * @pdp_free: List of all page directory page to free (delayed free).
+ * @last_idx: Last valid index for this page table. Page table size is derived
+ * from that value.
+ * @pd_shift: Page directory shift value (1 << pd_shift) is the number of entry
+ * that each page directory hold.
+ * @pde_mask: Mask bit corresponding to pfn value of lower page directory from
+ * a pde.
+ * @pde_shift: Shift value use to extract pfn value of lower page directory
+ * from a pde.
+ * @pde_valid: If pde & pde_valid is not 0 then it means this is a valid pde
+ * entry that have a valid pfn value for a lower page directory level.
+ * @pgd_shift: Shift value to get the index inside the pgd from an address.
+ * @min_serial: Oldest serial number use by the oldest updater.
+ * @updater_serial: Current serial number use for updater.
+ * @faulter_serial: Current serial number use for faulter.
+ * @lock: Lock protecting serial number and updaters/faulters list.
+ * @pgd_lock: Lock protecting pgd level (and all level if arch do not have room
+ * for spinlock inside its page struct).
+ */
+struct gpt {
+ uint64_t (*pde_from_pdp)(struct gpt *gpt, struct page *pdp);
+ void *pgd;
+ struct list_head faulters;
+ struct list_head updaters;
+ struct list_head pdp_young;
+ struct list_head pdp_free;
+ uint64_t last_idx;
+ uint64_t pd_shift;
+ uint64_t pde_mask;
+ uint64_t pde_shift;
+ uint64_t pde_valid;
+ uint64_t pgd_shift;
+ unsigned long min_serial;
+ unsigned long faulter_serial;
+ unsigned long updater_serial;
+ spinlock_t lock;
+ spinlock_t pgd_lock;
+ unsigned gfp_flags;
+};
+
+/* struct gpt_lock - generic page table range lock structure.
+ *
+ * @list: List struct for active lock holder lists.
+ * @first: Start address of the locked range (inclusive).
+ * @last: End address of the locked range (inclusive).
+ * @serial: Serial number associated with that lock.
+ *
+ * Before any read/update access to a range of the generic page table, it must
+ * be locked to synchronize with conurrent read/update and insertion. In most
+ * case gpt_lock will complete with only taking one spinlock for protecting the
+ * struct insertion in the active lock holder list (either updaters or faulters
+ * list depending if calling gpt_lock() or gpt_fault_lock()).
+ */
+struct gpt_lock {
+ struct list_head list;
+ uint64_t first;
+ uint64_t last;
+ unsigned long serial;
+ bool faulter;
+};
+
+/* struct gpt_walk - generic page table range walker structure.
+ *
+ * @lock: The lock protecting this iterator.
+ * @first: First index of the walked range (inclusive).
+ * @last: Last index of the walked range (inclusive).
+ *
+ * This is similar to the cpu page table walker. It allows to walk a range of
+ * the generic page table. Note that gpt walk does not imply protection hence
+ * you must call gpt_lock() prior to using gpt_walk() if you want to safely
+ * walk the range as otherwise you will be open to all kind of synchronization
+ * issue.
+ */
+struct gpt_walk {
+ int (*pte)(struct gpt *gpt,
+ struct gpt_walk *walk,
+ struct page *pdp,
+ void *ptep,
+ uint64_t first,
+ uint64_t last);
+ int (*pde)(struct gpt *gpt,
+ struct gpt_walk *walk,
+ struct page *pdp,
+ void *pdep,
+ uint64_t first,
+ uint64_t last,
+ uint64_t shift);
+ int (*pde_post)(struct gpt *gpt,
+ struct gpt_walk *walk,
+ struct page *pdp,
+ void *pdep,
+ uint64_t first,
+ uint64_t last,
+ uint64_t shift);
+ struct gpt_lock *lock;
+ uint64_t first;
+ uint64_t last;
+ void *data;
+};
+
+/* struct gpt_iter - generic page table range iterator structure.
+ *
+ * @gpt: The generic page table structure.
+ * @lock: The lock protecting this iterator.
+ * @pdp: Current page directory page.
+ * @pdep: Pointer to page directory entry for corresponding pdp.
+ * @idx: Current index
+ */
+struct gpt_iter {
+ struct gpt *gpt;
+ struct gpt_lock *lock;
+ struct page *pdp;
+ void *pdep;
+ uint64_t idx;
+};
+
+
+/* Page directory page helpers */
+static inline uint64_t gpt_pdp_shift(struct gpt *gpt, struct page *pdp)
+{
+ if (!pdp)
+ return gpt->pgd_shift;
+ return pdp->flags & 0xff;
+}
+
+static inline uint64_t gpt_pdp_first(struct gpt *gpt, struct page *pdp)
+{
+ if (!pdp)
+ return 0UL;
+ return pdp->index;
+}
+
+static inline uint64_t gpt_pdp_last(struct gpt *gpt, struct page *pdp)
+{
+ if (!pdp)
+ return gpt->last_idx;
+ return min(gpt->last_idx,
+ (uint64_t)(pdp->index +
+ (1UL << (gpt_pdp_shift(gpt, pdp) + gpt->pd_shift)) - 1UL));
+}
+
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+static inline void gpt_pdp_lock(struct gpt *gpt, struct page *pdp)
+{
+ if (pdp)
+ spin_lock(&pdp->ptl);
+ else
+ spin_lock(&gpt->pgd_lock);
+}
+
+static inline void gpt_pdp_unlock(struct gpt *gpt, struct page *pdp)
+{
+ if (pdp)
+ spin_unlock(&pdp->ptl);
+ else
+ spin_unlock(&gpt->pgd_lock);
+}
+#else /* USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS */
+static inline void gpt_pdp_lock(struct gpt *gpt, struct page *pdp)
+{
+ spin_lock(&gpt->pgd_lock);
+}
+
+static inline void gpt_pdp_unlock(struct gpt *gpt, struct page *pdp)
+{
+ spin_unlock(&gpt->pgd_lock);
+}
+#endif /* USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS */
+
+static inline void gpt_pdp_ref(struct gpt *gpt, struct page *pdp)
+{
+ if (pdp)
+ atomic_inc(&pdp->_mapcount);
+}
+
+static inline void gpt_pdp_unref(struct gpt *gpt, struct page *pdp)
+{
+ if (pdp && atomic_dec_and_test(&pdp->_mapcount))
+ BUG();
+}
+
+
+/* Generic page table common functions. */
+void gpt_free(struct gpt *gpt);
+
+
+/* Generic page table type specific functions. */
+int gpt_ulong_init(struct gpt *gpt);
+void gpt_ulong_lock_update(struct gpt *gpt, struct gpt_lock *lock);
+void gpt_ulong_unlock_update(struct gpt *gpt, struct gpt_lock *lock);
+int gpt_ulong_lock_fault(struct gpt *gpt, struct gpt_lock *lock);
+void gpt_ulong_unlock_fault(struct gpt *gpt, struct gpt_lock *lock);
+int gpt_ulong_walk(struct gpt_walk *walk,
+ struct gpt *gpt,
+ struct gpt_lock *lock);
+bool gpt_ulong_iter_idx(struct gpt_iter *iter, uint64_t idx);
+bool gpt_ulong_iter_first(struct gpt_iter *iter,
+ uint64_t first,
+ uint64_t last);
+bool gpt_ulong_iter_next(struct gpt_iter *iter);
+
+int gpt_u64_init(struct gpt *gpt);
+void gpt_u64_lock_update(struct gpt *gpt, struct gpt_lock *lock);
+void gpt_u64_unlock_update(struct gpt *gpt, struct gpt_lock *lock);
+int gpt_u64_lock_fault(struct gpt *gpt, struct gpt_lock *lock);
+void gpt_u64_unlock_fault(struct gpt *gpt, struct gpt_lock *lock);
+int gpt_u64_walk(struct gpt_walk *walk,
+ struct gpt *gpt,
+ struct gpt_lock *lock);
+bool gpt_u64_iter_idx(struct gpt_iter *iter, uint64_t idx);
+bool gpt_u64_iter_first(struct gpt_iter *iter,
+ uint64_t first,
+ uint64_t last);
+bool gpt_u64_iter_next(struct gpt_iter *iter);
+
+#ifndef CONFIG_64BIT
+int gpt_u32_init(struct gpt *gpt);
+void gpt_u32_lock_update(struct gpt *gpt, struct gpt_lock *lock);
+void gpt_u32_unlock_update(struct gpt *gpt, struct gpt_lock *lock);
+int gpt_u32_lock_fault(struct gpt *gpt, struct gpt_lock *lock);
+void gpt_u32_unlock_fault(struct gpt *gpt, struct gpt_lock *lock);
+int gpt_u32_walk(struct gpt_walk *walk,
+ struct gpt *gpt,
+ struct gpt_lock *lock);
+bool gpt_u32_iter_idx(struct gpt_iter *iter, uint64_t idx);
+bool gpt_u32_iter_first(struct gpt_iter *iter,
+ uint64_t first,
+ uint64_t last);
+bool gpt_u32_iter_next(struct gpt_iter *iter);
+#endif
+
+
+/* Generic page table iterator helpers. */
+static inline void gpt_iter_init(struct gpt_iter *iter,
+ struct gpt *gpt,
+ struct gpt_lock *lock)
+{
+ iter->gpt = gpt;
+ iter->lock = lock;
+ iter->pdp = NULL;
+ iter->pdep = NULL;
+}
+
+#endif /* __LINUX_GPT_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 2faf7b2..c041b3c 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -525,4 +525,7 @@ source "lib/fonts/Kconfig"
config ARCH_HAS_SG_CHAIN
def_bool n

+config GENERIC_PAGE_TABLE
+ bool
+
endmenu
diff --git a/lib/Makefile b/lib/Makefile
index 84000ec..e5ad435 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -197,3 +197,5 @@ quiet_cmd_build_OID_registry = GEN $@
clean-files += oid_registry_data.c

obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
+
+obj-$(CONFIG_GENERIC_PAGE_TABLE) += gpt.o
diff --git a/lib/gpt.c b/lib/gpt.c
new file mode 100644
index 0000000..3a8e62c
--- /dev/null
+++ b/lib/gpt.c
@@ -0,0 +1,202 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <[email protected]>
+ */
+/* Generic arch independant page table implementation. See include/linux/gpt.h
+ * for further informations on the design.
+ */
+#include <linux/gpt.h>
+#include <linux/highmem.h>
+#include <linux/slab.h>
+#include "gpt_generic.h"
+
+
+struct gpt_lock_walk {
+ struct list_head pdp_to_free;
+ struct gpt_lock *lock;
+ unsigned long locked[(1 << (PAGE_SHIFT - 3)) / sizeof(long)];
+};
+
+/* gpt_pdp_before_serial() - is page directory older than given serial.
+ *
+ * @pdp: Pointer to struct page of the page directory.
+ * @serial: Serial number to check against.
+ *
+ * Page table walker and iterator use this to determine if the current pde
+ * needs to be walked down/iterated over or not. Use by updater to avoid
+ * walking down/iterating over new page directory.
+ */
+static inline bool gpt_pdp_before_serial(struct page *pdp,
+ unsigned long serial)
+{
+ /*
+ * To know if a page directory is new or old we first check if it's not
+ * on the recently added list. If it is and its serial number is newer
+ * or equal to our lock serial number then it is a new page directory
+ * entry and must be ignore.
+ */
+ return list_empty(&pdp->lru) || time_after(serial, pdp->private);
+}
+
+/* gpt_lock_hold_pdp() - does given lock hold a reference on given directory.
+ *
+ * @lock: Lock to check against.
+ * @pdp: Pointer to struct page of the page directory.
+ *
+ * When walking down page table or iterating over this function is call to know
+ * if the current pde entry needs to be walked down/iterated over.
+ */
+static bool gpt_lock_hold_pdp(struct gpt_lock *lock, struct page *pdp)
+{
+ if (lock->faulter)
+ return true;
+ if (!atomic_read(&pdp->_mapcount))
+ return false;
+ if (!gpt_pdp_before_serial(pdp, lock->serial))
+ return false;
+ return true;
+}
+
+static void gpt_lock_walk_update_finish(struct gpt *gpt,
+ struct gpt_lock_walk *wlock)
+{
+ struct gpt_lock *lock = wlock->lock;
+ unsigned long min_serial;
+
+ spin_lock(&gpt->lock);
+ min_serial = gpt->min_serial;
+ list_del_init(&lock->list);
+ lock = list_first_entry_or_null(&gpt->updaters, struct gpt_lock, list);
+ gpt->min_serial = lock ? lock->serial : gpt->updater_serial;
+ spin_unlock(&gpt->lock);
+
+ /*
+ * Drain the young pdp list if the new smallest serial lock holder is
+ * different from previous one.
+ */
+ if (gpt->min_serial != min_serial) {
+ struct page *pdp, *next;
+
+ spin_lock(&gpt->pgd_lock);
+ list_for_each_entry_safe(pdp, next, &gpt->pdp_young, lru) {
+ if (!gpt_pdp_before_serial(pdp, gpt->min_serial))
+ break;
+ list_del_init(&pdp->lru);
+ }
+ list_for_each_entry_safe(pdp, next, &gpt->pdp_free, lru) {
+ if (!gpt_pdp_before_serial(pdp, gpt->min_serial))
+ break;
+ list_del(&pdp->lru);
+ list_add_tail(&pdp->lru, &wlock->pdp_to_free);
+ }
+ spin_unlock(&gpt->pgd_lock);
+ }
+}
+
+/* gpt_lock_fault_finish() - common lock fault cleanup.
+ *
+ * @gpt: The pointer to the generic page table structure.
+ * @wlock: Walk lock structure.
+ *
+ * This function first remove the lock from faulters list then update the
+ * serial number that will be use by next updater to either the oldest active
+ * faulter or to the next faulter serial number. In both case the next updater
+ * will ignore directory with serial equal or superior to this serial number.
+ * In other word it will only consider directory that are older that oldest
+ * active faulter.
+ *
+ * Note however that the young list is not drain here as we only want to drain
+ * it once updaters are done ie once no updaters might dereference such young
+ * page without holding a reference on it. Refer to gpt struct comments on
+ * young list.
+ */
+static void gpt_lock_fault_finish(struct gpt *gpt, struct gpt_lock_walk *wlock)
+{
+ struct gpt_lock *lock = wlock->lock;
+
+ spin_lock(&gpt->lock);
+ list_del_init(&lock->list);
+ lock = list_first_entry_or_null(&gpt->faulters, struct gpt_lock, list);
+ if (lock)
+ gpt->updater_serial = lock->serial;
+ else
+ gpt->updater_serial = gpt->faulter_serial;
+ spin_unlock(&gpt->lock);
+}
+
+static void gpt_lock_walk_free_pdp(struct gpt_lock_walk *wlock)
+{
+ struct page *pdp, *tmp;
+
+ if (list_empty(&wlock->pdp_to_free))
+ return;
+
+ synchronize_rcu();
+
+ list_for_each_entry_safe(pdp, tmp, &wlock->pdp_to_free, lru) {
+ /* Restore page struct fields to their expect value. */
+ list_del(&pdp->lru);
+ atomic_dec(&pdp->_mapcount);
+ pdp->mapping = NULL;
+ pdp->index = 0;
+ pdp->flags &= (~0xffUL);
+ __free_page(pdp);
+ }
+}
+
+
+/* Page directory page helpers */
+static inline bool gpt_pdp_cover_idx(struct gpt *gpt,
+ struct page *pdp,
+ unsigned long idx)
+{
+ return (idx >= gpt_pdp_first(gpt, pdp)) &&
+ (idx <= gpt_pdp_last(gpt, pdp));
+}
+
+static inline struct page *gpt_pdp_upper_pdp(struct page *pdp)
+{
+ if (!pdp)
+ return NULL;
+ return pdp->s_mem;
+}
+
+static inline void gpt_pdp_init(struct page *page)
+{
+ atomic_set(&page->_mapcount, 1);
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+ spin_lock_init(&page->ptl);
+#endif
+}
+
+
+/* Generic page table common functions. */
+void gpt_free(struct gpt *gpt)
+{
+ BUG_ON(!list_empty(&gpt->faulters));
+ BUG_ON(!list_empty(&gpt->updaters));
+ kfree(gpt->pgd);
+ gpt->pgd = NULL;
+}
+EXPORT_SYMBOL(gpt_free);
+
+
+/* Generic page table type specific functions. */
+GPT_DEFINE(u64, uint64_t, 3);
+#ifdef CONFIG_64BIT
+GPT_DEFINE(ulong, unsigned long, 3);
+#else
+GPT_DEFINE(ulong, unsigned long, 2);
+GPT_DEFINE(u32, uint32_t, 2);
+#endif
diff --git a/lib/gpt_generic.h b/lib/gpt_generic.h
new file mode 100644
index 0000000..c996314
--- /dev/null
+++ b/lib/gpt_generic.h
@@ -0,0 +1,663 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <[email protected]>
+ */
+/* Generic arch independant page table implementation. See include/linux/gpt.h
+ * for further informations on the design.
+ */
+
+/*
+ * Template for implementing generic page table for various types.
+ *
+ * SUFFIX suffix use for naming functions.
+ * TYPE type (uint64_t, unsigned long, ...)
+ * TYPE_SHIFT shift corresponding to GPT_TYPE (3 for u64, 2 for u32).
+ *
+ * Note that (1 << (1 << (TYPE_SHIFT + 3))) must be big enough to store any pfn
+ * and flags the user wants. For instance for a 32 bits arch with 36 bits PAE
+ * you need 24 bits to store a pfn thus if you use u32 as a type then you only
+ * have 8 bits left for flags in each entry.
+ */
+
+#define GPT_DEFINE(SUFFIX, TYPE, TYPE_SHIFT) \
+ \
+int gpt_ ## SUFFIX ## _init(struct gpt *gpt) \
+{ \
+ unsigned long pgd_size; \
+ \
+ gpt->pgd = NULL; \
+ if (!gpt->last_idx) \
+ return -EINVAL; \
+ INIT_LIST_HEAD(&gpt->faulters); \
+ INIT_LIST_HEAD(&gpt->updaters); \
+ INIT_LIST_HEAD(&gpt->pdp_young); \
+ INIT_LIST_HEAD(&gpt->pdp_free); \
+ spin_lock_init(&gpt->pgd_lock); \
+ spin_lock_init(&gpt->lock); \
+ gpt->pd_shift = (PAGE_SHIFT - TYPE_SHIFT); \
+ gpt->pgd_shift = (__fls(gpt->last_idx) / \
+ (PAGE_SHIFT - (TYPE_SHIFT))) * \
+ (PAGE_SHIFT - (TYPE_SHIFT)); \
+ pgd_size = (gpt->last_idx >> gpt->pgd_shift) << (TYPE_SHIFT); \
+ gpt->pgd = kzalloc(pgd_size, GFP_KERNEL); \
+ gpt->updater_serial = gpt->faulter_serial = gpt->min_serial = 0; \
+ return !gpt->pgd ? -ENOMEM : 0; \
+} \
+EXPORT_SYMBOL(gpt_ ## SUFFIX ## _init); \
+ \
+/* gpt_ ## SUFFIX ## _pde_pdp() - get page directory page from a pde. \
+ * \
+ * @gpt: The pointer to the generic page table structure. \
+ * @pde: Page directory entry to extract the lower directory page from. \
+ */ \
+static inline struct page *gpt_ ## SUFFIX ## _pde_pdp(struct gpt *gpt, \
+ TYPE pde) \
+{ \
+ if (!(pde & gpt->pde_valid)) \
+ return NULL; \
+ return pfn_to_page((pde & gpt->pde_mask) >> gpt->pde_shift); \
+} \
+ \
+/* gpt_ ## SUFFIX ## _pte_from_idx() - pointer to a pte inside directory \
+ * \
+ * @gpt: The pointer to the generic page table structure. \
+ * @pdp: Page directory page if any. \
+ * @idx: Index of the pte that is being lookup. \
+ */ \
+static inline void *gpt_ ## SUFFIX ## _pte_from_idx(struct gpt *gpt, \
+ struct page *pdp, \
+ uint64_t idx) \
+{ \
+ TYPE *ptep = pdp ? page_address(pdp) : gpt->pgd; \
+ \
+ ptep += (idx & ((1UL << gpt->pd_shift) - 1UL)); \
+ return ptep; \
+} \
+ \
+/* gpt_ ## SUFFIX ## _pdep_from_idx() - pointer to directory entry \
+ * \
+ * @gpt: The pointer to the generic page table structure. \
+ * @pdp: Page directory page if any. \
+ * @idx: Index of the pde that is being lookup. \
+ */ \
+static inline void *gpt_ ## SUFFIX ## _pdep_from_idx(struct gpt *gpt, \
+ struct page *pdp, \
+ uint64_t idx) \
+{ \
+ TYPE *pdep = pdp ? page_address(pdp) : gpt->pgd; \
+ uint64_t shift = gpt_pdp_shift(gpt, pdp); \
+ \
+ pdep += ((idx >> shift) & ((1UL << gpt->pd_shift) - 1UL)); \
+ return pdep; \
+} \
+ \
+static int gpt_ ## SUFFIX ## _walk_pde(struct gpt *gpt, \
+ struct gpt_walk *walk, \
+ struct page *pdp, \
+ void *ptr, \
+ uint64_t first, \
+ uint64_t last, \
+ uint64_t shift) \
+{ \
+ unsigned long i, npde; \
+ TYPE *pdep = ptr; \
+ uint64_t cur, lshift, mask, next; \
+ int ret; \
+ \
+ if (walk->pde) { \
+ ret = walk->pde(gpt, walk, pdp, ptr, \
+ first, last, shift); \
+ if (ret) \
+ return ret; \
+ } \
+ \
+ lshift = shift ? shift - gpt->pd_shift : 0; \
+ mask = ~((1ULL << shift) - 1ULL); \
+ npde = ((last - first) >> shift) + 1; \
+ for (i = 0, cur = first; i < npde; ++i, cur = next) { \
+ struct page *lpdp; \
+ TYPE pde = ACCESS_ONCE(pdep[i]); \
+ \
+ next = min((cur & mask) + (1UL << shift), last); \
+ lpdp = gpt_ ## SUFFIX ## _pde_pdp(gpt, pde); \
+ if (!lpdp || !gpt_lock_hold_pdp(walk->lock, lpdp)) \
+ continue; \
+ if (lshift) { \
+ void *lpde; \
+ \
+ lpde = gpt_ ## SUFFIX ## _pdep_from_idx(gpt, \
+ lpdp, \
+ cur); \
+ ret = gpt_ ## SUFFIX ## _walk_pde(gpt, walk, \
+ lpdp, lpde, \
+ cur, next, \
+ lshift); \
+ if (ret) \
+ return ret; \
+ } else if (walk->pte) { \
+ void *lpte; \
+ \
+ lpte = gpt_ ## SUFFIX ## _pte_from_idx(gpt, \
+ lpdp, \
+ cur); \
+ ret = walk->pte(gpt, walk, lpdp, \
+ lpte, cur, next); \
+ if (ret) \
+ return ret; \
+ } \
+ } \
+ \
+ if (walk->pde_post) { \
+ ret = walk->pde_post(gpt, walk, pdp, ptr, \
+ first, last, shift); \
+ if (ret) \
+ return ret; \
+ } \
+ \
+ return 0; \
+} \
+ \
+int gpt_ ## SUFFIX ## _walk(struct gpt_walk *walk, \
+ struct gpt *gpt, \
+ struct gpt_lock *lock) \
+{ \
+ TYPE *pdep = gpt->pgd; \
+ uint64_t idx; \
+ \
+ if (walk->first > gpt->last_idx || walk->last > gpt->last_idx) \
+ return -EINVAL; \
+ \
+ idx = walk->first >> gpt->pgd_shift; \
+ return gpt_ ## SUFFIX ## _walk_pde(gpt, walk, NULL, &pdep[idx], \
+ walk->first, walk->last, \
+ gpt->pgd_shift); \
+} \
+EXPORT_SYMBOL(gpt_ ## SUFFIX ## _walk); \
+ \
+static void gpt_ ## SUFFIX ## _pdp_unref(struct gpt *gpt, \
+ struct page *pdp, \
+ struct gpt_lock_walk *wlock, \
+ struct page *updp, \
+ TYPE *upde) \
+{ \
+ /* \
+ * The atomic decrement and test insure that only one thread \
+ * will cleanup pde. \
+ */ \
+ if (!atomic_dec_and_test(&pdp->_mapcount)) \
+ return; \
+ \
+ /* \
+ * Protection against race btw new pdes instancing and pdes \
+ * clearing due to unref, rely on faulter taking a reference on \
+ * all valid pdes and calling synchronize_rcu() after. After the \
+ * rcu synchronize no further unreference might clear a pde in \
+ * the faulter(s) range(s). \
+ */ \
+ *upde = 0; \
+ if (!list_empty(&pdp->lru)) { \
+ /* \
+ * It means this page directory was added recently but \
+ * is about to be destroy before it could be remove from \
+ * the young list. \
+ * \
+ * Because it is in the young list and lock holder can \
+ * access the page table without rcu protection it means \
+ * that we can not rely on synchronize_rcu to know when \
+ * it is safe to free the page as some thread might be \
+ * dereferencing it. We have to wait for all lock that \
+ * are older than this page directory. At which point we \
+ * know for sure that no thread can derefence the page. \
+ */ \
+ spin_lock(&gpt->pgd_lock); \
+ list_add_tail(&pdp->lru, &gpt->pdp_free); \
+ spin_unlock(&gpt->pgd_lock); \
+ } else \
+ /* \
+ * This means this is an old page directory and thus any \
+ * lock holder that might dereference a pointer to it \
+ * would have a reference on it. Hence because refcount \
+ * reached 0 we only need to wait for rcu grace period. \
+ */ \
+ list_add_tail(&pdp->lru, &wlock->pdp_to_free); \
+ \
+ /* Un-account this entry caller must hold a ref on pdp. */ \
+ if (updp && atomic_dec_and_test(&updp->_mapcount)) \
+ BUG(); \
+} \
+ \
+static int gpt_ ## SUFFIX ## _pde_lock_update(struct gpt *gpt, \
+ struct gpt_walk *walk, \
+ struct page *pdp, \
+ void *ptr, \
+ uint64_t first, \
+ uint64_t last, \
+ uint64_t shift) \
+{ \
+ unsigned long i, npde; \
+ struct gpt_lock_walk *wlock = walk->data; \
+ struct gpt_lock *lock = wlock->lock; \
+ TYPE *pdep = ptr; \
+ \
+ npde = ((last - first) >> shift) + 1; \
+ \
+ rcu_read_lock(); \
+ for (i = 0; i < npde; ++i) { \
+ struct page *page; \
+ TYPE pde = ACCESS_ONCE(pdep[i]); \
+ \
+ clear_bit(i, wlock->locked); \
+ page = gpt_ ## SUFFIX ## _pde_pdp(gpt, pde); \
+ if (!page) \
+ continue; \
+ if (!atomic_inc_not_zero(&page->_mapcount)) \
+ continue; \
+ \
+ if (!gpt_pdp_before_serial(page, lock->serial)) { \
+ /* This is a new entry ignore it. */ \
+ gpt_ ## SUFFIX ## _pdp_unref(gpt, page, wlock, \
+ pdp, &pdep[i]); \
+ continue; \
+ } \
+ set_bit(i, wlock->locked); \
+ } \
+ rcu_read_unlock(); \
+ \
+ for (i = 0; i < npde; i++) { \
+ struct page *page; \
+ \
+ if (!test_bit(i, wlock->locked)) \
+ continue; \
+ page = gpt_ ## SUFFIX ## _pde_pdp(gpt, pdep[i]); \
+ kmap(page); \
+ } \
+ \
+ return 0; \
+} \
+ \
+void gpt_ ## SUFFIX ## _lock_update(struct gpt *gpt, \
+ struct gpt_lock *lock) \
+{ \
+ struct gpt_lock_walk wlock; \
+ struct gpt_walk walk; \
+ \
+ spin_lock(&gpt->lock); \
+ lock->faulter = false; \
+ lock->serial = gpt->updater_serial; \
+ list_add_tail(&lock->list, &gpt->updaters); \
+ spin_unlock(&gpt->lock); \
+ \
+ INIT_LIST_HEAD(&wlock.pdp_to_free); \
+ wlock.lock = lock; \
+ walk.lock = lock; \
+ walk.data = &wlock; \
+ walk.pde = &gpt_ ## SUFFIX ## _pde_lock_update; \
+ walk.pde_post = NULL; \
+ walk.pte = NULL; \
+ walk.first = lock->first; \
+ walk.last = lock->last; \
+ \
+ gpt_ ## SUFFIX ## _walk(&walk, gpt, lock); \
+ gpt_lock_walk_free_pdp(&wlock); \
+} \
+EXPORT_SYMBOL(gpt_ ## SUFFIX ## _lock_update); \
+ \
+static int gpt_ ## SUFFIX ## _pde_unlock_update(struct gpt *gpt, \
+ struct gpt_walk *walk, \
+ struct page *pdp, \
+ void *ptr, \
+ uint64_t first, \
+ uint64_t last, \
+ uint64_t shift) \
+{ \
+ unsigned long i, npde; \
+ struct gpt_lock_walk *wlock = walk->data; \
+ struct gpt_lock *lock = wlock->lock; \
+ TYPE *pdep = ptr; \
+ \
+ npde = ((last - first) >> shift) + 1; \
+ \
+ rcu_read_lock(); \
+ for (i = 0; i < npde; ++i) { \
+ struct page *page; \
+ TYPE pde = ACCESS_ONCE(pdep[i]); \
+ \
+ if (!(pde & gpt->pde_valid)) \
+ continue; \
+ page = gpt_ ## SUFFIX ## _pde_pdp(gpt, pde); \
+ if (!page || !gpt_pdp_before_serial(page, lock->serial)) \
+ continue; \
+ kunmap(page); \
+ gpt_ ## SUFFIX ## _pdp_unref(gpt, page, wlock, \
+ pdp, &pdep[i]); \
+ } \
+ rcu_read_unlock(); \
+ \
+ return 0; \
+} \
+ \
+void gpt_ ## SUFFIX ## _unlock_update(struct gpt *gpt, \
+ struct gpt_lock *lock) \
+{ \
+ struct gpt_lock_walk wlock; \
+ struct gpt_walk walk; \
+ \
+ INIT_LIST_HEAD(&wlock.pdp_to_free); \
+ wlock.lock = lock; \
+ walk.lock = lock; \
+ walk.data = &wlock; \
+ walk.pde = NULL; \
+ walk.pde_post = &gpt_ ## SUFFIX ## _pde_unlock_update; \
+ walk.pte = NULL; \
+ walk.first = lock->first; \
+ walk.last = lock->last; \
+ \
+ gpt_ ## SUFFIX ## _walk(&walk, gpt, lock); \
+ \
+ gpt_lock_walk_update_finish(gpt, &wlock); \
+ gpt_lock_walk_free_pdp(&wlock); \
+} \
+EXPORT_SYMBOL(gpt_ ## SUFFIX ## _unlock_update); \
+ \
+static int gpt_ ## SUFFIX ## _pde_lock_fault(struct gpt *gpt, \
+ struct gpt_walk *walk, \
+ struct page *pdp, \
+ void *ptr, \
+ uint64_t first, \
+ uint64_t last, \
+ uint64_t shift) \
+{ \
+ unsigned long cmissing, i, npde; \
+ struct gpt_lock_walk *wlock = walk->data; \
+ struct gpt_lock *lock = wlock->lock; \
+ struct list_head pdp_new, pdp_added; \
+ struct page *page, *tmp; \
+ TYPE mask, *pdep = ptr; \
+ int ret; \
+ \
+ npde = ((last - first) >> shift) + 1; \
+ mask = ~((1ULL << shift) - 1ULL); \
+ INIT_LIST_HEAD(&pdp_added); \
+ INIT_LIST_HEAD(&pdp_new); \
+ \
+ rcu_read_lock(); \
+ for (i = 0, cmissing = 0; i < npde; ++i) { \
+ TYPE pde = ACCESS_ONCE(pdep[i]); \
+ \
+ clear_bit(i, wlock->locked); \
+ if (!(pde & gpt->pde_valid)) { \
+ cmissing++; \
+ continue; \
+ } \
+ page = gpt_ ## SUFFIX ## _pde_pdp(gpt, pde); \
+ if (!atomic_inc_not_zero(&page->_mapcount)) { \
+ cmissing++; \
+ continue; \
+ } \
+ set_bit(i, wlock->locked); \
+ } \
+ rcu_read_unlock(); \
+ \
+ /* Allocate missing page directory page. */ \
+ for (i = 0; i < cmissing; ++i) { \
+ page = alloc_page(gpt->gfp_flags | __GFP_ZERO); \
+ if (!page) { \
+ ret = -ENOMEM; \
+ goto error; \
+ } \
+ list_add_tail(&page->lru, &pdp_new); \
+ } \
+ \
+ /* \
+ * The synchronize_rcu() is for exclusion with concurrent update \
+ * thread that might try to clear the pde. Because a reference \
+ * was taken just above on all valid pdes we know for sure that \
+ * after the rcu synchronize all thread that were about to clear \
+ * pdes are done and that no new unreference will lead to pde \
+ * clear. \
+ */ \
+ synchronize_rcu(); \
+ \
+ gpt_pdp_lock(gpt, pdp); \
+ for (i = 0; i < npde; ++i) { \
+ TYPE pde = ACCESS_ONCE(pdep[i]); \
+ \
+ if (test_bit(i, wlock->locked)) \
+ continue; \
+ \
+ /* Anoter thread might already have populated entry. */ \
+ page = gpt_ ## SUFFIX ## _pde_pdp(gpt, pde); \
+ if (page && atomic_inc_not_zero(&page->_mapcount)) \
+ continue; \
+ \
+ page = list_first_entry_or_null(&pdp_new, \
+ struct page, \
+ lru); \
+ BUG_ON(!page); \
+ list_del(&page->lru); \
+ \
+ /* Initialize page directory page struct. */ \
+ page->private = lock->serial; \
+ page->s_mem = pdp; \
+ page->index = (first & mask) + (i << shift); \
+ page->flags |= (shift - gpt->pd_shift) & 0xff; \
+ gpt_pdp_init(page); \
+ list_add_tail(&page->lru, &pdp_added); \
+ \
+ pdep[i] = gpt->pde_from_pdp(gpt, page); \
+ /* Account this new entry inside upper directory. */ \
+ if (pdp) \
+ atomic_inc(&pdp->_mapcount); \
+ } \
+ gpt_pdp_unlock(gpt, pdp); \
+ \
+ spin_lock(&gpt->pgd_lock); \
+ list_splice_tail(&pdp_added, &gpt->pdp_young); \
+ spin_unlock(&gpt->pgd_lock); \
+ \
+ for (i = 0; i < npde; ++i) { \
+ page = gpt_ ## SUFFIX ## _pde_pdp(gpt, pdep[i]); \
+ kmap(page); \
+ } \
+ \
+ /* Free any left over pages. */ \
+ list_for_each_entry_safe (page, tmp, &pdp_new, lru) { \
+ list_del(&page->lru); \
+ __free_page(page); \
+ } \
+ return 0; \
+ \
+error: \
+ /* \
+ * We know that no page is kmaped and no page were added to the \
+ * directroy tree. \
+ */ \
+ list_for_each_entry_safe (page, tmp, &pdp_new, lru) { \
+ list_del(&page->lru); \
+ __free_page(page); \
+ } \
+ \
+ rcu_read_lock(); \
+ for (i = 0; i < npde; ++i) { \
+ if (test_bit(i, wlock->locked)) \
+ continue; \
+ \
+ page = gpt_ ## SUFFIX ## _pde_pdp(gpt, pdep[i]); \
+ gpt_ ## SUFFIX ## _pdp_unref(gpt, page, wlock, \
+ pdp, &pdep[i]); \
+ } \
+ rcu_read_unlock(); \
+ \
+ walk->last = first; \
+ return ret; \
+} \
+ \
+static int gpt_ ## SUFFIX ## _pde_unlock_fault(struct gpt *gpt, \
+ struct gpt_walk *walk, \
+ struct page *pdp, \
+ void *ptr, \
+ uint64_t first, \
+ uint64_t last, \
+ uint64_t shift) \
+{ \
+ unsigned long i, npde; \
+ struct gpt_lock_walk *wlock = walk->data; \
+ struct gpt_lock *lock = wlock->lock; \
+ TYPE *pdep = ptr; \
+ \
+ npde = ((last - first) >> shift) + 1; \
+ \
+ rcu_read_lock(); \
+ for (i = 0; i < npde; ++i) { \
+ struct page *page; \
+ \
+ page = gpt_ ## SUFFIX ## _pde_pdp(gpt, pdep[i]); \
+ if (!page || !gpt_lock_hold_pdp(lock, page)) \
+ continue; \
+ kunmap(page); \
+ gpt_ ## SUFFIX ## _pdp_unref(gpt, page, wlock, \
+ pdp, &pdep[i]); \
+ } \
+ rcu_read_unlock(); \
+ \
+ return 0; \
+} \
+ \
+int gpt_ ## SUFFIX ## _lock_fault(struct gpt *gpt, \
+ struct gpt_lock *lock) \
+{ \
+ struct gpt_lock_walk wlock; \
+ struct gpt_walk walk; \
+ int ret; \
+ \
+ lock->faulter = true; \
+ spin_lock(&gpt->lock); \
+ lock->serial = gpt->faulter_serial++; \
+ list_add_tail(&lock->list, &gpt->faulters); \
+ spin_unlock(&gpt->lock); \
+ \
+ INIT_LIST_HEAD(&wlock.pdp_to_free); \
+ wlock.lock = lock; \
+ walk.lock = lock; \
+ walk.data = &wlock; \
+ walk.pde = &gpt_ ## SUFFIX ## _pde_lock_fault; \
+ walk.pde_post = NULL; \
+ walk.pte = NULL; \
+ walk.first = lock->first; \
+ walk.last = lock->last; \
+ \
+ ret = gpt_ ## SUFFIX ## _walk(&walk, gpt, lock); \
+ if (ret) { \
+ walk.pde = NULL; \
+ walk.pde_post = &gpt_ ## SUFFIX ## _pde_unlock_fault; \
+ gpt_ ## SUFFIX ## _walk(&walk, gpt, lock); \
+ gpt_lock_fault_finish(gpt, &wlock); \
+ } \
+ gpt_lock_walk_free_pdp(&wlock); \
+ \
+ return ret; \
+} \
+EXPORT_SYMBOL(gpt_ ## SUFFIX ## _lock_fault); \
+ \
+void gpt_ ## SUFFIX ## _unlock_fault(struct gpt *gpt, \
+ struct gpt_lock *lock) \
+{ \
+ struct gpt_lock_walk wlock; \
+ struct gpt_walk walk; \
+ \
+ INIT_LIST_HEAD(&wlock.pdp_to_free); \
+ wlock.lock = lock; \
+ walk.lock = lock; \
+ walk.data = &wlock; \
+ walk.pde = NULL; \
+ walk.pde_post = &gpt_ ## SUFFIX ## _pde_unlock_fault; \
+ walk.pte = NULL; \
+ walk.first = lock->first; \
+ walk.last = lock->last; \
+ \
+ gpt_ ## SUFFIX ## _walk(&walk, gpt, lock); \
+ \
+ gpt_lock_fault_finish(gpt, &wlock); \
+ gpt_lock_walk_free_pdp(&wlock); \
+} \
+EXPORT_SYMBOL(gpt_ ## SUFFIX ## _unlock_fault); \
+ \
+static bool gpt_ ## SUFFIX ## _iter_idx_pdp(struct gpt_iter *iter, \
+ uint64_t idx) \
+{ \
+ struct gpt *gpt = iter->gpt; \
+ TYPE pde, *pdep; \
+ \
+ if (!gpt_pdp_cover_idx(gpt, iter->pdp, idx)) { \
+ iter->pdp = gpt_pdp_upper_pdp(iter->pdp); \
+ return gpt_ ## SUFFIX ## _iter_idx_pdp(iter, idx); \
+ } \
+ pdep = gpt_ ## SUFFIX ## _pdep_from_idx(gpt, iter->pdp, idx); \
+ if (!gpt_pdp_shift(gpt, iter->pdp)) { \
+ iter->pdep = pdep; \
+ iter->idx = idx; \
+ return true; \
+ } \
+ pde = ACCESS_ONCE(*pdep); \
+ if (!(pde & iter->gpt->pde_valid)) { \
+ iter->pdep = NULL; \
+ return false; \
+ } \
+ iter->pdp = gpt_ ## SUFFIX ## _pde_pdp(iter->gpt, pde); \
+ return gpt_ ## SUFFIX ## _iter_idx_pdp(iter, idx); \
+} \
+ \
+bool gpt_ ## SUFFIX ## _iter_idx(struct gpt_iter *iter, uint64_t idx) \
+{ \
+ iter->pdep = NULL; \
+ if ((idx < iter->lock->first) || (idx > iter->lock->last)) \
+ return false; \
+ \
+ return gpt_ ## SUFFIX ## _iter_idx_pdp(iter, idx); \
+} \
+EXPORT_SYMBOL(gpt_ ## SUFFIX ## _iter_idx); \
+ \
+bool gpt_ ## SUFFIX ## _iter_first(struct gpt_iter *iter, \
+ uint64_t first, \
+ uint64_t last) \
+{ \
+ iter->pdep = NULL; \
+ if (first > last) \
+ return false; \
+ if ((first < iter->lock->first) || (first > iter->lock->last)) \
+ return false; \
+ if ((last < iter->lock->first) || (last > iter->lock->last)) \
+ return false; \
+ \
+ do { \
+ if (gpt_ ## SUFFIX ## _iter_idx_pdp(iter, first)) \
+ return true; \
+ if (first < last) \
+ first++; \
+ else \
+ return false; \
+ } while (1); \
+ return false; \
+} \
+EXPORT_SYMBOL(gpt_ ## SUFFIX ## _iter_first); \
+ \
+bool gpt_ ## SUFFIX ## _iter_next(struct gpt_iter *iter) \
+{ \
+ if (!iter->pdep || iter->idx >= iter->lock->last) \
+ return false; \
+ return gpt_ ## SUFFIX ## _iter_first(iter, \
+ iter->idx + 1, \
+ iter->lock->last); \
+} \
+EXPORT_SYMBOL(gpt_ ## SUFFIX ## _iter_next)
--
1.9.3

2014-11-06 17:17:31

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 1/5] mmu_notifier: add event information to address invalidation v5

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/03/2014 03:42 PM, [email protected] wrote:
> From: Jérôme Glisse <[email protected]>
>
> The event information will be usefull for new user of mmu_notifier
> API. The event argument differentiate between a vma disappearing, a
> page being write protected or simply a page being unmaped. This
> allow new user to take different path for different event for
> instance on unmap the resource used to track a vma are still valid
> and should stay around. While if the event is saying that a vma is
> being destroy it means that any resources used to track this vma
> can be free.

Looks good. All I found was one spelling mistake :)

> + * - MMU_WRITE_BACK: memory is being written back to disk, all
> write accesses + * must stop after invalidate_range_start
> callback returns. Read access are + * still allowed. + * + *
> - MMU_WRITE_PROTECT: memory is being writte protected (ie should be
> mapped

"write protected"

> + * read only no matter what the vma memory protection allows).
> All write + * accesses must stop after invalidate_range_start
> callback returns. Read + * access are still allowed.

After fixing the spelling mistake, feel free to add my

Reviewed-by: Rik van Riel <[email protected]>

- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUW6z0AAoJEM553pKExN6DN3wIALqZPmNihc/AbOc6MCnp+two
do5pO0DTl61AD0SmPsjSKrADa8deHKDL3PqsEcA7aYOlwrJOkPhNxZZsq1SHscAO
iw4Ar9BbI0JwBZO4xq4RwFhAVnu5r5NZEcyG1t1EqOGoOVc8NIflTNCxQYOU+vkj
YxCZb4A0+e6nKe3P+tWso69AGHH5GVvFOqLy709OxneLbTVDRRBM1KzYtdkGR62i
u3Xa41WGVjAa6OVYEoENloa/o8cmL9vgqPG3bhbCjR8zpBPAQ7fS3g8Ckux72mS+
UNzyoZjCGpWg7IxF94xhTvydzER0XDMancbKzrYW14YoJ3mW7ZDj58vpK25SKM8=
=f2u6
-----END PGP SIGNATURE-----

2014-11-06 21:04:16

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 2/5] mmu_notifier: keep track of active invalidation ranges

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/03/2014 03:42 PM, [email protected] wrote:
> From: Jérôme Glisse <[email protected]>
>
> The mmu_notifier_invalidate_range_start() and
> mmu_notifier_invalidate_range_end() can be considered as forming
> an "atomic" section for the cpu page table update point of view.
> Between this two function the cpu page table content is unreliable
> for the address range being invalidated.
>
> Current user such as kvm need to know when they can trust the
> content of the cpu page table. This becomes even more important to
> new users of the mmu_notifier api (such as HMM or ODP).
>
> This patch use a structure define at all call site to
> invalidate_range_start() that is added to a list for the duration
> of the invalidation. It adds two new helpers to allow querying if
> a range is being invalidated or to wait for a range to become
> valid.
>
> For proper synchronization, user must block new range invalidation
> from inside there invalidate_range_start() callback, before
> calling the helper functions. Otherwise there is no garanty that a
> new range invalidation will not be added after the call to the
> helper function to query for existing range.
>
> Signed-off-by: Jérôme Glisse <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUW+IYAAoJEM553pKExN6DGQ0H/AsZn+UKNsKtys8kCnouMzvM
SiCZQE4xCTdYM/vvyhg6Iw1INz0aNescYRhI2k++S16vgaaleXEDXthJ2gKO8qB7
dgZ3eBDj9SzYVee6i779w77Eq9w1nhoPLyzTMpyYyB5PvfwKU8kq/j44rBNFVkdU
byKnQzWvzOkaAtifvsZYR/uTABB8D39O+++mARy39SqZRBDtb3aGL/4QidHI52qD
OEqtRFTftZ/yaeKvmrGw16e6NtAiE9IN/51pGuSH8vLjg9v884lnealMtfuLPbKR
e3LXaWZfl3cRJkBQjqe252wNHvqCX7T1dKBI0+V3rMqQuUHyuKLGg+Rq6NX1b7Q=
=sdxT
-----END PGP SIGNATURE-----

2014-11-06 22:32:57

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/03/2014 03:42 PM, [email protected] wrote:
> From: Jérôme Glisse <[email protected]>
>
> Page table is a common structure format most notably use by cpu
> mmu. The arch depend page table code has strong tie to the
> architecture which makes it unsuitable to be use by other non arch
> specific code.
>
> This patch implement a generic and arch independent page table. It
> is generic in the sense that entry size can be u64 or unsigned long
> (or u32 too on 32bits arch).
>
> It is lockless in the sense that at any point in time you can have
> concurrent thread updating the page table (removing or changing
> entry) and faulting in the page table (adding new entry). This is
> achieve by enforcing each updater and each faulter to take a range
> lock. There is no exclusion on range lock, ie several thread can
> fault or update the same range concurrently and it is the
> responsability of the user to synchronize update to the page table
> entry (pte), update to the page table directory (pdp) is under gpt
> responsability.
>
> API usage pattern is : gpt_init()
>
> gpt_lock_update(lock_range) // User can update pte for instance by
> using atomic bit operation // allowing complete lockless update.
> gpt_unlock_update(lock_range)
>
> gpt_lock_fault(lock_range) // User can fault in pte but he is
> responsible for avoiding thread // to concurrently fault the same
> pte and for properly accounting // the number of pte faulted in the
> pdp structure. gpt_unlock_fault(lock_range) // The new faulted pte
> will only be visible to others updaters only // once all concurrent
> faulter on the address unlock.
>
> Details on how the lockless concurrent updater and faulter works is
> provided in the header file.
>
> Changed since v1: - Switch to macro implementation instead of using
> arithmetic to accomodate the various size for table entry
> (uint64_t, unsigned long, ...). This is somewhat less flexbile but
> right now there is no use for the extra flexibility v1 was
> offering.
>
> Signed-off-by: Jérôme Glisse <[email protected]>

Never a fan of preprocessor magic, but I see why it's needed.

Acked-by: Rik van Riel <[email protected]>


- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUW/bgAAoJEM553pKExN6Dl6IH/i9rSRtvdO9+lf1cUe686XJb
GZ8KOp3Qa+ac0W63NqEaY5W+Fi7qyZJdoRFLCyOHBSP44qg9yoEJz8IbdPVNRjGG
lXyyfyOP0PY3wSakSP/IS3OIvapav6YPXiOIX7FlzPTReL+RWJPDYpmvi6S6nXgS
PuVTedVT5yaZwcqh0CyfDZ8pQqxEBSyvdVY/ntia7hxtUk9Or/sWVaRn8RE1u6EZ
xA5DtjqTB+UHmNtmTNe2B5i2TmvhIFYr+/ydCs76osR2e+UBcqQtnN3cdudZWyj3
Pk1c/7qtTqgS2pdiIkpjCKH5qXIszGM6vDSGCjM/4/7afX+vjk7UQDWeXGfzQFs=
=ndqX
-----END PGP SIGNATURE-----

2014-11-06 22:57:05

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/06/2014 05:40 PM, Jerome Glisse wrote:
> On Thu, Nov 06, 2014 at 05:32:00PM -0500, Rik van Riel wrote:

> Never a fan of preprocessor magic, but I see why it's needed.
>
> Acked-by: Rik van Riel <[email protected]>
>
>> v1 is not using preprocessor but has a bigger gpt struct
>> footprint and also more complex calculation for page table
>> walking due to the fact that i just rely more on runtime
>> computation than on compile time shift define through
>> preprocessor magic.
>
>> Given i am not a fan either of preprocessor magic if it makes you
>> feel any better i can resort to use v1, both have seen same kind
>> of testing and both are functionaly equivalent (API they expose
>> is obviously slightly different).
>
>> I am not convince that what the computation i save using
>> preprocessor will show up in anyway as being bottleneck for hot
>> path.

I have no strong preference either way. This code is perfectly readable.

Andrew?

- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUW/yGAAoJEM553pKExN6DqcwIAJAh8mUCOuzyhqJl21qMGWu9
FwL8qEUCxxXxLuX2MFv/wbkb07+OLI8nStI5rPxk6qUdC53YV4Bc7CWfvwF4slRB
hpPVGhmNKj4e5jwP+d8/MMSd6QfGA/jaiiRw9IxasOxzYKJxtKW4wAsme+qiDy6Y
i59sGQndVUstP6Zf5ZnaKN7BkG57daQqwypktPpMf7CQxv2uN5nnErDDFzhvm8Qz
tCcKtpsdZgek7l6RPaovvRHi0kT3L67gq5oIFuS9iiHGqhmohpj2sTENafLeWUb1
zGdjy8EcxBL5H0L1/wxs3PWjyKez1q/wEZJ390+wmRaMBWl1WqbGsAZ1uZ98bd0=
=ZAbm
-----END PGP SIGNATURE-----

2014-11-06 23:13:37

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

On Thu, Nov 06, 2014 at 05:32:00PM -0500, Rik van Riel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 11/03/2014 03:42 PM, [email protected] wrote:
> > From: J?r?me Glisse <[email protected]>
> >
> > Page table is a common structure format most notably use by cpu
> > mmu. The arch depend page table code has strong tie to the
> > architecture which makes it unsuitable to be use by other non arch
> > specific code.
> >
> > This patch implement a generic and arch independent page table. It
> > is generic in the sense that entry size can be u64 or unsigned long
> > (or u32 too on 32bits arch).
> >
> > It is lockless in the sense that at any point in time you can have
> > concurrent thread updating the page table (removing or changing
> > entry) and faulting in the page table (adding new entry). This is
> > achieve by enforcing each updater and each faulter to take a range
> > lock. There is no exclusion on range lock, ie several thread can
> > fault or update the same range concurrently and it is the
> > responsability of the user to synchronize update to the page table
> > entry (pte), update to the page table directory (pdp) is under gpt
> > responsability.
> >
> > API usage pattern is : gpt_init()
> >
> > gpt_lock_update(lock_range) // User can update pte for instance by
> > using atomic bit operation // allowing complete lockless update.
> > gpt_unlock_update(lock_range)
> >
> > gpt_lock_fault(lock_range) // User can fault in pte but he is
> > responsible for avoiding thread // to concurrently fault the same
> > pte and for properly accounting // the number of pte faulted in the
> > pdp structure. gpt_unlock_fault(lock_range) // The new faulted pte
> > will only be visible to others updaters only // once all concurrent
> > faulter on the address unlock.
> >
> > Details on how the lockless concurrent updater and faulter works is
> > provided in the header file.
> >
> > Changed since v1: - Switch to macro implementation instead of using
> > arithmetic to accomodate the various size for table entry
> > (uint64_t, unsigned long, ...). This is somewhat less flexbile but
> > right now there is no use for the extra flexibility v1 was
> > offering.
> >
> > Signed-off-by: J?r?me Glisse <[email protected]>
>
> Never a fan of preprocessor magic, but I see why it's needed.
>
> Acked-by: Rik van Riel <[email protected]>

v1 is not using preprocessor but has a bigger gpt struct footprint and also
more complex calculation for page table walking due to the fact that i just
rely more on runtime computation than on compile time shift define through
preprocessor magic.

Given i am not a fan either of preprocessor magic if it makes you feel any
better i can resort to use v1, both have seen same kind of testing and both
are functionaly equivalent (API they expose is obviously slightly different).

I am not convince that what the computation i save using preprocessor will
show up in anyway as being bottleneck for hot path.

Cheers,
J?r?me

>
>
> - --
> All rights reversed
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
>
> iQEcBAEBAgAGBQJUW/bgAAoJEM553pKExN6Dl6IH/i9rSRtvdO9+lf1cUe686XJb
> GZ8KOp3Qa+ac0W63NqEaY5W+Fi7qyZJdoRFLCyOHBSP44qg9yoEJz8IbdPVNRjGG
> lXyyfyOP0PY3wSakSP/IS3OIvapav6YPXiOIX7FlzPTReL+RWJPDYpmvi6S6nXgS
> PuVTedVT5yaZwcqh0CyfDZ8pQqxEBSyvdVY/ntia7hxtUk9Or/sWVaRn8RE1u6EZ
> xA5DtjqTB+UHmNtmTNe2B5i2TmvhIFYr+/ydCs76osR2e+UBcqQtnN3cdudZWyj3
> Pk1c/7qtTqgS2pdiIkpjCKH5qXIszGM6vDSGCjM/4/7afX+vjk7UQDWeXGfzQFs=
> =ndqX
> -----END PGP SIGNATURE-----

2014-11-07 21:36:53

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 4/5] hmm: heterogeneous memory management v6

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/03/2014 03:42 PM, [email protected] wrote:
> From: Jérôme Glisse <[email protected]>
>
> Motivation:
>
> Heterogeneous memory management is intended to allow a device to
> transparently access a process address space without having to lock
> pages of the process or take references on them. In other word
> mirroring a process address space while allowing the regular memory
> management event such as page reclamation or page migration, to
> happen seamlessly.
>
> Recent years have seen a surge into the number of specialized
> devices that are part of a computer platform (from desktop to
> phone). So far each of those devices have operated on there own
> private address space that is not link or expose to the process
> address space that is using them. This separation often leads to
> multiple memory copy happening between the device owned memory and
> the process memory. This of course is both a waste of cpu cycle and
> memory.
>
> Over the last few years most of those devices have gained a full
> mmu allowing them to support multiple page table, page fault and
> other features that are found inside cpu mmu. There is now a strong
> incentive to start leveraging capabilities of such devices and to
> start sharing process address to avoid any unnecessary memory copy
> as well as simplifying the programming model of those devices by
> sharing an unique and common address space with the process that
> use them.
>
> The aim of the heterogeneous memory management is to provide a
> common API that can be use by any such devices in order to mirror
> process address. The hmm code provide an unique entry point and
> interface itself with the core mm code of the linux kernel avoiding
> duplicate implementation and shielding device driver code from core
> mm code.

Acked-by: Rik van Riel <[email protected]>

- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUXTs9AAoJEM553pKExN6DhSYIAI41vr6c/vVdIg2m6Wq3DiSS
KtBTUX5/cFmvh9Zd3S422ZwzJQ6ZZLGsNuh2LajLqR0dhDKkwxS7FWFSdifcAfq2
B/Xq8JyeW98Fa0OP0V4uqMuo1FMvlXFZsDijFefxo5F2T/H6XyRI2M+f4w5w9iZa
3EvUaFHoG+mCjoR+ANuxwR9J048wWF626R6CHPOvvIKDNRVr+LADvLMBXmbnrYJs
643mmjhNT+EdPQbxBVszsUbBo/mGicRBuW+t3XkWy1g+hsa4AewhHnOuSHDr13zM
YBFjeGP1TbOQxtkiJetsAE4pKxSlJDoscp7vbJjYzLz3Kk2Fag3r1kpSU8S8stI=
=ucI+
-----END PGP SIGNATURE-----

2014-11-07 21:38:29

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 5/5] hmm/dummy: dummy driver to showcase the hmm api v3

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/03/2014 03:42 PM, [email protected] wrote:
> From: Jérôme Glisse <[email protected]>
>
> This is a dummy driver which full fill two purposes : - showcase
> the hmm api and gives references on how to use it. - provide an
> extensive user space api to stress test hmm.
>
> This is a particularly dangerous module as it allow to access a
> mirror of a process address space through its device file. Hence it
> should not be enabled by default and only people actively
> developing for hmm should use it.
>
> Changed since v1: - Fixed all checkpatch.pl issue (ignoreing some
> over 80 characters).
>
> Changed since v2: - Rebase and adapted to lastest change.
>
> Signed-off-by: Jérôme Glisse <[email protected]>

Acked-by: Rik van Riel <[email protected]>

- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUXTuhAAoJEM553pKExN6DF4EH/2sC7XPaKG5utzuuP1Jo5R1F
i9PuWpU3gMrgzeRR5/31MlmP+9uz6FDDtHb8fP4evp+rhyMvyFidFHAvtijmAhuj
T3tR+jPzMbJHk/JJX6JRHjRaErvTdIcFvSyWnLE8caaMWiQs7CqOj3jDIreCJW2x
89irX3HGLGsga9Uu9xwuF8UiGmrbLaPnICJ6Qqy94yYdxI9JlohYlqlDv+ouq9wp
Kv3tk0UwY83JtIqyDrCw70twY1hw8ApQWPKW6DdXYGSplY/na2JJE8qRte1BtAZ7
AUCE05v62r8YcSWgljN2txZETXyCmXELIgMchRXQGdXvICMZNMYiSM1zbW4Fjnk=
=dYPc
-----END PGP SIGNATURE-----