2014-11-10 18:29:10

by Jerome Glisse

[permalink] [raw]
Subject: HMM (heterogeneous memory management) v6

Andrew so resending with review and ack from Riek and couple minor fixes
along the way. Is there anything blocking this from getting in next kernel ?
Again hardware is coming and there is still a long list of features waiting
on this core set of patches getting in. I reinclude part of my previous
email below.


What it is ?

In a nutshell HMM is a subsystem that provide an easy to use api to mirror a
process address on a device with minimal hardware requirement (mainly device
page fault and read only page mapping). This does not rely on ATS and PASID
PCIE extensions. It intends to supersede those extensions by allowing to move
system memory to device memory in a transparent fashion for core kernel mm
code (ie cpu page fault on page residing in device memory will trigger
migration back to system memory).


Why doing this ?

We want to be able to mirror a process address space so that compute api such
as OpenCL or other similar api can start using the exact same address space on
the GPU as on the CPU. This will greatly simplify usages of those api. Moreover
we believe that we will see more and more specialize unit functions that will
want to mirror process address using their own mmu.

The migration side is simply because GPU memory bandwidth is far beyond than
system memory bandwith and there is no sign that this gap is closing (quite the
opposite).


Current status and future features :

None of this core code change in any major way core kernel mm code. This
is simple ground work with no impact on existing code path. Features that
will be implemented on top of this are :
1 - Tansparently handle page mapping on behalf of device driver (DMA).
2 - Improve DMA api to better match new usage pattern of HMM.
3 - Migration of anonymous memory to device memory.
4 - Locking memory to remote memory (CPU access triger SIGBUS).
5 - Access exclusion btw CPU and device for atomic operations.
6 - Migration of file backed memory to device memory.


How future features will be implemented :
1 - Simply use existing DMA api to map page on behalf of a device.
2 - Introduce new DMA api to match new semantic of HMM. It is no longer page
we map but address range and managing which page is effectively backing
an address should be easy to update. I gave a presentation about that
during this LPC.
3 - Requires change to cpu page fault code path to handle migration back to
system memory on cpu access. An implementation of this was already sent
as part of v1. This will be low impact and only add a new special swap
type handling to existing fault code.
4 - Require a new syscall as i can not see which current syscall would be
appropriate for this. My first feeling was to use mbind as it has the
right semantic (binding a range of address to a device) but mbind is
too numa centric.

Second one was madvise, but semantic does not match, madvise does allow
kernel to ignore them while we do want to block cpu access for as long
as the range is bind to a device.

So i do not think any of existing syscall can be extended with new flags
but maybe i am wrong.
5 - Allowing to map a page as read only on the CPU while a device perform
some atomic operation on it (this is mainly to work around system bus
that do not support atomic memory access and sadly there is a large
base of hardware without that feature).

Easiest implementation would be using some page flags but there is none
left. So it must be some flags in vma to know if there is a need to query
HMM for write protection.

6 - This is the trickiest one to implement and while i showed a proof of
concept with v1, i am still have a lot of conflictual feeling about how
to achieve this.


As usual comments are more then welcome. Thanks in advance to anyone that
take a look at this code.

Previous patchset posting :
v1 http://lwn.net/Articles/597289/
v2 https://lkml.org/lkml/2014/6/12/559 (cover letter did not make it to ml)
v3 https://lkml.org/lkml/2014/6/13/633
v4 https://lkml.org/lkml/2014/8/29/423
v5 https://lkml.org/lkml/2014/11/3/759

Cheers,
Jérôme

To: "Andrew Morton" <[email protected]>,
Cc: <[email protected]>,
Cc: linux-mm <[email protected]>,
Cc: <[email protected]>,
Cc: "Linus Torvalds" <[email protected]>,
Cc: "Mel Gorman" <[email protected]>,
Cc: "H. Peter Anvin" <[email protected]>,
Cc: "Peter Zijlstra" <[email protected]>,
Cc: "Linda Wang" <[email protected]>,
Cc: "Kevin E Martin" <[email protected]>,
Cc: "Jerome Glisse" <[email protected]>,
Cc: "Andrea Arcangeli" <[email protected]>,
Cc: "Johannes Weiner" <[email protected]>,
Cc: "Larry Woodman" <[email protected]>,
Cc: "Rik van Riel" <[email protected]>,
Cc: "Dave Airlie" <[email protected]>,
Cc: "Jeff Law" <[email protected]>,
Cc: "Brendan Conoboy" <[email protected]>,
Cc: "Joe Donohue" <[email protected]>,
Cc: "Duncan Poole" <[email protected]>,
Cc: "Sherry Cheung" <[email protected]>,
Cc: "Subhash Gutti" <[email protected]>,
Cc: "John Hubbard" <[email protected]>,
Cc: "Mark Hairgrove" <[email protected]>,
Cc: "Lucien Dunning" <[email protected]>,
Cc: "Cameron Buschardt" <[email protected]>,
Cc: "Arvind Gopalakrishnan" <[email protected]>,
Cc: "Haggai Eran" <[email protected]>,
Cc: "Or Gerlitz" <[email protected]>,
Cc: "Sagi Grimberg" <[email protected]>
Cc: "Shachar Raindel" <[email protected]>,
Cc: "Liran Liss" <[email protected]>,
Cc: "Roland Dreier" <[email protected]>,
Cc: "Sander, Ben" <[email protected]>,
Cc: "Stoner, Greg" <[email protected]>,
Cc: "Bridgman, John" <[email protected]>,
Cc: "Mantor, Michael" <[email protected]>,
Cc: "Blinzer, Paul" <[email protected]>,
Cc: "Morichetti, Laurent" <[email protected]>,
Cc: "Deucher, Alexander" <[email protected]>,
Cc: "Gabbay, Oded" <[email protected]>,


2014-11-10 18:29:21

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 2/5] mmu_notifier: keep track of active invalidation ranges v2

From: Jérôme Glisse <[email protected]>

The mmu_notifier_invalidate_range_start() and mmu_notifier_invalidate_range_end()
can be considered as forming an "atomic" section for the cpu page table update
point of view. Between this two function the cpu page table content is unreliable
for the address range being invalidated.

Current user such as kvm need to know when they can trust the content of the cpu
page table. This becomes even more important to new users of the mmu_notifier
api (such as HMM or ODP).

This patch use a structure define at all call site to invalidate_range_start()
that is added to a list for the duration of the invalidation. It adds two new
helpers to allow querying if a range is being invalidated or to wait for a range
to become valid.

For proper synchronization, user must block new range invalidation from inside
there invalidate_range_start() callback, before calling the helper functions.
Otherwise there is no garanty that a new range invalidation will not be added
after the call to the helper function to query for existing range.

Changed since v1:
- Fix a possible deadlock in mmu_notifier_range_wait_valid()

Signed-off-by: Jérôme Glisse <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
drivers/gpu/drm/i915/i915_gem_userptr.c | 13 +++--
drivers/iommu/amd_iommu_v2.c | 8 +--
drivers/misc/sgi-gru/grutlbpurge.c | 15 +++---
drivers/xen/gntdev.c | 15 +++---
fs/proc/task_mmu.c | 12 +++--
include/linux/mmu_notifier.h | 60 ++++++++++++----------
kernel/events/uprobes.c | 13 +++--
mm/huge_memory.c | 78 +++++++++++++----------------
mm/hugetlb.c | 55 +++++++++++----------
mm/ksm.c | 28 +++++------
mm/memory.c | 78 ++++++++++++++++-------------
mm/migrate.c | 36 +++++++-------
mm/mmu_notifier.c | 88 ++++++++++++++++++++++++++++-----
mm/mprotect.c | 17 ++++---
mm/mremap.c | 14 +++---
mm/rmap.c | 15 +++---
virt/kvm/kvm_main.c | 10 ++--
17 files changed, 310 insertions(+), 245 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 20dbd26..10b0044 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -128,26 +128,25 @@ restart:

static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ const struct mmu_notifier_range *range)
{
struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
struct interval_tree_node *it = NULL;
- unsigned long next = start;
+ unsigned long next = range->start;
unsigned long serial = 0;
+ /* interval ranges are inclusive, but invalidate range is exclusive */
+ unsigned long end = range->end - 1;

- end--; /* interval ranges are inclusive, but invalidate range is exclusive */
while (next < end) {
struct drm_i915_gem_object *obj = NULL;

spin_lock(&mn->lock);
if (mn->has_linear)
- it = invalidate_range__linear(mn, mm, start, end);
+ it = invalidate_range__linear(mn, mm, range->start, end);
else if (serial == mn->serial)
it = interval_tree_iter_next(it, next, end);
else
- it = interval_tree_iter_first(&mn->objects, start, end);
+ it = interval_tree_iter_first(&mn->objects, range->start, end);
if (it != NULL) {
obj = container_of(it, struct i915_mmu_object, it)->obj;
drm_gem_object_reference(&obj->base);
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 57d2acf..9b7f32d 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -421,9 +421,7 @@ static void mn_invalidate_page(struct mmu_notifier *mn,

static void mn_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ const struct mmu_notifier_range *range)
{
struct pasid_state *pasid_state;
struct device_state *dev_state;
@@ -444,9 +442,7 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,

static void mn_invalidate_range_end(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ const struct mmu_notifier_range *range)
{
struct pasid_state *pasid_state;
struct device_state *dev_state;
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index e67fed1..44b41b7 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,8 +221,7 @@ void gru_flush_all_tlb(struct gru_state *gru)
*/
static void gru_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end,
- enum mmu_event event)
+ const struct mmu_notifier_range *range)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
ms_notifier);
@@ -230,14 +229,13 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
STAT(mmu_invalidate_range);
atomic_inc(&gms->ms_range_active);
gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
- start, end, atomic_read(&gms->ms_range_active));
- gru_flush_tlb_range(gms, start, end - start);
+ range->start, range->end, atomic_read(&gms->ms_range_active));
+ gru_flush_tlb_range(gms, range->start, range->end - range->start);
}

static void gru_invalidate_range_end(struct mmu_notifier *mn,
- struct mm_struct *mm, unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mm_struct *mm,
+ const struct mmu_notifier_range *range)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
ms_notifier);
@@ -246,7 +244,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
(void)atomic_dec_and_test(&gms->ms_range_active);

wake_up_all(&gms->ms_wait_queue);
- gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms, start, end);
+ gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms,
+ range->start, range->end);
}

static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index fe9da94..db5c2cad 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,19 +428,17 @@ static void unmap_if_in_range(struct grant_map *map,

static void mn_invl_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ const struct mmu_notifier_range *range)
{
struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
struct grant_map *map;

spin_lock(&priv->lock);
list_for_each_entry(map, &priv->maps, next) {
- unmap_if_in_range(map, start, end);
+ unmap_if_in_range(map, range->start, range->end);
}
list_for_each_entry(map, &priv->freeable_maps, next) {
- unmap_if_in_range(map, start, end);
+ unmap_if_in_range(map, range->start, range->end);
}
spin_unlock(&priv->lock);
}
@@ -450,7 +448,12 @@ static void mn_invl_page(struct mmu_notifier *mn,
unsigned long address,
enum mmu_event event)
{
- mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
+ struct mmu_notifier_range range;
+
+ range.start = address;
+ range.end = address + PAGE_SIZE;
+ range.event = event;
+ mn_invl_range_start(mn, mm, &range);
}

static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e75a848..ce57739 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -855,6 +855,12 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
.mm = mm,
.private = &cp,
};
+ struct mmu_notifier_range range = {
+ .start = 0,
+ .end = -1UL,
+ .event = MMU_ISDIRTY,
+ };
+
down_read(&mm->mmap_sem);
if (type == CLEAR_REFS_SOFT_DIRTY) {
for (vma = mm->mmap; vma; vma = vma->vm_next) {
@@ -869,8 +875,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
downgrade_write(&mm->mmap_sem);
break;
}
- mmu_notifier_invalidate_range_start(mm, 0,
- -1, MMU_ISDIRTY);
+ mmu_notifier_invalidate_range_start(mm, &range);
}
for (vma = mm->mmap; vma; vma = vma->vm_next) {
cp.vma = vma;
@@ -895,8 +900,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
&clear_refs_walk);
}
if (type == CLEAR_REFS_SOFT_DIRTY)
- mmu_notifier_invalidate_range_end(mm, 0,
- -1, MMU_ISDIRTY);
+ mmu_notifier_invalidate_range_end(mm, &range);
flush_tlb_mm(mm);
up_read(&mm->mmap_sem);
mmput(mm);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index ac2a121..d20eeb1 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -69,6 +69,13 @@ enum mmu_event {
MMU_WRITE_PROTECT,
};

+struct mmu_notifier_range {
+ struct list_head list;
+ unsigned long start;
+ unsigned long end;
+ enum mmu_event event;
+};
+
#ifdef CONFIG_MMU_NOTIFIER

/*
@@ -82,6 +89,12 @@ struct mmu_notifier_mm {
struct hlist_head list;
/* to serialize the list modifications and hlist_unhashed */
spinlock_t lock;
+ /* List of all active range invalidations. */
+ struct list_head ranges;
+ /* Number of active range invalidations. */
+ int nranges;
+ /* For threads waiting on range invalidations. */
+ wait_queue_head_t wait_queue;
};

struct mmu_notifier_ops {
@@ -202,14 +215,10 @@ struct mmu_notifier_ops {
*/
void (*invalidate_range_start)(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event);
+ const struct mmu_notifier_range *range);
void (*invalidate_range_end)(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event);
+ const struct mmu_notifier_range *range);

/*
* invalidate_range() is either called between
@@ -279,15 +288,17 @@ extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
unsigned long address,
enum mmu_event event);
extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event);
+ struct mmu_notifier_range *range);
extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event);
+ struct mmu_notifier_range *range);
extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
unsigned long start, unsigned long end);
+extern bool mmu_notifier_range_is_valid(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end);
+extern void mmu_notifier_range_wait_valid(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end);

static inline void mmu_notifier_release(struct mm_struct *mm)
{
@@ -330,21 +341,22 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
}

static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mmu_notifier_range *range)
{
+ /*
+ * Initialize list no matter what in case a mmu_notifier register after
+ * a range_start but before matching range_end.
+ */
+ INIT_LIST_HEAD(&range->list);
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_range_start(mm, start, end, event);
+ __mmu_notifier_invalidate_range_start(mm, range);
}

static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mmu_notifier_range *range)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_range_end(mm, start, end, event);
+ __mmu_notifier_invalidate_range_end(mm, range);
}

static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -486,16 +498,12 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
}

static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mmu_notifier_range *range)
{
}

static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mmu_notifier_range *range)
{
}

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index f7d79d9..3cfe7ae 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -164,9 +164,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
spinlock_t *ptl;
pte_t *ptep;
int err;
- /* For mmu_notifiers */
- const unsigned long mmun_start = addr;
- const unsigned long mmun_end = addr + PAGE_SIZE;
+ struct mmu_notifier_range range;
struct mem_cgroup *memcg;

err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg);
@@ -176,8 +174,10 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
/* For try_to_free_swap() and munlock_vma_page() below */
lock_page(page);

- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ range.start = addr;
+ range.end = addr + PAGE_SIZE;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);
err = -EAGAIN;
ptep = page_check_address(page, mm, addr, &ptl, 0);
if (!ptep)
@@ -211,8 +211,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
err = 0;
unlock:
mem_cgroup_cancel_charge(kpage, memcg);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
unlock_page(page);
return err;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 49d0cec..7f56188 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -987,8 +987,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
pmd_t _pmd;
int ret = 0, i;
struct page **pages;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;

pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
GFP_KERNEL);
@@ -1026,10 +1025,10 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
cond_resched();
}

- mmun_start = haddr;
- mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ range.start = haddr;
+ range.end = haddr + HPAGE_PMD_SIZE;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);

ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1063,8 +1062,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
page_remove_rmap(page);
spin_unlock(ptl);

- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);

ret |= VM_FAULT_WRITE;
put_page(page);
@@ -1074,8 +1072,7 @@ out:

out_free_pages:
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
for (i = 0; i < HPAGE_PMD_NR; i++) {
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
@@ -1094,8 +1091,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page = NULL, *new_page;
struct mem_cgroup *memcg;
unsigned long haddr;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;

ptl = pmd_lockptr(mm, pmd);
VM_BUG_ON_VMA(!vma->anon_vma, vma);
@@ -1165,10 +1161,10 @@ alloc:
copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
__SetPageUptodate(new_page);

- mmun_start = haddr;
- mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ range.start = haddr;
+ range.end = haddr + HPAGE_PMD_SIZE;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);

spin_lock(ptl);
if (page)
@@ -1200,8 +1196,7 @@ alloc:
}
spin_unlock(ptl);
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
out:
return ret;
out_unlock:
@@ -1633,12 +1628,12 @@ static int __split_huge_page_splitting(struct page *page,
spinlock_t *ptl;
pmd_t *pmd;
int ret = 0;
- /* For mmu_notifiers */
- const unsigned long mmun_start = address;
- const unsigned long mmun_end = address + HPAGE_PMD_SIZE;
+ struct mmu_notifier_range range;

- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_HSPLIT);
+ range.start = address;
+ range.end = address + HPAGE_PMD_SIZE;
+ range.event = MMU_HSPLIT;
+ mmu_notifier_invalidate_range_start(mm, &range);
pmd = page_check_address_pmd(page, mm, address,
PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
if (pmd) {
@@ -1654,8 +1649,7 @@ static int __split_huge_page_splitting(struct page *page,
ret = 1;
spin_unlock(ptl);
}
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_HSPLIT);
+ mmu_notifier_invalidate_range_end(mm, &range);

return ret;
}
@@ -2433,8 +2427,7 @@ static void collapse_huge_page(struct mm_struct *mm,
int isolated;
unsigned long hstart, hend;
struct mem_cgroup *memcg;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;

VM_BUG_ON(address & ~HPAGE_PMD_MASK);

@@ -2474,10 +2467,10 @@ static void collapse_huge_page(struct mm_struct *mm,
pte = pte_offset_map(pmd, address);
pte_ptl = pte_lockptr(mm, pmd);

- mmun_start = address;
- mmun_end = address + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ range.start = address;
+ range.end = address + HPAGE_PMD_SIZE;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);
pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
/*
* After this gup_fast can't run anymore. This also removes
@@ -2487,8 +2480,7 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
_pmd = pmdp_clear_flush(vma, address, pmd);
spin_unlock(pmd_ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);

spin_lock(pte_ptl);
isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2871,36 +2863,32 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
struct page *page;
struct mm_struct *mm = vma->vm_mm;
unsigned long haddr = address & HPAGE_PMD_MASK;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;

BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);

- mmun_start = haddr;
- mmun_end = haddr + HPAGE_PMD_SIZE;
+ range.start = haddr;
+ range.end = haddr + HPAGE_PMD_SIZE;
+ range.event = MMU_MIGRATE;
again:
- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_start(mm, &range);
ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_trans_huge(*pmd))) {
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
return;
}
if (is_huge_zero_pmd(*pmd)) {
__split_huge_zero_page_pmd(vma, haddr, pmd);
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
return;
}
page = pmd_page(*pmd);
VM_BUG_ON_PAGE(!page_count(page), page);
get_page(page);
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);

split_huge_page(page);

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 652feac..3486d84 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2551,17 +2551,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
int cow;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;
int ret = 0;

cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;

- mmun_start = vma->vm_start;
- mmun_end = vma->vm_end;
+ range.start = vma->vm_start;
+ range.end = vma->vm_end;
+ range.event = MMU_MIGRATE;
if (cow)
- mmu_notifier_invalidate_range_start(src, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_start(src, &range);

for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
@@ -2601,8 +2600,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
} else {
if (cow) {
huge_ptep_set_wrprotect(src, addr, src_pte);
- mmu_notifier_invalidate_range(src, mmun_start,
- mmun_end);
+ mmu_notifier_invalidate_range(src, range.start,
+ range.end);
}
entry = huge_ptep_get(src_pte);
ptepage = pte_page(entry);
@@ -2615,8 +2614,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
}

if (cow)
- mmu_notifier_invalidate_range_end(src, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(src, &range);

return ret;
}
@@ -2634,16 +2632,17 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
struct page *page;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
- const unsigned long mmun_start = start; /* For mmu_notifiers */
- const unsigned long mmun_end = end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;

WARN_ON(!is_vm_hugetlb_page(vma));
BUG_ON(start & ~huge_page_mask(h));
BUG_ON(end & ~huge_page_mask(h));

+ range.start = start;
+ range.end = end;
+ range.event = MMU_MIGRATE;
tlb_start_vma(tlb, vma);
- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_start(mm, &range);
address = start;
again:
for (; address < end; address += sz) {
@@ -2716,8 +2715,7 @@ unlock:
if (address < end && !ref_page)
goto again;
}
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
tlb_end_vma(tlb, vma);
}

@@ -2814,8 +2812,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
struct hstate *h = hstate_vma(vma);
struct page *old_page, *new_page;
int ret = 0, outside_reserve = 0;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;

old_page = pte_page(pte);

@@ -2893,10 +2890,11 @@ retry_avoidcopy:
pages_per_huge_page(h));
__SetPageUptodate(new_page);

- mmun_start = address & huge_page_mask(h);
- mmun_end = mmun_start + huge_page_size(h);
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ range.start = address & huge_page_mask(h);
+ range.end = range.start + huge_page_size(h);
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);
+
/*
* Retake the page table lock to check for racing updates
* before the page tables are altered
@@ -2908,7 +2906,7 @@ retry_avoidcopy:

/* Break COW */
huge_ptep_clear_flush(vma, address, ptep);
- mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range(mm, range.start, range.end);
set_huge_pte_at(mm, address, ptep,
make_huge_pte(vma, new_page, 1));
page_remove_rmap(old_page);
@@ -2917,8 +2915,7 @@ retry_avoidcopy:
new_page = old_page;
}
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
out_release_all:
page_cache_release(new_page);
out_release_old:
@@ -3352,11 +3349,15 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
pte_t pte;
struct hstate *h = hstate_vma(vma);
unsigned long pages = 0;
+ struct mmu_notifier_range range;

BUG_ON(address >= end);
flush_cache_range(vma, address, end);

- mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
+ range.start = start;
+ range.end = end;
+ range.event = MMU_MPROT;
+ mmu_notifier_invalidate_range_start(mm, &range);
i_mmap_lock_write(vma->vm_file->f_mapping);
for (; address < end; address += huge_page_size(h)) {
spinlock_t *ptl;
@@ -3387,7 +3388,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
flush_tlb_range(vma, start, end);
mmu_notifier_invalidate_range(mm, start, end);
i_mmap_unlock_write(vma->vm_file->f_mapping);
- mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);
+ mmu_notifier_invalidate_range_end(mm, &range);

return pages << h->order;
}
diff --git a/mm/ksm.c b/mm/ksm.c
index 8c3a892..3667d98 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -855,14 +855,13 @@ static inline int pages_identical(struct page *page1, struct page *page2)
static int write_protect_page(struct vm_area_struct *vma, struct page *page,
pte_t *orig_pte)
{
+ struct mmu_notifier_range range;
struct mm_struct *mm = vma->vm_mm;
unsigned long addr;
pte_t *ptep;
spinlock_t *ptl;
int swapped;
int err = -EFAULT;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */

addr = page_address_in_vma(page, vma);
if (addr == -EFAULT)
@@ -870,10 +869,10 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,

BUG_ON(PageTransCompound(page));

- mmun_start = addr;
- mmun_end = addr + PAGE_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
- MMU_WRITE_PROTECT);
+ range.start = addr;
+ range.end = addr + PAGE_SIZE;
+ range.event = MMU_WRITE_PROTECT;
+ mmu_notifier_invalidate_range_start(mm, &range);

ptep = page_check_address(page, mm, addr, &ptl, 0);
if (!ptep)
@@ -913,8 +912,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
out_unlock:
pte_unmap_unlock(ptep, ptl);
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
- MMU_WRITE_PROTECT);
+ mmu_notifier_invalidate_range_end(mm, &range);
out:
return err;
}
@@ -937,8 +935,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
spinlock_t *ptl;
unsigned long addr;
int err = -EFAULT;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;

addr = page_address_in_vma(page, vma);
if (addr == -EFAULT)
@@ -948,10 +945,10 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
if (!pmd)
goto out;

- mmun_start = addr;
- mmun_end = addr + PAGE_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ range.start = addr;
+ range.end = addr + PAGE_SIZE;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);

ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
if (!pte_same(*ptep, orig_pte)) {
@@ -976,8 +973,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
pte_unmap_unlock(ptep, ptl);
err = 0;
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);
out:
return err;
}
diff --git a/mm/memory.c b/mm/memory.c
index 187f844..6c44dd7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1011,8 +1011,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
unsigned long next;
unsigned long addr = vma->vm_start;
unsigned long end = vma->vm_end;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;
bool is_cow;
int ret;

@@ -1048,11 +1047,11 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
* is_cow_mapping() returns true.
*/
is_cow = is_cow_mapping(vma->vm_flags);
- mmun_start = addr;
- mmun_end = end;
+ range.start = addr;
+ range.end = end;
+ range.event = MMU_MIGRATE;
if (is_cow)
- mmu_notifier_invalidate_range_start(src_mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_start(src_mm, &range);

ret = 0;
dst_pgd = pgd_offset(dst_mm, addr);
@@ -1069,8 +1068,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
} while (dst_pgd++, src_pgd++, addr = next, addr != end);

if (is_cow)
- mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
- MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(src_mm, &range);
return ret;
}

@@ -1374,13 +1372,16 @@ void unmap_vmas(struct mmu_gather *tlb,
unsigned long end_addr)
{
struct mm_struct *mm = vma->vm_mm;
+ struct mmu_notifier_range range = {
+ .start = start_addr,
+ .end = end_addr,
+ .event = MMU_MUNMAP,
+ };

- mmu_notifier_invalidate_range_start(mm, start_addr,
- end_addr, MMU_MUNMAP);
+ mmu_notifier_invalidate_range_start(mm, &range);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
- mmu_notifier_invalidate_range_end(mm, start_addr,
- end_addr, MMU_MUNMAP);
+ mmu_notifier_invalidate_range_end(mm, &range);
}

/**
@@ -1397,16 +1398,20 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
{
struct mm_struct *mm = vma->vm_mm;
struct mmu_gather tlb;
- unsigned long end = start + size;
+ struct mmu_notifier_range range = {
+ .start = start,
+ .end = start + size,
+ .event = MMU_MIGRATE,
+ };

lru_add_drain();
- tlb_gather_mmu(&tlb, mm, start, end);
+ tlb_gather_mmu(&tlb, mm, start, range.end);
update_hiwater_rss(mm);
- mmu_notifier_invalidate_range_start(mm, start, end, MMU_MIGRATE);
- for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
- unmap_single_vma(&tlb, vma, start, end, details);
- mmu_notifier_invalidate_range_end(mm, start, end, MMU_MIGRATE);
- tlb_finish_mmu(&tlb, start, end);
+ mmu_notifier_invalidate_range_start(mm, &range);
+ for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
+ unmap_single_vma(&tlb, vma, start, range.end, details);
+ mmu_notifier_invalidate_range_end(mm, &range);
+ tlb_finish_mmu(&tlb, start, range.end);
}

/**
@@ -1423,15 +1428,19 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
{
struct mm_struct *mm = vma->vm_mm;
struct mmu_gather tlb;
- unsigned long end = address + size;
+ struct mmu_notifier_range range = {
+ .start = address,
+ .end = address + size,
+ .event = MMU_MUNMAP,
+ };

lru_add_drain();
- tlb_gather_mmu(&tlb, mm, address, end);
+ tlb_gather_mmu(&tlb, mm, address, range.end);
update_hiwater_rss(mm);
- mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
- unmap_single_vma(&tlb, vma, address, end, details);
- mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
- tlb_finish_mmu(&tlb, address, end);
+ mmu_notifier_invalidate_range_start(mm, &range);
+ unmap_single_vma(&tlb, vma, address, range.end, details);
+ mmu_notifier_invalidate_range_end(mm, &range);
+ tlb_finish_mmu(&tlb, address, range.end);
}

/**
@@ -2051,10 +2060,12 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
int ret = 0;
int page_mkwrite = 0;
struct page *dirty_page = NULL;
- unsigned long mmun_start = 0; /* For mmu_notifiers */
- unsigned long mmun_end = 0; /* For mmu_notifiers */
+ struct mmu_notifier_range range;
struct mem_cgroup *memcg;

+ range.start = 0;
+ range.end = 0;
+
old_page = vm_normal_page(vma, address, orig_pte);
if (!old_page) {
/*
@@ -2213,10 +2224,10 @@ gotten:
if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
goto oom_free_new;

- mmun_start = address & PAGE_MASK;
- mmun_end = mmun_start + PAGE_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ range.start = address & PAGE_MASK;
+ range.end = range.start + PAGE_SIZE;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(mm, &range);

/*
* Re-check the pte - we dropped the lock
@@ -2286,9 +2297,8 @@ gotten:
page_cache_release(new_page);
unlock:
pte_unmap_unlock(page_table, ptl);
- if (mmun_end > mmun_start)
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ if (range.end > range.start)
+ mmu_notifier_invalidate_range_end(mm, &range);
if (old_page) {
/*
* Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index b5279b8..1b5b9ab 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1776,10 +1776,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
int isolated = 0;
struct page *new_page = NULL;
int page_lru = page_is_file_cache(page);
- unsigned long mmun_start = address & HPAGE_PMD_MASK;
- unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+ struct mmu_notifier_range range;
pmd_t orig_entry;

+ range.start = address & HPAGE_PMD_MASK;
+ range.end = range.start + HPAGE_PMD_SIZE;
+ range.event = MMU_MIGRATE;
+
/*
* Rate-limit the amount of data that is being migrated to a node.
* Optimal placement is no good if the memory bus is saturated and
@@ -1801,7 +1804,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
}

if (mm_tlb_flush_pending(mm))
- flush_tlb_range(vma, mmun_start, mmun_end);
+ flush_tlb_range(vma, range.start, range.end);

/* Prepare a page as a migration target */
__set_page_locked(new_page);
@@ -1814,14 +1817,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
WARN_ON(PageLRU(new_page));

/* Recheck the target PMD */
- mmu_notifier_invalidate_range_start(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_start(mm, &range);
ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
fail_putback:
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);

/* Reverse changes made by migrate_page_copy() */
if (TestClearPageActive(new_page))
@@ -1854,17 +1855,17 @@ fail_putback:
* The SetPageUptodate on the new page and page_add_new_anon_rmap
* guarantee the copy is visible before the pagetable update.
*/
- flush_cache_range(vma, mmun_start, mmun_end);
- page_add_anon_rmap(new_page, vma, mmun_start);
- pmdp_clear_flush_notify(vma, mmun_start, pmd);
- set_pmd_at(mm, mmun_start, pmd, entry);
- flush_tlb_range(vma, mmun_start, mmun_end);
+ flush_cache_range(vma, range.start, range.end);
+ page_add_anon_rmap(new_page, vma, range.start);
+ pmdp_clear_flush_notify(vma, range.start, pmd);
+ set_pmd_at(mm, range.start, pmd, entry);
+ flush_tlb_range(vma, range.start, range.end);
update_mmu_cache_pmd(vma, address, &entry);

if (page_count(page) != 2) {
- set_pmd_at(mm, mmun_start, pmd, orig_entry);
- flush_tlb_range(vma, mmun_start, mmun_end);
- mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
+ set_pmd_at(mm, range.start, pmd, orig_entry);
+ flush_tlb_range(vma, range.start, range.end);
+ mmu_notifier_invalidate_range(mm, range.start, range.end);
update_mmu_cache_pmd(vma, address, &entry);
page_remove_rmap(new_page);
goto fail_putback;
@@ -1875,8 +1876,7 @@ fail_putback:
page_remove_rmap(page);

spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(mm, &range);

/* Take an "isolate" reference and put new page on the LRU. */
get_page(new_page);
@@ -1901,7 +1901,7 @@ out_dropref:
ptl = pmd_lock(mm, pmd);
if (pmd_same(*pmd, entry)) {
entry = pmd_mknonnuma(entry);
- set_pmd_at(mm, mmun_start, pmd, entry);
+ set_pmd_at(mm, range.start, pmd, entry);
update_mmu_cache_pmd(vma, address, &entry);
}
spin_unlock(ptl);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index e51ea02..8d48bc4 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -174,9 +174,7 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
}

void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mmu_notifier_range *range)

{
struct mmu_notifier *mn;
@@ -185,21 +183,36 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->invalidate_range_start)
- mn->ops->invalidate_range_start(mn, mm, start,
- end, event);
+ mn->ops->invalidate_range_start(mn, mm, range);
}
srcu_read_unlock(&srcu, id);
+
+ /*
+ * This must happen after the callback so that subsystem can block on
+ * new invalidation range to synchronize itself.
+ */
+ spin_lock(&mm->mmu_notifier_mm->lock);
+ list_add_tail(&range->list, &mm->mmu_notifier_mm->ranges);
+ mm->mmu_notifier_mm->nranges++;
+ spin_unlock(&mm->mmu_notifier_mm->lock);
}
EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);

void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ struct mmu_notifier_range *range)
{
struct mmu_notifier *mn;
int id;

+ /*
+ * This must happen before the callback so that subsystem can unblock
+ * when range invalidation end.
+ */
+ spin_lock(&mm->mmu_notifier_mm->lock);
+ list_del_init(&range->list);
+ mm->mmu_notifier_mm->nranges--;
+ spin_unlock(&mm->mmu_notifier_mm->lock);
+
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
/*
@@ -211,12 +224,18 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
* (besides the pointer check).
*/
if (mn->ops->invalidate_range)
- mn->ops->invalidate_range(mn, mm, start, end);
+ mn->ops->invalidate_range(mn, mm,
+ range->start, range->end);
if (mn->ops->invalidate_range_end)
- mn->ops->invalidate_range_end(mn, mm, start,
- end, event);
+ mn->ops->invalidate_range_end(mn, mm, range);
}
srcu_read_unlock(&srcu, id);
+
+ /*
+ * Wakeup after callback so they can do their job before any of the
+ * waiters resume.
+ */
+ wake_up(&mm->mmu_notifier_mm->wait_queue);
}
EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);

@@ -235,6 +254,50 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm,
}
EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range);

+static bool mmu_notifier_range_is_valid_locked(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ struct mmu_notifier_range *range;
+
+ list_for_each_entry(range, &mm->mmu_notifier_mm->ranges, list) {
+ if (!(range->end <= start || range->start >= end)) {
+ return false;
+ }
+ }
+ return true;
+}
+
+bool mmu_notifier_range_is_valid(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ bool valid;
+
+ spin_lock(&mm->mmu_notifier_mm->lock);
+ valid = mmu_notifier_range_is_valid_locked(mm, start, end);
+ spin_unlock(&mm->mmu_notifier_mm->lock);
+ return valid;
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_range_is_valid);
+
+void mmu_notifier_range_wait_valid(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ spin_lock(&mm->mmu_notifier_mm->lock);
+ while (!mmu_notifier_range_is_valid_locked(mm, start, end)) {
+ int nranges = mm->mmu_notifier_mm->nranges;
+
+ spin_unlock(&mm->mmu_notifier_mm->lock);
+ wait_event(mm->mmu_notifier_mm->wait_queue,
+ nranges != mm->mmu_notifier_mm->nranges);
+ spin_lock(&mm->mmu_notifier_mm->lock);
+ }
+ spin_unlock(&mm->mmu_notifier_mm->lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_range_wait_valid);
+
static int do_mmu_notifier_register(struct mmu_notifier *mn,
struct mm_struct *mm,
int take_mmap_sem)
@@ -264,6 +327,9 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
if (!mm_has_notifiers(mm)) {
INIT_HLIST_HEAD(&mmu_notifier_mm->list);
spin_lock_init(&mmu_notifier_mm->lock);
+ INIT_LIST_HEAD(&mmu_notifier_mm->ranges);
+ mmu_notifier_mm->nranges = 0;
+ init_waitqueue_head(&mmu_notifier_mm->wait_queue);

mm->mmu_notifier_mm = mmu_notifier_mm;
mmu_notifier_mm = NULL;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2302721..c88f770 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -139,7 +139,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
unsigned long next;
unsigned long pages = 0;
unsigned long nr_huge_updates = 0;
- unsigned long mni_start = 0;
+ struct mmu_notifier_range range = {
+ .start = 0,
+ };

pmd = pmd_offset(pud, addr);
do {
@@ -150,10 +152,11 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
continue;

/* invoke the mmu notifier if the pmd is populated */
- if (!mni_start) {
- mni_start = addr;
- mmu_notifier_invalidate_range_start(mm, mni_start,
- end, MMU_MPROT);
+ if (!range.start) {
+ range.start = addr;
+ range.end = end;
+ range.event = MMU_MPROT;
+ mmu_notifier_invalidate_range_start(mm, &range);
}

if (pmd_trans_huge(*pmd)) {
@@ -180,8 +183,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pages += this_pages;
} while (pmd++, addr = next, addr != end);

- if (mni_start)
- mmu_notifier_invalidate_range_end(mm, mni_start, end, MMU_MPROT);
+ if (range.start)
+ mmu_notifier_invalidate_range_end(mm, &range);

if (nr_huge_updates)
count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 1ede220..5556f51 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -167,18 +167,17 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
bool need_rmap_locks)
{
unsigned long extent, next, old_end;
+ struct mmu_notifier_range range;
pmd_t *old_pmd, *new_pmd;
bool need_flush = false;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */

old_end = old_addr + len;
flush_cache_range(vma, old_addr, old_end);

- mmun_start = old_addr;
- mmun_end = old_end;
- mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ range.start = old_addr;
+ range.end = old_end;
+ range.event = MMU_MIGRATE;
+ mmu_notifier_invalidate_range_start(vma->vm_mm, &range);

for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
cond_resched();
@@ -230,8 +229,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
if (likely(need_flush))
flush_tlb_range(vma, old_end-len, old_addr);

- mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
- mmun_end, MMU_MIGRATE);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, &range);

return len + old_addr - old_end; /* how much done */
}
diff --git a/mm/rmap.c b/mm/rmap.c
index 367f882..ff79815 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1316,15 +1316,14 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
spinlock_t *ptl;
struct page *page;
unsigned long address;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ struct mmu_notifier_range range;
unsigned long end;
int ret = SWAP_AGAIN;
int locked_vma = 0;
- enum mmu_event event = MMU_MIGRATE;

+ range.event = MMU_MIGRATE;
if (flags & TTU_MUNLOCK)
- event = MMU_MUNLOCK;
+ range.event = MMU_MUNLOCK;

address = (vma->vm_start + cursor) & CLUSTER_MASK;
end = address + CLUSTER_SIZE;
@@ -1337,9 +1336,9 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
if (!pmd)
return ret;

- mmun_start = address;
- mmun_end = end;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
+ range.start = address;
+ range.end = end;
+ mmu_notifier_invalidate_range_start(mm, &range);

/*
* If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1408,7 +1407,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
(*mapcount)--;
}
pte_unmap_unlock(pte - 1, ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
+ mmu_notifier_invalidate_range_end(mm, &range);
if (locked_vma)
up_read(&vma->vm_mm->mmap_sem);
return ret;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 08bc07c..4ab31de 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -322,9 +322,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,

static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ const struct mmu_notifier_range *range)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
int need_tlb_flush = 0, idx;
@@ -337,7 +335,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
* count is also read inside the mmu_lock critical section.
*/
kvm->mmu_notifier_count++;
- need_tlb_flush = kvm_unmap_hva_range(kvm, start, end);
+ need_tlb_flush = kvm_unmap_hva_range(kvm, range->start, range->end);
need_tlb_flush |= kvm->tlbs_dirty;
/* we've to flush the tlb before the pages can be freed */
if (need_tlb_flush)
@@ -349,9 +347,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,

static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- enum mmu_event event)
+ const struct mmu_notifier_range *range)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);

--
1.9.3

2014-11-10 18:29:42

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

From: Jérôme Glisse <[email protected]>

Page table is a common structure format most notably use by cpu mmu. The
arch depend page table code has strong tie to the architecture which makes
it unsuitable to be use by other non arch specific code.

This patch implement a generic and arch independent page table. It is generic
in the sense that entry size can be u64 or unsigned long (or u32 too on 32bits
arch).

It is lockless in the sense that at any point in time you can have concurrent
thread updating the page table (removing or changing entry) and faulting in
the page table (adding new entry). This is achieve by enforcing each updater
and each faulter to take a range lock. There is no exclusion on range lock,
ie several thread can fault or update the same range concurrently and it is
the responsability of the user to synchronize update to the page table entry
(pte), update to the page table directory (pdp) is under gpt responsability.

API usage pattern is :
gpt_init()

gpt_lock_update(lock_range)
// User can update pte for instance by using atomic bit operation
// allowing complete lockless update.
gpt_unlock_update(lock_range)

gpt_lock_fault(lock_range)
// User can fault in pte but he is responsible for avoiding thread
// to concurrently fault the same pte and for properly accounting
// the number of pte faulted in the pdp structure.
gpt_unlock_fault(lock_range)
// The new faulted pte will only be visible to others updaters only
// once all concurrent faulter on the address unlock.

Details on how the lockless concurrent updater and faulter works is provided
in the header file.

Changed since v1:
- Switch to macro implementation instead of using arithmetic to accomodate
the various size for table entry (uint64_t, unsigned long, ...).
This is somewhat less flexbile but right now there is no use for the extra
flexibility v1 was offering.

Signed-off-by: Jérôme Glisse <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
include/linux/gpt.h | 340 +++++++++++++++++++++++++++
lib/Kconfig | 3 +
lib/Makefile | 2 +
lib/gpt.c | 202 ++++++++++++++++
lib/gpt_generic.h | 663 ++++++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 1210 insertions(+)
create mode 100644 include/linux/gpt.h
create mode 100644 lib/gpt.c
create mode 100644 lib/gpt_generic.h

diff --git a/include/linux/gpt.h b/include/linux/gpt.h
new file mode 100644
index 0000000..3c28634
--- /dev/null
+++ b/include/linux/gpt.h
@@ -0,0 +1,340 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <[email protected]>
+ */
+/*
+ * High level overview
+ * -------------------
+ *
+ * This is a generic arch independant page table implementation with lockless
+ * (allmost lockless) access. The content of the page table ie the page table
+ * entry, are not protected by the gpt helper, it is up to the code using gpt
+ * to protect the page table entry from concurrent update with no restriction
+ * on the mechanism (can be atomic or can sleep).
+ *
+ * The gpt code only deals with protecting the page directory tree structure.
+ * Which is done in a lockless way. Concurrent threads can read and or write
+ * overlapping range of the gpt. There can also be concurrent insertion and
+ * removal of page directory (insertion or removal of page table level).
+ *
+ * While removal of page directory is completely lockless, insertion of new
+ * page directory still require a lock (to avoid double insertion). If the
+ * architecture have a spinlock in its page struct then several threads can
+ * concurrently insert new directory (level) as long as they are inserting into
+ * different page directory. Otherwise insertion will serialize using a common
+ * spinlock. Note that insertion in this context only refer to inserting page
+ * directory, it does not deal about page table entry insertion and again this
+ * is the responsability of gpt user to properly synchronize those.
+ *
+ *
+ * Each gpt access must be done under gpt lock protection by calling gpt_lock()
+ * with a lock structure. Once a range is "locked" with gpt_lock() all access
+ * can be done in lockless fashion, using either gpt_walk or gpt_iter helpers.
+ * Note however that only directory that are considered as established will be
+ * considered ie if a thread is concurently inserting a new directory in the
+ * locked range then this directory will be ignore by gpt_walk or gpt_iter.
+ *
+ * This restriction comes from the lockless design. Some thread can hold a gpt
+ * lock for long time but if it holds it for a period long enough some of the
+ * internal gpt counter (unsigned long) might wrap around breaking all further
+ * access (thought it is self healing after a period of time). So access
+ * pattern to gpt should be :
+ * gpt_lock(gpt, lock)
+ * gpt_walk(gpt, lock, walk)
+ * gpt_unlock(gpt, lock)
+ *
+ * Walker callback can sleep but for now longer than it would take for other
+ * threads to wrap around internal gpt value through :
+ * gpt_lock_fault(gpt, lock)
+ * ... user faulting in new pte ...
+ * gpt_unlock_fault(gpt, lock)
+ *
+ * The lockless design refer to gpt_lock() and gpt_unlock() taking a spinlock
+ * only for adding/removing the lock struct to active lock list ie no more than
+ * few instructions in both case leaving little room for lock contention.
+ *
+ * Moreover there is no memory allocation during gpt_lock() or gpt_unlock() or
+ * gpt_walk(). The only constraint is that the lock struct must be the same for
+ * gpt_lock(), gpt_unlock() and gpt_walk().
+ */
+#ifndef __LINUX_GPT_H
+#define __LINUX_GPT_H
+
+#include <linux/mm.h>
+#include <asm/types.h>
+
+struct gpt_walk;
+struct gpt_iter;
+
+/* struct gpt - generic page table structure.
+ *
+ * @pde_from_pdp: Return page directory entry that correspond to a page
+ * directory page. This allow user to use there own custom page directory
+ * entry format for all page directory level.
+ * @pgd: Page global directory if multi level (tree page table).
+ * @faulters: List of all concurrent fault locks.
+ * @updaters: List of all concurrent update locks.
+ * @pdp_young: List of all young page directory page, analogy would be that
+ * directory page on the young list are like inside a rcu read section and
+ * might be dereference by other threads that do not hold a reference on it.
+ * Logic is that an active updater might have taken reference before this
+ * page directory was added and because once an updater have a lock on a
+ * range it can start to walk or iterate over the range without holding rcu
+ * read critical section (allowing walker or iterator to sleep). Directory
+ * are move off the young list only once all updaters that never considered
+ * it are done (ie have call gpt_ ## SUFFIX ## _unlock_update()).
+ * @pdp_free: List of all page directory page to free (delayed free).
+ * @last_idx: Last valid index for this page table. Page table size is derived
+ * from that value.
+ * @pd_shift: Page directory shift value (1 << pd_shift) is the number of entry
+ * that each page directory hold.
+ * @pde_mask: Mask bit corresponding to pfn value of lower page directory from
+ * a pde.
+ * @pde_shift: Shift value use to extract pfn value of lower page directory
+ * from a pde.
+ * @pde_valid: If pde & pde_valid is not 0 then it means this is a valid pde
+ * entry that have a valid pfn value for a lower page directory level.
+ * @pgd_shift: Shift value to get the index inside the pgd from an address.
+ * @min_serial: Oldest serial number use by the oldest updater.
+ * @updater_serial: Current serial number use for updater.
+ * @faulter_serial: Current serial number use for faulter.
+ * @lock: Lock protecting serial number and updaters/faulters list.
+ * @pgd_lock: Lock protecting pgd level (and all level if arch do not have room
+ * for spinlock inside its page struct).
+ */
+struct gpt {
+ uint64_t (*pde_from_pdp)(struct gpt *gpt, struct page *pdp);
+ void *pgd;
+ struct list_head faulters;
+ struct list_head updaters;
+ struct list_head pdp_young;
+ struct list_head pdp_free;
+ uint64_t last_idx;
+ uint64_t pd_shift;
+ uint64_t pde_mask;
+ uint64_t pde_shift;
+ uint64_t pde_valid;
+ uint64_t pgd_shift;
+ unsigned long min_serial;
+ unsigned long faulter_serial;
+ unsigned long updater_serial;
+ spinlock_t lock;
+ spinlock_t pgd_lock;
+ unsigned gfp_flags;
+};
+
+/* struct gpt_lock - generic page table range lock structure.
+ *
+ * @list: List struct for active lock holder lists.
+ * @first: Start address of the locked range (inclusive).
+ * @last: End address of the locked range (inclusive).
+ * @serial: Serial number associated with that lock.
+ *
+ * Before any read/update access to a range of the generic page table, it must
+ * be locked to synchronize with conurrent read/update and insertion. In most
+ * case gpt_lock will complete with only taking one spinlock for protecting the
+ * struct insertion in the active lock holder list (either updaters or faulters
+ * list depending if calling gpt_lock() or gpt_fault_lock()).
+ */
+struct gpt_lock {
+ struct list_head list;
+ uint64_t first;
+ uint64_t last;
+ unsigned long serial;
+ bool faulter;
+};
+
+/* struct gpt_walk - generic page table range walker structure.
+ *
+ * @lock: The lock protecting this iterator.
+ * @first: First index of the walked range (inclusive).
+ * @last: Last index of the walked range (inclusive).
+ *
+ * This is similar to the cpu page table walker. It allows to walk a range of
+ * the generic page table. Note that gpt walk does not imply protection hence
+ * you must call gpt_lock() prior to using gpt_walk() if you want to safely
+ * walk the range as otherwise you will be open to all kind of synchronization
+ * issue.
+ */
+struct gpt_walk {
+ int (*pte)(struct gpt *gpt,
+ struct gpt_walk *walk,
+ struct page *pdp,
+ void *ptep,
+ uint64_t first,
+ uint64_t last);
+ int (*pde)(struct gpt *gpt,
+ struct gpt_walk *walk,
+ struct page *pdp,
+ void *pdep,
+ uint64_t first,
+ uint64_t last,
+ uint64_t shift);
+ int (*pde_post)(struct gpt *gpt,
+ struct gpt_walk *walk,
+ struct page *pdp,
+ void *pdep,
+ uint64_t first,
+ uint64_t last,
+ uint64_t shift);
+ struct gpt_lock *lock;
+ uint64_t first;
+ uint64_t last;
+ void *data;
+};
+
+/* struct gpt_iter - generic page table range iterator structure.
+ *
+ * @gpt: The generic page table structure.
+ * @lock: The lock protecting this iterator.
+ * @pdp: Current page directory page.
+ * @pdep: Pointer to page directory entry for corresponding pdp.
+ * @idx: Current index
+ */
+struct gpt_iter {
+ struct gpt *gpt;
+ struct gpt_lock *lock;
+ struct page *pdp;
+ void *pdep;
+ uint64_t idx;
+};
+
+
+/* Page directory page helpers */
+static inline uint64_t gpt_pdp_shift(struct gpt *gpt, struct page *pdp)
+{
+ if (!pdp)
+ return gpt->pgd_shift;
+ return pdp->flags & 0xff;
+}
+
+static inline uint64_t gpt_pdp_first(struct gpt *gpt, struct page *pdp)
+{
+ if (!pdp)
+ return 0UL;
+ return pdp->index;
+}
+
+static inline uint64_t gpt_pdp_last(struct gpt *gpt, struct page *pdp)
+{
+ if (!pdp)
+ return gpt->last_idx;
+ return min(gpt->last_idx,
+ (uint64_t)(pdp->index +
+ (1UL << (gpt_pdp_shift(gpt, pdp) + gpt->pd_shift)) - 1UL));
+}
+
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+static inline void gpt_pdp_lock(struct gpt *gpt, struct page *pdp)
+{
+ if (pdp)
+ spin_lock(&pdp->ptl);
+ else
+ spin_lock(&gpt->pgd_lock);
+}
+
+static inline void gpt_pdp_unlock(struct gpt *gpt, struct page *pdp)
+{
+ if (pdp)
+ spin_unlock(&pdp->ptl);
+ else
+ spin_unlock(&gpt->pgd_lock);
+}
+#else /* USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS */
+static inline void gpt_pdp_lock(struct gpt *gpt, struct page *pdp)
+{
+ spin_lock(&gpt->pgd_lock);
+}
+
+static inline void gpt_pdp_unlock(struct gpt *gpt, struct page *pdp)
+{
+ spin_unlock(&gpt->pgd_lock);
+}
+#endif /* USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS */
+
+static inline void gpt_pdp_ref(struct gpt *gpt, struct page *pdp)
+{
+ if (pdp)
+ atomic_inc(&pdp->_mapcount);
+}
+
+static inline void gpt_pdp_unref(struct gpt *gpt, struct page *pdp)
+{
+ if (pdp && atomic_dec_and_test(&pdp->_mapcount))
+ BUG();
+}
+
+
+/* Generic page table common functions. */
+void gpt_free(struct gpt *gpt);
+
+
+/* Generic page table type specific functions. */
+int gpt_ulong_init(struct gpt *gpt);
+void gpt_ulong_lock_update(struct gpt *gpt, struct gpt_lock *lock);
+void gpt_ulong_unlock_update(struct gpt *gpt, struct gpt_lock *lock);
+int gpt_ulong_lock_fault(struct gpt *gpt, struct gpt_lock *lock);
+void gpt_ulong_unlock_fault(struct gpt *gpt, struct gpt_lock *lock);
+int gpt_ulong_walk(struct gpt_walk *walk,
+ struct gpt *gpt,
+ struct gpt_lock *lock);
+bool gpt_ulong_iter_idx(struct gpt_iter *iter, uint64_t idx);
+bool gpt_ulong_iter_first(struct gpt_iter *iter,
+ uint64_t first,
+ uint64_t last);
+bool gpt_ulong_iter_next(struct gpt_iter *iter);
+
+int gpt_u64_init(struct gpt *gpt);
+void gpt_u64_lock_update(struct gpt *gpt, struct gpt_lock *lock);
+void gpt_u64_unlock_update(struct gpt *gpt, struct gpt_lock *lock);
+int gpt_u64_lock_fault(struct gpt *gpt, struct gpt_lock *lock);
+void gpt_u64_unlock_fault(struct gpt *gpt, struct gpt_lock *lock);
+int gpt_u64_walk(struct gpt_walk *walk,
+ struct gpt *gpt,
+ struct gpt_lock *lock);
+bool gpt_u64_iter_idx(struct gpt_iter *iter, uint64_t idx);
+bool gpt_u64_iter_first(struct gpt_iter *iter,
+ uint64_t first,
+ uint64_t last);
+bool gpt_u64_iter_next(struct gpt_iter *iter);
+
+#ifndef CONFIG_64BIT
+int gpt_u32_init(struct gpt *gpt);
+void gpt_u32_lock_update(struct gpt *gpt, struct gpt_lock *lock);
+void gpt_u32_unlock_update(struct gpt *gpt, struct gpt_lock *lock);
+int gpt_u32_lock_fault(struct gpt *gpt, struct gpt_lock *lock);
+void gpt_u32_unlock_fault(struct gpt *gpt, struct gpt_lock *lock);
+int gpt_u32_walk(struct gpt_walk *walk,
+ struct gpt *gpt,
+ struct gpt_lock *lock);
+bool gpt_u32_iter_idx(struct gpt_iter *iter, uint64_t idx);
+bool gpt_u32_iter_first(struct gpt_iter *iter,
+ uint64_t first,
+ uint64_t last);
+bool gpt_u32_iter_next(struct gpt_iter *iter);
+#endif
+
+
+/* Generic page table iterator helpers. */
+static inline void gpt_iter_init(struct gpt_iter *iter,
+ struct gpt *gpt,
+ struct gpt_lock *lock)
+{
+ iter->gpt = gpt;
+ iter->lock = lock;
+ iter->pdp = NULL;
+ iter->pdep = NULL;
+}
+
+#endif /* __LINUX_GPT_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 2faf7b2..c041b3c 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -525,4 +525,7 @@ source "lib/fonts/Kconfig"
config ARCH_HAS_SG_CHAIN
def_bool n

+config GENERIC_PAGE_TABLE
+ bool
+
endmenu
diff --git a/lib/Makefile b/lib/Makefile
index 84000ec..e5ad435 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -197,3 +197,5 @@ quiet_cmd_build_OID_registry = GEN $@
clean-files += oid_registry_data.c

obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
+
+obj-$(CONFIG_GENERIC_PAGE_TABLE) += gpt.o
diff --git a/lib/gpt.c b/lib/gpt.c
new file mode 100644
index 0000000..3a8e62c
--- /dev/null
+++ b/lib/gpt.c
@@ -0,0 +1,202 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <[email protected]>
+ */
+/* Generic arch independant page table implementation. See include/linux/gpt.h
+ * for further informations on the design.
+ */
+#include <linux/gpt.h>
+#include <linux/highmem.h>
+#include <linux/slab.h>
+#include "gpt_generic.h"
+
+
+struct gpt_lock_walk {
+ struct list_head pdp_to_free;
+ struct gpt_lock *lock;
+ unsigned long locked[(1 << (PAGE_SHIFT - 3)) / sizeof(long)];
+};
+
+/* gpt_pdp_before_serial() - is page directory older than given serial.
+ *
+ * @pdp: Pointer to struct page of the page directory.
+ * @serial: Serial number to check against.
+ *
+ * Page table walker and iterator use this to determine if the current pde
+ * needs to be walked down/iterated over or not. Use by updater to avoid
+ * walking down/iterating over new page directory.
+ */
+static inline bool gpt_pdp_before_serial(struct page *pdp,
+ unsigned long serial)
+{
+ /*
+ * To know if a page directory is new or old we first check if it's not
+ * on the recently added list. If it is and its serial number is newer
+ * or equal to our lock serial number then it is a new page directory
+ * entry and must be ignore.
+ */
+ return list_empty(&pdp->lru) || time_after(serial, pdp->private);
+}
+
+/* gpt_lock_hold_pdp() - does given lock hold a reference on given directory.
+ *
+ * @lock: Lock to check against.
+ * @pdp: Pointer to struct page of the page directory.
+ *
+ * When walking down page table or iterating over this function is call to know
+ * if the current pde entry needs to be walked down/iterated over.
+ */
+static bool gpt_lock_hold_pdp(struct gpt_lock *lock, struct page *pdp)
+{
+ if (lock->faulter)
+ return true;
+ if (!atomic_read(&pdp->_mapcount))
+ return false;
+ if (!gpt_pdp_before_serial(pdp, lock->serial))
+ return false;
+ return true;
+}
+
+static void gpt_lock_walk_update_finish(struct gpt *gpt,
+ struct gpt_lock_walk *wlock)
+{
+ struct gpt_lock *lock = wlock->lock;
+ unsigned long min_serial;
+
+ spin_lock(&gpt->lock);
+ min_serial = gpt->min_serial;
+ list_del_init(&lock->list);
+ lock = list_first_entry_or_null(&gpt->updaters, struct gpt_lock, list);
+ gpt->min_serial = lock ? lock->serial : gpt->updater_serial;
+ spin_unlock(&gpt->lock);
+
+ /*
+ * Drain the young pdp list if the new smallest serial lock holder is
+ * different from previous one.
+ */
+ if (gpt->min_serial != min_serial) {
+ struct page *pdp, *next;
+
+ spin_lock(&gpt->pgd_lock);
+ list_for_each_entry_safe(pdp, next, &gpt->pdp_young, lru) {
+ if (!gpt_pdp_before_serial(pdp, gpt->min_serial))
+ break;
+ list_del_init(&pdp->lru);
+ }
+ list_for_each_entry_safe(pdp, next, &gpt->pdp_free, lru) {
+ if (!gpt_pdp_before_serial(pdp, gpt->min_serial))
+ break;
+ list_del(&pdp->lru);
+ list_add_tail(&pdp->lru, &wlock->pdp_to_free);
+ }
+ spin_unlock(&gpt->pgd_lock);
+ }
+}
+
+/* gpt_lock_fault_finish() - common lock fault cleanup.
+ *
+ * @gpt: The pointer to the generic page table structure.
+ * @wlock: Walk lock structure.
+ *
+ * This function first remove the lock from faulters list then update the
+ * serial number that will be use by next updater to either the oldest active
+ * faulter or to the next faulter serial number. In both case the next updater
+ * will ignore directory with serial equal or superior to this serial number.
+ * In other word it will only consider directory that are older that oldest
+ * active faulter.
+ *
+ * Note however that the young list is not drain here as we only want to drain
+ * it once updaters are done ie once no updaters might dereference such young
+ * page without holding a reference on it. Refer to gpt struct comments on
+ * young list.
+ */
+static void gpt_lock_fault_finish(struct gpt *gpt, struct gpt_lock_walk *wlock)
+{
+ struct gpt_lock *lock = wlock->lock;
+
+ spin_lock(&gpt->lock);
+ list_del_init(&lock->list);
+ lock = list_first_entry_or_null(&gpt->faulters, struct gpt_lock, list);
+ if (lock)
+ gpt->updater_serial = lock->serial;
+ else
+ gpt->updater_serial = gpt->faulter_serial;
+ spin_unlock(&gpt->lock);
+}
+
+static void gpt_lock_walk_free_pdp(struct gpt_lock_walk *wlock)
+{
+ struct page *pdp, *tmp;
+
+ if (list_empty(&wlock->pdp_to_free))
+ return;
+
+ synchronize_rcu();
+
+ list_for_each_entry_safe(pdp, tmp, &wlock->pdp_to_free, lru) {
+ /* Restore page struct fields to their expect value. */
+ list_del(&pdp->lru);
+ atomic_dec(&pdp->_mapcount);
+ pdp->mapping = NULL;
+ pdp->index = 0;
+ pdp->flags &= (~0xffUL);
+ __free_page(pdp);
+ }
+}
+
+
+/* Page directory page helpers */
+static inline bool gpt_pdp_cover_idx(struct gpt *gpt,
+ struct page *pdp,
+ unsigned long idx)
+{
+ return (idx >= gpt_pdp_first(gpt, pdp)) &&
+ (idx <= gpt_pdp_last(gpt, pdp));
+}
+
+static inline struct page *gpt_pdp_upper_pdp(struct page *pdp)
+{
+ if (!pdp)
+ return NULL;
+ return pdp->s_mem;
+}
+
+static inline void gpt_pdp_init(struct page *page)
+{
+ atomic_set(&page->_mapcount, 1);
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+ spin_lock_init(&page->ptl);
+#endif
+}
+
+
+/* Generic page table common functions. */
+void gpt_free(struct gpt *gpt)
+{
+ BUG_ON(!list_empty(&gpt->faulters));
+ BUG_ON(!list_empty(&gpt->updaters));
+ kfree(gpt->pgd);
+ gpt->pgd = NULL;
+}
+EXPORT_SYMBOL(gpt_free);
+
+
+/* Generic page table type specific functions. */
+GPT_DEFINE(u64, uint64_t, 3);
+#ifdef CONFIG_64BIT
+GPT_DEFINE(ulong, unsigned long, 3);
+#else
+GPT_DEFINE(ulong, unsigned long, 2);
+GPT_DEFINE(u32, uint32_t, 2);
+#endif
diff --git a/lib/gpt_generic.h b/lib/gpt_generic.h
new file mode 100644
index 0000000..c996314
--- /dev/null
+++ b/lib/gpt_generic.h
@@ -0,0 +1,663 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <[email protected]>
+ */
+/* Generic arch independant page table implementation. See include/linux/gpt.h
+ * for further informations on the design.
+ */
+
+/*
+ * Template for implementing generic page table for various types.
+ *
+ * SUFFIX suffix use for naming functions.
+ * TYPE type (uint64_t, unsigned long, ...)
+ * TYPE_SHIFT shift corresponding to GPT_TYPE (3 for u64, 2 for u32).
+ *
+ * Note that (1 << (1 << (TYPE_SHIFT + 3))) must be big enough to store any pfn
+ * and flags the user wants. For instance for a 32 bits arch with 36 bits PAE
+ * you need 24 bits to store a pfn thus if you use u32 as a type then you only
+ * have 8 bits left for flags in each entry.
+ */
+
+#define GPT_DEFINE(SUFFIX, TYPE, TYPE_SHIFT) \
+ \
+int gpt_ ## SUFFIX ## _init(struct gpt *gpt) \
+{ \
+ unsigned long pgd_size; \
+ \
+ gpt->pgd = NULL; \
+ if (!gpt->last_idx) \
+ return -EINVAL; \
+ INIT_LIST_HEAD(&gpt->faulters); \
+ INIT_LIST_HEAD(&gpt->updaters); \
+ INIT_LIST_HEAD(&gpt->pdp_young); \
+ INIT_LIST_HEAD(&gpt->pdp_free); \
+ spin_lock_init(&gpt->pgd_lock); \
+ spin_lock_init(&gpt->lock); \
+ gpt->pd_shift = (PAGE_SHIFT - TYPE_SHIFT); \
+ gpt->pgd_shift = (__fls(gpt->last_idx) / \
+ (PAGE_SHIFT - (TYPE_SHIFT))) * \
+ (PAGE_SHIFT - (TYPE_SHIFT)); \
+ pgd_size = (gpt->last_idx >> gpt->pgd_shift) << (TYPE_SHIFT); \
+ gpt->pgd = kzalloc(pgd_size, GFP_KERNEL); \
+ gpt->updater_serial = gpt->faulter_serial = gpt->min_serial = 0; \
+ return !gpt->pgd ? -ENOMEM : 0; \
+} \
+EXPORT_SYMBOL(gpt_ ## SUFFIX ## _init); \
+ \
+/* gpt_ ## SUFFIX ## _pde_pdp() - get page directory page from a pde. \
+ * \
+ * @gpt: The pointer to the generic page table structure. \
+ * @pde: Page directory entry to extract the lower directory page from. \
+ */ \
+static inline struct page *gpt_ ## SUFFIX ## _pde_pdp(struct gpt *gpt, \
+ TYPE pde) \
+{ \
+ if (!(pde & gpt->pde_valid)) \
+ return NULL; \
+ return pfn_to_page((pde & gpt->pde_mask) >> gpt->pde_shift); \
+} \
+ \
+/* gpt_ ## SUFFIX ## _pte_from_idx() - pointer to a pte inside directory \
+ * \
+ * @gpt: The pointer to the generic page table structure. \
+ * @pdp: Page directory page if any. \
+ * @idx: Index of the pte that is being lookup. \
+ */ \
+static inline void *gpt_ ## SUFFIX ## _pte_from_idx(struct gpt *gpt, \
+ struct page *pdp, \
+ uint64_t idx) \
+{ \
+ TYPE *ptep = pdp ? page_address(pdp) : gpt->pgd; \
+ \
+ ptep += (idx & ((1UL << gpt->pd_shift) - 1UL)); \
+ return ptep; \
+} \
+ \
+/* gpt_ ## SUFFIX ## _pdep_from_idx() - pointer to directory entry \
+ * \
+ * @gpt: The pointer to the generic page table structure. \
+ * @pdp: Page directory page if any. \
+ * @idx: Index of the pde that is being lookup. \
+ */ \
+static inline void *gpt_ ## SUFFIX ## _pdep_from_idx(struct gpt *gpt, \
+ struct page *pdp, \
+ uint64_t idx) \
+{ \
+ TYPE *pdep = pdp ? page_address(pdp) : gpt->pgd; \
+ uint64_t shift = gpt_pdp_shift(gpt, pdp); \
+ \
+ pdep += ((idx >> shift) & ((1UL << gpt->pd_shift) - 1UL)); \
+ return pdep; \
+} \
+ \
+static int gpt_ ## SUFFIX ## _walk_pde(struct gpt *gpt, \
+ struct gpt_walk *walk, \
+ struct page *pdp, \
+ void *ptr, \
+ uint64_t first, \
+ uint64_t last, \
+ uint64_t shift) \
+{ \
+ unsigned long i, npde; \
+ TYPE *pdep = ptr; \
+ uint64_t cur, lshift, mask, next; \
+ int ret; \
+ \
+ if (walk->pde) { \
+ ret = walk->pde(gpt, walk, pdp, ptr, \
+ first, last, shift); \
+ if (ret) \
+ return ret; \
+ } \
+ \
+ lshift = shift ? shift - gpt->pd_shift : 0; \
+ mask = ~((1ULL << shift) - 1ULL); \
+ npde = ((last - first) >> shift) + 1; \
+ for (i = 0, cur = first; i < npde; ++i, cur = next) { \
+ struct page *lpdp; \
+ TYPE pde = ACCESS_ONCE(pdep[i]); \
+ \
+ next = min((cur & mask) + (1UL << shift), last); \
+ lpdp = gpt_ ## SUFFIX ## _pde_pdp(gpt, pde); \
+ if (!lpdp || !gpt_lock_hold_pdp(walk->lock, lpdp)) \
+ continue; \
+ if (lshift) { \
+ void *lpde; \
+ \
+ lpde = gpt_ ## SUFFIX ## _pdep_from_idx(gpt, \
+ lpdp, \
+ cur); \
+ ret = gpt_ ## SUFFIX ## _walk_pde(gpt, walk, \
+ lpdp, lpde, \
+ cur, next, \
+ lshift); \
+ if (ret) \
+ return ret; \
+ } else if (walk->pte) { \
+ void *lpte; \
+ \
+ lpte = gpt_ ## SUFFIX ## _pte_from_idx(gpt, \
+ lpdp, \
+ cur); \
+ ret = walk->pte(gpt, walk, lpdp, \
+ lpte, cur, next); \
+ if (ret) \
+ return ret; \
+ } \
+ } \
+ \
+ if (walk->pde_post) { \
+ ret = walk->pde_post(gpt, walk, pdp, ptr, \
+ first, last, shift); \
+ if (ret) \
+ return ret; \
+ } \
+ \
+ return 0; \
+} \
+ \
+int gpt_ ## SUFFIX ## _walk(struct gpt_walk *walk, \
+ struct gpt *gpt, \
+ struct gpt_lock *lock) \
+{ \
+ TYPE *pdep = gpt->pgd; \
+ uint64_t idx; \
+ \
+ if (walk->first > gpt->last_idx || walk->last > gpt->last_idx) \
+ return -EINVAL; \
+ \
+ idx = walk->first >> gpt->pgd_shift; \
+ return gpt_ ## SUFFIX ## _walk_pde(gpt, walk, NULL, &pdep[idx], \
+ walk->first, walk->last, \
+ gpt->pgd_shift); \
+} \
+EXPORT_SYMBOL(gpt_ ## SUFFIX ## _walk); \
+ \
+static void gpt_ ## SUFFIX ## _pdp_unref(struct gpt *gpt, \
+ struct page *pdp, \
+ struct gpt_lock_walk *wlock, \
+ struct page *updp, \
+ TYPE *upde) \
+{ \
+ /* \
+ * The atomic decrement and test insure that only one thread \
+ * will cleanup pde. \
+ */ \
+ if (!atomic_dec_and_test(&pdp->_mapcount)) \
+ return; \
+ \
+ /* \
+ * Protection against race btw new pdes instancing and pdes \
+ * clearing due to unref, rely on faulter taking a reference on \
+ * all valid pdes and calling synchronize_rcu() after. After the \
+ * rcu synchronize no further unreference might clear a pde in \
+ * the faulter(s) range(s). \
+ */ \
+ *upde = 0; \
+ if (!list_empty(&pdp->lru)) { \
+ /* \
+ * It means this page directory was added recently but \
+ * is about to be destroy before it could be remove from \
+ * the young list. \
+ * \
+ * Because it is in the young list and lock holder can \
+ * access the page table without rcu protection it means \
+ * that we can not rely on synchronize_rcu to know when \
+ * it is safe to free the page as some thread might be \
+ * dereferencing it. We have to wait for all lock that \
+ * are older than this page directory. At which point we \
+ * know for sure that no thread can derefence the page. \
+ */ \
+ spin_lock(&gpt->pgd_lock); \
+ list_add_tail(&pdp->lru, &gpt->pdp_free); \
+ spin_unlock(&gpt->pgd_lock); \
+ } else \
+ /* \
+ * This means this is an old page directory and thus any \
+ * lock holder that might dereference a pointer to it \
+ * would have a reference on it. Hence because refcount \
+ * reached 0 we only need to wait for rcu grace period. \
+ */ \
+ list_add_tail(&pdp->lru, &wlock->pdp_to_free); \
+ \
+ /* Un-account this entry caller must hold a ref on pdp. */ \
+ if (updp && atomic_dec_and_test(&updp->_mapcount)) \
+ BUG(); \
+} \
+ \
+static int gpt_ ## SUFFIX ## _pde_lock_update(struct gpt *gpt, \
+ struct gpt_walk *walk, \
+ struct page *pdp, \
+ void *ptr, \
+ uint64_t first, \
+ uint64_t last, \
+ uint64_t shift) \
+{ \
+ unsigned long i, npde; \
+ struct gpt_lock_walk *wlock = walk->data; \
+ struct gpt_lock *lock = wlock->lock; \
+ TYPE *pdep = ptr; \
+ \
+ npde = ((last - first) >> shift) + 1; \
+ \
+ rcu_read_lock(); \
+ for (i = 0; i < npde; ++i) { \
+ struct page *page; \
+ TYPE pde = ACCESS_ONCE(pdep[i]); \
+ \
+ clear_bit(i, wlock->locked); \
+ page = gpt_ ## SUFFIX ## _pde_pdp(gpt, pde); \
+ if (!page) \
+ continue; \
+ if (!atomic_inc_not_zero(&page->_mapcount)) \
+ continue; \
+ \
+ if (!gpt_pdp_before_serial(page, lock->serial)) { \
+ /* This is a new entry ignore it. */ \
+ gpt_ ## SUFFIX ## _pdp_unref(gpt, page, wlock, \
+ pdp, &pdep[i]); \
+ continue; \
+ } \
+ set_bit(i, wlock->locked); \
+ } \
+ rcu_read_unlock(); \
+ \
+ for (i = 0; i < npde; i++) { \
+ struct page *page; \
+ \
+ if (!test_bit(i, wlock->locked)) \
+ continue; \
+ page = gpt_ ## SUFFIX ## _pde_pdp(gpt, pdep[i]); \
+ kmap(page); \
+ } \
+ \
+ return 0; \
+} \
+ \
+void gpt_ ## SUFFIX ## _lock_update(struct gpt *gpt, \
+ struct gpt_lock *lock) \
+{ \
+ struct gpt_lock_walk wlock; \
+ struct gpt_walk walk; \
+ \
+ spin_lock(&gpt->lock); \
+ lock->faulter = false; \
+ lock->serial = gpt->updater_serial; \
+ list_add_tail(&lock->list, &gpt->updaters); \
+ spin_unlock(&gpt->lock); \
+ \
+ INIT_LIST_HEAD(&wlock.pdp_to_free); \
+ wlock.lock = lock; \
+ walk.lock = lock; \
+ walk.data = &wlock; \
+ walk.pde = &gpt_ ## SUFFIX ## _pde_lock_update; \
+ walk.pde_post = NULL; \
+ walk.pte = NULL; \
+ walk.first = lock->first; \
+ walk.last = lock->last; \
+ \
+ gpt_ ## SUFFIX ## _walk(&walk, gpt, lock); \
+ gpt_lock_walk_free_pdp(&wlock); \
+} \
+EXPORT_SYMBOL(gpt_ ## SUFFIX ## _lock_update); \
+ \
+static int gpt_ ## SUFFIX ## _pde_unlock_update(struct gpt *gpt, \
+ struct gpt_walk *walk, \
+ struct page *pdp, \
+ void *ptr, \
+ uint64_t first, \
+ uint64_t last, \
+ uint64_t shift) \
+{ \
+ unsigned long i, npde; \
+ struct gpt_lock_walk *wlock = walk->data; \
+ struct gpt_lock *lock = wlock->lock; \
+ TYPE *pdep = ptr; \
+ \
+ npde = ((last - first) >> shift) + 1; \
+ \
+ rcu_read_lock(); \
+ for (i = 0; i < npde; ++i) { \
+ struct page *page; \
+ TYPE pde = ACCESS_ONCE(pdep[i]); \
+ \
+ if (!(pde & gpt->pde_valid)) \
+ continue; \
+ page = gpt_ ## SUFFIX ## _pde_pdp(gpt, pde); \
+ if (!page || !gpt_pdp_before_serial(page, lock->serial)) \
+ continue; \
+ kunmap(page); \
+ gpt_ ## SUFFIX ## _pdp_unref(gpt, page, wlock, \
+ pdp, &pdep[i]); \
+ } \
+ rcu_read_unlock(); \
+ \
+ return 0; \
+} \
+ \
+void gpt_ ## SUFFIX ## _unlock_update(struct gpt *gpt, \
+ struct gpt_lock *lock) \
+{ \
+ struct gpt_lock_walk wlock; \
+ struct gpt_walk walk; \
+ \
+ INIT_LIST_HEAD(&wlock.pdp_to_free); \
+ wlock.lock = lock; \
+ walk.lock = lock; \
+ walk.data = &wlock; \
+ walk.pde = NULL; \
+ walk.pde_post = &gpt_ ## SUFFIX ## _pde_unlock_update; \
+ walk.pte = NULL; \
+ walk.first = lock->first; \
+ walk.last = lock->last; \
+ \
+ gpt_ ## SUFFIX ## _walk(&walk, gpt, lock); \
+ \
+ gpt_lock_walk_update_finish(gpt, &wlock); \
+ gpt_lock_walk_free_pdp(&wlock); \
+} \
+EXPORT_SYMBOL(gpt_ ## SUFFIX ## _unlock_update); \
+ \
+static int gpt_ ## SUFFIX ## _pde_lock_fault(struct gpt *gpt, \
+ struct gpt_walk *walk, \
+ struct page *pdp, \
+ void *ptr, \
+ uint64_t first, \
+ uint64_t last, \
+ uint64_t shift) \
+{ \
+ unsigned long cmissing, i, npde; \
+ struct gpt_lock_walk *wlock = walk->data; \
+ struct gpt_lock *lock = wlock->lock; \
+ struct list_head pdp_new, pdp_added; \
+ struct page *page, *tmp; \
+ TYPE mask, *pdep = ptr; \
+ int ret; \
+ \
+ npde = ((last - first) >> shift) + 1; \
+ mask = ~((1ULL << shift) - 1ULL); \
+ INIT_LIST_HEAD(&pdp_added); \
+ INIT_LIST_HEAD(&pdp_new); \
+ \
+ rcu_read_lock(); \
+ for (i = 0, cmissing = 0; i < npde; ++i) { \
+ TYPE pde = ACCESS_ONCE(pdep[i]); \
+ \
+ clear_bit(i, wlock->locked); \
+ if (!(pde & gpt->pde_valid)) { \
+ cmissing++; \
+ continue; \
+ } \
+ page = gpt_ ## SUFFIX ## _pde_pdp(gpt, pde); \
+ if (!atomic_inc_not_zero(&page->_mapcount)) { \
+ cmissing++; \
+ continue; \
+ } \
+ set_bit(i, wlock->locked); \
+ } \
+ rcu_read_unlock(); \
+ \
+ /* Allocate missing page directory page. */ \
+ for (i = 0; i < cmissing; ++i) { \
+ page = alloc_page(gpt->gfp_flags | __GFP_ZERO); \
+ if (!page) { \
+ ret = -ENOMEM; \
+ goto error; \
+ } \
+ list_add_tail(&page->lru, &pdp_new); \
+ } \
+ \
+ /* \
+ * The synchronize_rcu() is for exclusion with concurrent update \
+ * thread that might try to clear the pde. Because a reference \
+ * was taken just above on all valid pdes we know for sure that \
+ * after the rcu synchronize all thread that were about to clear \
+ * pdes are done and that no new unreference will lead to pde \
+ * clear. \
+ */ \
+ synchronize_rcu(); \
+ \
+ gpt_pdp_lock(gpt, pdp); \
+ for (i = 0; i < npde; ++i) { \
+ TYPE pde = ACCESS_ONCE(pdep[i]); \
+ \
+ if (test_bit(i, wlock->locked)) \
+ continue; \
+ \
+ /* Anoter thread might already have populated entry. */ \
+ page = gpt_ ## SUFFIX ## _pde_pdp(gpt, pde); \
+ if (page && atomic_inc_not_zero(&page->_mapcount)) \
+ continue; \
+ \
+ page = list_first_entry_or_null(&pdp_new, \
+ struct page, \
+ lru); \
+ BUG_ON(!page); \
+ list_del(&page->lru); \
+ \
+ /* Initialize page directory page struct. */ \
+ page->private = lock->serial; \
+ page->s_mem = pdp; \
+ page->index = (first & mask) + (i << shift); \
+ page->flags |= (shift - gpt->pd_shift) & 0xff; \
+ gpt_pdp_init(page); \
+ list_add_tail(&page->lru, &pdp_added); \
+ \
+ pdep[i] = gpt->pde_from_pdp(gpt, page); \
+ /* Account this new entry inside upper directory. */ \
+ if (pdp) \
+ atomic_inc(&pdp->_mapcount); \
+ } \
+ gpt_pdp_unlock(gpt, pdp); \
+ \
+ spin_lock(&gpt->pgd_lock); \
+ list_splice_tail(&pdp_added, &gpt->pdp_young); \
+ spin_unlock(&gpt->pgd_lock); \
+ \
+ for (i = 0; i < npde; ++i) { \
+ page = gpt_ ## SUFFIX ## _pde_pdp(gpt, pdep[i]); \
+ kmap(page); \
+ } \
+ \
+ /* Free any left over pages. */ \
+ list_for_each_entry_safe (page, tmp, &pdp_new, lru) { \
+ list_del(&page->lru); \
+ __free_page(page); \
+ } \
+ return 0; \
+ \
+error: \
+ /* \
+ * We know that no page is kmaped and no page were added to the \
+ * directroy tree. \
+ */ \
+ list_for_each_entry_safe (page, tmp, &pdp_new, lru) { \
+ list_del(&page->lru); \
+ __free_page(page); \
+ } \
+ \
+ rcu_read_lock(); \
+ for (i = 0; i < npde; ++i) { \
+ if (test_bit(i, wlock->locked)) \
+ continue; \
+ \
+ page = gpt_ ## SUFFIX ## _pde_pdp(gpt, pdep[i]); \
+ gpt_ ## SUFFIX ## _pdp_unref(gpt, page, wlock, \
+ pdp, &pdep[i]); \
+ } \
+ rcu_read_unlock(); \
+ \
+ walk->last = first; \
+ return ret; \
+} \
+ \
+static int gpt_ ## SUFFIX ## _pde_unlock_fault(struct gpt *gpt, \
+ struct gpt_walk *walk, \
+ struct page *pdp, \
+ void *ptr, \
+ uint64_t first, \
+ uint64_t last, \
+ uint64_t shift) \
+{ \
+ unsigned long i, npde; \
+ struct gpt_lock_walk *wlock = walk->data; \
+ struct gpt_lock *lock = wlock->lock; \
+ TYPE *pdep = ptr; \
+ \
+ npde = ((last - first) >> shift) + 1; \
+ \
+ rcu_read_lock(); \
+ for (i = 0; i < npde; ++i) { \
+ struct page *page; \
+ \
+ page = gpt_ ## SUFFIX ## _pde_pdp(gpt, pdep[i]); \
+ if (!page || !gpt_lock_hold_pdp(lock, page)) \
+ continue; \
+ kunmap(page); \
+ gpt_ ## SUFFIX ## _pdp_unref(gpt, page, wlock, \
+ pdp, &pdep[i]); \
+ } \
+ rcu_read_unlock(); \
+ \
+ return 0; \
+} \
+ \
+int gpt_ ## SUFFIX ## _lock_fault(struct gpt *gpt, \
+ struct gpt_lock *lock) \
+{ \
+ struct gpt_lock_walk wlock; \
+ struct gpt_walk walk; \
+ int ret; \
+ \
+ lock->faulter = true; \
+ spin_lock(&gpt->lock); \
+ lock->serial = gpt->faulter_serial++; \
+ list_add_tail(&lock->list, &gpt->faulters); \
+ spin_unlock(&gpt->lock); \
+ \
+ INIT_LIST_HEAD(&wlock.pdp_to_free); \
+ wlock.lock = lock; \
+ walk.lock = lock; \
+ walk.data = &wlock; \
+ walk.pde = &gpt_ ## SUFFIX ## _pde_lock_fault; \
+ walk.pde_post = NULL; \
+ walk.pte = NULL; \
+ walk.first = lock->first; \
+ walk.last = lock->last; \
+ \
+ ret = gpt_ ## SUFFIX ## _walk(&walk, gpt, lock); \
+ if (ret) { \
+ walk.pde = NULL; \
+ walk.pde_post = &gpt_ ## SUFFIX ## _pde_unlock_fault; \
+ gpt_ ## SUFFIX ## _walk(&walk, gpt, lock); \
+ gpt_lock_fault_finish(gpt, &wlock); \
+ } \
+ gpt_lock_walk_free_pdp(&wlock); \
+ \
+ return ret; \
+} \
+EXPORT_SYMBOL(gpt_ ## SUFFIX ## _lock_fault); \
+ \
+void gpt_ ## SUFFIX ## _unlock_fault(struct gpt *gpt, \
+ struct gpt_lock *lock) \
+{ \
+ struct gpt_lock_walk wlock; \
+ struct gpt_walk walk; \
+ \
+ INIT_LIST_HEAD(&wlock.pdp_to_free); \
+ wlock.lock = lock; \
+ walk.lock = lock; \
+ walk.data = &wlock; \
+ walk.pde = NULL; \
+ walk.pde_post = &gpt_ ## SUFFIX ## _pde_unlock_fault; \
+ walk.pte = NULL; \
+ walk.first = lock->first; \
+ walk.last = lock->last; \
+ \
+ gpt_ ## SUFFIX ## _walk(&walk, gpt, lock); \
+ \
+ gpt_lock_fault_finish(gpt, &wlock); \
+ gpt_lock_walk_free_pdp(&wlock); \
+} \
+EXPORT_SYMBOL(gpt_ ## SUFFIX ## _unlock_fault); \
+ \
+static bool gpt_ ## SUFFIX ## _iter_idx_pdp(struct gpt_iter *iter, \
+ uint64_t idx) \
+{ \
+ struct gpt *gpt = iter->gpt; \
+ TYPE pde, *pdep; \
+ \
+ if (!gpt_pdp_cover_idx(gpt, iter->pdp, idx)) { \
+ iter->pdp = gpt_pdp_upper_pdp(iter->pdp); \
+ return gpt_ ## SUFFIX ## _iter_idx_pdp(iter, idx); \
+ } \
+ pdep = gpt_ ## SUFFIX ## _pdep_from_idx(gpt, iter->pdp, idx); \
+ if (!gpt_pdp_shift(gpt, iter->pdp)) { \
+ iter->pdep = pdep; \
+ iter->idx = idx; \
+ return true; \
+ } \
+ pde = ACCESS_ONCE(*pdep); \
+ if (!(pde & iter->gpt->pde_valid)) { \
+ iter->pdep = NULL; \
+ return false; \
+ } \
+ iter->pdp = gpt_ ## SUFFIX ## _pde_pdp(iter->gpt, pde); \
+ return gpt_ ## SUFFIX ## _iter_idx_pdp(iter, idx); \
+} \
+ \
+bool gpt_ ## SUFFIX ## _iter_idx(struct gpt_iter *iter, uint64_t idx) \
+{ \
+ iter->pdep = NULL; \
+ if ((idx < iter->lock->first) || (idx > iter->lock->last)) \
+ return false; \
+ \
+ return gpt_ ## SUFFIX ## _iter_idx_pdp(iter, idx); \
+} \
+EXPORT_SYMBOL(gpt_ ## SUFFIX ## _iter_idx); \
+ \
+bool gpt_ ## SUFFIX ## _iter_first(struct gpt_iter *iter, \
+ uint64_t first, \
+ uint64_t last) \
+{ \
+ iter->pdep = NULL; \
+ if (first > last) \
+ return false; \
+ if ((first < iter->lock->first) || (first > iter->lock->last)) \
+ return false; \
+ if ((last < iter->lock->first) || (last > iter->lock->last)) \
+ return false; \
+ \
+ do { \
+ if (gpt_ ## SUFFIX ## _iter_idx_pdp(iter, first)) \
+ return true; \
+ if (first < last) \
+ first++; \
+ else \
+ return false; \
+ } while (1); \
+ return false; \
+} \
+EXPORT_SYMBOL(gpt_ ## SUFFIX ## _iter_first); \
+ \
+bool gpt_ ## SUFFIX ## _iter_next(struct gpt_iter *iter) \
+{ \
+ if (!iter->pdep || iter->idx >= iter->lock->last) \
+ return false; \
+ return gpt_ ## SUFFIX ## _iter_first(iter, \
+ iter->idx + 1, \
+ iter->lock->last); \
+} \
+EXPORT_SYMBOL(gpt_ ## SUFFIX ## _iter_next)
--
1.9.3

2014-11-10 18:29:40

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 1/5] mmu_notifier: add event information to address invalidation v6

From: Jérôme Glisse <[email protected]>

The event information will be usefull for new user of mmu_notifier API.
The event argument differentiate between a vma disappearing, a page
being write protected or simply a page being unmaped. This allow new
user to take different path for different event for instance on unmap
the resource used to track a vma are still valid and should stay around.
While if the event is saying that a vma is being destroy it means that any
resources used to track this vma can be free.

Changed since v1:
- renamed action into event (updated commit message too).
- simplified the event names and clarified their intented usage
also documenting what exceptation the listener can have in
respect to each event.

Changed since v2:
- Avoid crazy name.
- Do not move code that do not need to move.

Changed since v3:
- Separate hugue page split from mlock/munlock and softdirty.

Changed since v4:
- Rebase (no other changes).

Changed since v5:
- Typo fix.
- Changed zap_page_range from MMU_MUNMAP to MMU_MIGRATE to reflect the
fact that the address range is still valid just the page backing it
are no longer.

Signed-off-by: Jérôme Glisse <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
drivers/gpu/drm/i915/i915_gem_userptr.c | 3 +-
drivers/iommu/amd_iommu_v2.c | 11 ++-
drivers/misc/sgi-gru/grutlbpurge.c | 9 ++-
drivers/xen/gntdev.c | 9 ++-
fs/proc/task_mmu.c | 6 +-
include/linux/mmu_notifier.h | 131 ++++++++++++++++++++++++++------
kernel/events/uprobes.c | 10 ++-
mm/filemap_xip.c | 2 +-
mm/huge_memory.c | 39 ++++++----
mm/hugetlb.c | 23 +++---
mm/ksm.c | 18 +++--
mm/memory.c | 27 ++++---
mm/migrate.c | 9 ++-
mm/mmu_notifier.c | 28 ++++---
mm/mprotect.c | 5 +-
mm/mremap.c | 6 +-
mm/rmap.c | 24 ++++--
virt/kvm/kvm_main.c | 12 ++-
18 files changed, 269 insertions(+), 103 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index d182058..20dbd26 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -129,7 +129,8 @@ restart:
static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ enum mmu_event event)
{
struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
struct interval_tree_node *it = NULL;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 90d734b..57d2acf 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -413,14 +413,17 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,

static void mn_invalidate_page(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
__mn_flush_page(mn, address);
}

static void mn_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
struct pasid_state *pasid_state;
struct device_state *dev_state;
@@ -441,7 +444,9 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,

static void mn_invalidate_range_end(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
struct pasid_state *pasid_state;
struct device_state *dev_state;
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 2129274..e67fed1 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
*/
static void gru_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end,
+ enum mmu_event event)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
ms_notifier);
@@ -235,7 +236,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,

static void gru_invalidate_range_end(struct mmu_notifier *mn,
struct mm_struct *mm, unsigned long start,
- unsigned long end)
+ unsigned long end,
+ enum mmu_event event)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
ms_notifier);
@@ -248,7 +250,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
}

static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
ms_notifier);
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 073b4a1..fe9da94 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,7 +428,9 @@ static void unmap_if_in_range(struct grant_map *map,

static void mn_invl_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
struct grant_map *map;
@@ -445,9 +447,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,

static void mn_invl_page(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
- mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
+ mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
}

static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2ab200d..e75a848 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -869,7 +869,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
downgrade_write(&mm->mmap_sem);
break;
}
- mmu_notifier_invalidate_range_start(mm, 0, -1);
+ mmu_notifier_invalidate_range_start(mm, 0,
+ -1, MMU_ISDIRTY);
}
for (vma = mm->mmap; vma; vma = vma->vm_next) {
cp.vma = vma;
@@ -894,7 +895,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
&clear_refs_walk);
}
if (type == CLEAR_REFS_SOFT_DIRTY)
- mmu_notifier_invalidate_range_end(mm, 0, -1);
+ mmu_notifier_invalidate_range_end(mm, 0,
+ -1, MMU_ISDIRTY);
flush_tlb_mm(mm);
up_read(&mm->mmap_sem);
mmput(mm);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 95243d2..ac2a121 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -9,6 +9,66 @@
struct mmu_notifier;
struct mmu_notifier_ops;

+/* MMU Events report fine-grained information to the callback routine, allowing
+ * the event listener to make a more informed decision as to what action to
+ * take. The event types are:
+ *
+ * - MMU_HSPLIT huge page split, the memory is the same only the page table
+ * structure is updated (level added or removed).
+ *
+ * - MMU_ISDIRTY need to update the dirty bit of the page table so proper
+ * dirty accounting can happen.
+ *
+ * - MMU_MIGRATE: memory is migrating from one page to another, thus all write
+ * access must stop after invalidate_range_start callback returns.
+ * Furthermore, no read access should be allowed either, as a new page can
+ * be remapped with write access before the invalidate_range_end callback
+ * happens and thus any read access to old page might read stale data. There
+ * are several sources for this event, including:
+ *
+ * - A page moving to swap (various reasons, including page reclaim),
+ * - An mremap syscall,
+ * - migration for NUMA reasons,
+ * - balancing the memory pool,
+ * - write fault on COW page,
+ * - and more that are not listed here.
+ *
+ * - MMU_MPROT: memory access protection is changing. Refer to the vma to get
+ * the new access protection. All memory access are still valid until the
+ * invalidate_range_end callback.
+ *
+ * - MMU_MUNLOCK: unlock memory. Content of page table stays the same but
+ * page are unlocked.
+ *
+ * - MMU_MUNMAP: the range is being unmapped (outcome of a munmap syscall or
+ * process destruction). However, access is still allowed, up until the
+ * invalidate_range_free_pages callback. This also implies that secondary
+ * page table can be trimmed, because the address range is no longer valid.
+ *
+ * - MMU_WRITE_BACK: memory is being written back to disk, all write accesses
+ * must stop after invalidate_range_start callback returns. Read access are
+ * still allowed.
+ *
+ * - MMU_WRITE_PROTECT: memory is being write protected (ie should be mapped
+ * read only no matter what the vma memory protection allows). All write
+ * accesses must stop after invalidate_range_start callback returns. Read
+ * access are still allowed.
+ *
+ * If in doubt when adding a new notifier caller, please use MMU_MIGRATE,
+ * because it will always lead to reasonable behavior, but will not allow the
+ * listener a chance to optimize its events.
+ */
+enum mmu_event {
+ MMU_HSPLIT = 0,
+ MMU_ISDIRTY,
+ MMU_MIGRATE,
+ MMU_MPROT,
+ MMU_MUNLOCK,
+ MMU_MUNMAP,
+ MMU_WRITE_BACK,
+ MMU_WRITE_PROTECT,
+};
+
#ifdef CONFIG_MMU_NOTIFIER

/*
@@ -82,7 +142,8 @@ struct mmu_notifier_ops {
void (*change_pte)(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
- pte_t pte);
+ pte_t pte,
+ enum mmu_event event);

/*
* Before this is invoked any secondary MMU is still ok to
@@ -93,7 +154,8 @@ struct mmu_notifier_ops {
*/
void (*invalidate_page)(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long address);
+ unsigned long address,
+ enum mmu_event event);

/*
* invalidate_range_start() and invalidate_range_end() must be
@@ -140,10 +202,14 @@ struct mmu_notifier_ops {
*/
void (*invalidate_range_start)(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end);
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event);
void (*invalidate_range_end)(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end);
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event);

/*
* invalidate_range() is either called between
@@ -206,13 +272,20 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
extern int __mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address);
extern void __mmu_notifier_change_pte(struct mm_struct *mm,
- unsigned long address, pte_t pte);
+ unsigned long address,
+ pte_t pte,
+ enum mmu_event event);
extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address);
+ unsigned long address,
+ enum mmu_event event);
extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start, unsigned long end);
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event);
extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start, unsigned long end);
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event);
extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
unsigned long start, unsigned long end);

@@ -240,31 +313,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
}

static inline void mmu_notifier_change_pte(struct mm_struct *mm,
- unsigned long address, pte_t pte)
+ unsigned long address,
+ pte_t pte,
+ enum mmu_event event)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_change_pte(mm, address, pte);
+ __mmu_notifier_change_pte(mm, address, pte, event);
}

static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_page(mm, address);
+ __mmu_notifier_invalidate_page(mm, address, event);
}

static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_range_start(mm, start, end);
+ __mmu_notifier_invalidate_range_start(mm, start, end, event);
}

static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_range_end(mm, start, end);
+ __mmu_notifier_invalidate_range_end(mm, start, end, event);
}

static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -359,13 +439,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
* old page would remain mapped readonly in the secondary MMUs after the new
* page is already writable by some CPU through the primary MMU.
*/
-#define set_pte_at_notify(__mm, __address, __ptep, __pte) \
+#define set_pte_at_notify(__mm, __address, __ptep, __pte, __event) \
({ \
struct mm_struct *___mm = __mm; \
unsigned long ___address = __address; \
pte_t ___pte = __pte; \
\
- mmu_notifier_change_pte(___mm, ___address, ___pte); \
+ mmu_notifier_change_pte(___mm, ___address, ___pte, __event); \
set_pte_at(___mm, ___address, __ptep, ___pte); \
})

@@ -393,22 +473,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
}

static inline void mmu_notifier_change_pte(struct mm_struct *mm,
- unsigned long address, pte_t pte)
+ unsigned long address,
+ pte_t pte,
+ enum mmu_event event)
{
}

static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
}

static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
}

static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
}

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6158a64b..f7d79d9 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -176,7 +176,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
/* For try_to_free_swap() and munlock_vma_page() below */
lock_page(page);

- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
err = -EAGAIN;
ptep = page_check_address(page, mm, addr, &ptl, 0);
if (!ptep)
@@ -194,7 +195,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,

flush_cache_page(vma, addr, pte_pfn(*ptep));
ptep_clear_flush_notify(vma, addr, ptep);
- set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+ set_pte_at_notify(mm, addr, ptep,
+ mk_pte(kpage, vma->vm_page_prot),
+ MMU_MIGRATE);

page_remove_rmap(page);
if (!page_mapped(page))
@@ -208,7 +211,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
err = 0;
unlock:
mem_cgroup_cancel_charge(kpage, memcg);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
unlock_page(page);
return err;
}
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 0d105ae..fb97c7c 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -193,7 +193,7 @@ retry:
BUG_ON(pte_dirty(pteval));
pte_unmap_unlock(pte, ptl);
/* must invalidate_page _before_ freeing the page */
- mmu_notifier_invalidate_page(mm, address);
+ mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
page_cache_release(page);
}
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 817a875..49d0cec 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1028,7 +1028,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,

mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);

ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1062,7 +1063,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
page_remove_rmap(page);
spin_unlock(ptl);

- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

ret |= VM_FAULT_WRITE;
put_page(page);
@@ -1072,7 +1074,8 @@ out:

out_free_pages:
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
for (i = 0; i < HPAGE_PMD_NR; i++) {
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
@@ -1164,7 +1167,8 @@ alloc:

mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);

spin_lock(ptl);
if (page)
@@ -1196,7 +1200,8 @@ alloc:
}
spin_unlock(ptl);
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
out:
return ret;
out_unlock:
@@ -1632,7 +1637,8 @@ static int __split_huge_page_splitting(struct page *page,
const unsigned long mmun_start = address;
const unsigned long mmun_end = address + HPAGE_PMD_SIZE;

- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_HSPLIT);
pmd = page_check_address_pmd(page, mm, address,
PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
if (pmd) {
@@ -1648,7 +1654,8 @@ static int __split_huge_page_splitting(struct page *page,
ret = 1;
spin_unlock(ptl);
}
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_HSPLIT);

return ret;
}
@@ -2469,7 +2476,8 @@ static void collapse_huge_page(struct mm_struct *mm,

mmun_start = address;
mmun_end = address + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
/*
* After this gup_fast can't run anymore. This also removes
@@ -2479,7 +2487,8 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
_pmd = pmdp_clear_flush(vma, address, pmd);
spin_unlock(pmd_ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

spin_lock(pte_ptl);
isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2870,24 +2879,28 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
again:
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_trans_huge(*pmd))) {
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
return;
}
if (is_huge_zero_pmd(*pmd)) {
__split_huge_zero_page_pmd(vma, haddr, pmd);
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
return;
}
page = pmd_page(*pmd);
VM_BUG_ON_PAGE(!page_count(page), page);
get_page(page);
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

split_huge_page(page);

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8654a52..652feac 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2560,7 +2560,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
mmun_start = vma->vm_start;
mmun_end = vma->vm_end;
if (cow)
- mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(src, mmun_start,
+ mmun_end, MMU_MIGRATE);

for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
@@ -2614,7 +2615,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
}

if (cow)
- mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(src, mmun_start,
+ mmun_end, MMU_MIGRATE);

return ret;
}
@@ -2640,7 +2642,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
BUG_ON(end & ~huge_page_mask(h));

tlb_start_vma(tlb, vma);
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
address = start;
again:
for (; address < end; address += sz) {
@@ -2713,7 +2716,8 @@ unlock:
if (address < end && !ref_page)
goto again;
}
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
tlb_end_vma(tlb, vma);
}

@@ -2891,8 +2895,8 @@ retry_avoidcopy:

mmun_start = address & huge_page_mask(h);
mmun_end = mmun_start + huge_page_size(h);
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);
/*
* Retake the page table lock to check for racing updates
* before the page tables are altered
@@ -2913,7 +2917,8 @@ retry_avoidcopy:
new_page = old_page;
}
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);
out_release_all:
page_cache_release(new_page);
out_release_old:
@@ -3351,7 +3356,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
BUG_ON(address >= end);
flush_cache_range(vma, address, end);

- mmu_notifier_invalidate_range_start(mm, start, end);
+ mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
i_mmap_lock_write(vma->vm_file->f_mapping);
for (; address < end; address += huge_page_size(h)) {
spinlock_t *ptl;
@@ -3382,7 +3387,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
flush_tlb_range(vma, start, end);
mmu_notifier_invalidate_range(mm, start, end);
i_mmap_unlock_write(vma->vm_file->f_mapping);
- mmu_notifier_invalidate_range_end(mm, start, end);
+ mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);

return pages << h->order;
}
diff --git a/mm/ksm.c b/mm/ksm.c
index d247efa..8c3a892 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -872,7 +872,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,

mmun_start = addr;
mmun_end = addr + PAGE_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+ MMU_WRITE_PROTECT);

ptep = page_check_address(page, mm, addr, &ptl, 0);
if (!ptep)
@@ -904,7 +905,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
if (pte_dirty(entry))
set_page_dirty(page);
entry = pte_mkclean(pte_wrprotect(entry));
- set_pte_at_notify(mm, addr, ptep, entry);
+ set_pte_at_notify(mm, addr, ptep, entry, MMU_WRITE_PROTECT);
}
*orig_pte = *ptep;
err = 0;
@@ -912,7 +913,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
out_unlock:
pte_unmap_unlock(ptep, ptl);
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+ MMU_WRITE_PROTECT);
out:
return err;
}
@@ -948,7 +950,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,

mmun_start = addr;
mmun_end = addr + PAGE_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);

ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
if (!pte_same(*ptep, orig_pte)) {
@@ -961,7 +964,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,

flush_cache_page(vma, addr, pte_pfn(*ptep));
ptep_clear_flush_notify(vma, addr, ptep);
- set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+ set_pte_at_notify(mm, addr, ptep,
+ mk_pte(kpage, vma->vm_page_prot),
+ MMU_MIGRATE);

page_remove_rmap(page);
if (!page_mapped(page))
@@ -971,7 +976,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
pte_unmap_unlock(ptep, ptl);
err = 0;
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+ MMU_MIGRATE);
out:
return err;
}
diff --git a/mm/memory.c b/mm/memory.c
index 3ae93ce..187f844 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1052,7 +1052,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
mmun_end = end;
if (is_cow)
mmu_notifier_invalidate_range_start(src_mm, mmun_start,
- mmun_end);
+ mmun_end, MMU_MIGRATE);

ret = 0;
dst_pgd = pgd_offset(dst_mm, addr);
@@ -1069,7 +1069,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
} while (dst_pgd++, src_pgd++, addr = next, addr != end);

if (is_cow)
- mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
+ MMU_MIGRATE);
return ret;
}

@@ -1374,10 +1375,12 @@ void unmap_vmas(struct mmu_gather *tlb,
{
struct mm_struct *mm = vma->vm_mm;

- mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+ mmu_notifier_invalidate_range_start(mm, start_addr,
+ end_addr, MMU_MUNMAP);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
- mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
+ mmu_notifier_invalidate_range_end(mm, start_addr,
+ end_addr, MMU_MUNMAP);
}

/**
@@ -1399,10 +1402,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
lru_add_drain();
tlb_gather_mmu(&tlb, mm, start, end);
update_hiwater_rss(mm);
- mmu_notifier_invalidate_range_start(mm, start, end);
+ mmu_notifier_invalidate_range_start(mm, start, end, MMU_MIGRATE);
for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
unmap_single_vma(&tlb, vma, start, end, details);
- mmu_notifier_invalidate_range_end(mm, start, end);
+ mmu_notifier_invalidate_range_end(mm, start, end, MMU_MIGRATE);
tlb_finish_mmu(&tlb, start, end);
}

@@ -1425,9 +1428,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
lru_add_drain();
tlb_gather_mmu(&tlb, mm, address, end);
update_hiwater_rss(mm);
- mmu_notifier_invalidate_range_start(mm, address, end);
+ mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
unmap_single_vma(&tlb, vma, address, end, details);
- mmu_notifier_invalidate_range_end(mm, address, end);
+ mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
tlb_finish_mmu(&tlb, address, end);
}

@@ -2212,7 +2215,8 @@ gotten:

mmun_start = address & PAGE_MASK;
mmun_end = mmun_start + PAGE_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

/*
* Re-check the pte - we dropped the lock
@@ -2244,7 +2248,7 @@ gotten:
* mmu page tables (such as kvm shadow page tables), we want the
* new page to be mapped directly into the secondary page table.
*/
- set_pte_at_notify(mm, address, page_table, entry);
+ set_pte_at_notify(mm, address, page_table, entry, MMU_MIGRATE);
update_mmu_cache(vma, address, page_table);
if (old_page) {
/*
@@ -2283,7 +2287,8 @@ gotten:
unlock:
pte_unmap_unlock(page_table, ptl);
if (mmun_end > mmun_start)
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
if (old_page) {
/*
* Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index 41945cb..b5279b8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1814,12 +1814,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
WARN_ON(PageLRU(new_page));

/* Recheck the target PMD */
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);
ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
fail_putback:
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

/* Reverse changes made by migrate_page_copy() */
if (TestClearPageActive(new_page))
@@ -1873,7 +1875,8 @@ fail_putback:
page_remove_rmap(page);

spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

/* Take an "isolate" reference and put new page on the LRU. */
get_page(new_page);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 3b9b3d0..e51ea02 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -142,8 +142,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
return young;
}

-void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
- pte_t pte)
+void __mmu_notifier_change_pte(struct mm_struct *mm,
+ unsigned long address,
+ pte_t pte,
+ enum mmu_event event)
{
struct mmu_notifier *mn;
int id;
@@ -151,13 +153,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->change_pte)
- mn->ops->change_pte(mn, mm, address, pte);
+ mn->ops->change_pte(mn, mm, address, pte, event);
}
srcu_read_unlock(&srcu, id);
}

void __mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
struct mmu_notifier *mn;
int id;
@@ -165,13 +168,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->invalidate_page)
- mn->ops->invalidate_page(mn, mm, address);
+ mn->ops->invalidate_page(mn, mm, address, event);
}
srcu_read_unlock(&srcu, id);
}

void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
+
{
struct mmu_notifier *mn;
int id;
@@ -179,14 +185,17 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
if (mn->ops->invalidate_range_start)
- mn->ops->invalidate_range_start(mn, mm, start, end);
+ mn->ops->invalidate_range_start(mn, mm, start,
+ end, event);
}
srcu_read_unlock(&srcu, id);
}
EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);

void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
{
struct mmu_notifier *mn;
int id;
@@ -204,7 +213,8 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
if (mn->ops->invalidate_range)
mn->ops->invalidate_range(mn, mm, start, end);
if (mn->ops->invalidate_range_end)
- mn->ops->invalidate_range_end(mn, mm, start, end);
+ mn->ops->invalidate_range_end(mn, mm, start,
+ end, event);
}
srcu_read_unlock(&srcu, id);
}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ace9345..2302721 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -152,7 +152,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
/* invoke the mmu notifier if the pmd is populated */
if (!mni_start) {
mni_start = addr;
- mmu_notifier_invalidate_range_start(mm, mni_start, end);
+ mmu_notifier_invalidate_range_start(mm, mni_start,
+ end, MMU_MPROT);
}

if (pmd_trans_huge(*pmd)) {
@@ -180,7 +181,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
} while (pmd++, addr = next, addr != end);

if (mni_start)
- mmu_notifier_invalidate_range_end(mm, mni_start, end);
+ mmu_notifier_invalidate_range_end(mm, mni_start, end, MMU_MPROT);

if (nr_huge_updates)
count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 17fa018..1ede220 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -177,7 +177,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,

mmun_start = old_addr;
mmun_end = old_end;
- mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
cond_resched();
@@ -229,7 +230,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
if (likely(need_flush))
flush_tlb_range(vma, old_end-len, old_addr);

- mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
+ mmun_end, MMU_MIGRATE);

return len + old_addr - old_end; /* how much done */
}
diff --git a/mm/rmap.c b/mm/rmap.c
index a5e9cc6..367f882 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -840,7 +840,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
pte_unmap_unlock(pte, ptl);

if (ret) {
- mmu_notifier_invalidate_page(mm, address);
+ mmu_notifier_invalidate_page(mm, address, MMU_WRITE_BACK);
(*cleaned)++;
}
out:
@@ -1142,6 +1142,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
spinlock_t *ptl;
int ret = SWAP_AGAIN;
enum ttu_flags flags = (enum ttu_flags)arg;
+ enum mmu_event event = MMU_MIGRATE;
+
+ if (flags & TTU_MUNLOCK)
+ event = MMU_MUNLOCK;

pte = page_check_address(page, mm, address, &ptl, 0);
if (!pte)
@@ -1247,7 +1251,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
out_unmap:
pte_unmap_unlock(pte, ptl);
if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
- mmu_notifier_invalidate_page(mm, address);
+ mmu_notifier_invalidate_page(mm, address, event);
out:
return ret;

@@ -1301,7 +1305,9 @@ out_mlock:
#define CLUSTER_MASK (~(CLUSTER_SIZE - 1))

static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
- struct vm_area_struct *vma, struct page *check_page)
+ struct vm_area_struct *vma,
+ struct page *check_page,
+ enum ttu_flags flags)
{
struct mm_struct *mm = vma->vm_mm;
pmd_t *pmd;
@@ -1315,6 +1321,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
unsigned long end;
int ret = SWAP_AGAIN;
int locked_vma = 0;
+ enum mmu_event event = MMU_MIGRATE;
+
+ if (flags & TTU_MUNLOCK)
+ event = MMU_MUNLOCK;

address = (vma->vm_start + cursor) & CLUSTER_MASK;
end = address + CLUSTER_SIZE;
@@ -1329,7 +1339,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,

mmun_start = address;
mmun_end = end;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);

/*
* If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1398,7 +1408,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
(*mapcount)--;
}
pte_unmap_unlock(pte - 1, ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
if (locked_vma)
up_read(&vma->vm_mm->mmap_sem);
return ret;
@@ -1454,7 +1464,9 @@ static int try_to_unmap_nonlinear(struct page *page,
while (cursor < max_nl_cursor &&
cursor < vma->vm_end - vma->vm_start) {
if (try_to_unmap_cluster(cursor, &mapcount,
- vma, page) == SWAP_MLOCK)
+ vma, page,
+ (enum ttu_flags)arg)
+ == SWAP_MLOCK)
ret = SWAP_MLOCK;
cursor += CLUSTER_SIZE;
vma->vm_private_data = (void *) cursor;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3a31ec6..08bc07c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -263,7 +263,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)

static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ enum mmu_event event)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
int need_tlb_flush, idx;
@@ -305,7 +306,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
- pte_t pte)
+ pte_t pte,
+ enum mmu_event event)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
int idx;
@@ -321,7 +323,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ enum mmu_event event)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
int need_tlb_flush = 0, idx;
@@ -347,7 +350,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ enum mmu_event event)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);

--
1.9.3

2014-11-10 18:29:35

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 4/5] hmm: heterogeneous memory management v6

From: Jérôme Glisse <[email protected]>

Motivation:

Heterogeneous memory management is intended to allow a device to transparently
access a process address space without having to lock pages of the process or
take references on them. In other word mirroring a process address space while
allowing the regular memory management event such as page reclamation or page
migration, to happen seamlessly.

Recent years have seen a surge into the number of specialized devices that are
part of a computer platform (from desktop to phone). So far each of those
devices have operated on there own private address space that is not link or
expose to the process address space that is using them. This separation often
leads to multiple memory copy happening between the device owned memory and the
process memory. This of course is both a waste of cpu cycle and memory.

Over the last few years most of those devices have gained a full mmu allowing
them to support multiple page table, page fault and other features that are
found inside cpu mmu. There is now a strong incentive to start leveraging
capabilities of such devices and to start sharing process address to avoid
any unnecessary memory copy as well as simplifying the programming model of
those devices by sharing an unique and common address space with the process
that use them.

The aim of the heterogeneous memory management is to provide a common API that
can be use by any such devices in order to mirror process address. The hmm code
provide an unique entry point and interface itself with the core mm code of the
linux kernel avoiding duplicate implementation and shielding device driver code
from core mm code.

Moreover, hmm also intend to provide support for migrating memory to device
private memory, allowing device to work on its own fast local memory. The hmm
code would be responsible to intercept cpu page fault on migrated range and
to migrate it back to system memory allowing cpu to resume its access to the
memory.

Another feature hmm intend to provide is support for atomic operation for the
device even if the bus linking the device and the cpu do not have any such
capabilities. On such hardware atomic operation require the page to only be
mapped on the device or on the cpu but not both at the same time.

We expect that graphic processing unit and network interface to be among the
first users of such api.

Hardware requirement:

Because hmm is intended to be use by device driver there are minimum features
requirement for the hardware mmu :
- hardware have its own page table per process (can be share btw != devices)
- hardware mmu support page fault and suspend execution until the page fault
is serviced by hmm code. The page fault must also trigger some form of
interrupt so that hmm code can be call by the device driver.
- hardware must support at least read only mapping (otherwise it can not
access read only range of the process address space).
- hardware access to system memory must be cache coherent with the cpu.

For better memory management it is highly recommanded that the device also
support the following features :
- hardware mmu set access bit in its page table on memory access (like cpu).
- hardware page table can be updated from cpu or through a fast path.
- hardware provide advanced statistic over which range of memory it access
the most.
- hardware differentiate atomic memory access from regular access allowing
to support atomic operation even on platform that do not have atomic
support on the bus linking the device with the cpu.

Implementation:

The hmm layer provide a simple API to the device driver. Each device driver
have to register and hmm device that holds pointer to all the callback the hmm
code will make to synchronize the device page table with the cpu page table of
a given process.

For each process it wants to mirror the device driver must register a mirror
hmm structure that holds all the informations specific to the process being
mirrored. Each hmm mirror uniquely link an hmm device with a process address
space (the mm struct).

This design allow several different device driver to mirror concurrently the
same process. The hmm layer will dispatch approprietly to each device driver
modification that are happening to the process address space.

The hmm layer rely on the mmu notifier api to monitor change to the process
address space. Because update to device page table can have unbound completion
time, the hmm layer need the capability to sleep during mmu notifier callback.

This patch only implement the core of the hmm layer and do not support feature
such as migration to device memory.

Changed since v1:
- convert fence to refcounted object
- change the api to provide pte value directly avoiding useless temporary
special hmm pfn value
- cleanups & fixes ...

Changed since v2:
- fixed checkpatch.pl warnings & errors
- converted to a staging feature

Changed since v3:
- Use mmput notifier chain instead of adding hmm destroy call to mmput.
- Clear mm->hmm inside mm_init to be match mmu_notifier.
- Separate cpu page table invalidation from device page table fault to
have cleaner and simpler code for synchronization btw this two types
of event.
- Removing hmm_mirror kref and rely on user to manage lifetime of the
hmm_mirror.

Changed since v4:
- Invalidate either in range_start() or in range_end() depending on the
kind of mmu event.
- Use the new generic page table implementation to keep an hmm mirror of
the cpu page table.
- Get rid of the range lock exclusion as it is no longer needed.
- Simplify the driver api.
- Support for hugue page.

Changed since v5:
- Take advantages of mmu_notifier tracking active invalidation range.
- Adapt to change to arch independant page table.

Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Jatin Kumar <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
include/linux/hmm.h | 364 +++++++++++++++
include/linux/mm.h | 11 +
include/linux/mm_types.h | 14 +
kernel/fork.c | 2 +
mm/Kconfig | 15 +
mm/Makefile | 1 +
mm/hmm.c | 1156 ++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 1563 insertions(+)
create mode 100644 include/linux/hmm.h
create mode 100644 mm/hmm.c

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..3331798
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,364 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <[email protected]>
+ */
+/* This is a heterogeneous memory management (hmm). In a nutshell this provide
+ * an API to mirror a process address on a device which has its own mmu using
+ * its own page table for the process. It supports everything except special
+ * vma.
+ *
+ * Mandatory hardware features :
+ * - An mmu with pagetable.
+ * - Read only flag per cpu page.
+ * - Page fault ie hardware must stop and wait for kernel to service fault.
+ *
+ * Optional hardware features :
+ * - Dirty bit per cpu page.
+ * - Access bit per cpu page.
+ *
+ * The hmm code handle all the interfacing with the core kernel mm code and
+ * provide a simple API. It does support migrating system memory to device
+ * memory and handle migration back to system memory on cpu page fault.
+ *
+ * Migrated memory is considered as swaped from cpu and core mm code point of
+ * view.
+ */
+#ifndef _HMM_H
+#define _HMM_H
+
+#ifdef CONFIG_HMM
+
+#include <linux/list.h>
+#include <linux/rwsem.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/workqueue.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mman.h>
+
+
+struct hmm_device;
+struct hmm_device_ops;
+struct hmm_mirror;
+struct hmm_event;
+struct hmm;
+
+
+/* hmm_fence - device driver fence to wait for device driver operations.
+ *
+ * In order to concurrently update several different devices mmu the hmm rely
+ * on device driver fence to wait for operation hmm schedules to complete on
+ * devices. It is strongly recommanded to implement fences and have the hmm
+ * callback do as little as possible (just scheduling the update and returning
+ * a fence). Moreover the hmm code will reschedule for i/o the current process
+ * if necessary once it has scheduled all updates on all devices.
+ *
+ * Each fence is created as a result of either an update to range of memory or
+ * for remote memory to/from local memory dma.
+ *
+ * Update to range of memory correspond to a specific event type. For instance
+ * range of memory is unmap for page reclamation, or range of memory is unmap
+ * from process address space as result of munmap syscall (HMM_MUNMAP), or a
+ * memory protection change on the range. There is one hmm_etype for each of
+ * those event allowing the device driver to take appropriate action like for
+ * instance freeing device page table on HMM_MUNMAP but keeping it when it is
+ * just an access protection change or temporary unmap.
+ */
+enum hmm_etype {
+ HMM_NONE = 0,
+ HMM_ISDIRTY,
+ HMM_MIGRATE,
+ HMM_MUNMAP,
+ HMM_RFAULT,
+ HMM_WFAULT,
+ HMM_WRITE_PROTECT,
+};
+
+struct hmm_fence {
+ struct hmm_mirror *mirror;
+ struct list_head list;
+};
+
+
+/* struct hmm_event - used to serialize change to overlapping range of address.
+ *
+ * @list: Core hmm keep track of all active events.
+ * @start: First address (inclusive).
+ * @end: Last address (exclusive).
+ * @fences: List of device fences associated with this event.
+ * @etype: Event type (munmap, migrate, truncate, ...).
+ * @backoff: Only meaningful for device page fault.
+ */
+struct hmm_event {
+ struct list_head list;
+ unsigned long start;
+ unsigned long end;
+ struct list_head fences;
+ enum hmm_etype etype;
+ bool backoff;
+};
+
+
+/* struct hmm_range - used to communicate range infos to various callback.
+ *
+ * @pte: The hmm page table entry for the range.
+ * @ptp: The page directory page struct.
+ * @start: First address (inclusive).
+ * @end: Last address (exclusive).
+ */
+struct hmm_range {
+ unsigned long *pte;
+ struct page *pdp;
+ unsigned long start;
+ unsigned long end;
+};
+
+static inline unsigned long hmm_range_size(struct hmm_range *range)
+{
+ return range->end - range->start;
+}
+
+#define HMM_PTE_VALID_PDIR_BIT 0UL
+#define HMM_PTE_VALID_SMEM_BIT 1UL
+#define HMM_PTE_WRITE_BIT 2UL
+#define HMM_PTE_DIRTY_BIT 3UL
+
+static inline unsigned long hmm_pte_from_pfn(unsigned long pfn)
+{
+ return (pfn << PAGE_SHIFT) | (1UL << HMM_PTE_VALID_SMEM_BIT);
+}
+
+static inline void hmm_pte_mk_dirty(volatile unsigned long *hmm_pte)
+{
+ set_bit(HMM_PTE_DIRTY_BIT, hmm_pte);
+}
+
+static inline void hmm_pte_mk_write(volatile unsigned long *hmm_pte)
+{
+ set_bit(HMM_PTE_WRITE_BIT, hmm_pte);
+}
+
+static inline bool hmm_pte_clear_valid_smem(volatile unsigned long *hmm_pte)
+{
+ return test_and_clear_bit(HMM_PTE_VALID_SMEM_BIT, hmm_pte);
+}
+
+static inline bool hmm_pte_clear_write(volatile unsigned long *hmm_pte)
+{
+ return test_and_clear_bit(HMM_PTE_WRITE_BIT, hmm_pte);
+}
+
+static inline bool hmm_pte_is_valid_smem(const volatile unsigned long *hmm_pte)
+{
+ return test_bit(HMM_PTE_VALID_SMEM_BIT, hmm_pte);
+}
+
+static inline bool hmm_pte_is_write(const volatile unsigned long *hmm_pte)
+{
+ return test_bit(HMM_PTE_WRITE_BIT, hmm_pte);
+}
+
+static inline unsigned long hmm_pte_pfn(unsigned long hmm_pte)
+{
+ return hmm_pte >> PAGE_SHIFT;
+}
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device.
+ *
+ * The hmm_device is the link btw hmm and each device driver.
+ */
+
+/* struct hmm_device_operations - hmm device operation callback
+ */
+struct hmm_device_ops {
+ /* mirror_ref() - take reference on mirror struct.
+ *
+ * @mirror: Struct being referenced.
+ */
+ struct hmm_mirror *(*mirror_ref)(struct hmm_mirror *mirror);
+
+ /* mirror_unref() - drop reference on mirror struct.
+ *
+ * @mirror: Struct being dereferenced.
+ */
+ struct hmm_mirror *(*mirror_unref)(struct hmm_mirror *mirror);
+
+ /* mirror_release() - device must stop using the address space.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * This callback is call either on mm destruction or as result to a
+ * call of hmm_mirror_release(). Device driver have to stop all hw
+ * thread and all usage of the address space, it has to dirty all
+ * pages that have been dirty by the device.
+ */
+ void (*mirror_release)(struct hmm_mirror *mirror);
+
+ /* fence_wait() - to wait on device driver fence.
+ *
+ * @fence: The device driver fence struct.
+ * Returns: 0 on success,-EIO on error, -EAGAIN to wait again.
+ *
+ * Called when hmm want to wait for all operations associated with a
+ * fence to complete (including device cache flush if the event mandate
+ * it).
+ *
+ * Device driver must free fence and associated resources if it returns
+ * something else thant -EAGAIN. On -EAGAIN the fence must not be free
+ * as hmm will call back again.
+ *
+ * Return error if scheduled operation failed or if need to wait again.
+ * -EIO Some input/output error with the device.
+ * -EAGAIN The fence not yet signaled, hmm reschedule waiting thread.
+ *
+ * All other return value trigger warning and are transformed to -EIO.
+ */
+ int (*fence_wait)(struct hmm_fence *fence);
+
+ /* fence_ref() - take a reference fence structure.
+ *
+ * @fence: Fence structure hmm is referencing.
+ */
+ void (*fence_ref)(struct hmm_fence *fence);
+
+ /* fence_unref() - drop a reference fence structure.
+ *
+ * @fence: Fence structure hmm is dereferencing.
+ */
+ void (*fence_unref)(struct hmm_fence *fence);
+
+ /* update() - update device mmu for a range of address.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ * @event: The event that triggered the update.
+ * @range: All informations about the range that needs to be updated.
+ * Returns: Valid fence ptr or NULL on success otherwise ERR_PTR.
+ *
+ * Called to update device page table for a range of address.
+ * The event type provide the nature of the update :
+ * - Range is no longer valid (munmap).
+ * - Range protection changes (mprotect, COW, ...).
+ * - Range is unmapped (swap, reclaim, page migration, ...).
+ * - Device page fault.
+ * - ...
+ *
+ * Any event that block further write to the memory must also trigger a
+ * device cache flush and everything has to be flush to local memory by
+ * the time the wait callback return (if this callback returned a fence
+ * otherwise everything must be flush by the time the callback return).
+ *
+ * Device must properly set the dirty bit using hmm_pte_mk_dirty helper
+ * on each hmm page table entry.
+ *
+ * The driver should return a fence pointer or NULL on success. Device
+ * driver should return fence and delay wait for the operation to the
+ * fence wait callback. Returning a fence allow hmm to batch update to
+ * several devices and delay wait on those once they all have scheduled
+ * the update.
+ *
+ * Device driver must not fail lightly, any failure result in device
+ * process being kill.
+ *
+ * Return fence or NULL on success, error value otherwise :
+ * -ENOMEM Not enough memory for performing the operation.
+ * -EIO Some input/output error with the device.
+ *
+ * All other return value trigger warning and are transformed to -EIO.
+ */
+ struct hmm_fence *(*update)(struct hmm_mirror *mirror,
+ struct hmm_event *event,
+ const struct hmm_range *range);
+};
+
+
+/* struct hmm_device - per device hmm structure
+ *
+ * @name: Device name (uniquely identify the device on the system).
+ * @ops: The hmm operations callback.
+ * @mirrors: List of all active mirrors for the device.
+ * @mutex: Mutex protecting mirrors list.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct (only once).
+ */
+struct hmm_device {
+ const char *name;
+ const struct hmm_device_ops *ops;
+ struct list_head mirrors;
+ struct mutex mutex;
+};
+
+int hmm_device_register(struct hmm_device *device);
+int hmm_device_unregister(struct hmm_device *device);
+
+
+/* hmm_mirror - device specific mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct associating
+ * the process address space with the device. Same process can be mirrored by
+ * several different devices at the same time.
+ */
+
+/* struct hmm_mirror - per device and per mm hmm structure
+ *
+ * @device: The hmm_device struct this hmm_mirror is associated to.
+ * @hmm: The hmm struct this hmm_mirror is associated to.
+ * @dlist: List of all hmm_mirror for same device.
+ * @mlist: List of all hmm_mirror for same process.
+ * @work: Work struct for delayed unreference.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct for each of the address space it wants to mirror. Same device can
+ * mirror several different address space. As well same address space can be
+ * mirror by different devices.
+ */
+struct hmm_mirror {
+ struct hmm_device *device;
+ struct hmm *hmm;
+ struct list_head dlist;
+ struct list_head mlist;
+ struct work_struct work;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror,
+ struct hmm_device *device,
+ struct mm_struct *mm);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+static inline struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror)
+{
+ if (!mirror || !mirror->device)
+ return NULL;
+
+ return mirror->device->ops->mirror_ref(mirror);
+}
+
+static inline struct hmm_mirror *hmm_mirror_unref(struct hmm_mirror *mirror)
+{
+ if (!mirror || !mirror->device)
+ return NULL;
+
+ return mirror->device->ops->mirror_unref(mirror);
+}
+
+void hmm_mirror_release(struct hmm_mirror *mirror);
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
+
+
+#endif /* CONFIG_HMM */
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b922a16..1f07826 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2172,5 +2172,16 @@ void __init setup_nr_node_ids(void);
static inline void setup_nr_node_ids(void) {}
#endif

+#ifdef CONFIG_HMM
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+ mm->hmm = NULL;
+}
+#else /* !CONFIG_HMM */
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+}
+#endif /* !CONFIG_HMM */
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 33a8acf..57ea037 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -16,6 +16,10 @@
#include <asm/page.h>
#include <asm/mmu.h>

+#ifdef CONFIG_HMM
+struct hmm;
+#endif
+
#ifndef AT_VECTOR_SIZE_ARCH
#define AT_VECTOR_SIZE_ARCH 0
#endif
@@ -430,6 +434,16 @@ struct mm_struct {
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
+#ifdef CONFIG_HMM
+ /*
+ * hmm always register an mmu_notifier we rely on mmu notifier to keep
+ * refcount on mm struct as well as forbiding registering hmm on a
+ * dying mm
+ *
+ * This field is set with mmap_sem old in write mode.
+ */
+ struct hmm *hmm;
+#endif
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 4dc2dda..0bb9dc4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
#include <linux/binfmts.h>
#include <linux/mman.h>
#include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/vmacache.h>
@@ -568,6 +569,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
mm_init_aio(mm);
mm_init_owner(mm, p);
mmu_notifier_mm_init(mm);
+ hmm_mm_init(mm);
clear_tlb_flush_pending(mm);
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
mm->pmd_huge_pte = NULL;
diff --git a/mm/Kconfig b/mm/Kconfig
index 1d1ae6b..b249db0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -618,3 +618,18 @@ config MAX_STACK_SIZE_MB
changed to a smaller value in which case that is used.

A sane initial value is 80 MB.
+
+if STAGING
+config HMM
+ bool "Enable heterogeneous memory management (HMM)"
+ depends on MMU
+ select MMU_NOTIFIER
+ select GENERIC_PAGE_TABLE
+ default n
+ help
+ Heterogeneous memory management provide infrastructure for a device
+ to mirror a process address space into an hardware mmu or into any
+ things supporting pagefault like event.
+
+ If unsure, say N to disable hmm.
+endif # STAGING
diff --git a/mm/Makefile b/mm/Makefile
index 9c4371d..8e78060 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -71,3 +71,4 @@ obj-$(CONFIG_ZSMALLOC) += zsmalloc.o
obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
obj-$(CONFIG_CMA) += cma.o
obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
+obj-$(CONFIG_HMM) += hmm.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..25c20ac
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,1156 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <[email protected]>
+ */
+/* This is the core code for heterogeneous memory management (HMM). HMM intend
+ * to provide helper for mirroring a process address space on a device as well
+ * as allowing migration of data between system memory and device memory refer
+ * as remote memory from here on out.
+ *
+ * Refer to include/linux/hmm.h for further informations on general design.
+ */
+#include <linux/export.h>
+#include <linux/bitmap.h>
+#include <linux/list.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/ksm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mmu_context.h>
+#include <linux/memcontrol.h>
+#include <linux/hmm.h>
+#include <linux/wait.h>
+#include <linux/mman.h>
+#include <linux/delay.h>
+#include <linux/workqueue.h>
+#include <linux/gpt.h>
+
+#include "internal.h"
+
+/* global SRCU for all HMMs */
+static struct srcu_struct srcu;
+
+
+/* struct hmm - per mm_struct hmm structure
+ *
+ * @device_faults: List of all active device page faults.
+ * @mirrors: List of all mirror for this mm (one per device).
+ * @mm: The mm struct this hmm is associated with.
+ * @ndevice_faults: Number of active device page faults.
+ * @kref: Reference counter
+ * @lock: Serialize the mirror list modifications.
+ * @wait_queue: Wait queue for event synchronization.
+ * @mmu_notifier: The mmu_notifier of this mm.
+ *
+ * For each process address space (mm_struct) there is one and only one hmm
+ * struct. hmm functions will redispatch to each devices the change made to
+ * the process address space.
+ */
+struct hmm {
+ struct list_head device_faults;
+ struct list_head mirrors;
+ struct mm_struct *mm;
+ unsigned long ndevice_faults;
+ struct kref kref;
+ spinlock_t lock;
+ wait_queue_head_t wait_queue;
+ struct mmu_notifier mmu_notifier;
+ struct gpt pt;
+};
+
+static struct mmu_notifier_ops hmm_notifier_ops;
+
+static inline struct hmm *hmm_ref(struct hmm *hmm);
+static inline struct hmm *hmm_unref(struct hmm *hmm);
+
+static void hmm_mirror_delayed_unref(struct work_struct *work);
+static void hmm_mirror_handle_error(struct hmm_mirror *mirror);
+
+static void hmm_device_fence_wait(struct hmm_device *device,
+ struct hmm_fence *fence);
+
+
+/* hmm_event - use to track information relating to an event.
+ *
+ * Each change to cpu page table or fault from a device is considered as an
+ * event by hmm. For each event there is a common set of things that need to
+ * be tracked. The hmm_event struct centralize those and the helper functions
+ * help dealing with all this.
+ */
+
+static inline bool hmm_event_overlap(struct hmm_event *a, struct hmm_event *b)
+{
+ return !((a->end <= b->start) || (a->start >= b->end));
+}
+
+static inline void hmm_event_init(struct hmm_event *event,
+ unsigned long start,
+ unsigned long end)
+{
+ event->start = start & PAGE_MASK;
+ event->end = PAGE_ALIGN(end);
+ INIT_LIST_HEAD(&event->fences);
+}
+
+static inline void hmm_event_wait(struct hmm_event *event)
+{
+ struct hmm_fence *fence, *tmp;
+
+ if (list_empty(&event->fences))
+ /* Nothing to wait for. */
+ return;
+
+ io_schedule();
+
+ list_for_each_entry_safe(fence, tmp, &event->fences, list) {
+ hmm_device_fence_wait(fence->mirror->device, fence);
+ }
+}
+
+
+/* hmm_range - range helper functions.
+ *
+ * Range are use to communicate btw various hmm function and device driver.
+ */
+
+static void hmm_range_update_mirrors(struct hmm_range *range,
+ struct hmm *hmm,
+ struct hmm_event *event)
+{
+ struct hmm_mirror *mirror;
+ int id;
+
+ id = srcu_read_lock(&srcu);
+ list_for_each_entry(mirror, &hmm->mirrors, mlist) {
+ struct hmm_device *device = mirror->device;
+ struct hmm_fence *fence;
+
+ fence = device->ops->update(mirror, event, range);
+ if (fence) {
+ if (IS_ERR(fence)) {
+ hmm_mirror_handle_error(mirror);
+ } else {
+ fence->mirror = hmm_mirror_ref(mirror);
+ list_add_tail(&fence->list, &event->fences);
+ }
+ }
+ }
+ srcu_read_unlock(&srcu, id);
+}
+
+static bool hmm_range_wprot(struct hmm_range *range, struct hmm *hmm)
+{
+ unsigned long i;
+ bool update = false;
+
+ for (i = 0; i < (hmm_range_size(range) >> PAGE_SHIFT); ++i) {
+ update |= hmm_pte_clear_write(&range->pte[i]);
+ }
+ return update;
+}
+
+static void hmm_range_clear(struct hmm_range *range, struct hmm *hmm)
+{
+ unsigned long i;
+
+ for (i = 0; i < (hmm_range_size(range) >> PAGE_SHIFT); ++i)
+ if (hmm_pte_clear_valid_smem(&range->pte[i]))
+ gpt_pdp_unref(&hmm->pt, range->pdp);
+}
+
+
+/* hmm - core hmm functions.
+ *
+ * Core hmm functions that deal with all the process mm activities and use
+ * event for synchronization. Those function are use mostly as result of cpu
+ * mm event.
+ */
+
+static uint64_t hmm_pde_from_pdp(struct gpt *gpt, struct page *pdp)
+{
+ uint64_t pde;
+
+ pde = (page_to_pfn(pdp) << PAGE_SHIFT);
+ pde |= (1UL << HMM_PTE_VALID_PDIR_BIT);
+ return pde;
+}
+
+static int hmm_init(struct hmm *hmm, struct mm_struct *mm)
+{
+ int ret;
+
+ hmm->mm = mm;
+ kref_init(&hmm->kref);
+ INIT_LIST_HEAD(&hmm->device_faults);
+ INIT_LIST_HEAD(&hmm->mirrors);
+ spin_lock_init(&hmm->lock);
+ init_waitqueue_head(&hmm->wait_queue);
+ hmm->ndevice_faults = 0;
+
+ /* Initialize page table. */
+ hmm->pt.last_idx = (mm->highest_vm_end - 1UL) >> PAGE_SHIFT;
+ hmm->pt.pde_mask = PAGE_MASK;
+ hmm->pt.pde_shift = PAGE_SHIFT;
+ hmm->pt.pde_valid = 1UL << HMM_PTE_VALID_PDIR_BIT;
+ hmm->pt.pde_from_pdp = &hmm_pde_from_pdp;
+ hmm->pt.gfp_flags = GFP_HIGHUSER;
+ ret = gpt_ulong_init(&hmm->pt);
+ if (ret)
+ return ret;
+
+ /* register notifier */
+ hmm->mmu_notifier.ops = &hmm_notifier_ops;
+ return __mmu_notifier_register(&hmm->mmu_notifier, mm);
+}
+
+static void hmm_del_mirror_locked(struct hmm *hmm, struct hmm_mirror *mirror)
+{
+ list_del_rcu(&mirror->mlist);
+}
+
+static int hmm_add_mirror(struct hmm *hmm, struct hmm_mirror *mirror)
+{
+ struct hmm_mirror *tmp_mirror;
+
+ spin_lock(&hmm->lock);
+ list_for_each_entry_rcu (tmp_mirror, &hmm->mirrors, mlist)
+ if (tmp_mirror->device == mirror->device) {
+ /* Same device can mirror only once. */
+ spin_unlock(&hmm->lock);
+ return -EINVAL;
+ }
+ list_add_rcu(&mirror->mlist, &hmm->mirrors);
+ spin_unlock(&hmm->lock);
+
+ return 0;
+}
+
+static inline struct hmm *hmm_ref(struct hmm *hmm)
+{
+ if (hmm) {
+ if (!kref_get_unless_zero(&hmm->kref))
+ return NULL;
+ return hmm;
+ }
+ return NULL;
+}
+
+static void hmm_destroy(struct kref *kref)
+{
+ struct hmm *hmm;
+
+ hmm = container_of(kref, struct hmm, kref);
+
+ down_write(&hmm->mm->mmap_sem);
+ /* A new hmm might have been register before we get call. */
+ if (hmm->mm->hmm == hmm)
+ hmm->mm->hmm = NULL;
+ up_write(&hmm->mm->mmap_sem);
+ mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
+
+ mmu_notifier_synchronize();
+
+ gpt_free(&hmm->pt);
+ kfree(hmm);
+}
+
+static inline struct hmm *hmm_unref(struct hmm *hmm)
+{
+ if (hmm)
+ kref_put(&hmm->kref, hmm_destroy);
+ return NULL;
+}
+
+static int hmm_device_fault_start(struct hmm *hmm, struct hmm_event *fevent)
+{
+ int ret = 0;
+
+ mmu_notifier_range_wait_valid(hmm->mm, fevent->start, fevent->end);
+
+ spin_lock(&hmm->lock);
+ if (mmu_notifier_range_is_valid(hmm->mm, fevent->start, fevent->end)) {
+ list_add_tail(&fevent->list, &hmm->device_faults);
+ hmm->ndevice_faults++;
+ fevent->backoff = false;
+ } else
+ ret = -EAGAIN;
+ spin_unlock(&hmm->lock);
+ wake_up(&hmm->wait_queue);
+
+ return ret;
+}
+
+static void hmm_device_fault_end(struct hmm *hmm, struct hmm_event *fevent)
+{
+ spin_lock(&hmm->lock);
+ list_del_init(&fevent->list);
+ hmm->ndevice_faults--;
+ spin_unlock(&hmm->lock);
+ wake_up(&hmm->wait_queue);
+}
+
+static void hmm_wait_device_fault(struct hmm *hmm, struct hmm_event *ievent)
+{
+ struct hmm_event *fevent;
+ unsigned long wait_for = 0;
+
+again:
+ spin_lock(&hmm->lock);
+ list_for_each_entry (fevent, &hmm->device_faults, list) {
+ if (!hmm_event_overlap(fevent, ievent))
+ continue;
+ fevent->backoff = true;
+ wait_for = hmm->ndevice_faults;
+ }
+ spin_unlock(&hmm->lock);
+
+ if (wait_for > 0) {
+ wait_event(hmm->wait_queue, wait_for != hmm->ndevice_faults);
+ wait_for = 0;
+ goto again;
+ }
+}
+
+static void hmm_update(struct hmm *hmm,
+ struct hmm_event *event)
+{
+ struct hmm_range range;
+ struct gpt_lock lock;
+ struct gpt_iter iter;
+ struct gpt *pt = &hmm->pt;
+
+ /* This hmm is already fully stop. */
+ if (hmm->mm->hmm != hmm)
+ return;
+
+ hmm_wait_device_fault(hmm, event);
+
+ lock.first = event->start >> PAGE_SHIFT;
+ lock.last = (event->end - 1UL) >> PAGE_SHIFT;
+ gpt_ulong_lock_update(&hmm->pt, &lock);
+ gpt_iter_init(&iter, &hmm->pt, &lock);
+ if (!gpt_ulong_iter_first(&iter, event->start >> PAGE_SHIFT,
+ (event->end - 1UL) >> PAGE_SHIFT)) {
+ /* Empty range nothing to invalidate. */
+ gpt_ulong_unlock_update(&hmm->pt, &lock);
+ return;
+ }
+
+ for (range.start = iter.idx << PAGE_SHIFT; iter.pdep;) {
+ bool update_mirrors = true;
+
+ range.pte = iter.pdep;
+ range.pdp = iter.pdp;
+ range.end = min((gpt_pdp_last(pt, iter.pdp) + 1UL) <<
+ PAGE_SHIFT, (uint64_t)event->end);
+ if (event->etype == HMM_WRITE_PROTECT)
+ update_mirrors = hmm_range_wprot(&range, hmm);
+ if (update_mirrors)
+ hmm_range_update_mirrors(&range, hmm, event);
+
+ range.start = range.end;
+ gpt_ulong_iter_first(&iter, range.start >> PAGE_SHIFT,
+ (event->end - 1UL) >> PAGE_SHIFT);
+ }
+
+ hmm_event_wait(event);
+
+ if (event->etype == HMM_MUNMAP || event->etype == HMM_MIGRATE) {
+ BUG_ON(!gpt_ulong_iter_first(&iter, event->start >> PAGE_SHIFT,
+ (event->end - 1UL) >> PAGE_SHIFT));
+ for (range.start = iter.idx << PAGE_SHIFT; iter.pdep;) {
+ range.pte = iter.pdep;
+ range.pdp = iter.pdp;
+ range.end = min((gpt_pdp_last(pt, iter.pdp) + 1UL) <<
+ PAGE_SHIFT, (uint64_t)event->end);
+ hmm_range_clear(&range, hmm);
+ range.start = range.end;
+ gpt_ulong_iter_first(&iter, range.start >> PAGE_SHIFT,
+ (event->end - 1UL) >> PAGE_SHIFT);
+ }
+ }
+
+ gpt_ulong_unlock_update(&hmm->pt, &lock);
+}
+
+static int hmm_do_mm_fault(struct hmm *hmm,
+ struct hmm_event *event,
+ struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ int r;
+
+ for (; addr < event->end; addr += PAGE_SIZE) {
+ unsigned flags = 0;
+
+ flags |= event->etype == HMM_WFAULT ? FAULT_FLAG_WRITE : 0;
+ flags |= FAULT_FLAG_ALLOW_RETRY;
+ do {
+ r = handle_mm_fault(mm, vma, addr, flags);
+ if (!(r & VM_FAULT_RETRY) && (r & VM_FAULT_ERROR)) {
+ if (r & VM_FAULT_OOM)
+ return -ENOMEM;
+ /* Same error code for all other cases. */
+ return -EFAULT;
+ }
+ flags &= ~FAULT_FLAG_ALLOW_RETRY;
+ } while (r & VM_FAULT_RETRY);
+ }
+
+ return 0;
+}
+
+
+/* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
+ *
+ * HMM use use mmu notifier to track change made to process address space.
+ */
+
+static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ struct hmm_mirror *mirror;
+ struct hmm *hmm;
+
+ /* The hmm structure can not be free because the mmu_notifier srcu is
+ * read locked thus any concurrent hmm_mirror_unregister that would
+ * free hmm would have to wait on the mmu_notifier.
+ */
+ hmm = container_of(mn, struct hmm, mmu_notifier);
+ spin_lock(&hmm->lock);
+ mirror = list_first_or_null_rcu(&hmm->mirrors,
+ struct hmm_mirror,
+ mlist);
+ while (mirror) {
+ hmm_del_mirror_locked(hmm, mirror);
+ spin_unlock(&hmm->lock);
+
+ mirror->device->ops->mirror_release(mirror);
+ INIT_WORK(&mirror->work, hmm_mirror_delayed_unref);
+ schedule_work(&mirror->work);
+
+ spin_lock(&hmm->lock);
+ mirror = list_first_or_null_rcu(&hmm->mirrors,
+ struct hmm_mirror,
+ mlist);
+ }
+ spin_unlock(&hmm->lock);
+
+ synchronize_srcu(&srcu);
+
+ wake_up(&hmm->wait_queue);
+}
+
+static void hmm_mmu_mprot_to_etype(struct mm_struct *mm,
+ unsigned long addr,
+ enum mmu_event mmu_event,
+ enum hmm_etype *etype)
+{
+ struct vm_area_struct *vma;
+
+ vma = find_vma(mm, addr);
+ if (!vma || vma->vm_start > addr || !(vma->vm_flags & VM_READ)) {
+ *etype = HMM_MUNMAP;
+ return;
+ }
+
+ if (!(vma->vm_flags & VM_WRITE)) {
+ *etype = HMM_WRITE_PROTECT;
+ return;
+ }
+
+ *etype = HMM_NONE;
+}
+
+static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ const struct mmu_notifier_range *range)
+{
+ struct hmm_event event;
+ unsigned long start = range->start, end = range->end;
+ struct hmm *hmm;
+
+ /* FIXME this should not happen beside when process is exiting. */
+ if (start >= mm->highest_vm_end)
+ return;
+ if (end > mm->highest_vm_end)
+ end = mm->highest_vm_end;
+
+ switch (range->event) {
+ case MMU_HSPLIT:
+ case MMU_MUNLOCK:
+ /* Still same physical ram backing same address. */
+ return;
+ case MMU_MPROT:
+ hmm_mmu_mprot_to_etype(mm, start, range->event, &event.etype);
+ if (event.etype == HMM_NONE)
+ return;
+ break;
+ case MMU_WRITE_BACK:
+ case MMU_WRITE_PROTECT:
+ event.etype = HMM_WRITE_PROTECT;
+ break;
+ case MMU_ISDIRTY:
+ event.etype = HMM_ISDIRTY;
+ break;
+ case MMU_MUNMAP:
+ event.etype = HMM_MUNMAP;
+ break;
+ case MMU_MIGRATE:
+ default:
+ event.etype = HMM_MIGRATE;
+ break;
+ }
+
+ hmm = container_of(mn, struct hmm, mmu_notifier);
+ hmm_event_init(&event, start, end);
+
+ hmm_update(hmm, &event);
+}
+
+static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long addr,
+ enum mmu_event mmu_event)
+{
+ struct mmu_notifier_range range;
+
+ range.start = addr & PAGE_MASK;
+ range.end = range.start + PAGE_SIZE;
+ range.event = mmu_event;
+ hmm_notifier_invalidate_range_start(mn, mm, &range);
+}
+
+static struct mmu_notifier_ops hmm_notifier_ops = {
+ .release = hmm_notifier_release,
+ /* .clear_flush_young FIXME we probably want to do something. */
+ /* .test_young FIXME we probably want to do something. */
+ /* WARNING .change_pte must always bracketed by range_start/end there
+ * was patches to remove that behavior we must make sure that those
+ * patches are not included as there are alternative solutions to issue
+ * they are trying to solve.
+ *
+ * Fact is hmm can not use the change_pte callback as non sleeping lock
+ * are held during change_pte callback.
+ */
+ .change_pte = NULL,
+ .invalidate_page = hmm_notifier_invalidate_page,
+ .invalidate_range_start = hmm_notifier_invalidate_range_start,
+};
+
+
+/* hmm_mirror - per device mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct. A process
+ * can be mirror by several devices at the same time.
+ *
+ * Below are all the functions and their helpers use by device driver to mirror
+ * the process address space. Those functions either deals with updating the
+ * device page table (through hmm callback). Or provide helper functions use by
+ * the device driver to fault in range of memory in the device page table.
+ */
+
+/* hmm_mirror_register() - register a device mirror against an mm struct
+ *
+ * @mirror: The mirror that link process address space with the device.
+ * @device: The device struct to associate this mirror with.
+ * @mm: The mm struct of the process.
+ * Returns: 0 success, -ENOMEM or -EINVAL if process already mirrored.
+ *
+ * Call when device driver want to start mirroring a process address space. The
+ * hmm shim will register mmu_notifier and start monitoring process address
+ * space changes. Hence callback to device driver might happen even before this
+ * function return.
+ *
+ * The mm pin must also be hold (either task is current or using get_task_mm).
+ *
+ * Only one mirror per mm and hmm_device can be created, it will return -EINVAL
+ * if the hmm_device already has an hmm_mirror for the the mm.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror,
+ struct hmm_device *device,
+ struct mm_struct *mm)
+{
+ struct hmm *hmm = NULL;
+ int ret = 0;
+
+ /* Sanity checks. */
+ BUG_ON(!mirror);
+ BUG_ON(!device);
+ BUG_ON(!mm);
+
+ /*
+ * Initialize the mirror struct fields, the mlist init and del dance is
+ * necessary to make the error path easier for driver and for hmm.
+ */
+ INIT_LIST_HEAD(&mirror->mlist);
+ list_del(&mirror->mlist);
+ INIT_LIST_HEAD(&mirror->dlist);
+ mutex_lock(&device->mutex);
+ mirror->device = device;
+ list_add(&mirror->dlist, &device->mirrors);
+ mutex_unlock(&device->mutex);
+ mirror->hmm = NULL;
+ mirror = hmm_mirror_ref(mirror);
+ if (!mirror) {
+ mutex_lock(&device->mutex);
+ list_del_init(&mirror->dlist);
+ mutex_unlock(&device->mutex);
+ return -EINVAL;
+ }
+
+ down_write(&mm->mmap_sem);
+
+ hmm = mm->hmm ? hmm_ref(hmm) : NULL;
+ if (hmm == NULL) {
+ /* no hmm registered yet so register one */
+ hmm = kzalloc(sizeof(*mm->hmm), GFP_KERNEL);
+ if (hmm == NULL) {
+ up_write(&mm->mmap_sem);
+ hmm_mirror_unref(mirror);
+ return -ENOMEM;
+ }
+
+ ret = hmm_init(hmm, mm);
+ if (ret) {
+ up_write(&mm->mmap_sem);
+ hmm_mirror_unref(mirror);
+ kfree(hmm);
+ return ret;
+ }
+
+ mm->hmm = hmm;
+ }
+
+ mirror->hmm = hmm;
+ ret = hmm_add_mirror(hmm, mirror);
+ up_write(&mm->mmap_sem);
+ if (ret) {
+ mirror->hmm = NULL;
+ hmm_mirror_unref(mirror);
+ hmm_unref(hmm);
+ return ret;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+static void hmm_mirror_delayed_unref(struct work_struct *work)
+{
+ struct hmm_mirror *mirror;
+
+ mirror = container_of(work, struct hmm_mirror, work);
+ hmm_mirror_unref(mirror);
+}
+
+static void hmm_mirror_handle_error(struct hmm_mirror *mirror)
+{
+ struct hmm *hmm = mirror->hmm;
+
+ spin_lock(&hmm->lock);
+ if (mirror->mlist.prev != LIST_POISON2) {
+ hmm_del_mirror_locked(hmm, mirror);
+ spin_unlock(&hmm->lock);
+
+ mirror->device->ops->mirror_release(mirror);
+ INIT_WORK(&mirror->work, hmm_mirror_delayed_unref);
+ schedule_work(&mirror->work);
+ } else
+ spin_unlock(&hmm->lock);
+}
+
+/* hmm_mirror_unregister() - unregister an hmm_mirror.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * Device driver must call this function when it is destroying a registered
+ * mirror structure. If destruction was initiated by the device driver then
+ * it must have call hmm_mirror_release() prior to calling this function.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+ BUG_ON(!mirror || !mirror->device);
+ BUG_ON(mirror->mlist.prev != LIST_POISON2);
+
+ mirror->hmm = hmm_unref(mirror->hmm);
+
+ mutex_lock(&mirror->device->mutex);
+ list_del_init(&mirror->dlist);
+ mutex_unlock(&mirror->device->mutex);
+ mirror->device = NULL;
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+
+/* hmm_mirror_release() - release an hmm_mirror.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * Device driver must call this function when it wants to stop mirroring the
+ * process.
+ */
+void hmm_mirror_release(struct hmm_mirror *mirror)
+{
+ if (!mirror->hmm)
+ return;
+
+ spin_lock(&mirror->hmm->lock);
+ /* Check if the mirror is already removed from the mirror list in which
+ * case there is no reason to call release.
+ */
+ if (mirror->mlist.prev != LIST_POISON2) {
+ hmm_del_mirror_locked(mirror->hmm, mirror);
+ spin_unlock(&mirror->hmm->lock);
+
+ mirror->device->ops->mirror_release(mirror);
+ synchronize_srcu(&srcu);
+
+ hmm_mirror_unref(mirror);
+ } else
+ spin_unlock(&mirror->hmm->lock);
+}
+EXPORT_SYMBOL(hmm_mirror_release);
+
+static int hmm_mirror_update(struct hmm_mirror *mirror,
+ struct hmm_event *event,
+ unsigned long *start,
+ struct gpt_iter *iter)
+{
+ unsigned long addr = *start & PAGE_MASK;
+
+ if (!gpt_ulong_iter_idx(iter, addr >> PAGE_SHIFT))
+ return -EINVAL;
+
+ do {
+ struct hmm_device *device = mirror->device;
+ unsigned long *pte = (unsigned long *)iter->pdep;
+ struct hmm_fence *fence;
+ struct hmm_range range;
+
+ if (event->backoff)
+ return -EAGAIN;
+
+ range.start = addr;
+ range.end = min((gpt_pdp_last(iter->gpt, iter->pdp) + 1UL) <<
+ PAGE_SHIFT, (uint64_t)event->end);
+ range.pte = iter->pdep;
+ for (; addr < range.end; addr += PAGE_SIZE, ++pte) {
+ if (!hmm_pte_is_valid_smem(pte)) {
+ *start = addr;
+ return 0;
+ }
+ if (event->etype == HMM_WFAULT &&
+ !hmm_pte_is_write(pte)) {
+ *start = addr;
+ return 0;
+ }
+ }
+
+ fence = device->ops->update(mirror, event, &range);
+ if (fence) {
+ if (IS_ERR(fence)) {
+ *start = range.start;
+ return -EIO;
+ }
+ fence->mirror = hmm_mirror_ref(mirror);
+ list_add_tail(&fence->list, &event->fences);
+ }
+
+ } while (addr < event->end &&
+ gpt_ulong_iter_idx(iter, addr >> PAGE_SHIFT));
+
+ *start = addr;
+ return 0;
+}
+
+struct hmm_mirror_fault {
+ struct hmm_mirror *mirror;
+ struct hmm_event *event;
+ struct vm_area_struct *vma;
+ unsigned long addr;
+ struct gpt_iter *iter;
+};
+
+static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
+ struct hmm_event *event,
+ struct vm_area_struct *vma,
+ struct gpt_iter *iter,
+ pmd_t *pmdp,
+ unsigned long start,
+ unsigned long end)
+{
+ struct page *page;
+ unsigned long *hmm_pte, i;
+ unsigned flags = FOLL_TOUCH;
+ spinlock_t *ptl;
+
+ ptl = pmd_lock(mirror->hmm->mm, pmdp);
+ if (unlikely(!pmd_trans_huge(*pmdp))) {
+ spin_unlock(ptl);
+ return -EAGAIN;
+ }
+ if (unlikely(pmd_trans_splitting(*pmdp))) {
+ spin_unlock(ptl);
+ wait_split_huge_page(vma->anon_vma, pmdp);
+ return -EAGAIN;
+ }
+ flags |= event->etype == HMM_WFAULT ? FOLL_WRITE : 0;
+ page = follow_trans_huge_pmd(vma, start, pmdp, flags);
+ spin_unlock(ptl);
+
+ BUG_ON(!gpt_ulong_iter_idx(iter, start >> PAGE_SHIFT));
+ hmm_pte = iter->pdep;
+
+ gpt_pdp_lock(&mirror->hmm->pt, iter->pdp);
+ for (i = 0; start < end; start += PAGE_SIZE, ++i, ++page) {
+ if (!hmm_pte_is_valid_smem(&hmm_pte[i])) {
+ hmm_pte[i] = hmm_pte_from_pfn(page_to_pfn(page));
+ gpt_pdp_ref(&mirror->hmm->pt, iter->pdp);
+ }
+ BUG_ON(hmm_pte_pfn(hmm_pte[i]) != page_to_pfn(page));
+ if (pmd_write(*pmdp))
+ hmm_pte_mk_write(&hmm_pte[i]);
+ }
+ gpt_pdp_unlock(&mirror->hmm->pt, iter->pdp);
+
+ return 0;
+}
+
+static int hmm_mirror_fault_pmd(pmd_t *pmdp,
+ unsigned long start,
+ unsigned long end,
+ struct mm_walk *walk)
+{
+ struct hmm_mirror_fault *mirror_fault = walk->private;
+ struct vm_area_struct *vma = mirror_fault->vma;
+ struct hmm_mirror *mirror = mirror_fault->mirror;
+ struct hmm_event *event = mirror_fault->event;
+ struct gpt_iter *iter = mirror_fault->iter;
+ unsigned long addr = start, i, *hmm_pte;
+ struct hmm *hmm = mirror->hmm;
+ pte_t *ptep;
+ int ret = 0;
+
+ /* Make sure there was no gap. */
+ if (start != mirror_fault->addr)
+ return -ENOENT;
+
+ if (event->backoff)
+ return -EAGAIN;
+
+ if (pmd_none(*pmdp))
+ return -ENOENT;
+
+ if (pmd_trans_huge(*pmdp)) {
+ ret = hmm_mirror_fault_hpmd(mirror, event, vma, iter,
+ pmdp, start, end);
+ mirror_fault->addr = ret ? start : end;
+ return ret;
+ }
+
+ if (pmd_none_or_trans_huge_or_clear_bad(pmdp))
+ return -EFAULT;
+
+ BUG_ON(!gpt_ulong_iter_idx(iter, start >> PAGE_SHIFT));
+ hmm_pte = iter->pdep;
+
+ ptep = pte_offset_map(pmdp, start);
+ gpt_pdp_lock(&hmm->pt, iter->pdp);
+ for (i = 0; addr < end; addr += PAGE_SIZE, ++i) {
+ if (!pte_present(*ptep) ||
+ ((event->etype == HMM_WFAULT) && !pte_write(*ptep))) {
+ ptep++;
+ ret = -ENOENT;
+ break;
+ }
+
+ if (!hmm_pte_is_valid_smem(&hmm_pte[i])) {
+ hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(*ptep));
+ gpt_pdp_ref(&hmm->pt, iter->pdp);
+ }
+ BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pte_pfn(*ptep));
+ if (pte_write(*ptep))
+ hmm_pte_mk_write(&hmm_pte[i]);
+ ptep++;
+ }
+ gpt_pdp_unlock(&hmm->pt, iter->pdp);
+ pte_unmap(ptep - 1);
+ mirror_fault->addr = addr;
+
+ return ret;
+}
+
+static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
+ struct hmm_event *event,
+ struct vm_area_struct *vma)
+{
+ struct hmm_mirror_fault mirror_fault;
+ struct mm_walk walk = {0};
+ struct gpt_lock lock;
+ struct gpt_iter iter;
+ unsigned long addr;
+ int ret = 0;
+
+ if ((event->etype == HMM_WFAULT) && !(vma->vm_flags & VM_WRITE))
+ return -EACCES;
+
+ ret = hmm_device_fault_start(mirror->hmm, event);
+ if (ret)
+ return ret;
+
+ addr = event->start;
+ lock.first = event->start >> PAGE_SHIFT;
+ lock.last = (event->end - 1UL) >> PAGE_SHIFT;
+ ret = gpt_ulong_lock_fault(&mirror->hmm->pt, &lock);
+ if (ret) {
+ hmm_device_fault_end(mirror->hmm, event);
+ return ret;
+ }
+ gpt_iter_init(&iter, &mirror->hmm->pt, &lock);
+
+again:
+ ret = hmm_mirror_update(mirror, event, &addr, &iter);
+ if (ret)
+ goto out;
+
+ if (event->backoff) {
+ ret = -EAGAIN;
+ goto out;
+ }
+ if (addr >= event->end)
+ goto out;
+
+ mirror_fault.event = event;
+ mirror_fault.mirror = mirror;
+ mirror_fault.vma = vma;
+ mirror_fault.addr = addr;
+ mirror_fault.iter = &iter;
+ walk.mm = mirror->hmm->mm;
+ walk.private = &mirror_fault;
+ walk.pmd_entry = hmm_mirror_fault_pmd;
+ ret = walk_page_range(addr, event->end, &walk);
+ hmm_event_wait(event);
+ if (!ret)
+ goto again;
+ addr = mirror_fault.addr;
+
+out:
+ gpt_ulong_unlock_fault(&mirror->hmm->pt, &lock);
+ hmm_device_fault_end(mirror->hmm, event);
+ if (ret == -ENOENT) {
+ ret = hmm_do_mm_fault(mirror->hmm, event, vma, addr);
+ ret = ret ? ret : -EAGAIN;
+ }
+ return ret;
+}
+
+/* hmm_mirror_fault() - call by the device driver on device memory fault.
+ *
+ * @mirror: Mirror related to the fault if any.
+ * @event: Event describing the fault.
+ *
+ * Device driver call this function either if it needs to fill its page table
+ * for a range of address or if it needs to migrate memory between system and
+ * remote memory.
+ *
+ * This function perform vma lookup and access permission check on behalf of
+ * the device. If device ask for range [A; D] but there is only a valid vma
+ * starting at B with B > A and B < D then callback will return -EFAULT and
+ * set event->end to B so device driver can either report an issue back or
+ * call again the hmm_mirror_fault with range updated to [B; D].
+ *
+ * This allows device driver to optimistically fault range of address without
+ * having to know about valid vma range. Device driver can then take proper
+ * action if a real memory access happen inside an invalid address range.
+ *
+ * Also the fault will clamp the requested range to valid vma range (unless the
+ * vma into which event->start falls to, can grow). So in previous example if D
+ * D is not cover by any vma then hmm_mirror_fault will stop a C with C < D and
+ * C being the last address of a valid vma. Also event->end will be set to C.
+ *
+ * All error must be handled by device driver and most likely result in the
+ * process device tasks to be kill by the device driver.
+ *
+ * Returns:
+ * > 0 Number of pages faulted.
+ * -EINVAL if invalid argument.
+ * -ENOMEM if failing to allocate memory.
+ * -EACCES if trying to write to read only address.
+ * -EFAULT if trying to access an invalid address.
+ * -ENODEV if mirror is in process of being destroy.
+ * -EIO if device driver update callback failed.
+ */
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event)
+{
+ struct vm_area_struct *vma;
+ int ret = 0;
+
+ if (!mirror || !event || event->start >= event->end)
+ return -EINVAL;
+
+ hmm_event_init(event, event->start, event->end);
+ if (event->end > mirror->hmm->mm->highest_vm_end)
+ return -EFAULT;
+
+retry:
+ if (!mirror->hmm->mm->hmm)
+ return -ENODEV;
+
+ /*
+ * So synchronization with the cpu page table is the most important
+ * and tedious aspect of device page fault. There must be a strong
+ * ordering btw call to device->update() for device page fault and
+ * device->update() for cpu page table invalidation/update.
+ *
+ * Page that are exposed to device driver must stay valid while the
+ * callback is in progress ie any cpu page table invalidation that
+ * render those pages obsolete must call device->update() after the
+ * device->update() call that faulted those pages.
+ *
+ * To achieve this we rely on few things. First the mmap_sem insure
+ * us that any munmap() syscall will serialize with us. So issue are
+ * with unmap_mapping_range() and with migrate or merge page. For this
+ * hmm keep track of affected range of address and block device page
+ * fault that hit overlapping range.
+ */
+ down_read(&mirror->hmm->mm->mmap_sem);
+ vma = find_vma_intersection(mirror->hmm->mm, event->start, event->end);
+ if (!vma) {
+ ret = -EFAULT;
+ goto out;
+ }
+ if (vma->vm_start > event->start) {
+ event->end = vma->vm_start;
+ ret = -EFAULT;
+ goto out;
+ }
+ event->end = min(event->end, vma->vm_end);
+ if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ switch (event->etype) {
+ case HMM_RFAULT:
+ case HMM_WFAULT:
+ ret = hmm_mirror_handle_fault(mirror, event, vma);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+out:
+ /* Drop the mmap_sem so anyone waiting on it have a chance. */
+ up_read(&mirror->hmm->mm->mmap_sem);
+ if (ret == -EAGAIN)
+ goto retry;
+ return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_fault);
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device
+ *
+ * The hmm_device is the link btw hmm and each device driver.
+ */
+
+/* hmm_device_register() - register a device with hmm.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success, -EINVAL otherwise.
+ *
+ * Call when device driver want to register itself with hmm. Device driver can
+ * only register once. It will return a reference on the device thus to release
+ * a device the driver must unreference the device.
+ */
+int hmm_device_register(struct hmm_device *device)
+{
+ /* sanity check */
+ BUG_ON(!device);
+ BUG_ON(!device->ops);
+ BUG_ON(!device->ops->mirror_ref);
+ BUG_ON(!device->ops->mirror_unref);
+ BUG_ON(!device->ops->mirror_release);
+ BUG_ON(!device->ops->fence_wait);
+ BUG_ON(!device->ops->fence_ref);
+ BUG_ON(!device->ops->fence_unref);
+ BUG_ON(!device->ops->update);
+
+ mutex_init(&device->mutex);
+ INIT_LIST_HEAD(&device->mirrors);
+
+ return 0;
+}
+EXPORT_SYMBOL(hmm_device_register);
+
+/* hmm_device_unregister() - unregister a device with hmm.
+ *
+ * @device: The hmm_device struct.
+ *
+ * Call when device driver want to unregister itself with hmm. This will check
+ * if there is any active mirror and return -EBUSY if so. It is device driver
+ * responsability to cleanup and stop all mirror before calling this.
+ */
+int hmm_device_unregister(struct hmm_device *device)
+{
+ struct hmm_mirror *mirror;
+
+ mutex_lock(&device->mutex);
+ mirror = list_first_entry_or_null(&device->mirrors,
+ struct hmm_mirror,
+ dlist);
+ mutex_unlock(&device->mutex);
+ if (mirror)
+ return -EBUSY;
+ return 0;
+}
+EXPORT_SYMBOL(hmm_device_unregister);
+
+static void hmm_device_fence_wait(struct hmm_device *device,
+ struct hmm_fence *fence)
+{
+ struct hmm_mirror *mirror;
+ int r;
+
+ if (fence == NULL)
+ return;
+
+ list_del_init(&fence->list);
+ do {
+ r = device->ops->fence_wait(fence);
+ if (r == -EAGAIN)
+ io_schedule();
+ } while (r == -EAGAIN);
+
+ mirror = fence->mirror;
+ device->ops->fence_unref(fence);
+ if (r)
+ hmm_mirror_handle_error(mirror);
+ hmm_mirror_unref(mirror);
+}
+
+
+static int __init hmm_subsys_init(void)
+{
+ return init_srcu_struct(&srcu);
+}
+subsys_initcall(hmm_subsys_init);
--
1.9.3

2014-11-10 20:22:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

Ok, so things are somewhat calm, and I'm trying to take time off to
see what's going on. And I'm not happy.

On Mon, Nov 10, 2014 at 10:28 AM, <[email protected]> wrote:
>
> Page table is a common structure format most notably use by cpu mmu. The
> arch depend page table code has strong tie to the architecture which makes
> it unsuitable to be use by other non arch specific code.

Please don't call this thing a "generic page table".

It is no such thing. The *real* page tables are page tables. This is
some kind of "mapping lookup", and has nothing to do with page tables
as far as I can see. Why do you call it a page table?

Also, why isn't this just using our *existing* generic mapping
functionality, which already uses a radix tree, and has a lot of
lockless models? We already *have* something like that, and it's
called a "struct address_space".

And if you *just* want the tree, why don't you use "struct radix_tree_root".

And if it's generic, why do you have that odd insane conditional
locking going on?

In other words, looking at this, I just go "this is re-implementing
existing models, and uses naming that is actively misleading".

I think it's actively horrible, in other words. The fact that you have
one ACK on it already makes me go "Hmm". Is there some actual reason
why this would be called a page table, when even your explanation very
much clarifies that it is explicitly written to *not* be an actual
page table.

I also find it absolutely disgusting how you use USE_SPLIT_PTE_PTLOCKS
for this, which seems to make absolutely zero sense. So you're sharing
the config with the *real* page tables for no reason I can see.

I'm also looking at the "locking". It's insane. It's wrong, and
doesn't have any serialization. Using the bit operations for locking
is not correct. We've gotten over that years ago.

Rik, the fact that you acked this just makes all your other ack's be
suspect. Did you do it just because it was from Red Hat, or do you do
it because you like seeing Acked-by's with your name?

Anyway, this gets a NAK from me. Maybe I'm missing something, but I
think naming is supremely important, and I really don't see the point
of this. At a minimum, it needs a *hell* of a lot more explanations
for all it does. And quite frankly, I don't think that will be
sufficient, since the whole "bitops for locking" looks downright
buggy, and it's not at all clear why you want this in the first place
as opposed to just using gang lookups on the radix trees that we
already have, and that is well-tested and known to scale fine.

So really, it boils down to: why is this any better than radix trees
that are well-named, tested, and work?

Linus

2014-11-10 20:58:30

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

On Mon, Nov 10, 2014 at 12:22:03PM -0800, Linus Torvalds wrote:
> Ok, so things are somewhat calm, and I'm trying to take time off to
> see what's going on. And I'm not happy.
>
> On Mon, Nov 10, 2014 at 10:28 AM, <[email protected]> wrote:
> >
> > Page table is a common structure format most notably use by cpu mmu. The
> > arch depend page table code has strong tie to the architecture which makes
> > it unsuitable to be use by other non arch specific code.
>
> Please don't call this thing a "generic page table".
>
> It is no such thing. The *real* page tables are page tables. This is
> some kind of "mapping lookup", and has nothing to do with page tables
> as far as I can see. Why do you call it a page table?

I did this because intention is to use it to implement hardware page
table for different hardware (in my case AMD, NVidia). So it would be
use for real page table just not for cpu but for gpu.

Also during Linux Plumber people working on IOMMU expressed there wish to
see some generic "page table" code that can be share among IOMMU as most
IOMMU use a page table directory hierarchy for mapping and it is not the
same as the one use by the CPU.

Those are the two main reasons why i named it page table. It simply full
fill same role as CPU page table but for other hardware block and it tries
to do it in a generic way.

>
> Also, why isn't this just using our *existing* generic mapping
> functionality, which already uses a radix tree, and has a lot of
> lockless models? We already *have* something like that, and it's
> called a "struct address_space".
>
> And if you *just* want the tree, why don't you use "struct radix_tree_root".

struct radix_tree_root would not have fields i need to implement a generic
"page table" as i need callback from user to build page directory entry.

>
> And if it's generic, why do you have that odd insane conditional
> locking going on?
>

I am not sure to which locking you are refering to here. The design is
to allow concurrent readers and faulters to operate at same time. For
this i need reader to ignore newly faulted|created directory. So during
table walk done there is a bit of trickery to achieve just that.

> In other words, looking at this, I just go "this is re-implementing
> existing models, and uses naming that is actively misleading".
>
> I think it's actively horrible, in other words. The fact that you have
> one ACK on it already makes me go "Hmm". Is there some actual reason
> why this would be called a page table, when even your explanation very
> much clarifies that it is explicitly written to *not* be an actual
> page table.
>
> I also find it absolutely disgusting how you use USE_SPLIT_PTE_PTLOCKS
> for this, which seems to make absolutely zero sense. So you're sharing
> the config with the *real* page tables for no reason I can see.
>

Update to page directory are synchronize through the spinlock of each
page backing a directory this is why i rely on that option. As explained
above i am trying to adapt the design of CPU page table to other hw page
table. The only difference is that the page directory entry and the page
table entry are different from the CPU and vary from one hw to the other.

I wanted to have generic code that can accomodate different hw at runtime
and not target one specific single CPU format at build time.

> I'm also looking at the "locking". It's insane. It's wrong, and
> doesn't have any serialization. Using the bit operations for locking
> is not correct. We've gotten over that years ago.

Bit operation are not use for locking at least not for inter-thread sync.
They are use for intra-thread synchronization because walk down of one
directory often needs to go over entry of one directory several times there
is a need to remember btw of those loop which entry inside the current
directory the current thread needs to care about. All the bit operations
are use only for that. Everything else is using the struct page spinlock
or global common spinlock and atomic to keep directory page alive.

All wlock are struct local to a thread and not share.

>
> Rik, the fact that you acked this just makes all your other ack's be
> suspect. Did you do it just because it was from Red Hat, or do you do
> it because you like seeing Acked-by's with your name?
>
> Anyway, this gets a NAK from me. Maybe I'm missing something, but I
> think naming is supremely important, and I really don't see the point
> of this. At a minimum, it needs a *hell* of a lot more explanations
> for all it does. And quite frankly, I don't think that will be
> sufficient, since the whole "bitops for locking" looks downright
> buggy, and it's not at all clear why you want this in the first place
> as opposed to just using gang lookups on the radix trees that we
> already have, and that is well-tested and known to scale fine.
>
> So really, it boils down to: why is this any better than radix trees
> that are well-named, tested, and work?

I hope all the above help clarify my intention and i apologize for lack
of clarity in my commit message and in the code comment. I can include
the above motivation to make this clear.

If you still dislike me reusing the page table name i am open to any
suggestion for a better name. But in my mind this is really intended to
be use to implement hw specific page table and i would like to share
the guts of it among different hw and possibly with IOMMU folks too.

Thanks for taking time to look at this, much appreciated.

Cheers,
J?r?me

>
> Linus

2014-11-10 21:35:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

On Mon, Nov 10, 2014 at 12:58 PM, Jerome Glisse <[email protected]> wrote:
> On Mon, Nov 10, 2014 at 12:22:03PM -0800, Linus Torvalds wrote:
>
> Also during Linux Plumber people working on IOMMU expressed there wish to
> see some generic "page table" code that can be share among IOMMU as most
> IOMMU use a page table directory hierarchy for mapping and it is not the
> same as the one use by the CPU.

If that is the case, why can't it just use the same format as the CPU anyway?

You can create page tables that have the same format as the CPU, they
just don't get loaded by the CPU.

Because quite frankly, I think that's where we want to end up in the
end anyway. You want to be able to have a "struct mm_struct" that just
happens to run on the GPU (or other accelerator). Then, the actual
hardware tables (or whatever) end up acting like just a TLB of that
tree. And in a perfect world, you can actually *share* the page
tables, so that you can have CLONE_VM threads that simply run on the
GPU, and if the CPU process exists, the normal ref-counting of the
"struct mm_struct" will keep the page tables around.

Because if you *don't* do it that way, you'll always have to have
these magical synchronization models between the two. Which is
obviously what you're adding (the whole invalidation callbacks), but
it really would be lovely if the "heterogeneous" memory model would
aim to be a bit more homogeneous...

> I am not sure to which locking you are refering to here. The design is
> to allow concurrent readers and faulters to operate at same time. For
> this i need reader to ignore newly faulted|created directory. So during
> table walk done there is a bit of trickery to achieve just that.

There's two different locking things I really don't like:

The USE_SPLIT_PTE_PTLOCKS thing is horrible for stuff like this. I
really wouldn't want random library code digging into core data
structures and random VM config options..

We do it for the VM, because we scale up to insane loads that do crazy
things, and it matters deeply, and the VM is really really core. I
have yet to see any reason to believe that the same kind of tricks are
needed or appropriate here.

And the "test_bit(i, wlock->locked)" kind of thing is also
unacceptable, because your "locks" aren't - you don't actually do the
lock acquire/release ordering for those things at all, and you test
them without any synchronization what-so-ever that I can see.

> Update to page directory are synchronize through the spinlock of each
> page backing a directory this is why i rely on that option. As explained
> above i am trying to adapt the design of CPU page table to other hw page
> table. The only difference is that the page directory entry and the page
> table entry are different from the CPU and vary from one hw to the other.

So quite frankly, I think it's wrong.

Either use the CPU page tables (just don't *load* them on the CPU), or
don't try to claim they are page tables. I really think you shouldn't
mix things up and confuse the issue. They aren't page tables. They
can't even match any particular piece of hardware, since the different
non-CPU "page tables" in the system are just basically random - the
mapping that a GPU uses may look very different from the mappings that
an IOMMU uses. So unlike the real VM page tables that the VM uses that
*try* to actually match the hardware if at all possible, a
device-specific page table very much will be still tied to the device.

Or am I reading something wrong? Because that's what it looks like
from my reading: your code is written for *some* things to be
dynamically configurable for the sizes of the levels (as 64-bit values
for the shift count? WTF? That's just psychedelic and seems insane)
but a lot seems to be tied to the VM page size and you use the lock in
the page for the page directories, so it doesn't seem like you can
actually ever do the same kind of "match and share the physical
memory" that we do with the VM page tables.

So it still looks like just a radix tree to me. With some
configuration options for the size of the elements, but not really to
share the actual page tables with any real hardware (iommu or gpu or
whatever).

Or do you actually have a setup where actual non-CPU hardware actually
walks the page tables you create and call "page tables"?

Linus

2014-11-10 21:47:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

On Mon, Nov 10, 2014 at 1:35 PM, Linus Torvalds
<[email protected]> wrote:
>
> Or do you actually have a setup where actual non-CPU hardware actually
> walks the page tables you create and call "page tables"?

So just to clarify: I haven't looked at all your follow-up patches at
all, although I've seen the overviews in earlier versions. When trying
to read through the latest version, I got stuck on this one, and felt
it was crazy.

But maybe I'm misreading it and it actually has good reasons for it.
But just from the details I look at, some of it looks too incestuous
with the system (the split PTL lock use), other parts look really
really odd (like the 64-bit shift counts), and some of it looks just
plain buggy (the bitops for synchronization). And none of it is all
that easy to actually read.

Linus

2014-11-10 22:50:44

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

On Mon, Nov 10, 2014 at 01:35:24PM -0800, Linus Torvalds wrote:
> On Mon, Nov 10, 2014 at 12:58 PM, Jerome Glisse <[email protected]> wrote:
> > On Mon, Nov 10, 2014 at 12:22:03PM -0800, Linus Torvalds wrote:
> >
> > Also during Linux Plumber people working on IOMMU expressed there wish to
> > see some generic "page table" code that can be share among IOMMU as most
> > IOMMU use a page table directory hierarchy for mapping and it is not the
> > same as the one use by the CPU.
>
> If that is the case, why can't it just use the same format as the CPU anyway?

I wish i could but GPU or IOMMU do have different page table and page directory
entry format. Some fields only make sense on GPU. Even if you look at Intel or
AMD IOMMU they use different format. The intention of my patch is to provide
common infrastructure code share page table management for different hw each
having different entry format.

>
> You can create page tables that have the same format as the CPU, they
> just don't get loaded by the CPU.
>
> Because quite frankly, I think that's where we want to end up in the
> end anyway. You want to be able to have a "struct mm_struct" that just
> happens to run on the GPU (or other accelerator). Then, the actual
> hardware tables (or whatever) end up acting like just a TLB of that
> tree. And in a perfect world, you can actually *share* the page
> tables, so that you can have CLONE_VM threads that simply run on the
> GPU, and if the CPU process exists, the normal ref-counting of the
> "struct mm_struct" will keep the page tables around.
>
> Because if you *don't* do it that way, you'll always have to have
> these magical synchronization models between the two. Which is
> obviously what you're adding (the whole invalidation callbacks), but
> it really would be lovely if the "heterogeneous" memory model would
> aim to be a bit more homogeneous...

Again that would be my wish but this is sadly far from being possible.

First unlike CPU, GPU are control through a command buffer queue. Inside
the command buffer queue you schedule program to run (address of program
code and number of threads to spawn as well as others arguments) but you
also schedule things GPU page table update for specific process (group
of threads). Thus inherently you have no idea on how long a GPU page
table update will take, unlike on a CPU with TLB flush and IPI. So
code/cmd updating page table on GPU run on a distinct engine than where
the code actually using those page table is runing.

So scheduling GPU page table update require a fair amount of driver
work and also allocation of a slot inside the GPU command buffer queue.
Doing all this along side CPU TLB flush inside atomic section sounds
insane to me. It could block the CPU for long time nor even mentioning
the fact that bug in driver would have more chance to cripple more
severly core kernel code path.

Second, as i explained the memory bandwidth gap btw CPU and GPU keeps
growing. So GPU will keep having discret memory and it will keep being
only accessible to the GPU.

Even Intel finaly understood that GPU are all about bandwidth and while
their solution is let add some special insanely big and fast cache sounds
like the right way to go, it is not unless you are ready to have a cache
that is several gigabytes in size all coupled with insane heuristic
implemented in the mmu silicon to decide what should be inside that fast
cache and what should not.

So to take advantages of GPU memory you need to migrate what is in system
memory to GPU memory and for this you need different page table btw the CPU
and the GPU. Things that are migrated inside the GPU memory will have entry
pointing to it inside the GPU page table but same address will have special
entry inside the CPU page table.


This are the two main motivations for having distinct page table that still
needs to be synchronize with each others.


>
> > I am not sure to which locking you are refering to here. The design is
> > to allow concurrent readers and faulters to operate at same time. For
> > this i need reader to ignore newly faulted|created directory. So during
> > table walk done there is a bit of trickery to achieve just that.
>
> There's two different locking things I really don't like:
>
> The USE_SPLIT_PTE_PTLOCKS thing is horrible for stuff like this. I
> really wouldn't want random library code digging into core data
> structures and random VM config options..
>
> We do it for the VM, because we scale up to insane loads that do crazy
> things, and it matters deeply, and the VM is really really core. I
> have yet to see any reason to believe that the same kind of tricks are
> needed or appropriate here.

Some update to this secondary hw page table will happen on the CPU inside
the same code path as the CPU page table update (hidden inside the mmu
notifier callback). Hence why i would like to have the same kind of
scalability where i do have a spinlock per directory allowing concurrent
updates to disjoint address range.

>
> And the "test_bit(i, wlock->locked)" kind of thing is also
> unacceptable, because your "locks" aren't - you don't actually do the
> lock acquire/release ordering for those things at all, and you test
> them without any synchronization what-so-ever that I can see.

As explained the test_bit is not use for synchronization whatsoever, the
wlock name is missleading here. It is use as a flag : was this entry
modified by previous loop. All inside one CPU thread and never share
with other thread. This is not use as synchronization btw different
CPU thread at all. I understand that this code might be hard to read
and that name of the variable is somewhat missleading.

>
> > Update to page directory are synchronize through the spinlock of each
> > page backing a directory this is why i rely on that option. As explained
> > above i am trying to adapt the design of CPU page table to other hw page
> > table. The only difference is that the page directory entry and the page
> > table entry are different from the CPU and vary from one hw to the other.
>
> So quite frankly, I think it's wrong.
>
> Either use the CPU page tables (just don't *load* them on the CPU), or
> don't try to claim they are page tables. I really think you shouldn't
> mix things up and confuse the issue. They aren't page tables. They
> can't even match any particular piece of hardware, since the different
> non-CPU "page tables" in the system are just basically random - the
> mapping that a GPU uses may look very different from the mappings that
> an IOMMU uses. So unlike the real VM page tables that the VM uses that
> *try* to actually match the hardware if at all possible, a
> device-specific page table very much will be still tied to the device.

As explained above i can not reuse the CPU page table first because the
entry format is hw dependant second because i want to have different
content btw the GPU and CPU page table for memory migration.

>
> Or am I reading something wrong? Because that's what it looks like
> from my reading: your code is written for *some* things to be
> dynamically configurable for the sizes of the levels (as 64-bit values
> for the shift count? WTF? That's just psychedelic and seems insane)
> but a lot seems to be tied to the VM page size and you use the lock in
> the page for the page directories, so it doesn't seem like you can
> actually ever do the same kind of "match and share the physical
> memory" that we do with the VM page tables.

It is like to page size because page size on arch we care about is 4k
and GPU page table for all hw i care about is also using the magic 4k
value. This might very well be false on some future hw and it would then
need to be untie from the VM page size.

The whole magic shift things is because a 32bit arch might be pair with
a GPU that have 64bit entry. The whole point of this patch is to provide
common code to walk and update a hw page table from the CPU and allowing
concurrent update of that hw page table. So instead of having each single
device driver implement its own code for page table walk and management
and implement its own synchronization for update i try hear to provide a
framework with those 2 features that can be share no matter what is the
format of entry use by the hardware.

>
> So it still looks like just a radix tree to me. With some
> configuration options for the size of the elements, but not really to
> share the actual page tables with any real hardware (iommu or gpu or
> whatever).

So as i said above i would want to update some of this page table from
CPU and thus i would like to be able share page table walk and locking
among different devices. And i believe IOMMU folks would like to do so
too, ie share page table walk and locking as common code and everything
else as hw specific code.

>
> Or do you actually have a setup where actual non-CPU hardware actually
> walks the page tables you create and call "page tables"?

Yes that's my point, hw will walk those page table but CPU would manipulate
them. So the format of the entry is dictated by the hw but the way to
update those and to walk them on the CPU can be done through common code.

Cheers,
J?r?me

>
> Linus

2014-11-10 22:58:28

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

On Mon, Nov 10, 2014 at 01:47:01PM -0800, Linus Torvalds wrote:
> On Mon, Nov 10, 2014 at 1:35 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > Or do you actually have a setup where actual non-CPU hardware actually
> > walks the page tables you create and call "page tables"?
>
> So just to clarify: I haven't looked at all your follow-up patches at
> all, although I've seen the overviews in earlier versions. When trying
> to read through the latest version, I got stuck on this one, and felt
> it was crazy.
>
> But maybe I'm misreading it and it actually has good reasons for it.
> But just from the details I look at, some of it looks too incestuous
> with the system (the split PTL lock use), other parts look really
> really odd (like the 64-bit shift counts), and some of it looks just
> plain buggy (the bitops for synchronization). And none of it is all
> that easy to actually read.

I hope my other emails explained the motivation for all this. The PTL
because update will happen concurrently as CPU page table update and
as CPU page table update i want the same kind of concurrency btw update
to disjoint address.

For 64bit shift and count i explained it is because some hw will have
a 64bit entry format for the page table no matter what arch they are
on (64bit hw page table on x86 32bit page table).

For bitop they are not use for synchronization but as flag inside a
single CPU thread and never share among different thread. This are
not synchronization point.


Sadly no matter how we wish code that is clear in our mind does not
necessarily end up as clear for other and i know the whole macro
things does not make this any easier. As i said the v1 is a non macro
version but it does pre-compute more things inside init and use more
of this precomputed value as parameter for CPU walk down.

Cheers,
J?r?me

>
> Linus

2014-11-10 23:53:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

On Mon, Nov 10, 2014 at 2:50 PM, Jerome Glisse <[email protected]> wrote:
>
> I wish i could but GPU or IOMMU do have different page table and page directory
> entry format. Some fields only make sense on GPU. Even if you look at Intel or
> AMD IOMMU they use different format. The intention of my patch is to provide
> common infrastructure code share page table management for different hw each
> having different entry format.

So I am fine with that, it's the details that confuse me. The thing
doesn't seem to be generic enough to be used for arbitrary page
tables, with (for example) the shifts fixed by the VM page size and
the size of the pte entry type. Also, the levels seem to be very
infexible, with the page table entries being the simple case, but then
you have that "pdep" thing that seems to be just _one_ level of page
directory.

The thing is, some of those fields are just *crazy*. For example, just
look at these:

+ uint64_t pd_shift;
+ uint64_t pde_shift;
+ uint64_t pgd_shift;

and making a shift value be 64-bit is *insane*. It makes no sense. The
way you use them, you take a value and shift by that, but that means
that the shift value cannot *possibly* be bigger than the size (in
bits) of the shift value.

In other words, those shifts are in the range 0..63. You can hold that
in 6 bits. Using a single "unsigned char" would already have two
extraneous bits.

Yet you don't save them in a byte. You save them in a "uint64_t" that
can hold values between 0..18446744073709551615. Doesn't that seem
strange and crazy to you?

And then you have these *insane* levels. That's what makes me think
it's not actually really generic enough to describe a real page table,
_or_ it is overkill for describing them. You have that pde_from_pdp()
function to ostensibly allow arbitrary page directory formats, but you
don't actually give it an address, so that function cannot be used to
actually walk the upper levels at all. Instead you have those three
shift values (and one mask value) that presumably describe the depths
of the different levels of the page tables.

And no, it's not some really clever "you describe different levels
separately, and they have a link to each other", because there's no
longer from one level to the next in the "struct gpt" either.

So it seems to be this really odd mixture of trying to be generic, but
at the same time there are all those odd details that are very
specific to one very particular two-level page table layout.

> It is like to page size because page size on arch we care about is 4k
> and GPU page table for all hw i care about is also using the magic 4k
> value. This might very well be false on some future hw and it would then
> need to be untie from the VM page size.

Ok, so fixing the page size at PAGE_SIZE might be reasonable. I wonder
how that actually works on other architectures that don't have a 4kB
page size, but it's possible that the answer is "this only works for
hardware that has the same page size as the CPU". Which might be a
reasonable assumption.

The actual layout is still very odd. And the whole "locked" array I
can't make heads or tails of. It is fixed as "PAGE_SIZE bits" (using
some pretty odd arithmetic, but that's what it comes out to, but at
the same time it seems to not be about the last-level page size, but
about some upper-level thing. And that big array is allocated on the
stack, etc etc. Not to mention the whole "it's not even a real lock"
issue, apparently.

This just doesn't make any *sense*. Why is that array PAGE_SIZE bits
(ie 4k bits, 512 bytes on x86) in size? Where does that 4k bits come
from? THAT is not really the page-size, and the upper entries don't
even have PAGE_SIZE number of entries anyway.

> The whole magic shift things is because a 32bit arch might be pair with
> a GPU that have 64bit entry

No, those shift values are never uint64_t. Not on 32-bit, not on
64-bit. In both cases, all the shift values must very much fit in just
6 bits. Six bits is enough to cover it. Not sixtyfour.

> The whole point of this patch is to provide
> common code to walk and update a hw page table from the CPU and allowing
> concurrent update of that hw page table.

So quite frankly, I just don't understand *why* it does any of what it
does the way it does. It makes no sense. How many levels of
directories? Why do we even care? Why the three fixed shifts?

So for example, I *like* the notion of just saying "we'll not describe
the upper levels of the tree at all, we'll just let those be behind a
manual walking function that the tree description is associated with".
Before looking closer - and judging by the comments - that was what
"pde_from_pdp()" would fo. But no, that one doesn't seem to walk
anything, and cannot actually do so without having an address to walk.
So it's something else.

It should be entirely possible to create a "generic page table walker"
(and by generic I mean that it would actually work very naturally with
thge CPU page tables too) by just having a "look up last level of page
tables" function, and then an iterator that walks just inside that
last level. Leave all the upper-level locking to the unspecified "walk
the upper layers" function. That would make sense, and sounds like it
should be fairly simple. But that's not what this patch is.

So the "locked" bits are apparently not about locking, which I guess I
should be relieved about since they cannot work as locks. The *number*
of bits is odd and unexplained (from the *use*, it looks like the
number of PDE's in an upper-level directory, but that is just me
trying to figure out the code, and it seems to have nothing to do with
PAGE_SIZE), the shifts have three different levels (why?) and are too
big. The pde_from_pdp() thing doesn't get an address so it can only
work one single entry at a time, and despite claiming to be about
scalability and performance I see "synchronize_rcu()" usage which
basically guarantees that it cannot possibly be either, and updating
must be slow as hell.

It all looks very fancy, but very little of it makes *sense* to me.
Why is something that isn't a lock called "locked" and "wlock"
respectively?

Can anybody explain it to me?

Linus

2014-11-11 02:45:41

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

On Mon, Nov 10, 2014 at 03:53:03PM -0800, Linus Torvalds wrote:
> On Mon, Nov 10, 2014 at 2:50 PM, Jerome Glisse <[email protected]> wrote:
> >
> > I wish i could but GPU or IOMMU do have different page table and page directory
> > entry format. Some fields only make sense on GPU. Even if you look at Intel or
> > AMD IOMMU they use different format. The intention of my patch is to provide
> > common infrastructure code share page table management for different hw each
> > having different entry format.
>
> So I am fine with that, it's the details that confuse me. The thing
> doesn't seem to be generic enough to be used for arbitrary page
> tables, with (for example) the shifts fixed by the VM page size and
> the size of the pte entry type. Also, the levels seem to be very
> infexible, with the page table entries being the simple case, but then
> you have that "pdep" thing that seems to be just _one_ level of page
> directory.
>
> The thing is, some of those fields are just *crazy*. For example, just
> look at these:
>
> + uint64_t pd_shift;
> + uint64_t pde_shift;
> + uint64_t pgd_shift;
>
> and making a shift value be 64-bit is *insane*. It makes no sense. The
> way you use them, you take a value and shift by that, but that means
> that the shift value cannot *possibly* be bigger than the size (in
> bits) of the shift value.
>
> In other words, those shifts are in the range 0..63. You can hold that
> in 6 bits. Using a single "unsigned char" would already have two
> extraneous bits.
>
> Yet you don't save them in a byte. You save them in a "uint64_t" that
> can hold values between 0..18446744073709551615. Doesn't that seem
> strange and crazy to you?
>

I was being lazy and wanted to avoid a u64 cast in most operation using
those value but yes you right a byte (6bit) is more than enough for all
those values.

I should add that :
(1 << pd_shift) is the number of directory entry inside a page (512 for
64bit entry with 4k page or 1024 for 32bit with 4k page).

pde_shift correspond to PAGE_SHIFT for directory entry

address >> pgd_shift gives the index inside the global page directory
ie top level directory. This is done because on 64bit arch running a
32bit app we want to only have 3 level while 64bit app would require
4levels.


> And then you have these *insane* levels. That's what makes me think
> it's not actually really generic enough to describe a real page table,
> _or_ it is overkill for describing them. You have that pde_from_pdp()
> function to ostensibly allow arbitrary page directory formats, but you
> don't actually give it an address, so that function cannot be used to
> actually walk the upper levels at all. Instead you have those three
> shift values (and one mask value) that presumably describe the depths
> of the different levels of the page tables.
>
> And no, it's not some really clever "you describe different levels
> separately, and they have a link to each other", because there's no
> longer from one level to the next in the "struct gpt" either.
>
> So it seems to be this really odd mixture of trying to be generic, but
> at the same time there are all those odd details that are very
> specific to one very particular two-level page table layout.
>

It is intended to accomodate either 3 or 4 level page table depending on
runtime. The whole mask, shift value and back link is to allow easy
iteration from one address by being able to jump back to upper level
from the lowest level.

> > It is like to page size because page size on arch we care about is 4k
> > and GPU page table for all hw i care about is also using the magic 4k
> > value. This might very well be false on some future hw and it would then
> > need to be untie from the VM page size.
>
> Ok, so fixing the page size at PAGE_SIZE might be reasonable. I wonder
> how that actually works on other architectures that don't have a 4kB
> page size, but it's possible that the answer is "this only works for
> hardware that has the same page size as the CPU". Which might be a
> reasonable assumption.
>
> The actual layout is still very odd. And the whole "locked" array I
> can't make heads or tails of. It is fixed as "PAGE_SIZE bits" (using
> some pretty odd arithmetic, but that's what it comes out to, but at
> the same time it seems to not be about the last-level page size, but
> about some upper-level thing. And that big array is allocated on the
> stack, etc etc. Not to mention the whole "it's not even a real lock"
> issue, apparently.
>
> This just doesn't make any *sense*. Why is that array PAGE_SIZE bits
> (ie 4k bits, 512 bytes on x86) in size? Where does that 4k bits come
> from? THAT is not really the page-size, and the upper entries don't
> even have PAGE_SIZE number of entries anyway.

The locked array is use to keep track of which entry in a directory
have been considered in current thread in previous loop. So it accounts
for worst case 32bit entry with VM page size ie 1024 entry per page
when page is 4k. Only needing 1bit per entry this means it require
1024bits worst case.

>
> > The whole magic shift things is because a 32bit arch might be pair with
> > a GPU that have 64bit entry
>
> No, those shift values are never uint64_t. Not on 32-bit, not on
> 64-bit. In both cases, all the shift values must very much fit in just
> 6 bits. Six bits is enough to cover it. Not sixtyfour.

I was talking about pd_shift, pgd_shift, pde_shift not the reason why i
was using uint64

>
> > The whole point of this patch is to provide
> > common code to walk and update a hw page table from the CPU and allowing
> > concurrent update of that hw page table.
>
> So quite frankly, I just don't understand *why* it does any of what it
> does the way it does. It makes no sense. How many levels of
> directories? Why do we even care? Why the three fixed shifts?

Number of directory is runtime depending on application, like i said a 32bit
app will only needs 3 level while a 64bit app needs 4 level.

>
> So for example, I *like* the notion of just saying "we'll not describe
> the upper levels of the tree at all, we'll just let those be behind a
> manual walking function that the tree description is associated with".
> Before looking closer - and judging by the comments - that was what
> "pde_from_pdp()" would fo. But no, that one doesn't seem to walk
> anything, and cannot actually do so without having an address to walk.
> So it's something else.

pde_from_pdp() only build a page directory entry (an entry pointing to
a sub-directory level) from a page it does not need any address. It is
not used from traversal, think of it as mk_pte() but for directory entry.

>
> It should be entirely possible to create a "generic page table walker"
> (and by generic I mean that it would actually work very naturally with
> thge CPU page tables too) by just having a "look up last level of page
> tables" function, and then an iterator that walks just inside that
> last level. Leave all the upper-level locking to the unspecified "walk
> the upper layers" function. That would make sense, and sounds like it
> should be fairly simple. But that's not what this patch is.

All the complexity arise from two things, first the need to keep ad-hoc
link btw directory level to facilitate iteration over range. Second the
fact that page directory can be free (remove) and inserted concurently.
In order to allow concurrency for directory insertion and removal for
overlapping range there is a need for reader to know which directory it
is safe for them to walk and which are not.

This is why the usage pattern is :

gpt_walk_update|fault_lock(range)
// device driver can walk the page table without any lock for the range
// and not fear that any directory will be free for the range. Using the
// helper it will also only walk directory that where know at the time
// gpt_walk_update_lock() was call, any directory added after that will
// not be considered and simply skipped
gpt_walk_update|fault_unlock(range)

So complexity is that at gpt_walk_update|fault_lock() time we take a
uniq sequence number that is use by all the gpt walker/iterator helper
to only consider directory that were know at "lock" time. More over the
gpt_walk_update|fault_lock() will take a reference on all known directory
(directory that have a sequence number older or equal to the lock sequence
number). While _unlock() will drop the reference and increment sequence
number and perform pending directory free if necessary.

>
> So the "locked" bits are apparently not about locking, which I guess I
> should be relieved about since they cannot work as locks. The *number*
> of bits is odd and unexplained (from the *use*, it looks like the
> number of PDE's in an upper-level directory, but that is just me
> trying to figure out the code, and it seems to have nothing to do with
> PAGE_SIZE), the shifts have three different levels (why?) and are too
> big. The pde_from_pdp() thing doesn't get an address so it can only
> work one single entry at a time, and despite claiming to be about
> scalability and performance I see "synchronize_rcu()" usage which
> basically guarantees that it cannot possibly be either, and updating
> must be slow as hell.

The synchronize_rcu is only in the fault case for which there would
already be exclusion through the per directory spinlock. So multiple
reader are fast, multiple faulter on different directory can happen
but they are slower than reader. The whole design obviously favorize
reader over faulter but i do not think the synchronize_rcu() will be
the bottleneck for the faulter code path.

Maybe i should not name gpt_walk_update but rather gpt_walk_reader.
The thing is gpt_walk_update can lead to page directory being remove
from the directory structure but never to new directory being added.

>
> It all looks very fancy, but very little of it makes *sense* to me.
> Why is something that isn't a lock called "locked" and "wlock"
> respectively?

wlock stands for walk lock, it is a temporary structure using by both
the lock and unlock code path to keep track of range locking. The lock
struct is public api and must be use with helper to walk the page table,
it stores the uniq sequence number that allow the walker to know which
directory are safe to walk and which must be ignore.

>
> Can anybody explain it to me?

Does my explanation above help clarify both the code and the design behind
it.

I should add that this is not the final product as what is missing to the
mix is dma mapping of page directory. Yes entry will not be pfn but bus
address to page so walking will be even more complex as it would need to
map back for dma mapping to page hence also why i abuse some of struct page
field so that iterator can more easily walk down page table without always
resorting to reverse dma mapping.

J?r?me

>
> Linus

2014-11-11 03:16:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

On Mon, Nov 10, 2014 at 6:45 PM, Jerome Glisse <[email protected]> wrote:
>
> I was being lazy and wanted to avoid a u64 cast in most operation using
> those value but yes you right a byte (6bit) is more than enough for all
> those values.

WHAT?

NONE OF WHAT YOU SAY MAKES ANY SENSE.

There's no reason for a "u64 cast". The value of "1 << pd_shift" is
going to be an "int" regardless of what type pd_shift is. The type of
a shift expression is the type of the left-hand side (with the C
promotion rules forcing it to at least "int"), the right-hand
expression type has absolutely no relevance.

So the fact is, those "shift" variables are of an insane size, and
your stated reason for that insane size makes no sense either.

It makes *zero* sense to ever have the shift count be a uint64_t. Not
with a cast, not *without* a cast. Seriously.

> I should add that :
> (1 << pd_shift) is the number of directory entry inside a page (512 for
> 64bit entry with 4k page or 1024 for 32bit with 4k page).

So that is actually the *only* shift-value that makes any sense having
at all, since if you were to have a helper routine to look up the
upper levels, nobody should ever even care about what their
sizes/shifts are.

But pd_shift at no point makes sense as uint64_t. Really. None of them
do. None of them *can* make sense. Not from a value range standpoint,
not from a C typesystem standpoint, not from *any* standpoint.

> pde_shift correspond to PAGE_SHIFT for directory entry
>
> address >> pgd_shift gives the index inside the global page directory
> ie top level directory. This is done because on 64bit arch running a
> 32bit app we want to only have 3 level while 64bit app would require
> 4levels.

But you've gone to some trouble to make it clear that the page tables
could be some other format, and have a helper indirect routine for
doing that - if that helper routine would just have done all the
levels, none of this would be necessary at all. As it is, it adds ugly
complexity, and the shifting and masking looks objectively insane.

> It is intended to accomodate either 3 or 4 level page table depending on
> runtime. The whole mask, shift value and back link is to allow easy
> iteration from one address by being able to jump back to upper level
> from the lowest level.

.. but why don't you just generate the masks from the shifts? It's trivial.

> The locked array is use to keep track of which entry in a directory
> have been considered in current thread in previous loop. So it accounts
> for worst case 32bit entry with VM page size ie 1024 entry per page
> when page is 4k. Only needing 1bit per entry this means it require
> 1024bits worst case.

So why the hell do you allocate 4k bits then? Because that's what you do:

unsigned long locked[(1 << (PAGE_SHIFT - 3)) / sizeof(long)]

that's 512 bytes. PAGE_SIZE bits. Count it.

Admittedly, it's a particularly confusing and bad way of specifying
that, but that's what it is. The "-3" is apparently because of "bits
in bytes", and the "/ sizeof(long)" is because of the base type being
"unsigned long" rather than a byte, but it all boils down to a very
complicated and unnecessarily obtuse way of writing "4096 bits".

If you wanted PAGE_SIZE bits (which you apparently don't even want),
the *SANE* way would be to just write it like

unsigned long locked[PAGE_SIZE / BITS_PER_LONG];

or something like that, which is actually *understandable*. But that's
not what the code does. It mixes that unholy mess of PAGE_SHIFT with
arithmetic and shifting and division, instead of just using *one*
clear operation.

Ugh.

> pde_from_pdp() only build a page directory entry (an entry pointing to
> a sub-directory level) from a page it does not need any address. It is
> not used from traversal, think of it as mk_pte() but for directory entry.

Yes, I see that. And I also see that it doesn't get the level number,
so you can't do different things for different levels.

Like real page tables often do.

And because it's only done one entry at a time, the code has to handle
all these levels, even though it's not even *interested* in handling
the levels. It would be much nicer to have the helper functions walk
all but the last level, and not have to have that complexity at all.
*And* it would make the code more generic to boot, since it wouldn't
depend on the quite possibly broken assumption that all levels are the
same.

Just look at x86-32 3-level paging. The top-most level is very
decidedly magical and doesn't look anything like the two other ones.
There are other examples.

> wlock stands for walk lock, it is a temporary structure using by both
> the lock and unlock code path to keep track of range locking. The lock
> struct is public api and must be use with helper to walk the page table,
> it stores the uniq sequence number that allow the walker to know which
> directory are safe to walk and which must be ignore.

But that "locked[]" array still makes no sense. It's apparently the
wrong size, since you claim the max is just 1k bits. It's mis-named.
It's just all confusing.

> Does my explanation above help clarify both the code and the design behind
> it.

Nope. It just makes me despair more.

Linus

2014-11-11 04:19:29

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

On Mon, Nov 10, 2014 at 07:16:04PM -0800, Linus Torvalds wrote:
> On Mon, Nov 10, 2014 at 6:45 PM, Jerome Glisse <[email protected]> wrote:
> >
> > I was being lazy and wanted to avoid a u64 cast in most operation using
> > those value but yes you right a byte (6bit) is more than enough for all
> > those values.
>
> WHAT?
>
> NONE OF WHAT YOU SAY MAKES ANY SENSE.
>
> There's no reason for a "u64 cast". The value of "1 << pd_shift" is
> going to be an "int" regardless of what type pd_shift is. The type of
> a shift expression is the type of the left-hand side (with the C
> promotion rules forcing it to at least "int"), the right-hand
> expression type has absolutely no relevance.
>
> So the fact is, those "shift" variables are of an insane size, and
> your stated reason for that insane size makes no sense either.
>
> It makes *zero* sense to ever have the shift count be a uint64_t. Not
> with a cast, not *without* a cast. Seriously.

Sorry i thougt the right-hand side also matter in type of shift, my bad.
Anyway like said easy to change to byte, it's just me being convinced of
some weird rules with shift and right-hand side.

>
> > I should add that :
> > (1 << pd_shift) is the number of directory entry inside a page (512 for
> > 64bit entry with 4k page or 1024 for 32bit with 4k page).
>
> So that is actually the *only* shift-value that makes any sense having
> at all, since if you were to have a helper routine to look up the
> upper levels, nobody should ever even care about what their
> sizes/shifts are.
>
> But pd_shift at no point makes sense as uint64_t. Really. None of them
> do. None of them *can* make sense. Not from a value range standpoint,
> not from a C typesystem standpoint, not from *any* standpoint.
>

Duely noted.

> > pde_shift correspond to PAGE_SHIFT for directory entry
> >
> > address >> pgd_shift gives the index inside the global page directory
> > ie top level directory. This is done because on 64bit arch running a
> > 32bit app we want to only have 3 level while 64bit app would require
> > 4levels.
>
> But you've gone to some trouble to make it clear that the page tables
> could be some other format, and have a helper indirect routine for
> doing that - if that helper routine would just have done all the
> levels, none of this would be necessary at all. As it is, it adds ugly
> complexity, and the shifting and masking looks objectively insane.
>
> > It is intended to accomodate either 3 or 4 level page table depending on
> > runtime. The whole mask, shift value and back link is to allow easy
> > iteration from one address by being able to jump back to upper level
> > from the lowest level.
>
> .. but why don't you just generate the masks from the shifts? It's trivial.

I use pde_mask because upper bit might not be part of the pfn again hw
specific page table and mask of pfn of an entry is hw specific. Or are
you talking about the address mask ? If so address mask is already derive
from shift value.

>
> > The locked array is use to keep track of which entry in a directory
> > have been considered in current thread in previous loop. So it accounts
> > for worst case 32bit entry with VM page size ie 1024 entry per page
> > when page is 4k. Only needing 1bit per entry this means it require
> > 1024bits worst case.
>
> So why the hell do you allocate 4k bits then? Because that's what you do:
>
> unsigned long locked[(1 << (PAGE_SHIFT - 3)) / sizeof(long)]
>
> that's 512 bytes. PAGE_SIZE bits. Count it.
>
> Admittedly, it's a particularly confusing and bad way of specifying
> that, but that's what it is. The "-3" is apparently because of "bits
> in bytes", and the "/ sizeof(long)" is because of the base type being
> "unsigned long" rather than a byte, but it all boils down to a very
> complicated and unnecessarily obtuse way of writing "4096 bits".
>
> If you wanted PAGE_SIZE bits (which you apparently don't even want),
> the *SANE* way would be to just write it like
>
> unsigned long locked[PAGE_SIZE / BITS_PER_LONG];
>
> or something like that, which is actually *understandable*. But that's
> not what the code does. It mixes that unholy mess of PAGE_SHIFT with
> arithmetic and shifting and division, instead of just using *one*
> clear operation.
>
> Ugh.

Yeah i got the math wrong at one point probably along the time i converted
from non macro to macro. Should have been :
(PAGE_SIZE / 4) / BITS_PER_LONG

>
> > pde_from_pdp() only build a page directory entry (an entry pointing to
> > a sub-directory level) from a page it does not need any address. It is
> > not used from traversal, think of it as mk_pte() but for directory entry.
>
> Yes, I see that. And I also see that it doesn't get the level number,
> so you can't do different things for different levels.
>
> Like real page tables often do.
>
> And because it's only done one entry at a time, the code has to handle
> all these levels, even though it's not even *interested* in handling
> the levels. It would be much nicer to have the helper functions walk
> all but the last level, and not have to have that complexity at all.
> *And* it would make the code more generic to boot, since it wouldn't
> depend on the quite possibly broken assumption that all levels are the
> same.
>
> Just look at x86-32 3-level paging. The top-most level is very
> decidedly magical and doesn't look anything like the two other ones.
> There are other examples.

In that respect hw i had in mind is more sane then x86 and all level
behave the same. The reason i did not do the walking as a callback is
because i thought it would be cleaner in respect to the whole sequence
number thing i explained below.

>
> > wlock stands for walk lock, it is a temporary structure using by both
> > the lock and unlock code path to keep track of range locking. The lock
> > struct is public api and must be use with helper to walk the page table,
> > it stores the uniq sequence number that allow the walker to know which
> > directory are safe to walk and which must be ignore.
>
> But that "locked[]" array still makes no sense. It's apparently the
> wrong size, since you claim the max is just 1k bits. It's mis-named.
> It's just all confusing.

Yes wrong size, for the name it's hard, as technicaly its a flag that
say if the first loop over a directory locked or not the entry. By locked
here i mean did the first loop took a reference on the sub-directory page
the entry points to. So maybe a better name is refed[] instead of locked[].

>
> > Does my explanation above help clarify both the code and the design behind
> > it.
>
> Nope. It just makes me despair more.

The design goal were :
(1) concurrent reader
(2) concurrent faulter
(3) reader/faulter can sleep
(4) prefer reader over faulter (hence faulter might have to pay higher price)
(5) free page directory once no longer needed

(5) require that any concurrent reader/faulter protect directory from being
free while they have active reader/faulter. While (1), (2) and (3) dictate
that there should be no locking while range is under use.

This is why i turned to the sequence number, each directory when created is
associated with a sequence number. Each reader use the current oldest sequence
number as reference thus all directory with sequence number newer than the
oldest are ignored. Each faulter increment the current sequence number and
use it as sequence number of each new directory they allocate.

Sequence number are then use to know which directory a reader or faulter need
to take a refcount against to block it from being free. Similarly once code
is done with a range it must drop refcount and again the sequence number is
use to determine which directory can safely be unref. So that new directory
are not unref by old reader that never took a ref on it.

Of course if i get rid of requirement (5) the code is lot simpler but i think
even for CPU page table we will want to have page directory reclaim at one
point.


Also its important to understand that (3) means real sleep (for as long as
GPU update can take).


Finaly reader are more than just reading entry, they can remove entry or
modify existing entry such that after a reader some directory might end
up on the list of reclaimable directory page.

>
> Linus

2014-11-11 04:29:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

On Mon, Nov 10, 2014 at 7:16 PM, Linus Torvalds
<[email protected]> wrote:
>
> There's no reason for a "u64 cast". The value of "1 << pd_shift" is
> going to be an "int" regardless of what type pd_shift is. The type of
> a shift expression is the type of the left-hand side (with the C
> promotion rules forcing it to at least "int"), the right-hand
> expression type has absolutely no relevance.

Btw, for that exact reason, code like this:

+ (uint64_t)(pdp->index +
+ (1UL << (gpt_pdp_shift(gpt, pdp) + gpt->pd_shift)) - 1UL));

is likely buggy if you actually care about the uint64_t part.

On 32-bit, 1ul will be 32-bit. And so will "(1ul << .. ) -1UL",
regardless of the type of the right hand of the shift. So the fact
that gpt->pd_shift and gpt_pdp_shift() are both u64, the actual end
result is u32 (page->index is a 32-bit entity on 32-bit architectures,
since pgoff_t is an "unsigned long" too). So you're doing the shifts
in 32-bit, the addition in 32-bit, and then just casting the resulting
32-bit thing to a 64-bit entity. The high 32 bits are guaranteed to
be zero, in other words.

This just highlights how wrong it is to make those shifts be u64. That
gpt_pdp_shift() helper similarly should at no point be returning u64.
It doesn't help, it only hurts. It makes the structure bigger for no
gain, and apparently it confuses people into thinking those shifts are
done in 64 bit.

When you do "a+b" or similar operations, the end result is the biggest
type size of 'a' and 'b' respectively (with the normal promotion to at
least 'int'). But that's not true of shifts, the type of the shift
expression is the (integer-promoted) left-hand side. The right-hand
side just gives the amount that value is shifted by, it doesn't affect
the type of the result.

Linus

2014-11-11 09:59:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

On Mon, Nov 10, 2014 at 09:45:33PM -0500, Jerome Glisse wrote:
> All the complexity arise from two things, first the need to keep ad-hoc
> link btw directory level to facilitate iteration over range.

btw means "by the way" not "between", use a dictionary some time.

2014-11-11 13:42:12

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

On Tue, Nov 11, 2014 at 10:59:03AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 10, 2014 at 09:45:33PM -0500, Jerome Glisse wrote:
> > All the complexity arise from two things, first the need to keep ad-hoc
> > link btw directory level to facilitate iteration over range.
>
> btw means "by the way" not "between", use a dictionary some time.

Apologies if my poor english makes it even harder to understand me.

J?r?me

Subject: Re: HMM (heterogeneous memory management) v6

On Mon, 10 Nov 2014, [email protected] wrote:

> In a nutshell HMM is a subsystem that provide an easy to use api to mirror a
> process address on a device with minimal hardware requirement (mainly device
> page fault and read only page mapping). This does not rely on ATS and PASID
> PCIE extensions. It intends to supersede those extensions by allowing to move
> system memory to device memory in a transparent fashion for core kernel mm
> code (ie cpu page fault on page residing in device memory will trigger
> migration back to system memory).

Could we define a new NUMA node that maps memory from the GPU and
then simply use the existing NUMA features to move a process over there.

2014-11-11 21:01:49

by David Airlie

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.


> On Mon, Nov 10, 2014 at 09:45:33PM -0500, Jerome Glisse wrote:
> > All the complexity arise from two things, first the need to keep ad-hoc
> > link btw directory level to facilitate iteration over range.
>
> btw means "by the way" not "between", use a dictionary some time.
>

Thanks for the in-depth review Peter.

Dave.

2014-11-12 20:09:47

by Jerome Glisse

[permalink] [raw]
Subject: Re: HMM (heterogeneous memory management) v6

On Tue, Nov 11, 2014 at 01:00:56PM -0600, Christoph Lameter wrote:
> On Mon, 10 Nov 2014, [email protected] wrote:
>
> > In a nutshell HMM is a subsystem that provide an easy to use api to mirror a
> > process address on a device with minimal hardware requirement (mainly device
> > page fault and read only page mapping). This does not rely on ATS and PASID
> > PCIE extensions. It intends to supersede those extensions by allowing to move
> > system memory to device memory in a transparent fashion for core kernel mm
> > code (ie cpu page fault on page residing in device memory will trigger
> > migration back to system memory).
>
> Could we define a new NUMA node that maps memory from the GPU and
> then simply use the existing NUMA features to move a process over there.

Sorry for late reply, i am traveling and working on an updated patchset to
change the device page table design to something simpler and easier to grasp.

So GPU process will never run on CPU nor will they have a kernel task struct
associated with them. From core kernel point of view they do not exist. I
hope that at one point down the line the hw will allow for better integration
with kernel core but it's not there yet.

So the NUMA idea was considered early on but was discarded as it's not really
appropriate. You can have several CPU thread working with several GPU thread
at the same time and they can either access disjoint memory or some share
memory. Usual case will be few kbytes of share memory for synchronization
btw CPU and GPU threads.

But when a GPU job is launch we want most of the memory it will use to be
migrated to device memory. Issue is that the device memory is not accessible
from the CPU (PCIE bar are too small). So there is no way to keep the memory
mapped for the CPU. We do need to mark the memory as unaccessible to the CPU
and then migrate it to the GPU memory.

Now when there is a CPU page fault on some migrated memory we need to migrate
memory back to system memory. Hence why i need to tie HMM with some core MM
code so that on this kind of fault core kernel knows it needs to call into
HMM which will perform housekeeping and starts migration back to system
memory.


So technicaly there is no task migration only memory migration.


Is there something i missing inside NUMA or some NUMA work in progress that
change NUMA sufficiently that it might somehow address the use case i am
describing above ?


Cheers,
J?r?me

Subject: Re: HMM (heterogeneous memory management) v6

On Wed, 12 Nov 2014, Jerome Glisse wrote:

> > Could we define a new NUMA node that maps memory from the GPU and
> > then simply use the existing NUMA features to move a process over there.
>
> So GPU process will never run on CPU nor will they have a kernel task struct
> associated with them. From core kernel point of view they do not exist. I
> hope that at one point down the line the hw will allow for better integration
> with kernel core but it's not there yet.

Right. So all of this is not relevant because the GPU manages it. You only
need access from the regular processors from Linux which has and uses Page
tables.

> So the NUMA idea was considered early on but was discarded as it's not really
> appropriate. You can have several CPU thread working with several GPU thread
> at the same time and they can either access disjoint memory or some share
> memory. Usual case will be few kbytes of share memory for synchronization
> btw CPU and GPU threads.

It is possible to ahve several threads accessing the memory in Linux. The
GPU threads run on the gpu and therefore are not a Linux issue. Where did
you see the problem?

> But when a GPU job is launch we want most of the memory it will use to be
> migrated to device memory. Issue is that the device memory is not accessible
> from the CPU (PCIE bar are too small). So there is no way to keep the memory
> mapped for the CPU. We do need to mark the memory as unaccessible to the CPU
> and then migrate it to the GPU memory.

Ok so this is transfer issue? Isnt this like block I/O? Write to a device?


> Now when there is a CPU page fault on some migrated memory we need to migrate
> memory back to system memory. Hence why i need to tie HMM with some core MM
> code so that on this kind of fault core kernel knows it needs to call into
> HMM which will perform housekeeping and starts migration back to system
> memory.


Sounds like a read operation and like a major fault if you would use
device semantics. You write the pages to the device and then evict them
from memory (madvise can do that for you). An access then causes a page
fault which leads to a read operation from the device.

> So technicaly there is no task migration only memory migration.
>
>
> Is there something i missing inside NUMA or some NUMA work in progress that
> change NUMA sufficiently that it might somehow address the use case i am
> describing above ?

I think you need to be looking at treating GPU memory as a block device
then you have the semantics you need.

2014-11-13 04:28:46

by Jerome Glisse

[permalink] [raw]
Subject: Re: HMM (heterogeneous memory management) v6

On Wed, Nov 12, 2014 at 05:08:47PM -0600, Christoph Lameter wrote:
> On Wed, 12 Nov 2014, Jerome Glisse wrote:
>
> > > Could we define a new NUMA node that maps memory from the GPU and
> > > then simply use the existing NUMA features to move a process over there.
> >
> > So GPU process will never run on CPU nor will they have a kernel task struct
> > associated with them. From core kernel point of view they do not exist. I
> > hope that at one point down the line the hw will allow for better integration
> > with kernel core but it's not there yet.
>
> Right. So all of this is not relevant because the GPU manages it. You only
> need access from the regular processors from Linux which has and uses Page
> tables.
>
> > So the NUMA idea was considered early on but was discarded as it's not really
> > appropriate. You can have several CPU thread working with several GPU thread
> > at the same time and they can either access disjoint memory or some share
> > memory. Usual case will be few kbytes of share memory for synchronization
> > btw CPU and GPU threads.
>
> It is possible to ahve several threads accessing the memory in Linux. The
> GPU threads run on the gpu and therefore are not a Linux issue. Where did
> you see the problem?

When they both use system memory there is no issue but if you want to leverage
GPU to its full potential you need to migrate memory from system memory to GPU
memory for the duration of the GPU computation (might be several minutes/hours
or more). But at the same time you do not want CPU access to be forbiden thus
if CPU access does happen you want to catch the CPU fault schedule a migration
of GPU memory back to system memory and resume the CPU thread that faulted.

So from CPU point of view this GPU memory is like a swap, the memory is swaped
in the GPU memory and this is exactly how i implemented in, using a special swap
type. Refer to the v1 of my patchset where i show case implementation of most
of the features.

>
> > But when a GPU job is launch we want most of the memory it will use to be
> > migrated to device memory. Issue is that the device memory is not accessible
> > from the CPU (PCIE bar are too small). So there is no way to keep the memory
> > mapped for the CPU. We do need to mark the memory as unaccessible to the CPU
> > and then migrate it to the GPU memory.
>
> Ok so this is transfer issue? Isnt this like block I/O? Write to a device?
>

It can be as slow as block I/O but it's unlike a block device, it's closer to
NUMA in theory because it's just about having memory close to the compute unit
(ie GPU memory in this case) but nothing else beside that match NUMA.

>
> > Now when there is a CPU page fault on some migrated memory we need to migrate
> > memory back to system memory. Hence why i need to tie HMM with some core MM
> > code so that on this kind of fault core kernel knows it needs to call into
> > HMM which will perform housekeeping and starts migration back to system
> > memory.
>
>
> Sounds like a read operation and like a major fault if you would use
> device semantics. You write the pages to the device and then evict them
> from memory (madvise can do that for you). An access then causes a page
> fault which leads to a read operation from the device.

Yes it's a major fault case but we do not want to use this with any special
syscall think existing application that link against library. Now you port
the library to use GPU but application is ignorant of this and thus any CPU
access it does will be through usual mmaped range that did not go through any
special syscall.

>
> > So technicaly there is no task migration only memory migration.
> >
> >
> > Is there something i missing inside NUMA or some NUMA work in progress that
> > change NUMA sufficiently that it might somehow address the use case i am
> > describing above ?
>
> I think you need to be looking at treating GPU memory as a block device
> then you have the semantics you need.

This was explored too but block device does not match what we want. Block device
is nice for file backed memory and we could have special file that would be backed
by GPU memory and process would open those special file and write to it. But this
is not how we want to use this, we do really want to mirror process address space,
ie any kind of existing CPU mapping can be use by GPU (except mmaped IO) and we
want to be able to migrate any of those existing CPU mapping to GPU memory while
still being able to service CPU page fault on range migrated to GPU memory.

So unless there is something i am completely oblivious too in the block device
model in the linux kernel, i fail to see how it could apply to what we want to
achieve.

Cheers,
J?r?me

2014-11-13 16:11:20

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

On 11/10/2014 03:22 PM, Linus Torvalds wrote:

> Rik, the fact that you acked this just makes all your other ack's be
> suspect. Did you do it just because it was from Red Hat, or do you do
> it because you like seeing Acked-by's with your name?

I acked it because I could not come up with a better idea
on how to solve this problem.

Keeping the device page tables in sync with the CPU page
tables (and sometimes different, when a page is migrated
from system DRAM to VRAM) will require either expanded
macros and generic walker functions like this, or trusting
the device driver writers to correctly copy over example
code and implement their own...

I have seen too much copied-and-slightly modified code
in drivers to develop a dislike for the second alternative.

2014-11-13 23:50:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

On Mon, Nov 10, 2014 at 3:53 PM, Linus Torvalds
<[email protected]> wrote:
>
> So I am fine with that, it's the details that confuse me. The thing
> doesn't seem to be generic enough to be used for arbitrary page
> tables, with (for example) the shifts fixed by the VM page size and
> the size of the pte entry type. Also, the levels seem to be very
> infexible, with the page table entries being the simple case, but then
> you have that "pdep" thing that seems to be just _one_ level of page
> directory.

Ok, so let me just put my money where my mouth is, and show some
example code of a tree walker that I think is actually more generic.
Sorry for the delay, I got distracted by other things, and I wanted to
write something to show what I think might be a better approach.

NOTE NOTE NOTE! I'm not saying you have to do it this way. But before
I even show the patch, let me show you the "tree descriptor" from my
stupid test-program that uses it, and hopefully that will show what
I'm really aiming for:

struct tree_walker_definition x86_64_def = {
.total_bits = 48,
.start = 0,
.end = 0x7fffffffffff,
.levels = {
{ .level_bits = 9, .lookup = pgd_lookup },
{ .level_bits = 9, .lookup = pud_lookup },
{ .level_bits = 9, .lookup = pmd_lookup },
{ .level_bits = 9, .walker = pte_walker }
}
};

so basically, the *concept* is that you can describe a real page table
by actually *describing* it. What the above does is tell you:

- the amount of bits the tables can cover (48 is four levels of 9
bits each, leaving 12 bits - 4096 bytes - for the actual pages)

- limit the range that can be walked (this isn't really all that
important, but it does, for example, mean that the walker will
fundamentally refuse to give access to the kernel mapping)

- show how the different levels work, and what their sizes are and
how you look them up or walk them, starting from the top-most.

Anyway, I think a descriptor like the above looks *understandable*. It
kind of stands on its own, even without showing the actual code.

Now, the code to actually *walk* the above tree looks like this:

struct tree_walker walk = {
.first = 4096,
.last = 4096*512*3,
.walk = show_walk,
.hole = show_hole,
.pre_walk = show_pre_walk,
.post_walk = show_post_walk,
};

walk_tree((struct tree_entry *)pgd, &x86_64_def, &walk);

ie you use the "walk_tree()" function to walk a particular tree (in
this case it's a fake page table directory in "pgd", see the details
in the stupid test-application), giving it the tree definition and the
"walk" parameters that show what should happen for particular details
(quite often hole/pre-walk/post-walk may be NULL, my test app just
shows them being called).

Now,. in addition to that, each tree description obviously needs the
functions to show how to look up the different levels ("lookup" for
moving from one level to another, and "walker" for actually walking
the last level page table knowing how "present" bits etc work.

Now, your code had a "uint64_t mask" for the present bits, which
probably works reasonably well in practice, but I really prefer to
just have that "walker" callback instead. That way the page tables can
look like *anything*, and you can walk them, without having magic
rules that there has to be a particular bit pattern that says it's
"present".

Also, my walker actually does super-pages - ie one of the mid-level
page tables could map one big area. I didn't much test it, but the
code is actually fairly straightforward, the way it's all been set up.
So it might be buggy, but it's *close*.

Now, one place we differ is on locking. I actually think that the
person who asks to walk the tree should just do the locking
themselves. You can't really walk the tree without knowing what kind
of tree it is, and so I think the caller should just do the locking.
Obviously, the tree walker itself may have some locking in the
"pre_walk/post_walk" thing and in its lookup routines, so the
description of the tree can contain some locking of its own, but I did
*not* want to make the infrastructure itself force any particular
locking strategy.

So this does something quite different from what your patch actually
did, and does that different thing very differently. It may not really
match what you are aiming for, but I'd *really* like the first
implementation of HMM that gets merged to not over-design the locking
(which I think yours did), and I want it to make *sense* (which I
don't think your patch did).

Also, please note that this *is* just an example. It has an example
user (that is just a stupid user-level toy app to show how it all is
put together), but it's not necessarily all that featureful, and it's
definitely not very tested.

But the code is actually fairly simple. But judge for yourself.

Linus


Attachments:
patch.diff (8.59 kB)

2014-11-14 01:01:36

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

On Thu, Nov 13, 2014 at 03:50:02PM -0800, Linus Torvalds wrote:
> +/*
> + * The 'tree_level' data only describes one particular level
> + * of the tree. The upper levels are totally invisible to the
> + * user of the tree walker, since the tree walker will walk
> + * those using the tree definitions.
> + *
> + * NOTE! "struct tree_entry" is an opaque type, and is just a
> + * used as a pointer to the particular level. You can figure
> + * out which level you are at by looking at the "tree_level",
> + * but even better is to just use different "lookup()"
> + * functions for different levels, at which point the
> + * function is inherent to the level.

Please, don't.

We will end up with the same last-level centric code as we have now in mm
subsystem: all code only cares about pte. It makes implementing variable
page size support really hard and lead to copy-paste approach. And to
hugetlb parallel world...

It would be nice to have tree_level description generic enough to get rid
of pte_present()/pte_dirty()/pte_* and implement generic helpers instead.

Apart from variable page size problem, we could get one day support
different CPU page table format supported in runtime: PAE/non-PAE on
32-bit x86 or LPAE/non-LPAE on ARM in one binary kernel image.

The big topic is how to get it done without significant runtime cost :-/

--
Kirill A. Shutemov

2014-11-14 01:18:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

On Thu, Nov 13, 2014 at 4:58 PM, Kirill A. Shutemov
<[email protected]> wrote:
> On Thu, Nov 13, 2014 at 03:50:02PM -0800, Linus Torvalds wrote:
>> +/*
>> + * The 'tree_level' data only describes one particular level
>> + * of the tree. The upper levels are totally invisible to the
>> + * user of the tree walker, since the tree walker will walk
>> + * those using the tree definitions.
>> + *
>> + * NOTE! "struct tree_entry" is an opaque type, and is just a
>> + * used as a pointer to the particular level. You can figure
>> + * out which level you are at by looking at the "tree_level",
>> + * but even better is to just use different "lookup()"
>> + * functions for different levels, at which point the
>> + * function is inherent to the level.
>
> Please, don't.
>
> We will end up with the same last-level centric code as we have now in mm
> subsystem: all code only cares about pte.

You realize that we have a name for this. It's called "reality".

> It makes implementing variable
> page size support really hard and lead to copy-paste approach. And to
> hugetlb parallel world...

No, go back and read the thing.

You're confusing two different issues: looking up the tree, and
actually walking the end result.

The "looking up different levels of the tree" absolutely _should_ use
different actors for different levels. Because the levels are not at
all guaranteed to be the same.

Sure, they often are. When you extend a tree, it's fairly reasonable
to try to make the different levels look identical. But "often" is not
at all "always".

More importantly, nobody should ever care. Because the whole *point*
of the tree walker is that the user never sees any of this. This is
purely an implementation detail of the tree itself. Somebody who just
*walks* the tree only sees the final end result.

And *that* is the "walk()" callback. Which gets the virtual address
and the length, exactly so that for a super-page you don't even really
see the difference between walking different levels (well, you do see
it, since the length will differ).

Now, I didn't actually try to make that whole thing very transparent.
In particular, somebody who just wants to see the data (and ignore as
much of the "tree" details as possible) would really want to have not
that "tree_entry", but the whole "struct tree_level *" and in
particular a way to *map* the page. I left that out entirely, because
it wasn't really central to the whole tree walking.

But thinking that the levels should look the same is fundamentally
bogus. For one, because they don't always look the same at all. For
another, because it's completely separate from the accessing of the
level data anyway.

Linus

2014-11-14 01:51:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

On Thu, Nov 13, 2014 at 5:18 PM, Linus Torvalds
<[email protected]> wrote:
>
> More importantly, nobody should ever care. Because the whole *point*
> of the tree walker is that the user never sees any of this. This is
> purely an implementation detail of the tree itself. Somebody who just
> *walks* the tree only sees the final end result.
>
> And *that* is the "walk()" callback. Which gets the virtual address
> and the length, exactly so that for a super-page you don't even really
> see the difference between walking different levels (well, you do see
> it, since the length will differ).
>
> Now, I didn't actually try to make that whole thing very transparent.

Side note: I'm not entirely sure it *can* be made entirely transparent.

Just as an example: if what you want to do is actually "access" the
data for some copying operation, then for a real CPU page table what
you want to do is to actually map the entry. And you definitely do not
want to map the entry one single page at a time - if you have a
top-level page directory entry, you'd want to map the whole page
directory entry, not the sub-pages of it. So mapping the thing is very
much level-dependent.

Fine, "just add 'map()'/'unmap()' functions to the tree description,
the same way we have lookup/walk. Yes, that would be fairly easy, but
it only works for CPU page tables. if you want to copy from device
data, what you want is more of a physical address thing that you do
DMA on, not a "map/unmap" model.

So I suspect *some* amount of per-tree knowledge is required. Or just
knowledge of what people actually want to do when walking the tree.

So don't get me wrong - I'm making excuses for not really having a
fleshed-out interface, but I'm making them because I think the
interface will either have to be tree-specific, or because we need
higher-level interfaces for what we actually want to do while walking.
That then decides where these kinds of tree differences will be
handled: will they be handled by the caller knowing that certain trees
are used in certain ways, or will they be handled by the tree walking
abstraction being explicitly extended to do certain operations? Or
will it be a bit of both?

See what I'm trying to say? There is no way to make the tree-walking
"truly generic" in the sense that you can do anything you want with
the results, because the *meaning* of the results will inevitably
depend a bit on what the trees are actually describing. Are they
describing local memory or remote memory?

Jerome had a "convert 'struct tree_entry *' to 'struct page *'"
function, but that doesn't necessarily work in the generic case
either, and is questionable with super-pages anyway (although
generally it works fairly well by just saying that they get described
by the first page in the superpage). But for actual CPU page tables,
some of the pages in those page tables may not *have* a "struct page"
associated with them at all, because they are mappings of
memory-mapped devices in high memory. So again, in a _generic_ model
that you might want to start replacing some of the actual VM code
with, you simply cannot use 'struct page' as some kind of generic
entry. At some level, the only thing you have is the actual page table
entry pointer, and the value behind it.

And it may well be ok to just say "the walker isn't generic in _that_
sense". A walker that can walk arbitrary page-table-tree-like
structures can still be useful just for the walking part, even if the
users might then always have to be aware of the final tree details. At
least they don't need to re-implement the basic iterator, they'll just
have to implement the "what do I do with the end result" for their
particular tree layout. So a walker can be generic at _just_
walking/iterating, but not necessarily at actually using the end
result.

I hope I'm explaining that logic well enough..

Linus